0% found this document useful (0 votes)
122 views491 pages

(Springer Actuarial) Arthur Charpentier - Insurance, Biases, Discrimination and Fairness-Springer Nature (2024)

The document discusses the complexities of discrimination in insurance, exploring its definitions, legal implications, and the challenges posed by statistical discrimination in actuarial practices. It aims to connect various perspectives on fairness and discrimination, particularly in the context of predictive modeling and big data. The book serves as a comprehensive resource for understanding the intersection of insurance, biases, and fairness in actuarial science.

Uploaded by

CR goew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views491 pages

(Springer Actuarial) Arthur Charpentier - Insurance, Biases, Discrimination and Fairness-Springer Nature (2024)

The document discusses the complexities of discrimination in insurance, exploring its definitions, legal implications, and the challenges posed by statistical discrimination in actuarial practices. It aims to connect various perspectives on fairness and discrimination, particularly in the context of predictive modeling and big data. The book serves as a comprehensive resource for understanding the intersection of insurance, biases, and fairness in actuarial science.

Uploaded by

CR goew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 491

Springer Actuarial

Arthur Charpentier

Insurance,
Biases,
Discrimination
and Fairness
Springer Actuarial

Editors-in-Chief
Hansjoerg Albrecher, Department of Actuarial Science, University of Lausanne,
Lausanne, Switzerland
Michael Sherris, School of Risk & Actuarial, UNSW Australia, Sydney, NSW,
Australia

Series Editors
Daniel Bauer, Wisconsin School of Business, University of Wisconsin-Madison,
Madison, WI, USA
Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France
Alexander J. McNeil, University of York, York, UK
Antoon Pelsser, Maastricht University, Maastricht, The Netherlands
Gordon Willmot, University of Waterloo, Waterloo, ON, Canada
Hailiang Yang, Department of Statistics & Actuarial Science, The University of
Hong Kong, Hong Kong, Hong Kong
This is a series on actuarial topics in a broad and interdisciplinary sense, aimed at
students, academics and practitioners in the fields of insurance and finance.
Springer Actuarial informs timely on theoretical and practical aspects of top-
ics like risk management, internal models, solvency, asset-liability management,
market-consistent valuation, the actuarial control cycle, insurance and financial
mathematics, and other related interdisciplinary areas.
The series aims to serve as a primary scientific reference for education, research,
development and model validation.
The type of material considered for publication includes lecture notes, mono-
graphs and textbooks. All submissions will be peer-reviewed.
Arthur Charpentier

Insurance, Biases,
Discrimination and Fairness
Arthur Charpentier
Department of Mathematics
UQAM
Montreal, QC, Canada

ISSN 2523-3262 ISSN 2523-3270 (electronic)


Springer Actuarial
ISBN 978-3-031-49782-7 ISBN 978-3-031-49783-4 (eBook)
https://doi.org/10.1007/978-3-031-49783-4

This work was supported by Institut Louis Bachelier.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


Preface

“14 litres d’encre de chine, 30 pinceaux, 62 crayons à mine grasse, 1 crayon à mine
dure, 27 gommes à effacer, 38 kilos de papier, 16 rubans de machine à écrire, 2
machines à écrire, 67 litres de bière ont été nécessaires à la réalisation de cette
aventure,” Goscinny and Uderzo (1965), Astérix et Cléopâtre.1
Discrimination is a complicated concept. The most neutral definition is, accord-
ing to Merriam-Webster (2022), simply “the act (or power) of distinguishing.”
Amnesty International (2023) adds that it is an “unjustified distinction.” And indeed,
most of the time, the word has a negative connotation, because discrimination
is associated with some prejudice. An alternative definition, still according to
Merriam-Webster (2022), is that discrimination is the “act of discriminating cate-
gorically rather than individually.” This corresponds to “statistical discrimination”
but also actuarial pricing. Because actuaries do discriminate. As Lippert-Rasmussen
(2017) clearly states, “insurance discrimination seems immune to some of the stan-
dard objections to discrimination.” Avraham (2017) goes further: “what is unique
about insurance is that even statistical discrimination which by definition is absent
of any malicious intentions, poses significant moral and legal challenges. Why?
Because on the one hand, policy makers would like insurers to treat their insureds
equally, without discriminating based on race, gender, age, or other characteristics,
even if it makes statistical sense to discriminate (...) On the other hand, at the core
of insurance business lies discrimination between risky and non-risky insureds. But
riskiness often statistically correlates with the same characteristics policy makers
would like to prohibit insurers from taking into account.” This is precisely the
purpose of this book, to dig further into those issues, to understand the seeming
oxymoron “fair discrimination” used in insurance, to weave together the multiple
perspectives that have been posed on discrimination in insurance, linking a legal
and a statistical view, an economic and an actuarial view, all in a context where
computer scientists have also recently brought an enlightened eye to the question

1 14 liters of ink, 30 brushes, 62 grease pencils, 1 hard pencil, 27 erasers, 38 kilos of paper, 16

typewriter ribbons, 2 typewriters, 67 liters of beer were necessary to realize this adventure.

v
vi Preface

of the fairness of predictive models. Dealing with discrimination in insurance is


probably an ill-defined unsolvable problem, but it is important to understand why,
in the current context of “big data” (that yield to proxy variables that can capture
information relative to sensitive attributes) and “artificial intelligence” (or more
precisely “machine-learning” techniques, and opaque boxes, less interpretable).
This book attempts to address questions such as “is being color-blind or gender-
blind sufficient (or necessary) to ensure that a model is fair”? “how can we assess
if a model is fair if we cannot collect sensitive information”? “is it fair that part
of the health insurance premium paid by a man should be dedicated to covering
the risk of becoming pregnant”? “is it fair to ask a smoker to pay more for his or
her health insurance premium”? “is it fair to use a gender-neutral principle if we
end up asking a higher premium to women”? “is it fair to use a discrimination-free
model on biased data”? “is it fair to use in a pricing model a legitimate variable
if it correlates strongly with a sensitive one”? Those are obviously old questions,
raised when actuaries started to differentiate premiums. This book is aimed at being
systematic, to connect the dots between communities that are too often distinct,
bringing different perspectives on those important questions.
Before going further into the subject, a few words of thanks are in order. First,
I wanted to thank Jean-Michel Beacco and Ryadh Benlahrech, of the Institut Louis
Bachelier2 (ILB) in Paris: an earlier version of this book was published in the
Opinion and Debates series edited by the ILB. Second, I also wanted to thank
Hansjörg Albrecher, Stéphane Loisel and Julien Truffin for their encouragement
in publishing that book, while we were all together enjoying a workshop in Luminy
(France), on machine learning for insurance, in the Summer 2022. I want to thank
Caroline Hillairet and Christian-Yann Robert for giving me the opportunity to give
a doctoral course on topics presented in this book. I should thank Laurence Barry,
Michel Denuit, Jean-Michel Loubès, Marie-Pier Côté, as well as colleagues who
attended recent seminars, for all the stimulating discussions we had over the past 3
years on those topics. I am extremely grateful to Ewen Gallic, Agathe Fernandes-
Machado, Antoine Ly, Olivier Côté, Philipp Ratz, and François Hu who challenged
me on some parts of the original manuscript, and helped me to improve it (even
if, of course, the errors that remain are entirely my fault, and my responsibility).
I also wanted to thank the Chaire Thélem / ILB and the AXA Research Fund for
financially supporting some of my recent research in this area, and Philippe Trainar
and the SCOR Foundation for deciding that this was just the beginning, and that it
was important to support research on all these topics, over the next years.

2 The ILB (Institut Louis Bachelier) is a nonprofit organization created in 2008. Its activities are

aimed at engaging academic researchers, as well as public authorities and private companies in
research projects in economics and finance with a focus on four societal transitions: environmental,
digital, demographic, and financial. The ILB is, thus, fully involved in the design of research
programs and initiatives aimed at promoting sustainable development in economics and finance.
Preface vii

Finally, I wanted to apologize to my family, namely Hélène, Maël, Romane, and


Fleur for all the time spent during evenings, week-ends and holidays working on
this book.

Montréal, QC, Canada Arthur Charpentier


July 2023
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 A Brief Overview on Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Discrimination? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Legal Perspective on Discrimination . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Discrimination from a Philosophical Perspective . . . . . . . . . 5
1.1.4 From Discrimination to Fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 Economics Perspective on Efficient Discrimination . . . . . . 10
1.1.6 Algorithmic Injustice and Fairness
of Predictive Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.7 Discrimination Mitigation and Affirmative Action . . . . . . . 15
1.2 From Words and Concepts to Mathematical Formalism . . . . . . . . . . . 15
1.2.1 Mathematical Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Legitimate Segmentation and Unfair Discrimination . . . . . 16
1.3 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Datasets and Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Part I Insurance and Predictive Modeling


2 Fundamentals of Actuarial Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Insurance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Premiums and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Premium and Fair Technical Price. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Case of a Homogeneous Population . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 The Fear of Moral Hazard and Adverse-Selection . . . . . . . . 32
2.3.3 Case of a Heterogeneous Population . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Mortality Tables and Life Insurance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Gender Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Health and Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.3 Wealth and Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Modeling Uncertainty and Capturing Heterogeneity . . . . . . . . . . . . . . . 40
2.5.1 Groups of Predictive Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

ix
x Contents

2.5.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


2.5.3 Interpreting and Explaining Models . . . . . . . . . . . . . . . . . . . . . . . 47
2.6 From Technical to Commercial Premiums . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.1 Homogeneous Policyholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.2 Heterogeneous Policyholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.6.3 Price Optimization and Discrimination . . . . . . . . . . . . . . . . . . . . 52
2.7 Other Models in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7.1 Claims Reserving and IBNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7.2 Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.3 Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.4 Parametric Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.5 Data and Models to Understand the Risks. . . . . . . . . . . . . . . . . 55
3 Models: Overview on Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Predictive Model, Algorithms, and “Artificial Intelligence” . . . . . . . 59
3.1.1 Probabilities and Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 From Categorical to Continuous Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.1 Historical Perspective, Insurers as Clubs . . . . . . . . . . . . . . . . . . 65
3.2.2 “Modern Insurance” and Categorization . . . . . . . . . . . . . . . . . . 67
3.2.3 Mathematics of Rating Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.4 From Classes to Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3 Supervised Models and “Individual” Pricing . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Machine-Learning Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.3 Penalized Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . 97
3.3.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.3.5 Trees and Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.6 Ensemble Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.3.7 Application on the toydata2 Dataset . . . . . . . . . . . . . . . . . . . 116
3.3.8 Application on the GermanCredit Dataset . . . . . . . . . . . . 116
3.4 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4 Models: Interpretability, Accuracy, and Calibration . . . . . . . . . . . . . . . . . . . 123
4.1 Interpretability and Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.1.1 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.1.2 Ceteris Paribus Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1.3 Breakdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.1.4 Shapley Value and Shapley Contributions . . . . . . . . . . . . . . . . . 133
4.1.5 Partial Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.1.6 Application on the GermanCredit Dataset . . . . . . . . . . . . 147
4.1.7 Application on the FrenchMotor Dataset . . . . . . . . . . . . . . 153
4.1.8 Counterfactual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.2 Accuracy of Actuarial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.2.1 Accuracy and Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.3 Calibration of Predictive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Contents xi

4.3.1 From Accuracy to Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164


4.3.2 Lorenz and Concentration Curves . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3.3 Calibration, Global, and Local Biases . . . . . . . . . . . . . . . . . . . . . 170

Part II Data
5 What Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.1 Data (a Brief Introduction). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.2 Personal and Sensitive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2.1 Personal and Nonpersonal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2.2 Sensitive and Protected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2.3 Sensitive Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.2.4 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.2.5 Right to be Forgotten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3 Internal and External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.3.1 Internal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.3.2 Connecting Internal and External Data . . . . . . . . . . . . . . . . . . . . 192
5.3.3 External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.4 Typology of Ratemaking Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.4.1 Ratemaking Variables in Motor Insurance . . . . . . . . . . . . . . . . 199
5.4.2 Criteria for Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.4.3 An Actuarial Criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.4.4 An Operational Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.4.5 A Criterion of Social Acceptability . . . . . . . . . . . . . . . . . . . . . . . . 204
5.4.6 A Legal Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.5 Behaviors and Experience Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.6 Omitted Variable Bias and Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . 205
5.6.1 Omitted Variable in a Linear Model . . . . . . . . . . . . . . . . . . . . . . . 205
5.6.2 School Admission and Affirmative Action . . . . . . . . . . . . . . . . 207
5.6.3 Survival of the Sinking of the Titanic. . . . . . . . . . . . . . . . . . . . . . 208
5.6.4 Simpson’s Paradox in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.6.5 Ecological Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law . . . . . . . . . . . . . . . 211
5.7.1 Goodhart’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.7.2 Other Biases and “Dark Data” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6 Some Examples of Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.1 Racial Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.1.1 A Sensitive Variable Difficult to Define . . . . . . . . . . . . . . . . . . . 218
6.1.2 Race and Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.2 Sex and Gender Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.2.1 Sex or Gender? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2.2 Sex, Risk and Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.2.3 The “Gender Directive” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.3 Age Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.3.1 Young or Old? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
xii Contents

6.4 Genetics versus Social Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234


6.4.1 Genetics-Related Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.4.2 Social Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.4.3 “Lookism,” Obesity, and Discrimination . . . . . . . . . . . . . . . . . . 237
6.5 Statistical Discrimination by Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.5.1 Stereotypes and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.5.2 Generalization and Actuarial Science . . . . . . . . . . . . . . . . . . . . . 241
6.5.3 Massive Data and Proxy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.6 Names, Text, and Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.6.1 Last Name and Origin or Gender . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.6.2 First Name and Age or Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.6.3 Text and Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.6.4 Language and Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.7 Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.7.1 Pictures and Facial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.7.2 Pictures of Houses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.8 Spatial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.8.1 Redlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.8.2 Geography and Wealth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.9 Credit Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.9.1 Credit Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.9.2 Discrimination Against the Poor . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.10 Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.10.1 On the Use of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.10.2 Mathematics of Networks, and Paradoxes. . . . . . . . . . . . . . . . . 270
7 Observations or Experiments: Data in Insurance . . . . . . . . . . . . . . . . . . . . . . 275
7.1 Correlation and Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.1.1 Correlation is (Probably) Not Causation . . . . . . . . . . . . . . . . . . 276
7.1.2 Causality in a Dynamic Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7.2 Rung 1, Association (Seeing, “what if I see...”) . . . . . . . . . . . . . . . . . . . . 280
7.2.1 Independence and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.2.2 Dependence with Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.3 Rung 2, Intervention (Doing, “what if I do...”) . . . . . . . . . . . . . . . . . . . . . 290
7.3.1 The do() Operator and Computing Causal Effects . . . . . . . . 290
7.3.2 Structural Causal Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) . . . . . 296
7.4.1 Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.4.2 Weights and Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . 301
7.5 Causal Techniques in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

Part III Fairness


8 Group Fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.1 Fairness Through Unawareness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
8.2 Independence and Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Contents xiii

8.3 Separation and Equalized Odds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324


8.4 Sufficiency and Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.5 Comparisons and Impossibility Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.6 Relaxation and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
8.7 Using Decomposition and Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
8.8 Application on the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . 351
8.9 Application on the FrenchMotor Dataset . . . . . . . . . . . . . . . . . . . . . . . . 351
9 Individual Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
9.1 Similarity Between Individuals (and Lipschitz Property) . . . . . . . . . . 358
9.2 Fairness with Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.3 Counterfactuals and Optimal Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.3.1 Quantile-Based Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
9.3.2 Optimal Transport (Discrete Setting) . . . . . . . . . . . . . . . . . . . . . . 363
9.3.3 Optimal Transport (General Setting) . . . . . . . . . . . . . . . . . . . . . . 366
9.3.4 Optimal Transport Between Gaussian Distributions . . . . . . 368
9.3.5 Transport and Causal Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
9.4 Mutatis Mutandis Counterfactual Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . 369
9.5 Application on the toydata2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
9.6 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 373

Part IV Mitigation
10 Pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
10.1 Removing Sensitive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.2 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.2.2 Binary Sensitive Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10.3 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
10.4 Application to toydata2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10.5 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 393
11 In-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.1 Adding a Group Discrimination Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
11.2 Adding an Individual Discrimination Penalty . . . . . . . . . . . . . . . . . . . . . . 400
11.3 Application on toydata2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
11.3.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
11.3.2 Equalized Odds and Class Balance . . . . . . . . . . . . . . . . . . . . . . . . 403
11.4 Application to the GermanCredit Dataset . . . . . . . . . . . . . . . . . . . . . . . 407
11.4.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
11.4.2 Equalized Odds and Class Balance . . . . . . . . . . . . . . . . . . . . . . . . 410
12 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.1 Post-Processing for Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.2 Weighted Averages of Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
12.3 Average and Barycenters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
12.4 Application to toydata1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
xiv Contents

12.5 Application on FrenchMotor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426


12.6 Penalized Bagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Mathematical Notation

Acronyms

ACA Affordable Care Act


CJEU Court of Justice of the European Union
CNIL (French) National Commission on Informatics and Liberty
DAG directed acyclical graph
EIOPA European Insurance and Occupational Pensions Authority
GAM Generalized Additive Models
GDPR General Data Protection Regulation
GLM Generalized Linear Models
HOLC Home Owners’ Loan Corporation
INED (French) Institute for Demographic Studies
PAID Prohibit Auto Insurance Discrimination

Mathematical Notations

.X⊥Y non-correlated variables, .X ⊥ Y : Cor[X, Y ] = 0


.⊥G | d-separability in causal inference
.X ⊥
⊥Y independence, .X ⊥⊥ Y, P[{X ∈ A}∩{Y ∈ B}] = P[X ∈ A]·P[Y ∈
B], ∀A, B
.X ⊥⊥ Y | Z conditional independence, .X ⊥⊥ Y | Z
L
.X = Y equal in distribution (probabilistic statement)
L
.Xn → Y convergence in distribution (probabilistic statement)
L
.Xn ∼ Y almost equal in distribution (statistical statement)
.A complement of a set .A in ., .A =  \ A.
.y empirical average from a sample .{y1 , · · · , yn }
.x 1 − x 2  norm (classically .2 -norm)
.x
⊥ transformation of a vector, orthogonal to .s, .x ⊥ = x − s x
.A transpose of a matrix, notation .A
.∝ proportionality sign

xv
xvi Mathematical Notation

1
. indicator, .1A (x) = 1(x ∈ A) = 1 if .x ∈ A, 0 otherwise or vector
of ones, .1 = (1, 1, · · · , 1) ∈ Rn , in linear algebra
I
. identity matrix, square matrix with 1 entries on the diagonal, 0
elsewhere
.{A, B} values taken by a binary sensitive attribute s
.A adjacency matrix

.aj (x ) accumulated local function, for variable j at location .x ∗
j
ACC accuracy, (TP+TN)/(P+N)
argmax arguments (of the maxima)
ATE average treatment effect
AUC area under the (ROC) curve
.B(n, p) binomial distribution—or Bernoulli .B(p) for .B(n, p)
.C ROC curve, .t → TPR ◦ FPR−1 (t)
.C convex hull of the ROC curve
CATE conditional average treatment effect
.Cor[X, Y ], r Pearson’s correlation, .r = Cor[X, Y ] = Cov[X, Y ]/

Var[X] · Var[Y ]
.Cov[X, Y ] covariance, .Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])]
.d(x 1 , x 2 ) distance between two points in .X
.D(p1 , p2 ) divergence
.d vector of degrees, in a network
.D training dataset, or .Dn

.j |S (x ) contribution of the j -th variable, at .x ∗ , conditional on a subset of
variables, .S ⊂ {1, · · · , k}\{j }
.do(X = x) do operator (for an intervention in causal inference)
.E[X] expected value (under probability .P)
f density associated with cumulative distribution function F
F cumulative distribution function, with respect to probability .P
.F
−1 (generalized) quantile, .F −1 (p) = inf {x ∈ R : p ≤ F (x)}
FN false negative (from confusion matrix)
FNR false-negative rate, also called miss rate
FP false positive (from confusion matrix)
FPR false-positive rate, also called fall-out
FPR.(t) function .[0, 1] → [0, 1],.FPR(t) = P[m(X > t|Y = 0)]
.γ Gini mean difference, .E |Y − Y  | whereY, Y  ∼ F are independent
copies
γjbd (x ∗ )
. breakdown contribution of the j -th variable, for individual .x ∗
(x ∗ ) Shapley contribution of the j -th variable, for individual .x ∗
shap
.γj
G Gini index
.G some graph, with nodes and edges
.GUI group fairness index
i index for individuals (usually, .i ∈ {1, · · · , n})
j index for variables (usually, .j ∈ {0, 1, · · · , k})
k number of features used in model)
Mathematical Notation xvii

L Lorenz curve
.L likelihood, .L(θ ; y)
.(y1 , y2 ) loss function on .L, .(y, 
y)
.1 absolute deviation loss function, .1 (y, y ) = |y − y|
.2 quadratic loss function, .2 (y, 
y ) = (y −  y )2

.j (x ) local dependence plot, for variable j at location .x ∗
j
.λ tuning parameter associated with a penalty in an optimization
problem (Lagrangian)
.log natural logarithm (with .log(ex ) = x, .∀x ∈ R)
.m(z) predictive model, .m : Z → Y, possibly a score in .[0, 1]
(z)
.m fitted predictive model from data .D (collection of .(yi , zi )’s)
.mt (z) classifier based on model .m(·) and threshold .t ∈ (0, 1), .mt : Z →
{0, 1}, .mt = 1(m > t)
.mx ∗ ,j (z) ceteris paribus profile
.M set of possible predictive models
.μ(x) regression function, .E[Y |X = x]
n number of observations in the training sample
.nA , nB number of observations in the training sample, respectively with .s =
A and .s = B
.N set of natural numbers, or non-negative integers (.0 ∈ N)
.N normal (Gaussian) distribution, .N(μ, σ 2 ) or .N(μ, )
.P true probability measure
p probability, .p ∈ [0, 1]

.pj (x ) partial dependence plot, for variable j at location .x ∗
j
.P(λ) Poisson distribution, .P(λ) with average .λ
PDP partial dependence plot
PPV prevision or positive predictive value (from confusion matrix),
TP/(TP+FP)
.Z orthogonal projection matrix, .Z = Z(Z Z)−1 Z
.(p, q) set of multivariate distributions with “margins” p and q
.Q some probability measure
.r(X, Y ) Pearson’s correlation
.r (X, Y ) maximal correlation
.R set of real numbers
.R
d standard vector space of dimension d
n ×n1
.R 0 set of real-valued matrices .n0 × n1
.R(m) risk of a model .m, associated with loss .
.
Rn (m) empirical risk of a model .m, for a sample of size n
.S, s sensitive attribute
.s collection of sensitive attributes, in .{0, 1}n
.s, S scoring rule, and expected scoring rule (in Sect. 4.2)
.S usually .{A, B} in the book, or .{0, 1}, so that .s ∈ S
.Sd standard probability simplex (.Sd ⊂ Rd )
T treatment variable in causal inference
.T transport / coupling mapping, .X → X or .Y → Y
xviii Mathematical Notation

 
.T# push-forward operator .P1 (A) = T# P0 (A) = P0 T−1 (A)
t threshold, cut-off for a classifier, .
y = 1(m(x) > t)
TN true negative (from confusion matrix)
TNR true-negative rate, also called specificity or selectivity, TN/(TN+FP)
TP true positive (from confusion matrix)
TPR true-positive rate, also called sensitivity or recall, TP/(TP+FN)
TPR.(t) function .[0, 1] → [0, 1], .TPR(t) = P[m(X > t|Y = 1)]
. ,θ latent unobservable risk factor (.θ for multivariate latent factors) or
unknown parameter in a parametric model
u utility function
.U uniform distribution
 
.U (a 0 , a 1 ) set of matrices . M ∈ Rn+0 ×n1 : M1n1 = a 0 andM 1n0 = a 1 .
n0
Un0 ,n1
. set of matrices .U 1n0 , 1n1
n1
V
. some value function on a subset of indices, in .{1, · · · , k}
Var[X]
. variance, .Var[X] = E[(X − E[X])2 ]
covariance matrix .Var[X] = E[(X − E[X])(X − E[X]) ]
W Wasserstein distance (.W2 if no index is specified)
w, .ω weight (.ω ≥ 0) or wealth (in the economic model)
. theoretical sample space associated with a probabilistic space
. weight matrix
.x, x i collection of explanatory variables for a single individual, in .X ⊂
Rk
.xj collection of observations, for variable j
.X subset of .Rk , so that .x ∈ X = X1 × · · · × Xk
.Y, y variable of interest
.Y
T ←t potential outcome of y if treatment T had taken value t
.y collection of observations, in .Yn
.
y prediction of the variable of interest
.Y subset of .R, so that .y ∈ Y but also .y , m(x) ∈ Y
.Z, z information, .z = (x, s), including legitimate and protected features
.z collection of observations .z = (x, s), in .Z
.Z set .X × S

The following convention is used in the textbook,


x value taken by a random variable, or single number (lower case, and italics)
X random variable (capital, and italics)
.x vector, or collection of numerical value (lower case, italics, and bold)
.X random vector or matrix (capital, italics and bold)
.X set, values taken by .X (calligraphic)
.xj value taken by a random variable corresponding to the j variable (from .x)
.x i vector, collection of numerical value for individual i in the dataset
Chapter 1
Introduction

Abstract Although the algorithms of machine-learning methods have brought


issues of discrimination and fairness back to the forefront, these topics have been
the subject of an extensive body of literature over the past decades. But dealing with
discrimination in insurance is fundamentally an ill-defined, unsolvable problem.
Nevertheless, we try to connect the dots, to explain different perspectives, going
back to the legal, philosophical, and economic approaches to discrimination, before
discussing the so-called concept of “actuarial fairness.” We offer some definitions,
an overview of the book, as well as the datasets used in the illustrative examples
throughout the chapters.

1.1 A Brief Overview on Discrimination

1.1.1 Discrimination?

Definition 1.1 (Discrimination (Merriam-Webster 2022)) Discrimination is the


act, practice, or an instance of separating or distinguishing categorically rather than
individually.
In this book, we use this neutral definition of “discrimination.” Nevertheless,
Kroll et al. (2017) reminds us that the word “discrimination” carries a very different
meaning in statistics and computer science than it does in public policy. “Among
computer scientists, the word is a value-neutral synonym for differentiation or clas-
sification: a computer scientist might ask, for example, how well a facial recognition
algorithm successfully discriminates between human faces and inanimate objects.
But, for policymakers, “discrimination” is most often a term of art for invidious,
unacceptable distinctions among people-distinctions that either are, or reasonably
might be, morally or legally prohibited.” The word discrimination can then be used
both in a purely descriptive sense (in the sense of making distinctions, as in this
book), or in a normative manner, which implies that the differential treatment of
certain groups is morally wrong, as shown by Alexander (1992), or more recently
Loi and Christen (2021). To emphasize the second meaning, we can prefer the word

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 1


A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_1
2 1 Introduction

“prejudice”, which refers to an “unjustifiable negative attitude” (Dambrum et al.


(2003) and Al Ramiah et al. (2010)) or an “irrational attitude of hostility” (Merriam-
Webster 2022) toward a group and its individual members.
Definition 1.2 (Prejudice (Merriam-Webster 2022)) Prejudice is (1) precon-
ceived judgment or opinion, or an adverse opinion or leaning formed without just
grounds or before sufficient knowledge; (2) an instance of such judgment or opinion;
(3) an irrational attitude of hostility directed against an individual, a group, a race,
or their supposed characteristics.
The definition of “discrimination” given in Correll et al. (2010) can be related to
the later one “behaviour directed towards category members that is consequential
for their outcomes and that is directed towards them not because of any particular
deservingness or reciprocity, but simply because they happen to be members of that
category.” Here, the idea of “unjustified” difference is mentioned. But what if the
difference can somehow be justified? The notion of “merit” is key to the expression
and experience of discrimination (we discuss this in relation to ethics later). It is
not an objectively defined criterion, but one rooted in historical and current societal
norms and inequalities.
Avraham (2017) explained in one short paragraph the dilemma of considering
the problem of discrimination in insurance. “What is unique about insurance is
that even statistical discrimination which by definition is absent of any malicious
intentions, poses significant moral and legal challenges. Why? Because on the one
hand, policy makers would like insurers to treat their insureds equally, without
discriminating based on race, gender, age, or other characteristics, even if it makes
statistical sense to discriminate (...) On the other hand, at the core of insurance
business lies discrimination between risky and non-risky insureds. But riskiness
often statistically correlates with the same characteristics policy makers would like
to prohibit insurers from taking into account.” To illustrate this problem, and why
writing about discrimination and insurance could be complicated, let us consider
the example of “redlining”. Redlining has been an important issue (that we discuss
further in Sect. 6.1.2), for the credit and the insurance industries, in the USA, that
started in the 1930s. In 1935, the Federal Home Loan Bank Board (FHLBB) looked
at more than 200 cities and created “residential security maps” to indicate the level
of security for real-estate investments in each surveyed city. On the maps (see
Fig. 1.1 with a collection of fictitious maps), the newest areas—those considered
desirable for lending purposes—were outlined in green and known as “Type A”.
“Type D” neighborhoods were outlined in red and were considered the most risky
for mortgage support (on the left of Fig. 1.1). Those areas were indeed those with a
high proportion of dilapidated (or dis-repaired) buildings (as we can observe on the
right of Fig. 1.1). This corresponds to the first definition of “redline.”
Definition 1.3 (Redline (Merriam-Webster 2022)) To redline is (1) to withhold
home-loan funds or insurance from neighborhoods considered poor economic risks;
(2) to discriminate against in housing or insurance.
1.1 A Brief Overview on Discrimination 3

Fig. 1.1 Map (freely) inspired by a Home Owners’ Loan Corporation map from 1937, where red
is used to identify neighborhoods where investment and lending were discouraged, on the left-
hand side (see Crossney 2016 and Rhynhart 2020). In the middle, some risk-related variable (a
fictitious “unsanitary index”) per neighborhood of the city is presented, and on the right-hand side,
a sensitive variable (the proportion of Black people in the neighborhood). Those maps are fictitious
(see Charpentier et al. 2023b)

In the 1970s, when looking at census data, sociologists noticed that red areas,
where insurers did not want to offer coverage, were also those with a high
proportion of Black people, and following the work of John McKnight and Andrew
Gordon, “redlining” received more interest. On the map in the middle, we can
observe information about the proportion of Black people. Thus, on the one hand,
it could be seen as “legitimate” to have a premium for households that could
reflect somehow the general conditions of houses. On the other hand, it would
be discriminatory to have a premium that is a function of the ethnic origin of the
policyholder. Here, the neighborhood, the “unsanitary index,” and the proportion
of Black people are strongly correlated variables. Of course, there could be non-
Black people living in dilapidated houses outside of the red area, Black people living
in wealthy houses inside the red area, etc. If we work using aggregated data, it is
difficult to disentangle information about sanitary conditions and racial information,
to distinguish “legitimate” and “nonlegitimate” discrimination, as discussed in
Hellman (2011). Note that within the context of “redlining,” the utilization of census
and aggregated data may introduce the potential for the occurrence of an “ecological
fallacy” (as discussed in King et al. (2004) or Gelman (2009)). In the 2020s, we now
have much more information (so called “big data era”) and more complex models
(machine-learning literature), and we will see how to disentangle this complex
problem, even if dealing with discrimination in insurance is probably still an ill-
defined unsolvable problem, with strong identification issues. Nevertheless, as we
will see, there are many ways of looking at this problem, and we try, here, to connect
the dots, to explain different perspectives.
4 1 Introduction

1.1.2 Legal Perspective on Discrimination

In Kansas, more than 100 years ago, a law was passed, allowing an insurance
commissioner to review rates to ensure that they were not “excessive, inadequate, or
unfairly discriminatory with regards to individuals,” as mentioned in Powell (2020).
Since then, the idea of “unfairly discriminatory” insurance rates has been discussed
in many States in the USA (see Box 1.1).

Box 1.1 “Unfairly discriminatory” insurance rates, according to US


legislation
Arkansas law (23/3/67/2/23-67-208), 1987 “A rate is not unfairly discrim-
inatory in relation to another in the same class of business if it reflects
equitably the differences in expected losses and expenses. Rates are not
unfairly discriminatory because different premiums result for policyholders
with like loss exposures but different expense factors, or with like expense
factors but different loss exposures, if the rates reflect the differences with
reasonable accuracy (...) A rate shall be deemed unfairly discriminatory as
to a risk or group of risks if the application of premium discounts, credits,
or surcharges among the risks does not bear a reasonable relationship to the
expected loss and expense experience among the various risks.”
Maine Insurance Code (24-A, 2303), 1969 “Risks may be grouped by classi-
fications for the establishment of rates and minimum premiums. Classification
rates may be modified to produce rates for individual risks in accordance with
rating plans that establish standards for measuring variations in hazards or
expense provisions, or both. These standards may measure any differences
among risks that may have a probable effect upon losses or expenses.
No risk classification may be based upon race, creed, national origin or
the religion of the insured (...) Nothing in this section shall be taken to
prohibit as unreasonable or unfairly discriminatory the establishment of
classifications or modifications of classifications or risks based upon size,
expense, management, individual experience, purpose of insurance, location
or dispersion of hazard, or any other reasonable considerations, provided
such classifications and modifications apply to all risks under the same or
substantially similar circumstances or conditions.”

Unfortunately, as recalled in Vandenhole (2005), there is “no universally


accepted definition of discrimination,” and most legal documents usually provide
(non-exhaustive) lists of the grounds on which discrimination is to be prohibited.
For example, in the International Covenant on Civil and Political Rights, “the law
shall prohibit any discrimination and guarantee to all persons equal and effective
protection against discrimination on any ground such as race, color, sex, language,
religion, political or other opinion, national or social origin, property, birth or
1.1 A Brief Overview on Discrimination 5

other status” (see Joseph and Castan 2013). Such lists do not really address the
question of what discrimination is. But looking for common features among those
variables can be used to explain what discrimination is. For instance, discrimination
is necessarily oriented toward some people based on their membership of a
certain type of social group, with reference to a comparison group. Hence, our
discourse should not center around the absolute assessment of how effectively an
individual within a specific group is treated but rather on the comparison of the
treatment that an individual receives relative to someone who could be perceived
as “similar” within the reference group. Furthermore, the significance of this
reference group is paramount, as discrimination does not merely entail disparate
treatment, it necessitates the presence of a favored group and a disfavored group,
thus characterizing a fundamentally asymmetrical dynamic. As Altman (2011),
wrote, “as a reasonable first approximation, we can say that discrimination consists
of acts, practices, or policies that impose a relative disadvantage on persons based
on their membership in a salient social group.”

1.1.3 Discrimination from a Philosophical Perspective

As mentioned already, we should not expect to have universal rules about discrim-
ination. For instance, Supreme Court Justice Thurgood Marshall claimed once that
“a sign that says ‘men only’ looks very different on a bathroom door than on a
courthouse door,” as reported in Hellman (2011). Nevertheless, philosophers have
suggested definitions, starting with a distinction between “direct” and “indirect”
discrimination. As mentioned in Lippert-Rasmussen (2014), it would be too simple
to consider direct discrimination as intentional discrimination. A classic example
would be a paternalistic employer who intends to help women by hiring them
only for certain jobs, or for a promotion, as discussed in Jost et al. (2009). In that
case, acts of direct discrimination can be unconscious in the sense that agents are
unaware of the discriminatory motive behind decisions (related to the “implicit bias”
discussed in Brownstein and Saul (2016a,b). Indirect discrimination corresponds to
decisions with disproportionate effects, that might be seen as discriminatory even if
that is not the objective of the decision process mechanism. A standard example
could be the one where the only way to enter a public building is by a set of
stairs, which could be seen as discrimination against people with disabilities who
use wheelchairs, as they would be unable to enter the building; or if there were a
minimum height requirement for a job where height is not relevant, which could
be seen as discrimination against women, as they are generally shorter than men.
On the one hand, for Young (1990), Cavanagh (2002), or Eidelson (2015), indirect
discrimination should not be considered discrimination, which should be strictly
limited to “intentional and explicitly formulated policies of exclusion or preference.”
For Cavanagh (2002), in many cases, “it is not discrimination they object to, but its
effects; and these effects can equally be brought about by other causes.” On the other
hand, Rawls (1971) considered structural indirect discrimination, that is, when the
6 1 Introduction

rules and norms of society consistently produce disproportionately disadvantageous


outcomes for the members of a certain group, relative to the other groups in society.
Even if it is not intentional, it should be considered discriminatory.
Let us get back to the moral grounds, to examine why discrimination is
considered wrong. According to Kahlenberg Richard (1996), racial discrimination
should be considered “unfair” because it is associated with an immutable trait.
Unfortunately, Boxill (1992) recalls that with such a definition, it would also
be unfair to deny blind people a driver’s license. And religion challenges most
definitions, as it is neither an immutable trait nor a form of disability. Another
approach is to claim that discrimination is wrong because it treats persons on
the basis of inaccurate generalizations and stereotypes, as suggested by Schauer
(2006). For Kekes (1995), treating a person a certain way only because she is
a member of a certain social group is inherently unfair, as stereotyping treats
people unequally “without rational justification.” Thus, according to Flew (1993),
racism is unfair because it treats individuals on the basis of traits that “are strictly
superficial and properly irrelevant to all, or almost all, questions of social status
and employability.” In other words, discrimination is perceived as wrong because
it fails to treat individuals based on their merits. But in that case, as Cavanagh
(2002) observed, “hiring on merit has more to do with efficiency than fairness,”
which we will discuss further in the next section, on the economic foundations
of discrimination. Finally, Lippert-Rasmussen (2006) and Arneson (1999, 2013)
suggested looking at discrimination based on some consequentialist moral theory. In
this approach, discrimination is wrong because it violates a rule that would be part
of the social morality that maximizes overall moral value. Arneson (2013) writes
that this view “can possibly defend nondiscrimination and equal opportunity norms
as part of the best consequentialist public morality.”
A classical philosophical notion close to the idea of “nondiscrimination” is
the concept of “equality of opportunity” (EOP). For Roemer and Trannoy (2016)
“equality of opportunity” is a political ideal that is opposed to assigned-at-birth
(caste) hierarchy, but not to hierarchy itself. To illustrate this point, consider the
extreme case of caste hierarchy, where children acquire the social status of their
parents. In contrast, “equality of opportunity” demands that the social hierarchy
is determined by a form of equal competition among all members of the society.
Rawls (1971) uses “equality of opportunity” to address the discrimination problem:
everyone should be given a fair chance at success in a competition. This is also
called “substantive equality of opportunity,” and it is often implemented through
metrics such as statistical parity and equalized odds, which assume that talent and
motivation are equally distributed among sub-populations. This concept can be
distinguished from the “substantive equality of opportunity,” as defined in Segall
(2013), where a person’s outcome should be affected only by their choices, not their
circumstances.
1.1 A Brief Overview on Discrimination 7

1.1.4 From Discrimination to Fairness

Humans have an innate sense of fairness and justice, with studies showing that even
3-year-old children have demonstrated the ability to consider merit when sharing
rewards, as shown by Kanngiesser and Warneken (2012), as well as chimpanzees
and primates (Brosnan 2006), and many other animal species. And given that this
trait is largely innate, it is difficult to define what is “fair,” although many scientists
have attempted to define notions of “fair” sharing, as Brams et al. (1996) recalls.
On the one hand “fair” refers to legality (and to human justice, translated into a
set of laws and regulations), and in a second sense, “fair” refers to an ethical or
moral concept (and to an idea of natural justice). The second reading of the word
“fairness” is the most important here. According to one dictionary, fairness “consists
in attributing to each person what is due to him by reference to the principles of
natural justice.” And being “just” raises questions related to ethics and morality
(we do not differentiate here between ethics and morality).
This has to be related to a concept introduced in Feinberg (1970), called “desert
theory,” corresponding to the moral obligation that good actions must lead to better
results. A student should deserve a good grade by virtue of having written a good
paper, the victim of an industrial accident should deserve substantial compensation
owing to the negligence of his or her employer. For Leibniz or Kant, a person
is supposed to deserve happiness in virtue for being morally good. In Feinberg
(1970)’s approach, “deserts” are often seen as positive, but they are also sometimes
negative, like fines, dishonor, sanctions, condemnations, etc. (see Feldman (1995),
Arneson (2007) or Haas (2013)). The concept of “desert” generally consists of a
relationship among three elements: an agent, a deserved treatment or good, and the
basis on which the agent is deserving.
We evoke in this book the “ethics of models,” or, as coined by Mittelstadt et al.
(2016) or Tsamados et al. (2021), the “ethics of algorithms.” A nuance exists with
respect to the “ethics of artificial intelligence,” which deals with our behavior or
choices (as human beings) in relation to autonomous cars, for example, and which
will attempt to answer questions such as “should a technology be adopted if it
is more efficient?” The ethics of algorithms questions the choices made “by the
machine” (even if they often reflect choices—or objectives—imposed by the person
who programmed the algorithm), or by humans, when choices can be guided by
some algorithm.
Programming an algorithm in an ethical way must be done according to a certain
number of standards. Two types of norms are generally considered by philosophers.
The first is related to conventions, i.e., the rules of the game (chess or Go), or the
rules of the road (for autonomous cars). The second is made up of moral norms,
which must be respected by everyone, and are aimed at the general interest. These
norms must be universal, and therefore not favor any individual, or any group of
individuals. This universality is fundamental for Singer (2011), who asks not to
judge a situation with his or her own perspective, or that of a group to which one
belongs, but to take a “neutral” and “fair” point of view.
8 1 Introduction

As discussed previously, the ethical analyses of discrimination are related to the


concept of “equality of opportunity,” which holds that the social status of individuals
depends solely on the service that they can provide to society. As the second
sentence of Article 1 of the 1789 Declaration of the Human Rights states, “les
distinctions sociales ne peuvent être fondées que sur l’utilité commune” (translated
as1 “social distinctions may be founded only upon the general good”) or as Rawls
(1971) points out, “offhand it is not clear what is meant, but we might say that those
with similar abilities and skills should have similar life chances. More specifically,
assuming that there is a distribution of natural assets, those who are at the same
level of talent and ability, and have the same willingness to use them, should
have the same prospects of success regardless of their initial place in the social
system, that is, irrespective of the income class into which they are born.” In the
deontological approach, inspired by Emmanuel Kant, one forgets the utilities of each
person, and simply imposes norms and duties. Here, regardless of the consequences
(for the community as a whole), some things are not to be done. A distinction is
typically made between egalitarian and proportionalist approaches. To go further,
Roemer (1996, 1998) propose a philosophical approach, whereas Fleurbaey and
Maniquet (1996) and Moulin (2004) consider an economic vision. And in a more
computational context, Leben (2020) goes back to normative principles to assess the
fairness of a model.
All ethics courses feature thought experiments, such as the popular “streetcar
dilemma.” In the original problem, stated in Foot (1967), a tram with no brakes is
about to run over five people, and one of them has the opportunity to flip a switch
that will cause the tram to swerve, but kill someone. What do we do? Or what
should we do? Thomson (1976) suggested a different version, with a footbridge,
where you can push a heavier person, who will crash into the track and die, but stop
the tram. The latter version is often more disturbing because the action is indirect,
and you start by murdering someone in order to save someone else. Some authors
have used this thought experiment to distinguish between explanation (on scientific
grounds, and based on causal arguments) and justification (based on moral precepts).
This tramway experiment has been taken up in the moral psychology experiment,
called, the Moral Machine project.2 In this “game,” one was virtually behind the
wheel of a car, and choices were proposed: “Do you run over one person or five
people?”, “Do you run over an elderly person or a child?”, “Do you run over a man
or a woman?” Bonnefon (2019) revisits the experiment, and the series of moral
dilemmas, where they obtained more than 40 million answers, from 130 countries.
Although naturally, numbers of victims were an important feature (we prefer to
kill fewer people), age was also very important (priority given to young people),
and legal arguments seemed to emerge (we prefer to kill pedestrians who cross

1 See https://avalon.law.yale.edu/18th_century/rightsof.asp.
2 See https://www.moralmachine.net/.
1.1 A Brief Overview on Discrimination 9

outside the dedicated crossings). These questions are important for self-driving cars,
as mentioned by Thornton et al. (2016).
For a philosopher, the question “How fair is this model to this group?” will
always be followed by “How fair by what normative principle?” Measuring the
overall effects on all those affected by the model (and not just the rights of a few) will
lead to incorporating measures of fairness into an overall calculation of social costs
and benefits. If we choose one approach, others will suffer. But this is the nature
of moral choices, and the only responsible way to mitigate negative headlines is to
develop a coherent response to these dilemmas, rather than ignore them. To speak
of the ethics of models poses philosophical questions from which we cannot free
ourselves, because, as we have said, a model is aimed at representing reality, “what
is.” To fight against discrimination, or to invoke notions of fairness, is to talk about
“what should be.” We are once again faced with the famous opposition of Hume
(1739). It is a well-known property of statistical models, as well as of machine-
learning ones. As Chollet (2021) wrote: “Keep in mind that machine learning can
only be used to memorize patterns that are present in your training data. You can
only recognize what you’ve seen before. Using machine learning trained on past
data to predict the future is making the assumption that the future will behave
like the past.” For when we speak of “norm,” it is important not to confuse the
descriptive and the normative, or with other words, statistics (which tells us how
things are) and ethics (which tells us how things should be). Statistical law is about
“what is” because it has been observed to be so (e.g., humans are bigger than
dogs). Human (divine, or judicial) law pertains to what is is because it has been
decreed, and therefore ought to be (e.g., humans are free and equal or humans are
good). One can see the “norm” as a regularity of cases, observed with the help of
frequencies (or averages, as mentioned in the next chapter), for example, on the
height of individuals, the length of sleep, in other words, data that make up the
description of individuals. Therefore, anthropometric data have made it possible
to define, for example, an average height of individuals in a given population,
according to their age; in relation to this average height, a deviation of 20% more
or less determines gigantism or dwarfism. If we think of road accidents, it may be
considered “abnormal” to have a road accident in a given year, at an individual
(micro) level, because the majority of drivers do not have an accident. However,
from the insurer’s perspective (macro), the norm is that 10% of drivers have an
accident. It would therefore be abnormal for no one to have an accident. This is
the argument found in Durkheim (1897). From the singular act of suicide, if it is
considered from the point of view of the individual who commits it, Durkheim tries
to see it as a social act, therefore falling within a real norm, within a given society.
From then on, suicide becomes, according to Durkheim, a “normal” phenomenon.
Statistics then make it possible to quantify the tendency to commit suicide in a
given society, as soon as one no longer observes the irregularity that appears in the
singularity of an individual history, but a “social normality” of suicide. Abnormality
is defined as “contrary to the usual order of things” (this might be considered an
empirical, statistical notion), or “contrary to the right order of things” (this notion of
right probably implies a normative definition), but also not conforming to the model.
10 1 Introduction

Defining a norm is not straightforward if we are only interested in the descriptive,


empirical aspect, as actuaries do when they develop a model, but when a dimension
of justice and ethics is also added, the complexity is bound to increase. We shall
return in Chap. 4 to the (mathematical) properties that a “fair” or “equitable” model
should be checked. Because if we ask a model to verify criteria not necessarily
observed in the data, it is necessary to integrate a specific constraint into the model-
learning algorithm, with a penalty related to a fairness measure (just as we use a
“model complexity measure” to avoid overfit).

1.1.5 Economics Perspective on Efficient Discrimination

If jurists used the term “rational discrimination,” economists used the term “effi-
cient” or “statistical discrimination,” such as in Phelps (1972) or Arrow (1973),
following early work by Edgeworth (1922). Following Becker (1957) economists
have tended to define discrimination as a situation where people who are “the
same” (with respect to legitimate covariates) are treated differently. Hence, a
“discrimination” corresponds here to some “disparity,” but we will frequently
use the term “discrimination.” More precisely, it is necessary to distinguish two
standards. One standard corresponds to “disparate treatment,” corresponding to
“any economic agent who applies different rules to people in protected groups is
practicing discrimination” as defined in Yinger (1998). The second discriminatory
standard corresponds to “disparate impact.” This corresponds to practices that seem
to be neutral, but have the effect of disadvantaging one group more than others.
Definition 1.4 (Disparate Treatment (Merriam-Webster 2022)) Disparate treat-
ment corresponds to the treatment of an individual (as an employee or prospective
juror) that is less favorable than treatment of others for discriminatory reasons (such
as race, religion, national origin, sex, or disability).
Definition 1.5 (Disparate Impact (Merriam-Webster 2022)) Disparate impact
corresponds to an unnecessary discriminatory effect on a protected class caused by
a practice or policy (as in employment or housing) that appears to be nondiscrimi-
natory.
In labor economics, wages should be a function of productivity, which is
unobservable when signing a contract, and therefore, as discussed in Riley (1975),
Kohlleppel (1983) or Quinzii and Rochet (1985), employers try to find signals.
As claimed in Lippert-Rasmussen (2013), statistical discrimination occurs when
“there is statistical evidence which suggests that a certain group of people differs
from other groups in a certain dimension, and its members are being treated
disadvantageously on the basis of this information.” Those signals are observable
variables that correlate with productivity.
In the most common version of the model, employers use observable group
membership as a proxy for unobservable skills, and rely on their beliefs about pro-
1.1 A Brief Overview on Discrimination 11

ductivity correlates, in particular their estimates of average productivity differences


between groups, as in Phelps (1972), Arrow (1973), or Bielby and Baron (1986). A
variant of this theory is when there are no group differences in average productivity,
but rather based on the belief that the variance in productivity is larger for some
groups than for others, as in Aigner and Cain (1977) or Cornell and Welch (1996). In
these cases, risk-adverse employers facing imperfect information may discriminate
against groups with larger expected variances in productivity. According to England
(1994), “statistical discrimination” might explain why there is still discrimination in
a competitive market. For Bertrand and Duflo (2017) “statistical discrimination”
is a “more disciplined explanation” than the taste-based model initiated by Becker
(1957), because the former “does not involve an ad hoc (even if intuitive) addition
to the utility function (animus toward certain groups) to help rationalize a puzzling
behavior.”
Here, “statistical discrimination,” rather than simply providing an explanation,
can lead people to see social stereotypes as useful and acceptable, and therefore
help to rationalize and justify discriminatory decisions. As suggested by Tilcsik
(2021), economists have theorized labor market discrimination, have constructed
mathematical models that attribute discrimination to the deliberate actions of profit-
maximizing firms or utility-maximizing individuals (as discussed in Charles and
Guryan (2011) or Small and Pager (2020)). And this view of discrimination
has influenced social science debates, legal decisions, corporate practices, and
public policy discussions, as mentioned in Ashenfelter and Oaxaca (1987), Dobbin
(2001), Chassonnery-Zaïgouche (2020), or Rivera (2020). The most influential
economic model of discrimination is probably the “statistical discrimination theory,”
discussed in the 1970s, with Phelps (1972), Arrow (1973), and Aigner and Cain
(1977). Applied to labor markets, this theory claims that employers have imperfect
information about the future productivity of job applicants, which leads them to use
easily observable signals, such as race or gender, to infer the expected productivity
of applicants, as explained in Correll and Benard (2006). Employers who practice
“statistical discrimination” rely on their beliefs about group statistics to evaluate
individuals (corresponding to “discrimination” as defined in Definition 1.1). In
this model, discrimination does not arise from a feeling of antipathy toward
members of a group, it is seen as a rational solution to an information problem.
Profit-maximizing employers use all the information available to them and, as
individual-specific information is limited, they use group membership as a “proxy.”
Economists tend to view “statistical discrimination” as “the optimal solution to
an information extraction problem” and sometimes describe it as “efficient” or
“fair,” as in Autor (2003), Norman (2003) and Bertrand and Duflo (2017). It should
be stressed here that this approach, initiated in the 1970s in the context of labor
economics is essentially the same as the one underlying the concept of “actuarial
fairness.” Observe finally that the word “statistical” used here reinforces the image
of discrimination as a rational, calculated decision, even though several models do
not assume that employers’ beliefs about group differences are based on statistical
data, or any other type of systematic evidence. Employers’ beliefs might be based
12 1 Introduction

on partial or idiosyncratic observations. As mentioned in Bohren et al. (2019) it is


possible to have “statistical discrimination with bad statistics” here.

1.1.6 Algorithmic Injustice and Fairness of Predictive Models

Although economists published extensively on discrimination in the job market


in the 1970s, the subject has come back into the spotlight following a number of
publications linked to predictive algorithms. Correctional Offender Management
Profiling for Alternative Sanctions, or compas, a tool widely used as a decision
aid in the US courts to assess a criminal’s chance of re-offending, based on some
risk scales for general and violent recidivism, and for pretrial misconduct. After
several months of investigation, Angwin et al. (2016) looked back at the output of
compas in a series of articles called “Machine Bias” (and subtitled “Investigating
Algorithmic Injustice”).
As pointed out (Feller et al. 2016), if we look at data from the compas dataset
(from the fairness R package), in Fig. 1.2, on the one hand (on the left of the
figure)
• for Black people, among those who did not re-offend, 42% were classified as
high risk
• for white people, among those who did not re-offend, 22% were classified as
high-risk
With standard terminology in classifiers and decision theory, the false-negative rate
is about two times higher for Black people (42% against 22%). As Larson et al.
(2016) wrote: “Black defendants were often predicted to be at a higher risk of

Fig. 1.2 Two analyses of the same descriptive statistics of compas data, with the number of
defendant (1) function of the race of the defendant (Black and white), (2) the risk category, obtained
from a classifier (binary, low, and high), and (3) the indicator that the defendants re-offended, or
not. On the left-hand side, the analysis of Dieterich et al. (2016) and on the right-hand side, that of
Feller et al. (2016)
1.1 A Brief Overview on Discrimination 13

recidivism than they actually were.” On the other hand (on the right-hand side of
the figure), as Dieterich et al. (2016) observed:
• For Black people, among those who were classified as high risk, 35% did not
re-offend.
• For white people, among those who were classified as high risk, 40% did not
re-offend.
Therefore, as the rate of recidivism is approximately equal at each risk score level,
irrespective of race, it should not be claimed that the algorithm is racist. The initial
approach is called “false positive rate parity,” whereas the second one is called
“predictive parity.” Obviously, there are reasonable arguments in favor of both
contradictory positions. From this simple example, we see that having a valid and
common definition of “fairness” or “parity” will be complicated.
Since then, many books and articles have addressed the issues highlighted in this
article, namely the increasing power of these predictive decision-making tools, their
ever-increasing opacity, the discrimination they replicate (or amplify), the ‘biased’
data used to train or calibrate these algorithms, and the sense of unfairness they
produce. For instance, Kirkpatrick (2017) pointed out that “the algorithm itself may
not be biased, but the data used by predictive policing algorithms is colored by years
of biased police practices.”
And justice is not the only area where such techniques are used. In the context
of predictive health systems, Obermeyer et al. (2019) observed that a widely used
health risk prediction tool (predicting how sick individuals are likely to be, and the
associated health care cost), that is applied to roughly 200 million individuals in the
USA per year, exhibited significant racial bias. More precisely, 17.7% of patients
that the algorithm assigned to receive “extra care” were Black, and if the bias in
the system was corrected for, as Ledford (2019) did, the percentage should increase
to 46.5%. Those “correction” techniques will be discussed in Part IV of this book,
when presenting “mitigation.”
Massive data, and machine-learning techniques, have provided an opportunity
to revisit a topic that has been explored by lawyers, economists, philosophers, and
statisticians for the past 50 years or longer. The aim here is to revisit these ideas, to
shed new light on them, with a focus on insurance, and explore possible solutions.
Lawyers, in particular, have discussed these predictive models, this “actuarial
justice,” as Thomas (2007), Harcourt (2011), Gautron and Dubourg (2015), or
Rothschild-Elyassi et al. (2018) coined it.
The idea of bias and algorithmic discrimination is not a new one, as shown for
instance by Pedreshi et al. (2008). However, over the past 20 years, the number of
examples has continued to increase, with more and more interest in the media. “AI
biases caused 80% of black mortgage applicants to be rejected” in Hale (2021),
or “How the use of AI risks recreating the inequity of the insurance industry of
the previous century” in Ito (2021). Pursuing David’s 2015 analysis, McKinsey
(2017) announced that artificial intelligence would disrupt the workplace (including
the insurance and banking sectors, Mundubeltz-Gendron (2019)) particularly to
14 1 Introduction

replace lackluster repetitive (human) work.3 These replacements raise questions,


and compel the market and the regulator to be cautious. For Reijns et al. (2021), “the
Dutch insurance sector makes it a mandate,” in an article on “ethical artificial intel-
ligence,” and in France, Défenseur des droits (2020) recalls that “algorithmic biases
must be able to be identified and then corrected” because “non-discrimination is
not an option, but refers to a legal framework.” Bergstrom and West (2021) note
that there are people writing a bill of rights for robots, or devising ways to protect
humanity from super-intelligent, Terminator-like machines, but that getting into the
details of algorithmic auditing is often seen as boring, but necessary.
Living with blinders on, or closing our eyes, rarely solves problems, although
it has long been advocated as a solution to discrimination. As Budd et al. (2021)
show, reverting to an Amazon experiment of removing names from CVs to eliminate
gender discrimination does not work, because by hiding the candidate’s name,
the algorithm continued to preferentially choose men over women. Why did this
happen? Simply because Amazon trained the algorithm from its existing resumes,
with an over-representation of men, and there are elements of a resume (apart from
the name) that can reveal a person’s gender, such as a degree from a women’s
university, membership of a female professional organisation, or a hobby where the
sexes are disproportionately represented. Proxies that correlate more or less with the
“protected” variables may sustain a form of discrimination.
In this textbook, we address these issues, limiting ourselves to actuarial models
in an insurance context, and almost exclusively, the pricing of insurance contracts.
In Seligman (1983), the author asks the following basic question: “If young women
have fewer car accidents than young men—which is the case—why shouldn’t women
get a better rate? If industry experience shows—which it does—that women spend
more time in hospital, why shouldn’t women pay more?” This type of question will
be the starting point in our considerations in this textbook.
Paraphrasing Georges Clémenceau,4 who said (in 1887) that “war is too serious a
thing to be left to the military,” Worham (1985) argued that insurance segmentation
was too important a task to be left to actuaries. Forty years later, we might wonder
whether it is not worse to leave it to algorithms, and to clarify actuaries’ role in these
debates. In the remainder, we begin by reviewing insurance segmentation and the
foundations of actuarial pricing of insurance contracts. We then review the various
terms mentioned in the title, namely the notion of “bias,” “discrimination,” and
“fairness,” while proposing a typology of predictive models and data (in particular,
the so-called “sensitive” data, which may be linked to possible discrimination).

3 Even if it seems exaggerated, because on the contrary, it is often humans who perform the

repetitive tasks to help robots: “in most cases, the task is repetitive and mechanical. One worker
explained that he once had to listen to recordings to find those containing the name of singer Taylor
Swift in order to teach the algorithm that it is a person” as reported by Radio Canada in April 2019.
4 Member of the Chamber of Deputies from 1885 and 1893 and then Prime Minister of France

from 1906 to 1909 and again from 1917 until 1920.


1.2 From Words and Concepts to Mathematical Formalism 15

1.1.7 Discrimination Mitigation and Affirmative Action

Mitigating discrimination is usually seen as paradoxical, because in order to avoid


discrimination, we must create another discrimination. More precisely, Supreme
Court Justice Harry Blackmun stated, in 1978, “in order to get beyond racism,
we must first take account of race. There is no other way. And in order to treat
some persons equally, we must treat them differently.” cited in Knowlton (1978), as
mentioned in Lippert-Rasmussen (2020)). More formally, an argument in favor of
affirmative action—called “the present-oriented anti-discrimination argument”—is
simply that justice requires that we eliminate or at least mitigate (present) discrim-
ination by the best morally permissible means of doing so, which corresponds to
affirmative action. Freeman (2007) suggested a “time-neutral anti-discrimination
argument,” in order to mitigate past, present, or future discrimination. But there
are also arguments against affirmative action, corresponding to “the reverse dis-
crimination objection,” as defined in Goldman (1979): some might consider that
there is an absolutely ethical constraint against unfair discrimination (including
affirmative action). To quote another Supreme Court Justice, in 2007, John G.
Roberts of the US Supreme Court submits: “The way to stop discrimination on
the basis of race is to stop discriminating on the basis of race” (Turner (2015)
and Sabbagh (2007)). The arguments against affirmative action are usually based
on two theoretical moral claims, according to Pojman (1998). The first denies that
groups have moral status (or at least meaningful status). According to this view,
individuals are only responsible for the acts they perform as specific individuals
and, as a corollary, we should only compensate individuals for the harms they have
specifically suffered. The second asserts that a society should distribute its goods
according to merit.

1.2 From Words and Concepts to Mathematical Formalism

1.2.1 Mathematical Formalism

The starting point of any statistical or actuarial model is to suppose that observations
are realizations of random variables, in some probabilistic space .(, F, P) (see Rol-
ski et al. (2009), for example, or any actuarial textbook). Therefore, let .P denote the
“true” probability measure, associated with random variables .(Z, Y ) = (S, X, Y ).
Here, features .Z can be split into a couple .(S, X), where .X is the nonsensitive
information whereas S is the sensitive attribute.5 Y is the outcome we want to model,
which would correspond to the annual loss of a given insurance policy (insurance
pricing), the indicator of a false claim (fraud detection), the number of visits to

5 For simplicity, in most of the book, we discuss the case where S is a single sensitive attribute.
16 1 Introduction

the dentist (partial information for insurance pricing), the occurrence of a natural
catastrophe (claims management), the indicator that the policyholder will purchase
insurance to a competitor (churn model), etc. Thus, here, we have a triplet .(S, X, Y ),
defined on .S × X × Y, following some unknown distribution .P. And classically,
.Dn = {(zi , yi )} = {(si , x i , yi )}, where .i = 1, 2, · · · , n, will denote a dataset, and

.Pn will denote the empirical probabilities associated with sample .Dn .

It is always assumed in this book that .S is somehow fixed in advance, and


is not learnt: gender is considered as a binary categorical variable, sensitive and
protected. In most cases, s will be a categorical variable, and in order to avoid heavy
notations, we simply consider a binary sensitive attribute (denoted .s ∈ {A, B} to
remain quite general, and avoid .{0, 1} not to get confused with values taken by y
in a classification problem). Recently, Hu et al. (2023b) discussed the case where .s
is a vector of multivariate attributes (of course possibly correlated). .Y depends on
the model considered: in a classification problem, .Y usually corresponds to .{0, 1},
whereas in a regression problem, .Y corresponds to the real line .R. We can also
consider counts, when .y ∈ N (i.e., .{0, 1, 2, · · · }). We do not discuss here the case
where .y is a collection of multiple predictions (also coined “multiple tasks” in the
machine-learning literature, see for example Hu et al. (2023a) for applications in the
context of fairness).
Throughout the book, we consider models that are formally functions .m :
S × X → Y, that will be estimated from our training dataset .Dn . Considering
models .m : X → Y (sometimes coined “gender-blind” if s denotes the gender, or
“color-blind” if s denotes the race, etc.) is supposed to create a more “fair” model,
unfortunately, in a very weak sense (as many variables in .x might be strongly
correlated with s). After estimating a model, we can use it to obtain predictions,
denoted .y , while .m(x) (or .m(x, s)) will be called the “score”, when y is a binary
variable take values in .{0, 1}.

1.2.2 Legitimate Segmentation and Unfair Discrimination

In the previous section, we have tried to explain that there could be “legitimate” and
“illegitimate” discrimination, “fair” and “unfair.” We consider here a first attempt
to illustrate that issue, with a very simple dataset (with simulated data). Consider
a risk, and let y denote the occurrence of that risk (hence, y is binary). As we
discuss in Chap. 2, it is legitimate to ask policyholders to pay a premium that is
proportional to .P[Y = 1], the probability that the risk occurs (which will be the
idea of “actuarial fairness”). Assume now that this occurrence is related to a single
feature x : the larger x, the more likely the risk will occur. A classic example could
be the occurrence of the death of a person, where x is the age of that person. Here,
the correlation between y and x is coming from a common (unobserved) factor, .x0 .
In a small dataset, toydata1 (divided into a training dataset, toydata1_train,
and a validation dataset, toydata1_validation), we have simulated values,
where the confounding variable .X0 (that will not be observed, and therefore cannot
1.2 From Words and Concepts to Mathematical Formalism 17

be used in the modeling process) is a Gaussian variable, .X0 ∼ N(0, 1), and then


⎨X = X0 + ,  ∼ N(0, 1/2 ),
⎪ 2

. S = 1(X0 + η > 0), η ∼ N(0, 1/22 ),




⎩Y = 1(X + ν > 0), ν ∼ N(0, 1/22 ).
0

The sensitive attribute s, which takes values 0 (or A) and 1 (or B), does not
influence y, and therefore it might not be legitimate to use it (it could be seen as
an “illegitimate discrimination”). Note that .x0 influences all variables, x, s, and y
(with a probit model for the last two), and because of that unobserved confounding
variable .x0 , all variables are here (strongly) correlated. In Fig. 1.3, we can visualize
the dependence between x and y (via boxplots of x given y) on the left-hand side,

Fig. 1.3 On top, boxplot of x conditional on y, with .y ∈ {0, 1} on the left-hand side, and
conditional on s, with .s ∈ {A, B} on the right-hand side, from the toydata1 dataset. Below,
the curve on the left-hand side is .x → P[Y = 1|X = x] whereas the curve on the right-hand
side is .x → P[S = A|X = x]. Hence, when .x = +1, .P[Y = 1|X = x] ∼ 25%, and therefore
.P[Y = 0|X = x] ∼ 75% (on the left-hand side), whereas when .x = +1, .P[S = A|X = x] ∼ 95%,
and therefore .P[S = B|X = x] ∼ 5% (on the right-hand side)
18 1 Introduction

and between x and s (via boxplots of x given s) on the right-hand side. For example,
if .x ∼ −1, then y takes values in .{0, 1} respectively with 25% and 75% chance. It
is a 75% and 25% chance if .x ∼ +1. Similarly, when .x ∼ −1, s is four times more
likely to be in group A than in group B.
When fitting a logistic regression to predict y based on both x and s, from
toydata1_train, observe that variable x is clearly significant, but not s (using
glm in R, see Sect. 3.3 for more details about standard classifiers, starting with the
logistic regression):
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2983 0.2083 -1.432 0.152
x 1.0566 0.1564 6.756 1.41e-11 ***
s == A 0.2584 0.2804 0.922 0.357

Without the sensitive variable s, we obtain a logistic regression on x only, that


could be seen as “fair through unawareness.” The estimation yields
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1390 0.1147 -1.212 0.226
x 1.1344 0.1333 8.507 <2e-16 ***

(x), that estimates .E[Y |X = x], is equal to


Here, .m

exp[−0.1390 + 1.1344 x]
(x) =
m
. .
1 + exp[−0.1390 + 1.1344 x]

But it does not mean that this model is perceived as “fair” by everyone. In Fig. 1.4,
 exceed a given threshold t, here .50%.
we can visualize the probability that scores .m
Even without using s as a feature in the model, .P[ m(X) > t|S = s] does depend
on s, whatever the threshold t. And if .E[
m(X)] ∼ 50%, observe that .E[ m(X)|S =
A] ∼ 65% while .E[ m(X)|S = B] ∼ 25%. With our premium interpretation, it
means that, on average, people that belong in group A pay a premium at least twice
that paid by people in group B. Of course, ceteris paribus it is not the case, as
individuals with the same x have the same prediction, whatever s, but overall, we
observe a clear difference. One can easily transfer this simple example to many
real-life applications.
Throughout this book, we provide examples of such situations, then formalize
some measures of fairness, and finally discuss methods used to mitigate a possible
discrimination in a predictive model .m, even if .m
 is not a function of the sensitive
attribute (fairness through unawareness).
1.3 Structure of the Book 19

Fig. 1.4 Distribution of the score .m(X, S), conditional on A and B, on the left-hand side, and
distribution of the score .m(X) without the sensitive variable, conditional on A and B, on the right-
hand side (fictitious example). In both cases, logistic regressions are considered. From this score,
we can get a classifier .
y = 1(m(z) > t) (where .z is either .(x, s), on the left-hand side, or simply
.x, on the right-hand side). Here, we consider cut-off .t = 50%. Areas on the right of the vertical
line (at .t = 50%) correspond to the proportion of individuals classified as .y = 1, in both groups,
A and B

1.3 Structure of the Book

In Part I we get back to insurance and predictive modeling. In Chap. 2, we present


applications of predictive modeling in insurance, emphasizing insurance ratemaking
and premium calculations, first in the context of homogeneous policyholders,
and then in that of heterogeneous policyholders. We will discuss “segmentation”
from a general perspective, the statistical approach being discussed in Chap. 3.
In that chapter, we present standard supervised models, with general linearized
models (GLMs), penalized versions, neural nets, trees, and ensemble approaches.
In Chap. 4, we then address the questions of interpretation and explanation of
predictive models, as well as accuracy and calibration.
In Part II, we discuss further segmentation and discrimination and sensitive
attributes in the context of insurance modeling. In Chap. 5, we provide a classifica-
tion and a typology of pricing variables. In Chap. 6, we discuss direct discrimination
(with race, gender, age, and genetics), and indirect direction We return to biases and
data, in Chap. 7, with a discussion about observations and experiments. We return
to how data are collected before getting back to the popular adage “correlation is
not causation,” and start to discuss causal inference and counterfactuals.
In Part III, we present various approaches to quantify fairness, with a focus
in Chap. 8, on “group discrimination” concepts, whereas “individual fairness” is
presented in Chap. 9.
And finally, In Part IV, we discuss the mitigation of discrimination, using three
approaches: the pre-processing approach, in Chap. 10, the in-processing approach
in Chap. 11, and the post-processing approach in Chap. 12.
20 1 Introduction

1.4 Datasets and Case Studies

In the following chapters, and more specifically in Parts III and IV, we use both
generated data and publicly available real datasets to illustrate various techniques,
either to quantify a potential discrimination (in Part III) or to mitigate it (in Part IV).
All the datasets are available from the GitHub repository,6 in R.
> library(devtools)
> devtools::install_github("freakonometrics/InsurFair")
> library(InsurFair)

The first toy dataset is the one discussed previously in Sect. 1.2.2, with
toydata1_train and toydata1_valid, with (only) three variables y
(binary outcome), s (binary sensitive attribute) and x (drawn from a Gaussian
variable).
> str(toydata1_train)
’data.frame’: 600 obs. of three variables:
$ x : num 0.7939 0.5735 0.9569 0.1299 -0.0606 ...
$ s : Factor w/ 2 levels "B","A": 1 1 2 1 2 2 2 1 1 1 ...
$ y : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...

As discussed, the three variables are correlated, as they are all based on an
unobserved common variable z.
The toydata2 dataset consists in two generated data, .n = 5000 are used as a
training sample, and .n = 1000 are used for validation. The process used to generate
the data is the following:
• The binary sensitive attribute, .s ∈ {A, B}, is drawn, with respectively .60% and
.40% individuals in each group

• .(x1 , x3 ) ∼ N(μs , Σ s ), with some correlation of .0.4 when .s = A and .0.7 when
.s = B

• .x2 ∼ U([0, 10]), independent of .x1 and .x3


• .η = β0 + β1 x1 + β2 x2 + β3 x12 + β4 1B (s), that does not depend on .x3
• .y ∼ B(p), where .p = exp(η)/[1 + exp(η)] = μ(x1 , x2 , s).
In Fig. 1.5, we can visualize scatter plots with .x1 on the x-axis and .x2 on the y-axis,
with on the left-hand side, colors depending on y (.y ∈ {GOOD , BAD}, or .y ∈ {0 , 1})
and depending on s (.s ∈ {A , B}) on the right-hand side. In Fig. 1.6, we can visualize
level curves of .(x1 , x2 ) → μ(x1 , x2 , A) on the left and .(x1 , x2 ) → μ(x1 , x2 , B) on
the right-hand side, where .μ(x1 , x2 , s) are the true probabilities used to generate the
dataset. Colors reflect the value of the probability (on the right part) and are coherent
with .{GOOD , BAD}.

6 See Charpentier (2014) for a general overview on the use of R in actuarial science. Note that some

packages mentioned here also exist in Python, in scikit-learn, as well as packages dedicated
to fairness, such as fairlearn, or aif360).
1.4 Datasets and Case Studies 21

Fig. 1.5 Scatter plot on toydata2, with .x1 on the x-axis and .x2 on the y-axis, with on the left-
hand side, colors depending on the outcome y (.y ∈ {GOOD , BAD}, or .y ∈ {0 , 1}) and depending
on the sensitive attribute s (.s ∈ {A , B}) on the right-hand side

True model, sensitive = A True model, sensitive = B


10

10

100
8

80
60
6

6
x2

x2

40
4

20
2

0
0

–4 –2 0 2 4 –4 –2 0 2 4
x1 x1

Fig. 1.6 Level curves of .(x1 , x2 ) → μ(x1 , x2 , A) on the left-hand side and .(x1 , x2 ) →
μ(x1 , x2 , B) on the right-hand side, the true probabilities used to generate the toydata2 dataset.
The blue area in the lower-left corner corresponds to . y close to .0 (blue) (or GOOD (blue) risk),
whereas the red area in the upper right corner corresponds to . y close to .1 (red) (or BAD (red) risk)

Then, there will be real data. The GermanCredit dataset, collected in


Hofmann (1990) and used in the CASdataset package, from Charpentier (2014),
contains 1000 observations and 23 attributes. The variable of interest y is a binary
variable indicating whether a person experienced a default of payment. There are
.70% of 0’s (“good” risks), .30% of 1’s (“bad” risks). The sensitive attribute is the

gender of the person (binary, with 69% women (B) and 31% men (A), but we can
also use the age, treated as categorical.
The FrenchMotor datasets, from Charpentier (2014), are in personal motor
insurance, with underwriting data, and information about claim occurrence (here
considered as binary). It is obtained as the aggregation of freMPL1, freMPL2,
22 1 Introduction

freMPL3 and freMPL4 from the CASdataset R package, while keeping


only observations with exposure exceeding .90%. Here, the sensitive attribute
is .s = Gender, which is a binary feature, and the goal is to create a score that
reflects the probability of claiming a loss (during the year). The entire dataset
contains .n = 12,437 policyholders and 18 variables. A subset with 70% of the
observations is used for training, and 30% are used for observation. Note that
variable SocioCateg contains here nine categories (only the first digit in the
categories is considered). In numerical applications, two specific individuals (named
Andrew and Barbara) are considered, to illustrate various points.
The telematic dataset is an original dataset, containing 1177 insurance
contracts, observed over 2 years. We have claims data for 2019 (here claim is
binary, no or yes, 13% of the policyholders claimed a loss), the age (age) and the
gender (gender) of the driver, as well as some telematic data for 2018 (including
Total_Distance, Total_Time, as well as Drive_Score, Style_Score,
Corner_Score, Acceleration_Score or Braking_Score, in addition to
some binary scores related to “heavy” acceleration or braking).
Part I
Insurance and Predictive Modeling

Predictive modeling involves the use of data to forecast future events. It relies on capturing
relationships between explanatory variables and the predicted variables from past occur-
rences and exploiting these relationships to predict future outcomes. Forecasting future
financial events is a core actuarial skill—actuaries routinely apply predictive modeling
techniques in insurance and other risk management application, Frees et al. (2014a).

The sciences do not try to explain, they hardly even try to interpret, they mainly make
models. By a model is meant a mathematical construct which, with the addition of
certain verbal interpretations, describes observed phenomena. The justification of such a
mathematical construct is solely and precisely that it is expected to work—that is, correctly
to describe phenomena from a reasonably wide area, Von Neumann (1955).

In economic theory, as in Harry Potter, the Emperor’s New Clothes or the tales of King
Solomon, we amuse ourselves in imaginary worlds. Economic theory spins tales and calls
them models. An economic model is also somewhere between fantasy and reality. Models
can be denounced for being simplistic and unrealistic, but modeling is essential because
it is the only method we have of clarifying concepts, evaluating assumptions, verifying
conclusions and acquiring insights that will serve us when we return from the model to
real life. In modern economics, the tales are expressed formally: words are represented by
letters. Economic concepts are housed within mathematical structures, Rubinstein (2012).
Chapter 2
Fundamentals of Actuarial Pricing

Abstract “Insurance is the contribution of the few to the misfortune of the many”
is a simple way to describe what insurance is. But it doesn’t say what the
“contribution” should be, to be fair. In this chapter, we return to the fundamentals of
pricing and risk sharing, and at the end we mention other models used in insurance
(to predict future payments to be provisioned, to create a fraud score, etc.).

Even though insurers will not be able to predict which of their clients will suffer a
loss, they should be capable of estimating probabilities to claim a loss, and possibly
the distribution of their aggregate losses, with an acceptable margin of error, and
budgeting accordingly. The role of actuaries is to run statistical analysis to measure
individual risk and price it.

2.1 Insurance

The insurance business is characterized by an inverted production cycle. In return


for a premium—the amount of which is known when the contract is taken out—the
insurer undertakes to cover a risk, the unknown date and amount, according to the
definition of “actuarial pricing.” In order to do this, the insurer pools the risks within
a mutuality. The universal secret of insurance is therefore the pooling of a large
number of insurance contracts within a mutuality, in order to allow compensation
to be made between the risks that have been damaged and those for which the
insurer has collected premiums without having had to pay out any benefits, as
Petauton (1998) argues. To use Chaufton (1886)’s formulation, insurance is the
“compensation of the effects of chance by mutuality organised according to the laws
of statistics.” The first important concept is “mutualization.”
Definition 2.1 (Mutuality (Wilkie 1997)) Mutuality is considered to be the
normal form of commercial private insurance, where participants contribute to the
risk pool through a premium that relates to their particular risk at the time of

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 25


A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_2
26 2 Fundamentals of Actuarial Pricing

the application, i.e., the higher the risk that they bring to the pool, the higher the
premium required.
Through effective underwriting, Wilkie (1997) claims that “the risk is evaluated
by the insurer as thoroughly as possible, based on all the facts that are relevant
and available.” Participation in mutual insurance schemes is voluntary and the
amount of cover that the individual purchases is discretionary. An essential feature
of mutual insurance is segmentation, or discrimination in underwriting, leading
to significant differences in premium rates for the same amount of life cover
for different participants. Viswanathan (2006) gives several examples. The second
concept is “solidarity.”
Definition 2.2 (Solidarity (Wilkie 1997)) Solidarity is the basis of most national
or social insurance schemes. Participation in such state-run schemes is generally
compulsory and individuals have no discretion over their level of cover. All
participants normally have the same level of cover. In solidarity schemes the
contributions are not based on the expected risk of each participant.
In those state-run schemes, contributions are often just equal for all, or it can
be according to the individual ability to pay (such as percentage of income). As
everybody pays the same contribution rate, the low-risk participants are effectively
subsidizing the high-risk participants. With an insurance economics perspective,
agents make decisions individually, forgetting that the decisions they make often go
beyond their narrow self-interest, reflecting instead broader community and social
interests, even in situations where they are not known to each other. This is not
altruism, per se, but rather a notion of strong reciprocity, the “predisposition to
cooperate even when there is no apparent benefit in doing so,” as formalized in
Gintis (2000) and Bowles and Gintis (2004).
Solidarity is important in insurance. In most countries, employer-based health
insurance includes maternity benefits for everyone. In the USA, a federal law says
it is discriminatory not to do so (the “Pregnancy Discrimination Act” (PDA) is an
amendment to the Civil Rights Act of 1964 that was enacted in 1978). “Yes, men
should pay for pregnancy coverage, and here’s why, said Hiltzik (2013), it takes two
to tango.” No man has ever given birth to a baby, but it’s also true that no baby has
ever been born without a man being involved somewhere along the line. “Society
has a vested interest in healthy babies and mothers” and “universal coverage is the
only way to make maternity coverage affordable”; therefore, solidarity is imposed,
and men should pay for pregnancy coverage.
One should probably stress here that insurance is not used to eliminate the
risk, but to transfer it, and this transfer is done according to a social philosophy
chosen by the insurer. With “public insurance,” as Ewald (1986) reminds us, the
goal is to transfer risk from individuals to a wider social group, by “socialising,” or
redistributing risk “more fairly within the population.” Thus, low-risk individuals
pay insurance premiums at a higher rate than their risk profile would suggest,
even if this seems “inefficient” from an economic point of view. Social insurance
is organized according to principles of solidarity, where access and coverage are
2.1 Insurance 27

independent of risk status, and sometimes of ability to pay (as noted by Mittra
(2007)). Nevertheless, in many cases, the premium is proportional to the income of
the policyholder, is usually provided by public rather than private entities. For some
social goods, such as health care, long-term care, and perhaps even basic mortgage
life insurance, it may simply be inappropriate to provide such products through a
mutuality-based model that inevitably excludes some individuals, as “primary social
goods, because they are defined as something to which everyone has an inalienable
right, cannot be distributed through a system that excludes individuals based on
their risk status or ability to pay.” Mutual insurance companies are often seen as an
intermediary between such public insurance and for-profit insurance companies.
And as Lasry (2015) points out, “insurance has long been faced with a dilemma:
on the one hand, better knowledge of a risk allows for better pricing; better
knowledge of risk factors can also encourage prevention; on the other hand,
mutualization, which is the basis of insurance, can only subsist in most cases in a
situation of relative ignorance (or even a legal obligation of ignorance).” Actuaries
will then seek to classify or segment risks, all based on the idea of mutualization.
We shall return to the mathematical formalism of this dilemma. De Pril and
Dhaene (1996) point out that segmentation is a technique that the insurer uses to
differentiate the premium and possibly the cover, according to a certain number of
specific characteristics of the risk of being the policyholder (hereinafter referred
to as segmentation criteria), with the aim of achieving a better match between
the estimated cost of the claim and the burdens that a given person places on the
community of policyholders and the premium that this person has to pay for the
cover offered. In Box 2.1, Rodolphe Bigot responds to general legal considerations
regarding risk selection in insurance (in France, but most principles can be observed
elsewhere). Underwriting is the term used to describe the decision-making process
by which insurers determine whether to offer, or refuse, an insurance policy to an
individual based on the available information (and the requested amount). Gandy
(2016) asserts that “right to underwrite” is basically a right to discriminate. Hence,
“higher premium” corresponds to a rating decision, “exclusion waiver” is a coverage
decision whereas “denial” is an underwriting decision.

Box 2.1 Insurance & Underwriting (in French Law), by Rodolphe Bigot1
The insurance transaction and the underlying mutualization are based on so-
called risk selection. Apart from most group insurance policies, which consist
of a kind of mutualization within a mutualization, the insurer refuses or
accepts that each applicant for insurance enters the mutualization constituted

(continued)

1 Lecturer in private law, UFR of Law, Le Mans University, member of Thémis-UM and Ceprisca.
28 2 Fundamentals of Actuarial Pricing

Box 2.1 (continued)


by the group of policyholders. This selection of risks “confines the mutualiza-
tion to the policyholders accepted by the insurer, who is always considered,
in insurance contract law, as the one who accepts the contract proposed to
him by the applicant for insurance,” wrote (Monnet 2017, p. 13ff). In this
respect, it should be recalled that the economics of the insurance transaction
requires that the insurance company be given a great deal of freedom to
accept or refuse the risk proposed to it. Proof of this freedom in the area of
personal insurance is in the provisions of Article 225-3 of the Criminal Code,
which exclude from the scope of application of the criminal repression of
discrimination in the supply of goods and services provided for in Article 225-
2 “discrimination based on the state of health, when it consists of operations
whose purpose is the prevention and coverage of risks of death, risks of harm
to the physical integrity of the person, or risks of incapacity for work or
disability.” However, not to admit a limit to this freedom of the insurer would
lead to the evacuation of important social considerations and to the exclusion
not only of insurance but also of the goods and services linked to it (such
as borrowing, and therefore access to property) of the most exposed persons
(Bigot and Cayol 2020, p. 540). The question of the right to insurance arises
here (Pichard 2006). To this end, “having access to insurance means not only
the very possibility of taking out a contract for coverage, but perhaps also
at a reasonable economic cost, not prohibitive, not dissuasive. In societies
where the need for security, or even comfort, is a leitmotif, the question is
very relevant.” (Noguéro 2010, p. 633)

2.2 Premiums and Benefits

Comparing policyholders is always tricky, as not only do they potentially carry


different risks but they may also have different preferences (and therefore choose
different policies). First of all, it is important to distinguish between coverage. In a
car insurance policy, the “third party liability” cover is the compulsory component,
covering exclusively the damage that the insured car might cause to a third party.
But some policyholders may want (or need) more extensive protection. Other
standard types of cover include “comprehensive” cover, which covers all damage
to the vehicle (regardless of the circumstances of the accident or the driver’s
responsibility), “collision” cover, which reimburses the owner for damage caused
to the vehicle in the event of a collision with a third party, and “fire and theft”
cover, which compensates the owner of the vehicle if it is damaged or destroyed
by fire, or if it is stolen. Some insurers also offer “mechanical breakdown” cover,
which allows the insurance to compensate for the cost of repairs related to a
2.2 Premiums and Benefits 29

Fig. 2.1 Coverage selected by auto insurance policyholders based on age, with the basic manda-
tory coverage, “third party insurance” and the broadest coverage known as “fully comprehensive.”
(Source: personal communication, real data from an insurance company in France)

breakdown, or “vehicle contents” cover, which offers compensation in the event


of damage to or disappearance of items inside the insured vehicle. There may
also be “assistance” cover, which provides services in the event of a breakdown,
such as breakdown assistance, towing, repatriation, etc. Another possible source
of difference is the indemnity, which may vary according to the choice of the
deductible level (Buchanan and Priest 2006). As a reminder, the deductible is the
amount that remains payable by the policyholder after the insurer has compensated
for a loss. The absolute (or fixed) excess is the most common in car insurance: in
a policy with an excess of e150, if the repair costs amount to e250, the insurance
company will pay e100 and the remaining e150 will be paid by the policyholder.
Many insurers now offer “mileage excesses,” defining a perimeter around the
vehicle’s usual parking place: within this perimeter, the assistance guarantee will
not work. However, if a breakdown occurs outside this perimeter, the assistance
guarantee can be called upon.
Also, it is difficult to compare the auto insurance premium paid by different
people. In Fig. 2.1, we can see that the choice of auto insurance coverage is strongly
dependent on age, with young drivers opting overwhelmingly for compulsory
coverage (one-third of drivers are between 20 and 25 years of age), and older drivers
taking out more “comprehensive” insurance (90% of drivers are between 70 and 80
years of age). Choosing different coverage inevitably translates into higher bills, as
older people may have a more expensive policy simply because they require more
coverage.
As mentioned earlier, a natural idea is that each policyholder should be offered
a premium that is proportional to the risk he or she represents, to avoid another
company enticing this customer with a more attractive contract. This principle of
“personalization” could be seen to have the virtues of fairness (as each individual
pays according to the risk he or she passes on to the community) and can even be
reconciled with the principle of mutualization: all that is needed (provided that the
market is large enough) is to group individuals into mutuals that are homogeneous
from the point of view of risk. This very general principle does not say anything
30 2 Fundamentals of Actuarial Pricing

about how to build a fair tariff. A difficult task lies in the fact that insurers have
incomplete information about their customers. It is well known that the observable
characteristics of policyholders (which can be used in pricing) explain only a small
proportion of the risks they represent. The only remedy for this imperfection is to
self-select policyholders by differentiating the cover offered to them, i.e., a nonlinear
scale linking the premium to be paid to the amount of the deductible accepted. As
mentioned in the previous chapter, observe that there is a close analogy between
this concept of “fair tariff” and “actuarial fairness,” or that of “equilibrium with
signal” proposed by Spence (1974, 1976) to describe the functioning of certain
labor markets. Riley (1975) proposed a more general model that could be applied
to insurance markets, among others. Cresta and Laffont (1982) proved the existence
of fair insurance rates for a single risk. Although the structure of equilibrium with
signal is now well understood in the case of a one-dimensional parameter, the same
cannot be said for cases where several parameters are involved. Kohlleppel (1983)
gave an example of the non-existence of such an equilibrium in a model satisfying
the natural extension of Spence’s hypotheses. As insurance is generally a highly
competitive and regulated market, the insurer must use all the statistical tools and
data at its disposal to build the best possible rates. At the same time, its premiums
must be aligned with the company’s strategy and take into account competition.
Because of the important role played by insurance in society, premiums are also
scrutinized by regulators. They must be transparent, explainable, and ethical. Thus,
pricing is not only statistical, it also carries strategic and societal issues. These
different issues can push the insurer to offer fairer premiums in relation to a given
variable. For example, the regulations require insurers to present fair premiums
according to the gender of the policy holder given their strategies and to offer fair
premiums according to age. Regardless of the reason why an insurance player must
present fairer pricing in relation to a variable, it must be able to define, measure,
and then mitigate the ethical bias of its pricing while preserving its consistency and
performance.

2.3 Premium and Fair Technical Price

Definition 2.3 (Expected Value) Let Y be a discrete random variable, then



.E[Y ] = yP[Y = y],
y∈Y

whereas it is (absolutely) continuous with density f ,



E[Y ] =
. yf (x)dy.
Y
2.3 Premium and Fair Technical Price 31

See Feller (1957) or Denuit and Charpentier (2004) for more details about this
quantity, that exists only if the sum or the integral is finite. Risks with infinite
expected values exhibit unexpected properties. If this quantity exists, because of
the law of large numbers (Proposition 3.1), this corresponds to the probabilistic
counterpart of the average of n values y1 , · · · , yn obtained as independent draws of
random variable Y . Interestingly, as discussed in the next chapter, this quantity can
be obtained as the solution of an optimization problem. More precisely,
 

n
 2  
.y = argmin yi − m and E[Y ] = argmin E [(Y − m)2 .
m∈R i=1 m∈R

2.3.1 Case of a Homogeneous Population

Before talking about segmentation, let us examine the case of a homogeneous


portfolio, where the policyholders are faced with the same probability of occurrence
of a claim, the same extent of damage, the same sum insured, etc., from o῾ μογενής
(homogenes), “of the same race, family or kind,” (from o῾ μός, homos, “same,” and
γένος, genos2 ). In nonlife insurance, the insurer undertakes (in exchange for the
payment of a premium, the amount of which is decided when the contract is signed)
to cover the claims that will occur during the year. If y is the annual cost of a
randomly selected policyholder, then we define3 the “pure premium” as .E[Y ].
Definition 2.4 (Pure Premium (Homogeneous Risks)) Let Y be the non-negative
random variable corresponding to the total annual loss associated with a given
policy, then the pure premium is .E[Y ].
If we consider the risk of losing 100 with probability p (and nothing with
probability .1 − p), the pure premium for this risk (economists would call it a
lottery) is .100 p. The premium for an insurance contract is then proportional to
the probability of having a claim. In the following, our examples are often limited

2 γένος is the etymological source of “gender,” and not “gene,” based on γενεά from the aorist
infinitive of γίγνομαι—I come into being.
3 Denuit and Charpentier (2004) discuss the mathematical formalism that allows such a writing.

In particular, the expectation is calculated according to a probability .P corresponding to the


“historical” probability (there is no law of one price in insurance, contrary to the typical approach
in market finance, in the sense of Froot et al. (1995). This will be discussed further in Sect. 3.1.1.).
Insurance, like econometrics, is formalized in a probabilistic world, perhaps unlike many machine
learning algorithms which are derived without a probabilistic model, as discussed in Charpentier
et al. (2018).
32 2 Fundamentals of Actuarial Pricing

to the estimation of this probability. In a personal insurance contract, the period of


cover is longer, and it is necessary to discount future payments.4


100 100
a=E
. = · P[T = t],
(1 + r)T (1 + r)t
t=0

for some discount rate r. However, this assumption of homogeneous risks proves
to be overly simplistic in numerous insurance scenarios. Take death insurance, for
example, where the law of T , representing the occurrence of death, should ideally
be influenced by factors such as the policyholder’s age at the time of contract. This
specific aspect will be explored further in Sect. 2.3.3. But before, let us get back to
classical concepts about economic decisions, when facing uncertain events.

2.3.2 The Fear of Moral Hazard and Adverse-Selection

In the context of insurance, moral hazard refers to the impact of insurance on


incentives to reduce risks. An individual facing an accidental risk such as of the
loss of a home, a car, or the risk of medical expenses can generally take actions
to reduce the risk. Without insurance, the costs and benefits of accident avoidance,
or precaution, are internal to the individual and the incentives for avoidance are
optimal. With insurance, some of the accident costs are borne by the insurer, as
recalled in Winter (2000).
Definition 2.5 (Adverse Selection (Laffont and Martimort 2002)) Adverse
selection is a market situation where buyers and sellers have different information.
“Adverse selection” characterizes principal-agent models in which an agent has
private information before a contract is written.
Definition 2.6 (Moral Hazard (Arrow 1963)) In economics, a moral hazard is a
situation where an economic actor has an incentive to increase its exposure to risk
because it does not bear the full costs of that risk.
There have been many publications about adverse selection and moral hazard in
life insurance, which creates a demand for insurance that correlates positively with
the insured person’s risk of loss, and could be seen as immoral, or unethical. In
Box 2.2, one of the oldest discussions about moral hazard, in Michelbacher (1926),
is reproduced.

4 Without discounting, as death is (at an infinite time horizon) certain, the pure premium would be

exactly the amount of the capital paid to the beneficiaries.


2.3 Premium and Fair Technical Price 33

Box 2.2 Moral Hazard, Michelbacher (1926)


“Moral hazard is the Bogey Man who will catch the unwary insurance official
who does not watch out. When insurance is under consideration he is always
present in one guise or another, sometimes standing out in bold relief, but
more often lurking in the background where he employs every expedient
to avoid detection. In all the ramifications of insurance procedure, from the
binding of the risk until the last moment of policy coverage has expired, his
insidious influence may manifest itself, usually where it is least expected. In
the other case his ignorance, carelessness, inattention or recklessness may
involve the carrier in claims which the ordinarily prudent policyholder would
avoid. The unsafe automobile driver; the employer whose attitude toward
safety is not proper; the careless person who loves display and is notoriously
lax in the protection of his jewelry: these and many others are “bad risks” for
the insurance carrier because they prevent the proper functioning of the law
of averages and introduce the certainty of loss into the insurance transaction.
It will be noted that the term “moral hazard” as employed in this discussion is
used in a much broader sense than the following definition, which is typical of
common usage, would imply: “The hazard is the deflection or variation from
the accepted standard of what is right in one’s conduct. Moral Hazard is that
risk or chance due to the failure of the positive moral qualities of a person
whose interests are affected by a policy of insurance.”

All actuaries have been lulled by Akerlof’s 1970 fable of “lemons”. The insurance
market is characterized by information asymmetries. From the insurer’s point of
view, these asymmetries mainly concern the need to find adequate information
on the customer’s risk profile. A decisive factor in the success of an insurance
business model is the insurer’s ability to estimate the cost of risk as accurately
as possible. Although in the case of some simple product lines, such as motor
insurance, the estimation of the cost of risk can be largely or fully automated and
managed in-house; in areas with complex risks, the assistance of an expert third
party can mitigate this type of information asymmetry. With Akerlof’s terminology,
some insurance buyers are considered low-risk peaches, whereas others are high-
risk lemons. In some cases, insurance buyers know (to some extent) whether they
are lemons or peaches. If the insurance company could tell the difference between
lemons and peaches, it would have to charge peaches a premium related to the risk
of the peaches and lemons a premium related to the risk of the lemons, according
to a concept of actuarial fairness, as Baker (2011) reminds us. But if actuaries are
not able to differentiate between lemons and peaches, then they will have to charge
the same price for an insurance contract. The main difference between the market
described by Akerlof (1970) (in the original fable it was a market for used cars)
and an insurance market is that the information asymmetry was initially (in the car
example) in favor of the seller of an asset. In the field of insurance, the situation is
34 2 Fundamentals of Actuarial Pricing

often more complex. In the field of car insurance, Dalziel and Job (1997) pointed
out the optimism bias of most drivers who all think they are “good risks.” The
same bias will be found in many other examples, as mentioned by Royal and Walls
(2019), but excluding health insurance, where the policyholder may indeed have
more information than the insurer.
To use the description given by Chassagnon (1996), let us suppose that an
insurer covers a large number of agents who are heterogeneous in their probability
of suffering a loss. The insurer proposes a single price that reflects the average
probability of loss of the agent representative of this economy, and it becomes
unattractive for agents whose probability of suffering an accident is low to insure
themselves. A phenomenon of selection by price therefore occurs and it is said
to be unfavorable because it is the bad agents who remain. To guard against
this phenomenon of anti-selection, risk selection and premium segmentation are
necessary. “Adverse-selection disappears when risk analysis becomes sufficiently
effective for markets to be segmented efficiently, says (Picard 2003), doesn’t the
difficulty econometricians have in highlighting real anti-selection situations in
the car insurance market reflect the increasingly precise evaluation of the risks
underwritten by insurers ?”

2.3.3 Case of a Heterogeneous Population

It is important to have models that can capture this heterogeneity (from ε῾ τερογενής,
heterogenes, “of different kinds,” from ε῞ τερος, heteros, “other, another, different”
and γένος, genos, “kinds”). To get back to our introductory example, if .Tx is the age
at (random) death of the policyholder of age x at the time the contract was taken out
(so that .Tx − x is the residual life span), then the pure premium corresponds to the
expected present value of future flows, i.e.,


100 100
ax = E
. = · P[Tx = x + t],
(1 + r)Tx −x (1 + r)t
t=0

for a discount rate r. Using a more statistical terminology, it can be rewritten as



 100 Lx+t−1 − Lx+t
ax =
. · ,
(1 + r)t Lx+t−1
t=0

where .Lt is the number of people alive at the end of t years in a cohort that we
would follow, so that .Lx+t−1 − Lx+t is the number of people alive at the end of
.x +t −1 years but not .x +t years (and therefore dead in their t-th year). It is De Witt

(1671) who first proposed this premium for a life insurance, where discriminating
according to age seems legitimate.
2.4 Mortality Tables and Life Insurance 35

But we can go further, because .P[Tx = t], the probability that the policyholder
of age x at the time of subscription will die in t years, could also depend on his or
her gender, his or her health history, and probably on other variables that the insurer
might know. And in this case, it is appropriate to calculate conditional probabilities,
.P[Tx = t|woman] or .P[Tx = t|man smoker].

2.4 Mortality Tables and Life Insurance

To illustrate heterogeneity, let us continue with mortality tables, as many tables


are public and openly available. The first modern mortality table was constructed
in 1662 by John Graunt in London, and the first scientific mortality table was
presented to the Royal Academy by Edmund Halley, in 1693, “Estimate of the
Degree of Mortality of Mankind, drawn from curious tables of the births and
funerals at the city of Breslau”. At first, tables were constructed on data obtained
from general population statistics, namely the Northampton (also called “Richard
Price” life table) and Carlisle tables, Milne (1815) or Gompertz (1825, 1833).
In order to compute adequate premium rates, insurance companies began to keep
accurate and reliable records of their own mortality experience. The first life table
constructed on the basis of insurance data was completed in 1834 by actuaries
of the Equitable Assurance of London, as discussed in Sutton (1874) and Nathan
(1925). Later, American life insurance companies had the benefit of the English
experience. As mentioned in Cassedy (2013), English mortality tables tended to
overestimate death rates (both in the USA and in England, contributing to the
prosperity of life insurance companies), and according to Zelizer (2018), The
Presbyterian and Episcopalian Funds relied on the Scottish mortality experience
(the first life table was constructed in 1775, as mentioned in Houston (1992)),
whereas the Pennsylvania Company and the Massachusetts Hospital Life Insurance
Company used the Northampton table. From the 1830s to the 1860s, American
companies based their premiums on the Carlisle table. In 1868, Sheppard Homans
(actuary of the Mutual Life Insurance Company) and George Phillips (Equitable’s
actuary) produce the first comprehensive table of American mortality in Homans
and Phillips (1868), named the “American Experience” table in Ransom and Sutch
(1987).

2.4.1 Gender Heterogeneity

As surprising as it may seem, Pradier (2011) noted that before the end of the
eighteenth century, in the UK and in France, the price of life annuities hardly
ever depended on the sex of the subscriber. However, the first separate mortality
tables, between men and women, constituted as early as 1740 by Nicolas Struyck
(published in the appendices of a geography article, Struyck (1740)) showed that
36 2 Fundamentals of Actuarial Pricing

Table 2.1 Excerpt from the Men and Women life tables in 1720 (Source: Struyck (1912), page
231), for pseudo-cohorts of one thousand people (.L0 = 1000)
Men Women
x .Lx .5 px x .Lx .5 px x .Lx .5 px x .Lx .5 px

0 1000 29.0% 45 371 16.6% 0 1000 28.9% 45 423 11.8%


5 710 5.6% 50 313 19.2% 5 711 5.2% 50 373 14.7%
10 670 4.2% 55 253 22.9% 10 674 3.3% 55 318 18.2%
15 642 5.5% 60 195 27.2% 15 652 4.3% 60 260 21.2%
20 607 6.6% 65 142 31.7% 20 624 5.8% 65 205 26.8%
25 567 7.9% 70 97 37.1% 25 588 6.8% 70 150 33.3%
30 522 9.2% 75 61 45.9% 30 548 7.3% 75 100 45.0%
35 474 10.5% 80 33 51.5% 35 508 7.9% 80 55 56.4%
40 424 12.5% 85 16 40 468 9.6% 85 24

women generally lived longer than men (Table 2.1). Struyck (1740) (translated in
Struyck (1912)) shows that at age 20, life expectancy (residual) is 30 years .3/4 for
men and 35 years .1/2 for women. It also provides life annuity tables by gender.
For a 50-year-old woman, a life annuity was worth 969 florins, compared with 809
florins for a man of the same age. This substantial difference seemed to legitimize
a differentiation of premiums. Here, 424 men (.Lx ) and 468 women (out of one
thousand respective births) had reached 40 years of age (.x = 40). And among those
who had reached 40 years of age, 12.5% of men and 9.6% of women would die
within 5 years (mathematically denoted .5 px = P[T ≤ x + 5|T > x]).
According to Pradier (2012), it was not until the Duchy of Calenberg’s widows’
fund went bankrupt in 1779 that the age and sex of subscribers were used in
conjunction to calculate annuity prices. In France, in 1984, the regulatory authorities
of the insurance markets decided to use regulatory tables established for the general
population by INSEE, based on the population observed over 4 years, namely the
PM 73-77 table for men and the PF 73-77 table for women, renamed TD and TV
73-77 tables respectively (with an analytical extension beyond 99 years). Although
the primary factor in mortality is age, gender is also an important factor, as shown
in the TD-TV Table. For more than a century, the mortality rate for men has been
higher than that of women in France.
In practice, however, the actuarial pricing of life insurance policies has continued
to be established without taking into account the gender of the policyholder. In fact,
the reason why two tables were used was that the male table was the regulatory
table for life insurance (PM became TD, for “table de décès,” or “death table”), and
the female table became the table for life insurance (PF became TV, for “table de
vie,” or “life table”). In 1993, the TD and TV 88-90 tables replaced the two previous
tables, with the same principle, i.e., the use of a table built on a male population for
life insurance, and a table built on a female population for life insurance. From a
prudential point of view, the female table models a population that has, on average,
a lower mortality rate, and therefore lives longer.
2.4 Mortality Tables and Life Insurance 37

In 2005, the TH and TF 00-02 tables were used as regulatory tables, still with
tables founded on different populations, namely men and women respectively. But
this time, the term men (H, for hommes) and women (F, for femmes) is maintained,
as regulations allowed for the possibility of different pricing for men and women.
A ruling by the Court of Justice of the European Union on 1 March 2011, however,
made gender-differentiated pricing impossible (as of 21 December 2012), on the
grounds that they would discriminate. In comparison, recent (French) INED tables
are also mentioned in Table 2.2, on the right-hand side.

2.4.2 Health and Mortality

Beyond gender, all sorts of “discriminating variables” have been studied, in order
to build, for example, mortality tables depending on whether the person is a smoker
or not, as in Benjamin and Michaelson (1988), in Table 2.3. Indeed, since Hoffman
(1931) or Johnston (1945), actuaries had observed that exposure to tobacco, and
smoking, had an important impact on the policyholder’s health. As Miller and
Gerstein (1983) wrote, “it is clear that smoking is an important cause of mortality.”
There are also mortality tables (or calculations of residual life expectancy)
by level of body mass index (BMI, introduced by Adolphe Quetelet in the mid-
nineteenth century), as calculated by Steensma et al. (2013) in Canada. A “normal”
index refers to people with an index between .18.5 and .25 kg/m2 ; “overweight”
refers to an index between 25 and .30 kg/m2 ; obesity level I refers to an index
between 30 and .35 kg/m2 , and obesity level II refers to an index exceeding
2
. 35kg/m . Table shows some of the elements. These orders of magnitude are

comparable with Fontaine et al. (2003) among the pioneering studies, Finkelstein
et al. (2010), or more recently Stenholm et al. (2017). If Adolphe Quetelet
introduced that index, it became popular in the 1970s, when “Dr. Keys was irritated
that life insurance companies [that] were estimating people’s body fat—and hence,
their risk of dying—by comparing their weights with the average weights of others of
the same height, age and gender,” as Callahan (2021) explains. In Keys et al. (1972),
with “more than 7000 healthy, mostly middle-aged men, Dr. Keys and his colleagues
showed that the body mass index was a more accurate—and far simpler—predictor
of body fat than the methods used by the insurance industry.” Nevertheless, this
measure is now known to have many flaws, as explained in Ahima and Lazar (2013)
(Table 2.4).

2.4.3 Wealth and Mortality

Higher incomes are associated with longer life expectancy, as mentioned already
in Kitagawa and Hauser (1973) with probably the first documented analysis. But
despite the importance of socioeconomic status to mortality and survival, Yang
38

Table 2.2 Excerpt from French tables, with TD and TV 73-77 on the left-hand side, TD and TV 88-90 in the center, and INED 2017-2019 on the right-hand
side
TD 73-77 TV 73-77 TD 88-90 TV 88-90
INED men INED women
0 100000 0 100000 0 100000 0 100000
0 100000 0 100000
10 97961 10 98447 10 98835 10 99129
10 99486 10 99578
20 97105 20 98055 20 98277 20 98869
20 99281 20 99471
30 95559 30 97439 30 96759 30 98371
30 98656 30 99247
40 93516 40 96419 40 94746 40 97534
40 97661 40 98810
50 88380 50 94056 50 90778 50 95752
50 95497 50 97645
60 77772 60 89106 60 81884 60 92050
60 90104 60 94777
70 57981 70 78659 70 65649 70 84440
70 78947 70 89145
80 28364 80 52974 80 39041 80 65043
80 59879 80 77161
90 4986 90 14743 90 9389 90 24739
90 25123 90 44236
100 103 100 531 100 263 100 1479
100 1412 100 4874
110 0 110 0 110 0 110 2
2 Fundamentals of Actuarial Pricing
2.4 Mortality Tables and Life Insurance 39

Table 2.3 Residual life expectancy (in years) by age (25–65 years) for smokers and nonsmokers
(Source: Benjamin and Michaelson (1988), for 1970–1975 data in the USA)
Men Women
Nonsmoker Smoker Nonsmoker Smoker
25 48.4 42.8 25 52.8 49.8
35 38.7 33.3 35 43.0 40.1
45 29.2 24.2 45 33.5 31.0
55 20.3 16.5 55 24.5 22.6
65 12.8 10.4 65 16.2 15.1

Table 2.4 Residual life expectancy (in years), as a function of age (between 20 and 70 years) as
a function of BMI level (Source: Steensma et al. (2013))
Men Women
Normal Overweight Obese I Obese II Normal Overweight Obese I Obese II
20 57.2 61.0 59.1 53.5 20 62.8 66.5 64.6 59.3
30 47.6 51.4 49.4 44.1 30 53.0 56.7 54.8 49.5
40 38.1 41.7 39.9 34.7 40 43.3 46.9 45.0 39.9
50 28.9 32.4 30.6 25.8 50 33.8 37.3 35.5 30.6
60 20.4 23.6 21.9 17.6 60 24.9 28.1 26.4 21.9
70 13.2 15.8 14.4 10.9 70 16.8 19.7 18.2 14.3

Table 2.5 Excerpt of life tables per wealth quantile and gender in France (Source: Blanpain
(2018))
Men Women
0–5% 45–50% 95–100% 0–5% 45–50% 95–100%
0 100000 100000 100000 0 100000 100000 100000
10 99299 99566 99619 10 99385 99608 99623
20 99024 99396 99469 20 99227 99506 99526
30 97930 98878 99094 30 98814 99302 99340
40 95595 98058 98627 40 97893 98960 99074
50 90031 96172 97757 50 95021 97959 98472
60 77943 91050 95649 60 88786 95543 97192
70 59824 79805 90399 70 79037 90408 94146
80 38548 59103 76115 80 63224 79117 85825
90 13337 23526 38837 90 31190 45750 55918
100 530 1308 3231 100 2935 5433 8717

et al. (2012), Chetty et al. (2016), and Demakakos et al. (2016) stressed that wealth
has been under-investigated as a predictor of mortality. Duggan et al. (2008) and
Waldron (2013) used social security data in the USA. In France, disparities of life
expectancy by social categories are well known. Recently, Blanpain (2018) created
some life tables per wealth quantiles. An excerpt can be visualized on Table 2.5,
with men on the left-hand side, women on the right-hand side, and fictional cohorts,
with the poorest 5% of the population on the left-hand side (“0–5%”) and the richest
40 2 Fundamentals of Actuarial Pricing

Fig. 2.2 Force of mortality (log scale) for men on the left-hand side and women on the right-hand
side, for various income quantiles (bottom, medium, and upper 10%), in France. (Data source:
Blanpain (2018))

5% on the right-hand side (“95–100%”). Force of mortality, as a function of the age,


the gender, and the wealth quantile, can be visualized in Fig. 2.2.

2.5 Modeling Uncertainty and Capturing Heterogeneity

2.5.1 Groups of Predictive Factors

A multitude of criteria can be used to create rate classes, as we have seen in the
context of mortality. To get a good predictive model, as in standard regression
models, we simply look for variables that correlate significantly with the variable
of interest, as mentioned by Wolthuis (2004). For instance, in the case of car
insurance, the following information was proposed in Bailey and Simon (1959):
use (leisure—pleasure—or professional—business), age (under 25 or not), gender
and marital status (married or not). Specifically, five risk classes are considered,
with rate surcharges relative to the first class (which is used here as a reference):
– “pleasure, no male operator under 25,” (reference),
– “pleasure, nonprincipal male operator under 25,” .+65%,
– “business use,” .+65%,
– “married owner or principal operator under 25,” .+65%,
– “unmarried owner or principal operator under 25,” .+140%.
In the 1960s, the rate classes resembled those that would be produced by
classification (or regression) trees such as those introduced by Breiman et al. (1984).
But using more advanced algorithms, Davenport (2006) points out that when an
actuary creates risk classes and rate groups, and in most cases, these “groups” are
not self-aware, they are not conscious (at most, the actuary will try to describe
2.5 Modeling Uncertainty and Capturing Heterogeneity 41

them by looking at the averages of the different variables). These groups, or risk
classes, are built on the basis of available data, and exist primarily as the product
of actuarial models. And as Gandy (2016) points out, there is no “physical basis”
for group members to identify other members of their group, in the sense that they
usually don’t share anything, except some common characteristics. As discussed in
Sect. 3.2, these risk groups, developed at a particular point in time, create a transient
collusion between policyholders, who are likely to change groups as they move,
change cars, or even simply grow older.

2.5.2 Probabilistic Models

Consider here a probabilistic space .(Ω, F, P), where .F is a set of “events” on .Ω


(.A ∈ F is an “event”). Recall briefly that .P is a function .F → [0, 1] satisfying
some properties, such as .P(Ω) = 1; for disjoint events, an “additivity property”:
.P(A ∪ B) = P(A) + P(B); a “subset property” (or “inclusion property”), if .A ⊂ B,

.P(A) ≤ P(B), as in Cardano (1564) or Bernoulli (1713), or for multiple (possibly

infinite) disjoint events as in Kolmogorov (1933), .A1 , · · · , An , · · · ,

P(A1 ∪ · · · ∪ An ∪ · · · ) = P(A1 ) + · · · + P(An ) + · · ·


.

inspired by Lebesgue (1918), etc. In Sect. 3.1.1 we will return to probability


measures, as they are extremely important in assessing how well calibrated the
model is, as well as how fair it is. But in this section, we need to recall two important
properties that are crucial to model heterogeneity.
Proposition 2.1 (Total Probability) If .(Bi )i∈I is a partition of .Ω (an exhaustive
(finite or countable) set of disjoint events),
 
P(A) =
. P(A ∩ Bi ) = P(A|Bi ) · P(Bi ),
i∈I i∈I

where, by definition, .P(A|Bi ) denotes the conditional probability of the occurrence


of .A, given that .Bi occurred.
Proof See Feller (1957) or Ross (1972). ⨆

An immediate consequence is the law of total expectations.
Proposition 2.2 (Total Expectations) For any measurable random variable Y
with finite expectation, if .(Bi )i∈I is a partition of .Ω

E(Y ) =
. E(Y |Bi ) · P(Bi ).
i∈I
42 2 Fundamentals of Actuarial Pricing

Proof See Feller (1957) or Ross (1972). ⨆



This formula can be written simply in the case where two sets, two subgroups,
are considered, for example, related to the gender of the individual,

E(Y ) = E(Y |woman) · P(woman) + E(Y |man) · P(man).


.

If Y denotes the life expectancy at the birth of an individual, the literal translation of
the previous expression is that the life expectancy at birth of a randomly selected
individual (on the left) is a weighted average of the life expectancies at birth
of females and males, the weights being the respective proportions of males and
females in the population. And as .E(Y ) is an average of the two,

. min{E(Y |woman), E(Y |man)} ≤ E(Y ) ≤ max{E(Y |woman), E(Y |man)};

in other words, treating the population as homogeneous, when it is not, means


that one group is subsidized by the other, which is called “actuarial unfairness,”
as discussed by Landes (2015), Frezal and Barry (2019), or Heras et al. (2020).
The greater the difference between the two conditional expectations, the greater
the unfairness. This “unfairness” is also called “cross-financing” as one group will
subsidize the other one.
Definition 2.7 (Pure Premium (Heterogeneous Risks)) Let Y be the non-
negative random variable corresponding to the total annual loss associated with
a given policy, with covariates .x, then the pure premium is .μ(x) = E[Y |X = x].
We use notation .μ(x), also named “regression function” (see Definition 3.1).
We also use notations .EY [Y ] (for .E[Y ]) and .EY |X [Y |X = x] (for .E[Y |X = x]) to
emphasize the measure used to compute the expected value (and to avoid confusion).
For example, we can write
 
EY [Y ] =
. yfY (Y )dy and EY |X [Y |X = x] = yfY |X (y|x)dy
R R

fY,X (y, x)
= y dy.
R fX (x)

The law of total expectations (Proposition 2.2) can be written, with that notation

EY [Y ] = EX EY |X [Y |X] .
.

An alternative is to write, with synthetic notations .E[Y ] = E E[Y |X] , where the
same notation—.E—is used indifferently to describe the same operator on different
probability measures.
2.5 Modeling Uncertainty and Capturing Heterogeneity 43

The law of total expectations can be written

EY [Y ] = EX EY |X [Y |X] = EX μ(X) ,
.

which is a desirable property we want to have on any pricing function m (also called
“globally unbiased,” see Definition 4.26).
Definition 2.8 (Balance Property) A pricing function m satisfies the balance
property if .EX [m(X)] = EY [Y ].
The name “balance property” comes from accounting, as we want assets (what
comes in, or premiums, .m(x)) to equal liabilities (what goes out, or losses y) on
average. This concept, as it appears in economics in Borch (1962), corresponds to
“actuarial fairness,” and is based on a match between the total value of collected
premiums and the total amount of legitimate claims made by the policyholder. As
it is impossible for the insurer to know what future claims will actually be like,
it is considered actuarially fair to set the level of premiums on the basis of the
historical claims record of people in the same (assumed) risk class. It is on this
basis that discrimination is considered “fair” in distributional terms, as explained
in Meyers and Van Hoyweghen (2018). Otherwise, the redistribution would be
considered “unfair,” with forced solidarity from the low-risk group to the high-
risk group. This “fairness” was undermined in the 1980s, when private insurers
limited access to insurance for people with AIDS, or at risk of developing it,
as Daniels (1990) recalls. Feiring (2009) goes further in the context of genetic
information, “since the individual has no choice in selecting his genotype or its
expression, it is unfair to hold him responsible for the consequences of the genes
he inherits—just as it is unfair to hold him responsible for the consequences of any
distribution of factors that are the result of a natural lottery.” In the late 1970s (see
Boonekamp and Donaldson (1979), Kimball (1979) or Maynard (1979)), the idea
that the proportionality between the premium and the risk incurred would guarantee
fairness between policyholders began to be translated into conditional expectation
(conditional on the risk factors retained).
As discussed in Meyers and Van Hoyweghen (2018), who trace the emergence of
actuarial fairness from its conceptual origins in the early 1960s to its position at the
heart of insurance thinking in the 1980s, the concept of “actuarial fairness” appeared
as more and more countries adopted anti-discrimination legislation. At that time,
insurers positioned “actuarial fairness” as a fundamental principle that would be
jeopardized if the industry did not benefit from exemptions to such legislation. For
instance, according to the Equality Act 2010 in the U.K. “it is not a contravention
(...) to do anything in connection with insurance business if (a) that thing is done
by reference to information that is both relevant to the assessment of the risk to be
insured and from a source on which it is reasonable to rely, and (b) it is reasonable
to do that thing ,” as Thomas (2017) wrote.
In most applications, there is a strong heterogeneity within the population, with
respect to risk occurrence and risk costs. For example, when modeling mortality,
the probability of dying within a given year can be above 50% for very old and
44 2 Fundamentals of Actuarial Pricing

sick people, and less than 0.001% for pre-teenagers. Formally, the heterogeneity
will be modeled by a latent factor .Θ. If y designates the occurrence (or not) of
an accident, y is seen as the realization of a random variable Y , which follows a
Bernoulli distribution, .B(Θ), where .Θ is a non-observable latent variable (as in
Gourieroux (1999) or Gourieroux and Jasiak (2007)). If y denotes the number of
accidents occurring during the year, Y follows a Poisson distribution, .P(Θ) (or a
binomial-negative model, or a parametric model with inflation of zeros, etc., as in
Denuit et al. (2007)). If y notes the total cost, Y follows a Tweedie distribution,
or more generally a compound Poisson distribution, which we denote by .L(Θ, ϕ),
where .L denotes a distribution with mean .Θ, and where .ϕ is a dispersion parameter
(see Definition 3.13 for more details). The goal of the segmentation is to constitute
ratemaking classes (denoted .Bi previously) in an optimal way, i.e., by ensuring
that one class does not subsidize the other, from observable characteristics, noted
.x = (x1 , x2 , · · · , xk ). Crocker and Snow (2013) speaks of “categorization based on

immutable characteristics.” For Gouriéroux (1999), it is the “static partition” used


to constitute sub-groups of homogeneous risks (“in a given class, the individual
risks are independent, with identical distributions”). This is what a classification
or regression tree does, the .Bi ’s being the leaves of the tree, with the previous
probabilistic notations. If y designates the occurrence of an accident, or the annual
(random) load, the actuary tries to approximate .E[Y |X], from training data. In an
econometric approach, if y designates the occurrence (or not) of an accident, and
if .x designates the set of observable characteristics of the policyholder, .Y |X = x
follows a Bernoulli distribution, .B(px ), for example,

exp(x ⊤ β)
px =
. or px = Ф(x ⊤ β),
1 + exp(x ⊤ β)

for a logistic or probit regression respectively.5 If Y designates the number of


accidents that occurred during the year, .Y |X = x follows a Poisson distribution,

.P(λx ), with typically .λx = exp(x β). If Y denotes the annual cost, .Y |X = x

follows a Tweedie distribution, or more generally a compound Poisson distribution,


.L(μx , ϕ), where .L denotes a distribution of mean .μ, with .μx = E[Y |X = x] (for

more details, Denuit and Charpentier (2004, 2005)).


To return to the analysis of De Wit and Van Eeghen (1984), detailed in Denuit and
Charpentier (2004), if we assume that the risks are homogeneous, the pure premium
will be .E[Y ], and we have the risk-sharing table, Table 2.6. Without purchasing
insurance, policyholders face a random loss Y . With insurance, policyholders face
a fixed loss .E[Y ]. The risk is transferred to the insurance company, which faces a
random loss .Y − E[Y ]. On average, the loss for the insurance company is null, and
all the risk is carried by the insurer.
At the other extreme, if the latent risk factor .Θ were observable, the requested
pure premium would be .E[Y |Θ], and we would have the split of Table 2.7.

5 .Ф is here the distribution function of the centered and reduced normal distribution, .N(0, 1).
2.5 Modeling Uncertainty and Capturing Heterogeneity 45

Table 2.6 Individual loss, its expected value, and its variables, for the policyholder on the left-
hand side and the insurer on the right-hand side. .E[Y ] is the premium paid, and Y the total loss,
from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
Policyholder Insurer
Loss .E[Y ] .Y − E[Y ]
Average loss .E[Y ] 0
Variance 0 .Var[Y ]

Table 2.7 Individual loss, its expected value and its variables, for the policyholder on the left-
hand side and the insurer on the right-hand side. .E[Y |Θ] is the premium paid, and Y the total loss,
from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
Policyholder Insurer
Loss .E[Y |Θ] .Y − E[Y |Θ]
Average loss .E[Y ] 0
Variance .Var[E[Y |Θ]] .Var[Y − E[Y |Θ]]

Table 2.8 Individual loss, its expected value, and its variables, for the policyholder on the left-
hand side, and the insurer on the right-hand side. .E[Y |X] is the premium paid, and Y the total loss,
from De Wit and Van Eeghen (1984) and Denuit and Charpentier (2004)
Policyholder Insurer
Loss .E[Y |X] .Y − E[Y |X]
Average loss .E[Y ] 0
Variance .Var[E[Y |X]] .E[Var[Y |X]]

Proposition 2.3 (Variance Decomposition (1)) For any measurable random


variable Y with finite variance

Var[Y ] = E[Var[Y |Θ]] + Var[E[Y |Θ]] .


.
     
→ insurer → policyholder

Proof See Denuit and Charpentier (2004). ⨆



Finally, using only observable features, denoted .x = (x1 , x2 , · · · , xk ), we would
have the decomposition of Table 2.8.
Proposition 2.4 (Variance Decomposition (2)) For any measurable random vari-
able Y with finite variance

. Var[Y ] = E[Var[Y |X]] + Var[E[Y |X]],


     
→ insurer → policyholder
46 2 Fundamentals of Actuarial Pricing

where

E[Var[Y |X]] = E[E[Var[Y |Θ]|X]] + E[Var[E[Y |Θ]|X]]


.

= E[Var[Y |Θ]] + E{Var[E[Y |Θ]|X]} .


     
perfect ratemaking misclassification

Proof See Denuit and Charpentier (2004). ⨆



This “misclassification” term (on the right) is called “subsidierende solidariteit”
in De Pril and Dhaene (1996), or “subsidiary solidarity”, as opposed to “ kanssoli-
dariteit” or “random solidarity” term (on the left). As certainty replaces uncertainty,
it will never disappear, at least for as long as it is a matter of predicting what may
happen during a year at the time of subscription. For Corlier (1998), segmentation
“decreases the solidarity of risks belonging to different segments.” And Löffler et al.
(2016), citing a McKinsey report on the future of the insurance industry, mentioned
that massive data will inevitably lead to de-mutualization and an increased focus on
prediction. Nevertheless, Lemaire et al. (2016) suggested some practical limits, as
“this process of segmentation, the sub-division of a portfolio of drivers into a large
number of homogeneous rating cells, only ends when the cost of including more risk
factors exceeds the profit that the additional classification would create, or when
regulators rule out new variables.”
The link with solidarity is discussed in Gollier (2002), which reminds us
that solidarity is fundamentally about making transfers in favor of disadvantaged
people, compared with advantaged people, as discussed in the introduction to
this chapter. But a very limited version of solidarity is taken into account in
the context of insurance: “solidarity in insurance means deciding not to segment
the corresponding risk market on the basis of the observable characteristics of
individuals’ risks,” as in health insurance or unemployment insurance. It should
be noted that although historically, the .x variables were discretized in order to make
“tariff classes,” it is now conventional to consider continuous variables as such, or
even to transform them, while maintaining a relative regularity. According to De Wit
and Van Eeghen (1984), in the past, it used to be very difficult to discover risk
factors both in a qualitative and in a quantitative sense: “solidarity was therefore,
unavoidably, considerable. But recent developments have changed this situation:
with the help of computers it has become possible to make thorough risk analyses,
and consequently to arrive at further premium differentiation.”
Again, the difficulty with pricing is that this underlying risk factor .Θ is not
observable. Not capturing it would lead to unfairness, as it would unduly subsidize
the “riskier” (likely to have more expensive claims) individuals with the “less risky.”
Baker and Simon (2002) went further, arguing that the reason why some people
are classified as “low risk” and others as “high risk” is irrelevant. Speaking of
automating accountability, Baker and Simon (2002) argued that it was important
to make people accountable for the risk that they bring to mutuality, especially
the riskiest policyholders, in order for the least risky policyholder to “feel morally
2.5 Modeling Uncertainty and Capturing Heterogeneity 47

comfortable” (as Stone (1993) put it). The danger is that, in this way, the allocation
of each person’s contributions to mutuality would be the result of an actuarial
calculation, as Stone (1993) put it. Porter (2020) said that this process was “a way
of making decisions without seeming to decide.” We review this point when we
discuss exclusions and the interpretability of models. The insurer then uses proxies
to capture this heterogeneity, as we have just seen. A proxy (one might call it a
“proxy variable”) is a variable that is not significant in its own right, but which
replaces a useful but unobservable, or unmeasurable, variable, according to Upton
and Cook (2014).
Most of our discussion focuses on tariff discrimination, and more precisely on
the “technical” tariff. As mentioned in the introduction, from the point of view
of the policyholder, this is not the most relevant variable. Indeed, in addition to
the actuarial premium (the pure premium mentioned earlier), there is a commercial
component, as an insurance agent may decide to offer a discount to one policyholder
or another, taking into account a different risk aversion or a greater or lesser
price elasticity (see Meilijson 2006). But an important underlying question is “is
the provided service the same?” Ingold and Soper (2016) review the example of
Amazon not offering the same services to all its customers, in particular same-
day-delivery offers, offered in certain neighborhoods, chosen by an algorithm that
ultimately reinforced racial bias (by never offering same-day delivery in neighbor-
hoods composed mainly of minority groups). A naive reading of prices on Amazon
would be biased because of this important bias in the data, which should be taken
into account. As Calders and Žliobaite (2013) reminds us, “unbiased computational
processes can lead to discriminative decision procedures.” In insurance, one could
imagine that a claims manager does not offer the same compensation to people
with different profiles—some people being less likely to dispute than others. It is
important to better understand the relationship between the different concepts.

2.5.3 Interpreting and Explaining Models

A large part of the actuary’s job is to motivate, and explain, a segmentation. Some
authors, such as Pasquale (2015), Castelvecchi (2016), or Kitchin (2017), have
pointed out that machine-learning algorithms are characterized by their opacity and
their “incomprehensibility,” sometimes called “black box” (or opaque) properties.
And it is essential to explain them, to tell a story. For Rubinstein (2012), as
mentioned earlier, models are “fables”: “economic theory spins tales and calls them
models. An economic model is also somewhere between fantasy and reality (...) the
word model sounds more scientific than the word fable or tale, but I think we are
talking about the same thing.” In the same way, the actuary will have to tell the
story of his or her model, before convincing the underwriting and insurance agents
to adopt it. But this narrative is necessarily imprecise. As Saint Augustine said,
“What is time? If no one asks me, I know. But if someone asks me and I want to
explain it, then I don’t know anymore.”
48 2 Fundamentals of Actuarial Pricing

Fig. 2.3 The evolution of auto insurance claim frequency as a function of primary driver age,
relative to overall annual frequency, with a Poisson regression in yellow, a smoothed regression in
red, a smoothed regression with a too small smoothing bandwidth in blue, and with a regression tree
in green. Dots on the left are the predictions for a 22-year-old driver. (Data source: CASdataset
R package, see Charpentier (2014))

One can hear that age must be involved in the prediction of the frequency of
claims in car insurance, and indeed, as we see in Fig. 2.3, the prediction will not
be the same at 18, 25, or 55 years of age. Quite naturally, a premium surcharge
for young drivers can be legitimized, because of their limited driving experience,
coupled with unlearned reflexes. But this story does not tell us at what order
of magnitude this surcharge would seem legitimate. Going further, the choice
of model is far from neutral on the prediction: for a 22-year-old policyholder,
relatively simple models propose an extra premium of +27%, +73%, +82%, or
+110% (compared with the average premium for the entire population). Although
age discrimination may seem logical, how much difference can be allowed here, and
would be perceived as “quantitatively legitimate”? In Sect. 4.1, we present standard
approaches used to interpret actuarial predictive models, and explain predicted
outcomes.

2.6 From Technical to Commercial Premiums

So far, we have discussed heterogeneity in technical pure premiums, but in real-


life applications, some additional heterogeneity can yield additional sources of
discrimination.
2.6 From Technical to Commercial Premiums 49

2.6.1 Homogeneous Policyholders

Technical or actuarial premiums are purely based on risk characteristics, whereas


commercial premiums are based on economic considerations. In classical textbooks
in economics of insurance (see Dionne and Harrington (1992); Dionne (2000, 2013)
or Eisen and Eckles (2011)), homogeneous agents are considered perfectly informed
(not in the sense that there is no randomness, but they perfectly know the odds of
unfortunate events). They have a utility function u (perfectly known also), a wealth
w, and they agree to transfer the risk (or parts of the risk) against the payment of a
premium .π if it satisfies

u(w − π ) ≥ E u(w − Y ) .
.

The utility that they have when paying the premium (on the left-hand side) exceeds
the expected utility that they have when keeping the risk (on the right-hand side).
Thus, an insurer, also with perfect knowledge of the wealth and utility of the agent
(or his or her risk aversion), could ask the following premium, named “indifference
premium”.
Definition 2.9 (Indifference Utility Principle) Let Y be the non-negative random
variable corresponding to the total annual loss associated with a given policy; for a
policyholder with utility u and wealth w, the indifference premium is
 
.π = w − u−1 E u(w − Y ) .

If u is the identity function, .π = E Y , corresponding to the technical, actuarial,


pure premium. And if the agent is risk adverse, u is concave and .π ≥ E Y .
Consider a simple scenario where a policyholder with a wealth of w, and with a
concave utility function u, faces a potential random loss occurring with a probability
p, resulting in a total loss of wealth. In Fig. 2.4, we can visualize both .π , the
indifference premium (from Definition 2.9), and .E(Y ), the pure premium (from
Definition 2.4). In Fig. 2.4, we use .p = 2/5 just to get a visual perspective. Here,
the random loss Y takes two values,

y2 = w with probability p = 2/5
Y =
.
y1 = 0 with probability 1 − p = 3/5,

or equivalently, the wealth is



w − y2 = 0 with probability p = 2/5
w−Y =
.
w − y1 = w with probability 1 − p = 3/5.

The technical pure premium is here .π0 = E(Y ) = py2 + (1 − p)y1 = pw, and
when paying that premium, the wealth would be .w − π0 = (1 − p)w.
50 2 Fundamentals of Actuarial Pricing

Fig. 2.4 Utility and (ex-post) wealth, with an increasing concave utility function u, whereas the
straight line corresponds to a linear utility .u0 (risk neutral). Starting with initial wealth .ω, the
agent will have random wealth W after 1 year, with two possible states: either .w1 (complete loss,
on the left part of the x-axis) or .w2 = ω (no loss, on the right part of the x-axis). Complete loss
occurs with .40% chance (.2/5). .π0 is the pure premium (corresponding to a linear utility) whereas
.π is the commercial premium. All premiums in the colored area are high enough for the insurance
company, and low enough for the policyholder

If the agent is risk adverse (strictly), .u(w − π0 ) > E[u(w − Y )], in the sense that
the insurance company can ask for a higher premium than the pure premium

π0 = E[Y ] > 0 : actuarial (pure) premium
.  
π − π0 = w − E[Y ] − u−1 E u(w − Y ) ≥ 0 : commercial loading.

We come back to the practice of price optimization in Sect. 2.6.3.

2.6.2 Heterogeneous Policyholders

“Actuarial fairness” refers to the notion of “legitimate” discrimination when it is


based on a risk factor. According to Thomas (2012), certain laws forbid discrimi-
nation based on age but often include provisions allowing exceptions for insurance
underwriting. However, these exceptions typically apply only to differences that
can be justified by variances in the underlying risk and the technical premium. In
the previous section, we discussed economic models, based on individual wealth
w and utility functions u, that can actually be heterogeneous. As mentioned in
Boonen and Liu (2022), with information about personal characteristics, the insurer
can customize insurance coverage and premium for each individual in order to
optimize his/her objective function. Commercial insurance premiums then depend
on observable variables that correlate with the individual’s risk-aversion parameter,
such as gender, age, or even race (as considered in Pope and Sydnor (2011)). Such
discrimination may violate insurance regulations with respect to discrimination.
2.6 From Technical to Commercial Premiums 51

Fig. 2.5 On the left, the same graph as in Fig. 2.4, with utility on the y-axis and (ex-post) wealth
on the x-axis, with an increasing concave utility function u, and an ex-post random wealth W
after 1 year, with two possible states: either .w1 (complete loss) or .w2 = ω (no loss). Complete
loss occurs with .40% chance (.2/5) on the left-hand side, whereas complete loss occurs with .60%
chance (.3/5) on the right-hand side. Agents have identical risk aversion and wealth, on both graphs.
Indifference premium is larger when the risk is more likely (with the additional black part on the
technical pure premium; here, the commercial loading is almost the same)

In the context of heterogeneity of the underlying risk only, consider the case in
which heterogeneity is captured through covariates .X and where agents have the
same wealth w and the same utility u,

π0 (x) = E[Y |X = x] : actuarial premium
.   
π − π0 = w − E[Y |X = x] − u−1 E u(w − Y )X = x : commercial loading.

For example, in Fig. 2.5, we have on the left, the same example as in Fig. 2.4,
corresponding to some “good” risk,

y2 = w with probability p = 2/5
Y =
.
y1 = 0 with probability 1 − p = 3/5.

On the right, we have some “bad risk,” where the value of the loss is unchanged, but
the probability of claiming a loss is higher (.p' > p). In Fig. 2.5

y2 = w with probability p' = 3/5 > 2/5
Y =
.
y1 = 0 with probability 1 − p' = 2/5.

In that case, it could be seen as legitimate, and fair, to ask a higher technical
premium, and possibly to add the appropriate loading then.
52 2 Fundamentals of Actuarial Pricing

If heterogeneity is no longer of the underlying risk, but of the risk aversion (or
possibly the wealth), if u is now the function of some covariates, .ux we should write

π0 (x) = μ(x) = E[Y ] > 0 : actuarial premium
.  
π − π0 = w − E[Y ] − u−1
x E ux (w − Y ) ≥ 0 : commercial loading.

Here, we used the expected utility approach from Von Neumann and Morgenstern
(1953) to illustrate, but alternatives could be considered.
The approach described previously is also named “differential pricing,” where
customers with a similar risk are charged different premiums (for reasons other than
risk occurrence and magnitude). Along these lines, Central Bank of Ireland (2021)
considered “price walking” as discriminatory. “Price walking” corresponds to the
case where longstanding, loyal policyholders are charged higher prices for the same
services than customers who have just switched to that provider. This is a well-
documented practice in the telecommunications industry that can also be observed
in insurance (see Guelman et al. (2012) or He et al. (2020), who model attrition rate,
or “customer churn”). According to Central Bank of Ireland (2021), the practice of
price walking is “unfair” and could result in unfair outcomes for some groups of
consumers, both in the private motor insurance and household insurance markets.
For example, long-term customers (those who stayed with the same insurer for 9
years or more) pay, on average, 14% more for private car insurance and 32% more
for home insurance than the equivalent customer renewing for the first time.

2.6.3 Price Optimization and Discrimination

“We define price optimization in P&C insurance [property and casualty insurance,
or nonlife insurance6 ] as the supplementation of traditional supply-side actuarial
models with quantitative customer demand models,” explained Bugbee et al. (2014).
Duncan and McPhail (2013), Guven and McPhail (2013), and Spedicato et al. (2018)
mention that such practices are intensively discussed by practitioners, even if they
did not get much attention in the academic journals. Notable exceptions would
be Morel et al. (2003), who introduced myopic pricing, whereas more realistic
approaches, named “semi-myopic pricing strategies”, were discussed in Krikler
et al. (2004) or more recently Ban and Keskin (2021).

6 Formally, property covers a home (physical building) and the belongings in it from all losses such

as fire, theft, etc., or covers damage to a car when involved in an accident, including protection from
damage/loss caused by other factors such as fire, vandalism, etc. Causality involves coverage if one
is being held responsible for someone injuring themselves on his or her property, or if one were to
cause any damage to someone else’s property, and coverage if one gets into an accident and causes
injuries to someone else or damage to their car.
2.6 From Technical to Commercial Premiums 53

Many regulators believe that price optimization is unfairly discriminatory (as


shown in Box 2.2, with some regulations in some states, in the USA). Is it legitimate
discrimination to have premiums function on willingness or ability to pay, and
risk aversion? According to the Code of Professional Conduct7 (Precept 1, on
“Professional Integrity”), “an actuary shall act honestly (...) to fulfill the profession’s
responsibility to the public and to uphold the reputation of the actuarial profession.”

Box 2.3 Price Optimization in the USA


• Alaska, Wing-Heir (2015)
The practice of adjusting either the otherwise applicable manual rates or
premiums or the actuarially indicated rates or premiums based on any of the
following is considered inconsistent with the statutory requirement that “rates
shall not be (...) unfairly discriminatory,” whether or not such adjustment is
included within the insurer’s rating plan:
(a) Price elasticity of demand;
(b) Propensity to shop for insurance;
(c) Retention adjustment at an individual level; and
(d) A policyholder’s propensity to ask questions or file complaints.
• California, Volkmer (2015)
Price Optimization does not seek to arrive at an actuarially sound estimate
of the risk of loss and other future costs of a risk transfer. Therefore, any use
of Price Optimization in the ratemaking/pricing process or in a rating plan is
unfairly discriminatory in violation of Californian law.
• District of Columbia, Taylor (2015)
Price optimization refers to an insurer’s practice of charging the maximum
premium that it expects an individual or class of individuals to bear, based
upon factors that are neither risk of loss related nor estimated expense related.
For example, an insurer may charge a nonprice-sensitive individual a higher
premium than it would charge a price-sensitive individual; despite their risk
characteristics being equal. This practice is discriminatory and it violates the
District’s anti-discrimination insurance laws codified at D.C. Official Code
§31-2231.13(c), 31-2703(a) and 31-2703(b).

(continued)

7 See https://www.soa.org/about/governance/about-code-of-professional-conduct/.
54 2 Fundamentals of Actuarial Pricing

Box 2.3 (continued)


• Pennsylvania, Miller (2015b)
With the advent of sophisticated pricing tools, including computer software
and rating models referred to as price optimization, insurers, rating orga-
nizations, and advisory organizations are reminded that policyholders and
applicants with identical risk classification profiles—that is, risks of the same
class and essentially the same hazard—must be charged the same premium.
Rates that fail to reflect differences in expected losses and expenses with
reasonable accuracy are unfairly discriminatory under Commonwealth law
and will not be approved by the Department.

2.7 Other Models in Insurance

So far, we have discussed only premium principles, but predictive models are used
almost everywhere in insurance.

2.7.1 Claims Reserving and IBNR

Another interesting application is reserving (see Hesselager and Verrall (2006) or


Wüthrich and Merz (2008) for more details). Loss reserves are a major item in
the financial statement of an insurance company and in terms of how it is valued
from the perspective of possible investors. The development and the release of
reserves are furthermore important input variables to calculate the MCEV (market
consistent embedded value), which “provides a means of measuring the value of
such business at any point in time and of assessing the financial performance of
the business over time,” as explained in American Academy of Actuaries (2011).
Hence, the estimates of unpaid losses give management important input for their
strategy, pricing, and underwriting. A reliable estimate of the expected losses is
therefore crucial. Traditional models of reserving for future claims are mainly based
on claims triangles (e.g., Chain Ladder or Bornhütter-Ferguson as distribution-
free methods, as described in De Alba (2004)) or distribution-based (stochastic)
models with aggregated data at the level of the gross insurance portfolio or at
the level of a subportfolio as the methodology requires the use of portfolio-based
parameters, e.g., reported or paid losses, prior expected parameters such as losses or
premiums. The reserving amount can be influenced by many factors, for example,
the composition of the claim, medical advancement, life expectancy, legal changes,
etc. The consequence is a loss of potentially valuable information at the level of the
single contract as the determining drivers are entirely disregarded.
2.7 Other Models in Insurance 55

2.7.2 Fraud Detection

Fraud is not self revealed, and therefore, it must be investigated, said Guillen (2006)
and Guillen and Ayuso (2008). Tools for detecting fraud span all kind of actions
undertaken by insurers. They may involve human resources, data mining, external
advisors, statistical analyses, and monitoring. The methods currently available for
detecting fraudulent or suspicious claims based on human resources rely on video
or audiotape surveillance, manual indicator cards, internal audits, and information
collected from agents or informants. Methods based on data analysis seek external
and internal data information. Automated methods use various machine-learning
techniques. such as selecting fuzzy set clustering in Derrig and Ostaszewski (1995),
simple regression models in Derrig and Weisberg (1998), or GLMs, with a logistic
regression in Artıs et al. (1999); Artís et al. (2002) and a probit model in Belhadji
et al. (2000), or neural networks, in Brockett et al. (1998) or Viaene et al. (2005).

2.7.3 Mortality

The modeling and forecasting of mortality rates have been subject to extensive
research in the past. The most widely used approach is the “Lee-Carter Model,”
from Lee and Carter (1992) with its numerous extensions. More recent approaches
involve nonlinear regression and GLMs. But recently, many machine-learning
algorithms have been used to detect (unknown) patterns, such as Levantesi and
Pizzorusso (2019), with decision trees, random forests, and gradient boosting. Perla
et al. (2021) generalized the Lee-Carter Model with a simple convolutional network.

2.7.4 Parametric Insurance

Parametric insurance is also an area where predictive models are important. Here,
we consider guaranteed payment of a predetermined amount of an insurance claim
upon the occurrence of a stipulated triggering event, which must be some predefined
parameter or metric specifically related to the insured person’s particular risk
exposure, as explained in Hillier (2022) or Jerry (2023).

2.7.5 Data and Models to Understand the Risks

So far, we mentioned the use of data and models in the context of estimating a
“fair premium.” But it should also be highlighted that insurance companies have
helped to improve the quality of life in many countries, using data that they
56 2 Fundamentals of Actuarial Pricing

collected. A classic example is the development of epidemiology. Indeed, as early


as the nineteenth century, insurance doctors initiated an approach that prefigures
systematic medical examinations, developing our contemporary medicine, based
on prevention, or increasingly oriented toward patients who do not look after
themselves. As early as 1905, John Welton Fischer, medical director of the
Northwestern Mutual Life Insurance Company and a member of the Association
of Life Insurance Medical Directors of America, became interested in the routine
measurement of blood pressure in the examination of life insurance applicants. He
was the first to do so at a time when the blood pressure monitor, which had just been
invented, had not really proved its worth. He was the first to do so at a time when the
newly invented tensiometer had not really proved its worth and was still confined to
experimental use. At the beginning of 1907, Fischer began to measure the systolic
blood pressure of applicants aged between 40 and 60 years. He then instructed his
company’s physicians to perform this measurement in cities with more than 100,000
inhabitants. By 1913, 85% of his company’s applicants had had their blood pressure
measured. Although Fischer’s conclusions are clear, he does not explain how he
foresaw the importance of this measurement as a risk factor. When Fischer proposed
the introduction of blood pressure measurement for the newly insured person, there
was no information on the prognosis associated with elevated blood pressure, nor
was there a clear definition of what “normal” pressure should be, as Kotchen (2011)
recalls. The relationship between blood pressure and cardiovascular morbidity was
still completely unknown, despite some work by clinicians. Insurance companies
produced the first prospective statistics for hypertension, a term that did not then
refer to any well-defined disease or concept. In 1911, Fischer wrote a letter to the
Medical Directors Association explaining to his peers that “the sphygmomanometer
is indispensable in life insurance examinations, and the time is not far distant when
all progressive life insurance companies will require its use in all examinations of
applicants for life insurance.”
In 1915, the Prudential Life Insurance Company had already measured the
blood pressure of 18,637 applicants, the New York Life Insurance Company that
of 62,000 applicants for insurance and, in 1922, the New York experiment of the
Metropolitan Life Insurance Company totaled 500,000 examinations in more than
8000 insured persons, recalls Dingman (1927). No private practitioner, no hospital
doctor, no organization, until then, had been able to compile such statistics. In a
series of reports that began with Dublin (1925), the Actuarial Society of America
described the distribution of blood pressure across the population, the age-related
increases in blood pressure, and the relationships of blood pressure to both body
size and mortality. This report studied a cohort of 20,000 insured persons, aged 38
to 42 years, with measurements of systolic and diastolic blood pressure. The report
showed an increase in systolic and diastolic blood pressures with age. At younger
ages, systolic and diastolic blood pressures were lower in women than in men.
Blood pressure also increased progressively with age in both men and women. The
report also showed that systolic and diastolic blood pressure increased with height
in men, defined in terms of “build groups” (average weight for each inch of height)
in different age groups of men. He eventually noted that changes in diastolic blood
2.7 Other Models in Insurance 57

pressure were more important than changes in systolic blood pressure in predicting
mortality. For insurers, this information, although measured on an ad hoc basis, was
sufficient to exclude certain individuals or to increase their insurance premiums.
The designation of hypertension as a risk factor for reduced life expectancy was not
based on research into the risk factors for hypertension, but on a simple measure of
correlation and risk analysis. And the existence of a correlation did not necessarily
indicate a causal link, but this was not the concern of the many physicians working
for insurers. Medical research was then able to work on a better understanding of
these phenomena, observed by the insurers, who had access to these data (because
they had the good idea of collecting them).
Chapter 3
Models: Overview on Predictive Models

Abstract In this chapter, we give an overview on predictive modeling, used by


actuaries. Historically, we moved from relatively homogeneous portfolios to tariff
classes, and then to modern insurance, with the concept of “premium personal-
ization.” Modern modeling techniques are presented, starting with econometric
approaches, before presenting machine-learning techniques.

As we have seen in the previous chapter, insurance is deeply related to predictive


modeling. But contrary to popular opinion that models and algorithms are purely
objective, O’Neil (2016) explains in her book that “models are opinions embedded
in mathematics (.. . .). A model’s blind spots reflect the judgments and priorities of
its creators.” In this chapter (and the next one), we get back to general ideas about
actuarial modeling.

3.1 Predictive Model, Algorithms, and “Artificial


Intelligence”

3.1.1 Probabilities and Predictions

We will not start a philosophical discussion about risk and uncertainty here. How-
ever, in actuarial science, all stories begin with a probabilistic model. “Probability is
the most important concept in modern science, especially as nobody has the slightest
notion what it means” said Bertrand Russell in a conference, back in the early 1930s,
quoted in Bell (1945). Very often, the “physical” probabilities receive an objective
value, on the basis of the law of large numbers, as the empirical frequency converge
toward “the probability” (frequentist theory of probabilities).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 59


A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_3
60 3 Models: Overview on Predictive Models

Proposition 3.1 (Law of Large Numbers (1)) Consider an infinite collection of


i.i.d. random variables .X, X1 , X2 , · · · , Xn , · · · in a probabilistic space .(, F, P),
then

1
n
a.s.
. 1(Xi ∈ A) → P({X ∈ A)} = P[A], as n → ∞.
n   
i=1
   probability
(empirical) frequency

Proof Strong law of large numbers (also called Kolmogorov’s law), see Loève
(1977), or any probability textbook. 

This is a so-called “physical” probability, or “objective.” It means that if we throw
a die a lot of times (here n), the “probability” of obtaining a 6 with this die is the
empirical frequency of 6 we obtained. Of course, with a perfectly balanced die, there
is no need to repeat throws of die to affirm that the probability of obtaining 6 at the
time of a throw is equal to .1/6 (by the symmetry of the cube). But if we repeat the
experience of throwing a die millions of time, .1/6 should be close to the frequency
of appearance of 6, corresponding to the “frequentist” definition of the concept
of “probability.” Almost 200 years ago, Cournot (1843) already distinguished an
“objective meaning” of the probability (as a measure of the physical possibility of
realization of a random event) and a “subjective meaning” (the probability being a
judgment made on an event, this judgment being linked to the ignorance of judgment
being linked to the ignorance of the conditions of the realization of the event).
If we use that “frequentist” definition (also coined “long-run probability” in
Kaplan (2023) as Proposition 3.1 is an asymptotic result), we are unable to make
sense of the probability of a “single singular event,” as noted by von Mises (1928,
1939): “When we speak of the ‘probability of death’, the exact meaning of this
expression can be defined in the following way only. We must not think of an
individual, but of a certain class as a whole, e.g., ‘all insured men forty-one years
old living in a given country and not engaged in certain dangerous occupations’.
A probability of death is attached to the class of men or to another class that can
be defined in a similar way. We can say nothing about the probability of death of
an individual even if we know his condition of life and health in detail. The phrase
‘probability of death’, when it refers to a single person, has no meaning for us at
all.” And there are even deeper paradoxes, that can be related to latent risk factors
discussed in the previous chapter, and the “true underlying probability” (to claim
a loss, or to die). In a legal context, Fenton and Neil (2018) quoted a judge, who
was told that a person was less than .50% guilty: “look, the guy either did it or
he did not do it. If he did then he is 100% guilty and if he did not then he is
0% guilty; so giving the chances of guilt as a probability somewhere in between
makes no sense and has no place in the law.” The main difference with actuarial
pricing, is that we should estimate probabilities associated with future events. But
still, one can wonder if “the true probability” is a concept that makes sense when
signing a contract. Thus, the goal here will be to train a model that will compute
3.1 Predictive Model, Algorithms, and “Artificial Intelligence” 61

a score, that might be interpreted as a “probability” (this will raise the question
of “ calibration” of a model, the connection between that score and the “observed
frequencies” (interpreted as probabilities), as discussed in Sect. 4.3.3).
Given a probability measure .P, one can define “conditional probabilities,” the
standard notation being the vertical bar. .P[A|B] is the conditional probability that
event .A occurs given the information that event .B occurred. It is the ratio of the
probability that both .A and .B occurred (corresponding to .P[A ∩ B]) over the
probability that .B occurred. Based on that definition, we can derive Bayes formula.
Proposition 3.2 (Bayes Formula) Given two events .A and .B such that .P[B] = 0,

P[B|A] · P[A]
P[A|B] =
. ∝ P[B|A] · P[A].
P[B]

Proof Bayes (1763) and Laplace (1774). 



Besides the mathematical expression, that formula has two possible interpretations.
The first one corresponds to an “update of beliefs,” from a prior distribution .P[A] to
a posterior distribution .P[A|B], given some additional information B. The second
one is related to an “inverse problem,” where we try to determine the causes of a
phenomenon from the experimental observation of its effects. An example could be
the one where .A is a disease and .B is a symptom (or a set of symptoms), and with
Bayes’ rule (see Spiegelhalter et al. (1993) for more details, with multiple diseases
and multiple symptoms).

P[disease|symptom] ∝ P[symptom|disease] · P[disease].


.

Another close example would be one where .B is the result of a test, and

P[cancer|test positive] ∝ P[test positive|cancer] · P[cancer].


.

A convention in actuarial literature is to suppose that random variables live


in a probabilistic space .(, F, P), without much discussion about the probability
measure .P. In most sections of this book, .P is the (unobservable) probability
measure associated with the portfolio of the insurer (or associated with the training
dataset). And we will use .Pn , associated with sample .Dn , as explained in Sect. 1.2.1.
For instance, using the validation dataset from GermanCredit, let X denote the
age of the person with a loan, and .Y the score of a random person (obtained from a
logistic regression), and the gender is the sensitive attribute .S ∈ {A, B}, so that

 > 50%|S = A] = 25%
Pn [X ∈ [18; 25]|S = A] = 32% and Pn [Y
.
 > 50%|S = B] = 21%.
Pn [X ∈ [18; 25]|S = B] = 10% and Pn [Y
62 3 Models: Overview on Predictive Models

But because there is competition in the market, .Pn can be different than .P, the
probability measure for the entire population

 > 50%|S = A] = 20%
P[X ∈ [18; 25]|S = A] = 20% and P[Y
.
 > 50%|S = B] = 15%.
P[X ∈ [18; 25]|S = B] = 15% and P[Y

There could also be some target probability measure .P as underwriters can be
willing to target some specific segments of the population, as discussed in Chaps. 7
and 12 (on change of measures),

P [X ∈ [18; 25]|S = A] = 25% and P [Y > 50%|S = A] = 25%
.
 > 50%|S = B] = 20%,
P [X ∈ [18; 25]|S = B] = 20% and P [Y

or possibly some “fair” measure .Q, as we discuss in Chaps. 8 (on quantifying


group fairness) and 12 (when mitigating discrimination), that will satisfy some
independence properties,

 > 50%|S = A] = 23%
Q[X ∈ [18; 25]|S = A] = 39% and Q[Y
.
 > 50%|S = B] = 23%.
Q[X ∈ [18; 25]|S = B] = 15% and Q[Y

It is also possible to mention here the fact that the model is fitted on past data,
associated with probability measure, .Pn but because of the competition on the
market, or because of the general economic context, the structure of the portfolio
might change. The probability measure for next year will then be .P n (with

Pn [X ∈ [18; 25]|S = A] = 35% and Pn [Y > 50%|S = A] = 27%
.
 > 50%|S = B] = 27%,
Pn [X ∈ [18; 25]|S = B] = 20% and Pn [Y

if our score, used to assess whether we give a loan to some clients attracts
more young (and more risky) people. We do not discuss this issue here, but the
“generalization” property should be with respect to a new unobservable and hard-
to-predict probability measure .Pn (and not .P as usually considered in machine
learning, as discussed in the next sections).

3.1.2 Models

Predictive models are used to capture a relationship between a response variable


y (typically a claim occurrence, a claim frequency, a claim severity, or an annual
3.1 Predictive Model, Algorithms, and “Artificial Intelligence” 63

Fig. 3.1 A simple linear model, a piecewise constant (green) model, or a complex model
(nonlinear but continuous), from ten observations .(xi , yi ), where x is a temperature in degrees
Fahrenheit and y is the temperature in degrees Celsius, at the same location i

cost) and a collection of predictors, denoted1 .x, as explained in Schmidt (2006).


If y is binary, a classical model will be the logistic regression, for example (see
Dierckx 2006), and more generally, actuaries have used intensively generalized
linear models (GLMs, see Frees (2006) or Denuit et al. (2019a)), where a prediction
of the outcome y is obtained by transforming a linear combination of predictors.
Econometric models (and GLMs) are popular as they rely strongly on probabilistic
models, and the insurance business is based on the randomness of events.
For Ekeland (1995), modeling is the (intellectual) construction of a mathematical
model, i.e., a network of equations supposed to describe reality. Very often, a model
is also, above all, a simplification of this reality. A model that is too complex
is not a good model. This is the idea of over-learning (or “overfitting”) that is
found in statistics, or the concept of parsimony, sometimes called “Ockham’s razor”
(as in Fig. 3.1), which is typical in econometrics and discussed by William of
Ockham (in the fourteenth century). As Milanković (1920) stated, “in order to
be able to translate the phenomena of nature into mathematical language, it is
always necessary to admit simplifications and to simplify certain influences and
irregularities.” The model is a simplification of the world, or, as Korzybski (1958)
said in a geography context, “a map is not the territory it represents, but, if correct,
it has a similar structure to the territory, which accounts for its usefulness.” The
map is not the territory: the map reflects our representation of the world, whereas the
territory is the world as it really is. We naturally think of Borges (1946) (or Umberto
Ecco’s pastiche of the impossibility of constructing the map 1: 1 of the Empire, in
Eco (1992)),2 “en aquel Imperio, el Arte de la Cartografía logró tal Perfección que
el mapa de una sola Provincia ocupaba toda una Ciudad, y el mapa del Imperio,

1 As discussed previously, notation .z is also used later on, and we distinguish admissible predictors

.x,
and sensitive ones .s. In this chapter, we mainly use .x, as in most textbooks.
2 “Inthat Empire, the Art of Cartography achieved such Perfection that the map of a single
Province occupied an entire City, and the map of the Empire, an entire Province. In time, these
64 3 Models: Overview on Predictive Models

toda una Provincia. Con el tiempo, estos Mapas Desmesurados no satisficieron y


los Colegios de Cartógrafos levantaron un Mapa del Imperio, que tenía el tamaño
del Imperio y coincidía puntualmente con él.”
The notion of model seems to have been replaced by the term “algorithm”, or
even “artificial intelligence”, or for short A.I. (notably in the press, see Milmo
(2021), Swauger (2021) or Smith (2021) among many others). For Zafar et al.
(2019), “algorithm” means predictive models (decision rules) calibrated from his-
torical data through data-mining techniques. To understand the difference, Cardon
(2019) gives an example to explain what machine learning is. It is quite simple to
write a program that converts a temperature in degrees Fahrenheit to a temperature
given in degrees Celsius. To do this, there is a simple rule: subtract 32 from the
temperature in degrees Fahrenheit x and multiply the result by .5/9 (or divide by
1.8), i.e.,

5
y←
. (x − 32).
9
A machine learning (or artificial intelligence) approach offers a very different
solution. Instead of coding the rule into the machine (what computer scientists might
call “Good Old Fashioned Artificial Intelligence,” as Haugeland (1989)), we simply
give to the machine several examples of matches between temperatures in Celsius
and Fahrenheit .(xi , yi ). We enter the data into a training dataset, and the algorithm
will learn a conversion rule by itself, looking for the closest candidate function to
the data. We can then find an example like the one in Fig. 3.1, with some data and
different models (one simple (linear) and one more complex).
It is worth noting that the “complexity” of certain algorithms, or their “opacity”
(which leads to the term “black box”), has nothing to do with the optimization
algorithm used (in deep learning, back-propagation is simply an iterative mechanism
for optimizing a clearly described objective). It is mainly that the model obtained
may seem complex, impenetrable, to take into account the possible interactions
between the predictor variables, for example. For the sake of completeness, a
distinction should be made between classical supervised machine-learning algo-
rithms and reinforcement learning techniques. The latter case describes sequential
(or iterative) learning methods, where the algorithm learns by experimenting, as
described in Charpentier et al. (2021). We find these algorithms in automatic driving,
for example, or if we wanted to correctly model the links between the data, the
constructed model, the new connected data, the update of the model, etc. But we
will not insist more on this class of models here.
To conclude this first section, let us stress that in insurance models, the goal
is not to predict “who” will die, and get involved in a car accident. Actuaries
create scores that are interpreted as the probability of dying, or the probability of

inordinate maps were not satisfactory and the Colleges of Cartographers drew up a map of the
Empire, which was the size of the Empire and coincided exactly with it” [personal translation].
3.2 From Categorical to Continuous Models 65

getting a bodily injury claim, in order to compute “fair” premiums. To use a typical
statistical example, let y denote the face of a die, potentially loaded. If p is the
(true) probability of falling on 6 (say .14.5752%), we say at first that we are able
to acquire information about the way the die was made, about its roughness, its
imperfections, that will allow us to refine our knowledge on this probability, but
also that we have a model able to link the information in an adequate way. Knowing
better the probability of falling on 6 does not guarantee that the die will fall on
6, the random component does not disappear, and will never disappear. Translated
into the insurance problem, p might denote the “true” probability that a specific
policyholder will get involved in a car accident. Based on external information
.x, some model will predict that the probability of being in an accident is .p x
(say .11.1245%). As mentioned by French statistician Alfred Sauvy, “dans toute
statistique, l’inexactitude du nombre est compensée par la précision des décimales”
(or “in all statistics, the inaccuracy of the number is compensated for by the
precision of the decimals,” infinite precision we might add). The goal is not to find a
model that returns either .0% or .100% (this happens with several machine-learning
algorithms), simply to assess with confidence a valid probability, used to compute
for a “fair” premium. And an easy way is perhaps to use simple categories: the
probability of getting a 6 less than 15% (less than a fair dice), between 15% and
18.5% (close to a fair dice), and more than .18.5% (more than a fair dice).

3.2 From Categorical to Continuous Models

“Risk classification” is an old and natural way to get insurance premiums, as


explained in Sect. 2.3.3. Not only higher-rated insured persons are less likely
to engage in the risky activity, risk classification provides incentives for risk
reduction (merit rating in auto insurance encourages adherence to traffic regulations;
experience-rating in workers’ compensation encourages employers to eliminate
workplace hazards, etc.). And as suggested by Feldblum (2006), risk classification
also promotes social welfare by making insurance coverage more widely available.

3.2.1 Historical Perspective, Insurers as Clubs

In ancient Rome, a collegium (plural collegia) was an association. They functioned


as social clubs, or religious collectives, whose members worked toward their shared
interests, as explained in Verboven (2011). During Republican Rome (which began
with the overthrow of the Roman Kingdom, 509 before the common era, and ended
with the establishment of the Roman Empire, 27 before our era), military collegia
were created. As explained in Ginsburg (1940), upon the completion of his service
a veteran had the right to join one of the many collegia veteranorum in each
legion. The Government established special savings banks. Half of the cash bonuses,
66 3 Models: Overview on Predictive Models

donativa, which the emperors attributed to the soldiers on various occasions, was
not handed over to the beneficiaries in cash but was deposited to the account of each
soldier in his legion’s savings bank. This could be seen as an insurance scheme, and
risks against which a member was insured were diverse. In the case of retirement,
upon the completion of his term of service, the soldier would receive a lump sum that
helped him to somewhat arrange the rest of his life. The membership in a collegium
gave him a mutual insurance against “unforeseen risks.” These collegia, besides
being cooperative insurance companies, had other functions. And because of the
structure of those collegia based on corporatism, members were quite homogeneous.
Sometime in the early 1660s, the Pirate’s Code was supposedly written by
the Portuguese buccaneer Bartolomeu Português. And interestingly, a section is
explicitly dedicated to insurance and benefits: “a standard compensation is provided
for maimed and mutilated buccaneers. Thus they order for the loss of a right arm six
hundred pieces of eight, or six slaves; for the loss of a left arm five hundred pieces
of eight, or five slaves; for a right leg five hundred pieces of eight, or five slaves;
for the left leg four hundred pieces of eight, or four slaves; for an eye one hundred
pieces of eight, or one slave; for a finger of the hand the same reward as for the
eye,” see Barbour (1911) (or more recently Leeson (2009) and Fox (2013) about
this piratical scheme).
In the nineteenth century, in Europe, mutual aid societies involved a group of
individuals who made regular payments into a common fund in order to provide for
themselves in later, unforeseeable moments of financial hardship or of old age. As
mentioned by Garrioch (2011), in 1848, there were 280 mutual aid societies in Paris
with well over 20,000 members. For example, the Société des Arts Graphiques, was
created in 1808. It admitted only men over 20 and under 50, and it charged much
higher admission and annual fees for those who joined at a more advanced age. In
return, they received benefits if they were unable to work, reducing over a period of
time, but in the case of serious illness the Society would pay the admission fee for a
hospice. In England, there were “friendly societies,” as described in Ismay (2018).
In France, after the 1848 revolution and Louis Napoléon Bonaparte’s coup d’état
in 1851, mutual funds were seen as a means of harmonizing social classes. The
money collected through contributions came to the rescue of unfortunate workers,
who would no longer have any reason to radicalize. It was proposed that insurance
should become compulsory (Bismarck proposed this in Germany in 1883), but the
idea was rejected in favor of giving workers the freedom to contribute, as the only
way to moralize the working classes, as Da Silva (2023) explains. In 1852, of the
236 mutual funds created, 21 were on a professional basis, whereas the other 215
were on a territorial basis. And from 1870 onward, mutual funds diversified the
professional profile of contributors beyond blue-collar workers, and expanded to
include employees, civil servants, the self-employed, and artists. But the amount of
the premium is not linked to the risk. As Da Silva (2023) puts it,“mutual insurers
see in the actuarial figure the programmed end of solidarity.” For mutual funds,
solidarity is essential, with everyone contributing according to their means and
receiving according to their needs. At around the same time, in France, the first
insurance companies appeared, based on risk selection, and the first mathematical
3.2 From Categorical to Continuous Models 67

approaches to calculating premiums. Hubbard (1852) advocates the introduction of


an “English-style scientific organization” in their management. For its members,
they had to be able to know “the probable average of the claims” that they should
cover, like insurance companies. The development of tables should lead insurers
to adopt the principle of contributions varying according to the age of entry and
the specialization of contributions and funds (health/retirement). It is with this in
mind that they drew up tables. For Stone (1993) and Gowri (2014) the defining
feature of “modern insurance” is its reliance on segmenting the risk pool into
distinct categories, each receiving a price corresponding to the particular risk that
the individuals assigned to that category are expected to represent (as accurately as
can be estimated by actuaries).

3.2.2 “Modern Insurance” and Categorization

Once heterogeneity with respect to the risk was observed in portfolios, insurers have
operated by categorizing individuals into risk classes and assigning corresponding
tariffs. This ongoing process of categorization ensures that the sums collected,
on average, are sufficient to address the realized risks within specific groups.
The aim of risk classification, as explained in Wortham (1986), is to identify the
specific characteristics that are supposed to determine an individual’s propensity to
suffer an adverse event, forming groups within which the risk is (approximately)
equally shared. The problem, of course, is that the characteristics associated with
various types of risk are almost infinite; as they cannot all be identified and priced
in every risk classification system, there will necessarily be unpriced sources of
heterogeneity between individuals in a given risk class.
In 1915, as mentioned in Rothstein (2003), the president of the Association of
Life Insurance Medical Directors of America noted that the question asked almost
universally of the Medical Examiner was “What is your opinion of the risk? Good,
bad, first-class, second-class, or not acceptable?” Historically, insurance prices
were a (finite) collection of prices (maybe more than the two classes mentioned,
“first-class” and “second-class”). In Box 3.1, in the early 1920s, Albert Henry
Mowbray, who worked for the New York Life Insurance Company and later Liberty
Mutual (and was also an actuary for state-level insurance commissions in New
Carolina and California, and the National Council on Workmen’s Insurance) gives
his perspective on insurance rate making.

Box 3.1 Historical perspective, Albert Henry Mowbray (1921)


“Classification of risks in some manner forms the basis of rate making
in practically all branches of insurance. It would appear therefore that
there should be some fundamental principle to which a correct system of

(continued)
68 3 Models: Overview on Predictive Models

Box 3.1 (continued)


classification in any branch of insurance should conform (.. . .) As long ago
as the days of ancient Greece and Rome the gradual transition of natural
phenomena was observed and set down in the Latin maxim, ‘natura non
agit per altum’. If each risk, therefore is to be precisely rated, it would
be necessary to recognize very minute differences and precisely measure
them. (.. . .) Since we are not capable of covering a large field fully and at
the same time recognizing small differences in all parts of the field, it is
natural that we resort to subdivision of the field by means of classification,
thereby concentrating our attention on a smaller interval which may again be
subdivided by further classification, and the system so carried on to the limit
to which we find it necessary or desirable to go. But however far we may go
in any system of classification, whether in the field of pure or applied science
including the business or insurance, we shall always find difficulties presented
by the borderline case, difficulties which arise from the continuous character
of natural phenomena which we are attempting to place in more or less
arbitrary divisions. While thus acknowledging that classification will never
completely solve the problem of recognizing differences between individuals,
nevertheless classification seems to be necessary at least as a preliminary step
toward such recognition in any field of study. The fact that a complete and
final solution cannot be made is, therefore, no justification for completely
discarding classification as a method of approach. Since it is insurance hazards
that we undertake to measure and classify, the preliminary step in studying
classification theory may well be to ask what is an insurance hazard and how
it may be determined. It must be evident to the members of this Society that
an insurance hazard is what is termed “a mathematical expectation,” that is
a product of a sum at risk and the probability of loss from the conditions
insured against, e.g., the destruction of a piece of property by fire, the death
of an individual, etc. If the net premiums collected are so determined on the
basis of the true natural probability and there is a sufficient spread then the
sums collected will just cover the losses and this is what should be.”
“1. The classification should bring together risks which have inherent in
their operation the same causes of loss. 2. The variation from risk to risk in the
strength of each cause or at least of the more important should not be greater
than can be handled by the formula by which the classification is subdivided,
i.e., the Schedule and / or Experience Rating Plan used. 3. The classification
should not cover risks which include, as important elements of their hazard,
causes which are not common to all. 4. The classification system and the
formula for its extension (Schedule and / or Experience Rating Plans) should
be harmonious. 5. The basis throughout should be the outward, recognizable
indicia of the presence and potency of the several inherent causes of loss
including extent as well as occurrence of loss.”
3.2 From Categorical to Continuous Models 69

Several articles and textbooks in sociology tried to understand how classification


mechanisms establish symbolic boundaries that reinforce group identities, such as
Bourdieu (2018), Lamont and Molnár (2002), Massey (2007), Ridgeway (2011),
Fourcade and Healy (2013), or Brubaker (2015). But here, those “groups” or
“classes” do not share any identity, and Simon (1988) or Harcourt (2015b) use the
term “actuarial classification” (where “actuarial” designates any decision-making
technique that relies on predictive statistical methods, replacing more holistic or
subjective forms of judgment). In those class-based systems, based on insurance
rating tables (or grids), results are determined by assigning individuals to a group
in which each person is positioned as “average” or “typical”. [Most] “actuaries
cannot think of individuals except as members of groups” claimed (Brilmayer
et al. 1979). Each individual is allocated the same value as all other members of
the group to which it is assigned (as opposed to models discussed in Sect. 3.3,
where a model gives to each individual its own unique value or score, as close as
possible, as explained in Fourcade (2016)). Simon (1987, 1988), and then Feeley
and Simon (1992), defined “actuarialism,” that designate the use of statistics to
guide “class-based decision-making,” used to price pensions and insurance. As
explained in Harcourt (2015b), this “actuarial classification” is the constitution
of groups with no experienced social significance for the participants. A person
classified as a particular risk by an insurance company shares nothing with the other
people thus classified, apart from a series of formal characteristics (e.g., age, sex,
marital status). As we see in Sect. 4.1 on interpretability, actuaries try ex-post to
give social representations to those groups. For Austin (1983) and Simon (1988),
categories used by the insurance company when grouping risks are “singularly
sterile,” resulting in inert, immobile, and deactivated communities, corresponding to
“artificial” groups. These are not groups organized around a shared history, common
experiences, or active commitment, forming some “aggregates”—living only in the
imagination of the actuary who calculates and tabulates, not in any lived form of
human association. If Hacking (1990) observed that standard classes create coherent
group identities (causing possible stereotypes and discrimination, as we discuss
in Part III), Simon (1988), provocatively suggests that actuarial classifications
can in turn “undo people’s identity.” As mentioned in Abraham (1986), the goal
for actuaries is to create groups, or “classes” made up of individuals who share
a series of common characteristics and are therefore presumed to represent the
same risk. Following François (2022), we could claim that actuarial techniques
reduce individuals to a series of formal roles that have no “moral density” and
therefore do not grant an “identity” that organizes a coherent sense of self. And
the inclusion of nominally “demoralized categories,” such as gender, in class-based
rating systems makes their total demoralization difficult to achieve—and is in itself
an issue of struggle. Heimer (1985) used the term “community of fate.” These
“communities” created artificially by statisticians are, in that sense, very different
from the communities of workers, neighbors, and co-religionists that characterized
the traditional mutual organizations displaced by modern forms of insurance, as
explained in Gosden (1961), Clark and Clark (1999), Levy (2012), or Zelizer
(2017). Furthermore, Rouvroy et al. (2013) and Cheney-Lippold (2017) point out
70 3 Models: Overview on Predictive Models

that scoring technologies are continually swapping predictors, “shuffling the cards,”
so that there is no stable basis for constructing group memberships, or a coherent
sense.
Harry S. Havens in the late 1970s gave the description mentioned in Box 3.2.

Box 3.2 Historical perspective, Harry S. Havens (1979)


“The price which a person pays for automobile insurance depends on age, sex,
marital status, place of residence and other factors. This risk classification
system produces widely differing prices for the same coverage for different
people. Questions have been raised about the fairness of this system, and
especially about its reliability as a predictor of risk for a particular individual.
While we have not tried to judge the propriety of these groupings, and the
resulting price differences, we believe that the questions about them warrant
careful consideration by the State insurance departments. In most States the
authority to examine classification plans is based on the requirement that
insurance rates are neither inadequate, excessive, nor unfairly discriminatory.
The only criterion for approving classifications in most States is that the
classifications be statistically justified—that is, that they reasonably reflect
loss experience. Relative rates with respect to age, sex, and marital status are
based on the analysis of national data. A youthful male driver, for example,
is charged twice as much as an older driver all over the country (.. . .) It
has also been claimed that insurance companies engage in redlining—the
arbitrary denial of insurance to everyone living in a particular neighborhood.
Community groups and others have complained that State regulators have
not been diligent in preventing redlining and other forms of improper
discrimination that make insurance unavailable in certain areas. In addition
to outright refusals to insure, geographic discrimination can include such
practices as: selective placement of agents to reduce business in some areas,
terminating agents and not renewing their book of business, pricing insurance
at unaffordable levels, and instructing agents to avoid certain areas. We
reviewed what the State insurance departments were doing in response to
these problems. To determine if redlining exists, it is necessary to collect
data on a geographic oasis. Such data should include current insurance
policies, new policies being written, cancellations, and non-renewals. It is
also important to examine data on losses by neighborhoods within existing
rating territories because marked discrepancies within territories would cast
doubt on the validity of territorial boundaries. Yet, not even a fifth of the States
collect anything other than loss data, and that data is gathered on a territory-
wide basis.”
3.2 From Categorical to Continuous Models 71

In Box 3.3, a paragraph from Casey et al. (1976) provides some historical
perspective, by Barbara Casey, Jacques Pezier and Carl Spetzler.

Box 3.3 Historical perspective, Casey et al. (1976)


“On the other hand, the opinion that distinctions based on sex, or any other
group variable, necessarily violate individual rights reflects ignorance of the
basic rules of logical inference in that it would arbitrarily forbid the use of
relevant information. It would be equally fallacious to reject a classification
system based on socially acceptable variables because the results appear
discriminatory. For example, a classification system may be built on the use
of a car, mileage, merit rating, and other variables, excluding sex. However,
when verifying the average rates according to sex one may discover significant
differences between males and females. Refusing to allow such differences
would be attempting to distort reality by choosing to be selectively blind. The
use of rating territories is a case in point. Geographical divisions, however
designed, often correlate with socio-demographic factors such as income level
and race because of natural aggregation or forced segregation according to
these factors. Again, we conclude that insurance companies should be free to
delineate territories and assess territorial differences as well as they can. At
the same time, insurance companies should recognize that it is in their best
interest to be objective and use clearly relevant factors to define territories lest
they be accused of invidious discrimination by the public. (.. . .) One possible
standard does exist for exception to the counsel that particular rating variables
should not be proscribed. What we have called ‘equal treatment’ standard of
fairness may precipitate a societal decision that the process of differentiating
among individuals on the basis of certain variables is discriminatory and
intolerable. This type of decision should be made on a specific, statutory basis.
Once taken, it must be adhered to in private and public transactions alike and
enforced by the insurance regulator. This is, in effect, a standard for conduct
that by design transcends and preempts economic considerations. Because it
is not applied without economic cost, however, insurance regulators and the
industry should participate in and inform legislative deliberations that would
ban the use of particular rating variables as discriminatory.”

3.2.3 Mathematics of Rating Classes

As mentioned in Sect. 2.5, an important theorem when modeling heterogeneity is


the variance decomposition property, or “law of total variance” (corresponding to
72 3 Models: Overview on Predictive Models

the Pythagorean Theorem, see Proposition 2.3),

Var[Y ] = E[Var[Y |]] + Var[E[Y |]] .


.
     
within variance between variance

Here, the variance of the outcome Y is decomposed into two parts, one representing
the variance due to the variability of the underlying risk factor ., and one reflecting
the inherent variability of Y if . did not vary (the homogeneous case). One can
recognize that a similar idea is the basis for analysis of variance (ANOVA) models
(as formalized in Fisher (1921) and Fisher and Mackenzie (1923)) where the total
variability is split into the “within groups” and the “between groups.” The “one-way
ANOVA” is a technique that can be used to compare whether or not the means of
two (or more) samples are significantly different. If the outcome y is continuous
(extensions can be obtained for binary variables, or counts), suppose that

yi,j = μj + εi,j ,
.

where i is the index over individuals, and j the index over groups (with .j =
1, 2, · · · , J ). .μj is the mean of the observations for group j , and errors .εi,j are
supposed to be zero-mean (normally distributed as a classical assumption). One
could also write

yi,j = μ + αj + εi,j , where α1 + α2 + · · · + αJ = 0,


.

where .μ is the overall mean, whereas .αj is the deviation from the overall mean,
for group j . Of course, one can generalize that model to multiple factors. In the
“two-way ANOVA,” two types of groups are considered

yi,j,k = μj,k + εi,j,k ,


.

where j is the index over groups according to the first factor, whereas k is the index
over groups according to the second factor. .μj,k is the mean of the observations for
groups j and k, and errors .εi,j,k are supposed to be zero-mean. We can write the
mean as a linear combination of factors, in the sense that

yi,j,k = μ + αj + βk + γj,k +εi,j,k ,


.
  
=μj,k

where .μ is still the overall mean, whereas .αj and .βj correspond to the deviation
from the overall mean, and .γi,k is the non-additive interaction effect. In order to
have identifiability of the model, some “sum-to-zero” constraints are added, as
previously,


J 
J 
K 
K
. αj = γj,k = 0 and βk = γj,k = 0.
j =1 j =1 k=1 k=1
3.2 From Categorical to Continuous Models 73

A more modern way to consider those models is to use linear models. For example,
for the “one-way ANOVA,” we can write .y = Xβ + ε, where

y = (y1,1 , · · · , yn1 ,1 , y1,2 , · · · , yn2 ,2 , · · · , y1,J , · · · , ynJ ,J )
.
ε = (ε1,1 , · · · , εn1 ,1 , ε1,2 , · · · , yn2 ,2 , · · · , ε1,J , · · · , εnJ ,J )

β = (β0 , β1 , · · · , βJ ) and
.

⎛ ⎞ ⎛ ⎞
1n1 0 ··· 0 1n1 1n1 0 ··· 0
⎜ 0 1n2 · · · 0 ⎟ ⎜ 1n 0 1n2 · · · 0 ⎟
⎜ ⎟ ⎜ 2 ⎟
.X = [1n , A] where A = ⎜ . .. .. ⎟ , and X = ⎜ . .. .. .. ⎟
⎝ .. . . ⎠ ⎝ .. . . . ⎠
0 0 · · · 1nJ 1nJ 0 0 · · · 1nJ

are respectively .n × J and .n × (J + 1) matrices. In the first approach, .y = Aβ + ε,


and the ordinary least squares estimate is
nj
 1 
β = (A A)−1 A y = (y ·1 , · · · , y ·J ) ∈ RJ , where y ·j =
. yi,j ,
nj
i=1

so that .
μj = y ·j is simply the average within group j . In the second case, if .yi,j =
μ + αj + εi,j , where .α1 + α2 + · · · + αJ = 0, we can prove that

1
J

μ=
. y= y ·j and 
αj = y ·j − 
y,
J
j =1

where the estimator of .μ is the average of the group averages. An alternative is to


change the constraint slightly, so that .n1 α1 + n2 α2 + · · · + nJ αJ = 0, and in that
case

1
J
.
μ=y= j = y ·j − y.
nj y ·j and β
n
j =1

Let .j ∈ {1, 2, · · · , J } and .k ∈ {1, 2, · · · , K}, and let .nj,k denote the number of
observations in group j for the first factor, and k for the second. Define averages
nj k K nj k J nj k
1  1  1 
y ·j k =
. yij k , y ·j · = yij k , and y ··k = yij k .
nj k nj · n·k
i=1 k=1 i=1 j =1 i=1

The model is here

yij k = μ + αj + βk + γj k + εij k ,
.
74 3 Models: Overview on Predictive Models

which we can write, using .(J + K + J K) indicators (vectors in dimension n, with


respectively, in each block, .nj · , .n·k and .nj k ones), as a classical regression problem.
As previously, under identifiability assumptions, it is possible to have interpretable
estimates to those quantities.

.
μ = y,  k = y ·,k − y,
αj = y j · − y, β

. γj k = y j k − y j · − y ·k + y.


Without the non-additive interaction effect, the model becomes

.yij k = μ + αj + βk + εij k .

Such models were used historically in claims reserving (see Kremer (1982) for a
formal connection), and, of course, in ratemaking. As explained in Bennett (1978),
“in a rating structure used in motor insurance there may typically be about eight
factors, each having a number of different levels into which risks may be classified
and then be charged different rates of premium,” with either an “additive model” or
a “multiplicative model” for the premium .μ (with notations of Bennett (1978)),

μj k··· = 
μ +αj + β k + · · · additive,
.
μj k··· = 
μ · k · · · multiplicative,
αj · β

where .αj is a parameter value for the i-th level of the first risk factor, etc., and .μ is
a constant corresponding to some “overall average level.”
Historically, classification relativities were determined one dimension at a time
(see Feldblum and Brosius (2003), and the appendices to McClenahan (2006) and
Finger (2006) for some illustration of the procedure). Then, Bailey and Simon
(1959, 1960) introduced the “minimum bias procedure.”
In Fig. 3.2, we can visualize a dozen classes associated with credit risk (on the
GermanCredit database), with on the x-axis, predictions given by two models,
and the empirical default probability on the y-axis (that will correspond to a discrete
version of the calibration plot described in Sect. 4.3.3).
As discussed in Agresti (2012, 2015), there are strong connections between
those approaches based on groups and linear models, and actuarial research started
to move toward “continuous” models. Nevertheless, the use of categories has
been popular in the industry for several decades. For example, Siddiqi (2012)
recommends cutting all continuous variables into bins, using a so-called “weight-
of-evidence binning” technique, usually seen as an “optimal binning” for numerical
and categorical variables using methods including tree-like segmentation, or Chi-
squared merge. In R, it can be performed using the woebin function of the
scorecard package. For example, on the GermanCredit dataset, three con-
tinuous variables are divided into bins, as in Fig. 3.3. For the duration (in months),
bins are A = .[0, 8), B = .[8, 16), C = .[16, 34), D = .[34, 44), and E = .[44, 80); for
3.2 From Categorical to Continuous Models 75

Fig. 3.2 Scatterplot with predictions .


y on various groups, and average outcomes .y, on the database
GermanCredit, with the logistic regression (GLM) on the left and the random forest (RF) on
the right. Here, y corresponds to the indicator of having a bad risk. Size of circles are proportional
to size of groups

Fig. 3.3 From continuous variables to categories (five categories .{A, B, C, D, E}), for three
continuous variables of the GermanCredit dataset: duration of the credit, amount of credit,
and age of the applicant. Bars in the background are the number of applicants in each bin (y-axis
on the left), and the line is the probability of having a default (y-axis on the right)

the credit amount, bins are A = .[0, 1400), B = .[1400, 1800), C = .[1800, 4000), D
= .[4000, 9200), B = .[9200, 20000); and for the age of the applicant, A = .[19, 26),
B = .[26, 28), C = .[28, 35), D = .[35, 37), and E = .[37, 80). The use of categorical
features, to create ratemaking classes is now obsolete, as more and more actuaries
consider “individual” pricing models.
76 3 Models: Overview on Predictive Models

3.2.4 From Classes to Score

Instead of considering risk classes, the measurement of risk can take a very different
form, which we could call “individualization”, or “personalization”, as in Barry
and Charpentier (2020). In many respects, the latter is a kind of asymptotic limit
of the asymptotic limit of the first one, when the number of classes increases.
By significantly reducing the population through the assignment of individuals to
exclusive categories, and ensuring that each category consists of a single individ-
ual, the processes of “categorization” and “individualization” begin to converge.
Besides computational aspects (discussed in the next section), this approach is
fundamentally altering the logical properties of reasoning, as discussed in François
(2022) and Krippner (2023). Individuals are given a very precise “score” (which
of course can be shared with others). These scores are not discrete, discontinuous,
qualitative categories, but numbers that we can consequently engage in calculations
(as explained in the previous chapter). When individualized measures are employed,
they are situated on a continuous scale: individuals are assigned highly precise
scores, which, of course, occasionally and at intervals, may be shared with others
but generally enable the ranking of individuals in relation to one another. These
scores are no longer discrete, discontinuous, qualitative categories, but rather
numerical values that can, therefore, be subjected to calculations. Furthermore,
these numbers possess cardinal value in the sense that they not only facilitate the
ranking of risks in comparison with one another but also denote a quantity (of
risk) amenable to reasoning and notably computation. Last, probabilities can be
associated with these numbers, which are not the property of a group but that of
an individual: risk measurement is no longer intended to designate the probability
of an event occurring within a group once in a thousand trials; it is aimed at
providing a quantified assessment of the default risk associated with a specific
individual, in their idiosyncrasy and irreducible singularity. Risk measurement has
now evolved into an individualized measure, François (2022) claim. Thanks to those
scores, individual policyholders are directly comparable. As Friedler et al. (2016)
explained, “the notion of the group ceases to be a stable analytical category and
becomes a speculative ensemble assembled to inform a decision and enable a course
of action (.. . .) Ordered for a different purpose, the groups scatter and reassemble
differently.” In the next section, we present techniques used by actuaries to model
risks, and compute premiums.

3.3 Supervised Models and “Individual” Pricing

If we are going to talk here mainly about insurance pricing models, i.e., supervised
models where the variable of interest y is the occurrence of a claim in the coming
year, the number of claims, or the total charge, it is worth keeping in mind that
the input data (.x) can be the predictions of a model. For example .x1 could be
3.3 Supervised Models and “Individual” Pricing 77

an observed acceleration score from the previous year (computed by an external


provider who had access to the raw telematics data), .x2 could be the distance to
the nearest fire station (extrapolated from distance-to-address software), .x3 can be
the age of the policyholder, .x4 could be a prediction of the number of kilometers
driven, etc. (in Chap. 5, we discuss more predictive variables used by actuaries). In
the training dataset, the “past observations” .yi can also be predictions, especially if
we want to use recent claims, still pending, but where the claims manager can give
an estimate, based on human experts but also possibly on opaque algorithms. We
can think of those applications that give a cost estimate of a car damage claim based
on a photo of the vehicle sent by the policyholder, or the use of compensation scales
for claims not yet settled.
As we see in this section, a natural “model” or “predictor” for a variable y is
related to the conditional expected value. If y corresponds to the total annual loss
associated with a given policy, we have seen in the previous chapter that .E[Y ] was
the “homogeneous pure premium” (see Definition 2.4) and .E[Y |X] corresponds to
the “heterogeneous pure premium” (see Definition 2.7). In the classical collective
risk model, .Y = Z1 + · · · + ZN is a compound sum, a random sum of random
individual losses, and under standard assumptions (see Denuit and Charpentier
(2004) or Denuit et al. (2007)), .E[Y |X] = E[N|X] · E[Z|X], where the first
term .E[N|X] is the expected annual claim frequency for a policyholder with
characteristics .X, whereas .E[Z|X] is the average cost of a single claim. In this
chapter, quite generally, y is one of those variables of interest used to calculate a
premium.
If x and y are discrete variables, recall that

E[Y |X = x] =
. y P[Y = y|X = x].
y

Quite naturally, in the absolutely continuous case, we would write


 
f (x, y)
E[Y |X = x] =
. y f (y|x)dy = y dy,
f (x)

with standard notation. Those functions are interesting as we have the following
decomposition
 
.Y = E[Y |X = x] + Y − E[Y |X = x] ,
  

where .E[ε|X = x] = 0. It should be stressed that the extension to the case where X
is absolutely continuous is formally slightly complicated since .{X = x} is an event
with probability 0, and then .P[Y ∈ A|X = x] is not properly defined (in Bayes
78 3 Models: Overview on Predictive Models

formula, in Proposition 3.2). As stated in Kolmogorov (1933),3 “der Begriff der


bedingten Wahrscheinlichkeit in Bezug auf eine isoliert gegebene Hypothese, deren
Wahrscheinlichkeit gleich Null ist, unzulässig.” Billingsley (2008), Rosenthal (2006)
or Resnick (2019) provide theoretical functions for the notation “.E[Y |X = x],”
Definition 3.1 (Regression Function .μ) Let Y be the non-negative random vari-
able of interest, observed with covariates .X, the regression function is .μ(x) =
E[Y |X = x].
Without going into too much detail (based on measure theory), we will invoke
here the “law of the unconscious statistician” (as coined in Ross (1972) and Casella
and Berger (1990)), and write
 
 
E [ϕ(Y )] =
. ϕ Y (ω) P(dω) = ϕ(y)PY (dy),
 R

for some random variable .Y : (, F, P) → R, with law .PY . And we will take even
more freedom when conditioning. As discussed in Proschan and Presnell (1998),
“statisticians make liberal use of conditioning arguments to shorten what would
otherwise be long proofs,” and we do the same here. Heuristically, (the proof can
be found in Pfanzagl (1979) and Proschan and Presnell (1998)), a version of .P(Y ∈
A|X = x) can be obtained as a limit of conditional probabilities given that X lies
in small neighborhoods of x, the limit being taken as the size of the neighborhood
tends to 0,

   P({Y ∈ A} ∩ {|X − x| ≤ })
P Y ∈ AX = x = lim
.
→0 P({|X − x| ≤ })
  
= lim P Y ∈ A|X − x| ≤ ,
→0

that can be extended into a higher dimension, using some distance between .X and
x, and then use that approach to define4 “.E[Y |X = x].” In Sect. 4.1, we have a brief
.

discussion about a related problem, which is the distinction between .E[ϕ(x1 , X2 )]


and .E[ϕ(X1 , X2 )|X1 = x1 ].

3.3.1 Machine-Learning Terminology

Suppose that random variables .(X, Y ) are defined on a probabilistic space


(, F, P), and we observe a finite sample .(x 1 , y1 ), · · · , (x n , yn ). Based on that
.

3 “the notion of conditional probability is inadmissible in relation to a hypothesis given in isolation

whose probability is zero” [personal translation].


4 Which corresponds to a very standard idea in non-parametric statistics, see Tukey (1961),

Nadaraya (1964) or Watson (1964).


3.3 Supervised Models and “Individual” Pricing 79

sample, we want to estimate, or learn, a model m that is a good approximation of


the unobservable regression function .μ, where .μ(x) = E[Y |X = x].
In the specific case where y is a categorical variable, for example, a binary
variable (taking here values in .{0, 1}), there is strong interest in the machine-learning
literature not to estimate the regression function .μ, but to construct a “classifier”
that predicts the class. For example, in the logistic regression (see Sect. 3.3.2), we
suppose that .(Y |X = x) ∼ B(μ(x)), where .logit[μ(x)] = x β, and .μ(x) has
two interpretations, as .μ(x) = E[Y |X = x] and .μ(x) = P[Y = 1|X = x].
From this regression function, one can easily construct a “classifier” by considering
.mt (x) = 1(m(x) > t), taking values in .{0, 1} (like y), for some appropriate cutoff

threshold .t ∈ [0, 1].


Definition 3.2 (Loss . ) A loss function . is a function defined on .Y × Y such that
.(y, y ) ≥ 0 et . (y, y) = 0.
A loss is not necessarily a distance (between y and .y ) as symmetry is not
required, and nor is the triangle inequality. Some losses are simply a function (called
“cost”) of some distance between y and .y .
Definition 3.3 (Risk .R) For a fitted model .m
, its risk is
 
R(
. m) = E (X)) .
(Y, m

For instance, in a regression problem, a quadratic loss function . 2 is used

. 2 (y, 
y) = (y − 
y )2 ,

and the risk (named “quadratic risk”) is then


 
R2 (
. m) = E (Y − m
(X))2 ,

(x) is some prediction. Observe that


where .m
    
E[Y ] = argmin R2 (m) = argmin E
. 2 (Y, m) .
m∈R m∈R

The fact that the expected value minimizes the expected loss for some loss function
(here . 2 ) is named “elicitable” in Gneiting et al. (2007). From this property, we
can understand why the expected value is also called “best estimate” (see also the
connection to Bregman distance, in Definition 3.12). As discussed in Huttegger
(2013), the use of a quadratic loss function gives rise to a rich geometric structure,
for variables that are squared integrable, which is essentially very close to the
geometry of Euclidean spaces (.L2 being a Hilbert space, with an inner product, and
a projection operator; we come back to this point in Chap. 10, in “pre-processing”
approaches). Up to a monotonic transformation (the square root function), the
distance here is the expectation of the quadratic loss function. With the terminology
80 3 Models: Overview on Predictive Models

of Angrist and Pischke (2009), the regression function .μ is the function of .x that
serves as “the best predictor of y, in the mean-squared error sense.”
The quantile loss . q,α , for some .α ∈ (0, 1) is
   
. q,α (y, 
y) = max α(y − 
y ), (1 − α)(
y − y) = (y − 
y ) α − 1(y<
y) .

For example, Kudryavtsev (2009) used a quantile loss function in the context of
ratemaking. It is called “quantile” loss as
    
Q(α) = F −1 (α) ∈ argmin Rq,α (q) = argmin E
. q,α (Y, q) .
q∈R q∈R

Indeed,
   ∞ 
  q
.argmin Rq,α (q) = argmin (α − 1) (y − q)dFY (y) + α (y − q)dFY (y) ,
m m −∞ q

and by computing the derivative of the expected loss via an application of the
Leibniz integral rule,
 q  ∞
0 = (1 − α)
. dFY (y) − α dFY (y),
−∞ q

so that .0 = FY (q  ) − α. Thus, quantiles are also “elicitably” functional. When


.α = 1/2 (the median), we recognize the least absolute deviation loss . 1 , . 1 (y, 
y) =
|y − 
y |.
Definition 3.4 (Empirical Risk .
Rn ) Given a sample .{(yi , x i ), i = 1, · · · , n},
define the empirical risk

1
n

Rn (
. m) = m(x i ), yi ) .
(
n
i=1

Again, in the regression context, with a quadratic loss function, the empirical risk
is the mean squared error (MSE), defined as

1
n

Rn (
. m) = MSEn = (yi − m
(x i ))2 .
n
i=1

, defined as the empirical risk minimizer, over a training sample and a


Note that .m
collection of models, is also called M-estimator in Huber (1964).
3.3 Supervised Models and “Individual” Pricing 81

In the context of a classifier, where .y ∈ {0, 1} as well as .


y , a natural loss is the
so-called “.0/1 loss,”

1 if y = 
y
. 0/1 (y, 
y) = 1(y = 
y) =
0 if y = 
y.

In the context of a classifier, the loss is a function on .Y × Y, i.e., .{0, 1} × {0, 1},
taking values in .R+ . But in many cases, we want to compute a “loss” between y and
an estimation of .P[Y = 1], instead of a predicted class . y ∈ {0, 1}, therefore, it will
be a function defined on .{0, 1} × [0, 1]. That will correspond to a “scoring rule” (see
Definition 4.16). The empirical risk associated with the . 0/1 loss is the proportion of
misclassified individuals, also named “classifier error rate.” But it is possible to get
more information: given a sample of size n, it is possible to compute the “confusion
matrix,” which is simply the contingency table of the pairs .(yi ,  yi ), as in Figs. 3.4
and 3.5 .
Given a threshold t, one will get the confusion matrix, and various quantities can
be computed. To illustrate, consider a simple logistic regression model, on .x (and
not s), and get predictions on .n = 40 observations from toydata2 (as in Table
8.1). To illustrate, two values are considered for t, .30% and .50%.

Fig. 3.4 General actual value


representation of the
“confusion matrix,” with 0 1 total
counts of .y = 0 (column on 0 0•
true negative false negative
the left, .n•0 ) and .y = 1
TN = 00 FN = 01
(column on the right, .n•1 ),
prediction

counts of .y = 0, “negative”


1 1•
outcomes (row on top, .n0• ) false positive true positive
and .
y = 1, “positive” FP = 10 TP = 11
outcomes (row below, .n1• )
total •0 •1

Fig. 3.5 Expressions of the actual value


standard metrics associated
with the “confusion matrix” 0 1 total
(false positive rate, true TN
0 NPV =
positive rate, false negative true negative false negative FN + TN
rate, true negative rate, TN = 00 FN = 01
prediction

positive predictive value and TP


negative predictive value) 1 PPV =
false positive true positive FP + TP
FP = 10 TP = 11

FP TP
FPR = TPR =
FP + TN TP + FN

TN FN
TNR = FNR =
FP + TN TP + FN
82 3 Models: Overview on Predictive Models

actual value actual value


0 1 total 0 1 total
0 15 0 23
true negative false negative true negative false negative
12 3 18 5
prediction

prediction
1 25 1 17
false positive true positive false positive true positive
8 17 2 15

total 20 20 total 20 20

Fig. 3.6 Confusion matrices with threshold .30% and .50% for .n = 40 observations from the
toydata2 dataset, and a logistic regression for m

From Fig. 3.6, we can compute various quantities (as explained in Figs. 3.4
and 3.5). Sensitivity (true positive rate) is the probability of a positive test result,
conditioned on the individual truly being positive. Thus, here we have

17 15
TPR(30%) =
. = 0.85 and TPR(50%) = = 0.75,
3 + 17 5 + 15

whereas the miss rate (false negative rate)

3 5
.FNR(30%) = = 0.15 and FNR(50%) = = 0.25.
3 + 17 5 + 15

Specificity (true negative rate) is the probability of a negative test result, conditioned
on the individual truly being negative,

12 18
TNR(30%) =
. = 0.6 and TNR(50%) = = 0.9,
8 + 12 2 + 18

whereas the fall-out (false positive rate) is

8 2
FPR(30%) =
. = 0.4 and FPR(50%) = = 0.1
8 + 12 2 + 18

The negative predictive value (NPV)

12 18
. NPV(30%) = = 0.8 and NPV(50%) = = 0.7826,
12 + 3 18 + 5

whereas the precision (positive predictive value) is

17 15
PPV(30%) =
. = 0.68 and PPV(50%) = = 0.8824.
17 + 8 15 + 2
3.3 Supervised Models and “Individual” Pricing 83

Accuracy is the proportion of good predictions

12 + 17
.ACC(30%) = = 0.725 and
12 + 8 + 3 + 17
18 + 15
ACC(50%) = = 0.825,
18 + 2 + 5 + 15

whereas “balanced accuracy” (see Langford and Schapire 2005) is the average of
the true positive rate (TPR) and the true negative rate (TNR),

0.85 + 0.6 0.75 + 0.9


BACC(30%) =
. = 0.725 and BACC(50%) = = 0.8250.
2 2
Finally, Cohen’s kappa (from Cohen (1960), which is based on the accuracy
assuming that y and .
y are independent—as in the Chi-squared test),

29
− 20 33
− 20
.κ(30%) = 40 40
= 0.45 and κ(50%) = 40 40
= 0.65,
1− 20
40 1− 20
40

whereas Matthews correlation coefficient (see Definition 8.15) is

MCC(30%) = 0.464758 and MCC(30%) = 0.6574383.


.

One issue here is that the sample used to compute the empirical risk is the same
as the one used to fit the model, also-called “in-sample risk”

1
n
is
Rn (m) =
. (m(x i ), yi ) .
n
i=1

Thus, if we consider
 
n = argmin 
is
.m Rn (m) ,
m∈M

on a set .M of admissible models, we will have a tendency to capture a lot of noise


and to over-adjust the data: this is called “over-fitting.” For example, in Fig. 3.7, we
 such that the in-sample risk is null
have two fitted models .m

1
n

.
is
Rn (
m) = m(x i ), yi ) = 0.
(
n
i=1
84 3 Models: Overview on Predictive Models

Fig. 3.7 Two fitted models from a (fake) dataset .(x1 , y1 ), · · · , (x10 , y10 ), with a linear model on
the left, and a polynomial model on the right, such that for both in-sample risk is null, .
is
Rn (m) = 0

To avoid this problem, randomly divide the initial database into a training dataset
and a validation dataset. The training database, with .nT < n observations, will be
used to estimate the parameters of the model
 
nT = argmin 
is
m
. RnT (m) .
m∈M

Then, the validation dataset, with .nV = n − nT observations, will be used to select
the model, using the “out-of-sample risk”

1 
nV
 
os
.Rn (
mnT ) = nT (x i ), yi .
m
V
nV
i=1

Quite generally, given a loss function . : Y × Y → R+ , and a collection .Dn of n


independent observations drawn from .(X, Y ) (corresponding to the dataset) the risk
is
  
.EX ED E Y |X [ (Y, 
m (X)|Dn )] ,
n

that cannot be calculated without knowing the true distribution of .(Y, X).
If . is the quadratic loss, . 2 (y, 
y ) = (y − 
y )2 ,
 
. m) = EDn EY |X [ (Y, m
R2 ( (X)|Dn )]
  2
= EY |X (Y ) − EDn m(X)|Dn )
 2 
+EY |X Y − EY |X (Y )
  2 
+EDn m (X)|Dn ) − EDn m (X)|Dn ) .
3.3 Supervised Models and “Individual” Pricing 85

We recognize the square of the bias (.bias2 ), the stochastic error, and the variance of
the estimator.
Here, so far, all observations in the training dataset have the same importance.
But it is possible to include weights, for each observation, in the optimization
procedure. A classic example could be the weighted least squares,


n
 2
. ωi yi − x i β ,
i=1

for some positive (or null) weights .(ω1 , · · · , ωn ) ∈ Rn+ . The weighted least squares
estimator is


β = (X ΩX)−1 X y, where  = diag(ω).
.

More generally, it is possible to consider a weighted version of the empirical risk


n
. 
Rω (
m) = ωi (
m(x i ), yi ) .
i=1

We have considered here losses, which could be seen as a distance between


a prediction .y and an actual observation y. But Huttegger (2017) claims that
those losses—which were denoted . (y,  y )—are maybe not a correct measure of
“epistemic accuracy.” The first extension is based on the fact that, instead of a
point estimate .
y , we could have a confidence interval, or a predictive distribution.
Observing .y = 100 when we predict . y = 50 with a standard deviation on the
prediction of 50 is not the same as observing .y = 100 when we predict . y = 120
with a standard deviation on the prediction of 10. Formally, it means that we want to
quantify a distance between a single point and a distribution. That will be discussed
in Sect. 4.2.1, when we introduce “scoring rules.” Another important tool will be a
“distance” between two distributions.
Definition 3.5 (Hellinger Distance (Hellinger 1909)) For two discrete distribu-
tions p and q, the Hellinger distance is

1    
dH (p, q)2 = p(i) − q(i) = 1 − p(i)q(i) ∈ [0, 1],
2
i i

and for absolutely continuous distributions, if p and q are densities,


 
p(x) p(x)
dH (p, q)2 =
. p(x) log dx or p(x) log dx.
R q(x) Rd q(x)
86 3 Models: Overview on Predictive Models

For example, for two Gaussian distributions with means .μ and variances .σ 2 ,
! " #
2σ1 σ2 1 (μ1 − μ2 )2
2
.dH (p1 , p2 ) =1− exp − ,
σ12 + σ22 4 σ12 + σ22

(that can be extended into a higher dimension, as in Pardo (2018)), whereas for two
exponential distributions with means .μ1 and .μ2 ,

2 μ1 μ2
dH2 (p1 , p2 ) = 1 −
. .
μ1 + μ2

A few years after, Saks (1937) introduced the concept of “total variation”
(between measures) in the context of signed measures on a measurable space, and
it can be used to define a total variation distance between probability measures (see
Dudley 2010). Quite generally, given two discrete distributions p and q, the total
variation is the largest possible difference between the probabilities that the two
probability distributions can assign to the same event:.
Definition 3.6 (Total Variation (Jordan 1881; Rudin 1966)) For two univariate
distributions p and q, the total variation distance between p and q is
 
.dTV (p, q) = sup |p(A) − q(A)| .
A⊂R

It should be stressed here that in the context of discrimination, Zafar et al. (2019)
or Zhang and Bareinboim (2018) suggest removing the symmetry property, to take
into account that there is a favored and a disfavored group, and therefore to consider
 
DTV (pq) = sup p(A) − q(A) .
.
A⊂R

Removing the standard property of symmetry (that we have on distances) yields the
concept of “divergence,” which is still a non-negative function, positive (in the sense
that it is null if and only if “.p = q”), and the triangle inequality is not satisfied
(even if it could satisfy some sort of Pythagorean theorem, if an “inner product”
can be derived). As Amari (1982) explains, it is mainly because divergences are
generalizations of “squared distances,” not “linear distances.”
Definition 3.7 (Kullback–Leibler (Kullback and Leibler 1951)) For two discrete
distributions p and q, Kullback–Leibler divergence of p, with respect to q is
 p(i)
DKL (pq) =
. p(i) log ,
q(i)
i
3.3 Supervised Models and “Individual” Pricing 87

and for absolutely continuous distributions,


 
p(x) p(x)
DKL (pq) =
. fpx) log dx or p(x) log dx,
R q(x) Rd q(x)

in higher dimension.
This corresponds to the relative entropy from q to p. Interestingly, (Kullback
2004) mentioned that he preferred the term “discrimination information.”
Notice that the ratio .log(p/q) is sometimes called “weight-of-evidence,” follow-
ing Good (1950) and Ayer (1972), see also Wod (1985) or Weed (2005) for some
surveys. Again, this is not a distance (even if it satisfies the nice property .p = q
if and only if .DKL (pq) = 0), so we use the term “divergence” (and notation D
instead of d).
Definition 3.8 (Mutual Information (Shannon and Weaver 1949)) For a pair of
two discrete variables x and y with joint distributions .pxy (and marginal ones .px
and .py ), the mutual information is

 pxy (i, j )

IM(x, y) = DKL (pxy pxy
. )= pxy (i, j ) log ⊥ (i, j )
pxy
i,j
 pxy (i, j )
= pxy (i, j ) log ,
px (i)py (j )
i,j

⊥ is the independent version of .p , i.e., .p ⊥ (i, j ) = p (i)p (j ).


where .pxy xy xy x y

Observe that, for two Gaussian distributions,


$ %
1 σ22 σ12 (μ1 − μ2 )2
.DKL (p1 p2 ) = log 2 + 2 + −1 ,
2 σ1 σ2 σ22

and in higher dimension (say k),


& '
1 | 2 | −1 −1
.DKL (p1 p2 ) = log + tr{ 2  1 } + (μ2 − μ1 )  2 (μ2 − μ1 ) − k .
2 | 1 |

As proved in Tsybakov (2009), it is possible to find relationships between those


measures, such as
 √
dTV (p, q) ≤
. 1 − exp[DKL (pq)] or dH (p, q)2 ≤ dTV (p, q) ≤ 2dH (p, q).

It is possible to derive a symmetric divergence measure by averaging with the so-


called “dual divergence,” also named “PSI index” (defined, as well as most functions
in Siddiqi (2012), in the scorecard R package).
88 3 Models: Overview on Predictive Models

Definition 3.9 (Population Stability Index (PSI) (Siddiqi 2012)) PSI is a mea-
sure of population stability between two population samples,

PSI = DKL (p1  p2 ) + DKL (p2  p1 ).


.

An alternative is to consider the following approach, with “Jensen–Shannon


divergence,”
Definition 3.10 (Jensen–Shannon (Lin 1991)) The Jensen–Shannon distance is a
symmetric distance induced by Kullback–Liebler divergence.

1 1
DJS (p1 , p2 ) =
. DKL (p1 q) + DKL (p2 q),
2 2

where .q = 12 (p1 + p2 ).
Another popular distance is the Wasserstein distance,5 also called Mallows’
distance, from Mallows (1972).
Definition 3.11 (Wasserstein (Wasserstein 1969)) Consider two measures on p
and q on .Rd , with a norm . ·  (on .Rd ). Then define
(  )1/k
Wk (p, q) =
. inf x − y dπ(x, y)
k
,
π ∈(p,q) Rd ×Rd

where .(p, q) is the set of all couplings of p and q.


Without specifying, the Wasserstein distance will be .W2 , and . ·  is the
Euclidean norm. .W1 is also called “earth mover’s distance,” see for example Levina
and Bickel (2001) or Gottschlich and Schuhmacher (2014), for a fast numerical
implementation. As mentioned in Villani (2009), the total variation distance arises
quite naturally as the optimal transportation cost, when the cost function is, . 0/1 , or
.1(x = y), since

 
dTV (p, q) =
. inf P[X = Y ], (X, Y ) ∼ π
π ∈(p,q)
 
= inf E[ 0/1 (X, Y )], (X, Y ) ∼ π .
π ∈(p,q)

With the Wasserstein-distance, we consider


 
 
. inf E[ (X, Y )], (X, Y ) ∼ π or inf (x, y)π(dx, dy) .
π ∈(p,q) π ∈(p,q)

5 The original name, VaserxteOT1w ncyr.fd ĭn , is also written “Vaserstein” but as the
distance is usually denoted “W ,” we write “Wasserstein.”
3.3 Supervised Models and “Individual” Pricing 89

The connection with “transport” is obtained as follows: given .T : Rk → Rk , define


the “push-forward” measure,
 
P1 (A) = T# P0 (A) = P0 T−1 (A) , ∀A ⊂ Rk .
.

An optimal transport .T (in Brenier’s sense, from Brenier (1991), see Villani (2009)
or Galichon (2016)) from .P0 toward .P1 will be the solution of
 
T ∈
.

arginf (x, T(x))dP0 (x) .
T:T# P0 =P1 Rk

In dimension 1 (distributions on .R), let .F0 and .F1 denote the cumulative distribution
function, and .F0−1 and .F1−1 denote quantiles. Then

( 1 k )1/k
 −1 
Wk (p0 , p1 ) =
. F0 (u) − F1−1 (u) du ,
0

and one can prove that the optimal transport .T is a monotone transformation. More
precisely,

T : x0 → x1 = F1−1 ◦ F0 (x0 ).
.

For empirical measures, in dimension 1, the distance is a simple function of the


order statistics:
" #1/k
1
n
Wk (x, y) =
. |x(i) − y(i) |k .
n
i=1

Observe that, for two Gaussian distributions, and the Euclidean distance,
 2
W2 (p0 , p1 )2 = (μ1 − μ0 )2 + σ1 − σ0 ,
.

and in the higher dimension,


(  )
1/2 1/2 1/2
.W2 (p0 , p1 ) = 2
μ1 − μ0 22 + tr  0 +  1 − 2  1  0  1 .

If variances are equal, we can write simply



W2 (p0 , p1 )2 = μ1 − μ0 22 = (μ1 − μ0 ) (μ1 − μ0 )
.
DKL (p0 p1 ) = (μ1 − μ0 )  −1 (μ1 − μ0 ).
90 3 Models: Overview on Predictive Models

And in that Gaussian case, there is an explicit expression for the optimal transport,
which is simply an affine map (see Villani (2003) for more details). In the univariate
case, .x1 = TN (x0 ) = μ1 + σσ10 (x0 − μ0 ), whereas in the multivariate case, an
analogous expression can be derived:

x 1 = TN (x 0 ) = μ1 + A(x 0 − μ0 ),
.

where .A is a symmetric positive matrix that satisfies .A 0 A =  1 , which has


−1/2  1/2 1/2 1/2 −1/2
a unique solution given by .A =  0 0 10  0 , where .M 1/2 is
the square root of the square (symmetric) positive matrix .M based on the Schur
decomposition (.M 1/2 is a positive symmetric matrix), as described in Higham
(2008).
In the non-Gaussian case, one can prove (see Alvarez-Esteban et al. 2018) that

W2 (p0 , p1 )2 = μ1 − μ0 22 + W2 (p̄0 , p̄1 )2 ,


.

where .μ0 and .μ1 are the means of .p0 and .p1 , and .p̄0 and .p̄1 are the corresponding
centered probabilities. And if the measures are not Gaussian, but have variances . 0
and . 1 , Gelbrich (1990) proved that
(  )
1/2 1/2 1/2
W2 (p0 , p1 ) ≥
.
2
μ1 − μ0 22 + tr  0 +  1 − 2  1  0  1 .

If variances are equal, we can write simply



W2 (p1 , p2 )2 = μ2 − μ1 22 = (μ2 − μ1 ) (μ2 − μ1 )
.
DKL (p1 p2 ) = (μ2 − μ1 )  −1
2 (μ2 − μ1 )

To conclude this part, Banerjee et al. (2005) suggested loss functions named
“Bregman distance functions.”
Definition 3.12 (Bregman Distance Functions (Banerjee et al. 2005)) Given a
strictly convex differentiable function .φ : R → R,

Bφ (y1 , y2 ) = φ(y1 ) − φ(y2 ) − (y1 − y2 )φ (y2 ),


.

or if .φ : Rd → R,

Bφ (y 1 , y 2 ) = φ(y 1 ) − φ(y 2 ) − (y 1 − y 2 ) ∇φ(y 2 ).


.

Note that .Bφ is symmetric, positive, and .Bφ (y1 , y2 ) = 0 if and only if .y1 = y2 .
For example, if .φ(t) = t 2 , .Bφ (y1 , y2 ) = 2 (y1 , y2 ) = (y1 − y2 )2 . Huttegger (2017)
pointed out that those functions have a “nice epistemological motivation.” Consider
3.3 Supervised Models and “Individual” Pricing 91

some very general distance function .ψ, such that, for any random variable Z,
   
E ψ(Y, E[Y ]) ≤ E ψ(Y, Z) ,
.

meaning that .E[Y ] is the “best estimate” of Y (according to this distance .ψ). If
Y = 1A , it means that
.

   
E ψ(1A , P[A]) ≤ E ψ(1A , Z) ,
.

meaning that .P[A] is the “best” degree of belief of .1A . If we suppose that .ψ is
continuously differentiable in its first argument, and .ψ(0, 0) = 0, then .ψ is a
Bregman distance function. And one can write, if .φ : Rd → R,

Bφ (y 1 , y 2 ) = (y 1 − y 2 ) ∇ 2 φ(y t )(y 1 − y 2 ),
.

where .∇ 2 denote the Hessian matrix, and where .y t = ty 1 + (1 − t)y 2 , for some .t ∈
[0, 1]. We recognize some sort of local Mahalanobis distance, induced by .∇ 2 φ(y t ).

3.3.2 Generalized Linear Models

Generalized linear models (GLMs) covers a vast class of probabilistic models


that contains the logistic and the probit regression (for binary y) and the Poisson
regression (when y corresponds to counts). For more than 30 years, GLMs have
been the most popular predictive technique for actuaries (see Haberman and
Renshaw (1996), Denuit and Charpentier (2005), Denuit et al. (2007), Ohlsson and
Johansson (2010) or Frees (2006); Frees et al. (2014a), among many others). The
starting point is that the density of y (or the probability function if y is a discrete
variable) should be in the exponential family.
Definition 3.13 (Exponential Family (McCullagh and Nelder 1989)) The dis-
tribution of Y is in the exponential family if its density (with respect to some
appropriate measure) is
( )
yθ − b(θ )
.fθ,ϕ (y) = exp + c(y, ϕ) ,
ϕ

where .θ is the canonical parameter, .ϕ is a nuisance parameter, and .b : R → R is


some .R → R function.
The binomial, Poisson, Gaussian, Gamma, and Inverse Gaussian distributions
belong to this family (see McCullagh and Nelder (1989) for more examples).
Consider some dataset .(yi , x i ) such that .yi is supposed to be a realization of
some random variable .Yi with distribution .fθi ,ϕ , in the exponential family. More
92 3 Models: Overview on Predictive Models

specifically, in this GLM framework, different quantities are used, namely the
canonical parameter is .θi , the prediction for .yi is

μi = E(Yi ) = b (θi ),
.

the score associated with .yi is

ηi = x i β,
.

and the link function is .g such that

ηi = g(μi ) = g(b (θi )).


.

For the “canonical link function,” .g −1 = b and then .θi = x i β = ηi and .μi =
E(Yi ) = g −1 (x i β). Inference is performed by finding the maximum of the log-
likelihood, that is

1   
n n
. log L = yi x i β − b(x i β) + c(yi , ϕ) ,
ϕ
i=1 i=1
  
independent of β

and if .
β denotes the optimal parameter, the prediction is . yi = m(x i ) = g −1 (x i 
β).
In Fig. 3.8, we have an explanatory diagram of a GLM, starting from some
predictor variables .x = (x1 , · · · , xk ) (on the left) and a target variable y (on the
right). The score, .η = x β is created from the predictors .x, and then the prediction
is obtained by a nonlinear transformation, .m(x) = g −1 (x β).

1
1

2 1
2

3
3
4
4

model

Fig. 3.8 Explanatory diagram of a generalized linear model (GLM), starting from some predictor
variables .x = (x1 , · · · , xk ) and a target variable y
3.3 Supervised Models and “Individual” Pricing 93

With the canonical link function the first-order condition is simply (with a
standard matrix formulation)
   
.∇ log L = X y = X y − g −1 (X
y − β) = 0.

This is the numerical equation solved numerically when calling glm in R (using
Fisher’s iterative algorithm, which is equivalent of a gradient descent, with the
Newton–Raphson iterative technique, where the explicit expression of the Hessian
is used). In a sense, the probabilistic construction is simply a way of interpreting
the derivation. For example, a Poisson regression can be performed on positive
observations .yi (not necessarily integers), which makes sense if we focus only on
solving the first-order condition (as done by the computer), not if we care about the
interpretation. And actually, we can see this approach as a “machine-learning” one.
For convenience, let us write quite generally .log Li (
yi ) the contribution of the i-th
observation to the log-likelihood.
As mentioned previously, with a machine-learning approach, the in-sample
empirical risk is

1
n

Rn (
. m) = m(x i ), yi ) ,
(
n
i=1

and it is possible to make connections between the maximum likelihood optimiza-


tion and the minimization of the in-sample empirical risk, introducing the scaled
deviance of the exponential model,


n
 
D =
. d  ( yi , yi ) = 2 log Li (yi ) − log Li (
yi , yi ) , where d  ( yi ) .
i=1

Here, the first term .logLi (yi ) corresponds to the log-likelihood of a “perfect fit”
(as .
yi is supposed to be equal to .yi ), also called “saturated model.” The unscaled
deviance, .d = ϕd  is used as a loss function. For the Gaussian model, . (yi ,  yi ) =
yi , yi )2 , and the deviance corresponds to the . 2 loss. For the Poisson distribution
(
(with a log-link), the loss would be

2 (yi log yi − yi log 
yi − yi + 
yi ) yi > 0
. (yi , 
yi ) =
2
yi yi = 0,

whereas for the logistic regression,


⎧    

⎪ 2 y log yi
+ (1 − y ) log 1−yi
yi ∈ (0, 1)
⎨ i ŷi i 1−ŷi
. (yi , 
yi ) = −2 log (1 − yi ) yi = 0



−2 log (
yi ) yi = 1.
94 3 Models: Overview on Predictive Models

In insurance (and loss modeling), a classical model is the one introduced in


Tweedie (1984), the “Tweedie model,” which corresponds to the compound Poisson-
gamma distribution. Thus, it is a distribution with a density on .(0, ∞), satisfying
( 1−a )
1 μ μ2−a
. log[f (y)] = y − + constant, for y ∈ R+ , where a ∈ (1, 2),
ϕ 1−a 2−a

μ2−a
and with probability mass as 0 equal to .exp − ϕ(2−a) . Here we use a parametriza-
tion not based on .θ but on .μ, so that it is easier to derive the “Tweedie loss” when
.a ∈ (1, 2),


y 2−a 
y 1−a
. a (y, 
y) = −y .
2−a 1−a

Among other complex models considered in actuarial literature, to model more


precisely claims frequency, Cragg (1971) introduced the so-called “hurdle model,”
whereas Lambert (1992) introduced “zero-inflated” count models (see Hilbe (2014)
for a general survey on models for counts). In the first case, for the hurdle Poisson
model (Welsh et al. (1996) referred to it as the “conditional Poisson model”),

π if y = 0
P(Yi = yi ) =
. yi
(1 − π ) [eλ −1]y
λ
!
if y = 1, 2, · · ·
i

for some .λ > 0 and .π ∈ (0, 1), whereas for the zero-inflated Poisson,

π + (1 − π )e−λ if y = 0
P(Yi = yi ) =
. yi −λ
(1 − π ) λ yei ! if y = 1, 2, · · ·

A zero-inflated model can only increase the probability of having .y = 0, but this
is not a restriction in hurdle models. From those distributions, the idea here is to
consider a logistic regression for the binomial component, so that .logit(πi ) = x i β b
and a Poisson regression for counts, .exp(λi ) = x i β p . For the hurdle model,
because there are two parameters, the log-likelihood can be separated in two terms
(the parameters are therefore estimated independently), so that one can derive the
associated loss functions. For example, the R package countreg contains loss
functions that can be used in the package mboost for boosting.
A strong assumption of linear models is the linear assumption. Actually, the term
“linear” is ambiguous, because .ηi = x i β = x i , β (using geometric notations for
inner products on vector spaces) is linear both in .x and in .β. If x is the age of a
policyholder, we can consider a linear model in x but also a linear model in .log(x),

. x, .(x − 20)+ or any nonlinear transformation.

“Natura non facit saltus” (nature does not make jumps) as claimed by Leibniz,
meaning that in most applications, we consider continuous transformations. For
3.3 Supervised Models and “Individual” Pricing 95

instance, the regression function of claim frequency, as a function of the age of


the driver, should continuously decrease (at least when they are young, as they
continuously move from inexperienced to experienced drivers). An extension is
based on more general functional forms. Thus,

ηi = g(μi ) = x i β = β0 + β1 x1 + · · · + βk xk ,
.

will become

ηi = g(μi ) = x i β = β0 + s1 (x1 ) + · · · + sk (xk ),


.

where each function .sj is some unspecified function, introduced to make the model
more flexible. Note that it is still additive here, so those models are named GAM,
for “generalized additive models.” In the Gaussian case, if g is the identity function,
instead of seeking a model .m(x1 , · · · , xk ) = β0 + s1 (x1 ) + · · · + sk (xk ) such that
 .

n
 2
 ∈ argmin
m
. yi − (β0 + s1 (x1i ) + · · · + sk (xki )) ,
i=1

on a very general set of functions .s1 , · · · , sk , it could be natural to ask functions to


be sufficiently smooth. Following (Hastie and Tibshirani 1987), “un-smooth” means
that the average value of the square of the second derivative, the integral of .(sj )2 , is
too large, so we could add a penalty in the previous problem and try to solve
⎧" #
⎨ n
 2
 ∈ argmin
.m yi − (β0 + s1 (x1i ) + · · · + sk (xki ))

i=1


k  ⎬
+ λj sj (tj )2 dtj ,

j =1

that can be solved using simple numerical techniques. As mentioned in Verrall


(1996) or Lee and Antonio (2015), in the context of actuarial applications, such
continuous and nonlinear functions can be very useful, not only because they help
to prevent misspecifications (and therefore possible incorrect predictions) but also
because they provide information about each covariate and the outcome y. As in
Fig. 3.9 each variable .xj is converted into another one through a function .sj that
can be expressed in a specific functional basis (in the gam function of the mgcv R
package, s or s can denote either a thin plate spline, or a cubic spline).
In Fig. 3.10, with a binary outcome y, a plain logistic regression is used on the
left, with .ηi = β0 + β1 x1i + β2 x2i (and linear isodensity curves), and .ηi = β0 +
s1 (x1i ) + s2 (x2i ) on the right, on the toydata2 dataset. The true level curves
can be visualized in Fig. 1.4. One could also consider adding “interaction” terms.
Following (Friedman and Popescu 2008), given two variables .xj and .xj , a model
96 3 Models: Overview on Predictive Models

h1 ( ), · · · , h ( )

2 1

model

Fig. 3.9 Explanatory diagram of a generalized additive model (GAM), starting from the same
predictor variables .x = (x1 , · · · , xk ) (on the left) and with the same target variable y (on the
right). Each continuous variable .xj is converted into a function .h(xj ), expressed in some basis
function (such as splines), .h1 (xj ), · · · , hk (xj )

Fig. 3.10 Evolution of .(x1 , x2 ) → m (x1 , x2 , A), on the toydata2, with a plain logistic
regression on the left (generalized linear model [GLM]), and a generalized additive model (GAM)
on the right, fitted on the toydata2 training dataset. The area in the lower left corner corresponds
to low probabilities for .P[Y = 1|X1 = x1 , X2 = x2 ], whereas the area in the upper right corner
corresponds to high probabilities. True values of .(x1 , x2 ) → μ(x1 , x2 , A) = E[Y |x1 , x2 , A], are on
the left of Fig. 1.6. The scale can be visualized on the right (in %)
3.3 Supervised Models and “Individual” Pricing 97

x → m(x) contains interactions between .xj and .xj if


.

$  %2
∂ 2 m(x) 
E > 0.
∂xj ∂xj x=X
.

Classically, actuaries have considered a simple product, .(x1 , x2 ) → x1 x2 between


the two variables to capture the joint effect.
Another natural extension is the class of “linear mixed models” (LMM), with

y = Xβ + Zu + ε,
.

where .X and .β are the fixed effects design matrix, and fixed effects (fixed but
unknown) respectively, whereas .Z and .γ are the random effects design matrix
and random effects, respectively. The latter is used to capture some remaining
heterogeneity that was not captured by the .x variables. Here, .ε contains the
residual components, supposedly independent of .γ . And naturally, one can define
a “generalized linear mixed model” (GLMM), as studied in McCulloch and Searle
(2004),

g(E[Y |γ ]) = x β + z γ ,
.

as in Jiang and Nguyen (2007) and Antonio and Beirlant (2007). It is possible to
make connections between credibility and a GLMM, as in Klinker (2010). The R
package glmm can be used here.

3.3.3 Penalized Generalized Linear Models

In the context of linear models, .y = Xβ + ε, the ordinary least squares (OLS)


estimate of .β is .
β = (X X)−1 X y and can be computed only if .X is a full-
rank matrix (.X X has to be inverted). But when some variables correlate highly,
there can be numerical issues. Hoerl and Kennard (1970) suggested using Tikhonov
regularization, consisting in adding a (small) positive value on the diagonal of .X X,
so that the matrix can be inverted. Thus, one could consider, for some .λ > 0,

ridge
βλ
. = (X X + λI)−1 X y,
98 3 Models: Overview on Predictive Models

which can also be seen as the solution to the penalized objective


⎧ ⎫
⎨n 
k ⎬
.
ridge
βλ = argmin (yi − x i β)2 + λ βj2 ,
⎩ ⎭
i=1 j =1

2 22 
 = argmin 2y − Xβ 2 + λβ22 .
ridge
βλ
.
  2   
= empirical risk = penalty

Here, .λ ≥ 0 is a tuning parameter. It can be related to the Lagrangian in some


constrained optimization problem,
2 22 
min 2y − Xβ 2
. subject to β22 ≤ k.
2

For the interpretation variables should have the same scales, so classically, variables
are standardized, to have unit variance. In an OLS context, we want to solve
Definition 3.14 (Ridge Estimator (OLS) (Hoerl and Kennard 1970))
⎧ ⎫
⎨1 
n 
k ⎬
.
ridge
βλ = argmin (yi − x i β)2 + λ βj2 .
β∈Rk ⎩ 2 i=1

j =1

or more generally (when maximizing the log-likelihood)


Definition 3.15 (Ridge Estimator (GLM))
⎧ ⎫
⎨  n 
k ⎬
.
ridge
βλ = argmin − log f (yi |μi = g −1 (x i β)) + λ βj2 .
β∈Rk ⎩ i=1

j =1

See van Wieringen (2015) for many more results on ridge regression. The “least
absolute shrinkage and selection operator” regression, also called “LASSO,” was
introduced in Santosa and Symes (1986), and popularized by Tibshirani (1996) that
extended Breiman (1995). Heuristically, the best subset selection problem can be
expressed as
2 22 
. min 2y − Xβ 2 subject to β 0 ≤ κ,
2

where .β 0 denotes a so-called “. 0 -norm,” which is defined as .β 0 = κ if


exactly .κ components of .β are nonzero. More generally, consider
2 22 
min 2y − Xβ 2
. subject to β p ≤ κ,
2
3.3 Supervised Models and “Individual” Pricing 99

3 1/p
p
with .βp = j =1 |βj |
p . On the one hand, if .p ≤ 1, the optimization
problem can be seen as a variable section technique, as the optimal parameter has
some null components (this corresponds to the statistical concept of “sparsity”, see
Hastie et al. (2015)). On the other hand, if .p ≥ 1, it is a convex constraint (strictly
convex if .p > 1), which simplifies computations. Thus, .p = 1 is an interesting
case. When the objective is the sum of the squares of residuals, we want to solve
Definition 3.16 (LASSO Estimator (OLS) (Tibshirani 1996))
⎧ ⎫
⎨1 
n 
k ⎬
.
lasso
βλ = argmin (yi − x i β)2 + λ |βj | .
⎩2 ⎭
i=1 j =1

or more generally (when maximizing the log-likelihood)


Definition 3.17 (LASSO Estimator (GLM))
⎧ ⎫
⎨  n 
k ⎬
.
lasso
βλ = argmin − log f (yi |μi = g −1 (x i β)) + λ |βj | .
⎩ ⎭
i=1 j =1

And it is actually possible to consider the “elastic net method” that (linearly)
combines the . 1 and . 2 penalties of the LASSO and ridge methods. Starting from
the LASSO penalty, Zou and Hastie (2005) suggested adding a quadratic penalty
term that serves to enforce strict convexity of the loss function, resulting in a unique
minimum. Within the OLS framework, consider
⎧ ⎫
⎨1  n  k k ⎬
.
elastic
β λ1 ,λ2 = argmin (yi − x i β)2 + λ1 |βj | + λ2 βj2 .
⎩2 ⎭
i=1 j =1 j =1

In R, the package glmnet can be used to estimate those models, as in Fig. 3.11,
on the GermanCredit training dataset.6 We can visualize the shrinkage, and
variable selection: if we want to consider only two variables the indicator associated
with no checking account (.1(no checking account)) and the duration of the
credit (duration) are supposed to be the “best” two. Note that in those algorithms,
variables y and .x are usually centered, to remove the intercept .β0 , and are also
scaled. If we further assume that variables .x are orthogonal with unit . 2 norm,
.X X = I, then

ridge 1 ols ols


βλ
. = β ∝β ,
1+λ

6 This is a simple example, with covariates that are both continuous and categorical. See Friedman

et al. (2001) or Hastie et al. (2015) for more details.


100 3 Models: Overview on Predictive Models

Fig. 3.11 Evolution of .λ → 


lasso
β λ , on the germancredit dataset on a logistic regres-
sion, with continuous variables duration and Credit_amount, as well as indicators
.1(no checking account), or .1(car (new))

which is a “shrinkage estimator” of the least squares estimator (as the propor-
tionality coefficient is smaller than 1). Considering .λ > 0 will induce a bias,
.E[β ] = E[
ols ridge
β λ ], but at the same time it could (hopefully) “reduce” the variance,
in the sense that .Var[β ] − Var[
ols ridge
β λ ] is a positive matrix. Theobald (1974) and
Farebrother (1976) proved that such property holds for some .λ > 0.
The variance reduction of those estimators comes at a price: the estimators
of .β are deliberately biased, as well as predictions as . yi = x i 
β λ (whatever
the penalty considered); consequently, those models are not well calibrated (in
the sense discussed in Sect. 4.3.3). Nevertheless, as discussed for instance in
Steyerberg et al. (2001), in a classification context, those techniques might actually
improve the calibration of predictions, especially when the number of covariates
is large. And as proved by Fan and Li (2001), the LASSO estimate satisfies a nice
consistency property, in the sense that the probability of estimating 0’s for zero-
valued parameters tends to one, when .n → ∞. The algorithm selects the correct
variables and provides unbiased estimates of selected parameters, satisfying an
“oracle property.”
It is also possible to make a connection between credibility and penalize regres-
sion, as shown in Miller (2015a) and Fry (2015). Recall that Bühlmann credibility is
the solution of a Bayesian model whose prior depends on the likelihood hypothesis
for the target, as discussed in Jewell (1974), Klugman (1991) or Bühlmann and
Gisler (2005). And penalized regression is the solution of a Bayesian model with
either a normal (for ridge) or a Laplace (for LASSO) prior.
3.3 Supervised Models and “Individual” Pricing 101

3.3.4 Neural Networks

We refer to Denuit et al. (2019b) for more details, and applications in actuarial
science. Neural networks are first an architecture that can be seen as an extension of
the one we have seen with GLMs and GAMs. In Fig. 3.12, we have a neural network
with two “hidden layers,” between the predictor variables .x = (x1 , · · · , xk ), and
the output. The first layer consists in three “neurons” (or latent variables), and the
second layer consists in two neurons.
To get a more visual understanding, one can consider the use of principal
component analysis (PCA) to reduce dimension in a GLM, as in Fig. 3.13. The
single layer consists here in the collection of the k principal components, obtained
using simple algebra, so that each component .zj is a linear combination of the
predictors .x. In this architecture, we consider a single layer, with k neurons, and
only two are used afterward. Here, we keep the idea of using a linear combination
of the variables.
Once the architecture is fixed, we try to construct (a priori) interpretative neurons
in the intermediate layer, we care only about accuracy, and the construction of
intermediate variables is optimized (this is called “back propagation”). As a starting
point, consider some binary explanatory variables, .x ∈ {0, 1}k ; McCulloch and Pitts
(1943) suggested a simple model, with threshold b
⎛ ⎞
k
.yi = h ⎝ xj,i ⎠ , where h(x) = 1(x ≥ b),
j =1

layer 1 layer 2
1

model

Fig. 3.12 Explanatory diagram of a neural network, starting from the same predictor variables
.x = (x1 , · · · , xk ) (actually, a normalized version of those variables, as explained in Friedman
et al. (2001)) and with the same target variable y. The intermediate layers (of neurons) can be
considered the constitution of intermediate features that are then aggregated
102 3 Models: Overview on Predictive Models

model

Fig. 3.13 Explanatory diagram showing the use of principal component analysis (PCA, see
Sect. 3.4) to reduce dimension in a GLM, seen as a neural network architecture, starting from the
same predictor variables .x = (x1 , · · · , xk ) and with the same target variable y. The intermediate
layer consists in the k principal components. Then, the GLM is considered not on the k predictors
but on the first two principal components

or (equivalently)
⎛ ⎞

k
yi = h ⎝ω +
. xj,i ⎠ where h(x) = 1(x ≥ 0),
j =1

with weight .ω = −b. The trick of adding 1 as an input was very important (as the
intercept in the linear regression) and can lead to simple interpretation. For instance,
if .ω = −1 we recognize the “or” logical operator (.yi = 1 if .∃j such that .xj,i = 1),
whereas if .ω = −k, we recognize the “and” logical operator (.yi = 1 if .∀j , .xj,i =
1). Unfortunately, it is not possible to get the “xor” logical operator (the “exclusive
or,” i.e., .yi = 1 if .x1,i = x2,i ) with this architecture. Rosenblatt (1961) considered
the extension where .x’s are real-valued (instead of binary), with “weight” .ω ∈ Rk
(the word is between quotation marks because here, weights can be negative)
⎛ ⎞
k
.yi = h ⎝ ωj xj,i ⎠ where h(x) = 1(x ≥ b),
j =1

or (equivalently, with .ω0 = −b)


⎛ ⎞

k
. yi = h ⎝ω0 + ωj xj,i ⎠ where h(x) = 1(x ≥ 0),
j =1
3.3 Supervised Models and “Individual” Pricing 103

with “weights” .ω ∈ Rk+1 . Even if there is no probabilistic foundations here, we


recognize an expression close to .y = g −1 (x β).
Minsky and Papert (1969) proved that those perceptrons were linear separators,
unfortunately not very powerful. The step function h was quickly replaced by the
sigmoid function .h(x) = (1 + e−x )−1 that corresponds to the logistic function,
already used by statisticians since Wilson and Worcester (1943) and Berkson (1944).
This function h is called the “activation function”. Scientists working in signal
theory used both .y ∈ {0, 1} (off and on) and .y ∈ {−1, +1} (negative and positive).
In the latter case, one can consider the hyperbolic tangent as an activation function

(ex − e−x )
.h(x) = ,
(ex + e−x )

or the popular ReLU function (“rectified linear unit”). So here, for a classification
problem,
⎛ ⎞

k
yi = h ⎝ω0 +
. ωj xj,i ⎠ = m(x i ).
j =1

But so far, there is nothing more, compared with an econometric model (actually
we have less since there is no probabilistic foundations, so no confidence intervals
for instance). The interesting idea was to replicate those models, in a network.
Instead of mapping .x and y, the idea is to map .x and some latent variables .zj , and
then to map .z and y, using the same structure. And possibly, we can use a deeper
network, not with one single layer, but many more. Consider Fig. 3.14, which is a
simplified version of Fig. 3.12.

model

Fig. 3.14 Neural network, starting from the same predictor variables .x = (x1 , · · · , xk ) and with
the same target variable y, and a single layer, with variables .z = (z1 , · · · , zJ )
104 3 Models: Overview on Predictive Models

Here we consider a single layer, and .z = (z1 , z2 , z3 ),


  
zj = hj ωj,0 + x ωj , (ωj,0 , ωj ) ∈ Rk+1
.  
y = h ω0 + z ω , (ω0 , ω) ∈ Rk+1 ,

so that, when plugging-in,


⎛ ⎞
 
J 
y = mω (x) = h ω0 + z ω = h ⎝ω0 +
. ωj hj ωj,0 + x ωj ⎠ .
j =1

Our model .mω is now based on .(k + 1)(J + 1) parameters. Given a model .mω , we
can compute the quadratic loss function,


n
. (yi − mω (x i ))2 ,
i=1

the cross-entropy


n
. (yi log mω (x i ) + [1 − yi ] log[1 − mω (x i )]),
i=1

or any empirical risk, based on loss . ,


n
. (yi , mω (x i )) .
i=1

 = (w, w 1 , · · · , w J ), by solving
The idea is then to get optimal weights .w
 .

n

ω = argmin
.

(yi , mω (x i )) .
ω
i=1

This corresponds to the “back-propagation” problem. Of course, theoretically, this


can be performed whatever the architecture of the network, with multiple layers.
Although classical regression models (such as GLM) are based on convex
optimization problems, (deep) neural networks are usually not convex problems.
And sometimes, the dimension of the dataset can be really big, so we cannot pass
all the data to the computer at once to compute .m(x), we need to divide the data
into smaller subsets and give it to our computer one by one (called “batches”).
In the previous figures, we considered the case where layers were somehow
related to some dimension reduction, with a number of hidden nodes .z smaller than
the number of predictors .x, but it is actually possible to consider more nodes in the
3.3 Supervised Models and “Individual” Pricing 105

hidden layers than the number of predictors. The point is to capture nonlinearity and
nonmonotonicity.
Without going too much further (see Denuit et al. (2019b) or Wüthrich and Merz
(2022) for more details), let us stress here that neural networks are intensively used
because they have strong theoretical foundations and some “universal approxima-
tion theorems,” even in very general frameworks. Those results were obtained at
the end of the 1980s, with Cybenko (1989), with an arbitrary number of artificial
neurons, or Hornik et al. (1989); Hornik (1991), with multilayer feed-forward
networks and only one single hidden layer. Later on, Leshno et al. (1993) proved that
the “universal approximation theorem” was equivalent to having a nonpolynomial
activation function (see Haykin (1998) for more details on theoretical properties).
Those theorems, which provide a guarantee of having a uniform approximation of
.μ by some .mw (in the sense that, for any . > 0, one can find .w—and an appropriate

architecture—such that .sup{|mw (m(x) − μ(m(x)} ≤ ), leading to the idea that if


we cannot model properly, we simply need more data (and more complex models),
corresponding to “deep” models.
On top of that, a major drawback, especially in actuarial science, is the lack
of probabilistic foundations of those models. This concern was raised in some
literature published in the 1980s, starting with Rumelhart et al. (1985, 1986), or
more recently Hertz et al. (1991) and Buntine and Weigend (1991), where back-
propagation is formalized in a Bayesian context, taken up by MacKay (1992) and
Neal (1992) more than 30 years ago (or more recently Neal (2012) Theodoridis
(2015), Gal and Ghahramani (2016) and Goulet et al. (2021)). Observe that similar
issues were discussed in other machine-learning techniques, such as “support vector
machine,” where the distance to the separation line is used as a score that can
then be interpreted as a probability, as “Platt scaling” in Platt et al. (1999) or
“isotonic regression” in Zadrozny and Elkan (2001, 2002) (see also Niculescu-
Mizil and Caruana (2005a), where “good probabilities” are defined), as discussed in
Sect. 4.3.3.

3.3.5 Trees and Forests

Again, let us briefly explain the general idea (see Denuit et al. (2020) for more
details). Decision trees appeared in the statistical literature in the 1970s and the
1980s, with Messenger and Mandell (1972) with Theta Automatic Interaction
Detection (THAID), then Breiman and Stone (1977) and Breiman et al. (1984)
with Classification And Regression Trees (CART), as well as Quinlan (1986, 1987,
1993) with Iterative Dichotomiser 3 (ID3) that later became C4.5 (“a landmark
decision tree program that is probably the machine learning workhorse most widely
used in practice to date” as Witten et al. (2016) wrote). One should probably
mention here the fact that the idea of “decision trees” was mentioned earlier in
psychology, as discussed in Winterfeldt and Edwards (1986), but without any details
about algorithmic construction (see also Lima (2014) for old representations of
106 3 Models: Overview on Predictive Models

“decision trees”). Indeed, as already noted in Laurent and Rivest (1976) constructing
optimal binary decision trees is long and time consuming, and using some backward
procedure will fasten the process. Starting with the entire training set, we select
the most predictive variable (with respect to some criteria, as discussed later) and
we split the population in two, using an appropriate threshold and this predictive
variable. And then we iterate within each sub-population. Heuristically, each sub-
population should be as homogeneous as possible, with respect to y. In the methods
considered previously, we contemplated all explanatory variables together, using
linear algebra techniques to solve optimization problems more efficiently, but here,
variables are used sequentially (or to be more specific, binary step functions, as .xj
becomes .1(xj ≤ t) for some optimally selected threshold). After some iterations,
we end up with some subgroups, or sub-regions in .X called either “leaves” or
terminal nodes, whereas intermediary splits are called internal nodes. In the visual
representation, segments connecting nodes are called “branches”, and to continue
with the arboricultural and forestry metaphors, we evoke pruning when we trim the
trees to avoid possible overfit.
As mentioned, after several iterations, we split the population and the space .X
into a partition of J regions, .R1 , · · · , RJ , such that .Ri ∩ Rj = ∅ when .i = j and
.R1 ∪ R2 ∪ · · · ∪ RJ = X. Then, in each region, prediction is performed simply by

considering the average of y for observations in that specific region, in a regression


context, if .|Rj | denotes the number of observations in region .Rj ,

1 

yRj =
. yi
|Rj |
i:x i ∈Rj

or using a majority rule for a classifier. Classically, regions are (hyper) rectangles
in .X ⊂ Rk (or orthants), in order to simplify the construction of the tree, and to
have a simple and graphical interpretation of the model. In a regression context,
the classical strategy is to minimize the mean squared error, which is the in-sample
empirical risk for the . 2 -norm,


J   2
.MSE = MSEj where MSEj = yi − 
yRj ,
j =1 i:x i ∈Rj

where .
yRj is the prediction in region .Rj . For a classification problem, observe that
  2
MSEj =
. yi − 
yRj = n0,j (0 − 
yRj )2 + n1,j (1 − 
yRj )2 ,
i:x i ∈Rj
3.3 Supervised Models and “Individual” Pricing 107

so that, if .n0,j and .n1,j denote the number of observations such that .y = 0 and
y = 1 respectively in the region .Rj ,
.

( )2 ( )2
n0,j n0,j n0,j n1,j
. max{MSEj } = n0,j + n1,j = ,
n0,j + n1,j n1,j + n1,j n0,j + n1,j

and therefore


J
n0,j n1,j 
J
MSE =
. = nj · p1,j (1 − p1,j ),
n0,j + n1,j
j =1 j =1

which corresponds to the “Gini impurity index”.


Formally, an impurity index is some function .ψ : [0, 1] → R+ positive,
symmetric (.ψ(u) = ψ(1 − u)), minimal in 0 and 1 (consider some normalized
version .ψ(0) = ψ(1) = 0). Classical indices are the Gini index, .ψ(u) = u(1 − u),
the misclassification function .ψ(u) = 1 − max{u, 1 − u}, and the (cross) entropy,
.ψ(u) = −u log u − (1 − u) log(1 − u).

Instead of considering all possible partitions of .X that are rectangles, some “top-
down greedy approach” is usually considered, with some recursive binary splitting.
The algorithm is top-down as we start with all observations in the same class
(usually called the “root” of the tree) and then we split into regions that are smaller
and smaller. And it is greedy because optimization is performed without looking
back at the past. Formally, at the first stage, select variable .xκ and some cutoff point
t and consider half spaces

R1 (κ, t) = {x = (x1 , · · · , xk )|xκ < t} ⊂ X


.

R2 (κ, t) = {x = (x1 , . . . , xk )|xκ ≥ t} ⊂ X.

We seek the regions that minimize the mean squared error,


  2   2
MSEκ (t) =
. yi − 
yR1 (κ,t) + yi − 
yR2 (κ,t) ,
i:x i ∈R1 (κ,c) i:x i ∈R2 (κ,c)

or more generally any impurity function for a classification problem


   
. yR1 (κ,t) + n2 · ψ 
n1 · ψ  yR2 (κ,t) .

Then find the best variable and the best cutoff solution of
  
. min inf MSEκ (t) .
κ=1,··· ,k t∈Xκ

At stage .j + 1, repeat the previous procedure on all regions created at stage j :


within region .Xj , identify the variable .xκ and cut-off point t that will minimize the
108 3 Models: Overview on Predictive Models

empirical risk and yield to the split of .Xj into two half spaces

{x ∈ Xr |xκ < t} and {x ∈ Xr |xκ ≥ t}.


.

And iterate. Ultimately, each leaf contains a single observation that corresponds
to a null empirical risk, on the training dataset, but will hardly generate. To avoid
this overfit, it will be necessary either to introduce a stopping criterion or to prune
the complete tree. In practice, we stop when the leaves have reached a minimum
number of observations, set beforehand. An alternative is to stop if the variation
(relative or absolute) of the objective function does not decrease enough. Formally,
to decide whether a leaf .{N } should be divided into .{NL , NR }, compute the impurity
variation
n nR
L
.I(NL , NR ) = I(N ) − I(NL , NR ) = ψ( yN ) − yNL ) +
ψ( ψ(yNL ) .
n n

We decide to split if .I(NL , NR )/I(N ) exceeds some complexity parameter,


usually denoted .cp (the default value in rpart in R is .1%).
In Fig. 3.15 we can visualize the classification tree obtained on the toydata2
dataset. Recall that in the entire training population, .y = 40%. Then the tree grows
as follows,
• if .x1 < 0.61 (first node, 73% of the population .y = 26%)
.

• if .x2 < 5.9 (first node, first branch, 43%, .y = 12%, final leaf)
.

• if .x2 ≥ 5.9 (second node, first branch, 30%, .y = 30%)


.

• if .x1 < −0.39 (first node, second branch, 21%, .y = 38%, final leaf)
.

• if .x1 ≥ −0.39 (second node, second branch, 9%, .y = 63%, final leaf)
.

Fig. 3.15 Classification tree on the toydata2 dataset, with default pruning parameters of
rpart. At the top, the entire population (.100%, .y = 40%) and at the bottom six leaves, on the
left, .43% of the population and .y = 12% and on the right, .16% of the population and .y = 89%.
The six regions (in space .X), associated with the six leaves, can be visualized on the right-hand
side of Fig. 3.18
3.3 Supervised Models and “Individual” Pricing 109

Table 3.1 Predictions for two individuals Andrew and Barbara, with models trained on
toydata2, with the true value .μ (used to generate data), then a plain logistic one (glm), an
additive logistic one (gam), a classification tree (cart), a random forest (rf), and a boosting model
(gbm, as in Fig. 3.21)
x1 x2 x3 s μ(x, s) glm (x)
m gam (x)
m cart (x)
m rf (x)
m gbm (x)
m
Andrew −1 8 −2 A 0.366 0.379 0.372 0.384 0.310 0.354
Barbara 1 4 2 B 0.587 0.615 0.583 0.434 0.622 0.595

Fig. 3.16 Classification tree on the GermanCredit dataset, with default pruning parameters of
rpart. At the top, the entire population (.100%, .y = 30%) and at the bottom nine leaves, on the
left, .45% of the population and .y = 13% and on the right, .3% of the population and .y = 35%

• if .x1 ≥ 0.61 (second node, 27% of the population .y = 77%)


.

• if .x2 < 4.3 (first node, third branch, 11%, .y = 60%)


.

• if .x1 < 1.6 (first node, fourth branch, 6%, .y = 43%)


.

• if .x2 ≥ 1.6 (second node, fourth branch, 5%, .y = 83%, final leaf)
.

• if .x2 ≥ 4.3 (second node, third branch, 16%, .y = 89%, final leaf)
.

Observe that only variables x1 and x2 are used here. It is then possible to visualize
the prediction for a given pair .(x1 , x2 ) (the value of .x3 and s will have no influence).
As in Fig. 3.15. It is possible to visualize two specific predictions, as on Table 3.1
(we discuss further interpretations for those two specific individuals in Sect. 4.1).
In Fig. 3.16 we can visualize the classification tree obtained on the
GermanCredit dataset. Recall that in the entire training population (.n = 700),
.y = 30.1%. Then the tree grows as follows,

.• if Account_status .≥ 200 (first node, % of the population .y = 13.2%, final


leaf)
.• if Account_status .< 200 (first node, 73% of the population .y = 44.1%)

• if Duration .< 22.5 (first node, second branch, 43%, .y = 12%, final leaf)
.

• .. . .
.
110 3 Models: Overview on Predictive Models

For the pruning procedure, create a very large and deep tree, and then, cut some
branches. Formally, given a large tree .T0 identify a subtree .T ⊂ T0 that minimizes

|T|
   
. Rm 2 + α|T|,
Yi − Y
m=1 i:Xi ∈Rm

where .α is some complexity parameter, and .|T| is the number of leaves in the subtree
T. Observe that it is similar to penalized methods described previously, used to get
.

a tradeoff between bias and variance, between accuracy and parsimony.


Those classification and regression trees are easy to compute, to interpret.
Unfortunately, those trees are rather unstable (even if the prediction is much more
robust). The idea introduced by Breiman (1996a) consists in growing multiple trees
to get a collection of trees, or a “forest,” to improve classification by combining
classifiers obtained from randomly generated training sets (using a “bootstrap”
procedure—corresponding to resampling, with replacement), and then to aggregate.
This will correspond to “bagging”, which is an ensemble approach.

3.3.6 Ensemble Approaches

We have seen so far how to estimate various models, and then, we use some metrics
to select “the best one.” But rather than choosing the best among different models, it
could be more efficient to combine them. Among those “ensemble methods,” there
will bagging, random forests, or boosting (see Sollich and Krogh (1995), Opitz
and Maclin (1999) or Zhou (2012)). Those techniques can be related to “Bayesian
model averaging”, which linearly combines submodels of the same family, with
the posterior probabilities of each model, as coined in Raftery et al. (1997) or
Wasserman (2000), and “stacking”, which involves training a model to combine
the predictions of several other learning algorithms, as described in Wolpert (1992)
or Breiman (1996c).
It should be stressed here that a “weak learner” is defined as a classifier that is
correlates only slightly with the true classification (it can label examples slightly
better than random guessing, so to speak). In contrast, a “strong learner” is a
classifier that is arbitrarily well correlated with the true classification. Long story
short, we discuss in this section the idea that combining “weak learner” could yield
better results than seeking a “strong learner”.
A first approach is the one described in Fig. 3.17: consider a collection of
predictions, .{y (1) , · · · ,
y (k) }, obtained using k models (possibly from different
families, GLM, trees, neural networks, etc.), consider a linear combination of those
models, and solve a problem like
 .

n
 
. min yi , α 
yi .
α∈Rk
i=1
3.3 Supervised Models and “Individual” Pricing 111

1 2

2
3

3
4

4
5

model

Fig. 3.17 Explanatory diagram of parallel training, or “bagging” (bootstrap and aggregation),
starting from the same predictor variables .x = (x1 , · · · , xk ) and with the same target variable
y. Different models .mj (x) are fitted, and the outcome is usually the average of the models

In a classification problem, a popular aggregation function could be the “majority


rule”. The predicted class is the one most models predicted. Unfortunately, this opti-
mization problem can be rather unstable, as predictions obtained using the different
models are usually highly (positively) correlated with one another. Therefore, it is
rather natural to add a penalization in the previous problem, to use only a subset of
models.
Historically, de Condorcet (1785) already suggested similar techniques to make
decisions by taking into account a plurality of votes. Consider a jury, with k judges,
and suppose that p is the probability of making a mistake (with a binary decision). If
we suppose that mistakes are made independently, the probability that the majority
makes a mistake is
 ( )
k j
. p (1 − p)k−j .
j
j ≥[(k+1)/2]

For example, with .k = 11 judges, if each judge has a 30% chance of making the
decision, the majority is wrong with only 8% chance. This probability decreases
with k, unless there is a strong correlation. So in an ideal situation, ensemble
techniques work better if predictions are independent from each other.
Galton (1907) suggested that technique while trying to “guess the weight of
an ox” in a county fair in Cornwall, England, as recalled in Wallis (2014) and
Surowiecki (2004). .k = 787 participants provided guesses . y1 , · · · , 
yk . The ox
weighed 1198 pounds, and the average of the estimates was 1197 pounds. Francis
Galton compared two strategies, either picking a single prediction .
yj or considering
112 3 Models: Overview on Predictive Models

the average .y. If t is the truth, we can write

1
k
 
. yj − t)2 = (y − t)2 +
E ( yi − y)2 ,
(
k
i=1

so clearly, using the average prediction is better than seeking the “best one.” From
a statistical perspective, hopefully, ensemble generalizes better than a single chosen
model. From a computational perspective, averaging should be faster than solving
an optimization problem (seeking the “best” model). And interestingly, it will take
us outside classical model classes.
In ensemble learning, we want models to be reasonably accurate, and as inde-
pendent as possible. A popular example is “bagging” (for bootstrap aggregating),
introduced to increase the stability of predictions and accuracy. It can also reduce
variance and overfit. The algorithm would be
1. Generate k training datasets using bootstrap (resampling), as subdividing the
database into k smaller (independent) dataset will lead to small training datasets,
(j ) , with .j = 1, · · · , k,
2. Each dataset is used as a training sample to fit the model .m
such as (deep) trees, with small bias and large variance (variance will decrease
with aggregation, as shown below),
3. For a new observation .x, the aggregated prediction will be

1  (j )
k
bagging (x) =
.m  (x).
m
k
j =1

Observe that here, we use the same weights for all models. The simple version of
“random forest” is based on the aggregation of trees. In Fig. 3.18 we can visualize
the predictions on the toydata2 dataset, as a function of .x1 and .x2 (and .x3 = 0),
with a classification tree on the left-hand side, and the aggregation of .k = 500 trees
on the right.
As explained in Friedman et al. (2001), the variance of aggregated models7 is
⎡ ⎤ ⎡ ⎤
  1 
k
1 
k
.Var mbagging (x) = Var ⎣ (j ) (x)⎦ = 2 Var ⎣
m (j ) (x)⎦
m
k k
j =1 j =1

1  
k k
 (j ) 
= 2
 1 (x), m
Cov m (j2 ) (x)
k
j1 =1 j2 =1

1 2  (j )   (j ) 
≤ 2
 (x)) = Var m
k Var m  (x) ,
k

7 Variancesand covariances are associated with random variables .Uj,x = m


(j ) (x)’s for a given .x,
based on the uncertainty of the training sample.
3.3 Supervised Models and “Individual” Pricing 113

Fig. 3.18 Evolution of .(x1 , x2 ) → m


(x1 , x2 , A), on the toydata2, with a single classification
tree on the left-hand side (the output being the probability of having .y = 1), and a random forest
on the right-hand side. CART = Classification And Regression Tree

 (j ) 
as, when .j1 = j2 , .Corr m  1 (x), m (j2 ) (x) ≤ 1. And actually, even if
m(j ) (x)] = σ 2 (x) and .Corr[
.Var[ (j2 ) (x)] = r(x),
m(j1 ) (x), m

  1 − r(x) 2
Var m
. bagging (x) = r(x)σ 2 (x) + σ (x).
k
The variance is lower when “more different” models are aggregated. And more
generally, as explained in (Denuit et al. 2020, Sect. 4.3.3),
     
E
. bagging (X) ≤ E
Y, m (j ) (X) .
Y, m

Another type of ensemble model is related to sequential learning, with “boost-


ing,” described in Fig. 3.19, introduced in Schapire (1990) and Breiman (1996b).
Boosting is based on the question posed by Kearns and Valiant (1989) “can a set of
weak learners create a single strong learner?” For a regression, the algorithm for
boosting would be
1. initialization : k (number of trees), .γ (learning rate), .m0 (x) = y
2. at stage .t ≥ 1,
2.1 Compute “residuals” .ri,t ← yi − mt−1 (x i );
2.2 Fit a model .ri,t ∼ h(x i ) for some weak learner (not too deep tree) h;
2.3 Update .mt (·) = mt−1 (·) + γ h(·);
2.4 Loop (.t ← t + 1 and return to 2.1)
114 3 Models: Overview on Predictive Models

1 2 3 4
3

model

Fig. 3.19 Explanatory diagram of sequential learning (“boosting”) starting from the same predic-
tor variables .x = (x1 , · · · , xk ) and with the same target variable y. Using “weak” learners .ht
learning from the residuals of the previous model, and to improve sequentially the model (.ht+1 is
fitted to explain residuals .y − mt (x), and the update is .mt+1 = mt + ht+1 )

or more generally,
1. Initialization
3 : k (number of trees), .γ (learning rate), and .m0 (x) =
argmin ni=1 (yi , ϕ);
ϕ
2. At the stage .t ≥ 1,

y) 
∂ (yi ,
2.1 Compute “residuals” .ri,t ← y 
∂ ;
y =m t−1 (x i )
2.2 Fit a model
3 .ri,t ∼ ht (x i ) for some weak learner .h ∈ H, .ht =
argmin ni=1 (ri,t , h(x i ));
h∈H
2.3 Update .mt (·) = mt−1 (·) + γ ht (·);
2.4 Loop (.t ← t + 1 and return to 2.1)
In the context of classification, Freund and Schapire (1997) introduced the
“adaboost” algorithm (for “adaptive boosting”), based on updating weights
(see also Schapire (2013) for additional heuristics). Consider some binary clas-
sification problem (where y takes values 0 or 1) with a training dataset .D =
({yi ; x i }, i = 1, . . . , n). Following Algorithm 10.1 in Friedman et al. (2001),
1. Set weights .ωi,1 = 1/n, for .i = 1, . . . , n, .γ > 0, .m0 (x) = 0, and define .ω1 ;
2. At the stage .t ≥ 1,

3 dataset .Dt = (D, ωt );


2.1 Generate (by resampling, using weights .ωt ) a training
2.2 Learn a model .yi ∼ ht (x i ) on .Dt , .ht = argmin ni=1 ωi,t (yi , h(x i ));
h∈H
3.3 Supervised Models and “Individual” Pricing 115

2.3 Compute the “weighted error rate”

3
n
ωi,t · 1 (ht (x i ) = yi )
i=1
. ¯t ←
3
n
ωi,t
i=1

on the training dataset .Dt , and compute .αt = log((1 − ¯t )/¯t );


2.4 Update the weights
 
.ωi,t+1 ← ωi,t · exp αt 1 (ht (x i ) = yi ) ;

Update .mt+1 (·) = mt (·) + γ αt ht (·);


2.5 Loop (.t ← t + 1 and return to 2.1).
Friedman et al. (2000) and Niculescu-Mizil and Caruana (2005b) proved strong
connections between additive models (GAM) and boosting. As we consider a
classifier, a standard loss is . 0/1 (y, ŷ) = 1(y =
 ŷ). But for computations, a classical
strategy consists in “convexifying” the loss function.
In Fig. 3.20, two boosting models are trained on the toydata2 dataset, with
different learning rates .γ (respectively .2% and .10% on the left and on the right).
With a higher learning rate (on the right), convergence is obtained faster. When the
number of iterations t is too large, the model is “overfitting” , as we can see on the
right, on the validation dataset (with function gbm in R, k-fold cross-validation is
used, instead of the training/validation approach described previously; see Friedman
et al. (2001) for more discussions). On the y-axis, the Bernoulli deviance is used.
In Fig. 3.21, we focus on predictions of two specific individuals Andrew and
Barbara (as described in Table 3.1), with .t → mt (x). The “optimal” predictions

Fig. 3.20 Evolution of Bernoulli deviance of .mt with a boosting learning (adaboost) on the
toydata2 dataset, as a function of t, on the validation sample (on top) and on the training sample
(below), with two different learning rates (on the left and on the right). Vertical lines correspond to
the “optimal” number of (sequential) trees
116 3 Models: Overview on Predictive Models

Fig. 3.21 Evolution of .mt (x) with a boosting learning (adaboost) on the toydata2 dataset,
as a function of t (for the models .mt from Fig. 3.20), with .xB on top and .xA below, with different
learning rates

are respectively obtained with 450 and 100 trees, as those were values that minimize
the overall error, on the validation samples (in Fig. 3.21).

3.3.7 Application on the toydata2 Dataset

On the toydata2 dataset, we can visualize the prediction on two individuals,


Andrew and Barbara, in Table 3.1.
To compare ensemble method predictions (and possibly other models), we can
visualize prediction surfaces, or some more visual representations, with interpo-
lations of predictions between some individuals. More specifically, consider two
individuals, possibly in the two groups (A and B), characterized by .zA = (x A , A)
and .zB = (x B , B) respectively. Because we consider models that do not depend
on the sensitive attribute s here, we can now consider some linear interpolation
between those two individuals (Andrew on the left, Barbara on the right),
.x t = tx B + (1 − t)x A , with .t ∈ [0, 1]. In Fig. 3.22, we can visualize .t → m
(x t ).

3.3.8 Application on the GermanCredit Dataset

The classification tree on the GermanCredit training dataset can be visualized


in Fig. 3.16, with nine “groups” (corresponding to terminal leaves). Confusion
matrices, associated with various models (logistic regression and random forest),
trained on the GermanCredit dataset, can be visualized on Fig. 3.23, with
threshold .t = 30% and .50%.
3.3 Supervised Models and “Individual” Pricing 117

Fig. 3.22 Roughness of predictive models, with, on the right-hand side, interpolation .t → m (x t ),
where .x t = tx B +(1−t)x A , for .t ∈ [0, 1] for five models on the toydata2 dataset, corresponding
to some individual in-between Andrew and Barbara. .x t correspond to fictitious individuals on
the segments .[x A , x B ] on the left, connecting Andrew and Barbara

actual value actual value


0 1 total 0 1 total
0 189 0 171
true negative false negative true negative false negative
160 29 143 28
prediction

prediction

1 111 1 129
false positive true positive false positive true positive
51 60 68 61

total 211 89 total 211 89


actual value actual value
0 1 total 0 1 total
0 231 0 254
true negative false negative true negative false negative
185 46 197 57
prediction

prediction

1 69 1 146
false positive true positive false positive true positive
26 43 14 32

total 211 89 total 211 89

Fig. 3.23 Confusion matrices with threshold .30% (on top) and .50% (below) on 300 observations
from the GermanCredit dataset, with a logistic regression on the left, and a random forest on
the right (without the sensitive attribute, gender)

In Table 3.2 different metrics are used on a plain logistic regression and a random
forest approach, with the accuracy (with a confidence interval) as well as specificity
and sensitivity, computed with thresholds 70%, 50%, and 30%. On the left, models
with all features (including sensitive ones) are considered. Then, from the left to the
right, we consider models (plain logistic and random forest) without gender, without
age, and without both.
118 3 Models: Overview on Predictive Models

Table 3.2 Various statistics for two classifiers, a logistic regression and a random forest on
the GermanCredit dataset (accuracy—with a confidence interval (lower and upper bounds)—
specificity and sensitivity are computed with a 70%, 50%, and 30% threshold), where all variables
are considered, on the left. Then, from the left to the right, without gender, without age, and without
both
All variables – gender – age – gender and age
GLM RF GLM RF GLM RF GLM RF
AUC 0.793 0.776 0.790 0.783 0.794 0.790 0.794 0.790
τ = 70% Accuracy 0.730 0.723 0.737 0.727 0.740 0.737 0.740 0.737
Accuracy− 0.676 0.669 0.683 0.672 0.686 0.683 0.686 0.683
Accuracy+ 0.779 0.773 0.786 0.776 0.789 0.786 0.789 0.786
Specificity 0.654 1.000 0.692 1.000 0.704 0.917 0.704 0.917
Sensitivity 0.737 0.718 0.741 0.720 0.744 0.729 0.744 0.729
τ = 50% Accuracy 0.757 0.773 0.760 0.767 0.753 0.787 0.753 0.787
Accuracy− 0.704 0.722 0.708 0.715 0.701 0.736 0.701 0.736
Accuracy+ 0.804 0.819 0.807 0.813 0.801 0.832 0.801 0.832
Specificity 0.618 0.756 0.623 0.721 0.609 0.778 0.609 0.778
Sensitivity 0.797 0.776 0.801 0.774 0.797 0.788 0.797 0.788
τ = 30% Accuracy 0.723 0.677 0.733 0.680 0.723 0.690 0.723 0.690
Accuracy− 0.669 0.621 0.679 0.624 0.669 0.634 0.669 0.634
Accuracy− 0.773 0.729 0.783 0.732 0.773 0.742 0.773 0.742
Specificity 0.526 0.469 0.541 0.472 0.526 0.485 0.526 0.485
Sensitivity 0.844 0.835 0.847 0.832 0.844 0.847 0.844 0.847

Receiver operating characteristic (ROC) curves associated with the plain logistic
regression and the random forest model (on the validation dataset) can be visualized
in Fig. 3.24, with models including all features, and then when possibly sensitive
attributes are not used.
More precisely, it is possible to compare models (here plain logistic and the ran-
dom forest) not with global metrics, but by plotting .m rf (x i , si ) against .m
glm (x i , si ),
on the left-hand side of Fig. 3.25. On the right-hand side, we can compare ranks
among predictions. The (outlier) in the right lower corner corresponds to some
glm (x i , si ) is almost the 80% lowest (and would be seen as
individual i such that .m
a “large risk,” in the worst 20% tail with a plain logistic model) whereas .m rf (x i , si )
is almost the 20% lowest (and would be seen as a “small risk,” in the top 20% tail
with a random forest model).

3.4 Unsupervised Learning

Supervised learning corresponds to the case we want to model and predict a target
variable y (as explained in the previous sections). In the case of unsupervised
learning, there is no target, and we only have a collection of variables .x. The two
3.4 Unsupervised Learning 119

Fig. 3.24 Receiver operating characteristic curves, for different models, plain logistic regression
and random forest on GermanCredit, with all variables .(x, s) at the top left, and on .x at the top
right, without gender and age. Below, models are on .(x, s) with a single sensitive attribute . GLM
= generalized linear model

general problems we want to solve are dimension reduction (where we want the
use a smaller number of features) and cluster construction (where we try to regroup
individuals together to create groups).
For cluster analysis, see Hartigan (1975), Jain and Dubes (1988) or Gan and
Valdez (2020) for theoretical foundations. Campbell (1986) applied cluster analysis
to identify groups of car models with similar technical attributes for the purpose of
estimating risk premium for individual car models, whereas Yao (2016) explored
territory clustering for ratemaking in motor insurance.
Reducing dimension simply means that, instead of our initial vectors of data, .x i ,
we want to consider a lower dimensional vector .x i . Instead of matrix .X, we consider
.X of lower rank. In the case of a linear transformation, corresponding to PCA (see
120 3 Models: Overview on Predictive Models

Fig. 3.25 Scatterplot of .( rf (x i )) on the left, on the germancredit dataset, and the
mglm (x i ), m
associated ranks on the right (with a linear transformation to have “ranks” on 0–100, corresponding
to the empirical copula. GLM = generalized linear model)

X Z c
X

Fig. 3.26 Nonlinear autoencoder at the top (with transformations .ϕ and .ψ), and a linear
autoencoder at the bottom (with transformations .P and .P , corresponding to principal component
analysis). .X is the original dataset, .Z the embedded version in a smaller latent space (principal
factors), and .X is the reconstruction

Jolliffe (2002) or Hastie et al. (2015)) or linear auto-encoder (see Sakurada and Yairi
(2014) or Wang et al. (2016)), the error is


n
   
.X − X  = X − P P X = P P xi − xi P P xi − xi
2 2

i=1

More generally, the error function of a nonlinear autoencoder is


n
   
X − X 2 = X − ψ ◦ ϕ(X)2 =
. ψ ◦ ϕ(x i ) − x i ψ ◦ ϕ(x i ) − x i .
i=1

In Fig. 3.26, we can visualize the general architecture of a nonlinear autoencoder


at the top, and a linear autoencoder below (corresponding to PCA).
3.4 Unsupervised Learning 121

The error function of a linear auto-encoder is


n    
. trace P P − I xi xi P P − I
i=1
$ %
 
n
 
= trace P P − I xi xi P P − I ,
i=1

the middle term is a covariance matrix; thus, it is .V V , and we recognize


 
 P P −I V
. F ,
1/2 2

where .·F denotes the Frobenius norm, corresponding to the elementwise . 2 norm
of a matrix,
    
.MF = = trace MM where M = Mi,j .
2 2
Mi,j
i,j

Let . · F denote the Frobenius/elementwise . 2 norm of a matrix,


  
M2F =
.
2
Mi,j where M = Mi,j .
i,j

Consider the following problem,


 
min X − Y 2F subject to rank(Y ) = k (≤ rank(X)).
.
Y

If .X = U V then .Y = U k kV k where we keep the first k columns of .U , .V and


. .
One can rewrite
 
. min X − P X2F subject to rank(P ) = k,
P ∈

where . is the set of projection matrices.


If .S = X X, we can write (equivalently)
 
. max trace(SP ) subject to rank(P ) = k
P ∈

or
  
max trace(SP ) where P = P ∈  : eigenvalues(P ) ∈ {0, 1}
.
P ∈P

and trace(P ) = k .
122 3 Models: Overview on Predictive Models

As explained in Samadi et al. (2018), if .X −X 2F can be seen as the reconstruction
error of .X with respect to .X , the way to define a rank k-reconstruction loss is to
consider

. k (X, X ) = X − X 2F − X − X 2F


 
where .X = argmin X − Y 2F subject to rank(Y ) = k.
Y
Chapter 4
Models: Interpretability, Accuracy,
and Calibration

Abstract In this chapter, we present important concepts for when dealing with
predictive models. We start with a discussion about the interpretability and explain-
ability of models and algorithms, presenting different tools that could help us to
understand “why” the predicted outcome of the model is the one we got. Then, we
will discuss accuracy, which is usually the ultimate target of most machine-learning
techniques. But as we see, the most important concept is the “good calibration” of
the model, which means that we want to have, locally, a balanced portfolio, and that
the probability predicted by the model is, indeed, related to the true risk.

In a popular book on the philosophy of science, Nancy Cartwright examines the


relationship between theoretical models, experiments, explanations, and various
concepts linked to causal relations. And she tells the following anecdote, “My newly
planted lemon tree is sick, the leaves yellow and dropping off. I finally explain this
by saying that water has accumulated in the base of the planter: the water is the
cause of the disease. I drill a hole in the base of the oak barrel where the lemon
tree lives, and foul water flows out. That was the cause. Before I had drilled the
hole, I could still give the explanation and to give that explanation was to present
the supposed cause, the water. There must be such water for the explanation to be
correct. An explanation of an effect by a cause has an existential component, not
just an optional extra ingredient,” Cartwright (1983).

4.1 Interpretability and Explainability

Interpretability is about transparency, about understanding exactly why and how


the model is generating predictions, and therefore, it is important to observe the
inner mechanics of the algorithm considered. This leads to interpreting the model’s
parameters and features used to determine the given output. Explainability is about
explaining the behavior of the model in human terms.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 123
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_4
124 4 Models: Interpretability, Accuracy, and Calibration

(ceteris paribus) (mutatis mutandis)


1 1

×
2 × 2

Fig. 4.1 On the left, the ceteris paribus approach (only the direct relationship from .x1 to y is
considered, and .x2 is supposed to remain unchanged) and the mutatis mutandis approach (a change
in .x1 has a direct impact on y, and there could be an additional effect via .x2 )

Definition 4.1 (Ceteris Paribus (Marshall 1890)) Ceteris paribus (or more pre-
cisely ceteris paribus sic stantibus) is a Latin phrase, meaning “all other things
being equal” or “other things held constant.”
The ceteris paribus approach is commonly used to consider the effects of a cause,
in isolation, by assuming that any other relevant conditions are absent. In Fig. 4.1,
the output of a model, . y can be influenced by .x1 and .x2 , and in the ceteris paribus
analysis of the influence of .x1 on . y , we isolate the effect of .x1 on .
y . In the mutatis
mutandis approach, if .x1 and .x2 are correlated, we add to the “direct effect” (from
.x1 to .
y ) a possible “indirect effect” (through .x2 ).
Definition 4.2 (Mutatis Mutandis) Mutatis mutandis is a Latin phrase meaning
“with things changed that should be changed” or “once the necessary changes have
been made.”
In order to illustrate, let .(X1 , X2 , ε)┬ denote some Gaussian random vector,
where the first two components are correlated, and .ε is some unpredictable random
noise, independent of the pair .(X1 , X2 )┬
⎛ ⎞ ⎛⎛ ⎞ ⎛ 2 ⎞⎞
X1 μ1 σ1 rσ1 σ2 0
. ⎝X2 ⎠ ∼ N ⎝⎝μ2 ⎠ , ⎝rσ1 σ2 σ22 0 ⎠⎠ .
ε 0 0 0 σ2

Suppose that .Y = β0 + β1 X1 + β2 X2 + ε (as in a standard linear model), then for


some .x ∗ = (x1∗ , x2∗ ),

EY |X [Y |x ∗ ] = EX [Y |x1∗ , x2∗ ] = β0 + β1 x1∗ + β2 x2∗ ,


.

whereas .EY [Y ] = β0 + β1 μ1 + β2 μ2 . Then, on the one hand, if we compute the


standard conditional expected value of .X2 , conditional on .X1 , we have
rσ2 ∗
EX2 |X1 [X2 |x1∗ ] = μ2 +
. (x − μ1 ),
σ1 1
4.1 Interpretability and Explainability 125

and therefore
 
∗ rσ2 ∗
.EY |X1 [Y |x1 ] = β0 + β1 x1∗ + β2 μ2 + (x − μ1 ) : mutatis mutandis.
σ1 1

On the other hand, in the ceteris paribus approach, “isolating” the effect of
.x1 to other possible causes means that we pretend that .X1 and .X2 are now
independent. Therefore, formally, instead of .(X1 , X2 ), we consider .(X1⊥ , X2⊥ ) a
“copy” with independent components and the same marginal distributions,1 then
.E

Y |X⊥ |X⊥ [Y |x1 ] = μ2 , and
2 1

EY |X⊥ [Y |x1∗ ] = β0 + β1 x1∗ + β2 μ2 : ceteris paribus


.
1

Therefore, we have clearly the direct effect (ceteris paribus), and the indirect effect,
rσ2 ∗
. EY |X1 [Y |x1∗ ] = EY |X⊥ [Y |x1∗ ] +β2 (x − μ1 ).
1 σ1 1
mutatis mutandis ceteris paribus

As expected, if variables .x1 and .x2 are independent, .r = 0, and the mutatis mutandis
and the ceteris paribus approaches are identical. Later on, when presenting various
techniques in this chapter, we may use notation .EX1 and .EX⊥ , instead of .EY |X1 or
1
.E
Y |X1⊥ respectively, to avoid notations that are too heavy.
And more generally, from a statistical perspective, if we consider a nonlinear
model .EY |X [Y |x ∗ ] = EX [Y |x1∗ , x2∗ ] = m(x1∗ , x2∗ ), a natural ceteris paribus
estimate of the effect of .x1 on the prediction is
n
1
EY |X⊥ [m(X1⊥ , X2⊥ )|x1∗ ] ≈
. m(x1∗ , xi,2 )
1 n
i=1

(the average on the right being the empirical counterpart of the expected value on
the left), whereas to estimate mutatis mutandis, we need a local version, to take into
account a possible (local) correlation between .x1 and .x2 , i.e.,

1
.EY |X1 [m(X1 , X2 )|x1∗ ] ≈ m(x1∗ , xi,2 ),
‖Vϵ (x1∗ )‖
i∈Vϵ (x1∗ )

 
where .Vϵ (x1∗ ) = i : |xi,1 − x1∗ | ≤ ϵ is a neighborhood of .x1∗ . It should be stressed
that notations “.EY |X1 [m(X1 , X2 )|x1∗ ]” and “.EY |X⊥ [m(X1⊥ , X2⊥ )|x1∗ ]” do not have
1
measure-theoretic foundations, but they will be useful to highlight that in some

L
1 In the sense that .X2⊥ =X2 , almost surely, and .X1⊥ = X1 , and .X1⊥ ⊥⊥ X2 .
126 4 Models: Interpretability, Accuracy, and Calibration

cases, metrics and mathematical objects “pretend” that explanatory variables are
independent.

4.1.1 Variable Importance

When introducing random forests, Breiman (2001) suggested a simple technique


to rank the importance of variables, in a natural way. This technique has been
improved, in Helton and Davis (2002), Azen and Budescu (2003), Rifkin and
Klautau (2004), and Saltelli et al. (2008), in the context of classification and
regression trees, and random forests. The general definition, for other models, could
be the following.
Definition 4.3 (.VIj or “permutation .VIj ” (Fisher et al. 2019)) Given a loss
function .𝓁 and a model m, the importance of the j -th variable is
   
VIj = E 𝓁(Y, m(X−j , Xj )) − E 𝓁(Y, m(X −j , Xj⊥ )) ,
.

and the empirical version is


n
j = 1
VI
. 𝓁(yi , m(x i,−j , xi,j )) − 𝓁(yi , m(x i,−j , x̃i,j )),
n
i=1

for some permutation .x̃ j or .x j .


On the todydata2 dataset, with three explanatory variables (.x1 , .x2 and .x3 )
 j can be computed using the variable-importance
and a sensitive attribute (s), .VI
function variable_importance from the DALEX package (see Biecek and
Burzykowski 2021 for more details). By default, the loss considered is the one
associated with .1 − AUC for classification (loss_one_minus_auc, as here),
but cross entropy can be used for multilabel classification, whereas root mean
square error is the default loss for regression. In Figs. 4.2 and 4.3 we can visualize
variable importance for the four models (including some confidence band), for
models without and with the sensitive attribute s respectively. This measure can be
quantified as some “drop-out loss of AUC,” and therefore, as a measure of variable
importance. One could also use FeatureImp from the iml R package, based
on Molnar (2023). Observe on those Figs. 4.2 and 4.10, that all four models are
comparable, in this example.
4.1 Interpretability and Explainability

Fig. 4.2 Variable importance for different models trained on toydata2, without the sensitive attribute s, with four variables, .x1 , .x2 , .x3 , and s. CART =
classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF = random forest
127
128

Fig. 4.3 Variable importance for different models trained on toydata2, with the sensitive attribute s, with four variables, .x1 , .x2 , .x3 , and s. Cart = classification
and regression trees, gam = generalized additive model, glm = generalized linear model, rf = random forest
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability 129

4.1.2 Ceteris Paribus Profiles

Instead of a global measure, some local metrics can be considered. Goldstein et al.
(2015) defined the “individual conditional expectation” directly derived from ceteris
paribus functions, coined “ceteris paribus profile” in Biecek and Burzykowski
(2021).
Definition 4.4 (Ceteris Paribus Profile .z |→ mx ∗ ,j (z) (Goldstein et al. 2015))
Given .x ∗ ∈ X, define on .Xj

. z |→ mx ∗ ,j (z) = m(x ∗−j , z) = m(x1∗ , . . . , xj∗−1 , z, xj∗+1 , . . . , xp∗ ).

Here, it is a ceteris paribus profile in the sense that .xj∗ changes (and takes variable
value z) whereas all other components remain unchanged. Define then the difference
when component j takes generic value z and .xj∗ ,

δmx ∗ ,j (z) = mx ∗ ,j (z) − mx ∗ ,j (xj∗ ).


.

cp
Definition 4.5 (.dmj (x ∗ )) The mean absolute deviation associated with the j -th
variable, at .x ∗ , is .dmj (x ∗ ),
   
dmj (x ∗ ) = E |δmx ∗ ,j (Xj )| = E |m(x ∗−j , Xj ) − m(x ∗−j , xj∗ )|
cp
.

cp
 j (x ∗ )) The empirical mean absolute deviation associated with
Definition 4.6 (.dm
the j -th variable, at .x ∗ , is
n
 cp ∗ 1
dm
. j (x ) = |m(x ∗−j , xi,j ) − m(x ∗−j , xj∗ )|.
n
i=1

In Figs. 4.4 and 4.5, we can visualize “ceteris paribus profiles” on our four
models, on toyxdata2, with .j = 1 (variable .x1 ) with the plain logistic regression,
the GAM, the classification tree, and the random forest, .z |→ mx ∗ ,1 (z). In Fig. 4.4,
it is .z |→ mx ∗ ,1 (z) associated with Andrew (when .(x ✶ , s ✶ ) = (−1, 8, −2, A))
and in Fig. 4.5, it is .z |→ mx ∗ ,1 (z) associated with Barbara (when .(x ✶ , s ✶ ) =
(1, 4, 2, B)). Bullet points indicate the values .mx ∗ ,1 (x1∗ ) for Andrew in Fig. 4.4, and
Barbara in Fig. 4.5. On top left, function is monotonic, with a “logistic” shape. On
the right, we see that a GLM will probably miss a nonlinear effect, with a (caped) J
shape.
130

Fig. 4.4 “Ceteris paribus profiles” for Andrew for different models trained on toydata2 (see Table 3.1 for numerical values, for variable .x1 , here .z✶ =
(x ✶ , s ✶ ) = (−1, 8, −2, A))
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability

Fig. 4.5 “Ceteris paribus profiles” for Barbara for different models trained on toydata2 (see Table 3.1 for numerical values, for variable .x1 , here .z✶ =
(x ✶ , s ✶ ) = (1, 4, 2, B))
131
132 4 Models: Interpretability, Accuracy, and Calibration

4.1.3 Breakdowns

For a standard linear model, observe that we can write

k k
┬  
m
. 0 + 
(x ∗ ) = β 0 +
β x∗ = β j xj∗ = y +
β j xj∗ − x j ,
β
j =1 j =1
=vj (x ∗ )

where .vj (x ∗ ) is interpreted as the contribution of the j -th variable on the prediction
for an individual with characteristics .x ∗ . More generally, Robnik-Šikonja and
Kononenko (1997, 2003, 2008) defined the (additive) contribution of the j -th
variable on the prediction for an individual with characteristics .x ∗

vj (x ∗ ) = m(x1∗ , . . . , xj∗−1 , xj∗ , xj∗+1 , . . . , xk∗ )


.

− EX⊥ [m(x1∗ , . . . , xj∗−1 , Xj , xj∗+1 , . . . , xk∗ )],


j

so that
k
 
m(x ∗ ) = E m(X) +
. vj (x ∗ ),
j =1

 
and for the linear model .vj (x ∗ ) = βj xj∗ − EX⊥ |X−j [Xj⊥ |X−j = x ∗−j ] , and
  j
 ∗ j x ∗ − x j .
.vj (x ) = β
j
More generally, .vj (x ∗ ) = m(x ∗ ) − EX⊥ |X−j [m(x ∗−j , Xj ))], where we can write
j
m(x ∗ ) as .E[m(x ∗ )], i.e.,
.

⎧      
⎨E m(X)x ∗ , . . . , x ∗ − E ⊥ m(X)x1∗ , . . . , xj∗−1 , xj∗+1 , . . . , xk∗
∗ 1 k X |X −j
.vj (x ) =     j  ∗ 
⎩E m(X)x ∗ − E ⊥ 
X |X−j m(X) x −j .
j

Definition 4.7 (.γjbd (x ∗ ) (Biecek and Burzykowski 2021)) The breakdown con-
tribution of the j -th variable, at .x ∗ , is
     
γjbd (x ∗ ) = vj (x ∗ ) = E m(X)x ∗ − EX⊥ |X−j m(X)x ∗−j .
.
j

“In other words, the contribution of the j -th variable is the difference between
the expected value of the model’s prediction conditional on setting the values of the
first j variables equal to their values in .x ∗ and the expected value conditional on
setting the values of the first .j − 1 variables equal to their values in .x ∗ ,” as Biecek
and Burzykowski (2021) said.
4.1 Interpretability and Explainability 133

We can rewrite the contribution of the j -th variable, at .x ∗ ,


⎧      
⎨E m(X)x ∗ , . . . , x ∗ − E ⊥ m(X)x1∗ , . . . , xj∗−1 , xj∗+1 , . . . , xk∗
∗ 1 k X |X −j
.vj (x ) =     j  ∗ 
⎩E m(X)x ∗ − E ⊥ 
X |X−j m(X) x −j .
j

Definition 4.8 (.Δj |S (x ∗ )) The contribution of the j -th variable, at .x ∗ , conditional


on a subset of variables, .S ⊂ {1, . . . , k}\{j }, is
     
Δj |S (x ∗ ) = EX⊥
. m(X)x ∗S∪{j } − EX⊥ m(X)x ∗S ,
S∪{j } S

so that .vj (x ∗ ) = Δj |{1,2,...,k}\{j } = Δj |−j .


On the toydata2 dataset, we can compute contributions of .x1 , .x2 and .x3 for
two individuals, Andrew and Barbara, as in Figs. 4.6 and 4.7 respectively, using
type = "break_down" in the predict_parts function of the DALEX R
package. For Andrew (in Fig. 4.6), the starting point is the average value on the
entire population (close to 40%). The large value of .x2 (here 8) yields about .+0.18
on the prediction, whereas the negative value of .x1 (here .−1) yields about from
.−0.19 to .−0.14 on the prediction. Here, s has no impact, as we consider models

trained without the sensitive attribute.

4.1.4 Shapley Value and Shapley Contributions

In order to get a robust way of defining contributions, in the context of predictive


modeling, Lipovetsky and Conklin (2001) suggested using the Shapley value in
statistics, to decompose the .R 2 of a linear regression into additive contributions
of each single covariate. Then, (Štrumbelj and Kononenko 2010, 2014) suggested
using Shapley values to decompose predictions into feature contributions, and, more
recently, Lundberg and Lee (2017) provided a unified version.
Recall that the “Shapley value,” as defined in Shapley (1953), is based on a
coalitional game, with k players, and a “value function” (also named “characteristic
function”) .V that can be defined on any coalition of players, .S ⊂ {1, 2, . . . , k}.
Given a coalition .S ⊂ {1, 2, . . . , k} of players, then .V(S) corresponds to the “worth
of coalition S,” which should reflect payoffs that the members of S would obtain
from this cooperation. In the context of games, assuming that all players collaborate,
the Shapley value is one way (among many others) of distributing the total gains
among all players. In game theory literature (starting with Shapley and Shubik 1969
but then emphasized by Moulin 1992 and Moulin 2004), it can be referred to as a
“fair” mechanism, in the sense that it is the only distribution with certain desirable
134

Fig. 4.6 Breakdown decomposition .γ


jbd (z∗A ) for Andrew for different models trained on toydata2 (see Table 3.1 for numerical values, here .z✶A = (x ✶A , s ✶ ) =
(−1, 8, −2, A))
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability

Fig. 4.7 Breakdown decomposition .γ


jbd (z∗B ) for Barbara for different models trained on toydata2 (see Table 3.1 for numerical values, here .z✶ =
(x ✶ , s ✶ ) = (1, 4, 2, B))
135
136 4 Models: Interpretability, Accuracy, and Calibration

properties. The Shapley value describes contribution to the payout, weighted and
summed over all possible feature value combinations, as follows,

1 |S|! (k − |S| − 1)!


φj (V) =
. (V (S ∪ {j }) − V(S)) ,
k k!
S⊆{1,...,k}\{j }

As explained in Ichiishi (2014), if we suppose that coalitions are being formed one
player at a time, at step j , it should be fair for player j to be given .V (S ∪ {j }) −
V(S) as a fair compensation for joining the coalition. And then for each actor, to
take the average of this contribution over all possible different permutations in which
the coalition can be formed. Which is exactly the expression above, that we can
rewrite
1
φj (V) =
.
number of players
coalitions including j

marginal contribution of j to coalition


× .
number of coalitions excluding j

The goal, in Shapley (1953), was to find contributions .φj (V), for some value
function .V, that satisfies a series of desirable properties, namely
k
• “Efficiency”: . φj (V) = V({1, . . . , k}),
j =1
 
• “Symmetry”: if .V (S ∪ {j }) = V S ∪ {j ' } .f orallS, then .φj = φj ' ,
• “Dummy” (or “null player”): if .V (S ∪ {j }) = V (S) .f orallS, then .φj = 0,
• “Additivity”: if .V(1) and .V(2) have decomposition .φ(V(1) ) and .φ(V(2) ), then
.V +V(2) has decomposition .φ(V(1) +V(2) ) = φ(V(1) )+φ(V(2) ). “Linearity”
(1)

will be obtained if we add .φ(λ · V) = λ · φ(V).


In the context of predictive models, S denotes some subset of features used in
the model (.S ⊂ {1, 2, . . . , k}), .x is some vector of features. Here, it could be
natural to suppose that .Vx denotes the prediction for feature values in set S that are
marginalized, over features that are not included in set S. Štrumbelj and Kononenko
(2014) suggested Monte Carlo sampling  to
 compute
 contributions .φj (Vx ).
Here, we use .Vx ∗ (S) = EX⊥ m(X)x ∗S , as value function, for any set S of
S
variables, so that .Δj |S (x ∗ ) = Vx ∗ (S ∪ {j }) − Vx ∗ (S), from Definition 4.8.
shap
Definition 4.9 (Shapley Contributions .γj (x ∗ )) The Shapley contribution of
the j -th variable, at .x ∗ , is
 
1 k − 1 −1
(x ∗ ) = Δj |S (x ∗ ) = φj (Vx ∗ ).
shap
γj
.
k |S|
S⊆{1,...,k}\{j }
4.1 Interpretability and Explainability 137

Interestingly, for a linear regression with k uncorrelated features, and mean centered,

.m(x ∗ ) = β0 + β1 x1∗ + β2 x2∗ + · · · + βk xk∗ ,


  shap shap shap
=E m(X) γ1 (x ∗ ) γ2 (x ∗ ) γk (x ∗ )

as discussed in Aas et al. (2021).


More generally, these contributions satisfy the following properties:
k
 
(x ∗ ) = m(x ∗ ) − E m(X)
shap
• “Local accuracy”: . γj
j =1
(x ∗ ) = γk (x ∗ )
shap shap
• “Symmetry”: if j and k are interchangeable, .γj
(x ∗ ) = 0.
shap
• “Dummy”: if .Xj does not contribute to the model, .γj
Here, the interpretation of the additivity principle is not easy to derive (and to
legitimate as a “desirable property,” in the context of models). Observe that if
there are two variables, .k = 2, .γ1 (x ∗ ) = Δ1|2 (x ∗ ) = γ1bd (x ∗ ). And if
shap

.p ⪢ 2, computations can be heavy. Štrumbelj and Kononenko (2014) suggested

an approach based on simulations.


Given .x ∗ and some individual .x i , define
 
xj∗' with probability 1/2 x ∗+ ∗
i = (x̃i,1 , . . . , xj , . . . , x̃i,k )
x̃i,j ' =
. and
xi,j ' with probability 1/2 x ∗−
i = (x̃i,1 , . . . , xi,j , . . . , x̃i,k ).

(x ∗ ) ≈ m(x ∗+ ∗−
shap
Observe that .γj i ) − m(x i ), and therefore

1
(x ∗ ) = m(x ∗+ ∗−
shap
j
γ
.
i ) − m(x i ),
s
i∈{1,...,n}

(at each step we pick individual i in the training dataset, s times).


In the context of our toydata2 dataset, it is possible to compute Shapley values
for two individuals (Andrew and Barbara), as in Figs. 4.8 and 4.9 respectively,
obtained using option type = "shap" in function predict_parts of pack-
age DALEX, as in Biecek and Burzykowski (2021). Observe that, at least, signs of
contributions are consistent among models: .x1∗ has a negative contribution whereas

.x has a positive one, for Andrew; however, it is the opposite for Barbara.
2
Štrumbelj and Kononenko (2014) and Lundberg and Lee (2017) suggested using
that decomposition to get a global contribution of each variable, instead of a local
version.
138

shap
Fig. 4.8 Shapley contributions .γ
j (z✶A ) for Andrew for different models trained on toydata2 (see Table 3.1 for numerical values, here .z✶ = (x ✶ , s ✶ ) =
(−1, 8, −2, A))
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability

shap
Fig. 4.9 Shapley contributions .γ
j (z✶B ) for Barbara for different models trained on toydata2 (see Table 3.1 for numerical values; here, .z✶ = (x ✶ , s ✶ ) =
(1, 4, 2, B))
139
140 4 Models: Interpretability, Accuracy, and Calibration

shap
Definition 4.10 (Shapley Contribution .γ j ) The contribution of the j -th vari-
able is
n
shap 1 shap
γj
. = γj (x i ).
n
i=1

One interesting feature about the Shapley value is that the contribution can be
extended, from a single player j to any coalition, for example, two players .{i, j }.
This yields the concept of “Shapley interaction.”
shap
Definition 4.11 (Shapley Interaction .γi,j (x ∗ )) The interaction contribution
between the i-th and the j -th variable, at .x ∗ , is

|S|! (k − |S| − 2)!


γi,j (x ∗ ) =
. Δi,j |S (x ∗ ),
2 k!
S⊆{1,...,k}\{i,j }

where
     
Δi,j |S (x ∗ ) = EX⊥
. m(X)x ∗S∪{i,j } − EX⊥ m(X)x ∗S∪{j }
S∪{i,j } S∪{j }
  ∗    
−EX⊥ m(X) x S∪{i} + EX⊥ m(X)x ∗S .

S∪{i} S

4.1.5 Partial Dependence

The “partial dependence plot,” formally defined and coined in Friedman (2001), is
simply the average of “ceteris paribus profiles.”
Definition 4.12 (PDP .pj (xj∗ ) and .pj (xj∗ )) The partial dependence plot associated
with the j -th variable is the function .Xj → R defined as
 
pj (xj∗ ) = EX⊥ m(X)|xj∗ ,
.
j

and the empirical version is


n n
1 1
. j (xj∗ ) =
p m(xj∗ , x i,−j ) = mx i ,j (xj∗ ) .
n n
i=1 i=1
ceteris paribus
4.1 Interpretability and Explainability 141

See Greenwell (2017) for the implementation in R, with the pdp package.
One can also use type = "partial" in the predict_parts function of
the DALEX package, as in Biecek and Burzykowski (2021). In Fig. 4.10 we can
1 (associated with variable .x1 ) in dataset toydata2, the average of
visualize .p
∗ ∗
.m(x , x i,−j ) when .i = 1, . . . , n, including all .m(x , x i,−j )s in Fig. 4.11.
j j
Interestingly, instead of the sum over the n predictions, subsums can be consid-
ered, with respect to some criteria. For example, in Figs. 4.12, 4.13 and 4.14, sums
over .si = A or .si = B are considered,

1 1
. jA (xj∗ ) =
p m(xj∗ , x i,−j ) and p
jB (xj∗ ) = m(xj∗ , x i,−j ).
nA nB
i:si ==A i:si ==B

On the toydata2 data, the three variables j (namely .x1∗ , .x2∗ and .x3∗ ) are used,
in Figs. 4.12, 4.13 and 4.14 respectively. If .x3∗ has a very flat impact (in Fig. 4.14),
and no influence on the outcome, one should observe that .p jA (x3∗ ) and .p
jB (x3∗ ) are
significantly different.
But instead of those ceteris paribus dependence plots, it could be interesting to
consider some local versions, or mutatis mutandis dependence plots. Apley and Zhu
(2020) introduced the “local dependence plot” and the “accumulated local plot,”
defined as follows,
Definition 4.13 (Local Dependence Plot .𝓁j (xj∗ ) and .
𝓁j (xj∗ )) The local depen-
dence plot is defined as
 
𝓁j (xj∗ ) = EXj m(X)|xj∗
.

1  

𝓁j (xj∗ ) =
. m(xj∗ , x i,−j ) where V (xj∗ ) = i : d(xi,j , xj∗ ) ≤ ϵ ,
card(V (xj∗ ))
i∈V (xj∗ )

n
1
or 
. 𝓁j (xj∗ ) =  ∗ ωi (xj∗ )m(xj∗ , x i,−j ) where ωi (xj∗ ) = Kh (xj∗ − xi,j ),
i ωi (xj ) i=1

for a smooth version, for some kernel .Kh .


Apley and Zhu (2020) suggested to use, instead,
Definition 4.14 (Accumulated Local .aj (xj∗ ))

  

xj∗ ∂m(xj , X −j ) 
.aj (xj ) = EXj xj dxj .
−∞ ∂xj
142

Fig. 4.10 Partial dependence profile .p


1 associated with variable .x1 , for four different models trained on toydata2. CART = classification and regression
trees, GAM = generalized additive model, GLM = generalized linear model, RF = random forest
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability

Fig. 4.11 Partial dependence profile .p


1 associated with variable .x1 , seen as the average of ceteris paribus profiles .m(xj∗ , x i,−j )’s (in gray) for different models
trained on toydata2. CART = classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF = random forest
143
144

Fig. 4.12 Partial dependence profiles .p


1A and .p
1B , for .x1 , when the sensitive attribute s is either A or B, as the average of subgroups (.si being either A or B) for
different models trained on toydata2. CART = classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF
= random forest
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability

Fig. 4.13 Partial dependence profiles .p


2A and .p
2B , for .x2 , when the sensitive attribute s is either A or B, as the average of subgroups (.si being either A or B) for
different models trained on toydata2. CART = classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF
= random forest
145
146

Fig. 4.14 Partial dependence profiles .p


3A and .p
3B , for .x3 , when the sensitive attribute s is either A or B, as the average of subgroups (.si being either A or B) for
different models trained on toydata2. CART = classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF
= random forest
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability 147

The following estimate was considered,


Definition 4.15 (Accumulated Local Function  ∗
.aj (x ))
j

kj∗
∗ 1  

.aj (xj ) =α+ m(ak , x i,−j ) − m(ak−1 , x i,−j ) ,
nu
u=1 u:xi,j ∈(au−1 ,au ]

(where .α is some normalization constant, as .E[


aj (Xj )] = 0).
Observe in Fig. 4.15, the three dependence profiles for .x1 , for the random
forest model, with respectively the “partial dependence plot” on the left, the
“local dependence plot” in the middle, and the “accumulated local plot” on the
right, on the toydata2 dataset, with options type = "accumulated" in the
predict_parts function, as in Biecek and Burzykowski (2021). One could also
use the FeatureEffect function in the iml R package, based on Molnar (2023),
respectively with method = "pdp", "ale" and "ice".

4.1.6 Application on the GermanCredit Dataset

In order to illustrate, consider four models estimated on the GermanCredit


dataset: a plain GLM (estimated using glm), a gradient boosting model (GBM;
estimated using adaboost algorithm, in gbm, with the default tuning parameters),
a classification and regression tree (CART; estimated using rpart and the default
tuning parameters) and a random forest (RF; estimated using randomForest,
based on 500 trees). On the GermanCredit dataset, consider two individuals with
a loan, Barbara and Andrew, in Table 4.1.
jbd (z∗A )
In Figs. 4.16 and 4.17 we can visualize the breakdown decomposition .γ
for Andrew and . bd ∗
γj (zB ) for Barbara respectively, on four models trained on
GermanCredit.
shap ∗
In Figs. 4.18 and 4.19, we can visualize Shapley contributions .γj (zA ) for
(z∗B ) for Barbara respectively.
shap
Andrew and .γj
In Fig. 4.20, partial dependence plots .p duration
A duration
and .p B for the
duration variable, for male (.s = A) and female (.s = B), respectively, on
the GermanCredit dataset, with the plain logistic and adaboost (GBM) on
top, a classification tree and a random forest, below. All functions are increasing,
indicating that the probability of having a default on a loan should increase with the
duration.
148

Fig. 4.15 Partial dependence plot .p


1 on the left, local dependence plot . .a1 on the right, for .x1 , for the random
𝓁1 in the middle, and accumulated local function 
forest (rf) model m, trained on toydata2
4 Models: Interpretability, Accuracy, and Calibration
Table 4.1 Information about Barbara and Andrew, two policyholders (names are fictional) from the GermanCredit dataset. Gender corresponds to
the protected attribute, s
Gender Default
4.1 Interpretability and Explainability

s Age Housing Property Credit amount Account status Duration Employment since Job Purpose y
Barbara Female 63 For free No property 13,756 None 60 ≥7 years Highly qualified New car Good (0)
Andrew Male 28 Rent Car 4,113 [0, 200] 24 <1 year Skilled employee Old car Bad (1)

GLM Boosting (GBM) CART RF


With s Without s With s Without s With s Without s With s Without s
Prob. (Rank) Prob. Rank Prob. (Rank) Prob. (Rank) Prob. (Rank) Prob. (Rank) Prob. Rank Prob. Rank
Barbara 38.4% (69%) 40.0% (69%) 38.4% (76%) 40.6% (73%) 66.7% (77%) 66.7% (77%) 46.4% (82%) 46.2% (81%)
Andrew 46.7% (74%) 44.2% (72%) 46.7% (60%) 27.7% (58%) 15.4% (54%) 15.4% (54%) 29.0% (55%) 26.0% (52%)
149
150 4 Models: Interpretability, Accuracy, and Calibration

Fig. 4.16 Breakdown decomposition .γ jbd (z∗A ) for Andrew, GermanCredit, on four models.
CART = classification and regression trees, GBM = gradient boosting model, GLM = generalized
linear model, RF = random forest

Fig. 4.17 Breakdown decomposition .γjbd (z∗B ) for Barbara, GermanCredit, on four models.
CART = classification and regression trees, GBM = gradient boosting model, GLM = generalized
linear model, RF = random forest
4.1 Interpretability and Explainability 151

Fig. 4.18 Shapley contributions for Andrew, on the GermanCredit dataset. CART = classi-
fication and regression trees, GBM = gradient boosting model, GLM = generalized linear model,
RF = random forest

Fig. 4.19 Shapley contributions for Barbara, on the GermanCredit dataset. CART =
classification and regression trees, GBM = gradient boosting model, GLM = generalized linear
model, RF = random forest
152

Fig. 4.20 Partial dependence plots .p A


duration and .p B
duration for the duration variable, for male (.s = A) and female (.s = B), on the GermanCredit
dataset, with the plain logistic and adaboost (GBM) on top, a classification tree and a random forest, below. CART = classification and regression trees,
GBM = gradient boosting model, GLM = generalized linear model, RF = random forest
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability 153

4.1.7 Application on the FrenchMotor Dataset

On the FrenchMotor dataset, we model the occurrence of a claim (variable


claim), with four models: a plain GLM (estimated using glm), a gradient boosting
model (estimated using gam from the mgcv R package, with splines (functions
s()) on the age of the driving license, LicAge, the driver’s age DrivAge,
the bonus-malus coefficient BonusMalus, and the RiskVar variable), a clas-
sification tree (estimated using rpart and the default tuning parameters) and a
random forest (estimated using randomForest, based on 500 trees). The sensitive
attribute s here is the gender. Two individuals are considered in Table 4.2, at the
top (named respectively Barbara and Andrew). Predictions of the probability
of claiming a loss, for those two individuals, are given below. For Barbara, the
classification tree (including the sensitive attribute s, here the gender) predicts a
probability of claiming a loss of .6.6% (seeing Barbara as a median driver, as
.6.6% corresponds to the 49% quantile).

In Figs. 4.21 and 4.22, we can visualize breakdown decomposition .γ jbd (z∗A ) for
Andrew and . γjbd (z∗B ) for Barbara respectively, on four models trained on the
FrenchMotor dataset.
In Figs. 4.23 and 4.24, we can visualize Shapley contributions (including confi-
shap ∗ shap ∗
dence intervals) .γj (zA ) for Andrew and .γj (zB ) for Barbara respectively,
for models estimated using the FrenchMotor dataset. One could also use
Shapley in the iml R package (see Molnar et al. 2018).
Figure 4.25 is the partial dependence plot for a specific (continuous) variable, the
license age (license_age), with the average of ceteris paribus profiles, for males
and females respectively. Observe that, even if s is not included in the models, partial
dependence plots are different in the two groups: for GLM and GAM, predicted
probabilities for males are higher than for females, whereas for the random forest,
predicted probabilities for females are higher than for males.
Note that there are more connections between interpretation and causal models,
as discussed in Feder et al. (2021); Geiger et al. (2022) or Wu et al. (2022). We return
to those approaches in Chaps. 7 (on experimental and observational data) and 9 (on
individual fairness, and counterfactual).

4.1.8 Counterfactual Explanation

To conclude this section, let us briefly mention here a concept that will be
discussed further in the context of fairness and discrimination, related to the idea
of “counterfactuals” (as named in Lewis 1973). The word “counterfactual” can be
either an adjective describing something “thinking about what did not happen but
could have happened, or relating to this kind of thinking,” or a noun defined as
“something such as a piece of writing or an argument that considers what would
have been the result if events had happened in a different way to how they actually
154

Table 4.2 Information about Barbara and Andrew, two policyholders (names are fictional) from the FrenchMotor dataset. CSP1 corresponds to
‘farmers and breeders’ (employees of their farm) whereas CSP5 corresponds to ‘employees.’ The risk variable (RiskVar) is an internal score going from 1
(low) to 20 (high). A bonus of 50 is the best (lowest) level, and 100 is usually the entry level for new drivers. Neither has a garage. For the predictions, a ‘rank’
of 11% means that the policyholder is perceived as almost in the top 10% of all drivers in the validation database
Gender Claim
s Age (years) Marital status Social category License (months) Car age Car use Car class Car gas Bonus malus Risk score y
Barbara Female 26 Alone CSP5 67 10+ Private + office M1 Regular 76 7 0
Andrew Male 36 Couple CSP1 206 0 Professional M2 Diesel 78 19 0

GLM GAM CART RF


With s Without s With s Without s With s Without s With s Without s
Prob. (Rank) Prob. Rank Prob. (Rank) Prob. (Rank) Prob. (Rank) Prob. (Rank) Prob. Rank Prob. Rank
Barbara 3.4% (11%) 3.3% (10%) 3.8% (12%) 3.8% (12%) 6.6% (49%) 6.6% (43%) 2.4% (68%) 4.0% (74%)
Andrew 9.8% (64%) 9.9% (65%) 7.8% (46%) 7.8% (46%) 17.1% (93%) 17.1% (93%) 80.2% (98%) 75.2% (98%)
4 Models: Interpretability, Accuracy, and Calibration
4.1 Interpretability and Explainability 155

Fig. 4.21 Breakdown decomposition .γ jbd (z∗A ) for Andrew for different models trained on
FrenchMotor. CART = classification and regression trees, GAM = generalized additive model,
GLM = generalized linear model, RF = random forest

Fig. 4.22 Breakdown decomposition .γjbd (z∗B ) for Barbara for different models trained on
FrenchMotor. CART = classification and regression trees, GAM = generalized additive model,
GLM = generalized linear model, RF = random forest
156 4 Models: Interpretability, Accuracy, and Calibration

shap
j
Fig. 4.23 Shapley contributions .γ (z✶A ) for Andrew, on the FrenchMotor dataset. CART =
classification and regression trees, GAM = generalized additive model, GLM = generalized linear
model, RF = random forest

shap
Fig. 4.24 Shapley contributions .γj (z✶B ) for Barbara, on the FrenchMotor dataset. CART
= classification and regression trees, GAM = generalized additive model, GLM = generalized linear
model, RF = random forest
4.1 Interpretability and Explainability

Fig. 4.25 Partial dependence plots .p A


licence_age and .p B
license_age for the license_age variable, for male (.s =A) and female (.s =B), on the
FrenchMotor dataset. CART = classification and regression trees, GAM = generalized additive model, GLM = generalized linear model, RF = random
forest
157
158 4 Models: Interpretability, Accuracy, and Calibration

happened.” Therefore, a counterfactual is a statement of the form “had A occurred


then B would have occurred.” For example, Wachter et al. (2017) suggested using
such counterfactual statements as a psychology-grounded approach for explaining
the opaque decision rule. It is clearly related to the idea of deriving explanations.
Finally, we provided some tools that could be used because, in real life, some
insurers believe that accuracy (as discussed in the next section) is the ultimate goal
when creating a predictive model, even if the price to pay is a very opaque rule. As
Rudin (2019) wrote, “some people hope that creating methods for explaining these
black box models will alleviate some of the problems, but trying to explain black
box models, rather than creating models that are interpretable in the first place, is
likely to perpetuate bad practice and can potentially cause great harm to society
(. . . ) Explanations are often not reliable, and can be misleading (. . . ) If we instead
use models that are inherently interpretable, they provide their own explanations,
which are faithful to what the model actually computes.”

4.2 Accuracy of Actuarial Models

For classification problems, calibration measures how well your model’s scores
can be interpreted as probabilities, according to Cameron (2004) or Gross (2017).
Accuracy measures how often a model produces correct answers, as defined in
Schilling (2006).

4.2.1 Accuracy and Scoring Rules

Accurate, from Latin accuratus (past participle of accurare), means “done with
care.” With a statistical perspective, accuracy is how far off a prediction (.
y ) is from
its true value (y). Thus, a model is accurate if the errors (
.ε = y − y ) are small. In
the least-squares approach, accuracy can be measured simply by looking at the loss,
or mean squared error, that is the empirical risk
n n
 1 1
Rn =
. 𝓁2 (yi , 
yi ) = 
εi2 .
n n
i=1 i=1

One could also consider the mean absolute error


n n
1 1
. 𝓁1 (yi , 
yi ) = |
εi |,
n n
i=1 i=1
4.2 Accuracy of Actuarial Models 159

or the symmetric mean absolute percentage error, as introduced by Armstrong


(1985)
n
|
εi |
. .
|yi | + | yi |
i=1

But those metrics are based on a loss function defined on .Y × Y, that measures
a distance between the observation y and the prediction . y , seen as a pointwise
prediction. But in many applications, the prediction can be a distribution. So instead
of a loss defined on .Y×Y, one could consider a scoring function defined on .Py ×Y,
where .Py denotes a set of distributions on .Y, as discussed in Sect. 3.3.1. Following
Gneiting and Raftery (2007), define
Definition 4.16 (Scoring Rules (Gneiting and Raftery 2007)) A scoring rule is a
function .s : Y×Py → R that quantifies the error of reporting .Py when the outcome
is y. The expected score when belief is .Q is .S(Py , Q) = EQ s(Py , Y ) .
Definition 4.17 (Proper Scoring Rules (Gneiting and Raftery 2007)) A scoring
rule is proper if .S(Py , Q) ≥ S(Q, Q) for all .Py , Q ∈ Py , and strictly proper if
equality holds only when .Py = Q.
Let us start with the binary case, when .y ∈ Y = {0, 1} and .Py is the set of
Bernoulli distributions, .B(p), with .p ∈ [0, 1]. The scoring rule can be denoted
.s(p, y) ∈ R, and the expected score is .S(p, q) = E(s(p, Y )) with .Y ∼ B(q),

for some .q ∈ [0, 1]. For example, the Brier scoring rule is defined as follows: let
.fq (y) = q (1 − q)
y 1−y , and define

1
.s(fq , y) = −2fq (y) + fq (y)2 ,
y=0

which can be written


1 2 
.s(q, y) = −q y (1 − q)1−y + q + (1 − q)2 .
2
This is a proper scoring rule. And we can define .G(q) = S(q, q) that then satisfies
.G(q) ≤ S(p, q) for all .p, q. Interestingly, we can recover S from G. Because the
expected value is a linear operator, .q → S(p, q) is linear, as mentioned in Parry
(2016). S is proper if .S(q, q) ≤ S(p, q) for all .p, q. And in that case, the divergence
.d(p, q) = S(p, q) − S(q, q) cannot be negative. And for a proper scoring rule,

.d(p, q) = 0 if and only if .p = q. Under some mild conditions on .P, S is a proper

scoring rule if and only if there exists a concave function G such that

s(y, p) = G(p) + (y − p)∂G(p),


.
160 4 Models: Interpretability, Accuracy, and Calibration

and .G(q) = S(q, q). For example, if .G(p) = −p log p − (1 − p) log(1 − p)


(corresponding to the entropy),

S(y, q) = −y log q − (1 − y) log(1 − q).


.

In that case, the deviance is

q 1−q
d(p, q) = q log
. + (1 − q) log ,
p 1−p

which is Kullback–Leibler divergence (from Definition 3.7). If .G(p) = p(1 − p)


(corresponding to the Gini index), then .S(y, q) = (y −q)2 , which is the Brier score.
In that case, the deviance is the mean squared difference between probabilities,
.d(p, q) = (p − q) .
2

One can use the R package scoringRules to compute those quantities, as


presented in Jordan et al. (2019). Thus, scoring rules provide summary measures
of predictive performances, assigning a numerical score to probabilistic forecasts.
An interesting property is the “sharpness” of the predictive distributions, or its
concentration. The less variable the predictions, the more concentrated the predic-
tive distribution is. See Candille and Talagrand (2005) and Gneiting et al. (2007)
for more details. Note that Duan et al. (2020) suggested using a gradient of the
scoring rule with respect to the distribution of the prediction, in a boosting procedure
(instead of the standard gradient of the loss, as in Sect. 3.3.6).
To conclude this section on accuracy, let us recall that in a classification problem,
“accuracy” has a precise quantitative definition (and not only an abstract concept):
it is the fraction of predictions our model got right. Therefore, it is a property of .mt
(for a given threshold .τ ) and not of the model m. Note that instead of .mt = 1(m > t)
it is also possible to define the “decision boundary function” .d = m − t so that .mt
is 1 if d is positive, 0 otherwise.
Definition 4.18 (Accuracy (Classifier)) The accuracy of a classifier .mt is the
number of correct predictions over the total number of predictions

TP + TN
accuracy(mt ) =
. .
TP + TN + FP + FN

In some heavily unbalanced data (for instance to detect fraudulent transactions),


with .P[Y = 1] = 1%, observe that model m, which always predicts 0, will have
(on average) 99% accuracy. Instead of looking a classifier .mt , for a given t, we can
consider the overall scoring function m (for all t in .[0, 1]).

(P[m(X) > t|Y = 0], P[m(X) > t|Y = 1])t∈[0,1] ,


.

as .
y = mt (x) = 1m(x)>t for threshold t,

 = 1|Y = 0], P[Y


(P[Y
.  = 1|Y = 1]) = (FPR, TPR).
4.2 Accuracy of Actuarial Models 161

The receiver operating characteristic (ROC) curve is the curve obtained by repre-
senting the true-positive rates according to the false-positive rates, by changing the
threshold. This can be related to the “discriminant curve” in the context of credit
scores, in Gourieroux and Jasiak (2007).
Definition 4.19 (ROC Curve) The ROC curve is the parametric curve
 
.P[m(X) > t|Y = 0], P[m(X) > t|Y = 1] , for t ∈ [0, 1],

when the score .m(X) and Y evolve in the same direction (a high score indicates a
high risk).

C(t) = TPR ◦ FPR−1 (t),


.

where

FRP(t) = P[m(X) > t|Y = 0] = P[m0 (X) > t]
.
TPR(t) = P[m(X) > t|Y = 1] = P[m1 (X) > t].

In other words, the ROC curve is obtained from the two survival functions of
m(X) FPR and TPR (respectively conditional on .Y = 0 and .Y = 1). The AUC, the
.

area under the curve, is then written as follows,


Definition 4.20 (AUC, Area Under the ROC Curve) The area under the curve is
defined as the area below the ROC curve,
 1  1
AUC =
. C(t)dt = TPR ◦ FPR−1 (t)dt.
0 0

In Fig. 4.26, we can visualize on the left-hand side a classification tree, when
we try to predict the gender of a driver using telematic information (from the
telematic dataset), and on the right-hand side, ROC curves associated with three
models, a smooth logistic regression (GAM), adaboost (boosting, GBM) and a
random forest, trained on 824 observations, and ROC curves are based on the 353
observations of the validation dataset.
The AUC of a classifier is equal to the probability that the classifier will rank
a randomly chosen positive example higher than a randomly chosen negative
example. Indeed, assume for simplicity that the score (actually .m0 and .m1 ) has a
derivative, so that the true-positive rate and the false-positive rate are given by
 1  1
TPR(t) =
. m'1 (x) dx and FPR(t) = m'0 (x) dx,
t t
162 4 Models: Interpretability, Accuracy, and Calibration

Fig. 4.26 On the right, ROC curves .C n (t), for various models on the telematic (validation)
dataset, where we try to predict the gender of the driver using telematic data (and the age). The area
under the curve (AUC) using a generalized addition model (GAM) is close to 69%. A classification
tree is also plotted on the left. GBM = gradient boosting model, RF = random forest

then
  −∞
1  
AUC =
. TPR FPR−1 (t) dt = TPR(u)FPR' (u) du,
0 ∞

with a simple change of variable, and therefore


 ∞  ∞  
AUC =
. 1(v > u)m'1 (v)m'0 (u) dv du = P M1 > M0 ,
−∞ −∞

where .M1 is the score for a positive instance and .M0 is the score for a negative
instance. Therefore, as discussed in Calders and Jaroszewicz (2007), the AUC is
very close to the Mann–Whitney U test, used to test the null hypothesis that, for
randomly selected values .Z0 and .Z1 from two populations, the probability of .Z0
being greater than .Z1 is equal to the probability of .Z1 being greater than .Z0 , which
is written, empirically

1
. 1(zj > zi ).
n0 n1
i:zi =0 j :zj =1

Therefore, the ROC curve and the AUC have more to do with the rankings of
individuals than values of predictions: if h is some strictly increasing function, m
and .h ◦ m have the same ROC curves. For example, if .m  is a trained model, using
any technique discussed previously, then both .m1/2 and .m
2 are valid models, in the
sense that they have the same AUC, the same ROC curve, and can be considered as
4.3 Calibration of Predictive Models 163

Table 4.3 Area under the curve for various models on the toydata2
Training data Validation data
GLM CART GAM RF GLM CART GAM RF
(x, s)
.m 85.3 82.7 86.1 100.0 86.0 81.4 86.3 83.6
(x)
.m 85.0 82.7 85.9 100.0 85.5 81.4 85.9 83.6
CART = classification and regression tree, GAM = generalized additive model, GLM = generalized
linear model, RF = random forest

Table 4.4 Area under the curve for various models on the GermanCredit (validation sub-
sample) dataset, predicting the default. At the top, models including the sensitive variable (the
gender), and at the bottom, models without the sensitive variable, corresponding to fairness
through unawareness
GLM Tree Boosting Bagging
(x, s)
.m 79.339% 72.922% 79.488% 77.914%
(x)
.m 78.992% 72.922% 79.035% 78.287%
GLM = generalized linear model

scores as they both take values in .[0, 1]. Thus, accuracy for classifiers has to do with
the ordering of predictions, not their actual value (which is related to calibration, and
discussed in Sect. 4.3.3). Austin and Steyerberg (2012) used the term “concordance
statistic” for the AUC. Note that the AUC is also related to the Gini index (discussed
in the next section) (Table 4.3).
On the GermanCredit dataset, the variable of interest y is the default, taking
values in .{0, 1}, and the protected attribute p is the gender (binary, with male and
female). We consider four models, either on both .x and p, or only on .x (without
the sensitive attribute, corresponding to fairness through unawareness, as defined in
Chap. 8) : (1) a logistic regression, or GLM (2) a classification tree, (3) a boosted
model, and (4) a bagging model, corresponding to a random forest. The AUC for
those models is given in Table 4.4 and ROC curves are in Fig. 4.27.
If y is a categorical variable in more than 2 classes, different scoring rules can be
used, as suggested by Kull and Flach (2015).

4.3 Calibration of Predictive Models

Somehow, calibration is related to ideas mentioned previously, as we simply want


“probabilities” given by a model to make sense. If that property was somewhat
natural in GLMs, it is usually not the case with machine-learning ones. According
to Wang et al. (2021), “deep neural networks tend to be overconfident and poorly
calibrated after training,” and more recently, Guo et al. (2017) “have shown that
modern neural networks are poorly calibrated and over-confident despite having
better performance” (as written in Müller et al. 2019).
164 4 Models: Interpretability, Accuracy, and Calibration

Fig. 4.27 Receiver operating characteristic curves .C n (t), for various models on the
GermanCredit (validation) dataset, predicting a default. Thick lines correspond to models
including the sensitive variable (gender), and thin lines, models without the sensitive variable,
corresponding to “fairness through unawareness”. GLM = generalized linear model

4.3.1 From Accuracy to Calibration

According to Heckert et al. (2002), “accuracy is a qualitative term referring to


whether there is agreement between a measurement made on an object and its true
(target or reference) value. Bias is a quantitative term describing the difference
between the average of measurements made on the same object and its true value.”
And “calibration” could be related to this “bias.” Silver (2012) raised that issue in
the context of probabilities and forecasting, “out of all the times you said there
was a 40 percent chance of rain, how often did rain actually occur? If, over
the long run, it really did rain about 40 percent of the time, that means your
forecasts were well calibrated.” The same idea can be found in Scikit Learn (2017),
“well calibrated classifiers are probabilistic classifiers for which the output can be
directly interpreted as a confidence level. For instance, a well calibrated (binary)
4.3 Calibration of Predictive Models 165

classifier should classify the samples such that among the samples to which it gave
a [predicted probability] value close to 0.8, approximately 80% actually belong to
the positive class.”
As defined in Kuhn et al. (2013), when seeking for good calibration, “we desire
that the estimated class probabilities are reflective of the true underlying probability
of the sample.” This was initially written in the context of posterior distribution (in
a Bayesian setting), as in Dawid (1982): “suppose that a forecaster sequentially
assigns probabilities to events. He is well calibrated if, for example, of those events
to which he assigns a probability 30 percent, the long-run proportion that actually
occurs turns out to be 30 percent.” Van Calster et al. (2019) gave a nice state of the
art.

4.3.2 Lorenz and Concentration Curves

Before properly defining “calibration,” let us mention the Lorenz curve, as well as
“concentration curves,” in the context of actuarial models. Frees et al. (2011, 2014b)
suggested using the Lorenz curve and the Gini index to provide accuracy measures
for a classifier, and a regression. In economics, the Lorenz curve is a graphical
representation of the distribution of income or of wealth, and it was popularized
to represent inequalities in wealth distribution. It simply shows the proportion of
overall income, or wealth, owned by the bottom u% of the people. The first diagonal
(obtained if .L(u)) is called “line of perfect equality,” in the sense that the bottom u%
of the population always gets u% of the income. Inequalities arise when the bottom
u% of the population gets v% of the income, with .v < u (because the population
is sorted based on incomes, the Lorenz curve cannot rise above the line of perfect
equality). Formally, we have the following definition.
Definition 4.21 (Lorenz Curve (Lorenz 1905)) . If Y is a positive random vari-
able, with cumulative distribution function F ,
   u −1
E Y · 1(Y ≤ F −1 (u)) F (t)dt
.L(u) =   = 01 ,
EY −1 (t)dt
0 F

and the empirical version is

[nu]

y(i)
n (u) =
.L
i=1
,
n
y(i)
i=1

for a sample .{y1 , . . . , yn }, where .y(i) denote the order statistics, in the sense that
y(1) ≤ y(2) ≤ · · · ≤ y(n−1) ≤ y(n) .
.
166 4 Models: Interpretability, Accuracy, and Calibration

n (u) for various models on the GermanCredit (validation) dataset.


Fig. 4.28 Lorenz curves .L
Thick lines correspond to models including the sensitive variable (the gender), and thin lines,
models without the sensitive variable, corresponding to “fairness through unawareness.” GLM =
generalized linear model

In Fig. 4.28, we can visualize some Lorenz curves on various models fitted on
the toydata2 (validation) dataset. Those functions can be obtained using the Lc
function in the ineq R package.
In order to measure the concentration of wealth, Lorenz (1905) suggested the
following function .[0, 1] → [0, 1], for positive variable .yi .
The expression of the Lorenz curve, .u |→ L(u), reminds us of the expected
shortfall, where the denominator is the quantile, and not the expected  value. For

example, if Y has a log-normal distribution .LN(μ, σ 2 ), .L(u) = Ф Ф−1 (u) − σ ,
while if Y has a Pareto distribution with tail index .α (i.e., .P[Y > y] = (x/x0 )−α ),
the Lorenz curve is .L(u) = 1 − (1 − u)(α−1)/α (see Cowell 2011). It is rather
common to summarize the Lorenz curve into a single parameter, the Gini index,
introduced in Gini (1912), corresponding to a linear transformation of the area under
the Lorenz curve. More precisely, .G = 1 − 2AUC, so that a small .AUC corresponds
4.3 Calibration of Predictive Models 167

to a distribution with small concentration .G ∼ 1, and .AUC = 1/2 corresponds to


identical values for .yi , and .G = 0. Thus,
    
1 2Cov Y, F (Y ) E |Y − Y ' |
G=1−2
. L(u)du =   =   ,
0 EY 2E Y

where .Y ' is an independent version of Y , and the empirical version is


n n
n = 1 1
.G |yi − yj |.
2y n2
i=1 j =1

The numerator in the computation of the Gini index is the mean absolute difference,
also named “Gini mean difference” in Yitzhaki and Schechtman (2013),
 
. γ = E |Y − Y ' | where Y, Y ' ∼ F are independent copies.

If this framework can be used in the case where .y ∈ R+ , it is also possible to


consider the case where .y ∈ {0, 1}. Hence, if F corresponds to a Bernoulli variable
.B(p),

γ = 2p(1 − p) = 1 − p2 − (1 − p)2 ,
.

which corresponds to the Brier score as in Gneiting and Raftery (2007). Consider
now more generally some classification problem, with a training sample .(x i , yi ),
Murphy (1973), Murphy and Winkler (1987), Dawid (2004), and more recently
Pohle (2020), introduced a so-called “Murphy decomposition.” For a squared loss,
     2 
E (X − Y )2 = Var[Y ] − Var E[Y |X] + E X − E[Y |X] ,
.

UNC RES CAL

where the first term, UNC, is the unconditional entropy uncertainty, which rep-
resents the “uncertainty” in the variable of interest and does not depend on the
predictions (also called “pure randomness”); the second component, RES, is called
“resolution”, and corresponds to the part of the uncertainty in Y that can be
explained by the prediction, so it should reduce the expected score by that amount
(compared with the unconditional forecast); the last part, CAL corresponds to
“miscalibration,” or “reliability.”
As explained in Murphy (1996), an original decomposition was derived as a
partition of the Brier score, which can be interpreted as the MSE for probability
forecasts. For the Brier score in a binary setting, as discussed in Bröcker (2009),
with a calibration function .π(p) = P[Y = 1|p], where p is the forecast probability,
168 4 Models: Interpretability, Accuracy, and Calibration

and y the observed value, if .π = P[Y = 1], the Brier score is decomposed as
 2  2
. π (1 − π ) − E π(p) − π + E p − π(p) .
UNC RES CAL

A second way of assessing models is to study the Kullback–Leibler divergence


 is the fitted model whereas
(see Definition 3.7). For a classification problem, if .m
.μ is the true model, Kullback–Leibler divergence (also named “discrimination

information,” as in Kullback 2004) can be defined as follows,


   
μ(x i )   1 − μ(x i )
m(x i )) = μ(x i ) log
.DKL (μ(x i )‖ + 1 − μ(x i ) log .
(x i )
m 1−m (x i )

 ∈ M, we select the one that minimizes the


Given a collection of fitted models, .m
divergence, i.e.,
 n
−1       
. min μ(x i ) log m(x i ) + 1 − μ(x i ) log 1 − m
(x i ) .
∈M
m n
i=1

As the true model m is unknown, this problem cannot be solved, and a “natural”
idea is to replace .m(x i ) by the true observed values, that is, to solve
 n
−1       
. min yi log m(x i ) + 1 − yi ) log 1 − m
(x i ) ,

m n
i=1

which means that we select the model that minimizes Bernoulli deviance loss, in the
training sample.
Logistic regression satisfies the “balance property,” that could be seen as some
global “unbiased” estimator property, as discussed in Sect. 3.3.2,
n
1  
. yi − m
(x i ) = 0,
n
i=1

on the training sample.


It is possible to use the Lorenz curve on scores .{ m
(x n )}. Let .F
m(x 1 ), . . . , m
denote the empirical cumulative distribution function,
n
. m (u) = 1
u |→ F m(x i ) ≤ u),
1(
n
i=1
4.3 Calibration of Predictive Models 169

whereas the Lorenz curve is


[nu]

m(i)

.u |→ L(u) =
i=1
,
n
m(i)
i=1

where .mi = m (x i ) and .m(i) is the order statistics. Observe that some mirrored
Lorenz curve can also be used, with .M : u |→ 1 − L(1
 − u),


n 
n
m(i) mi 1(mi > miu✶ )
i=[n(1−u)]

.u |→ M(u) = =
i=1
,

n 
n
m(i) mi
i=1 i=1

with .iu✶ = [n(1 − u)], so that the curve should be in the upper corner (as for the
ROC curve), as suggested in Tasche (2008). If .mi is identical for all individuals,

.u |→ M(u) is on the first diagonal. With a “perfect model” (or “saturated model”),
when .mi = yi , we have a piecewise linear function, from .(0, 0) to .(y, 1) and from
.(y, 1) to .(1, 1). In the context of motor insurance, it does not mean that we have

perfectly estimated the probability but it means that our model was able to predict
without any mistakes who would have claimed a loss.
Ling and Li (1998) introduced a “gain curve,” also called “(cumulative) lift
curve,” defined as


n
yi 1(mi > miu✶ )
.u |→ 
i=1
𝚪 (u) = ,

n
yi
i=1

where observed outcomes .yi are aggregated in the ordering of their prediction .mi =
(x i ).
m
Definition 4.22 (Concentration Curve (Gourieroux and Jasiak 2007) and (Frees
et al. 2011, 2014b)) If Y is a positive random variable, observed jointly with .X,
and if .m(X) and .μ(X) denote a predictive model, and the regression curve, the
concentration curve of .μ with respect to m is
   
E μ(X) · 1(m(X) ≤ Qm (u) E Y · 1(m(X) ≤ Qm (u))
.𝚪(u) =   =   ,
E μ(X) EY

where .Qm is the quantile function of .m(X), i.e.,


 
.Qm (u) = inf y : P[m(X) ≤ t] > u ,
170 4 Models: Interpretability, Accuracy, and Calibration

and the empirical version is

[nu]

y(i):m

𝚪n (u) =
.
i=1
,
n
y(i):m
i=1

for a sample .{y1 , . . . , yn }, where .y(i):m are ordered with respect to m, in the sense
that .m(x (1):m ) ≤ m(x (2):m ) ≤ · · · ≤ m(x (n−1):m ) ≤ m(x (n):m ).
This function could be seen as the extension of the Lorenz curve, in the sense
that if .L(u) was the proportion of wealth owned by the lower u% of the population,
.𝚪(u) would represent the proportion of the total true expected loss, corresponding to

the u% of the policyholder with smallest u% premiums (pure premium, computed


using model m). Such a function provides more information than some simple
“accuracy” plot (like the ROC curve), and it can be related to .E[Y |m(X) ≤ t],
or more interesting, .E[Y |m(X) = t], which corresponds to a “calibration” curve.

4.3.3 Calibration, Global, and Local Biases

According to Kuhn et al. (2013), Section 11.1, “we desire that the estimated class
probabilities are reflective of the true underlying probability of the sample. That
is, the predicted class probability (or probability-like value) needs to be well-
calibrated. To be well-calibrated, the probabilities must effectively reflect the true
likelihood of the event of interest.” That is the definition we consider here
Definition 4.23 (Well-Calibrated (1) (Van Calster et al. 2019; Krüger and Ziegel
2021)) The forecast X of Y is a well-calibrated forecast of Y if .E(Y |X) = X
almost surely, or .E[Y |X = x] = x, for all x.
Definition 4.24 (Well Calibrated (2) (Zadrozny and Elkan 2002; Cohen and
Goldszmidt 2004)) The prediction .m(X) of Y is a well-calibrated prediction if
.E[Y |m(X) =  y] = 
y , for all .
y.
Definition 4.25 (Calibration Plot) The calibration plot associated with model m is
the function . y |→ E(Y |m(X) = 
y ). The empirical version is some local regression
on .{yi , m(x i )}.
4.3 Calibration of Predictive Models 171

Such a plot can be obtained using binning techniques, as in Hosmer and


Lemesbow (1980). Given a partition .{I1 , . . . , Ik , . . .} of .Y, we would have a well
calibrated model if, for any k,


n 
n
m(x i )1(m(x i ) ∈ Ik ) yi 1(m(x i ) ∈ Ik )
i=1 i=1
. ≈ .

n n
1(m(x i ) ∈ Ik ) 1(m(x i ) ∈ Ik )
i=1 i=1

One can also consider the k-nearest neighbors approach, or a local regression
(using the locfit R package), as in Denuit et al. (2021). In Figs. 4.29, 4.30
and 4.31, we can visualize some empirical calibration plots on the toydata2

Fig. 4.29 The blue line is the empirical calibration plot .


y |→ E(Y |m(X) =  y ) on two models
estimated on the toydata2 dataset (based on k-nearest neighbors), with the initial validation
dataset on the left-hand side .(n = 1,000) and some simulated larger sample (.n = 100,000) on the
right-hand side, with the generalized additive model (GAM) on top, and the random forest model
below. Bars in the back correspond to some histogram of .m(X)
172 4 Models: Interpretability, Accuracy, and Calibration

Fig. 4.30 The blue line is the empirical calibration plot . y |→ E[Y |Y  =  y ], on the
GermanCredit dataset (based on k-nearest neighbors), with a GLM, a classification tree (on
(x, s), including the sensitive attribute (here
top), a boosting and a bagging model (below), when .m
the gender, s)

(Fig. 4.29), for the GLM and the random forest, and on the four models on
the GermanCredit dataset (Figs. 4.30 and 4.31). The calibPlot function in
package predtools can also be used.
We introduced in Sect. 2.5.2 the idea of a “balanced” model, with Definition 2.8,
which corresponds to a property of “globally unbiased.”
Definition 4.26 (Globally Unbiased Model m (Denuit et al. 2021)) Model m is
globally unbiased if .E[Y ] = E[m(X)].
But it is possible to consider a local version,
Definition 4.27 (Locally Unbiased Model m (Denuit et al. 2021)) Model m is
locally unbiased at .
y if .E[Y |m(X) = 
y] = 
y.
It means that the model is balanced “locally” on the group of individuals such
that .x’s satisfy .m(x) = 
y (and therefore no cross-financing between groups).
4.3 Calibration of Predictive Models 173

Fig. 4.31 Calibration plot . =


y |→ E[Y |Y y ], on the GermanCredit dataset (based on k-nearest
neighbors), with a GLM, a classification tree (on top), a boosting and a bagging model (below),
(x), excluding the sensitive attribute (gender s)
when .m

On the GermanCredit dataset, the variable of interest y is the default, taking


values in .{0, 1}, and the protected attribute p is gender (binary, with male and
female). We consider four models, either on both .x and p, or only on .x (without
the sensitive attribute, corresponding to fairness through unawareness, as defined in
Chap. 8): (1) a logistic regression, or GLM, (2) a classification tree, (3) a boosted
model, and (4) a bagging model, corresponding to a random forest. In Fig. 4.30, we
can visualize the calibration graphs for the four models, when .x and p are used, and
in Fig. 4.31, when .x only is used.
174 4 Models: Interpretability, Accuracy, and Calibration

In the context of GLMs, as mentioned in Vidoni (2003), some theoretical results


can be derived. Consider some sample .(yi , x i ) such that .Yi is supposed to have
distribution f , in the exponential family, as discussed in Sect. 3.3.2, i.e.,
 
yi θi − b(θi )
f (yi ) = exp
. + c(yi , ϕ) .
ϕ

In the GLM framework, different quantities are used, namely the natural parameter
yi = μi = E(Yi ) = b' (θi )), the score associated
for .yi (.θi ), the prediction for .yi (.
with .yi (.ηi = x ┬ i β) and link function: '
.g such that .ηi = g(μi ) = g(b (θi )). The

first-order conditions can be written, using the standard chain rule technique

∂ log Li ∂ log Li ∂θi ∂μi ∂ηi


. = · · · = 0,
∂βj ∂θi ∂μi ∂ηi ∂βj

where the four terms are


 −1
∂ log Li yi − μi 1 ∂ηi
. = · · xi,j · = (g ' (μi ))−1 .
∂βj ϕ V (μi ) ∂μi

In that case, with the canonical link .g✶ = b'−1 , i.e., .ηi = θi , the first-order condition
is (with notation 
.y = μ),

∇ log L = X ┬ (y − 
. y ) = 0,
n n

so, if there is an intercept, .1 (y − 
y ) = 0, i.e., . yi = 
yi , which is the
i=1 i=1
]. If a noncanonical link function
empirical version (training dataset) of .E[Y ] = E[Y
is used (which is the case for the Tweedie model or the gamma model with a
logarithm link function), the first-order condition is

∇ log L = X ┬ Ω(y − 
. y ) = 0,

] (unless we
where .Ω is a diagonal matrix,2 so we no longer have .E[Y ] = E[Y
consider an appropriate change of measure).

   
2 .Ω= W Δ, where .W = diag (V (μi )g ' (μi )2 )−1 and Δ = diag g ' (μi ) , so that we recognize
Fisher information—corresponding to the Hessian matrix (up to a negative sign)—.X ┬ W X.
4.3 Calibration of Predictive Models 175

Table 4.5 Balance  property on the FrenchMotor,  where y is the occurrence of a claim within
the year, with . n1 ni=1 m
(x i , si ) on top, and . n1 ni=1 m
(x i ) below (in %). Recall that on the training

dataset . n1 ni=1 yi = 8.73%
Training data Validation data
.y GLM CART GAM RF .y GLM CART GAM RF
(x, s)
.m 8.73 8.73 8.73 8.73 8.27 8.55 9.05 9.03 8.84 8.70
(x)
.m 8.73 8.73 8.73 8.73 8.29 8.55 9.05 9.03 8.84 8.73

Proposition 4.1 In the GLM framework with the canonical link function, .m (x) =
g✶−1 (x ┬ 
β) is balanced, or globally unbiased, but possibly locally biased.
i
This property can be visualized in Table 4.5.
If a model is not well-calibrated, several techniques have been considered in the
literature, such as a logistic correction, Platt scaling, and the “isotonic regression,”
as discussed in Niculescu-Mizil and Caruana (2005b), in the context of boosting.
Friedman et al. (2000) showed that adaboost builds an additive logistic regression
model, and that the optimal m satisfies

1 P[Y = 1|X = x] 1 μ(x)


m✶ (x) =
. log = log ,
2 P[Y = 0|X = x] 2 1 − μ(x)

which suggests applying a “logistic correction” in order to get back the conditional
probability. Platt et al. (1999) suggested the use of a sigmoid function (coined “Platt
scaling”),

1
.m' (x) = ,
1 + exp[am(x) + b]

where a and b are obtained using maximum likelihood techniques. Finally, the
isotonic (or monotonic) regression, introduced in Robertson et al. (1988) and
Zadrozny and Elkan (2001), considers a nonparametric increasing transformation
from .m(x i )s to .yi s (that can be performed either with Iso or rfUtilities R
packages, and probability.calibration function).
In Fig. 4.32, we can visualize smooth calibration curves (estimated using
local regression, with the locfit package) on three models estimated from
the toydata1 dataset. At the top, crude calibration curves, fitted on .{ yi , yi }s, and
at the bottom, calibration curves after some “isotonic correction” (obtained with the
probability.calibration function from the rfUtilities R package).
176 4 Models: Interpretability, Accuracy, and Calibration

Fig. 4.32 Calibration plots .y |→ E(Y |m(X) = 


y ) (based on smooth local regression) on three
models estimated on the toydata1 dataset, at the top (generalized linear model [GLM], gradient
boosting model [GBM], and random forest [RF], from the left to the right), and the calibration
after the “isotonic correction” at the bottom
Part II
Data

Insurance companies are in the business of discrimination. Insurers attempt to segregate


insureds into separate risk pools based on the differences in their risk profiles, first, so that
different premiums can be charged to the different groups based on their differing risks and,
second, to incentivize risk reduction by insureds. This is why we let insurers discriminate.
There are limits, however, to the types of discrimination that are permissible for insurers.
But what exactly are those limits and how are they justified?, Avraham et al. (2014)
–Tu la troubles, reprit cette bête cruelle,
Et je sais que de moi tu médis l’an passé.
–Comment l’aurais-je fait si je n’étais pas né ?
Reprit l’Agneau, je tette encor ma mère.
–Si ce n’est toi, c’est donc ton frère.
–Je n’en ai point.
–C’est donc quelqu’un des tiens
de La Fontaine (1668), Le Loup et l’Agneau.
–You roil it, said the wolf; and, more, I know
You cursed and slander’d me a year ago.
–O no! how could I such a thing have done!
A lamb that has not seen a year,
A suckling of its mother dear?
–Your brother then.
–But brother I have none.
–Well, well, what’s all the same,
’Twas some one of your name
de La Fontaine (1668), The Wolf and the Lamb.1

1 (.Αισωπος) Aesop’s original fables did not use this motivation as casus belli, even if the translation

by Jacobs (1894) is “if it was not you, it was your father, and that is all one”.
Chapter 5
What Data?

Abstract Actuaries now collect all kinds of information about policyholders,


which can not only be used to refine a premium calculation but also to carry out
prevention operations. We return here to the choice of relevant variables in pricing,
with emphasis on actuarial, operational, legal and ethical motivations. In particular,
we discuss the idea of capturing information on the behavior of an insured person,
and the difficult reconciliation with the strong constraints not only of privacy but
also of fairness.

In Part I, we used the symbolic notation .x = (x1 , . . . , xk ) to formalize “predictive


variables” or “pricing variables,” corresponding to variables that could be used to
compute the premium. In this chapter, we briefly discuss which variables could be
used. As explained in Flanagan (1985), although a wide range of diverse criteria
can be utilized for premium differentiation, insurers have traditionally relied upon
the personal characteristics of an insured person in motor insurance, including
factors such as age, gender, and marital status. These personal characteristics are
used as grounds for assessing risk because they are convenient: they are accessible,
cheap, verifiable, stable, and most of them are “reliably correlated with many
aspects of behavior important in insurance.” For example, motor vehicle accident
rates are statistically higher for very young drivers and for elderly drivers. This is
because young drivers are more likely to have less driving experience and have a
higher propensity to engage in high-risk driving conduct. And on the other hand,
elderly drivers, are likely to suffer from sensory or cognitive impairments, which
negatively impact their driving abilities, as explained in Kelly and Nielson (2006)
or Brown et al. (2007). In this Chapter, we discuss data, personal data, sensitive data,
behavioral data, data related to past historical data, protected data, etc. But before
starting, it is important to keep in mind that “all data is man-made,” as Christensen
et al. (2016) wrote. “Somebody, at some point, decided what data to collect, how to
organize it, how to present it, and how to infer meaning from it—and it embeds all
kinds of false rigor into the process. Data have the same agenda as the person who
created them, wittingly or unwittingly. For all the time that senior leaders spend
analyzing data, they should be making equal investments in determining what data

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 179
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_5
180 5 What Data?

should be created in the first place (. . . ) “Data has an annoying way of conforming
itself to support whatever point of view we want it to support.”

5.1 Data (a Brief Introduction)

“All data is credit data,” said Merrill (2012) at a conference. And if credit
institutions collect a lot of “information,” so do insurance companies, to assess
and prevent risk, target ideal customers, accurately price policies, provide quotes,
conduct investigations, follow trends, create new products, etc. Such information is
now called “data” (from the Latin datum, data being the plural, past participle of
dare “to give,” used in the XVII-th century to designate a fact given as the basis for
calculation, in mathematical problems1 )
Definition 5.1 (Data Wikipedia 2023) In common usage, data are a collection
of discrete or continuous values that convey information, describing the quantity,
quality, fact, statistics, other basic units of meaning, or simply sequences of symbols
that may be further interpreted.
A few years ago, the term “statistics” was also popular, as introduced in
Achenwall (1749). It was based on the Latin statisticum (collegium),” meaning
“(lecture course on) state affairs,” and Italian statista “one skilled in statecraft,”
and the German term “Statistik” designated the analysis of data about the state,
signifying the “science of state” (corresponding to “political arithmetic” in English).
So in a sense, “statistics” was used to designate official data, collected for instance
by the Census, with a strict protocol, as explained in Bouk (2022), whereas the term
“data” would correspond to any kind of information that could be scraped, stored,
and used.
The “big data” hype has given us the opportunity to talk about not only its large
volume and value but also its variety (and all sorts of words beginning with the letter
“v”). Although for actuaries, data have often been “tabular data,” corresponding to
matrix numbers as seen in Part I, in the last few years the variety of data types has
become more apparent. There will naturally be text, starting with a name, an address
(which can be converted into spatial coordinates), but also possibly drug names in
prescriptions, telephone conversations with an underwriter or claims manager, or in
the case of companies, contracts with digitized clauses, etc. We can have images,
such as a picture of the car after a fender-bender or of the roof of a house after a
fire, medical images (X-rays, MRI), a satellite image of a field for a crop insurance
contract, or of a village after a flood, etc. Finally, there will also be information
associated with connected objects, data obtained from devices in a car fleet, from

1 Euclid’s treatise on plane geometry was named δεδομένα, translated as “data.”


5.1 Data (a Brief Introduction) 181

a water leak detector or from chimney monitoring and control devices. However,
statistical summaries of “scores” are often based on these raw data (frequently
not available to the insurer) such as the number of kilometers driven in a given
week by a car insurance policyholder, or an acceleration score. These data, which
are much more extensive than tabular variables with predefined fields (admittedly,
sometimes with definition issues, as Desrosières 1998 points out), can provide
sensitive information that can be exploited by an opaque algorithm, possibly without
the knowledge of the actuary.
In non-commercial insurance, the policyholder is an individual, a person (even in
property insurance), and some part of the information collected will be considered
as “personal data,” as much of the information collected is sometimes considered
“sensitive" or “protected." In Europe, “personal data” is any information relating
to a natural person who is identified or can be identified, directly or indirectly. The
definition of personal data is specified in Article 4 of the General Data Protection
Regulation (GDPR). This information can be an identifier (a name, an identification
number, location data, for example) or one or more specific elements specific to the
physical, physiological, genetic, mental, economic, cultural or social identity of the
person. Among the (non-exhaustive) list given by the French CNIL,2 there may be
the surname, first name, telephone number, license plate, social security number,
postal address, e-mail, a voice recording, a photograph, etc. Such information is
relevant, and important, in the insurance business.
“Sensitive data” constitute a subset of personal data that include religious
beliefs, sexual orientation, union involvement, ethnicity, medical status, criminal
convictions and offences, biometric data, genetic information, or sexual activities.
According to the GDPR, in 2016,3 “processing of personal data revealing racial
or ethnic origin, political opinions, religious or philosophical beliefs or trade
union membership, as well as processing of genetic data, biometric data for the
purpose of uniquely identifying a natural person, data concerning health or data
concerning the sex life or sexual orientation of a natural person are prohibited.”
Such information is considered “sensitive.” In Europe, the 20184 “Convention 108”
(or “convention for the protection of individuals with regard to automatic processing
of personal data”) further clarifies the contours.

2 CNIL is the Commission Nationale de l’Informatique et des Libertés (National Commission on

Informatics and Liberty), an independent French administrative regulatory body whose mission is
to ensure that data privacy law is applied to the collection, storage, and use of personal data, in
France.
3 See https://gdpr-info.eu/.
4 See https://www.coe.int/en/web/data-protection/convention108-and-protocol.
182 5 What Data?

5.2 Personal and Sensitive Data

5.2.1 Personal and Nonpersonal Data

Classically, we can distinguish between different types of insurance contracts. Life


insurance is a contract (between a policyholder and an insurance company) where
the insurer promises to pay a designated beneficiary a sum of money upon the death
of a person (or possibly some critical illness, including disability insurance). Health
insurance is a type of insurance that covers the whole (or a part of the risk) of a
person incurring medical expenses. Property insurance provides protection against
most risks to property, such as fire, theft, and some weather damage. Casualty
insurance is the term that broadly encompasses insurance not directly concerned
with life insurance, health insurance, or property insurance. There is clearly a
difference between the first cases (life insurance or health insurance) where the
object of the contract is a person, and the last cases (property and casualty, or
general insurance) where the object of the contract is a material good (such as a
house, or a car—in the latter case, we include damage caused by the use of the
latter, including to third parties, including people). In the first two cases, actuaries
consider the information provided by the person in question to be important, if not
essential, for valuing the contract. In the other two cases, information about the car
is important, but information about the driver is almost as important. For household
insurance, personal data can be used to identify previous insurance coverage, to
assess fraudulent activities (false claim), etc.
Obviously, a lot of personal information can be used to assess the probability of
dying, of becoming disabled, or of facing medical expenses, including biological
factors (attained age, and health information), genetic factors (sex, individual geno-
type, which will be discussed in Sect. 6.4 in the next chapter), lifestyle (smoking,
drinking habits, etc.), social factors (marital status, social class, occupation, etc.),
geographical factors (region, postal code).

5.2.2 Sensitive and Protected Data

From Avraham et al. (2013), in the USA, we can get Fig. 5.1, which provides a
general perspective about variables that are prohibited across all States, with the
“highest average level of strictness,” for different types of insurance. If there is a
strong consensus about religion and “race,” it is more complicated for other criteria.
5.2 Personal and Sensitive Data 183

auto P&C disability health life


race (or origin)
religion
gender
sexual orientation
age ∗

credit score
zip code
genetics +

Fig. 5.1 (U.S.) State Insurance Antidiscrimination Laws . A factor is either considered “permit-
ted” or there is no specific regulation (green filled circle) usually because the factor is not relevant
to the risk. “prohibited” (red filled cross) or there could be variation across states (○␣). .∗ means
limited regulation; .+ is specifically permitted because of adverse selection (source: Avraham et al.
2013)

CA HI GA NC NY MA PA FL TX AL ON NB NL QC
Gender
Age ∗ ∗

Driving experience
Credit history ∗ ∗ ∗

Education
Profession
Employment
Family
Housing
Address/ZIP code

Fig. 5.2 A factor is considered “permitted” (green filled circle) when there are no laws or
regulatory policies in the state or province that prohibit insurers from using that factor. Otherwise,
it will be “prohibited” (red filled cross). In North Carolina, age is only allowed when giving a
discount to drivers 55 years of age and older. In Pennsylvania, credit score can be used for new
business and to reduce rates at renewal, but not to increase rates at renewal. In Alberta, credit score
and driver’s license seniority cannot be used for mandatory coverage (but can be used on optional
coverage). In Labrador, age cannot be used before 55, and beyond that, it must be a discount (as in
North Carolina) (source: in the United States, The Zebra 2022 and in Canada, Insurance Bureau of
Canada 2021)

Figure 5.2 presents some variables traditionally used in car insurance, in the
United States5 and in Canada.6 Unlike most European countries, which have a civil
law culture (where the main source of law is found in legal codes), the Canadian
provinces and the states of the United States of America have a common law system
(where rules are primarily made by the courts as individual decisions are made).
Québec uses a mixed law. Most states and provinces have documents listing the
Prohibited Rating Variables, such as Alberta’s (Automobile Insurance Rate Board,

5 CA: California, HI: Hawaii, GA: Georgia, NC: North Carolina, NY: New York, MA: Mas-

sachusetts, PA: Pennsylvania, FL: Florida, TX: Texas.


6 AL: Alberta, ON: Ontario, NB: New Brunswick, NL: Newfoundland and Labrador, QC: Québec.
184 5 What Data?

2022). This heterogeneity makes it possible to highlight the cultural character of


what is considered as “discriminatory.”
In Québec (Canada), as stated in Section 20.1 of the Charter of Human Rights
and Freedoms, “a distinction based on age, sex or marital status is permitted
when it is based on a factor that allows a risk to be determined. For example, an
insurance company may ask you questions about your age and sex to determine your
premium.” In California, as noted by Butler and Butler (1989), many of the rating
variables cannot be used by insurers if they are not causally related to the risk of
accidents and their cost (we return to this “causal” issue in Chap. 7).
In Europe, Article 5-2 of Directive 2004/113 based on the Charter of Funda-
mental Rights provided that “Member States may decide (...) to allow proportional
differences in premiums and benefits for individuals where the use of sex is a
determining factor in the assessment of risk, on the basis of relevant and accurate
actuarial and statistical data,” as recalled by Laulom (2012).
In Box 5.1, we have a definition of “sensitive data,” according to the European
directive on the protection of personal data, in Europe.

Box 5.1 Sensitive data, European Commission (1995)


Sensitive data is data revealing racial or ethnic origin, political opinions,
religious or philosophical beliefs or trade union membership, data concerning
health or sex life. The prohibition on the processing of sensitive data does not
apply if:
(a) The data subject has given his explicit consent to the processing of those
data, except where the laws of the Member State provide that the prohibition
referred to in paragraph 1 may not be lifted by the data subject’s giving his
consent; or
(b) Processing is necessary for the purposes of carrying out the obligations
and specific rights of the controller in the field of employment law in so far as
it is authorized by national law providing for adequate safeguards; or
(c) processing is necessary to protect the vital interests of the data subject
or of another person where the data subject is physically or legally incapable
of giving his consent; or
(d) processing is carried out in the course of its legitimate activities with
appropriate guarantees by a foundation, association or any other nonprofit-
seeking body with a political, philosophical, religious or trade-union aim, and
on the condition that the processing relates solely to the members of the body
or to persons who have regular contact with it in connection with its purposes
and that the data are not disclosed to a third party without the consent of the
data subjects; or
(e) The processing relates to data that are manifestly made public by the
data subject or are necessary for the establishment, exercise, or defense of
legal claims.
5.2 Personal and Sensitive Data 185

A variable that has been intensively discussed is ‘gender,” from the Council
Directive 2004/113/EC (see Sect. 6.2). In France, Article L. 111-7 of the Insurance
Code states that “the Minister in charge of the economy may authorize, by decree,
differences in premiums and benefits based on the taking into account of sex and
proportionate to the risks when relevant and precise actuarial and statistical data
establish that sex is a determining factor in the evaluation of the insurance risk.”
Here, “determining factor” echoes the “causal” effect required in California. In
Box 5.2, we have a description of legal aspects regarding discrimination in the
context of insurance, in France by Rodolphe Bigot, that are explicitly described
(age, family status, and sexual orientation (page 185), pregnancy, maternity and
“social insecurity” (page 186), and finally sex ( page 187).

Box 5.2 Discrimination & Insurance in France, by Rodolphe Bigot


To fight against discrimination, there are general prohibitions, formulated in
the Penal Code and the Civil Code, and special prohibitions in insurance
law, inserted into the Insurance Code. Both are subject to variable geometry
adjustments linked to the specificities of the insurance business. In civil
matters, a general prohibition applicable to insurance is set forth in Article
16-13 of the Civil Code, which provides that “no one may be discriminated
against on the basis of his or her genetic characteristics.” In criminal matters,
the refusal or subordination of the provision of a service or good based on
one of the criteria—discriminatory—listed in article 225-1 or 225-1-1 of the
French Criminal Code constitutes reprehensible discrimination (C. pén., art.
225-1 et seq.). Since 2008, all direct or indirect discrimination based on such
criteria has been covered (L. 27 May 2008; modified by L. 18 Nov. 2016 de
modernisation de la justice du XXIe siècle), with a general derogation in the
presence of differences in treatment “justified by a legitimate aim” and if “the
means of achieving this aim are necessary and appropriate.” A reversal of
the burden of proof is provided for in actions brought before the civil courts
(L. 27 May 2008, art. 4), unlike the rules of evidence applicable before the
criminal courts, where it is up to the plaintiff to prove the discrimination.
First, a distinction based on age in particular (C. pén., art. 225-1 and 225-
2) makes up a criminally sanctioned discrimination if it tends to apply a
difference in treatment for access to insurance coverage or for the termination
of insurance benefits: it is therefore prohibited in the case of risks of loss of
employment for the purpose of securing a bank loan, for example. However,
an exemption is provided for the pricing of life insurance contracts (death
insurance and life annuities), by applying mortality tables (C. assur., art.
A 132-18). The Defender of Rights admitted this discrimination for a 74-
year-old health insurance applicant whose enrollment was refused because
the policy limited it to 70 years of age (opinion no. MLD / 2012-150,

(continued)
186 5 What Data?

Box 5.2 (continued)


16 November 2012: “when objectively justified by actuarial and statistical
elements, age limits in access to a personal insurance contract do not
constitute discrimination” (The random nature of the insurance contract and
the principles of risk selection and pooling may justify taking age into account
in personal insurance. Article 2 of the law of 27 May 2008 provides for
two other exceptions, which should be confirmed by the proposed directive
on equal treatment between persons (COM 2008 / 426 of July 2, 2008): 1)
differences in treatment justified by a legitimate aim, if the means of achieving
this aim is necessary and appropriate; 2) when the differences in treatment are
provided for and authorized by the laws and regulations in force.
Second, discrimination based on family status or sexual orientation is
also prohibited (C. pén., art. 225-1 and 225-2). In the case of a same-sex
couple, this would include an employer’s refusal to pay a death benefit to an
employee’s partner in a civil union or to the employee in a civil union in the
event of his/her partner’s death (...)

(...) Third, discrimination based on a woman’s pregnancy and maternity


is prohibited. In insurance, women cannot be treated less favorably in terms
of premiums and benefits as a result of expenses related to pregnancy and
maternity (C. assur., art. L. 111-7, I, paragraph 2), because, in principle, both
sexes are affected by maternity. In other words, insurance contracts, whatever
the risk covered (health, provident fund, etc.), must no longer include differ-
entiated treatment of these factors. Clauses instituting a longer waiting period
for the coverage of hospitalization costs when the latter is consecutive to a
pregnancy are therefore no longer legal. However, if there is relevant actuarial
and statistical data, i.e., objective considerations, “contractual provisions that
are more favorable to women may remain.” Failure to comply with these
provisions exposes the employer to the penalties provided for in the Criminal
Code for acts of discrimination (Lamy Assurances, 2021, n.◦ 3806).
Fourth, the refusal to provide a good or service because of a person’s place
of residence constitutes discrimination in the penal sense (C. pén., art. 225-
1). Although it is still possible to adjust the amount of the premium according
to the place of the insured person’s residence, the insurer is prohibited from
refusing an applicant for insurance because of this factor. This does not apply
to subscriptions or memberships made by residents of States a French insurer
is not authorized to contract with (Lamy Assurances, 2021, No. 3808).
Fifth, refusal to insure on the basis of the insured person’s social
insecurity is prohibited, which falls within the scope of the offence of

(continued)
5.2 Personal and Sensitive Data 187

discrimination for refusing to supply a good or service on the basis of


“particular vulnerability resulting from the economic situation, apparent or
known to its author” (C. pén., art. 225-1, amended by L. 24 June 2016,
aimed at combating discrimination on the grounds of social insecurity). The
insurer cannot reject a membership or subscription on this basis, but it can
nevertheless take this factor into account to modulate the amount of the
premium. Note that since 26 June 2016, indirect discrimination can result
from a provision, criterion or practice that is neutral on the surface, but
likely to result in a particular disadvantage for individuals. Moreover, there
is a complex system of protection for vulnerable persons in insurance: “the
movement for vulnerable persons is a dance that alternates between special
protection and undifferentiated protection” (Noguéro 2010, p. 633). (...)

(...) Discrimination on the basis of sex is subject to a renewed regime


dedicated to new contracts concluded as of 21 December 2012 (except for
retirement, health and accident contracts subscribed by an employer and
to which the employee is a compulsory member; C. séc. soc. L. 911-1)
and aimed at uniformly applying the unisex rule—prohibiting any direct
or indirect discrimination based on sex—to insurance contracts within the
European Union (Parléani 2012, p. 563). Article A. 111-6 of the Insurance
Code incorporated the European Commission’s guidelines (Arr. 18 Dec. 2012,
NOR: EFIT1238658A, relating to equality between men and women in insur-
ance, JO 20 December, mod. by Arr. 3 February 2014, NOR: EFIT1400411A,
JO 11 February).
The calculation of premiums and benefits falls within the scope of the
unisex rule (in insurance operations classified, by reference to Article R.
321-1, in the branches: 1 “Accidents (including occupational accidents and
diseases)” (Art. a. 111-2), 2 “Sickness” (Art. A. 111-3), 3 “Land vehicle
bodies (other than railways)” (Art. A. 111-4), 10 “Civil liability for motor
vehicles” (Art. A. 111-4), 20 “Life and death” (Art. A. 111-5), 22 “Insurance
linked to investment funds” (Art. A. 111-5), 23 “Tontine operations” (Art.
A. 111-5), 26 “Any operation of a collective nature defined in Section I of
Chapter I of Title IV of Book IV” (Art. A. 111-5)). By way of derogation,
the criterion of sex may be used “as a factor in the assessment of risks in
general to collect, store and use information on sex or related to sex for
internal provisioning and pricing, marketing and advertising, and pricing of
reinsurance.” “The use of sex as indirect discrimination is also allowed for
the pricing of certain real risks, such as, for example, a differentiation of

(continued)
188 5 What Data?

premiums based on the size of a car’s engine, even though the most powerful
cars are in fact bought more by men” (Lamy Assurances, 2021, n.◦ 3803).
To bring French law into compliance with European rules, Article L. 111-
7 of the Insurance Code was rewritten with the law of 26 July 2013. A
paragraph II bis was added: “The derogation provided in the last paragraph
of I is applicable to contracts and memberships in group insurance contracts
concluded or made no later than December 20, 2012 and to such contracts
and memberships tacitly renewed after that date. The derogation does not
apply to contracts and memberships mentioned in the first paragraph of
this IIa which have been substantially modified after December 20, 2012,
requiring the parties’ agreement, other than a modification that at least one
of the parties cannot refuse.” In terms of collective supplementary employee
benefits, no discrimination on the basis of gender can be made. However,
insurers still have the possibility of offering options in policies or insurance
products according to gender in order to cover conditions that exclusively or
mainly concern men or women. Differentiated coverage is therefore possible
for breast cancer, uterine cancer, or prostate cancer.

As Debet (2007) said, “in order to fight against discrimination, it is still


necessary to be able to identify it; in order to identify it, it seems natural to
proceed with the statistical observation of differences, of diversity.” And as we have
seen in Box 5.1, in European regulation “sensitive data is data revealing racial or
ethnic origin, political opinions, etc..” meaning that there could be discrimination
if sensitive inferences can be made about individuals, as discussed in Wachter and
Mittelstadt (2019).

5.2.3 Sensitive Inferences

Sometimes, data are “derived” (e.g., country of residence derived from the subject’s
postcode) or “inferred” data (e.g., credit score, outcome of a health assessment,
results of a personalization or recommendation process) and not “provided” by the
data subject actively or passively, but rather created by a data controller or third
party from data provided by the data subject and, in some cases, other background
data, as explained in Wachter and Mittelstadt (2019). According to Abrams (2014),
inferences can be considered personal data as they derive from personal data.
Another precaution that should be kept in mind relates to the distinction between
what “reveals” and what “ is likely to reveal,” as Debet (2007) states (see also
Van Deemter 2010, “in praise of vagueness ”). Some information is self reported,
and some is inferred data. For instance, it could be possible to ask for “sex at birth”
to collect a sex variable, but in most cases a variable is based on civility (where
5.2 Personal and Sensitive Data 189

“Mrs” or “Mr” are proposed), so the information is more a gender variable. But
it can be more complex, as some models can be used to infer some information.
One can imagine that “being pregnant” could be a sensitive piece of information in
many situations. This information exists in some health organization databases (or
health care reimbursements). But as shown by Duhigg (2019) (in how companies
learn your secrets), there are organizations that try to infer this information from
purchases. This is the famous story of the man, in the Minneapolis area, who was
surprised that coupons for various products for young mothers were addressed to
his daughter. In this story, the inference by the model had been correct. We can
imagine, for marketing reasons, that some insurers might be interested in knowing
such information.
Recently, Lovejoy (2021) recalls that in June 2020, LinkedIn had a massive
breach (exposing the data of 700 million users), with a database of records including
phone numbers, physical addresses, geolocation data and... “inferred salaries.”
Again, knowing the salary of policyholders could be seen as interesting for some
insurers (any micro-economic model used to study insurance demand is based on
the “wealth” of the policyholder, as in Sect. 2.6.1). Much more information can
be inferred from telematics data, albeit with varying degrees of uncertainty. As
mentioned by Bigot et al. (2019), by observing that a person parks almost every
Friday morning near a mosque, one could say that there is a high probability that he
or she is Muslim (based on surveys of Muslim practices). But it is possible that this
inference is completely wrong, and that this person actually goes to the gym, across
the street from the mosque, and moreover, is a regular attendee. Facebook may be
able to infer protected attributes such as sexual orientation, race (Speicher et al.
2018), as well as political views (Tufekci 2018) and impending suicide attempts
(Constine 2017), whereas third parties have used Facebook data to decide eligibility
for loans (Taylor and Sadowski 2015) and infer political positions on abortion
(Coutts 2016). Susceptibility to depression can also be inferred from Facebook and
Twitter usage data. Microsoft can also predict Parkinson’s (Allerhand et al. 2018)
and Alzheimer’s (White et al. 2018) disease from search engine interactions. Other
recent invasive applications include predicting pregnancy by Target, and assessing
user satisfaction from mouse tracking (Chen et al. 2017). Even if such inferences are
impossible to understand (as they are provided by opaque models), or refute most
of the time, they could impact our private lives, reputation, and identity deeply, as
discussed in Wachter and Mittelstadt (2019). Therefore, inferences are very close to
the data available, and to “attacks,” as coined in the literature about privacy.

5.2.4 Privacy

As explained by Kelly (2021), “often the location data is used to determine what
stores people visit. Things like sexual orientation are used to determine what
demographics to target,” (in a marketing context). Each type of data can reveal
something about our interests and preferences, our opinions, our hobbies, and
190 5 What Data?

our social interactions. For example, an MIT study7 demonstrated how email
metadata can be used to map our lives, showing the changing dynamics of our
professional and personal networks. These data can be used to infer personal
information, including a person’s background, religion or beliefs, political views,
sexual orientation and gender identity, social relationships, or health. For example,
it is possible to infer our specific health conditions simply by connecting the dots
between a series of phone calls. For Mayer et al. (2016), the law currently treats
call content and metadata separately and makes it easier for government agencies
to obtain metadata, in part because it assumes that it should not be possible to infer
specific sensitive details about individuals from metadata alone. Chakraborty et al.
(2013) reminds us that current approaches to privacy protection, typically defined in
multi-user contexts, rely on anonymization to prevent such sensitive behavior from
being traced back to the user—a strategy that does not apply if the user’s identity is
already known. In time, a tracking system may be accurate enough to place a person
in the vicinity of a bank, bar, mosque, clinic or other privacy-sensitive location. In
2015, as told in Miracle (2016), Noah Deneau wondered if it would be possible to
identify devout Muslim drivers in New York City by looking at anonymized data
and inactive drivers during the five times of the day when they are supposed to
pray. He quickly searched for drivers who were not very active during the 30- to
45-min Muslim prayer period and was able to find four examples of drivers who
might fit this pattern. This brings to mind (Gambs et al., 2010), who conducted
an investigation on a dataset containing mobility data of taxi drivers in the San
Francisco Bay Area. By finding places where the taxi’s GPS sensor was turned off
for a long period of time (e.g., 2 h), they were able to infer the interests of the
drivers. For 20 of the 90 analyzed users, they were able to locate a plausible home
in a small neighborhood. They even confirmed these results for 10 users by using
a satellite view of the area: It showed the presence of a yellow taxi parked in front
of the driver’s supposed home. Dalenius (1977) introduced an interesting concept
of privacy. Nothing about an individual should be learned from a dataset if it cannot
be learned without having access to the dataset. We will return to this idea when we
define the fairness criteria, and when we require that the protected variable s cannot
be predicted from the data, and from the predictions.

5.2.5 Right to be Forgotten

The “right to be forgotten” is the right to have private information about a person be
removed from various directories, as discussed in Rosen (2011), Mantelero (2013),
or Jones (2016). It is also named “ right to oblivion” in de Andrade (2012) and “right
to erasure” in Ausloos (2020).

7 Project https://immersion.media.mit.edu/.
5.2 Personal and Sensitive Data 191

As explicitly mentioned in Mbungo (2014), “individuals with a criminal record


constitute a minority that cannot escape discrimination.” Criminal records can have
a wide range of collateral consequences that often extend well beyond any sentence
imposed by the courts, and in many countries, the mark of a criminal record follows
a person long after he or she has served a prison sentence or paid a fine, Pager
(2003, 2008). As explained in Henley (2014), a criminal record thus becomes a
legitimate reason to refuse cover, or could lead to a massive increase in insurance
premiums. According to Bath and Edgar (2010), more than four in five ex-prisoners
said that their previous convictions made it harder for them to get insurance and
that, even when they were successful, they were charged far more. In Québec, it
is not considered discriminatory to refuse to insure a person who has a criminal
record, within the meaning of the Québec Charter of Human Rights and Freedoms,
simply because criminal history is not included in the list of prohibited grounds
for discrimination, as it is, for example, for ethnic origin or religion. On the other
hand, if you are refused insurance because you are living with a spouse or parent
who has a criminal record, this could constitute discrimination based on your civil
status (and civil status is on the list of prohibited grounds of discrimination in
the Québec Charter, including marriage, civil union, and de facto union, but also
filiation). There is nothing in Québec law that prohibits insurers from asking you
about your criminal record and making their decision based on this factor. In fact, as
mentioned in Sanche and Roberge (2023), the province’s Civil Code even requires
you to declare—in good faith—factors that could influence the insurer’s assessment
of the risk you represent to it. If you fail to disclose such information, it is possible
that a court could rule in favor of your insurer if it eventually refuses to compensate
you following a loss, or if it revokes your policy. In Québec, some companies refuse
to insure a person with a criminal record even if the offence is not related to the
insurance applied for. Other companies will offer insurance, but the premium will
be significantly higher (double or quadruple). This premium surcharge applies to all
persons living in the same household as a person who has been prosecuted.
Medical historical data can also be problematic. “The financial burden of cancer
can extend decades after diagnosis,” wrote Dumas et al. (2017). Some countries
in Europe (France, Belgium, Luxembourg, the Netherlands, Portugal, or Romania)
adopted national legislative initiatives to recognize a “right to be forgotten” for
cancer survivors. On 26 January 2016, a law (2016-41, commonly referred to as
“the right to be forgotten”) was adopted in France. According to this law, survivors
do not have to disclose their past history of cancer to the insurer after a fixed number
of years post-treatment: 5 years for patients diagnosed under the age of 18 at the time
of diagnosis and 10 years for patients aged 18 or above (in 2022, it became 5 years
for adults too). Before 2016 and the adoption of this law, insurers could impose
higher premiums or could refuse to insure survivors because of their past history of
cancer, even when they had no health problem at the time they purchased insurance.
192 5 What Data?

5.3 Internal and External Data

5.3.1 Internal Data

The process of collecting “internal data” typically begins with a “form,” derived
from the Latin word forma meaning form, contour, figure, or shape. In the XIV-th
century, the term started to refer to a legal agreement.8 Forms are used by insurers
in the underwriting process or to handle claims. “Think of a form to be filled in,
on paper or a screen, intended to gather information that can later be quantified,”
wrote Bouk (2022). Almost paraphrasing Christensen et al. (2016), he adds that
“someone, somewhere, designed that form, deciding on the set of questions to be
asked, or the spaces to be left blank. Maybe they also listed some possible answers
or limited the acceptable responses (...) The final resulting form and all that is
written upon it as well as all negotiations that shaped it, weather backstage or
offscreen, so to speak—all of this is data too. The data behind the numbers. To
find stories in the data, we must widen our lens to take in not only the numbers but
also the processes that generated those numbers.”
Traditionally, with some simple perspective, insurance companies use two
kinds of databases: an underwriting database (one line represents a policy, with
information on the policyholder, the insured property, etc.) and a claims database
(one line corresponds to a claim, with the policy number, and the last view of
the associated expenses), as in Fig. 5.3. These two bases are linked by the policy
number. But it is possible to use other secondary “keys” (as coined in database
management systems), corresponding to a single variable (or set of them) that can
uniquely identify rows. For internal data, classical keys are the policy number (to
connect the underwriting and the claim database), a “client number” (to connect
different policies that could hold the same person), or a claim number (usually to
connect the claims database to financial records). But it is also possible to use some
“keys” to connect to other databases, such as the license plate of the car, the model
of a car (“2018 Honda Civic Touring Sedan”), the address of a house, etc.

5.3.2 Connecting Internal and External Data

In recent years, however, companies have increasingly relied on data obtained from
a wide variety of external sources. These data are either on the insured property, with

8 Instead of the Latin formula that could designate a contract. Actually, “formula” nowadays

refers to “mathematical formulas” as seen in Chaps. 3 and 4, or “magic formulas,” the two being
very close for many people (see for example the introduction of O’Neil (2016) explaining how
mathematics “was not only deeply entangled in the world’s problems but also fueling many of
them. The housing crisis, the collapse of major financial institutions, the rise of unemployment—
all had been aided and abetted by mathematicians wielding magic formulas.”
5.3 Internal and External Data 193

claims database
underwriting database
id policy
policy address
socio-economic data
IRIS

geographic data
(lat,long)

Fig. 5.3 The databases of an insurer, with an underwriting database (in the center), with one line
per insurance policy. This database is linked to the claims database, which contains all the claims,
and an associated policy number. Other data can be used to enhance the database, for example,
based on the insured person’s home address, with either socio-economic information (aggregated
by neighborhood, wealth, number of crimes, etc.) but also other information, extracted from maps,
satellite images (distance to the nearest fire hydrant, presence of a swimming pool, etc.) In car
insurance, it is possible to find the value of a car from its characteristics, etc.

information about the car model, or about the house, obtained from the address,
as in Fig. 5.3. The address historically allowed (aggregated) information on the
neighborhood to be obtained, with numbers of violations, on past floods, on the
distance to the nearest fire station, etc. We can also use satellite images (via Google
Earth) or information from OpenStreetMap (we will mention those in Sect. 6.8).
And insurers rely on data that are becoming increasingly extensive, with sensors
deployed everywhere, in the car, or cell phones, as Prince and Schwarcz (2019)
recall. This “data boom” raises the question of whether an increasingly detailed
insight into the lives of policyholders can lead to more accurate pricing of risks.
Following Heidorn (2008), Hand (2020) discussed the concept of “dark data,”
noting that all data has a hidden side, which could potentially generate bias. We
formalize in the following sections these notions of bias, some of which can
be visualized in Fig. 5.4, inspired by Suresh and Guttag (2019). The “historical
bias” is the one that exists in the world as it is. This is the bias evoked in Garg
et al. (2018) in contextualization in textual analysis (“word embedding”) where
the vectorization reflects the biases existing between men and women (the word
“nurse” (nongendered) is often associated with words associated with women,
whereas “doctor” is often associated with words associated with men), or toward
minorities. The “sampling bias” is the one mentioned in Ribeiro et al. (2016), where
a classification algorithm is trained, to distinguish dogs from wolves, except that
194 5 What Data?

historical bias dataset


training
sample

validation
sample sample
sampling bias (testing)
measure bias

(a)

definitions
learning cultural
bias bias

training deployment
sample
deployment bias

validation
sample
evaluation
(testing)
evaluation
bias

(b)

Fig. 5.4 Bias in data generation, and in model building (loosely based on Suresh and Guttag
2019). (a) Data generation. (b) Modeling process

all the images of wolves are taken in the snow, and the algorithm just looks at
the background of the image to assign a label. For “measurement bias,” Dressel
and Farid (2018) refers to reoffending in predictive justice, which is sometimes
measured not as a new conviction but as a second arrest. The “cultural bias” (called
“aggregation bias” in Suresh and Guttag 2019) refers to the following problem:
a particular dataset might represent people or groups with different backgrounds,
cultures, 3or norms, and a given variable can mean something quite different for
them. Examples include irony in textual analysis, or a cultural reference (which
the algorithm cannot understand). Hooker et al. (2020) observe that compression
amplifies existing algorithmic biases (where compression is similar to tree pruning,
when attempting to simplify models). Another example is Bagdasaryan et al. (2019)
who point out that data anonymization techniques can be problematic. Differential
privacy learning mechanisms such as gradient clipping and noise addition have a
disproportionate effect on under-represented and more complex subgroups. This
phenomenon is called “learning bias.” Evaluation bias takes place when the
reference data used for a particular task is not representative. This can be seen in
facial recognition algorithms, trained on a population that is very different from
the real one. Buolamwini and Gebru (2018) note that darker-skinned women make
5.3 Internal and External Data 195

up 7.4% of the Adience database (published in Eidinger et al. 2014, with 26,580
photos across 2,284 subjects with a binary gender label and one label from eight
different age groups), and this lack of representativeness of certain populations
can be an issue (e.g., for an algorithm designed to detect skin cancers). Finally,
“deployment bias” refers to the gap between the problem a model is expected to
solve, and the way it is actually used. This is what Collins (2018) or Stevenson
(2018) show, describing the harmful consequences of risk assessment tools for
actuarial sentencing, particularly in justifying an increase in incarceration on the
basis of individual characteristics. In Box 5.3, Olivier l’Harridon discusses “decision
bias.”

Box 5.3 Decision bias, by Olivier l’Harridon


The literature developed in behavioral economics and risk psychology has
highlighted several “ biases” that affect decision-making in times of uncer-
tainty. Typically, a decision bias corresponds to an observed deviation from
a norm of rational behavior or judgment. Traditionally, economic theory
provides, with the expected utility theory of von Neumann, Morgenstern,
and Savage, and Bayes’ theorem, a norm of rational behavior in the face of
uncertainty, whether the probabilities associated with events are known or
not. Any deviation from this clearly defined norm is generally qualified as a
decision bias, even if this deviation can be justified by the particular decision
context, the decision-making speed, or the lack of probabilistic information.
Among decision biases, there are numerous choice heuristics, which are
mental shortcuts corresponding to intuitive and rapid mental operations. It is
important to keep in mind that these shortcuts can be justified by the need
to obtain a satisfactory answer, having the advantage of being fast without
necessarily being optimal. The most famous choice heuristic in the field
of uncertainty is certainly the representativeness heuristic. It occurs when
individuals misjudge the frequency of an event by an abusive generalization of
a similar past event. The direct translation of the representativeness heuristic
is the use of stereotypes to make predictions: events that have only been
observed on a limited sample are generalized to the whole population, extreme
events are trivialized, phenomena for which more details are known are
considered more probable, data on recent events are favored over initial
data on the situation, etc. This last characteristic can be reinforced by the
availability heuristic, where individuals that have a better memory of events
are more easily available in their representations, i.e., they are significant
for them, such as frequent events, more recent events, or even larger events.
The last major heuristic identified by the literature is the anchoring heuristic,
which refers to the tendency to set beliefs at a certain level and to maintain
this anchor in subsequent evaluations of uncertainty, leading to a conservatism

(continued)
196 5 What Data?

Box 5.3 (continued)


bias. The use of these heuristics is fundamentally linked to the decision-
making process: by providing frequent feedback on past decisions, or by
emphasizing decisions, the impact of these heuristics on decision-making can
be significantly reduced.
Alongside these heuristics, motivational biases are more difficult to con-
sider, particularly in terms of possible interventions to reduce them. For
example, behavioral economics has shown that individuals tend to establish
a separate mental accounting for each small risk to which they are subjected,
without considering their potential compensations. Moreover, even for small
risks, individuals are much more sensitive to those that lead to losses than to
those that lead to gains, demonstrating a strong aversion to losses. Individuals
therefore tend to over-insure themselves for these small risks but also to
choose low-deductible contracts even if they are less advantageous for them.
Note that loss aversion also generates a form of risk conservatism by encour-
aging individuals to stay at their status quo, their point of reference, rather
than risk losses, even for the sake of greater prospects of gains. In a different
vein, risk decisions are also biased by the strong tendency of individuals to
overestimate small probabilities and underestimate large probabilities, which
are important sources of optimism and pessimism about subjective views of
luck and uncertainty. (...)

(...) In addition to these heuristics, motivational biases are more difficult


to account for, particularly in respect of how to reduce them. For instance,
behavioral economics has demonstrated that individuals tend to set up a
separate mental accounting for each small risk they face, without taking into
account their potential offsets. Moreover, for even small risks, individuals
are much more sensitive to those that lead to losses than those that lead to
gains, thereby displaying a strong aversion to losses. Individuals therefore
tend to over-insure themselves for these small risks but also to choose low-
deductible policies even if they are less beneficial for them. It is worth noting
that loss aversion also generates a kind of risk conservatism by encouraging
individuals to remain in their status quo, their reference point, rather than
risk losses, even for greater prospects of gains. On a different note, risk
decisions are also biased by the strong tendency of individuals to overestimate
small probabilities and underestimate large probabilities, both of which are
important sources of optimism and pessimism regarding the subjective vision
of luck and uncertainty.

(continued)
5.3 Internal and External Data 197

Moreover, the absence of information, or the need to own it before making


a decision, also results in a certain number of decision biases. For example,
individuals may react strongly to ambiguity, typically when they are aware of
the absence of information that might be available but is not given to them.
Some individuals also have difficulty taking into account new information,
positive or negative, that they learn about risk factors, and these difficulties
can have particularly important consequences when they involve risk factors
affecting health, such as heredity, smoking, or information on cardiovascular
risks.
Finally, decision biases arising from fields related to uncertainty, typically
involving the perception of the present and the future, can strongly interact
with risky decision making. Therefore, the immediacy bias, which is a
strong variation in impatience toward immediate consequences, coupled with
loss aversion, leads individuals to overestimate present costs despite future
benefits outcome and to neglect prevention and predictability, whether in the
monetary or in the health domain.

5.3.3 External Data

Insurers have the feeling that they can use external data to get valuable information
about their policyholders. And not only insurers, all major companies are fascinated
by external data, especially those collected by big tech companies. As Hill (2022)
wrote, “Facebook defines who we are, Amazon defines what we want, and Google
defines what we think.”
But it is not new. Scism and Maremont (2010a,b) reported that a US insurer
had teamed up with a consulting firm, to look at 60,000 recent insurance applicants
(health related) and they found that a predictive model based partly on consumer-
marketing data, was “persuasive” in its ability to replicate traditional underwriting
techniques (based on costly blood and urine testing). So-called “external informa-
tion” was personal and family medical history, as well as information shared by
the industry from previous insurance applications, and data provided by Equifax
Inc., such as likely hobbies, television viewing habits, estimated income, etc. Some
examples of “good” risk assessment factors were being a “foreign traveler,” making
healthy food choices, being an outdoor enthusiast, and having strong ties to the
community. On the other hand, “bad” risk factors included having a long commute,
high television consumption, and making purchases associated with obesity, among
many others.
This is possible because of “data brokers” or “data aggregators,” as discussed
in Beckett (2014) and Spender et al. (2019). These companies collect data on
a grand scale from various sources, independently, they clean the data, link it,
198 5 What Data?

and use machine-learning techniques to extract information. As mentioned by


Harcourt (2015a), one of the largest data brokers in the USA sells lists of “Elderly
Opportunity Seekers” (with 3.3 million older people “looking for ways to make
money”), “Suffering Seniors” (with 4.7 million people with cancer or Alzheimer’s
disease), or “Oldies but Goodies” (with 500,000 gamblers over 55 years old). As
explained in Schneier (2015), Lauer (2017), and Zuboff (2019), those profiling
companies are replacing the control companies that took over from the discipline
companies. Nakashima (2018) revealed that Google continued to track movements
even when a user explicitly asked it not to, by locking location tracking in Android,
for example. Knowing the main movements and locations gives a lot of information
about a person, often sensitive information. Beyond home and work, it is possible
to infer sexual preferences, religion, etc.
This aggregation of data is problematic, as discussed in O’Neil (2016) or Lauer
(2017), without even mentioning any ethical considerations. This could explain
the interest of insurers to connected objects, also named “Internet of Things,” as
described in Iten et al. (2021). Not only those devices can help to collect new
information, but overall, three main business opportunities have been identified:
customer engagement, risk reduction, and risk assessment. If we think of health
insurance, contacts between the insurance company and a future policyholder start
with sickness-related questions. Applications of smartphones could, in a way,
appear as less intrusive, and Barbosa (2019) recalls that offering a free fitness tracker
increased “customer engagement.” Those wearable devices offer an opportunity to
encourage customers to have a healthier lifestyle, by simple push notifications to
nudge the customer to leave their sofa, and go for a walk, or by setting and tracking
workout goals to motivate the policyholder to pursue a regular exercise regime, as
discussed in Spender et al. (2019). They can also be used for prevention, to detect
a problem before it leads to very costly medical procedures. Those devices are also
used in motor insurance, with different types of insurance, usage-based insurance
(UBI), pay-as-you-drive (PAYD), pay-how-you-drive (PHYD) and manage-how-
you-drive (MHYD). The simplest one is UBI, and the only difference compared with
traditional insurance is that instead of estimating the mileage beforehand the exact
milage is calculated at the end of the years, and adjustments are made. PAYD takes
not only the exact mileage but also other nonbehavioral risk factors into account,
e.g., kind of road or time of day. Such data can come from a cellphone. PHYD adds
behavioral risk factors, such as speeding and driving style, measured by an on-board
device. The main difference is that a cellphone is usually related to a single person,
whereas the on-board device is related to a car. Finally, MHYD is in principle the
same as PHYD, except that it is based on dynamic interactions. The driver gets
suggestions on how to drive more safely and therefore lowers the premium. Such
“gamification” introduces feedback biases, and makes the data harder to analyze.
5.4 Typology of Ratemaking Variables 199

5.4 Typology of Ratemaking Variables

In Sect. 3.2, we have mentioned the historical importance of “categorization” for


insurers, and now, we will look more carefully at the variables used as ratemaking
variables.

5.4.1 Ratemaking Variables in Motor Insurance

In France, as recalled in Delaporte (1962, 1965), in the aftermath of the Second


World War, the price of motor liability insurance was defined using the “méthode
de la tarification à la moyenne” (or “average pricing method”). The work of
the actuaries consisted in defining large categories of insured persons deemed
to be homogeneous. The construction of these equivalence classes is based on
the grouping of three criteria: (1) the zone where the vehicle is parked, (2) its
fiscal power, (3) the insured person’s profession and the use of the vehicle. In
1962, the rate structure of the Groupement Technique Accidents (GTA) of the
Association Générale des Sociétés d’Assurances contre les Accidents (AGSAA), the
professional organization of automobile insurance companies, in France, was based
on 6 geographic zones, 12 vehicle groups, as well as 3 usages and 6 professional
categories (namely, for the 3 usages, “business or commercial,” “leisure,” and
“public transport of goods,” and the 6 professional categories were (1) craftsmen,
(2) ministerial officers (lawyers, bailiffs, notaries, etc.), (3) travelers, representatives
and salesmen, (4) full-time employees of commercial, industrial and craft enter-
prises, ministerial offices, and liberal professions, (5) clergymen, (6) civil servants,
magistrates, and members of the education system). According to Delaporte (1965),
“the grouping of observations into classes, while it masks all or part of the inter-
individual differences, does not eliminate them.” At the turn of the 1950s and 1960s,
a “risk-modelled premium”— “prime modelée sur le risque” –is introduced. The
estimation of the risk should therefore take into consideration, on the one hand,
the class in which the insured is classified, and on the other hand, the number of
accidents observed during the previous years. On the basis of these data, coefficients
are calculated that should be applied to the premium for each insured person.
“This is the ‘risk-modelled premium’, which represents the probable value of an
individual risk,” as discussed in Chap. 2. This model would therefore not only be
more technically correct but also socially correct. According to Delaporte (1965) “if
the risks of the insured are not all equal, it is normal to ask each insured a premium
proportional to the risk that he or she makes the mutual insurance company bear.”
Within the same class, the bad risks, a minority, would significantly increase the
premium of the good risks.
More recently, Jean Lemaire presented classification variables commonly used,
in Lemaire (1985), see Box 5.4.
200 5 What Data?

Box 5.4 Classification variables, Lemaire (1985)


“the main task of the actuary who sets up a new tariff is to make it as fair
as possible by partitioning the policies into homogeneous classes, with all
policyholders belonging to the same class paying the same premium (...)
Most developed countries use several classification variables to differentiate
premiums among automobile third-party liability policyholders. Typical vari-
ables include age, sex, and occupation of the main driver, the town where
he resides, and the type and use of his car. More exotic variables, such
as the driver’s marital status and smoking behavior, or even the color of
his car, have been introduced in some countries. Such variables are often
called prior rating variables, as their values can be determined before the
policyholder starts to drive. The main purpose for their use is to subdivide
policyholders into homogeneous classes. If, for instance, females are proved
to cause significantly fewer accidents than males, equity arguments suggest
that they should be charged a lower premium (...) Despite the use of many
a priori variables, very heterogeneous driving behaviors are still observed
in each tariff cell (...) Hence the idea came in the mid-1950s to allow for
posterior premium adjustments, after having observed the claims history of
each policyholder. Such practices, called experience rating, merit-rating,
no-claim discount, or bonus-malus systems (BMS), penalize the insureds
responsible for one or more accidents by an additional premium or malus,
and reward claim-free policyholders, by awarding a discount or bonus.”

Banham (2015) and Karapiperis et al. (2015), cited in Kiviat (2021), mentioned
that in car insurance, policymakers have investigated the use of credit scores, web
browser history, grocery store purchases, zip codes, the number of past addresses,
speeding tickets, education level, data from devices that track driving in real time,
etc. In Box 5.5, we reproduce a list of variables given by David Backer, insurer in
Florida, USA.

Box 5.5 Motor Insurance Pricing, Backer (2017)


Examples of factors are:
• Characteristics of the automobile—age, manufacturer, value, safety
features (anti-lock brakes, anti-theft devices, adaptive cruise control, lane
departure feature) all matter to insurance companies.
• The coverage selected—liability limits, uninsured motorist, collision,
comprehensive coverages vary significantly depending on your willingness
to take on more risk.
(continued)
5.4 Typology of Ratemaking Variables 201

Box 5.5 (continued)


• Deductible you select—higher deductibles mean that the driver pays more
to repair his vehicle before the insurance kicks in, thus reducing your
premium.
• Profile of the driver—age, gender, marital status, place of residence,
driving record all help to determine your ultimate premium.
• Usage of the car—do you commute to work, use your vehicle for business
or pleasure only?
The following are individual risk factors that determine if you are charged
more or less money by your auto insurance company:
1. Credit Rating—study after study indicate that a person’s credit rating
can determine their propensity to have an accident. The best automobile
insurance companies insure drivers with the best credit scores.
2. Payment History—similar to your credit history, if you pay your auto-
mobile premiums on time, you may be able to reduce your premium.
Age—around 30% of all vehicle injuries in the USA are caused by drivers
aged 15–24. Those in the 16–19 group are three times as likely as those
over age 20 to be in a fatal car crash. Motor vehicle accidents are the
leading cause of death for teenagers in the country. For this reason, drivers
aged 16–19 pay 50% more on average for automobile insurance than
drivers aged 20–24. As drivers age, their rates typically decline until, by
age 55, drivers can enjoy senior discounts.
3. Driving Record—driving history has been shown to be an accurate
indicator of future claims. Drivers with a clean driving record enjoy
discounts not granted to those with tickets or accidents.
4. Gender—men, especially young ones, are much more likely to be involved
in automobile accidents than women, regardless of their age. For this
reason, men pay significantly more for their automobile insurance over
the years than women.
5. Nature of Employment—drivers who use their cars for business, such as
real-estate salespeople, pay higher rates than drivers who work from home.
Although it is difficult to quantify, drivers who work in high-stress jobs for
long hours, such as doctors, tend to have more accidents than those in low-
stress occupations.
6. Vehicle type—cars that are more expensive to repair or that are most
likely to be stolen carry higher rates than others. Some high-performance
automobiles are very difficult and expensive to insure. Cars with extra
safety features may be subject to additional credits.
7. Location of your home—drivers who live in high-risk urban areas of
the country and those who live in high-crime neighborhoods, where their

(continued)
202 5 What Data?

Box 5.5 (continued)


vehicles are more likely to be stolen or vandalized, pay higher rates than
others.
8. Marital Status—married people tend to be better drivers than singles and
pay lower premiums. In many cases, the multi-car discount kicks in, thus
reducing your premium.

5.4.2 Criteria for Variable Selection

In Chap. 3 we discussed variable selection in the context of machine learning


and statistics, in the sense that a variable is legitimate in a pricing model if the
association with the outcome y (claim occurrence or frequency, average cost or
aggregated losses) is significant. In Chap. 4, we have seen that interpretability was
important, so, including that variable “should make sense.” If we want to go further,
several factors should be taken into account. In an interview mentioned in Meyers
and Van Hoyweghen (2018), a senior reinsurance manager says “for example, we
are not allowed to use genetic testing because you are born like that. You have no
say in it. No matter what you eat, you have no influence at all on your genetics. So
in the long run the regulator and consumer interest groups will probably shut down
us using things that people have no control over. Now, the one thing people have
clear control over, is their behaviour. And that is a very easy discussion to have with
people to say: ‘You know you are doing wrong. Yes I do. Why don’t you change it?
I don’t feel like it.’ Then you say: ‘You understand, I cannot actually reward you for
that or accept you as a client. There is not much you can argue because all you have
to do is act in a positive way and that solves the problem.’ It is something they have
full control over. ” Therefore, if people can control how they behaved, they should
take responsibility for the outcomes that their behavior produces. This was actually
the statement made by a Dutch telematics-based motor insurer in a video,9 “Other
peoples’ bad luck, recklessness and carelessness determine how much you pay for
your insurance. Even if you never cause any damage, you end up paying for people
who do. At Fairzekering, we don’t think that is fair. How badly or, more importantly,
how well you drive should make a difference.” It is a classical actor–observer bias, as
defined in Jones and Nisbett (1971), corresponding to the tendency to attribute the
behavior of others to internal causes, while attributing our own behavior to external
causes. We often think that it is the structure, or circumstances, that constrain our
own choices, but at the same time, it is the behavior of others that explains theirs.
And the story is biased because if we don’t want to pay for a bad neighbor’s claims,

9 See https://vimeo.com/123722811.
5.4 Typology of Ratemaking Variables 203

we have to accept that they don’t want to pay for ours, in return. Furthermore, our
behaviors can certainly be influenced by our own decisions, but they can also most
certainly be influenced by the decisions of others. Our behaviors are not isolated.
They are the product of our own actions and the product of our interactions with
others. And this applies when we are driving a car, as well as in a million and one
other activities.
As explained in Finger (2006) and Cummins et al. (2013), the risk factors
of a risk classification system have to meet many requirements, they have to be
(i) statistically (actuarially) relevant, (ii) accepted by society, (iii) operationally
suitable, and (iv) legally accepted.

5.4.3 An Actuarial Criterion

A classification variable is considered actuarially fair, in the sense of Cummins


et al. (2013), if it is accurate, provides homogeneity among members, has statistical
credibility, and is reliable over time. A classification variable will be said to be
“accurate” if it allocates policyholders in such a way that they each pay a premium
proportional to its expected cost of claims. To prevent adverse selection, this is
probably the most important criterion. Homogeneity requires that all policyholders
in the same risk class have the same expected claim costs. A large number of
insureds in each group is needed to make the group’s loss history statistically
credible, and to make pooling still meaningful. Too few members result in losses
that vary widely from year to year and cause premiums to fluctuate in the same way.
Finally, a reliable classification variable produces cost differences between different
groups that remain relatively stable over time. Being perceived as a “ good risk” in
2020 and then as a “bad risk” in 2021, without having felt that one’s behavior has
changed in the meantime, will generally not be well perceived.

5.4.4 An Operational Criterion

Some actuarially fair risk classification variables cannot be applied in practice


because they do not meet the operational standards of objectivity, low cost of
implementation, and handling difficulty. Data that are laborious and costly to
collect and verify rarely make good classification variables. In connection with the
objective of low administrative costs, data used for another purpose make good risk
classification variables. The use of a variable that is reported or collected by other
agencies reduces the likelihood of it being manipulated and, as Cummins et al.
(2013) point out, reduces the cost of verification. A classification variable needs to
offer minimal ambiguity between insured persons, and the total categories described
by the variable should be mutually exclusive and comprehensive.
204 5 What Data?

5.4.5 A Criterion of Social Acceptability

A third consideration in the selection of risk classification variables is social


acceptability. According to the classification of Cummins et al. (2013), the four
main criteria are privacy, causality, controllability, and affordability/availability.
Privacy affects the willingness of individuals to disclose certain information, which
in turn affects the accuracy of a risk classification variable as well as the ease with
which it can be collected and verified (more at the end of this section). Causality
requires more than an intuitive relationship between the classification variable and
expected losses. A good risk classification variable must encourage individuals to
reduce the expected frequency and/or severity of their losses (corresponding to the
“controllability” criterion). The social criterion of affordability/availability requires
those who need to purchase insurance coverage to be able to do so reasonably. Social
acceptability seems to be even greater when the risk is linked to a criterion that is a
matter of choice for the insured.

5.4.6 A Legal Criterion

Finally, in practice, the use or prohibition of certain classification variables is


most often imposed by law (or regulation). In Canada, provincial laws, which
are generally more stringent than those in many states in the USA, typically
mandate that classification variables should not be unfairly discriminatory. This
means that the actuarial fairness test must be demonstrated to ensure fairness in
how variables are used for classification purposes. However, classification variables
have sometimes been prohibited because there is only a correlation, not a causal
relationship, between the classification variable and expected claims costs, or
because they have been deemed socially unacceptable. The legal criteria, of course,
vary by state and province, as noted previously.

5.5 Behaviors and Experience Rating

Algorithmic differentiation can also be used to differentiate behavioral controls,


usually with a distinction between “hard” and “soft” behavioral controls. “Hard”
behavioral controls are based on strict rules and obligations, whereas “soft”
behavioral controls are achieved through financial incentives, recommendations,
“nudges,” etc., as discussed in Yeung (2018a,b). Even if it is usually based on
technological tools, the idea echoes “experience rating,” as coined in Keffer (1929),
5.6 Omitted Variable Bias and Simpson’s Paradox 205

which occurs when the premium charged for an insurance policy is explicitly linked
to the previous claim record of the policy or the policyholder. Actually, claims
history has been the most important rating factor in motor insurance, over the past
60 years. Lemaire et al. (2016) argued that annual mileage and claims history (such
as a bonus malus class) are the two main powerful rating variables.
Experience rating has to do with “merit-based” insurance pricing, “merit rating”
as coined by Van Schaack (1926), studied in Wilcox (1937) or Rubinow (1936), and
popularized by Bailey and Simon (1960). It seems like “merit” has always been seen
as a morally valid predictor. But recently, Sandel (2020) criticized this “ tyranny
of merit,” “the meritocratic ideal places great weight on the notion of personal
responsibility. Holding people responsible for what they do is a good thing, up to
a point. It respects their capacity to think and act for themselves, as moral agents
and as citizens. But it is one thing to hold people responsible for acting morally;
it is something else to assume that we are, each of us, wholly responsible for our
lot in life.” “What matters for a meritocracy is that everyone has an equal chance
to climb the ladder of success; it has nothing to say about how far apart the rungs
of the ladder should be. The meritocratic ideal is not a remedy for inequality; it is
a justification of inequality.” Hence, behavioral fairness corresponds to the fairness
of merit. But as pointed out by Szalavitz (2017), the narrative that legitimizes the
idea of merit is an often biased narrative, with a classical “actor-observer bias.”
Personalization is close to this idea, as it means looking at that individual based on
her own risk profile, claims experience, and characteristics, rather than viewing him
or her as part of a set of similar risks. Hyper-personalization takes personalization
further.

5.6 Omitted Variable Bias and Simpson’s Paradox

Omitted variable bias occurs when a regression model is fitted without considering
an important (predictor) variable. For example, Pradier (2011) noted that actuarial
textbooks (such as Depoid 1967) state that “the pure premium of women in the North
American market would be equal to that of men if it were conditioned on mileage.”

5.6.1 Omitted Variable in a Linear Model

The sub-identification corresponds to the case where the true model would be .yi =
β0 + x   
1 β 1 + x 2 β 2 + εi , but the estimated model is .yi = b0 + x 1 b1 + ηi (i.e.,
the variables .x 2 are not used in the regression). The least square estimate of .b1 , in
206 5 What Data?

the mis-specified model, is (with the standard matrix writing in econometrics, like
Davidson et al. 2004 or Charpentier et al. 2018)

 
.b 1 = (X 1 X 1 )
−1 
X1 y
= (X −1 
1 X 1 ) X 1 [X 1 β 1 + X 2 β 2 + ε]

= (X −1   −1   −1 
1 X 1 ) X 1 X 1 β 1 + (X 1 X 1 ) X 1 X 2 β 2 + (X 1 X 1 ) X 1 ε

= β 1 + (X1 X1 )−1 X X2 β 2 + (X  X 1 )−1 X


1 ε,
  1   1  
β 12 νi

so that .E[
b1 ] = β 1 + β 12 , the bias (which we have noted .β 12 ) being null only in the
case where .X 1 X 2 = 0 (that is to say .X 1 ⊥ X 2 ): we find here a consequence of the
Frisch–Waugh theorem (from Frisch and Waugh 1933). If we simplify a little, let us
suppose that the real underlying model of the data

y = β0 + β1 x1 + β2 x2 + ε,
.

where .x1 and .x2 are explanatory variables, y is the target variable, and .ε is random
noise. The estimated model by removing .x2 gives

y =

. b0 + 
b1 x1 .

One can think of a missing significant variable .x2 , or the case where .x2 is a
protected variable. Estimates of the regression coefficients obtained by least squares
are (usually) biased, in the sense that

 c
ov[x1 , y] ov[x1 , β0 + β1 x1 + βx2 + ε]
c
b1 =
. = ,
 1]
Var[x  1]
Var[x
or

 ov[x1 , x1 ]
c ov[x1 , x2 ] c
c ov[x1 , ε] ov[x1 , x2 ]
c
b1 = β1 ·
. +β2 · + = β1 + β2 · .

Var[x ] 
Var[x1 ] 
Var[x ]  1]
Var[x
  1    1 
=1 =0

When .x2 is omitted, . b1 is biased, especially because .x1 and .x2 are correlated.
Therefore, in most realistic cases, removing the sensitive variable (if .x2 = p) not
only fails to make the regression models fair but, on the contrary, it is likely to
amplify discrimination. For example, in labor economics, if immigrants tend to have
lower levels of education, then the regression model would “punish” low education
even more by offering even lower wages to those with low levels of education (who
are mostly immigrants). Žliobaite and Custers (2016) suggests that a better strategy
for sorting regression models would be to learn a model on complete data that
includes the sensitive variable, then remove the component containing the sensitive
5.6 Omitted Variable Bias and Simpson’s Paradox 207

variable and replace it with a constant that does not depend on the sensitive variable.
A study on discrimination prevention for regression, Calders and Žliobaite (2013), is
related to the topic, but with a different focus. Their goal is to analyze the role of the
sensitive variable in suppressing discrimination, and to demonstrate the need to use
it for discrimination prevention. Calders and Žliobaite (2013) does, of course, use
the sensitive variable to formulate nondiscriminatory constraints, which are applied
during model fitting. But a discussion of the role of the sensitive variable is not the
focus of their study. Similar approaches have been discussed in economic modeling,
as in Pope and Sydnor (2011), where the focus was on the sanitization of regression
models; our focus in this paper is on the implications for data regulations.
Returning to our example, there are cases where . b1 < 0 (for example) whereas in
the true model, .β1 > 0. This is called a (Simpson) paradox or spurious correlation
(in ecological inference) in the sense that the direction of impact of a predictor
variable is not clear.

5.6.2 School Admission and Affirmative Action

Although examples of such paradoxes are numerous (Alipourfard et al. 2018


compiles a substantial list), Simpson’s paradox has historically been on college
admissions, in Bickel et al. (1975), as in Fig. 5.5.
With mathematical notations, Simpson’s paradox can be written as


⎪P[A|B ∩ C] < P[A|B ∩ C]

. P[A|B ∩ C] < P[A|B ∩ C]


⎩P[A|B] > P[A|B],

Fig. 5.5 Admission statistics on the six largest programs graduating at Berkeley, with the number
of admissions/number of applications that received the percentage of admissions. The bold
numbers indicate, by row, which men or women have the highest admission rate. The proportion
column shows the male–female proportions in the application submissions. The total corresponds
to the 12,763 applications in 85 graduate programs; the six largest programs are detailed below,
and the “top 6” line is the total of these six programs, i.e., 4,526 applications (source: Bickel et al.
1975)
208 5 What Data?

(where the bar denotes the complement of the event) or analytically,

a1 a2 b1 b2 a1 + b1 a2 + b2
. < and < while > .
c1 c2 d1 d2 c1 + d1 c2 + d2

The conclusion of Bickel et al. (1975) emphasizes “the bias in the aggregated
data stems not from any pattern of discrimination on the part of admissions com-
mittees, which seems quite fair on the whole, but apparently from prior screening at
earlier levels of the educational system. Women are shunted by their socialization
and education toward fields of graduate study that are generally more crowded,
less productive of completed degrees, and less well funded, and that frequently offer
poorer professional employment prospects.” In other words, the source of the gender
bias in admissions was a field problem: through no fault of the departments, women
were “separated by their socialization,” which occurred at an earlier stage in their
lives.

5.6.3 Survival of the Sinking of the Titanic

Another illustration is found in the Titanic data, specifically when using information
about crew members to passengers. Figure 5.6 shows the same paradox, in the
context of survival following the sinking.
But let us look at the survival rates for men and women separately, as presented
in Fig. 5.6. Among the crew, there were 885 men, of whom 192 survived, a rate of
21.7%. Among the third-class passengers, 462 were men, and 75 survived, a rate of
16.2%. Among the crew, there were 23 women, and 20 of them survived, a rate of
87.0%. And among the third-class passengers, 165 were women, and 76 survived,
a rate of 46.1%. In other words, for males and females separately, the crew had
a higher survival rate than the third-class passengers; but overall, the crew had a
lower survival rate than the third-class passengers. As with the admissions, there is
no miscalculation, or catch. There is simply a misinterpretation, because gender .x2
and status .x1 (passenger/crew member) are not independent, just as gender .x2 and
survival y are not independent. Indeed, although women represent 22% of the total
population, they represent more than 50% of the survivors... and 2.5% of the crew.

Fig. 5.6 Survival statistics for Titanic passengers conditional on two factors, crew/passenger (.x1 )
and gender (.x2 )
5.6 Omitted Variable Bias and Simpson’s Paradox 209

5.6.4 Simpson’s Paradox in Insurance

In a demographic context, Cohen (1986) looked at mortality in Costa Rica and


Sweden (Sweden was then known, and promoted, for its excellent life expectancy).
Not surprisingly, he found that in 1960, the mortality rate for women in all age
groups was higher in Costa Rica than in Sweden, as shown in Fig. 5.7. Yet the overall
mortality rate for women in Costa Rica was lower than in Sweden, with a mortality
rate of .8.12 in Costa Rica, compared with .9.29 in Sweden. The explanation
is related to Simpson’s paradox, and it comes from the different structure of the
populations. The population of Costa Rica is much younger on average than that
of Sweden, and therefore the younger age groups (which have a low mortality rate)
weigh more on average for Costa Rica than for Sweden, leading to a fairly low
overall mortality rate in Costa Rica, despite a fairly bad rate in each age group, as
seen in the age pyramid, on the right of Fig. 5.7.
Davis (2004) studied the relationship between the number of pedestrian-vehicle
accidents and the average speed of vehicles at various locations in a city. Specif-
ically, the study assessed the value of imposing stricter speed limits on vehicles
in cities, and unexpectedly, the model showed that lowering the speed limit from
30 to 25 mph would increase the number of accidents. The explanation is that an
unfortunate aggregation of the data (which did not take into account that the number
of accidents was much lower in residential areas, for example) led to a paradoxical
conclusion (the number of accidents is expected to decrease when the speed limit is
reduced).

5.6.5 Ecological Fallacy

Problems similar to Simpson’s paradox also occur in other forms. For example,
the ecological paradox (analyzed by Freedman 1999, Gelman 2009, and King et al.
2004) describes a contradiction between a global correlation and a correlation within
groups. A typical example was described by Robinson (1950). The correlation
between the percentage of the foreign-born population and the percentage of literate
people in the 48 states of the USA in 1930 was .+53%. This means that states
with a higher proportion of foreign-born people were also more likely to have
higher literacy rates (more people who could read, at least in American English).
Superficially, this value suggests that being foreign born means that people are
more likely to be literate. But if we look at the state level, the picture is quite
different. Within states, the average correlation is .−11%. The negative value means
that being foreign born means that people are less likely to be literate. If the within-
state information had not been available, an erroneous conclusion could have been
drawn about the relationship between country of birth and literacy.
210

Fig. 5.7 Annual mortality rate for women, Costa Rica and Sweden, and the age pyramid for both countries (inspired by Cohen 1986, data source: Keyfitz et al.
1968)
5 What Data?
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law 211

5.7 Self-Selection, Feedback Bias, and Goodhart’s Law

Nonresponse is a form of self-selection where individuals refuse to be part of the


learning sample. In fact, this situation is increasingly found in administrative files,
as pointed out by Westreich (2012). Until very recently, our data were often stored
automatically, without our knowledge, and without us having any say in the matter.
In Europe, the European Union’s GDPR has changed that, as you probably realize
from all the invitations to check boxes indicating that you understand and consent to
the storage of your personal data when you browse websites. In application of this
principle, many countries in Europe have passed laws giving those who request it
the opportunity to have their data deleted. This concept of “opting-out” is restrictive,
and can strongly bias the retained data. One can think of Dilley and Greenwood
(2017), who noted that the number of abandoned emergency calls (to 99910 ) in the
UK had more than doubled in 2016. How do we account for these unsuccessful calls
if we want to seriously study feelings of insecurity?
This problem of self-selection can also resurface when we try to model admission
to a university program (to echo the problem presented in the previous section).
Here we consider three indicator variables .(y1:i , y2:i , y3:i ), where the first variable
indicates whether a person has applied to a program, the second indicates admission
to a program, and the third indicates enrolment in a program. Assume furthermore
that the records are evaluated on the basis of two quantities .(x1:i , x2:i ), or .x i . In
the examples from the USA, the two main measures studied are the GPA score
(Grade Point Average, often ranging from 1 to 4 respectively for a D and an A or
an A+) and the SAT score (Scholastic Aptitude Test, ranging from 400 to 1,600).
For simplicity, let us assume that both scores are normalized on a scale of 0–100,
as in Fig. 5.8. Suppose that very simple rules are adopted for each of the stages
j : .yj :i = 1[x  β j > sj ]. For example, no student whose sum .x1 + x2 does not
exceed 60 is admitted (.y1:i = 1[x1:i + x2:i > 60]), any student whose sum of scores
exceeds 120 is admitted (.y2:i = 1[x1:i + x2:i > 120]), and students who are too
good do not come in the end (.y3:i = 1[x1:i + x2:i ∈ (120, 160]]). As can be seen
in Fig. 5.8, depending on the data used, the correlation between the variables .x1 and
.x2 can strongly change: globally, the variables correlate strongly, positively, but on

the subset of the students enrolled in the program (i.e., .{i : y3:i = 1}), the variables
.x1 and .x2 are correlate strongly but negatively this time.

This example of school admissions can be found in many cases, in insurance


or banking. In credit risk, we try to estimate the probability of an underwriter
defaulting. But we can wonder about the data available to estimate this probability
(or to build a credit score). The first barrier is self-selection, as some people do not
apply for credit, for several reasons.

10 Equivalent, in the UK, of 911 in North America, 112 in many European countries, or 0118 999

881 999 119 725 3.


212 5 What Data?

Fig. 5.8 Relationship between the two variables, .x1 (GPA) and .x2 (SAT), as a function of the
population studied: total population top left (.r ∼ 0.55), observable population top right, i.e.,
students who applied to the program (.r ∼ 0.45), population of students admitted to the program
at the bottom left (.r ∼ −0.05), and population of students enrolled in the program at the bottom
right (.r ∼ −0.7) (source: author, dummy data). GPA = gradient point average, SAT = scholastic
aptitude test

This self-selection is all the more disturbing as we often have no information


on the people who do not apply for credit.11 We can also think of speed cameras
to identify areas where speeding would be frequent. Are all the speeds of the cars
measured and recorded? Or only the cars that have committed a speeding offence?
Do the cameras operate permanently, or only at times when children are playing
in the street? Does the device record only speed or also other suspicious behavior?
One can also think of a relatively popular database of road traffic accidents, where
“every road traffic accident known to the police is the subject of a Bulletin d’Analyse
d’Accident Corporel (BAAC).” We can imagine that some accidents do not appear

11 To continue the analogy, in credit risk, we find the three previous levels, with (1) those who do

not apply for credit, (2) those to whom the institution does not offer credit, and (3) those who are
not interested in the offer made.
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law 213

in the database, and that some information is only partially reported, such as
information related to blood alcohol content, as pointed out by Carnis and Lassarre
(2019): does missing information mean that the test was negative or that it was not
done? It is crucial to know how the data were collected before you begin to analyze
it. Missing information is also common in health insurance, as medical records are
often relatively complex, with all kinds of procedure codes (that vary from one
hospital to another, from one insurer to another). In France, the majority of drugs
are pre-packaged (in boxes of 12, 24, 36), which makes it difficult to quantify the
“real” consumption of drugs.

5.7.1 Goodhart’s Law

When measuring the wealth of a population, there are two common approaches:
surveys and tax data. Surveys can be expensive and logistically complex to conduct.
On the other hand, tax data can be biased, particularly when it comes to capturing
very high incomes owing to tax optimization strategies that may vary over time.
For instance, in the UK, individuals can circumvent inheritance tax by utilizing
strategies such as borrowing against a taxable asset (e.g., their home) and investing
the loan in a nontaxable asset such as woodland. One can also evade taxes in the
UK by purchasing property through an offshore company, as non-UK companies
and residents are exempt from UK taxation. When loopholes in a tax system are
identified and individuals start to exploit them extensively, it often results in the
creation of more complex structures that, in turn, possess their own loopholes. This
phenomenon is known as “Goodhart’s Law.”
According to Marilyn Strathern, Goodhart’s Law states that “once a measure
becomes a goal, it ceases to be a good measure.” In US health care, Poku (2016)
notes that starting in 2012, under the Affordable Care Act, Medicare began charging
financial penalties to hospitals with “higher than expected” 30-day readmission
rates. Consequently, the average 30-day hospital readmission rate for fee-for-service
beneficiaries decreased. Is this due to improved efforts by hospitals to transition and
coordinate care, or is it related to the increase in “observed” stays over the same
period? Very often, setting a target based on a precise measure (here the 30-day
readmission rate) makes this variable completely unusable to quantify the risk of
getting sick again, but also has a direct impact on other variables (in this case the
number of “observation” stays), making it difficult to monitor over time. On the
Internet, algorithms are increasingly required to sort content, judge the defamatory
or racist nature of Tweets, see if a video is a deepfake, put a reliability score on
a Facebook account, etc. There is a growing demand from many individuals to
make algorithms transparent, allowing them to understand the process behind the
creation of these scores. Unfortunately, as noted by Dwoskin (2018) “not knowing
how [Facebook is] judging us is what makes us uncomfortable. But the irony is that
they can’t tell us how they are judging us—because if they do, the algorithms that
they built will be gamed,” exactly as Goodhart’s law implies.
214 5 What Data?

As Desrosières (1998) wrote, “quantitative indicators retroact on quantified


actors.” During Spring 2020, television news channels provided real-time updates
on the number of people in intensive care and hospital deaths, which were then
presented in the form of graphs on dedicated websites. As hospitals faced a crisis
with overwhelming patient numbers, the National Health Service (NHS) in the UK
requested each hospital to estimate their bed capacity to facilitate resource alloca-
tion. In this situation, announcing a shortage of available beds became the optimal
strategy for obtaining additional funding. This raises concerns about the accuracy
of how full the system truly is, as each hospital may manipulate the measurements
to suit their own interests, thus questioning the reliability of the reported data. And
just as troubling, while governments focused on hospitals (providing the official
data used to make most indicators), nursing homes experienced disastrous death
tolls, which took a long time to be quantified, as told by Giles (2020).
But aside from data errors, the idea of crime maps raises more subtle issues
related to hidden data and feedback. Attention was drawn to these problems when
the British insurance group Direct Line conducted a survey that found that “10%
of all British adults would definitely or probably consider not reporting a crime to
the police because it would appear on an online crime map, which could have a
negative impact on their ability to rent/sell or reduce the value of their property.”
Instead of showing where incidents have occurred, the maps may show where
people don not bother reporting them. This is quite different, and anyone making
decisions based on these data could easily be misled. O’Neil (2016) also recalls
the selection bias in early telematics data systems: “in these early days, the auto
insurers’ tracking systems are opt-in. Only those willing to be tracked have to turn
on their telematic boxes. They get rewarded with a discount of between 5% and
50% and the promise of more down the road. (And the rest of us subsidize those
discounts with higher rates).” More than 20 years ago, this problem had already
been talked about, e.g., Morrison (1996), which recalls (quoting Stein 1994), that
certain insurance companies, in the USA, used the proof of domestic violence to
discriminate against the victim by refusing her access to any form of insurance. The
argument was the same as the one used centuries ago to prohibit life insurance:
“There is some fear that if the beneficiary is the batterer, we would be providing a
financial incentive, if it’s life insurance, for the proceeds to be paid for him to kill
her,” reported Seelye (1994). Therefore, victims realized that if they sought medical
or police protection, the resulting records could compromise their insurability. As
a result, Morrison (1996) asserts that victims could stop seeking help, or reporting
incidents of violence, in order to preserve their insurance coverage.
Another example is connected objects, finally providing insight into the behavior
of policyholders. Car insurers have known for a long time that the risk is very
strongly linked to the behavior of the motorist, but, as Lancaster and Ward (2002)
pointed out, this hunch has never been used. “As certainty replaces uncertainty,” in
the words of Zuboff (2019), “premiums that once reflected the necessary unknowns
of everyday life can now rise and fall from millisecond to millisecond, informed by
the precise knowledge of how fast you drive to work after an unexpectedly hectic
early morning caring for a sick child or if you perform wheelies in the parking
5.7 Self-Selection, Feedback Bias, and Goodhart’s Law 215

lot behind the supermarket.” Policyholders are rewarded when they improve their
driving behavior, “relative to the broader policy holder pool” as stated by Friedman
and Canaan (2014). This approach, sometimes referred to as “gamification,” may
even encourage drivers to change their behavior and risks. Jarvis et al. (2019) goes
so far as to assert that “insurers can eliminate uncertainty by shaping behaviour.”

5.7.2 Other Biases and “Dark Data”

In attempting to typify dark data biases, Hand (2020) listed dozens of other existing
biases. Beyond the missing variables mentioned above, there is a particularly impor-
tant selection bias. In 2017, during one of the debates at the NeurIPS conference on
interpretability,12 an example of pneumonia detection was mentioned: a deep neural
network is trained to distinguish low-risk patients from high-risk patients, in order to
determine who to treat first. The model was extremely accurate on the training data.
Upon closer inspection, it turns out that the neural network found out that patients
with a history of asthma were extremely low risk, and did not require immediate
treatment. This may seem counter-intuitive, as pneumonia is a lung disease and
patients with asthma tend to be more inclined to it (typically making them high-risk
patients). Looking in more detail, asthma patients in the training data did have a
low risk of pneumonia, as they tended to seek medical attention much earlier than
non-asthma patients. In contrast, non-asthmatics tended to wait until the problem
became more severe before seeking care.
Survival bias is another type of bias that is relatively well known and doc-
umented. The best-known example is that presented by Mangel and Samaniego
(1984). During World War II, engineers and statisticians (British) were asked how
to strengthen bombers that were under enemy fire. The statistician Abraham Wald
began to collect data on cabin impacts. To everyone’s surprise, he recommended
armoring the aircraft areas that showed the least damage. Indeed, the aircraft used
in the sample had a significant bias: only aircraft that returned from the theatre
of operations were considered. If they were able to come back with holes in the
wingtips, it meant that the parts were strong enough. And because no aircraft came
back with holes in the propeller engines, those are the parts that needed to be
reinforced. Another example is patients with advanced cancer. To determine which
of two treatments is more effective in extending life spans, patients are randomly
assigned to the two treatments and the average survival times in the two groups are
compared. But inevitably, some patients survive for a long time—perhaps decades—
and we don’t want to wait decades to find out which treatment is better. So the study
will probably end before all the patients have died. That means we won’t know the

12 Called “The Great AI Debate: Interpretability is necessary for machine learning,” opposing Rich

Caruana and Patrice Simard (for) to Kilian Weinberger and Yann LeCun (against) https://youtu.be/
93Xv8vJ2acI.
216 5 What Data?

survival times of patients who have lived past the study end date. Another concern
is that over time, patients may die of causes other than cancer. And again, the data
telling us how long they would have survived before dying of cancer is missing.
Finally, some patients might drop out (for reasons unrelated to the study, or not).
Once again, their survival times are missing data. Related to this example, we can
return to another important example: why more people are dying from Alzheimer’s
disease than in the past? One answer may seem paradoxical: it is due to the progress
of medical science. Thanks to medical advances, people who would have died young
are now surviving long enough to be vulnerable to potentially long-lasting diseases,
such as Alzheimer’s. This raises all sorts of interesting questions, including the
consequences of living longer.
Chapter 6
Some Examples of Discrimination

Abstract We return here to the usual protected, or sensitive, variables that can
lead to discrimination in insurance. We mention direct discrimination, with race and
ethnic origin, gender and sex, or age. We also discuss genetic-related discrimination,
and as several official protected attributes are not related to biology but to social
identity, we return to this concept. We also discuss other inputs used by insurers, that
could be related to sensitive attributes, with text, pictures, and spatial information,
and could be seen as some discrimination by proxy. We also mention the use of
credit scores and network data.

In this chapter, we present and discuss some examples of discrimination. Again,


and that was stressed previously, for example, in Sect. 1.1.3, a pricing model can be
seen as “discriminatory” (based on a criterion that we present in Part III) without
any intention. Using a sensitive attribute or a variable correlated with it to improve
the accuracy of a model is considered “rational discrimination,” as discussed in
Sect. 1.1.5, and referred to as “unintentional proxy discrimination” in Prince and
Schwarcz (2019). However, such practices can potentially be illegal and morally
questionable. Determining their legality requires legal expertise, while assessing
their ethical implications involves philosophical considerations. Hellman (2011)
addresses that issue, answering the question “when is discrimination wrong?” In this
chapter, we try to present some features that are usually seen either as “sensitive”
or as “correlated with a sensitive attribute,” and discuss the issue in an insurance
context.

6.1 Racial Discrimination

In a chapter dedicated to examples of discrimination, a natural and simple starting


point could be “racism,” corresponding to discrimination and prejudice toward
people based on their race or ethnicity. As mentioned earlier (such as Table 5.1 for
the USA), racism is commonly recognized as a wrongful form of discrimination.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 217
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_6
218 6 Some Examples of Discrimination

According to Dennis (2004), “racism is the idea that there is a direct cor-
respondence between a group’s values, behavior and attitudes, and its physical
features ... Racism is also a relatively new idea: its birth can be traced to
the European colonization of much of the world, the rise and development of
European capitalism, and the development of the European and US slave trade.”
Nevertheless, we can probably also mention Aristotle (350–320 before our era),
according to whom, Greeks (or Hellenes, Ελληνες) were free by nature, whereas
“barbarians” (βάρβαρος, non-Greeks) were slaves by nature. But in a discussion
on Aristotle and racism, Puzzo (1964) claimed that “racism rests on two basic
assumptions: that a correlation exists between physical characteristics and moral
qualities; that mankind is divisible into superior and inferior stocks. Racism, thus
defined, is a modern concept, for prior to the XVI-th century there was virtually
nothing in the life and thought of the West that can be described as racist. To
prevent misunderstanding, a clear distinction must be made between racism and
ethnocentrism (...) The Ancient Hebrews, in referring to all who were not Hebrews
as Gentiles, were indulging in ethnocentrism, not in racism (...) So it was with the
Hellenes who denominated all non-Hellenes.” We could also mention the “blood
purity laws” (“limpieza de sangre”) that were once commonplace in the Spanish
empire (that differentiated converted Jews and Moors (conversos and moriscos)
from the majority Christians, as discussed in Kamen (2014)), and require each
candidate to prove, with family tree in hand, the reliability of his or her identity
through public disclosure of his or her origins.

6.1.1 A Sensitive Variable Difficult to Define

There are dozens of books, and dedicated entries in encyclopedias, discussing


“race,” and its scientific grounds, such as Memmi (2000), Dennis (2004), Ghani
(2008), or Zack (2014). Historically, we should probably start in the XIX-th century,
with Georges-Louis Leclerc de Buffon (“the father of all thought in natural history
in the second half of the XVIII-th century,” Mayr (1982)) and Carl Linnaeus (“father
of modern taxonomy,” Calisher (2007)), who defined “varietas” or “species”—the
largest group of organisms in which any two individuals of the appropriate sexes or
mating types can produce fertile offspring—and “subspecies”—rank below species,
used for populations that live in different areas and vary in size, shape, or other
physical characteristics. Following Keita et al. (2004), we can consider “race” as a
synonym for “subspecies.” During the same period, Johann Friedrich Blumenbach
explored the biodiversity of humans, in Blumenbach (1775), mainly by comparing
skull anatomy and skin color, and suggested five human “races,” or more precisely
“generis humani varietates quinae principes, species vero unica” (one species,
and five principal varieties of humankind): the “Caucasian” (or white race, for
Europeans, including Middle Easterners and South Asians in the same category),
the “Mongolian” (or yellow race, including all East Asians), the “Malayan” (or
brown race, including Southeast Asians and Pacific Islanders), the “Ethiopian” (or
6.1 Racial Discrimination 219

Fig. 6.1 On the right-hand side, color chart showing various shades of skin color, by L’Oréal
(2022), as well as the Fitzpatrick Skin Scale at the bottom (six levels), and on the left-hand side,
the RGB (red-green-blue) decomposition of those colors (source: Telles (2014))

black race, including all sub-Saharan Africans), and the “American” (or red race,
including all Native Americans), as discussed in Rupke and Lauer (2018). Johann
Friedrich Blumenbach and Carl Linnaeus investigated the idea of “human race,”
from an empirical perspective, and at the same time, Immanuel Kant became one
of the most notable Enlightenment thinkers to defend racism, from a philosophical
and scientific perspective, as discussed in Eze (1997) or Mills (2017) (even if Kant
(1795) ultimately rejected racial hierarchies and European colonialism).
A more simple version of “racism” is discrimination based on skin color, also
known as “colorism,” or “shadeism,” which is a form of prejudice and discrimination
in which people who are perceived as belonging to a darker skinned race are treated
differently based on their darker skin color. Somehow, this criterion could be seen
as more objective, because it is based on a color. Telles (2014) as defined by the
Fitzpatrick Skin Scale, used by dermatologists and researchers. Types 1 through 3
are widely considered lighter skin (on the left), and 4 through 6 as darker skin (on
the right-hand side). On could also consider a larger palette, as in Fig. 6.1.
This connection between “racism” and “colorism” is an old one. Kant (1775),
entitled “Of the Different Human Races” as translated into English (the original title
was “Rassender Menschen” in German), was the preliminary work for his lectures
on physical geography (collected in Rink (1805)). In Kant (1785), “Determination
of the Concept of a Human Race,” his initial position, on the existence of human
races, was confirmed. The first essay was published a few months before Johann
Friedrich Blumenbach’s de generis humani varietate nativa” (of the native variety
of the human race), and proposed a classification system less complex than the one
in Blumenbach (1775), based almost solely on color. Both used color as a way
of differentiating the races, even if it was already being criticized. For Immanuel
Kant, there were four “races”: whites, Black people, Hindustani, and Kalmuck, and
the explanation of such differences were the effects of air and sun. For example,
he argued that by the solicitude of nature, human beings were equipped with seeds
(“keime”) and natural predispositions (“anlagen”) that were developed, or held back,
depending on climate.
220 6 Some Examples of Discrimination

The use of colors could be seen as simple and practical, with a more objective
definition than the one underlying the concept of “race,” but it is also very
problematic. For example, Hunter (1775) classified as “light brown” Southern
Europeans, Sicilians, Abyssinians, the Spanish, Turks, and Laplanders, and as
“brown” Tartars, Persians, Africans on the Mediterranean, and the Chinese. In the
XX -th century in the USA, simple categories such as “white” and “black” was not
enough. As mentioned by Marshall (1993), in New York City, populations who
spoke Spanish was not usually referred to as belonging either to the “white race” or
the “ black race,” but were designated as ‘Spanish,” ‘Cuban,” or ‘Puerto Ricans.”
Obviously, the popular racial typologies used in the USA were not based on any
competent genetic studies. And it was the same almost everywhere. In Brazil for
instance, descendants played a negligible role in establishing racial identity. Harris
(1970) has shown that siblings could be assigned to different racial categories.
He counted more than 40 racial categories that were used in Brazil. Phenotypical
attributes (such as skin color, hair form, and nose or mouth shape) were entering into
the Brazilian racial classification, but the most important determinant of racial status
was socio-economic position. It was the same for the Eta in Japan, Smythe (1952)
and Donoghue (1957). As Atkin (2012) wrote, an “east African will be classified as
‘black’ under our ordinary concept but this person shares a skin colour with people
from India and a nose shape with people from northern Europe. This makes our
ordinary concept of race look to be in bad shape as an object for scientific study—it
fails to divide the world up as it suggests it should.”
From a scientific and biological standpoint, races do not exist as there are no
inherent “biobehavioral racial essences,” as described by Mallon (2006). Instead,
races are sociological constructs that are created and perpetuated by racism. Racism
here is a set of processes that create (or perpetuate) inequalities based on racializing
some groups, with a ’“privileged” group that will be favored, and a “racialized”
group that will be disadvantaged. Given the formal link between racism and
perceived discrimination, it is natural to start with this protected variable (even if
it is not clearly defined at the moment). Historically, in the USA, the notion of
race has been central in discussions on discrimination (Anderson (2004) provides a
historical review in the United States), see also Box 6.1, by Elisabeth Vallet.

Box 6.1 Race and Ethnicity in the U.S., by Elisabeth Vallet1


Since the early days of the Republic, the U.S. Census has identified indi-
viduals according to broad racial and ethnic categories. In addition to the
traditional federal categories (White/Caucasian, Black/African American,

(continued)

1 Professor, Director of the Raoul Dandurand Center in Montréal, Canada.


6.1 Racial Discrimination 221

Box 6.1 (continued)


Asian American, American Indian/Alaska Native, Native Hawaiian/Pacific
Islander, Multiracial), the decennial census adds the Hispanic/Latino category.
It is also expected that eventually the Middle East and North Africa category
will be included in the census.
This racial classification, which now serves as the basis for statistical
analysis in the country, has its roots in slavery and a classification designed
to establish a form of racial purity, when the “single drop of blood” and
“blood quantum” rules (Villazor 2008) were applied to determine racial group
membership and helped to shape the population groups involved and their
own self-perceptions. Moreover, the complexity of this classification lies in
the fact that it is not always based on constant elements. On the one hand, the
categories, as stated in the forms, change over time. Therefore, the nature of
the questionnaire and the way in which the questions are formulated, because
they evolve, sometimes alter the overall picture of the population over time.
On the other hand, this classification is now based on the self-identification
of individuals, which may also vary from one census to another. Therefore,
some studies have shown significant changes in the self-identification of
certain individuals, which must be taken into account when using these data.
According to Liebler et al. (2017), for example, there was significant variation
for the same individuals between the 2000 and 2010 censuses, particularly
in categories that could report multiple ethnic or Hispanic backgrounds.
For example, the study shows that the consistency of responses was most
pronounced among non-Hispanic whites, Black people, and Asians. Identi-
fication instability was more pronounced among people who identified as
Native American, Pacific Islanders, people with multiple backgrounds, and
Hispanics. There are many reasons for this, but some of them have to do
with the evolution of the integration of immigrant communities, or with the
fact that miscegenation can lead to the prevalence of one identity over the
other over time. Therefore, analyses must include the fact that the very idea
of race/ethnicity as a social construct must be included as such in the analysis
of statistical data.

6.1.2 Race and Risk

As recalled by Wolff (2006), in 1896, Frederick L. Hoffman, an actuary with


Prudential Life Insurance, published a book demonstrating, with statistics, that the
American Black man was uninsurable (see Hoffman (1896)). Du Bois (1896) wryly
noted that the death rate of Black people in the United States was only slightly higher
(but comparable) than that of white citizens in Munich, Germany, at the same time.
But above all, the main criticism is that it aggregated all kinds of data, preventing a
222 6 Some Examples of Discrimination

more refined analysis of other causes of (possible) excess mortality (this is also the
argument put forward by O’Neil (2016)). At that time, in the United States, several
states were passing anti-discrimination laws, prohibiting the charging of different
premiums on the basis of racial information. For example, as Wiggins (2013) points
out, in the summer of 1884, the Massachusetts state legislature passed the Act to
Prevent Discrimination by Life Insurance Companies Against People of Color. This
law prevented life insurers operating in the state from making any distinction or
discrimination between white persons and persons of color wholly or partially of
African descent, as to the premiums or rates charged for policies upon the lives
of such persons. The law also required insurers to pay full benefits to African
American policyholders. It was on the basis of these laws that the uninsurability
argument was made: insuring Black people at the same rate as white people would
be statistically unfair, argued (Hoffman 1896), and not insuring Black people was
the only way to comply with the law (see also Heen (2009)). As Bouk (2015)
points out “industrial insurers operated a high-volume business; so to simplify sales
they charged the same nickel to everyone. The home office then calculated benefits
according to actuarially defensible discrimination, by age initially and then by race.
In November 1881, Metropolitan decided to mimic Prudential, allowing policies to
be sold to African Americans once again, but with the understanding that Black
policyholders’ survivors only received two-thirds of the standard benefit.”
In the credit market, Bartlett et al. (2021), following up on Bartlett et al.
(2018), shows that in the USA, discrimination based on ethnicity has continued
to exist in the US mortgage market (for African Americans and Latinos), both
traditional and algorithm-based lending. But algorithms have changed the nature
of discrimination from one based on prejudice, or human dislike, to illegitimate
applications of statistical discrimination. Moreover, algorithms discriminate not by
denying loans, as traditional lenders do, but by setting higher prices or interest rates.
In health care, Obermeyer et al. (2019) shows that there is discrimination, based on
ethnicity or “racial” bias in a commercial software program widely used to assign
patients requiring intensive medical care to a managed care program. White patients
were more likely to be assigned to the care program than Black patients with a
comparable health status. The assignment was made using a risk score generated
by an algorithm. The calculation included data on total medical expenditures in a
given year and fine-grained data on health service use in the previous year. The score
therefore does not reflect expected health status, but predicts the cost of treatment.
Bias, stereotyping, and clinical uncertainty on the part of health care providers
can contribute to racial and ethnic disparities in health care, as noted by Nelson
(2002). Finally, in auto insurance, Heller (2015) found that predominantly African
American neighborhoods pay 70% more, on average, for auto insurance premiums
than other neighborhoods. In response, the Property Casualty Insurers Association
of America responded2 in November 2015 that “insurance rates are color-blind and
solely based on risk.” This position is still held by actuarial associations in the USA,

2 Online on their website, https://www.pciaa.net/, see https://bit.ly/43ls6eb.


6.1 Racial Discrimination 223

for whom questions about discrimination are meaningless. Larson et al. (2017)
obtained 30 million premium quotes, by zip code, for major insurance companies
across the USA, and confirmed that a gap existed, albeit a smaller one. Also, in
Illinois, insurance companies charged on average more than 10% more in auto
liability premiums for “majority minority” zip codes (in the sense that the rate of
minorities was the highest) than in majority white zip codes. Historically, as recalled
by Squires (2003), many financial institutions have used such discrimination by
refusing to serve predominantly African American geographic areas.
Although such analyses have recently proliferated (Klein (2021) provides a
relatively comprehensive literature review), this potential racial discrimination issue
was analyzed by Klein and Grace (2001), for instance, who offered the possibility of
controlling covariates correlated with race, and showed that there was no statistical
evidence of geographic redlining. This conclusion was consistent with the analysis
of Harrington and Niehaus (1998), and was subsequently echoed by Dane (2006),
Ong and Stoll (2007), or Lutton et al. (2020) among many others. It should be noted
here that redlining is not only associated with an antisocial criterion, but very often
with an economic criterion. A recent case of statistical discrimination is currently
being investigated in Belgium, as mentioned by Orwat (2020). In this country, the
energy supplier EDF Luminus refuses to supply electricity to people living in a
certain postal code district. For the energy supplier, this postal code area represents
a zone where many people with a bad credit history live.
If the term “ethnic statistics” is a sensitive subject in France, the censuses have
asked, traditionally (for more than a century), the nationality at birth, therefore
distinguishing French by birth and French by adoption. And since 1992, the variable
“parents’ country of birth” has been introduced in a growing number of public
surveys. In French statistics, the word “ethnic” in the anthropological sense (sub-
national or supra-national human groups whose existence is proved even though
they do not have a state) has long had a place, particularly in surveys on migration
between Africa and Europe. The 1978 Data Protection Act therefore uses the
expression “racial or ethnic origins.” In this sense, “ethnic origin” means any
reference to a foreign origin, whether it is a question of nationality at birth, the
parents’ nationality, or “reconstructions” based on the family name or physical
appearance. Some general intelligence and judicial police files contain personal
information on an individual’s physical characteristics, and in particular on their
skin color, as recalled by Debet (2007). Some medical research files (e.g., in
dermatology) may contain similar information. The French National Institute of
Statistics (INSEE) had initially refused to introduce a question on the parent’s
birthplace in its 1990 family survey, which could have served as a sampling frame. It
was not until the 1999 survey that it was introduced, as recalled by Tribalat (2016).
Another concern that may arise is that the difference that may exist in insurance
premiums between ethnic origins is not a reflection of different risks but of different
treatments. Therefore, Hoffman et al. (2016) shows that racial prejudice and false
beliefs about biological differences between Black and white people continue to
shape the way we perceive and treat the former—they are associated with racial
disparities in pain assessment and treatment recommendations.
224 6 Some Examples of Discrimination

6.2 Sex and Gender Discrimination

In examining historical perspectives on sex and gender, it is important to note the


contrasting viewpoints of influential philosophers. Aristotle (350–320 BCE) posited
the notion that a woman is essentially a man who has failed to fully develop, as
discussed by Horowitz (1976) and Merchant (1980). This belief was perpetuated
further as Aristotle’s Politics (Πολιτικά) became a standard textbook in medieval
and early modern universities. On the contrary, Plato (380–350 BCE) advocated for
the equal education of women and men, particularly in military, intellectual, and
political leadership, as expressed in his work Republic (Πολιτεία). However, the
widespread familiarity with Plato’s ideas in the West was limited until its translation
from Greek to Latin in the early XV-th century, as mentioned by Allen (1975).
Moving beyond historical perspectives, the term “sexism” is commonly defined
as prejudice, discrimination, or stereotyping based on sex or gender, particu-
larly directed against women and girls. Encyclopædia Britannica describes it as
such, whereas the New Oxford American Dictionary echoes this definition by
emphasizing sexism as prejudice, stereotyping, or discrimination typically targeting
women. According to Cudd and Jones (2005), sexism represents a pervasive form
of oppression against women, spanning across different historical periods and
geographical locations.
Before discussing the “gender directive,” in Box 6.2, Émilie Biland-Curinier
explains the differences between “sex” and “gender.”

Box 6.2 Sex and Gender, by Émilie Biland-Curinier3


We traditionally differentiate sex, which is a biological characteristic (related
to physical and physiological features, e.g., chromosomes, gene expression,
hormone levels and function, etc.) and gender, which refers to the sexual
identity of an individual. Sex and gender are often described in binary terms
(girl/woman or boy/man). However, the diversity of sexual development and
atypical formulas is important, whether they are of chromosomal, hormonal,
or environmental origin. In reality, the sex/gender debates are numerous, and
are reflected in many fields of knowledge and public action.
In the social sciences, sex is today considered less as a “biological reality”
than as a social and, above all, legal construction: sex is the one attributed
to each individual on his or her birth certificate and then on all of his or
her more or less official papers. Most people keep this native and legal sex
all their life but some change it (a large number of “trans” people, whether

(continued)

3 Professor at Science-Po Paris.


6.2 Sex and Gender Discrimination 225

Box 6.2 (continued)


they have undergone genital surgery or not). In the case of intersex/intersex
people, the assignment of legal sex involves a good deal of arbitrariness (or
more sociologically, medical discretion), as the biological attributes do not
fit into any of the binary female/male categories. Most countries base this
legal category of sex on this duality, but some have recently opened up other
possibilities (e.g., it is possible to fill in ‘X’ in the Netherlands, and ‘various’
in Germany). In these countries, we speak of neutral sex/gender, in others of
neutral third.
In statistics, the “sex” category is the one that is mostly used. This category
is indeed declarative: it reflects most often the legal sex, but in the case where
individuals are asked to fill it in (e.g., in the census), there may be very
few discrepancies. Interesting fact: Statistics Canada has recently adapted
its categories to take into account gender identity (i.e., to make trans people
visible).
As a result, gender relations in sociology and economics (and in particular
inequalities between women and men) are mainly analyzed quantitatively on
the basis of the sex variable. For a brief presentation of this issue in the French
context see Grobon and Mourlot (2014), otherwise Amossé and De Peretti
(2011). In considering the definition of gender as a concept within the realm of
social science, philosopher Elsa Dorlin eloquently highlights its significance:
“The concept of gender has made it possible to historicize the identities,
the roles and the symbolic attributes of the feminine and the masculine,
defining them, not only as the product of a differentiated socialization of the
individuals, specific to each society and variable in time, but also as the effect
of an asymmetrical relation, of a power relation.”
Building upon this understanding, historian Joan Scott offers a compelling
definition of the gender relationship: “Gender is a primary way of signifying
power relationships. The categories of masculine and feminine, such as
“men” and “women,” therefore have meaning and existence only in their
antagonistic relationship, and not as “identities” or as “essences” taken
separately” (Dorlin 2005).

6.2.1 Sex or Gender?

Insights into the distinction between gender and sex, exploring various perspectives
from different fields of knowledge and public action are given in Box 6.3. These
discussions shed light on the complex nature of gender and sex, moving beyond
binary categorizations and acknowledging the diversity of sexual development. By
examining social, legal, and statistical aspects, as well as their implications in
sociology and economics, the box delves into the multifaceted nature of gender
226 6 Some Examples of Discrimination

and its significance in understanding power dynamics and social relations.4 As an


interesting anecdote highlighting the complexities surrounding gender identifica-
tion, it is worth noting a case from 2018. In this instance, a cisgender Canadian
man deliberately changed the gender identification on his driver’s license from male
to female with the intention of obtaining lower motor vehicle liability insurance
premiums typically available to “female drivers,” as reported by Ashley (2018). This
anecdote serves to underscore how gender markers can be manipulated for personal
gain and raises questions about the fairness and potential loopholes within certain
systems or policies.

Box 6.3 Gender Identity, in Canada


The gender identity categories offered as potential responses represent the
considerable diversity in how individuals and groups understand, experience,
and express gender identity.
Gender fluid refers to a person whose gender identity or expression
changes or shifts along the gender spectrum.
Man refers to a person who internally identifies and/or publicly expresses
as a man. This may include cisgender and transgender individuals. Cisgender
means that one’s gender identity matches one’s sex assigned at birth.
Nonbinary refers to a person whose gender identity does not align with a
binary understanding of gender such as man or woman.
Trans man refers to a person whose sex assigned at birth is female, and
who identifies as a man.
Trans woman refers to a person whose sex assigned at birth is male, and
who identifies as a woman.
Two-Spirit is a term used by some North American Indigenous people
to indicate a person who embodies both female and male spirits or whose
gender identity, sexual orientation, or spiritual identity is not limited by the
male/female dichotomy.
Woman refers to a person who internally identifies and/or publicly
expresses as a woman. This may include cisgender and transgender individu-
als. Cisgender means that one’s gender identity matches one’s sex assigned at
birth.

4 From https://science.gc.ca/site/science/en/interagency-research-funding/policies-and-
guidelines/self-identification-data-collection-support-equity-diversity-and-inclusion/.
6.2 Sex and Gender Discrimination 227

6.2.2 Sex, Risk and Insurance

Many books, published in the XVIII-th century, mention that men and women have
very different behaviors when it comes to insurance. According to Fish (1868),
“upon no class of society do the blessings of life insurance fall so sweetly as
upon women. And yet” agents “have more difficulty in winning them to their cause
than their husbands.” Phelps (1895) asked explicitly “do women like insurance?”
whereas Alexander (1924) collects fables and short stories, published by insurance
companies, with the idea of scaring women by dramatizing the “disastrous economic
consequences of their stubbornness,” as Zelizer (2018) named it.
Women live longer than men across the world and scientists have by and large
linked the sex differences in longevity with biological foundation to survival. A
new study of wild mammals has found considerable differences in life span and
aging in various mammalian species. Among humans, women’s life span is almost
8% on average longer than men’s life span. But among wild mammals, females in
60% of the studied species have, on average, 18.6% longer lifespans. Everywhere
in the world women live longer than men—but this was not always the case (see
also life tables in Chap. 2). On Fig. 6.2, we can compare on the left-hand side,
the probability (at birth) of dying between 30 and 70 years of age also, in several
countries, for women (x-axis) and men (y-axis). On the right-hand side, we compare
life expectancy at birth.

Fig. 6.2 Probability of dying between 30 and 70 years old, on the left-hand side, for most
countries, and life expectancy at birth, on the right-hand side. Women are on the x-axis whereas
men are on the y-axis. The size of the dots is related to the population size (data source: Ortiz-
Ospina and Beltekian (2018))
228 6 Some Examples of Discrimination

6.2.3 The “Gender Directive”

The 2004 EU Goods and Services Directive, Council of the European Union (2004),
was aimed at reducing gender gaps in access to all goods and services, discussed for
example by Thiery and Van Schoubroeck (2006). A special derogation in Article
5(2) allowed insurers to set gender-based prices for men and women. Indeed,
“Member States may decide (...) to allow proportionate differences in premiums
and benefits for individuals where the use of sex is a determining factor in the
assessment of risk, on the basis of relevant and accurate actuarial and statistical
data.” In other words, this clause allowed an exception for insurance companies,
provided that they produced actuarial and statistical data that established that sex
was an objective factor in the assessment of risk. The European Court of Justice
canceled this legal exception in 2011, in a ruling discussed at length by Schmeiser
et al. (2014) or Rebert and Van Hoyweghen (2015), for example. This regulation,
which generated a lot of debate in Europe in 2007 and then in 2011, also raised
many questions in the USA, several decades earlier, in the late 1970s, with Martin
(1977), Hedges (1977), and Myers (1977). For example, in the case of Los Angeles,
the Department of Water and Power vs Manhart, the Supreme Court considered a
pension system in which female employees made higher contributions than males
for the same monthly benefit because of longer life expectancy. The majority
ultimately determined that the plan violated Title VII of the Civil Rights Act of
1964 because it assumes that individuals conform to the broader trends associated
with their gender. The court suggested that such discrimination might be troubling
from a civil rights perspective because it does not treat individuals as such, as
opposed to merely members of a group that they belong to. These laws were
driven, in part, by the fact that employment decisions are generally individual: a
specific person is hired, fired, or demoted, based on his or her past or expected
contribution to the employer’s mission. In contrast, stereotyping of individuals based
on group characteristics is generally more tolerated in fields such as insurance,
where individualized decision making does not make sense.
In Box 6.4, Avner Bar-Hen discusses measurement of gender-related inequalities.

Box 6.4 Gender Inequality and Discrimination, by Avner Bar-Hen5


We all have our own ideas about inequality and talk about it as if it were
something simple. Let us take a look at the making of gender disparity
indicators. Traditionally, there are two main approaches to measuring gender
inequality indices: (i) The first method measures gender equality using
surveys, usually as a complement to other questions. In this case, knowledge

(continued)

5 Professor at the Conservatoire National des Arts et Métiers in Paris.


6.2 Sex and Gender Discrimination 229

Box 6.4 (continued)


of gender equality is not the main objective. (ii) The second approach is based
on targets and statistics to quantify them. The first international indices only
date back to 1995 and are heavily based on the UNDP Human Development
Index (HDI). The HDI is a well-theorized and widely disseminated concept;
this makes it possible to know fairly precisely what is being measured, but
international comparisons are difficult. The complementary variables to the
HDI that have been proposed are often interesting, but above all complicate
the readability of the results.
Since the 2000s, several alternatives have been proposed. Among the most
popular are the Gender Equity Index (introduced by Social Watch in 2004)
and the Global Gender Gap Index (GGGI, proposed by the World Economic
Forum in 2006). These indices claim to be measures of gender equality, but
their general concepts are not clearly formulated. For example, the GGGI
ignores the underlying causes of gender inequality, such as health. Other
proposals such as the Social Institutions and Gender Index (SIGI, proposed by
the Organization for Economic Co-operation and Development in 2007) focus
not only on social institutions that affect gender equality but also on family
codes or property rights. This index can be seen as a combined measure of
women’s disadvantages relative to men regarding certain basic rights and the
fulfillment of distinctive rights for women. The lack of symmetry in the SIGI
indicators and the scales of the indicators makes it ultimately unclear what is
being measured.
The purpose of these different indices is to assess multiple aspects of
gender disparities, not only in academic research on the causes and con-
sequences of gender inequality but also to inform public policy debates. A
single index is never the solution to the problem caused by the large number
of indicators: it is therefore necessary to compare different gender indices.
However, the description of methodologies uses different terminologies, does
not adequately describe the methodological choices, and is often silent on
potential sources of measurement error. Finally, index developers are rarely
explicit about the overall concept they seek to measure. Another important
point is that the use of gender data raises questions about the quality of
the data. If they are of questionable quality, the legitimacy of their use for
collective purposes is questionable. The methodological choices underlying
the construction of indices often show that what the index measures is
different from what it claims to determine!
The valuable contributions made by the proponents of gender indices must
be recognized. They are a resource that has made it possible to describe
gender disparities and promote women’s rights in a more effective way than
was possible before 1995. Having a comprehensive empirical index is better

(continued)
230 6 Some Examples of Discrimination

Box 6.4 (continued)


than not having one, even if current measures suffer from methodological
weaknesses. However, producing multiple indices also creates a problem for
users: they do not have the means to compare them with each other.

6.3 Age Discrimination

As Robbins (2015) said, “if you are not already part of a group disadvantaged by
prejudice, just wait a couple of decades—you will be.” As explained in Ayalon and
Tesch-Römer (2018), “ageism” is a recent concept, defined in a very neutral way in
Butler (1969) as “a prejudice by one age group against another age group.” Butler
(1969) argued that ageism represents discrimination by the middle-aged group
against the younger and older groups in society, because the middle-aged group is
responsible for the welfare of the younger and older age groups, which are seen as
dependent. But according to Palmore (1978), ageing is seen as a loss of functioning
and abilities, and therefore, it carries a negative connotation. Accordingly, terms
such as “old” or “elderly” have negative connotations and thus should be avoided.
Age contrasts with race and sex in that our membership of age-based groups
changes over time, as mentioned in Daniels (1998). And unlike caterpillars becom-
ing butterflies, human aging is usually considered as a continuous process. There-
fore, the boundaries of age-defined categories (such as the “under 18” or the “over
65”) will always be based on arbitrariness.
In Fig. 6.3, the screening process used by the NHS during the COVID19
pandemic is explained.6 It is explicitly based on a score (with points added) function
of the age of the person arriving at the hospital. Thus, the first treatment received
will be based on the age of the person, which corresponds to a strong selection bias,
based on a possibly sensitive attribute.

6.3.1 Young or Old?

According to of the European Union (2018), differences in treatment “on the


grounds of age do not constitute discrimination (...) if age is a determining factor
in the assessment of risk for the service in question and this assessment is based
on actuarial principles and relevant and reliable statistical data.” In the USA, the
idea that age can be grounds for discrimination was reflected in the 1967 Age
Discrimination in Employment Act, which followed the 1964 Civil Rights Act,

6 From https://www.nhsdghandbook.co.uk/wp-content/uploads/2020/04/COVID-Decision-
Support-Tool.pdf.
6.3 Age Discrimination 231

Fig. 6.3 COVID-19 Decision Support Tool used in England, in March 2020, provided by the NHS
(National Health System)

which focused primarily on ethical and racial issues, as shown in the example of
Macnicol (2006). In the majority of cases, age discrimination is considered from
an employment perspective, as in Duncan and Loretto (2004) or Adams (2004).
In terms of insurance, age is considered “less discriminatory” than gender, as we
have seen, because as Macnicol (2006) observes, age is not a club in which one
enters at birth, and it will change with time. We can observe some insurers refusing
to discriminate according to the age, assuming it as a form of “raison d’être” (in
the sense given by the 2019 French Pact law). In France, for example, Mutuelle
Générale is committed to strengthening solidarity between generations. This should
be reflected in a rejection of discrimination, segmenting, according to this criterion.
But just as a distinction exists between biological sex and gender, some suggest
distinguishing between biological age and perceived (or subjective) age, such as
Stephan et al. (2015) or Kotter-Grühn et al. (2016). Uotinen et al. (2005) showed
that this subjective age would be a better predictor of mortality than biological
age. As Beider (1987) points out, it can be argued that if people do not have a
fair chance based on their age, because not everyone ages equally, people die at
different ages. Bidadanure (2017) reminds us that age discrimination is always
perceived as less “preoccupying” than other kinds. The aging process, from birth to
adulthood, correlates with various developmental and cognitive processes that make
it relevant to assign different responsibilities, consent capacities, and autonomy
to children, young adults, or the elderly. But unlike sex and race, age is not a
232 6 Some Examples of Discrimination

discrete and immutable characteristic. We expect to go through the different stages


of a life and old age is a club we know we will most likely be joining one day.
Therefore, differential treatment on age does not necessarily generate inequalities
among people over time whereas differential treatment on ethnicity and gender does:
“a society that relentlessly discriminates against people because of their age can
still treat them equally throughout their lives. Everyone’s turn [to be discriminated
against] is coming,” said Gosseries (2014). Citing a decision of the Court of Appeal
of 2008, Mercat-Bruns (2020) recalls that “the legislator was careful to make a
distinction between age and health, and these two grounds cannot be confused by
considering that advanced age necessarily implies poor health”.”
In automobile insurance, as recalled by Cooper (1990), Liisa (1994) and Clarke
et al. (2010), the increase in the number of claims among the elderly can be
explained by a loss of sensory and motor acuity, the use of medication (in particular
psychotropic drugs), and a decrease in reflexes. But the elderly also tend to drive
less, as pointed out by Fontaine (2003). Figure 6.4 shows the frequency of accidents
and the number of fatalities, per million miles driven, for different age groups. The
risk of personal injury (or death) in a car accident increases significantly from age
60 onward and continues to increase rapidly with age, as shown by Li et al. (2003).
But in the majority of countries, the average risk of older people does not appear
to be particularly important (as long as older people are seen as a homogeneous
group). Figure 6.5 shows the evolution of the (annual) frequency of claims, the
average cost of claims, and the average premium for automobile insurance in
Quebec, as an age function of insured (by age group), for collision or upset coverage.
To go further, Dulisse (1997), Meuleners et al. (2006) and Cheung and McCartt
(2011) note that the share of responsibility in accidents also increases with age, in
particular, more accidents on the right-hand side, often reflecting failure to give way.
Many countries have raised the question of regulation in relation to very advanced
ages (over 80 years). In terms of disability, insurers are not allowed to discriminate

Fig. 6.4 Number of crashes (left) and number of fatalities (right), per million miles driven, for
both males and females (males in blue and females in red), by driver age. The reference (0) are
men aged 30–60 years. The number of accidents is three times higher (+200%) for those over 85,
and the number of deaths more than ten times higher (+900%) (data source: Li et al. (2003))
6.3 Age Discrimination

Fig. 6.5 From left to right, average written premium (Canadian dollars), claims frequency, and average claims cost (Canadian dollars), by age group (x-axis)
and gender (males in blue and females in red) in Quebec (data source: Groupement des Assureurs Automobiles (2021))
233
234 6 Some Examples of Discrimination

on the basis of disability if the person is allowed to drive. But for degenerative
diseases, a few laws explicitly prohibit driving, for example, for someone with an
established disease, such as Parkison’s disease, Crizzle et al. (2012). The fact that
older people are more responsible for accidents raises many moral questions, as
putting oneself at risk as a driver is one thing, but potentially injuring or killing
others is less acceptable.

6.4 Genetics versus Social Identity

6.4.1 Genetics-Related Discrimination

Genetic discrimination, or “genoism,” occurs when people treat others (or are
treated) differently because they have or are perceived to have a gene mutation(s)
that causes or increases the risk of an inherited disorder. According to Ajunwa
(2014, 2016), “genetic discrimination should be defined as when an individual
is subjected to negative treatment, not as a result of the individual’s physical
manifestation of disease or disability, but solely because of the individual’s genetic
composition.” This concept is related to “genetic determinism” (as defined in
de Melo-Martín (2003) and Harden (2023)) or more recently “genetic essentialism”
(as in Peters (2014)). Dar-Nimrod and Heine (2011) defined “genetic essentialism”
as the belief that people of a same group share some set of genes that make
them physically, cognitively, and behaviorally uniform, but different from others.
Consequently, “genetic essentialists” believe that some traits are not influenced (or
only a little) by the social environment. But as explained in Jackson and Depew
(2017), essentialism is genetically inaccurate because it not only overestimates the
amount of genetic differentiation between human races but it also underestimates
the amount of genetic variation among same-race individuals (Rosenberg 2011;
Graves Jr 2015).
An important issue here is that “genetic discrimination should be defined as when
an individual is subjected to negative treatment, not as a result of the individual’s
physical manifestation of disease or disability, but solely because of the individual’s
genetic composition,” Ajunwa (2014). And that is usually difficult to assess (as
we can see, for instance, with obesity). But other definitions are more vague.
For example, according to Erwin et al. (2010), “the denial of rights, privileges,
or opportunities or other adverse treatment based solely on genetic information,
including family history or genetic test results,” could be seen as “genetic discrim-
ination.” According to the legislation in Florida, USA (Title XXXVII, Chapter 627,
Section 4301, 2017) “genetic information means information derived from genetic
testing to determine the presence or absence of variations or mutations, including
carrier status, in an individual’s genetic material or genes that are scientifically or
medically believed to cause a disease, disorder, or syndrome, or are associated with
a statistically increased risk of developing a disease, disorder, or syndrome, which
6.4 Genetics versus Social Identity 235

is asymptomatic at the time of testing. Such testing does not include routine physical
examinations or chemical, blood, or urine analysis unless conducted purposefully
to obtain genetic information, or questions regarding family history.” In Box 6.5,
an excerpt explaining what “genetic tests” are is provided, from Bélisle-Pipon et al.
(2019).

Box 6.5 Genetic Tests: Individuals and Insurers, Bélisle-Pipon et al.


(2019)
“‘Genetic tests’ largely refers to germline (not somatic) DNA changes that
are discovered in apparently healthy individuals, through direct-to-consumer
genetic testing, consumer-facing but physician-mediated genetic testing over
the internet or the rare but growing phenomenon of predictive genomics
clinics in conventional medical environments. This testing may result in
the identification of disease risk information through either the presence of
monogenic risk variants, Torkamani et al. (2018), or extremes in polygenic
risk scores, Vassy et al. (2017). DNA testing is not fully accurate at predicting
disease, and this inaccuracy can arise in two ways. The techniques used
(generally next-generation sequencing) can have analytic errors, producing
incorrect ‘calls’ such that the variants identified are simply wrong. These
types of errors are increasingly rare as the technology improves. However,
simply carrying a pathogenic variant or a high polygenic risk score does
not mean that someone will eventually get the disease, as genetic markers
have variable ‘penetrance’, Cooper et al. (2013). Thus, genetic markers, even
well-accepted pathogenic variants, may occur in individuals who will never
develop the disease in question. While these points are true, the presence of
both pathogenic variants for monogenic diseases and very high polygenic risk
scores may increase the probability that an individual will develop the disease
in question and thus can form the basis of the discrimination concerns. While
there is an ongoing debate about the clinical utility of this information in the
individual patient, there is no question that such information can predict risk
on a population basis and is thus of great interest to life insurance companies
and others that are in the business of estimating and monetizing risk.”

According to Rawls (1971), the starting point for each person in society is
the result of a social lottery (the political, social, and economic circumstances in
which each person is born) and a natural lottery (the biological potentials with
which each person is born—recently, Harden (2023) revisited this genetic lottery
and its social consequences). John Rawls argues that the outcome of each person’s
social and natural lottery is a matter of good or bad “fortune” or “chance” (like
ordinary lotteries). And as it is impossible to deserve the outcome of this lottery,
discrimination resulting from these lotteries should not exist. For egalitarians (“luck-
egalitarians”), it is appropriate to eliminate the differential effects on people’s
236 6 Some Examples of Discrimination

interests that, from their perspective, are a matter of luck. Affirmative action in
favor of women is a means of neutralizing the effects of sexist discrimination. Stone
(2007) revisits the idea that this ex ante equality is part of what makes lotteries
fair and appealing. Abraham (1986) discusses the consequences of natural lotteries
in insurance. Around the same time, Wortham (1986) stated, “those suffering from
disease, a genetic defect, or disability on the basis of a natural lottery should not be
penalized in insurance.”
As Natowicz et al. (1992) explained, “People at risk for genetic discrimina-
tion are (1) those individuals who are asymptomatic but carry a gene(s) that
increases the probability that they will develop some disease, (2) individuals who
are heterozygotes (carriers) for some recessive or X-linked genetic condition but
who are and will remain asymptomatic, (3) individuals who have one or more
genetic polymorphisms that are not known to cause any medical condition, and (4)
immediate relatives of individuals with known or presumed genetic conditions.”

6.4.2 Social Identity

As discussed so far, we have seen that it is perceived as “unfair” or “discriminatory”


if individuals (or more precisely group of individuals) face higher premiums, or
limited coverage, owing to characteristics that they cannot control. That would be
the case of generic or biological characteristics. But when we talk about “racism,”
or “gender-neutral pricing,” we’re talking about criteria that are not biologically
but socially defined. We are talking about social identity. Social identity refers to
a person’s membership in a social group. The common groups that make up a
person’s social identity are age, ability, ethnicity, race, gender, sexual orientation,
socio-economic status, and religion, as discussed by Tajfel (1978) and Tajfel et al.
(1986).
The notion of “identity” is paradoxical. It articulates similarity and difference,
uniqueness and community. For the individual, as for the group, identity is the result
of complex interactions between feeling and objective social determinants, between
self-perception and the gaze of others, between the intimate and the social. These
are not just external factors that influence self-representation, but the constitutive
processes on which identity is founded. The second paradox is linked to the fact
that the individual must think of others in order to think of herself, or himself.
It is because of their relationship with others that they become self-aware and
construct their own identity. An individual’s “social identity” is the sum of relations
of inclusion and exclusion in relation to the sub-groups that make up a society.
In many cases, sensitive attributes are related to auto-identification. Affirmative
action measures require applicants to self-identify as “native,” disabled, female,
racialized, or gender minorities.
6.5 Statistical Discrimination by Proxy 237

6.4.3 “Lookism,” Obesity, and Discrimination

Williams (2017) and Liu (2017) discussed discrimination based on people’s appear-
ance, coined “lookism.” To quote Liu (2015), “everybody deserves to be treated
based on what kind of person he or she is, not based on what kind of person other
people are.” It is probably a reason why discrimination against “overweight people”
is challenging. As explained in Loos and Yeo (2022), although it is undeniable that
changes in the environment have played a significant role in the rapid rise of obesity,
it is important to recognize that obesity is the outcome of a complex interplay
between environmental factors and inherent biological traits. And it is the “social
norm” that makes overweight people possibly discriminated against.
The stereotype “fat is bad” has existed in the medical field for decades, as
Nordholm (1980) reminds us. Further study is needed to ascertain how this affects
practice. It appears that obese individuals, as a group, avoid seeking medical
care because of their weight. As claimed by Czerniawski (2007), “with the rise
of actuarial science, weight became a criterion insurance companies used to
assess risk. Used originally as a tool to facilitate the standardization of the
medical selection process throughout the life insurance industry, these tables later
operationalized the notion of ideal weight and became recommended guidelines
for body weights.” At the end of the 1950s, 26 insurance companies cooperated to
determine the mortality of policyholders according to their body weight, as shown
in Wiehl (1960). The conclusion is clear with regard to mortality: “Studies bring out
the clear-cut disadvantage of overweight-mortality ratios rising in every instance
with increase in degree of overweight.” This is also what was said by Baird (1994),
40 years after “obesity is regarded by insurance companies as a substantial risk for
both life and disability policies.” This risk increases proportionally with the degree
of obesity (the same conclusion is found in Lew and Garfinkel (1979) or Must et al.
(1999)).

6.5 Statistical Discrimination by Proxy

In statistics, as explained by Upton and Cook (2014), a “proxy” (we will also use
“substitute variable”) is a variable that, in a predictive model, replaces a useful but
unobservable, unmeasurable variable.
Definition 6.1 (Proxy (Upton and Cook 2014)) A proxy is a measurable variable
that is used in place of a variable that cannot be measured.
And for a variable to be a good proxy, it must have a good correlation with the
interest variable. A relatively popular example is the fact that in elementary school,
shoe size is often a good proxy for reading ability. In reality, foot size has little to do
with cognitive ability, but in children, foot size correlates greatly with age, which in
238 6 Some Examples of Discrimination

turn correlates greatly with cognitive ability. This concept is quite related to notions
of causality, as discussed in the next chapter.
We mentioned earlier economists’ vision of discrimination, such as Gary Becker,
and the link with a form of rationality and efficiency. And indeed, for many authors,
what matters is that the association between variables is strong enough to make
up a reliable predictor. For Norman (2003), group membership provides reliable
information on the group, and by extension on any individual who is a member
of it; the systematic use of this information (the generalization and stereotyping
discussed in Schauer (2006) and Puddifoot (2021)) can be economically efficient.
Taking an ethical counterpoint, Greenland (2002) reminds us that some information
sources should be excluded from our decision making because they are irrelevant, or
noncausal, even though they may provide fairly reliable information because of their
strong correlation with another indicator. The central argument is that if variables
are noncausal, then they lack moral justification. And proxy discrimination raises
complex ethical issues. As Birnbaum (2020) states, “if discriminating intentionally
on the basis of prohibited classes is prohibited—e.g., insurers are prohibited from
using race, religion or national origin as underwriting, tier placement or rating
factors—why would practices that have the same effect be permitted?” In other
words, it is not enough to compare paid premiums, but the narrative process
of modeling (i.e., the notions of interpretability and explicability, or the “fable”
mentioned in the introduction) is equally important in judging whether a model
is discriminatory or not. And the difficulty, as Seicshnaydre (2007) said, is that it
is not about looking for a proof of a sexist, or racist intention, or motivation, but
of establishing that an algorithm discriminates according to a prohibited (because
protected) criterion.
Obermeyer et al. (2019) reports that in 2019, a large health care company had
used medical expenses as a proxy for medical condition severity. The use of this
proxy resulted in a racially discriminatory algorithm, because although the medical
condition may not be racially discriminatory, health care spending is (in the USA
at least). More generally, all sorts of proxies are used, more or less correlated
with the variables of interest. For example, a person’s (or household’s) income is
estimated by income tax, or by living conditions (or the neighborhood where the
person lives). The first version of this paper speaks “indirect risk factors.” We return
to the importance of causal graphs for understanding whether one variable causes
another, or whether they simply correlate, in the Sect. 7.1.
There are also certain quantities that are essential for modeling decision making
in an uncertain context, but that are difficult to measure. This is the case of the
abstract concept of “risk aversion” (widely discussed by Menezes and Hanson
(1970) and Slovic (1987)). Hofstede (1995) proposes an uncertainty avoidance
index, calculated from survey data. The first two studies, Outreville (1990) and
1996, suggested using the education level to assess risk aversion. According to
Outreville (1996), education promotes an understanding of risk and therefore an
increase in the demand for insurance, for example (although an inverse relationship
could exist, if one assumes that increased levels of education are associated with an
increase in transferable human capital, which induces greater risk-taking).
6.5 Statistical Discrimination by Proxy 239

Recently, the Court of Justice of the European Union issued a ruling on 1 August
2022 with potentially far-reaching implications about inferred sensitive data (Case
C-184/20, ECLI:EU:C:2022:601). In essence, the question posed to the European
court was whether the disclosure of information such as the name of a partner
or spouse would constitute processing of sensitive data within the meaning of the
GDPR, even though such data is not in itself directly sensitive, but only allows the
indirect inference of sensitive information, such as the sexual orientation of the data
subject. More precisely, the question was “The referring court asks, in substance,
whether (.. . .) the publication (.. . .) of personal data likely to disclose indirectly
the political opinions, trade-union membership or sexual orientation of a natural
person constitutes processing of special categories of personal data (.. . .).” The
Court of Justice of the European Union has made a clear ruling on this issue, stating
that processing of sensitive data must be considered to be “processing not only of
intrinsically sensitive data, but also of data which indirectly reveals, by means of
an intellectual deduction or cross-checking operation, information of that kind.”
For example, location data indicating places of worship or health facilities visited
by an individual could now be qualified as sensitive data, as well as the recording
of the order of a vegetarian menu at a restaurant in a food delivery application.
And typically, if the pages “liked” by a user of a social network or the groups to
which he belongs are not technically sensitive data, membership in a support group
for pregnant women, or the placing of “likes” on the pages of politically oriented
newspapers, allow the deduction of perfectly precise sensitive information relating
to the state of health of a person or his political positions.
“Humans think in stories rather than facts, numbers or equations—and the
simpler the story, the better,” said Harari (2018), but for insurers, it is often a mixture
of both. For Glenn (2000), like the Roman god Janus, an insurer’s risk selection
process has two sides: the one presented to regulators and policyholders, and the
other presented to underwriters. On the one hand, there is the face of numbers,
statistics, and objectivity. On the other, there is the face of stories, character, and
subjective judgment. The rhetoric of insurance exclusion—numbers, objectivity,
and statistics—forms what Brian Glenn calls “the myth of the actuary,” “a powerful
rhetorical situation in which decisions appear to be based on objectively determined
criteria when they are also largely based on subjective ones” or “the subjective
nature of a seemingly objective process”. And for Daston (1992), this alleged
objectivity of the process is false, and dangerous, as also pointed out by Desrosières
(1998). Glenn (2003) claimed that there are many ways to rate accurately. Insurers
can rate risks in many different ways depending on the stories they tell on which
characteristics are important and which are not. “The fact that the selection of risk
factors is subjective and contingent upon narratives of risk and responsibility has in
the past played a far larger role than whether or not someone with a wood stove is
charged higher premiums.” Going further, “virtually every aspect of the insurance
industry is predicated on stories first and then numbers.” We remember Box et al.
(2011)’s “all models are wrong but some models are useful,” in other words, any
model is at best a useful fable.
240 6 Some Examples of Discrimination

6.5.1 Stereotypes and Generalization

As Bernstein (2013) reminds us, the word “stereotype” merges a Greek adjective
meaning solid, στερεός, with a noun meaning a mold, τύπος. Combining the two
terms, the word refers to a hard molding, something that can leave a mark, which
gave a printing term, namely the printing form used for letterpress printing. In 1802,
the dictionary of the French Academy, mentions, for the word “stereotype,” “a new
word which is said of stereotyped books, or printed with solid forms or plates,”
meaning that an image perpetuated without change. The American journalist and
public intellectual Walter Lippmann gave the word its contemporary meaning in
Lippmann (1922). For him, it was a description of how human beings fit “the
world outside” into “ the pictures in our heads,” which form simplified descriptive
categories by which we seek to locate others or groups of individuals. Walter
Lippmann tried to explain how images that spontaneously arise in people’s minds
become concrete. Stereotypes, he observed, are “the subtlest and most pervasive of
all influences.” A few years later, he began the first experiments to better understand
this concept. One could observe that Lippmann (1922) was one of the first books
on public opinion, manipulation, and storytelling. Therefore, it is natural to see
connections between the word “ stereotype” and storytelling, as well as explanations
and interpretations.
The importance of stereotypes in understanding many decision-making processes
are analyzed in detail in Kahneman (2011), inspired in large part by Bruner (1957),
and more recently, Hamilton and Gifford (1976) and especially Devine (1989). For
Daniel Kahneman, schematically, two types of mechanisms are used to make a
decision. System 1 is used for rapid decision making: it allows us to recognize
people and objects, helps us to direct our attention, and encourages us to fear spiders.
It is based on knowledge stored in the memory and accessible without intention, and
without effort. It can be contrasted with System 2, which allows decision making
in a more complex context, requiring discipline and sequential thinking. In the first
case, our brain uses the stereotypes that govern representativeness judgments, and
uses this heuristic to make decisions. If I cook a fish for some friends to eat, I
open a bottle of white wine. The stereotype “fish goes well with white wine” allows
me to make a decision quickly, without having to think. Stereotypes are statements
about a group that are accepted (at least provisionally) as facts on each member.
Whether correct or not, stereotypes are the basic tool for thinking about categories in
System 1. But often, further, more constructed thinking—corresponding to System
2—leads to a better, even optimal decision. Without choosing just any red wine,
perhaps a pinot noir would also be perfectly suited to grilled mullet. As Fricker
(2007) asserted, “stereotypes are [only] widely held associations between a given
social group and one or more attributes.” Isn’t this what actuaries do every day?
6.5 Statistical Discrimination by Proxy 241

6.5.2 Generalization and Actuarial Science

In the “Ten Oever” judgment (Gerardus Cornelis Ten Oever v Stichting Bedrijf-
spensioenfonds voor het Glazenwassers – en Schoonmaakbedrijf, in April 1993),
the Advocate General Van Gerven argued that “the fact that women generally
live longer than men has no significance at all for the life expectancy of a
specific individual and it is not acceptable for an individual to be penalized on
account of assumptions which are not certain to be true in his specific case,”
as mentioned in De Baere and Goessens (2011). Schanze (2013) used the term
“injustice by generalization.” But at the same time, as explained by Schauer (2006),
this “generalization” is probably the reason for the actuary’s existence: “To be an
actuary is to be a specialist in generalization, and actuaries engage in a form of
decision-making that is sometimes called actuarial.” This idea can be found in
actuarial justice, for example in Harcourt (2008). Schauer (2006) reported that we
might be led to believe that it is better to have airline pilots with good vision than bad
ones (this point is also raised in the context of driving, and car insurance, Owsley
and McGwin Jr (2010)). This criterion could be used in hiring, and would, of course,
constitute a kind of discrimination, distinguishing “good” pilots from “bad” ones,
pilots with good vision from others. Some airlines might impose a maximum age
for airline pilots (55 or 60, for example), age being a reliable, if imperfect, indicator
of poor vision (or more generally of poor health, with impaired hearing or slower
reflexes). If we exclude the elderly from being commercial airline pilots we will
end up, ceteris paribus, with a cohort of airline pilots who have better vision, better
hearing, and faster reflexes. The explanation given here is clearly causal, and the
underlying goals of the discrimination then seem clearly legitimate, so that even the
use of age becomes “proxy discrimination” in the sense of Hellman (1998), called
“statistical discrimination” in Schauer (2017).
For Thiery and Van Schoubroeck (2006) (but also Worham (1985)), lawyers
and actuaries have fundamentally different conceptions of discrimination and
segmentation in insurance, one being individual, the other collective, as stigmatized
in the USA by the Manhart and Norris cases (Hager and Zimpleman 1982; Bayer
1986). In the Manhart case in 1978, the Court ruled that an annuity plan in which
men and women received equal benefits at retirement, even though women made
larger contributions, was illegal. In 1983, the Supreme Court ruled in the Norris
case that the use of gender-differentiated actuarial factors for benefits in pension
plans was illegal because it fell within the prohibition against discrimination. These
two decisions are a legal affirmation that insurance technique could not always be
used as a guarantee to justify differential treatment of members of certain groups
in the context of insurance premium segmentation. Indeed, legally, the right to
equal treatment is one that is granted to a person in his or her capacity as an
individual, not as a member of a racial, sexual, religious, or ethnic group. Therefore,
an individual cannot be treated differently because of his or her membership in
such a group, particularly one to which he or she has not chosen to belong. These
orders emphasize that individuals cannot be treated as mere components of a racial,
242 6 Some Examples of Discrimination

religious, or sexual class, asserting that fairness to individuals trumps fairness to


classes. But this view is fundamentally opposed to the actuarial approach, which
historically analyzes risks and calculates premiums in terms of groups. As explained
in Chap. 3, and Barry (2020b), until recently, actuaries considered individuals only
as members of a group.
The actuarial approach is the one mentioned in the first paragraph. An individual
belonging to a group with a higher statistical risk of survival or death ends up paying
a higher premium or receiving fewer benefits. In automobile insurance, an individual
in a group with a higher statistical risk of accident must pay higher premiums In the
case of car insurance, an individual belonging to a group with a higher statistical risk
of accident has to pay higher premiums. Brilmayer et al. (1983) recalled that it is
the differences between the probabilities of having an accident according to gender
(and not individual differences) that are taken into account to justify the difference
in premiums, to explain the difference in benefits, or to base a selection mechanism.
Insurance classification systems are based on the assumption that individuals meet
the average characteristics, stereotyped as we say, of a group to which they belong.
Insurers argue that current statistics indicate that, on average, more women than
men drive without an accident and that, as a result, the average woman has a lower
loss expectancy than the average man. Based on these data, women have to pay
a lower premium than men. Insurance companies aim to preserve equality between
groups, not individuals, and this is the reason why insurers think in terms of “average
woman” and “average man.”
A fundamental foundation of insurance is the idea of risk pooling, as explained
in Chap. 2, that is, the formation of groups. Risk in insurance cannot be considered
without this notion of mutualization, and this is the major difference with financial
mathematics, where there is a fundamental value of a risk (in a market). Mutual-
ization is intrinsic to the segmentation of insurance risks, and imposes a form of
solidarity within the group, as all the premiums of a group must be statistically
entirely compensated for by all the reimbursements of this same group. The insurer
then imposes solidarity between policyholders who have the same risk profile (with
a comparable probability of loss and size of loss). This is known as “pure luck
solidarity,” as coined in Barry (2020a). Without segmentation, or if the groups
are not composed of members with a comparable risk profile, we will observe a
phenomenon of subsidizing solidarity, in the sense that a person with a certain risk
profile pays for the amount of the loss of persons with a higher loss expectancy.
We find again the opposition (using the terminology of Hume (1739)) between
is and ought, between what is and what ought to be, between the statistical norm of
the actuary and the norm of the legislator, which we mentioned in the introduction.
From an empirical, descriptive point of view, to be within the norm means nothing
more than to be within the average, not to be too far removed from this average. We
can then define the norm as the frequency of what occurs most often, as the most
frequently encountered attitude or the most regularly manifested preference. But this
normality is not normativity, and “to be in the norm”, to be exemplary, then comes
under a different dimension, which this time is no longer linked to a description of
reality but to an identification of what it must tend toward. Therefore, we move from
6.5 Statistical Discrimination by Proxy 243

the register of being to that of ought-to-be, from is to ought. It is indeed difficult


to envisage the model (or normality) without slipping toward this second meaning
that can be found in the concept of norm, and which deploys a dimension that is
properly normative. This vision leads to a confusion between norms and laws, even
if all normativity is not expressed in the form of laws. Therefore, David Hume notes
that, in all moral systems, the authors move from statements of fact, i.e., enunciative
statements of the type “there is,” to propositions that include a normative expression,
such as “it is necessary,” “we must.” What he contests is the passage from one type
of assertion to another: for him, these are two types of statements that have nothing
to do with one another, and therefore that cannot be logically linked to each other, in
particular from an empirical norm to a normative rule. For Hume, a non-normative
statement cannot give rise to a normative conclusion. This assertion of David Hume
has triggered many comments and interpretations, in particular because, stated as it
is, it seems to be an obstacle to any attempt at naturalizing morality (as detailed in
MacIntyre (1969) or Rescher (2013)). In this sense, we find the strong distinction
between the norm in regularity (normality) and the rule (normativity).
The principle of equality of human beings, recognized as a fundamental right,
imposes the corresponding obligation not to discriminate. Therefore, to try to define
discrimination is to try to specify what this principle of equality of all people
means in concrete terms. Discrimination is defined as unequal and unfavorable
treatment applied to certain people because of a criterion prohibited by law (i.e.,
race, origin, sex, etc.). According to Article 225-1 of the French Penal Code, “any
distinction made between individuals on the basis of their origin, their gender, ...
constitutes discrimination.” In the USA, the Prohibit Auto Insurance Discrimination
Act, introduced in July 2019 in Congress, prohibits an automobile insurer from
taking certain factors into account when determining the insurance premium or
its eligibility. More explicitly, these prohibited factors include a driver’s gender,
employment status, zip code, census tract, marital status, and credit score.
Epstein and King (2002) points out that, unlike traditional statistical models, an
artificial intelligence (AI) algorithm does not do this by relying on a human’s initial
intuition about the causal explanations for the statistical relationships between the
input data and the target variable. Instead, AI algorithms use training data to discover
on their own which features can be used to predict the target variable. Although this
process completely ignores causality, it inevitably leads AIs to “seek out” proxies
for directly predictive features when data on those features are not made available
to the AI because of legal prohibitions (Mittelstadt et al. 2016). As pointed out by
Barocas and Selbst (2016), “therefore, a data mining model with a large number
of variables will determine the extent to which membership in a protected class is
relevant to the sought-after trait whether or not that information is an input.”
244 6 Some Examples of Discrimination

6.5.3 Massive Data and Proxy

Again, risk mutualization consists, for a group of economic agents, in pooling a


certain fraction of their resources to compensate the members of the group who
suffer damage, as recalled by Henriet and Rochet (1987). In the early forms of
insurance activity, this principle implied equal participation by each member of
the community, or possibly participation proportional to individual resources, but
certainly not participation proportional to individual risks. As competition in the
insurance markets of several countries (especially in motor insurance) has become
more and more active, a new trend has emerged: the personalization of premiums.
This antagonistic vision of mutualization considers that each insured person should
pay a premium proportional to their individual risk. Many insurers are strongly
opposed to this trend toward personalization, some even rejecting the concept of
“measurable individual risk” as statistical nonsense. However, personalization is
also defended by others for equity reasons (each member participates proportional
to the burden that he or she imposes on the community) and can even be reconciled
with mutualization: if the market is large enough, individuals can be grouped into
a very large number of mutual funds, each being homogeneous from the risk point
of view. With the advent of extensive data and the application of machine-learning
techniques in the insurance industry, the potential for personalization appears to be
within reach, as highlighted by Barry and Charpentier (2020). This could potentially
lead to the concept of a “pool of one risk,” as described by McFall et al. (2020).
Automobile insurance covers risks that are linked to the habits and behaviors
of the driver, as Landes (2015) reminds us. Fire insurance compensates for risks
that arise as a result of carelessness (regarding, for example, electrical appliances or
furnaces). Home theft insurance covers risks that can usually be avoided with proper
care (e.g., extra door locks, camera surveillance system, watch dogs, alarm, etc.).
And actuaries are trying to capture this information, enriching their data. According
to Daniels (1990), there is a moral obligation to ensure that insurance premiums
accurately reflect the risks associated with the insured individuals.
And in the context of risk personalization, the idea that insurance products
must first and foremost be “fair” to the insured is increasingly expressed by
commentators, as discussed by Meyers and Van Hoyweghen (2018). Yet for McFall
(2019), if individual behavior rather than group membership were to become the
basis for risk assessment, the social, economic, and political consequences would
be considerable. It would disrupt the distributive and supportive character that is
expressed in all health insurance plans, even those nominally designated private or
commercial. Personalized risk pricing is at odds with the infrastructure that currently
defines, regulates, and delivers health insurance. Going further, Van Lancker (2020)
sees the widespread use of digital technology as changing the organization of the
postwar welfare state in ways that affect its potential to ensure a decent standard
of living for all. Moor and Lury (2018) explore how pricing has historically been
implicated in the constitution of people and how the ability to “personalize” pricing
reconfigures the ability of markets to discriminate. Recent pricing techniques make
6.5 Statistical Discrimination by Proxy 245

it more difficult for consumers to identify themselves as part of a recognized group.


They spoke of this when they wrote “now, with the evolution of data science
and network computers, insurance is facing fundamental change. With ever more
information available (...) insurers will increasingly calculate risk for the individual
and free themselves from the generalities of the larger pool.”
And although big data may present a danger, insurers like to create synthetic
scores. Beniger (2009) points out that the volume of data is drastically reduced, with
a single variable seemingly containing all the information related to all kinds of risks
for a given individual. In automobile insurance, the vehicle may be associated with
dozens of variables (make, model, maximum speed, weight, fuel, color, number
of seats, presence of safety features, etc.) but most insurers have a “vehiculier”
that summarizes all the information related to the vehicle in about ten classes. In
home insurance, the address of the dwelling, which can be associated with dozens
of variables (flood zone, distance to the nearest fire station, building materials, etc.),
is associated with a “zonier” made up of a few classes.
Prince and Schwarcz (2019) used the term “discrimination by proxy” to describe
this incidental impact. Proxy discrimination occurs when a facially neutral feature
is used as a substitute—or proxy—for a prohibited feature. Austin (1983), quoting
Works (1977), claimed that “although the core concern of the underwriter is
the human characteristics of the risk, cheap screening indicators are adopted as
surrogates for solid information about the attitudes and values of the prospective
insured.” The invitations to underwriters to introduce prejudgments and biases
and to indulge amateur psychological stereotypes are apparent. Even generalized
underwriting texts include occupational, ethnic, racial, geographic, and cultural
characterizations certain to give offense if publicly stated. Therefore, Prince and
Schwarcz (2019) considers three types of proxy discrimination: (1) causal proxy
discrimination, (2) opaque proxy discrimination, and (3) indirect proxy discrimina-
tion. Opaque proxy discrimination occurs when one is unable to formally establish
a causal link between the sensitive variable p and the target variable y. In the
genetic context, even for many pathogenic genetic variants, it is often not known
why a particular sequence of a gene leads to increased risk. In both causal and
opaque proxy discrimination, prohibited characteristics are “directly predictive” of
legitimate outcomes of interest. In indirect proxy discrimination, a variable x has
significant predictive power simply because correlates with the target variable y,
and the true causal variable is not present in the database. A typical example is that
in school, shoe size is indirectly predictive of the number of spelling mistakes in a
dictation, as mentioned after Definition 6.1.
This unintended discrimination by proxy by AIs cannot be avoided simply by
depriving the AI of information about the membership of individuals in legally
suspect classes or obvious proxies for such group membership. However, the
exclusion of the forbidden input alone may not be enough when there are other
characteristics that correlate with the forbidden input—an issue that is exacerbated
in the context of big data. Additional readings include Bornstein (2018) or Selbst
and Barocas (2018), who delve into the “stereotype theory of liability.”
246 6 Some Examples of Discrimination

6.6 Names, Text, and Language

The first name, also known as a personal name, is an individual’s given name used
to specifically identify them, alongside their patronymic or family name. In the
majority of Indo-European languages, the first name is positioned before the family
name, known as the Western order. This differs from the Eastern order, where the
family name precedes the first name. Unlike the family name (usually inherited from
the fathers in patriarchal societies), the first name is chosen by the parents at birth
(or before), according to criteria influenced by the law and/or social conventions,
pressures and/or trends. More precisely, at birth (and/or baptism), each person is
usually given one or more first names, of which only one (which can be made up)
will be used afterward: the usual first name.

6.6.1 Last Name and Origin or Gender

The use of family names appeared in Venice in the IX-th century (Brunet and Bideau
2000; Ahmed 2010) and came into use across Europe in the later Middle Ages
(beginning roughly in the XI-th century), according to the “Family names” entry of
the Encyclopædia Britannica. Family names seems to have originated in aristocratic
families and in big cities, as having a hereditary surname that develops into a
family name preserves the continuity of the family and facilitates official property
records and other matters. Sources of family names are original nicknames (Biggs,
Little, Grant), occupations (Archer, Clark), place names (Wallace, Murray, Hardes,
Whitney, Fields, Holmes, Brookes, Woods), as mentioned in McKinley (2014).
In English, the suffixation of—son has been also very popular (Richardson,
Dickson, Harrison, Gibson) but also using the prefix Fitz—(Fitzgerald), which goes
back to Norman French fis (fils in French), meaning “son,” as explained in McKinley
(2014). In Russian, as discussed in Plakans and Wetherell (2000), the suffix—ov
(ov, “son of”) was also used, such as “Ivan Petrov (Petrov),” for “Ivan (Ivan),
the son of Piotr ( Piotr, or Petr+ ov),” with the possibility of designating the
successive fathers, with the use of patronymics: Vasily Ivanovich Petrov (Vasiliıu
Ivanoviq Petrov), is Vasily (Vasiliıu) son of Ivan (Ivan+oviq), born from
the ancestor Piotr (Piotr+ ov).
Icelandic surnames are different from most other naming systems in the modern
Western world by being patronymic or occasionally matronymic, as mentioned in
Willson (2009) and Johannesson (2013): they indicate the father (or mother) of
the child and not the historic family lineage. Generally, with few exceptions, a
person’s last name indicates the first name of their father (patronymic) or in some
cases mother (matronymic) in the genitive, followed by—son “(son”) or—dóttir
(“daughter”). For instance, in 2017, Iceland’s national Women’s soccer team players
were Agla Maria Albertsdóttir, Sigridur Gardarsdóttir, Ingibjorg Sigurdardóttir,
Glodis Viggosdóttir, Dagny Brynjarsdóttir, Sara Bjork Gunnarsdóttir, Fanndis
6.6 Names, Text, and Language 247

Fridriksdóttir, Hallbera Gisladóttir, Gudbjorg Gunnars dóttir, Sif Atladóttir or


Gunnhildur Jonsdóttir. In the national Men’s soccer team, players were Hákon
Rafn Valdimarsson, Patrik Gunnarsson, Höskuldur Gunnlaugsson, Júlíus Magnús
son, Viktor Örlygur Andrason or Kristall Máni Ingason.
The development is slightly different among Jewish people. They chose, or were
given, family names only at the end of the XVIII-th and beginning of the XIX-
th century, and most derived from religious vocations, as explained in Kaganoff
(1996): Cantor, Canterini, Kantorowicz (lower priest); Kohn, Cohen, Cahen, Kaan,
Kahane (priest); Levi, Halévy, Löwy (name of the tribe of priests), etc. In China,
Liu et al. (2012) mentioned that there are about 1000 names, and only 60 are
commonly used. Most Chinese surnames are monosyllabic and originate from a
physical characteristic or description: Wang (yellow), Wong (wild water field),
Chan (old) or Chu (mountain). In Japan, until the XIX-th century, only members
of the nobility had a family name, as explained by Tanaka (2012). But everything
changed when the Emperor declared that everyone should have a second name.
Entire villages took the same name. That is why there are only about 10,000 names
in Japan and most of them are geographical place names: Arakawa (rough, river),
Yamada (mountain, rice field), Hata (farm), and Shishido (flesh, door).
Using birth record data from California, Fryer Jr and Levitt (2004) find that
“Blacker names are associated with lower-income zip codes [and] lower levels of
parental education.” Table 6.1 presents names along with the respective proportions

Table 6.1 Last name, and Name Rank White (%) Black (%) Hispanic (%)
racial proportions in the USA,
from Gaddis (2017) (data Washington 138 5.2% 89.9% 1.5%
from US Census (2012)) Jefferson 594 18.7% 75.2% 1.6%
Booker 902 30.0% 65.6% 1.5%
Banks 278 41.3% 54.2% 1.5%
Jackson 18 41.9% 53.0% 1.5%
Mosley 699 42.7% 52.8% 1.5%
Becker 315 96.4% 0.5% 1.4%
Meyer 163 96.1% 0.5% 1.6%
Walsh 265 95.9% 1.0% 1.4%
Larsen 572 95.6% 0.4% 1.5%
Nielsen 765 95.6% 0.3% 1.7%
McGrath 943 95.9% 0.6% 1.6%
Stein 720 95.6% 0.9% 1.6%
Decker 555 95.4% 0.8% 1.7%
Andersen 954 95.5% 0.6% 1.7%
Hartman 470 95.4% 1.5% 1.2%
Orozco 690 3.9% 0.1% 95.1%
Velazquez 789 4.0% 0.5% 94.9%
Gonzalez 23 4.8% 0.4% 94.0%
Hernandez 15 4.6% 0.4% 93.8%
248 6 Some Examples of Discrimination

of individuals identifying themselves as white, Black, or Hispanic across different


geographic areas. In many cases, the last name can serve as a proxy or indicator of
a person’s racial attribute.

6.6.2 First Name and Age or Gender

As Bosmajian (1974) states, “an individual has no definition, no validity for himself,
without a name. His name is his badge of individuality, the means whereby he
identifies himself and enters upon a truly subjective existence.” Names are often the
first information people have in a social interaction. Sometimes we know individuals
by name even before we meet them in person, as Erwin (1995) reminds us. First and
last names can carry a lot of information, as shown by Hargreaves et al. (1983),
Dinur et al. (1996), or Daniel and Daniel (1998). To quote Young et al. (1993), “the
name Bertha might be judged as belonging to an older Caucasian woman of lower-
middle class social status, with attitudes common to those of an older generation
.(· · · ) a person with a name such as Fred, Frank, Edith, or Norma is likely to be

judged, at least in the absence of other information, to be either less intelligent, less
popular, or less creative than would a person with a name such as Kevin, Patrick,
Michelle, or Jennifer.”
As discussed in Riach and Rich (1991) and Rorive (2009), a popular technique to
test for discrimination (in a real-life context) is to use “practice testing” or “situation
testing”. This probably started in the 1960s in the UK, with Daniel et al. (1968),
who aimed to compare the likelihood of success among three fictitious individuals,
British citizens originally from Hungary, the Caribbean, and England, across various
domains including employment, housing, and insurance, as noted by Héran (2010).
Smith (1977) also references a protocol used in 1974, which involved random job
offer selections and the submission of identical written applications. Concurrently in
the USA, Fix and Turner (1998) and Blank et al. (2004) mention “affirmative action
(pair-testing).” These so-called “scientific” tests rely on stringent, demanding, and
often costly protocols. In the context of insurance, in France, Petit et al. (2015)
and l’Horty et al. (2019) organized such experiments, where three profiles were
used: the first corresponds to the candidate whose name and surname are North
African sounding (for example, Abdallah Kaïdi, Soufiane Aazouz, or Medji Ben
Chargui), the second corresponds to the candidate whose first name is French
sounding and whose surname is North African sounding (for example, François El
Hadj, Olivier Ait Ourab, or Nicolas Mekhloufi), and finally the one whose first name
and last name are French sounding (for example Julien Dubois, Bruno Martin, or
Thomas Lecomte). Amadieu (2008) mentions that the first names (male) Sébastien,
Mohammed, and Charles-Henri are used for tests. Table 6.2 shows the main first
names in three generations of immigrants.
In Box 6.6, Baptiste Coulmont discusses the use of the first name as a proxy for
some sensitive attribute.
6.6 Names, Text, and Language 249

Table 6.2 Top 3 first names by sex and generations in France, according to the origin (Southern
Europe or Maghreb) of grandparents, Coulmont and Simon (2019)
Children of Grandchildren of
Immigrants immigrants immigrants
Southern Men José, Antonio, Manuel Jean, David, Thomas, Lucas,
Europe Alexandre Enzo
Women Maria, Marie, Ana Marie, Sandrine, Laura, Léa, Camille
Sandra
Maghreb Men Mohamed, Ahmed, Rachid Mohamed, Karim, Yanis, Nicolas,
Mehdi Mehdi
Women Fatima, Fatiha, Khaduja Sarah, Nadia, Sarah, Ines, Lina
Myriam

Box 6.6 The First Name as Proxy, by Baptiste Coulmont7


As early as the 1930s, historians such as Marc Bloch in France felt that first
names could be used for historical research. “The very choice of baptismal
names, their nature, relative frequency, are all traits which, when properly
interpreted, reveal currents of thought or feeling,” Bloch (1932). The first
name of a person is indeed on the one hand the intimate choice of the parental
couple (and their close relations), and on the other hand information very
often present in population registers. This connection between the private and
the public spheres makes it possible to conceive the first name as a gateway
to the study of a group’s culture, because it says things about the givers of
first names and the bearers of first names. What does it say? Marc Bloch is
careful enough to insist on the necessity of a “proper interpretation.” With
the gender of the first name and the sex of the civil status, the associations
are strong enough to guide the interpretation: if epicene first names exist,
they name only a small part of the population. And it is even possible, for a
large number of first names, from this simple first name, to infer with great
certainty the gender of the person who bears it. This is how, after a time when
it was forbidden to use married people’s sex, it was possible to reconstitute
the missing information, Carrasco (2007). The fashion phenomenon around
first names, visible with a craze and then disappearance over the years,
makes it possible to estimate the “average age” of the carriers of a first
name. Today, in France, individuals called Nathalie are older than those
called Emma. But this does not make the first name a transparent signal of
a person’s age. An interesting survey, which asked participants to estimate

(continued)

7 Professor at the École Normale Supérieure Paris-Saclay.


250 6 Some Examples of Discrimination

Box 6.6 (continued)


a person’s age from their first name, indicates that “the perceived age of
a first name corresponds only weakly with the true average year of birth
of people with that name.” Interpretation must be undertaken with caution
when the study focuses on other characteristics of first names. Parents in
different locations in the social circles choose different first names, Besnard
and Grange (1993). The frequency of first names therefore varies with the
social environment. Today, individuals called Anouk, Adèle, or Joséphine
taken as a group have parents with more education than those called Anissa,
Mégane, or Deborah. But this relationship depends on time, as some of the
first names spread—following fashion—from one environment to another.
Finally, when the characteristics associated with first names are linked to fluid,
contextual identities, or assigned administratively from the outside, it is the
addition of other variables that gives meaning to first names. Many studies
(Tzioumis 2018 or Mazieres and Roth 2018) seek to exploit the information
on ethnic, national or geographical origin contained in names and surnames,
but they need, to start the investigation, a link between the first name and
the variable studied. Directories from different countries, for example. The
generalization of correlations between origin and first name to other countries
or other populations must be made with caution.

Similarly, in “are Emily and Greg more employable than Lakisha and Jamal ?,”
Bertrand and Mullainathan (2004) randomly assigned African American or white-
sounding names in resumes to manipulate the perception of race. “White names”
received 50% more callbacks for interviews. Voicu (2018) presents the Bayesian
Improved First Name Surname Geocoding (BIFSG) model to use first names to
improve the classification of race and ethnicity in a mortgage-lending context,
drawing on Coldman et al. (1988) and Fiscella and Fremont (2006). Analyzing data
from the German Socio-Economic Panel, Tuppat and Gerhards (2021) show that
immigrants with first names considered uncommon in the host country dispropor-
tionately complain of discrimination. When names are used as markers indicating
ethnicity, it has been observed that highly educated immigrants tend to report
perceiving discrimination in the host country more frequently than less educated
immigrants. This phenomenon is referred to as the “discrimination paradox.”
Rubinstein and Brenner (2014) show that the Israeli labor market discriminates
on the basis of perceived ethnicity (between Sephardic and Ashkenazi-sounding
surnames). Carpusor and Loges (2006) analyzes the impact of first and last names
on the rental market, whereas Sweeney (2013) analyze their impact on online
advertising.
6.6 Names, Text, and Language 251

6.6.3 Text and Biases

Chatbots will both raise critical ethical challenges and hold implications for
the democratization of technology, and implementing research addressing these
directions is important. Chatbots permit users to interact through natural language,
and consequently are a potential low threshold means of accessing information and
services and promoting inclusion. However, owing to technological limitations and
design choices, they can be the means of perpetuating and even reinforcing existing
biases in society, excluding or discriminating against some user groups, as discussed
in Harwell et al. (2018) or Feine et al. (2019), and over-representing or enshrining
specific values.
If we translate Turkish sentences that use the gender-neutral ‘o’ to English, we
obtain outputs such as the ones in Table 6.3.
To make such decisions, it clearly means that corpuses used to train those
models are sexist (possibly also agist or racist). Therefore, the words we use can
be the proxy for some sensitive attributes. This is related to “social norm bias” as
defined in Cheng et al. (2023), inspired by Antoniak and Mimno (2021). Social
norm bias is characterized by the associations between an algorithm’s predictions
and individuals’ adherence to inferred social norms, which can be a source of
algorithmic unfairness, as penalizing individuals for their adherence or deviation
to social norms is a form of algorithmic unfairness. More precisely, Social Norm
Bias occurs when an algorithm is more likely to correctly classify the women in
this occupation who adhere to these norms over the women in the same occupation
who do not adhere to the bias. Tang et al. (2017) used manually compiled lists
of gendered words and using only the frequency of these words. “Occupations are
socially and culturally ‘gendered’” wrote Stark et al. (2020); many jobs (for instance
in science, technology, and engineering) are traditionally masculine, Light (1999)
and Ensmenger (2015). Different English words have gender- and age-related
connotations, as shown in Moon (2014), inspired by Williams and Bennett (1975).
Based on a large reference corpus (the 450-million-word Bank of English, BoE),

Table 6.3 Translation from 2017 2023


Turkish to English, inspired
by Kuczmarski (2018), based o bir öğretmen ∠ she is a teacher he is a teacher
on https://translate.google. o bir hemşire ∠ she is a nurse she is a nurse
com/ (accessed in January o bir doktor ∠ he is a doctor she is a doctor
2023) o bir Şarkıcı ∠ she is a singer he is a singer
o bir sekreter ∠ she is a secretary she is a secretary
o bir dişçi ∠ he is a dentist he is a dentist
o bir çiçekçi ∠ she is a florist she is a florist
o çalışkan ∠ he is hard working he is hard working
o tembel ∠ she is lazy he is lazy
o güzel ∠ she is beautiful she is beautiful
o çirkin ∠ he is ugly he is ugly
252 6 Some Examples of Discrimination

Moon (2014) observed that the most frequent adjectives co-occurring with “young”
are: inexperienced, beautiful, fresh, attractive, healthy, vulnerable, pretty, naive, tal-
ented, impressionable, energetic, crazy, single, dynamic, fit, strong, trendy, innocent,
foolish, handsome, hip, stupid, ambitious, free, full (of life/ideas/hope/etc.), lovely,
enthusiastic, eager, small, vibrant, gifted, immature, slim, good-looking. In contrast,
the most frequent adjectives co-occurring with “old” are: sick, tired, infirm, frail,
gray, fat, worn(-out), decrepit, disabled, wrinkly/wrinkled, slow, poor, weak, wise,
beautiful, rare, ugly. The following associations were observed:
• Precocious, shy: teens, tailing off in the twenties
• Pretty, promising: peaking with teens, twenties, tailing off in the thirties
• Beautiful, fresh-faced, stunning: mainly teens and twenties
• Blonde: strongly associated with women in their twenties; to a lesser extent teens
and thirties
• Ambitious, brilliant, talented: peaking in the twenties
• Attractive: peaking in the twenties, tailing off in the thirties
• Handsome: peaking in the twenties, tailing off in the forties
• Balding, dapper, formidable, genial, portly, paunchy: mainly forties and older
• Sprightly/spritely: beginning to appear in the sixties, stronger in the seven-
ties/eighties
• Frail: mainly the seventies and older
Crawford et al. (2004) provide a corpus of 600 words and human-labeled gender
scores, as scored on a scale of 1–5 (1 is the most feminine, 5 is the most masculine)
by undergraduates at US universities. They find that words referring to explicitly
gendered roles such as wife, husband, princess, and prince are the most strongly
gendered, whereas words such as pretty and handsome also skew strongly in the
feminine and masculine directions respectively.

6.6.4 Language and Voice

The voice is also an important element, often very subjective during meetings with
agents, in person, but today it is analyzed by algorithms, in particular when an
insurer uses conversational robots (we sometimes speak of a “chat bot”), as analyzed
by Hunt (2016), Koetter et al. (2019), Nuruzzaman and Hussain (2020), Oza et al.
(2020), or Rodríguez Cardona et al. (2021). We can note, in France, the experience
of the virtual assistant launched by Axa Legal Protection, called Maxime, described
by Chardenon (2019).
In the fall of 2020, in France, a law proposal repressing discrimination based on
accent was presented to the parliament, as reported by Le Monde (2021). Linguists
would speak of phonostyle discrimination, as in Léon (1993), or of “diastratic
variation,” with differences between usages by gender, age, and social background
(in the broad sense), in Gadet (2007), or of “glottophobia”, a term introduced by
Blanchet (2017). Glottophobia can be defined as “contempt, hatred, aggression,
6.6 Names, Text, and Language 253

rejection, exclusion, negative discrimination actually or allegedly based on the


fact of considering incorrect, inferior, bad certain linguistic forms (perceived as
languages, dialects, or uses of languages) used by these people, in general by
focusing on the linguistic forms (and without always being fully aware of the extent
of the effects produced on the people).” If Van Parijs (2002) is mainly interested in
people discriminated against because French is not their mother tongue (“the mother
tongue, in this perspective, is as illegitimate a basis for discrimination as race,
gender or faith”), Blanchet (2017) emphasizes above all cultural differences (the
lengthening of vowels, the distribution of pauses, the rate of speech, the accentuation
of certain syllables, the richness of vocabulary, etc.). One can think of the “suburban
accents” as Fagyal (2010) called them.
Blodgett and O’Connor (2017) have drawn attention to a significant aspect
of algorithmic fairness: the unequal performance of natural language-processing
algorithms when applied to the language of authors belonging to different social
groups. Notably, existing systems occasionally exhibit poorer analysis of the
language used by women and minorities compared with that of white individuals
and men. Finally, it is worth noting that in Canada, the expression “speak white”
was a racist slur used by English speakers to characterize those who did not speak
English in public places. Historically, this is what was said to Liberal Party MP
Henri Bourassa in 1889 when he attempted to speak French during debates in
the Canadian House of Commons. As Michèle Lalonde, author of the 1968 poem
“Speak White” (reported by Dostie (1974)) puts it, “Speak White is the protest
of white Negroes in America. Language here is the equivalent of colour for the
American Negro. The French language is our black colour.”
For Squires and Chadwick (2006), “linguistic profiling,” i.e., the identification
of a person’s race from the sound of their voice and the use of this information
to discriminate on the basis of race, has been documented in the home rental
market, and Squires and Chadwick (2006) analyze its impact in the home insurance
sector. Based on an analysis of paired tests conducted by fair housing organizations,
they show that home insurance agents are generally able to detect the race of a
person who contacts them by phone and that this information affects the services
provided to people inquiring about purchasing a home insurance policy. Finally, as
an anecdote, Durry (2001) reports that in 1982 an insurance company in France was
sued for refusing to provide car insurance to people who could not read or write.
On the legal basis that “everything that is not [explicitly] prohibited is permitted,”
the court ruled that the insurance company could not be held liable. However, the
chosen criterion arguably led to the elimination of foreigners more often than French
citizens, Durry (2001) pointed out. More recently, to return to our introductory
discussion, chat bots have been shown to be capable of reproducing discrimination,
mostly in relation to the speaker’s gender, as explained by Feine et al. (2019),
Aran et al. (2019), McDonnell and Baxter (2019), or Maedche (2020), and can
also be clearly racist, as shown by Schlesinger et al. (2018). One should then be
careful when using voice in an opaque algorithm (as are most algorithms used with
language).
254 6 Some Examples of Discrimination

6.7 Pictures

6.7.1 Pictures and Facial Information

More than a century ago, first Lombroso (1876), and then Bertillon and Chervin
(1909), laid the foundations of phrenology and the “born criminal” theory, which
assumes that physical characteristics correlate with psychological traits and criminal
inclinations. The idea was to build classifications of human types on the basis of
morphological characteristics, in order to explain and predict differences in morals
and behavior. One could speak of the invention of a “prima facie”. We can also
mention “ugly laws,” in the sense of Schweik (2009), taking up a term used by
Burgdorf and Burgdorf Jr (1974) to describe laws in force in several cities in the
USA until the 1920s, but some of which lasted until 1974. These laws allowed
people with “unsightly” marks and scars on their bodies to be banned from public
places, especially parks. In New York, in the XVIII-th century, Browne (2015) recalls
that “lantern laws” demanded that Black, mixed-race, and Indigenous enslaved
people carry candle lanterns with them if they walked about the city after sunset,
and not in the company of a white person. The law prescribed various punishments
for those who did not carry this supervisory device.
These debates are resurfacing as the number of applications of facial recognition
technology increases, thanks to improvements in the quality of the images, the
algorithms used, and the processing power of computers. The potential of these
facial recognition tools to perform health assessment is demonstrated in Boczar et al.
(2021). Historically, Anguraj and Padma (2012) had proposed a diagnostic tool for
facial paralysis, and recently, Hong et al. (2021) uses the fact that many genetic
syndromes have facial dysmorphism and/or facial gestures that can be used as a
diagnostic tool to recognize a syndrome. As shown in Fig. 6.6, many applications
online can, for free, label pictures, and extract information, personal if not sensitive,
such as gender (with a “confidence” value), the age, and also some sort of emotion.
In Chap. 3, we explained that in the context of risk modeling, a “classifier” that
associated some pictures with a class is not extremely interesting, and having the
probability of belonging to classes is more interesting. In Fig. 6.7, we challenge the
“confidence” value given by Picpurify, using pictures generated by a generative
adversarial network (used in Hill and White (2020) to generate faces), with a
“continuous” transformation from a picture (top left) to another one (bottom right).
Based on the terminology we use later, when using barycenters in Chap. 12, we
have here some sort of “geodesic” between the picture of a woman and a picture of
a man. Surprisingly, we would expect the “confidence” (for identifying a “man”) to
increase continuously from a low value to a higher one. But this is not the case. The
algorithm predicting with very high confidence that the person on the top right is a
“female” and also with very high confidence that the person on the bottom left is a
“male” (with only very few changes in the pixels between the two pictures).
More generally, beyond medical considerations, Wolffhechel et al. (2014)
reminds us that several personality traits can be read from a face, and that facial
6.7 Pictures 255

female, age: 38 female, age: 23 male, age: 37 male, age: 53


female (0.997) female (0.989) male (0.967) male (0.985)
age: 34 age: 20 age: 27 age: 38
joy (74%) joy (85%) joy (81%) joy (73%)

female, age: 30 male, age: 27 male, age: 43 male, age: 37


female (0.985) male (0.983) male (0.984) male (0.996)
age: 28 age:33 age: 38 age: 38
joy (82%) joy (69%) joy (78%) joy (56%)

male, age: 24 male, age: 33 male, age: 34 male, age: 48


male (0.944) male (0.981) female (0.905) male (0.989)
age: 26 age: 32 age: 34 age: 48
joy (70%) joy (81%) joy (82%) joy (83%)

Fig. 6.6 Faces generated by Karras et al. (2020). Gender and age were provided by https://
gender.toolpie.com/ and facelytics, https://www.picpurify.com/demo-face-gender-age.html with
a “confidence,” and https://cloud.google.com/vision/, https://howolddoyoulook.com/ and https://
www.facialage.c+om/ (accessed in January 2023)

features influence first impressions. That said, the prediction model considered fails
to reliably predict personality traits from facial features. However, recent technical
developments, accompanied by the development of large image banks, have made it
possible, as claimed by Kachur et al. (2020), to predict multidimensional personality
profiles from static facial images, using neural networks trained on large labeled
data sets. About 10 years ago, Cao et al. (2011) proposed predicting the gender of a
person from facial images (Rattani et al. 2017 or Rattani et al. 2018), and recently
Kosinski (2021) used a facial recognition algorithm to predict political orientation
(in a binary context, opposing liberals and conservatives, in the spirit of Rule and
Ambady (2010)). Wang and Kosinski (2018) and Leuner (2019) proposed using
256 6 Some Examples of Discrimination

female (0.984) female (0.983) female (0.982) female (0.960)


male (0.016) male (0.017) male (0.018) male (0.040)

female (0.009) female (0.013) female (0.014) female (0.015)


male (0.991) male (0.987) male (0.986) male (0.985)

Fig. 6.7 GAN used in Hill and White (2020) to generate faces, with a “continuous” transformation
from a picture (top left) to another one (bottom right), and then gender predicted using https://www.
picpurify.com/demo-face-gender-age.html with a “confidence” (accessed in March 2023)

these algorithms to predict sexual orientation. But as explained in Agüera y Arcas


et al. (2018), neural networks appeared to be fixating on peripheral attributes, such
as makeup choices and specific eyewear styles, as indicators of sexual orientation,
instead of focusing on inherent facial features. A salient observation in the training
dataset revealed that heterosexual women demonstrated a higher propensity for
wearing eye shadow than their homosexual counterparts, whereas heterosexual
men exhibited a greater likelihood of wearing glasses in contrast to gay men.
This phenomenon underscored the neural networks’ inclination to discern sexual
orientation by aligning with societal fashion and superficial biases, rather than
conducting a meticulous analysis of facial characteristics, encompassing features
such as cheek structure, nose shape, and ocular properties, among others.
In Shikhare (2021), the idea of using a facial score model for life insurance
underwriting was evoked (in connection with the concept of AUW—accelerated
underwriting), with images as in Fig. 6.6. The idea is to look for “abnormal
characteristics” like those related to a particular condition (Down’s syndrome,
Cornelia de Lange, Cushing’s, acromegaly, etc.), including abnormal skin color to
detect bronchial asthma or hepatitis, or to infer gender, as mentioned in Shikhare
(2021).
6.7 Pictures 257

6.7.2 Pictures of Houses

Using deep-learning algorithms, it is possible to extract information from pictures


taken from cars after an accident, as discussed in Dwivedi et al. (2021), and several
companies are using this technique through various applications. After an accident,
the app directs the policyholder to take photos of the damaged cars at certain angles,
and in certain lights. And using just those photos, an opaque model estimates how
much it will cost to fix the car. “Garage operators say the numbers can be wildly
inaccurate,” mentioned Marshall (2021). “You can’t diagnose suspension damage
or a bent wheel or frame misalignment from a photograph.”
Some companies try to sell models that could estimate how much it will cost to
fix water damage or a fire, in a house. For example, in Fig. 6.8, we can visualize
three photos of damage to a house, tagged by Google API Cloud Vision.
Figure 6.9 showcases a range of images pertaining to a building search. On the
left, street photos captured by Google Street View (launched in 2007) are displayed,
whereas on the right, aerial imagery from Google Earth (launched in 2001) is
presented. These images offer the opportunity to detect noteworthy elements such
as a neighbor’s swimming pool in close proximity to a house and the presence
of public electrical equipment. The availability of such visual information could
be beneficial for assessing the potential risks associated with water damage or fire
incidents accurately.

windows (91%) brown (98%) wood (83%)


wood (86%) wood (88%) tints and shades (75%)
flooring (82%) hardwood (70%) ceiling (72%)

Fig. 6.8 Labels from Google API Cloud Vision https://cloud.google.com/vision/, (accessed in
January 2023) (source: personal collection)

Fig. 6.9 Examples of images associated with a building search, with street photos on the left-hand
side, with Google Street View, and aerial imagery, with Google Earth, on the right-hand side
258 6 Some Examples of Discrimination

Fig. 6.10 Examples of images associated with building search, with street photos on the left, from
Google Street View, with different views from 2012 until 2018 on top, and from 2009 until 2018
below

Fig. 6.11 Examples of aerial imagery, from Google Earth

Figure 6.10 showcases an additional series of images obtained from Google


Street View, specifically associated with building search. The images on the left-
hand side depict views captured in a particular year, whereas those on the right-hand
side display views taken a few years later (2012 and 2018 for the images on top
and 2009 to 2018 for the images at the bottom). This collection of images offers
valuable insights into various aspects of buildings. Notably, these images allow for
the observation of potential house extensions that may not have been declared by
the policyholder. Additionally, they may reveal instances where a tree is growing at
a faster rate than anticipated, posing a potential risk, especially when it is in close
proximity to a load-bearing wall.
Figure 6.11 presents aerial imagery obtained from Google Earth, showing four
buildings. These images provide valuable information for detecting the presence
or absence of swimming pools, as explored by Galindo et al. (2009), Rodríguez-
Cuenca and Alonso (2014), and Cunha et al. (2021). Furthermore, these views can
be utilized to obtain a more accurate estimate of the size of the houses. However, a
challenge arises when a house has a black tile roof, as it becomes more difficult to
differentiate between the shadow of the house and the roof itself. This difficulty can
lead to overestimations of the total surface area of the building, as depicted in the
images.
6.8 Spatial Information 259

6.8 Spatial Information

“Geographic location is a well-established variable in many lines of insurance,”


recalls Bender et al. (2022). “Geographic information is crucial for estimating
the future costs of an insurance contract” claimed Blier-Wong et al. (2021). For
weather-related perils, such as flooding, geographic information is crucial because
location is a key piece of information for predicting loss frequency. In motor
insurance, policyholders living in rural areas are less likely to have accidents
because they use roads with little traffic. When they do have accidents, they tend to
generate higher claims costs because they are more severe. For this reason, insurance
companies define territory levels, which serve as unique coding in their pricing
models.
Open data sources, such as OpenStreetMap, provide a wealth of spatial informa-
tion that can be utilized for various purposes. For example, Fig. 6.12 demonstrates
the extraction of buildings in Montréal, Canada, using that open geographic
database. However, this is just one example, as it is also possible to extract other
elements such as streets, roads, rivers, and lakes from the same dataset.
Given that the locations of fire hydrants in Montréal are publicly available infor-
mation, it is feasible to visualize and compute the distance between a policyholder’s
house and the nearest fire hydrants. This is exemplified in Fig. 6.13, where the
proximity of the policyholder’s house to the closest fire hydrants is represented.
In Fig. 6.14, an additional visualization approach is shown where buildings are
presented from a three-dimensional perspective. The colors assigned to the buildings
are based on thermal diagnostics, providing insights into the thermal characteristics
of the structures.

Fig. 6.12 Some polygons of building contours, in Montréal (Canada), extracted from Open-
StreetMap (getbb function of osmdata package)
260 6 Some Examples of Discrimination

Fig. 6.13 Locations of fire hydrants in a neighborhood in Montréal (Canada), with a satellite
picture from GoogleView on the right-hand side (qmap function of ggmap package)

Fig. 6.14 Three-dimensional perspective of a building in Paris (France) (source: https://particulier.


gorenove.fr/)

6.8.1 Redlining

If geographic information is important for assessing risk, it has been seen as


problematic information, because of “redlining” (see Squires and DeWolfe (1981)
and Baker and McElrath (1997)). In the USA, the federal government created the
Home Owners’ Loan Corporation (HOLC) during the Depression, in 1933, to slow
down the dramatic increase in the rate of housing foreclosures. Redlining refers to
lending (or insurance) discrimination that bases credit decisions on the location of a
property to the exclusion of characteristics of the borrower or property. Usually,
it means that lenders will not make loans to areas with African Americans or
other perceived risks to real-estate investments. Jackson argued that the Federal
Housing Administration and private lenders obtained copies of the HOLC maps
and that the grades on the maps impacted their lending decisions. Community
6.8 Spatial Information 261

groups in Chicago’s Austin neighborhood coined the word “redlining” in the late
1960s, referring literally to red lines lenders and insurance providers admitted
drawing around areas that they would not service. “The practice of redlining
was first identified and named in the Chicago neighborhood of Austin in the late
1960s. Saving and loan associations, at the time the primary source of residential
mortgages, drew red lines around neighborhoods they thought were susceptible to
racial change and refused to make mortgages in those neighborhoods,” explained
Squires (2011). And because those areas also had a higher proportion of African
American people, “redlining” started to be perceived as a discriminatory practice.
More generally, the use of geographic attributes may hide (intentionally or not) the
fact that some neighborhoods are populated mainly by people of a specific race, or
minority.

6.8.2 Geography and Wealth

Unfortunately, it is difficult to assess whether people are discriminated because of


their ethnic origin, or because of their wealth. As observed in Institute and Faculty
of Actuaries (2021), “people in poverty pay more for a range of products, including
energy, through standard variable tariffs; credit, through high interest loans and
credit cards; insurance, through postcodes considered higher risk; and payments,
through not being able to benefit from direct debits as they are presently structured.”
Historically, the address was used to associate an insured person with a urban
area. Figure 6.15 presents a visualization of Paris using IRIS zones.8 This visual-
ization incorporates various socio-economic indicators such as the median income
per household in each neighborhood, the proportion of people over 65 years old,
and the rates of dwellings with sizes less than .40 m2 and of more than .100 m2 .
Additionally, statistics for each IRIS zone could include information on building
dilapidation, burglary rates, and other relevant data.
Initially, Jean et al. (2016) had noted that at a fairly coarse level, night lighting
was a rough indicator of economic wealth. Indeed, world night-time maps show
that many developing countries are poorly lit. Jean et al. (2016) combined night-
time maps with high-resolution daytime satellite imagery, and the combined images
allowed—“with a bit of machine-learning wizardry”—to obtain accurate estimates
of household consumption and assets (two quantities often difficult to measure in
poorer countries). Subsequently, Seresinhe et al. (2017) suggested estimating the
amount of green space in different locations based on satellite images. Taking
this a step further, Gebru et al. (2017) claimed to be able to quantify socio-
economic attributes such as income, ethnicity, education, and voting patterns from

8 IRIS = Ilôts Regroupés pour l’Information Statistique, a division of the French territory into

grids of homogeneous size, with a reference size of 2000 inhabitants per elementary grid. France
has 16,100 IRIS, including 650 in the overseas departments, and Paris (intra muros) has 992 IRIS.
262 6 Some Examples of Discrimination

Fig. 6.15 Median income per household, proportion of elderly people, proportion of dwellings
according to their size, by neighborhood (IRIS), in Paris, France. This statistical information on
the neighborhood (and not the insured person) can be used to rate a home insurance policy, for
example (data source: INSEE open data)

cars detected on Google Street View images. For example, in the USA, if the
number of Sedans in a neighborhood is greater than the number of pickup trucks,
that neighborhood is likely to vote Democrat in the next presidential election (88%
chance); otherwise, the city is likely to vote Republican (82% chance).
As Law et al. (2019) puts it, when an individual buys a home, they are
simultaneously buying its structural characteristics, its accessibility to work, and
the neighborhood’s amenities. Some amenities, such as air quality, are measurable,
whereas others, such as the prestige or visual impression of a neighborhood, are
difficult to quantify. Rundle et al. (2011) notes that Google Street View provides a
sense of neighborhood safety (related to traffic safety), if looking for crosswalks, the
presence of parks and green spaces, etc. Using street and satellite image data, Law
et al. (2019) shows that it is impossible to capture these unquantifiable features and
improve the estimation of housing prices, in London, UK. A neural network, with
input from traditional housing characteristics such as age, size and accessibility, as
well as visual features from Google Street View and aerial images, is trained to
estimate house prices. It is also possible to infer some possibly sensitive personal
information, such as a possible disability with the presence of a ramp for the house
(the methodology is detailed in Hara et al. (2014)), or sexual orientation with the
presence of a rainbow flag in the window, or political with a Confederate flag
(as mentioned in Mas (2020)). Ilic et al. (2019) also evokes this “deep mapping”
of environmental attributes. A Siamese Convolutional Neural Network (SCNN)
is trained on temporal sequences of Google Street View images. The evolution
over time confirms some urban areas known to be undergoing gentrification, while
6.9 Credit Scores 263

revealing areas undergoing gentrification that were previously unknown. Finally,


Kita and Kidziński (2019) discusses possible direct applications in insurance.

6.9 Credit Scores

Certain authors have, in the course of historical discourse, advanced the proposition
that discrimination grounded in wealth or income should be taken into account, as
advocated by Brudno (1976), Gino and Pierce (2010), or more recently Paugam
et al. (2017). However, it is pertinent to note that this particular criterion is not
within the purview of this discussion. Nevertheless, the contemplations posited by
these scholars evoke pertinent inquiries concerning the intricate interplay between
discrimination, economic disparities, and the principle of meritocracy. These con-
nections have also been underscored by Dubet (2014) and further elaborated upon
in Dubet (2016).

6.9.1 Credit Scoring

“Credit scoring is one of the most successful applications of statistical and


operations research modeling in finance and banking” claimed Thomas et al.
(2002). “Credit scoring” describes statistical models used to assist credit institution
and banks in the process of running credit granting operations, including bank
loans, credit cards, mortgages, etc. These techniques were developed to provide a
more quantitative and supposedly more objective approach to historical approaches,
which were essentially based on personal judgment. Credit cards provided a
guaranteed source of credit, so that the customer did not have to go back and
negotiate a loan for each individual purchase. In 1949, the “Diner’s Club” card was
offered, before banks offered the first cards in the 1950s and 1960s. In 1958, the
Bank of America introduced the BankAmericard, a statewide card in California that
offered customers a debit, and especially a credit, card service, before spreading
nationwide in 1965. In 1966, a group of Illinois banks got together and formed
the Interbank Card Association. Together they created Mastercard (originally called
Master Charge, the current name dates from 1979). In England, Barclays launched
Barclaycard (which had a monopoly until the arrival of the Access card in 1972).
In 1973, electronic payment was introduced with faster authorization systems (the
waiting time went from 5 min to 56 s per authorization) and clearing. The Bank
of America card was renamed Visa in 1977 to facilitate the roll-out of the card
worldwide. With the development of these credit cards, which were to allow credit
to be granted almost instantaneously, it became necessary to quantify the associated
risk (of nonpayment of the debt). In the 1960s, several articles discussed the use of
predictive models, such as Chatterjee and Barcun (1970) or Churchill et al. (1977).
264 6 Some Examples of Discrimination

In the brief section “how insurers determine your premium,” in the National
Association of Insurance Commissioners (2011, 2022) reports, it is explained that
“most insurers use the information in your credit report to calculate a credit-based
insurance score. They do this because studies show a correlation between this score
and the likelihood of filing a claim. Credit-based insurance scores are different from
other credit scores.” And as mentioned in Kuhn (2020), credit scoring is allowed
in 47 states in the USA (all except California, Massachusetts, and Hawaii), and it
is used by the 15 largest auto insurers in the country and over 90% of all US auto
insurers.
More generally, credit scores are an important piece of individual information
in the USA or Canada, and are widely used in many lines of insurance, not just
loan insurance. As noted, a (negative) credit event, such as a default (or late)
payment on a mortgage, or a bankruptcy, can impact an individual for a considerable
period of time. These credit scores are numbers that represent an assessment of
a person’s creditworthiness, or the likelihood that he or she will pay back their
debts. But increasingly, these scores are being used in quite different contexts,
such as insurance. As mentioned by Kiviat (2019), “the field of property and
casualty insurance is dominated by the idea that it is fair to use data for pricing
if the data actuarially relate to loss insurers expect to incur.” And as she explains,
actuaries, underwriters, and policymakers seem to have gone along with that, being
conform with their sense of “moral deservingness” (using the term introduced by
Watkins-Hayes and Kovalsky (2016)). Guseva and Rona-Tas (2001) recalls that
in North America, Experian, Equifax, and TransUnion keep records of a person’s
borrowing and repayment activities. And the Fair Isaac Corporation (FICO) has
developed a formula (not known) that calculates, using these records, a score,
based on debt and available credit, income, or rather their variations (along with
payment history, number of recent credit applications and negative events—such
as bankruptcy or foreclosure—as well as changes in income due to changes in
employment or a family situation). The FICO score starts at 300 and goes up to
850, with a poor score below 600, and an “exceptional” score above 800. The
average score is around 711. This score, created for banking institutions, is now
used in pre-employment screening, as Bartik and Nelson (2016) reminds us. For
O’Neil (2016), this use of credit scores in hiring and promotion creates a dangerous
vicious cycle in terms of poverty. The Credit-Based Insurance Scores cannot use
any personal information to determine the score, other than that on the credit report,
specifically race and ethnicity, religion, gender, marital status, age, employment
history, occupation, place of residence, child/family support obligations, or rental
agreements. In Table 6.4, via Solution (2020) we can see the evolution of the
proposed credit rate according to the credit score, over 30 years, for a “good risk”
(3.2%) and a “bad risk” (4.8%), with the total amount of credit (a “bad risk” will
have an extra cost of 20%), and on an insurance premium, for the same profile, a
“bad risk” having an extra premium of about 70%.
In Table 6.4, we have at the top, the cost of a 30-year credit, for an amount of
$150,000, according to the credit score (with five categories, going from the most
risky on the left-hand side to the least risky on the right-hand side), with the average
6.9 Credit Scores 265

Table 6.4 Top, cost of a 30-year credit, for an amount of $150,000, according to the credit score,
with the average interest rate. Bottom, insurance premium charged to a 30-year-old driver, with
no convictions, driving an average car, 12,000 miles per year, in the city (source: InCharge Debt
Solutions)
300–619 620–659 660–699 700–759 760–850
Total credit cost $283,319 $251,617 $245,508 $239,479 $233,532
Rates 4.8% 3.8% 3.6% 3.4% 3.2%
Motor insurance premium $2580 $2400 $2150 $2000 $1500

rate proposed. At the bottom an insurance premium charged to a 30-year-old driver,


with no convictions, driving an average car, 12,000 miles per year, in the city, based
on the same credit score categories, is shown.
As noted by Lauer (2017), credit institutions have long compiled data on
individuals’ financial histories, such as the timeliness of loan repayments, total
amount borrowed, frequency of applying for new loans, among others. Insurers
quickly realized that this information was highly predictive of all kinds of risks.
As early as the 1970s, companies in the USA were required by law to notify people
when credit reports resulted in adverse actions, such as an increase in insurance
premium or credit. Decades later, this rule let people know that credit scores were
causing their auto insurance rates to suddenly rise. For Kiviat (2019), it is only
recently that it has become clear how important credit score is in insurance product
pricing, especially in auto insurance. Miller et al. (2003) illustrate credit reports that
contain a wide variety of credit information on an individual consumer. In addition
to information that identifies a particular person, the report contains data on credit
card and loan balances, types of credit, the status of each account, judgments, liens,
collections, bankruptcies, and credit inquiries.
These credit scores have long been used by lending institutions to predict the
risk associated with repaying a loan or meeting another financial responsibility.
Insurance score modelers have begun to combine and weight selected credit
attributes to develop a single insurance score. These “insurance scores” are added
as a risk factor to create risk classification schemes in order to obtain more accurate
results. Although both the credit score and the insurance score are derived from a
person’s credit report, the two scores are different. There is no reason to believe
that a credit score measuring the likelihood of loan repayment will be based on
the same attributes (or that each attribute will be given the same weight) as those
used to calculate an insurance score, and vice versa. Unfortunately, some in the
insurance industry have come to call credit-based insurance scores simply “credit
score.” This abuse of language may have led some to conclude that the advantages
and disadvantages of using credit scores in the lending industry are directly related
to the advantages and disadvantages of insurance. It may also have led some to
attempt to apply the results of studies of credit scores by lending institutions to
the use of insurance scores by insurers. Morris et al. (2017) point out that credit-
based insurance scores are perhaps the most important example of the insured
person’s characteristics that is regulated because it may yield potentially suspect
266 6 Some Examples of Discrimination

classifications. Any correlation between insurance scores and race or income is


potentially troubling. First, insurance scoring can have a disparate impact on racial
minorities and low-income households, causing members of these groups to pay
higher premiums on average. Credit information, for example, has been used for
auto and home insurance underwriting for several years, despite its potential for
proxy discrimination (as classified by Kiviat (2019) and Prince and Schwarcz
(2019)).
In particular, Brockett and Golden (2007) questioned the use of credit scores
to price insurance contracts. As noted by Kiviat (2019), inspired by Morris et al.
(2017), policyholder income, in particular, could be predictive of future insurance
claims if low-income policyholders are more likely to file claims even when losses
are only slightly above their deductible. Some insurers may not even be aware that
there is a correlation between the proxy variable (credit scores) and the suspect
variable (race and income). Even if insurers are aware of this correlation, they may
not believe that it helps to explain the power of credit information to predict claims.
Instead, they may believe, as in fact much available evidence indicates, that credit
information is predictive of claims because it measures policyholder care levels.
Some insurers, such as Root Insurance explain that they refuse to use these credit
scores because they consider them discriminatory: “for decades, the car insurance
industry has used credit score as a major factor in calculating rates. By basing
rates on demographic factors like occupation, education, and credit score, the
traditional car insurance industry has long relied on unfair, discriminatory biases in
its pricing practices. These practices unfairly penalize historically under-resourced
communities, immigrants, and those struggling to pay large medical expenses.”
Credit scores, or more precisely variations of credit scores can indicate changes
in protected attributes. As explained in Avery et al. (2004), individuals who recently
experienced a divorce might be expected to have a higher likelihood of payment
problems on accounts that will translate into their credit score. But marital status is
usually a protected attribute. And as shown in Dean and Nicholas (2018) and Dean
et al. (2018), “credit scores are increasingly used to understand health outcomes,”
and more and more papers describe this so-called “out-of-pocket health care” in the
USA (see Pisu et al. (2010), Bernard et al. (2011), or Zafar and Abernethy (2013)).

6.9.2 Discrimination Against the Poor

In 2013, Martin Hirsch (former director of Emmaüs—a charitable organization—


and Assistance Publique-Hôpitaux de Paris—a university hospital trust—in France)
claimed that “it’s getting expensive to be poor.” To understand why there could be
discrimination against poor people, it is important to get back to “merit,” and try
to understand on what criteria do we admire people. For the Greeks, excellence, or
᾿
αρετή (arete), was a major virtue. This excellence went beyond moral excellence:
in the Greco-Roman world, the term evoked a form of nobility, recognizable by
the beauty, strength, courage, or intelligence of the person. Now this excellence
6.9 Credit Scores 267

had little to do with wealth; thus, Herodotus is astonished that the winners of the
Olympian games were content with an olive wreath and a “glorious renown,” (περ`ι
᾿
αρετ η̃ς). In the Greek ethical vision, especially among the Stoics, a “good life”
does not depend on material wealth—a precept pushed to its height by Diogenes
who, seeing a child drinking from his hands at the fountain, throws away the bowl
he had, telling himself that it is again, useless wealth.
Greek society, despite its orientation toward values beyond mere material wealth,
remained profoundly hierarchical. This prompts the inquiry of pinpointing the
juncture in Western history when affluence emerged as the paramount metric for
evaluating all aspects of life. Max Weber’s theory, as expounded in Weber (1904),
is illuminating in this context: he posits that the Protestant work ethic fostered a
mindset where worldly achievements and success were considered indicative of
divine predestination for the afterlife. Concomitantly, the affluent individuals in the
present realm were perceived as the chosen ones for the future. Adam Smith, a
discerning critic of the nascent capitalist society in his era, devoted a chapter in “The
Theory of Moral Sentiments,” as delineated in Smith (1759), to the subject titled ‘of
the corruption of our moral feelings occasioned by that disposition to admire the
rich and great, and to despise or neglect the poor and lowly.” This underscores
the historical backdrop in which adulation of the wealthy and powerful, coupled
with indifference or disdain for the impoverished and humble, began to take root.
Today, the veneration of wealth appears to have reached unprecedented levels, with
material success nearly exalted to the status of a moral virtue. Conversely, poverty
has metamorphosed into a stigmatized condition that is arduous to transcend.
Nonetheless, historical context reminds us that these circumstances are not inherent
or immutable.
Indeed, the poor have not always been “bad.” In Europe, the Church has largely
contributed to disseminating the image of the “good poor,” as it appears in the
Gospels: “happy are you poor, the kingdom of God is yours,” or “God could have
made all men rich, but he wanted there to be poor people in this world, so that the
rich would have an opportunity to redeem their sins.” Beyond this, the poor is seen
as an image of Christ, Jesus having said “whatever you do to the least of these, you
will do to me.” Helping the poor, doing a work of mercy, is a means of salvation.
For Saint Thomas Aquinas, charity is thus essential to correct social inequalities
by redistributing wealth through alms giving. In the Middle Ages, merchants
were seen as useful, even virtuous, as they allowed wealth to circulate within the
community. Priests played the role of social assistants, helping the sick, the elderly,
and people with disabilities. The hospices and “xenodochia” of the Middle Ages
(ξενοδοχει̃ον, the “place for strangers,” ξένος) are the symbol of this care of the
poor. And quite often, poverty is not limited to material capital, but also social and
cultural, to use more contemporary terminology.
Toward the end of the Middle Ages, the figure of the “bad poor,” the parasitic and
dangerous vagabond, appeared. Brant (1494) denounced these welfare recipients,
“some become beggars at an age when, young and strong, and in full health, one
could work: why bother.” This mistrust was reinforced by the great pandemic of the
Black Death. The hygienic theories of the end of the XIX-th century added the final
268 6 Some Examples of Discrimination

touch: if fevers and diseases were caused by insalubrity and poor living conditions,
then by keeping the poor out, the rich were protected from disease.
In the words of Mollat and du Jourdin (1986), “the poor are those who,
permanently or temporarily, find themselves in a situation of weakness, dependence,
humiliation, characterized by the deprivation of means, variable according to the
times and the societies, of power and social consideration.” Recently, Cortina
(2022) proposed the term “aporophobia,” or “pauvrophobia,” to describe a whole
set of prejudices that exist toward the poor. The unemployed are said to be
welfare recipients and lazy. These prejudices, which stigmatize a group, “the poor,”
lead to fear or hatred, generating an important cleavage, and finally a form of
discrimination. Cortina (2022)’s “pauvrophobia” is discrimination against social
precariousness, which would be almost more important than standard forms of
discrimination, such as racism or xenophobia. Cortina (2022) ironically notes that
rich foreigners are often not rejected.
But these prejudices also turn into accusations. Szalavitz (2017) thus abruptly
asks the question, “why do we think poor people are poor because of their own
bad choices?” The “actor-observer” bias provides one element of an answer: we
often think that it is circumstances that constrain our own choices, but that it is the
behavior of others that changes theirs. In other words, others are poor because they
made bad choices, but if I am poor, it is because of an unfair system. This bias is
also valid for the rich: winners often tend to believe that they got where they are by
their own hard work, and that they therefore deserve what they have.
Social sciences studies show, however, that the poor are rarely poor by choice,
and increasing inequality and geographic segregation do not help. The lack of
empathy then leads to more polarization, more rejection, and, in a vicious circle,
even less empathy.
To discriminate is to distinguish (exclude or prefer) a person because of his/her
“personal characteristics.” Can we then speak of discrimination against the poor? Is
poverty (like gender or skin color) a personal characteristic? In Québec, the concept
of “social condition” (which explicitly encompass poverty) is considered a protected
attribute. Consequently, discrimination based on that condition is legally prohibited.
A correlation between wealth and risk exists in various contexts. In France, for
instance, there is a notable disparity in road accident deaths, with approximately
3% of “executives” and 15% of “workers” being affected, despite both groups
representing nearly 20% of the working population each. Additionally, Blanpain
(2018) highlights that there is a significant 13-year gap in life expectancy at
birth between the most affluent and the most economically disadvantaged men, as
discussed in Chap. 2.

6.10 Networks

Mathematically, a “graph” .G comprises “nodes” (also known asçç “vertices,” the


set of notes being denoted as V ) that are interconnected by “edges” (the set of
6.10 Networks 269

edges being denoted as E). These mathematical structures serve as the foundation
for defining networks, which are graphs where nodes or edges possess attributes. In
a social network, nodes are individuals (or policyholders) and edges denote some
sort of “connections” between individuals. On social media, such as Facebook or
LinkedIn, an edge indicates that two individuals are friends or colleagues, that
will be interconnected, through some reciprocal connection. On Twitter (now X),
connections are directional, in the sense that one account is following another one
(but of course, some edges could be bidirectional). In genealogical trees, nodes are
individuals, and connections will usually denote parent-children connections. Other
popular network structures are “bipartite graphs,” where nodes are divided into two
disjoint and independent sets .V1 and .V2 , and edges connect nodes between .V1 and
.V2 (and not within). Classical examples are “affiliate networks” where employees

and employers are connected, or policyholders and brokers, car accident and experts,
disability claims and medical doctors, etc. It is also possible to have two groups of
nodes, such as disease and drugs, and edges denote associations. Based on such a
graph, insurance companies may infer diseases based on drugs purchased.
Historically, connections among people were used to understand informal risk
sharing. “Villagers experience health problems, crop failures, wildly changing
employment, in addition to a variety of needs for cash, such as dowries. They
don’t have insurance or much, if any, savings: they rely on each other for help,”
said Jackson (2019). Increasingly, insurers try to extract information from various
networks. And as Bernstein (2007) claimed, “network and data analyses compound
and reflect discrimination embedded within society.” This can be related to “peer-
to-peer” or “decentralized insurance,” as studied in Feng (2023).

6.10.1 On the Use of Networks

Scism (2019) presented a series of “life hacks,” including tips on how to behave on
social media in order to bypass insurers’ profile evaluations. For example “do not
post photos of yourself smoking,” “ post pictures of yourself exercising (but not while
engaging in a risky sport),” “use fitness tracking devices to show you are concerned
about your health,” “purchase food from healthy online meal-preparation services,”
and “visit the gym with mobile location-tracking enabled (while leaving your phone
at home when you go to a bar).”
Social networks are also important to analyse fraud. Fraud is often committed
through illegal set-ups with many accomplices. When traditional analytical tech-
niques fail to detect fraud owing to a lack of evidence, social network analysis
may give new insights by investigating how people influence each other. These are
the so-called guilt-by-associations, where we assume that fraudulent influences run
through the network. For example, insurance companies often have to deal with
groups of fraudsters trying to swindle by resubmitting the same claim using different
people. Suspicious claims often involve the same claimers, claimees, vehicles,
witnesses, and so on. By creating and analyzing an appropriate network, inspectors
270 6 Some Examples of Discrimination

may gain new insight in the suspiciousness of the claim and can prevent pursuit of
the claim. In many applications, it may be useful to integrate a second node type
in the network. Affiliation or bipartite networks represent the reason why people
connect to each other, and include the events that network objects—such as people
or companies—attend or share. An event can for example refer to a paper (scientific
fraud), a resource (social security fraud), an insurance claim (insurance fraud), a
store (credit card fraud), and so on. Adding a new type of node to the network
not only enriches the imaginative power of graphs but also creates new insights in
the network structure and provides additional information neglected before. On the
other hand, including a second type of node results in an increasing complexity of
the analysis.
As mentioned by the National Association of Insurance Commissioners (2011,
2022), “insurance companies can base premiums on all insured drivers in your
household, including those not related by blood, such as roommates.” And Boyd
et al. (2014) asserted that there is a new kind of discrimination associated not
with personal characteristics (like those discussed in the previous section) but with
personal networks. Beyond their personal characteristics (such as race or gender), an
important source of information is “who they know.” In the context of usage-based
auto insurance, the nuance is that personal networks are not those represented by
driving behavior (strictly speaking), but those defined by the places to which people
physically go.
In many countries, when it comes to employment, most companies are required
to respect equal opportunity: discrimination on the basis of race, gender, beliefs, reli-
gion, color, and national origin is prohibited. Additional regulations prohibit many
employers from discriminating on the basis of age, disability, genetic information,
military history, and sexual orientation. However, nothing prevents an employer
from discriminating based on a person’s personal network. And increasingly, as
Boyd et al. (2014) reminds us, technical decision-making tools are providing new
mechanisms by which this can happen. Some employers use LinkedIn (and other
social networking sites) to determine a candidate’s “cultural fit” for hire, including
whether or not a candidate knows people already known to the company. Although
hiring on the basis of personal relationships is by no means new, it takes on new
meaning when it becomes automated and occurs on a large scale. Algorithms
that identify our networks, or predict our behavior based on them, offer new
opportunities for discrimination and unfair treatment.

6.10.2 Mathematics of Networks, and Paradoxes

Feld (1991) has shown that in any network the average degree (i.e., the number
of neighbors, or connections) of the neighbor of a node is strictly greater than
the average degree of nodes in the network as a whole. Applied to networks of
friendship, this translates simply as “on average your friends have more friends
than you do.” This phenomenon is known as the “friendship paradox.” A related
6.10 Networks 271

phenomenon, the generalized friendship paradox, describes similar behavior with


respect to other attributes of network nodes, Jo and Eom (2014). Are your friends
richer than you? Generalized friendship paradoxes arise when such attributes
correlate with node degree. If richer people are on average also more popular, then
wealth and popularity correlate positively and hence the tendency for your friends
to be more popular than you could mean that they are also richer. Heuristically,
the friendship paradox states that people’s friends tend to be more popular than they
themselves are. Stated a little more precisely, nodes in a network tend to have a lower
degree than their neighbors do. Consider an undirected network of n individuals
(nodes). Let A = [Ai,j ] denote the n × n adjacency matrix, and d = (di ) denote the
n-vector of degrees, in the sense that di = A i 1.
Proposition 6.1 The average number of friends of the collection of friends of
individuals in a social network is higher than the average number of friends of
the collection of the individuals themselves. More formally,
n  
1 1  1
n n
. Aij dj ≥ di .
n di n
i=1 j =1 i=1

Proof Define differences i ’s between the average of its neighbors’ degrees and its
own degree, in the sense that

1 
n
i =
. Aij dj − di ,
di
j =1

where we suppose here that all nodes have nonzero degrees (at least one neighbor).
The friendship paradox states that the average of i across all nodes is greater than
zero. In order to prove it, write the average as
n   n  
1 1 1  1
n n
dj
. i = Aij dj − di = Aij − Aij ,
n n di n di
i=1 i=1 j =1 ij =1

which yields
   
1 1 1
n n n
dj di
. i = Aij − 1 but also Aij −1 ,
n n di n dj
i=1 ij =1 ij =1

by exchanging the summation indices, and because A is a symmetric matrix. By


adding the two, we can write

    2
2 1 1 
n
dj di dj di
. i = Aij + −2 = Aij − ≥ 0.
n n di dj 2n di dj
i=1 ij ij



272 6 Some Examples of Discrimination

Observe that the exact equality holds only when di = dj for all pairs of neighbors,
corresponding to the case where the network is a regular graph (or possibly the
reunion of disjoint regular graphs).
For the generalized friendship paradox, which considers attributes other than
degree, as in Cantwell et al. (2021), one can define an analogous quantity, (x)
i ,
as some attribute x (such as the wealth) is defined as

(x) 1 
i
. = Aij xj − xi ,
di
j

which measures the difference between the average of the attribute for node i’s
neighbors and the value for i itself. When the average of this quantity over all nodes
is positive one may say that the generalized friendship paradox holds. In contrast
to the case of degree, this is not always true—the value of (x) i can be zero or
negative—but we can write the average as
    
1  (x) 1 1  1 Aij
. i = Aij xj − xi = xi − xi ,
n n di n dj
i i j i j

where the second line again follows from interchanging summation indices. Defin-
ing the new quantity
 Aij
δi =
. ,
dj
j

and noting that

1 1  Aij 1 1 
. δi = = Aij = 1,
n n dj n dj
i ij j i

we can then write


1  (x) 1 1 1
. i = xi δi − xi δ i = Cov(x, δ).
n n n n
i i i i

Thus, we will have a generalized friendship paradox in the sense defined here if (and
only if) x and δ correlate positively. But this is not always the case:

Cov(d, δ) ≥ 0
. ⇒
 Cov(d, x) ≥ 0.
Cov(x, δ) ≥ 0
6.10 Networks 273

French poet Paul Verlaine warned us,9 “il ne faut jamais juger les gens sur leurs
fréquentations. Tenez, Judas, par exemple, il avait des amis irréprochables.” Never-
theless, several organizations have proposed to use information about our “friends”
in order to learn more about us, following an homophily principle (in the sense of
McPherson et al. (2001)), because as the popular saying goes, “birds of a feather
flock together.” Therefore, Bhattacharya (2015) noted that “you apply for a loan
and your would-be lender somehow examines the credit ratings of your Facebook
friends. If the average credit rating of these members is at least a minimum credit
score, the lender continues to process the loan application. Otherwise, the loan
application is rejected.” As we can see, mathematical guarantees are not strong here,
and it is very likely that such a strategy creates more biases.

9 “you should never judge people by who they associate with. Take Judas, for example, he had

friends who were beyond reproach.”


Chapter 7
Observations or Experiments:
Data in Insurance

Abstract An important challenge for actuaries is that they need to answer causal
questions with observational data. After a brief discussion about correlation and
causality, we describe the “causation ladder,” and the three rungs: association or
correlation (“what if I see...”), intervention (“what if I do...”), and counterfactuals
(“what if I had done...”). Counterfactuals are important for quantifying discrimina-
tion.

To take up the classification of Rosenbaum (2018), it is important to distinguish


“experimental” and “observational” data. In the latter case, we use data also called
“administrative data,” such as financial transaction or medical visit records, which
show what people actually do, and not what they say they do (as in surveys, for
example). Such data allow us to see what people buy, eat, where they travel, etc.,
from what they leave behind. For example, there were 185 billion transactions
using Visa cards in 2019, and 85 billion using Mastercard. This massive data
gives us information on the holder’s behavior, without their knowledge. Looking
at transactions probably allows us to better quantify the number of air travels for
a given person than a survey or a quick poll would; we also suspect that these
records are biased and do not reflect all transactions, and probably give a biased
view of their behavior. If we want to understand the impact of prevention on risk,
we take data we collected from doctors, and compare the outcome of those who
go for a routine check, and those who do not. Those are “observational data” in
the sense that agents are free to act as they please, and we are content to observe
from a distance, as explained in Rosenbaum (2005). But of course, there may be a
strong bias, as discussed previously: people might be doing a routine examination
because they know that they are at risk. With “experimental data,” we consider some
controlled randomized assignment to some treatment, as explained in Shadish and
Luellen (2005). In our example, we ask some people, randomly chosen, to do a
routine check.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 275
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_7
276 7 Observations or Experiments: Data in Insurance

7.1 Correlation and Causation

In Sect. 3.4, we have described various predictive models, where some variables,
called “explanatory variables” or “predictors,” .x are used to “predict” a variable y,
through some function m. We needed variables .x to be as correlated as possible with
y, so that .m(x) also, hopefully, correlates with y. This correlation make them appear
as valid predictors, and that is usually sufficient motivation to use them as pricing
variables. “Databases about people are full of correlations, only some of which
meaningfully reflect the individual’s actual capacity or needs or merit, and even
fewer of which reflect relationships that are causal in nature,” claimed (Williams
et al. 2018). In Sect. 4.1, we discussed interpretability, and explainability, we try to
explain why a variable is legitimate in a pricing model, and why a specific individual
gets a large prediction (and therefore is asked a larger premium). But as mentioned
by Kiviat (2019), the causal effect from .x to y, was traditionally not a worthwhile
question for insurers, and the concept of “actuarial fairness” allows this problem to
be swept under the rug. But more and more “policymakers want(ed) to understand
why” some variable “ predict(ed) insurance loss in order to determine if any links
in the causal chain held the wrong people to account, and as a consequence gave
them prices they did not deserve” recalls (Kiviat 2019). That is actually simply a
regulatory requirement across many countries, states, and provinces, if one wants to
prove that rates are not “unfairly discriminatory” (Fig. 7.1).

7.1.1 Correlation is (Probably) Not Causation

Everyone is familiar with the adage “correlation does not imply causation.” And
indeed, a classical question for researchers, but also policy makers, is whether
a significant correlation between two variables represents a causal effect. As
mentioned in Traag and Waltman (2022), a similar question arises when we observe
a difference in outcomes y between two groups: does the difference represent a bias

Fig. 7.1 Correlation, by Randall Munroe, 2009 (Source: https://xkcd.com/552/ )


7.1 Correlation and Causation 277

or disparity? If the difference is based on a causal effect, there may be a disparity,


or a bias. But if the difference does not represent a causal effect, there is probably
no bias or disparity.
To illustrate that issue, in an article published in the Wall Street Journal, entitled
Medicaid is worse than no coverage at all, Scott Gottlieb stated that “dozens of
recent medical studies show that Medicaid patients suffer for it. In some cases,
they’d do just as well without health insurance.” Specifically, Gottlieb (2011) relied
on LaPar et al. (2010), which showed, statistically, that uninsured patients were
about 25% less likely than those with Medicaid to have an “in-hospital death’,” see
Table 7.1.
Several studies have found that Medicaid patients with specific conditions (e.g.,
cancer) or who have undergone specific treatments (e.g., lung transplantation or
coronary angioplasty) had significantly poorer health outcomes than patients with
private insurance, the same conditions, or same procedures. For example, for major
surgery, uninsured patients were about 25% less likely than Medicaid patients to
die in the hospital. Being insured by Medicaid would almost seem to make the risk
worse! The potential problem with this type of study is that the comparison groups,
Medicaid patients and privately insured or uninsured patients, may be subject to
self-selection. It is likely that many patients do not enroll in Medicaid until after,
sometimes long after, the onset of a serious medical problem. In this case, those
who choose to enroll in Medicaid may be sicker than those with private insurance
or those who are uninsured. In a way, this self-selection effect is a form of reverse
causation. Being sick drives people to enroll in Medicaid, not the other way around.
This is the concern with “observational data.”

Table 7.1 In-hospital mortality for all patients undergoing major surgery, by major payer group.
(Source LaPar et al. (2010), Tables 4 and 5)
Medicare Medicaid Uninsured Insurance
In-hospital mortality 4.4% 3.7% 3.2% 1.3%
Pulmonary resection 4.3% 4.3% 6.2% 2.0%
Esophagectomy 8.7% 7.5% 6.5% 3.0%
Colectomy 7.5% 5.4% 3.9% 1.8%
Pancreatectomy 6.1% 5.8% 8.4% 2.7%
Gastrectomy 10.8% 5.4% 5.0% 3.5%
Aortic aneurysm 12.4 14.5 14.8 7.0
Hip replacement 0.4% 0.2% 0.1% 0.1%
Coronary artery 4.0% 2.8% 2.3% 1.4%
bypass grafting
Number of cases 491,829 40,259 24,035 337,535
Age (years) .73.5 ± 8.6 .49.8 ± 16.4 .51.8 ± 12.8 .55.5 ± 11.4
Women 49.6% 48.8% 35.8% 39.7%
Length of stay (days) .9.5 ± 0.1 .12.7 ± 0.4 .10.1 ± 0.3 .7.4 ± 0.1
Total cost ($) .76, 374 ± 53.1 .93, 567 ± 251.4 .78, 279 ± 231.0 .63, 057 ± 53.0
Rural location 10.1% 8.5% 9.8% 6.6%
278 7 Observations or Experiments: Data in Insurance

In 2008, owing to severe budget constraints, Oregon found itself with a Medicaid
waiting list of 90,000 people and only enough money to cover 10,000 of them.
So the state created a lottery to randomly select people who would qualify for
Medicaid, therefore recreating the necessary preconditions. The reality, however,
was a bit more complex, as many of the lottery winners were not eligible for
Medicaid or chose not to submit their paperwork to enroll in the program. Compared
with the control group (people who did not have access to Medicaid), Finkelstein
et al. (2012) observed that the treatment group had substantially and statistically
significantly higher health care use (including primary and preventive care as well
as hospitalizations), lower out-of-pocket medical expenditures and medical debt
(including fewer bills sent to collection), and better self-reported physical and
mental health. These “experiments,” which are often difficult to implement (for
financial and sometimes ethical reasons), make it possible to bypass the bias of
administrative data. Having a non-null correlation is quite easy, but proving a causal
effect is difficult.
Definition 7.1 (Common Cause (Reichenbach 1956)) If X and Y are non-
independent, .X /⊥⊥ Y , then, either



⎨X causes Y
. Y causes X


⎩there exists Z such that Z causes both X and Y.

This concept of “common cause” is ill-defined and may be seen as tautological,


as “cause” has not been defined properly. Heuristically, the (probabilistic) causation
is not that the occurrence of a single event “causes” another to happen, but rather
that the occurrence increases the likelihood of the other event happening, all else
being equal (see Hitchcock 1997). A starting point could be the case of sequential
variables, where the dynamics could make causality easier to define.

7.1.2 Causality in a Dynamic Context

Before defining causality in the context of individual data, let us recall that, in a
context of temporal data, Granger (1969) introduced a concept of “causality” that
takes a relatively simple form. Sequences of observations are useful to properly
capture this “causal” effect. Consider here a standard bivariate autoregressive time
series, where a regression of variables at time .t + 1 on the same variables at time t
is performed
{
x1,t+1 = c1 + a1,1 x1,t + a1,2 x2,t + ε1,t
.
x2,t+1 = c2 + a2,1 x1,t + a2,2 x2,t + ε2,t ,
7.1 Correlation and Causation 279

also noted .x t+1 = c + Ax t + ε t+1 , where the off-diagonal terms of the autoregres-
sion matrix .A allow to quantify the lagged causality, i.e., a lagged causal effect
(between t and .t + 1) with respectively .x1 → x2 or .x1 ← x2 (see Hamilton (1994)
or Reinsel (2003) for more details). For example, Fig. 7.2 shows the scatterplot
.(x1,t , x2,t+1 ) and .(x2,t , x1,t+1 ), left and right, where respectively .x1 denotes the

number of cyclists in Helsinki, per day, in 2014 (at a given road intersection) and
.x2 denotes the (average) temperature on the same day. The graph on the left is

equivalent to asking whether the temperature “causes” the number of cyclists (if
the temperature rises, the number of cyclists on the roads rises) and the graph on the
right-hand side is equivalent to asking whether the number of cyclists “causes” the
temperature (if the number of cyclists on the roads rises, the temperature rises). In
both cases, if we estimate
{
x1 → x2 : x2,t+1 = γ1 + α2,1 x1,t + η1,t
.
x1 ← x2 : x1,t+1 = γ2 + α1,2 x2,t + η2,t ,

we observe significant regressions (but this is not a causal test)




⎨x1 → x2 : x2,t+1 = 4320 + 757 x1,t + η1,t , R 2 = 75.72%
(234) (23)
.

⎩x1 ← x2 : x1,t+1 = −1.98 + 0.001 x2,t + η2,t , R 2 = 72.41%.
(0.04) (0.00003)

We can use the Granger test (see Hamilton 1994), on the data of Fig. 7.2, on the two
causal hypotheses (not on the levels but on the daily variations, of the number of
cyclists, and of the temperature)
{
x1 → x2 : H0 : a2,1 = 0, p − value = 56.66%
.
x1 ← x2 : H0 : a1,2 = 0, p − value = 0.004%.

Fig. 7.2 Number of cyclists per day (x2 ), in 2014, in Helsinki (Finland), and average daily
temperature (x1 ) respectively at time t on the x-axis and t + 1 on the y-axis. The regression lines
are estimated only on days when the temperature exceeded 0◦ C. (Data source: https://www.reddit.
com/r/dataisbeautiful/comments/8k40wl )
280 7 Observations or Experiments: Data in Insurance

In other words, temperature is causally related to the presence of cyclists on the road
(the temperature “causes” the number of cyclists, according to Granger’s approach),
but not vice versa.
In a nondynamic context, defining causality is a more perilous exercise. In
the upcoming sections, we revisit the concept of the “causation ladder” initially
introduced by Pearl (2009b) and more recently discussed in Pearl and Mackenzie
(2018). The first stage (Sect. 7.2), referred to as “association,” represents the most
basic level where we observe a connection or correlation between two or more
variables. Moving on to the second stage (Sect. 7.3), known as “intervention,”
we encounter situations where we not only observe an association but also have
the ability to alter the world through suitable interventions (or experiments). The
third stage (Sect. 7.4) focuses on “counterfactuals.” According to Holland (1986),
the “fundamental problem of causal inference” arises from the fact that we are
confined to observing just one realization, despite the potential existence of several
alternative scenarios that could have been observed. In this section, we employ
counterfactuals to estimate the causal effect, which quantifies the disparity between
what we observed and the “potential outcome” if the individual had received the
treatment.

7.2 Rung 1, Association (Seeing, “what if I see...”)

Two variables are independent when the value of one gives no information about the
value of the other. In everyday language, dependence, association, and correlation
are used interchangeably. Technically, however, association is synonymous with
dependence (or non-independence) and is different from correlation. “Association
is a very general relationship: one variable provides information about another,
as explained in Altman and Krzywinski (2015), in the sense that there could be
association even if the “correlation coefficient” is statistically null.

7.2.1 Independence and Correlation

Here, we have to study the joint distribution of some variables, and their conditional
distribution.
Proposition 7.1 (Chain Rule) In dimension 2, for all sets .A and .B,
[ ] [ | ] [ ]
P X ∈ A, Y ∈ B = P Y ∈ B|X ∈ A · P X ∈ A ,
.

and in dimension 3, for all sets .A, .B and .C,


[ ] [ | ] [ | ] [ ]
P X ∈ A, Y ∈ B, Z ∈ C = P Z ∈ C|Y ∈ B, X ∈ A · P Y ∈ B|X ∈ A · P X ∈ A .
.
7.2 Rung 1, Association (Seeing, “what if I see...”) 281

Proof This chain rule is a simple consequence of the definition of the conditional
probability. u
n
Definition 7.2 (Independence (Dimension 2)) X and Y are independent, denoted
X ⊥⊥ Y , if for any sets .A, B ⊂ R,
.

[ ] [ ] [ ]
P X ∈ A, Y ∈ B = P X ∈ A · P Y ∈ B .
.

An equivalent expression is, from the Chain Rule, that


[ | ] [ ]
P Y ∈ B|X ∈ A = P Y ∈ B .
.

Definition 7.3 (Linear Independence) Consider two random variables .X and .Y .


X ⊥ Y if and only if .Cov[X, Y ] = 0.
.

Proposition 7.2 Consider two random variables .X and .Y . .X ⊥⊥ Y if and only if


for any functions .ϕ : R → R and .ψ : R → R (such that the expected values below
exist and are well defined) .Cov[ϕ(X), ψ(Y )] = 0, i.e.,

E[ϕ(X)ψ(Y )] = E[ϕ(X)] · E[ψ(Y )].


.

Proof Hirschfeld (1935), Gebelein (1941), Rényi (1959) and Sarmanov (1963)
introduced the concept of “maximal correlation,” defined as
{ }
r * (X, Y ) = max Corr[ϕ(X), ψ(Y )] ,
.
ϕ,ψ

(provided that expected values exist and are well defined). And those authors proved
that .X ⊥ Y if and only if .r * (X, Y ) = 0. u
n
Definition 7.4 (Linear Independence) In a general context, consider two random
vectors .X and .Y , in .Rdx and .Rdy respectively. .X ⊥ Y if and only if for any .a ∈ Rdx
and .b ∈ Rdy

Cov[a T X, bT Y ] = 0.
.

Definition 7.5 (Independence) In a general context, consider two random vectors


X and .Y . .X ⊥⊥ Y if and only if for any .A ⊂ Rdx and .B ⊂ Rdy ,
.

P[{X ∈ A} ∩ {Y ∈ B}] = P[{X ∈ A}] · P[{Y ∈ B}].


.

Proposition 7.3 Consider two random vectors .X and .Y . .X ⊥⊥ Y if and only if for
any functions .ϕ : Rdx → R and .ψ : Rdy → R (such that the expected values below
exist and are well defined)

. E[ϕ(X)ψ(Y )] = E[ϕ(X)] · E[ψ(Y )],


282 7 Observations or Experiments: Data in Insurance

or equivalently

Cov[ϕ(X), ψ(Y )] = 0.
.

Proof This is the extension in the higher dimension of Proposition 7.2. u


n
Definition 7.6 (Mutual Independence) Let .Y = (Y1 , · · · , Yd ) denote some
random vector. All components of .Y are (mutually) independent if for any
.A1 , · · · , Ad ⊂ R

[ ]
n
d | |
d
P {(Y1 , · · · , Yd ) ∈
. Ai } = P[{Yi ∈ Ai }].
i=1 i=1

Definition 7.7 (Independent Version of Some Random Vector) Let .Y =


(Y1 , · · · , Yd ) denote some random vector. .Y ⊥ = (Y1⊥ , · · · , Yd⊥ ) is an independent
version of .Y if

⎨(Y ⊥ , · · · , Y ⊥ ) are mutually independent random variables
1 d
.
L
⎩Y ⊥ =
i Y , ∀i = 1, · · · , d.
i

In Sect. 4.1, we have discussed the difference between ceteris paribus (“all other
things being equal” or “other things held constant”) and mutatis mutandis (“once the
necessary changes have been made”). As discussed, ceteris paribus has to do with
versions of some random vector with some independent components (we consider
some explanatory variable as if they were independent of the other ones),
Definition 7.8 (Version of Some Random Vector with Independent Margin)
Let .Y = (Y1 , · · · , Yd ) denote some random vector. .(Y1⊥ , Y2 , · · · , Yd ) is a version
of .Y with independent first margin if

⎨Y ⊥ ⊥⊥ Y −1 = (Y2 , · · · , Yd )
1
.
L
⎩Y ⊥ =
1 Y1 .

One can easily extend the previous concept for some subset of indices .J ⊂
{1, · · · , d}. All those concepts can be extended to the case of condition indepen-
dence.
Definition 7.9 (Conditional Independence (dimension 2)) X and Y are indepen-
dent conditionally on Z, denoted .X ⊥⊥ Y | Z, if for any sets .A, B, C ⊂ R,
[ | ] [ | ] [ | ]
P X ∈ A, Y ∈ B|Z ∈ C = P X ∈ A|Z ∈ C · P Y ∈ B|Z ∈ C .
.
7.2 Rung 1, Association (Seeing, “what if I see...”) 283

Again, an alternative characterization is


[ | ] [ | ]
P Y ∈ B|X ∈ A, Z ∈ C = P Y ∈ B|Z ∈ C .
.

Again, this notion can be extended in a higher dimension.


Definition 7.10 (Conditional Independence) In a general context, consider three
random vectors .X, .Y and .Z. .(X ⊥⊥ Y )|Z if and only if for any .A ⊂ Rdx , .B ⊂ Rdy
and .C ⊂ Rdz ,

P[{X ∈ A} ∩ {Y ∈ B}|Z ∈ C] = P[{X ∈ A}|Z ∈ C] · P[{Y ∈ B}|Z ∈ C].


.

See Dawid (1979) for various properties on conditional independence. All those
concepts are well known in actuarial science, but there still are several pitfalls. So
let us recall some simple properties.
Proposition 7.4 Consider three random variables X, Y , and Z. If .X ⊥ Z and
Y ⊥ Z, then .aX + bY ⊥ Z, for any .a, b ∈ R.
.

Proof By the linearity of the covariance,

Cov[aX + bY, Z] = a Cov[X, Z] +b Cov[Y, Z] = 0.


.
' '' ' ' '' '
=0 (X⊥Z) =0 (Y ⊥Z)

u
n
Proposition 7.5 Consider three random variables X, Y , and Z. If .X ⊥ Z and
Y ⊥ Z, it does not imply that .ψ(X, Y ) ⊥ Z, for any .ψ : R2 → R.
.

Proof


⎪ (0, 0, 0) with probability 1/4,

(0, 1, 1) with probability 1/4,
.(X, Y, Z) =
⎪(1, 0, 1)
⎪ with probability 1/4,

(1, 1, 0) with probability 1/4.

One can easily get that




⎨X, Y, Z ∼ B(1/2)

. XY, Y Z, XZ ∼ B(1/4)


⎩XY Z = 0, a.s.

Thus, on the one hand


1 1 1
E[XZ] =
. = · = E[X] · E[Z];
4 2 2 ' '' ' ' '' '
=1/2 =1/2
284 7 Observations or Experiments: Data in Insurance

therefore, .Cov(X, Z) = 0, and similarly for the pair .(Y, Z), so .X ⊥ Z and .Y ⊥ Z.
On the other hand, .XY /⊥ Z as .Cov(XY, Z) /= 0, because

.E[XY · Z] = 0 /= E[XY ] · E[Z] .


' '' ' ' '' '
=1/4 =1/2

u
n
Proposition 7.6 Consider a random vector .X in .Rk , and a random variable Z.
.X ⊥ Z does not imply that .ψ(X) ⊥ Z, for any .ψ : R → R.
k

In the context of fairness, even if .X is orthogonal to S (by construction), we can


- = m(X) /⊥ S if m is a nonlinear function. This corresponds to the
still have .Y
following property,
Proposition 7.7 Consider three random variables X, Y , and Z. Even if .X ⊥⊥ Z
and .Y ⊥⊥ Z, it does not imply either that .ψ(X, Y ) ⊥ Z or that .ψ(X, Y ) ⊥⊥ Z, for
any .ψ : R2 → R.
Proof A counter-example can be obtained using variables that are pairwise indepen-
dent, but not mutually independent. The example of the previous proof works (with
.ψ(x, y) = xy). Pairs are more than non-correlated, they are pairwise independent,

so .X ⊥⊥ Z and .Y ⊥⊥ Z. But

⎨(0, 0) with probability 1/4,
.(XY, Z) = (0, 1) with probability 1/2,

(1, 0) with probability 1/4,

and therefore, .XY /⊥⊥ Z as

[ ] 1 1 1 [ ] [ ]
P XY = 1, Z = 0 = =
. / · = P XY = 1 · P Z = 0 .
4 4 2
u
n
Proposition 7.8 Consider a random vector .X in .Rk , and a random variable Z.
.X ⊥
⊥ Z does not imply either that .ψ(X) ⊥ Z or .ψ(X) ⊥⊥ Z, for any .ψ : Rk →
R.
In the context of fairness, if we were able to ensure that .X⊥ ⊥⊥ S, we can still
- = m(X⊥ ) /⊥⊥ S (even .Y
have .Y - = m(X⊥ ) /⊥ S).

7.2.2 Dependence with Graphs

If we have introduced various concepts related to independence, we do not have


any proper definition of a “causal effect.” Spirtes et al. (1993) noted that it would
7.2 Rung 1, Association (Seeing, “what if I see...”) 285

be difficult to “define” what causality is, and that it would probably be simpler
to “axiomatize” it. In other words, the approach involves identifying the essential
attributes that must be present for a relationship to be classified as “causal,”
expressing these properties in mathematical terms, and then assessing whether these
axioms lead to the meaningful characterization of a causal relationship.
For example, it seems legitimate that these relations are transitive: if .x1 causes
.x2 and if .x2 causes .x3 , then it must also be true that .x1 causes .x3 . We could then

talk about global causality. But a local version seems to exist: if .x1 causes .x3 only
through .x2 , then it is possible to block the influence of .x1 on .x3 if we prevent .x2 from
being influenced by .x1 . One could also ask that the causal relation be irreflexive, in
the sense that .x1 cannot cause itself. The danger of this property is that it tends to
seek a causal explanation for any variable. Finally, an asymmetrical property of the
relation is often desired, in the sense that .x1 causes .x2 implies that .x2 cannot cause
.x1 . Here again, precision is necessary, especially if the variables are dynamic: this

property does not prevent a lagged feedback effect. It is indeed possible for .x1,t to
cause .x2,t+1 and for .x2,t+1 to cause .x1,t+2 , but not for .x1,t to cause .x2,t and for
.x2,t to cause .x1,t , with this approach. As noted by Wright (1921), well before (Pearl

1988), the most natural tool to describe these causal relations visually and simply
is probably through a directed graph. Within this conceptual framework, a variable
is depicted as a node within the network (e.g., .x1 or .x2 ), and a causal relationship,
such as “.x1 causes .x2 ,” is visually represented by an arrow that directs from .x1 to .x2
(akin to our previous approach in analyzing time series).
A variable is here a node of the network (for example .x1 or .x2 ), and a causal
relation, in the sense “.x1 causes .x2 ” will be translated by an arrow directed from .x1
to .x2 (as we did on time series).
Definition 7.11 (Confounder) A variable is a confounder when it influences both
the dependent variable and the independent variable, causing a spurious association.
The existence of confounders is an important quantitative explanation why “corre-
lation does not imply causation.” For example, in Fig. 7.3a, .x1 is a confounder for
.x2 and .x3 . The term “fork” is also used.

(a) (b) (c)


1 1 1

2 3 2 3 2 3

Fig. 7.3 Some examples of directed (acyclical) graphs, with three nodes, and two connections. (a)
corresponds to the case where .x1 is a confounder for .x2 and .x3 , corresponding to a common shock
or mutual dependence (also called “fork”), (b) corresponds to the case where .x2 is a mediator for
.x1 and .x3 (also called “chain”), and (c) corresponds to the case where .x3 is a collider for .x1 and
.x2 , corresponding to a mutual causation case. (a) Confounder (b) Mediator (c) Collider
286 7 Observations or Experiments: Data in Insurance

Definition 7.12 (Collider) A variable is a collider when it is causally influenced


by two or more variables.
For example, in Fig. 7.3c, .x3 is a collider for .x1 and .x2 . The term “inverted fork” is
also used.
Definition 7.13 (Mediator) A mediation model proposes that the independent
variable influences the mediator variable, which in turn influences the dependent
variable.
For example, in Fig. 7.3b, .x2 is a mediator for .x1 (the independent variable) and .x3
(the dependent variable). The term “chain” is also used.
In example (b), we say that .x2 and .x3 are causally dependent on .x1 , .x2 will be
directly and .x3 indirectly. We will say that .x1 is a causal parent of .x2 (a cause),
and conversely that .x2 is a causal child of .x1 (a consequence). This parent/child
relationship is associated with the existence of a link between the two variables. We
say that .x1 is a causal ancestor of .x3 , and conversely that .x3 is a causal descendant of
.x1 . This ancestor/descendant relation is associated with the existence of a (directed)

path between the two variables, i.e., a succession of links. If there is no path between
two nodes, we say that the two variables are causally independent. And if there is
a directed path from node .x1 to node .x2 , then .x1 causally affects .x2 : if .x1 had been
different, .x2 would also have been different. Therefore, causality has a direction.
A collider is a variable that is the consequence of two or more variables, like .x3
in (c). A noncollider is a variable influenced by only one variable, and it allows
a consequence to be causally transmitted along a path. The causal variables that
influence the collider are themselves not necessarily associated, together. If this is
the case, the collider is said to be shielded and the variable is the vertex of a triangle.
For (a) .x2 /⊥⊥ x3 whereas .x2 ⊥⊥ x3 | x1 , for (b) .x1 /⊥⊥ x3 whereas .x1 ⊥⊥ x3 | x2 , and
for (c) .x1 ⊥⊥ x2 .
In Fig. 7.4, .x4 is a “descendant” of .x1 , a child of .x2 (and .x4 ), a parent of .x5 (and
.x6 ), and an “ancestor” of .x7 . The variables .x3 and .x5 are not causally independent.

.x4 is a collider, but .x6 is not. .x4 is an unshielded collider because .x2 and .x3 (the two

parents) are not connected (but they are not independent)


The idea now is to associate “dependence” with “connectedness” between nodes
(i.e., the existence of a connecting path) and “independence” with “unconnected-
ness,” also coined “separation.” If Kiiveri and Speed (1982) introduced most of
the concepts, Koller and Friedman (2009) and Peters et al. (2017) provided recent
overviews. Let us formalize here the concepts described above.

Fig. 7.4 An example of a 2 6


directed graph, with seven
nodes, .x1 , · · · , x7 , and eight
edges, 1 4 7
.{1 → 2}, · · · , {6 → 7}

3 5
7.2 Rung 1, Association (Seeing, “what if I see...”) 287

(a) (b)
2 6 2 6

1 4 7 1 4 7

3 5 3 5

Fig. 7.5 On the directed graph of Fig. 7.4, examples of blocking a path. Path
.π= {x1 → x2 → x4 → x5 } (blue), from .x1 (blue) to .x5 (blue) is blocked by .x2 (red) (on
the left, (a)), and not blocked by .x3 (blue) (on the right, (b))

Definition 7.14 (Path) A path .π from a node .xi to another node .xj is a sequence
of nodes and edges starting at .xi and ending at .xj .
On the causal graph of Fig. 7.4, .π = {x1 → x2 → x4 → x5 } is a path from node
x1 to .x5 . To go further, a conditioning set .x c is simply a collection of nodes.
.

Definition 7.15 (Blocking Path) A path .π from a node .xi to another node .xj is
blocked by .x c whenever there is a node .xk such that either .xk ∈ x c and
{ } { } { }
. xk − → xk → xk + or xk − ← xk ← xk + or xk − ← xk → xk +
{ }
or .xk /∈ x c , as well as all descents of .xk , and . xk − → xk ← xk + . In that case, write
.xi ⊥
G−π xj | x c .
On the causal graph of Fig. 7.5, the path .π = {x1 → x2 → x4 → x5 } (from .x1
to .x5 ) is blocked by .x2 , on the left-hand side, (a), and not blocked by .x3 , on the
right-hand side, (b).
Definition 7.16 (d-separation (nodes)) A node .xi is said to be d-separated with
another node .xj by .x c whenever every path from .xi to .xj is blocked by .x c . We will
simply denote .xi ⊥G xj | x c .
On the causal graph of Fig. 7.6, at the top, nodes .x1 and .x5 are d-separated by .x4 ,
at the top left (a), as no path that does not contain .x4 connects .x1 and .x5 . Similarly,
they are d-separated by nodes .(x2 , x3 ), at the top right (b), as there are no paths that
do not contain .x2 and .x3 that connects .x1 and .x5 . At the bottom, nodes .x1 and .x5 are
neither blocked by .x3 , bottom left (c), nor the pair .(x3 , x6 ), bottom right (d), as there
is a path that does not contain .x3 and .(x3 , x6 ) respectively, that connects .x1 and .x5 .
In both cases, path .π = {x1 → x2 → x4 → x5 } can be considered (it connects .x1
and .x5 , and does not contain .x3 and .x6 ).
Definition 7.17 (d-separation (sets)) A set of nodes .x i is said to be d-separated
with another set of nodes .x j by .x c whenever every path from any .xi ∈ x i to any
.xj ∈ x j is blocked by .x c . We simply denote .x i ⊥
G xj | xc.
288 7 Observations or Experiments: Data in Insurance

(a) (b)
2 6 2 6

1 4 7 1 4 7

3 5 3 5

(c) (d)
2 6 2 6

1 4 7 1 4 7

3 5 3 5

Fig. 7.6 On the directed graph of Fig. 7.4, examples of d-separation. Nodes .x1 (blue) and .x5 (blue)
are d-separated by .x4 (red) (top left (a)), by .(x2 , x3 ) (red) (top right (b)), and not blocked by .x3
(blue) (bottom left (c)) and .(x3 , x6 ) (blue) (bottom right (d))

Proposition 7.9 Two nodes .xi and .xj are d-separated by .x c if and only if members
of .x c block all paths from .xi to .xj .
Proposition 7.10 Two nodes .xi and .xj are d-separated by .x c if and only if members
of .x c block all paths from .xi to .xj .
Definition 7.18 (Markov Property) Given a causal graph .G with nodes .x, the joint
distribution of .X satisfies the (global) Markov property with respect to .G if, for any
disjoints .x 1 , .x 2 and .x c

x 1 ⊥G x 2 | x c ⇒ X1 ⊥⊥ X2 | X c .
.

The probability chain rule allows us to calculate the probability of an intersection


of events using conditional probabilities, because (if .P[B|A] is denoted .PA [B])

P[A1 ∩ · · · ∩ An ]= P[A1 ]×PA1 (A2 )×PA1 ∩A2 [A3 ]× · · · ×PA1 ∩···∩An−1 [An ],
.

which we could write

P[x1 , · · · , xn ] = P[x1 ] × P[x2 |x1 ] × P[x3 |x1 , x2 ] × · · · × P[xn |x1 , · · · , xn−1 ].


.

But this writing is not unique, because we could also write (for example)

P[x1 , · · · , xn ] = P[xn ] × P[xn−1 |xn ] × P[xn−2 |xn , xn−1 ] × · · · × P[x1 |xn , · · · , x2 ],


.

as .P[xn , xn−1 ] = P[xn ] × P[xn−1 |xn ] as well as .P[xn−1 ] × P[xn |xn−1 ].


7.2 Rung 1, Association (Seeing, “what if I see...”) 289

Fig. 7.7 Other examples of (e) (f )


directed causal graphs 1

1 2 3 4 3 4

The idea here is to write conditional probabilities involving only the variables
and their causal parents. For example, the graph (e) in Fig. 7.7 would correspond to

P[x1 , x2 , x3 , x4 ] = P[x1 ] × P[x2 |x1 ] × P[x3 |x1 , x2 ] × P[x4 |x1 , x2 , x3 ],


.

whereas graph (f) of Fig. 7.7 would be associated with the writing

P[x1 , x2 , x3 , x4 ] = P[x1 , x2 ] × P[x3 |x1 , x2 ] × P[x4 |x1 , x2 , x3 ],


.

which can also be written as

P[x1 , x2 , x3 , x4 ] = P[x1 ] × P[x2 ] × P[x3 |x1 , x2 ] × P[x4 |x1 , x2 , x3 ],


.

because .x1 and .x2 are assumed to be independent. It is not uncommon to add a
Markovian assumption, corresponding to the case where each variable is indepen-
dent of all its ancestors conditional on its parents. For example, on the graph (e) of
Fig. 7.7, the Markov hypothesis allows

.P[x3 |x1 , x2 ] = P[x3 |x2 ]

to be written. Also, the graph (e) of Fig. 7.7 would correspond to

P[x1 , x2 , x3 , x4 ] = P[x1 ] × P[x2 |x1 ] × P[x3 |x2 ] × P[x4 |x3 ].


.

whereas graph (f) of Fig. 7.7 would be associated with the writing

P[x1 , x2 , x3 , x4 ] = P[x1 ] × P[x2 ] × P[x3 |x1 , x2 ] × P[x4 |x3 ].


.

To go further in the examples, on the diagram (a) of Fig. 7.3

P[x1 , x2 , x3 ] = P[x2 |x1 ] · P[x3 |x1 ],


.

such that
P[x1 , x2 , x3 ]
.P[x2 , x3 |x1 ] = = P[x2 |x1 ] · P[x3 |x1 ],
P[x1 ]
290 7 Observations or Experiments: Data in Insurance

and therefore .x2 ⊥⊥ x3 conditionally to .x1 . In the diagram (b) of Fig. 7.3

P[x1 , x2 , x3 ] = P[x1 ]P[x2 |x1 ] · P[x3 |x2 ],


.

such that
P[x1 , x2 , x3 ] P[x1 ]P[x2 |x1 ]
P[x1 , x3 |x2 ] =
. = P[x3 |x2 ] = P[x1 |x2 ] · P[x3 |x2 ],
P[x2 ] P[x2 ]

and therefore .x1 ⊥⊥ x3 conditionally to .x2 .


In the diagram (c) of Fig. 7.3

P[x1 , x2 , x3 ] = P[x1 ] · P[x2 ] · P[x3 |x1 , x2 ],


.

such that

.P[x1 , x2 ] = P[x1 ] · P[x2 ],

in other words .x1 ⊥⊥ x2 , but

P[x1 , x2 , x3 ] P[x1 ] · P[x2 ] · P[x3 |x1 , x2 ]


P[x1 , x2 |x3 ] =
. = ,
P[x3 ] P[x3 ]

and therefore .x1 and .x2 are not independent, conditional on .x3 .
See Côté et al. (2023) for more details about causal graphs, conditional indepen-
dence and fairness, in the context of insurance.

7.3 Rung 2, Intervention (Doing, “what if I do...”)

In this section, we discuss “interventions,” corresponding to the idea of cutting


arrows in causal graphs, and forcing some values for some notes. This is denoted
.do(x1 = x) meaning that (1) all arrows arriving to node .x1 are removed, and (2) the

value of .x1 is forced to take value x.

7.3.1 The do() Operator and Computing Causal Effects

To formalize the concept of “intervention,” we simply note that .P[Y ∈ A|X = x]


describes how .Y ∈ A is likely to occur if X happened to be equal to x. Therefore,
it is an observational statement. We denote .P [Y ∈ A|do(X = x)] to describe how
.Y ∈ A is likely to occur if X is set to x (to avoid confusion, we use P and not

.P, in this introduction). Here, it is an intervention statement. Using causal graphs,


7.3 Rung 2, Intervention (Doing, “what if I do...”) 291

(a) (b) (c) (d)

Fig. 7.8 Illustration of the .do operator, with two forks, (a) and (d), z being a collider on (a) and
a confounder on (d), and two chains, (b) and (c). At the top, the causal graphs, and at the bottom,
the implied graphs when an intervention on x, corresponding to “.do(x)” is considered. (a) Fork z
collider (b) Chain (c) Chain (d) Fork z confounder

the intervention .do(X = x) means that all incoming edges to x are cut. Hence,
.P [Y ∈ A|do(X = x)] can be seen as a .Q[Y ∈ A|X = x], where the causal graph
has been manipulated. On the two graphs on the left-hand side of Fig. 7.8a and b,
.P [Y ∈ A|do(X = x)] = P[Y ∈ A|X = x]. On the two graphs on the right-hand

side of Fig. 7.8c and d,


E
P [Y ∈ A|do(X = x)] = Q[Y ∈ A|X = x] =
. Q[Y ∈ A, Z = z|X = x],
z

by the law of total probability. Using Bayes’ rule


E
P [Y ∈ A|do(X = x)] =
. Q[Y ∈ A|X = x, Z = z] · Q[Z = z],
z

and as .Q is the probability on the cut graph,


E
P [Y ∈ A|do(X = x)] =
. P[Y ∈ A|X = x, Z = z]·P[Z = z] /= P[Y ∈ A|X=x].
z

If .P [Y ∈ A|do(X = x)] /= P[Y ∈ A|X = x], it means that X and Y are


confounded. And now that we have a better understanding, as in most textbooks, let
us denote .P[Y ∈ A|do(X = x)] this expression (having in mind that it is not equal
to .P[Y ∈ A|X = x]). To get a better understanding of this idea of interventions via
this .do() operator, let us introduce “structural causal models.”

7.3.2 Structural Causal Models

For simplicity, we present here linear Gaussian structural causal models, associated
with an acyclic causal graph. The definition is very close to the “algorithmic”
292 7 Observations or Experiments: Data in Insurance

definition of the Markov property, well known in the context of homogeneous


Markov chains (see, for example, Rolski et al. 2009): if .(Xt ) is a Markov chain,
then

L
.(Xt |Xt−1 = xt−1 , Xt−2 = X2−2 , · · · , X1 = X1 , X0 = x0 ) = (Xt |Xt−1 = xt−1 ),

and equivalently, there are independent variables .(Ut ) and a measurable function h
such that .Xt = h(Xt−1 , Ut ). In the case of a causal graph, quite naturally, if C is a
cause, and E the effect, we should expect to have .E = h(C, U ) for some measurable
function h and some random noise U . This is the idea of structural models.
Definition 7.19 (Structural Causal Models (SCM) (Pearl 2009b)) In a simple
causal graph, with two nodes C (the cause) and E (the effect), the causal graph
is .C → E, and the mathematical interpretation can be summarized in two
assignments:
{
C = hc (UC )
.
E = he (C, UE ),

where .UC and .UE are two independent random variables, .UC ⊥⊥ UE .
More generally, a structural causal model is a triplet .(U , V , h), as in Pearl
(2010) or Halpern (2016). The variables in .U are called exogenous variables, in
other words, they are external to the model (we do not have to explain how they
are caused). The variables in .V are called endogenous. Each endogenous variable
is a descendant of at least one exogenous variable. Exogenous variables cannot
be descendants of any other variable and, in particular, cannot be descendants
of an endogenous variable. Also, they have no ancestors and are represented as
roots in causal graphs. Finally, if we know the value of each exogenous variable,
we can, using .h functions, determine with perfect certainty the value of each
endogenous variable. The causal graphs we have described consist of a set of n
nodes representing the variables in .U and .V , and a set of edges between the n
nodes representing the functions in .h. Observe that we consider acyclical graphs,
not only for a mathematical reason (to ensure that the model is solvable) but also
for interpretation: a cycle between x, y, and z would mean that x causes y, y causes
z, and z causes x. In a static setting (such as the one we consider here), that is not
possible.
In the causal diagram (a) in Fig. 7.9, we have two endogenous variables, x and
y, and two exogenous variables, .ux and .uy . The diagram (a) is a representation

Fig. 7.9 Causal diagram of a


structural causal model, with
an intervention on x, on the
right
7.3 Rung 2, Intervention (Doing, “what if I do...”) 293

(a) (b) (c) (d)


mediator variable confounding variable

Fig. 7.10 Two causal diagrams, .x → y, with a mediator m on the left-hand side ((a) and (b)) and
an intervention on x and a confounding factor w on the right-hand side ((c), and (d)) with and an
intervention on x. Variables u are exogeneous

of the real world, but we assume here that it is possible to make interventions,
and to change the value of x, assuming that all things remain equal. We use here
the notation .Y * to write the “potential” outcome if an intervention were to be
considered:

{ real world with


{ intervention (.do(x))
X = hx (Ux ) X=x
. .
Y = hy (X, Uy ) Yx* = hy (x, Uy )

In the causal diagram (a) in Fig. 7.10, we have three endogenous variables, x, y,
and a mediator m, and three exogenous variables, .ux , .uy and .um . The diagram (a) is
a representation of the real world, but as before, it is assumed here that it is possible
to make interventions on X.
In other words,

in the presence of a mediator (m)


real world with intervention (.do(x))
⎧ ⎧

⎨ X = hx (U x ) ⎪
⎨X = x
. M = hm (X, Um ) . Mx = hm (x, Um )

⎩ ⎪
⎩ *
Y = hy (X, M, Uy ) Yx = hy (x, Mx , Uy ).

In the causal diagram (b) of Fig. 7.10, we have three endogenous variables, x, y,
and a confounding factor w, and three exogenous variables, .ux , .uy and .uw . Diagram
(c) is a representation of the real world, but as before, it is assumed here that it is
possible to make interventions on X. In other words,

in the presence of a confusion factor (w)


real world with intervention (.do(x))
⎧ ⎧

⎨ X = hx (W, U x ) ⎪
⎨X = x
. W = hw (Uw ) . W = hw (Uw )

⎩ ⎪
⎩ *
Y = hy (X, W, Uy ) Yx = hy (x, W, Uy ).
294 7 Observations or Experiments: Data in Insurance

{
mediator : P[Yx* = 1] = P[Y = 1|do(X = x)] = P[Y = 1|X = x]
.
confusion : P[Yx* = 1] = P[Y = 1|do(X = x)] /= P[Y = 1|X = x].

In fact, in the presence of a confounding factor, .P[Yx* = 1], which corresponds to


.P[Y = 1|do(X = x)], should be written

E
. P[Y = 1|W = w, X = x] · P[W = w] = E(P[Y = 1|W, X = x]).
w

For example, we can suppose that .P[Y = 1|W = w, X = x] is obtained using a


logistic model: if .μx (w) = P[Y = 1|W = w, X = x]

exp[β -x x + β
-0 + β -w w]
-
μx (w) =
. ,
- - -w w]
1 + exp[β0 + βx x + β

the average causal effect, .ACE = E(μ1 (W ) − μ0 (W )) will be estimated by

1E
n
A-
.CE = μ1 (wi ) − -
(- μ0 (wi )).
n
i=1

As explained in Pearl (1998), the structural equation .y = hy (x, uy ) represents


a causal mechanism that specifies the value taken by the variable y in response
to each pair of values taken by the variable x and the (exogenous) factor .uy .
.y = fy (x, uy ) is computed under the formal intervention “X is set to x,” which

Pearl (1998) notes “.do(X = x)” (or simply .do(x)), then assigned to y (historically,
from Wright (1921) to Holland (1986), passing by Neyman et al. (1923) or Rubin
(1974), various notations have been proposed). To use the notation and interpretation
of Pearl (2010), “Y would be .yx* , had X been x in situation .uy ” will be written
.Yx (uy ) = y, since the structural error .uy has not being impacted by an intervention
*

on x.
In probabilistic terminology, .P[Y = y|X = x] denotes the population
distribution of Y among individuals whose X value is x. Here, .P[Y = y|do(X = x)]
represents the population distribution of Y if all individuals in the population had
their X value set to x. And more generally, .P(Y = y|do(X = x), Z = z)
denotes the conditional probability that .Y = y, given .Z = z, in the distribution
created by the intervention .do(X = x). Also, in the literature, the average causal
effect (ACE) corresponds to .E[Y |do(X = 1)] − E[Y |do(X = 0)], or .Y -0 if
-1 − Y
-
.Yx = E[Y |do(X = x)] (which we also note hereafter .Y
X←1 − YX←0 , as in Russell
* *

et al. (2017)). To calculate this quantity, given a causal graph,


E
P[Y = y|do(X = x)] =
. P[Y = y|X = x, P A = z] · P[P A = z],
z
7.3 Rung 2, Intervention (Doing, “what if I do...”) 295

where P A denotes the parents of x, and z covers all combinations of values that
the variables in P A can take. A sufficient condition for identifying the causal effect
.P(y|do(x)) is that each path between X and one of its children traces at least one

arrow emanating from the measured variable, as in Tian and Pearl (2002).
To illustrate this difference between the intervention (via the do operator) and
conditioning, consider the causal graph (c) of Fig. 7.10, discussed in de Lara (2023),
based on the following structural causal model, with additive functions,


⎨W = hw (uw ) = uw

. X = hx (w, ux ) = w + ux


⎩Y = h (x, w, u ) = x + w + u
y y y

A solution .(X, W, Y ) is such that


⎧ ⎧
⎪ ⎪
⎨X = W + ux
⎪ ⎨X = uw + ux

. W = uw or W = uw

⎪ ⎪

⎩Y = X + W + u ⎩Y = 2u + u + u
y w x y

As mentioned in Bongers et al. (2021), structural causal models are not always
solvable, this is why the “acyclicity” assumption is important, because it ensures
unique solvability. If we consider now a “do intervention,” with .do(x = 0), we have
⎧ ⎧
⎪ ⎪
⎨X = 0
⎪ ⎨X = 0

. *
Wx←0 = uw *
, or Wx←0 = uw

⎪ ⎪

⎩Y * = x + w + u ⎩Y * = u + u
x←0 y x←0 w y

Thus, on the one hand, observe that .W |X = 0 as the same distribution as .Uw
conditional on .Ux + Uw = 0, i.e., .W |X = 0 as the same distribution as .−Ux . On
L
the other hand, the distribution of .W * is .Uw . Thus, generally, .(W |X = 0) /= Wx←0
*

which corresponds to .(W |do(x = 0)).


Before concluding, and reaching rung 3 in the “ladder of causation,” observe that
it is possible to use causal graphs to defined more precisely “bias” and “disparity”
(or “discrimination”).
Definition 7.20 (Bias (Traag and Waltman 2022)) A bias is a direct causal effect
of s on y that is considered unjustified (in a moral or ethical sense).
As stressed in Traag and Waltman (2022), the fact that a direct causal effect is
justified, or not is “an ethical question that cannot be determined empirically using
data.”
Definition 7.21 (Disparity (Traag and Waltman 2022)) A disparity as a causal
effect of s on y that includes a bias.
296 7 Observations or Experiments: Data in Insurance

Therefore, there is a disparity of s on y if at least one causal effect from s on y


represents a bias. Each bias is a disparity, but a disparity does not need to be a bias.
And a disparity is seen as unfair. It is possible to discuss redlining, mentioned in
the early parts of Chap. 1, and in Sect. 6.1.2 with respect to those definitions. In its
strict (historical) definition, the practice of redlining in the USA, corresponds to the
case where organizations, such as financial institutions and insurers, deny people
services based on the area in which they live (no loans, or no insurance). And as we
have seen, owing to geographic racial segregation, intentionally or not, this yields to
not serving people of certain races. Here, there could be multiple “biases.” The use
of ZIP codes, to determine whom to insure or not, may be considered “unjustified,”
in which case there is a location bias, and a racial disparity, when insuring. But
it is also possible that there is a geographic racial bias, where people of a certain
race could be denied access to certain neighborhoods. This geographic racial bias
consequently produces a racial disparity in insuring, even if using ZIP codes was
no longer “unjustified.” If insurers use race to determine whom to insure, there is a
racial bias in insuring, and not only a racial disparity. Therefore, even if there is no
racial bias in insuring, it would not imply that there is no problem. A racial disparity
in insurance indicates that the outcome is unfair with respect to race, as discussed
in Traag and Waltman (2022).

7.4 Rung 3, Counterfactuals (Imagining, “what if I had


done...”)

In the previous section, we considered Pearl’s causal modeling approach, based


on Pearl (2009a). According to Holland (1986), there is “no causation without
manipulation,” and therefore, there should be only two rungs on the “ladder
of causation” (and gender or race should not be understood to have a causal
effect, as discussed in Holland (2003)). Nevertheless, it is possible to go one step
further, with the “potential-outcome framework,” also known as Neyman–Rubin
causal modeling, from Rubin (1974), based on the concept of “counterfactuals.”
Rather than considering whether some variables (such as gender or race) are
manipulable, we consider here hypothetical possibilities (see Kohler-Hausmann
(2018) for further discussions on counterfactual causal thinking in the context of
racial discrimination).

7.4.1 Counterfactuals

Pearl and Mackenzie (2018) noted that causal inference was intended to answer the
question “what would have happened if...?” This question is central in epidemiology
(what would have happened if this person had received the treatment?) or as soon as
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 297

we try to evaluate the impact of a public policy (e.g., what would have happened
if we had not removed this tax?). But we note that this is the question we ask
as soon as we talk about discrimination (e.g., what would have happened if this
woman had been a man?). In causal inference, in order to quantify the effect of
a drug or a public policy measure, two groups are constituted, one that receives
the treatment and another that does not, and therefore serves as a counterfactual, in
order to answer the question “what would have happened if the same person had
had access to the treatment?” When analyzing discrimination, similar questions are
asked, for example, “would the price of risk be different if the same person had been
a man and not a woman?,” except that here gender is not a matter of choice, of an
arbitrary assignment to a treatment (random in so-called randomized experiments).
In fact, this parallel between discrimination analysis and causal inference was
initially criticized: changing treatment is possible, whereas sex change is a product
of the imagination. One can also think of the questions regarding the links between
smoking and certain cancers: seeing smoking as a “treatment” may make sense
mathematically, but ethically, one could not force someone to smoke just to quantify
the probability of getting cancer a few years later1 (whereas in a clinical experiment,
one could imagine that a patient is given the blue pills, instead of the red pills). We
enter here into the category of the so-called “quasi-experimental” approaches, in the
sense of Cook et al. (2002) and DiNardo (2016).
In the data, y (often called “outcome”) is the variable that we seek to model
and predict, and which will serve as a measure of treatment effectiveness. The
potential outcomes are the outcomes that would be observed under each possible
treatment, and we note yt* the outcome that would be observed if the treatment T
had taken the value t. And the counterfactual outcomes are what would have been
observed if the treatment had been different, in other words, for a person of type t,
the counterfactual outcome is y1−t * (because t takes the values {0, 1}). The typical
example is that of a person who received a vaccine (t = 1), who did not get sick
(y = 0), whose counterfactual outcome would be y0* , sometimes noted yt←0 * . Before

launching the vaccine efficacy study, the two outcomes are potential, y0* and y1* .
Once the study is launched, the observed outcome will be y and the counterfactual
outcome will be y1−t * . Note that different notations are used in the literature, y(1)

and y(0) in Imbens and Rubin (2015), y 1 and y 0 in Cunningham (2021), or yt=1 and
yt=0 in Pearl and Mackenzie (2018). Here, we use yt←1 * * , the star being a
and yt←0
reminder that those quantities are potential outcomes, as shown in Table 7.2.
The treatment formally corresponds, in our vaccine example, to an intervention,
which is formally a shot given to a person, or a pill that the person must swallow.
In this section, it is not possible to manipulate the variable whose causal effect we
want to measure. In the Introduction, we mentioned the idea that body mass index
(BMI) could have an impact on health status, but BMI is not a pill, it is an observed
quantity. It could be possible to manipulate variables that will have an impact on the

1 Ina humorous article, Smith and Pell (2003) asked the question of setting up randomized
experiments to prove the causal link between having a parachute and surviving a plane crash.
298 7 Observations or Experiments: Data in Insurance

Table 7.2 Excerpt of a standard table, with observed data ti , xi , yi , and potential outcomes
* *
yi,T ←0 and yi,T ←1 respectively when treatment (t) is either 0 or 1. The question mark ?
corresponds to the unobserved outcome, and will correspond to the counterfactual value of the
observed outcome
Treatment Outcome Features ···
ti yi *
yi,T ←0
*
yi,T ←1 xi ···
1 0 75 75 ? 172 ···
2 1 52 ? 52 161 ···
3 1 57 ? 57 163 ···
4 0 78 78 ? 183 ···

index (by forcing a person to practice sports regularly, change their eating habits,
etc.), so that one is not measuring strictly speaking the causal effect of the BMI,
but rather that of the interventions that influence the index. In the same way, it is
impossible to intervene on certain variables, said to be immutable, such as gender
or racial origin. The counterfactual is then purely hypothetical. Dawid (2000) was
very critical of the idea that we can create (or observe) a counterfactual, because “by
definition, we can never observe such [counterfactual] quantities, nor can we assess
empirically the validity of any modelling assumption we may make about them, even
though our conclusions may be sensitive to these assumptions.”
We will say that there is a causal effect (or “identified causal effect”) of a (binary)
treatment t on an outcome y if y0* and y1* are significantly different. And as we
cannot observe these variables at the individual level, we compare the effect on sub-
populations, as shown by Rubin (1974), Hernán and Robins (2010), or Imai (2018).
Quite naturally, one might want to measure the causal effect as the difference in y
between the two groups, the treated (t = 1) and the untreated (t = 0), but unless
additional assumptions are made, this difference does not correspond to the average
causal effect (ATE, “average treatment effect”). But let us formalize a little bit more
the different concepts used here.
Definition 7.22 (Average Treatment Effect (Holland 1986)) Given a treatment
T , the average treatment effect on outcome Y is
[ * ]
τ = ATE = E Yt←1
. − Yt←0
*
.

Definition 7.23 (Conditional Average Treatment Effect (Wager and Athey


2018)) Given a treatment T , the conditional average treatment effect on outcome
Y , given some covariates X, is
[ * * |
| ]
τ (x) = CATE(x) = E Yt←1
. − Yt←0 X=x .
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 299

Definition 7.24 (Individual Average Treatment Effect) Given a treatment T , the


conditional average treatment effect on outcome Y , for individual i, given covariates
Xi , is
[ * ]
IATE(i) = E Yi,t←(1−t
.
i)
− Yi,t←t
*
i
.

A naïve attempt to estimate the average treatment effect is to consider


En En
yi 1(ti = 1) yi 1(ti = 0)
τnaive = Ei=1
-
. n − Ei=1
n ,
i=1 1(t i = 1) i=1 1(ti = 0)
' '' ' ' '' '
y1 y0

where y 1 is the average outcome of treated observations (ti = 1), and y 0 is the
average outcome of individuals in the control group (ti = 0). Observe that y 1 and
y 0 are unbiased estimates of E[Y |T = 1] and E[Y |T = 0] respectively. Therefore,
-
τnaive is an unbiased estimate of

E[Y |T = 1] − E[Y |T = 0].


. (7.1)

Equation (7.1) is called “association effect” of T on Y (first level on the ladder of


causality). This is not yet the average treatment effect, as E[YT* ←t ] and E[Y |T =
t] are different quantities (unless data were obtained in a randomized experiment,
without any selection bias). It is possible to “identify” causal effect, when adding
some properties.
Definition 7.25 (Unconfoundedness/Ignorability) The treatment indicator T is
said to be unconfounded, or ignorable, if
( * )
. YT ←1 , YT* ←0 ⊥⊥ T .

This property is a classical consequence of randomization, but not only that.


It implies that knowing T gives no information for predicting (YT* ←1 , YT* ←0 ) (the
distribution of potential outcomes in the different treatment groups are the same).
Of course, it is necessary to take into account heterogeneity, through covariates x.
A stronger condition is then needed.
Definition 7.26 (Conditional Unconfoundedness/Strong Ignorability) The
treatment indicator T is said to be conditionally unconfounded, if
( * ) |
. YT ←1 , YT* ←0 ⊥⊥ T | X.

For simplicity, assume that all confounders have been identified, and are
categorical variables. Within a class (or stratum) x, there is no confounding effect,
and therefore, the causal effect can be identified by naive estimation, and the overall
average treatment effect is identified by aggregation: by the law of total probability,
300 7 Observations or Experiments: Data in Insurance

P[YT* ←1 = y] is equal to
E [ ] E [ ]
. P YT* ←1 = y|X = x P[X = x] = P YT* ←1 = y|X = x, T = 1 P[X = x],
x x

that we can write


[ ] E
P YT* ←1 = y =
. P[Y = y|X = x, T = 1]P[X = x],
x

and similarly for P[YT* ←0 = y], so that


E( )
E[YT* ←1 − YT* ←0 ] =
. E[Y |X = x, T = 1] − E[Y |X = x, T = 0] P[X = x],
x

and the estimate will be


E
-
τstrata (x) =
. - pn (X = x),
τnaive (x)-
x

also called “exact matching estimate.” Here, p-n (X = x) is the proportion of strata
x in the training dataset (or Pn (X = x) with notations of Part I).
Unfortunately, strata might be sparse, and that will generate a lot of variability.
So, instead of considering all possible strata, it is possible to create a score, such
that conditional on the score, we would have independence. This is the idea of a
“balancing score.”
Definition 7.27 (Balancing Score (Rosenbaum and Rubin 1983)) A balancing
score is a function b(X) such that
|
. X ⊥⊥ T | b(X).

An obvious scoring function is b(X) = X, but the idea here is to have b :


X → Rd , where d is small (ideally d = 1), at least smaller than the dimension
of X. It could be seen as the equivalent of the concept of “sufficient statistic” in
parametric inference, in the sense that all the information contained in a sample X
can be summarized into that statistic.
Proposition 7.11 If b(X) is a balancing score, and if conditional unconfounded-
ness (as in Definition 7.26) is satisfied, then
( * ) |
. YT ←1 , YT* ←0 ⊥⊥ T | b(X).

Proof The complete proof can be found in Rosenbaum and Rubin (1983) and
Borgelt et al. (2009). It is a direct consequence of the so called “contraction”
property, in the sense that Y ⊥⊥ T | B and Y ⊥⊥ B imply Y ⊥⊥ (T , B). See Zenere
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 301

et al. (2022) for more details about balancing scores and conditional independence
properties. u
n
A popular balancing score is the “propensity score.” With our previous notations,
if n(x) is the number of observations (out of n) within stratum x, and p -n (x) =
n(x)/n. The propensity score will be - en (x) = n1 (x)/n(x), where n1 (x) is the
number of treated individuals in stratum x. And one can write

E n ( )
n(x) 1 E yi 1(ti = 1) yi 1(ti = 0)
-
τstrata (x) =
. -
τnaive (x) = − .
x
n n e(x i ) 1 − e(x i )
i=1

With a few calculations (see in the next section for the probabilistic version), we can
actually write the latter as
( ) E
1E E yi 1(ti = 1) yi 1(ti = 0) n(e)
.-
τstrata (x) = − = -
τnaive (e) ,
n e e 1−e e
n
e(x i )=e

where we recognize a matching estimator, not on the strata x but on the score.
The interpretation is that, conditional on the score, we can pretend that data were
collected through some randomization process.
Definition 7.28 (Propensity Score (Rosenbaum and Rubin 1983)) The propen-
sity score e(x) is the probability of being assigned to a particular treatment given a
set of observed covariates. For a binary treatment (t ∈ {0, 1}),

e(x) = P[T = 1|X = x].


.

As mentioned earlier, and as proved in Rosenbaum and Rubin (1983), the


propensity score is a balancing score. Even more, a score b(x) is balanced if and
only if it is a function of the propensity score e(x). As T is binary, this comes from
the fact that P[T = 1|X, e(X)] = P[T = 1|e(X)].

7.4.2 Weights and Importance Sampling

A classical technique in surveys to correct biases is to use properly chosen


weights, as explained in Pfeffermann (1993) and Biemer and Christ (2012). Inverse
probability weighting leverages the propensity score .e(x) to balance covariates
between groups. Intuitively, if we divide the weight of each unit by the probability
that it will be treated, each unit has an equal probability of being processed.
Mathematically, this corresponds to a change in probability, from .P to .Q, where
( )
dQ 1 T 1−T
. = + ,
dP 2 e(X) 1 − e(X)
302 7 Observations or Experiments: Data in Insurance

so that .Q is such that .Q(T = 0) = Q(T = 1), and under .Q, .T ⊥⊥ X. This means
that the pseudo population (obtained by re-weighting) looks as if the treatment was
randomly allocated by tossing an unbiased coin, and
[ ( )]
EQ [T Y ] 1 T 1−T
EQ [Y | T = 1] = = 2 · · EP Y T +
Q(T = 1) 2 e(X) 1 − e(X)
[ ] [ [ | ]] [ ]
T T | EP [T Y | X]
= EP · Y = EP EP · Y |X = EP
e(X) e(X) e(X)
.
[ [ ] ] [ [ ]]
EP T YT* ←1 | X e(X) · EP YT* ←1 | X
= EP = EP
e(X) e(X)
[ [ ]] [ ]
= EP EP YT* ←1 | X = EP YT* ←1 ,

and
[ ]
1−T [ ]
EQ [Y | T = 0] = EP
. · Y = EP YT* ←0 .
1 − e(X)
Thus, if we combine,
[ ] [ ]
[ * ] T 1−T
.EP YT ←1 − YT ←0 = EP · Y − EP ·Y .
*
e(X) 1 − e(X)
The price to pay to be able to identify the average treatment effect under .P is that
we need to estimate the propensity score e (see Kang and Schafer (2007) or Imai
and Ratkovic (2014)). We return to the Radon–Nikodym derivative and weights in
Sect. 12.2.
Importance sampling is a classical technique, popular when considering Monte
Carlo simulations to compute some quantities efficiently. Recall that Monte Carlo is
simply based on the law of large numbers: if we can draw i.i.d. copies of a random
variable .Xi ’s, under probability .P, then

1E
n
. h(xi ) → EP [h(X)], as n → ∞.
n
i=1

And much more can be obtained, as the empirical distribution .Pn (associated with
sample .{x1 , · · · , xn }) converges to .P as .n → ∞ (see, for example, Van der Vaart
2000).
Now, assume that we have an algorithm to draw efficiently i.i.d. copies of a
random variable .Xi , under probability .Q, and we still want to compute .EP [h(X)].
The idea of importance sampling is to use some weights,

1 E dP(xi )
n
. h(xi ) → EP [h(X)], as n → ∞,
n dQ(xi )
i=1 ' '' '
ωi
7.4 Rung 3, Counterfactuals (Imagining, “what if I had done...”) 303

where weights are simply based on the likelihood ratio of .P over .Q. To introduce
notations that we use afterwards, define

1 E dP(xi )
n
μis =
-
. h(xi )
n dQ(xi )
i=1

and if the likelihood ratio is known only up to a multiplicative constant, define a


“self-normalized importance sampling” estimate, as coined in Neddermeyer (2009)
and Owen (2013),
En
i=1 ωi h(xi ) dP(xi )
μis' =
-
. En with ωi ∝ .
ω
i=1 i dQ(xi)

At the top of Fig. 7.11, we supposed that we had a nice code to generate a Poisson
distribution .P(8); unfortunately, we want to generate some Poisson .P(5). At the
bottom, we consider the opposite: we can generate some .P(5) variable, but we want
a .P(8). On the left, the values of the weights, .dP(x)/dQ(x), with .x ∈ N. In the
center, the histogram of .n = 500 observations from the algorithm we have (.P(8) at
the top, .P(5) at the bottom), and on the right, a weighted histogram for observations
that we wish we had, mixing the first sample and appropriate weights (.P(5) at the
top, .P(8) below). Below, we generate data from .P(5), and the largest observation
was here 13 (before, all values from 0 to 11 were obtained). As we can see on the
right, it is not possible to get data outside the range of data initially obtained. Clearly,
this approach works well only when the supports are close. The weighted histogram
was obtained using wtd.hist, in package weights.
In our context, one can define the importance sampling estimator of .E[YT* ←1 ], as

( ) 1 E yi nT 1 E yi
μis YT* ←1 =
-
. = ,
n1 e(x i ) n n e(x i )
ti =1 ti =1

and a “self-normalized importance sampling” estimate for .E[YT* ←1 ] is


E
( ) t =1 ωi yi 1
-
μ
.
is'
YT* ←1 = Ei , where ωi = .
ti =1 ωi e(x i )

The “self-normalized importance sampling” estimate for the conditional average


treatment effect is
E E '
ti =1 ωi yi t =0 ωi yi 1 1
τ = E
.-
is'
− Ei ' −, where ωi = and ωi' = .
ti =1 ωi ti =0 ωi e(x i ) 1 − e(x i )
304 7 Observations or Experiments: Data in Insurance

Fig. 7.11 Illustration of the importance sampling procedure, and the use of weights
(.dP(x)/dQ(x)). At the top, we have an algorithm to generate a Poisson distribution .P(8) (in
the middle), where thin lines represent the theoretical histogram, the plain boxes the empirical
histogram). We distort that sample to generate some Poisson .P(5), and we have the histogram
on the right-hand side (using function wtd.hist from package weights), with the empirical
histogram in plain boxes. Thin lines represent the theoretical histogram associated with the .P(5)
distribution. Below, it is the opposite, we can generate .P(5) samples, and distort them to get a .P(8)
sample

7.5 Causal Techniques in Insurance

We will use these ideas of counterfactuals in Sect. 9.3, to quantify “individual


fairness,” but observe that causal inference is important in insurance in at least two
applications: prevention and uplift modeling.
Heuristically, “prevention” means that measures are taken to prevent risk, such
as diseases or injuries, rather than curing them or treating their symptoms, in
health insurance. If there is a causal relationship, an intervention could actually be
effective.
Farbmacher et al. (2022) investigate the causal effect of health insurance
coverage (T ) on general health (Y ) and decompose it into an indirect pathway via
7.5 Causal Techniques in Insurance 305

the incidence of a regular medical checkup (X) and a direct effect entailing any
other causal mechanisms. Whether or not an individual undergoes routine checkups
appears to be an interesting mediator, as it is likely to be affected by health insurance
coverage and may itself have an impact on the individual’s health (simply because
checkups can help to identify medical conditions before they become serious).
Another classical application of causal inference and predictive modeling could
be “uplift modeling.” The idea is to model the impact of a treatment (that would be as
a direct marketing action) on an individual’s behavior. Those ideas were formalized
more than 20 years ago, in Hansotia and Rukstales (2002) or Hanssens et al.
(2003). In Radcliffe and Surry (1999), the term “true response modelling” was used,
Lo (2002) used “ true lift,” and finally (Radcliffe 2007) suggested techniques for
“building and assessing uplift models.” More specifically, Hansotia and Rukstales
(2002) used two models, estimated separately (namely two logistic regressions), one
for the treated individuals, and one for the nontreated ones and Lo (2002) suggested
an interaction model, where interaction terms between predictive variables .x and
the treatment t are added. Over the past 20 years, several papers appeared to apply
those techniques, in personalized medicine, such as Nassif et al. (2013), but also in
insurance, with Guelman et al. (2012, 2014) and Guelman and Guillén (2014).
Part III
Fairness

For leaders today—both in business and regulation—the dominant theme of 21st century
financial services is fast turning out to be a complicated question of fairness, Wheatley
(2013), Chief Executive of the FCA, at Mansion House, London
When you can measure what you are speaking about, and express it in numbers, you
know something about it; but when you cannot measure it, when you cannot express
it in numbers, your knowledge is of a meagre and unsatisfactory kind1 : it may be the
beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of
science, whatever the matter may be, Lord Kevin, Thomson (1883)
Voulez-vous croire au réel, mesurez-le,
Do you want to believe in reality? Measure it, Bachelard (1927)

1 Sociologist William Ogburn, a onetime head of the Social Sciences Division at the University of

Chicago, was responsible for perhaps the most contentious carving on campus. Curving around an
oriel window facing 59th Street is the quote from Lord Kelvin: “when you cannot measure, your
knowledge is meager and unsatisfactory”. Leu (2015) mentioned that in a 1939 symposium,
economist Frank Knight Jr. snarkily suggested that the quote should be changed to “if you cannot
measure, measure anyhow”.
Chapter 8
Group Fairness

Abstract Assessing whether a model is discriminatory, or not, is a complex


problem. As in Chap. 3, where we discussed global and local interpretability of
predictive models, we start with some global approaches (the local ones will be
discussed in Chap. 9), also called “group fairness,” comparing quantities between
groups, usually identified by sensitive attributes (e.g., gender, ethnicity, age, etc.).
Using the formalism introduced in the previous chapters, y denotes the variable of
interest, .
y or .m(x) denotes the prediction given by the model, and s the sensitive
attribute. Most concepts are derived from three main principles: independence
(.
y ⊥⊥ s), separation (.
y ⊥⊥ s conditional on y), and sufficiency ((.y ⊥⊥ s) conditional
on .
y ). We review these approaches here, linking them while opposing them, and we
implement metrics related to those notions on various datasets.

Before starting formally to define fairness principles, remember that we have a


simple dataset, toydata2 dataset, with only three admissible features, and several
models were considered in Chap. 3. In Fig. 8.1, we can visualize those models on
a subset with .n = 40 individuals. As in Kearns and Roth (2019), the shape of
the points reflects the value of the sensitive attribute s, with two types of individuals
here, circles and squares (as in Ruillier (2004)), corresponding to group A and group
B respectively. In this dataset y is binary, taking values in .{0, 1}, corresponding
to a good or a bad outcome. The probability of a bad outcome is a quantity of
interest as the premium is proportional to that quantity. If y is the numerical version
of the categorical output (here the indicator .1(y = 1)) the probability .P[Y = 1]
corresponds to .E[Y ]. The sensitive attribute is also binary, taking values in .{A, B},
that will correspond to a circle .• or a square .█. There are also three legitimate
continuous covariates, .x1 , .x2 and .x3 . Those variables .x were used to build a score
.m(x) ∈ [0, 1], that will be interpreted as an estimation of both .P[Y = 1|X = x] or

.E[Y |X = x]. This predictor is said to be “fair through unawareness” (discussed in

Sect. 8.1) when s was not used (also denoted “without sensitive” in the applications).
But it is also possible to include s, and the score is .m(x, s) ∈ [0, 1]. Four different
models are considered here (and trained on the complete toydata2 dataset): a
plain logistic regression (GLM, fitted with glm), an additive logistic regression

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 309
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_8
310 8 Group Fairness

Fig. 8.1 Scores .m(x i , si ), on top (“with sensitive” attribute) or .m(x i ) ∈ [0, 1] at the bottom
(“without sensitive” attribute), for the different models, fitted on the toydata2 dataset. Colors
correspond to the value of y, with .{0, 1} and the shape corresponds to the value of s (.f illed square
and .f illed circle correspond to A and B respectively)

(GAM, with splines, for the three continuous variables, using gam from the mgcv
package), a classification tree (CART, fitted using rpart), and a random forest
(fitted using randomForest). Details on those .n = 40 individuals are given in
Table 8.1. In Fig. 8.1, at the top, the x-values (from the left to the right) of the .n = 40
points correspond to values of .m(x i )’s in .[0, 1]. Colors correspond to the value of
y, with .{0, 1} and the shape corresponds to the value of s (.█ and .• correspond to A
and B respectively. As expected, individuals associated with .yi = 0 are more likely
to be on the left-hand side (small .m(x i )’s).
In Fig. 8.2, instead of visualizing .n = 40 individuals on a scatter-plot (as
in Fig. 8.1), we can visualize the distribution of scores, that is, the distribution
of .m(x i , si ) (“with sensitive” attribute) or .m(x i ) ∈ [0, 1] (“without sensitive”
attribute) using box plots respectively when .yi = 0 and .yi = 1 at the top, and
respectively when .si = A and .si = B at the bottom. For example at the bottom of the
two graphs, the two boxes correspond to prediction .m(x i ) when a logistic regression
is considered. At the top, we have the distinction .yi = 0 and .yi = 1: for individuals
.yi = 0, the median value for .m(x i ) is 23% with the logistic regression, whereas it is

60% for individuals .yi = 1. All models here were able to “discriminate” according
to the risk of individuals. Below, we have the distinction .si = A and .si = B: for
individuals .si = A, the median value for .m(x i ) is 30%, whereas it is 47% for
individuals .si = B. All models here were able to “discriminate” according to the risk
of individuals. Based on the discussion we had in the introduction, in Sect. 1.1.6 (and
Fig. 1.2 on the compas dataset respectively with Dieterich et al. (2016) and Feller
et al. (2016) interpretations), if the premium is proportional to .m(x i ), individuals in
group .B would be asked, on average, a higher premium than individuals in group
.A, and that difference could be perceived as discriminatory. When comparing box

plots at the top and at the bottom, at least, observe that all models “discriminate”
more, between groups, based on the true risk y than based on the sensitive attribute
s. In Fig. 8.3 we can visualize survival functions (on the small dataset, with .n = 40
8 Group Fairness 311

Table 8.1 The small version of toydata2, with .n = 40 observations. p corresponds to the
“true probability” (used to generate y), and .m(x) is the predicted probability, from a plain logistic
regression
.x1 -1.220 -1.280 -0.930 -0.330 0.050 1.000 -0.070 -1.340 -1.800 -1.890
.x2 3.700 9.600 4.500 5.800 2.800 1.200 1.400 3.800 5.900 3.200
.x3 -1.190 -0.410 -1.050 -0.660 -0.890 0.250 -1.530 -0.300 -1.780 -0.570
s A A A A A A A A A A
y 0 0 0 0 0 0 0 0 0 0
p 0.090 0.500 0.130 0.270 0.14 0.220 0.070 0.090 0.160 0.060
.m(x) 0.099 0.506 0.167 0.379 0.22 0.312 0.121 0.102 0.116 0.049
.x1 -0.330 0.800 -1.040 -0.160 -0.990 -0.440 -0.220 -3.200 -1.720 -1.090
.x2 7.900 9.200 9.400 9.800 6.700 7.900 9.700 3.700 2.700 9.600
.x3 -0.230 -0.650 -1.530 -1.650 -1.150 -0.030 -0.040 -3.440 -0.700 -1.620
s A A A A A A A A A A
y 1 1 1 1 1 1 1 1 1 1
p 0.460 0.840 0.500 0.690 0.260 0.440 0.660 0.100 0.050 0.510
.m(x) 0.585 0.867 0.511 0.739 0.297 0.565 0.759 0.012 0.047 0.515
.x1 1.480 0.720 0.400 -0.470 -0.230 -0.820 0.740 1.440 0.200 -0.610
.x1 3.000 0.000 4.800 5.300 6.500 0.500 0.700 1.200 3.000 3.900
.x1 3.520 1.690 0.080 0.910 0.220 1.220 0.560 1.620 0.190 -0.010
s B B B B B B B B B B
y 0 0 0 0 0 0 0 0 0 0
p 0.670 0.160 0.460 0.310 0.470 0.050 0.210 0.480 0.260 0.190
.m(x) 0.681 0.209 0.485 0.350 0.494 0.063 0.233 0.453 0.287 0.199
.x1 0.820 2.040 1.740 1.580 1.550 1.030 0.440 3.090 2.180 2.300
.x1 4.200 2.800 0.400 2.300 9.600 8.500 4.200 5.900 7.100 4.900
.x3 2.920 2.480 2.010 1.570 2.830 1.240 0.570 3.510 1.260 1.200
s B B B B B B B B B B
y 1 1 1 1 1 1 1 1 1 1
p 0.540 0.840 0.540 0.650 0.970 0.900 0.420 1.000 0.980 0.960
.m(x) 0.619 0.75 0.463 0.587 0.961 0.889 0.455 0.968 0.936 0.878

individuals) of those scores, because, as discussed in Chap. 3, classical quantities


used on classifiers (true positive rates, etc.) can be visualized on such a graph. Even
if each curve is obtained on 20 individuals (as shown in Table 8.1, the dataset is well
balanced, with 10 individuals in each group .(y, s)). In Table 8.2, the Kolmogorov–
Smirnov test is performed, to assess whether the difference between the survival
functions is significant. Here .H0 would be “there is no discrimination with respect
to s” (for the lower part of the Table, it would be y at the top), whereas .H1 would
be “there is no discrimination with respect to s.” Here, we do not consider some
favored–disfavored distinction between the two groups; discrimination could be in
both directions.
312 8 Group Fairness

Fig. 8.2 Box plot of the scores .m(x i , si ) (with sensitive attribute) or .m(x i ) ∈ [0, 1] (without
sensitive attribute), for the different models, conditional on .y ∈ {0, 1} at the top and conditional on
.s ∈ {A, B} at the bottom

Table 8.2 Kolmogorov–Smirnov test, to compare the conditional distribution of .m(x i , si ) (“with
sensitive” attribute) or .m(x i ) ∈ [0, 1] (“without sensitive” attribute), conditional on the value
of y at the top, and s at the bottom. The distance is the maximum distance between the two
survival functions, and the p-value is the one obtained with a two-sided alternative hypothesis.
.H0 corresponds to the hypothesis that both distributions are identical (“no discrimination”)

Unaware (without s) Aware (with s)


GLM GAM CART RF GLM GAM CART RF
Difference between score distributions, .y ∈ {0, 1}
Distance 0.700 0.650 0.500 0.700 0.650 0.600 0.500 0.650
p-value 0.01% 0.03% 0.29% 0.01% 0.03% 0.11% 0.29% 0.03%
Difference between score distributions, .s ∈ {A, B}
Distance 0.350 0.350 0.450 0.400 0.400 0.400 0.450 0.400
p-value 17.45% 17.13% 1.66% 7.87% 8.11% 8.11% 1.66% 8.11%

8.1 Fairness Through Unawareness

The very first concept we discuss is based on “blindness,” also coined “fairness
through unawareness” in Dwork et al. (2012), and is based on the (naïve) idea that
8.1 Fairness Through Unawareness 313

Fig. 8.3 Survival distribution of the scores .m(x i ) ∈ [0, 1] (“without sensitive” attribute), for the
different models (plain logistic regression on the left-hand side, and random forest on the right-
hand side), conditional on .y ∈ {0, 1} at the top and conditional on .s ∈ {A, B} at the bottom

a model fitted on the subpart of the dataset that contains only legitimate attributes
(and not sensitive ones) is “fair.” As discussed previously, .x ∈ X denote legitimate
features, whereas .s ∈ S = {A, B} denotes the protected attribute. As discussed in
Chap. 3, we may use here .z as a generic notation, to denote either .x, or .(x, s).
Definition 8.1 (Fairness Through Unawareness (Dwork et al. 2012)) A model
m satisfies the fairness through unawareness criteria, with respect to a sensitive
attribute .s ∈ S if .m : X → Y.
Based on that idea, we extend the notion of “regression function” (from
Definition 3.1), and distinguish “aware” and “unaware regression functions.”
Definition 8.2 (Aware and Unaware Regression Functions .μ) The aware regres-
sion function is .μ(x, s) = E[Y |X = x, S = s] and the unaware regression function
is .μ(x) = E[Y |X = x].
314 8 Group Fairness

Fig. 8.4 Scatterplot with aware (with the sensitive attribute s) on the x-axis and unaware model
(without the sensitive attribute s) on the y-axis, or .{
μ(x i , si ), 
μ(x i )}, for the GLM, the GAM and
the Random Forest models, on the toydata2 dataset, .n = 1000 individuals

In Lindholm et al. (2022a,b), the “aware regression function” is named “best-


estimate price (given full information).” In Fig. 8.4, we can visualize empirical
versions of those two regression functions, on the toydata2 dataset, with
scatterplots .{
μ(x i , si ), 
μ(x i )}, where three kinds of models are considered.
This principle prescribes not to explicitly employ sensitive features when making
decisions, and assessing premiums. It is not a group fairness principle per se, but it
is the first one that we mention in this chapter. Apfelbaum et al. (2010) reminds
us that “the color-blind approach to managing diversity has become a leading
institutional strategy for promoting racial equality, across domains and scales of
practice,” making this principle extremely important. Goodwin and Voola (2013)
discussed the “the gender-blind approach,” asking if “gender neutral” or “gender
blind” are equivalent (the answer is no). To go further, a simple way of achieving
this is simply not to ask for such information (gender, age, race, etc.). It is precisely
why most resumes do not mention the age of candidates, or their gender. Omitting
the collection of sensitive information, both from the training data and the testing
or validation data, poses a significant drawback to this strategy. By not gathering
such data, we are unable to assess the fairness of the model, as highlighted in Lewis
(2004). But as stated in Apfelbaum et al. (2010), “institutional messages of color
blindness may therefore artificially depress formal reporting of racial injustice.
Color-blind messages may thus appear to function effectively on the surface even as
they allow explicit forms of bias to persist.” As discussed in the previous chapters,
proxy-based discrimination is still possible, and this approach of fairness “functions
effectively on the surface,” only. So we need other definition to assess whether a
model is fair, or not.
8.2 Independence and Demographic Parity 315

8.2 Independence and Demographic Parity

As pointed out by Caton and Haas (2020), there are at least a dozen ways to formally
define the fairness of a classifier, or more generally of a model. For example, one can
wish for independence between the score and the group membership, .m(Z) ⊥⊥ S,
or between the prediction (as a class) and the protected variable .Y  ⊥⊥ S.

Definition 8.3 (Independence (Barocas et al. 2017)) A model m satisfies the


independence property if .m(Z) ⊥⊥ S, with respect to the distribution .P of the triplet
.(X, S, Y ).

Observe that this is implicitly in the introduction of this chapter, for example, in
Table 8.2 when we compared conditional distributions of the score .m(x i ) in the two
groups, .s = A and .s = B. Inspired by Darlington (1971) we define as follows, for a
classifier .mt :
Definition 8.4 (Demographic Parity (Calders and Verwer 2010; Corbett-Davies
et al. 2017)) A decision function .
y —or a classifier .mt , taking values in .{0, 1}—
satisfies demographic parity, with respect to some sensitive attribute S if (equiva-
lently)

⎪P[Y
⎪  = 1|S = A] = P[Y
 = 1|S = B] = P[Y
 = 1]

. |S = A] = E[Y
E[Y |S = B] = E[Y
]


⎩P[m (Z) = 1|S = A] = P[m (Z) = 1|S = B] = P[m (Z) = 1].
t t t

In the regression case, when y is continuous, there are two possible definitions
of demographic parity. If .
y = m(z), we ask only for the equality of the conditional
expectation (weak notion) or for the equality of the conditional distributions (strong
notion).
Definition 8.5 (Weak Demographic Parity) A decision function .
y satisfies weak
demographic parity if

|S = A] = E[Y
E[Y
. |S = B].

Definition 8.6 (Strong Demographic Parity) A decision function .


y satisfies
 ⊥⊥ S, i.e., for all A,
demographic parity if .Y

 ∈ A|S = A] = P[Y
P[Y
.  ∈ A|S = B], ∀A ⊂ Y.

If y and .
y are binary, the two definitions are equivalent, but this is usually not
the case. When the score is used to select clients, so as to authorize the granting
of a loan by a bank or a financial institution, this “demographic parity” concept
(also called “statistical fairness,” “equal parity,” “equal acceptance rate,” or simply
“independence,” as mentioned in Calders and Verwer (2010)), simply requires the
316 8 Group Fairness

fraction of applicants in group A who are granted credit to be approximately the


same as the fraction of applicants in group B who are granted credit. And by
symmetry, the rejection proportions must be identical. Using the same threshold
t on the scores, to grant a loan, we get values of Table 8.3. For example, with a plain
logistic regression

 = 1|S = A] = 8  = 1|S = B] = 9 = 45%,


P[Y
. = 40% while P[Y
20 20
so that, strictly speaking,

.  = 1|S = A] /= P[Y
P[Y  = 1|S = B].

In Table 8.3, the small dataset with .n = 40 is at the top, and with the entire dataset
(.n = 1000) at the bottom. One can observe an important difference between the
ratio, for identical thresholds t. This is simply because the small data set .n = 40 is a
distorted version of the entire one, with selection bias. Recall that in the entire data
set,


⎪ P[Y = 0, S = A] ∼ 47%


⎨P[Y = 1, S = B] ∼ 27%
.

⎪ P[Y = 1, S = A] ∼ 13%



P[Y = 0, S = B] ∼ 13%,

whereas in our small dataset, all probabilities are .25% (in order to have exactly 10
individuals in each group). So clearly, selection bias has an impact on discrimination
assessment.
In Fig. 8.5 we can visualize .t |→ P[mt (Z) = 0|S = A]/P[mt (Z) = 0|S = B] for
aware and unaware models (here a plain logistic regression) on the left-hand side,
and more specifically .t |→ P[mt (Z) = 0|S = A] and .t |→ P[mt (Z) = 0|S = B]
in the middle and on the right-hand side. The model without the sensitive attribute
is more fair, with respect to the demographic parity criteria, than the one with the
sensitive attribute. Points on the dots are the ones obtained on the small dataset,
with .n = 40. As mentioned previously, on that dataset, it seems that there is less
discrimination (with respect to s), probably because of some selection bias. When
.t = 30% for the plain unaware logistic regression, out of 300,000 individuals in

group A, there are 100,644 “positive,” which is a 33.55% proportion (.P[Y  = 1|S =
A]), and out of 200,000 individuals in group B, there are 163,805 “positive,” which
is a 81.9% proportion (.P[Y = 1|S = B]). Thus, the ratio .P[Y  = 1|S = B]/P[Y =
1|S = A] is 2.44. As t increases, both proportions (of “positive”) decrease, but the
 = 1|S = B]/P[Y
ratio .P[Y  = 1|S = A] increases. In Fig. 8.6, we visualize on the

right-hand side .P[Y = 1|S = B] and .P[Y  = 1|S = A], and on the left-hand side,
.t |→ P[mt (X) ≤ t|S = A] and .t |→ P[mt (X) ≤ t|S = B].
Table 8.3 Quantifying demographic parity on toydata2, using dem_parity from R package fairness. Ratio .P[Y
 = 1|S = B]/P[Y
 = 1|S = A] (and
.P[Y
= 0|S = A]/P[Y = 0|S = B]) should be equal to 1 to satisfy the demographic parity criteria
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.n = 40, .t = 50%, ratio .P[Y = 1|S = B]/P[Y = 1|S = A]
.P[Y
 = 1|S = A] 40% 35% 20% 30% 30% 25% 20% 30%
.P[Y
 = 1|S = B] 45% 55% 40% 55% 60% 55% 40% 55%
Ratio 1.125 1.571 2.000 1.833 2.000 2.200 2.000 1.833
8.2 Independence and Demographic Parity

.n = 40, various t, ratio .P[Y


 = 1|S = B]/P[Y = 1|S = A]
.t = 30% 1.500 1.400 1.273 1.333 1.778 1.875 1.273 1.556
.t = 50% 1.125 1.571 2.000 1.833 2.000 2.200 2.000 1.833
.t = 70% 2.000 2.333 7.000 2.333 2.000 6.000 7.000 2.000
.n = 40, various t, ratio .P[Y
 = 0|S = A]/P[Y = 0|S = B]
.t = 30% 2.000 1.667 1.500 1.375 2.750 2.400 1.500 1.833
.t = 50% 1.091 1.444 1.333 1.556 1.750 1.667 1.333 1.556
.t = 70% 1.214 1.308 1.462 1.308 1.214 1.357 1.462 1.214
.n = 40, ratio .E[m(X)|S = B]/E[m(X)|S = A]
.E[m(X)|S = A] 34.840% 33.570% 33.185% 34.580% 32.290% 31.125% 33.185% 33.110%
.E[m(X)|S = B] 54.800% 54.410% 50.400% 54.000% 56.870% 56.155% 50.400% 54.700%
(continued)
317
318

Table 8.3 (continued)


Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
Ratio 1.125 1.571 2.000 1.833 2.000 2.200 2.000 1.833
.n = 1000, various t, ratio .P[Y
 = 1|S = B]/P[Y
 = 1|S = A]
.t = 30% 1.652 1.519 1.235 1.559 1.918 1.714 1.235 1.798
.t = 50% 1.877 2.451 2.918 2.404 2.944 3.457 2.918 2.180
.t = 70% 6.033 8.711 26.000 4.621 7.917 19.333 26.000 4.578
.n = 1000, various t, ratio .P[Y
 = 0|S = A]/P[Y
 = 0|S = B]
.t = 30% 5.507 4.667 4.059 4.682 8.510 5.746 4.059 5.873
.t = 50% 3.603 3.806 3.256 4.159 4.735 6.825 4.884 6.332
.t = 70% 2.648 2.868 2.902 2.869 2.781 3.010 2.902 2.938
.n = 1000, ratio .E[m(X)|S = B]/E[m(X)|S = A]
.E[m(X)|S = A] 34.840% 33.570% 33.185% 34.580% 32.290% 31.125% 33.185% 33.110%
.E[m(X)|S = B] 54.800% 54.410% 50.400% 54.000% 56.870% 56.155% 50.400% 54.700%
Ratio .×1.573 .×1.621 .×1.519 .×1.562 .×1.761 .×1.804 .×1.519 .×1.652
8 Group Fairness
8.2 Independence and Demographic Parity

Fig. 8.5 Demographic parity as a function of the threshold t, for classifier .mt (x), when m is a plain logistic regression—with and without the sensitive attribute
s—with groups A and B, on toydata2, using dem_parity from R package fairness. Here, .n = 500,000 simulated data are considered for plain lines,
whereas dots on the left-hand side are empirical values obtained on the smaller subset, with .n = 40, as in Table 8.3. On the left-hand side, evolution of the ratio
.P[Y = 1|S = B]/P[Y  = 1|S = A]. The horizontal line (at .y = 1) corresponds to perfect demographic parity. In the middle .t |→ P[mt (X) > t|S = B] and
.t |→ P[mt (X) > t|S = A] on the model with s, and on the right-hand side without s
319
320

Fig. 8.6 Alternative demographic parity graphs (compared with Fig. 8.5), with ratio .P[Y  = 0|S = A]/P[Y  = 0|S = B], on the left-hand side, with and without
the sensitive attribute s. In the middle and on the right-hand side .t |→ P[mt (X) ≤ t|S = B] and .t |→ P[mt (X) ≤ t|S = A], respectively with and without the
sensitive attribute s
8 Group Fairness
8.2 Independence and Demographic Parity 321

Fig. 8.7 Receiver operating characteristic curve on the plain logistic regression on the left, on the
small dataset with .n = 40 individuals, with the optimal threshold (.t = 50%) and the evolution of
the rate of error on the right, and the optimal threshold (also .t = 50%)

As we can see in Definition 8.4, this measure is based on .mt , and not m. A
classical choice for t is .50% (used in the majority rule in bagging approaches), but
other choices are possible. An alternative is to choose t so that .P[mt (Z) > t] should
be close to .P[Y = 1] (at least on the training dataset). In Sect. 8.9, we consider the
probability of claiming a loss in motor insurance (on the FrenchMotor dataset).
In the training subset, 8.72% of the policyholders claimed a loss, whereas in the
validation dataset, the percentage was slightly lower at 8.55%. With the logistic
regression model, on the validation dataset, the average prediction (.m(z)) is 9% and
the median one is 8%. With a threshold .t = 16%, and a logistic regression about
10% of the policyholders get . y = 1 (which is close to the claim frequency in the
dataset) and .9% with a classification tree. Therefore, another natural threshold is
the quantile (associated with .m(zi )) when probability is the proportion of 0 among
.yi . Finally, as discussed in Chapter 9 in Krzanowski and Hand (2009), several

approaches can be considered, based on the receiver operating characteristic (ROC)


curve or the rate of error. In Fig. 8.7, we can see the “optimal” threshold, in the
sense of maximum predictive power on the left-hand side, or minimizing the rate of
error committed, on the right-hand side.
In this definition of fairness, observe that we consider the conditional distribution
of .
y on s, but other variables are not considered. If we have a lot of heterogeneity
based on a specific legitimate attribute x, it could be interesting to add it. Therefore,
we can consider the following extension.
322 8 Group Fairness

Definition 8.7 (Conditional Demographic Parity (Corbett-Davies et al. 2017))


We have a conditional demographic parity if (at choice) for all .x,

⎪  
⎨P[Y = 1|XL = x, S = A] = P[Y = 1|XL = x, S = B], ∀y ∈ {0, 1}

. |XL = x, S = A] = E[Y
E[Y |XL = x, S = B],


⎩P[Y ∈ A|X = x, S = A] = P[Y  ∈ A|X = x, S = B], ∀A ⊂ Y,
L L

where L denotes a “legitimate” subset of unprotected covariates.


Finally, “demographic parity” claims that a model is “fair” if the prediction .
y is
independent of the protected attribute s,
 ⊥⊥ S.
Y
.

An alternative formulation would be (in a very general setting, such as a regression


problem) to use distances between conditional distribution, .Y |S = A and .Y |S =
B, as defined in Sect. 3.3.1. If the conditional distribution of .m(X) in the two
groups is too different, e.g., a large population stability index (.PSI) as defined in
 = 1|S = A) − P(Y
Definition 3.3.1, the empirical estimation of .P(Y  = 1|S = B)
may not be robust, as discussed in Siddiqi (2012). Therefore, Szepannek and Lübke
(2021) suggested using a “group unfairness index” (GUI) defined as
  
y , s) =
GUI(
. P(Y  = i|s = A) log P(Y = i|s = A) .
 = i|s = B) − P(Y
 = i|s = B)
P(Y
i∈{0,1}

And inspired by Shannon and Weaver (1949), it is also possible to define some
mutual information, based on Kullback–Leibler divergence between the joint
distribution and the independent version (see Definition 3.7).
   = i, S = s)
P(Y
y , s) =
IM(
.  = i, S = s) log
P(Y .
 = i)P(S = s)
P(Y
i∈{0,1} s∈{A,B}

“Strong demographic parity” can be related to total distance (as in Defini-


tion 3.6), as a decision function .  ⊥⊥ S,
y satisfies strong demographic parity if .Y
i.e., for all .A ⊂ R, .dTV (PA , PB ) = 0, where .PA and .PB denote the conditional
distributions of the score .m(X, S). Quite naturally, one could consider another
distance, for instance, the Wasserstein distance, as defined in Definition 3.11.
Proposition 8.1 A model m satisfies the strong demographic parity property if and
only if .W2 (PA , PB ) = 0.
But we can also consider Kullback–Leibler divergence
Proposition 8.2 A model m satisfies the strong demographic parity property if and
only if .DKL (PA ‖PB ) = DKL (PB ‖PA ) = 0.
Figure 8.8 provides a visual representation of this property using three models
estimated on the FrenchMotor dataset. The x-axis displays the distribution of
8.2 Independence and Demographic Parity

Fig. 8.8 Matching between .m(x, s = A) (distribution on the x-axis) and .m(x, s = B) (distribution on the y-axis), where m is (from the left to the right) GLM,
GBM, and RF. The solid line is the (monotonic) optimal transport .T⋆ for different models
323
324 8 Group Fairness

scores in group A, whereas the y-axis represents the distribution of scores in group B.
The solid line depicts the optimal transport .T⋆ , which follows a monotonic pattern.
If this line aligns with the diagonal, it indicates that the model m satisfies the “strong
demographic parity” criteria and is considered fair.
One shortcoming with those approaches is that “demographic parity” is simply
based on the independence between the protected variable s and the prediction
.
y . And this does not take into account the fact that the outcome y may correlate
with the sensitive variable s. In other words, if the groups induced by the sensitive
attribute s have different underlying distributions for y, ignoring these dependencies
may lead to results that would be considered fair. Therefore, quite naturally, an
extension of the independence property is the “separation” criterion that adds the
value of the outcome y. More precisely, we require independence between the
prediction .y and the sensitive variable s, conditional on the value of the outcome
variable y, or formally .Y ⊥⊥ S conditional on Y .

8.3 Separation and Equalized Odds

In a general context, define “separation” as follows,


Definition 8.8 (Separation (Barocas et al. 2017)) A model .m : Z → Y satisfies
the separation property if .m(Z) ⊥⊥ S | Y , with respect to the distribution .P of the
triplet .(X, S, Y ).
Based on that principle, several notions can be introduced, when y is a binary
variable (.y ∈ {0, 1}). The first one, “equal opportunity” is achieved when the
predicted target variable of a . y model and the label of a protected category s
are statistically independent of each other, conditional on the actual value of the
target variable y (either 0 or 1). In the context of this binary classification problem,
this implies that the true-positive rates and false-positive rates should be equal
across groups defined by the protected category. A fairness criterion that is slightly
less strict than strong equal opportunity is weak equal opportunity. In weak equal
opportunity, only the probability of true positives is required to be equalized across
groups within a protected category. Formally, this criterion mandates parity in either
false-positives or true-positives between the two groups, A and B (refer to Fig. 8.9
for visual representation).
Definition 8.9 (True-Positive Equality, (Weak) Equal Opportunity (Hardt et al.
2016)) A decision function . y —or a classifier .mt (·), taking values in .{0, 1}—
satisfies equal opportunity, with respect to some sensitive attribute S if

⎪   
⎨P[Y = 1|S = A, Y = 1] = P[Y = 1|S = B, Y = 1] = P[Y = 1|Y = 1]

. P[mt (Z) = 1|S = A, Y = 1] = P[mt (Z) = 1|S = B, Y = 1]


⎩ = P[m (Z) = 1|Y = 1],
t

which corresponds to the parity of true positives, in the two groups, .{A, B}.
8.3 Separation and Equalized Odds 325

Fig. 8.9 Receiver operating characteristic curves (true-positive rates against false-positive rates)
for the plain logistic regression m, on the toydata2 dataset. Percentages indicated are thresholds
t used, in each group (A and B), with the false-positive rate (on the x-axis) and the true-positive
rate (on the y-axis)

The previous property can also be named “weak equal opportunity,” and as for
demographic parity, a stronger property can be defined.
Definition 8.10 (Strong Equal Opportunity) A classifier .m(·), taking values in
{0, 1}, satisfies equal opportunity, with respect to some sensitive attribute S if
.

.P[m(X, S) ∈ A|S = A, Y = 1] = P[m(X, S) ∈ A|S = B, Y = 1]


= P[m(X, S) ∈ A|Y = 1],

for all .A ⊂ [0, 1].


Definition 8.11 (False-Positive Equality (Hardt et al. 2016)) A decision function

y —or a classifier .mt (·), taking values in .{0, 1}—satisfies parity of false positives,
.

with respect to some sensitive attribute s, if



⎪P[Y
⎪  = 1|S = A, Y = 0] = P[Y = 1|S = B, Y = 0] = P[Y
 = 1|Y = 0]

. P[mt (Z) = 1|S = A, Y = 0] = P[mt (Z) = 1|S = B, Y = 0]


⎩ = P[m (Z) = 1|Y = 0].
t

In Fig. 8.9, point .• in the top left corner corresponds to the case where the
threshold t is 50% in group B, on the left-hand side, and point .◦ in the bottom left
corner corresponds to the case where the threshold t is 50% in group A. As those
two points are not on same vertical or horizontal line, .mt satisfies neither “true-
positive equality” nor “false-positive equality,” when .t = 50%. Nevertheless, if we
suppose that t can be different, it is possible to achieve both (but never together). If
we use a threshold .t = 24.1% in group A, we have “false-positive equality” with
326 8 Group Fairness

the classifier obtained when using a threshold .t = 50% in group B. And if we use a
threshold .t = 15.2% in group A, we have “true-positive equality” with the classifier
obtained when using a threshold .t = 50% in group B. When examining the graph
on the right, we can identify specific thresholds that need to be employed in group
B to achieve either “true-positive equality” or “false-positive equality” when the
threshold .t = 50% is used in group A.
If the two properties are satisfied at the same time, we have “equalized odds.”
Definition 8.12 (Equalized Odds (Hardt et al. 2016)) Parity of false positives A
decision function .
y —or a classifier .mt (·) taking values in .{0, 1}—satisfies equal
odds constraint, with respect to some sensitive attribute S, if

⎪  
⎨P[Y = 1|S = A, Y = y] = P[Y = 1|S = B, Y = y]

.  = 1|Y = y], ∀y ∈ {0, 1}
= P[Y ,


⎩P[m (Z) = 1|S = A, Y = y] = P[m (Z) = 1|S = B, Y = y], ∀y ∈ {0, 1},
t t

which corresponds to parity of true positive and false positive, in the two groups.
Note that instead of using the value of .y (conditional on y and s), “equalized
odds” means, for some specific threshold t,

.E[mt (Z)|Y = y, S = A] = E[mt (Z)|Y = y, S = B], ∀y ∈ {0, 1}.

The separation criterion implies that the prediction .


y should show the same error
rate for each value of s, which explains why it is coined “equalized odds.”
Therefore, as in Kleinberg et al. (2016), we can consider a so-called “class
balance” property, where we consider m instead of .mt ,
Definition 8.13 (Class Balance (Kleinberg et al. 2016)) We have class balance
in the weak sense if

E[m(X)|Y = y, S = A] = E[m(X)|Y = y, S = B], ∀y ∈ {0, 1},


.

or in the strong sense if

P[m(X) ∈ A|Y = y, S = A]
.

= P[m(X) ∈ A|Y = y, S = B], ∀A ⊂ [0, 1], ∀y ∈ {0, 1}.

As in the previous section, it is possible to compute for a threshold t the


ratios .t |→ P[mt (Z) = 1|S = A, Y = y]/P[mt (Z) = 1|S = B, Y = y] or
.t |→ P[mt (Z) = 0|S = A, Y = y]/P[mt (Z) = 0|S = B, Y = y], as in Table 8.4,

with .y = 1 and .y = 0 at the top and at the bottom respectively.


On top of Table 8.4, we can visualize the false positive rate (FPR) parity metric
as described by Chouldechova (2017). For instance, when employing aware logistic
regression with a threshold of .t = 50% in groups A and B, the FPR is 30% and 20%
8.3 Separation and Equalized Odds 327

Table 8.4 False-positive and false-negative metrics, on our .n = 40 individuals from dataset
toydata2, using fpr_parity and fnr_parity from R package fairness. The ratio
(group B against group A) should be close to 1 to have either false-positive fairness (at the top) or
false-negative fairness (at the bottom)
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
False-positive rate ratio, various t
.t = 50% 1.50 4.00 2.00 – 4.00 5.00 2.00 Inf
.t = 70% 1.75 1.75 3.00 2.67 1.75 2.25 3.00 2.00
False-negative rate ratio, various t
.t = 30% 0.60 0.75 0.60 1.00 0.33 0.20 0.60 0.50
.t = 50% 1.00 0.50 0.00 2.00 0.00 0.00 0.00 1.00

respectively (based on the small dataset with .n = 40 observations). If we consider


a baseline value of 1 for the FPR in group B, the corresponding value in group A
would be 1.5. This aligns with the false-positive rate ratio displayed in the table. If
the criterion for equality in FPRs is satisfied, the FPR ratio should be close to 1. In
Fig. 8.10, we can look at this metric, as a function of the threshold t, when n is very
large (to have a smooth function). At the bottom of Table 8.4, the false negative
rate (FNR) parity metric is used, as described by Chouldechova (2017). Here, on
the aware logistic regression, and a threshold of .t = 50%, on both groups A and B,
the FPR is the same, 10%. With a basis 1 in group B, it would have been 1 in A. In
Fig. 8.11, we can look at this metric, as a function of the threshold t, when n is very
large.
The metric used for “equalized odds” (also known as “equal opportunity,”
“positive rate parity” or simply “separation”) in function equal_odds from R
package fairness is simply the “sensitivity” (or true-positive rate). In Table 8.5,
we can see numerical estimates on our .n = 40 database, and in Fig. 8.12, we can
visualize this metric, as a function of the threshold t, when n is very large.
A concept of “lack of disparate mistreatment” has been introduced simultane-
ously in Zafar et al. (2019) and Berk et al. (2017) (and coined “accuracy equality”),.
Definition 8.14 (Similar Mistreatment (Zafar et al. 2019)) We have similar
mistreatment, or “lack of disparate mistreatment,” if

 = Y |S = A] = P[Y
P[Y  = Y |S = B] = P[Y
 = Y]
.
P[mt (X) = Y |S = A] = P[mt (X) = Y |S = B] = P[mt (X) = Y ].

In the context of a regression model, we could consider a strong property on the


distribution of residuals,

 ∈ A|S = A] = P[Y − Y
P[Y − Y
.  ∈ A|S = B] = P[Y − Y
 ∈ A], ∀A.
328

Fig. 8.10 On the right-hand side, evolution of the false-positive rates, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of
threshold t (on a plain logistic regression), on toydata2, using fpr_parity from R package fairness
8 Group Fairness
8.3 Separation and Equalized Odds

Fig. 8.11 On the right-hand side, evolution of the false-negative rates, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of
threshold t (on plain logistic regression), on toydata2, using fnr_parity from R package fairness. Here, .n = 500,000 simulated data are considered
329
330 8 Group Fairness

Table 8.5 “equalized odds” on the .n = 40 subset of toydata2, using equal_odds from R
package fairness
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.t = 30% 1.400 1.167 1.400 1.000 2.000 1.800 1.400 1.333
.t = 50% 1.000 1.125 1.111 0.889 1.429 1.250 1.111 1.000
.t = 70% 1.000 1.111 1.000 0.900 1.000 1.000 1.000 0.900

To address the equality of the distribution of residuals, we can explore a less


stringent condition (yet easier to test) expressed in terms of moments,

|a |S = A = E |Y − Y
E |Y − Y
. |a |S = B = E |Y − Y
|a , ∀a > 0.

Several extensions of this notion, introduced in Zafar et al. (2019), could be


considered, such as lack of disparate mistreatment for “false-positive signals”

|S = A, Y = 0] = P[Y /= Y
P[Y /= Y
. |S = B, Y = 0],

or for “false discovery signals,”

|S = A, Y
P[Y /= Y
.  = 1] = P[Y /= Y
|S = B, Y
 = 1],

for example.
One can also use any metrics based on confusion matrices, such as .φ, introduced
by Matthews (1975), also denoted MCC, for “Matthew’s correlation coefficient” in
Baldi et al. (2000) or Tharwat (2021),
Definition 8.15 (.φ-Fairness (Chicco and Jurman 2020)) We have .φ-fairness if
φA = φB , where .φs denotes Matthew’s correlation coefficient for the s group,
.

TPs · TNs − FPs · FNs


φs = √
. , s ∈ {A, B}.
(TPs + FPs )(TPs + FNs ) · (TNs + FPs )(TNs + FNs )

The evolution of .φA /φB as a function of the threshold is reported in Table 8.6.
For example, with a 50% threshold t, in group .A, MCC is 0.612 with a plain logistic
regression, whereas it is 0.704 in group .A, with the same model. When employing
a baseline value of 1 for MCC in group B, the corresponding value in group A is
0.870 (see Table 3.6 for details on computations, based on confusion matrices). In
Fig. 8.13, we can look at this metric, as a function of the threshold t, when n is very
large.
Finally, observe that instead of asking for true-positive rates and false-positive
rates to be equal, one can ask to have identical ROC curves, in the two groups.
As a reminder, we have defined (see Definition 4.19) the ROC curve as .C(t) =
TPR ◦ FPR−1 (t), where .FPR(t) = P[m(X) > t|Y = 0] and .TPR(t) = P[m(X) >
t|Y = 1].
8.3 Separation and Equalized Odds

Fig. 8.12 On the right-hand side, evolution of the equalized odds metrics, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of
threshold t (on plain logistic regression), on toydata2, using equal_odds from R package fairness
331
332 8 Group Fairness

Table 8.6 Evolution of .φ-fairness, as a function of threshold t, on the small version of


toydata2 (40 observations), using mcc_parity from R package fairness
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.t = 30% 0.693 0.611 1.151 0.615 1.005 1.061 1.151 0.768
.t = 50% 0.870 0.745 0.816 0.241 1.069 0.821 0.816 0.482
.t = 70% 0.642 0.801 0.313 0.191 0.642 0.350 0.313 0.214

Definition 8.16 (Equality of ROC Curves (Vogel et al. 2021)) Let .FRPs (t) =
P[m(X) > t|Y = 0, S = s] and .TPRs (t) = P[m(X) > t|Y = 1, S = s], where
−1 −1
.s ∈ {A, B}. Set .ΔT P R (t) = TPRB ◦TPRA (t)−t et .ΔF RP (t) = FPRB ◦FPRA (t)−t.

We have fairness with respect to ROC curves if .‖ΔT P R ‖∞ = ‖ΔF P R ‖∞ = 0.


And a weaker conditional can be asked, as in Beutel et al. (2019) and Borkan
et al. (2019), using not the entire ROC curve, but only the area under that curve
(AUC).
Definition 8.17 (AUC Fairness (Borkan et al. 2019)) We will have AUC fairness
if .AUCA = AUCB , where .AUCs is the AUC associated with model m within the s
group.
In Table 8.7, we can visualize the ratio .AUCB /AUCA . For example, with a
logistic regression without the sensitive attribute s, AUC in groups A and B are 77%
and 92% respectively. Thus, when employing a basis 1 in group B, the corresponding
value in group A would be 0.837.
Recall that equal opportunity is satisfied if the “separation” property is satisfied,
in the sense that the prediction .Y  is conditionally independent of the protected
attribute S, given the actual value Y ,

 ⊥⊥ S | Y = y.
∀y : Y
.

In our illustration, with .n = 40 individuals, equality of opportunity is impossible


to achieve. Indeed, this definition of fairness suggests that the false-positive and
true-positive rates might be the same for both populations. This may be reasonable,
but in the illustrative example, it is impossible because the two ROC curves do
not intersect, as visualized in Fig. 8.9. Note that if the curves did cross, this could
impose threshold choices that would be unattractive in practice (with acceptance
rates potentially much too low, or too high). Nevertheless, on some applications, it
is possible to transform

P[mt (X) = 1|S = A, Y = y] = P[mt (X) = 1|S = B, Y = y], ∀y ∈ {0, 1},


.

into

P[mtA (X) = 1|S = A, Y = y] = P[mtB (X) = 1|S = B, Y = y], ∀y ∈ {0, 1},


.
8.3 Separation and Equalized Odds

Fig. 8.13 On the right-hand side, evolution of the .φ-fairness metric, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of threshold
t (on plain logistic regression), on toydata2, using mcc_parity from R package fairness. Here, .n = 500,000 simulated data are considered. In the
middle evolution of MCC, in groups A and B, for .mt (x, s)—with the sensitive attribute s, as a function of t. On the left, evolution of the ratio between groups
A and B, with and without the use of the sensitive attribute in .mt respectively, as a function of t. Dots on the left are empirical values obtained on a smaller
subset, as in Table 8.6
333
334 8 Group Fairness

Table 8.7 AUC fairness on the small version of toydata2, using roc_parity (that actually
compared AUC and not ROC curves) based on the ratio of AUC in the two groups, from R package
fairness
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
Ratio of AUC 0.837 0.839 0.913 0.768 0.857 0.860 0.913 0.763

for some appropriate threshold .tA and .tB . We can visualize this in Fig. 8.14. On
the left-hand side of Fig. 8.14, the thresholds are chosen so that the rate of false
positives is the same for both populations (A and B). In other words, if our model
is related to load acceptance, we must have the same proportion of individuals who
were offered a loan in each group (advantaged and disadvantaged). For example,
here, we keep the threshold of 50% on the score for group B (corresponding to a
true-positive rate of about 31.8%), and we must use a threshold of about 24.1% on
the score for group A. We consider the right-hand side true positive.
Definition 8.18 (Equal Treatment (Berk et al. 2021a)) Equal treatment is
achieved when the rates of false positives and false negatives are identical within
the protected groups,

 = 1|S = A, Y = 0]
P[Y  = 1|S = B, Y = 0]
P[Y
. = .
 = 0|S = A, Y = 1]
P[Y  = 0|S = B, Y = 1]
P[Y

Berk et al. (2021a) uses the processing term in connection with causal inference,
which we discuss next. When the classifier yields a higher number of false
negatives for the supposedly privileged group, it indicates that a larger proportion
of disadvantaged individuals are receiving favorable outcomes compared with the
opposite scenario of false positives. A slightly different version had been proposed
by Jung et al. (2020).
Definition 8.19 (Equalizing Disincentives (Jung et al. 2020)) The difference
between the true-positive rate and the false-positive rate must be the same in the
protected groups,

.  = 1|S = A, Y = 1] − P[Y
P[Y  = 1|S = A, Y = 0]
 = 1|S = B, Y = 1] − P[Y
= P[Y  = 1|S = B, Y = 0].

Before moving to the third criterion, it should be stressed that “equalized odds”
may not be a legitimate criterion, because equal error rates could simply reproduce
existing biases. If there is a disparity of s in y, then equal error rates could simply
reproduce this disparity in the prediction .y . And therefore, correcting the disparity
of s in y actually requires different error rates for different values of s.
8.4 Sufficiency and Calibration 335

Fig. 8.14 Densities of scores in solid strong lines, conditional of .y = 0 on the left-hand side, and
.y = 1 on the right-hand side, in groups A and B on the large toydata2 data set. Dotted lines
are unconditional on y. We consider here threshold .t = 50%, and positive predictions . y = 1 or
.mt (x) > t. At the top, threshold .t = 50% is kept in group B, and we select another threshold for
group A (to have the same proportion of .y = 1), whereas at the bottom, threshold .t = 50% is kept
in group A, and we select another threshold for group B

8.4 Sufficiency and Calibration

A third commonly used criterion is sometimes called “sufficiency,” which requires


independence between the target Y and the sensitive variable S, conditional either
on a given score .m(X) or prediction .Y, introduced by Sokolova et al. (2006), and
later taken up by Kleinberg et al. (2016), Zafar et al. (2019), and Pleiss et al. (2017).
Formally, it means that

 or m(X).
Y ⊥⊥ S | Y
.
336 8 Group Fairness

Definition 8.20 (Sufficiency (Barocas et al. 2017)) A model .m : Z → Y satisfies


the sufficiency property if .Y ⊥⊥ S | m(Z), with respect to the distribution .P of the
triplet .(X, S, Y ).
As discussed in Sect. 4.2 (and Definition 4.23) this property is closely related
to calibration of the model. For Hedden (2021), it is the only interesting criterion
to define fairness with solid philosophical grounds, and Baumann and Loi (2023)
relates this criteria to the concept of “actuarial fairness” discussed earlier.
Definition 8.21 (Calibration Parity, Accuracy Parity (Kleinberg et al. 2016;
Zafar et al. 2019)) Calibration parity is met if

P[Y = 1|m(X) = t, S = A] = P[Y = 1|m(X) = t, S = B], ∀t ∈ [0, 1].


.

We can go further by asking for a little more, not only for parity but also for a
good calibration.
Definition 8.22 (Good Calibration (Kleinberg et al. 2017; Verma and Rubin
2018)) Fairness of good calibration is met if

. P[Y = 1|m(X) = t, S = A] = P[Y = 1|m(X) = t, S = B] = t, ∀t ∈ [0, 1].

In the context of a classifier, instead of conditioning on .m(X), we can simply use


, as suggested in Chouldechova (2017).
Y
.

Definition 8.23 (Predictive Parity (1)—Outcome Test (Chouldechova 2017))


We have predictive parity if

 = 1, S = A] = P[Y = 1|Y
P[Y = 1|Y
.  = 1, S = B].

Note that if .  /= Y ] > 0), and if the two groups


y is not a perfect classifier (.P[Y
are not balanced (.P[S = A] /= P[S = B]), then it is impossible to have predictive
parity and equal opportunity at the same time. Note that positive predictive value is

TPR · P[S = s]
PPVs =
. , ∀s ∈ {A, B},
TPR · P[S = s] + FPR · (1 − P[S = s])

such that .PPVA = PPVB implies that either .TPR or .FPR is zero, and as negative
predictive value can be written

(1 − FPR) · (1 − P[S = s])


NPVs =
. , ∀s ∈ {A, B},
(1 − TPR) · P[S = s] + (1 − FPR) · (1 − P[S = s])

such that .NPVA /= NPVB , and predictive parity cannot be achieved.


Continuing the formalism of Chouldechova (2017), Barocas et al. (2019) pro-
posed an extension to predictive parity, with a distinction.
8.4 Sufficiency and Calibration 337

Definition 8.24 (Predictive Parity (2) (Barocas et al. 2019))

 = 1] = P[Y = 1|S = B, Y
P[Y = 1|S = A, Y  = 1] positive prediction
.
 = 0] = P[Y = 1|S = B, Y
P[Y = 1|S = A, Y  = 0] negative prediction

or

. =
P[Y = 1|S = A, Y =
y ] = P[Y = 1|S = B, Y y ], ∀
y ∈ {0, 1}.

Finally, let us note that Kleinberg et al. (2017) introduced a notion of balance for
positive/negative class.

⎪E(m(X)|Y = 1, S = B) = E(M|Y = 1, S = A), balance for positive class


. E(m(X)|Y = 0, S = B) = E(M|Y = 0, S = A), equilibrium for the negative


⎩ class.

In Table 8.8, we use the “accuracy parity metric” as described by Kleinberg et al.
(2016) and Friedler et al. (2019). In groups A and B, accuracy metrics are 80% and
85%, in the .n = 40 dataset. Therefore, if basis 1 in group B is considered, the value
in group A would have been 0.941. In Fig. 8.15, we can look at this metric, as a
function of the threshold, when n is very large.
Another approach can be inspired by Kim (2017), for whom another way of
defining if a classification is fair or not is to say that we cannot tell from the result if
the subject was member of a protected group or not. In other words, if an individual’s
score does not allow us to predict that individual’s attributes better than guessing
them without any information, we can say that the score was assigned fairly.
Definition 8.25 (Non-Reconstruction of the Protected Attribute (Kim 2017)) If
we cannot tell from the result (.x, .m(x), y and .
y ) whether the subject was a member
of a protected group or not, we talk about fairness by nonreconstruction of the
protected attribute

, Y ] = P[S = B|X, m(X), Y


P[S = A|X, m(X), Y
. , Y ].

Table 8.8 Accuracy parity on the small subset of toydata2, using acc_parity from R
package fairness. 1.071 means that accuracy is 7.1% higher in group A than in group B
Unaware (without s) Aware (with s)
GLM GAM CART RF GLM GAM CART RF
.t = 30% 0.933 0.875 1.071 0.833 1.071 1.067 1.071 0.938
.t = 50% 0.941 0.882 0.875 0.632 1.000 0.882 0.875 0.737
.t = 70% 0.812 0.867 0.647 0.647 0.812 0.688 0.647 0.688
338

Fig. 8.15 On the right-hand side, evolution of accuracy, in groups A and B, for .mt (x)—without the sensitive attribute s—as a function of threshold t (on
plain logistic regression), on toydata2, using acc_parity from R package fairness. Here, .n = 500,000 simulated data are considered. In the middle
evolution of accuracy, in groups A and B, for .mt (x, s)—with the sensitive attribute s, as a function of t
8 Group Fairness
8.5 Comparisons and Impossibility Theorems 339

8.5 Comparisons and Impossibility Theorems

The different notions of “group fairness” can be summarized in Table 8.9. And as
we now see, those notions are incompatible.
Proposition 8.3 Suppose that a model m satisfies the independence condition (in
Sect. 8.2) and the sufficiency property (in Sect. 8.4), with respect to a sensitive
attribute s, then necessarily, .Y ⊥⊥ S.
Proof From the sufficiency property (8.4), .S ⊥⊥ Y | m(Z), then, for .s ∈ S and
A ⊂ Y,
.

P[S = s, Y ∈ A] = E P[S = s, Y ∈ A|m(Z)] ,


.

can be written

P[S = s, Y ∈ A] = E P[S = s|m(Z)] · P[Y ∈ A|m(Z)] .


.

And from the independence property (8.4), .m(Z) ⊥⊥ S, we can write the first
component .P[S = s|m(Z)] = P[S = s], almost surely, and therefore

P[S = s, Y ∈ A] = E P[S = s] · P[Y ∈ A|m(Z)] = P[S = s] · P[Y ∈ A ,


.

for all .s ∈ S and .A ⊂ Y, corresponding to the independence between S and Y . ⨅



Therefore, unless the sensitive attribute s has no impact on the outcome y, there
is no model m that satisfies independence and sufficiency simultaneously.
Proposition 8.4 Consider a classifier .mt taking values in .Y = {0, 1}. Suppose
that .mt satisfies the independence condition (Sect. 8.2) and the separation property
(Sect. 8.3), with respect to a sensitive attribute s, then necessarily either .mt (Z) ⊥⊥ Y
or .Y ⊥⊥ S (possibly both).
Proof Because .mt satisfies the independence condition (8.2), .mt (Z) ⊥⊥ S, and the
separation property (8.3), .mt (Z) ⊥⊥ S | Y , them, for .
y ∈ Y and for .s ∈ S,

P[mt (Z) = 
. y ] = P[mt (Z) = 
y |S = s] = E P[mt (Z) = 
y |Y, S = s] ,

that we can write



. P[mt (Z) = 
y] = y |Y = y, S = s · P Y = y S = s ,
P mt (Z) = 
y

or

P[mt (Z) = 
. y] = y |Y = y · P Y = y S = s ,
P mt (Z) = 
y
340

Table 8.9 Group fairness definitions, where .M = m(Z)


Independence, .Y  ⊥⊥ S, (Definition 8.3)
Statistical parity Dwork et al. (2012) .P[Y
= 1|S = s] = cst, ∀s (8.4)
Conditional statistical parity Corbett-Davies et al. (2017) .P[Y
= 1|S = s, X = x] = cstx , ∀s, y (8.7)
Separation, .Y
 ⊥⊥ S | Y , (Definition 8.8)

Equalized odds Hardt et al. (2016) .P[Y


= 1|S = s, Y = y] = csty , ∀s, y (8.12)
Equalized opportunity Hardt et al. (2016) .P[Y
= 1|S = s, Y = 1] = cst, ∀s (8.9)
Predictive equality Corbett-Davies et al. (2017) .P[Y
 = 1|S = s, Y = 0] = cst, ∀s (8.11)
Balance Kleinberg et al. (2016) .E[M|S = s, Y = 1] = csty , ∀s,y (8.13)
, (Definition 8.20)
Sufficiency, .Y ⊥⊥ S | Y
Disparate mistreatment Zafar et al. (2019) .P[Y = y|S = s, Y
 = y] = csty , ∀s, y (8.14)
Predictive parity Chouldechova (2017) .P[Y = 1|S = s, Y
 = 1] = cst, ∀s (8.23)
Calibration Chouldechova (2017) .P[Y = 1|M = m, S = s] = cstm , ∀m, s (8.21)
Well-calibration Chouldechova (2017) .P[Y = 1|M = m, S = s] = s, ∀m, s (8.22)
Other
AUC fairness Borkan et al. (2019) Same AUC in both groups (8.17)
ROC fairness Vogel et al. (2021) Same ROC curves in both groups (8.16)
Nonreconstruction Kim (2017) .P[S = s|X, M, Y
, Y ] = csts , ∀s (8.25)
8 Group Fairness
8.5 Comparisons and Impossibility Theorems 341

almost certainly. Furthermore, we can also write



. P[mt (Z) = 
y] = P mt (Z) = 
y |Y = y · P Y = y ,
y

so that, if we combine the two expressions, we get


  
. y |Y = y · P Y = y S = s − P Y = y = 0,
P mt (Z) = 
y

almost certainly. And since we assumed that y was a binary variable, .P[Y = 0] =
1 − P[Y = 1], as well as .P[Y = 0|S = s] = 1 − P[Y = 1|S = s], and therefore
 
P mt (Z) = 
. y |Y = 1 · P Y = 1 S = s − P Y = 1

or
 
. y |Y = 0 · P Y = 0 S = s − P Y = 0
− P mt (Z) = 

can be written
 
P mt (Z) = 
. y |Y = 0 · P Y = 1 S = s − P Y = 1 .

Thus, either .P Y = 1 S = s − P Y = 1 almost surely, or .P mt (Z) = 


y |Y = 0 =
P mt (Z) =  y |Y = 1 (or both). ⨆

Of course, the previous proposition holds only when y is a binary variable.
Proposition 8.5 Consider a classifier .mt taking values in .Y = {0, 1}. Suppose
that .mt satisfies the sufficiency condition (Sect. 8.4) and the separation property
(Sect. 8.3), with respect to a sensitive attribute s, then necessarily either .P[mt (Z) =
1|Y = 1] = 0 or .Y ⊥⊥ S or .mt (Z) ⊥⊥ Y .
Proof Suppose that .mt satisfies the sufficiency condition (8.4) and the separation
property (8.3) respectively .Y ⊥⊥ S | mt (Z) and .mt (Z) ⊥⊥ S | Y . For all .s ∈ S, we
can write, using Bayes’ formula

P[mt (Z) = 1|Y = 1, S = s] · P[Y = 1|S = s]


P[Y = 1|S = s, mt (Z) = 1] =
. ,
P[mt (Z) = 1|S = s]

i.e.,

P[mt (Z) = 1|Y = 1] · P[Y = 1|S = s]


P[Y = 1|S = s, mt (Z) = 1] = 
. ,
P[mt (Z) = 1|Y = y] · P[Y = 1|S = s]
y∈{0,1}
342 8 Group Fairness

which should not depend on s (from the sufficiency property). So a similar property
holds if .S = s ' . Observe further that .P[mt (Z) = 1|Y = 1] is the true positive
rate (TPR) whereas .P[mt (Z) = 1|Y = 0] is the false-positive rate (FPR). Let
.ps = P[Y = 1|S = s], so that

TPR
. P[Y = 1|S = s, mt (Z) = 1] = .
ps · TPR + (1 − ps ) · FPR

Supposes that Y and S are not independent (otherwise .Y ⊥⊥ S as stated in the


proposition), i.e., there are s and .s ' such that .ps = P[Y = 1|S = s] /= P[Y = 1|S =
s ' ] = ps ' . Hence, .ps /= ps ' , but at the same time

TPR TPR
. = .
ps · TPR + (1 − ps ) · FPR ps ' · TPR + (1 − ps ' ) · FPR

Suppose that .TPR /= 0 (otherwise .TPR = P[mt (Z) = 1|Y = 1] = 0 as stated in the
proposition), then

(ps − ps ' ) · TPR = (ps − ps ' ) · FPR /= 0,


.

and therefore .mt (Z) ⊥⊥ Y . ⨆



So, to summarize that section, unless very specific properties are assumed on .P,
there is no prediction function .m(·) that can satisfy two fairness criteria at the same
time.

8.6 Relaxation and Confidence Intervals

We had seen that demographic fairness is translated into the following equality

 = 1|S = A]
P[Y  = 1|S = B]
P[Y
. =1=
 = 1|S = B]
P[Y  = 1|S = A]
P[Y

If this approach is interesting, the statistical reality is that having a perfect equality
between two (predictive) probabilities is usually impossible. It is actually possible
to relax that equality, as follows,

Definition 8.26 (Disparate Impact (Feldman et al. 2015)) A decision function .Y
has a disparate impact, for a given threshold .τ , if,
  = 1|S = B] 
 = 1|S = A] P[Y
P[Y
. min , < τ (usually 80%).
 = 1|S = B] P[Y
P[Y  = 1|S = A]
8.6 Relaxation and Confidence Intervals 343

This so-called “four-fifths rule,” coupled with the .τ = 80% threshold, was
originally defined by the State of California Fair Employment Practice Commission
Technical Advisory Committee on Testing, which issued the California State
Guidelines on Employee Selection Procedures in October 1972, as recalled in
Feldman et al. (2015), Mercat-Bruns (2016), Biddle (2017), or Lipton et al. (2018).
This standard was later adopted in the 1978 Uniform Guidelines on Employee
Selection Procedures, used by the Equal Employment Opportunity Commission,
the US Department of Labor, and the US Department of Justice. An important point
here is that this form of discrimination occurred even when the employer did not
intend to discriminate, but by looking at employment statistics (on gender or racial
grounds), it was possible to observe (and correct) discriminatory bias.
For example, on the toydata2 with .n = 1000 individuals,

 = 1|S = A]
P[Y 134 400 134
. = = ∼ 33.1% ⪡ 80%.
 = 1|S = B]
P[Y 270 600 405

 = 1|S = A) = P(Y
Another approach, suggested to relax the equality .P(Y =
1|S = B), consists in introducing a notion of “approximate-fairness” in Holzer
and Neumark (2000), Collins (2007) and Feldman et al. (2015), or .ε-fairness in Hu
(2022)

.  = 1|S = A) − P(Y
P(Y  = 1|S = B) < ε.

The left deviation is sometimes called “statistical parity difference” (.SPD).


Žliobaite (2015) suggests normalizing the statistical parity difference,
   = 0) 
SPD P(Y = 1) P(Y
.NSPD where Dmax = min , ,
Dmax P(S = B) P(S = A)

so that .NSPD = 1 for maximum discrimination (otherwise .NSPD < 1).


For strong concepts of fairness, we can use distances (or divergences) between
distributions, as in Proposition 8.1, using the Wasserstein distance,

W2 (PA , PB ) < ε,
.

or Proposition 8.2, using Kullback-Leibler divergence.


Besse et al. (2018) considered another approach, based on confidence intervals
for fairness criteria. For example, for the disparate impact, we have seen that we
should calculate
 = 1|S = A]
P[Y
T =
. ,
 = 1|S = B]
P[Y
344 8 Group Fairness

whose empirical version is


 

yi 1(si = A) 1(si = B)
Tn = i
. · i ,

y
i i 1(s i = B) i 1(si = A)

which can be used to construct a confidence interval for T (Besse et al. (2018)
propose an asymptotic test, but using resampling techniques it is also possible to get
confidence intervals).

8.7 Using Decomposition and Regressions

If the three previous approaches are now quite popular in machine-learning literature
(independence, separation, and sufficiency), other techniques to quantify discrimi-
nation have been introduced into econometrics. A classical case in labor economics
is the gender wage gap. Such a gap has been observed for decades and economists
have tried to explain the difference in average wages between men and women. In a
nutshell, as in insurance, such a gap could be a “fair demographic discrimination” if
there were group differences in wage determinants, that is, in characteristics that are
relevant to wages, such as education. This is called “compositional differences.”
But that gap can also be associated with “unfair discrimination,” if there was
a differential compensation for these determinants, such as different returns to
education for men and women. This is called “differential mechanisms.” In order
to construct a counterfactual state—to answer a question “If women had the same
characteristics as men, what would be their wage?”—economists have considered
the decomposition method, to recovery the causal effect of a sensitive attribute. Cain
(1986) and Fortin et al. (2011) have provided the state of the art on those techniques.
If the seminal work by Kitagawa (1955) and Solow (1957) introduced the
“decomposition method,” Oaxaca (1973) and Blinder (1973) have laid the foun-
dations for the decomposition approach to analyze mean wage differences between
groups, based either on gender or on race. This approach offers a straightforward
means of disentangling cause and effect within the context of linear models,

.yi = γ 1B (si ) + x ┬
i β + εi ,

where y is the salary, s denotes the binary sensitive attribute .1B (s), .x is a collection
of predictive variables (or control variables, that might have an influence on the
salary, such as education or work experience). In such a model, . γ can be used to
answer the question we asked earlier “what the wage would be if women (.s = B)
had the same characteristics .xas men (.s = A)?.” To introduce the Blinder–Oaxaca
approach, suppose that we consider two (linear) models, for men and women
respectively,

yA:i = x ┬
A:i β A + εA:i (group A)
.
yB:i = x ┬
B:i β B + εB:i (group B).
8.7 Using Decomposition and Regressions 345

Using ordinary least squares estimates (and standard properties of linear models),

yA = x┬ 
A β A (group A)
.
┬
y B = x B β B (group B),

so that .y A − y B = x ┬  ┬
A β A − x B β B , which we can write (by adding and removing
┬
.x A β B )

 ┬ 
yA − yB = xA − xB βB + x┬  −
. A βA βB , (8.1)
     
characteristics coefficients

where the first term is the “characteristics effect,” which describes how much the
difference in outcome y (on average) is due to the differences in the levels of
explanatory variables .x, whereas the second term is the “coefficients effect,” which
describes how much the difference in outcome y (on average) is due to differences in
the magnitude of regression coefficients .β. The first one is the legitimate component,
also called “endowment effect” in Woodhams et al. (2021) or “composition effect” in
Hsee and Li (2022), whereas the second one can be interpreted as some illegitimate
discrimination, and is called “returns effect” in Agrawal (2013) or “structure effect”
in Firpo (2017). For the first component,
 ┬  ┬
. xA − xB βB = x A:1 − x B:1 β B:1
 ┬  ┬
+ · · · + x A:j − x B:j βB:j + · · · + x A:k − x B:k βB:k ,

where,
 for the j -th term, we explicitly see the average difference in the two groups,
.x A:j − x B:j , whereas for the second component
   
x┬   ┬   ┬   ┬  
.A β A − β B = x A:1 βA:1 − βB:1 +· · ·+x A:j βA:j − βB:j +· · ·+x A:k βA:k − βB:k

But similarly, we could have written,


 ┬ 
yA − yB = xA − xB βA + x┬  −
. B βA βB , (8.2)
     
characteristics coefficients

which is analogous to the previous decomposition, in Eq. (8.1), but can be rather
different. This is more or less the same as the regression on categorical variables:
changing the reference does not change the prediction, only the interpretation of
the coefficient. To visualize it, consider the case where there is only one single
characteristic x, as in Fig. 8.16.
In the context of categorical variables, it is rather common to use a contrast
approach where all quantities are expressed with respect to some “average bench-
mark.” We do the same here, except that the “average benchmark” is now a “fair
346 8 Group Fairness

Fig. 8.16 Visualization of the Blinder–Oaxaca decomposition of .y A − y B , with a single covariate


A − β
x, as .xA (β B (as in Eq. 8.1) on the left, and .xB (β
B ) and .(x A − x B )β A − β
B ) and .(x A − x B )β
A
(as in Eq. 8.2) on the right (fictitious data)

benchmark.” So suppose that it could be possible to have a nondiscriminatory


(potential) outcome .y ⋆ , so that

y ⋆ = x ┬ β ⋆ + ε,
.

where .E[ε|X] = 0. Then we could write, non-ambiguously


    
┬⋆   ┬ ⋆ 
β + x┬

. yA − yB = xA − xB A β A − β + x B β − β B , (8.3)
        
characteristics coefficients (A) coefficients (B)

where the “coefficient effect” component is now decomposed into two parts: an
illegitimate discrimination in favor of group A, and an illegitimate discrimination
against group B. This approach can be used to get a better interpretation of the first
two models. In fact, if we assume that there is only discrimination against group B,
and no discrimination in favor of group A, then .β ⋆ = β A and we obtain Eq. (8.1),
whereas if we assume that there is only discrimination in favor of group A, and no
discrimination against group B, then .β ⋆ = β B and we obtain Eq. (8.2).
As for the contrast approach, one can consider an average approach for the fair
benchmark. Reimers (1983) suggested considering the average coefficient between
the two groups,

 1 1
β = βA + 

. β ,
2 2 B
whereas Cotton (1988) suggested a weighted average, based on population sizes in
the two groups,

⋆ nA  nB 
β =
. β + β .
nA + nB A nA + nB B
8.7 Using Decomposition and Regressions 347

And quite naturally, consider

β ω = ω

.

β A + (1 − ω)
β B , where ω ∈ [0, 1].

Then we can write Eq. (8.3) as

    ┬ 

y A −y B = x A −x B
. ω
β A +(1−ω)
β B + (1−ω)x A +ωx B 
β A −
βB , (8.4)

or more generally

    ┬ 

y A − y B = x A − x B Ω
. β A + (I − Ω)
β B + (I − Ω)x A + Ωx B βA − 
βB ,
(8.5)

for some .p × p matrix .Ω, that corresponds to relative weights given to the
coefficients of group 0. Oaxaca and Ransom (1994) suggested using

Ω = (X┬ X)−1 (X┬


. A X A ).

But this approach suffers several drawbacks, related to the “omitted variable bias”
discussed in Sect. 5.6 (see also Jargowsky (2005) or more recently Cinelli and
Hazlett (2020)). Therefore, Jann (2008) suggested simply using a pooled regression
over the two groups, controlled by the group membership, that is

yi = γ 1B (si ) + x ┬
. i β + εi ,

so that, with heuristically standard matrix notations,


 −1

β = [X, s]┬ [X, s]
.

[X, s]┬ y.

As in Brown et al. (1980), consider now some binary sensitive attribute s,


and possibly some class x, corresponding to the industry. This corresponds to the
framework of Simpson’s paradox discussed in Sect. 5.6, where y is the acceptance
in a graduate program, s the gender of the candidate, and x the program. Here, y
is the wage, s the gender, and x the industry. Let .ps:j denote the probability of a
person of gender s entering industry j . Then,

ys =
. ps:j y s:j ,
j

so that the average wage gap between men and women in the labor market is

yB − yA =
. pB:j y B:j − pA:j y A:j ,
j
348 8 Group Fairness

and we decompose this wage gap into industry wage differentials, and the prob-
ability of entering a certain industry. With personal data, one can consider some
multinomial logit-model, so that the probability of an individual with characteristics
.x i joining industry j would be

exp(x ┬
i β)
pj,i =
. .
1 + exp(x ┬
i β)

Such a model can be estimated for men and women, independently,

exp(x ┬
i βs )
ps:j,i =
. .
1 + exp(x ┬
i βs )

For a woman (.s = B) with characteristics .x i , define the “nondiscriminatory”


probability (or probability of working in industry j if that person had been treated
as a man, A) as

exp(x ┬
i β A)
.

pB:j,i = ,
1 + exp(x ┬
i β A)

and decompose the average wage gap


 
yB − yA =
. pB:j y B:j − pA:j y A:j = dj ,
j j

in two parts,

dj = [pB:j − pA:j ]y B:j + pA:j [y B:j − y A:j ] .
.

The term can also be decomposed as



.[pB:j − pA:j ]y B:j = (pB:j − pA:j

) + (pA:j

− pA:j )) y B:j .

And similarly, it is possible to use Blinder–Oaxaca decomposition for the second


term, as

[y B:j − y A:j ] = [x ┬  ┬ 
. B:j β B:j − x A:j β A:j ],

i.e.,

[x B:j − x A:j ]┬


. β B:j + x ┬  
A:j [β B:j − β A:j ].
8.7 Using Decomposition and Regressions 349

And therefore, .dj (and .y B − y A ) can be decomposed intro four components

dj = pA:j [x B:j − x A:j ]┬


. β B:j legitimate difference within industries

+ pA:j x ┬  
A:j [β B:j − β A:j ] illegitimate within industries

+ [(pB:j − pA:j

)] y B:j legitimate across industries
+ [(pA:j

− pA:j ]y B:j illegitimate across industries.

If x is no longer defined on classes, DiNardo et al. (1995) suggested a continuous


version, where the vector of probabilities .ps:j now has a continuous density .fs that
can be estimated using kernel density estimates.
An alternative is to consider so-called “direct and reverse regressions,” to
quantify discrimination. In the context of labor market discrimination, Conway
and Roberts (1983) considered some data, where y denotes the income, a job
offer or a promotion. The protected variable p is either the gender or the indicator
associated with racial minorities (and is binary). And .x denotes some information
about job qualifications that will be univariate here, and is considered to be
an imperfect measure of actual productivity (“we shall mainly think of x as a
job qualification rather than as a proxy for productivity”). When performing a
regression of y on x and s, the regression coefficient of s estimates mean salary
differences between females and males after statistical allowance for the measured
qualifications, x. Such a regression corresponds to the idea that discrimination
has to do with disparity in mean salaries, for a given measured job qualification.
Conway and Roberts (1983) considered another type of possible discrimination,
called “placement discrimination,” which refers to the “shunting” or “steering”
of females (or minorities) into lower job levels than their qualifications warrant.
Therefore, Conway and Roberts (1983) claims that it is more natural to compare the
average qualifications of males and females within each job group, that is, to regress
x on y and s, so that discrimination corresponds to disparities in mean measured
qualifications for entering given job groups. Hence, the conditional distribution
of y given x is named “direct regression method,” whereas the one based on the
conditional distribution of x given y is the “reverse regression method.” The former
was considered in Finkelstein (1980), Fisher (1980), or Weisberg and Tomberlin
(1982), who discuss the use of linear regression models in legal cases of employment
discrimination. And the latter idea was discussed in Ferber and Green (1982a,b),
Kamalich and Polachek (1982), Goldberger (1984), Greene (1984), and Michelson
and Blattenberger (1984), in the early 1980s.
Those two approaches provide two different perspectives on fairness: (1) fairness
in the sense that the distributions of male and female incomes are the same, given
qualifications, corresponding to a direct regression, (2) fairness in the sense that
350 8 Group Fairness

the distributions of male and female qualifications are the same, given incomes,
corresponding to a reverse regression.

E[Y |X = x] = β0 + x ┬ β 1 + β2 s
.
E[X|Y = y] = α0 + α1 y + α2 s,

or

yi = β0 + x ┬ β 1 + β2 si + εi
.
xi = α0 + α1 yi + α2 si + ηi .

The idea underlying both proposals is the intuitively appealing one that if the
protected class is underpaid, that is, .β2 < 0, then we should expect its members to
be overqualified in the jobs that they hold, that is, .α2 > 0. Kamalich and Polachek
(1982) suggested a multivariate extension, with


⎪ x1,i = α1,0 + α1,1 yi + α1,2 si + α1,3,2 x2,i + α1,3,3 x3,i + · · · + α1,3,k xk,i + η1,i



⎪ x = α2,0 + α2,1 yi + α2,2 si + α2,3,1 x1,i + α2,3,3 x3,i + · · · + α2,3,k xk,i + η2,i

⎨ 2,i
..
. .





⎪ xk,i = αp,0 + αk,1 yi + αk,2 si + αk,3,1 x1,i + αk,3,2 x2,i


+ · · · + αk,3,k−1 xk−1,i + ηk,i ,

whereas Conway and Roberts (1983) suggested some aggregated index of .x. More
precisely,
⎧ z

⎨  i 
. yi = β0 + x ┬ β 1 +β2 si + εi .


zi = α0 + α1 yi + α2 si + ηi

Greene (1984) observed that


n

α2 =
. 2 ,
· (y A − y)(1 − R 2 ) − β
n − nA

As mentioned by Conway and Roberts (1983), “because of the conditioning on


salary, reverse regression exposes features of the data that might go undetected by
the use of direct regression alone. Comparison of employee qualifications for given
jobs or salaries is especially pertinent when one is trying to detect shunting, which
is usually thought of as placement of members of protected classes into lower job
groups than would be suggested by their qualifications.”
8.9 Application on the FrenchMotor Dataset 351

Racine and Rilstone (1995) suggested using nonparametric regression to avoid


possible misspecifications. Indeed, if

yi = g(xi ) + β2 s
. ,
xi = h(yi ) + α2 s

which can be written

yi − E[Y |X = xi ] = β2 (si − E[S|X = xi ]) + εi


. .
xi − E[X|Y = yi ] = α2 (si − E[S|Y = yi ]) + ηi

First, the unknown conditional means are nonparametrically estimated. Then they
are substituted for the unknown functions, and least squares is used to estimate.
Robinson (1988) proved that these estimates are asymptotically equivalent to those
obtained using the true conditional mean functions for estimation.

8.8 Application on the GermanCredit Dataset

In Sect. 4.2, four models were presented on the GermanCredit dataset (logistic
regression, classification tree, boosting, and bagging), with and without the sensitive
variable (gender). Cumulative distribution functions of the scores, for the plain
logistic regression and the boosting algorithm, can be visualized in Fig. 8.17.
In the training subset of GermanCredit 30.1% of people got a default (.y =
BAD), 29.7% in the validation dataset (.y). If we consider predictions from the
logistic regression model, with the sensitive attribute, on the validation dataset, the
average prediction (.m(x)) is 28.7% and the median one is 20.4%. With a (classifier)
threshold .t = 20%, we have a balanced dataset, with . y = 1 for .50% of the risky
people (see Fig. 8.17). With a threshold .t = 40%, 30% of the policyholders get
.
y = 1 (which is close to the default frequency in the dataset). In Tables 8.10
and 8.11, we use thresholds .t = 20% and .40% respectively.

8.9 Application on the FrenchMotor Dataset

In Sect. 4.2, four models were presented on the FrenchMotor dataset (logistic
regression, classification tree, boosting, and bagging), with and without the sen-
sitive variable (gender). In the training subset of FrenchMotor 8.72% of the
policyholders claimed a loss, 8.55% in the validation dataset (.y). If we consider
predictions from the logistic regression model, with the sensitive attribute, on the
validation dataset, the average prediction (.m(x)) is 9% and the median one is 8%.
With a threshold .t = 8%, we have a balanced dataset, with . y = 1 for .50% of the
352 8 Group Fairness

Fig. 8.17 Distributions of the score .m(z), on the GermanCredit dataset, conditional on y on the
left-hand side, and s on the right-hand side, when m is a plain logistic regression without sensitive
attribute s at the top, and boosting without sensitive attribute s at the bottom, with threshold .t =
40%

risky drivers (see Fig. 8.18). With a threshold .t = 16%, 10% of the policyholders
get .
y = 1 (which is close to the claim frequency in the dataset). In Tables 8.12
and 8.13, we use thresholds .t = 8% and .16% respectively.
8.9 Application on the FrenchMotor Dataset 353

Table 8.10 Fairness metrics on the GermanCredit dataset, with the fairness R package,
by Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .20%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 51.7% 28.0% 54.7% 61.7% 50.7% 28.0% 56.0% 60.7%
Predictive rate parity 0.992 1.190 0.992 1.050 0.957 1.190 1.041 1.037
Demographic parity 0.998 1.091 1.159 1.027 1.213 1.091 1.112 1.208
FNR parity 1.398 0.740 1.078 1.124 1.075 0.740 1.064 0.970
Proportional parity 0.922 1.008 1.071 0.949 1.121 1.008 1.027 1.116
Equalized odds 0.816 1.069 0.947 0.888 0.956 1.069 0.953 1.031
Accuracy parity 0.843 1.181 0.912 0.904 0.896 1.181 0.943 0.966
FPR parity 1.247 0.683 1.470 0.855 2.004 0.683 0.962 1.069
NPV parity 0.676 1.141 0.763 0.772 0.735 1.141 0.799 0.823
Specificity parity 0.941 1.439 0.930 1.028 0.851 1.439 1.007 0.990
ROC AUC parity 0.928 1.162 0.997 1.108 0.926 1.162 1.004 1.090
MCC parity 0.604 2.013 0.744 0.851 0.639 2.013 0.884 0.930

Table 8.11 Fairness metrics on the GermanCredit dataset, with the fairness R package,
by Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .40%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 30.3% 26.0% 27.7% 25.7% 30.7% 26.0% 28.0% 27.0%
Predictive rate parity 1.030 1.179 1.110 1.182 1.034 1.179 1.111 1.200
Demographic parity 1.090 1.062 1.074 1.069 1.108 1.062 1.044 1.019
FNR parity 1.533 0.851 1.110 0.781 1.342 0.851 1.322 0.962
Proportional parity 1.007 0.981 0.992 0.987 1.024 0.981 0.964 0.942
Equalized odds 0.925 1.032 0.982 1.041 0.944 1.032 0.955 1.008
Accuracy parity 0.949 1.154 1.054 1.164 0.963 1.154 1.038 1.159
FPR parity 1.118 0.703 0.820 0.653 1.118 0.703 0.784 0.641
NPV parity 0.738 1.080 0.890 1.108 0.766 1.080 0.848 1.082
Specificity parity 0.935 1.470 1.169 1.480 0.935 1.470 1.203 1.652
ROC AUC parity 0.928 1.162 0.997 1.108 0.926 1.162 1.004 1.090
MCC parity 0.745 1.817 1.105 1.754 0.779 1.817 1.056 2.055
354 8 Group Fairness

Fig. 8.18 Distributions of the score .m(z), on the FrenchMotor dataset, conditional on y on the
left-hand side, and s on the right-hand side, when m is a plain logistic regression without sensitive
attribute s at the top, and boosting without sensitive attribute s at the bottom, with threshold .t = 8%
8.9 Application on the FrenchMotor Dataset 355

Table 8.12 Fairness metrics on the FrenchMotor dataset, with the fairness R package, by
Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .t = 8%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 51.1% 29.2% 49.6% 18.7% 50.8% 29.2% 51.6% 18.6%
Predictive rate parity 1.019 1.021 1.017 1.011 1.018 1.021 1.027 1.012
Demographic parity 0.673 0.588 0.700 0.589 0.649 0.588 0.693 0.588
FNR parity 0.833 0.900 0.789 0.813 0.865 0.900 0.806 0.818
Proportional parity 1.182 1.034 1.231 1.035 1.141 1.034 1.217 1.033
Equalized odds 1.187 1.040 1.234 1.031 1.145 1.040 1.232 1.030
Accuracy parity 1.161 1.051 1.198 1.037 1.125 1.051 1.205 1.036
FPR parity 1.004 0.886 1.125 0.775 0.975 0.886 0.956 0.727
NPV parity 1.004 1.054 0.986 1.071 0.982 1.054 1.060 1.074
Specificity parity 0.998 1.141 0.927 1.079 1.012 1.141 1.026 1.091
ROC AUC parity 1.023 1.098 1.027 1.059 1.023 1.098 1.046 1.063
MCC parity 1.482 1.496 1.505 1.128 1.394 1.496 2.273 1.136

Table 8.13 Fairness metrics on the FrenchMotor dataset, with the fairness R package, by
Varga and Kozodoi (2021), for women, the reference being men, with the threshold at .t = 16%
With sensitive Without sensitive
GLM Tree Boosting Bagging GLM Tree Boosting Bagging
.P[m(X) > t] 10.0% 9.2% 6.6% 14.6% 10.2% 9.2% 5.6% 14.5%
Predictive rate parity 1.011 1.016 1.009 1.022 1.014 1.016 1.010 1.020
Demographic parity 0.596 0.591 0.587 0.577 0.588 0.591 0.592 0.577
FNR parity 0.618 0.620 0.642 0.819 0.710 0.620 0.478 0.827
Proportional parity 1.048 1.039 1.032 1.014 1.034 1.039 1.040 1.014
Equalized odds 1.045 1.040 1.027 1.021 1.033 1.040 1.034 1.020
Accuracy parity 1.050 1.050 1.032 1.038 1.043 1.050 1.040 1.036
FPR parity 1.071 1.003 1.090 0.569 1.015 1.003 1.090 0.613
NPV parity 1.011 1.259 0.652 1.160 1.092 1.259 0.847 1.143
Specificity parity 0.748 0.987 0.467 1.256 0.944 0.987 0.467 1.234
ROC AUC parity 1.023 1.098 1.027 1.059 1.023 1.098 1.046 1.063
MCC parity 0.993 1.452 0.354 1.289 1.236 1.452 0.610 1.265
Chapter 9
Individual Fairness

Abstract Group fairness, as studied in Chap. 8, considered fairness from a global


perspective, in the entire population, by attempting to answer the question “are
individuals in the advantaged group and in the disadvantaged group treated dif-
ferently?” Or more formally, are the predictions and the protected variable globally
independent? Here, we focus on a specific individual, in the disadvantaged group,
and we talk about discrimination (in a broad sense, or inequity) by asking what
the model would have predicted if this same person had been in the favored group.
We return here to the classical approaches, emphasizing the different approaches to
constructing a counterfactual for this individual.

In the previous Chapter, we were interested in a notion of “group fairness” (with


subgroups constituted from the values of y, s, or .y ). It was probably first formalized
in Dwork et al. (2012). The notion of individual fairness emphasizes that similar
individuals (based on unprotected attributes only) should be treated similarly.
Edwards (1932) mentioned the following example, almost a century ago, “Two
actual cases may be cited as examples: Two neighbours, a man and woman, residing
door to door in the city of Toronto, insured their Ford sedans through different agents
with this insurer within a few days of each other, no change in rating policy having
taken place in the meantime. A scrutiny of the written applications showed the risks
as alike as two peas in a pod. The woman paid a thirteen per cent higher rate
than the man. The only explanation of the discrimination on the daily report was
‘Premium arranged by A.H.B.’. ‘A.H.B.’ were the initials of a general agent of the
insurer ,” cited in Barry (2020a).
Many definitions of individual fairness have been recently introduced in the
literature (for instance, Jung et al. (2019a,b) introduced “fairness elicitation,”
whereas Salimi et al. (2020) defined “justifiable fairness,” among many others).
In this chapter, we first discuss the popular idea that two similar individuals should
have the same prediction (related to some “Lipschitz property”), and then, discuss
ideas related to causal inference and counterfactual fairness, based on concepts
introduced in Sect. 7.4. As previously, consider a binary sensitive attribute s, taking

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 357
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_9
358 9 Individual Fairness

values in .{A, B}, A being the favored group (or supposed to be), and B the disfavored
one.

9.1 Similarity Between Individuals (and Lipschitz Property)

The natural idea, formalized in Luong et al. (2011), is that two “close” individuals
(in the sense of unprotected characteristics .x) must have the same prediction.
Definition 9.1 (Similarity Fairness (Luong et al. 2011; Dwork et al. 2012))
Consider two metrics, one on .Y × Y (or for a classifier .[0, 1] and not .{0, 1}) noted
.Dy , and one on .X noted .Dx , such that we have similarity fairness on a database of

size n if we have the following property (called the Lipschitz property)

Dy (m(x i , si ), m(x j , sj )) ≤ L · Dx (x i , x j ), ∀i, j = 1, · · · , n,


.

for some .L < ∞.


Duivesteijn and Feelders (2008) defined a quite similar concept, called “mono-
tonic classification.” From a practical perspective, it is difficult to determine which
metric to use to measure the similarity of two individuals (i.e., between .x i and .x j ),
as explained by Kim et al. (2018). As Dwork et al. (2012) noticed, “our approach is
centered around the notion of a task-specific similarity metric describing the extent
to which pairs of individuals should be regarded as similar for the classification task
at hand. The similarity metric expresses ground truth. When ground truth is unavail-
able, the metric may reflect the “best” available approximation as agreed upon by
society. Following established tradition—Rawls (2001)—the metric is assumed to
be public and open to discussion and continual refinement. Indeed, we envision
that, typically, the distance metric would be externally imposed, for example, by a
regulatory body or externally proposed by a civil rights organization.” The most
usual ones for numeric variables are derived from the Mahalanobis distance, to
take into account the different scales between the variables (a normalized Euclidean
distance), or some normalized .𝓁1 distance. For categorical variables, some similarity
indices can be used (as in Jaccard (1901), Dice (1945) or Sorensen (1948)). Gower
(1971) suggested combining those two for mixed data. But one should keep in mind
that this choice is not neutral. Nevertheless, because of the separation property of
distances, whatever they are (.d(x1 , x2 ) = 0 if and only if .x1 = x2 ), a consequence
of the Lipschitz property is that a fair model m should satisfy .m(x, A) = m(x, B).
While rewriting the criteria of Definition 9.1 as

Dy (m(x i , si ), m(x j , sj ))
. ≤ L, ∀i, j = 1, · · · , n,
Dx (x i , x j )

for some L, Petersen et al. (2021) defined “local fairness” as follows.


9.2 Fairness with Causal Inference 359

Definition 9.2 (Local Individual Fairness (Petersen et al. 2021)) Consider two
metrics, one on .Y (.[0, 1] for a classifier and not .{0, 1}) noted .Dy , and one on .X
noted .Dx , model m is locally individually fair if
 
Dy (m(X, S), m(x ' , S))
E(X,S)
. limsup ≤ L < ∞.
x ' :Dx (X,x ' )→0 Dx (X, x ' )

Heidari and Krause (2018) and Gupta and Kamble (2021) considered some
dynamic extension of that rule, to define fair rules in reinforcement learning, for
example, inspired by Bolton et al. (2003) who found that customers’ impression of
fairness of prices critically relies on past prices that act as reference points.

9.2 Fairness with Causal Inference

As Loftus et al. (2018) wrote, “only by understanding and accurately modelling


the mechanisms that propagate unfairness through society can we make informed
decisions as to what should be done.” Galles and Pearl (1998) first introduced the
idea of counterfactual reasoning in the context of fairness, formalized later on by
Kusner et al. (2017) and Kilbertus et al. (2017), which states that the decision made
should remain fixed, even if, hypothetically, the protected attribute (such as the race,
or the gender) were to be changed.
As pointed out by Wu et al. (2019) and Carey and Wu (2022), there are
a few concepts of fairness that are related to causal inference. In Chap. 7, we
introduced the “ladder of causation” distinguishing “intervention” (in Sect. 7.3)
and “counterfactuals” (in Sect. 7.4). Those two perspectives yield two definitions
of individual fairness,
Definition 9.3 (Proxy-Based Fairness (Kilbertus et al. 2017)) A decision-
making process .
y exhibits no proxy discrimination with respect to sensitive attribute
s if
     
EY
. do(S = B) .
do(S = A) = E Y

Definition 9.4 (Fairness on Average Treatment Effect (Kusner et al. 2017)) We


achieve the fairness on average treatment effect (counterfactual fairness on average)
 ⋆ 
ATE = E YS←A
. − YS←B

= 0.
360 9 Individual Fairness

Based on the previous definition, a quite natural extension is the local one,
Definition 9.5 (Counterfactual Fairness (Kusner et al. 2017)) We achieve coun-
terfactual fairness for an individual with characteristics .x if
 ⋆  
CATE(x) = E YS←A
. − YS←B
⋆ X = x = 0.

Observe that there are several variations in the literature around those definitions.
Kusner et al. (2017) formally defined counterfactual fairness “conditional on a
factual condition,” whereas Wu et al. (2019) considered “path-specific causal fair-
ness.” Zhang and Bareinboim (2018) distinguished “counterfactual direct effect,”
“counterfactual indirect effect” and “counterfactual spurious effect.” To explain
quickly the differences, following Avin et al. (2005), let us define the idea of “path-
specific causal effect” (studied in Zhang et al. (2016) and Chiappa (2019)).
Definition 9.6 (Path-Specific Effect (Avin et al. 2005)) Given a causal diagram,
and a path .π some s to y, the .π -effect of a change of s from B to A on y is

PEπ (B → A) = E[Y |doπ (S = A)] − E[Y |S = B],


.

where “.doπ (S = A)” denotes the intervention on s transmitted only along path .π .
Transmission along a path (in a causal graph) was introduced with Defini-
tion 7.14. Then, following Wu et al. (2019), define the “path-specific counterfactual
effect”
Definition 9.7 (Path-Specific Counterfactual Effect (Wu et al. 2019)) Given a
causal diagram, a factual condition (denoted .F), and a path .π some s to y, the .π -
effect of a change of s from B to A on y is

PCEπ (B → A|F) = E[Y |doπ (S = A), F] − E[Y |S = B, F].


.

The “factual condition” .F is a very general notation, that could be converted


later on into .{X = x} or .{Y = y}, for example. Based on those concepts, one
can define “path-specific causal fairness” simply by asking that .PCEπ (B → A|F)
should be null. Based on those definitions, “counterfactual fairness” is obtained
when .F = {(X, S) = (x, s)}, for any path .π , whereas “counterfactual indirect
fairness” is obtained when .F = {(Y, S) = (y, s)}, for any indirect path .π (the effect
of s on y should be transmitted through some .x’s) and “indirect causal fairness” is
only for any indirect path .π (and no factual condition). See Baer et al. (2019) for a
review of fairness concepts based on causal graphs.
9.3 Counterfactuals and Optimal Transport 361

9.3 Counterfactuals and Optimal Transport

So far, we have used the general terminology in causal inference to assess if a


model is discriminatory, or not. But we have not created, per se a counterfactual
of an individual who may feel discriminated. Counterfactuals were introduced in
Sect. 7.4. Hume (1748) probably introduced the idea of counterfactuals: “we may
define a cause to be an object followed by another, and where all the objects,
similar to the first, are followed by objects similar to the second. Or, in other
words, where, if the first object had not been, the second never had existed.”
Unfortunately, it took some time to introduce the concept of having a “potential
counterfactual,” starting with Lewis (1973). Formally, having observed .(x, s, y) we
hope to create a counterfactual by considering .(x, s ⋆ , y ⋆ ), where .s ⋆ is a category
other than s, when the sensitive attribute is categorical (if we had observed .s = A,
we wish to create a protected class observation .s = B, and vice versa), in order to
compare the outcome y with the potential outcome .y ⋆ of the counterfactual. And
briefly mentioned in Sect. 7.4, if s (hypothetically) changes, chances are that .x also
changes, as Gordaliza et al. (2019), Berk et al. (2021b), Torous et al. (2021), de Lara
et al. (2021), and Charpentier et al. (2023a) reminded us. Formally, we simply say
that the distributions of .x, conditionally to .s = A or .s = B, are not identical,
which will necessarily happen in the presence of a proxy of s among the explanatory
variables .x. This can be related to the concept of “pairwise fair representations,” as
defined in Lahoti et al. (2019).
In Sect. 4.2.1, we have mentioned Kullback–Leibler divergence (in Defini-
tion 3.7), a symmetric extension with Jensen-Shannon divergence (in Defini-
tion 3.10) and the Wasserstein distance (in Definition 3.11). Given two measures
on p and q on .Rd , with a norm .‖ · ‖, then the Wasserstein distance is defined a
.Wk (p, q) where


Wk (p, q)k =
. inf ‖x − y‖k dπ(x, y) ,
π ∈Π(p,q) R ×R
d d

where .Π(p, q) is the set of all couplings of p and q. As mentioned in Defini-


tion 3.11, without any further specification, the Wasserstein distance is .W2 , where d
is the Euclidean distance. In the univariate case, it is possible to simplify, and to see
a connection with quantiles

1 k
 −1 
Wk (p, q)k =
. Fp (u) − Fq−1 (u) du.
0
362 9 Individual Fairness

(a) (b) (c)


2 1 1

1 2 2

3 3 3

Fig. 9.1 (a) Causal graph used to generate variables in toydata2. (b) Simple causal graph that
can be used on toydata2, where the sensitive attribute may actually cause the outcome y, either
directly (upper arrow), or indirectly, through .x1 , a mediator variable. (c) Causal graph where the
sensitive attribute s may cause the outcome y, either directly or indirectly, via two possible paths
and two mediator variables, .x1 and .x2

9.3.1 Quantile-Based Transport

Before defining properly this concept of “quantile-based transport,” let us recall a


simple lemma,
Lemma 9.1 Let X denote n random variable with cumulative distribution function
.Fp , and .h : R → R an increasing one-to-one mapping, then the distribution of
.Y = h(X) is .Fq (y) = Fp h
−1 (y) . And conversely, given two distributions .F and
p
.Fq , let .h
−1 = Fp ◦ Fq or .h = Fq−1 ◦ Fp , then if X has distribution .Fp , .Y = h(Y )
−1

has distribution .Fq (and in that case, X and Y are comonotone).


Proof This is a consequence of the probability integral transform and the probabil-
ity density function after transformation. ⨆

In Fig. 9.1a, we have the causal graph used to generate the toydata2 dataset
(see Sect. 1.4 for a description). To illustrate the “quantile-based transport,” consider
a simplified version, with the casual model of Fig. 9.1b: s, a binary variable (taking
values in .{A, B}), is generated; then, .x1 is generated, conditionally on s, and finally;
y is generated, a function of s and .x1 .
Let m denote a model fitted on the data, so that .m(z) is an estimate of the
regression function .μ(z) = E[Y |Z = z], where .z is either only .x (fairness
through unawareness, without the sensitive attribute) or .(x, s) (function of the
sensitive attribute). Consider an individual with characteristics .x, in class .B, could
be considered as discriminated, if a counterfactual version would get a different
output. And the counterfactual of .(x, B) would be .(x s←A , A), as an (hypothetical)
intervention of s should have an impact on .x, from the model considered, Fig. 9.1b.
9.3 Counterfactuals and Optimal Transport 363

In this section, .x is a single variable, .x1 . Heuristically, counterfactuals are


obtained by requiring relative stability within the classes. It means that the
counterfactual of .(x1 , B) depends on the relative position of .x1 within group B.
Let .F1A and .F1B denote the empirical distribution functions of .X1 in both groups,
.F1s (x) = P[X1 ≤ x|S = s]. If u is the probability associated with .x1 in group B—

in the sense .u = F1B (x1 )—then the counterfactual should be associated with the
quantile (in group A) with the same probability u. Thus, the counterfactual of .(x1 , B)
−1
would be .(T(x1 ), A), where .T = F1A ◦ F1B . Following Berk et al. (2021b) and
Charpentier et al. (2023a) it is possible to define a “quantile based counterfactual”
(or “adaptation with quantile preservation,” as defined in Plečko et al. (2021)), as
follows.
Definition 9.8 (Quantile-Based Counterfactual) The counterfactual of .(x1 , B) is
−1
(T(x1 ), A), where .T(x1 ) = F1A
. ◦ F1B (x1 ).
Definition 9.9 (Quantile-Based Counterfactual Discrimination) There is coun-
terfactual discrimination with model m for individual .(x1 , B) if
−1
m(x1 , B) /= m(T(x1 ), A), where T = F1A
. ◦ F1B .

In real-life applications, .T is estimated, empirically, using . −1 ◦ F


Tn = F 1A
1B . And
if .X1 has a Gaussian distribution, conditional on s, in the sense that,

L L
XA = (X|S = A) ∼ N(μA , σA 2 ) and XB = (X|S = B) ∼ N(μB , σB 2 ),
.

then
XB − μB L
. μA + σA · = XA .
σB
 
T(XB )

9.3.2 Optimal Transport (Discrete Setting)

The empirical version of the problem described in the previous section could
be expressed as follows: consider two samples with identical size, denoted
.{x , · · · , xn } and .{x , · · · , xn }. For each individual .x , the counterfactual is an
A A B B B
1 1 i
A
individual in the other group .xj , with two constraints: (1) it should be a one-to-one
matching: each individual in group B should be associated with a single observation
in group A, and conversely; (2) individuals should be matched with a “close” one,
in the other group. Stuart (2010) used the name “1:1 nearest-neighbor matching,”
to describe that matching procedure (see also Dehejia and Wahba (1999) or Ho
et al. (2007)). The first condition imposes that a matching is simply a permutation
.σ of .{1, 2, · · · , n}, so that, for all i, the counterfactual of .x will be .x
B A
i p(i) . Recall
364 9 Individual Fairness

that p can be characterized by a .n × n permutation matrix .P (with .Pij = 1 if


.j = σ (i) and .Pij = 0 otherwise). .n × n permutation matrix, with entries in .{0, 1},
satisfy .P 1n = 1n and .P ┬ 1n = 1n , as defined in Brualdi (2006). If .C denotes the
.n × n matrix that quantifies the distance between individuals in the two groups,

.Ci,j = d(xi , xj ) = (x − x ) , the optimal matching is the solution of


B A 2 B A 2
i j
⎧ ⎫
  ⎨ ⎬
. min 〈P , C〉 = min Pi,j Ci,j , (9.1)
P ∈P P ∈P ⎩ i,j ⎭

where .P is the set of .n × n permutation matrices. The solution is very intuitive, and
based on the following rearrangement inequality,
Lemma 9.2 (Hardy–Littlewood–Pólya Inequality (1)) Given .x1 ≤ · · · ≤ xn and
.y1 ≤ · · · ≤ yn n pairs of ordered real numbers, for every permutation .σ of
.{1, 2, · · · , n},


n 
n 
n
. xi yn+1−i ≤ xi yσ (i) ≤ xi yi .
i=1 i=1 i=1

Proof See Hardy et al. (1952). ⨆



That previous inequality can be extended, from this product version (terms are
products between .xi and some .yj ) to more general function .Ф(xi , yj ).
Definition 9.10 (Supermodular) Function .Ф : Rk × Rk → R is supermodular if
for any .z, z' ∈ Rk ,

Ф(z ∧ z' ) + Ф(z ∨ z' ) ≥ Ф(z) + Ф(z' ),


.

where .z ∧ z' and .z ∨ z' denote respectively the maximum and the minimum
componentwise. If .−Ф is supermodular, .Ф is said to be submodular.
From Topkis’ characterization theorem (see Topkis 1998), if .Ф : R × R → R
is twice differentiable, .Ф is supermodular if and only if .∂ 2 Ф/∂x∂y ≥ 0, for all
.i /= j . And as mentioned in Galichon (2016), many popular functions in applied

mathematics, and economics, satisfy this property, such as Cobb–Douglas functions,


.Ф(x, y) = x y when .a, b ≥ 0 (on .R+ × R+ ) or if .Ф(x, y) = γ (x − y) for
a b

some concave function .γ : R → R, such as .Ф(x, y) = −|x − y|k with .k ≥ 1 or


.Ф(x, y) = −(x − y − k)+ . Note that .−Ф can be seen as a cost function.

Lemma 9.3 (Hardy–Littlewood–Pólya Inequality (2)) Given .x1 ≤ · · · ≤ xn and


y1 ≤ · · · ≤ yn n pairs of ordered real numbers, and some supermodular function
.
9.3 Counterfactuals and Optimal Transport 365

Ф : R × R → R, for every permutation .σ of .{1, 2, · · · , n},


.


n 
n 
n
. Ф(xi , yn+1−i ) ≤ Ф(xi , yσ (i) ) ≤ Ф(xi , yi ),
i=1 i=1 i=1

whereas if .Ф : R × R → R is submodular,


n 
n 
n
. Ф(xi , yi ) ≤ Ф(xi , yσ (i) ) ≤ Ф(xi , yn+1−i ).
i=1 i=1 i=1

Proof See Hardy et al. (1952). ⨆



Another way of writing that inequality is that, given two sets of real numbers
{x1 , · · · , xn } and .{y1 , · · · , yn } with no ties, if .Ф is submodular (such as .Ф(x, y) =
.

|x − y|k with .k ≥ 1),


n 
n
. Ф(xi , yi ) ≥ Ф(xi , yσ ⋆ (i) ),
i=1 i=1

where .σ ⋆ is the permutation such that the rank of .yσ ⋆ (i) (among the y) is equal to the
rank of .xi (among the x), as discussed in Chapter 2 of Santambrogio (2015). This
corresponds to a “monotone rearrangement.”
A numerical illustration is provided in Table 9.1. There are .n = 6 individuals
in class B (per row) and in class A (per column). Table on the left-hand side is the
distance matrix .C, between .xi and .xj , whereas the table on the right-hand side is
the optimal permutation .P , solution of Eq. (9.1). Here, individual .i = 3, in group B,
is matched with individual .j = 9, in group A. Thus, in this very specific example,
model m would be seen to be fair for individual .i = 3 if .m(x 3 , B) = m(x 9 , A).
Observed that fairness can be assessed here only for individuals who belong to the
training dataset (and not any fictional individual .(x, B)).
A more general setting could be considered, where the two groups no longer have
the same size n. Consider two groups, A and B. Given .ν B ∈ Rn+B and .ν A ∈ Rn+A such

Table 9.1 Optimal matching, .n = 6 individuals in class B (per row) and in class A (per column).
The table on the left-hand side is the distance matrix .C, whereas the table on the right-hand side is
the optimal permutation .P ⋆ , solution of Eq. (9.1)
7 8 9 10 11 12 7 8 9 10 11 12
1 0.41 0.55 0.22 0.64 0.04 0.25 1 .· .· .· .· 1 .· 1 .→ 11
2 0.28 0.24 0.73 0.22 0.64 0.80 2 .· 1 .· .· .· .· 2 .→ 8
3 0.28 0.47 0.32 0.52 0.16 0.37 3 .· .· 1 .· .· .· 3 .→ 9
4 0.28 0.62 0.81 0.25 0.64 0.85 4 1 .· .· .· .· .· 4 .→ 7
5 0.41 0.37 0.89 0.25 0.81 0.97 5 .· .· .· 1 .· .· 5 .→ 10
6 0.66 0.76 0.21 0.89 0.22 0.14 6 .· .· .· .· .· 1 6 .→ 12
366 9 Individual Fairness

Table 9.2 Optimal matching, .nB = 6 individuals in class B (per row) and .nA = 10 individuals in
class A (per column). The table on the left-hand side is the distance matrix .C, whereas the table on
the right-hand side is the optimal weight matrix .P ⋆ in .U6,10 , solution of Eq. (9.2)
7 8 9 10 11 12 13 14 15 16 7 8 9 10 11 12 13 14 15 16
1 0.41 0.55 0.22 0.64 0.04 0.25 0.24 0.77 0.74 0.55 1 .· .· .1/5 .· .3/5 .· .1/5 .· .· .·
2 0.28 0.24 0.73 0.22 0.64 0.80 0.76 0.76 0.12 0.10 2 .· .2/5 .· .· .· .· .· .· .· .3/5
3 0.28 0.47 0.32 0.52 0.16 0.37 0.27 0.68 0.63 0.45 3 .3/5 .· .· .· .· .· .2/5 .· .· .·
4 0.28 0.62 0.81 0.25 0.64 0.85 0.58 0.32 0.51 0.48 4 .· .· .· .2/5 .· .· .· .3/5 .· .·
5 0.41 0.37 0.89 0.25 0.81 0.97 0.91 0.81 0.05 0.25 5 .· .1/5 .· .1/5 .· .· .· .· .3/5 .·
6 0.66 0.76 0.21 0.89 0.22 0.14 0.33 0.96 0.99 0.79 6 .· .· .2/5 .· .· .3/5 .· .· .· .·

that .ν ┬ ┬
B 1nB = ν A 1nA (identical sums), define
 
U (ν B , ν A ) = M ∈ Rn+B ×nA : M1nA = ν B and M ┬ 1nB = ν A ,
.

where .Rn+B ×nA is the set of .nB ×nA matrices with positive entries. This set of matrices
is a convex polytope (see Brualdi  2006). 
nB
In our case, let us denote .U 1B , 1A as .UnB ,nA . Then the problem we want
nA
to solve is simply
⎧ ⎫
  ⎨nB 
nA ⎬
.P ∈ argmin 〈P , C〉 or argmin

Pi,j Ci,j . (9.2)
P ∈UnB ,nA P ∈UnB ,nA ⎩ ⎭
i=1 j =1

Here, .P ⋆ is no longer a permutation matrix, but .P ⋆ ∈ UnB ,nA , so that sums per
row are equal to one (with positive entries), and can be considered as weights (as
permutation matrices), but here, sums per column are equal, but not to one—they are
equal to the ratio .nB /nA . To illustrate, consider Table 9.2, with .nB = 6 individuals
in class B (per row) and .nA = 10 individuals in class A (per column). Consider
individual .i = 3, in group B. She is matched with a weighted sum of individuals in
group A, namely .j = 7 and .13, with respective weights .3/5 and .2/5. Thus, on this
very specific example, model m would be seen to be fair for individual .i = 3 if

3 2
m(x 3 , B) =
. m(x 7 , A) + m(x 13 , A).
5 5

9.3.3 Optimal Transport (General Setting)

A transport .T is a deterministic function that couples .x0 and .x1 (or more generally
vectors in the same space) in the sense that .x1 = T(x0 ). And .(X0 , T(X0 )) is a
coupling of two measures .P0 and .P1 (for a more general framework than the p and
9.3 Counterfactuals and Optimal Transport 367

q used previously) if .X ∼ P0 and .T(X0 ) ∼ P1 . We denote .T# P0 = P1 the “push-


forward” of .P0 by mapping .T.
Lemma 9.4 The distribution of X on .R is the push-forward measure of the uniform
measure on .[0, 1], with .T = FX−1 .
Proof This is a consequence of the probability integral transform. ⨆

There are many couplings, many transport functions, and we seek an “optimal”
one. An optimal transport .T⋆ (in Brenier’s sense, from Brenier (1991), see Villani
(2009) or Galichon (2016)) from .P0 toward .P1 is the solution of

T ∈
.

arginf γ (x − T(x))dP0 (x) ,
T:T# P0 =P1 Rk

that we can relate to the equation below, which could be seen as the continuous (and
more general) version of Eq. (9.2),

. inf Ф(x 0 , x 1 ) ν(dx 0 , dx 1 ),


ν∈Π(P0 ,P1 ) Rk ×Rk    
=C =P

for some function .Ф : Rk × Rk → R+ , such that .Ф(x 0 , x 1 ) = γ (x 0 − x 1 ).


Definition 9.11 (Monge–Kantorovich Problem) Given a submodular function
Ф : Rk × Rk , and two measures .P0 and .P1 on .Rk , the coupling .ν ⋆ ∈ Π(P0 , P1 )
.

is optimal if
  
.ν⋆ ∈ arginf E Ф(X0 , X 1 ) , where (X 0 , X 1 ) ∼ ν.
ν∈Π(P0 ,P1 )

An optimal transport is a mapping .T⋆ : Rk → Rk such that if .X ∼ P0 , .T⋆ (X0 ) ∼ P1 ,


and therefore .(X0 , T⋆ (X0 )) ∼ ν ⋆ .
Formally, in general settings, such a deterministic correspondence via .T between
probability distributions may not exist, in particular if .P0 and .P1 are not Lebesgue
absolutely continuous. Solving a problem with a function .Ф is called Kantorovich
relaxation of Monge’s formulation of optimal transport (from Kantorovich and
Rubinshtein 1958).
In dimension 1, as discussed previously, there is a simple solution to this problem,
under mild conditions. Let .Fp and .Fq denote the cumulative distribution functions
associated with measures p and q,
Proposition 9.1 Assume that .Ф is strictly submodular, and that p has no mass
point. Then the Monge–Kantorovich problem has a unique optimal assignment, and
this assignment is characterized by .X1 = T⋆ (X0 ), where .T⋆ is given by .T⋆ =
Fq−1 ◦ Fp .
Proof See Villani (2003) or Galichon (2016). ⨆

368 9 Individual Fairness

Observe that the optimal assignment is a comonotone solution (.T⋆ is an


increasing mapping), which could be seen as an extension of Hardy–Littlewood–
Pólya inequality (Lemma 9.3). In a higher dimension, .T⋆ is an “increasing” mapping
in the (unique) sense that it is the gradient of a convex function.
This idea of transport in the context of quantifying fairness was originally
mentioned in Dwork et al. (2012), where the use of “earthmover distances” with
the Lipschitz condition is mentioned. As mentioned in Sect. 3.3, the “earthmover
distance” is actually simply the Wasserstein distance with index with 1 (denoted
.W1 ).

9.3.4 Optimal Transport Between Gaussian Distributions

After defining the optimal mapping in the univariate case using quantiles, we
mentioned that the optimal transport between two univariate Gaussian distributions
(.N(μ0 , σ02 ) and .N(μ1 , σ12 ) respectively) is

σ1
x1 = T⋆N (x0 ) = μ1 +
. x0 − μ0 ,
σ0

and actually, a similar transformation can be obtained in a higher dimension.


Proposition 9.2 Suppose that .P0 and .P1 are two Gaussian measures (respectively
N(μ0 , Σ 0 ) and .N(μ1 , Σ 1 )) then the optimal transport is
.

x 1 = T⋆N (x 0 ) = μ1 + A(x 0 − μ0 ),
.

where .A is a symmetric positive matrix that satisfies .AΣ 0 A = Σ 1 , which has a


−1/2 1/2 1/2 1/2 −1/2
unique solution given by .A = Σ 0 Σ0 Σ1Σ0 Σ0 .
Proof See Villani (2003) or Galichon (2016). ⨆

Here, .M 1/2 is the square root of the square (symmetric) positive matrix .M based
on the Schur decomposition (.M 1/2 is a positive symmetric matrix), as described
in Higham (2008). In R, such a function is obtained using sqrtm in the expm
package.

9.3.5 Transport and Causal Graphs

Plečko et al. (2021) considered a more general approach to solving problems such
as the one in Fig. 9.1d: as previously, s causes x1 and x2 , but x2 is also influenced by
x1 . Therefore, when transporting on x2 , we should consider not the quantile function
of x2 conditional on s, but a quantile regression function, of x2 conditional on s
9.4 Mutatis Mutandis Counterfactual Fairness 369

and x1 . A quantile random forest, as in Meinshausen and Ridgeway (2006) can be


considered (or any machine-learning algorithm based on a quantile loss function
𝓁q,α ) or more generally, a quantile regression based on optimal transport, as in
Carlier et al. (2016).
Numerical applications on toydata2 and GermanCredit dataset, based on
causal graphs of Figs. 9.1 and 9.8 respectively, are considered in the next section
and at the end of this chapter.

9.4 Mutatis Mutandis Counterfactual Fairness

A model satisfies the “counterfactual fairness” (Definition 9.5) property “had the
protected attributes (e.g., race) of the individual been different, other things being
equal, the decision would have remained the same.” This is a ceteris paribus
definition of counterfactual fairness. But a mutatis mutandis definition of the
conditional average treatment effect can be considered (as in Berk et al. (2021b)
and Charpentier et al. (2023a)).
Definition 9.12 (Mutatis Mutandis Counterfactual Fairness (Kusner et al.
2017)) If the prediction in the real world is the same as the prediction in the
counterfactual world, mutatis mutandis where the individual would have belonged
to a different demographic group, we have counterfactual fairness, i.e.,

E[YS←A
.

|X = T(x)] = E[YS←B

|X = x], ∀x,

where .T is the optimal transport from the distribution of .X conditional on .S = B to


the distribution of .X condition on .S = A (Table 9.3).

Table 9.3 Definitions of individual fairness, see Carey and Wu (2022) for a complete list
Similarity Fairness
(Lipschitz) Dwork et al. (2012) (9.1) yi , 
.Dy ( yj )≤ Dx (x i , x j ), ∀i, j
     
Proxy-Based Fairness, Kilbertus et al. (9.3) .E Y do(S = A) = E Y do(S = B)
(2017)
 ⋆   ⋆ 
Fairness on Average Kusner et al. (2017) (9.4) .E YS←A = E YS←B
Treatment Effect
 ⋆    ⋆  
Counterfactual Fairness, Kusner et al. (2017) (9.5) .E YS←A X = x = E YS←A X = x
Path-Specific Effect Avin et al. (2005) (9.6) .E[Y |doπ (S
= A)] = E[Y |doπ (S =
B)]
Path-Specific Wu et al. (2019) (9.7) .E[Y |doπ (S = A), F] = E[Y |doπ (S =
Counterfactual Effect B), F]
Mutatis Mutandis Kusner et al. (2017) (9.12) .E[YS←A
⋆ |X = T(x)] = E[YS←B
⋆ |X =
Counterfactual x]
370 9 Individual Fairness

9.5 Application on the toydata2 Dataset

Throughout Chap. 8, we have studied various group-related notions of fairness,


on some simulated data, toydata2. Recall that empirical proportions of 1
are 24% and 65%, in groups A and B respectively. From a demographic parity
perspective, we may claim that there is discrimination, as the two proportions
are significantly different. Here, instead of that global perspective, we consider
specific individuals (the same as those considered in Sect. 4.1 to illustrate local
interpretability concepts).
(x, A) and .m
At the top of Table 9.4, in the first block, we compare .m (x, B) for
different models, non-aware and aware, with a logistic regression, a logistic smooth
version (GAM), and a random forest (those models were described previously). At

Table 9.4 Creating counterfactuals for Betty, Brienne and Beatrix


Original data
s .x1 .x2 .x3 glm (x) .m
.m glm (x, s) gam (x) .m
.m gam (x, s) .m
rf (x) rf (x, s)
.m
Betty B 0 2 0 18.22% 24.06% 13.23% 17.63% 17.4% 29.6%
Brienne B 1 5 1 67.19% 70.47% 66.18% 67.09% 63.60% 61.80%
Beatrix B 2 8 2 94.95% 94.73% 97.53% 97.58% 96.60% 98.40%
Alex A 0 2 0 18.22% 13.71% 13.23% 10.05% 17.40% 9.20%
Ahmad A 1 5 1 67.19% 54.48% 66.18% 50.49% 63.60% 64.40%
Anthony A 2 8 2 94.95% 90.02% 97.53% 90.51% 96.60% 68.20%
Counterfactual
adjusted data, using marginal quantiles
Betty A .−1.68 2.1 .−1.68 3.51% 3.58% 4.78% 4.85% 10.40% 10.80%
Brienne A .−0.98 5.1 .−0.96 19.39% 17.65% 16.64% 16.13% 29.00% 41.00%
Beatrix A .−0.27 7.9 .−0.26 59.83% 53.65% 51.89% 46.37% 53.60% 49.00%
adjusted data, using optimal transport, Fig. 9.1c
Betty A .−1.96 2.1 .−1.9 2.62% 2.82% 4.65% 4.81% 0.00% 0.00%
Brienne A 0.29 5 0.25 48.24% 38.92% 40.04% 32.14% 21.40% 12.20%
Beatrix A 0.31 7.8 0.21 72.83% 65.1% 67.5% 58.83% 20.80% 15%
adjusted data, using Gaussian transport, Fig. 9.1c
Betty A .−1.58 2.15 .−1.59 3.95% 3.96% 4.96% 4.99% 0.40% 0.40%
Brienne A .−0.98 4.96 .−0.99 18.47% 16.84% 15.84% 15.40% 19.80% 27.20%
Beatrix A .−0.38 7.79 .−0.38 55.71% 50.05% 47.86% 43.16% 51.80% 63.60%
adjusted data, with fairAdapt, Fig. 9.2e
Betty A .−1.65 2 .−1.32 3.63% 3.54% 4.72% 4.60% 14.60% 8.00%
Brienne A .−0.97 4.55 .−0.94 16.57% 14.96% 13.96% 13.51% 2.20% 5.20%
Beatrix A .−0.33 7.72 .−0.44 56.3% 50.71% 48.49% 43.74% 70.60% 74.80%
adjusted data, with fairAdapt, Fig. 9.2f
Betty A .−1.75 2.28 .−1.68 3.5% 3.6% 5.03% 5.13% 7.20% 7.00%
Brienne A .−0.96 5.3 .−0.96 20.9% 19.05% 17.91% 17.34% 5.80% 8.40%
Beatrix A .−0.24 8.12 .−0.34 62.31% 56.43% 54.8% 49.3% 45.60% 39.20%
9.5 Application on the toydata2 Dataset 371

the top, we have Betty, Brienne, and Beatrix, and for those three individuals, we
want to quantify some possible individual discrimination. At the bottom, we have
Alex, Ahmad, and Anthony, who are somehow “male versions” of Betty, Brienne,
and Beatrix, in the sense that they share the same legitimate characteristics .x. Even
if the distance between .x is null (as in the Lipschitz property) within each pair, a
“proper counterfactual” of Betty is not Alex (neither is Ahmad the counterfactual of
Brienne and neither is Anthony the counterfactual of Beatrix). We use the techniques
mentioned previously to construct counterfactuals to those individuals.
For the first block, we simply use marginal transformations, as in Definition 9.8.
For example, for Brienne, .x1 = 1, which is the median value among women (group
B), if Brienne had been a man (group A), and if we assume that she would have
kept the same relative position (the median), her corresponding value for .x1 would
have been .−1. Similarly for .x3 . But for .x2 , 5 is also the median in group B, and as
the median is almost the same in group A, .x2 remains the same. Thus, if Brienne,
characterized by .(x1 , x2 , x3 , B), had been a man, there would be discrimination with
model m if .m(T1 (x1 ), T2 (x2 ), T3 (x3 ), A) < m(x1 , x2 , x3 , B).
Instead of marginal transformations, it is possible to consider the two other
techniques mentioned previously, optimal transport (using transport in R) and
a multivariate Gaussian transport. Formally, we consider here causal graphs of
Fig. 9.1, in the sense that s has a causal impact on both .x1 and .x2 , and not on
.x3 . In Table 9.4, counterfactuals are created for Betty, Brienne, and Beatrix, with

predictions if those individuals had been in group A instead of B. In Fig. 9.3, we


have the optimal transport on a regular grid .(x1 , x3 ), where the arrow starts at
.x = (x1 , x3 ) and ends at .T(x), respectively with marginal quantile transport (as in

Definition 9.8) and with multivariate Gaussian transport (based on Proposition 9.2).
For example, Beatrix, corresponding to .(2, 8) is mapped with .(−0.27, 7.9) in the
first case and .(−0.38, 7.82) in the second case.
With the fairadapt function, in the fairAdapt package based on Plečko
et al. (2021), it is also possible to create some counterfactuals. To do so, we consider
the causal networks of Fig. 9.2e and f, the first one being that used to generate
data. Compared with the causal networks of Fig. 9.1, here, we take into account
the existing correlation between .x1 and .x3 , in the sense that an intervention on s
changes .x1 , and even if s has no direct impact on .x3 , it will go through the path
.π = {s, x1 , x3 }. Then, all variables may have an impact on y. To model properly that

complex graph, we use fairadapt function to create counterfactuals (Fig. 9.3).


In Fig. 9.4, we have scatterplots .(m(x i ), m(T(x i ))) for individuals in groups A
and B. Light points are those from the original dataset, whereas the dark ones are the
six individuals described in Table 9.4. Points in group A are on the diagonal, as the
points are not adapted. Points in group B are here, for all models, below the diagonal,
in the sense that adapted predictions are below initial predictions. Therefore, this
could legitimate the idea that people in group B are discriminated against, compared
with people in group A because, if treated as someone in group A, for people in group
B, predictions would be smaller.
In Fig. 9.5, we can visualize on the left-hand side the distribution of .m(x i ) in
groups A and B. On the right-hand side, the distribution of .m(T(x i )) in group B.
372 9 Individual Fairness

(d) (e) (f)


2 2 2

1 1 1

3 3 3

Fig. 9.2 (d) Causal graph with no direct impact of s on y, but two mediators, and possibly, .x1 may
cause .x2 . (e) is similar to (c) with an additional indirect connection from .x1 to y, via mediator .x3 .
(f) is similar to (d) with an additional indirect connection from .x1 to y, via mediator .x3

Fig. 9.3 Optimal transport on a .(x1 , x2 ) grid, from B to A, using fairadapt on the left-hand
side, and, on the right-hand side using a parametric estimate of .T⋆ (based on Gaussian quantiles).
Only individuals in group B are transported here

On the left-hand side of Fig. 9.6, we have the scatterplot on .(x1 , x2 ), with points
in the A group mainly on the left-hand side and in the group B on the right-hand side.
On the right-hand side of Fig. 9.6, gray segments map individuals in group B and in
group A. The three points on the right-hand side, .• are Betty, Brienne, and Beatrix,
and on the left-hand side .• they denote the three matched individuals in group A.
In Fig. 9.7, we can visualize the optimal transport on a .(x1 , x2 ) grid, from B to A,
using a nonparametric estimate of .T⋆ (based on empirical quantiles) on the left-hand
side, and a Gaussian distribution on the right-hand side.
9.6 Application to the GermanCredit Dataset 373

Fig. 9.4 Scatterplot .(m(x i ), m(T⋆ (x i ))), adapted prediction against the original prediction, for
individuals in groups A and B, on the toydata2 dataset. Transformation is from group B to
group A (therefore predictions in group A remain unchanged). Model m is, at the top, plain logistic
regression (GLM) and at the bottom, a random forest

9.6 Application to the GermanCredit Dataset

The GermanCredit dataset was considered in Sect. 8.8, where group fairness
metrics were computed. Recall that the proportion of empirical defaults in the
training dataset was 35% for males (group .M) whereas it was 28% for females
(group .F). And here, we try to quantify potential individual discrimination against
women. In Fig. 9.8, we have a simple causal graph on the germancredit
dataset, with .s → {duration, credit}, and .{duration, credit} → y,
as well as all other variables (that could cause y) (Fig. 9.9). The causal graph of
Fig. 9.10 is the one used in Watson et al. (2021). On that second causal graph,
.s → {savings, job} and then multiple causal relationships. Finally, on the
374 9 Individual Fairness

Fig. 9.5 The plain line is the density of .x1 for group A, whereas the plain area corresponds to
group B, on the toydata2 dataset. On the left-hand side, we have the distribution on the training
dataset, and on the right-hand side, the density of the adapted variables .x1 (with a transformation
from group B to group A)

Fig. 9.6 Optimal matching, of individuals in group B to individuals in group A, on the right, where
points .• are Betty, Brienne, and Beatrix, and .• their counterfactual version in group A

causal graph of Fig. 9.11, four causal links are added, to the previous one, .s →
{duration, credit_Amount} and .{duration, credit_Amount} → y.
Based on the causal graph of Fig. 9.8, when computing the counterfactual, most
variables that are children of the sensitive attribute s are adjusted (here .x1 is
Duration and .x2 is Credit_Amount). As in the previous section, we consider
three pairs of individuals; within each pair, individuals share the same .x, only
s changes (more details on the other features are also given). And again, the
counterfactual version of Betty (.F) is not Alex (.M). For instance, the amount of credit
that was initially 1262 for Betty should be different if Betty had been a man. If we
were using some quantile-based transport, as in Table 9.5, 1262 corresponds to the
25% quantile for men, and 17% quantile for women. So, if Betty were considered a
man, with a credit amount corresponding to the 17% quantile within men, it should
9.6 Application to the GermanCredit Dataset 375

Fig. 9.7 Optimal transport on a .(x1 , x2 ) grid, from B to A, using a nonparametric estimate of .T⋆
(based on empirical quantiles) on the left-hand side, and a Gaussian distribution on the right-hand
side

sex housing
duration
default
job savings

age credit purpose

Fig. 9.8 Simple causal graph on the GermanCredit dataset, where all variables could have
a direct impact on y (or default), except for the sensitive attribute (s), which has an indirect
impact through two possible paths, via either the duration of the credit (duration) or the amount
(credit_Amount)

be a smaller amount, about 1074. Again, after that transformation, when everyone
is considered as a man, distributions of credit and duration are identical.
Here, eight models are considered: a logistic regression (GLM), a boosting (with
Adaboost algorithm, as presented in Sect. 3.3.6), a classification tree (as in Fig. 3.25)
and a random forest (RF, with 500 trees), each time the unaware version (based on
.x only, not s) and the aware version (including s). For all unaware models, Alex (.M)

and Beatrix (.F) get the same prediction. But for most models, aware models yield to
different predictions, when individuals have a different gender. Predictions of those
six individuals are given in Table 9.7. Those individual predictions can be visualized
in Fig. 9.9, when the causal graph is the one of Fig. 9.8. Lighter points correspond
376 9 Individual Fairness

Fig. 9.9 Scatterplot .(m(x i ), m(T⋆ (x i ))) for individuals in groups M and F, on the
GermanCredit dataset. Transformation is only from group F to group M, so that all individuals
are seen as men, on the y-axis. Model m is, from top to bottom, a plain logistic regression (GLM),
a boosting model (GBM) and a random forest (RF). We used fairadapt codes and the causal
graph of Fig. 9.8
9.6 Application to the GermanCredit Dataset 377

sex housing
duration
default
job savings

age credit purpose

Fig. 9.10 Causal graph on the germancredit dataset, from Watson et al. (2021)

sex housing
duration
default
job savings

age credit purpose

Fig. 9.11 Causal graph on the GermanCredit dataset

Table 9.5 Counterfactuals based on the causal graph of Fig. 9.8, based on marginal quantile
transformations. We consider here only counterfactual versions of women, considered as the
“disfavored”
Alex Ahmad Anthony Betty Brienne Beatrix
s (gender) M M M F F F
.x1 Duration 12 18 30 12 18 30
.u = F1|s (x1 ) 36% 57% 86% 34% 50% 79%
−1
.T(x1 ) = F1|s=M (u) 12 18 30 12 18 24
.x2 Credit 1262 2319 4720 1262 2319 4720
.u = F2|s (x2 ) 25% 55% 82% 17% 45% 76%
−1
.T(x2 ) = F2|s=M (u) 1262 2319 4720 1074 1855 3854

to the entire dataset, and darker points are the six individuals. Predictions with the
random forest are not very robust here.
For the logistic regression and the boosting algorithm (see Table 9.7), counter-
factual predictions are rather close to the original ones. For example, if Brienne
had been a man, mutatis mutandis, the default probability would have been .23.95%
instead of .24.30% (as Ahamad), when considering the unaware logistic regression.
378 9 Individual Fairness

Table 9.6 Predictions, with .95% confidence intervals, based on the causal graph of Fig. 9.8, using
marginal quantile transformations. We consider here only counterfactual versions of women only,
considered as the “disfavored”
Betty Brienne Beatrix
Unaware logistic 39.7% [23.9% ; 57.9%] 24.3% [13.8% ; 39.1%] 30.9% [15.7% ; 51.7%]
.m(x)
Unaware logistic 39.7% [23.9% ; 57.9%] 24.3% [13.8% ; 39.1%] 30.9% [15.7% ; 51.7%]
.m(x)

Unaware logistic 39.5% [23.8% ; 57.8%] 24.0% [13.6% ; 38.7%] 24.9% [12.2% ; 44.1%]
.m(T(x))
Aware logistic 36.7% [21.1% ; 55.6%] 22.6% [12.5% ; 37.4%] 30.1% [15.2% ; 50.8%]
.m(x, s = F)
Aware logistic 42.1% [25.4% ; 60.8%] 26.8% [15.0% ; 43.2%] 35.1% [17.6% ; 57.8%]
.m(x, s = M)

Aware logistic 41.9% [25.2% ; 60.7%] 26.5% [14.8% ; 42.8%] 28.6% [13.7% ; 50.2%]
.m(T(x), s = M)

The impact is larger for Beatrix, who had an initial prediction of default of .30.88%,
as a woman (even if the logistic regression is gender blind), and the same model, on
the counterfactual version of Beatrix, would have predicted .24.91%. It could be seen
as sufficiently different to consider that Beatrix could legitimately fell discriminated
because of her gender. But as shown in Table 9.6, we can easily compute confidence
intervals for predictions on GLM (it would be more complicated for boosting
algorithms and random forests) and the difference is not significantly different.
More complex models can be considered. Using function fairTwins, it is
possible to get a counterfactual for each individual in the dataset, as suggested in
Szepannek and Lübke (2021), even when we consider categorical covariates. The
first “realistic” causal graph we consider is the one used in Watson et al. (2021), on
that same dataset, that can be visualized in Fig. 9.10.
In the GermanCredit dataset, several variables in .x are categorical. For
ordered categorical variables (such as Savings_Bonds, taking values < 100
DM, 100 <= ...< 500 DM, etc.) it is possible to adapt optimal transport
techniques, as suggested in Plečko et al. (2021), assuming that ordered categories
are .{1, 2, · · · , m} and using a cost function .γ (i, j ) = |i − j |k . For non-ordered
categorical variables (such as Job, or Housing, the latter taking values such as
‘rent’, ‘own,’ or ‘for free’), a cost function .γ (i, i) = 1(i /= j ) is used.
And for continuous variables (Age, Credit_Amount or Duration), previous
techniques can be used. As Age is on no-path from s to y in all causal graphs, it
will not change. In Fig. 9.12 we can visualize the distributions of x conditional on
.s = M and .s = F respectively when x is Credit_Amount and Duration.
9.6 Application to the GermanCredit Dataset 379

Fig. 9.12 Distributions of x (Credit_Amount at the top and Duration at the bottom)
conditional on .s = A and .s = B

Predictions of the six individuals from Table 9.7 can be visualized in Fig. 9.13,
when the causal graph is that of Fig. 9.10. As in the previous section, there are
three pairs of individuals; within each pair, individuals share the same .x, only s
changes. Then, eight models are considered, a logistic regression (GLM), a boosting
(with Adaboost algorithm, as presented in Sect. 3.3.6), a classification tree (as in
Fig. 3.25), and a random forest (RF, with 500 trees), each time, with the unaware
version (based on .x only, not s) and the aware version (including s). For all unaware
models, Alex (.M) and Beatrix (.F) get the same prediction. But for most models,
aware models yield to different prediction, when individuals have a different gender.
380

Table 9.7 Creating counterfactuals for Betty, Brienne, and Beatrix in the GermanCredit dataset
Firstname s Firstname s Job Savings Housing Purpose
Alex M Betty F highly qualified employee 100 DM rent radio /
television
Ahmad M Brienne F skilled employee 100<=...<500 DM own furniture
Anthony M Beatrix F unskilled - resident no savings for car (new)
free
9 Individual Fairness
Original data
s Age Duration Credit .m glm (x) .m glm (x, s) .m gbm (x) .m
gbm (x, s) .m
cart (x) .m
cart (x, s) .m
rf (x) .m
rf (x, s)
Betty F 26 12 1262 39.69% 36.66% 42.30% 43.26% 31.75% 31.75% 25.40% 23.20%
Brienne F 33 18 2320 24.30% 22.61% 23.88% 21.08% 21.31% 21.31% 43.60% 33.60%
Beatrix F 45 30 4720 30.88% 30.08% 28.49% 30.42% 15.38% 15.38% 23.40% 25.80%
Alex M 26 12 1262 39.69% 42.10% 42.30% 44.86% 31.75% 31.75% 25.40% 21.60%
Ahmad M 33 18 2320 24.30% 26.84% 23.88% 22.18% 21.31% 21.31% 43.60% 31.00%
Anthony M 45 30 4720 30.88% 35.08% 28.49% 31.82% 15.38% 15.38% 23.40% 31.60%
Counterfactual
adjusted data, with marginal quantile transformations, causal graph from Fig. 9.8
Betty M 26 12 1074 39.51% 41.90% 40.69% 44.86% 31.75% 31.75% 23.80% 25.60%
Brienne M 33 18 1855 23.95% 26.46% 23.88% 22.18% 21.31% 21.31% 43.00% 38.60%
Beatrix M 45 24 3854 24.91% 28.58% 20.55% 20.31% 21.31% 21.31% 17.60% 24.80%
adjusted data, with fairAdapt, causal graph from Fig. 9.8
9.6 Application to the GermanCredit Dataset

Betty M 26 12 1110 42.73% 45.18% 44.24% 46.64% 31.75% 31.75% 22.2% 25.6%
Brienne M 33 18 1787 23.90% 26.40% 23.88% 22.18% 21.31% 21.31% 43.2% 38.2%
Beatrix M 45 24 3990 25.01% 28.70% 22.17% 23.60% 21.31% 21.31% 19.6% 26.4%
adjusted data, with fairAdapt, causal graph from Fig. 9.10
Betty M 26 18 1778 52.23% 54.03% 40.05% 46.81% 21.31% 21.31% 34.80% 31.80%
Brienne M 33 15 1864 32.25% 35.85% 31.60% 25.97% 21.31% 21.31% 23.00% 20.40%
Beatrix M 45 21 3599 39.70% 43.16% 28.36% 28.90% 21.31% 21.31% 10.60% 13.40%
adjusted data, with fairAdapt, causal graph from Fig. 9.11
Betty M 26 15 1882 49.05% 50.86% 35.32% 40.12% 21.31% 21.31% 27.8% 30.0%
Brienne M 33 18 1881 50.76% 53.49% 43.00% 38.77% 21.31% 21.31% 10.8% 13.8%
Beatrix M 45 24 3234 24.20% 26.23% 14.63% 16.84% 21.31% 21.31% 22.4% 19.0%
381
382 9 Individual Fairness

Fig. 9.13 Scatterplot .(m(x i ), m(T(x i ))) for individuals in groups M and F, on the
GermanCredit dataset. Transformation is only from group F to group M, so that all individuals
are seen as men, on the y-axis. Model m is, from top to bottom, a plain logistic regression (GLM),
a boosting model (GBM). We used fairadapt codes and the causal graph of Fig. 9.10
Part IV
Mitigation

Technology is neither good nor bad; nor is it neutral, Kranzberg (1986)1


She didn’t tell the algorithm to try to equalize the false rejection rates between the two
groups, so it didn’t. In its standard form, machine learning won’t give you anything ‘for
free’ you didn’t explicitly ask for, and may in fact often give you the opposite of what you
wanted. (. . . ) As we discussed in the introduction, machine learning won’t give you things
like gender neutrality ‘for free’ that you didn’t explicitly ask for Kearns and Roth (2019)

1 Also called “Kranzberg’s First Law”. “By that I mean that technology’s interaction with the social

ecology is such that technical developments frequently have environmental, social, and human
consequences that go far beyond the immediate purposes of the technical devices and practices
themselves, and the same technology can have quite different results when introduced into different
contexts or under different circumstances,” Kranzberg (1995).
Chapter 10
Pre-processing

Abstract “Pre-processing” is about distorting the training sample to ensure that


the model we obtain is “fair,” with respect to some criteria (defined in the previous
chapters). The two standard techniques are either to modify the original dataset (and
to distort features to make them “fair,” or independent of the sensitive attribute), or
to use weights (as used in surveys to correct for biases). If there are poor theoretical
guarantees, there are also legal issues with those techniques.

As we have seen previously, given a dataset .Dn , which is a collection of observations


. - (as discussed in Part II) and to
(x i , si , yi ), it is possible to estimate a model .m
quantify the fairness of the model (as discussed in Part III) based on appropriate
metrics. Therefore, there are different ways of mitigating a possible discrimination,
if any. The first one is to modify .Dn (namely “pre-processing,” described in this
chapter), to modify the training algorithm (“in-processing,” possibly in adding a
fairness constraint to the objective function, as described in Chap. 8), or by distorting
the trained model .m - (“post-processing,” as described in Chap. 9).
Pre-processing techniques can be divided into two categories. The first one
modifies the original data, as suggested in Calmon et al. (2017) and Feldman et al.
(2015), but it does not have many statistical guarantees (see Propositions 7.6 and 7.8,
where we have seen that if we were able to ensure that .X⊥ ⊥⊥ S, we can still have
- = m(X⊥ ) /⊥⊥ S – even .Y
.Y - = m(X⊥ ) /⊥ S)). Barocas and Selbst (2016) and
Krasanakis et al. (2018) also question the legality of that approach where training
data are, somehow, falsified. And another approach is based on reweighing, where
instead of having observations with equal weights in the training sample, we adjust
weights in the training sample (in the training function to be more specific, so it
could actually be seen as an “in-processing” approach). Kamiran and Calders (2012)
suggest simply having two weights, depending on whether .si is either A or B, as well
as Jiang and Nachum (2020). Heuristically, the idea is to amplify the error from an
“underrepresented group” in the training sample, so that the optimization procedure
can equally update a model for a different group.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 385
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_10
386 10 Pre-processing

10.1 Removing Sensitive Attributes

The most simple “pre-processing” approach is simply to remove the sensitive


attribute s. As discussed in Chap. 8, this corresponds to the concept of “fairness
by unawareness.” That is what is required by the “ gender directive” (C-236 / 09,
of March 2011), in Europe, where insurers are no longer allowed to use gender to
price insurance products.
As mentioned in Sect. 1.1.7 , mitigating discrimination is usually seen as
paradoxical, because in order to avoid discrimination, we must create another
discrimination. The point of the “reverse discrimination objection,” as defined
in Goldman (1979), is that there is an absolute ethical constraint against unfair
discrimination. The motto of that approach could be the quote of John G. Roberts
of the US Supreme Court: “the way to stop discrimination on the basis of race is to
stop discriminating on the basis of race,” as mentioned in Sabbagh (2007). Under
that principle, this approach is actually the only one that should be used, a ban on
the use of any sensitive attribute.
In Italy, Porrini and Fusco (2020) used data from the National Institute for the
Supervision of Insurance (IVASS) to understand the effect of such a ban on the
market prices. They measure the influence on the premiums of the gender variable
(which was still in the databases, even if insurers were not allowed to use it to
discriminate) and other variables such as the age of the driver, the type of vehicle,
and the geolocation, for the period 2011–2014. The price paid (insurance premium)
by males and females was collected in that dataset. They observed that the legal
limitation in the use of a rating factor such as gender may have effects on the market,
with increases in premiums. More precisely, the effect of the gender discrimination
ban is that women are not directly discriminated by gender, because after the ban,
the premium was the same for males and females with identical other features, but
overall, “there is less gender equality because in the same conditions a woman pays
more than a man, given the effects of other risky variables”.

10.2 Orthogonalization

10.2.1 General Case

This approach is a classical one in the econometric literature, when linear models
are considered. Interestingly, it is possible to use it even if there are multiple
sensitive attributes. It is also the one discussed in Frees and Huang( (2023). )Write
the .n × k matrix .S as a collection of k vectors in .Rn , .S = s 1 · · · s k , that
will correspond to k sensitive attributes. The orthogonal projection on variables
T −1 T
.{s 1 , · · · , s k } is associated with a matrix .||S = S(S S) S , while the projection
on the orthogonal of .S is .||S ⊥ = I − ||S (see Gram–Schmidt orthogonalization, and
10.2 Orthogonalization 387

Fig. 10.1 Orthogonal


projection of a vector .x- on the
line generated by .s-, .||s x, and
its orthogonal component,
.||s ⊥ x

Fig. 10.1). Let .-S denote the collection of centered vectors (using matrix notations,
-
.S = H S where .H = I − (11 )/n).
T
( )
Write the .n × p matrix .X as a collection of p vectors in .Rn , .X = x 1 · · · x p .
For any .x j , define

x⊥ - -T- −1-T
.j = ||-
S ⊥ x j = x j − S(S S) S x j .

One can easily prove that .x ⊥


j is then orthogonal to any .s, as

1 T 1 T
Cov(s, x ⊥
. j )= s H x⊥
j = - S ⊥ x j = 0.
s ||-
n n

And similarly, the centered version of .x ⊥j is then also orthogonal to any .s. From

an econometric perspective, .x j can be seen as the residual of the regression of .x j
against .s, obtained from least-squares estimation

. s T-
xj = - βj + x⊥
j .

Quite naturally, instead of training a model on .Dn = (x i , si , yi ), we could use


.Dn = (x ⊥ ⊥
i , yi ), where, by construction, all variables .x j are now orthogonal to any
.s. Unfortunately, as mentioned in Proposition 7.6, if .X
⊥ ⊥⊥ S, we could still have
- = m(X ) /⊥⊥ S if we consider a nonlinear class of model for m.
.Y

10.2.2 Binary Sensitive Attribute

Asking for orthogonality, .Xj ⊥ S is related to a null correlation between the two
variables. But if S is binary, this can be simplified. Recall that if .x = (x1 , · · · , xn ),
then
1 T 1 1
x=
. (1 x) and Var(x) = x T H x where H = I − (11T ),
n n n
388 10 Pre-processing

H being an idempotent (projection) matrix (as .H 2 = H H = H ). The empirical


.

covariance is defined, using matrix notations

1 T
Cov(s, x) =
. s Hx
n
Observe that Pearson’s linear correlation is

Hs Hx sTH x
Cor(s, x) =
. · = ,
||H s|| ||H x|| ||H s||||H x||

as .H is an idempotent (projection) matrix. If .s ∈ {0, 1}n , the covariance can be


written
( )
Cov(s, x) = s 0 s 1 Ax 1 − Ax 0 ,
.

where
nj 1 E
sj =
. and Ax j = (xi − x).
n nj
i:si =j

Observe that to ensure that our predictor is fair, we need to compute the
correlation between y and s. If y is also binary, .y ∈ {0, 1}n , the covariance between
s and y can be written

n11 n1• n•1 E n


Cov(s, y) =
. − where nj k = 1si =j 1Si =k .
n n n
i=1

10.3 Weights

The propensity score was introduced in Rosenbaum and Rubin (1983), as the
probability of treatment assignment conditional on observed baseline covariates.
The propensity score exists in both randomized experiments and in observational
studies, the difference is that in randomized experiments, the true propensity score
is known (and is related to the design of the experiment) whereas in observational
studies, the propensity score is unknown, and should be estimated. If the logistic
regression is the classical technique used to estimate that conditional probability,
McCaffrey et al. (2004) suggested some boosted regression approach, whereas Lee
et al. (2010) suggested bagging techniques.
In Fig. 10.2, we can visualize .ω |→ Cor[- mω (x), 1B (s)], on two datasets,
toydata2 (on the left) and germancredit (on the right), where .m -ω is a logistic
regression with weights proportional to 1 in class A and .ω in class B. The large dot
is the plain logistic regression (with identical weights).
10.4 Application to toydata2 389

Fig. 10.2 .ω |→ Cor[- mω (x, s), 1B (s)], on two datasets, toydata2 (on the left-hand side)
and GermanCredit (on the right-hand side), where .m -ω is a logistic regression with weights
proportional to 1 in class A and .ω in class B

10.4 Application to toydata2

In the first block, at the top of Table 10.1, we have predictions for six individuals,
using logistic regression trained on toydata2, including the sensitive attribute. On
the left are the original values of the features .(x i , si ). The second part corresponds
to the transformation of the features, .x ⊥ i , so that each feature is orthogonal to the
sensitive attribute (.x ⊥j is then orthogonal to any .s). Observe that features have on
average the same values when conditioning with respect to the sensitive attribute.
Based on orthogonalized features .m -⊥ (x ⊥ ) is the fitted unaware logistic regression.
This model is fair with respect to demographic parity as the demographic parity
ratio, .E[-
m(Z)|S = B]/E[- m(Z)|S = A], is close to one. Observe further that the
empirical correlation between .m -(x i , si ) and .1B (si ) was initially .0.72, and after
orthogonalization, the empirical correlation between .m -⊥ (x ⊥ i ) and .1B (si ) is .0.01.
At the bottom (third block), we can also visualize statistics about equalized odds,
with ratios .E[- m(Z)|S = B, Y = y]/E[- m(Z)|S = A, Y = y] for .y = 0 (at the
top) and .y = 1 (at the bottom). Values should be close to one to have equalized
odds, which is not the case here. On the right-hand side, at the very last column, we
consider a model based on weights in the training sample .m -ω (x). For the choice of
the weight, Fig. 10.2 suggested using very large weights for individuals in class B.
This model is slightly fairer than the original plain logistic regression.
In Figs. 10.3 and 10.4, we can visualize the orthogonalization method, with, in
Fig. 10.3, the optimal transport plot, between distributions of .m -(x i , si ) (on the x-
-⊥ (x ⊥
axis) to .m i ) (on the y-axis), for individuals in group A on the left-hand side, and
in group B on the right-hand side. In Fig. 10.4, on the left, we can visualize densities
-⊥ (x ⊥
of .m i ) from individuals in group A and in group B (thin lines are densities
of scores from the original plain logistic regression .m -(x i , si )). On the right-hand
side, we have the scatterplot of points .(- m(x i , si = A), m-⊥ (x ⊥ i )) and .(- m(x i , si =
B), m ⊥ ⊥
- (x i )). The six individuals at the top of Table 10.1 are emphasized, with
390

Table 10.1 Predictions using logistic regressions on toydata2, with two “pre-processing” approaches, with orthogonalized features .m -⊥ (x ⊥ ) and using
weights .m-ω (x). The first six rows describes six individuals. The next three rows are a check for demographic parity, with averages conditional on s, whereas
the last six rows are a check for equalized odds, with averages conditional on s and y
x1 x2 x3 s y m
-(x, s) x1⊥ x2⊥ x3⊥ m
-⊥ (x ⊥ ) m
-ω (x)
Alex 0 2 0 A GOOD 0.138 0.957 −2.924 0.968 0.355 0.221
Betty 0 2 0 B BAD 0.274 −0.958 −3.128 −0.938 0.123 0.221
Ahmad 1 5 1 A BAD 0.546 1.957 0.076 1.968 0.709 0.761
Brienne 1 5 1 B GOOD 0.739 0.042 −0.128 0.062 0.383 0.761
Anthony 2 8 2 A GOOD 0.900 2.957 3.076 2.968 0.915 0.973
Beatrix 2 8 2 B BAD 0.955 1.042 2.872 1.062 0.733 0.973
Average |S = A −0.967 4.919 −0.976 0.223 0.223 0.223 −0.010 −0.006 0.402 0.288
Average |S = B 0.958 5.132 0.935 0.675 0.675 0.223 −0.010 −0.006 0.408 0.674
(Difference) +2 ∼0 +2 ×3.027 ×3.027 ∼0 ∼0 ∼0 ×1.015 × 2.340
Average |S = A, Y =0 −1.018 4.357 −0.993 0.000 0.184 0.000 −0.061 −0.567 0.362 0.239
Average |S = B, Y =0 0.194 3.569 0.437 0.000 0.463 0.000 −0.061 −0.567 0.228 0.435
(Difference) ×2.516 ∼0 ∼0 ∼0 ×0.630 ×1.820
Average |S = A, Y =1 −0.788 6.873 −0.917 1.000 0.362 1.000 0.169 1.949 0.540 0.458
Average |S = B, Y =1 1.325 5.884 1.175 1.000 0.777 1.000 0.169 1.949 0.494 0.789
(Difference) ×2.146 ∼0 ∼0 ∼0 ×0.915 ×1.723
10 Pre-processing
10.4 Application to toydata2 391

Fig. 10.3 Optimal transport between distributions of .m -⊥ (x ⊥


-(x i , si ) (x-axis) to .m i ) (y-axis), for
individuals in group A on the left-hand side, and in group B on the right-hand side

Fig. 10.4 On the left-hand side, densities of .m -⊥ (x ⊥i ) from individuals in group A and in B
(thin lines are densities of scores from the plain logistic regression .m -(x i , si )). On the right-hand
side, scatterplot of points .(- -⊥ (x ⊥
m(x i , si = A), m i )) and .(-
m (x i , si = B), m -⊥ (x ⊥
i )), from the
toydata2 dataset

vertical segments showing the (individual) difference between the initial model and
the fair one.
In Figs. 10.5 and 10.6, we can visualize similar plots for the weight method,
with, in Fig. 10.5, the optimal transport plot between distributions of .m -(x i , si ) (on
the x-axis) to .m-ω (x i )’s (on the y-axis), for individuals in group A on the left-hand
side, and in group B on the right-hand side. In Fig. 10.6, on the left-hand side, we can
visualize densities of .m -ω (x i ) from individuals in group A and in group B (again, thin
lines are the original densities of scores from the plain logistic regression .m-(x i , si )).
On the right-hand side, we have the scatterplot of points .(- -⊥ (x ⊥
m(x i , si = A), m i ))
m(x i , si = B), m
and .(- ⊥ ⊥
- (x i )). We emphasize again the six individuals at the top of
Table 10.1.
392 10 Pre-processing

Fig. 10.5 Optimal transport between distributions of .m -(x i , si ) (x-axis) to .m


-ω (x i ) (y-axis), for
individuals in group A on the left-hand side, and in group B on the right-hand side

Fig. 10.6 On the left-hand side, densities of .m -ω (x i ) from individuals in group A and in B (thin
lines are densities of scores from the plain logistic regression .m -(x i , si )). On the right-hand side,
m(x i , si = A), m
scatterplot of points .(- -ω (x i )) and .(-
m(x i , si = B), m
-ω (x i )), from the toydata2
dataset

In Fig. 10.7, visualization of the optimal transport between distributions of


-⊥ (x ⊥
m
.
i )’ for individuals in group A (x-axis) to individuals in group B (y-axis) on
-ω (x i ) on the right-hand side. If the “transport line” is on the
the left-hand side, and .m
first diagonal, the Wasserstein distance between the two conditional distributions is
close to 0, meaning that demographic parity is satisfied.
10.5 Application to the GermanCredit Dataset 393

Fig. 10.7 Optimal transport between distributions of .m -⊥ (x ⊥i ) for individuals in group A (x-axis)
-ω (x i ) on the right-hand side
to individuals in group B (y-axis) on the left-hand side, and .m

10.5 Application to the GermanCredit Dataset

The same analysis can be performed on the GermanCredit dataset, with plain
logistic regression here too, but other techniques described in Chap. 3 could be
considered. For the orthogonalization, it is performed on the .X design matrix,
that contains indicators for all factors (but the reference) for categorical variable.
Observe that the empirical correlation between .m -(x i , si ) and .1B (si ) was initially
.−0.195. After orthogonalization, the empirical correlation between .m -⊥ (x ⊥i ) and
.1B (si ) is now .0.009. In Table 10.2, we have averages of score predictions using the

plain logistic regressions on GermanCredit on top. As averages of .m -(x i , si = A)


and .m-(x i , si = B) are 35.2% and 27.7% respectively, the demographic parity
ratio is .78.7%, which is close to 80%, used in the “four-fifths rule” for disparate
impact (corresponding to a .80% threshold in Definition 8.26). The second row
is the first “pre-processing” approach, with orthogonalized features .m -⊥ (x ⊥ ). The
demographic parity ratio is not .1.010, which means that model .m -⊥ could be
considered to be fair, with respect to this criteria. It could even be considered
to be fair with respect to the equalized odds criteria, as averages .m -(x i ) between
individuals such that .si = A and .si = B, in the group .yi = 0 (or GOOD risk), have
a ratio of .1.045, whereas it is .1.054 in the group .yi = 1 (or BAD risk). The third
row is obtained when weights are considered .m -ω (x). From Fig. 10.2, a large weight
for individuals in class B is considered here too (so that the correlation between the
scores .m-ω (x i ) and .1B (si ) gets closer to 0). We can observe that this model can also
be considered fair, because averages .m -(x i ) between individuals such that .si = A
and .si = B could be considered similar.
In Figs. 10.8 and 10.9, we can visualize the orthogonalization method, with, in
Fig. 10.8, the optimal transport plot, between distributions of .m -(x i , si ) (on the x-
axis) to .m-⊥ (x ⊥i ) (on the y-axis), for individuals in group A on the left-hand side,
394 10 Pre-processing

Table 10.2 Averages of score predictions using the plain logistic regressions on
GermanCredit at the top, with two “pre-processing” approaches, with orthogonalized
-⊥ (x ⊥ ) and using weights .m
features .m -ω (x), at the bottom. The second block is a check for
demographic parity, with averages conditional on s, whereas the second block is a check for
equalized odds, with averages conditional on s and y
Demographic Equalized Equalized
parity odds (.y = 0) odds (.y = 1)
A B (Ratio) A B (Ratio) A B (Ratio)
-(x, s)
.m 0.352 0.277 .×0.787 0.299 0.237 .×0.792 0.448 0.381 .×0.850
-
.m
⊥ (x ⊥ ) 0.298 0.301 .×1.010 0.249 0.260 .×1.045 0.387 0.408 .×1.054

-ω (x)
.m 0.289 0.277 .×0.958 0.249 0.235 .×0.946 0.363 0.386 .×1.065

Fig. 10.8 Optimal transport between distributions of .m -⊥ (x ⊥


-(x i , si ) (x-axis) to .m i ) (y-axis), for
individuals in group A on the left-hand side, and in group B on the right-hand side

-⊥ (x ⊥
Fig. 10.9 On the left-hand side, densities of .m i ) from individuals in group A and in B (thin
-(x i , si )). On the right-hand side,
lines are densities of scores from the plain logistic regression .m
m(x i , si = A), m
the scatterplot of points .(- -⊥ (x ⊥ m(x i , si = B), m
i )) and .(- -⊥ (x ⊥i )), from the
germancredit dataset
10.5 Application to the GermanCredit Dataset 395

Fig. 10.10 Optimal transport between distributions of .m -(x i , si ) (x-axis) to .m


-ω (x i ) (y-axis), for
individuals in group A on the left-hand side, and in group B on the right-hand side

Fig. 10.11 On the left-hand side, densities of .m-ω (x i ) from individuals in group A and in group B
(thin lines are densities of scores from the plain logistic regression .m -(x i , si )). On the right-hand
m(x i , si = A), m
side, the scatterplot of points .(- -ω (x i )) and .(-
m(x i , si = B), m -ω (x i )), from the
GermanCredit dataset

and in group B on the right-hand side. In Fig. 10.9, on the left-hand side, we can
visualize densities of .m-⊥ (x ⊥ i ) from individuals in group A and in group B (thin
lines are densities of scores from the original plain logistic regression .m -(x i , si )).
On the right-hand side, we have the scatterplot of points .(- -⊥ (x ⊥
m(x i , si = A), m i ))
m(x i , si = B), m
and .(- ⊥ ⊥
- (x i )). The six individuals mentioned (in Table 9.7) are
again emphasized.
In Figs. 10.10 and 10.11, we can visualize similar plots for the weight method,
with, in Fig. 10.10, the optimal transport plot, between distributions of .m -(x i , si )
-ω (x i ) (on the y-axis), for individuals in group A on the left-hand
(on the x-axis) to .m
side, and in group B on the right-hand side. In Fig. 10.11, on the left-hand side,
we can visualize densities of .m -ω (x i ) from individuals in group A and in group B
(again, thin lines are the original densities of scores from the plain logistic regression
396 10 Pre-processing

Fig. 10.12 Optimal transport between distributions of .m-⊥ (x ⊥i ) for individuals in group A (x-axis)
-ω (x i ) on the right-hand side, on the
to individuals in group B (y-axis) on the left-hand side, and .m
GermanCredit dataset

.-(x i , si )). On the right-hand side, we have the scatterplot of points .(-
m m(x i , si =
-⊥ (x ⊥
A), m i )) and .(-
m (x i , s i = B), -
m ⊥ (x ⊥ )).
i
In Fig. 10.12, visualization of the optimal transport between distributions of
-⊥ (x ⊥
.m
i ) for individuals in group A (x-axis) to individuals in group B (y-axis) on
the left, and .m -ω (x i ) on the right-hand side. If the “transport line” is on the first
diagonal, the Wasserstein distance between the two conditional distributions is close
to 0, meaning that demographic parity is satisfied.
Chapter 11
In-processing

Abstract Classically, to estimate a model, we look for a model (in a pre-defined


class) that minimizes a prediction error, or that maximizes the accuracy. If the
model is required to satisfy constraints, a natural idea is to add a penalty term in
the objective function. The idea of “in-processing” is to get a trade-off between
accuracy and fairness. As previously, we present that approach to some datasets.

Classically, we have seen (see Definition 3.3) that models were obtained by
minimizing the empirical risk, the sample-based version of .R(
m)
   
 ∈ argmin R(m) where R(m) = E  (Y, m(X, S)) ,
m
.
m∈M

for some set of models, .M. Quite naturally, we could consider a constrained
optimization problem
 
 ∈ argmin R(m) s.t. m fair,
m
.
m∈M

for some fairness criterion. Using standard results in optimization, one could
consider a penalized version of that problem
 
.  ∈ argmin R(m) + λR(m),
m
m∈M

where “.R(m)” denotes a positive regularizer, indicating the extent to which the
fairness criterion is violated, as suggested in Zafar et al. (2017), Zhang and
Bareinboim (2018), Agarwal et al. (2018), Kearns et al. (2018), Li and Fan (2020),
and Li and Liu (2022). As a (technical) drawback, adding a regularizer that is
nonconvex could increase the complexity of optimization, as mentioned in Roth
et al. (2017) and Cotter et al. (2019).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 397
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_11
398 11 In-processing

11.1 Adding a Group Discrimination Penalty

Recall that weak demographic parity (Sect. 8.2) is achieved if


    
Em
. (X, S)S = s = E m(X, S) , ∀s,

whereas weak equal opportunity (Sect. 8.3) is achieved if


     
. Em (X, S)S = s, Y = 1 = E m(X, S)Y = 1 , ∀s.

And because strong demographic parity is achieved if


    
.Pm (X, S) ∈ AS = s = P m(X, S) ∈ A , ∀s, ∀A,

or, with synthetic notations, .PA [A] = PB [A] = P[A]. Thus, a classical metric for
strong demographic parity would be

wA DKL (PA P) + wB DKL (PB P)


.

for some appropriate weights .wA and .wB . Strong equal opportunity is achieved if
     
. Pm (X, S) ∈ AS = s, Y = 1 = P m(X, S) ∈ AY = 1 , ∀s, ∀A,

and therefore, a metric for strong equal opportunity would be

wA DKL (PA,Y =1 PY =1 ) + wB DKL (PB,Y =1 PY =1 ).


.

In the classical logistic regression, we solve


 

β = argmax log L(β) ,
.
β


n
. log L(β) = yi log[m(x i )] + (1 − yi ) log[1 − m(x i )]
i=1

exp[x 
i β]
and .m(x i ) = . Inspired by Zafar et al. (2019), fairness constraints
1 + exp[x 
i β]
related to disparate impact and disparate mistreatment criteria discussed previously,
which should be strictly satisfied, could be introduced. For example, weak demo-
graphic parity is achieved if .E[m(X)|S = A] = E[ m(X)|S = B] and therefore
 

β ∗ = argmin − log L(β) , s.t. E[
. m(X)|S = A] = E[
m(X)|S = B]
β
11.1 Adding a Group Discrimination Penalty 399

One could consider a more flexible version, where mistreatment is bounded,


   
β  = argmin − log L(β) , s.t. E[
. m(X)|S = A] − E[ m(X)|S = B] ≤ .
β

Quite naturally, one can consider a penalized version of that constrained optimiza-
tion problem
  
β λ = argmin − log L(β) + λE[

. m(X)|S = B] .
m(X)|S = A] − E[
β

Following Scutari et al. (2022), observe that one could also write
 

β  = argmin − log L(β) , s.t. |cov[mβ (x), 1B (s)]| ≤ ,
.
β

for some . ≥ 0, or the penalized version


 

β λ = argmin − log L(β) + λ|cov[mβ (x), 1B (s)]| ,
.
β

exp[x 
i β]
where .mβ (x i ) = .
1 + exp[x 
i β]
The sample-based version of the penalty would be
 1  1  
  
β λ = argmin − log L(β) + λ
. mβ (x i ) − mβ (x i ) ,
β nA nB
i:si =A i:si =B

or, because
   
cov[mβ (x), 1B (s)] = E mβ (x)[1B (s) − E(1B (s))] − E 1B (s) − E[1B (s)] mβ (x),
.

=0

the sample-based version of the penalty can also be written as

1
n
nB
cov[mβ (x), 1B (s)] =
. (1B (si ) − 1B ) · mβ (x i ), where 1B = .
n n
i=1

Komiyama et al. (2018) actually initiated that idea: it regresses the fitted values
against the sensitive attributes and the response, and it bounds the proportion
of the variance explained by the sensitive attributes over the total explained
variance in that model. In R, functions frrm and nclm in the package fairml
perform such regression, where in the former, the constraint is actually written as
a Ridge penalty (see Definition 3.15). Demographic parity is achieved with the
400 11 In-processing

option "sp-komiyama" whereas equalized odds are obtained with the option
"eo-komiyama".
An alternative, suggested from Proposition 7.2, is to use the maximal correlation.
Recall that given two random variables U and V , the maximal correlation (also
coined “HGR,” as it was introduced in Hirschfeld (1935), Gebelein (1941), and
Rényi (1959))

r  (U, V ) = HGR(U, V ) =
. max E[f (U )g(V )],
f ∈FU ,g∈GV

where

FU = {f : U → R : E[f (U )] = 0 and E[f 2 (U )] = 1}


.
GV = {g : V → R : E[g(V )] = 0 and E[g 2 (V )] = 1}

Mary et al. (2019) estimated the maximal correlation using (Gaussian) kernel
density estimates, in the context of fairness constraints (instead of a plain linear
correlation as done previously).

11.2 Adding an Individual Discrimination Penalty

Observe that instead of a regularization based on a group fairness criterion, one


could consider some individual fairness criteria. With notations of Sect. 11.1, the
penalized risk can be defined, for a binary sensitive attribute (if S is not equal to s,
it is equal to .s ) as
     
.Rλ (
m) = E  (Y, m
(X, S)) + λ P[S = s] · E r Xs , s, X s , s ,
s∈S

s being the other category, and where .r(x, s, x , s ) is a penalty that enforces the
.

difference between the outcomes for an individual and its counterfactual version.
Russell et al. (2017) considered
 
.r(x, s, x , s ) = m (x , s ),
(x, s) − m

and more generally


   
rε (x, s, x , s ) = max 0, m
. (x , s ) − ε ,
(x, s) − m

for .ε-fairness, as introduced in Sect. 8.6, which corresponds to a convex relaxation


of the previous condition. For instance, de Lara (2023) considered
 2
.r(x, s, x , s ) = m(x, s) − m
(x , s ) .
11.3 Application on toydata2 401

11.3 Application on toydata2

11.3.1 Demographic Parity

In Figs. 11.1 and 11.2, we can visualize .AUC( mλ ) against the demographic parity
ratio .E[
mβλ (Z)|S = B]/E[mβλ (Z)|S = A], in Fig. 11.1 when .Z = (X, S) and
in Fig. 11.2, when .Z = X (unaware model), where, with general notations, the
prediction is obtained from a logistic regression

exp[zβ λ]
.βλ (z) =
m ,
1 + exp[zβ λ]

Fig. 11.1 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of m βλ on the y-axis and the fairness ratio (demographic parity) on the x-axis. Top left
βλ (x i , si ) (with a logistic
corresponds to accurate and fair. On the right-hand side, evolution of m
regression) for three individuals in group A and three in B, on the toydata2 dataset

Fig. 11.2 On the left-hand side, accuracy–fairness (demographic parity) trade-off plot (based on
Table 11.1), with the AUC of m βλ on the y-axis and the fairness ratio on the x-axis. The thin line
is the one with the model taking into account s (from Fig. 11.1). On the right-hand side, evolution
of mβλ (x i ) (with a logistic regression) for three individuals in group A and three in B, on the
toydata2 dataset
402 11 In-processing

Table 11.1 Penalized logistic regression, for different values of .λ, including s on the left-hand
side and excluding s on the right-hand side. At the top, values of . β λ in the first block, and
predictions .mβλ (x i , si ) and .m
βλ (x i ) for a series of individuals in the second block. At the bottom,
the demographic parity ratio, .E[ mβλ (Z)|S = B]/E[ mβλ (Z)|S = A] (fairness is achieved when the
ratio is 1), and AUC (the higher the value, the higher the accuracy of the model)
(x, s), aware
m m(x), unaware
← less fair more fair → ← less fair more fair →

β 0 (Intercept) −2.55 −2.29 −1.97 −1.51 −1.03 −2.14 −1.98 −1.78 −1.63

β 1 (x1 ) 0.88 0.88 0.85 0.77 0.62 1.01 0.84 0.57 0.26

β 2 (x2 ) 0.37 0.37 0.35 0.32 0.25 0.37 0.35 0.31 0.24

β 3 (x3 ) 0.02 0.02 0.02 0.02 0.03 0.15 0.02 −0.15 −0.29

β B (1B ) 0.82 0.44 −0.03 −0.70 −1.31 – – – –
Betty 0.27 0.25 0.22 0.17 0.14 0.20 0.22 0.24 0.24
Brienne 0.74 0.71 0.66 0.54 0.40 0.70 0.66 0.55 0.38
Beatrix 0.95 0.95 0.93 0.87 0.73 0.96 0.93 0.82 0.55
Alex 0.14 0.17 0.22 0.29 0.37 0.20 0.22 0.24 0.24
Ahmad 0.55 0.61 0.66 0.70 0.71 0.70 0.66 0.55 0.38
Anthony 0.90 0.92 0.93 0.93 0.91 0.96 0.93 0.82 0.55
E[m(x i , si )|S = A] 0.23 0.26 0.31 0.36 0.42 0.25 0.30 0.37 0.41
E[m(x i , si )|S = B] 0.67 0.65 0.61 0.53 0.42 0.64 0.61 0.54 0.41
(ratio) ×2.97 ×2.49 ×2.01 ×1.46 ×1.00 ×2.53 ×2.02 ×1.48 ×1.00
AUC 0.86 0.86 0.85 0.82 0.74 0.86 0.85 0.82 0.70

where .
β λ is a solution of a penalized maximum likelihood problem,
 
. λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
β

In Table 11.1, we have the outcome of penalized logistic regressions, for different
values of .λ, including s on the left-hand side and excluding s on the right-hand side.
At the top, values of . βλ (x i , si ) and .m
β λ in the first block, and predictions .m βλ (x i )
for a series of individuals in the second block. At the bottom, the demographic parity
ratio, .E[mβλ (Z)|S = B]/E[ mβλ (Z)|S = A] (fairness is achieved when the ratio is
1), and AUC (the higher the value, the higher the accuracy of the model).
In Fig. 11.3, we have the optimal transport plot, between distributions of
βλ (x i , si ) from individuals in group A and in B, for different values of .λ (low
.m

value on the left-hand side and high value on the right-hand side), associated with a
demographic parity penalty criterion.
In Fig. 11.4, we have, on the left-hand side, densities of .m βλ (x i , si ) from
individuals in group A and in B (thin lines are densities of .m β(x i , si )). On the
right-hand side, the scatterplot of points .( mβ(x i , si = A), m
βλ (x i ), s = A) and
mβ(x i , si = B), m
.( βλ (x i ), s = B), where .m
β is the plain logistic regression, and
βλ is the penalized logistic regression, from the toydata2 dataset, associated
.m

with a demographic parity penalty criterion.


11.3 Application on toydata2 403

Fig. 11.3 Optimal transport between distributions of .m βλ (x i , si ) from individuals in group A and
in B, for different values of .λ (low value on the left-hand side and high value on the right-hand
side), associated with a demographic parity penalty criterion

In Fig. 11.5, the optimal transport plot, from the distribution of .m β(x i , si ) to the
βλ (x i , si ), for individuals in group A on the left-hand side, and in
distribution of .m
group B on the right-hand side, for different values of .λ, with a low value at the top
and a high value at the bottom (fair model, associated with a demographic parity
penalty criterion).

11.3.2 Equalized Odds and Class Balance

In Figs. 11.6, we can visualize .AUC(mλ ) against the demographic parity ratio
E[
. mβλ (Z)|S = B]/E[ mβλ (Z)|S = A], with general notations, the prediction is
obtained from a logistic regression

exp[zβ λ]
βλ (z) =
m
. ,
1 + exp[zβ λ]

where .
β λ is a solution of a penalized maximum likelihood problem,
 
λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
.
β

In Table 11.2, we can visualize the outcome of some penalized logistic regres-
sions, for different values of .λ, including s on the left-hand side and excluding s
on the right-hand side. At the top, values of . β λ in the first block, and predictions
βλ (x i , si ) and .m
.m βλ (x i ) for a series of individuals in the second block. At the
bottom, the class balance ratios, .E[ mβλ (Z)|S = B, Y = y]/E[ mβλ (Z)|S = A, Y =
y] (fairness is achieved when the ratio is 1), and AUC (the higher the value, the
higher the accuracy of the model).
404 11 In-processing

Fig. 11.4 On the left-hand side, densities of .m βλ (x i , si ) from individuals in group A and in B (thin
lines are densities of .m β(x i , si )). On the right-hand side the scatterplot of points .( mβ(x i , si =
βλ (x i ), s = A) and .(
A), m mβ(x i , si = B), m βλ (x i ), s = B), where .m β is the plain logistic
regression, and .m βλ is the penalized logistic regression, from the toydata2 dataset, associated
with a demographic parity penalty criterion
11.3 Application on toydata2 405

Fig. 11.5 Optimal transport from the distribution of .m β(x i , si ) to the distribution of .m
βλ (x i , si ),
for individuals in group A on the left-hand side, and in group B on the right-hand side, for different
values of .λ, with a low value at the top and a high value at the bottom (fair model, associated with
a demographic parity penalty criterion)
406 11 In-processing

Fig. 11.6 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of .mβλ on the y-axis and the fairness ratio (class balance) on the x-axis. Top left corresponds
βλ (x i , si ) (with a logistic regression) for
to accurate and fair. On the right-hand side, evolution of .m
three individuals in group A and three in B, in the toydata2 dataset

Table 11.2 Penalized logistic regression, for different values of .λ, including s on the left-hand
side, and excluding s on the right-hand side. At the top, values of . β λ in the first block, and
βλ (x i , si ) and .m
predictions .m βλ (x i ) for a series of individuals in the second block. At the bottom,
the class balance ratios, .E[ mβλ (Z)|S = B, Y = y]/E[ mβλ (Z)|S = A, Y = y] (fairness is achieved
when the ratio is 1), and AUC (the higher the value, the higher the accuracy of the model)
(x, s), aware
m
← less fair more fair →

β 0 (Intercept) −2.55 −2.45 −2.34 −2.21 −2.27 −2.08 −1.44 -2.61

β 1 (x1 ) 0.89 0.90 0.92 0.94 0.88 0.82 0.52 0.39

β 2 (x2 ) 0.37 0.37 0.37 0.37 0.41 0.40 0.30 0.39

β 3 (x3 ) 0.02 0.03 0.03 0.03 0.01 0.00 0.04 −0.42

β B (1B ) 0.81 0.63 0.39 0.11 −0.03 −0.48 −0.60 0.40
Betty 0.27 0.25 0.23 0.20 0.19 0.15 0.19 0.20
Brienne 0.74 0.72 0.70 0.67 0.66 0.57 0.51 0.44
Beatrix 0.95 0.95 0.95 0.94 0.94 0.91 0.81 71
Alex 0.14 0.15 0.17 0.19 0.19 0.22 0.30 0.14
Ahmad 0.55 0.58 0.61 0.65 0.66 0.68 0.65 0.33
Anthony 0.90 0.91 0.93 0.94 0.94 0.94 0.89 0.60
E[
m(x i , si )|S = A, Y = 0] 0.19 0.20 0.21 0.23 0.26 0.29 0.36 0.33
E[
m(x i , si )|S = B, Y = 0] 0.46 0.44 0.42 0.39 0.38 0.32 0.33 0.33
(Ratio) 2.47 2.25 2.00 1.74 1.47 1.10 0.91 1.00
E[
m(x i , si )|S = A, Y = 1] 0.37 0.38 0.40 0.43 0.48 0.52 0.55 0.54
E[
m(x i , si )|S = B, Y = 1] 0.78 0.76 0.75 0.73 0.72 0.66 0.59 0.54
(Ratio) 2.12 2.00 1.86 1.72 1.50 1.26 1.09 1.00
(Global ratio) 2.47 2.25 2.00 1.74 1.50 1.26 1.09 1.00
AUC 0.86 0.86 0.86 0.86 0.85 0.83 0.80 0.75
11.4 Application to the GermanCredit Dataset 407

In Fig. 11.7, we can visualize, on the left-hand side, densities of .m βλ (x i , si )


from individuals in group A and in B (thin lines are densities of .m β(x i , si )).
On the right-hand side, the scatterplot of points .( βλ (x i , A)) and
mβ(x i , A), m
.( βλ (x i , B)), where .m
mβ(x i , B), m β is the plain logistic regression, and .m βλ is the
penalized logistic regression, from the toydata2 dataset, associated with a class
balance penalty criterion.
In Fig. 11.8, we have the optimal transport plot, from distribution of .m β(x i , si )
to the distribution of .m βλ (x i , si ), for individuals in group A on the left-hand side,
and in group B on the right-hand side, for different values of .λ, with a low value at
the top and a high value at the bottom (fair model, associated with a class balance
(equalized odds) penalty criterion).

11.4 Application to the GermanCredit Dataset

11.4.1 Demographic Parity

In Figs. 11.9 and 11.10, we can visualize .AUC( mλ ) against the demographic parity
ratio .E[
mβλ (Z)|S = B]/E[mβλ (Z)|S = A], in Fig. 11.1 when .Z = (X, S) and
in Fig. 11.2 when .Z = X (unaware model), where, with general notations, the
prediction is obtained from a logistic regression

exp[zβ λ]
βλ (z) =
m
.
1 + exp[zβ λ]

where .
β λ is a solution of a penalized maximum likelihood problem,
 
λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
.
β

In Fig. 11.11, we can visualize densities of .m βλ (x i , si ) from individuals in group


A and in B (thin lines are densities of .m β(x i , si )), on the left-hand side. On the
right-hand side, the scatterplot of points .( mβ(x i , si = A), m βλ (x i ), s = A) and
mβ(x i , si = B), m
.( βλ (x i ), s = B), where .m β is the plain logistic regression, and
βλ is the penalized logistic regression, from the toydata2 dataset, associated
.m

with a demographic parity penalty criterion.


In Fig. 11.12, we have the optimal transport plot, from distribution of .m β(x i , si )
βλ (x i , si ), for individuals in group A on the left-hand side,
to the distribution of .m
and in group B on the right-hand side, for different values of .λ, with a low value at
the top and a high value at the bottom (fair model, associated with a demographic
parity penalty criterion).
408 11 In-processing

Fig. 11.7 On the left-hand side, densities of m βλ (x i , si ) from individuals in group A and in B (thin
lines are densities of m β(x i , si )). On the right-hand side, the scatterplot of points ( mβ(x i , si =
βλ (x i ), s = A) and (
A), m mβ(x i , si = B), m βλ (x i ), s = B), where m β is the plain logistic
regression, and m βλ is the penalized logistic regression, from the toydata2 dataset, associated
with a class balance penalty criterion
11.4 Application to the GermanCredit Dataset 409

Fig. 11.8 Optimal transport from the distribution of m β(x i , si ) to the distribution of m
βλ (x i , si ),
for individuals in group A on the left-hand side, and in group B on the right-hand side, for different
values of λ, with a low value at the top and a high value at the bottom (fair model, associated with
a class balance (equalized odds) penalty criteria)
410 11 In-processing

Fig. 11.9 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of m βλ on the y-axis and the fairness ratio (demographic parity) on the x-axis. Top left
corresponds to accurate and fair. On the right-hand side, evolution of mβλ (x i , si ) (with a logistic
regression) for three individuals in group A and three in B, in the toydata2 dataset

Fig. 11.10 Optimal transport between distributions of m βλ (x i , si ) from individuals in group A
and in B, for different values of λ (low value on the left-hand side and high value on the right-hand
side), associated with a demographic parity penalty criterion

11.4.2 Equalized Odds and Class Balance

In Figs. 11.13 and 11.14, we can visualize .AUC( mλ ) against the demographic parity
ratio .E[
mβλ (Z)|S = B]/E[mβλ (Z)|S = A], in Fig. 11.1 when .Z = (X, S) and
in Fig. 11.2 when .Z = X (unaware model), where, with general notations, the
prediction is obtained from a logistic regression

exp[zβ λ]
βλ (z) =
m
. ,
1 + exp[zβ λ]
11.4 Application to the GermanCredit Dataset 411

Fig. 11.11 On the left-hand side, densities of .m βλ (x i , si ) from individuals in group A and
in B (thin lines are densities of .m β(x i , si )). On the right-hand side, the scatterplot of points
mβ(x i , si = A), m
.( βλ (x i ), s = A) and .(
mβ(x i , si = B), m βλ (x i ), s = B), where .m
β is the plain
logistic regression, and .m βλ is the penalized logistic regression, from the toydata2 dataset,
associated with a demographic parity penalty criterion
412 11 In-processing

Fig. 11.12 Optimal transport from distribution of .m β(x i , si ) to the distribution of .m


βλ (x i , si ), for
individuals in group A on the left-hand side, and in group B on the right-hand side, for different
values of .λ, with a low value at the top and a high value at the bottom (fair model, associated with
a demographic parity penalty criterion)
11.4 Application to the GermanCredit Dataset 413

Fig. 11.13 On the left-hand side, accuracy–fairness trade-off plot (based on Table 11.1), with the
AUC of m βλ on the y-axis and the fairness ratio (class balance) on the x-axis. Top left corresponds
to accurate and fair. On the right-hand side, evolution of mβλ (x i , si ) (with a logistic regression) for
three individuals in group A and three in B, in the toydata2 dataset

Fig. 11.14 Optimal transport between distributions of m βλ (x i , si ) from individuals in group A
and in B, for different values of λ (low value on the left-hand side and high value on the right-hand
side), associated with a class balance penalty criterion

where .
β λ is a solution of a penalized maximum likelihood problem,
 
.λ ∈ argmin − log L(β) + λ · |cov[mβ (x), 1B (s)]| .
β
β

In Fig. 11.15, we can visualize densities of .m βλ (x i , si ) from individuals in group


A and in B (thin lines are densities of .m β(x i , si )), on the left-hand side. On the
right-hand side, the scatterplot of points .( mβ(x i , si = A), m βλ (x i ), s = A) and
mβ(x i , si = B), m
.( βλ (x i ), s = B), where .m β is the plain logistic regression, and
βλ is the penalized logistic regression, from the toydata2 dataset, associated
.m

with a class balance penalty criterion.


In Fig. 11.16, we can visualize the optimal transport plot, from the distribution
β(x i , si ) to the distribution of .m
of .m βλ (x i , si ), for individuals in group A on the
left-hand side, and in group B on the right-hand side, for different values of .λ, with
a low value at the top and a high value at the bottom (fair model, associated with a
class balance (equalized odds) penalty criteria).
414 11 In-processing

Fig. 11.15 On the left-hand side, densities of .m βλ (x i , si ) from individuals in group A and
in B (thin lines are densities of .m β(x i , si )). On the right-hand side, the scatterplot of points
mβ(x i , si = A), m
.( βλ (x i ), s = A) and .(
mβ(x i , si = B), m βλ (x i ), s = B), where .m
β is the plain
logistic regression, and .m βλ is the penalized logistic regression, from the toydata2 dataset,
associated with a class balance penalty criterion
11.4 Application to the GermanCredit Dataset 415

Fig. 11.16 Optimal transport from distribution of .m β(x i , si ) to the distribution of .m


βλ (x i , si ), for
individuals in group A on the left-hand side, and in group B on the right-hand side, for different
values of .λ, with a low value at the top and a high value at the bottom (fair model, associated with
a class balance (equalized odds) penalty criteria)
Chapter 12
Post-Processing

Abstract The idea of “post-processing” is relatively simple, as we change neither


the training data, nor the model that has been estimated; we simply transform the
predictions obtained, to make them “fair” (according to some specific criteria). As
actuaries care about calibration, and the associated concept of a “well-balanced”
model, quite naturally, we use averages and barycenters. Using optimal transport,
we describe techniques, with strong mathematical guarantees, that could be used to
get a “fair” pricing model.

In some applications, training a model is a long and painful process, so we do not


want to re-train it, and “in-processing” (discussed in Chap. 11) is not an option. As
we see in this chapter, an alternative is to re-calibrate the outcomes independently
of a model, as suggested in Hardt et al. (2016) or Pleiss et al. (2017).

12.1 Post-Processing for Binary Classifiers

If weak demographic parity is not satisfied, in the sense that .EX|S=A [m(X)] =
EX|S=B [m(X)], a simple technique to get a fair model is to consider

E[m(X, S)]
m (x, s) =
. · m(x, s) for a policyholder in group s.
E[m(X, s)]

For example, in the FrenchMotor dataset, overall, a single policyholder has


.8.67% chance to claim a loss, .8.94% for a man (group A) and .8.20% for a
woman (group B). Because of this difference, in order to get a fair model, “gender-
neutral,” the premium for a woman should be .8.67/8.20 = 1.058 (or .5.8%) higher,
.m (x, s) = 1.058·m(x, s), and .3% lower than the predicted one, for men. Of course,


this is perhaps simplistic, so let us consider the use of barycenters of distributions.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 417
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4_12
418 12 Post-Processing

12.2 Weighted Averages of Outputs

Another natural approach, inspired by techniques used in sampling theory, is to


use post-stratification techniques, which is standard when dealing with a “biased
sample.” This is the idea discussed in Lindholm et al. (2022a). Let us extend here
some concepts introduced in Chap. 7, and more specifically Sect. 7.4. Recall that
the regression function (see Definition 3.1) is defined as

 
μ(x) = E[Y |X = x] = E E[Y |X = x, S] =
. E[Y |X = x, S = s]dP[S = s].
S

Following Moodie and Stephens (2022), the latter can be written



μ(x) =
. E[Y W |X = x, S = s]dP[S = s|X = x] = E[Y W |X = x],
S

where W is a version of the Radon–Nikodym derivative

dP[S = s]
W =
. ,
dP[S = s|X = x]

corresponding to the change of measure that gives independence between .X and


the sensitive attribute S. Following Côté (2023), we have the following interesting
property.
Proposition 12.1 Let W be a version of the Radon–Nikodym derivative

dP[S = s]
W =
. ,
dP[S = s|X = x]

then .E[W ] = 1, .E[SW ] = E[S], .E[XW ] = E[X] and .E[XSW ] = E[X]E[S].


Proof As proved in Fong et al. (2018),
 
E[W ] =
. wdP[S = s, X = x] = wdP[S = s|X = x]dP[X = x],

which can be written



dP[S = s]
.E[W ] = dP[S = s|X = x]dP[X = x],
dP[S = s|X = x]

and therefore

.E[W ] = dP[S = s]dP[X = x] = 1.
12.3 Average and Barycenters 419

Similarly,
 
E[SW ] =
. swdP[S = s, X = x] = swdP[S = s|X = x]dP[X = x],

and
 
E[SW ] =
. sdP[S = s]dP[X = x] = E[S]dP[X = x] = E[S].

The proof of .E[XW ] = E[X] is similar, and finally


 
E[XSW ] =
. xswdP[S = s|X = x]dP[X = x] = xsdP[S = s]dP[X = x]


E[XSW ] =
. xE[S]dP[X = x] = E[X]E[S].

As a consequence, observe that .Cov[XW, S] = 0. In statistics, this Radon–


Nikodym derivative is related to the propensity score (see Definition 7.28, as
discussed in Freedman and Berk (2008), Li and Li (2019) and Karimi et al. (2022)).

12.3 Average and Barycenters

Recall that “weighted averages” are solutions of


 n 

.y = argmin

ωi (y − yi ) 2
,
y
i=1

for some collection of observations .{y1 , · · · , yn } and some weights .ω1 , · · · , ωn ≥


0. The extension in standard Euclidean spaces, named “barycenters,” or “centroids,”
are defined as the solution of
 n 

.z = argmin ωi d(z, zi )2 ,

z
i=1

for some collection of points .{z1 , · · · , zn }, some weights .ω1 , · · · , ωn ≥ 0, and


where d is the Euclidean distance. This can be extended to more general spaces,
where points are measures. We can therefore define some sort of average measure,
solution of
 n 

.P = argmin ωi d(P, Pi )2 ,

P i=1
420 12 Post-Processing

for some distance (or divergence) d, as in Nielsen and Boltz (2011). Those are
also called “centroïds” associated with measures .P = {P1 , · · · , Pn }, and weights
.ω. Instead of theoretical measures .Pi , the idea of “averaging histograms” (or

empirical measures) has been considered in Nielsen and Nock (2009) using “gener-
alized Bregman centroid,” and in Nielsen (2013), who introduced the “generalized
Kullback–Leibler centroid,” based on Jeffreys divergence, introduced in Jeffreys
(1946), which corresponds to a symmetric divergence, extending Kullback–Leibler
divergence (see Definition 3.7).
An alternative (see Agueh and Carlier (2011) and Definition 3.11) is to use the
Wasserstein distance .W2 . As shown in Santambrogio (2015), if one of the measures
.Pi is absolutely continuous, the minimization problem has a unique solution. As

discussed in Section 5.5.5 in Santambrogio (2015), it is possible to use a simple


version for univariate measures. Given a reference measure, say .P1 , it is possible to
write the barycenter as the “average push-forward” transformation of .P1 : if .Pi =
T1→i
# P1 (with the convention that .T1→1
# is the identity),
 n

P =
.

ωi T1→i P1 .
i=1 #

And in the univariate case, .T1→i is simply a rearrangement, defined as .T1→i =


Fi−1 ◦ F1 , where .Fi (t) = Pi ((−∞, t]) and .Fi−1 is its generalized inverse. Note that
the Wasserstein barycenter is also the “Fréchet mean of distributions” in Petersen
and Müller (2019), with the associated R package frechet. See Le Gouic and
Loubes (2017) for technical results. As discussed in Alvarez-Esteban et al. (2018),
moments and risk measures associated with .P can be expressed simply from
associated measures on .Pi and .ω.
To illustrate, consider, as Mallasto and Feragen (2017), the case of Gaussian dis-
tributions, .N(μi ,  i ), which works in any dimension actually. Jeffreys–Kullback-
Leibler centroid of those distributions would be


n 
n
N(μ∗ ,  ∗ ), where μ∗ =
. ωi μi and  ∗ = ωi  i .
i=1 i=1

But this is not the Wasserstein barycenter, which is actually


n
N(μ ,   ), where μ =
. ωi μi ,
i=1

and where .  is the unique positive definite matrix such that


n
1/2
 =
.

ωi  1/2  i  1/2 .
i=1
12.3 Average and Barycenters 421

Fig. 12.1 Wasserstein barycenter and Jeffreys-Kullback–Leibler centroid of two Gaussian dis-
tributions on the left, and empirical estimate of the density of the Wasserstein barycenter and
Jeffreys–Kullback–Leibler centroid of two samples .x1 and .x2 (drawn from normal distributions)

In the univariate case, with two Gaussian measures, the difference is that in the
first case, the variance is the average of variances, whereas in the second case, the
standard deviation is the average of standard deviations,
⎧ 
⎨σ ∗ = ω σ 2 + ω σ 2 : Jeffreys–Kullback–Leibler centroid
1 1 2 2
.
⎩σ  = ω σ + ω σ : Wasserstein barycenter.
1 1 2 2

In Fig. 12.1 we can visualize the barycenters of two Gaussian distributions, on the
left, and the empirical version with kernel density estimators on the right (based
on two Gaussian samples of sizes .n = 1000). More specifically, if .f1 is the kernel
1 is the integral of .f1 , and if
density estimator of sample .x1 , if .F


T (x) = ω1 x + ω2 F
. 1 (x) ,
−1 F
2

the density of the barycenter is

f (x) = f1 (T−1 (x)) · |d T−1 (x)|.


.

Fairness is achieved by considering barycenters of regression predictions. For


unbounded predictions, classical kernel density estimators can be considered (using
density in R), but in classification, scores take values in .[0, 1]. One could use
transformed kernel techniques (as discussed in Geenens (2014)), or Beta kernels (as
in Chen (1999)) using kdensity from the kdensity R package (and the option
kernel="beta").
In Fig. 12.2 we can visualize the barycenters of two samples generated from
some Beta distributions. On the left-hand side, histograms are used, and the Jeffreys
centroid is plotted. In the middle and on the right-hand side, smooth density
estimators of .f1 and .f2 are considered, using Beta kernels. In the middle, we
consider the Jeffreys centroid and on the right-hand side, the Wasserstein barycenter.
422 12 Post-Processing

Fig. 12.2 Empirical Jeffreys–Kullback–Leibler centroid of two samples generated from beta
distributions on the left-hand side (based on histograms) and in the middle (based on the estimated
density of the beta kernel ), the Wasserstein barycenter on the right-hand side (based on the
estimated density of the beta kernel)

Fig. 12.3 Optimal transport for two samples drawn from two beta distributions (one skewed to
the left (on the left-hand side) and one to the right (on the right-hand side), on the x-axis) to the
barycenter (on the y-axis)

The transport, from .f1 or .f2 to .f  (all three on the right-hand side of Fig. 12.2)
can be visualized in Fig. 12.3, on the left-hand side and on the right-hand side
respectively.
Given two scoring functions .m(x, s = A) and .m(x, s = B), it is possible, post-
processing, to construct a fair score .m using the approach we just described.
12.4 Application to toydata1 423

Definition 12.1 (Fair Barycenter Score) Given two scores .m(x, s = A) and
m(x, s = B), the “fair barycenter score” is
.


m (x, s = A) = P[S = A] · m(x, s = A) + P[S = B] · FB−1 ◦ FA m(x, s = A)
.
m (x, s = B) = P[S = A] · FA−1 ◦ FB m(x, s = B) + P[S = B] · m(x, s = B).

In Definition 2.7, we defined “balanced” scores.


Proposition 12.2 The score .m is balanced.
In Fig. 12.4, inspired by Fig. 8.8, we can visualize the matching between
.m(x, s = A) and .m (x, s = A) at the top, and between .m(x, s = B) and
.m (x, s = B) at the bottom.


In Fig. 12.5, we have scatterplots of points .(m(x i , si = A), m (x i , s = A) and


.(m(x i , si = B), m (x i , s = B), with three models (GLM, GBM, and RF), on the


probability of claiming a loss in motor insurance when s is the gender of the driver,
from the left to the right.

12.4 Application to toydata1

In Fig. 12.6, computations of the distribution of the “fair score” defined as the
barycenter of the distributions of scores .(m(x i , si = A) and .(m(x i , si = B), in
the two groups. On the left, the Wasserstein barycenter, and then two computations
of the Jeffreys-Kullback–Leibler centroid.
In Fig. 12.7, the optimal transport plot, with the optimal matching between .m(x)
and .m (x) for individuals in group .s = A, on the left-hand side, and between .m(x)
and .m (x) for individuals in group .s = B on the right-hand side, with unaware
scores on the toydata1 dataset.
In Fig. 12.8, the optimal transport plot, with matching between .m(x, s = A) and
.m (x) for individuals in group .s = A, on the left-hand side, and between .m(x, s =


B) and .m (x) for individuals in group .s = B on the right-hand side, with aware
scores on the toydata1 dataset (plain lines, thin lines are the unaware score from
Fig. 12.7).
In Table 12.1, we can visualize the predictions for six individuals, with an
unaware model .m (x, s), an aware model .m
(x), and two barycenters, the Wasserstein
barycenter .m∗w (x) and the Jeffreys–Kullback–Leibler centroid, .m∗jkl (x).
In Fig. 12.9, we have distributions of the scores in the two groups, A and B, after
optimal transport to the barycenter, with the Jeffreys–Kullback–Leibler centroid at
the top and the Wasserstein barycenter at the bottom, with Definition 12.1. Given
424
12 Post-Processing

Fig. 12.4 Matching between .m(x, s = A) and .m (x), at the top, and between .m(x, s = B) and .m (x), at the bottom, on the probability of claiming a loss in
motor insurance when s is the gender of the driver, from FrenchMotor
12.4 Application to toydata1 425

Fig. 12.5 Scatterplot of points .(m(x i , si = A), m (x i , s = A) and .(m(x i , si = B), m (x i , s = B),
with three models (GLM, GBM, and RF), on the probability of claiming a loss in motor insurance
when s is the gender of the driver

Fig. 12.6 Barycenter of the two distributions of scores, of the distributions of scores .(m(x i , si =
A) and .(m(x i , si = B). On the left, the Wasserstein barycenter, and then two computations of the
Jeffreys-Kullback–Leibler centroid

two scores .m(x, s = A) and .m(x, s = B), the “fair barycenter score” being that of
Definition 12.1,

m (x, s = A) = P[S = A] · m(x, s = A) + P[S = B] · FB−1 ◦ FA m(x, s = A)
.
m (x, s = B) = P[S = A] · FA−1 ◦ FB m(x, s = B) + P[S = B] · m(x, s = B).
426 12 Post-Processing

Fig. 12.7 Matching between .m(x) and .m (x) for individuals in group .s = A, on the left-hand
side, and between .m(x) and .m (x) for individuals in group .s = B on the right-hand side, with
unaware scores on the toydata1 dataset

Fig. 12.8 Matching between .m(x, s = A) and .m (x) for individuals in group .s = A, on the left-
hand side, and between .m(x, s = B) and .m (x) for individuals in group .s = B on the right-hand
side, with aware scores on the toydata1 dataset (plain lines, thin lines are the unaware score
from Fig. 12.7)

12.5 Application on FrenchMotor

In the entire dataset, we have 64% men (7973) and 36% women (4464) registered
as “main driver.” Overall, if we consider “weak demographic parity,” .8.2% of
women claim a loss, as opposed to .8.9% of men. In Table 12.2, we can visualize
“gender-neutral” predictions, derived from the logistic regression (GLM), a boost-
12.5 Application on FrenchMotor 427

Table 12.1 Individual predictions for six fictitious individuals. With two models (aware and
unaware) and two barycenters, on toydata1
x s .y (x, s)
.m (x)
.m .m

w (x) .m

jkl (x)
Alex .−1 A 0.475 0.250 0.219 0.154 0.094
Betty .−1 B 0.475 0.205 0.219 0.459 0.357
Ahmad 0 A 0.475 0.490 0.465 0.341 0.279
Brienne 0 B 0.475 0.426 0.465 0.719 0.692
Anthony .+1 A 0.475 0.734 0.730 0.571 0.521
Beatrix .+1 B 0.475 0.681 0.730 0.842 0.932

Fig. 12.9 Distributions of the scores in the two groups, A and B, after optimal transport to the
barycenter, with Jeffreys–Kullback–Leibler centroid at the top and the Wasserstein barycenter at
the bottom

ing algorithm (GBM), and a random forest (RF). The first column corresponds to
the proportional approach discussed in Sect. 12.1.
In Table 12.2, we have, for the two groups, the global correction discussed in
Sect. 12.2, with .−6% for the men (.×0.94, group A) and .+11% for the women
(.×1.11, group B). As on average, women have fewer accidents than men, they need
428 12 Post-Processing

Table 12.2 “Gender-free” prediction if the initial prediction was 5% (at the top), 10% (in the
middle), and 20% (at the bottom). The first approach is the simple “benchmark” based on .P[Y =
1]/P[Y = 1|S = s], and then three models are considered, GLM, GBM, and RF
A (men) B (women)
.×0.94 GLM GBM RF .×1.11 GLM GBM RF
.m(x) = 5% 4.73% 4.94% 4.80% 4.42% 5.56% 5.16% 5.25% 6.15%
.m(x) = 10% 9.46% 9.83% 9.66% 8.92% 11.12% 10.38% 10.49% 12.80%
.m(x) = 20% 18.91% 19.50% 18.68% 18.26% 22.25% 20.77% 21.63% 21.12%

to be charged more to have a “fair” (nondiscriminatory) premium. Then, we consider


the “fair barycenter score,” from Definition 12.1.

m (x, s = A) = P[S = A] · m(x, s = A) + P[S = B] · FB−1 ◦ FA m(x, s = A)
.
m (x, s = B) = P[S = A] · FA−1 ◦ FB m(x, s = B) + P[S = B] · m(x, s = B),

where m is either a logistic regression (GLM, on the left-hand side), boosting


(Adaboost, GBM, in the middle) and random forest (RF, on the right-hand side).
To illustrate, we consider different individuals in groups A and B, where an initial
prediction of a .5% probability of claiming a loss is considered (at the top), .10% (in
the middle), and .20% (at the bottom).
In Figs. 12.4 and 12.5, we have seen how to get a “fair prediction,” with the
matching between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s =
B) and .m (x, s = B), in Fig. 12.4, and with the scatterplot of points .(m(x i , si =
A), m (x i , s = A)) and .(m(x i , si = B), m (x i , s = B)) in Fig. 12.5.
We can also consider a binary sensitive attribute, related to age, with .s =
1(age > 65) (discrimination against old people), in Table 12.3 and .s = 1(age <
30) (discrimination against young people), in Table 12.4.
Tables 12.4 and 12.3 extend Table 12.2, from the case where the sensi-
tive attribute was gender to the case where the sensitive attribute is age, with
young/nonyoung in Table 12.4, old/non-old in Table 12.3.
In Figs. 12.10 and 12.12 we can visualize matchings between .m(x, s = A) and
.m (x, s = A), at the top, and between .m(x, s = B) and .m (x, s = B) at the
 

bottom, respectively with .s = 1(age > 65) (discrimination against old people) and
.s = 1(age < 30) (discrimination against young people).

Table 12.3 “Age-free” prediction (against old driver) if the initial prediction was 5% (at the top),
10% (in the middle) and 20% (at the bottom)
A (younger .< 65) B (old .> 65)
.×1.01 GLM GBM RF .×0.94 GLM GBM RF
.m(x) = 5% 5.05% 5.17% 5.10% 5.27% 4.71% 3.84% 3.84% 3.96%
.m(x) = 10% 10.09% 10.37% 10.16% 11.00% 9.42% 7.81% 9.10% 6.88%
.m(x) = 20% 20.19% 19.98% 19.65% 21.26% 18.85% 19.78% 23.79% 12.54%
12.5 Application on FrenchMotor 429

Table 12.4 “Age-free” (against young drivers) prediction if the initial prediction was 5% (at the
top), 10% (in the middle), and 20% (at the bottom)
A (young .< 25) B (older .> 25)
.×0.74 GLM GBM RF .×1.06 GLM GBM RF
.m(x) = 5% 3.71% 3.61% 4.45% 2.41% 5.29% 5.29% 5.14% 6.05%
.m(x) = 10% 7.42% 7.89% 8.69% 5.17% 10.59% 10.29% 10.19% 11.95%
.m(x) = 20% 14.84% 21.82% 18.09% 9.93% 21.17% 19.87% 20.33% 21.29%

Fig. 12.10 Matching between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s =
B) and .m (x, s = B), at the bottom, based on the probability of claiming a loss in motor insurance
when s is the indicator that the driver is “old” .1(age > 65)

In Fig. 12.11, we have a scatterplot of points .(m(x i , si = A), m( x i )) and


.(m(x i , si = B), m( x i )), with three models (GLM, GBM, RF), based on the


probability of claiming a loss in motor insurance when s is the indicator that the
driver is “old” .1(age > 65) (more than 65 years old).
In Fig. 12.12, we can visualize the optimal transport plot, with t matching
between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s = B)
and .m (x, s = B), at the bottom, based on the probability of claiming a loss in
motor insurance when s is the indicator that the driver is “young” .1(age < 30)
(less than 30 years old).
In Fig. 12.13, we have the scatterplot of points .(m(x i , si = A), m( x i )) and
.(m(x i , si = B), m( x i )), with three models (GLM, GBM, and RF), based on the


probability of claiming a loss in motor insurance when s is the indicator that the
driver is “young” .1(age < 30).
430 12 Post-Processing

Fig. 12.11 Scatterplot of points .(m(x i , si = A), m( x i )) and .(m(x i , si = B), m( x i )), with three
models (GLM, GBM, and RF), based on the probability of claiming a loss in motor insurance when
s is the indicator that the driver is “old” .1(age > 65)

Fig. 12.12 Matching between .m(x, s = A) and .m (x, s = A), at the top, and between .m(x, s =
B) and .m (x, s = B), at the bottom, based on the probability of claiming a loss in motor insurance
when s is the indicator that the driver is “young” .1(age < 30)
12.6 Penalized Bagging 431

Fig. 12.13 Scatterplot of points .(m(x i , si = A), m( x i )) and .(m(x i , si = B), m( x i )), with three
models (GLM, GBM, and RF), based on the probability of claiming a loss in motor insurance when
s is the indicator that the driver is “young” .1(age < 30)

12.6 Penalized Bagging

If m is a random forest, instead of using equal weights in the bagging procedure,


one can consider seeking weights so that the outcome will be “fair,” as suggested
in Fermanian and Guegan (2021). Formally, consider k models, .m1 , · · · , mk . In a
classification problem, .mj (x) ≈ P[Y = 1|X = x] and in the case of a regression
.mj (x) ≈ E[Y |X = x].

 k
Let .Mω (x) = ωj mj (x) = ω m(x). For example, with random forests, k is
j =1
large, and .ωj = 1/k. But we can consider an ensemble approach, on a few models.
Recall that “demographic parity” (Sect. 8.2) is achieved if
       
EY
. S = B = E Y
S = A = E Y ,

meaning here
     
E ω m(X)S = A = E ω m(X)S = B .
.

Empirically, we can compute (for some loss .)


 n n 
i=1 1 (si = A) ω m(x i ) i=1 1 (si = B) ω m(x i )
. n , n .
i=1 1 (si = A) i=1 1 (si = B)

The problem we should solve could be


 
argmin R(ω) + λR(ω) ,
.
ω
432 12 Post-Processing

where the empirical risk (associated with accuracy) could be, if the risk is associated
with loss .,

1  
n
R(ω) =
.  ω m(x i ), yi ,
n
i=1

and where .R(ω) is some fairness criteria as discussed in Sect. 11.1. Following
Friedler et al. (2019), an alternative could be to consider .α-disparate impact

|α |S = A]
E[|Y
Rα =
. for α > 0,
|α |S = B]
E[|Y

and .β-equalized odds


  
.Rβ = 1 − EY E[Y |S = B, Y ]β .
|S = A, Y ] − E[Y

As in Sect. 11.1, we want to find some weights that give a trade-off between
fairness and accuracy. The only difference is that in Chap. 11, “in-processing,” we
were still training the model. Here, we already have a collection of models, we
just want to consider a weighted average of the model. Given a sample .{(yi , x i )},
consider the following penalized problem, where the fairness criteria are related to
.α-“demographic parity,” and accuracy is characterized by some loss .,

⎧ α ⎫
⎨ 1  1  
 λ  ⎬
. min  ω m(x i ) − ω m(x i ) +  ω m(x i ), yi ,

ω∈Sk ⎩ nA i:S =A n
i  n
i

i

for some .α > 0, where .Sk is the standard


 probability simplex
 (as defined in Boyd
and Vandenberghe (2004)), .Sk = w ∈ Rk+ : w 1 = 1 . As proved in Fermanian
and Guegan (2021), if .α = 2 and . = 2 , then

−1 1
ωdp =
. where = AA + λB,
1 −1 1

where
1  1
A=
. m(x i ) − m(x i )
nA n
i:Si =A i

and

1
n
B=
. m(x i ) − yi 1 m(x i ) − yi 1 .
n
i=1
12.6 Penalized Bagging 433

The tuning parameter .λ is positive, and selecting .λ > 0 yields a unique solution.
For .β-“equalized odds,” the optimization problem is
 
1   (A) α λ 
n
. min  e(yi ) +
e (yi ) −   ω m(x i ), yi
ω∈Sk n
i=1
n
i

where  |Y = yi ] where 


.e (yi ) is an estimation of .E[Y .e
(A) (y ) is an estimation
i

of .E[Y |Y = yi , S = A]. For instance, consider some standard nonparametric
estimators, kernel based,


n

.e (y) ∝ ω m(x i )Kh (yi − y) = ω v h (y),
i=1

and
 (0)

.e
(A)
(y) ∝ ω m(x i )Kh (yi − y) = ω v h (y)
i:si =A

Again, if .α = 2 and with . = 2 ,

1
−1 1 n

.ω eo = where = γ i γ i + λB,
1 −1 1 n
i=1

where

.γ i = v (0)
h (yi ) − v h (yi )

and

1
n
B=
. m(x i ) − yi 1 m(x i ) − yi 1 .
n
i=1

We should keep in mind here that we can always solve this problem numerically.
References

Aas K, Jullum M, Løland A (2021) Explaining individual predictions when features are dependent:
More accurate approximations to shapley values. Artif Intell 298:103502
Abraham K (1986) Distributing risk: Insurance, legal theory and public policy. Yale University
Press, Yale
Abrams M (2014) The origins of personal data and its implications for governance. SSRN 2510927
Achenwall G (1749) Abriß der neuesten Staatswissenschaft der vornehmsten Europäischen Reiche
und Republicken zum Gebrauch in seinen Academischen Vorlesungen. Schmidt
Adams SJ (2004) Age discrimination legislation and the employment of older workers. Labour
Econ 11(2):219–241
Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair
classification. In: Dy J, Krause A (eds) International Conference on Machine Learning, Pro-
ceedings of Machine Learning Research, Stockholmsmässan, Stockholm Sweden, Proceedings
of Machine Learning Research, vol 80, pp 60–69
Agrawal T (2013) Are there glass-ceiling and sticky-floor effects in india? an empirical examina-
tion. Oxford Dev Stud 41(3):322–342
Agresti A (2012) Categorical data analysis. Wiley, New York
Agresti A (2015) Foundations of linear and generalized linear models. Wiley, New York
Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. SIAM J Math Anal 43(2):904–
924
Ahima RS, Lazar MA (2013) The health risk of obesity–better metrics imperative. Science
341(6148):856–858
Ahmed AM (2010) What is in a surname? the role of ethnicity in economic decision making. Appl
Econ 42(21):2715–2723
Aigner DJ, Cain GG (1977) Statistical theories of discrimination in labor markets. Ind Labor Relat
Rev 30(2):175–187
Ajunwa I (2014) Genetic testing meets big data: Tort and contract law issues. Ohio State Law J
75:1225
Ajunwa I (2016) Genetic data and civil rights. Harvard Civil Rights-Civil Liberties Law Rev 51:75
Akerlof GA (1970) The market for “lemons”: Quality uncertainty and the market mechanism. Q J
Econ 84(3):488–500
Al Ramiah A, Hewstone M, Dovidio JF, Penner LA (2010) The social psychology of discrimi-
nation: Theory, measurement and consequences. In: Making equality count, pp 84–112. The
Liffey Press, Dublin
Alexander L (1992) What makes wrongful discrimination wrong? biases, preferences, stereotypes,
and proxies. Univ Pennsylvania Law Rev 141(1):149–219

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 435
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4
436 References

Alexander W (1924) Insurance fables for life underwriters. The Spectator Company, London
Alipourfard N, Fennell PG, Lerman K (2018) Can you trust the trend? discovering simpson’s
paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on
Web Search and Data Mining, pp 19–27
Allen CG (1975) Plato on women. Feminist Stud 2(2):131
Allerhand L, Youngmann B, Yom-Tov E, Arkadir D (2018) Detecting Parkinson’s disease from
interactions with a search engine: Is expert knowledge sufficient? In: Proceedings of the 27th
ACM International Conference on Information and Knowledge Management, pp 1539–1542
Altman A (2011) Discrimination. Stanford Encyclopedia of Philosophy
Altman N, Krzywinski M (2015) Association, correlation and causation. Nature Methods
12(10):899–900
Alvarez-Esteban PC, del Barrio E, Cuesta-Albertos JA, Matrán C (2018) Wide consensus aggrega-
tion in the Wasserstein space. application to location-scatter families. Bernoulli 24:3147–3179
Amadieu JF (2008) Vraies et fausses solutions aux discriminations. Formation emploi Revue
française de sciences sociales 101:89–104
Amari SI (1982) Differential geometry of curved exponential families-curvatures and information
loss. Ann Stat 10(2):357–385
American Academy of Actuaries (2011) Market consistent embedded value. Life Financial
Reporting Committee
Amnesty International (2023) Discrimination. https://www.amnesty.org/en/what-we-do/
discrimination/
Amossé T, De Peretti G (2011) Hommes et femmes en ménage statistique: une valse à trois temps.
Travail, genre et sociétés 2:23–46
Anderson TH (2004) The pursuit of fairness: A history of affirmative action. Oxford University
Press, Oxford
de Andrade N (2012) Oblivion: The right to be different from oneself-reproposing the right to be
forgotten. In: Cerrillo Martínez A, Peguera Poch M, Peña López I, Vilasau Solana M (eds) VII
international conference on internet, law & politics. Net neutrality and other challenges for the
future of the Internet, IDP. Revista de Internet, Derecho y Política, 13, pp 122–137
Angrist JD, Pischke JS (2009) Mostly harmless econometrics: An empiricist’s companion.
Princeton University Press, Princeton
Anguraj K, Padma S (2012) Analysis of facial paralysis disease using image processing technique.
Int J Comput Appl 54(11). https://doi.org/10.5120/8607-2455
Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias: There’s software used across the
country to predict future criminals and it’s biased against blacks. ProPublica May 23
Antoniak M, Mimno D (2021) Bad seeds: Evaluating lexical methods for bias measurement. In:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long
Papers), pp 1889–1904
Antonio K, Beirlant J (2007) Actuarial statistics with generalized linear mixed models. Insurance
Math Econ 40(1):58–76
Apfelbaum EP, Pauker K, Sommers SR, Ambady N (2010) In blind pursuit of racial equality?
Psychol Sci 21(11):1587–1592
Apley DW, Zhu J (2020) Visualizing the effects of predictor variables in black box supervised
learning models. J Roy Stat Soc B (Stat Methodol) 82(4):1059–1086
Aran XF, Such JM, Criado N (2019) Attesting biases and discrimination using language semantics.
arXiv 1909.04386
Agüera y Arcas B, Todorov A, Mitchell M (2018) Do algorithms reveal sexual orientation or just
expose our stereotypes. Medium January 11
Armstrong JS (1985) Long-range Forecasting: from crystal ball to computer. Wiley, New York
Arneson RJ (1999) Egalitarianism and responsibility. J Ethics 3:225–247
Arneson RJ (2007) Desert and equality. In: Egalitarianism: New essays on the nature and value of
equality, pp 262–293. Oxford University Press, Oxford
References 437

Arneson RJ (2013) Discrimination, disparate impact, and theories of justice. In: Hellman D,
Moreau S (eds) Philosophical foundations of discrimination law, vol 87, p 105. Oxford
University Press, Oxford
Arrow KJ (1963) Uncertainty and the welfare economics of medical care. Am Econ Rev 53:941–
973
Arrow KJ (1973) The theory of discrimination. In: Ashenfelter O, Rees A (eds) Discrimination in
labor markets. Princeton University Press, Princeton
Artıs M, Ayuso M, Guillen M (1999) Modelling different types of automobile insurance fraud
behaviour in the spanish market. Insurance Math Econ 24(1–2):67–81
Artís M, Ayuso M, Guillén M (2002) Detection of automobile insurance fraud with discrete choice
models and misclassified claims. J Risk Insurance 69(3):325–340
Ashenfelter O, Oaxaca R (1987) The economics of discrimination: Economists enter the court-
room. Am Econ Rev 77(2):321–325
Ashley F (2018) Man who changed legal gender to get cheaper insurance exposes the unreliability
of gender markers. CBC (Canadian Broadcasting Corporation) - Radio Canada July 28
Atkin A (2012) The philosophy of race. Acumen
Ausloos J (2020) The right to erasure in EU data protection law. Oxford University Press, Oxford
Austin PC, Steyerberg EW (2012) Interpreting the concordance statistic of a logistic regression
model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med
Res Methodol 12:1–8
Austin R (1983) The insurance classification controversy. Univ Pennsylvania Law Rev 131(3):517–
583
Automobile Insurance Rate Board (2022) Technical guidance: Change in rates and rating pro-
grams. Albera AIRB
Autor D (2003) Lecture note: the economics of discrimination-theory. Graduate Labor Economics,
Massachusetts Institute of Technology, pp 1–18
Avery RB, Calem PS, Canner GB (2004) Consumer credit scoring: do situational circumstances
matter? J Bank Finance 28(4):835–856
Avin C, Shpitser I, Pearl J (2005) Identifiability of path-specific effects. In: IJCAI International
Joint Conference on Artificial Intelligence, pp 357–363
Avraham R (2017) Discrimination and insurance. In: Lippert-Rasmussen K (ed) Handbook of the
Ethics of Discrimination, Routledge, pp 335–347
Avraham R, Logue KD, Schwarcz D (2013) Understanding insurance antidiscrimination law. South
California Law Rev 87:195
Avraham R, Logue KD, Schwarcz D (2014) Towards a universal framework for insurance anti-
discrimination laws. Connecticut Insurance Law J 21:1
Ayalon L, Tesch-Römer C (2018) Introduction to the section: Ageism–concept and origins.
Contemporary perspectives on ageism, pp 1–10
Ayer AJ (1972) Probability and evidence. Columbia University Press, New York
Azen R, Budescu DV (2003) The dominance analysis approach for comparing predictors in
multiple regression. Psychol Methods 8(2):129
Bachelard G (1927) Essai sur la connaissance approchée. Vrin
Backer DC (2017) Risk profiling in the auto insurance industry. Gracey-Backer, Inc Blog March
14
Baer BR, Gilbert DE, Wells MT (2019) Fairness criteria through the lens of directed acyclic
graphical models. arXiv 1906.11333
Bagdasaryan E, Poursaeed O, Shmatikov V (2019) Differential privacy has disparate impact on
model accuracy. Adv Neural Inf Process Syst 32:15479–15488
Bailey RA, Simon LJ (1959) An actuarial note on the credibility of experience of a single private
passenger car. Proc Casualty Actuarial Soc XLVI:159
Bailey RA, Simon LJ (1960) Two studies in automobile insurance ratemaking. ASTIN Bull J IAA
1(4):192–217
Baird IM (1994) Obesity and insurance risk. Pharmacoeconomics 5(1):62–65
438 References

Baker T (2011) Health insurance, risk, and responsibility after the patient protection and affordable
care act. University of Pennsylvania Law Review 1577–1622
Baker T, McElrath K (1997) Insurance claims discrimination. In: Insurance redlining: Disinvest-
ment, reinvestment, and the evolving role of financial institutions, pp 141–156. The Urban
Institute Press, Washington, DC
Baker T, Simon J (2002) Embracing risk: the changing culture of insurance and responsibility.
University Chicago Press, Chicago
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of
prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424
Ban GY, Keskin NB (2021) Personalized dynamic pricing with machine learning: High-
dimensional features and heterogeneous elasticity. Manag Sci 67(9):5549–5568
Banerjee A, Guo X, Wang H (2005) On the optimality of conditional expectation as a bregman
predictor. IEEE Trans Inf Theory 51(7):2664–2669
Banham R (2015) Price optimization or price discrimination? regulators weigh in. Carrier
Management May 17
Barbosa JJR (2019) The business opportunities of implementing wearable based products in the
health and life insurance industries. PhD thesis, Universidade Católica Portuguesa
Barbour V (1911) Privateers and pirates of the west indies. Am Hist Rev 16(3):529–566
Barocas S, Selbst AD (2016) Big data’s disparate impact. California Law Rev 104:671–732
Barocas S, Hardt M, Narayanan A (2017) Fairness in machine learning. Nips Tutor 1:2017
Barocas S, Hardt M, Narayanan A (2019) Fairness and machine learning. fairmlbook.org
Barry L (2020a) Insurance, big data and changing conceptions of fairness. Eur J Sociol 61:159–184
Barry L (2020b) L’invention du risque catastrophes naturelles. Chaire PARI, Document de Travail
18
Barry L, Charpentier A (2020) Personalization as a promise: Can big data change the practice of
insurance? Big Data Soc 7(1):2053951720935143
Bartik A, Nelson S (2016) Deleting a signal: Evidence from pre-employment credit checks. SSRN
2759560
Bartlett R, Morse A, Stanton R, Wallace N (2018) Consumer-lending discrimination in the era of
fintech. University of California, Berkeley, Working Paper
Bartlett R, Morse A, Stanton R, Wallace N (2021) Consumer-lending discrimination in the fintech
era. J Financ Econ 140:30–56
Bath C, Edgar K (2010) Time is money: Financial responsibility after prison. Prison Reform Trust,
London
Baumann J, Loi M (2023) Fairness and risk: An ethical argument for a group fairness definition
insurers can use. Philos Technol 36(3):45
Bayer PB (1986) Mutable characteristics and the definition of discrimination under title vii. UC
Davis Law Rev 20:769
Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Philos Trans Roy
Soc Lond (53):370–418
Becker GS (1957) The economics of discrimination. University of Chicago Press, Chicago
Beckett L (2014) Everything we know about what data brokers know about you. ProPublica June
13
Beider P (1987) Sex discrimination in insurance. J Appl Philos 4:65–75
Belhadji EB, Dionne G, Tarkhani F (2000) A model for the detection of insurance fraud. Geneva
papers on risk and insurance issues and practice, pp 517–538
Bélisle-Pipon JC, Vayena E, Green RC, Cohen IG (2019) Genetic testing, insurance discrimination
and medical research: what the united states can learn from peer countries. Nature Med
25(8):1198–1204
Bell ET (1945) The development of mathematics. Courier Corporation, Chelmsford
Bender M, Dill C, Hurlbert M, Lindberg C, Mott S (2022) Understanding potential influences of
racial bias on p&c insurance: Four rating factors explored. CAS research paper series on race
and insurance pricing
References 439

Beniger J (2009) The control revolution: Technological and economic origins of the information
society. Harvard University Press, Harvard
Benjamin B, Michaelson R (1988) Mortality differences between smokers and non-smokers. J Inst
Actuaries 115(3):519–525
Bennett M (1978) Models in motor insurance. J Staple Inn Actuarial Soc 22:134–160
Bergstrom CT, West JD (2021) Calling bullshit: the art of skepticism in a data-driven world.
Random House Trade Paperbacks
Berk R, Heidari H, Jabbari S, Joseph M, Kearns M, Morgenstern J, Neel S, Roth A (2017) A
convex framework for fair regression. arXiv 1706.02409
Berk R, Heidari H, Jabbari S, Kearns M, Roth A (2021a) Fairness in criminal justice risk
assessments: The state of the art. Sociol Methods Res 50(1):3–44
Berk RA, Kuchibhotla AK, Tchetgen ET (2021b) Improving fairness in criminal justice algorith-
mic risk assessments using optimal transport and conformal prediction sets. arXiv 2111.09211
Berkson J (1944) Application of the logistic function to bio-assay. J Am Stat Assoc 39(227):357–
365
Bernard DS, Farr SL, Fang Z (2011) National estimates of out-of-pocket health care expenditure
burdens among nonelderly adults with cancer: 2001 to 2008. J Clin Oncol 29(20):2821
Bernoulli J (1713) Ars conjectandi: opus posthumum: accedit Tractatus de seriebus infinitis; et
Epistola gallice scripta de ludo pilae reticularis. Impensis Thurnisiorum
Bernstein A (2013) What’s wrong with stereotyping. Arizona Law Rev 55:655
Bernstein E (2007) Temporarily yours: intimacy, authenticity, and the commerce of sex. University
of Chicago Press, Chicago
Bertillon A, Chervin A (1909) Anthropologie métrique: conseils pratiques aux missionnaires
scientifiques sur la manière de mesurer, de photographier et de décrire des sujets vivants et
des pièces anatomiques. Imprimerie nationale. Paris, France
Bertrand M, Duflo E (2017) Field experiments on discrimination. Handbook Econ Field Exp
1:309–393
Bertrand M, Mullainathan S (2004) Are emily and greg more employable than lakisha and jamal?
a field experiment on labor market discrimination. Am Econ Rev 94(4):991–1013
Besnard P, Grange C (1993) La fin de la diffusion verticale des gouts?(prénoms de l’élite et du
vulgum). L’Année sociologique, pp 269–294
Besse P, del Barrio E, Gordaliza P, Loubes JM (2018) Confidence intervals for testing disparate
impact in fair learning. arXiv 1807.06362
Beutel A, Chen J, Doshi T, Qian H, Wei L, Wu Y, Heldt L, Zhao Z, Hong L, Chi EH, et al.
(2019) Fairness in recommendation ranking through pairwise comparisons. In: Proceedings of
the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pp 2212–2220
Bhattacharya A (2015) Facebook patent: Your friends could help you get a loan - or not. CNN
Business 2015/08/04
Bickel PJ, Hammel EA, O’Connell JW (1975) Sex bias in graduate admissions: Data from
berkeley. Science 187(4175):398–404
Bidadanure J (2017) Discrimination and age. In: Lippert-Rasmussen K (ed) Handbook of the ethics
of discrimination, Routledge, pp 243–253
Biddle D (2017) Adverse impact and test validation: A practitioner’s guide to valid and defensible
employment testing. Routledge
Biecek P, Burzykowski T (2021) Explanatory model analysis: explore, explain, and examine
predictive models. CRC Press, Boca Raton
Bielby WT, Baron JN (1986) Men and women at work: Sex segregation and statistical discrimina-
tion. Am J Sociol 91(4):759–799
Biemer PP, Christ SL (2012) Weighting survey data. In: International handbook of survey
methodology, Routledge, pp 317–341
Bigot R, Cayol A (2020) Le droit des assurances en tableaux. Ellipses
440 References

Bigot R, Cocteau-Senn D, Arthur C (2019) La protection des données personnelles en assurance :


dialogue du juriste avec l’actuaire. In: Netter E (ed) Regards sur le nouveau droit des données
personnelles, CEPRISCA, collection Colloques
Billingsley P (2008) Probability and measure. Wiley, New York
Birnbaum B (2020) Insurance consumer protection issues resulting from, or heightened bycovid-
19. Center for Economic Justice Report
Blanchet P (2017) Discriminations: combattre la glottophobie. Éditions Textuel
Blank RM, Dabady M, Citro CF (2004) Measuring racial discrimination. National Academies
Press, Washington, D.C.
Blanpain N (2018) L’espérance de vie par niveau de vie-méthode et principaux résultats. INSEE
Document de Travail F1801
Blier-Wong C, Cossette H, Lamontagne L, Marceau E (2021) Geographic ratemaking with spatial
embeddings. ASTIN Bull J IAA, 1–31
Blinder AS (1973) Wage discrimination: Reduced form and structural estimates. J Human Resour
8(4):436–455
Bloch M (1932) Noms de personne et histoire sociale. Annales d’histoire économique et sociales
4(13):67–69
Blodgett SL, O’Connor B (2017) Racial disparity in natural language processing: A case study of
social media African-American english. arXiv 1707.00061
Blumenbach JF (1775) De generis humani varietate nativa. Vandenhoek & Ruprecht
Boczar D, Avila FR, Carter RE, Moore PA, Giardi D, Guliyeva G, Bruce CJ, McLeod CJ, Forte AJ
(2021) Using facial recognition tools for health assessment. Plastic Surg Nurs 41(2):112–116
Bohren JA, Haggag K, Imas A, Pope DG (2019) Inaccurate statistical discrimination: An
identification problem. Tech. rep., National Bureau of Economic Research
Bolton LE, Warlop L, Alba JW (2003) Consumer perceptions of price (un) fairness. J Consumer
Res 29(4):474–491
Bongers S, Forré P, Peters J, Mooij JM (2021) Foundations of structural causal models with cycles
and latent variables. Ann Stat 49(5):2885–2915
Bonnefon JF (2019) La voiture qui en savait trop. L’intelligence artificielle a-t-elle une morale?
Humensciences Editions
Boonekamp C, Donaldson D (1979) Certain alternatives for price uncertainty. Canad J Econ Revue
canadienne d’Economique 12(4):718–728
Boonen TJ, Liu F (2022) Insurance with heterogeneous preferences. J Math Econ 102:102742
Borch K (1962) Application of game theory to some problems in automobile insurance*. ASTIN
Bull J IAA 2(2):208–221
Borgelt C, Steinbrecher M, Kruse RR (2009) Graphical models: representations for learning,
reasoning and data mining. Wiley, New York
Borges JL (1946) Del rigor en la ciencia. Los Anales de Buenos Aires
Borkan D, Dixon L, Sorensen J, Thain N, Vasserman L (2019) Nuanced metrics for measuring
unintended bias with real data for text classification. In: Companion Proceedings of the 2019
World Wide Web Conference, pp 491–500
Bornstein S (2018) Antidiscriminatory algorithms. Alabama Law Rev 70:519
Bosmajian HA (1974) The language of oppression, vol 10. Public Affairs Press, New York
Bouk D (2015) How our days became numbered: risk and the rise of the statistical individual. The
University of Chicago Press, Chicago
Bouk D (2022) Democracy’s data: the hidden stories in the U.S. census and how to read them.
MCD
Bourdieu P (2018) Distinction a social critique of the judgement of taste. In: Inequality classic
readings in race, class, and gender, Routledge, pp 287–318
Bowles S, Gintis H (2004) The evolution of strong reciprocity: cooperation in heterogeneous
populations. Theor Popul Biol 65(1):17–28
Box GE, Luceño A, del Carmen Paniagua-Quinones M (2011) Statistical control by monitoring
and adjustment, vol 700. Wiley, New York
Boxill BR (1992) Blacks and social justice. Rowman & Littlefield, Lanham
References 441

Boyd D, Levy K, Marwick A (2014) The networked nature of algorithmic discrimination. Data
and Discrimination: Collected Essays Open Technology Institute
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Brams SJ, Brams SJ, Taylor AD (1996) Fair division: from cake-cutting to dispute resolution.
Cambridge University Press, Cambridge
Brant S (1494) Das Narrenschiff. von Jakob Locher
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics
37(4):373–384
Breiman L (1996a) Bagging predictors. Mach Learn 24:123–140
Breiman L (1996b) Bias, variance, and arcing classifiers. Tech. rep., University of California,
Berkeley
Breiman L (1996c) Stacked regressions. Mach Learn 24:49–64
Breiman L (2001) Random forests. Machine learning 45:5–32
Breiman L, Stone C (1977) Parsimonious binary classification trees. technology service co.
rporation, santa monica. Tech. rep., Ca., Technical Report
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Taylor &
Francis, London
Brenier Y (1991) Polar factorization and monotone rearrangement of vector-valued functions.
Commun Pure Appl Math 44(4):375–417
Brilmayer L, Hekeler RW, Laycock D, Sullivan TA (1979) Sex discrimination in employer-
sponsored insurance plans: A legal and demographic analysis. Univ Chicago Law Rev 47:505
Brilmayer L, Laycock D, Sullivan TA (1983) The efficient use of group averages as nondiscrimi-
nation: A rejoinder to professor benston. Univ Chicago Law Rev 50(1):222–249
Bröcker J (2009) Reliability, sufficiency, and the decomposition of proper scores. Q J Roy Meteorol
Soc J Atmos Sci Appl Meteorol Phys Oceanogr 135(643):1512–1519
Brockett PL, Golden LL (2007) Biological and psychobehavioral correlates of credit scores and
automobile insurance losses: Toward an explication of why credit scoring works. J Risk
Insurance 74(1):23–63
Brockett PL, Xia X, Derrig RA (1998) Using kohonen’s self-organizing feature map to uncover
automobile bodily injury claims fraud. J Risk Insurance, 245–274
Brosnan SF (2006) Nonhuman species’ reactions to inequity and their implications for fairness.
Social Justice Res 19(2):153–185
Brown RL, Charters D, Gunz S, Haddow N (2007) Colliding interests–age as an automobile
insurance rating variable: Equitable rate-making or unfair discrimination? J Bus Ethics
72(2):103–114
Brown RS, Moon M, Zoloth BS (1980) Incorporating occupational attainment in studies of male-
female earnings differentials. J Human Res, 3–28
Browne S (2015) Dark matters: On the surveillance of blackness. Duke University Press, Durham
Brownstein M, Saul J (2016a) Implicit bias and philosophy, volume 1: Metaphysics and epistemol-
ogy. Oxford University Press, Oxford
Brownstein M, Saul J (2016b) Implicit bias and philosophy, volume 2: Moral responsibility,
structural injustice, and ethics. Oxford University Press, Oxford
Brualdi RA (2006) Combinatorial matrix classes, vol 13. Cambridge University Press, Cambridge
Brubaker R (2015) Grounds for difference. Harvard University Press, Harvard
Brudno B (1976) Poverty, inequality, and the law. West Publishing Company, Eagan
Bruner JS (1957) Going beyond the information given. In: Bruner J, Brunswik E, Festinger L,
Heider F, Muenzinger K, Osgood C, Rapaport D (eds) Contemporary approaches to cognition,
pp 119–160. Harvard University Press, Harvard
Brunet G, Bideau A (2000) Surnames: history of the family and history of populations. Hist Family
5(2):153–160
Buchanan R, Priest C (2006) Deductible. Encyclopedia of Actuarial Science
Budd LP, Moorthi RA, Botha H, Wicks AC, Mead J (2021) Automated hiring at Amazon.
Universiteit van Amsterdam E-0470
442 References

Bugbee M, Matthews B, Callanan S, Ewert J, Guven S, Boison L, Liao C (2014) Price optimization
overview. Casualty Actuarial Society
Bühlmann H, Gisler A (2005) A course in credibility theory and its applications, vol 317. Springer,
New York
Buntine WL, Weigend AS (1991) Bayesian back-propagation. Complex Syst 5:603–643
Buolamwini J, Gebru T (2018) Gender shades: Intersectional accuracy disparities in commercial
gender classification. In: Conference on Fairness, Accountability and Transparency, Proceed-
ings of Machine Learning Research, pp 77–91
Burgdorf MP, Burgdorf Jr R (1974) A history of unequal treatment: The qualifications of
handicapped persons as a suspect class under the equal protection clause. Santa Clara Lawyer
15:855
Butler P, Butler T (1989) Driver record: A political red herring that reveals the basic flaw in
automobile insurance pricing. J Insurance Regulat 8(2):200–234
Butler RN (1969) Age-ism: Another form of bigotry. Gerontologist 9(4_Part_1):243–246
Cain GG (1986) The economic analysis of labor market discrimination: A survey. Handbook Labor
Econ 1:693–785
Calders T, Jaroszewicz S (2007) Efficient auc optimization for classification. In: Knowledge
Discovery in Databases: PKDD 2007: 11th European Conference on Principles and Practice
of Knowledge Discovery in Databases, Warsaw, Poland, September 17–21, 2007. Proceedings
11, pp 42–53. Springer, New York
Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification.
Data Mining Knowl Discovery 21(2):277–292
Calders T, Žliobaite I (2013) Why unbiased computational processes can lead to discriminative
decision procedures. In: Discrimination and privacy in the information society, pp 43–57.
Springer, New York
Calisher CH (2007) Taxonomy: what’s in a name? doesn’ta rose by any other name smell as sweet?
Croatian Med J 48(2):268
Callahan A (2021) Is bmi a scam. The New York Times May 18th
Calmon FP, Wei D, Ramamurthy KN, Varshney KR (2017) Optimized data pre-processing for
discrimination prevention. arXiv 1704.03354
Cameron J (2004) Calibration - i. Encyclopedia of Statistical Sciences, 2nd edn.
Campbell M (1986) An integrated system for estimating the risk premium of individual car models
in motor insurance. ASTIN Bull J IAA 16(2):165–183
Candille G, Talagrand O (2005) Evaluation of probabilistic prediction systems for a scalar variable.
Q J Roy Meteor Soc J Atmos Sci Appl Meteo Phys Oceanogr 131(609):2131–2150
Cantwell GT, Kirkley A, Newman MEJ (2021) The friendship paradox in real and model networks.
J Complex Netw 9(2):cnab011
Cao D, Chen C, Piccirilli M, Adjeroh D, Bourlai T, Ross A (2011) Can facial metrology predict
gender? In: 2011 International Joint Conference on Biometrics (IJCB), pp 1–8. IEEE
Cardano G (1564) Liber de ludo aleae. Franco Angeli
Cardon D (2019) Culture numérique. Presses de Sciences Po
Carey AN, Wu X (2022) The causal fairness field guide: Perspectives from social and formal
sciences. Front Big Data 5. https://doi.org/10.3389/fdata.2022.892837
Carlier G, Chernozhukov V, Galichon A (2016) Vector quantile regression: an optimal transport
approach. Ann Stat 44:1165–1192
Carnis L, Lassarre S (2019) Politique et management de la sécurité routière. In: Laurent C,
Catherine G, Marie-Line G (eds) La sécurité routière en France, Quand la recherche fait son
bilan et trace des perspectives, L’Harmattan
Carpusor AG, Loges WE (2006) Rental discrimination and ethnicity in names. J Appl Soc Psychol
36(4):934–952
Carrasco V (2007) Le pacte civil de solidarité: une forme d’union qui se banalise. Infostat Justice
97(4)
Cartwright N (1983) How the laws of physics lie. Oxford University Press, Oxford
Casella G, Berger RL (1990) Statistical Inference. Duxbury Advanced Series
References 443

Casey B, Pezier J, Spetzler C (1976) The role of risk classification in property and casualty
insurance: a study of the risk assessment process : final report. Stanford Research Institute,
Stanford
Cassedy JH (2013) Demography in early America. Harvard University Press, Harvard
Castelvecchi D (2016) Can we open the black box of ai? Nature News 538(7623):20
Caton S, Haas C (2020) Fairness in machine learning: A survey. arXiv 2010.04053
Cavanagh M (2002) Against equality of opportunity. Clarendon Press, Oxford, England
Central Bank of Ireland (2021) Review of differential pricing in the private car and home insurance
markets. Central Bank of Ireland Publications, Dublin, Ireland
Chakraborty S, Raghavan KR, Johnson MP, Srivastava MB (2013) A framework for context-aware
privacy of sensor data on mobile systems. In: Proceedings of the 14th Workshop on Mobile
Computing Systems and Applications, Association for Computing Machinery, HotMobile ’13
Chardenon A (2019) Voici maxime, le chatbot juridique d’axa, fruit d’une démarche collaborative.
L’usine digitale 12 février
Charles KK, Guryan J (2011) Studying discrimination: Fundamental challenges and recent
progress. Annu Rev Econ 3(1):479–511
Charpentier A (2014) Computational actuarial science with R. CRC Press, Boca Raton
Charpentier A, Flachaire E, Ly A (2018) Econometrics and machine learning. Economie et
Statistique 505(1):147–169
Charpentier A, Élie R, Remlinger C (2021) Reinforcement learning in economics and finance.
Comput Econ 10014
Charpentier A, Flachaire E, Gallic E (2023a) Optimal transport for counterfactual estimation: A
method for causal inference. In: Thach NN, Kreinovich V, Ha DT, Trung ND (eds) Optimal
transport statistics for economics and related topics. Springer, New York
Charpentier A, Hu F, Ratz P (2023b) Mitigating discrimination in insurance with Wasserstein
barycenters. BIAS, 3rd Workshop on Bias and Fairness in AI, International Workshop of ECML
PKDD
Chassagnon A (1996) Sélection adverse: modèle générique et applications. PhD thesis, Paris,
EHESS
Chassonnery-Zaïgouche C (2020) How economists entered the ‘numbers game’: Measuring
discrimination in the us courtrooms, 1971–1989. J Hist Econ Thought 42(2):229–259
Chatterjee S, Barcun S (1970) A nonparametric approach to credit screening. J Am Stat Assoc
65(329):150–154
Chaufton A (1886) Les assurances, leur passé, leur présent, leur avenir, au point de vue rationnel,
technique et pratique, moral, économique et social, financier et administratif, légal, législatif et
contractuel, en France et à l’étranger. Chevalier-Marescq
Chen SX (1999) Beta kernel estimators for density functions. Comput Stat Data Anal 31(2):131–
145
Chen Y, Liu Y, Zhang M, Ma S (2017) User satisfaction prediction with mouse movement
information in heterogeneous search environment. IEEE Trans Knowl Data Eng 29(11):2470–
2483
Cheney-Lippold J (2017) We are data. New York University Press, New York
Cheng M, De-Arteaga M, Mackey L, Kalai AT (2023) Social norm bias: residual harms of fairness-
aware algorithms. Data Mining and Knowledge Discovery, pp 1–27
Chetty R, Stepner M, Abraham S, Lin S, Scuderi B, Turner N, Bergeron A, Cutler D (2016)
The association between income and life expectancy in the United States, 2001–2014. JAMA
315(16):1750–1766
Cheung I, McCartt AT (2011) Declines in fatal crashes of older drivers: Changes in crash risk and
survivability. Accident Anal Prevent 43(3):666–674
Chiappa S (2019) Path-specific counterfactual fairness. Proc AAAI Confer Artif Intell
33(01):7801–7808
Chicco D, Jurman G (2020) The advantages of the matthews correlation coefficient (mcc) over f1
score and accuracy in binary classification evaluation. BMC Genom 21(1):1–13
Chollet F (2021) Deep learning with Python. Simon and Schuster, New York
444 References

Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments. Big Data 5(2):153–163
Christensen CM, Dillon K, Hall T, Duncan DS (2016) Competing against luck: The story of
innovation and customer choice. Harper Business, New York
Churchill G, Nevin JR, Watson RR (1977) The role of credit scoring in the loan decision. Credit
World 3(3):6–10
Cinelli C, Hazlett C (2020) Making sense of sensitivity: Extending omitted variable bias. J Roy
Stat Soc B (Stat Methodol) 82(1):39–67
Clark G, Clark GW (1999) Betting on lives: the culture of life insurance in England, 1695–1775.
Manchester University Press, Manchester
Clarke DD, Ward P, Bartle C, Truman W (2010) Older drivers’ road traffic crashes in the UK.
Accident Anal Prevent 42(4):1018–1024
Cohen I, Goldszmidt M (2004) Properties and benefits of calibrated classifiers. In: 8th European
Conference on Principles and Practice of Knowledge Discovery in Databases, vol 3202,
pp 125–136. Springer, New York
Cohen J (1960) A coefficient of agreement for nominal scales. Educat Psychol Measur 20(1):37–46
Cohen JE (1986) An uncertainty principle in demography and the unisex issue. Am Stat 40(1):32–
39
Coldman AJ, Braun T, Gallagher RP (1988) The classification of ethnic status using name
information. J Epidemiol Community Health 42(4):390–395
Collins BW (2007) Tackling unconscious bias in hiring practices: The plight of the rooney rule.
New York University Law Rev 82:870
Collins E (2018) Punishing risk. Georgetown Law J 107:57
de Condorcet N (1785) Essai sur l’application de l’analyse à la probabilité des décisions rendues à
la pluralité des voix. Imprimerie royale, Paris
Constine J (2017) Facebook rolls out AI to detect suicidal posts before they’re reported. Techcrunch
November 27
Conway DA, Roberts HV (1983) Reverse regression, fairness, and employment discrimination. J
Bus Econ Stat 1(1):75–85
Cook TD, Campbell DT, Shadish W (2002) Experimental and quasi-experimental designs for
generalized causal inference. Houghton Mifflin Boston, MA
Cooper DN, Krawczak M, Polychronakos C, Tyler-Smith C, Kehrer-Sawatzki H (2013) Where
genotype is not predictive of phenotype: towards an understanding of the molecular basis of
reduced penetrance in human inherited disease. Human Genetics 132:1077–1130
Cooper PJ (1990) Differences in accident characteristics among elderly drivers and between elderly
and middle-aged drivers. Accident Anal Prevention 22(5):499–508
Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A (2017) Algorithmic decision making and the
cost of fairness. arXiv 1701.08230
Corlier F (1998) Segmentation : le point de vue de l’assureur. In: Cousy H, Classens H,
Van Schoubroeck C (eds) Compétitivité, éthique et assurance. Academia Bruylant
Cornell B, Welch I (1996) Culture, information, and screening discrimination. J Polit Econ
104(3):542–571
Correll J, Judd CM, Park B, Wittenbrink B (2010) Measuring prejudice, stereotypes and discrimi-
nation. The SAGE handbook of prejudice, stereotyping and discrimination, pp 45–62
Correll SJ, Benard S (2006) Biased estimators? comparing status and statistical theories of gender
discrimination. In: Advances in group processes, vol 23, pp 89–116. Emerald Group Publishing
Limited, Leeds, England
Cortina A (2022) Aporophobia: why we reject the poor instead of helping them. Princeton
University Press, Princeton
Côté O (2023) Methodology applied to build a non-discriminatory general insurance rate according
to a pre-specified sensitive variable. MSc Thesis, Université Laval
Côté O, Côté MP, Charpentier’ A (2023) A fair price to pay: exploiting directed acyclic graphs for
fairness in insurance. Mimeo
References 445

Cotter A, Jiang H, Gupta MR, Wang S, Narayan T, You S, Sridharan K (2019) Optimization with
non-differentiable constraints with applications to fairness, recall, churn, and other goals. J
Mach Learn Res 20(172):1–59
Cotton J (1988) On the decomposition of wage differentials. Rev Econ Stat, 236–243
Coulmont B, Simon P (2019) Quels prénoms les immigrés donnent-ils à leurs enfants en France?
Populat Soc (4):1–4
Council of the European Union (2004) Council directive 2004/113/ec of 13 december 2004
implementing the principle of equal treatment between men and women in the access to and
supply of goods and services. Official J Eur Union 373:37–43
Cournot AA (1843) Exposition de la théorie des chances et des probabilités. Hachette
Coutts S (2016) Anti-choice groups use smartphone surveillance to target ‘abortion-minded
women’during clinic visits. Rewire News Group May 25
Cowell F (2011) Measuring inequality. Oxford University Press, Oxford
Cragg JG (1971) Some statistical models for limited dependent variables with application to the
demand for durable goods. Econometrica J Econometric Soc, 829–844
Crawford JT, Leynes PA, Mayhorn CB, Bink ML (2004) Champagne, beer, or coffee? a corpus of
gender-related and neutral words. Behav Res Methods Instrum Comput 36:444–458
Cresta J, Laffont J (1982) The value of statistical information in insurance contracts. GREMAQ
Working Paper 8212
Crizzle AM, Classen S, Uc EY (2012) Parkinson disease and driving: an evidence-based review.
Neurology 79(20):2067–2074
Crocker KJ, Snow A (2013) The theory of risk classification. In: Loubergé H, Dionne G (eds)
Handbook of insurance, pp 281–313. Springer, New York
Crossney KB (2016) Redlining. https://philadelphiaencyclopediaorg/essays/redlining/
Cudd AE, Jones LE (2005) Sexism. A companion to applied ethics, pp 102–117
Cummins JD, Smith BD, Vance RN, Vanderhel J (2013) Risk classification in life insurance, vol 1.
Springer Science & Business Media, New York
Cunha HS, Sclauser BS, Wildemberg PF, Fernandes EAM, Dos Santos JA, Lage MdO, Lorenz
C, Barbosa GL, Quintanilha JA, Chiaravalloti-Neto F (2021) Water tank and swimming pool
detection based on remote sensing and deep learning: Relationship with socioeconomic level
and applications in dengue control. Plos One 16(12):e0258681
Cunningham S (2021) Causal inference. Yale University Press, Yale
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals
Syst 2(4):303–314
Czerniawski AM (2007) From average to ideal: The evolution of the height and weight table in the
United States, 1836–1943. Soc Sci Hist 31(2):273–296
Da Silva N (2023) La bataille de la Sécu: une histoire du système de santé. La fabrique éditions
Dalenius T (1977) Towards a methodology for statistical disclosure control. statistik Tidskrift
15(429–444):2–1
Dalziel JR, Job RS (1997) Motor vehicle accidents, fatigue and optimism bias in taxi drivers.
Accident Analy Prevent 29(4):489–494
Dambrum M, Despres G, Guimond S (2003) On the multifaceted nature of prejudice: Psy-
chophysiological responses to ingroup and outgroup ethnic stimuli. Current Res Soc Psychol
8(14):187–206
Dane SM (2006) The potential for racial discrimination by homeowners insurers through the use
of geographic rating territories. J Insurance Regulat 24(4):21
Daniel JE, Daniel JL (1998) Preschool children’s selection of race-related personal names. J Black
Stud 28(4):471–490
Daniel WW, et al. (1968) Racial discrimination in England: based on the PEP report. Penguin
Books, Harmondsworth
Daniels N (1990) Insurability and the hiv epidemic: ethical issues in underwriting. Milbank Q,
497–525
Daniels N (1998) Am I my parents’ keeper? An essay on justice between the young and the old.
Oxford University Press, Oxford
446 References

Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of dna. Psychol
Bull 137(5):800
Darlington RB (1971) Another look at ‘cultural fairness’ 1. J Educat Measur 8(2):71–82
Daston L (1992) Objectivity and the escape from perspective. Soc Stud Sci 22(4):597–618
Davenport T (2006) Competing on analytics. Harvard Bus Rev 84:1–10
David H (2015) Why are there still so many jobs? The history and future of workplace automation.
J Econ Perspect 29(3):3–30
Davidson R, MacKinnon JG, et al. (2004) Econometric theory and methods, vol 5. Oxford
University Press, New York
Davis GA (2004) Possible aggregation biases in road safety research and a mechanism approach
to accident modeling. Accident Anal Prevent 36(6):1119–1127
Dawid AP (1979) Conditional independence in statistical theory. J Roy Stat Soc B (Methodologi-
cal) 41(1):1–15
Dawid AP (1982) The well-calibrated Bayesian. J Am Stat Assoc 77(379):605–610
Dawid AP (2000) Causal inference without counterfactuals. J Am Stat Assoc 95(450):407–424
Dawid AP (2004) Probability forecasting. Encyclopedia of Statistical Sciences 10
De Alba E (2004) Bayesian claims reserving. Encyclopedia of Actuarial Science
De Baere G, Goessens E (2011) Gender differentiation in insurance contracts after the judgment in
case c-236/09, Association Belge des Consommateurs Test-Achats asbl v. conseil des ministres.
Colum J Eur L 18:339
De Pril N, Dhaene J (1996) Segmentering in verzekeringen. DTEW Research Report 9648, pp 1–56
De Wit GW, Van Eeghen J (1984) Rate making and society’s sense of fairness. ASTIN Bull J IAA
14(2):151–163
De Witt J (1671) Value of life annuities in proportion to redeemable annuities. Originally in Dutch
Translated in Hendriks (1853), pp 232–49
Dean LT, Nicholas LH (2018) Using credit scores to understand predictors and consequences. Am
J Public Health 108(11):1503–1505
Dean LT, Schmitz KH, Frick KD, Nicholas LH, Zhang Y, Subramanian S, Visvanathan K (2018)
Consumer credit as a novel marker for economic burden and health after cancer in a diverse
population of breast cancer survivors in the USA. J Cancer Survivorship 12(3):306–315
Debet A (2007) Mesure de la diversité et protection des données personnelles. Commission
Nationale de l’Informatique et des Libertés 16/05/2007 08:40 DECO/IRC
Défenseur des droits (2020) Algorithmes: prévenir l’automatisation des discriminations. https://
www.defenseurdesdroits.fr/sites/default/files/2023-07/ddd_rapport_algorithmes_2020_EN_
20200531.pdf
Dehejia RH, Wahba S (1999) Causal effects in nonexperimental studies: Reevaluating the
evaluation of training programs. J Am Stat Assoc 94(448):1053–1062
Delaporte P (1962) Sur l’efficacité des critères de tarification de l’assurance contre les accidents
d’automobiles. ASTIN Bull J IAA 2(1):84–95
Delaporte PJ (1965) Tarification du risque individuel d’accidents d’automobiles par la prime
modelée sur le risque. ASTIN Bull J IAA 3(3):251–271
Demakakos P, Biddulph JP, Bobak M, Marmot MG (2016) Wealth and mortality at older ages: a
prospective cohort study. J Epidemiol Community Health 70(4):346–353
Dennis RM (2004) Racism. In: Kuper A, Kuper J (eds) The social science encyclopedia, Routledge
Denuit M, Charpentier A (2004) Mathématiques de l’assurance non-vie: Tome I Principes
fondamentaux de théorie du risque. Economica. Paris, France
Denuit M, Charpentier A (2005) Mathématiques de l’assurance non-vie: Tome II Tarification et
provisionnement. Economica. Paris, France
Denuit M, Maréchal X, Pitrebois S, Walhin JF (2007) Actuarial modelling of claim counts: Risk
classification, credibility and bonus-malus systems. Wiley, New York
Denuit M, Hainault D, Trufin J (2019a) Effective statistical learning methods for actuaries I (GLMs
and extensions). Springer, New York
Denuit M, Hainault D, Trufin J (2019b) Effective statistical learning methods for actuaries III
(neural networks and extensions). Springer, New York
References 447

Denuit M, Hainault D, Trufin J (2020) Effective statistical learning methods for actuaries II (tree-
based methods and extensions). Springer, New York
Denuit M, Charpentier A, Trufin J (2021) Autocalibration and tweedie-dominance for insurance
pricing with machine learning. Insurance Math Econ
Depoid P (1967) Applications de la statistique aux assurances accidents et dommages: cours
professé à l’Institut de statistique de l’Université de Paris. 2e édition revue et augmentée.. . .
Berger-Levrault
Derrig RA, Ostaszewski KM (1995) Fuzzy techniques of pattern recognition in risk and claim
classification. J Risk Insurance, 447–482
Derrig RA, Weisberg HI (1998) Aib pip claim screening experiment final report. understanding
and improving the claim investigation process. AIB Filing on Fraudulent Claims Payment
Desrosières A (1998) The politics of large numbers: A history of statistical reasoning. Harvard
University Press, Harvard
Devine PG (1989) Stereotypes and prejudice: Their automatic and controlled components. J
Personality Soc Psychol 56(1):5
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology
26(3):297–302
Dierckx G (2006) Logistic regression model. Encyclopedia of Actuarial Science
Dieterich W, Mendoza C, Brennan T (2016) Compas risk scales: Demonstrating accuracy equity
and predictive parity. Northpointe Inc 7(7.4):1
Dilley S, Greenwood G (2017) Abandoned 999 calls to police more than double. BBC 19
September 2017
DiNardo J (2016) Natural experiments and quasi-natural experiments, pp 1–12. Palgrave Macmil-
lan UK, London
DiNardo J, Fortin N, Lemieux T (1995) Labor market institutions and the distribution of wages,
1973–1992: A semiparametric approach. National Bureau of Economic Research (NBER)
Dingman H (1927) Insurability, prognosis and selection. The spectator company
Dinur R, Beit-Hallahmi B, Hofman JE (1996) First names as identity stereotypes. J Soc Psychol
136(2):191–200
Dionne G (2000) Handbook of insurance. Springer, New York
Dionne G (2013) Contributions to insurance economics. Springer, New York
Dionne G, Harrington SE (1992) An introduction to insurance economics. Springer, New York
Dobbin F (2001) Do the social sciences shape corporate anti-discrimination practice: The United
States and France. Comparative Labor Law Pol J 23:829
Donoghue JD (1957) An eta community in japan: the social persistence of outcaste groups. Am
Anthropol 59(6):1000–1017
Dorlin E (2005) Sexe, genre et intersexualité: la crise comme régime théorique. Raisons Politiques
2:117–137
Dostie G (1974) Entrevue de michèle lalonde. Le Journal 1er juin 1974
Dressel J, Farid H (2018) The accuracy, fairness, and limits of predicting recidivism. Sci Adv
4(1):eaao5580
Du Bois W (1896) Review of race traits and tendencies of the American negro. Ann Am Acad,
127–33
Duan T, Anand A, Ding DY, Thai KK, Basu S, Ng A, Schuler A (2020) Ngboost: Natural gradient
boosting for probabilistic prediction. In: International Conference on Machine Learning,
PMLR, pp 2690–2700
Dubet F (2014) La Préférence pour l’inégalité. Comprendre la crise des solidarités: Comprendre la
crise des solidarités. Seuil - La République des idées
Dubet F (2016) Ce qui nous unit : Discriminations, égalité et reconnaissance. Seuil - La République
des idées
Dublin L (1925) Report of the joint committee on mortality of the association of life insurance
medical directors. The Actuarial Society of America
Dudley RM (2010) Distances of probability measures and random variables. In: Selected works of
RM dudley, pp 28–37. Springer, New York
448 References

Duggan JE, Gillingham R, Greenlees JS (2008) Mortality and lifetime income: evidence from us
social security records. IMF Staff Papers 55(4):566–594
Duhigg C (2019) How companies learn your secrets. The New York Times 02-16-2019
Duivesteijn W, Feelders A (2008) Nearest neighbour classification with monotonicity constraints.
In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases,
pp 301–316. Springer, New York
Dulisse B (1997) Older drivers and risk to other road users. Accident Anal Prevent 29(5):573–582
Dumas A, Allodji R, Fresneau B, Valteau-Couanet D, El-Fayech C, Pacquement H, Laprie A,
Nguyen TD, Bondiau PY, Diallo I, et al. (2017) The right to be forgotten: a change in access to
insurance and loans after childhood cancer? J Cancer Survivorship 11:431–437
Duncan A, McPhail M (2013) Price optimization for the us market. techniques and implementation
strategies.”. In: Ratemaking and Product Management Seminar
Duncan C, Loretto W (2004) Never the right age? gender and age-based discrimination in
employment. Gender Work Organization 11(1):95–115
Durkheim É (1897) Le suicide: étude sociologique. Félix Alcan Editeur
Durry G (2001) La sélection de la clientèle par l’assureur : aspects juridiques. Risques 45:65–71
Dwivedi M, Malik HS, Omkar S, Monis EB, Khanna B, Samal SR, Tiwari A, Rathi A (2021) Deep
learning-based car damage classification and detection. In: Advances in Artificial Intelligence
and Data Engineering: Select Proceedings of AIDE 2019, pp 207–221. Springer, New York
Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. In: Pro-
ceedings of the 3rd Innovations in Theoretical Computer Science Conference, vol 1104.3913,
pp 214–226
Dwoskin E (2018) Facebook is rating the trustworthiness of its users on a scale from zero to one.
Washington Post 21-08
Eco U (1992) Comment voyager avec un saumon. Grasset
Edgeworth FY (1922) Equal pay to men and women for equal work. Econ J 32(128):431–457
Edwards J (1932) Ten years of rates and rating bureaus in ontario, applied to automobile insurance.
Proc Casualty Actuarial Soc 19:22–64
Eidelson B (2015) Discrimination and disrespect. Oxford University Press, Oxford
Eidinger E, Enbar R, Hassner T (2014) Age and gender estimation of unfiltered faces. IEEE Trans
Inf Forens Secur 9(12):2170–2179
Eisen R, Eckles DL (2011) Insurance economics. Springer, New York
Ekeland I (1995) Le chaos. Flammarion
England P (1994) Neoclassical economists’ theories of discrimination. In: Equal employment
opportunity: Labor market discrimination and public policy, Aldine de Gruyter, pp 59–70
Ensmenger N (2015) “beards, sandals, and other signs of rugged individualism”: masculine culture
within the computing professions. Osiris 30(1):38–65
Epstein L, King G (2002) The rules of inference. The University of Chicago Law Review, pp 1–133
Erwin C, Williams JK, Juhl AR, Mengeling M, Mills JA, Bombard Y, Hayden MR, Quaid
K, Shoulson I, Taylor S, et al. (2010) Perception, experience, and response to genetic
discrimination in huntington disease: The international respond-hd study. Am J Med Genet
B Neuropsychiatric Genet 153(5):1081–1093
Erwin PG (1995) A review of the effects of personal name stereotypes. Representative Research in
Social Psychology
European Commission (1995) Directive 95/46/ec of the european parliament and of the council of
24 october 1995 on the protection of individuals with regard to the processing of personal data
and on the free movement of such data. Official J Eur Communit 38(281):31–50
of the European Union C (2018) Proposal for a council directive on implementing the principle
of equal treatment between persons irrespective of religion or belief, disablility, age or sexual
orientation. Proceedings of Council of the European Union 11015/08
Ewald F (1986) Histoire de l’Etat providence: les origines de la solidarité. Grasset
Eze EC (1997) Race and the enlightenment: A reader. Wiley, New York
Fagyal Z (2010) Accents de banlieue. Aspects prosodiques du français populaire en contact avec
les langues de l’immigration, L’Harmattan
References 449

Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties.
J Am Stat Assoc 96(456):1348–1360
Farbmacher H, Huber M, Lafférs L, Langen H, Spindler M (2022) Causal mediation analysis with
double machine learning. Economet J 25(2):277–300
Farebrother RW (1976) Further results on the mean square error of ridge regression. J Roy Stat
Soc B (Methodological), 248–250
Feder A, Oved N, Shalit U, Reichart R (2021) causalm: Causal model explanation through
counterfactual language models. Comput Linguist 47(2):333–386
Feeley MM, Simon J (1992) The new penology: Notes on the emerging strategy of corrections and
its implications. Criminology 30(4):449–474
Feinberg J (1970) Justice and personal desert. In: Feinberg J (ed) Doing and deserving. Princeton
University Press, Princeton
Feine J, Gnewuch U, Morana S, Maedche A (2019) Gender bias in chatbot design. In: International
workshop on chatbot research and design, pp 79–93. Springer, New York
Feiring E (2009) reassessing insurers’ access to genetic information: genetic privacy, ignorance,
and injustice. Bioethics 23(5):300–310
Feld SL (1991) Why your friends have more friends than you do. Am J Sociol 96(6):1464–1477
Feldblum S (2006) Risk classification, pricing aspects. Encyclopedia of actuarial science
Feldblum S, Brosius JE (2003) The minimum bias procedure: A practitioner’s guide. In: Pro-
ceedings of the Casualty Actuarial Society, Casualty Actuarial Society Arlington, vol 90, pp
196–273
Feldman F (1995) Desert: Reconsideration of some received wisdom. Mind 104(413):63–77
Feldman M, Friedler SA, Moeller J, Scheidegger C, Venkatasubramanian S (2015) Certifying
and removing disparate impact. In: Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, vol 1412.3756, pp 259–268
Feller A, Pierson E, Corbett-Davies S, Goel S (2016) A computer program used for bail and
sentencing decisions was labeled biased against blacks. it’s actually not that clear. The
Washington Post October 17
Feller W (1957) An introduction to probability theory and its applications. Wiley, New York
Feng R (2023) Decentralized insurance. Springer, New York
Fenton N, Neil M (2018) Risk assessment and decision analysis with Bayesian networks. CRC
Press, Boca Raton
Ferber MA, Green CA (1982a) Employment discrimination: Reverse regression or reverse logic.
Working Paper, University of Illinois, Champaign
Ferber MA, Green CA (1982b) Traditional or reverse sex discrimination? a case study of a large
public university. Ind Labor Relat Rev 35(4):550–564
Fermanian JD, Guegan D (2021) Fair learning with bagging. SSRN 3969362
Finger RJ (2006) Risk classification. In: Bass I, Basson S, Bashline D, Chanzit L, Gillam W,
Lotkowski E (eds) Foundations of Casualty Actuarial Science, Casualty Actuarial Society, pp
287–341
Finkelstein A, Taubman S, Wright B, Bernstein M, Gruber J, Newhouse JP, Allen H, Baicker K,
Group OHS (2012) The oregon health insurance experiment: evidence from the first year. Q J
Econ 127(3):1057–1106
Finkelstein EA, Brown DS, Wrage LA, Allaire BT, Hoerger TJ (2010) Individual and aggregate
years-of-life-lost associated with overweight and obesity. Obesity 18(2):333–339
Finkelstein MO (1980) The judicial reception of multiple regression studies in race and sex
discrimination cases. Columbia Law Rev 80(4):737–754
Firpo SP (2017) Identifying and measuring economic discrimination. IZA World of Labor
Fiscella K, Fremont AM (2006) Use of geocoding and surname analysis to estimate race and
ethnicity. Health Ser Res 41(4p1):1482–1500
Fish HC (1868) The agent’s manual of life assurance. Wynkoop & Hallenbeck
Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: Learning a
variable’s importance by studying an entire class of prediction models simultaneously. J Mach
Learn Res 20(177):1–81
450 References

Fisher FM (1980) Multiple regression in legal proceedings. Columbia Law Rev 80:702
Fisher RA (1921) Studies in crop variation. i. an examination of the yield of dressed grain from
broadbalk. J Agricultural Sci 11(2):107–135
Fisher RA, Mackenzie WA (1923) Studies in crop variation. ii. The manurial response of different
potato varieties. J Agricultural Sci 13(3):311–320
Fix M, Turner MA (1998) A National Report Card on Discrimination in America: The Role of
Testing. ERIC: Proceedings of the Urban Institute Conference (Washington, DC, March 1998)
Flanagan T (1985) Insurance, human rights, and equality rights in canada: When is discrimination
“reasonable?”. Canad J Polit Sci/Revue canadienne de science politique 18(4):715–737
Fleurbaey M, Maniquet F (1996) A theory of fairness and social welfare. Cambridge University
Press, Cambridge
Flew A (1993) Three concepts of racism. Int Soc Sci Rev 68(3):99
Fong C, Hazlett C, Imai K (2018) Covariate balancing propensity score for a continuous treatment:
Application to the efficacy of political advertisements. Ann Appl Stat 12(1):156–177
Fontaine H (2003) Driver age and road traffic accidents: what is the risk for seniors? Recherche-
transports-sécurité
Fontaine KR, Redden DT, Wang C, Westfall AO, Allison DB (2003) Years of life lost due to
obesity. J Am Med Assoc 289(2):187–193
Foot P (1967) The problem of abortion and the doctrine of the double effect. Oxford Rev 5
Forfar DO (2006) Life table. Encyclopedia of Actuarial Science
Fortin N, Lemieux T, Firpo S (2011) Decomposition methods in economics. In: Handbook of labor
economics, vol 4, pp 1–102. Elsevier
Fourcade M (2016) Ordinalization: Lewis a. coser memorial award for theoretical agenda setting
2014. Sociological Theory 34(3):175–195
Fourcade M, Healy K (2013) Classification situations: Life-chances in the neoliberal era. Account
Organizations Soc 38(8):559–572
Fox ET (2013) ’Piratical Schemes and Contracts’: Pirate Articles and Their Society 1660–1730.
PhD Thesis, University of Exeter
François P (2022) Catégorisation, individualisation. retour sur les scores de crédit. hal 03508245
Freedman DA (1999) Ecological inference and the ecological fallacy. Int Encyclopedia Soc Behav
Sci 6(4027-4030):1–7
Freedman DA, Berk RA (2008) Weighting regressions by propensity scores. Evaluat Rev
32(4):392–409
Freeman S (2007) Rawls. Routledge
Frees EW (2006) Regression models for data analysis. Encyclopedia of Actuarial Science
Frees EW, Huang F (2023) The discriminating (pricing) actuary. North American Actuarial J
27(1):2–24
Frees EW, Meyers G, Cummings AD (2011) Summarizing insurance scores using a gini index. J
Am Stat Assoc 106(495):1085–1098
Frees EW, Derrig RA, Meyers G (2014a) Predictive modeling applications in actuarial science,
vol 1. Cambridge University Press, Cambridge
Frees EW, Meyers G, Cummings AD (2014b) Insurance ratemaking and a gini index. J Risk
Insurance 81(2):335–366
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an
application to boosting. J Comput Syst Sci 55(1):119–139
Frezal S, Barry L (2019) Fairness in uncertainty: Some limits and misinterpretations of actuarial
fairness. J Bus Ethics 167(1):127–136
Fricker M (2007) Epistemic injustice: Power and the ethics of knowing. Oxford University Press,
Oxford
Friedler SA, Scheidegger C, Venkatasubramanian S (2016) On the (im) possibility of fairness.
arXiv 1609.07236
Friedler SA, Scheidegger C, Venkatasubramanian S, Choudhary S, Hamilton EP, Roth D (2019) A
comparative study of fairness-enhancing interventions in machine learning. In: Proceedings of
the Conference on Fairness, Accountability, and Transparency, pp 329–338
References 451

Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat, 1189–
1232
Friedman J, Popescu BE (2008) Predictive learning via rule ensembles. Ann Appl Stat, 916–954
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting
(with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Friedman J, Hastie T, Tibshirani R, et al. (2001) The elements of statistical learning. Springer, New
York
Friedman S, Canaan M (2014) Overcoming speed bumps on the road to telematics. In: Challenges
and opportunities facing auto insurers with and without usage-based programs, Deloitte
Frisch R, Waugh FV (1933) Partial time regressions as compared with individual trends.
Econometrica, 387–401
Froot KA, Kim M, Rogoff KS (1995) The law of one price over 700 years. National Bureau of
Economic Research (NBER) 5132
Fry T (2015) A discussion on credibility and penalised regression, with implications for actuarial
work. Actuaries Institute
Fryer Jr RG, Levitt SD (2004) The causes and consequences of distinctively black names. Q J Econ
119(3):767–805
Gaddis SM (2017) How black are lakisha and jamal? Racial perceptions from names used in
correspondence audit studies. Sociol Sci 4:469–489
Gadet F (2007) La variation sociale en français. Editions Ophrys
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: Representing model
uncertainty in deep learning. In: International Conference on Machine Learning, PMLR, pp
1050–1059
Galichon A (2016) Optimal transport methods in economics. Princeton University Press, Princeton
Galindo C, Moreno P, González J, Arevalo V (2009) Swimming pools localization in colour
high-resolution satellite images. In: 2009 IEEE International Geoscience and Remote Sensing
Symposium, vol 4, pp IV–510. IEEE
Galles D, Pearl J (1998) An axiomatic characterization of causal counterfactuals. Foundations Sci
3:151–182
Galton F (1907) Vox populi. Nature 75(7):450–451
Gambs S, Killijian MO, del Prado Cortez MNn (2010) Show me how you move and i will tell you
who you are. In: Proceedings of the 3rd ACM International Workshop on Security and Privacy
in GIS and LBS
Gan G, Valdez EA (2020) Data clustering with actuarial applications. North American Actuarial J
24(2):168–186
Gandy OH (2016) Coming to terms with chance: Engaging rational discrimination and cumulative
disadvantage, Routledge
Garg N, Schiebinger L, Jurafsky D, Zou J (2018) Word embeddings quantify 100 years of gender
and ethnic stereotypes. Proc Natl Acad Sci 115(16):E3635–E3644
Garrioch D (2011) Mutual aid societies in eighteenth-century paris. French History & Civiliza-
tion 4
Gautron V, Dubourg É (2015) La rationalisation des outils et méthodes d’évaluation: de l’approche
clinique au jugement actuariel. Criminocorpus Revue d’Histoire de la justice, des crimes et des
peines
Gebelein H (1941) Das statistische problem der korrelation als variations- und eigenwertproblem
und sein zusammenhang mit der ausgleichsrechnung. ZAMM - Journal of Applied Mathemat-
ics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik 21(6):364–379
Gebru T, Krause J, Wang Y, Chen D, Deng J, Aiden EL, Fei-Fei L (2017) Using deep learning and
Google Street View to estimate the demographic makeup of neighborhoods across the United
States. Proc Natl Acad Sci 114(50):13108–13113
Geenens G (2014) Probit transformation for kernel density estimation on the unit interval. J Am
Stat Assoc 109(505):346–358
452 References

Geiger A, Wu Z, Lu H, Rozner J, Kreiss E, Icard T, Goodman N, Potts C (2022) Inducing causal


structure for interpretable neural networks. In: International Conference on Machine Learning,
PMLR, pp 7324–7338
Gelbrich M (1990) On a formula for the L2 Wasserstein metric between measures on Euclidean
and Hilbert spaces. Mathematische Nachrichten 147(1):185–203
Gelman A (2009) Red state, blue state, rich state, poor state: Why Americans vote the way they
do. Princeton University Press, Princeton
Ghani N (2008) Racism. In: Schaefer RT (ed) Encyclopedia of race, ethnicity, and society, pp
1113–1115. Sage Publications
Giles C (2020) Goodhart’s law comes back to haunt the uk’s covid strategy. Financial Times 14-5
Gini C (1912) Variabilità e mutabilità. Contributo allo studio delle distribuzioni e delle relazioni
statistiche
Gino F, Pierce L (2010) Robin hood under the hood: Wealth-based discrimination in illicit customer
help. Organization Sci 21(6):1176–1194
Ginsburg M (1940) Roman military clubs and their social functions. In: Transactions and
Proceedings of the American Philological Association, JSTOR, vol 71, pp 149–156
Gintis H (2000) Strong reciprocity and human sociality. J Theor Biol 206(2):169–179
Glenn BJ (2000) The shifting rhetoric of insurance denial. Law Soc Rev, 779–808
Glenn BJ (2003) Postmodernism: the basis of insurance. Risk Manag Insurance Rev 6(2):131–143
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat
Assoc 102(477):359–378
Gneiting T, Balabdaoui F, Raftery AE (2007) Probabilistic forecasts, calibration and sharpness. J
Roy Stat Soc B (Stat Methodol) 69(2):243–268
Goldberger AS (1984) Reverse regression and salary discrimination. J Human Resour, 293–318
Goldman A (1979) Justice and reverse discrimination. Princeton University Press, Princeton
Goldstein A, Kapelner A, Bleich J, Pitkin E (2015) Peeking inside the black box: Visualizing
statistical learning with plots of individual conditional expectation. J Comput Graph Stat
24(1):44–65
Gollier C (2002) La solidarité sous l’angle économique. Revue Générale du Droit des Assurances,
pp 824–830
Gompertz B (1825) On the nature of the function expressive of the law of human mortality, and
on a new mode of determining the value of life contingencies. Philos Trans Roy Soc Lond
115:513–583
Gompertz B (1833) On the nature of the function expressive of the law of human mortality, and on
a new mode of determining the value of life contingencies. in a letter to francis baily, esq. frs
&c. by benjamin gompertz, esq. fr s. Philos Trans Roy Soc Lond 2:252–253
Good IJ (1950) Probability and the weighing of evidence. Griffin
Goodwin S, Voola AP (2013) Framing microfinance in Australia–gender neutral or gender blind?
Aust J Soc Issu 48(2):223–239
Gordaliza P, Del Barrio E, Fabrice G, Loubes JM (2019) Obtaining fairness using optimal transport
theory. In: International Conference on Machine Learning, Proceedings of Machine Learning
Research, pp 2357–2365
Goscinny R, Uderzo A (1965) Astérix et Cléopâtre. Hachette
Gosden PHJH (1961) The friendly societies in England, 1815–1875. Manchester University Press,
Manchester
Gosseries A (2014) What makes age discrimination special: A philosophical look at the ecj case
law. Netherlands J Legal Philos 43:59–80
Gottlieb S (2011) Medicaid is worse than no coverage at all. Wall Street J 10/03
Gottschlich C, Schuhmacher D (2014) The shortlist method for fast computation of the
earth mover’s distance and finding optimal solutions to transportation problems. PloS One
9(10):e110214
Goulet JA, Nguyen LH, Amiri S (2021) Tractable approximate gaussian inference for Bayesian
neural networks. J Mach Learn Res 22:251–1
References 453

Gouriéroux C (1999) The econometrics of risk classification in insurance. Geneva Papers Risk
Insurance Theory 24(2):119–137
Gourieroux C (1999) Statistique de l’assurance. Economica
Gourieroux C, Jasiak J (2007) The econometrics of individual risk. Princeton University Press,
Princeton
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics, 857–
871
Gowri A (2014) The irony of insurance: Community and commodity. PhD thesis, University of
Southern California
Granger CW (1969) Investigating causal relations by econometric models and cross-spectral
methods. Econometrica J Econ Soc, 424–438
Graves Jr JL (2015) Great is their sin: Biological determinism in the age of genomics. Ann Am
Acad Polit Soc Sci 661(1):24–50
Greene WH (1984) Reverse regression: The algebra of discrimination. J Bus Econ Stat 2(2):117–
120
Greenland S (2002) Causality theory for policy uses. In: Murray C (ed) Summary measures of
population, pp 291–302. Harvard University Press, Harvard
Greenwell BM (2017) pdp: an r package for constructing partial dependence plots. R J 9(1):421
Grobon S, Mourlot L (2014) Le genre dans la statistique publique en France. Regards croisés sur
l’économie 2:73–79
Gross, S.T. (2017). Well-Calibrated Forecasts. In Wiley StatsRef: Statistics Reference Online (eds
N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri and J.L. Teugels). https://doi.
org/10.1002/9781118445112.stat00252.pub2
Groupement des Assureurs Automobiles (2021) Plan statistique automobile, résultats généraux,
voitures de tourisme. GAA
Guelman L, Guillén M (2014) A causal inference approach to measure price elasticity in
automobile insurance. Exp Syst Appl 41(2):387–396
Guelman L, Guillén M, Pérez-Marín AM (2012) Random forests for uplift modeling: an
insurance customer retention case. In: International conference on modeling and simulation
in engineering, economics and management, pp 123–133. Springer, New York
Guelman L, Guillén M, Perez-Marin AM (2014) A survey of personalized treatment models for
pricing strategies in insurance. Insurance Math Econ 58:68–76
Guillen M (2006) Fraud in insurance. Encyclopedia of Actuarial Science
Guillen M, Ayuso M (2008) Fraud in insurance. Encyclopedia of Quantitative Risk Analysis and
Assessment 2
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In:
International conference on machine learning, pp 1321–1330. PMLR
Gupta S, Kamble V (2021) Individual fairness in hindsight. J Mach Learn Res 22(1):6386–6420
Guseva A, Rona-Tas A (2001) Uncertainty, risk, and trust: Russian and American credit card
markets compared. Am Sociol Rev, 623–646
Guven S, McPhail M (2013) Beyond the cost model: Understanding price elasticity and its
applications. In: Casualty actuarial society E-forum, Spring 2013, Citeseer
Haas D (2013) Merit, fit, and basic desert. Philos Explorat 16(2):226–239
Haberman S, Renshaw AE (1996) Generalized linear models and actuarial science. J Roy Stat Soc
D (Statistician) 45(4):407–436
Hacking I (1990) The taming of chance. 17. Cambridge University Press, Cambridge
Hager WD, Zimpleman L (1982) The norris decision, its implications and application. Drake Law
Rev 32:913
Hale K (2021) A.i. bias caused 80% of black mortgage applicants to be denied. Forbes 09/2021
Halley E (1693) An estimate of the degrees of the mortality of mankind, drawn from curious
tables of the births and funerals at the city of breslaw; with an attempt to ascertain the price of
annuities upon lives. Philos Trans Roy Soc 17:596–610
Halpern JY (2016) Actual causality. MIT Press, Cambridge, MA
454 References

Hamilton DL, Gifford RK (1976) Illusory correlation in interpersonal perception: A cognitive basis
of stereotypic judgments. J Exp Soc Psychol 12(4):392–407
Hamilton JD (1994) Time series analysis. Princeton University Press, Princeton
Hand DJ (2020) Dark data: why what you don’t know matters. Princeton University Press,
Princeton
Hansotia BJ, Rukstales B (2002) Direct marketing for multichannel retailers: Issues, challenges
and solutions. J Database Market Customer Strat Manag 9:259–266
Hanssens DM, Parsons LJ, Schultz RL (2003) Market response models: Econometric and time
series analysis, vol 2. Springer Science & Business Media, New York
Hara K, Sun J, Moore R, Jacobs DW, Froehlich J (2014) Tohme: detecting curb ramps in Google
Street View using crowdsourcing, computer vision, and machine learning. In: Proceedings of
the 27th Annual ACM Symposium on User Interface Software and Technology
Harari YN (2018) 21 Lessons for the 21st century. Random House, New York
Harcourt BE (2008) Against prediction. University of Chicago Press, Chicago
Harcourt BE (2011) Surveiller et punir à l’âge actuariel. Déviance et Société 35:163
Harcourt BE (2015a) Exposed: Desire and disobedience in the digital age. Harvard University
Press, Harvard
Harcourt BE (2015b) Risk as a proxy for race: The dangers of risk assessment. Federal Sentencing
Rep 27(4):237–243
Harden KP (2023) Genetic determinism, essentialism and reductionism: semantic clarity for
contested science. Nature Rev Genet 24(3):197–204
Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Adv Neural Inf
Process Syst 29:3315–3323
Hardy GH, Littlewood JE, Pólya G, Pólya G, et al. (1952) Inequalities. Cambridge University
Press, Cambridge
Hargreaves DJ, Colman AM, Sluckin W (1983) The attractiveness of names. Human Relat
36(4):393–401
Harrington SE, Niehaus G (1998) Race, redlining, and automobile insurance prices. J Bus
71(3):439–469
Harris M (1970) Referential ambiguity in the calculus of brazilian racial identity. Southwestern J
Anthropol 26(1):1–14
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Harwell D, Mayes B, Walls M, Hashemi S (2018) The accent gap. The Washington Post July 19
Hastie T, Tibshirani R (1987) Generalized additive models: some applications. J Am Stat Assoc
82(398):371–386
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity. Monogr Stat Appl
Probab 143:143
Haugeland J (1989) Artificial intelligence: The very idea. MIT Press
Havens HV (1979) Issues and needed improvements in state regulation of the insurance business.
US General Acounting Office
Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall PTR, Indianapolis,
IN
He Y, Xiong Y, Tsai Y (2020) Machine learning based approaches to predict customer churn for
an insurance company. In: 2020 Systems and Information Engineering Design Symposium
(SIEDS), pp 1–6. IEEE
Heckert NA, Filliben JJ, Croarkin CM, Hembree B, Guthrie WF, Tobias P, Prinz J, et al. (2002)
Handbook 151: SEMATECH e-handbook of statistical methods. NIST
Hedden B (2021) On statistical criteria of algorithmic fairness. Philos Public Affairs 49(2):209–
231
Hedges BA (1977) Gender discrimination in pension plans: Comment. J Risk Insurance 44(1):141–
144
Heen ML (2009) Ending jim crow life insurance rates. Northwestern J Law Soc Policy 4:360
Heidari H, Krause A (2018) Preventing disparate treatment in sequential decision making. In:
IJCAI, pp 2248–2254
References 455

Heidorn PB (2008) Shedding light on the dark data in the long tail of science. Library Trends
57(2):280–299
Heimer CA (1985) Reactive risk and rational action. University of California Press, California
Heller D (2015) High price of mandatory auto insurance in predominantly African American
communities. Tech. rep., Consumer Federation of America
Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen
veränderlichen. Journal für die reine und angewandte Mathematik 1909(136):210–271
Hellman D (1998) Two types of discrimination: The familiar and the forgotten. California Law
Rev 86:315
Hellman D (2011) When is discrimination wrong? Harvard University Press, Harvard
Helton JC, Davis F (2002) Illustration of sampling-based methods for uncertainty and sensitivity
analysis. Risk Anal 22(3):591–622
Henley A (2014) Abolishing the stigma of punishments served: Andrew henley argues that those
who have been punished should be free from future discrimination. Criminal Justice Matters
97(1):22–23
Henriet D, Rochet JC (1987) Some reflections on insurance pricing. Eur Econ Rev 31(4):863–885
Héran F (2010) Inégalités et discriminations: Pour un usage critique et responsable de l’outil
statistique
Heras AJ, Pradier PC, Teira D (2020) What was fair in actuarial fairness? Hist Human Sci
33(2):91–114
Hernán MA, Robins JM (2010) Causal inference
Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. CRC Press,
Boca Raton
Hesselager O, Verrall R (2006) Reserving in non-life insurance. Encyclopedia of Actuarial Science
Higham NJ (2008) Functions of matrices: theory and computation. SIAM, Philadelphia, PA
Hilbe JM (2014) Modeling count data. Cambridge University Press, Cambridge
Hill K (2022) A dad took photos of his naked toddler for the doctor. Google flagged him as a
criminal. The New York Times August 25
Hill K, White J (2020) Designed to deceive: do these people look real to you? The New York Times
11(21)
Hillier R (2022) The legal challenges of insuring against a pandemic. In: Pandemics: insurance and
social protection, pp 267–286. Springer, Cham
Hiltzik M (2013) Yes, men should pay for pregnancy coverage and here’s why. Los Angeles Times
November 1st
Hirschfeld HO (1935) A connection between correlation and contingency. Math Proc Camb Philos
Soc 31(4):520–524
Hitchcock C (1997) Probabilistic causation. Stanford Encyclopedia of Philosophy
Ho DE, Imai K, King G, Stuart EA (2007) Matching as nonparametric preprocessing for reducing
model dependence in parametric causal inference. Polit Anal 15(3):199–236
Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics 12(1):55–67
Hoffman FL (1896) Race traits and tendencies of the American Negro, vol 11. American Economic
Association
Hoffman FL (1931) Cancer and smoking habits. Ann Surg 93(1):50
Hoffman KM, Trawalter S, Axt JR, Oliver MN (2016) Racial bias in pain assessment and treatment
recommendations, and false beliefs about biological differences between blacks and whites.
Proc Natl Acad Sci 113(16):4296–4301
Hofmann HJ (1990) Die anwendung des cart-verfahrens zur statistischen bonitätsanalyse von
konsumentenkrediten. Zeitschrift fur Betriebswirtschaft 60:941–962
Hofstede G (1995) Insurance as a product of national values. Geneva Papers Risk Insurance-Issues
Pract 20(4):423–429
Holland PW (1986) Statistics and causal inference. J Am Stat Assoc 81(396):945–960
Holland PW (2003) Causation and race. ETS Research Report Series RR-03-03
Holzer H, Neumark D (2000) Assessing affirmative action. J Econ Liter 38(3):483–568
456 References

Homans S, Phillips GW (1868) Tontine dividend life assurance policies. Equitable Life Assurance
Society of the United States
Hong D, Zheng YY, Xin Y, Sun L, Yang H, Lin MY, Liu C, Li BN, Zhang ZW, Zhuang J, et al.
(2021) Genetic syndromes screening by facial recognition technology: Vgg-16 screening model
construction and evaluation. Orphanet J Rare Dis 16(1):1–8
Hooker S, Moorosi N, Clark G, Bengio S, Denton E (2020) Characterising bias in compressed
models. arXiv 2010.03058
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks
4(2):251–257
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal
approximators. Neural Networks 2(5):359–366
Horowitz MAC (1976) Aristotle and woman. J Hist Biol 9:183–213
Hosmer DW, Lemesbow S (1980) Goodness of fit tests for the multiple logistic regression model.
Commun Stat Theory Methods 9(10):1043–1069
Houston R (1992) Mortality in early modern scotland: the life expectancy of advocates. Continuity
Change 7(1):47–69
Hsee CK, Li X (2022) A framing effect in the judgment of discrimination. Proc Natl Acad Sci
119(47):e2205988119
Hu F (2022) Semi-supervised learning in insurance: fairness and active learning. PhD thesis,
Institut polytechnique de Paris
Hu F, Ratz P, Charpentier A (2023a) Fairness in multi-task learning via Wasserstein barycenters.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases –
ECML PKDD
Hu F, Ratz P, Charpentier A (2023b) A sequentially fair mechanism for multiple sensitive attributes.
ArXiv 2309.06627
Hubbard GN (1852) De l’organisation des sociétés de bienfaisance ou de secours mutuels et des
bases scientifiques sur lesquelles elles doivent être établies. Guillaumin, Paris
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101
Hume D (1739) A treatise of human nature. Cambridge University Press, Cambridge
Hume D (1748) An enquiry concerning human understanding. Cambridge University Press,
Cambridge
Hunt E (2016) Tay, Microsoft’s AI chatbot, gets a crash course in racism from twitter. Guardian
24(3):2016
Hunter J (1775) Inaugural disputation on the varieties of man. In: Blumenbach JF (ed) De generis
humani varietate nativa. Vandenhoek & Ruprecht, Gottingae
Huttegger SM (2013) In defense of reflection. Philos Sci 80(3):413–433
Huttegger SM (2017) The probabilistic foundations of rational learning. Cambridge University
Press, Cambridge
Ichiishi T (2014) Game theory for economic analysis. Academic Press, Cambridge, MA
Ilic L, Sawada M, Zarzelli A (2019) Deep mapping gentrification in a large Canadian city using
deep learning and Google Street View. PloS One 14(3):e0212814
Imai K (2018) Quantitative social science: an introduction. Princeton University Press, Princeton
Imai K, Ratkovic M (2014) Covariate balancing propensity score. J Roy Stat Soc B Stat Methodol,
243–263
Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences.
Cambridge University Press, Cambridge
Ingold D, Soper S (2016) Amazon doesn’t consider the race of its customers. should it? Bloomberg
April 21st
Institute and Faculty of Actuaries (2021) The hidden risks of being poor: the poverty premium in
insurance. Faculty of Actuaries Report
Insurance Bureau of Canada (2021) Facts of the property and casualty insurance industry in
Canada. Insurance Bureau of Canada
Ismay P (2018) Trust among strangers: friendly societies in modern Britain. Cambridge University
Press, Cambridge
References 457

Iten R, Wagner J, Zeier Röschmann A (2021) On the identification, evaluation and treatment of
risks in smart homes: A systematic literature review. Risks 9(6):113
Ito J (2021) Supposedly ‘fair’ algorithms can perpetuate discrimination. Wired 02.05.2019
Jaccard P (1901) Distribution de la flore alpine dans le bassin des dranses et dans quelques régions
voisines. Bull de la Société Vaudoise de Sci Nature 37:241–272
Jackson JP, Depew DJ (2017) Darwinism, democracy, and race: American anthropology and
evolutionary biology in the twentieth century. Taylor & Francis, London
Jackson MO (2019) The human network: How your social position determines your power, beliefs,
and behaviors. Vintage
Jacobs J (1894) Aesop’s fables: selected and told anew. Capricorn Press, Santa Barbara, CA
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Hoboken, NJ
Jann B (2008) The blinder–oaxaca decomposition for linear regression models. Stata J 8(4):453–
479
Jargowsky PA (2005) Omitted variable bias. Encyclopedia Social Measur 2:919–924
Jarvis B, Pearlman RF, Walsh SM, Schantz DA, Gertz S, Hale-Pletka AM (2019) Insurance rate
optimization through driver behavior monitoring. Google Patents 10,169,822
Jean N, Burke M, Xie M, Davis WM, Lobell DB, Ermon S (2016) Combining satellite imagery
and machine learning to predict poverty. Science 353(6301):790–794
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc Roy Soc
Lond A Math Phys Sci 186(1007):453–461
Jerry RH (2023) Understanding parametric insurance: A potential tool to help manage pandemic
risk. In: Covid-19 and insurance, pp 17–62. Springer, New York
Jewell WS (1974) Credible means are exact Bayesian for exponential families. ASTIN Bull J IAA
8(1):77–90
Jiang H, Nachum O (2020) Identifying and correcting label bias in machine learning. In:
International Conference on Artificial Intelligence and Statistics, Proceedings of Machine
Learning Research, pp 702–712
Jiang J, Nguyen T (2007) Linear and generalized linear mixed models and their applications, vol 1.
Springer, New York
Jo HH, Eom YH (2014) Generalized friendship paradox in networks with tunable degree-attribute
correlation. Phys Rev E 90(2):022809
Johannesson GT (2013) The history of Iceland. ABC-CLIO
Johnston L (1945) Effects of tobacco smoking on health. British Med J 2(4411):98
Jolliffe IT (2002) Principal component analysis. Springer, New York
Jones EE, Nisbett RE (1971) The actor and the observer: Divergent perceptions of the causes of
behavior. General Learning Press, New York
Jones ML (2016) Ctrl + Z: The Right to Be Forgotten. New York University Press, New York
Jordan A, Krüger F, Lerch S (2019) Evaluating probabilistic forecasts with scoringRules. J
Stat Softw 90:1–37
Jordan C (1881) Sur la série de Fourier. Camptes Rendus Hebdomadaires de l’Academie des Sci
92:228–230
Joseph S, Castan M (2013) The international covenant on civil and political rights: cases, materials,
and commentary. Oxford University Press, Oxford
Jost JT, Rudman LA, Blair IV, Carney DR, Dasgupta N, Glaser J, Hardin CD (2009) The existence
of implicit bias is beyond reasonable doubt: A refutation of ideological and methodological
objections and executive summary of ten studies that no manager should ignore. Res Organizat
Behav 29:39–69
Jung C, Kearns M, Neel S, Roth A, Stapleton L, Wu ZS (2019a) An algorithmic framework for
fairness elicitation. arXiv 1905.10660
Jung C, Kearns M, Neel S, Roth A, Stapleton L, Wu ZS (2019b) Eliciting and enforcing subjective
individual fairness. arXiv:1905.10660
Jung C, Kannan S, Lee C, Pai MM, Roth A, Vohra R (2020) Fair prediction with endogenous
behavior. arXiv 2002.07147
458 References

Kachur A, Osin E, Davydov D, Shutilov K, Novokshonov A (2020) Assessing the big five
personality traits using real-life static facial images. Sci Rep 10(1):1–11
Kaganoff BC (1996) A dictionary of Jewish names and their history. Jason Aronson, Lanham, MD
Kahlenberg Richard D (1996) The remedy. class, race and affirmative action. Basic, New York
Kahneman D (2011) Thinking, fast and slow. Farrar, Straus and Giroux
Kamalich RF, Polachek SW (1982) Discrimination: Fact or fiction? an examination using an
alternative approach. Southern Econ J, 450–461
Kamen H (2014) The Spanish Inquisition: a historical revision. Yale University Press, Yale
Kamiran F, Calders T (2012) Data preprocessing techniques for classification without discrimina-
tion. Knowl Inf Syst 33(1):1–33
Kang JD, Schafer JL (2007) Demystifying double robustness: A comparison of alternative
strategies for estimating a population mean from incomplete data. Stat Sci 22(4):523–539
Kanngiesser P, Warneken F (2012) Young children consider merit when sharing resources with
others. PLOS ONE 8(8):e43979
Kant I (1775) Über die verschiedenen Rassen der Menschen. Nicolovius Edition
Kant I (1785) Bestimmung des Begriffs einer Menschenrace. Haude und Spener, Berlin
Kant I (1795) Zum ewigen Frieden. Ein philosophischer Entwurf). Nicolovius Edition
Kantorovich LV, Rubinshtein S (1958) On a space of totally additive functions. Vestnik of the St
Petersburg Univ Math 13(7):52–59
Kaplan D (2023) Bayesian statistics for the social sciences. Guilford Publications, New York
Karapiperis D, Birnbaum B, Brandenburg A, Castagna S, Greenberg A, Harbage R, Obersteadt
A (2015) Usage-based insurance and vehicle telematics: insurance market and regulatory
implications. CIPR Study Ser 1:1–79
Karimi H, Khan MFA, Liu H, Derr T, Liu H (2022) Enhancing individual fairness through
propensity score matching. In: 2022 IEEE 9th International Conference on Data Science and
Advanced Analytics (DSAA), pp 1–10. IEEE
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the
image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp 8110–8119
Kearns M, Roth A (2019) The ethical algorithm: The science of socially aware algorithm design.
Oxford University Press, Oxford
Kearns M, Valiant L (1989) Cryptographic limitations on learning boolean formulae and finite
automata. J ACM 21(1):433–444
Kearns M, Neel S, Roth A, Wu ZS (2018) Preventing fairness gerrymandering: Auditing and
learning for subgroup fairness. In: International Conference on Machine Learning, Proceedings
of Machine Learning Research, vol 1711.05144, pp 2564–2572
Keffer R (1929) An experience rating formula. Trans Actuarial Soc Am 30:130–139
Keita SOY, Kittles RA, Royal CD, Bonney GE, Furbert-Harris P, Dunston GM, Rotimi CN (2004)
Conceptualizing human variation. Nature Genet 36(Suppl 11):S17–S20
Kekes J (1995) The injustice of affirmative action involving preferential treatment. In: Cahn S (ed)
The Affirmative Action Debate, Routledge, pp 293–304
Kelly H (2021) A priest’s phone location data outed his private life. it could happen to anyone. The
Washinghton Post 22-07-2021
Kelly M, Nielson N (2006) Age as a variable in insurance pricing and risk classification. Geneva
Papers Risk Insurance Issues Pract 31(2):212–232
Keyfitz K, Flieger W, et al. (1968) World population: an analysis of vital data. The University of
Chicago Press, Chicago
Keys A, Fidanza F, Karvonen MJ, Kimura N, Taylor HL (1972) Indices of relative weight and
obesity. J Chronic Dis 25(6–7):329–343
Kiiveri H, Speed T (1982) Structural analysis of multivariate data: A review. Sociological Methodol
13:209–289
Kilbertus N, Rojas-Carulla M, Parascandolo G, Hardt M, Janzing D, Schölkopf B (2017) Avoiding
discrimination through causal reasoning. arXiv 1706.02744
References 459

Kim MP, Reingold O, Rothblum GN (2018) Fairness through computationally-bounded awareness.


arXiv 1803.03239
Kim PT (2017) Auditing algorithms for discrimination. University of Pennsylvania Law Rev
166:189
Kimball SL (1979) Reverse sex discrimination: Manhart. Am Bar Foundation Res J 4(1):83–139
King G, Tanner MA, Rosen O (2004) Ecological inference: New methodological strategies.
Cambridge University Press, Cambridge
Kirkpatrick K (2017) It’s not the algorithm, it’s the data. Commun ACM 60(2):21–23
Kita K, Kidziński Ł (2019) Google street view image of a house predicts car accident risk of its
resident. arXiv 1904.05270
Kitagawa EM (1955) Components of a difference between two rates. J Am Stat Assoc
50(272):1168–1194
Kitagawa EM, Hauser PM (1973) Differential mortality in the united states. In: Differential
mortality in the United States. Harvard University Press, Harvard
Kitchin R (2017) Thinking critically about and researching algorithms. Inf Commun Soc 20(1):14–
29
Kiviat B (2019) The moral limits of predictive practices: The case of credit-based insurance scores.
Am Sociol Rev 84(6):1134–1158
Kiviat B (2021) Which data fairly differentiate? American views on the use of personal data in two
market settings. Sociol Sci 8:26–47
Klein R (2021) Matching rate to risk: Analysis of the availability and affordability of private
passenger automobile insurance. Tech. rep., Insurance Information Institute
Klein R, Grace MF (2001) Urban homeowners insurance markets in Texas: A search for redlining.
J Risk Insurance, 581–613
Kleinberg J, Mullainathan S, Raghavan M (2016) Inherent trade-offs in the fair determination of
risk scores. arXiv 1609.05807
Kleinberg J, Lakkaraju H, Leskovec J, Ludwig J, Mullainathan S (2017) Human Decisions and
Machine Predictions. Q J Econ 133(1):237–293
Klinker F (2010) Generalized linear mixed models for ratemaking: a means of introducing
credibility into a generalized linear model setting. In: Casualty actuarial society E-forum,
Winter 2011 Volume 2
Klugman SA (1991) Bayesian statistics in actuarial science: with emphasis on credibility, vol 15.
Springer Science & Business Media, New York
Knowlton RE (1978) Regents of the University of California v. Bakke. Arkansas Law Rev 32:499
Koetter F, Blohm M, Drawehn J, Kochanowski M, Goetzer J, Graziotin D, Wagner S (2019)
Conversational agents for insurance companies: from theory to practice. In: International
Conference on Agents and Artificial Intelligence, pp 338–362. Springer, New York
Kohler-Hausmann I (2018) Eddie murphy and the dangers of counterfactual causal thinking about
detecting racial discrimination. Northwestern Univ Law Rev 113:1163
Kohlleppel L (1983) Multidimensional market signalling. Institut für Gesellschafts und
Wirtschaftswissenschaften, Wirtschaftstheoretische Abteilung
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press
Kolmogorov A (1933) Grundbegriffe der wahrscheinlichkeitsrechnung
Komiyama J, Takeda A, Honda J, Shimao H (2018) Nonconvex optimization for regression
with fairness constraints. In: International Conference on Machine Learning, Proceedings of
Machine Learning Research, pp 2737–2746
Korzybski A (1958) Science and sanity: An introduction to non-Aristotelian systems and general
semantics. Institute of GS
Kosinski M (2021) Facial recognition technology can expose political orientation from naturalistic
facial images. Sci Rep 11(1):1–7
Kotchen TA (2011) Historical trends and milestones in hypertension research: a model of the
process of translational research. Hypertension 58(4):522–538
Kotter-Grühn D, Kornadt AE, Stephan Y (2016) Looking beyond chronological age: Current
knowledge and future directions in the study of subjective age. Gerontology 62(1):86–93
460 References

Kranzberg M (1986) Technology and history: “Kranzberg’s laws”. Technol Culture 27(3):544–560
Kranzberg M (1995) Technology and history: “Kranzberg’s laws”. Bull Sci Technol Soc 15(1):5–
13
Krasanakis E, Spyromitros-Xioufis E, Papadopoulos S, Kompatsiaris Y (2018) Adaptive sensitive
reweighting to mitigate bias in fairness-aware classification. In: Proceedings of the 2018 World
Wide Web Conference, pp 853–862
Kremer E (1982) Ibnr-claims and the two-way model of anova. Scandinavian Actuarial J
1982(1):47–55
Krikler S, Dolberger D, Eckel J (2004) Method and tools for insurance price and revenue
optimisation. J Financ Serv Market 9(1):68–79
Krippner GR (2023) Unmasked: A history of the individualization of risk. Sociol Theory,
07352751231169012
Kroll JA, Huey J, Barocas S, Felten EW, Reidenberg JR, Robinson DG, Yu H (2017) Accountable
algorithms. Univ Pennsylvania Law Rev 165:633–705
Krüger F, Ziegel JF (2021) Generic conditions for forecast dominance. J Bus Econ Stat 39(4):972–
983
Krzanowski WJ, Hand DJ (2009) ROC curves for continuous data. CRC Press, Boca Raton
Kuczmarski J (2018) Reducing gender bias in google translate. Keyword 6:2018
Kudryavtsev AA (2009) Using quantile regression for rate-making. Insurance Math Econ
45(2):296–304
Kuhn M, Johnson K, et al. (2013) Applied predictive modeling, vol 26. Springer, New York
Kuhn T (2020) Root insurance commits to eliminate bias from its car insurance rates. Business
Wire August 6
Kull M, Flach P (2015) Novel decompositions of proper scoring rules for classification: Score
adjustment as precursor to calibration. In: Machine Learning and Knowledge Discovery in
Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7–11, 2015,
Proceedings, Part I 15, pp 68–85. Springer, New York
Kullback S (2004) Minimum discrimination information (mdi) estimation. Encyclopedia Stat
Sci 7:4821–4823
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Kusner MJ, Loftus J, Russell C, Silva R (2017) Counterfactual fairness. Adv Neural Inf Process
Syst 30:4067–4077
de La Fontaine J (1668) Fables. Barbin
Laffont JJ, Martimort D (2002) The theory of incentives: the principal-agent model, pp 273–306.
Princeton University Press, Princeton
Lahoti P, Gummadi KP, Weikum G (2019) Operationalizing individual fairness with pairwise fair
representations. arXiv 1907.01439
Lambert D (1992) Zero-inflated poisson regression, with an application to defects in manufactur-
ing. Technometrics 34(1):1–14
Lamont M, Molnár V (2002) The study of boundaries in the social sciences. Annu Rev Sociol,
167–195
Lancaster R, Ward R (2002) The contribution of individual factors to driving behaviour: Implica-
tions for managing work-related road safety. HM Stationery Office
Landes X (2015) How fair is actuarial fairness? J Bus Ethics 128(3):519–533
Langford J, Schapire R (2005) Tutorial on practical prediction theory for classification. J Mach
Learn Res 6(3):273–306
LaPar DJ, Bhamidipati CM, Mery CM, Stukenborg GJ, Jones DR, Schirmer BD, Kron IL, Ailawadi
G (2010) Primary payer status affects mortality for major surgical operations. Ann Surg
252(3):544
Laplace PS (1774) Mémoire sur la probabilité de causes par les évenements. Mémoire de
l’académie royale des sciences
de Lara L (2023) Counterfactual models for fair and explainable machine learning: A mass
transportation approach. PhD thesis, Institut de Mathématiques de Toulouse
References 461

de Lara L, González-Sanz A, Asher N, Loubes JM (2021) Transport-based counterfactual models.


arXiv 2108.13025
Larson J, Mattu S, Kirchner L, Angwin J (2016) How we analyzed the compas recidivism
algorithm. ProPublica 23-05
Larson J, Angwin J, Kirchner L, Mattu S (2017) How we examined racial discrimination in auto
insurance prices. ProPublica, April 5
Lasry JM (2015) La rencontre choc de l’assurance et du big data. Risques 103:19–24
Lauer J (2017) Creditworthy: a history of consumer surveillance and financial identity in America.
Columbia University Press, New York
Laulom S (2012) Égalité des sexes et primes d’assurances. Semaine Sociale Lamy 1531:44–49
Laurent H, Rivest RL (1976) Constructing optimal binary decision trees is np-complete. Inf Process
Lett 5(1):15–17
Law S, Paige B, Russell C (2019) Take a look around: Using street view and satellite images to
estimate house prices. ACM Trans Intell Syst Technol 10(5):1–19
Le Gouic T, Loubes JM (2017) Existence and consistency of Wasserstein barycenters. Probab
Theory Relat Fields 168:901–917
Le Monde (2021) La discrimination par l’accent bientôt réprimée ? une proposition de loi adoptée
jeudi à l’assemblée. Le Monde 26-11-2020
Leben D (2020) Normative principles for evaluating fairness in machine learning. In: Proceedings
of the AAAI/ACM Conference on AI, Ethics, and Society, pp 86–92
Lebesgue H (1918) Remarques sur les théories de la mesure et de l’intégration. Ann scientifiques
de l’École Normale Supérieure 35:191–250
Ledford H (2019) Millions affected by racial bias in health-care algorithm. Nature 574(31):2
Lee BK, Lessler J, Stuart EA (2010) Improving propensity score weighting using machine learning.
Stat Med 29(3):337–346
Lee RD, Carter LR (1992) Modeling and forecasting us mortality. J Am Stat Assoc 87(419):659–
671
Lee S, Antonio K (2015) Why high dimensional modeling in actuarial science. ASTIN, AFIR/ERM
& IACA Colloquia
Leeson PT (2009) The calculus of piratical consent: the myth of the myth of social contract. Public
Choice 139:443–459
Lemaire J (1985) Automobile insurance: actuarial models, vol 4. Springer Science & Business
Media, New York
Lemaire J, Park SC, Wang KC (2016) The use of annual mileage as a rating variable. ASTIN Bull
J IAA 46(1):39–69
Léon PR (1993) Précis de phonostylistique: parole et expressivité. Nathan
Leshno M, Lin VY, Pinkus A, Schocken S (1993) Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function. Neural Networks 6(6):861–
867
Leu C (2015) Looking up. The University of Chicago Magazine May-June(15)
Leuner J (2019) A replication study: Machine learning models are capable of predicting sexual
orientation from facial images. arXiv 1902.10739
Levantesi S, Pizzorusso V (2019) Application of machine learning to mortality modeling and
forecasting. Risks 7(1):26
Levina E, Bickel P (2001) The earth mover’s distance is the mallows distance: Some insights from
statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV
2001, vol 2, pp 251–256. IEEE
Levy J (2012) Freaks of fortune. Harvard University Press, Harvard
Lew EA, Garfinkel L (1979) Variations in mortality by weight among 750,000 men and women. J
Chronic Dis 32(8):563–576
Lewis AE (2004) What group?” studying whites and whiteness in the era of “color-blindness.
Sociol Theory 22(4):623–646
Lewis D (1973) Counterfactuals. Wiley, New York
462 References

l’Horty Y, Bunel M, Mbaye S, Petit P, Du Parquet L (2019) Discriminations dans l’accès à la


banque et à l’assurance: Les enseignements de trois testings. Revue d’économie politique
129(1):49–78
Li C, Fan X (2020) On nonparametric conditional independence tests for continuous variables.
Wiley Interdisciplinary Rev Comput Stat 12(3):e1489
Li F, Li F (2019) Propensity score weighting for causal inference with multiple treatments. Ann
Appl Stat 13:2389–2415
Li G, Braver ER, Chen LH (2003) Fragility versus excessive crash involvement as determinants
of high death rates per vehicle-mile of travel among older drivers. Accident Anal Prevent
35(2):227–235
Li P, Liu H (2022) Achieving fairness at no utility cost via data reweighing with influence. In:
International Conference on Machine Learning, Proceedings of Machine Learning Research,
pp 12917–12930
Liebler CA, Porter SR, Fernandez LE, Noon JM, Ennis SR (2017) America’s churning races:
Race and ethnicity response changes between census 2000 and the 2010 census. Demography
54(1):259–284
Light JS (1999) When computers were women. Technol Culture 40(3):455–483
Liisa HB (1994) Aging and fatal accidents in male and female drivers. J Gerontol 49(6):S286–S290
Lima M (2014) The book of trees: Visualizing branches of knowledge, vol 5. Princeton Architec-
tural Press, New York
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory
37(1):145–151
Lindholm M, Richman R, Tsanakas A, Wüthrich MV (2022a) Discrimination-free insurance
pricing. ASTIN Bull J IAA 52(1):55–89
Lindholm M, Richman R, Tsanakas A, Wüthrich MV (2022b) A discussion of discrimination and
fairness in insurance pricing. arXiv 2209.00858
Ling CX, Li C (1998) Data mining for direct marketing: Problems and solutions. In: Conference
on knowledge discovery and data mining, vol 98, pp 73–79
Lipovetsky S, Conklin M (2001) Analysis of regression in game theory approach. Appl Stochast
Models Bus Ind 17(4):319–330
Lippert-Rasmussen K (2006) The badness of discrimination. Ethical Theory Moral Pract 9:167–
185
Lippert-Rasmussen K (2013) Discrimination. In: LaFollette H (ed) The international encyclopedia
of ethics. Wiley-Blackwell, New York
Lippert-Rasmussen K (2014) Born free and equal? A philosophical inquiry into the nature of
discrimination. Oxford University Press, Oxford
Lippert-Rasmussen K (2017) The philosophy of discrimination: An introduction. In: The Rout-
ledge handbook of the ethics of discrimination, Routledge, pp 1–16
Lippert-Rasmussen K (2020) Making sense of affirmative action. Oxford University Press, Oxford
Lippmann W (1922) Public opinion. Routledge
Lipton ZC, Chouldechova A, McAuley J (2018) Does mitigating ml’s impact disparity require
treatment disparity? In: Proceedings of the 32nd International Conference on Neural Informa-
tion Processing Systems, pp 8136–8146
Liu X (2015) No fats, femmes, or asians. Moral Philos Polit 2(2):255–276
Liu X (2017) Discrimination and lookism. In: Lippert-Rasmussen K (ed) Handbook of the ethics
of discrimination, pp 276–286, Routledge
Liu Y, Chen L, Yuan Y, Chen J (2012) A study of surnames in china through isonymy. Am J Phys
Anthropol 148(3):341–350
Lo VS (2002) The true lift model: a novel data mining approach to response modeling in database
marketing. ACM SIGKDD Explorat Newsl 4(2):78–86
Loève M (1977) Probability theory. Springer, New York
Löffler M, Münstermann B, Schumacher T, Mokwa C, Behm S (2016) Insurers need to plug into
the internet of things–or risk falling behind. European Insurance
References 463

Loftus JR, Russell C, Kusner MJ, Silva R (2018) Causal reasoning for algorithmic fairness. arXiv
1805.05859
Loi M, Christen M (2021) Choosing how to discriminate: navigating ethical trade-offs in fair
algorithmic design for the insurance sector. Philos Technol, 1–26
Lombroso C (1876) L’uomo delinquente. Hoepli
Loos RJ, Yeo GS (2022) The genetics of obesity: from discovery to biology. Nature Rev Genet
23(2):120–133
L’Oréal (2022) A new geography of skin color. https://wwwlorealcom/en/articles/science-and-
technology/expert-inskin/
Lorenz MO (1905) Methods of measuring the concentration of wealth. Publ Am Stat Assoc
9(70):209–219
Lovejoy B (2021) Linkedin breach reportedly exposes data of 92% of users, including inferred
salaries. 9to5mac 06/29
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I,
Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Proceedings
of the 31st International Conference on Neural Information Processing Systems, vol 30, pp
4768–4777. Curran Associates, Inc.
Luong BT, Ruggieri S, Turini F (2011) k-nn as an implementation of situation testing for discrim-
ination discovery and prevention. In: Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp 502–510
Lutton L, Fan A, Loury A (2020) Where banks don’t lend. WBEZ
MacIntyre AC (1969) Hume on ‘is’ and ‘ought’. In: The is-ought question, pp 35–50. Springer,
New York
MacKay DJ (1992) A practical Bayesian framework for backpropagation networks. Neural
Comput 4(3):448–472
Macnicol J (2006) Age discrimination: An historical and contemporary analysis. Cambridge
University Press, Cambridge
Maedche A (2020) Gender bias in chatbot design. Chatbot Research and Design, p 79
Mallasto A, Feragen A (2017) Learning from uncertain curves: The 2-wasserstein metric for
gaussian processes. Advances in Neural Information Processing Systems 30
Mallon R (2006) ‘race’: normative, not metaphysical or semantic. Ethics 116(3):525–551
Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat, 508–515
Mangel M, Samaniego FJ (1984) Abraham wald’s work on aircraft survivability. J Am Stat Assoc
79(386):259–267
Mantelero A (2013) The eu proposal for a general data protection regulation and the roots of the
‘right to be forgotten’. Comput Law Secur Rev 29(3):229–235
Marshall A (1890) General relations of demand, supply, and value. Principles of economics:
unabridged eighth edition
Marshall A (2021) Ai comes to car repair, and body shop owners aren’t happy. Wired April 13
Marshall GA (1993) Racial classifications: popular and scientific. In: Harding S (ed) The “racial”
economy of science: Toward a democratic future, pp 116–125. Indiana University Press,
Bloomington, IN
Martin GD (1977) Gender discrimination in pension plans: Author’s reply. J Risk Insurance
44(1):145–149
Mary J, Calauzènes C, El Karoui N (2019) Fairness-aware learning for continuous attributes and
treatments. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International
Conference on Machine Learning, Proceedings of Machine Learning Research, Proceedings
of Machine Learning Research, vol 97, pp 4382–4391
Mas L (2020) A confederate flag spotted in the window of police barracks in paris. France 24 10/07
Massey DS (2007) Categorically unequal: The American stratification system. Russell Sage
Foundation
Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage
lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2):442–451
464 References

Mayer J, Mutchler P, Mitchell JC (2016) Evaluating the privacy properties of telephone metadata.
Proc Natl Acad Sci 113(20):5536–5541
Maynard A (1979) Pricing, insurance and the national health service. J Soc Pol 8(2):157–176
Mayr E (1982) The growth of biological thought: Diversity, evolution, and inheritance. Harvard
University Press, Harvard
Mazieres A, Roth C (2018) Large-scale diversity estimation through surname origin inference.
Bull Sociol Methodol/Bull de Méthodologie Sociologique 139(1):59–73
Mbungo R (2014) L’approche juridique internationale du phénomène de discrimination fondée sur
le motif des antécédents judiciaires. Revue québécoise de droit international 27(2):59–97
McCaffrey DF, Ridgeway G, Morral AR (2004) Propensity score estimation with boosted
regression for evaluating causal effects in observational studies. Psychol Methods 9(4):403
McClenahan CL (2006) Ratemaking. In: Foundations of casualty actuarial science. Casualty
Actuarial Society
McCullagh P, Nelder J (1989) Generalized linear models. Chapman & Hall
McCulloch CE, Searle SR (2004) Generalized, linear, and mixed models. Wiley, New York
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
McDonnell M, Baxter D (2019) Chatbots and gender stereotyping. Interact Comput 31(2):116–121
McFall L (2019) Personalizing solidarity? the role of self-tracking in health insurance pricing.
Econ Soc 48(1):52–76
McFall L, Meyers G, Hoyweghen IV (2020) Editorial: The personalisation of insurance: Data,
behaviour and innovation. Big Data Soc 7(2):1–11
McKinley R (2014) A history of British surnames. Routledge, Abingdon
McKinsey (2017) Technology, jobs and the future of work. McKinsey Global Institute
McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: Homophily in social networks.
Annu Rev Sociol 27(1):415–444
Meilijson I (2006) Risk aversion. Encyclopedia of Actuarial Science
Meinshausen N, Ridgeway G (2006) Quantile regression forests. J Mach Learn Res 7(6):983–999
de Melo-Martín I (2003) When is biology destiny? biological determinism and social responsibil-
ity. Philos Sci 70(5):1184–1194
Memmi A (2000) Racism. University of Minnesota Press, Minnesota
Menezes CF, Hanson DL (1970) On the theory of risk aversion. Int Econ Rev, 481–487
Mercat-Bruns M (2016) Discrimination at work. University of California Press, California
Mercat-Bruns M (2020) Les rapports entre vieillissement et discrimination en droit: une fertilisa-
tion croisée utile sur le plan individuel et collectif. La Revue des Droits de l’Homme 17
Merchant C (1980) The death of nature: Women, ecology, and the scientific revolution. Harper-
collins
Merriam-Webster (2022) Dictionary. Merriam-Webster
Merrill D (2012) New credit scores in a new world: Serving the underbanked. TEDxNewWallStreet
Messenger R, Mandell L (1972) A modal search technique for predictive nominal scale multivari-
ate analysis. J Am Stat Assoc 67(340):768–772
Meuleners LB, Harding A, Lee AH, Legge M (2006) Fragility and crash over-representation among
older drivers in western australia. Accident Anal Prevent 38(5):1006–1010
Meyers G, Van Hoyweghen I (2018) Enacting actuarial fairness in insurance: From fair discrimi-
nation to behaviour-based fairness. Sci Culture 27(4):413–438
Michelbacher G (1926) ‘moral hazard’ in the field of casualty insurance. Proc Casualty Actuar Soc
13(27):448–471
Michelson S, Blattenberger G (1984) Reverse regression and employment discrimination. J Bus
Econ Stat 2(2):121–122
Milanković M (1920) Théorie mathématique des phénomènes thermiques produits par la radiation
solaire. Gauthier-Villars, Paris
Miller G, Gerstein DR (1983) The life expectancy of nonsmoking men and women. Public Health
Rep 98(4):343
References 465

Miller H (2015a) A discussion on credibility and penalised regression, with implications for
actuarial work. Actuaries Institute
Miller MJ, Smith RA, Southwood KN (2003) The relationship of credit-based insurance scores to
private passenger automobile insurance loss propensity. Actuarial Study, Epic Actuaries
Miller T (2015b) Price optimization. Commonwealth of Pennsylvania, Insurance Department
August 25
Mills CW (2017) Black rights/white wrongs: The critique of racial liberalism. Oxford University
Press, Oxford
Milmo D (2021) Working of algorithms used in government decision-making to be revealed. The
Guardian November 29
Milne J (1815) A Treatise on the Valuation of Annuities and Assurances on Lives and Survivor-
ships: On the Construction of Tables of Mortality and on the Probabilities and Expectations of
Life, vol 2. Longman, Hurst, Rees, Orme, and Brown
Minsky M, Papert S (1969) An introduction to computational geometry. Cambridge tiass, HIT
479:480
Miracle JM (2016) De-anonymization attack anatomy and analysis of ohio nursing workforce data
anonymization. PhD thesis, Wright State University
Mittelstadt BD, Allo P, Taddeo M, Wachter S, Floridi L (2016) The ethics of algorithms: Mapping
the debate. Big Data Soc 3(2):2053951716679679
Mittra J (2007) Predictive genetic information and access to life assurance: The poverty of ‘genetic
exceptionalism’. BioSocieties 2(3):349–373
Mollat M, du Jourdin MM (1986) The poor in the Middle Ages: an essay in social history. Yale
University Press, Yale
Molnar C (2023) A guide for making black box models explainable. https://christophm.github.io/
interpretable-ml-book
Molnar C, Casalicchio G, Bischl B (2018) iml: An r package for interpretable machine learning. J
Open Source Soft 3(26):786
Monnet J (2017) Discrimination et assurance. Journal de Droit de la Santé et de l’Assurance
Maladie 16:13–19
Moodie EE, Stephens DA (2022) Causal inference: Critical developments, past and future. Canad
J Stat 50(4):1299–1320
Moon R (2014) From gorgeous to grumpy: adjectives, age and gender. Gender Lang 8(1):5–41
Moor L, Lury C (2018) Price and the person: Markets, discrimination, and personhood. J Cultural
Econ 11(6):501–513
Morel P, Stalk G, Stanger P, Wetenhall P (2003) Pricing myopia. The Boston Consulting Group
Perspectives
Morris DS, Schwarcz D, Teitelbaum JC (2017) Do credit-based insurance scores proxy for income
in predicting auto claim risk? J Empir Legal Stud 14(2):397–423
Morrison EJ (1996) Insurance discrimination against battered women: Proposed legislative
protections. Ind LJ 72:259
Moulin H (1992) An application of the shapley value to fair division with money. Econometrica,
1331–1349
Moulin H (2004) Fair division and collective welfare. MIT Press, Cambridge, MA
Mowbray A (1921) Classification of risks as the basis of insurance rate making with special
reference to workmen’s compensation. Proceedings of the Casualty Actuarial Society
Müller R, Kornblith S, Hinton GE (2019) When does label smoothing help? Adv Neural Inf Process
Syst 32:4694–4703
Mundubeltz-Gendron S (2019) Comment l’intelligence artificielle va bouleverser le monde du
travail dans l’assurance. L’Argus de l’Assurance 10/04
Murphy AH (1973) A new vector partition of the probability score. J Appl Meteor Climatol
12(4):595–600
Murphy AH (1996) General decompositions of mse-based skill scores: Measures of some basic
aspects of forecast quality. Month Weather Rev 124(10):2353–2369
466 References

Murphy AH, Winkler RL (1987) A general framework for forecast verification. Month Weather
Rev 115(7):1330–1338
Must A, Spadano J, Coakley EH, Field AE, Colditz G, Dietz WH (1999) The disease burden
associated with overweight and obesity. J Am Med Assoc 282(16):1523–1529
Myers RJ (1977) Gender discrimination in pension plans: Further comment. J Risk Insurance
44(1):144–145
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9(1):141–142
Nakashima R (2018) Google tracks your movements, like it or not. Associated Press August 14
Nassif H, Kuusisto F, Burnside ES, Page D, Shavlik J, Santos Costa V (2013) Score as you lift
(sayl): A statistical relational learning approach to uplift modeling. In: Machine Learning and
Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech
Republic, September 23–27, 2013, Proceedings, Part III 13, pp 595–611. Springer, New York
Nathan EB (1925) Analysed mortality: the english life no. 8a tables. Trans Faculty Actuaries
10:45–124
National Association of Insurance Commissioners (2011) A consumer’s guide to auto insurance.
NAIC Reports
National Association of Insurance Commissioners (2022) A consumer’s guide to auto insurance.
NAIC Reports
Natowicz MR, Alper JK, Alper JS (1992) Genetic discrimination and the law. Am J Human Genet
50(3):465
Neal RM (1992) Bayesian training of backpropagation networks by the hybrid monte carlo method.
Tech. rep., Citeseer
Neal RM (2012) Bayesian learning for neural networks, vol 118. Springer Science & Business
Media, New York
Neddermeyer JC (2009) Computationally efficient nonparametric importance sampling. J Am Stat
Assoc 104(486):788–802
Nelson A (2002) Unequal treatment: confronting racial and ethnic disparities in health care. J Natl
Med Assoc 94(8):666
Neyman J, Dabrowska DM, Speed T (1923) On the application of probability theory to agricultural
experiments. Essay on principles, section 9. Stat Sci, 465–472
Niculescu-Mizil A, Caruana R (2005a) Predicting good probabilities with supervised learning. In:
Proceedings of the 22nd International Conference on Machine Learning, pp 625–632
Niculescu-Mizil A, Caruana RA (2005b) Obtaining calibrated probabilities from boosting. In:
Proc. 21st Conference on Uncertainty in Artificial Intelligence (UAI’05), pp 413–420. AUAI
Press, Arlington, VA
Nielsen F (2013) Jeffreys centroids: A closed-form expression for positive histograms and a
guaranteed tight approximation for frequency histograms. IEEE Signal Process Lett 20(7):657–
660
Nielsen F, Boltz S (2011) The burbea-rao and bhattacharyya centroids. IEEE Trans Inf Theory
57(8):5455–5466
Nielsen F, Nock R (2009) Sided and symmetrized bregman centroids. IEEE Trans Inf Theory
55(6):2882–2904
Noguéro D (2010) Sélection des risques. discrimination, assurance et protection des personnes
vulnérables. Revue générale du droit des assurances 3:633–663
Nordholm LA (1980) Beautiful patients are good patients: evidence for the physical attractiveness
stereotype in first impressions of patients. Soc Sci Med Part A Med Psychol Med Sociol
14(1):81–83
Norman P (2003) Statistical discrimination and efficiency. Rev Econ Stud 70(3):615–627
Nuruzzaman M, Hussain OK (2020) Intellibot: A dialogue-based chatbot for the insurance
industry. Knowl Based Syst 196:105810
Oaxaca R (1973) Male-female wage differentials in urban labor markets. Int Econ Rev 14(3):693–
709
Oaxaca RL, Ransom MR (1994) On discrimination and the decomposition of wage differentials. J
Economet 61(1):5–21
References 467

Obermeyer Z, Powers B, Vogeli C, Mullainathan S (2019) Dissecting racial bias in an algorithm


used to manage the health of populations. Science 366(6464):447–453
Ohlsson E, Johansson B (2010) Non-life insurance pricing with generalized linear models, vol 174.
Springer, New York
O’Neil C (2016) Weapons of math destruction: How big data increases inequality and threatens
democracy. Crown
Ong PM, Stoll MA (2007) Redlining or risk? a spatial analysis of auto insurance rates in los
angeles. J Policy Anal Manag 26(4):811–830
Opitz D, Maclin R (1999) Popular ensemble methods: An empirical study. J Artif Intell Res
11:169–198
Ortiz-Ospina E, Beltekian D (2018) Why do women live longer than men? Our World in Data
Orwat C (2020) Risks of discrimination through the use of algorithms. Institute for Technology
Assessment and Systems Analysis
Outreville JF (1990) The economic significance of insurance markets in developing countries. J
Risk Insurance, 487–498
Outreville JF (1996) Life insurance markets in developing countries. J Risk Insurance, 263–278
Owen AB (2013) Monte Carlo theory, methods and examples. Stanford Lectures Notes
Owsley C, McGwin Jr G (2010) Vision and driving. Vision Res 50(23):2348–2361
Oza D, Padhiyar D, Doshi V, Patil S (2020) Insurance claim processing using rpa along with
chatbot. In: Proceedings of the 3rd International Conference on Advances in Science &
Technology (ICAST)
Pager D (2003) The mark of a criminal record. Am J Sociol 108(5):937–975
Pager D (2008) Marked: Race, crime, and finding work in an era of mass incarceration. University
of Chicago Press, Chicago
Palmore E (1978) Are the aged a minority group? J Am Geriatrics Soc 26(5):214–217
Pardo L (2018) Statistical inference based on divergence measures. CRC Press, Boca Raton
Parléani G (2012) Commentaire des lignes directrices de la commission européenne sur les suites
de l’arrêt “test achats”. Revue générale du droit des assurances 3:563
Parry M (2016) Linear scoring rules for probabilistic binary classification. Electron J Stat 10:1596–
1607
Pasquale F (2015) The black box society: the secret algorithms that control money and information.
Harvard University Press, Harvard
Paugam S, Cousin B, Giorgetti C, Naudet J (2017) Ce que les riches pensent des pauvres. Seuil,
Paris
Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference.
Morgan Kaufmann, San Francisco, CA
Pearl J (1998) Graphs, causality, and structural equation models. Sociol Methods Res 27(2):226–
284
Pearl J (2009a) Causal inference in statistics: An overview. Stat Surv 3:96–146
Pearl J (2009b) Causality. Cambridge University Press, Cambridge
Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2):1–59
Pearl J, Mackenzie D (2018) The book of why: the new science of cause and effect. Basic Books,
New York
Pedreshi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of
the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’08, pp 560–568. Association for Computing Machinery
Perla F, Richman R, Scognamiglio S, Wüthrich MV (2021) Time-series forecasting of mortality
rates using deep learning. Scandinavian Actuarial J 2021(7):572–598
Petauton P (1998) L’opération d’assurance : définitions et principes. In: Ewald F, Lorenzi JH (eds)
Encyclopédie de l’assurance, Economica
Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference: foundations and learning
algorithms. MIT Press, Cambridge, MA
Peters T (2014) Playing God? Genetic determinism and human freedon. Routledge
468 References

Petersen A, Müller HG (2019) Fréchet regression for random objects with Euclidean predictors.
Ann Stat 47(2):691–719
Petersen F, Mukherjee D, Sun Y, Yurochkin M (2021) Post-processing for individual fairness. Adv
Neural Inf Process Syst 34:25944–25955
Petit P, Duguet E, L’Horty Y (2015) Discrimination résidentielle et origine ethnique: une étude
expérimentale sur les serveurs en île-de-france. Economie Prevision (1):55–69
Pfanzagl P (1979) Conditional distributions as derivatives. Ann Probab 7(6):1046–1050
Pfeffermann D (1993) The role of sampling weights when modeling survey data. International
Statistical Review/Revue Internationale de Statistique, pp 317–337
Phelps ES (1972) The statistical theory of racism and sexism. Am Econ Rev 62(4):659–661
Phelps JT (1895) Life insurance sayings. Riverside Press, Riverside Press
Picard P (2003) Les frontières de l’assurabilité. Risques 54:65–66
Pichard M (2006) Les droits à: étude de législation française. Economica
Pisu M, Azuero A, McNees P, Burkhardt J, Benz R, Meneses K (2010) The out of pocket cost of
breast cancer survivors: a review. J Cancer Survivorship 4(3):202–209
Plakans A, Wetherell C (2000) Patrilines, surnames, and family identity: A case study from the
russian baltic provinces in the nineteenth century. Hist Family 5(2):199–214
Platt J, et al. (1999) Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods. Adv Large Margin Classifiers 10(3):61–74
Plečko D, Bennett N, Meinshausen N (2021) fairadapt: Causal reasoning for fair data pre-
processing. arXiv 2110.10200
Pleiss G, Raghavan M, Wu F, Kleinberg J, Weinberger KQ (2017) On fairness and calibration.
arXiv 1709.02012
Pohle MO (2020) The murphy decomposition and the calibration-resolution principle: A new
perspective on forecast evaluation. arXiv 2005.01835
Pojman LP (1998) The case against affirmative action. Int J Appl Philos 12(1):97–115
Poku M (2016) Campbell’s law: implications for health care. J Health Serv Res Policy 21(2):137–
139
Pope DG, Sydnor JR (2011) Implementing anti-discrimination policies in statistical profiling
models. Am Econ J Econ Policy 3(3):206–31
Porrini D, Fusco G (2020) Less discrimination, more gender inequality:: The case of the italian
motor-vehicle insurance. J Risk Manag Insurance 24(1):1–11
Porter TM (2020) Trust in numbers. Princeton University Press, Princeton
Powell L (2020) Risk-based pricing of property and liability insurance. J Insurance Regulat 1:1–23
Pradier PC (2011) (petite) histoire de la discrimination (dans les assurances). Risques 87:51–57
Pradier PC (2012) Les bénéfices terrestres de la charité. les rentes viagères des hôpitaux parisiens,
1660–1690. Histoire Mesure 26(XXVI-2):31–76
Prince AE, Schwarcz D (2019) Proxy discrimination in the age of artificial intelligence and big
data. Iowa Law Rev 105:1257
Proschan MA, Presnell B (1998) Expect the unexpected from conditional expectation. Am Stat
52(3):248–252
Puddifoot K (2021) How stereotypes deceive us. Oxford University Press, Oxford
Puzzo DA (1964) Racism and the western tradition. J Hist Ideas 25(4):579–586
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Quinlan JR (1987) Simplifying decision trees. Int J Man Mach Stud 27(3):221–234
Quinlan JR (1993) C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco, CA
Quinzii M, Rochet JC (1985) Multidimensional signalling. J Math Econ 14(3):261–284
Racine J, Rilstone P (1995) The reverse regression problem: statistical paradox or artefact of
misspecification? Canad J Econ, 502–531
Radcliffe N (2007) Using control groups to target on predicted lift: Building and assessing uplift
model. Direct Market Anal J, 14–21
Radcliffe N, Surry P (1999) Differential response analysis: Modeling true responses by isolating
the effect of a single action. Credit Scoring and Credit Control IV
References 469

Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear regression
models. J Am Stat Assoc 92(437):179–191
Ransom RL, Sutch R (1987) Tontine insurance and the armstrong investigation: a case of stifled
innovation, 1868–1905. J Econ Hist 47(2):379–390
Rattani A, Reddy N, Derakhshani R (2017) Gender prediction from mobile ocular images: A
feasibility study. In: IEEE International Symposium on Technologies for Homeland Security
(HST), pp 1–6. IEEE
Rattani A, Reddy N, Derakhshani R (2018) Convolutional neural networks for gender prediction
from smartphone-based ocular images. IET Biometrics 7(5):423–430
Rawls J (1971) A theory of justice: Revised edition. Harvard University Press, Harvard
Rawls J (2001) Justice as fairness: A restatement. Harvard University Press, Harvard
Rebert L, Van Hoyweghen I (2015) The right to underwrite gender: The goods & services directive
and the politics of insurance pricing. Tijdschrift Voor Genderstudies 18(4):413–431
Reichenbach H (1956) The direction of time. University of Los Angeles Press, Berkeley
Reijns T, Weurding R, Schaffers J (2021) Ethical artificial intelligence – the dutch insurance
industry makes it a mandate. KPMG Insights 03/2021
Reimers CW (1983) Labor market discrimination against hispanic and black men. Rev Econ Stat,
570–579
Reinsel GC (2003) Elements of multivariate time series analysis. Springer, New York
Rényi A (1959) On measures of dependence. Acta mathematica hungarica 10(3–4):441–451
Rescher N (2013) How wide is the gap between facts and values? In: Studies in Value Theory, pp
25–52. De Gruyter
Resnick S (2019) A probability path. Springer, New York
Rhynhart R (2020) Mapping the legacy of structural racism in Philadelphia. Office of the
Controller, Philadelphia
Riach PA, Rich J (1991) Measuring discrimination by direct experimental methods: Seeking
gunsmoke. J Post Keynesian Econ 14(2):143–150
Ribeiro MT, Singh S, Guestrin C (2016) “why should i trust you?” explaining the predictions of any
classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp 1135–1144
Ridgeway CL (2011) Framed by gender: How gender inequality persists in the modern world.
Oxford University Press, Oxford
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
Riley JG (1975) Competitive signalling. J Econ Theory 10(2):174–186
Rink FT (1805) Ansichten aus Immanuel Kant’s Leben. Göbbels und Unzer
Rivera LA (2020) Employer decision making. Annu Rev Sociol 46:215–232
Robbins LA (2015) The pernicious problem of ageism. Generations 39(3):6–9
Robertson T, FT W, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
Robinson PM (1988) Root-n-consistent semiparametric regression. Econometrica, 931–954
Robinson WS (1950) Ecological correlations and the behavior of individuals. Am Sociol Rev
15(3):351–357
Robnik-Šikonja M, Kononenko I (1997) An adaptation of relief for attribute estimation in
regression. In: Machine Learning: Proceedings of the Fourteenth International Conference
(ICML’97), vol 5, pp 296–304
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff.
Mach Learn 53(1):23–69
Robnik-Šikonja M, Kononenko I (2008) Explaining classifications for individual instances. IEEE
Trans Knowl Data Eng 20(5):589–600
Rodríguez Cardona D, Janssen A, Guhr N, Breitner MH, Milde J (2021) A matter of trust?
examination of chatbot usage in insurance business. In: Proceedings of the 54th Hawaii
International Conference on System Sciences, p 556
Rodríguez-Cuenca B, Alonso MC (2014) Semi-automatic detection of swimming pools from aerial
high-resolution images and lidar data. Remote Sens 6(4):2628–2646
Roemer JE (1996) Theories of distributive justice. Harvard University Press, Harvard
470 References

Roemer JE (1998) Equality of opportunity. Harvard University Press, Harvard


Roemer JE, Trannoy A (2016) Equality of opportunity: Theory and measurement. J Econ Literature
54(4):1288–1332
Rolski T, Schmidli H, Schmidt V, Teugels JL (2009) Stochastic processes for insurance and finance.
Wiley, New York
Rorive I (2009) Proving discrimination cases: The role of situation testing. Centre for Equal Rights
and Migration Policy Group
Rosen J (2011) The right to be forgotten. Stan L Rev Online 64:88
Rosenbaum P (2005) Observational study. Encyclopedia of statistics in behavioral science
Rosenbaum P (2018) Observation and experiment. Harvard University Press, Harvard
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies
for causal effects. Biometrika 70(1):41–55
Rosenberg NA (2011) A population-genetic perspective on the similarities and differences among
worldwide human populations. Human Biol 83(6):659
Rosenblatt F (1961) Principles of neurodynamics. perceptrons and the theory of brain mechanisms.
Tech. rep., Cornell Aeronautical Lab Inc Buffalo NY
Rosenthal JS (2006) First look at rigorous probability theory, A. World Scientific Publishing
Company, Singapore
Ross SM (1972) Introduction to probability models. Academic Press, Cambridge, MA
Roth K, Lucchi A, Nowozin S, Hofmann T (2017) Stabilizing training of generative adversarial
networks through regularization. Adv Neural Inf Process Syst 30:2019–2029
Rothschild-Elyassi G, Koehler J, Simon J (2018) Actuarial justice, chap 14, pp 194–206. Wiley,
New York
Rothstein WG (2003) Public health and the risk factor: A history of an uneven medical revolution,
vol 3. Boydell & Brewer, Suffolk
Rouvroy A, Berns T, Carey-Libbrecht L (2013) Algorithmic governmentality and prospects of
emancipation. Réseaux 177(1):163–196
Royal A, Walls M (2019) Flood risk perceptions and insurance choice: Do decisions in the
floodplain reflect overoptimism? Risk Anal 39(5):1088–1104
Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized
studies. J Educat Psychol 66(5):688
Rubinow I (1936) State pool plans and merit rating. Law Contemp Probl 3(1):65–88
Rubinstein A (2012) Economic fables. Open Book Publishers, Cambridge
Rubinstein Y, Brenner D (2014) Pride and prejudice: Using ethnic-sounding names and inter-ethnic
marriages to identify labour market discrimination. Rev Econ Stud 81(1):389–425
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and
use interpretable models instead. Nature Mach Intell 1(5):206–215
Rudin W (1966) Real and complex analysis. McGraw-hill, New York
Ruillier J (2004) Quatre petits coins de rien du tout. Bilboquet
Rule NO, Ambady N (2010) Democrats and republicans can be differentiated from their faces.
PloS One 5(1):e8733
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error
propagation. Tech. rep., California Univ San Diego La Jolla Inst for Cognitive Science
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating
errors. Nature 323(6088):533–536
Rundle AG, Bader MD, Richards CA, Neckerman KM, Teitler JO (2011) Using Google Street
View to audit neighborhood environments. Am J Prevent Med 40(1):94–100
Rupke N, Lauer G (2018) Johann Friedrich Blumenbach: race and natural history, 1750–1850.
Routledge
Russell C, Kusner M, Loftus C, Silva R (2017) When worlds collide: integrating different counter-
factual assumptions in fairness. In: Advances in Neural Information Processing Systems, NIPS
Proceedings, vol 30, pp 6414–6423
Sabbagh D (2007) Equality and transparency: A strategic perspective on affirmative action in
American law. Springer, New York
References 471

Saks S (1937) Theory of the integral. Instytut Matematyczny Polskiej Akademi Nauk, Warszawa-
Lwów
Sakurada M, Yairi T (2014) Anomaly detection using autoencoders with nonlinear dimensionality
reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for
Sensory Data Analysis, pp 4–11
Salimi B, Howe B, Suciu D (2020) Database repair meets algorithmic fairness. ACM SIGMOD
Record 49(1):34–41
Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S (2008)
Global sensitivity analysis: the primer. Wiley, New York
Samadi S, Tantipongpipat U, Morgenstern JH, Singh M, Vempala S (2018) The price of fair pca:
One extra dimension. Adv Neural Inf Process Syst 31:3992–4001
Sanche F, Roberge I (2023) La question de la semaine sur le casier judiciaire et les assurances.
Radio Canada January 31
Sandel MJ (2020) The tyranny of merit: What’s become of the common good? Penguin, UK
Santambrogio F (2015) Optimal transport for applied mathematicians. Birkäuser, NY 55(58–
63):94
Santosa F, Symes WW (1986) Linear inversion of band-limited reflection seismograms. SIAM J
Scientific Statist Comput 7(4):1307–1330
Sarmanov O (1963) Maximum correlation coefficient (nonsymmetric case). Sel Transl Math Stat
Probab 2:207–210
Schanze E (2013) Injustice by generalization: notes on the Test-Achats decision of the european
court of justice. German Law J 14(2):423–433
Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
Schapire RE (2013) Explaining adaboost. Empirical Inference: Festschrift in Honor of Vladimir N
Vapnik, pp 37–52
Schauer F (2006) Profiles, probabilities, and stereotypes. Harvard University Press, Harvard
Schauer F (2017) Statistical (and non-statistical) discrimination. In: Lippert-Rasmussen K (ed)
Handbook of the ethics of discrimination, pp 42–53. Routledge
Schilling E (2006) Accuracy and precision. Encyclopedia of Statistical Sciences, p 25
Schlesinger A, O’Hara KP, Taylor AS (2018) Let’s talk about race: Identity, chatbots, and ai. In:
Proceedings of the 2018 chi Conference on Human Factors in Computing Systems, pp 1–14
Schmeiser H, Störmer T, Wagner J (2014) Unisex insurance pricing: consumers’ perception and
market implications. Geneva Papers Risk Insurance Issues Pract 39(2):322–350
Schmidt KD (2006) Prediction. Encyclopedia of Actuarial Science
Schneier B (2015) Data and Goliath: The hidden battles to collect your data and control your world.
WW Norton & Company, New York
Schweik SM (2009) The ugly laws. New York University Press, New York
Scikit Learn (2017) Probability calibration. https://scikit-learnorg/stable/modules/calibrationhtml
Scism L (2019) New york insurers can evaluate your social media use–if they can prove why it’s
needed. The Wall Street Journal January 30
Scism L, Maremont M (2010a) Inside deloitte’s life-insurance assessment technology. Wall Street
Journal November 19
Scism L, Maremont M (2010b) Insurers test data profiles to identify risky clients. Wall Street
Journal November 19
Scutari M, Panero F, Proissl M (2022) Achieving fairness with a simple ridge penalty. Stat Comput
32(5):77
Seelye KQ (1994) Insurability for battered women. New York Times May 12
Segall S (2013) Equality and opportunity. Oxford University Press, Oxford
Seicshnaydre SE (2007) Is the road to disparate impact paved with good intentions: Stuck on state
of mind in antidiscrimination law. Wake Forest L Rev 42:1141
Selbst AD, Barocas S (2018) The intuitive appeal of explainable machines. Fordham Law Rev
87:1085
Seligman D (1983) Insurance and the price of sex. Fortune February 21st
472 References

Seresinhe CI, Preis T, Moat HS (2017) Using deep learning to quantify the beauty of outdoor
places. Roy Soc Open Sci 4(7):170170
Shadish WR, Luellen JK (2005) Quasi-experimental designs. Encyclopedia of Statistics in
Behavioral Science
Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois
Press, Urbana, IL
Shapley LS (1953) A value for n-person games. In: Kuhn HW, Tucker AW (eds) Contributions to
the theory of games II, pp 307–317. Princeton University Press, Princeton
Shapley LS, Shubik M (1969) Pure competition, coalitional power, and fair division. Int Econ Rev
10(3):337–362
Shikhare S (2021) Next generation ltc - life insurance underwriting using facial score model. In:
Insurance Data Science Conference
Siddiqi N (2012) Credit risk scorecards: developing and implementing intelligent credit scoring,
vol 3. Wiley, New York
Silver N (2012) The signal and the noise: Why so many predictions fail-but some don’t. Penguin,
Harmondsworth
Simon J (1987) The emergence of a risk society-insurance, law, and the state. Socialist Rev 95:60–
89
Simon J (1988) The ideological effects of actuarial practices. Law Soc Rev 22:771
Singer P (2011) Practical ethics. Cambridge University Press, Cambridge
Slovic P (1987) Perception of risk. Science 236(4799):280–285
Small ML, Pager D (2020) Sociological perspectives on racial discrimination. J Econ Perspect
34(2):49–67
Smith A (1759) The theory of moral sentiments. Penguin, Harmondsworth
Smith CS (2021) A.i. here, there, everywhere. New York Times (February 23)
Smith DJ (1977) Racial disadvantage in Britain: the PEP report. Penguin, Harmondsworth
Smith GC, Pell JP (2003) Parachute use to prevent death and major trauma related to gravitational
challenge: systematic review of randomised controlled trials. BMJ 327(7429):1459–1461
Smythe HH (1952) The eta: a marginal japanese caste. Am J Sociol 58(2):194–196
Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and roc: a family
of discriminant measures for performance evaluation. In: Australasian Joint Conference on
Artificial Intelligence, pp 1015–1021. Springer, New York
Sollich P, Krogh A (1995) Learning with ensembles: How overfitting can be useful. Adv Neural
Inf Process Syst 8:190–196
Solow RM (1957) Technical change and the aggregate production function. The Review of
Economics and Statistics, pp 312–320
Solution ICD (2020) How to increase credit score
Sorensen TA (1948) A method of establishing groups of equal amplitude in plant sociology based
on similarity of species content and its application to analyses of the vegetation on danish
commons. Biol Skar 5:1–34
Spedicato GA, Dutang C, Petrini L (2018) Machine learning methods to perform pricing
optimization. a comparison with standard glms. Variance 12(1):69–89
Speicher T, Ali M, Venkatadri G, Ribeiro FN, Arvanitakis G, Benevenuto F, Gummadi KP, Loiseau
P, Mislove A (2018) Potential for discrimination in online targeted advertising. In: Conference
on Fairness, Accountability and Transparency, Proceedings of Machine Learning Research, pp
5–19
Spence M (1974) Competitive and optimal responses to signals: An analysis of efficiency and
distribution. J Econ Theory 7(3):296–332
Spence M (1976) Informational aspects of market structure: An introduction. Q J Econ, 591–597
Spender A, Bullen C, Altmann-Richer L, Cripps J, Duffy R, Falkous C, Farrell M, Horn T, Wigzell
J, Yeap W (2019) Wearables and the internet of things: Considerations for the life and health
insurance industry. Brit Actuarial J 24:e22
Spiegelhalter DJ, Dawid AP, Lauritzen SL, Cowell RG (1993) Bayesian analysis in expert systems.
Stat Sci, 219–247
References 473

Spirtes P, Glymour C, Scheines R (1993) Discovery algorithms for causally sufficient structures.
In: Causation, prediction, and search, pp 103–162. Springer, New York
Squires G (2011) Redlining to reinvestment. Temple University Press, Philadelphia, PA
Squires GD (2003) Racial profiling, insurance style: Insurance redlining and the uneven develop-
ment of metropolitan areas. J Urban Affairs 25(4):391–410
Squires GD, Chadwick J (2006) Linguistic profiling: A continuing tradition of discrimination in
the home insurance industry? Urban Affairs Rev 41(3):400–415
Squires GD, DeWolfe R (1981) Insurance redlining in minority communities. Rev Black Polit
Econ 11(3):347–364
Stark L, Stanhaus A, Anthony DL (2020) “i don’t want someone to watch me while i’m working”:
Gendered views of facial recognition technology in workplace surveillance. J Assoc Inf Sci
Technol 71(9):1074–1088
Steensma C, Loukine L, Orpana H, Lo E, Choi B, Waters C, Martel S (2013) Comparing life
expectancy and health-adjusted life expectancy by body mass index category in adult canadians:
a descriptive study. Populat Health Metrics 11(1):1–12
Stein A (1994) Will health care reform protect victims of abuse-treating domestic violence as a
public health issue. Human Rights 21:16
Stenholm S, Head J, Aalto V, Kivimäki M, Kawachi I, Zins M, Goldberg M, Platts LG, Zaninotto
P, Hanson LM, et al. (2017) Body mass index as a predictor of healthy and disease-free life
expectancy between ages 50 and 75: a multicohort study. Int J Obesity 41(5):769–775
Stephan Y, Sutin AR, Terracciano A (2015) How old do you feel? the role of age discrimination
and biological aging in subjective age. PloS One 10(3):e0119293
Stevenson M (2018) Assessing risk assessment in action. Minnesota Law Rev 103:303
Steyerberg E, Eijkemans M, Habbema J (2001) Application of shrinkage techniques in logistic
regression analysis: a case study. Statistica Neerlandica 55(1):76–88
Stone DA (1993) The struggle for the soul of health insurance. J Health Polit Policy Law
18(2):287–317
Stone P (2007) Why lotteries are just. J Polit Philos 15(3):276–295
Štrumbelj E, Kononenko I (2010) An efficient explanation of individual classifications using game
theory. J Mach Learn Res 11:1–18
Štrumbelj E, Kononenko I (2014) Explaining prediction models and individual predictions with
feature contributions. Knowl Inf Syst 41:647–665
Struyck N (1740) Inleiding tot de algemeene geographie. Tirion 1740:231
Struyck N (1912) Les oeuvres de Nicolas Struyck (1687–1769): qui se rapportent au calcul des
chances, à la statistique général, à la statistique des décès et aux rentes viagèter. Société
générale néerlandaise d’assurances sur la vie et de rentes viagères
Stuart EA (2010) Matching methods for causal inference: A review and a look forward. Stat Sci
25(1):1
Suresh H, Guttag JV (2019) A framework for understanding sources of harm throughout the
machine learning life cycle. arXiv 1901.10002
Surowiecki J (2004) The wisdom of crowds: why the many are smarter than the few and how
collective wisdom shapes business, economies, societies and nations. Doubleday & Co, New
York
Sutton W (1874) On the method used by Dr. Price in the construction of the northampton mortality
table. J Inst Actuaries 18(2):107–122
Swauger S (2021) The next normal: Algorithms will take over college, from admissions to
advising. Washington Post (November 12)
Sweeney L (2013) Discrimination in online ad delivery: Google ads, black names and white names,
racial discrimination, and click advertising. Queue 11(3):10–29
Szalavitz M (2017) Why do we think poor people are poor because of their own bad choices. The
Guardian July 5
Szepannek G, Lübke K (2021) Facing the challenges of developing fair risk scoring models. Front
Artif Intell 4:681915
474 References

Tajfel H, Turner JC, Worchel S, Austin WG, et al. (1986) Psychology of intergroup relations, pp
7–24. Nelson-Hall, Chicago
Tajfel HE (1978) Differentiation between social groups: Studies in the social psychology of
intergroup relations. Academic Press, Cambridge, MA
Tanaka K (2012) Surnames and gender in japan: Women’s challenges in seeking own identity. J
Family Hist 37(2):232–240
Tang S, Zhang X, Cryan J, Metzger MJ, Zheng H, Zhao BY (2017) Gender bias in the job market:
A longitudinal analysis. Proc ACM Human Comput Interact 1(CSCW):1–19
Tasche D (2008) Validation of internal rating systems and pd estimates. In: The analytics of risk
model validation, pp 169–196. Elsevier
Taylor A, Sadowski J (2015) How companies turn your Facebook activity into a credit score. The
Nation May 27
Taylor S (2015) Price optimization ban. Government of the District of Columbia, Department of
Insurance August 25
Telles E (2014) Pigmentocracies: Ethnicity, race, and color in Latin America. UNC Press Books
Tharwat A (2021) Classification assessment methods. Appl Comput Inf 17(1):168–192
The Zebra (2022) Car insurance rating factors by state. https://www.thezebra.com/
Theobald CM (1974) Generalizations of mean square error applied to ridge regression. J Roy Stat
Soc B (Methodological) 36(1):103–106
Theodoridis S (2015) Machine learning: a Bayesian and optimization perspective. Academic Press,
Cambridge, MA
Thiery Y, Van Schoubroeck C (2006) Fairness and equality in insurance classification. Geneva
Papers Risk Insurance Issues Pract 31(2):190–211
Thomas G (2012) Non-risk price discrimination in insurance: market outcomes and public policy.
Geneva Papers Risk Insurance Issues Pract 37:27–46
Thomas G (2017) Loss coverage: Why insurance works better with some adverse selection.
Cambridge University Press, Cambridge
Thomas L, Crook J, Edelman D (2002) Credit scoring and its applications. SIAM
Thomas RG (2007) Some novel perspectives on risk classification. Geneva Papers Risk Insurance
Issues Pract 32(1):105–132
Thomson JJ (1976) Killing, letting die, and the trolley problem. Monist 59(2):204–217
Thomson W (1883) Electrical units of measurement. Popular Lect Addresses 1(73):73–136
Thornton SM, Pan S, Erlien SM, Gerdes JC (2016) Incorporating ethical considerations into
automated vehicle control. IEEE Trans Intell Transp Syst 18(6):1429–1439
Tian J, Pearl J (2002) A general identification condition for causal effects. In: Proceedings of the
Eighteenth National Conference on Artificial Intelligence, pp 567–573. MIT Press
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B
(Methodological) 58(1):267–288
Tilcsik A (2021) Statistical discrimination and the rationalization of stereotypes. Am Sociol Rev
86(1):93–122
Topkis DM (1998) Supermodularity and complementarity. Princeton University Press, Princeton
Torkamani A, Wineinger NE, Topol EJ (2018) The personal and clinical utility of polygenic risk
scores. Nature Rev Genet 19(9):581–590
Torous W, Gunsilius F, Rigollet P (2021) An optimal transport approach to causal inference. arXiv
2108.05858
Traag V, Waltman L (2022) Causal foundations of bias, disparity and fairness. arXiv 2207.13665
Tribalat M (2016) Statistiques ethniques, Une querelle bien française. L’Artilleur, Paris
Tsamados A, Aggarwal N, Cowls J, Morley J, Roberts H, Taddeo M, Floridi L (2021) The ethics
of algorithms: key problems and solutions, pp 1–16. AI & Society
Tsybakov AB (2009) Introduction to nonparametric estimation. Springer, New York
Tufekci Z (2018) Facebook’s surveillance machine. New York Times 19:1
Tukey JW (1961) Curves as parameters, and touch estimation. In: Neyman J (ed) Proceedings of
the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol 1, pp 681–694.
University of California Press, California
References 475

Tuppat J, Gerhards J (2021) Immigrants’ first names and perceived discrimination: A contribution
to understanding the integration paradox. Eur Sociol Rev 37(1):121–135
Turner R (2015) The way to stop discrimination on the basis of race. Stanford J Civil Rights Civil
Liberties 11:45
Tweedie MCK (1984) An index which distinguishes between some important exponential families.
Statistics: applications and new directions (Calcutta, 1981), pp 579–604
Tzioumis K (2018) Demographic aspects of first names. Scientific Data 5(1):1–9
Uotinen V, Rantanen T, Suutama T (2005) Perceived age as a predictor of old age mortality: a
13-year prospective study. Age Ageing 34(4):368–372
Upton G, Cook I (2014) A dictionary of statistics 3e. Oxford University Press, Oxford
US Census (2012) Frequently occurring surnames from census 2000, census report data file a: Top
1000 names. Genealogy Data
Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW (2019) Calibration: the
achilles heel of predictive analytics. BMC Med 17(1):1–7
Van Deemter K (2010) Not exactly: In praise of vagueness. Oxford University Press, Oxford
Van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, Cambridge
Van Gerven G (1993) Case c-109/91, Gerardus Cornelis Ten Oever v. Stichting bedrijfspensioen-
fonds voor het glazenwassers-en schoonmaakbedrijf. EUR-Lex 61991CC0109
Van Lancker W (2020) Automating the welfare state: Consequences and challenges for the
organisation of solidarity. In: Shifting solidarities, pp 153–173. Springer, New York
Van Parijs P (2002) Linguistic justice. Polit Philos Econ 1(1):59–74
Van Schaack D (1926) The part of the casualty insurance company in accident prevention. Ann
Am Acad Polit Soc Sci 123(1):36–40
Vandenhole W (2005) Non-discrimination and equality in the view of the UN human rights treaty
bodies. Intersentia nv
Varga TV, Kozodoi N (2021) fairness, algorithmic fairness r package. R Vignette
Vassy JL, Christensen KD, Schonman EF, Blout CL, Robinson JO, Krier JB, Diamond PM, Lebo
M, Machini K, Azzariti DR, et al. (2017) The impact of whole-genome sequencing on the
primary care and outcomes of healthy adult patients: a pilot randomized trial. Ann Int Med
167(3):159–169
Verboven K (2011) Introduction: Professional collegia: Guilds or social clubs? Ancient Society, pp
187–195
Verma S, Rubin J (2018) Fairness definitions explained. In: 2018 IEEE/ACM International
Workshop on Software Fairness (Fairware), pp 1–7. IEEE
Verrall R (1996) Claims reserving and generalised additive models. Insurance Math Econ
19(1):31–43
Viaene S, Dedene G, Derrig RA (2005) Auto claim fraud detection using Bayesian learning neural
networks. Expert Syst Appl 29(3):653–666
Vidoni P (2003) Prediction and calibration in generalized linear models. Ann Inst Stat Math
55:169–185
Villani C (2003) Topics in optimal transportation, vol 58. American Mathematical Society,
Providence, RI
Villani C (2009) Optimal transport: old and new, vol 338. Springer, New York
Villazor RC (2008) Blood quantum land laws and the race versus political identity dilemma.
Californial Law Rev 96:801
Viswanathan KS (2006) Demutualization. Encyclopedia of Actuarial Science
Vogel R, Bellet A, Clémen S, et al. (2021) Learning fair scoring functions: Bipartite ranking
under roc-based fairness constraints. In: International Conference on Artificial Intelligence and
Statistics, Proceedings of Machine Learning Research, pp 784–792
Voicu I (2018) Using first name information to improve race and ethnicity classification. Stat Public
Policy 5(1):1–13
Volkmer S (2015) Notice regarding unfair discrimination in rating: optimization. State of Califor-
nia, Department of Insurance February 18
von Mises R (1928) Wahrscheinlichkeit Statistik und Wahrheit. Springer, New York
476 References

von Mises R (1939) Probability, statistics and truth. Macmillan, New York
Von Neumann J (1955) CollectED works. Pergamon Press, Oxford, England
Von Neumann J, Morgenstern O (1953) Theory of games and economic behavior. Princeton
University Press, Princeton
Wachter S, Mittelstadt B (2019) A right to reasonable inferences: re-thinking data protection law
in the age of big data and ai. Columbia Bus Law Rev, 494
Wachter S, Mittelstadt B, Russell C (2017) Counterfactual explanations without opening the black
box: Automated decisions and the gdpr. Harvard J Law Technol 31:841
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using
random forests. J Am Stat Assoc 113(523):1228–1242
Waldron H (2013) Mortality differentials by lifetime earnings decile: Implications for evaluations
of proposed social security law changes. Social Secur Bull 73:1
Wallis KF (2014) Revisiting francis galton’s forecasting competition. Stat Sci, 420–424
Wang DB, Feng L, Zhang ML (2021) Rethinking calibration of deep neural networks: Do not be
afraid of overconfidence. Adv Neural Inf Process Syst 34:11809–11820
Wang Y, Kosinski M (2018) Deep neural networks are more accurate than humans at detecting
sexual orientation from facial images. J Personality Soc Psychol 114(2):246
Wang Y, Yao H, Zhao S (2016) Auto-encoder based dimensionality reduction. Neurocomputing
184:232–242
Wasserman L (2000) Bayesian model selection and model averaging. J Math Psychol 44(1):92–107
Wasserstein LN (1969) Markov processes over denumerable products of spaces, describing large
systems of automata. Problemy Peredachi Informatsii 5(3):64–72
Watkins-Hayes C, Kovalsky E (2016) The discourse of deservingness. The Oxford Handbook of
the Social Science of Poverty 1
Watson DS, Gultchin L, Taly A, Floridi L (2021) Local explanations via necessity and sufficiency:
Unifying theory and practice. Uncertainty in Artificial Intelligence, pp 1382–1392
Watson GS (1964) Smooth regression analysis. Sankhyā Indian J Stat A, 359–372
Weber M (1904) Die protestantische ethik und der geist des kapitalismus. Archiv für Sozialwis-
senschaft und Sozialpolitik 20:1–54
Weed DL (2005) Weight of evidence: a review of concept and methods. Risk Anal Int J 25(6):1545–
1557
Weisberg HI, Tomberlin TJ (1982) A statistical perspective on actuarial methods for estimating
pure premiums from cross-classified data. J Risk Insurance, 539–563
Welsh AH, Cunningham RB, Donnelly CF, Lindenmayer DB (1996) Modelling the abundance of
rare species: statistical models for counts with extra zeros. Ecol Modell 88(1–3):297–308
Westreich D (2012) Berkson’s bias, selection bias, and missing data. Epidemiology 23(1):159
Wheatley M (2013) The fairness challenge. FCA Financial Conduct Authority Speeches October
10
White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from web
search signals. NPJ Digit Med 1(1):8
Wiehl DG (1960) Build and blood pressure. Society of Actuaries
van Wieringen WN (2015) Lecture notes on ridge regression. arXiv 1509.09169
Wiggins B (2013) Managing Risk, Managing Race: Racialized Actuarial Science in the United
States, 1881–1948. University of Minnesota PhD thesis
Wikipedia (2023) Data. Wikipedia, The Free Encyclopedia
Wilcox C (1937) Merit rating in state unemployment compensation laws. Am Econ Rev, 253–259
Wilkie D (1997) Mutuality and solidarity: assessing risks and sharing losses. Philos Trans Roy Soc
Lond B Biol Sci 352(1357):1039–1044
Williams BA, Brooks CF, Shmargad Y (2018) How algorithms discriminate based on data they
lack: Challenges, solutions, and policy implications. J Inf Policy 8:78–115
Williams G (2017) Discrimination and obesity. In: Lippert-Rasmussen K (ed) Handbook of the
ethics of discrimination, pp 264–275. Routledge
Williams JE, Bennett SM (1975) The definition of sex stereotypes via the adjective check list. Sex
Roles 1(4):
References 477

Willson K (2009) Name law and gender in Iceland. UCLA: Center for the Study of Women
Wilson EB, Worcester J (1943) The determination of ld 50 and its sampling error in bio-assay. Proc
Natl Acad Sci 29(2):79–85
Wing-Heir L (2015) Price optimization in ratemaking. State of Alaska, Department of Commerce,
Community and Economic Development Bulletin B 15–12
Winter RA (2000) Optimal insurance under moral hazard. Handbook of insurance, pp 155–183
Winterfeldt D, Edwards W (1986) Decision analysis and behavioral research
Witten IH, Frank E, Hall MA, Pal CJ, DATA M (2016) Practical machine learning tools and
techniques. Morgan Kaufmann, Burlington, MA
Wod I (1985) Weight of evidence: A brief survey. Bayesian Stat 2:249–270
Wolff MJ (2006) The myth of the actuary: life insurance and Frederick L. Hoffman’s race traits
and tendencies of the American negro. Public Health Rep 121(1):84–91
Wolffhechel K, Fagertun J, Jacobsen UP, Majewski W, Hemmingsen AS, Larsen CL, Lorentzen
SK, Jarmer H (2014) Interpretation of appearance: The effect of facial features on first
impressions and personality. PloS One 9(9):e107721
Wolpert DH (1992) Stacked generalization. Neural Networks 5(2):241–259
Wolthuis H (2004) Heterogeneity. Encyclopedia of Actuarial Science, pp 819–821
Woodhams C, Williams M, Dacre J, Parnerkar I, Sharma M (2021) Retrospective observational
study of ethnicity-gender pay gaps among hospital and community health service doctors in
england. BMJ Open 11(12):e051043
Worham L (1985) Insurance classification: too important to be left to the actuaries. Univ Michigan
J Law 19:349
Works R (1977) Whatever’s fair-adequacy, equity, and the underwriting prerogative in property
insurance markets. Nebraska Law Rev 56:445
Wortham L (1986) The economics of insurance classification: The sound of one invisible hand
clapping. Ohio State Law J 47:835
Wright S (1921) Correlation and causation. J Agricultural Res 20:557–585
Wu Y, Zhang L, Wu X, Tong H (2019) Pc-fairness: A unified framework for measuring causality-
based fairness. Adv Neural Inf Process Syst 32:3404–3414
Wu Z, D’Oosterlinck K, Geiger A, Zur A, Potts C (2022) Causal proxy models for concept-based
model explanations. arXiv 2209.14279
Wüthrich MV, Merz M (2008) Stochastic claims reserving methods in insurance. Wiley, New York
Wüthrich MV, Merz M (2022) Statistical foundations of actuarial learning and its applications, vol
3822407. Springer Nature, New York
Yang TC, Chen VYJ, Shoff C, Matthews SA (2012) Using quantile regression to examine
the effects of inequality across the mortality distribution in the us counties. Soc Sci Med
74(12):1900–1910
Yao J (2016) Clustering in general insurance pricing. In: Frees E, Meyers G, Derrig R (eds)
Predictive modeling applications in actuarial science, pp 159–79. Cambridge University Press,
Cambridge
Yeung K (2018a) Algorithmic regulation: a critical interrogation. Regulation Governance
12(4):505–523
Yeung K (2018b) A study of the implications of advanced digital technologies (including AI
systems) for the concept of responsibility within a human rights framework. MSI-AUT (2018)
5
Yinger J (1998) Evidence on discrimination in consumer markets. J Econ Perspect 12(2):23–40
Yitzhaki S, Schechtman E (2013) The Gini methodology: a primer on a statistical methodology.
Springer, New York
Young IM (1990) Justice and the politics of difference. Princeton University Press, Princeton
Young RK, Kennedy AH, Newhouse A, Browne P, Thiessen D (1993) The effects of names on
perception of intelligence, popularity, and competence. J Appl Soc Psychol 23(21):1770–1788
Zack N (2014) Philosophy of science and race. Routledge
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and
naive Bayesian classifiers. In: ICML, Citeseer, vol 1, pp 609–616
478 References

Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability
estimates. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pp 694–699
Zafar MB, Valera I, Rodriguez MG, Gummadi KP (2017) Fairness constraints: Mechanisms for
fair classification. arXiv 1507.05259:962–970
Zafar MB, Valera I, Gomez-Rodriguez M, Gummadi KP (2019) Fairness constraints: A flexible
approach for fair classification. J Mach Learn Res 20(1):2737–2778
Zafar SY, Abernethy AP (2013) Financial toxicity, part i: a new name for a growing problem.
Oncology 27(2):80
Zelizer VAR (2017) Morals and markets: The development of life insurance in the United States.
Columbia University Press, New York
Zelizer VAR (2018) Morals and markets. Columbia University Press, New York
Zenere A, Larsson EG, Altafini C (2022) Relating balance and conditional independence in
graphical models. Phys Rev E 106(4):044309
Zhang J, Bareinboim E (2018) Fairness in decision-making–the causal explanation formula. In:
Thirty-Second AAAI Conference on Artificial Intelligence
Zhang L, Wu Y, Wu X (2016) A causal framework for discovering and removing direct and indirect
discrimination. arXiv 1611.07509
Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC Press, Boca Raton
Žliobaite I (2015) On the relation between accuracy and fairness in binary classification. arXiv
1505.05723
Žliobaite I, Custers B (2016) Using sensitive personal data may be necessary for avoiding
discrimination in data-driven decision models. Artif Intell Law 24(2):183–201
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Roy Stat Soc B
(Stat Methodol) 67(2):301–320
Zuboff S (2019) The age of surveillance capitalism: The fight for a human future at the new frontier
of power. Public Affairs, New York
Index

Symbols Between variance, 72


compas, 12 Bias, 85
GermanCredit, 74, 117 global, 43, 172
local, 172
minimum, 74
A Big data, 180, 244
Accumulated local, 147 Binning, 74
Accuracy, 158 Black box, 47, 64
Activation function, 103 Blood pressure, 56
Actuarialism, 69 Boosting, 113
Actuarial pricing, v Bootstrap, 110
Adaboost, 114, 115, 375, 379 Branches, 106
Additive, 95 Breakdown, 132
Adverse-selection, 32, 34
Affirmative action, 15, 386
Aggregation, 69 C
AIDS, 43 Calibration, 74, 164, 254
Algorithm, 64 plot, 170
Altruism, 26 California, 184
Area under the curve (AUC), 161 Casualty, 182
Artificial intelligence, 64, 243 Categorization, 76
Association, 238 Causal, 184
Autoencoder, 120 Cause, 5
common, 278
Ceteris paribus, 123, 129
B Chain, 286
Back propagation, 104 Chain ladder, 54
Bagging, 110, 112, 431 Civil rights, 4
Balance, 43, 172, 326 Classes, 75
Barycenter, 254 Classification, 69
Bayes formula, 61 Classification tree, 108, 113
Bayesian, 105 Classifier, 79, 81
Becker, Gary, 238 Common cause, 278
Behavior, 202 Compound Poisson, 44
Best estimate, 91 Confounder, 17

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 479
A. Charpentier, Insurance, Biases, Discrimination and Fairness,
Springer Actuarial, https://doi.org/10.1007/978-3-031-49783-4
480 Index

Confusion matrix, 81 E
Consequentialism, 6 Econometrics, 34
Contingency table, 81 Efficiency, 26
Continuous, 68 Egalitarian, 8
Contribution Elastic, 99
Shapley, 136, 140 Elicitable, 79
Conundrum, 123 Empirical risk, 80
Cooperation, 26 Employment, 201
Counterfactual, 153, 369 Entropy, 107
Coverage, 28, 29 Equality, 15
Credit, 201 opportunity, 6
Credit scoring, 222, 264 treatment, 10
Criminal records, 191 Equalized odds, 326
Cross-financing, 42, 172 Equal opportunity, 6, 324
Equitable, 10
Ethics, 7
D Europe, 184
Data, 180 Expected utility, 52
big, 180, 244 Explainability, 123, 276
inferred, 188 Exponential family, 91
man made, 179
personal, 181
sensitive, 181 F
Decision, 240 F airness
Decision making, 240 .φ, 330
Decision tree, 105 Fable, 23, 47
Deductible, 29 Facial, 256
Demographic parity, 315, 322 Fair, 10
Deviance, 93 Fairness
Difference, 15 actuarial, 43, 276
Directed acyclical graph (DAG), 284 local individual, 359
Discrimination, v, 1 similarity, 358
direct, 5 Fat, 37
efficient, 10 First name, 248
fair, v Fraud, 182
indirect, 5 Friends, 273
legitimate, 34 Frisch–Waugh, 206
by proxy, 244, 245
rational, 10
reverse, 15, 386 G
statistical, 10 Gaussian, 87
unfair, 4 Gender, 37, 201, 224
Diseases, 256 fluid, 226
Disparate impact, 10 non binary, 226
Disparate treatment, 10 General Data Protection Regulation (GDPR),
Distance, 86 181, 211
Hellinger, 85 General insurance, 182
Mahalanobis, 91 Generalization, 240, 241
total variation, 86 Generalized additive models (GAM), 95
Wasserstein, 88 Generalized linear models (GLM), 91
Divergence, 86 lasso, 99
Jensen–Shannon, 88 ridge, 98
Kullback–Leibler, 86 Genetics, 43, 182
Durkheim, Emile, 9 Geodesic, 254
Index 481

Geographical, 201 mutual, 27


Gini impurity, 107 public, 26
Global bias, 43, 172 Intention, 2, 4, 5, 11
Glottophobia, 252 Intentional, 5
Google Street View, 257 Interaction
Graphs, 284 Shapley, 140
Group, v Interpretability, 123, 238
Group fairness Isotonic, 105
AUC, 332 Is-ought, 242
calibration, 335
class balance, 326
demographic parity, 315 K
disparate mistreatment, 327 Kahneman, Daniel, 240
equal opportunity, 324 Kelvin, 307
equal treatment, 334 Kranzberg’s law, 383
false positive, 325
independence, 315
nonreconstruction, 337
separation, 324 L
sufficiency, 335 Ladder of causation
true positive, 325 association, 280
counterfactual, 296
intervention, 290
H La Fontaine, Jean, 177
Halley, Edmond, 35 Lagrangian, 98
Health, 56 Lantern laws, 254
Health insurance, 182 Latent model, 17
Heterogeneity, 34 Learner
History, 201 strong, 110
Homophilia, 270 weak, 110
Hume, David, 9, 242 Leaves, 106, 108
Hurdle, 94 Lemons, 33
Life insurance, 182
Life table, 35, 209
I Likelihood, 92
IBNR, 54 Linguistic, 253
ICE, 129 Link, 93
Identifiability, 72 Lipschitz, 358
Identity, 69, 236 Local, 359
Immutable, 6 Local bias, 172
Indifference utility, 49 Local dependence plot, 141
Individual, v Local interpretability, 129
Individual conditional expectation, 129 Location, 201
Individual fairness Log-likelihood, 92
counterfactual, 369 Loss, 79
proxy discrimination, 359 ., 79

Individualization, 76 .1 , 80, 99


Inefficiencies, 26 .2 , 79, 84, 98, 121
Information .0/1 , 81, 115
asymmetry, 33 absolute, 80
In-sample, 83 quadratic, 79, 84
Insurance quantile, 80
motor, 29
482 Index

M Orthogonalization, 389
Mahalanobis, 91, 358 Out-of-sample, 84
Majority, 106 Overfit, 63, 83, 106, 115
Majority rule, 111
Manhart, 241
P
Marital status, 202
Paradox
Mean squared error, 80, 107
Simpson, 209
Merit, 6, 205, 264
Partial dependence plot (PDP), 140
Mileage, 29
Penalty, 98
Minimum bias, 74
Personalization, 76
Misclassification, 46
Personalized premium, 44
Mitigation, 15, 386
Personalized pricing, 244
Mixed models, 97
Phrenology, 254
Model, 63
Pirates, 66
additive, 74
Platt scaling, 105
bagging, 110, 112
Poisson, 94
black box, 64
Pooling, 242
boosting, 113
Population Stability Index (PSI), 87
ensemble, 110
Predictive parity, 336, 337
multiplicative, 74
Prejudice, v, 2
neural networks, 101
Pricing, v
random forest, 110
Prima facie, 254
trees, 105
Principal component analysis, 120
Monge–Kantorovich, 367
Principal components, 101
Moral, 6, 7
Privacy, 189
Moral hazard, 32, 34
Probability, 61
Moral machine, 8
conditional, 61
Morgenstern, Oskar, 52
Propensity score, 301, 388
Murphy decomposition, 167
Propublica, 12
Mutatis mutandis, 124
Proxy, 237, 244
Mutual aid, 66
Pruning, 106
Mutual information, 87
Mutuality, 25
Mutualization, 242 Q
Quantile, 80, 89
Québec, 184
N
Names, 250
Network, 268, 270 R
Neural networks, 101 Race, 6
Neutral, v Raison d’être, 231
Non-reconstruction, 337 Random forest, 110, 113
Norm, 242 Rawls, John, 8
Normal, 56 Receiver operating characteristic (ROC), 118,
161
Reciprocity, 26
O Redistribution, 26
Obesity, 37 Redlining, 261
Ockham, 63 Regression, 78
Old, 232 Reichenbach, Hans, 278
Opaque, 64 Religion, 6
OpenStreetMap, 259 Resampling, 110
Optimal Reserving, 54
coupling, 367 Resolution, 167
Index 483

Responsibility, 205 Subsidize, 42


Reverse discrimination, 15, 386 Supermodular, 364
Risk, 79 System 1/2, 240
empirical, 80
in-sample, 83
out-of-sample, 84 T
Risk aversion, 50 Ten oever, 241
ROC curve, 330 Tobacco, 37
Rome, 65 Top-bottom, 107
Rubinstein, Ariel, 23 Total expectations, 41
Total probability, 41
Total variation, 86
S Tree, 105, 108, 112
Sandel, Michael, 205 True-positive equality, 324
Satellite, 262 Tweedie, 44, 94
Score, 16, 18, 76
Scoring rules, 85, 159
Sensitive attribute U
age, 185, 232 Uncertainty, 167
family status, 186 Unfairly discriminatory, 4
gender, 224 Unsupervised, 119
race, 6, 223 Utility function, 49
religion, 6
sex, 224
Separation, 286 V
Sex, 37, 224 Variable importance, 126
Shapley, 140 Variable selection, 98, 202
Shapley contributions, 129, 136 Variance
Sharpness, 160 covariance, 112
Signaling, 10 decomposition, 45, 71
Similarity, 358 Vehicle, 201
Simpson’s paradox, 207 Verlaine, Paul, 272
Smoker, 37 Von Neumann, John, 52
Smooth, 95
Socialization, 26
Solidarity, 26 W
luck, 242 Weak demographic parity, 315
random, 46, 242 Weak equal opportunity, 324
subsidiary, 46 Weak learner, 110
Sparsity, 99 Wealth, 49, 261
Spatial data, 261 Weights, 115
Stacking, 110 Welfare, 26
Statistics, 180 Well calibrated, 170
Stereotype, 240 Why, 276
Street views, 262 Within variance, 72
Strong demographic parity, 315 Wrong, 1, 6
Strong equal opportunity, 325
Strong learner, 110
Structural causal model, 291, 292 Z
Struyck, Nicolas, 36 Zero-inflated, 94
Subjectivity, 239

You might also like