On the usefulness of the fit-on-test view on evaluating calibration of classifiers

Kängsepp, Markus; Valk, Kaspar; Kull, Meelis

doi:10.1007/s10994-024-06652-6

On the usefulness of the fit-on-test view on evaluating calibration of classifiers

Open access
Published: 24 February 2025

Volume 114, article number 105, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Machine Learning Aims and scope Submit manuscript

On the usefulness of the fit-on-test view on evaluating calibration of classifiers

Download PDF

1345 Accesses
1 Altmetric
Explore all metrics

Abstract

Calibrated uncertainty estimates are essential for classifiers used in safety-critical applications. If a classifier is uncalibrated, then there is a unique way to calibrate its uncertainty using the idealistic true calibration map corresponding to this classifier. Although the true calibration map is typically unknown in practice, it can be estimated with many post-hoc calibration methods which fit some family of potential calibration functions on a validation dataset. This paper examines the connection between such post-hoc calibration methods and calibration evaluation. Despite the negative connotations of fitting on test data in machine learning, we claim that fitting calibration maps on test data as part of the calibration evaluation process is a method worth considering, and we refer to this view as fit-on-test. This view enables the usage of any post-hoc calibration method as an evaluation measure, unlocking missed opportunities in development of evaluation methods. We prove that even ECE, which is the most common calibration evaluation method, is actually a fit-on-test measure. This observation leads us to a new method of tuning the number of bins in ECE with cross-validation. Fitting on test data can lead to test-time overfitting, and therefore, we discuss the limitations and concerns with the fit-on-test view. Our contributions also include: (1) enhancement of reliability diagrams with diagonal filling; (2) development of new calibration map families PL and PL3; and (3) an experimental study of which families perform strongly both as post-hoc calibrators and calibration evaluators.

Classifier calibration: a survey on how to assess and improve predicted class probabilities

Article Open access 16 May 2023

Investigating Calibrated Classification Scores Through the Lens of Interpretability

Estimating Expected Calibration Errors

1 Introduction

When classifiers are incorporated into safety-critical applications, it is essential that the predictions of these classifiers would involve reliable uncertainty estimates. If the predictions are over-confident, then this can cause costly errors, such as an autonomous vehicle getting into an accident. If the predictions are under-confident, then this can result in a failure of the system to fulfill its task, e.g. an autonomous car would move too slowly to mitigate the over-estimated risks. Therefore, classifiers are expected to report calibrated uncertainty in the form of class probability estimates. A probabilistic classifier is considered calibrated, if in the groups of similar predictions the average prediction is in an agreement with the actual class proportions. For example, in binary classification this implies that if the classifier predicts 80% probability to be positive for each of a set of 100 instances, then 80 of these instances are expected to be truly positives. Most learning algorithms result in classifiers that are not well-calibrated and need dedicated post-hoc calibration methods to be applied (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017).

Progress in developing methods to get calibrated classifiers can only be made if we have reliable methods for evaluating calibration. In binary classification, the most common way of estimating a classifier’s calibration is through reliability diagrams and ECE (estimated calibration error, also known as the expected calibration error^{Footnote 1}) (Murphy & Winkler, 1977; Broecker, 2011; Naeini et al., 2015). In reliability diagrams, many instances with similar predicted probabilities are binned together to get an estimate of the true calibrated probability in each bin by averaging the corresponding class labels. However, there is no consensus, how to place the bins in reliability diagrams and how many bins there should be (Roelofs et al., 2020). Usually 10, 15, or 20 bins are used (Naeini et al., 2015; Guo et al., 2017). Bins are placed with equal width, so that they take up an equal chunk in the probability space, or with equal size, so that they each contain an equal number of predictions. The choice of binning can drastically impact the shape of the reliability diagram and alter the estimated calibration error (Roelofs et al., 2020; Kumar et al., 2019; Nixon et al., 2019). Failure to measure calibration reliably leads to problems deciding which classifier is better calibrated or which method of post-hoc calibration is better. This in turn harms the performance of safety-critical systems.

Even though there are multiple existing works published on methods of evaluating calibration, i.e. Widmann et al. (2019); Zhang et al. (2020), these have not exploited the more direct link between post-hoc calibration and evaluation, which we refer to as the fit-on-test paradigm. According to this paradigm, any post-hoc calibration method can be repurposed for calibration evaluation by applying it on the test data (not on the validation data as in post-hoc calibration) and then using it as a plug-in estimator of calibration error.

The contributions of this paper are the following:

We introduce the fit-on-test paradigm of evaluating calibration, showing that any post-hoc calibration method can also be used for evaluating calibration (Sect. 4.2);
We prove that the classical binning-based ECE measure follows from the fit-on-test paradigm using a particular calibration map family (Sect. 4.4);
Exploiting this fact, we show how cross-validation can be used for optimising the number of bins in ECE (Sect. 4.6);
We demonstrate shortcomings in the common visualisations of reliability diagrams and propose reliability diagrams with diagonal filling (Sect. 4.5);
Using the fit-on-test paradigm, we develop new methods PL and PL3 of evaluating calibration using continuous piecewise linear functions (Sect. 5);
We clarify the methodology of assessing calibrators and calibration evaluators (Sect. 6) and introduce the usage of pseudo-real data for this purpose (Sect. 7);
We perform experimental comparisons to find out which families of calibration maps result in better post-hoc calibration, better reliability diagrams, and better approximations of calibration errors (Sect. 7).
We discuss the limitations of the fit-on-test paradigm (Sect. 8).

2 Related work

In this section, an overview of different approaches to calibration error evaluation is given. Evaluation of calibration has been the main focus of several works after the introduction of reliability diagrams and ECE (Murphy & Winkler, 1977; Broecker, 2011; Naeini et al., 2015). To start with, Vaicenavicius et al. (2019) proposed a more general definition of calibration and a method to perform statistical calibration tests based on binning. Widmann et al. (2019) proposed the kernel calibration error for calibration evaluation in multi-class classification. Roelofs et al. (2020) proposed a method which chooses the maximal number of bins such that it leads to a monotonically increasing reliability diagram. Popordanoska et al. (2022) proposed estimating multi-class calibration error using kernel density estimation with Dirichlet kernels. Popordanoska et al. (2023) proposed Kullback-Leibler calibration error, allowing one to estimate all proper calibration errors and refinement terms.

The research on evaluating calibration has gone hand-in-hand with the research on post-hoc calibration, which aims to learn a calibration map transforming the classifier’s output probabilities into calibrated probabilities. Many papers contributed to both post-hoc calibration and evaluation. Naeini et al. (2015) proposed BBQ and used ECE to evaluate calibration. Guo et al. (2017) proposed temperature, vector and matrix scaling and used the reliability diagrams and ECE for evaluating confidence in multi-class classification. Confidence stands for the probability of a target class, leaving out all the other probabilities and class-wise relations. This is also referred to as top-label calibration error by Kumar et al. (2019). Kull et al. (2019) proposed Dirichlet calibration and the notion of classwise-calibration error. Classwise-calibration error measures the calibration error for each class separately. Kumar et al. (2019) proposed scaling-binning calibration, a new debiasing method for ECE and the notion of marginal calibration error. Marginal calibration error is similar to classwise-calibration error defined concurrently with Kull et al. (2019), adding a possibility to control how much each class counts towards the error. Zhang et al. (2020) proposed generic Mix-n-Match calibration strategies and used kernel density estimation (KDE) for estimating calibration error. Gupta et al. (2021) proposed a calibration evaluation metric based on the Kolmogorov-Smirnov test and a calibration method based on fitting splines. Xiong et al. (2023) proposed proximity calibration (procal) for confidence calibration and proximity-informed ECE (PIECE). PIECE divides instances based on the representation space distance into proximity groups and uses this information to measure the calibration error of different proximity groups separately.

3 Notation and background

3.1 True calibration error

We present the methods for binary classification, but Sect. 3.3 shows applicability to multi-class classification as well. Consider a binary classifier $f:\mathcal {X}\rightarrow [0,1]$ predicting the probabilities of instances to be positive. Let $X\in \mathcal {X}$ be a randomly drawn instance, $Y\in \{0,1\}$ its true class, and let us denote the model’s predictions with $\hat{P}=f(X)$. Every classifier f has a corresponding true calibration map, which could be used to perfectly calibrate the model: $c^*_f(\hat{p})=\mathbb {E}[Y\mid \hat{P}=\hat{p}]$ (also known as the canonical calibration function (Vaicenavicius et al., 2019)). For evaluation of calibration, consider a test dataset with instances $x_1,\dots ,x_n\in \mathcal {X}$ and true labels $y_1,\dots ,y_n\in \{0,1\}$, and denote the predictions by $\hat{p}_i=f(x_i)$. The true calibration error (CE) is the model’s average violation of calibration; it could be defined on the overall test distribution as $\mathbb {E}[\vert c^*_f(\hat{P})-\hat{P}\vert ^\alpha ]$ (Kumar et al., 2019) but we define it for the test dataset:

$$\begin{aligned} \textsf{CE}^{(\alpha )}=\frac{1}{n}\sum _{i=1}^n \vert c^*_f(\hat{p}_i)-\hat{p}_i\vert ^\alpha \end{aligned}$$

(1)

where $\alpha =1$ corresponds to absolute error (MAE) and $\alpha =2$ to squared error (MSE). Figure 1a shows an example of a true calibration map, where each red line shows the violation of calibration corresponding to a particular data point, and the average length of red lines equals to CE.

3.2 Reliability diagrams and ECE

There are multiple ways to estimate calibration error. One of the most popular ways is using reliability diagrams (Murphy & Winkler, 1977). The reliability diagram is a bar plot, where each bar contains a certain region of probabilities (a bin) and the bar height corresponds to the average label ($\bar{y}_k$) in the k-th bin (Fig. 1b). Each red line in Fig. 1b shows the difference between the average label $\bar{y}_k$ and the average prediction $\bar{p}_k$ in the k-th bin. The vector $\textbf{B}=(B_1,\dots ,B_{b+1})$ provides the bin boundaries $0=B_1<B_2<\ldots<B_b<B_{b+1}=1+\epsilon$, resulting in bins $[B_1,B_2),\dots ,[B_b,B_{b+1})$, where $\epsilon$ is an infinitesimal to ensure that $1\in [B_b,B_{b+1})$. Thus, $\bar{y}_k=\frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})} y_i$ and $\bar{p}_k=\frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})} \hat{p}_i$ where $n_k=\vert \{i: \hat{p}_i\in [B_k,B_{k+1})\}\vert$ is the size of bin k. The bins can be either equal size (each bin has the same number of instances), or equal width (each bin covers the equal region in the probability space).

Based on the reliability diagrams (Fig. 1b), the estimated calibration error (ECE) (Naeini et al., 2015) is a weighted average between the mean accuracy and the mean probability in each bin:

$$\begin{aligned} \textsf{ECE}^{(\alpha )}_\textbf{B}=\frac{1}{n}\sum _{k=1}^b n_k\cdot \vert \bar{y}_k-\bar{p}_k\vert ^\alpha . \end{aligned}$$

(2)

The binning-based ECE is known to be biased (Broecker, 2011; Ferro & Fricker, 2012) with $\mathbb {E}[\textsf{ECE}^{(\alpha )}_\textbf{B}]\ne \mathbb {E}[\textsf{CE}^{(\alpha )}]$, hence in our experiments we use debiasing as proposed by Kumar et al. (2019).

3.3 Calibration evaluation for multi-class classification

In contrast to binary classification, there are multiple different definitions of calibration for multi-class tasks:

a binary classifier is calibrated if all predicted probabilities to be positive are calibrated: $\Pr [Y=1\vert f(X)=\hat{p}]=\hat{p}$ for all $\hat{p}\in [0,1]$;
a multi-class classifier is class-k-calibrated if all the predicted probabilities of class k are calibrated: $\Pr [Y=k\vert f_k(X)=\hat{p}]=\hat{p}$ for all $\hat{p}\in [0,1]$ (Kull et al., 2019; Kumar et al., 2019; Nixon et al., 2019);
a multi-class classifier is confidence calibrated if $\Pr [Y=\mathop {\mathrm {arg\,max}}\limits f(X)\vert \max f(X)=\hat{p}]=\hat{p}$ for all $\hat{p}\in [0,1]$ (Kull et al., 2019; Guo et al., 2017).

However, in all of the above scenarios, we need to evaluate if the predicted and actual probabilities of an event are equal among all instances with shared predictions. By redefining $Y=1$ and $Y=0$ to denote whether or not the event happened and $\hat{P}=f(X)$ to denote the estimated probability of that event, we have essentially reduced all three evaluation tasks to the first task of evaluating calibration in binary classification. The shared definition of calibration then becomes: $\Pr [Y=1\vert f(X)=\hat{p}]=\hat{p}$ or equivalently, $\mathbb {E}[Y\vert f(X)=\hat{p}]=\hat{p}$, This explains also why ECE has been applied to all those 3 scenarios.

3.4 Post-hoc calibration

Post-hoc calibration is the task where the goal is to use a validation set to obtain an estimate $\hat{c}$ of the true calibration map $c^*_f$ for a given uncalibrated classifier f. Post-hoc calibration methods view the task basically as binary regression: given the predictions $\hat{p}_1,\dots ,\hat{p}_n\in [0,1]$ and the corresponding true binary labels $y_1,\dots ,y_n\in \{0,1\}$, find a ‘regression’ model $\hat{c}:[0,1]\rightarrow [0,1]$ that best predicts the labels from the predictions, evaluated typically by cross-entropy or mean squared error which in this context are respectively known as the log-loss and the Brier score - two members of the family of strictly proper losses (Brier, 1950). Why are proper losses a good way of evaluating progress towards estimating the true calibration map? A common justification is that these losses have the virtue that they are minimised by the perfectly calibrated model $c^*_f$ (Kumar et al., 2019), that is: $\mathop {\mathrm {arg\,min}}\limits _{\hat{c}(\hat{p})}\mathbb {E}[l(\hat{c}(\hat{p}),Y)\vert \hat{P}=\hat{p}]=c^*_f(\hat{p})$ for any $\hat{p}\in [0,1]$ and any strictly proper loss l. However, this justification refers to the optimum only. Our following Theorem 1 makes even a stronger claim that a reduction of the expected loss l leads to the same-sized improvement in how well $\hat{c}(\hat{p})$ approximates $c^*_f(\hat{p})$, measured by any Bregman divergence $d:[0,1]\rightarrow [0,1]$ (here d quantifies similarity between two binary categorical probability distributions, and it is a strictly proper loss when the label is its second argument, see details and proofs of the theorems in Appendix B):

Theorem 1

Let $d:[0,1]\times [0,1]\rightarrow \mathbb {R}$ be any Bregman divergence and $\hat{c}_1,\hat{c}_2:[0,1]\rightarrow [0,1]$ be two estimated calibration maps. Then

$$\begin{aligned}&\mathbb {E}\Bigl [d(\hat{c}_1(\hat{p}),Y)\vert \hat{P}=\hat{p}\Bigr ]-\mathbb {E}\Bigl [d(\hat{c}_2(\hat{p}),Y)\vert \hat{P}=\hat{p}\Bigr ]\\&\quad =d\Bigl (\hat{c}_1(\hat{p}),c^*_f(\hat{p})\Bigr )- d\Bigl (\hat{c}_2(\hat{p}),c^*_f(\hat{p})\Bigr ). \end{aligned}$$

The above theorem involves expectations conditioned on $\hat{p}$ which are typically impossible to estimate for any particular $\hat{p}$ in isolation, because there is just one or very few instances with exactly the same predicted probability $\hat{p}$. Therefore, most post-hoc calibration methods minimize the empirical loss $\sum _{i=1}^{n}d(\hat{c}(\hat{p}_i),y_i)$ for $\hat{c}$ in some sub-family $\mathcal {C}$ within all possible calibration maps, using inductive biases such as assuming $c^*_f$ is monotonic (isotonic calibration (Zadrozny & Elkan, 2002)), or belongs to some parametric family, e.g. logistic functions (Platt scaling (Platt, 2000)).

4 The fit-on-test paradigm

4.1 Evaluation of calibration always involves estimation

The goal of evaluating calibration is to measure how far a classifier f is from being perfectly calibrated, based on a given test set. Ideally, we would like to know for each test instance how far the prediction $\hat{p}_i$ is from the corresponding perfectly calibrated probability $c^*_f(\hat{p}_i)$. The fundamental problem is that we can never directly observe $c^*_f(\hat{p}_i)$, even on the test data. Therefore, evaluation of calibration always involves some form of estimation, and one cannot measure the true calibration error precisely.

The standard ECE measure gets around this problem by introducing bins. The idea is that if there are sufficiently many instances in the bin $[B_k,B_{k+1})$, and the bin is narrow enough so that the corresponding perfectly calibrated probabilities $c^*_f(\hat{p}_i)=\mathbb {E}[Y\mid \hat{P}=\hat{p}_i]$ do not vary much within the bin, then one can estimate the calibration error in the bin as follows:

$$\begin{aligned} \frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})}\vert \mathbb {E}[Y\mid \hat{P}=\hat{p}_i]-\hat{p}_i\vert ^\alpha \approx \vert \bar{y}_k-\bar{p}_k\vert ^\alpha \end{aligned}$$

that is by the difference between the proportion of positives and the average prediction within the bin. If the bins are too narrow, then there are not sufficiently many instances in them, resulting in high variance of calibration error estimation. If the bins are too wide, then the corresponding perfectly calibrated probabilities $c^*_f(\hat{p}_i)$ vary too much inside the bin, resulting in potential bias in the estimation.

4.2 Fit-on-test estimation of calibration error

Seeing the challenges of choosing a good binning for ECE and the existing attempts of improving over ECE (Roelofs et al., 2020), we looked for alternatives in estimating the calibration error. A classical and intuitive estimation method is plug-in estimation, where the estimate is calculated using the same formula as the population statistic it is estimating. We propose to use this for the true calibration error defined earlier as Eq.(1), getting the plug-in estimator:

$$\begin{aligned} \widehat{\textsf{CE}}^{(\alpha )}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}(\hat{p}_i)-\hat{p}_i\vert ^\alpha \end{aligned}$$

(3)

where $\hat{c}(\cdot )$ is some estimator of the function $c^*_f(\cdot )$. The intuition is that in order to estimate the true calibration error we would first estimate the true calibration map. After that we can use the average discrepancy between the predictions and the corresponding estimated calibration map values as our estimate of the true calibration error.

Our task now is to find a way to estimate the true calibration map $c^*_f(\cdot )$. Here it is very important to note that this estimation needs to be performed only using the given test set. This is because the goal of evaluating calibration is to do so based on a given test set.

Here we can turn to the existing literature on post-hoc calibration. Indeed, the goal of post-hoc calibration is also to estimate the true calibration map, except that the estimation is performed there on the validation set. All we need to do is to take a post-hoc calibration method and apply it instead on test data. By this we can get an estimated calibration map $\hat{c}(\cdot )$ which can be used within Eq.(3) to approximate calibration error. As estimating a calibration map is essentially fitting a function, we refer to such plug-in estimation as the fit-on-test estimation of calibration error. To summarise, any post-hoc calibration method can be applied on the test data and used within the plug-in estimator to turn it into a fit-on-test estimator of calibration error. Note that this does not mean that all methods would be equally useful as plug-in estimators; more discussion about the limitations of a fit-on-test estimator is in Sects. 4.3 and 8.

However, nothing prevents us from going beyond the set of existing post-hoc calibration methods. Whenever we have some family $\mathcal {C}$ of potential calibration map functions and some strictly proper loss l, we can define the corresponding fit-on-test calibration evaluation measure by first performing fitting on the test data:

$$\begin{aligned} \hat{c}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}=\mathop {\mathrm {arg\,min}}\limits _{c\in \mathcal {C}} \frac{1}{n}\sum _{i=1}^n l(c(\hat{p}_i),y_i) \end{aligned}$$

and then using it within the plug-in estimator:

$$\begin{aligned} \textsf{ECE}^{(\alpha )}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}(\hat{p}_i)-\hat{p}_i\vert ^{\alpha }. \end{aligned}$$

Fit-on-test estimation of calibration error has been visualised in Fig. 2 (illustration only, not on real data).

4.3 Discussion

As the idea of using plug-in estimation is almost trivial, one might wonder why it has not been introduced before. We guess this is partly due to the following potential concerns:

1.
Due to inevitable overfitting (or the generalisation gap) in any fitting process, we are bound to get our estimated $\hat{c}$ closer to the observed labels than $c^*_f$ is. This bias can harm our capability of estimating the true calibration error $\vert c^*_f(\hat{p})-\hat{p}\vert$;
2.
By choosing a particular family $\mathcal {C}$ of functions to be used during the fitting process, we would potentially misjudge the calibration error in the cases where $c^*_f$ is not in this family.

It seems impossible to fully solve both problems at the same time: a more restrictive set of functions helps against overfitting and alleviates the first problem, but increases the second problem; a bigger set of functions helps against the second problem, but increases overfitting. However, our experiments demonstrate that good tradeoffs are possible, using flexible families but still with relatively few parameters.

The classical binning-based ECE might seem to sidestep this problem and instead of estimating $c^*_f$ at all given points, it performs the comparison of bin averages $\bar{p}$ and $\bar{y}$. Perhaps surprisingly though, it can be proved (see the next subsection) that the binning-based ECE can also be seen as a fit-on-test estimator of calibration error for a particular family of functions $\mathcal {C}$ with the Brier score (MSE) as the loss function. Therefore, the above concerns are valid for the standard binning-based ECE as well.

4.4 Classical binned ECE is also a fit-on-test estimator

Next we prove that the standard binning-based ECE measure as defined by Eq.(2) is a fit-on-test estimator with a certain calibration map family $\mathcal {C}$ that we will present in a moment. Before this, we first introduce a bigger family $\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}=\{c_{(\textbf{B},\textbf{H},\textbf{A})}\}$ of piecewise linear functions with b pieces (or bins), parametrised by the following 3 vectors:

$\textbf{B}\in [0,1]^{b+1}$ - the boundaries of the b pieces (or bins) with the constraint $0=B_1<B_2<\ldots<B_b<B_{b+1}=1+\epsilon$;
$\textbf{H}\in \mathbb {R}^{b}$ - values of the function at the boundaries $B_1,\dots ,B_b$;
$\textbf{A}\in \mathbb {R}^{b}$ - slopes of linear functions within the bins.

The values of these functions can be calculated as follows:

$$\begin{aligned} c_{(\textbf{B},\textbf{H},\textbf{A})}(\hat{p})=\sum _{k=1}^b I[B_k\,{\le }\,\hat{p}\,{<}\,B_{k+1}]\cdot (H_k+A_k(\hat{p}-B_k)) \end{aligned}$$

where $I[\cdot ]$ is the indicator function. Note that the resulting functions $c_{(\textbf{B},\textbf{H},\textbf{A})}$ can be non-continuous because the right side of the bin $[B_k,B_{k+1})$ ends near the value $H_k+A_k(B_{k+1}-B_k)$ and nothing is preventing this value from being different than $H_{k+1}$ which is the left side of the bin $[B_{k+1},B_{k+2})$.

It turns out that the classical binning-based ECE is a fit-on-test estimator of calibration error with respect to a particular subfamily of $\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}$ that we will describe next. As ECE is calculated from the reliability diagrams that are piecewise constant, one might guess that this subfamily would contain all the piecewise constant (slope 0) functions with a particular fixed binning $\textbf{B}$. However, this is not true. To see this, consider a synthetic example in Fig. 3a with 6 instances (2 negatives and 4 positives) shown as red dots, and 2 bins $\textbf{B}=(0,0.5,1+\varepsilon )$. The traditional definition of ECE in Eq.(2) yields $ECE=0.133$ in this example. Fitting a piecewise constant function with binning $\textbf{B}$ by minimizing the Brier score results in the calibration map $\hat{c}$ visualised in Fig. 3b. As the Brier score is a proper loss, the optimum is achieved by empirical averages of labels within each bin. Hence, the shape of the calibration map $\hat{c}$ in Fig. 3b matches with the reliability diagram in Fig. 3a. However, if we now use the fit-on-test estimator of calibration error as the average length of vertical red lines in Fig. 3b, then we get $\widehat{CE}=0.167$, which is different from $ECE=0.133$ with the same binning.

Instead, ECE with the binning $\textbf{B}$ is actually a fit-on-test estimator of calibration error with respect to the subfamily $\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}$ which contains all piecewise linear functions with slope 1 (i.e. a 45-degrees ascending slope) in each of the bins. In other words, $\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}$ contains all functions with the fixed binning $\textbf{B}$, fixed slopes $\textbf{A}=\textbf{1}=(1,1,\dots ,1)$ i.e. slope 1 for each of the b bins, and any heights $\textbf{H}\in \mathbb {R}^b$ for the left-side boundaries of these bins. Fitting this family for our example results in the calibration map visualised in Fig. 3c. Using the fit-on-test estimator of calibration error we now get exactly the standard $ECE=0.133$ as the average length of vertical red lines. This can be confirmed visually, seeing that each of the vertical red lines in Fig. 3c has exactly the same length as the bin-specific red line in the standard reliability diagram of Fig. 3a. These lengths are equal because the diagonal and the calibration map have both slope equal to 1. The following theorem confirms this by proving that the standard ECE with binning $\textbf{B}$ can be seen as a fit-on-test estimator of calibration error.

Theorem 2

Consider a predictive model with predictions $\hat{p}_1,\dots ,\hat{p}_n\in [0,1]$ on a test set with actual labels $y_1,\dots ,y_n$ and a binning $\textbf{B}$ with $b\ge 1$ bins and boundaries $0=B_1<\dots <B_{b+1}=1+\epsilon$. Then for any $\alpha >0$, the measure $\textsf{ECE}^{(\alpha )}_{\textbf{B}}$ as defined by Eq.(2) is equal to the fit-on-test estimator of calibration error using the family $\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}$ and fitting the Brier score:

$$\begin{aligned} \textsf{ECE}^{(\alpha )}_{\textbf{B}}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}(\hat{p}_i)-\hat{p}_i\vert ^{\alpha } \end{aligned}$$

$$\begin{aligned} \text { where}\quad \hat{c}=\mathop {\mathrm {arg\,min}}\limits _{c\in \mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}} \frac{1}{n}\sum _{i=1}^n (c(\hat{p}_i)-y_i)^2. \end{aligned}$$

Furthermore, $\hat{c}(\bar{p}_k)=\bar{y}_k$ for $k=1,\dots ,b$, where $\bar{p}_k$ and $\bar{y}_k$ are the average $\hat{p}_i$ and $y_i$ in the bin $[B_k,B_{k+1})$.

Proof

See the Supplementary Material. $\hfill\square$

4.5 Fit-on-test reliability diagrams

Reliability diagrams are a common way of evaluating calibration of classifiers visually. Next we discuss the shortcomings of existing reliability diagrams and propose enhancements to them.

The simplest classical binning-based reliability diagrams just present the bar plot showing the bins and the proportions of positives in these bins. As we demonstrate in Fig. 4 (top row), it can happen that 3 classifiers with very different ECE values of 0.002, 0.090 and 0.070 have an identical reliability diagram. This is because the proportions $\bar{y}_k$ of positives in the bins $k=1,...,b$ are respectively the same for the 3 classifiers. However, ECE is different for these classifiers due to differences in the average predictions $\bar{p}_k$ in the bins. A common way to address this problem is to visually indicate the average prediction within each bin (Song et al., 2021). The second row of Fig. 4 does so by showing bin centres $(\bar{p}_k,\bar{y}_k)$ with red dots. This reveals that the first classifier is actually nearly calibrated according to this binning, because the red dots are almost at the diagonal of perfect calibration, hence very low ECE of 0.002. However, the second and third classifiers still have identical visualisation, while the values of ECE are different. This is caused by different numbers $n_k$ of instances within the bins, thus resulting in different weights for the terms in the formula Eq.(2) for ECE. Therefore, it is important to complement reliability diagrams with frequency histograms (Song et al., 2021), as we have done in the third row of Fig. 4.

In the last row of Fig. 4 we propose new reliability diagrams that we call reliability diagrams with diagonal filling. The original bar plot is kept there with a black line. The blue colour does not fill the bars to the horizontal top as usual, but instead to a diagonal line with slope 1 that crosses the top of the bar at the red bin centre $(\bar{p}_k,\bar{y}_k)$. As we know from Theorem 2, the top boundary of the blue filling represents the calibration map resulting from fitting the family $\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}$. The theorem also states that the average distance between this calibration map and the main diagonal of perfect calibration is equal to the standard ECE. In this way, the difference between classifiers 1 and 2 becomes clearly evident from the figure. Classifier 1 is almost perfectly calibrated because the estimated calibration map nearly matches the main diagonal, whereas classifier 2 is quite far from being calibrated. More precisely, the area between the diagonal filling and the main diagonal is exactly equal to ECE, assuming that the bins have equal width and an equal number of instances in them. However, if the bins have different numbers of instances (e.g. as for the 3rd classifier), then the areas between the diagonal filling and the main diagonal need to be weighted accordingly, as in the third row of the figure.

Our reliability diagrams with diagonal filling are just one example of getting reliability diagrams using fit-on-test calibration maps. More generally, one could use any post-hoc calibration method on the test data to obtain an estimated calibration map, and then fill the area under this curve. We call such visualisations as fit-on-test reliability diagrams. With any such diagram, the fit-on-test estimator of calibration error can be measured as the average distance across all instances between the reliability diagram and the main diagonal. Figure 5 shows examples of fit-on-test reliability diagrams, but the calibration map families PL and PL3 used there will yet be introduced in Sect. 5.

4.6 Cross-validated number of bins for ECE

As discussed at the beginning of this Sect. 4, choice of the number of bins strongly influences how well the standard binning-based $\textsf{ECE}^{(\alpha )}_{\textbf{B}}$ is estimating the true calibration error. Viewing ECE as fitting the family $\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}$ on the test data by minimising the Brier score, we can see the choice of the number of bins b as a hyper-parameter optimisation task. We can now come up with novel methods for choosing the number of bins for ECE. For example, we can split the test set randomly into two folds: on one fold we perform fitting with different numbers of bins, and on the other fold evaluate which number of bins provides the best fit according to the Brier score. After that, the final reliability diagram could be drawn with this selected number of bins, and ECE calculated based on this diagram. Instead of such fixed split into two folds, any other hyperparameter optimisation technique can be used. We propose to use cross-validation (CV), a typical hyper-parameter tuning method, to select the number of bins which provides the best fit. In the example of Fig. 1, the optimum was achieved by 14 bins as shown in Fig. 5a. Note that we are fitting $\hat{c}(\hat{p}_i)$ to $y_i$ for $i=1,\dots ,n$ and thus CV is improving the fit between the estimated calibration map (which is the top of diagonal filling of the reliability diagram) and the binary labels. As a result, the fit between the estimated calibration function $\hat{c}$ and the true calibration curve $c^*_f$ also improves in expectation, as implied by Theorem 1. A better fit of $\hat{c}$ and $c^*_f$ implies a ‘more reliable’ reliability diagram, in the sense that on average, the top of the filling is on average closer to the true calibration function. While cross-validation is a standard tool for hyperparameter tuning, it has been missed for ECE because it has not been seen as fitting before.

In the implementation of CV, inspired by Tikka and Hollmén (2008), we prefer a lower number of bins whenever the relative difference in loss is less than 0.1 percent, further improving performance (for details see Appendix C.2).

5 Calibration map families PL and PL3

The family $\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}$ of functions used by the binning-based ECE has several weaknesses: (1) it contains non-continuous functions while ‘jumps’ are unlikely to be present in the true calibration function; (2) it only contains segments with slope 1, making it hard to fit the true calibration function in regions with a different slope.

Therefore, we are instead looking for a family satisfying the following criteria: (1) contains only continuous functions; (2) has flexibility to fit any curve; (3) contains the identity function; (4) has few parameters not to overfit heavily.

We first revisit the families used by the existing post-hoc calibration methods. Some methods use families with a constant fixed number of parameters, such as Platt scaling (Platt, 2000) (2 parameters) and beta calibration (Kull et al., 2017) (3 parameters), see also Table 1. A small fixed number of parameters is clearly not sufficient to have the flexibility to fit any curve. The non-parametric methods like isotonic calibration (Zadrozny & Elkan, 2002) have a tendency to be overconfident and thus overfit the data (Allikivi & Kull, 2019). Therefore, we are looking for methods that have a flexible number of parameters, so that a suitable number could be selected according to the dataset, for example by cross-validation. Piecewise linear functions with slope 1 satisfy this requirement, but these in turn are not continuous. This motivates our following proposal of using the family of continuous piecewise linear calibration maps, with unconstrained slopes.

Table 1 A selection of fit-on-test estimators of calibration error in binary classification, resulting from different choices of the calibration map family $\mathcal {C}$ and proper loss function l. The baseline method, Classical ECE, is given in bold

On the usefulness of the fit-on-test view on evaluating calibration of classifiers

Abstract

Similar content being viewed by others

Classifier calibration: a survey on how to assess and improve predicted class probabilities

Investigating Calibrated Classification Scores Through the Lens of Interpretability

Estimating Expected Calibration Errors

Explore related subjects

1 Introduction

2 Related work

3 Notation and background

3.1 True calibration error

3.2 Reliability diagrams and ECE

3.3 Calibration evaluation for multi-class classification