1 Introduction

When classifiers are incorporated into safety-critical applications, it is essential that the predictions of these classifiers would involve reliable uncertainty estimates. If the predictions are over-confident, then this can cause costly errors, such as an autonomous vehicle getting into an accident. If the predictions are under-confident, then this can result in a failure of the system to fulfill its task, e.g. an autonomous car would move too slowly to mitigate the over-estimated risks. Therefore, classifiers are expected to report calibrated uncertainty in the form of class probability estimates. A probabilistic classifier is considered calibrated, if in the groups of similar predictions the average prediction is in an agreement with the actual class proportions. For example, in binary classification this implies that if the classifier predicts 80% probability to be positive for each of a set of 100 instances, then 80 of these instances are expected to be truly positives. Most learning algorithms result in classifiers that are not well-calibrated and need dedicated post-hoc calibration methods to be applied (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017).

Progress in developing methods to get calibrated classifiers can only be made if we have reliable methods for evaluating calibration. In binary classification, the most common way of estimating a classifier’s calibration is through reliability diagrams and ECE (estimated calibration error, also known as the expected calibration errorFootnote 1) (Murphy & Winkler, 1977; Broecker, 2011; Naeini et al., 2015). In reliability diagrams, many instances with similar predicted probabilities are binned together to get an estimate of the true calibrated probability in each bin by averaging the corresponding class labels. However, there is no consensus, how to place the bins in reliability diagrams and how many bins there should be (Roelofs et al., 2020). Usually 10, 15, or 20 bins are used (Naeini et al., 2015; Guo et al., 2017). Bins are placed with equal width, so that they take up an equal chunk in the probability space, or with equal size, so that they each contain an equal number of predictions. The choice of binning can drastically impact the shape of the reliability diagram and alter the estimated calibration error (Roelofs et al., 2020; Kumar et al., 2019; Nixon et al., 2019). Failure to measure calibration reliably leads to problems deciding which classifier is better calibrated or which method of post-hoc calibration is better. This in turn harms the performance of safety-critical systems.

Even though there are multiple existing works published on methods of evaluating calibration, i.e. Widmann et al. (2019); Zhang et al. (2020), these have not exploited the more direct link between post-hoc calibration and evaluation, which we refer to as the fit-on-test paradigm. According to this paradigm, any post-hoc calibration method can be repurposed for calibration evaluation by applying it on the test data (not on the validation data as in post-hoc calibration) and then using it as a plug-in estimator of calibration error.

The contributions of this paper are the following:

  • We introduce the fit-on-test paradigm of evaluating calibration, showing that any post-hoc calibration method can also be used for evaluating calibration (Sect. 4.2);

  • We prove that the classical binning-based ECE measure follows from the fit-on-test paradigm using a particular calibration map family (Sect. 4.4);

  • Exploiting this fact, we show how cross-validation can be used for optimising the number of bins in ECE (Sect. 4.6);

  • We demonstrate shortcomings in the common visualisations of reliability diagrams and propose reliability diagrams with diagonal filling (Sect. 4.5);

  • Using the fit-on-test paradigm, we develop new methods PL and PL3 of evaluating calibration using continuous piecewise linear functions (Sect. 5);

  • We clarify the methodology of assessing calibrators and calibration evaluators (Sect. 6) and introduce the usage of pseudo-real data for this purpose (Sect. 7);

  • We perform experimental comparisons to find out which families of calibration maps result in better post-hoc calibration, better reliability diagrams, and better approximations of calibration errors (Sect. 7).

  • We discuss the limitations of the fit-on-test paradigm (Sect. 8).

2 Related work

In this section, an overview of different approaches to calibration error evaluation is given. Evaluation of calibration has been the main focus of several works after the introduction of reliability diagrams and ECE (Murphy & Winkler, 1977; Broecker, 2011; Naeini et al., 2015). To start with, Vaicenavicius et al. (2019) proposed a more general definition of calibration and a method to perform statistical calibration tests based on binning. Widmann et al. (2019) proposed the kernel calibration error for calibration evaluation in multi-class classification. Roelofs et al. (2020) proposed a method which chooses the maximal number of bins such that it leads to a monotonically increasing reliability diagram. Popordanoska et al. (2022) proposed estimating multi-class calibration error using kernel density estimation with Dirichlet kernels. Popordanoska et al. (2023) proposed Kullback-Leibler calibration error, allowing one to estimate all proper calibration errors and refinement terms.

The research on evaluating calibration has gone hand-in-hand with the research on post-hoc calibration, which aims to learn a calibration map transforming the classifier’s output probabilities into calibrated probabilities. Many papers contributed to both post-hoc calibration and evaluation. Naeini et al. (2015) proposed BBQ and used ECE to evaluate calibration. Guo et al. (2017) proposed temperature, vector and matrix scaling and used the reliability diagrams and ECE for evaluating confidence in multi-class classification. Confidence stands for the probability of a target class, leaving out all the other probabilities and class-wise relations. This is also referred to as top-label calibration error by Kumar et al. (2019). Kull et al. (2019) proposed Dirichlet calibration and the notion of classwise-calibration error. Classwise-calibration error measures the calibration error for each class separately. Kumar et al. (2019) proposed scaling-binning calibration, a new debiasing method for ECE and the notion of marginal calibration error. Marginal calibration error is similar to classwise-calibration error defined concurrently with Kull et al. (2019), adding a possibility to control how much each class counts towards the error. Zhang et al. (2020) proposed generic Mix-n-Match calibration strategies and used kernel density estimation (KDE) for estimating calibration error. Gupta et al. (2021) proposed a calibration evaluation metric based on the Kolmogorov-Smirnov test and a calibration method based on fitting splines. Xiong et al. (2023) proposed proximity calibration (procal) for confidence calibration and proximity-informed ECE (PIECE). PIECE divides instances based on the representation space distance into proximity groups and uses this information to measure the calibration error of different proximity groups separately.

3 Notation and background

3.1 True calibration error

We present the methods for binary classification, but Sect. 3.3 shows applicability to multi-class classification as well. Consider a binary classifier \(f:\mathcal {X}\rightarrow [0,1]\) predicting the probabilities of instances to be positive. Let \(X\in \mathcal {X}\) be a randomly drawn instance, \(Y\in \{0,1\}\) its true class, and let us denote the model’s predictions with \(\hat{P}=f(X)\). Every classifier f has a corresponding true calibration map, which could be used to perfectly calibrate the model: \(c^*_f(\hat{p})=\mathbb {E}[Y\mid \hat{P}=\hat{p}]\) (also known as the canonical calibration function (Vaicenavicius et al., 2019)). For evaluation of calibration, consider a test dataset with instances \(x_1,\dots ,x_n\in \mathcal {X}\) and true labels \(y_1,\dots ,y_n\in \{0,1\}\), and denote the predictions by \(\hat{p}_i=f(x_i)\). The true calibration error (CE) is the model’s average violation of calibration; it could be defined on the overall test distribution as \(\mathbb {E}[\vert c^*_f(\hat{P})-\hat{P}\vert ^\alpha ]\) (Kumar et al., 2019) but we define it for the test dataset:

$$\begin{aligned} \textsf{CE}^{(\alpha )}=\frac{1}{n}\sum _{i=1}^n \vert c^*_f(\hat{p}_i)-\hat{p}_i\vert ^\alpha \end{aligned}$$
(1)

where \(\alpha =1\) corresponds to absolute error (MAE) and \(\alpha =2\) to squared error (MSE). Figure 1a shows an example of a true calibration map, where each red line shows the violation of calibration corresponding to a particular data point, and the average length of red lines equals to CE.

Fig. 1
figure 1

a True calibration map (orange line) versus the predicted probabilities (dashed line). Connecting lines show instance-wise miscalibration. b Reliability diagram consists of bars (blue) with the height of average label. The red lines show the error between the mean labels and predicted probabilities in each bin. The diagrams are made with synthetic data (3000 data points, stratified sample of 50 data points from the bins shown for instance-wise errors, see Appendix D.3 for more details) (Color figure online)

3.2 Reliability diagrams and ECE

There are multiple ways to estimate calibration error. One of the most popular ways is using reliability diagrams (Murphy & Winkler, 1977). The reliability diagram is a bar plot, where each bar contains a certain region of probabilities (a bin) and the bar height corresponds to the average label (\(\bar{y}_k\)) in the k-th bin (Fig. 1b). Each red line in Fig. 1b shows the difference between the average label \(\bar{y}_k\) and the average prediction \(\bar{p}_k\) in the k-th bin. The vector \(\textbf{B}=(B_1,\dots ,B_{b+1})\) provides the bin boundaries \(0=B_1<B_2<\ldots<B_b<B_{b+1}=1+\epsilon\), resulting in bins \([B_1,B_2),\dots ,[B_b,B_{b+1})\), where \(\epsilon\) is an infinitesimal to ensure that \(1\in [B_b,B_{b+1})\). Thus, \(\bar{y}_k=\frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})} y_i\) and \(\bar{p}_k=\frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})} \hat{p}_i\) where \(n_k=\vert \{i: \hat{p}_i\in [B_k,B_{k+1})\}\vert\) is the size of bin k. The bins can be either equal size (each bin has the same number of instances), or equal width (each bin covers the equal region in the probability space).

Based on the reliability diagrams (Fig. 1b), the estimated calibration error (ECE) (Naeini et al., 2015) is a weighted average between the mean accuracy and the mean probability in each bin:

$$\begin{aligned} \textsf{ECE}^{(\alpha )}_\textbf{B}=\frac{1}{n}\sum _{k=1}^b n_k\cdot \vert \bar{y}_k-\bar{p}_k\vert ^\alpha . \end{aligned}$$
(2)

The binning-based ECE is known to be biased (Broecker, 2011; Ferro & Fricker, 2012) with \(\mathbb {E}[\textsf{ECE}^{(\alpha )}_\textbf{B}]\ne \mathbb {E}[\textsf{CE}^{(\alpha )}]\), hence in our experiments we use debiasing as proposed by Kumar et al. (2019).

3.3 Calibration evaluation for multi-class classification

In contrast to binary classification, there are multiple different definitions of calibration for multi-class tasks:

  • a binary classifier is calibrated if all predicted probabilities to be positive are calibrated: \(\Pr [Y=1\vert f(X)=\hat{p}]=\hat{p}\) for all \(\hat{p}\in [0,1]\);

  • a multi-class classifier is class-k-calibrated if all the predicted probabilities of class k are calibrated: \(\Pr [Y=k\vert f_k(X)=\hat{p}]=\hat{p}\) for all \(\hat{p}\in [0,1]\) (Kull et al., 2019; Kumar et al., 2019; Nixon et al., 2019);

  • a multi-class classifier is confidence calibrated if \(\Pr [Y=\mathop {\mathrm {arg\,max}}\limits f(X)\vert \max f(X)=\hat{p}]=\hat{p}\) for all \(\hat{p}\in [0,1]\) (Kull et al., 2019; Guo et al., 2017).

However, in all of the above scenarios, we need to evaluate if the predicted and actual probabilities of an event are equal among all instances with shared predictions. By redefining \(Y=1\) and \(Y=0\) to denote whether or not the event happened and \(\hat{P}=f(X)\) to denote the estimated probability of that event, we have essentially reduced all three evaluation tasks to the first task of evaluating calibration in binary classification. The shared definition of calibration then becomes: \(\Pr [Y=1\vert f(X)=\hat{p}]=\hat{p}\) or equivalently, \(\mathbb {E}[Y\vert f(X)=\hat{p}]=\hat{p}\), This explains also why ECE has been applied to all those 3 scenarios.

3.4 Post-hoc calibration

Post-hoc calibration is the task where the goal is to use a validation set to obtain an estimate \(\hat{c}\) of the true calibration map \(c^*_f\) for a given uncalibrated classifier f. Post-hoc calibration methods view the task basically as binary regression: given the predictions \(\hat{p}_1,\dots ,\hat{p}_n\in [0,1]\) and the corresponding true binary labels \(y_1,\dots ,y_n\in \{0,1\}\), find a ‘regression’ model \(\hat{c}:[0,1]\rightarrow [0,1]\) that best predicts the labels from the predictions, evaluated typically by cross-entropy or mean squared error which in this context are respectively known as the log-loss and the Brier score - two members of the family of strictly proper losses (Brier, 1950). Why are proper losses a good way of evaluating progress towards estimating the true calibration map? A common justification is that these losses have the virtue that they are minimised by the perfectly calibrated model \(c^*_f\) (Kumar et al., 2019), that is: \(\mathop {\mathrm {arg\,min}}\limits _{\hat{c}(\hat{p})}\mathbb {E}[l(\hat{c}(\hat{p}),Y)\vert \hat{P}=\hat{p}]=c^*_f(\hat{p})\) for any \(\hat{p}\in [0,1]\) and any strictly proper loss l. However, this justification refers to the optimum only. Our following Theorem 1 makes even a stronger claim that a reduction of the expected loss l leads to the same-sized improvement in how well \(\hat{c}(\hat{p})\) approximates \(c^*_f(\hat{p})\), measured by any Bregman divergence \(d:[0,1]\rightarrow [0,1]\) (here d quantifies similarity between two binary categorical probability distributions, and it is a strictly proper loss when the label is its second argument, see details and proofs of the theorems in  Appendix B):

Theorem 1

Let \(d:[0,1]\times [0,1]\rightarrow \mathbb {R}\) be any Bregman divergence and \(\hat{c}_1,\hat{c}_2:[0,1]\rightarrow [0,1]\) be two estimated calibration maps. Then

$$\begin{aligned}&\mathbb {E}\Bigl [d(\hat{c}_1(\hat{p}),Y)\vert \hat{P}=\hat{p}\Bigr ]-\mathbb {E}\Bigl [d(\hat{c}_2(\hat{p}),Y)\vert \hat{P}=\hat{p}\Bigr ]\\&\quad =d\Bigl (\hat{c}_1(\hat{p}),c^*_f(\hat{p})\Bigr )- d\Bigl (\hat{c}_2(\hat{p}),c^*_f(\hat{p})\Bigr ). \end{aligned}$$

The above theorem involves expectations conditioned on \(\hat{p}\) which are typically impossible to estimate for any particular \(\hat{p}\) in isolation, because there is just one or very few instances with exactly the same predicted probability \(\hat{p}\). Therefore, most post-hoc calibration methods minimize the empirical loss \(\sum _{i=1}^{n}d(\hat{c}(\hat{p}_i),y_i)\) for \(\hat{c}\) in some sub-family \(\mathcal {C}\) within all possible calibration maps, using inductive biases such as assuming \(c^*_f\) is monotonic (isotonic calibration (Zadrozny & Elkan, 2002)), or belongs to some parametric family, e.g. logistic functions (Platt scaling (Platt, 2000)).

4 The fit-on-test paradigm

4.1 Evaluation of calibration always involves estimation

The goal of evaluating calibration is to measure how far a classifier f is from being perfectly calibrated, based on a given test set. Ideally, we would like to know for each test instance how far the prediction \(\hat{p}_i\) is from the corresponding perfectly calibrated probability \(c^*_f(\hat{p}_i)\). The fundamental problem is that we can never directly observe \(c^*_f(\hat{p}_i)\), even on the test data. Therefore, evaluation of calibration always involves some form of estimation, and one cannot measure the true calibration error precisely.

The standard ECE measure gets around this problem by introducing bins. The idea is that if there are sufficiently many instances in the bin \([B_k,B_{k+1})\), and the bin is narrow enough so that the corresponding perfectly calibrated probabilities \(c^*_f(\hat{p}_i)=\mathbb {E}[Y\mid \hat{P}=\hat{p}_i]\) do not vary much within the bin, then one can estimate the calibration error in the bin as follows:

$$\begin{aligned} \frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})}\vert \mathbb {E}[Y\mid \hat{P}=\hat{p}_i]-\hat{p}_i\vert ^\alpha \approx \vert \bar{y}_k-\bar{p}_k\vert ^\alpha \end{aligned}$$

that is by the difference between the proportion of positives and the average prediction within the bin. If the bins are too narrow, then there are not sufficiently many instances in them, resulting in high variance of calibration error estimation. If the bins are too wide, then the corresponding perfectly calibrated probabilities \(c^*_f(\hat{p}_i)\) vary too much inside the bin, resulting in potential bias in the estimation.

4.2 Fit-on-test estimation of calibration error

Seeing the challenges of choosing a good binning for ECE and the existing attempts of improving over ECE (Roelofs et al., 2020), we looked for alternatives in estimating the calibration error. A classical and intuitive estimation method is plug-in estimation, where the estimate is calculated using the same formula as the population statistic it is estimating. We propose to use this for the true calibration error defined earlier as Eq.(1), getting the plug-in estimator:

$$\begin{aligned} \widehat{\textsf{CE}}^{(\alpha )}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}(\hat{p}_i)-\hat{p}_i\vert ^\alpha \end{aligned}$$
(3)

where \(\hat{c}(\cdot )\) is some estimator of the function \(c^*_f(\cdot )\). The intuition is that in order to estimate the true calibration error we would first estimate the true calibration map. After that we can use the average discrepancy between the predictions and the corresponding estimated calibration map values as our estimate of the true calibration error.

Our task now is to find a way to estimate the true calibration map \(c^*_f(\cdot )\). Here it is very important to note that this estimation needs to be performed only using the given test set. This is because the goal of evaluating calibration is to do so based on a given test set.

Here we can turn to the existing literature on post-hoc calibration. Indeed, the goal of post-hoc calibration is also to estimate the true calibration map, except that the estimation is performed there on the validation set. All we need to do is to take a post-hoc calibration method and apply it instead on test data. By this we can get an estimated calibration map \(\hat{c}(\cdot )\) which can be used within Eq.(3) to approximate calibration error. As estimating a calibration map is essentially fitting a function, we refer to such plug-in estimation as the fit-on-test estimation of calibration error. To summarise, any post-hoc calibration method can be applied on the test data and used within the plug-in estimator to turn it into a fit-on-test estimator of calibration error. Note that this does not mean that all methods would be equally useful as plug-in estimators; more discussion about the limitations of a fit-on-test estimator is in Sects. 4.3 and 8.

However, nothing prevents us from going beyond the set of existing post-hoc calibration methods. Whenever we have some family \(\mathcal {C}\) of potential calibration map functions and some strictly proper loss l, we can define the corresponding fit-on-test calibration evaluation measure by first performing fitting on the test data:

$$\begin{aligned} \hat{c}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}=\mathop {\mathrm {arg\,min}}\limits _{c\in \mathcal {C}} \frac{1}{n}\sum _{i=1}^n l(c(\hat{p}_i),y_i) \end{aligned}$$

and then using it within the plug-in estimator:

$$\begin{aligned} \textsf{ECE}^{(\alpha )}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}(\hat{p}_i)-\hat{p}_i\vert ^{\alpha }. \end{aligned}$$

Fit-on-test estimation of calibration error has been visualised in Fig. 2 (illustration only, not on real data).

Fig. 2
figure 2

Fit-on-test estimation of calibration error: (1) a calibration map \(\hat{c}\) is obtained fitting a family of calibration maps \(\mathcal {C}\) by minimising the loss l on the test data; (2) instance-wise calibration errors are estimated as distances of predictions from calibrated predictions; (3) overall calibration error is estimated as the average of instance-wise errors. The plots are illustrative, not based on real data

4.3 Discussion

As the idea of using plug-in estimation is almost trivial, one might wonder why it has not been introduced before. We guess this is partly due to the following potential concerns:

  1. 1.

    Due to inevitable overfitting (or the generalisation gap) in any fitting process, we are bound to get our estimated \(\hat{c}\) closer to the observed labels than \(c^*_f\) is. This bias can harm our capability of estimating the true calibration error \(\vert c^*_f(\hat{p})-\hat{p}\vert\);

  2. 2.

    By choosing a particular family \(\mathcal {C}\) of functions to be used during the fitting process, we would potentially misjudge the calibration error in the cases where \(c^*_f\) is not in this family.

It seems impossible to fully solve both problems at the same time: a more restrictive set of functions helps against overfitting and alleviates the first problem, but increases the second problem; a bigger set of functions helps against the second problem, but increases overfitting. However, our experiments demonstrate that good tradeoffs are possible, using flexible families but still with relatively few parameters.

The classical binning-based ECE might seem to sidestep this problem and instead of estimating \(c^*_f\) at all given points, it performs the comparison of bin averages \(\bar{p}\) and \(\bar{y}\). Perhaps surprisingly though, it can be proved (see the next subsection) that the binning-based ECE can also be seen as a fit-on-test estimator of calibration error for a particular family of functions \(\mathcal {C}\) with the Brier score (MSE) as the loss function. Therefore, the above concerns are valid for the standard binning-based ECE as well.

4.4 Classical binned ECE is also a fit-on-test estimator

Next we prove that the standard binning-based ECE measure as defined by Eq.(2) is a fit-on-test estimator with a certain calibration map family \(\mathcal {C}\) that we will present in a moment. Before this, we first introduce a bigger family \(\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}=\{c_{(\textbf{B},\textbf{H},\textbf{A})}\}\) of piecewise linear functions with b pieces (or bins), parametrised by the following 3 vectors:

  • \(\textbf{B}\in [0,1]^{b+1}\) - the boundaries of the b pieces (or bins) with the constraint \(0=B_1<B_2<\ldots<B_b<B_{b+1}=1+\epsilon\);

  • \(\textbf{H}\in \mathbb {R}^{b}\) - values of the function at the boundaries \(B_1,\dots ,B_b\);

  • \(\textbf{A}\in \mathbb {R}^{b}\) - slopes of linear functions within the bins.

The values of these functions can be calculated as follows:

$$\begin{aligned} c_{(\textbf{B},\textbf{H},\textbf{A})}(\hat{p})=\sum _{k=1}^b I[B_k\,{\le }\,\hat{p}\,{<}\,B_{k+1}]\cdot (H_k+A_k(\hat{p}-B_k)) \end{aligned}$$

where \(I[\cdot ]\) is the indicator function. Note that the resulting functions \(c_{(\textbf{B},\textbf{H},\textbf{A})}\) can be non-continuous because the right side of the bin \([B_k,B_{k+1})\) ends near the value \(H_k+A_k(B_{k+1}-B_k)\) and nothing is preventing this value from being different than \(H_{k+1}\) which is the left side of the bin \([B_{k+1},B_{k+2})\).

It turns out that the classical binning-based ECE is a fit-on-test estimator of calibration error with respect to a particular subfamily of \(\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}\) that we will describe next. As ECE is calculated from the reliability diagrams that are piecewise constant, one might guess that this subfamily would contain all the piecewise constant (slope 0) functions with a particular fixed binning \(\textbf{B}\). However, this is not true. To see this, consider a synthetic example in Fig. 3a with 6 instances (2 negatives and 4 positives) shown as red dots, and 2 bins \(\textbf{B}=(0,0.5,1+\varepsilon )\). The traditional definition of ECE in Eq.(2) yields \(ECE=0.133\) in this example. Fitting a piecewise constant function with binning \(\textbf{B}\) by minimizing the Brier score results in the calibration map \(\hat{c}\) visualised in Fig. 3b. As the Brier score is a proper loss, the optimum is achieved by empirical averages of labels within each bin. Hence, the shape of the calibration map \(\hat{c}\) in Fig. 3b matches with the reliability diagram in Fig. 3a. However, if we now use the fit-on-test estimator of calibration error as the average length of vertical red lines in Fig. 3b, then we get \(\widehat{CE}=0.167\), which is different from \(ECE=0.133\) with the same binning.

Fig. 3
figure 3

A synthetic example about how ECE can be viewed as a fit-on-test estimator of calibration error. a A reliability diagram of a test set with 6 instances (4 positives with predicted probabilities 0.3, 0.75, 0.85, 0.95 and 2 negatives with predicted probabilities 0.1, 0.65), yielding \(ECE=0.133\); b a piecewise constant fit-on-test estimator yields \(\widehat{CE}=0.167\); c a piecewise slope-1 fit-on-test estimator provably yields the same \(ECE=0.133\) as the original in (a) while visualizing per-instance calibration errors also; d our proposed reliability diagram with diagonal filling combines elements from (a) and (c)

Instead, ECE with the binning \(\textbf{B}\) is actually a fit-on-test estimator of calibration error with respect to the subfamily \(\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}\) which contains all piecewise linear functions with slope 1 (i.e. a 45-degrees ascending slope) in each of the bins. In other words, \(\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\) contains all functions with the fixed binning \(\textbf{B}\), fixed slopes \(\textbf{A}=\textbf{1}=(1,1,\dots ,1)\) i.e. slope 1 for each of the b bins, and any heights \(\textbf{H}\in \mathbb {R}^b\) for the left-side boundaries of these bins. Fitting this family for our example results in the calibration map visualised in Fig. 3c. Using the fit-on-test estimator of calibration error we now get exactly the standard \(ECE=0.133\) as the average length of vertical red lines. This can be confirmed visually, seeing that each of the vertical red lines in Fig. 3c has exactly the same length as the bin-specific red line in the standard reliability diagram of Fig. 3a. These lengths are equal because the diagonal and the calibration map have both slope equal to 1. The following theorem confirms this by proving that the standard ECE with binning \(\textbf{B}\) can be seen as a fit-on-test estimator of calibration error.

Theorem 2

Consider a predictive model with predictions \(\hat{p}_1,\dots ,\hat{p}_n\in [0,1]\) on a test set with actual labels \(y_1,\dots ,y_n\) and a binning \(\textbf{B}\) with \(b\ge 1\) bins and boundaries \(0=B_1<\dots <B_{b+1}=1+\epsilon\). Then for any \(\alpha >0\), the measure \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) as defined by Eq.(2) is equal to the fit-on-test estimator of calibration error using the family \(\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\) and fitting the Brier score:

$$\begin{aligned} \textsf{ECE}^{(\alpha )}_{\textbf{B}}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}(\hat{p}_i)-\hat{p}_i\vert ^{\alpha } \end{aligned}$$
$$\begin{aligned} \text { where}\quad \hat{c}=\mathop {\mathrm {arg\,min}}\limits _{c\in \mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}} \frac{1}{n}\sum _{i=1}^n (c(\hat{p}_i)-y_i)^2. \end{aligned}$$

Furthermore, \(\hat{c}(\bar{p}_k)=\bar{y}_k\) for \(k=1,\dots ,b\), where \(\bar{p}_k\) and \(\bar{y}_k\) are the average \(\hat{p}_i\) and \(y_i\) in the bin \([B_k,B_{k+1})\).

Proof

See the Supplementary Material. \(\hfill\square\)

4.5 Fit-on-test reliability diagrams

Reliability diagrams are a common way of evaluating calibration of classifiers visually. Next we discuss the shortcomings of existing reliability diagrams and propose enhancements to them.

The simplest classical binning-based reliability diagrams just present the bar plot showing the bins and the proportions of positives in these bins. As we demonstrate in Fig. 4 (top row), it can happen that 3 classifiers with very different ECE values of 0.002, 0.090 and 0.070 have an identical reliability diagram. This is because the proportions \(\bar{y}_k\) of positives in the bins \(k=1,...,b\) are respectively the same for the 3 classifiers. However, ECE is different for these classifiers due to differences in the average predictions \(\bar{p}_k\) in the bins. A common way to address this problem is to visually indicate the average prediction within each bin (Song et al., 2021). The second row of Fig. 4 does so by showing bin centres \((\bar{p}_k,\bar{y}_k)\) with red dots. This reveals that the first classifier is actually nearly calibrated according to this binning, because the red dots are almost at the diagonal of perfect calibration, hence very low ECE of 0.002. However, the second and third classifiers still have identical visualisation, while the values of ECE are different. This is caused by different numbers \(n_k\) of instances within the bins, thus resulting in different weights for the terms in the formula Eq.(2) for ECE. Therefore, it is important to complement reliability diagrams with frequency histograms (Song et al., 2021), as we have done in the third row of Fig. 4.

In the last row of Fig. 4 we propose new reliability diagrams that we call reliability diagrams with diagonal filling. The original bar plot is kept there with a black line. The blue colour does not fill the bars to the horizontal top as usual, but instead to a diagonal line with slope 1 that crosses the top of the bar at the red bin centre \((\bar{p}_k,\bar{y}_k)\). As we know from Theorem 2, the top boundary of the blue filling represents the calibration map resulting from fitting the family \(\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\). The theorem also states that the average distance between this calibration map and the main diagonal of perfect calibration is equal to the standard ECE. In this way, the difference between classifiers 1 and 2 becomes clearly evident from the figure. Classifier 1 is almost perfectly calibrated because the estimated calibration map nearly matches the main diagonal, whereas classifier 2 is quite far from being calibrated. More precisely, the area between the diagonal filling and the main diagonal is exactly equal to ECE, assuming that the bins have equal width and an equal number of instances in them. However, if the bins have different numbers of instances (e.g. as for the 3rd classifier), then the areas between the diagonal filling and the main diagonal need to be weighted accordingly, as in the third row of the figure.

Our reliability diagrams with diagonal filling are just one example of getting reliability diagrams using fit-on-test calibration maps. More generally, one could use any post-hoc calibration method on the test data to obtain an estimated calibration map, and then fill the area under this curve. We call such visualisations as fit-on-test reliability diagrams. With any such diagram, the fit-on-test estimator of calibration error can be measured as the average distance across all instances between the reliability diagram and the main diagonal. Figure 5 shows examples of fit-on-test reliability diagrams, but the calibration map families PL and PL3 used there will yet be introduced in Sect. 5.

Fig. 4
figure 4

Reliability diagrams of 3 classifiers (columns) created with different visualisation methods (rows); the concrete classification task is irrelevant. All classifiers have different ECE, but the plain reliability diagrams (top row) are identical. Differences are gradually revealed by adding bin centres indicating the average predicted probabilites (red dots, in the second row) and frequency histograms (in the third row). The last row shows our proposed reliability diagrams with diagonal filling, where ECE is better visualised because it is equal to the instance-wise average distance from the blue boundary to the main diagonal of perfect calibration (Color figure online)

4.6 Cross-validated number of bins for ECE

As discussed at the beginning of this Sect. 4, choice of the number of bins strongly influences how well the standard binning-based \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) is estimating the true calibration error. Viewing ECE as fitting the family \(\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}\) on the test data by minimising the Brier score, we can see the choice of the number of bins b as a hyper-parameter optimisation task. We can now come up with novel methods for choosing the number of bins for ECE. For example, we can split the test set randomly into two folds: on one fold we perform fitting with different numbers of bins, and on the other fold evaluate which number of bins provides the best fit according to the Brier score. After that, the final reliability diagram could be drawn with this selected number of bins, and ECE calculated based on this diagram. Instead of such fixed split into two folds, any other hyperparameter optimisation technique can be used. We propose to use cross-validation (CV), a typical hyper-parameter tuning method, to select the number of bins which provides the best fit. In the example of Fig. 1, the optimum was achieved by 14 bins as shown in Fig. 5a. Note that we are fitting \(\hat{c}(\hat{p}_i)\) to \(y_i\) for \(i=1,\dots ,n\) and thus CV is improving the fit between the estimated calibration map (which is the top of diagonal filling of the reliability diagram) and the binary labels. As a result, the fit between the estimated calibration function \(\hat{c}\) and the true calibration curve \(c^*_f\) also improves in expectation, as implied by Theorem 1. A better fit of \(\hat{c}\) and \(c^*_f\) implies a ‘more reliable’ reliability diagram, in the sense that on average, the top of the filling is on average closer to the true calibration function. While cross-validation is a standard tool for hyperparameter tuning, it has been missed for ECE because it has not been seen as fitting before.

In the implementation of CV, inspired by Tikka and Hollmén (2008), we prefer a lower number of bins whenever the relative difference in loss is less than 0.1 percent, further improving performance (for details see Appendix C.2).

Fig. 5
figure 5

Different reliability diagrams with the number of bins or pieces optimised using cross-validation, on the same data as in Fig. 1: a a reliability diagram with diagonal filling using 14 bins; b piecewise linear reliability diagram with 3 pieces; c piecewise linear in logit-logit space reliability diagram with 2 pieces

5 Calibration map families PL and PL3

The family \(\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}\) of functions used by the binning-based ECE has several weaknesses: (1) it contains non-continuous functions while ‘jumps’ are unlikely to be present in the true calibration function; (2) it only contains segments with slope 1, making it hard to fit the true calibration function in regions with a different slope.

Therefore, we are instead looking for a family satisfying the following criteria: (1) contains only continuous functions; (2) has flexibility to fit any curve; (3) contains the identity function; (4) has few parameters not to overfit heavily.

We first revisit the families used by the existing post-hoc calibration methods. Some methods use families with a constant fixed number of parameters, such as Platt scaling (Platt, 2000) (2 parameters) and beta calibration (Kull et al., 2017) (3 parameters), see also Table 1. A small fixed number of parameters is clearly not sufficient to have the flexibility to fit any curve. The non-parametric methods like isotonic calibration (Zadrozny & Elkan, 2002) have a tendency to be overconfident and thus overfit the data (Allikivi & Kull, 2019). Therefore, we are looking for methods that have a flexible number of parameters, so that a suitable number could be selected according to the dataset, for example by cross-validation. Piecewise linear functions with slope 1 satisfy this requirement, but these in turn are not continuous. This motivates our following proposal of using the family of continuous piecewise linear calibration maps, with unconstrained slopes.

Table 1 A selection of fit-on-test estimators of calibration error in binary classification, resulting from different choices of the calibration map family \(\mathcal {C}\) and proper loss function l. The baseline method, Classical ECE, is given in bold

5.1 PL - piecewise linear calibration maps

We propose to use the subfamily \(\mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)}\) of continuous functions from the piecewise linear function family \(\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}\). The family \(\mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)}\) is only parametrised by the bin boundaries \(\textbf{B}\) and by the values \(\textbf{H}\) of the function at the boundaries, whereas the slopes can be calculated from \(\textbf{B}\) and \(\textbf{H}\) with \(A_k=\frac{H_{k+1}-H_{k}}{B_{k+1}-B_{k}}\) to ensure that the line at the right end of bin k coincides with the line at the left end of bin \(k+1\). Note that the bin boundaries \(\textbf{B}\) are now also parameters to be fitted, together with the values \(\textbf{H}\). As \(B_1=0\) and \(B_{b+1}=1+\epsilon\) are fixed, we are fitting 2b parameters: \(b-1\) bin boundaries and \(b+1\) values in \(\textbf{H}\). The number of bins b is optimised through cross-validation, similarly to Sect. 4.6.

We call the corresponding fit-\((\mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)},l)\)-on-test method as PL: the piecewise linear method for evaluating calibration. In particular, we can now draw new kind of piecewise linear reliability diagrams (Fig. 5b) which provide a better fit to the true calibration function than the binning-based methods, as demonstrated in our experiments (Sect. 7). For a visual comparison, check Fig. 5a and Fig. 5b, as these figures have been made with the same data. More comparative examples can be found in Appendix E.1. Therefore, the piecewise linear reliability diagram can be used similarly to the classical ECE reliability diagram to check visually how well the model is calibrated. We can then also use this estimated calibration map for fit-on-test estimation of calibration error, getting the measure which we call the piecewise linear ECE or in short ECE-PL. According to the fit-on-test estimation method, ECE-PL measures the instance-wise average distance from the piecewise linear function to the main diagonal: \(\textsf{ECE}_{\textsf{PL}}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}(\hat{p}_i)-\hat{p}_i\vert ^{\alpha }\) where \(\hat{c}=\mathop {\mathrm {arg\,min}}\limits _{c\in \mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)}} \frac{1}{n}\sum _{i=1}^n (c(\hat{p}_i)-y_i)^2\).

Implementation Details

Although the continuous piecewise linear functions are mathematically very well known, we found only one existing public implementation (Jekel & Venter, 2019), based on least squares fitting with differential evolution. We included this in our experiments with the name \(PL_{DE}\) (results in  Appendix F). However, we also created ourselves a neural network based implemention depicted in Fig. 6, which allowed us to add cross-entropy fitting. Full details about the architecture are given in  Appendix C, but here is a short overview.

We have a single input (\(\hat{p}\)) and a single output (\(\hat{c}\)) connected through two layers: the binning layer and the interpolation layer. The binning layer has \(b+1\) gating units corresponding to the bin boundaries, each outputting whether \(\hat{p}\) is to the left or to the right of the boundary (\(L_i=\hat{p}-B_i\) and \(R_i=B_i-\hat{p}\)). The binning layer is parametrised by b real values (\(\textbf{Z}=z_1, z_2, \dots , z_b\)) which are passed through the softmax (\(\sigma\)) to obtain the widths of the bins and through cumulative sum (\(B_i = \sum _{k=1}^{i} \sigma _k(\textbf{Z})\)) to obtain bin boundaries. These parameters are initialised such that all bins contain the same number of training instances. The interpolation layer has \(b+1\) parameters which are each passed through the logistic function (\(\phi\)) to obtain the calibration map values \(H_1,\dots ,H_{b+1}\) at the bin boundaries. These are initialised such that the represented calibration map is the identity function. The b units correspond to the bins and the bin to which \(\hat{p}\) belongs produces the linearly interpolated output. Based on these values, the piecewise linear function value is calculated as a sum of two neighbouring nodes (\(H_k\), \(H_{k+1}\)) as follows:

$$\begin{aligned} \hat{c}= \sum _{k=1}^{b} g(L_{k+1}, R_k)\left( H_k\frac{L_{k+1}}{L_{k+1} + R_k} + H_{k+1}\frac{R_k}{L_{k+1} + R_k}\right) \end{aligned}$$

where the gating function g is defined as \(g(L_{k+1}, R_k) = I[(L_{k+1}>0) \& (R_k>0)]\).

Fig. 6
figure 6

Architecture of the piecewise linear function implementation as a neural network. The bin boundaries \(B_i\) are parametrised by logits (\(\textbf{Z}=z_1, z_2, \dots , z_b\)). These are passed through the softmax with cumulative sum (\(B_i = \sum _{k=1}^{i} \sigma _k(\textbf{Z})\)). The \(H_i\) values are activated by logistic function \(\phi\). The g stands for a gate, which is an indicator function, that outputs 1 if input is in the bin or 0 otherwise

We use 10-fold-cross-validation to select the number of segments in the piecewise linear function to best approximate the true calibration map, similar to Sect. 4.6. The same way as in hyperparameter optimization for Dirichlet calibration (Kull et al., 2019), the predictions on test data are obtained as an average output from all the 10 models with the chosen number of segments but trained from different folds, i.e. we are not refitting a single model on all 10 folds.

5.2 PL3 - piecewise linear in logit-logit space

The piecewise linear method can be used for calibration evaluation universally for any kinds of models, but it also makes sense to seek for dedicated families for special cases, such as for neural networks. Next we propose the family PL3 specifically for evaluating calibration in neural networks, taking inspiration from temperature scaling (Guo et al., 2017) and beta calibration (Kull et al., 2017).

Temperature scaling is fitting a family of functions \(\hat{c}(\hat{p})=\sigma (\textbf{z}/t)\) with a single temperature parameter t, where the softmax \(\sigma\) is applied on logits \(\textbf{z}\) that the uncalibrated model would have directly converted into probabilities with \(\hat{p}=\sigma (\textbf{z})\) (Guo et al., 2017). In the binary classification case with a single output, \(\sigma (z)=1/(1+e^{-z})\) is the logistic function, the inverse function of the logit \(\sigma ^{-1}(p)=\ln (p/(1-p))\). Importantly, if plotted in the logit-logit scale, binary temperature scaling fits a straight line (Fig. 7), since \(\sigma ^{-1}(\hat{c}(\hat{p}))=\sigma ^{-1}(\sigma (\textbf{z}/t)=z/t=1/t\cdot \sigma ^{-1}(\sigma (\textbf{z}))=\sigma ^{-1}(\hat{p})\). Further, Fig. 7 compares various methods in the probability space and logit-logit space, giving extra information on how well the methods fit the calibration map in the low and high-probability regions.

Interestingly, another post-hoc calibration method known as beta calibration (Kull et al., 2017) fits calibration maps which in the logit-logit space are approximately piecewise linear with two segments, as shown in Fig. 7 (top right subfigure, blue line). The proof for this fact is given in Appendix E.2. This motivates our calibration map family PL3 of Piecewise Linear functions in the Logit-Logit space (PLLL=PL3), which corresponds to temperature scaling when using 1 piece, approximates beta calibration when using 2 pieces, and can take more complicated shapes when using more linear pieces in the logit-logit space. Fig. 5c shows an example of a piecewise linear in the logit-logit space reliability diagram. Further details are provided in Appendix E.2.

Fig. 7
figure 7

Motivation for PL3. Comparison of results in the probability space (left column) and in the logit-logit space (right column). Methods are split into two groups for clarity. In essence, subfigures (a) and (b) are the same, only display different functions. Model ResNet110 (He et al., 2015) on "cats vs rest" task of CIFAR-5m (Color figure online)

Implementation Details

To implement PL3, we use the same neural architecture as for PL, with the following modifications: (1) the expected input is in the logit space; (2) the bin boundaries are converted into the logit space; (3) the logistic is removed from the parameters feeding the interpolation layer and instead the logistic is applied on the final output. Further details are given in  Appendix C.

6 Assessment of calibrators and evaluators

Before proceeding to the experiments, let us discuss the methods for assessing the quality of calibrators and evaluators of calibration.

6.1 Assessment of post-hoc calibrators

Post-hoc calibrators should be evaluated based on the effectiveness of calibration, but this effectiveness could be interpreted in two ways. Firstly, we could evaluate how well-calibrated the outputs of the calibrator are by calculating the calibration error after calibration (CEAC):

$$\begin{aligned} CEAC&=\textsf{CE}(\hat{c}\circ f)=\mathbb {E}[d(\hat{C},c^*_{\hat{c}\circ f}(\hat{C}))] \end{aligned}$$

Secondly, we could evaluate how well the calibrator approximates the true calibration map by calculating the calibration map estimation error (CMEE):

$$\begin{aligned} CMEE&=\mathbb {E}[d(\hat{c}(\hat{P}),c^*_f(\hat{P}))] \end{aligned}$$

where d is any Bregman divergence and \(\hat{C}=\hat{c}(\hat{P})=(\hat{c}\circ f)(X)=\hat{c}(f(X))\). We prove that low calibration map estimation error is a stronger requirement than low calibration error after calibration, because CMEE is an upper bound for CEAC:

Theorem 3

$$\begin{aligned} CMEE=CEAC+\mathbb {E}[d(c^*_{\hat{c}\circ f}(\hat{C}),c^*_f(\hat{P}))]. \end{aligned}$$

Intuitively, the difference between CMEE and CEAC is due to ties introduced by \(\hat{c}\) where \(\hat{c}(\hat{p})=\hat{c}(\hat{p}')\) while \(c^*_f(\hat{p})\ne c^*_f(\hat{p}')\) for some \(\hat{p}\ne \hat{p}'\). As low CMEE is a stronger requirement, we prefer this measure in the experiments. We use the notation \(d(\hat{c},c^*)\) as a more memorable synonym for CMEE.

6.2 Assessment of calibration evaluators

Calibration evaluators that follow the fit-on-test paradigm, estimate \(\hat{c}\) on the test data. The resulting \(\hat{c}\) can be viewed as a reliability diagram, or used to estimate the calibration error with \(\textsf{ECE}^{(\alpha )}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}\).

We see 3 main application scenarios:

  1. 1.

    the reliability diagram is needed if the reliability of individual predictions has to be known separately; for example, this is needed if there is a downstream decision-making process performed by a human or AI;

  2. 2.

    the estimated calibration error is needed if the general level of trust needs to be known, not specifically for each output;

  3. 3.

    and the ranking of the estimated calibration errors is needed if performing selection of the best calibrated model.

In the experiments, we evaluate each calibration evaluation method \(\mathcal {M}\) against these three objectives, measuring: (1) the quality of reliability diagrams with \(d(\hat{c}_\mathcal {M},c^*)\); (2) the quality of calibration error estimates with \(\vert ECE_\mathcal {M}-CE\vert\); and (3) the quality of ranking performance using Spearman’s correlation of the ranking produced by the calibration evaluator against the ranking of true calibration errors, which we refer to as \(rankcorr(ECE_\mathcal {M},CE)\). All these methods assume good estimates of the true calibration map, which can be obtained either on synthetic data or if there is access to magnitudes of more data than the calibration evaluator used.

Note that in the experiments we mostly use the absolute difference \(d(\hat{c},c^*)=\vert \hat{c}-c^*\vert\) as the distance measure between the estimated and true calibration maps, following the tradition of how ECE is calculated from the reliability diagrams. As the absolute difference is not a Bregman divergence, the Appendix also considers the squared difference \(d(\hat{c},c^*)=\vert \hat{c}-c^*\vert ^2\) which is a Bregman divergence. Importantly, a method \(\mathcal {M}\) that produces the best calibration map estimate with the lowest \(\vert \hat{c}_\mathcal {M}-c^*\vert\) might not be the same as method \(\mathcal {M}'\) that produces the best calibration error estimate \(\vert ECE_{\mathcal {M}'}-CE\vert\). For example, this occurs when comparing the methods in Figs. 5a and b. In Fig. 5b, the piecewise linear method (PL) achieves \(\vert ECE_{PL}-CE\vert =0.0045\) and \(\vert \hat{c}_{PL}-c^*\vert =0.0204\), while in Fig. 5a, equal-width binning with cross-validated number of bins (EWCV) achieves \(\vert ECE_{EWCV}-CE\vert =0.0024\) and \(\vert \hat{c}_{EWCV}-c^*\vert =0.0398\). Thus, even though PL has in this example better \(\vert \hat{c}-c^*\vert\), it still loses to EWCV with respect to \(\vert ECE-CE\vert\). Intuitively, EWCV has bigger errors in the reliability diagram, but during the calculation of ECE these errors happen to cancel out more than in the case of PL.

7 Experiments and results

7.1 Pseudo-real experiments and results

The goal of our experimental studies is to (1) evaluate our proposed PL and PL3 calibration map families as calibration methods; and (2) find the best calibration map family for calibration evaluation (thanks to the fit-on-test paradigm).

A key problem for research in calibration evaluation methods and post-hoc calibration methods is that proper evaluation requires access to the true calibration map. This, however, is unknown. One workaround would be to use synthetically created data, where the true calibration map is known. However, with synthetic data the shape of the true calibration map might not be realistic. Our solution to this is to use the pseudo-real dataset CIFAR-5m (Nakkiran et al., 2021) with 5 million synthetic images, created such that the models trained on CIFAR-10 (Krizhevsky et al., 2009) have very similar performance on CIFAR-5m, and vice versa. Thus, it is likely that the true calibration maps are also very realistic. Thanks to the vast size of the CIFAR-5m dataset we can estimate the true calibration map very precisely. In our experiments, we used isotonic calibration on 1 million hold-out datapoints to estimate the true calibration map (other options than isotonic were considered in  Appendix F, minor differences in the results). The main part of the experiments concentrates on CIFAR-5m, while  Appendix F shows more results on synthetic and real datasets, where the results are comparable to CIFAR-5m with minor differences, supporting the overall conclusions. On CIFAR-5m, we concentrated on three 1-vs-rest calibration tasks (car, cat, dog) and confidence calibration. Only three 1-vs-rest tasks were used due to computational limitations.

ResNet110 (He et al., 2015), WideNet32 (Zagoruyko & Komodakis, 2016) and DenseNet40 (Huang et al., 2016) models were trained on 45k datapoints from CIFAR-5m, additional 5k datapoints were used to calibrate the outputs of models with multi-class calibration methods: temperature scaling (TempS), vector scaling (VecS), and matrix scaling with off-diagonal and intercept regularisation (MSODIR); Dirichlet calibration with ODIR regularisation (dirODIR), Spline with natural method (Spline); Order-invariant version of intra-order preserving functions (IOP); binary calibration methods: Platt, isotonic, beta calibration (beta), scaling-binning (ScaleBin), ECE-based binning (\(ES_{sweep}\)) methods with equal-size binning (ES), piecewise linear methods (PL and PL3). For PL we used our neural network model trained with cross-entropy loss (the results with optimising for Brier score and with the alternative optimisation method based on differential evolution are included in  Appendix F).

Table 2 Comparison of post-hoc calibrators with respect to \(\vert \hat{c}_\mathcal {M}-c^*\vert\) \((\times 10^{3})\) in estimating the true calibration map of DenseNet40, ResNet110 and WideNet32 classifiers on CIFAR-5m

Results

We show the results in absolute differences (i.e. \(\alpha =1\)) as in most earlier works (Appendix F has the quadratic also, i.e. \(\alpha =2\)). Table 2 assesses PL and PL3 as post-hoc calibration methods. The calibration map family corresponding to ECE fit with the sweeping method (Roelofs et al., 2020) is also included as \(ES_{sweep}\). The calibration methods are evaluated by how well they approximate the true calibration map: \(\vert \hat{c}_\mathcal {M}-c^*\vert =\sum _{i=1}^n\vert \hat{c}_\mathcal {M}(\hat{p}_i)-c^*(\hat{p}_i)\vert\) is measured on unseen 1 million data points against the ground truth, where \(\hat{p}_i\) are the outputs of the classifier, \(\hat{c}_\mathcal {M}(\hat{p}_i)\) are the post-hoc calibrated predictions, and \(c^*(\hat{p}_i)\) is the ‘true’ calibration map.

Table 3 Comparison of calibration evaluators in estimating the reliability diagram with \(\vert \hat{c}_\mathcal {M}-c^*\vert\) \((\times 10^{3})\) on CIFAR-5m. In contrast to Table 2, the evaluators are compared on predictions that have been previously post-hoc calibrated. The best scores are given in bold

PL3 is the best method in all cases, showing the usefulness of the logit-logit space when calibrating neural models which are far from being calibrated. Note that the errors of PL3 are in all cases more than 20% smaller than the errors of the second-performing method Platt. Appendix F includes more variations of these methods, the KDE method and results for each architecture separately (minor differences).

Next we compare the calibration evaluators against each of the 3 objectives listed in Sect. 6.2. We perform the comparison in tasks where the models are already quite close to being calibrated, because this is typical when evaluators are used for finding out which post-hoc calibrator is performing best. We compare evaluators in the task of evaluating 6 post-hoc calibrators: we measure how precisely the evaluators \(\mathcal {M}\) estimate the reliability diagrams (Table 3 showing \(\vert \hat{c}_\mathcal {M}-c^*\vert\)), the total true calibration errors (Table 4 showing \(\vert ECE_\mathcal {M}-CE\vert\)), and how well the estimated ranking of 6 calibrators agrees with the true ranking based on true calibration errors (Table 4 showing \(rankcorrel(ECE_\mathcal {M},CE)\)). The 6 calibrators were chosen as the best methods from Table 2: beta, vector scaling, Platt, PL3, scaling-binning, isotonic. The evaluators are compared on different test set sizes 1k, 3k, 10k with 5 different random seeds for each size.

Table 4 Comparison of calibration evaluators in estimating the true calibration error with \(\vert ECE_\mathcal {M}-CE\vert\) \((\times 10^{3})\) and the ranking of 6 calibrators with rankcorrel(ECE) on CIFAR-5m. The best scores are given in bold

The first objective is to assess the reliability diagrams using \(\vert \hat{c}_\mathcal {M}-c^*\vert =\sum _{i=1}^n\vert \hat{c}_\mathcal {M}(\hat{p}_i)-c^*(\hat{p}_i)\vert\) measured on 1 million unseen data points, where \(\hat{p}_i\) are now the already post-hoc calibrated outputs of the classifier (calibrated with the 6 best methods from Table 2 trained on 5k data points), \(\hat{c}_\mathcal {M}(\hat{p}_i)\) are the results of fit-on-test calibration applied on top of the post-hoc calibrated predictions (trained on another separate data set of size either 1k, 3k, or 10k), and \(c^*(\hat{p}_i)\) is the ‘true’ calibration map of the post-hoc calibrated predictions.

The results in Table 3 show that PL is the best on average, as well as after disaggregating according to the classifier’s architecture, test set size, or the task. PL3 is mostly second and the evaluator using the beta calibration map family is mostly third. The order invariant version of IOP shows also promising results. The performance of beta calibration varies with the size of the test dataset. This is expected, because this method has only 3 parameters which is good for small test sets but worse for bigger test sets. Similarly, the data set size affects IOP too, again because of the small number of parameters. The methods based on equal-size binning are performing worse, with \(ES_{CV}\) ranking the highest on average, closely followed by \(ES_{sweep}\), and the classical \(ES_{15}\) with 15 bins lagging behind. Further, the Spline method is among one of the worst performing methods. From Tables 2 and 3 we can conclude that when predictions are far from being calibrated then PL3 is best for approximating the true calibration map, and PL is best when predictions are nearly calibrated. Appendix F discusses the results showing different aggregations, and reports the optimal numbers of bins for ES and PL methods.

The second objective is to estimate the numeric value of the total true calibration error, and here the rows \(\vert ECE_\mathcal {M}-CE\vert\) of Table 4 show the benefits of \(ES_{15}\), beta calibration and PL3 (with some differences across tasks). This demonstrates that while the tilted-top reliability diagrams of \(ES_{15}\) are not precise, their debiased average distance from the diagonal closely agrees with the average distance of the true reliabilities (true calibration map) from the diagonal. While PL and PL3 perform reasonably well, there is a big potential for further improvements, because debiasing remains as future work for these methods. The ranking of 6 calibrators is best done by the isotonic fit-on-test evaluator, achieving over 55% correlation to the true ranking for one-vs-rest tasks and over 65% for the confidence task.

8 Discussion

As we explained in our work, evaluating the calibration error using the fit-on-test approach is done by estimating the calibration map by fitting a calibration map family on test data, and then using the plugin-estimator of calibration error. The most popular method for the calibration error evaluation is a binning ECE, which we proved to also be fit-on-test. The name fit-on-test might sound controversial but has been chosen deliberately, to highlight the essence of issues in current methods for assessing calibration. The main concern with fitting on the test data is getting a good fit, like for any other fitting task. Thus, one needs a suitable family of functions and a suitable fitting procedure so the data is not overfitted or underfitted. A poor fit leads to poor performance of the method and unreliable results. However, since the true calibration map and the true calibration error are not available, we do not even know in practice whether there is overfitting or underfitting during fit-on-test evaluation. One way to tackle the problem is to use synthetic or pseudo-real data that would look very similar to real data. This way, we would have an estimate of how well each fit-on-test method could work on real data. We tested experimentally some deep neural network architectures on different datasets and saw some performance fluctuations between methods across different subtasks. However, there are bigger differences across different evaluation tasks. Even though the piecewise linear method performs very well in calibration map prediction, it is still not better than \(ES_{15}\) when predicting the calibration error or ranking. This might be because of the bias introduced, and thus, there might be a need for debiasing. Note that we did not dive into problems of data shift and out-of-distribution prediction, which are certainly affecting model uncertainty calibration as well in practice.

9 Conclusion and future work

We suggest to view evaluation of calibration according to the fit-on-test paradigm, promoting the use of post-hoc calibration methods for calibration evaluation. This view enables reliability diagrams that are closer to the true calibration maps, more exact estimates of the total calibration error, and ranking of calibrators in the order which better corresponds to their true quality. Following fit-on-test, we have proposed cross-validation to tune the number of bins in ECE, and demonstrated the benefits of piecewise linearity in the original as well as in the logit-logit space, inspired by temperature scaling and beta calibration. Having said that, the limitation of this approach is, as the name states, fitting on the test set. Thus, it is essential to be careful not to overfit the evaluation method and check which methods would best work on given data.

Future work involves the development of debiasing methods for \(ECE_{PL}\) and \(ECE_{PL3}\) and analysing further the benefits of different calibration map families in different scenarios, including dataset shift.