Abstract
Calibrated uncertainty estimates are essential for classifiers used in safety-critical applications. If a classifier is uncalibrated, then there is a unique way to calibrate its uncertainty using the idealistic true calibration map corresponding to this classifier. Although the true calibration map is typically unknown in practice, it can be estimated with many post-hoc calibration methods which fit some family of potential calibration functions on a validation dataset. This paper examines the connection between such post-hoc calibration methods and calibration evaluation. Despite the negative connotations of fitting on test data in machine learning, we claim that fitting calibration maps on test data as part of the calibration evaluation process is a method worth considering, and we refer to this view as fit-on-test. This view enables the usage of any post-hoc calibration method as an evaluation measure, unlocking missed opportunities in development of evaluation methods. We prove that even ECE, which is the most common calibration evaluation method, is actually a fit-on-test measure. This observation leads us to a new method of tuning the number of bins in ECE with cross-validation. Fitting on test data can lead to test-time overfitting, and therefore, we discuss the limitations and concerns with the fit-on-test view. Our contributions also include: (1) enhancement of reliability diagrams with diagonal filling; (2) development of new calibration map families PL and PL3; and (3) an experimental study of which families perform strongly both as post-hoc calibrators and calibration evaluators.
Similar content being viewed by others
1 Introduction
When classifiers are incorporated into safety-critical applications, it is essential that the predictions of these classifiers would involve reliable uncertainty estimates. If the predictions are over-confident, then this can cause costly errors, such as an autonomous vehicle getting into an accident. If the predictions are under-confident, then this can result in a failure of the system to fulfill its task, e.g. an autonomous car would move too slowly to mitigate the over-estimated risks. Therefore, classifiers are expected to report calibrated uncertainty in the form of class probability estimates. A probabilistic classifier is considered calibrated, if in the groups of similar predictions the average prediction is in an agreement with the actual class proportions. For example, in binary classification this implies that if the classifier predicts 80% probability to be positive for each of a set of 100 instances, then 80 of these instances are expected to be truly positives. Most learning algorithms result in classifiers that are not well-calibrated and need dedicated post-hoc calibration methods to be applied (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017).
Progress in developing methods to get calibrated classifiers can only be made if we have reliable methods for evaluating calibration. In binary classification, the most common way of estimating a classifier’s calibration is through reliability diagrams and ECE (estimated calibration error, also known as the expected calibration errorFootnote 1) (Murphy & Winkler, 1977; Broecker, 2011; Naeini et al., 2015). In reliability diagrams, many instances with similar predicted probabilities are binned together to get an estimate of the true calibrated probability in each bin by averaging the corresponding class labels. However, there is no consensus, how to place the bins in reliability diagrams and how many bins there should be (Roelofs et al., 2020). Usually 10, 15, or 20 bins are used (Naeini et al., 2015; Guo et al., 2017). Bins are placed with equal width, so that they take up an equal chunk in the probability space, or with equal size, so that they each contain an equal number of predictions. The choice of binning can drastically impact the shape of the reliability diagram and alter the estimated calibration error (Roelofs et al., 2020; Kumar et al., 2019; Nixon et al., 2019). Failure to measure calibration reliably leads to problems deciding which classifier is better calibrated or which method of post-hoc calibration is better. This in turn harms the performance of safety-critical systems.
Even though there are multiple existing works published on methods of evaluating calibration, i.e. Widmann et al. (2019); Zhang et al. (2020), these have not exploited the more direct link between post-hoc calibration and evaluation, which we refer to as the fit-on-test paradigm. According to this paradigm, any post-hoc calibration method can be repurposed for calibration evaluation by applying it on the test data (not on the validation data as in post-hoc calibration) and then using it as a plug-in estimator of calibration error.
The contributions of this paper are the following:
-
We introduce the fit-on-test paradigm of evaluating calibration, showing that any post-hoc calibration method can also be used for evaluating calibration (Sect. 4.2);
-
We prove that the classical binning-based ECE measure follows from the fit-on-test paradigm using a particular calibration map family (Sect. 4.4);
-
Exploiting this fact, we show how cross-validation can be used for optimising the number of bins in ECE (Sect. 4.6);
-
We demonstrate shortcomings in the common visualisations of reliability diagrams and propose reliability diagrams with diagonal filling (Sect. 4.5);
-
Using the fit-on-test paradigm, we develop new methods PL and PL3 of evaluating calibration using continuous piecewise linear functions (Sect. 5);
-
We clarify the methodology of assessing calibrators and calibration evaluators (Sect. 6) and introduce the usage of pseudo-real data for this purpose (Sect. 7);
-
We perform experimental comparisons to find out which families of calibration maps result in better post-hoc calibration, better reliability diagrams, and better approximations of calibration errors (Sect. 7).
-
We discuss the limitations of the fit-on-test paradigm (Sect. 8).
2 Related work
In this section, an overview of different approaches to calibration error evaluation is given. Evaluation of calibration has been the main focus of several works after the introduction of reliability diagrams and ECE (Murphy & Winkler, 1977; Broecker, 2011; Naeini et al., 2015). To start with, Vaicenavicius et al. (2019) proposed a more general definition of calibration and a method to perform statistical calibration tests based on binning. Widmann et al. (2019) proposed the kernel calibration error for calibration evaluation in multi-class classification. Roelofs et al. (2020) proposed a method which chooses the maximal number of bins such that it leads to a monotonically increasing reliability diagram. Popordanoska et al. (2022) proposed estimating multi-class calibration error using kernel density estimation with Dirichlet kernels. Popordanoska et al. (2023) proposed Kullback-Leibler calibration error, allowing one to estimate all proper calibration errors and refinement terms.
The research on evaluating calibration has gone hand-in-hand with the research on post-hoc calibration, which aims to learn a calibration map transforming the classifier’s output probabilities into calibrated probabilities. Many papers contributed to both post-hoc calibration and evaluation. Naeini et al. (2015) proposed BBQ and used ECE to evaluate calibration. Guo et al. (2017) proposed temperature, vector and matrix scaling and used the reliability diagrams and ECE for evaluating confidence in multi-class classification. Confidence stands for the probability of a target class, leaving out all the other probabilities and class-wise relations. This is also referred to as top-label calibration error by Kumar et al. (2019). Kull et al. (2019) proposed Dirichlet calibration and the notion of classwise-calibration error. Classwise-calibration error measures the calibration error for each class separately. Kumar et al. (2019) proposed scaling-binning calibration, a new debiasing method for ECE and the notion of marginal calibration error. Marginal calibration error is similar to classwise-calibration error defined concurrently with Kull et al. (2019), adding a possibility to control how much each class counts towards the error. Zhang et al. (2020) proposed generic Mix-n-Match calibration strategies and used kernel density estimation (KDE) for estimating calibration error. Gupta et al. (2021) proposed a calibration evaluation metric based on the Kolmogorov-Smirnov test and a calibration method based on fitting splines. Xiong et al. (2023) proposed proximity calibration (procal) for confidence calibration and proximity-informed ECE (PIECE). PIECE divides instances based on the representation space distance into proximity groups and uses this information to measure the calibration error of different proximity groups separately.
3 Notation and background
3.1 True calibration error
We present the methods for binary classification, but Sect. 3.3 shows applicability to multi-class classification as well. Consider a binary classifier \(f:\mathcal {X}\rightarrow [0,1]\) predicting the probabilities of instances to be positive. Let \(X\in \mathcal {X}\) be a randomly drawn instance, \(Y\in \{0,1\}\) its true class, and let us denote the model’s predictions with \(\hat{P}=f(X)\). Every classifier f has a corresponding true calibration map, which could be used to perfectly calibrate the model: \(c^*_f(\hat{p})=\mathbb {E}[Y\mid \hat{P}=\hat{p}]\) (also known as the canonical calibration function (Vaicenavicius et al., 2019)). For evaluation of calibration, consider a test dataset with instances \(x_1,\dots ,x_n\in \mathcal {X}\) and true labels \(y_1,\dots ,y_n\in \{0,1\}\), and denote the predictions by \(\hat{p}_i=f(x_i)\). The true calibration error (CE) is the model’s average violation of calibration; it could be defined on the overall test distribution as \(\mathbb {E}[\vert c^*_f(\hat{P})-\hat{P}\vert ^\alpha ]\) (Kumar et al., 2019) but we define it for the test dataset:
where \(\alpha =1\) corresponds to absolute error (MAE) and \(\alpha =2\) to squared error (MSE). Figure 1a shows an example of a true calibration map, where each red line shows the violation of calibration corresponding to a particular data point, and the average length of red lines equals to CE.
a True calibration map (orange line) versus the predicted probabilities (dashed line). Connecting lines show instance-wise miscalibration. b Reliability diagram consists of bars (blue) with the height of average label. The red lines show the error between the mean labels and predicted probabilities in each bin. The diagrams are made with synthetic data (3000 data points, stratified sample of 50 data points from the bins shown for instance-wise errors, see Appendix D.3 for more details) (Color figure online)
3.2 Reliability diagrams and ECE
There are multiple ways to estimate calibration error. One of the most popular ways is using reliability diagrams (Murphy & Winkler, 1977). The reliability diagram is a bar plot, where each bar contains a certain region of probabilities (a bin) and the bar height corresponds to the average label (\(\bar{y}_k\)) in the k-th bin (Fig. 1b). Each red line in Fig. 1b shows the difference between the average label \(\bar{y}_k\) and the average prediction \(\bar{p}_k\) in the k-th bin. The vector \(\textbf{B}=(B_1,\dots ,B_{b+1})\) provides the bin boundaries \(0=B_1<B_2<\ldots<B_b<B_{b+1}=1+\epsilon\), resulting in bins \([B_1,B_2),\dots ,[B_b,B_{b+1})\), where \(\epsilon\) is an infinitesimal to ensure that \(1\in [B_b,B_{b+1})\). Thus, \(\bar{y}_k=\frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})} y_i\) and \(\bar{p}_k=\frac{1}{n_k}\sum _{i:\hat{p}_i\in [B_k,B_{k+1})} \hat{p}_i\) where \(n_k=\vert \{i: \hat{p}_i\in [B_k,B_{k+1})\}\vert\) is the size of bin k. The bins can be either equal size (each bin has the same number of instances), or equal width (each bin covers the equal region in the probability space).
Based on the reliability diagrams (Fig. 1b), the estimated calibration error (ECE) (Naeini et al., 2015) is a weighted average between the mean accuracy and the mean probability in each bin:
The binning-based ECE is known to be biased (Broecker, 2011; Ferro & Fricker, 2012) with \(\mathbb {E}[\textsf{ECE}^{(\alpha )}_\textbf{B}]\ne \mathbb {E}[\textsf{CE}^{(\alpha )}]\), hence in our experiments we use debiasing as proposed by Kumar et al. (2019).
3.3 Calibration evaluation for multi-class classification
In contrast to binary classification, there are multiple different definitions of calibration for multi-class tasks:
-
a binary classifier is calibrated if all predicted probabilities to be positive are calibrated: \(\Pr [Y=1\vert f(X)=\hat{p}]=\hat{p}\) for all \(\hat{p}\in [0,1]\);
-
a multi-class classifier is class-k-calibrated if all the predicted probabilities of class k are calibrated: \(\Pr [Y=k\vert f_k(X)=\hat{p}]=\hat{p}\) for all \(\hat{p}\in [0,1]\) (Kull et al., 2019; Kumar et al., 2019; Nixon et al., 2019);
-
a multi-class classifier is confidence calibrated if \(\Pr [Y=\mathop {\mathrm {arg\,max}}\limits f(X)\vert \max f(X)=\hat{p}]=\hat{p}\) for all \(\hat{p}\in [0,1]\) (Kull et al., 2019; Guo et al., 2017).
However, in all of the above scenarios, we need to evaluate if the predicted and actual probabilities of an event are equal among all instances with shared predictions. By redefining \(Y=1\) and \(Y=0\) to denote whether or not the event happened and \(\hat{P}=f(X)\) to denote the estimated probability of that event, we have essentially reduced all three evaluation tasks to the first task of evaluating calibration in binary classification. The shared definition of calibration then becomes: \(\Pr [Y=1\vert f(X)=\hat{p}]=\hat{p}\) or equivalently, \(\mathbb {E}[Y\vert f(X)=\hat{p}]=\hat{p}\), This explains also why ECE has been applied to all those 3 scenarios.
3.4 Post-hoc calibration
Post-hoc calibration is the task where the goal is to use a validation set to obtain an estimate \(\hat{c}\) of the true calibration map \(c^*_f\) for a given uncalibrated classifier f. Post-hoc calibration methods view the task basically as binary regression: given the predictions \(\hat{p}_1,\dots ,\hat{p}_n\in [0,1]\) and the corresponding true binary labels \(y_1,\dots ,y_n\in \{0,1\}\), find a ‘regression’ model \(\hat{c}:[0,1]\rightarrow [0,1]\) that best predicts the labels from the predictions, evaluated typically by cross-entropy or mean squared error which in this context are respectively known as the log-loss and the Brier score - two members of the family of strictly proper losses (Brier, 1950). Why are proper losses a good way of evaluating progress towards estimating the true calibration map? A common justification is that these losses have the virtue that they are minimised by the perfectly calibrated model \(c^*_f\) (Kumar et al., 2019), that is: \(\mathop {\mathrm {arg\,min}}\limits _{\hat{c}(\hat{p})}\mathbb {E}[l(\hat{c}(\hat{p}),Y)\vert \hat{P}=\hat{p}]=c^*_f(\hat{p})\) for any \(\hat{p}\in [0,1]\) and any strictly proper loss l. However, this justification refers to the optimum only. Our following Theorem 1 makes even a stronger claim that a reduction of the expected loss l leads to the same-sized improvement in how well \(\hat{c}(\hat{p})\) approximates \(c^*_f(\hat{p})\), measured by any Bregman divergence \(d:[0,1]\rightarrow [0,1]\) (here d quantifies similarity between two binary categorical probability distributions, and it is a strictly proper loss when the label is its second argument, see details and proofs of the theorems in Appendix B):
Theorem 1
Let \(d:[0,1]\times [0,1]\rightarrow \mathbb {R}\) be any Bregman divergence and \(\hat{c}_1,\hat{c}_2:[0,1]\rightarrow [0,1]\) be two estimated calibration maps. Then
The above theorem involves expectations conditioned on \(\hat{p}\) which are typically impossible to estimate for any particular \(\hat{p}\) in isolation, because there is just one or very few instances with exactly the same predicted probability \(\hat{p}\). Therefore, most post-hoc calibration methods minimize the empirical loss \(\sum _{i=1}^{n}d(\hat{c}(\hat{p}_i),y_i)\) for \(\hat{c}\) in some sub-family \(\mathcal {C}\) within all possible calibration maps, using inductive biases such as assuming \(c^*_f\) is monotonic (isotonic calibration (Zadrozny & Elkan, 2002)), or belongs to some parametric family, e.g. logistic functions (Platt scaling (Platt, 2000)).
4 The fit-on-test paradigm
4.1 Evaluation of calibration always involves estimation
The goal of evaluating calibration is to measure how far a classifier f is from being perfectly calibrated, based on a given test set. Ideally, we would like to know for each test instance how far the prediction \(\hat{p}_i\) is from the corresponding perfectly calibrated probability \(c^*_f(\hat{p}_i)\). The fundamental problem is that we can never directly observe \(c^*_f(\hat{p}_i)\), even on the test data. Therefore, evaluation of calibration always involves some form of estimation, and one cannot measure the true calibration error precisely.
The standard ECE measure gets around this problem by introducing bins. The idea is that if there are sufficiently many instances in the bin \([B_k,B_{k+1})\), and the bin is narrow enough so that the corresponding perfectly calibrated probabilities \(c^*_f(\hat{p}_i)=\mathbb {E}[Y\mid \hat{P}=\hat{p}_i]\) do not vary much within the bin, then one can estimate the calibration error in the bin as follows:
that is by the difference between the proportion of positives and the average prediction within the bin. If the bins are too narrow, then there are not sufficiently many instances in them, resulting in high variance of calibration error estimation. If the bins are too wide, then the corresponding perfectly calibrated probabilities \(c^*_f(\hat{p}_i)\) vary too much inside the bin, resulting in potential bias in the estimation.
4.2 Fit-on-test estimation of calibration error
Seeing the challenges of choosing a good binning for ECE and the existing attempts of improving over ECE (Roelofs et al., 2020), we looked for alternatives in estimating the calibration error. A classical and intuitive estimation method is plug-in estimation, where the estimate is calculated using the same formula as the population statistic it is estimating. We propose to use this for the true calibration error defined earlier as Eq.(1), getting the plug-in estimator:
where \(\hat{c}(\cdot )\) is some estimator of the function \(c^*_f(\cdot )\). The intuition is that in order to estimate the true calibration error we would first estimate the true calibration map. After that we can use the average discrepancy between the predictions and the corresponding estimated calibration map values as our estimate of the true calibration error.
Our task now is to find a way to estimate the true calibration map \(c^*_f(\cdot )\). Here it is very important to note that this estimation needs to be performed only using the given test set. This is because the goal of evaluating calibration is to do so based on a given test set.
Here we can turn to the existing literature on post-hoc calibration. Indeed, the goal of post-hoc calibration is also to estimate the true calibration map, except that the estimation is performed there on the validation set. All we need to do is to take a post-hoc calibration method and apply it instead on test data. By this we can get an estimated calibration map \(\hat{c}(\cdot )\) which can be used within Eq.(3) to approximate calibration error. As estimating a calibration map is essentially fitting a function, we refer to such plug-in estimation as the fit-on-test estimation of calibration error. To summarise, any post-hoc calibration method can be applied on the test data and used within the plug-in estimator to turn it into a fit-on-test estimator of calibration error. Note that this does not mean that all methods would be equally useful as plug-in estimators; more discussion about the limitations of a fit-on-test estimator is in Sects. 4.3 and 8.
However, nothing prevents us from going beyond the set of existing post-hoc calibration methods. Whenever we have some family \(\mathcal {C}\) of potential calibration map functions and some strictly proper loss l, we can define the corresponding fit-on-test calibration evaluation measure by first performing fitting on the test data:
and then using it within the plug-in estimator:
Fit-on-test estimation of calibration error has been visualised in Fig. 2 (illustration only, not on real data).
Fit-on-test estimation of calibration error: (1) a calibration map \(\hat{c}\) is obtained fitting a family of calibration maps \(\mathcal {C}\) by minimising the loss l on the test data; (2) instance-wise calibration errors are estimated as distances of predictions from calibrated predictions; (3) overall calibration error is estimated as the average of instance-wise errors. The plots are illustrative, not based on real data
4.3 Discussion
As the idea of using plug-in estimation is almost trivial, one might wonder why it has not been introduced before. We guess this is partly due to the following potential concerns:
-
1.
Due to inevitable overfitting (or the generalisation gap) in any fitting process, we are bound to get our estimated \(\hat{c}\) closer to the observed labels than \(c^*_f\) is. This bias can harm our capability of estimating the true calibration error \(\vert c^*_f(\hat{p})-\hat{p}\vert\);
-
2.
By choosing a particular family \(\mathcal {C}\) of functions to be used during the fitting process, we would potentially misjudge the calibration error in the cases where \(c^*_f\) is not in this family.
It seems impossible to fully solve both problems at the same time: a more restrictive set of functions helps against overfitting and alleviates the first problem, but increases the second problem; a bigger set of functions helps against the second problem, but increases overfitting. However, our experiments demonstrate that good tradeoffs are possible, using flexible families but still with relatively few parameters.
The classical binning-based ECE might seem to sidestep this problem and instead of estimating \(c^*_f\) at all given points, it performs the comparison of bin averages \(\bar{p}\) and \(\bar{y}\). Perhaps surprisingly though, it can be proved (see the next subsection) that the binning-based ECE can also be seen as a fit-on-test estimator of calibration error for a particular family of functions \(\mathcal {C}\) with the Brier score (MSE) as the loss function. Therefore, the above concerns are valid for the standard binning-based ECE as well.
4.4 Classical binned ECE is also a fit-on-test estimator
Next we prove that the standard binning-based ECE measure as defined by Eq.(2) is a fit-on-test estimator with a certain calibration map family \(\mathcal {C}\) that we will present in a moment. Before this, we first introduce a bigger family \(\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}=\{c_{(\textbf{B},\textbf{H},\textbf{A})}\}\) of piecewise linear functions with b pieces (or bins), parametrised by the following 3 vectors:
-
\(\textbf{B}\in [0,1]^{b+1}\) - the boundaries of the b pieces (or bins) with the constraint \(0=B_1<B_2<\ldots<B_b<B_{b+1}=1+\epsilon\);
-
\(\textbf{H}\in \mathbb {R}^{b}\) - values of the function at the boundaries \(B_1,\dots ,B_b\);
-
\(\textbf{A}\in \mathbb {R}^{b}\) - slopes of linear functions within the bins.
The values of these functions can be calculated as follows:
where \(I[\cdot ]\) is the indicator function. Note that the resulting functions \(c_{(\textbf{B},\textbf{H},\textbf{A})}\) can be non-continuous because the right side of the bin \([B_k,B_{k+1})\) ends near the value \(H_k+A_k(B_{k+1}-B_k)\) and nothing is preventing this value from being different than \(H_{k+1}\) which is the left side of the bin \([B_{k+1},B_{k+2})\).
It turns out that the classical binning-based ECE is a fit-on-test estimator of calibration error with respect to a particular subfamily of \(\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}\) that we will describe next. As ECE is calculated from the reliability diagrams that are piecewise constant, one might guess that this subfamily would contain all the piecewise constant (slope 0) functions with a particular fixed binning \(\textbf{B}\). However, this is not true. To see this, consider a synthetic example in Fig. 3a with 6 instances (2 negatives and 4 positives) shown as red dots, and 2 bins \(\textbf{B}=(0,0.5,1+\varepsilon )\). The traditional definition of ECE in Eq.(2) yields \(ECE=0.133\) in this example. Fitting a piecewise constant function with binning \(\textbf{B}\) by minimizing the Brier score results in the calibration map \(\hat{c}\) visualised in Fig. 3b. As the Brier score is a proper loss, the optimum is achieved by empirical averages of labels within each bin. Hence, the shape of the calibration map \(\hat{c}\) in Fig. 3b matches with the reliability diagram in Fig. 3a. However, if we now use the fit-on-test estimator of calibration error as the average length of vertical red lines in Fig. 3b, then we get \(\widehat{CE}=0.167\), which is different from \(ECE=0.133\) with the same binning.
A synthetic example about how ECE can be viewed as a fit-on-test estimator of calibration error. a A reliability diagram of a test set with 6 instances (4 positives with predicted probabilities 0.3, 0.75, 0.85, 0.95 and 2 negatives with predicted probabilities 0.1, 0.65), yielding \(ECE=0.133\); b a piecewise constant fit-on-test estimator yields \(\widehat{CE}=0.167\); c a piecewise slope-1 fit-on-test estimator provably yields the same \(ECE=0.133\) as the original in (a) while visualizing per-instance calibration errors also; d our proposed reliability diagram with diagonal filling combines elements from (a) and (c)
Instead, ECE with the binning \(\textbf{B}\) is actually a fit-on-test estimator of calibration error with respect to the subfamily \(\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}\) which contains all piecewise linear functions with slope 1 (i.e. a 45-degrees ascending slope) in each of the bins. In other words, \(\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\) contains all functions with the fixed binning \(\textbf{B}\), fixed slopes \(\textbf{A}=\textbf{1}=(1,1,\dots ,1)\) i.e. slope 1 for each of the b bins, and any heights \(\textbf{H}\in \mathbb {R}^b\) for the left-side boundaries of these bins. Fitting this family for our example results in the calibration map visualised in Fig. 3c. Using the fit-on-test estimator of calibration error we now get exactly the standard \(ECE=0.133\) as the average length of vertical red lines. This can be confirmed visually, seeing that each of the vertical red lines in Fig. 3c has exactly the same length as the bin-specific red line in the standard reliability diagram of Fig. 3a. These lengths are equal because the diagonal and the calibration map have both slope equal to 1. The following theorem confirms this by proving that the standard ECE with binning \(\textbf{B}\) can be seen as a fit-on-test estimator of calibration error.
Theorem 2
Consider a predictive model with predictions \(\hat{p}_1,\dots ,\hat{p}_n\in [0,1]\) on a test set with actual labels \(y_1,\dots ,y_n\) and a binning \(\textbf{B}\) with \(b\ge 1\) bins and boundaries \(0=B_1<\dots <B_{b+1}=1+\epsilon\). Then for any \(\alpha >0\), the measure \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) as defined by Eq.(2) is equal to the fit-on-test estimator of calibration error using the family \(\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\) and fitting the Brier score:
Furthermore, \(\hat{c}(\bar{p}_k)=\bar{y}_k\) for \(k=1,\dots ,b\), where \(\bar{p}_k\) and \(\bar{y}_k\) are the average \(\hat{p}_i\) and \(y_i\) in the bin \([B_k,B_{k+1})\).
Proof
See the Supplementary Material. \(\hfill\square\)
4.5 Fit-on-test reliability diagrams
Reliability diagrams are a common way of evaluating calibration of classifiers visually. Next we discuss the shortcomings of existing reliability diagrams and propose enhancements to them.
The simplest classical binning-based reliability diagrams just present the bar plot showing the bins and the proportions of positives in these bins. As we demonstrate in Fig. 4 (top row), it can happen that 3 classifiers with very different ECE values of 0.002, 0.090 and 0.070 have an identical reliability diagram. This is because the proportions \(\bar{y}_k\) of positives in the bins \(k=1,...,b\) are respectively the same for the 3 classifiers. However, ECE is different for these classifiers due to differences in the average predictions \(\bar{p}_k\) in the bins. A common way to address this problem is to visually indicate the average prediction within each bin (Song et al., 2021). The second row of Fig. 4 does so by showing bin centres \((\bar{p}_k,\bar{y}_k)\) with red dots. This reveals that the first classifier is actually nearly calibrated according to this binning, because the red dots are almost at the diagonal of perfect calibration, hence very low ECE of 0.002. However, the second and third classifiers still have identical visualisation, while the values of ECE are different. This is caused by different numbers \(n_k\) of instances within the bins, thus resulting in different weights for the terms in the formula Eq.(2) for ECE. Therefore, it is important to complement reliability diagrams with frequency histograms (Song et al., 2021), as we have done in the third row of Fig. 4.
In the last row of Fig. 4 we propose new reliability diagrams that we call reliability diagrams with diagonal filling. The original bar plot is kept there with a black line. The blue colour does not fill the bars to the horizontal top as usual, but instead to a diagonal line with slope 1 that crosses the top of the bar at the red bin centre \((\bar{p}_k,\bar{y}_k)\). As we know from Theorem 2, the top boundary of the blue filling represents the calibration map resulting from fitting the family \(\mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\). The theorem also states that the average distance between this calibration map and the main diagonal of perfect calibration is equal to the standard ECE. In this way, the difference between classifiers 1 and 2 becomes clearly evident from the figure. Classifier 1 is almost perfectly calibrated because the estimated calibration map nearly matches the main diagonal, whereas classifier 2 is quite far from being calibrated. More precisely, the area between the diagonal filling and the main diagonal is exactly equal to ECE, assuming that the bins have equal width and an equal number of instances in them. However, if the bins have different numbers of instances (e.g. as for the 3rd classifier), then the areas between the diagonal filling and the main diagonal need to be weighted accordingly, as in the third row of the figure.
Our reliability diagrams with diagonal filling are just one example of getting reliability diagrams using fit-on-test calibration maps. More generally, one could use any post-hoc calibration method on the test data to obtain an estimated calibration map, and then fill the area under this curve. We call such visualisations as fit-on-test reliability diagrams. With any such diagram, the fit-on-test estimator of calibration error can be measured as the average distance across all instances between the reliability diagram and the main diagonal. Figure 5 shows examples of fit-on-test reliability diagrams, but the calibration map families PL and PL3 used there will yet be introduced in Sect. 5.
Reliability diagrams of 3 classifiers (columns) created with different visualisation methods (rows); the concrete classification task is irrelevant. All classifiers have different ECE, but the plain reliability diagrams (top row) are identical. Differences are gradually revealed by adding bin centres indicating the average predicted probabilites (red dots, in the second row) and frequency histograms (in the third row). The last row shows our proposed reliability diagrams with diagonal filling, where ECE is better visualised because it is equal to the instance-wise average distance from the blue boundary to the main diagonal of perfect calibration (Color figure online)
4.6 Cross-validated number of bins for ECE
As discussed at the beginning of this Sect. 4, choice of the number of bins strongly influences how well the standard binning-based \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) is estimating the true calibration error. Viewing ECE as fitting the family \(\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}\) on the test data by minimising the Brier score, we can see the choice of the number of bins b as a hyper-parameter optimisation task. We can now come up with novel methods for choosing the number of bins for ECE. For example, we can split the test set randomly into two folds: on one fold we perform fitting with different numbers of bins, and on the other fold evaluate which number of bins provides the best fit according to the Brier score. After that, the final reliability diagram could be drawn with this selected number of bins, and ECE calculated based on this diagram. Instead of such fixed split into two folds, any other hyperparameter optimisation technique can be used. We propose to use cross-validation (CV), a typical hyper-parameter tuning method, to select the number of bins which provides the best fit. In the example of Fig. 1, the optimum was achieved by 14 bins as shown in Fig. 5a. Note that we are fitting \(\hat{c}(\hat{p}_i)\) to \(y_i\) for \(i=1,\dots ,n\) and thus CV is improving the fit between the estimated calibration map (which is the top of diagonal filling of the reliability diagram) and the binary labels. As a result, the fit between the estimated calibration function \(\hat{c}\) and the true calibration curve \(c^*_f\) also improves in expectation, as implied by Theorem 1. A better fit of \(\hat{c}\) and \(c^*_f\) implies a ‘more reliable’ reliability diagram, in the sense that on average, the top of the filling is on average closer to the true calibration function. While cross-validation is a standard tool for hyperparameter tuning, it has been missed for ECE because it has not been seen as fitting before.
In the implementation of CV, inspired by Tikka and Hollmén (2008), we prefer a lower number of bins whenever the relative difference in loss is less than 0.1 percent, further improving performance (for details see Appendix C.2).
Different reliability diagrams with the number of bins or pieces optimised using cross-validation, on the same data as in Fig. 1: a a reliability diagram with diagonal filling using 14 bins; b piecewise linear reliability diagram with 3 pieces; c piecewise linear in logit-logit space reliability diagram with 2 pieces
5 Calibration map families PL and PL3
The family \(\mathcal {C}^{(b)}_{(\textbf{B},\mathcal {H},\textbf{1})}\) of functions used by the binning-based ECE has several weaknesses: (1) it contains non-continuous functions while ‘jumps’ are unlikely to be present in the true calibration function; (2) it only contains segments with slope 1, making it hard to fit the true calibration function in regions with a different slope.
Therefore, we are instead looking for a family satisfying the following criteria: (1) contains only continuous functions; (2) has flexibility to fit any curve; (3) contains the identity function; (4) has few parameters not to overfit heavily.
We first revisit the families used by the existing post-hoc calibration methods. Some methods use families with a constant fixed number of parameters, such as Platt scaling (Platt, 2000) (2 parameters) and beta calibration (Kull et al., 2017) (3 parameters), see also Table 1. A small fixed number of parameters is clearly not sufficient to have the flexibility to fit any curve. The non-parametric methods like isotonic calibration (Zadrozny & Elkan, 2002) have a tendency to be overconfident and thus overfit the data (Allikivi & Kull, 2019). Therefore, we are looking for methods that have a flexible number of parameters, so that a suitable number could be selected according to the dataset, for example by cross-validation. Piecewise linear functions with slope 1 satisfy this requirement, but these in turn are not continuous. This motivates our following proposal of using the family of continuous piecewise linear calibration maps, with unconstrained slopes.
5.1 PL - piecewise linear calibration maps
We propose to use the subfamily \(\mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)}\) of continuous functions from the piecewise linear function family \(\mathcal {C}_{\mathcal {B},\mathcal {H},\mathcal {A}}^{(b)}\). The family \(\mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)}\) is only parametrised by the bin boundaries \(\textbf{B}\) and by the values \(\textbf{H}\) of the function at the boundaries, whereas the slopes can be calculated from \(\textbf{B}\) and \(\textbf{H}\) with \(A_k=\frac{H_{k+1}-H_{k}}{B_{k+1}-B_{k}}\) to ensure that the line at the right end of bin k coincides with the line at the left end of bin \(k+1\). Note that the bin boundaries \(\textbf{B}\) are now also parameters to be fitted, together with the values \(\textbf{H}\). As \(B_1=0\) and \(B_{b+1}=1+\epsilon\) are fixed, we are fitting 2b parameters: \(b-1\) bin boundaries and \(b+1\) values in \(\textbf{H}\). The number of bins b is optimised through cross-validation, similarly to Sect. 4.6.
We call the corresponding fit-\((\mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)},l)\)-on-test method as PL: the piecewise linear method for evaluating calibration. In particular, we can now draw new kind of piecewise linear reliability diagrams (Fig. 5b) which provide a better fit to the true calibration function than the binning-based methods, as demonstrated in our experiments (Sect. 7). For a visual comparison, check Fig. 5a and Fig. 5b, as these figures have been made with the same data. More comparative examples can be found in Appendix E.1. Therefore, the piecewise linear reliability diagram can be used similarly to the classical ECE reliability diagram to check visually how well the model is calibrated. We can then also use this estimated calibration map for fit-on-test estimation of calibration error, getting the measure which we call the piecewise linear ECE or in short ECE-PL. According to the fit-on-test estimation method, ECE-PL measures the instance-wise average distance from the piecewise linear function to the main diagonal: \(\textsf{ECE}_{\textsf{PL}}=\frac{1}{n}\sum _{i=1}^n \vert \hat{c}(\hat{p}_i)-\hat{p}_i\vert ^{\alpha }\) where \(\hat{c}=\mathop {\mathrm {arg\,min}}\limits _{c\in \mathcal {C}_{\mathcal {B},\mathcal {H},\text {cont}}^{(b)}} \frac{1}{n}\sum _{i=1}^n (c(\hat{p}_i)-y_i)^2\).
Implementation Details
Although the continuous piecewise linear functions are mathematically very well known, we found only one existing public implementation (Jekel & Venter, 2019), based on least squares fitting with differential evolution. We included this in our experiments with the name \(PL_{DE}\) (results in Appendix F). However, we also created ourselves a neural network based implemention depicted in Fig. 6, which allowed us to add cross-entropy fitting. Full details about the architecture are given in Appendix C, but here is a short overview.
We have a single input (\(\hat{p}\)) and a single output (\(\hat{c}\)) connected through two layers: the binning layer and the interpolation layer. The binning layer has \(b+1\) gating units corresponding to the bin boundaries, each outputting whether \(\hat{p}\) is to the left or to the right of the boundary (\(L_i=\hat{p}-B_i\) and \(R_i=B_i-\hat{p}\)). The binning layer is parametrised by b real values (\(\textbf{Z}=z_1, z_2, \dots , z_b\)) which are passed through the softmax (\(\sigma\)) to obtain the widths of the bins and through cumulative sum (\(B_i = \sum _{k=1}^{i} \sigma _k(\textbf{Z})\)) to obtain bin boundaries. These parameters are initialised such that all bins contain the same number of training instances. The interpolation layer has \(b+1\) parameters which are each passed through the logistic function (\(\phi\)) to obtain the calibration map values \(H_1,\dots ,H_{b+1}\) at the bin boundaries. These are initialised such that the represented calibration map is the identity function. The b units correspond to the bins and the bin to which \(\hat{p}\) belongs produces the linearly interpolated output. Based on these values, the piecewise linear function value is calculated as a sum of two neighbouring nodes (\(H_k\), \(H_{k+1}\)) as follows:
where the gating function g is defined as \(g(L_{k+1}, R_k) = I[(L_{k+1}>0) \& (R_k>0)]\).
Architecture of the piecewise linear function implementation as a neural network. The bin boundaries \(B_i\) are parametrised by logits (\(\textbf{Z}=z_1, z_2, \dots , z_b\)). These are passed through the softmax with cumulative sum (\(B_i = \sum _{k=1}^{i} \sigma _k(\textbf{Z})\)). The \(H_i\) values are activated by logistic function \(\phi\). The g stands for a gate, which is an indicator function, that outputs 1 if input is in the bin or 0 otherwise
We use 10-fold-cross-validation to select the number of segments in the piecewise linear function to best approximate the true calibration map, similar to Sect. 4.6. The same way as in hyperparameter optimization for Dirichlet calibration (Kull et al., 2019), the predictions on test data are obtained as an average output from all the 10 models with the chosen number of segments but trained from different folds, i.e. we are not refitting a single model on all 10 folds.
5.2 PL3 - piecewise linear in logit-logit space
The piecewise linear method can be used for calibration evaluation universally for any kinds of models, but it also makes sense to seek for dedicated families for special cases, such as for neural networks. Next we propose the family PL3 specifically for evaluating calibration in neural networks, taking inspiration from temperature scaling (Guo et al., 2017) and beta calibration (Kull et al., 2017).
Temperature scaling is fitting a family of functions \(\hat{c}(\hat{p})=\sigma (\textbf{z}/t)\) with a single temperature parameter t, where the softmax \(\sigma\) is applied on logits \(\textbf{z}\) that the uncalibrated model would have directly converted into probabilities with \(\hat{p}=\sigma (\textbf{z})\) (Guo et al., 2017). In the binary classification case with a single output, \(\sigma (z)=1/(1+e^{-z})\) is the logistic function, the inverse function of the logit \(\sigma ^{-1}(p)=\ln (p/(1-p))\). Importantly, if plotted in the logit-logit scale, binary temperature scaling fits a straight line (Fig. 7), since \(\sigma ^{-1}(\hat{c}(\hat{p}))=\sigma ^{-1}(\sigma (\textbf{z}/t)=z/t=1/t\cdot \sigma ^{-1}(\sigma (\textbf{z}))=\sigma ^{-1}(\hat{p})\). Further, Fig. 7 compares various methods in the probability space and logit-logit space, giving extra information on how well the methods fit the calibration map in the low and high-probability regions.
Interestingly, another post-hoc calibration method known as beta calibration (Kull et al., 2017) fits calibration maps which in the logit-logit space are approximately piecewise linear with two segments, as shown in Fig. 7 (top right subfigure, blue line). The proof for this fact is given in Appendix E.2. This motivates our calibration map family PL3 of Piecewise Linear functions in the Logit-Logit space (PLLL=PL3), which corresponds to temperature scaling when using 1 piece, approximates beta calibration when using 2 pieces, and can take more complicated shapes when using more linear pieces in the logit-logit space. Fig. 5c shows an example of a piecewise linear in the logit-logit space reliability diagram. Further details are provided in Appendix E.2.
Motivation for PL3. Comparison of results in the probability space (left column) and in the logit-logit space (right column). Methods are split into two groups for clarity. In essence, subfigures (a) and (b) are the same, only display different functions. Model ResNet110 (He et al., 2015) on "cats vs rest" task of CIFAR-5m (Color figure online)
Implementation Details
To implement PL3, we use the same neural architecture as for PL, with the following modifications: (1) the expected input is in the logit space; (2) the bin boundaries are converted into the logit space; (3) the logistic is removed from the parameters feeding the interpolation layer and instead the logistic is applied on the final output. Further details are given in Appendix C.
6 Assessment of calibrators and evaluators
Before proceeding to the experiments, let us discuss the methods for assessing the quality of calibrators and evaluators of calibration.
6.1 Assessment of post-hoc calibrators
Post-hoc calibrators should be evaluated based on the effectiveness of calibration, but this effectiveness could be interpreted in two ways. Firstly, we could evaluate how well-calibrated the outputs of the calibrator are by calculating the calibration error after calibration (CEAC):
Secondly, we could evaluate how well the calibrator approximates the true calibration map by calculating the calibration map estimation error (CMEE):
where d is any Bregman divergence and \(\hat{C}=\hat{c}(\hat{P})=(\hat{c}\circ f)(X)=\hat{c}(f(X))\). We prove that low calibration map estimation error is a stronger requirement than low calibration error after calibration, because CMEE is an upper bound for CEAC:
Theorem 3
Intuitively, the difference between CMEE and CEAC is due to ties introduced by \(\hat{c}\) where \(\hat{c}(\hat{p})=\hat{c}(\hat{p}')\) while \(c^*_f(\hat{p})\ne c^*_f(\hat{p}')\) for some \(\hat{p}\ne \hat{p}'\). As low CMEE is a stronger requirement, we prefer this measure in the experiments. We use the notation \(d(\hat{c},c^*)\) as a more memorable synonym for CMEE.
6.2 Assessment of calibration evaluators
Calibration evaluators that follow the fit-on-test paradigm, estimate \(\hat{c}\) on the test data. The resulting \(\hat{c}\) can be viewed as a reliability diagram, or used to estimate the calibration error with \(\textsf{ECE}^{(\alpha )}_{\text {fit-}(\mathcal {C},l)\text {-on-test}}\).
We see 3 main application scenarios:
-
1.
the reliability diagram is needed if the reliability of individual predictions has to be known separately; for example, this is needed if there is a downstream decision-making process performed by a human or AI;
-
2.
the estimated calibration error is needed if the general level of trust needs to be known, not specifically for each output;
-
3.
and the ranking of the estimated calibration errors is needed if performing selection of the best calibrated model.
In the experiments, we evaluate each calibration evaluation method \(\mathcal {M}\) against these three objectives, measuring: (1) the quality of reliability diagrams with \(d(\hat{c}_\mathcal {M},c^*)\); (2) the quality of calibration error estimates with \(\vert ECE_\mathcal {M}-CE\vert\); and (3) the quality of ranking performance using Spearman’s correlation of the ranking produced by the calibration evaluator against the ranking of true calibration errors, which we refer to as \(rankcorr(ECE_\mathcal {M},CE)\). All these methods assume good estimates of the true calibration map, which can be obtained either on synthetic data or if there is access to magnitudes of more data than the calibration evaluator used.
Note that in the experiments we mostly use the absolute difference \(d(\hat{c},c^*)=\vert \hat{c}-c^*\vert\) as the distance measure between the estimated and true calibration maps, following the tradition of how ECE is calculated from the reliability diagrams. As the absolute difference is not a Bregman divergence, the Appendix also considers the squared difference \(d(\hat{c},c^*)=\vert \hat{c}-c^*\vert ^2\) which is a Bregman divergence. Importantly, a method \(\mathcal {M}\) that produces the best calibration map estimate with the lowest \(\vert \hat{c}_\mathcal {M}-c^*\vert\) might not be the same as method \(\mathcal {M}'\) that produces the best calibration error estimate \(\vert ECE_{\mathcal {M}'}-CE\vert\). For example, this occurs when comparing the methods in Figs. 5a and b. In Fig. 5b, the piecewise linear method (PL) achieves \(\vert ECE_{PL}-CE\vert =0.0045\) and \(\vert \hat{c}_{PL}-c^*\vert =0.0204\), while in Fig. 5a, equal-width binning with cross-validated number of bins (EWCV) achieves \(\vert ECE_{EWCV}-CE\vert =0.0024\) and \(\vert \hat{c}_{EWCV}-c^*\vert =0.0398\). Thus, even though PL has in this example better \(\vert \hat{c}-c^*\vert\), it still loses to EWCV with respect to \(\vert ECE-CE\vert\). Intuitively, EWCV has bigger errors in the reliability diagram, but during the calculation of ECE these errors happen to cancel out more than in the case of PL.
7 Experiments and results
7.1 Pseudo-real experiments and results
The goal of our experimental studies is to (1) evaluate our proposed PL and PL3 calibration map families as calibration methods; and (2) find the best calibration map family for calibration evaluation (thanks to the fit-on-test paradigm).
A key problem for research in calibration evaluation methods and post-hoc calibration methods is that proper evaluation requires access to the true calibration map. This, however, is unknown. One workaround would be to use synthetically created data, where the true calibration map is known. However, with synthetic data the shape of the true calibration map might not be realistic. Our solution to this is to use the pseudo-real dataset CIFAR-5m (Nakkiran et al., 2021) with 5 million synthetic images, created such that the models trained on CIFAR-10 (Krizhevsky et al., 2009) have very similar performance on CIFAR-5m, and vice versa. Thus, it is likely that the true calibration maps are also very realistic. Thanks to the vast size of the CIFAR-5m dataset we can estimate the true calibration map very precisely. In our experiments, we used isotonic calibration on 1 million hold-out datapoints to estimate the true calibration map (other options than isotonic were considered in Appendix F, minor differences in the results). The main part of the experiments concentrates on CIFAR-5m, while Appendix F shows more results on synthetic and real datasets, where the results are comparable to CIFAR-5m with minor differences, supporting the overall conclusions. On CIFAR-5m, we concentrated on three 1-vs-rest calibration tasks (car, cat, dog) and confidence calibration. Only three 1-vs-rest tasks were used due to computational limitations.
ResNet110 (He et al., 2015), WideNet32 (Zagoruyko & Komodakis, 2016) and DenseNet40 (Huang et al., 2016) models were trained on 45k datapoints from CIFAR-5m, additional 5k datapoints were used to calibrate the outputs of models with multi-class calibration methods: temperature scaling (TempS), vector scaling (VecS), and matrix scaling with off-diagonal and intercept regularisation (MSODIR); Dirichlet calibration with ODIR regularisation (dirODIR), Spline with natural method (Spline); Order-invariant version of intra-order preserving functions (IOP); binary calibration methods: Platt, isotonic, beta calibration (beta), scaling-binning (ScaleBin), ECE-based binning (\(ES_{sweep}\)) methods with equal-size binning (ES), piecewise linear methods (PL and PL3). For PL we used our neural network model trained with cross-entropy loss (the results with optimising for Brier score and with the alternative optimisation method based on differential evolution are included in Appendix F).
Results
We show the results in absolute differences (i.e. \(\alpha =1\)) as in most earlier works (Appendix F has the quadratic also, i.e. \(\alpha =2\)). Table 2 assesses PL and PL3 as post-hoc calibration methods. The calibration map family corresponding to ECE fit with the sweeping method (Roelofs et al., 2020) is also included as \(ES_{sweep}\). The calibration methods are evaluated by how well they approximate the true calibration map: \(\vert \hat{c}_\mathcal {M}-c^*\vert =\sum _{i=1}^n\vert \hat{c}_\mathcal {M}(\hat{p}_i)-c^*(\hat{p}_i)\vert\) is measured on unseen 1 million data points against the ground truth, where \(\hat{p}_i\) are the outputs of the classifier, \(\hat{c}_\mathcal {M}(\hat{p}_i)\) are the post-hoc calibrated predictions, and \(c^*(\hat{p}_i)\) is the ‘true’ calibration map.
PL3 is the best method in all cases, showing the usefulness of the logit-logit space when calibrating neural models which are far from being calibrated. Note that the errors of PL3 are in all cases more than 20% smaller than the errors of the second-performing method Platt. Appendix F includes more variations of these methods, the KDE method and results for each architecture separately (minor differences).
Next we compare the calibration evaluators against each of the 3 objectives listed in Sect. 6.2. We perform the comparison in tasks where the models are already quite close to being calibrated, because this is typical when evaluators are used for finding out which post-hoc calibrator is performing best. We compare evaluators in the task of evaluating 6 post-hoc calibrators: we measure how precisely the evaluators \(\mathcal {M}\) estimate the reliability diagrams (Table 3 showing \(\vert \hat{c}_\mathcal {M}-c^*\vert\)), the total true calibration errors (Table 4 showing \(\vert ECE_\mathcal {M}-CE\vert\)), and how well the estimated ranking of 6 calibrators agrees with the true ranking based on true calibration errors (Table 4 showing \(rankcorrel(ECE_\mathcal {M},CE)\)). The 6 calibrators were chosen as the best methods from Table 2: beta, vector scaling, Platt, PL3, scaling-binning, isotonic. The evaluators are compared on different test set sizes 1k, 3k, 10k with 5 different random seeds for each size.
The first objective is to assess the reliability diagrams using \(\vert \hat{c}_\mathcal {M}-c^*\vert =\sum _{i=1}^n\vert \hat{c}_\mathcal {M}(\hat{p}_i)-c^*(\hat{p}_i)\vert\) measured on 1 million unseen data points, where \(\hat{p}_i\) are now the already post-hoc calibrated outputs of the classifier (calibrated with the 6 best methods from Table 2 trained on 5k data points), \(\hat{c}_\mathcal {M}(\hat{p}_i)\) are the results of fit-on-test calibration applied on top of the post-hoc calibrated predictions (trained on another separate data set of size either 1k, 3k, or 10k), and \(c^*(\hat{p}_i)\) is the ‘true’ calibration map of the post-hoc calibrated predictions.
The results in Table 3 show that PL is the best on average, as well as after disaggregating according to the classifier’s architecture, test set size, or the task. PL3 is mostly second and the evaluator using the beta calibration map family is mostly third. The order invariant version of IOP shows also promising results. The performance of beta calibration varies with the size of the test dataset. This is expected, because this method has only 3 parameters which is good for small test sets but worse for bigger test sets. Similarly, the data set size affects IOP too, again because of the small number of parameters. The methods based on equal-size binning are performing worse, with \(ES_{CV}\) ranking the highest on average, closely followed by \(ES_{sweep}\), and the classical \(ES_{15}\) with 15 bins lagging behind. Further, the Spline method is among one of the worst performing methods. From Tables 2 and 3 we can conclude that when predictions are far from being calibrated then PL3 is best for approximating the true calibration map, and PL is best when predictions are nearly calibrated. Appendix F discusses the results showing different aggregations, and reports the optimal numbers of bins for ES and PL methods.
The second objective is to estimate the numeric value of the total true calibration error, and here the rows \(\vert ECE_\mathcal {M}-CE\vert\) of Table 4 show the benefits of \(ES_{15}\), beta calibration and PL3 (with some differences across tasks). This demonstrates that while the tilted-top reliability diagrams of \(ES_{15}\) are not precise, their debiased average distance from the diagonal closely agrees with the average distance of the true reliabilities (true calibration map) from the diagonal. While PL and PL3 perform reasonably well, there is a big potential for further improvements, because debiasing remains as future work for these methods. The ranking of 6 calibrators is best done by the isotonic fit-on-test evaluator, achieving over 55% correlation to the true ranking for one-vs-rest tasks and over 65% for the confidence task.
8 Discussion
As we explained in our work, evaluating the calibration error using the fit-on-test approach is done by estimating the calibration map by fitting a calibration map family on test data, and then using the plugin-estimator of calibration error. The most popular method for the calibration error evaluation is a binning ECE, which we proved to also be fit-on-test. The name fit-on-test might sound controversial but has been chosen deliberately, to highlight the essence of issues in current methods for assessing calibration. The main concern with fitting on the test data is getting a good fit, like for any other fitting task. Thus, one needs a suitable family of functions and a suitable fitting procedure so the data is not overfitted or underfitted. A poor fit leads to poor performance of the method and unreliable results. However, since the true calibration map and the true calibration error are not available, we do not even know in practice whether there is overfitting or underfitting during fit-on-test evaluation. One way to tackle the problem is to use synthetic or pseudo-real data that would look very similar to real data. This way, we would have an estimate of how well each fit-on-test method could work on real data. We tested experimentally some deep neural network architectures on different datasets and saw some performance fluctuations between methods across different subtasks. However, there are bigger differences across different evaluation tasks. Even though the piecewise linear method performs very well in calibration map prediction, it is still not better than \(ES_{15}\) when predicting the calibration error or ranking. This might be because of the bias introduced, and thus, there might be a need for debiasing. Note that we did not dive into problems of data shift and out-of-distribution prediction, which are certainly affecting model uncertainty calibration as well in practice.
9 Conclusion and future work
We suggest to view evaluation of calibration according to the fit-on-test paradigm, promoting the use of post-hoc calibration methods for calibration evaluation. This view enables reliability diagrams that are closer to the true calibration maps, more exact estimates of the total calibration error, and ranking of calibrators in the order which better corresponds to their true quality. Following fit-on-test, we have proposed cross-validation to tune the number of bins in ECE, and demonstrated the benefits of piecewise linearity in the original as well as in the logit-logit space, inspired by temperature scaling and beta calibration. Having said that, the limitation of this approach is, as the name states, fitting on the test set. Thus, it is essential to be careful not to overfit the evaluation method and check which methods would best work on given data.
Future work involves the development of debiasing methods for \(ECE_{PL}\) and \(ECE_{PL3}\) and analysing further the benefits of different calibration map families in different scenarios, including dataset shift.
Data availability
All the data used for the work is publicly available, and the generation of new data and the work results are included in the source code.
Code availability
The source code is available at https://github.com/markus93/fit-on-test.
Notes
Following Roelofs et al. (2020), we prefer ‘estimated’ to avoid confusion with the true calibration error which involves an expectation.
References
Allikivi, M.L., & Kull, M. (2019). Non-parametric Bayesian isotonic calibration: Fighting over-confidence in binary classification. In: Machine Learning and Knowledge Discovery in Databases (ECML-PKDD’19). Springer, 68–85
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.
Broecker, J. (2011). Estimating reliability and resolution of probability forecasts through decomposition of the empirical score. Climate Dynamics, 39, 655–667.
Ferro, C., & Fricker, T. E. (2012). A bias-corrected decomposition of the brier score. Quarterly Journal of the Royal Meteorological Society, 138, 1954–1960.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017). On Calibration of Modern Neural Networks. In: Thirty-fourth International Conference on Machine Learning, Sydney, Australia, arXiv:1706.04599.
Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., & Hartley, R. (2021). Calibration of neural networks using splines. In: International Conference on Learning Representations, https://openreview.net/forum?id=eQe8DEWNN2W.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR abs/1512.03385. arXiv:1512.03385.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239.
Huang, G., Liu, Z., & Weinberger, K.Q. (2016). Densely connected convolutional networks. CoRR abs/1608.06993. arXiv:1608.06993,
Jekel, C.F., & Venter, G. (2019). pwlf: A Python Library for Fitting 1D Continuous Piecewise Linear Functions. https://github.com/cjekel/piecewise_linear_fit_py
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Kull, M., Silva Filho, T., & Flach, P. (2017). Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In: Singh A, Zhu J (eds) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol 54. PMLR, Fort Lauderdale, FL, USA, 623–631
Kull, M., Perelló-Nieto, M., Kängsepp, M., de Menezes e Silva Filho, T., Song, H., & Flach, Peter A. (2019). Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration. In: Advances in Neural Information Processing Systems (NeurIPS).
Kumar, A., Liang, P., & Ma, T. (2019). Verified uncertainty calibration. In: Advances in Neural Information Processing Systems (NeurIPS’19).
Murphy, A. H., & Winkler, R. L. (1977). Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society Series C (Applied Statistics), 26(1), 41–47.
Naeini, M.P., Cooper, G., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using bayesian binning. In: AAAI Conference on Artificial Intelligence
Nakkiran, P., Neyshabur, B., & Sedghi, H. (2021). The deep bootstrap framework: Good online learners are good offline generalizers. arXiv:2010.08127.
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on Machine learning, 625–632.
Nieto, M.P., Song, H., Filho, T.S., & Kängsepp, M. (2019). PyCalib: Python library for classifier calibration. https://github.com/classifier-calibration/PyCalib
Nixon, J., Dusenberry, M., & Zhang, L., et al. (2019). Measuring calibration in deep learning. ArXiv arXiv:1904.01685
Platt, J., et al. (2000). Probabilities for SV machines. In A. Smola, P. Bartlett, & B. Schölkopf (Eds.), Advances in Large Margin Classifiers, 61–74. MIT Press.
Popordanoska, T., Sayer, R., & Blaschko, M. (2022). A consistent and differentiable lp canonical calibration error estimator. Advances in Neural Information Processing Systems, 35, 7933–7946.
Popordanoska, T., Gruber, S.G., & Tiulpin, A., et al. (2023). Consistent and asymptotically unbiased estimation of proper calibration errors. arXiv preprint arXiv:2312.08589
Rahimi, A., Shaban, A., Cheng, C.A., Hartley, R., & Boots, B. (2020). Intra order-preserving functions for calibration of multi-class neural networks. In: Advances in Neural Information Processing Systems (NeurIPS).
Roelofs, R., Cain, N., Shlens, J., & Mozer, M. (2020). Mitigating bias in calibration error estimation. ArXiv abs/2012.08668.
Song, H., Perello-Nieto, M., Santos-Rodriguez, R., Kull, M., & Flach, P., and others (2021). Classifier calibration: How to assess and improve predicted class probabilities: A survey. arXiv preprint arXiv:2112.10327.
Tikka, J., & Hollmén, J. (2008). Sequential input selection algorithm for long-term prediction of time series. Neurocomputing 71(13):2604–2615. https://doi.org/10.1016/j.neucom.2007.11.037, https://www.sciencedirect.com/science/article/pii/S0925231208002233, artificial Neural Networks (ICANN 2006) / Engineering of Intelligent Systems (ICEIS 2006).
Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., & Schön, T. (2019). Evaluating model calibration in classification. In: Chaudhuri K, Sugiyama M (eds) Proceedings of Machine Learning Research, Proceedings of Machine Learning Research, 89. PMLR, 3459–3467.
Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In: NeurIPS.
Xiong, M., Deng, A., Koh, P.W., Wu, J., Li, S., Xu, J., & Hooi B. (2023). Proximity-informed calibration for deep neural networks. In: Thirty-seventh Conference on Neural Information Processing Systems, https://openreview.net/forum?id=xOJUmwwlJc
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD’02). ACM, 694–699.
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. CoRR abs/1605.07146. arXiv:1605.07146.
Zhang, J., Kailkhura, B., & Han, T.Y. (2020). Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In: ICML.
Acknowledgements
This work was supported by the Estonian Research Council grant PRG1604 and by the European Social Fund via IT Academy programme.
Funding
This work was supported by the Estonian Research Council grant PRG1604 and by the European Social Fund via IT Academy programme.
Author information
Authors and Affiliations
Contributions
All the authors worked together on developing and analyzing methods and concepts. Mathematical theorems and proofs were written by Meelis Kull. Data preparation and practical side were done by Markus Kängsepp and Kaspar Valk. Writing of the article was done jointly by all of the authors.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Ethics approval
Not applicable.
Consent to participate
The authors agree to participate in the conference.
Consent for publication
The authors permit the publication of the article and its materials.
Additional information
Editor: Eyke Hüllermeier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Source code
The source code is available at https://github.com/markus93/fit-on-test. It contains everything needed to run the experiments and to get the results displayed in the article. The source code contains a yml-file for generating a Conda environment with the needed packages and versions.
Appendix B: Proofs
1.1 B.1: Definitions Related to Theorems 1 and 3
The main paper includes 3 theorems. Theorems 1 and 3 include Bregman divergences as a way to measure dissimilarity between two probability distributions. We use Bregman divergences in the context of binary class probability estimation, in which case a probability distribution can be represented by a single real number in the range [0, 1], representing the probability for the positive class. Thus, in our case, Bregman divergences are functions that take in two positive class probabilities and output a real number representing the divergence of the distributions represented by these probabilities, \(d:[0,1]\times [0,1]\rightarrow \mathbb {R}\). Note that the order of the two arguments is such that the first is the ‘prediction’ and the second is the ‘ground truth’ (we have seen both orders in the literature, but have found this order more natural). The formal definition is as follows:
Definition 1
A function \(d:[0,1]\times [0,1]\rightarrow \mathbb {R}\) is called a Bregman divergence, if there exists a continuously-differentiable and strictly convex function \(\phi :[0,1]\rightarrow \mathbb {R}\), such that for every \(p,q\in [0,1]\):
where \(\phi '\) is the derivative of \(\phi\).
Theorems 1 and 3 involve random variables. As described in the main paper, we have X as a randomly drawn instance (i.e. X is a vector of its feature values) and Y is its label. In Theorem 1 we consider a particular fixed probabilistic classifier \(f:\mathcal {X}\rightarrow \mathbb {R}\) and thus, its output can be viewed as a random variable also, \(\hat{P}=f(X)\). The true calibration map of f can be calculated as \(c^*_f(\hat{p})=\mathbb {E}\Bigl [Y\Big \vert f(X)=\hat{p}\Bigr ]=\mathbb {E}\Bigl [Y\Big \vert \hat{P}=\hat{p}\Bigr ]\) for any \(\hat{p}\in [0,1]\).
In Theorem 3 we consider a particular calibrator \(\hat{c}:[0,1]\rightarrow [0,1]\), and the true calibration map of the approximately calibrated model \(\hat{c}\circ f\), which according to the definition of \(c^*\) can be written out as \(c^*_{\hat{c}\circ f}(c)=\mathbb {E}\Bigl [Y\Big \vert (\hat{c}\circ f)(X)=c\Bigr ]=\mathbb {E}\Bigl [Y\Big \vert \hat{c}(f(X))=c\Bigr ]\) for any \(c\in [0,1]\).
1.2 2.2: Proof of Theorem 1
Theorem 1
Let \(d:[0,1]\times [0,1]\rightarrow \mathbb {R}\) be any Bregman divergence and \(\hat{c}_1,\hat{c}_2:[0,1]\rightarrow [0,1]\) be two estimated calibration maps. Then
Proof
It is sufficient to prove that the value of the following expression does not depend on \(\hat{c}_i\) where \(i=1\) or \(i=2\) and thus is the same for \(\hat{c}_1\) and \(\hat{c}_2\):
Let \(\phi\) be a convex function that gives rise to d, then according to the definition of the Bregman divergence we can rewrite the above expression as follows:
As \(\mathbb {E}\Bigl [\phi (Y)\Big \vert \hat{P}=\hat{p}\Bigr ]\) and \(\phi (c^*_f(\hat{p}))\) do not depend on \(\hat{c}_i\) and as the terms \(\phi (\hat{c}_i(\hat{p}))\) and \(\hat{c}_i(\hat{p})\phi '(\hat{c}_i(\hat{p}))\) both cancel out, we are left to prove that the value of the remaining expression does not depend on \(\hat{c}_i\):
Noting that \(\phi '(\hat{c}_i(\hat{p}))\) does not depend on \(\hat{P}\), it can be taken out from the expectation, and thus the expression can be written as:
However, this is equal to zero, since by the definition of the true calibration map \(c^*_f\) we have:
\(\square\)
1.3 B.3: Proof of Theorem 2
Theorem 2
Consider a predictive model with predictions \(\hat{p}_1,\dots ,\hat{p}_n\in [0,1]\) on a test set with actual labels \(y_1,\dots ,y_n\) and a binning \(\textbf{B}\) with \(b\ge 1\) bins and boundaries \(0=B_1<\dots <B_{b+1}=1+\epsilon\). Then for any \(\alpha >0\), the measure \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) is equal to:
Furthermore, \(\hat{c}(\bar{p}_k)=\bar{y}_k\) for \(k=1,\dots ,b\), where \(\bar{p}_k\) and \(\bar{y}_k\) are the average \(\hat{p}_i\) and \(y_i\) in the bin \([B_k,B_{k+1})\).
Proof
Our first goal is to prove that \(\hat{c}(\bar{p}_k)=\bar{y}_k\) for \(k=1,\dots ,b\). To find the values of parameters at the optimum \(\hat{c}\), we study the stated minimization task and consider any \(c\in \mathcal {C}_{(\textbf{B},\mathcal {H},\textbf{1})}\) with any values of parameters \((H_1,\dots ,H_b)\). Let us rewrite the quantity to be minimized, grouping the instances by bins and using the definition of \(c(\cdot )\):
Equating the derivatives of this expression with respect to each \(H_k\) to zero, we get:
for each k. Therefore, \(H_k-B_k=\bar{y}_k-\bar{p}_k\) holds for the optimum at \(\hat{c}\) and we get
More generally, for any \(\hat{p}\in [B_k,B_{k+1})\) we get \(\hat{c}(\hat{p})=H_k+1(\hat{p}-B_k)=\hat{p}+(H_k-B_k)=\hat{p}+(\bar{y}_k-\bar{p}_k)\). This implies that the term \(\vert \bar{p}_k-\bar{y}_k\vert\) in the definition of \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) is equal to \(\vert \hat{c}(\hat{p})-\hat{p}\vert\) for any \(\hat{p}\in [B_k,B_{k+1})\). Using this, we can rewrite \(\textsf{ECE}^{(\alpha )}_{\textbf{B}}\) as follows:
\(\square\)
1.4 B.4: Proof of Theorem 3
Theorem 3 applies for any positive class probability estimator \(f:\mathcal {X}\rightarrow [0,1]\), any calibrator \(\hat{c}:[0,1]\rightarrow [0,1]\), and any Bregman divergence d. First we remind of the definitions of CMEE and CEAC given in the main paper:
where \(\hat{C}=\hat{c}(\hat{P})=(\hat{c}\circ f)(X)=\hat{c}(f(X))\).
Theorem 3
\(CMEE=CEAC+\mathbb {E}[d(c^*_{\hat{c}\circ f}(\hat{C}),c^*_f(\hat{P}))]\).
Proof
We have to prove that \(CEAC+\mathbb {E}[d(c^*_{\hat{c}\circ f}(\hat{C}),c^*_f(\hat{P}))]-CMEE=0\), that is:
We will prove the variant of this equality where all expectations are replaced by conditional expectations conditioned on \(\hat{C}\), from which the original equality follows due to the law of total expectation, i.e. \(\mathbb {E}[\mathbb {E}[V\vert W]]=\mathbb {E}[V]\) for any random variables V and W. That is, it is sufficient to prove that:
Let \(\phi\) be a convex function that gives rise to d, then according to the definition of the Bregman divergence we can rewrite the above equality as follows:
The first two terms in each conditional expectation cancel out among the three conditional expectations, leaving us with only the last terms. Taking into account that \(c^*_f(\hat{P})\) is the only term which is not a constant under conditioning with \(\hat{C}\), the above is equivalent to:
As \(\hat{C}\phi '(\hat{C})\) cancels out, we can reorganise the remaining terms as follows:
It now suffices to prove that \(\mathbb {E}[c^*_f(\hat{P})\vert \hat{C}]=c^*_{\hat{c}\circ f}(\hat{C})\) for every value of \(\hat{C}\). According to the definition of the true calibration maps \(c^*_f(\hat{P})\) and \(c^*_{\hat{c}\circ f}(\hat{C})\), this equality can be rewritten as:
Since \(\hat{P}\) functionally determines \(\hat{C}\) through \(\hat{C}=\hat{c}(\hat{P})\), the above equality follows directly from the law of total expectation applied on conditional expectations, i.e. \(\mathbb {E}[\mathbb {E}[V\vert \mathcal {G}_2]\vert \mathcal {G}_1]=\mathbb {E}[V\vert \mathcal {G}_1]\) for any random variable V and \(\sigma\)-algebras \(\mathcal {G}_1\subseteq \mathcal {G}_2\). \(\square\)
Appendix C: Implementation details
This section provides the implementation details for all the methods used in the experiments.
1.1 C.1: Piecewise linear fitting using a neural network
1.1.1 C.1.1: PL details
This section gives a more precise overview of implementation details of the piecewise linear method. Information about the overall architecture is available in the main part of the article. Furthermore, the exact implementation is available in the source code in the file "piecewise_linear.py".
The model is initialised such that it represents the identity calibration map. It is trained up to 1500 epochs. Early stopping patience is set to 20, which means that if the training loss has not gone smaller for 20 epochs, then the fitting is stopped. The model is optimized using Adam (Kingma & Ba, 2014) optimiser with the learning rate of 0.01. The batch size is \(min(n_{data}/4, 512)\), e.g. for 1000 data points the batch size is 250, and for 10000 data points it is 512. Cross-validation is used to pick the number of nodes for the binning layer. The number of nodes minus 1 gives the number of bins. All numbers of bins from 1 bin to 16 bins are considered for the model. The only exception is when there are 1000 instances, where the maximum number of bins considered is 6. The model is trained using the MSE or cross-entropy loss. Cross-entropy loss (CE) was showcased in the main article, due to better performance. In the following tables of results, \(PL_{NN}^{CE}\) stands for training with the CE loss.
The PL method fitting for a single model takes under 15 s, and with 10-fold cross-validation it takes up to 150 s depending on the number of nodes the model has. In total, finding the best number of bins (1 to 16 bins) with 10-fold CV takes up to 25 min depending on the data size and complexity of the fitted function. Mostly the model performs much faster, but it gets slower as more bins are used or the data sizes get bigger. Further speedups can be obtained by reducing the number of folds and the different numbers of bins considered in hyperparameter optimisation. The scripts were run in a high performance computing center using CPU processing power (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz) with up to 6GB of RAM.
1.1.2 C.1.2: PL3 details
This section gives a further implementation details of the piecewise linear method in the logit-logit space. Similarly to PL, information about the overall architecture is available in the main part of the article and the exact implementation is available in source code in "piecewise_linear.py".
The model training part is exactly the same as for the PL method.
The PL3 method fitting tends to take more time comparing to PL method. Fitting a single model takes under 30 s, and with 10-fold cross-validation it takes up to 300 s. In total, finding the best number of bins with 10-fold CV takes up to 80 min. Large speedup can be obtained by reducing the number of folds and the different numbers of bins considered in hyperparameter optimisation.
1.2 C.2: Details of cross-validation
In our experiments, cross-validation is used to find the best number of bins for the \(ES_{CV}\), PL and PL3 methods. We have seen that the results improve by using a simple complexity-reducing regularisation trick, according to which we prefer a lower number of bins instead of a higher number of bins whenever the relative difference in the cross-validated loss estimate is less than 0.1 percent. Furthermore, the same way as in hyperparameter optimization for Dirichlet calibration (Kull et al., 2019), the predictions on test data are obtained as an average output from all the 10 models with the chosen number of segments but trained from different folds, i.e. we are not refitting a single model on all 10 folds.
Table 20 depicts the difference in results for estimating \(\vert \hat{c}_\mathcal {M}-c^*\vert\) with the classical CV and with the CV using the complexity-reducing regularisation. In Table 21 the same is depicted for estimating \(\vert ECE_\mathcal {M}-CE\vert\). It can be seen that the complexity reducing regularisation makes results better for \(\vert \hat{c}_\mathcal {M}-c^*\vert\) estimation in most cases. The only exception is \(PL3^{CE}\), where the regularisation makes the results a tiny bit worse. On the other hand, the regularisation results on \(\vert ECE_\mathcal {M}-CE\vert\) are opposite, and the overall results get worse. Nevertheless, the regularisation was left in as it showed promising results on smaller data set.
1.3 C.3: Details about the binning methods
The binning methods were implemented using NumPy and Scikit-Learn packages. The binning methods follow the approaches previously established, however cross-validation and unit-slope calibration maps are added. We have also implemented the sweep method to choose the highest number of bins just before the calibration map gets non-monotonic. All of the binning methods enable using both equal size and equal width binning. The implementation is available in the source code in the file "binnings.py".
1.4 C.4: Implementation details of other methods
Other methods used for comparisons are taken from publicly available packages as follows: isotonic calibration and Platt scaling are available from the pycalib package (Nieto et al., 2019); beta calibration from the betacal package (Kull et al., 2017); KCE from the pycalibration package (Widmann et al., 2019) and \(PL_{DE}\) from the pwlf package (Jekel & Venter, 2019). For KCE we used the unbiased version with RBFKernel. Both, splines (Gupta et al., 2021) and intra-order preserving (IOP) functions (Rahimi et al., 2020) are implemented using official implementation provided in the original articles. The best number of splines for spline methods was found using CV similarly as for piecewise linear methods (Subsection C.2). For spline, three different methods were used: natural, parabolic and cubic. For IOP functions, the configurations given with original paper were used and two different versions: order-invariant (\(IOP_{OI}\)) and order-preserving (\(IOP_{OP}\)). Unfortunately, the third version, diagonal version, did not work. For KDE we used the implementation provided in the original article (Zhang et al., 2020). We used point-wise estimates for KDE as they seemed to offer better results than the integral based estimates proposed in the original article.
The best number of bins for \(PL_{DE}\) was found using CV similarly as for piecewise linear methods (Subsection C.2). The pwlf package also supports fitting piecewise quadratic curves (degree 2) in addition to the piecewise linear curves (degree 1). The degree 1 had better results than degree 2 according to the results in Table 16. For degree of 1, seven bins were chosen as the maximum limit for CV, as from there on the model fitting got very slow. For the degree of 2, five bins were chosen as the maximum limit for CV. On average, fitting \(PL_{DE}\) with 10-fold CV took about 30 min. The licenses of packages have been checked to be freely usable for our work.
1.5 C.5: Debiasing ECE
Debiasing was applied for the binning-based ECE methods. Reminding the notation, \(\bar{y}_k=\frac{1}{n_k}\sum _{\hat{p}_i\in [B_k,B_{k+1})} y_i\) is the average label in k-th bin, \(\bar{p}_k=\frac{1}{n_k}\sum _{\hat{p}_i\in [B_k,B_{k+1})} \hat{p}_i\) is the average prediction in k-th bin, \(n_k=\vert \{i\mid \hat{p}_i\in [B_k,B_{k+1})\}\vert\) is the size of bin k, \(\textsf{ECE}^{(1)}\) is defined as \(\textsf{ECE}^{(1)}_\textbf{B}=\frac{1}{n}\sum _{k=1}^b n_k\cdot \vert \bar{p}_k-\bar{y}_k\vert ^1\).
Kumar et al. (2019) proposed to debias \(\textsf{ECE}^{(1)}\) by defining \(\bar{y}_k\) in each bin as a sample from a random variable defined by a Gaussian distribution \(N(\bar{y}_k, \frac{\bar{y}_k(1-\bar{y}_k)}{n_k})\). Bias can then be estimated by drawing repeated samples from the same random variable. The final debiased estimate of \(\textsf{CE}^{(1)}\) can then be achieved by subtracting the approximated bias from the \(\textsf{ECE}^{(1)}\) value.
Instead of drawing repeated random samples, we propose to use integration and the probability density function of the same random variable for computationally faster results. Instead of drawing samples from \(R_k \sim N(\bar{y}_k, \frac{\bar{y}_k(1-\bar{y}_k)}{n_k})\) for each bin to find \(\mathbb {E}[n_k\cdot \vert \bar{p}_k-R_k\vert ]\) as proposed by Kumar et al. (2019), one can find it computationally faster by finding in each bin
where \(f_{R_k}\) is the probability density function of \(R_k\). We used the simple trapezoidal integration with 10k equally-spaced integration points within the area up to 5 standard deviations away from the mean.
Appendix D: Datasets and experimental setup
We ran experiments on pseudo-real, synthetic and real datasets. More details about these datasets and the experimental setup is in the following sections. We have checked the licenses of datasets and confirmed that these datasets are freely usable for this work.
1.1 D.1: List of methods with shortened names
-
\(ES_{15}\) - ECE method with equal-size binning and 15 bins, EW for equal-width binning;
-
\(ES_{sweep}\) - ECE method with equal-size binning and sweep method for choosing the number of bins;
-
\(ES_{CV}\) - ECE method with equal-size binning and cross-validation for choosing the number of bins;
-
Platt - platt scaling;
-
beta - Beta calibration;
-
isotonic - isotonic calibration;
-
Scaling-Binning - scaling-binning method;
-
TempS/VecS - temperature/vector scaling;
-
MSODIR - matrix scaling with off-diagonal and intercept regularisation;
-
dirODIR/dirL2 - dirichlet calibration with ODIR and L2 regularisation;
-
1-Temp - temperature scaling learning in 1-vs-rest fashion for binary calibration;
-
Spline (natural/cubic/parabolic) - spline calibration with different methods;
-
\(IOP_{OI}\) and \(IOP_{OP}\) - intra-order preserving functions with order-invariant and order-preserving variants;
-
\(PL_{NN}^{CE}\) and \(PL_{NN}^{MSE}\)- piecewise linear method with cross-entropy or MSE loss;
-
\(PL3^{CE}\) - piecewise linear method in logit-logit space;
-
\(PL_{DE}\) - piecewise linear method based on least squares fitting with differential evolution. \(PL_{DE}^{2}\) - degree 2 for quadratic curves;
-
KDE - kernel density estimation;
-
KCE - kernel calibration error.
1.2 D.2: Pseudo-real experiments
The pseudo-real dataset is called CIFAR-5m (Nakkiran et al., 2021) and contains over 5 million synthetic images similar to CIFAR-10 (Krizhevsky et al., 2009) with size of 32x32 pixels. These images were created by sampling DDPM model (Ho et al., 2020) trained on CIFAR-10 data. The pseudo-real data was used to get close to real data with the advantage of being able to estimate the true calibration map very precisely.
In order to estimate the true calibration map on 1 million hold-out datapoints 3 different calibration methods were used: (1) isotonic calibration; (2) equal size binning with 100 bins with flat-top (i.e. slope 0) bins; (3) equal size binning with 100 bins with slope-1-tops (i.e. tops of the bins parallel to the main diagonal). Only minor differences across different ‘true’ calibration maps occurred (see Tables 6, 10 and 11).
To set up the experiments, the datasets were calibrated with various 2-class calibration methods (beta, isotonic, Platt, ScalingBinning), multiclass calibrators (TempS, VecS, MSODIR, dirODIR, IOP, Spline), piecewise linear methods PL3, \(PL_{NN}\) with cross-entropy loss, and the method \(ES_{sweep}\) adapted from calibration evaluation to calibration map fitting. Table 5 compares the results. The calibration is done in 1-vs-Rest fashion to achieve results for a 2-class problem. We have 1 confidence calibration task and 3 one-vs-Rest subtasks (car, cat, dog), for 6 calibration methods on 3 models, in total 72 combinations. The 6 calibration methods were chosen as the best methods in Table 2 of the main article: beta, vector scaling, Platt, PL3, ScalingBinning, isotonic.
1.3 D.3: Synthetic experiments
For synthetic data, only the predicted probabilities \(\hat{p}\), labels y, and corresponding calibrated probabilities \(c^*_f(\hat{p})\) were generated. Synthetic data were generated based on five different base shapes (Fig. 8). The 4 first shapes were chosen to mimic likely scenarios of calibration mappings, with combinations of over- and under-confidence for values below and above 0.5. The 5th shape ‘stairs’ was added as a more challenging shape that crosses the identity function in two places.
From each shape, multiple variants that we refer to as ‘derivates’ were generated by linearly mixing the shape with the identity function in different proportions. Derivates were generated to have datasets with different expected calibration errors. To generate a synthetic dataset from a particular derivate, first the calibrated probabilities \(c^*_f(\hat{p})\) were sampled from a base distribution. Then the labels y were sampled from the calibrated probabilities. Finally, the derivate was used to transform the calibrated probabilities \(c^*_f(\hat{p})\) to their corresponding predicted probabilities \(\hat{p}\). Therefore, the construction of derivates creates the inverse of the calibration map, and then the calibration map can be obtained from it by inverting the function. It might seem more intuitive to first sample the predicted probabilities \(\hat{p}\) and then use the true calibration map to find the corresponding calibrated probabilities \(c^*_f(\hat{p})\). However, the more unintuitive approach used here has the following benefit the other approach does not. Namely, by first sampling \(c^*_f(\hat{p})\) and the labels, we can generate synthetic datasets that have the exact same set of labels but different predicted probabilities. This allows to mimic a realistic scenario where we have several models trained on the same dataset and we would like to choose the model with the lowest calibration error. E.g., if we would like to rank different post-hoc calibration methods. This fact has been used to generate Table 27. The five base shapes were defined by the following functions from \(c^*_f(\hat{p})\) to \(\hat{p}\):
-
\(square(x)=x^2\)
-
\(sqrt(x)=\sqrt{x}\)
-
\(beta1(x)=1 / (1 + 1 / (e^c\cdot x^a / (1-x)^b)), \text { where }a=0.4,b=0.45,c=b\ln {(0.6)}-a\ln (0.4)\)
-
\(beta2(x)=1/(1+1/(e^c\cdot x^a/(1-x)^b)), \text { where } a=2, b=2.2, c=b\ln {(0.52)}-a\ln (0.48)\)
-
\(stairs(x)=stairs\_helper(x+1/3) - stairs\_helper(1/3), \text { where } stairs\_helper(x)= step(step(3x\pi ))/(3\pi ), \text { and } step(x)=x-\sin (x)\)
Synthetic data were generated for data sizes 1000, 3000, 10000 with 5 different data seeds. The calibrated predictions \(c^*_f(\hat{p})\) were sampled from the uniform distribution. Derivates with expected absolute calibration errors \(0.00,0.005, 0.01,\dots ,0.10\) were generated for each of the 5 base shapes. In total, we used 3 data sizes, 5 random seeds, 5 base shapes, and 21 derivates per shape. In Fig. 8, two examples of derivates for the "stairs" function are also shown. Note that the derivates have been obtained by linear mixing with the identity function in the horizontal direction because of creating the inverse calibration maps first.
For Figs. 1 and 5 in the main article 3000 data points were generated in the way described above. Uniform distribution and the base function beta2 were used.
1.4 D.4: Real experiments
The real dataset contains 60k images from CIFAR-10/100 (Krizhevsky et al., 2009) with size of 32x32 pixels with 10 or 100 classes. Each of the datasets have 5k validation and 10k train instances. In total, there are 10 model-dataset combinations, the model outputs have been calibrated using 5 different methods (TempS (Guo et al., 2017), VecS (Guo et al., 2017), MS_ODIR (Guo et al., 2017), dir_L2 (Kull et al., 2019), dir_ODIR (Kull et al., 2019)) using the validation set. This gives us 50 datasets of model and calibration method combinations. To have a comparison with the synthetic data, we extracted test data subsets of sizes 1000 (10 sets), 3000 (3 sets), 10000 (1 set). The number of sets is also multiplied by 5, as there were 5 different calibration methods. In total, we got 500 sets with 1000 instances, 150 sets with 3000 instances, and 50 sets with 10000 instances.
Appendix E: Visualizations of calibration
1.1 E.1: Comparisons of reliability diagrams
Figures 9, 10, 11, 12 and 13 depict comparisons between different reliability diagrams obtained on synthetic data with 10000 instances, true calibration error 0.1 and different calibration functions. The middle three diagrams of every figure are depicted with slope 1 (CE estimation by fitting). Therefore, the diagrams for \(ES_{sweep}\) might not seem monotonic, but they would be if they were plotted with the classic flat bin roofs. Based on the figures, the piecewise linear method is able to follow the true calibration map the best.
1.2 E.2: Calibration maps in the logit-logit scale
As proved in the main paper, the temperature scaling method applied in binary classification fits a calibration map which is linear in the logit-logit scale. Furthermore, here we show that the beta calibration method fits a calibration map which is approximately piecewise linear in the logit-logit scale with 2 pieces.
Consider the calibration map family of beta calibration, \(\hat{c}(\hat{p})=\frac{1}{1+1/\left( e^c\frac{\hat{p}^a}{(1-\hat{p})^b}\right) }\). Changing the y-axis to logit scale, we get
We can rewrite it in two ways:
For low values of \(\hat{p}\) near 0, the term \((a-b)\ln (1-\hat{p})\) in the first way of writing is nearly zero, and thus we get a linear approximation in the logit-logit scale with slope a. For high values of \(\hat{p}\) near 1, the term \((a-b)\ln \hat{p}\) in the second way of writing is nearly zero, and thus we get a linear approximation in the logit-logit scale with slope b. This can be clearly seen visually in Fig. 14, where the breakpoint between the two pieces is at \(\hat{p}=0.5\) and near this point the two linear segments are interpolated non-linearly.
Fig. 7a from the main article shows a case from the pseudo-real dataset. As seen in the figure on the right, PL3 has been able to approximate most of the true calibration map quite well in the logit-logit space piecewise linearly, with two long linear segments (and a third short segment to the right of \(1-10^{-6}\)). In contrast, beta calibration has failed to capture these segments because it is bound to have the breakpoint between the two segments at \(\hat{p}=0.5\).
Appendix F: Results
1.1 F.1: Results of pseudo-real experiments
Here is a brief guide to how we have arranged the tables with the results:
-
Table 5. Pseudo-Real: Calibration method comparison;
-
Tables 6 and 7. Pseudo-Real: Calibration Maps - absolute and square errors;
-
Tables 8 and 9. Pseudo-Real: ECE - absolute and square errors;
-
Tables 10, 11, 12 and 13. Pseudo-Real: Different ground-truths;
-
Tables 14 and 15. Pseudo-Real: Comparison of different numbers of instances;
-
Tables 16 and 17. Pseudo-Real: CE and MSE comparison, including \(PL_{DE}\) with degree 1 and 2;
-
Table 28. Real: biases.
1.1.1 F.1.1: Results of estimating the reliability diagram with \(\vert \hat{c}_\mathcal {M}-c^*\vert\) and \(\vert \hat{c}_\mathcal {M}-c^*\vert ^2\)
The following paragraphs have results of estimating the reliability diagram with \(\vert \hat{c}_\mathcal {M}-c^*\vert\) and \(\vert \hat{c}_\mathcal {M}-c^*\vert ^2\) on CIFAR-5m.
Results on CIFAR-5m indicate that the performance for measures vary over different methods. However, beta calibration and piecewise linear methods PL and PL3 with cross-entropy loss are in the lead. Beta calibration has average rank of 4.6, \(PL3^{CE}\) has average rank of 4.2, \(PL_{NN}^{MSE}\) has average rank of 3.4 and \(PL_{NN}^{CE}\) has average rank of 1.2 (Table 6). In case of quadratic loss, \(PL_{NN}^{CE}\) and \(PL_{NN}^{MSE}\) perform the best, followed by \(IOP_{OI}\), \(PL3^{CE}\), \(ES_{10}\) and beta calibration (Table 7). Note that \(PL_{NN}^{MSE}\) performs well, however it is clearly outperformed by \(PL_{NN}^{CE}\).
The number of data instances (Table 14), does not change much the overall ordering. However, the more data instances, the better all models are, at getting closer to true calibration map. Additionally, as stated in the main article, the beta calibration works really well with smaller data sizes - 1000, 3000, similarly \(IOP_{OI}\).
Next, the cross-entropy loss seems to be beneficial for both PL3 and \(PL_{NN}\) methods (Table 16). The method \(PL_{DE}\) with degree of 1 is performing better than with degree of 2. Overall, \(PL_{DE}\) is mostly behind PL3 and \(PL_{NN}\) with CE loss.
As expected, equal-size (ES) binning methods performed much better than equal-width (EW) binning methods (Table 18). Furthermore, comparing \(ES_{sweep}\) with \(ES_{CV}\), CV variant gets better results 8 times out of 12 and thus getting also better average ranking 8.8 compared to average rank 11.3 of sweep (Table 6).
The standard deviations (Table 22) of calibration measures are really similar, only \(PL_{DE}\) and KDE perform worse than other methods. Interestingly, the standard deviations are generally higher for DenseNet40 with confidence dataset.
Different estimated ground-truths (isotonic; equal-size binning with flat tops; with slope 1 tops) of CIFAR-5m data, result in only minor differences (Tables 6, 10, 11).
1.1.2 F.1.2: Results of estimating the true calibration error with \(\vert ECE_\mathcal {M}-CE\vert\) and \(\vert ECE_\mathcal {M}-CE\vert ^2\)
The following paragraphs have results of estimating the true calibration error with \(\vert ECE_\mathcal {M}-CE\vert\) and \(\vert ECE_\mathcal {M}-CE\vert ^2\) on CIFAR-5m.
\(ECE_{15}\) (among all ES) and \(PL3^{CE}\) are the best at estimating calibration error, with average ranks of 3.8 and 5.2 (Table 8). On the other hand, the best measures for the quadratic loss are \(PL_{NN}^{CE}\) with the average rank of 2.8 and \(IOP_{OI}\) with the average rank of 4.2 (Table 9).
The number of data instances (Table 15), does not change the ranking for most methods, similarly to estimating the true calibration map. In comparison between other methods, \(ES_{sweep}\), beta, and \(IOP_{OI}\) work better with less data, while \(ES_{20}\), \(ES_{25}\), \(ES_{CV}\) work better with more data. Furthermore, again all the methods get more accurate with more data.
The cross-entropy loss is beneficial for PL3, but \(PL_{NN}\) with MSE loss is better at estimating \(\vert ECE_\mathcal {M}-CE\vert\) (Table 17). Again, \(PL_{DE}\) with degree 1 is outperforming degree 2. PL3 with CE loss is the best at estimating \(\vert ECE_\mathcal {M}-CE\vert\) among piecewise linear methods.
For estimating \(\vert ECE_\mathcal {M}-CE\vert\), equal-size binning methods performed much better than equal-width binning methods (Table 19), the only exception is \(ES_{10}\), where equal-size performs similarly to equal-width binning. Next, comparing \(ES_{sweep}\) with \(ES_{CV}\), \(ES_{sweep}\) is better at absolute error, however \(ES_{CV}\) is better at quadratic error (Tables 8 and 9).
The standard deviations (Table 23) of calibration measures are really similar, only \(PL_{DE}\) and KDE perform worse than other methods.
Different estimated ground-truths (isotonic; equal-size binning with flat tops; with slope 1 tops) of CIFAR-5m data, result in only minor differences; however, it does change the top-1 ranking method (Tables 12, 13 and Table 4 from the main article ).
1.2 F.2: Results of synthetic experiments
The results of synthetic experiments indicate that beta calibration and Platt scaling are very good when the parametric family is able to approximate \(c^*\) closely (Table 24). However, these methods can fail for more difficult shapes (i.e ‘stairs’). Piecewise linear methods are following in performance and the binning methods are in last place. To achieve a good score for all the shapes, the piecewise linear methods should be picked. The repeating results for isotonic in Table 24 are not a bug, but a peculiarity arising from the way synthetic data were generated.
Performance of with regards to \(\vert ECE_\mathcal {M}-CE\vert\) is similar to \(\vert \hat{c}_\mathcal {M}-c^*\vert\) (Table 25). The best measure is \(PL_{DE}\) and measures Platt scaling and \(ES_{CV}\) are following. Again, Platt scaling and beta calibration are failing at ‘stairs’ dataset.
In Table 26 the performance of estimating \(\vert ECE_\mathcal {M}-CE\vert ^2\) is compared so that KCE could also be included. The unbiased version of KCE with RBFKernel is used. KCE performs worse than the other selected methods.
In Table 27 Spearman correlation is used to see how well the measures are able to rank different models. All the measures are really good, the only exception is ‘stairs’, where Platt scaling and beta calibration are not as good.
1.3 F.3: Results of real experiments
Table 28 also shows the biases on the real data, to be compared with pseudo-real results. To achieve as valid results as possible, the pseudo-real data has the subset containing the architectures (Resnet110, WideNet32, Densenet40) and calibrated with the same methods (TempS, VecS, MSODIR, dirL2, dirODIR). The only difference is that real experiments are trained and evaluated on CIFAR-10, but the pseudo-real experiments are trained on CIFAR-5m dataset. To calculate average bias, one needs to subtract the average true calibration error from the average estimates - the same constant for each value in the row). For real data, we do not know this constant, but we have subtracted a value which makes the row average bias match with the corresponding row average bias of pseudo-real data. Thus, the true average biases are a constant shift away from these. However, as this does not affect the ranks, we can still compare whether the ranking of biases across different methods agrees between the synthetic and real data. There is a strong agreement, increasing confidence that the stronger methods on pseudo-real data in Tables 6 and 8 are also stronger on real data.
Figure 15 depicts how different evaluation methods order models by calibration. The results are shown for 10 different real model-dataset combinations. The test set size is 10k for each example. For each model-dataset combination there are 5 models after applying post-hoc calibration methods.
It can be seen from the figure that for all cases, the choice of the evaluation method has consequences on the ordering of models. For example, for resnet110SD_c100, beta, isotonic and Platt order vector scaling (VecS) as the most calibrated model, while other methods order temperature scaling (TempS) as the most calibrated model.
1.4 F.4: Running time of the experiments
The experiments were run on the pseudo-real data combining 13 calibration maps (6 was used for final results), 5 seeds, 3 models and 3 data sizes and 4 2-class experiments (3 1vsRest and 1 confidence). In total 2340 combinations for all the ECE methods, isotonic, beta, platt, kernel methods and PL methods took around 6500 h. Running the experiments on the synthetic data combining 5 calibration maps, 5 seeds, 21 derivates and 3 data sizes, for all the ECE methods, isotonic, beta, platt, kernel methods and PL methods took around 5000 h. The experiments were run on the real data combining 11 model-dataset combinations, with three different data sizes with different number of subsets (total 10) and 5 calibration methods, in total 700 combinations, which took under 350 h. The scripts were run in a high performance computing center using CPU processing power (Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz) with up to 6GB of RAM.
Appendix G: Limitations and future work
While piecewise linear methods in both the original probability space and in the logit-logit space have improved the state-of-the-art as calibrators and as evaluators of calibration, there is room for improvement: (1) we have not experimented on non-neural classification methods to identify whether PL or PL3 or some other method would work well there; (2) when estimating calibration error, PL and PL3 would benefit from debiasing methods to further improve performance; (3) the running time of PL and PL3 should be reduced, e.g. by only considering less bins (usually 5 or less bins were used) and using 3- or 5-fold CV instead of 10-fold CV; (4) new pseudo-real datasets should be created from generative models trained on other datasets than CIFAR-10, so that true calibration maps could be estimated and calibration methods compared; and (5) new calibration map families could be created based on results with new pseudo-real datasets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kängsepp, M., Valk, K. & Kull, M. On the usefulness of the fit-on-test view on evaluating calibration of classifiers. Mach Learn 114, 105 (2025). https://doi.org/10.1007/s10994-024-06652-6
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-024-06652-6















