1 Introduction

In this paper, we address the challenge of training multiclass classifiers using weakly labeled data. A weak label refers to a label that does not explicitly indicate the true class of a sample, but instead provides a discrete variable statistically related to the class or identifies a subset of candidate classes.

Building on prior research (Cid-Sueiro, 2012; Van Rooyen & Williamson, 2018; Chiang & Sugiyama, 2023; Chen et al., 2023, 2024; Iacovissi et al., 2023), we adopt a general framework unifying various partial supervision problems, such as learning with noisy, complementary, supplementary, or partial labels, as well as positive-unlabeled (PU) learning and unlabeled-unlabeled (UU) learning.

Many algorithms proposed for weak labels adapt traditional supervised learning loss functions to handle weak labels via loss correction. Our focus is on methods that employ a statistical model to relate classes and weak labels (Jin & Ghahramani, 2002; Xiao et al., 2015; Feng et al., 2020; Katsura & Uchida, 2021; Ishida et al., 2017; Xu et al., 2021), typically via a transition probability matrix that describes how weak labels are generated from true classes (Ishida et al., 2019; Yoshida et al., 2021) or vice versa (Menon et al., 2015; Scott et al., 2013). Although this model may not always be explicitly defined (as in (Grandvalet, 2002)), the effectiveness of some methods depends heavily on the underlying weak label generation process. In this paper, we will not address the important challenge of estimating this model, which is sometimes tackled by assuming the availability of a few noise-free labels (Xiao et al., 2015; Yu et al., 2018; Hendrycks et al., 2018) or anchor points (Patrini et al., 2017; Yao et al., 2020) or using corrupted labels only (Ghosh et al., 2015; Katz-Samuels et al., 2019). Other methods select losses that are relatively robust to uncertainty in the model (Ghosh et al., 2015; Cid-Sueiro et al., 2014).

The transition matrix has been used to construct two main types of losses: (1) losses based on the linear transformation of standard supervised losses (Natarajan et al., 2013; Cid-Sueiro, 2012; Van Rooyen & Williamson, 2018; Yoshida et al., 2021), and (2) losses defined on probabilistic predictions of weak labels, derived from the linear transformation of probabilistic class predictions (Sukhbaatar et al., 2014; Yu et al., 2018; Patrini et al., 2017). Patrini et al. (2017) defines the former as backward corrected and the latter as forward corrected losses (hereafter, simply backward and forward losses).

Despite extensive work on loss correction, a systematic comparative analysis between forward and backward losses is noticeably absent in the literature. Although some empirical evidence suggests that forward losses tend to outperform backward losses (see e.g., (Patrini et al., 2017)), this observation lacks a comprehensive theoretical and experimental validation for general weak-label models.

The purpose of this paper is twofold. First, we introduce a unifying family of losses that generalizes both forward and backward losses, encompassing them as special cases. Second, leveraging this framework, we conduct a theoretical and empirical comparative analysis, providing evidence for the superiority of forward losses.

Our main contributions are the following:

  • We define a family of forward-backward losses encompassing forward and backward losses as special cases. Additionally, we show that some types of reweighting schemes can also be formulated within this framework.

  • We establish sufficient conditions under which forward-backward losses are proper, ranking-calibrated or classification-calibrated, and identify conditions ensuring convexity and lower-boundedness.

  • We present a theoretical and experimental analysis demonstrating that proper forward losses yield higher accuracy and lower variance in probability estimates than any other proper loss in the family.

Although our analysis shows that forward proper losses consistently outperform others, the general formulation contributes toward a broader characterization of losses for learning from weak labels, an important step toward a general theory that is still lacking.

The paper is organized as follows: Sect. 2 reviews related work. Sect. 3 formulates the problem defines loss functions. Sect. 4 gives conditions for proper forward-backward losses. Sect. 5 analyzes ranking and classification calibration. Sect. 6 states some error bounds for minimization of forward-backward losses. In Sect. 7 we show some comparative experiments. Finally, we state some conclusions in Sect. 8

2 Related work

Unified approaches to learning from arbitrary weak label models date back to the general formulations of backward losses in (Cid-Sueiro, 2012; Cid-Sueiro et al., 2014; Van Rooyen & Williamson, 2018). General models for maximum likelihood estimation (an instance of forward correction) can be found in Perello-Nieto et al. (2020); Chen et al. (2023, 2024). General approaches for binary classification (including learning from noisy labels, PU learning and semi-supervised learning) can be found in (Xie et al., 2024) for AUC optimization and (Gong et al., 2022) for margin-based classifiers.

Chiang and Sugiyama (2023) integrated up to 15 different scenarios into a probabilistic framework supporting both discrete weak labels and confidence scores. It also introduces a risk-rewrite formulation that facilitates backward correction and a novel “marginal chain method” to all these scenarios. An even more general perspective appears in (Iacovissi et al., 2023), which situates label correction, forward/backward correction, and importance reweighting within the broader context of data corruption, including corrupted input features (e.g., concept drift).

While these contributions show that the same strategy (forward, backward, marginal chain, importance reweighting) can be applied to different corruption types, our work moves in the direction of integrating different methods (forward, backward and, to some extent, marginal chain) into a unified family of correction techniques.

To our knowledge, no systematic comparison of forward and backward correction has been published. Patrini et al. (2017) first observed the inferior performance of backward correction in noisy label scenarios, a finding corroborated by (Ma et al., 2018; Ding et al., 2018; Lukasik et al., 2020). In (Chou et al., 2020), a case study on complementary labels shows that while backward losses provide unbiased risk estimators, their negative components lead to high variance, over-fitting and reduced accuracy compared to forward losses and other methods. Similarly, Ishida et al. (2019) (referring to backward correction as Free) proposed a gradient ascent (GA) approach to mitigate negative loss effects, yet both Free and GA underperformed relative to forward losses in complementary and partial label settings (Feng et al., 2020, 2020).

The inferior performance of backward losses is often attributed to their negative components, which cause overfitting. Techniques such as training control (e.g., GA), enforcing non-negativity (Kiryo et al., 2017; Lu et al., 2020), or minimizing upper bounds (Feng et al., 2020) have improved performance but are seldom compared directly to forward losses. Our experiments explore various weak label scenarios, avoiding the pitfalls of negative loss components by building on ideas from (Van Rooyen & Williamson, 2018; Yoshida et al., 2021). Nevertheless, our theoretical analysis shows that even under ideal training conditions, the variance of posterior probability estimates with backward losses cannot be lower than that of forward losses, suggesting that optimization issues alone do not explain their inferior performance.

3 Formulation

3.1 Notation

Vectors are written in boldface, matrices are written in boldface capital, and sets are written in calligraphic letters. \(|{{\mathcal {A}}}|\) is the cardinality of finite set \({\mathcal {A}}\). For vectors, the \(\text {superindex }^\top\) denotes the transposition, \(\odot\) and \(\oslash\) denote pointwise multiplication and division, respectively. When \(\textbf{v}\) is a vector, \(\log (\textbf{v})\) and \(\exp (\textbf{v})\) denote the component-wise logarithm and exponential, respectively.. For any matrix \(\textbf{A}\), \(\text {tr}(\textbf{A})\) is its trace, \(\Vert \textbf{A}\Vert\) its Frobenious norm and \(\Vert \textbf{A}\Vert _1\) its \(L_1\) norm.

For any integer n, \(\textbf{e}_i^n\) is a unit vector of dimension n with all zero components apart from the i-th component which is equal to one, and \(\mathbbm {1}_n\) is an all-ones vector with dimension n. The superscript may be omitted if it is clear from the context.

We will use \(\Psi\), \(\varvec{\phi }\) to denote loss functions. The number of classes is c, and the number of possible weak label vectors is d. The set of all \(d\times c\) matrices with stochastic columns, that is, the set of \(d\times c\) left-stochastic matrices is \(\mathcal {M} = \{\textbf{M} \in [0,1]^{d\times c}: \textbf{M}^\top \mathbbm {1}_d =\mathbbm {1}_c\}\), and the simplex of the probability vectors of dimension d is \(\mathcal {P}_{d} = \{\textbf{p}\in [0,1]^{d}: \textbf{p}^\top \mathbbm {1}_{d} =1\}\).

3.2 Learning from weak labels

Let \({{\mathcal {X}}}\) be a sample space, \({{\mathcal {Y}}}\) a finite set of c target categories, and \({{\mathcal {W}}}\) a finite set of \(d \ge c\) weak categories. Sample \((\textbf{x}, \varvec{\omega }) \in {{\mathcal {X}}} \times {{\mathcal {W}}}\) is drawn from an unknown distribution P.

We encode target categories as one-hot vectors: \({{\mathcal {Y}}} = \{\textbf{e}_j^c, j=0,\ldots ,c-1\}\). The goal is to learn a predictor of the target class \(\textbf{y}\in {{\mathcal {Y}}}\) given \(\textbf{x}\), using a weakly labeled dataset \({{\mathcal {S}}}=\{(\textbf{x}_k,\varvec{\omega }_k)\}_{k=0}^{n-1}\) of independent samples from P.

The interpretation of \({{\mathcal {Y}}}\) and \({{\mathcal {W}}}\) varies by application. This general formulation accommodates diverse partial supervision scenarios, with particular focus on cases where categories in \({{\mathcal {W}}}\) correspond to subsets of \({{\mathcal {Y}}}\). Examples include:

  • Clean labels: In this case, \({{\mathcal {W}}} = {{\mathcal {Y}}}\) and \(\varvec{\omega }=\textbf{y}\) with probability 1.

  • Noisy labels  (Raykar et al., 2010): \({{\mathcal {W}}} = {{\mathcal {Y}}}\) but \(P\{\varvec{\omega } \ne \textbf{y}\}>0\).

  • Complementary labels  (Ishida et al., 2017): \({{\mathcal {W}}} = {{\mathcal {Y}}}\) but \(P\{\varvec{\omega } \ne \textbf{y}\} = 1\).

  • Clean labels and unlabeled data: \({{\mathcal {W}}} = {{\mathcal {Y}}} \cup \{\textbf{0}\}\), where \(\varvec{\omega }=\textbf{0}\) when the target class is unknown.

  • Positive-Unlabeled (PU) data: \({{\mathcal {W}}} = \{(0, 1), (1, 1)\}\).

  • Partial labels (Cour et al., 2011; Jin & Ghahramani, 2002; Ambroise et al., 2001; Grandvalet & Bengio, 2004): each label is a set of candidate target categories, only one of them being true. In this case, each element in \({{\mathcal {W}}}\) is a non-empty subset of \({{\mathcal {Y}}}\).

For convenience, we represent weak categories as one-hot vectors. Given an ordering \({{\mathcal {W}}}=\{\varvec{\omega }_0,\ldots ,\varvec{\omega }_{d-1}\}\), we define the one-hot encoding \({{\mathcal {Z}}}=\{\textbf{e}_i^d\}_{i=0}^{d-1}\), and denote by \(\textbf{z}=\textbf{e}_i^d\) the one-hot label corresponding to \(\varvec{\omega }_i\).

To summarize, we will use the following notation for the class variables:

  • \(\textbf{y} \in {{\mathcal {Y}}}\): the target class, represented as a one-hot vector.

  • \(\textbf{z} \in {{\mathcal {Z}}}\): the weak class, represented as a one-hot vector

Thus, learning from weak labels consists in training a predictor of the target class \(\textbf{y} \in {{\mathcal {Y}}}\) given sample \(\textbf{x}\), using a weakly labeled dataset \({{\mathcal {S}}} = \{(\textbf{x}_k, \textbf{z}_k), k=0,\ldots ,n-1\}\) whose labels are elements of \({{\mathcal {Z}}}\).

Without loss of generality, we assume that \({{\mathcal {Z}}}\) contains only weak labels with nonzero probability (\(P(\textbf{z})>0\)). The statistical dependency between \(\textbf{z}\) and \(\textbf{y}\) is modeled through an arbitrary \(d\times c\) transition matrix \(\textbf{M}(\textbf{x})\in \mathcal {M}\) of conditional probabilities

$$\begin{aligned} m_{ij}(\textbf{x}) = P\{z_i=1 | y_j=1,\textbf{x} \} \end{aligned}$$
(1)

Defining the posteriors \(\textbf{p}(\textbf{x})\) and \(\varvec{\eta }(\textbf{x})\) with components \(p_i=P\{z_i=1|\textbf{x}\}\) and \(\eta _j=P\{y_j=1|\textbf{x}\}\), we can write \(\textbf{p}(\textbf{x}) = \textbf{M}(\textbf{x}){\varvec{\eta }}(\textbf{x})\). In general, the dependency with \(\textbf{x}\) will be omitted and we will write, for instance,

$$\begin{aligned} \textbf{p} = \textbf{M}{\varvec{\eta }}. \end{aligned}$$
(2)

If \(\textbf{M}\) is independent of the features, \(\textbf{x}\), the relation between random variables \(\textbf{x}\), \(\textbf{y}\) and \(\textbf{z}\) can be represented through the graphical model in Fig.  1

Fig. 1
figure 1

The graphical model describing the weak label generation process

The transition matrix is central to the design of loss functions. Our work in this paper is mostly related to the determination of loss functions for a given transition matrix, and we will not deal with the issue of determining \(\textbf{M}\) from the data, which is out of the scope of this paper.

The feature independence assumption is not required for most of our theoretical analysis, except for the error bound in Sec. 6. Most of our experiments assume a known feature-independence transition matrix. A further discussion on the estimation of the transition matrix can be found in sec. 8.1.

3.3 Classification calibration, ranking calibration and properness

The goal of the learning algorithm is to find an accurate class predictor using a weakly labeled set. The predictor computes a score vector \(\textbf{f}= g(\textbf{x}) \in {{\mathcal {F}}}\), where \({{\mathcal {F}}} \subset {\mathbb {R}}^c\) is the hypothesis space, and a class prediction \(\hat{\textbf{y}} = \mathop {\textrm{argmax}}\limits _{\textbf{y}\in {{\mathcal {Y}}}} \{\textbf{y}^\top \textbf{f}\}\). When proper losses are used, we will require probabilistic scores, so that \({{\mathcal {F}}}= {{\mathcal {P}}}_c\).

A weak loss is any function \(\Psi :\mathcal {Z} \times {{\mathcal {F}}} \rightarrow {\mathbb {R}}\). For any loss function, \(\Psi (\textbf{z}, \textbf{f})\), we will use an alternative vector representation, by defining

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = (\Psi (\textbf{e}_0^d, \textbf{f}), \Psi (\textbf{e}_0^d, \textbf{f}), \ldots , \Psi (\textbf{e}_{d-1}^d, \textbf{f}))^\top \end{aligned}$$
(3)

so that \(\Psi (\textbf{z}, \textbf{f}) = \textbf{z}^\top \varvec{\Psi }(\textbf{f})\) for all \(\textbf{z}\in \mathcal {Z}\) and, using (2), the expected loss becomes

$$\begin{aligned} \mathbb {E}_\textbf{z}\{\Psi (\textbf{z},\textbf{f})\} = \varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}) \end{aligned}$$
(4)

The dimension of a loss is the dimension of its vector representation: d for a weak loss, and c for a standard supervised loss.

We are interested in conditions ensuring that the expected loss is minimized when the classifier is calibrated. We consider three types of calibration. The first requires that the predicted scores coincide with the posterior class probabilities.

Definition 1

(Proper loss) Weak loss \(\Psi (\textbf{z},\textbf{f})\) is \(\textbf{M}\)-proper if, for any \(\varvec{\eta }\in {{\mathcal {P}}}_c\),

$$\begin{aligned} \varvec{\eta }\in \arg \min _{\textbf{f}\in \mathcal {P}_c} \varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}), \end{aligned}$$
(5)

The loss is strictly \(\textbf{M}\)-proper if \(\varvec{\eta }\) is the unique minimizer.

A second type of calibration requires that the class scores preserve the order of the class posterior probabilities.

Definition 2

(Ranking calibration) The weak loss \(\Psi (\textbf{z},\textbf{f})\) is \(\textbf{M}\)-ranking calibrated (or \(\textbf{M}\)-RC) if, for any \(\varvec{\eta }\in {{\mathcal {P}}}_c\), any \(\textbf{f}^* \in \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f})\) satisfies (\(\eta _i> \eta _j \Rightarrow f_i^* > f_j^*\)).

Finally, classification calibration requires that both the classifier scores and the posterior class probabilities provide the same class predictions:

Definition 3

(Classification calibration) The weak loss \(\Psi (\textbf{z},\textbf{f})\) is \(\textbf{M}\)-classification calibrated (or \(\textbf{M}\)-CC) if, for any \(\varvec{\eta }\in {{\mathcal {P}}}_c\), \(\textbf{f}^* \in \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f})\) satisfies (\(\eta _i> \max _{j\ne c} \eta _j \Rightarrow f_i^* > \max _{j\ne c} f_j^*\)).

3.4 Forward, backward and forward-backward losses

The losses discussed in this section are defined as transformations of a loss used for supervised learning, that we will name the base loss.

3.4.1 Backward loss

A backward loss is any linear transformation of a base loss.

Definition 4

(Backward loss) Weak loss \(\varvec{\Psi }(\textbf{f})\) is a backward loss for a weak label set \({\mathcal {Z}}\), if

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{f}) \end{aligned}$$
(6)

for some c-dimensional loss \(\varvec{\phi }\) and some \(d\times c\) matrix \(\textbf{B}\), where \(d=|{{\mathcal {Z}}}|\).

In (Van Rooyen & Williamson, 2018), \(\textbf{B}\) is named the reconstruction matrix as it reverts the effect of the transition matrix. In (Cid-Sueiro et al., 2014), it is named a virtual label matrix because its columns play the same role as target classes in gradient-based learning algorithms. Here we refer to \(\textbf{B}\) simply as the backward matrix.

Backward losses have been proposed for noisy labels (Natarajan et al., 2013; Menon et al., 2015; Patrini et al., 2017), complementary labels (Ishida et al., 2019), multi-complementary labels (Feng et al., 2020), PU labels (Du Plessis et al., 2015), unlabeled-unlabeled (UU) data Lu et al. (2020), and general weak label models (Natarajan et al., 2013; Cid-Sueiro, 2012; Cid-Sueiro et al., 2014; Van Rooyen & Williamson, 2018; Yoshida et al., 2021).

3.4.2 Forward loss

Similarly, the forward losses can be defined as follows:

Definition 5

(Forward loss) Weak loss \(\varvec{\Psi }(\textbf{f})\) is a forward loss for a weak label set \({\mathcal {Z}}\), if

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$
(7)

for some d-dimensional loss \(\varvec{\phi }\) and some \(d\times c\) matrix \(\textbf{F}\), where \(d=|{{\mathcal {Z}}}|\).

When \(\textbf{F}=\textbf{M}\) and the base loss is proper, the optimization of the forward loss can be carried out in two steps: (1) estimate the posterior weak label probabilities (\(\textbf{p} = \textbf{M}\varvec{\eta }\)) with loss \(\varvec{\phi }(\textbf{p})\) from the data, and (2) compute the posterior class probabilities via the pseudoinverse \(\hat{\varvec{\eta }} =\textbf{M}^+\hat{\textbf{p}}\) (see the classifier-consistent method in (Feng et al., 2020)).

The cross entropy loss,

$$\begin{aligned} \varvec{\phi }(\textbf{p}) = - \log (\textbf{p}) \end{aligned}$$
(8)

is the most common choice, making forward loss minimization equivalent to the maximum likelihood estimation of the model parameters (Zhang et al., 2019), often solved by means of the Expectation-Maximization (EM) algorithm, as in (Perello-Nieto et al., 2020).

Forward losses are closely related to some re-weighting methods (Wu et al., 2023), (Lv et al., 2020) and (Feng et al., 2020), which are based on the iterative minimization of a loss \(\Psi (\textbf{z}, \textbf{f}) = \textbf{q}^\top \varvec{\phi }(\textbf{f})\), where \(\textbf{q}\) is an estimate of the posterior class probabilities, given \(\textbf{z}\) and given the current model. For the cross-entropy in (8), this loss can be derived as the E-step of the EM algorithm (see (Perello-Nieto et al., 2020)). Appendix A further discusses this connection.

3.4.3 Forward-backward loss

Forward-backward losses are a straightforward extension of forward and backward losses, combining a forward and a backward matrix.

Definition 6

(Forward-backward loss) Weak loss \(\varvec{\Psi }(\textbf{f})\) is a forward-backward loss for a weak label set \({\mathcal {Z}}\), if

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{F} \textbf{f}) \end{aligned}$$
(9)

for some m-dimensional loss \(\varvec{\phi }\), some \(m \times d\) matrix \(\textbf{B}\) and some \(m \times c\) matrix \(\textbf{F}\).

Forward-backward losses have potential applications in scenarios where label corruption arises from a cascade of two noisy processes, such that the transition matrix can be factorized as \(\textbf{M}= \textbf{M}_l \textbf{M}_r\). In such cases, forward-backward losses could theoretically address the effects of \(\textbf{M}_l\) through the backward component and \(\textbf{M}_r\) through the forward component.

By unifying forward and backward losses into a common framework, we can jointly analyze their properties and compare their theoretical and practical advantages. Forward-backward losses form the basis of our subsequent analysis. In the following sections, we examine conditions under which these losses are \(\textbf{M}\)-proper, \(\textbf{M}\)-RC, or \(\textbf{M}\)-CC.

4 Proper forward-backward losses

The following theorem provides sufficient conditions under which a forward-backward loss is proper.

Theorem 1

Let \(\varvec{\phi }(\textbf{q})\), \(\textbf{q}\in {{\mathcal {P}}}_k\) be a strictly proper loss with dimension \(k\ge c\), and let \(\varvec{\Psi }(\textbf{f})\) be a forward-backward loss with forward and backward matrices \(\textbf{F}\) and \(\textbf{B}\), respectively.

If \(\textbf{F}\) is left-stochastic with rank c and \(\textbf{F}=\textbf{B} \textbf{M}\), then \(\varvec{\Psi }(\textbf{f})\) is strictly \(\textbf{M}\)-proper.

Proof

If \(\textbf{F}= \textbf{B}\textbf{M}\) we have \(\varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{B}\textbf{M}\textbf{f})\). Consider the solution set

$$\begin{aligned} {{\mathcal {B}}}&= \mathop {\textrm{argmin}}\limits _{\textbf{f}} \left\{ \varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}) \right\} \nonumber \\&= \mathop {\textrm{argmin}}\limits _{\textbf{f}} \left\{ \varvec{\eta }^\top \textbf{F}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \right\} \end{aligned}$$
(10)

Since \(\textbf{F}\) is left-stochastic, \(\textbf{F}\varvec{\eta }\) is a stochastic vector and, thus, since \(\phi\) is strictly proper,

$$\begin{aligned} {{\mathcal {B}}}&= \left\{ \textbf{f}\mid \textbf{F}\textbf{f}= \textbf{F}\varvec{\eta }\right\} \end{aligned}$$
(11)

Since \(\textbf{F}\) is rank c, \(\textbf{F}\textbf{f}= \textbf{F}\varvec{\eta }\) iff \(\textbf{f}= \varvec{\eta }\) and, thus, \({{\mathcal {B}}} = \{\varvec{\eta }\}\), which proves that \(\varvec{\Psi }\) is \(\textbf{M}\)-proper. \(\square\)

Th. 1 shows that, for any arbitrary choice of \(\textbf{B}\) such that \(\textbf{B} \textbf{M}\) is left stochastic with rank c, the loss

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{B} \textbf{M} \textbf{f}) \end{aligned}$$
(12)

is strictly proper. The theorem generalizes some published results on forward and backward \(\textbf{M}\)-proper losses:

  • Taking \(\textbf{B}=\textbf{M}^+\) where \(\textbf{M}^+\) is any left inverse of \(\textbf{M}\), we get \(\varvec{\Psi }(\textbf{f}) = \left( \textbf{M}^+\right) ^\top \varvec{\phi }(\textbf{f})\), which is a general expression for backward losses (Cid-Sueiro, 2012).

  • Taking \(\textbf{B}=\textbf{I}\) we get a general expression \(\varvec{\Psi }(\textbf{f}) = \varvec{\phi }(\textbf{M}\textbf{f})\) for forward losses (Ghosh et al., 2015)

Note that, as a consequence of Th. 1, if the weak labels are produced by a cascade of two corruption processes, that is, \(\textbf{M}= \textbf{M}_l\textbf{M}_r\), the loss \(\varvec{\Psi }(\textbf{f}) = \textbf{B}_l^\top \varvec{\phi }(\textbf{M}_r \textbf{f})\), where \(\textbf{B}_l\) is a left inverse of \(\textbf{M}_l\), is proper. Therefore, the decontamination processes can be potentially carried out through the combination of a backward and a forward component in the weak loss.

4.1 Convexity

The convexity of an \(\textbf{M}\)-proper forward-backward loss given by (12) depends on the backward matrix. In general, the convexity is preserved by any left-stochastic \(\textbf{B}\).

Theorem 2

Let \(\varvec{\Psi }\) be a forward-backward proper loss given by (12). If \(\varvec{\phi }\) is convex in \({{\mathcal {P}}}_c\) and \(\textbf{B}\) is left stochastic, then \(\varvec{\Psi }\) is convex.

Proof

The proof is straightforward, as \({\Psi }\) is a composition of a convex function \(\phi\) and two linear and convex combinations. \(\square\)

In particular, for a forward loss, \(\textbf{B}=\textbf{I}\), which is left stochastic and, thus, the loss is convex.

Theorem 2 shows how to construct forward-backward losses that preserve convexity. However, it is generally not applicable to backward losses because, except in trivial cases (e.g., diagonal transition matrices), no left inverse of \(\textbf{M}\) is left-stochastic.

Nonetheless, Van Rooyen and Williamson (2018) have shown that convexity can be preserved for composite backward losses: if \(\textbf{f}= \varvec{\kappa }(\textbf{v})\), where \(\varvec{\kappa }\) is the inverse link function (Williamson et al., 2016), the composite backward loss \(\textbf{M}_\text {li}^\top \varvec{\phi }(\varvec{\kappa }(\textbf{v}))\) is a convex function of \(\textbf{v}\) for an appropriate choice of the left inverse. Extending this result to forward-backward losses is not straightforward.

4.2 Lower bounded losses

If \(\phi\) is proper and lower-bounded, the forward \(\textbf{M}\)-proper loss \(\varvec{\phi }(\textbf{M}\varvec{\eta })\) is also lower-bounded. This is not true in general for backward losses because the backward matrix, as a left inverse of a stochastic matrix, typically contains negative entries. Consequently, if \(\phi\) is not upper bounded (as the cross entropy in (8)), the empirical risk is not lower bounded, leading to overfitting (Sugiyama et al., 2022). Different types of training tricks (Kiryo et al., 2017; Ishida et al., 2019; Lu et al., 2020) or modifications of the cross entropy (Yoshida et al., 2021) have been proposed to address this problem.

Note, however, that any loss satisfying Theorem 2 with a lower-bounded \(\varvec{\phi }\) is also lower-bounded. Thus, incorporating a forward component can mitigate negative contributions of the backward matrix and ensure boundedness.

4.3 Optimizing the backward matrix

Although any pair of matrices \(\textbf{B}\) and \(\textbf{F}\) (satisfying \(\textbf{B}\textbf{M}= \textbf{F}\)) defines a proper loss, the choice has a strong impact on training performance. This raises the problem of selecting the optimal pair. In this section, we show that, for proper losses, theoretical arguments favor forward losses.

In general, the optimal choice may depend on \(\varvec{\eta }\), but we can optimize the selection for a given \(\varvec{\eta }\), following a procedure similar to that proposed in (Bacaicoa-Barber et al., 2021) for backward losses. To do so, assume \({{\mathcal {S}}} = \{\textbf{z}_k, k=0,\ldots ,n-1\}\) is a set of i.i.d. samples with probabilities \(p_i = P\{\textbf{z}_k= \textbf{e}_i^d\}= \textbf{e}_i^d\textbf{M}\varvec{\eta }\), for some transition matrix \(\textbf{M}\) and some \(\varvec{\eta }\in {{\mathcal {P}}}_c\). To estimate \(\varvec{\eta }\) from \({{\mathcal {S}}}\), we can minimize the empirical risk based on a strictly \(\textbf{M}\)-proper forward-backward loss in the form (12), that is,

$$\begin{aligned} \textbf{f}^*&= \mathop {\textrm{argmin}}\limits _\textbf{f}\sum _{k=0}^{n-1} \textbf{z}_k^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$
(13)

Since \(\varvec{\phi }\) is strictly proper

$$\begin{aligned} \textbf{f}^*&= \textbf{F}^\ell \textbf{B}\overline{\textbf{p}} \end{aligned}$$
(14)

where \(\textbf{F}_\ell\) is any left inverse of \(\textbf{F}\) and

$$\begin{aligned} \overline{\textbf{p}} = \frac{1}{n} \sum _{k=0}^{n-1} \textbf{z}_k \end{aligned}$$
(15)

(that is, \(\overline{\textbf{p}}\) is a sample estimate of the weak label priors).

Noting that

$$\begin{aligned} {\mathbb {E}}\{\textbf{F}^\ell \textbf{B}\textbf{z}\} = \textbf{F}^\ell \textbf{B}\textbf{M}\varvec{\eta }= \textbf{F}^\ell \textbf{F}\varvec{\eta }= \varvec{\eta }\end{aligned}$$
(16)

we can see that \(\textbf{F}^\ell \textbf{B}\textbf{z}\) (and, thus, \(\textbf{F}^\ell \textbf{B}\overline{\textbf{p}}\)) is an unbiased estimate of \(\varvec{\eta }\). Therefore, we can select \(\textbf{B}\) and \(\textbf{F}^\ell\) in such a way that the variance of the estimate is minimized. Noting that

$$\begin{aligned} {\mathbb {E}}\{\Vert \textbf{F}^\ell \textbf{B}\textbf{z}&- \varvec{\eta }\Vert ^2\} = {\mathbb {E}}\{(\textbf{F}^\ell \textbf{B}\textbf{z} - \varvec{\eta })^\top (\textbf{F}^\ell \textbf{B}\textbf{z} - \varvec{\eta })\} \nonumber \\&= {\mathbb {E}}\{\textbf{z}^\top \textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\textbf{z}\} - 2\varvec{\eta }^\top \textbf{F}^\ell \textbf{B}{\mathbb {E}}\{\textbf{z}\} + \varvec{\eta }^\top \varvec{\eta }\nonumber \\&= \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} - 2\varvec{\eta }^\top \textbf{F}^\ell \textbf{B}\textbf{M}\varvec{\eta }+ \varvec{\eta }^\top \varvec{\eta }\nonumber \\&= \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} - \varvec{\eta }^\top \varvec{\eta }\end{aligned}$$
(17)

where \(\varvec{\Delta }_{\textbf{p}}\) is a diagonal matrix with the components of \({\mathbb {E}}\{\textbf{z}\}\) in the diagonal, and taking into account that the second term in (17) does not depend on \(\textbf{B}\), we can solve the optimization problem

$$\begin{aligned}&\min _{\textbf{B},\textbf{F}} \left\{ \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} \right\} \nonumber \\&\text {subject to } \textbf{BM} = \textbf{F}\text { and } \mathbbm {1}_h^\top \textbf{F}=\mathbbm {1}_c \end{aligned}$$
(18)

Appendix B.2 shows that any pair of matrices \(\textbf{F}\) (left-stochastic) and \(\textbf{B}\) satisfying

$$\begin{aligned} \textbf{F}^\ell \textbf{B}= \left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \textbf{M}\right) ^{-1} \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$
(19)

is a solution to this problem.

This result has two key implications:

  • Any pair \((\textbf{F}, \textbf{B}^*)\), where \(\textbf{F}\) is an arbitrary left-stochastic matrix with rank c and

    $$\begin{aligned} \textbf{B}^* = \textbf{F}\left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \textbf{M}\right) ^{-1} \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$
    (20)

    is optimal. In particular, for \(\textbf{F}=\textbf{I}\), this is the solution proposed in (Bacaicoa-Barber et al., 2021) for backward losses.

  • The pair \((\textbf{F}, \textbf{B}) = (\textbf{M}, \textbf{I})\) is optimal (as the right-hand side of (19) is a left inverse of \(\textbf{M}\)). That is, forward proper losses are optimal.

Note that, though all pairs satisfying (19) minimize variance, they are not equivalent in practice since \(\varvec{\Delta }_\textbf{p}\) depends on the unknown posterior weak label probabilities. As in (Bacaicoa-Barber et al., 2021), this can be mitigated by replacing these posteriors with weak label priors. As the experiments will show, this usually outperforms other choices of the forward backward loss, but loses optimality.

On the other hand, forward losses are optimal, without requiring any knowledge of the weak label probabilities. As as a consequence, they tend to outperform any other choices of the forward-backward loss, as we will see in the experiments.

5 RC and CC forward-backward losses

In order to characterize RC and CC forward-backward losses, the concepts of order-preserving and max-preserving transformations will be essential.

Definition 7

(Order-preserving matrix) Square matrix \(\textbf{A}\) is order-preserving if the linear transformation \(\textbf{y} = \textbf{A} \textbf{x}\) preserves the order of the components, that is, for any ij, \(x_i < x_j\) iff \(y_i < y_j\)

Definition 8

(Max-preserving matrix) Square matrix \(\textbf{A}\) is max-preserving if the linear transformation \(\textbf{y} = \textbf{A} \textbf{x}\) preserves the component of the maximum, that is, for any i, \(x_i = \max _{j} x_j\) iff \(y_i = \max _j y_j\)

The following lemma shows that order and max preserving matrices are equivalent and can be characterized by a general formula.

Lemma 1

Let \(\textbf{A}\) be a square \(d\times d\) matrix. The following conditions are equivalent:

  1. 1.

    \(\textbf{A}\) is order preserving

  2. 2.

    \(\textbf{A}\) is max preserving

  3. 3.

    \(\textbf{A} = \lambda \textbf{I} + \mathbbm {1}_d \textbf{v}^\top\) for some \(\lambda > 0\) and some \(\textbf{v}\in \mathbb {R}^d\).

Proof

See Appendix B.3. \(\square\)

Using the above lemma, we can prove the following:

Lemma 2

If \(\textbf{A}\) is order-preserving and non-singular, its inverse is also order-preserving.

Proof

See Appendix B.4. \(\square\)

The following theorem generalizes a previous result in (Cid-Sueiro et al., 2014) for backward losses, to forward-backward losses, and provides a general formula for RC and CC losses.

Theorem 3

Let \(\varvec{\phi }(\textbf{q})\), \(\textbf{q}\in {{\mathbb {R}}^c}\) be a RC/CC loss, \(\textbf{B}\) is a matrix such that

$$\begin{aligned} \textbf{B} \textbf{M} = \beta \textbf{I} + \mathbbm {1}_c \textbf{b}^\top \end{aligned}$$
(21)

for some \(\textbf{b}\in \mathbb {R}^c\) and some \(\beta > 0\). Also, \(\textbf{F}\) is a non-singular square matrix in the form

$$\begin{aligned} \textbf{F} = \lambda \textbf{I} + \mathbbm {1}_c \textbf{w}^\top \end{aligned}$$
(22)

for some \(\textbf{w}\in \mathbb {R}^c\) and some \(\lambda > 0\). Then, the forward-backward loss \(\varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{F}\textbf{f})\) is \(\textbf{M}\)-RC/CC.

Proof

Note that, by Lemma 1, and taking into account Eqs. (21) and (22), both \(\textbf{M} \textbf{B}\) and \(\textbf{F}\) are order- and max- preserving matrices.

Let \(\textbf{f}^*\) be a risk minimizer, that is

$$\begin{aligned} \textbf{f}^*&\in \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}) = \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$
(23)

Since \(\textbf{F}\) is non-singular, we can write \(\textbf{f}^*=\textbf{F}^{-1} \textbf{v}^*\) where

$$\begin{aligned} \textbf{v}^*&\in \arg \min _\textbf{z} \varvec{\eta }^\top \textbf{M}^\top \textbf{B}^\top \varvec{\phi }(\textbf{v}) \end{aligned}$$
(24)

Assume \(\eta _i > \eta _j\). Since \(\textbf{B} \textbf{M}\) is order-preserving, \((\textbf{B} \textbf{M}\varvec{\eta })_i > (\textbf{B} \textbf{M}\varvec{\eta })_j\) and, thus, if \(\phi\) is RC, \(v_i > v_j\). By Lemma 2, since \(\textbf{F}\) is order preserving, so is \(\textbf{F}^{-1}\) and, thus, \(v_i > v_j\) implies \(f_i > f_j\) and, thus, \(\varvec{\Psi }\) is \(\textbf{M}\)-RC.

Assuming \(\eta _i > \eta _j\), for all \(j\ne i\), the same argument applies to show that, if \(\varvec{\phi }\) is CC, \(\varvec{\Psi }\) is \(\textbf{M}\)-CC. \(\square\)

6 Error bound

Even though the backward matrix may have negative components, we can establish the consistency of learning when the base loss \(\varvec{\phi }\) is lower- and upper-bounded.

We consider the function space \(\mathcal {F} = \{ f: \textbf{x}\mapsto {\mathbb {R}}^c\}\). For proper losses, the space of functions should be restricted to the simplex (i.e., \(\textbf{f}= f(\textbf{x})\in {{\mathcal {P}}}_c\)). However, this restriction does not affect the analysis presented here, so we will keep it in this general form. The c-valued function space can be decomposed into its components \(\mathcal {F} = \bigoplus _{i=0}^{c-1}\mathcal {F}_i\).

A learning algorithm is consistent if, as the sample size \(n\rightarrow \infty\),

$$\begin{aligned} f_n = \mathop {\textrm{argmin}}\limits _f \hat{R}(f) = \mathop {\textrm{argmin}}\limits _f \frac{1}{n} \sum _{k=0}^{n-1} \Psi (\textbf{z}_k, f(\textbf{x}_k)) \end{aligned}$$
(25)

and

$$\begin{aligned} f^* = \mathop {\textrm{argmin}}\limits _f R(f) = \mathop {\textrm{argmin}}\limits _f {\mathbb {E}}_P \left[ \Psi (\textbf{z}, f(\textbf{x}))\right] \end{aligned}$$
(26)

satisfy \(R(f_n)\rightarrow R(f)\) as \(n\rightarrow \infty\).

Theorem 4

Let \(\phi (\textbf{f})\) be a nonnegative L-Lipschitz loss bounded from above by M. Then, for any \(\delta > 0\), with probability at least \(1-\delta\)

$$\begin{aligned} R(f_n) - R(f^*) \le 4 \sqrt{2} L \left\| \textbf{B}\right\| \left\| \textbf{F}\right\| \sum _{i=0}^{c-1}{\mathfrak {R}}_{n}(\mathcal {G}_i) + 4 h M \Vert \textbf{B}\Vert _1 \sqrt{\frac{\log \frac{2}{\delta }}{2n}} \end{aligned}$$
(27)

where \({\mathfrak {R}}_n(\mathcal {G})\) is he Rademacher complexity for a sample size \(n\) and a function class \(\mathcal {G}\).

Proof

See Appendix B.5 \(\square\)

Since for many function classes (e.g. neural networks with bounded norm) the Rademacher complexities \({\mathfrak {R}}_{n}(\mathcal {G}_i)\) are \(\mathcal {O}(1/\sqrt{n})\) (Golowich et al., 2019), this theorem proves risk-consistency when the base loss is upper and lower bounded. If the base loss is strictly proper, this futher implies the classification-consistency.

However, this result cannot be trivially extended to all types of base losses: if the base loss is not upper bounded (e.g. for the cross entropy) and the backward matrix has negative entries, the empirical risk may be neither upper nor lower bounded, and learning may be inconsistent, as can be observed experimentally. Although suitable adjustments to the base loss and the backward matrix can mitigate this issue (Yoshida et al., 2021), it remains a main limitation in the application of backward losses.

7 Experiments

This section presents a comparative analysis of forward, backward, and forward-backward losses under varying levels of label corruption. Specifically, we evaluate these losses in the proper case across three corruption types: noisy labels, complementary labels, and partial labels. Our goal is to empirically demonstrate the superiority of forward losses, consistent with our theoretical results and prior findings.

We evaluate the losses on a variety of datasets, including Banknote Authentication (Lohweg, 2012), for binary classification; MNIST (LeCun et al., 1998), CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), for multiclass classification; and a Synthetic Gaussian Mixture Model for controlling the estimation of the posterior probability. This ensures that our comparison is independent of the architecture, data domain and size, and corruption models.

Label corruption processes follow models and parameterizations from prior work. In some cases, we replicate published setups to directly test whether forward losses outperform backward losses under identical conditions, preserving the fidelity of the original experiments.

To assess posterior probability estimation, we conduct controlled classification tasks on synthetic data. The synthetic dataset comprises 4000 samples drawn from four overlapping Gaussian distributions. This setting allows evaluating algorithm performance in the realizable case, where the classifier can perfectly fit the true posterior, and directly quantifying estimation quality, since the true posteriors are known.

Discrepancy between predicted and true posteriors is measured via:

$$\begin{aligned} \Vert \textbf{f}(\textbf{x}) - \varvec{\eta }(\textbf{x}) \Vert \end{aligned}$$
(28)

computed over the test set, providing a direct evaluation of estimation accuracy.

Regardless of corruption type, training uses multiclass logistic regression with an Adam optimizer (learning rate \(10^{-3}\)) for 50 epochs, repeated 10 times.

7.1 Noisy labels

To ensure a comprehensive evaluation, we test our method across datasets of increasing complexity: binary classification (Banknote), simple multiclass (MNIST), deep learning benchmarks (CIFAR), and synthetic datasets.

Banknote-authentication.

We begin with a binary classification task using the banknote-authentication dataset, which classifies genuine versus forged banknotes, adopting the corruption process in Natarajan et al. (2013), given by

$$\begin{aligned} \textbf{M}= \left( {\begin{smallmatrix} 1-\rho _{-1} & \rho _{+1} \\ \rho _{-1} & 1-\rho _{+1} \end{smallmatrix}}\right) \end{aligned}$$
(29)

evaluating the performance for different values of \(\varvec{\rho } =(\rho _{-1},\rho _{+1})\). We also consider a decomposition of the matrix \(\textbf{M}\) as the product of two matrices \(\textbf{M}_l\) and \(\textbf{M}_r\), such that \(\textbf{M}=\textbf{M}_l \textbf{M}_r\). This decomposition allows us to utilize those matrices so that the pair \((\textbf{F},\textbf{B}) = (\textbf{M}_r,\textbf{B}_l)\) and the loss computes as \(\varvec{\Psi }(\textbf{f}) = \textbf{B}_l^\top \varvec{\phi }(\textbf{M}_r \textbf{f})\), where \(\textbf{B}_l\) is a left inverse of \(\textbf{M}_l\).

We train a Logistic Regression model. Fig. 2 shows that as corruption levels increase, forward and forward-backward losses consistently outperform backward loss, particularly at higher noise levels, with higher median accuracy and lower variability in training and testing. Forward-backward loss behaves similarly to forward loss, though differences are minor due to the dataset’s small size and classification simplicity.

Fig. 2
figure 2

Comparison of training and testing accuracy for backward (Bwd), forward-backward (Fwd-Bwd), and forward (Fwd) losses for binary classification with noisy labels

MNIST

For this dataset, we follow the label corruption process described by Natarajan et al. (2013), using the transition matrix:

$$\begin{aligned} \textbf{M} = \left( {\begin{smallmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & \rho & 0 & 0 \\ 0 & 0 & 1-\rho & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1-\rho & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1-\rho & \rho & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & \rho & 1-\rho & 0 & 0 & 0 \\ 0 & 0 & \rho & 0 & 0 & 0 & 0 & 1-\rho & 0 & 0 \\ 0 & 0 & 0 & \rho & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{smallmatrix}}\right) \end{aligned}$$
(30)

As in their setting, labels are flipped with probability \(\rho\) between similar digits: \(2 \rightarrow 7,\ 3 \rightarrow 8,\ 5 \leftrightarrow 6,\ 7 \rightarrow 1\). Also, we decompose the matrix \(\textbf{M}\) such that \(\textbf{M}_l\) encompasses label flipping with probability \(\rho\) between: \(2 \rightarrow 7\), and \(7 \rightarrow 1\). Whereas \(\textbf{M}_r\) encompasses label flipping with probability \(\rho\) between: \(3 \rightarrow 8\), and \(5 \leftrightarrow 6\). Hence, \((\textbf{F},\textbf{B}) = (\textbf{M}_r,\textbf{B}_l)\) and the loss computes as \(\varvec{\Psi }(\textbf{f}) = \textbf{B}_l^\top \varvec{\phi }(\textbf{M}_r \textbf{f})\), where \(\textbf{B}_l\) is a left inverse of \(\textbf{M}_l\)

We train a multilayer perceptron (MLP) with an input layer and a hidden layer of size 784 and an output layer of size 10. The Adam optimizer is used again with an initial learning rate of \(10^{-3}\). Figure 3 shows that forward loss outperforms backward loss, achieving higher accuracy and lower variability in both training and testing.

Fig. 3
figure 3

Comparison of training and testing accuracy for backward (Bwd), forward-backward (Fwd-Bwd), and forward (Fwd) losses for the MNIST dataset with noisy labels

CIFAR-10

We use a ResNet-18 architecture trained with SGD (learning rate \(10^{-3}\), momentum 0.9, weight decay \(5\times 10^{-4}\)). Due to computational demands, we limit the experiment to 4 repetitions and 40 epochs.

For the label noise, we follow the process described by Natarajan et al. (2013), where labels are flipped with probability \(\rho\) between the next classes: Truck \(\rightarrow\) Automobile, Bird \(\rightarrow\) Airplane, Deer \(\rightarrow\) Horse, and Cat \(\leftrightarrow\) Dog.

The decomposition for the forward-backward loss was made such that \(\textbf{M}_l\) encompasses label flipping with probability \(\rho\) between: Truck \(\rightarrow\) Automobile, and Bird \(\rightarrow\) Airplane; whereas \(\textbf{M}_r\) encompasses label flipping with probability \(\rho\) between:Deer \(\rightarrow\) Horse, and Cat \(\leftrightarrow\) Dog.

Fig. 4
figure 4

Comparison of training and testing accuracy for backward (Bwd), forward-backward (Fwd-Bwd), and forward (Fwd) losses for the CIFAR-10 dataset with noisy labels

The results in Fig. 4 show that, despite greater variability, forward loss achieves higher median accuracy in both training and testing.

CIFAR-100

Similarly, we test the losses on CIFAR-100 using a ResNet-32 architecture with the same optimizer settings as for ResNet-18. The experiment is restricted to 4 repetitions and 40 epochs due to computational constraints.

For label noise, we follow (Natarajan et al., 2013): The 100 classes are grouped into 20 superclasses (5 classes each). Noise flips each class circularly within superclasses 1–10, repeating the pattern for superclasses 11–20. Thus, \(\textbf{M}\) is a block matrix:

$$\begin{aligned} \textbf{M}=\left( {\begin{smallmatrix} \textbf{A} & \textbf{0}\\ \textbf{0} & \textbf{A}\end{smallmatrix}}\right) \ \text {where}\ \textbf{A} = \left( {\begin{smallmatrix} 1-\rho & 0 & 0 & \cdots & 0 & \rho \\ \rho & 1-\rho & 0 & \cdots & 0 & 0 \\ 0 & \rho & 1-\rho & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 1-\rho & 0 \\ 0 & 0 & 0 & \cdots & \rho & 1-\rho \end{smallmatrix}}\right) , \end{aligned}$$
(31)

where \(\textbf{A}\) is a \(10 \times 10\) matrix.

A simple decomposition of the transition matrix is given by

$$\begin{aligned} \textbf{M}_l=\left( {\begin{smallmatrix} \textbf{A} & \textbf{0}\\ \textbf{0} & \textbf{I}\end{smallmatrix}}\right) \ \text {and}\ \textbf{M}_r=\left( {\begin{smallmatrix} \textbf{I} & \textbf{0}\\ \textbf{0} & \textbf{A}\end{smallmatrix}}\right) \end{aligned}$$
(32)
Fig. 5
figure 5

Comparison of training and testing accuracy for backward (Bwd), forward-backward (Fwd-Bwd), and forward (Fwd) losses for the CIFAR-100 dataset with noisy labels

In Fig. 5, the forward loss shows noticeably higher median accuracy than the backward and forward-backward losses and smaller variability. As the noise level increases, the forward loss still maintains a more stable accuracy profile.

Gaussian Mixture Model

Lastly, as mentioned before, we will train a logistic classifier for the Gaussian Mixture Model to evaluate the quality of the posterior probability estimates.

Labels were corrupted using the transition matrix

$$\begin{aligned} \textbf{M} = \left( {\begin{smallmatrix} 1-\rho & \rho /3 & \rho /3 & \rho /3 \\ \rho /3 & 1-\rho & \rho /3 & \rho /3 \\ \rho /3 & \rho /3 & 1-\rho & \rho /3 \\ \rho /3 & \rho /3 & \rho /3 & 1-\rho \\ \end{smallmatrix}}\right) \end{aligned}$$
(33)

We factorize the transition matrix as \(\textbf{M}= \textbf{A}^2\) so \(\textbf{M}_l=\textbf{M}_r=\textbf{A}\). Notice that the case \(\rho =0.8\) is not used here, as for a 4-class problem it would mean that each noisy class is more likely than the true class.

Fig. 6
figure 6

Comparison of training and testing accuracy for backward (Bwd), forward-backward (Fwd-Bwd), and forward (Fwd) losses for the dataset with the mixture of Gaussians with noisy labels

In Fig. 6 as noise level increases, the median accuracy declines and variance grows for all methods, reflecting the added difficulty of heavier label noise. Nevertheless, forward loss maintains a performance advantage, with its boxplot showing higher central tendency and tighter interquartile ranges.

Fig. 7
figure 7

Distribution of the mean norm (left) and standard deviation norm (right) of the difference between the prediction and the true posterior distribution for the Gaussian mixture model with noisy labels

In Fig. 7, forward loss achieves lower median errors and tighter interquartile ranges, confirming its superior ability to approximate the true posterior distribution. These results underscore the advantage of forward losses in accurately modeling posterior probabilities for noisy label settings.

7.2 Complementary labels

We also evaluate a complementary label setting (Ishida et al., 2019) where the transition matrix \(\textbf{M}\) has components \(m_{ij} = (c-1)\delta _{ij}\) where \(\delta _{ij}\) is the Kronecker delta.

Since a complementary label is selected at random from the negative classes (i.e. all classes other than the true class), we can decompose this selection in two steps: in the first step, we select half of the negative classes at random. In the second step, we take one of these selected classes at random. This two steps define the respective left and right matrices for the decomposition \(\textbf{M}=\textbf{M}_\ell \textbf{M}_r\)

Using the same architectures applied to noisy labels, we evaluated MNIST, CIFAR-10, and CIFAR-100. Results are summarized in Fig. 8:

Fig. 8
figure 8

Comparison of training and testing accuracy for backward (Bwd), forward-backward (Fwd-Bwd), and forward (Fwd) losses under complementary labels. From left to right: MNIST, CIFAR-10, and CIFAR-100

For MNIST (left), the forward loss achieves the highest median accuracy with low variability, demonstrating robust performance under severe label noise. The backward loss consistently shows the words performance, while the forward-backward loss falls in between.

On CIFAR-10 (center), the forward loss again achieves the highest median accuracy, though variability across runs increases. The backward loss performs notably poorly, underscoring the advantage of forward correction.

For CIFAR-10 (middle), the forward loss achieves higher median accuracy compared to the forward-backward and backward losses, despite exhibiting greater variability across runs. The backward loss performs particularly poorly, suggesting that the forward loss provides stronger overall performance, even with some fluctuations. For CIFAR-100 (right), the forward loss displays higher variability across runs but achieves the highest median accuracy among the three methods. In contrast, both backward loss and forward-backward loss encounter more pronounced learning difficulties, which can lead to noticeably lower performance.

Gaussian Mixture Model

We now analyze the complementary label setting for the Gaussian mixture model.

Fig. 9
figure 9

Comparison of training and testing accuracy for forward and backward losses for the Gaussian mixture model with complementary labels

Figure 9 shows that forward loss consistently outperforms the others, achieving higher median accuracies on both training and test sets, highlighting its robustness with complementary labels.

Fig. 10
figure 10

Distribution of the mean norm (left) and standard deviation norm (right) of the difference between the prediction and the true posterior distribution for the Gaussian mixture model with complementary labels

Figure 10 shows that forward loss achieves the lowest discrepancy, demonstrating superior accuracy in approximating true posteriors. In contrast, backward loss exhibits higher mean error and greater variability, highlighting its poorer performance.

7.3 Partial labels

Finally, we explore partial label corruption as modeled in Cour et al. (2011); Feng et al. (2020). The corruption process is defined as:

$$\begin{aligned}&P(\varvec{\omega }|\textbf{y}=\textbf{e}_i)= {\left\{ \begin{array}{ll} 1-\rho & \text {if}\ \varvec{\omega }=\textbf{y}\\ \frac{\rho }{2^{c-1}-1}& \text {if}\ \varvec{\omega }\ne \textbf{y}\ \text {and}\ \varvec{\omega }^\top \textbf{y}=1\\ 0& \text {if}\ \varvec{\omega }^\top \textbf{y}=0 \\ \end{array}\right. } \end{aligned}$$
(34)

This parametrization of the transition matrix enables a flexible corruption of the dataset implying that larger values of \(\rho\) result in a dataset with higher corruption.

For partial labels, we evaluate the backward losses (\(\textbf{F}=\textbf{I}\)) with and without the convexity constraint, and with and without the optimized matrix \(\textbf{B}^*\) in (20). Additionally, we will assess the forward loss (\(\textbf{F}=\textbf{M}\), \(\textbf{B}=\textbf{I}\)), as well as the optimized forward-backward loss given by \(\textbf{F}=\textbf{M}\) and the optimal backward matrix \(\textbf{B}^*\) in (20). When needed, matrix \(\varvec{\Delta }_\textbf{p}\) is computed using weak label priors, estimated from weak label proportions following the method proposed in (Bacaicoa-Barber et al., 2021).

MNIST

First, we test on the MNIST dataset, trained in the same manner as in the noisy or complementary label setting.

Fig. 11
figure 11

Comparison of training and testing accuracy for forward, backward, and forward backward losses for the MNIST dataset with partial labels. The boxplots are ordered as follows (from left to right): Backward (bwd), Convex backward, backward (\(\textbf{B}^*\)), convex backward (\(\textbf{B}^*\)), forward (fwd)-backward, forward

Figure 11 shows that forward loss once again outperforms the other losses. Consistent with prior observations, forward loss achieves the highest median accuracy, clearly outperforming the other losses. Moreover, forward-backward loss tends to exceed the performance of the backward losses.

CIFAR 10

The CIFAR-10 dataset is trained in the same manner as in the noisy or complementary label setting.

Fig. 12
figure 12

Comparison of training and testing accuracy for forward, backward, and forward backward losses for the CIFAR-10 dataset with partial labels. The boxplots are ordered as follows (from left to right): Backward (bwd), Convex backward, backward (\(\textbf{B}^*\)), convex backward (\(\textbf{B}^*\)), forward (fwd)-backward, forward

In Fig. 12, the forward loss achieves the highest median accuracy on both training and testing sets, consistent with previous results. The forward-backward loss also tends to outperform other backward losses, likely due to the pseudoinverse used in their computation..

Gaussian Mixture Models

Finally, we evaluate the performance of forward, backward, and forward-backward losses under partial label corruption for the Gaussian Mixture Models.

Fig. 13
figure 13

Comparison of training and testing accuracy for forward, backward, and forward-backward losses for the for the Gaussian mixture model with partial labels. The boxplots are ordered as follows (from left to right): Backward (bwd), Convex backward, backward (\(\textbf{B}^*\)), convex backward (\(\textbf{B}^*\)), forward (fwd)-backward, forward

As seen in Fig. 13, forward losses again outperform other losses, with forward-backward losses also performing well on both training and test sets. Backward losses continue to underperform, especially when \(\textbf{B}\) is not optimized and under higher corruption levels.

Fig. 14
figure 14

Distribution of the mean norm (left) and standard deviation norm (right) of the difference between the true posterior distribution for the Gaussian mixture model with partial labels. The boxplots are ordered as follows (from left to right): Backward (bwd), convex backward, backward (\(\textbf{B}^*\)), convex backward (\(\textbf{B}^*\)), forward (fwd)-backward, forward

Figure 14 highlights the superior performance of forward loss in approximating true posteriors, consistently showing lower error and variability. Forward-backward loss offers a reasonable compromise, outperforming backward approaches but still falling short of forward losses. As before, methods relying on the pseudoinverse of the transition matrix underperform.

7.4 Clothing1M

We tested the approaches presented in this paper on the real-world noisy dataset Clothing1M presented in (Xiao et al., 2015). We made the estimation of the transition matrix empirically counting the relative frequencies of the subset of instances for which both the true labels and the noisy labels are available.

For the forward-backard loss we have decomposed the estimated transition matrix \(\hat{\textbf{M}}\) numerically so we used a factorization that is approximately \(\textbf{M}_l\textbf{M}_r \approx \hat{\textbf{M}}\).

A ResNet-50 architecture, pre trained on ImageNet was employed as the base classifier. The model was trained for 10 epochs using the Adam optimizer with a learning rate of \(10^{-3}\). The training phase utilized the noisy labels provided in the Clothing1M dataset, whereas the evaluation was made using the clean test set. Only one repetition was made as the weakly dataset was given and no weakening process of the labels was made.

Fig. 15
figure 15

Comparison of test accuracy for forward (Fwd), backward (Bwd), and forward-backward loss corrections on the Clothing 1 M dataset for 10 training epochs

Figure 15 highlights that forward loss correction consistently outperforms both the backward and forward-backward methods in test accuracy over the entire training period. It reaches a higher final accuracy confirming earlier experiments that highlight the superiority of the forward loss correction for proper losses.

7.5 Overall discussion and conclusions

In summary, the experiments reveal several key findings: first, convexification helps mitigate the convergence issues typically observed with backward losses in partial label learning, generally resulting in better performance than backward losses without convexity constraints. Second, the optimal backward matrix defined in (20) consistently outperforms alternatives such as the pseudoinverse of the transition matrix. In practice, the forward-backward loss (with \(\textbf{F}=\textbf{M}\) and \(\textbf{B}=\textbf{B}^*\)) offers a balanced trade-off between forward and backward approaches.

Third, in both complementary and noisy label scenarios, forward-backward losses tend to achieve intermediate performance between backward and forward losses. Notably, the performance gap between forward and forward-backward losses increases when the transition matrix is block-structured and thus trivially factorizable, compared to cases with uniformly distributed noise. This may occur because, in such settings, two independent noise processes are effectively present, and the forward component can only correct one of them. As a result, the forward-backward method does not fully close the gap to the forward approach.

Overall, forward losses demonstrate the best performance, consistently surpassing forward-backward and backward losses in both accuracy and stability.

8 Conclusions

In this study, we introduced a unified framework that integrates forward and backward loss functions for learning from weak labels, providing a comprehensive understanding of their shared properties.

By combining these losses into a single family of forward-backward losses, we clarified their relationships and offered deeper insights into their common characteristics. We established sufficient conditions under which forward-backward losses are proper, ranking-calibrated, classification-calibrated, convex, and lower-bounded. These conditions address critical challenges often associated with backward losses-such as non-convexity and lack of a lower bound-ensuring that forward-backward losses retain essential properties for effective learning.

This unification has also enabled a systematic comparative analysis, demonstrating that no backward or forward-backward loss can outperform forward losses for posterior probability estimation. Theoretical findings align with experimental results, confirming the robustness and effectiveness of forward losses in mitigating the challenges posed by weak labels. For RC and CC losses, our framework offers a unified perspective that can inspire the development of new learning algorithms.

8.1 Limitations and further work

One important limitation of our framework, shared by all methods based on forward or backward correction, is the assumption that the transition matrix is known and feature-independent. Some models surpass this problem by making some independence assumptions on weak labels (Feng et al., 2020; Katsura & Uchida, 2021; Ishida et al., 2017) that could be not realistic. Additional approaches make assumptions on the weak labeling process (often related to the dominance of the true class over noisy classes in the weak label (Lv et al., 2020; Zhang et al., 2021; Ambroise et al., 2001; Wu et al., 2023)) which can be translated into constraints on \(\textbf{M}\). In general, the probabilistic calibration of the models require loss functions that depend on \(\textbf{M}\) (Cid-Sueiro, 2012; Van Rooyen & Williamson, 2018; Yoshida et al., 2021). Some methods that have been proposed to estimate the transition matrix from data (noise-free labels, anchor points, etc) have been discussed in Sect. 1.

The assumption that the transition matrix is feature independent has been widely adopted, though it may be not supported by empirical evidence (for instance, one can expect higher label noise from human annotators for input images that are near the decision boundaries). While our theoretical analysis does not require feature independence, in practice our experimental results rely on this simplifying assumption. The development of instance-dependent models has attracted some recent interest, in particular for the noisy label case (see (Xia et al., 2020), for instance). Investigating feature-dependent transition matrices derived from realistic and domain-specific models of the annotation process could be a valuable direction for future research.

Our work is a step towards a complete characterization of losses from learning from weak labels, though there is further work to be done in this direction. Although other losses can be partially connected to our work (like (Wu et al., 2023)), other relevant losses, like (Cour et al., 2011) and many others, cannot be fit into our framework. Our ongoing research aims to develop more general formulations that will lead to more efficient and robust learning algorithms.