A unified view of forward and backward losses for learning from weak labels

Bacaicoa-Barber, Daniel; Cid-Sueiro, Jesús

doi:10.1007/s10994-025-06841-x

A unified view of forward and backward losses for learning from weak labels

Open access
Published: 12 August 2025

Volume 114, article number 205, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Machine Learning Aims and scope Submit manuscript

A unified view of forward and backward losses for learning from weak labels

Download PDF

1569 Accesses
1 Altmetric
Explore all metrics

Abstract

Training multiclass classifiers on weakly labeled datasets, where labels provide only partial or noisy information about the true class, poses a significant challenge in machine learning. To address various forms of label corruption, including noisy, complementary, supplementary, or partial labels, as well as positive-unlabeled data, forward and backward correction losses have been widely employed. Adopting a general formulation that encompasses all these types of label corruption, we introduce a new family of loss functions, termed forward-backward losses, which generalizes both forward and backward correction. We analyze the theoretical properties of this family, providing sufficient conditions under which these losses are proper, ranking-calibrated, classification-calibrated, convex, or lower-bounded. This unified view will be useful to show, through theoretical analysis and experiments, that proper forward losses consistently outperform other forward-backward losses in terms of robustness and accuracy. However, the optimal choice of loss for ranking- and classification-calibrated settings remains an open question. Our work provides a comprehensive framework for weak label learning, offering new directions to develop more robust and effective algorithms.

Adapting Supervised Classification Algorithms to Arbitrary Weak Label Scenarios

New Loss Function for Multiclass, Single-Label Classification

Uncovering hidden patterns: low-rank label correlations for multi-label weak-label learning

Article 11 September 2024

1 Introduction

In this paper, we address the challenge of training multiclass classifiers using weakly labeled data. A weak label refers to a label that does not explicitly indicate the true class of a sample, but instead provides a discrete variable statistically related to the class or identifies a subset of candidate classes.

Building on prior research (Cid-Sueiro, 2012; Van Rooyen & Williamson, 2018; Chiang & Sugiyama, 2023; Chen et al., 2023, 2024; Iacovissi et al., 2023), we adopt a general framework unifying various partial supervision problems, such as learning with noisy, complementary, supplementary, or partial labels, as well as positive-unlabeled (PU) learning and unlabeled-unlabeled (UU) learning.

Many algorithms proposed for weak labels adapt traditional supervised learning loss functions to handle weak labels via loss correction. Our focus is on methods that employ a statistical model to relate classes and weak labels (Jin & Ghahramani, 2002; Xiao et al., 2015; Feng et al., 2020; Katsura & Uchida, 2021; Ishida et al., 2017; Xu et al., 2021), typically via a transition probability matrix that describes how weak labels are generated from true classes (Ishida et al., 2019; Yoshida et al., 2021) or vice versa (Menon et al., 2015; Scott et al., 2013). Although this model may not always be explicitly defined (as in (Grandvalet, 2002)), the effectiveness of some methods depends heavily on the underlying weak label generation process. In this paper, we will not address the important challenge of estimating this model, which is sometimes tackled by assuming the availability of a few noise-free labels (Xiao et al., 2015; Yu et al., 2018; Hendrycks et al., 2018) or anchor points (Patrini et al., 2017; Yao et al., 2020) or using corrupted labels only (Ghosh et al., 2015; Katz-Samuels et al., 2019). Other methods select losses that are relatively robust to uncertainty in the model (Ghosh et al., 2015; Cid-Sueiro et al., 2014).

The transition matrix has been used to construct two main types of losses: (1) losses based on the linear transformation of standard supervised losses (Natarajan et al., 2013; Cid-Sueiro, 2012; Van Rooyen & Williamson, 2018; Yoshida et al., 2021), and (2) losses defined on probabilistic predictions of weak labels, derived from the linear transformation of probabilistic class predictions (Sukhbaatar et al., 2014; Yu et al., 2018; Patrini et al., 2017). Patrini et al. (2017) defines the former as backward corrected and the latter as forward corrected losses (hereafter, simply backward and forward losses).

Despite extensive work on loss correction, a systematic comparative analysis between forward and backward losses is noticeably absent in the literature. Although some empirical evidence suggests that forward losses tend to outperform backward losses (see e.g., (Patrini et al., 2017)), this observation lacks a comprehensive theoretical and experimental validation for general weak-label models.

The purpose of this paper is twofold. First, we introduce a unifying family of losses that generalizes both forward and backward losses, encompassing them as special cases. Second, leveraging this framework, we conduct a theoretical and empirical comparative analysis, providing evidence for the superiority of forward losses.

Our main contributions are the following:

We define a family of forward-backward losses encompassing forward and backward losses as special cases. Additionally, we show that some types of reweighting schemes can also be formulated within this framework.
We establish sufficient conditions under which forward-backward losses are proper, ranking-calibrated or classification-calibrated, and identify conditions ensuring convexity and lower-boundedness.
We present a theoretical and experimental analysis demonstrating that proper forward losses yield higher accuracy and lower variance in probability estimates than any other proper loss in the family.

Although our analysis shows that forward proper losses consistently outperform others, the general formulation contributes toward a broader characterization of losses for learning from weak labels, an important step toward a general theory that is still lacking.

The paper is organized as follows: Sect. 2 reviews related work. Sect. 3 formulates the problem defines loss functions. Sect. 4 gives conditions for proper forward-backward losses. Sect. 5 analyzes ranking and classification calibration. Sect. 6 states some error bounds for minimization of forward-backward losses. In Sect. 7 we show some comparative experiments. Finally, we state some conclusions in Sect. 8

2 Related work

Unified approaches to learning from arbitrary weak label models date back to the general formulations of backward losses in (Cid-Sueiro, 2012; Cid-Sueiro et al., 2014; Van Rooyen & Williamson, 2018). General models for maximum likelihood estimation (an instance of forward correction) can be found in Perello-Nieto et al. (2020); Chen et al. (2023, 2024). General approaches for binary classification (including learning from noisy labels, PU learning and semi-supervised learning) can be found in (Xie et al., 2024) for AUC optimization and (Gong et al., 2022) for margin-based classifiers.

Chiang and Sugiyama (2023) integrated up to 15 different scenarios into a probabilistic framework supporting both discrete weak labels and confidence scores. It also introduces a risk-rewrite formulation that facilitates backward correction and a novel “marginal chain method” to all these scenarios. An even more general perspective appears in (Iacovissi et al., 2023), which situates label correction, forward/backward correction, and importance reweighting within the broader context of data corruption, including corrupted input features (e.g., concept drift).

While these contributions show that the same strategy (forward, backward, marginal chain, importance reweighting) can be applied to different corruption types, our work moves in the direction of integrating different methods (forward, backward and, to some extent, marginal chain) into a unified family of correction techniques.

To our knowledge, no systematic comparison of forward and backward correction has been published. Patrini et al. (2017) first observed the inferior performance of backward correction in noisy label scenarios, a finding corroborated by (Ma et al., 2018; Ding et al., 2018; Lukasik et al., 2020). In (Chou et al., 2020), a case study on complementary labels shows that while backward losses provide unbiased risk estimators, their negative components lead to high variance, over-fitting and reduced accuracy compared to forward losses and other methods. Similarly, Ishida et al. (2019) (referring to backward correction as Free) proposed a gradient ascent (GA) approach to mitigate negative loss effects, yet both Free and GA underperformed relative to forward losses in complementary and partial label settings (Feng et al., 2020, 2020).

The inferior performance of backward losses is often attributed to their negative components, which cause overfitting. Techniques such as training control (e.g., GA), enforcing non-negativity (Kiryo et al., 2017; Lu et al., 2020), or minimizing upper bounds (Feng et al., 2020) have improved performance but are seldom compared directly to forward losses. Our experiments explore various weak label scenarios, avoiding the pitfalls of negative loss components by building on ideas from (Van Rooyen & Williamson, 2018; Yoshida et al., 2021). Nevertheless, our theoretical analysis shows that even under ideal training conditions, the variance of posterior probability estimates with backward losses cannot be lower than that of forward losses, suggesting that optimization issues alone do not explain their inferior performance.

3 Formulation

3.1 Notation

Vectors are written in boldface, matrices are written in boldface capital, and sets are written in calligraphic letters. $|{{\mathcal {A}}}|$ is the cardinality of finite set ${\mathcal {A}}$. For vectors, the $\text {superindex }^\top$ denotes the transposition, $\odot$ and $\oslash$ denote pointwise multiplication and division, respectively. When $\textbf{v}$ is a vector, $\log (\textbf{v})$ and $\exp (\textbf{v})$ denote the component-wise logarithm and exponential, respectively.. For any matrix $\textbf{A}$, $\text {tr}(\textbf{A})$ is its trace, $\Vert \textbf{A}\Vert$ its Frobenious norm and $\Vert \textbf{A}\Vert _1$ its $L_1$ norm.

For any integer n, $\textbf{e}_i^n$ is a unit vector of dimension n with all zero components apart from the i-th component which is equal to one, and $\mathbbm {1}_n$ is an all-ones vector with dimension n. The superscript may be omitted if it is clear from the context.

We will use $\Psi$, $\varvec{\phi }$ to denote loss functions. The number of classes is c, and the number of possible weak label vectors is d. The set of all $d\times c$ matrices with stochastic columns, that is, the set of $d\times c$ left-stochastic matrices is $\mathcal {M} = \{\textbf{M} \in [0,1]^{d\times c}: \textbf{M}^\top \mathbbm {1}_d =\mathbbm {1}_c\}$, and the simplex of the probability vectors of dimension d is $\mathcal {P}_{d} = \{\textbf{p}\in [0,1]^{d}: \textbf{p}^\top \mathbbm {1}_{d} =1\}$.

3.2 Learning from weak labels

Let ${{\mathcal {X}}}$ be a sample space, ${{\mathcal {Y}}}$ a finite set of c target categories, and ${{\mathcal {W}}}$ a finite set of $d \ge c$ weak categories. Sample $(\textbf{x}, \varvec{\omega }) \in {{\mathcal {X}}} \times {{\mathcal {W}}}$ is drawn from an unknown distribution P.

We encode target categories as one-hot vectors: ${{\mathcal {Y}}} = \{\textbf{e}_j^c, j=0,\ldots ,c-1\}$. The goal is to learn a predictor of the target class $\textbf{y}\in {{\mathcal {Y}}}$ given $\textbf{x}$, using a weakly labeled dataset ${{\mathcal {S}}}=\{(\textbf{x}_k,\varvec{\omega }_k)\}_{k=0}^{n-1}$ of independent samples from P.

The interpretation of ${{\mathcal {Y}}}$ and ${{\mathcal {W}}}$ varies by application. This general formulation accommodates diverse partial supervision scenarios, with particular focus on cases where categories in ${{\mathcal {W}}}$ correspond to subsets of ${{\mathcal {Y}}}$. Examples include:

Clean labels: In this case, ${{\mathcal {W}}} = {{\mathcal {Y}}}$ and $\varvec{\omega }=\textbf{y}$ with probability 1.
Noisy labels (Raykar et al., 2010): ${{\mathcal {W}}} = {{\mathcal {Y}}}$ but $P\{\varvec{\omega } \ne \textbf{y}\}>0$.
Complementary labels (Ishida et al., 2017): ${{\mathcal {W}}} = {{\mathcal {Y}}}$ but $P\{\varvec{\omega } \ne \textbf{y}\} = 1$.
Clean labels and unlabeled data: ${{\mathcal {W}}} = {{\mathcal {Y}}} \cup \{\textbf{0}\}$, where $\varvec{\omega }=\textbf{0}$ when the target class is unknown.
Positive-Unlabeled (PU) data: ${{\mathcal {W}}} = \{(0, 1), (1, 1)\}$.
Partial labels (Cour et al., 2011; Jin & Ghahramani, 2002; Ambroise et al., 2001; Grandvalet & Bengio, 2004): each label is a set of candidate target categories, only one of them being true. In this case, each element in ${{\mathcal {W}}}$ is a non-empty subset of ${{\mathcal {Y}}}$.

For convenience, we represent weak categories as one-hot vectors. Given an ordering ${{\mathcal {W}}}=\{\varvec{\omega }_0,\ldots ,\varvec{\omega }_{d-1}\}$, we define the one-hot encoding ${{\mathcal {Z}}}=\{\textbf{e}_i^d\}_{i=0}^{d-1}$, and denote by $\textbf{z}=\textbf{e}_i^d$ the one-hot label corresponding to $\varvec{\omega }_i$.

To summarize, we will use the following notation for the class variables:

$\textbf{y} \in {{\mathcal {Y}}}$: the target class, represented as a one-hot vector.
$\textbf{z} \in {{\mathcal {Z}}}$: the weak class, represented as a one-hot vector

Thus, learning from weak labels consists in training a predictor of the target class $\textbf{y} \in {{\mathcal {Y}}}$ given sample $\textbf{x}$, using a weakly labeled dataset ${{\mathcal {S}}} = \{(\textbf{x}_k, \textbf{z}_k), k=0,\ldots ,n-1\}$ whose labels are elements of ${{\mathcal {Z}}}$.

Without loss of generality, we assume that ${{\mathcal {Z}}}$ contains only weak labels with nonzero probability ($P(\textbf{z})>0$). The statistical dependency between $\textbf{z}$ and $\textbf{y}$ is modeled through an arbitrary $d\times c$ transition matrix $\textbf{M}(\textbf{x})\in \mathcal {M}$ of conditional probabilities

$$\begin{aligned} m_{ij}(\textbf{x}) = P\{z_i=1 | y_j=1,\textbf{x} \} \end{aligned}$$

(1)

Defining the posteriors $\textbf{p}(\textbf{x})$ and $\varvec{\eta }(\textbf{x})$ with components $p_i=P\{z_i=1|\textbf{x}\}$ and $\eta _j=P\{y_j=1|\textbf{x}\}$, we can write $\textbf{p}(\textbf{x}) = \textbf{M}(\textbf{x}){\varvec{\eta }}(\textbf{x})$. In general, the dependency with $\textbf{x}$ will be omitted and we will write, for instance,

$$\begin{aligned} \textbf{p} = \textbf{M}{\varvec{\eta }}. \end{aligned}$$

(2)

If $\textbf{M}$ is independent of the features, $\textbf{x}$, the relation between random variables $\textbf{x}$, $\textbf{y}$ and $\textbf{z}$ can be represented through the graphical model in Fig. 1

The transition matrix is central to the design of loss functions. Our work in this paper is mostly related to the determination of loss functions for a given transition matrix, and we will not deal with the issue of determining $\textbf{M}$ from the data, which is out of the scope of this paper.

The feature independence assumption is not required for most of our theoretical analysis, except for the error bound in Sec. 6. Most of our experiments assume a known feature-independence transition matrix. A further discussion on the estimation of the transition matrix can be found in sec. 8.1.

3.3 Classification calibration, ranking calibration and properness

The goal of the learning algorithm is to find an accurate class predictor using a weakly labeled set. The predictor computes a score vector $\textbf{f}= g(\textbf{x}) \in {{\mathcal {F}}}$, where ${{\mathcal {F}}} \subset {\mathbb {R}}^c$ is the hypothesis space, and a class prediction $\hat{\textbf{y}} = \mathop {\textrm{argmax}}\limits _{\textbf{y}\in {{\mathcal {Y}}}} \{\textbf{y}^\top \textbf{f}\}$. When proper losses are used, we will require probabilistic scores, so that ${{\mathcal {F}}}= {{\mathcal {P}}}_c$.

A weak loss is any function $\Psi :\mathcal {Z} \times {{\mathcal {F}}} \rightarrow {\mathbb {R}}$. For any loss function, $\Psi (\textbf{z}, \textbf{f})$, we will use an alternative vector representation, by defining

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = (\Psi (\textbf{e}_0^d, \textbf{f}), \Psi (\textbf{e}_0^d, \textbf{f}), \ldots , \Psi (\textbf{e}_{d-1}^d, \textbf{f}))^\top \end{aligned}$$

(3)

so that $\Psi (\textbf{z}, \textbf{f}) = \textbf{z}^\top \varvec{\Psi }(\textbf{f})$ for all $\textbf{z}\in \mathcal {Z}$ and, using (2), the expected loss becomes

$$\begin{aligned} \mathbb {E}_\textbf{z}\{\Psi (\textbf{z},\textbf{f})\} = \varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}) \end{aligned}$$

(4)

The dimension of a loss is the dimension of its vector representation: d for a weak loss, and c for a standard supervised loss.

We are interested in conditions ensuring that the expected loss is minimized when the classifier is calibrated. We consider three types of calibration. The first requires that the predicted scores coincide with the posterior class probabilities.

Definition 1

(Proper loss) Weak loss $\Psi (\textbf{z},\textbf{f})$ is $\textbf{M}$-proper if, for any $\varvec{\eta }\in {{\mathcal {P}}}_c$,

$$\begin{aligned} \varvec{\eta }\in \arg \min _{\textbf{f}\in \mathcal {P}_c} \varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}), \end{aligned}$$

(5)

The loss is strictly $\textbf{M}$-proper if $\varvec{\eta }$ is the unique minimizer.

A second type of calibration requires that the class scores preserve the order of the class posterior probabilities.

Definition 2

(Ranking calibration) The weak loss $\Psi (\textbf{z},\textbf{f})$ is $\textbf{M}$-ranking calibrated (or $\textbf{M}$-RC) if, for any $\varvec{\eta }\in {{\mathcal {P}}}_c$, any $\textbf{f}^* \in \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f})$ satisfies ($\eta _i> \eta _j \Rightarrow f_i^* > f_j^*$).

Finally, classification calibration requires that both the classifier scores and the posterior class probabilities provide the same class predictions:

Definition 3

(Classification calibration) The weak loss $\Psi (\textbf{z},\textbf{f})$ is $\textbf{M}$-classification calibrated (or $\textbf{M}$-CC) if, for any $\varvec{\eta }\in {{\mathcal {P}}}_c$, $\textbf{f}^* \in \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f})$ satisfies ($\eta _i> \max _{j\ne c} \eta _j \Rightarrow f_i^* > \max _{j\ne c} f_j^*$).

3.4 Forward, backward and forward-backward losses

The losses discussed in this section are defined as transformations of a loss used for supervised learning, that we will name the base loss.

3.4.1 Backward loss

A backward loss is any linear transformation of a base loss.

Definition 4

(Backward loss) Weak loss $\varvec{\Psi }(\textbf{f})$ is a backward loss for a weak label set ${\mathcal {Z}}$, if

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{f}) \end{aligned}$$

(6)

for some c-dimensional loss $\varvec{\phi }$ and some $d\times c$ matrix $\textbf{B}$, where $d=|{{\mathcal {Z}}}|$.

In (Van Rooyen & Williamson, 2018), $\textbf{B}$ is named the reconstruction matrix as it reverts the effect of the transition matrix. In (Cid-Sueiro et al., 2014), it is named a virtual label matrix because its columns play the same role as target classes in gradient-based learning algorithms. Here we refer to $\textbf{B}$ simply as the backward matrix.

Backward losses have been proposed for noisy labels (Natarajan et al., 2013; Menon et al., 2015; Patrini et al., 2017), complementary labels (Ishida et al., 2019), multi-complementary labels (Feng et al., 2020), PU labels (Du Plessis et al., 2015), unlabeled-unlabeled (UU) data Lu et al. (2020), and general weak label models (Natarajan et al., 2013; Cid-Sueiro, 2012; Cid-Sueiro et al., 2014; Van Rooyen & Williamson, 2018; Yoshida et al., 2021).

3.4.2 Forward loss

Similarly, the forward losses can be defined as follows:

Definition 5

(Forward loss) Weak loss $\varvec{\Psi }(\textbf{f})$ is a forward loss for a weak label set ${\mathcal {Z}}$, if

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$

(7)

for some d-dimensional loss $\varvec{\phi }$ and some $d\times c$ matrix $\textbf{F}$, where $d=|{{\mathcal {Z}}}|$.

When $\textbf{F}=\textbf{M}$ and the base loss is proper, the optimization of the forward loss can be carried out in two steps: (1) estimate the posterior weak label probabilities ($\textbf{p} = \textbf{M}\varvec{\eta }$) with loss $\varvec{\phi }(\textbf{p})$ from the data, and (2) compute the posterior class probabilities via the pseudoinverse $\hat{\varvec{\eta }} =\textbf{M}^+\hat{\textbf{p}}$ (see the classifier-consistent method in (Feng et al., 2020)).

The cross entropy loss,

$$\begin{aligned} \varvec{\phi }(\textbf{p}) = - \log (\textbf{p}) \end{aligned}$$

(8)

is the most common choice, making forward loss minimization equivalent to the maximum likelihood estimation of the model parameters (Zhang et al., 2019), often solved by means of the Expectation-Maximization (EM) algorithm, as in (Perello-Nieto et al., 2020).

Forward losses are closely related to some re-weighting methods (Wu et al., 2023), (Lv et al., 2020) and (Feng et al., 2020), which are based on the iterative minimization of a loss $\Psi (\textbf{z}, \textbf{f}) = \textbf{q}^\top \varvec{\phi }(\textbf{f})$, where $\textbf{q}$ is an estimate of the posterior class probabilities, given $\textbf{z}$ and given the current model. For the cross-entropy in (8), this loss can be derived as the E-step of the EM algorithm (see (Perello-Nieto et al., 2020)). Appendix A further discusses this connection.

3.4.3 Forward-backward loss

Forward-backward losses are a straightforward extension of forward and backward losses, combining a forward and a backward matrix.

Definition 6

(Forward-backward loss) Weak loss $\varvec{\Psi }(\textbf{f})$ is a forward-backward loss for a weak label set ${\mathcal {Z}}$, if

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{F} \textbf{f}) \end{aligned}$$

(9)

for some m-dimensional loss $\varvec{\phi }$, some $m \times d$ matrix $\textbf{B}$ and some $m \times c$ matrix $\textbf{F}$.

Forward-backward losses have potential applications in scenarios where label corruption arises from a cascade of two noisy processes, such that the transition matrix can be factorized as $\textbf{M}= \textbf{M}_l \textbf{M}_r$. In such cases, forward-backward losses could theoretically address the effects of $\textbf{M}_l$ through the backward component and $\textbf{M}_r$ through the forward component.

By unifying forward and backward losses into a common framework, we can jointly analyze their properties and compare their theoretical and practical advantages. Forward-backward losses form the basis of our subsequent analysis. In the following sections, we examine conditions under which these losses are $\textbf{M}$-proper, $\textbf{M}$-RC, or $\textbf{M}$-CC.

4 Proper forward-backward losses

The following theorem provides sufficient conditions under which a forward-backward loss is proper.

Theorem 1

Let $\varvec{\phi }(\textbf{q})$, $\textbf{q}\in {{\mathcal {P}}}_k$ be a strictly proper loss with dimension $k\ge c$, and let $\varvec{\Psi }(\textbf{f})$ be a forward-backward loss with forward and backward matrices $\textbf{F}$ and $\textbf{B}$, respectively.

If $\textbf{F}$ is left-stochastic with rank c and $\textbf{F}=\textbf{B} \textbf{M}$, then $\varvec{\Psi }(\textbf{f})$ is strictly $\textbf{M}$-proper.

Proof

If $\textbf{F}= \textbf{B}\textbf{M}$ we have $\varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{B}\textbf{M}\textbf{f})$. Consider the solution set

$$\begin{aligned} {{\mathcal {B}}}&= \mathop {\textrm{argmin}}\limits _{\textbf{f}} \left\{ \varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}) \right\} \nonumber \\&= \mathop {\textrm{argmin}}\limits _{\textbf{f}} \left\{ \varvec{\eta }^\top \textbf{F}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \right\} \end{aligned}$$

(10)

Since $\textbf{F}$ is left-stochastic, $\textbf{F}\varvec{\eta }$ is a stochastic vector and, thus, since $\phi$ is strictly proper,

$$\begin{aligned} {{\mathcal {B}}}&= \left\{ \textbf{f}\mid \textbf{F}\textbf{f}= \textbf{F}\varvec{\eta }\right\} \end{aligned}$$

(11)

Since $\textbf{F}$ is rank c, $\textbf{F}\textbf{f}= \textbf{F}\varvec{\eta }$ iff $\textbf{f}= \varvec{\eta }$ and, thus, ${{\mathcal {B}}} = \{\varvec{\eta }\}$, which proves that $\varvec{\Psi }$ is $\textbf{M}$-proper. $\square$

Th. 1 shows that, for any arbitrary choice of $\textbf{B}$ such that $\textbf{B} \textbf{M}$ is left stochastic with rank c, the loss

$$\begin{aligned} \varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{B} \textbf{M} \textbf{f}) \end{aligned}$$

(12)

is strictly proper. The theorem generalizes some published results on forward and backward $\textbf{M}$-proper losses:

Taking $\textbf{B}=\textbf{M}^+$ where $\textbf{M}^+$ is any left inverse of $\textbf{M}$, we get $\varvec{\Psi }(\textbf{f}) = \left( \textbf{M}^+\right) ^\top \varvec{\phi }(\textbf{f})$, which is a general expression for backward losses (Cid-Sueiro, 2012).
Taking $\textbf{B}=\textbf{I}$ we get a general expression $\varvec{\Psi }(\textbf{f}) = \varvec{\phi }(\textbf{M}\textbf{f})$ for forward losses (Ghosh et al., 2015)

Note that, as a consequence of Th. 1, if the weak labels are produced by a cascade of two corruption processes, that is, $\textbf{M}= \textbf{M}_l\textbf{M}_r$, the loss $\varvec{\Psi }(\textbf{f}) = \textbf{B}_l^\top \varvec{\phi }(\textbf{M}_r \textbf{f})$, where $\textbf{B}_l$ is a left inverse of $\textbf{M}_l$, is proper. Therefore, the decontamination processes can be potentially carried out through the combination of a backward and a forward component in the weak loss.

4.1 Convexity

The convexity of an $\textbf{M}$-proper forward-backward loss given by (12) depends on the backward matrix. In general, the convexity is preserved by any left-stochastic $\textbf{B}$.

Theorem 2

Let $\varvec{\Psi }$ be a forward-backward proper loss given by (12). If $\varvec{\phi }$ is convex in ${{\mathcal {P}}}_c$ and $\textbf{B}$ is left stochastic, then $\varvec{\Psi }$ is convex.

Proof

The proof is straightforward, as ${\Psi }$ is a composition of a convex function $\phi$ and two linear and convex combinations. $\square$

In particular, for a forward loss, $\textbf{B}=\textbf{I}$, which is left stochastic and, thus, the loss is convex.

Theorem 2 shows how to construct forward-backward losses that preserve convexity. However, it is generally not applicable to backward losses because, except in trivial cases (e.g., diagonal transition matrices), no left inverse of $\textbf{M}$ is left-stochastic.

Nonetheless, Van Rooyen and Williamson (2018) have shown that convexity can be preserved for composite backward losses: if $\textbf{f}= \varvec{\kappa }(\textbf{v})$, where $\varvec{\kappa }$ is the inverse link function (Williamson et al., 2016), the composite backward loss $\textbf{M}_\text {li}^\top \varvec{\phi }(\varvec{\kappa }(\textbf{v}))$ is a convex function of $\textbf{v}$ for an appropriate choice of the left inverse. Extending this result to forward-backward losses is not straightforward.

4.2 Lower bounded losses

If $\phi$ is proper and lower-bounded, the forward $\textbf{M}$-proper loss $\varvec{\phi }(\textbf{M}\varvec{\eta })$ is also lower-bounded. This is not true in general for backward losses because the backward matrix, as a left inverse of a stochastic matrix, typically contains negative entries. Consequently, if $\phi$ is not upper bounded (as the cross entropy in (8)), the empirical risk is not lower bounded, leading to overfitting (Sugiyama et al., 2022). Different types of training tricks (Kiryo et al., 2017; Ishida et al., 2019; Lu et al., 2020) or modifications of the cross entropy (Yoshida et al., 2021) have been proposed to address this problem.

Note, however, that any loss satisfying Theorem 2 with a lower-bounded $\varvec{\phi }$ is also lower-bounded. Thus, incorporating a forward component can mitigate negative contributions of the backward matrix and ensure boundedness.

4.3 Optimizing the backward matrix

Although any pair of matrices $\textbf{B}$ and $\textbf{F}$ (satisfying $\textbf{B}\textbf{M}= \textbf{F}$) defines a proper loss, the choice has a strong impact on training performance. This raises the problem of selecting the optimal pair. In this section, we show that, for proper losses, theoretical arguments favor forward losses.

In general, the optimal choice may depend on $\varvec{\eta }$, but we can optimize the selection for a given $\varvec{\eta }$, following a procedure similar to that proposed in (Bacaicoa-Barber et al., 2021) for backward losses. To do so, assume ${{\mathcal {S}}} = \{\textbf{z}_k, k=0,\ldots ,n-1\}$ is a set of i.i.d. samples with probabilities $p_i = P\{\textbf{z}_k= \textbf{e}_i^d\}= \textbf{e}_i^d\textbf{M}\varvec{\eta }$, for some transition matrix $\textbf{M}$ and some $\varvec{\eta }\in {{\mathcal {P}}}_c$. To estimate $\varvec{\eta }$ from ${{\mathcal {S}}}$, we can minimize the empirical risk based on a strictly $\textbf{M}$-proper forward-backward loss in the form (12), that is,

$$\begin{aligned} \textbf{f}^*&= \mathop {\textrm{argmin}}\limits _\textbf{f}\sum _{k=0}^{n-1} \textbf{z}_k^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$

(13)

Since $\varvec{\phi }$ is strictly proper

$$\begin{aligned} \textbf{f}^*&= \textbf{F}^\ell \textbf{B}\overline{\textbf{p}} \end{aligned}$$

(14)

where $\textbf{F}_\ell$ is any left inverse of $\textbf{F}$ and

$$\begin{aligned} \overline{\textbf{p}} = \frac{1}{n} \sum _{k=0}^{n-1} \textbf{z}_k \end{aligned}$$

(15)

(that is, $\overline{\textbf{p}}$ is a sample estimate of the weak label priors).

Noting that

$$\begin{aligned} {\mathbb {E}}\{\textbf{F}^\ell \textbf{B}\textbf{z}\} = \textbf{F}^\ell \textbf{B}\textbf{M}\varvec{\eta }= \textbf{F}^\ell \textbf{F}\varvec{\eta }= \varvec{\eta }\end{aligned}$$

(16)

we can see that $\textbf{F}^\ell \textbf{B}\textbf{z}$ (and, thus, $\textbf{F}^\ell \textbf{B}\overline{\textbf{p}}$) is an unbiased estimate of $\varvec{\eta }$. Therefore, we can select $\textbf{B}$ and $\textbf{F}^\ell$ in such a way that the variance of the estimate is minimized. Noting that

$$\begin{aligned} {\mathbb {E}}\{\Vert \textbf{F}^\ell \textbf{B}\textbf{z}&- \varvec{\eta }\Vert ^2\} = {\mathbb {E}}\{(\textbf{F}^\ell \textbf{B}\textbf{z} - \varvec{\eta })^\top (\textbf{F}^\ell \textbf{B}\textbf{z} - \varvec{\eta })\} \nonumber \\&= {\mathbb {E}}\{\textbf{z}^\top \textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\textbf{z}\} - 2\varvec{\eta }^\top \textbf{F}^\ell \textbf{B}{\mathbb {E}}\{\textbf{z}\} + \varvec{\eta }^\top \varvec{\eta }\nonumber \\&= \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} - 2\varvec{\eta }^\top \textbf{F}^\ell \textbf{B}\textbf{M}\varvec{\eta }+ \varvec{\eta }^\top \varvec{\eta }\nonumber \\&= \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} - \varvec{\eta }^\top \varvec{\eta }\end{aligned}$$

(17)

where $\varvec{\Delta }_{\textbf{p}}$ is a diagonal matrix with the components of ${\mathbb {E}}\{\textbf{z}\}$ in the diagonal, and taking into account that the second term in (17) does not depend on $\textbf{B}$, we can solve the optimization problem

$$\begin{aligned}&\min _{\textbf{B},\textbf{F}} \left\{ \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} \right\} \nonumber \\&\text {subject to } \textbf{BM} = \textbf{F}\text { and } \mathbbm {1}_h^\top \textbf{F}=\mathbbm {1}_c \end{aligned}$$

(18)

Appendix B.2 shows that any pair of matrices $\textbf{F}$ (left-stochastic) and $\textbf{B}$ satisfying

$$\begin{aligned} \textbf{F}^\ell \textbf{B}= \left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \textbf{M}\right) ^{-1} \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$

(19)

is a solution to this problem.

This result has two key implications:

Any pair $(\textbf{F}, \textbf{B}^*)$, where $\textbf{F}$ is an arbitrary left-stochastic matrix with rank c and
$$\begin{aligned} \textbf{B}^* = \textbf{F}\left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \textbf{M}\right) ^{-1} \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$
(20)
is optimal. In particular, for $\textbf{F}=\textbf{I}$, this is the solution proposed in (Bacaicoa-Barber et al., 2021) for backward losses.
The pair $(\textbf{F}, \textbf{B}) = (\textbf{M}, \textbf{I})$ is optimal (as the right-hand side of (19) is a left inverse of $\textbf{M}$). That is, forward proper losses are optimal.

Note that, though all pairs satisfying (19) minimize variance, they are not equivalent in practice since $\varvec{\Delta }_\textbf{p}$ depends on the unknown posterior weak label probabilities. As in (Bacaicoa-Barber et al., 2021), this can be mitigated by replacing these posteriors with weak label priors. As the experiments will show, this usually outperforms other choices of the forward backward loss, but loses optimality.

On the other hand, forward losses are optimal, without requiring any knowledge of the weak label probabilities. As as a consequence, they tend to outperform any other choices of the forward-backward loss, as we will see in the experiments.

5 RC and CC forward-backward losses

In order to characterize RC and CC forward-backward losses, the concepts of order-preserving and max-preserving transformations will be essential.

Definition 7

(Order-preserving matrix) Square matrix $\textbf{A}$ is order-preserving if the linear transformation $\textbf{y} = \textbf{A} \textbf{x}$ preserves the order of the components, that is, for any i, j, $x_i < x_j$ iff $y_i < y_j$

Definition 8

(Max-preserving matrix) Square matrix $\textbf{A}$ is max-preserving if the linear transformation $\textbf{y} = \textbf{A} \textbf{x}$ preserves the component of the maximum, that is, for any i, $x_i = \max _{j} x_j$ iff $y_i = \max _j y_j$

The following lemma shows that order and max preserving matrices are equivalent and can be characterized by a general formula.

Lemma 1

Let $\textbf{A}$ be a square $d\times d$ matrix. The following conditions are equivalent:

1.
$\textbf{A}$ is order preserving
2.
$\textbf{A}$ is max preserving
3.
$\textbf{A} = \lambda \textbf{I} + \mathbbm {1}_d \textbf{v}^\top$ for some $\lambda > 0$ and some $\textbf{v}\in \mathbb {R}^d$.

Proof

See Appendix B.3. $\square$

Using the above lemma, we can prove the following:

Lemma 2

If $\textbf{A}$ is order-preserving and non-singular, its inverse is also order-preserving.

Proof

See Appendix B.4. $\square$

The following theorem generalizes a previous result in (Cid-Sueiro et al., 2014) for backward losses, to forward-backward losses, and provides a general formula for RC and CC losses.

Theorem 3

Let $\varvec{\phi }(\textbf{q})$, $\textbf{q}\in {{\mathbb {R}}^c}$ be a RC/CC loss, $\textbf{B}$ is a matrix such that

$$\begin{aligned} \textbf{B} \textbf{M} = \beta \textbf{I} + \mathbbm {1}_c \textbf{b}^\top \end{aligned}$$

(21)

for some $\textbf{b}\in \mathbb {R}^c$ and some $\beta > 0$. Also, $\textbf{F}$ is a non-singular square matrix in the form

$$\begin{aligned} \textbf{F} = \lambda \textbf{I} + \mathbbm {1}_c \textbf{w}^\top \end{aligned}$$

(22)

for some $\textbf{w}\in \mathbb {R}^c$ and some $\lambda > 0$. Then, the forward-backward loss $\varvec{\Psi }(\textbf{f}) = \textbf{B}^\top \varvec{\phi }(\textbf{F}\textbf{f})$ is $\textbf{M}$-RC/CC.

Proof

Note that, by Lemma 1, and taking into account Eqs. (21) and (22), both $\textbf{M} \textbf{B}$ and $\textbf{F}$ are order- and max- preserving matrices.

Let $\textbf{f}^*$ be a risk minimizer, that is

$$\begin{aligned} \textbf{f}^*&\in \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \varvec{\Psi }(\textbf{f}) = \arg \min _\textbf{f}\varvec{\eta }^\top \textbf{M}^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$

(23)

Since $\textbf{F}$ is non-singular, we can write $\textbf{f}^*=\textbf{F}^{-1} \textbf{v}^*$ where

$$\begin{aligned} \textbf{v}^*&\in \arg \min _\textbf{z} \varvec{\eta }^\top \textbf{M}^\top \textbf{B}^\top \varvec{\phi }(\textbf{v}) \end{aligned}$$

(24)

Assume $\eta _i > \eta _j$. Since $\textbf{B} \textbf{M}$ is order-preserving, $(\textbf{B} \textbf{M}\varvec{\eta })_i > (\textbf{B} \textbf{M}\varvec{\eta })_j$ and, thus, if $\phi$ is RC, $v_i > v_j$. By Lemma 2, since $\textbf{F}$ is order preserving, so is $\textbf{F}^{-1}$ and, thus, $v_i > v_j$ implies $f_i > f_j$ and, thus, $\varvec{\Psi }$ is $\textbf{M}$-RC.

Assuming $\eta _i > \eta _j$, for all $j\ne i$, the same argument applies to show that, if $\varvec{\phi }$ is CC, $\varvec{\Psi }$ is $\textbf{M}$-CC. $\square$

6 Error bound

Even though the backward matrix may have negative components, we can establish the consistency of learning when the base loss $\varvec{\phi }$ is lower- and upper-bounded.

We consider the function space $\mathcal {F} = \{ f: \textbf{x}\mapsto {\mathbb {R}}^c\}$. For proper losses, the space of functions should be restricted to the simplex (i.e., $\textbf{f}= f(\textbf{x})\in {{\mathcal {P}}}_c$). However, this restriction does not affect the analysis presented here, so we will keep it in this general form. The c-valued function space can be decomposed into its components $\mathcal {F} = \bigoplus _{i=0}^{c-1}\mathcal {F}_i$.

A learning algorithm is consistent if, as the sample size $n\rightarrow \infty$,

$$\begin{aligned} f_n = \mathop {\textrm{argmin}}\limits _f \hat{R}(f) = \mathop {\textrm{argmin}}\limits _f \frac{1}{n} \sum _{k=0}^{n-1} \Psi (\textbf{z}_k, f(\textbf{x}_k)) \end{aligned}$$

(25)

and

$$\begin{aligned} f^* = \mathop {\textrm{argmin}}\limits _f R(f) = \mathop {\textrm{argmin}}\limits _f {\mathbb {E}}_P \left[ \Psi (\textbf{z}, f(\textbf{x}))\right] \end{aligned}$$

(26)

satisfy $R(f_n)\rightarrow R(f)$ as $n\rightarrow \infty$.

Theorem 4

Let $\phi (\textbf{f})$ be a nonnegative L-Lipschitz loss bounded from above by M. Then, for any $\delta > 0$, with probability at least $1-\delta$

$$\begin{aligned} R(f_n) - R(f^*) \le 4 \sqrt{2} L \left\| \textbf{B}\right\| \left\| \textbf{F}\right\| \sum _{i=0}^{c-1}{\mathfrak {R}}_{n}(\mathcal {G}_i) + 4 h M \Vert \textbf{B}\Vert _1 \sqrt{\frac{\log \frac{2}{\delta }}{2n}} \end{aligned}$$

(27)

where ${\mathfrak {R}}_n(\mathcal {G})$ is he Rademacher complexity for a sample size $n$ and a function class $\mathcal {G}$.

Proof

See Appendix B.5 $\square$

Since for many function classes (e.g. neural networks with bounded norm) the Rademacher complexities ${\mathfrak {R}}_{n}(\mathcal {G}_i)$ are $\mathcal {O}(1/\sqrt{n})$ (Golowich et al., 2019), this theorem proves risk-consistency when the base loss is upper and lower bounded. If the base loss is strictly proper, this futher implies the classification-consistency.

However, this result cannot be trivially extended to all types of base losses: if the base loss is not upper bounded (e.g. for the cross entropy) and the backward matrix has negative entries, the empirical risk may be neither upper nor lower bounded, and learning may be inconsistent, as can be observed experimentally. Although suitable adjustments to the base loss and the backward matrix can mitigate this issue (Yoshida et al., 2021), it remains a main limitation in the application of backward losses.

7 Experiments

This section presents a comparative analysis of forward, backward, and forward-backward losses under varying levels of label corruption. Specifically, we evaluate these losses in the proper case across three corruption types: noisy labels, complementary labels, and partial labels. Our goal is to empirically demonstrate the superiority of forward losses, consistent with our theoretical results and prior findings.

We evaluate the losses on a variety of datasets, including Banknote Authentication (Lohweg, 2012), for binary classification; MNIST (LeCun et al., 1998), CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), for multiclass classification; and a Synthetic Gaussian Mixture Model for controlling the estimation of the posterior probability. This ensures that our comparison is independent of the architecture, data domain and size, and corruption models.

Label corruption processes follow models and parameterizations from prior work. In some cases, we replicate published setups to directly test whether forward losses outperform backward losses under identical conditions, preserving the fidelity of the original experiments.

To assess posterior probability estimation, we conduct controlled classification tasks on synthetic data. The synthetic dataset comprises 4000 samples drawn from four overlapping Gaussian distributions. This setting allows evaluating algorithm performance in the realizable case, where the classifier can perfectly fit the true posterior, and directly quantifying estimation quality, since the true posteriors are known.

Discrepancy between predicted and true posteriors is measured via:

$$\begin{aligned} \Vert \textbf{f}(\textbf{x}) - \varvec{\eta }(\textbf{x}) \Vert \end{aligned}$$

(28)

computed over the test set, providing a direct evaluation of estimation accuracy.

Regardless of corruption type, training uses multiclass logistic regression with an Adam optimizer (learning rate $10^{-3}$) for 50 epochs, repeated 10 times.

7.1 Noisy labels

To ensure a comprehensive evaluation, we test our method across datasets of increasing complexity: binary classification (Banknote), simple multiclass (MNIST), deep learning benchmarks (CIFAR), and synthetic datasets.

Banknote-authentication.

We begin with a binary classification task using the banknote-authentication dataset, which classifies genuine versus forged banknotes, adopting the corruption process in Natarajan et al. (2013), given by

$$\begin{aligned} \textbf{M}= \left( {\begin{smallmatrix} 1-\rho _{-1} & \rho _{+1} \\ \rho _{-1} & 1-\rho _{+1} \end{smallmatrix}}\right) \end{aligned}$$

(29)

evaluating the performance for different values of $\varvec{\rho } =(\rho _{-1},\rho _{+1})$. We also consider a decomposition of the matrix $\textbf{M}$ as the product of two matrices $\textbf{M}_l$ and $\textbf{M}_r$, such that $\textbf{M}=\textbf{M}_l \textbf{M}_r$. This decomposition allows us to utilize those matrices so that the pair $(\textbf{F},\textbf{B}) = (\textbf{M}_r,\textbf{B}_l)$ and the loss computes as $\varvec{\Psi }(\textbf{f}) = \textbf{B}_l^\top \varvec{\phi }(\textbf{M}_r \textbf{f})$, where $\textbf{B}_l$ is a left inverse of $\textbf{M}_l$.

We train a Logistic Regression model. Fig. 2 shows that as corruption levels increase, forward and forward-backward losses consistently outperform backward loss, particularly at higher noise levels, with higher median accuracy and lower variability in training and testing. Forward-backward loss behaves similarly to forward loss, though differences are minor due to the dataset’s small size and classification simplicity.

MNIST

For this dataset, we follow the label corruption process described by Natarajan et al. (2013), using the transition matrix:

$$\begin{aligned} \textbf{M} = \left( {\begin{smallmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & \rho & 0 & 0 \\ 0 & 0 & 1-\rho & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1-\rho & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1-\rho & \rho & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & \rho & 1-\rho & 0 & 0 & 0 \\ 0 & 0 & \rho & 0 & 0 & 0 & 0 & 1-\rho & 0 & 0 \\ 0 & 0 & 0 & \rho & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{smallmatrix}}\right) \end{aligned}$$

(30)

As in their setting, labels are flipped with probability $\rho$ between similar digits: $2 \rightarrow 7,\ 3 \rightarrow 8,\ 5 \leftrightarrow 6,\ 7 \rightarrow 1$. Also, we decompose the matrix $\textbf{M}$ such that $\textbf{M}_l$ encompasses label flipping with probability $\rho$ between: $2 \rightarrow 7$, and $7 \rightarrow 1$. Whereas $\textbf{M}_r$ encompasses label flipping with probability $\rho$ between: $3 \rightarrow 8$, and $5 \leftrightarrow 6$. Hence, $(\textbf{F},\textbf{B}) = (\textbf{M}_r,\textbf{B}_l)$ and the loss computes as $\varvec{\Psi }(\textbf{f}) = \textbf{B}_l^\top \varvec{\phi }(\textbf{M}_r \textbf{f})$, where $\textbf{B}_l$ is a left inverse of $\textbf{M}_l$

We train a multilayer perceptron (MLP) with an input layer and a hidden layer of size 784 and an output layer of size 10. The Adam optimizer is used again with an initial learning rate of $10^{-3}$. Figure 3 shows that forward loss outperforms backward loss, achieving higher accuracy and lower variability in both training and testing.

CIFAR-10

We use a ResNet-18 architecture trained with SGD (learning rate $10^{-3}$, momentum 0.9, weight decay $5\times 10^{-4}$). Due to computational demands, we limit the experiment to 4 repetitions and 40 epochs.

For the label noise, we follow the process described by Natarajan et al. (2013), where labels are flipped with probability $\rho$ between the next classes: Truck $\rightarrow$ Automobile, Bird $\rightarrow$ Airplane, Deer $\rightarrow$ Horse, and Cat $\leftrightarrow$ Dog.

The decomposition for the forward-backward loss was made such that $\textbf{M}_l$ encompasses label flipping with probability $\rho$ between: Truck $\rightarrow$ Automobile, and Bird $\rightarrow$ Airplane; whereas $\textbf{M}_r$ encompasses label flipping with probability $\rho$ between:Deer $\rightarrow$ Horse, and Cat $\leftrightarrow$ Dog.

The results in Fig. 4 show that, despite greater variability, forward loss achieves higher median accuracy in both training and testing.

CIFAR-100

Similarly, we test the losses on CIFAR-100 using a ResNet-32 architecture with the same optimizer settings as for ResNet-18. The experiment is restricted to 4 repetitions and 40 epochs due to computational constraints.

For label noise, we follow (Natarajan et al., 2013): The 100 classes are grouped into 20 superclasses (5 classes each). Noise flips each class circularly within superclasses 1–10, repeating the pattern for superclasses 11–20. Thus, $\textbf{M}$ is a block matrix:

$$\begin{aligned} \textbf{M}=\left( {\begin{smallmatrix} \textbf{A} & \textbf{0}\\ \textbf{0} & \textbf{A}\end{smallmatrix}}\right) \ \text {where}\ \textbf{A} = \left( {\begin{smallmatrix} 1-\rho & 0 & 0 & \cdots & 0 & \rho \\ \rho & 1-\rho & 0 & \cdots & 0 & 0 \\ 0 & \rho & 1-\rho & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 1-\rho & 0 \\ 0 & 0 & 0 & \cdots & \rho & 1-\rho \end{smallmatrix}}\right) , \end{aligned}$$

(31)

where $\textbf{A}$ is a $10 \times 10$ matrix.

A simple decomposition of the transition matrix is given by

$$\begin{aligned} \textbf{M}_l=\left( {\begin{smallmatrix} \textbf{A} & \textbf{0}\\ \textbf{0} & \textbf{I}\end{smallmatrix}}\right) \ \text {and}\ \textbf{M}_r=\left( {\begin{smallmatrix} \textbf{I} & \textbf{0}\\ \textbf{0} & \textbf{A}\end{smallmatrix}}\right) \end{aligned}$$

(32)

In Fig. 5, the forward loss shows noticeably higher median accuracy than the backward and forward-backward losses and smaller variability. As the noise level increases, the forward loss still maintains a more stable accuracy profile.

Gaussian Mixture Model

Lastly, as mentioned before, we will train a logistic classifier for the Gaussian Mixture Model to evaluate the quality of the posterior probability estimates.

Labels were corrupted using the transition matrix

$$\begin{aligned} \textbf{M} = \left( {\begin{smallmatrix} 1-\rho & \rho /3 & \rho /3 & \rho /3 \\ \rho /3 & 1-\rho & \rho /3 & \rho /3 \\ \rho /3 & \rho /3 & 1-\rho & \rho /3 \\ \rho /3 & \rho /3 & \rho /3 & 1-\rho \\ \end{smallmatrix}}\right) \end{aligned}$$

(33)

We factorize the transition matrix as $\textbf{M}= \textbf{A}^2$ so $\textbf{M}_l=\textbf{M}_r=\textbf{A}$. Notice that the case $\rho =0.8$ is not used here, as for a 4-class problem it would mean that each noisy class is more likely than the true class.

In Fig. 6 as noise level increases, the median accuracy declines and variance grows for all methods, reflecting the added difficulty of heavier label noise. Nevertheless, forward loss maintains a performance advantage, with its boxplot showing higher central tendency and tighter interquartile ranges.

In Fig. 7, forward loss achieves lower median errors and tighter interquartile ranges, confirming its superior ability to approximate the true posterior distribution. These results underscore the advantage of forward losses in accurately modeling posterior probabilities for noisy label settings.

7.2 Complementary labels

We also evaluate a complementary label setting (Ishida et al., 2019) where the transition matrix $\textbf{M}$ has components $m_{ij} = (c-1)\delta _{ij}$ where $\delta _{ij}$ is the Kronecker delta.

Since a complementary label is selected at random from the negative classes (i.e. all classes other than the true class), we can decompose this selection in two steps: in the first step, we select half of the negative classes at random. In the second step, we take one of these selected classes at random. This two steps define the respective left and right matrices for the decomposition $\textbf{M}=\textbf{M}_\ell \textbf{M}_r$

Using the same architectures applied to noisy labels, we evaluated MNIST, CIFAR-10, and CIFAR-100. Results are summarized in Fig. 8:

For MNIST (left), the forward loss achieves the highest median accuracy with low variability, demonstrating robust performance under severe label noise. The backward loss consistently shows the words performance, while the forward-backward loss falls in between.

On CIFAR-10 (center), the forward loss again achieves the highest median accuracy, though variability across runs increases. The backward loss performs notably poorly, underscoring the advantage of forward correction.

For CIFAR-10 (middle), the forward loss achieves higher median accuracy compared to the forward-backward and backward losses, despite exhibiting greater variability across runs. The backward loss performs particularly poorly, suggesting that the forward loss provides stronger overall performance, even with some fluctuations. For CIFAR-100 (right), the forward loss displays higher variability across runs but achieves the highest median accuracy among the three methods. In contrast, both backward loss and forward-backward loss encounter more pronounced learning difficulties, which can lead to noticeably lower performance.

Gaussian Mixture Model

We now analyze the complementary label setting for the Gaussian mixture model.

Figure 9 shows that forward loss consistently outperforms the others, achieving higher median accuracies on both training and test sets, highlighting its robustness with complementary labels.

Figure 10 shows that forward loss achieves the lowest discrepancy, demonstrating superior accuracy in approximating true posteriors. In contrast, backward loss exhibits higher mean error and greater variability, highlighting its poorer performance.

7.3 Partial labels

Finally, we explore partial label corruption as modeled in Cour et al. (2011); Feng et al. (2020). The corruption process is defined as:

$$\begin{aligned}&P(\varvec{\omega }|\textbf{y}=\textbf{e}_i)= {\left\{ \begin{array}{ll} 1-\rho & \text {if}\ \varvec{\omega }=\textbf{y}\\ \frac{\rho }{2^{c-1}-1}& \text {if}\ \varvec{\omega }\ne \textbf{y}\ \text {and}\ \varvec{\omega }^\top \textbf{y}=1\\ 0& \text {if}\ \varvec{\omega }^\top \textbf{y}=0 \\ \end{array}\right. } \end{aligned}$$

(34)

This parametrization of the transition matrix enables a flexible corruption of the dataset implying that larger values of $\rho$ result in a dataset with higher corruption.

For partial labels, we evaluate the backward losses ($\textbf{F}=\textbf{I}$) with and without the convexity constraint, and with and without the optimized matrix $\textbf{B}^*$ in (20). Additionally, we will assess the forward loss ($\textbf{F}=\textbf{M}$, $\textbf{B}=\textbf{I}$), as well as the optimized forward-backward loss given by $\textbf{F}=\textbf{M}$ and the optimal backward matrix $\textbf{B}^*$ in (20). When needed, matrix $\varvec{\Delta }_\textbf{p}$ is computed using weak label priors, estimated from weak label proportions following the method proposed in (Bacaicoa-Barber et al., 2021).

MNIST

First, we test on the MNIST dataset, trained in the same manner as in the noisy or complementary label setting.

Figure 11 shows that forward loss once again outperforms the other losses. Consistent with prior observations, forward loss achieves the highest median accuracy, clearly outperforming the other losses. Moreover, forward-backward loss tends to exceed the performance of the backward losses.

CIFAR 10

The CIFAR-10 dataset is trained in the same manner as in the noisy or complementary label setting.

In Fig. 12, the forward loss achieves the highest median accuracy on both training and testing sets, consistent with previous results. The forward-backward loss also tends to outperform other backward losses, likely due to the pseudoinverse used in their computation..

Gaussian Mixture Models

Finally, we evaluate the performance of forward, backward, and forward-backward losses under partial label corruption for the Gaussian Mixture Models.

As seen in Fig. 13, forward losses again outperform other losses, with forward-backward losses also performing well on both training and test sets. Backward losses continue to underperform, especially when $\textbf{B}$ is not optimized and under higher corruption levels.

Figure 14 highlights the superior performance of forward loss in approximating true posteriors, consistently showing lower error and variability. Forward-backward loss offers a reasonable compromise, outperforming backward approaches but still falling short of forward losses. As before, methods relying on the pseudoinverse of the transition matrix underperform.

7.4 Clothing1M

We tested the approaches presented in this paper on the real-world noisy dataset Clothing1M presented in (Xiao et al., 2015). We made the estimation of the transition matrix empirically counting the relative frequencies of the subset of instances for which both the true labels and the noisy labels are available.

For the forward-backard loss we have decomposed the estimated transition matrix $\hat{\textbf{M}}$ numerically so we used a factorization that is approximately $\textbf{M}_l\textbf{M}_r \approx \hat{\textbf{M}}$.

A ResNet-50 architecture, pre trained on ImageNet was employed as the base classifier. The model was trained for 10 epochs using the Adam optimizer with a learning rate of $10^{-3}$. The training phase utilized the noisy labels provided in the Clothing1M dataset, whereas the evaluation was made using the clean test set. Only one repetition was made as the weakly dataset was given and no weakening process of the labels was made.

Figure 15 highlights that forward loss correction consistently outperforms both the backward and forward-backward methods in test accuracy over the entire training period. It reaches a higher final accuracy confirming earlier experiments that highlight the superiority of the forward loss correction for proper losses.

7.5 Overall discussion and conclusions

In summary, the experiments reveal several key findings: first, convexification helps mitigate the convergence issues typically observed with backward losses in partial label learning, generally resulting in better performance than backward losses without convexity constraints. Second, the optimal backward matrix defined in (20) consistently outperforms alternatives such as the pseudoinverse of the transition matrix. In practice, the forward-backward loss (with $\textbf{F}=\textbf{M}$ and $\textbf{B}=\textbf{B}^*$) offers a balanced trade-off between forward and backward approaches.

Third, in both complementary and noisy label scenarios, forward-backward losses tend to achieve intermediate performance between backward and forward losses. Notably, the performance gap between forward and forward-backward losses increases when the transition matrix is block-structured and thus trivially factorizable, compared to cases with uniformly distributed noise. This may occur because, in such settings, two independent noise processes are effectively present, and the forward component can only correct one of them. As a result, the forward-backward method does not fully close the gap to the forward approach.

Overall, forward losses demonstrate the best performance, consistently surpassing forward-backward and backward losses in both accuracy and stability.

8 Conclusions

In this study, we introduced a unified framework that integrates forward and backward loss functions for learning from weak labels, providing a comprehensive understanding of their shared properties.

By combining these losses into a single family of forward-backward losses, we clarified their relationships and offered deeper insights into their common characteristics. We established sufficient conditions under which forward-backward losses are proper, ranking-calibrated, classification-calibrated, convex, and lower-bounded. These conditions address critical challenges often associated with backward losses-such as non-convexity and lack of a lower bound-ensuring that forward-backward losses retain essential properties for effective learning.

This unification has also enabled a systematic comparative analysis, demonstrating that no backward or forward-backward loss can outperform forward losses for posterior probability estimation. Theoretical findings align with experimental results, confirming the robustness and effectiveness of forward losses in mitigating the challenges posed by weak labels. For RC and CC losses, our framework offers a unified perspective that can inspire the development of new learning algorithms.

8.1 Limitations and further work

One important limitation of our framework, shared by all methods based on forward or backward correction, is the assumption that the transition matrix is known and feature-independent. Some models surpass this problem by making some independence assumptions on weak labels (Feng et al., 2020; Katsura & Uchida, 2021; Ishida et al., 2017) that could be not realistic. Additional approaches make assumptions on the weak labeling process (often related to the dominance of the true class over noisy classes in the weak label (Lv et al., 2020; Zhang et al., 2021; Ambroise et al., 2001; Wu et al., 2023)) which can be translated into constraints on $\textbf{M}$. In general, the probabilistic calibration of the models require loss functions that depend on $\textbf{M}$ (Cid-Sueiro, 2012; Van Rooyen & Williamson, 2018; Yoshida et al., 2021). Some methods that have been proposed to estimate the transition matrix from data (noise-free labels, anchor points, etc) have been discussed in Sect. 1.

The assumption that the transition matrix is feature independent has been widely adopted, though it may be not supported by empirical evidence (for instance, one can expect higher label noise from human annotators for input images that are near the decision boundaries). While our theoretical analysis does not require feature independence, in practice our experimental results rely on this simplifying assumption. The development of instance-dependent models has attracted some recent interest, in particular for the noisy label case (see (Xia et al., 2020), for instance). Investigating feature-dependent transition matrices derived from realistic and domain-specific models of the annotation process could be a valuable direction for future research.

Our work is a step towards a complete characterization of losses from learning from weak labels, though there is further work to be done in this direction. Although other losses can be partially connected to our work (like (Wu et al., 2023)), other relevant losses, like (Cour et al., 2011) and many others, cannot be fit into our framework. Our ongoing research aims to develop more general formulations that will lead to more efficient and robust learning algorithms.

Data Availability

All implementations are publicly available at https://github.com/DaniBacaicoa/ForwardBackard_losses/.

References

Ambroise, C., Denoeux, T., Govaert, G., & Smets, P. (2001). Learning from an imprecise teacher: Probabilistic and evidential approaches. Applied Stochastic Models and Data Analysis, 1, 100–105.
Google Scholar
Bacaicoa-Barber, D., Perello-Nieto, M., Santos-Rodríguez, R., & Cid-Sueiro, J. (2021). On the selection of loss functions under known weak label models. International conference on artificial neural networks. https://doi.org/10.1007/978-3-030-86340-1_27
Article Google Scholar
Chen, H., Shah, A., Wang, J., Tao, R., Wang, Y., Xie, X. Raj, B. (2023). Imprecise label learning: A unified framework for learning with various imprecise label configurations. arXiv:abs/2305.12715
Chen, H., Wang, J., Feng, L., Li, X., Wang, Y., Xie, X.. Raj, B. (2024). A general framework for learning from weak supervision. arXiv:abs/2402.01922
Chiang, C.-K., & Sugiyama, M. (2023). Unified risk analysis for weakly supervised learning. arXiv:abs/2309.08216
Chou, Y.-T., Niu, G., Lin, H.-T., Sugiyama, M. (2020). Unbiased risk estimators can mislead: A case study of learning with complementary labels. H.D. III and A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 1929–1938). PMLR. https://proceedings.mlr.press/v119/chou20a.html
Cid-Sueiro, J. (2012). Proper losses for learning from partial labels. Advances in neural information processing systems
Cid-Sueiro, J., García-García, D., & Santos-Rodríguez, R. (2014). Consistency of losses for learning from weak labels. Machine Learning and Knowledge Discovery in Databases. https://doi.org/10.1007/978-3-662-44848-9_13
Article Google Scholar
Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12, 1501–1536.
MathSciNet Google Scholar
Ding, Y., Wang, L., Fan, D., Gong, B. (2018). A semi-supervised two-stage approach to learning from noisy labels. In: 2018 IEEE winter conference on applications of computer vision (wacv) p.1215–1224
Du Plessis, M., Niu, G., Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. International conference on machine learning pp. 1386–1394
Feng, L., Kaneko, T., Han, B., Niu, G., An, B., Sugiyama, M. (2020). Learning with multiple complementary labels. International conference on machine learning pp. 3072–3081
Feng, L., Lv, J., Han, B., Xu, M., Niu, G., Geng, X., & Sugiyama, M. (2020). Provably consistent partial-label learning. Advances in Neural Information Processing Systems, 33, 10948–10960.
Google Scholar
Ghosh, A., Manwani, N., & Sastry, P. (2015). Making risk minimization tolerant to label noise. Neurocomputing, 160, 93–107. https://doi.org/10.1016/j.neucom.2014.09.081
Article Google Scholar
Golowich, N., Rakhlin, A., Shamir, O. (2019). Size-independent sample complexity of neural networks. arXiv:abs/1712.06541
Gong, C., Yang, J., You, J., & Sugiyama, M. (2022). Centroid estimation with guaranteed efficiency: A general framework for weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2841–2855. https://doi.org/10.1109/TPAMI.2020.3044997
Article Google Scholar
Grandvalet, Y. (2002). Logistic regression for partial labels. Proc. ipmu.
Grandvalet, Y., & Bengio, Y. (2004). Learning from partial labels with minimum entropy.
Hendrycks, D., Mazeika, M., Wilson, D., Gimpel, K. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. Advances in neural information processing systems
Iacovissi, L., Lu, N., Williamson, R.C. (2023). Corruptions of supervised learning problems: Typology and mitigations. arXiv:abs/2307.08643
Ishida, T., Niu, G., Hu, W., Sugiyama, M. (2017). Learning from complementary labels. Advances in neural information processing systems
Ishida, T., Niu, G., Menon, A., Sugiyama, M. (2019). Complementary-label learning for arbitrary losses and models. International conference on machine learning pp. 2971–2980
Jin, R., & Ghahramani, Z. (2002). Learning with multiple labels. Advances in Neural Information Processing Systems, 15, 897–904.
Google Scholar
Katsura, Y., & Uchida, M. (2021). Candidate-label learning: A generalization of ordinary-label learning and complementary-label learning. SN Computer Science, 2(4), 288. https://doi.org/10.1007/s42979-021-00681-x
Article Google Scholar
Katz-Samuels, J., Blanchard, G., & Scott, C. (2019). Decontamination of mutual contamination models. Journal of Machine Learning Research, 20(41), 1–57.
MathSciNet Google Scholar
Kiryo, R., Niu, G., Du Plessis, M.C., Sugiyama, M. (2017). Positive-unlabeled learning with non-negative risk estimator. Advances in neural information processing systems
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Technical report
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Liu, T., & Tao, D. (2016). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 447–461. https://doi.org/10.1109/TPAMI.2015.2456899
Article Google Scholar
Lohweg, V. (2012). Banknote Authentication. UCI Machine Learning Repository
Lu, N., Zhang, T., Niu, G., Sugiyama, M. (2020). Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach. International conference on artificial intelligence and statistics pp. 1115–1125
Lukasik, M., Bhojanapalli, S., Menon, A., Kumar, S. (2020). Does label smoothing mitigate label noise? H.D. III and A. Singh (Eds.), Proceedings of the 37th international conference on machine learning 119: 6448–6458. PMLR. https://proceedings.mlr.press/v119/lukasik20a.html
Lv, J., Xu, M., Feng, L., Niu, G., Geng, X., Sugiyama, M. (2020). Progressive identification of true labels for partial-label learning. international conference on machine learning pp. 6500–6510
Ma, X., Wang, Y., Houle, M.E., Zhou, S., Erfani, S., Xia, S.. Bailey, J. (2018). Dimensionality-driven learning with noisy labels. J. Dy and A. Krause (Eds.), Proceedings of the 35th international conference on machine learning 80: 3355–3364. PMLR. https://proceedings.mlr.press/v80/ma18d.html
Maurer, A. (2016). A vector-contraction inequality for rademacher complexities. R. Ortner, H.U. Simon, and S. Zilles (Eds.), Algorithmic learning theory pp. 3–17: Cham, Springer International Publishing
Menon, A., Van Rooyen, B., Ong, C.S., Williamson, B. (2015). Learning from corrupted binary labels via class-probability estimation. International conference on machine learning pp. 125–134
Mohri, M., Rostamizadeh, A., Talwalkar, A. (2018). Foundations of machine learning (2nd ed.). MIT press
Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari, A. (2013). Learning with noisy labels. Advances in neural information processing systems
Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., Qu, L. (2017). Making deep neural networks robust to label noise: A loss correction approach. Proceedings of the ieee conference on computer vision and pattern recognition pp. 1944–1952
Perello-Nieto, M., Santos-Rodriguez, R., Garcia-Garcia, D., & Cid-Sueiro, J. (2020). Recycling weak labels for multiclass classification. Neurocomputing, 400, 206–215. https://doi.org/10.1016/j.neucom.2020.02.085
Article Google Scholar
Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11(4), 1297–1322.
MathSciNet Google Scholar
Scott, C., Blanchard, G., Handy, G. (2013). Classification with asymmetric label noise: Consistency and maximal denoising. Conference on learning theory pp. 489–511
Sugiyama, M., Bao, H., Ishida, T., Lu, N., Sakai, T. (2022). Machine learning from weak supervision: An empirical risk minimization approach. MIT Press
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L.D., Fergus, R. (2014). Training convolutional networks with noisy labels. arXiv: Computer Vision and Pattern Recognition , https://api.semanticscholar.org/CorpusID:6458072
Van Rooyen, B., & Williamson, R. C. (2018). A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228), 1–50.
Google Scholar
Wang, R., Liu, T., & Tao, D. (2018). Multiclass learning with partially corrupted labels. IEEE Transactions on Neural Networks and Learning Systems, 29(6), 2568–2580. https://doi.org/10.1109/TNNLS.2017.2699783
Article MathSciNet Google Scholar
Wei, Z., Feng, L., Han, B., Liu, T., Niu, G., Zhu, X., Shen, H.T. (2023). A universal unbiased method for classification from aggregate observations. A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (Eds.), Proceedings of the 40th international conference on machine learning 202: 36804–36820. PMLR. https://proceedings.mlr.press/v202/wei23a.html
Williamson, R. C., Vernet, E., & Reid, M. D. (2016). Composite multiclass losses. Journal of Machine Learning Research, 17(222), 1–52.
MathSciNet Google Scholar
Wu, Z., Lv, J., & Sugiyama, M. (2023). Learning with proper partial labels. Neural Computation, 35(1), 58–81. https://doi.org/10.1162/neco_a_01554
Article MathSciNet Google Scholar
Xia, X., Liu, T., Han, B., Wang, N., Gong, M., Liu, H., & Sugiyama, M. (2020). Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems, 33, 7597–7610.
Google Scholar
Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X. (2015). Learning from massive noisy labeled data for image classification. Proceedings of the ieee conference on computer vision and pattern recognition pp. 2691–2699
Xie, Z., Liu, Y., He, H.-Y., Li, M., & Zhou, Z.-H. (2024). Weakly supervised auc optimization: A unified partial auc approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(7), 4780–4795. https://doi.org/10.1109/TPAMI.2024.3357814
Article Google Scholar
Xu, N., Qiao, C., Geng, X., & Zhang, M.-L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119–27130.
Google Scholar
Yao, Y., Liu, T., Han, B., Gong, M., Deng, J., Niu, G., & Sugiyama, M. (2020). Dual t: Reducing estimation error for transition matrix in label-noise learning. Advances in Neural Information Processing Systems, 33, 7260–7271.
Google Scholar
Yoshida, S.M., Takenouchi, T., Sugiyama, M. (2021). Lower-bounded proper losses for weakly supervised classification. International conference on machine learning pp. 12110–12120
Yu, X., Liu, T., Gong, M., Tao, D. (2018). Learning with biased complementary labels. Proceedings of the european conference on computer vision (ECCV) pp. 68–83
Zhang, Y., Charoenphakdee, N., Sugiyama, M. (2019). Learning from indirect observations. arXiv:abs/1910.04394
Zhang, Y., Niu, G., Sugiyama, M. (2021). Learning noise transition matrix from only noisy labels via total variation regularization. International conference on machine learning pp. 12501–12512

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

D. Bacaicoa-Barber and J. Cid-Sueiro have contributed equally to this work.

Authors and Affiliations

Signal Theory and Communications Department, University Carlos III of Madrid, Av. de la Universidad, 30, 28670, Leganés, Madrid, Spain
Daniel Bacaicoa-Barber & Jesús Cid-Sueiro

Authors

Daniel Bacaicoa-Barber
View author publications
Search author on:PubMed Google Scholar
Jesús Cid-Sueiro
View author publications
Search author on:PubMed Google Scholar

Contributions

D.B. and J.C. contributed equally to this work.

Corresponding author

Correspondence to Daniel Bacaicoa-Barber.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Ethical approval

Not applicable.

Additional information

Editor: Gang Niu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Cross entropy minimization, marginal chain and other re-weighting schemes

The marginal chain method was proposed by (Chiang & Sugiyama, 2023) as a general approach to learning from weak labels that is based on weighting a proper loss through a decontamination matrix that has non-negative components. For an i.i.d, sample $\{(\textbf{x}_k,\textbf{z}_k)\}_{k=0}^{n-1}$, assuming that $\textbf{f}$ contains the posterior class probabilities for a model with parameters $\textbf{w}$ and input $\textbf{x}$ (that is, $f_i = p(y_i=1 | \textbf{x}, \textbf{w})$), the marginal chain method consists on the iterative computation of

$$\begin{aligned} q_{k,i}^{(t)}&= p(\textbf{y}_k = \textbf{e}_i| \textbf{z}_k,\textbf{f}_k^{(t)}) \end{aligned}$$

(A1)

(where $\textbf{f}_k^{(t)}$ is the class posterior probability vector given the model at time t) and

$$\begin{aligned} \textbf{w}^{(t+1)}&= \mathop {\textrm{argmin}}\limits _{\textbf{w}} \sum _{k=0}^{n-1} \left( \textbf{q}_k^{(t)}\right) ^\top \varvec{\phi }\left( {\textbf{f}_k}\right) \end{aligned}$$

(A2)

(where $\textbf{q}_k^{(t)}$ is a vector with components $q_{k,i}^{(t)}$). The marginal chain method subsumes other models (Wu et al., 2023; Lv et al., 2020; Feng et al., 2020) as particular cases.

In this section we show that, when $\varvec{\phi }(\textbf{f})=-\log (\textbf{f})$, iterations (A1) and (A2) are equivalent to the E and M steps of the maximum likelihood (ML) estimation of the model parameters via an expectation-maximization (EM) algorithm. This, in turn, implies that, for the log loss, the marginal chain method is equivalent to the minimization of the forward loss $\varvec{\Psi }(\textbf{f})=-\log (\textbf{M}\textbf{f}_k)$.

Assuming that $\textbf{f}$ contains the posterior class probabilities for a model with parameters $\textbf{w}$ and input $\textbf{x}$ (that is, $f_i = p(y_i=1 | \textbf{x}, \textbf{w})$), we can write $p(\textbf{z}|\textbf{x},\textbf{w} )=\textbf{z}^\intercal \textbf{M}\textbf{f}$ and he ML estimator is

$$\begin{aligned} \mathbf{w^*} =&\mathop {\textrm{argmax}}\limits _\textbf{w} \sum _{k=0}^{n-1} \log p(\textbf{z}_k|\textbf{x}_k,\textbf{w}) = \mathop {\textrm{argmin}}\limits _\textbf{w} \sum _{k=0}^{n-1} \textbf{z}_k^\intercal \varvec{\phi }(\textbf{M}\textbf{f}_k) \end{aligned}$$

(A3)

where $\varvec{\phi }(\textbf{M}\textbf{f}_k) = -\log ( \textbf{M}\textbf{f}_k)$.

ML estimation based on the EM algorithm has been proposed in (Perello-Nieto et al., 2020; Wei et al., 2023; Chen et al., 2023), for general weak label models, using the true labels in $\textbf{Y} = (\textbf{y}_0, \textbf{y}_1,\ldots ,\textbf{y}_{n-1})$ as hidden variables. The complete data log-likelihood is

$$\begin{aligned} L(\textbf{w},\textbf{Y})&= \sum _{k=0}^{n-1}\log p(\textbf{z}_k,\textbf{y}_k|\textbf{f}_k) =\sum _{k=0}^{n-1}\log p(\textbf{z}_k|\textbf{y}_k,\textbf{f}_k)+ \sum _{k=0}^{n-1}\log p(\textbf{y}_k|\textbf{f}_k)\nonumber \\&= \sum _{k=0}^{n-1} \textbf{y}_k^\top \log (\textbf{M}^\top \mathbf{{z}_k}) + \sum _{k=0}^{n-1} \textbf{y}_k^\top \log (\textbf{f}_k) \end{aligned}$$

(A4)

To define the E-step, using $\textbf{Z} = (\textbf{z}_0, \textbf{z}_1,\ldots , \textbf{z}_{n-1})$ we compute the conditional expectation

$$\begin{aligned} Q(\textbf{w}|\textbf{Z},\textbf{w}^{(t)}) =&{\mathbb {E}}[L(\textbf{w},\textbf{Y}) \mid \textbf{Z},\textbf{w}^{(t)}] = \sum _{k=0}^{n-1} \textbf{q}_k^{(t)} \left( \log (\textbf{M}^\top \mathbf{{z}_k}) + \log (\textbf{f}_k)\right) \end{aligned}$$

(A5)

where $\textbf{q}_k^{(t)} = {\mathbb {E}}[\textbf{y}_k|\textbf{z}_k,\textbf{w}^{(t)}]$ is the posterior mean of the hidden variable given the current model with parameters $\textbf{w}^{(t)}$. Since $\textbf{y}_k$ is a binary vector, its components are given by (A1). Moreover, since the first term in (A5) does not depend on the model parameters, $\textbf{w}$, the M-step reduces to (A2).

As an alternative, importance re-weighting schemes have been proposed for noisy labels (Liu & Tao, 2016; Wang et al., 2018) that use non-probabilistic weights derived from the use of an importance distribution over the risk. Therefore, these schemes do not fit into the EM formulation, and cannot be trivially associated to the minimization of the type of forward losses discussed in this paper.

Appendix B Proofs

1.1 B.1 Proper forward losses with F other than M

In this section, we show that the choice $\textbf{F}=\textbf{M}$ is not unique for a forward loss $\varvec{\Psi }(\textbf{f})=\varvec{\phi }(\textbf{F}\textbf{f})$ to be proper. Since the goal is to find an example, we will restrict the analysis to the square loss, but starting from a general case where the base loss $\varvec{\phi }$ is differentiable.

In general, if $\varvec{\Psi }(\textbf{f}) = \varvec{\phi }(\textbf{F}\textbf{f})$ is strictly $\textbf{M}$-proper,

$$\begin{aligned} \varvec{\eta }= \arg \min _{\textbf{f}\in \mathcal {P}_c} \varvec{\eta }^\top \textbf{M}^\top \varvec{\phi }(\textbf{F}\textbf{f}) \end{aligned}$$

(B6)

This is a constrained optimization problem that can be written using the Lagrangian,

$$\begin{aligned} L(\textbf{f}, \lambda , \varvec{\mu }) = \varvec{\eta }^\top \textbf{M}^\top \varvec{\phi }(\textbf{F}\textbf{f}) - \lambda \left( \textbf{f}^\top \mathbbm {1}_c - 1\right) + \varvec{\mu }^\top {\varvec{\textbf{f}}} \end{aligned}$$

(B7)

whose KKT conditions are

$$\begin{aligned} \textbf{F}^\top&\textbf{J}_{\phi }^\top (\textbf{F}\textbf{f}) \textbf{M}\varvec{\eta }- \lambda \mathbbm {1}_c + \varvec{\mu } = 0, \end{aligned}$$

(B8)

$$\begin{aligned}&\textbf{f}^\top \mathbbm {1}_c = 1, \end{aligned}$$

(B9)

$$\begin{aligned}&f_i \mu _i = 0, \qquad 0\ge i \ge c-1 \end{aligned}$$

(B10)

$$\begin{aligned}&f_i \ge 0, \qquad 0\ge i \ge c-1 \end{aligned}$$

(B11)

$$\begin{aligned}&\mu _i \ge 0, \qquad 0\ge i \ge c-1 \end{aligned}$$

(B12)

where $\textbf{J}_{\phi }^\top (\textbf{F}\textbf{f})$ is the Jacobian matrix of $\phi (\textbf{q})$ at $\textbf{q}=\textbf{F}\textbf{f}$.

Since $\varvec{\Psi }$ is proper, the KKT conditions are satisfied for $\textbf{f}=\varvec{\eta }$. Therefore,

$$\begin{aligned} \textbf{F}^\top \textbf{J}_{\phi }^\top (\textbf{F}\varvec{\eta })\textbf{M}\varvec{\eta }- \lambda \mathbbm {1}_c + \varvec{\mu } = 0. \end{aligned}$$

(B13)

Pre-multiplying this equation by $\mathbbm {1}_c^\top$ we get

$$\begin{aligned} \lambda = \frac{1}{c} \mathbbm {1}_c^\top \textbf{F}^\top \textbf{J}_{\phi }^\top (\textbf{F}\varvec{\eta }) \textbf{M}\varvec{\eta }+ \frac{1}{c} \mathbbm {1}_c^\top \varvec{\mu } \end{aligned}$$

(B14)

Note that the KKT conditions must be satisfied for any $\varvec{\eta }\in \mathcal {P}_c$. In particular, for any $\varvec{\eta }$ in the relative interior of ${{\mathcal {P}}}_c$, the third condition KKT implies $\varvec{\mu } = 0$, and thus the first condition becomes

$$\begin{aligned} \textbf{F}^\top \textbf{J}_{\phi }^\top (\textbf{F}\varvec{\eta }) \textbf{M}\varvec{\eta }- \frac{1}{c} \left( \mathbbm {1}_c^\top \textbf{F}^\top \textbf{J}_{\phi }^\top (\textbf{F}\varvec{\eta }) \textbf{M}\varvec{\eta }\right) \mathbbm {1}_c = \textbf{0}. \end{aligned}$$

(B15)

or, equivalently

$$\begin{aligned} \left( c\textbf{I} - \mathbbm {1}_c\mathbbm {1}_c^\top \right) \textbf{F}^\top \textbf{J}_{\phi }^\top (\textbf{F}\varvec{\eta }) \textbf{M}\varvec{\eta }= \textbf{0}. \end{aligned}$$

(B16)

Noting that the square loss can be written as

$$\begin{aligned} \varvec{\phi }(\textbf{q})&= \text {diag}\left( (\textbf{I} - \textbf{q}\mathbbm {1}_d^\top )^\top (\textbf{I} - \textbf{q}\mathbbm {1}_d^\top ) \right) = \left( 1 + \Vert \textbf{q}\Vert ^2 \right) \mathbbm {1}_d - 2 \textbf{q} \end{aligned}$$

(B17)

therefore

$$\begin{aligned} \textbf{J}_{\phi }(\textbf{q}) = 2\left( \mathbbm {1}_d\textbf{q}^\top -\textbf{I}\right) \end{aligned}$$

(B18)

Using this equation into (B16), we get

$$\begin{aligned} \left( c\textbf{I} - \mathbbm {1}_c\mathbbm {1}_c^\top \right) \textbf{F}^\top \left( \textbf{F}- \textbf{M}\right) \varvec{\eta }= 0. \end{aligned}$$

(B19)

Since this equation must be satisfied by any $\varvec{\eta }$ inside the simplex, we get

$$\begin{aligned} \left( c\textbf{I} - \mathbbm {1}_c\mathbbm {1}_c^\top \right) \textbf{F}^\top (\textbf{F}-\textbf{M}) = \textbf{0} \end{aligned}$$

(B20)

This equation is trivially true for $\textbf{F}=\textbf{M}$, and it is the only solution if $\textbf{F}$ and $\textbf{M}$ are square matrices with rank c. However, if $d>c$, there are many other solutions for $\textbf{F}$ satisfying (B20). For instance,

$$\begin{aligned} \textbf{M}= \left( {\begin{smallmatrix} 0.6 & 0 \\ 0 & 0.6 \\ 0.4 & 0.4 \end{smallmatrix}}\right) \qquad \text { and } \qquad \textbf{F}= \left( {\begin{smallmatrix} 0.84 & 0.12 \\ 0.12 & 0.84 \\ 0.04 & 0.04 \end{smallmatrix}}\right) \end{aligned}$$

(B21)

satisfy (B20)

1.2 B.2 Detailed optimization of the backward matrix

To solve the optimization problem in (17), we can first solve the optimization with respect to $\textbf{B}$ for a given $\textbf{F}$, that is

$$\begin{aligned}&\min _{\textbf{B}} \left\{ \text {tr}\{\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}\} \right\} \nonumber \\&\text {subject to } \textbf{BM} = \textbf{F}\end{aligned}$$

(B22)

The Lagrangian is given by

$$\begin{aligned} {{\mathcal {L}}}(\textbf{B}) = \text {tr}(\textbf{B}^\top \textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}) + \text {tr}\left( \varvec{\Lambda }^\top (\textbf{B}\textbf{M}- \textbf{F}) \right) . \end{aligned}$$

(B23)

where $\varvec{\Lambda }$ is the matrix of Lagrange multipliers. Therefore,

$$\begin{aligned} \frac{\partial {{\mathcal {L}}}}{\partial \textbf{B}}&= 2\textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}+ \varvec{\Lambda } \textbf{M}^\top = 0 \end{aligned}$$

(B24)

Therefore,

$$\begin{aligned} 2\textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}= -\varvec{\Lambda } \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$

(B25)

Post-multiplying by $\textbf{M}$ and using the given constraint,

$$\begin{aligned} 2\textbf{F}^{\ell \top } = -\varvec{\Lambda } \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1}\textbf{M}\end{aligned}$$

(B26)

thus

$$\begin{aligned} \varvec{\Lambda } = - 2\textbf{F}^{\ell \top } \left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1}\textbf{M}\right) ^{-1} \end{aligned}$$

(B27)

and, pre-multiplying by $\textbf{F}^\top$

$$\begin{aligned} 2\textbf{F}^{\ell \top }\textbf{F}^\ell \textbf{B}\varvec{\Delta }_{\textbf{p}}= 2\textbf{F}^{\ell \top } \left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1}\textbf{M}\right) ^{-1} \textbf{M}^\top \end{aligned}$$

(B28)

and

$$\begin{aligned} \textbf{F}^\ell \textbf{B}= \left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1}\textbf{M}\right) ^{-1} \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$

(B29)

It is straightforward to check that

$$\begin{aligned} \textbf{B}= \textbf{F}\left( \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1}\textbf{M}\right) ^{-1} \textbf{M}^\top \varvec{\Delta }_{\textbf{p}}^{-1} \end{aligned}$$

(B30)

is a solution of (B29).

1.3 B.3 Proof for Lemma 1.

Lemma 3

Let $\textbf{A}$ be a square $d\times d$ matrix. The following conditions are equivalent:

1.
$\textbf{A}$ is order preserving
2.
$\textbf{A}$ is max preserving
3.
$\textbf{A} = \lambda \textbf{I} + \mathbbm {1}_d \textbf{v}^\top$ for some $\lambda > 0$ and some $\textbf{v}\in \mathbb {R}^d$.

Proof

We will prove the result as a chain of implications.

($1 \Rightarrow 2$) Condition 1 implies Condition 2 trivially: if matrix $\textbf{A}$ maintains the order of each individual component, it will preserve the location of the maximum, in particular.

($2 \Rightarrow 3$) Now, we will prove that Condition 2 implies Condition 3. Assume that $\textbf{A}$ is max-preserving. For $\textbf{x}=\textbf{e}_i$, we have

$$\begin{aligned} y_i&= \textbf{e}_i^\top \textbf{A} \textbf{e}_i = a_{ii} \end{aligned}$$

(B31)

$$\begin{aligned} y_j&= \textbf{e}_j^\top \textbf{A} \textbf{e}_i = a_{ji}, \qquad j\ne i \end{aligned}$$

(B32)

Since $\textbf{A}$ is max-preserving, $y_i>y_j$ and, thus,

$$\begin{aligned} a_{ii} > a_{ji} \end{aligned}$$

(B33)

Let $\textbf{x}$ be such that $x_i= \mu - \epsilon$ and $x_j=\mu + \epsilon$, for some $\mu >0$, some arbitrarily small $\epsilon >0$, and some $i \ne j$, and assume that $x_k<x_j\ \forall k\notin \{i,j\}$. Since $\textbf{A}$ is max-preserving and $x_i < x_j = \max _k x_k$, we have $y_i < y_j$ and, thus

$$\begin{aligned} a_{ii} (\mu + \epsilon )&+ a_{ij}(\mu -\epsilon ) + \sum _{k\notin \{i,j\}} a_{ik} x_k\nonumber \\&< a_{ji} (\mu + \epsilon ) + a_{jj}(\mu -\epsilon ) + \sum _{k\notin \{i,j\}} a_{jk} x_k \end{aligned}$$

(B34)

Since this must be true for any arbitrary small $\epsilon$, we have

$$\begin{aligned} (a_{ii} + a_{ij})\mu + \sum _{k\notin \{i,j\}} a_{ik} x_k \le (a_{ji} + a_{jj})\mu + \sum _{k\notin \{i,j\}} a_{jk} x_k \end{aligned}$$

(B35)

Alternatively, taking $x_i=\mu -\epsilon$ and $x_j=\mu +\epsilon$, we can conclude that the opposite inequality is also true. Therefore, the above inequality can be replaced by an equality. Since this must be true for any $\mu$ large enough and any $x_k$, we can take $x_k=0$, for all $k\ne i$, $k\ne j$ to get

$$\begin{aligned} a_{ii} + a_{ij} = a_{jj} + a_{ji} \end{aligned}$$

(B36)

Replacing (B35) into (B34), we get

$$\begin{aligned} \sum _{k\notin \{i,j\}} a_{ik} x_k = \sum _{k\notin \{i,j\}} a_{jk} x_k \end{aligned}$$

(B37)

which must be true for any values $x_k$ (with $k \notin \{i, j\}$). Therefore

$$\begin{aligned} a_{ik} = a_{jk}, \qquad k \notin \{i, j\} \end{aligned}$$

(B38)

that is, $a_{ik}$ does not depend on i (for $i \ne k$) and we can write

$$\begin{aligned} a_{ik}= v_k, \qquad j \ne k \end{aligned}$$

(B39)

for some $v_k$. Using (B39) in (B36), we can write

$$\begin{aligned} a_{ii} - v_i = a_{jj} - v_j = \lambda \end{aligned}$$

(B40)

for some $\lambda$, which must be non-negative by virtue of (B33). Eqs. (B39) and (B40) implies statement 3.

($3 \Rightarrow 1$) Finally, we will see that 3 implies 2. Assume $\textbf{A}$ satisfies Condition 3 and $x_i < x_j$. Then

$$\begin{aligned} \textbf{A} \textbf{x} = \lambda \textbf{x} + (\textbf{v}^\top \textbf{x}) \mathbbm {1}_d \end{aligned}$$

(B41)

Therefore,

$$\begin{aligned} y_i = \lambda x_i + (\textbf{v}^\top \textbf{x}) < \lambda x_j + (\textbf{v}^\top \textbf{x}) = y_j \end{aligned}$$

(B42)

and, thus, $\textbf{A}$ is max-preserving. $\square$

1.4 B.4 Proof for Lemma 2.

Lemma 4

If $\textbf{A}$ is order-preserving and non-singular, its inverse is also order-preserving.

Proof

The lemma is a direct consequence of the matrix inversion lemma. If $\textbf{A}$ is order-preserving, it satisfies Condition 3 and

$$\begin{aligned} \textbf{A}^{-1} = \lambda ' \textbf{I} + \mathbbm {1}_d \textbf{v}'^\top \end{aligned}$$

(B43)

where

$$\begin{aligned} \lambda '&= \frac{1}{\lambda } \end{aligned}$$

(B44)

$$\begin{aligned} \textbf{v}'&= - \frac{1}{\lambda } \left( \lambda + \textbf{v}^\top \mathbbm {1}_d \right) ^{-1} \textbf{v} \end{aligned}$$

(B45)

Thus, $\textbf{A}^-1$ also satisfies Condition 3, and it is order-preserving. $\square$

1.5 B.5 Computation of the error bound

In this appendix, we provide some definitions and preliminary results to prove Theorem 4.

Definition 9

(Rademacher complexity) Let $(X_0,Z_0), (X_1,Z_1), \ldots , (X_{n-1},Z_{n-1})$ be $n$ i.i.d. random variables drown from the joint probability distribution $P(\textbf{x},\textbf{z})$ and $\mathcal {G}$ be a class of functions over $\mathcal {X}\times \mathcal {Z}$. The Rademacher complexity is an expectation over all samples of size $n$ drawn according to $P(\textbf{x},\textbf{z})$,

$$\begin{aligned} {\mathfrak {R}}_{n}(\mathcal {G}) = {\mathbb {E}}_{(X_0,Z_0), \ldots , (X_{n-1},Z_{n-1})\sim P(\textbf{x},\textbf{z})}\left[ {\mathbb {E}}_{\varvec{ \sigma }}\left[ \frac{1}{n} \sum _{i=0}^{n-1} \sigma _i g(X_i,Z_i)\right] \right] \end{aligned}$$

(B46)

where $\varvec{\sigma }=(\sigma _0,\ldots ,\sigma _{n-1})^\top$, with $\sigma _i$ being i.i.d. Rademacher variables, that is, dichotomic variables taking values -1 and 1 with the same probability.

To establish the proof of Theorem 4, we follow similar steps to other results on bounding errors for learning from weak labels in multiclass classification (see (Wu et al., 2023), for instance). Some lemmas will be helpful.

Lemma 5

Let $\hat{f}^*$ be the empirical risk minimizer and $f^*$ the true risk minimizer. Then the following inequality holds.

$$\begin{aligned} R(\hat{f}^*)- R(f^*) \le 2 \sup _{f\in \mathcal {F}} |\hat{R}(f) - R(f)| \end{aligned}$$

(B47)

Proof

See, for instance, (Wu et al., 2023). $\square$

For a given loss function $\phi$ and a class of models ${{\mathcal {F}}}$ and the forward and backward matrices $\textbf{B}$ and $\textbf{F}$, respectively, consider the class of loss functions

$$\begin{aligned} \mathcal {H} = \left\{ h:(\textbf{x}, \textbf{z}) \mapsto \textbf{z}^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}f(\textbf{x})) \mid f \in \mathcal {F} \right\} \end{aligned}$$

(B48)

so the Rademacher complexity of $\mathcal {H}$ is

$$\begin{aligned} {\mathfrak {R}}_{n}(\mathcal {H}) = {\mathbb {E}}_P {\mathbb {E}}_{\varvec{\sigma }}\left[ \sup _{h\in \mathcal {H}} \frac{1}{n}\sum _{i=0}^{n-1} \sigma _i h(\textbf{x}_i,\textbf{z}_i)\right] \end{aligned}$$

(B49)

Lemma 6

Let $\phi$ be a nonnegative loss bounded from above by M, then for any $\delta >0$ with probability at least $1-\delta$,

$$\begin{aligned} \sup _{f\in \mathcal {F}} \left| R(f) - \widehat{R}(f)\right| \le 2 {\mathfrak {R}}_{n}(\mathcal {H}) + M \sqrt{\frac{\log \frac{2}{\delta }}{2n}} \end{aligned}$$

(B50)

Proof

Let $\mathcal {S}$ and $\mathcal {S}'$ be two sample sets that differ in just one element, say $(\textbf{x}_0,\textbf{z}_0)$ and $(\textbf{x}'_0,\textbf{z}'_0)$. Defining $\varphi (\mathcal {S}) = \sup _{h\in \mathcal {H}} \left( R(h)-\widehat{R}_\mathcal {S}(h)\right)$

$$\begin{aligned} \vert \varphi (\mathcal {S}) - \varphi (\mathcal {S}') \vert&\le \left| \sup _{h\in \mathcal {H}} \left( \widehat{R}_\mathcal {S}(h) - \widehat{R}_{\mathcal {S}'}(h) \right) \right| \le \sup _{h\in \mathcal {H}} \left| \widehat{R}_\mathcal {S}(h) - \widehat{R}_{\mathcal {S}'}(h) \right| \nonumber \\&= \frac{1}{n} \sup _{h\in \mathcal {H}} |h(\textbf{x}_0,\textbf{z}_0) - h(\textbf{x}'_0,\textbf{z}'_0)| \nonumber \\&= \frac{1}{n} \sup _{f\in \mathcal {F}} \left| \textbf{z}_0^\top \textbf{B}^\top \varvec{\phi }( \textbf{F}f(\textbf{x}_0)) - {\textbf{z}'}_0^\top \textbf{B}^\top \varvec{\phi }( \textbf{F}f(\textbf{x}'_0)) \right| \nonumber \\&\le \frac{1}{n} \sup _{f\in \mathcal {F}} \left[ \left| \textbf{z}_0^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}f(\textbf{x}_0))\right| + \left| {\textbf{z}'}_0^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}f(\textbf{x}'_0))\right| \right] \end{aligned}$$

(B51)

Since $\textbf{z}$ is a one hot vector,

$$\begin{aligned} \left| \textbf{z}_0^\top \textbf{B}^\top \varvec{\phi }(\textbf{F}f(\textbf{x}_0))\right|&\le \Vert \textbf{z}_0\Vert _1 \Vert \textbf{B}\Vert _1 \Vert \varvec{\phi }(\textbf{F}f(\textbf{x}_0))\Vert _1 \nonumber \\&\le \Vert \textbf{B}\Vert _1 mM \end{aligned}$$

(B52)

where m is the dimension of $\phi$. Therefore, using (B52) in (B51), we get

$$\begin{aligned} \vert \varphi (\mathcal {S}) - \varphi (\mathcal {S}') \vert&\le \frac{2 m M \Vert \textbf{B}\Vert _1}{n}. \end{aligned}$$

(B53)

and, by McDiarmid’s inequality,

$$\begin{aligned} P\left( \left| \varphi (S)-{\mathbb {E}}[\varphi (S)]\right| \ge \epsilon \right) \le 2 \exp \left( -\frac{2\epsilon ^2}{n\left( \frac{2mM\Vert \textbf{B}\Vert _1}{n} \right) ^2}\right) \end{aligned}$$

(B54)

which implies that for any $\delta > 0$, with probability at least $1-\frac{\delta }{2}$,

$$\begin{aligned} \varphi (\mathcal {S})\le {\mathbb {E}}[\varphi (\mathcal {S})] + 2 m M \Vert \textbf{B} \Vert _1\sqrt{\frac{\log \frac{2}{\delta }}{2n}}. \end{aligned}$$

(B55)

Since ${\mathbb {E}}[\varphi (\mathcal {S})]\le 2 {\mathfrak {R}}_{n}(\mathcal {H})$ (see Th. 3.3 in (Mohri et al., 2018)), the proof is complete. $\square$

Lemma 7

If $\phi (\textbf{f})$ is L-Lipschitz,

$$\begin{aligned} {\mathfrak {R}}_{n}(\mathcal {H}) \le \sqrt{2}L \left\| \textbf{B}\right\| \left\| \textbf{F}\right\| \sum _{i=0}^{c-1} {\mathfrak {R}}_{n}(\mathcal {F}_i) \end{aligned}$$

(B56)

Proof

If $\phi (\textbf{f})$ is L-Lipschitz, then we can write

$$\begin{aligned} \left| \textbf{B}^\top \phi (\textbf{Fx})-\textbf{B}^\top \phi (\textbf{Fy}) \right|&\le \Vert \textbf{B}\Vert \left\| \phi (\textbf{Fx})-\phi (\textbf{Fy}) \right\| \le L \Vert \textbf{B}\Vert \Vert \textbf{Fx}-\textbf{Fy} \Vert \nonumber \\&\le L \Vert \textbf{B}\Vert \Vert \textbf{F}\Vert \Vert \textbf{x}-\textbf{y} \Vert \end{aligned}$$

(B57)

therefore, $\varvec{\Psi }(\textbf{f})$ is $L_\textbf{BF}$-Lipschitz with $L_\textbf{BF} = L \left\| \textbf{B}\right\| \left\| \textbf{F}\right\|$.

For the class ${{\mathcal {H}}}$ given by (B48), since the loss function $\Psi (\textbf{f}, \textbf{z})$ is $L_\textbf{BF}$-Lipschitz with respect to $\textbf{f}$, then, applying the Rademacher vector contraction inequality (Maurer, 2016), we can write

$$\begin{aligned} {\mathfrak {R}}_{n}(\mathcal {H}) = {\mathfrak {R}}_{n}(\Psi \circ \mathcal {F}) \le \sqrt{2} L \Vert \textbf{B}\Vert \Vert \textbf{F}\Vert \sum _{i=0}^{c-1} {\mathfrak {R}}_{n}(\mathcal {F}_i) \end{aligned}$$

(B58)

$\square$

Joining (B47), (B50) and (B58), we get (27), which proves Theorem 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Bacaicoa-Barber, D., Cid-Sueiro, J. A unified view of forward and backward losses for learning from weak labels. Mach Learn 114, 205 (2025). https://doi.org/10.1007/s10994-025-06841-x

Download citation

Received: 16 September 2024
Revised: 09 July 2025
Accepted: 14 July 2025
Published: 12 August 2025
Version of record: 12 August 2025
DOI: https://doi.org/10.1007/s10994-025-06841-x

Keywords

Profiles

Daniel Bacaicoa-Barber View author profile
Jesús Cid-Sueiro View author profile

A unified view of forward and backward losses for learning from weak labels

Abstract

Similar content being viewed by others

Adapting Supervised Classification Algorithms to Arbitrary Weak Label Scenarios

New Loss Function for Multiclass, Single-Label Classification

Uncovering hidden patterns: low-rank label correlations for multi-label weak-label learning

Explore related subjects

1 Introduction

2 Related work

3 Formulation

3.1 Notation

3.2 Learning from weak labels

3.3 Classification calibration, ranking calibration and properness

Definition 1

Definition 2

Definition 3

3.4 Forward, backward and forward-backward losses

3.4.1 Backward loss

Definition 4

3.4.2 Forward loss

Definition 5

3.4.3 Forward-backward loss

Definition 6

4 Proper forward-backward losses

Theorem 1

Proof

4.1 Convexity

Theorem 2

Proof

4.2 Lower bounded losses

4.3 Optimizing the backward matrix

5 RC and CC forward-backward losses

Definition 7

Definition 8

Lemma 1

Proof

Lemma 2

Proof

Theorem 3

Proof

6 Error bound

Theorem 4

Proof

7 Experiments

7.1 Noisy labels

7.2 Complementary labels

7.3 Partial labels

7.4 Clothing1M

7.5 Overall discussion and conclusions

8 Conclusions

8.1 Limitations and further work

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix A Cross entropy minimization, marginal chain and other re-weighting schemes

Appendix B Proofs

1.1 B.1 Proper forward losses with F other than M

1.2 B.2 Detailed optimization of the backward matrix

1.3 B.3 Proof for Lemma 1.

Lemma 3

Proof

1.4 B.4 Proof for Lemma 2.

Lemma 4

Proof

1.5 B.5 Computation of the error bound

Definition 9

Lemma 5

Proof

Lemma 6

Proof