0% found this document useful (0 votes)
8 views54 pages

Generalization Error of The Tilted Empirical Risk

This document investigates the generalization error of the tilted empirical risk (TER) in supervised learning, providing uniform and information-theoretic bounds on the tilted generalization error with a convergence rate of O(1/n). It highlights the robustness of the TER framework against distribution shifts and outliers, and examines the KL-regularized expected tilted empirical risk minimization problem. The findings aim to support the empirical success of the TER framework through statistical learning theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views54 pages

Generalization Error of The Tilted Empirical Risk

This document investigates the generalization error of the tilted empirical risk (TER) in supervised learning, providing uniform and information-theoretic bounds on the tilted generalization error with a convergence rate of O(1/n). It highlights the robustness of the TER framework against distribution shifts and outliers, and examines the KL-regularized expected tilted empirical risk minimization problem. The findings aim to support the empirical success of the TER framework through statistical learning theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Generalization Error of the Tilted Empirical Risk

Gholamali Aminian1 Amir R. Asadi 2


Tian Li 3 Ahmad Beirami 4

Gesine Reinert 1, 5 Samuel N. Cohen1, 6

October 18, 2024


arXiv:2409.19431v2 [stat.ML] 17 Oct 2024

Abstract
The generalization error (risk) of a supervised statistical learning algorithm quantifies its
prediction ability on previously unseen data. Inspired by exponential tilting, Li et al. (2021)
proposed the tilted empirical risk as a non-linear risk metric for machine learning applications
such as classification and regression problems. In this work, we examine the generalization error
of the tilted empirical risk. In particular, we provide uniform and information-theoretic bounds
on the tilted generalization error, defined as the difference
√ between the population risk and the
tilted empirical risk, with a convergence rate of O(1/ n) where n is the number of training
samples. Furthermore, we study the solution to the KL-regularized expected tilted empirical
risk minimization problem and derive an upper bound on the expected tilted generalization error
with a convergence rate of O(1/n).

1
The Alan Turing Institute.
2
Department of Statistics, University of Cambridge.
3
University of Chicago.
4
Google DeepMind.
5
Department of Statistics, University of Oxford.
6
Mathematical Institute, University of Oxford.

1
Contents
1 Introduction 3

2 Preliminaries 4
2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Risk Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Generalization Bounds for Bounded Loss Function 6


3.1 Uniform Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Information-theoretic Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Generalization Bounds for Unbounded Loss Functions 10

5 Robustness of TERM 11

6 The KL-Regularized TERM Problem 13

7 Related Works 14

8 Conclusion and Future Work 15

A Overview of Main Results 21

B Other Related Works 22

C Technical Tools 23

D Proofs and Details of Section 3 26


D.1 Uniform bound: details for bounded loss . . . . . . . . . . . . . . . . . . . . . . . . . 26
D.2 Information-theoretic bounds: details for bounded loss . . . . . . . . . . . . . . . . . 33

E Proofs and Details of Section 4 37


E.1 Uniform bounds: details for unbounded loss . . . . . . . . . . . . . . . . . . . . . . . 37
E.2 Information-theoretic Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

F Proof and details of Section 5 43

G Proofs and details of Section 6 46

H Other Bounds 49
H.1 Rademacher Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
H.2 A Stability Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
H.3 A PAC-Bayesian Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2
1 Introduction
Empirical risk minimization (ERM) is a popular framework in machine learning. The performance of
the empirical risk (ER) is affected when the data set is strongly imbalanced or contains outliers. For
these scenarios, inspired by the log-sum-exponential operator with applications in multinomial linear
regression and naive Bayes classifiers (Calafiore et al., 2019; Murphy, 2012; Williams and Barber,
1998), the tilted empirical risk (TER) is proposed by Li et al. (2021) for supervised learning ap-
plication, such as classification and regression problems. Li et al. (2021, 2023a) showed that tilted
empirical risk minimization (TERM) can handle class imbalance, mitigate the effect of outliers,
and enable fairness between subgroups. Different applications of TERM have been explored, e.g.,
differential privacy (Lowy and Razaviyayn, 2021), semantic segmentation (Szabó et al., 2021) and
noisy label self-correction (Zhou et al., 2020). In this paper, we aim to corroborate the empirical
success of the TERM framework through statistical learning theory.
A central concern in statistical learning theory is understanding the efficacy of a learning al-
gorithm when applied to test data. This evaluation is typically carried out by investigating the
generalization error, which quantifies the disparity between the performance of the algorithm on
the training dataset and its performance on previously unseen data, drawn from the same un-
derlying distribution, via a risk function. Understanding the generalization behaviour of learning
algorithms is one of the most important objectives in statistical learning theory. Various approaches
have been developed (Rodrigues and Eldar, 2021), including VC dimension-based bounds (Vapnik,
1999), stability-based bounds (Bousquet and Elisseeff, 2002b), PAC Bayesian bounds (McAllester,
2003), and information-theoretic bounds (Russo and Zou, 2019; Xu and Raginsky, 2017). However,
the TER generalization performance under unbounded loss function has not yet been studied. This
paper focuses on the generalization error of the tilted empirical risk (tilted generalization error) of
learning algorithms. Our contributions here can be summarized as follows,

• We provide upper and lower bounds on the tilted generalization error via uniform and information-
theoretic approaches under bounded loss functions for both positive and negative tilts and

show that the convergence rates of the upper bounds are O(1/ n), as expected, where n is
the number of training samples.

• We provide upper and lower bounds on the tilted generalization error under unbounded loss

functions for the negative tilt with the convergence rate of O(1/ n) via a uniform approach.

• We study the robustness of the tilted empirical risk under distribution shift induced by noise or
outliers for unbounded loss functions with bounded second-moment assumption and negative
tilt, and derive generalization bounds that justify the robustness properties of TERM (Li et al.,
2021) for negative tilt.

• We study the KL-regularized TERM problem and provide an upper bound on the expected
tilted generalization error with convergence rate O(1/n).

The paper is organised as follows: Section 2 introduces notation, the problem, and the risk
functions used in this paper. Our upper and lower bounds on the tilted generalization error for
bounded loss functions via uniform and information-theoretic approaches are given in Section 3.
The upper and lower bounds on the tilted generalization error for unbounded loss functions with
bounded second moments are given in Section 4. Section 5 is devoted to the study of robustness

3
to the distributional shift in training samples. The KL-regularized TERM is considered in Section
6. Section 7 surveys related work. Section 8 concludes the paper. Technical tools and proofs are
deferred to the appendices.

2 Preliminaries
Notations: Upper-case letters denote random variables (e.g., Z), lower-case letters denote the
realizations of random variables (e.g., z), and calligraphic letters denote sets (e.g., Z). All logarithms
are in the natural base. The tilted expectation of a random variable X with tilting γ is defined
as γ1 log(E[exp(γX)]). The set of probability distributions (measures) over a space X with finite
variance is denoted by P(X ).

Information measures: For two probability measures P and Q defined on the space X , such
that P is absolutely continuous
R with respect to Q, the Kullback-Leibler (KL) divergence between
P and Q is KL(P kQ) := X log (dP /dQ) dP. If Q is also absolutely continuous with respect to
P , then the symmetrized KL divergence is DSKL (P kQ) := KL(P kQ) + KL(QkP ). The mutual
information between two random variables X and Y is defined as the KL divergence between
the joint distribution and product-of-marginal distribution I(X; Y ) := KL(PX,Y kPX ⊗ PY ), or
Requivalently, the conditional KL divergence between PY |X and PY over PX , KL(PY |X kPY |PX ) :=
X KL(PY |X=x kPY )dPX (x). The symmetrized KL information between X and Y is given by ISKL (X; Y ) :=
DSKL (PX,Y kPX ⊗ PY ), see (Aminian et al.,
R 2015). The total variation distance between two densi-
ties P and Q, is defined as TV(P, Q) := X |P − Q|(dx).

2.1 Problem Formulation


Let S = {Zi }ni=1 be the training set, where each sample Zi = (Xi , Yi ) belongs to the instance space
Z := X × Y; here X is the input (feature) space and Y is the output (label) space. We assume that
Zi are i.i.d. generated from the same data-generating distribution µ.
Here we consider the set of hypothesis H with elements h : X 7→ Y ∈ H. When H is finite, then
its cardinality is denoted by card(H). In order to measure the performance of the hypothesis h, we
consider a non-negative loss function ℓ : H × Z → R+ 0. What are these approaches
We apply different methods to study the performance of our algorithms, including uniform
and information-theoretic approaches. In uniform approaches, such as the VC-dimension and the
Rademacher complexity approach (Bartlett and Mendelson, 2002; Vapnik, 1999), the hypothesis
space is independent of the learning algorithm. Therefore, these methods are algorithm-independent;
our results for these methods do not specify the learning algorithms.
Learning Algorithms: For information-theoretic approaches in supervised learning, following
Xu and Raginsky (2017), we consider learning algorithms that are characterized by a Markov kernel
(a conditional distribution) PH|S . Such a learning algorithm maps a data set S to a hypothesis in
H, which is chosen according to PH|S . This concept thus includes randomized learning algorithms.
Robustness: Suppose that due to an outlier or noise in training samples, the underlying
distribution of the training dataset, Ŝ is shifted via the distribution of noise or outlier in the training
dataset, denoted as µ̃ ∈ P(Z). We model the distributional shift via distribution µ̃ due to inspiration
by the notion of influence function (Christmann and Steinwart, 2004; Marceau and Rioux, 2001;
Ronchetti and Huber, 2009).

4
2.2 Risk Functions
The main quantity we are interested in is the population risk, defined by

R(h, µ) := EZ̃∼µ [ℓ(h, Z̃)], h ∈ H.

As the distribution µ is unknown, in classical statistical learning, the (true) population risk for
h ∈ H is estimated by the (linear) empirical risk

X n
b S) = 1
R(h, ℓ(h, Zi ). (1)
n
i=1

The generalization error for the linear empirical risk is given by


b S);
gen(h, S) := R(h, µ) − R(h, (2)

this is the difference between the true risk and the linear empirical risk. The TER, as a non-linear
empirical risk (Li et al., 2021) and estimator of population risk, is defined by

1 1 X
n

b
Rγ (h, S) = log exp γℓ(h, Zi ) .
γ n
i=1
#just product of log them log of exponent then apply limit.
The TER is an increasing function in γ (Li et al., 2023a, Theorem 1), and as γ → 0, the TER
converges to the linear empirical risk in (1). Inspired by (Li et al., 2021), the primary objective
is to optimize the population risk; the TERM is utilized in order to help the learning dynamics.
Therefore, we decompose the population risk as follows:
b γ (h, S) +
R(h, µ) = R(h, µ) − R b γ (h, S),
R (3)
| {z } | {z }
tilted generalization error tilted empirical risk

where we define the tilted generalization error as


b γ (h, S).
genγ (h, S) := R(h, µ) − R (4)

In learning theory, for uniform approaches, most works focus on bounding the linear generalization
error gen(h, S) from (2) such that under the distribution of the dataset S, with probability at least
(1 − δ), it holds that for all h ∈ H,

|gen(h, S)|| ≤ g(δ, n), (5)

where g is a real function dependent on δ ∈ (0, 1) and n is the number of data samples. Similarly,
for the tilted generalization error from (4), we are interested in finding a bound gt (δ, n, γ) such that
with probability at least 1 − δ, under the distribution of S,

genγ (h, S) ≤ gt (δ, n, γ), (6)

where gt is a real function. Furthermore, we denote the excess risk under the tilted emprical risk
as,
Eγ (µ) := R(h∗γ (S), µ) − R(h∗ (µ), µ), (7)

5
where h∗ (µ) := arg minh∈H R(h, µ) and h∗γ (S) := arg minh∈H R̂γ (h, S).
We denote the expected TER with respect to the distribution of S by

b γ (h, S)],
Rγ (h, PS ) = EPS [R (8)

and we denote the tilted (true) population risk by

1  h1 X
n i
Rγ (h, PS ) = log EPS exp(γℓ(h, Zi )) . (9)
γ n
i=1

Under the i.i.d. assumption, the tilted population risk is equal to an entropic risk function (Howard and Matheson,
1972). We also introduce the non-linear generalization error, which plays an important role in ob-
taining our bounds, as
gd b γ (h, S).
enγ (h, S) := Rγ (h, µ) − R (10)
Information-theoretic Approach: For the information-theoretic approach, as the hypothesis H
is a random variable under a learning algorithm as Markov kernel, i.e., PH|S , we take expectations
over the hypothesis H over the above expressions for fixed h to define the expected true risk,
expected empirical risk, and expected tilted generalization error,

R(H, PH ⊗ µ) := EPH ⊗µ [ℓ(H, Z)],


b γ (H, S)],
Rγ (H, PH,S ) := EP [R (11)
H,S

genγ (H, S) := EPH,S [genγ (H, S)].

Similar to (11), we define the tilted population risk and the tilted generalization error, (9) and
(10), as expectations. We also provide upper bounds on the absolute value of expected tilted
generalization error with respect to the joint distribution of S and H, of the form

|genγ (H, S)| ≤ ge (n, γ),

where ge is a real function.

3 Generalization Bounds for Bounded Loss Function


Upper bounds under linear empirical risk for bounded loss functions via information theoretic
and uniform bounds are studied by Shalev-Shwartz and Ben-David (2014) and Xu and Mannor
(2012), respectively. Inspired by these works, in this section, we provide upper bounds on the
tilted generalization error via uniform and information-theoretic approaches for bounded loss func-

tions with the convergence rate of O(1/ n) which is similar to generalization error under linear
empirical risk. Upper bounds via stability (Bousquet and Elisseeff, 2002b), Rademacher complex-
ity (Bartlett and Mendelson, 2002) and PAC-Bayesian approaches (Alquier, 2021) are provided in
Appendix H. All proof details are deferred to Appendix D.
In this section the following assumption is made.
Assumption 3.1 (Bounded loss function). There is a constant M such that the loss function,
(h, z) 7→ ℓ(h, z) satisfies 0 ≤ ℓ(h, z) ≤ M uniformly for all h ∈ H, z ∈ Z.
Assumption 3.1 will be relaxed in Section 4.

6
3.1 Uniform Bounds
For uniform bounds of the type (6) we decompose the tilted generalization error (4) as follows,

genγ (h, S) = R(h, µ) − Rγ (h, µ⊗n )


| {z }
I1
(12)
+ Rγ (h, µ ⊗n
)−Rb γ (h, S),
| {z }
I2

where I1 is the difference between the population risk and the tilted population risk and I2 is the
non-linear generalization error.
We first derive an upper bound on term I1 in the following Proposition.

Proposition 3.2. Under Assumption 3.1, for γ ∈ R the difference between the population risk and
the tilted population risk satisfies
−1 
Var exp(γℓ(h, Z)) ≤ R(h, µ) − Rγ (h, µ⊗n )

 (13)
− exp(−2γM )
≤ Var exp(γℓ(h, Z)) .

Note that for γ → 0, the upper and lower bounds in Proposition 3.2 are zero.
As the log function is Lipschitz on a bounded interval, applying the Hoeffding inequality to
term I2 and Proposition 3.2 to term I1 in (12), we obtain the following upper bound on the tilted
generalization error.

Theorem 3.3. Given Assumption 3.1, for any fixed h ∈ H with probability at least (1 − δ) the tilted
generalization error satisfies the upper bound,

− exp(−2γM ) 
genγ (h, S) ≤ Var exp(γℓ(h, Z))

r
exp(|γ|M ) − 1 log(2/δ)
+ .
|γ| 2n

Theorem 3.4. Under the same assumptions of Theorem 3.3, for a fixed h ∈ H, with probability at
least (1 − δ), the tilted generalization error satisfies the lower bound
−1 
genγ (h, S) ≥ Var exp(γℓ(h, Z))

r (14)
exp(|γ|M ) − 1 log(2/δ)
− .
|γ| 2n

Remark 3.5. The lower bound in Theorem 3.4 for a negative γ can be tighter than the lower bound
on the tilted generalization error for a positive γ, (14), as the variance is positive.

Combining Theorem 3.3 and Theorem 3.4, we derive an upper bound on the absolute value of
the titled generalization error.

7
Corollary 3.6. Let A(γ) = (1 − exp(γM ))2 . Under the same assumptions in Theorem 3.3, with
probability at least (1−δ), and a finite hypothesis space, the absolute value of the titled generalization
error satisfies

sup |genγ (h, S)|


h∈H
r
exp(|γ|M ) − 1 log(card(H)) + log(2/δ)

|γ| 2n
max(1, exp(−2γM ))A(γ)
+ ,
8|γ|

where A(γ) = (1 − exp(γM ))2 .

Corollary 3.7. Under the assumptions in Corollary 3.6, if γ is of order O(n−β ) for β > 0, then, as
A(γ) ∼ γ 2 M 2 for γ → 0, the upper bound on the absolute
 tilted generalization error in Corollary 3.6

has a convergence rate of max O(1/ n), O(n−β ) as n → ∞.

Remark 3.8. Choosing β ≥ 1/2 in Corollary 3.7 gives a convergence rate of O(1/ n) for the tilted
generalization error.

Remark 3.9 (The influence of γ). As γ → 0, the upper bound in Corollary 3.6 on the absolute value
of tilted generalization error converges to the upper bound on absolute value of the generalization
error under the ERM algorithm obtained by Shalev-Shwartz and Ben-David (2014),
r
log(card(H)) + log(2/δ)
sup |gen(h, S)| ≤ M . (15)
h∈H 2n

In particular, exp(|γ|M ) − 1 /|γ| → M and the first term in Corollary 3.6 vanishes. Therefore,
the upper bound converges to a uniform bound on the linear empirical risk.

Remark 3.10. The term − exp(−2γM )/γ in Theorem 3.3 can cause the upper bounds in Theo-
rem 3.3 to be tighter for γ > 0 than for γ < 0. Furthermore, in comparison with the uniform upper
bound on the linear generalization error, (15), we obtain a tilted generalization error upper bounds
which can be tighter, for sufficiently large values of samples n and small values of γ.

Using Corollary 3.7, we derive an upper bound on the excess risk.

Corollary 3.11. Under the same assumptions in Theorem 3.3, and a finite hypothesis space, with
probability at least (1 − δ), the excess risk of tilted empirical risk satisfies
r
2 exp(|γ|M ) − 1 log(card(H)) + log(2/δ)
Eγ (µ) ≤
|γ| 2n
2 max(1, exp(−2γM ))A(γ)
+ ,
8|γ|

where A(γ) = (1 − exp(γM ))2 .

8
The theorems in this section assumed that the hypothesis space is finite; this is for example the
case in classification problems with a finite number of classes. If this assumption is violated, we
can apply the growth function technique from (Bousquet et al., 2003; Vapnik, 1999). Furthermore,
the growth function can be bounded by VC-dimension in binary classification (Vapnik, 1999) or
Natarajan dimension (Holden and Niranjan, 1995) for multi-class classification scenarios. Note that
the VC-dimension and Rademacher complexity bounds are uniform bounds and are independent of
the learning algorithms.

3.2 Information-theoretic Bounds


Next, we provide an upper bound on the expected tilted generalization error. All proof details are
deferred to Appendix D.2. For information-theoretic bounds, we employ the following decomposition
of the expected tilted generalization error,

genγ (H, S)
= {R(H, PH ⊗ µ) − Rγ (H, PH ⊗ µ⊗n )} (16)
⊗n
+ {Rγ (H, PH ⊗ µ ) − Rγ (H, PH,S )}.

The following is helpful in deriving the upper bound.

Proposition 3.12. Under Assumption 3.1, the following inequality holds for any learning algorithm,
PH|S ,

Rγ (H, PH ⊗ µ⊗n ) − Rγ (H, PH,S )


r (17)
(exp(|γ|M ) − 1) I(H; S)
≤ .
|γ| 2n

Using Proposition 3.12, we derive the following upper and lower bounds on the expected gener-
alization error.

Theorem 3.13. Under Assumption 3.1, the expected tilted generalization error satisfies
r
(exp(|γ|M ) − 1) I(H; S) γ exp(−γM )  1  
genγ (H, S) ≤ − 1− EPH VarZ̃∼µ (ℓ(H, Z̃)) . (18)
|γ| 2n 2 n

Theorem 3.14. Under the same assumptions in Theorem 3.13, the expected tilted generalization
error satisfies
r
(exp(|γ|M ) − 1) I(H; S)
genγ (H, S) ≥ −
|γ| 2n
γ exp(γM )  1  
− 1− EPH VarZ̃∼µ (ℓ(H, Z̃)) .
2 n
Combining Theorem 3.13 and Theorem 3.14, we derive an upper bound on the absolute value
of the expected tilted generalization error.

9
Corollary 3.15. Under the same assumptions in Theorem 3.13, the absolute value of the expected
titled generalization error satisfies
r
(exp(|γ|M ) − 1) I(H; S)
|genγ (H, S)| ≤
|γ| 2n
|γ|M exp(|γ|M ) 
2 1
+ 1− .
8 n
Remark 3.16. In Corollary 3.15, we observe that by choosing γ = O(n−β ), the overall convergence

rate of the generalization error upper bound is max(O(1/ n), O(n−β )) for bounded I(H; S). For
β ≥ 1/2, the convergence rate of (18) is the same as the convergence rate of the expected upper
bound in (Xu and Raginsky, 2017). In addition, for γ → 0, the upper bound in Corollary 3.15
converges to the expected upper bound in (Xu and Raginsky, 2017).
The results in this section are non-vacuous for bounded I(H; S). If this assumption is violated,
we can apply the individual sample method (Bu et al., 2020), chaining methods (Asadi et al., 2018),
or conditional mutual information frameworks (Steinke and Zakynthinou, 2020) to derive tighter
upper bound for the tilted generalization error.

4 Generalization Bounds for Unbounded Loss Functions


In the previous section, we assumed that the loss function is bounded which is limiting in prac-
tice. Several works already proposed some solutions to overcome the boundedness assumption
under linear empirical risk (Alquier and Guedj, 2018; Haddouche and Guedj, 2022; Holland, 2019;
Lugosi and Neu, 2022) via a PAC-Bayesian approach. In this section, we derive upper bounds on
the tilted generalization error via uniform approach for negative tilt (γ < 0) under bounded sec-

ond moment assumption with convergence rate of O(1/ n). In particular, we relax the bounded
assumption (Assumption 3.1). The following assumptions are made for the uniform analysis.
Assumption 4.1 (Uniform bounded second moment). There is a constant κu ∈ R+ such that the
loss function (H, Z) 7→ ℓ(H, Z) satisfies Eµ [ℓ2 (h, Z)] ≤ κ2u uniformly for all h ∈ H.
The assumption on second moment, Assumption 4.1, can be satisfied if the loss function is sub-
Gaussian or sub-Exponential (Boucheron et al., 2013) under the distribution µ for all h ∈ H. All
proof details for the results in this section are deferred to Appendix E. In this following we present
the results based on uniform bounds. The results for unbounded loss function based on information
theoretic bounds are presented in Appendix E.2.
For unbounded loss function, we consider the decomposition of the tilted generalization error
in (12) for γ < 0. The term I1 can be bounded using Lemma C.9. Then, we apply Bernstein’s
inequality (Boucheron et al., 2013) to provide upper and lower bounds on the second term, I2 .
Theorem 4.2. Given Assumption 4.1, for any fixed h ∈ H with probability at least (1 − δ), then
the following upper bound holds on the tilted generalization error for γ < 0,
r
log(2/δ)
genγ (h, S) ≤ 2κu exp(−γκu )
n (19)
4 exp(−γκu ) log(2/δ) γ 2
− − κu .
3nγ 2

10
2 2
Theorem 4.3. Given Assumption 4.1, there exists a ζ ∈ (0, 1) such that for n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
,
for any fixed h ∈ H with probability at least (1 − δ), and γ < 0, the following lower bound on the
tilted generalization error holds
r
2κu exp(−γκu ) log(2/δ)
genγ (h, S) ≥ −
(1 − ζ) n
(20)
4 exp(−γκu )(log(2/δ))
+ .
3nγ(1 − ζ)
Combining Theorem 4.2 and Theorem 4.3, we derive an upper bound on the absolute value of
the tilted generalization error.

Corollary 4.4. Under the same assumptions in Theorem 4.3 and a finite hypothesis space, then
2 2
for n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu ) , for γ < 0 and with probability at least (1 − δ), the absolute value of the
titled generalization error satisfies
r
2κu exp(−γκu ) B(δ) 4 exp(−γκu )B(δ) γ 2
sup |genγ (h, S)| ≤ − − κu , (21)
h∈H (1 − ζ) n 3nγ(1 − ζ) 2

where B(δ) = log(card(H)) + log(2/δ).


(4γ 2 κ2u +8/3ζ) log(2/δ)
Remark 4.5. For γ ≍ n−1/2 , n ≥ ζ 2 exp(−2κu )
and −1 < γ < 0, the upper bound in
Corollary 4.4 gives a theoretical guarantee on the convergence rate of O(n−1/2 ). Using TER with
negative γ can help to derive an upper bound on the absolute value of the tilted generalization error
under the bounded second-moment assumption.

Similar to Corollary 3.11, an upper bound on excess risk under unbounded loss function assump-
tion can be derived.

Corollary 4.6. Under the same assumption in Theorem 4.3 and a finite hypothesis space, then for
2 2
n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
with probability at least (1 − δ) and γ < 0, the excess risk of tilted empirical
risk satisfies
r
4κu exp(−γκu ) B(δ) 8 exp(−γκu )B(δ)
Eγ (µ) ≤ − γκ2u − , (22)
(1 − ζ) n 3nγ(1 − ζ)

where B(δ) = log(card(H)) + log(2/δ).

5 Robustness of TERM
As shown in experiments by Li et al. (2021), the tilted empirical risk is robust to noise or outliers
samples during training using negative tilt (γ < 0). In this section, we study the robustness of the
TER under distributional shift, µ̃ and the following assumption is made.

Assumption 5.1 (Uniform bounded second moment under µ̃). There is a constant κs ∈ R+ such
that the loss function (H, Z) 7→ ℓ(H, Z) satisfies Eµ̃ [ℓ2 (h, Z)] ≤ κ2s uniformly for all h ∈ H.

Using the functional derivative (Cardaliaguet et al., 2019), we can provide the following results.

11
Proposition 5.2. Given Assumption 5.1 and Assumption 4.1, then the difference of tilted popula-
tion risk under, (9), between µ and µ̃ is bounded as follows,

1 TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )


log(EZ̃∼µ [C(h, Z̃)]) − log(EZ̃∼µ̃ [C(h, Z̃)]) ≤ , (23)
|γ| γ2 (κu − κs )

where C(h, Z̃) = exp(γℓ(h, Z̃)).

Note that, for positive γ, the result in Proposition 5.2 does not hold and can be unbounded.
Using Proposition 5.2, we can provide an upper and lower bounds on the tilted generalization error
under distributional shift.

Proposition 5.3. Given Assumptions 4.1 and 5.1, for any fixed h ∈ H and with probability least
(1 − δ) for γ < 0, then the following upper bound holds on the tilted generalization error
r
log(2/δ) γ 2
genγ (h, Ŝ) ≤ 2κs exp(−γκs ) − κu
n 2
4 exp(−γκs )(log(2/δ)) TV(µ, µ̃)
− + D(κs , κu ),
3nγ γ2
exp(|γ|κu )−exp(|γ|κs )
where D(κs , κu ) = (κu −κs ) and Ŝ is the training dataset under the distributional shift.

Proposition 5.4. Given Assumptions 4.1 and 5.1, for any fixed h ∈ H and with probability least
2 2
(1 − δ), there exists a ζ ∈ (0, 1) such that for n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
and γ < 0, such that the
following upper bound holds on the tilted generalization error
r
2κs exp(−γκs ) log(2/δ)
genγ (h, Ŝ) ≥ −
(1 − ζ) n
4 exp(−γκs )(log(2/δ)) TV(µ, µ̃)
+ − D(κs , κu ),
3nγ(1 − ζ) γ2
exp(|γ|κu )−exp(|γ|κs )
where D(κs , κu ) = (κu −κs ) and Ŝ is the training dataset under the distributional shift.

Combining Proposition 5.3 and Proposition 5.4, we derive an upper bound on the absolute value
of the tilted generalization error under distributional shift.
2 2
Theorem 5.5. Under the same assumptions in Proposition 5.4, then for n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
and γ < 0, and with probability at least (1 − δ), the absolute value of the tilted generalization error
under distributional shift satisfies
r
2κs exp(−γκs ) B(δ) 4 exp(−γκs )B(δ)
sup |genγ (h, Ŝ)| ≤ −
h∈H (1 − ζ) n 3nγ(1 − ζ)
 (24)
γκ2u TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
− + ,
2 γ2 (κu − κs )

where B(δ) = log(card(H)) + log(2/δ).

12
Similar results to Proposition 5.3, can be derived via an information-theoretic  approach.
exp(|γ|κ )−exp(|γ|κ )
Robustness vs Generalization: The term TV(µ,µ̃)
u s
γ2 (κu −κs ) represents the distri-
butional shift cost (or robustness) associated with the TER. This cost can be reduced by increasing
|γ|. However, increasing |γ| also amplifies other terms in the upper bound of the tilted generalization
error. Therefore, there is a trade-off between robustness and generalization, particularly for γ < 0
in the TER. Interestingly, Li et al. (2021) also observed this trade-off for negative tilt.

6 The KL-Regularized TERM Problem


Our upper bound in Corollary 3.15 on the absolute value of expected generalization error depends
on the mutual information between H and S. Therefore, it is of interest to investigate an algorithm
which minimizes the regularized expected TERM via mutual information.

⋆ 1
PH|S = arg inf Rγ (H, PH,S ) + I(H; S), (25)
PH|S α

where α is the inverse temperature. As discussed by Aminian et al. (2023); Xu and Raginsky (2017),
the regularization problem in (25) is dependent on the data distribution, PS . Therefore, we relax
the problem in (25) by considering the following regularized version via KL divergence,

⋆ 1
PH|S = arg inf Rγ (H, PH,S ) + KL(PH|S kπH |PS ), (26)
PH|S α

where I(H; S) ≤ KL(PH|S kπH |PS ) and πH is a prior distribution over hypothesis space H. All
proof details are deferred to Appendix G.

Proposition 6.1. The solution to the expected TERM regularized via KL divergence, (26), is the
tilted Gibbs Posterior (a.k.a. Gibbs Algorithm),

πH  1 X −α/γ
n
γ
PH|S := exp(γℓ(H, Zi )) , (27)
Fα (S) n
i=1

where Fα (S) is a normalization factor.

Note that the Gibbs posterior,

πH  1 Xn 
α
PH|S := exp − α ℓ(H, Zi ) , (28)
F̃α (S) n
i=1

is the solution to the KL-regularized ERM minimization problem, where F̃α (S) is the normalization
factor. Therefore, the tilted Gibbs posterior is different from the Gibbs posterior, (28). It can be
shown that for γ → 0, the tilted Gibbs posterior converges to the Gibbs posterior. Therefore, it is
interesting to study the expected generalization error of the tilted Gibbs posterior. For this purpose,
we give an exact characterization of the difference between the expected TER under the joint and
the product of marginal distributions of H and S. More results regarding the Gibbs posterior are
provided in Appendix G.

13
Proposition 6.2. The difference between the expected TER under the joint and product of marginal
distributions of H and S can be expressed as,
ISKL (H; S)
Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) = . (29)
α
We next provide a parametric upper bound on the tilted generalization error of the tilted Gibbs
posterior.
Theorem 6.3. Under Assumption 3.1, the expected generalization error of the tilted Gibbs posterior
satisfies

α(exp(|γ|M ) − 1)2
genγ (H, S) ≤
2γ 2 n (30)
γ exp(−γM ) 1  
− 1 − EPH VarZ̃∼µ (ℓ(H, Z̃)) .
2 n
Similar to Corollary 3.15, we derive the following upper bound on the absolute value of the
expected tilted generalization error of the tilted generalization error.
Corollary 6.4. Under the same assumptions in Theorem 6.3, the absolute value of the expected
tilted generalization error of the tilted Gibbs posterior satisfies

α(exp(|γ|M ) − 1)2
|genγ (H, S)| ≤
2γ 2 n
(31)
|γ|M 2 exp(|γ|M )  1
+ 1− .
8 n
Remark 6.5 (Convergence rate). If γ = O(1/n), then we obtain a theoretical bound on the con-
vergence rate of O(1/n) for the upper bound on the tilted generalization error of the tilted Gibbs
posterior.
Remark 6.6 (Discussion of γ). From the upper bound in Theorem 6.3, we can observe that under
γ → 0 and Assumption 3.1, the upper bound converges to the upper bound on the Gibbs poste-
rior (Aminian et al., 2021a). For positive tilt (γ > 0), and sufficient large value of n, the upper
bound in Theorem 6.3, can be tighter than the upper bound on the Gibbs posterior.

7 Related Works
Generalization Error Analysis: Different approaches have been applied to study the generaliza-
tion error of general learning problems under empirical risk minimization, including VC dimension-
based, Rademacher complexity, PAC-Bayesian, stability and information-theoretic bounds. In this
section, we discuss the related works about Uniform and information-theoretic bounds. More re-
lated works for generalization error analysis are discussed in Appendix B.
Uniform Bounds: Uniform bounds (or VC bounds) are proposed by Bartlett et al. (1998, 2019);
Vapnik and Chervonenkis (1971). For any class of functions F of VC  dimension d, with probability
at least 1 − δ the generalization error is O (d + log(1/δ))1/2 n−1/2 . This bound depends solely on
the VC dimension of the function class and on the sample size; in particular, it is independent of
the learning algorithm.

14
Information-theoretic bounds: Russo and Zou (2019); Xu and Raginsky (2017) propose using the
mutual information between the input training set and the output hypothesis to upper bound
the expected generalization error. Multiple approaches have been proposed to tighten the mutual
information-based bound: Bu et al. (2020) provide tighter bounds by considering the individual sam-
ple mutual information; Asadi and Abbe (2020); Asadi et al. (2018) propose using chaining mutual
information; and Aminian et al. (2020, 2021b); Hafez-Kolahi et al. (2020); Steinke and Zakynthinou
(2020) provide different upper bounds on the expected generalization error based on the linear em-
pirical risk framework.
The aforementioned approaches are applied to study the generalization error in the linear empirical
risk framework. To our knowledge, the generalization error of the tilted empirical risk minimization
from information theoretical and uniform approach perspectives has not been explored.

8 Conclusion and Future Work


In this paper, we study the tilted empirical risk minimization, as proposed by Li et al. (2021).
In particular, we established an upper and lower bound on the tilted generalization error of the
tilted empirical risk through uniform and information-theoretic approaches, obtaining theoretical

guarantees that the convergence rate is O(1/ n) under bounded loss functions for negative and
positive tilts. Furthermore, we provide an upper bound on the titled generalization error for the
unbounded loss function. We also study the tilted generalization error under distribution shift in
the training dataset due to noise or outliers, where we discussed the generalization and robustness
trade-off. Additionally, we explore the KL-regularized tilted empirical risk minimization, where
the solution involves the tilted Gibbs posterior, and we derive a parametric upper bound on this
minimization with a convergence rate of O(1/n) under some conditions.
Our current results are applicable to scenarios with bounded loss functions and in scenarios
with negative tilting (γ < 0) when the possibly unbounded loss function has a bounded second
moment. However, our current results for the asymptotic regime γ → ∞, where the tilted empirical
risk is equal to maximum loss, are vacuous. Therefore, studying the generalization performance of
tilted empirical risk minimization under unbounded loss functions for positive tilting, and obtaining
informative bounds in the asymptotic regime γ → ∞, are planned as future work. Moreover, we
intend to apply other approaches from the literature, such as those discussed by Aminian et al.
(2023), to derive further bounds on the tilted generalization error in the mean-field regime.

Acknowledgements
Gholamali Aminian, Gesine Reinert and Samuel N. Cohen acknowledge the support of the UKRI
Prosperity Partnership Scheme (FAIR) under EPSRC Grant EP/V056883/1 and the Alan Tur-
ing Institute. Gesine Reinert is also supported in part by EPSRC grants EP/W037211/1 and
EP/R018472/1. Samuel N. Cohen also acknowledges the support of the Oxford–Man Institute for
Quantitative Finance. Amir R. Asadi is supported by Leverhulme Trust grant ECF-2023-189 and
Isaac Newton Trust grant 23.08(b).

15
References
Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv preprint arXiv:2110.11216,
2021.

Pierre Alquier and Benjamin Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine
Learning, 107(5):887–902, 2018.

Amiran Ambroladze, Emilio Parrado-Hernández, and John Shawe-Taylor. Tighter PAC-Bayes


bounds. Advances in neural information processing systems, 19:9, 2007.

Gholamali Aminian, Hamidreza Arjmandi, Amin Gohari, Masoumeh Nasiri-Kenari, and Urbashi
Mitra. Capacity of diffusion-based molecular communication networks over LTI-Poisson chan-
nels. IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 1(2):188–201,
2015.

Gholamali Aminian, Laura Toni, and Miguel RD Rodrigues. Jensen-Shannon information based
characterization of the generalization error of learning algorithms. In 2020 IEEE Information
Theory Workshop (ITW). IEEE, 2020.

Gholamali Aminian, Yuheng Bu, Laura Toni, Miguel Rodrigues, and Gregory Wornell. An exact
characterization of the generalization error for the gibbs algorithm. Advances in Neural Informa-
tion Processing Systems, 34:8106–8118, 2021a.

Gholamali Aminian, Laura Toni, and Miguel RD Rodrigues. Information-theoretic bounds on


the moments of the generalization error of learning algorithms. In 2021 IEEE International
Symposium on Information Theory (ISIT), pages 682–687. IEEE, 2021b.

Gholamali Aminian, Samuel N Cohen, and Łukasz Szpruch. Mean-field analysis of generalization
errors. arXiv preprint arXiv:2306.11623, 2023.

Amir R. Asadi and Emmanuel Abbe. Chaining meets chain rule: Multilevel entropic regularization
and training of neural networks. Journal of Machine Learning Research, 21(139):1–32, 2020.

Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening
generalization bounds. In Advances in Neural Information Processing Systems, pages 7234–7243,
2018.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and
structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear vc-dimension bounds for piecewise
polynomial networks. Neural Comput., 10(8):2159–2173, 1998. doi: 10.1162/089976698300017016.
URL https://doi.org/10.1162/089976698300017016.

Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension
and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning
Research, 20(63):1–17, 2019. URL http://jmlr.org/papers/v20/17-612.html.

Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian bounds based
on the Rényi divergence. In Artificial Intelligence and Statistics, pages 435–444. PMLR, 2016.

16
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymp-
totic theory of independence. Oxford University Press, 2013.

Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res.,
2:499–526, March 2002a. ISSN 1532-4435. doi: 10.1162/153244302760200704. URL
https://doi.org/10.1162/153244302760200704.

Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning
Research, 2(Mar):499–526, 2002b.

Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning
theory. In Summer school on machine learning, pages 169–207. Springer, 2003.

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable
algorithms. In Conference on Learning Theory, pages 610–626, 2020.

Yuheng Bu, Shaofeng Zou, and Venugopal V Veeravalli. Tightening mutual information-based
bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):
121–130, 2020.

Giuseppe C Calafiore, Stephane Gaubert, and Corrado Possieri. Log-sum-exp neural networks and
posynomial models for convex and log-log-convex data. IEEE transactions on neural networks
and learning systems, 31(3):827–838, 2019.

Pierre Cardaliaguet, François Delarue, Jean-Michel Lasry, and Pierre-Louis Lions. The master
equation and the convergence problem in mean field games. Princeton University Press, 2019.

Olivier Catoni. A pac-bayesian approach to adaptive classification. preprint, 840:2, 2003.

Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning.


arXiv preprint arXiv:0712.0248, 2007.

Yuansi Chen, Chi Jin, and Bin Yu. Stability and convergence trade-off of iterative optimization
algorithms. arXiv preprint arXiv:1804.01619, 2018.

Andreas Christmann and Ingo Steinwart. On robustness properties of convex risk minimization
methods for pattern recognition. The Journal of Machine Learning Research, 5:1007–1034, 2004.

Krishnamurthy Dvijotham and Emanuel Todorov. A unifying framework for linearly solvable control.
arXiv preprint arXiv:1202.3715, 2012.

Gintare Karolina Dziugaite and Daniel Roy. Entropy-SGD optimizes the prior of a PAC-Bayes
bound: Generalization properties of entropy-SGD and data-dependent priors. In International
Conference on Machine Learning, pages 1377–1386. PMLR, 2018.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complex-
ity of neural networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigol-
let, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Pro-
ceedings of Machine Learning Research, pages 297–299. PMLR, 06–09 Jul 2018. URL
https://proceedings.mlr.press/v75/golowich18a.html.

17
Maxime Haddouche and Benjamin Guedj. Pac-bayes generalisation bounds for heavy-tailed losses
through supermartingales. arXiv preprint arXiv:2210.00928, 2022.

Hassan Hafez-Kolahi, Zeinab Golgooni, Shohreh Kasaei, and Mahdieh Soleymani. Conditioning
and processing: Techniques to improve information-theoretic generalization bounds. Advances in
Neural Information Processing Systems, 33, 2020.

Fredrik Hellström and Giuseppe Durisi. Generalization bounds via information density and condi-
tional information density. IEEE Journal on Selected Areas in Information Theory, 2020.

Sean B Holden and Mahesan Niranjan. On the practical applicability of vc dimension bounds.
Neural Computation, 7(6):1265–1288, 1995.

Matthew Holland. Pac-bayes under potentially heavy tails. Advances in Neural Information Pro-
cessing Systems, 32, 2019.

Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes. Management
science, 18(7):356–369, 1972.

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:
Risk bounds, margin bounds, and regularization. Advances in neural information processing
systems, 21, 2008.

Ilja Kuzborskij and Csaba Szepesvári. Efron-stein pac-bayesian inequalities. arXiv preprint
arXiv:1909.01931, 2019.

Jaeho Lee, Sejun Park, and Jinwoo Shin. Learning bounds for risk-sensitive learning. Advances in
Neural Information Processing Systems, 33:13867–13879, 2020.

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization.
In International Conference on Learning Representations, 2021.

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. On tilted losses in machine learning:
Theory and applications. Journal of Machine Learning Research, 24(142):1–79, 2023a.

Xiaoli Li, Siran Zhao, Chuan Chen, and Zibin Zheng. Heterogeneity-aware fair federated learning.
Information Sciences, 619:968–986, 2023b.

Andrew Lowy and Meisam Razaviyayn. Output perturbation for differentially private convex opti-
mization with improved population loss bounds, runtimes and applications to private adversarial
training. arXiv preprint arXiv:2102.04704, 2021.

Gábor Lugosi and Gergely Neu. Generalization bounds via convex analysis. In Conference on
Learning Theory, pages 3524–3546. PMLR, 2022.

Gábor Lugosi and Gergely Neu. Online-to-pac conversions: Generalization bounds via regret anal-
ysis. arXiv preprint arXiv:2305.19674, 2023.

Étienne Marceau and Jacques Rioux. On robustness in risk theory. Insurance: Mathematics and
Economics, 29(2):167–185, 2001.

18
Pascal Massart. Some applications of concentration inequalities to statistics. In Annales de la
Faculté des sciences de Toulouse: Mathématiques, volume 9, pages 245–303, 2000.

David A McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999.

David A McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.

Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188,


1989.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.
MIT press, 2018.

Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of SGLD for non-
convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.

Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning, 2022.

Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvári, and John Shawe-Taylor. PAC-Bayes analysis
beyond the usual bounds. Advances in Neural Information Processing Systems, 33:16833–16845,
2020.

Miguel RD Rodrigues and Yonina C Eldar. Information-theoretic methods in data science. Cam-
bridge University Press, 2021.

Elvezio M Ronchetti and Peter J Huber. Robust statistics. John Wiley & Sons Hoboken, NJ, USA,
2009.

Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via
information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.

John Shawe-Taylor and Robert C Williamson. A PAC analysis of a Bayesian estimator. In Proceed-
ings of the tenth annual conference on Computational learning theory, pages 2–9, 1997.

Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutual
information. In Conference on Learning Theory, pages 3437–3452, 2020.

Attila Szabó, Hadi Jamali-Rad, and Siva-Datta Mannava. Tilted cross-entropy (tce): Promoting
fairness in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 2305–2310, 2021.

Michel Talagrand. New concentration inequalities in product spaces. Inventiones mathematicae,


126(3):505–563, 1996.

Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks,
10(5):988–999, 1999.

19
Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. In Theory of probability and its applications, pages 11–30. Springer,
1971.

Yingjie Wang, Hong Chen, Weifeng Liu, Fengxiang He, Tieliang Gong, Youcheng Fu, and Dacheng
Tao. Tilted sparse additive models. In International Conference on Machine Learning, pages
35579–35604. PMLR, 2023.

Christopher KI Williams and David Barber. Bayesian classification with gaussian processes. IEEE
Transactions on pattern analysis and machine intelligence, 20(12):1342–1351, 1998.

Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learn-


ing algorithms. Advances in Neural Information Processing Systems, 30, 2017.

Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.

Guojun Zhang, Saber Malekmohammadi, Xi Chen, and Yaoliang Yu. Proportional fairness in
federated learning. arXiv preprint arXiv:2202.01666, 2022.

Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans-
actions on Information Theory, 52(4):1307–1321, 2006.

Xuelin Zhang, Yingjie Wang, Liangxuan Zhu, Hong Chen, Han Li, and Lingjuan Wu. Robust
variable structure discovery based on tilted empirical risk minimization. Applied Intelligence, 53
(14):17865–17886, 2023.

Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Robust curriculum learning: from clean label de-
tection to noisy label self-correction. In International Conference on Learning Representations,
2020.

20
A Overview of Main Results
An overview of our main results is provided in Fig. 1. All notations are summarized in Table 1.

Table 1: Summary of notations in the paper


Notation Definition Notation Definition
S Training set Zi i-th sample
X Input (feature) space Y Output (label) space
µ Data-generating distribution µ̃ Data-generating distribution under distributional shift
γ Tilt parameter KL(P kQ) KL-divergence between P and Q
I(X; Y ) Mutual information DSKL (P kQ) Symmetrized KL divergence
TV(P, Q) Total variation distance κu Bound on second moment
h Hypothesis H Hypothesis space
ℓ(h, z) Loss function R(h, µ) Population (true) risk
b γ (h, S)
R Tilted empirical risk genγ (h, S) Tilted generalization error
Rγ (h, PS ) Tilted population risk Eγ (µ) Excess risk

Tilted
Generalization
Error

Bounded Unbounded Robustness KL-Regularized


Loss Function Loss Function of TERM TERM
(γ ∈ R) (γ ∈ R<0 ) (γ ∈ R<0 ) (γ ∈ R)
Section 3 Section 4 Section 5 Section 6

Information- Information- Information-


Uniform Bound Uniform Bound Uniform Bound
theoretic Bound theoretic Bound theoretic Bound

Upper bound Upper bound Upper bound Upper bound Upper bound
(Theorem 3.13) (Theorem 3.3) (Theorem E.4) (Theorem 4.2) (Proposition 5.3)
Upper bound
& & & & &
(Theorem 6.3)
Lower bound Lower bound Lower bound Lower bound Lower bound
(Theorem 3.14) (Theorem 3.4) (Theorem E.6) (Theorem 4.3) (Theorem 5.4)

Figure 1: Overview of the main results

21
B Other Related Works
This section details related works about Tilted empirical risk minimization, the Rademacher com-
plexity, stability and PAC-Bayesian bounds, as well as related work on unbounded loss functions.
Rademacher Complexity Bounds: This approach is a data-dependent method to provide an upper
bound on the generalization error based on the Rademacher complexity of the function class H,
see (Bartlett and Mendelson, 2002; Golowich et al., 2018). Bounding the Rademacher complexity
involves the model parameters. Typically, in Rademacher complexity analysis, a symmetrization
technique is used which can be applied to the empirical risk, but not directly to the TER.
Stability Bounds: Stability-based bounds for generalization error are given in Aminian et al.
(2023); Bousquet and Elisseeff (2002a); Bousquet et al. (2020); Chen et al. (2018); Mou et al. (2017).
For stability analysis, the key tool is Lemma 7 in Bousquet and Elisseeff (2002b), which is based on
ERM linearity. Therefore, we can not apply stability analysis to TER directly.
PAC-Bayes bounds: First proposed by McAllester (1999); Shawe-Taylor and Williamson (1997)
and McAllester (2003), PAC-Bayesian analysis provides high probability bounds on the general-
ization error in terms of the KL divergence between the data-dependent posterior induced by the
learning algorithm and a data-free prior that can be chosen arbitrarily (Alquier, 2021). There
are multiple ways to generalize the standard PAC-Bayesian bounds, including using information
measures other than KL divergence (Alquier and Guedj, 2018; Aminian et al., 2021b; Bégin et al.,
2016; Hellström and Durisi, 2020) and considering data-dependent priors (Ambroladze et al., 2007;
Catoni, 2007; Dziugaite and Roy, 2018; Rivasplata et al., 2020). However, this method has not been
applied to TER to provide generalization error bounds.
Tilted Empirical Risk Minimization: The TERM algorithm for machine learning is pro-
posed by Li et al. (2021), and good performance of the TERM under outlier and noisy label scenarios
for negative tilting (γ < 0) and under imbalance and fairness constraints for positive tilting (γ > 0)
is demonstrated. Inspired by TERM, Wang et al. (2023) propose a class of new tilted sparse ad-
ditive models based on the Tikhonov regularization scheme. Their results have some limitations.
First, in (Wang et al., 2023, Theorem 3.3) the authors derive an upper bound for λ = n−ζ where
ζ < −1/2 and λ are the regularization parameters in (Wang et al., 2023, Eq.4). This implies λ → ∞
as n → +∞, which is impractical. Finally, the analysis in (Wang et al., 2023) assumes that both the
loss function and its derivative are bounded. Therefore, it can not be applied to the unbounded loss
function scenario. Furthermore, we consider KL regularization, which is different from the Tikhonov
regularization scheme with the sparsity-induced ℓ1,2 -norm regularizer as introduced in (Wang et al.,
2023). Therefore, our current results do not cover the learning algorithm in [9]. Zhang et al. (2023)
studied the TERM as a target function to improve the robustness of estimators. The application of
TERM in federated learning is also studied, in (Li et al., 2023b; Zhang et al., 2022). The authors in
(Lee et al., 2020), propose an upper bound on the Entropic risk function generalization error via the
representation of coherent risk function and using the Rademacher complexity approach. However,
their approach is limited to negative tilt and bounded loss function. Although rich experiments are
given by Li et al. (2021) for the TERM algorithm in different applications, the generalization error
of the TERM has not yet been addressed for unbounded loss functions.
Unbounded loss functions: Some works studied the generalization error under unbounded
loss functions via the PAC-Bayesian approach. Losses with heavier tails are studied by Alquier and Guedj
(2018) where probability bounds (non-high probability) are developed. Using a different estimator
than empirical risk, PAC-Bayes bounds for losses with bounded second and third moments are devel-
oped by Holland (2019). Notably, their bounds include a term that can increase with the number of

22
samples n. Kuzborskij and Szepesvári (2019) and Haddouche and Guedj (2022) also provide bounds
for losses with a bounded second moment. The bounds in Haddouche and Guedj (2022) rely on a
parameter that must be selected before the training data is drawn. Information-theoretic bounds
based obased on the second moment of suph∈H |ℓ(h, Z) − E[ℓ(h, Z̃)]| are derived in Lugosi and Neu
(2022, 2023). In contrast, our second moment assumption is more relaxed, being based on the
expected version with respect to the distribution over the hypothesis set and the data-generating
distribution. In our paper, we provided a generalization error bound on the tilted empirical risk via
uniform and information-theoretic approaches under a bounded second-moment assumption.

C Technical Tools
We first define the functional linear derivative as in Cardaliaguet et al. (2019).
Definition C.1. (Cardaliaguet et al., 2019) A functional U : P(Rn ) → R admits a functional
δU
linear derivative if there is a map δm : P(Rn ) × Rn → R which is continuous on P(Rn ), such that
′ n
for all m, m ∈ P(R ), it holds that
Z 1Z
′ δU
U (m ) − U (m) = (mλ , a) (m′ − m)(da) dλ,
0 Rn δm

where mλ = m + λ(m′ − m).


The following lemmas are used in our proofs.
Lemma C.2. Suppose that X > 0 and γ < 0, then we have

Var(exp(γX)) ≤ γ 2 Var(X). (32)

Proof. By the mean value theorem, for each realisation X(ω) of X for an element ω of its underlying
probability space there is a value c(ω) in the interval between X(ω) and E[X] such that

exp(γX(ω)) − exp(γE[X]) = γ(X − E[X]) exp(γc(ω)).

As X > 0 we have c(ω) > 0. Moreover,

Var(exp(γX)) = E[(exp(γX) − E[exp(γX)])2 ]


(a)
≤ E[(exp(γX) − exp(γE[X]))2 ]
(b)
= E[γ 2 exp(2γc)(X − E[X])2 ]
(c)
≤ γ 2 Var(X),

where (a), (b) and (c) follow from the minimum mean square representation, the mean-value theorem
and the negativity of γX, respectively.

Lemma C.3. Suppose that 0 < a < X < b < ∞. Then the following inequality holds,
VarPX (X) VarPX (X)
2
≤ log(E[X]) − E[log(X)] ≤ ,
2b 2a2
where VarPX (X) is the variance of X under the distribution PX .

23
d2

2 = −1 +2β, the function log(x)+βx2 is concave for β = 1
Proof. As dx 2 log(x) + βx x2 2b2
and convex
for β = 2a12 . Hence, by Jensen’s inequality,
h X2 X2 i
E[log(X)] = E log(X) + 2 − 2
2b 2b
1 1
≤ log(E[X]) + 2 E[X]2 − 2 E[X 2 ]
2b 2b
1
= log(E[X]) − 2 VarPX (X),
2b
which completes the proof of the lower bound. A similar approach can be applied to derive the
upper bound.

Lemma C.4. Suppose that 0 < a < X < b < ∞. Then the following inequality holds,

−VarPX (X) exp(b) −VarPX (X) exp(a)


≤ exp(E[X]) − E[exp(X)] ≤ ,
2 2
where VarPX (X) is the variance of X under the distribution PX .

In the next results, PS is the distribution of S.

Lemma C.5 (McDiarmid’s inequality (McDiarmid, 1989)). Let F : Z n 7→ R be any measurable


function for which there exists constants ci , i = 1, · · · , m) such that

sup |F (S) − F (S(i) | ≤ ci ,


S∈Z n ,Z̃i ∈Z

where S(i) = {Z1 , · · · , Z̃i , · · · , Zn }, is the replace-one sample dataset and Z̃i is an i.i.d. sample with
respect to all Zi for i ∈ [n]. Then the following inequality holds with probability at least (1 − δ) under
PS , v
u n
u X  log(1/δ)
|F (S) − EPS [F (S)]| ≤ t c2i .
2
i=1

Lemma C.6 (Hoeffding Inequality, Boucheron et al., 2013). Suppose that S = {Zi }ni=1 are bounded
independent random variables such that a ≤ Zi ≤ b, i = 1, . . . , n. Then the following inequality holds
with probability at least (1 − δ) under PS ,
n
r
1X log(2/δ)
E[Z] − Zi ≤ (b − a) . (33)
n 2n
i=1

Lemma C.7 (Bernstein’s Inequality, Boucheron et al., 2013). Suppose that S = {Zi }ni=1 are i.i.d.
random variable such that |Zi − E[Z]| ≤ R almost surely for all i, and Var(Z) = σ 2 . Then the
following inequality holds with probability at least (1 − δ) under PS ,
n
r
1X 4σ 2 log(2/δ) 4R log(2/δ)
E[Z] − Zi ≤ + . (34)
n n 3n
i=1

24
Lemma C.8. For a positive random variable, Z > 0, suppose E[Z 2 ] ≤ η . Then, the following
inequality holds,
E[Z] ≤ η 1/2 .

Lemma C.9. Suppose E[X 2 ] < ∞. Then, for X > 0 and γ < 0 the following inequality holds,
1 −γ
0 ≤ E[X] − log E[eγX ] ≤ E[X 2 ].
γ 2

Proof. The left inequality follows from Jensen’s inequality applied to f (x) = log (x). For the right
inequality, we have for γX < 0,
1
eγX ≤ 1 + γX + (γX)2 . (35)
2
Therefore, we have
 
1 γX 1 1 2 2
log E[e ] ≥ log E 1 + γX + γ X
γ γ 2
 
1 1 2 2
= log 1 + γE[X] + γ E[X ]
γ 2
 
1 1 2 2
≥ γE[X] + γ E[X ]
γ 2
γ
= E[X] + E[X 2 ].
2

Lemma C.10. Suppose that 0 < a < x < b and f (x) is an increasing and concave function. Then
the following holds,
f ′ (b)(b − a) ≤ f (b) − f (a) ≤ f ′ (a)(b − a). (36)

Lemma C.11 (Uniform bound (Mohri et al., 2018)). Let F be the set of functions f : Z → [0, M ]
and µ be a distribution over Z. Let S = {zi }ni=1 be a set of size n i.i.d. drawn from Z. Then, for
any δ ∈ (0, 1), with probability at least 1 − δ over the choice of S, we have
( n
) r
1X 1 2
sup EZ∼µ[f (Z)] − f (zi ) ≤ 2R̂S (F) + 3M log .
f ∈F n 2n δ
i=1

We use the next two results, namely Talagrand’s contraction lemma and Massart’s Lemma, to
estimate the Rademacher complexity.

Lemma C.12 (Talagrand’s contraction lemma (Shalev-Shwartz and Ben-David, 2014)). Let φi :
R → R (i ∈ {1, . . . , n}) be L-Lipschitz functions and Fr be a set of functions from Z to R. Then it
follows that for any {zi }ni=1 ⊂ Z,
" n
# " n
#
1X 1X
Eσ sup σi φi (f (zi )) ≤ LEσ sup σi f (zi ) .
f ∈Fr n
i=1 f ∈Fr n
i=1

25
Lemma C.13 (Massart’s
P lemma  (Massart, 2000)). Assume that the hypothesis space H is finite.
2 n 2
Let B := maxh∈H i=1 h (zi ) . Then

p
B 2 log(card(H))
R̂S (H) ≤ .
n
Lemma C.14 (Lemma 1 in (Xu and Raginsky, 2017)). For any measureable function f : Z →
[0, M ], r
I(X; Y )
EPX,Y [f (X, Y )] − EPX ⊗PY [f (X, Y )] ≤ M .
2
Lemma C.15 (Coupling Lemma). Assume the function f : Z → [0, M ] and the function g : R+ 7→
R+ are L-Lipschitz. Then the following upper bound holds,
r
I(X; Y )
EPX,Y [g ◦ f (X, Y )] − EPX ⊗PY [g ◦ f (X, Y )] ≤ LM .
2
We recall that the notation TV denotes the total variation distance between probability distri-
butions.

Lemma C.16 (Kantorovich-Rubenstein duality of total variation distance, see (Polyanskiy and Wu,
2022)). The Kantorovich-Rubenstein duality (variational representation) of the total variation dis-
tance is as follows:
1
TV(m1 , m2 ) = sup {EZ∼m1 [g(Z)] − EZ∼m2 [g(Z)]} , (37)
2L g∈GL

where GL = {g : Z → R, ||g||∞ ≤ L}.

D Proofs and Details of Section 3


D.1 Uniform bound: details for bounded loss
Proposition 3.2 (Restated). Under Assumption 3.1, the difference between the population risk
and the tilted population risk for γ ∈ R, satisfies that for any h ∈ H

−1 − exp(−2γM )
Var(exp(γℓ(h, Z))) ≤ R(h, µ) − Rγ (h, µ⊗n ) ≤ Var(exp(γℓ(h, Z))).
2γ 2γ

26
Proof. For any h ∈ H we have

R(h, µ) = E[ℓ(h, Z)]


h1 i
= E log exp(γℓ(h, Z))
γ
"
1 h  exp(−2γM )
= EZ∼µ − log exp(γℓ(h, Z)) − exp(2γℓ(h, Z))
|γ| 2
#
exp(−2γM ) i (38)
+ exp(2γℓ(h, Z))
2
1 exp(−2γM )
≤ log(E[exp(γℓ(h, µ))]) + Var(exp(γℓ(h, Z)))
γ 2|γ|
exp(−2γM )
= Rγ (h, µ⊗n ) + Var(exp(γℓ(h, Z))).
2|γ|

A similar approach can be applied for γ > 0 by using Lemma C.3 and the final result holds.

Theorem 3.3. Given Assumption 3.1, for any fixed h ∈ H with probability at least (1 − δ)
the tilted generalization error satisfies the upper bound,
r
− exp(−2γM )  exp(|γ|M ) − 1 log(2/δ)
genγ (h, S) ≤ Var exp(γℓ(h, Z)) + . (39)
2γ |γ| 2n

Proof. We can apply the Proposition 3.2 to provide an upper bound on term I1 . Regarding the
term I5 , we have for γ < 0

b γ (h, S)
Rγ (h, µ⊗n ) − R
n n
1 1X 1 1X
= log(Eµ⊗n [ exp(γℓ(h, Zi ))]) − log( exp(γℓ(h, Zi )))
γ n γ n
i=1 i=1
exp(−γM ) 1X
n (40)
≤ EZ̃∼µ [exp(γℓ(h, Z̃))] − exp(γℓ(h, Zi ))
|γ| n
i=1
r
exp(−γM )(1 − exp(γM )) log(2/δ)
≤ .
|γ| 2n

Similarly, for γ > 0, we have


r
b γ (h, S) ≤ (exp(γM ) − 1)
Rγ (h, µ⊗n ) − R
log(2/δ)
. (41)
|γ| 2n

Combining this bound with Proposition 3.2 completes the proof.

Corollary 3.7 (Restated). Under the same Assumptions as in Theorem 3.3 and assuming γ is
of order O(n−β ) for β > 0, the upper boundon the tilted generalization error in Theorem 3.3 has a

convergence rate of max O(1/ n), O(n−β ) as n → ∞.

27
x
Proof. Using the inequality x+1 ≤ log(1+ x) ≤ x and Taylor expansion for the exponential function,

|γ|2 M 2 |γ|3 M 3
exp(|γ|M ) = 1 + |γ|M + + + O(γ 4 ), (42)
2 6
it follows that

exp(|γ|M ) − 1 |γ|M 2 |γ|2 M 3
≈M+ + + O(γ 3 ). (43)
|γ| 2 6
q
√ exp(|γ|M )−1 log(card(H))+log(2/δ)
This results in a convergence rate of O(1/ n) for |γ| 2n under γ → 0.
2
For the term max(1,exp(−2γM ))(1−exp(γM ))
8|γ| , using Taylor expansion, we have the convergence rate
of O(|γ|) for the first term; this completes the proof.

Theorem 3.4. Under the same assumptions of Theorem 3.3, for a fixed h ∈ H, with proba-
bility at least (1 − δ), the tilted generalization error satisfies the lower bound
r
−1  exp(|γ|M ) − 1 log(2/δ)
genγ (h, S) ≥ Var exp(γℓ(h, Z)) − . (44)
2γ |γ| 2n

Proof. The proof is similar to that of Theorem 3.3, by using the lower bound in Proposition 3.2.

Corollary 3.6. Let A(γ) = (1 − exp(γM ))2 . Under the same assumptions in Theorem 3.3,
with probability at least (1 − δ), and a finite hypothesis space, the absolute value of the titled
generalization error satisfies
r
exp(|γ|M ) − 1 log(card(H)) + log(2/δ) max(1, exp(−2γM ))A(γ)
sup |genγ (h, S)| ≤ + ,
h∈H |γ| 2n 8|γ|

where A(γ) = (1 − exp(γM ))2 .

Proof. We can derive the following upper bound on the absolute of tilted generalization error by
combining Theorem 3.3 and Theorem 3.4 for any fixed h ∈ H
r
exp(|γ|M ) − 1 log(card(H)) + log(2/δ) max(1, exp(−2γM ))A(γ)
|genγ (h, S)| ≤ + , (45)
|γ| 2n 8|γ|

where A(γ) = (1 − exp(γM ))2 . Then, the final result follow by applying the uniform bound for all
h ∈ H using (45).

28
Corollary 3.11 (Restated). Under the same assumptions in Theorem 3.3, and a finite
hypothesis space, with probability at least (1−δ), the excess risk of tilted empirical risk satisfies
r
2 exp(|γ|M ) − 1 log(card(H)) + log(2/δ) 2 max(1, exp(−2γM ))A(γ)
Eγ (µ) ≤ + ,
|γ| 2n 8|γ|

where A(γ) = (1 − exp(γM ))2 .

Proof. It can be proved that,


Eγ (µ) ≤ 2 sup |genγ (h, S)|
h∈H

R(h∗γ (S), µ) ≤ R̂γ (h∗γ (S), µ) + U ≤ R̂γ (h∗ (µ), µ) + U ≤ R(h∗ (µ), µ) + 2U,

where U = suph∈H |R(h, µ) − R̂γ (h, S)| = suph∈H | genγ (h, S)|.
Note that suph∈H | genγ (h, S)| can be bounded using Corollary 3.6.

D.1.1 Another Approach for Uniform Bounds


For this approach, we decompose the tilted generalization error as follows,

genγ (h, S)
b γ (h, S),
= R(h, µ) − Rγ (h, µ⊗n ) + Rγ (h, µ⊗n ) − Rγ (h, µ⊗n ) + Rγ (h, µ⊗n ) − R
| {z } | {z } | {z }
I1 I2 I3

where I1 the same as in (12), I2 is the difference between the tilted population risk and the expected
TER, and I3 is the difference between the expected TER and the TER.

Proposition D.1. Under Assumption 3.1, for all h ∈ H, for γ ∈ R \ {0}, we have

exp(−2γM )  1 
Var exp(γℓ(h, Z)) ≤ Rγ (h, µ⊗n ) − Rγ (h, µ⊗n ) ≤ Var exp(γℓ(h, Z)) .
2γn 2γn
Proof. Due to the i.i.d. assumption, we have
1
Rγ (h, µ⊗n ) = log(EZ∼µ [exp(γℓ(h, Z))])
γ
 X n !
1 1
= log ES∼µ⊗n exp(γℓ(h, Zi ))
γ n
i=1

29
Using Lemma C.3 and Jensen’sinequality, we have for γ < 0,

Rγ (h, µ⊗n )
"  X n !#
1 1
= E log exp(γℓ(h, Zi ))
γ n
i=1
"  X n !  n 2
1 1 exp(−2γM ) 1 X
= E ⊗n − log exp(γℓ(h, Zi )) − exp(γℓ(h, Zi ))
|γ| S∼µ n 2 n
i=1 i=1
 X n 2 # (46)
exp(−2γM ) 1
+ exp(γℓ(h, Zi ))
2 n
i=1
1  exp(−2γM ) 1 X n 
≤ log EZ∼µ [exp(γℓ(h, Z))] + Var exp(γℓ(h, Zi ))
γ 2|γ| n
i=1
exp(−2γM )  
= Rγ (h, µ⊗n ) + VarZ∼µ exp(γℓ(h, Z)) .
2|γ|n

Therefore, we have

exp(−2γM )  
Rγ (h, µ⊗n ) − Rγ (h, µ⊗n ) ≤ VarZ∼µ exp(γℓ(h, Z)) .
2|γ|n

Similarly, we can show that,


1  
Rγ (h, µ⊗n ) − Rγ (h, µ⊗n ) ≥ VarZ∼µ exp(γℓ(h, Z)) .
2|γ|n

Thus, for γ < 0 we obtain


1  
VarZ∼µ exp(γℓ(h, Z)) ≥ Rγ (h, µ⊗n ) − Rγ (h, µ⊗n )
2γn
exp(−2γM )  
≥ VarZ∼µ exp(γℓ(h, Z)) .
2γn
The same approach can be applied to γ > 0. This completes the proof.

Next we use McDiarmid’s inequality, Lemma C.5, to derive an upper bound on the absolute
value of the term I3 , in the following proposition.

Proposition D.2. Under Assumption 3.1 and assuming |γ| < log(n+1) M , the difference of expected
TER and TER satisfies with probability at least (1 − δ),
r
b γ (h, S)| ≤ c n log(1/δ) ,
|Rγ (h, µ⊗n ) − R (47)
2
)
where c = γ1 log 1 + 1−exp(|γ|M
n .

30
Proof. Viewing TER as a function of the data samples {Zi }ni=1 , the following bounded difference
holds for γ < 0, with the supremum taken over {zi }ni=1 ∈ Z n , z̃i ∈ Z,

b γ (h, S) − R
sup |R b γ (h, S(i) )|
1 
n n
1X 1X
= sup log( exp(γℓ(h, zj ))) − log( exp(γℓ(h, zj )) + exp(γℓ(h, z̃i )))
γ n n
j=1 j=1,
j6=i
Pn !
1 j=1 exp(γℓ(h, zj )))

= sup log Pn
γ j=1, exp(γℓ(h, zj ))
+ exp(γℓ(h, z̃i ))
j6=i
! (48)
1 exp(γℓ(h, z̃i )) − exp(γℓ(h, zi ))
= sup log 1 + Pn
γ j=1 exp(γℓ(h, zj ))
! !!
1 exp(γM ) − 1 1 1 − exp(γM )
≤ max log 1 + , log 1 +
γ n exp(γM ) γ n exp(γM )
! !!
1 exp(−γM ) − 1 1 1 − exp(−γM )
= max log 1 + , log 1 + .
γ n γ n

Therefore, we choose
! !!
1 exp(−γM ) − 1 1 1 − exp(−γM )
c̃i = max log 1 + , log 1 +
γ n γ n
!
1 1−exp(−γM )
in McDiarmid’s inequality, Lemma C.5. Note that for the term γ log 1 + n , we need

that 1−exp(−γM
n
)
> −1, and therefore we should have |γ| < log(n+1)
M ; otherwise c̃ = ∞.
Similarly, for γ > 0, we can show that,
! !!
1 exp(γM ) − 1 1 1 − exp(γM )
c̃i = max log 1 + , log 1 + .
γ n γ n

Then, under the distribution PS , with probability at least (1 − δ), the following inequality holds,
r
⊗n b n log(1/δ)
|Rγ (h, µ ) − Rγ (h, S)| ≤ c̃ , (49)
2
  
where c̃ = max γ1 log 1 − y , γ1 log 1 + y , and y = 1−exp(|γ|M n
)
.

As we have | log(1 − x)| ≤ | log(1 + x)| for −1 < x < 0, it holds that c̃ = γ1 log 1 + y , and
1−exp(|γ|M )
y= n .

With Propositions D.2- 3.2, we derive upper and lower bounds on the tilted generalization error.

31
log(n+1)
Theorem D.3. Given Assumption 3.1, |γ| < M , and assuming a finite hypothesis space, then
for γ ∈ R \ {0} we have for a fixed h ∈ H
r
(1 − exp(γM ))2  1  n(log(1/δ))
genγ (h, S) ≤ − exp(−2γM ) + c , (50)
8γ n 2
)
where c = c(γ) = γ1 log 1 + 1−exp(|γ|M
n .

Proof. Combining the results in Proposition D.1, Proposition D.2 and Proposition 3.2, we can derive
the following upper bound on generalization error,
 1 exp(−2γM ) 
genγ (h, S) ≤VarZ∼µ (exp(γℓ(h, Z))) −
2γn 2γ
r (51)
n(log(card(H)) + log(1/δ))
+c .
2
From the boundedness of exp(γℓ(h, z)) ∈ [exp(γM ), 1] for γ < 0 and exp(γℓ(h, z)) ∈ [1, exp(γM )]
for γ > 0, we have
  (1 − exp(γM ))2
VarZ∼µ exp(γℓ(h, Z)) ≤ .
4
Substituting this in (51) completes the proof.

Remark D.4 (The influence of γ). As γ → 0, the upper bound in Theorem D.3 on the tilted gen-
eralization error converges to the upper bound on the generalization error under the ERM algorithm
obtained by Shalev-Shwartz and Ben-David (2014), (15). In particular, c = c(γ) → M n and the
first term in (50) vanishes. Therefore, the upper bound converges to a uniform bound on the linear
empirical risk.
Corollary D.5. Under the same assumptions in Corollary D.5, the upper bound on  the tilted
√ −β
generalization error in Theorem D.3 has a convergence rate of max O(1/ n), O(n ) as n → ∞.
x
Proof. Using the inequality x+1 ≤ log(1+ x) ≤ x and Taylor expansion for the exponential function,

|γ|2 M 2 |γ|3 M 3
exp(|γ|M ) = 1 + |γ|M + + + O(γ 4 ), (52)
2 6
it follows that
1−exp(|γ|M )
r
⊗n b γ (h, S)| ≤ n n log(1/δ)
|Rγ (h, µ )−R 1−exp(|γ|M ))
|γ|(1 + 2
n
r
1 − exp(|γ|M ) n log(1/δ)

|γ|(n + 1 − exp(|γ|M )) 2
r
M + (1/2)|γ|M 2 + O(|γ|2 ) n log(1/δ)
≤ 2 2 3
.
n − |γ|M − 1/2γ M − O(|γ| ) 2

This results in a convergencerate of O(1/ n) for γ → 0.
))2 1
For the term (1−exp(γM

γ
n − exp(−2γM ) , using the Taylor expansion (52), we have O( n ) for
(1−exp(γM ))2 −(1−exp(γM ))2 exp(−2γM )
8nγ and O(γ) for 8γ . Therefore, the final result follows.

32
Theorem D.6 (Uniform Lower Bound). Under the assumptions of Theorem D.3 for a fixed h ∈ H
we have
 r
Var exp(γℓ(h, Z))  exp(−2γM )  n(log(1/δ))
genγ (h, S) ≥ −1 −c , (53)
2γ n 2

1 1−exp(|γ|M ) 
where c = c(γ) = γ log 1 + n .

Proof. The final results

Remark D.7. For log(n) 2M < |γ| <


log(n+1)
M , the lower bound in Theorem D.6 for a negative γ can
be tighter than the lower bound on the tilted generalization error, (53), for a positive γ, due to the
term exp(−2γM
n
)
− 1 in Theorem D.6.

Remark D.8. The terms (1/n−exp(−2γM γ


))
in Theorem D.3 can cause the upper bounds in Theo-
rem D.3 to be tighter for γ > 0 than for γ < 0. Furthermore, in comparison with the uniform upper
bound on the linear generalization error, (15), we obtain a tilted generalization error upper bound,
(50), which can be tighter, for sufficiently large value of samples n and small values of γ.

D.2 Information-theoretic bounds: details for bounded loss


Proposition 3.12 (Restated). Under Assumption 3.1, the following inequality holds,
r
(exp(|γ|M ) − 1) I(H; S)
Rγ (H, PH ⊗ µ⊗n ) − Rγ (H, PH,S ) ≤ .
|γ| 2n

Proof. The proof follows directly from applying Lemma C.15 to the log(.) function and then applying
Lemma C.14.

Theorem 3.13 (Restated). Under Assumption 3.1, the expected tilted generalization error
satisfies
r
(exp(|γ|M ) − 1) I(H; S) γ exp(−γM ) 1  
genγ (H, S) ≤ − 1 − EPH VarZ̃∼µ (ℓ(H, Z̃)) .
|γ| 2n 2 n

Proof. We expand

genγ (H, S) = R(H, PH ⊗ µ) − Rγ (H, PH ⊗ µ⊗n ) + Rγ (H, PH ⊗ µ⊗n ) − Rγ (H, PH,S ). (54)

Using Proposition 3.12, it follows that


r
(exp(|γ|M ) − 1) I(H; S)
|Rγ (H, PH ⊗ µ ⊗n
) − Rγ (H, PH,S )| ≤ . (55)
|γ| 2n

33
Using the Lipschitz property of the log(.) function under Assumption 3.1, we have for γ > 0,

R(H, PH ⊗ µ) − Rγ (H, PH ⊗ µ⊗n )


n n
1 γX 1 1X
= EPH ⊗µ⊗n [ log(exp( ℓ(H, Zi )))] − EPH ⊗µ⊗n [ log( exp(γℓ(H, Zi )))]
γ n γ n
i=1 i=1

exp(−γM ) h γX
n
1X
n i
≤ EPH ⊗µ⊗n exp( ℓ(H, Zi )) − exp(γℓ(H, Zi ))
γ n n
i=1 i=1
" # (56)
− exp(−γM )  n
1X 2  1 X n
2 
2
≤ EPH ⊗µ⊗n γ ℓ(H, Zi ) − γℓ(H, Zi )
2γ n n2
i=1 i=1
− exp(−γM )  
= (1 − 1/n)EPH VarZ̃∼µ (γℓ(H, Z̃))

− exp(−γM )γ  
= (1 − 1/n)EPH VarZ̃∼µ (ℓ(H, Z̃))
2

where Z̃ ∼ µ. A similar results also holds for γ < 0. Combining (55), (56) with (54) completes the
proof.

We now give a lower bound via the information-theoretic approach.

Theorem 3.14 (Restated). Under the same assumptions in Theorem 3.13, the expected
tilted generalization error satisfies
r
(exp(|γ|M ) − 1) I(H; S) γ exp(γM ) 1  
genγ (H, S) ≥ − − 1 − EPH VarZ̃∼µ (ℓ(H, Z̃)) .
|γ| 2n 2 n

Proof. Similarly as in the proof of Theorem 3.13, we can prove the lower bound. Using the Lipschitz
property of the log(.) function under Assumption 3.1, we have for γ > 0,

R(H, PH ⊗ µ) − Rγ (H, PH ⊗ µ⊗n )


n n
1 γX 1 1X
= EPH ⊗µ⊗n [ log(exp( ℓ(H, Zi )))] − EPH ⊗µ⊗n [ log( exp(γℓ(H, Zi )))]
γ n γ n
i=1 i=1
1 h γX
n
1X
n i
≥ EPH ⊗µ⊗n exp( ℓ(H, Zi )) − exp(γℓ(H, Zi ))
γ n n
i=1 i=1
" # (57)
− exp(γM )  n
1X 2  1 X n
2 
2
≥ EPH ⊗µ⊗n γ ℓ(H, Zi ) − γℓ(H, Zi )
2γ n n2
i=1 i=1
− exp(γM ) 1  
= 1 − EPH VarZ̃∼µ (γℓ(H, Z̃))
2γ n
−γ exp(γM ) 1  
= 1 − EPH VarZ̃∼µ (ℓ(H, Z̃)) .
2 n
Similar results also holds for γ < 0. Combining (55), (57) with (54) completes the proof.

34
Corollary 3.15 (Restated). Under the same assumptions in Theorem 3.13, the absolute
value of the expected titled generalization error satisfies
r
(exp(|γ|M ) − 1) I(H; S) |γ|M 2 exp(|γ|M )  1
|genγ (H, S)| ≤ + 1− .
|γ| 2n 8 n

Proof. We can derive the upper bound on absolute value of the expected tilted generalization error
by combining Theorem 3.13 and Theorem 3.14.

D.2.1 Individual Sample Bound Discussion


We can apply previous information-theoretic bounding techniques (e.g., (Bu et al., 2020), (Asadi et al.,
2018), and (Steinke and Zakynthinou, 2020)), exploiting the Lipschitz property of the logarithm
function (or Lemma C.10) over a bounded support. For example, to derive an upper bound based
on the individual sample (Bu et al., 2020), using the approach for Proposition 3.12 and Lemma C.10,
we have for γ > 0, we have

Rγ (H, PH ⊗ µ⊗n ) − Rγ (H, PH,S )


1 
n n
1X 1X
≤ EPH ⊗µ⊗n [ exp(γℓ(H, Zi ))] − EPH,S [ exp(γℓ(H, Zi ))]
γ n n (58)
i=1 i=1
 n r
exp(γM ) − 1 X 1 I(H; Zi )
≤ ,
γ n 2
i=1

where for the first equality, we applied the Lemma C.10 and for the second inequality, we use that
1 ≤ exp(γℓ(H, Zi )) ≤ exp(γM ) for γ > 0 and the approach in (Bu et al., 2020) for bounding via
individual samples. A similar approach also can be applied to γ < 0. Therefore, the following upper
bound holds on the absolute value of the expected generalization error,
r
(exp(|γ|M ) − 1) X I(H; Zi ) |γ|M 2 exp(|γ|M )  1
n
|genγ (H, S)| ≤ + 1− .
|γ|n 2n 8 n
i=1

D.2.2 Another Approach for an Information-theoretic Bound


Instead of the decomposition (12) we can also consider the following decomposition of the expected
tilted generalization error.

EPH,S [genγ (H, S)] = EPH [R(H, µ)] − Rγ (H, PH ⊗ µ)


(59)
+ Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) + Rγ (H, PH,S ) − Rγ (H, PH,S ).

Proposition D.9. Under Assumption 3.1, the following inequality holds,


r
(exp(|γ|M ) − 1) I(H; S)
Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) ≤ . (60)
|γ| 2n

35
Proof. For γ < 0 and using the Lipschitz property of the log(x) function on an interval, we have

Rγ (H, PH ⊗ µ) − Rγ (H, PH,S )


n
1 1 1X
= log(EPH ⊗µ [exp(γℓ(H, Z̃))] − log(EPH,S [ exp(γℓ(H, Zi ))])
γ γ n
i=1
n (61)
exp(−γM ) 1X
≤ EPH ⊗µ [exp(γℓ(H, Z̃))] − EPH,S [ exp(γℓ(H, Zi ))]
|γ| n
i=1
r
(exp(|γ|M ) − 1) I(H; S)
≤ .
|γ| 2n

A similar bound can be derived for γ > 0.

Using Proposition D.9 and Lemma C.3, we can derive the following upper bound on the expected
generalization error.

Theorem D.10. Under Assumption 3.1, the following upper bound holds on the expected tilted
generalization error,
r
(exp(|γ|M ) − 1) I(H; S) Var(exp(γℓ(H, Z)))  exp(−2γM ) 
genγ (H, S) ≤ + 1− . (62)
|γ| 2n 2γ n

Proof. We have

EH,S [genγ (H, S)]


= EH,S [R(H, µ)] − Rγ (H, PH ⊗ µ) + Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) (63)
+ Rγ (H, PH,S ) − EP [R b γ (H, S)].
H,S

b γ (H, S) ≤ M ,
As 0 ≤ R

b γ (H, S)] ≤ Var(exp(γℓ(H, Z)))


Rγ (H, PH,S ) − EPH,S [R . (64)

Using Proposition D.9, we have
r
(exp(|γ|M ) − 1) I(H; S)
|Rγ (H, PH ⊗ µ) − Rγ (H, PH,S )| ≤ . (65)
|γ| 2n

Under Assumption 3.1, we have

EPH,S [R(H, µ)] − Rγ (H, PH ⊗ µ)


n n
1X 1 1X
= EPH ⊗µ [ ℓ(H, Zi )] − log(EPH ⊗µ [exp(γ ℓ(H, Zi ))])
n γ n (66)
i=1 i=1
Var(exp(γℓ(H, Z)))
≤ − exp(−2γM ) .
2γn

Combining (64), (65) and (66) with (63) completes the proof.

36
E Proofs and Details of Section 4
E.1 Uniform bounds: details for unbounded loss

Theorem 4.2 (Restated). Given Assumption 4.1, for any fixed h ∈ H with probability at
least (1 − δ), then the following upper bound holds on the tilted generalization error for γ < 0,
r
log(2/δ) 4 exp(−γκu ) log(2/δ) γ 2
genγ (h, S) ≤ 2κu exp(−γκu ) − − κu . (67)
n 3nγ 2

Proof. Using Bernstein’s inequality, Lemma C.7, for Xi = exp(γℓ(h, Zi )) and considering 0 < Xi <
1, we have
s
Xn
1 4Var(exp(γℓ(h, Z̃))) log(2/δ) 4 log(2/δ)
exp γℓ(h, Zi ) ≤ E[exp(γℓ(h, Z̃))] + + ,
n n 3n
i=1

where we also used that


x x
log(x + y) = log(y) + log(1 + ) ≤ log(y) + for y > x > 0. (68)
y y
Thus,
1 X
n 
log exp γℓ(h, Zi )
n
i=1
 
≤ log E[exp(γℓ(h, Z̃))]
s
1 4Var(exp(γℓ(h, Z̃))) log(2/δ) 1 4 log(2/δ)
+ +
E[exp(γℓ(h, Z̃))] n E[exp(γℓ(h, Z̃))] 3n
 
≤ log E[exp(γℓ(h, Z̃))]
s
4Var(exp(γℓ(h, Z̃))) log(2/δ) 4 log(2/δ)
+ exp(−γκu ) + exp(−γκu ) .
n 3n

Therefore, for γ < 0 we have

1   1 1 X n 
log E[exp(γℓ(h, Z̃))] − log exp γℓ(h, Zi )
γ γ n
i=1
s
exp(−γκu ) 4Var(exp(γℓ(h, Z̃))) log(2/δ) 4 log(2/δ)
≤ + exp(−γκu ) .
|γ| n 3n|γ|

Using Lemma C.2, completes the proof.

37
Theorem 4.3 (Restated). Given Assumption 4.1, there exists a ζ ∈ (0, 1) such that for
2 2
n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
and γ < 0, for any fixed h ∈ H, the following lower bound on the
tilted generalization error holds with probability at least (1 − δ);
r
2κu exp(−γκu ) log(2/δ) 4 exp(−γκu )(log(2/δ))
genγ (h, S) ≥ − + . (69)
(1 − ζ) n 3nγ(1 − ζ)

Proof. Recall that Z̃ ∼ µ and


1
genγ (h, S) = R(h, µ) − log(E[exp(γℓ(h, Z̃))])
γ
n
!
1 1 1X
+ log(E[exp(γℓ(h, Z̃))]) − log exp(γℓ(h, Zi )) .
γ γ n
i=1

First, we apply Lemma C.9 to yield R(h, µ) − γ1 log(E[exp(γℓ(h, Z̃))]) ≥ 0. Next we focus on the
second line of this display. Bernstein’s inequality, Lemma C.7, for Xi = exp(γℓ(h, Zi )), so that
0 < Xi < 1, gives that with probability at least (1 − δ),
s
Xn
1 4 Var(exp(γℓ(h, Z̃))) log(2/δ) 4 log(2/δ)
exp γℓ(h, Zi ) ≥ E[exp(γℓ(h, Z̃))] − − . (70)
n n 3n
i=1

Assume for now that there is a ζ ∈ (0, 1) such that


s
4 Var(exp(γℓ(h, Z̃))) log(2/δ) 4 log(2/δ)
+ ≤ ζE[exp(γℓ(h, Z̃))]. (71)
n 3n

As log(y −x) = log(y)+log(1− xy ) ≥ log(y)− y−x x


for y > x > 0, then by taking y = E[exp(γℓ(h, Z̃))]
q
and x = 4 Var(exp(γℓ(h,
n
Z̃))) log(2/δ)
+ 4 log(2/δ)
3n , so that with (71) we have y − x ≥ (1 − ζ)y > 0, taking
logarithms on both sides of (70) gives that with probability at least (1 − δ),

38
1 X
n 
log exp γℓ(h, Zi )
n
i=1
s 
  1 4 Var(exp(γℓ(h, Z̃))) log(2/δ) 4 log(2/δ)
≥ log E[exp(γℓ(h, Z̃))] −  + 
E[exp(γℓ(h, Z̃))] n 3n
s
  1 4 Var(exp(γℓ(h, Z̃))) log(2/δ)
≥ log E[exp(γℓ(h, Z̃))] −
(1 − ζ)E[exp(γℓ(h, Z̃))] n
1 4 log(2/δ)

(1 − ζ)E[exp(γℓ(h, Z̃))] 3n
s
  exp(−γκ ) 4 Var(exp(γℓ(h, Z̃))) log(2/δ) exp(−γκ ) 4 log(2/δ)
u u
≥ log E[exp(γℓ(h, Z̃))] − −
(1 − ζ) n (1 − ζ) 3n
r
  |γ|2κ exp(−γκ ) log(2/δ) exp(−γκ ) 4 log(2/δ)
u u u
≥ log E[exp(γℓ(h, Z̃))] − − .
(1 − ζ) n (1 − ζ) 3n
(72)

Here we used that by Assumption 4.1 and Lemma C.8,

E[exp(γℓ(h, Z̃))] ≤ exp[E(γℓ(h, Z̃))] ≤ exp(γκu )

and by Assumption 4.1,

Var(exp(γℓ(h, Z̃))) ≤ [Eexp(γℓ(h, Z̃))2 ] ≤ exp[E(γℓ(h, Z̃))2 ] ≤ exp(γ 2 κ2u ).

This gives the stated bound assuming that (71) holds. In order to satisfy (71), viewing (71) as a

quadratic inequality in n and using that (a + b)2 ≤ 2a2 + 2b2 yields

(4 Var(exp(γℓ(h, Z̃))) + 8/3ζE[exp(γℓ(h, Z̃))]) log(2/δ)


n≥ ,
ζ 2 (E[exp(γℓ(h, Z̃))])2

Now applying exp(γκu ) ≤ E[exp(γℓ(h, Z̃))] ≤ 1 and Var(exp(γℓ(h, Z̃))) ≤ γ 2 κ2u completes the
proof.

Corollary 4.4 (Restated). Under the same assumptions in Theorem 4.3 and a finite hy-
2 2
pothesis space, then for n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
, for γ < 0 and with probability at least (1 − δ),
the absolute value of the titled generalization error satisfies
r
2κu exp(−γκu ) (log(card(H)) + log(2/δ))
sup |genγ (h, S)| ≤
h∈H (1 − ζ) n
(73)
4 exp(−γκu )(log(card(H)) + log(2/δ)) γ 2
− − κu .
3nγ(1 − ζ) 2

39
Proof. Combining the upper and lower bounds, Theorem 4.2 and Theorem 4.3, we can derive the
following bound for a fixed h ∈ H,
r
2κu exp(−γκu ) (log(2/δ))
|genγ (h, S)| ≤
(1 − ζ) n
(74)
4 exp(−γκu )(log(2/δ)) γ 2
− − κu .
3nγ(1 − ζ) 2
Then, using the uniform bound Lemma C.11 completes the proof.

E.2 Information-theoretic Bounds


In the information-theoretic approach for the unbounded loss function, we relax the uniform as-
sumption, Assumption 4.1, as follows,
Assumption E.1 (Bounded second moment). The learning algorithm PH|S , loss function ℓ, and
µ are such that there is a constant κt ∈ R+ where the loss function (H, Z) 7→ ℓ(H, Z) satisfies
EPH ⊗µ [ℓ2 (H, Z)] ≤ κ2t .
The following results are helpful in deriving the upper and lower bounds under unbounded loss
assumptions via the information-theoretic approach.
Proposition E.2. Given Assumption E.1, the following inequality holds for γ < 0,
 q
exp(−γκ ) κ2t I(H;S) , if I(H;S) ≤
γ 2 κ2t
t
Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) ≤ − exp(−γκ )  I(H;S)
n
γ 2 κ2
 n 2
γ 2 κ2t
 t
+ 2t , I(H;S)
if n >
γ n 2

Proof. For γ < 0 and we use that 0 ≤ exp(γℓ(H, Z̃)) ≤ 1 and Var(exp(γℓ(H, Z̃))) ≤ γ 2 Var(ℓ(H, Z̃)) ≤
γ 2 κ2t . Note that the variable exp(γℓ(H, Z̃)) is sub-exponential with parameters (γ 2 κ2t , 1) under the
distribution PH ⊗ µ. Using the approach in (Aminian et al., 2021a; Bu et al., 2020) for the sub-
exponential case, we have
n
1X
EPH ⊗µ [exp(γℓ(H, Z̃))] − EPH,S [ exp(γℓ(H, Zi ))]
n
i=1
 q (75)
 γ 2 κ2
|γ| κ2t I(H;S)
n if I(H;S)
n ≤ 2t

 I(H;S) + γ 2 κ2t if I(H;S) >
γ 2 κ2t
.
n 2 n 2

I(H;S) γ 2 κ2t
Therefore, we have for n ≤ 2 ,
r
h1 X
n i  h i κ2t I(H; S) 
EPH,S exp(γℓ(H, Zi )) ≤ EPH ⊗µ exp(γℓ(H, Z̃)) + |γ| . (76)
n n
i=1

Using (68) gives


1  1X
n  1
log EPH,S [ exp(γℓ(H, Zi ))] − log(EPH ⊗µ [exp(γℓ(H, Z̃))])
γ n γ
i=1
r (77)
|γ| κ2t I(H; S)
≥ .
γEPH ⊗µ [exp(γℓ(H, Z̃))] n

40
I(H;S) γ 2 κ2t
For n > 2 , we have
h1 X
n i
EPH,S exp(γℓ(H, Zi ))
n
i=1 (78)
 h i I(H; S) γ 2 κ2 
t
≤ EPH ⊗µ exp(γℓ(H, Z̃)) + + .
n 2
Using (68) again, we obtain,

1  1X
n  1
log EPH,S [ exp(γℓ(H, Zi ))] − log(EPH ⊗µ [exp(γℓ(H, Z̃))])
γ n γ
i=1 (79)
1  I(H; S) γ 2 κ2t 
≥ + .
γEPH ⊗µ [exp(γℓ(H, Z̃))] n 2

As under Assumption E.1 and Lemma C.8, we have exp(γκt ) ≤ EPH ⊗µ [exp(γℓ(H, Z̃))], the final
result follows.

Corollary E.3. Given Assumption E.1, the following inequality holds for γ < 0 and some ζ ∈ (0, 1),
 q
 2
 − exp(−γκt ) κt I(H;S) , γ 2 κ2t I(H;S) 
(1−ζ) n if max 2I(H;S)2 , ζ 2 exp(2γκ
≤n
Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) ≥  
2
γ κt  t)
 2 2 4I(H;S) 4ζ −2 −1
 exp(−γκt ) I(H;S) + γ κt , if min( , 2I(H;S) ) > n.
(1−ζ)γ n 2 γ 2 κ2 γ 2 κ2 t t

Proof. Similar to Proposition E.2, we have


n
1X
EPH ⊗µ [exp(γℓ(H, Z̃))] − EPH,S [ exp(γℓ(H, Zi ))]
n
i=1
 q (80)
 γ 2 κ2
|γ| κ2t I(H;S)
n ifI(H;S)
n ≤ 2t
≤ γ 2 κ2t γ 2 κ2
 I(H;S) + if I(H;S)
> 2t.
n 2 n

I(H;S) γ 2 κ2t
Therefore, we have for n ≤ 2 ,
r
h1 X
n i  h i κ2t I(H; S) 
EPH,S exp(γℓ(H, Zi )) ≥ EPH ⊗µ exp(γℓ(H, Z̃)) + γ . (81)
n n
i=1

Assume for now that there is a ζ ∈ (0, 1) such that


r
κ2t I(H; S) h i
|γ| ≤ ζEPH ⊗µ exp(γℓ(H, Z̃)) (82)
n
As log(y − x) = log(y) + log(1 − xy ) ≥ log(y) − y−x x
for y > x > 0, then by taking y =
h i q
2
κt I(H;S)
EPH ⊗µ exp(γℓ(H, Z̃)) and x = |γ| n , so that with (71) we have y − x ≥ (1 − ζ)y > 0,
taking logarithms on both sides of (81) gives that

41
1  1X
n  1
log EPH,S [ exp(γℓ(H, Zi ))] − log(EPH ⊗µ [exp(γℓ(H, Z̃))])
γ n γ
i=1
r (83)
1 κ2t I(H; S)
≥ ,
(1 − ζ)EPH ⊗µ [exp(γℓ(H, Z̃))] n
γ 2 κ2t I(H;S) I(H;S) γ 2 κ2t
where it holds for n ≥ ζ 2 exp(2γκt ) . For n > 2 , we have
h1 X
n i  h i I(H; S) γ 2 κ2 
t
EPH,S exp(γℓ(H, Zi )) ≥ EPH ⊗µ exp(γℓ(H, Z̃)) − − . (84)
n n 2
i=1

As log(y − x) = log(y) + log(1 − xy ) ≥ log(y) − x


y−x , we obtain,

1  1X
n  1
log EPH,S [ exp(γℓ(H, Zi ))] − log(EPH ⊗µ [exp(γℓ(H, Z̃))])
γ n γ
i=1 (85)
1  I(H; S) γ 2 κ2 
t
≥ + ,
γ(1 − ζ)EPH ⊗µ [exp(γℓ(H, Z̃))] n 2

4I(H;S) 4ζ −2 −1
where it holds for γ 2 κ2t
≥ n. Note that, we have,
h i
EPH ⊗µ exp(γℓ(H, Z̃))
r
|γ| κ2t I(H; S)
≥ (86)
ζ n
I(H; S) γ 2 κ2t
≥ + .
n 2
Therefore, we have, r
|γ| κ2t I(H; S) I(H; S) γ 2 κ2t
≥ + , (87)
ζ n n 2

where 87 is a quadratic inequality in n and using that (a + b)2 ≤ 2a2 + 2b2 yields,

4I(H; S) 4ζ −2 − 1
≥ n.
γ 2 κ2t
As under Assumption E.1 and Lemma C.8, we have exp(γκt ) ≤ EPH ⊗µ [exp(γℓ(H, Z̃))], the final
result follows.

Using Proposition E.2, we can obtain the following upper bound on the expected generalization
error.
Theorem E.4. Given Assumption E.1, the following upper bound holds on the expected tilted gen-
eralization error for γ < 0,
 q
exp(−γκ ) κ2t I(H;S) − γ κ2 , if I(H;S)
γ 2 κ2
≤ 2t
t  t
genγ (H, S) ≤ − exp(−γκ ) I(H;S) γ 2 κ2  γ
n 2 n
γ 2 κ2
 t
+ 2 t − 2 κ2t , if I(H;S) > 2t
γ n n

42
Proof. We use the following decomposition,
EPH,S [genγ (H, S)]
= EH,S [R(H, µ)] − Rγ (H, PH ⊗ µ) + Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) (88)
+ Rγ (H, PH,S ) − EP [R b γ (H, S)].
H,S

Using Lemma C.9, we have

|γ|
EPH,S [R(H, µ)] − Rγ (H, PH ⊗ µ) ≤ EPH ⊗µ [ℓ2 (H, Z̃)], (89)
2
and due to Jensen’s inequality for γ < 0, we have
b γ (H, S)] ≤ 0.
Rγ (H, PH,S ) − EPH,S [R (90)
Using a similar approach as for the proof of Proposition D.9 and applying Proposition E.2, we
obtain
 q
exp(−γκ ) κ2t I(H;S) , I(H;S)

γ 2 κ2t
t
Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) ≤ exp(−γκ ) I(H;S) γ 2 κ2 
n n
I(H;S)
2
γ 2 κ2t
(91)
 t
+ t
, > .
|γ| n 2 n 2

Combining (90), (91) and (89) with (88) completes the proof.

Remark E.5. Assuming γ = O(n−1/2 ), the upper bound in Theorem E.4 has the convergence rate
O(n−1/2 ). Note that the result in Theorem E.4 holds for unbounded loss functions, provided that the
second moment of the loss function exists.
A similar approach to Theorem E.4 can be applied to derive an information-theoretical lower
bound on the expected tilted generalization error for γ < 0 by applying Corollary E.3.
Theorem E.6. Given Assumption E.1, the following lower bound holds on the expected tilted gen-
eralization error for γ < 0,
 q
 2
− exp(−γκt ) κt I(H;S) + γ κ2t , γ 2 κ2t I(H;S) 
n 2 if max 2I(H;S) ,
γ 2 κ2t ζ 2 exp(2γκt )
≤n
genγ (H, S) ≥   −2
 2 2
 exp(−γκt ) I(H;S) + γ κt + γ κ2t , 4 4ζ −1 I(H;S) 2I(H;S)
γ n 2 2 if min( γ 2 κ2
, γ 2 κ2 ) > n.
t t

Proof. The proof is similar to Theorem E.4 and using Corollary E.3.

Combining Theorem E.6 and Theorem E.4, we can derive the upper bound on the absolute value
of the expected generalization error.

F Proof and details of Section 5


Proposition 5.2 (Restated). Under Assumption 4.1 and Assumption 5.1, the difference of the
tilted population risk (9) between µ and µ̃ is bounded as follows;
1 1 TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
log(EZ̃∼µ [exp(γℓ(h, Z̃))]) − log(EZ̃∼µ̃ [exp(γℓ(h, Z̃))]) ≤ .
γ γ γ2 (κu − κs )
(92)

43
Proof. We have for a fixed h ∈ H that
1 1
log(EZ̃∼µ [exp(γℓ(h, Z̃))]) − log(EZ̃∼µ̃ [exp(γℓ(h, Z̃))])
γ γ
Z 1Z
(a) exp(γℓ(h, z))
= (µ̃ − µ)(dz)dλ
0 Z |γ|EZ̃∼µλ [exp(γℓ(h, Z̃))]
Z 1 (93)
(b) TV(µ, µ̃) exp(|γ|κ )
s
≤ exp(|γ|λ(κu − κs ))dλ
|γ| 0
TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
= ,
|γ| |γ|(κu − κs )

where (a) and (b) follow from the functional derivative with µλ = µ̃ + λ(µ − µ̃) and Lemma C.16.
The same approach can be applied for lower bound.

Proposition 5.3 (Restated). Given Assumptions 4.1 and 5.1, for any fixed h ∈ H and
with probability least (1 − δ) for γ < 0, then the following upper bound holds on the tilted
generalization error
r
(log(2/δ))
genγ (h, Ŝ) ≤ 2κs exp(−γκs )
n 
4 exp(−γκs )(log(2/δ)) γ 2 TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
− − κu + ,
3nγ 2 γ2 (κu − κs )

where Ŝ is the training dataset under the distributional shift.

Proof. The proof follows directly from the following decomposition of the tilted generalization error
under distribution shift,

genγ (h, Ŝ) = R(h, µ) − Rγ (h, µ⊗n ) + Rγ (h, µ) − Rγ (h, µ̃) + Rγ (h, µ̃) − Rb γ (h, Ŝ),
| {z } | {z } | {z }
I5 I6 I7

where I5 , I6 and I7 can be bounded using Lemma C.9, Proposition 5.2 and Theorem 4.2, respectively.

Remark F.1. It is noteworthy that the upper bound in Proposition 5.3 can be infinite for γ → −∞
and γ = 0 and a fixed n. Consequently, there must exist a γ ∈ (−∞, 0) that minimizes this upper
bound. To illustrate this point, consider the case where n → ∞; here, the first and second terms in
the upper bound would vanish. Thus, we are led to the following minimization problem:
" #
⋆ γ 2 TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
γ := arg min − κu + , (94)
γ∈(−∞,0) 2 γ2 (κu − κs )

where the solution exists. As γ ⋆ decreases when TV(µ, µ̃) increases, practically, this implies that if the
training distribution becomes more adversarial (i.e., further away from the benign test distribution),
we would use smaller negative γ’s to bypass outliers.

44
Proposition 5.4 (Restated). Given Assumptions 4.1 and 5.1, for any fixed h ∈ H and
2 2
with probability least (1 − δ), there exists a ζ ∈ (0, 1) such that for n ≥ (4γ κζu2+8/3ζ) log(2/δ)
exp(2γκu )
and γ < 0, such that the following upper bound holds on the tilted generalization error
r
2κs exp(−γκs ) log(2/δ) 4 exp(−γκs )(log(2/δ))
genγ (h, Ŝ) ≥ − +
(1 − ζ) n 3nγ(1 − ζ)

TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
− ,
γ2 (κu − κs )

where Ŝ is the training dataset under the distributional shift.

Proof. The proof follows directly from the following decomposition of the tilted generalization error
under distribution shift,

genγ (h, Ŝ) = R(h, µ) − Rγ (h, µ⊗n ) + Rγ (h, µ) − Rγ (h, µ̃) + Rγ (h, µ̃) − Rb γ (h, Ŝ),
| {z } | {z } | {z }
I5 I6 I7

where I5 , I6 and I7 can be bounded using Lemma C.9, Proposition 5.2 and Theorem 4.3, respectively.

Theorem 5.5 (Restated). Under the same assumptions in Proposition 5.4, then for n ≥
(4γ 2 κ2u +8/3ζ) log(2/δ)
ζ 2 exp(2γκu )
and γ < 0, and with probability at least (1 − δ), the absolute value of the
tilted generalization error under distributional shift satisfies
r
2κs exp(−γκs ) B(δ) 4 exp(−γκs )B(δ)
sup |genγ (h, Ŝ)| ≤ −
h∈H (1 − ζ) n 3nγ(1 − ζ)
 (95)
2
γκu TV(µ, µ̃) exp(|γ|κu ) − exp(|γ|κs )
− + ,
2 γ2 (κu − κs )

where B(δ) = log(card(H)) + log(2/δ).

Proof. The result follows from combining the results of Proposition 5.3, Proposition 5.4 and applying
the uniform bound.

45
G Proofs and details of Section 6
Proposition 6.1 (Restated). The solution to the expected TERM regularized via KL divergence,
(26), is the tilted Gibbs posterior,

πH  1 X −α/γ
n
γ
PH|S = exp(γℓ(H, zi )) , (96)
Fα (S) n
i=1

where Fα (S) is the normalization factor.

Proof. From (Zhang, 2006), we know that,


1
PX⋆ = min EPX [f (x)] + KL(PX kQX ), (97)
PX α
QX exp(−αf (X))
where PX⋆ = EQX [exp(−αf (X))] . Using (97), it can be shown that the tilted Gibbs posterior is the
solution to (96).

Proposition 6.2 (Restated). The difference between the expected TER under the joint and
product of marginal distributions of H and S can be characterized as,

ISKL (H; S)
Rγ (H, PH ⊗ µ) − Rγ (H, PH,S ) = . (98)
α

Proof. As in Aminian et al. (2015), the symmetrized KL information between two random variables
(S, H) can be written as

ISKL (H; S) = EPH ⊗µ⊗n [log(PH|S )] − EPH,S [log(PH|S )]. (99)

The results follows by substituting the tilted Gibbs posterior in (99).

Theorem 6.3 (Restated). Under Assumption 3.1, the expected generalization error of the
tilted Gibbs posterior satisfies,

α(exp(|γ|M ) − 1)2 Var(exp(γℓ(H, Z̃)))  


genγ (H, S) ≤ + 1/n − exp(−2γM ) . (100)
2γ 2 n 2γ

Proof. Note that, we have

I(H; S) ISKL (H; S)



α α
= Rγ (H, PH ⊗ µ) − Rγ (H, PH,S )
(101)
≤ Rγ (H, PH ⊗ µ⊗n ) − Rγ (H, PH,S )
r
(exp(|γ|M ) − 1) I(H; S)
≤ .
|γ| 2n

46
Therefore, we have r
I(H; S) (exp(|γ|M ) − 1) I(H; S)
≤ . (102)
α |γ| 2n
Solving (102), results in, r
p (exp(|γ|M ) − 1) 1
I(H; S) ≤ α . (103)
|γ| 2n
Therefore, we obtain,

Rγ (H, PH ⊗ µ⊗n ) − Rγ (H, PH,S )


(exp(|γ|M ) − 1)2 (104)
≤α .
2γ 2 n
Using Theorem 3.13, the final result follows.

In addition to KL-regularized linear risk minimization, the Gibbs posterior is also the solution
to another problem. For this formulation
 R  α we  recall that the α-Rényi divergence between P and Q is
1 dP
given by Rα (P kQ) := α−1 log X dQ dQ , for α ∈ (0, 1)∪(1, ∞). We also define the conditional
h R  α i
1 dP
Rényi divergence between PX|Y and QX|Y as Rα (PX|Y kQX|Y |PY ) := α−1 EPY log X dQX|Y dQX|Y ,
X|Y
for α ∈ (0, 1) ∪ (1, ∞). Here, PX|Y denotes the conditional distribution of X given Y .

Proposition G.1 (Gibbs posterior). Suppose that γ = α1 − 1 and α ∈ (0, 1) ∪ (1, ∞). Then the
solution to the minimization problem
 h1  h  ii 
α
PH|S = arg inf EPS log EPH|S exp γ R̂(H, S) + Rα (PH|S kπH |PS ) , (105)
PH|S γ

with R̂(H, S) the linear empirical risk (1), is the Gibbs posterior,

α πH [exp(−γ R̂(H, S))]


PH|S = ,
EπH [exp(−γ R̂(H, S))]

where πH is the prior distribution on the space H of hypotheses.

Proof. Let us consider the following minimization problem,


 
1
find arg min log(EPY [exp(γf (Y ))]) + Rα (PY kQY ) , (106)
PY γ
1
where γ = α − 1. As shown by Dvijotham and Todorov (2012), the solution to (106) is the Gibbs
posterior,
QY exp(−αf (Y ))
PY⋆ = .
EQY [exp(−αf (Y ))]

If α → 1, then γ → 0 and (105) converges to the KL-regularized ERM problem.


The tilted generalization error under the Gibbs posterior can be bounded as follows.

47
Proposition G.2. Under Assumption 3.1 when training with the Gibbs posterior, (28), the following
upper bound holds on the expected tilted generalization error,

M 2 α Var(exp(γℓ(H, Z)))
genγ (H, S) ≤ − exp(−2γM ). (107)
2n 2γ
Proof. Let us consider the following decomposition,
n
1X
genγ (H, S) = R(H, PH ⊗ µ) − EPH,S [ ℓ(H, Zi )]
n
i=1
n (108)
1X
+ EPH,S [ ℓ(H, Zi )] − Rγ (H, PH,S ).
n
i=1

From Aminian et al. (2021a), for the Gibbs posterior we have


h1 X
n i αM 2
R(H, PH ⊗ µ) − EPH,S ℓ(H, Zi ) ≤ .
n 2n
i=1

In addition, using Lemma C.3 for uniform distribution, we have


n
1X1  1 1 X
n 
log exp(γℓ(H, Zi )) − log exp(γℓ(H, Zi ))
n γ γ n
i=1 i=1
−Var(exp(γℓ(H, Z)))
≤ exp(−2γM ).

This completes the proof.

From Proposition G.2, we can observe that for γ > 0 the upper bound on the tilted generalization
error under the Gibbs posterior is tighter than the upper bound on the generalization error of the
Gibbs posterior under linear empirical risk given in Aminian et al. (2021a).
Furthermore, we can provide an upper bound on the absolute value of the expected tilted
generalization error under the Gibbs posterior,

M 2 α max(1, exp(−2γM ))
genγ (H, S) ≤ + (1 − exp(γM ))2 . (109)
2n 8γ

In (109), choosing γ = O(1/n) we obtain a proof of a convergence rate of O(1/n) for the upper
bound on the absolute value of the expected tilted generalization error of the Gibbs posterior.

48
H Other Bounds
In this section, we provide upper bounds via Rademacher complexity, stability and PAC-Bayesian
approaches. The results are based on the assumption of bounded loss functions (Assumption 3.1).

H.1 Rademacher Complexity


Inspired by the work (Bartlett and Mendelson, 2002), we provide an upper bound on the tilted
generalization error via Rademacher complexity analysis. For this purpose, we need to define the
Rademacher complexity.
As in Bartlett and Mendelson (2002), for a hypothesis set H of functions h : X 7→ Y, the
Rademacher complexity with respect to the dataset S is
h 1X
n i
RS (H) := ES,σσ sup σi h(Xi ) ,
h∈H n i=1

where σ = {σi }ni=1 are i.i.d Rademacher random variables; σi ∈ {−1, 1} and σi = 1 or σi = −1
with probability 1/2, for i ∈ [n]. The empirical Rademacher complexity R̂S (H) with respect to S is
defined by
h 1X
n i
R̂S (H) := Eσ sup σi h(Xi ) . (110)
h∈H n i=1

To provide an upper bound on the tilted generalization error, first, we apply the uniform bound,
Lemma C.11, and Talagrand’s contraction lemma (Talagrand, 1996) in order to derive a high-
probability upper bound on the tilted generalization error; we employ the notation (110).

Proposition H.1. Given Assumptions 3.1 and assuming the loss function is Mℓ′ -Lipschitz-
continuous in a binary classification problem, the tilted generalization error satisfies with
probability at least (1 − δ) that
r
3(exp(|γ|M ) − 1) log(1/δ)
gd
enγ (h, S) ≤ 2 exp(|γ|M )Mℓ′ R̂S (H) + .
|γ| 2n

Proof. Note that exp(γM ) ≤ x ≤ 1 for γ < 0 and 1 ≤ x ≤ exp(γM ) for γ > 0. Therefore, we
have the Lipschitz constant exp(−γM ) and 1 for negative and positive γ, respectively. Similarly,
for exp(γx) and 0 < x < M , we have the Lipschitz constants γ and γ exp(γM ), for γ < 0 and γ > 0,

49
respectively. For γ < 0, we have
gd
enγ (h, S)
1  1 1 X
n

= log EZ∼µ[exp(γℓ(h, Z))] − log exp γℓ(h, Zi )
γ γ n
i=1
1  1 1 X
n

≤ | log EZ∼µ [exp(γℓ(h, Z))] − log exp γℓ(h, Zi ) |
γ γ n
i=1
 X n 
1  1 
≤ log EZ∼µ [exp(γℓ(h, Z))] − log exp γℓ(h, Zi )
|γ| n
i=1
n (111)
(a)exp(−γM ) 1X 
≤ EZ∼µ[exp(γℓ(h, Z))] − exp γℓ(h, Zi )
|γ| n
i=1
r
(b) exp(−γM ) 3 exp(−γM )(1 − exp(γM )) log(1/δ)
≤ 2R̂S (E ◦ L ◦ H) +
|γ| |γ| 2n
r
(c) 3 exp(−γM )(1 − exp(γM )) log(1/δ)
≤ 2 exp(−γM )R̂S (L ◦ H) +
|γ| 2n
r
(d) 3(exp(−γM ) − 1) log(1/δ)
≤ 2 exp(−γM )Mℓ′ R̂S (H) + ,
|γ| 2n
where (a) holds due to the Lipschitzness of log(x) in a bounded interval, (b) holds due to the uniform
bound Lemma C.11, (c) and (d) hold due to Talagrand’s contraction Lemma C.12).
Similarly, we can prove for γ > 0, we have
r
3(exp(γM ) − 1) log(1/δ)
gd
enγ (h, S) ≤ 2 exp(γM )Mℓ′ R̂S (H) + . (112)
γ 2n

Then, we obtain an upper bound on the generalization error by combining Proposition H.1,
Massart’s lemma (Massart, 2000) and Lemma C.3.

Theorem H.2. Under the same assumptions as in Proposition H.1, assuming a finite hy-
pothesis space, the tilted generalization error satisfies with probability at least (1 − δ) that
p
max(1, exp(−2γM )) 2 2 log(card(H))
genγ (h, S) ≤ (exp(γM ) − 1) + 2AMℓ′ B
8γ n
r
3(A(γ) − 1) log(1/δ)
+ ,
|γ| 2n
P 
n
where A(γ) = exp(|γ|M ) and B 2 = maxh∈H 2
i=1 h (zi ) .

Proof. We consider the following decomposition for the Rademacher complexity,


b γ (h, S),
genγ (h, S) = R(h, µ) − Rγ (h, µ⊗n ) + Rγ (h, µ⊗n ) − R

50
where R(h, µ)−Rγ (h, µ⊗n ) can be bounded using Proposition 3.2. The second term can be bounded
by using Proposition H.1 and Massart’s lemma (Lemma C.13).
√ √
Similar to Remark 3.8, assuming γ = O(1/ n), we have the convergence rate of O(1/ n) for
the tilted generalization error. For an infinite hypothesis space, covering number bounds can be
applied to the empirical Rademacher complexity, see, e.g., (Kakade et al., 2008). We note that the
VC-dimension and Rademacher complexity bounds are uniform bounds and are independent of the
learning algorithms.

51
H.2 A Stability Bound
In this section, we also study the upper bound on the tilted generalization error from the stabil-
ity perspective (Bousquet and Elisseeff, 2002b). In the stability approach, (Bousquet and Elisseeff,
2002b), the learning algorithm is a deterministic function of S.
For stability analysis, we define the replace-one sample dataset as
S(i) = {Z1 , · · · , Z̃i , · · · , Zn },
where the sample Zi is replaced by an i.i.d. data sample Z̃i sampled from µ. To distinguish the
hypothesis in the stability approach from the uniform approaches, we consider hs : Z n 7→ H as the
learning algorithm. In the stability approach, the hypothesis is a deterministic function hs (S) of
the dataset.
 We are interested
 in providing an upper bound on the expected tilted generalization
error EPS genγ (hs (S), S) .

Theorem H.3. Under Assumption 3.1, the following upper bound holds with probability at
least (1 − δ) under distribution PS ,
 
EPS genγ (hs (S), S)
(1 − exp(γM ))2  
≤ 1 + exp(−2γM ) + exp(|γ|M )EPS ,Z̃ [|ℓ(hs (S), Z̃) − ℓ(hs (S(i) ), Z̃)|].

(113)

Proof. We use the following decomposition of the tilted generalization error;


 
EPS genγ (hs (S), S)
 1 
= EPS R(hs (S), µ) − log(EPS ,µ [exp(γℓ(hs (S), Z̃))])
γ
  h1 Xn i
1 1 (114)
+ E PS log(EPS ,µ [exp(γℓ(hs (S), Z̃))]) − log EPS exp(γℓ(hs (S), Zi ))
γ γ n
i=1
  h1 X n i 
1 b
+ E PS log EPS exp(γℓ(hs (S), Zi )) − Rγ (hs (S), S) .
γ n
i=1
Using Lemma C.3, we have
 1 
EPS R(hs (S), µ) − log(EPS ,µ [exp(γℓ(hs (S), Z̃))])
γ
− exp(−2γM )
≤ VarPS ,µ (exp(γℓ(hs (S), Z̃))),

and
  h1 X n i 
1 b
E PS log EPS exp(γℓ(hs (S), Zi )) − Rγ (hs (S), S)
γ n
i=1
 h1 Xn i  1 X n 
1 1
= log EPS exp(γℓ(hs (S), Zi )) − EPS log exp(γℓ(hs (S), Zi ))
γ n γ n
i=1 i=1
1 
≤ Var exp(γℓ(hs (S), Zi )) .

52
Using the Lipschitz property of the log and exponential functions on a closed interval, we have

1 1  h1 X n i
log(EPS ,µ [exp(γℓ(hs (S), Z̃))]) − log EPS exp(γℓ(hs (S), Zi ))
γ γ n
i=1
1 1  h i
= log(EPS ,µ [exp(γℓ(hs (S), Z̃))]) − log EPS exp(γℓ(hs (S), Zi ))
γ γ
≤ exp(|γ|M )EPS ,µ [|ℓ(hs (S), Z̃) − ℓ(hs (S(i) ), Z̃)|].

Finally, we have
 
EPS genγ (hs (S), S)
1  exp(−2γM )
≤ Var exp(γℓ(hs (S), Zi )) − VarPS ,µ (exp(γℓ(hs (S), Z̃)))
2γ 2γ
+ exp(|γ|M )EPS ,µ [|ℓ(hs (S), Z̃) − ℓ(hs (S(i) ), Z̃)|]
(1 − exp(γM ))2  
≤ 1 + exp(−2γM ) + exp(|γ|M )EPS ,µ [|ℓ(hs (S), Z̃) − ℓ(hs (S(i) ), Z̃)|].

We also consider the uniform stability as in Bousquet and Elisseeff (2002b).

Definition H.4 (Uniform Stability). A learning algorithm is uniform β-stable with respect to the
loss function if the following holds for all S ∈ Z n and z̃i ∈ Z,

ℓ(hs (S), z̃i ) − ℓ(hs (S(i) ), z̃i ) ≤ β, i ∈ [n].

Remark H.5 (Uniform Stability). Suppose that the learning algorithm is β-uniform stable with
respect to a given loss function. Then, using Theorem H.3, we have
  (1 − exp(γM ))2  
EPS genγ (hs (S), S) ≤ 1 + exp(−2γM ) + exp(|γ|M )β. (115)
8|γ|

Note that for a learning algorithm with uniform β-stability, where β = O(1/n), then with γ of
order O(1/n), we obtain a guarantee on the convergence rate of O(1/n).

53
H.3 A PAC-Bayesian Bound
Inspired by previous works on PAC-Bayesian theory, see, e.g.,(Alquier, 2021; Catoni, 2003), we
derive a high probability bound on the expectation of the tilted generalization error with respect to
the posterior distribution over the hypothesis space.
In the PAC-Bayesian approach, we fix a probability distribution over the hypothesis (parameter)
space as prior distribution, denoted as Qh . Then, we are interested in the generalization performance
under a data-dependent distribution over the hypothesis space, known as posterior distribution,
denoted as ρh .

Theorem H.6. Under Assumption 3.1, the following upper bound holds on the conditional
expected tilted generalization error with probability at least (1 − δ) under the distribution PS ;
for any η > 0,

max(1, exp(−2γM ))(1 − exp(γM ))2


|Eρh [genγ (H, S)]| ≤
8|γ|
2
(116)
LηA L(KL(ρh kQh ) + log(1/δ))
+ + ,
8n η
where Qh and ρh are prior and posterior distributions over the hypothesis space, respectively.

Proof. We use the following decomposition of the generalization error,

b γ (H, S)].
Eρh [genγ (H, S)] = Eρh [R(H, µ) − Rγ (H, µ) + Rγ (H, µ) − R

The term Eρh [R(H, µ) − Rγ (H, µ)] can be bounded using Lemma C.3. The second term Rγ (H, µ) −
Rb γ (H, S) can be bounded using the Lipschitz property of the log function and Catoni’s bound
(Catoni, 2003).
√ √
Remark H.7. Choosing η and γ such that η −1 ≍ 1/ n and γ = O(1/ n) results in a theoretical

guarantee on the convergence rate of O(1/ n).

54

You might also like