0% found this document useful (0 votes)
94 views29 pages

ERM and Uniform Convergence Insights

This document summarizes key points about empirical risk minimization (ERM): 1) ERM finds a function f^n that minimizes the empirical loss L^(f) based on training data. The expected error of f^n, E[L(f^n)], is bounded by the expected difference between empirical and true loss, E[L(f^n) - L^(f^n)]. 2) For any fixed function f, the empirical loss L^(f) is an unbiased estimate of the true expected loss L(f). However, for a data-dependent function like f^n, L^(f^n) may not accurately estimate L(f^n)

Uploaded by

Irfan Fadhullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views29 pages

ERM and Uniform Convergence Insights

This document summarizes key points about empirical risk minimization (ERM): 1) ERM finds a function f^n that minimizes the empirical loss L^(f) based on training data. The expected error of f^n, E[L(f^n)], is bounded by the expected difference between empirical and true loss, E[L(f^n) - L^(f^n)]. 2) For any fixed function f, the empirical loss L^(f) is an unbiased estimate of the true expected loss L(f). However, for a data-dependent function like f^n, L^(f^n) may not accurately estimate L(f^n)

Uploaded by

Irfan Fadhullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 14

ERM, Uniform Convergence

Sasha Rakhlin

Oct 24, 2019

1 / 21
Outline

ERM

Uniform Deviations

2 / 21
Outline

ERM

Uniform Deviations

3 / 21
Recall that we proved
1 D2
EL(wT ) ≤ × 2
n+1 γ
for the last hyperplane wT of Perceptron (cycled until no more mistakes)
under the assumption that there exists w∗ with ∥w∗ ∥ = 1 such that
Y ⟨w∗ , X⟩ ≥ γ and D ≥ ∥X∥ almost surely.

This is a result about a particular minimizer of empirical loss ̂


L01 (w).

4 / 21
Full gradient for i.i.d. data

Suppose that instead of multi-pass Perceptron, we run full gradient descent


on empirical objective

̂ 1 n
L(w) = ∑ max {−Yt ⟨w, Xt ⟩ , 0}
n t=1

NB: hinge-at-zero max{−y ⟨w, x⟩ , 0} is not a surrogate loss for the indicator
loss, but its minimizer does enjoy zero indicator loss.
If all we know is that w minimizes empirical loss ̂
L(w) (but not necessarily
obtained via Perceptron), what can we do?
Beyond hyperplanes with margin, what can we say about expected loss of
an empirical minimizer?

5 / 21
It is useful to consider the function
1 n
w↦ ∑ I{Yt ⟨w, Xt ⟩ ≤ 0}
n t=1

as w varies (but fixing the data). For example, values of this function over
the sphere ∥w∥ = 1 look like [draw in class], while the expected error looks
like [draw in class].

6 / 21
If we unroll this picture of L(w) and ̂
L(w), it would look like this:

expected
empirical error
error

w
<latexit sha1_base64="ts0CY4a14w3a0XCDpZ7yXRMqzcg=">AAAB6HicdVBNSwMxEJ31s9avqkcvwSJ4KtmKdI9FLx5bsB/QLiWbZtvYbHZJskpZ+gu8eFDEqz/Jm//GtF1BRR8MPN6bYWZekAiuDcYfzsrq2vrGZmGruL2zu7dfOjhs6zhVlLVoLGLVDYhmgkvWMtwI1k0UI1EgWCeYXM39zh1TmsfyxkwT5kdkJHnIKTFWat4PSmVcwQsgXKl6rlerWeK5GF+cIze3ypCjMSi994cxTSMmDRVE656LE+NnRBlOBZsV+6lmCaETMmI9SyWJmPazxaEzdGqVIQpjZUsatFC/T2Qk0noaBbYzImasf3tz8S+vl5rQ8zMuk9QwSZeLwlQgE6P512jIFaNGTC0hVHF7K6Jjogg1NpuiDeHrU/Q/aVcrruVNt1y/zOMowDGcwBm4UIM6XEMDWkCBwQM8wbNz6zw6L87rsnXFyWeO4Aect08rko0q</latexit>
<latexit sha1_base64="2SctM1+O3W8Nn5NThmcvZnHb1gc=">AAAB6HicdZBLSwMxFIXv1FcdX1WXboJFcFUyFelsxKIbly3YB7RDyaSZNjbzIMkoZSi4d+NCEbf+Gffu/DemrYKKHgh8nHMvuff6ieBKY/xu5RYWl5ZX8qv22vrG5lZhe6ep4lRS1qCxiGXbJ4oJHrGG5lqwdiIZCX3BWv7ofJq3rplUPI4u9ThhXkgGEQ84JdpY9ZteoYhLeCaES2XXcSsVA66D8fERcj6j4umrfXILALVe4a3bj2kaskhTQZTqODjRXkak5lSwid1NFUsIHZEB6xiMSMiUl80GnaAD4/RREEvzIo1m7veOjIRKjUPfVIZED9XvbGr+lXVSHbhexqMk1Syi84+CVCAdo+nWqM8lo1qMDRAquZkV0SGRhGpzG9sc4WtT9D80yyXHcN0pVs9grjzswT4cggMVqMIF1KABFBjcwQM8WlfWvfVkPc9Lc9Znzy78kPXyAZrKjvc=</latexit>
sha1_base64="S2GWn45MWBfwQkl4WF+N7tm7AKw=">AAAB6HicdZDLSgMxFIYz9VbHW9Wlm2ARXJVMRTobsejGZQv2Am0pmfRMG5u5kGSUMvQJ3LhQxK0+jHs34tuYXgQV/SHw8f/nkHOOFwuuNCEfVmZhcWl5Jbtqr61vbG7ltnfqKkokgxqLRCSbHlUgeAg1zbWAZiyBBp6Ahjc8n+SNa5CKR+GlHsXQCWg/5D5nVBuretPN5UmBTIVJoeg6bqlkwHUIOT7CzjzKn77aJ/HLu13p5t7avYglAYSaCapUyyGx7qRUas4EjO12oiCmbEj70DIY0gBUJ50OOsYHxulhP5LmhRpP3e8dKQ2UGgWeqQyoHqjf2cT8K2sl2nc7KQ/jREPIZh/5icA6wpOtcY9LYFqMDFAmuZkVswGVlGlzG9sc4WtT/D/UiwXHcNXJl8/QTFm0h/bRIXJQCZXRBaqgGmII0C26Rw/WlXVnPVpPs9KMNe/ZRT9kPX8CjFmQaw==</latexit>

zero empirical error zero expected error


(due to margin in data) (due to margin in P)

Blue curve is fixed (given P) while red curve changes according to a draw of
data.

Interpret our claim about last-step-Perceptron as: there is a choice of


empirical minimizer (̂
L(wT ) = 0) such that in expectation (over draw of
data) its out-of-sample performance (blue curve) is O(1/n).

Do we expect any minimizer of the empirical profile to have a small


expected error?

7 / 21
Let’s make things more general. Fix some loss function ` and a class F of
functions X → Y. Empirical Risk Minimization (ERM) algorithm is

f̂n = argmin ̂
L(f)
f∈F

In the linear case,


F = {x ↦ ⟨w, x⟩ ∶ ∥w∥ ≤ 1}

8 / 21
Performance of ERM

If f̂n is an ERM,

L(f̂n ) − L(fF ) = {L(f̂n ) − ̂


L(f̂n )} + {̂
L(f̂n ) − ̂
L(fF )} + {̂
L(fF ) − L(fF )}
≤ {L(f̂n ) − ̂
L(f̂n )} + {̂
L(fF ) − L(fF )}

because the second term is negative.

Now take expectation on both sides and observe that second term is zero in
expectation:
ES [̂
L(fF )] − L(fF ) = 0

So, estimation error

ES [L(f̂n )] − L(fF ) ≤ ES [L(f̂n ) − ̂


L(f̂n )]

9 / 21
Let’s do that last step slowly: for any fixed (data-independent) function f
(including fF )

n
1
ES {̂
L(f)} = ES { ∑ `(f(Xi ), Yi )}
n i=1
1 n
= ∑ ES {`(f(Xi ), Yi )}
n i=1
1 n
= ∑ E(X,Y) {`(f(X), Y)}
n i=1
= L(f)

10 / 21
Wait, Are We Done?
Can’t we also apply above calculation to show that

ES [L(f̂n ) − ̂
L(f̂n )]

is zero? No, because f̂n is data-dependent:

1 n ̂
L(f̂n )} = ES {
ES {̂ ∑ `(fn (Xi ), Yi )}
n i=1
1 n
= ∑ ES {`(f̂n (Xi ), Yi )}
n i=1
= ES {`(f̂n (Xi ), Yi )}
≠ ES,(X,Y) {`(f̂n (X), Y)}

The next-to-last term is “in sample” while expected loss is “out of sample.”
We say that `(f̂n (Xi ), Yi ) is a biased estimate of E`(f̂n (X), Y).

How bad can this bias be?

11 / 21
Example
▸ X = [0, 1], Y = {0, 1}
▸ `(f(Xi ), Yi ) = I{f(Xi ) ≠ Yi }
▸ distribution P = Px × Py∣x with Px = Unif[0, 1] and Py∣x = δy=1
▸ function class

F = ∪n∈N {f = fS ∶ S ⊂ X , ∣S∣ = n, fS (x) = I{x ∈ S}}

0 1

ERM f̂n memorizes (perfectly fits) the data, but has no ability to
generalize. Observe that

0 = E`(f̂n (Xi ), Yi ) ≠ E`(f̂n (X), Y) = 1

This phenomenon is called overfitting (though the term is vague and used
in a variety of ways).
12 / 21
Example

NB: we will see later in the course that memorization methods can
generalize, so in the previous example it was not just the memorization part
that was problematic, but the overall definition of f̂n .

In fact, we already saw that perfectly fitting the data (at least in terms of
perfectly separating the dataset) did not prevent Perceptron’s last output
to have good generalization.

13 / 21
Example

Where do we go from here? Two approaches:


1. uniform deviations (remove the hat altogether)
2. find properties of algorithms that limit in some way the bias of
`(f̂n (Xi ), Yi ). Stability, differential privacy, compression are such
approaches.

14 / 21
Outline

ERM

Uniform Deviations

15 / 21
Removing the Hat

Recall: the difficulty in bounding the generalization gap was that f̂n
depends on the data. The key trick here is to remove the dependence by
“maxing out”:

E [L(f̂n ) − ̂
L(f̂n )] ≤ E max [L(f) − ̂
L(f)]
f∈F

For this inequality to hold, we only need to know that f̂n takes values in F
(does not have to be ERM).

If we have some extra knowledge on the location of f̂n , we should take


supremum over that (ideally, data-independent) subset of F. This is called
“localized analysis.”

16 / 21
Uniform Deviations

We first focus on understanding

1 n
E max {EX,Y `(f(X), Y) − ∑ `(f(Xi ), Yi )} .
f∈F n i=1

If F = {f0 } consists of a single


√ function, then above is 0. However, if
F = {f0 , f1 }, above is O(1/ n) as soon as f0 and f1 “different enough.”

17 / 21
A bit of notation to simplify things...
To ease the notation,
▸ Let zi = (xi , yi ) so that the training data is {z1 , . . . , zn }
▸ g(z) = `(f(x), y) for z = (x, y)
▸ Loss class G = {g ∶ g(z) = `(f(x), y)} = ` ○ F
▸ gn = `(f̂n (⋅), ⋅), gG = `(fF (⋅), ⋅)
̂
▸ g∗ = arg ming Eg(z) = `(f∗ (⋅), ⋅) is Bayes optimal (loss) function
We can now work with the set G, but keep in mind that each g ∈ G
corresponds to an f ∈ F:
g ∈ G ←→ f ∈ F

Once again, the quantity of interest is

1 n
max {Eg(z) − ∑ g(zi )}
g∈G n i=1

On the next slide, we visualize deviations Eg(z) − n1 ∑n


i=1 g(zi ) for all
possible functions g and discuss all the concepts introduces so far.

18 / 21
Empirical Process Viewpoint

Eg

0
g⇤ all functions

19 / 21
Empirical Process Viewpoint

n
1X
g(zi ) Eg
n
i=1

0
g⇤ all functions

19 / 21
Empirical Process Viewpoint

n
1X
g(zi ) Eg
n
i=1

0
g⇤ all functions

19 / 21
Empirical Process Viewpoint

n
1X
g(zi ) Eg
n
i=1

0
ĝn g⇤ all functions

19 / 21
Empirical Process Viewpoint

n
1X
g(zi )
n
i=1

0
ĝn g⇤

19 / 21
Empirical Process Viewpoint

n
1X G
g(zi ) Eg
n
i=1

0
g⇤ all functions

19 / 21
Empirical Process Viewpoint

n
1X G
g(zi ) Eg
n
i=1

0
g⇤ gG ĝn all functions

19 / 21
Empirical Process Viewpoint

n
1X G
g(zi ) Eg
n
i=1

0
g⇤ all functions

19 / 21
Empirical Process Viewpoint
A stochastic process is a collection of random variables indexed by some set.
An empirical process is a stochastic process

1 n
{Eg(z) − ∑ g(zi )}
n i=1 g∈G

indexed by a function class G.

Uniform Law of Large Numbers:

1 n
sup ∣Eg − ∑ g(zi )∣ → 0
g∈G n i=1

in probability.

20 / 21
Empirical Process Viewpoint
A stochastic process is a collection of random variables indexed by some set.
An empirical process is a stochastic process

1 n
{Eg(z) − ∑ g(zi )}
n i=1 g∈G

indexed by a function class G.

Uniform Law of Large Numbers:

1 n
sup ∣Eg − ∑ g(zi )∣ → 0
g∈G n i=1

in probability.

Key question: How “big” can G be for the supremum of the empirical
process to still be manageable?

20 / 21
Important distinction:

A take-away message is that the following two statements are worlds apart:

1 n
with probability at least 1 − δ, for any g ∈ G, Eg − ∑ g(zi ) ≤ 
n i=1

vs

1 n
for any g ∈ G, with probability at least 1 − δ, Eg − ∑ g(zi ) ≤ 
n i=1

The second statement follows from CLT, while the first statement is often
difficult to obtain and only holds for some G.

21 / 21

You might also like