Lecture 14
ERM, Uniform Convergence
Sasha Rakhlin
Oct 24, 2019
1 / 21
Outline
ERM
Uniform Deviations
2 / 21
Outline
ERM
Uniform Deviations
3 / 21
Recall that we proved
1 D2
EL(wT ) ≤ × 2
n+1 γ
for the last hyperplane wT of Perceptron (cycled until no more mistakes)
under the assumption that there exists w∗ with ∥w∗ ∥ = 1 such that
Y ⟨w∗ , X⟩ ≥ γ and D ≥ ∥X∥ almost surely.
This is a result about a particular minimizer of empirical loss ̂
L01 (w).
4 / 21
Full gradient for i.i.d. data
Suppose that instead of multi-pass Perceptron, we run full gradient descent
on empirical objective
̂ 1 n
L(w) = ∑ max {−Yt ⟨w, Xt ⟩ , 0}
n t=1
NB: hinge-at-zero max{−y ⟨w, x⟩ , 0} is not a surrogate loss for the indicator
loss, but its minimizer does enjoy zero indicator loss.
If all we know is that w minimizes empirical loss ̂
L(w) (but not necessarily
obtained via Perceptron), what can we do?
Beyond hyperplanes with margin, what can we say about expected loss of
an empirical minimizer?
5 / 21
It is useful to consider the function
1 n
w↦ ∑ I{Yt ⟨w, Xt ⟩ ≤ 0}
n t=1
as w varies (but fixing the data). For example, values of this function over
the sphere ∥w∥ = 1 look like [draw in class], while the expected error looks
like [draw in class].
6 / 21
If we unroll this picture of L(w) and ̂
L(w), it would look like this:
expected
empirical error
error
w
<latexit sha1_base64="ts0CY4a14w3a0XCDpZ7yXRMqzcg=">AAAB6HicdVBNSwMxEJ31s9avqkcvwSJ4KtmKdI9FLx5bsB/QLiWbZtvYbHZJskpZ+gu8eFDEqz/Jm//GtF1BRR8MPN6bYWZekAiuDcYfzsrq2vrGZmGruL2zu7dfOjhs6zhVlLVoLGLVDYhmgkvWMtwI1k0UI1EgWCeYXM39zh1TmsfyxkwT5kdkJHnIKTFWat4PSmVcwQsgXKl6rlerWeK5GF+cIze3ypCjMSi994cxTSMmDRVE656LE+NnRBlOBZsV+6lmCaETMmI9SyWJmPazxaEzdGqVIQpjZUsatFC/T2Qk0noaBbYzImasf3tz8S+vl5rQ8zMuk9QwSZeLwlQgE6P512jIFaNGTC0hVHF7K6Jjogg1NpuiDeHrU/Q/aVcrruVNt1y/zOMowDGcwBm4UIM6XEMDWkCBwQM8wbNz6zw6L87rsnXFyWeO4Aect08rko0q</latexit>
<latexit sha1_base64="2SctM1+O3W8Nn5NThmcvZnHb1gc=">AAAB6HicdZBLSwMxFIXv1FcdX1WXboJFcFUyFelsxKIbly3YB7RDyaSZNjbzIMkoZSi4d+NCEbf+Gffu/DemrYKKHgh8nHMvuff6ieBKY/xu5RYWl5ZX8qv22vrG5lZhe6ep4lRS1qCxiGXbJ4oJHrGG5lqwdiIZCX3BWv7ofJq3rplUPI4u9ThhXkgGEQ84JdpY9ZteoYhLeCaES2XXcSsVA66D8fERcj6j4umrfXILALVe4a3bj2kaskhTQZTqODjRXkak5lSwid1NFUsIHZEB6xiMSMiUl80GnaAD4/RREEvzIo1m7veOjIRKjUPfVIZED9XvbGr+lXVSHbhexqMk1Syi84+CVCAdo+nWqM8lo1qMDRAquZkV0SGRhGpzG9sc4WtT9D80yyXHcN0pVs9grjzswT4cggMVqMIF1KABFBjcwQM8WlfWvfVkPc9Lc9Znzy78kPXyAZrKjvc=</latexit>
sha1_base64="S2GWn45MWBfwQkl4WF+N7tm7AKw=">AAAB6HicdZDLSgMxFIYz9VbHW9Wlm2ARXJVMRTobsejGZQv2Am0pmfRMG5u5kGSUMvQJ3LhQxK0+jHs34tuYXgQV/SHw8f/nkHOOFwuuNCEfVmZhcWl5Jbtqr61vbG7ltnfqKkokgxqLRCSbHlUgeAg1zbWAZiyBBp6Ahjc8n+SNa5CKR+GlHsXQCWg/5D5nVBuretPN5UmBTIVJoeg6bqlkwHUIOT7CzjzKn77aJ/HLu13p5t7avYglAYSaCapUyyGx7qRUas4EjO12oiCmbEj70DIY0gBUJ50OOsYHxulhP5LmhRpP3e8dKQ2UGgWeqQyoHqjf2cT8K2sl2nc7KQ/jREPIZh/5icA6wpOtcY9LYFqMDFAmuZkVswGVlGlzG9sc4WtT/D/UiwXHcNXJl8/QTFm0h/bRIXJQCZXRBaqgGmII0C26Rw/WlXVnPVpPs9KMNe/ZRT9kPX8CjFmQaw==</latexit>
zero empirical error zero expected error
(due to margin in data) (due to margin in P)
Blue curve is fixed (given P) while red curve changes according to a draw of
data.
Interpret our claim about last-step-Perceptron as: there is a choice of
empirical minimizer (̂
L(wT ) = 0) such that in expectation (over draw of
data) its out-of-sample performance (blue curve) is O(1/n).
Do we expect any minimizer of the empirical profile to have a small
expected error?
7 / 21
Let’s make things more general. Fix some loss function ` and a class F of
functions X → Y. Empirical Risk Minimization (ERM) algorithm is
f̂n = argmin ̂
L(f)
f∈F
In the linear case,
F = {x ↦ ⟨w, x⟩ ∶ ∥w∥ ≤ 1}
8 / 21
Performance of ERM
If f̂n is an ERM,
L(f̂n ) − L(fF ) = {L(f̂n ) − ̂
L(f̂n )} + {̂
L(f̂n ) − ̂
L(fF )} + {̂
L(fF ) − L(fF )}
≤ {L(f̂n ) − ̂
L(f̂n )} + {̂
L(fF ) − L(fF )}
because the second term is negative.
Now take expectation on both sides and observe that second term is zero in
expectation:
ES [̂
L(fF )] − L(fF ) = 0
So, estimation error
ES [L(f̂n )] − L(fF ) ≤ ES [L(f̂n ) − ̂
L(f̂n )]
9 / 21
Let’s do that last step slowly: for any fixed (data-independent) function f
(including fF )
n
1
ES {̂
L(f)} = ES { ∑ `(f(Xi ), Yi )}
n i=1
1 n
= ∑ ES {`(f(Xi ), Yi )}
n i=1
1 n
= ∑ E(X,Y) {`(f(X), Y)}
n i=1
= L(f)
10 / 21
Wait, Are We Done?
Can’t we also apply above calculation to show that
ES [L(f̂n ) − ̂
L(f̂n )]
is zero? No, because f̂n is data-dependent:
1 n ̂
L(f̂n )} = ES {
ES {̂ ∑ `(fn (Xi ), Yi )}
n i=1
1 n
= ∑ ES {`(f̂n (Xi ), Yi )}
n i=1
= ES {`(f̂n (Xi ), Yi )}
≠ ES,(X,Y) {`(f̂n (X), Y)}
The next-to-last term is “in sample” while expected loss is “out of sample.”
We say that `(f̂n (Xi ), Yi ) is a biased estimate of E`(f̂n (X), Y).
How bad can this bias be?
11 / 21
Example
▸ X = [0, 1], Y = {0, 1}
▸ `(f(Xi ), Yi ) = I{f(Xi ) ≠ Yi }
▸ distribution P = Px × Py∣x with Px = Unif[0, 1] and Py∣x = δy=1
▸ function class
F = ∪n∈N {f = fS ∶ S ⊂ X , ∣S∣ = n, fS (x) = I{x ∈ S}}
0 1
ERM f̂n memorizes (perfectly fits) the data, but has no ability to
generalize. Observe that
0 = E`(f̂n (Xi ), Yi ) ≠ E`(f̂n (X), Y) = 1
This phenomenon is called overfitting (though the term is vague and used
in a variety of ways).
12 / 21
Example
NB: we will see later in the course that memorization methods can
generalize, so in the previous example it was not just the memorization part
that was problematic, but the overall definition of f̂n .
In fact, we already saw that perfectly fitting the data (at least in terms of
perfectly separating the dataset) did not prevent Perceptron’s last output
to have good generalization.
13 / 21
Example
Where do we go from here? Two approaches:
1. uniform deviations (remove the hat altogether)
2. find properties of algorithms that limit in some way the bias of
`(f̂n (Xi ), Yi ). Stability, differential privacy, compression are such
approaches.
14 / 21
Outline
ERM
Uniform Deviations
15 / 21
Removing the Hat
Recall: the difficulty in bounding the generalization gap was that f̂n
depends on the data. The key trick here is to remove the dependence by
“maxing out”:
E [L(f̂n ) − ̂
L(f̂n )] ≤ E max [L(f) − ̂
L(f)]
f∈F
For this inequality to hold, we only need to know that f̂n takes values in F
(does not have to be ERM).
If we have some extra knowledge on the location of f̂n , we should take
supremum over that (ideally, data-independent) subset of F. This is called
“localized analysis.”
16 / 21
Uniform Deviations
We first focus on understanding
1 n
E max {EX,Y `(f(X), Y) − ∑ `(f(Xi ), Yi )} .
f∈F n i=1
If F = {f0 } consists of a single
√ function, then above is 0. However, if
F = {f0 , f1 }, above is O(1/ n) as soon as f0 and f1 “different enough.”
17 / 21
A bit of notation to simplify things...
To ease the notation,
▸ Let zi = (xi , yi ) so that the training data is {z1 , . . . , zn }
▸ g(z) = `(f(x), y) for z = (x, y)
▸ Loss class G = {g ∶ g(z) = `(f(x), y)} = ` ○ F
▸ gn = `(f̂n (⋅), ⋅), gG = `(fF (⋅), ⋅)
̂
▸ g∗ = arg ming Eg(z) = `(f∗ (⋅), ⋅) is Bayes optimal (loss) function
We can now work with the set G, but keep in mind that each g ∈ G
corresponds to an f ∈ F:
g ∈ G ←→ f ∈ F
Once again, the quantity of interest is
1 n
max {Eg(z) − ∑ g(zi )}
g∈G n i=1
On the next slide, we visualize deviations Eg(z) − n1 ∑n
i=1 g(zi ) for all
possible functions g and discuss all the concepts introduces so far.
18 / 21
Empirical Process Viewpoint
Eg
0
g⇤ all functions
19 / 21
Empirical Process Viewpoint
n
1X
g(zi ) Eg
n
i=1
0
g⇤ all functions
19 / 21
Empirical Process Viewpoint
n
1X
g(zi ) Eg
n
i=1
0
g⇤ all functions
19 / 21
Empirical Process Viewpoint
n
1X
g(zi ) Eg
n
i=1
0
ĝn g⇤ all functions
19 / 21
Empirical Process Viewpoint
n
1X
g(zi )
n
i=1
0
ĝn g⇤
19 / 21
Empirical Process Viewpoint
n
1X G
g(zi ) Eg
n
i=1
0
g⇤ all functions
19 / 21
Empirical Process Viewpoint
n
1X G
g(zi ) Eg
n
i=1
0
g⇤ gG ĝn all functions
19 / 21
Empirical Process Viewpoint
n
1X G
g(zi ) Eg
n
i=1
0
g⇤ all functions
19 / 21
Empirical Process Viewpoint
A stochastic process is a collection of random variables indexed by some set.
An empirical process is a stochastic process
1 n
{Eg(z) − ∑ g(zi )}
n i=1 g∈G
indexed by a function class G.
Uniform Law of Large Numbers:
1 n
sup ∣Eg − ∑ g(zi )∣ → 0
g∈G n i=1
in probability.
20 / 21
Empirical Process Viewpoint
A stochastic process is a collection of random variables indexed by some set.
An empirical process is a stochastic process
1 n
{Eg(z) − ∑ g(zi )}
n i=1 g∈G
indexed by a function class G.
Uniform Law of Large Numbers:
1 n
sup ∣Eg − ∑ g(zi )∣ → 0
g∈G n i=1
in probability.
Key question: How “big” can G be for the supremum of the empirical
process to still be manageable?
20 / 21
Important distinction:
A take-away message is that the following two statements are worlds apart:
1 n
with probability at least 1 − δ, for any g ∈ G, Eg − ∑ g(zi ) ≤
n i=1
vs
1 n
for any g ∈ G, with probability at least 1 − δ, Eg − ∑ g(zi ) ≤
n i=1
The second statement follows from CLT, while the first statement is often
difficult to obtain and only holds for some G.
21 / 21