0% found this document useful (0 votes)

20 views37 pages

Understanding Empirical Risk Minimization

The document discusses the relationship between probability, learning, and hypothesis testing, emphasizing the importance of empirical risk minimization (ERM) in estimating functions. It outlines how the law of large numbers can help verify hypotheses, while also addressing challenges posed by multiple hypotheses and the need for confidence bounds. The text concludes with a discussion on the tradeoff between the richness of hypothesis sets and the guarantees of learning performance.

Uploaded by

Mark Davenport

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views37 pages

Understanding Empirical Risk Minimization

Uploaded by

Mark Davenport

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Another approach

You watch me do this trick a couple times and notice I always hand out 5 cards

Suppose you instead consider

Now, can you learn a function such that is a reliable predictor of ?

Probability to the rescue!
Any agreeing with the training data may be possible
but that does not mean that any is equally probable

A short digression
• Suppose that Javier has a biased coin, which lands on heads with some unknown
probability
–
–
• Javier toss the coin times
–

Does tell us anything about ?

What can we learn from ?
Given enough tosses (large ), we expect that

Law of large numbers

Clearly, at least in a very limited sense, we can learn something about from
observations

There is always the possibility that we are totally wrong, but given enough data,
the probability should be very small
Connection to learning
Coin tosses: We want to estimate (i.e., predict how likely a “heads” is)
Learning: We want to estimate a function

Suppose we have a hypothesis and that is discrete

Think of the as a series of independent coin tosses, where the are
drawn from a probability distribution
– heads: our hypothesis is correct, i.e.,
– tails: our hypothesis is wrong, i.e.,

Define
(Population) risk:

Empirical risk:
Trust, but verify
The law of large numbers guarantees that as long as we have enough data, we will
have that

This means that we can use to verify whether was a good hypothesis

Unfortunately, verification is not learning

• Where did come from?

• What if is large?
• How do we know if , or at least, if ?
• Given many possible hypotheses, how can we pick a good one?
From coins to learning
Consider an ensemble of many hypotheses

If we fix a hypotheses before drawing our data, then the law of large numbers
tells us that

However, it is also true that for a fixed , if is large it can still be very likely
that there is some hypothesis for which is still very far from
Example
Question 1: If I toss a fair coin 10 times, what is the
probability that I get 10 heads?

Question 2: If I toss 1000 fair coins 10 times each, what is

the probability that some coin will get 10 heads?

This illustrates the fundamental challenge of multiple hypothesis testing

…and back to learning
If we have many hypotheses (large ), then

even though for any fixed hypothesis it is likely that

it is also likely that there will be at least one hypothesis where is very
different from

Can we adapt our approach to handle many hypotheses?

A first model of learning
Let’s restrict our attention to binary classification
– our labels belong to (or )

We observe the data

where each

Suppose we are given a list of possible hypotheses

From the training data , we would like to select the best possible hypothesis
from
Example
Empirical risk
Recall our definition of risk and its empirical counterpart

Risk:

Empirical risk:

The empirical risk gives us an estimate of the true risk , and from
the law of large numbers we know that as

We should be able to use the empirical risk to choose a good hypothesis

Empirical risk minimization (ERM)
We want to choose a hypothesis from that achieves a small risk

Since is supposed to be a good estimate of , an incredibly natural

(and common) strategy is to pick

Aside:
The risk in ERM
As long as we have enough data, for any particular hypothesis , we expect

However, if is very large, then we can also expect that there are some for
which

Thus, what can we say about ?

• We know that is as small as it can be
– this could be because is small
– or, it could be because for some
• Which explanation is more likely?
– it depends… just how large is ?
Confidence bounds
One way to provide guarantees for the ERM approach is to set and such that

for all (and for some suitably small choice of )

Of course, we can never guarantee that this holds, so instead we will be concerned
with the probability that

distribution of
Too much randomness?
Ultimately, we will want to show something like

for all

What is random here?

– the training data
– , because each depends on
– , because it depends on

In order to tease all of this apart, let’s begin by going back to just a single
hypothesis and studying
Bounding the error
We want to calculate

Note that is a random variable

– we can write where the are Bernoulli random variables
– thus, is a Binomial random variable
– since , we have that
Deviation from the mean
Thus, an equivalent way to think about our problem is that we would like to
calculate

and this is just asking about the probability that a Binomial random variable will be
within of its mean

If represents the cumulative distribution function (CDF) of our binomial

random variable, then we can write
Bounding the deviation
Unfortunately, the CDF we are interested in is given by

This has no nice closed form expression, and is rather unwieldy to work with and
doesn’t give us much intuition

Instead of calculating the probability exactly, it is enough to get a good bound of

the form

or equivalently
Concentration inequalities
An inequality of the form

tell us how a particular random variable (in this case ) concentrates

around its mean

There are many different concentration inequalities that give us various bounds
along these lines

We will start with a very simple one, and then build up to a stronger result
Markov’s inequality
The simplest of these results is Markov’s inequality

Let be any nonnegative random variable.

Then for any ,

This is cool on its own, but can be leveraged to say even

more since for any strictly monotonically increasing and
nonnegative-valued function
Chebyshev’s inequality
As an example, Chebyshev’s inequality
states that for any random variable ,

Proof.
Note that is a nonnegative random
variable. Thus we can apply Markov’s inequality to obtain
Proof of Markov (Part 1)
There is a simple proof of Markov if you know the (super useful!) fact that for any
nonnegative random variable

Proof. We can write

where

Thus
Proof of Markov (Part 2)
We can visualize this result as

Thus, we can immediately see that we must have

and hence
Hoeffding’s inequality
Chebyshev’s inequality gives us the kind of result we are after, but it is too loose
to be of practical use

Hoeffding’s inequality assumes a bit more about our random variable beyond
having finite variance, but gets us a much tighter and more useful result:

Let be independent bounded random variables, i.e., random variables

such that for all

Let . Then for any , we have

Chernoff’s bounding method
To prove this result, we will use a similar approach as in Chebyshev’s inequality
To begin consider only the upper tail inequality:

(Markov)

(Independence)
Hoeffding’s Lemma
It is not obvious, but also not too hard to show, that

(proof uses convexity and then gets a bound using a Taylor series expansion)

Plugging this in, we obtain that for any , we have

By setting , we have
Putting it all together
Thus, we have proven that

An analogous argument proves

Combined, these give

Special case: Binomials
If the are Bernoulli random variables, then is a Binomial random variable
and Hoeffding’s inequality becomes

Finally going back to our original problem, this means that Hoeffding yields the
bound
Multiple hypotheses
Thus, after much effort, we have that for a particular hypothesis ,

However, we are ultimately interested in , not just a single hypothesis

One way to argue that is to ensure that

simultaneously for all

Equivalently, we can try to bound the probability that any hypothesis has an
empirical risk that deviates from its mean by more than
Formal statement
We can express this mathematically as

We can bound this using something called the union bound

Union bound
Union bound For any sequence of events

The events in our case are given by

Final result
Interpretation
We went through all of this work to show that

linearly exponentially
increasing decreasing

This suggests that ERM is a reasonable approach as long as isn’t too big
(i.e., )

Note that the above is equivalent to the statement that with probability at
least ,
Bounding the excess risk
Note that we would ideally actually like to choose

We can also relate the performance of to :

We have already shown that with probability at least

What about ?
Bounding the excess risk
We will bound in two steps…
• cannot be too much bigger than :
By the definition of ,
From before, we have
Thus

• cannot be too much bigger than :

By the definition of ,
From before, we have
Thus
The upshot
Thus,

Bottom line: As long as isn’t too big ( ) then we can be reasonably

confident that isn’t too much larger than

Of course, the trick in doing a good job of learning is ensure that is actually
small

To achieve this, we need a “rich” set of possible hypotheses…

unfortunately…
Fundamental tradeoff
More hypotheses ultimately sacrifices our guarantee that ,
which causes the whole argument to break

Richer set of hypotheses

Error

“Richness” of hypothesis set

Understanding Empirical Risk Minimization
No ratings yet
Understanding Empirical Risk Minimization
27 pages
Understanding Empirical Risk Minimization
No ratings yet
Understanding Empirical Risk Minimization
61 pages
Applied Probability II Concepts and Theorems
No ratings yet
Applied Probability II Concepts and Theorems
28 pages
Hoeffding's Lemma in PAC Learning
No ratings yet
Hoeffding's Lemma in PAC Learning
6 pages
Empirical Risk Minimization in Learning
No ratings yet
Empirical Risk Minimization in Learning
12 pages
Chernoff Bounds in Random Variables
No ratings yet
Chernoff Bounds in Random Variables
4 pages
Estimation in Machine Learning Concepts
No ratings yet
Estimation in Machine Learning Concepts
48 pages
Intermediate Statistics Summary Notes
No ratings yet
Intermediate Statistics Summary Notes
32 pages
McDiarmid's Inequality Explained
No ratings yet
McDiarmid's Inequality Explained
22 pages
Key Concepts in Learning Theory
No ratings yet
Key Concepts in Learning Theory
10 pages
Evaluating Hypothesis Accuracy and Bias
No ratings yet
Evaluating Hypothesis Accuracy and Bias
60 pages
Chebyshev's Inequality Explained
No ratings yet
Chebyshev's Inequality Explained
5 pages
Probability Bounds and Proofs
No ratings yet
Probability Bounds and Proofs
3 pages
Modes of Convergence in Classical Probability
No ratings yet
Modes of Convergence in Classical Probability
10 pages
Union Bound in Probability Theory
No ratings yet
Union Bound in Probability Theory
19 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
29 pages
Fat Tails and Risk Management Insights
100% (3)
Fat Tails and Risk Management Insights
99 pages
Concentration Inequalities Overview
No ratings yet
Concentration Inequalities Overview
4 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
75 pages
Probability and Risk: Fat Tails Explained
100% (1)
Probability and Risk: Fat Tails Explained
131 pages
Understanding Randomised Algorithms
No ratings yet
Understanding Randomised Algorithms
30 pages
Introduction to Learning Theory Basics
No ratings yet
Introduction to Learning Theory Basics
5 pages
Probabilistic Method in Graph Theory
No ratings yet
Probabilistic Method in Graph Theory
7 pages
Stochastic Multi-Armed Bandits Overview
No ratings yet
Stochastic Multi-Armed Bandits Overview
15 pages
Statistical Methods in Economics
No ratings yet
Statistical Methods in Economics
115 pages
Jensen's Inequality and PAC Learning
No ratings yet
Jensen's Inequality and PAC Learning
2 pages
Solutions Manual for Murphy's ML Book
No ratings yet
Solutions Manual for Murphy's ML Book
127 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Feasibility of Learning in Supervised ML
No ratings yet
Feasibility of Learning in Supervised ML
4 pages
Introduction to Stochastic Processes
No ratings yet
Introduction to Stochastic Processes
5 pages
Probability Concepts and Applications
No ratings yet
Probability Concepts and Applications
8 pages
McDiarmid's Inequality Explained
No ratings yet
McDiarmid's Inequality Explained
9 pages
Probability Theory Overview
No ratings yet
Probability Theory Overview
21 pages
Markov and Chebyshev Inequalities Explained
No ratings yet
Markov and Chebyshev Inequalities Explained
4 pages
Probability Bounds in Learning Theory
No ratings yet
Probability Bounds in Learning Theory
14 pages
Understanding Probability Concepts
No ratings yet
Understanding Probability Concepts
11 pages
Understanding the Probabilistic Method
No ratings yet
Understanding the Probabilistic Method
7 pages
Chernoff Bounds for Binomial Distribution
No ratings yet
Chernoff Bounds for Binomial Distribution
11 pages
Probability Inequalities Overview
No ratings yet
Probability Inequalities Overview
11 pages
Understanding Risk Preference and Utility
No ratings yet
Understanding Risk Preference and Utility
18 pages
Understanding Cross-Validation in Machine Learning
No ratings yet
Understanding Cross-Validation in Machine Learning
4 pages
Probability and Linear Algebra in ML
No ratings yet
Probability and Linear Algebra in ML
87 pages
Machine Learning Models Overview
No ratings yet
Machine Learning Models Overview
38 pages
Law of Large Numbers Explained
No ratings yet
Law of Large Numbers Explained
10 pages
Understanding Computational Learning Theory
No ratings yet
Understanding Computational Learning Theory
19 pages
Techniques of the Probabilistic Method
No ratings yet
Techniques of the Probabilistic Method
9 pages
Mathematical Finance Lecture by Poudel
No ratings yet
Mathematical Finance Lecture by Poudel
19 pages
Probability and Statistics Overview
No ratings yet
Probability and Statistics Overview
26 pages
Properties of Expectation Explained
No ratings yet
Properties of Expectation Explained
55 pages
Understanding Heuristics and Biases
No ratings yet
Understanding Heuristics and Biases
31 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Lecture 10: November 2, 2021: 1 Basics of Probability: The Finite Case
No ratings yet
Lecture 10: November 2, 2021: 1 Basics of Probability: The Finite Case
3 pages
Understanding Probability Basics
No ratings yet
Understanding Probability Basics
7 pages
Law of Large Numbers & Central Limit Theorem
No ratings yet
Law of Large Numbers & Central Limit Theorem
10 pages
Solutions for EE126 Discussion 4
100% (1)
Solutions for EE126 Discussion 4
4 pages
Understanding Posterior Probability
No ratings yet
Understanding Posterior Probability
15 pages
Mathematical Foundations of Fat Tails
No ratings yet
Mathematical Foundations of Fat Tails
36 pages
Bounded Random Variables in Statistics
No ratings yet
Bounded Random Variables in Statistics
8 pages
IVECO Standard for Spheroidal Graphite Iron
No ratings yet
IVECO Standard for Spheroidal Graphite Iron
5 pages
Understanding Atomic Nuclei and Isotopes
No ratings yet
Understanding Atomic Nuclei and Isotopes
47 pages
EE 204 Electric Circuits Final Exam
No ratings yet
EE 204 Electric Circuits Final Exam
11 pages
Physics Important Questions 2025
No ratings yet
Physics Important Questions 2025
2 pages
Grade 8 Circuits Worksheet with Answers
No ratings yet
Grade 8 Circuits Worksheet with Answers
5 pages
Performance of Triple Friction Pendulum Bearings
No ratings yet
Performance of Triple Friction Pendulum Bearings
12 pages
Non-Permeable Polybag Specifications
No ratings yet
Non-Permeable Polybag Specifications
3 pages
Heisenberg Uncertainty Principle Explained
No ratings yet
Heisenberg Uncertainty Principle Explained
29 pages
Analysis of The Stability of Metal Working Fluid Emulsions by Turbidity Spectra
No ratings yet
Analysis of The Stability of Metal Working Fluid Emulsions by Turbidity Spectra
7 pages
Cpet (Odisha) 2021
No ratings yet
Cpet (Odisha) 2021
16 pages
Fourier Series of Periodic Square Wave
No ratings yet
Fourier Series of Periodic Square Wave
40 pages
Algebra Simplification Exercises
No ratings yet
Algebra Simplification Exercises
1 page
Cambridge IGCSE Physics Chapter 1 Test
No ratings yet
Cambridge IGCSE Physics Chapter 1 Test
5 pages
Understanding Darcy's Law in Soil Mechanics
No ratings yet
Understanding Darcy's Law in Soil Mechanics
10 pages
Vertical Vessel Leg Support Design
100% (5)
Vertical Vessel Leg Support Design
2 pages
RRB ALP Tier-01 Science & Awareness PDF
No ratings yet
RRB ALP Tier-01 Science & Awareness PDF
70 pages
January 2016 P2 CSEC Physics
No ratings yet
January 2016 P2 CSEC Physics
20 pages
Illerlll Flrces Il Mellbers: B. B. Muvdi Et Al., Engineering Mechanics of Materials © Springer-Verlag New York Inc. 1991
No ratings yet
Illerlll Flrces Il Mellbers: B. B. Muvdi Et Al., Engineering Mechanics of Materials © Springer-Verlag New York Inc. 1991
2 pages
Multifunction Meter DTSD342-9n Specs
No ratings yet
Multifunction Meter DTSD342-9n Specs
2 pages
Assigning Beam Orientation in Abaqus
No ratings yet
Assigning Beam Orientation in Abaqus
14 pages
Bajhang Upper Seti Hydropower Cost Estimate
No ratings yet
Bajhang Upper Seti Hydropower Cost Estimate
6 pages
Process Control Assignment Solutions
No ratings yet
Process Control Assignment Solutions
2 pages
AT1117M Radiation Contamination Monitor
No ratings yet
AT1117M Radiation Contamination Monitor
2 pages
The End of Architecture: A Critical Analysis
No ratings yet
The End of Architecture: A Critical Analysis
38 pages
IIT Test Preparation: Physics Questions
No ratings yet
IIT Test Preparation: Physics Questions
31 pages
Comparative Analysis of Steel Structures
No ratings yet
Comparative Analysis of Steel Structures
7 pages
Virtual Reality Concepts and Technologies
100% (1)
Virtual Reality Concepts and Technologies
434 pages
Superposition and Interference in Waves
No ratings yet
Superposition and Interference in Waves
4 pages
Composite Materials: Characteristics & Applications
100% (1)
Composite Materials: Characteristics & Applications
6 pages
Understanding Lubrication Types and Mechanisms
No ratings yet
Understanding Lubrication Types and Mechanisms
21 pages

Understanding Empirical Risk Minimization

Uploaded by

Understanding Empirical Risk Minimization

Uploaded by

Another approach

Suppose you instead consider

Now, can you learn a function such that is a reliable predictor of ?

Does tell us anything about ?

Law of large numbers

Suppose we have a hypothesis and that is discrete

Unfortunately, verification is not learning

• Where did come from?

Question 2: If I toss 1000 fair coins 10 times each, what is

This illustrates the fundamental challenge of multiple hypothesis testing

even though for any fixed hypothesis it is likely that

Can we adapt our approach to handle many hypotheses?

We observe the data

Suppose we are given a list of possible hypotheses

We should be able to use the empirical risk to choose a good hypothesis

Since is supposed to be a good estimate of , an incredibly natural

Thus, what can we say about ?

for all (and for some suitably small choice of )

What is random here?

Note that is a random variable

If represents the cumulative distribution function (CDF) of our binomial

Instead of calculating the probability exactly, it is enough to get a good bound of

tell us how a particular random variable (in this case ) concentrates

Let be any nonnegative random variable.

This is cool on its own, but can be leveraged to say even

Proof. We can write

Thus, we can immediately see that we must have

Let be independent bounded random variables, i.e., random variables

Let . Then for any , we have

Plugging this in, we obtain that for any , we have

An analogous argument proves

Combined, these give

However, we are ultimately interested in , not just a single hypothesis

One way to argue that is to ensure that

We can bound this using something called the union bound

The events in our case are given by

We can also relate the performance of to :

We have already shown that with probability at least

• cannot be too much bigger than :

Bottom line: As long as isn’t too big ( ) then we can be reasonably

To achieve this, we need a “rich” set of possible hypotheses…

Richer set of hypotheses

“Richness” of hypothesis set

You might also like