Another approach
You watch me do this trick a couple times and notice I always hand out 5 cards
Suppose you instead consider
Now, can you learn a function such that is a reliable predictor of ?
Probability to the rescue!
Any agreeing with the training data may be possible
but that does not mean that any is equally probable
A short digression
• Suppose that Javier has a biased coin, which lands on heads with some unknown
probability
–
–
• Javier toss the coin times
–
Does tell us anything about ?
What can we learn from ?
Given enough tosses (large ), we expect that
Law of large numbers
as
Clearly, at least in a very limited sense, we can learn something about from
observations
There is always the possibility that we are totally wrong, but given enough data,
the probability should be very small
Connection to learning
Coin tosses: We want to estimate (i.e., predict how likely a “heads” is)
Learning: We want to estimate a function
Suppose we have a hypothesis and that is discrete
Think of the as a series of independent coin tosses, where the are
drawn from a probability distribution
– heads: our hypothesis is correct, i.e.,
– tails: our hypothesis is wrong, i.e.,
Define
(Population) risk:
Empirical risk:
Trust, but verify
The law of large numbers guarantees that as long as we have enough data, we will
have that
This means that we can use to verify whether was a good hypothesis
Unfortunately, verification is not learning
• Where did come from?
• What if is large?
• How do we know if , or at least, if ?
• Given many possible hypotheses, how can we pick a good one?
From coins to learning
Consider an ensemble of many hypotheses
If we fix a hypotheses before drawing our data, then the law of large numbers
tells us that
However, it is also true that for a fixed , if is large it can still be very likely
that there is some hypothesis for which is still very far from
Example
Question 1: If I toss a fair coin 10 times, what is the
probability that I get 10 heads?
Question 2: If I toss 1000 fair coins 10 times each, what is
the probability that some coin will get 10 heads?
This illustrates the fundamental challenge of multiple hypothesis testing
…and back to learning
If we have many hypotheses (large ), then
even though for any fixed hypothesis it is likely that
it is also likely that there will be at least one hypothesis where is very
different from
Can we adapt our approach to handle many hypotheses?
A first model of learning
Let’s restrict our attention to binary classification
– our labels belong to (or )
We observe the data
where each
Suppose we are given a list of possible hypotheses
From the training data , we would like to select the best possible hypothesis
from
Example
Empirical risk
Recall our definition of risk and its empirical counterpart
Risk:
Empirical risk:
The empirical risk gives us an estimate of the true risk , and from
the law of large numbers we know that as
We should be able to use the empirical risk to choose a good hypothesis
Empirical risk minimization (ERM)
We want to choose a hypothesis from that achieves a small risk
Since is supposed to be a good estimate of , an incredibly natural
(and common) strategy is to pick
Aside:
The risk in ERM
As long as we have enough data, for any particular hypothesis , we expect
However, if is very large, then we can also expect that there are some for
which
Thus, what can we say about ?
• We know that is as small as it can be
– this could be because is small
– or, it could be because for some
• Which explanation is more likely?
– it depends… just how large is ?
Confidence bounds
One way to provide guarantees for the ERM approach is to set and such that
for all (and for some suitably small choice of )
Of course, we can never guarantee that this holds, so instead we will be concerned
with the probability that
distribution of
Too much randomness?
Ultimately, we will want to show something like
for all
What is random here?
– the training data
– , because each depends on
– , because it depends on
In order to tease all of this apart, let’s begin by going back to just a single
hypothesis and studying
Bounding the error
We want to calculate
Note that is a random variable
– we can write where the are Bernoulli random variables
– thus, is a Binomial random variable
– since , we have that
Deviation from the mean
Thus, an equivalent way to think about our problem is that we would like to
calculate
and this is just asking about the probability that a Binomial random variable will be
within of its mean
If represents the cumulative distribution function (CDF) of our binomial
random variable, then we can write
Bounding the deviation
Unfortunately, the CDF we are interested in is given by
This has no nice closed form expression, and is rather unwieldy to work with and
doesn’t give us much intuition
Instead of calculating the probability exactly, it is enough to get a good bound of
the form
or equivalently
Concentration inequalities
An inequality of the form
tell us how a particular random variable (in this case ) concentrates
around its mean
There are many different concentration inequalities that give us various bounds
along these lines
We will start with a very simple one, and then build up to a stronger result
Markov’s inequality
The simplest of these results is Markov’s inequality
Let be any nonnegative random variable.
Then for any ,
This is cool on its own, but can be leveraged to say even
more since for any strictly monotonically increasing and
nonnegative-valued function
Chebyshev’s inequality
As an example, Chebyshev’s inequality
states that for any random variable ,
Proof.
Note that is a nonnegative random
variable. Thus we can apply Markov’s inequality to obtain
Proof of Markov (Part 1)
There is a simple proof of Markov if you know the (super useful!) fact that for any
nonnegative random variable
Proof. We can write
where
Thus
Proof of Markov (Part 2)
We can visualize this result as
Thus, we can immediately see that we must have
and hence
Hoeffding’s inequality
Chebyshev’s inequality gives us the kind of result we are after, but it is too loose
to be of practical use
Hoeffding’s inequality assumes a bit more about our random variable beyond
having finite variance, but gets us a much tighter and more useful result:
Let be independent bounded random variables, i.e., random variables
such that for all
Let . Then for any , we have
Chernoff’s bounding method
To prove this result, we will use a similar approach as in Chebyshev’s inequality
To begin consider only the upper tail inequality:
(Markov)
(Independence)
Hoeffding’s Lemma
It is not obvious, but also not too hard to show, that
(proof uses convexity and then gets a bound using a Taylor series expansion)
Plugging this in, we obtain that for any , we have
By setting , we have
Putting it all together
Thus, we have proven that
An analogous argument proves
Combined, these give
Special case: Binomials
If the are Bernoulli random variables, then is a Binomial random variable
and Hoeffding’s inequality becomes
Finally going back to our original problem, this means that Hoeffding yields the
bound
Multiple hypotheses
Thus, after much effort, we have that for a particular hypothesis ,
However, we are ultimately interested in , not just a single hypothesis
One way to argue that is to ensure that
simultaneously for all
Equivalently, we can try to bound the probability that any hypothesis has an
empirical risk that deviates from its mean by more than
Formal statement
We can express this mathematically as
We can bound this using something called the union bound
Union bound
Union bound For any sequence of events
The events in our case are given by
Final result
Interpretation
We went through all of this work to show that
linearly exponentially
increasing decreasing
This suggests that ERM is a reasonable approach as long as isn’t too big
(i.e., )
Note that the above is equivalent to the statement that with probability at
least ,
Bounding the excess risk
Note that we would ideally actually like to choose
We can also relate the performance of to :
We have already shown that with probability at least
What about ?
Bounding the excess risk
We will bound in two steps…
• cannot be too much bigger than :
By the definition of ,
From before, we have
Thus
• cannot be too much bigger than :
By the definition of ,
From before, we have
Thus
The upshot
Thus,
Bottom line: As long as isn’t too big ( ) then we can be reasonably
confident that isn’t too much larger than
Of course, the trick in doing a good job of learning is ensure that is actually
small
To achieve this, we need a “rich” set of possible hypotheses…
unfortunately…
Fundamental tradeoff
More hypotheses ultimately sacrifices our guarantee that ,
which causes the whole argument to break
Richer set of hypotheses
Error
“Richness” of hypothesis set