DSAI 514 – Statistical Inference
Point Estimation
Instructor: Ş. Betül Özateş
Boğaziçi University
19/02/2025
Devore, Jay L., Kenneth N. Berk, and Matthew A. Carlton. (2012) Modern mathematical statistics with applications. Vol. 285. New York: Springer,
and https://medium.com/@hosamedwee/kullback-leibler-kl-divergence-with-examples-part-2-9123bff5dc10
Last Lecture: KL Divergence
• The KL Divergence of Q from P, denoted as DKL(P||Q) is a measure of
information lost when Q is used to approximate P.
• P is the true distribution, and Q is the estimated distribution.
Last Lecture: KL Divergence
• The KL Divergence of Q from P, denoted as DKL(P||Q) is a measure of
information lost when Q is used to approximate P.
• P is the true distribution, and Q is the estimated distribution.
• P(x) is the probability of event x according to the true distribution.
• This term is used as a weighting factor, meaning events that are
more probable in the first distribution have a larger impact on the
divergence.
Last Lecture: KL Divergence
• The KL Divergence of Q from P, denoted as DKL(P||Q) is a measure of
information lost when Q is used to approximate P.
• P is the true distribution, and Q is the estimated distribution.
• P(x) is the probability of event x according to the true distribution.
• This term is used as a weighting factor, meaning events that are more
probable in the first distribution have a larger impact on the divergence.
• If an event is highly probable in P but not in Q, we want that to
contribute more to our divergence measure.
• If an event is not very likely in P, we don’t want it to contribute much to
our divergence measure, even if Q assigns it a high probability.
• This is because we’re measuring the divergence from P to Q, not the other way
around.
Last Lecture: KL Divergence
𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥
Last Lecture: KL Divergence
𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥
• If P and Q assign the same probability to x, then this ratio is 1, and the
logarithm of 1 is 0, so events that P and Q agree on don’t contribute to
the divergence.
Last Lecture: KL Divergence
𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥
• If P and Q assign the same probability to x, then this ratio is 1, and the
logarithm of 1 is 0, so events that P and Q agree on don’t contribute to
the divergence.
• If P assigns more probability to x than Q does, then this ratio is greater
than 1, and the logarithm is positive, so this event contributes to the
divergence.
Last Lecture: KL Divergence
𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥
• If P and Q assign the same probability to x, then this ratio is 1, and the
logarithm of 1 is 0, so events that P and Q agree on don’t contribute to
the divergence.
• If P assigns more probability to x than Q does, then this ratio is greater
than 1, and the logarithm is positive, so this event contributes to the
divergence.
• If P assigns less probability to x than Q does, then this ratio is less than
1, and the logarithm is negative, but remember that we’re multiplying
this by P(x), so events that P assigns low probability to don’t contribute
much to the divergence.
Last Lecture: KL Divergence
𝑃 𝑥
• 𝑃 𝑥 log :
𝑄 𝑥
For each outcome, we calculate how much probability P assigns to it,
and then multiply it by the log of the ratio of the probabilities P and Q
assign to it.
- This ratio tells us how much P and Q differ on this particular outcome.
Last Lecture: KL Divergence
𝑃 𝑥
• 𝑃 𝑥 log :
𝑄 𝑥
For each outcome, we calculate how much probability P assigns to it,
and then multiply it by the log of the ratio of the probabilities P and Q
assign to it.
- This ratio tells us how much P and Q differ on this particular outcome.
• Finally, we sum over all possible outcomes. This gives us a single
number that represents the total difference between P and Q.
• This means that the KL Divergence is a weighted sum of the log difference in
probabilities, where the weights are the probabilities according to the first
distribution.
KL Divergence: Movie Recommendation System – Toy Example
• Let’s say we have a user who has rated 5 movies in the past. The ratings are on a scale
of 1 to 5, with 5 being the highest. Here are the ratings:
• Movie A: 5
• Movie B: 4
• Movie C: 5
• Movie D: 1
• Movie E: 2
• We can consider these ratings as the “true” distribution of the user’s preferences.
• Now, let’s say our recommendation system has some features for each movie (like
genre, director, actors, etc.). Based on these features, it predicts the following ratings
for the user’s preferences:
• Movie A: 4
• Movie B: 3
• Movie C: 5
• Movie D: 2
• Movie E: 3