0% found this document useful (0 votes)
26 views11 pages

DSAI514 Lec2 Point Estimation Part 1

The document discusses KL Divergence, a measure of information loss when an estimated distribution Q approximates a true distribution P. It explains how the divergence is calculated using the probabilities assigned to events by both distributions, emphasizing that events with higher probabilities in P have a greater impact on the divergence. Additionally, a toy example of a movie recommendation system illustrates how user ratings can serve as the true distribution and how predicted ratings can be compared using KL Divergence.

Uploaded by

Knn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

DSAI514 Lec2 Point Estimation Part 1

The document discusses KL Divergence, a measure of information loss when an estimated distribution Q approximates a true distribution P. It explains how the divergence is calculated using the probabilities assigned to events by both distributions, emphasizing that events with higher probabilities in P have a greater impact on the divergence. Additionally, a toy example of a movie recommendation system illustrates how user ratings can serve as the true distribution and how predicted ratings can be compared using KL Divergence.

Uploaded by

Knn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DSAI 514 – Statistical Inference

Point Estimation

Instructor: Ş. Betül Özateş


Boğaziçi University
19/02/2025

Devore, Jay L., Kenneth N. Berk, and Matthew A. Carlton. (2012) Modern mathematical statistics with applications. Vol. 285. New York: Springer,
and https://medium.com/@hosamedwee/kullback-leibler-kl-divergence-with-examples-part-2-9123bff5dc10
Last Lecture: KL Divergence

• The KL Divergence of Q from P, denoted as DKL(P||Q) is a measure of


information lost when Q is used to approximate P.
• P is the true distribution, and Q is the estimated distribution.
Last Lecture: KL Divergence

• The KL Divergence of Q from P, denoted as DKL(P||Q) is a measure of


information lost when Q is used to approximate P.
• P is the true distribution, and Q is the estimated distribution.

• P(x) is the probability of event x according to the true distribution.


• This term is used as a weighting factor, meaning events that are
more probable in the first distribution have a larger impact on the
divergence.
Last Lecture: KL Divergence

• The KL Divergence of Q from P, denoted as DKL(P||Q) is a measure of


information lost when Q is used to approximate P.
• P is the true distribution, and Q is the estimated distribution.

• P(x) is the probability of event x according to the true distribution.


• This term is used as a weighting factor, meaning events that are more
probable in the first distribution have a larger impact on the divergence.

• If an event is highly probable in P but not in Q, we want that to


contribute more to our divergence measure.

• If an event is not very likely in P, we don’t want it to contribute much to


our divergence measure, even if Q assigns it a high probability.

• This is because we’re measuring the divergence from P to Q, not the other way
around.
Last Lecture: KL Divergence

𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥
Last Lecture: KL Divergence

𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥

• If P and Q assign the same probability to x, then this ratio is 1, and the
logarithm of 1 is 0, so events that P and Q agree on don’t contribute to
the divergence.
Last Lecture: KL Divergence

𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥

• If P and Q assign the same probability to x, then this ratio is 1, and the
logarithm of 1 is 0, so events that P and Q agree on don’t contribute to
the divergence.

• If P assigns more probability to x than Q does, then this ratio is greater


than 1, and the logarithm is positive, so this event contributes to the
divergence.
Last Lecture: KL Divergence

𝑃 𝑥
• log( ) is the ratio of the probabilities assigned to event x by P and Q.
𝑄 𝑥

• If P and Q assign the same probability to x, then this ratio is 1, and the
logarithm of 1 is 0, so events that P and Q agree on don’t contribute to
the divergence.

• If P assigns more probability to x than Q does, then this ratio is greater


than 1, and the logarithm is positive, so this event contributes to the
divergence.

• If P assigns less probability to x than Q does, then this ratio is less than
1, and the logarithm is negative, but remember that we’re multiplying
this by P(x), so events that P assigns low probability to don’t contribute
much to the divergence.
Last Lecture: KL Divergence

𝑃 𝑥
• 𝑃 𝑥 log :
𝑄 𝑥
For each outcome, we calculate how much probability P assigns to it,
and then multiply it by the log of the ratio of the probabilities P and Q
assign to it.
- This ratio tells us how much P and Q differ on this particular outcome.
Last Lecture: KL Divergence

𝑃 𝑥
• 𝑃 𝑥 log :
𝑄 𝑥
For each outcome, we calculate how much probability P assigns to it,
and then multiply it by the log of the ratio of the probabilities P and Q
assign to it.
- This ratio tells us how much P and Q differ on this particular outcome.

• Finally, we sum over all possible outcomes. This gives us a single


number that represents the total difference between P and Q.

• This means that the KL Divergence is a weighted sum of the log difference in
probabilities, where the weights are the probabilities according to the first
distribution.
KL Divergence: Movie Recommendation System – Toy Example
• Let’s say we have a user who has rated 5 movies in the past. The ratings are on a scale
of 1 to 5, with 5 being the highest. Here are the ratings:
• Movie A: 5
• Movie B: 4
• Movie C: 5
• Movie D: 1
• Movie E: 2
• We can consider these ratings as the “true” distribution of the user’s preferences.

• Now, let’s say our recommendation system has some features for each movie (like
genre, director, actors, etc.). Based on these features, it predicts the following ratings
for the user’s preferences:
• Movie A: 4
• Movie B: 3
• Movie C: 5
• Movie D: 2
• Movie E: 3

You might also like