0% found this document useful (0 votes)

41 views

Relative Entropy

The document summarizes key concepts from a lecture on information theory, including: 1. Relative entropy (a.k.a Kullback-Leibler divergence) measures the distance between two probability distributions. It is always non-negative and is not symmetric. 2. Mutual information measures the relationship between two random variables, and can be expressed using relative entropy. 3. Important properties of relative entropy are that it is convex and satisfies a chain rule when relating the divergence of joint distributions to the divergence of marginal and conditional distributions.

Uploaded by

Srivignesh Rajan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Relative Entropy

Uploaded by

Srivignesh Rajan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CSE533: Information Theory in Computer Science October 6, 2010

Lecture 3
Lecturer: Anup Rao Scribe: Prasang Upadhyaya

1 Introduction
In the previous lecture we looked at the application of entropy to derive inequalities that involved counting.
In this lecture we step back and introduce the concepts of relative entropy and mutual information that
measure two kinds of relationship between two distributions over random variables.

2 Relative Entropy
The relative entropy, also known as the Kullback-Leibler divergence, between two probability distributions on
a random variable is a measure of the distance between them. Formally, given two probability distributions
p(x) and q(x) over a discrete random variable X, the relative entropy given by D(p||q) is defined as follows:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X

In the definition above 0 log 00 = 0 log 0q = 0 and p log 01 = ∞.

As an example, consider a random variable X with the law q(x). We assume nothing about q(x). Now
consider a set E ⊆ X and define p(x) to be the law of X|X∈E . The divergence between p and q:
X P r[X = x|X∈E ]
D(p||q) = P r[X = x|X∈E ] log
P r[X = x]
x∈X
X P r[X = x|X∈E ]
= P r[X = x|X∈E ] log (Using 0 log 0 = 0)
P r[X = x]
x∈E
X P r[X = x|X∈E ]
= P r[X = x|X∈E ] log (Using the chain rule)
P r[X = x|X∈E ]P r[X ∈ E]
x∈E
X 1
= P r[X = x|X∈E ] log
P r[X ∈ E]
x∈E
1
= log
P r[X ∈ E]

In the extreme case with E = X , the two laws p and q are identical with a divergence of 0.
We will henceforth refer to relative entropy or Kullback-Leibler divergence as divergence

2.1 Properties of Divergence

1. Divergence is not symmetric. That is, D(p||q) = D(q||p) is not necessarily true. For example, unlike
D(p||q), D(q||p) = ∞ in the example mentioned in the previous section, if ∃x ∈ X \ E : q(x) > 0.

3-1
2. Divergence is always non-negative. This is because of the following:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
X q(x)
=− p(x) log
p(x)
x∈X

q
= −E log
p

q
≥ − log E
p
!
X q(x)
= − log p(x)
p(x)
x∈X
=0

The inequality is introduced due to the application of Jensen’s inequality and the concavity of log.
3. Divergence is a convex function on the domain of probability distributions. Formally,
Lemma 1 (Convexity of divergence). Let p1 , q1 and p2 , q2 be probability distributions over a random
variable X and ∀λ ∈ (0, 1) define
p = λp1 + (1 − λ)p2
q = λq1 + (1 − λ)q2
Then, D(p||q) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 ).

To prove the lemma, we shall use the log-sum inequality [1], which can be proved by reducing to
Jensen’s inequality:
Proposition 2 (Log-sum Inequality). If a1 , . . . , an , b1 , . . . , bn are non-negative numbers, then
n n
! Pn
X X ai
ai log(1/bi ) ≤ ai log Pi=1 n
i=1 i=1 i=1 bi

Proof [of Lemma 1] Let a1 (x) = λp1 (x), a2 (x) = (1−λ)p2 (x) and b1 (x) = λq1 (x), b2 (x) = (1−λ)q2 (x).
Then,
X λp1 (x) + (1 − λ)p2 (x)
D(p||q) = (λp1 (x) + (1 − λ)p2 (x)) log
x
λq1 (x) + (1 − λ)q2 (x)
X a1 (x) + a2 (x)
= (a1 (x) + a2 (x)) log
x
b1 (x) + b2 (x)
X a1 (x) a2 (x)

≤ a1 (x) log + a2 (x) log (Using the log-sum inequality)
x
b1 (x) b2 (x)
X λp1 (x) (1 − λ)p2 (x)

= λp1 (x) log + (1 − λ)p2 (x) log
x
λq1 (x) (1 − λ)q2 (x)
= λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

3-2
2.2 Relationship of Divergence with Entropy
Intuitively, the entropy of a random variable X with a probability distribution p(x) is related to how much
p(x) diverges from the uniform distribution on the support of X. The more p(x) diverges the lesser its
entropy and vice versa. Formally,
X 1
H(X) = p(x) log
p(x)
x∈X
X p(x)
= log |X | − p(x) log 1
x∈X |X |

= log |X | − D(p||unif orm)

2.3 Conditional Divergence

Given the joint probability distributions p(x, y) and q(x, y)of two discrete random variables X and Y , the
conditional divergence between two conditional probability distributions p(y|x) and q(y|x) is obtained by
computing the divergence between p and q for all possible values of x ∈ X and then averaging over these
values of x. Formally,
X X p(y|x)
D(p(y|x)||q(y|x)) = p(x) p(y|x) log
q(y|x)
x∈X y∈Y

Given the above definition we can prove the following chain rule about divergence of joint probability
distribution functions.
Lemma 3 (Chain Rule).

D (p(x, y)||q(x, y)) = D (p(x)||q(x)) + D (p(y|x)||q(y|x))

Proof
XX p(x, y)
D (p(x, y)||q(x, y)) = p(x, y) log
x y
q(x, y)
XX p(x)p(y|x)
= p(x)p(y|x) log
x y
q(x)q(y|x)
XX p(x) X X p(y|x)
= p(x)p(y|x) log + p(x)p(y|x) log
x y
q(x) x y
q(y|x)
X p(x) X X X p(y|x)
= p(x) log p(y|x) + p(x) p(y|x) log
x
q(x) y x y
q(y|x)
= D (p(x)||q(x)) + D (p(y|x)||q(y|x))

3-3
3 Mutual Information
Mutual information is a measure of how correlated two random variables X and Y are such that the more
independent the variables are the lesser is their mutual information. Formally,
I(X ∧ Y ) = D(p(x, y)||p(x)p(y))
X p(x, y)
= p(x, y) log
x,y
p(x)p(y)
X p(x, y) X X
= p(x, y) log p(x, y) log p(x) − p(x, y) log p(y)
x,y
− x,y x,y
= −H(X, Y ) + H(X) + H(Y )
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
Here I(X ∧ Y ) is the mutual information between X and Y , p(x, y) is the joint probability distribution, p(x)
and p(y) are the marginal distributions of X and Y .
As before we define the conditional mutual information when conditioned upon a third random variable
Z to be
I(X ∧ Y |Z) = Ez [I(X ∧ Y |Z = z)]
= H(X|Z) − H(Y |X, Z)
This leads us to the following chain rule.
Lemma 4 (Chain Rule). I(X, Z ∧ Y ) = I(X ∧ Y ) + I(Z ∧ Y |X)
Proof
I(X, Z ∧ Y ) = H(X, Z) − H(X, Z|Y )
= H(X) + H(Z|X) − H(X|Y ) − H(Z|X, Y )
= (H(X) − H(X|Y )) + (H(Z|X) − H(Z|X, Y ))
= I(X ∧ Y ) + I(Z ∧ Y |X)

3.1 An Example
We now look at the effect of conditioning on Mutual information. We consider the following two examples.
Example 1. Let X, Y, Z be uniform bits with zero parity. Now,
I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1 − 0 = 1
H(X|Z) = 1 since given Z, X could be either of {0, 1} while given Y, Z, X is already determined. Meanwhile,
I(X ∧ Y ) = H(X) − H(X|Y ) = 1 − 1 = 0
Example 2. Let A, B, C be uniform random bits. Define X = A, B and Y = A, C and Z = A. Now,
I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1 − 1 = 0
while,
I(X ∧ Y ) = H(X) − H(X|Y ) = 2 − 1 = 1
Thus, unlike entropy, conditioning may decrease or increase the mutual information.

3-4
3.2 Properties of Mutual Information
Lemma 5. If X, Y are independent and Z has an arbitrary probability distribution then,

I(X, Y ∧ Z) ≥ I(X ∧ Z) + I(Y ∧ Z)

Proof

I({X, Y } ∧ Z) = I(X ∧ Z) + I(Y ∧ Z|X) (Using the chain rule)

= I(X ∧ Z) + H(Y |X) − H(Y |X, Z)
= I(X ∧ Z) + H(Y ) − H(Y |X, Z) (X and Y are independent)
≥ I(X ∧ Z) + H(Y ) − H(Y |Z) (Conditioning can not increase entropy)
= I(X ∧ Z) + I(Y ∧ Z)

Lemma 6. Let (X, Y ) ∼ p(x, y) be the joint probability distribution of X and Y . By the chain rule,
p(x, y) = p(x)p(y|x) = p(y)p(x|y). For clarity we represent p(x) (resp. p(y)) by α and p(y|x) (resp. p(x|y))
by π. The following holds:
Concavity in p(x): For i ∈ {1, 2}, let Ii (X, Y ) be the mutual information for (X, Y ) ∼ αi π, respectively.
P
For λ1 , λ2 ∈ [0, 1] such that λ1 +λ2 = 1, let I(X ∧Y ) be the mutual information for (X, Y ) ∼ i λi αi π.
Then,
I(X ∧ Y ) ≥ λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )

Convexity in p(y|x): For i ∈ {1, 2}, let Ii (X, Y ) be the mutual information for (X, Y ) ∼ απi , respectively.
P
For λ1 , λ2 ∈ [0, 1] such that λ1 +λ2 = 1, let I(X ∧Y ) be the mutual information for (X, Y ) ∼ i λi απi .
Then,
I(X ∧ Y ) ≤ λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )

Proof We first prove the convexity of p(y|x|): we will apply Lemma 1 and use the definition of mutual
information in terms of divergence. Thus,
! !!
X X
I(X ∧ Y ) = D λ1 απ1 + λ2 απ2 || λ1 απ1 + λ2 απ2 λ1 απ1 + λ2 απ2
y x
! !!
X X X
= D λ1 απ1 + λ2 απ2 || λ1 α π1 + λ2 α π2 λ1 απ1 + λ2 απ2
y y x
!
X
= D λ1 απ1 + λ2 απ2 || α λ1 απ1 + αλ2 απ2
x
!
X X X
= D λ1 απ1 + λ2 απ2 || λ1 απ1 απ1 + λ2 απ1 απ2
y x y
! !! ! !!
X X X X
≤ λ1 D απ1 || απ1 απ1 + λ2 D απ2 || απ2 απ2
y x y x
= λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )
P
Here we used the fact that y πi = 1 and used Lemma 1 to introduce the inequality.

3-5
We now prove the concavity of p(x). We first simplify the LHS and the RHS.
X λ1 α1 π + λ2 α2 π
I(X ∧ Y ) = (λ1 α1 π + λ2 α2 π) log P P
x,y y λ1 α1 π + λ2 α2 π ( x λ1 α1 π + λ2 α2 π)
X π
= (λ1 α1 π + λ2 α2 π) log P
x,y
( x 1 1 + λ2 α2 π)
λ α π
   
X X X X X
= (λ1 α1 π + λ2 α2 π) log π −  λi αi π  log  λi αi π 
x,y x,y i∈{0,1} x i∈{0,1}
X X αi π
λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y ) = λi αi π log P P
x,y i∈{0,1} y αi π ( x αi π)
!
X X X X
= (λ1 α1 π + λ2 α2 π) log π − λi αi π log αi π
x,y x,y i∈{0,1} x

Thus, to prove that LHS ≥ RHS we need to prove that,

    !
X X X X X
 λi αi π  log  λi αi π  ≤ λi αi π log αi π
i∈{0,1} x i∈{0,1} i∈{0,1} x

that follows directly from the application of the log-sum inequality [1]

References
[1] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York,
NY, USA, 1991.

3-6

Solved Problems
No ratings yet
Solved Problems
7 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
4 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
Entropy
No ratings yet
Entropy
21 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
2 Entropy and Mutual Information: I (A) F (P (A) )
No ratings yet
2 Entropy and Mutual Information: I (A) F (P (A) )
27 pages
Entropy 4
No ratings yet
Entropy 4
10 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
L04
No ratings yet
L04
4 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Information Theory Entropy Relative Entropy
No ratings yet
Information Theory Entropy Relative Entropy
60 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
1 Introduction To Information Theory
No ratings yet
1 Introduction To Information Theory
9 pages
Notes It
No ratings yet
Notes It
46 pages
Math Supplement PDF
No ratings yet
Math Supplement PDF
17 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Information-Theoretic Identities
No ratings yet
Information-Theoretic Identities
29 pages
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
36 pages
SummaryFeb5 2024
No ratings yet
SummaryFeb5 2024
2 pages
02 Measure of Information
No ratings yet
02 Measure of Information
17 pages
Understanding Basic Probability
No ratings yet
Understanding Basic Probability
7 pages
PSET-1 Sol PDF
No ratings yet
PSET-1 Sol PDF
3 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
L01
No ratings yet
L01
5 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2
No ratings yet
Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2
26 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Information and Entropy: Aria Nosratinia - Information Theory 2-1
No ratings yet
Information and Entropy: Aria Nosratinia - Information Theory 2-1
7 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Lecture_15
No ratings yet
Lecture_15
7 pages
Jour 2
No ratings yet
Jour 2
37 pages
Tema 1 Awp
No ratings yet
Tema 1 Awp
32 pages
HW 1 Sol
No ratings yet
HW 1 Sol
5 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
EE 231A: Information Theory: Rick Wesel Wesel@ee - Ucla.edu
No ratings yet
EE 231A: Information Theory: Rick Wesel Wesel@ee - Ucla.edu
16 pages
Class3 ML MaxEnt
No ratings yet
Class3 ML MaxEnt
6 pages
lời giải
No ratings yet
lời giải
52 pages
Slide 04
No ratings yet
Slide 04
16 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
Information Theory
No ratings yet
Information Theory
122 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Bascal
No ratings yet
Bascal
16 pages
Advanced Digital SAT Math Course Free Trial
No ratings yet
Advanced Digital SAT Math Course Free Trial
4 pages
0109 Mathematics Paper With Solution Evening
No ratings yet
0109 Mathematics Paper With Solution Evening
11 pages
၂၀၁၄ တကၠသိုလ္ဝင္စာေမးပြဲ မေကြးတိုင္း သခၤ်ာေမးခြန္း
No ratings yet
၂၀၁၄ တကၠသိုလ္ဝင္စာေမးပြဲ မေကြးတိုင္း သခၤ်ာေမးခြန္း
4 pages
C Lab Assignment Final
No ratings yet
C Lab Assignment Final
4 pages
mat1102-differential-calculus-and-coordinate-geometry
No ratings yet
mat1102-differential-calculus-and-coordinate-geometry
13 pages
StaticsC04 - Force System Resultants
No ratings yet
StaticsC04 - Force System Resultants
35 pages
Appa PDF
0% (1)
Appa PDF
5 pages
Second-Order Partial Differential Equations and Green's Functions
No ratings yet
Second-Order Partial Differential Equations and Green's Functions
12 pages
Lab 03 Steady-State Error: Prepared by
No ratings yet
Lab 03 Steady-State Error: Prepared by
9 pages
Group Theory Problems With Solutions
No ratings yet
Group Theory Problems With Solutions
4 pages
1 1 Counting
No ratings yet
1 1 Counting
118 pages
Normal and Echelon Forms Voice Embedded
No ratings yet
Normal and Echelon Forms Voice Embedded
16 pages
Group Assignment Marks
No ratings yet
Group Assignment Marks
22 pages
Calculus Advanced Level Question Bank
No ratings yet
Calculus Advanced Level Question Bank
14 pages
Ejercicio Cobol 5
No ratings yet
Ejercicio Cobol 5
2 pages
Sheet - 01 - Complex Number
No ratings yet
Sheet - 01 - Complex Number
7 pages
Special Orthogonal Groups and Rotations: Christopher Triola
No ratings yet
Special Orthogonal Groups and Rotations: Christopher Triola
27 pages
On The Origin of Dynamics Ebook
No ratings yet
On The Origin of Dynamics Ebook
568 pages
Kanitham by Sundarji
No ratings yet
Kanitham by Sundarji
212 pages
Force Estimation Using Vibration Data
No ratings yet
Force Estimation Using Vibration Data
9 pages
Linear Algebra - Solved Assignments - Fall 2005 Semester
100% (1)
Linear Algebra - Solved Assignments - Fall 2005 Semester
28 pages
Iit 3 Unlocked
No ratings yet
Iit 3 Unlocked
185 pages
IB Maths SL Formula Sheet 2019
No ratings yet
IB Maths SL Formula Sheet 2019
1 page
LA L1notes
No ratings yet
LA L1notes
59 pages
Introduction To Curve Fitting
No ratings yet
Introduction To Curve Fitting
10 pages
Vita
No ratings yet
Vita
5 pages
RGPV Syllabus Aicte BT 302 Mathematics 3 (Ce Me Ec)
No ratings yet
RGPV Syllabus Aicte BT 302 Mathematics 3 (Ce Me Ec)
1 page
Daa Unit-Ii
No ratings yet
Daa Unit-Ii
21 pages
CBSE Sample Paper Class 11 Maths Set 1 Solution
No ratings yet
CBSE Sample Paper Class 11 Maths Set 1 Solution
22 pages

Relative Entropy

Uploaded by

Relative Entropy

Uploaded by

CSE533: Information Theory in Computer Science October 6, 2010

In the definition above 0 log 00 = 0 log 0q = 0 and p log 01 = ∞.

2.1 Properties of Divergence

= log |X | − D(p||unif orm)

2.3 Conditional Divergence

D (p(x, y)||q(x, y)) = D (p(x)||q(x)) + D (p(y|x)||q(y|x))

I(X, Y ∧ Z) ≥ I(X ∧ Z) + I(Y ∧ Z)

I({X, Y } ∧ Z) = I(X ∧ Z) + I(Y ∧ Z|X) (Using the chain rule)

Thus, to prove that LHS ≥ RHS we need to prove that,

You might also like