Relative Entropy
Relative Entropy
Lecture 3
Lecturer: Anup Rao Scribe: Prasang Upadhyaya
1 Introduction
In the previous lecture we looked at the application of entropy to derive inequalities that involved counting.
In this lecture we step back and introduce the concepts of relative entropy and mutual information that
measure two kinds of relationship between two distributions over random variables.
2 Relative Entropy
The relative entropy, also known as the Kullback-Leibler divergence, between two probability distributions on
a random variable is a measure of the distance between them. Formally, given two probability distributions
p(x) and q(x) over a discrete random variable X, the relative entropy given by D(p||q) is defined as follows:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
In the extreme case with E = X , the two laws p and q are identical with a divergence of 0.
We will henceforth refer to relative entropy or Kullback-Leibler divergence as divergence
3-1
2. Divergence is always non-negative. This is because of the following:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
X q(x)
=− p(x) log
p(x)
x∈X
q
= −E log
p
q
≥ − log E
p
!
X q(x)
= − log p(x)
p(x)
x∈X
=0
The inequality is introduced due to the application of Jensen’s inequality and the concavity of log.
3. Divergence is a convex function on the domain of probability distributions. Formally,
Lemma 1 (Convexity of divergence). Let p1 , q1 and p2 , q2 be probability distributions over a random
variable X and ∀λ ∈ (0, 1) define
p = λp1 + (1 − λ)p2
q = λq1 + (1 − λ)q2
Then, D(p||q) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 ).
To prove the lemma, we shall use the log-sum inequality [1], which can be proved by reducing to
Jensen’s inequality:
Proposition 2 (Log-sum Inequality). If a1 , . . . , an , b1 , . . . , bn are non-negative numbers, then
n n
! Pn
X X ai
ai log(1/bi ) ≤ ai log Pi=1 n
i=1 i=1 i=1 bi
Proof [of Lemma 1] Let a1 (x) = λp1 (x), a2 (x) = (1−λ)p2 (x) and b1 (x) = λq1 (x), b2 (x) = (1−λ)q2 (x).
Then,
X λp1 (x) + (1 − λ)p2 (x)
D(p||q) = (λp1 (x) + (1 − λ)p2 (x)) log
x
λq1 (x) + (1 − λ)q2 (x)
X a1 (x) + a2 (x)
= (a1 (x) + a2 (x)) log
x
b1 (x) + b2 (x)
X a1 (x) a2 (x)
≤ a1 (x) log + a2 (x) log (Using the log-sum inequality)
x
b1 (x) b2 (x)
X λp1 (x) (1 − λ)p2 (x)
= λp1 (x) log + (1 − λ)p2 (x) log
x
λq1 (x) (1 − λ)q2 (x)
= λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )
3-2
2.2 Relationship of Divergence with Entropy
Intuitively, the entropy of a random variable X with a probability distribution p(x) is related to how much
p(x) diverges from the uniform distribution on the support of X. The more p(x) diverges the lesser its
entropy and vice versa. Formally,
X 1
H(X) = p(x) log
p(x)
x∈X
X p(x)
= log |X | − p(x) log 1
x∈X |X |
Given the above definition we can prove the following chain rule about divergence of joint probability
distribution functions.
Lemma 3 (Chain Rule).
Proof
XX p(x, y)
D (p(x, y)||q(x, y)) = p(x, y) log
x y
q(x, y)
XX p(x)p(y|x)
= p(x)p(y|x) log
x y
q(x)q(y|x)
XX p(x) X X p(y|x)
= p(x)p(y|x) log + p(x)p(y|x) log
x y
q(x) x y
q(y|x)
X p(x) X X X p(y|x)
= p(x) log p(y|x) + p(x) p(y|x) log
x
q(x) y x y
q(y|x)
= D (p(x)||q(x)) + D (p(y|x)||q(y|x))
3-3
3 Mutual Information
Mutual information is a measure of how correlated two random variables X and Y are such that the more
independent the variables are the lesser is their mutual information. Formally,
I(X ∧ Y ) = D(p(x, y)||p(x)p(y))
X p(x, y)
= p(x, y) log
x,y
p(x)p(y)
X p(x, y) X X
= p(x, y) log p(x, y) log p(x) − p(x, y) log p(y)
x,y
− x,y x,y
= −H(X, Y ) + H(X) + H(Y )
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
Here I(X ∧ Y ) is the mutual information between X and Y , p(x, y) is the joint probability distribution, p(x)
and p(y) are the marginal distributions of X and Y .
As before we define the conditional mutual information when conditioned upon a third random variable
Z to be
I(X ∧ Y |Z) = Ez [I(X ∧ Y |Z = z)]
= H(X|Z) − H(Y |X, Z)
This leads us to the following chain rule.
Lemma 4 (Chain Rule). I(X, Z ∧ Y ) = I(X ∧ Y ) + I(Z ∧ Y |X)
Proof
I(X, Z ∧ Y ) = H(X, Z) − H(X, Z|Y )
= H(X) + H(Z|X) − H(X|Y ) − H(Z|X, Y )
= (H(X) − H(X|Y )) + (H(Z|X) − H(Z|X, Y ))
= I(X ∧ Y ) + I(Z ∧ Y |X)
3.1 An Example
We now look at the effect of conditioning on Mutual information. We consider the following two examples.
Example 1. Let X, Y, Z be uniform bits with zero parity. Now,
I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1 − 0 = 1
H(X|Z) = 1 since given Z, X could be either of {0, 1} while given Y, Z, X is already determined. Meanwhile,
I(X ∧ Y ) = H(X) − H(X|Y ) = 1 − 1 = 0
Example 2. Let A, B, C be uniform random bits. Define X = A, B and Y = A, C and Z = A. Now,
I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1 − 1 = 0
while,
I(X ∧ Y ) = H(X) − H(X|Y ) = 2 − 1 = 1
Thus, unlike entropy, conditioning may decrease or increase the mutual information.
3-4
3.2 Properties of Mutual Information
Lemma 5. If X, Y are independent and Z has an arbitrary probability distribution then,
Proof
Lemma 6. Let (X, Y ) ∼ p(x, y) be the joint probability distribution of X and Y . By the chain rule,
p(x, y) = p(x)p(y|x) = p(y)p(x|y). For clarity we represent p(x) (resp. p(y)) by α and p(y|x) (resp. p(x|y))
by π. The following holds:
Concavity in p(x): For i ∈ {1, 2}, let Ii (X, Y ) be the mutual information for (X, Y ) ∼ αi π, respectively.
P
For λ1 , λ2 ∈ [0, 1] such that λ1 +λ2 = 1, let I(X ∧Y ) be the mutual information for (X, Y ) ∼ i λi αi π.
Then,
I(X ∧ Y ) ≥ λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )
Convexity in p(y|x): For i ∈ {1, 2}, let Ii (X, Y ) be the mutual information for (X, Y ) ∼ απi , respectively.
P
For λ1 , λ2 ∈ [0, 1] such that λ1 +λ2 = 1, let I(X ∧Y ) be the mutual information for (X, Y ) ∼ i λi απi .
Then,
I(X ∧ Y ) ≤ λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )
Proof We first prove the convexity of p(y|x|): we will apply Lemma 1 and use the definition of mutual
information in terms of divergence. Thus,
! !!
X X
I(X ∧ Y ) = D λ1 απ1 + λ2 απ2 || λ1 απ1 + λ2 απ2 λ1 απ1 + λ2 απ2
y x
! !!
X X X
= D λ1 απ1 + λ2 απ2 || λ1 α π1 + λ2 α π2 λ1 απ1 + λ2 απ2
y y x
!
X
= D λ1 απ1 + λ2 απ2 || α λ1 απ1 + αλ2 απ2
x
!
X X X
= D λ1 απ1 + λ2 απ2 || λ1 απ1 απ1 + λ2 απ1 απ2
y x y
! !! ! !!
X X X X
≤ λ1 D απ1 || απ1 απ1 + λ2 D απ2 || απ2 απ2
y x y x
= λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )
P
Here we used the fact that y πi = 1 and used Lemma 1 to introduce the inequality.
3-5
We now prove the concavity of p(x). We first simplify the LHS and the RHS.
X λ1 α1 π + λ2 α2 π
I(X ∧ Y ) = (λ1 α1 π + λ2 α2 π) log P P
x,y y λ1 α1 π + λ2 α2 π ( x λ1 α1 π + λ2 α2 π)
X π
= (λ1 α1 π + λ2 α2 π) log P
x,y
( x 1 1 + λ2 α2 π)
λ α π
X X X X X
= (λ1 α1 π + λ2 α2 π) log π − λi αi π log λi αi π
x,y x,y i∈{0,1} x i∈{0,1}
X X αi π
λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y ) = λi αi π log P P
x,y i∈{0,1} y αi π ( x αi π)
!
X X X X
= (λ1 α1 π + λ2 α2 π) log π − λi αi π log αi π
x,y x,y i∈{0,1} x
that follows directly from the application of the log-sum inequality [1]
References
[1] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York,
NY, USA, 1991.
3-6