0% found this document useful (0 votes)
41 views

Relative Entropy

The document summarizes key concepts from a lecture on information theory, including: 1. Relative entropy (a.k.a Kullback-Leibler divergence) measures the distance between two probability distributions. It is always non-negative and is not symmetric. 2. Mutual information measures the relationship between two random variables, and can be expressed using relative entropy. 3. Important properties of relative entropy are that it is convex and satisfies a chain rule when relating the divergence of joint distributions to the divergence of marginal and conditional distributions.

Uploaded by

Srivignesh Rajan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Relative Entropy

The document summarizes key concepts from a lecture on information theory, including: 1. Relative entropy (a.k.a Kullback-Leibler divergence) measures the distance between two probability distributions. It is always non-negative and is not symmetric. 2. Mutual information measures the relationship between two random variables, and can be expressed using relative entropy. 3. Important properties of relative entropy are that it is convex and satisfies a chain rule when relating the divergence of joint distributions to the divergence of marginal and conditional distributions.

Uploaded by

Srivignesh Rajan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CSE533: Information Theory in Computer Science October 6, 2010

Lecture 3
Lecturer: Anup Rao Scribe: Prasang Upadhyaya

1 Introduction
In the previous lecture we looked at the application of entropy to derive inequalities that involved counting.
In this lecture we step back and introduce the concepts of relative entropy and mutual information that
measure two kinds of relationship between two distributions over random variables.

2 Relative Entropy
The relative entropy, also known as the Kullback-Leibler divergence, between two probability distributions on
a random variable is a measure of the distance between them. Formally, given two probability distributions
p(x) and q(x) over a discrete random variable X, the relative entropy given by D(p||q) is defined as follows:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X

In the definition above 0 log 00 = 0 log 0q = 0 and p log 01 = ∞.


As an example, consider a random variable X with the law q(x). We assume nothing about q(x). Now
consider a set E ⊆ X and define p(x) to be the law of X|X∈E . The divergence between p and q:
X P r[X = x|X∈E ]
D(p||q) = P r[X = x|X∈E ] log
P r[X = x]
x∈X
X P r[X = x|X∈E ]
= P r[X = x|X∈E ] log (Using 0 log 0 = 0)
P r[X = x]
x∈E
X P r[X = x|X∈E ]
= P r[X = x|X∈E ] log (Using the chain rule)
P r[X = x|X∈E ]P r[X ∈ E]
x∈E
X 1
= P r[X = x|X∈E ] log
P r[X ∈ E]
x∈E
1
= log
P r[X ∈ E]

In the extreme case with E = X , the two laws p and q are identical with a divergence of 0.
We will henceforth refer to relative entropy or Kullback-Leibler divergence as divergence

2.1 Properties of Divergence


1. Divergence is not symmetric. That is, D(p||q) = D(q||p) is not necessarily true. For example, unlike
D(p||q), D(q||p) = ∞ in the example mentioned in the previous section, if ∃x ∈ X \ E : q(x) > 0.

3-1
2. Divergence is always non-negative. This is because of the following:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
X q(x)
=− p(x) log
p(x)
x∈X
 
q
= −E log
p
  
q
≥ − log E
p
!
X q(x)
= − log p(x)
p(x)
x∈X
=0

The inequality is introduced due to the application of Jensen’s inequality and the concavity of log.
3. Divergence is a convex function on the domain of probability distributions. Formally,
Lemma 1 (Convexity of divergence). Let p1 , q1 and p2 , q2 be probability distributions over a random
variable X and ∀λ ∈ (0, 1) define
p = λp1 + (1 − λ)p2
q = λq1 + (1 − λ)q2
Then, D(p||q) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 ).

To prove the lemma, we shall use the log-sum inequality [1], which can be proved by reducing to
Jensen’s inequality:
Proposition 2 (Log-sum Inequality). If a1 , . . . , an , b1 , . . . , bn are non-negative numbers, then
n n
!  Pn 
X X ai
ai log(1/bi ) ≤ ai log Pi=1 n
i=1 i=1 i=1 bi

Proof [of Lemma 1] Let a1 (x) = λp1 (x), a2 (x) = (1−λ)p2 (x) and b1 (x) = λq1 (x), b2 (x) = (1−λ)q2 (x).
Then,
X λp1 (x) + (1 − λ)p2 (x)
D(p||q) = (λp1 (x) + (1 − λ)p2 (x)) log
x
λq1 (x) + (1 − λ)q2 (x)
X a1 (x) + a2 (x)
= (a1 (x) + a2 (x)) log
x
b1 (x) + b2 (x)
X a1 (x) a2 (x)

≤ a1 (x) log + a2 (x) log (Using the log-sum inequality)
x
b1 (x) b2 (x)
X λp1 (x) (1 − λ)p2 (x)

= λp1 (x) log + (1 − λ)p2 (x) log
x
λq1 (x) (1 − λ)q2 (x)
= λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

3-2
2.2 Relationship of Divergence with Entropy
Intuitively, the entropy of a random variable X with a probability distribution p(x) is related to how much
p(x) diverges from the uniform distribution on the support of X. The more p(x) diverges the lesser its
entropy and vice versa. Formally,
X 1
H(X) = p(x) log
p(x)
x∈X
X p(x)
= log |X | − p(x) log 1
x∈X |X |

= log |X | − D(p||unif orm)

2.3 Conditional Divergence


Given the joint probability distributions p(x, y) and q(x, y)of two discrete random variables X and Y , the
conditional divergence between two conditional probability distributions p(y|x) and q(y|x) is obtained by
computing the divergence between p and q for all possible values of x ∈ X and then averaging over these
values of x. Formally,
X X p(y|x)
D(p(y|x)||q(y|x)) = p(x) p(y|x) log
q(y|x)
x∈X y∈Y

Given the above definition we can prove the following chain rule about divergence of joint probability
distribution functions.
Lemma 3 (Chain Rule).

D (p(x, y)||q(x, y)) = D (p(x)||q(x)) + D (p(y|x)||q(y|x))

Proof
XX p(x, y)
D (p(x, y)||q(x, y)) = p(x, y) log
x y
q(x, y)
XX p(x)p(y|x)
= p(x)p(y|x) log
x y
q(x)q(y|x)
XX p(x) X X p(y|x)
= p(x)p(y|x) log + p(x)p(y|x) log
x y
q(x) x y
q(y|x)
X p(x) X X X p(y|x)
= p(x) log p(y|x) + p(x) p(y|x) log
x
q(x) y x y
q(y|x)
= D (p(x)||q(x)) + D (p(y|x)||q(y|x))

3-3
3 Mutual Information
Mutual information is a measure of how correlated two random variables X and Y are such that the more
independent the variables are the lesser is their mutual information. Formally,
I(X ∧ Y ) = D(p(x, y)||p(x)p(y))
X p(x, y)
= p(x, y) log
x,y
p(x)p(y)
X p(x, y) X X
= p(x, y) log p(x, y) log p(x) − p(x, y) log p(y)
x,y
− x,y x,y
= −H(X, Y ) + H(X) + H(Y )
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
Here I(X ∧ Y ) is the mutual information between X and Y , p(x, y) is the joint probability distribution, p(x)
and p(y) are the marginal distributions of X and Y .
As before we define the conditional mutual information when conditioned upon a third random variable
Z to be
I(X ∧ Y |Z) = Ez [I(X ∧ Y |Z = z)]
= H(X|Z) − H(Y |X, Z)
This leads us to the following chain rule.
Lemma 4 (Chain Rule). I(X, Z ∧ Y ) = I(X ∧ Y ) + I(Z ∧ Y |X)
Proof
I(X, Z ∧ Y ) = H(X, Z) − H(X, Z|Y )
= H(X) + H(Z|X) − H(X|Y ) − H(Z|X, Y )
= (H(X) − H(X|Y )) + (H(Z|X) − H(Z|X, Y ))
= I(X ∧ Y ) + I(Z ∧ Y |X)

3.1 An Example
We now look at the effect of conditioning on Mutual information. We consider the following two examples.
Example 1. Let X, Y, Z be uniform bits with zero parity. Now,
I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1 − 0 = 1
H(X|Z) = 1 since given Z, X could be either of {0, 1} while given Y, Z, X is already determined. Meanwhile,
I(X ∧ Y ) = H(X) − H(X|Y ) = 1 − 1 = 0
Example 2. Let A, B, C be uniform random bits. Define X = A, B and Y = A, C and Z = A. Now,
I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1 − 1 = 0
while,
I(X ∧ Y ) = H(X) − H(X|Y ) = 2 − 1 = 1
Thus, unlike entropy, conditioning may decrease or increase the mutual information.

3-4
3.2 Properties of Mutual Information
Lemma 5. If X, Y are independent and Z has an arbitrary probability distribution then,

I(X, Y ∧ Z) ≥ I(X ∧ Z) + I(Y ∧ Z)

Proof

I({X, Y } ∧ Z) = I(X ∧ Z) + I(Y ∧ Z|X) (Using the chain rule)


= I(X ∧ Z) + H(Y |X) − H(Y |X, Z)
= I(X ∧ Z) + H(Y ) − H(Y |X, Z) (X and Y are independent)
≥ I(X ∧ Z) + H(Y ) − H(Y |Z) (Conditioning can not increase entropy)
= I(X ∧ Z) + I(Y ∧ Z)

Lemma 6. Let (X, Y ) ∼ p(x, y) be the joint probability distribution of X and Y . By the chain rule,
p(x, y) = p(x)p(y|x) = p(y)p(x|y). For clarity we represent p(x) (resp. p(y)) by α and p(y|x) (resp. p(x|y))
by π. The following holds:
Concavity in p(x): For i ∈ {1, 2}, let Ii (X, Y ) be the mutual information for (X, Y ) ∼ αi π, respectively.
P
For λ1 , λ2 ∈ [0, 1] such that λ1 +λ2 = 1, let I(X ∧Y ) be the mutual information for (X, Y ) ∼ i λi αi π.
Then,
I(X ∧ Y ) ≥ λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )

Convexity in p(y|x): For i ∈ {1, 2}, let Ii (X, Y ) be the mutual information for (X, Y ) ∼ απi , respectively.
P
For λ1 , λ2 ∈ [0, 1] such that λ1 +λ2 = 1, let I(X ∧Y ) be the mutual information for (X, Y ) ∼ i λi απi .
Then,
I(X ∧ Y ) ≤ λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )

Proof We first prove the convexity of p(y|x|): we will apply Lemma 1 and use the definition of mutual
information in terms of divergence. Thus,
! !!
X X
I(X ∧ Y ) = D λ1 απ1 + λ2 απ2 || λ1 απ1 + λ2 απ2 λ1 απ1 + λ2 απ2
y x
! !!
X X X
= D λ1 απ1 + λ2 απ2 || λ1 α π1 + λ2 α π2 λ1 απ1 + λ2 απ2
y y x
!
X
= D λ1 απ1 + λ2 απ2 || α λ1 απ1 + αλ2 απ2
x
!
X X X
= D λ1 απ1 + λ2 απ2 || λ1 απ1 απ1 + λ2 απ1 απ2
y x y
! !! ! !!
X X X X
≤ λ1 D απ1 || απ1 απ1 + λ2 D απ2 || απ2 απ2
y x y x
= λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y )
P
Here we used the fact that y πi = 1 and used Lemma 1 to introduce the inequality.

3-5
We now prove the concavity of p(x). We first simplify the LHS and the RHS.
X λ1 α1 π + λ2 α2 π
I(X ∧ Y ) = (λ1 α1 π + λ2 α2 π) log P  P
x,y y λ1 α1 π + λ2 α2 π ( x λ1 α1 π + λ2 α2 π)
X π
= (λ1 α1 π + λ2 α2 π) log P
x,y
( x 1 1 + λ2 α2 π)
λ α π
   
X X X X X
= (λ1 α1 π + λ2 α2 π) log π −  λi αi π  log  λi αi π 
x,y x,y i∈{0,1} x i∈{0,1}
X X αi π
λ1 I1 (X ∧ Y ) + λ2 I2 (X ∧ Y ) = λi αi π log P  P
x,y i∈{0,1} y αi π ( x αi π)
!
X X X X
= (λ1 α1 π + λ2 α2 π) log π − λi αi π log αi π
x,y x,y i∈{0,1} x

Thus, to prove that LHS ≥ RHS we need to prove that,


    !
X X X X X
 λi αi π  log  λi αi π  ≤ λi αi π log αi π
i∈{0,1} x i∈{0,1} i∈{0,1} x

that follows directly from the application of the log-sum inequality [1]

References
[1] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York,
NY, USA, 1991.

3-6

You might also like