0% found this document useful (0 votes)
239 views

Information Theory and Applications

Uploaded by

Georgy Zaidan
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
239 views

Information Theory and Applications

Uploaded by

Georgy Zaidan
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 293

An Introduction to Information Theory and Applications

F. Bavaud J.-C. Chappelier J. Kohlas

version 2.04 - 20050309 - UniFr course

Contents
1 Uncertainty and Information 1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Choice and Uncertainty . . . . . . . . . . . . . . . . . 1.1.2 Choice with Known Probability . . . . . . . . . . . . . 1.1.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . 1.1.4 Axiomatic Determination of Entropy . . . . . . . . . . 1.2 Information And Its Measure . . . . . . . . . . . . . . . . . . 1.2.1 Observations And Events . . . . . . . . . . . . . . . . 1.2.2 Information and Questions . . . . . . . . . . . . . . . 1.2.3 Mutual Information and Kullback-Leibler Divergence . 1.2.4 Surprise, Entropy and Information . . . . . . . . . . . 1.2.5 Probability as Information . . . . . . . . . . . . . . . . 2 Ecient Coding of Information 2.1 Coding a Single Random Variable . . . . . . 2.1.1 Prex-Free Codes . . . . . . . . . . . . 2.1.2 n-ary Trees for Coding . . . . . . . . . 2.1.3 Kraft Inequality . . . . . . . . . . . . 2.2 Ecient Coding . . . . . . . . . . . . . . . . . 2.2.1 What Are Ecient Codes? . . . . . . 2.2.2 Probabilized n-ary Trees: Path Length 2.2.3 Noiseless Coding Theorem . . . . . . . 2.2.4 Human Codes . . . . . . . . . . . . . 9 10 10 18 28 37 44 44 51 61 69 73

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Uncertainty . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

81 . 82 . 84 . 87 . 90 . 95 . 95 . 96 . 99 . 103 113 . 114 . 116 . 118 . 122 . 122 . 124 . 125 . 127 . 130 . 132 . 132 . 134

3 Stationary processes & Markov chains 3.1 The entropy rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The AEP theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The concept of typical set: redundancy and compressibility 3.3 First-order Markov chains . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Transition matrix in n steps . . . . . . . . . . . . . . . . . . 3.3.2 Flesh and skeleton. Classication of states . . . . . . . . . . 3.3.3 Stationary distribution . . . . . . . . . . . . . . . . . . . . . 3.3.4 The entropy rate of a Markov chain . . . . . . . . . . . . . 3.3.5 Irreversibility . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Markov chains of general order . . . . . . . . . . . . . . . . . . . . 3.4.1 Stationary distribution and entropy rate . . . . . . . . . . . 3.5 Reconstruction of Markov models from data . . . . . . . . . . . . . 3

. . . . . . . . . . . .

. . . . . . . . . . . .

4 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 Empirical and model distributions . . . . . . . . . . The formula of types for Markov chains . . . . . . . Maximum likelihood and the curse of dimensionality Testing the order of a Markov chain . . . . . . . . . Simulating a Markov process . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . 134 137 139 140 143

4 Coding for Noisy Transmission 4.1 Communication Channels . . . . . . . . . . . . . . 4.1.1 Communication Channels . . . . . . . . . . 4.1.2 Channel Capacity . . . . . . . . . . . . . . 4.1.3 Input-Symmetric Channels . . . . . . . . . 4.1.4 Output-Symmetric Channels . . . . . . . . 4.1.5 Symmetric Channels . . . . . . . . . . . . . 4.1.6 Transmission Rate . . . . . . . . . . . . . . 4.2 A Few Lemmas . . . . . . . . . . . . . . . . . . . . 4.2.1 Multiple Use Lemma . . . . . . . . . . . . . 4.2.2 Data Processing Lemma . . . . . . . . . . . 4.2.3 Fanos Lemma . . . . . . . . . . . . . . . . 4.3 The Noisy Coding Theorem . . . . . . . . . . . . . 4.3.1 Repetition Codes . . . . . . . . . . . . . . . 4.3.2 The Converse to the Noisy Coding Theorem Feedback . . . . . . . . . . . . . . . . . . . 4.3.3 The Noisy Coding Theorem for a DMC . . 5 Complements to Ecient Coding of Information 5.1 Variable-to-Fixed Length Coding: Tunstalls Code 5.1.1 Introduction . . . . . . . . . . . . . . . . . 5.1.2 Proper Sets . . . . . . . . . . . . . . . . . . 5.1.3 Tunstall message sets . . . . . . . . . . . . 5.1.4 Tunstall Code Construction Algorithm . . . 5.2 Coding the Positive Integers . . . . . . . . . . . . . 5.3 Coding of Sources with Memory . . . . . . . . . . 5.3.1 Human coding of slices . . . . . . . . . . . 5.3.2 Elias-Willems Source Coding Scheme . . . . 5.3.3 Lempel-Ziv Codings . . . . . . . . . . . . . 5.3.4 gzip and bzip2 . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . for . . . .

. . . . . . . . . . . . . a . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DMC without . . . . . . . . . 165 . . . . . . . . . 169

147 . 148 . 149 . 152 . 154 . 155 . 156 . 157 . 158 . 158 . 159 . 160 . 163 . 163

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

177 178 178 178 180 182 184 186 188 188 191 194 197 198 198 199 201 203 206 207 208 211

6 Error Correcting Codes 6.1 The Basics of Error Correcting Codes . . . . . . . . . . . . . . 6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Hamming Distance and Codeword Weight . . . . . . . . 6.1.3 Minimum Distance Decoding and Maximum Likelihood 6.1.4 Error Detection and Correction . . . . . . . . . . . . . . 6.2 Linear Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Some Properties of Linear Codes . . . . . . . . . . . . . 6.2.3 Encoding with Linear Codes . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

CONTENTS 6.2.4 Systematic Form of a Linear Codes . . . . . 6.2.5 Decoding: Verication Matrix . . . . . . . . 6.2.6 Dual Codes . . . . . . . . . . . . . . . . . . 6.2.7 Syndromes . . . . . . . . . . . . . . . . . . 6.2.8 Minimum Distance and Verication Matrix 6.2.9 Binary Hamming Codes . . . . . . . . . . . Cyclic Codes . . . . . . . . . . . . . . . . . . . . . 6.3.1 Introduction . . . . . . . . . . . . . . . . . 6.3.2 Cyclic Codes and Polynomials . . . . . . . 6.3.3 Decoding . . . . . . . . . . . . . . . . . . . Convolutional Codes . . . . . . . . . . . . . . . . . 6.4.1 Introduction . . . . . . . . . . . . . . . . . 6.4.2 Encoding . . . . . . . . . . . . . . . . . . . 6.4.3 General Denition . . . . . . . . . . . . . . 6.4.4 Lattice Representation . . . . . . . . . . . . 6.4.5 Decoding . . . . . . . . . . . . . . . . . . . 6.4.6 Minimum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 212 214 217 218 220 221 224 225 226 231 234 234 235 237 239 242 245 253 254 254 255 258 258 259 260 262 264 264 266 267 269 270 272 273 274 277 279 280 281

6.3

6.4

7 Cryptography 7.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Cryptography Goals . . . . . . . . . . . . . . . . . . 7.1.2 Historical Examples . . . . . . . . . . . . . . . . . . 7.2 Perfect Secrecy . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Denition and Consequences . . . . . . . . . . . . . 7.2.2 One Example: One-Time Pad . . . . . . . . . . . . . 7.2.3 Imperfect Secrecy and Unicity Distance . . . . . . . 7.2.4 Increasing Unicity Distance: Homophonic Coding . . 7.3 Practical Secrecy: Algorithmic Security . . . . . . . . . . . 7.3.1 Algorithmic Complexity . . . . . . . . . . . . . . . . 7.3.2 One-Way Functions . . . . . . . . . . . . . . . . . . 7.3.3 DES . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Public-Key Cryptography . . . . . . . . . . . . . . . . . . . 7.4.1 A bit of Mathematics . . . . . . . . . . . . . . . . . 7.4.2 The Die-Hellman Public Key Distribution System 7.4.3 Trapdoor Functions . . . . . . . . . . . . . . . . . . 7.4.4 RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Die-Lamport Authentication . . . . . . . . . . . . 7.5.2 Authentication with RSA . . . . . . . . . . . . . . . 7.5.3 Shared secrets . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

CONTENTS

Notations
VX : set of values for random variable X . : the empty string. E [X ]: expected valued of X . P (X = 3): general probability function. pX (3): probability distribution of a given random variable (here X ). Notice that pX (3) = P (X = 3) an bn : asymptotic exponent equivalence: lim an 1 log( ) = 0. n n bn

X := Y , and Y =: X : equal by denition. Both cases mean X is by denition equal to Y .

CONTENTS

Chapter 1

Module C1: Uncertainty and Information

rg Kohlas by Ju

Learning Objectives for Chapter 1 After studying this module you should understand why it is important to measure the amount of uncertainty in a situation of choice; why entropy is an appropriate measure of uncertainty; how information and uncertainty are related and why therefore entropy plays an important role in measuring information; that information is always relative to a precise question and to prior information.

Introduction
Welcome to this rst step into the world of information theory. Clearly, in a world which develops itself in the direction of an information society, the notion and concept of information should attract a lot of scientic attention. In fact, although pragmatic information processing in computers, in the Internet and other computer networks develops at an extremely fast pace, the theoretical and conceptual study of what information is, and how it should be treated hardly keeps up with this frantic development. Information theory, in the technical sense, as it is used today goes back to the work of Claude Shannon and was introduced as a means to study and solve problems of communication or transmission of signals over channels. Although it is quite a narrow view of information, especially focusing on measurement of information content, it must be part of any larger theory of information. Therefore this basic module introduces 9

10

CHAPTER 1. UNCERTAINTY AND INFORMATION

the basic elements of information theory as laid down by Shannon and his successors. But already in this rst module we try to open the view on information. We emphasize that information must always be considered with respect to precisely specied questions. The same piece of information may bear on dierent questions; with respect to each question its information content will be dierent. For some questions, the content may even be void. The amount contained in a piece of information with respect to a given question will be measured by the reduction, or more generally, the change of uncertainty regarding this question induced by the information. We follow Shannon by measuring uncertainty by the entropy and our approach is in the spirit of Shannon also insofar, as information is measured by change of entropy. But by expliciting the question to which the information is applied, we go beyond Shannon. Also we stress the importance of the prior information, with respect to which information amount is to be measured. Although this is implicit in Shannons approach, expliciting it makes the concept more clear. In fact, with this view it becomes evident, that probabilities are themselves information whose content can be measured by change of entropy. In the course of the discussions it becomes clear, that information has also an algebraic structure: information can be combined or aggregated, information must be focussed on specied questions. This important aspect however is not treated in depth. This will be reserved to other modules. The same holds true for the application of classical information theory to coding, communication and other domains. So we wish you a lot of pleasure in studying this module.

1.1

Entropy

Learning Objectives for Section 1.1 After studying this section you should know how entropy is dened and what its most important properties are; understand why it is an appropriate measure of uncertainty.

1.1.1

Choice and Uncertainty

Learning Objectives for Subsection 1.1.1 After studying this subsection you should know how a situation of choice is formally described; understand how a game of questions and answers leads to a measure of the amount of uncertainty in a situation of choice; understand why this measure is a logarithm and what its units are; how possible choices can be coded using the game of questions and answers.

1.1. ENTROPY

11

Any situation of uncertainty can be described as a situation where there are several possibilities, and it is unknown which one will be selected. A typical example related to computers is the question what will be the next keystroke of a user of a computer. There are, depending on the keyboard, several dozens of possibilities, if combined strokes are allowed. A more complex situation of uncertainty arises, if a whole sequence of keystrokes in a dialog session is considered. Of course there is an abundance of situations of uncertainty in daily life, scientic inquiry, medical diagnostics, statistical inference, criminal investigations, etc. So, there is clearly much interest in studying uncertainty and in measuring the amount of uncertainty in a given situation. In fact the latter is a basic issue in communication and information theory. We start by a formal description of a situation of uncertainty. Suppose there is a case where one of say m dierent possibilities exist. We denote these possibilities by e1 , e2 , . . . , em . Such a set of possibilities, denoted by S = {e1 , e2 , . . . , em } is called a (nite) scheme of choice. The idea is that somebody, or some process or some mechanism, etc. selects one of these possibilities. The uncertainty arises, because we do not know which one of the m possibilities is selected. How do we measure the amount of uncertainty in a scheme of choice S ? Intuitively, the larger the cardinality |S | of S (the number of elements), the larger the uncertainty. So much seems clear. Then, why not simply take |S | as the measure of uncertainty? It is indeed a possibility. But we prefer another approach. Imagine the following game: I select a possibility of S and you can ask me questions about my choice. However I accept only questions with yes-no answers. For all purposes we may assume that the possibilities are represented by numbers 1, 2, . . . , m = |S |. So you may ask questions like: is the number you selected odd? is it less than 10? greater than 13? etc. The more questions you need, the greater is the uncertainty. So the idea is to measure the uncertainty by the number of questions you need to nd my choice out. Of course, you should ask clever questions. You may ask whether the number is 1, if no, is it 2, etc. In this way, you may need up to m questions to nd my choice out. This is clearly not an optimal way to proceed. However, if you ask rst, if the number is smaller than m/2, my answer allows you to limit your subsequent search to only half of the initial possibilities. And then you may proceed in a similar manner. So, this seems to be a clever way to nd out my choice. To get a bit more formal assume rst, that m is a power of 2, m = 2n . Then we may partition S with the rst question (is your choice greater than 2n1 ?) into two halves of equal size: {1, . . . , 2n1 } and {2n1 + 1, . . . , 2n }. Each half can be further halved by the second question. If the answer to the rst question is no, then the next question determines either {1, . . . , 2n2 } or {2n2 + 1, . . . , 2n1 }. If the answer to the rst question is yes, then the next question distinguishes between {2n1 + 1, . . . , 2n1 + 2n2 } and {2n1 + 2n2 + 1, . . . , 2n }. This process of questions and answers is depicted in Figure 1.1. Each question is represented by a node, starting with the rst question. A question node is labelled by the set of possibilities identied so far. So the rst node is labelled by the whole set S . Nodes on the rst level by the two half sets, on the second level by the four quarter sets, etc. Each possible answer is indicated by an arc leaving the node. We label a no answer by a 0 and a yes answer by a 1. The process of dividing sets of possibilities into equal halves ends eventually after exactly n steps with the actual choice. Now, the number n is nothing else than the logarithm of m = 2n in base 2, that is, n = log2 m. And this is a lot smaller than m itself.

12

CHAPTER 1. UNCERTAINTY AND INFORMATION

Figure 1.1: The question-answer tree for the binary search for an unknown choice among 2n elements.

So it seems reasonable to express the amount of uncertainty in a choice system S with |S | = 2n possibilities with the logarithm in base 2. Lets denote the amount of uncertainty of a choice system S by h(|S |). So we propose h(|S |) = log |S |, at least if |S | is a power of 2. But what about the general case, where the cardinality |S | is any number? We may try the same scheme of questions. The only dierence we encounter is, that some question nodes may represent sets of odd cardinality, say 2k + 1 for example. Then the questions partition this set into two slightly unequal sets, one with k + 1, the other with k elements. Figure 1.2 shows this situation schematically.

Figure 1.2: The typical node of a question-answer tree for the binary search for an unknown choice in the general case.

If the number of possibilities in the choice system S is between the two powers of 2, 2n |S | < 2n+1 , then we may either remove some possibilities to get 2n possibilities, or add some possibilities to get 2n+1 possibilities. In the rst case we need n questions, in the latter n + 1 to nd out the actual choice. So the amount of uncertainty of S must be somewhere between these two limits. Now, we have n log |S | < n + 1. So, we may again take h(|S |) = log |S | as a measure of the amount of uncertainty in the general case, however this time log |S | is not necessarily an integer value. We nally adopt the following denition:

1.1. ENTROPY

13

Denition 1.1 The Amount of Uncertainty of a Choice Scheme. For a choice system S with |S | possible choices, we dene the amount of uncertainty h(|S |) by h(|S |) = log |S |.

Example 1.1 (Chess Board) As a rst example take an empty chess board. There are exactly m = 64 = 26 possibilities to place a piece on it. Thus the system of choice can be represented by S = {1, 2, . . . , 64}, where each number stands for a eld. The amount of uncertainty in the position of a given piece on a chessboard is h(|S |) = log 64 = 6. As any denition, this one is also arbitrary to a certain degree. Its justication will have to be proved by its usefulness and beauty in applications. And this will be attempted in what follows in this module, as well as in other ones. By the way, we may play the game of question and answer also with questions having more than 2 possible answers. Suppose the questions have k > 2 possible answers. Then each question allows the partition of the set of, say m possibilities into k subsets of approximately m/k elements. So, as above, if |S | = kn , then we need exactly n = logk |S | questions. This time we use the logarithm with base k. So, we might also have dened h(|S |) = logk |S |. But we have logk |S | = logk (2) log2 |S |. So, changing the base of the logarithm is like changing the unit of measurement, and thus not really essential. In the future log without indication of the base means by default base 2. If we have a choice system S and a corresponding question tree (like Figure 1.1), then we have at the same time a coding of the possibilities of the choice system. In fact, concatenate the 0 and 1 on the path of the root to the possibility in question. This is a code of this possibility. If we use binary questions, then we have a binary code for the choice system. Note that the length of the code of each possibility equals either the next smaller or the next greater integer of h(|S |) = log |S |. This is a rst hint of the close relation between our measure of uncertainty and coding. Example 1.2 (Binary Question Tree) A system of choice is given by S = {1, 2, 3, 4, 5}. Its amount of uncertainty is h(|S |) = log 5 2, 3219 bit. A possible corresponding binary question tree is depicted in Figure 1.3. It is easy to see, that the code 001 represents the possibility {2} and its length 3 is the next greater integer of h(|S |). The possibility {3} has code length 2, {4} has length 2, etc. Here are now a few very simple properties of our measure of uncertainty h(|S |):

14

CHAPTER 1. UNCERTAINTY AND INFORMATION

Figure 1.3: Example: A binary question tree of a 5-possibilities system. 1. If S1 and S2 are two choice systems and |S1 | = |S2 |, then h(|S1 |) = h(|S2 |). Only the number of possibilities in a choice system matters, not their nature. 2. If S1 and S2 are two choice systems and |S1 | < |S2 |, then h(|S1 |) < h(|S2 |), since the logarithm is a non-decreasing function. That is what we expect: the uncertainty increases with the number of possibilities of a choice. 3. If S1 and S2 are two choice systems and S2 has twice as many possibilities as S1 (|S2 | = 2 |S1 |), then, using base 2 for the logarithm, we obtain h(|S2 |) = h(|S1 |)+1. This follows from the additivity of the logarithm, and from log2 2 = 1, h(|S2 |) = log2 |S2 | = log2 (2 |S1 |) = log2 2 + log2 |S1 | = 1 + h(|S1 |). 4. If S is a choice system with only two possibilities, then, with the logarithm in base 2, h(|S |) = log2 2 = 1. This unit for the measurement is called a bit (binary information unit). We shall see that uncertainty is closely related to information, and the latter is measured in the same units as uncertainty. Also here we enter the realm of computers. That is why binary questions are most popular. If we have two choice systems S1 = {e1,1 , e1,2 , . . . , e1,n } and S2 = {e2,1 , e2,2 , . . . , e2,m }, then the corresponding two choice possibilities can be compounded into a combined system, which contains all n m pairwise combinations of possible choices, {(e1,1 , e2,1 ), (e1,1 , e2,2 ), . . . , (e1,2 , e2,1 ), . . . , (e1,n , e2,m )} . Such a set of pairs is called a cartesian product of the two individual sets, and is written as S1 S2 = {(e1,1 , e2,1 ), (e1,1 , e2,2 ), . . . , (e1,2 , e2,1 ), . . . , (e1,n , e2,m )} . We call this new choice system a system of independent choices. This expresses the idea, that the choice in each of the two system is made independently of the choice in the other system, to get the combined choice. How is the amount of uncertainty of such a system of independent choices related to the amount of uncertainty in each of the two choice systems? The simple answer is given in the following theorem.

1.1. ENTROPY

15

Theorem 1.1 Additivity of Uncertainty. The uncertainty of the system of independent choices is the sum of the uncertainty of both simple systems, h(|S1 S2 |) = h(|S1 |) + h(|S2 |).

Proof The proof is simple, since this is essentially the additivity of the logarithm. In fact, h(|S1 S2 |) = log |S1 S2 | = log(|S1 | |S2 |) = log |S1 | + log |S2 | = h(|S1 |) + h(|S2 |). This theorem is a strong justication for our denition of the measure of uncertainty: One would expect that the uncertainties in two independent situations add, when they are considered together. Example 1.3 (Chess Board - Continuation) We return to the chess board situation of example 1.1. We saw, that the amount of uncertainty of the position of a piece on the whole board is h(|S |) = log 64 = 6. In the same way, we see, that the amount of uncertainty of the position of a piece in a single row or column is log 8 = 3. So we get the expected result 6 = log 64 = log 8 + log 8. Of course this can be generalized to the combination of more than two independent choices. Let S1 , S2 , . . . , Sm be m choice systems. Then the cartesian product of mtuples S1 S2 Sm = {(e1,1 , e2,1 , . . . , em,1 ), . . .} is the corresponding system of independent choices

Corollary 1.1 h(|S1 S2 Sm |) = h(|S1 |) + h(|S2 |) + + h(|Sm |). (1.1)

Example 1.4 (Dice) We throw m times a dice and assume that the throws are independent. This can be modelled by m independent systems of choices S1 , . . . , Sm , where each system contains 6 possibilities. From corollary 1.1 we get that h(|S1 S2 Sm |) = h(|S1 |) + h(|S2 |) + + h(|Sm |) = m log 6.

Summary for Section 1.1 We formalized situations of uncertainty by choice systems S , where one of a nite number of possibilities will be selected, but it is unknown which one;

16

CHAPTER 1. UNCERTAINTY AND INFORMATION The uncertainty associated with a choice system is measured by the (least) number of questions to be asked in order to nd out the actual choice. This leads to propose log |S | as a measure of the uncertainty. With binary questions the unit of measurement is called a bit ; The question game denes a tree, which can be used to dene codes for the possibilities of the choice system S . The lengths of these codes are approximately equal to the measure of uncertainty log |S |. If binary questions are used we get binary codes ; We found that the uncertainty of systems of independent choices add together.

Control Question 1 For x > 0, the logarithm log2 (x) is 1. always positive; 2. a non decreasing function; 3. maximal in x = 10; 4. equal to 0 in x = 0; 5. equal to 0 in x = 1; 6. equal to 1 in x = 2. Answer 1. That is not correct. The logarithm log2 (x) takes negative numbers for all 0 < x < 1. This will play a signicant role in the next subsection, where the so-called entropy will be dened as H (X ) = pX (x) log pX (x). 2. That is right. This is an important property of the logarithm. 3. The logarithm has no maximum, so that is wrong. 4. No, the logarithm is not even dened in x = 0. 5. That is correct and important. Take as an example the system of choice given by S = {1}. Since we have only one possibility, there is no uncertainty. In fact, we really obtain h(|S |) = h(1) = log 1 = 0. 6. Yes, that is true. Once more this property is signicant. Take as an example a system of choice S with only two possibilities. Then we get h(|S |) = h(2) = 1. That is why we call the unit of uncertainty bit.

Control Question 2 If we have two choice systems S1 = {e1,1 , e1,2 , . . . , e1,n } and S2 = {e2,1 , e2,2 , . . . , e2,m }, then the system of independent choices S1 S2 has

1.1. ENTROPY 1. n + m elements; 2. n m elements. Answer

17

Since the system of independent choices S1 S2 is given by the cartesian product of the two individual sets, that is S1 S2 = {(e1,1 , e2,1 ), (e1,1 , e2,2 ), . . . , (e1,n , e2,m )}, we get |S1 S2 | = n m.

Control Question 3 Given two choice systems S1 and S2 with |S2 | = 2 |S1 |, then h(|S1 S2 |) equals 1. h(|S1 |) + h(|S2 |); 2. 1 + 2 h(|S1 |); 3. log(|S1 | |S2 |); 4. h(|S1 |) h(|S2 |); 5. 1 +
1 2

h(|S2 |). Answer

1. Due to the additivity of the logarithm, this is correct. 2. That is true; once more because of the additivity of the logarithm. 3. Since |S1 S2 | = |S1 | |S2 |, this one is also correct. 4. The logarithm is additive, but not multiplicative, that means log(xy ) = log x log y , hence the assertion is incorrect. 5. That is wrong. But we have h(|S1 S2 |) = 1 + 2 h(|S1 |).

Control Question 4 We are going back to the last control question. Which of the correct assertions will remain true for arbitrary S1 and S2 without the property |S2 | = 2 |S1 |. Answer 1. Remains true. 2. Now, the assertion becomes incorrect, because we cant expect that h(|S2 |) = h(S1 ) + 1. 3. Remains true.

18

CHAPTER 1. UNCERTAINTY AND INFORMATION

1.1.2

Choice with Known Probability

Learning Objectives for Subsection 1.1.2 After studying this subsection you should understand how uncertainty is aected, if probabilities for the possible choices are known; know the denition of entropy and some of its elementary properties; get a rst appreciation of entropy as a measure of uncertainty.

There are situations, where probabilities are known for the dierent possibilities which may arise. For example, if we know that a user is typing an English text on the keyboard, then we know that some letters occur more often than others and thus some keys are more likely to be struck. Or, on the level of words, if we know that the user is programming, then we know that some keywords like if, then, else, etc. are more likely to be typed in than most other combinations of letters. We shall see in this section how this additional knowledge of probabilities aects the amount of uncertainty in a choice system. To start, we formally introduce probabilities into a choice system S = {e1 , e2 , . . . , em } by assigning probabilities pi to the possibilities ei for i = 1, 2, . . . , m. These probabilities have to satisfy the following conditions:
m

0 pi 1, for i = 1, 2, . . . m,
i=1

pi = 1.

(1.2)

The second condition expresses the fact that exactly one of the m possibilities must be selected. The choice system S together with the set of probabilities P = {p1 , p2 , . . . , pm } forms a probabilistic choice system. Here is the formal denition: Denition 1.2 Probabilistic Choice System. If S is a choice system, and P a set of probabilities on S satisfying conditions (1.2), then the pair (S, P ) is called a probabilistic choice system. If E S is a subset of S , called an event in the language of probability theory, then its probability is given by p(E ) =
ei E

pi .

What is then the amount of uncertainty in a probabilistic choice system? We may attempt the same game of questions and answers as in the previous question. However, now it is no more clever to partition the set of possibilities into subsets of equal size, because this neglects the probabilities. Suppose for example that one possibility, say e1 , is much more likely than all others. Then of course we should rst ask, whether this is the actual possibility. There is a big chance that we hit the actual possibility with just one question. Only if the answer is no do we have to continue. Lets look at an example.

1.1. ENTROPY

19

Example 1.5 (Trees related to a Probabilistic Choice System) Given a probabilistic choice system (S, P ) with S = {e1 , e2 , . . . , e8 } and P = {0.3, 0.2, 0.1, 0.05, 0.05, 0.1, 0.15, 0.05}. A corresponding binary- and an alternative, better tree of (S, P ) are depicted in Figure 1.4 and Figure 1.5. A simple computation shows, that the expected word length in the binary tree is 3 and 2.75 in the better one.

Figure 1.4: The binary tree of our example with expected worth length 3.

Figure 1.5: In this alternative tree the expected worth length reduces to 2.75. As the example shows, we should try to select our questions such as to minimize the expected, or average, number of questions needed. Now, this is not a trivial task. Nevertheless, the solution to this problem is known and very much used in coding theory . The key idea is to partition a set not into subsets of equal cardinality, but into subsets of equal probability. It is especially known from coding theory that the expected number of questions is then approximately
m

i=1

pi log pi .

20

CHAPTER 1. UNCERTAINTY AND INFORMATION

This quantity is called entropy and we propose to use it as a measure of the amount of uncertainty in a probabilistic choice system. We do not exclude that some of the probabilities pi vanish, pi = 0. For this case we adopt the convention that 0 log 0 = 0, which is reasonable since limx0 x log x = 0. Denition 1.3 The Amount of Uncertainty of a Probabilistic Choice System. Let (S, P ) be a probabilistic choice system. Then we dene the amount of uncertainty in (S, P ) to be the entropy
m

H (P ) =
i=1

pi log pi .

(1.3)

Again there is some amount of arbitrariness in this denition. The base k of the logarithm in the entropy corresponds, as in the previous section, to the number k of possible answers for each question in the game of questions and answers. We leave this open, as a change of the base corresponds only to a change of the unit. As before, the binary case is most popular, with the unit bit. This is the base we usually use in examples. Example 1.6 (Trees related to a Probabilistic Choice System - Continuation) We continue the example 1.5 by computing the amount of uncertainty in the given probabilistic choice system, that means computing the entropy.
8

H (P ) =
i=1

pi log pi

= 0.3 log 0.3 0.2 log 0.2 0.2 log 0.1 0.15 log 0.05 0.15 log 0.15 2.7087 bit Thus H (P ) is less than the expected word length in the better tree. Entropy is a really fundamental notion of information and communication theory as the rest of this course will demonstrate. Therefore, it is worthwhile studying its properties. Note that we sometimes write H (P ) = H (p1 , p2 , . . . , pm ), if P = {p1 , p2 , . . . , pm }. First we make the connection between the general notion of entropy as introduced here and the measure of uncertainty for choice systems without probabilities as dened in the previous subsection. We see that if P is the uniform distribution over n choices, then H 1 1 ,..., n n
n

=
i=1

1 1 1 log = log = log n = h(n). n n n

Property 1.1 The entropy of a uniform probability distribution over n possibilities equals the measure of uncertainty of the corresponding choice system without probability.

1.1. ENTROPY

21

This is an expression of Laplaces principle of insucient reason, which says, if you know nothing other, assume equal probabilities. In this context this works nicely. Hence, it turns out that entropy also covers in this sense the particular case of choice systems without probabilities. For a given choice system S with |S | = n, intuitively, we have maximal uncertainty, if we know no probabilities, or if we assume uniform (equal) probabilities over all possibilities, H (P ) h(|S |) = log n. This is in fact true and can be proved. In order to prove this result, we need the following lemma.

Lemma 1.1 Let p1 , p2 , . . . , pm and q1 , q2 , . . . , qm be two probability distributions over the same number of m possibilities. Then
m m

pi log qi
i=1 i=1

pi log pi

(1.4)

and equality holds if, and only if, pi = qi .

Proof We have log x = log e ln x, where ln denotes the natural logarithm to the base e. ln x is a convex function, that is, its graph is below its tangent in all points (see Figure 1.6). If we take the derivative of ln x in the point x = 1, then this gives us ln x x 1 with equality if, and only if, x = 1. Therefore, we have ln hence
m

qi qi 1, pi pi

pi ln
i=1

qi pi

qi
i=1 i=1

pi = 1 1 = 0.

From this we conclude that


m m m

pi log qi
i=1 i=1

pi log pi = log e
i=1

pi ln

qi 0. pi

This shows that (1.4) holds with equality if, and only if, qi /pi = 1 for all i, i.e. pi = qi . We apply this lemma now,
m m

H (P ) log m =
i=1

pi log pi +
i=1

pi log

1 0, m

with equality if, and only if, pi = 1/m. We have proved the following theorem:

22

CHAPTER 1. UNCERTAINTY AND INFORMATION

Figure 1.6: Convexity of the function y = ln x and the tangent at the point x = 1.

Theorem 1.2 max H (P ) = log m, (1.5)

where the maximum is taken over all probability distributions P for m possibilities. This maximum is reached only for the equiprobable distribution. We add further elementary properties of entropy. 1. If (S1 , P1 ) and (S2 , P2 ) are two probabilistic choice systems with |S1 | = |S2 | and P1 = P2 , then H (P1 ) = H (P2 ). This says that the entropy depends only on the probability distribution, but not on the nature of the possibilities ei in the choice system. This follows directly from the denition of the entropy (1.3) which depends only on the probabilities pi . 2. We have H (p1 , p2 , . . . , pn ) = H (p1 , p2 , . . . , pn , 0). This comes from the convention 0 log 0 = 0. It says that possibilities with vanishing probability are irrelevant to the amount of uncertainty. This is reasonable, since we may be sure that such possibilities are never selected. 3. Consider a two stage scheme as depicted in Figure 1.7. In the rst stage one of two possibilities are selected with probabilities p and q = 1 p. If in the rst stage the rst possibility is selected, then in the second stage one of n possibilities is selected with probabilities pi /p. If the second possibility is selected in the rst stage, then one of m possibilities is selected in the second stage with probabilities qi /q . Here it is assumed that
n m

pi = p,
i=1 i=1

qi = q.

1.1. ENTROPY Note that this implies p1 + p2 + pn + q1 + q2 + qm = 1,

23

i.e. {p1 , . . . , pn , q1 , . . . , qm } is a probability distribution on n + m elements. Then we have the following equality between entropies of the two stages: H (p1 , p2 , . . . , pn , q1 , q2 , . . . , qm ) p1 p2 pn q1 q2 qm = H (p, q ) + pH ( , , . . . , ) + qH ( , , . . . , ). p p p q q q This can be veried, by the denition of entropy.

Figure 1.7: A two-stage probabilistic choice where in the rst stage one of two possibilities are selected with probabilities p and q = 1 p and in the second stage either one of n possibilities with probability pi /p or one of m possibilities with probability qi /q . We add a number of more technical properties of entropy.

Proposition 1.1 1. H (p1 , p2 , . . . , pn ) = H (p(1) , p(2) , . . . , p(n) ) for every permutation . 2. H (p1 , p2 , . . . , pn ) is continuous in all its variables. 3. We have the equation H (p1 , . . . , pn ) = H (p1 + p2 , p3 , . . . , pn ) p2 p1 , ), +(p1 + p2 )H ( p1 + p2 p1 + p2 for every probability distribution p1 , . . . , pn with n 2.
1 1 4. H ( n ,..., n ) is monotone increasing with n.

These propositions are very easy to prove. Their importance resides in the fact, that they are characterizing properties for entropy. That, is, when we impose these four

24

CHAPTER 1. UNCERTAINTY AND INFORMATION

reasonable conditions on a measure of uncertainty, then we necessarily get the entropy for this measure. We shall return to this interesting point in section 1.1.4. Proof (1) Follows directly from the denition of entropy and commutativity of addition. (2) Follows from the fact that the logarithm is a continuous function. (3) Needs some simple computations, H (p1 , p2 , . . . , pn )
n

= p1 log p1 p2 log p2
i=3

pi log pi
n

= (p1 + p2 ) log(p1 + p2 )
i=3

pi log pi

+p1 log(p1 + p2 ) p1 log p1 + p2 log(p1 + p2 ) p2 log p2 p1 p1 p2 p2 = H (p1 + p2 , p3 , . . . , pn ) (p1 + p2 ) log + log p1 + p2 p1 + p2 p1 + p2 p1 + p2 p2 p1 , ). = H (p1 + p2 , p3 , . . . , pn ) + (p1 + p2 )H ( p1 + p2 p1 + p2
1 1 (4) Follows since H ( n ,..., n ) = log n and the logarithm is monotone increasing.

For further reference we introduce an alternative notion. A probabilistic choice scheme (S, P ) may also be represented by a nite random variable X , which takes values ei from S = {e1 , e2 , . . . , em }. The probability that X = ei is then pX (ei ) and P is called the probability density of X . Inversely, each nitely-valued random variable gives rise to probabilistic choice systems. Formally, a random variable with values in S is an application of some sample space into S . A probability distribution in S is then induced by pX (x) =
:X ( )=x

p( ),

if {p( ) : } are the probabilities dened in the nite sample space . The set {pX (x) : x S } then denes the probabilities on the choice space S . So we may as well speak of random variables instead of probabilistic choice systems, and dene accordingly the entropy of random variable X with values in S by H (X ) =
xS

pX (x) log pX (x).

This then measures the uncertainty associated with random variable X . In what follows this will often be a more convenient way to look at things. Example 1.7 (Bernoulli) Let X be a binomial random variable representing n Bernoulli trials, that is with pX (x) = n x p (1 p)nx . x

1.1. ENTROPY The entropy of X is given by


n

25

H (X ) =
i=0

n i p (1 p)ni log i

n i p (1 p)ni . i

Let us take n = 4 and p = q = 0.5. Hence


4

H (X ) =
i=0

4 (0.5)i (0.5)4i log i

4 (0.5)i (0.5)4i i

= 0.1250 log 0.0625 0.5 log 0.25 0.375 log 0.375 2.0306 bit. To conclude this section, we consider a random variable X associated with a probabilistic choice situation (S, P ). Assume that an event E S is observed. How does this aect the uncertainty? This event induces a new probabilistic choice situation (E, PE ). Here PE refers to the conditional probabilities pX |E (x) = pX (x) , for all x E. pX (E )

The uncertainty related to this new situation is H (X |E ) =


xE

pX |E (x) log pX |E (x).

This is also called the conditional entropy of X given E . Example 1.8 (Conditional Entropy) Let X be a random variable related to the probabilistic choice situation (S, P ) given by S = {1, 2, 3, 4}, P = {0.5, 0.25, 0.125, 0.125}, and E = {1, 3} an event. Thus H (X ) = 0.5 log 0.5 0.25 log 0.25 0.125 log 0.125 0.125 log 0.125 = 1.75bit. With pX (E ) = 0.625, pX |E (1) = pX (1)/pX (E ) = 0.8 and pX |E (3) = pX (3)/pX (E ) = 0.2 we obtain H (X |E ) = pX |E (1) log pX |E (1) pX |E (3) log pX |E (3) = 0.8 log 0.8 0.2 log 0.2 0.7219 bit.

Summary for Section 1.1 We have seen that for a probabilistic choice, represented by a probabilistic choice system, a dierent strategy in the game of questions and answer must be used: rather than partitioning the set of possibilities into subsets of equal cardinality, we partition it into subsets of nearly equal probability. This leads to the entropy as approximately the expected number of questions and thus as an appropriate measure of uncertainty;

26

CHAPTER 1. UNCERTAINTY AND INFORMATION We have seen that the uncertainty of a choice system equals the entropy of a probabilistic choice system with equal probabilities, that is a uniform probability distribution. This corresponds to Laplaces principle of insucient reason. So the concept of entropy also covers the case of non probabilistic choice systems; In fact, equal probabilities, or choice systems without known probabilities represent, for a set S of a given cardinality, the largest uncertainty; Entropy depends only on the probability distribution of a choice system, but not on the nature of the possibilities; We saw some simple properties of entropy which characterize the concept entropy.

Control Question 5 Given a probabilistic choice system (S, P ) by S = {e1 , e2 , . . . , en } and P = {p1 , p2 , . . . , pn }. Then, H (P ) 1. = 2. = h(n) 3. log n 4. h(|S |) 5. > 0 Answer 1. This is correct. It is simply the denition of the amount of uncertainty of a probabilistic choice system, also known as the entropy. 2. In general, this assertion is wrong. But it becomes correct if P is the uniform probability distribution, that is if p1 = . . . = pn = 1/n. 3. We saw, that if we assume uniform probabilities for all possibilities, we have maximal entropy, so this is correct. 4. Since h(|S |) = log n, this case equals the case above. Thus, this assertion is also true. 5. Take as an example the probability distribution p1 = 1, p2 , . . . , pn = 0. The entropy is then H (P ) = log 1 = 0. This counter-example shows, that the proposition is wrong. However, H (P ) 0 holds for all probability distributions P .
n i=1 pi log pi

Control Question 6 Given the binary tree as depicted in gure 1.8. Compute

1.1. ENTROPY 1. the expected word length; 2. the entropy.

27

0.21

0.79

0.11

0.10

0.10

0.69

0.10

0.01 0.05

0.03 0.05

0.07 0.40

0.29

Figure 1.8: Compute the expected word length and the entropy in this binary tree. Answer 1. The expected word length is equal to 3 (0.1 + 0.01 + 0.05 + 0.05 + 0.03 + 0.07 + 0.4 + 0.29) = 3. This is not a surprise, since the tree is equilibrated. 2. For the entropy we get 0.1 log 0.1 0.01 log 0.01 0.1 log 0.05 0.03 log 0.03 0.07 log 0.07 0.4 log 0.4 0.29 log 0.29 2.2978.

Control Question 7 Given a probabilistic choice system (S, P ) by S = {e1 , e2 , . . . , en } and P = {p1 , p2 , . . . , pn }. Then, H (p1 , p2 , . . . , pn ) 1. = H (0, p1 , . . . , pn );
p2 1 2. = H (p1 + p2 , p3 , . . . , pn ) + (p1 + p2 )H ( p1p +p2 , p1 +p2 ); pn 1 3. = H (p1 + pn , p2 , . . . , pn1 ) + (p1 + pn )H ( p1p +pn , p1 +pn );

4. =

n 1 i=1 pi log pi .

Answer

28

CHAPTER 1. UNCERTAINTY AND INFORMATION 1. Follows directly from the convention 0 log 0 = 0. 2. That is correct (see proposition 1.1). 3. Same case as above (use the permutation property).
1 = log 1 log pi = log pi , the assertion is correct. 4. Since we have log p i

Control Question 8 Let X be a random variable related to a probabilistic choice situation (S, P ) and E an event E S . Then H (X |E ) H (X ). Is this assertion correct? Answer No, the assertion is incorrect. Here is a counter-example: Let S = {e1 , e2 , e3 }, p1 = 0.99, p2 = 0.005, p3 = 0.005 and E = {e2 , e3 }. Hence we get H (X ) = 0.99 log 0.99 0.01 log 0.005 0.0908 and H (X |E ) = log 0.5 = 1.

1.1.3

Conditional Entropy

Learning Objectives for Subsection 1.1.3 After studying this section you should know how the entropy of compound choice systems or multidimensional variables is related to the entropy of the components or single variables; understand how then knowledge of the choice in one component or the value of one variable aects the uncertainty of the remaining components or variables.

We start by considering two choice systems S1 and S2 and the associated system of independent choices S1 S2 = {(e1,1 , e2,1 ), (e1,1 , e2,2 ), . . . , (e1,n , e2,m )}. By aecting probabilities pi,j to the compound choice (e1,i , e2,j ), we extend the system of independent choices to a compound probabilistic choice system (S1 S2 , P ), where P = {pi,j ; i = 1, 2, . . . , n; j = 1, 2, . . . , m}. We must have
n m

0 pi,j ,
i=1 j =1

pi,j = 1.

This is a two-dimensional probability distribution. We may compute the two marginal (2) (1) (2) (2) (1) (1) distribution P1 = {p1 , p2 , . . . , pn }, and P2 = {p1 , p2 , . . . , pm }, dened by pi
(1) m

=
j =1

pi,j ,

pj =
i=1

(2)

pi,j .

(1.6)

This gives us then two associated probabilistic choice systems (S1 , P1 ) and (S2 , P2 ).

1.1. ENTROPY

29

We shall introduce a random variable for each probabilistic choice system as explained at the end of the previous subsection. So let X be associated with the system (S1 , P1 ) and Y with the system (S2 , P2 ). The pair of variables (X, Y ) is then associated with the compound probabilistic system (S1 S2 , P ). We have the two-dimensional probability distribution p(X,Y ) (e1,i , e2,j ) = pi,j for the pair of random variables (X, Y ). Variable X has the marginal distribution pX (e1,i ) = pi and Y the marginal distri(2) bution pX (e2,j ) = pj . We remind you that two probabilistic choice systems, or two random variables X and Y are called independent, if, and only if, pX,Y (x, y ) = pX (x) pY (y ), for all pairs (x, y ) S1 S2 . We have three dierent entropies associated with the three probabilistic choice systems: The two single variables X and Y and the two-dimensional variable (X, Y ), H (X, Y ) =
xS1 y S2 (1)

pX,Y (x, y ) log pX,Y (x, y ), pX (x) log pX (x),


xS1

H (X ) = H (Y ) =
y S2

pY (y ) log pY (y ).

Example 1.9 (Compound Probabilistic Choice System) Given a compound system of independent choices S1 S2 = {(e1,1 , e2,1 ), (e1,1 , e2,2 ), (e1,2 , e2,1 ), (e1,2 , e2,2 )}, P = {0.5, 0.1, 0.3, 0.1},

and two random variables (X , Y ) associated with (S1 S2 , P ). It is easy to identify the single choice systems S1 = {e1,1 , e1,2 }, S2 = {e2,1 , e2,2 }.

Applying (1.6) gives us the two marginal distributions P1 = {0.6, 0.4} and P2 = {0.8, 0.2}. We are now able to compute the entropies H (X, Y ) =
xS1 y S2

pX,Y (x, y ) log pX,Y (x, y )

= 0.5 log 0.5 0.1 log 0.1 0.3 log 0.3 0.1 log 0.1 1.6855 bit, H (X ) =
xS1

pX (x) log pX (x) = 0.6 log 0.6 0.4 log 0.4

0.9710 bit, H (Y ) =
y S2

pY (y ) log pY (y ) = 0.8 log 0.8 0.2 log 0.2

0.7219 bit. The question arises, how the three entropies above are related. The answer is contained in the following theorem

30

CHAPTER 1. UNCERTAINTY AND INFORMATION

Theorem 1.3 For any pair of random variables X and Y , we have H (X, Y ) H (X ) + H (Y ). (1.7)

Equality holds if, and only if, X and Y are independent random variables.

Proof This theorem is proved by straightforward calculation and using lemma 1.1: H (X ) + H (Y ) =
x

pX (x) log pX (x) +


y

pY (y ) log pY (y ) pX,Y (x, y ) log pY (y )


x y

=
x y

pX,Y (x, y ) log pX (x) + pX,Y (x, y ) log pX (x) pY (y ) .


x y

Now, lemma 1.1 gives us the following inequality,


x y

pX,Y (x, y ) log pX (x) pY (y )


x y

pX,Y (x, y ) log pX,Y (x, y )

= H (X, Y ). This proves the inequality (1.7). According to lemma 1.1 we have equality in the last inequality, if, and only if, pX,Y (x, y ) = pX (x) pY (y ), which means, that X and Y are independent. This theorem tells us, that entropies of two variables add up to the entropy of the compound, two-dimensional variables, only if the variables are independent. Otherwise there is less uncertainty in the compound situation than in the two simple choice systems. The reason is, that the dependence between the variables (their correlation) accounts for some common parts of uncertainty in both single variables. Example 1.10 (Compound Probabilistic Choice System - Continuation) In example 1.9 we had the random variables X and Y . Check yourself that H (X, Y ) < H (X ) + H (Y ). Theorem 1.3 generalizes easily to more than two variables. Let X in a general setting denote the vector (X1 , X2 , . . . , Xm ) of m random variables Xi . This vector random variable has the probability distribution pX (x), where x = (x1 , x2 , . . . , xm ) and each variable Xi has the marginal distribution pXi (xi ) =
x1 ,...,xi1 ,xi+1 ,...,xm

pX (x1 , x2 , . . . , xi1 , xi , xi+1 , . . . , xm ).

The random variables X1 , X2 , . . . , Xm are called (mutually) independent, if, and only

1.1. ENTROPY if, pX (x) = pX1 (x1 ) pX2 (x2 ) pXm (xm ) holds. The common entropy of the multidimensional variable X is dened by H (X) =
x

31

pX (x) log pX (x).

Then theorem 1.3 can be generalized as in the following corollary. Corollary 1.2 For any (X1 , X2 , . . . , Xm ) we have multidimensional
m

random

variable

H (X)
i=1

H (Xi ).

Equality holds if, and only if, the variables X1 , X2 , . . . , Xm are mutually independent.

Proof Goes by induction over m. The corollary holds for m = 2 according to theorem 1.3. Suppose it holds for m. Then consider the pair of random variables Xm = (X1 , X2 , . . . , Xm ) and Xm+1 , such that Xm+1 = (Xm , Xm+1 ). Again by theorem 1.3 and by the assumption of induction, we have
m m+1

H (Xm+1 ) H (Xm ) + H (Xm+1 )


i=1

H (Xi ) + H (Xm+1 ) =
i=1

H (Xi ).

Example 1.11 (Independence) Let X1 , . . . , Xn be independent random variables supplying the result 0 with probability 0.5 and 1 with probability 0.5, which means pXi (0) = 0.5, pXi (1) = 0.5, for i = 1, . . . , n.

Hence H (X1 , . . . , Xn ) = n H (X1 ) = n. We come back to the case of two variables X and Y . Suppose, we observe the value of one variable, say Y = y . How does this aect the uncertainty concerning variable X ? We remark that this observation changes the distribution pX (x) to the conditional distribution pX |y (x, y ) dened as pX |y (x, y ) = pX,Y (x, y ) . pY (y )

Therefore, we obtain the conditional entropy of X , given Y = y , H (X |Y = y ) =


x

pX |y (x, y ) log pX |y (x, y ).

To simplify the notation we often abbreviate H (X |Y = y ) by H (X |y ). So, the observation that Y = y changes the uncertainty regarding X from H (X ) to H (X |y ).

32

CHAPTER 1. UNCERTAINTY AND INFORMATION

As the following example shows, the new entropy or uncertainty may be smaller or larger than the old one. A particular observation may increase or decrease uncertainty. Note however, that if the two random variables are independent, then we have pX |y (x, y ) = pX (x) for every x and y . In this case we see that H (X |y ) =
x

pX (x) log pX (x) = H (X ).

The uncertainty in X does not change, when a variable Y which is independent of X is observed. Example 1.12 (Conditional Entropy) Given pX,Y (0, 1) = pX,Y (1, 0) = 1 1 pX,Y (0, 0) = 3 , pX,Y (1, 1) = 0, pX (0) = pY (0) = 2 3 and pX (1) = pY (1) = 3 . Hence H (X |Y = 0) = pX |y (0, 0) log pX |y (0, 0) pX |y (1, 0) log pX |y (1, 0) pX,Y (0, 0) pX,Y (1, 0) pX,Y (1, 0) pX,Y (0, 0) log log = pY (0) pY (0) pY (0) pY (0) = 0.5 log 0.5 0.5 log 0.5 = 1, H (X |Y = 1) = = 0, 2 1 1 2 H (X ) = H (Y ) = log log 0.9183. 3 3 3 3 So we get that H (X |Y = 1) < H (X ) < H (X |Y = 0). In addition to the conditional entropy of X given a particular observation Y = y we may consider the expected conditional entropy of X given Y , which is the expected value of H (X |y ) relative to y , H (X |Y ) =
y

pX,Y (0, 1) pX,Y (0, 1) pX,Y (1, 1) pX,Y (1, 1) log log pY (1) pY (1) pY (1) pY (1)

pY (y )H (X |y ).

We emphasize the dierence between H (X |y ), H (X |E ) and H (X |Y ). In the former case we understand the conditional entropy of the variable X given an observed event Y = y or E . In the latter case we denote the expected conditional entropy. Note that pX,Y (x, y ) = pY (y )pX |y (x, y ). Therefore, we can develop the expected conditional entropy as follows, H (X |Y ) =
y

pY (y )H (X |y ) pY (y )pX |y (x, y ) log pX |y (x, y )


x y

= =
x y

pX,Y (x, y ) log

pX,Y (x, y ) pY (y )

=
x y

pX,Y (x, y ) (log pX,Y (x, y ) log pY (y ))

= H (X, Y ) H (Y ).

1.1. ENTROPY Here we have proven the following very important theorem.

33

Theorem 1.4 For a pair of random variables X and Y we always have H (X, Y ) = H (Y ) + H (X |Y ). (1.8)

This theorem tells us, that we can always consider the uncertainty of a pair of variables as the result of a chaining, where we start with the uncertainty of one of the variables, say Y , and add the (expected) conditional uncertainty of the second one, given the rst one. Of course, we may start with any of the two variables. Thus, H (X, Y ) = H (X ) + H (Y |X ) also holds true. Example 1.13 (Expected Conditional Entropy) We continue example 1.12 by computing the expected conditional entropy of X given Y and the compound entropy of (X, Y ). H (X |Y ) = pY (0)H (X |Y = 0) + pY (1)H (X |Y = 1) = H (X, Y ) = log 1 1.5850 bit. 3 2 bit 3

As you can check H (X, Y ) = H (Y ) + H (X |Y ). In communication theory, channels for the transmission of signals are considered. Suppose that at the input random signs from a certain choice system I appear with known probabilities. This then denes a random variable X . During a transmission an input sign can be changed into some output sign from an output choice system O. Of course there must be some dependence between the sign at the input and the sign at the output. If we denote the output sign by the variable Y , then this dependence is described by the conditional probabilities pY |x (y, x), where 0 pY |x (y, x) for all x I, y O,
y

pY |x (y, x) = 1 for all x I.

This is called the transmission matrix. Figure 1.9 shows this channel system. Then the equation H (X, Y ) = H (X ) + H (Y |X ) says that the whole uncertainty in the system is composed of the uncertainty of the input signal H (X ) and the transmission uncertainty over the channel H (Y |X ).

Figure 1.9: Transmission channel. Example 1.14 (Symmetric Binary Channel) A simple symmetric binary chan-

34

CHAPTER 1. UNCERTAINTY AND INFORMATION

nel with random variables X for the input and Y for the output is given by the following transmission matrix: P= pY |x (0, 0) pY |x (1, 0) pY |x (0, 1) pY |x (1, 1) = 1 1

Thus, the probability of a transmission error is . Let pX (0) = p and pX (1) = q = 1 p. Hence H (X ) = p log p (1 p) log(1 p) and pY (0) = pY |x (0, 0) pX (0) + pY |x (0, 1) pX (1) = (1 )p + (1 p), pY (1) = pY |x (1, 0) pX (0) + pY |x (1, 1) pX (1) = p + (1 )(1 p). With H (Y |0) = pY |x (0, 0) log pY |x (0, 0) pY |x (1, 0) log pY |x (1, 0) = (1 ) log(1 ) log , H (Y |1) = pY |x (0, 1) log pY |x (0, 1) pY |x (1, 1) log pY |x (1, 1) = log (1 ) log(1 ) = H (Y |0), we obtain H (Y |X ) = pX (0)H (Y |0) + pX (1)H (Y |1) = pH (Y |0) + (1 p)H (Y |0) = H (Y |0) = H (Y |1). This is not a surprise, since the channel is symmetric. With a certain amount of eort you can show that H (X, Y ) = H (X ) + H (Y |X ). Let us consider now a numerical example. Given = 0.1, pX (0) = p = 0.2 and pX (1) = q = 0.8. Thus H (X ) = 0.2 log 0.2 0.8 log 0.8 0.7219 bit, pY (0) = 0.9 0.2 + 0.1 0.8 = 0.26, pY (1) = 0.1 0.2 + 0.9 0.8 = 0.74, and H (Y |X ) = H (Y |0) = 0.9 log 0.9 0.1 log 0.1 0.4690 bit. Since pX,Y (0, 0) = pY |x (0, 0)pX (0) = 0.9 0.2 = 0.18, pX,Y (1, 0) = pY |x (0, 1)pX (1) = 0.1 0.8 = 0.08, pX,Y (0, 1) = pY |x (1, 0)pX (0) = 0.1 0.2 = 0.02, pX,Y (1, 1) = pY |x (1, 1)pX (1) = 0.9 0.8 = 0.72, we obtain nally H (X, Y ) 0.72 log 0.72 1.1909 bit. = 0.18 log 0.18 0.08 log 0.08 0.02 log 0.02

Theorem 1.4 again generalizes easily to a sequence of more than two variables. Contrary to the conditional entropy of X , given an observation Y = y , the expected conditional entropy of X given Y is always smaller or at most equal to the entropy of X . So, on the average, an observation of Y does decrease the uncertainty of X .

1.1. ENTROPY

35

Corollary 1.3 For any pair of random variables X and Y , we have H (X |Y ) H (X ). Equality holds, if, and only if, X and Y are independent. (1.9)

Proof To prove inequality (1.9) we use the chaining rule and theorem 1.3. H (X |Y ) = H (X, Y ) H (Y ) H (X ) + H (Y ) H (Y ) = H (X ). With equality if and only if X and Y are independent. If X and Y are independent, then observing any one of these two variables does not change the uncertainty of the other one. That is, intuitively in this case, one variable can not give information about the other one. Corollary 1.4 Let X1 , X2 , . . . , Xm be random variables. Then H (X1 , X2 , . . . , Xm ) = H (X1 ) + H (X2 |X1 ) + + H (Xm |X1 , X2 , . . . , Xm1 ). (1.10)

Proof The proof is by induction. It holds for m = 2 from theorem 1.4. Suppose it holds for some m. Then, let Xm = (X1 , X2 , . . . , Xm ). From theorem 1.4 and the assumption of induction, we obtain that H (X1 , X2 , . . . , Xm , Xm+1 ) = H (Xm , Xm+1 ) = H (Xm ) + H (Xm+1 |Xm ) = H (X1 , X2 , . . . , Xm ) + H (Xm+1 |X1 , X2 , . . . , Xm ) = H (X1 ) + H (X2 |X1 ) + + H (Xm |X1 , X2 , . . . , Xm1 ) + H (Xm+1 |X1 , X2 , . . . , Xm ). So equation (1.10) holds for m + 1, hence for all m. (1.10) is called the (generalized) chaining rule. It is especially very important for communication theory. Summary for Section 1.1 We found that the joint entropy of several random variables is always less or equal to the sum of the entropies of the individual variable. It equals the sum only if the variables are independent. The conditional entropy measures the uncertainty of a variable, when the value of another variable is observed. This uncertainty may, depending on the observation, increase or decrease. The expected conditional entropy however is always smaller than the original entropy. Conditional entropy equals unconditional entropy, if the random variables are independent.

36 Control Question 9 Relate 1. H (X |y ); 2. H (X |Y ); 3. H (X |E ); with

CHAPTER 1. UNCERTAINTY AND INFORMATION

a. expected conditional entropy; b. entropy conditioned on an observed event. Answer H (X |y ) and H (X |E ) are denoting the entropy conditioned on an observed event. Whereas H (X |Y ) is the expected conditional entropy.

Control Question 10 H (X ) may be 1. < H (X |Y ); 2. < H (X |y ); 3. > H (X |y ); 4. = H (X |y ); 5. = H (X |Y ). Answer The rst assertion is incorrect, since for any pair of random variables X and Y we have H (X ) H (X |Y ). The second and third proposition are indeed correct (see example 1.12). If X and Y are independent random variables, we always have H (X ) = H (X |Y ) = H (X |y ), hence the fourth and fth assertions are also true.

Control Question 11 Relate, if possible, 1. H (X, Y ); 2. H (X |Y );

1.1. ENTROPY with a. H (X ); b. H (Y ); c. H (X ) + H (Y ); d. = H (Y ) + H (X |Y ); e. = H (X ) + H (Y |X ); f. = H (Y ) + H (Y |X ); g. =


x,y

37

pX |Y (x, y ) log pX |Y (x, y );

h. = H (X, Y ) H (X ). Answer For H (X, Y ) we have c. H (X ) + H (Y ); d. = H (Y ) + H (X |Y ); e. = H (X ) + H (Y |X ). For H (X |Y ) we have a. H (X ); c. H (X ) + H (Y ).

1.1.4

Axiomatic Determination of Entropy

Learning Objectives for Subsection 1.1.4 After studying this section you should understand that entropy is characterized by some simple conditions; how entropy is derived from these conditions.

We introduce here four simple conditions, which should be satised by a measure of uncertainty related to a probability over a nite choice system. If S = {e1 , e2 , . . . , en } is a nite choice system and p1 = p(e1 ), p2 = p(e2 ), . . . , pn = p(en ) a probability distribution over it, then the measure of uncertainty of this system is assumed to be a function H (p1 , p2 , . . . , pn ) of p1 , p2 , . . . , pn only. But at this point the form this function should take is undened. In particular, H does not denote the entropy here, but some unknown function. Rather than to dene H by some formula, we impose the following conditions on H :

38

CHAPTER 1. UNCERTAINTY AND INFORMATION

(H1) H (p1 , p2 , . . . , pn ) = H (p(1) , p(2) , . . . , p(n) ) for any permutation of 1, 2, . . . , n. (H2) H (p1 , p2 , . . . , pn ) is a continuous function in all variables. (H3) The equation H (p1 , p2 , . . . , pn ) = H (p1 + p2 , p3 , . . . , pn ) + (p1 + p2 ) H ( holds for all probability distributions p1 , p2 , . . . , pn .
1 (H4) For a uniform probability distribution pi = n for i = 1, . . . , n, H (p1 , . . . , pn ) = 1 1 H ( n , . . . , n ) as a function of n 1 is monotone increasing.

p1 p2 , ) p1 + p2 p1 + p2

These are reasonable conditions to impose on a measure of uncertainty of a probability distribution. (H1) says that the measure does not depend on the numbering of the possible choices. (H2) requires that small changes in the probabilities should only provoke small changes in the measures of uncertainty. Condition (H3) is more technical, it is a simple form of a chaining formula like theorem 1.4. (H4) expresses the idea that with uniform distribution (equivalent to choice without probabilities), the uncertainty should increase with the number of possibilities. Entropy as dened in subsection 1.1.2 indeed satises these conditions as is stated in proposition 1.1. In this subsection we shall prove that H must be the entropy, if (H1) to (H4) are required.

Theorem 1.5 (H1), (H2), (H3) and (H4) are satised if, and only if,
n

H (p1 , p2 , . . . , pn ) =
i=1

pi log pi .

(1.11)

The logarithm may be taken to any base. The if part of the theorem has already been proved in proposition 1.1. The only if part remains to be proved. That is, we assume conditions (H1) to (H4) and derive (1.11). We do this in three steps. In each step, we prove a lemma, each of which is also interesting in itself. We start by showing that (H3) essentially already contains a general form of chaining. We consider a probability distribution p1 , p2 , . . . , pn . But instead of selecting one of the possibilities directly according to these probabilities, we use a two-stage choice scheme as represented by a tree in in Figure 1.10. In the rst stage one of several arcs is selected with probability p1 + + pi1 , pi1 +1 + + pi2 , . . .. In the second stage, depending on the choice in the rst stage, a second arc is selected. For example, if in the rst stage the left most arc has been selected, then the next selection is according to the probabilities p2 pi1 p1 , ,..., . p1 + + pi1 p1 + + pi1 p1 + + pi1 We see, that with this two-stage scheme we nally select one of the n possibilities with the original probabilities p1 , p2 , . . . , pn . We have now the following lemma.

1.1. ENTROPY

39

Figure 1.10: A two-stage probabilistic choice system.

Lemma 1.2 (H1) and (H3) imply that H (p1 , p2 , . . . , pi1 , pi1 +1 , . . . , pi2 , . . . , pis1 +1 , . . . , pis , pis +1 , . . . , pn ) = H (p1 + + pi1 , pi1 +1 + + pi2 , . . . , pis1 +1 + + pis , pis +1 + + pn ) pi1 p1 ,..., ) +(p1 + + pi1 ) H ( p1 + + pi1 p1 + + pi1 pi1 +1 pi2 ,..., ) +(pi1 +1 + . . . + pi2 ) H ( pi1 +1 + . . . + pi2 pi1 +1 + . . . + pi2 pis1 +1 pis +(pis1 +1 + . . . + pis ) H ( ,..., ) pis1 +1 + . . . + pis pis1 +1 + . . . + pis pn pis +1 ,..., ). +(pis +1 + . . . + pn ) H ( pis +1 + . . . + pn pis +1 + . . . + pn

Proof First we prove that H (p1 , . . . , pi , pi+1 , . . . , pn ) = H (p1 + + pi , pi+1 , . . . , pn ) p1 pi +(p1 + + pi ) H ( ,..., ). p1 + + pi p1 + + pi (1.12)

This is proved by induction over i. It holds for i = 2 by (H3). Suppose (1.12) holds

40

CHAPTER 1. UNCERTAINTY AND INFORMATION

for i. Then, applying the formula for i = 2 to (1.12), we obtain H (p1 , . . . , pi , pi+1 , . . . , pn ) = H (p1 + + pi , pi+1 , . . . , pn ) p1 pi +(p1 + + pi ) H ( ,..., ) p1 + + pi p1 + + pi = {H ((p1 + + pi ) + pi+1 , pi+2 , . . . , pn ) pi+1 p1 + + pi , )} + ((p1 + + pi ) + pi+1 ) H ( p1 + + pi + pi+1 p1 + + pi + pi+1 pi p1 ,..., ). +(p1 + + pi ) H ( p1 + + pi p1 + + pi Since (1.12) holds for i, we conclude that H( p1 pi pi+1 ,..., , ) p1 + + pi + pi+1 p1 + + pi + pi+1 p1 + + pi + pi+1 p1 + + pi pi+1 = H( , ) p1 + + pi + pi+1 p1 + + pi + pi+1 p1 pi p1 + + pi H( ,..., ). + p1 + + pi + pi+1 p1 + + pi p1 + + pi

If we substitute this above, we obtain H (p1 , . . . , pi , pi+1 , . . . , pn ) = H (p1 + + pi + pi+1 , pi+2 , . . . , pn ) +(p1 + + pi + pi+1 ) H ( p1 pi+1 ,..., ). p1 + + pi + pi+1 p1 + + pi + pi+1

So (1.12) holds for every i = 2, . . . , n. Now, by (H1) we then also have H (p1 , . . . , pi1 , pi , pi+1 , . . . , pj 1 , pj , pj +1 , . . . , pn ) = H (p1 , . . . , pi1 , pi + + pj , pj +1 , . . . , pn ) pj pi ,..., ), +(pi + + pj ) H ( pi + + pj pi + + pj and this for 1 i < j n. If we apply this successively to H (p1 , p2 , . . . , pi1 , pi1 +1 , . . . , pi2 , . . . , pis1 +1 , . . . , pis , pis +1 , . . . , pn ), then Lemma follows. In the next step of our overall proof, we consider the case of uniform probability distributions, or the case of choice without probabilities. That is, we put pi = 1/n for i = 1, . . . , n. We dene

1 1 h(n) = H ( , . . . , ). n n

Then we prove the next lemma:

1.1. ENTROPY

41

Lemma 1.3 (H1), (H3) and (H4) imply h(n) = c log n for some constant c > 0 and all integers n.

Proof Let n = m l for some integers n, m, l. By lemma 8 we have h(m l) = H ( 1 1 ,..., ) ml ml 1 1 1/ml 1/ml 1 ,..., ) = H( , . . . , ) + l H( l l l 1/l 1/l = h(l) + h(m).

(1.13)

This fundamental equation has the solution h(n) = c log n. We show that this is the only solution. If m and l are integers, select an integer N and determine n such that ln mN < ln+1 . Then, from (H4) we conclude that h(ln ) h(mN ) < h(ln+1 ). But by the fundamental equation (1.13) h(ln ) = n h(l) and h(mN ) = N h(m), hence n h(l) N h(m) < (n + 1) h(l). We note that for l = 1 we have h(ln ) = h(l) = n h(l), hence h(l) = 0. For l > 1, (H4) implies that h(l) > 0. Thus suppose l > 0. Then h(m) n+1 n < . N h(l) N But, we also have log ln log mN < log ln+1 by the monotonicity of the logarithm. Hence n log l N log m < (n + 1) log l, or log m n+1 n < . N log l N These inequalities imply | h(m) log m 1 |< . h(l) log l N

Since this holds for all integers N , we conclude that log m h(m) = . h(l) log l

42

CHAPTER 1. UNCERTAINTY AND INFORMATION

This in turn is valid for all integers m, l. Therefore, h(m) h(l) = = c, log m log l for a certain constant c. Thus we have h(m) = c log m for m > 1. But for m = 1 we have both h(m) = 0 and c log 1 = 0. Thus h(m) = c log m for all m 1. Since by (H4) h(n) is monotone increasing, we must have c > 0. In the third step we use the results obtained so far to prove (1.11) for rational probabilities. We formulate the corresponding lemma: Lemma 1.4 (H1), (H3) and (H4) imply for rational probabilities p1 , . . . , pn that
n

H (p1 , . . . , pn ) = c
i=1

pi log pi .

(1.14)

Proof Assume that p1 = q1 qn , . . . , pn = p p

for some integers q1 , . . . , qn and p such that q1 + + qn = p. We have by denition 1 1 h(p) = H ( , . . . , ) p p 1 1 1 1 1 1 = H ( , . . . , , , . . . , , . . . , , . . . , ), p p p p p p


q1 q2 qn

where the rst group contains q1 , the second q2 , and the last qn arguments. From lemmas 8 and 1.3 we then obtain, using this grouping of variables, that h(p) = H ( q1 qn ,..., ) p p 1 1 qn 1 1 q1 H( , . . . , ) + H( , . . . , ) + + p q1 q1 p qn qn = H (p1 , . . . , pn ) + c p1 log q1 + + c pn log qn .

This implies that H (p1 , . . . , pn ) = c log p c p1 log q1 c pn log qn = c (p1 (log p log q1 ) + + pn (log p log qn )) = c (p1 log p1 + + pn log pn ). This proves (1.14).

1.1. ENTROPY

43

Now, we are nearly at the end of the overall proof of theorem 1.5. If pi are arbitrary probabilities, not necessarily rational ones, then they can be approximated by a sequence of rational ones, converging to pi . Since (1.11) holds for all rational probability distributions, the required continuity (H2) of H (p1 , . . . , pn ) and the continuity of the right hand side of (1.14) then implies that (1.11) holds for any probability distribution. This concludes the overall proof of theorem 1.5. This theorem tells us, that we may select the base of the logarithm, as well as the constant c > 0 arbitrarily. These choices only determine the measurement unit for the measure of uncertainty. Summary for Section 1.1 In this subsection we proved that the elementary requirements, that the function H (p1 , . . . , pn ) be continuous, does not depend on the ordering of the probabilities and satises a simple decomposition property, together with the requirement that h(n) = 1 1 H(n ,..., n ) be monotone increasing with n imply that H must be the entropy; The proof was done in three steps: In the rst one a more general decomposition property was derived. In the second step, it was proved that h(n) is essentially a logarithm. In the third step it was shown that H is the entropy, if the probabilities are rational numbers. The theorem follows then from the requirement of continuity.

Control Question 12 Which conditions do not characterize the entropy H ? 1. H (p1 , p2 , . . . , pn ) = H (p(1) , p(2) , . . . , p(n) ) for exactly one permutation of 1, 2, . . . , n; 2. H (p1 , p2 , . . . , pn ) is a continuous function in all variables;
1 1 3. H (p1 , . . . , pn ) = H ( n ,..., n ) as a function of n 1 is monotone decreasing.

Answer 1. Since H (p1 , p2 , . . . , pn ) = H (p(1) , p(2) , . . . , p(n) ) must hold true for all permutations of 1, 2, . . . , n, this condition does not characterize the entropy H , thus the answer is correct. 2. That is indeed a characterization of the entropy. 3. The uncertainty should increase with the number of possibilities, hence this is not a characterization of H .

44

CHAPTER 1. UNCERTAINTY AND INFORMATION

1.2

Information And Its Measure

Learning Objectives for Section 1.2 After studying this section you should understand how information is measured; that the measure of information is always relative to a precise question and also relative to previous information; that information and questions have a natural algebraic structure; other quantities, related to the measure of information, like the mutual information, the divergence and the degree of surprise, together with their properties and the relations among them.

1.2.1

Observations And Events

Learning Objectives for Subsection 1.2.1 After studying this subsection you should understand that an observation of a random variable or an event related to a random variable is information; that the amount of information gained by observing the value of a variable or an event is measured by the resulting change of uncertainty; that therefore entropy and measure of information are intimately related.

What is information and how is it measured? We start studying this question in this subsection. The basic idea is that information is something that changes uncertainty, preferably decreasing it. Accordingly, we propose to measure the amount of information by the amount of change of uncertainty. This idea will be developed in this section step by step, progressing from very simple to more involved situations. The emphasis in this section will be on the measurement of information content, and less on the representation of information and other properties it may have besides quantity. To start lets consider a probabilistic choice system (S, P ) represented by a random variable X taking values x S with probabilities pX (x) (see subsection 1.1.2). This random variable describes a certain experiment where the outcome is uncertain. The uncertainty of this situation is measured by the entropy H (X ) =
xS

pX (x) log pX (x).

(1.15)

When the experiment is carried out, a certain value x S of the random variable is observed. There is no uncertainty left. So the previous uncertainty H (X ) is reduced to the posterior uncertainty 0. The dierence H (X ) 0 = H (X ) is the amount of information gained by performing the experiment. So the entropy of a random variable measures the amount of information gained by observing the actual value of the variable.

1.2. INFORMATION AND ITS MEASURE This idea calls for two important remarks:

45

Since information is a change of entropy it is measured in the same unit as entropy, i.e. bits, if base 2 is selected for the logarithm. The amount of information gained by an observation is the same for all possible observations. In particular, it is the same whether the probability of the actual observation is small or large. We return to this point in subsection 1.2.4. Example 1.15 (Binary Random Variable) If, for a binary variable X the outcome 0 arises with probability p and 1 with probability q = 1 p, then observing the outcome of this binary experiment results in a gain of information H (X ) = p log p q log q . In particular, in the case of a fair coin, observing the outcome of a throw gives 1 bit of information. Lets now slightly generalize the situation. We still consider a random variable X related to a probabilistic choice situation (S, P ). The associated uncertainty is still H (X ). But this time, we carry out the experiment only partially. We do not observe the exact value of X , but only some event E S . Obviously this is also information. But what is its amount? The observation of the event E changes the random variable X to the conditional variable X |E related to the new probabilistic choice situation (E, PE ). Here PE denotes the conditional probabilities pX |E (x) = pX (x) , for all x E. pX (E )

This new situation, created by the observation of the event E , has the uncertainty which corresponds to the conditional entropy H (X |E ) (see subsection 1.1.2), H (X |E ) =
xE

pX |E (x) log pX |E (x).

So, the observation of E changes the uncertainty from H (X ) to H (X |E ). The amount of information gained is thus H (X ) H (X |E ). We shall see in a moment, that this is not always really a gain of information, since H (X |E ) may be greater than H(X), such that the observation of event E increases uncertainty, which corresponds according to our denition of negative information. Example 1.16 (Murderer) Assume that we have n suspected murderers, but one of them (say number 1) is a lot more suspect than the other n 1 ones. We may represent the probability that suspect 1 is the murderer by pX (1) = 1 , which is nearly one. The probabilities that one of the other suspects could be the murderer is only pX (i) = /(n 1). The entropy is then H (X ) = (1 )(log 1 ) log n1 . (1.16)

If is small, then this entropy will be very small. This reects the fact, that we are pretty sure that no. 1 is the murderer. But suppose now that all of a sudden no. 1 produces an alibi. We are then forced to exclude no. 1 from the list of suspects. This corresponds to the event E that X {2, . . . , n}. The conditional distribution of X given E is then pX |E (i) = 1/(n 1) for i = 2, . . . , n. The corresponding new

46

CHAPTER 1. UNCERTAINTY AND INFORMATION

uncertainty is H (X |E ) = log(n 1). This can be much larger than H (X ). So, the new information, i.e. the alibi of no. 1, changes (unexpectedly) a clear and neat situation into a very uncertain, messy situation. The information is therefore negative. This example should convince that negative information is a reality. We introduce now some new notation. First we denote the conditional random variable X given an event E S by XE . It corresponds, as noted above, to the probabilistic choice situation (E, PE ), where PE is the set of conditional probabilities pX |E (x) for x E . Then, we denote the amount of information of the event E with respect to the random variable by i(E/X ). So we have i(E/X ) = H (X ) H (XE ). (1.17)

If, in particular, the event E corresponds to the observation of a precise value x of the random variable, E = {x}, then we write the corresponding amount of information i(x/X ). And we have H (X |x) = 0, hence, as already noted above i(x/X ) = H (X ). In this sense, and only in this sense, entropy is a measure of information. If we are interested whether a particular event E takes place or not, we are confronted with a new choice situation ({E, E c }, P ). Associated with it is a new random variable Y with the following probability distribution pY (E ) = pX (E ) =
xE

(1.18)

pX (x),

pY (E c ) = pX (E c ) =
xE c

pX (x).

What will be the expected information, when we learn whether E takes place or not? It is I (X |Y ) = pY (E )i(E/X ) + pY (E c )i(E c /X ) = H (X ) (pY (E )H (X |E ) + pY (E c )H (X |E c )) = H (X ) H (X |Y ). (1.19) But we know (see corollary 1.3) that H (X |Y ) H (X ). So, the expected measure of information gained by observing whether some event takes place or not, is never negative, i.e. I (X |Y ) 0. We shall get back to the important notion of expected information in subsection 1.2.3. Example 1.17 (Murderer - Continuation) Lets return to the murder example 1.16 above. Suppose somebody announces that he will produce proof of the guilt or innocence of no. 1 (by examining DNA for example). We expect with probability 1 that we will prove the guilt of no. 1. This represents event E c in the notation of example 1.16. The resulting uncertainty in this case will be 0 and the information obtained H (X ) (see (1.16)). With probability we expect that the innocence of no. 1 will be proved (event E ). The remaining uncertainty is then, as seen in example 1.16 log(n 1) and the information obtained H (X ) log(n 1). So, in this particular case, the expected information to be gained by this proof is equal to (1 )H (X ) + (H (X ) log(n 1)) = H (X ) log(n 1) = (1 ) log(1 ) log 0.

1.2. INFORMATION AND ITS MEASURE

47

The last equation is obtained using (1.16). Note that this is exactly the amount of information when we learn whether suspect no. 1 is guilty or not. Suppose now that information, in the form of events observed, comes in successive steps. First we observe an event E1 S , and then we get more precise information from an event E2 E1 . Since event E1 changes the random variable X to the conditional random variable XE1 , the information gained by E2 with respect to the former information E1 is i(E2 /XE1 ) = H (XE1 ) H (XE2 ). The next theorem shows then that we can add the information gained in each step to get the full information.

Theorem 1.6 Let X be a random variable associated with a probabilistic choice situation (S, P ) and E1 , E2 two events, E2 E1 S . Then i(E2 /X ) = i(E1 /X ) + i(E2 /XE1 ). (1.20)

Proof The proof is straightforward, using the denition of information i(E2 /X ) = H (X ) H (XE2 ) = H (X ) H (XE1 ) + H (XE1 ) H (XE2 ) = i(E1 /X ) + i(E2 /XE1 ). We want at this step to stress the following important aspect of information:

An amount of information is always relative to prior information. So, the amount of information of the event E2 relative to the original variable X is generally not the same as its amount relative to the information given by the event E1 . That is, in general i(E2 /X ) = i(E2 /XE1 ). The notation we use underlines this: i(E2 /X ) is the amount of information contained in the event E2 relative to the prior information or prior probability distribution of X , whereas i(E2 /XE1 ) is the amount of information of the same event E2 relative to the prior information or probability distribution of XE1 .

Example 1.18 (Relativity to prior Information) This remark can be illustrated by the special case of choice without probabilities. Thus, assume S to be a deterministic choice system. If E S is an observed event, then we may denote its amount of information relative to the prior information S by i(E/S ). Then we have i(E/S ) = log |S | log |E | = log |S | . |E |

If, as in the theorem 1.6, we have E2 E1 S , then, once E1 is observed, we have a new choice system E1 . If we next observe E2 , we gain information i(E2 /XE1 ) with

48

CHAPTER 1. UNCERTAINTY AND INFORMATION

respect to the former information E1 . Thus, i(E1 /S ) = log |S | log |E1 |, i(E2 /S ) = log |S | log |E2 |, i(E2 /XE1 ) = log |E1 | log |E2 |. Of course, in this case we also have i(E2 /S ) = i(E1 /S ) + i(E2 /XE1 ). We note that we obtain exactly the same results, if we do not assume a choice system S without probabilities, but a probabilistic choice system (S, P ), where P is the uniform probability distribution over S . This discussion seems to indicate that not only events represent information, but also random variables or rather that their associated probability distributions, are information. This is indeed so. Subsection 1.2.5 will discuss probabilities as information. Of course theorem 1.6 generalizes to more than two events. Corollary 1.5 If Em Em1 . . . E1 S , then i(Em /X ) = i(E1 /X ) + i(E2 /XE1 ) + + i(Em /XEm1 ). (1.21)

Example 1.19 (Fair Die) Let X be the random variable associated with the toss 1 = log 6 bit. Someone is telling us, that of a fair die. Then we have H (X ) = log 6 X = 1. So let E1 be the event X = 1. Thus i(E1 /X ) = H (X ) H (XE1 ) = log 6 log 5 = log 6 bit, 5

since H (XE1 ) = log 5 bit. A litte while later, we receive the information, that X = 1 and X = 2 and we associate the event E2 with it. So i(E2 /X ) = H (X ) H (XE2 ) = log 6 log 4 = log since H (XE2 ) = log 4 bit. Finally we compute i(E2 /XE1 ) = H (XE1 ) H (XE2 ) = log 5 log 4 = log We verify that indeed i(E2 /X ) = log 3 6 5 = log + log = i(E1 /X ) + i(E2 /XE1 ). 2 5 4 5 bit. 4 3 bit, 2

We may also have the situation where two dierent sources of information report two events E1 , E2 S relative to a probabilistic choice situation (S, P ) and an associated random variable X . These two pieces of information can be combined into the event E1 E2 . We assume that E1 E2 is not empty, since this would represent contradictory or incompatible information. The amount of the combined information is then i(E1 E2 /X ). By theorem 1.6 we see that i(E1 E2 /X ) = i(E1 /X ) + i(E1 E2 /XE1 ) = i(E2 /X ) + i(E1 E2 /XE2 ).

1.2. INFORMATION AND ITS MEASURE

49

It does not matter in which sequence the two pieces of information are combined. In both cases we get the same result. Here we observe that information may come in pieces and can then be combined. This points to a certain algebraic structure of information, besides its quantitative aspect. Example 1.20 (Fair Die - Continuation) Once again we are tossing a fair die (random variable X ). As before H (X ) = log 6 bit. We observe, that the result is an even number (event E1 ). Since H (XE1 ) = log 3 bit, we get that 6 = log 2 = 1 bit. 3

i(E1 /X ) = H (X ) H (XE1 ) = log 6 log 3 = log

We observe next, that the result is smaller than 4 (event E2 ). So, with H (XE2 ) = log 3 bit,

i(E2 /X ) = H (X ) H (XE2 ) = log 6 log 3 = 1 bit. Note that E1 E2 = {2}. Since H (XE1 E2 ) = 0 bit, we nally obtain i(E1 E2 /X ) = log 6 bit, i(E1 E2 /XE1 ) = log 3 bit, i(E1 E2 /XE2 ) = log 3 bit. And we see that i(E1 E2 /X ) = log 6 = i(E1 /X ) + i(E1 E2 /XE1 ) = 1 + log 3.

Summary for Section 1.2 In this subsection we have seen that events or, more particularly, observations of values of random variables are information. The amount of information gained by an event is measured by the change of uncertainty, that is, the entropy. So information is measured in bits, like the entropy. The amount of information gained by observing an event may be negative, that it, uncertainty may be increased. In case the exact value of a random variable is observed, the amount of information gained equals the entropy of the variable, which is always non-negative. The amount of information gained is relative to the prior information. That is, an event has not an absolute amount of information, but the amount depends on what was known before - the prior probability distribution. If information, represented by events, comes in successive steps, then the total information gained is the sum of the information gained in each step relative to

50 the previous step.

CHAPTER 1. UNCERTAINTY AND INFORMATION

Control Question 13 What is the relation between entropy and the measure of information? 1. There is no relation. 2. Since a probability distribution P represents information, then, in this sense, and only in this sense, H (P ) measures at the same time entropy and information. 3. Information is dened by the change of entropy. 4. Entropy and Information are both measured in bits. Answer 1. That is wrong (see below). 2. It is correct that a probability distribution represents information. But this is not a relation between entropy and information and, additionally, the assertion is nonsense. 3. Yes, that is the main idea of an information measure. 4. That is indeed correct, but this is due to the relation between the entropy and information. Hence, the answer is wrong.

Control Question 14 i(E/X ) is always positive, since 1. H (X |Y ) H (X ); 2. H (X |E ) H (X ); 3. H (X ) 0; 4. information is always relative to a prior information; 5. the assertion is wrong; information can be negative. Answer 1. It is correct, that H (X |Y ) H (X ), but this has nothing to do with the assertion. 2. We cant expect that H (X |E ) H (X ), hence this is wrong. 3. It is correct, that H (X ) 0, but this has nothing to do with the assertion.

1.2. INFORMATION AND ITS MEASURE

51

4. It is correct, that information is always relative to a prior information, but this has nothing to do with the assertion. 5. Yes, information can also be negative, see example 1.16.

Control Question 15 Let X be a random variable related to the probabilistic choice situation (S, P ) and E2 E1 S . Then i(E2 /X ) 1. = 0, if E2 corresponds to the observation of a precise value x of X ; 2. = H (X ) H (X |E2 ); 3. = i(E1 /X ) + i(E2 /XE1 ); 4. = log
| E2 | |S |

if X is uniform distributed. Answer

1. That is wrong. If E2 corresponds to the observation of a precise value x of X , we get i(E2 /X ) = H (X ). 2. This is the denition of i(E2 /X ), hence it is correct. 3. Since i(E2 /X ) = H (X ) H (XE1 ) + H (XE1 ) H (XE2 ) it is correct.
2| 4. Correct. Follows directly from log |E |S | = log

|S | | E2 |

= log |S | log |E2 |.

1.2.2

Information and Questions

Learning Objectives for Subsection 1.2.2 After studying this subsection you should understand that information may relate to dierent questions; that the amount of an information can only be measured relative to a precisely specied question, as well as relative to prior information; that information and questions exhibit an algebraic structure; what independent information means and that the total amount of independent information is the sum of the amounts of individual information.

We start by considering the simple case of a compound probabilistic choice system (S1 S2 , P ) and the corresponding pair of random variables X and Y (we refer to subsection 1.1.3 for these notions). This simple situation will serve as a model to

52

CHAPTER 1. UNCERTAINTY AND INFORMATION

study how information pertains to dierent questions, and why therefore, the content of information can only be measured relative to a specied question. Also this simple case serves to exhibit the important fact that information and questions possess an inherent algebraic structure. Assume we observe the value of one of the two variables, say Y = y S2 . As we have seen in the previous section this reduces the uncertainty regarding Y to zero, and the observation contains i(y/Y ) = H (Y ) bits of information relative to the prior information Y . But this observation also changes the prior probability distribution of X , which becomes a conditional random variable X |y with probability distribution pX |y (x, y ) = pX,Y (x, y ) , for all x S1 . pY (y )

This means that the uncertainty regarding the variable X changes too. So the observation Y = y also contains information relative to X . But its amount is not the same, as that relative to Y . In fact, the change in entropy of X , which measures the amount of information of y relative to X , is as follows: i(y/X ) = H (X ) H (X |y ). We saw in subsection 1.1.3, example 1.12, that the entropy H (X |y ) may be smaller or greater than H (X ). In the latter case, we get a negative amount i(y/X ) of information. So, this is another case where negative information may arise. The observation y also changes the common entropy of the two variables also. That is, we have i(y/X, Y ) = H (X, Y ) H (X, Y |y ). We note that pX,Y |y (x, y ) = pX,Y (x, y ) pY (y )

for all x S , if y = y . And we have pX,Y |y (x, y ) = 0, if y = y . But this shows that pX,Y |y (x, y ) = pX |y (x). So, we conclude that H (X, Y |y ) = H (X |y ) and hence, i(y/X, Y ) = H (X, Y ) H (X |y ). Using theorem 1.4, we also nd i(y/X, Y ) = H (X ) H (X |y ) + H (Y |X ) = i(y/X ) + H (Y |X ). (1.23) (1.22)

The information gained by y with respect to both variables X and Y equals the information gained by y with respect to variable X plus the expected uncertainty remaining in Y , when X is observed. So, the same simple observation Y = y is information, which has a dierent measure relative to dierent references or questions. i(y/Y ) is the amount of information with respect to the question of the value of Y , i(y/X ) the amount of information regarding the unknown value of X and nally i(y/X, Y ) the amount of information with respect to the unknown common value of the two variables X and Y together. So, we emphasize, the amount of an information is to be measured relative to the question to be considered. This is a second principle of relativity. The rst one was introduced in the previous subsection, and says that the amount of information is always measured relative to the prior information. Lets summarize the two basic principles of relativity of information:

1.2. INFORMATION AND ITS MEASURE

53

Relativity regarding the question : The amount of information is to be measured relative to a specied question. Questions are so far represented by random variables or choice situations. The question is, what is the value of the random variable, or the element selected in a choice situation?

Relativity regarding the prior information : The amount of information is to be measured relative to the prior information. Prior information is represented by the prior probability distribution of the variable or the probabilistic choice situation.

If X and Y are independent random variables, then H (X |y ) = H (X ) and thus i(y/X ) = 0. The information Y = y has no content relative to X , it does not bear on X . And i(y/X, Y ) = i(y/Y ) from (1.23), since H (Y |X ) = H (Y ) in this case. Example 1.21 Assume pX,Y (0, 1) = pX,Y (1, 0) = pX,Y (0, 0) = 1 3 , pX,Y (1, 1) = 0, 1 2 and pX (1) = pY (1) = 3 like in examples 1.12 and 1.13. hence pX (0) = pY (0) = 3 Since pX,Y (0, 0) pX,Y (1, 0) 1 = pX,Y |Y =0 (1, 0) = = , pY (0) pY (0) 2 pX,Y |Y =0 (0, 1) = pX,Y |Y =0 (1, 1) = 0, pX,Y |Y =0 (0, 0) =
1 1 1 1 it follows, that H (X, Y |Y = 0) = 1 2 log 2 2 log 2 = log 2 = 1 bit. With 1 1 1 1 1 1 H (X, Y ) = 1 3 log 3 3 log 3 3 log 3 = log 3 bit, we obtain

i(Y = 0/X, Y ) = H (X, Y ) H (X, Y |Y = 0) = log

1 3 1 + log = log bit. 3 2 2

Example 1.22 (Compound Choice Situation without Probabilities) Consider a compound choice situation S1 S2 without probabilities. any observation of y S2 yields the amount of information

Then

i(y/S1 S2 ) = log |S1 S2 | log |S1 | = log |S1 | + log |S2 | log |S1 | = log |S2 | bit. This is not a surprise, since there is no uncertainty left in S2 , and, as we have seen, the information gained is then the uncertainty in S2 . Example 1.23 (Symmetric Binary Channel - Continuation) The communication channel is an important example (see example 1.14 in subsection 1.1.3). X refers to an uncertain input signal and Y to the output signal. In practice one observes the output signal Y = y and would like to infer about the unknown input signal, which was transformed into y . Then i(y/X ) = H (X ) H (X |y ) is the information content of the information y relative to this question. If there is no distortion of the input signal during the transmission, then X = Y and H (X |y ) = 0 bit. In this case i(y/X ) = H (X ), that is, y contains all the information about X . But in general, we have H (X |y ) > 0 and hence i(y/X ) < H (X ). There is a loss of information associated with the transmission. It is also of interest to look at the expected

54

CHAPTER 1. UNCERTAINTY AND INFORMATION

information at the output regarding the input, I (X |Y ) =


y S2

pY (y )i(y/X ).

This is a very important quantity in communication theory. We shall come back to this in subsection 1.2.3. We now generalize the discussion above by considering events related to the two variables X and Y . We notice that there may be events E related to the variable X , that is E S1 , events related to Y , i.e. E S2 and nally events related to both variables, E S1 S2 . We label an event E by d(E ) to indicate to what domain it belongs. So d(E ) = {x} means E S1 , and d(E ) = {x, y } means E S1 S2 . If d(E ) = {x, y }, then we dene the projection of E to domain x by E x = {x S1 : there is a y S2 such that (x, y ) E }. The projection E y is dened similarly. We refer to gure 1.11 for a geometric picture of projections.

Figure 1.11: The projection of an event to a smaller domain, illustrated in the case of two dimensions. We start by considering an event E with d(E ) = {x, y } (see gure 1.11). This is clearly information relative to X , to Y and to (X, Y ). The corresponding measures are i(E/X ) = H (X ) H (X |E ), i(E/Y ) = H (Y ) H (Y |E ), i(E/X, Y ) = H (X, Y ) H (X, Y |E ). At this point we need to clarify conditional random variables such as X |E , if E is an event relative to S1 S2 , that is, E S1 S2 . Clearly, the conditional random variables X, Y |E has the probability distribution pX,Y |E (x, y ) = pX,Y (x, y ) , for (x, y ) E. pX,Y (E )

Otherwise, that is, if (x, y ) E , we have pX,Y |E (x, y ) = 0. The conditional variable X |E has now the marginal probability distribution of pX,Y |E (x, y ), that is pX |E (x) =
y S2

pX,Y |E (x, y ) for all x E x .

1.2. INFORMATION AND ITS MEASURE For x E x , we have pX |E (x) = 0. Similarly, we have pY |E (y ) =
xS1

55

pX,Y |E (x, y ) for all y E y ,

and pY |E (y ) = 0 for y E y . This claries how H (X |E ) and H (Y |E ) have to be computed, H (X |E ) =


x

pX |E (x) log pX |E (x), pY |E (y ) log pY |E (y ).


y

H (Y |E ) =

Again, one or several of these information measures may be negative in the general case. Example 1.24 (Fair Coin) Consider a fair coin which is thrown twice. X is the result of the rst throw, Y the result of the second one. Since the coin is fair, we have pX,Y (x, y ) = 1/4 for all four possible results. Therefore H (X, Y ) = log 4 = 2 bit. If we learn that the coin did not show both times heads, then we know that the event E = {(0, 0), (0, 1)(1, 0)} took place assuming that 0 indicates tails and 1 means heads. So we obtain that pX,Y (x, y ) = 1/3 for the three remaining possible results in E . Thus, H (X, Y |E ) = log 3 and we gained the amount of information i(E/X, Y ) = 2 log 3 = log 4/3 bit. Regarding the rst throw, we obtain pX |E (0) = pX,Y (0, 0) + pX,Y (0, 1) = 1 pX |E (1) = pX,Y (1, 0) = . 3 So the remaining uncertainty is 2 1 1 2 2 H (X |E ) = log log = (log 3) bit. 3 3 3 3 3 The information obtained relative to the rst throw is therefore 2 2 2 i(E/X ) = H (X ) H (X |E ) = 1 (log 3 ) = log + 0.0817 bit 3 3 3 Due to the symmetry of the situation, the same amount of information is also gained relative to the second throw. Next, we look at the case when an event E relative to the second variable Y is observed, i.e. d(E ) = {y }. As usual, we have in this case i(E/Y ) = H (Y ) H (Y |E ). But what can be said about the information relative to X, Y or to X alone. To answer these questions, we need to extend the event relative to E to an event relative to X, Y , but without adding information. We dene E {x,y} = E VX . This is the so-called cylindric extension of E to the domain {x, y } (see gure 1.2.2). A look at g. 1.12 shows that no information relative to the variable X has been added, which is not already contained in E . 2 , 3

56

CHAPTER 1. UNCERTAINTY AND INFORMATION

{x,y}

Figure 1.12: The cylindric extension of an event to a larger domain, illustrated in the case of two dimensions First, we dene i(E/X, Y ) = i(E {x,y} /X, Y ) = H (X, Y ) H (X, Y |E {x,y} ). That is, we consider that E and E {x,y} represent the same information relative to X, Y . Note that pX,Y |E {x,y} (x, y ) = because pX,Y (E {x,y} ) =
(x,y )E {x,y }

pX,Y (x, y ) pX,Y (x, y ) = , for all (x, y ) E {x,y} , pY (E ) pX,Y (E {x,y} )

pX,Y (x, y ) =
xS1 ,y E

pX,Y (x, y ) = pY (E ).

In the same way, we dene i(E/X ) = i(E {x,y} /X ) = H (X ) H (X |E {x,y} ). Remembering that pX |E {x,y} (x) =
y E

pX,Y |E {x,y} (x, y )

(1.24)

and H (X |E {x,y} ) =
xS1

pX |E {x,y} (x) log pX |E {x,y} (x).

(1.25)

We notice that this information is in general dierent from zero. Thus, even an event relating to variable Y carries information relative to variable X . This results from the correlation between the two variables X and Y . Indeed, assuming that X and Y are stochastically independent, i.e. pX,Y (x, y ) = pX (x)pY (y ), then pX,Y |E {x,y} (x, y ) = pX (x)pY (y ) = pX (x)pY |E (y ). pY (E )

1.2. INFORMATION AND ITS MEASURE Thus in this case we obtain pX |E {x,y} (x) =
y E

57

pX,Y |E {x,y} (x, y ) = pX (x).

Thus, we have that H (X |E {x,y} ) = H (X ) and therefore i(E/X ) = 0. Of course, by symmetry, a similar analysis can be carried out for an event related to variable Y . Example 1.25 (Fair Coin - Continuation) We are referring to the example 1.24. But now, we observe, that the second throw resulted in heads. This is represented by the event EY = {1}. We easily see, that p(E ) = 0.5. To compute i(EY /X, Y ) we need the cylindric extension of EY given by EY
{x,y }

= {(0, 1), (1, 1)}.

Regarding both throws, we obtain pX,Y |E {x,y} (0, 1) = pX,Y |E {x,y} (1, 1) =
Y Y

1 2

and 0 otherwise. So the information obtained by EY relative to both throws is i(E/X, Y ) = H (X, Y ) H (X, Y |EY 1 1 1 1 = 2 + log + log 2 2 2 2 = 1 bit.
{x,y }

Assume next, that E = {(x, y )}, that is that an exact value is observed for both random variables X and Y . Then, we obtain the following measures of information, i(x, y/X ) = i(x/X ) = H (X ), i(x, y/Y ) = i(y/Y ) = H (Y ), i(x, y/X, Y ) = H (X, Y ) i(x/X ) + i(y/Y ). (1.26)

The rst two equalities hold true since H (X |x, y ) = H (X |x) = 0 and H (Y |x, y ) = H (Y |y ) = 0. The last inequality is nothing else than (1.7) (subsection 1.1.3). In this case, clearly all three information measures are non-negative. Equality holds in the last inequality, when X and Y are independent. Then the individual pieces of information, bearing on the variables X and Y respectively, add to the total information bearing on both variables simultaneously. The condition for the additivity of information can be generalized as shown in the following theorem.

Theorem 1.7 Let X and Y be independent random variables relative to a probabilistic choice situation (S1 S2 , P ). If E = E {x} E {y} , then i(E/X, Y ) = i(E/X ) + i(E/Y ). (1.27)

58

CHAPTER 1. UNCERTAINTY AND INFORMATION

Proof Since X and Y are independent, we have H (X, Y ) = H (X )+ H (Y ) (see (1.7) in subsection 1.1.3). Furthermore, the conditional variables X |E {x} and Y |E {y} are still independent, since, for x E {x} and y E {y} we have pX,Y |E (x, y ) pX,Y (x, y ) pX,Y (E ) pX (x)pY (y ) = pX (E {x} )pY (E {y} ) = pX |E {x} (x)pY |E {y} (y ). = Therefore, again from (1.7), we also have H (X, Y |E ) = H (X |E {x} ) + H (Y |E {y} ). Thus, we obtain i(E/X, Y ) = H (X, Y ) H (X, Y |E ) = (H (X ) H (X |E {x} )) + (H (Y ) H (Y |E {y} )) = i(E {x} /X ) + i(E {y} /Y ).

Example 1.26 (Fair Coin - Continuation) We are referring to the example 1.25. {x,y } Since we know the result of the second throw, we get i(EY /Y ) = H (Y ) = log 2 = 1 bit. The cylindric extension of the event EY does not add information relative to X , {x,y } thus i(EY /X ) = 0 bit. So we get the expected result i(EY
{x,y }

/X, Y ) = i(EY

{x,y }

/X ) + i(EY

{x,y }

/Y ) = 1 bit.

Two events E1 and E2 which each bear on one of the variables X and Y , i.e. d(E1 ) = {x} and d(E2 ) = {y }, are called independent. Theorem 1.7 says that the information content of their combination E1 E2 is the sum of the information contents of each event, provided that the random variables X and Y are independent. This carries the addition theorem from entropies over to information measures. In the previous sections, we showed that a choice situation without probabilities has the same entropy as the probabilistic choice situation with uniform probability distribution over the possible choices. This corresponds to Laplaces principle of insucient reason. In the next example, we want to draw attention to the danger of an unreected application of this principle. Example 1.27 (Fair Coin - Continuation) We refer back to example 1.24 where a fair coin is thrown twice and the event E reported that heads did not turn out twice. We now drop the assumption that the coin is fair. The coin can be anything. So we do not have a uniform distribution over the four possible outcomes. In fact we have no probability distribution at all. That is, we have a scheme of choice S = {(0, 0), (0, 1), (1, 0), (1, 1)} without probabilities. The uncertainty h(|S |) of this compound choice scheme is 2 bit as before; here Laplaces principle of insucient reason still applies. Also h(|E |) = log 3 bit as before. But what is the uncertainty regarding the rst throw? We insist that we have no

1.2. INFORMATION AND ITS MEASURE

59

reason to assume a uniform distribution over the four possibilities before the event E is reported and over the three remaining possibilities, after event E is reported. We then simply have a choice scheme E = {(0, 0), (0, 1), (1, 0)} without probabilities. So, when we regard the rst throw, we have the two possibilities given by E {x} = {0, 1} and hence the corresponding choice situation without probabilities. So the remaining uncertainty is h(|E {x} |) = 1 bit as before the event E became known. Hence the information is null, i(E/X ) = 0 bit. The fact is, knowing that the coin is fair is information, represented by the uniform distribution. This additional information together with the event E yields information on the rst throw. If we do not know that the coin is fair, or if we have no reason to assume that it is so (like if the coin is ordinary), then we do not have this information and the inference would be biased, if we simply assumed it without reason. What this example shows, is that a probability distribution over a set of possible choices is information. And assuming a uniform distribution is replacing ignorance by information. Sometimes this does not matter, but in general it does! This reinforces the relativity principle of information stating that a measure of information is always relative to prior information. In particular, this example shows that we need to treat choice schemes without probabilities and those with probabilities dierently. In fact, it will be possible to join both cases in an appropriate unique formalism. But this goes beyond this introductory chapter and is postponed until chap. 1.2.5. Another thing, we saw with the model case of two variables, is that events or information may relate to dierent questions, but still carry information to other questions. In fact there is some order between questions: The question related to variable X or domain {x}, and represented by the set of possible values VX is somehow coarser, than the question related to both variables X, Y or domain {x, y }, and represented by the set VX VY . And an information related to some domain (say {x, y }) also carries information for the coarser domains {x} and {y }. So already this rst very simple model case of two variables hints to a lot of the structure of information and questions; structure which has to be formalized, if information in all its generality is to be understood. We shall treat these aspects in a more general case in the next subsection. Summary for Section 1.2 Events refer to one or the other or to both variables. In any case they represent information. But its amount is to be measured with respect to a specied variable or question. The second principle of relativity says that the amount of information in an observed event is also measured relative to a prior information. Prior information is represented by the prior probability distribution (before observation of the event) of the considered question, that is the probabilistic choice situation or random variable. Independent events with respect to independent random variables carry information which sum up to the total information represented by the two events

60

CHAPTER 1. UNCERTAINTY AND INFORMATION simultaneously. This is the addition theorem for independent information. Since probability distributions also represent information, choice situations without probabilities must be carefully distinguished from choice situations with probabilities.

Control Question 16 Relativity regarding the 1. posterior information; 2. question; 3. maximal gain; 4. expected change of entropy; is a basic principle of information. Answer 1. No, one of the basic principles of information is relativity regarding prior information. 2. That is correct; information is always relative to a question. 3. We have never seen a relativity principle called maximal gain. 4. We have never seen a relativity principle called expected change of entropy.

Control Question 17 Someone is telling you that the weather will be ne tomorrow. Is this information a gain for you? 1. Yes, of course. 2. No! Answer Both possibilities are wrong, because information is always measured relative to a specied question. Hence it depends on the question if the information is a gain or a loss.

Control Question 18 What are conditions on the random variables X and Y relative to a probabilistic choice system (S1 S2 , P ) and an event E , such that the additivity of information holds, that is i(E/X, Y ) = i(E/X ) + i(E/Y ):

1.2. INFORMATION AND ITS MEASURE 1. X and Y are independent. 2. P is the uniform probability distribution. 3. E = E {x} E {y} . 4. H (X, Y ) = H (X ) + H (Y ). Answer 1. That is correct. X and Y have to be independent. 2. No, there is no limitation on P . 3. That is indeed an important condition.

61

4. This is equivalent to the condition X and Y independent; since X and Y have to be independent, this condition is correct.

Control Question 19 Given a choice system (S1 S2 ) with the corresponding random variables X , Y and an event E = {y } S2 . Then 1. i(E/Y ) = 0; 2. i(E/X ) = 0; 3. i(E/X, Y ) = 0. Answer 1. Information is dended by the change of entropy, hence, here we have i(E/Y ) = H (Y ) H (Y |E ) = H (Y ). So this assertion is wrong. 2. Since the observation of an event E S2 can also aect the random variable X , this proposition is wrong. 3. i(E/X, Y ) = H (X, Y )H (X, Y |E ) and we cant expect that H (X, Y ) = H (X, Y |E ). It follows that this assertion is also untrue.

1.2.3

Mutual Information and Kullback-Leibler Divergence

Learning Objectives for Subsection 1.2.3 After studying this subsection you should understand the notion of mutual information and its relation to entropy and information; the notion of informational distance (Kullback-Leibler divergence) and its relation to mutual information.

62

CHAPTER 1. UNCERTAINTY AND INFORMATION

We look once more at a compound probabilistic choice situation (S1 S2 , P ) and the associated random variables X and Y . If a value y is observed for Y , then as seen in subsection 1.2.2, we get the amount of information i(y/X ) = H (X ) H (X |y ) with respect to X . But rather than looking at a particular observation y , we look at the expected amount of information relative to X gained by observing Y . This value is I (X |Y ) =
y

pY (y )i(y/X ) =
y

pY (y )(H (X ) H (X |y )) = H (X ) H (X |Y ).(1.28)

I (X |Y ) is called the mutual information between X and Y . It is an important notion in information theory, but it is not an information, strictly speaking. It is the expected or average amount of information obtained on X by observing Y . In the corollary 1.3 in subsection 1.1.3, we have seen that always H (X |Y ) H (X ) and equality holds only, if X and Y are independent. This implies that the following property holds.

Theorem 1.8 I (X |Y ) 0, and I (X |Y ) = 0 if, and only if, X and Y are independent. (1.29)

So, although in a particular case the information i(y/X ) may be negative, on average we expect a positive amount of information about X from observing Y . Example 1.28 (Communication Channel) Consider a communication channel, with X the input source and Y the output. By observing the output Y we expect a positive amount of information regarding the input X . Although in particular transmissions, uncertainty about the input is increased, on average the observation of the output decreases uncertainty about the input. This is of course extremely important for reasonable communication. Furthermore, from (1.8) we obtain I (X |Y ) = H (X ) + H (Y ) H (X, Y ). Because this formula is symmetric in X and Y , we conclude that I (X |Y ) = I (Y |X ). We expect as much information on X by observing Y than on Y by observing X .

Theorem 1.9 I (X |Y ) = I (Y |X ).

That is a remarkable result. So mutual information between X and Y is the same as between Y and X . This symmetry is also evident from the following formula for the mutual information, which we obtain from (1.28), if we introduce the denition of the

1.2. INFORMATION AND ITS MEASURE entropies appearing there,

63

I (X |Y ) =
x

pX (x) log pX (x) +


x,y

pY (y )pX |Y (x, y ) log pX |Y (x, y )

=
x,y

pX,Y (x, y )(log pX,Y (x, y ) log


x,y

pX,Y (x, y ) log pX (x)) pY (y ) pX,Y (x, y ) . pX (x)pY (y ) (1.30)

Finally, we conclude from I (X |Y ) 0 and H (X |Y ) 0 that I (X |Y ) H (X ) = i(x/X ). By symmetry we have also I (Y |X ) H (Y ) = i(y/Y ). The expected information on either variable obtained by observing the other one, is always less than the information gained by directly observing either of the variables. Now, of course in a transmission system, it is not possible to observe the input directly. That is why a loss of information has to be expected when transmitting information. Example 1.29 (Statistical Inference) Let X and Y be binary random variables representing a throw of a coin. The coin may be fair or not. This is represented by a choice system or a random variable Q. Its probability distribution is given by pQ (fair) = pQ (unfair) = 0.5. We know, that if the coin is fair, we have pX |Q (0|fair) = pX |Q (1|fair) = pY |Q (0|fair) = pY |Q (1|fair) = 0.5 and if it is unfair, pX |Q (0|unfair) = pY |Q (0|unfair) = 0.9, pX |Q (1|unfair) = pY |Q (1|unfair) = 0.1. We further assume that X and Y are independent, thus, the conditional probability distribution of (X, Y |Q) is given by pX,Y |Q (0, 0|fair) = pX,Y |Q (0, 1|fair) = pX,Y |Q (1, 0|fair) = pX,Y |Q (1, 1|fair) = 0.25, pX,Y |Q (0, 0|unfair) = 0.81, pX,Y |Q (0, 1|unfair) = pX,Y |Q (1, 0|unfair) = 0.09, pX,Y |Q (1, 1|unfair) = 0.01. Since pX,Y (0, 0) = pX,Y |Q (0, 0|fair)pQ (fair) + pX,Y |Q (0, 0|unfair)pQ (unfair) = 0.25 0.5 + 0.81 0.5 = 0.53, pX,Y (0, 1) = pX,Y |Q (0, 1|fair)pQ (fair) + pX,Y |Q (0, 1|unfair)pQ (unfair) = 0.25 0.5 + 0.09 0.5 = 0.17, pX,Y (1, 0) = pX,Y |Q (1, 0|fair)pQ (fair) + pX,Y |Q (1, 0|unfair)pQ (unfair) = 0.25 0.5 + 0.09 0.5 = 0.17, pX,Y (1, 1) = pX,Y |Q (1, 1|fair)pQ (fair) + pX,Y |Q (1, 1|unfair)pQ (unfair) = 0.25 0.5 + 0.01 0.5 = 0.13,

64 we obtain

CHAPTER 1. UNCERTAINTY AND INFORMATION

H (X, Y ) = pX,Y (0, 0) log pX,Y (0, 0) pX,Y (0, 1) log pX,Y (0, 1) pX,Y (1, 0) log pX,Y (1, 0) pX,Y (1, 1) log pX,Y (1, 1) = 0.53 log 0.53 0.17 log 0.17 0.17 log 0.17 0.13 log 0.13 1.7373 bit. An nally, with H (X, Y |fair) =
x,y

pX,Y |Q (x, y |fair) log pX,Y |Q (x, y |fair)

= pX,Y |Q (0, 0|fair) log pX,Y |Q (0, 0|fair) pX,Y |Q (0, 1|fair) log pX,Y |Q (0, 1|fair) pX,Y |Q (1, 0|fair) log pX,Y |Q (1, 0|fair) pX,Y |Q (1, 1|fair) log pX,Y |Q (1, 1|fair) = log 4 = 2 bit, H (X, Y |unfair) = 0.81 log 0.81 0.09 log 0.09 0.09 log 0.09 0.01 log 0.01 0.938 bit, H (X, Y |Q) =
q

pQ (q )H (X, Y |q )

= pQ (fair)H (X, Y |fair) + pQ (unfair)H (X, Y |unfair) 0.5 2 + 0.5 0.938 1.469 bit, we have I (X, Y |Q) = H (X, Y ) H (X, Y |Q) 1.7373 1.469 0.2683 bit. This is the expected information about the question whether the coin is fair, if we observe two throws of the coin. Example 1.30 (Symmetric Binary Channel - Continuation) In a communication channel we must expect to get less information on the input by observing the output, than by directly observing the input. For example consider the symmetric binary channel from example 1.14 with = 0.1, pX (0) = 0.2 and pX (1) = 0.8. We saw, that pY (0) = 0.26 and pY (1) = 0.74, thus H (Y ) = 0.26 log 0.26 0.74 log 0.74 0.8267 bit, and we obtain I (X |Y ) = I (Y |X ) = H (Y ) H (Y |X ) = 0.26 log 0.26 0.74 log 0.74 + 0.9 log 0.9 + 0.1 log 0.1 0.3578 bit. But H (X ) = 0.2 log 0.2 0.8 log 0.8 0.7219 bit > I (X |Y ). That means that the expected loss of information is H (X ) I (X |Y ) 0.7219 0.3578 = 0.3641 bit.

1.2. INFORMATION AND ITS MEASURE

65

We introduce another important notion of information theory. Consider two probabilistic choice systems, with the same choice set S , but dierent probability distributions. If X and Y are the corresponding random variables, then we dene K (PX , PY ) =
xS

pX (x) log

pX (x) . pY (x)

(1.31)

This is called the Kullback-Leibler divergence between X and Y . It is a kind of distance between the probability distributions of X and Y , although d(PX , PY ) = d(PY , PX ). But Theorem 1.10 K (PX , PY ), K (PY , PX ) 0, and K (PX , PY ) = K (PY , PX ) = 0 if, and only if, PX = PY . Proof Indeed, we have K (PX , PY ) =
xS

pX (x) log pX (x)


xS

pX (x) log pY (x).

Lemma 1.1 tells us then that K (PX , PY ) 0 and equal to 0 only if pX (x) = pY (x).

Example 1.31 (Symmetric Binary Channel - Continuation) Once more we consider the symmetric binary channel from the examples 1.14 and 1.30. Here we have K (PX , PY ) = pX (0) log = K (PY , PX ) = = pX (0) pX (1) + pX (1) log pY (0) pY (1) 0.8 0.2 + 0.8 log 0.2 log 0.26 0.74 0.0143 bit, pY (0) pY (1) pY (0) log + pY (1) log pX (0) pX (1) 0.26 0.74 0.26 log + 0.74 log 0.2 0.8 0.0152 bit.

Consider a compound probabilistic situation (S S, P ) and the associated pair of random variables X and Y , which have both the same set of values S . Denote by (S, PX ) and (S, PY ) the probabilistic choice situations related to the two variables X and Y . That is, pX and pY are the marginal distributions of X and Y , given by pX (x) =
y

pX,Y (x, y ),

pY (y ) =
x

pX,Y (x, y ).

66

CHAPTER 1. UNCERTAINTY AND INFORMATION

Furthermore PX PY denotes the probability distribution with values pX (x) pY (y ). Then (1.30) shows that

I (X |Y ) = I (Y |X ) = K (P, PX PY ).

(1.32)

In this view, I (X |Y ) measures to what degree the common probability distribution of the pair (X, Y ) diverges from the case of independence. That is, why the mutual information is also sometimes considered as a measure of dependence between two random variables with the same set of values. Finally we have (see 1.30)

I (X |Y ) =
x,y

pX,Y (x, y ) log

pY |x (x, y ) pY (y ) pY |x (x, y ) pY (y )

=
x,y

pX (x)pY |x (x, y ) log pX (x)K (PY |x , PY )


x

= =
x

pX (x)i(x/Y ).

(1.33)

i(x/Y ) = H (Y ) H (Y |x) measures the information gained on Y by observing X = x. Thus, although in general K (PY |x , PY ) = i(x/Y ), there is equality on average over x between this information and the Kullback-Leibler divergence K (PY |x , PY ). By symmetry we also have

I (Y |X ) =
y

pY (y )K (PX |y , PX ) =
y

pY (y )i(y/X ).

Example 1.32 (Kullback-Leibler Divergence and Mutual Information) Consider S = (e1 , e2 ), S S = {(e1 , e1 ), (e1 , e2 ), (e2 , e1 ), (e2 , e2 )} and P = {0.5, 0.1, 0.3, 0.1}. Let X and Y be the random variables associated to the compound probabilistic situation (S S, P ). Thus, the marginal distributions of X and Y are given by pX (e1 ) = pX,Y (e1 , e1 ) + pX,Y (e1 , e2 ) = 0.5 + 0.1 = 0.6, pX (e2 ) = pX,Y (e2 , e1 ) + pX,Y (e2 , e2 ) = 0.3 + 0.1 = 0.4, pY (e1 ) = pX,Y (e1 , e1 ) + pX,Y (e2 , e1 ) = 0.5 + 0.3 = 0.8, pY (e2 ) = pX,Y (e1 , e2 ) + pX,Y (e2 , e2 ) = 0.1 + 0.1 = 0.2. We dene the distribution PX PY by p(x, y ) = pX (x)pY (y ). Let us now compute some entropies. With

1.2. INFORMATION AND ITS MEASURE

67

pX |Y =e1 (e1 , e1 ) = pX |Y =e1 (e2 , e1 ) = pX |Y =e2 (e1 , e2 ) = pX |Y =e2 (e2 , e2 ) = we obtain

pX,Y (e1 , e1 ) pY (e1 ) pX,Y (e2 , e1 ) pY (e1 ) pX,Y (e1 , e2 ) pY (e2 ) pX,Y (e2 , e2 ) pY (e2 )

0.5 0.8 0.3 = 0.8 0.1 = 0.2 0.1 = 0.2 =

= 0.625, = 0.375, = 0.5, = 0.5,

H (X ) = 0.6 log 0.6 0.4 log 0.4 0.9710 bit, H (X |e1 ) = pX |Y =e1 (e1 , e1 ) log pX |Y =e1 (e1 , e1 ) pX |Y =e1 (e2 , e1 ) log pX |Y =e1 (e2 , e1 ) = 0.625 log 0.625 0.375 log 0.375 0.9544 bit, H (X |e2 ) = pX |Y =e2 (e1 , e2 ) log pX |Y =e2 (e1 , e2 ) pX |Y =e2 (e2 , e2 ) log pX |Y =e2 (e2 , e2 ) = 0.5 log 0.5 0.5 log 0.5 = 1 bit, H (X |Y ) = pY (e1 )H (X |e1 ) + pY (e2 )H (X |e2 ) 0.8 0.9544 + 0.2 1 = 0.9635 bit. Hence I (X |Y ) = H (X ) H (X |Y ) 0.9710 0.9635 0.0074 bit. Since K (P, PX PY ) =
(x,y )S S

pX,Y (x, y ) log

pX,Y (x, y ) p(x, y )

= pX,Y (e1 , e1 ) log

pX,Y (e1 , e2 ) pX,Y (e1 , e1 ) + pX,Y (e1 , e2 ) log p(e1 , e1 ) p(e1 , e2 ) pX,Y (e2 , e2 ) pX,Y (e2 , e1 ) + pX,Y (e2 , e2 ) log +pX,Y (e2 , e1 ) log p(e2 , e1 ) p(e2 , e2 ) 0.5 0.1 0.3 0.1 = 0.5 log + 0.1 log + 0.3 log + 0.1 log 0.6 0.8 0.6 0.2 0.4 0.8 0.4 0.2 0.0074 bit,

we verify that I (X |Y ) = K (P, PX PY ). And nally, with

pY |X =e1 (e1 , e1 ) = pY |X =e1 (e2 , e1 ) = pY |X =e2 (e1 , e2 ) = pY |X =e2 (e2 , e2 ) =

pX,Y (e1 , e1 ) pX (e1 ) pX,Y (e1 , e2 ) pX (e1 ) pX,Y (e2 , e1 ) pX (e2 ) pX,Y (e2 , e2 ) pX (e2 )

0.5 0.6 0.1 = 0.6 0.3 = 0.4 0.1 = 0.4 =

5 = , 6 1 = , 6 3 = , 4 1 = , 4

68 and

CHAPTER 1. UNCERTAINTY AND INFORMATION

H (Y ) = 0.8 log 0.8 0.2 log 0.2 0.7219 bit, 5 5 i(e1 /Y ) = H (Y ) H (Y |e1 ) 0.7219 + log + 6 6 3 3 i(e2 /Y ) = H (Y ) H (Y |e2 ) 0.7219 + log + 4 4 we obtain I (X |Y ) =
x

1 log 6 1 log 4

1 0.0719 bit, 6 1 0.00894 bit, 4

pX (x)i(x/Y ) = pX (e1 )i(e1 /Y ) + pX (e2 )i(e2 /Y )

0.6 0.0719 0.4 0.0894 0.0074 bit. It is interesting to note that i(e2 /Y ) is negative.

Summary for Section 1.2 We have dened mutual information I (X |Y ) between two random variables X and Y as the expected gain in information on one variable, obtained if the other one is observed. Remarkably, this value is symmetric, I (X |Y ) = I (Y |X ). Mutual information is also the dierence between the sum of the individual entropies of the two variables and the actual entropy of the pair. This shows that mutual information is always non-negative and vanishes exactly, if the two variables X and Y are independent. Thus it also measures the reduction of uncertainty of the actual situation with respect to the case of independent variables. The Kullback-Leibler divergence K (PX , PY ) of two probability distributions PX and PY measures the distance from PX to PY . However, in general K (PX , PY ) = K (PY , PX ). Nevertheless we have K (PX , PY ) = K (PY , PX ) = 0 if PX = PY . The mutual information I (X |Y ) equals the Kullback-Leibler divergence from the common distribution of the pair (X, Y ) to the product of their marginal distributions. This is another measure of the distance of the actual situation and the assumed case of independence. The information gained on a variable Y by observing another variable X equals on average the Kullback-Leibler divergence between the conditional distribution of Y given X = x and the unconditional marginal distribution of Y .

Control Question 20 What can we say about the mutual Information between X and Y ? 1. The mutual information is the expected amount of information relative to X by observing Y . 2. In some cases the mutual information can be negative.

1.2. INFORMATION AND ITS MEASURE 3. The mutual information can be zero; even if X and Y are independent.

69

4. The mutual information between X and Y equals the mutual information between Y and X . Answer 1. That ts our denition of the mutual information exactly, hence it is correct. 2. Incorrect, since I (X |Y ) = H (X ) H (X |Y ) and H (X |Y ) H (X ) it follows, that I (X |Y ) 0. 3. The mutual information is equal to zero, if, and only if, X and Y are independent, hence the assertion is wrong. 4. We have I (X |Y ) = H (X ) + H (Y ) H (X, Y ) = H (X ) + H (Y ) H (Y, X ) = I (Y |X ), hence the proposition is correct.

Control Question 21

1. K (PX , PY ) = K (PY , PX ), since K is a kind of distance. 2. K (PX , PY ) = 0, if X and Y are independent. 3. I (X |Y ) = K (PX PY , P ), if X and Y are the random variables associated to the compound probabilistic situation (S S, P ). Answer 1. We said that K is a kind of distance, however, K (PX , PY ) = K (PY , PX ). 2. K (PX , PY ) = 0 if, and only if, PX = PY , hence the assertion is false. 3. It is true, that I (X |Y ) = K (P, PX PY ), but, since K (PX , PY ) = K (PY , PX ), the proposition is wrong.

1.2.4

Surprise, Entropy and Information

Learning Objectives for Subsection 1.2.4 After studying this subsection you should understand the notion of degree of surprise associated with an event; its relation to entropy, measures of information and mutual information.

70

CHAPTER 1. UNCERTAINTY AND INFORMATION

If E S is a rare event in a probabilistic choice situation (S, P ), then we may be very surprised when we actually observe it. So, it may be interesting to measure the degree of unexpectedness or of surprise of an event. Lets denote the degree of unexpectedness of the event E by s(E ). The following characteristics seem reasonable to assume for it: s(E ) should only depend on the probability p(E ) of the event E , i.e. s(E ) = f (p(E )). s(E ) should be a monotone decreasing function of its probability p(E ). The greater the probability, the smaller the unexpectedness of the event; the smaller the probability, the greater the surprise. If two events E1 and E2 are independent, then the degree of surprise of their common occurrence should be the sum of their individual degrees of surprise, s(E1 E2 ) = s(E1 ) + s(E2 ). The logarithm is the only function satisfying these requirements, s(E ) = log p(E ). If we (arbitrarily) require in addition that the degree of surprise of an event with probability 1/2 equals 1, f (1/2) = 1, then the logarithm must be taken to the base 2. So log2 p(E ) is considered as a measure of the degree of surprise or unexpectedness of an event E . Some authors dene log p(E ) to be the information contained in E or the selfinformation of E. We disagree with this view: An information is a, possibly partial, answer to a precise question and it is information in so far as it changes the uncertainty about the possible answer to this question. If we consider a single event E , then no specic question is associated with it. On the other hand, the degree of unexpectedness is associated with an event and not a specied question. Example 1.33 (Lottery) Suppose you win in a lottery where odds are 1 to say 1015 or something like that. You will be (agreeably) surprised. If you do not win, you will not be surprised at all. The amount of information that you win relative to the question win or not? is the same, as it is if you nd out that you do not win. The amount of the same information (i.e. the number drawn in the lottery) relative to the question which number is drawn is much larger. The degree of surprise associated with the number drawn on the other hand does not depend on the question asked. Example 1.34 (Swiss-Lotto) We are playing Swiss-Lotto, that means choosing 6 numbers out of 45. There are exactly 45 6 = 45! = 8145060 6! (45 6)!

possibilities, so, the degree of unexpectedness s(win) = log 8145060 22.9575. Let X denote the random variable whether we win or not. Thus 1 , 8145060 1 , p(X = not win) = 1 8145060 p(X = win) =

1.2. INFORMATION AND ITS MEASURE and hence H (X ) = 1 1 1 log 1 8145060 8145060 8145060 log 1 1 8145060

71

2.9957 106 bit. The uncertainty whether we win or not is really small, because it is almost sure that we will not win. To further clarify the dierence between surprise and information, consider a probabilistic choice situation where the possible choices S are only E or not E (that is E c ) with probabilities p and 1 p for the two possible cases. The uncertainty associated with this situation is p log p (1 p) log(1 p). In gure 1.13, the graph of this uncertainty and the degree of surprise log p as a function of p is drawn. Uncertainty is maximal for p = 1/2. This is also the maximum information obtained by observing E with respect to the question whether E happens or not. Uncertainty is null if either p = 0 or p = 1. Surprise however is maximal (innite in fact), if p = 0. In this case we do not at all expect E to happen. Surprise is null on the other hand, if p = 1. In this case we are sure that E will happen, and are not surprised at all, if E indeed occurs.

Figure 1.13: Information contained in event E relative to the question E or not E versus degree of surprise in E . If we look at the entropy H (X ) = pX (x1 ) log pX (x1 ) pX (x2 ) log pX (x2 ) pX (xm ) log pX (xm ) of a random variable X , then we see that it is the expected value of the unexpectedness log pX (xi ) of the events X = xi . So, surprise, entropy and information are closely related, but dierent concepts. The amount of information (of a value of a random variable observed) is the average or expected degree of surprise of the possible events. There is even another relation. Let X and Y be two random variables associated with the compound probabilistic choice situation (S1 S2 , P ). How does the unexpectedness

72

CHAPTER 1. UNCERTAINTY AND INFORMATION

of the event X = x change, if Y = y is observed? Originally, the degree of surprise of the event {X = x} is log pX (x). Once Y = y is observed it changes to log pX |y (x|y ) = log So, the change of surprise is log pX (x) + log pX,Y (x, y ) pX,Y (x, y ) = log . pY (y ) pX (x)pY (y ) pX,Y (x, y ) . pY (y )

From this we see that the mutual information I (X |Y ) =


x,y

pX,Y (x, y ) log

pX,Y (x, y ) pX (x)pY (y )

is the expected change of surprise for a value of X given that a value of Y is observed. Or, in other words, the expected amount of information gained with respect to X by observing Y equals the expected change of unexpectedness of observing X given that Y is observed. So, this is another link between amount of information and degree of surprise. Summary for Section 1.2 We dened the degree of unexpectedness or of surprise of an event as the negative of the logarithm of its probability. It grows as the probability of the event decreases, is zero for a probability of one, and innite for a probability of zero. Surprise is not to be confused with information. The former is simply associated with an event, the latter is measured with respect to a question. That is, an event always has the same unexpectedness. However the amount of information it carries depends on the question considered. However surprise and information are related: Entropy of a variable , hence information, equals the expected value of surprise of the possible values of the random variable. And mutual information, i.e. expected information with respect to one variable, given the observation of the other, equals the expected change of surprise in the rst variable, given the observation of the second one.

Control Question 22 Let E be an event with p(E ) = 0. Then 1. s(E ) = 0, since we are sure that E can not happen; 2. s(E ) = 1, since s(E ) is maximal if p(E ) = 0. Answer Surprise is innite (and hence maximal) if p(E ) = 0. So both answers are wrong.

1.2. INFORMATION AND ITS MEASURE

73

Control Question 23 The degree of surprise s(E ) 1. is monotone decreasing (as a function of p(E )); 2. is continuous; 3. can be zero; 4. is measured with respect to a specied question. Answer 1. The greater the probability, the smaller the surprise, hence it is correct. One can also argue that the logarithm is monotone increasing and log x 0 if 0 < x 1. 2. Follows directly from the fact that the logarithm is a continuous function. 3. If p(E ) = 1, there is no surprise left, and we indeed get s(E ) = 0. 4. The degree of surprise is associated with an event and not, like information, to a specied question. It is important to understand this dierence.

1.2.5

Probability as Information

Learning Objectives for Subsection 1.2.5 After studying this subsection you should understand that not only events or observations of values of random variables represent information, but that a probability distribution represents information too; how the Kullback-Leibler divergence measures in a special, but important case, the amount of information of a probability distribution.

We look at a choice situation S without probability. As we know, its uncertainty is measured by h(|S |) = log |S |. But now we also understand, that we may consider the probabilistic choice situation (S, pV ) too, where pV = 1/|S | is the uniform distribution over S . We have H (V ) = h(|S |) = log |S |. The set of possible choices is all we know in a given situation. But, now suppose, that some source of information tells us, that in fact, not all elements of S are equally probable, but that we rather have a probabilistic choice situation (S, P ), where P represents some non uniform probability distribution over S . Let X be the random variable associated with this choice situation. We know that H (X ) < H (V ). Thus, we really have new information with respect to the initial choice situation without probabilities, namely information in our sense, that changes the measure of

74

CHAPTER 1. UNCERTAINTY AND INFORMATION

uncertainty. The amount of information contained in the probability distribution P , relative to the prior uniform distribution, is expressed by the change of entropy, i(X/V ) = H (V ) H (X ) = log |S | +
x

pX (x) log pX (x)

=
x

pX (x) log(|S |pX (x))

= K (PX , PV ). So, the information content of P relative to the original uniform distribution equals the Kullback-Leibler divergence between the probability distribution P and the original uniform distribution. We notice that 0 i(X/V ) log |S |. The rst inequality holds since H (X ) H (V ) = log |S | and the second one because H (V ) H (X ) H (V ), since H (X ) 0. So, the amount of information i(X/V ) is always positive, and the maximal amount log |S | is obtained, when the event X = x is observed, that is H (X |x) = 0. In fact, the last case corresponds to a degenerate random variable Y with probability distribution pY (x) = 1, and pY (y ) = 1, if y = x. So, here we go one step further and consider not only events or values of random variables which are observed or reported as information, but more generally specications of probability distributions over choice sets as well. This is in fact more general, also in the sense that an observed event may be represented by the conditional probability distribution it induces. So, in fact, if we start with the uniform random variable V over S and observe event E S, then this amounts to considering the conditional random variable VE = V |E , which has a uniform distribution over E , pV |E (x) = Then we obtain i(E/V ) = i(VE /V ) = H (V ) H (VE ) = log |S | log |E | = log Note that |E |/|S | = p(E ) and according to the last result, i(E/V ) = log p(E ). Here we see that the surprise is a measure of information, but a very particular one. It is the amount of information gained by an event E relative the prior information represented by the uniform random variable V , that is, the choice without probability. We stress once more, that in this more general setting as well, the amount of an information is still to be measured relative to a) a precisely specied question and b) a specied prior information. In order to emphasize these remarks we consider the case of compound choice situations. So let S1 S2 be a compound choice situation. We associate with it two uniform random variables V1 over S1 and V2 over S2 . Then in the pair (V1 , V2 ) with the uniform distribution over S1 S2 , the two variables V1 and V2 are independent. Consider now |S | . |E |
1 |E |

if x E, otherwise.

1.2. INFORMATION AND ITS MEASURE

75

any pair of random variables (X, Y ), associated with a probabilistic choice situation (S1 S2 , P ). Then this represents an information relative to (V1 , V2 ) of amount i(X, Y /V1 , V2 ) = log |S1 | + log |S2 | H (X, Y ). We have (use theorem 1.3): i(X, Y /V1 , V2 ) (log |S1 | H (X )) + (log |S2 | H (Y )) = i(X/V1 ) + i(Y /V2 ). (1.34) Equality holds exactly, if X and Y are independent. The probability distribution P over S1 S2 of course contains also information with respect only to S1 (or S2 ). In fact, knowing the distribution P gives an uncertainty of H (X ) with respect to S1 . Hence, we have i(X, Y /V1 ) = H (V1 ) H (X ) = i(X/V1 ). Similarly we obtain i(X, Y /V2 ) = i(Y /V2 ). Example 1.35 (Symmetric Binary Channel - Continuation) Given a compound choice situation S1 S2 = {(0, 0), (0, 1), (1, 0), (1, 1)} (symmetric binary channel). With it we associate two uniform random variables V1 and V2 as described above. This means that we know nothing about the channel and its input/output. But now consider the pair (X, Y ) of random variables associated with the compound choice situation (S1 S2 , P ) given in the examples 1.14 and 1.30. This is new information describing how the channel works. We have seen, that H (X, Y ) 1.1909 bit, H (X ) = 0.7219 bit and H (Y ) = 0.8267 bit. Hence i(X, Y /V1 , V2 ) = log |S1 | + log |S2 | H (X, Y ) = 2 log 2 H (X, Y ) 0.8091 bit, i(X/V1 ) = log |S1 | H (X ) 0.2781 bit i(Y /V2 ) = log |S2 | H (Y ) 0.1733 bit. Note that 0.4514 bit i(X/V1 ) + i(Y /V2 ) < i(X, Y /V1 , V2 ) 0.8091 bit. A particular case arises, if one of the two random variables X or Y is degenerate. Suppose for example that Y = y . Then we have pX,Y (x, y ) = pX (x) if y = y and pX,Y (x, y ) = 0, if y = y . In such a case we have i(X, y/V ) = log |S1 | + log |S2 | H (X, y ). But H (X, y ) =
x,y

pX,Y (x, y ) log pX,Y (x, y ) =


x

pX (x) log pX (x) = H (X ).

Thus we see that, in correspondence to (1.34) i(X, y/V ) = (log |S1 | H (X )) + log |S2 | = i(X/V1 ) + i(y/V2 ) If both variables X and Y are degenerate, that is X = x and Y = y , then the last result implies that i(x, y/V ) = i(x/V1 ) + i(y/V2 ).

76

CHAPTER 1. UNCERTAINTY AND INFORMATION

This is of course the same information as observing the values x and y for the possible choices in S1 and S2 . Another particular case is that only a probabilistic information relative to S1 is known, whereas relative to S2 we have a choice without probabilities. This situation can be represented by a pair of random variables X and Y such that pX,Y (x, y ) = pX (x) , |S2 |

such that X has the marginal probability distribution pX (x) and Y the uniform distribution pY (y ) = 1/|S2 |, representing a choice without probabilities. Then we nd that i(X, Y /V ) = log |S1 | + log |S2 | H (X, Y ) pX (x) pX (x) = log |S1 | + log |S2 | log |S2 | |S2 | x,y = log |S1 |
x

pX (x) log pX (x)

= i(X/V1 ). So, if nothing is known about the choice in S2 , then the amount of information relative to the question V is the same as the information relative to V1 . Example 1.36 (Probability as Information) Consider S1 = {0, 1}, S2 = {0, 1} and the random variables X associated with S1 and Y associated with S2 . Let pY (0) = 0.1 and pY (1) = 0.9. Relative to S1 we have a choice without probabilities. Thus pX,Y (0, 0) = pX,Y (0, 1) = pX,Y (1, 0) = pX,Y (1, 1) = pY (0) |S1 | pY (1) |S1 | pY (0) |S1 | pY (1) |S1 | 0.1 2 0.9 = 2 0.1 = 2 0.9 = 2 = = 0.05, = 0.45, = 0.05, = 0.45,

and, with H (X, Y ) = 0.05 log 0.050.45 log 0.450.05 log 0.050.45 log 0.45 1.469 bit, we obtain i(X, Y /V ) = log |S1 | + log |S2 | H (X, Y ) = 2 log 2 H (X, Y ) 0.531 bit. Since we have a choice without probabilites relative to S1 we get that H (X ) = 0.5 log 0.5 0.5 log 0.5 = log 2 = 1 bit. Since H (Y ) = 0.1 log 0.1 0.9 log 0.9 0.469 bit we nally obtain i(X/V1 ) = H (V1 ) H (X ) = log |S1 | H (X ) = 1 1 = 0 bit, i(Y /V2 ) = H (V2 ) H (Y ) = log |S2 | H (Y ) 1 0.469 0.531 bit. Note that i(Y /V2 ) = i(X, Y /V ).

1.2. INFORMATION AND ITS MEASURE

77

To conclude we prove a theorem, which is to information what theorem 1.4 is to entropy. In order to formulate the theorem, we introduce a new concept. If X and Y are a pair of variables associated with the compound probabilistic choice situation (S1 S2 , P ), then let Yx denote the conditional random variable Y given that X = x. We may consider the amount of information carried by this variable, or rather its probability distribution, with respect to the question V2 , i(Yx /V2 ) = log |S2 | H (Y |x). It is the measure of the information contained in the pair of random variables (X, Y ) and the observation X = x with respect to V2 . We consider the expected value of this information I (Y |X/V2 ) =
x

pX (x)i(Yx /V2 ) = log |S2 | H (Y |X )

Now we have the following theorem.

Theorem 1.11 Let X and Y be a pair of random variables associated with the probabilistic choice situation (S1 S2 , P ). Then i(X, Y /V ) = i(X/V1 ) + I (Y |X/V2 ).

Proof The proof is straightforward, using theorem 1.4, i(X, Y /V ) = log |S1 | + log |S2 | H (X, Y ) = (log |S1 | H (X )) + (log |S2 | H (Y |X )) = i(X/V1 ) + I (Y |X/V2 ). This theorem says that the measure of information contained in X, Y relative to the choice situation S1 S2 without probabilities, is the sum of the information contained in X with respect to S1 and the expected information of Y given an observation of X with respect to S2 . Example 1.37 (Symmetric Binary Channel - Continuation) We continue example 1.35. Since H (X |Y ) 0.4690 (see example 1.14), we obtain I (Y |X/V2 ) = log |S2 | H (Y |X ) = log 2 0.4690 0.5310 bit. Thus 0.8091 bit i(X, Y /V ) = i(X/V1 ) + I (Y |X/V2 ). We have seen that both observations of random variables or events, as well as probability distributions represent information. In many cases it seemed that events could be replaced by probability distributions, i.e. the conditional distribution, to obtain the same amount of information. Also choice without probability turned out to be equivalent to probabilistic choice with a uniform probability distribution. So, nally, it seems

78

CHAPTER 1. UNCERTAINTY AND INFORMATION

that probability distributions are the ultimate form of information. This however is a premature conclusion, as the following example shows. Example 1.38 (Probability Distributions and Events as Information) Suppose a coin is thrown two times and it is reported that the result was not heads both times. How much information does this provide for the question whether the rst throw was heads or tails? If we know nothing about the coin, in particular, if we cannot assume that it is a fair coin, then we are faced with a choice situation without probability. In fact S = {(h, h), (h, t), (t, h), (t, t)}, if h stands for heads and t for tails. The observation is the event E = {(h, t), (t, h), (t, t)} and the amount of information gained with respect to the original choice situation is log 4/3 bit. The event E projected to the choice situation of the rst throw is {h, t}, that is both results are equally possible. Hence, the information gained by E with respect to this question is 0 bit. Assuming an uniform probability distribution over S means to assume that the coin is fair. Then, event E conditions the probability distribution to 1/3 for the three possible outcomes {(h, t), (t, h), (t, t)}. The marginal distribution with respect to the rst throw assigns probability 1/3 to heads and 2/3 to tails. Thus, this time the information gained with respect to the rst throw, is log 4 + 1/3 log 1/3 + 2/3 log 2/3 bit, which is dierent from 0. But note that in this case we have used the additional information that the coin is fair. So, probabilistic choice situations with uniform distributions and choice situations without probability are not always identical and equivalent. In general, we have to carefully distinguish between probability distributions as information and nonprobabilistic information such as events. This concludes our introductory discussion of the measure of information. Summary for Section 1.2 Probability distributions on choice situations represent information relative to the choice without probability. Its amount is measured, as usual, by the reduction of uncertainty that is achieved. Observed events or values also represent information which can be subsumed into the conditional probability distributions they induce, relative to the prior probability distribution. As before, in the case of compound choice situations, probability distributions carry information with respect to dierent questions, and in general the respective amounts are dierent. The relativity of information with respect to specied questions is still valid in this more general framework.

Control Question 24 Consider the probabilistic choice situation (S, pV ) with pV = 1/|S | and let X be a random variable associated with (S, P ). Then i(X/V )

1.2. INFORMATION AND ITS MEASURE 1. = H (V ) H (X ); 2. = K (PX , PV PX ); 3. can be negative. Answer 1. This is correct, since information is dened by the change of entropy.

79

2. The relation between the Kullback-Leibler divergence and the information content of P relative to the uniform distribution is given by K (PX , PV ), hence the assertion is false. 3. That is not possible. i(X/V ) 0 because H (X ) H (V ) if V is uniformly distributed.

Control Question 25 Let X and Y be a pair of random variables associated with (S1 S2 , P ). Then, i(X, Y /V ) = i(X/V1 ) + I (Y |X/V2 ), if, and only if, 1. X and Y are independent; 2. X and Y have a uniform probability distribution; 3. pX = pY . Answer All answers are wrong, since i(X, Y /V ) = i(X/V1 ) + I (Y |X/V2 ) holds for all random variables X and Y .

Summary for Chapter 1 We have seen that entropy is used to measure the uncertainty in a probabilistic choice situation or in a random variable. Entropy is related to the game of (binary) questions and answers in the sense, that it indicates the expected number of questions to nd out the actual element selected in the choice situation or the actual value of the random variable. Uncertainty is essentially measured in bits. Information is given by an observed event or an observed value of a random variable. A probability distribution specied on a choice situation also represents information relative to the situation without known probabilities. The amount of information is measured by the change in the uncertainty between the situation before the information is obtained and afterwards. Like uncertainty, information is measured in bits. The amount of information is relative to a specied question, which is indicated

80

CHAPTER 1. UNCERTAINTY AND INFORMATION or represented by a choice situation. It is also relative to the prior information which is given by the prior probability distribution. Information leads to a posterior probability distribution. The amount of information is the dierence between the entropies of these two distributions. Mutual information between two random variables is the expected amount of information gained on one variable, if the other one can be observed. It is symmetric : the expected gain of information is the same for both variables. The degree of surprise of an event can also be measured. It turns out, that entropy is the expected degree of surprise in a choice situation. So surprise, uncertainty and information are closely linked, but dierent concepts.

OutLook
This module contains an elementary introduction into the basics of information theory. However there are still two questions that need to be looked at briey: 1. What are the applications of this theory? 2. In what directions is this theory to be further developed? The main application of information theory is in coding theory. This theory studies how to code most economically the signals or the information coming from some sources of specied characteristics. It also addresses the question of coding signals or information for transmission over noisy channels, where information is lost. The problem is to introduce enough redundancy into the code such that information loss can be compensated. These problems are studied in modules C2, C3 and C4. But coding is only a relatively narrow aspect of information. There are many more questions related to information: 1. How is information processed, that is, how is information from dierent sources combined and focussed on the questions of interest? This relates to information processing in general, and on the computer in particular. 2. What does it mean that information is uncertain ? 3. How is information represented and how can consequences be deduced or inferred from it? This brings logic and general inference methods, for example from statistics into the game. These questions raise many issues of research. Subsequent modules treat these questions in some depth.

Chapter 2

Module C2: Ecient Coding of Information

by J.-C. Chappelier

Learning Objectives for Chapter 2 In this chapter we present: 1. the basics of coding a discrete information source for compressing its messages; 2. what conditions a code needs to meet in order to compress eciently; 3. the fundamental limit for the compression of data; 4. and methods for producing ecient compression codes.

Introduction
In the preceding chapters, we introduced Shannons measure of uncertainty and several of its properties. However, we have not yet experienced how useful this measure can be in practice. In this chapter, we exhibit the rst practical problem which benets from Shannons measure. This problem is the problem of coding a discrete information source into a sequence of symbols. We also develop some ecient methods for performing such codings, and study under which conditions it can be done. In this framework, entropy appears to be the fundamental limit to data compression; i.e. related to the minimal coding expected length. But what are the general reasons for coding after all? There are basically three dierent purposes: coding for compressing data, i.e. reducing (on average) the length of the messages. This means trying to remove as much redundancy in the messages as possible. 81

82

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Source

Ut

Encoder

Zt

Figure 2.1: Basic schema for source coding: the source symbol Ut at time t is transformed into the corresponding codeword Zt .

coding for ensuring the quality of transmission in noisy conditions. This requires adding redundancy in order to be able to correct messages in the presence of noise. coding for secrecy: making the message impossible (or hard) to read for unauthorized readers. This requires making the access to the information content of the message hard. This chapter focuses on the rst aspect of coding: coding for eciency. The current chapter addresses the problem of coding a discrete information source. But what does all this mean? The following section answers that question. The next section then focuses on the eciency of coding for compressing data. Finally, the last section provides an algorithm to actually construct an ecient code.

2.1

Coding a Single Random Variable

Learning Objectives for Section 2.1 After studying this section you should know: 1. what coding a discrete memoryless information source means; 2. what the general properties of a code are; 3. what the relation between codes and trees is; 4. under which conditions certain codes may exist.

An information source is a message generator, i.e. a generator of sequences of symbols. A symbol is simply an element of a set, called an alphabet. In this course, only nite alphabets will be addressed. When the alphabet is nite, the information source is said to be discrete ; and the size of the alphabet is called the arity of the source. Finally, only messages of nite length will be considered. For instance, a newspaper can be regarded as a discrete information source whose messages are the texts contained in the newspaper, the symbols simply being the letters of the usual alphabet (including the whitespace and other punctuation marks). The messages from the source come then into a encoder which transforms them into a sequence of codewords . The basic schema of such a coding framework is given in gure 2.1. A codeword is simply a (non empty) sequence of symbols taken into the coding alphabet, another alphabet used by the encoder. Coding therefore maps source symbols to

2.1. CODING A SINGLE RANDOM VARIABLE codewords, each of which may consist of several code symbols.

83

More formally, an information source Ut , and more generally U when the time index is not pertinent, is a random process on a given alphabet VU ; i.e. a sequence of random variables on VU . Each symbol as a probability P (Ut = ui ) to be emitted by the source at time t. The source is said to be memoryless if the probability for a symbol ui to be emitted does not depend on the past emitted values; i.e. if t 1 ui V U P (Ut = ui |U1 ...Ut1 ) = P (Ut = ui ).

Furthermore, only stationary sources , i.e. sources for which P (Ut = ui ) does not depend on t, are addressed in this chapter. In such a case, and when no ambiguity is possible, P (U = ui ) will be denoted by pi . We assume that pi = 0 for all considered symbol ui in the source alphabet; that is, we do not bother with those symbols of the source whose probability to occur is zero. Regarding the coder, only the simplest case where one single codeword is associated to each source symbol, is considered. Technically speaking, the encoding process Z := f (U ) is a mapping from the source alphabet VU to the set of codewords VZ . Denoting Z the set of all nite length sequences of symbols from the code alphabet Z , the set of codewords VZ is a subset of Z not containing the empty string (i.e. the unique sequence of length 0). Furthermore, we focus on such codes where dierent source symbols map to dierent codewords. Technically speaking, the mapping f is injective. Such codes, where Z = f (U ) is an injective mapping, are called non-singular codes

Denition 2.1 (Non-singular Code) A code of a discrete information source is said to be non-singular when dierent source symbols maps to different codewords. Formally, denoting zi the codeword corresponding to source symbol ui , we have: ui = uj = zi = zj . All the codes considered in the rest of this chapter are non-singular. Since there is no reason to create codewords which are not used, i.e. which do not correspond to a source symbol, the mapping f from VU to VZ is surjective, and is therefore a one-to-one mapping. Example 2.1 (Finite Source Encoding) A very common example of code is given by the Morse code. This code is used for coding letters. It uses essentially of two symbols: a dot () and a dash ().1 For instance, the letters E, A and K are respectively coded , and . As we are interested in coding messages (i.e. sequences of symbols) and not only single symbols alone, we focus on codes such that each encoded message can be uniquely
Actually four symbols are used in the Morse code: a letter space and a word space code are also used.
1

84

CHAPTER 2. EFFICIENT CODING OF INFORMATION

decoded. Such codes are called non-ambiguous codes A code for which any sequence of codewords uniquely corresponds to a single source message. Denition 2.2 (Non-Ambiguous Codes) A code of a discrete source is said to be non-ambiguous if and only if each (nite length) sequence of codewords uniquely corresponds to a single source message. More formally, a code is said to be non-ambiguous if and only if the trivial extension of the coding mapping f to the set of messages V , taking its value into the set of f U (f : V V ), is a one-to-one mapping. nite length sequences of codewords VZ U Z Example 2.2 (Ambiguous Code) Consider the source consisting of the three symbols a, b and c. Its messages can be any sequence of these symbols; for instance aabca is a message of this source. Then, the following encoding of this source: a1 b 00 is ambiguous. c 11

For instance, there is no way, once encoded, to distinguish the message aaaa from the message cc. Indeed, both are encoded as 1111. Example 2.3 (Non-Ambiguous Code) Keeping the same source as in the previous example, consider now the following code: a1 b 00 c 10 It can be show that this code is non-ambiguous. For instance the sequence 10000 decodes into abb and the sequence 1000 into cb.

2.1.1

Prex-Free Codes

Among the class of non-ambiguous codes, certain are of some particular interest. These are the prex-free codes. Before dening what such a code is, we need to introduce the notion of prex . A sequence z of length n (n 1) is said to be a prex of another sequence z if and only if the rst n symbols of z form exactly the sequence z . For instance, abba is a prex of abbabc. Note that any sequence is trivially a prex of itself. Denition 2.3 (Prex-Free Code) A code of a discrete source is said to be prex-free when no codeword is the prex of another codeword. More formally, a code Z , the alphabet of which is Z and the set of codewords of which is VZ , is said to be prex-free if and only if z VZ y Z (zy VZ = y = )

( denoting the empty (i.e. 0-length) string).

2.1. CODING A SINGLE RANDOM VARIABLE


General codes Non-singular codes Non-ambiguous codes

85

Prefix-free codes = Instantaneously decodable codes

Figure 2.2: How the dierent types of codes are related. Example 2.4 (Prex-Free Code) Consider the source consisting of the three symbols a, b and c. The following encoding of this source: a0 b 10 is prex-free. c 11

On the other hand, the following encoding: a1 b 00 c 10 is not prex-free as 1 (the codeword for a) is a prex of 10 (the codeword for c). Why focusing on prex-free codes? The answer lies in the following next two properties (2.1 and 2.2), which emphasize their interest.

Property 2.1 Any prex-free code is non-ambiguous. It is important to notice however that there exists some non-ambiguous code which are not prex-free as the example 2.3 shows. Let us now come to the second interesting property of prex-free codes. Denition 2.4 A code is said to be instantaneously decodable if and only if each codeword in any string of codewords can be decoded as soon as its end is reached. Property 2.2 A code is instantaneously decodable if and only if it is prexfree. This denition ensures that there is no need to memorize the past codewords nor to wait for the following ones to achieve the decoding. Such a code saves both time and space in the decoding process of a encoded message. We up to now have encountered many dierent types of codes: non-singular, nonambiguous, prex-free. How these dierent types are related one to another is summarized in gure 2.2. Control Question 26

86

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Consider some information source U , the symbols of which are u1 = 1, u2 = 2, u3 = 3, and u4 = 4, with the following probability distribution: ui P (U = ui ) 1 0.5 2 0.25 3 0.125 4 0.125

Consider then the following encoding of it (where zi is the codeword for ui ): z1 0 1. Is this code non ambiguous? 2. Encode the message 1423312. 3. Decode the sequence 1001101010. Answer 1) In this code no codeword is prex of another codeword. Therefore this code is prex-free. Since any prex-free code is non-ambiguous, this code is non-ambiguous. 2) 1423312 011110110110010 3) 1001101010 21322 z2 10 z3 110 z4 111

Control Question 27 Are these codes prex-free? non-ambiguous? Instantaneously decodable? a. z1 =00, z2 =10, z3 =01, z4 =11 b. z1 =0, z2 =1, z3 =01 c. z1 =1, z2 =101 Answer First of all, any prex-free code is instantaneously decodable and, conversely, any instantaneously decodable is prex-free. Therefore the answer to the rst question is the same to the answer to the last one. The code dened in a- is prex-free: indeed no codeword is prex of another one. It is therefore also a non-ambiguous code. The code dened in b- is not prex-free as z1 is prex of z3 . It is furthermore ambiguous as, for instance, 01 corresponds to both sequences z1 z2 and z3 . The code dened in c- is not prex-free since z1 is prex of z2 . However, this code is non ambiguous. Indeed, the only way to get 0 is with z2 . Therefore, when receiving a 1, waiting for one bit more (or the end of the message) can decode the message: if the next bit is 1

2.1. CODING A SINGLE RANDOM VARIABLE


interior nodes root

87

depth

leaves

Figure 2.3: Summary of the terms related to trees.

then we decode into u1 u1 ; and if we received a 0 instead, we decode as u2 , swallowing meanwhile the next digit which is surely a 1. For instance, 1011111011011 is to be understood as: z2 z1 z1 z1 z2 z2 z1 and decoded into u2 u1 u1 u1 u2 u2 u1 . Notice: This is an example of a code which is non-ambiguous although it is not prexfree.

2.1.2

n-ary Trees for Coding

In order to study more deeply the properties of prex-free codes, we now need to introduce some more denitions and formulate some theorems. Among them, the most useful tool for the study of prex-free codes are n-ary trees. Let us rst very briey summarize the concept of tree and related terms (cf gure 2.3). A tree is basically a graph (nodes and arcs) which begins at a root node (simply the root). Each node in the graph is either a leaf or an interior node.2 An interior node has one or more child nodes and is called the parent of its child nodes. The arity of a node is the number of its child nodes. A leaf node is a node without child, i.e. a node of arity 0. As opposed to real-life tree, the root is usually depicted at the top of the gure, and the leaves are depicted at the bottom. The depth of a node in a tree the number of arcs to go from the root to this node. By convention the depth of the root is null. The depth of a tree is the maximum depth of its leaves, i.e. the maximum number of arcs to go from the root to a leaf. Finally, a node n1 is said to cover another node n2 if the path from the root to n2 contains n1 . Notice that a node covers at least itself. Denition 2.5 (n-ary tree) A n-ary tree (n 1) is a tree in which each interior node has arity n, i.e. has exactly n child nodes. A full n-ary tree is a n-ary tree in which all the leafs have the same depth.
2

Notice that by this denition, the root is also an interior node.

88

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Example 2.5 (n-ary tree)

Binary tree (n = 2)

Ternary tree (n = 3)

Full Ternary tree

Property 2.3 In the full n-ary tree of depth d 0, each node at depth (0 d) covers exactly nd leaves.

Denition 2.6 (Coding Tree) A coding tree is a n-ary tree, the arcs of which are labeled with letters of a given alphabet of size n, in such a way that each letter appears at most once out of a given node. The codewords dened by such a tree correspond to sequences of labels along paths from the root to a leaf.

Example 2.6 (Coding Tree)

a a b c 1 a

b b c

c 2

A ternary coding tree: the codeword represented by leaf 1 is ac and leaf 2 represents the codeword c.

Denition 2.7 (n-ary code) A code with an alphabet of size n is called a n-ary code.

Property 2.4 For every n-ary prex-free code, there exists at least one n-ary coding tree such that each codeword corresponds to the sequence of labels of an unique path from the root to a leaf. Conversely, every coding tree denes a prex-free code. The codewords of this prex-free code are dened as the sequences of labels of each path from the root to each leaf of the coding tree. Shortly speaking, prex-free codes and coding trees are equivalent. Example 2.7 The coding tree corresponding to the prex-free code of example 2.4

2.1. CODING A SINGLE RANDOM VARIABLE ({0, 10, 11} ) is

89

0 a

1 0 b 1 c

As a standard display convention, the leaves are labeled by the source symbol, the codeword of which is the path from the root. Notice that when representing a (prex-free) code by a tree, it can occur that some leaves do not correspond to any codeword. Such leaves are called unused leaves . For instance, the binary coding tree corresponding to the code { 0, 101 } has two unused leaves as shown on the right.
0 1 0 0 1 1

Indeed, neither 11 nor 100 correspond to codewords. They are useless. Denition 2.8 (Complete Code) When there is no unused leaf in the corresponding n-ary coding tree, the code is said to be a complete code .

Control Question 28 For all the trees below, with the convention that all the branches out of a same node are labeled with dierent labels (not displayed here), tell if it is a coding tree or not; and, in the case it is a coding tree, tell 1. what is its arity, 2. if the corresponding code is complete, 3. the length of the codeword associated to message c.

1)
a b c d e

2)

a b c d e

f h i j k g

3)

4)
a b c d e

5)
a b c

b c d e

f h i j k g
Answer

90

CHAPTER 2. EFFICIENT CODING OF INFORMATION

1) yes, arity=2, complete, length for codeword coding c is 3 2) no, this tree is not a coding tree since the arity of its nodes is not constant (sometimes 3, sometimes 4) 3) yes, arity=4, not complete, length for codeword coding c is 2 4) yes, arity=5, complete, length for codeword coding c is 1 5) yes, arity=3, not complete, length for codeword coding c is 3

2.1.3

Kraft Inequality

We are now looking at the conditions that must hold for a prex-free code to exist. The Kraft inequality appears as a necessary and sucient condition.

Theorem 2.1 (Kraft Inequality) There exists a D-ary prex-free code of N codewords and whose codeword lengths are the positive integers l1 , l2 , . . . , lN if and only if
N

D li 1.
i=1

(2.1)

When (2.1) is satised with equality, the corresponding prex-free code is complete.

Example 2.8 For the binary (D = 2) complete prex-free code of example 2.4
N

({0, 10, 11}), the sum equal to 1.


i=1

D li is 21 + 22 + 22 , i.e.

1 2

1 4

1 4

which is indeed

Similarly, Kraft inequality tells us that there exists at least one ternary prex-free code whose codeword lengths are 1, 2, 2 and 4. Indeed 31 + 32 + 32 + 34 = Such a code would not be complete. e-pendix: Kraft Inequality Warning! A classical pitfall to avoid with this theorem is the following: the theorem only tells us when a prex-free code may exists, but it does not at all answers the question if a given code (with such and such lengths for its codewords) is indeed prexfree. For instance, the rst code given in example 2.4 ({1, 00, 10}) is not prex-free. However the corresponding sum Dli is 21 + 22 + 22 = 1. The pitfall to avoid is that Theorem 2.1 does not tell us that this code is prex-free, but that there exists a prexfree code with the same codeword lengths. Indeed such a code is given in the second part of example 2.4 ({0, 10, 11}). Let us now proceed with the proof of Theorem 2.1. 46 81 0.57 < 1.

2.1. CODING A SINGLE RANDOM VARIABLE

91

Proof = Suppose rst that there does exist a D-ary prex-free code whose codeword lengths are l1 , l2 , . . . , lN . Let L := max li + 1. Consider constructing the corresponding coding tree Tcode by pruning3 the full D-ary tree of depth L, Tfull , at all nodes corresponding to codewords.
0 1 0 1 0 1 1 0 1
i

Tfull
0

1 1 0

Tcode

Because of the prex-free condition, no node corresponding to a codeword can be below another node corresponding to another codeword. Therefore each node corresponding to a codeword prunes its own subtree. Looking at the ith codeword and applying property 2.3 to li (which is < L), Tcode has, for this node only, D Lli leaves less than Tfull . Considering now the whole code, Tcode has than Tfull .
N Lli i=1 D

= DL

N li i=1 D

leaves less

However at most D L leaves can be removed since Tfull has precisely D L leaves. Therefore
N

DL
i=1

Dli

DL ,

i.e.

D li 1.
i=1

Furthermore, in the case where the considered code is complete, all the leaves correspond to a codeword; therefore all the corresponding subtrees in Tfull have been removed, and therefore all the D L leaves of Tfull have been removed. This means
N N

that
i=1

Lli

= D , i.e.
i=1

D li = 1.

= Conversely, suppose that l1 , l2 , . . . , lN are positive integers such that (2.1) is satised. Let L be the largest of these numbers L := max li , and nj be the number
i

of these li that are equal to j (1 j L).


L L1

Inequality (2.1) can then be written as


j =1

nj D j 1, i.e. nL D L
j =1 L2

nj D Lj .

Since nL 0, we have: D nL1 D L

nj D Lj ,
j =1

i.e. nL1 D L1

L2

nj D Lj 1 .
j =1

92

CHAPTER 2. EFFICIENT CODING OF INFORMATION

And since all the nj are non-negative, we successively get, for all 0 k L 1
Lk 1

nLk D

Lk

j =1

nj D Lj k .

These inequalities constitute the key point for constructing a code with codeword lengths l1 , l2 , . . . , lN : 1. start with a single node (the root) 2. for all k from 0 to L do (a) assign each codeword such that li = k to a node of the current depth (k). These nk nodes becomes therefore leaves of the coding tree. (b) extend all the remaining nodes of current depth with D child nodes. Doing so, the number of nodes which are extended at step (2b) is D k
j k

n j D k j

leading to D

k +1

j k

nj D

k +1j

new nodes for next step. Because of the former

inequalities, this number is greater than nk+1 , leaving therefore enough nodes for next step (2a). The algorithm can therefore always assign nodes to codewords; and nally construct the whole coding tree for the code. Therefore, if the li satisfy inequality (2.1), we are able to construct a prex-free code with the corresponding codeword lengths. Furthermore, in the case where step (2a) when j = L is DL
j L i

D li = 1, the number of nodes remaining after


N

nj DLj = D L
i=1

D Lli = D L (1
i

Dli ) = 0,

which means that all the nodes have been aected to a codeword, i.e. that the code is complete. Note that this proof of the Kraft inequality actually contains an eective algorithm for constructing a D-ary prex-free code given the codeword lengths (whenever such a code exists). Example 2.9 Does a binary prex-free code with codeword lengths l1 = 2, l2 = 2, l3 = 2, l4 = 3, and l5 = 4 exist?
5

The answer is yes since


i=1

2li = 1/4 + 1/4 + 1/4 + 1/8 + 1/16 = 15/16 < 1. An

example of such a code can be:

i.e. removing the whole subtree at given node

2.1. CODING A SINGLE RANDOM VARIABLE


0 1

93

0 1 0 1 u1 u2 u 30 1 u4 0 1 u5

U Z

u1 00

u2 01

u3 10

u4 110

u5 1110

Example 2.10 Does a binary prex-free code with codeword lengths 1, twice 2, 3, and 4 exist?
5

The answer is no, since


i=1

2li = 1/2 + 1/4 + 1/4 + 1/8 + 1/16 = 19/16 > 1.

Control Question 29 Does there exists a ternary prex-free code with codeword lengths 1, 2, 2, 3 and 4? Answer
5

Since
i=1

3li = 1/3 + 1/9 + 1/9 + 1/27 + 1/81 = 49/81 < 1, we know that such a

prex-free code exists. Here is one example, as a matter of illustration:


a b c b c a a b c b c

u1

u2 u3 u4 u5

U Z

u1 a

u2 ba

u3 bb

u4 bca

u5 bcba

Control Question 30 Which of the following trees (maybe several) has for its codeword lengths: 2, 2, 3, 3 and 4.

u1

u2 u3 u4 u5

u1 u2 u3 u4

u5

u1

u2

u3

u2 u2 u3 u3 u4

u1
Answer

u2 u3

u4 u5

94

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Trees 2 and 5. Notice that tree 4 is even not a conding tree, since in a coding tree one message must correspond to one leaf and one leaf only.

Summary for Chapter 2 Codes: prex-free = non-ambiguous = non-singular Prex-Free Codes: no codeword is prex of another

equivalent to instantaneously decodable codes equivalent to coding trees Kraft Inequality: prex-free D-ary code
i

Dli 1

2.2. EFFICIENT CODING

95

2.2

Ecient Coding

Learning Objectives for Section 2.2 In this section, you should: 1. understand what ecient means for a compression code; 2. learn how prex-free codes and probabilized n-ary trees are related; 3. learn what the universal bound on eciency is for memoryless source coding; 4. see an example of ecient codes.

2.2.1

What Are Ecient Codes?

It is now time to start addressing the question of original interest, namely coding for eciency. Our goal is to code the information source so as to minimize the average code length; i.e. the average length of a sequence of codewords. Provided that the source has certain general properties (which almost often hold) minimizing the average code length is equivalent to minimizing the expected code length .

Denition 2.9 (Expected code length) Formally, recalling that source symbol ui (1 i N ) has a probability pi to be emitted, and denoting li the length of the corresponding codeword, the expected code length E [L] is the expected value of the length of a codeword, i.e.
N

E [L] =
i=1

pi li

(2.2)

When precision is required, the expected length of the code Z will be denoted by E [LZ ]. We are therefore looking for (prex-free) codes such that E [L] is as small as possible. From (2.2), it is obvious that we should assign the shorter codewords to the most probable values of U . Indeed, if pi > pj and l l then pi l + pj l pi l + pj l. But how do we know what codeword lengths to use? And what is the smallest E [L] that can be targeted? We will address these questions shortly, but we rst have to look more accurately at the properties of coding trees within this perspective. Control Question 31 Consider some information source U , the symbols of which are u1 = 1, u2 = 2, u3 = 3, and u4 = 4, with the following probability distribution:

96

CHAPTER 2. EFFICIENT CODING OF INFORMATION ui P (U = ui ) 1 0.125 2 0.3 3 0.125 4 0.25 5 0.2

Consider then the following encoding of it (where zi is the codeword for ui ): z1 1110 What is the expected code length? Answer By denition the expected code length is the expected value of the length of the codewords, i.e., denoting li the length of zi :
4 4

z2 110

z3 10

z4 1111

z5 0

E [L] =
i=1

p(Z = zi )li =
i=1

p(U = ui )li = 0.1254+0.33+0.1252+0.254+0.21 = 2.85

2.2.2

Probabilized n-ary Trees: Path Length and Uncertainty

Recall that a prex-free code denes a n-ary tree in which each codeword corresponds to a leaf (through its path from the root). On the other hand, the probability distribution of the source to be coded assigns probabilities to the codewords, and hence to the leaves of the corresponding n-ary tree. By convention, a probability 0 is assigned to any unused leaf (i.e. that does not correspond to a codeword). The probability assignment can further be extended to interior nodes by recursively assigning them, from the leaves to the root, a probability equal to the sum of the probabilities of the child nodes. Doing so we create a probabilized n-ary tree.

Denition 2.10 (Probabilized n-ary tree) A probabilized n-ary tree is a n-ary tree with nonnegative numbers between 0 and 1 (probabilities) assigned to each node (including the leaves) in such a way that: 1. the root is assigned with a probability 1, and 2. the probability of every node (including the root) is the sum of the probabilities of its child nodes.

Example 2.11 Taking p1 = p5 = 0.1, p2 = p4 = 0.2 and p3 = 0.4 for the binary prex-free code in example 2.9, page 92, results in the following binary tree with probabilities:

2.2. EFFICIENT CODING


1 0.3 0.7 0.3 0.1

97

u1 u2 u3
0.1 0.2 0.4

u4 0.2 u5
0.1

Notice that, in a probabilized n-ary tree, the sum of the probabilities of the leaves must be one.

Lemma 2.1 (Path Length Lemma) In a probabilized n-ary tree, the average depth of the leaves is equal to the sum of the probabilities of the interior nodes (i.e. excluding the leaves but including the root).

Proof The probability of each node is equal to the sum of the probabilities of the leaves of the subtree stemming out from that node. Therefore the sum of the probabilities of interior nodes is a sum over leaf probabilities. Furthermore, a leaf probability appears in this sum exactly as many times as the depth d of the corresponding leaf. Indeed, a leaf at depth d is covered by exactly d interior nodes: all these nodes that are on the path from the root to that leaf. Thus, the sum of the probabilities of all the interior nodes equals the sum of the products of each leaf probability and its depth. This latter sum is precisely the denition of the average depth of the leaves. More formally, let i , 1 i M be the M interior nodes and j , 1 j N be the N leaves. Let furthermore Pi be the probability of interior node i and pj the probability of leaf j . Finally, let (j ) be the depth of leaf j and let us denote by i j the fact that interior node i covers the leaf j . Then the the sum of the probabilities of interior nodes is equal to:
M M N N

Pi =

pj =

pj =

i=1

i=1 j :i j

j =1 i:i j

j =1

pj

i:i j

1 .

Moreover,
i:i j

1 in nothing but the number of interior nodes covering leaf j . There1 = (j )


i:i j

fore

and

Pi =
i=1 j =1

pj (j ) =: E [] .

Example 2.12 (Path Length Lemma) In the preceding example, the average

98

CHAPTER 2. EFFICIENT CODING OF INFORMATION

depth of the leaves is 1 + 0.3 + 0.7 + 0.3 + 0.1 = 2.4 by the Path Length Lemma. As a check, note that (denition of the expected code length) 2 0.1 + 2 0.2 + 2 0.4 + 3 0.2 + 4 0.1 = 2.4 . We now consider some entropy measures on a probabilized n-ary tree. Denition 2.11 (Leaf Entropy of a Probabilized n-ary Tree) Let N be the number of leaves of a probabilized n-ary tree and p1 , p2 , . . . , pN their probabilities. The leaf entropy of such a tree is dened as Hleaf =
i

pi log pi

(2.3)

Property 2.5 For the probabilized n-ary tree corresponding to the prex-free coding tree of an information source U , we have: Hleaf = H (U ) (2.4)

Proof Let Z be the prex-free code under consideration. By denition (of a coding tree), pi is the probability of the ith codeword, and therefore Hleaf = H (Z ). Furthermore, since the code is non-singular (Z = f (U ) is injective), H (Z ) = H (U ). Therefore Hleaf = H (U ). Denition 2.12 Let M be the number of interior nodes of a probabilized n-ary tree and P1 , P2 , . . . , PM their probabilities. Let furthermore qi1 , qi2 , . . . , qini be the probabilities of the ni child nodes (including leaves) of the interior node whose probability is Pi . The branching entropy Hi at this node is then dened as ni qij qij Hi = log , (2.5) Pi Pi
j =1

Notice that, because of the second property of the denition of a probabilized n-ary tree (denition 2.10, page 96), we have
ni

Pi =
j =1

qij .

Example 2.13 Suppose that the M = 5 nodes for the tree of examples 2.9, page 92, and 2.11, page 2.11, are numbered in such a way that P1 = 1, P2 = 0.3, P3 = 0.7, P4 = 0.3 and P5 = 0.1. Then Hleaf =
i=1 5

pi log pi

2.122 bit.

2.2. EFFICIENT CODING We have n1 = 2 and q11 = 0.3 and q12 = 0.7, thus H1 = 0.3 log 0.3 0.7 log 0.7 Similarly, n2 = 2 and q21 = 0.1, q22 = 0.2, thus H2 = 0.1 0.1 0.2 0.2 log log 0.3 0.3 0.3 0.3 0.918 bit. 0.918 bit, H5 = 0. 0.881 bit.

99

It is left as an exercise to show that H3

0.985 bit, H4

Theorem 2.2 (Leaf Entropy Theorem) The leaf entropy of a probabilized n-ary tree equals the sum over all interior nodes (including the root) of the branching entropy of that node weighted by its probability. Using the above dened notations:
M

Hleaf =
i=1

Pi Hi

(2.6)

Example 2.14 Continuing Example 2.13, we calculate Hleaf by (2.6) to obtain Hleaf = 1 H1 + 0.3 H2 + 0.7 H3 + 0.3 H4 + 0.1 H5 0.881 + 0.3 0.918 + 0.7 0.985 + 0.3 0.918 + 0 bit 2.122 bit. in agreement with the direct calculation made in example 2.13.

Theorem 2.3 For any two prex-free codes of the same information source, the code which has the shortest expected code length has the highest symbol entropy rate. Shortly speaking, compressing the data increases the symbol entropy.

2.2.3

Noiseless Coding Theorem

We now use the results of the previous sections to obtain a fundamental lower bound on the expected code length of a prex-free code of some information source. Lower Bound on the Expected Codeword Length for Prex-Free Codes Theorem 2.4 (Shannon Coding Theorem, Part 1) For any discrete memoryless information source of entropy H (U ), the expected code length E [L] of any D-ary prex-free code for this source satises: E [L] H (U ) , log D (2.7)

100

CHAPTER 2. EFFICIENT CODING OF INFORMATION

The bound (2.7) could possibly have been anticipated on intuitive grounds. It takes H (U ) bits of information to specify the value of U . But each D-ary digit of the codeword can, according to Theorem 1.2 and mutual information denition (equation (1.28)), provide at most log D bits of information about U . Thus, we surely will need at least H (U )/ log D code digits, on the average, to specify U . Control Question 32 Consider some information source U , the symbols of which are u1 = 1, u2 = 2, u3 = 3, and u4 = 4, with the following probability distribution: ui P (U = ui ) 1 0.5 2 0.25 3 0.125 4 0.125

Consider then the following encoding of it (where zi is the codeword for ui ): z1 0 z2 10 z3 110 z4 111

1. What is the expected code length? 2. Is the code considered an ecient code, i.e. optimal from the expected code length point of view? Answer By denition the expected code length is the expected value of the length of the codewords, i.e., denoting li the length of zi :
4 4

E [L] =
i=1

p(Z = zi ) li =
i=1

p(U = ui ) li = 0.5 1+ 0.25 2+ 0.125 3+ 0.125 3 = 1.75

Let us compare the expected code length of that code to the source entropy: it can easily be computed that H (U ) = 1.75 bit. We know, by the part 1 of Shannon Noiseless Coding Theorem, that for any prex-free code Z of U we must have H (U ) E [L ] (here log D = 1 since we have a binary code (D = 2)). Therefore, since we have E [L] = H (U ), for any prex-free code Z of U we have E [L] E [L ]. This means that, from the point of view of expected code length, the proposed code is optimal: there cannot exists another prex-free code with a strictly shorter expected code length. The above theorem is our rst instance where the answer to a technical question turns out to be naturally expressed in terms of Shannons entropy. However, this is not a full justication yet of the use of entropy, since only a lower bound has been specied. For instance, the value 1 is trivially another valid lower bound for the expected code length, but we would not claim this bound as a justication for anything! Only when the given lower bound is, in some sense, the best possible one, it can be used as a justication. To show that the bound expressed in the above theorem is indeed the best possible one, we need to show that there exist some codes whose expected code length can be arbitrarily close to it.

2.2. EFFICIENT CODING Shannon-Fano Prex-Free Codes

101

We now show how to construct ecient prex-free codes, although non-optimum in general, but close enough to the lower bound on the expected code length. The key idea is to use as codeword for ui , a code whose length is li = log pi , log D

where x denotes for any x the only integer such that x x < x + 1. Such a code is called a Shannon-Fano code since the technique is implicit in Shannons 1948 paper but was rst made explicit by Fano. But does such a prex-free code always exists? The answer is yes due to the Kraft inequality.
pi Indeed, since by denition li log log D , we have

Dli
i i

D log D =
i

log pi

D logD pi =
i

pi = 1.

Let us now measure how good such a code is in terms of its expected code length. By denition of li we have: log pi + 1. (2.8) li < log D Multiplying both side by pi and summing over i gives: pi li <
i

i pi log pi

log D

+
i

pi ,

(2.9)

i.e. E [L] <

H (U ) + 1. log D

(2.10)

We see that the Shannon-Fano code has an expected code length which is within one symbol of the lower bound (2.7) satised by all prex-free codes. This code is thus quite good. Indeed, we know, from the rst part of the coding theorem previously seen, that no prex-free code can beat the expected code length of the Shannon-Fano code by more than one symbol. Therefore, when the entropy of the encoded source H (U ) is large, Shannon-Fano coding is nearly optimal. But when H (U ) is small, we can generally do much better than Shannon-Fano coding, as discussed in the next section. Let us now foccus on the second part of Shannons First Coding Theorem.

Theorem 2.5 (Shannon Coding Theorem, Part 2) For any discrete memoryless information source of entropy H (U ), there exists at least one D-ary prex-free code of it whose expected code length E [L] satises: E [L] < H (U ) + 1. log D (2.11)

102

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Example 2.15 Consider binary (D = 2) Shannon-Fano coding for the 4-ary source U for which p1 = 0.4, p2 = 0.3, p3 = 0.2 and p4 = 0.1. Such a coding will have as codeword lengths (since log2 (2) = 1) l1 = log2 0.4 = 2, l3 = log2 0.2 = 3, and l2 = log2 0.3 = 2, l4 = log2 0.1 = 4.

We then construct the code by the algorithm given in the proof of the Kraft inequality, page 92, to obtain the code whose binary tree is
0 1 1 1 1 0 1 0 u1 u2 u3 0 u4

By the Path Length Lemma, we have: E [L] = 1 + 0.7 + 0.3 + 0.3 + 0.1 = 2.4, and a direct calculation gives: H (U ) = 0.4 log 0.4 + 0.3 log 0.3 + 0.2 log 0.2 + 0.1 log 0.1 We see indeed that (2.11) is satised. Notice, however, that this code is clearly non-optimal. Had we simply used the 4 possible codewords of length 2, we would have had a shorter code (E [L] = 2). e-pendix: Shannon Noiseless Coding Theorem Control Question 33 Consider a source U , the entropy of which is 2.15 bit. For the following values (2.75, 2.05, 3.25, 2.15), do you think a binary binary prex-free codes of U with such a expected code length could exist? Do you think a better code, i.e. another binary prex-free codes of U with a shorter expected code length, can exist? (yes, no, or maybe) 1.8 bit.

expected code length 2.75 2.05 3.25 2.15

could exist?

does a better code exist? no maybe yes, for sure

Answer

2.2. EFFICIENT CODING expected code length 2.75 2.05 3.25 2.15 does a better code exist? no maybe yes, for sure

103

could exist? yes no yes yes

X X X X

From the rst part of Shannon Noiseless Coding Theorem, we know that no prex-free binary code of U can have an expected length smaller than 2.15. Therefore 2.05 is impossible, and there is no better prex-free code than a code having an expected length of 2.15. The second part of the theorem tells us that we are sure that there exists at least one code whose expected code length is smaller than H (U ) + 1 i.e. 3.15. Therefore we are sure that there exist a better code than a code having an expected length of 3.25. For the other aspect of the question, the theorem doesnt say anything, neither that such code exist nor that they cannot exist. So maybe such codes could exist, according to our knowledge up to now.

2.2.4

Human Codes

Human Coding Algorithm We now show how to construct an optimum D-ary prex-free code for a discrete memoryless information source U with n symbols. The algorithm for constructing such an optimal code is the following: 1. Start with n nodes (which will nally be the leaves of the coding tree) corresponding to the n symbols of the source u1 , u2 , . . . , un . Assign the probability pi to node ui for all 1 i n. Mark these n nodes as active. Compute the remainder r of 1 n divided by D 1. Notice that, although 1 n is negative, r is positive by denition of a remainder. Notice also that in the binary case (D = 2), r is always null. 2. Group together, as child nodes of a newly created node, the D r least probable active nodes together with r unused nodes (leaves):
new node ... ... ... ... D-r useful leaves r unused leaves

Mark the D r active nodes as not active and the newly created node as active. Assign the newly created node a probability equal to the sum of the probabilities of the D r nodes just deactivated.

104

CHAPTER 2. EFFICIENT CODING OF INFORMATION

3. If there is only one active node, then stop (this node is then the root of the coding tree). Otherwise, set r = 0 and go back to step 2.

The prex-free code resulting from such a coding algorithm is called a Human code , since the simple algorithm here described was discovered by D. Human in the fties. Example 2.16 (Binary Human Coding) Consider the information source U such that U pi u1 0.05 u2 0.1 u3 0.15 u4 0.27 u5 0.20 u6 0.23

One Human code for U is given by:


0 0 0 0 1 1 0 1 u4 u u 1 5 6

1 u3

z1 0000

z2 0001

z3 001

z4 01

z5 10

z6 11

u1 u2

The probabilities associated to the interior nodes are the following: v1 = u1 u2 0.15 v2 = v1 u3 0.30 v3 = u5 u6 0.43 v4 = v2 u4 0.57 v5 = v4 v3 1

Finally, notice that E [L] = 2 (0.2 + 0.23 + 0.27) + 3 (0.15) + 4 (0.1 + 0.05) = 2.45 (or by the Path Length Lemma: E [L] = 1 + 0.57 + 0.43 + 0.30 + 0.15 = 2.45), and
6

H (U ) =
i=1

pi log pi = 2.42 bit.

Example 2.17 (Ternary Human Coding) For the same source U as in the previous example and using a ternary code (D = 3), we have for the remainder of 1 n := 1 6 = 5 by D 1 := 2: r = 1. Indeed, 5 = 3 2 + 1. Therefore one unused leaf has to be introduced. The ternary Human code is in this case:
a a a b c

b c c

u6 u4

u3 u5
b

z1 aab

z2 aac

z3 ab

z4 c

z5 ac

z6 b

u1 u2

The probabilities associated to the interior nodes are the following: v1 = u1 u2 0.15 v2 = v1 u3 u5 0.50 v3 = v2 u6 u4 1

Finally, notice that E [L] = 1 + 0.5 + 0.15 = 1.65 (by the Path Length Lemma), and

2.2. EFFICIENT CODING


H (U ) log 3

105

2.42 1.59

= 1.52.

Control Question 34 Consider the unfair dice with the following probability distribution: 1 0.17 2 0.12 3 0.10 4 0.27 5 0.18 6 0.16

The purpose of this question is to build one binary Human code for this dice. For this code, we will use the convention to always give the label 0 to the least probable branch and the label 1 to the most probable branch. Furthermore, new nodes introduced will be named 7, 8, etc..., in that order. 1. What are the rst two nodes to be regrouped? What is the corresponding probability? 2. What are then the next two nodes that are regrouped? What is the corresponding probability? 3. Keep on giving the names of the two nodes to be regrouped and the corresponding probability. 4. Give the Human code found for this source: ui = zi = 1 2 3 4 5 6

Answer Human coding consists in iteratively regrouping the least two probable values. Let us then rst order the source messages by increasing probability: 3 0.10 2 0.12 6 0.16 1 0.17 5 0.18 4 0.27

1) The rst two values to be regrouped are then 3 and 2, leading to a new node 7 whose probability is 0.10 + 0.12 = 0.22. The corresponding subtree (useful for building the whole tree) is 0 convention dened in the question. 2) The next iteration of the algorithm now regroups 6 and 1: 0 sponding probability is 0.33. The new set of values is therefore: 5 0.18 7 0.22 4 0.27 8 0.33

1 using the

1 and the corre-

106

CHAPTER 2. EFFICIENT CODING OF INFORMATION

3) Here are the next iterations: 9 is made with 5 and 7 , its probability is 0.4 10 is made with 4 and 8 , its probability is 0.6 11 is made with 9 and 10 , its probability is 1.0 4) The corresponding Human code (i.e. fullling the convention) is therefore:
0 0 0 1 1 1 0 1 0 1

5 3

2 4 6 1

i.e. ui = zi = 1 111 2 011 3 010 4 10 5 00 6 110

e-pendix: Human Coding e-pendix: Ecient Coding

Optimality of Human Coding We now want to demonstrate that Human coding is an optimal coding in the sense that no other prex-free code can have an expected code length strictly shorter than the one resulting from the Human coding. It is however important to remember that there are many optimal codes: permuting the code symbols or exchanging two codewords of the same length will give another code with the same expected length. The Human algorithm constructs only one such optimal code. Before proving the optimality of Human codes, we rst need to give a few properties of optimal codes in general.
n

A code is optimal if
i=1

pi li is minimal among all possible prex-free code of the same

source, denoting li the length of the codeword corresponding to the symbol ui .

Lemma 2.2 For an optimal code of some information source with n possible symbols we have: i(1 i n) j (1 j n) pi > pj = li lj .

Proof Let Z be an optimal code of the considered source. For a given i and a given j , consider the code Y in which the codewords zi and zj are swapped, i.e. yj = zi ,

2.2. EFFICIENT CODING yi = zj and yk = zk for k = i, k = j . Then E [LY ] E [LZ ] = pj li + pi lj (pi li + pj lj ) = (pi pj ) (lj li ).

107

Because Z is optimal, E [LY ] E [LZ ]. Therefore, if pi > pj , lj li has to be non-negative. Lemma 2.3 (Node-Counting Lemma) The number of leaves in a D-ary tree is 1 + k (D 1) where k is the number of interior nodes (including the root).

Proof Each interior node has D child nodes, therefore the total number of nodes in the tree which are a child of another node is k D. The only node in the tree which is not a child of another nodes is the root. The total number of nodes in the tree is then k D + 1. But there are by denition k interior nodes, therefore the number of leaves (i.e. the number of nodes which are not interior nodes) is k D + 1 k = 1 + k (D 1).

Lemma 2.4 For a given information source U , in the coding tree of an optimal prex-free D-ary code of U , there are at most D 2 unused leaves and all these unused leaves are at maximum depth. Moreover, there is an optimal D-ary code for U in which all the unused leaves are child nodes of the same parent node.

Proof If there is at least one unused leaf which is not at the maximum length, the expected code length could be decreased by transferring one of the codewords at maximum length to this unused leaf. The original code would therefore not be optimal. Moreover, if there are more than D unused leaves at the maximum length, at least D of these unused nodes can be regrouped as child nodes of the same node and replaced by this only one unused node, which is at a depth shorten by 1. Therefore if there are more than D unused leaves, the code cannot be optimal. Finally, if there are precisely D 1 unused leaves at the maximum length, they can be regrouped as child nodes of the same parent node, which also has one used leaf as its last child node. But the code can then be shorten by simply removing this last useless digit. Indeed, this last digit is not discriminative as all its sibling nodes are useless nodes. Lemma 2.5 The number of unused leaves in the tree of an optimal D-ary prex-free code for a discrete information source U with n possible symbols, is the (positive) remainder of the division of 1 n by D 1.

108

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Proof Let r be the number of unused leaves. Since U has n dierent symbols, we have: number of leaves in the n. r= D-ary coding tree It follows then from the node-counting lemma that r = [1 + k(D 1)] n, or 1 n = k(D 1) + r . Moreover, from lemma 2.4, we know that, if the code is optimal, 0 r < D 1 . It follows then, from Euclids division theorem for integers, that r is the remainder of the division of 1 n by D 1 (the quotient being k).

Lemma 2.6 There exists an optimal D-ary prex-free code for a discrete information source U with n dierent symbols (n 2) such that the D r least probable codewords dier only in their last digit, with r the remainder of the division of 1 n by D 1 (therefore D r 2).

Proof First notice that not all optimal codes are claimed to satisfy this property, but by rearranging an existing optimal code, we can nd at least one optimal code that satisfy the property. Let us consider an optimal D-ary prex-free code for U (this exists since the number of D-ary prex-free codes for U is nite). By lemma 2.5, we know that there are r unused leaves, which by lemma 2.4 are all at maximal depth. Let us consider the D r siblings of these unused leaves. They are among the longest length codewords (since they are at maximal depth). Let us now build the code where we exchange these D r longest codewords with the D r less probable ones. Due to lemma 2.2 this does not change the expected length (otherwise the considered code would not have been optimal). Therefore the resulting code is also an optimal code. But we are sure for this latter code that the D r least probable codewords dier only in their last digit. Due to lemma 2.6, it is sucient to look for an optimal code in the class of codes where the D r least probable codewords dier only in their last digit. It is now time to establish the optimality of Human coding.

Theorem 2.6 Human coding is optimal: if Z is one Human code of some information source U and X is another non-ambiguous code for U , then E [LX ] E [LZ ].

Proof We prove this theorem by induction on the number of codewords (i.e. the number of source symbols).

2.2. EFFICIENT CODING

109

It is trivial to verify that for any source with less than D symbols, the Human code is optimal. Suppose now that the Human coding procedure is optimal for any source with at most n 1 symbols and consider a source U with n symbols (n > D). Let r be the remainder of the division of 1 n by D 1: 1 n = q (D 1) + r . Without loss of generality, let un(Dr)+1 , ..., un be the D r least probable source symbols. By construction a Human code Z for U is made of an Human code Y for the source V whose n (D r ) + 1 dierent symbols are v1 = u1 , ..., vn(Dr) = un(Dr) and vn(Dr)+1 , with probabilities q1 = p1 , ..., qn(Dr) = pn(Dr) and qn(Dr)+1 = pn(Dr)+1 + + pn . Indeed the number of unused leaves introduced for Y is the remainder of the division of 1 [n (D r )+1] by D 1, which is 0 since 1 [n (D r )+1] = 1 n r + D 1 = q (D 1) + (D 1) = (q + 1)(D 1). This shows that Y indeed corresponds to the code built in the second and following steps of the building of Z . Z appears then as an extension of Y in the codeword yn(Dr)+1 : z1 = y1 , ..., zn(Dr) = yn(Dr) and yn(Dr)+1 is the prex of all the codewords zn(Dr)+1 ,..., zn which all dier only by the last symbol. Then, denoting by li the length of zi and by li the length of yi :
n nD +r n

E [LZ ] :=
i=1

pi li =
i=1 nD +r

pi li +
i=nD +r +1 n

pi li

=
i=1 n

pi li +
i=nD +r +1 n

pi (li + 1) pi

=
i=1

pi li +
i=nD +r +1 n

E [LY ] +
i=nD +r +1

pi

Since
i=nD +r +1

pi is independent of the coding process (it only depends on the source

U ), due to lemma 2.6 and to the fact that Y , by induction hypothesis, is optimal for V (which has less than n symbols), we then conclude that Z is optimal for U (i.e. that E [LZ ] is minimal).

110

CHAPTER 2. EFFICIENT CODING OF INFORMATION

2.2. EFFICIENT CODING

111

Summary for Chapter 2 Prex-Free Codes: no codeword is prex of another prex-free = non-ambiguous = non-singular prex-free instantaneously decodable Kraft Inequality: prex-free D-ary code Entropy bound on expected code length: E [L] =
i iD li

pi li

H (U ) log D

Shannon-Fano code: li = E [L] =


i

log pi log D H (U ) +1 log D

pi li

Human code: 1. introduce 1 n mod (D 1) unused leaves with probability 0 2. recursively regroup least probable nodes 3. is optimal (regarding the expected code length) among non-ambiguous codes

Historical Notes and Bibliography


Shannon Noiseless Coding Theorem Kraft Inequality Human Coding 1948 1949 1952

OutLook
For further details on coding for compression, refer to [9].

112

CHAPTER 2. EFFICIENT CODING OF INFORMATION

Chapter 3

Module C3: Entropy rate of stationary processes. Markov chains

by F. Bavaud

Introduction
Consider a doubly discrete system Xt which, at each time t Z Z := {. . . , 2, 1, 0, 1, 2, . . . }, is in some state xt := {1 , . . . , m }. In the description we adopt, the state space is nite with m = || states; also, the set of times t Z Z is also discrete but bi-innite at both extremities.

Denition 3.1 A stochastic process is a bi-innite sequence


X = . . . X2 X1 X0 X1 X2 . . . = {Xt }

of random variables Xt indexed by an integer t.

Example 3.1 Let Xt be the local temperature of the t-th day at noon. The sequence of successive Xt constitutes a stochastic process, which can be modelled by variously sophisticated models: for instance, the distribution of each Xt is independent of the other Xt , or depends on Xt1 only, etc. The observation of a numerical series of values . . . x2 x1 x0 x1 x2 . . . constitutes a realization of the stochastic processs. Let xt represent the observed value of variable Xt . The stochastic process is completely determined provided all the joint (= multivariate) probabilities of the form p(xl . . . , xn ) = p(xn l ) := P (Xl = xl , Xl+1 = xl+1 , . . . , Xn = xn ) 113

114

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

are dened, for all l and n with l n. Note that sequence xn l contains n l + 1 elements. To be consistent, the probability measure must add up to one: p(xn l)=1
nl+1 xn l

Also, the probability of a subsequence contained in a sequence must obtain by summing over all the values of the states of the sequence not appearing in the subsequence, that is, for instance p(x1 x2 x3 x4 x5 ) = p(x1 x4 )
(x2 x3 x5 ) 3

The probabilistic dynamics of a stochastic process might or might not depend explicitely of the time i at which it is observed. In the second case, the process is said to be stationary:

Denition 3.2 A stochastic process is stationary if the joint probability of a subsequence is invariant with respect to an overall shift in time, i.e. P (Xl+T = xl , Xl+T +1 = xl+1 . . . Xn+T = xn ) = P (Xl = xl , Xl+1 = xl+1 . . . Xn = xn ) for any T Z Z.

3.1

The entropy rate

Learning Objectives for Section 3.1 After studying this section you should be familiar with the concept of entropy rate, in relation to the concepts of typical set, redundancy and compression be able to assess whether or not a given compression scheme (diminishing the length and/or the number of categories of the message) is feasible in principle

k, Consider a stationary process. The joint entropy of the sequence X1 , X2 , . . . , Xk = X1 measuring the total uncertainty in the joint outcome of the sequence, is k H (X1 )=
k xk 1

k p(xk 1 ) log p(x1 )

k ) is increasing in k : adding more arguments increases uncertainty. One denes H (X1 the entropy rate h of the process as the limit (if existing)

h := lim

1 k H (X1 ) k k

(3.1)

3.1. THE ENTROPY RATE

115

Also, consider the conditional entropy hk , for k = 1, 2, 3 . . . , measuring the uncertainty in Xk conditionnaly to X1 . . . , Xk1 , as well as its limit when k :
k 1 k k 1 hk := H (Xk |X1 ) = H (X1 ) H (X1 ) (a)

h := lim hk
k

(3.2)

where (a) follows from H (Y |Z ) = H (Y, Z ) H (Z ). The quantity h in 3.1 measures the uncertainty per variable in an innite sequence, and the quantity h in 3.2 measures the residual entropy on the last variable when all the past is known. It turns out that those two quantities coincide for stationary processes:

Theorem 3.1 For a stationary processes, the non-negative quantity hk dened in 3.2 is non-increasing in k. Its limit h := limk hk denes the entropy rate h of the process exists and can be computed either way as h = lim
k

1 k 1 k H (X1 ) = lim H (Xk |X1 ) k k

(3.3)

k ) = k H (X ), where H (X ) is the entropy Example 3.2 For an i.i.d. process, H (X1 1 for a single variable. As a result, h = limk k kH (X ) = H (X ). The behavior of hk for this and other processes is depicted in gure 3.10.

Theorem 3.2 h log m, where m := || be the alphabet size. Equality k holds i the process is independent (that is p(xk 1) = i=1 p(xi )) and uniform (that is p(xi ) = 1/m).

Proof
k 1 ) lim H (Xk ) log m h = lim H (Xk |X1 k k (a) (b)

where equality in (a) holds under independence, and equality in (b) holds under uniformity. The entropy rate h measures the conditional uncertainty associated to each single outcome of a process, knowing its whole past. For xed m, this uncertainty is maximal when the predictibility of the outcome is minimal, that is when the process is maximally random. Theorem 3.2 says that a maximally random process must be uniform and independent, exacly as a fair dice with m sides which must be unbaised (= uniform) and without memory (= independent successive outcomes). More generally, the following exact decomposition holds:

116

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

Theorem 3.3 For a stationary process, hk = log m Kk (p||pIND ) K (p||pUNI ) where Kk (p||pIND ) 0 is the Kullback-Leibler departure from independence, namely Kk (p||pIND ) :=
xk 1

p(xk 1 ) log

p(xk 1) pIND (xk 1)

k 1 pIND (xk 1 ) := p(x1 ) p(xk )

(3.4)

and K (p||pUNI ) is the Kullback-Leibler departure from uniformity, namely K (p||pUNI ) :=


xk

p(xk ) log

p(xk ) 1/m

(3.5)

Note: observe K (p||pUNI ) to be independent of k by stationarity. Proof By construction


k 1 hk = H (Xk |X1 )=
1 xk 1

k 1 p(x1 ) xk

k 1 k 1 p(xk |x1 ) log p(xk |x1 )=

=
1 xk 1

k 1 p(x1 ) xk

k 1 p(xk |x1 ) log[

p(xk 1 ) p(xk ) 1/m ]= k 1 p(x ) 1/m p(x1 ) k p(xk ) ln p(xk ) + log m 1/m

=
xk 1

p(xk 1 ) log

p(xk 1) k 1 p(x1 ) p(xk )

xk

Remark: using information-theoretical quantities, the proof can alternatively be presented as


k k 1 )= hk = H (X1 ) H (X1 k 1 k = H (X1 ) H (X1 ) H (Xk ) + H (Xk ) log m + log m Kk (p||pIND ) K (p||pUNI)

3.2

The AEP theorem

Recall that a variable Zn is said to converge in probability towards the constant c P (noted Zn c) i
n

lim P (|Zn c| ) = 1)

> 0

(3.6)

Example 3.3 Let {Yi } represent a sequence of i.i.d. numerical variables with mean n 1 and nite variance 2 . The variable Sn := n t=1 Yt then converges in probability to . That is, for n suciently large, the probability to observe a deviation Sn larger than any nite quantity > 0 becomes negligible.

3.2. THE AEP THEOREM

117

Theorem 3.4 AEP (= asymptotic equipartition property) theorem: for a stationary ergoodic processes, the variable 1 n ) Zn := log p(X1 n
P

(3.7)

converges in probability towards the entropy rate h dened in 3.3: Zn h . Equivalently (see theorem 3.5 (a)), and using from now on natural logarithms for simplicity, theorem 3.4 tells that > 0 ,
n n lim P (exp[n(h + )] p(X1 ) exp[n(h )]) = 1

Denition 3.3 The (n, )-typical set Tn () is the set of all empirical sequences n xn 1 (called typical) whose probability p(x1 ) is close to exp(n h ) in the sense
n n Tn () := {xn 1 | exp[n(h + )] p(x1 ) exp[n(h )]} (3.8)

The probability for data to belong to Tn () is


n P (Tn ()) := P (X1 Tn ()) =
n xn 1

n p(xn 1 ) I (x1 Tn ())

(3.9)

and theorem 3.4 can be rewritten as > 0 ,


n n lim P (X1 Tn ()) = 1

(3.10)

That is, for increasingly large n, it becomes increasingly certain that, drawing a n-gram n turning out to be xn , one nds that p(xn ) is very close exp(n h ): most of the X1 1 1 observed empirical sequences xn have a probability increasingly close to exp(n h ): 1 for n large, almost all events are almost equally surprising. The following theorem makes this statement precise and rigorous: Theorem 3.5 1. xn 1 Tn () i |
1 n n ) h | . ln p(X1

2. P (Tn ()) > 1 for n large enough. 3. |Tn ()| exp(n (h + )). 4. |Tn ()| > (1 ) exp(n (h )) for n large enough.

Proof 1. apply denition 3.6. 2. as a consequence of equation 3.10, P (Tn ()) is arbitrarily close to 1 for n large enough.

118 3. 1=

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

p(xn 1)
n xn 1

p(xn 1)
xn 1 Tn () xn 1 Tn ()

exp[n(h + )] = = exp[n(h + )] |Tn ()|

4. 1 < P (Tn ())


xn 1 Tn ()

exp(n(h )) = exp(n(h )) |Tn ()|

3.2.1

The concept of typical set: redundancy and compressibility

n The set n of all sequences xn 1 of length n with m = || categories contains m distinct elements. Supppose the probabilistic process producing them to be independent and n . uniform; then each sequence possesses the same probability, namely P (xn 1) = m In the other direction, suppose the dependence in the process to be so strong that a single sequence of length n can possibly be produced; by construction, this sequence has probability one, and the remaining mn 1 sequences have probability zero.

In general, that is inbetween those two extreme situations, the key quantity controlling the number of typically observable sequences as well as their probability is the entropy rate h (recall h = log m in the rst case above, and h = 0 in the second case). One has that the n of all sequences xn 1 of length n splits, for n large enough, into two subsets 1 : the set Tn () of typical sequences, containing essentially |Tn ()| = exp(nh ) sequences, each of them having probability exp(nh ). For n large, all the probability is concentrated in the typical set: P (Tn ()) = 1.
c () := n \ T (), containing mn exp(nh ) the set of non-typical sequences Tn n c ()) 0. sequences, each of them having negligible probability: P (Tn =

Thus, the higher the entropy rate h , that is the more uncertain the future given the whole past, the more numerous the typical sequences, the only ones to be actually observable (for n large). Closely associated to the concept of entropy rate is the notion of redundancy : Denition 3.4 (redundancy) The redundancy R of a stationary ergodic stochastic process on m states with entropy rate h is R := 1 h log m (3.11)

It is understood that the same logarithmic unit (e.g. bits or nats) is used in log m and in the dention of h , which makes R independent of the choice of units. It
1 The previous statements can be given a fully rigorous status, thanks to the asymptotic equipartition property (AEP) theorem; here Tn () denotes the set of sequences of length n whose probability stands between exp(n(h + )) and exp(n(h )).

3.2. THE AEP THEOREM

119

Figure 3.1: The AEP theorem: for 0 < h < log m and n large, almost all sequences of n are non-typical (sand grains) and their contribution to the total volume is negligible. By contrast, the relative number of typical sequences (peebles) is negligible, but their volume accounts for the quasi-totality of the total volume. follows from 0 h log m (theorem 3.2) that 0 R 1. By construction, h = (1 R) log m and exp(nh ) = m(1R)n . Thus, among all the mn sequences of n , a total of |Tn ()| = m(1R)n of them are typical; each of those typical sequences (1 R)n . In particular: has probability m a maximally random process (i.e. independent and uniform) is characterized by h = log m, or equivalently R = 0. The number of typical sequences is equal to mn , the total number of sequences; that is, each sequence is typical and has probability mn . a minimally random process is characterized by h = 0, or equivalently R = 1. The process is eventually deterministic: given a large enough piece of past . . . , xl , xl+1 , . . . , xn , the value xn+1 of the next state Xn+1 can be predicted with certainty. The number of typical sequences is equal to m0 = 1: that is, there is asymptotically an unique typical sequence with probability 1, namely the only sequence produced in a deterministic process. inbetween those extreme cases, a generic stochastic process obeying 0 < h < log m satises 0 < R < 1: the process is partly redundant and partly impredictible. The proportion of typical sequences is m(1R)n = m R n mn and vanishes for n . Example 3.4 The following metaphor, due to Hillman (1996), might help making the AEP concept intuitive (see gure 3.1). A beach of total volume mn represents the totality of the sequences. Most of the volume is made up of exp(nh ) peebles, each having a volume of exp(nh ). The beach also contains about mn sand grains, but so small that their total contribution to the volume of the beach is of order << 1. Thus, among the mn possible sequences, only m(1R) n can indeed occur, all with the same probability (for n large). That is, the total average amount of information n h carried by a sequence of length n with entropy rate h on an alphabet m can equivalently be obtained

120

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

A) by the set of sequences of eective length ne = (1 R)n equiprobably distributed over m categories; the entropy rate of this new process reaches now its maximum log m. B) by the set of sequences of length n equiprobably distributed over m(1R) categories, with a corresponding entropy rate of log m(1R) = (1 R) log m. Modifying the original process does not modify the total information, which remains equal to nh : ne log m = n (1 R) log m = n h n log m
(1R)

(A) (B)

= n (1 R) log m = n h

However, the redundancy of the modied processes is now zero in both cases: application of 3.11 yields RA = 1 log m =0 log m
(A)

RB = 1

(1 R) log m =0 (1 R) log m

(B)

The precise, detailled ways in which the initial sequences can be compressed (from the point of view of their lengths (A) or their alphabet size (B)) constitutes a part of the coding theory (see module C2). Example 3.5 The entropy of a representative text in simplied english with m = 27 categories (no punctuation nor distinction of cases, that is 26 letters plus the blank) has been estimated to about h = 1.3 bits per letter, corresponding to an redundancy of about R = 1 1.3/ log 2 27 = 0.73 (Shannon 1950, cited in Cover and Thomas 1991). That is, a 300 pages novel could typically be reduced to a 300(1 0.73) = 81 pages novel on the same alphabet, or to a novel of same length with only 27(10.73) = 2.43 (i.e. at least 3) symbols. The aspect of a sample of the latter could be something like MUUMMXUUMMMMUMXXUMXMMXMMUXMUMXMMMMXXXUUXMXMUUXMUXMUXMU UMUXUMUUUXMMUUUMXMMMXXXXMUMXXMMXXMUMXUMUUXMUUXMMMXMUXX UXXXUXXUUMMUXUXMUUMXUUXMUXXUXUMUUUXXXMMXXUMXUUUMMUXMXM where the redundancy of the new text is now zero, meaning that the slightest modication of the latter will irreversibly alter its content. By contrast, the high redundancy (R = 0.73) of standard english permits to correct and recover an altered text, containing for instance misspellings. Note: the above example presupposes that written english is produced by a stationary stochastic process, which is naturally questionnable. Control Question 35 A stationary stochastic process produces a sequence of n consecutive symbols (n large) on an alphabet with m signs. Suppose the redundancy of the process to be R = 0.25. Then 1. it is possible to compress (without diminishing the total information) the sequence length from n to 0.75 n? Possible answers: yes - no - it depends on n.

3.2. THE AEP THEOREM

121

2. it is possible to compress (without diminishing the total information) the alphabet length from m to 0.75 m? Possible answers: yes - no - it depends on m.

Answer

1. yes: correct answer: compression scheme A) above permits to compress the sequence length (on the same alphabet) from n to ne = (1 0.75) n = 0.75 n (maximum compression, asymtotically attainable for n ).

2. it depends on m: correct answer: the alphabet size can be taken (with out diminishing the sequence length) to m1R = m0.75 . But one readily checks m0.75 > 0.75 m (unfeasible compression) if m = 2, and m0.75 < 0.75 m (feasible compression) if m = 3, 4, . . ..

Example 3.6 A highly simplied meteorological description assigns each day into one of the three categories nice (N), rainy (R) or snowy (S). For instance, NNRNSRN constitutes a possible a meteorological week. The are a maximum of 3n dierent sequences of n days; each of those sequence would be equiprobable (with probability 3n ) if the whether did follow a perfectly random process (with a maximum entropy rate of h = log 3, as given by theorem 3.2). However, real weather is certainly not a completely random process, that is its redundancy R is strictly positive: if R = 0.5 for instance, then, among all possible 3n dierent sequences of n days, 3(10.5) n = 1.73n are typical, that is likely to occur. if R = 0.75, only 3(10.75) n = 1.32n sequences of n days are typical. if R = 1, only 3(11) n = 30 = 1 sequence is typical, that is the only sequence generated by the deterministic (R = 1) process. In this example, the number of eective full possible types of weather for next day (as measured by the uncertainty conditional to the day before) passes from 3 to 1.73, 1.32 and even 1 as R increases from 0 to 1.

122

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

3.3

First-order Markov chains

Learning Objectives for Section 3.3 After studying this section you should be familiar with the concept of (rst-order) Markov chain, its transition matrix and their iterates, and the concept of stationary distribution. be able to classify the states as recurrent, transient, absorbing and periodic. understand the theoretical meaning as well as the computation of the associated entropy. understand the nature of the irreversibility produced by temporal evolution.

Denition 3.5 A rst-order Markov chain, or simply Markov chain, is a discrete stochastic process whose memory is limited to the last state, that is: p(xt+1 |xt ) = p(xt+1 |xt ) t Z Z

Let := {1 , . . . , m } represent the m states of system. The Markov chain is entirely determined by the m m transition matrix pjk := P (Xt+1 = k |Xt = j ) = p(k |j ) obeying the consistency conditions
m

pjk 0
k =1

pjk = 1

Example 3.7 Consider a two-states process with state space = {a, b}. When the system is in state a, it remains in the same state with probability 0.7 (and moves to state b with probability 0.3). When the system is in state b, it remains in the same state with probability 0.6 (and moves to state a with probability 0.4). The conditional probabilities are p(a|a) = 0.7, p(b|a) = 0.3, p(b|b) = 0.6 and p(a|b) = 0.4. Numbering a as 1 and b as 2, the probabilities equivalently express as p11 = 0.7, p12 = 0.3, p21 = 0.4 and p22 = 0.6, or, in matrix form, P = Observe each row to sum to 1. 0.7 0.3 0.4 0.6 (3.12)

3.3.1

Transition matrix in n steps

Consider a process governed by a Markov chain which is in state j at time t. By denition, its probability to be in state k at time t + 1 is pjk . But what about its

3.3. FIRST-ORDER MARKOV CHAINS

123

Figure 3.2: The probability to reach state k at time t + 2 (say k = 3) starting from state j at time t obtains by summing over all possible intermediate states l at time t + 1. probability to be in state k at time t + 2 ? Figure 3.2 shows that the searched for probability obtains by summing over all possible intermediate states l at time t + 1 (here m = 3). That is
(2) pjk m

:= P (Xt+2 = k |Xt = j ) =
l=1

pjl plk

(3.13)

Denoting P := (pjk ) and P (2) := (pjk ) the m m one-step and two-steps transition matrices respectively, equation 3.13 shows the latter to obtain from the former by straighforward matrix multiplication, that is P (2) = P 2 := P P . The mechanism generalizes for higher-order lags and we have the result

(2)

Theorem 3.6 The n-step transition matrix P (n) whose components pjk := P (Xt+n = k |Xt = j ) give the probability to reach state k at time t + n given that the system was in state j at time t obtains from the (one-step) transition matrix P = (pjk ) as P (n) = P P P P = P n
n times

(n)

Example 3.8 (example 3.7, continued:) The two- and three-steps transition matrices giving the probabilitites to reach state k from state j in n = 2, respectively n = 3 steps, are P (2) and P (3) with P (2) := P P = 0.61 0.39 0.52 0.48 P (3) := P P P = 0.583 0.417 0.556 0.444

Note the property that each row of P sums to 1 to be automatically inherited by P (2) and P (3) .

124

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

Figure 3.3: State a reaches itself and states b, c, d and e. State b reaches itself and states c, d and e. State c reaches itself and states d and e. State d reaches itself and states c and e, etc. The communication equivalence classes are {a}, {b}, {c, d, e}, {f }, {g} and {h}.

3.3.2

Flesh and skeleton. Classication of states

The concept of communication between states denes an equivalence relation among the set of m states involved in a nite Markov chain: Denition 3.6 state j reaches state k, written j k, if there is a path jl1 l2 ln1 k of length n 0 such that pjl1 pl1 l2 pln1 k > 0, i.e. if there is a n 0 such that pjk > 0. As pjj = 1 > 0, each state reaches itself by construction: j j for all j . states j and k commmunicate, noted j k, i j k and k j . Thus the relation is reexive (j j ) and transitive (j l and l k imply j k). The communicability relation is in addition symmetric (j k imply k j ). That is, the relation communicates with is an equivalence relation, partitioning states into groups of states, each state inside a group communicating with all the other states inside the same group. Note that the skeleton aspects (i.e whether a transition is possible or not) dominate the esh aspects (i.e. the question of the exact probability a possible transition) in the above classication. That is, j and k communicate i there are integers n and n (n) (n ) (n) (n ) with pjk > 0 and pkj > 0; the question of the exact values of pjk > 0 and pkj > 0 is of secondary importance relatively to the fact that those two quantities are stricly positive. Example 3.9 Consider a Markov chain whith skeleton given by gure 3.3. The arrows denote reachability in one step. State a reaches itself as well as states b, c, d and e. However, a can be reached from itself only. Thus the equivalence class of a (relatively to the relation ) contains a itself. Reasonning further, one nds the equivalence classes to be {a}, {b}, {c, d, e}, {f }, {g} and {h}.
(n) (0)

Denition 3.7 State j is recurrent (or persistent, or ergodic) if the probability that the process starting from j will eventually return to j is unity. State j is transient if it is not recurrent, that is if the probability of no return to j starting from j is non zero.

3.3. FIRST-ORDER MARKOV CHAINS

125

Figure 3.4: Example 3.10: the communication equivalence classes are {a}, {e} (recurrent classes) and {b, c, d} (transient class). States a et e are absorbing. One can show that states belonging to the same equivalence class are either all recurrent or all transient, which justies the following denition: Denition 3.8 Reccurent classes are equivalence classes whose elements are all recurrent. Transient classes are equivalence classes whose elements are transient. In example 3.9, the recurrent classes are {c, d, e} and {h}. All other classes are transient. Example 3.10 Consider the following Markov transition matrix 1 0 0 0 0 0.5 0 0.5 0 0 P = 0 0.5 0 0.5 0 0 0 0.5 0 0.5 0 0 0 0 1

(3.14)

whose skeleton is represented in gure 3.4. There are two recurrent classes, namely {a} and {e}, and one transient class, namely {b, c, d}. Recurrent states which are single members of their class, such as a and e, cannot be left once entered. Such states are said to be absorbing. A necessary and sucient condition for a state j to be absorbing is pjj = 1, as demonstrated in rows 1 and 5 of 3.14. It might happen that pjj = 0 for all n not divisible by d, and d is the greatest such integer. This means that if the chain is in state j at some time t, it can only return there at times of the form t + md where m is an integer. Then state j is said to have (n) period d. A state with period d = 1 is said to be aperiodic. If pjj = 0 for all n 1, state j has an innite period d = . One can show that states belonging to the same equivalence class have all the same period: for instance, in gure 3.4, states a and e are aperiodic, while b, c and d have period d = 2. Example 3.11 A tree is a graph containing no circuit. Figure 3.5 left depicts a Markov chain on a symmetric tree: there is a single recurrent equivalence class {a, b, c, d, e, f }, all states of which have period d = 2. Adding a single circuit such as in gure 3.5 middle or right still conserves the single equivalence class {a, b, c, d, e, f }, but all states are now aperiodic (d = 1).
(n)

3.3.3

Stationary distribution

From now on one considers regular Markov chains only, that is consisting of a single aperiodic recurrent class: equivalently, each state can be attained from each other after

126

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

Figure 3.5: Example 3.11. Left: the underlying skeleton is a symmetric tree, and all states have period d = 2. The addition of a single circuit (middle or right) makes all states aperiodic (d = 1).

sucient lapse of time, i.e. there exist an integer N such that pjk 0 for all states j , k and all times n N .

(n)

Theorem 3.7 Let P = (pjk ) be the m m transition matrix of a regular Markov chain on m states. Then for n , the 1 2 1 2 P = 1 2 1 2
m

powers P n approach a transition matrix P of the form m1 m m m1 m with > 0 and j = 1 (3.15) j j =1 m1 m m1 m

the distribution = (1 , 2 , . . . , m ) is the only solution of the equation j pjk = k


j =1

i.e.
m j =1 j

P = = 1.

(3.16)

obeying the normalization condition

The distribution is referred to as the stationary or equilibrium distribution associated to the chain P . Proof *** classical proof remaining to be done *** Example 3.12 Consider the following transition 0.823 0.087 0.045 0.058 0.908 0.032 P = 0.030 0.032 0.937 0.044 0.002 0.001 matrix 0.044 0.001 0.001 0.952

(3.17)

3.3. FIRST-ORDER MARKOV CHAINS Some of its successive powers are 0.433 0 .175 P (5) = 0.107 0.143 0.204 0.205 = 0.191 0.202 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

127

0.263 0.662 0.137 0.037 0.308 0.349 0.305 0.211

0.161 0.137 0.741 0.021 0.286 0.305 0.390 0.171

0.143 0.025 0.014 0.798 0.202 0.140 0.114 0.416 2 2 2 2 3 3 3 3 4 4 4 4

P (25)

P ()

and the corresponding equilibrium distribution is = (0.2, 0.3, 0.3, 0.2). One can verify m j =1 j pjk = k to hold for each k : indeed 0.2 0.823 + 0.3 0.087 + 0.3 0.045 + 0.2 0.0.044 = 0.2 2 0.058 + 0.3 0.908 + 0.3 0.0.032 + 0.2 0.001 = 0.3 2 0.030 + 0.3 0.0.032 + 0.3 0.937 + 0.2 0.001 = 0.3 2 0.044 + 0.0.3 0.002 + 0.3 0.001 + 0.2 0.952 = 0.2 (k = 1) (k = 2) (k = 3) (k = 4)

0.2 0.2 = 0.2 0.2

0.2 1 0.2 1 = 0.2 1 1 0.2

3.3.4

The entropy rate of a Markov chain

Theorem 3.8 The entropy rate of a rst-order regular Markov chain with transition matrix P = (pjk ) is h =
j

j
k

pjk log pjk

(3.18)

where is the stationary distribution associated with the transition matrix P .

Proof Theorem 3.1 yields


n1 h = lim H (Xn |X1 ) = lim H (Xn |Xn1 ) (b) m n j =1 (a) n m n

= lim

pj

(n1)

[
k =1

pjk log pjk ] =


j

(c)

j
k

pjk log pjk

where (a) follows from denition 3.5, (b) follows from the denition H (Xn |Xn1 ) (n1) involving pj , the probability for the system to be in state j at time n 1, and (c) follows from theorem 3.7, implying limn pj
(n1)

= j .

128

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

Figure 3.6: A rst-order Markov chain on = {a, b}.

Example 3.13 Consider the Markov chain (of order 1) on 2 states = {a, b}, with p(a|a) = 2/3, p(b|a) = 1/3, p(b|b) = 2/3 and p(a|b) = 1/3 (gure 3.6). By symmetry, the corresponding stationary distribution is (a) = (b) = 0.5. In view of 3.18, its entropy rate is h = h2 = (a) [p(a|a) ln p(a|a) + p(b|a) ln p(b|a)] (b) [p(a|b) ln p(a|b) + p(b|b) ln p(b|b)] = 1 2 2 1 1 1 1 1 2 2 [ ln + ln ] [ ln + ln ] = 0.325 nats = 0.469 bits 2 3 3 3 3 2 3 3 3 3 Example 3.14 Consider the following mobility table N = (njk ), cross-classifying fathers occupation (rows j = 1, . . . , 5) by sons rst full time civilian occupation (columns k = 1, . . . , 5) for 19912 U.S. men in 1973 in ve categories: a = upper nonmanual; b = lower nonmanual; c = upper manual; d = lower manual; e = farm (source: Hout 1986, cited in Mirkin 1996). N = a b c a 1 414 521 302 724 524 254 b c 798 648 856 d 756 914 771 e 409 357 441 total 4 101 2 964 2 624 d e total 643 40 2 920 703 48 2 253 1 676 108 4 086 3 325 237 6 003 1 611 1 832 4 650 7 958 2 265 19 912 (3.19)

Dividing each cell njk by its row total nj results in a transition matrix P = (pjk ) with pjk := njk /nj , giving the conditional probabilities for an individual (whose father has occupation j ) to have rst full time civilian occupation k: P = a b c d e a 0.48 0.32 0.20 0.12 0.09 b 0.18 0.23 0.16 0.15 0.08 c 0.10 0.11 0.21 0.13 0.09 d 0.22 0.31 0.41 0.55 0.35 e 0.01 0.02 0.03 0.04 0.39 (3.20)

The components of the stationary solution associated to the transition matrix 3.20 are a = 0.26, b = 0.17, c = 0.13, d = 0.40 and e = 0.04. That is, under the ction of a constant transition matrix 3.19, one will observe in the long run 26% of people in category a, 17% in category b, etc. The conditional entropies H (Xn |Xn1 = j ) = k pjk log pjk , measuring the uncertainty on the sons occupation Xn (knowing

3.3. FIRST-ORDER MARKOV CHAINS the fathers occupation Xn1 = j ) are (in bits): H (Xn |Xn1 = a) = 1.85 H (Xn |Xn1 = c) = 2.02 H (Xn |Xn1 = b) = 2.01 H (Xn |Xn1 = d) = 1.83 H (Xn |Xn1 = e) = 1.95

129

Thus sons occupation is most uncertain when the father is upper manual (2.02 for Xn1 = c) and least uncertain when the father is lower manual (1.83 for Xn1 = d). On average, the uncertainty is
5

j H (Xn |Xn1 = j ) = 0.26 1.85 + 0.17 2.01 + 0.13 2.02 +


j =1

+0.40 1.83 + 0.04 1.95 = 1.90 = h which is nothing but the entropy rate of the process h in virtue of 3.18: as expected and by construction, the entropy rate of a Markov process measures the mean conditional uncertainty on the next state knowing the previous state. By contrast, the corresponding unconditional uncertainty is j j log j = 2.05 bits, which is larger than h = 1.90 but smaller than the uniform uncertainty log2 5 = 2.32 bits. Control Question 36 True or false? 1. the entropy rate h of a rst-order Markov chain can never exceed its unconditional entropy H (X ) Possible answers: true - false 2. the entropy rate h of a rst-order Markov chain can never equal its unconditional entropy H (X ) Possible answers: true - false 3. the entropy rate h of a rst-order Markov chain is not dened if the chain is not regular Possible answers: true - false 4. the entropy rate h associated to a chain with m categories can never exceed log m. Possible answers: true - false Answer 1. true, since, for rst-order Markov chains, h = H (X2 |X1 ) H (X2 ). 2. false, since rst-order Markov chains include the trivial case of independent sequences (= Markov chain of order 0), whose entropy rates h equal their unconditional entropies H (X ). 3. true, since in general the stationary distribution j is not dened if the chain is not regular. 4. true, since h H (X ) log m

130

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

3.3.5

Irreversibility

Consider a (m m) transition matrix P = (pjk ) dening a regular Markov process, (0) (0) with associated stationary distribution . Let fj 0 (obeying m = 1) be the j =1 fj initial distribution (t = 0) of the system across its dierent possible states j = 1, . . . , m. (1) (0) m At time t = 1, the distribution becomes fk = pjk . More generally, the j =1 fj ( n ) distribution f at time t = n is given by
(n) fk m

=
j =1

fj

(0)

pjk

(n)

(3.21)
(n) (0)

As limn pjk = k from theorem 3.7, formula 3.21 shows limn fk = m j =1 fj k = k . That is, irrespectively of the prole of the initial distribution f (0) , the long run distribution f (n) converges in the long run n towards the stationary distribution . One speaks of equilibrium if f (0) = . In summary, a general non-equilibrium prole f (0) = evolves towards the equilibrium prole f () = where it remains then unchanged, since k = m j =1 j pjk by virtue of 3.16. Thus the Markov dynamics is irreversible: any initial distribution f (0) always evolves towards the (unique) stationary distribution , and never the other way round; also the dissimilarity between any two distributions fades out during evolution, as the following theorem shows: Theorem 3.9 Let f (n) and g(n) be any two distributions whose evolution 3.21 is governed by the same regular Markov process P = (pjk ). Then evolution makes the two distributions increasingly similar (and increasingly similar with the stationary distribution ) in the sense K (f (n+1) ||g(n+1) ) K (f (n) ||g(n) ) where f (n+1) and g(n+1) are the corresponding next time distributions, namely
(n+1) fk m

(n)

=
j =1

(n) fj

pjk

(n+1) gk

=
j =1

gj

(n)

pjk

(3.22)

Particular cases: K (f (n+1) || ) K (f (n) || ) (obtained with g(n) := , which implies g(n+1) = ): the relative entropy of f (n) with respect to decreases with n: again, limn f (n) = . K ( ||g(n+1) ) K ( ||g(n) ) (obtained with f (n) := , which implies f (n+1) = ): the relative entropy of with respect to g(n) decreases with n. K (f (n+1) ||f (n) ) K (f (n) ||f (n1) ) (obtained with g(n) := f (n1) , which implies g(n+1) = f (n) ): the dissimilarity between the actual distribution and the previous one (as measured by K (f (n) ||f (n1) )) decreases with n. Example 3.15 (example 3.13, continued:) suppose the initial distribution f (0)

3.3. FIRST-ORDER MARKOV CHAINS

131

to be f (0) (a) = 0.9 and f (0) (b) = 0.1. Using 3.21 and theorem 3.6, the successive distributions f (n) at time t = n and their divergence K (f (n) || ) (in nats) with respect to the stationary distribution = f () (with (a) = (b) = 0.5) are, in order, n 0 1 2 3 ... Control Question 37 Determine the unique correct answer: f (n) (a) 0.9 0.633 0.544 0.515 ... 0.5 f (n) (b) 0.1 0.367 0.456 0.485 ... 0.5 K (f (n) || ) 0.420 0.036 0.004 0.0005 ... 0

1. once leaving a state, the system will return to it with probability one if the state is a) transient; b) absorbing; c) recurrent; d) aperiodic. 2. the identity matrix I is the transition matrix of a Markov chain, all states of which are a) transient; b) absorbing; c) irreversible; d) aperiodic. 3. let P be a two-by-two transition matrix. The minimal number of non-zero components of P insuring the regularity of the associated Markov chain is a) 1; b) 2; c) 3; d) 4. 4. suppose P 5 = I , where P is a nite Markov transition matrix and I the identity matrix. Then P is a) undetermined; b) regular; c) I ; d) aperiodic.

Answer

1. c) recurrent 2. b) absorbing 3. c) 3 4. c) I

132

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

3.4

Markov chains of general order

Learning Objectives for Section 3.4 After studying this section and the following one, you should be able to generalize the concepts of the previous section to Markov chains of arbitrary order r understand the theoretical foundations of the test of the order of the chain and be able to apply it be familiar with the concept of over-parameterization and of its consequences in text simulation

Denition 3.9 A Markov chain of order r > 0 is a discrete stochastic process whose memory is limited to the r past states, that is:
n1 n1 p(xn |x1 ) = p(xn |xn r )

n r + 1

A Markov chain of order r = 0, also called Bernoulli process, is a stochastic process without memory, that is
n1 p(xn |x1 ) = p(xn )

n 1

A zero-order Markov chain is then plainly an independent process. All Markov processes considered here are stationary, that is their transition probabilin1 ties p(xn |xn r ) are time independent. The latter are generically denoted as p( |) where is the present state and r species the r immediately anterior states. By construction, p( |) := p( )/p() 0 with p( |) = 1.

3.4.1

Stationary distribution and entropy rate

An r -th order Markov chain on , dened by the transitions p( |) where and r , can formally be considered as a rst-order Markov chain on r with a (mr mr ) r r r transition probability matrix Q = (q ) (where = r 1 and = 1 ) given by q = q ( |) := 1 2 2 3 . . . r1 r p(r |r 1) (3.23)

where ab := 1 if a = b and ab := 0 if a = b. Equation 3.23 tells that the transition is possible i the r 1 rst elementary states of correspond to the r 1 last elementary states of (see gure 3.7) Supposing in addition the chain Q = (q ) to be regular (i.e. each state of r communicates with each state of r and the chain is aperiodic), there is a unique stationary r r distribution () = (r 1 ) on satisfying 3.16 on , that is, using 3.23: (1 , 2 , . . . , r ) p(r+1 |1 , 2 , . . . , r ) = (2 , . . . , r , r+1 )
1

(3.24)

3.4. MARKOV CHAINS OF GENERAL ORDER

133

Figure 3.7: A Markov chain of order r (here k = 4) on is specied by the set of conditional probabilities of the form p(4 |1 2 3 4 ). The same chain can be considered as a rst-order Markov chain q on r where = (1 2 3 4 ) and = (1 2 3 4 ); as expressed by 3.23, the transition matrix q is zero unless 1 = 2 , 2 = 3 and 3 = 4 . Similarly, 3.18 shows the corresponding entropy rate h to be h = or, using 3.23 h=
r () q

log q ,

()

p( |) log p( |)

(3.25)

k 1 Recall in general the conditional entropy hk := H (Xk |X1 ) to be non-increasing in k. On the other hand, 3.25 shows the entropy rate to coincide with hr+1 . In conclusion:

Theorem 3.10 For a r -th order Markov chain, h = hr+1 . That is h1 h2 hr hr+1 = hr+2 = hr+3 = . . . = h (3.26)

The behavior 3.26 of hk is illustrated in gure 3.10 b) for r = 1 and in gure 3.10 c) for r = 3. Particular cases: r = 1: the entropy rate becomes h =

()

p( |) log p( |) =
j

j
k

pjk log pjk

which is the same expression as 3.18. r = 0: the entropy rate becomes h =


p( ) log p( ) =
k

k log k

which is the entropy of the corresponding stationary distribution. Example 3.16 Consider (gure 3.8) the Markov chain of order r = 2 on m = 2 states = {a, b}, with p(a|aa) = 0.3 p(a|ba) = 0.7 p(b|aa) = 0.7 p(b|ba) = 0.3 p(a|ab) = 0.6 p(a|bb) = 0.4 p(b|ab) = 0.4 p(b|bb) = 0.6

134

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

Figure 3.8: A second-order Markov chain p( |) on = {a, b} (example 3.16) written as a rst order chain q on bigrams = 1 2 2 and = 1 2 2 . Transitions are forbiden if 2 = 1 : for instance, transition form = ab to = aa is impossible. By symmetry, the stationary distribution (1 , 2 ) obeying 3.24 turns out to be the uniform distribution on 2 , that is (aa) = (ab) = (ba) = p(bb) = 1 4 . For instance, one veries the following equality to hold (aa) p(a|aa) + (ba) p(a|ba) = 1 1 1 0.3 + 0.7 = = (aa) 4 4 4

as well as other equalities involved in 3.24. The entropy rate 3.25 is h = (aa) [p(a|aa) log p(a|aa) + p(b|aa) log p(b|aa)] (ab) [p(a|ab) log p(a|ab) + p(b|ab) log p(b|ab)] (ba) [p(a|ba) log p(a|ba) + p(b|ba) log p(b|ba)] (bb) [p(a|bb) log p(a|bb) + p(b|bb) log p(b|b)] = 1 [0.3 log 0.3 + 0.7 log 0.7 + 0.6 log 0.6 + 0.4 log 0.4] = 0.189 nats 2

3.5

Reconstruction of Markov models from data

Up to now, we have assumed the diverse models of interest (stationary, Markov of order r , Markov of of order 1, etc.) to be given. Very often, however, we only have at disposal an empirical realization of a process, i.e. only the data D are known, and models M must be inferred form data D. This kind of situation is paradigmatic of inferential statistics (see module S1). For clarity sake, empirical (respectively model) quantities will be indexed from now one by the letter D (respectively M ).

3.5.1

Empirical and model distributions

+r 1 A sequence of k consecutive states xl k is called a k-gram. Given the k-gram l k l and the l-gram , the k + l-gram obtained by concatenating to the right of is simply denoted by . The length of a subsequence is simply denoted as ||: for instance, k = | | and l = | |.

Data D consist of xn 1 , the complete observed sequence of size n. Let n( ) be the empirical count of k-gram k , that is the number of times subsequence occurs

3.5. RECONSTRUCTION OF MARKOV MODELS FROM DATA

135

Figure 3.9: The identity l n( ) = n( ) holds i there is no occurence of a symbol of in the l last symbols of xn 1. in xn 1 (for instance, the bigram = 11 is contained n( ) = 3 times in the sequence = 0111011). The number of all k-grams contained in a sequence of length n is (for x7 1 n k) n( ) = n k + 1
k

Also, one has n( ) n( )


l

where identity holds i data xn 1 do not contain occurences of elements of closer than l places from the right boundary (see gure 3.9). The empirical distribution and empirical conditional distribution are dened as f D ( ) := n( ) nk+1 f D ( | ) := n( ) l n( ) k l (3.27)

where the denominators insure proper normalization, namely k f D ( ) = 1 and D l f ( | ) = 1. Asymptotically, that is for n large, one has approximatively nk+1 = n( ), and thus = n and l n( ) n( ) f D ( ) = n n( ) f D ( | ) = n( ) k l (3.28)

To emphasize the contrast with empirical distributions, the corresponding theoretical distributions will from now on be denoted as f M ( ) and f M ( | ) with f M ( ) := p( ) f M ( | ) := p( ) p( )

where p(. . .) is the consistent probability measure dened in section 3. Example 3.17 The l-th Thue-Morse sequence D l is a binary sequence recursively constructed as follows: D0 = 1 l D l+1 = D l D

binary inversion, replacing each symbol of D (in where denotes concatenation and D order) by its complement, namely 1 = 0 and 0 = 1. The rst Thue-Morse sequences are

136 l 0 1 2 3 4 5 6

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS Thue-Morse sequence D l 1 10 1001 10010110 1001011001101001 10010110011010010110100110010110 1001011001101001011010011001011001101001100101101001011001101001

In general, the l-th sequence D l contains 2l binary symbols in equal proportion (for l 1). D l can also be obtained by applying l times the substitution rule 1 10 and 0 01 to the initial sequence D 0 = 1. Also, the odd entries of D l+1 reproduce D l . Although purely deterministic, the sequences D l can be used to dene empirical distributions f D ( ) and conditional empirical distributions f D ( | ). For instance, for D = D 6 , one nds symbol relative empirical frequency bigram relative empirical frequency trigram rel. emp. frequency 111 0 110 =1 6 11 = 10 = 1
1 2

0
1 2

total 1 01 = 00 = total 1 000 0 total 1 total 1 total 1

10 63

1 6

21 63

1 3

21 63

1 3

11 63

1 6

10 62

10 62

101 =1 6 1|11 0

11 62

100 1 =6 0|11 1 0|01


1 2

10 62

011 =1 6

10 62

010 =1 6

11 62

001 =1 6

conditional symbol relative empirical frequency conditional symbol relative empirical frequency

total 1 total 1

10 21

1|10 =1 2 1|00 1

11 21

0|10 =1 2

1|01
1 2

0|00 0

The behavior of hk for k 1 for the sequence D 14 (containing 16384 binary symbols) is depicted in gure 3.10. The values of the empirical distributions f D ( ) and conditional distributions f D ( | ) (such as those as found in example 3.17) can serve to dene model or theoretical dis can in turn be simulated tributions f M ( ) and f M ( | ). New stochastic sequences D M M from the Markov chains with parameters f ( ) and f ( | ). By construction, the will look similar to those of the trainstatistical properties of the resulting sequence D ing sequence D:

3.5. RECONSTRUCTION OF MARKOV MODELS FROM DATA

137

3.5.2

The formula of types for Markov chains

Consider a r -th order Markov chain dened by the conditional distribution f M ( |) where and r . The probability to observe data xn 1 is
n r P (xn 1 ) = p(x1 ) i=r +1 i1 r p(xi |xi r ) = p(x1 ) r

f M ( |)n()

For r xed and for n , the contribution of the term p(xr 1 ) becomes negligible relatively to the contribution of the product. Thus, asymptotically, that is for n large, the boundary eects caused by the nitude of n, and occuring at the very beginning or the very end of the sequence xn 1 , become negligible and one can write approximatively P (xn 1) =
r

f M ( |)n()

(3.29)

In the same approximation (compare with 3.27) one has n() f D () = n n( ) f D ( |) = n() r (3.30)

Intuitively, one expects that, for n large, the empirical distribution f D ( |) tends to f M ( |) with uctuations around this value. The next theorem (where entropies are expressed in nats for convenience) shows this to be indeed the case; morevover, the uctuation of the empirical values around the theoretical ones are controlled by the conditional Kullback-Leibler divergence Kr (f D ||f M ) of order r : Theorem 3.11 (formula of types for Markov chains) For Markov chains of order r , the probability to observe the conditional empirical distribution f D ( |) (for all and r ) is, asymptotically P (f D | f M ) where Kr (f D ||f M ) :=
r

exp(n Kr (f D ||f M ))

(3.31)

f D () K ([f D ||f M ]|) f D ( |) ln


(3.32) (3.33)

K ([f D ||f M ]|) :=

f D ( |) f M ( |)

In particular, for given counts n( ) (and thus given f D () and f D ( |)), the probM ( |) = f D ( |): as expected, the maximum-likelihood ability 3.31 is maximum i f estimate of the model M is simply given by the corresponding empirical quantity (see module S1 for a more detailed exposition for the independent case). Remark: K0 (f D ||f M ) is the ordinary (unconditional) divergence: K0 (f D ||f M ) = K (f D ||f M ) =

f D ( ) log

f D ( ) f M ( )

138

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS


1 n

Proof Recall that an bn means limn approximation is n! nn exp(n).

log(an /bn ) = 0. For instance, Stirlings

The value of f D ( |) is the same for all n()!/( n( )!) data xn 1 possessing the same counts n( ) but diering by the occurence order. Thus P (f D | f M ) =
r

n()!

1 f M ( |)n() n( )!

Taking the logarithm, using 3.30 and Stirlings approximation yields ln P (f D | f M ) =


r

ln = n
r

n()n() M f ( |)n() = n( )n() f D ()


=
r

n( ) ln

f M ( |) f D ( |)

f D ( |) ln

f D ( |) f M ( |)

Example 3.18 Consider a rst order Markov chain with two states a and b. When in a given state (a or b), the system remains in the same state with probability 0.9, and changes with probability 0.1. That is, f M (a|a) = 0.9, f M (b|a) = 0.1, f M (b|b) = 0.9 and f M (a|b) = 0.1. Suppose data to be of the form D aaaa aaaa
n times

Then f D (a) = 1, f D (b) = 0, f D (a|a) = 1 and f D (b|a) = 0. Then K ([f D ||f M ]|a) = f D (a|a) ln f D (a|a) f D (b|a) D + f ( b | a ) ln = f M (a|a) f M (b|a) 0 1 + 0 ln = 0.105 nats = 1 ln 0.9 0.1

On the other hand, neither f D (a|b) nor f D (b|b) are dened, since the system has never been observed in state b: equations 3.27 or 3.28 return the undetermined value 0/0 (assumed nite). Thus K ([f D ||f M ]|b) is not dened, but K1 (f D ||f M ) is: K1 (f D ||f M ) = f D (a) K ([f D ||f M ]|a) + f D (b) K ([f D ||f M ]|b) = = 1 2.30 + 0 K ([f D ||f M ]|b) = 0.105 nats Thus P (f D | f M ) exp(n 0.105) = (0.9)n

For instance, the probability of observing the sequence aaaaaaaaaa under the model (n = 10) is approximatively (0.9)10 = 0.35 (the formula is exact up to the initial term P (X1 = a), already neglected in 3.29); the probability of observing the sequence aaaaaaaaaaaaaaaaaaaa (n = 20) is (0.9)20 = 0.12, etc.

Example 3.19 (example 3.18, continued) By symmetry, the stationary proba-

3.5. RECONSTRUCTION OF MARKOV MODELS FROM DATA bility associated to the chain is (a) = (b) = 0.5, and the entropy rate is h = (a)[f M (a|a) ln f M (a|a) + f M (b|a) ln f M (b|a)] (b)[f M (a|b) ln f M (a|b) + f M (b|b) ln f M (b|b)] = 0.9 ln 0.9 0.1 ln 0.1 = 0.325 nats

139

Thus the size of the typical sequences set grows as |Tn ()| = exp(0.325 n) = (1.38)n , instead of 2n for a maximally random (= independent + uniform) process. Otherwise said, the dynamics of the system under investigation behaves as if only 1.38 eective choices were at disposal at each step, instead of 2 eective choices (namely a and b) for the maximally random dynamics.

3.5.3

Maximum likelihood and the curse of dimensionality

n Suppose one observes a sequence xn 1 of length n with m := || states, believed to be generated by a Markov chain of order r . The complete specication a the latter model necessitates to determine all the quantitites of the form f M ( |) for all and r , that is a total of mr (m 1) quantities (the quantities f M ( |) are not completely free, but must obey the mr constraints f M ( |) = 1 for all r , whence the factor m 1).

But, even for relatively modest values of m and r , the number of free parameters mr (m 1) grows very fast (for instance 48 free parameters for m = 4 and r = 2, or 54 free parameters for m = 3 and r = 3). In consequence, the amount of data D required to estimate all those parameters with a reasonably small error becomes very large! This phenomenon, sometimes referred to as the curse of dimensionality, constitutes a major drawback, severely restricting the use of Markov chain modelling for growing r , despite the generality and exibility of the latter. Concretely, consider the maximum likelihood estimation, consisting in estimating f M ( |) M ( |) maximizing P (f D |f M ) for given f D . The formula of types 3.31 as the value f M ( |) to be simply given by the correthen demonstrates the searched for estimate f M ( |) = f D ( |). But a sequence D = xn of length sponding empirical distribution f 1 n contains a maximum of n r distinct transitions , and if m or r are large enough, the majority of the theoretically observed transitions will simply not occur in D = xn 1 , and the corresponding maximum likelihood estimates will be given the value M ( |) = 0, even if f M ( |) > 0. f This problem of unobserved transitions occurs each time the number of possible states m as well as the order of the chain r are large in comparison of the sample size n. Dierent remedies have been proposed to allevy the problem, such as the trigram strategy consiting in estimating f M (3 |1 2 ) (for a Markov chain of order r = 2) as M (3 |1 2 ) = 0 f D (3 ) + 1 f D (3 |2 ) + 2 f D (3 |1 2 ) f where the optimal choice of the non-negative weights 0 , 1 and 2 , obeying 0 + 1 + 2 = 1, is typically determined by trial and error, aiming at maximizing some overall performance index relatively to a given application. Although sometimes satisfactory for a given practical application, such estimates lack theoretical foundation and formal justication. This situation is somewhat analogous to the problem of unobserved species, occuring each time the number of possible states m is so large in comparison to the sample size n that some states might have not

140

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

been observed at all in the data D. Although well identied in textual and biological data for a while, this problem has nevertheless not received a simple and universally aggreed upon solution; see however module L1 for a strategy aimed at estimating the total number of possible (= observed + unobserved) states.

3.5.4

Testing the order of a Markov chain

Denote by f D (. . .) be the empirical distribution function determined from data D = n M xn 1 , and denote by f (. . .) p(. . .) its theoretical counterpart, relatively to some model M . The corresponding empirical, respectively theoretical conditional entropies hk introduced in section 3.1 are
k 1 k hM )= k = H (X1 ) H (X1 k1

f M ()

f M ( |) ln f M ( |) f D ( |) ln f D ( |)

hD k

:=
k1

f ()

Suppose the model M to be a Markov chain of order r . Then theorem 3.10 implies
M M M M M M M hM 1 h2 hr hr +1 = hr +2 = hr +3 = . . . = h = h

Euqivalently, the quantity dM k dened for k 1 by


k +1 k 1 k M M ) ) H (X1 dM k := hk hk +1 = 2H (X1 ) H (X1

obey, for a Markov chain of order r , the following: dM 1 0 dM 2 0 dM r 0


M M dM r +1 = dr +2 = . . . d = 0

Thus the greatest k for which dM k is strictly positive indicates the order r = k of the Markov chain M . Intuitively, dM k measures the uncertainty reduction when the last symbol of a sequence is predicted using a past of length k instead of k 1, whence dM k = 0 if k > r . As with any inferential problem in statistics, the diculty is that the conditional k entropies hM k are theoretical quantities dened by (m 1)m parameters, not directly n observable. But, if the data D = x1 are numerous enough, one then expects hD k to be M close to hk , for k small. Also, the ratio number of parameters to be estimated/number log n of data is small i k is small relatively to log m . Thus empirical estimates are good as long as k kmax , where kmax is of order of kmax := is a satisfactory pragmatic choice. Hence, for k kmax , large values of dD k constitute an evidence pointing towards a model of order k, as conrmed by the following maximum likelihood test: The order test Consider a Markov process on || = m states, observed n successive times, about which, for some k kmax , two hypotheses compete, namely:
log n log m .

Figures 3.10 and 3.11 suggest that (3.34)

1 log n 2 log m

3.5. RECONSTRUCTION OF MARKOV MODELS FROM DATA

141

M Figure 3.10: Observed values hD k (continuous line) and theoretical values hk (dotted line) in terms of k for dierent models. In a), b) and c), empirical maximum likelihood M estimates hD k coincide approximatively with theoretical values hk as far as k kmax = 1 log n 2 log m . Estimates with k > kmax are not reliable due to the proliferation of unobserved transitions. a): uniform and independent process (fair heads or tails) of length n = 1024 on m = 2 states. b): Markov chain of order r = 1 of length n = 1024 on m = 2 states (see example 3.13). c): Markov chain of order r = 3 of length n = 1024 on m = 2 states (see example 3.20). d): the gure depicts the empirical values hD k obtained from the 14-th Thue-Morse sequence D 14 , of length n = 214 on m = 2 states (see example 3.17).

k : : the process is governed by a Markov chain of order k 1 H0 k : the process is governed by a Markov chain of (strict) order k H1 k is rejected at level if Then H0 D D 2 2 k 1 2n dD ] k = 2n [hk hk +1 ] 1 [(m 1) m

(3.35)

where dD k is measured in nats.


1 ln n The test can be applied in succession for k = 1, 2, ... kmax := 2 ln m . Potential k -th D order candidate models are signalized by high values of dk . For instance, if all dk are small, an independence model can be considered (see gure 3.11). If all dk are large, each k + 1-th model beats the immediately inferior k-order model. Figure 3.11 shows the order r of the chain to be signalized by a peak at dD k (for k kmax ).

Example 3.20 Let M be a binary Markov chain of order 3 specied by

142

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

Figure 3.11: Observed (line) and expected (dots) values of dk = hk hk+1 in terms of k for the models presented in gure 3.10. f M (4 |1 2 3 ) on = {a, b}. A sequence of length n = 1024 is generated form this model, from which empirical distributions f D are determined, and conditional entropies hD k are computed.
2 2 k 1 D The values of hD k and dk as well as the threshold 1 [df] with df = (m 1) m at the signicance level = 0.001 are:

k 1 2 3 4 5 = kmax

hD k 0.692 0.692 0.691 0.637 0.631

dD k 0 0.001 0.054 0.006 0.0088

2n dD k 0 2.05 110.59 12.29 18.022

df

1 2 4 8 16

2 0.999 [df] 10.8 13.8 16.3 18. 20.5

Proof Likelihood ratio strategies, shown by Neyman and Pearson to minimize errors of both kinds (see module S1), command to take as decision variable the ratio of probabilities corresponding to both models, or equivalently their logarithmic likelihood ratio LL P (f D |f k) P (f D |f k 1 )
LL := log

P (f D |f k) P (f D |f k 1 )

D by a Markov chain of order k , where f k is the best (in the ML-sense) modelling of f D and fk1 is the best model for f of order k 1. The former is more general than the

3.5. RECONSTRUCTION OF MARKOV MODELS FROM DATA

143

D latter and thus bound yield a better t, that is P (f D |f k ) P (f |fk 1 ), i.e. LL 0. D M On the other hand, formula 3.31 of types P (f | f ) exp(n Kr (f D ||f M )) yields D LL = n [Kk1 (f D ||f k 1 ) Kk (f ||fk )] 0

Finally, Kullback-Leibler conditional divergences Kr (f D ||f r ) are, for n large, well represented by their quadratic approximations, the computed chi-squares 1 2 D 2 n (f ||fr ). Remarkably, the latter have been shown to behave, under model fr , 2 as a chi-square distribution [df], where df is equal to the number of free parameters D assigned to models of order r . Also, the dierence 2n[Kk1 (f D ||f k 1 ) Kk (f ||fk )], caused by uctuations of f D accountable by f k but not by fk 1 , has been shown 2 to behave as [df] with df equal to the dierence in the number of free parameters between the two models. Gluing the pieces together yields the following inferential recipy reject model f k 1 D D 2 and accepts model fk at level if 2n [Kk1 (f ||fk1 ) Kk (f ||fk )] 1 [df] where df is here (m 1) mk (m 1) mk1 = (m 1)2 mk1 .
D The proof is complete if we nally show that Kk1 (f D ||f k 1 ) = hk (and that D D Kk (f ||f k ) = hk +1 ). But it follows from 3.2 that k k 1 hD )= k = H (X1 ) H (X1 k (a)

f D ( ) log f D ( ) +
k1

f D () log f D () f D () log f D ()
k1

=
k1

f D ( ) log f D ( ) +
(b)

=
k1

f D ()

f D ( |) log f D ( |)

3.5.5

Simulating a Markov process

Given the n observations xn 1 , and, under the hypothesis that the underlying process is a Markov chain of order r , one rst determines the order k of the chain (with k 3.35.
1 ln n 2 ln m )

by using the test

one then estimates the corresponding theoretical transitions f M ( |) (with ) and r ) by the empirical ones f D ( |) := n( (maximum likelihood estimation). At this stage, one is in position to simulate the Markov process, simply by running a k-th order process with transition matrix f D ( |) from some initial state r drawn with probability f D (). Example 3.21 Written with m = 27 states (the alphabet + the blank, without punctuation), the english version of the Universal declaration of Human Rights constitutes D a text xn 1 of length n = 8 149, from which conditional empirical distributions f ( |) can be computed. One can imagine the text to have been produced by a Markov chain of order r , dened by the set of theoretical conditional probabilities {f M ( |)} where

144

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS

is a r -gram. Those theoretical probabilities can be estimated (ML-estimation) by M ( |) := f D ( |), and, in virtue of the corresponding empirical estimates, that is f 1 log n 1 log 8 149 3.34 the estimate is guaranteed to reliable for r 2 log m = 2 log 27 = 1.36, which is M ( |) = f D ( |) with | | = r yield: rather small! Simulations based on f r = 0 (independent process) iahthire edr pynuecu d lae mrfa ssooueoilhnid nritshfssmo nise yye noa it eosc e lrc jdnca tyopaooieoegasrors c hel niooaahettnoos rnei s sosgnolaotd t atiet The relative frequencies of all m = 27 symbols are correctly sampled; in particular, the proportion of blanks (16.7%) is respected and words have about the correct length. However, original transitions between symbols are obviously not correctly reproduced. r = 1 (First-order Markov chain) erionjuminek in l ar hat arequbjus st d ase scin ero tubied pmed beetl equly shitoomandorio tathic wimof tal ats evash indimspre tel sone aw onere pene e ed uaconcol mo atimered First-order transitions are taken into account, which makes the sample more readable than the previous one (in particular, the consonants-vowels transitions are respected). r = 2 (Second-order Markov chain) mingthe rint son of the frentery and com andepent the halons hal to coupon efornitity the rit noratinsubject will the the in priente hareeducaresull ch infor aself and evell The sample begins to look like english, with a zest of latin.... r = 3 (Third-order Markov chain) law socience of social as the right or everyone held genuinely available sament of his no one may be enties the right in the cons as the right to equal co one soveryone The text denitely looks like english, with whole words correctly reproduced. However, we are beyond the safe limit kmax = 1.36: the simulated text betrays its origin, namely the Universal declaration of Human Right, and not another original english text of comparable size, such as a a cooking recipy or a text about mathM ( |) = f D ( |) with || = 3 are ematics. Equivalently, the model estimates f over-parameterized. r = 4 (Fourth-order Markov chain) are endowed with other means of full equality and to law no one is the right to choose of the detent to arbitrarily in science with pay for through freely choice work All simulated words are now genuine english words, which reects the high redundancy of english R = 0.73 (see example 3.5). Meanwhile, the over-parameterization problem is getting worse.... r = 9 (Markov chain of order 9) democratic society and is entitled without interference and to seek receive and impartial tribunals for acts violating the fundamental

3.5. RECONSTRUCTION OF MARKOV MODELS FROM DATA rights indispensable for his

145

The over-parameterization has reached outrageously scandalous levels by statistical standards: the set {f M ( |)} of nine-order transitions (|| = 9) contains (27 1) 279 = 6.1 1013 parameters estimated from a text of length n = 8.1 103 only! As a result, whole chunks of the original text are reproduced without alteration in the simulated text. Control Question 38 Determine the unique correct answer: 1. the curse of dimensionality alludes to a problem of a) visualization in highdimensional spaces; b) lack of regularity for Markov chains; c) unability to perform hypothesis testing; d) over-parameterization. 2. the number of arguments of the stationary distribution function associated to a Markov chain of order r is a) 1;; b) r 1; c) r ; d) variable. 3. the observation of a transition vorbiden by a Markov model a) is a rare event; b) is possible is the sample is small enough; c) should occur in proportion less than the signicance level; d) indicates the inadequacy of the model. 4. the conditional Kullback-Leibler divergence Kr (f D ||f M ) of order r a) is zero i a vorbidden transition occurs; b) is innite i a vorbidden transition occurs; c) is increasing in r ; d) increases with the probability P (f D |f M ). Answer 1. d) over-parameterization 2. c) r 3. d) indicates the inadequacy of the model 4. b) is innite i a vorbidden transition occurs

Historical Notes and Bibliography


Section 3.3.5. Irreversible behavior was rst clearly attested and formalized in Thermodynamics in the middle of the XIXth century under the name of the Second Principle of Thermodynamics, stating that the (physical) entropy of an isolated non-equilibrium physical system grows until it reaches equilibrium. The modern, purely information-theoretical formulation of the second principle is given (in the framework of Markov processes) by theorem 3.9. It shows in particular K (f (n) || ) to decrease to zero for n . If the stationary distribution is uniform (i.e. j = 1/m for all j = 1, . . . , m), then K (f (n) || ) = log m H (f (n) ) where H (f ) is Shannons entropy of distribution f : here theorem 3.9 conrms that H (f (n) ) is indeed increasing in n with limit log m. But in the general case

146

CHAPTER 3. STATIONARY PROCESSES & MARKOV CHAINS where is not uniform, the Second Principle should be more correctly stated as the relative entropy (with respect to ) of an isolated non-equilibrium physical system grows until it reaches equilibrium.

Section 3.5. The Universal Declaration of Human Rights was adopted by UNOs General Assembly (resolution 217 A (III)) of 10 December 1948.

OutLook
Cover,T.M. and Thomas,J.A. Elements of Information Theory, Wiley (1991) Hillman, C. An entropy primer, http://www.math.washington.edu/ hillman/PUB/primer.ps, (1996) Jelinek,F. Statistical Methods for Speech Recognition, The MIT Press, Cambridge, MA (1998) Mirkin,B. Mathematical Classication and Clustering, Kluwer, Dordrecht, (1996) Shields, P.C. The Ergodic Theory of Discrete Sample Paths, Graduate Studies in Mathematics, Volume 13, American Mathematical Society (1996) Xanthos, A. Entropizer 1.1: un outil informatique pour lanalyse s equentielle. Proceedings of the 5th International Conference on the Statistical Analysis of Textual Data (2000).

Chapter 4

Module C4: Coding for Noisy Transmission

by J.-C. Chappelier

Learning Objectives for Chapter 4 In this chapter, we present: 1. the basics of coding a discrete information source in order to be able to accurately transmit its messages even in the presence of noise; 2. how a noisy transmission can be formalized by the notion of a channel ; 3. the two fundamental notions ruling noisy transmissions: the channel capacity and the transmission rate ; 4. the fundamental limit for the transmission error in the case of a noisy transmission.

Introduction
When dealing with information, one of the basic goals is to transmit it reliably. In this context, transmit both means transmitting some information from one point to another, as we usually understand it, but also to transmit it through time; for instance to store it somewhere (to memorize it) and then retrieve it later on. In both cases however, transmission of information can, in real life, hardly be achieved in a fully reliable manner. There always exists a risk of distortion of the transmitted information: some noise on the line, some leak of the memory or the hard disk storing the information, etc. What eect does noise have on the transmission of messages? Several situations could be possible: 147

148

CHAPTER 4. CODING FOR NOISY TRANSMISSION it is never possible to transmit any messages reliably (too much noise); it is possible to transmit messages with a reasonable error probability; it is possible to transmit messages with with an error probability which is as small as we can wish for (using error correcting codes).

The purpose of the present chapter is to study how coding can help transmitting information in a reliable way, even in the presence of noise during the transmission. The basic idea of such codings is to try to add enough redundancy in the coded message so that transmitting it in reasonably noisy conditions leaves enough information undisturbed for the receiver to be able to reconstruct the original message without distortion. Of course, the notions of enough redundancy and reasonably noisy conditions need to be specied further; and even quantied and related. This will be done by rst formalizing a bit further the notion of noisy transmission, by introducing the notion of a communication channel, which is addressed in section 4.1. As we will see in section 4.3, the two fundamental notions ruling noisy transmissions are the channel capacity and the rate use for transmitting the symbols of the messages. These notions are rst introduced in section 4.1.

Noise Source Coding Channel Decoding Received message

Transmitted codeword

Zi

Received codeword

Figure 4.1: Error correcting communication over a noisy channel.

The general framework this chapter will focuses on is summarized in gure 4.1.

4.1

Communication Channels

Learning Objectives for Section 4.1 This section presents the notion of a communication channel and the main characterization of it: the capacity. After studying it, you should be able to formalize a communication (i.e. to give the corresponding channel) and to compute its capacity, at least in the most simple cases. You should also know what a symmetric channel is, and what this implies on its capacity We nally also introduce the notion of transmission rate.

{
Z

Ui

z1 i ...

zn i

z 1 ...

4.1. COMMUNICATION CHANNELS

149

4.1.1

Communication Channels

Roughly speaking, a communication channel (shorter channel) represents all that could happen to the transmitted messages between their emission and their reception. A message is a sequences of symbols. A symbol is simply an element of a set, called an alphabet. In this course, only nite alphabets will be addressed. The input sequence X1 , X2 , X3 , . . . (i.e. the message to be transmitted) is fully determined by the source alone; but the transmission determines the resulting conditional probabilities of the output sequence Y1 , Y2 , Y3 , . . . (i.e. the message received) knowing the input sequence. In mathematical terms, the channel species the conditional probabilities of the various messages that can be received, conditionally to the messages that have been emitted; i.e. P (Y1 = y1 , ..., Yn = yn |X1 = x1 , ..., Xn = xn ) for all possible n and values y1 , x1 , ..., yn , xn .

Denition 4.1 (Discrete Memoryless Channel) The discrete memoryless channel (DMC ) is the simplest kind of communication channel. Formally, DMC consists of three quantities: 1. a discrete input alphabet, VX , the elements of which represent the possible emitted symbols for all input messages (the source X ); 2. a discrete output alphabet, VY , the elements of which represent the possible received symbols (output sequence); and 3. for each x VX , the conditional probability distributions pY |X =x over VY which describe the channel behavior in the manner that, for all n = 1, 2, 3, . . . : P (Yn = yn |X1 = x1 , . . . , Xn = xn , Y1 = y1 , . . . , Yn1 = yn1 ) = P (Y = yn |X = xn ), These are called the transmission probabilities of the channel. Equation (4.1) is the mathematical statement that corresponds to the memoryless nature of the DMC. What happens to the signal sent on the n-th use of the channel is independent of what happens on the previous n 1 uses. Notice also that (4.1) implies that the DMC is time-invariant, since the probability distribution pYn |xn does not depend on n. When VX and VY are nite, a DMC is very often specied by a diagram where: 1. the nodes on the left indicate the input alphabet VX ; 2. the nodes on the right indicate the output alphabet VY ; and 3. the directed branch from xi to yj is labeled with the conditional probability pY |X =xi (yj ) (unless this probability is 0, in which case the branch is simply omitted.) (4.1)

150

CHAPTER 4. CODING FOR NOISY TRANSMISSION

Example 4.1 (Binary Symmetric Channel) The simplest (non trivial) case of DMC is the binary symmetric channel (BSC ), for which VX = VY = {0, 1} (binary) and pY |X =0 (1) = pY |X =1 (0) (symmetric). This value p = pY |X =0 (1) = pY |X =1 (0) is called the error rate and is the only parameter of the BSC. Indeed, pY |X =0 (0) = pY |X =1 (1) = 1 p. The BSC is represented by the following diagram:
1
1p p p

X
0

Y
1p

Example 4.2 (Noisy Transmission over a Binary Symmetric Channel) Suppose we want to transmit the 8 following messages: 000, 001, 010, 011, 100, 101, 110 and 111. Suppose that the channel used for transmission is noisy in such a way that it changes one symbol over ten, regardless of everything else; i.e. each symbol has a probability p = 0.1 to be ipped (0 into 1, and 1 into 0). Such a channel is a BSC with an error rate equal to p = 0.1, What is the probability to transmit one of our messages correctly? Regardless of which message is sent, this probability is (1 p)3 = 0.93 = 0.719 (corresponding to the probability to transmit 3 times one bit without error). Therefore, the probability to receive and erroneous message is 0.281, i.e 28%; which is quite high! Suppose now we decide to code each symbol of the message by twice itself: message code 000 000000 001 000011 010 001100 011 001111 100 110000 ... ... 111 111111

What is now the probability to have a message sent correctly? In the same way, this is (1 p)6 = 0.531 And the probability to receive and erroneous message is now 0.469... ...worse than previously, it seems! However, what is the probability to receive a erroneous message which seems to be valid; i.e. what is the probability to receive a erroneous message and to not detect it as wrong? Not detecting an erroneous message means that two corresponding symbol have both been changed. If for instance we sent 000000, but 110000 is received, there is not way to see that some errors occurred. However, if 010000 is received, clearly at least

4.1. COMMUNICATION CHANNELS one error occurred (and retransmission could for instance be required).

151

So, not detecting an error could come either from 2 changes (at the corresponding places) or 4 changes or the whole 6 symbols. What is the probability to change 2 2 4 2 4 symbols? Answer: 6 2 p (1 p) = 15 p (1 p) What is the probability to change 3 2 4 2 corresponding symbols? Only 1 p (1 p) = 3 p2 (1 p)4 . Similarly, the probability to change 4 corresponding symbols is 3 p4 (1 p)2 , and to change the whole six symbols is p6 . Therefore, the probability of not detecting an error is 3 p2 (1 p)4 + 3 p4 (1 p)2 + p6 = 0.020 which is much smaller. This means that the probability to make a error in the reception (i.e. to trust an erroneous message without being aware of) is only 0.02. Conclusion: some codings are better than other for the transmission of messages over a noisy channel. Finally, we wish to clearly identify the situation when a DMC is used without feedback , i.e. when the probability distribution of the inputs does not depend on the output. More formally, a DMC is said to be without feedback when:

pXn |x1 ,...,xn1 ,y1 ,...,yn1 = pXn |x1 ,...,xn1

(4.2)

for all n = 1, 2, 3, . . . . Notice that (4.2) does not imply that we choose each input digit independently of previous input digits, only that we are not using the past output digits in any way when we choose successive input digits (as we could if a feedback channel was available from the output to the input of the DMC). Let us now give a fundamental result about DMC without feedback.

Theorem 4.1 For a DMC without feedback, we have for all n N:


n

H (Y1 . . . Yn |X1 . . . Xn ) =
i=1

H (Yi |Xi )

where X1 ...Xn stands for an input sequence of length n and Y1 ...Yn for the corresponding output.

Proof From the chain rule for probabilities, we have pX1 ,...,Xn,Y1 ,...,Yn (x1 , ..., xn , y1 , ..., yn ) =
n

pX1 (x1 ) pY1 |x1 (y1 )


i=2

pXi |x1 ,...,xi1 ,y1 ,...,yi1 (xi )pYi |x1 ,...,xi ,y1 ,...,yi1 (yi )

152

CHAPTER 4. CODING FOR NOISY TRANSMISSION

Making use of (4.1) and (4.2), we get: pX1 ,...,Xn ,Y1 ,...Yn (x1 , . . . , xn , y1 , . . . yn )
n

= pX1 (x1 ) pY1 |x1 (y1 )


i=2 n

pXi |x1 ,...,xi1 (xi ) pY |xi (yi )


n

= pX1 (x1 )
i=2

pXi |x1 ,...,xi1 (xj )


i=1 n

pY |xi (yi )

= pX1 ,...,Xn (x1 , . . . , xn )


i=1

pY |xi (yi ).

Dividing now by pX1 ,...,Xn (x1 , . . . , xn ), we obtain:


n

pY1 ,...,Yn|x1 ,...,xn (y1 , ..., yn ) =


i=1

pY |xi (yi )

(4.3)

The relationship (4.3) is so fundamental that it is sometimes (erroneously) given as the denition of the DMC. When (4.3) is used instead of (4.1), the DMC is implicitly assumed to be used without feedback. e-pendix: Cascading Channels

4.1.2

Channel Capacity

The purpose of a channel is to transmit messages (information) from one point (the input) to another (the output). The channel capacity precisely measures this ability: it is the maximum average amount of information the output of the channel can bring on the input. Recall that a DMC if fully specied by the conditional probability distributions pY |X =x (where X stands for the input of the channel and Y for the output). The input probability distribution pX (x) is not part of the channel, but only of the input source used. The capacity of a channel is thus dened as the maximum mutual information I (X ; Y ) that can be obtained among all possible choice of pX (x). More formally:

Denition 4.2 (Channel Capacity) The capacity C of a Discrete Memoryless Channel is dened as C = max I (X ; Y ),
pX

(4.4)

where X stands for the input of the channel and Y for the output. We will shortly see that this denition is indeed very useful for studying noisy transmissions over channels, but let us rst give a rst example. Example 4.3 (Capacity of BSC) What is the capacity C of a BSC, dened in example 4.1?

4.1. COMMUNICATION CHANNELS First notice that, by denition of mutual information, C = max H (Y ) H (Y |X ) .
pX

153

Furthermore, since in the case of a BSC, P (Y = X ) = p and P (Y = X ) = 1 p, we have H (Y |X ) = p log(p) (1 p) log(1 p) =: h(p), which does not depend on pX . Therefore C = max H (Y ) h(p).
pX

Since Y is a binary random variable, we have (by theorem 1.2): H (Y ) log 2, i.e. H (Y ) 1 bit. Can this maximum be reached for some pX ? Indeed, yes: if X is uniformly distributed, we have pY (0) = p pX (1)+(1 p) pX (0) = 0.5 p +0.5 (1 p) = 0.5; which means that Y is also uniformly distributed, leading to H (Y ) = 1 bit. Therefore: maxX H (Y ) = 1 bit and C = 1 h(p) Control Question 39 What is the capacity of the binary erasure channel dened by the following graph:
0
1p p (lost) p

(in bits).

X
1

1
1p

Is it (in the most general case): 1. C = 1 h(p) = 1 + p log p + (1 p) log(1 p), 2. C = 1 p, 3. C = 1, 4. C = 1 + h(p) = 1 p log p (1 p) log(1 p), 5. or C = 1 + p? Answer Let r be r = P (X = 0) (thus P (X = 1) = 1 r ). By denition, the channel capacity is: C = max I (Y ; X ) = max [H (Y ) H (Y |X )]
P (X ) r

Let us rst compute H (Y |X ): H (Y |X ) =


x

P (X = x)
y

P (Y |X = x) log P (Y |X = x)

=
x

P (X = x) [p log p + (1 p) log(1 p) + 0]

= [p log p + (1 p) log(1 p)] = h(p)

154

CHAPTER 4. CODING FOR NOISY TRANSMISSION

Since H (Y |X ) = h(p) is independent of r C = max [H (Y )] h(p)


r

Let us now compute H (Y ). We rst need the probability distribution of Y : P (Y = 0) = r (1 p) P (Y = 1) = (1 r ) (1 p) P (Y = lost) = r p + (1 r ) p = p Thus H (Y ) = r (1 p) log(r (1 p)) (1 r ) (1 p) log((1 r ) (1 p)) p log p Since P (Y = lost) is independent of r , H (Y ) is maximal for P (Y = 0) = P (Y = 1) i.e. r (1 p) = (1 r ) (1 p) and thus r = 1 r = 0.5. Another way to nd the maximum of H (Y ) is to see that H (Y ) = r (1 p) log(r (1 p)) (1 r ) (1 p) log((1 r ) (1 p)) p log(4.5) p = r (1 p) log(r ) r (1 p) log(1 p) (1 r ) (1 p) log(1 r ) (1 r ) (1 p) log(1 p) p log p = (1 p) h(r ) (r + 1 r ) (1 p) log(1 p) p log p = (1 p) h(r ) + h(p) which is maximum for r = 0.5. Eventually, the maximum of H (Y ) is (1 p) + h(p), and at last we nd: C = 1 p + h(p) h(p) = 1 p. Thus, the right answer is 2).

4.1.3

Input-Symmetric Channels

We here consider only DMC with a nite alphabet, i.e. a nite number, K , of input symbols, and a nite number, J , of output symbols.

Denition 4.3 Such a DMC is said to be input-symmetric if the error probability distributions are all the samea for all input symbols; i.e. the sets pY |xi (y ) : y VY are independent of xi .
a

apart from a permutation

Example 4.4 (Input-Symmetric Channels)

4.1. COMMUNICATION CHANNELS

155

aa 0
1/2

1/2 1/4 1/4

0
1/2

ba
1/2 1/2

X
1

1/2 1/2

(lost)

b
1/4

Y
bc

c 1
1/2

1/4

cc
(a) A inputsymmetric DMC which is not outputsymmetric. (b) An outputsymmetric DMC which is not inputsymmetric.

A BSC is input-symmetric. That of Figure (a) above is also input-symmetric, but the channel of Figure (b) is not input-symmetric.

Lemma 4.1 For a input-symmetric DMC, H (Y |X ) is independent of the distribution pX , and H (Y |X ) = H (Y |xi ) for any given xi VX . Proof It follows from the denition of a input-symmetric channel that
J

xi V X

H (Y |X = xi ) = H0 =
j =1

pj log pj

(4.6)

where {p1 , p2 , . . . , pJ } the set of probabilities pY |xi (y ) (which is indepent of the input letter xi ). Therefore H (Y |X ) =
x

pX (x)H (Y |X = x) =
x

pX (x)H0 = H0

For a input symmetric DMC, nding the input probability distribution that achieves capacity (i.e. achieves the maximum of I (X ; Y )) reduces to simply nding the input distribution that maximizes the uncertainty of the output. Property 4.1 For a input-symmetric DMC, we have C = max [H (Y )] H0
PX

(4.7)

where H0 = H (Y |X = xi ) for any of the xi VX . Proof This property directly derives from denition 4.2 and lemma 4.1.

4.1.4

Output-Symmetric Channels

A input-symmetric channel is one in which the probabilities leaving each input symbol are the same (appart from a permutation). We now consider channels with the prop-

156

CHAPTER 4. CODING FOR NOISY TRANSMISSION

erty that the probabilities reaching each output symbol are the same (appart from a permutation). More formaly, a DMC is said to be output-symmetric when the sets pY |x (yi ) : x VX are independent of yi . Notice that in this case the sums x pY |x (yi ) are independent of yi .1 Example 4.5 (Output-Symmetric Channel) The BSC (see example 4.1) is output-symmetric. The channel of example 4.4, gure (b), is output-symmetric, but that of gure (a) is not.

Lemma 4.2 For a output-symmetric DMC, the uniform input probability distribution (i.e., pX (xi ) is the same for all xi ) results in the uniform output probability distribution (i.e., pY (yi ) is the same for all yi ).

Proof pY (yj ) =
xi

pY |xi (yj )pX (xi ) =

1 |VX |

pY |xi (yj )
xi

But, since the DMC is output-symmetric, the sum on the right in this last equation is independent of yj . Thus pY is independent of yj ; i.e. is the uniform probability distribution.

Property 4.2 For an output-symmetric DMC, the input of which is X and the output of which is Y , we have: max H (Y ) = log |VY |.
pX

(4.8)

Proof This property comes immediately from Lemma 4.2 and theorem 1.2 on the entropy upper bound.

4.1.5

Symmetric Channels

Denition 4.4 (Symmetric Channel) A DMC is symmetric when it is both input- and output-symmetric.

Theorem 4.2 The capacity of a symmetric channel (the input of which is X and the output of which is Y ) is given by: C = log |VY | H0 where H0 = H (Y |X = xi ) for any of the input symbol xi VX .
1

(4.9)

Some authors called this property week (output-)symmetry.

4.1. COMMUNICATION CHANNELS

157

Proof The above theorem is an immediate consequence of properties 4.1 and 4.2. Example 4.6 The BSC (see Example 4.1) is a symmetric channel for which H0 = h(p) = p log p (1 p) log(1 p). Thus, CBSC = log 2 h(p) = 1 h(p) (in bits)

4.1.6

Transmission Rate

Denition 4.5 (Transmission Rate) The transmission rate (in base b) of a code encoding a discrete source U of |VU | messages with codewords of xed length n is dened by: logb |VU | Rb = n Notice that |VU | is also the number of possible codewords (deterministic non-singular coding). In practice, the base b chosen for the computation of the transmission rate is the arity of the code. Example 4.7 The (binary) transmission rate of the code used in example 4.2 is 8 3 1 R = log 6 = 6 = 2 . This sounds sensible since this code repeats each message twice, i.e. uses twice as many symbols as originally emitted. Control Question 40 On a noisy channel, we plan to use a code consisting of tripling each symbol of the messages. For instance a will be transmitted as aaa. What is the transmission rate R of such a code? (If you are a bit puzzled by the base, choose for the base the arity of the source, i.e. the number of dierent symbols the source can emit) Answer Whatever the alphabet used for the messages, each symbol is repeated 3 times. Therefore whatever the message is, it will always be encoded into a coded message three 1 times longer. Therefore, the transmission rate is R = 3 . For people preferring the application of the denition, here it is: the xed length of codewords is n = 3 m, where m is the length of a message. If D is the arity of the source U (i.e. the size of its alphabet), the number of possible messages of length m is |VU | = D m . Thus, we have: RD = logD |VU | logD Dm m 1 = = = n 3m 3m 3

158

CHAPTER 4. CODING FOR NOISY TRANSMISSION

Summary for Section 4.1 Channel: input and output alphabets, transmission probabilities (pY |X =x ). DMC = Discrete Memoryless Channel. Channel Capacity: C = maxpX I (X ; Y ) Code/Transmission Rate: Rb =
logb |VU | n PX

Capacity of Input-symmetric channels: C = max [H (Y )] H (Y |X = xi ) for any of the xi VX . Capacity of Symmetric channels: C = log |VY | H (Y |X = xi ) for any of the xi V X .

4.2

A Few Lemmas

Learning Objectives for Section 4.2 In this section, we introduce several general results that we shall subsequently apply in our study of channels, but have also many other applications.

4.2.1

Multiple Use Lemma

We previously dene the capacity of a channel as the maximum amount of information the output of the channel can bring on the input. What can we say if we use the same channel several times (as it is expected to be the case in real life!)?

Lemma 4.3 If a DMC without feedback of capacity C is used n times, we have: I (X1 . . . Xn ; Y1 . . . Yn ) n C where X1 ...Xn stands for an input sequence of length n and Y1 ...Yn for the corresponding output.

Proof Using denition of mutual information and theorem 4.1, we have: I (X1 . . . Xn ; Y1 . . . Yn ) = H (Y1 . . . Yn ) H (Y1 . . . Yn |X1 . . . Xn )
n

= H (Y1 . . . Yn )
i=1

H (Yi |Xi )

(4.10)

Recalling that H (Y1 . . . Yn ) = H (Y1 ) +

H (Yi |Y1 ...Yi1 )


i=2

4.2. A FEW LEMMAS


X Processor no. 1 Y Processor no. 2 Z

159

Figure 4.2: The conceptual situation for the Data Processing Lemma.

and that H (Yi |Y1 ...Yi1 ) H (Yi ) we have: H (Y1 . . . Yn )


i=1 n

H (Yi )

which, after substitution into (4.10) gives


n

I (X1 . . . Xn ; Y1 . . . Yn )
i=1 n

[H (Yi ) H (Yi |Xi )] I (Xi ; Yi )


i=1

= nC by the denition of channel capacity.

(4.11)

4.2.2

Data Processing Lemma

We here consider the situation shown in Figure 4.2, where several processors are used successively. Processors considered here are completely arbitrary devices. They may be deterministic or even stochastic. They may even have anything at all inside. The only thing that Figure 4.2 asserts is that there is no hidden path by which X can aect Z , i.e., X can aect Z only indirectly through its eect on Y . In mathematical terms, this constraint can be expressed as pZ |x,y (z ) = pZ |y (z ) (4.12)

for all y such that pY (y ) = 0, which means simply that, when y is given, then z is not further inuenced by x. More formally the sequence X, Y, Z is a rst-order Markov chain. The Data Processing Lemma states essentially that information cannot be increased by any sort of processing (although it can perhaps be put into a more accessible form!).

Lemma 4.4 (Data Processing Lemma) When Markov chain, i.e. when (4.12) holds, we have: I (X ; Z ) I (X ; Y ) and I (X ; Z ) I (Y ; Z ).

X, Y, Z is a rst-order (4.13) (4.14)

160

CHAPTER 4. CODING FOR NOISY TRANSMISSION

In other words, the mutual information between any two states of a Markov chain is always less or equal to the mutual information between any two intermediate states. Proof (4.12) implies that H (Z |XY ) = H (Z |Y ) and hence I (Y ; Z ) = H (Z ) H (Z |Y ) = H (Z ) H (Z |XY ) H (Z ) H (Z |X ) = I (X ; Z ) which proves (4.14). To prove (4.13), we need only show that Z, Y, X (the reverse of the sequence X, Y, Z ) is also a Markov chain, i.e. that pX |yz (x) = pX |y (x) for all y such that pY (y ) = 0. Indeed, pZ |x,y (z ) = pZ |y (z ) = pZ |x,y (z ) pXY (x, y ) pX |y,z (x) = pY Z (y, z ) pZ |y (z ) pXY (x, y ) = pY Z (y, z ) pY Z (y, z ) pXY (x, y ) = pY (y ) pY Z (y, z ) pXY (x, y ) = pY (y ) = pX |y (x) (4.16) (4.15)

4.2.3

Fanos Lemma

Now, for the rst time in our study of information theory, we introduce the notion of errors. Suppose we think of the random variable U as being an estimate of the random variable U . For this to make sense, U needs to take on values from the same alphabet as U . Then an error is just the event that U = U and the probability of error, Pe , is thus Pe = P (U = U ). (4.17)

We now are ready for one of the most interesting and important results in information theory, one that relates Pe to the conditional uncertainty H (U |U ).

4.2. A FEW LEMMAS

161

Lemma 4.5 (Fanos Lemma) Let U and U be two D-ary random variables with the same values. Denoting by Pe the probability that U is dierent from U , we have: h(Pe ) + Pe log2 (D 1) H (U |U ) (4.18) where the uncertainty H (U |U ) is in bits, and h is is the entropy of a binary random variable: h(p) = p log(p) (1 p) log(1 p).

Proof We begin by dening the random variable Z as the random variable indicating a dierence between U and U : Z= 0 when U = U 1 when U = U,

Z is therefore a binary random variable with parameter Pe . So, we have: H (Z ) = h(Pe ). Furthermore: H (U Z |U ) = H (U |U ) + H (Z |U U ) = H (U |U ), since U and U uniquely determine Z (therefore H (Z |U U ) = 0. Thus H (U |U ) = H (U Z |U ) = H (Z |U ) + H (U |U Z ) H (Z ) + H (U |U Z ) But H (U |U , Z = 0) = 0 since in this case U is uniquely determined, and H (U |U , Z = 1) log2 (D 1) (4.22) (4.21) (4.20) (4.19)

since, for each value u of U , there are, when Z = 1, at most D 1 values with non-zero probability for U : u VU H (U |U = u, Z = 1) log2 (D 1) Equations (4.21) and (4.22) imply H (U |U Z ) = pZ (0)H (U |U , Z = 0) + pZ (1)H (U |U , Z = 1) = 0 + pZ (1)H (U |U , Z = 1) pZ (1) log 2 (D 1) = Pe log2 (D 1). (4.23)

Substituting (4.19) and (4.23) into (4.20), we obtain (4.18) as was to be shown.

162

CHAPTER 4. CODING FOR NOISY TRANSMISSION

~ h(Pe) + P e log(D 1) log D log(D 1)

D1 D

Pe

Figure 4.3: Evolution according to Pe of the bound on H (U |U ) given by Fanos Lemma.

We can provide an interpretation of Fanos Lemma. Notice rst that the function on the left of (4.18), sketched in gure 4.3, is concave in Pe ) and is positive for all Pe such that 0 Pe 1. Thus, when a positive value of H (U |U ) is given, (4.18) implicitly species a positive lower bound on Pe . Example 4.8 Suppose that U and U are binary-valued (i.e., D = 2) and that H (U |U ) = 1 2 bit. Then (4.18) gives h(Pe ) or, equivalently, .110 Pe .890. The fact that here (4.18) also gives a non-trivial upper bound on Pe is a consequence of the fact that the given H (U |U ) exceeds the value taken on by the left side of (4.18) when Pe = 1, i.e. log2 (D 1) = 0. This is not always the case. For instance, taking D = 3 and H (U |U ) = becomes 1 h(Pe ) + Pe 2 which leads to Pe .084,
1 2

1 2

bits, (4.18)

but does not provide any useful upper bound. Only the trivial bound Pe 1 can be asserted. A review of the proof of Fanos Lemma reveals that equality holds in (4.18) if and only if the probability of an error given that U = u is the same for all u and when there is an error (i.e., when U = U ), the n 1 erroneous values of U are always equally likely. It follows that (4.18) gives the strongest possible lower bound on Pe , given only H (U |U )

4.3. THE NOISY CODING THEOREM and the size, n, of the alphabet for U . Summary for Section 4.2 Data Processing Lemma: pZ |x,y (z ) I (X ; Y ) and I (X ; Z ) I (Y ; Z ) = pZ |y (z ) = I (X ; Z )

163

Fanos Lemma: h(Pe ) + Pe log2 (|VU | 1) H (U |U )

4.3

The Noisy Coding Theorem

Learning Objectives for Section 4.3 This section is the core of the chapter and gives the important Noisy Coding Theorem , which explains under what conditions reliable communication is possible or not. After studying this section, you should be able to decide: 1. what is the maximum transmission speed you can use on a noisy communication channel to be able to construct reliable communication on this channel using error correcting codes; 2. what is the minimum amount of errors you are sure to make if you transmit information faster than this maximum.

4.3.1

Repetition Codes

The role of this section is to give a concrete example of very simple (naive) errorcorrecting codes. As such, there is not much to be learned from this section, but the fact that naive coding is not very good and that other, more appropriate, codes should be considered. We here consider binary repetition codes ; i.e. codes Rk for which each (binary) input symbol is repeated n = 2 k + 1 times (k > 1). For instance the code R1 was used in example 4.2. We consider only odd number of repetitions because decoding such codes is done by the majority. We thus avoid the non-determinism that even number of repetitions could introduce (in cases where the received codeword contains as many 0s as 1s). With such codes, the probability of a wrong decision on decoding a symbol is the probability that at least k + 1 errors have occurred on the corresponding block. Let us consider the case where such a code is used over a BSC. The number of errors made in this case by the channel has then a binomial distribution with parameters (n, p). Therefore, the expected number of errors at the level of the transmission of codeword symbols is n p. For p < 0.5, this expected number is less than k + 0.5, therefore the probability that at least k + 1 errors have occurred on a codeword (i.e. one coding block), tends to

164

CHAPTER 4. CODING FOR NOISY TRANSMISSION

be negligible as k (hence n) tends to innity. In other words, the probability that we take a wrong decision when decoding becomes negligible as the number of repetition increases (hence the length of the codewords). Thus we are able to compensate the loss due to the noise on the channel to any desired degree by choosing a large enough number of repetitions. However, the price we pay for this is in this case quite huge in terms of eciency. 1 ... Indeed the transmission rate of such a code is n ...which also tends to 0 as n grows to innity! For large n this is likely to be an unacceptably low transmission rate. The importance of the Noisy Coding Theorem is precisely that it ensures that good codes can be achieved for any given transmission rate below the capacity. The transmission rate can be xed a priori and does not need to be ridiculously small as in the above example. It only requires to be below the channel capacity. e-pendix: Repetition Codes over a BSC Control Question 41 On a BSC with error probability p, we plan to use the repetition code R1 consisting of tripling each symbol of the messages. Decoding is done according to the majority in the received block. What is, as a function of p, the output bit error probability Pb for such a communication system? 1. p2 p3 2. p2 3. 3p2 2p3 4. 2p2 3p3 5. p3 Answer The correct answer is 3. When does an error occur in decoding? When 2 or 3 symbols of the block have been corrupted. What is the probability of this to occur? The probability of one erroneous block with exactly two symbols are wrong is p2 (1 p), and there are exactly 3 2 = 3 ways to have 2 wrong symbols among 3. Therefore the probability to get a wrong block with exactly 2 errors is 3 p2 (1 p). Similarly, the probability to get a wring block with exactly 3 errors is p3 . Therefore: Pb = 3 p2 (1 p) + p3 = 3 p2 2 p3

4.3. THE NOISY CODING THEOREM

165

4.3.2

The Converse to the Noisy Coding Theorem for a DMC without Feedback

In this section, we show that it is impossible to transmit information reliably through a DMC at a transmission rate above the capacity of this DMC. Without loss of essential generality, we suppose that the information to be transmitted is the output from a binary symmetric source (BSS ), which is a (memoryless) binary source with P (0) = P (1) = 1 2 . We here consider the case where the DMC is used without feedback (see Figure 4.1), for which we now give an important result on noisy transmissions (introduced by C. Shannon), known as the Converse Part of the Noisy Coding Theorem.

Theorem 4.3 (Converse Part of the Noisy Coding Theorem) If a BSS is used at rate R on a DMC without feedback of capacity C , and if R > C , then Pb , the bit error probability at the output, satises: Pb h where h
1 1

C R

(4.24)

(x) = min{p : p log(p) (1 p) log(1 p) = x}.

Here, we have written h to denote the inverse binary entropy function dened by 1 h (x) = min{p : p log(p) (1 p) log(1 p) = x}, where the minimum is selected in order to make the inverse unique (see gure 4.4).
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.4: The function h , the inverse binary entropy function dened by h min{p : p log(p) (1 p) log(1 p) = x}.

(x) =

166

CHAPTER 4. CODING FOR NOISY TRANSMISSION

One important consequence of this fact is that, whenever R > C , (4.24) will specify a positive lower bound on the output bit error probability Pb that any coding system, as complex as it might be, can overcome. More clearly:. it is impossible to transmit any amount of information reliably at a rate bigger than the channel capacity. Before going to the proof of the theorem, let us rst give an example. Example 4.9 (Converse Part of the Noisy Coding Theorem) Suppose we 1 bit for transmitting message out of a BSS. If are using a DMC with capacity C = 4 1 we transmit at a rate R = 2 , then (4.24) gives us an error of at least 11%: Pb h
1

1 2

= .11.

This means that at least 11% of the binary symbols will be transmitted incorrectly (even after error correction decoding).

Proof We want to show that, when R > C , there is a positive lower bound on Pb that no manner of coding can overcome. As a rst step in this direction, we note that the output bit error probability is 1 Pb = m
m

P (Ui = Ui ).
i=1

(4.25)

where m is the length of the input message. By the use of Fanos Lemma, we have in the binary case considered here: h(P (Ui = Ui )) H (Ui |Ui ) thus:
m m

H (Ui |Ui )
i=1 i=1

h(P (Ui = Ui ))

(4.26)

To proceed further, we observe that


m

H (U1 . . . Um |U1 . . . Um ) = H (U1 |U1 ) +


i=2 m

H (Ui |U1 . . . Um U1 . . . Ui1 ) (4.27)

i=1

H (Ui |Ui )

since further conditioning can only reduce uncertainty. Furthermore, because h(x) is concave in x (for 0 x 1): 1 m
m

h(P (Ui = Ui )) h
i=1

1 m

P (Ui = Ui )
i=1

= h(Pb ).

(4.28)

4.3. THE NOISY CODING THEOREM Thus, up to this point we have: h(Pb ) 1 H (U1 ...Um |U1 ...Um ). m

167

Let us continue with H (U1 ...Um |U1 ...Um ): H (U1 . . . Um |U1 . . . Um ) = H (U1 . . . Um ) I (U1 . . . Um ; U1 . . . Um ) = m I (U1 . . . Um ; U1 . . . Um )

because in the case where the input source U is a BSS, H (U1 . . . Um ) = log 2m = m . What about I (U1 . . . Um ; U1 . . . Um )? If we consider the Coding and the Channel in Figure 4.1 respectively as Processor #1 and Processor #2 of Figure 4.2, using the Data Processing Lemma we have: I (U1 . . . Um ; U1 . . . Um ) I (z1 . . . zn ; U1 . . . Um ) (4.29) where zi is the code of Ui (see gure 4.1). Next, consider the Channel and the Decoder respectively as Processor #1 and Processor #2. Using the Data Processing Lemma again we conclude that: I (z1 . . . zn ; U1 . . . Um ) I (z1 . . . zn ; z1 . . . zn ). Combining (4.29) and (4.30), we obtain: I (U1 . . . Um ; U1 . . . Um ) I (z1 . . . zn ; z1 . . . zn ). which combined with lemma 4.3 leads to: I (U1 . . . Um ; U1 . . . Um ) n C So we nally can conclude that H (U1 ...Um |U1 ...Um ) m n C and thus h(Pb ) 1 i.e., by denition of the transmission rate R: h(Pb ) 1 C R n C m (4.31) (4.30)

You may wonder whether (4.24) still holds when feedback is permitted from the output of the DMC to the channel encoder. The answer is a somewhat surprising yes. To arrive at (4.11), we made strong use of the fact that the DMC was used without feedback in fact (4.11) may not hold when feedback is present. However, (4.31) can still be shown to hold so that the converse (4.24) still holds true when feedback is present. This fact is usually stated as saying that feedback does not increase the capacity of a DMC. This

168

CHAPTER 4. CODING FOR NOISY TRANSMISSION

does not mean, however, that feedback is of no value when transmitting information over a DMC. When R < C , feedback can often be used to simplify the encoder and decoder that would be required to achieve some specied output bit error probability. Control Question 42 On a DMC without feedback of capacity C , we want to transmit message with a bit 1 ). error probability Pb lower than a given value Pb max (0, 2 We do not now already which kind of code will be used. However, we wish to determine the maximal transmission rate Rmax we can use on that channel (and compatible with Pb Pb max ). What is Rmax in terms of C and Pb max ? 1. C 2. h
1

(Pb max )

3. C h(Pb max ) 4. 5. 6. h
1

C (1 h(Pb max )) 1 (C h
1

(Pb max ))

(1 C/Pb max ) Answer

The correct answser is 4. We are looking for the maximal transmission rate Rmax we can use. Let us see if this maximal rate could be above the capacity C . The converse part of the noisy coding theorem tells us that if we are transmitting at a rate R = Rmax > C we have: 1 C Pb h (1 ) R Since Pb max Pb , if R = Rmax > C we also have: Pb max h
1

(1

C ) R

1 And thus, h being non-decreasing on (0, 2 ):

C 1 h(Pb max )

Therefore, the maximal transmission rate we could use is Rmax = C 1 h(Pb max ) .

4.3. THE NOISY CODING THEOREM

169

r zi Z zj

zk

Figure 4.5: Example of decoding error in the random coding framework: zi was send, z was received and will here be decoded into zj .

Indeed, if R > Rmax then R > C (since


C R)

>h

C 1e h(Pb max )

> C ), and therefore Pb > h

(1

(1

C Rmax )

= Pb max .

So if R > Rmax , we are sure that Pb > Pb max . Notice that this does not mean that if R < Rmax we are sure to have Pb Pb max . We do not know if there exist such a code that R = Rmax and Pb Pb max . All what we now is that if we transmit too fast (R > Rmax ) we are sure to make too many errors ( Pb > Pb max ).

4.3.3

The Noisy Coding Theorem for a DMC

Up to now, we have seen that it is impossible to have a transmission error below a certain level if the transmission rate R is bigger than the capacity. We now study what is going on when we wish to transmit a BSS over a DMC at a rate below its capacity.

Theorem 4.4 (Noisy Coding Theorem for a DMC) Consider transmitting messages at rate R on a DMC without feedback of capacity C . For all R < C and all > 0, there exists some (error-correcting) code whose rate is R and with an output bit error probability Pb < . This important theorem claims that by choosing an appropriate way of encoding information over a channel, one can reach as small error probability as wished. Proof At the level of this introductory course, let us, for the sake of simplicity, prove the theorem only in the case of a BSC. The proof will in this case be constructive in the sense that we actually construct a code verifying the theorem. General proof of Shannons theorem extends to DMC with arbitrary input and output

170

CHAPTER 4. CODING FOR NOISY TRANSMISSION

alphabet, and arbitrary channel capacity. The main idea of the general proof is the same, namely to code messages randomly and to decode with nearest codeword decision. Diculties in the proof are caused mainly by the general form of the channel capacity when it is not a BSC. Interested readers can nd complete proofs for the general situation in [1] or [6]. Let us now proceed with the proof in the case of a BSC. The sketch of it is the following: 1. given R and , choose the appropriate codeword length n and number of codewords M = 2R n . 2. Then choose the M codewords z1 , ..., zM at random (as binary vectors of length n), without repetition. If VU , the number of possible binary messages to be transmitted, is bigger than M , then each input message will be split into pieces of at most n symbols which will be encoded separately (by one of the codewords zi ). 3. compute the decoding threshold r . This in fact corresponds to the maximum number of transmission errors the code is able to correct. 4. Use the following decoding procedure: if there exists one and only one codeword zi such that d(z, zi ) r , decode as zi . Here z denotes the message received after transmission, i.e. at the output of the channel, and d(x, y ) is the Hamming distance between two binary strings, i.e. the number of dierent symbols between x and y . otherwise decode as z1 . Such a coding scheme (called random coding) is enough to ensure Pb < (provide the right n have been chosen). Let us now be more precise and go in the details. First recall that the capacity of a BSC is C = 1 h(p), where p is the error probability of the channel and h(x) is the binary entropy function: h(x) = x log x(1x) log(1 x). Since 0 < R < C , we have 1 > 1 R > h(p), and therefore: , 0.5 > p, such that R < 1 h() as illustrated in this gure:

4.3. THE NOISY CODING THEOREM


h(x) 1R h(p) 0

171
~

x 0.5

h1(1R)

For such a , we have: lim p (1 p) e + 2n(R+h()1) = 0 n ( p)2

since R + h() 1 < 0. Therefore, for any given > 0, there exists n0 such that n n0 p (1 p) e + 2n (R+h()1) 2 n ( p)

For technical reasons that will appear clear later on, we also need to have n such that 1 n = max {q N : q n} > n p. This is the case provided that n > n1 = ( p) . To summarize up to this point, we get to the following result: p [0, 0.5) R, 0 < R < 1 h(p), > 0, n N (h and n > n p This is in fact true for all n > max {n0 , n1 } above dened. So we found one of the appropriate codeword length n. Let us proceed as explain in the beginning with M = 2n R codewords and r = n = max {m N : m n}. In this framework, for a given codeword zi , an error occurs when (see gure 4.5) 1 there have been more that r transmission errors: d(z, zi ) > r or 2 a) d(z, zi ) r and b) z C , z = zi : d(z, z ) r and c) i = 1 where C = {z1 , ..., zM } denotes the code. Therefore, the probability Perr (zi ) that a given codeword zi is incorrectly transmitted (including decoding) is bounded by Perr (zi ) = P (case 1) + P (case 2) P (d(z, zi ) > r ) + P (z C \ {zi } : d(z, z ) r )
1

(1 R), p) such that

p (1 p) e + 2n (R+h()1) 2 n ( p)

172

CHAPTER 4. CODING FOR NOISY TRANSMISSION

Let us now nd upper bounds for these two terms. d(z, zi ) is the number of transmission errors at the output of the BSC (i.e. before decoding). It is a binomial random variable with probability distribution P (d(z, zi ) = k nk . Thus, its average and variance respectively are: k) = n k p (1 p) E [d(z, zi )] = n p var(d(z, zi )) = n p (1 p)

Then, by Chebyshevs inequality we have (since r > n p)2 : P (|d(z, zi ) n p| r n p) Therefore: P (d(z, zi ) > r ) P (d(z, zi ) r ) = P (d(z, zi ) n p r n p) P (|d(z, zi ) n p| r n p) n p (1 p) (r n p)2 For the second term (P (z C \ {zi } : d(z, z ) r )), we have: P (z C \ {zi } : d(z, z ) r ) = P (d(z, z1 ) r or ... or d(z, zi1 ) r or d(z, zi+1 ) r or ... or d(z, zM ) r )
z C\{zi }

n p (1 p) (r n p)2

(4.32)

P (d(z, z ) r ) = (M 1) P (d(z, z ) r )

since there are M 1 codewords z such that z = zi . Moreover, for a given z C , the number of possible z such that d(z, z ) = k is equal to the number of binary strings of length n that dier from z in exactly k positions. n! Thus, this number is n k = k ! (n k ) ! Therefore, the total number of possible z such that d(z, z ) r is equal to
r k =0

n k 1 2n
r k =0

and thus P (d(z, z ) r ) = So we get:

n k
r k =0

P (z C \ {zi } : d(z, z ) r )

M 1 2n

n k

Moreover, as proven at the very end of this proof, we have for all r , 0 r n 2:
r k =0

n k

2n h( n ) ,

4.3. THE NOISY CODING THEOREM so (recall that M = 2nR and that r
n 2

173

since < 0.5) we nally found: 2nR 1 nh( r ) 2 n 2n e r 2n(R+h( n )1)

P (z C \ {zi } : d(z, z ) r )

(4.33)

Now, regrouping equations (4.32) and (4.33) together, we nd: Perr (zi ) n p (1 p) e r + 2n(R+h( n )1) 2 (r n p)

which, by the initial choices of r and n is smaller than . To conclude the proof we only have to notice that Pb = i).
1 n Perr (zi )

Perr (zi ) (for all

The only missing step in the above proof is the proof of the following technical result: n n N r N, 0 r 2 which is now given.
r k =0 r k =0

n k

2n h( n )

n k

=
k =0

n (k r ) k 1 if t 0 0 otherwise

where

(t) =

Thus, for all x, 0 < x 1 : (t) xt , and therefore:


r k =0

n k

k =0 r

n k r x k (1 + x)n xr

i.e.
k =0

n k

which is in particular true for x =


r k =0 r

r nr ,

with r n/2: (1 +
r n nr ) r r ( n r)
r r

n k n k

i.e.
k =0

2nlog(1+ nr )rlog( nr )

174 But:

CHAPTER 4. CODING FOR NOISY TRANSMISSION r r ) r log( ) nr nr r 1 r n = n log( log( ) r r) 1 n n 1 n r r r r = n log (1 ) log(1 ) n n n n r = n h( ) n n log(1 +

which concludes the proof. Control Question 43 Consider using a BSC with an error probability p (whose capacity is therefore C = 1 h(p)). In the following cases, tell if a code fullling the requirement could be constructed: channel p C code R Pb (in %) exists? 5% 0.801 3/4 1 2.2 10% 0.675 3/4 1 2.2

2/3 1 2.2

9/10 1 2.2

2/3 1 2.2

9/10 1 2.2

Answer If R < C we are sure that there exists a code with an output bit error probability Pb as small as we wish. On the other hand, if R > C we cannot have h(Pb ) < 1
C R.

Finally, if R > C and h(Pb ) 1 C R we cannot conclude since the situation could be possible (i.e. does not contradict the theorem) but we do not know enough about coding so as to be sure that such a code actually exists. So here are the conclusions: channel p C code R Pb (in %) R < C? 1 C/R h(Pb ) exists? p C code R Pb (in %) R < C? 1 C/R h(Pb ) exists? 5% 0.801 3/4 1 2.2 yes yes yes 10% 0.675 3/4 1 2.2 no 0.100 0.081 0.153 no maybe

2/3 2.2 yes yes yes 1

9/10 2.2 no 0.109 0.081 0.153 no maybe 1

channel

2/3 2.2 yes yes yes 1

9/10 2.2 no 0.250 0.081 0.153 no no 1

4.3. THE NOISY CODING THEOREM

175

Summary for Section 4.3 Shannons Noisy Coding Theorem: For all > 0 and all R < C , C being the capacity of a DMC, there exists a code, the transmission rate of which is R and the output bit error probability Pb of which is below . Conversely, all codes for which the transmission rate is above the channel ca1 pacity have a output bit error probability Pb greater than h 1 C R .

Summary for Chapter 4 Channel: input and output alphabets, transmission probabilities (pY |X =x ). DMC = Discrete Memoryless Channel. Channel Capacity: C = maxpX I (X ; Y ) Code/Transmission Rate: Rb =
logb |VU | n PX

Capacity of Input-symmetric channels: C = max [H (Y )] H (Y |X = xi ) for any of the xi VX . Capacity of Symmetric channels: C = log |VY | H (Y |X = xi ) for any of the xi VX . Data Processing Lemma: pZ |x,y (z ) I (X ; Y ) and I (X ; Z ) I (Y ; Z ) = pZ |y (z ) = I (X ; Z )

Fanos Lemma: h(Pe ) + Pe log2 (|VU | 1) H (U |U ) Shannons Noisy Coding Theorem: For all > 0 and all R < C , C being the capacity of a DMC, there exists a code, the transmission rate of which is R and the output bit error probability Pb of which is below . Conversely, all codes for which the transmission rate is above the channel ca1 pacity have a output bit error probability Pb greater than h 1 C R .

Historical Notes and Bibliography


This theorem was the bombshell in Shannons 1948 paper [10]. Prior to its publication, it was generally believed that in order to make communications more reliable it was necessary to reduce the rate of transmission (or, equivalently, to increase the signal-to-noise-ratio, as the 1947 engineers would have said). Shannon dispelled such myths, showing that, provided that the rate of transmission is below the channel capacity, increased reliability could be purchased entirely by increased complexity in the coding system, with no change in the signal-to-noise-ratio.

176

CHAPTER 4. CODING FOR NOISY TRANSMISSION

The rst rigorous proof of Shannons noisy coding theorem was due to Feinstein in 1954 [4]. A simpler proof using random coding was published by Gallager in 1965 [5]. The converse part of the theorem was proved by Fano in 1952, published in his class notes.

OutLook
The Noisy Coding Theorem lacks practical considerations, for it does not give concrete ways to construct good codes eciently. For instance, no indication of how large the codewords need to be for a given ; i.e. how complex the encoder and decoder need to be to achieve a given reliability. This is the reason why coding theory has become a so important eld: nding good error correcting codes, good in the sense that the error probability is low but the transmission rate is high, is indeed challenging.

Chapter 5

Module I1: Complements to Ecient Coding of Information

by J.-C. Chappelier

Learning Objectives for Chapter 5 In this chapter, we present several dierent complements to the basics of ecient coding, i.e. data compression. Studying this chapter, you should learn more about 1. how to perform optimal variable-to-xed length coding (Tunstall codes); 2. how to simply and eciently encode the integers with a prex-free binary code (Elias code); 3. some of the techniques used for coding sequences with inner dependencies (stationary source coding), as for instance the famous Lempel-Ziv coding.

177

178CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION

5.1

Variable-to-Fixed Length Coding: Tunstalls Code

Learning Objectives for Section 5.1 In this section, you will learn: what variable-to-xed length coding is about; what proper message sets are and why are they useful; how the rst part of Shannon Noiseless Coding Theorem generalizes to variable-to-xed length coding of proper sets; what Tunstall message sets are; and what are they useful for: providing optimal variable-to-xed length coding; how to build such sets, i.e. Tunstall codes.

5.1.1

Introduction

The variable length codewords considered in chapter 2 are not always convenient in practice. If the codewords are, for instance, to be stored in memory, codewords the length of which is equal to the memory word length (e.g. 8, 16 or 32 bits) would certainly be preferred. However, it was precisely the variability of the codeword lengths that provided eciency to the codes presented in chapter 2! So the question is if it is possible to get similar coding eciency when all codewords are forced to have the same length? The answer is yes, provided that codewords are no longer assigned to xed length blocks of source symbols but rather to variable-length sequences of source symbols, i.e. variable-length segmentation of the source stream must be achieved. This is called Variable-to-Fixed Length coding: the D-ary codewords have all the same length n, but the length, LV , of the messages V to which the codewords are assigned, is a random variable. Since n/E [LV ] is the average number of D-ary code digits per source symbol, the optimality criterion of such a code becomes E [LV ], the average encoded message length; which should be made as large as possible.

5.1.2

Proper Sets

What properties should variable-to-xed length codes have? In order to be prex-free, the codewords should correspond to the leaves of a coding tree (see property 2.4). Furthermore, in order to be able to code any sequence from the source, the coding tree must be complete (see denition 2.8). If indeed the code is not complete, the sequence of symbols corresponding to unused leaves can not be encoded! A variable-to-xed length code is thus required to be a proper code ; i.e. its codewords should form a proper set.

5.1. VARIABLE-TO-FIXED LENGTH CODING: TUNSTALLS CODE

179

Denition 5.1 (Proper Set) A set of messages is a proper set if and only if it corresponds to the complete set of leaves of a coding tree.

Example 5.1 (Proper Set) The set The sets a, b, ca, cb

a, b, ca, cb, cc

is a proper set.

and aa, ac, b, cb, cc

are not proper sets.

Here are the corresponding coding trees:

a b

c a b c

a b

c a b c a b

a c

c a b c

Control Question 44 For each of the following sets, decide whether it is a proper set or not: 1. 010, 00, 1, 011 2. 000, 010, 001, 01, 1 3. 110, 010, 011, 10, 00 4. 110, 011, 00, 010, 111, 10 Answer 1. yes. 2. no: it is not prex free. 3. no: it is not complete. 4. yes.

Theorem 5.1 The uncertainty H (V ) of a proper set V for a D-ary discrete memoryless information source, the uncertainty of which is H (U ), satises: H (V ) = E [LV ] H (U ), where E [LV ] is the average encoded message length.

Proof The coding tree corresponding to a proper set is by denition a complete tree, and thus the entropy of each of its internal nodes is equal to the entropy of the source U .

180CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION Then, from the leaf entropy theorem (theorem 2.2), we have: H (V ) = Pi H (U ) = Pi H (U )

et from the Path Length Lemma (lemma 2.1): H (V ) = E [LV ] H (U ) We now can see how Shannon Noiseless Coding Theorem (part 1) applies to proper sets of memoryless information sources:

Theorem 5.2 For any D-ary prex-free encoding Z of any proper message set V for a discrete memoryless information source U , the ratio of the average codeword length E [LZ ], to the average encoded message length, E [LV ], satises E [LZ ] H (U ) log D E [LV ] where H (U ) is the uncertainty of a single source symbol.

Proof From Theorem 5.1: H (V ) = E [LV ] H (U ) and from the Shannon Noiseless Coding Theorem (theorem 2.4): H (V ) E [LZ ] , log D thus: H (U ) E [ LV ] E [ LZ ] . log D

5.1.3

Tunstall message sets

Next section considers the eective procedure for building ecient variable-to-xed length codes, but we rst require another denition; which is the topic of this section.

Denition 5.2 (Tunstall message set) A set of messages is a Tunstall message set if and only if it is a proper set such that, in the corresponding coding tree, every node is at least as probable as every leaf.

5.1. VARIABLE-TO-FIXED LENGTH CODING: TUNSTALLS CODE Example 5.2 (Tunstall set)
0.7
0.7 0.49 0.21 0.21 0.3 0.09 0.063 0.027

181

0.3 0.21

0.49 0.343 0.240 0.103

0.147

This is not a Tunstall set since there is a leaf and an internal node such that the probability of that leaf (0.49) is bigger than the probability of that internal node (0.3).

This is a Tunstall tree since every internal node is more probable than every leaf.

Tunstall message sets are optimal variable length to block coding i.e. provide maximal average length of encoded messages as the following theorem states:

Theorem 5.3 A proper message set maximizes the average encoded message length (over all proper message sets) if and only if it is a Tunstall message set.

Proof Let us rst prove that is a proper set is not a Tunstall set then it cannot maximize the average encoded message length. Let W be a proper set that is not a Tunstall set. Therefore there exist in the corresponding coding tree a leaf w and an internal node n such that P (n ) < P (wi ) Consider then the coding tree consisting of moving the subtree below n to the leaf w (n thus becomes a leave and w an internal node). With the former example:

n*
0.7 0.49 0.3 0.49 0.21 0.21 0.09 0.063 0.027 0.343

0.7 0.3 0.09 0.21 0.21 0.147 0.063 0.027

0.103 0.044

The tree obtained by this operation still denes a proper set. Furthermore, since P (n ) < P (wi ) the probability of all the nodes of the moved subtree are greater in the new coding tree than in the original tree. Thus, the average encoded message length, which according to the Path Length Lemma is the sum of all the internal nodes probabilities, is greater for the new message set than for the former. Thus the former message set could not perform the maximum average encoded message length.

182CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION Conversely, using a similar argument, a tree that is not maximal cannot be a Tunstall set: consider the upper node where it diers from a tree that is maximal.

5.1.4

Tunstall Code Construction Algorithm

Let n be the desired size of the codewords, and let DU and DZ respectively be the arity of the source U and the arity of the code Z . The maximum number of codewords is thus DZ n . We want to build a Tunstall set of size M for U (M DZ n ), i.e. we are looking (among other things) for a complete coding tree, i.e. with no unused leaves. Thus we must have M in the form of (see lemma 2.3) M = 1 + k (DU 1) for some k. To be optimal we are looking for maximal M (with M DZ n ). Thus, we must choose: k= This leads to the following algorithm: Tunstalls Algorithm for Optimum Variable-to-Fixed Length Coding 1. Check whether DZ n DU . If not, abort: variable-to-xed length coding is impossible. 2. Compute k =
DZ n 1 DU 1

DZ n 1 DU 1

3. Compute the encoded message set size M as M = 1 + k (DU 1) 4. Construct the Tunstall coding tree of size M (i.e. M leaves) by repeating k times (root included): extend (with DU branches) the most probable node It is easy to check that the obtained set is indeed a Tunstall set. 5. Assign a distinct DZ -ary codeword of length n to each leaf, i.e. to each message in the Tunstall message set. Example 5.3 DZ = 2, n = 3 (i.e. codeword = 3 bits) U ternary source (DU = 3) such that P (U = 1) = 0.3, P (U = 0) = 0.2 and P (U = 1) = 0.5. Then we have: k = (8 1)/2 = 3 and M =1+32=7 k loops: 1) extend the root

5.1. VARIABLE-TO-FIXED LENGTH CODING: TUNSTALLS CODE

183

2) extend the most probable node:

3) extend the most probable node:

Finally, aect codewords: V Z 1,1 000 1,0 001 1,-1 010 0 011 -1,1 100 -1,0 101 -1,-1 110

Control Question 45 We are interested in an optimal 4 bits binary variable-to-xed length code of the ternary source, the symbols probabilities of which are P (U = a) = 0.6, P (U = b) = 0.3 and P (U = c) = 0.1. 1. How many codewords has this code? 2. How many steps are required to build the tree? 3. How is the input message acaaabaaaabbaaabbbc segmented (i.e. split into parts to be encoded)? 4. How is acaaabaaaabbaaabbbc (same message) encoded, using the convention that the leaves are numbered according to increasing probabilities? Answer 1. M = 15 2. k = 7 3. ac,aaab,aaaa,bb,aaab,bb,c

184CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION 4. 0111011000000101011001010100: 0111, 0110, 0000, 0101, 0110, 0101, 0100. Here is the corresponding coding tree:

a
.6000

b
.3000

c
.1000 0100

a
.3600

b
.1800

c
.0600 0111 .1800

a b

.0900 .0300 0101 1011

.2160 .1080 .0360 .1080 .0540 .0180 0001 1010 0010 1001 1101 .1296 .0648 .0216 0000 0110 1100

.1080 .0540 .0180 0011 1000 1110

Summary for Chapter 5 variable-to-xed length coding: encoding segment of variable length of the input information source with codewords that have all the same size proper set: a complete set of leaves of a coding tree (no useless leaf). Tunstall set: a proper set such that every node of the corresponding coding tree is at least as probable as any of its leaves. optimality of Tunstall sets: A proper set maximizes the average encoded message length if and only if it is a Tunstall set. Tunstall algorithm: k = probable node.
DZ n 1 DU 1

,M = 1 + k (DU 1), extend k times the most

5.2

Coding the Positive Integers

Learning Objectives for Section 5.2 In this section, we study how to simply and eciently encode the integers with a prex-free binary code (Elias code). Let us now turn to another, completely dierent, aspect of ecient coding: how to eciently represent integers with binary codes? In this section we describe a very clever binary prex-free code for the positive integers that was brought by Elias. Let us start from a usual code for integer the reader should be awared of: (most signicant bit) binary representation. Here is an example of this code, we call it here Z0 : n Z0 (n) 1 1 2 10 3 11 4 100 5 101 6 110 7 111 8 1000

5.2. CODING THE POSITIVE INTEGERS The length of these codewords is: |Z0 (n)| = log2 n + 1 which is pretty close to theoretical optimum in the most general case (log2 n).

185

However, this code suers a major drawback: it is far from being prex-free. In fact, every codeword Z0 (n) is the prex of innitely many other codewords! The rst idea brought by Elias was to add an encoding of the length |Z0 (n)| in front of Z0 (n) to make the code prex-free. The encoding proposed is to add |Z0 (n)| 1 zeros in front of Z0 (n). Here is an example of this code, we call Z1 : n Z1 (n) 1 1 2 010 3 011 4 00100 5 00101 6 00110 7 00111 8 0001000

Z1 is now a prex-free code. However, it length is far from being the desired one: it is twice as long: |Z1 (n)| = 2 log2 n + 1 The clever trick used by Elias to get rid of this drawback was to change the encoding of the length made by zeros for the Z1 encoding of this length. A codeword is thus made of the concatenation of Z1 (|Z0 (n)|) and Z0 (n). For instance, 7 is encoded into 111 with Z0 , having thus a length of 3. This leads to the prex Z1 (3) =011, and therefore 7 is encoded into 011111 (=011,111). Here are more examples: n Z2 (n) 1 11 2 01010 3 01011 4 011100 5 011101 6 011110 7 011111 8 001001000

Notice that Z0 (n) is always starting with a 1, which is now no longer required to avoid ambiguity. We thus can get ride of this useless 1. This leads us to the nal Elias encoding for integers, here denoted by Z2 : n Z2 (n) 1 1 2 0100 3 0101 4 01100 5 01101 6 01110 7 01111 8 00100000

This code is prex-free. What about its length? for the main part: |Z0 (n)| = log2 n + 1 and for the prex: |Z1 (|Z0 (n)|)| = 2 log2 ( log2 n + 1) + 1 Thus: |Z2 (n)| = 2 log2 ( log2 n + 1) + 1 + log2 n + 1 1 = log2 n + 2 log2 ( log2 n + 1) + 1 (5.1)

186CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION It is remarkable enough that Elias found a binary prex-free code for integers, the length of which is quite close to the optimal log2 n, and which is furthermore easily implementable. Control Question 46 What is the Elias encoding for 255? Answer Z2 (255) =00010001111111. Indeed: Z0 (255) =11111111 and Z1 (|Z0 (255)|) = Z1 (8) =0001000

Summary for Chapter 5 Elias Codes are prex-free binary codes for integers, whose length is (asymptotically) close to optimal log2 (n). These codes result from the concatenation of a prex, made of the rst Elias code of the length of the usual binary representation of the number to be encoded, and a sux, made of the usual binary representation without its most signicant bit.

5.3

Coding of Sources with Memory

Learning Objectives for Section 5.3 After studying this session, you should know: 1. how rst part of Shannon Noiseless Coding Theorem generalizes to stationary sources; 2. several methods to perform compression of stationary sources message ows: (a) Elias-Willems Recency encoding; (b) Lempel-Ziv codes.

Let us nish this chapter by touching upon the dicult subject of coding sources with memory, i.e. with internal dependencies in the symbol sequences. We have see in chapter 2 that Human coding is optimal for coding a xed number of independent and identically distributed random variables (i.e. several independent occurrences of the same source). What about the case where the symbols are dependent? The sources considered in this section are thus stationary stochastic process (see denitions 3.1 and 3.2 in chapter 3); more precisely, sources that emit sequences U1 , U2 , U3 , . . . of D-ary random variables such that for every t 1 and every n 1, the

5.3. CODING OF SOURCES WITH MEMORY

187

random vectors (U1 , U2 , . . . , Un ) and (Ut+1 , Ut+2 , . . . , Un+n ) have the same probability distribution. Example 5.4 Let us consider as an example of a source with internal dependencies, the oscillatory source U consisting of a stationary binary source such that pU (0) = pU (1) = 0.5 and: P (Ui = 0|Ui1 = 0) = 0.01 P (Ui = 0|Ui1 = 1) = 0.99 P (Ui = 1|Ui1 = 0) = 0.99 P (Ui = 1|Ui1 = 1) = 0.01

(and no other longer term dependencies, i.e. P (Ui |Ui1 ...U1 ) = P (Ui |Ui1 ), otherwise at least one of the above equations could not hold). The entropy of a single symbol of this source is clearly H (U ) = 1 bit. What about the entropy rate?

= =
stationarity

i i

lim H (Ui |Ui1 Ui2 ...U1 )

lim H (Ui |Ui1 )

H (U2 |U1 ) P (U1 = 0) h(0.01) + P (U1 = 1) h(0.99) 2 0.5 h(0.01) 0.081 bit

= = =

where h(p) is the entropy of a binary random variable of parameter p: h(p) = p log(p) (1 p) log(1 p). For such discrete stationary sources, the rst part of the Shannon Noiseless Coding Theorem generalizes as:

Theorem 5.4 The average length E [LZ ] of a prex-free D-ary code Z for segments of length k of a stationary discrete source U veries: E [ LZ ] h (U ) log D k where h (U ) is the entropy rate of the source U as dened in theorem 3.1.

Proof The proof comes directly from the rst part of the Shannon Noiseless Coding Theorem and the properties of entropy rates. Indeed: H (U1 ...Uk ) k h (U ) E [LZ ] log D log D

188CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION

5.3.1

Human coding of slices

One simple way to eciently encode stationary sources consists in segmenting the source symbol ow into sequences of xed length k and consider this new source, hereby designed as U (k) , for (memoryless) Human coding. For instance, if the original ow is 001010011010... and we take k = 3, we will consider the messages 001, 010, 011, etc... separately. The larger k, the more dependencies U (k) handles and the more ecient the corresponding Human coding will be. This solution is unfortunately requiring too much computation power to be used in practice for large k. Example 5.5 Here is the Human code of the oscillator source of example 5.4 with slices of size 3. U 000 001 010 011 100 101 110 111 P (U ) 0.5 0.01 0.01 = .00005 0.5 0.01 0.99 = .00495 0.5 0.99 0.99 = .49005 0.5 0.99 0.01 = .00495 0.5 0.99 0.01 = .00495 0.5 0.99 0.99 = .49005 0.5 0.01 0.99 = .00495 0.5 0.01 0.01 = .00005
0.50995 0.0199 0.01 0.00505 0.0001 000 111 0.0099 001 110 011 100 101 010

E [L] = 1 + 0.50995 + 0.0199 + 0.01 + 0.0099 + 0.00505 + 0.0001 = 1.5549 = 0.5183 per symbol

5.3.2

Elias-Willems Source Coding Scheme

The Elias-Willems scheme for coding source with memory is the following: 1. split the source symbol ow in blocks of size k (i.e. U (k) as in the former section) 2. code each block with the Elias code of its recency distance1 . The only piece in the above scheme we are still at this point lacking is the recency distance .
Elias-Willems is actually using recency rank, which is smaller than recency distance but does not aect the general results presented here.
1

5.3. CODING OF SOURCES WITH MEMORY

189

Denition 5.3 (Recency Distance) The Recency Distance of a symbol v at position n in a symbol sequence V is Rn (v ) = n Ln (v ), where Ln (v ) is the last index t (before n) such that Vt = v : Ln (v ) = max {t < n : Vt = v } Rn (v ) is thus the number of symbols received at time n since the last reception of v (before n). For recency distance to be dened for every possible n (even the rst ones), a convention needs to be chosen to give an initial index value to every possible symbols.

Denition 5.4 (Recency Distance Series) The recency distance series N associated with a random process V is the random process dened by Nn = Rn (Vn ).

Example 5.6 (Recency Distance) As rst example, consider the sequence 01011010 out of a binary source. The corresponding recency distances, with the convention that 0 has default initial index -1 and symbol 1 has 0, are then: vn 0 1 0 1 1 0 1 0 Rn (0) 2 1 2 1 2 3 1 2 Rn (1) 1 2 1 2 1 1 2 1 Nn 2 2 2 2 1 3 2 2

And the corresponding recency distance series is thus 2,2,2,2,1,3,2,2. For a second example, consider the source V the symbols of which are 2 bits words: 00, 01, ... and the convention that 00 has initial default index value -3, 01 -2, 10 -1 and 11 0. For the sequence 11,01,00,10,01,11,01,01,00, the recency distance series will thus be 1,4,6,5,3,5,2,1,6. Control Question 47 Considering a binary source with single bit symbols and the convention that 0 has initial default index -1 and 1 0, what is the recency distance series for the corresponding source sequence: 0001101000111? Answer 2,1,1,4,1,3,2,2,1,1,4,1,1.

190CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION

Control Question 48 Considering a binary source with single bit symbols and the convention that 0 has initial default index -1 and 1 0, what is source sequence corresponding to the recency distance series: 1,3,1,3,1,3,2,2? Answer 10011010 Here comes now a property that will be useful to see how ecient the Elias-Willems scheme is. Property 5.1 If the source V is stationary and ergodica , then the symbol v appears on average every 1/pV (v ) times: E [Ni |Vi = v ] = E [Ri (v )] =
a

1 pV (v )

If you do not know what ergodic means, do not bother too much here, since almost all interesting real information sources are ergodic. If you want to know, look in any good book of probabilities, dealing with random process.

Proof Let kn (v ) be the number of occurrences of v in a sequence of length n from source V . The number of intervals between two consecutive repetitions of v is thus kn (v ) and the total length of these intervals is n. The average interval on a sequence of length n is thus n/kn (v ). When n grows to innity, because the source is ergodic, kn (v ) tends to innity as (n pV (v )) and thus E [Ri (v )] = lim 1 n = n kn (v ) pV (v )

Let us now prove that Elias coding of recency distance performs a coding that is asymptotically ecient, i.e. asymptotically to the theoretical optimum given by theorem 5.4.

Theorem 5.5 The expected length E [|Z2 (N )|] of a Elias-Willems encoding with blocks of size k of a stationary source U veries: hk (U ) 2 1 E [|Z2 (N )|] hk (U ) + log2 (k hk (U ) + 1) + k k k

Corollary 5.1 The expected length E [|Z2 (N )|] of a Elias-Willems encoding with blocks of size k of a stationary source U veries: E [|Z2 (N )|] = h (U ) k k lim

5.3. CODING OF SOURCES WITH MEMORY

191

Proof What is the expected length of a Elias-Willems code? E [|Z2 (N )|] =


vV

pV (v ) E [|Z2 (Ni (Vi ))||Vi = v ] =


vV

pV (v ) E [|Z2 (Ri (v ))|]

But, from equation 5.2 we have: E [|Z2 (Ri (v ))|] log2 (Ri (v )) + 2 log2 (log2 (Ri (v )) + 1) + 1 Using Jensen inequality we have: E [|Z2 (N )|]
vV

log2 (E [Ri (v )]) + 2 log2 (log2 (E [Ri (v )]) + 1) + 1

and from property 5.1 E [|Z2 (N )|]


vV

log2 (pV (v )) + 2
vV

log2 (1 log2 (pV (v ))) + 1

i.e., using Jensen inequality once again: E [|Z2 (N )|] H (V ) + 2 log2 (H (V ) + 1) + 1 Notice nally that H (V ) = H (U (k) ) = H (U1 , ..., Uk ) = k hk (U ), thus: 2 1 E [|Z2 (N )|] hk (U ) + log2 (k hk (U ) + 1) + k k k

Control Question 49 How is the sequence 100000101100 encoded using Elias-Willems scheme with k = 2 and the convention that 00 has initial default index value -3, 01 -2, 10 -1 and 11 0? Answer The answer is 01000110110101011010101. With k = 2, 100000101100 is split into 10,00,00,10,11,00 which corresponds to the recency distance series: 2,5,1,3,5,3 which is encoded, using Elias code for integers, into 0100, 01101, 1, 0101, 01101, 0101.

5.3.3

Lempel-Ziv Codings

The idea of the very popular Lempel-Ziv codings is quite similar to the former EliasWillems coding scheme: using ideas similar to the recency distance, it also aims at being universal, i.e. to work well for dierent kind of stationary source without precisely knowing all their statistical properties. There exist many variations of the basic Lempel-Ziv algorithm, using dictionaries, post-

192CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION processing and many other improvements. Among the most famous variants we can cite: Name LZ77 Authors Lempel & Ziv (1977) Lempel & Ziv (1978) Storer & Szymanski (1982) Welch (1984) Method 1 character and 1 pair of xedsize pointers no dictionary same as LZ77 but with a dictionary (pointers in the dictionary) 1 xed-size pointer or 1 character (+ 1 indicator bit) no dictionary only xed-size pointers alphabet included in the dictionary

LZ78

LZSS

LZW

These algorithms are the most largely used compression algorithms in practice (e.g. in zip, compress, gzip, ...). The main reasons are because these algorithms perform good compression rate in a ecient manner. These algorithms are indeed linear time complex and do not require much memory. In this section, we focus on the core of these compression algorithms by presenting the simplest of them: LZ77. For this code, the codewords are tuples (i, j, u). i and j are integers and u is a source symbol. The codeword (i, j, u) represents a sequence of symbols that can be obtained from the current sequence by copying j symbols starting from i position back and adding the symbol u at the end. If i is null, j is ignored. Example 5.7 (LZ77 codeword) For instance, if the current decoded sequence is 10010, the codeword (3, 2, 1) represents the sequence 011: copying 2 symbols (01) starting from 3 positions back (10|010), and adding 1 at the end. If j is greater than i, the copying of character continues with the newly copied characters (i.e. the buer starting at i positions back is cyclic). For instance, if the current decoded sequence is 10010, the codeword (3, 5, 1) represents the sequence 010011: starting from three positions back (10|010) copying ve symbols: rst the thre symbols already existing at i = 3 positions back (010), leading to 10010|010, and going on with the next two characters, 01 from the newly added character, leading to 1001001001. The decoding nally ends adding the 1 (last element of the codeword) at the end, leading to 10010010011. In summary: 10010 + (3, 5, 1) = 10010010011.

5.3. CODING OF SOURCES WITH MEMORY Example 5.8 (LZ77 decoding) Here is an example of how the sequence (0,0,0) (0,0,1) (2,1,0) (3,2,1) (5,3,1) (1,6,0) is decoded: codeword (0,0,0) (0,0,1) (2,1,0) (3,2,1) (5,3,1) (1,6,0) cyclic buer 0 01 0101... 0 100 100100... 01 00101 00101... 0100101001 1 1111... added sequence 0 1 00 101 0011 1111110

193

complete decoded sequence 0 01 0100 0100101 01001010011 010010100111111110

The nal result is thus 010010100111111110. The corresponding coding algorithm, using a sliding window buer for remembering the past context, is the following: 1. look in the current context (i.e. the beginning of the sequence still to be encoded) for the shortest sequence that is not already in the buer; 2. remove the last character u of this unseen sequence and look for the closest corresponding sequence back in the buer; 3. emit the corresponding back position (i) and length (j ), followed by the removed last character (u) 4. update the buer (with the newly encoded sequence) and go back in 1 as long as there is some input. Here is an example of this encoding algorithm: Example 5.9 (LZ77 encoding) Consider the message 111111111101111011 to be encoded. At the beginning, since the buer is empty, the rst pointer pair must be (0,0) and the corresponding character is the rst character of the sequence. Thus the rst codeword is (0, 0, 1). Now the buer is updated into 1. With the convention used for j > i, we are now able to encode sequences of any length made only of 1. Thus the sequence considered for encoding, i.e. the shortest sequence starting after the last encoded sequence and not already in the (cyclic) buer is now: 1111111110. This sequence is encoded into the codeword (1, 9, 0): loop 1 step back in the buer, copy 9 characters from the buer (with repetitions since 9 > 1, and add the last part of the codeword, here 0. So the part of the message encoded so far (and the buer) is 11111111110, and the part still to be encoded is 1111011. Back to step 1: what is the shortest sequence to be encoded that is not contained in the buer? At this stage, it is 1111011. Removing the last char (1) we end up with 111101 which in the current buer correspond to i = 5 and j = 6 (using once again the cyclic aspect of the buer: i < j ).

194CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION The corresponding codeword is thus (5, 6, 1). To summarize: 111111111101111011 is encoded into (0, 0, 1)(1, 9, 0)(5, 6, 1). Control Question 50 How is the sequence 100000101100 encoded using LZ77 algorithm? Answer The answer is (0,0,1) (0,0,0) (1,4,1) (2,2,1) (3,1,0)

Control Question 51 How is the sequence (0,0,0) (1,2,1) (2,5,0) decoded (assuming LZ77 encoding)? Answer 0001010100

5.3.4

gzip and bzip2

When I will have time, I will here add a few words on gzip and maybe on bzip2, compression algorithms which are very popular in GNU world. What I can now say in the only minute I have is that gzip is using a variation of the LZ77 algorithm combined with a post-processing of pointers using Human coding. Interested readers could refer to http://www.gzip.org/algorithm.txt for more details. Summary for Chapter 5 Shannon Noiseless Coding Theorem for prex-free codes of stationary source: E [LZ ] h (U ) log D k . Recency Distance Rn (v ) is the number of symbols received at time n till the last reception of v (before n). Elias-Willems Codes Elias coding of recency distance. Lempel-Ziv Algorithm LZ77: using a (cyclic) buer remembering past sequences, encode sequences with codewords consisting of one position back in the buer, one length and one character to be added at the end of the sequence encoded so far.

5.3. CODING OF SOURCES WITH MEMORY

195

Summary for Chapter 5 variable-to-xed length coding: encoding segment of variable length of the input information source with codewords that have all the same size proper set: a complete set of leaves of a coding tree (no useless leaf). Tunstall set: a proper set such that every node of the corresponding coding tree is at least as probable as any of its leaves. optimality of Tunstall sets: A proper set maximizes the average encoded message length if and only if it is a Tunstall set. Tunstall algorithm: k = probable node.
DZ n 1 DU 1

,M = 1 + k (DU 1), extend k times the most

Elias Codes are prex-free binary codes for integers, whose length is (asymptotically) close to optimal log2 (n). These codes result from the concatenation of a prex, made of the rst Elias code of the length of the usual binary representation of the number to be encoded, and a sux, made of the usual binary representation without its most signicant bit. Shannon Noiseless Coding Theorem for prex-free codes of stationary source: E [LZ ] h (U ) log D k . Recency Distance Rn (v ) is the number of symbols received at time n till the last reception of v (before n). Elias-Willems Codes Elias coding of recency distance. Lempel-Ziv Algorithm LZ77: using a (cyclic) buer remembering past sequence, encode sequences with codewords consisting of one position back in the buer, one length and one character to be added at the end of the sequence so far.

Historical Notes and Bibliography


In spite of its fundamental importance, Tunstalls work was never published in the open literature. Tunstalls doctoral thesis (A. Tunstall, Synthesis of Noiseless Compression Codes, Ph.D. thesis, Georgia Institute of Technology, Atlanta, GA, 1968), which contains this work, lay unnoticed for many years before it became familiar to information theorists. To be continued...

OutLook

196CHAPTER 5. COMPLEMENTS TO EFFICIENT CODING OF INFORMATION

Chapter 6

Module I2: Error Correcting Codes

by J.-C. Chappelier

Learning Objectives for Chapter 6 After studying this chapter, you should know more about error-correcting codes, and more precisely: 1. what is minimum distance of a code and how many errors a code of a given minimum distance can correct; 2. the basics of linear codes: how to construct such codes, how to encode and decode with linear code, ...; 3. what Hamming Codes are and how to use them; 4. the basics of cyclic codes; 5. and the basics of convolutional codes: encoder circuit, lattice representation, Viterbi algorithm for decoding.

Introduction
The fundamental Shannons Noisy Coding Theorem presented in chapter 4 provides theoretical bounds on the coding of messages for transmission over a noisy channel. Unfortunately, this important theorem (nor its proof) does not give any hint on how to actually build good error correcting codes in practice. This is the reason why a theory of error correcting codes has been developing for many years. This theory focuses mainly on the algebraic structure of codes. The basic idea is to give structure to the set of codewords in order to use this structure to provide hints for decoding messages in case of transmission error. 197

198

CHAPTER 6. ERROR CORRECTING CODES

Due to its strong mathematical grounding, algebraic coding theory is now well established as a full scientic eld on its own, with applications to many problems going beyond channel coding. The purpose of this chapter is certainly not to provide an exhaustive view of algebraic error correction codes (a whole book would hardly do it) but rather to introduce the key ideas of the domain. The reader interested in investing this topic further is referred to the rather vast literature of this eld. The study of this chapter requires a bit of mathematical background, especially in the eld of algebraic structures.

6.1

The Basics of Error Correcting Codes

Learning Objectives for Section 6.1 In this section, the following key points about error correcting codes are presented: 1. what is a block-code; 2. how distance between codewords is measured; 3. what is the weight of a codewords; 4. what are minimum distance and minimum weight of a code; 5. and how useful they are for determining how many errors a code can detect and/or correct.

6.1.1

Introduction

The context of the present chapter is noisy transmission as presented in the introduction of chapter 4. When a codeword zi is transmitted via a noisy channel and z is received, the transmission error corresponds to the dierence between z and zi : e = z zi . The key idea of algebraic coding is to add algebraic structure to the set of codewords such that the transmission error can easily be expressed in term of the operations dening this algebraic structure (starting from the above dierence operation). For instance, if we are dealing with binary codewords (e.g. 10011) the natural dierence on binary words is the bit by bit dierence (a.k.a. exclusive or for readers familiar with computer sciences), i.e. the dierence in each position such that there is a 0 whenever the two corresponding bits are the same and 1 when they are not: 0 - 1 = 1 and, as usual, 0 - 0 = 0, 1 - 1 = 0, 1 - 0 = 1. Example 6.1 (Binary Dierence) Here is an example of the dierence of two binary words: 11011 01101 = 10110 Technically speaking, the above dened operation actually corresponds to modulo 2

6.1. THE BASICS OF ERROR CORRECTING CODES

199

arithmetic, i.e. the Galois Field GF(2) is considered and codewords are elements of the vector space GF(2)n (where n is the length of the codewords). This framework easily extends to any D-ary codes using modulo D arithmetic. Denition 6.1 (Block-Code) A D-ary block-code of length n is an nonempty subset of the vector space of n-tuples GF(D)n (i.e. D-ary words of the same length n, considered as row vector). Example 6.2 (Block-Codes) The set {1101, 0110, 1110} is an example of a binary block-code. Another example of block-code, considering ternary codes using the symbols 0 1 and 2, such that 1 + 2 = 0 (i.e. GF(3) arithmetic), is given by the set {120, 201, 222, 010}. The set {011, 110, 10} is not a block-code since these words have not all the same length. Two important notions in the eld of algebraic error codes are now introduced: the Hamming distance between words and the weight of a word. How this notions relates to the problem of error correction is shown in the following section.

6.1.2

Hamming Distance and Codeword Weight

Denition 6.2 (Hamming distance) The Hamming distance , d(zi , zj ), between two words of the same length z and z (i.e. two n-tuples in the most general case) is the number of symbols (i.e. positions) in which z and z dier. Example 6.3 (Hamming distance) The Hamming distance between 101010 and 111010 is 1 since these two words dier only in the second position. The Hamming distance between 1010 and 0101 is 4 since these two words dier in all positions. The Hamming distance between 1010 and 111010 in not dened (and will not be considered). The Hamming distance is indeed a metric (i.e. following three axioms of metric distance) for n-tuples (over a non empty set!). The demonstration is left as an exercise. Denition 6.3 (Codeword Weight) The weight of a word is the number of non-zero symbols in it. Example 6.4 (Codeword Weight) The weight of 10110 is 3, whereas the weight of 00000000 is 0 and the weight of 001000 is 1.

200

CHAPTER 6. ERROR CORRECTING CODES

Property 6.1 The Hamming distance between two codewords is the weight of their dierence : d(zi , zj ) = w(zi zj ) denoting by d() the Hamming distance and w() the weight. Proof zi and zj dier in a position if and only if zi zj is non-zero in that position.

Example 6.5 Here is an example of the equivalence between the Hamming distance of two binary words and the weight of their dierence: d(10110, 11010) = 2 w(10110 11010) = w(01100) = 2 Here comes now some useful properties of the codeword weights. Property 6.2 The weight of a codeword is always positive or null. Trivial: by denition. Denition 6.4 (null codeword) The null codeword is the codeword made only of zeros. It will be denoted by 0. Property 6.3 The weight of a codeword is 0 if an only if the codeword is the null codeword 0. Property 6.4 The weight is symmetric: for every codeword zi , w(zi ) = w(zi ) (where zi is the word in which each symbol is the opposite of the corresponding symbol in zi ). Example 6.6 (Weight symmetry) Considering ternary codes using the symbols 0 1 and 2, such that 1 + 2 = 0 (i.e. GF(3) arithmetic), we have: w(1202102) = w(2101201) = 5 = w(1202102) Notice that in the binary case, the latest property is trivial since in that case zi = zi for every zi .1 Property 6.5 For every codewords zi and zj , we have: w(zi ) + w(zj ) w(zi + zj )
1

Recall that in the binary case, 1 = 1.

6.1. THE BASICS OF ERROR CORRECTING CODES

201

Example 6.7 In the binary case, we have for instance: w(110101) + w(010101) = 4 + 3 = 7 1 = w(100000) = w(110101 + 010101) Considering ternary codes using the symbols 0 1 and 2 as above, we have for instance: w(01221021)+w(21002010) = 6+4 = 10 5 = w(22220001) = w(01221021+21002010)

Proof These properties directly comes from property 6.2 and the fact that the Hamming distance is indeed a metric. Control Question 52 1. What is the weight of 11101101? 2. What is the weight of 0? 3. What is the weight of 1? 4. What is the weight of 2? 5. What is the weight of 1221032? 6. What is the Hamming distance between 11 and 00? 7. What is the Hamming distance between 101 and 001? 8. What is the Hamming distance between 1234 and 3214? Answer Weights: 6; 0; 1; 1; 6. Hamming distances: 2; 1; 2. Why are Hamming distance and weight of fundamental importance for algebraic error correction? In the framework dened in the previous section, where the error e that occurred in the transmission where zi was emitted and z received is dened as e = z zi , the number of errors that occurred in transmission now appears to be simply the weight of e, i.e. w(z zi ). Due to property 6.1 this is the Hamming distance d(z, zi ) between the emitted codeword and the one received. In this framework, detecting an error then simply means detecting non-null weights. Correcting an error however, implies to be able to further compute the actual dierence (without knowing zi , of course! Only z is known at the reception).

6.1.3

Minimum Distance Decoding and Maximum Likelihood

How could a block-code decoder take its decision to decode a received word? A natural intuitive answer is to assume the smallest number of errors (i.e. w(e) minimum), which,

202

CHAPTER 6. ERROR CORRECTING CODES

according to what has just been said, leads to take the closest codeword (i.e. d(z, zi ) minimum). For instance, if the only two possible codewords are 000 and 111 and 010 is received, we certainly would like2 to have it decoded as 000.

Denition 6.5 (Minimum Distance Decoding) A code C is said to use minimum distance decoding whenever the decoding decision D consists, for any received word z , in choosing (one of) the closest codeword(s): D (z ) = Argmin d(z, z )
z C

How is minimum distance decoding mathematically sound? In the case of the Binary Symmetric Channel (see example 4.1), this intuitively sensible minimum distance decoding procedure follows from Maximum Likelihood Decoding. Let us see the decoding procedure from a probabilistic (Bayesian) point of view. When a word z is received, the most likely codeword to have been emitted (knowing that this word is received) is the codeword z which maximize the probability P (X = z |Y = z ) (where X is the input of the channel and Y its output). In practice this probability is not easy to cope with3 if the distribution P (X = zi ) of the codewords at the emission (so called a priori probabilities) are not known. In this context, no further assumption is made but the less biased one that all the codewords are equally likely (maximum entropy assumption). Then it comes: Argmax P (X = z |Y = z ) = Argmax
z C z C

P (Y = z |X = z ) P (X = z ) P (Y = z ) z C = Argmax P (Y = z |X = z ) P (X = z )
z C

(6.1) (6.2) (6.3) (6.4)

= Argmax P (Y = z |X = z )

The last equality is obtained using the equally likely codewords hypothesis. The remaining term P (Y = z |X = z ) is the so-called likelihood, and it nally happen that the decoder should decode z by the most likely codeword, i.e. the codeword z that maximizes P (Y = z |X = z ). So what? Well, the last term P (Y = z |X = z ) appears to be more easy to handle than the rst P (X = z |Y = z ). For instance in the case of a discrete memoryless channel (DMC, see 4.1) this turns into the product of the transmission probabilities for each symbol. In the further case of a BSC, where all symbols have the same error probability p, this probability simply becomes:
b) b) P (X = z |Y = z ) = pd(z,z (1 p)nd(z,z
2 3

knowing nothing else or even to estimate, although this is possible.

6.1. THE BASICS OF ERROR CORRECTING CODES

203

since there are exactly d(z, z ) symbols which have been corrupted by the transmission and n d(z, z ) which have been transmitted correctly. It is then easy to see that the the codeword z that maximizes P (Y = z |X = z ) = is the one that minimizes the distance d(z, z ). This proves that for a BSC, minimum distance decoding and maximum likelihood decoding are equivalent.

6.1.4

Error Detection and Correction

Is there a way to know a priori how many errors a given code can correct? detect?4 The answer is yes and relies mainly on an important characteristic of a block-code: its minimum distance. Denition 6.6 (Minimum Distance of a Code) The Minimum Distance dmin (C ) of a code C = {z1 , ..., zi , ..., zM } is the minimum (non null) Hamming distance between any two dierent of its codewords: dmin (C ) = min d(zi , zj ).
i=j

The following results about error correction and detection illustrates why the minimum distance of a code is of central importance in error correcting coding theory. Theorem 6.1 (Error-Correcting and Detection Capacity) A blockcode of length n using minimum distance decoding can, for any two integers t and s such that 0 t n and 0 s n t, correct all patterns of t or fewer errors and detect all patterns of t + 1, ..., t + s errors if and only if its minimum distance is strictly bigger than 2t + s. Proof The implication is demonstrated ad absurdio: C cannot correct all patterns of t or cannot detect all patterns of t + 1, ..., t + s errors if and only if dmin (C ) 2t + s If the code C cannot correct all patterns of less than (including) t errors, this means there exists at least one codeword zi and one error pattern e of weight less than t such that the decoding D (zi + e) is not zi . Let us call zj the codeword decoded instead of zi in this case: zj = D (zi + e). Using triangle inequality for metric d, we have: d(zi , zj ) d(zi , zi + e) + d(zi + e, zj ) But d(zi , zi + e) = w(e) t and d(zi + e, zj ) d(zi + e, zi ) since the code is using minimum distance decoding. Thus d(zi , zj ) t + t 2t + s
By detecting an error we actually mean detecting but not correcting an error, since of course correcting an error implies having detected it!
4

204

CHAPTER 6. ERROR CORRECTING CODES

and therefore dmin (C ), which is less than or equal to d(zi , zj ), is also less than or equal to 2t + s. If on the other hand the code can correct all patterns of less than (including) t errors but cannot detect all patterns of t +1, ..., t + s, there exists at least one codeword zi and one error pattern e of weight between t +1 and t + s which is not detected but decoded into another codeword zj : D (zi + e) = zj . Introducing the error e = zj (zi + e), we also have D (zj + e ) = zj , i.e. e is an error that is corrected when applied to zj . Since w(e ) = d(zj + e , zj ) = d(zi + e, zj ) d(zi + e, zi ) (because of minimum distance decoding), we have both w(e ) t + s and D (zj + e ) = zj . Thus w(e ) must be less than (or equal to) t. This allows us to conclude similarly to above: d(zi , zj ) d(zi , zi + e) + d(zi + e, zj ) (t + s) + t = 2t + s and therefore dmin (C ) 2t + s. Thus if C cannot correct all patterns of t or cannot detect all patterns of t + 1, ..., t + s errors then dmin (C ) 2t + s. Conversely, if dmin (C ) 2t + s, there exists two distinct codewords, zi and zj , such that d(zi , zj ) 2t + s. This means that the weight of the vector z = zi zj is also less than 2t + s. But any vector z of weight less than (or equal to) 2t + s can be written as the sum of two vectors e and f such that w(e) t and w(f ) t + s: take the rst components up to t non-zero as the components of e [or all the components if w(z ) < t] and the remaining components (zero elsewhere) for f . For instance 011010, can be written as 010000 + 001010 (t = 1 and s = 1). Thus, there exists two errors e and e (take e = f ) such that w(e) t, w(e ) t + s and zi zj = e e ; i.e. zi + e = zj + e. This means that two distinct codewords and two error patterns will be decoded the same way (since zi + e = zj + e, D (zi + e ) = D (zj + e). This implies that (at least) either zi + e is not corrected (D (zi + e ) = zi ) or zj + e is not detected (D (zj + e) = zi ). Thus not all patterns of less than (including) t error can be corrected or not all patterns of t + 1, ... t + s errors can be detected. Property 6.6 (Maximum Error-Detecting Capacity) A block-code C using minimum distance decoding can be used to detect all error patterns of dmin (C ) 1 or fewer errors. Proof Use t = 0 and s = dmin (C ) 1 in the above theorem. Property 6.7 (Maximum Error-Correcting Capacity) A block-code C using minimum distance decoding can be used to correct all error patterns of (dmin (C ) 1)/2 (Euclidean, also called integer, division) or fewer errors, but cannot be used to correct all error patterns of 1 + (dmin (C ) 1)/2 errors.

6.1. THE BASICS OF ERROR CORRECTING CODES Proof Use t = (dmin (C ) 1)/2 and s = 0 in the above theorem.

205

(dmin (C ) 1)/2 is furthermore the maximum t that can be used in the above theorem, since dmin (C ) 2 (1 + (dmin (C ) 1)/2) [Recall that / denotes the Euclidean division].

Example 6.8 A block-code having a minimum distance of 8 can be used to either correct all error patterns of less than (including) 3 errors and detect all patterns of 4 errors (t = 3, s = 1); correct all error patterns of less than 2 errors and detect all patterns of 3 to 5 errors (t = 2, s = 3); correct all error patterns of 1 error and detect all patterns of 2 to 6 errors (t = 1, s = 5); detect all patterns of less than (including) 7 errors (t = 0, s = 7). A block-code having a minimum distance of 7 can either correct all error patterns of less than (including) 3 errors (t = 3, s = 0); correct all error patterns of less than 2 errors and detect all patterns of 3 to 4 errors (t = 2, s = 2); correct all error patterns of 1 error and detect all patterns of 2 to 5 errors (t = 1, s = 4); detect all patterns of less than (including) 6 errors (t = 0, s = 6).

Example 6.9 To be able to correct all patterns of 1 error (and thats it), a block-code must have a minimum distance at least equal to 3. Control Question 53

1. A communication engineer want to have a channel where all patterns of 3 or less errors are corrected. Can he use a block code with a minimum distance of 5? 2. How many errors can be corrected at most with a block code with a minimum distance of 6? 3. Can such a code furthermore detect errors? If yes, how many? Answer 1. A block code with minimum distance of 5 can at most correct all patterns of (5 1)/2 = 2 or less errors. The answer is thus: no.

206

CHAPTER 6. ERROR CORRECTING CODES

2. A block code with minimum distance of 6 can at most correct all patterns of (6 1)/2 = 2 or less errors.

3. In such a case 2t = 2 2 = 4, thus the code can furthermore detect all patterns of 3 errors (s = 1). Indeed: 6 > 2 2 + 1 (dmin (C ) > 2t + s).

Summary for Chapter 6 block-code: a non-empty set of words of the same length, considered as row vectors. weight: (of a word) the number of non-zero symbols. Hamming distance: the number of coordinates in which two vectors dier. The Hamming distance between two words is the weight of their dierence. minimum distance decoding: error correction framework in which each received word is decoded into the closest (according to Hamming distance) codeword. maximum likelihood decoding: error correction framework in which each received word z is decoded into (one of) the most likely codeword(s) z , i.e. a codeword such that P (Y = z |X = z ) is maximal (with X the input of the noisy channel and Y its output). minimum distance of a code: the minimum (non null) Hamming distance between any two (dierent) codewords. error correcting and detecting capacity: A block-code C of length n using minimum distance decoding can, for any two integers t and s such that 0 t n and 0 s n t, correct all patterns of t or fewer errors and detect all patterns of t + 1, ..., t + s errors if and only if its minimum distance dmin (C ) is strictly bigger than 2t + s: dmin (C ) > 2t + s C corrects t and detects t + s errors.

6.2

Linear Codes

Because of their properties and their simplicity, linear codes, which are studied in this section, are of major interest in error coding. One of their main advantage in practice is that they are easy to implement.

6.2. LINEAR CODES Learning Objectives for Section 6.2 In this section, the basics of linear codes are presented. What you should know about this topic is: 1. what is a (n, m) D-ary linear code is; 2. that for such codes minimum distance and minimum weight are equivalent; 3. how to encode using a generator matrix; 4. what the systematic form generator matrix of a linear code; 5. how to decode using verication matrix and syndrome table; 6. how to compute the minimum distance of a linear code; 7. what binary Hamming codes are and how to use them.

207

6.2.1

Denitions

Linear codes are block-codes on which an algebraic structure is added to help decoding: the vector space structure.

Denition 6.7 (Linear Code) An (n, m) D-ary linear code (1 m n) is a m-dimensional subspace of the vector space GF(D)n of n-tuples over GF(D).

Looking at the denition of block-codes, the above denition could be rephrased: a linear code is a block-code which is a vector space. Example 6.10 (Linear Code) The code {1101, 0110, 1110} given in example 6.2 is not a linear code since 0000 is not part of it (it could therefore not be a vector space). The (binary) code {0000, 1101, 0110, 1011} is a linear code since any (binary) linear combination of codewords is also a codeword. It is furthermore a (4, 2) binary linear code since its dimension (i.e. the dimension of the vector subspace) is 2 and its length is 4. Notice further that the minimum distance of this code is 2 (easy to check) and therefore this code can only be used for single error detection (refer to theorem 6.1). Control Question 54 For each of the following binary codes, say whether or not this is a linear code. If yes, give the two numbers n and m of the denition: 1. C = {0000, 1000, 0001, 1001} 2. C = {1000, 0001, 1001}

208 3. C = {00, 01, 11}

CHAPTER 6. ERROR CORRECTING CODES

4. C = {0000, 1000, 1100, 0100, 1101, 0001, 0101, 1001} 5. C = {0, 1, 10, 11} 6. C = {00, 11, 10, 01} 7. C = {0000, 0001, 0010, 0100} Answer 1. yes: any linear combination of codewords is also a codeword (it is enough to check that for instance the last codeword is the combination of the rst two non null). Its a (n = 4, m = 2) linear codeword. 2. no: this code does not have the null codeword. 3. no (e.g. 01 + 11 = 10 is not a codeword) 4. yes (the thirds column is always 0, the other 3 generates all 3 bits words). Its a (n = 4, m = 3) code. 5. No! This is even not a block code since the codewords do not all have the same length. 6. yes. n = 2 and m = 2 7. no (e.g. 0001 + 0010 = 0011 is not a codeword).

6.2.2

Some Properties of Linear Codes

Property 6.8 Every linear code contains the null codeword 0. This comes directly from the fact that a linear code is a vector (sub)space. Let us now study further these linear code. First of all, how many codewords contains a (n, m) D-ary linear code? Property 6.9 A (n, m) D-ary linear code contains D m dierent codewords (including the null codeword). Notice that this property can be used to quickly determine that codes with a wrong number of codewords (not a power of D) are not linear. For linear codes, this can also be used to quickly determine m (e.g. to guess the size of a basis). Proof Since a linear code is a vector space of dimension m, every codeword is the linear combination of m basis codewords (one basis for this vector space).

6.2. LINEAR CODES

209

For a (n, m) D-ary code, there are therefore exactly D m dierent codewords: all the D m linear combinations. What is then the transmission rate of such a code?

Property 6.10 The transmission rate of a (n, m) linear code is R= m n

Proof Recall that the transmission rate of a D-ary code encoding a source of M log M dierent messages with codewords of length n is R = D . n How many dierent messages could be encoded using a (n, m) linear code? As many as there are codewords, i.e. D m . Thus, R= logD Dm m = n n

Let us now come to another useful property of linear code.

Theorem 6.2 (Minimum Weight and Minimum Distance Equivalence) For every linear code C dmin (C ) = wmin (C ) where wmin (C )is the minimum weight of the code, i.e. the smallest weight of non-zero codewords: wmin (C ) = min w(z )
z =0
z C

This result is very important in practice since wmin (C )is much easier to compute than dmin (C ). Proof For a code C = {z1 , ..., zi , ...}, we have by denition 6.6 dmin (C ) = mini=j d(zi , zj ). Thus, using property 6.1, dmin (C ) = mini=j w(zi zj ). But if C is a linear code, for every two codewords zi and zj , zi zj is also a codeword. It is furthermore the null codeword if and only if zi = zj (i.e. i = j ). Thus dmin (C ) wmin (C ). Conversely, for every codeword zi , w(zi ) = d(zi , 0). Since 0is always part of a linear code, we get: wmin (C ) dmin (C ), which concludes the proof. Example 6.11 The (11, 3) binary code {00000000000, 10011110000, 01000111100, 00111001111, 11011001100, 10100111111, 01111110011, 11100000011} has a minimum weight of 5 and thus a minimum distance of 5.

210

CHAPTER 6. ERROR CORRECTING CODES

This code can therefore correct all error patterns with 1 or 2 errors (cf Property 6.7). [It is left as an exercise to check that this code is indeed a linear code.] Control Question 55 What is the minimum distance of the following codes? 1. C = {0000, 1000, 0001, 1001} 2. C = {0000, 1000, 1100, 0100, 1101, 0001, 0101, 1001} 3. C = {00000000, 00001011, 00110000, 00111011, 11000000, 11001011, 11110000, 11111011} 4. C = {1000, 0001, 0010, 0100} Answer 1. dmin (C ) = wmin (C ) = 1 2. dmin (C ) = wmin (C ) = 1 3. dmin (C ) = wmin (C ) = 2 (e.g. third codeword) 4. dmin (C ) = 2!! Although wmin (C ) = 1! This is a pitfall! This code not a linear code. Thus the minimum weight theorem cannot be applied. Her the minimum distance is simply computed using its denition. It is easy to see that the distance between any two codewords is 2.

Control Question 56 How many errors can (at most) correct the following linear codes: 1. 2. C = {0000, 1000, 1100, 0100, 1101, 0001, 0101, 1001}

3. C = {000000000, 000000111, 000111000, 000111111, 111000000, 111000111, 111111000, 111111111 Answer 1. dmin (C ) = wmin (C ) = 1 thus this code can correct (1 1)/2 = 0 errors!! This is not a very useful code! 2. dmin (C ) = wmin (C ) = 3 thus this code can correct (2 1)/2 = 1 error.

6.2. LINEAR CODES

211

6.2.3

Encoding with Linear Codes

The relationship linking messages to be encoded to corresponding codewords must now be emphasized further: it is time to see how to eciently use linear codes to encode messages. If the m codewords chosen to be a basis of a (n, m) linear code (vector space)5 are denoted by z1 , ..., zm , then any codeword zi can be written as
m

zi =
k =1

ui,k zk

where ui,k is the component of zi on the basis vector zk . In a more compact way, using linear algebra, we have: zi = (ui,1 , ..., ui,m ) G = ui G where ui is the row vector (ui,1 , ..., ui,m ) and G the matrix whose rows are z1 , ..., zm . It is then very natural to choose to encode the m symbol message ui by the codeword zi which results from the above multiplication by G. For this reason, the matrix G is called a generator matrix of the code.6 Denition 6.8 (Generator Matrix) A m n matrix G is said to be a generator matrix of a (n, m) linear code C if and only if its m row vectors are a basis of the vector space C . The encoding of a message u (of size m) is then done by z = u G.

Example 6.12 (Generator Matrix) Let us go on with the (4, 2) binary linear code used in example 6.10: {0000, 1101, 0110, 1011}. This code, having four codewords, can encode four messages: the four binary words of two bits, u0 =00, u1 =10, u2 =01, u3 =11. Let us do this encoding using a generator matrix. One basis for this linear code could be z1 = 1101, z2 = 0110, which leads to G= u1 is then encoded into u1 G = (10) 1 1 0 1 0 1 1 0 = (1101) = z1 1 1 0 1 0 1 1 0

and similarly u2 into z2 , u3 into 1011 and u0 into 0000. Notice that a linear code always encode the null message with the null codeword 0. This is precisely due to the linear (i.e. vector space) aspect of the code.
Recall that in most cases this choice is not unique Notice that, for a given code, this generator matrix is not unique: it depends on the basis chosen for representing the code.
6 5

212

CHAPTER 6. ERROR CORRECTING CODES

Using this way of encoding with generator matrix the actual encoding of messages is very easy to implement in practice. In the binary case for instance, only a few7 exclusive-or (XOR) gates can do the encoding.

6.2.4

Systematic Form of a Linear Codes

Among all possible generator matrices, one is of special interest (if it exists): the systematic form.

Denition 6.9 (Systematic Form) A generator matrix G of a (n, m) linear code is said to be in systematic form when it is written as 1 0 0 p1,1 p1,nm 0 1 0 p2,1 p2,nm G = [Im P ] = . . . . . . .. . . . . . . . . . . . . . 0 0 1 pm,1 pm,nm where Im is the identity matrix of size m and P a m (n m) matrix, often call Parity Matrix .

Notice that, when it exists for a given code, the systematic generator matrix is unique.

Denition 6.10 (Systematic Linear Code) A linear code that uses a generator matrix in systematic form is called a systematic (linear) code. When a (n, m) linear code uses systematic form generator matrix, the m rst symbols of the n symbols of a codeword are exactly the symbols of the encoded message: zi = (ui,1 ui,2 ... ui,m zi,m+1 , ... zi,n ) In other words, systematic codes send rst the message unencoded and then (n m) encoding symbols used for error detection/correction. Example 6.13 Recalling example 6.12, another choice for the basis vectors could have been z1 = 1011, z2 = 0110, leading to G= 1 0 1 1 0 1 1 0

which is the systematic form generator matrix for this code. Example 6.14 (Parity-Check Bit) For binary messages, the Parity-Check Bit is the bit that corresponds to the parity of the message, i.e. the (binary) sum of its bits. For instance the parity-check bit for 01101 is 1 + 1 + 1 = 1 and the parity-check bit for 00101 is 1 + 1 = 0.
7

at most n (m 1) actually

6.2. LINEAR CODES

213

Parity-Check Bit encoding consists simply in sending rst the message as it is, follow by its parity-check bit. In terms of codes, this corresponds to the (m + 1, m) binary linear code, the generator matrix of which is 1 . G = Im . . 1 which is in systematic form. Notice that the minimum distance of this code is 2 (using Theorem 6.2), thus this code is only able to do single error detection (refer to Theorem 6.1). Control Question 57 For the following matrices 1. say if it could be a generator matrix. 2. if yes, say if it is in systematic form. 3. if the matrix is not in systematic form, give the systematic form matrix for the corresponding code. 4. (when it is a generator matrix) how will the message 1011 be encoded using the systematic form? 1 0 1. G = 0 1 1 0 2. G = 0 0 1 0 3. G = 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 Answer 1. This is not a generator matrix since the four rows are linearly dependent (their sum is zero) they cannot be a basis of a vector space. 2. Yes this matrix is a generator matrix (for a (7, 4) binary linear code). It is furthermore the generator matrix of this code since the four rst columns make the identity matrix. The encoding of 1011 with this code is 1011001.

214

CHAPTER 6. ERROR CORRECTING CODES

3. Yes, this matrix is indeed a generator matrix. The four rows are four linearly independent vector: to have a zero in the next to last column you must have a zero coecient on the last line, then to have a 0 at the last column you need to annihilate the second row and then its easy to see that the rst and third row are linearly independent (using for instance the second column). This matrix is not in systematic form. To nd the systematic form generator matrix, we need to nd a basis of this linear code whose four rst coordinates form the identity matrix. There are several mathematical way to do this. Here is one rather basic: adding the fours rows, we get vector. 1 0 0 0 0 1 0 which can make the rst

Then adding this vector to the rst row of the original matrix, we get 0 1 0 0 1 1 0 which is the second vector. Similarly adding the rst new vector to the last row of the original matrix leads to 0 0 1 0 1 0 1 which is the third vector we are looking for. And for the last one, we can add the second row of the original matrix 0 1 0 1 0 0 1 to the new second vector 0 1 0 0 1 1 0 to get 0 0 0 1 1 1 1 and the systematic form generator matrix is: 1 0 0 0 0 1 0 0 1 0 0 1 1 0 G = 0 0 1 0 1 0 1 0 0 0 1 1 1 1 The encoding of 1011 with this matrix is 1011000.

6.2.5

Decoding: Verication Matrix

At this level, we know how to encode with linear codes. But what about decoding? How can errors be corrected? This is the whole story after all! Here is precisely where the linearity of linear codes will help. Suppose that a matrix F such that, for every codeword z , z F = 0 has been found8 . Then, if an error e occurs during the transmission of z and z = z + e is received, we have z F = (z + e) F = z F + e F = 0 + e F = e F This last result is very useful since z F is independent of the emitted codeword z but depends only on the error e. The result of this transmission error appears as a linear combination of the rows of F . In order to correct/detect the error, the vectors of the vector space generated by the rows of F simply needs to be mapped to the corresponding correction (or detection message). This is this key idea that is formalized and studied an bit further now. For good mathematical reasons9 , the above equation z F = 0 is always given in the following form: z HT = 0
8 9

As you will see in a while, this is not so dicult. orthogonality: G H T = 0

6.2. LINEAR CODES where


T

215

is the transpose operator and H = F T .

Denition 6.11 (Verication Matrix) A (n m) n matrix H is a verication matrix for a (n, m) D-ary linear code C if and only if z GF(D)n z H T = 0 z C In other words, a verication matrix for a code C is a matrix, the kernel of which is C . Notice that a given linear code might have several dierent verication matrices: any matrix, the rows of which are a basis of the vector space orthogonal to the linear code10 is a verication matrix for this code. How to nd verication matrices? It the case where the code is systematic, it is easy to nd a verication matrix as the following theorem show: Theorem 6.3 For a systematic (n, m) linear code, the systematic form generator matrix of which is G = [Im P ] , the matrix H = P T Inm is a verication matrix. Proof For every message ui , the corresponding codeword is zi = ui G = ui [ I m P ]

i.e.

(zi,1 , ..., zi,m ) = ui (zi,m+1 , ..., zi,n ) = ui P (zi,m+1 , ..., zi,n ) = (zi,1 , ..., zi,m ) P

Thus i.e. (zi,1 , ..., zi,m ) P + (zi,m+1 , ..., zi,n ) = 0 or in its matrix form: zi P Inm =0
T

Therefore, we have found a matrix ( P T Inm


10

) such that its product with every

recall: a linear code is a vector subspace

216 codeword gives the null vector.

CHAPTER 6. ERROR CORRECTING CODES

It is easy to see that the inverse construction leads to the result that every word x T such that x P T Inm = 0 veries x = (x1 , ..., xm ) G and appears therefore as a codeword. Notice that in the binary case (GF(2)) P = P . Example 6.15 Consider the systematic code C , the generator matrix of which is G= Then (n = 5 and m = 2) H= 1 0 1 1 1 1
T

1 0 1 0 1 0 1 1 1 1

= I2

1 0 1 1 1 1

I3

is one possible verication matrix for C .

1 1 1 0 0 = 0 1 0 1 0 1 1 0 0 1

It has just been shown how easy it is to nd the verication matrix when the systematic form generator matrix is know. What about the general case, when a generator matrix not is systematic form is used? A verication matrix H for a (n, m) D-ary linear code with generator matrix G, the rows of which are denoted by z1 , ..., zm , can be constructed using the following procedure11 : 1. For i = m + 1 to n, choose zi as any GF(D)n vector linearly independent of z1 , ..., zi1 2. Compute the inverse M 1 of the matrix M the rows of which are z1 , ..., zn . 3. Extract H T as the last n m columns of the matrix M 1 . Example 6.16 Let us come back to example 6.12 and the (4, 2) code, one generator of which was given as: 1 1 0 1 G= 0 1 1 0 Lets rst nd two 1000 and 0100. 1 1 0 1 Thus M = 1 0 0 1
11

vectors linearly independent of the former. Choose for instance 0 1 0 0 1 0 0 0 0 0 and M 1 = 0 1 0 0 1 0 1 0 0 1 0 1 1 1

This is a standard procedure in linear algebra used to construct orthogonal subspaces.

6.2. LINEAR CODES 1 0 Finally, we have H T = 0 1 Control Question 58 0 1 , i.e. H = 1 1

217

1 0 0 1 . 0 1 1 1

Give one verication matrix for the linear code the systematic form encoding matrix is 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 G= 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 Answer 1 0 H= 1 0 Control Question 59 Is the word z =1001101 a codeword of the code, one 1 1 1 0 1 0 H= 0 1 1 0 0 1 0 0 1 1 0 0 Answer yes: z H T = 0 thus z is a codeword. verication matrix of which is 0 0 1 1 1 0 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

6.2.6

Dual Codes

A verication matrix for a (n, m) D-ary linear code C is a (n m) n matrix H such that its kernel ker(H ) is C : ker(H ) = C . Furthermore, by the fundamental property of dimensions of vector space, dim(ker(H ))+ rank(H ) = n, i.e. dim(C ) + rank(H ) = n or rank(H ) = n m. So the n m rows of H generate a n m dimension subspace of G(D)n , i.e. a (n, n m) linear code. Thus H appears to be the generator matrix of a (n, n m) linear code. This code is called the dual code of C (and vice versa).

Property 6.11 The dual code of a dual code of a code C is the code C itself.

218

CHAPTER 6. ERROR CORRECTING CODES

Property 6.12 A generation matrix of a code is a verication matrix for its dual code, and conversely.

6.2.7

Syndromes

Let us now repeat the important key idea of linear code. If z is the transmitted codeword and an error e occurs, the received word is then z = z + e. If H is a verication matrix for the code used, then z H T = (z + e) H T = z H T + e H T = 0 + e H T = e H T This illustrates the important fact that z H T depends only on the actual error pattern e and not at all on the transmitted codeword z . For this reason, this result z H T is of peculiar importance for decoding. This is call the syndrome (of z relative to H ).

Denition 6.12 (Syndrome) The syndrome of a word z relative to a verication matrix H is the product z H T .

Property 6.13 The syndrome s = z H T of a received word z relative to the verication matrix H of a code C depends only on the transmission error e = z zi and not on the transmitted codeword zi ( zi C ).

Furthermore, the error pattern e is decomposed into elementary errors ek (i.e. made of only one error on one single symbol): e = (e1 , ..., en ), then s(z ) = z H T = e H T =
k

ek hk

where hk is the k-th column of H : H = [h1 , ..., hn ]. To nd the corrector (i.e. the opposite of the error), only the correctors corresponding to single errors need to be known and then sum up. Correcting can then simply done by mapping the columns of H to corrector (stored in a memory) and adding the one corresponding to the non-zero positions of the syndrome. Example 6.17 (Syndrome-based Correction 1 1 1 0 H= 0 1 0 1 1 1 0 0 is a verication matrix for the binary code used. Table) Suppose that 0 0 1

Then the following correctors can be derived from the columns of H :

6.2. LINEAR CODES Syndrome 101 111 100 010 001 Corrector 10000 01000 00100 00010 00001

219

This is simply obtained by listing the columns of H . Reordering by syndromes the above table in order to actually use it in practice (where only the syndrome is know), we get: Syndrome 000 001 010 011 100 101 110 111 Notice that: 1. The null syndrome always maps to no correction, due to Denition 6.11; 2. For 011 and 110, the corrector is not unique in this example: for instance 011 = 010 + 001 leads to 00011 (00001 + 00010), but 011 = 111 + 100 leads to another correction, 01100. This is due to the fact the minimal distance of this code is 3 (see next section), and thus this code can only correct all the patterns with 1 error, but not all the patterns with 2 errors! These two syndromes actually corresponds to two transmission errors. In practice the correspondence table between syndromes and errors is stored in a memory and the general mechanism for decoding (and correcting) a received message z is the following: 1. Compute the syndrome s(z ) = z H T ; 2. Get the correction c = e (i.e. the opposite of the error) from the linear combination of correctors stored in the memory; 3. Decode by z = z + c. Example 6.18 (Decoding with a Linear Code) Let us go on with the last example (Example 6.17), the generator of which is G= 1 0 1 0 1 0 1 1 1 1 Corrector 00000 00001 00010 ? 00100 10000 ? 01000

220

CHAPTER 6. ERROR CORRECTING CODES

Suppose that u = 10 is to be transmitted. This message is encoded into z = 10101 with the above (5, 2) binary linear code. Suppose that z = 00101, i.e. that the rst bit has been corrupted. The syndrome computation gives s = z H T = 101 leading to the corrector e = 10000 (see correctors table given in Example 6.17). Thus the decoded codeword is z + e = 00101 + 10000 = 10101, which leads to the decoded message (the rst two bits, since a systematic code is being used): 10, which corresponds to the original message. Control Question 60 What was the original message matrix is 1 0 H= 1 0 The original message was 10001. Explanation: z H T = 0101 which correspond to the third column of H , thus one error occurred in the third position. Thus the emitted codeword was 100011001 (change the third bit), and since the code is systematic (see H ), the original message was 10001 (take the m = 9 4 = 5 rst bits). if you receive 101011001 and the code verication 1 1 0 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 ? 0 1

Answer

6.2.8

Minimum Distance and Verication Matrix

The general presentation of linear code is now ended by a useful result which allows to compute the minimum distance of a code12 directly from its verication matrix.

Theorem 6.4 (Verication Matrix and Minimum Distance) If H is a verication matrix for an (n, m) D-ary linear code C (with 1 m < n), then the minimum distance dmin (C ) of this code is equal to the smallest number of linearly dependent columns of H .

Proof For every vector z , z H T is a linear combination of w(z ) columns of H . And by Denition 6.11: z C if and only if z H T = 0. Thus if z C then there exists w(z ) columns of H are linearly dependent; and conversely, if q columns of H are linearly dependent, there exists a codeword of
12

and thus the maximum number of errors to be corrected

6.2. LINEAR CODES weight q .

221

Thus wmin (C ) is the minimum number of columns of H that are linearly dependent, and we conclude using Theorem 6.2. For a binary linear code C with verication matrix H , this last results implies that If H does not have a null column, dmin (C ) > 1. If H does further not have twice the same column, dmin (C ) > 2. Example 1 1 1 0 1 0 1 1 0 6.19 A binary linear code one verication matrix of which is H = 0 0 1 0 has a minimum distance of 3. 0 1

Indeed, H does not have a null column nor twice the same column so dmin (C ) > 2. Furthermore, there is a set of there columns of H that is linearly dependent. For instance, h1 , h3 and h5 .

Property 6.14 (Singletons bound) For a (n, m) linear code C , dmin (C ) n m + 1. Proof Columns of H are vectors in GF(D)(nm) , so any set of n m + 1 of these column is linearly dependent. Therefore, using Theorem 6.4, dmin (C ) n m + 1. Control Question 61 What is the minimum distance of 1 0 H= 1 0 a code, the verication matrix of which is 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 ? 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 Answer dmin (C ) = 3, and thus t = 1. e-pendix: Linear Code

How many errors can this code correct?

6.2.9

Binary Hamming Codes

Let us now study further what are the good codes that can correct one error (in the binary case). Since we are looking for only one error, it is sucient for the syndrome to indicate where this error holds. The idea is to have a code such that the syndrome directly

222

CHAPTER 6. ERROR CORRECTING CODES

indicates the position of the error; for instance as its binary code (0...001 for the rst position, 0...010 for the second, 0...011 for the third, etc.). Recall that a single error in position k leads to a syndrome which is the k-th column of H . The above idea thus leads for the verication matrix to construct a matrix, the columns of which are the binary representation of their position (see example below). What is not clear yet is 1. what dimensions should this verication matrix have? 2. does this construction actually lead to a verication matrix of a code? 3. can such a code actually correct all patterns of 1 error? Regarding the rst point, recall that the size of a syndrome of a (n, m) linear code is n m. If the syndrome is directly encoding the error position, it could then represent 2nm 1 positions. So no place is lost if the total number of positions to be represented (i.e. the length n of the codeword) is n = 2nm 1. Despite the trivial case n = 3, m = 1, here are some possible sizes for such codes: n 7 15 31 63 . . . m 4 11 26 57 . . . r =nm 3 4 5 6 . . . matrix (for n = 7 and 15): 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0

and here are two examples of verication 0 0 H3 = 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 H4 = 0 1 1 0 0 1 1 0 1 0 1 0

1 1 1 1

The second above question is easy to answer: yes this construction actually leads to a linear code since matrices constructed this way are full rank (i.e. rank(H ) = n m), since it is easy to construct the Inm identity matrix out of their columns (take the rst, second, forth, eighth, etc.. columns). Thus the dimension of their kernel is m, leading to a (n, m) linear code. Finally, to address the last question (can such a code actually correct all patterns of 1 error?), we have to compute its minimum distance. The verication matrices resulting from the above construction never have null column nor the same column twice so at least dmin (C ) 3. Moreover, the rst three column (binary representations of 1, 2 and 3) are always linearly dependent. So the minimum distance of such codes is always 3. Thus such codes can correct all patterns of 1 error. Such codes are called (binary) Hamming codes .

6.2. LINEAR CODES

223

Denition 6.13 (Hamming code) A Hamming code is a (2r 1, 2r r 1) binary linear code (r 2), the verication matrix of which is 0 0 1 . . . . . ... . . Hr = br (1)T br (2)T br (n)t = . . 0 1 1 1 1 1 where bn (i) is the binary representation of i on n bits (e.g b4 (5) = (0101)).

Property 6.15 Every binary Hamming code can correct all patterns of one error. Example 6.20 (Hamming code) Let us take r = 3 and construct the (7, 4) binary Hamming code. We have: 0 0 0 1 1 1 1 H3 = 0 1 1 0 0 1 1 1 0 1 0 1 0 1

To nd one generator matrix, we look for 4 vectors z such that z H T = 0, for instance (easy to check, and there are many others): z1 = 1110000 z2 = 1101001 z3 = 1000011 z4 = 1111111

leading to

1 1 G= 1 1

1 1 0 1

1 0 0 1

0 1 0 1

0 0 0 1

0 0 1 1

0 1 1 1

Suppose now the message u = 1001 has to be send. It is encoded into z = u G = 0001111. Lets assume further that an error occurred on the third bit, so that z = 0011111 is received. How will this be decoded? The syndrome is s(0011111) = z H T = 011, i.e. 3 in binary, indicating that an error occurred on the third bit. So the result of the decoding is z 0010000 (error in the third position) which is 0001111, the codeword that was actually emitted.

Summary for Chapter 6 linear code: a block code which is a vector space (i.e. any linear combination of codewords is also a codeword).

224

CHAPTER 6. ERROR CORRECTING CODES A (n, m) D-ary linear code is a dimension m vector subspace of the dimension n vector space of D-ary words.

minimum distance: for linear code minimum distance of the code is equal to the minimal weight. generator matrix: (of a (n, m) linear code) a m n matrix the rows of which are a basis of the code (they are thus linearly independent). systematic form: a m n generator matrix of a (n, m) linear code is said to be in systematic form only if its left most m m submatrix is the identity matrix (of size m). encoding: the encoding with linear codes is done by matrix multiplication: the word to be encoded u is multiply by one chosen generator matrix G of the code producing the codeword z = u G. If the generator matrix is in systematic form, the rst m symbols of the codeword are exactly the symbols of the message. Thus only the n m last symbols actually need to be computed. verication matrix: A (n m) n matrix H is a verication matrix for a (n, m) linear code C if and only if z GF(D)n z H T = 0 z C The verication matrix is very useful for decoding. syndrome: The result of the product of a word by the verication matrix: s = z H T . The syndrome is used to determine the error to be corrected. It indeed correspond to the linear combination of columns of H which precisely is the product of the error pattern by H T . binary Hamming codes: (2r 1, 2r r 1) linear codes that can correct all patterns of 1 error; the verication matrix is given in the form of the binary enumeration of the columns.

6.3

Cyclic Codes

Learning Objectives for Section 6.3 After studying this section you should know: 1. what a cyclic code actually is; 2. how to (and why!) represent codewords using polynomials; 3. how to encode and decode cyclic code using the generator polynomial.

6.3. CYCLIC CODES

225

6.3.1

Introduction

Although cyclic codes is the most largely used class of error correcting codes, only a very short introduction to this topic is here presented. Indeed, a detailed presentation of cyclic codes is far enough to be the matter of a whole course in itself, which is largely out of the scope of the present lectures. Interested readers who wish to study deeper the subject should refer to the rather large literature on the domain. Denition 6.14 A cyclic code is a linear code such that for every codeword zi with n symbols zi = zi,1 ...zi,n , the word zi,2 ...zi,n zi,1 resulting of a (left)cyclic permutation (also called shift) of the symbols of zi is also a codeword.

Notice that this denition implies that then any cyclic permutation of a codeword is also a codeword. Example 6.21 (Cyclic Code) The following (binary) linear code is a cyclic code: z1 = 000, z2 = 101, z3 = 011, z4 = 110 Conversely, the following code z1 = 000, z2 = 001, z3 = 010, z4 = 011 (which is linear) is not cyclic since, for instance, the cyclic permutation 100 of z3 is not a codeword. Cyclic codes are an important subclass of linear codes since they have many algebraic properties that simplify the encoding and decoding implementations. Control Question 62 For each of the following binary codes, say whether this is a cyclic code or not. 1. C = {0000, 1000, 0001, 1001} 2. C = {1000, 0001, 0100, 0010} 3. C = {000, 100, 010, 001, 110, 011, 111, 101} 4. C = {0000, 0001, 0010, 0011} 5. C = {00, 11, 10, 01} Answer 1. no. 2. no! This is not a linear code! A cyclic code must rst of all be a linear code. 3. yes.

226 4. no. 5. yes.

CHAPTER 6. ERROR CORRECTING CODES

6.3.2

Cyclic Codes and Polynomials

In order to algebraically take into account this new constraints on codewords (cyclic permutations of codewords are also codewords), more complete algebraic structure than vector space one (which cope with linearity) is required. The algebra of polynomials is precisely a good way to represent this new constraint. Indeed, suppose a codeword zi with n symbols zi = zi,1 ...zi,n is represented by the polynomial zi (X ) = zi,1 X n1 + zi,2 X n2 + ... + zi,n1 X + zi,n , i.e. the j th symbol zi,j of a codeword zi of size n is the coecient of X nj in the corresponding polynomial zi (X ). What is then X zi (X ) modulo (X n 1)? A bit a simple polynomial algebra directly shows that this indeed corresponds to the left-cyclic permutation of zi . Proof Let us prove that monome multiplication corresponds to left-cyclic permutation.
n

X zi (X ) = X
n

j =1

=
j =1 n1

zi,j X nj +1

zi,j X nj

=
k =0

zi,k+1 X nk
n1

= zi,1 X n +
k =1

zi,k+1 X nk

Working modulo (X n 1) simply means that X n corresponds to 1,13 X n+1 corresponds to X , etc. Therefore zi,1 X n mod (X n 1) equals zi,1 , and the above equation, modulo (X n 1), leads to

n1

X zi (X ) = zi,1 +
k =1 n1

zi,k+1 X nk

=
k =1

zi,k+1 X nk + zi,1 X nn

which indeed corresponds to the codeword zi,2 ...zi,n zi,1 , the left-shift of zi . Since cyclic codes precisely deal with cyclic permutation of their codewords, polynomials seem to be a very appropriate way to represent them. This aspect will be emphasized further after a short example.

6.3. CYCLIC CODES

227

Example 6.22 (Modulo (X n 1) arithmetic) Here is a short example of a modulo (X n 1) computation: (X 2 + 1) (X + 1) = X 3 + X 2 + X + 1 = 1 + X 2 + X + 1 mod (X 3 1) = X 2 + X + (1 + 1) = X2 + X mod (X 3 1) mod (X 3 1)

since in binary 1 + 1 = 0. Example 6.23 (Polynomial Representation of Cyclic Code) Recall the binary cyclic code of the last example: z1 = 000z2 = 101z3 = 011z4 = 110 The polynomial representation of this code is: z1 (X ) = 0 z2 (X ) = 1 X 2 + 0 X + 1 = X 2 + 1 z3 (X ) = 0 X 2 + 1 X + 1 = X + 1 z4 (X ) = 1 X 2 + 1 X + 0 = X 2 + X Notice furthermore that X z2 (X ) = z3 (X ) mod (X 3 1), which express that z3 is the left-shift of z2 . Control Question 63 What is the polynomial representation of the following codewords: 1. 00000000 2. 10001 3. 0000001 4. 1111 Answer 1. 0 2. X 4 + 1 3. 1 4. X 3 + X 2 + X + 1

228

CHAPTER 6. ERROR CORRECTING CODES

Control Question 64 Considering the two codewords z1 and z2 of a cyclic code, what is z1 z2 in the following cases: 1. z1 = 010110, z2 = 000100 2. z1 = 1010, z2 = 0101 3. z1 = 11001, z2 = 01010 Answer 1. 011001, this one is easy since z2 corresponds to a 2-left shift. 2. 0 There are two ways to get this result. Either direct polynomial computation 1010 0101 = (X 3 + X ) (X 2 + 1) = X 5 + X 3 + X 3 + X = X 5 + X = X + X mod (X 4 1) = 0 or understanding that multiplying by 0101 means adding the 2-left shift (0100) with the message itself (0101 = 0100 + 0001), which in this case lead to 0 since the 2-left shift is the same as the message itself. 3. 11101

The condition dening cyclic codes can now be used to characterize further cyclic code using polynomial properties: Property 6.16 If z (X ) is the polynomial corresponding to a codeword z of a cyclic code of size n, then for any polynomial p(X ), p(X ) z (X ) mod (X n 1) is also a polynomial corresponding to a codeword of this code (left-shifts and linear combinations). Proof for any k, X k z (X ) mod (X n 1) is also a polynomial corresponding to a codeword of this code (left-shifts) a cyclic code is a linear code, so any linear combination of codewords is also a codeword. A cyclic code corresponds therefore to an ideal of the ring GF(X ). Theorem 6.5 For every (n, m) cyclic code C , there exist one polynomial gC (X ) of degree n m such that C = {gC (X ) p : p GF(X ), deg(p) < m} i.e. every codeword polynomial is a multiple of gC (X ) and conversely. In other words, the code C is generated by gC (X ). gC (X ) is actually called the generator of C .

6.3. CYCLIC CODES

229

Proof This simply comes from the fact that GF(X ), as any polynomial ring in one variable over a eld, is a principal ring: every ideal is principal; i.e. can be generated by a single element. Coding a word u using a cyclic code C could then simply consist of sending z (X ) = u(X ) gC (X ). However, systematic form coding, i.e. coding in such a way that the rst symbols corresponds to the message itself is often preferred. For a (n, m) cyclic code the procedure is then the following: 1. multiply the message polynonmial u(X ) by X nm (i.e. in practice to n m left shifts of the message) Notice that n m is the degree of the generator. 2. divide X nm u(X ) by the generator g(X ) and get the remainder r (X ) 3. Then encoding of u(X ) is then z (X ) = X nm u(X ) r (X ) (which is a multiple of g(X ), the m higher symbols of which correspond to the m symbols of u). Example 6.24 (Systematic Coding with Cyclic Code) Consider for instance the (7, 4) binary cyclic code z1 = 0000000, z2 = 0001011, z3 = 0010110, z4 = 0101100, z5 = 1011000, z6 = 0110001, z7 = 1100010, z8 = 1000101, z9 = 1010011, z10 = 0100111, z11 = 1001110, z12 = 0011101, z13 = 0111010, z14 = 1110100, z15 = 1101001, z16 = 1111111 This code has for generator g(X ) = z2 (X ) = X 3 + X + 1. [It is left as an exercise that z2 (X ) is actually a generator of this code.] Using this code, we want to transmit the message u = 1101, i.e. u(X ) = X 3 + X 2 + 1. Let us rst divide X 3 u(X ) = X 6 + X 5 + X 3 by g(X ): X 6 +X 5 + X3 X 6+ X 4 +X 3 5 X + X 3 +X 2 4 X + X 2 +X X 3+ X +1 1 i.e. X 3 u(X ) = (X 3 + X 2 + X + 1)g(X ) + 1. Thus the codeword is z (X ) = X 3 u(X ) + 1 = X 6 + X 5 + X 3 + 1 which represents 1101001. To summarize: the message 1101 is encoded 1101001 by the above cyclic code. Notice that, as wanted, the rst 4 bits are the bits of the original message u. X3 + X + 1 X3 + X2 + X + 1

Theorem 6.6 The generator of a cyclic code of length n is a factor of X n 1.

230

CHAPTER 6. ERROR CORRECTING CODES

Proof Consider a cyclic code of length n the generator of which is g(X ) of degree r. Then X (nr) g(X ) is of degree n and can be written (Euclidean division by (X n 1)): X (nr) g(X ) = (X n 1) + q (X ) with q (X ) a polynomial of degree less than n. Since g(X ) is a codeword of cyclic code of length n, q (X ) = X (nr) g(X ) mod (X n 1) is also a codeword of this code (Property 6.16). Thus (Theorem 6.5), there exists p(X ) such that q (X ) = p(X ) g(X ). Coming back to (62), we have: X (nr) g(X ) = (X n 1) + p(X ) g(X ) i.e. X n 1 = X (nr) p(X ) g(X ) Thus g(X ) is a factor of X n 1. Let us now see the converse: Theorem 6.7 Every factor of X n 1 of degree r is the generator of a (n, n r ) cyclic code. Proof Let g(X ) be a factor of X n 1 of degree r . The n r polynomials g(X ), Xg(X ), ..., X (nr1) g(X ) are all of degree less than n. Furthermore, any linear combination of these n r polynomials is also a polynomial of degree less than n, i.e. g(X ), Xg(X ), ..., X (nr1) g(X ) is a basis of a vector subspace of GF(X ). Thus g(X ), Xg(X ), ..., X (nr1) g(X ) generates a (n, n r ) linear code. Is this code cyclic? Let z (X ) = z0 + z1 X + ... + zn1 X (n1) be one codeword of this code. Then X z (X ) = z0 X + z1 X 2 + ... + zn1 X n = zn1 (X n 1) + zn1 + z0 X + z1 X 2 + ... + zn2 X (n1) = zn1 (X n 1) + y (X ) where y (X ) is a left shift of z (Y ). But z (X ) is a multiple of g(X ) since it is a codeword of the linear code generated by g(X ), Xg(X ), ..., X (nr1) g(X ), and X n 1 is also a multiple of g(X ) by assumption. Thus y (X ) = X z (X ) zn1 (X n 1) appears also to be a multiple of g(X ), i.e. is an element of the subspace generated by g(X ), Xg(X ), ..., X (nr1) g(X ). Therefore any left shift of a codeword is also a codeword, i.e. the code is cyclic. Control Question 65 How are the following messages encoded in systematic form with a code, the generator

6.3. CYCLIC CODES of which is g(X ) = X 6 + X 3 + 1: 1. 000 2. 111 3. 101 Answer 1. 000000000

231

The null word is always encoded into the null codeword. The only problem here is to know the size of the codewords. Since the degree of the generator is 6 and the length of the input messages is 3, the size of the codewords is 6 + 3 = 9. 2. 111111111 (X 2 + X + 1) X 6 = X 8 + X 7 + X 6 X 8 + X 7 + X 6 = (X 2 + X + 1)g(X ) + X 5 + X 4 + X 3 + X 2 + X + 1 X 8 + X 7 + X 6 (X 5 + X 4 + X 3 + X 2 + X + 1) = X 8 + X 7 + X 6 + X 5 + X 4 + X 3 + X 2 + X + 1 which is the codeword. 3. 101101101 (X 2 + 1) X 6 = X 8 + X 6 X 8 + X 6 = (X 2 + 1)g(X ) + X 5 + X 3 + X 2 + 1 X 8 + X 6 (X 5 + X 3 + X 2 + 1) = X 8 + X 6 + X 5 + X 3 + X 2 + 1 which is the codeword. There is actually a fastest way to encode with this special code. Looking at the generator you might see that this code simple consists in repeating the message three times!

6.3.3

Decoding

We know how to encode messages with cyclic codes. What about decoding then? The decoding process is similar to the framework used of linear codes in general: 1. rst compute a syndrome from the received word (which depends only on the error, not on the emitted codeword, and which is null when the received word is a codeword) 2. then deduce the corrector (i.e. the opposite of the error) 3. Finally, apply the corrector to the received codeword. The construction of the syndrome of a word z (X ) is simple: it is the remainder of the division of z (X ) by the generator g(X ) of the code.

232

CHAPTER 6. ERROR CORRECTING CODES

Indeed, we know that every codeword z (X ) is a multiple of g(X ). Thus the remainder of z (X ) + e(X ) (w.r.t. g(X )) is the same as the one of e(X ): z (X ) = (X )g(X ) e(X ) = (X )g(X ) + s(X ) with deg(s(X )) < deg(g(X )). It is also clear from the above construction that the syndrome s(X ) is null if an only if z (X ) is a codeword (i.e. multiple of g(X )). The correctors, corresponding to all the non null syndromes, can be obtained by division by g(X ). Notice that for a single error X i of degree i less that n m (the degree of g(X )), the syndrome is simply X i ; for the single error X (nm) , the syndrome is X (nm) g(X ). Example 6.25 (Decoding with cyclic code) Let us continue with the previous example: the message 1101 has been encoded into 1101001 and is now transmitted over a noisy channel. Suppose that the second symbol has been ipped, i.e. that we receive 1001001. What is the corresponding decoded word? 1001001 corresponds to z (X ) = X 6 + X 3 + 1, the division of which by g(X ) gives: X 6+ X 3+ 1 6 4 X +X +X 3 X 4+ X 2 +X X 2 +X +1 Thus the syndrome is here X 2 + X + 1. The corrector/syndrome table for g(X ) = X 3 + X + 1 is the following: syndrome 1 X X2 X +1 X2 + X X2 + X + 1 X2 + 1 corrector 1 X X2 X3 X4 X5 X6 X3 + X + 1 X3 + X = z (X ) + e(X ) = [(X ) + (X )]g(X ) + s(X )

[The four rst rows have been obtained using the above notices. The last three by division of the error by g(X )] Thus we found that the corrector has to be X 5 and the decoded word is nally z (X ) = z (X ) + X 5 = X 6 + X 5 + X 3 + 1, i.e. 1101001. Since a systematic code is being used, the rst 4 symbols of this codeword are the 4 bits of the original message: 1101.

6.3. CYCLIC CODES

233

Control Question 66 Consider the (7, 4) linear code, the generator of which is g(X ) = X 3 + X 2 + 1. How will the following received messages be decoded 1. 1001011 2. 1011001 3. 0000001 (provided that systematic form coding was used). Answer 1. 1001 Indeed, 1001011 is a codeword: 1001011 = X 6 + X 3 + X + 1 = (X 3 + X 2 + X + 1)(X 3 + X 2 + 1) Since systematic form coding was used, the four rst bits are the message. 2. 1010 1011001 = X 6 + X 4 + X 3 + 1 = (X 3 + X 2 )g(X ) + X 2 + 1 but X 2 + 1 = g(X ) + X 3 , thus there is an error pattern of weight 1 (e = X 3 ) whose syndrome is X 2 + 1. Minimum distance decoding thus lead to X 6 + X 4 + X 3 + 1 + X 3 = 1010001. 3. 0000 (from corrected codeword 0000000).

Summary for Chapter 6 Cyclic Code: a linear code such that any (symbol) shift of any codeword is also a codeword. Polynomial Representation: z = z1 ...zn is represented by the polynomial z (X ) = z1 X n1 + z2 X n2 + ... + zn1 X + zn , i.e. the j th symbol zj of a codeword z of size n is the coecient of X nj in the corresponding polynomial z (X ). The polynomial multiplication by X (degree one monome) corresponds to the (1 position) left shift. All operations are made modulo X n 1. Generator: For every cyclic code, there exist one polynomial such that every codeword polynomial representation is a multiple of it and conversely. Systematic Form Encoding: The encoding method such that the m rst symbols of a codeword are exactly the m symbols of the encoded message.

234

CHAPTER 6. ERROR CORRECTING CODES For cyclic codes, systematic form encoding is achieved through the following steps: 1. multiply the message polynonmial u(X ) by X nm 2. divide X nm u(X ) by the generator g(X ) and get the remainder r (X ) 3. encode u(X ) by z (X ) = X nm u(X ) r (X ).

Decoding: The decoding process is similar to the framework used of linear codes in general: 1. compute the syndrome of the received word: it is the remainder of the division of this word by the generator of the code; 2. then deduce the corrector from a precomputed mapping of syndromes to correctors (the division of the corrector by the generator gives the syndrome as the remainder) 3. Finally, apply the corrector to the received codeword and decode the original words as the rst m symbols of the decoded codeword (provided that systematic encoding has been used).

6.4

Convolutional Codes

Learning Objectives for Section 6.4 After studying this section, you should know: 1. what convolutional codes are; 2. how encoding of such codes is achieved; 3. what a state and state diagram are; 4. what is the lattice representation associated with a convolutional code; 5. how to use Viterbi algorithm on lattices to do minimum distance decoding; 6. how to compute the minimal distance of a convolutional code.

6.4.1

Introduction

In this section, a non-block error coding framework is considered: convolutional codes. Convolutional codes dier from block codes in that the coding mechanism keeps memory about the encoded symbols. In one sense, convolutional codes can appear as unbounded block codes, i.e. block codes with innite size blocks. However, there is an signicant dierence in the design of these coding/decoding techniques. Furthermore, convolutional codes have been found much superior to block-codes in many applications.

6.4. CONVOLUTIONAL CODES


z 2i1 serialization z 2i1z 2i

235

+ ui u i1 u i2 +

z 2i

Figure 6.1: A rst example of a convolutional encoder. Each message symbol ui is encoded into two codeword symbols z2i1 and z2i .

6.4.2

Encoding

The starting point of a convolutional code is the encoder. Rather than beginning with precise denitions and a general analysis of convolutional codes, we prefer to start with a simple example that still contains the main features of convolutional coding. The encoder of the example chosen for this section is depicted in Figure 6.1. At each time step i, one message symbol ui enters the encoder and two codewords symbols z2i1 z2i are emitted; i.e. u = (u1 , ..., ui , ...) is encoded into z = (z1 , z2 , ..., z2i1 , z2i , ...). The rate of this code is thus 1/2. The message symbols ui and the codeword symbols zj considered here are all binary digits. The additions shown in Figure 6.1 are binary addition (i.e. exclusive-or). More formally, the encode depicted in Figure 6.1 can be written as z2i1 = ui + ui2 z2i = ui + ui1 + ui2 i.e. ui (ui2 + ui , ui2 + ui1 + ui ) These equations can be viewed as a discrete convolution of the input sequence with the sequences 1, 0, 1, 0, 0, . . . and 1, 1, 1, 0, 0, . . . , respectively. This explains the name convolutional code. However, nor the above equations, nor Figure 6.1, fully determine the codewords since the values of ui2 and ui1 are required. What are they at time i = 1, i.e. what is the initial state of the system? The convention is that there are always null, i.e. u1 = u0 = 0. To ensure that this is always the case, i.e. that whenever a new message has to be encoded the initial state of the encoder is always 0 in all memories, the encoding of a former message must leave the encoder in this null state. Thus every encoding of a message must contain enough zeros at then end so as to ensure that all the memories of the system have return to 0. In the case of the encoder presented in Figure 6.1, this means that the encoding of every message will nish by encoding two more zeros. Example 6.26 (Coding with convolutional code) Suppose we want to then the message u = 101 using the encoder depicted in Figure 6.1. How does it work? (6.5) (6.6)

236

CHAPTER 6. ERROR CORRECTING CODES

Let us trace all the components of the encoding: i 1 2 3 4 5 ui 1 0 1 (0) (0) State (ui1 ui2 ) 00 10 01 10 01 z2i1 z2i 11 01 00 01 11

The corresponding codeword is thus z = 1101000111. The last two line corresponds to the two zero bit that must be introduced in the coder at the end of every message so as to put the coder back into its initial state. Control Question 67 Consider the convolutional code, the encoder of which is described by the following diagram:

+ + ui u i1 u i2 u i3 + +

z 4i3 serialization z 4i2 z 4i1 z 4i


z 4i3 z 4i2 z 4i1 z 4i

1. How many zeros must be added after each word to encode? (a) 1 (b) 2 (c) 3 (d) 4 (e) 5 2. How is 10110 encoded? (a) 0011110010101000100111100111 (b) 10110 (c) 00111100101010000101111001110000 (d) 0010010110011010110001001001 (e) 11110000111111110000000000000000 (f) 11000011010100011010011111100000 Answer

6.4. CONVOLUTIONAL CODES 1. (c) 2. (c): 00111100101010000101111001110000 ui 1 0 1 1 0 0 0 0 ui1 0 1 0 1 1 0 0 0 ui2 0 0 1 0 1 1 0 0 ui3 0 0 0 1 0 1 1 0 z4i3 0 1 1 1 0 1 0 0 z4i2 0 1 0 0 1 1 1 0 z4i1 1 0 1 0 0 1 1 0 z4i 1 0 0 0 1 0 1 0

237

Let us now wonder what is in general the code generated by the encoder depicted in Figure 6.1? Consider for instance a 3 bits message u = (u1 , u2 , u3 ). As we have seen, what has to be actually encoded is (u1 , u2 , u3 , 0, 0), i.e two zero bits are added at the end of the original message in order to put the memory of the encoder back into its initial state. The size of the corresponding codeword is thus 2 5 = 10. The bits of this codewords are given by equations (6.5) format: 1 (z2i1 , z2i ) = (ui2 , ui1 , ui ) 0 1 and (6.6), i.e. in a matrix 1 1 (6.7) 1

Thus the complete codeword z is obtained by multiplying (u1 , u2 , u3 , 0, 0) by the matrix 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 1 1 0 0 , G3 = 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 or more simply by multiplying u = (u1 , u2 , u3 ) by 1 1 0 1 1 1 0 0 0 0 G3 = 0 0 1 1 0 1 1 1 0 0 . 0 0 0 0 1 1 0 1 1 1 The encoded codeword is thus z = u G3 . For a message of length m, this generalizes to z = u Gm where Gm is the m (2 m + 4) matrix made of one row two columns shifts of the small block matrix of equation (6.7).

This result is true in general, independently of the length of the encoded message. This illustrates why convolutional codes are presented as unbounded linear block codes.

6.4.3

General Denition

Let us now give a general denition of convolutional codes.

238

CHAPTER 6. ERROR CORRECTING CODES

with Fi a k n matrix, and [0] the k n null matrix; i.e. each set of k rows of G is the same as the previous set of k rows but shifted n places right. A message u of nite length m, u = (u1 , ..., um ) is encoded by z = u Gm where u is the vector of length m = qk, with q = m k , such that u = (u1 , ..., um , 0, ..., 0), and Gm is the top left submatrix of G of size qk n(r + q ). Notice that u = u, i.e. m = m if m is a multiple of k (in particular when k = 1!). In the above denition, k actually corresponds to the number of message symbols that are going in the encoder (k input lines), n is the number of outgoing codeword symbols per input (n output lines) and r is the maximum number of memories (a.k.a. registers) on one input line. Example 6.27 The example encoder of Figure 6.1 builds a (2, 1, 2) convolutional code: k = 1 input line with r = 2 memories, producing n = 2 codeword bits for each input bit. As seen in Section 6.4.2, for an input message of 3 10 matrix 1 1 0 1 1 1 G3 = 0 0 1 1 0 1 0 0 0 0 1 1 length 3, its generator matrix is the 0 0 0 0 1 1 0 0 0 1 1 1

Denition 6.15 A (n, k, r ) D-ary convolutional code is an unbounded linear code, the generator matrix of which is in the following form (innite): F0 F1 F2 Fr [0] [0] [0] [0] F0 F1 Fr1 Fr [0] [0] G = [0] [0] F F 0 r 2 Fr 1 Fr [0] . . .. .. .. .. . . . . . . . .

where indeed each row (set of k = 1 row(s)) is a 2-left shift of the row above it. Regarding the denition, we have for the above matrix G3 :

F0 = [ 1 1 ], corresponding to the two coecients of ui in equations (6.5) and (6.6), F1 = [ 0 1 ], corresponding to the two coecients of ui1 , and F2 = [ 1 1 ], corresponding to the two coecients of ui2 . Notice that convolutional codes are linear: any combination of codewords is also a codeword (with the convention that smaller codewords are padded with zeros at the end such that the linear combination makes sense, i.e. all added words with the same length). Control Question 68 1. What are (n, k, r ) of the convolutional code given in the last question?

6.4. CONVOLUTIONAL CODES (a) (1, 3, 4) (b) (7, 4, 1) (c) (3, 1, 4) (d) (4, 1, 3) (e) (7, 1, 4) (f) (1, 3, 7) 2. What is the generator matrix of this code? (a) How many F blocks are there? (b) What is the size of each F block: ?? ? (c) Give all the F blocks. Answer 1. 4: (4, 1, 3) 2. (a) r = 3 (b) 1 4 (c) F0 = [ 0 0 1 1 ], F1 = [ 1 1 0 0 ], F2 = [ 1 0 0 1 ], F3 = [ 0 1 1 1 ]. Indeed, 0 1 z =u 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0

239

For instance: G4

0 0 0 0 = 0 0 0 0 0 0 0 1

0 1 0 0

0 0 1 0

0 0 0 1

6.4.4

Lattice Representation

The state of a system is the set of internal parameters (memories, a.k.a. registers), that is required to compute the corresponding output to a given input block. For the encoder of Figure 6.1 for instance, the state at time i is the current contents of the two memories, i.e. Si = (ui1 , ui2 ). The behavior of the encoder depends only on its state and the input: a convolutional code encoder is a state machine. All its behaviors are described by the state diagram.

240

CHAPTER 6. ERROR CORRECTING CODES

Denition 6.16 (State Diagram) The state diagram of an encoder is a graph, the nodes of which are all possible internal states of the encoder. An arc between a node Si and a node Sj in this graph represents the fact there exists an input that, when received in state Si makes the encoder go in state Sj . These arcs are usually labeled with the input symbol(s) and the corresponding output symbols.

Example 6.28 (State Diagram) For instance, for the encoder of Figure 6.1, we have:
1 / 01

u i / z 2i-1 z 2i
1 / 10

11
0 / 10 0 / 01 1 / 00

10
1 / 11

01
0 / 11

00
0 / 00

where each node represent the state of the encoder i.e. the states of the two internal memories, the blue label is the input that produces the state change, and the red label is the corresponding output symbols. For instance, if in stage 01, a 1 is received as input symbols, then the state becomes 10 and the two output symbols are 00. The set of codewords of a (n, k, r ) convolutional code corresponding to all possible messages of m bits thus corresponds to the set of all paths of length n(r + m k ) in the state diagram that start from the null state (all zero) and return to this null state. The unfolding in time of all the paths of the same length n(r + m k ) in the state diagram is called the lattice of size m of the (n, k, r ) convolutional code. Example 6.29 (Lattice) For the (2, 1, 2) code considered in previous examples, the lattice of length m = 3, representing the encodings of all input messages of 3 bits is:
11 01 11

10
10
01
01

10
01 01

10
01 00

01
11 11

10

10
11

10

11

11
00 00 00

11
00 00 00 00 00 00 00

00

in which a the uppermost arc out of a node corresponds to input bit 1 and the lowest

6.4. CONVOLUTIONAL CODES arc to 0.

241

The rst three columns of arcs thus corresponds to the encoding of the 3 message bits, and the last two columns of arcs corresponds to the ending zeros14 , having thus only bottommost arcs. Example 6.30 (Encoding in the Lattice) For instance, the encoding of the message u = 101 corresponds to the following path :
11 01 11

10
10
01
01

10
01 01
01

10
01 00

11

11

10

10

10

11

11
00 00 00

11
00 00

11
00 00 00 00 00 00

i.e. to the codeword z = 1101000111. Control Question 69 Consider the lattice corresponding to the encoding of a message of length 3 with the encoder of the former control question. How many columns of states does it have? For each column, how many states are there? Give the arc label for the following pairs of states. If no arc is present, answer no arc: from from from from 100 101 001 111 to to to to 010: 110: 111: 011: Answer 8 columns 1,2,4,8,8,4,2,1. The answer all 8 could also be considered as correct (all column full although sub obtimal since many states are unreachable in such a lattice). from from from from 100 101 001 111 to to to to 010: 110: 101: 011: 1100 1000 no arc 0010

Here is an example of the lattice (not all labels have been specied):

242

CHAPTER 6. ERROR CORRECTING CODES


111 011

0001

111 011 101 001 110

00

10
011

0110

111

101 001

0
001 001

110

110 010 100 000

0111

111

010

010 100 000

010

0
000

1 01

100 000

11

00 100
000

0000

000

000

000

and here are the codewords emited in each possible situation: Si \ui 000 001 010 011 100 101 110 111 0 0000 0111 1001 1110 1100 1011 0101 0010 1 0011 0100 1010 1101 1111 1000 0110 0001

6.4.5

Decoding

As we have just seen, a codeword correspond to one path form the starting node to the ending node of the lattice. Decoding consist thus in nding the most appropriate path corresponding to the received message to decode. In the framework of minimal distance decoding, i.e. decoding the codeword with minimal number of errors, the most appropriate path means the path with the minimal Hamming distance with the message to be decoded. Finding this closest path can be done using dynamic programming, i.e. the Viterbi algorithm. This algorithm is decoding one codeword block after the other (i.e. n message bits after the others), keeping at each stage only the locally optimal solutions, i.e. for each node in the column corresponding to the current decoded block, keeping the best path that come to this node. At the end, the best path found for the last node is the decoded message. What is important and useful for this algorithm is that the number of possible best paths that are kept at each time step is always less than or equal to the number of states of the encoder. The algorithm is thus of linear complexity, instead of exponential complexity of the naive algorithm which consists in comparing all the paths with the message to be decoded.

6.4. CONVOLUTIONAL CODES Let us now give the algorithm more precisely.

243

Lets rst introduce a bit of notation. For each state s of the encoder, let i (s) be the best (i.e. closest, i.e. with minimum number of errors) decoding of length i ending in state s i (s) =
z1 ,...,z2i

min

d(z1 ...z2i , z1 ...z2i )

ending in s

The complete decoding thus corresponds to |z b| (00) |z | = m + 2 = the length of the message to be decoded.

n 2

ici where |z | is

It is easy to see that for every couple of encode states (s, s ), we have: i (s) = min d(z2i1 z2i , z2i1 z2i ) + i1 (s ) .

z2i1 z2i

from s to s

This leads to the following algorithm (Viterbi algorithm):


0 (00) = 0 for i from 1 to |z | do for all s do i (s) = min (d(z2i1 z2i , z2i1 z2i ) + i1 (s )) mark the/one arc from s to s that achieves the minimum end for end for reconstruct the optimal path backwards from the ending to the beginning null state
s s

Example 6.31 Suppose that the codeword z = 1101000111 is sent over a noisy channel and that z = 0101010111 is received, i.e. two errors occurred. The above algorithm will then go through the following steps, where the minimal number of error i (s) at this stage i is written in red above every state s in the lattice:
3

11

01
10

11
10

11
1

01
10

11

10
01
01

10

10

10

01
01

01
01
11

01

01 00

01
01
2

10

01

01 00

11

11

11

10
0

10

10

10
0

10

10

11

11

00 z=

11 00

00

11 00
01

11

00

00 01
3

00

00 01
3

00

00 11

00

00 z=

11 00

00

11 00
01

2 11

00

00 01

00

00 01

00

00 11

00

01

01

11
1

01
10

11
10
2

11
1

01
10

11

10
2 2

10

10

10

01
01
2

01
2

01
01
11

01 00
11

01
01
2

10

01
2

01
01

01 00
11

11

11

11

10
0

10

10
2

10
0

10

10
2

00 z=

11 00

00

11 00
01

2 11

00

00 01

00

00 01

00

00 11

00

00 z=

11 00

00

11 00
01

2 11

00

00 01

00

00 01

00

00 11

00

01

01

244
3 3

CHAPTER 6. ERROR CORRECTING CODES


3 3

11
1

01
10

11
10
2 2

11
1

01
10

11
10
2 2

10

10

01
01
2

01
2

01
01
11 11

01 00

01
01
2

10

10

01
2

01
01

01 00
11

11

11

10
0

10
11 00

10
2

10
2 0

10
11 00

10
2

11

00 z=

11 00

2 11

00

00

00 01

00

00 01

00

00 11

00

00 z=

11 00

2 11

00

00

00 01

00

00 01

00

00 11

00

01

01

01

01

In the rst step, only two path can be considered: either a 1 was encoded, leading from state 00 to state 10. If this occurred then 11 was emitted and thus one error occurred during the transmission (since 01 has been received). or a 0 was encoded, leading from state 00 to state 00. If this occurred then 00 was emitted and one error also occurred during the transmission. At the second step there is still only one possible path to reach each state (i.e. 4 paths now). The minimal number of errors for each of this 4 states is the minimal number of errors of the former state plus the number of error of the corresponding path (i.e. the dierence between the emitted and the received symbols). For instance, would the path going from state 10 to state 11 have been used, two errors occurred then since in such a case 10 would have been emitted and 01 is received. This leads to a minimum number of errors for state 11 at step 2 2 (11) = 1 + 2 = 3. At step three, we have two possible paths to reach each state. The algorithm keeps only one of the two, where the minimum number of errors is done. The other arc is drawn with dashed gray line. The algorithm goes on like this up to the nal stage, where the null state is reached with a minimal number of error of 2. The very last step is the backward reconstruction of the path, which is drawn in blue on the last picture. This path corresponds to the codeword 1101000111. The nal step of the decoding is to reconstruct the original message from the codeword, which is done by knowing which arc (0 or 1) has been followed (or simply by looking at the rst bit of each state). In this case, we come to 10100, and suppressing the last 2 zero which are not part of the original message, we end up with 101. Control Question 70 For the code used in the former control question, how is 1011101011111101101001100010 decoded? Answer The answer is 0110. Here is the corresponding treillis with minimum number of errors at each node:

6.4. CONVOLUTIONAL CODES


5 0001 111 5 011 7 111 6 011 5 101 7 001 7 110 7 010 7 100 7 000 9 000 8 000 6 010

245

00

10

8 011

0110

010

5 101 5 001

0 111
9 001

7 001

3 110 3 010

111 1

7 010 7 100 9 000

10 01

5 110

0111

00
000

11

1 100 3 000

11

00 100
5 000

0000 1011

9 000

1010

1111

1101

1010

0110

0010

The received message 1011 1010 1111 1101 1010 0110 0010 has thus 9 errors. The emitted codeword was 0000 0011 1111 0101 1110 0111 0000 which correspond to the encoded message 0110000, i.e. the original message 0110.

6.4.6

Minimum Distance

As presented in section 6.4.3, a convolutional code is a linear code. Thus, by Theorem 6.2, its minimum distance is equal to its minimal weight.

Property 6.17 (Minimal Weight of a Convolutional Code) For a convolutional code, the minimal weight is the minimal number of non zero symbols on a path going from the null state to the null state.

Proof This comes directly from denition of minimal weight and the fact that a codeword corresponds to a path in the state diagram going from the null state back to the null state Example 6.32 Consider the convolutional code we have been dealing with since the beginning of this section. The best way to represent paths in the state diagram is to use the lattice. For minimal weight computation the lattice of size two is enough since every arc in the lattice of size three that is not present in the lattice of size two will come back into a same state with a bigger number of non zero symbol, so cannot be part of the minimal path. In out case, this gives us:

246

CHAPTER 6. ERROR CORRECTING CODES

11
10 10

Poids=6

10
01 11

10
01 11

Poids=5

01
11

01
11 00

00

00

00

00

00

00

00

00

Thus dmin (C ) = 5. Control Question 71 What is the minimum distance of the encoder given in former control questions? Answer dmin (C ) = 9. Here is the codeword with this weight:

001

10

01

0111

1 00
000

0 10

010

11

100 000

e-pendix: Convolution Code Summary for Chapter 6 convolutional code: A (n, k, r ) D-ary convolutional code is an unbounded linear code, the generator (innite) matrix of which is such that each set of k rows is the same as the previous set of k rows but shifted n places right. This corresponds to the matrix description of the encoder algorithm which is often given in the form of a picture of a circuit with k input lines, n output lines and at most r memories on a input to output path. encoding: A message u of nite length m, u = (u1 , ..., um ) is encoded by z = u Gm where q = m k , m = qk , u is the vector of length m such that u = (u1 , ..., um , 0, ..., 0), and Gm is the top left submatrix of size qk n(r + q ) of the generator matrix. encoder (internal) state: the set of states of the memories (or registers) of the encoder

6.4. CONVOLUTIONAL CODES

247

state diagram: The state diagram of an encoder is a graph, the nodes of which are all possible internal states of the encoder. An arc between a node Si and a node Sj in this graph represents the fact there exists an input that, when received in state Si makes the encoder go in state Sj . These arcs are labeled with the input symbol(s) and the corresponding output symbols. lattice representation: The time unfolded representation of all possible paths in the state diagram. Viterbi decoding algorithm: The dynamic programming algorithm which, for a given message to be decoded, nds the shortest path (in terms of number of errors) in the lattice.

248

CHAPTER 6. ERROR CORRECTING CODES

6.4. CONVOLUTIONAL CODES

249

Summary for Chapter 6 block-code: a non-empty set of words of the same length, considered as row vectors. weight: (of a word) the number of non-zero symbols. Hamming distance: the number of coordinates in which two vectors dier. The Hamming distance between two words is the weight of their dierence. minimum distance decoding: error correction framework in which each received word is decoded into the closest (according to Hamming distance) codeword. maximum likelihood decoding: error correction framework in which each received word z is decoded into (one of) the most likely codeword(s) z , i.e. a codeword such that P (Y = z |X = z ) is maximal (with X the input of the noisy channel and Y its output). minimum distance of a code: the minimum (non null) Hamming distance between any two (dierent) codewords. error correcting and detecting capacity: A block-code C of length n using minimum distance decoding can, for any two integers t and s such that 0 t n and 0 s n t, correct all patterns of t or fewer errors and detect all patterns of t + 1, ..., t + s errors if and only if its minimum distance dmin (C ) is strictly bigger than 2t + s: dmin (C ) > 2t + s C corrects t and detects t + s errors. linear code: a block code which is a vector space (i.e. any linear combination of codewords is also a codeword). A (n, m) D-ary linear code is a dimension m vector subspace of the dimension n vector space of D-ary words. minimum distance of a linear code: for linear code minimum distance of the code is equal to the minimal weight. generator matrix of a linear code: (of a (n, m) linear code) a m n matrix the rows of which are a basis of the code (they are thus linearly independent). systematic form of the generator matrix of a linear code: a m n generator matrix of a (n, m) linear code is said to be in systematic form only if its left most m m submatrix is the identity matrix (of size m). encoding with a linear code: the encoding with linear codes is done by matrix multiplication: the word to be encoded u is multiply by one chosen generator matrix G of the code producing the codeword z = u G. If the generator matrix is in systematic form, the rst m symbols of the codeword are exactly the symbols of the message. Thus only the n m last symbols actually need to be computed. verication matrix of a linear code: A (n m) n matrix H is a verication

250

CHAPTER 6. ERROR CORRECTING CODES matrix for a (n, m) linear code C if and only if z GF(D)n z H T = 0 z C The verication matrix is very useful for decoding.

syndrome of a word with respect to a linear code: The result of the product of a word by the verication matrix: s = z H T . The syndrome is used to determine the error to be corrected. It indeed correspond to the linear combination of columns of H which precisely is the product of the error pattern by H T . binary Hamming codes: (2r 1, 2r r 1) linear codes that can correct all patterns of 1 error; the verication matrix is given in the form of the binary enumeration of the columns. Cyclic code: a linear code such that any (symbol) shift of any codeword is also a codeword. Polynomial representation of cyclic codes: z = z1 ...zn is represented by the polynomial z (X ) = z1 X n1 + z2 X n2 + ... + zn1 X + zn , i.e. the j th symbol zj of a codeword z of size n is the coecient of X nj in the corresponding polynomial z (X ). The polynomial multiplication by X (degree one monome) corresponds to the (1 position) left shift. All operations are made modulo X n 1. Generator of a cyclic code: For every cyclic code, there exist one polynomial such that every codeword polynomial representation is a multiple of it and conversely. Systematic form cyclic code encoding: The encoding method such that the m rst symbols of a codeword are exactly the m symbols of the encoded message. For cyclic codes, systematic form encoding is achieved through the following steps: 1. multiply the message polynonmial u(X ) by X nm 2. divide X nm u(X ) by the generator g(X ) and get the remainder r (X ) 3. encode u(X ) by z (X ) = X nm u(X ) r (X ). Decoding with cyclic codes: The decoding process is similar to the framework used of linear codes in general: 1. compute the syndrome of the received word: it is the remainder of the division of this word by the generator of the code; 2. then deduce the corrector from a precomputed mapping of syndromes to correctors (the division of the corrector by the generator gives the syndrome as the remainder) 3. Finally, apply the corrector to the received codeword and decode the original words as the rst m symbols of the decoded codeword (provided that

6.4. CONVOLUTIONAL CODES systematic encoding has been used).

251

Convolutional code: A (n, k, r ) D-ary convolutional code is an unbounded linear code, the generator (innite) matrix of which is such that each set of k rows is the same as the previous set of k rows but shifted n places right. This corresponds to the matrix description of the encoder algorithm which is often given in the form of a picture of a circuit with k input lines, n output lines and at most r memories on a input to output path. Encoding with convolutional code: A message u of nite length m, u = (u1 , ..., um ) is encoded by z = u Gm where q = m k , m = qk , u is the vector of length m such that u = (u1 , ..., um , 0, ..., 0), and Gm is the top left submatrix of size qk n(r + q ) of the generator matrix. Convolutional code encoder (internal) state: the set of states of the memories (or registers) of the encoder State diagram: The state diagram of an encoder is a graph, the nodes of which are all possible internal states of the encoder. An arc between a node Si and a node Sj in this graph represents the fact there exists an input that, when received in state Si makes the encoder go in state Sj . These arcs are labeled with the input symbol(s) and the corresponding output symbols. Lattice representation: The time unfolded representation of all possible paths in the state diagram. Viterbi decoding algorithm: The dynamic programming algorithm which, for a given message to be decoded, nds the shortest path (in terms of number of errors) in the lattice.

Historical Notes and Bibliography


This section still needs to be improved. The work on error-correcting codes started of course from Shannon pioneer work in 1948. The design of good and ecient codes started in the fties with the works of Hamming, Slepian and many others. during the fties, most of the work in this area was devoted to the development of a real theory of coding (linear codes, both block and convolutional). Convolutional codes were rst introduced in 1955 by Elias [3] as an alternative to block codes. Wozencraft proposed later an ecient sequential decoding method for such codes [14]. Then in 1967, Viterbi proposed a maximum-likelihood decoding algorithm [13] quite easy to implement which leads to several applications of convolutional codes, in particular deep-space satellite communications. A theory to practice shift was made during the seventies, with a rapid growth of military and spatial communication applications.

252

CHAPTER 6. ERROR CORRECTING CODES

OutLook
See also [2], [6], [12] and [8].

Chapter 7

Module I3: Cryptography

by J.-C. Chappelier

Learning Objectives for Chapter 7 In this chapter, the basics of cryptography are presented. After studying them, you should know 1. what perfect and practical security are, 2. how much secure modern ciphering systems are, 3. why security and authentication are theoretically incompatible, 4. what RSA and DES are and how they work, 5. what unicity distance is and how to compute it.

Introduction
Cryptography, as its Greek root (hidden writing) suggested, is concerned with the secrecy of information. But in the modern sense, this scientic domain is also concerned with the authenticity of information. In the Information Age we are now living in, cryptography can no longer be avoided and, indeed, became a standard tool for communications. As information can nowadays be extremely sensitive and have enormous economic value, its transmission over easily accessible channels, e.g. the Internet, sometimes requires condentiality and authenticity to be ensured. The purpose of cryptography is to provide such guarantees. This chapter introduces you to the basics of this rather modern eld of computer sciences and study more formally its two goals: secrecy and authenticity. Roughly speaking, the goal of secrecy is to ensure that the message is received by authorized persons; the goal of authenticity is to ensure that the message as been sent by an authorized person. 253

254

CHAPTER 7. CRYPTOGRAPHY

7.1

General Framework

Learning Objectives for Section 7.1 After studying this section you should know: 1. what cryptography deals with; 2. how to formally describe the general framework cryptography focuses on; 3. and several historical (unsecure) ciphering examples.

7.1.1

Cryptography Goals

The framework of cryptography is to encode messages so as to ensure either secrecy or authenticity. As described in chapter 2, a message M is a sequence of symbols out of an alphabet . In cryptography, the encoding of message is called encrypting or ciphering. In the framework considered in this chapter, encrypting will be done using a function e and a key K , which is itself a nite sequence of symbols out of an alphabet, usually but not necessarily the same as the message alphabet . The encrypted message, or cryptogram, is thus C = e(M, K ). The encryption function is here assumed to be deterministic. C is thus perfectly determined once M and K are given, i.e. H (C |M, K ) = 0. The decrypting (or deciphering) is done using a function d and the key K , such that (unsurprisingly!) d(e(M, K ), K ) = M . We also assume that decrypting is deterministic, i.e. H (M |C, K ) = 0. Notice that H (C |M, K ) = 0 and H (M |C, K ) = 0 do not imply H (K |M, C ) = 0; several keys could indeed be possible for a given (M, C ) pair. In practice however this is hardly the case (and a bad idea), and almost always H (K |M, C ) is also 0. The general framework cryptography focuses on can be summarize by the picture given in gure 7.1.

Unauthorized person encryption C=e(M,K)


public channel

Message M Key K

secure channel

Figure 7.1: The general framework cryptography focuses on. The goal of cryptography is to protect the message against wrong receipt (secrecy ): it should be impossible to get the message M out of

RECEIVER

SENDER

decrypting D=d(C,K)

7.1. GENERAL FRAMEWORK the encrypted message C = e(M, K ) without knowing K ;

255

wrong emission (authentication ): it should be impossible to substitute another message C without knowing K . The cryptanalysis is wondering about cracking security/authentication on a communication channel. Cracking a security systems means nding M or K knowing C = e(M, K ). The hypothesis usually made are: encrypting and decrypting algorithms are known by everybody (Kerckhos hypothesis) and even statistics about the messages (but not the message itself!) could be collected; unauthorized persons doesnt know the key K ; everybody can get C = e(M, K ) (but not M nor K ). Thus, all the secrecy is due only to the fact that the enemies do not know the actual value of the secret key. It is indeed risky to hope that the design of the ciphering algorithm could be safeguarded from the enemies. Nonetheless, in many applications of cryptography, notably in military and diplomatic applications, the cryptographers try to keep the ciphering algorithm as secret as possible. Kerckhos hypothesis doesnt forbid this, but only warns not to count too much on the success of such safekeeping. On the other hand, Kerckhos would certainly have admired the designers of the Data Encryption Standard (DES) (see section 7.3.3) who published a complete description of their encryption system, and is nonetheless perhaps the most widely used cipher today.

7.1.2

Historical Examples

Before studying further the fundamentals of cryptography with the modern tools of Information Theory, let us rst give three historical (but unsecure) examples of cryptosystems: substitution, transposition and Vigen` ere ciphers. Substitution The substitution cipher simply consists in replacing every symbol of the message alphabet by another symbol of this alphabet, known in advance. The key of such a system is a permutation of the alphabet , which dene the substitution for all the symbols. Example 7.1 (Substitution cipher) Consider messages out of the usual alphabet made of 27 letters (including the whitespace!): = {A, ..., Z, }. A possible key k, i.e. a permutation of , could be: A R B I . . . . . . Y Z B E L

256 In this case e(A BAY , k) = RLIRB .

CHAPTER 7. CRYPTOGRAPHY

Transposition In the transposition cipher, the key consists of a permutation of the d > 1 rst integers (d is also part of the denition of the key). The encrypting algorithm is then the following: 1. pad the message with (less than d 1) whitespaces, so that the length of the message is a multiple of d; 2. split the message in blocks of length d; 3. permute the symbols in each block according to the permutation K . Example 7.2 (Transposition Cipher) Let us take the permutation (2 4 3 1 5) as the key (thus d = 5). [ Note on permutation notation: (2 4 3 1 5) means that the second letter of the original message becomes the rst of the encrypted message, the fourth of the original message becomes the second, etc. ] Suppose now we want to encore the message TRANSPOSITION CIPHER IS SIMPLE. The length of the message is 29, which is not a multiple of d = 5. One extra whitespace thus needs to be added at the end. Then we split the message in six blocks of size 5 (whitespaces have here been marked by a dot to make them appear more clearly): TRANS POSIT ION.C IPHER IS.SI MPLE.

And nally we apply the transposition to each block: RNATS OISPT O.NIC PEHIR SS.II PELM.

The transmitted message is thus RNATSOISPTO NICPEHIRSS IIPELM (by convention ending whitespaces could be removed). The decoding is done exactly the same way but using the inverse permutation (which in this case is (4 1 3 2 5)).

Vigen` ere Cipher The last historical example we want to present is the Vigen` ere cipher. In this cryptography system, the key is a sequence of symbols from the same alphabet as the messages. In practice, it is very often one usual word or a sentence of a few words. Using an order on (e.g. the usual alphabetical order), this key is transformed into a sequence of integers, e.g. A = 1, B = 2, ..., Z = 26 and = 27.

7.1. GENERAL FRAMEWORK More formally, if n is the size of ,

257

i(a) is the position of symbol a in (according to the order chosen on ), 1 i(a) n, (i) the i-th symbol in (1 i n, otherwise consider i mod n), the key K is made of p symbols K = k1 ...kp , and M of q symbols M = m1 ...mq , then C = (i(m1 ) + i(k1 )) (i(m2 ) + i(k2 )) ... (i(mp ) + i(kp )) (i(mp+1 ) + i(k1 )) ... (i(mq ) + i(kq
mod p ))

Example 7.3 (Vigen` ere Cipher) Lets once again consider messages made out of the 27 English letters (including the whitespace). The key is thus a sequence of characters, for instance k =INFORMATION. How is the message VIGENERE CIPHER IS ALSO QUITE SIMPLE encoded? Assuming that letter A corresponds to 1 and whitespace to 27, letter I then corresponds to 9, and thus the rst letter of the message, V, is encoded by V+9=D, the second letter of the message I is encoded by I+N=I+14=W, the third letter G by G+F=G+6=M, etc. Here is the complete encoding: VIGENERE CIPHER IS ALSO QUITE SIMPLE INFORMATIONINFORMATIONINFORMATIONINF DWMTERSYIRWYVKFRVTTJ FXNWI FFTAX YZK i.e. the encoded message is DWMTERSYIRWYVKFRVTTJ FXNWI FFTAX YZK.

Summary for Chapter 7 cryptography aims at either transmitting messages securely (only authorized persons can read it) or authenticate messages (no unauthorized persons could have send it). To do so, the clear messages M are encoded using a key K and a deterministic function: C = e(M, K ). Encrypted messages can be decoded deterministically using the decoding function d and the same key K , so that d(e(M, K ), K ) = M . H (C |M, K ) = 0. H (M |C, K ) = 0.

258

CHAPTER 7. CRYPTOGRAPHY

7.2

Perfect Secrecy

Learning Objectives for Section 7.2 After studying this section you should know: 1. what is a perfectly secret cryptosystem; 2. one example of such a cryptosystem; 3. and for imperfectly secure systems, how to estimate the maximum message size that can be securely transmitted.

After the entertaining historical examples of the last section, let us now come into the modern science of cryptography. This begins with a information theoretical denition of what a good (perfect is the word used by Shannon) cryptographic system is.

7.2.1

Denition and Consequences

In the framework depicted in gure 7.1, where only encrypted messages can be captured by the unauthorized persons1 , the system will be safe if the encrypted message does not bring any information on the original message, i.e. if I (C ; M ) = 0, which also means that M and C are independent random variables. Denition 7.1 (Perfect Secrecy) A encryption system is said to be perfect, i.e. provide perfect secrecy, if and only if the mutual information of the clear message M with the encrypted messages C is null: I (C ; M ) = 0. Theorem 7.1 In a perfect ciphering system, there must be at least as many possible keys as possible messages.

Proof I (C ; M ) = 0 implies that for every message m, P (C |M = m) = P (C ). Let us now consider a possible cryptogram, i.e. an encrypted message c such that P (C = c) = 0. Thus for every possible original message m, we have P (C = c|M = m) = 0 which means that for every m there exist a key, denoted k(m), such that c = e(k(m), m). Furthermore, m = m = k(m) = k(m ) otherwise deciphering would no longer be deterministic: we would have two dierent messages that with the same key give the same cryptogram c! There are thus at least as many keys as there are possible messages m.
1

This kind of attack is called ciphertext-only attack.

7.2. PERFECT SECRECY

259

Theorem 7.2 In a perfect cryptographic system, the uncertainty on the keys H (K ) is at least as big as the uncertainty on the messages H (M ): H (K ) H (M ).

Proof H (M ) = H (M |C ) H (M, K |C ) = H (K |C ) + H (M |K, C ) = H (K |C ) H (K ) The consequence of these two theorems is that in a perfect system keys must be complex enough, at least more complex than the messages themselves.

7.2.2

One Example: One-Time Pad

Let us now present a well-know example of a perfect cryptographic system: the onetime pad , which is actually used by diplomats. Without any loss of generally, we here consider the binary one-time pad, i.e. messages, cryptograms and keys are binary sequences ( = {0, 1}). In this system, the key is a random sequence of n independent bits, K = K1 K2 ...Kn : p(Ki = 0) = p(Ki = 0|K1 , ..., Ki1 ) = 0.5 where n is the size of the longest message to be transmitted. The encryption is done simply by adding the symbols of the message and the symbols of the key2 : Ci = Mi + Ki . Example 7.4 (One Time Pad) Suppose the key is k = 11010101010010101001 and the message to be transmitted m = 11110000111100001111, then the encrypted message is c = m + k = 00100101101110100110.

Theorem 7.3 One-time pad is a perfect cipher. Proof For 1 < i n: p(Ci = 0|C1 , ..., Ci1 , M1 , ...Mn ) = p(Mi = 0|C1 , ..., Ci1 , M1 , ...Mn ) p(Ki = 0) +p(Mi = 1|C1 , ..., Ci1 , M1 , ...Mn ) p(Ki = 1) = 0.5 [p(Mi = 0|C1 , ..., Ci1 , M1 , ...Mn ) + p(Mi = 1|C1 , ..., Ci1 , M1 , ...Mn )] = 0.5
binary addition (a.k.a. exclusive or for readers familiar with computer sciences) is the usual modulo 2 addition, without carry: 0 + 1 = 1 and, as usual, 0 + 0 = 0, 1 + 1 = 0, 1 + 0 = 1.
2

260

CHAPTER 7. CRYPTOGRAPHY

Similarly, p(C1 |M1 , ...Mn ) = 0.5, and p(Ci |C1 , ..., Ci1 ) = 0.5 for all i, 1 i n. Thus, P (C |M ) = P (C ), i.e. I (C ; M ) = 0.

7.2.3

Imperfect Secrecy and Unicity Distance

We have seen that for a cryptographic system to be perfect, the key must be complex enough. In practice, at least for wide range usage, this is not very convenient. For a practical wide range system (e.g. security on the Internet) the key must be small (at least smaller than the messages) and to be used several times, i.e. the system has to be imperfect from a formal point of view. What can we thus say about imperfect (but more convenient) systems? To determine when a ciphering system that did not oer perfect secrecy could in principle be broken, Shannon introduced the so-called key equivocation function dened for integers by a(n) = H (K |C1 ...Cn ).

It seem obvious that the more encrypted text has been seen, the less uncertainty remains of the key. More formally:

lim a(n) = 0

The unicity distance u is then dened as the smallest n such that a(n) 0.

Denition 7.2 (unicity distance) The unicity distance of a cryptosystem is the smallest n such that H (K |C1 ...Cn ) 0

Thus, u is the least amount of ciphertext from which unauthorized persons are able to determine the secret key almost uniquely. Roughly speaking, the unicity distance is the least amount of ciphertext from which the ciphering system can be broken. Let us now compute the unicity distance under certain circumstances.

7.2. PERFECT SECRECY

261

Theorem 7.4 If M and C are of the same length n and from the same alphabet ; encrypted messages have roughly maximal uncertainty: H (Cn ) log || (which is something every cryptographer tries to reach); n

the key and messages are independent: H (Mn , K ) = H (Mn ) + H (K ) (which is also very natural and usual). Then the unicity distance can be approximated by u H (K ) R(M ) log || (7.1)

where R(M ) is the redundancy of the unencrypted messages M , as dened in section 3.2.1 of chapter 3: R(M ) = 1 H (M ) . log ||

Proof Assume n to be large enough so that H (Mn ) n H (M ), which is a sensible hypothesis (consider otherwise the maximum of such an n and the value of u obtained with the given formula).
H (KCn )

H (K |Cn ) = H (Mn KCn ) H (Mn |KCn ) H (Cn ) = H (Mn KCn ) H (Cn ) = H (Mn K ) H (Cn ) = H (Mn ) + H (K ) H (Cn ) = n H (M ) + H (K ) n log || Unicity distance u is dened by: H (K |Cu ) = 0, i.e. u H (M ) log || + H (K ) = 0 or: u = = H (K ) log || H (M ) H (K ) R(M ) log ||

Example 7.5 (unicity distance) Let us consider English messages (made of the 27 letters alphabet, including whitespace) encrypted with a cipher using key of 20 independent letters. H (K ) = 20 log(27). Knowing that the entropy rate of English is roughly 2 bits per letter, the redundancy of messages is R(M ) = 1 2/ log(27) 0.58 and the unicity distance of such a system

262 is: u = = H (K ) R log || 20 log(27) log(27) 2

CHAPTER 7. CRYPTOGRAPHY

35

i.e. cryptograms of about 35 characters will allow to determine the key almost uniquely! Shannon was well aware that the formula (7.1) was valid in general and can be used to estimate equivocation characteristics and the unicity distance for the ordinary types of ciphers . Indeed, cryptographers routinely use this formula to estimate the unicity distance of almost all ciphers. Notice also that u is, in principle, the required amount of ciphertext to determine the key almost uniquely. However, nding K from C1 , C2 , . . . , Cu may very well be an intractable problem in practice. This formula only says that all the information is there, but does not say a word on how much dicult it might be to extract it. We will come to this aspect later on in section 7.3.

7.2.4

Increasing Unicity Distance: Homophonic Coding

It can be seen from (7.1) that a good way to increase the unicity distance (i.e. to tell less about the system) is to decrease the redundancy of the messages, i.e. to increase their entropy. It is for instance a good idea to compress the messages before encrypting them. Indeed, in the best compression cases, H (Mn ) n log || et thus R(M ) 0, so u . Another possibility is to use an old cryptographic trick called homophonic substitution . In this process, several dierent homophones are used to represent each symbol of the original alphabet; the more homophones for the most frequent symbols so that the homophones appear almost equally likely (whereas original symbols do not). Example 7.6 In English, the most probable symbol is the whitespace, with a probability about .1859, the next most probable symbol is E, which has probability about .1031. The least likely is Z with a probability about .0005. If we want to convert such an English text using almost equally likely symbols, we need at least 1/0.005 = 2000 symbols (so as to be able to have at least one for the Z). Suppose we are thus using 2000 homophones to represent the 27 letters. The whitespace will be represented by any of the 372 ( .1859 2000) symbols we choose for it, E with any of the 206 ( .1031 2000) other symbols reserved for it, etc., and 1 ( .0005 2000) homophone symbol is used to represent the Z. The choice of a substitute for an English letter is then made by a uniform random choice from the set of homophone substitutes for that letter. The successive choices are made independently. After such a conversion, each homophone symbol in the converted text is essentially equally likely to any of the other. The decoding can be easily achieved by replacing each of the substitutes by the corresponding letter. There is no need to know in advance which substitutes were

7.2. PERFECT SECRECY randomly chosen in the pre-coding process. Control Question 72

263

What is the unicity distance of a cryptosystem ciphering messages of 96 characters having an entropy rate of 3 bits per character with keys, the entropy of which is 33 bits. Answer u= (if needed: R = 1
3 log 96

33 = 9.2 log 96 3

= 54.4%)

Control Question 73 What is the unicity distance of a cryptosystem encoding binary messages which are 25% redundant, with uniformly distributed keys of 16 binary symbols? Answer u= Indeed entropy of uniformly distributed keys of 16 bits is... 16 bits! log || = log 2 = 1 16 = 64 0.25 1

Summary for Chapter 7 Perfect Secrecy: I (C ; M ) = 0 for a system to be perfectly secret there must be at least as many keys as messages and H (K ) must be greater than (or equal to) H (M ). One-Time Pad: for each encryption, a random key is chosen, whose length is equal to the message length and whose symbols are independent. The key is then simply added (symbol by symbol) to the message. One-Time Pad is a perfect cipher. unicity distance: the minimum number of encrypted text that must be known to determine the key almost surely: H (K |C1 ...Cu ) 0. under certain general assumptions, the unicity distance can be approximated by u H (K ) R(M ) log ||

264

CHAPTER 7. CRYPTOGRAPHY where R(M ) is the redundancy of the unencrypted messages M .

7.3

Practical Secrecy: Algorithmic Security

Learning Objectives for Section 7.3 After studying this section you should know: 1. how practical secrecy is achieved for unperfectly secure cryptosystems; 2. what dicult means for a computer (algorithmic complexity); 3. what a one-way function is; 4. how does DES work.

Up to this point, no particular attention has been paid to the computational power required to actually crack the system. The analysis of secrecy developed up to here applies independently of the time and computing power available for the attacks. Security against computationally unrestrained enemies is called unconditional security (or theoretical security as Shannon used to call it). As we have seen in theorems 7.1 and 7.2, achieving unconditional security usually requires enormous quantities of complex secret keys much more than what could be accepted in practice for wide scope cryptography applications. Most cryptographic systems used in practice thus do not rely not on the impossibility of being broken but rather on the diculty of such a breaking. In this framework, the goal is to ensure security against unauthorized persons who have a limited time and computing power available for their attacks. This is called computational security (or practical security as Shannon used to call it). The point is to change lack of information (unconditional security) for diculty to access to the information. But what does it actually mean to be dicult? How could we measure the diculty to crack a code? This is the aim of algorithmic complexity.

7.3.1

Algorithmic Complexity

It is not the purpose of this section to provide a complete course on algorithmic complexity, but we would rather present the basic concepts so that the rest of the lecture regarding computational security can be understood well enough. Algorithmic complexity aims at dening the complexity of decision problems. A decision problem is simply a yes/no question on a well dened input. For instance, given an integer number n (the input), is this number a prime number? If the answer to the decision problem could be found by some algorithm (on a Turing machine), we call the decision problem algorithmic.3
In the general algorithmic complexity theory, algorithmic decision problems are called Turing decidable problems, but this goes a bit beyond the scope of this chapter.
3

7.3. PRACTICAL SECRECY: ALGORITHMIC SECURITY

265

For algorithmic decision problems, the (time-)complexity4 is dened as the smallest number of time steps (on a Turing machine) of the algorithms that can answer the question.5 For good fundamental reasons, this complexity is not expressed exactly, but only in the way it depends on the size of the input: a problem is said to be linear, quadratic, exponential, ... this mean that its complexity grows linearly, quadratically, exponentially, ... with the size of the input. The complexity is thus expressed in terms of big O notation. Denition 7.3 (Big O notation) For two functions f and g over the real numbers, g is said to be O(f ) if and only if x0 R , c R , x x 0 |g(x)| c f (x)

Notice that if g is O(f ) and f is O(h), g is also O(h). For complexity measure, we are looking for the smallest and simplest f such that g is O(f ) (e.g. such that f is also O(|g|)). Example 7.7 (Big O notation) 3 n + log n + 4 is O(n). Notice this is also O(n + n3 ), O(n log n), O(n2 ), ... which are not pertinent when used for complexity measure. 5 x2 12 x7 + 5 x3 is O(x7 ). 1/x is O(1). The complexity of a linear problem is O(n), where n is the size of the input. A problem whose complexity is O(2n ) is an exponential problem.

Denition 7.4 (P and NP) P is the set of algorithmic decision problems, the complexity of which is polynomial. NP is the set of algorithmic decision problems, such that if a possible solution is given, it is possible to verify this solution in a polynomial time. A classical pitfall is to think that NP means not-P or non-P. This is wrong for several reasons: P and NP are not complementary: P is actually totally included in NP; there are problems which are neither in P nor in NP. What does the N of NP means then? It stands for Non-deterministic. NP problems are problems which are polynomial in a non-deterministic way: pick up a possible solution at random, then you can conclude (for that candidate solution only!), in polynomial time. Clearly P NP, but it is for the moment still an open question whether NP P or not.
4 5

only time-complexity is considered in this chapter. We do not go here into the details of problems and co-problems.

266

CHAPTER 7. CRYPTOGRAPHY

With respect to this question, there is a subset of NP which is of particular interest: the NP-Complete (or NP-C) problems.

Denition 7.5 (NP-Complete) A problem is said to be NP-Complete if it is in NP at least as dicult as any problem in NP. This class is of particular importance because if someone manages to prove that one single NP-C problem is actually in P, then all NP is included in P! There are nally a last class of problems (the dicult ones): NP-hard problems. Denition 7.6 A problema is said to be NP-hard if it at least as dicult as any problem in NP.
In its most general denition, the NP-hard class also includes problems which are not decision only problems.
a

NP-C and NP-hard problems are often confused. The dierence between NP-C and NP-hard problems is that a NP-hard problem does not required to be in NP (either because you do not bother or you do not want to spend time to prove that or, more fundamentally, because it is a so dicult problem that even testing on single solution cannot be achieved in polynomial time). Example 7.8 (NP-Complete problems) Satisability (SAT) : The input is a set of n boolean (i.e. true/false) variables x1 , ..., xn . Decision: could (xi1 xj1 xk1 ) (xi2 xj2 xk2 ) (xi3 xj3 xk3 ) .... be satised, i.e. be true for some values of the variables xi ? Traveling Salesman Problem (TSP): The input is a graph (a set of nodes and arcs) G and a distance d. Decision: is there a circuit going through all nodes of G and whose length is below d? We now have all the required ingredients to make a code dicult to crack: get inspiration from NP-hard problems. Let us now focus more precisely on the usage of dicult problems for cryptography: one-way functions and, latter on, trap-door functions.

7.3.2

One-Way Functions

Denition 7.7 (One-Way Function) A one-way function is a function that is easy to compute but (computationally) dicult to invert. How could a one-way function be useful for cryptography?

7.3. PRACTICAL SECRECY: ALGORITHMIC SECURITY

267

The key idea is that both the encoding e(M, K ) = C and the decoding d(C, K ) = M are easy to compute but that their inversion is dicult (even if H (K |C, M ) = 0). However, the most obvious application of one-way functions is certainly for passwordbased systems. For each authorized user of the system, his password w is stored in the encrypted form e(w), where e is a one-way function. When someone wants to use the system (log in), he presents a password w and the system computes (easy way) e(w ) and checks if it correspond to the stored information e(w). If so, the user is granted access; if not, he is denied. The virtue of this system is that the stored encrypted passwords do not need to be kept secret6 . If e is truly a one-way function, an attacker who somehow gets access to these encrypted passwords can not do anything with that as it is computationally infeasible for him to nd a (pass)word x such that e(x) = e(w). Notice, interestingly, that this rst example application of one-way functions, actually used in practice, provides authenticity rather than security, in the sense developed earlier in this chapter. An example of one-way function is given in the next section.

7.3.3

DES

Data Encryption Standard (DES in short) is one instance of a computationally secure cryptographic system (or at least thought to be!) that uses one-way functions. Let us present here the basic idea of DES. We here only focus on the core of the system as the actual standard contains several other practical tricks. DES is using a NP-Complete problem very similar to SAT for this: equation systems in GF(2). Example 7.9 Deciding if the system x1 x4 + x2 x3 x5 = 1 x2 x3 + x1 x3 x4 = 1 x1 x3 + x1 x2 x5 = 1 has a solutions or not is NP-Complete with respect to the number of variables. The fact that the solution (x1 , x2 , x3 , x4 , x5 ) = (1, 0, 1, 1, 0) is here easy to nd should not hide the fact that for a larger number a variables, nding a solution is indeed a dicult problem. How is it used for a cryptographic system? Choose two integers n and m, and a non-linear function f from GF(2)m GF(2)n to GF(2)n : f (x1 , ..., xm , y1 , ..., yn ) = (p1 , ..., pn )
6

although there is also no reason to make it public!

268

CHAPTER 7. CRYPTOGRAPHY

Choose also a key K of (d 1)m bits, and split it into (d 1) parts of m bits each: K = (K1 , ..., , Kd1 ). Suppose that be the binary message M to be send has of 2n bits.7 M is split into two parts of length n: M = (M0 , M1 ). The encryption is then done iteratively in d 1 steps (i = 2, ..., d): Mi = Mi2 + f (Ki1 , Mi1 ) Finally, the cryptogram sent is C = (Md1 , Md ) Decrypting is simply done the other way round (i = d, d 1, ..., 2): Mi2 = Mi + f (Ki1 , Mi1 ) Example 7.10 (DES) Let us consider the following non-linear function (with m = 3 and n = 3): f (x1 , x2 , x3 , y1 , y2 , y3 ) = (x1 x2 y1 y2 , x2 x3 y1 y3 , (x1 + x2 )y1 y3 ) and lets choose a key K = 101011 (d = 3): K1 = 101, K2 = 011 How will the message 101111 be encoded? M = 101111 M0 = 101, M1 = 111 Iterations: M2 = M0 + f (K1 , M1 ) = (1, 0, 1) + f ((1, 0, 1), (1, 1, 1)) = (1, 0, 1) + (0, 0, 1) = (1, 0, 0) M3 = M1 + f (K2 , M2 ) = (1, 1, 1) + f ((0, 1, 1), (1, 0, 0)) = (1, 1, 1) + (0, 0, 0) = (1, 1, 1) C = (M2 , M3 ) = (1, 0, 0, 1, 1, 1) So nally, 100111 is send.

Security of DES The security of DES is based on a NP-Complete problem. As such, there are at least three possible sources of insecurity:
7

otherwise pad and split the original message to have a smaller pieces of length 2n.

7.4. PUBLIC-KEY CRYPTOGRAPHY

269

NP = P: if indeed someday it happens that polynomial solutions could be found for NP problems, then these dicult problems will no longer be dicult at all! However, this is nowadays very unlikely. The size of the key is not big enough (recall that complexity is growing with the size, and thus, only long enough inputs lead to computation time long enough to be unreached). Actually, ever since DES was proposed, it has a lot been criticized for its short 56 bits key size. But the most serious critic is certainly that, because the problem is NP, any possible solution can, by denition, be tested in polynomial time, i.e. if by chance the attacker guesses the right key, it is easy for him to check that it is the right key! The main conclusion is that the security is not always guaranteed for all cases: it might, by chance, be easily cracked on some special cases. The only security comes from the low chance for the attacker to guess the key. Summary for Chapter 7 Algorithmic Complexity: how does the time of an algorithm grow with respect to its input size. P and NP: A problem is said to be in P if it can be solved by an algorithm, the complexity of which is polynomial. A problem is said to be in NP if a proposed solution of it can be check in polynomial time (with respect to the size of this solution). Pitfall: NP does not mean not P but rather non-deterministic P. One-Way function: a function that is easy to compute but dicult to invert. DES: a cryptosystem based on the diculty to solve non-linear boolean systems.

7.4

Public-Key Cryptography

Learning Objectives for Section 7.4 After studying this section you should know: 1. what public-key cryptography means and how is it possible; 2. what a trapdoor function is; 3. what the Die-Lamport Distribution System is and how it works; 4. how does RSA work and what is its security.

The main problem to be addressed in large scale cryptographic systems is: how to transmit the keys in a secure manner?

270

CHAPTER 7. CRYPTOGRAPHY

The intelligent idea of Die and Hellman is not to transmit the key at all, but to use public keys. Each pair of users can, using a generic system for the distribution of keys, have its own key for their communication. We present in this section two dierent schemes for public key distribution: Die-Hellman scheme and RSA. The paper Die and Hellman published in 1976, had created a real shock in the cryptography community at the time. That paper suggested it is possible to build computationally secure ciphering systems without any secure channel for the exchange of the keys! Indeed it was well known that public-key systems can actually not provide any unconditional security, since H (K ) = 0 for such systems! The breakthrough came from their clever idea that if computational security as been decided to be used (as indeed in most practical applications), then the secure exchange of secret keys is no longer required. This rather counter intuitive idea lies on the fundamental notions of one-way functions, already presented, and trapdoor function, to be introduced in a few sections. But before going on on this topic, we need a bit more of mathematics.

7.4.1

A bit of Mathematics

Modern computational cryptography is based on nite eld algebra and more precisely on the modulo p multiplication (where p is a prime), i.e. the multiplicative group GF (p) of Galois eld GF(p): 1, ..., p 1. Since this group has p 1 elements, Euler-Fermat theorem ensures that: np1 = 1 mod p for every n in GF (p). Example 7.11 (GF(5)) Let us consider GF (5) = {1, 2, 3, 4} (i.e. p = 5), where, for instance, 4 3 = 2 , 2 4 = 3, 2 3 = 1. (4 3 = 12 = 2 mod 5) Regarding Euler-Fermat theorem: e.g. for n = 2 we have: 24 = 16 = 1 mod 5.

Denition 7.8 (Primitive Root) An integer n is a primitive root of a prime p if and only if its modulo order is p 1, i.e.: ni = 1 mod p, 0 < i < p 1 and np1 = 1 mod p Theorem 7.5 For every prime number p, there exists at least one primitive root in GF (p).

Example 7.12 Let us consider GF (5) once again. n = 2 : 22 = 4, 23 = 3, 24 = 1, thus 2 is a primitive root in GF (5). n = 4 : 42 = 1, i.e. 4 is not a primitive root.

7.4. PUBLIC-KEY CRYPTOGRAPHY Discrete Exponentiation

271

Consider GF (p) for some prime p, and let a be a primitive root. By discrete exponentiation to the base a in GF (p) we mean the function expa : GF (p) GF (p) such that expa (n) = an . Since a is primitive root, the p 1 possible values of expa (x) (as n ranges over GF (p)) are all distinct. Thus, its inverse function expa 1 exists. This function and is called the discrete logarithm to the base a and is denoted by Loga . Example 7.13 In GF (5), we have seen that 2 is a primitive root. Log2 (3) = 3 (in GF (5)): indeed, as seen in example 7.12, 23 = 3 mod 5. Notice that Loga (a) = 1 and that Loga (1) = p 1 in GF (p). Control Question 74 Which of the following numbers are primitive roots in GF (11): 2, 4, 5, 6, 9, 10? Answer 2: yes; 4: no (45 = 1); 5: no (55 = 1); 6: yes; 9: no (95 = 1); 10: no (102 = 1).

Control Question 75 In GF (11), compute Log7 2 Log7 4 Log7 5 Log7 6 Log7 10 Answer Log7 2 = 3. Indeed, 73 = 72 7 = 49 7 = 5 7 = 35 = 2 mod 11. Log7 4 = 6: 76 = (73 )2 = 22 = 4 mod 11. Log7 5 = 2: 72 = 5 mod 11. Log7 6 = 7: 77 = 76 7 = 4 7 = 28 = 6 mod 11. Log7 10 = 5: 75 = 73 72 = 2 5 = 10 mod 11.

272

CHAPTER 7. CRYPTOGRAPHY

Conjecture 7.1 (Die-Hellman-Pohlig) The discrete exponentiation is a one-way function. First of all, discrete exponentiation is always easy to compute, requiring at most 2 log2 n multiplications in GF (p) using the square-and-multiply algorithm. On the other hand, the fastest known algorithm known today [7] for nding discrete logarithms is in O(exp(n1/3 (log n)2/3 )). However, there is no proof that there is no algorithm for computing the general discrete logarithm in a shorter time than the above. The Die, Hellman and Pohlig conjectured that discrete exponentiation in GF (p) (when the base is a primitive root) is a one-way function, provided that p is a large number such that p 1 also has a large prime factor. No proof of the conjecture has been given yet. But neither has an algorithm been found for eciently computing the discrete logarithm. In other words, the historical evidence in favor of the conjecture has been increasing, but no theoretical evidence has yet been produced.

7.4.2

The Die-Hellman Public Key Distribution System

In their 1976 paper, Die and Hellman suggested an ingenious scheme for creating a common secret key between sender and receiver in a network without the need for a secure channel to exchange secret keys; their scheme relies on the one-way aspect of discrete exponentiation. Suppose f (x) = ax is truly a one-way function and is known to all users of the network. Each person (say, user A) randomly chooses (in secret!) a private (or secret ) key xA and then computes her public key yA = axA , which is publicly published. When another person (say users B) wish to communicate securely with A, each fetches the public number of the other, and uses this key to the power his/her own private key for the communication; i.e user A computes yB xA and user B computes yA xB . What is magic is that these two number are indeed the same: yB xA = (axB )xA = axA xB = (axA )xB = yA xB . This number kAB = axA xB , that both users A and B can compute, is their common secret, which they can safely use as their secret key for communication using a conventional secret-key cryptosystem. What the Die-Hellman scheme provides is thus a public way to distribute secret keys. If some unauthorized person wishes to crack the key, he should be able to take discrete logarithms of either yA or yB (e.g. xA = Loga yA ) and then get the desired secret key as KAB = yB xA . But if the discrete exponentiation used is truly one-way, this attack is computationally infeasible. Up to now (2003), nobody has produced an attack on the Die-Hellman public keydistribution scheme that is not computationally equivalent to computing the discrete logarithm. However, it has neither been proved that all attacks on this system are computationally equivalent to computing the discrete logarithm. Example 7.14 (Die-Hellman public key) In a Die-Hellman scheme, with

7.4. PUBLIC-KEY CRYPTOGRAPHY

273

p=127 and a=67, a user A chooses as private key xA = 111. He then publishes his public key yA = 67111 = 102 mod 127. Another user, B, chooses xB = 97; thus yB = 6797 = 92 mod 127. These two users can communicate using the key kAB = 92111 = 10297 = 77 mod 127. Control Question 76 In a Die-Hellman public key scheme, with p = 19 and a = 3 (which is indeed a primitive root in GF (19)), what is the public key corresponding to the private key 5? what key does a person whose private key is 7 use to communicate with a person whose public key is 14? Same question with p = 101 and a = 51. Answer 15: 35 = 15 mod 19 3: 147 = 3 mod 19 With p = 101 and a = 51: 60: 515 = 60 mod 101 6: 147 = 6 mod 101

7.4.3

Trapdoor Functions

Trapdoor functions, the second crucial aspect introduced by Die and Hellman for their public key cryptography framework, is more subtle and more dicult than the rst one of one-way functions. Denition 7.9 A trapdoor function is actually a family of bijective functions ft , indexed by a parameter t (the trapdoor key), such that each function is a one-way function, but when t is known, ft 1 is also easy to compute. The cryptographic utility of a trapdoor function is the following: each user randomly (and secretly) chooses a trapdoor key, let us say t, and publishes ft (but not t itself!). Usually ft is taken in a family of functions so that only some parameters need to be published. These parameters are called the public key. If someone want to communicate a message M to the person whose published trapdoor function is ft , he simply sends ft (M ), which is easy to compute since ft is one-way. To

274

CHAPTER 7. CRYPTOGRAPHY

get the proper message, the receiver computes ft 1 which is also easy to compute for him since he possesses the trapdoor key t. This computation is however dicult for any person who does not have the trapdoor key. An example of trapdoor function is given in the next section.

7.4.4

RSA

The rst trapdoor functions was made in 1978 by R. L. Rivest, A. Shamir and L. Adleman (RSA in short). The RSA trapdoor function is based on the supposed diculty of factorizing integers. In this framework, both the message and the cryptogram are represented as (huge!) integers. Each user chooses two (large) prime numbers p and q (such that p 1 and q 1 also have large prime factors) so that for all possible message M , M < pq (otherwise split M in several parts so that each part is less that pq and consider as each part as M in the following). Let n = pq and m = (p 1)(q 1). The user chooses then d < m which is prime with m (i.e. d and m do not have any divisor in common) and computes e such that ed = 1 mod m. An algorithm that computes e knowing d and m is given in appendix at the end of this chapter. The public key (to be published) is then (e, n) and the private key (to be kept secret) is (d, p, q, m). The encryption function (which is public) is C = Me and the decryption (which is secret) is D = Cd mod n mod n

RSA framework works properly if D = M , i.e. M ed = M mod n. This is indeed the case: since ed = 1 mod m, there exist some > 0 such that M ed = M M m . Recall furthermore that since p and q are prime numbers M p1 = 1 mod p and M q1 = 1 mod q , thus M ed = M M m = M M (p1)(q1) = M M p1 and similarly M ed = M mod q.
(q 1)

= M 1 mod p

A simple result from basic arithmetic is now required: Theorem 7.6 Given three integers m, n, p, if n and m do not have any divisor in common, and x = 0 mod m and x = 0 mod n then x = 0 mod (mn). The proof is really straightforward. So is the following corollary we are interested in:

7.4. PUBLIC-KEY CRYPTOGRAPHY

275

Corollary 7.1 If p and q are two prime numbers and if x = y mod p and x = y mod q then x = y mod (pq ).

Thus we have M ed = M mod n. Example 7.15 (RSA) Suppose we want to use a RSA system to communicate, and we choose p = 47 and q = 59 (This is not really secure, but for the illustration purposes! In practice p and q should have more than 150 digits.) Then we compute n = pq = 2773 and m = (p 1)(q 1) = 2668 and choose a d that is prime with m; e.g. d = 157. Finally we compute e such that 157 e = 1 mod 2668 using Euclids extended greatest common divisor algorithm: e = 17. e and n are published: (17, 2773), but the other numbers are keep secret. Assume now someone wants to send us the message ITS ALL GREEK TO ME. By a convention agreed before (and which is public: Kerckhos hypothesis), she transforms it into numbers: 09 20 19 00 01 12 12 00 07 18 05 05 11 00 ... Since M must be less than n, i.e M < 2773, she splits the above stream into integers of at most 4 digits (indeed the maximum code will then be 2626 corresponding to ZZ): 920, 1900, 112, 1200, 718, 505, 1100, ... Then she computes 92017 mod 2773, 190017 mod 2773, 11217 mod 2773, ... and sends us the corresponding integers; i.e 948, 2342, 1084, 1444, 2663, 2390, 778, ... This message is decrypted using our own private key: 948157 = 920, 2342157 = 1900, ... and end the decoding by applying the convention back: 09, 20, 19, 00, ... = ITS ... 920, 1900, ... =

Now you may wonder how did our correspondent compute 190017 mod 2773, or how did we compute 2342157 mod 2773?... This is done by the square and multiply method and keeping in mind that ab= for any a = mod n and b = mod n For instance 2342157 = (2342128 ) (234216 ) (23428 ) (23424 ) 2342 = 1428 239 284 900 2342 = 1900 mod 2773 mod n

276 since

CHAPTER 7. CRYPTOGRAPHY

23422 23424 23428 234216

= = = =

4312 27432 = 302 9002 2842

= 2743 = 900 = 284 = 239 mod 2773

mod mod mod mod

2773 2773 2773 2773

and 2342128 = 1428 Control Question 77 Consider a very simple RSA set up where two people have the following parameters: p 3 5 q 19 11 d 5 7 e 29 23

A B 1. What is the public key of A? 2. What is the public key of B?

3. How does A send the message 43 to B? 4. How does B send the message 43 to A? Answer 1. (29, 57) 2. (23, 55) 3. 32: 4323 = 32 mod 55 4. 25: 4329 = 25 mod 57

Security of RSA Breaking RSA by nding m = (p 1)(q 1) is computationally equivalent to factoring n. In fact, all attacks thus far proposed against RSA have turned out to be computationally equivalent to factoring n, but no proof has been forthcoming that this must be the case. Furthermore, there is no hard evidence that factoring of integers is inherently dicult, but there is much historical evidence in support of this second conjecture. Anyone who has tried to improve upon known factoring algorithms soon convinces himself that this is a very hard problem. In 1999, the factorization of the 155 digit (512 bit) RSA Challenge Number was completed. It required 3.7 (real life) months and 35.7 CPU-years in total. This CPU-eort was estimated to be equivalent to approximately 8000 MIPS years.8
see http://www.rsasecurity.com/rsalabs/challenges/factoring/rsa155.html for further details.
8

7.5. AUTHENTICATION

277

Up to now (2003), the RSA system is considered safe from attacks made by factoring, at least with keys of more than 1024 bits... ...until revolutionary advances in factoring algorithms have been made! [which is unlikely today] Summary for Chapter 7 primitive root: a number n smaller than a prime p is said to be a primitive root in GF (p) if and only if the only power 1 < i < p such that ni = 1 is p 1. discrete logarithm: an integer n is the discrete logarithm to the base a of another integer m in GF (p) if an = m mod p (where p is a prime number and a a primitive root in mGF ). Die-Hellman Public Key Distribution Scheme: being given a prime p and a primitive root a in GF (p), each user chooses a private key x and published his public key y = ax mod p. When two users wish to communicate, each one uses the key consisting of the public key of the other to the power his own private key: kAB = yB xA = yA xB trapdoor functions: a family of one-way functions depending on a parameter, such that when this parameter is know, the inverse is no longer hard to compute. RSA: Each user chooses two prime numbers p and q and a number d < (p 1)(q 1) which has no common divisor with (p 1)(q 1). The public key is then (e, pq ) where e is such that ed = 1 mod (p 1)(q 1), and the private key is (d, p, q ). A message M (which is a integer less than pq ) is encrypted using public keys by C = M e mod pq . The decrypting is done using private keys: D = C d mod pq .

7.5

Authentication

Learning Objectives for Section 7.5 After studying this section you should know: 1. Why authentication and security are theoretically incompatible; 2. how to ensure authentication in practice; 3. what Die-Hellman and RSA authentication scheme consist of.

In this last section, we now want to address the second aspect of cryptography: authentication, i.e. ways to ensure that the message as been sent by an authorized person. In other words we wonder here whether the received cryptogram C is a valid (legal) one or if it has been faked by an unauthorized person.

278 Authentication and Security

CHAPTER 7. CRYPTOGRAPHY

Authentication was for long not easy to distinguish clearly from security. In fact, cryptographers have discovered only recently that these two goals are indeed quite independent, and even incompatible from a pure theoretical point of view. This result is due to the following theorem.

Theorem 7.7 The probability PI that a cryptogram is falsied (i.e. to nd a cryptogram that is accepted although it has not been emitted by an authorized person) is bounded by: PI 2I (C ;K ) where C is the random variable representing possible cryptograms and K the possible keys. This bound is tight and can be reached in special cases.

Proof Let (C, K ) be the authentication function: (C, K ) = 1 if M | C = e(M, K ) 0 otherwise

The probability that a cryptogram C is accepted as correct is: P (acc(C )|C ) =


K

(C, K )P (K )

thus, the probability to have a falsication of authenticity is: PI = P (acc(C )) =


C

P (C, acc(C )) P (C )P (acc(C )|C )


C

= =
C,K

(C, K )P (C )P (K ) P (C )P (K )
C,K,P (C,K )=0

= =
C,K

P (C, K )

P (C )P (K ) P (C, K )

= E i.e. log(PI ) = log E

P (C )P (K ) P (C, K ) P (C )P (K ) P (C, K )

7.5. AUTHENTICATION using Jensen inequality we have log E P (C )P (K ) P (C, K ) P (C )P (K ) P (C, K ) = I (C ; K ) E log

279

Thus PI 2I (C ;K )
(C )P (K ) with equality if and only if PP (C,K ) is constant for all (C, K ) (such that P (C, K ) > 0), e.g. C and K are independent.

Thus, to guarantee authenticity, i.e. to have small infraction probability, the mutual information between cryptograms and keys must be big! However, to ensure perfect condentiality, we have seen in the former sections that I (C ; M ) = 0 is required! Thus, from a strict information content point of view, authentication and condentiality appear to be somehow incompatible. Indeed, I (C ; K ) = I (C ; M ) + H (K ) H (M ) H (K |M, C ) thus I (C ; M ) = 0 (and with the sensible assumption that H (K |M, C ) = 0) implies I (C ; K ) = H (K ) H (M ), which implies PI 2H (M )H (K ) > 0; i.e. a perfect cryptosystem should have very complex keys (H (K ) H (M )) to also ensure authentication. One solution although, is to have I (C ; K ) 0 (to ensure authenticity) but ensure condentiality by algorithmic security. This is one of the reasons why cryptographic systems based on algorithmic complexity have become popular for authentication. Let us see now how this can be done. The basic idea is to use a digital signature to messages. Such a signature will ensure that the sender of the message is actually who is claimed to be, and conversely in case of disagreement, the sender whose message has been signed cannot deny having send it, since he is the only one to be able to produce this signature.

7.5.1

Die-Lamport Authentication

In the Die-Lamport authentication framework, each user chooses 2 n secret keys: k1 , ..., kn and 1 , ..., n 2 n sequences s1 , ..., sn et t1 , ..., tn and then produces the parameters: i = e(si , ki ) and i = e(ti , i ). Then he publishes (i.e. make publicly available) i , i , si , ti , for i = 1...n. To sign the the binary message M of length n, M = m1 , ..., mn he uses = 1 , ..., n , where: ki if mi = 0 i = i if mi = 1 Notice that this is a huge signature since ki and i are not bits but keys, i.e. sequences of k bits, where k is the size of the key required by the encryption algorithm.

280

CHAPTER 7. CRYPTOGRAPHY

When the receiver receives the message and its signature, he can check that the message has been sent by the right person by doing: if mi = 0, e(si , i ) = i if mi = 1, e(ti , i ) = i Such an authentication scheme presents however several drawbacks: lot of material needs to be communicated in advance; a message of n bits requires a signature of k n bits!

7.5.2

Authentication with RSA

Die and Hellman also showed in 1976 that, provided the domain and the range of the trapdoor function ft coincide for every t, ft can also be used to ensure the authenticity of messages with digital signatures. Provided that legitimate messages have sucient internal structure to be reliably distinguished from random sequences (which is again against what is required for security!), if user A wishes to sign a message M so that it is unmistakable that it came from him, user A applies to M the decryption algorithm A ). Any other user, with his private key to compute the signed message M = d(M, k say user B, who obtains M can use the encryption algorithm with the public key of A kA to compute e(M , kA ) = M . However, only user A knows how to write the meaningful message M as the random-looking string M ; and it is computationally dicult for any other user to nd M such that e(M , kA ) = M . Of course, this scheme does not provide any secrecy. If secrecy is also desired, user A could send the signed message M using usual encryption method. Lets illustrate this general scheme with RSA. User A sends M to B using the cryptogram C = eB (M ), and signs it with S (M ) = eB (dA (M )) B can verify authenticity by doing: eA (dB (S (M ))) which must be equal to M . Notice that this presupposes that dA (M ) is in the domain of eB , i.e. in the case of RSA that dA (M ) < nB . There are several practical tricks to do so, among which the easiest is to split M into smaller pieces so that dA (M ) is actually smaller that nB . Example 7.16 (RSA Authentication) To continue our last example (example 7.15), where p = 47, q = 59, d = 157 and e = 17, how will we sign the message OK ? Using the same convention as before, OK corresponds to the integer 1511. The signature is then d(1511) = 1511157 = 1657 mod 2773. If we send this message to someone whose public key is (725, 2881) the encrypted message will be 1511725 = 1369 mod 2881 and the encrypted signature 1657725 = 2304 mod 2881; i.e. we send (1369, 2304). The receiver decodes the message using her private key d = 65: M = 136965 = 1511 mod 2881 and checks the signature: S = 230465 = 1657 mod 2881, e(S ) = 165717 = 1511 mod 2773. e(S ) = M : the message is valid.

7.5. AUTHENTICATION

281

Control Question 78 Consider again the very simple RSA set up where two people have the following parameters: p 3 5 q 19 11 d 5 7 e 29 23

A B

1. What does A send to B to authenticate message 43? 2. B receives the message 22 from A, signed 41. Is it really A who send it? Answer 1. A sends 07. The signature is 28 (435 = 28 mod 57) and is encrypted into 07 (2823 = 7 mod 55). 2. No, this message doesnt come from A. The ciphered text 22 corresponds the message 33 (227 = 33 mod 55). The decoded signature is 46 (417 = 46 mod 55), which after public encryption for A should be 33. But 4629 31 mod 57 = 33. The correct signature for 33 is 42.

Signature security A cannot deny having send m : only dA can produce S . B (or anyone else but A) cannot change m for m : S (m ) = S (m) and computing S (m ) is not possible without dA . In practice however, there are several drawbacks to this signature scheme. For instance, the sender may deliberately publish his private key making then doubtful all digital signature attributed to him, or he can also deliberately loose his private key so that the messages he sent become unveriable. To make up for the later, trusted organisms where the keys should be recorded before transactions, could play the role of private key banks .

7.5.3

Shared secrets

Let us nished with a quite dierent authentication scheme where the access to some critical resources or information must be shared by several persons. The idea in such system is that several persons together can reconstruct the complete key, but none can do it alone. This is the shared secret method. Examples of such situation include, e.g. opening a safe with two keys, missile launch requiring 3 authorizations.

282

CHAPTER 7. CRYPTOGRAPHY

Suppose that we have n authorized persons (key holders) and that k parts of the key are enough to open the door, i.e. access the secret S . This means: H (S |pi1 , ..., pik ) = 0 for any subset {i1 , ..., ik } of {1, ..., n}, where p1 , ..., pn are the n parts of the key. However, less that k parts are not enough to get any information about this secret: H (S |pi1 , ..., pik1 ) = H (S ) To do so, lets choose for every secret a polynomial of order k 1 whose lower coecient is S : P (X ) = S + a1 X + a2 X 2 + ... + ak1 X k1 The other coecients ai are chosen at random, and are dierent from one secret to the other. The authorized user i received the part of the secret as the value of the polynomial for the value i: pi = p(i), This fullls the above conditions: k users can reconstruct the polynomial by interpolation and thus get S , but k 1 (or less) cannot. Example 7.17 Secret S = 0105, shared by n = 5 persons among which any two of them can access the secret (k = 2) p(x) = 105 + 50x Thus p1 = 155, p2 = 205, p3 = 255, p4 = 305, p5 = 355 Reconstruction by 2 participants (e.g. 2 and 5): p(2) = S + a1 2 = 205 p(5) = S + a1 5 = 355

3S + a1 (10 10) = 1025 710 S = 105 But the reconstruction by only 1 participant is not feasible. Secret sharing can usefully be used to create access structures to the secret: there are less n users and some of there receive more parts than the other. For example, imagine that opening a bank safe requires 1 director, or 2 authorized representatives, or 1 authorized representative and 2 cashiers, or 5 cashiers. For instance, with k = 10 the bank director receives 10 parts, each authorized representatives 6 parts,

7.5. AUTHENTICATION and each cashiers 2 parts only. Thus the director alone has the required 10 part to open the safe, 2 authorized representatives have 12 parts, 1 representatives and 2 cashiers 10 parts, and 5 cashiers: 10 parts.

283

The only problem with such solution is that for complex situations the number of parts may be large. Summary for Chapter 7 Authentication (to ensure that the message as been sent by an authorized person) and secrecy (to ensure that the message is received by authorized persons) are somehow theoretically incompatible, since the former requires I (C ; K ) as large as possible and the latter I (C ; M ) as small as possible. PI 2I (C ;K ) Die-Lamport Authentication scheme: can be used to sign binary messages. Choose 2n keys and 2n sequences, publish the encryption of the latter by the former and sign sending one or the other keys depending on the message bits. RSA Authentication scheme: The signature is the message to the power the private key. Send it encrypted using addressees public key. Shared secrets: The access to one common secret is spread among several key holders using a polynomial.

284

CHAPTER 7. CRYPTOGRAPHY

7.5. AUTHENTICATION

285

Summary for Chapter 7 cryptography aims at either transmitting messages securely (only authorized persons can read it) or authenticate messages (no unauthorized persons could have send it). To do so, the clear messages M are encoded using a key K and a deterministic function: C = e(M, K ). Encrypted messages can be decoded deterministically using the decoding function d and the same key K , so that d(e(M, K ), K ) = M . Perfect Secrecy: I (C ; M ) = 0 for a system to be perfectly secret there must be at least as many keys as messages and H (K ) must be greater than (or equal to) H (M ). One-Time Pad: for each encryption, a random key is chosen, whose length is equal to the message length and whose symbols are independent. The key is then simply added (symbol by symbol) to the message. One-Time Pad is a perfect cipher. unicity distance: the minimum number of encrypted text that must be known (in unperfect cryptosystems) to determine the key almost surely: H (K |C1 ...Cu ) 0. Under certain general assumptions, the unicity distance can be approximated by H (K ) u R(M ) log || where R(M ) is the redundancy of the unencrypted messages M . One-Way function: a function that is easy to compute but dicult to invert. DES: a cryptosystem based on the diculty to solve non-linear boolean systems. discrete logarithm: an integer n is the discrete logarithm to the base a of another integer m in GF (p) if an = m mod p (where p is a prime number and a a primitive root in mGF ). Die-Hellman Public Key Distribution Scheme: being given a prime p and a primitive root a in GF (p), each user chooses a private key x and published his public key y = ax mod p. When two users wish to communicate, each one uses the key consisting of the public key of the other to the power his own private key: kAB = yB xA = yA xB trapdoor functions: a family of one-way functions depending on a parameter, such that when this parameter is know, the inverse is no longer hard to compute. RSA: Each user chooses two prime numbers p and q and a number d < (p 1)(q 1)

286

CHAPTER 7. CRYPTOGRAPHY which has no common divisor with (p 1)(q 1). The public key is then (e, pq ) where e is such that ed = 1 mod (p 1)(q 1), and the private key is (d, p, q ). A message M (which is a integer less than pq ) is encrypted using public keys by C = M e mod pq . The decrypting is done using private keys: D = C d mod pq .

Authentication (to ensure that the message as been sent by an authorized person) and secrecy (to ensure that the message is received by authorized persons) are somehow theoretically incompatible, since the former requires I (C ; K ) as large as possible and the latter I (C ; M ) as small as possible. Die-Lamport Authentication scheme: can be used to sign binary messages. Choose 2n keys and 2n sequences, publish the encryption of the latter by the former and sign sending one or the other keys depending on the message bits. RSA Authentication scheme: The signature is the message to the power the private key. Send it encrypted using addressees public key.

Historical Notes and Bibliography


Secrecy of messages has for very long be a subject of study. It is indeed claim to date back to ancient Egypt (1900 BC) or ancient China. In Europe, although Greeks and Romans (e.g. Caesar cipher) already used ciphers, cryptography and cryptanalysis really started in the second half of the thirteen century and developed more seriously from the fteen century. Around 1560, the French diplomat Blaise de Vigen` ere (1523-1596) developed his cryptosystem from the work of several of his predecessors: Alberti (1404-1472), Trith` eme (1462-1516) and Porta (1535-1615). Vigen` ere cipher remained unbreakable for 300 years. The hypothesis that the security of the cipher should reside entirely in the secret key was rst made in 1883 by Auguste Kerckhos (1835-1903); and cryptographic history has demonstrated his wisdom. A determined enemy is generally able to obtain a complete blueprint of the enciphering and deciphering machines, either by clever deduction or by outright stealing or by measures in-between these extremes. The rst really scientic treatment of secrecy has only been provided by C. Shannon in 1949 [11]. Shannons theory of secrecy is in fact a straightforward application of the information theory that he had formulated one year before. The ingenuity of the 1949 paper lies not in the methods used therein but rather in the new way of viewing and the intelligent formulation that Shannon made of the problem of secrecy. Although Shannon gave his theory of secrecy in 1949, it was not until 1984 that Simmons gave an analogous theory of authenticity, illustrating how more dicult and subtle authentication is. The foundations of cryptographic thinking were once again shook in 1976, when two Stanford University researchers, Whiteld Die and Martin E. Hellman, published their paper entitled New Directions in Cryptography . Die and Hellman suggested that it is possible to have computationally secure cryptographic systems that required no secure channel for the exchange of secret keys. Ralph Merkle, then a graduate

7.5. AUTHENTICATION

287

student at the Berkeley University, independently formulated the basic ideas of such public-key cryptography and submitted a paper thereon at almost the same time as Die and Hellman, but his paper was published almost two years later than theirs and unfortunately lost the due credit for his discovery. The fundamental contribution of the Die-Hellman paper consisted in the two crucial denitions, of a one-way function (which they borrowed from R. M. Needhams work on computer passwords) and of a trapdoor function (which was completely new), together with suggestions as to how such functions could eliminate the need for the exchange of secret keys in computationally-secure cryptographic systems. Although Die and Hellman shrewdly dened trapdoor functions in their 1976 paper and clearly pointed out the cryptographic potential of such functions, the rst proposal of such a function was made only two years latter, in 1978 by the M.I.T. researchers R. L. Rivest, A. Shamir and L. Adleman (thus RSA!). In the meantime, the basis of DES (1977) came out from the IBM Lucifer cryptosystem (rst published by Feistel in 1973!). However, whereas the Lucifer scheme used a key of 128 bits, the US National Bureau of Standard (now known as National Institute of Standards and Technology) who published the DES retains only 56 bits for the key. Ever since DES was rst proposed, it has a lot been criticized for its short key size. In 1998, the EFF (Electronic Frontier Foundation has built a dedicated machine, Deep Crack, in order to show to the world that DES is not (or at least no more) a secure algorithm. Deep Crack, which costed $200000 and was built with 1536 dedicated chips, was able to recover a 56 bit key, using exhaustive search, in 4 days in average, checking 92 billions of keys each second.9 Later on (January 18, 1999), with the help of distributed.net, an organization specialized in collecting and managing computers idle time, they broke a DES key in 22 hours and 15 minutes! More than 100000 computers (from the slowest PC to the most powerful multiprocessors machines) have received and done a little part of the work; this allowed a rate of 250,000,000,000 keys being checked every second. 10 In November 2002, AES (Advanced Encryption Standard), the successor to DES was published. AES uses another type of algorithm (Rijndael algorithm, invented by Joan Daemen and Vincent Rijmen) and supports key sizes of 128 bits, 192 bits, and 256 bits, which nowadays seems more than enough.

OutLook
D. Welsh, Codes and Cryptography, Oxford University Press, 1988.
http://www.ars-cryptographica.com/

Appendum: Solving e d = 1 mod m


Finding e and k such that e d k m = 1 can be done using Euclids greatest common divisor algorithm (since the greatest common divisor of d and m is precisely 1) .
9 10

This part is borrowed from http://lasecwww.epfl.ch/memo des.shtml for more details, see http://www.rsasecurity.com/rsalabs/challenges/des3/index.html

288

CHAPTER 7. CRYPTOGRAPHY

Let u, v and t be vectors of Q2 (i.e. couples of rational numbers). The initialization step of the algorithm consist of u = (0, m), v = (1, d). The stop condition is that the second component v2 of v equals 1. In this case the rst component is v1 = e, i.e. at the end of the algorithm v = (e, 1) . After the initialization step, the algorithm loops until the stop condition is fullled: t u r v, u v, vt 2 where r = u v2 Example 7.18 Let us nd e such that 7e = 1 mod 60, i.e. d = 7 and m = 60: u (0, 60) (1, 7) (8, 4) (9, 3) v (1, 7) (8, 4) (9, 3) (17, 1) r 60 =8 7 7 =1 4 4 =1 3 t (8, 4) (9, 3) (17, 1)

(stop)

thus e = 17 mod 60, i.e.: e = 43.

Bibliography
[1] R. Ash. Information Theory. Wiley, 1965. [2] R. Blahut. Theory and Practice of Error Control Codes. Addison-Wesley, 1983. [3] P. Elias. Coding for noisy channels. In IRE Conv. Rec., volume 4, pages 3747, 1955. [4] A. Feinstein. A new basic theorem of information theory. IRE Trans. Information Theory, IT-4:222, 1954. [5] R. G. Gallager. A simple derivation of the coding theorem and some applications. IEEE Trans. Information Theory, IT-11:318, 1965. [6] R. G. Gallager. Information Theory and Reliable Communication. Wiley, 1968. [7] D.M. Gordon. Discrete logarithms in GF(p) using the number eld sieve. SIAM Journal of Computing, 1(6):124138, 1993. [8] S. Lin and D. J. Costello. Error Control Coding: Fundamentals and Applications. Prentice-Hall, 1983. [9] David Salomon. Data Compression The complete reference. Springer, 1998. [10] C. E. Shannon. A mathematical theory of communication. Bell Sys. Tech. Journal, 27:379423, 623656, 1948. [11] C. E. Shannon. Communication theory of secrecy systems. Bell Sys. Tech. Journal, 28:656715, 1949. [12] J. H. van Lint. Introduction to Coding Theory. Springer-Verlag, 1982. [13] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13:260269, 1967. [14] J. M. Wozencraft and B. Reien. Sequential Decoding. MIT Press, 1961.

289

290

BIBLIOGRAPHY

Glossary
BSC: Binary Symmetric Channel. 150

1 BSS: Binary Symmetric Source: a memoryless binary source with P (0) = P (1) = 2 . 165

DMC: Discrete Memoryless Channel, see denition page 149. Hamming distance: is the number of coordinates in which two vectors dier.

149 199

Human code: an ecient coding framework for memoryless information sources. 104 Minimum Distance: the minimum (non null) Hamming distance between any two (dierent) codewords. 203 arity: size of the alphabet. For a n-ary tree: n, the number of child nodes of each node. 82 block-code: a non-empty set of words of the same length, considered as row vectors. 199 channel capacity: the maximum average amount of information the output of the channel can bring on the input. 152 codeword: a non empty sequence of symbols from the coding alphabet. 82

communication channel: formalization of what happens to the the transmitted message between their emission and their reception. 149 complete code: a prex-free code, the coding tree of which does not have unused leaf. 89 cyclic code: a linear code, every shift of the codewords of which is also a codeword.. 225 expected code length: the expected value of the length of the codewords. 95

generator matrix: One matrix the rows of which are linearly independent codeword. Such a matrix is used to encode the messages.. 211 information source: generator of sequences of symbols. 82

instantaneously decodable: a code is said to be instantaneously decodable if each codeword in any string of codewords can be decoded as soon as its end is reached. 85 291

292

BIBLIOGRAPHY

lattice: The unfolding in time of all the paths in the state diagram of a convolutional code corresponding to all the possible the encodings of messages of dierent lengths. 240 linear code: a block-code with vector space structure. message: sequence of symbols. 207 82, 149

minimum distance decoding: decoding framework in which each received word is decoded into the closest codeword. 202 non-singular codes: codes for which dierent source symbols maps to dierent codewords. 83 null codeword: The codeword made only of zeros. 200

one-way function: a function that is easy to compute but dicult to computationally invert. 266 stationary sources: information sources, the statistics of which do not depend on time. 83 syndrome: product of the received word by a verication matrix. 218

transmission probabilities: conditional probability distributions of the output of a channel knowing the input. 149 trapdoor function: family of one-way functions depending on a parameter, such that, when the parameter is known, the inverse is no longer hard to compute.. 273 weight: the number of non-zero symbols. 199

without feedback: a channel is said to be used without feedback when the probability distribution of the inputs does not depend on the output.. 151

Index
block-code, 199 capacity, 152 code block, 199 cyclic, 225 linear, 207 cyclic code, 225 decoding minimum distance, 202 distance Hamming, 199 DMC, 149 function one-way, 266 generator matrix, 211 Hamming distance, 199 information source, 82 lattice, 240 linear code, 207 message, 82 minimum distance decoding, 202 minimum distance of a code, 203 null codeword, 200 one-way function, 266 syndrome, 218 transmission rate, 157 verication matrix, 215 weight, 199

293

You might also like