0% found this document useful (0 votes)

108 views730 pages

Info Theory Polyanskiy Wu

Uploaded by

Veeken Chaglassian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views730 pages

Info Theory Polyanskiy Wu

Uploaded by

Veeken Chaglassian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 730

i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-i

i i

This material will be published by Cambridge University Press as “Information Theory” by

Yury Polyanskiy and Yihong Wu. This prepublication version is free to view and download for
personal use only. Not for redistribution, resale or use in derivative works.
Note that while this version has matching equation numbers, theorems etc, the printed version
will have many typos corrected. We recommend using it for thorough reading, while free version
could be used for quick reference lookup.
Polyanskiy & Wu © 2023

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-ii

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-i

i i

Book Heading

This textbook introduces the subject of information theory at a level suitable for advanced
undergraduate and graduate students. It develops both the classical Shannon theory and recent
applications in statistical learning. There are six parts covering foundations of information mea-
sures; data compression; hypothesis testing and large deviations theory; channel coding and
channel capacity; rate-distortion and metric entropy; and, finally, statistical applications. There are
over 210 exercises, helping the reader deepen their knowledge and learn about recent discoveries.

Yury Polyanskiy is a Professor of Electrical Engineering and Computer Science at MIT and a
member of Laboratory of Information and Decision Systems (LIDS) and Statistics and Data Sci-
ence Center (SDSC). He was elected an IEEE Fellow (2024), received the 2020 IEEE Information
Theory Society James Massey Award and co-authored papers receiving Best Paper Awards from
the IEEE Information Theory Society (2011), the IEEE International Symposium on Information
Theory (2008, 2010, 2022), and the Conference on Learning Theory (2021). At MIT he teaches
courses on information theory, probability, statistics, and machine learning.

Yihong Wu is a Professor in the Department of Statistics and Data Science at Yale University.
He was a recipient of the Marconi Society Paul Baran Young Scholar Award in 2011, the NSF
CAREER award in 2017, the Sloan Research Fellowship in Mathematics in 2018, and the IMS
fellow in 2023. He has taught classes on probability, statistics, and information theory at Yale
University and the University of Illinois at Urbana-Champaign.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-ii

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-iii

i i

Information Theory
From Coding to Learning
FIRS T E DI TI ON

Yury Polyanskiy
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology

Yihong Wu
Department of Statistics and Data Science
Yale University

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-iv

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-vi

i i

Contents

Preface page xv
Introduction xvii

Frequently used notation 1

Part I Information measures 5

1 Entropy 8
1.1 Entropy and conditional entropy 8
1.2 Axiomatic characterization 13
1.3 History of entropy 14
1.4* Submodularity 16
1.5* Han’s inequality and Shearer’s Lemma 17

2 Divergence 20
2.1 Divergence and Radon-Nikodym derivatives 20
2.2 Divergence: main inequality and equivalent expressions 24
2.3 Differential entropy 26
2.4 Markov kernels 29
2.5 Conditional divergence, chain rule, data-processing inequality 31
2.6* Local behavior of divergence and Fisher information 36
2.6.1* Local behavior of divergence for mixtures 36
2.6.2* Parametrized family 38

3 Mutual information 41
3.1 Mutual information 41
3.2 Mutual information as difference of entropies 44
3.3 Examples of computing mutual information 47
3.4 Conditional mutual information and conditional independence 50
3.5 Sufficient statistic and data processing 53
3.6 Probability of error and Fano’s inequality 55
3.7* Estimation error in Gaussian noise (I-MMSE) 58
3.8* Entropy-power inequality 63

4 Variational characterizations and continuity of information measures 66

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-vii

i i

Contents vii

4.1 Geometric interpretation of mutual information 67

4.2 Variational characterizations of divergence: Gelfand-Yaglom-Perez 70
4.3 Variational characterizations of divergence: Donsker-Varadhan 71
4.4 Gibbs variational principle 74
4.5 Continuity of divergence 78
4.6* Continuity under monotone limits of σ -algebras 79
4.7 Variational characterizations and continuity of mutual information 81
4.8* PAC-Bayes 83
4.8.1 Uniform convergence 85
4.8.2 Generalization bounds in statistical learning theory 87

5 Extremization of mutual information: capacity saddle point 91

5.1 Convexity of information measures 91
5.2 Saddle point of mutual information 94
5.3 Capacity as information radius 96
5.4* Existence of capacity-achieving output distribution (general channel) 97
5.5 Gaussian saddle point 100
5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization, Sinkhorn 102

6 Tensorization and information rates 106

6.1 Tensorization (single-letterization) of mutual information 106
6.2* Gaussian capacity via orthogonal symmetry 107
6.3 Entropy rate 109
6.4 Entropy and symbol (bit) error rate 112
6.5 Mutual information rate 113

7 f-divergences 115
7.1 Definition and basic properties of f-divergences 115
7.2 Data-processing inequality; approximation by finite partitions 118
7.3 Total variation and Hellinger distance in hypothesis testing 122
7.4 Inequalities between f-divergences and joint range 126
7.5 Examples of computing joint range 130
7.5.1 Hellinger distance versus total variation 131
7.5.2 KL divergence versus total variation 131
7.5.3 χ2 -divergence versus total variation 132
7.6 A selection of inequalities between various divergences 132
7.7 Divergences between Gaussians 133
7.8 Mutual information based on f-divergence 134
7.9 Empirical distribution and χ2 -information 136
7.10 Most f-divergences are locally χ2 -like 138

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-viii

i i

viii Contents

7.11 f-divergences in parametric families: Fisher information 140

7.12 Rényi divergences and tensorization 144
7.13 Variational representation of f-divergences 147
7.14*Technical proofs: convexity, local expansions and variational representations 152

8 Entropy method in combinatorics and geometry 158

8.1 Binary vectors of average weights 158
8.2 Shearer’s lemma & counting subgraphs 159
8.3 Brégman’s Theorem 161
8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney 164

9 Random number generators 166

9.1 Setup 166
9.2 Converse 167
9.3 Elias’ construction from data compression 167
9.4 Peres’ iterated von Neumann’s scheme 169
9.5 Bernoulli factory 172

Exercises for Part I 174

Part II Lossless data compression 193

10 Variable-length compression 197
10.1 Variable-length lossless compression 197
10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 203
10.3 Uniquely decodable codes, prefix codes and Huffman codes 206

11 Fixed-length compression and Slepian-Wolf theorem 212

11.1 Source coding theorems 212
11.2 Asymptotic equipartition property (AEP) 217
11.3 Linear compression (hashing) 218
11.4 Compression with side information at both compressor and decompressor 221
11.5 Slepian-Wolf: side information at decompressor only 222
11.6 Slepian-Wolf: compressing multiple sources 224
11.7*Source-coding with a helper (Ahlswede-Körner-Wyner) 226

12 Entropy of ergodic processes 230

12.1 Bits of ergodic theory 230
12.2 Shannon-McMillan, entropy rate and AEP 233
12.3 Proof of the Shannon-McMillan-Breiman Theorem 234

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-ix

i i

Contents ix

12.4*Proof of the Birkhoff-Khintchine Theorem 237

12.5*Sinai’s generator theorem 239

13 Universal compression 244

13.1 Arithmetic coding 245
13.2 Combinatorial construction of Fitingof 246
13.3 Optimal compressors for a class of sources. Redundancy. 247
13.4*Asymptotic maximin solution: Jeffreys prior 250
13.5 Sequential probability assignment: Krichevsky-Trofimov 253
13.6 Online prediction and density estimation 255
13.7 Individual sequence and worst-case regret 259
13.8 Lempel-Ziv compressor 261

Exercises for Part II 265

Part III Hypothesis testing and large deviations 273

14 Neyman-Pearson lemma 276
14.1 Neyman-Pearson formulation 276
14.2 Likelihood ratio tests 280
14.3 Converse bounds on R(P, Q) 282
14.4 Achievability bounds on R(P, Q) 283
14.5 Asymptotics: Stein’s regime 286
14.6 Chernoff regime: preview 289

15 Information projection and large deviations 291

15.1 Basics of large deviations theory 291
15.1.1 Log MGF and rate function 293
15.1.2 Tilted distribution 297
15.2 Large-deviations exponents and KL divergence 299
15.3 Information projection 302
15.4 I-projection and KL geodesics 306
15.5 Sanov’s theorem 307
15.6*Information projection with multiple constraints 308

16 Hypothesis testing: error exponents 313

16.1 (E0 , E1 )-Tradeoff 313
16.2 Equivalent forms of Theorem 16.1 317
16.3*Sequential hypothesis testing 319
16.4 Composite, robust, and goodness-of-fit hypothesis testing 324

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-x

i i

x Contents

16.5*Hypothesis testing with communication constraints 325

Exercises for Part III 329

Part IV Channel coding 339

17 Error correcting codes 342
17.1 Codes and probability of error 342
17.2 Coding for Binary Symmetric Channels 344
17.3 Optimal decoder 347
17.4 Weak converse bound 348

18 Random and maximal coding 350

18.1 Information density 351
18.2 Shannon’s random coding bound 353
18.3 Dependence-testing (DT) bound 356
18.4 Feinstein’s maximal coding bound 357
18.5 RCU and Gallager’s bound 359
18.6 Linear codes 362
18.7 Why random and maximal coding work well? 366

19 Channel capacity 370

31 Lower bounds via reduction to hypothesis testing 594

31.1 Le Cam’s two-point method 594
31.2 Assouad’s Lemma 597
31.3 Assouad’s lemma from the Mutual Information Method 598
31.4 Fano’s method 599

32 Entropic bounds for statistical estimation 602

32.1 Yang-Barron’s construction 603
32.1.1 Bayes risk as conditional mutual information and capacity bound 605
32.1.2 Capacity upper bound via KL covering numbers 608
32.1.3 Bounding capacity and KL covering number using Hellinger entropy 609
32.1.4 General bounds between cumulative and individual (one-step) risks 610
32.2 Pairwise comparison à la Le Cam-Birgé 612
32.2.1 Composite hypothesis testing and Hellinger distance 612
32.2.2 Hellinger guarantee on Le Cam-Birgé’s pairwise comparison estimator 614
32.2.3 Refinement using local entropy 616
32.2.4 Lower bound using local Hellinger packing 618
32.3 Yatracos’ class and minimum distance estimator 620
32.4 Density estimation over Hölder classes 622

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xiv

Y. Polyanskiy <[email protected]>
MIT
Y. Wu <[email protected]>
Yale

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xvii

i i

Introduction

What is information?
The Oxford English Dictionary lists 18 definitions of the word information, while the Merriam-
Webster Dictionary lists 17. This emphasizes the diversity of meaning and contexts in which the
word information may appear. This book, however, is only concerned with a precise mathematical
understanding of information, independent of the application domain.
How can we measure something that we cannot even define well? Among the earliest attempts
of quantifying information we can list R.A. Fisher’s works on the uncertainty of statistical esti-
mates (“confidence intervals”) and R. Hartley’s definition of information as the logarithm of the
number of possibilities. Around the same time, Fisher [169] and others identified connection
between information and thermodynamic entropy. This line of thinking culminated in Claude
Shannon’s magnum opus [378], where he formalized the concept of (what we call today) the Shan-
non information and forever changed the human language by accepting John Tukey’s word bit as
the unit of its measurement. In addition to possessing a number of elegant properties, Shannon
information turned out to also answer certain rigorous mathematical questions (such as the opti-
mal rate of data compression and data transmission). This singled out Shannon’s definition as the
right way of quantifying information. Classical information theory, as taught in [106, 111, 177],
focuses exclusively on this point of view.
In this book, however, we take a slightly more general point of view. To introduce it, let us
quote an eminent physicist L. Brillouin [76]:

We must start with a precise definition of the word “information”. We consider a problem involving a certain
number of possible answers, if we have no special information on the actual situation. When we happen to be
in possession of some information on the problem, the number of possible answers is reduced, and complete
information may even leave us with only one possible answer. Information is a function of the ratio of the
number of possible answers before and after, and we choose a logarithmic law in order to insure additivity of
the information contained in independent situations.

Note that only the last sentence specializes the more general term information to the Shannon’s
special version. In this book, we think of information without that last sentence. Namely, for us
information is a measure of difference between two beliefs about the system state. For example, it
could be the amount of change in our worldview following an observation or an event. Specifically,
suppose that initially the probability distribution P describes our understanding of the world (e.g.,
P allows us to answer questions such as how likely it is to rain today). Following an observation our
distribution changes to Q (e.g., upon observing clouds or a clear sky). The amount of information in
the observation is the dissimilarity between P and Q. How to quantify dissimilarity depends on the
particular context. As argued by Shannon, in many cases the right choice is the Kullback-Leibler

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xviii

i i

xviii Introduction

(KL) divergence D(QkP), see Definition 2.1. Indeed, if the prior belief is described by a probability
mass function P = (p1 , . . . , pk ) on the set of k possible outcomes, then the observation of the first
outcome results in the new (posterior) belief vector Q = (1, 0, . . . , 0) giving D(QkP) = log p11 ,
and similarly for other outcomes. Since the outcome i happens with probability pi we see that the
average dissimilarity between the prior and posterior beliefs is

X
k
1
pi log ,
pi
i=1

which is precisely the Shannon entropy, cf. Definition 1.1.

However, it is our conviction that measures of dissimilarity (or “information measures”) other
than the KL divergence are needed for applying information theory beyond the classical realms.
For example, the concepts of total variation, Hellinger distance and χ2 -divergence (both promi-
nent members of the f-divergence family) have found deep and fruitful applications in the theory
of statistical estimation and probability, as well as contemporary topics in theoretical computer
science such as communication complexity, estimation with communication constraints, property
testing (we discuss these in detail in Part VI). Therefore, when we talk about information measures
in Part I of this book we do not exclusively focus on those of Shannon type, although the latter are
justly given a premium treatment.

What is information theory?

Similarly to information, the subject of information theory does not have a precise definition.
In the narrowest sense, it is a scientific discipline concerned with optimal methods of transmit-
ting and storing data. The highlights of this part of the subject are so called “coding theorems”
showing existence of algorithms for compressing and communicating information across noisy
channels. Classical results, such as Shannon’s noisy channel coding theorem (Theorem 19.9),
not only show existence of algorithms, but also quantify their performance and show that
such performance is best possible. This part is, thus, concerned with identifying fundamental
limits of practically relevant (engineering) problems. Consequently, this branch is sometimes
called “IEEE1 -style information theory”, and it influenced or revolutionized much of informa-
tion technology we witness today: digital communication, wireless (cellular and WiFi) networks,
cryptography (Diffie-Hellman), data compression (Lempel-Ziv family of algorithms), and a lot
more.
This book, however, is not limited to the IEEE-style information theory, because the true
scope of the field is much broader. Indeed, the Hilbert’s 13th problem (for smooth functions)
was illuminated and resolved by Arnold and Kolmogorov via the idea of metric entropy that
Kolmogorov introduced following Shannon’s rate-distortion theory [440]. The isomorphism prob-
lem for Bernoulli shifts in ergodic theory has been solved by introducing the Kolmogorov-Sinai

1
For Institute of Electrical and Electronics Engineers; pronounced “Eye-triple-E”.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xix

i i

Introduction xix

entropy [387, 322]. In physics, the Landauer principle and other works on Maxwell demon have
been heavily influenced by the information theory [267, 42]. Natural language processing (NLP),
the idea of modeling text as a high order Markov model, has seen spectacular successes recently in
the form of GPT [320] and related models. Many more topics ranging from biology, neuroscience
and thermodynamics to pattern recognition, artificial intelligence and control theory all regularly
appear in information-theoretic conferences and journals.
It seems that objectively circumscribing the territory claimed by information theory is futile.
Instead, we would like to highlight what we believe to be the recent developments that fascinate
us and which motivated us to write this book.
First, information processing systems of today are much more varied compared to those of last
century. A modern controller (robot) is not just reacting to a few-dimensional vector of observa-
tions, modeled as a linear time-invariant system. Instead, it has million-dimensional inputs (e.g.,
a rasterized image), delayed and quantized, which also need to be communicated across noisy
links. The target of statistical inference is no longer a low-dimensional parameter, but rather a
high-dimensional (possibly discrete) object with structure (e.g. a sparse matrix, or a social net-
work between people with underlying community structure). Furthermore, observations arrive
to a statistician from spatially or temporally separated sources, which need to be transmitted
cognizant of rate limitations. Recognizing these new challenges, multiple communities simul-
taneously started re-investigating classical results (Chapter 29) on the optimality of maximum
likelihood and the (optimal) variance bounds given by the Fisher information. These developments
in high-dimensional statistics, computer science and statistical learning depend on the mastery of
the f-divergences (Chapter 7), the mutual-information method (Chapter 30), and the strong version
of the data-processing inequality (Chapter 33).
Second, since the 1990s technological advances have brought about a slew of new noisy channel
models. While classical theory addresses the so-called memoryless channels, the modern channels,
such as in flash storage, or urban wireless (multi-path, multi-antenna) communication, are far from
memoryless. In order to analyze these, the classical “asymptotic i.i.d.” theory is insufficient. The
resolution is the so-called “one-shot” approach to information theory, in which all main results
are developed while treating the channel inputs and outputs as abstract [211]. Only at the last step
those inputs are given the structure of long sequences and the asymptotic values are calculated.
This new “one-shot” approach has additional relevance for quantum information theory, where it
is in fact necessary.
Third, following impressive empirical achievements in 2010s there was an explosion in the
interest of understanding the methods and limits of machine learning from data. Information-
theoretic principles were instrumental for several discoveries in this area. As examples, we recall
the concept of metric entropy (Chapter 27) that is a cornerstone of Vapnik’s approach to supervised
learning (known as empirical risk minimization), non-linear regression and theory of density esti-
mation (Chapter 32). In machine learning density estimation is known as probabilistic generative
modeling, a prototypical problem in unsupervised learning. At present the best algorithms were
derived by applying information-theoretic ideas: Gibbs variational principle for Kullback-Leibler
divergence (in variational auto-encoders (VAE), cf. Example 4.2) and variational characteriza-
tion of Jensen-Shannon divergence (in generative adversarial networks (GAN), cf. Example 7.5).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xx

i i

xx Introduction

Another fascinating connection is that the optimal prediction performance of online-learning

algorithms is given by the maximum of the mutual information. This is shown through a deep con-
nection between prediction and universal compression (Chapter 13), which lead to the discovery
of the multiplicative weight update algorithm [445, 104].
On the theoretical side, a common information-theoretic method known as strong data-
processing inequality (Chapter 33) lead to resolutions of a series of problems in distributed estima-
tion, community detection (in graphs) and principal component analysis (spiked Wigner model).
The PAC-Bayes method, rooted in Donsker-Varadhan’s characterization of the Kullback-Leibler
divergence, lead to numerous breakthroughs in the theory of bounding the generalization error of
learning algorithms and in understanding concentration and uniform convergence properties of
empirical processes in high dimensions (Section 4.8*).
Fourth, theoretical computer science has been exchanging ideas with information theory as
well. Classical connections include entropy and combinatorics (Chapter 8); entropy and random-
ness extraction (Chapter 9); von Neumann’s computation with noisy gates (Section 33.1); Ising,
Potts and coloring models on trees and general graphs (Section 33.5); communication complex-
ity and Hellinger distance (Exercise I.41). More recently, skillful applications of the chain rule
lead to an elegant strengthening of Szemerédi’s regularity lemma in graph theory by Tao (Exer-
cise I.63) and to a breakthrough in union-closed sets conjecture by Gilmer (Exercise I.61). The
so-called I-MMSE identity (Section 3.7*) was applied to get a very short proof of stochastic local-
ization (Exercise I.66). In the area of randomized sampling and counting, the method of spectral
independence (Exercise VI.26) resolved multiple long-standing conjectures.

Why another book on information theory?

Our motivation for writing this book was two-fold. First, in our experience there is a need for
a graduate-level textbook on information theory, developed at an acceptable level of generality
(i.e. not restricted to discrete, or categorical, random variables) while not sacrificing any mathe-
matical rigor. Second, we wanted to introduce the readers to all the exciting (classical and new)
connections between information theory and other disciplines that we surveyed in previous sec-
tion. We believe that topics like the f-divergences, the one-shot point of view, the connections
with statistical learning and probability are not covered adequately in existing textbooks and are
future-proof: their significance will only grow with time. Currently being relegated to specialized
monographs, acquisition of this toolkit by an aspiring student is delayed.
There are two great classical textbooks that are unlikely to become irrelevant any time
soon: [106] by Cover-Thomas and [111] by Csiszár-Körner (and the revised edition of the lat-
ter [115]). The former has been a primary textbook for the majority of undergraduate courses on
information theory in the world. It manages to rigorously introduce the concepts of entropy, infor-
mation and divergence and prove all the main results of the field, while also sampling several less
standard topics, such as universal compression, gambling and portfolio theory.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xxi

i i

Introduction xxi

The textbook [111] spearheaded the combinatorial approach on information theory, known as
“the method of types”. While more mathematically demanding than [106], [111] manages to intro-
duce stronger results such as sharp estimates of error exponents and, especially, rate regions in
multi-terminal communication systems. However, both books are almost exclusively focused on
asymptotics, Shannon-type information measures and discrete (finite alphabet) cases.
Focused on specialized topics, several monographs are available. For a communication-oriented
reader, the classical [177] is still indispensable. The one-shot point of view is taken in [211]. Con-
nections to statistical learning theory and learning on graphs (belief propagation) is beautifully
covered in [287]. Ergodic theory is the central subject in [198]. Quantum information theory – a
burgeoning field – is treated in the recent [451]. The only textbook dedicated to the connection
between information theory and statistics is by Kullback [264], though restricted to large-sample
asymptotics in hypothesis testing. In nonparametric statistics, application of information-theoretic
methods is briefly (but elegantly) covered in [424].
Nevertheless, it is not possible to quilt this textbook from chapters of these excellent prede-
cessors. A number of important topics are treated exclusively here, such as those in Chapters 7
(f-divergences), 18 (one-shot coding theorems), 22 (finite blocklength), 27 (metric entropy), 30
(mutual information method), 32 (entropic bounds on estimation), and 33 (strong data-processing
inequalities). Furthermore, building up to these chapters requires numerous small innovations
across the rest of the textbook and are not available elsewhere. In addition, the exercises explore
works of the last few years.
Going to omissions, this book almost entirely skips the topic of multi-terminal information
theory (with exception of Sections 11.7*, 16.5* and 25.3*) . This difficult subject captivated much
of the effort in the post-Shannon “IEEE-style” theory. We refer to the classical [115] and the recent
excellent textbook [147] containing an encyclopedic coverage of this area.
Another unfortunate omission is the connection between information theory and functional
inequalities [106, Chapter 17]. This topic has seen a flurry of recent activity, especially in loga-
rithmic Sobolev inequalities, isoperimetry, concentration of measure, Brascamp-Lieb inequalities,
(Marton-Talagrand) information-transportation inequalities and others. We only briefly mention
these topics in Sections 3.7*, 3.8* and associated exercises (e.g. I.47 and I.65). For a fuller
treatment, see the monograph [353] and references there.
Finally, this book will not teach one how to construct practical error-correcting codes or design
modern wireless communication systems. Following our Part IV, which covers the basics, an
interested reader is advised to master the tools from coding theory via [360] and multiple-antenna
channels via [423].

A note to statisticians
The interplay between information theory and statistics is a constant theme in the development of
both fields. Since its inception, information theory has been indispensable for understanding the
fundamental limits of statistical estimation. The prominent role of information-theoretic quanti-
ties, such as mutual information, f-divergence, metric entropy, and capacity, in establishing the

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xxii

i i

xxii Introduction

minimax rates of estimation has long been recognized since the seminal work of Le Cam [272],
Ibragimov and Khas’minski [222], Pinsker [328], Birgé [53], Haussler and Opper [216], Yang
and Barron [464], among many others. In Part VI of this book we give an exposition to some of
the most influential information-theoretic ideas and their applications in statistics. This part is not
meant to be a thorough treatment of decision theory or mathematical statistics; for that purpose,
we refer to the classics [222, 276, 68, 424] and the more recent monographs [78, 190, 446] focus-
ing on high dimensions. Instead, we apply the theory developed in previous Parts I–V of this book
to several concrete and carefully chosen examples of determining the minimax risk in both classi-
cal (fixed-dimensional, large-sample asymptotic) and modern (high-dimensional, non-asymptotic)
settings.
At a high level, the connection between information theory (in particular, data transmission)
and statistical inference is that both problems are defined by a conditional distribution PY|X , which
is referred to as the channel for the former and the statistical model or experiment for the latter. In
both disciplines the ultimate goal is to estimate X with high fidelity based on its noisy observation Y
using computationally efficient algorithms. However, in data transmission the set of allowed values
of X is typically discrete and restricted to a carefully chosen subset of inputs (called codebook),
the design of which is considered to be the main difficulty. In statistics, however, the space or
the distribution of allowed values of X (the parameter) is constrained by the problem setup (for
example, requiring sparsity or low rank on X), not by the statistician. Despite this key difference,
both disciplines in the end are all about estimating X based on Y and information-theoretic ideas
are applicable in both settings.
Specifically, in Chapter 29 we show how the data processing inequality can be used to deduce
classical lower bounds in statistical estimation (Hammersley-Chapman-Robbins, Cramér-Rao,
van Trees). In Chapter 30 we introduce the mutual information method, based on the reasoning
in joint source-channel coding. Namely, by comparing the amount of information contained in
the data and the amount of information required for achieving a given estimation accuracy, both
measured in bits, this method allows us to apply the theory of capacity and rate-distortion func-
tion developed in Parts IV and V to lower bound the statistical risk. Besides being principled, this
approach also unifies the three popular methods for proving minimax lower bounds due to Le
Cam, Assouad, and Fano respectively (Chapter 31).
It is a common misconception that information theory only supplies techniques for proving
negative results in statistics. In Chapter 32 we present three upper bounds on statistical estimation
risk based on metric entropy: Yang-Barron’s construction inspired by universal compression, Le
Cam-Birgé’s tournament based on pairwise hypothesis testing, and Yatracos’ minimum-distance
approach. These powerful methods are responsible for some of the strongest and most general
results in statistics and applicable to both high-dimensional and nonparametric problems. Finally,
in Chapter 33 we introduce the method based on strong data processing inequalities and apply
it to resolve an array of contemporary problems including community detection on graphs, dis-
tributed estimation with communication constraints, and generating random tree colorings. These
problems are increasingly captivating the minds of computer scientists as well.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xxiii

i i

Introduction xxiii

How to use this textbook

An introductory class on information theory aiming at advanced undergraduate or graduate
students can proceed with the following sequence:

• Part I: Chapters 1–3, Sections 4.1, 5.1–5.3, 6.1, and 3.6, focusing only on discrete prob-
ability space and ignoring Radon-Nikodym derivatives. Some mention of applications in
combinatorics and cryptography (Chapters 8, 9 and select exercises) is recommended.
• Part II: Chapter 10, Sections 11.1–11.5.
• Part III: Chapter 14, Sections 15.1–15.3, and 16.1.
• Part IV: Chapters 17–18, Sections 19.1–19.3, 19.7, 20.1–20.2, 23.1.
• Part V: Sections 24.1–24.3, 25.1, 26.1, and 26.3.
• Conclude with a few applications of information theory outside the classical domain (Chap-
ters 30 and 33).

A graduate-level class on information theory with a traditional focus on communication and

compression can proceed faster through Part I (omitting f-divergences and other non-essential
chapters), but then cover Parts II–V in depth, including strong converse, finite-blocklength regime,
and communication with feedback, but omitting Chapter 27. It is important to work through
exercises at the end of Part IV for this kind of class.
For a graduate-level class on information theory with an emphasis on statistical learning, start
with Part I (especially Chapter 7), followed by Part II (especially Chapter 13) and Part III, from
Part IV limit coverage to Chapters 17-19, and from Part V to Chapter 27 (especially, Sections 27.1–
27.4). This should leave more than half of the semester for carefully working through Part VI. For
example, for a good pace we suggest leaving at least 5-6 lectures for Chapters 32 and 33. These last
chapters contain some bleeding-edge research results and open problems, hopefully welcoming
students to work on them. For that we also recommend going over the exercises at the end of
Parts I and VI.
Difficult sections are marked with asterisks and can be skipped on a first reading as they may
rely on material from future chapters or external sources.
An extensive index should help connect different topics together. For example, looking up
“community detection” shows all the many occurrences of this interesting example across the
chapters.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-1

i i

Frequently used notation

General conventions

• The symbol ≜ reads defined as and ≡ abbreviated as.

• The set of real numbers and integers are denoted by R and Z. Let N = {1, 2, . . .}, Z+ =
{0, 1, . . .}, R+ = {x : x ≥ 0}.
• For n ∈ N, let [n] = {1, . . . , n}.
• Throughout the book, xn ≜ (x1 , . . . , xn ) denotes an n-dimensional vector, xji ≜ (xi , . . . , xj ) for
1 ≤ i < j ≤ n and xS ≜ {xi : i ∈ S} for S ⊂ [n].
• Unless explicitly specified, the logarithm log and exponential exp are with respect to a generic
common base. The natural logarithm is denoted by ln = loge and expe {·} = e(·) .
• We agree to take exp{+∞} = +∞, exp{−∞} = 0, log(+∞) = +∞, log(0) = −∞. The
function x 7→ x log x is extended to x = 0 by taking 0 · log 0 = 0. The bivariate function (x, y) 7→
log xy extended to x = 0 and y = 0 is denoted by Log xy and has a special convention (2.10).
• a ∧ b = min{a, b} and a ∨ b = max{a, b}.
• For p ∈ [0, 1], p̄ ≜ 1 − p.
• x+ = max{x, 0}.
• f(x+) ≜ lim f(y), f(x−) ≜ lim f(y).
y↘ x y↗ x
• limit inferior and limit superior: lim infn→∞ gn ≜ limn→∞ infm≥n gm and lim supn→∞ gn ≜
limn→∞ supm≥n gm .
• wH (x) denotes the Hamming weight (number of ones) of a binary vector x. dH (x, y) =
Pn
i=1 1 {xi 6= yi } denotes the Hamming distance between vectors x and y of length n.
• Standard big O notations are used throughout the book: e.g., for any positive sequences {an }
and {bn }, an = O(bn ) if there is an absolute constant c > 0 such that an ≤ cbn ; an = Ω(bn )
if bn = O(an ); an = Θ(bn ) if both an = O(bn ) and an = Ω(bn ), we also write an bn in
these cases; an = o(bn ) or bn = ω(an ) if an ≤ ϵn bn for some ϵn → 0. In addition, if there is a
parameter p in the discussion and the constant c in the definition of an = O(bn ) depends on p,
then we emphasize this fact by writing an = Op (bn ).

Information theory and statistics

• h(·) is the binary entropy function, H(·) denotes general Shannon entropy.
• d(·k·) is the binary divergence function, D(·k·) denotes general Kullback-Leibler divergence
• Standard channels BSCδ , BECδ , BIAWGNσ2
• Common divergences are χ2 (·k·) (chi-squared), Dα (·k·) (Rényi divergence), Df (·k·) (general
f-divergence).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-2

i i

2 Frequently used notation

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-7

i i

Information measures form the backbone of information theory. The first part of this book is
devoted to an in-depth study of some of them, most notably, entropy, divergence, mutual informa-
tion, as well as their conditional versions (Chapters 1–3). In addition to basic definitions illustrated
through concrete examples, we will also study various aspects including chain rules, regularity,
tensorization, variational representation, local expansion, convexity and optimization properties,
as well as the data processing principle (Chapters 4–6). These information measures will be
imbued with operational meaning when we proceed to classical topics in information theory such
as data compression and transmission, in subsequent parts of the book. This Part also includes
topics connecting information theory to other subjects, such as I-MMSE relation (estimation the-
ory), entropy power inequality (probability), PAC-Bayes bounds and Gibbs variational principle
(machine learning).
In addition to the classical (Shannon) information measures, Chapter 7 provides a systematic
treatment of f-divergences, a generalization of (Shannon) measures introduced by Csíszar that
plays an important role in many statistical problems (see Parts III and VI). Finally, towards the
end of this part we will discuss two operational topics: random number generators in Chapter 9
and the application of entropy method to combinatorics and geometry in Chapter 8.
Several contemporary topics are developed in exercises such as stochastic block model
(Exercise I.49), Gilmer’s method in combinatorics (Exercise I.61), Tao’s proof of Szemerédi’s reg-
ularity lemma (Exercise I.63), Eldan’s stochastic localization (Exercise I.66), Gross’ log-Sobolev
inequality (Exercise I.65) and others.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-8

i i

1 Entropy

This chapter introduces the first information measure – Shannon entropy. After studying its stan-
dard properties (chain rule, conditioning), we will briefly describe how one could arrive at its
definition. We discuss axiomatic characterization, the historical development in statistical mechan-
ics, as well as the underlying combinatorial foundation (“method of types”). We close the chapter
with Han’s and Shearer’s inequalities, that both exploit submodularity of entropy. After this Chap-
ter, the reader is welcome to consult the applications in combinatorics (Chapter 8) and random
number generation (Chapter 9), which are independent of the rest of this Part.

1.1 Entropy and conditional entropy

Definition 1.1 (Entropy) Let X be a discrete random variable with probability mass function
PX (x), x ∈ X . The entropy (or Shannon entropy) of X is
h 1 i
H(X) = E log
PX (X)
X 1
= PX (x) log .
P X ( x)
x∈X

When computing the sum, we agree that (by continuity of x 7→ x log 1x )

1
0 log = 0. (1.1)
0
Since entropy only depends on the distribution of a random variable, it is customary in information
theory to also write H(PX ) in place of H(X), which we will do freely in this book. The basis of the
logarithm in Definition 1.1 determines the units of the entropy:

log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp

Different units will be convenient in different cases and so most of the general results in this book
are stated with “baseless” log/exp.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-9

i i

1.1 Entropy and conditional entropy 9

Definition 1.2 (Joint entropy) The joint entropy of n discrete random variables Xn ≜
(X1 , X2 , . . . , Xn ) is

h 1 i
H(Xn ) = H(X1 , . . . , Xn ) = E log ,
PX1 ,...,Xn (X1 , . . . , Xn )

which can also be written explicitly as a summation over a joint probability mass function (PMF):

X X 1
H(Xn ) = ··· PX1 ,...,Xn (x1 , . . . , xn ) log .
x1 xn
PX1 ,...,Xn (x1 , . . . , xn )

Note that joint entropy is a special case of Definition 1.1 applied to the random vector Xn =
(X1 , X2 , . . . , Xn ) taking values in the product space.

Remark 1.1 The name “entropy” originates from thermodynamics – see Section 1.3, which
also provides combinatorial justification for this definition. Another common justification is to
derive H(X) as a consequence of natural axioms for any measure of “information content” – see
Section 1.2. There are also natural experiments suggesting that H(X) is indeed the amount of
“information content” in X. For example, one can measure time it takes for ant scouts to describe
the location of the food to ants-workers. It was found that when nest is placed at the root of a full
binary tree of depth d and food at one of the leaves, the time was proportional to the entropy of a
random variable describing the food location [358]. (It was also estimated that ants communicate
with about 0.7–1 bit/min and that communication time reduces if there are some regularities in
path-description: paths like “left,right,left,right,left,right” are described by scouts faster).

1.0

0.8

0.6

0.4

0.2

0.0

Figure 1.1 Conditional entropy of a Bernoulli X given its Gaussian noisy observation.

Review: Convexity

• Convex set: A subset S of some vector space is convex if x, y ∈ S ⇒ αx + ᾱy ∈ S

for all α ∈ [0, 1]. (Recall: ᾱ ≜ 1 − α.)
Examples: unit interval [0, 1]; S = {probability distributions on X }; S = {PX :
E[X] = 0}.
• Convex function: f : S → R is
– convex if f(αx + ᾱy) ≤ αf(x) + ᾱf(y) for all x, y ∈ S, α ∈ [0, 1].
– strictly convex if f(αx + ᾱy) < αf(x) + ᾱf(y) for all x 6= y ∈ S, α ∈ (0, 1).
– (strictly) concave if −f is (strictly) convex.
R
Examples: x 7→ x log x is strictly convex; the mean P 7→ xdP is convex but
not strictly convex, variance is concave (Question: is it strictly concave? Think of
zero-mean distributions.).
• Jensen’s inequality:

For any S-valued random variable X,

– f is convex ⇒ f(EX) ≤ Ef(X) Ef(X)

– f is strictly convex ⇒ f(EX) < Ef(X), unless X

is a constant (X = EX a.s.) f(EX)

Theorem 1.4 (Properties of entropy)

(a) (Positivity) H(X) ≥ 0 with equality, iff X is a constant (no randomness).

(b) (Uniform distribution maximizes entropy) For finite X , H(X) ≤ log |X |, with equality iff X is
uniform on X .
(c) (Invariance under relabeling) H(X) = H(f(X)) for any bijective f.
(d) (Conditioning reduces entropy) H(X|Y) ≤ H(X), with equality iff X and Y are independent.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-12

1.2 Axiomatic characterization

P
One might wonder why entropy is defined as H(P) = pi log p1i and if there are other definitions.
Indeed, the information-theoretic definition of entropy is related to entropy in statistical physics.
Also, it arises as answers to specific operational problems, e.g., the minimum average number of
bits to describe a random variable as discussed above. Therefore it is fair to say that it is not pulled
out of thin air.
Shannon in 1948 paper has also showed that entropy can be defined axiomatically, as a
function satisfying several natural conditions. Denote a probability distribution on m letters by
P = (p1 , . . . , pm ) and consider a functional Hm (p1 , . . . , pm ). If Hm obeys the following axioms:

(a) Permutation invariance: Hm (pπ (1) , . . . , pπ (m) ) = Hm (p1 , . . . , pm ) for any permutation π on [m].
(b) Expansibility: Hm (p1 , . . . , pm−1 , 0) = Hm−1 (p1 , . . . , pm−1 ).
(c) Normalization: H2 ( 12 , 12 ) = log 2.
(d) Subadditivity: H(X, Y) ≤ H(X) + H(Y). Equivalently, Hmn (r11 , . . . , rmn ) ≤ Hm (p1 , . . . , pm ) +
Pn Pm
Hn (q1 , . . . , qn ) whenever j=1 rij = pi and i=1 rij = qj .
(e) Additivity: H(X, Y) = H(X) + H(Y) if X ⊥ ⊥ Y. Equivalently, Hmn (p1 q1 , . . . , pm qn ) =
Hm (p1 , . . . , pm ) + Hn (q1 , . . . , qn ).
(f) Continuity: H2 (p, 1 − p) → 0 as p → 0.
Pm
then Hm (p1 , . . . , pm ) = i=1 pi log p1i is the only possibility. The interested reader is referred to
[115, Exercise 1.13] and the references therein.
We note that there are other meaningful measure of randomness, including, notably, the Rényi
entropy of order α introduced by Alfréd Rényi [356]
( Pm
1
1−α log pα α ∈ (0, 1) ∪ (1, ∞)
Hα (P) ≜ i=1 i
(1.4)
mini log 1
pi α = ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-14

i i

(The quantity H∞ is also known as the min-entropy, or Hmin , in the cryptography literature). One
can check that

1 0 ≤ Hα (P) ≤ log m, where the lower (resp. upper) bound is achieved when P is a point mass
(resp. uniform);
2 Hα (P) is non-increasing in α and tends to the Shannon entropy H(P) as α → 1.
3 Rényi entropy satisfies the above six axioms except for the subadditivity.

1.3 History of entropy

In the early days of industrial age, engineers wondered if it is possible to construct a perpetual
motion machine. After many failed attempts, a law of conservation of energy was postulated: a
machine cannot produce more work than the amount of energy it consumed from the ambient
world. (This is also called the first law of thermodynamics.) The next round of attempts was then
to construct a machine that would draw energy in the form of heat from a warm body and convert it
to equal (or approximately equal) amount of work. An example would be a steam engine. However,
again it was observed that all such machines were highly inefficient. That is, the amount of work
produced by absorbing heat Q was far less than Q. The remainder of energy was dissipated to
the ambient world in the form of heat. Again after many rounds of attempting various designs
Clausius and Kelvin proposed another law:

Second law of thermodynamics: There does not exist a machine that operates in a cycle (i.e. returns to its original
state periodically), produces useful work and whose only other effect on the outside world is drawing heat from
a warm body. (That is, every such machine, should expend some amount of heat to some cold body too!)1

Equivalent formulation is as follows: “There does not exist a cyclic process that transfers heat
from a cold body to a warm body”. That is, every such process needs to be helped by expending
some amount of external work; for example, the air conditioners, sadly, will always need to use
some electricity.
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy
of it (initially described as “transformative content” or Verwandlungsinhalt in German), whose
value must return to its original state. Furthermore, under any reversible (i.e. quasi-stationary, or
“very slow”) process operated on this machine the change of entropy is proportional to the ratio

1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-15

i i

1.3 History of entropy 15

of absorbed heat and the temperature of the machine:

∆Q
∆S = . (1.5)
T
If heat Q is absorbed at temperature Thot then to return to the original state, one must return some
amount of heat Q′ , where Q′ can be significantly smaller than Q but never zero if Q′ is returned
at temperature 0 < Tcold < Thot . Further logical arguments can convince one that for irreversible
cyclic process the change of entropy at the end of the cycle can only be positive, and hence entropy
cannot reduce.
There were great many experimentally verified consequences that second law produced. How-
ever, what is surprising is that the mysterious entropy did not have any formula for it (unlike, say,
energy), and thus had to be computed indirectly on the basis of relation (1.5). This was changed
with the revolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation
of the second law based on statistical physics principles and showed that, e.g., for a system of n
independent particles (as in ideal gas) the entropy of a given macro-state can be computed as
X
ℓ
1
S = kn pj log , (1.6)
pj
j=1

where k is the Boltzmann constant, and we assumed that each particle can only be in one of ℓ
molecular states (e.g. spin up/down, or if we quantize the phase volume into ℓ subcubes) and pj is
the fraction of particles in j-th molecular state.
More explicitly, their innovation was two-fold. First, they separated the concept of a micro-
state (which in our example above corresponds to a tuple of n states, one for each particle) and the
macro-state (a list {pj } of proportions of particles in each state). Second, they postulated that for
experimental observations only the macro-state matters, but the multiplicity of the macro-state
(number of micro-states that correspond to a given macro-state) is precisely the (exponential
of the) entropy. The formula (1.6) then follows from the following explicit result connecting
combinatorics and entropy.
Pk
Proposition 1.5 (Method of types) Let n1 , . . . , nk be non-negative integers with i=1 ni =
n, and denote the distribution P = (p1 , . . . , pk ), pi = nni . Then the multinomial coefficient

n1 ,...nk ≜ n1 !···nk ! satisfies
n n!

1 n
exp{nH(P)} ≤ ≤ exp{nH(P)} .
( 1 + n) k − 1 n1 , . . . nk

i.i.d. Pn
Proof. For the upper bound, let X1 , . . . , Xn ∼ P and let Ni = i=1 1 {Xj = i} denote the number
of occurrences of i. Then (N1 , . . . , Nk ) has a multinomial distribution:
Y
k
′ ′ n n′
P[N1 = n1 , . . . , Nk = nk ] = ′ ′ pi i ,
n1 , . . . , nk
i=1

n′i n′1 n′k

for any nonnegative integers such that + · · · + = n. Recalling that pi = ni /n, the upper
bound follows from P[N1 = n1 , . . . , Nk = nk ] ≤ 1. In addition, since (N1 , . . . , Nk ) takes at most

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-16

i i

(n + 1)k−1 values, the lower bound follows if we can show that (n1 , . . . , nk ) is its mode. Indeed,
for any n′i with n′1 + · · · + n′k = n, defining ∆i = n′i − ni we have

P[N1 = n′1 , . . . , Nk = n′k ] Y Y

k k
ni !
= i ≤
p∆ i
ni−∆i p∆ i
i = 1,
P[N1 = n1 , . . . , Nk = nk ] (ni + ∆i )!
i=1 i=1
−∆
Pn
where the inequality follows from m!
(m+∆)! ≤m and the last equality follows from i=1 ∆i =
0.
Proposition 1.5 shows that the multinomial coefficient can be approximated up to a polynomial
(in n) term by exp(nH(P)). More refined estimates can be obtained; see Ex. I.2. In particular, the
binomial coefficient can be approximated using the binary entropy function as follows: Provided
that p = nk ∈ (0, 1),
n

e− 1 / 6 ≤ k
≤ 1. (1.7)
√ 1
enh(p)
2πnp(1−p)

For more on combinatorics and entropy, see Ex. I.1, I.3 and Chapter 8. For more on the intricate
relationship between statistical, mechanistic and information-theoretic description of the world
see Section 12.5* on Kolmogorov-Sinai entropy.

1.4* Submodularity

Recall that [n] denotes a set {1, . . . , n}, Sk denotes subsets of S of size k and 2S denotes all subsets
of S. A set function f : 2S → R is called submodular if for any T1 , T2 ⊂ S
f(T1 ∪ T2 ) + f(T1 ∩ T2 ) ≤ f(T1 ) + f(T2 ) (1.8)
Submodularity is similar to concavity, in the sense that “adding elements gives diminishing
returns”. Indeed consider T′ ⊂ T and b 6∈ T. Then
f(T ∪ b) − f(T) ≤ f(T′ ∪ b) − f(T′ ) .

Theorem 1.6 Let Xn be discrete RV. Then T 7→ H(XT ) is submodular.

Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show
H(A, B, C) + H(B) ≤ H(A, B) + H(B, C) .
This follows from a simple chain
H(A, B, C) + H(B) = H(A, C|B) + 2H(B) (1.9)
≤ H(A|B) + H(C|B) + 2H(B) (1.10)
= H(A, B) + H(B, C) (1.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-17

i i

1.5* Han’s inequality and Shearer’s Lemma 17

Note that entropy is not only submodular, but also monotone:

T1 ⊂ T2 =⇒ H(XT1 ) ≤ H(XT2 ) .

So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on
[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex cone
in R2+ −1 . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributions
n

on Xn . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [472], that Γ̄∗n is also a
closed convex cone and that

Γ∗n ⊂ Γ̄∗n ⊂ Γn .

The astonishing result of [473] is that

Γ∗2 = Γ̄∗2 = Γ2 (1.12)

Γ∗3 ⊊ Γ̄∗3 = Γ3 (1.13)
Γ∗n ⊊ Γ̄∗n ⊊Γn n ≥ 4. (1.14)

This follows from the fundamental new information inequality not implied by the submodularity
of entropy (and thus called non-Shannon inequality). Namely, [473] showed that for any 4-tuple
of discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
Here we have used mutual information and conditional mutual information – notions that we
will introduce later. However, the above inequality (with the help of Theorem 3.4) can be easily
rewritten as a rather cumbersome expression in terms of entropies of sets of variables X1 , X2 , X3 , X4 .
In conclusion, the work [473] demonstrated that the entropy set-function is more constrained than
a generic submodular non-negative set function even if one only considers linear constraints.

1.5* Han’s inequality and Shearer’s Lemma

Theorem
P
1.7 (Han’s inequality) Let Xn be discrete n-dimensional RV and denote H̄k (Xn ) =
1 H̄k
T∈([nk])
H(XT ) the average entropy of a k-subset of coordinates. Then is decreasing in k:
(nk) k

1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.15)
n k
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:

H̄k+1 − H̄k ≤ H̄k − H̄k−1 . (1.16)

H̄m
Proof. Denote for convenience H̄0 = 0. Note that m is an average of differences:

1X
m
1
H̄m = (H̄k − H̄k−1 )
m m
k=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-18

i i

Thus, it is clear that (1.16) implies (1.15) since increasing m by one adds a smaller element to the
average. To prove (1.16) observe that from submodularity

H(X1 , . . . , Xk+1 ) + H(X1 , . . . , Xk−1 ) ≤ H(X1 , . . . , Xk ) + H(X1 , . . . , Xk−1 , Xk+1 ) .

Now average this inequality over all n! permutations of indices {1, . . . , n} to get

H̄k+1 + H̄k−1 ≤ 2H̄k

as claimed by (1.16).
Alternative proof: Notice that by “conditioning decreases entropy” we have

H(Xk+1 |X1 , . . . , Xk ) ≤ H(Xk+1 |X2 , . . . , Xk ) .

Averaging this inequality over all permutations of indices yields (1.16).

Theorem 1.8 (Shearer’s Lemma) Let Xn be discrete n-dimensional RV and let S ⊂ [n] be
a random variable independent of Xn and taking values in subsets of [n]. Then

H(XS |S) ≥ H(Xn ) · min P[i ∈ S] . (1.17)

i∈[n]

Remark 1.2 In the special case where S is uniform over all subsets of cardinality k, (1.17)
reduces to Han’s inequality 1n H(Xn ) ≤ 1k H̄k . The case of n = 3 and k = 2 can be used to give
an entropy proof of the following well-known geometry result that relates the size of 3-D object
to those of its 2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number
of distinct points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N2 . For
another application, see Section 8.2.

Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a
list (possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(Xn ) · min deg(i) , (1.18)
i
j

where deg(i) ≜ #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.18) is trivial, since the minimum on the right-hand side is either
zero (if SM 6= [n]) or equals multiplicity of SM in C ,2 in which case we have
X
H(XSj ) ≥ H(XSM )#{j : Sj = SM } = H(Xn ) · min deg(i) .
i
j

For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.8) implies that the sum on the
left-hand side of (1.18) does not increase under this replacement, values deg(i) are not changed.

2
Note that, consequently, for Xn without constant coordinates, and if C is a chain, (1.18) is only tight if C consists of only ∅
and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.18) can be improved, cf. [288].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-19

i i

1.5* Han’s inequality and Shearer’s Lemma 19

Notice that the total number of pairs that are not related by inclusion strictly decreases by this
replacement: if T was related by inclusion to S1 then it will also be related to at least one of S1 ∪ S2
or S1 ∩ S2 ; if T was related to both S1 , S2 then it will be related to both of the new sets as well.
Therefore, by applying this operation we must eventually arrive to a chain, for which (1.18) has
already been shown.
Remark 1.3 Han’s inequality (1.16) holds for any submodular set-function. For Han’s inequal-
ity (1.15) we also need f(∅) = 0 (this can be achieved by adding a constant to all values of f).
Shearer’s lemma holds for any submodular set-function that is also non-negative.
Example 1.5 (Non-entropy submodular function) Another submodular set-function is
S 7→ I(XS ; XSc ) .
Han’s inequality for this one reads
1 1
0= In ≤ · · · ≤ Ik · · · ≤ I1 ,
n k
1
P
where Ik = S:|S|=k I(XS ; XSc ) measures the amount of k-subset coupling in the random vector
(nk)
n
X.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-20

i i

2 Divergence

In this chapter we study divergence D(PkQ) (also known as information divergence, Kullback-
Leibler (KL) divergence, relative entropy), which is the first example of dissimilarity (information)
measure between a pair of distributions P and Q. As we will see later in Chapter 7, KL divergence is
a special case of f-divergences. Defining KL divergence and its conditional version in full general-
ity requires some measure-theoretic acrobatics (Radon-Nikodym derivatives and Markov kernels),
that we spend some time on. (We stress again that all these abstractions can be ignored if one is
willing to only work with finite or countably-infinite alphabets.)
Besides definitions we prove the “main inequality” showing that KL-divergence is non-negative.
Coupled with the chain rule for divergence, this inequality implies the data-processing inequality,
which is arguably the central pillar of information theory and this book. We conclude the chapter
by studying local behavior of divergence when P and Q are close. In the special case when P and
Q belong to a parametric family, we will see that divergence is locally quadratic with Hessian
being the Fisher information, explaining the fundamental role of the latter in classical statistics.

2.1 Divergence and Radon-Nikodym derivatives

Review: Measurability

For an exposition of measure-theoretic preliminaries, see [84, Chapters I and IV].

We emphasize two aspects. First, in this book we understand Lebesgue integration
R
fdμ as defined for measurable functions that are extended real-valued, i.e. f : X →
X R
R ∪ {±∞}. In particular, for negligible set E, i.e. μ[E] = 0, we have X 1E fdμ = 0
regardless of (possibly infinite) values of f on E, cf. [84, Chapter I, Prop. 4.13]. Second,
we almost always assume that alphabets are standard Borel spaces. Some of the nice
properties of standard Borel spaces:

• All complete separable metric spaces, endowed with Borel σ -algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q∞
• If Xi , i = 1, . . . are standard Borel, then so is i=1 Xi .
• Singletons {x} are measurable sets.
• The diagonal {(x, x) : x ∈ X } is measurable in X × X .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-21

i i

2.1 Divergence and Radon-Nikodym derivatives 21

We now need to define the second central concept of this book: the relative entropy, or Kullback-
Leibler divergence. Before giving the formal definition, we start with special cases. For that we
fix some alphabet A. The relative entropy from between distributions P and Q on X is denoted by
D(PkQ), defined as follows.

• Suppose A is a discrete (finite or countably infinite) alphabet. Then

(P P(a)
a∈A:P(a),Q(a)>0 P(a) log Q(a) , supp(P) ⊂ supp(Q)
D(PkQ) ≜ (2.1)
+∞, otherwise

• Suppose A = Rk , P and Q have densities (pdfs) p and q with respect to the Lebesgue measure.
Then
(R
{p>0,q>0}
p(x) log qp((xx)) dx Leb{p > 0, q = 0} = 0
D(PkQ) = (2.2)
+∞ otherwise

These two special cases cover a vast majority of all cases that we encounter in this book. How-
ever, mathematically it is not very satisfying to restrict to these two special cases. For example, it
is not clear how to compute D(PkQ) when P and Q are two measures on a manifold (such as a
unit sphere) embedded in Rk . Another problematic case is computing D(PkQ) between measures
on the space of sequences (stochastic processes). To address these cases we need to recall the
concepts of Radon-Nikodym derivative and absolute continuity.
Recall that for two measures P and Q, we say P is absolutely continuous w.r.t. Q (denoted by
P Q) if Q(E) = 0 implies P(E) = 0 for all measurable E. If P Q, then Radon-Nikodym
theorem show that there exists a function f : X → R+ such that for any measurable set E,
Z
P(E) = fdQ. [change of measure] (2.3)
E
dP
Such f is called a relative density or a Radon-Nikodym derivative of P w.r.t. Q, denoted by dQ .
dP dP
Not that dQ may not be unique. In the simple cases, dQ is just the familiar likelihood ratio:

• For discrete distributions, we can just take dQ

dP
(x) to be the ratio of pmfs.
• For continuous distributions, we can take dQ (x) to be the ratio of pdfs.
dP

We can see that the two special cases of D(PkQ) were both computing EP [log dQdP
]. This turns
out to be the most general definition that we are looking for. However, we will state it slightly
differently, following the tradition.

Definition 2.1 (Kullback-Leibler (KL) Divergence) Let P, Q be distributions on A, with

Q called the reference measure. The divergence (or relative entropy) between P and Q is
(
EQ [ dQ
dP dP
log dQ ] PQ
D(PkQ) = (2.4)
+∞ otherwise

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-22

i i

adopting again the convention from (1.1), namely, 0 log 0 = 0.

Below we will show (Lemma 2.5) that the expectation in (2.4) is well-defined (but possibly
infinite) and coincides with EP [log dQdP
] whenever P Q.
To demonstrate the general definition in the case not covered by discrete/continuous special-
izations, consider the situation in which both P and Q are given as densities with respect to a
common dominating measure μ, written as dP = fP dμ and dQ = fQ dμ for some non-negative
fP , fQ . (In other words, P μ and fP = dP dμ .) For example, taking μ = P + Q always allows one to
specify P and Q in this form. In this case, we have the following expression for divergence:
(R
dμ fP log ffQP μ({fQ = 0, fP > 0}) = 0,
D(PkQ) = fQ >0,fP >0
(2.5)
+∞ otherwise
Indeed, first note that, under the assumption of P μ and Q μ, we have P Q iff
μ({fQ = 0, fP > 0}) = 0. Furthermore, if P Q, then dQdP
= ffQP Q-a.e, in which case apply-
ing (2.3) and (1.1) reduces (2.5) to (2.4). Namely, D(PkQ) = EQ [ dQ dP dP
log dQ ] = EQ [ ffQP log ffQP ] =
R R
fP fP
dμfP log fQ 1 {fQ > 0} = dμfP log fQ 1 {fQ > 0, fP > 0}.
Note that D(PkQ) was defined to be +∞ if P 6 Q. However, it can also be +∞ even when
P Q. For example, D(CauchykGaussian) = ∞. However, it does not mean that there are
somehow two different ways in which D can be infinite. Indeed, what can be shown is that in
both cases there exists a sequence of (finer and finer) finite partitions Π of the space A such that
evaluating KL divergence between the induced discrete distributions P|Π and Q|Π grows without
a bound. This will be subject of Theorem 4.5 below.
Our next observation is that, generally, D(PkQ) 6= D(QkP) and, therefore, divergence is not a
distance. We will see later, that this is natural in many cases; for example it reflects the inherent
asymmetry of hypothesis testing (see Part III and, in particular, Section 14.5). Consider the exam-
ple of coin tossing where under P the coin is fair and under Q the coin always lands on the head.
Upon observing HHHHHHH, one tends to believe it is Q but can never be absolutely sure; upon
observing HHT, one knows for sure it is P. Indeed, D(PkQ) = ∞, D(QkP) = 1 bit.
Having made these remarks we proceed to some examples. First, we show that D is unsurpris-
ingly a generalization of entropy.

Theorem 2.2 (Entropy vs divergence) If distribution P is supported on a finite set A, then

H(P) = log |A| − D(PkUA ) ,

where UA is the uniform distribution on A.

Proof. D(PkUA ) = EP [log 1P/|A|

(X)
] = log |A| − H(P).

Example 2.1 (Binary divergence) Consider P = Ber(p) and Q = Ber(q) on A = {0, 1}.
Then
p p
D(PkQ) = d(pkq) ≜ p log + p log . (2.6)
q q

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-23

i i

2.1 Divergence and Radon-Nikodym derivatives 23

Here is how d(pkq) depends on p and q:

1
log q
d(p∥q) d(p∥q)

1
log q̄

q p
0 p 1 0 q 1

The following quadratic lower bound is easily checked:

d(pkq) ≥ 2(p − q)2 log e .

In fact, this is a special case of the famous Pinsker’s inequality (Theorem 7.10).
Example 2.2 (Real Gaussian) For two Gaussians on A = R,
log e (m1 − m0 )2 1 h σ02 σ12 i
D(N (m1 , σ12 )kN (m0 , σ02 )) = + log + − 1 log e . (2.7)
2 σ02 2 σ12 σ02
Here, the first and second term compares the means and the variances, respectively.
Similarly, in the vector case of A = Rk and assuming det Σ0 6= 0, we have

D(N (m1 , Σ1 )kN (m0 , Σ0 ))

log e 1
= ( m1 − m0 ) ⊤ Σ −
0
1
( m 1 − m 0 ) + log det Σ 0 − log det Σ 1 + tr(Σ −1
0 Σ 1 − I ) log e . (2.8)
2 2
See Exercise I.8 for the derivation.
Example 2.3 (Complex Gaussian) The complex Gaussian distribution Nc (m, σ 2 ) with
1 −|z−m|2 /σ2
mean m ∈ C and variance σ 2 has a density e for z ∈ C. In other words, the real and
π σ2
imaginary parts are independent real Gaussians:

σ 2 /2 0
Nc (m, σ ) = N Re(m) Im(m) ,
2
0 σ 2 /2

Then
log e |m1 − m0 |2 σ02 σ12
D(Nc (m1 , σ12 )kNc (m0 , σ02 )) = + log + − 1 log e. (2.9)
2 σ02 σ12 σ02

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-24

i i

which follows from (2.8). More generally, for complex Gaussian vectors on Ck , assuming det Σ0 6=
0,

D(Nc (m1 , Σ1 )kNc (m0 , Σ0 )) =(m1 − m0 )H Σ−

0 (m1 − m0 ) log e
1

+ log det Σ0 − log det Σ1 + tr(Σ−

0 Σ1 − I) log e
1

2.2 Divergence: main inequality and equivalent expressions

Many inequalities in information can be attributed to the following fundamental result, namely,
the nonnegativity of divergence.

Theorem 2.3 (Information inequality)

D(PkQ) ≥ 0,

with equality iff P = Q.

Proof. In view of the definition (2.4), it suffices to consider P Q. Let φ(x) ≜ x log x, which
is strictly convex on R+ . Applying Jensen’s Inequality:
h dP i h dP i
D(PkQ) = EQ φ ≥ φ EQ = φ(1) = 0,
dQ dQ
dP
with equality iff dQ = 1 Q-a.e., namely, P = Q.

Here is a typical application of the previous result (variations of it will be applied numerous
times in this book). This result is widely used in machine learning as it shows that minimizing
average cross-entropy loss ℓ(Q, x) ≜ log Q(1x) recovers the true distribution (Exercise III.11).

Corollary 2.4 Let X be a discrete random variable with H(X) < ∞. Then

1
min E log = H(P) ,
Q Q( X )
and unique minimizer is Q = PX .

Proof. It is sufficient to prove that for any Q 6= PX we have

1
E log > H(X) .
Q(X)
If the LHS is infinite this is clear, so let us assume it is finite and hence Q(x) > 0 whenever
PX (x) > 0. Then subtracting H(X) from both sides and using linearity of expectation we have

1 PX (X)
E log − H(X) = E log = D(PX kQ) > 0
Q( X ) Q( X )
where the inequality is via Theorem 2.3.

2.3 Differential entropy

The definition of D(PkQ) extends verbatim to measures P and Q (not necessarily probability
measures), in which case D(PkQ) can be negative. A sufficient condition for D(PkQ) ≥ 0 is that
R R
P is a probability measure and Q is a sub-probability measure, i.e., dQ ≤ 1 = dP. The notion
of differential entropy is simply the divergence with respect to the Lebesgue measure.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-27

i i

2.3 Differential entropy 27

Definition 2.6 The differential entropy of a random vector X is

h(X) = h(PX ) ≜ −D(PX kLeb). (2.16)

In particular, if X has probability density function (pdf) p, then h(X) = E log p(1X) ; otherwise
h(X) = −∞. The conditional differential entropy is h(X|Y) ≜ E log pX|Y (1X|Y) where pX|Y is a
conditional pdf.

Example 2.4 (Gaussian) For X ∼ N(μ, σ 2 ),

1
h(X) = log(2πeσ 2 ) (2.17)
2
More generally, for X ∼ N( μ, Σ) in Rd ,
1
h(X) = log((2πe)d det Σ) (2.18)
2
Warning: Even for continuous random variable X, h(X) can be positive, negative, take values
of ±∞ or even undefined.1 There are many crucial differences between the Shannon entropy and
the differential entropy. For example, from Theorem 1.4 we know that deterministic processing
cannot increase the Shannon entropy, i.e. H(f(X)) ≤ H(X) for any discrete X, which is intuitively
clear. However, this fails completely for differential entropy (e.g. consider scaling). Furthermore,
for sums of independent random variables, for integer-valued X and Y, H(X + Y) is finite whenever
H(X) and H(Y) are, because H(X + Y) ≤ H(X, Y) = H(X) + H(Y). This again fails for differential
entropy. In fact, there exists real-valued X with finite h(X) such that h(X + Y) = ∞ for any
independent Y such that h(Y) > −∞; there also exist X and Y with finite differential entropy such
that h(X + Y) does not exist (cf. [65, Section V]).
Nevertheless, differential entropy shares many functional properties with the usual Shannon
entropy. For a short application to Euclidean geometry see Section 8.4.

Theorem 2.7 (Properties of differential entropy) Assume that all differential entropies
appearing below exist and are finite (in particular all RVs have pdfs and conditional pdfs).

(a) (Uniform distribution maximizes differential entropy) If P[Xn ∈ S] = 1 then h(Xn ) ≤

log Leb(S),with equality iff Xn is uniform on S.
(b) (Scaling and shifting) h(Xn + x) = h(Xn ), h(αXn ) = h(Xn ) + k log |α| and for an invertible
matrix A, h(AXn ) = h(Xn ) + log | det A|.
(c) (Conditioning reduces differential entropy) h(X|Y) ≤ h(X). (Here Y is arbitrary.)
(d) (Chain rule) Let Xn has a joint probability density function. Then
X
n
h( X n ) = h(Xk |Xk−1 ) .
k=1

1 n c −(−1)n n
For an example, consider a piecewise-constant pdf taking value e(−1) n on the n-th interval of width ∆n = n2
e .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-28

i i

(e) (Submodularity) The set-function T 7→ h(XT ) is submodular.

P
(f) (Han’s inequality) The function k 7→ k 1n [n] h(XT ) is decreasing in k.
(k) T∈( k )

Proof. Parts (a), (c), and (d) follow from the similar argument in the proof (b), (d), and (g) of
Theorem 1.4. Part (b) is by a change of variable in the density. Finally, (e) and (f) are analogous
to Theorems 1.6 and 1.7.
Interestingly, the first property is robust to small additive perturbations, cf. Ex. I.6. Regard-
ing maximizing entropy under quadratic constraints, we have the following characterization of
Gaussians.

Theorem 2.8 Let Cov(X) = E[XX⊤ ] − E[X]E[X]⊤ denote the covariance matrix of a random
vector X. For any d × d positive definite matrix Σ,
1
max h(X) = h(N(0, Σ)) = log((2πe)d det Σ) (2.19)
PX :Cov(X)⪯Σ 2
Furthermore, for any a > 0,
a d 2πea
max h(X) = h N 0, Id = log . (2.20)
PX :E[∥X∥ ]≤a
2 d 2 d

Proof. To show (2.19), without loss of generality, assume that E[X] = 0. By comparing to
Gaussian, we have

0 ≤ D(PX kN(0, Σ))

1 log e
= − h(X) + log((2π )d det(Σ)) + E[X⊤ Σ−1 X]
2 2
≤ − h(X) + h(N(0, Σ)),

where in the last step we apply E[X⊤ Σ−1 X] = Tr(E[XX⊤ ]Σ−1 ) ≤ Tr(I) due to the constraint
Cov(X) Σ and the formula (2.18). The inequality (2.20) follows analogously by choosing the
reference measure to be N(0, ad Id ).

Corollary 2.9 The map Σ 7→ log det Σ is concave on the space of real positive definite n × n
matrices.

Proof. Let Σ1 , Σ2 be positive definite n × n matrices. Let Y ∼ Ber(1/2) and given Y = 0 we

set X ∼ N (0, Σ1 ) and otherwise X ∼ N (0, Σ2 ). Let Cov(X) = Σ = 12 Σ1 + 12 Σ2 . Then we have
h(X|Y) ≤ h(X) ≤ 12 log((2πe)n det Σ). For h(X|Y) we apply (2.18) and after simplification obtain
1 1 1 1
log det Σ1 + log det Σ2 ≤ log det Σ1 + Σ2 ,
2 2 2 2
which is exactly the claimed concavity.
Finally, let us mention a connection between the differential entropy and the Shannon entropy.
Let X be a continuous random vector in Rd . Denote its discretized version by Xm = m1 bmXc

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-29

i i

2.4 Markov kernels 29

for m ∈ N, where b·c is taken componentwise. Rényi showed that [357, Theorem 1] provided
H(bXc) < ∞ and h(X) is defined, we have

H(Xm ) = d log m + h(X) + o(1), m → ∞. (2.21)

To interpret this result, consider, for simplicity, d = 1, m = 2k and assume that X takes values
in the unit interval, in which case X2k is the k-bit uniform quantization of X. Then (2.21) suggests
that for large k, the quantized bits behave as independent fair coin flips. The underlying reason is
that for “nice” density functions, the restriction to small intervals is approximately uniform. For
more on quantization see Section 24.1 (notably Section 24.1.5) in Chapter 24.

2.4 Markov kernels

The main objects in this book are random variables and probability distributions. The main opera-
tion for creating new random variables, as well as for defining relations between random variables,
is that of a Markov kernel (also known as a transition probability kernel).

Definition 2.10 A Markov kernel K : X → Y is a bivariate function K( · | · ), whose first

argument is a measurable subset of Y and the second is an element of X , such that:

1 For any x ∈ X : K( · |x) is a probability measure on Y

2.5 Conditional divergence, chain rule, data-processing inequality 31

and X = Y at the same time! Indeed, consider any set E = B × C ⊂ X × Y . We always have
PX,Y [B × C] = PX [B ∩ C]. Thus if C is countable then PX,Y [E] = 0 and so is PX PY [E] = 0. On the
other hand, if Cc is countable then PX [C] = PY [C] = 1 and PX,Y [E] = PX PY [E] again. Thus, both
PY|X = K and PY|X = PY are valid conditional distributions. But notice that since PY [{x}] = 0, we
have K(·|x) 6 PY for every x ∈ X . In particular, the value of D(PY|X=x kPY ) can either be 0 or
+∞ for every x depending on the choice of the version of PY|X . It is, thus, advisable to stay within
the realm of standard Borel spaces.
We will also need to use the following result extensively. We remind that a σ -algebra is called
separable if it is generated by a countable collection of sets. Any standard Borel space’s σ -algebra
is separable. The following is another useful result about Markov kernels, cf. [84, Chapter 5,
Theorem 4.44]:

Theorem 2.12 (Doob’s version of Radon-Nikodym Theorem) Assume that Y is a

measurable space with a separable σ -algebra. Let PY|X : X → Y and RY|X : X → Y be two
Markov kernels. Suppose that for every x we have PY|X=x RY|X=x . Then there exists a measurable
function (x, y) 7→ f(y|x) ≥ 0 such that for every x ∈ X and every measurable subset E of Y ,
Z
PY|X (E|x) = f(y|x)RY|X (dy|x) .
E

dPY|X=x
The meaning of this theorem is that the Radon-Nikodym derivative dRY|X=x can be made jointly
measurable with respect to (x, y).

2.5 Conditional divergence, chain rule, data-processing inequality

We aim to define the conditional divergence between two Markov kernels. Throughout this chapter
we fix a pair of Markov kernels PY|X : X → Y and QY|X : X → Y , and also a probability measure
PX on X . First, let us consider the case of discrete X . We define the conditional divergence as
X
D(PY|X kQY|X |PX ) ≜ PX (x)D(PY|X=x kQY|X=x ) .
x∈X

In order to extend the above definition to more general X , we need to first understand whether
the map x 7→ D(PY|X=x kQY|X=x ) is even measurable.

Lemma 2.13 Suppose that Y is standard Borel. The set A0 ≜ {x : PY|X=x QY|X=x } and the
function
x 7→ D(PY|X=x kQY|X=x )
are both measurable.
dPY|X=x dQY|X=x
Proof. Take RY|X = 1
2 PY|X + 12 QY|X and define fP (y|x) ≜ dRY|X=x (y) and fQ (y|x) ≜ dRY|X=x (y).
By Theorem 2.12 these can be chosen to be jointly measurable on X × Y . Let us define B0 ≜

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-32

i i

{(x, y) : fP (y|x) > 0, fQ (y|x) = 0} and its slice Bx0 = {y : (x, y) ∈ B0 }. Then note that PY|X=x
QY|X=x iff RY|X=x [Bx0 ] = 0. In other words, x ∈ A0 iff RY|X=x [Bx0 ] = 0. The measurability of B0
implies that of x 7→ RY|X=x [Bx0 ] and thus that of A0 . Finally, from (2.12) we get that

f P ( Y | x)
D(PY|X=x kQY|X=x ) = EY∼PY|X=x Log , (2.23)
f Q ( Y | x)
which is measurable, e.g. [84, Chapter 1, Prop. 6.9].

With this preparation we can give the following definition.

Definition 2.14 (Conditional divergence) Assuming Y is standard Borel, define

D(PY|X kQY|X |PX ) ≜ Ex∼PX [D(PY|X=x kQY|X=x )]

We observe that as usual in Lebesgue integration it is possible that a conditional divergence is

finite even though D(PY|X=x kQY|X=x ) = ∞ for some (PX -negligible set of) x.

Theorem 2.15 (Chain rule) For any pair of measures PX,Y and QX,Y we have
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kQX ) , (2.24)

regardless of the versions of conditional distributions PY|X and QY|X one chooses.

Proof. First, let us consider the simplest case: X , Y are discrete and QX,Y (x, y) > 0 for all x, y.
Letting (X, Y) ∼ PX,Y we get

PX,Y (X, Y) PX (X)PY|X (Y|X)
D(PX,Y kQX,Y ) = E log = E log
QX,Y (X, Y) QX (X)QY|X (Y|X)

PY|X (Y|X) PX (X)
= E log + E log
QY|X (Y|X) QX (X)
completing the proof.
Next, let us address the general case. If PX 6 QX then PX,Y 6 QX,Y and both sides of (2.24) are
infinity. Thus, we assume PX QX and set λP (x) ≜ dQ dPX
X
(x). Define fP (y|x), fQ (y|x) and RY|X as in
the proof of Lemma 2.13. Then we have PX,Y , QX,Y RX,Y ≜ QX RY|X , and for any measurable E
Z Z
PX,Y [E] = λP (x)fP (y|x)RX,Y (dx dy) , QX,Y [E] = fQ (y|x)RX,Y (dx dy) .
E E

Then from (2.12) we have

fP (Y|X)λP (X)
D(PX,Y kQX,Y ) = EPX,Y Log . (2.25)
fQ (Y|X)
Note the following property of Log: For any c > 0
ac a
Log = log(c) + Log
b b

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-33

i i

2.5 Conditional divergence, chain rule, data-processing inequality 33

unless a = b = 0. Now, since PX,Y [fP (Y|X) > 0, λP (X) > 0] = 1, we conclude that PX,Y -almost
surely
fP (Y|X)λP (X) fP ( Y | X )
Log = log λP (X) + Log .
fQ (Y|X) fQ (Y|X)
We aim to take the expectation of both sides over PX,Y and invoke linearity of expectation. To
ensure that the issue of ∞ − ∞ does not arise, we notice that the negative part of each term has
finite expectation by (2.15). Overall, continuing (2.25) and invoking linearity we obtain

fP ( Y | X )
D(PX,Y kQX,Y ) = EPX,Y [log λP (X)] + EPX,Y Log ,
fQ (Y|X)
where the first term equals D(PX kQX ) by (2.12) and the second D(PY|X kQY|X |PX ) by (2.23) and
the definition of conditional divergence.

The chain rule has a number of useful corollaries, which we summarize below.

Theorem 2.16 (Properties of divergence) Assume that X and Y are standard Borel.
Then

(a) Conditional divergence can be expressed unconditionally:

D(PY|X kQY|X |PX ) = D(PX PY|X kPX QY|X ) .

(b) (Monotonicity) D(PX,Y kQX,Y ) ≥ D(PY kQY ).

(c) (Full chain rule)
X
n
D(PX1 ···Xn kQX1 ···Xn ) = D(PXi |Xi−1 kQXi |Xi−1 |PXi−1 ). (2.26)
i=1
Qn
In the special case of QXn = i=1 QX i ,
X
n
D(PX1 ···Xn kQX1 · · · QXn ) = D(PX1 ···Xn kPX1 · · · PXn ) + D(PXi kQXi )
i=1
X
n
≥ D(PXi kQXi ), (2.27)
i=1
Qn
where the inequality holds with equality if and only if PXn = j=1 PXj .
(d) (Tensorization)
 
Yn Y n X n
D  PXj 
QX j = D(PXj kQXj ).
j=1 j=1 j=1

(e) (Conditioning increases divergence) Given PY|X , QY|X and PX , let PY = PY|X ◦ PX and QY =
QY|X ◦ PX , as represented by the diagram:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-34

i i

PY |X PY

QY |X QY

Then D(PY kQY ) ≤ D(PY|X kQY|X |PX ), with equality iff D(PX|Y kQX|Y |PY ) = 0.

Proof. (a) This follows from the chain rule (2.24) since PX = QX .
(b) Apply (2.24), with X and Y interchanged and use the fact that conditional divergence is non-
negative.
Qn Qn
(c) By telescoping PXn = i=1 PXi |Xi−1 and QXn = i=1 QXi |Xi−1 .
(d) Apply (c).
(e) The inequality follows from (a) and (b). To get conditions for equality, notice that by the chain
rule for D:

D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kPX )

| {z }
=0

= D(PX|Y kQX|Y |PY ) + D(PY kQY ).

Some remarks are in order:

• There is a nice interpretation of the full chain rule as a decomposition of the “distance” from
PXn to QXn as a sum of “distances” between intermediate distributions, cf. Ex. I.43.
• In general, D(PX,Y kQX,Y ) and D(PX kQX ) + D(PY kQY ) are incomparable. For example, if X = Y
under P and Q, then D(PX,Y kQX,Y ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PX,Y 6= QX,Y we have D(PX,Y kQX,Y ) > 0 = D(PX kQX ) + D(PY kQY ).

The following result, known as the Data-Processing Inequality (DPI), is an important principle
in all of information theory. In many ways, it underpins the whole concept of information. The
intuitive interpretation is that it is easier to distinguish two distributions using clean (resp. full)
data as opposed to noisy (resp. partial) data. DPI is a recurring theme in this book, and later we
will study DPI for other information measures such as mutual information and f-divergences.

Theorem 2.17 (DPI for KL divergence) Let PY = PY|X ◦ PX and QY = PY|X ◦ QX , as

represented by the diagram:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-35

i i

2.5 Conditional divergence, chain rule, data-processing inequality 35

PX PY

PY|X

QX QY

Then

D(PY kQY ) ≤ D(PX kQX ), (2.28)

with equality if and only if D(PX|Y kQX|Y |PY ) = 0.

Proof. This follows from either the chain rule or monotonicity:

D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )

| {z }
=0

= D(PX|Y kQX|Y |PY ) + D(PY kQY )

Corollary 2.18 (Divergence under deterministic transformation) Let Y = f(X).

Then D(PY kQY ) ≤ D(PX kQX ), with equality if f is one-to-one.

Note that D(Pf(X) kQf(X) ) = D(PX kQX ) does not imply that f is one-to-one; as an example,
consider PX = Gaussian, QX = Laplace, Y = |X|. In fact, the equality happens precisely when
f(X) is a sufficient statistic for testing P against Q; in other words, there is no loss of information
in summarizing X into f(X) as far as testing these two hypotheses is concerned. See Example 3.9
for details.
A particular useful application of Corollary 2.18 is when we take f to be an indicator function:

Corollary 2.19 (Large deviations estimate) For any subset E ⊂ X we have

d(PX [E]kQX [E]) ≤ D(PX kQX ),

where d(·k·) is the binary divergence function in (2.6).

Proof. Consider Y = 1 {X ∈ E}.

This method will be highly useful in large deviations theory which studies rare events (Sec-
tion 14.5 and Section 15.2), where we apply Corollary 2.19 to an event E which is highly likely
under P but highly unlikely under Q.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-36

i i

2.6* Local behavior of divergence and Fisher information

As we shall see in Section 4.5, KL divergence is in general not continuous. Nevertheless, it
is reasonable to expect that in non-pathological cases the functional D(PkQ) vanishes when P
approaches Q “smoothly”. Due to the smoothness and strict convexity of x log x at x = 1, it is
then also natural to expect that this functional decays “quadratically”. In this section we exam-
ine this question first along the linear interpolation between P and Q, then, more generally, in
smooth parametrized families of distributions. These properties will be extended to more general
divergences later in Sections 7.10 and 7.11.

2.6.1* Local behavior of divergence for mixtures

Let 0 ≤ λ ≤ 1 and consider D(λP + λ̄QkQ), which vanishes as λ → 0. Next, we show that this
decay is always sublinear.

Proposition 2.20 When D(PkQ) < ∞, the one-sided derivative at λ = 0 vanishes:

d
D(λP + λ̄QkQ) = 0
dλ λ=0

If we exchange the arguments, the criterion is even simpler:

d
D(QkλP + λ̄Q) = 0 ⇐⇒ PQ (2.29)
dλ λ=0

Proof.

1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function

λ 7→ g(λ) ≜ (λf + λ̄) log(λf + λ̄)

g(λ)
is convex and equals zero at λ = 0. Thus λ is increasing in λ. Moreover, by the convexity of
x 7→ x log x:
1 1
(λf + λ)(log(λf + λ)) ≤ (λf log f + λ1 log 1) = f log f
λ λ
and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.
To prove (2.29) first notice that if P 6 Q then there is a set E with p = P[E] > 0 = Q[E].
Applying data-processing for divergence to X 7→ 1E (X), we get
1
D(QkλP + λ̄Q) ≥ d(0kλp) = log
1 − λp

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-37

i i

2.6* Local behavior of divergence and Fisher information 37

and derivative is non-zero. If P Q, then let f = dP

dQ and notice simple inequalities

log λ̄ ≤ log(λ̄ + λf) ≤ λ(f − 1) log e .

1
Dividing by λ and assuming λ < 2 we get for some absolute constants c1 , c2 :

where Eθ is with respect to X ∼ Pθ . In particular, V is known as the score.

Under suitable regularity conditions, we have the identity

E θ [ V] = 0 (2.33)

and several equivalent expressions for the Fisher information matrix:

JF (θ) = cov(V)
θ
Z p p
= 4 μ(dx)(∇θ pθ (x))(∇θ pθ (x))⊤

= − Eθ [Hessθ (ln pθ (X)))] ,

where the last identity is obtained by differentiating (2.33) with respect to each θj .
The significance of Fisher information matrix arises from the fact that it gauges the local behav-
ior of divergence for smooth parametric families. Namely, we have (again under suitable technical
conditions):2

log e ⊤
D(Pθ0 kPθ0 +ξ ) = ξ JF (θ0 )ξ + o(kξk2 ) , (2.34)
2
which is obtained by integrating the Taylor expansion:
1
ln pθ0 +ξ (x) = ln pθ0 (x) + ξ ⊤ ∇θ ln pθ0 (x) + ξ ⊤ Hessθ (ln pθ0 (x))ξ + o(kξk2 ) .
2
We will establish this fact rigorously later in Section 7.11. Property (2.34) is of paramount impor-
tance in statistics. We should remember it as: Divergence is locally quadratic on the parameter
space, with Hessian given by the Fisher information matrix. Note that for the Gaussian location
model Pθ = N (θ, Σ), (2.34) is in fact exact with JF (θ) ≡ Σ−1 – cf. Example 2.2.
As another example, note that Proposition 2.21 is a special case of (2.34) by considering Pλ =
λ̄Q + λP parametrized by λ ∈ [0, 1]. In this case, the Fisher information at λ = 0 is simply
χ2 (PkQ). Nevertheless, Proposition 2.21 is completely general while the asymptotic expansion
(2.34) is not without regularity conditions (see Section 7.11).
Remark 2.3 Some useful properties of Fisher information are as follows:

• Reparametrization: It can be seen that if one introduces another parametrization θ̃ ∈ Θ̃ by

means of a smooth invertible map Θ̃ → Θ, then Fisher information matrix changes as

i i

3 Mutual information

After technical preparations in previous chapters we define perhaps the most famous concept in the
entire field of information theory, the mutual information. It was originally defined by Shannon,
although the name was coined later by Robert Fano.1 It has two equivalent expressions (as a KL
divergence and as difference of entropies), both having their merits. In this chapter, we collect
some basic properties of mutual information (non-negativity, chain rule and the data-processing
inequality). While defining conditional information, we also introduce the language of directed
graphical models, and connect the equality case in the data-processing inequality with Fisher’s
concept of sufficient statistic.
So far in this book we have not yet attempted connecting information quantities to any opera-
tional concepts. The first time this will be done in Section 3.6 where we relate mutual information
to probability of error in the form of Fano’s inequality, which states that whenever I(X; Y) is small,
one should not be able to predict X on the basis of Y with a small probability of error. As such, this
inequality will be applied countless times in the rest of the book as a main workhorse for studying
fundamental limits of problems in both information theory and in statistics.
The connection between information and estimation is furthered in Section 3.7*, in which we
relate mutual information and minimum mean squared error in Gaussian noise (I-MMSE relation).
From the latter we also derive the entropy power inequality, which plays a central role in high-
dimensional probability and concentration of measure.

3.1 Mutual information

Mutual information was first defined by Shannon to measure the decrease in entropy of a random
quantity following the observation of another (correlated) random quantity. Unlike the concept
of entropy itself, which was well-known by then in statistical mechanics, the mutual information
was new and revolutionary and had no analogs in science. Today, however, it is preferred to define
mutual information in a different form (proposed in [378, Appendix 7]).

1
Professor of electrical engineering at MIT, who developed the first course on information theory and as part of it
formalized and rigorized much of Shannon’s ideas. Most famously, he showed the “converse part” of the noisy channel
coding theorem, see Section 17.4.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-42

i i

Definition 3.1 (Mutual information) For a pair of random variables X and Y we define
I(X; Y) = D(PX,Y kPX PY ).

The intuitive interpretation of mutual information is that I(X; Y) measures the dependency
between X and Y by comparing their joint distribution to the product of the marginals in the KL
divergence, which, as we show next, is also equivalent to comparing the conditional distribution
to the unconditional.
The way we defined I(X; Y) it is a functional of the joint distribution PX,Y . However, it is also
rather fruitful to look at it as a functional of the pair (PX , PY|X ) – more on this in Section 5.1.
In general, the divergence D(PX,Y kPX PY ) should be evaluated using the general definition (2.4).
Note that PX,Y PX PY need not always hold. Let us consider the following examples, though.
Example 3.1 If X = Y ∼ N(0, 1) then PX,Y 6 PX PY and I(X; Y) = ∞. This reflects our
intuition that X contains an “infinite” amount of information requiring infinitely many bits to
describe. On the other hand, if even one of X or Y is discrete, then we always have PX,Y PX PY .
Indeed, consider any E ⊂ X × Y measurable in the product sigma algebra with PX,Y (E) > 0.
P
Since PX,Y (E) = x∈S P[(X, Y) ∈ S, X = x], there exists some x0 ∈ S such that PY (Ex0 ) ≥ P[X =
x0 , Y ∈ Ex0 ] > 0, where Ex0 ≜ {y : (x0 , y) ∈ E} is a section of E (measurable for every x0 ). But
then PX PY (E) ≥ PX PY ({x0 } × Ex0 ) = PX ({x0 })PY (Ex0 ) > 0, implying that PX,Y PX PY .

Theorem 3.2 (Properties of mutual information)

(a) (Mutual information as conditional divergence) Whenever Y is standard Borel,

I(X; Y) = D(PY|X kPY |PX ) . (3.1)

(b) (Symmetry) I(X; Y) = I(Y; X)

I(f(X); Y) ≤ I(X; Y) .

If f is one-to-one (with a measurable inverse), then I(f(X); Y) = I(X; Y).

(e) (More data ⇒ more information) I(X1 , X2 ; Z) ≥ I(X1 ; Z)

Proof. (a) This follows from Theorem 2.16(a) with QY|X = PY .

K
(b) Consider a Markov kernel K sending (x, y) 7→ (y, x). This kernel sends measure PX,Y −
→ PY,X
K
and PX PY −
→ PY PX . Therefore, from the DPI Theorem 2.17 applied to this kernel we get

D(PX,Y kPX PY ) ≥ D(PY,X kPY PX ) .

Applying this argument again, shows that inequality is in fact equality.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-43

i i

3.1 Mutual information 43

K
(d) Consider a Markov kernel K sending (x, y) 7→ (f(x), y). This kernel sends measure PX,Y − →
K
Pf(X),Y and PX PY −
→ Pf(X) PY . Therefore, from the DPI Theorem 2.17 applied to this kernel we
get

D(PX,Y kPX PY ) ≥ D(Pf(X),Y kPf(X) PY ) .

It is clear that the two sides correspond to the two mutual informations. For bijective f, simply
apply the inequality to f and f−1 .
(e) Apply (d) with f(X1 , X2 ) = X1 .

Of the results above, the one we will use the most is (3.1). Note that it implies that
D(PX,Y kPX PY ) < ∞ if and only if

x 7→ D(PY|X=x kPY )

is PX -integrable. This property has a counterpart in terms of absolute continuity, as follows.

Lemma 3.3 Let Y be standard Borel. Then

PX,Y PX PY ⇐⇒ PY|X=x PY for PX -a.e. x

Proof. Suppose PX,Y PX PY . We need to prove that any version of the conditional probability
satisfies PY|X=x PY for almost every x. Note, however, that if we prove this for some version P̃Y|X
then the statement for any version follows, since PY|X=x = P̃Y|X=x for PX -a.e. x. (This measure-
theoretic fact can be derived from the chain rule (2.24): since PX P̃Y|X = PX,Y = PX PY|X we must
have 0 = D(PX,Y kPX,Y ) = D(P̃Y|X kPY|X |PX ) = Ex∼PX [D(P̃Y|X=x kPY|X=x )], implying the stated
dPX,Y R
fact.) So let g(x, y) = dP X PY
(x, y) and ρ(x) ≜ Y g(x, y)PY (dy). Fix any set E ⊂ X and notice
Z Z
PX [E] = 1E (x)g(x, y)PX (dx) PY (dy) = 1E (x)ρ(x)PX (dx) .
X ×Y X
R
On the other hand, we also have PX [E] = 1E dPX , which implies ρ(x) = 1 for PX -a.e. x. Now
define
(
g(x, y)PY (dy), ρ(x) = 1
P̃Y|X (dy|x) =
PY (dy), ρ(x) 6= 1 .

Directly plugging P̃Y|X into (2.22) shows that P̃Y|X does define a valid version of the conditional
probability of Y given X. Since by construction P̃Y|X=x PY for every x, the result follows.
Conversely, let PY|X be a kernel such that PX [E] = 1, where E = {x : PY|X=x PY } (recall that
E is measurable by Lemma 2.13). Define P̃Y|X=x = PY|X=x if x ∈ E and P̃Y|X=x = PY , otherwise.
By construction PX P̃Y|X = PX PY|X = PX,Y and P̃Y|X=x PY for every x. Thus, by Theorem 2.12
there exists a jointly measurable f(y|x) such that

P̃Y|X (dy|x) = f(y|x)PY (dy) ,

3.2 Mutual information as difference of entropies 45

(e) If X or Y are discrete then I(X; Y) ≤ min (H(X), H(Y)), with equality iff H(X|Y) = 0 or
H(Y|X) = 0, or, equivalently, iff one is a deterministic function of the other.

Proof. (a) By Theorem 3.2.(a), I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete,
then D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, let A = {x : PX (x) > 0}
denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . (∆ is measurable since it’s
the intersection of Ac × Ac with the diagonal {(x, x) : x ∈ X }.) Then PX,X (∆) = PX (Ac ) > 0
but since
Z Z
(PX × PX )(E) ≜ PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X

we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X 6 PX × PX and thus by definition

I(X; X) = D(PX,X kPX PX ) = +∞ .

(b) Since X is discrete there exists a countable set S such that P[X ∈ S] = 1, and for any x0 ∈ S we
have P[X = x0 ] > 0. Let λ be a counting measure on S and let μ = λ×PY , so that PX PY μ. As
shown in Example 3.1 we also have PX,Y μ. Furthermore, fP (x, y) ≜ dPdμX,Y (x, y) = pX|Y (x|y),
where the latter denotes conditional pmf of X given Y (which is a proper pmf for almost every
y, since P[X ∈ S|Y = y] = 1 for a.e. y). We also have fQ (x, y) = dPdμ
X PY
(x, y) = dP
dλ (x) = pX (x),
X

where the latter is an unconditional pmf of X. Note that by definition of Radon-Nikodym

derivatives we have

E[pX|Y (x0 |Y)] = pX (x0 ) . (3.4)

Next, according to (2.12) we have

X
fP ( X , Y ) pX|Y (x|y)
I(X; Y) = E Log = Ey∼PY pX|Y (x|y)Log .
fQ (X, Y) p X ( x)
x∈ S

Note that PX,Y -almost surely both pX|Y (X|Y) > 0 and PX (x) > 0, so we can replace Log with
log in the above. On the other hand,
X 1

H(X|Y) = Ey∼PY pX|Y (x|y) log .
pX|Y (x|y)
x∈ S

Adding these two expressions, we obtain

( a) X 1

I(X; Y) + H(X|Y) = Ey∼PY pX|Y (x|y) log
p X ( x)
x∈S
X
(b) 1 ( c) 1
= Ey∼PY pX|Y (x|y) log = E log ≜ H(X) ,
p X ( x) PX (X)
x∈S
P P
where in (a) we used linearity of Lebesgue integral EPY x , in (b) we interchange E and
via Fubini; and (c) holds due to (3.4).
(c) Simply add H(Y) to both sides of (3.2) and use the chain rule for H from (1.2).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-46

i i

(d) These arguments are similar to discrete case, except that counting measure is replaced with
Lebesgue. We leave the details as an exercise.
(e) Follows from (b).

From (3.2) we deduce the following result, which was previously shown in Theorem 1.4(d).

Corollary 3.5 (Conditioning reduces entropy) For discrete X, H(X|Y) ≤ H(X), with
equality iff X ⊥
⊥ Y.

Proof. If H(X) = ∞ then there is nothing to prove. Otherwise, apply (3.2).

Thus, the intuition behind the last corollary (and an important innovation of Shannon) is to
give meaning to the amount of entropy reduction (mutual information). It is important to note
that conditioning reduces entropy on average, not per realization. Indeed, take X = U OR Y, where
i.i.d.
U, Y ∼ Ber( 21 ). Then X ∼ Ber( 34 ) and H(X) = h( 14 ) < 1 bit = H(X|Y = 0), i.e., conditioning on
Y = 0 increases entropy. But on average, H(X|Y) = P [Y = 0] H(X|Y = 0) + P [Y = 1] H(X|Y =
1) = 12 bits < H(X), by the strong concavity of h(·).
Remark 3.1 (Information, entropy, and Venn diagrams) For discrete random vari-
ables, the following Venn diagram illustrates the relationship between entropy, conditional
entropy, joint entropy, and mutual information from Theorem 3.4(b) and (c).

H(X, Y )

H(Y |X) I(X; Y ) H(X|Y )

H(Y ) H(X)

Applying analogously the inclusion-exclusion principle to three variables X1 , X2 , X3 , we see

that the triple intersection corresponds to
H(X1 ) + H(X2 ) + H(X3 ) − H(X1 , X2 ) − H(X2 , X3 ) − H(X1 , X3 ) + H(X1 , X2 , X3 ) (3.5)
which is sometimes denoted by I(X1 ; X2 ; X3 ). It can be both positive and negative (why?).
In general, one can treat random variables as sets (so that the Xi corresponds to set Ei and the
pair (X1 , X2 ) corresponds to E1 ∪ E2 ). Then we can define a unique signed measure μ on the finite
algebra generated by these sets so that every information quantity is found by replacing
I/H → μ ;→ ∩ ,→ ∪ | → \.

As an example, we have
H(X1 |X2 , X3 ) = μ(E1 \ (E2 ∪ E3 )) , (3.6)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-47

i i

3.3 Examples of computing mutual information 47

I(X1 , X2 ; X3 |X4 ) = μ(((E1 ∪ E2 ) ∩ E3 ) \ E4 ) . (3.7)

By inclusion-exclusion, the quantity in (3.5) corresponds to μ(E1 ∩ E2 ∩ E3 ), which explains why

μ is not necessarily a positive measure. For an extensive discussion, see [110, Chapter 1.3].

3.3 Examples of computing mutual information

Below we demonstrate how to compute I in both continuous and discrete settings.
Example 3.2 (Bivariate Gaussian) Let X, Y be jointly Gaussian. Then
1 1
I(X; Y) = log (3.8)
2 1 − ρ2X,Y

where ρX,Y ≜ E[(X−EX )(Y−EY)]

σX σY ∈ [−1, 1] is the correlation coefficient; see Figure 3.1 for a plot.
To show (3.8), by shifting and scaling if necessary, we can assume without loss of generality that

I(X; Y )

ρ
-1 0 1

Figure 3.1 Mutual information between correlated Gaussians.

EX = EY = 0 and EX2 = EY2 = 1. Then ρ = EXY. By joint Gaussianity, Y = ρX + Z for some

Z ∼ N ( 0, 1 − ρ 2 ) ⊥
⊥ X. Then using the divergence formula for Gaussians (2.7), we get

I(X; Y) = D(PY|X kPY |PX )

= ED(N (ρX, 1 − ρ2 )kN (0, 1))

1 1 log e
=E log + (ρ X) 2
+ 1 − ρ 2
− 1
2 1 − ρ2 2
1 1
= log .
2 1 − ρ2
Alternatively, we can use the differential entropy representation in Theorem 3.4(d) and the entropy
formula (2.17) for Gaussians:

I(X; Y) = h(Y) − h(Y|X)

= h( Y ) − h( Z )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-48

i i

1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log .
2 2 2 1 − ρ2
where the second equality follows h(Y|X) = h(Y − X|X) = h(Z|X) = h(Z) applying the shift-
invariance of h and the independence between X and Z.
Similar to the role of mutual information, the correlation coefficient also measures the
dependency between random variables which are real-valued (or, more generally, valued in an
inner-product space) in a certain sense. In contrast, mutual information is invariant to bijections
and much more general as it can be defined not just for numerical but for arbitrary random
variables.

Example 3.3 (AWGN channel) The additive white Gaussian noise (AWGN) channel is one
of the main examples of Markov kernels that we will use in this book. This kernel acts from R to
R by taking an input x and setting K(·|x) ∼ N (x, σN2 ), or, in equation form we write Y = X + N,
with X ⊥⊥ N ∼ N (0, σN2 ). Pictorially, we can think of it as

X + Y

Now, suppose that X ∼ N (0, σX2 ), in which case Y ∼ N (0, σX2 + σN2 ). Then by invoking (2.17)
twice we obtain
1 σ2
I(X; Y) = h(Y) − h(Y|X) = h(X + N) − h(N) = log 1 + X2 ,
2 σN
2
where σσX2 is frequently referred to as the signal-to-noise ratio (SNR). See Figure 3.2 for an illus-
N
tration. Note that in engineering it is common to express SNR in decibels (dB), so that SNR in dB
equals 10 log10 (SNR). Later, we will define AWGN channel more formally in Definition 20.10.

Example 3.4 (BI-AWGN channel) In communication and statistical applications one also
often encounters a situation where AWGN channel’s input is restricted to X ∈ {±1}. This Markov
kernel is denoted BIAWGNσN2 : {±1} → R and acts by setting

Definition 3.6 (Conditional mutual information) If X and Y are standard Borel, then
we define
I(X; Y|Z) ≜ D(PX,Y|Z kPX|Z PY|Z |PZ ) (3.9)
= Ez∼PZ [I(X; Y|Z = z)] . (3.10)
where the product PX|Z PY|Z is a conditional distribution such that (PX|Z PY|Z )(A × B|z) =
PX|Z (A|z)PY|Z (B|z), under which X and Y are independent conditioned on Z.

Denoting I(X; Y) as a functional I(PX,Y ) of the joint distribution PX,Y , we have I(X; Y|Z) =
Ez∼PZ [I(PX,Y|Z=z )]. As such, I(X; Y|Z) is a linear functional in PZ . Measurability of the map z 7→
I(PX,Y|Z=z ) is not obvious, but follows from Lemma 2.13.
To further discuss properties of the conditional mutual information, let us first introduce the
notation for conditional independence. A family of joint distributions can be represented by a
directed acyclic graph (DAG) encoding the dependency structure of the underlying random vari-
ables. We do not intend to introduce formal definitions here and refer to the standard monograph
for full details [271]. But in short, every problem consisting of finitely (or countably infinitely)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-51

i i

3.4 Conditional mutual information and conditional independence 51

many random variables can be depicted as a DAG. Nodes of the DAG correspond to random vari-
ables and incoming edges into the node U simply describe which variables need to be known in
order to generate U. A simple example is a Markov chain (path graph) X → Y → Z, which repre-
sents distributions that factor as {PX,Y,Z : PX,Y,Z = PX PY|X PZ|Y }. We have the following equivalent
descriptions:

X → Y → Z ⇔ PX,Z|Y = PX|Y · PZ|Y

There is a general method for obtaining these equivalences for general graphs, known as d-
separation, see [271]. We say that a variable V is a collider on some undirected path if it appears
on the path as

collider: ··· → V ← ··· . (3.11)

Otherwise, V is called a non-collider (and hence appears as → V →, ← V ←, or ← V →). A pair

of collections of variables A and B are d-connected by a collection C if there exists an undirected
path from some variable in A to some variable in B such that a) there are no non-colliders in C
and b) every collider is either in C or has a descendant in C. The concept of d-connectedness
is important because it characterizes conditional independence. Specifically, A ⊥ ⊥ B|C in every
distribution satisfying a given graphical model if and only if A and B are not d-connected by C. It
is rather useful for many information-theoretic considerations to master this criterion. However,
in our book we will not formally require this apparatus beyond the basic equivalences for a linear
Markov chain listed above. We do recommend practicing these, however, e.g. by doing Exercises
I.26–I.30.

Theorem 3.7 (Further properties of mutual information) Suppose that all random
variables are valued in standard Borel spaces. Then:

(a) I(X; Z|Y) ≥ 0, with equality iff X → Y → Z.

(b) (Simple chain rule)3

I(X, Y; Z) = I(X; Z) + I(Y; Z|X)

= I(Y; Z) + I(X; Z|Y).

3
Also known as “Kolmogorov identities”.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-52

i i

(c) (DPI for mutual information) If X → Y → Z, then

I(X; Z) ≤ I(X; Y) , (3.12)
with equality iff X → Z → Y.
(d) If X → Y → Z → W, then I(X; W) ≤ I(Y; Z).
(e) (Full chain rule)
X
n
I(Xn ; Y) = I(Xk ; Y|Xk−1 ).
k=1

(f) (Permutation invariance) If f and g are one-to-one (with measurable inverses), then
I(f(X); g(Y)) = I(X; Y).

Proof. (a) By definition and Theorem 3.2(c).

(b) First, notice that from (3.1) we have (with a self-evident notation):

I(Y; Z|X = x) = D(PY|Z,X=x kPY|X=x |PZ|X=x ) .

Taking expectation over X here we get
( a)
I(Y; Z|X) = D(PY|X,Z kPY|X |PX,Z ) .
On the other hand, from the chain rule for D, (2.24), we have
(b)
D(PX,Y,Z kPX,Y PZ ) = D(PX,Z kPX PZ ) + D(PY|X,Z kPY|X |PX,Z ) ,
where in the second term we noticed that conditioning on X, Z under the measure PX,Y PZ
results in PY|X (independent of Z). Putting (a) and (b) together completes the proof.
(c) Apply Kolmogorov identity to I(Y, Z; X):
I(Y, Z; X) = I(X; Y) + I(X; Z|Y)
| {z }
=0

= I(X; Z) + I(X; Y|Z)

(d) Several applications of the DPI: I(X; W) ≤ I(X; Z) ≤ I(Y; Z)

(e) Recursive application of Kolmogorov identity.
(f) Apply DPI to f and then to f−1 .

Remark 3.2 In general, I(X; Y|Z) and I(X; Y) are incomparable. Indeed, consider the following
examples:

• I(X; Y|Z) > I(X; Y): We need to find an example of X, Y, Z, which do not form a Markov chain.
To that end notice that there is only one directed acyclic graph non-isomorphic to X → Y → Z,
i.i.d.
namely X → Y ← Z. With this idea in mind, we construct X, Z ∼ Ber( 12 ) and Y = X ⊕ Z. Then
I(X; Y) = 0 since X ⊥⊥ Y; however, I(X; Y|Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-53

i i

3.5 Sufficient statistic and data processing 53

• I(X; Y|Z) < I(X; Y): Simply take X, Y, Z to be any random variables on finite alphabets and
Z = Y. Then I(X; Y|Z) = I(X; Y|Y) = H(Y|Y) − H(Y|X, Y) = 0 by a conditional version of (3.3).

Remark 3.3
Pn
(Chain rule for IP⇒ Chain rule for H) Set Y = Xn . Then H(Xn ) =
n k− 1 k− 1
), since H(Xk |Xn , Xk−1 ) = 0.
n
n n
I(X ; X ) = k=1 I(Xk ; X |X )= k=1 H(Xk |X
Remark 3.4 (DPI for divergence =⇒ DPI for mutual information) We proved
DPI for mutual information in Theorem 3.7 using Kolmogorov’s identity. In fact, DPI for mutual
information is implied by that for divergence in Theorem 2.17:

I(X; Z) = D(PZ|X kPZ |PX ) ≤ D(PY|X kPY |PX ) = I(X; Y),

P Z| Y PZ|Y
where note that for each x, we have PY|X=x −−→ PZ|X=x and PY −−→ PZ . Therefore if we have a
bi-variate functional of distributions D(PkQ) which satisfies DPI, then we can define a “mutual
information-like” quantity via ID (X; Y) ≜ D(PY|X kPY |PX ) ≜ Ex∼PX D(PY|X=x kPY ) which will
satisfy DPI on Markov chains. A rich class of examples arises by taking D = Df (an f-divergence
– see Chapter 7).
Remark 3.5 (Strong data-processing inequalities) For many channels PY|X , it is
possible to strengthen the data-processing inequality (2.28) as follows: For any PX , QX we have

D(PY kQY ) ≤ ηKL D(PX kQX ) ,

where ηKL < 1 and depends on the channel PY|X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information in Theorem 3.7(c): For any PU,X we have

U→X→Y =⇒ I(U; Y) ≤ ηKL I(U; X) .

For example, for PY|X = BSCδ we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
(SDPIs) quantify the intuitive observation that noise intrinsic to the channel PY|X must reduce the
information that Y carries about the data U, regardless of how we optimize the encoding U 7→ X.
We explore SDPI further in Chapter 33 as well as their ramifications in statistics.
In addition to the case of strict inequality in DPI, the case of equality is also worth taking a closer
look. If U → X → Y and I(U; X) = I(U; Y), intuitively it means that, as far as U is concerned,
there is no loss of information in summarizing X into Y. In statistical parlance, we say that Y is a
sufficient statistic of X for U. This is the topic for the next section.

3.5 Sufficient statistic and data processing

Much later in the book we will be interested in estimating parameters θ of probability distributions
of X. To that end, one often first tries to remove unnecessary information contained in X. Let us
formalize the setting as follows:

• Let PθX be a collection of distributions of X parameterized by θ ∈ Θ;

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-54

i i

• Let PT|X be some Markov kernel. Let PθT ≜ PT|X ◦ PθX be the induced distribution on T for each
θ.

Definition 3.8 (Sufficient statistic) We say that T is a sufficient statistic of X for θ if there
exists a transition probability kernel PX|T so that PθX PT|X = PθT PX|T , i.e., PX|T can be chosen to not
depend on θ.

The intuitive interpretation of T being sufficient is that, with T at hand, one can ignore X; in
other words, T contains all the relevant information to infer about θ. This is because X can be
simulated on the sole basis of T without knowing θ. As such, X provides no extra information
for identification of θ. Any one-to-one transformation of X is sufficient, however, this is not the
interesting case. In the interesting cases dimensionality of T will be much smaller (typically equal
to that of θ) than that of X. See examples below.
Observe also that the parameter θ need not be a random variable, as Definition 3.8 does not
involve any distribution (prior) on θ. This is a so-called frequentist point of view on the problem
of parameter estimation.

Theorem 3.9 Let θ, X, T be as in the setting above. Then the following are equivalent

(a) T is a sufficient statistic of X for θ.

(b) ∀Pθ , θ → T → X.
(c) ∀Pθ , I(θ; X|T) = 0.
(d) ∀Pθ , I(θ; X) = I(θ; T), i.e., the data processing inequality for mutual information holds with
equality.

Proof. We omit the details, which amount to either restating the conditions in terms of conditional
independence, or invoking equality cases in the properties stated in Theorem 3.7.

The following result of Fisher provides a criterion for verifying sufficiency:

Theorem 3.10 (Fisher’s factorization theorem) For all θ ∈ Θ, let PθX have a density pθ
with respect to a common dominating measure μ. Let T = T(X) be a deterministic function of X.
Then T is a sufficient statistic of X for θ iff

pθ (x) = gθ (T(x))h(x)

for some measurable functions gθ and h and all θ ∈ Θ.

Proof. We only give the proof in the discrete case where pθ represents the PMF. (The argument
P R
for the general case is similar replacing by dμ). Let t = T(x).
“⇒”: Suppose T is a sufficient statistic of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T =
t) = Pθ (X = x|T = t)Pθ (T = t) = P(X = x|T = T(x)) Pθ (T = T(x))
| {z }| {z }
h(x) gθ (T(x))

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-55

i i

3.6 Probability of error and Fano’s inequality 55

“⇐”: Suppose the factorization holds. Then

p θ ( x) g θ ( t ) h ( x) h ( x)
Pθ (X = x|T = t) = P =P =P ,
x 1 {T ( x ) = t }p θ ( x ) x 1 { T( x ) = t } gθ ( t ) h ( x ) x 1 {T (x) = t}h(x)
free of θ.

Example 3.8 (Independent observations) In the following examples, a parametrized

distribution generates an independent sample of size n, which can be summarized into a scalar-
valued sufficient statistic. These can be verified by checking the factorization of the n-fold product
distribution and applying Theorem 3.10.

i.i.d.
• Normal mean model. Let θ ∈ R and observations X1 , . . . , Xn ∼ N (θ, 1). Then the sample mean
Pn
X̄ = 1n j=1 Xj is a sufficient statistic of Xn for θ.
i.i.d. Pn
• Coin flips. Let Bi ∼ Ber(θ). Then i=1 Bi is a sufficient statistic of Bn for θ.
i.i.d.
• Uniform distribution. Let Ui ∼ Unif(0, θ). Then maxi∈[n] Ui is a sufficient statistic of Un for θ.

Example 3.9 (Sufficient statistic for hypothesis testing) Let Θ = {0, 1}. Given θ = 0
or 1, X ∼ PX or QX , respectively. Then Y – the output of PY|X – is a sufficient statistic of X for θ iff
D(PX|Y kQX|Y |PY ) = 0, i.e., PX|Y = QX|Y holds PY -a.s. Indeed, the latter means that for kernel QX|Y
we have

PX PY|X = PY QX|Y and QX PY|X = QY QX|Y ,

which is precisely the definition of sufficient statistic when θ ∈ {0, 1}. This example explains
the condition for equality in the data-processing for divergence in Theorem 2.17. Then assuming
D(PY kQY ) < ∞ we have:

D(PX kQX ) = D(PY kQY ) ⇐⇒ Y is a sufficient statistic for testing PX vs QX

Proof. Let QX,Y = QX PY|X , PX,Y = PX PY|X , then

D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )

| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )
≥ D(PY kQY )

with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a sufficient statistic for
testing PX vs QX as desired.

3.6 Probability of error and Fano’s inequality

Let W be a random variable and Ŵ be our prediction of it. Depending on the information available
for producing Ŵ we can consider three types of problems:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-56

i i

1 Random guessing: W ⊥ ⊥ Ŵ.

2 Guessing with data: W → X → Ŵ, where X = f(W) is a deterministic function of W.
3 Guessing with noisy data: W → X → Y → Ŵ, where X → Y is given by some noisy channel.

Our goal is to draw converse statements of the following type: If the uncertainty of W is too high
or if the information provided by the data is too scarce, then it is difficult to guess the value of W.
In this section we formalize these intuitions using (conditional) entropy and mutual information.

Theorem 3.11 Let |X | = M < ∞. Then for any X̂ ⊥

⊥ X,

H(X) ≤ FM (P[X = X̂]) (3.13)

where

FM (x) ≜ (1 − x) log(M − 1) + h(x), x ∈ [0, 1] (3.14)

and h(x) = x log 1x + (1 − x) log 1−1 x is the binary entropy function.

If Pmax ≜ maxx∈X PX (x), then

H(X) ≤ FM (Pmax ) = (1 − Pmax ) log(M − 1) + h(Pmax ) , (3.15)

with equality iff PX = (Pmax , 1− Pmax 1−Pmax

M−1 , . . . , M−1 ).

The function FM (·) is shown in Figure 3.3. Notice that due to its non-monotonicity the
statement (3.15) does not imply (3.13), even though P[X = X̂] ≤ Pmax .

FM (p)

log M

log(M − 1)

p
0 1/M 1

Figure 3.3 The function FM in (3.14) is concave with maximum log M at maximizer 1/M, but not monotone.

Proof. To show (3.13) consider an auxiliary (product) distribution QX,X̂ = UX PX̂ , where UX is
uniform on X . Then Q[X = X̂] = n1/M. Denoting
o P[X = X̂] ≜ PS , applying the DPI for divergence
to the data processor (X, X̂) 7→ 1 X = X̂ yields d(PS k1/M) ≤ D(PXX̂ kQXX̂ ) = log M − H(X).
To show the second part, suppose one is trying to guess the value of X without any side informa-
tion. Then the best bet is obviously the most likely outcome (mode) and the maximal probability

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-57

i i

3.6 Probability of error and Fano’s inequality 57

of success is

max P[X = X̂] = Pmax . (3.16)

X̂⊥
⊥X

Thus, applying (3.13) with X̂ being the mode yield (3.15). Finally, suppose that P =
(Pmax , P2 , . . . , PM ) and introduce Q = (Pmax , 1− Pmax 1−Pmax
M−1 , . . . , M−1 ). Then the difference of the right
and left side of (3.15) equals D(PkQ) ≥ 0, with equality iff P = Q.

Remark 3.6 Let us discuss the unusual proof technique. Instead of studying directly the prob-
ability space PX,X̂ given to us, we introduced an auxiliary one: QX,X̂ . We then drew conclusions
about the target metric ( probability of error) for the auxiliary problem (the probability of error
= 1 − M1 ). Finally, we used DPI to transport statement about Q to a statement about P: if D(PkQ)
is small, then the probabilities of the events (e.g., {X 6= X̂}) should be small as well. This is a
general method, known as meta-converse, that we develop in more detail later in this book for
channel coding (see Section 22.3). For the specific result (3.15), however, there are much more
explicit ways to derive it – see Ex. I.25.
Similar to the Shannon entropy H, Pmax is also a reasonable measure for randomness of P. In
fact, recall from (1.4) that
1
H∞ (P) = log (3.17)
Pmax
is the Rényi entropy of order ∞, cf. (1.4). In this regard, Theorem 3.11 can be thought of as our
first example of a comparison of information measures: it compares H and H∞ . We will study
such comparisons systematically in Section 7.4.
Next we proceed to the setting of Fano’s inequality where the estimate X̂ is made on the basis of
some observation Y correlated with X. We will see that the proof of the previous theorem trivially
generalizes to this new case of possibly randomized estimators. Though not needed in the proof,
it is worth mentioning that the best estimator minimizing the probability of error P[X 6= X̂] is the
maximum posterior (MAP) rule, i.e., the posterior mode: X̂(y) = argmaxx PX|Y (x|y).

Theorem 3.12 (Fano’s inequality) Let |X | = M < ∞ and X → Y → X̂. Let Pe = P[X 6= X̂],
then

H(X|Y) ≤ FM (1 − Pe ) = Pe log(M − 1) + h(Pe ) . (3.18)

Furthermore, if Pmax ≜ maxx∈X PX (x) > 0, then regardless of |X |,

1
I(X; Y) ≥ (1 − Pe ) log − h(Pe ) . (3.19)
Pmax

Proof. To show (3.18) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs
n o
QX,Y,X̂ = UX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1 X 6= X̂ (note that PX̂|Y is
identical for both).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-58

i i

To show (3.19) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs QX,Y,X̂ =
n o
PX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1 X 6= X̂ to obtain:

I(X; Y) = D(PX,Y,X̂ kQX,Y,X̂ ) ≥ d(P[X = X̂]kQ[X = X̂])

1
≥ −h(Pe ) + (1 − Pe ) log ≥ −h(Pe ) − (1 − Pe ) log Pmax ,
Q[X = X̂]
where the last step follows from Q[X = X̂] ≤ Pmax since X ⊥
⊥ X̂ under Q. (Again, we refer to
Ex. I.25 for a direct proof.)
The following corollary of the previous result emphasizes its role in providing converses (or
impossibility results) for statistics and data transmission.

Corollary 3.13 (Lower bound on average probability of error) Let W → X → Y →

Ŵ, where W is uniform on [M] ≜ {1, . . . , M}. Then
I ( X ; Y ) + h( P e )
Pe ≜ P[W 6= Ŵ] ≥ 1 − (3.20)
log M
I(X; Y) + log 2
≥1− . (3.21)
log M

Proof. Apply Theorem 3.12 and the data processing for mutual information: I(W; Ŵ) ≤ I(X; Y).

3.7* Estimation error in Gaussian noise (I-MMSE)

In previous section we considered estimating a discrete random variable X on the basis of obser-
vation Y and showed bounds on the probability of reconstruction error. Here we consider a more
general case of X ∈ Rd and a quadratic loss, which is also known in signal processing as the
mean-squared error (MSE). Specifically, whenever E[kXk2 ] < ∞ we define

mmse(X|Y) ≜ E[kX − E[X|Y]k2 ] ,

where MMSE stands for minimum MSE (which follows from the fact that the best estimator of X
given Y is precisely E[X|Y]). Just like Fano’s inequality one can derive inequalities relating I(X; Y)
and mmse(X|Y). For example, from Tao’s inequality (see Corollary 7.11) one can easily get for
the case where X ∈ [−1, 1] that
2
0 ≤ Var(X) − mmse(X|Y) ≤ I ( X ; Y) ,
log e
which shows that the variance reduction of X due to Y is at most proportional to their mutual
information. (Simply notice that E[| E[X|Y] − E[X]|2 ] = Var(X) − mmse(X|Y)).
However, this section is not about such inequalities. Here we show a remarkable equality for
the special case when Y is an observation of X corrupted by Gaussian noise. As applications of

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-59

i i

3.7* Estimation error in Gaussian noise (I-MMSE) 59

this identity we will derive stochastic localization in Exercise I.66 and entropy power inequality
in Theorem 3.16.

Theorem 3.14 (I-MMSE [205]) Let X ∈ Rd be independent of Z ∼ N (0, Id ). If E[kXk2 ] < ∞

then for all γ > 0 we have
d √ log e √
I(X; γ X + Z) = mmse(X| γ X + Z) (3.22)
dγ 2
d2 √ log e √
I(X; γ X + Z) = − E[kΣγ ( γ X + Z)k2F ] , (3.23)
dγ 2 2
√
where Σγ (y) = Cov(X| γ X + Z = y) and k · kF is the Frobenius norm of the matrix.

As a simple example, for Gaussian X, one may verify (3.22) by combining the mutual
information in Example 3.3 with the MMSE in Example 28.1.
Before proving Theorem 3.14 we start with some notation and preliminary results. Let I ⊂ R
be an open interval, μ a (positive) measure on Rd , and K, L : Rd × I → R such that the following
R R R
conditions are met: a) K(x, θ) μ(dx) exists for all θ ∈ I; b) Rd μ(dx) I dθ|L(x, θ)| < ∞; c)
R
t 7→ Rd μ(dx)L(x, t) is continuous and d) we have
∂
K(x, θ) = L(x, θ)
∂θ
for all x, θ. Then
Z Z
∂
K(x, θ) dx = L(x, θ) dx . (3.24)
∂θ Rd Rd
Rθ
(To see this, take θ > θ0 ∈ I and write K(x, θ) = K(x, θ0 ) + θ0 dtL(x, t). Now we can integrate
R Rθ
this over x and interchange the order of integrals to get dxK(x, θ) = constant + θ0 g(t)dt, where
R
g(t) = dxL(x, t) is continuous). Note that in the case of finite interval I both conditions b) and
c) are implied by condition e) for all t ∈ I we have |L(x, t)| ≤ ℓ(x) and ℓ is μ-integrable.
Let ϕa (x) = (2πa1)d/2 e−∥x∥ /(2a) be the density of N (0, aId ). Suppose p is some probability
2

distribution, and f is a function then we denote by p ∗ f(x) = EX′ ∼p [f(x − X′ )], which coincides
with the usual convolution if p is a density. In particular, the Gaussian convolution p ∗ ϕa is known
k
as a Gaussian mixture with mixing distribution p. For any differential operator D = ∂ xi ∂···∂ xi we
1 k
have

D(p ∗ ϕa ) = p ∗ (Dϕa ) . (3.25)

For D = ∂∂x1 this follows from (3.24) by taking μ = p, K(x, θ) = ϕa (y − x) where y =

(θ, y2 , . . . . , yd ). Conditions for L(x, θ) follow from the fact that Dϕa is uniformly bounded. The
case of general D follows by induction.
As a next application we will show that
∂ ∂ 1
p ∗ ϕa (y) = p ∗ ( ϕa )(y) = ∆(p ∗ ϕa )(y) , (3.26)
∂a ∂a 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-60

i i

where ∆f = tr(Hess f) is the Laplacian. Notice that the second equality follows from (3.25) and
the easily checked identity
∂ 1 1
ϕa (x) = 2 (kxk2 − ad)ϕa (x) = ∆ϕa (x) .
∂a 2a 2
Thus, we only need to justify the first equality in (3.26). To that end, we use (3.24) with μ = p,
K(x, a) = ϕa (y − x) and L(x, a) = ∂∂a K(x, a). Note that by the previous calculation we have
supx | ∂∂a ϕa (x)| < ∞, and thus condition e) of (3.24) applies and so (3.24) implies (3.26).
The next lemma shows a special property of Gaussian convolution (derivatives of log-
convolution correspond to conditional moments).
√
Lemma 3.15 Let Y = X + aZ, where X ⊥
⊥ Z and X ∼ p and Z ∼ N (0, Id ). Then

1 If E[kXk] < ∞ we have

1
∇ ln(p ∗ ϕa )(y) = (E[X|Y = y] − y) . (3.27)
a
and also
3 4
k∇ ln p ∗ ϕa (x)k ≤ kxk + E[kXk] . (3.28)
a a
2 If E[kXk2 ] < ∞ we have
1 1
Hess ln(p ∗ ϕa )(y) = 2
Cov[X|Y = y] − Id . (3.29)
a a

Proof. Notice that since Y ∼ p ∗ ϕa we have

Z Z
1 1
E[X|Y = y] = xp(x)ϕa (y − x)dx = (y − x)p(y − x)ϕa (x)dx
p ∗ ϕ a ( y) R d p ∗ ϕ a ( y) R d
Z
( a) a (b) a
= y+ p(y − x)(∇ϕa (x)) = y + ∇ (p ∗ ϕa (x)) ,
p ∗ ϕ a ( y) p ∗ ϕ a ( y)

where (a) follows from the fact that ∇ϕa (x) = − ax ϕa (x) and (b) from (3.25). The proof of (3.27)
is completed after noticing p∗ϕ1a (y) ∇ (p ∗ ϕa (x)) = ∇ ln(p ∗ ϕa )(y). Technical estimate (3.28) is
shown in [344, Proposition 2].
The identity (3.29) is shown entirely similarly.

With these preparations we are ready to prove the Theorem.

Proof of Theorem 3.14. For simplicity, in this proof we compute all informations and entropies
with natural base, so log = ln. With these preparations we can show (3.22). First, let a = 1/γ and
notice
√ √ √ d
I(X; γ X + Z) = I(X; X + aZ) = h(X + aZ) − ln(2πea) ,
2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-61

i i

3.7* Estimation error in Gaussian noise (I-MMSE) 61

where we computed differential entropy of the Gaussian via Theorem 2.8. Thus, the proof is
completed if we show
d √ d 1
h(X + aZ) = − mmse(X|Ya ) , (3.30)
da 2a 2a2
√
where we defined Ya = X + aZ. Let the law of X be p. Conceptually, the computation is just a
few lines:
Z
d √ d
− h(X + aZ) = (p ∗ ϕa )(x) ln(p ∗ ϕa )(x)dx
da da
Z
( a) ∂
= [(p ∗ ϕa )(x) ln(p ∗ ϕa )(x)]dx
∂a
Z
(b) 1
= (1 + ln p ∗ ϕa )∆(p ∗ ϕa )dx
2
Z
( c) 1
= (p ∗ ϕa )∆(ln p ∗ ϕa )dx
2
Z
(d) 1 1 d
= (p ∗ ϕa )(y)( 2 mmse(X|Ya = y) − )dy ,
2 a a
where (a) and (c) will require technical justifications, while (b) is just (3.26) and (d) is by taking
trace of (3.29). Note that (a) is just interchange of differentiation and integration, while (c) is
simply the “self-adjointness” of Laplacian.
We proceed to justifying (a). We will apply (3.24) with μ = Leb, I = (a1 , a2 ) some finite
interval, K(x, a) = (p ∗ ϕa )(x) ln(p ∗ ϕa )(x) and
∂ 1
L ( x, a ) = K(x, a) = (1 + ln(p ∗ ϕa )(x))(p ∗ ∆ϕa )(x) ,
∂a 2
where we again used (3.26).
Integrating (3.28) we get
3 4
| ln p ∗ ϕa (x) − ln p ∗ ϕa (0)| ≤ kxk2 + kxk E[kXk] .
2a a
Since p ∗ ϕa (0) ≤ ϕa (0) we get that for all a ∈ (a1 , a2 ) we have for some c > 0:

| ln p ∗ ϕa (x)| ≤ c(1 + kxk + kxk2 ) . (3.31)

From this estimate we note that

K(x, a) ≤ c(1 + kxk + kxk2 )(p ∗ ϕa )(x) .

The integral of the right-hand side over x is simply c(1 + E[kYa k + kYa k2 ]) < ∞, which confirms
condition a) of (3.24).
Next, we notice that for any differential operator D we have Dϕa (x) = f(x)ϕa (x) where f is some
polynomial in x. Since for a < a2 we have supx f(ϕx)ϕ a (x)
a2 ( x )
< ∞ we have that for some constant c′
and all a < a2 and all x we have

|D(p ∗ ϕa )(x)| ≤ c′ p ∗ ϕa2 (x) , (3.32)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-62

i i

where we used (3.25) as well. Thus, for L(x, a) we can see that the first term is bounded by (3.31)
and the second by the previous display, so that overall
cc′
L ( x, a ) ≤
|2 + kxk + kxk2 |(p ∗ ϕa2 )(x) .
2
Since again the right-hand side is integrable, we see that condition e) of (3.24) applies and thus
R
the interchange of differentiation and in step (a) is valid.
Finally, we argue that step (c) is applicable. To that end we prove an auxiliary result first: If
R R
u, v are two univariate twice-differentiable functions with a) R |u′′ v| and R |v′′ u| both finite and
R ′ ′
b) |u v | < ∞ then
Z Z
u′′ v = v′′ u . (3.33)
R R

Indeed, from condition b) there must exist a sequence cn → +∞, bn → −∞ such that
|u′ (cn )v′ (cn )| + |u′ (bn )v′ (bn )| → 0. On the other hand, from condition a) we have
Z cn Z
lim u′′ v = u′′ v ,
n→∞ bn R
R
and similarly for v′′ u. Now applying integration by parts we have
Z cn Z cn
′′ ′ ′ ′ ′
u v = u (cn )v (cn ) − u (bn )v (bn ) + v′′ u ,
bn bn

and the first two terms vanish with n.

R 2
Next, consider multivariate twice-differentiable functions U, V with a) Rd |V ∂∂x2 U| and
R ∂2
R i

R d | U 2
∂ xi
V| both finite and b) R d k∇ U k k∇ Vk < ∞ then
Z Z
V∆ U = U∆ V . (3.34)
Rd Rd

We write x = (x1 , xd2 )by grouping the last (d − 1) coordinates together. Fix xd2 and define
u(x1 ) = U(x1 , x2 , · · · , xd ) and v(x1 ) = V(x1 , x2 , · · · , xd ). For Lebesgue-a.e. xd2 we see that u, v
satisfy conditions for (3.33). Thus, we obtain that for such xd2 we have
Z Z
∂2 ∂2
V(x) 2 U(x) dx1 = U(x) 2 V(x) dx1 .
R ∂ x1 R ∂ x1
Integrating this over xd2 we get
Z Z
∂2 ∂2
V(x) 2 U(x) dx = U(x) 2 V(x) dx .
Rd ∂ x1 Rd ∂ x1
Now, to justify step (c) we have to verify that U(x) = 1 + ln(p ∗ ϕa )(x) and V(x) = p ∗ ϕa (x)
2
satisfy conditions of the previous result. To that end, notice that from (3.29) we have | ∂∂y2 U(y)| ≤
i
1
a2 E[X2i |Ya = y] + 1
a and thus
Z
∂2
|V U| = Oa (E[X2i ]) < ∞ .
Rd ∂ x2i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-63

i i

3.8* Entropy-power inequality 63

2
On the other hand, from (3.25) and (3.32) we have | ∂∂y2 V(y)| ≤ c′ p ∗ ϕa2 (y). From (3.31) then we
i
obtain
Z Z
∂2
|U 2 V| ≤ cc (1 + kxk + kxk2 )p ∗ ϕa2 (x) = cc′ E[1 + kYa2 k + kYa2 k2 ] < ∞ .
′
Rd ∂ xi
R
Finally, for showing Rd k∇Ukk∇Vk < ∞ we apply (3.28) to estimate k∇Uk ≲a 1 + kyk and
use (3.32) to estimate k∇Vk ≲a p ∗ ϕa2 (x). Thus, we have
Z
k∇Ukk∇Vk ≲a E[1 + kYa2 k] < ∞ .
Rd
R R
This completes verification of conditions and we conclude U∆V = V∆U as required for step
(c).
√
The identity (3.23) is obtained by differentiating function γ 7→ mmse(X| γ + Z) using very
similar methods. We refer to [206] for full justification.

Remark 3.7 (Tensorization of I-MMSE) We proved the I-MMSE identity for a d-

dimension vectors directly. However, it turns out that the one-dimensional version implies the
d-dimensional version. Specifically, suppose the 1D version of (3.22) is already proven. Let us
denote X = (X1 , . . . , Xd ), Y = (Y1 , . . . , Yd ), Z = (Z1 , . . . , Zd ) as in (3.22). However, now let
√
γ = (γ1 , . . . , γd ) be a vector and we set Yj = γj Xj + Zj . We are interested in computing the
derivative of I(X; Y) along the diagonal γ = (γ, . . . , γ). To that end, denote by Y∼j = {Yi : i 6= j}
and notice that by the chain rule we have
∂ ∂
I ( X ; Y) = I(Xj ; Yj |Y∼j ) . (3.35)
∂γj ∂γj

Similarly, notice that Ey∼j ∼PY∼j [mmse(Xj |Yj , Y∼j = y∼j )] = mmse(Xj |Y). Thus, applying the 1D-
version of (3.22) we get
∂ log e
I ( X ; Y) = mmse(Xj |Y) .
∂γj 2
P
Now since mmse(X|Y) = j mmse(Xj |Y) by summing (3.35) over j we obtain the d-dimensional
version of (3.22). Note that we computed the derivative in a scalar parameter γ by introducing a
vector one γ and then using the chain rule to simplify partial derivatives. This idea is the basis
of area theorem in information theory [360, Lemma 3] and Guerra interpolation in statistical
physics [410].

3.8* Entropy-power inequality

As an application of the last section’s result we demonstrate an important relation between the addi-
tive structure of Rn and entropy. To state the result, recall that from (2.18) an iid Gaussian vector Zd
with coordinates of power σ 2 have differential entropy h(Zd ) = 2d log(2πeσ 2 ). Correspondingly,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-64

But this latter inequality is very simple to argue, since clearly

√ √ √
mmse(X| γ X + Z) ≥ mmse(X| γ U1 + Z1 , γ U2 + Z2 ) .

On the other hand, for the right-hand side X is a sum of two conditionally independent terms and
thus
√ √ √ √
mmse(X| γ U1 +Z1 , γ U2 +Z2 ) = mmse(U1 | γ U1 +Z1 ) cos2 α+mmse(U2 | γ U2 +Z2 ) sin2 (α) .

In Corollary 2.9 we have already seen how properties of differential entropy can be translated
to properties of positive definite matrices. Here is another application:

Corollary 3.17 (Minkowski inequality) Let Σ1 , Σ2 be real positive definite n × n matrices.

Then
1 1 1
det(Σ1 + Σ2 ) n ≥ det(Σ1 ) n + det(Σ2 ) n .

Proof. Take Ai ∼ N (0, Σi ), use (2.18) and the EPI.

EPI is a corner stone of many information-theoretic arguments: for example, it was used to estab-
lish capacity region of the Gaussian broadcast channel [45]. However, its significance extends
throughout geometry and analysis, having deep implications for high-dimensional probability,
convex geometry and concentration of measure. As an example see Exercise I.65 which derives
the log-Sobolev inequality of Gross. Further discussions are outside of the scope of this book, but
we recommend reading [106, Chapter 16].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-66

i i

4 Variational characterizations and continuity of

information measures

In this chapter we collect some results on variational characterizations of information measures. It

is a well known method in analysis to study a functional by proving a variational characterization
of the form F(x) = supλ∈Λ fλ (x) or F(x) = infμ∈M gμ (x). Such representations can be useful for
multiple purposes:

• Convexity: pointwise supremum of convex functions is convex.

• Regularity: pointwise supremum of lower semicontinuous (lsc) functions is lsc.
• Bounds: upper/lower bound on F follows by choosing any λ (μ) and evaluating fλ (gμ ).

We will see in this chapter that divergence has two different sup characterizations (over partitions
and over functions). The mutual information is more special. In addition to inheriting the ones
from KL divergence, it possesses two extra: an inf-representation over (centroid) measures QY
and a sup-representation over Markov kernels.
As applications of these variational characterizations, we discuss the Gibbs variational prin-
ciple, which serves as the basis of many modern algorithms in machine learning, including the
EM algorithm and variational autoencoders; see Section 4.4. An important theoretical construct
in machine learning is the idea of PAC-Bayes bounds (Section 4.8*).
From information theoretic point of view variational characterizations are important because
they address the problem of continuity. We will discuss several types of continuity in this Chapter.
First, is the continuity in discretization. This is related to the issue of computation. For complicated
P and Q direct computation of D(PkQ) might be hard. Instead, one may want to discretize the
infinite alphabet and compute numerically the finite sum. Does this approximation work, i.e., as
the quantization becomes finer, are the resulting finite sums guaranteed to converge to the true
value of D(PkQ)? The answer is positive and this continuity with respect to discretization is the
content of Theorem 4.5.
Second, is the continuity under change of the distribution. For example, this arises in the prob-
lem of estimating information measures. In many statistical setups, oftentimes we do not know
P or Q, and we estimate the distribution by P̂n using n iid observations sampled from P (in dis-
crete cases we may set P̂n to be simply the empirical distribution). Does D(P̂n kQ) provide a good
estimator for D(PkQ)? Does D(P̂n kQ) → D(PkQ) if P̂n → P? The answer is delicate – see
Section 4.5.
Third, there is yet another kind of continuity: continuity “in the σ -algebra”. Despite the scary
name, this one is useful even in the most “discrete” situations. For example, imagine that θ ∼

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-67

i i

4.1 Geometric interpretation of mutual information 67

i.i.d.
Unif(0, 1) and Xi ∼ Ber(θ). Suppose that you observe a sequence of Xi ’s until the random moment
τ equal to the first occurrence of the pattern 0101. How much information about θ did you learn
by time τ ? We can encode these observations as
(
Xj , j ≤ τ ,
Zj = ,
?, j > τ
where ? designates the fact that we don’t know the value of Xj on those times. Then the question
we asked above is to compute I(θ; Z∞ ). We will show in this chapter that
X
∞
I(θ; Z∞ ) = lim I(θ; Zn ) = I(θ; Zn |Zn−1 ) (4.1)
n→∞
n=1

thus reducing computation to evaluating an infinite sum of simpler terms (not involving infinite-
dimensional vectors). Thus, even in this simple question about biased coin flips we have to
understand how to safely work with infinite-dimensional vectors.

4.1 Geometric interpretation of mutual information

Mutual information (MI) can be understood as a weighted “distance” from the conditional
distributions to the marginal distribution. Indeed, for discrete X, we have
X
I(X; Y) = D(PY|X kPY |PX ) = D(PY|X=x kPY )PX (x).
x∈X

Furthermore, it turns out that PY , similar to the center of gravity, minimizes this weighted distance
and thus can be thought as the best approximation for the “center” of the collection of distribu-
tions {PY|X=x : x ∈ X } with weights given by PX . We formalize these results in this section and
start with the proof of a “golden formula”. Its importance is in bridging the two points of view
on mutual information. Recall that on one hand we had the Fano’s Definition 3.1, on the other
hand for discrete cases we had the Shannon’s definition (3.3) as difference of entropies. Then
next result (4.3) presents MI as the difference of relative entropies in the style of Shannon, while
retaining applicability to continuous spaces in the style of Fano.

Theorem 4.1 (Golden formula) For any QY we have

D(PY|X kQY |PX ) = I(X; Y) + D(PY kQY ). (4.2)
Thus, if D(PY kQY ) < ∞, then
I(X; Y) = D(PY|X kQY |PX ) − D(PY kQY ). (4.3)

Proof. In the discrete case and ignoring the possibility of dividing by zero, the argument is really
simple. We just need to write

(3.1) PY|X PY|X QY
I(X; Y) = EPX,Y log = EPX,Y log
PY PY QY

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-68

i i

P Q P
and then expand log PYY|XQYY = log QY|YX − log QPYY . The argument below is a rigorous implementation
of this idea.
First, notice that by Theorem 2.16(e) we have D(PY|X kQY |PX ) ≥ D(PY kQY ) and thus if
D(PY kQY ) = ∞ then both sides of (4.2) are infinite. Thus, we assume D(PY kQY ) < ∞ and
in particular PY QY . Rewriting LHS of (4.2) via the chain rule (2.24) we see that Theorem
amounts to proving
D(PX,Y kPX QY ) = D(PX,Y kPX PY ) + D(PY kQY ) .
The case of D(PX,Y kPX QY ) = D(PX,Y kPX PY ) = ∞ is clear. Thus, we can assume at least one of
these divergences is finite, and, hence, also PX,Y PX QY .
dPY
Let λ(y) = dQ Y
(y). Since λ(Y) > 0 PY -a.s., applying the definition of Log in (2.10), we can
write

λ(Y)
EPY [log λ(Y)] = EPX,Y Log . (4.4)
1
dPX PY
Notice that the same λ(y) is also the density dPX QY
(x, y) of the product measure PX PY with respect
to PX QY . Therefore, the RHS of (4.4) by (2.11) applied with μ = PX QY coincides with
D(PX,Y kPX QY ) − D(PX,Y kPX PY ) ,
while the LHS of (4.4) by (2.13) equals D(PY kQY ). Thus, we have shown the required
D(PY kQY ) = D(PX,Y kPX QY ) − D(PX,Y kPX PY ) .

By dropping the second term in (4.2) we obtain the following result.

Corollary 4.2 (Mutual information as center of gravity) For any QY we have

I(X; Y) ≤ D(PY|X kQY |PX )
and, consequently,
I(X; Y) = min D(PY|X kQY |PX ). (4.5)
QY

If I(X; Y) < ∞, the unique minimizer is QY = PY .

Remark 4.1 The variational representation (4.5) is useful for upper bounding mutual infor-
mation by choosing an appropriate QY . Indeed, often each distribution in the collection PY|X=x
is simple, but their mixture, PY , is very hard to work with. In these cases, choosing a suitable QY
in (4.5) provides a convenient upper bound. As an example, consider the AWGN channel Y = X+Z
in Example 3.3, where Var(X) = σ 2 , Z ∼ N (0, 1). Then, choosing the best possible Gaussian Q
and applying the above bound, we have:
1
I(X; Y) ≤ inf E[D(N (X, 1)kN ( μ, s))] = log(1 + σ 2 ),
μ∈R,s≥0 2
which is tight when X is Gaussian. For more examples and statistical applications, see Chapter 30.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-69

i i

4.1 Geometric interpretation of mutual information 69

Theorem 4.3 (Mutual information as distance to product distributions)

I(X; Y) = min D(PX,Y kQX QY )
QX ,QY

with the unique minimizer (QX , QY ) = (PX , PY ).

Proof. We only need to use the previous corollary and the chain rule (2.24):
(2.24)
D(PX,Y kQX QY ) = D(PY|X kQY |PX ) + D(PX kQX ) ≥ I(X; Y) .

Interestingly, the point of view in the previous result extends to conditional mutual information
as follows: We have

I(X; Z|Y) = min D(PX,Y,Z kQX,Y,Z ) , (4.6)

QX,Y,Z :X→Y→Z

where the minimization is over all QX,Y,Z = QX QY|X QZ|Y , cf. Section 3.4. Showing this character-
ization is very similar to the previous theorem. By repeating the same argument as in (4.2) we
get

D(PX,Y,Z kQX QY|X QZ|Y )

=D(PX,Y,Z kPX PY|X PZ|Y ) + D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
=D(PX,Y,Z kPY PX|Y PZ|Y ) + D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
= D(PXZ|Y kPX|Y PZ|Y |PY ) +D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
| {z }
I(X;Z|Y)

≥ I ( X ; Z| Y) .

Characterization (4.6) can be understood as follows. The most general directed graphical model
for the triplet (X, Y, Z) is a 3-clique (triangle).

Y X

What is the information flow on the dashed edge X → Z? To answer this, notice that removing
this edge restricts the joint distribution to a Markov chain X → Y → Z. Thus, it is natural to
ask what is the minimum (KL-divergence) distance between a given PX,Y,Z and the set of all
distributions QX,Y,Z satisfying the Markov chain constraint. By the above calculation, optimal
QX,Y,Z = PY PX|Y PZ|Y and hence the distance is I(X; Z|Y). For this reason, we may interpret I(X; Z|Y)
as the amount of information flowing through the X → Z edge.
In addition to inf-characterization, mutual information also has a sup-characterization.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-70

i i

Remark 4.2 Similar to how (4.5) is used to upper-bound I(X; Y) by choosing a good approx-
imation to PY , this result is used to lower-bound I(X; Y) by selecting a good (but computable)
approximation QX|Y to usually a very complicated posterior PX|Y . See Section 5.6 for applications.
Proof. Since modifying QX|Y=y on a negligible set of y’s does not change the expectations, we
will assume that QX|Y=y PY for every y. If I(X; Y) = ∞ then there is nothing to prove. So we
assume I(X; Y) < ∞, which implies PX,Y PX PY . Then by Lemma 3.3 we have that PX|Y=y PX
dQX|Y=y /dPX
for almost every y. Choose any such y and apply (2.11) with μ = PX and noticing Log 1 =
dQX|Y=y
log dP X
we get

dQX|Y=y
EPX|Y=y log = D(PX|Y=y kPX ) − D(PX|Y=y kQX|Y=y ) ,
dPX
which is applicable since the first term is finite for a.e. y by (3.1). Taking expectation of the previous
identity over y we obtain

dQX|Y
EPX,Y log = I(X; Y) − D(PX|Y kQX|Y |PY ) ≤ I(X; Y) , (4.8)
dPX
implying the first part. The equality case in (4.7) follows by taking QX|Y = PX|Y , which satisfies
the conditions on Q when I(X; Y) < ∞.

4.2 Variational characterizations of divergence: Gelfand-Yaglom-Perez

The point of the following theorem is that divergence on general alphabets can be defined via
divergence on finite alphabets and discretization. Moreover, as the quantization becomes finer, we
approach the value of divergence.

Theorem 4.5 (Gelfand-Yaglom-Perez [182]) Let P, Q be two probability measures on X

with σ -algebra F . Then
X
n
P[Ei ]
D(PkQ) = sup P[Ei ] log
, (4.9)
{E1 ,...,En } i=1 Q[ E i ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 10 = 0 and log 01 = ∞ per our usual convention.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-71

i i

4.3 Variational characterizations of divergence: Donsker-Varadhan 71

Remark 4.3 This theorem, in particular, allows us to prove all general identities and inequali-
ties for the cases of discrete random variables and then pass to the limit. In the case of mutual
information I(X; Y) = D(PX,Y kPX PY ), the partitions of X and Y can be chosen separately,
see (4.29).

Proof. “≥”: Fix a finite partition E1 , . . . En . Define a function (quantizer) f : X → {1, . . . , n} as

follows: For any x, let f(x) denote the index j of the set Ej to which x belongs. Let X be distributed
according to either P or Q and set Y = f(X). Applying data processing inequality for divergence
yields

D(PkQ) = D(PX kQX )

≥ D(PY kQY ) (4.10)
X P(Ei )
= P(Ei ) log .
Q(Ei )
i

“≤”: To show D(PkQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P(B). Choosing the partition E1 = B and E2 = Bc , we have
P2 P[Ei ]
D(PkQ) = ∞ = i=1 P[Ei ] log Q[Ei ] . In the sequel we assume that P Q and let X = dQ .
dP

Then D(PkQ) = EQ [X log X] = EQ [φ(X)] by (2.4). Note that φ(x) ≥ 0 if and only if x ≥ 1. By
monotone convergence theorem, we have EQ [φ(X)1 {X < c}] → D(PkQ) as c → ∞, regardless
of the finiteness of D(PkQ).
Next, we construct a finite partition. Let n = c/ϵ be an integer and for j = 0, . . . , n − 1, let
Ej = {jϵ ≤ X ≤ (j + 1)ϵ} and En = {X ≥ c}. Define Y = ϵbX/ϵc as the quantized version. Since φ
is uniformly continuous on [0, c], for any x, y ∈ [0, c] such |x−y| ≤ ϵ, we have |φ(x)−φ(y)| ≤ ϵ′ for
some ϵ′ = ϵ′ (ϵ, c) such as ϵ′ → 0 as ϵ → 0. Then EQ [φ(Y)1 {X < c}] ≥ EQ [φ(X)1 {X < c}] − ϵ′ .
Moreover,
X
n−1 n−1
X
P(Ej )
EQ [φ(Y)1 {X < c}] = φ(jϵ)Q(Ej ) ≤ ϵ′ + φ Q(Ej )
Q( E j )
j=0 j=0
X
n
P(Ej )
≤ ϵ′ + Q(X ≥ c) log e + P(Ej ) log ,
Q(Ej )
j=0

P(E )
where the first inequality applies the uniform continuity of φ since jϵ ≤ Q(Ejj ) < (j + 1)ϵ, and the
second applies φ ≥ − log e. As Q(X ≥ c) → 0 as c → ∞, the proof is completed by first sending
ϵ → 0 then c → ∞.

4.3 Variational characterizations of divergence: Donsker-Varadhan

The following is perhaps the most important variational characterization of divergence. We remind
of our convention exp{−∞} = 0, log 0 = −∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-72

i i

Theorem 4.6 (Donsker-Varadhan [134]) Let P, Q be probability measures on X and

denote a class of functions CQ = {f : X → R ∪ {−∞} : 0 < EQ [exp{f(X)}] < ∞}. Then
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] (4.11)
f∈CQ

In particular, if D(PkQ) < ∞ then EP [f(X)] is well-defined and < ∞ for every f ∈ CQ . The
identity (4.11) holds with CQ replaced by the class of all R-valued simple functions. If X is a
normal topological space (e.g., a metric space) with the Borel σ -algebra, then also
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] , (4.12)
f∈Cb

where Cb is the class of all bounded continuous functions.

Proof. “D ≥ supf∈CQ ”: We can assume for this part that D(PkQ) < ∞, since otherwise there is
nothing to prove. Then fix f ∈ CQ and define a probability measure Qf (tilted version of Q) via
Qf (dx) = exp{f(x) − ψf }Q(dx) , ψf ≜ log EQ [exp{f(X)}] . (4.13)
Then, Qf Q. We will apply (2.11) next with reference measure μ = Q. Note that according
exp{f(x)−ψf }
to (2.10) we always have Log 1 = f(x) − ψf even when f(x) = −∞. Thus, we get
from (2.11)

dQf /dQ
EP [f(X)] − ψf = EP Log = D(PkQ) − D(PkQf ) ≤ D(PkQ) .
1
Note that (2.11) also implies that if D(PkQ) < ∞ and f ∈ CQ the expectation EP [f] is well-defined.
“D ≤ supf ” with supremum over all simple functions: The idea is to just take f = log dQ dP
;
however to handle all cases we proceed more carefully. First, notice that if P 6 Q then for some
E with Q[E] = 0 < P[E] and c → ∞ taking f = c1E shows that both sides of (4.11) are infinite.
Pn P[ E ]
Thus, we assume P Q. For any partition of X = ∪nj=1 Ej we set f = j=1 1Ej log Q[Ejj ] . Then
the right-hand sides of (4.11) and (4.9) evaluate to the same value and hence by Theorem 4.5 we
obtain that supremum over simple functions (and thus over CQ ) is at least as large as D(PkQ).
Finally, to show (4.12), we show that for every simple function f there exists a continuous
bounded f′ such that EP [f′ ] − log EQ [exp{f′ }] is arbitrarily close to the same functional evaluated
at f. To that end we first show that for any a ∈ R and measurable A ⊂ X there exists a sequence
of continuous bounded fn such that
EP [fn ] → aP[A], and EQ [exp{fn }] → exp{a}Q[A] (4.14)
hold simultaneously, i.e. fn → a1A in the sense of approximating both expectations. We only
consider the case of a > 0 below. Let compact F and open U be such that F ⊂ A ⊂ U and
max(P[U] − P[F], Q[U] − Q[F]) ≤ ϵ. Such F and U exist whenever P and Q are so-called regular
measures. Without going into details, we just notice that finite measures on Polish spaces are
automatically regular. Then by Urysohn’s lemma there exists a continuous function fϵ : X → [0, a]
equal to a on F and 0 on Uc . Then we have
aP[F] ≤ EP [fϵ ] ≤ aP[U]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-73

i i

4.3 Variational characterizations of divergence: Donsker-Varadhan 73

exp{a}Q[F] ≤ EQ [exp{fϵ }] ≤ exp{a}Q[U] .

Subtracting aP[A] and exp{a}Q[A] for each of these inequalities, respectively, we see that taking
ϵ → 0 indeed results in a sequence of functions satisfying (4.14).
Pn
Similarly, if we want to approximate a general simple function g = i=1 ai 1Ai (with Ai disjoint
and |ai | ≤ amax < ∞) we fix ϵ > 0 and define functions fi,ϵ approximating ai 1Ai as above with
sets Fi ⊂ Ai ⊂ Ui , so that S ≜ ∪i (Ui \ Fi ) satisfies max(P[S], Q[S]) ≤ nϵ. We also have
X X
| fi,ϵ − g| ≤ amax 1Ui \Fi ≤ namax 1S .
i i

P
We then clearly have | EP [ i fi,ϵ ] − EP [g]| ≤ amax n2 ϵ. On the other hand, we also have
X X
exp{ai }Q[Fi ] ≤ EQ [exp{ fi,ϵ }]
i i

≤ EQ [exp{g}1Sc ] + exp{namax }Q[S] ≤ EQ [exp{g}] + exp{namax }nϵ .

P P
Hence taking ϵ → 0 the sum i fi,ϵ → i ai 1Ai in the sense of both EP [·] and EQ [exp{·}].

Remark 4.4 1 What is the Donsker-Varadhan representation useful for? By setting f(x) =
ϵ · g(x) with ϵ 1 and linearizing exp and log we can see that when D(PkQ) is small, expec-
tations under P can be approximated by expectations over Q (change of measure): EP [g(X)] ≈
EQ [g(X)]. This holds for all functions g with finite exponential moment under Q. Total variation
distance provides a similar bound, but for a narrower class of bounded functions:

| EP [g(X)] − EQ [g(X)]| ≤ kgk∞ TV(P, Q) .

2 More formally, the inequality EP [f(X)] ≤ log EQ [exp f(X)] + D(PkQ) is useful in estimating
EP [f(X)] for complicated distribution P (e.g. over high-dimensional X with weakly dependent
coordinates) by making a smart choice of Q (e.g. with iid components).
3 In Chapter 5 we will show that D(PkQ) is convex in P (in fact, in the pair). A general method
of obtaining variational formulas like (4.11) is via the Young-Fenchel duality, which we review
below in (7.84). Indeed, (4.11) is exactly that inequality since the Fenchel-Legendre conjugate
of D(·kQ) is given by a convex map f 7→ ψf . For more details, see Section 7.13.
4 Donsker-Varadhan should also be seen as an “improved version” of the DPI. For example, one
of the main applications of the DPI in this book is in obtaining estimates like

1
P[A] log ≤ D(PkQ) + log 2 , (4.15)
Q[ A ]

which is the basis of the large deviations theory (Corollary 2.19 and Chapter 15) and Fano’s
inequality (Theorem 3.12). The same estimate can be obtained by applying (4.11) with f(x) =
1 {x ∈ A} log Q[1A] .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-74

i i

4.4 Gibbs variational principle

As we remarked before the Donsker-Varadhan characterization can be seen as a way of expressing
the convex function P 7→ D(PkQ) as supremum of linear functions P 7→ EP [f] − ψf , where ψf =
log EQ [exp{f}], in which case ψf is known as the convex conjugate. Now, by the general convex
duality theory one would expect than that the function f 7→ ψf is convex and should have a similar
characterization as the supremum of linear functions of f. In this section we derive it and show
several of its (quite influential) classical and modern applications.

Proposition 4.7 (Gibbs variational principle) Let f : X → R∪{−∞} be any measurable

function and Q a probability measure on X . Then
log EQ [exp{f(X)}] = sup EP [f(X)] − D(PkQ) ,
P

where the supremum is taken over all P with D(PkQ) < ∞. If the left-hand side is finite then the
unique maximizer of the right-hand side is P = Qf , a tilted version of Q defined in (4.13).

Proof. Let ψf ≜ log EQ [exp{f(X)}]. First, if ψf = −∞, then Q-a.s. f = −∞ and hence P-a.s. also
f = −∞, so that both sides of the equality are −∞. Next, assume −∞ < ψf < ∞. Then by
Donsker-Varadhan (4.11) we get
ψf ≥ EP [f(X)] − D(PkQ) .
dQf
On the other hand, setting P = Qf we obtain an equality. To show uniqueness, notice that Log dQ
1 =
f − ψf even when f = −∞. Thus, from (2.11) we get whenever D(PkQ) < ∞ that
EP [f(X) − ψf ] = D(PkQ) − D(PkQf ) .
From here we conclude that EP [f(X)] < ∞ and hence can rewrite the above as
EP [f(X)] − D(PkQ) = ψf − D(PkQf ) ,
which shows that EP [f(X)] − D(PkQ) = ψf implies P = Qf .
Next, suppose ψf = ∞. Let us define fn = f ∧ n, n ≥ 1. Since ψfn < ∞ we have by the previous
characterization that there is a sequence Pn such that D(Pn kQ) < ∞ and as n → ∞
EPn [f(X) ∧ n] − D(Pn kQ) = ψfn % ψf = ∞ .
Since EPn [f(X) ∧ n] ≤ EPn [f(X)], we have
EPn [f(X)] − D(Pn kQ) ≥ ψfn → ∞ ,
concluding the proof.
We now briefly explore how Proposition 4.7 has been applied over the last century. We start
with the example from statistical physics and graphical models. Here the key idea is to replace
sup over all distributions P with a subset that is easier to handle. This idea is the basis of much of
variational inference [447].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-75

i i

4.4 Gibbs variational principle 75

Example 4.1 (Mean-field approximation for Ising model) Suppose that we have a
complicated model for a distribution of a vector X̃ ∈ {0, 1}n+m given by an Ising model
1
PX̃ (x̃) = exp{x̃⊤ Ãx̃ + b̃⊤ x̃} ,
Z̃
where Ã ∈ R(n+m)×(n+m) is a symmetric interaction matrix with zero diagonal and b̃ is a vector
of external fields and Z̃ is a normalization constant. We note that often Ã is very sparse with non-
zero entries occurring only those few variables xi and xj that are considered to be interacting (or
adjacent in some graph). We decompose the vector X̃ = (X, Y) into two components: the last
m coordinates are observables and the first n coordinates are hidden (latent), whose values we
want to infer; in other words, our goal is to evaluate PX|Y=y upon observing y. It is clear that this
conditional distribution is still an Ising model, so that
1
PX|Y (x|y) = exp{x⊤ Ax + b⊤ x} , x ∈ {0, 1}n
Z
where A is the n × n leading minor of Ã and b and Z depend on y. Unfortunately, computing even
a single value P[X1 = 1] is known to be generally computationally infeasible [394, 175], since
evaluating Z requires summation over 2n values of x.
Let us denote f(x) = x⊤ Ax + b⊤ x and by Q the uniform distribution on {0, 1}n . Applying
Proposition 4.7 we obtain

log Z − n log 2 = log EQ [exp{f}] = sup EP [f(Xn )] − D(PkQ) .

PXn

Then by Theorem 2.2 we get

log Z = sup EP [f(Xn )] + H(PXn ) .

PXn

As we said, exact computation of log Z, though, is not tractable. An influential idea is to instead
Qn
search the maximizer in the class of product distributions PXn = i=1 Ber(pi ). In this case, this
supremization can be solved almost in closed-form:
X
sup p⊤ Ap + b⊤ p + h(pi ) ,
p
i

where p = (p1 , . . . , pn ). Since the objective function is strongly concave (Exercise I.37), we only
need to solve the first order optimality conditions (or mean-field equations), which is a set of n
non-linear equations in n variables:
X
n 1
pi = σ bi + 2 ai,j pj , σ(x) ≜ .
1 + exp(−x)
j=1

These are solved by iterative message-passing algorithms [447]. Once the values of pi are obtained,
the mean-field approximation is to take
Y
n
PX|Y=y ≈ Ber(pi ) .
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-76

i i

We stress that the mean-field idea is not only to approximate the value of Z, but also to consider
the corresponding maximizer (over a restricted class of product distributions) as the approximate
posterior distribution.

To get another flavor of examples, let us consider a more general setting, where we have some
(θ)
parametric collection of distributions PX,Y indexed by θ ∈ Rd . Often, the joint distribution is such
(θ) (θ) (θ) (θ)
that PX and PY|X are both “simple”, but the PY and PX|Y are “complex” or even intractable (e.g.
in sparse linear regression and community detection Section 30.3). As in the previous example, X
is the latent (unobserved) and Y is the observed variable.
For a moment we will omit writing θ and consider the problem of evaluating PY (y) – a quantity
(known as evidence) showing how extreme the observed y is. Note that

PY (y) = Ex∼PX [PY|X (y|x)] .

Although by assumption PX and PY|X are both easy to compute, this marginalization may be
intractable. As a workaround, we invoke Proposition 4.7 with f(x) = log PY|X (y|x) and Q = PX to
get

PX,Y (X, y)
log PY (y) = sup ER [f(X)] − D(RkPX ) = sup EX∼R log , (4.16)
R R R(X)

where R is an arbitrary distribution. Note that the right-hand side only involves a simple quantity
PX,Y and hence all the complexity of computation is moved to optimization over R. Expres-
sion (4.16) is known as evidence lower bound (ELBO) since for any fixed value of R we get a
provable lower bound on log PY (y). Typically, one optimizes the choice of R over some convenient
set of distributions to get the best (tightest) lower bound in that class.
One such application leads to the famous iterative (EM) algorithm, see (5.33) below. Another
application is a modern density estimation algorithm, which we describe next.

Example 4.2 (Variational autoencoders [245]) A canonical problems in unsupervised

i.i.d.
learning is density estimation: given a sample y1 , . . . , yn ∼ PY estimate the true PY on Rd . We
describe a modern solution of [245]. First, they propose a latent parametric model (a generative
model) for PY . Namely, Y is generated by first sampling a latent variable X ∼ N (0, Id′ ) and then
setting Y to be conditionally Gaussian:

Y = μ(X; θ) + D(X; θ)Z , X⊥

⊥ Z ∼ N (0, Id ) ,

where vector μ(·; θ) and diagonal matrix D(·; θ) are deep neural networks with input (·) and
(θ)
weights θ. (See [245, App. C.2] for a detailed description.) The resulting distribution PY is a
(complicated) location-scale Gaussian mixture. To find the best density [245] aims to maximize
the likelihood by solving
X (θ)
max log PY (yi ) .
θ
i

i i

4.5 Continuity of divergence

For a finite alphabet X it is easy to establish the continuity of entropy and divergence:

Proposition 4.8 Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Then
the map P 7→ D(PkQ) is continuous. In particular, P 7→ H(P) is continuous.

Warning: Divergence is never continuous in the pair, even for finite alphabets. For example,
as n → ∞, d( 1n k2−n ) 6→ 0.

Proof. Notice that

X P(x)
D(PkQ) = P(x) log
x
Q ( x)

and each term is a continuous function of P(x).

Our next goal is to study continuity properties of divergence for general alphabets. We start
with a negative observation.

Remark 4.5 In general, D(PkQ) is not continuous in either P or Q. For example, let X1 , . . . , Xn
Pn d
be iid and equally likely to be {±1}. Then by central limit theorem, Sn = √1n i=1 Xi −
→N (0, 1)
as n → ∞. But D(PSn kN (0, 1)) = ∞ for all n because Sn is discrete. Note that this is also an
example for strict inequality in (4.19).

Nevertheless, there is a very useful semicontinuity property.

Theorem 4.9 (Lower semicontinuity of divergence) Let X be a metric space with

Borel σ -algebra H. If Pn and Qn converge weakly to P and Q, respectively,2 then

D(PkQ) ≤ lim inf D(Pn kQn ) . (4.19)

n→∞

On a general space if Pn → P and Qn → Q pointwise3 (i.e. Pn [E] → P[E] and Qn [E] → Q[E] for
every measurable E) then (4.19) also holds.

Proof. This simply follows from (4.12) since EPn [f] → EP [f] and EQn [exp{f}] → EQ [exp{f}] for
every f ∈ Cb .

2
Recall that sequence of random variables Xn converges in distribution to X if and only if their laws PXn converge weakly
to PX .
3
Pointwise convergence is weaker than convergence in total variation and stronger than weak convergence.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-79

i i

4.6* Continuity under monotone limits of σ -algebras 79

4.6* Continuity under monotone limits of σ -algebras

Our final and somewhat delicate topic is to understand the (so far neglected) dependence of D and I
on the implicit σ -algebra of the space. Indeed, the definition of divergence D(PkQ) implicitly (via
Radon-Nikodym derivative) depends on the σ -algebra F defining the measurable space (X , F).
To emphasize the dependence on F we will write in this section only the underlying σ -algebra
explicitly as follows:

D(PF kQF ) .

Our main results are continuity under monotone limits of σ -algebras. Recall that a sequence of
nested σ -algebras, F1 ⊂ F2 · · · , is said to Fn % F when F ≜ σ (∪n Fn ) is the smallest σ -
algebra containing ∪n Fn (the union of σ -algebras may fail to be a σ -algebra and hence needs
completion). Similarly, a sequence of nested σ -algebras, F1 ⊃ F2 · · · , is said to Fn & F if
F = ∩n Fn (intersection of σ -algebras is always a σ -algebra). We will show in this section that we
always have:

Fn % F =⇒ D(PFn kQFn ) % D(PF kQF ) (4.20)

Fn & F =⇒ D(PFn kQFn ) & D(PF kQF ) (4.21)

For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to (a) any algebra of sets F and (b) two positive additive (not necessarily σ -additive)
set-functions P, Q on F . We do so following the Gelfand-Yaglom-Perez variational representation
of divergence (Theorem 4.5).

Definition 4.10 (KL divergence over an algebra) Let P and Q be two positive, addi-
tive (not necessarily σ -additive) set-functions defined over an algebra F of subsets of X (not
necessarily a σ -algebra). We define
X
n
P[Ei ]
D(PF kQF ) ≜ sup P[Ei ] log ,
{E1 ,...,En } i=1 Q[Ei ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.

Note that when F is not a σ -algebra or P, Q are not σ -additive, we do not have Radon-Nikodym
theorem and thus our original definition of KL-divergence is not applicable.

Theorem 4.11 (Measure-theoretic properties of divergence) Let P, Q be probability

measures on the measurable space (X , H). Assume all algebras below are sub-algebras of H.
Then:

• (Monotonicity) If F ⊆ G are nested algebras then

D(PF kQF ) ≤ D(PG kQG ) . (4.22)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-80

i i

S
• Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let F = n Fn be their limit, then

D(PFn kQFn ) % D(PF kQF ) .

• If F is (P + Q)-dense in G then4

D(PF kQF ) = D(PG kQG ) . (4.23)

• (Monotone convergence theorem) Let F1 ⊆ F2 . . . be an increasing sequence of algebras and

W
let F = n Fn be the smallest σ -algebra containing all of Fn . Then we have

D(PFn kQFn ) % D(PF kQF )

and, in particular,

D(PX∞ kQX∞ ) = lim D(PXn kQXn ) .

n→∞

Proof. The first two items are straightforward applications of the definition. The third follows
from the following fact: if F is dense in G then any G -measurable partition {E1 , . . . , En } can
be approximated by a F -measurable partition {E′1 , . . . , E′n } with (P + Q)[Ei 4E′i ] ≤ ϵ. Indeed,
first we set E′1 to be an element of F with (P + Q)(E1 4E′1 ) ≤ 2n ϵ
. Then, we set E′2 to be
an 2nϵ
-approximation of E2 \ E′1 , etc. Finally, E′n = (∪j≤1 E′j )c . By taking ϵ → 0 we obtain
P ′ P[E′i ] P P[ Ei ]
i P[Ei ] log Q[E′i ] → i P[Ei ] log Q[Ei ] .
The last statement follows from the previous one and the fact that any algebra F is μ-dense in
the σ -algebra σ{F} it generates for any bounded μ on (X , H) (cf. [142, Lemma III.7.1].)
Finally, we address the continuity under the decreasing σ -algebra, i.e. (4.21).

Proposition 4.12 Let Fn & F be a sequence of decreasing σ -algebras with intersection

F = ∩n Fn ; let P, Q be two probability measures on F0 . If D(PF0 kQF0 ) < ∞ then we have

D(PFn kQFn ) & D(PF kQF ) (4.24)

The condition D(PF0 kQF0 ) < ∞ can not be dropped, cf. the example after (4.32).
h i
Proof. Let X−n = dP
dQ . Since X−n = EQ dP
dQ Fn , we have that (. . . , X−1 , X0 ) is a uniformly
Fn
integrable martingale. By the martingale convergence theorem in reversed time, cf. [84, Theorem
5.4.17], we have almost surely
dP
X−n → X−∞ ≜ . (4.25)
dQ F

We need to prove that

EQ [X−n log X−n ] → EQ [X−∞ log X−∞ ] .

4
Recall that F is μ-dense in G if ∀E ∈ G, ϵ > 0∃E′ ∈ F s.t. μ[E∆E′ ] ≤ ϵ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-81

i i

4.7 Variational characterizations and continuity of mutual information 81

We will do so by decomposing x log x as follows

x log x = x log+ x + x log− x ,

where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have
from the bounded convergence theorem:

EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ].

To prove a similar convergence for log+ we need to notice two things. First, the function

x 7→ x log+ x

is convex. Second, for any non-negative convex function ϕ s.t. E[ϕ(X0 )] < ∞ the collection
{Zn = ϕ(E[X0 |Fn ]), n ≥ 0} is uniformly integrable. Indeed, we have from Jensen’s inequality
1 E[ϕ(X0 )]
P[Zn > c] ≤ E[ϕ(E[X0 |Fn ])] ≤
c c
and thus P[Zn > c] → 0 as c → ∞. Therefore, we have again by Jensen’s

E[Zn 1{Zn > c}] ≤ E[ϕ(X0 )1{Zn > c}] → 0 c → ∞.

Finally, since X−n log+ X−n is uniformly integrable, we have from (4.25)

EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ]

and this concludes the proof.

4.7 Variational characterizations and continuity of mutual information

Again, similarly to Proposition 4.8, it is easy to show that in the case of finite alphabets mutual
information is always continuous on finite-dimensional simplices of distributions.5

Proposition 4.13 (a) If X and Y are both finite alphabets, then PX,Y 7→ I(X; Y) is continuous.
(b) If X is finite, then PX 7→ I(X; Y) is continuous.
(c) Without any assumptions on X and Y , let PX range over the convex hull Π = co(P1 , . . . , Pn ) =
Pn Pn
{ i=1 αi Pi : i=1 αi = 1, αi ≥ 0}. If I(Pj , PY|X ) < ∞ (using the notation I(PX , PY|X ) =
I(X; Y)) for all j ∈ [n], then the map PX 7→ I(X; Y) is continuous.

Proof. For the first statement, apply the representation

I(X; Y) = H(X) + H(Y) − H(X, Y)

and the continuity of entropy in Proposition 4.8.

5
Here we only assume that topology on the space of measures is compatible with the linear structure, so that all linear
operations on measures are continuous.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-82

i i

1
P
For the second statement, take QY = |X | PY|X=x . Note that
x∈X
" !#
X
D(PY kQY ) = EQY f PX (x)hx (Y) ,
x
dPY|X=x
where f(t) = t log t and hx (y) = are bounded by |X | and non-negative. Thus, from the
dQY (y)
bounded convergence theorem we have that PX 7→ D(PY kQY ) is continuous. The proof is complete
since by the golden formula

I(X; Y) = D(PY|X kQY |PX ) − D(PY kQY ) ,

and the first term is linear in PX .

For the third statement, form a chain Z → X → Y with Z ∈ [n] and PX|Z=j = Pj . WLOG assume
that P1 , . . . , Pn are distinct extreme points of co(P1 , . . . , Pn ). Then there is a linear bijection
between PZ and PX ∈ Π. Furthermore, I(X; Y) = I(Z; Y) + I(X; Y|Z). The first term is continu-
ous in PZ by the previous claim, whereas the second one is simply linear in PZ . Thus, the map
PZ 7→ I(X; Y) is continuous and so is PX 7→ I(X; Y).

Further properties of mutual information follow from I(X; Y) = D(PX,Y kPX PY ) and correspond-
ing properties of divergence, e.g.

1 Donsker-Varadhan for mutual information: by definition of mutual information

I(X; Y) = sup E[f(X, Y)] − log E[exp{f(X, Ȳ)}] , (4.26)

where Ȳ is a copy of Y, independent of X and supremum can be taken over any of the classes of
(bivariate) functions as in Theorem 4.6. Notice, however, that for mutual information we can
also get a stronger characterization:6

I(X; Y) ≥ E[f(X, Y)] − E[log E[exp{f(X, Ȳ)}|X]] , (4.27)

from which (4.26) follows by moving the outer expectation inside the log. Both of these can
be used to show that E[f(X, Y)] ≈ E[f(X, Ȳ)] as long as the dependence between X and Y (as
measured by I(X; Y)) is weak, cf. Exercise I.55.
d
2 If (Xn , Yn ) → (X, Y) converge in distribution, then

I(X; Y) ≤ lim inf I(Xn ; Yn ) . (4.28)

n→∞

d
• Example of strict inequality: Xn = Yn = n1 Z. In this case (Xn , Yn ) → (0, 0) but I(Xn ; Yn ) =
H(Z) > 0 = I(0; 0).
• An even more impressive example: Let (Xp , Yp ) be uniformly distributed on the unit ℓp -ball
d
on the plane: {(x, y) ∈ R2 : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but
I(Xp ; Yp ) → ∞. (See Ex. I.16)

6
Just apply Donsker-Varadhan to D(PY|X=x0 kPY ) and average over x0 ∼ PX .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-83

i i

4.8* PAC-Bayes 83

3 Mutual information as supremum over partitions:

X PX,Y [Ei × Fj ]
I(X; Y) = sup PX,Y [Ei × Fj ] log , (4.29)
{Ei }×{Fj } i,j PX [Ei ]PY [Fj ]

where supremum is over finite partitions of spaces X and Y .7

4 (Monotone convergence I):

I(X∞ ; Y) = lim I(Xn ; Y) (4.30)

n→∞

I(X∞ ; Y∞ ) = lim I(Xn ; Yn ) (4.31)

n→∞

This implies that the full amount of mutual information between two processes X∞ and Y∞
is contained in their finite-dimensional projections, leaving nothing in the tail σ -algebra. Note
also that applying the (finite-n) chain rule to (4.30) recovers (4.1).
5 (Monotone convergence II): Recall that for any random process (X1 , . . .) we define its tail σ -
T
algebra as Ftail = ∩n≥1 σ(X∞ n ). Let Xtail be a random variable such that σ(Xtail ) =
∞
n≥1 σ(Xn ).
Then

I(Xtail ; Y) = lim I(X∞

n ; Y) , (4.32)
n→∞

whenever the right-hand side is finite. This is a consequence of Proposition 4.12. Without the
i.i.d.
finiteness assumption the statement is incorrect. Indeed, consider Xj ∼ Ber(1/2) and Y = X∞ 0 .
∞
Then each I(Xn ; Y) = ∞, but Xtail = constant a.e. by Kolmogorov’s 0-1 law, and thus the
left-hand side of (4.32) is zero.

4.8* PAC-Bayes
A deep implication of Donsker-Varadhan and Gibbs principle is a method, historically known as
PAC-Bayes,8 for bounding suprema of empirical processes. Here we present the key idea together
with two applications: one in high-dimensional probability and the other in statistical learning
theory.
But first, let us agree that in this section ρ and π will denote distributions on Θ and we will
write Eρ [·] and Eπ [·] to mean integration over only the θ variable over the respective prior, i.e.
Z
Eρ [fθ (x)] ≜ Eθ∼ρ [fθ (x)] = fθ (x)ρ(dθ)
Θ

denotes a function of x. Similarly, EPX [fθ (X)] will denote expectation only over X ∼ PX . The
following estimate is a workhorse of PAC-Bayes method.

7
To prove this from (4.9) one needs to notice that algebra of measurable rectangles is dense in the product σ-algebra. See
[129, Sec. 2.2].
8
For probably approximately correct (PAC), as developed by Shawe-Taylor and Williamson [382], McAllester [298],
Mauer [297], Catoni [83] and many others, see [16] for a survey.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-84

i i

Proposition 4.14 (PAC-Bayes inequality) Consider a collection of functions {fθ : X →

R, θ ∈ Θ} such that (θ, x) 7→ fθ (x) is measurable. Fix a random variable X ∈ X and prior
π ∈ P(Θ). Then with probability at least 1 − δ we have for all ρ ∈ P(Θ):
1
Eρ [fθ (X) − ψ(θ)] ≤ D(ρkπ ) + log , ψ(θ) ≜ log EPX [exp{fθ (X)}] . (4.33)
δ
Furthermore, for any joint distribution Pθ,X we have

E[fθ (X) − ψ(θ)] ≤ I(θ; X) ≤ D(Pθ|X kπ |PX ) . (4.34)

Proof. We will prove the following result, known in this area as an exponential inequality:

EPX sup exp{Eρ [fθ (X) − ψ(θ)] − D(ρkπ )} ≤ 1 , (4.35)
ρ

where the supremum inside is taken over all ρ such that D(ρkπ ) < ∞. Indeed, from it (4.33)
follows via the Markov inequality. Notice that this supremum is taken over uncountably many
values and hence it is not a priori clear whether the function of X under the outer expectation is
even measurable. We will show the latter together with the exponential inequality.
To that end, we apply the Gibbs principle (Proposition 4.7) to the function θ 7→ fθ (X) − ψ(θ)
and base measure π. Notice that this function may take value −∞, but nevertheless we obtain

sup Eρ [fθ (X) − ψ(θ)] − D(ρkπ ) = log Eπ [exp{fθ (X) − ψ(θ)}] ,

where the right-hand side is a measurable function of X. Exponentiating and taking expectation
over X we obtain

EPX sup exp{Eρ [fθ (X) − ψ(θ)] − D(ρkπ )} = EPX [Eπ [exp{fθ (X) − ψ(θ)}]] .
ρ

We claim that the right-hand side equals π [ψ(θ) < ∞] ≤ 1, which completes the proof of (4.35).
Indeed, let E = {θ : ψ(θ) < ∞}. Then for any θ ∈ E we have EPX [exp{fθ (X) − ψ(θ)}] = 1, or in
other words for all θ:

EPX [exp{fθ (X) − ψ(θ)}]1{θ ∈ E} = 1{θ ∈ E} .

Now applying Eπ to both sides here and invoking Fubini we obtain

EPX [Eπ [exp{fθ (X) − ψ(θ)}]1{θ ∈ E}] = π [E] .

Finally, notice that 1{θ ∈ E} can be omitted since for θ ∈ Ec we have exp{fθ (X) − ∞} = 0 by
agreement.
To show (4.34), for each x take ρ = Pθ|X=x in (4.35) to get

Ex∼PX exp{E[fθ (X) − ψ(θ)|X = x] − D(Pθ|X=x kπ )} ≤ 1 .

By Jensen’s we can move outer expectation inside the exponent and obtain the right-most
inequality in (4.34). To get the bound in terms of I(θ; X) take π = Pθ and recall (4.5).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-85

i i

4.8* PAC-Bayes 85

4.8.1 Uniform convergence

As stated the PAC-Bayes inequality is too general to appreciate its importance. Its significance
and depth is only revealed once the art of applying it is mastered. We first give such example in
the context of uniform convergence and high-dimensional probability.
Example 4.3 (Norms of subgaussian vectors [475]) Suppose X ∼ N (0, Σ) take values
in Rd . What is the magnitude of kXk ≡ kXk2 ? First, by Jensen’s we get
p √
E[kXk] ≤ E[kXk2 ] = tr Σ .

But what about typical value of kXk, i.e. can we show an upper bound on kXk that holds with high
probability? In order to see how PAC-Bayes could be useful here, notice that kXk = sup∥v∥=1 v⊤ X.
Thus, we aim to use (4.33) to bound this supremum. For any v let ρv = N (v, β 2 Id ) and notice
v⊤ X = Eρv [θ⊤ X]. We also take π = N (0, β 2 Id ) and fθ (x) = λθ⊤ x, θ ∈ Rd , where β, λ > 0
are parameters to be optimized later. Taking base of log to be e, we compute explicitly ψ(θ) =
1 2 ⊤ λ2 ⊤ ∥ v∥ 2
2 λ θ Σθ , Eρv [ψ(θ)] = 2 (v Σv + β tr Σ) and D(ρv k π ) = 2β 2 via (2.8). Thus, using (4.33)
2

restricted to ρv with kvk = 1 we obtain that with probability ≥ 1 − δ we have for all v with
kvk = 1
λ2 ⊤ 1 1
λ v⊤ X ≤ (v Σv + β 2 tr Σ) + 2 + ln .
2 2β δ
√
Now, we can optimize right-hand side over β by choosing β 2 = 1/ λ2 tr Σ to get
λ2 ⊤ √ 1
λ v⊤ X ≤ (v Σv) + λ tr Σ + ln .
2 δ
Finally, estimating v⊤ Σv ≤ kΣkop and optimizing λ we obtain the resulting high-probability
bound:
r
√ 1
kXk ≤ tr Σ + 2kΣkop ln .
δ
Although this result can be obtained using the standard technique of Chernoff bound (Section 15.1)
– see [270, Lemma 1] for a stronger version or [69, Example 5.7] for general norms based on sophis-
ticated Gaussian concentration inequalities, the advantages of the PAC-Bayes proof are that (a) it is
2
not specific to Gaussian and holds for any X such that ψ(θ) ≤ λ2 θ⊤ Σθ (similar to subgaussian ran-
dom variables introduced below); and (b) its extensions can be used to analyze the concentration
of sample covariance matrices, cf. [475].
To present further applications, we need to introduce a new concept.

Definition 4.15 (Subgaussian random variables) A random variable X is called σ 2 -

subgaussian if
σ 2 λ2
E[eλ(X−E[X]) ] ≤ e 2 ∀λ ∈ R .

Here are some useful observations:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-86

i i

• N (0, σ 2 ) is σ 2 -subgaussian. In fact it satisfies the condition with equality and explains the origin
of the name.
• If X ∈ [a, b], then X is (b−4a) -subgaussian. This is the well-known Hoeffding’s lemma (see
2

Exercise III.22 for a proof).

Pn
• If Xi are iid and σ 2 -subgaussian then the empirical average Sn = 1n i=1 Xi is (σ 2 /n)-
subgaussian.

There are many equivalent ways to define subgaussianity [438, Prop. 2.5.2], including by requiring
t2
tails of X to satisfy P[|X − E[X]| > t] ≤ 2e− 2σ2 . However, for us the most important property is the
consequence of the two observations above: empirical average of independent bounded random
variables are O(1/n)-subgaussian.
The concept of subgaussianity is used in PAC-Bayes method as follows. Suppose we have a
i.i.d.
collection of functions F from X to R and an iid sample Xn ∼ PX . One of the main questions of
empirical process theory and uniform convergence is to get a high-probability bound on

1X
n
sup E[f(X)] − Ên [f(X)] , Ên [f(X)] ≜ f(Xi ) .
f∈F n
i=1

Suppose that we know each f is taking values in [0, 1]. Then (E −Ên )f is 4n
1
-subgaussian and
applying PAC-Bayes inequality to functions λ(E −Ên )f(X) we get that with probability ≥ 1 − δ
for any ρ on F we have

λ 1 1
Ef∼ρ [(E −Ên )f(X)] ≤ + D(ρkπ ) + ln ,
8n λ δ
where π is a fixed prior. This method can be used to get interesting bounds for countably-infinite
collections F (see Exercise I.55 and I.56). However, the real power of this method shows when
F is uncountable (as in the previous example for Gaussian norms).
We remark that bounding the supremum of a random process (e.g., empirical or Gaussian pro-
cess) indexed by a continuous parameter is a vast subject [139, 429, 431]. The usual method is
based on discretization and approximation (with more advanced version known as chaining; see
(27.22) and Exercise V.28 for the counterpart of Gaussian processes). The PAC-Bayes inequality
offers an alternative which often allows for sharper results and shorter proofs. There are also appli-
cations to high-dimensional probability (the small-ball probability and random matrix theory), see
[317, 309]. In those works, PAC-Bayes is applied with π being the uniform distribution on a small
ball.

Remark 4.6 (PAC-Bayes vs Rademacher complexity) Note that PAC-Bayes bounds

supremum of an empirical process. Indeed, for any value Y we can think of θ 7→ fθ (Y) as vector
(random process) indexed by θ ∈ Θ. Each ρ ∈ P(Θ) defines a linear function F(ρ) ≜ Eθ∼ρ [fθ (Y)]
of this vector. A modern method [38] of bounding the supremum of a collection of functions is to
use Rademacher complexity. It turns out that PAC-Bayes can be rather naturally understood in that
framework: if |fθ (·)| ≤ M then the set {F(ρ) : D(ρkπ ) ≤ C < ∞} has Rademacher complexity

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-87

distribution (for any choice of the lossless compressor). In this way, the idea of [476] bridges
the area of data compression and generalization bounds: if the trained neural network has highly
compressible θ (e.g. has many zero weights) then it has smaller excess risk and thus is less prone
to overfitting.
Before closing this section, let us also apply the “in expectation” version of the PAC-
Bayes (4.34). Namely, again suppose that losses ℓθ (x) are in [0, 1] and suppose the learning
algorithm (given Xn ) selects ρ and then samples θ̂ ∼ ρ. This creates a joint distribution Pθ̂,Xn .
From (4.34), as in the preceding proof, for every λ > 0 we get
λ
E[L(θ̂) − L̂n (θ̂)] ≤ + λI(θ̂; Xn ) .
8n
Optimizing over λ we obtain the bound
r
1
E[L(θ̂) − L̂n (θ̂)] ≤ I(θ̂; Xn ) .
2n
This version of McAllester’s result [369, 461] provides a useful intuition: the algorithm’s propen-
sity to overfit can be gauged by the amount of information leaking from Xn into θ̂. For applications,
though, a version with a flexible reference prior π, as in Theorem 4.16, appears more convenient.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-91

Remark 5.2 (Strict and strong convexity) There are a number of alternative arguments
possible. For example, (p, q) 7→ p log pq is convex on R2+ , which is a manifestation of a general

phenomenon: for a convex f(·) the perspective function (p, q) 7→ qf pq is convex too. Yet another
way is to invoke the Donsker-Varadhan variational representation Theorem 4.6 and notice that
supremum of convex functions is convex. Our proof, however, allows us to immediately notice
that the map (P, Q) 7→ D(PkQ) is not strictly convex. Indeed, the gap in the DPI that we used
in the proof is equal to D(PX|Y kQX|Y |PY ), which can be zero. For example, this happens if P0 , Q0
have common support, which is disjoint from the common support of P1 , Q1 . At the same time
the map P 7→ D(PkQ), whose convexity was so crucial in the previous Chapter, turns out to not
only be strictly convex but in fact strongly convex with respect to total variation, cf. Exercise I.37.
This strong convexity is crucial for the analysis of mirror descent algorithm, which is a first-order
method for optimization over probability measures (see [40, Examples 9.10 and 5.27].)

Theorem 5.2 The map PX 7→ H(X) is concave. Furthermore, if PY|X is any channel, then
PX 7→ H(X|Y) is concave. If X is finite, then PX 7→ H(X|Y) is continuous.

Proof. For the special case of the first claim, when PX is on a finite alphabet, the proof is complete
by H(X) = log |X | − D(PX kUX ). More generally, we prove the second claim as follows. Let
f(PX ) = H(X|Y). Introduce a random variable U ∼ Ber(λ) and define the transformation

P0 U=0
PX|U =
P1 U=1

Consider the probability space U → X → Y. Then we have f(λP1 + (1 − λ)P0 ) = H(X|Y) and
λf(P1 ) + (1 − λ)f(P0 ) = H(X|Y, U). Since H(X|Y, U) ≤ H(X|Y), the proof is complete. Continuity
follows from Proposition 4.13.

Recall that I(X; Y) is a function of PX,Y , or equivalently, (PX , PY|X ). Denote I(PX , PY|X ) =
I(X; Y).

Theorem 5.3 (Mutual Information)

• For fixed PY|X , PX →

7 I(PX , PY|X ) is concave.
• For fixed PX , PY|X →7 I(PX , PY|X ) is convex.

Proof. There are several ways to prove the first statement, all having their merits.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-93

i i

5.1 Convexity of information measures 93

• Third proof : Pick a Q and use the golden formula: I(X; Y) = D(PY|X kQ|PX ) − D(PY kQ), where
PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→ D(PY kQ)
(convex).

To prove the second (convexity) statement, simply notice that

I(X; Y) = D(PY|X kPY |PX ) .

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-95

i i

5.2 Saddle point of mutual information 95

Proof. The statement is: I(PX , PY|X ) = C ⇒ PY = P∗Y . Indeed:

C = D(PY|X kPY |PX ) = D(PY|X kP∗Y |PX ) − D(PY kP∗Y )

≤ D(PY|X kP∗Y |P∗X ) − D(PY kP∗Y )
= C − D(PY kP∗Y ) ⇒ PY = P∗Y

Statement (5.2) follows from the left inequality in (5.1) and “conditioning increases divergence”
property in Theorem 2.16.

Remark 5.3 • The finiteness of C is necessary for Corollary 5.5 to hold. For a counterexample,
consider the identity channel Y = X, where X takes values on integers. Then any distribution
with infinite entropy is a capacity-achieving input (and output) distribution.
• Unlike the output distribution, capacity-achieving input distribution need not be unique. For
example, consider Y1 = X1 ⊕ Z1 and Y2 = X2 where Z1 ∼ Ber( 12 ) is independent of X1 . Then
maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 2, achieved by PX1 X2 = Ber(p) × Ber( 21 ) for any p. Note that
the capacity-achieving output distribution is unique: P∗Y1 Y2 = Ber( 12 ) × Ber( 21 ).

Applying Theorem 5.4 to conditional divergence gives the following result.

Corollary 5.6 (Minimax) Under the assumptions of Theorem 5.4, we have

max I(X; Y) = max min D(PY|X kQY |PX )

PX ∈P PX ∈P QY

= min sup D(PY|X kQY |PX )

QY PX ∈P

Proof. This follows from the standard property of saddle points: Maximizing/minimizing the
leftmost/rightmost sides of (5.1) gives

≤ min D(PY|X kQY |P∗X ) ≤ max min D(PY|X kQY |PX ).

The last corollary gives a geometric interpretation to capacity: It equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY|X=x : x ∈ X }. Moreover, the optimal
center P∗Y is a convex combination of some PY|X=x and is equidistant to those.
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence)
for arbitrary input space (see Theorem 32.4 for a related representation):

Corollary 5.8 Let {PY|X=x : x ∈ X } be a set of distributions. Then

C = sup I(X; Y) ≤ inf sup D(PY|X=x kQ) ≤ sup D(PY|X=x kPY|X=x′ )
PX Q x∈X x,x′ ∈X
| {z } | {z }
radius diameter

Proof. By the golden formula Corollary 4.2, we have

5.4* Existence of capacity-achieving output distribution (general

channel)
In the previous section we have shown that the solution to
C = sup I(X; Y)
PX ∈P

can be (a) interpreted as a saddle point; (b) written in the minimax form; and (c) that the capacity-
achieving output distribution P∗Y is unique. This was all done under the extra assumption that the
supremum over PX is attainable. It turns out, properties b) and c) can be shown without that extra
assumption.

Theorem 5.9 (Kemperman [243]) For any PY|X and a convex set of distributions P such
that
C = sup I(PX , PY|X ) < ∞, (5.6)
PX ∈P

there exists a unique P∗Y with the property that

C = sup D(PY|X kP∗Y |PX ) . (5.7)
PX ∈P

Furthermore,
C = sup min D(PY|X kQY |PX ) (5.8)
PX ∈P QY
= min sup D(PY|X kQY |PX ) (5.9)
QY PX ∈P

= min sup D(PY|X=x kQY ) , (if P = {all PX }.) (5.10)

QY x∈X

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-98

i i

Note that Condition (5.6) is automatically satisfied if there exists a QY such that

sup D(PY|X kQY |PX ) < ∞ . (5.11)

PX ∈P

Example 5.1 (Non-existence of capacity-achieving input distribution) Let Z ∼

N (0, 1) and consider the problem

C= sup I ( X ; X + Z) . (5.12)
E[X]=0,E[X2 ]=P
PX :
E[X4 ]=s

Without the constraint E[X4 ] = s, the capacity is uniquely achieved at the input distribution PX =
N (0, P); see Theorem 5.11. When s 6= 3P2 , such PX is no longer feasible. However, for s > 3P2
the maximum
1
C= log(1 + P)
2
is still attainable. Indeed, we can add a small “bump” to the Gaussian distribution as follows:

PX = (1 − p)N (0, P) + pδx ,

where p → 0 and x → ∞ such that px2 → 0 but px4 → s − 3P2 > 0. This shows that for the
problem (5.12) with s > 3P2 , the capacity-achieving input distribution does not exist, but the
capacity-achieving output distribution P∗Y = N (0, 1 + P) exists and is unique as Theorem 5.9
shows.

Proof of Theorem 5.9. Let P′Xn be a sequence of input distributions achieving C, i.e.,
I(P′Xn , PY|X ) → C. Let Pn be the convex hull of {P′X1 , . . . , P′Xn }. Since Pn is a finite-dimensional
simplex, the (concave) function PX 7→ I(PX , PY|X ) is continuous (Proposition 4.13) and attains its
maximum at some point PXn ∈ Pn , i.e.,

In ≜ I(PXn , PY|X ) = max I(PX , PY|X ) .

PX ∈Pn

Denote by PYn be the output distribution induced by PXn . We have then:

D(PYn kPYn+k ) = D(PY|X kPYn+k |PXn ) − D(PY|X kPYn |PXn ) (5.13)

≤ I(PXn+k , PY|X ) − I(PXn , PY|X ) (5.14)
≤ C − In , (5.15)

where in (5.14) we applied Theorem 5.4 to (Pn+k , PYn+k ). The crucial idea is to apply comparison
of KL divergence (which is not a distance) with a true distance known as total variation defined
in (7.3) below. Such comparisons are going to be the topic of Chapter 7. Here we assume for
granted validity of Pinsker’s inequality (see Theorem 7.10). According to that inequality and since
In % C, we conclude that the sequence PYn is Cauchy in total variation:

sup TV(PYn , PYn+k ) → 0 , n → ∞.

k≥ 1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-99

i i

5.4* Existence of capacity-achieving output distribution (general channel) 99

Since the space of all probability distributions on a fixed alphabet is complete in total variation,
the sequence must have a limit point PYn → P∗Y . Convergence in TV implies weak convergence,
and thus by taking a limit as k → ∞ in (5.15) and applying the lower semicontinuity of divergence
(Theorem 4.9) we get

D(PYn kP∗Y ) ≤ lim D(PYn kPYn+k ) ≤ C − In ,

k→∞

and therefore, PYn → P∗Y in the (stronger) sense of D(PYn kP∗Y ) → 0. By Theorem 4.1,

D(PY|X kP∗Y |PXn ) = In + D(PYn kP∗Y ) → C . (5.16)

S
Take any PX ∈ k≥ 1 Pk . Then PX ∈ Pn for all sufficiently large n and thus by Theorem 5.4

D(PY|X kPYn |PX ) ≤ In ≤ C , (5.17)

which, by the lower semicontinuity of divergence and Fatou’s lemma, implies

D(PY|X kP∗Y |PX ) ≤ C . (5.18)

To prove that (5.18) holds for arbitrary PX ∈ P , we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃∗Y , and obtain:

D(PYn kP̃Yn ) = D(PY|X kP̃Yn |PXn ) − D(PY|X kPYn |PXn ) (5.19)

≤ C − In , (5.20)

where (5.20) follows from (5.18) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃∗Y = P∗Y
and therefore (5.18) holds.
To see the uniqueness of P∗Y , assuming there exists Q∗Y that fulfills C = supPX ∈P D(PY|X kQ∗Y |PX ),
we show Q∗Y = P∗Y . Indeed,

C ≥ D(PY|X kQ∗Y |PXn ) = D(PY|X kPYn |PXn ) + D(PYn kQ∗Y ) = In + D(PYn kQ∗Y ).

Since In → C, we have D(PYn kQ∗Y ) → 0. Since we have already shown that D(PYn kP∗Y ) → 0,
we conclude P∗Y = Q∗Y (this can be seen, for example, from Pinsker’s inequality and the triangle
inequality TV(P∗Y , Q∗Y ) ≤ TV(PYn , Q∗Y ) + TV(PYn , P∗Y ) → 0).
Finally, to see (5.9), note that by definition capacity as a max-min is at most the min-max, i.e.,

in view of (5.16) and (5.17).

Corollary 5.10 Let X be countable and P a convex set of distributions on X . If

supPX ∈P H(X) < ∞ then
X 1
sup H(X) = min sup PX (x) log <∞
PX ∈P QX PX ∈P
x
Q X ( x)

and the optimizer Q∗X exists and is unique. If Q∗X ∈ P , then it is also the unique maximizer of H(X).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-100

i i

100

Proof. Just apply Kemperman’s Theorem 5.9 to the identity channel Y = X.

Example 5.2 (Max entropy) Assume that f : Z → R is such that Z(λ) ≜

P
n∈Z exp{−λf(n)} < ∞ for all λ > 0. Then

max H(X) ≤ inf {λa + log Z(λ)} .

X:E[f(X)]≤a λ>0

This follows from taking

QX (n) = Z(λ)−1 exp{−λf(n)} (5.21)

in Corollary 5.10. Distributions of this form are known as Gibbs distributions for the energy func-
tion f. This bound is often tight and achieved by PX (n) = Z(λ∗ )−1 exp{−λ∗ f(n)} with λ∗ being
the minimizer, see Exercise III.27. (Note that Proposition 4.7 discusses Lagrangian version of the
same problem.)

5.5 Gaussian saddle point

For the additive noise channel there is another curious saddle point relation that we rigorously
present in the next result. The proofs are based on applying the inf- and sup-characterizations of
mutual information in (4.5) and (4.7), respectively. Note that we have already seen that Gaussian
distribution is extremal under covariance constraints (Theorem 2.8).

Theorem 5.11 Let Xg ∼ N (0, σX2 ) , Ng ∼ N (0, σN2 ) , Xg ⊥

⊥ Ng . Then:

1. “Gaussian capacity”:
1 σ2
C = I(Xg ; Xg + Ng ) = log 1 + X2
2 σN
2. “Gaussian input is the best for Gaussian noise”: For all X ⊥
⊥ Ng and Var X ≤ σX2 ,

I(X; X + Ng ) ≤ I(Xg ; Xg + Ng ), (5.22)

d
with equality iff X=Xg .
3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[Xg N] = 0 and EN2 ≤ σN2 ,

I(Xg ; Xg + N) ≥ I(Xg ; Xg + Ng ),
d
with equality iff N=Ng and independent of Xg .

This result encodes extremality properties of the normal distribution: for the AWGN channel,
Gaussian input is the most favorable (attains the maximum mutual information, or capacity), while
for a general additive noise channel the least favorable noise is Gaussian. For a vector version of
the former statement see Exercise I.9.

• Assumption I: f(t) = mins F(t, s) (i.e. f can be written as a minimum of some other function F).
• Assumption II: There exist two solvers t∗ (s) = argmint F(t, s) and s∗ (t) = argmins F(t, s).
• Iterative algorithm:
– Step 0: Fix some s0 , t0 .
– Step 2k − 1: sk = s∗ (tk−1 ).
– Step 2k: tk = t∗ (sk ).

Note that there is a steady improvement at each step (the value F(sk , tk ) is decreasing), so it
can be often proven that the algorithm converges to a local minimum, or even a global minimum
under appropriate conditions (e.g. the convexity of f). Below we discuss several applications of
this idea, and refer to [113] for proofs of convergence. In general, this class of iterative meth-
ods for maximizing and minimizing mutual information are called Blahut-Arimoto algorithms for
their original discoverers [24, 62]. Unlike gradient ascent/descent that proceeds by small (“local”)
changes of the decision variable, algorithms in this section move by large (“global”) jumps and
hence converge much faster.
The basis of all these algorithms is the Gibbs variational principle (Proposition 4.7): for
any function c : Y → R and any QY on Y , under the integrability condition Z =
R
QY (dy) exp{−c(y)} < ∞, the minimum

min D(PY kQY ) + EY∼PY [c(Y)] (5.28)

is attained at P∗Y (dy) = Z1 QY (dy) exp{−c(y)}. For simplicity below we mostly consider the case
of discrete alphabets X , Y .

Maximizing mutual information (capacity). We have a fixed PY|X and the optimization
problem

QX|Y
C = max I(X; Y) = max max EPX,Y log ,
PX PX QX|Y PX

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-103

i i

5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization, Sinkhorn 103

where in the second equality we invoked (4.7). This results in the iterations:
1
QX|Y (x|y) ← PX (x)PY|X (y|x)
Z(y)
( )
1 X
′
PX (x) ← Q (x) ≜ exp PY|X (y|x) log QX|Y (x|y) ,
Z y

where Z(y) and Z are normalization constants. To derive this, notice that for a fixed PX the optimal
QX|Y = PX|Y . For a fixed QX|Y , we can see that

QX|Y
EPX,Y log = log Z − D(PX kQ′ ) ,
PX
and thus the optimal PX = Q′ .
Denoting Pn to be the value of PX at the nth iteration, we observe that

I(Pn , PY|X ) ≤ C ≤ sup D(PY|X=x kPY|X ◦ Pn ) . (5.29)

This is useful since at every iteration not only we get an estimate of the optimizer Pn , but also the
gap to optimality C − I(Pn , PY|X ) ≤ C − RHS. It can be shown, furthermore, that both RHS and
LHS in (5.29) monotonically converge to C as n → ∞, see [113] for details.

Minimizing mutual information (rate-distortion). We have a fixed PX , a cost function c(x, y)

and the optimization problem

R = min I(X; Y) + E[d(X, Y)] = min D(PY|X kQY |PX ) + E[d(X, Y)] , (5.30)
P Y| X PY|X ,QY

where in the second equality we invoked (4.5). This minimization problem is the basis of lossy
compression and will be discussed extensively in Part V. Using (5.28) we derive the iterations:
1
PY|X (y|x) ← QY (y) exp{−d(x, y)}
Z ( x)
QY ← PY|X ◦ PX .

A sandwich bound similar to (5.29) holds here, see (5.32), so that one gets two computable
sequences converging to R from above and below, as well as PY|X converging to the argmin
in (5.30).

EM algorithm (convex case). Proposed in [121], the Expectation-Maximization (EM) algo-

rithm is a heuristic for solving the maximum likelihood problem. It is known to converge to the
global maximizer for convex problems. We first consider this special case (with a general one to
follow next). Given a distribution PX our goal is to minimize the divergence with respect to the
mixture QX = QX|Y ◦ QY :

L = min D(PX kQX|Y ◦ QY ) , (5.31)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-104

i i

104

where QX|Y is a given channel. This is a problem arising in the maximum likelihood estimation for
Pn
mixture models where QY is the unknown mixing distribution and PX = 1n i=1 δxi is the empirical
distribution of the sample (x1 , . . . , xn ). To derive an iterative algorithm for (5.31), we write
min D(PX kQX ) = min min D(PX,Y kQX,Y ) .
QY QY PY|X

dQ ( x| y)
(Note that taking d(x, y) = − log dPX|XY (x) shows that this problem is equivalent to (5.30).) By the
chain rule, thus, we find the iterations
1
PY|X ← QY (y)QX|Y (x|y)
Z ( x)
QY ← PY|X ◦ PX .
(n) (n) (n)
Denote by QY the value of QY at the nth iteration and QX = QX|Y ◦ QY . Notice that for any n
and all QY we have from Jensen’s inequality,
" #
( n) dQX|Y (n)
D(PX kQX ) − D(PX kQX ) = EX∼PX log EY∼QY (n)
≤ gap(QX ) ,
dQX
dQX|Y=y
where gap(QX ) ≜ log esssupy EX∼PX [ dQX ]. In all, we get the following sandwich bound:
(n) (n) (n)
D(PX kQX ) − gap(QX ) ≤ L ≤ D(PX kQX ) , (5.32)
and it can be shown that as n → ∞ both sides converge to L, see e.g. [112, Theorem 5.3].

EM algorithm (general case). The EM algorithm is also applicable more broadly than (5.31),
(θ) (θ)
in which the quantity QX|Y is fixed. In general, we consider the model where both QY and QX|Y
depend on the unknown parameter θ and the goal (see Section 29.3) is to maximize the total log
Pn (θ)
likelihood i=1 log QX (xi ) over θ. A canonical example (which was one of the original motiva-
(θ) Pk
tions for the EM algorithm) is the k-component Gaussian mixture QX = j=1 wj N ( μj , 1); in
other words, QY = (w1 , . . . , wk ), QX|Y=j = N ( μj , 1) and θ = (w1 , . . . , wk , μ1 , . . . , μk ). If the cen-
ters μj ’s are known and only the weights wj ’s are to be estimated, then we get the simple convex
case in (5.31). Otherwise, we need to jointly optimize the log-likelihood over the centers and the
weights, which is a non-convex problem.
Here, one way to approach the problem is to apply the ELBO (4.16) as follows:
" (θ)
#
(θ) QX,Y (xi , Y)
log QX (xi ) = sup EY∼P log .
P P(Y)
Thus the maximum likelihood can be written as a double maximization problem
X (θ)
sup log QX (xi ) = sup sup F(θ, PY|X ) ,
θ i θ PY|X

⊥ ... ⊥
⊥ Xn , then
X
n
I(Xn ; Y) ≥ I(Xi ; Y) (6.2)
i=1

106

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-107

i i

6.2* Gaussian capacity via orthogonal symmetry 107

Q
with equality iff PXn |Y = PXi |Y PY -almost surely.1 Consequently,

X
n
min I(Xn ; Yn ) = min I(Xi ; Yi ).
PYn |Xn P Yi | X i
i=1

In short, we see that

1 For a product channel, the input maximizing the mutual information is a product distribution.
2 For a product source, the channel minimizing the mutual information is a product channel.

This type of result is often known as single-letterization in information theory. It tremendously

simplifies the optimization problem over a high-dimensional (multi-letter) problem to a scalar
(single-letter) problem. For example, in the simplest case where Xn , Yn are binary vectors, opti-
mizing I(Xn ; Yn ) over PXn and PYn |Xn entails optimizing over 2n -dimensional vectors and 2n × 2n
matrices, whereas optimizing each I(Xi ; Yi ) individually is easy. In analysis, the effect when some
quantities extend additively to tensor powers is called tensorization. One of the most famous such
examples is a log-Sobolev inequality, see Exercise I.65 or [200]. Since forming a product of chan-
nels or distributions is a form of tensor power, the first part of the theorem shows that the capacity
tensorizes.

Example 6.1 Let us complement Theorem 6.1 with the following examples.

• (6.1) fails for non-product channels. Let X1 ⊥ ⊥ X2 ∼ Ber(1/2). Let Y1 = X1 + X2 (binary

addition) and Y2 = X1 . Then I(X1 ; Y1 ) = I(X2 ; Y2 ) = 0 but I(X2 ; Y2 ) = 2 bits.
• Strict inequality in (6.1). Consider Yk = Xk = U ∼ Ber(1/2) for all k. Then I(Xk ; Yk ) = 1 bit
P
and I(Xn ; Yn ) = 1 bit < I(Xk ; Yk ) = n bits.
• Strict inequality in (6.2). Let X1 ⊥ ⊥ ... ⊥
⊥ X . Consider Y1 = X2 , Y2 = X3 , . . . , Yn = X1 . Then
P n P
I(Xk ; Yk ) = 0 for all k, and I(Xn ; Yn ) = H(Xi ) > 0 = I(Xk ; Yk ).

6.2* Gaussian capacity via orthogonal symmetry

In this section we revisit the “Gaussian saddle point” result from Theorem 5.11. There it was
derived by an explicit argument. Here we demonstrate how tensorization can be used to show
extremality of Gaussian input/noise without any explicit calculations.

1 ∏n
That is, if PXn ,Y = PY i=1 PXi |Y as joint distributions.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-108

i i

108

We start with the maximization of mutual information (capacity) question. In the notation
of Theorem 5.11 we know that (for Z ∼ N (0, 1))
1
max I(X; X + Z) = log 1 + σX2 .
PX :E[X2 ]≤σX2 2
Note that from tensorization we also immediately get (for Zn ∼ N (0, In ))
n
max I(Xn ; Xn + Zn ) = log 1 + σX2 .
PXn :E[∥X ∥ ]≤nσX
n 2 2 2
Thus, the traditional way of solving n-dimensional problems is to solve a 1-dimensional version
by explicit (typically calculus of variations) computation and then apply tensorization. However,
it turns out that sometimes directly solving the n-dimensional problem is magically easier and that
is what we want to show in this section.
So, suppose that we are trying to directly solve

Pmax I(Xn ; Xn + Zn )
E[ X2k ]≤nσX2

over the joint distribution PXn . By the tensorization property in Theorem 6.1(a) we get
X
n
n n n
Pmax I( X ; X + Z ) = Pmax I(Xk ; Xk + Zk ) .
E[ X2k ]≤nσX2 E[ X2k ]≤nσX2
k=1

Given distributions PX1 · · · PXn satisfying the constraint, form the “average of marginals” distribu-
Pn Pn
tion P̄X = n1 k=1 PXk , which also satisfies the single letter constraint E[X2 ] = 1n k=1 E[X2k ] ≤ σX2 .
Then from the concavity in PX of I(PX , PY|X )

1X
n
I(P̄X ; PY|X ) ≥ I(PXk , PY|X )
n
k=1

So P̄X gives the same or better mutual information, which shows that the extremization above
ought to grow linearly with n, i.e.

Pmax I(Xn ; Xn + Zn ) = n max I(X; X + Z) .

E[ X2k ]≤nσX2 PX :E[X2 ]≤σX2

Next, let us return to Yn = Xn + Zn . Since an isotropic Gaussian is rotationally symmetric, for

any orthogonal transformation U ∈ O(n), U · (Zn ) ∼ N (0, In ), so that PUYn |UXn = PYn |Xn , and

I(PXn , PYn |Xn ) = I(PUXn , PUYn |UXn ) = I(PUXn , PYn |Xn ).

Similarly to the “average of marginals” argument above, averaging over all orthogonal rotations U
of Xn can only make the mutual information larger. Therefore, the optimal input distribution PXn
can be chosen to be invariant under orthogonal transformations. Consequently, by Theorem 5.9,
the (unique!) capacity achieving output distribution P∗Yn must be rotationally invariant. Further-
more, from the conditions for equality in (6.1) we conclude that P∗Yn must have independent
components. Since the only product distribution satisfying the power constraints and having rota-
tional symmetry is an isotropic Gaussian, we conclude that PYn = (P∗Y )⊗n and P∗Y = N (0, 1 + σX2 ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-109

i i

6.3 Entropy rate 109

In turn, the only distribution PX such that PX+Z = P∗Y is PX = N (0, σX2 ) (this can be argued by
considering characteristic functions).
The last part of Theorem 5.11 can also be handled similarly. That is, we can show that the
minimizer in

min I(XG ; XG + N)
PN :E[N2 ]=1

is necessarily Gaussian by going to a multidimensional problem and averaging over all orthogonal
rotations.
The idea of “going up in dimension” (i.e. solving an n = 1 problem by going to an n > 1
problem first) as presented here is from [333] and only re-derives something that we have already
shown directly in Theorem 5.11. But the idea can also be employed for solving various non-convex
differential entropy maximization problems, cf. [184].

6.3 Entropy rate

Definition 6.2 The entropy rate of a random process X = (X1 , X2 , . . .) is
1
H(X) ≜ lim H(Xn ) (6.3)
n→∞ n
provided the limit exists.

A sufficient condition for the entropy rate to exist is stationarity, which essentially means invari-
d
ance with respect to time shift. Formally, X is stationary if (Xt1 , . . . , Xtn )=(Xt1 +k , . . . , Xtn +k ) for
any t1 , . . . , tn , k ∈ N. This definition naturally extends to two-sided processes indexed by Z.

Theorem 6.3 For any stationary process X = (X1 , X2 , . . .)

(a) H(Xn |Xn−1 ) ≤ H(Xn−1 |Xn−2 ).

(b) 1n H(Xn ) ≥ H(Xn |Xn−1 ).
(c) 1n H(Xn ) ≤ n−1 1 H(Xn−1 ).
(d) H(X) exists and H(X) = limn→∞ 1n H(Xn ) = limn→∞ H(Xn |Xn−1 ). Both sequences converge
to H(X) from above.
(e) If X can be extended to a two-sided stationary process X = (. . . , X−1 , X0 , X1 , X2 , . . .), then
H(X) = H(X1 |X0−∞ ) provided that H(X1 ) < ∞.

Proof.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-110

i i

110

(d) n 7→ 1n H(Xn ) is a decreasing sequence and lower bounded by zero, hence has a limit
Pn
H(X). Moreover by chain rule, 1n H(Xn ) = 1n i=1 H(Xi |Xi−1 ). From here we claim that
H(Xn |Xn−1 ) converges to the same limit H(X). Indeed, from the monotonicity shown in part
(a), limn H(Xn |Xn−1 ) = H′ exists. Next, recall the following fact from calculus: if an → a,
Pn
then the Cesàro’s mean 1n i=1 ai → a as well. Thus, H′ = H(X).
(e) Assuming H(X1 ) < ∞ we have from (4.30):

lim H(X1 ) − H(X1 |X0−n ) = lim I(X1 ; X0−n ) = I(X1 ; X0−∞ ) = H(X1 ) − H(X1 |X0−∞ ).
n→∞ n→∞

Example 6.2 (Stationary processes) Let us discuss some of the most standard examples
of stationary processes.

(a) Memoryless source: If X is iid, then H(X) = H(X1 ).

(b) An iid process is the simplest example of a stationary stochastic process. The next in complex-
ity is a mixed source: Given two stationary (e.g., iid) processes Y and Z, define another X as
follows. Flip a coin with bias p. If head, set X = Y; if tail, set X = Z. Applying Theorem 3.4(b)
yields 0 ≤ H(Xn ) − (pH(Yn ) + p̄H(Yn )) ≤ log 2 for all n. Then H(X) = pH(Y) + p̄H(Z).
(c) Stationary Markov process: Let X be a Markov chain X1 → X2 → X3 → · · · with transition
kernel P[X2 = b|X1 = a] = K(b|a) and initialized with an invariant distribution X1 ∼ μ (i.e.
P
μ(b) = a K(b|a) μ(a)). Then H(Xn |Xn−1 ) = H(Xn |Xn−1 ) for all n and hence
X 1
H(X) = H(X2 |X1 ) = μ(a)K(b|a) log .
K ( b| a)
a,b

See Exercise I.31 for an example. This kind of process is what is called first-order Markov
process, since Xn depends only on Xn−1 . There is an extension of that idea, where a k-th order
Markov process is defined by a kernel PXn |Xn−1 . Shannon classically suggested that such a pro-
n− k
cess is a good model for natural language (with sufficiently large k), and recent breakthrough
in large language models [320] largely verified his vision.

Note that both of our characterizations of the entropy rate converge to the limit from above
and thus evaluating H(Xn |Xn−1 ) or n1 H(Xn ) for arbitrary large n does not give any guarantees on
the true value of H(X) beyond an upper bound (in particular, we cannot even rule out H(X) = 0).
However, for a certain class of stationary processes, widely used in speech and language modeling,
we can have a sandwich bound.

Definition 6.4 (Hidden Markov model (HMM)) Given a stationary Markov chain
. . . , S−1 , S0 , S1 , . . . on state space S and a Markov kernel PX|S : S → X , we define HMM as
a stationary process . . . , X−1 , X0 , X1 , . . . as follows. First a trajectory S∞
−∞ is generated. Then,
conditionally on it, we generate each Xi ∼ PX|S=Si independently. In other words, X is just S but
observed over a stationary memoryless channel PX|S (called the emission channel).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-111

i i

6.3 Entropy rate 111

One of the fundamental results in this area is due to Blackwell [60] who showed that an P(S)-
−1
valued belief process Rn = (Rs,n , s ∈ S) given by Rs,n ≜ P[Sn = s|Xn−∞ ] is in fact a stationary
first-order Markov process. The common law μ of Rn (independent of n) is called the Blackwell
measure. Although finding μ is very difficult even for the simplest processes (see example below),
we do have the following representation of entropy rate in terms of μ
Z
H( X ) = μ(dr) Es∼r [H(PX|S=s )] .
P(S)
P
That is, the entropy rate is an integral of a simple function r 7→ s rs H(PX|S=s ) over μ.
Example 6.3 (Gilbert-Elliott HMM [187, 151]) This is an HMM with binary states and
binary emissions. Let S = {0, 1} and P[S1 6= S0 |S0 ] = τ , i.e. the transition matrix of the S-

1−τ τ
process is . Set Xi = BSCδ (Si ). In this case the Blackwell measure μ is supported
τ 1−τ
on [τ, 1 − τ ] and is the law of the random variable P[S1 = 1|X0−∞ ] and the entropy rate can be
expressed in terms of the binary entropy function h:
Z 1
H( X ) = μ(dx)h(δx̄ + δ̄ x) ,
0

where we remind x̄ = 1 − x etc. In fact, we can express integration over μ in terms of the limit
R
fdμ = limn→∞ Kn f(1/2), where K is the transition kernel of the belief process, which acts on
functions g : [0, 1] → R as

xτ̄ δ̄ + x̄τ δ xτ̄ δ + x̄τ δ̄
Kg(x) = p(x)g + p̄(x)g , p(x) = 1 − p̄(x) = δx̄ + δ̄ x .
p ( x) p̄(x)
We can see that the belief process follows a simple fractional-linear updates, but nevertheless
the stationary measure μ is extremely complicated which can be either absolutely continuous or
singular (fractal-like) [33, 32]. As such, understanding H(X) as a function of (τ, δ) is a major open
problem in this area. We remark, however, that if instead of the BSC we used X = BECδ (S) then
the resulting entropy rate is much easier to compute, see Exercise I.32.
Despite these complications, the entropy rate of HMM has a nice property: it can be tightly
sandwiched between a monotonically increasing and a monotonically decreasing sequences. As
we remarked above, such sandwich bound is not possible for general stationary processes.

Proposition 6.5 Consider an HMM process X with state process S. Then

H(Xn |Xn1−1 , S0 ) ≤ H(X) ≤ H(Xn |Xn1−1 ) , (6.4)

and both sides converge monotonically to H(X) as n → ∞.

Proof. The part about the upper bound we have already established. To show the lower bound,
notice that
−1 −1
H(X) = H(Xn |Xn−∞ ) ≥ H(Xn |Xn−∞ , S0 ) = H(Xn |Xn1−1 , S0 ) ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-112

i i

112

where in the last step we used the Markov property X0−∞ → S0 → Xn1 . Next, we show that
H(Xn |Xn1−1 , S0 ) is increasing in n. Indeed

H(Xn+1 |Xn1 , S0 ) = H(Xn |Xn0−1 , S−1 ) ≥ H(Xn |Xn0−1 , S−1 , S0 ) = H(Xn |Xn1−1 , S0 ) ,

where the first equality is by stationarity, the inequality is by adding conditioning (Theorem 1.4)
and the last equality is due to the Markov property (S−1 , X0 ) → S0 → Xn1 .
Finally, we show that

H(X) = lim H(Xn |Xn1−1 , S0 ) .

n→∞

Indeed, note that by (4.30) we have

I(S0 ; X∞
1 ) = lim I(S0 ; X1 ) ≤ H(S0 ) < ∞ ,
n
n→∞

and thus I(S0 ; Xn1 ) − I(S0 ; Xn1−1 ) → 0. But we also have by the chain rule

I(S0 ; Xn1 ) − I(S0 ; Xn1−1 ) = I(S0 ; Xn |Xn1−1 ) = H(Xn |Xn1−1 ) − H(Xn |Xn1−1 , S0 ) → 0 .

Thus, we can see that the difference between the two sides of (6.4) vanishes with n.

6.4 Entropy and symbol (bit) error rate

In this section we show that the entropy rates of two processes X and Y are close whenever they
can be “coupled”. Coupling of two processes means defining them on a common probability space
so that the average distance between their realizations is small. In the following, we will require
that the so-called symbol error rate (expected fraction of errors) is small, namely

1X
n
P[Xj 6= Yj ] ≤ ϵ . (6.5)
n
j=1

(The minimal such ϵ over all possible couplings is called Ornstein’s distance between stochastic
processes.) For binary alphabet this quantity is known as the bit error rate, which is one of the
performance metrics we consider for reliable data transmission in Part IV (see Section 17.1 and
Section 19.6). Notice that if we define the Hamming distance as
X
n
d H ( xn , yn ) ≜ 1{xj 6= yj } (6.6)
j=1

then (6.5) corresponds to requiring E[dH (Xn , Yn )] ≤ nϵ.

Before showing our main result, we show that Fano’s inequality Theorem 3.12 can be
tensorized:

Proposition 6.6 Let X1 , . . . , Xn take values on a finite alphabet X . Then

H(Xn |Yn ) ≤ nF|X | (1 − δ) = n(δ log(|X | − 1) + h(δ)) , (6.7)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-113

i i

6.5 Mutual information rate 113

where the function FM is defined in (3.14), and

1X
n
1
δ= E[dH (Xn , Yn )] = P[Xj 6= Yj ] .
n n
j=1

Proof. For each j ∈ [n], applying (3.18) to the Markov chain Xj → Yn → Yj yields

H(Xj |Yn ) ≤ FM (P [Xj = Yj ]) , (6.8)

where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.3),
and combining with (6.8), we get
X
n
H(Xn |Yn ) ≤ H( X j | Y n ) (6.9)
j=1
Xn
≤ FM (P[Xj = Yj ]) (6.10)
j=1
 
1 X
n
≤ nFM  P[Xj = Yj ] (6.11)
n
j=1

where in the last step we used the concavity of FM and Jensen’s inequality.

Corollary 6.7 Consider two processes X and Y with entropy rates H(X) and H(Y). If
P[Xj 6= Yj ] ≤ ϵ

for every j and if X takes values on a finite alphabet of size M, then

H(X) − H(Y) ≤ FM (1 − ϵ) .

If both processes have alphabets of size M, then

|H(X) − H(Y)| ≤ ϵ log M + h(ϵ) → 0 as ϵ → 0.

Proof. There is almost nothing to prove:

H(Xn ) ≤ H(Xn , Yn ) = H(Yn ) + H(Xn |Yn )

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-116

i i

116

dominating measure μ, then we have

Z
p ( x)
Df (PkQ) = q ( x) f dμ + f′ (∞)P[q = 0] (7.2)
q>0 q(x)

with the agreement that if P[q = 0] = 0 the last term is taken to be zero regardless of the value of
f′ (∞) (which could be infinite).

Remark 7.1 For the discrete case, with Q(x) and P(x) being the respective pmfs, we can also
write
X
P(x)
Df (PkQ) = Q ( x) f
x
Q ( x)

with the understanding that

• f(0) = f(0+),
• 0f( 00 ) = 0, and
• 0f( a0 ) = limx↓0 xf( ax ) = af′ (∞) for a > 0.

Remark 7.2 A nice property of Df (PkQ) is that the definition is invariant to the choice of
the dominating measure μ in (7.2). This is not the case for other dissimilarity measures, e.g., the
squared L2 -distance between the densities kp − qk2L2 (dμ) which is a popular loss function for density
estimation in statistics literature (cf. Section 32.4).
The following are common f-divergences:

• Kullback-Leibler (KL) divergence: We recover the usual D(PkQ) in Chapter 2 by taking

f(x) = x log x.
• Total variation: f(x) = 12 |x − 1|,
Z Z
1 dP 1
TV(P, Q) ≜ EQ −1 = |dP − dQ| = 1 − d(P ∧ Q). (7.3)
2 dQ 2

Moreover, TV(·, ·) is a metric on the space of probability distributions.1

• χ2 -divergence: f(x) = (x − 1)2 ,
" 2 # Z Z
dP (dP − dQ)2 dP2
χ (PkQ) ≜ EQ
2
−1 = = − 1. (7.4)
dQ dQ dQ

Note that we can also choose f(x) = x2 − 1. Indeed, f’s differing by a linear term lead to the
same f-divergence, cf. Proposition 7.2.

1 ∫ ∫ dQ
In (7.3), d(P ∧ Q) is the usual short-hand for ( dP dμ
∧ dμ
)dμ where μ is any dominating measure. The expressions in
(7.4) and (7.5) are understood in the similar sense.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-117

i i

7.1 Definition and basic properties of f-divergences 117

√ 2
• Squared Hellinger distance: f(x) = 1 − x ,
 
r !2 Z √ Z p
dP  p 2
H2 (P, Q) ≜ EQ  1 − = dP − dQ = 2 − 2 dPdQ. (7.5)
dQ
R√
Here the quantity B(P, Q) ≜ dPdQ
p is known as the Bhattacharyya coefficient (or Hellinger
affinity) [52]. Note that H(P, Q) = H2 (P, Q) defines a metric on the space of probability dis-
tributions: indeed, the triangle inequality follows from that of L2 ( μ) for a common dominating
measure. Note, however, that
P 7→ H(P, Q) is not convex. (7.6)
(This is because metric H is not induced by a Banach norm on the space of measures.) For an
explicit example, consider p 7→ H(Ber(p), Ber(0.1)).
1−x
• Le Cam divergence (distance) [273, p. 47]: f(x) = 2x +2 ,
Z
1 (dP − dQ)2
LC(P, Q) = . (7.7)
2 dP + dQ
p
Moreover, LC(PkQ) is a metric on the space of probability distributions [152], known as Le
Cam distance.
• Jensen-Shannon divergence: f(x) = x log x2x 2
+1 + log x+1 ,
P + Q P + Q
JS(P, Q) = D P +D Q . (7.8)
2 2
p
Moreover, JS(PkQ) is a metric on the space of probability distributions [152].

Remark 7.3 If Df (PkQ) is an f-divergence, then it is easy to verify that Df (λP + λ̄QkQ) and
Df (PkλP + λ̄Q) are f-divergences for all λ ∈ [0, 1]. In particular, Df (QkP) = Df̃ (PkQ) with
f̃(x) ≜ xf( 1x ).
We start summarizing some formal observations about the f-divergences

Proposition 7.2 (Basic properties) The following hold:

1 Df1 +f2 (PkQ) = Df1 (PkQ) + Df2 (PkQ).

the latter referred to as the conditional f-divergence (similar to Definition 2.14 for conditional
KL divergence).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-118

i i

118

5 If PX,Y = PX PY|X and QX,Y = QX PY|X then

Df (PX,Y kQX,Y ) = Df (PX kQX ) . (7.10)

In particular,

Df ( P X P Y k QX P Y ) = Df ( P X k QX ) . (7.11)

6 Let f1 (x) = f(x) + c(x − 1), then

Df1 (PkQ) = Df (PkQ) ∀P, Q .

In particular, we can always assume that f ≥ 0 and (if f is differentiable at 1) that f′ (1) = 0.

Proof. The first and second are clear. For the third property, verify explicitly that Df (PkQ) = 0
for f = c(x − 1). Next consider general f and observe that for P ⊥ Q, by definition we have

Df (PkQ) = f(0) + f′ (∞), (7.12)

which is well-defined (i.e., ∞ − ∞ is not possible) since by convexity f(0) > −∞ and f′ (∞) >
−∞. So all we need to verify is that f(0) + f′ (∞) = 0 if and only if f = c(x − 1) for some c ∈ R.
Indeed, since f(1) = 0, the convexity of f implies that x 7→ g(x) ≜ xf(−x)1 is non-decreasing. By
assumption, we have g(0+) = g(∞) and hence g(x) is a constant on x > 0, as desired.
For property 4, let RY|X = 12 PY|X + 21 QY|X . By Theorem 2.12 there exist jointly measurable
p(y|x) and q(y|x) such that dPY|X=x = p(y|x)dRY|X=x and QY|X = q(y|x)dRY|X=x . We can then take
μ in (7.2) to be μ = PX RY|X , which gives dPX,Y = p(y|x)dμ and dQX,Y = q(y|x)dμ and thus

Df (PX,Y kQX,Y )
Z Z
p ( y| x) ′
= dμ1{y : q(y|x) > 0} q(y|x)f + f (∞) dμ1{y : q(y|x) = 0} p(y|x)
X ×Y q ( y| x) X ×Y
Z Z Z
(7.2) p ( y| x)
= dPX dRY|X=x q(y|x)f + f′ (∞) dRY|X=x p(y|x)
X {y:q(y|x)>0} q ( y| x) {y:q(y|x)=0}
| {z }
Df (PY|X=x ∥QY|X=x )

which is the desired (7.9).

Property 5 follows from the observation: if we take μ = PX,Y + QX,Y and μ1 = PX + QX then
dPX,Y dPX
dμ = dμ1 and similarly for Q.
Property 6 follows from the first and the third. Note also that reducing to f ≥ 0 is done by taking
c = f′ (1) (or any subdifferential at x = 1 if f is not differentiable).

7.2 Data-processing inequality; approximation by finite partitions

Theorem 7.3 (Monotonicity)
Df (PX,Y kQX,Y ) ≥ Df (PX kQX ) . (7.13)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-119

i i

7.2 Data-processing inequality; approximation by finite partitions 119

dPX,Y = p1 (x)p2 (y|x)dRX,Y , dQX,Y = q1 (x)q2 (y|x)dRX,Y

f(u) ≥ f(t) + μ(u − t) , ∀u ≥ 0 .

Thus, f′ (∞) ≥ μ and taking u = λt for any λ ∈ [0, 1] we have shown:

f(λt) + λ̄tf′ (∞) ≥ f(t) , ∀t ≥ 0, λ ∈ [0, 1] . (7.14)

Note that we added t = 0 case as well, since for t = 0 the statement is obvious (recall, though,
that f(0) ≜ f(0+) can be equal to +∞).
Next, fix some x with q1 (x) > 0 and consider the chain
Z
p1 (x)p2 (y|x) p 1 ( x)
dRY|X=x q2 (y|x)f + PY|X=x [q2 (Y|x) = 0]f′ (∞)
{y:q2 (y|x)>0} q 1 ( x ) q2 ( y | x ) q 1 ( x )

( a) p 1 ( x) p1 (x)
≥f PY|X=x [q2 (Y|x) > 0] + P [q2 (Y|x) = 0]f′ (∞)
q 1 ( x) q1 (x) Y|X=x

(b) p 1 ( x)
≥f
q 1 ( x)

where (a) is by Jensen’s inequality and the convexity of f, and (b) by taking t = pq11 ((xx)) and λ =
PY|X=x [q2 (Y|x) > 0] in (7.14). Now multiplying the obtained inequality by q1 (x) and integrating
over {x : q1 (x) > 0} we get
Z Z
p ( x, y) p 1 ( x)
dRX,Y q(x, y)f + f′ (∞)PX,Y [q1 (X) > 0, q2 (Y|X) = 0] ≥ dRX q1 (x)f .
{q>0} q ( x, y) {q1 >0} q 1 ( x)
Adding f′ (∞)PX [q1 (X) = 0] to both sides we obtain (7.13) since both sides evaluate to
definition (7.2).
The following is the main result of this section.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-120

i i

120

Theorem 7.4 (Data processing) Consider a channel that produces Y given X based on the
conditional law PY|X (shown below).

PX PY

PY|X

QX QY

Let PY (resp. QY ) denote the distribution of Y when X is distributed as PX (resp. QX ). For any
f-divergence Df (·k·),

Df (PY kQY ) ≤ Df (PX kQX ). (7.15)

Proof. This follows from the monotonicity (7.13) and (7.10).

Next we discuss some of the more useful properties of f-divergence that parallel those of KL
divergence in Theorem 2.16:

Theorem 7.5 (Properties of f-divergences)

(a) Non-negativity: Df (PkQ) ≥ 0. If f is strictly convex2 at 1, then Df (PkQ) = 0 if and only if

P = Q.
(b) Joint convexity: (P, Q) 7→ Df (PkQ) is a jointly convex function. Consequently, P 7→ Df (PkQ)
and Q 7→ Df (PkQ) are also convex.
(c) Conditioning increases f-divergence. Let PY = PY|X ◦ PX and QY = QY|X ◦ QY , or, pictorially,

PY |X PY

QY |X QY

Then

Df (PY kQY ) ≤ Df PY|X kQY|X |PX .

Proof. (a) Non-negativity follows from monotonicity by taking X to be unary. To show strict
positivity, suppose for the sake of contradiction that Df (PkQ) = 0 for some P 6= Q. Then
there exists some measurable A such that p = P(A) 6= q = Q(A) > 0. Applying the data

2
By strict convexity at 1, we mean for all s, t ∈ [0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf(s) + (1 − α)f(t) > f(1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-121

7.3 Total variation and Hellinger distance in hypothesis testing

As we will discover throughout the book, different f-divergences have different operational signif-
icance. For example, χ2 -divergence is useful in the study of Markov chains (see Example 33.8 and
Exercise VI.19); in estimation the Bayes quadratic risk for a binary prior is determined by Le Cam
divergence (7.7). Here we discuss the relation of TV and Hellinger H2 to the problem of binary
hypothesis testing. We will delve deep into this problem in Part III (and return to its composite
version in Part VI). In this section, we only introduce some basics for the purpose of illustration.
The binary hypothesis testing problem is formulated as follows: one is given an observation
(random variable) X, and it is known that either X ∼ P (a case referred to as null-hypothesis H0 )
or X ∼ Q (alternative hypothesis H1 ). The goal is to decide, on the basis of X alone, which of the
two hypotheses holds. In other words, we want to find a (possibly randomized) decision function
ϕ : X → {0, 1} such that the sum of two types of probabilities of error

P[ϕ(X) = 1] + Q[ϕ(X) = 0] (7.17)

is minimized.
In this section we first show that optimization over ϕ naturally leads to the concept of TV.
Subsequently, we will see that asymptotic considerations (when P and Q are replaced with P⊗n
and Q⊗n ) leads to H2 . We start with the former case.

Theorem 7.7 (a) sup-representation of TV:

1
TV(P, Q) = sup P(E) − Q(E) = sup EP [f(X)] − EQ [f(X)] (7.18)
E 2 f∈F
where the first supremum is over all measurable sets E, and the second is over F = {f : X →
R, kfk∞ ≤ 1}. In particular, the minimal total error probability in (7.17) is given by

min {P[ϕ(X) = 1] + Q[ϕ(X) = 0]} = 1 − TV(P, Q), (7.19)

where the minimum is over all decision rules ϕ : X → {0, 1}.3

(b) inf-representation of TV [403]:4 Provided that the diagonal {(x, x) : x ∈ X } is measurable,

TV(P, Q) = min{PX,Y [X 6= Y] : PX = P, PY = Q}, (7.20)

PX,Y

where minimization is over joint distributions PX,Y with the property PX = P and PY = Q,
which are called couplings of P and Q.

Proof. Let p, q, μ be as in Definition 7.1. Then for any f ∈ F we have

Z Z
f(x)(p(x) − q(x))dμ ≤ |p(x) − q(x)|dμ = 2TV(P, Q) ,

3
The extension of (7.19) from simple to composite hypothesis testing is in (32.28).
4
See Exercise I.36 for another inf-representation.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-123

i i

7.3 Total variation and Hellinger distance in hypothesis testing 123

which establishes that the second supremum in (7.18) lower bounds TV, and hence (by taking
f(x) = 2 · 1E (x) − 1) so does the first. For the other direction, let E = {x : p(x) > q(x)} and notice
Z Z Z
0 = (p(x) − q(x))dμ = + (p(x) − q(x))dμ ,
E Ec
R R
implying that Ec (q(x)− p(x))dμ = E (p(x)− q(x))dμ. But the sum of these two integrals precisely
equals 2 · TV, which implies that this choice of E attains equality in (7.18).
For the inf-representation, we notice that given a coupling PX,Y , for any kfk∞ ≤ 1, we have
EP [f(X)] − EQ [f(X)] = E[f(X) − f(Y)] ≤ 2PX,Y [X 6= Y]
which, in view of (7.18), shows that the inf-representation is always an upper bound. To show
R
that this bound is tight one constructs X, Y as follows: with probability π ≜ min(p(x), q(x))dμ
we take X = Y = c with c sampled from a distribution with density r(x) = π1 min(p(x), q(x)),
whereas with probability 1 − π we take X, Y sampled independently from distributions p1 (x) =
1−π (p(x) − min(p(x), q(x))) and q1 (x) = 1−π (q(x) − min(p(x), q(x))) respectively. The result
1 1

follows upon verifying that this PX,Y indeed defines a coupling of P and Q and applying the last
identity of (7.3).
Remark 7.6 (Variational representation) The sup-representation (7.18) of the total vari-
ation will be extended to general f-divergences in Section 7.13. However, only the TV has the
representation of the form supf∈F | EP [f] − EQ [f]| over the class of functions. Distances of this
form (for different classes of F ) are sometimes known as integral probability metrics (IPMs).
And so TV is an example of an IPM for the class F of all bounded functions.
In turn, the inf-representation (7.20) has no analogs for other f-divergences, with the notable
exception of Marton’s d2 , see Remark 7.15. Distances defined via inf-representations over cou-
plings are often called Wasserstein distances, and hence we may think of TV as the Wasserstein
distance with respect to Hamming distance d(x, x′ ) = 1{x 6= x′ } on X . The benefit of variational
representations is that choosing a particular coupling in (7.20) gives an upper bound on TV(P, Q),
and choosing a particular f in (7.18) yields a lower bound.
Of particular relevance is the special case of testing with multiple observations, where the data
X = (X1 , . . . , Xn ) are i.i.d. drawn from either P or Q. In other words, the goal is to test
H0 : X ∼ P⊗n vs H1 : X ∼ Q⊗ n .
By Theorem 7.7, the optimal total probability of error is given by 1 − TV(P⊗n , Q⊗n ). By the data
processing inequality, TV(P⊗n , Q⊗n ) is a non-decreasing sequence in n (and bounded by 1 by
definition) and hence converges. One would expect that as n → ∞, TV(P⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test vanishes. It turns out that for fixed
distributions P 6= Q, large deviations theory (see Chapter 16) shows that TV(P⊗n , Q⊗n ) indeed
converges to one as n → ∞ and, in fact, exponentially fast:
TV(P⊗n , Q⊗n ) = 1 − exp(−nC(P, Q) + o(n)), (7.21)
where the exponent C(P, Q) > 0 is known as the Chernoff Information of P and Q given in (16.2).
However, as frequently encountered in high-dimensional statistical problems, if the distributions

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-124

i i

124

P = Pn and Q = Qn depend on n, then the large-deviations asymptotics in (7.21) can no longer be

directly applied. Since computing the total variation between two n-fold product distributions is
typically difficult, understanding how a more tractable f-divergence is related to the total variation
may give insight on its behavior. It turns out Hellinger distance is precisely suited for this task.
Shortly, we will show the following relation between TV and the Hellinger divergence:
r
1 2 H2 (P, Q)
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − ≤ 1. (7.22)
2 4
Direct consequences of the bound (7.22) are:

• H2 (P, Q) = 2, if and only if TV(P, Q) = 1. In this case, the probability of error is zero since
essentially P and Q have disjoint supports.
• H2 (P, Q) = 0 if and only if TV(P, Q) = 0. In this case, the smallest total probability of error is
one, meaning the best test is random guessing.
• Hellinger consistency is equivalent to TV consistency: we have

H2 (Pn , Qn ) → 0 ⇐⇒ TV(Pn , Qn ) → 0 (7.23)

H (Pn , Qn ) → 2 ⇐⇒ TV(Pn , Qn ) → 1;
2
(7.24)

however, the speed of convergence need not be the same.

Theorem 7.8 For any sequence of distributions Pn and Qn , as n → ∞,

1
TV(P⊗ ⊗n
n , Qn ) → 0 ⇐⇒ H (Pn , Qn ) = o
n 2
n

⊗n ⊗n 1
TV(Pn , Qn ) → 1 ⇐⇒ H (Pn , Qn ) = ω
2
n

i.i.d.
Proof. For convenience, let X1 , X2 , ...Xn ∼ Qn . Then
v 
u n
u Y Pn
H2 (P⊗ ⊗n
n , Qn ) = 2 − 2E
n t (Xi ) 
Qn
i=1
r r n
Yn
Pn Pn
=2−2 E (Xi ) = 2 − 2 E
Qn Qn
i=1
n
1
= 2 − 2 1 − H2 (Pn , Qn ) . (7.25)
2

We now use (7.25) to conclude the proof. Recall from (7.23) that TV(P⊗ ⊗n
n , Qn ) → 0 if and
n

only if H2 (P⊗ n ⊗n
n , Qn ) → 0, which happens precisely when H (Pn , Qn ) = o( n ). Similarly, by
2 1
⊗n ⊗n 2 ⊗n ⊗n
(7.24), TV(Pn , Qn ) → 1 if and only if H (Pn , Qn ) → 2, which is further equivalent to
H2 (Pn , Qn ) = ω( 1n ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-125

i i

7.3 Total variation and Hellinger distance in hypothesis testing 125

Remark 7.7 Property (7.25) is known as tensorization. More generally, we have

! n
Y
n Y
n Y 1
H 2
Pi , Qi =2−2 1 − H2 ( P i , Qi ) . (7.26)
2
i=1 i=1 i=1

While some other f-divergences also satisfy tensorization, see Section 7.12, the H2 has the advan-
tage of a sandwich bound (7.22) making it the most convenient tool for checking asymptotic
testability of hypotheses.
Q Q
Remark 7.8 (Kakutani’s dichotomy) Let P = Pi and Q = i≥1 Qi , where Pi Qi .
i≥1
Kakutani’s theorem shows the following dichotomy between these two distributions on the infinite
sequence space:
P
• If i≥1 H2 (Pi , Qi ) = ∞, then P and Q are mutually singular (i.e. P ⊥ Q).
P
• If i≥1 H2 (Pi , Qi ) < ∞, then P and Q are equivalent (i.e. P Q and Q P).

In the Gaussian case, say, Pi = N( μi , 1) and Qi = N(0, 1), the equivalence condition simplifies to
P 2
μi < ∞.
To understand Kakutani’s criterion, note that by the tensorization property (7.26), we have
Y H2 (Pi , Qi )

H ( P , Q) = 2 − 2
2
1− .
2
i≥1

Q 2 P
Thus, if i≥1 (1 − H (P2i ,Qi ) ) = 0, or equivalently, i≥1 H2 (Pi , Qi ) = ∞, then H2 (P, Q) = 2,
P
which, by (7.22), is equivalent to TV(P, Q) = 0 and hence P ⊥ Q. If i≥1 H2 (Pi , Qi ) < ∞,
then H2 (P, Q) < 2. To conclude the equivalence between P and Q, note that the likelihood ratio
dP
Q dPi dP
dQ = i≥1 dQi satisfies that either Q( dQ = 0) = 0 or 1 by Kolmogorov’s 0-1 law. See [143,
Theorem 5.3.5] for details.
We end this section by discussing the related concept of contiguity. Note that if two distributions
Pn and Qn has vanishing total variation, then Pn (A) = Qn (A) + o(1) uniformly for all events A.
Sometimes and especially for statistical applications we are only interested comparing those events
with probability close to zero or one. This leads us to the following definition.

Definition 7.9 (Contiguity and asymptotic separatedness) Let {Pn } and {Qn } be
sequences of probability measures on some Ωn . We say Pn is contiguous with respect to Qn
(denoted by Pn ◁ Qn ) if for any sequence {An } of measurable sets, Qn (An ) → 0 implies that
Pn (An ) → 0. We say Pn and Qn are mutually contiguous (denoted by Pn ◁▷ Qn ) if Pn ◁ Qn
and Qn ◁ Pn . We say that Pn is asymptotically separated from Qn (denoted Pn △ Qn ) if
lim supn→∞ TV(Pn , Qn ) = 1.

Note that when Pn = P and Qn = Q these definitions correspond to P Q and P ⊥ Q,

respectively, and thus should be viewed as their asymptotic versions. Clearly, Pn ◁▷ Qn is much
weaker than TV(Pn , Qn ) → 0; for example, Pn (An ) = 1/2 only guarantees Qn (An ) is not tending to

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-126

i i

126

0 or 1. In addition, if Pn ◁ Qn , then any test that succeeds with high Qn -probability must fail with
high Pn -probability; in other words, Pn and Qn cannot be distinguished perfectly so TV(Pn , Qn ) =
1 − Ω(1), in particular contiguity and separatedness are mutually exclusive. Furthermore, often
many interesting sequences of measures satisfy dichotomy similar to Kakutani’s: either Pn ◁▷ Qn
or Pn △ Qn , see [282].
Our interest in these notions arises from the fact that f-divergences are instrumental for
establishing contiguity and separatedness. For example, from (7.24) we conclude that

Pn △ Qn ⇐⇒ lim sup H2 (Pn , Qn ) = 2 .

n→∞

On the other hand, [385, Theorem III.10.1] shows

Pn ◁ Qn ⇐⇒ lim lim sup Dα (Pn kQn ) = 0 ,

α→0+ n→∞

where Dα is Rényi divergence (Definition 7.24). This criterion can be weakened to the following
(commonly used) one: Pn ◁ Qn if χ2 (Pn kQn ) = O(1). Indeed, applying p a change ofpmeasure
and Cauchy-Schwarz, Pn (An ) = EPn [1 {An }] = EQn [ dQ dPn
n
1 { An }] ≤ 1 + χ2 (Pn kQn ) Qn (An ),
which vanishes whenever Qn (An ) vanishes. (See Exercise I.49 for a concrete example in the con-
text of community detection and random graphs.) In particular, a sufficient condition for mutual
contiguity is the boundedness of likelihood ratio: c ≤ QPnn ≤ C for some constants c, C.

7.4 Inequalities between f-divergences and joint range

In this section we study the relationship, in particular, inequalities, between f-divergences. To
gain some intuition, we start with the ad hoc approach by proving the Pinsker’s inequality, which
bounds total variation from above in terms of the KL divergence.

Theorem 7.10 (Pinsker’s inequality)

D(PkQ) ≥ (2 log e)TV2 (P, Q). (7.27)

Proof. It suffices to consider the natural logarithm for the KL divergence. First we show that,
by the data processing inequality, it suffices to prove the result for Bernoulli distributions. For
any event E, let Y = 1 {X ∈ E} which is Bernoulli with parameter P(E) or Q(E). By the DPI,
D(PkQ) ≥ d(P(E)kQ(E)). If Pinsker’s inequality holds for all Bernoulli distributions, we have
r
1
D(PkQ) ≥ TV(Ber(P(E)), Ber(Q(E)) = |P(E) − Q(E)|
2
q
Taking the supremum over E gives 12 D(PkQ) ≥ supE |P(E) − Q(E)| = TV(P, Q), in view of
Theorem 7.7.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-127

i i

7.4 Inequalities between f-divergences and joint range 127

The binary case follows easily from a second-order Taylor expansion (with integral remainder
form) of p 7→ d(pkq):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t(1 − t) q

and TV(Ber(p), Ber(q)) = |p − q|.

Pinsker’s inequality has already been used multiple times in this book. Here is yet another
implication that is further explored in Exercise I.62 and I.63 (Szemerédi regularity).

Corollary 7.11 (Tao’s inequality [414]) Let Y → X → X′ be a Markov chain with Y ∈

[−1, 1]. Then
2
E[| E[Y|X] − E[Y|X′ ]|2 ] ≤ I(Y; X|X′ ) . (7.28)
log e
The same estimate holds for Y ranging over a unit ball in any normed vector space with | · | in the
LHS being the norm.

Pinsker’s inequality and Tao’s inequality are both sharp in the sense that the constants can not
be improved. For example, for (7.27) we can take Pn = Ber( 21 + 1n ) and Qn = Ber( 12 ) and compute
D(Pn ∥Qn )
that TV 2 (P ,Q ) → 2 log e as n → ∞. (This is best seen by inspecting the local quadratic behavior
n n
in Proposition 2.21.) Nevertheless, this does not mean that the inequality (7.27) is not improvable,
as the RHS can be replaced by some other function of TV(P, Q) with additional higher-order
terms. Indeed, several such improvements of Pinsker’s inequality are known. But what is the best
inequality? In addition, another natural question is the reverse inequality: can we upper-bound
D(PkQ) in terms of TV(P, Q)?
Settling these questions rests on characterizing the joint range (the set of possible values) of a
given pair f-divergences. This systematic approach to comparing f-divergences (as opposed to the
ad hoc proof of Theorem 7.10 we presented above) is the subject of the rest of this section.

Definition 7.12 (Joint range) Consider two f-divergences Df (PkQ) and Dg (PkQ). Their
joint range is a subset of [0, ∞]2 defined by

R ≜ {(Df (PkQ), Dg (PkQ)) : P, Q are probability measures on some measurable space} .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-128

i i

128

In addition, the joint range over all k-ary distributions is defined as

Rk ≜ {(Df (PkQ), Dg (PkQ)) : P, Q are probability measures on [k]} .

As an example, Figure 7.1 gives the joint range R between the KL divergence and the total vari-
ation. By definition, the lower boundary of the region R gives the optimal refinement of Pinsker’s
inequality:

D(PkQ) ≥ F(TV(P, Q)), F(ϵ) ≜ inf D(PkQ) = inf{s : (ϵ, s) ∈ R}.

(P,Q):TV(P,Q)=ϵ

Also from Figure 7.1 we see that it is impossible to bound D(PkQ) from above in terms of TV(P, Q)
due to the lack of upper boundary.

1.5

1.0

0.5

0.2 0.4 0.6 0.8

Figure 7.1 Joint range of TV and KL divergence. The dashed line is the quadratic lower bound given by
Pinsker’s inequality (7.27).

The joint range R may appear difficult to characterize since we need to consider P, Q over
all measurable spaces; on the other hand, the region Rk for small k is easy to obtain (at least
numerically). Revisiting the proof of Pinsker’s inequality in Theorem 7.10, we see that the key
step is the reduction to Bernoulli distributions. It is natural to ask: to obtain full joint range is it
possible to reduce to the binary case? It turns out that it is always sufficient to consider quaternary
distributions, or the convex hull of that of binary distributions.

Theorem 7.13 (Harremoës-Vajda [214])

R = co(R2 ) = R4 .

where co denotes the convex hull with a natural extension of convex operations to [0, ∞]2 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-129

i i

7.4 Inequalities between f-divergences and joint range 129

We will rely on the following famous result from convex analysis (cf. e.g. [145, Chapter 2,
Theorem 18]).

Lemma 7.14 (Fenchel-Eggleston-Carathéodory theorem) Let S ⊆ Rd and x ∈ co(S).

Then there exists a set of d + 1 points S′ = {x1 , x2 , . . . , xd+1 } ∈ S such that x ∈ co(S′ ). If S has at
most d connected components, then d points are enough.

Proof. Our proof will consist of three claims:

• Claim 1: co(R2 ) ⊂ R4 ;
• Claim 2: Rk ⊂ co(R2 );
• Claim 3: R = R4 .
S∞
Note that Claims 1-2 prove the most interesting part: k=1 Rk = co(R2 ). Claim 3 is more
technical and its proof can be found in [214]. However, the approximation result in Theorem 7.6
S∞
shows that R is the closure of k=1 Rk . Thus for the purpose of obtaining inequalities between
Df and Dg , Claims 1-2 are sufficient.
We start with Claim 1. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space
X and given any α ∈ [0, 1], define two joint distributions of the random variables (X, B) where
PB = QB = Ber(α), PX|B=i = Pi and QX|B=i = Qi for i = 0, 1. Then by (7.9) we get

Df (PX,B kQX,B ) = ᾱDf (P0 kQ0 ) + αDf (P1 kQ1 ) ,

and similarly for the Dg . Thus, R is convex. Next, notice that

R2 = R̃2 ∪ {(pf′ (∞), pg′ (∞)) : p ∈ (0, 1]} ∪ {(qf(0), qg(0)) : q ∈ (0, 1]} ,

where R̃2 is the image of (0, 1)2 of the continuous map

(p, q) 7→ Df (Ber(p)kBer(q)), Dg (Ber(p)kBer(q)) .

Since (0, 0) ∈ R̃2 , we see that regardless of which f(0), f′ (∞), g(0), g′ (∞) are infinite, the set
R2 ∩ R2 is connected. Thus, by Lemma 7.14 any point in co(R2 ∩ R2 ) is a combination of two
points in R2 ∩ R2 , which, by the argument above, is a subset of R4 . Finally, it is not hard to see
that co(R2 )\R2 ⊂ R4 , which concludes the proof of co(R2 ) ⊂ R4 .
Next, we prove Claim 2. Fix P, Q on [k] and denote their PMFs (pj ) and (qj ), respectively. Note
that without changing either Df (PkQ) or Dg (PkQ) (but perhaps, by increasing k by 1), we can
p
make qj > 0 for j > 1 and q1 = 0, which we thus assume. Denote ϕj = qjj for j > 1 and consider
the set
 
 X X
k 
S = Q̃ = (q̃j )j∈[k] : q̃j ≥ 0, q̃j = 1, q̃1 = 0, q̃j ϕj ≤ 1 .
 
j=2

We also define a subset Se ⊂ S consisting of points Q̃ of two types:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-130

i i

130

1 q̃j = 1 for some j ≥ 2 and ϕj ≤ 1.

2 q̃j1 + q̃j2 = 1 for some j1 , j2 ≥ 2 and q̃j1 ϕj1 + q̃j2 ϕj2 = 1 .
P
It can be seen that Se are precisely all the extreme points of S . Indeed, any Q̃ ∈ S with j≥2 q̃j ϕj <
1 with more than one non-zero atom cannot be extremal (since there is only one active linear
P P
constraint j q̃j = 1). Similarly, Q̃ with j≥2 q̃j ϕj = 1 can only be extremal if it has one or two
non-zero atoms.
We next claim that any point in S can be written as a convex combination of finitely many
points in Se . This can be seen as follows. First, we can view S and Se as subsets of Rk−1 . Since S
is clearly closed and convex, by the Krein-Milman theorem (see [12, Theorem 7.68]), S coincides
with the closure of the convex hull of its extreme points. Since Se is compact (hence closed), so
is co(Se ) [12, Corollary 5.33]. Thus we have S = co(Se ) and, in particular, there are probability
weights {αi : i ∈ [m]} and extreme points Q̃i ∈ Se so that
X
m
Q= αi Q̃i . (7.29)
i=1

Next, to each Q̃ we associate P̃ = (p̃j )j∈[k] as follows:

(
ϕj q̃j , j ∈ {2, . . . , k} ,
p̃j = Pk
1 − j=2 ϕj q̃j , j = 1

We then have that

X
Q̃ 7→ Df (P̃kQ̃) = q̃j f(ϕj ) + f′ (∞)p̃1
j≥2

affinely maps S to [0, ∞] (note that f(0) or f′ (∞) can equal ∞). In particular, if we denote P̃i =
P̃(Q̃i ) corresponding to Q̃i in decomposition (7.29), we get
X
m
Df (PkQ) = αi Df (P̃i kQ̃i ) ,
i=1

and similarly for Dg (PkQ). We are left to show that (P̃i , Q̃i ) are supported on at most two points,
which verifies that any element of Rk is a convex combination of k elements of R2 . Indeed, for
Q̃ ∈ Se the set {j ∈ [k] : q̃j > 0 or p̃j > 0} has cardinality at most two (for the second type
extremal points we notice p̃j1 + p̃j2 = 1 implying p̃1 = 0). This concludes the proof of Claim
2.

7.5 Examples of computing joint range

In this section we show how to apply the method of Harremoës and Vajda for proving the best
possible comparison inequalities between various f-divergences.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-131

i i

7.5 Examples of computing joint range 131

7.5.1 Hellinger distance versus total variation

The joint range R2 of H2 and TV over binary distributions is simply:
√ √
R2 = (2(1 − pq − p̄q̄), |p − q|) : 0 ≤ p ≤ 1, 0 ≤ q ≤ 1 .
shown as non-convex grey region in Figure 7.2. By Theorem 7.13, their full joint range R is the
convex hull of R2 , which turns out to be exactly described by the sandwich bound (7.22) shown
earlier in Section 7.3. This means that (7.22) is not improvable. Indeed, with t ranging from 0 to
1,

1−t
• the upper boundary is achieved by P = Ber( 1+ t
2 ), Q = Ber( 2 ),
• the lower boundary is achieved by P = (1 − t, t, 0), Q = (1 − t, 0, t).

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.5 1.0 1.5 2.0

Figure 7.2 The joint range R of TV and H2 is characterized by (7.22), which is the convex hull of the grey
region R2 .

7.5.2 KL divergence versus total variation

The joint range between KL and TV was previously shown in Figure 7.1. Although there is
no known closed-form expression, the following parametric formula of the lower boundary (see
Figure 7.1) is known [163, Theorem 1]:

TVt = 1 t 1 − coth(t) − 1 2
2 t
, t ≥ 0. (7.30)
Dt = −t2 csch2 (t) + t coth(t) + log(t csch(t))

where we take the natural logarithm. Here is a corollary (weaker bound) due to [427]:
1 + TV(P, Q) 2TV(P, Q)
D(PkQ) ≥ log − log e. (7.31)
1 − TV(P, Q) 1 + TV(P, Q)
Both bounds are stronger than Pinsker’s inequality (7.27). Note the following consequences:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-132

i i

132

• D → 0 ⇒ TV → 0, which can be deduced from Pinsker’s inequality;

• TV → 1 ⇒ D → ∞ and hence D = O(1) implies that TV is bounded away from one. This can
be obtained from (7.30) or (7.31), but not Pinsker’s inequality.

7.5.3 χ2 -divergence versus total variation

Proposition 7.15 We have the following bound
(
4t2 . t≤ 1
χ (PkQ) ≥ f(TV(P, Q)) ≥ 4TV (P, Q),
2 2
f( t) = 2
, (7.32)
t
1−t t≥ 1
2.

where the function f is a convex increasing bijection of [0, 1) onto [0, ∞). Furthermore, for every
s ≥ f(t) there exists a pair of distributions such that χ2 (PkQ) = s and TV(P, Q) = t.

Proof. We claim that the binary joint range is convex. Indeed,

(p − q)2 t2
TV(Ber(p), Ber(q)) = |p − q| ≜ t, χ2 (Ber(p)kBer(q)) = = .
q( 1 − q) q( 1 − q)
Given |p − q| = t, let us determine the possible range of q(1 − q). The smallest value of q(1 − q)
is always 0 by choosing p = t, q = 0. The largest value is 1/4 if t ≤ 1/2 (by choosing p = 1/2 − t,
q = 1/2). If t > 1/2 then we can at most get t(1 − t) (by setting p = 0 and q = t). Thus we
get χ2 (Ber(p)kBer(q)) ≥ f(|p − q|) as claimed. The convexity of f follows since its derivative is
monotonically increasing. Clearly, f(t) ≥ 4t2 because t(1 − t) ≤ 41 .

7.6 A selection of inequalities between various divergences

This section presents a collection of useful inequalities. For a more complete treatment, con-
sider [373] and [424, Sec. 2.4]. Most of these inequalities are joint ranges, which means they
are tight.

• KL vs TV: see (7.30). For discrete distributions there is partial comparison in the other direction
(“reverse Pinsker”, cf. [373, Section VI]):

2 2 log e
D(PkQ) ≤ log 1 + TV(P, Q)2 ≤ TV(P, Q)2 , Qmin = min Q(x)
Qmin Qmin x

• KL vs Hellinger:
2
D(P||Q) ≥ 2 log ≥ log e · H2 (P, Q). (7.33)
2 − H2 (P, Q)
The first inequality gives the joint range and is attained at P = Ber(0), Q = Ber(q). For a fixed
H2 , in general D(P||Q) has no finite upper bound, as seen from P = Ber(p), Q = Ber(0).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-133

i i

7.7 Divergences between Gaussians 133

There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup, cf. [122, Theorem A.1] and Exercise I.64):
log( Q1min − 1)
D(PkQ) ≤ 1 − (1 − H2 (P, Q))2 , Qmin = min Q(x)
1 − 2Qmin x

Another partial result is in Ex. I.59.

• KL vs χ2 :

0 ≤ D(P||Q) ≤ log(1 + χ2 (P||Q)) ≤ log e · χ2 (PkQ) . (7.34)

The left-hand inequality states that no lower bound on KL in terms of χ2 is possible.

• TV vs Hellinger: see (7.22). A useful simplified bound from [186] is the following:
r
H2 (P, Q)
TV(P, Q) ≤ −2 ln 1 −
2
• Le Cam vs Hellinger [273, p. 48]:
1 2
H (P, Q) ≤ LC(P, Q) ≤ H2 (P, Q). (7.35)
2
• Le Cam vs Jensen-Shannon [422]:

LC(P, Q) log e ≤ JS(P, Q) ≤ LC(P, Q) · 2 log 2 (7.36)

• χ2 vs TV: The full joint range is given by (7.32). Two simple consequences are:
1p 2
TV(P, Q) ≤ χ (PkQ) (7.37)
2
1 χ2 (PkQ)
TV(P, Q) ≤ max , (7.38)
2 1 + χ2 (PkQ)
where the second is useful for bounding TV away from one.
• JS vs TV: The full joint region is given by

1 − TV(P, Q) 1
2d ≤ JS(P, Q) ≤ TV(P, Q) · 2 log 2 . (7.39)
2 2
The lower bound is a consequence of Fano’s inequality. For the upper bound notice that for
p, q ∈ [0, 1] and |p − q| = τ the maximum of d(pk p+2 q ) is attained at p = 0, q = τ (from the
convexity of d(·k·)) and, thus, the binary joint-range is given by τ 7→ d(τ kτ /2) + d(1 − τ k1 −
τ /2). Since the latter is convex, its concave envelope is a straight line connecting endpoints at
τ = 0 and τ = 1.

7.7 Divergences between Gaussians

To get a better feel for the behavior of f-divergences, here we collect expressions (as well as
asymptotic expansions) of divergences between Gaussian distributions.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-134

i i

134

1 Total variation:
Z | μ|
| μ| 2σ | μ|
TV(N (0, σ ), N ( μ, σ )) = 2Φ
2 2
−1= φ(x)dx = √ + O( μ2 ), μ → 0.
2σ | μ|
− 2σ 2π σ
(7.40)
2 Hellinger distance:
μ2 μ2
H2 (N (0, σ 2 )kN ( μ, σ 2 )) = 2 − 2e− 8σ2 = + O( μ3 ), μ → 0. (7.41)
4σ 2
More generally,
1 1
|Σ1 | 4 |Σ2 | 4 1 ′ −1
H (N ( μ1 , Σ1 )kN ( μ2 , Σ2 )) = 2 − 2
2
1 exp − ( μ1 − μ2 ) Σ̄ ( μ1 − μ2 ) ,
|Σ̄| 2 8

where Σ̄ = Σ1 +Σ
2
2
.
3 KL divergence:

1 σ2 1 ( μ 1 − μ 2 )2 σ12
D(N ( μ1 , σ12 )kN ( μ2 , σ22 )) = log 22 + + 2 − 1 log e. (7.42)
2 σ1 2 σ22 σ2
For a more general result see (2.8).
4 χ2 -divergence:
μ2 μ2
χ2 (N ( μ, σ 2 )kN (0, σ 2 )) = e σ2 − 1 = 2 + O( μ3 ), μ → 0 (7.43)
 2 σ
e √ μ /(2−σ 2 )
− 1 σ2 < 2
χ2 (N ( μ, σ 2 )kN (0, 1)) = σ 2−σ 2 (7.44)
∞ σ2 ≥ 2

5 χ2 -divergence for Gaussian mixtures [225] (see also Exercise I.48 for the Ingster-Suslina
method applicable to general mixture distributions):
−1
X,X′ ⟩
χ2 (P ∗ N (0, Σ)kN (0, Σ)) = E[e⟨Σ ] − 1, ⊥ X′ ∼ P .
X⊥ (7.45)

7.8 Mutual information based on f-divergence

Given an f-divergence Df , we can define f-information, an extension of mutual information, as
follows:

If (X; Y) ≜ Df (PX,Y kPX PY ) . (7.46)

Theorem 7.16 (Data processing) For U → X → Y, we have If (U; Y) ≤ If (U; X).

Proof. Note that If (U; X) = Df (PU,X kPU PX ) ≥ Df (PU,Y kPU PY ) = If (U; Y), where we
applied the data-processing Theorem 7.4 to the (possibly stochastic) map (U, X) 7→ (U, Y). See
also Remark 3.4.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-135

i i

7.8 Mutual information based on f-divergence 135

A useful property of mutual information is that X ⊥⊥ Y iff I(X; Y) = 0. A generalization of it is

the property that for X → Y → Z we have I(X; Y) = I(X; Z) iff X → Z → Y. Both of these may or
may not hold for If depending on the strict convexity of f, see Ex. I.40.
Another often used property of the standard mutual information is the subadditivity: If PA,B|X =
PA|X PB|X (i.e. A and B are conditionally independent given X), then

I(X; A, B) ≤ I(X; A) + I(X; B). (7.47)

However, other notions of f-information have complicated relationship with subadditivity:

1 The f-information corresponding to the χ2 -divergence,

Iχ2 (X; Y) ≜ χ2 (PX,Y kPX PY ) (7.48)

is not generally subadditive. There are two special cases when Iχ2 is subadditive: If one of the
Iχ2 (X; A) or Iχ2 (X; B) is small [202, Lemma26] or if X ∼ Ber(1/2) channels PA|X and PB|X are
BMS (Section 19.4*), cf. [1].
2 The f-information corresponding to total-variation ITV (X; Y) ≜ TV(PX,Y , PX PY ) is not subad-
ditive. Furthermore, it has a counter-intuitive behavior of “getting stuck”. For example, take
X ∼ Ber(1/2) and A = BSCδ (X), B = BSCδ (X) – two independent observations of X across
the BSC. A simple computation (Exercise I.35) shows:

ITV (X; A, B) = ITV (X; A) = ITV (X; B) .

In other words, an additional observation does not improve TV-information at all. This is the
main reason for the famous herding effect in economics [30].
3 The symmetric KL-information

ISKL (X; Y) ≜ D(PX,Y kPX PY ) + D(PX PY kPX,Y ), (7.49)

the f-information corresponding to the symmetric KL divergence (also known as the Jeffreys
divergence)

DSKL (P, Q) ≜ D(PkQ) + D(QkP), (7.50)

satisfies, quite amazingly [265], the additivity property:

ISKL (X; A, B) = ISKL (X; A) + ISKL (X; B). (7.51)

Let us prove this in the discrete case. First notice the following equivalent expression for ISKL :
X
ISKL (X; Y) = PX (x)PX (x′ )D(PY|X=x kPY|X=x′ ) . (7.52)
x, x′

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-136

i i

136

which is rewritten as
X X PY|X (y|x)
PX (x)PX (x′ ) PY (y) log = 0. (7.53)
PY|X (y|x′ )
x,x′ y

Next, by definition,
X PX,Y (x, y)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PX (x)PY (y)

P Y| X ( y | x )
Since the marginals of PX,Y and PX PY coincide, we can replace log PPXX(,xY)(PxY,y()y) by any log f ( y)
for any f. We choose f(y) = PY|X (y|x′ ) to get

X PY|X (y|x)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PY|X (y|x′ )

Now averaging this over PX (x′ ) and applying (7.53) to get rid of the second term in [· · · ], we
obtain (7.52). For another interesting property of ISKL , see Ex. I.54.

7.9 Empirical distribution and χ2 -information

i.i.d.
Consider an arbitrary channel PY|X and some input distribution PX . Suppose that we have Xi ∼ PX
for i = 1, . . . , n. Let

1X
n
P̂n = δ Xi
n
i=1

denote the empirical distribution corresponding to this sample. Let PY = PY|X ◦ PX be the output
distribution corresponding to PX and PY|X ◦ P̂n be the output distribution corresponding to P̂n (a
random distribution). Note that when PY|X=x (·) = ϕ(· − x), where ϕ is a fixed density, we can
think of PY|X ◦ P̂n as a kernel density estimator (KDE), whose density is p̂n (x) = (ϕ ∗ P̂n )(x) =
Pn
i=1 ϕ(Xi − x). Furthermore, using the fact that E[PY|X ◦ P̂n ] = PY , we have
1
n

E[D(PY|X ◦ P̂n kPX )] = D(PY kPX ) + E[D(PY|X ◦ P̂n kPY )] ,

where the first term represents the bias of the KDE due to convolution and increases with band-
width of ϕ, while the second term represents the variability of the KDE and decreases with the
bandwidth of ϕ. Surprisingly, the second term is is sharply (within a factor of two) given by the
Iχ2 information. More exactly, we prove the following result.

Proposition 7.17

1
E[D(PY|X ◦ P̂n kPY )] ≤ log 1 + Iχ2 (X; Y) , (7.54)
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-137

i i

7.9 Empirical distribution and χ2 -information 137

where Iχ2 (X; Y) is defined in (7.48). Furthermore,

log e
lim inf n E[D(PY|X ◦ P̂n kPY )] ≥ I 2 ( X ; Y) . (7.55)
n→∞ 2 χ
In particular, E[D(PY|X ◦ P̂n kPY )] = O(1/n) if Iχ2 (X; Y) < ∞ and ω(1/n) otherwise.

In Section 25.4* we will discuss an extension of this simple bound, in particular showing that
in many cases about n = exp{I(X; Y)+ K} observations are sufficient to ensure D(PY|X ◦ P̂n kPY ) =
e−O(K) .

Proof. First, a simple calculation shows that

1
E[χ2 (PY|X ◦ P̂n kPY )] = I 2 (X; Y) .
n χ
Then from (7.34) and Jensen’s inequality we get (7.54).
To get the lower bound in (7.55), let X̄ be drawn uniformly at random from the sample
{X1 , . . . , Xn } and let Ȳ be the output of the PY|X channel with input X̄. With this definition we
have:

E[D(PY|X ◦ P̂n kPY )] = I(Xn ; Ȳ) . (7.56)

Next, apply (6.2) to get

X
n
I(Xn ; Ȳ) ≥ I(Xi ; Ȳ) = nI(X1 ; Ȳ) .
i=1

Finally, notice that

!
n−1 1
I(X1 ; Ȳ) = D PX PY + PX,Y PX PY
n n

and apply the local expansion of KL divergence (Proposition 2.21) to get (7.55).

In the discrete case, by taking PY|X to be the identity channel (Y = X) we obtain the following
guarantee on the closeness between the empirical and the population distribution. This fact can be
used to test whether the sample was truly generated by the distribution PX .

Corollary 7.18 Suppose PX is discrete with support X . If X is infinite, then

lim n E[D(P̂n kPX )] = ∞ . (7.57)
n→∞

Otherwise, we have

|X | − 1 log e
E[D(P̂n kPX )] ≤ log 1 + ≤ (|X | − 1) . (7.58)
n n

Proof. Simply notice that Iχ2 (X; X) = |X | − 1.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-138

i i

138

Remark 7.9 For fixed PX , the tight asymptotic result is

log e
lim n E[D(P̂n kPX )] = (|supp(PX )| − 1) . (7.59)
n→∞ 2
See Lemma 13.2 below. See also Exercise VI.10 for the results on estimating PX under different
loss functions by means other than using empirical distribution.
Corollary 7.18 is also useful for the statistical application of entropy estimation. Given n iid
observations, a natural estimator of the entropy of PX is the empirical entropy Ĥemp = H(P̂n )
(plug-in estimator). It is clear that empirical entropy is an underestimate, in the sense that the bias

E[Ĥemp ] − H(PX ) = − E[D(P̂n kPX )]

is always non-negative. For fixed PX , Ĥemp is known to be consistent even on countably infinite
alphabets [22], although the convergence rate can be arbitrarily slow, which aligns with the con-
clusion of (7.57). However, for large alphabet of size Θ(n), the upper bound (7.58) does not vanish
(this is tight for, e.g., uniform distribution). In this case, one need to de-bias the empirical entropy
(e.g. on the basis of (7.59)) or employ different techniques in order to achieve consistent estimation.
See Section 29.4 for more details.

7.10 Most f-divergences are locally χ2 -like

In this section we prove analogs of Proposition 2.20 and Proposition 2.21 for the general
f-divergences.

Theorem 7.19 Suppose that Df (PkQ) < ∞ and derivative of f(x) at x = 1 exist. Then,
1
lim Df (λP + λ̄QkQ) = (1 − P[supp(Q)])f′ (∞) ,
λ→0 λ
where as usual we take 0 · ∞ = 0 in the left-hand side.

Remark 7.10 Note that we do not need a separate theorem for Df (QkλP + λ̄Q) since the
exchange of arguments leads to another f-divergence with f(x) replaced by xf(1/x).

Proof. Without loss of generality we may assume f(1) = f′ (1) = 0 and f ≥ 0. Then, decomposing
P = μP1 + μ̄P0 with P0 ⊥ Q and P1 Q we have
Z
1 ′ 1 dP1
Df (λP + λ̄QkQ) = μ̄f (∞) + dQ f 1 + λ( μ − 1) .
λ λ dQ

Note that g(λ) = f (1 + λt) is positive and convex for every t ∈ R and hence λ1 g(λ) is mono-
tonically decreasing to g′ (0) = 0 as λ & 0. Since for λ = 1 the integrand is assumed to be
Q-integrable, the dominated convergence theorem applies and we get the result.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-139

i i

7.10 Most f-divergences are locally χ2 -like 139

Theorem 7.20 Let f be twice continuously differentiable on (0, ∞) with

lim sup f′′ (x) < ∞ .
x→+∞

If χ2 (PkQ) < ∞, then Df (λ̄Q + λPkQ) < ∞ for all 0 ≤ λ < 1 and
1 f′′ (1) 2
lim 2
Df (λ̄Q + λPkQ) = χ (PkQ) . (7.60)
λ→0 λ 2
If χ2 (PkQ) = ∞ and f′′ (1) > 0 then (7.60) also holds, i.e. Df (λ̄Q + λPkQ) = ω(λ2 ).

Remark 7.11 Conditions of the theorem include D, DSKL , H2 , JS, LC and all Rényi divergences
1 (x − 1); see Definition 7.24). A similar result holds also for the
1 λ
of orders λ < 2 (with f(x) = λ−
case when f′′ (x) → ∞ with x → +∞ (e.g. Rényi divergences with λ > 2), but then we need to
make extra assumptions in order to guarantee applicability of the dominated convergence theorem
(often just the finiteness of Df (PkQ) is sufficient).

Proof. Assuming that χ2 (PkQ) < ∞ we must have P Q and hence we can use (7.1) as the
definition of Df . Note that under (7.1) without loss of generality we may assume f′ (1) = f(1) = 0
(indeed, for that we can just add a multiple of (x − 1) to f(x), which does not change the value of
Df (PkQ)). From the Taylor expansion we have then
Z 1
f(1 + u) = u2 (1 − t)f′′ (1 + tu)dt .
0

Applying this with u = λ P−

Q
Q
we get
Z Z 1 2
P−Q P−Q
Df (λ̄Q + λPkQ) = dQ dt(1 − t)λ2 f′′ 1 + tλ . (7.61)
0 Q Q
P−Q
Note that for any ϵ > 0 we have supx≥ϵ |f′′ (x)| ≜ Cϵ < ∞. Note that Q ≥ −1 and, thus, for
every λ the integrand is non-negative and bounded by
2
P−Q
C1−λ (7.62)
Q
which is integrable over dQ × Leb[0, 1] (by finiteness of χ2 (PkQ) and Fubini, which applies due
to non-negativity). Thus, Df (λ̄Q + λPkQ) < ∞. Dividing (7.61) by λ2 we see that the integrand
is dominated by (7.62) and hence we can apply the dominated convergence theorem to conclude
Z 1 Z 2
1 ( a) P−Q ′′ P−Q
lim Df (λ̄Q + λPkQ) = dt(1 − t) dQ lim f 1 + tλ
λ→0 λ2 0 Q λ→0 Q
Z 1 Z 2
P−Q f′′ (1) 2
= dt(1 − t) dQ f′′ (1) = χ (PkQ) ,
0 Q 2
which proves (7.60).
We proceed to proving that Df (λP + λ̄QkQ) = ω(λ2 ) when χ2 (PkQ) = ∞. If P Q then
this follows by replacing the equality in (a) with ≥ due to Fatou’s lemma. If P 6 Q, we consider

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-140

i i

140

decomposition P = μP1 + μ̄P0 with P1 Q and P0 ⊥ Q. From definition (7.2) we have (for
λμ
λ1 = 1−λμ̄ )

Df (λP + λ̄QkQ) = (1 − λμ̄)Df (λ1 P1 + λ̄1 QkQ) + λμ̄Df (P0 kQ) ≥ λμ̄Df (P0 kQ) .

Recall from Proposition 7.2 that Df (P0 kQ) > 0 unless f(x) = c(x − 1) for some constant c and
the proof is complete.

7.11 f-divergences in parametric families: Fisher information

In Section 2.6.2* we have already previewed the fact that in parametric families of distributions,
the Hessian of the KL divergence turns out to coincide with the Fisher information. Here we
collect such facts and their proofs. These materials form the basis of sharp bounds on parameter
estimation that we will study later in Chapter 29.
To start with an example, let us return to the Gaussian location model (GLM) Pt ≜ N (t, 1), t ∈
R. From the identities presented in Section 7.7 we obtain the following asymptotics:
|t| t2
TV(Pt , P0 ) = √ + o(|t|), H2 (Pt , P0 ) = + o( t2 ) ,
2π 4
t2
χ2 (Pt kP0 ) = t2 + o(t2 ), D(Pt kP0 ) = + o(t2 ) ,
2 log e
12
LC(Pt , P0 ) = t + o(t2 ) .
4
We can see that with the exception of TV, other f-divergences behave quadratically under small
displacement t → 0. This turns out to be a general fact, and furthermore the coefficient in front
of t2 is given by the Fisher information (at t = 0). To proceed carefully, we need some technical
assumptions on the family Pt .

Definition 7.21 (Regular single-parameter families) Fix τ > 0, space X and a family
Pt of distributions on X , t ∈ [0, τ ). We define the following types of conditions that we call
regularity at t = 0:

(a) Pt (dx) = pt (x) μ(dx), for some measurable (t, x) 7→ pt (x) ∈ R+ and a fixed measure μ on X ;
(b0 ) There exists a measurable function (s, x) 7→ ṗs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ṗs (x0 )|ds < ∞ and
Z t
p t ( x0 ) = p 0 ( x0 ) + ṗs (x0 )ds . (7.63)
0

Furthermore, for μ-almost every x0 we have limt↘0 ṗt (x0 ) = ṗ0 (x0 ).
(b1 ) We have ṗt (x) = 0 whenever p0 (x) = 0 and, furthermore,
Z
(ṗt (x))2
μ(dx) sup < ∞. (7.64)
X 0≤t<τ p0 (x)

Dividing by t2 notice that from (b0 ) we have tp0 (X) −−→ p0 (X) and thus
2 2
pt (X) − p0 (X) pt (X) − p0 (X) ṗ0 (X)
f′′ 1 + z → f′′ (1) .
p0 ( X ) tp0 (X) p0

Thus, applying Fatou’s lemma we recover (7.70).

Next, plugging f(x) = x log x in (7.71) we obtain for the KL divergence
Z 1 " 2 #
1 1−z pt (X) − p0 (X)
D(Pt kP0 ) = (log e) dz EX∼P0 . (7.72)
t2 0 1 + z pt (X)−p0 (X) tp0 (X)
p0 (X)

2
The first fraction inside the bracket is between 0 and 1 and the second by sup0<t<τ pṗ0t ((XX)) , which
is P0 -integrable by (b1 ). Thus, dominated convergence theorem applies to the double integral
in (7.71) and we obtain
Z 1 " 2 #
1 ṗ0 (X)
lim 2 D(Pt kP0 ) = (log e) dz EX∼P0 (1 − z) ,
t→0 t 0 p0 ( X )

completing the proof of (7.69).

Remark 7.13 Theorem 7.22 extends to the case of multi-dimensional parameters as follows.
Define the Fisher information matrix at θ ∈ Rd :
Z p p ⊤
JF (θ) ≜ μ(dx)∇θ pθ (x)∇θ pθ (x) (7.73)

Then (7.68) becomes χ2 (Pt kP0 ) = t⊤ JF (0)t + o(ktk2 ) as t → 0 and similarly for (7.69), which
has previously appeared in (2.34).

Theorem 7.22 applies to many cases (e.g. to smooth subfamilies of exponential families, for
which one can take μ = P0 and p0 (x) ≡ 1), but it is not sufficiently general. To demonstrate the
issue, consider the following example.

Example 7.1 (Location families with compact support) We say that family Pt is a
(scalar) location family if X = R, μ = Leb and pt (x) = p0 (x − t). Consider the following

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-143

i i

7.11 f-divergences in parametric families: Fisher information 143

example, for α > −1:



 α
x ∈ [ 0, 1] ,
x ,
p0 (x) = Cα × (2 − x)α , x ∈ [ 1, 2] , ,


 0, otherwise

with Cα chosen from normalization. Clearly, here condition (7.64) is not satisfied and both
χ2 (Pt kP0 ) and D(Pt kP0 ) are infinite for t > 0, since Pt 6 P0 . But JF (0) < ∞ whenever α > 1
and thus one expects that a certain remedy should be possible. Indeed, one can compute those
f-divergences that are finite for Pt 6 P0 and find that for α > 1 they are quadratic in t. As an
illustration, we have


 1+α
0≤α<1
Θ(t ),
2 2 1
H (Pt , P0 ) = Θ(t log t ), α = 1 (7.74)


Θ(t2 ), α>1

as t → 0. This can be computed directly, or from a more general results of [222, Theorem VI.1.1].5
For a relation between Hellinger and Fisher information see also (VI.5).
The previous example suggests that quadratic behavior as t → 0 can hold even when Pt 6 P0 ,
which is the case handled by the next (more technical) result, whose proof we placed in Sec-
tion 7.14*). One can verify that condition (c1 ) is indeed satisfied for all α > 1 in Example 7.1,
thus establishing the quadratic behavior. Also note that the stronger (7.66) only applies to α ≥ 2.

Theorem 7.23 Given a family of distributions {Pt : t ∈ [0, τ )} satisfying the conditions (a),
(c0 ) and (c1 ) of Definition 7.21, we have

1 − 4ϵ #
χ (Pt kϵ̄P0 + ϵPt ) = t ϵ̄ JF (0) +
2 2 2
J (0) + o(t2 ) , ∀ϵ ∈ (0, 1) (7.75)
ϵ
t2
H2 (Pt , P0 ) = JF ( 0 ) + o ( t 2 ) , (7.76)
4
R R
where JF (0) = 4 ḣ02 dμ < ∞ is the Fisher information and J# (0) = ḣ20 1 {h0 = 0}dμ can be
called the Fisher defect at t = 0.

Example 7.2 (On Fisher defect) Note that in most cases of interest we will have the situa-
tion that t 7→ ht (x) is actually differentiable for all t in some two-sided neighborhood (−τ, τ ) of
0. In such cases, h0 (x) = 0 implies that t = 0 is a local minima and thus ḣ0 (x) = 0, implying that

5
Statistical significance of this calculation is that if we were to estimate the location parameter t from n iid observations,
then precision δn∗ of the optimal estimator up to constant factors is given by solving H2 (Pδn∗ , P0 ) n1 , cf. [222, Chapter
1
− 1+α
VI]. For α < 1 we have δn∗ n which is notably better than the empirical mean estimator (attaining precision of
− 12
only n ). For α = 1/2 this fact was noted by D. Bernoulli in 1777 as a consequence of his (newly proposed) maximum
likelihood estimation.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-144

i i

144

the defect J#
F = 0. However, for other families this will not be so, sometimes even when pt (x) is
smooth on t ∈ (−τ, τ ) (but not ht ). Here is such an example.
Consider Pt = Ber(t2 ). A straightforward calculation shows:
ϵ̄2 p
χ2 (Pt kϵ̄P0 + ϵPt ) = t2 + O(t4 ), H2 (Pt , P0 ) = 2(1 − 1 − t2 ) = t2 + O( t4 ) .
ϵ
Taking μ({0}) = μ({1}) = 1 to be the counting measure, we get the following

(√ 
 √−t , x = 0
1−t , x=0
2  1−t2
h t ( x) = , ḣt (x) = sign(t), x = 1, t 6= 0 .
|t|, x=1 


1, x = 1, t = 0 (just as an agreement)

Note that if we view Pt as a family on t ∈ [0, τ ) for small τ , then all conditions (a), (c0 ) and
(c1 ) are clearly satisfied (ḣt is bounded on t ∈ (−τ, τ )). We have JF (0) = 4 and J# (0) = 1 and
thus (7.75) recovers the correct expansion for χ2 and (7.76) for H2 .
Notice that the non-smoothness of ht only becomes visible if we extend the domain to t ∈
(−τ, τ ). In fact, this issue is not seen in terms of densities pt . Indeed, let us compute the density pt
and its derivative ṗt explicitly too:
( (
1 − t2 , x = 0 −2t, x = 0
pt (x) = , ṗt (x) = .
2
t, x=1 2t, x=1

Clearly, pt is continuously differentiable on t ∈ (−τ, τ ). Furthermore, the following expectation

(typically equal to JF (t) in (7.67))
" 2 # (
ṗt (X) 0, t=0
EX∼Pt = 2
pt (X) 4 + 2 , t 6= 0
4t
1−t

is discontinuous at t = 0. To make things worse, at t = 0 this expectation does not match our
definition of the Fisher information JF (0) in Theorem 7.23, and thus does not yield the correct
small-t behavior for either χ2 or H2 . In general, to avoid difficulties one should restrict to those
families with t 7→ ht (x) continuously differentiable in t ∈ (−τ, τ ).

7.12 Rényi divergences and tensorization

The following family of divergence measures introduced by Rényi is key in many applications
involving product measures. Although these measures are not f-divergences, they are obtained as
monotone transformation of an appropriate f-divergence and thus satisfy DPI and other properties
of f-divergences. Later, Rényi divergence will feature prominently in characterizing the optimal
error exponents in hypothesis testing (see Section 16.1 and especially Remark 16.1), in approxi-
mating of channel output statistic (see Section 25.4*), and in nonasymptotic bounds for composite
hypothesis testing (see Section 32.2.1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-145

i i

7.12 Rényi divergences and tensorization 145

Definition 7.24 For any λ ∈ R \ {0, 1}, the Rényi divergence of order λ between probability
distributions P and Q is defined as
" λ #
1 dP
Dλ (PkQ) ≜ log EQ ,
λ−1 dQ

where EQ [( dQ ) ] is formally understood as a sign(λ−1)Df (PkQ)+1 with f(x) = sign(λ−1)(xλ −1)

dP λ

– see Definition 7.1. Extending Definition 2.14 of conditional KL divergence and assuming the
same setup, the conditional Rényi divergence is defined as

Dλ (PX|Y ||QX|Y |PY ) ≜ Dλ (PY × PX|Y ||PY × QX|Y )

Z
1
= log EY∼PY (dPX|Y (x))λ (dQX|Y (a))1−λ .
λ−1 X

Numerous properties of Rényi divergences are known, see [432]. Here we only notice a few:

• Special cases of λ = 12 , 1, 2: Under mild regularity conditions limλ→1 Dλ (PkQ) = D(PkQ).

2
On the other hand, D2 = log(1 + χ2 ) and D 21 = −2 log(1 − H2 ) are monotone transformation
of the χ2 -divergence (7.4) and the Hellinger distance (7.5), respectively.
• For all λ ∈ R the map λ 7→ Dλ (PkQ) is non-decreasing and the map λ 7→ (1 − λ)Dλ (PkQ) is
concave.
• For λ ∈ [0, 1] the map (P, Q) 7→ Dλ (PkQ) is convex.
• For λ ≥ 0 the map Q 7→ Dλ (PkQ) is convex.
• For Q uniform on a finite alphabet of size m, Dλ (PkQ) = log m − Hλ (P), where Hλ is the Rényi
entropy of order λ defined in (1.4). This recovers Theorem 2.2 as the special case of λ = 1.
• There is a version of the chain rule:
(λ)
Dλ (PA,B ||QA,B ) = Dλ (PB ||QB ) + Dλ (PA|B ||QA|B |PB ) , (7.77)
(λ)
where PB is the λ-tilting of PB towards QB given by

PB (b) ≜ PλB (b)Q1B−λ (b) exp{−(λ − 1)Dλ (PB ||QB )} .

(λ)
(7.78)

• The key property is additivity under products, or tensorization:

!
Y Y X
Dλ PXi QXi = Dλ (PXi kQXi ) , (7.79)
i i i

which is a simple consequence of (7.77). Dλ ’s are the only divergences satisfying DPI and
tensorization [310]. The most well-known special cases of (7.79) are for Hellinger distance
(see (7.26)) and for χ2 :
!
Yn Yn Yn

1+χ 2
Pi Qi = 1 + χ2 (Pi kQi ) .
i=1 i=1 i=1

We can also obtain additive bounds for non-product distributions, see Ex. I.42 and I.43.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-146

i i

146

The following consequence of the chain rule will be crucial in statistical applications later (see
Section 32.2, in particular, Theorem 32.8).

Q Q
Proposition 7.25 Consider product channels PYn |Xn = PYi |Xi and QYn |Xn = QYi |Xi . We
have (with all optimizations over all possible distributions)

X
n
inf Dλ (PYn kQYn ) = inf Dλ (PYi kQYi ), (7.80)
PXn ,QXn PXi ,QXi
i=1
Xn X
n
sup Dλ (P kQ ) =
Yn Yn sup Dλ (PYi kQYi ) = sup Dλ (PYi |Xi =x kQYi |Xi =x′ ). (7.81)
PXn ,QXn ′
i=1 PXi ,QXi i=1 x,x

In particular, for any collections of distributions {Pθ : θ ∈ Θ} and {Qθ : θ ∈ Θ}:

inf Dλ (PkQ) ≥ n inf Dλ (PkQ), (7.82)

P∈co{P⊗ n ⊗n
θ },Q∈co{Qθ }
P∈co{Pθ },Q∈co{Qθ }

sup Dλ (PkQ) ≤ n sup Dλ (PkQ). (7.83)

P∈co{P⊗ n ⊗n
θ },Q∈co{Qθ }
P∈co{Pθ },Q∈co{Qθ }

Remark 7.14 The mnemonic for (7.82)-(7.83) is that “mixtures of products are less distin-
guishable than products of mixtures”. The former arise in statistical settings where iid observations
are drawn a single distribution whose parameter is drawn from a prior.

Proof. The second equality in (7.81) follows from the fact that Dλ is an increasing function
of an f-divergence, and thus maximization should be attained at an extreme point of the space
of probabilities, which are just the single-point masses. The main equalities (7.80)-(7.81) follow
from a) restricting optimizations to product distributions and invoking (7.79); and b) the chain rule
for Dλ . For example for n = 2, we fix PX2 and QX2 , which (via channels) induce joint distributions
PX2 ,Y2 and QX2 ,Y2 . Then we have

Dλ (PY1 |Y2 =y kQY1 |Y2 =y′ ) ≥ inf Dλ (P̃Y1 kQ̃Y1 ) ,

P̃X1 ,Q̃X1

since PY1 |Y2 =y is a distribution induced by taking P̃X1 = PX1 |Y2 =y , and similarly for QY1 |Y2 =y′ . In
all, we get

(λ)
X
2
Dλ (PY2 kQY2 ) = Dλ (PY2 kQY2 ) + Dλ (PY1 |Y2 kQY1 |Y2 |PY2 ) ≥ inf Dλ (PYi kQYi ) ,
PXi ,QXi
i=1

as claimed. The case of sup is handled similarly.

From (7.80)-(7.81), we get (7.82)-(7.83) by taking X = Θ and specializing the inf and sup to
diagonal distributions PXn and QXn , i.e., those with the property that P[X1 = · · · = Xn ] = 1 and
Q[X1 = · · · = Xn ] = 1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-147

i i

7.13 Variational representation of f-divergences 147

7.13 Variational representation of f-divergences

In Theorem 4.6 we had a very useful variational representation of KL-divergence due to Donsker
and Varadhan. In this section we show how to derive such representations for other f-divergences
in a principled way. The proofs are slightly technical and given in Section 7.14* at the end of this
chapter.
Let f : (0, +∞) → R be a convex function. The convex conjugate f∗ : R → R ∪ {+∞} of f is
defined by:

f∗ (y) = sup xy − f(x) , y ∈ R. (7.84)

x∈ R +

Denote the domain of f∗ by dom(f∗ ) ≜ {y : f∗ (y) < ∞}. Two important properties of the convex
conjugates are

1 f∗ is also convex (which holds regardless of f being convex or not);

2 Biconjugation: (f∗ )∗ = f, which means

f(x) = sup xy − f∗ (y)

and implies the following (for all x > 0 and y)

f(x) + f∗ (y) ≥ xy .

Similarly, we can define a convex conjugate for any convex functional Ψ(P) defined on the
space of measures, by setting
Z
Ψ∗ (g) = sup gdP − Ψ(P) . (7.85)
P

Under appropriate conditions (e.g. finite X ), biconjugation then yields the sought-after variational
representation
Z
Ψ(P) = sup gdP − Ψ∗ (g) . (7.86)
g

Next we will now compute these conjugates for Ψ(P) = Df (PkQ). It turns out to be convenient
to first extend the definition of Df (PkQ) to all finite signed measures P then compute the conjugate.
To this end, let fext : R → R ∪ {+∞} be an extension of f, such that fext (x) = f(x) for x ≥ 0 and
fext is convex on R. In general, we can always choose fext (x) = ∞ for all x < 0. In special cases
e.g. f(x) = |x − 1|/2 or f(x) = (x − 1)2 we can directly take fext (x) = f(x) for all x. Now we can
define Df (PkQ) for all signed measure measures P in the same way as in Definition 7.1 using fext
in place of f.
For each choice of fext we have a variational representation of f-divergence:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-148

i i

148

Theorem 7.26 Let P and Q be probability measures on X . Fix an extension fext of f and let f∗ext
is the conjugate of fext , i.e., f∗ext (y) = supx∈R xy − fext (x). Denote dom(f∗ext ) ≜ {y : f∗ext (y) < ∞}.
Then

Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))]. (7.87)

g:X →dom(f∗
ext )

where the supremum can be taken over either (a) all simple g or (b) over all g satisfying
EQ [f∗ext (g(X))] < ∞.

We remark that when P Q then both results (a) and (b) also hold for supremum over g :
X → R, i.e. without restricting g(x) ∈ dom(f∗ext ).
As a consequence of the variational characterization, we get the following properties for f-
divergences:

1 Convexity: First of all, note that Df (PkQ) is expressed as a supremum of affine functions (since
the expectation is a linear operation). As a result, we get that (P, Q) 7→ Df (PkQ) is convex,
which was proved previously in Theorem 7.5 using different method.
2 Weak lower semicontinuity: Recall the example in Remark 4.5, where {Xi } are i.i.d. Rademach-
ers (±1), and
Pn
i=1 Xi d
√ →N (0, 1)
−
n
by the central limit theorem; however, by Proposition 7.2, for all n,

PX1 +X2 +...+Xn
Df √ N (0, 1) = f(0) + f′ (∞) > 0,
n

since the former distribution is discrete and the latter is continuous. Therefore similar to the
KL divergence, the best we can hope for f-divergence is semicontinuity. Indeed, if X is a nice
space (e.g., Euclidean space), in (7.87) we can restrict the function g to continuous bounded
functions, in which case Df (PkQ) is expressed as a supremum of weakly continuous functionals
(note that f∗ ◦ g is also continuous and bounded since f∗ is continuous) and is hence weakly
w
lower semicontinuous, i.e., for any sequence of distributions Pn and Qn such that Pn −→ P and
w
Qn −→ Q, we have

lim inf Df (Pn kQn ) ≥ Df (PkQ).

n→∞

3 Relation to DPI: As discussed in (4.15) variational representations can be thought of as

extensions of the DPI. As an exercise, one should try to derive the estimate
p
|P[A] − Q[A]| ≤ Q[A] · χ2 (PkQ)

via both the DPI and (7.91).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-149

i i

7.13 Variational representation of f-divergences 149

Example 7.3 (Total variation and Hellinger) For total variation, we have f(x) = 12 |x − 1|.
Consider the extension fext (x) = 21 |x − 1| for x ∈ R. Then

∗ 1 +∞ if |y| > 1
fext (y) = sup xy − |x − 1| = 2 .
x 2 y if |y| ≤ 1
2

Thus (7.87) gives

TV(P, Q) = sup EP [g(X)] − EQ [g(X)], (7.88)
g:|g|≤ 12

which previously appeared in (7.18). A calculation for squared Hellinger yields f∗ext (y) = y
1−y with
y ∈ (−∞, 1) and, thus, after changing from g to h = 1 − g in (7.87), we obtain
1
H2 (P, Q) = 2 − inf EP [h] + EQ [ ] .
h>0 h
As an application, consider f : X → [0, 1] and τ ∈ (0, 1), so that h = 1 − τ f satisfies 1
h ≤ 1 + 1−τ
τ
f.
Then the previous characterization implies
1 1
EP [f] ≤ EQ [f] + H2 (P, Q) ∀f : X → [0, 1], ∀τ ∈ (0, 1) .
1−τ τ

Example 7.4 (χ2 -divergence) For χ2 -divergence we have f(x) = (x − 1)2 . Take fext (x) =
y2
(x − 1)2 , whose conjugate is f∗ext (y) = y + 4.
Applying (7.87) yields
" #
2
g ( X )
χ2 (PkQ) = sup EP [g(X)] − EQ g(X) + (7.89)
g:X →R 4
= sup 2EP [g(X)] − EQ [g2 (X)] − 1 (7.90)
g:X →R

where the last step follows from a change of variable (g ← 12 g − 1).

To get another equivalent, but much more memorable representation, we notice that (7.90) it is
not scale-invariant. To make it so, setting g = λh and optimizing over the λ ∈ R first we get
(EP [h(X)] − EQ [h(X)])2
χ2 (PkQ) = sup . (7.91)
h:X →R VarQ (h(X))
The statistical interpretation of (7.91) is as follows: if a test statistic h(X) is such that the separation
between its expectation under P and Q far exceeds its standard deviation, then this suggests the two
hypothesis can be distinguished reliably. The representation (7.91) will turn out useful in statistical
applications in Chapter 29 for deriving the Hammersley-Chapman-Robbins (HCR) lower bound
as well as its Bayesian version, see Section 29.1.2, and ultimately the Cramér-Rao and van Trees
lower bounds.
Example 7.5 (Jensen-Shannon divergence and GANs) For the Jensen-Shannon diver-
∗
gence (7.8) we have f(x) = x log 12x 2
+x + log 1+x . Computing the conjugate we obtain f (s) =
− log(2 − exp(s)) with domain s ∈ (−∞, log 2). We obtain from (7.87) the characterization
JS(P, Q) = sup EP [g] + EQ [log(2 − exp{g(X)})] ,
g:X →(−∞,log 2)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-150

i i

150

or after reparametrizing h = exp{g}/2 we get

JS(P, Q) = sup log 2 + EP [log(h)] + EQ [log(1 − h)] .

h:X →(0,1)

This characterization is the basis of an influential modern method of density estimation, known
as generative adversarial networks (GANs) [193]. Here is its essence. Suppose that we are trying to
approximate a very complicated distribution P on Rd by representing it as (the law of) a generator
map G : Rm → Rd applied to a standard normal Z ∼ N (0, Im ). The idea of [193] is to search for
a good G by minimizing JS(P, PG(Z) ). Due to the variational characterization we can equivalently
formulate this problem as

inf sup EX∼P [log h(X)] + EZ∼N [log(1 − h(G(Z))]

G h

(and in this context the test function h is called a discriminator or, less often, a critic). Since the
distribution P is only available to us through a sample of iid observations x1 , . . . , xn ∼ P, we
approximate this minimax problem by

1X
n
inf sup log h(xi ) + EZ∼N [log(1 − h(G(Z))] .
G h n
i=1

In order to be able to solve this problem another idea of [193] is to approximate the intractable
optimizations over the infinite-dimensional function spaces of G and h by an optimization over
neural networks. This is implemented via alternating gradient ascent/descent steps over the
(finite-dimensional) parameter spaces defining the neural networks of G and h. Following the
breakthrough of [193] variations on their idea resulted in finding G(Z)’s that yielded incredibly
realistic images, music, videos, 3D scenery and more.
Example 7.6 (KL-divergence) In this case we have f(x) = x log x. Consider the extension
fext (x) = ∞ for x < 0, whose convex conjugate is f∗ (y) = log e
e exp(y). Hence (7.87) yields

D(PkQ) = sup EP [g(X)] − (EQ [exp{g(X)}] − 1)log e (7.92)

g:X →R

Note that in the last example, the variational representation (7.92) we obtained for the KL
divergence is not the same as the Donsker-Varadhan identity in Theorem 4.6, that is,

D(PkQ) = sup EP [g(X)] − log EQ [exp{g(X)}] . (7.93)

(we verify these facts below too), we conclude that

1
lim ġ(s, x) = ġ(0, x) = ḣ0 (x) 2 · 1{h0 (x) > 0} + √ 1{h0 (x) = 0} , (7.99)
s→ 0 ϵ

where we also used continuity ḣt (x) → ḣ0 (x) by assumption (c0 ).
Substituting the integral expression for g(t, x) into (7.97) we obtain
Z Z 1 Z 1
L(t) = μ(dx) du1 du2 ġ(tu1 , x)ġ(tu2 , x) . (7.100)
0 0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-153

i i

7.14* Technical proofs: convexity, local expansions and variational representations 153

Since |ġ(s, x)| ≤ C|hs (x)| for some C = C(ϵ), we have from Cauchy-Schwarz
Z Z
μ(dx)|ġ(s1 , x)ġ(s2 , x)| ≤ C2 sup μ(dx)ḣt (x)2 < ∞ . (7.101)
t X

where the last inequality follows from the uniform integrability assumption (c1 ). This implies that
Fubini’s theorem applies in (7.100) and we obtain
Z 1 Z 1 Z
L(t) = du1 du2 G(tu1 , tu2 ) , G(s1 , s2 ) ≜ μ(dx)ġ(s1 , x)ġ(s2 , x) .
0 0

Notice that if a family of functions {fα (x) : α ∈ I} is uniformly square-integrable, then the family
{fα (x)fβ (x) : α ∈ I, β ∈ I} is uniformly integrable simply because apply |fα fβ | ≤ 12 (f2α + f2β ).
Consequently, from the assumption (c1 ) we see that the integral defining G(s1 , s2 ) allows passing
the limit over s1 , s2 inside the integral. From (7.99) we get as t → 0
Z
1 1 − 4ϵ #
G(tu1 , tu2 ) → G(0, 0) = μ(dx)ḣ0 (x) 4 · 1{h0 > 0} + 1{h0 = 0} = JF (0)+
2
J ( 0) .
ϵ ϵ
From (7.101) we see that G(s1 , s2 ) is bounded and thus, the bounded convergence theorem applies
and
Z 1 Z 1
lim du1 du2 G(tu1 , tu2 ) = G(0, 0) ,
t→0 0 0

which thus concludes the proof of L(t) → JF (0) and of (7.75) assuming facts about ϕ. Let us
verify those.
For simplicity, in the next paragraph we omit the argument x in h0 (x) and ϕ(·; x). A straightfor-
ward differentiation yields
h20 (1 − 2ϵ ) + 2ϵ h2
ϕ′ (h) = 2h .
(ϵ̄h20 + ϵh2 )3/2
h20 (1− ϵ2 )+ ϵ2 h2 1−ϵ/2
Since √ h
≤ √1
ϵ
and ϵ̄h20 +ϵh2
≤ 1−ϵ we obtain the finiteness of ϕ′ . For the continuity
ϵ̄h20 +ϵh2
of ϕ′ notice that if h0 > 0 then clearly the function is continuous, whereas for h0 = 0 we have
ϕ′ (h) = √1ϵ for all h.
We next proceed to the Hellinger distance. Just like in the argument above, we define
Z Z 1 Z 1
1
M(t) ≜ 2 H2 (Pt , P0 ) = μ(dx) du1 du2 ḣtu1 (x)ḣtu2 (x) .
t 0 0
R
Exactly as above from Cauchy-Schwarz and supt μ(dx)ḣt (x)2 < ∞ we conclude that Fubini
applies and hence
Z 1 Z 1 Z
M(t) = du1 du2 H(tu1 , tu2 ) , H(s1 , s2 ) ≜ μ(dx)ḣs1 (x)ḣs2 (x) .
0 0

Again, the family {ḣs1 ḣs2 : s1 ∈ [0, τ ), s2 ∈ [0, τ } is uniformly integrable and thus from (c0 ) we
conclude H(tu1 , tu2 ) → 14 JF (0). Furthermore, similar to (7.101) we see that H(s1 , s2 ) is bounded

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-154

i i

154

and thus
Z 1 Z 1
1
lim M(t) = du1 du2 lim H(tu1 , tu2 ) = JF ( 0 ) ,
t→0 0 0 t→0 4
concluding the proof of (7.76).

Proceeding to variational representations, we prove the counterpart of Gelfand-Yaglom-

Perez Theorem 4.5, cf. [185].

Proof of Theorem 7.6. The lower bound Df (PkQ) ≥ Df (PE kQE ) follows from the DPI. To prove
an upper bound, first we reduce to the case of f ≥ 0 by property 6 in Proposition 7.2. Then define
sets S = suppQ, F∞ = { dQdP
= 0} and for a fixed ϵ > 0 let

dP
Fm = ϵm ≤ f < ϵ(m + 1) , m = 0, 1, . . . .
dQ
We have
X Z X
dP
ϵ mQ[Fm ] ≤ dQf ≤ϵ (m + 1)Q[Fm ] + f(0)Q[F∞ ]
m S dQ m
X
≤ϵ mQ[Fm ] + f(0)Q[F∞ ] + ϵ . (7.102)
m

m = {x > 1 : ϵm ≤ f(x) < ϵ(m + 1)} the function f is increasing and

Notice that on the interval I+
−
on Im = {x ≤ 1 : ϵm ≤ f(x) < ϵ(m + 1)} it is decreasing. Thus partition further every Fm into
− −
m = { dQ ∈ Im } and Fm = { dQ ∈ Im }. Then, we see that
dP dP
F+ +

P[F±
m]
f ≥ ϵm .
Q[ F ±
m]
− −
0 , F0 , . . . , Fn , Fn , F∞ , S , ∪m>n Fm }. For this
Next, define the partition consisting of sets E = {F+ + c

partition we have, by the previous display:

X
D(PE kQE ) ≥ ϵ mQ[Fm ] + f(0)Q[F∞ ] + f′ (∞)P[Sc ] . (7.103)
m≤n

We next show that with sufficiently large n and sufficiently small ϵ the RHS of (7.103)
approaches Df (PkQ). If f(0)Q[F∞ ] = ∞ (and hence Df (PkQ) = ∞) then clearly (7.103) is also
infinite. Thus,
assume
that f(0)Q[F∞ ] < ∞.
R
If S dQf dQ = ∞ then the sum over m on the RHS of (7.102) is also infinite, and hence
dP
P
for any N > 0 there exists some n such that m≤n mQ[Fm ] ≥ N, thus showing that RHS
R
for (7.103) can be made arbitrarily large. Thus, assume S dQf dQdP
< ∞. Considering LHS
P
of (7.102) we conclude that for some large n we have m>n mQ[Fm ] ≤ 12 . Then, we must have
again from (7.102)
X Z
dP 3
ϵ mQ[Fm ] + f(0)Q[F∞ ] ≥ dQf − ϵ.
S dQ 2
m≤n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-155

i i

7.14* Technical proofs: convexity, local expansions and variational representations 155

Thus, we have shown that for arbitrary ϵ > 0 the RHS of (7.103) can be made greater than
Df (PkQ) − 32 ϵ.
Proof of Theorem 7.26. First, we show that for any g : X → dom(f∗ext ) we must have
EP [g(X)] ≤ Df (PkQ) + EQ [f∗ext (g(X))] . (7.104)
Let p(·) and q(·) be the densities of P and Q. Then, from the definition of f∗ext we have for every x
s.t. q(x) > 0:
p ( x) p ( x)
f∗ext (g(x)) + fext ( ) ≥ g ( x) .
q ( x) q ( x)
Integrating this over dQ = q dμ restricted to the set {q > 0} we get
Z
p ( x)
EQ [f∗ext (g(X))] + q(x)fext ( ) dμ ≥ EP [g(X)1{q(X) > 0}] . (7.105)
q>0 q ( x)
Now, notice that
fext (x)
sup{y : y ∈ dom(f∗ext )} = lim = f′ (∞) (7.106)
x→∞ x
Therefore, f′ (∞)P[q(X) = 0] ≥ EP [g(X)1{q(X) = 0}]. Summing the latter inequality with (7.105)
we obtain (7.104).
Next we prove that supremum in (7.87) over simple functions g does yield Df (PkQ), so that
inequality (7.104) is tight. Armed with Theorem 7.6, it suffices to show (7.87) for finite X . Indeed,
for general X , given a finite partition E = {E1 , . . . , En } of X , we say a function g : X → R is
E -compatible if g is constant on each Ei ∈ E . Taking the supremum over all finite partitions E we
get
Df (PkQ) = sup Df (PE kQE )
E

= sup sup EP [g(X)] − EQ [f∗ext (g(X))]

E g:X →dom(f∗
ext )
g E -compatible

= sup EP [g(X)] − EQ [f∗ext (g(X))],

g:X →dom(f∗
ext )
g simple

where the last step follows is because the two suprema combined is equivalent to the supremum
over all simple (finitely-valued) functions g.
Next, consider finite X . Let S = {x ∈ X : Q(x) > 0} denote the support of Q. We show the
following statement
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))] + f′ (∞)P(Sc ), (7.107)
g:S→dom(f∗
ext )

which is equivalent to (7.87) by (7.106). By definition,

X
P(x)
Df (PkQ) = Q(x)fext +f′ (∞) · P(Sc ),
Q ( x)
x∈S
| {z }
≜Ψ(P)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-156

i i

156

Consider the functional Ψ(P) defined above where P takes values over all signed measures on S,
which can be identified with RS . The convex conjugate of Ψ(P) is as follows: for any g : S → R,
( )
X P ( x)
∗ ∗
Ψ (g) = sup P(x)g(x) − Q(x) sup h − fext (h)
P x h∈dom(f∗
ext )
Q ( x)
X
= sup inf ∗ P(x)(g(x) − h(x)) + Q(x)f∗ext (h(x))
P h:S→dom(fext ) x
( a) X
= inf sup P(x)(g(x) − h(x)) + EQ [f∗ext (h)]
h:S→dom(f∗
ext ) P
x
(
EQ [f∗ext (g(X))] g : S → dom(f∗ext )
= .
+∞ otherwise

where (a) follows from the minimax theorem (which applies due to finiteness of X ). Applying
the convex duality in (7.86) yields the proof of the desired (7.107).

Proof of Theorem 7.27. First we argue that the supremum in the right-hand side of (7.94) can
be taken over all simple functions g. Then thanks to Theorem 7.6, it will suffice to consider finite
alphabet X . To that end, fix any g. For any δ , there exists a such that EQ [f∗ext (g − a)] − aP[S] ≤
Ψ∗Q,P (g) + δ . Since EQ [f∗ext (g − an )] can be approximated arbitrarily well by simple functions we
conclude that there exists a simple function g̃ such that simultaneously EP [g̃1S ] ≥ EP [g1S ] − δ and

Ψ∗Q,P (g̃) ≤ EQ [f∗ext (g̃ − a)] − aP[S] + δ ≤ Ψ∗Q,P (g) + 2δ .

This implies that restricting to simple functions in the supremization in (7.94) does not change the
right-hand side.
Next consider finite X . We proceed to compute the conjugate of Ψ, where Ψ(P) ≜ Df (PkQ) if
P is a probability measure on X and +∞ otherwise. Then for any g : X → R, maximizing over
all probability measures P we have:
X
Ψ∗ (g) = sup P(x)g(x) − Df (PkQ)
P x∈X
X X X
P(x)
= sup P(x)g(x) − P(x)g(x) − Q ( x) f
P x∈X Q ( x)
x∈Sc x∈ S
X X X
= sup inf P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + Q(x)f∗ext (h(x))
P h:S→R x∈S x∈Sc x∈S
( ! )
( a) X X
= inf sup P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + EQ [f∗ext (h(X))]
h:S→R P x∈ S x∈Sc

(b) ′ ∗
= inf max max g(x) − h(x), maxc g(x) − f (∞) + EQ [fext (h(X))]
h:S→R x∈ S x∈ S

( c) ′ ∗
= inf max a, maxc g(x) − f (∞) + EQ [fext (g(X) − a)]
a∈ R x∈ S

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-157

i i

7.14* Technical proofs: convexity, local expansions and variational representations 157

where (a) follows from the minimax theorem; (b) is due to P being a probability measure; (c)
follows since we can restrict to h(x) = g(x) − a for x ∈ S, thanks to the fact that f∗ext is non-
decreasing (since dom(fext ) = R+ ).
From convex duality we have shown that Df (PkQ) = supg EP [g] − Ψ∗ (g). Notice that without
loss of generality we may take g(x) = f′ (∞) + b for x ∈ Sc . Interchanging the optimization over
b with that over a we find that
sup bP[Sc ] − max(a, b) = −aP[S] ,
b

which then recovers (7.94). To get (7.95) simply notice that if P[Sc ] > 0, then both sides of (7.95)
are infinite (since Ψ∗Q (g) does not depend on the values of g outside of S). Otherwise, (7.95)
coincides with (7.94).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-158

i i

8 Entropy method in combinatorics and geometry

A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. This method typically proceeds as follows: in
order to count the cardinality of a given set C , we draw an element uniformly at random from
C , whose entropy is given by log |C|. To bound |C| from above, we describe this random object
by a random vector X = (X1 , . . . , Xn ) then proceed to compute or upper-bound the joint entropy
H(X1 , . . . , Xn ) via one of the following methods:
Pn
• Marginal bound: H(X1 , . . . , Xn ) ≤ i=1 H(Xi )
• Pairwise bound (Shearer’s lemma) and generalization cf. Theorem 1.8: H(X1 , . . . , Xn ) ≤
1
P
n−1 i<j H(Xi , Xj ) Pn
• Chain rule (exact calculation): H(X1 , . . . , Xn ) = i=1 H(Xi |X1 , . . . , Xi−1 )

We give three applications using the above three methods, respectively, in the order of increas-
ing difficulty: enumerating binary vectors of a given average weights, counting triangles and other
subgraphs, and Brégman’s theorem.
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties
of differential entropy (Section 2.3).

8.1 Binary vectors of average weights

Lemma 8.1 (Massey [293]) Let C ⊂ {0, 1}n and let p be the average fraction of 1’s in C ,
i.e.
1 X wH (x)
p= ,
|C| n
x∈C

where wH (x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ exp{nh(p)}.

We emphasize that this result holds even if p > 1/2.

Proof. Let X = (X1 , . . . , Xn ) be drawn uniformly at random from C . Then
X
n X
n
log |C| = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) = h(pi ),
i i=1

8.3 Brégman’s Theorem

In this section, we present an elegant entropy proof of Radhakrishnan [351] of Brégman’s Theorem
[74], which bounds the number of perfect matchings (1-regular spanning subgraphs) in a bipartite
graphs.

3
Note that for H = K3 this gives a bound weaker than (8.2). To recover (8.2) we need to take X = (X1 , . . . , Xn ) be
uniform on all injective homomorphisms H → G.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-162

i i

162

We start with some definitions. The permanent of an n × n matrix A is defined as

XY
n
perm(A) ≜ aiπ (i) ,
π ∈Sn i=1

where Sn denotes the group of all permutations of [n]. For a bipartite graph G with n vertices on
the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix. For example,
 
 
 
 
perm   = 1, perm  =2
 
 

Theorem 8.4 (Brégman’s Theorem) For any n × n bipartite graph with adjacency matrix
A,
Y
n
1
perm(A) ≤ (di !) di ,
i=1

where di is the degree of left vertex i (i.e. sum of the ith row of A).

As an example, consider G = Kn,n . Then perm(G) = n!, which coincides with the RHS
[(n!)1/n ]n = n!. More generally, if G consists of n/d copies of Kd,d , then Brégman’s bound is
tight and perm = (d!)n/d .

Proof. If perm(A) = 0 then there is nothing to prove, so instead we assume perm(A) > 0 and
some perfect matchings exist. As a first attempt of proving Theorem 8.4 using the entropy method,
we select a perfect matching uniformly at random which matches the ith left vertex to the Xi th right
one. Let X = (X1 , . . . , Xn ). Then
X
n X
n
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H( X i ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is worse than Brégman’s bound by an exponential factor, since by
Stirling’s formula (I.2)
!
Yn
1 Y
n
(di !) di ∼ d i e− n .
i=1 i=1

Here is our second attempt. The hope is to use the chain rule to expand the joint entropy and
bound the conditional entropy more carefully. Let us write
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-163

i i

8.3 Brégman’s Theorem 163

where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, there is no reason why we should order from 1 to n. The key idea is to label the vertices
randomly, apply chain rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then

log perm(A) = H(X) = H(X|π )

= H(Xπ (1) , . . . , Xπ (n) |π )
X
n
= H(Xπ (k) |Xπ (1) , . . . , Xπ (k−1) , π )
k=1
Xn
= H(Xk |{Xj : π −1 (j) < π −1 (k)}, π )
k=1
X
n
≤ E log Nk ,
k=1

where Nk denotes the number of possible matchings for vertex k given the outcomes of {Xj :
π −1 (j) < π −1 (k)} and the expectation is with respect to (X, π ). The key observation is:

Lemma 8.5 Nk is uniformly distributed on [dk ].

Example 8.1 As a concrete example for Lemma 8.5, consider the 1 1
graph G on the right. For vertex k = 1, dk = 2. Depending on the
random ordering, if π = 1 ∗ ∗, then Nk = 2 w.p. 1/3; if π = ∗ ∗ 1, 2 2
then Nk = 1 w.p. 1/3; if π = 213, then Nk = 2 w.p. 1/3; if π = 312,
then Nk = 1 w.p. 1/3. Combining everything, indeed Nk is equally
3 3
likely to be 1 or 2.

Applying Lemma 8.5,

1 X
dk
1
E(X,π ) log Nk = log i = log(di !) di
dk
i=1

and hence
X
n
1 Y
n
1
log perm(A) ≤ log(di !) di = log (di !) di .
k=1 i=1

Proof of Lemma 8.5. In fact, we will show that even conditioned on Xn the distribution of Nk
is uniform. Indeed, if d = dk is the degree of k-th (right) node then let J1 , . . . , Jd be those right
nodes that match with neighbors of k under the fixed perfect matching (one of Ji ’s, say J1 , equals

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-164

8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney 165

Corollary 8.7 (Loomis-Whitney) Let K be a compact subset of Rn and let Kjc denote the
projection of K onto coordinates in [n] \ j. Then
Y
n
1
Leb(K) ≤ Leb(Kjc ) n−1 . (8.6)
j=1

Proof. Let A be a rectangle having the same volume as K. Note that

Y
n
1
Leb(K) = Leb(A) = Leb(Ajc ) n−1
j=1

By the previous theorem, Leb(Ajc ) ≤ Leb(Kjc ).

The meaning of the Loomis-Whitney inequality is best understood by introducing the average
Leb(K)
width of K in the jth direction: wj ≜ Leb(Kjc ) . Then (8.6) is equivalent to

Y
n
Leb(K) ≥ wj ,
j=1

E[l(Ψ(Xn ))] i.i.d.
rΨ (p) = lim sup , Xn ∼ Ber(p).
n→∞ n
In other words, Ψ consumes a stream of n coins with bias p and outputs on average nrΨ (p) fair
coins.
Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN (x2n+1 ) =
ΨvN (x2n )), whose rate is rvN (p) = pp̄. Clearly this is wasteful, because even if the input bits are
already fair, we only get 25% in return.

9.2 Converse
We show that no extractor has a rate higher than the binary entropy function h(p), even if the
extractor is allowed to be non-universal (depending on p). The intuition is that the “information
content” contained in each Ber(p) variable is h(p) bits; as such, it is impossible to extract more
than that. This is easily made precise by the data processing inequality for entropy (since extractors
are deterministic functions).

Theorem 9.2 For any extractor Ψ and any p ∈ (0, 1),

1 1
rΨ (p) ≥ h(p) = p log2 + p̄ log2 .
p p̄

Proof. Let L = Ψ(Xn ). Then

nh(p) = H(Xn ) ≥ H(Ψ(Xn )) = H(Ψ(Xn )|L) + H(L) ≥ H(Ψ(Xn )|L) = E [L] bits,

where the last step follows from the assumption on Ψ that Ψ(Xn ) is uniform over {0, 1}k
conditioned on L = k.

The rate of von Neumann’s extractor and the entropy bound are plotted in Figure 9.1. Next
we present two extractors, due to Elias [149] and Peres [327] respectively, that attain the binary
entropy function. (More precisely, both construct a sequence of extractors whose rate approaches
the entropy bound).

9.3 Elias’ construction from data compression

The intuition behind Elias’ scheme is the following:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-168

i i

168

rate

1 bit

rvN

p
0 1 1
2

Figure 9.1 Rate function of von Neumann’s extractor and the binary entropy function.

1 For iid Xn , the probability of each string only depends on its type, i.e., the number of 1’s, cf.
method of types in Exercise I.1. Therefore conditioned on the number of 1’s to be qn, Xn is
uniformly distributed over the type class Tq . This observation holds universally for any value
of the actual bias p.
2 Given a uniformly distributed random variable on some finite set, we can easily turn it into
variable-length string of fair coin flips. For example:
• If U is uniform over {1, 2, 3}, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.
• If U is uniform over {1, 2, . . . , 11}, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and the remaining
eight numbers 4, . . . , 11 are assigned to 3-bit strings.
We will study properties of these kind of variable-length encoders later in Chapter 10.

Lemma 9.3 Given U uniformly distributed on [M], there exists f : [M] → {0, 1}∗ such that
conditioned on l(f(U)) = k, f(U) is uniformly over {0, 1}k . Moreover,
log2 M − 4 ≤ E[l(f(U))] ≤ log2 M bits.

Proof. We defined f by partitioning [M] into subsets whose cardinalities are powers of two, and
assign elements in each subset to binary strings of that length. Formally, denote the binary expan-
Pn
sion of M by M = i=0 mi 2i , where the most significant bit mn = 1 and n = blog2 Mc + 1. Taking
non-zero mi ’s we can write M = 2i0 + · · · 2it as a sum of distinct powers of twos and thus define
a partition [M] = ∪tj=0 Mj , where |Mj | = 2ij . We map the elements of Mj to {0, 1}ij . Finally, notice
that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U) ≥ H(f(U)) ≥ H(f(U)|l(f(U))) = E[l(f(U))], and the lower bound
follows from
1 X 1 X 2n X i−n
n n n
2n+1
E[l(f(U))] = mi 2i · i = n − mi 2i (n − i) ≥ n − 2 ( n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0

where the last step follows from n ≤ log2 M + 1.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-169

i i

9.4 Peres’ iterated von Neumann’s scheme 169

Elias’ extractor Fix n ≥ 1. Let wH (xn ) define the Hamming weight (number of ones) of a
binary string xn . Let Tk = {xn ∈ {0, 1}n : wH (xn ) = k} define the Hamming sphere of radius k.
For each 0 ≤ k ≤ n, we apply the function f from Lemma 9.3 to each Tk . This defines a mapping
ΨE : {0, 1}n → {0, 1}∗ and then we extend it to ΨE : {0, 1}∗ → {0, 1}∗ by applying the mapping
per n-bit block and discard the last incomplete block. Then it is clear that the rate is given by
n E[l(ΨE (X ))]. By Lemma 9.3, we have
1 n

n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
wH (Xn ) wH (Xn )
Using Stirling’s approximation (cf. Exercise I.1) we can show

n
log = nh(wH (Xn )/n) + O(log n) .
wH (Xn )
Pn wH ( X n )
Since n1 wH (Xn ) = 1n i=1 1{Xi = 1}, from the law of large numbers we conclude n →p
and since h is a continuous bounded function, we also have
1
E[l(ΨE (Xn ))] = h(p) + O(log n/n).
n
Therefore the extraction rate of ΨE approaches the optimum h(p) as n → ∞.

9.4 Peres’ iterated von Neumann’s scheme

The main idea is to recycle the bits thrown away in von Neumann’s scheme and iterate. What von
Neumann’s extractor discarded are: (a) bits from equal pairs; (b) location of the distinct pairs. To
achieve the entropy bound, we need to extract the randomness out of these two parts as well.
First, some notations: Given x2n , let k = l(ΨvN (x2n )) denote the number of consecutive distinct
bit-pairs.

• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj −1 .
• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij −1 .
• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1 .

Here yk are the bits that von Neumann’s scheme outputs and both vn−k and un are discarded. Note
that un is important because it encodes the location of the yk and contains a lot of information.
Therefore von Neumann’s scheme can be improved if we can extract the randomness out of both
vn−k and un .

Peres’ extractor For each t ∈ N, recursively define an extractor Ψt as follows:

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-173

i i

9.5 Bernoulli factory 173

Goal Block simulation FSM

1
1
0

0
f(p) = 1/2 A0 = 10; A1 = 01
1

1 0
0

0 0
1
f(p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1

0
1 1

0 0
0
0
1
1
p3
f(p) = p3 +p̄3
A0 = 000; A1 = 111

0
0
1
1
1 1

Table 9.1 Bernoulli factories realized by FSM or block simulation.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-174

i i

Exercises for Part I

I.1 (Combinatorial meaning of entropy)

(a) Fix n ≥ 1 and 0 ≤ k ≤ n. Let p = nk and define Tp ⊂ {0, 1}n to be the set of all binary
sequences with p fraction of ones. Show that if k ∈ [1, n − 1] then

exp{nh(p)} n exp{nh(p)}
p ≤ | Tp | = ≤p . (I.1)
8k(n − k)/n k 2πk(n − k)/n

where h(·) is the binary entropy. Conclude that for all 0 ≤ k ≤ n we have

log |Tp | = nh(p) + O(log n) .

Hint: Stirling’s approximation:

1 n! 1
e 12n+1 ≤ √ ≤ e 12n , n≥1 (I.2)
2πn(n/e)n
(b) Let Qn = Ber(q)n be iid Bernoulli distribution on {0, 1}n . Show that

log Qn [Tp ] = −nd(pkq) + O(log n)

(c*) More generally, let X be a finite alphabet, P̂, Q distributions on X , and TP̂ a set of all strings
in X n with composition P̂. If TP̂ is non-empty (i.e. if nP̂(·) is integral) then

log |TP̂ | = nH(P̂) + O(log n)

log Qn [TP̂ ] = −nD(P̂kQ) + O(log n)

and furthermore, both O(log n) terms can be bounded as |O(log n)| ≤ |X | log(n + 1). (Hint:
show that number of non-empty TP̂ is ≤ (n + 1)|X | .)
I.2 (Refined method of types) The following refines Proposition 1.5. Let n1 , . . . , be non-negative
P
integers with i ni = n and let k+ be the number of non-zero ni ’s. Then

n k+ − 1 1 X
log = nH(P̂) − log(2πn) − log P̂i − Ck+ ,
n1 , n2 , . . . 2 2
i:ni >0

where P̂i = nni and 0 ≤ Ck+ ≤ log e

12 . (Hint: use (I.2)).
I.3 (Conditional entropy and Markov types)
(a) Fix n ≥ 1, a sequence xn ∈ X n and define

Nxn (a, b) = |{(xi , xi+1 ) : xi = a, xi+1 = b, i = 1, . . . , n}| ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-175

max I(X; MX + Z) = max I(X; ΛX + Z) .

PX :E[∥X∥2 ]≤s2 PX :E[∥X∥2 ]≤s2

(b) Conclude that

1X + 2
n
max I(X; MX + Z) = log (λi t) ,
PX :E[∥X∥2 ]≤s2 2
i=1
Pn
where log+ x = max(0, log x) and t is determined from solving i=1 |t − λ− i |+ = s .
2 2

This distribution of energy of X along singular vectors of M is known as water-filling solution,

see Section 20.4.
(d)
I.10 (MIMO capacity) Let M ∈ Rk×n be a random, orthogonally invariant matrix (i.e. M = MU
for any orthogonal matrix U). Let X ⊥⊥ (Z, M) and Z ∼ N (0, In ). Show that

1 h i 1X h i n
s2 s2
max I(X; MX + Z|M) = E log det(I + MT M) = E log(1 + σi2 (M)) ,
PX :E[∥X∥2 ]≤s2 2 n 2 n
i=1

where σi (M) are the singular values of M. (Hint: average over rotations as in Section 6.2*)
i.i.d.
Note: In communication theory Mi,j ∼ N (0, 1) (Ginibre ensemble) models a multi-input, multi-
output (MIMO) channel with n transmit and k receive antennas. The matrix MMT is a Wishart
matrix and its spectrum, when n and k grow proportionally, approaches the Marchenko-Pastur
distribution. The important practical consequence is that the capacity of a MIMO channel grows
for high-SNR as 21 min(n, k) log SNR. This famous observation [418] is the reason modern WiFi
and cellular systems employ multiple antennas.
I.11 (Conditional capacity) Consider a Markov kernel PB,C|A : A → B × C , which we will also
(a)
understand as a collection of distributions PB,C ≜ PB,C|A=a . Prove
(a) (a)
inf sup D(PC|B kQC|B |PB ) = sup I(A; C|B) ,
QC|B a∈A PA

whenever supremum on the right-hand side is finite and achievable by some distribution P∗A . In
R
this case, optimal QC|B = P∗C|B is found by disintegrating P∗B,C = PA∗ (da)PB,C . (Hint: follow
(a)

the steps of (5.1).)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-177

i i

Exercises for Part I 177

I.12 Conditioned on X = x, let Y be Poisson with mean x, i.e.,

xk
PY|X (k|x) = e−x , k = 0, 1, 2, . . .
k!
Let X be an exponential random variable with unit mean. Find I(X; Y).
I.13 (Information lost in erasures) Let X, Y be a pair of random variables with I(X; Y) < ∞. Let Z
be obtained from Y by passing the latter through an erasure channel, i.e., X → Y → Z where
(
1 − δ, z = y ,
PZ|Y (z|y) =
δ, z =?

where ? is a symbol not in the alphabet of Y. Find I(X; Z).

I.14 (Information bottleneck) Let X → Y → Z where Y is a discrete random variable taking values
on a finite set Y . Prove that

I(X; Z) ≤ log |Y|.

I.15 The Hewitt-Savage 0-1 law states that certain symmetric events have no randomness. Let
{Xi }i≥1 be a sequence be iid random variables. Let E be an event determined by this sequence.
We say E is exchangeable if it is invariant under permutation of finitely many indices in
the sequence of {Xi }’s, e.g., the occurance of E is unchanged if we permute the values of
(X1 , X4 , X7 ), etc.
Let us prove the Hewitt-Savage 0-1 law information-theoretically in the following steps:
P Pn
(a) (Warm-up) Verify that E = { i≥1 Xi converges} and E = {limn→∞ n1 i=1 Xi = E[X1 ]}
are exchangeable events.
(b) Let E be an exchangeable event and W = 1E is its indicator random variable. Show that
for any k, I(W; X1 , . . . , Xk ) = 0. (Hint: Use tensorization (6.2) to show that for arbitrary n,
nI(W; X1 , . . . , Xk ) ≤ 1 bit.)
(c) Since E is determined by the sequence {Xi }i≥1 , we have by continuity of mutual informa-
tion:

H(W) = I(W; X1 , . . .) = lim I(W; X1 , . . . , Xk ) = 0.

k→∞

Conclude that E has no randomness, i.e., P(E) = 0 or P(E) = 1.

(d) (Application to random walk) Often after the application of Hewitt-Savage, further efforts
are needed to determine whether the probability is 0 or 1. As an example, consider Xi ’s
Pn
are iid ±1 and Sn = i=1 Xi denotes the symmetric random walk. Verify that the event
E = {Sn = 0 finitely often} is exchangeable. Now show that P(E) = 0.
(Hint: consider E+ = {Sn > 0 eventually} and E− similarly. Apply Hewitt-Savage to them
and invoke symmetry.)
I.16 Let (X, Y) be uniformly distributed in the unit ℓp -ball Bp ≜ {(x, y) : |x|p + |y|p ≤ 1}, where
p ∈ (0, ∞). Also define the ℓ∞ -ball B∞ ≜ {(x, y) : |x| ≤ 1, |y| ≤ 1}.
(a) Compute I(X; Y) for p = 1/2, p = 1 and p = ∞.
(b*) Determine the limit of I(X; Y) as p → 0.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-178

i i

178 Exercises for Part I

Pn
I.17 Suppose Z1 , . . . Zn are independent Poisson random variables with mean λ. Show that i=1 Zi
is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.18 Suppose Z1 , . . . Zn are independent uniformly distributed on the interval [0, λ]. Show that
max1≤i≤n Zi is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.19 (Divergence of order statistic) Given xn = (x1 , . . . , xn ) ∈ Rn , let x(1) ≤ . . . ≤ x(n) denote the
ordered entries. Let P, Q be distributions on R and PXn = Pn , QXn = Qn .
(a) Prove that
D(PX(1) ,...,X(n) kQX(1) ,...,X(n) ) = nD(PkQ). (I.3)
(b) Show that
D(Bin(n, p)kBin(n, q)) = nd(pkq).
I.20 (Continuity of entropy on finite alphabet) We have shown that on a finite alphabet entropy is a
continuous function of the distribution. Quantify this continuity by explicitly showing
|H(P) − H(Q)| ≤ h(TV(P, Q)) + TV(P, Q) log(|X | − 1)
for any P and Q supported on X .
Hint: Use Fano’s inequaility and the inf-representation (over coupling) of total variation in
Theorem 7.7(a).
I.21 (a) For any X such that E [|X|] < ∞, show that
(E[X])2
D(PX kN (0, 1)) ≥ nats.
2
(b) For a > 0, find the minimum and minimizer of
min D(PX kN (0, 1)).
PX :EX≥a

Is the minimizer unique? Why?

I.22 Suppose D(P1 kP0 ) < ∞ then show
d
D(λP1 + λ̄QkλP0 + λ̄Q) = 0 .
dλ λ=0

This extends Prop. 2.20.

I.23 (Metric entropy and capacity) Let {PY|X=x : x ∈ X } be a set of distributions and let C =
supPX I(X; Y) be its capacity. For every ϵ ≥ 0, define1
N(ϵ) = min{k : ∃Q1 . . . Qk : ∀x ∈ X , minj D(PY|X=x kQj ) ≤ ϵ2 } . (I.4)
(a) Prove that

C = inf ϵ2 + log N(ϵ) . (I.5)
ϵ≥0

1
N(ϵ) is the minimum number of radius-ϵ (in divergence) balls that cover the set {PY|X=x : x ∈ X }. Thus, log N(ϵ) is a
metric entropy – see Chapter 27.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-179

i i

Exercises for Part I 179

(Hint: when is N(ϵ) = 1? See Theorem 32.4.)

(b) Similarly, show

I(X; Y) = inf (ϵ + log N(ϵ; PX )) ,

ϵ≥0

where the average-case covering number is

N(ϵ; PX ) = min{k : ∃Q1 . . . Qk : Ex∼PX [minj D(PY|X=x kQj )] ≤ ϵ} (I.6)

Comments: These estimates are useful because N(ϵ) for small ϵ roughly speaking depends on
local (differential) properties of the map x 7→ PY|X=x , unlike C which is global.
I.24 Consider the channel PYm |X : [0, 1] 7→ {0, 1}m , where given x ∈ [0, 1], Ym is i.i.d. Ber(x).
(a) Using the upper bound from Exercise I.23 prove
1
C(m) ≜ max I(X; Ym ) ≤ log m + O(1) , m → ∞.
PX 2
Hint: Find a covering of the input space.
(b) Show a lower bound to establish
1
C(m) ≥ log m + o(log m) , m → ∞.
2
Hint: Show that for any ϵ > 0 there exists K(ϵ) such that for all m ≥ 1 and all p ∈ [ϵ, 1 − ϵ]
we have |H(Bin(m, p)) − 12 log m| ≤ K(ϵ).
I.25 This exercise shows other ways of proving Fano’s inequality in its various forms.
(a) Prove (3.15) as follows. Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π
to the last M − 1 atoms to obtain the distribution Pπ . By comparing H(P) and H(Q), where
Q is the average of Pπ over all permutations, complete the proof.
(b) Prove (3.15) by directly solving the convex optimization max{H(P) : 0 ≤ pi ≤ Pmax , i =
P
1, . . . , M, i pi = 1}.
(c) Prove (3.19) as follows. Let Pe = P[X 6= X̂]. First show that

I(X; Y) ≥ I(X; X̂) ≥ min{I(PX , PZ|X ) : P[X = Z] ≥ 1 − Pe }.

PZ|X

Notice that the minimum is non-zero unless Pe = Pmax . Second, solve the stated convex
optimization problem. (Hint: look for invariants that the matrix PZ|X must satisfy under
permutations (X, Z) 7→ (π (X), π (Z)) then apply the convexity of I(PX , ·)).
Qn
I.26 Show that PY1 ···Yn |X1 ···Xn = i=1 PYi |Xi if and only if the Markov chain Yi → Xi → (X\i , Y\i )
holds for all i = 1, . . . , n, where X\i = {Xj , j 6= i}.
I.27 (Distributions and graphical models)
(a) Draw all possible directed acyclic graphs (DAGs, or directed graphical models) compatible
with the following distribution on X, Y, Z ∈ {0, 1}:
(
1/6, x = 0, z ∈ {0, 1} ,
PX,Z (x, z) = (I.7)
1/3, x = 1, z ∈ {0, 1}
Y=X+Z (mod2) (I.8)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-180

i i

180 Exercises for Part I

You may include only the minimal DAGs (recall: the DAG is minimal for a given
distribution if removal of any edge leads to a graphical model incompatible with the
distribution).2
Qn
(b) Draw the DAG describing the set of distributions PXn Yn satisfying PYn |Xn = i=1 PYi |Xi .
(c) Recall that two DAGs G1 and G2 are called equivalent if they have the same vertex sets and
each distribution factorizes w.r.t. G1 if and only if it does so w.r.t. G2 . For example, it is
well known

X→Y→Z ⇐⇒ X←Y←Z ⇐⇒ X ← Y → Z.

Consider the following two DAGs with countably many vertices:

X1 → X2 → · · · → Xn → · · ·
X1 ← X2 ← · · · ← Xn ← · · ·

Are they equivalent?

I.28 Give a necessary and sufficient condition for A → B → C for jointly Gaussian (A, B, C) in
terms of correlation coefficients. For discrete (A, B, C) denote xabc = PABC (a, b, c) and write
the Markov chain condition as a list of degree-2 polynomial equations in {xabc , a ∈ A, b ∈
B, c ∈ C}.
I.29 Let A, B, C be discrete with PC|B (c|b) > 0 ∀b, c.
(a) Show

A→B→C
=⇒ A ⊥
⊥ (B, C)
A→C→B

Discuss implications for sufficient statistic.

(b*) For binary (A, B, C) characterize all counterexamples.
Comment: Thus, a popular positivity condition PA,B,C > 0 allows to infer conditional inde-
pendence relations, which are not true in general. In other words, a set of distributions
satisfying certain (conditional) independence relations does not coincide with the closure of
its intersection with {PA,B,C > 0}, see [366] for more.
I.30 Consider the implication

I( A; C ) = I( B; C ) = 0 =⇒ I(A, B; C) = 0 . (I.9)

(a) Show (I.9) for jointly Gaussian (A, B, C).

(b) Find a counterexample for general (A, B, C).
(c) Prove or disprove: (I.9) also holds for arbitrary finite-cardinality discrete (A, B, C) under
positivity condition PA,B,C (a, b, c) > 0 ∀a, b, c.

2
Note: {X → Y}, {X ← Y} and {X Y} are the three possible directed graphical modelss for two random variables. For
example, the third graph describes the set of distributions for which X and Y are independent: PXY = PX PY . In fact, PX PY
factorizes according to any of the three DAGs, but {X Y} is the unique minimal DAG.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-181

i i

Exercises for Part I 181

I.31 Find the entropy rate of a stationary ergodic Markov chain with transition probability matrix
 1 1 1 
2 4 4
P= 0 1
2
1
2

1 0 0

I.32 (Solvable HMM) Similar to the Gilbert-Elliott process (Example 6.3) let Sj ∈ {±1} be a
stationary two-state Markov chain with P[Sj = −Sj−1 |Sj−1 ] = 1 − P[Sj = Sj−1 |Sj−1 ] = τ . Let
iid
Ej ∼ Ber(δ), with Ej ∈ {0, 1} and let Xj = BECδ (Sj ) be the observation of Sj through the binary
erasure channel (BEC) with erasure probability δ , i.e. Xj = Sj Ej . Find entropy rate of Xj (you
can give answer in the form of a convergent series). Evaluate at τ = 0.11, δ = 1/2 and compare
with H(X1 ).
I.33 Consider a binary symmetric random walk Xn on Z that starts at zero. In other words, Xn =
Pn
j=1 Bj , where (B1 , B2 , . . .) are independent and equally likely to be ±1.
(a) When n 1 does knowing X2n provide any information about Xn ? More exactly, prove

lim inf I(Xn ; X2n ) > 0.

n→∞

(Hint: lower semicontinuity and central limit theorem)

(b) Compute the exact value of this limit limn→∞ I(Xn ; X2n ).
I.34 (Entropy rate and contiguity) Theorem 2.2 states that if a distribution on a finite alphabet X is
almost uniform then its entropy must be close to log |X |. This exercise extends this observation
to random processes.
(a) Let Qn be the uniform distribution on X n . Show that, if {Pn } ◁ {Qn } (cf. Definition 7.9),
then H(Pn ) = H(Qn ) + o(n) = n log |X | + o(n), or equivalently, D(Pn kQn ) = o(n).
(b) Show that for non-uniform Qn , Pn ◁▷ Qn does not imply H(Pn ) = H(Qn ) + o(n). (Hint:
for a counterexample, consider the mixed source in Example 6.2(b).)
I.35 Let ITV (X; Y) = TV(PX,Y , PX PY ). Let X ∼ Ber(1/2) and conditioned on X generate A and B
independently setting them equal to X or 1 − X with probabilities 1 − δ and δ , respectively (i.e.
A ← X → B). Show
1
ITV (X; A, B) = ITV (X; A) = | − δ| .
2
This means the second observation of X is “uninformative” (in the ITV sense).
Similarly, show that when X ∼ Ber(δ) for δ < 1/2 there exists joint distribution PX,Y so that
TV(PY|X=0 , PY|X=1 ) > 0 (thus ITV (X; Y) and I(X; Y) are strictly positive), but at the same time
minX̂(Y) P[X 6= X̂] = δ . In other words, observation Y is informative about X, but does not
improve the probability of error.
Note: This effect is the basis of an interesting economic effect of herding [30].
I.36 Prove the following variational representation of the total variation:
s
Z 2
1 d(P0 − P1 )
TV(P0 , P1 ) = inf dQ . (I.10)
2 Q dQ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-182

i i

182 Exercises for Part I

I.37 Show that map P 7→ D(PkQ) is strongly convex, i.e. for all λ ∈ [0, 1] and all P0 , P1 , Q we have
λD(P1 kQ) + λ̄D(P0 kQ) − D(λP1 + λ̄P0 kQ) ≥ 2λλ̄TV(P0 , P1 )2 log e .

(Hint: Write LHS as I(X; Y) for X ∼ Ber(λ) and apply Pinsker’s).

ϵ
I.38 (Rényi divergences and Blackwell order) Let pϵ = 1+e eϵ . Show that for all ϵ > 0 and all α > 0
we have
Dα (Ber(pϵ )kBer(1 − pϵ )) < Dα (N (ϵ, 1)kN (0, 1)) .
Yet, for small enough ϵ we have
TV(Ber(pϵ ), Ber(1 − pϵ )) > TV(N (ϵ, 1), N (0, 1)) .
Note: This shows that domination under all Rényi divergences does not imply a similar
comparison in other f-divergences [132]. On the other hand, we have the equivalence [310]:
∀α > 0 : Dα (P1 kP0 ) ≤ Dα (Q1 kQ0 )
⇐⇒ ∃ n0 ∀ n ≥ n0 ∀ f : Df ( P ⊗ ⊗n ⊗n ⊗n
1 kP0 ) ≤ Df (Q1 kQ0 ) .
n

(The latter is also equivalent to existence of a kernel Kn such that Kn ◦ P⊗

i
n
= Q⊗ n
i – a so-called
Blackwell order on pairs of measures, also known as channel degradation).
I.39 (Rényi divergence as KL [383]) Show for all α ∈ R:
(1 − α)Dα (PkQ) = inf (αD(RkP) + (1 − α)D(RkQ)) . (I.11)
R

Whenever the LHS is finite, derive the explicit form of a unique minimizer R.
I.40 For an f-divergence, consider the following statements:
(i) If If (X; Y) = 0, then X ⊥
⊥ Y.
(ii) If X − Y − Z and If (X; Y) = If (X; Z) < ∞, then X − Z − Y.
Recall that f : (0, ∞) → R is a convex function with f(1) = 0.
(a) Choose an f-divergence which is not a multiple of the KL divergence (i.e., f cannot be of
form c1 x log x + c2 (x − 1) for any c1 , c2 ∈ R). Prove both statements for If .
(b) Choose an f-divergence which is non-linear (i.e., f cannot be of form c(x − 1) for any c ∈ R)
and provide examples that violate (i) and (ii).
(c) Choose an f-divergence. Prove that (i) holds, and provide an example that violates (ii).
I.41 (Hellinger and interactive protocols [31]) In the area of interactive communication Alice has
access to X and outputs bits Ai , i ≥ 1, whereas Bob has access to Y and outputs bits Bi , i ≥ 1.
The communication proceeds in rounds, so that at i-th round Alice and Bob see the previous
messages of each other. This means that conditional distribution of the protocol is given by
Y
n
PAn ,Bn |X,Y = PAi |Ai−1 ,Bi−1 ,X PBi |Ai−1 ,Bi−1 ,Y .
i=1

Denote for convenience Πx,y ≜ PAn ,Bn |X=x,Y=y . Show

(a) (Cut-and-paste lemma) H2 (Πx,y , Πx′ ,y′ ) = H2 (Πx,y′ , Πx′ ,y ). Are there any other f-
divergences with this property?

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-183

i i

Exercises for Part I 183

(b) H2 (Πx,y , Πx′ ,y ) + H2 (Πx,y′ , Πx′ ,y′ ) ≤ 2H2 (Πx,y , Πx′ ,y′ )
I.42 (Chain rules I)
(a) Show using (I.11) and the chain rule for KL that
X
n
(1 − α)Dα (PXn kQXn ) ≥ inf(1 − α)Dα (PXi |Xi−1 =a kQXi |Xi−1 =a )
a
i=1

(b) Derive two special cases:

1 Y n
1
1 − H2 (PXn , QXn ) ≤ sup(1 − H2 (PXi |Xi−1 =a , QXi |Xi−1 =a ))
2 a 2
i=1
Y
n
1 + χ2 (PXn kQXn ) ≤ sup(1 + χ2 (PXi |Xi−1 =a kQXi |Xi−1 =a ))
a
i=1

I.43 (Chain rules II)

(a) Show that the chain rule for divergence can be restated as
X
n
D(PXn kQXn ) = D(Pi kPi−1 ),
i=1

where Pi = PXi QXni+1 |Xi , with Pn = PXn and P0 = QXn . The identity above shows how
KL-distance from PXn to QXn can be traversed by summing distances between intermediate
Pi ’s.
(b) Using the same path and triangle inequality show that
X
n
TV(PXn , QXn ) ≤ EPXi−1 TV(PXi |Xi−1 , QXi |Xi−1 )
i=1

(c) Similarly, show for the Hellinger distance H:

Xn q
H(PXn , QXn ) ≤ EPXi−1 H2 (PXi |Xi−1 , QXi |Xi−1 )
i=1

See also [230, Theorem 7] for a deeper result, where for a universal C > 0 it is shown that
X
n
H2 (PXn , QXn ) ≤ C EPXi−1 H2 (PXi |Xi−1 , QXi |Xi−1 ) .
i=1

I.44 (a) Define Marton’s divergence

Z 2
dP
Dm (PkQ) = dQ 1 − .
dQ +

Prove that
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : PX = P, PY = Q}
PXY

where the infimum is over all couplings. (Hint: For one direction use the same coupling
achieving TV. For the other direction notice that P[X 6= Y|Y] ≥ 1 − QP((YY)) .)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-184

i i

184 Exercises for Part I

(b) Define symmetrized Marton’s divergence

Dsm (PkQ) = Dm (PkQ) + Dm (QkP).

Prove that

Dsm (PkQ) = inf{E[P2 [X 6= Y|Y]] + E[P2 [X 6= Y|X]] : PX = P, PY = Q}.

PXY

p
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ E[p(y|X)2 ] μ(dy).
(b) Show that
Z − 1
1
inf χ (QY kPY|X |PX ) =
2
μ(dy) − 1, (I.13)
QY g ( y)

where g(y) ≜ E[pY|X (y|X)−1 ] and we use agreement 1/0 = ∞ for all reciprocals. If the right-
hand side is finite, then the minimum is achieved by QY (dy) ∝ g(1y) 1{g(y) < ∞} μ(dy).
(c) Show that
Z
inf D(QY kPY|X |PX ) = − log μ(dy) exp(E[log p(y|X)]). (I.14)
QY

If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ exp(E[log p(y|X)]) μ(dy).
Note: This exercise shows that the center of gravity with respect to other f-divergences need not
be PY but its reweighted version. For statistical applications, see Exercises VI.6, VI.9, and VI.10,
where (I.12) and (I.13) are used to determine the form of the Bayes estimator.
I.46 (DPI for Fisher information) Let pθ (x, y) be a smoothly parametrized family of densities on
X ⊗ Y (with respect to some reference measure μX ⊗ μY ) where θ ∈ Rd . Let JXF,Y (θ) denote
the Fisher information matrix of the joint distribution and similarly JXF (θ), JYF (θ) those of the
marginals.
(a) (Monotonicity) Assume the interchangeability of derivative and integral, namely,
R
∇θ pθ (y) = μX (dx)∇θ pθ (x, y) for every θ, y. Show that JYF (θ) JXF,Y (θ).
(b) (Data processing inequality) Suppose, in addition, that θ → X → Y. (In other words, pθ (y|x)
does not depend on θ.) Then JYF (θ) JXF (θ), with equality if Y is a sufficient statistic of X
for θ.

I.55 Let (X, Y) be some dependent random variables. Suppose that for every x the random variable
h(x, Y) is ϵ2 -subgaussian. Show that
p
EPX,Y [h(X, Y)] − EPX ×PY [h(X, Y)] ≤ 2ϵ2 I(X; Y) . (I.18)
This allows one to control expectations of functions of dependent random variables by replacing
them with independent pairs at the expense of (square-root of the) mutual information slack.
I.56 ([369]) Let A = {Aj : j ∈ J} be a countable collection of random variables and T is a J-valued
random index. Show that if each Aj is ϵ2 -subgaussian, then
p
| E[AT ]| ≤ 2ϵ2 I(A; T) .

P
I.57 (Divergence for mixtures [216, 249]) Let Q̄ = i π i Qi be a mixture distribution.
(a) Prove
!
X
D(PkQ̄) ≤ − log π i exp(−D(PkQi )) ,
i
P
improving over the simple convexity estimate D(PkQ̄) ≤ i π i D(PkQi ). (Hint: Prove that
the function Q 7→ exp{−aD(PkQ)} is concave for every a ≤ 1.)
(b) Furthermore, for any distribution {π̃ j }, any λ ∈ [0, 1] we have
X X X
π̃ j D(Qj kQ̄) + D(π kπ̃ ) ≥ − π i log π̃ j e−(1−λ)Dλ (Pi ∥Pj )
j i j
X
≥ − log π i π̃ j e−(1−λ)Dλ (Pi ∥Pj )
i,j
′
(Hint: Prove D(PA|B=b kQA ) ≥ − EA|B=b [log EA′ ∼QA gg((AA,,bb)) ] via Donsker-Varadhan. Plug in
g(a, b) = PB|A (b|a)1−λ , average over B and use Jensen to bring outer EB|A inside the log.)
I.58 (Mutual information and pairwise distances [216]) Suppose we have knowledge of pairwise
distances dλ (x, x′ ) ≜ Dλ (PY|X=x kPY|X=x′ ), where Dλ is the Rényi divergence of order λ. What
i.i.d.
can be said about I(X; Y)? Let X, X′ ∼ PX . Using Exercise I.57, prove that
I(X; Y) ≤ − E[log E[exp(−d1 (X, X′ ))|X]]
and for every λ ∈ [0, 1]
I(X; Y) ≥ − E[log E[exp(−(1 − λ)dλ (X, X′ ))|X]].
See Theorem 32.5 for an application.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-189

i i

Exercises for Part I 189

I.59 (D ≲ H2 log H12 trick) Show that for any P, U, R, λ > 1, and 0 < ϵ < 2−5 λ−1 we have
λ

λ 1
D(PkϵU + ϵ̄R) ≤ 8(H (P, R) + 2ϵ)
2
log + Dλ (PkU) .
λ−1 ϵ

Thus, a Hellinger ϵ-net for a set of P’s can be converted into a KL (ϵ2 log 1ϵ )-net; see
Theorem 32.6 in Section 32.2.4.
−1
(a) Start by proving the tail estimate for the divergence: For any λ > 1 and b > e(λ−1)

dP dP log b
EP log · 1{ > b} ≤ λ−1 exp{(λ − 1)Dλ (PkQ)}
dQ dQ b
(b) Show that for any b > 1 we have

b log b dP dP
D(PkQ) ≤ H2 (P, Q) √ + EP log · 1{ > b}
( b − 1)2 dQ dQ
h(x)
(Hint: Write D(PkQ) = EP [h( dQ
dP )] for h(x) = − log x + x − 1 and notice that
√
( x−1)2
is
monotonically decreasing on R+ .)
(c) Set Q = ϵU + ϵ̄R and show that for every δ < e− λ−1 ∧ 14
1

1
D(PkQ) ≤ 4H2 (P, R) + 8ϵ + cλ ϵ1−λ δ λ−1 log ,
δ
where cλ = exp{(λ − 1)Dλ (PkU). (Notice H2 (P, Q) ≤ H2 (P, R) + 2ϵ, Dλ (PkQ) ≤
Dλ (PkU) + log 1ϵ and set b = 1/δ .)
2
(d) Complete the proof by setting δ λ−1 = 4H c(λPϵ,λ−
R)+2ϵ
1 .
I.60 Let G = (V, E) be a finite directed graph. Let

4 = (x, y, z) ∈ V3 : (x, y), (y, z), (z, x) ∈ E ,

∧ = (x, y, z) ∈ V3 : (x, y), (x, z) ∈ E .

Prove that 4 ≤ ∧.
Hint: Prove H(X, Y, Z) ≤ H(X) + 2H(Y|X) for random variables (X, Y, Z) distributed uniformly
over the set of directed 3-cycles, i.e. subsets X → Y → Z → X.
I.61 (Union-closed sets conjecture (UCSC)) Let X and Y be independent vectors in {0, 1}n .
Show [88]
p̄
H(X OR Y) ≥ (H(X) + H(Y)) , p̄ ≜ min min(P[Xi = 0], P[Yi = 0]) ,
2ϕ i
√
where OR denotes coordinatewise logical-OR and ϕ = 52−1 . (Hint: set Z = X OR Y, use chain
P
rule H(Z) ≥ i H(Zi |Xi−1 , Yi−1 ), and the inequality for binary-entropy h(ab) ≥ h(a)b2+ϕh(b)a ).
Comment: F ⊂ {0, 1}n is called a union-closed set if x, y ∈ F =⇒ (x OR y) ∈ F . The UCSC
states that p = maxi P[Xi = 1] ≥ 1/2, where X is uniform over F . Gilmer’s method [189]
applies the inequality above to Y taken to be an independent copy of X (so that H(X OR Y) ≤
H(X) = H(Y) = log |F|) to prove that p ≥ 1 − ϕ ≈ 0.382.

i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-192

i i

192 Exercises for Part I

√ √
(a) Let Yt = tX + ϵZ, where X ⊥ ⊥ Z ∼ N (0, Id ) and t, ϵ > 0. Show that Cov(X|Yt ) ϵt In
√
(Hint: consider the suboptimal estimator X̂(Yt ) = Yt / t).
(b) Show that 0 ≤ H(X) − H(X|Yt ) ≤ n2 log(1 + ϵtn ρ) ≤ t log e
2Rϵ ρ. (Hint: use (5.22))
2
(c) Show that ρ ≥ mmse(X|Y1 ) − mmse(X|Y2 ) = 1ϵ 1 E[kΣt (Yt )k2F ]dt, where Σt (y) =
Cov[X|Yt = y]. (Hint: use (3.23)).
Thus we conclude that for some t ∈ [1, 2] decomposing μ = EYt PX|Yt satisfies the stated claims.

We also note that thinking of the X = Sn , it would be more correct to call the first two com-
pression types above as “fixed-to-variable” and “fixed-to-fixed”, because they take fixed number
of input letters and produce variable or fixed number of output bits. There exists other types of
compression algorithms, which we do not discuss, e.g. a beautiful variable-to-fixed compressor
of Tunstall [425].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-197

i i

10 Variable-length compression

In this chapter we consider a basic question: how does one describe a discrete random variable
X ∈ X in terms of a variable-length bit string so that the description is the shortest possible. The
basic idea, already used in the telegraph’s Morse code, is completely obvious: shorter descriptions
(bit strings) should correspond to more probable symbols. Later, however, we will see that this
basic idea becomes a lot more subtle once we take X to mean a group of symbols. The discovery
of Shannon was that compressing groups of symbols together (even if they are iid!) can lead
to impressive savings in compressed length. That is, coding English text by first grouping 10
consecutive characters together is much better than doing so on a character-by-character basis. One
should appreciate boldness of Shannon’s proposition since sorting all possible 2610 realizations of
the 10-letter English chunks in the order of their decreasing frequency appears quite difficult. It is
only later, with the invention of Huffman coding, arithmetic coding and Lempel-Ziv compressors
(decades after) that these methods became practical and ubiquitous.
In this Chapter we discover that the minimal compression length of X is essentially equal to the
entropy H(X) for both the single-shot, uniquely-decodable and prefix-free codes. These results
are the first examples of coding theorems in our book, that is results connecting an operational
problem and an information measure. (For this reason, compression is also called source coding
in information theory.) In addition, we also discuss the so called Zipf law and how its widespread
occurrence can be described information-theoretically.

10.1 Variable-length lossless compression

The setting of the lossless data compression is depicted in the following figure.

X Compressor
{0, 1}∗ Decompressor X
f: X →{0,1}∗ g: {0,1}∗ →X

More formally, a function f : X → {0, 1}∗ is a variable-length single-shot lossless compressor

of a random variable X if it satisfies the following properties:

1 It maps each symbol x ∈ X into a variable-length string f(x) in {0, 1}∗ ≜ ∪k≥0 {0, 1}k =
{∅, 0, 1, 00, 01, . . . }. Each f(x) is referred to as a codeword and the collection of codewords the
codebook.

197

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-198

i i

198

PX (i)

i
1 2 3 4 5 6 7 ···
∗
f

∅ 0 1 00 01 10 11 ···

Figure 10.1 Illustration of the optimal variable-length lossless compressor f∗ .

2 It is lossless for X: there exists a decompressor g : {0, 1}∗ → X such that P [X = g(f(X))] = 1.
In other words, f is injective on the support of PX .

Notice that since {0, 1}∗ is countable, lossless compression is only possible for discrete X. Also,
since the structure of X is not important, we can relabel X such that X = N = {1, 2, . . . } and
sort the PMF decreasingly: PX (1) ≥ PX (2) ≥ · · · . In a single-shot compression setting, cf. [251],
we do not impose any additional constraints on the map f. Later in Section 10.3 we will introduce
conditions such as prefix-freeness and unique-decodability.
To quantify how good a compressor f is, we introduce the length function l : {0, 1}∗ → Z+ , e.g.,
l(∅) = 0, l(01001) = 5. We could consider different objectives for selecting the best compressor f,
for example, minimizing any of E[l(f(X))], esssup l(f(X)), median[l(f(X))] appears reasonable. It
turns out that there is a compressor f∗ that minimizes all objectives simultaneously. As mentioned
in the preface to this chapter, the main idea is to assign longer codewords to less likely symbols,
and reserve the shorter codewords for more probable symbols. To make precise of the optimality
of f∗ , let us recall the concept of stochastic dominance.

Definition 10.1 (Stochastic dominance) For real-valued random variables X and Y, we

st.
say Y stochastically dominates (or, is stochastically larger than) X, denoted by X ≤ Y, if P [Y ≤ t] ≤
P [X ≤ t] for all t ∈ R.

st.
By definition, X ≤ Y if and only if the CDF of X is larger than that of Y pointwise; in other words,
the distribution of X assigns more probability to lower values than that of Y does. In particular, if
X is dominated by Y stochastically, so are their means, medians, supremum, etc.

Theorem 10.2 (Optimal f∗ ) Consider the compressor f∗ defined (for a down-sorted PMF
PX ) by f∗ (1) = ∅, f∗ (2) = 0, f∗ (3) = 1, f∗ (4) = 00, etc, assigning strings with increasing lengths
to symbols i ∈ X . (See Figure 10.1 for an illustration.) Then

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-199

i i

10.1 Variable-length lossless compression 199

1 Length of codeword:

l(f∗ (i)) = blog2 ic.

2 l(f∗ (X)) is stochastically the smallest: For any lossless compressor f : X → {0, 1}∗ ,
st.
l(f∗ (X)) ≤ l(f(X))

i.e., for any k, P[l(f(X)) ≤ k] ≤ P[l(f∗ (X)) ≤ k]. As a result, E[l(f∗ (X))] ≤ E[l(f(X))].

Proof. Note that

X
k
|Ak | ≜ |{x : l(f(x)) ≤ k}| ≤ 2i = 2k+1 − 1 = |{x : l(f∗ (x)) ≤ k}| ≜ |A∗k |.
i=0

Here the inequality is because f is lossless so that |Ak | can at most be the total number of binary
strings of length up to k. Then
X X
P[l(f(X)) ≤ k] = PX (x) ≤ PX (x) = P[l(f∗ (X)) ≤ k], (10.1)
x∈Ak x∈A∗
k

since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.

Having identified the optimal compressor the next question is to understand its average com-
pression length E[ℓ(f∗ (X))]. It turns out that one can in fact compute it exactly as an infinite series,
see Exercise II.1. However, much more importantly, it turns out to be essentially equal to H(X).
Specifically, we have the following result.

Theorem 10.3 (Optimal average code length vs entropy [14])

H(X) bits − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) bits

Remark 10.1 (Source coding theorem) Theorem 10.3 is the first example of a coding
theorem in this book, which relates the fundamental limit E[l(f∗ (X))] (an operational quantity) to
the entropy H(X) (an information measure).

Proof. Define L(X) = l(f∗ (X))). For the upper bound, observe that since the PMF are ordered
decreasingly by assumption, PX (m) ≤ 1/m, so L(m) ≤ log2 m ≤ log2 (1/PX (m)). Taking
expectation yields E[L(X)] ≤ H(X).
For the lower bound,
( a)
H(X) = H(X, L) = H(X|L) + H(L) ≤ E[L] + H(L)

(b) 1
≤ E [ L] + h (1 + E[L])
1 + E[L]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-200

i i

200

1
= E[L] + log2 (1 + E[L]) + E[L] log2 1 + (10.2)
E [ L]
( c)
≤ E[L] + log2 (1 + E[L]) + log2 e
(d)
≤ E[L] + log2 (e(1 + H(X)))

where in (a) we have used the fact that H(X|L = k) ≤ k bits, because f∗ is lossless, so that given
f∗ (X) ∈ {0, 1}k , X can take at most 2k values; (b) follows by Exercise I.4; (c) is via x log(1 + 1/x) ≤
log e, ∀x > 0; and (d) is by the previously shown upper bound H(X) ≤ E[L].
To give an illustration, we need to introduce an important method of going from a single-letter
i.i.d.
source to a multi-letter one, already alluded to in the preface. Suppose that Sj ∼ PS (this is called a
memoryless source). We can group n letters of Sj together and consider X = Sn as one super-letter.
Applying our results to random variable X we obtain:
nH(S) ≥ E[l(f∗ (Sn ))] ≥ nH(S) − log2 n + O(1).
In fact for memoryless sources, the exact asymptotic behavior is found in [408, Theorem 4]:
(
∗ n nH(S) + O(1) , PS = Unif
E[l(f (S ))] = .
nH(S) − 2 log2 n + O(1) , PS 6= Unif
1

1
For the case of sources for which log2 PS has non-lattice distribution, it is further shown in [408,
Theorem 3]:
1
E[l(f∗ (Sn ))] = nH(S) − log2 (8πeV(S)n) + o(1) , (10.3)
2
where V(S) is the varentropy of the source S:
1
V(S) ≜ Var log2 . (10.4)
PS (S)

Theorem 10.3 relates the mean of l(f∗ (X)) to that of log2 PX1(X) (entropy). It turns out that
distributions of these random variables are also closely related.

Theorem 10.4 (Code length distribution of f∗ ) ∀τ > 0, k ≥ 0,

1 ∗ 1
P log2 ≤ k ≤ P [l(f (X)) ≤ k] ≤ P log2 ≤ k + τ + 2−τ +1 .
PX (X) PX (X)

Proof. Lower bound (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 10.3,
L(m) = blog2 mc ≤ log2 m ≤ log2 PX 1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
Upper bound (converse): Consider, the following chain

1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-201

i i

10.1 Variable-length lossless compression 201

X
1
≤ P log2 ≤k+τ + PX (x)1 {l(f∗ (x)) ≤ k}1 PX (x) ≤ 2−k−τ
PX (X)
x∈X

1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)

Remark 10.2 (Achievability vs converse) Traditionally, in information theory positive

results (“compression length is smaller than ...”) are called achievability and negative results
(“compression length cannot be smaller than ...”) are called converse.
So far our discussion applies to an arbitrary random variable X. Next we consider the source as
a random process (S1 , S2 , . . .) and introduce blocklength n. We apply our results to X = Sn , that is,
by treating the first n symbols as a supersymbol. The following corollary states that the limiting
behavior of l(f∗ (Sn )) and log PSn 1(Sn ) always coincide.

Corollary 10.5 Let (S1 , S2 , . . .) be a random process and U, V real-valued random variable.
Then
1 1 d 1 ∗ n d
log2 →U
− ⇔ l(f (S ))−
→U (10.5)
n PSn (Sn ) n
and

1 1 1
√ (l(f∗ (Sn )) − H(Sn ))−
d d
√ log2 n
− H( S ) →
n
−V ⇔ →V (10.6)
n PS (S )
n n

Proof. First recall that convergence in distribution is equivalent to convergence of CDF at all
d
→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the CDF of U is
continuity point, i.e., Un −
continuous (i.e., not an atom of U).
√
To get (10.5), apply Theorem 10.4 with k = un and τ = n:

1 1 1 ∗ 1 1 1 √
P log2 ≤ u ≤ P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n
√
To get (10.6), apply Theorem 10.4 with k = H(Sn ) + nu and τ = n1/4 :
∗ n
1 1 l(f (S )) − H(Sn )
P √ log − H( S ) ≤ u ≤ P
n
√ ≤u
n PSn (Sn ) n

1 1 −1/4
+ 2−n +1 .
1/ 4
≤P √ log n
− H ( S n
) ≤ u + n
n PSn (S )
(10.7)

Now let us particularize the preceding theorem to memoryless sources of i.i.d. Sj ’s. The
important observation is that the log likelihood becomes an i.i.d. sum:
1 X n
1
log n
= log .
PSn (S ) PS (Si )
i=1 | {z }
i.i.d.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-202

i i

202

This implies several results at once:

P
1 By the weak law of large numbers (WLLN), we know that n1 log PSn 1(Sn ) −→E log PS1(S) = H(S).
Therefore in (10.5) the limiting distribution U is degenerate, i.e., U = H(S), and we have the
following result of fundamental importance:1
1 ∗ n P
l(f (S ))−
→H(S) .
n
That is, the optimal compression rate of an iid process converges to its entropy rate. This is
a version of Shannon’s source coding theorem, which we will also discuss in the subsequent
chapter.
2 By the Central Limit Theorem (CLT), if varentropy V(S) < ∞, then we know that V in (10.6)
is Gaussian, i.e.,

1 1 d
p log n)
− nH(S) −→N (0, 1).
nV(S) PSn ( S

Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1
(l(f∗ (Sn )) − nH(S))−
d
p →N (0, 1),
nV(S)

or, in shorthand,
p
l(f∗ (Sn )) ≈ nH(S) + nV(S)N (0, 1) in distribution.

Gaussian approximation tells us the speed of convergence 1n l(f∗ (Sn )) → H(S) and also gives us
a good approximation of the distribution of length at finite n.

Example 10.1 (Ternary source) Next we apply our bounds to approximate the distribu-
tion of l(f∗ (Sn )) in a concrete example. Consider a memoryless ternary source outputting i.i.d. n
symbols from the distribution PS = [0.445, 0.445, 0.11]. We first compare different results on the
minimal expected length E[l(f∗ (Sn ))] in the following table:

Blocklength Lower bound (10.3) E[l(f∗ (Sn ))] H(Sn ) (upper bound) asymptotics (10.3)
n = 20 21.5 24.3 27.8 23.3 + o(1)
n = 100 130.4 134.4 139.0 133.3 + o(1)
n = 500 684.1 689.2 695.0 688.1 + o(1)

In all cases above E[l(f∗ (S))] is close to a midpoint between the bounds.

1
Convergence to a constant in distribution is equivalent to that in probability.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-203

i i

10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 203

Optimal compression: CDF, n = 200, PS = [0.445 0.445 0.110] Optimal compression: PMF, n = 200, P S = [0.445 0.445 0.110]
1 0.06
True PMF
Gaussian approximation
Gaussian approximation (mean adjusted)
0.9

0.05
0.8

0.7

0.04

0.6

0.5
P

0.03

P
0.4

0.02
0.3

0.2
0.01

True CDF
0.1 Lower bound
Upper bound
Gaussian approximation
Gaussian approximation (mean adjusted)
0 0
1.25 1.3 1.35 1.4 1.45 1.5 1.25 1.3 1.35 1.4 1.45 1.5
Rate Rate

Figure 10.2 Left plot: Comparison of the true CDF of l(f∗ (Sn )), bounds of Theorem 10.4 (optimized over τ ),
and the Gaussian approximations in (10.8) and (10.9). Right plot: PMF of the optimal compression length
l(f∗ (Sn )) and the two Gaussian approximations.

Next we consider the distribution of l(f∗ (Sn ). Its Gaussian approximation is defined as
p
nH(S) + nV(S)Z , Z ∼ N ( 0, 1) . (10.8)

However, in view of (10.3) we also define the mean-adjusted Gaussian approximation as

Definition 10.6 (Extension of a code) The (symbol-by-symbol) extension of f : A →

{0, 1}∗ is f : A+ → {0, 1}∗ where f(a1 , . . . , an ) = (f(a1 ), . . . , f(an )) is defined by concatenating
the bits.

Definition 10.7 (Uniquely decodable codes) f : A → {0, 1}∗ is uniquely decodable if

its extension f : A+ → {0, 1}∗ is injective.

Definition 10.8 (Prefix codes) f : A → {0, 1}∗ is a prefix code2 if no codeword is a prefix
of another (e.g., 010 is a prefix of 0101).

Example 10.2 A = {a, b, c}.

2
Also known as prefix-free/comma-free/self-punctuating/instantaneous code.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-207

i i

10.3 Uniquely decodable codes, prefix codes and Huffman codes 207

• f(a) = 0, f(b) = 1, f(c) = 10. Not uniquely decodable, since f(ba) = f(c) = 10.
• f(a) = 0, f(b) = 10, f(c) = 11. Uniquely decodable and a prefix code.
• f(a) = 0, f(b) = 01, f(c) = 011, f(d) = 0111 Uniquely decodable but not a prefix code, since
as long as 0 appears, we know that the previous codeword has terminated.3

Remark 10.3

1 Prefix codes are uniquely decodable and hence lossless, as illustrated in the following picture:

all lossless codes

la ∈ N

This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:

Theorem 10.10
H(X) ≤ L∗ (X) ≤ H(X) + 1 bit. (10.13)

Proof. Right inequality: Consider the following length assignment la = log2 PX1(a) ,4 which
P P
satisfies Kraft since l a∈A 2−la m≤ a∈A PX (a) = 1. By Theorem 10.9, there exists a prefix code
f such that l(f(a)) = log2 PX1(a) and El(f(X)) ≤ H(X) + 1.
Light inequality: We give two proofs for this converse. One of the commonly used ideas to deal
with combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in
(10.12) and relax it into the following optimization problem, which obviously provides a lower
bound
X
L∗ (X) ≜ min PX (a)la (10.14)
a∈A
X
s.t. 2− l a ≤ 1
a∈A

This is a nice convex optimization problem, with an affine objective function and a convex feasible
set. Solving (10.14) by Lagrange multipliers (Exercise!) yields that the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
Another proof is the following: For any f whose codelengths {la } satisfying the Kraft inequality,
− la
define a probability measure Q(a) = P 2 2−la . Then
a∈A

X
El(f(X)) − H(X) = D(PkQ) − log 2−la ≥ 0.
a∈A

4
Such a code is called a Shannon code.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-210

i i

210

Next we describe the Huffman code, which achieves the optimum in (10.12). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of the Huffman code is to build
the binary tree from the bottom up: Given a PMF {PX (a) : a ∈ A},

1 Choose the two least-probable symbols in the alphabet.

2 Delete the two symbols and add a new symbol (with combined probabilities). Add the new
symbol as the parent node of the previous two symbols in the binary tree.

The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted PMF)
or O(|A| log |A|) (unsorted PMF).

Example 10.3 A = {a, b, c, d, e}, PX = {0.25, 0.25, 0.2, 0.15, 0.15}.

Huffman tree: Codebook:
0 1 f(a) = 00
0.55 0.45 f(b) = 10
0 1 0 1 f(c) = 11
f(d) = 010
a 0.3 b c
0 1 f(e) = 011

d e

Theorem 10.11 (Optimality of Huffman codes) The Huffman code achieves the minimal
average code length (10.12) among all prefix (or uniquely decodable) codes.

Proof. See [106, Sec. 5.8].

Remark 10.6 (Drawbacks of Huffman codes)

1 Constructing the Huffman code requires knowing the source distribution. This brings us the
question: Is it possible to design universal compressor which achieves entropy for a class of
source distributions? And what is the price to pay? These questions are the topic of universal
compression and will be addressed in Chapter 13.
2 To understand the main limitation of Huffman coding, we recall that (as Shannon pointed out),
while Morse code already exploits the nonequiprobability of English letters, working with
pairs (or more generally, n-grams) of letters achieves even more compression, since letters in
a pair are not independent. In other words, to compress a block of symbols (S1 , . . . , Sn ) by
applying Huffman code on a symbol-by-symbol basis one can achieve an average length of
Pn
i=1 H(Si ) + n bits. But applying Huffman codes on a whole block (S1 , . . . , Sn ), that is the
code designed for PS1 ,...,Sn , allows to exploit the memory in the source and achieve compres-
P
sion length H(S1 , . . . , Sn ) + O(1). Due to (1.3) the joint entropy is smaller than i H(Si ) (and

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-211

i i

10.3 Uniquely decodable codes, prefix codes and Huffman codes 211

usually much smaller). However, the drawback of this idea is that constructing the Huffman
code has complexity |A|n – exponential in the blocklength.
To resolve these problems we will later study other methods:

1 Arithmetic coding has a sequential encoding algorithm with complexity linear in the block-
length, while still attaining H(Sn1 ) length – Section 13.1.
2 Lempel-Ziv algorithm also has low-complexity and is even universal, provably optimal for all
ergodic sources – Section 13.8.
As a summary of this chapter, we learned the following relationship between entropy and
compression length of various codes:
H(X) − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) ≤ E[l(fHuffman (X))] ≤ H(X) + 1.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-212

i i

11 Fixed-length compression and Slepian-Wolf

theorem

In previous chapter we introduced the concept of variable-length compression and studied its
fundamental limits with and without prefix-free condition. In some situations, however, one may
desire that the output of the compressor always be of a fixed length, say, k bits. Unless k is unrea-
sonably large, then, this will require relaxing the losslessness condition. This is the focus of this
chapter: compression in the presence of (typically vanishingly small) probability of error. It turns
out allowing even very small error enables several beautiful effects:

• The possibility to compress data via matrix multiplication over finite fields (linear compression
or hashing).
• The possibility to reduce compression length from H(X) to H(X|Y) if side information Y is
available at the decompressor (Slepian-Wolf).
• The possibility to reduce compression length below H(X) if access to a compressed representa-
tion of side-information Y is available at the decompressor (Ahlswede-Körner-Wyner).

All of these effects are ultimately based on the fundamental property of many high-dimensional
probability distributions, the asymptotic equipartition (AEP), which we study in the context of iid
distributions. Later we will extend this property to all ergodic processes in Chapter 12.

11.1 Source coding theorems

The coding paradigm in this section is illustrated as follows:

X Compressor {0, 1}k Decompressor X ∪ {e}

f: X →{0,1}k g: {0,1}k →X ∪{e}

Note that if we insist like in Chapter 10 that g(f(X)) = X with probability one, then k ≥
log2 |supp(PX )| and no meaningful compression can be achieved. It turns out that by tolerating
a small error probability, we can gain a lot in terms of code length! So, instead of requiring
g(f(x)) = x for all x ∈ X , consider only lossless decompression for a subset S ⊂ X :
(
x x∈S
g(f(x)) =
e x 6∈ S

212

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-213

i i

11.1 Source coding theorems 213

and the probability of error is:

P [g(f(X)) 6= X] = P [g(f(X)) = e] = P [X ∈
/ S] .

We summarize this formally next.

Definition 11.1 A compressor-decompressor pair (f, g) is called a fixed-length almost-lossless

(k, ϵ) source code for X ∈ X , or (k, ϵ)-code for short, if:

f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}

such that g(f(x)) ∈ {x, e} for all x ∈ X and P [g(f(X)) = e] ≤ ϵ. The fundamental limit of fixed-
length compression is simply the minimum probability of error and is defined as

ϵ∗ (X, k) ≜ inf{ϵ : ∃(k, ϵ)-code for X} .

The following result connects the respective fundamental limits of fixed-length almost lossless
compression and variable-length lossless compression (Section 10.1):

Theorem 11.2 (Fundamental limit of fixed-length compression) Recall the optimal

variable-length compressor f∗ defined in Theorem 10.2 and assume as before that X = N and
PX (1) ≥ PX (2) ≥ · · · . Then
X
ϵ∗ (X, k) = P [l(f∗ (X)) ≥ k] = PX (x).
x≥2k

Proof. Note that because of the assumption X = N compressor must reserve one k-bit string for
the error message even if PX (1) = 1. The proof is essentially tautological. Note 1 + 2 +· · ·+ 2k−1 =
2k − 1. Let S be the set of top 2k − 1 most likely (as measured by PX (x)) elements x ∈ X . Then

ϵ∗ (X, k) = P [X 6∈ S] = P [l(f∗ (X)) ≥ k] .

The last equality follows from (10.1).

Comparing Theorems 10.2 and 11.2, we see that the optimal codes in these two settings work
as follows:

• Variable-length: f∗ encodes the 2k − 1 symbols with the highest probabilities to

{ϕ, 0, 1, 00, . . . , 1k−1 }.
• Fixed-length: The optimal compressor f maps the elements of S into (00 . . . 00), . . . , (11 . . . 10)
and the rest in X \S to (11 . . . 11). The decompressor g decodes perfectly except for outputting
e upon receipt of (11 . . . 11).

Remark 11.1 (Detectable vs undetectable errors) In Definition 11.1 we required that

the errors be always detectable, i.e., g(f(x)) = x or e. Alternatively, we can drop this requirement
and allow undetectable errors, in which case we can of course do better since we have more

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-214

i i

214

freedom in designing codes. It turns out that we do not gain much by this relaxation. Indeed, if
we define

ϵ̃∗ (X, k) = inf{P [g(f(X)) 6= X] : f : X → {0, 1}k , g : {0, 1}k → X ∪ {e}},

P P
then ϵ̃∗ (X, k) = x>2k PX (x). This follows immediately from P [g(f(X)) = X] = x∈S PX (x)
where S ≜ {x : g(f(x)) = x} satisfies |S| ≤ 2k , because f takes no more than 2k values. Compared
to Theorem 11.2, we see that ϵ̃∗ (X, k) and ϵ∗ (X, k) only differ by PX (2k ) ≤ 2−k . In particular,
ϵ∗ (X, k + 1) ≤ ϵ̃∗ (X, k) ≤ ϵ∗ (X, k) and we can at most save a single bit in compressed strings.

These simple observations lead us to the first fundamental result of Shannon.

Corollary 11.3 (Shannon’s source coding theorem) Let Sn be i.i.d. discrete random
variables. Then for any R > 0 and γ ∈ R asymptotically in blocklength n we have

∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)

If varentropy V(S) < ∞ then also

p
lim ϵ∗ (Sn , nH(S) + nV(S)γ) = Q(γ)
n→∞
R∞
√1 e−t /2 dt
2
where Q(x) = x 2π
is the complementary CDF of N (0, 1)s.

Proof. Combine Theorem 11.2 with Corollary 10.5.

This result demonstrates that if we are to compress an iid string Sn down to k = k(n) bits
then minimal possible k enabling vanishing error satisfies nk = H(S), that is we can compress to
entropy rate of the iid process S and no more. Furthermore, if we allow a non-vanishing error ϵ
then compression is possible down to
p
k = nH(S) + nV(S)Q−1 (ϵ)

bits. In the language of modern information theory, Corollary 11.3 derives both the asymptotic
fundamental limit (minimal k/n) and the normal approximation under non-vanishing error.
The next desired step after understanding asymptotics is to derive finite blocklength guarantees,
that is bounds on ϵ∗ (X, k) in terms of the information quantities. As we mentioned above, the
upper and lower bounds are typically called achievability and converse bounds. In the case of
lossless compression such bounds are rather trivial corollaries of Theorem 11.2, but we present
them for completeness next. For other problems in this Part and other Parts obtaining good finite
blocklength bounds is much more challenging.

Theorem 11.4 (Finite blocklength bounds) For all τ > 0 and all k ∈ Z+ we have

1 −τ ∗ ∗ 1
P log2 > k + τ − 2 ≤ ϵ̃ (X, k) ≤ ϵ (X, k) ≤ P log2 ≥k .
PX (X) PX (X)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-215

i i

11.1 Source coding theorems 215

Proof. The argument for the lower (converse) bound is identical to the converse of Theorem 10.4.
Indeed, considering the optimal (undetectable error) code let S = {x : g(f(x)) = x} and note

∗ 1 1
1 − ϵ̃ (X, k) = P [X ∈ S] ≤ P log2 ≤ k + τ + P X ∈ S, log2 >k+τ
PX (X) PX (X)

. For the second term we have

X
1
P X ∈ S, log2 >k+τ = PX (x)1{PX (x) < 2−k−τ } ≤ |S|2−k−τ ≤ 2−τ ,
PX (X)
x∈ S

where we used the fact that |S| ≤ 2k . Combining the two inequalities yields the lower bound.
For the upper bound, without loss of generality we assume PX (1) ≥ PX (2) ≥ · · · . Then by
Theorem 11.2 we have
X X 1
1

ϵ∗ (X, k) = PX (m) ≤ 1 ≥ 2k PX (m) = P log2 ≥k ,
P X ( m) PX (X)
k m≥2

where ≤ follows from the fact that mth largest mass PX (m) ≤ 1
m.

We now will do something strange. We will prove an upper bound that is weaker than that of
Theorem 11.4 and furthermore, the proof is much longer. However, this will be our first exposition
to the technique of random coding (also known as probabilistic method outside of information
theory).1 We will quickly find out that outside of the simplest setting of lossless compression,
where the optimal encoder f∗ was easy to describe, good encoders are very hard to find and thus
random coding becomes indispensable. In particular, Slepian-Wolf theorem (Section 11.5 below)
all of data transmission (Part IV) and lossy data compression (Part V) will be based on the method.

Theorem 11.5 (Random coding achievability) For any k ∈ Z+ and any τ > 0 we have

1
ϵ̃∗ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0 , (11.1)
PX (X)

that is there exists a compressor-decompressor pair with the (possibly undetectable) error bounded
by the right-hand side.

Proof. We first start with constructing a suboptimal decompressor g for a given f. Indeed, for a
given compressor f, the optimal decompressor which minimizes the error probability is simply the
maximum a posteriori (MAP) decoder, i.e.,

g∗ (w) = argmax PX|f(X) (x|w) = argmax PX (x) .

x x:f(x)=w

1
These methods were discovered simultaneously by Shannon [378] and Erdös [153], respectively.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-216

i i

216

However, this decoder’s performance is a little hard to analyze, so instead, we consider the
following (suboptimal) decompressor g:


 x, ∃! x ∈ X s.t. f(x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)

 e, otherwise

Note that log2 PX1(x) ≤ k − τ ⇐⇒ PX (x) ≥ 2−(k−τ ) . We call those x “high-probability”. (In the
language of [106] and [115] these would be called “typical” realizations).
Denote f(x) = cx and call the long vector C = [cx : x ∈ X ] a codebook. It is instructive to think
of C as a hashing table: it takes an object x ∈ X and assigns to it a k-bit hash value.
To analyze the error probability let us define

′ ′ 1
J(x, C) ≜ x ∈ X : cx′ = cx , x 6= x, log2 ≤k−τ
PX (x′ )
to be the set of high-probability inputs whose hashes collide with that of x. Then we have the
following estimate for probability of error:

1
P [g(f(X)) = e] = P log2 > k − τ ∪ {J(X, C) 6= ∅}
PX (X)

1
≤ P log2 > k − τ + P [J(X, C) 6= ϕ]
PX (X)
The first term does not depend on the codebook C , while the second term does. The idea now
is to randomize over C and show that when we average over all possible choices of codebook,
the second term is smaller than 2−τ . Therefore there exists at least one codebook that achieves
i.i.d.
the desired bound. Specifically, let us consider C generated by setting each cx ∼ Unif[{0, 1}k ] and
independently of X. Equivalently, since C can be represented by an |X | × k binary matrix, whose
rows correspond to codewords, we choose each entry to be independent fair coin flip. Averaging
the error probability (over C and over X), we have

′ 1
EC [P [J(X, C) 6= ϕ]] = EC,X 1 ∃x 6= X : log2 ≤ k − τ, cx = cX
′
PX (x′ )
 
X 1

≤ EC,X  1 log2 ≤ k − τ 1 {cx′ = cX } (union bound)
PX ( x′ )
x′ ̸=X
 
X
= 2− k E X  1 PX (x′ ) ≥ 2−k+τ 
x′ ̸=X
X
≤ 2− k 1 PX (x′ ) ≥ 2−k+τ
x′ ∈X
−k k−τ
≤2 2 = 2−τ ,

where the crucial penultimate step uses the fact that there can be at most 2k−τ values of x′ with
PX (x′ ) > 2−k+τ .

1 P Sn ∈ Tδn → 1 as n → ∞.
2 |Tδn | ≤ exp{(H(S) + δ)n}.

i.i.d.
For example if Sn ∼ Ber(p), then PSn (sn ) = pwH (s ) p̄n−wH (s ) , where wH (sn ) is the Hamming
n n

weight of the string (number of 1’s). Thus the typical set corresponds to those sequences whose
Hamming weight 1n wH (sn ) is close to the expected value of p + Op (δ).

Proof. By WLLN, we have

1 1 P
log −
→H(S). (11.2)
n PSn (Sn )

Thus, P[Sn ∈ Tδn ] → 1. On the other hand, since for every sn ∈ Tδn we have PSn (sn ) > exp{−(H(S)+
δ)n} there can be at most exp{(H(S) + δ)n} elements in Tδn .

To understand the meaning of the AEP, notice that it shows that the gigantic space S n has
almost all of probability PSn concentrated on an exponentially smaller subset Tδn . Furthermore, on
this subset the measure PSn is approximately uniform: PSn (sn ) = exp{−nH(S) ± nδ}.
To see how AEP is related to compression, let us give a third proof of Shannon’s result:

∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)

Indeed, let us consider an encoder f that enumerates (by strings in {0, 1}nR ) elements of Tδn . Then
if R > H(S) + δ the decoding error happens with probability P[Sn 6∈ Tδn ] → 0. Hence any rate
R > H(S) results in a vanishing error. On the other hand, if R < H(S) then it is clear that 2nR -bits
cannot describe any significant portion of |Tδn | and since on the latter the measure PSn is almost
uniform, the probability of error necessarily converges to 1 (in fact exponentially fast). There is a
certain conceptual beauty in this way of proving source coding theorem. For example, it explains
why optimal compressor’s output should look almost like iid Ber(1/2):2 after all it enumerates
over an almost uniform set Tδn .

11.3 Linear compression (hashing)

So far we have seen three proofs of Shannon’s theorem (Corollary 11.3), but unfortunately each of
the proofs used methods that are not feasible to implement in practice. The first method required
sorting all |S|n realizations of the input data block, the second required constructing a |S|n × k
hashing table and the third enumerating the entropy-typical set |Tδn | = exp{nH(S) + o(n)}. In
this section we show a fourth method, which is conceptually important and also results in a very
simple compressor. (The decompressor that we describe is still going to be very impractical, but
it can be made practical by leveraging efficient decoders of linear error correcting codes.)

2
This is the intuitive basis why compressors can be used as random number generators; cf. Section 9.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-219

i i

11.3 Linear compression (hashing) 219

In this section we assume that the source takes the form X = Sn , where each coordinate is an
element of a finite field (Galois field), i.e., Si ∈ Fq , where q is the cardinality of Fq . (This is only
possible if q = pk for some prime number p and k ∈ N.)

Definition 11.7 (Galois field) F is a finite set with operations (+, ·) where

• The addition operation + is associative and commutative.

• The multiplication operation · is associative and commutative.
• There exist elements 0, 1 ∈ F s.t. 0 + a = 1 · a = a.
• ∀a, ∃ − a, s.t. a + (−a) = 0
• ∀a 6= 0, ∃a−1 , s.t. a−1 a = 1
• Distributive: a · (b + c) = (a · b) + (a · c)

Simple examples of finite fields:

• Fp = Z/pZ, where p is prime (“modulo-p arithmetic”)

• F4 = {0, 1, x, x + 1} with addition and multiplication as polynomials in F2 [x] modulo x2 + x + 1.

A linear compressor is a linear function H : Fnq → Fkq (represented by a matrix H ∈ Fqk×n ) that
maps each x ∈ Fnq to its codeword w = Hx, namely
    
w1 h11 ... h1n x1
 ..   .. ..   .. 
 . = . .  . 
wk hk1 ... hkn xn

Compression is achieved if k < n, i.e., H is a fat matrix, which, again, is only possible in the
almost lossless sense.

Theorem 11.8 (Achievability via linear codes) Let X ∈ Fnq be a random vector. For all
τ > 0, there exists a linear compressor H ∈ Fnq×k and decompressor g : Fkq → Fnq ∪ {e}, s.t. its
undetectable error probability is bounded by

1
P [g(HX) 6= X] ≤ P logq > k − τ + q−τ .
PX (X)

Remark 11.4 Consider the Hamming space q = 2. In comparison with Shannon’s random
coding achievability, which uses k2n bits to construct a completely random codebook, here for lin-
ear codes we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional
linear subspace of the Hamming space.

Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 11.5),
given the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) =
argmaxx:Hx=w PX (x), which outputs the most likely symbol that is compatible with the codeword

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-221

i i

11.4 Compression with side information at both compressor and decompressor 221

Z/6Z
       
1 2
  0     0  
       
P H  ..  = 0 = 6− k but P H  ..  = 0 = 3− k ,
  .     .  
0 0

since 0 · 2 = 3 · 2 = 0 in Z/6Z.

11.4 Compression with side information at both compressor and

decompressor
We now move to discussing several variations of the compression problem when the data consists
of a correlated pair (X, Y) ∼ PX,Y . The first variation is schematically depicted in the next figure:

X {0, 1}k X ∪ {e}

Compressor Decompressor

Formally, we make the following definition

Definition 11.9 (Compression with side information) Given PX,Y we define

• compressor f : X × Y → {0, 1}k

• decompressor g : {0, 1}k × Y → X ∪ {e}
• probability of error P[g(f(X, Y), Y) 6= X] < ϵ. A code satisfying this property is called a (k, ϵ)-
s.i. code
• Fundamental Limit: ϵ∗ (X|Y, k) = inf{ϵ : ∃(k, ϵ)–s.i. code}

Note that here unlike the source X, the side information Y need not be discrete. Conditioned on
Y = y, the problem reduces to compression without side information studied in Section 11.1, but
with the source X distributed according to PX|Y=y . Since Y is known to both the compressor and
decompressor, they can use the best code tailored for this distribution. Recall ϵ∗ (X, k) defined in
Definition 11.1, the optimal probability of error for compressing X using k bits, which can also be
denoted by ϵ∗ (PX , k). Then we have the following relationship

ϵ∗ (X|Y, k) = Ey∼PY [ϵ∗ (PX|Y=y , k)],

which allows us to apply various bounds developed before. In particular, we clearly have the
following result.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-222

i i

222

Theorem 11.10

1 1
P log > k + τ − 2−τ ≤ ϵ∗ (X|Y, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0
PX|Y (X|Y) PX|Y (X|Y)

Corollary 11.11 Let (X, Y) = (Sn , Tn ) where the pairs (Si , Ti )i.i.d.
∼ PS,T . Then
(
∗ 0 R > H(S|T)
lim ϵ (S |T , nR) =
n n
n→∞ 1 R < H(S|T)

Proof. Indeed, note that from WLLN we have

1X
n
1 1 1 P
log = log −
→H(S|T)
n PSn |Tn (S |T )
n n n PS|T (Si |Ti )
i=1

as n → ∞. Thus, the result follows from setting (X, Y) = (Sn , Tn ) in the previous theorem.

11.5 Slepian-Wolf: side information at decompressor only

In previous section we learned that given access to side-information at both compressor and
decompressor the optimal compression rate is given by the conditional entropy H(S|T). We now
consider what happens if the side information Y = Tn is not available at the compressor. This is
demonstrated schematically in the following figure:

X {0, 1}k X ∪ {e}

Compressor Decompressor

Formally, we make the following definition.

Definition 11.12 (Slepian-Wolf code) Given PX,Y , we define a Slepian-Wolf coding

problem as:

• compressor f : X → {0, 1}k

• decompressor g : {0, 1}k × Y → X ∪ {e}
• probability of error P[g(f(X), Y) 6= X] ≤ ϵ. A code satisfying this property is called a (k, ϵ)-
Slepian-Wolf code
• Fundamental Limit: ϵ∗SW (X|Y, k) = inf{ϵ : ∃(k, ϵ)-Slepian-Wolf code}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-223

i i

11.5 Slepian-Wolf: side information at decompressor only 223

Here is the very surprising result of Slepian and Wolf3 , which shows that unavailability of the
side information at compressor does not hinder the compression rate at all.

Theorem 11.13 (Slepian-Wolf [392])

∗ 1
ϵ ( X | Y , k) ≤ ϵ∗SW (X|Y, k) ≤ P log ≥ k − τ + 2−τ
PX|Y (X|Y)

From this theorem we will get by the WLLN the asymptotic result:

Corollary 11.14 (Slepian-Wolf [392])

(
0 R > H(S|T)
lim ϵ∗SW (Sn |Tn , nR) =
n→∞ 1 R < H(S|T)

And we remark that the side-information (T-process) is not even required to be discrete, see
Exercise II.9.

Proof of the Theorem. LHS is obvious, since side information at the compressor and decompres-
sor is better than only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y), then define the compressor and decoder as

f(x) = cx
(
x ∃!x : cx = w, x h.p.|y
g(w, y) =
0 otherwise

where we used the shorthand x h.p.|y ⇔ log2 PX|Y1(x|y) < k −τ . The error probability of this scheme,
as a function of the code book C , is

1
P[X 6= g(f(X))|C] = P log ≥ k − τ or J(X, C|Y) 6= ∅|C
PX|Y (X|Y)

1
≤ P log ≥ k − τ + P [J(X, C|Y) 6= ∅|C]
PX|Y (X|Y)
X
1
= P log ≥k−τ + PX,Y (x, y)1 {J(x, C|y) 6= ∅}.
PX|Y (X|Y) x, y

where J(x, C|y) ≜ {x′ 6= x : x′ h.p.|y, cx = cx′ }.

3
This result is often informally referred to as “the most surprising result post-Shannon”.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-224

i i

224

Now averaging over C we get

 
( a) X
P[J(x, C|y) 6= ∅] ≤ EC  1 {x′ h.p.|y}1 {cx′ = cx }
x′ ̸=x
(b)
≤ 2k−τ P[cx′ = cx ]
( c)
= 2−τ ,

where (a) is a union bound, (b) follows from the fact that |{x′ : x′ h.p.|y}| ≤ 2k−τ , and (c) is from
P[cx′ = cx ] = 2−k for any x 6= x′ .

Remark 11.5 (Undetectable error) Definition 11.12 allows appearance of undetected

errors. Now, we have seen that in all previous random coding results (except for the linear com-
pression) we could always easily modify the compression algorithm to make all undetected errors
detectable. However, Slepian-Wolf magic crucially depends on undetectable errors. Indeed, sup-
pose we require that g(f(x), y) = x e for all x, y with PX,Y (x, y) > 0. Suppose there is some
c ∈ {0, 1}k such that f(x1 ) = f(x2 ) = c. Then g(c, y) = e for all y, and the side-information is not
needed for such a c. On the other hand, if c ∈ {0, 1}k is such that it has a unique x s.t. f(x) = c,
then we can set g(c, y) = x and ignore y again. Overall, we see that in either case side-information
is not useful at the decompressor and we can only compress down to H(X) not H(X|Y). Similarly,
one can show that Slepian-Wolf theorem does not hold in the setting of variable-length lossless
compression: the minimal average compression length of any lossless algorithm is at least H(X)
(instead of H(X|Y)).

11.6 Slepian-Wolf: compressing multiple sources

A simple extension of the previous result also covers another variation of the data-compression
task, in which two correlated sources X and Y are compressed individually (possibly at two remote
locations), but are decompressed jointly at the destination. This time, however, the goal is to
reproduce both X and Y. This is depicted as follows:

X {0, 1}k1
Compressor f1
Decompressor g

(X̂, Ŷ)

Y {0, 1}k2
Compressor f2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-225

i i

11.6 Slepian-Wolf: compressing multiple sources 225

More formally, we define:

Definition 11.15 (Multi-terminal compression) Given PX,Y let

• compressors f1 : X → {0, 1}k1 , f2 : Y → {0, 1}k2 .

• decompressor g : {0, 1}k1 × {0, 1}k2 → X × Y ∪ {e}
• probability of error P[g(f1 (X), f2 (Y)) 6= (X, Y)] ≤ ϵ. A code satisfying this property is called a
(k1 , k2 , ϵ)-code (or multi-terminal Slepian-Wolf code).
• Fundamental limit: ϵ∗SW (X, Y, k1 , k2 ) = inf{ϵ : ∃(k1 , k2 , ϵ)-code}.

The asymptotic fundamental limit here is given as follows.

Theorem 11.16 Let (X, Y) = (Sn , Tn ) with (Si , Ti )i.i.d.

i i

11.7* Source-coding with a helper (Ahlswede-Körner-Wyner) 227

X {0, 1}nR1
Compressor f1

Decompressor g
X̂

Y {0, 1}nR2
Compressor f2

Note also that unlike the previous section decompressor is only required to produce an esti-
mate of X (not of Y), hence the name of this problem: compression with a (rate-limited) helper.
The difficulty this time is that what needs to be communicated over this link from Y to decom-
pressor is not the information about Y but only that information in Y that is maximally useful
for decompressing X. Despite similarity with the previous sections, this task is completely new
and, consequently, characterization of rate pairs R1 , R2 is much more subtle in this case. It was
completed independently in two works [9, 459].

Theorem 11.17 (Ahlswede-Körner-Wyner) Consider i.i.d. source (Xn , Yn ) ∼ PX,Y with X

discrete. Compressor produces message W1 = f1 (Xn ) and helper produces a message W2 = f2 (Yn ),
with Wi consisting of at most nRi bits, i ∈ {1, 2}. Decompressor produces an estimate X̂n =
g(W1 , W2 ). If rate pair (R1 , R2 ) is achievable with vanishing probability of error P[X̂n 6= Xn ] → 0,
then there exists an auxiliary random variable U taking values on alphabet of cardinality |Y| + 1
such that PX,Y,U = PX,Y PU|Y (i.e. X → Y → U) and

R1 ≥ H(X|U), R2 ≥ I(Y; U) . (11.3)

Furthermore, for every such random variable U the rate pair (H(X|U), I(Y; U)) is achievable with
vanishing error.

In other words, this time the set of achievable pairs (R1 , R2 ) belongs to a region of R2+ described
as ∪{[H(X|U), +∞)×[I(Y; U), +∞)} with the union taken over all possible PU|Y : Y → U , where
|U| = |Y| + 1. The boundary of the optimal (R1 , R2 )-region is traced by an FI -curve, a concept
we will define later (Definition 16.5).

Proof. First, note that iterating over all possible random variables U (without cardinality con-
straint) the set of pairs (R1 , R2 ) satisfying (11.3) is convex. Next, consider a compressor W1 =
f1 (Xn ) and W2 = f2 (Yn ). Then from Fano’s inequality (3.19) assuming P[Xn 6= X̂n ] = o(1) we
have

H(Xn |W1 , W2 )) = o(n) .

i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-231

i i

12.1 Bits of ergodic theory 231

Therefore a stationary process can be described by the tuple (Ω, F, P, τ, S0 ) and Sk = S0 ◦ τ k .

Remark 12.1

1 Alternatively, a random process (S0 , S1 , S2 , . . . ) is stationary if its joint distribution is invariant

with respect to shifts in time, i.e., PSmn = PSm+t , ∀n, m, t. Indeed, given such a process we can set
n+t
Ω = S ∞ and define an m.p.t. as follows:
τ
(s0 , s1 , . . . ) −
→ (s1 , s2 , . . . ) (12.1)
So τ is a shift to the left.
2 An event E ∈ F is shift-invariant if
(s1 , s2 , . . . ) ∈ E ⇐⇒ (s0 , s1 , s2 , . . . ) ∈ E, ∀s0

or, equivalently, E = τ −1 E. Thus τ -invariant events are also called shift-invariant, when τ is
interpreted as (12.1).
3 Some examples of shift-invariant events are {∃n : xi = 0, ∀i ≥ n}, {lim sup xi < 1} etc. A non
shift-invariant event is A = {x0 = x1 = · · · = 0}, since τ (1, 0, 0, . . .) ∈ A but (1, 0, . . .) 6∈ A.
4 Also recall that the tail σ -algebra is defined as
\
Ftail ≜ σ{Sn , Sn+1 , . . .} .
n≥1

It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event {∃n : xi = 0, ∀ odd i ≥ n} is in Ftail but not shift-invariant.

Proposition 12.3 (Poincaré recurrence) Let τ be measure-preserving for (Ω, F, P). Then
for any measurable A with P[A] > 0 we have
" #
[
P τ −k A A = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥ 1

S
Proof. Let B = k≥ 1 τ −k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently

P[A ∪ B] = P[B] . (12.2)

To that end notice that τ −1 A ∪ τ −1 B = B and thus
P[τ −1 (A ∪ B)] = P[B] ,
but the left-hand side equals P[A ∪ B] by the measure preservation of τ , proving (12.2).
Consider τ mapping the initial state of the conservative (Hamiltonian) mechanical system to its
state after passage of a given amount of time. It is known that τ preserves the Lebesgue measure in
phase space (Liouville’s theorem). Thus the Poincaré recurrence leads to a rather counter-intuitive
conclusions. For example, opening the barrier separating two gases in a cylinder allows them to
mix. Poincaré recurrence says that eventually they will return back to the original separated state

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-232

i i

232

(with each gas occupying roughly its half of the cylinder). Of course, the “paradox” is resolved
by observing that it will take unphysically long for this to happen.

Definition 12.4 (Ergodicity) A transformation τ is ergodic if ∀E ∈ Finv we have P[E] = 0 or

1. A process {Si } is ergodic if all shift-invariant events are deterministic, i.e., for any shift-invariant
event E, P [S∞
1 ∈ E] = 0 or 1.

Here are some examples:

• {Sk = k2 }: ergodic but not stationary.

• {Sk = S0 }: stationary but not ergodic (unless S0 is a constant). Note that the singleton set
E = {(s, s, . . .)} is shift invariant and P [S∞ 1 ∈ E] = P [S0 = s] ∈ (0, 1), not deterministic.
• {Sk } i.i.d. is stationary and ergodic (by Kolmogorov’s 0-1 law, tail events have no randomness).
• (Sliding-window construction of ergodic processes) If {Si } is ergodic, then {Xi =
f(Si , Si+1 , . . . )} is also ergodic. Such a process {Xi } is called a B-process if Si is i.i.d.
• Here is an important example demonstrating how one can look at a simple iid process Si in two
i.i.d. P∞
ways via sliding window. Take Si ∼ Ber( 12 ) and set Xk = n=0 2−n−1 Sk+n = 2Xk−1 mod 1.
The marginal distribution of Xi is uniform on [0, 1]. Furthermore, Xk ’s behavior is completely
deterministic: given X0 , all future Xk ’s can be computed exactly. This example shows that cer-
tain deterministic maps exhibit ergodic/chaotic behavior under iterative application: although
the trajectory of Xk is completely deterministic, its time-averages converge to expectations and
in general “look random” since full determinism is only guaranteed if infinite-precision mea-
surement of X0 is available. Any discretization of Xk ’s results in random behavior of positive
entropy rate – see more on this in Section 12.5*.
• There are also stronger conditions than ergodicity. Namely, we say that τ is mixing (or strongly
mixing) if

P[A ∩ τ −n B] → P[A]P[B] .

We say that τ is weakly mixing if

X
n
1
P[A ∩ τ −n B] − P[A]P[B] → 0 .
n
k=1

Strong mixing implies weak mixing, which implies ergodicity (Exercise II.12).
• {Si }: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),
regardless of initial distribution.
As a toy example, consider the kernel P(0|1) = P(1|0) = 1 with initial distribution
P(S0 = 0) = 0.5. This process only has two sample paths: P [S∞ 1 = (010101 . . .)] =
P [ S∞
1 = ( 101010 . . .)] = 1
2 . It is easy to verify this process is ergodic (in the sense of Defi-
nition 12.4). Note however, that in the Markov-chain literature a chain is called ergodic if it is
irreducible, aperiodic and recurrent. This example does not satisfy this definition (this clash of
terminology is a frequent source of confusion).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-233

i i

12.2 Shannon-McMillan, entropy rate and AEP 233

• {Si }: stationary zero-mean Gaussian process with autocovariance function c(n) = E[S0 S∗n ].

1 X
n
lim c(t) = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0

lim c(n) = 0 ⇔ {Si } mixing

n→∞

Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesáro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {c(n)}, in the sense that there exists a unique positive measure μ on [−π , π ] such that
1
R
c(n) = 2π exp(inx) μ(dx). The spectral criteria can be formulated as follows:

{Si } is ergodic ⇔ spectral measure has no atoms (CDF is continuous)

{Si } is a B-process ⇔ spectral measure has a density (power spectral density, cf. Example 6.4)

Detailed exposition on stationary Gaussian processes can be found in [135, Theorem 9.3.2, pp.
474, Theorem 9.7.1, pp. 493–494].

12.2 Shannon-McMillan, entropy rate and AEP

Equipped with the definitions of ergodicity we can state the three main results of this chapter. First
is an analog of the law of large numbers for normalized log-likelihoods.

Theorem 12.5 (Shannon-McMillan-Breiman) Let S = {S1 , S2 , . . . } be a stationary and

ergodic discrete process with entropy rate H ≜ H(S). Then
1 1 P
log n
−
→H, also a.s. and in L1 (12.3)
n PS ( S )
n

Corollary 12.6 Let {S1 , S2 , . . . } be a discrete stationary and ergodic process with entropy rate
H (in bits). Denote by f∗n the optimal variable-length compressor for Sn and ϵ∗ (Sn , nR) the optimal
probability of error of its fixed-length compressor with R bits per symbol (Definition 11.1). Then
we have
(
1 ∗ n P 0 R > H,
l(f (S ))− →H and lim ϵ∗ (Sn , nR) = (12.4)
n n n→∞ 1 R < H.

Proof. By Corollary 10.5, the asymptotic distributions of 1n l(f∗n (Sn )) and 1n log PSn1(sn ) coincide.
By the Shannon-McMillan-Breiman theorem (we only need convergence in probability) the latter
converges to a constant H.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-234

i i

234

In Chapter 11 we learned the asymptotic equipartition property (AEP) for iid sources. Thanks
to Shannon-McMillan-Breiman the same proof we did for the iid processes works for a general
ergodic process.

Corollary 12.7 (AEP for stationary ergodic sources) Let {S1 , S2 , . . . } be a stationary
and ergodic discrete process. For any δ > 0, define the set

1 1
δ
Tn = s : n
log −H ≤δ .
n PSn (sn )
Then

1 P Sn ∈ Tδn → 1 as n → ∞.
2 exp{n(H − δ)}(1 + o(1)) ≤ |Tδn | ≤ exp{(H + δ)n}(1 + o(1)).

Some historical notes are in order. Convergence in probability for stationary ergodic Markov
chains was already shown in [378]. The extension to convergence in L1 for all stationary ergodic
processes is due to McMillan in [301], and to almost sure convergence to Breiman [75].1 A modern
proof is in [11]. Note also that for a Markov chain, existence of typical sequences and the AEP can
be anticipated by thinking of a Markov process as a sequence of independent decisions regarding
which transitions to take at each state. It is then clear that Markov process’s trajectory is simply a
transformation of trajectories of an iid process, hence must concentrate similarly.

12.3 Proof of the Shannon-McMillan-Breiman Theorem

We shall show the L1 -convergence, which implies convergence in probability automatically. We
will not prove a.s. convergence. To this end, let us first introduce Birkhoff-Khintchine’s conver-
gence theorem for ergodic processes, the proof of which is presented in the next subsection. The
interpretation of this result is that time averages converge to the ensemble average.

Theorem 12.8 (Birkhoff-Khintchine’s Ergodic Theorem) Let {Si } be a stationary and

ergodic process. For any integrable function f, i.e., E |f(S1 , . . . )| < ∞,

1X
n
lim f(Sk , . . . ) = E f(S1 , . . . ) a.s. and in L1 .
n→∞ n
k=1

In the special case where f depends on finitely many coordinates, say, f = f(S1 , . . . , Sm ),

1X
n
lim f(Sk , . . . , Sk+m−1 ) = E f(S1 , . . . , Sm ) a.s. and in L1 .
n→∞ n
k=1

1
Curiously, both McMillan and Breiman left the field after these contributions. McMillan went on to head the US satellite
reconnaissance program, and Breiman became a pioneer and advocate of machine learning approach to statistical
inference.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-235

i i

12.3 Proof of the Shannon-McMillan-Breiman Theorem 235

Example 12.1 Consider f = f(S1 ). Then for an iid process Theorem 12.8 is simply the strong
law of large numbers. On the extreme, if {Si } has constant trajectories, i.e. Si = S1 for all i ≥
1, then such process is non-ergodic and conclusion of Theorem 12.8 fails (unless S1 is an a.s.
constant).
We introduce an extension of the idea of the Markov chain.

Definition 12.9 (Finite order Markov chain) {Si : i ∈ N} is an mth order Markov chain
if PSt+1 |St1 = PSt+1 |Stt−m+1 for all t ≥ m. It is called time homogeneous if PSt+1 |Stt−m+1 = PSm+1 |Sm1 .

Remark 12.2 Showing (12.3) for an mth order time-homogeneous Markov chain {Si } is a
direct application of Birkhoff-Khintchine. Indeed, we have

1X
n
1 1 1
log = log
n n
PSn (S ) n PSt |St−1 (St |St−1 )
t=1

1 X
n
1 1 1
= log + log
n PSm (Sm ) n
t=m+1
PSt |St−1 (Sl |Sll− 1
−m )
t−m

1 X
n
1 1 1
= log + log t−1
, (12.5)
n PS1 (Sm ) n PSm+1 |S1 (St |St−m )
m
| {z 1
} | t=m +1
{z }
→0
→H(Sm+1 |Sm
1 ) by Birkhoff-Khintchine

1
where we applied Theorem 12.8 with f(s1 , s2 , . . .) = log PS m (sm+1 |sm )
.
m+1 |S1 1

Now let us prove (12.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to first approximate the distribution of that ergodic process by an m-th order
Markov chain (finite memory) and make use of (12.5), then let m → ∞ to make the approximation
accurate. This is a highly influential contribution of Shannon to the theory of stochastic processes,
known as Markov approximation.

Proof of Theorem 12.5 in L1 . To show that (12.3) converges in L1 , we want to show that
1 1
E log − H → 0, n → ∞.
n PSn (Sn )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
Y
∞
Q(m) (S∞
1 ) ≜ PSm
1
( Sm
1) PSt |St−1 (St |Stt− 1
−m )
t− m
t=m+1
Y∞
= PSm1 (Sm
1) PSm+1 |Sm1 (St |Stt− 1
−m ),
t=m+1

where the second line applies stationarity. Note that under Q(m) , {Si } is an mth -order time-
homogeneous Markov chain.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-236

i i

236

By triangle inequality,

1 1 1 1 1 1 1 1
E log − H ≤ E log − log (m) +E log (m) − Hm + |Hm − H|
n n
PSn (S ) n n
PSn (S ) n QSn (Sn ) n QSn (Sn ) | {z }
| {z } | {z } ≜C
≜A ≜B

where Hm ≜ H(Sm+1 |Sm1 ).

We discuss each term separately next:

• C = |Hm − H| → 0 as m → ∞ by Theorem 5.4 (Recall that for stationary processes:

H(Sm+1 |Sm
1 ) → H from above).
• As shown in Remark 12.2, for any fixed m, B → 0 in L1 as n → ∞, as a consequence of
Birkhoff-Khintchine. Hence for any fixed m, EB → 0 as n → ∞.
• For term A, applying the next Lemma 12.10,

1 dPSn 1 (m) 2 log e

E[A] = EP log (m)
≤ D(PSn kQSn ) +
n dQSn n en

where
" #
1 1 PSn (Sn )
(m)
D(PSn kQSn ) = E log Qn
n n PSm (Sm ) t=m+1 PSm+1 |S1 (St |St−1 )
m t− m

1
= (−H(Sn ) + H(Sm ) + (n − m)Hm )
n
→ Hm − H as n → ∞,

with the second equality following from stationarity again.

Combining all three terms and sending n → ∞, we obtain for any m,

1 1
lim sup E log − H ≤ 2(Hm − H).
n→∞ n PSn (Sn )

Sending m → ∞ completes the proof of the L1 -convergence.

Lemma 12.10

dP 2 log e
EP log ≤ D(PkQ) + .
dQ e

Proof. |x log x| − x log x ≤ 2 log e

e , ∀x > 0, since LHS is zero if x ≥ 1, and otherwise upper
1 2 log e
bounded by 2 sup0≤x≤1 x log x = e .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-237

The proof is completed by showing

2ϵ
P lim sup An (h + h1 ) ≥ δ ≤ . (12.6)
n δ
Indeed, then by taking ϵ → 0 we will have shown

P lim sup An (f) ≥ E[f] + δ = 0
n→∞

as required, and the opposite direction is shown analogously.

The proof of (12.6) makes use of the Maximal Ergodic Lemma stated as follows:

Theorem 12.11 (Maximal Ergodic Lemma) Let (P, τ ) be a probability measure and a
measure-preserving transformation. Then for any f ∈ L1 (P) we have

E[f1 supn≥1 An f > a ] kfk1
P sup An f > a ≤ ≤
n≥1 a a
Pn−1
where An f = 1
n k=0 f ◦ τ k.

This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this theorem
is exactly equivalent to the following result:

Lemma 12.12 (Estimate for the maximum of averages) Let {Zn : n = 1, . . .} be a

stationary process with E[|Z1 |] < ∞ then

Z1 + . . . + Zn E[|Z1 |]
P sup >a ≤ ∀a > 0.
n≥1 n a

Proof. The argument for this Lemma has originally been quite involved, until a dramatically
simple proof (below) was found by A. Garsia [180, Theorem 2.2.2]. Define
X
n
Sn = Zk ,
k=1

Ln = max{0, Z1 , . . . , Z1 + · · · + Zn },
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn },
Sn
Z∗ = sup .
n≥1 n

It is sufficient to show that

E[Z1 1{Z∗ >0} ] ≥ 0 . (12.7)

Indeed, applying (12.7) to Z̃1 = Z1 − a and noticing that Z̃∗ = Z∗ − a we obtain

E[Z1 1{Z∗ >a} ] ≥ aP[Z∗ > a] ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-239

i i

12.5* Sinai’s generator theorem 239

from which Lemma follows by upper-bounding the left-hand side with E[|Z1 |].
In order to show (12.7) we notice that

Z1 + Mn = max{S1 , . . . , Sn }

and furthermore

Z1 + M n = Ln on {Ln > 0}

Thus, we have

Z1 1{Ln >0} = Ln − Mn 1{Ln >0}

where we do not need indicator in the first term since Ln = 0 on {Ln > 0}c . Taking expectation
we get

E[Z1 1{Ln >0} ] = E[Ln ] − E[Mn 1{Ln >0} ]

≥ E[Ln ] − E[Mn ]
= E[Ln ] − E[Ln−1 ] = E[Ln − Ln−1 ] ≥ 0 , (12.8)

where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1 , and Ln ≥ Ln−1 ,
respectively. Taking limit as n → ∞ in (12.8) and noticing that {Ln > 0} % {Z∗ > 0}, we
obtain (12.7).

12.5* Sinai’s generator theorem

As we mentioned in the introduction to this Chapter, there is a classical conundrum in natural
science. Our microscopic description of motions of atoms is fully deterministic (i.e. given posi-
tions and velocities of atoms at time t there is an operator τ that gives their positions in time
t + 1). On the other hand, in many ways large systems behave probabilistically (as described by
statistical mechanics, Gibbs distributions etc). An important conceptual bridge was built with the
introduction of Kolmogorov-Sinai entropy, which in a nutshell attempts to resolve the conundrum
by noticing that our way of describing a system at time t would typically involve only finitely
many bits, and thus while τ is deterministic when acting on a full description of state, from the
point of view of any finite-bit description τ appears to act stochastically.
More formally, we associate to every probability-preserving transformation (p.p.t.) τ a number,
called the Kolmogorov-Sinai entropy. This number is invariant to isomorphisms of p.p.t.’s (appro-
priately defined). Sinai’s generator theorem then allows one to compute the Kolmogorov-Sinai
entropy.

Definition 12.13 Fix a probability-preserving transformation τ acting on probability space

(Ω, F, P). Kolmogorov-Sinai entropy of τ is defined as
1
H(τ ) ≜ sup lim H(X0 , X0 ◦ τ, . . . , X0 ◦ τ n−1 ) ,
X0 n→∞ n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-240

i i

240

where supremum is taken over all finitely-valued random variables X0 : Ω → X measurable with
respect to F .

Note that every random variable X0 generates a stationary process adapted to τ , that is

Xk ≜ X0 ◦ τ k .

In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:

• σ -algebra G ⊂ F is P-dense in F , or sometimes we also say G = F mod P or even G = F

mod 0, if for every E ∈ F there exists E′ ∈ G s.t.

P[E∆E′ ] = 0 .

• Partition A = {Ai : i = 1, 2, . . .} measurable with respect to F is called generating if

_
∞
σ{τ −n A} = F mod P .
n=0

• Random variable Y : Ω → Y with a countable alphabet Y is called a generator of (Ω, F, P, τ )

σ{Y, Y ◦ τ, . . . , Y ◦ τ n , . . .} = F mod P

Theorem 12.14 (Sinai’s generator theorem) Let Y be the generator of a p.p.t.

(Ω, F, P, τ ). Let H(Y) be the entropy rate of the process Y = {Yk = Y ◦ τ k : k = 0, . . .}. If
H(Y) is finite, then H(τ ) = H(Y).

Proof. Notice that since H(Y) is finite, we must have H(Yn0 ) < ∞ and thus H(Y) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞
we have from lower semicontinuity of mutual information, cf. (4.28), that

lim I(Y; Ỹm ) ≥ H(Y) ,

m→∞

and consequently for arbitrarily small ϵ and sufficiently large m

H(Y|Ỹ) ≤ ϵ ,

Then, consider the chain

H(Yn0 ) = H(Ỹn0 , Yn0 ) = H(Ỹn0 ) + H(Yn0 |Ỹn0 )

X n
= H(Ỹn0 ) + H(Yi |Ỹn0 , Yi0−1 )
i=0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-241

i i

12.5* Sinai’s generator theorem 241

X
n
≤ H(Ỹn0 ) + H(Yi |Ỹi )
i=0

= H(Ỹn0 ) + nH(Y|Ỹ) ≤ H(Ỹn0 ) + nϵ

Thus, the entropy rate of Ỹ (which is on a finite alphabet) can be made arbitrarily close to that of
Y, concluding that H(τ ) ≥ H(Y).
The bulk of the proof is to show that for any stationary process X adapted to τ the entropy rate
is upper bounded by H(Y). To that end, consider X : Ω → X with finite X and define as usual the
process X = {X ◦ τ k : k = 0, 1, . . .}. By the generating property of Y we have that X (perhaps
after modification on a set of measure zero) is a function of Y∞0 . So are all Xk ’s. Thus

H(X0 ) = I(X0 ; Y∞ n
0 ) = lim I(X0 ; Y0 ) ,
n→∞

where we used the continuity-in-σ -algebra property of mutual information, cf. (4.30). Rewriting
the latter limit differently, we have
lim H(X0 |Yn0 ) = 0 .
n→∞

0 ) ≤ ϵ. Then consider the following chain:

Fix ϵ > 0 and choose m so that H(X0 |Ym
H(Xn0 ) ≤ H(Xn0 , Yn0 ) = H(Yn0 ) + H(Xn0 |Yn0 )
X n
≤ H(Yn0 ) + H(Xi |Yni )
i=0
X
n
= H(Yn0 ) + H(X0 |Yn0−i )
i=0

≤ H(Yn0 ) + m log |X | + (n − m)ϵ ,

where we used stationarity of (Xk , Yk ) and the fact that H(X0 |Yn0−i ) < ϵ for i ≤ n − m. After
dividing by n and passing to the limit our argument implies
H(X) ≤ H(Y) + ϵ .
Taking here ϵ → 0 completes the proof.
Alternative proof: Suppose X0 is taking values on a finite alphabet X and X0 = f(Y∞
0 ). Then (this
is a measure-theoretic fact) for every ϵ > 0 there exists m = m(ϵ) and a function fϵ : Y m+1 → X
s.t.
P[f(Y∞
0 ) 6= fϵ (Y0 )] ≤ ϵ .
m

S
(This is just another way to say that n σ(Yn0 ) is P-dense in σ(Y∞
0 ).) Define a stationary process
X̃ as
X̃j ≜ fϵ (Yjm+j ) .
Notice that since X̃n0 is a function of Yn0+m we have
H(X̃n0 ) ≤ H(Yn0+m ) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-242

i i

242

Dividing by m and passing to the limit we conclude that the entropy rates satisfy

H(X̃) ≤ H(Y) .

Finally, to relate X̃ to X notice that by construction for every j

P[X̃j 6= Xj ] ≤ ϵ .

Since both processes take values on a fixed finite alphabet, from Corollary 6.7 we infer that

|H(X) − H(X̃)| ≤ ϵ log |X | + h(ϵ) .

Altogether, we have shown that H(X) ≤ H(Y) + ϵ log |X | + h(ϵ). Taking ϵ → 0 concludes the
proof.

Some examples of Theorem 12.14 are as follows:

• Let Ω = [0, 1], F the Borel σ -algebra, P = Leb and

(
2ω, ω < 1/2
τ (ω) = 2ω mod 1 =
2ω − 1, ω ≥ 1/2

It is easy to show that Y(ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
Let us understand significance of this example and Sinai’s result. If we have full “microscopic”
description of the initial state of the system ω , then the future states of the system are completely
deterministic: τ (ω), τ (τ (ω)), · · · . However, in practice we can not possibly have a complete
description of the initial state, and should be satisfied with some discrete (i.e. finite or countably-
infinite) measurement outcomes Y(ω), Y(τ (ω)) etc. What we infer from the previous result
is that no matter how fine our discrete measurements are, they will still generate a process
that will have finite entropy rate (equal to log 2 bits per measurement). This reconciles the
apparent paradox between Newtonian (dynamical) and Gibbsian (statistical) points of view
on large mechanical systems. In more mundane terms, we may notice that Sinai’s theorem tells
us that much more complicated stochastic processes (e.g. the one generated by a ternary valued
measurement Y′ (ω = 1{ω > 1/3} + 1{ω > 2/3}) would still have the entropy rate same as the
simple iid Bernoulli(1/2) process.
• Let Ω be the unit circle S1 , F the Borel σ -algebra, and P the normalized length and

τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ . (When 2π is irrational, this is known to be an ergodic p.p.t.).
Here Y = 1{|ω| < 2π ϵ} is a generator for arbitrarily small ϵ and hence

H(τ ) ≤ H(X) ≤ H(Y0 ) = h(ϵ) → 0 as ϵ → 0 .

This is an example of a zero-entropy p.p.t.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-243

i i

12.5* Sinai’s generator theorem 243

Remark 12.3 Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists
fi : Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦fi = f1−i ◦τi ; 2) fi ◦f1−i is identity
on Ωi (a.e.); 3) Pi [f−
1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
1

p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revolutionary, since it
allowed to show that p.p.t.s corresponding to shifts of iid Ber(1/2) and iid Ber(1/3) processes are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L 2 ( Ω , P ) → L 2 ( Ω , P ) (12.9)
ϕ(x) 7→ ϕ(τ (x)) . (12.10)
However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entire
unit circle, and thus is unable to distinguish Ber(1/2) from Ber(1/3).2

2
To see the statement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider ϕ(x∞1 ) defined as
∑m iωk x . This ϕ has unit energy and as m → ∞ we have kU ϕ − eiω ϕk
√1 L2 → 0. Hence every e
iω belongs to
m k=1 e k τ
the spectrum of Uτ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-244

i i

13 Universal compression

Unfortunately, theory developed so far is not very helpful for anyone tasked with actually com-
pressing a file of English text. Indeed, since the probability law governing text generation is not
given to us, one cannot apply compression results that we discussed so far. In this chapter we
will discuss how to produce compression schemes that do not require a priori knowledge of the
distribution. For example, an n-letter input compressor maps X n → {0, 1}∗ . There is no one fixed
probability distribution PXn on X n , but rather a whole class of distributions. Thus, the problem of
compression becomes intertwined with the problem of distribution (density) estimation and we
will see that optimal algorithms for both problems are equivalent.
The plan for this chapter is as follows:

1 We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be asymp-
totically optimal simultaneously for all i.i.d. distributions and with small modifications for all
finite-order Markov chains.
2 Next class of universal compressors is based on assuming that the true distribution PXn belongs
to a given class. These methods proceed by choosing a good model distribution QXn serving as
the minimax approximation to each distribution in the class. The compression algorithm for a
single distribution QXn is then designed as in previous chapters.
3 Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically adapt
to the distribution of the source, without any prior assumptions required.
Throughout this chapter, all logarithms are binary. Instead of describing each compres-
sion algorithm, we will merely specify some distribution QXn and apply one of the following
constructions:

• Sort all xn in the order of decreasing QXn (xn ) and assign values from {0, 1}∗ as in Theorem 10.2,
this compressor has lengths satisfying
1
ℓ(f(xn )) ≤ log .
QXn (xn )
• Set lengths to be

1
ℓ(f(xn )) ≜ log
QXn (xn )
and apply Kraft’s inequality Theorem 10.9 to construct a prefix code.

244

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-245

i i

13.1 Arithmetic coding 245

• Use arithmetic coding (see next section).

The important conclusion is that in all these cases we have

1
ℓ(f(xn )) = log ± universal constant ,
Q X n ( xn )
and in this way we may and will always replace lengths with log QXn1(xn ) . In this architecture, the
only task of a universal compression algorithm is to specify the QXn , which is known as universal
probability assignment in this context.
Qn
If one factorizes QXn = t=1 QXt |Xt−1 then we arrive at a crucial conclusion: universal compres-
1
sion is equivalent to sequential (online) prediction under the log-loss, which in itself is simply
a version of the density estimation task in learning theory. This exciting connection between
compression and learning theory is explored in Section 13.6 and is a highlight of this Chapter.
In turn, machine learning drives advances in universal compression. As of 2022 the best per-
forming text compression algorithms (cf. the leaderboard at [289]) use a deep neural network
(specifically, a transformer model) that starts from a fixed initialization. As the input text is pro-
cessed, parameters of the network are continuously updated via stochastic gradient descent causing
progressively better prediction (and hence compression) performance.
This chapter, thus, can be understood as both a set of results on information theory (universal
compression) or machine learning (online prediction/density estimation).

13.1 Arithmetic coding

Constructing an encoder table from QXn may require a lot of resources if n is large. Arithmetic
coding provides a convenient workaround by allowing the encoder to output bits sequentially.
Notice that to do so, it requires that not only QXn but also its marginalizations QX1 , QX2 , · · · be
easily computable. (This is not the case, for example, for Shtarkov distributions (13.12)-(13.14),
which are not compatible for different n.)
Let us agree upon some ordering on the alphabet of X (e.g. a < b < · · · < z) and extend this
order lexicographically to X n (that is for x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), we say x < y if
xi < yi for the first i such that xi 6= yi , e.g., baba < babb). Then let
X
F n ( xn ) = Q X n ( yn ) .
yn < xn

Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QXn (xn )). These intervals are disjoint
subintervals of [0, 1). As such, each xn can be represented uniquely by any point in the interval Ixn .
A specific choice is as follows. Encode
xn 7→ largest dyadic interval Dxn contained in Ixn (13.1)
and we agree to select the left-most dyadic interval when there are two possibilities. Recall that
dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ) where m is an integer. We encode
such interval by the k-bit (zero-padded) binary expansion of the fractional number m2−k =

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-246

i i

246

Pk
0.b1 b2 . . . bk = i=1 bi 2−i . For example, [3/4, 7/8) 7→ 110, [3/4, 13/16) 7→ 1100. We set the
codeword f(xn ) to be that string. The resulting code is a prefix code satisfying

1 1
log2 ≤ ℓ(f(x )) ≤ log2
n
+ 1. (13.2)
QXn (xn ) Q X n ( xn )
(This is an exercise, see Ex. II.13.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QXn−1 (xn−1 ) QXn |Xn−1 (y|xn−1 )
y< xn

and thus Fn (xn ) can be computed sequentially if QXn−1 and QXn |Xn−1 are easy to compute. This
method is the method of choice in many modern compression algorithms because it allows to
dynamically incorporate the learned information about the data stream, in the form of updating
QXn |Xn−1 (e.g. if the algorithm detects that an executable file contains a long chunk of English text,
it may temporarily switch to QXn |Xn−1 modeling the English language).
We note that efficient implementation of arithmetic encoder and decoder is a continuing
research area. Indeed, performance depends on number-theoretic properties of denominators of
distributions QXt |Xt−1 , because as encoder/decoder progress along the string, they need to peri-
odically renormalize the current interval Ixt to be [0, 1) but this requires carefully realigning the
dyadic boundaries. A recent idea of J. Duda, known as asymmetric numeral system (ANS) [138],
lead to such impressive computational gains that in less than a decade it was adopted by most
compression libraries handling diverse data streams (e.g., the Linux kernel images, Dropbox and
Facebook traffic, etc).

13.2 Combinatorial construction of Fitingof

Fitingof [170] suggested that a sequence xn ∈ X n should be prescribed information Φ0 (xn ) equal
to the logarithm of the number of all possible permutations obtainable from xn (i.e. log-size of the
type-class containing xn ). As we have shown in Proposition 1.5:
Φ0 (xn ) = nH(xT ) + O(log n) T ∼ Unif([n]) (13.3)
= nH(P̂xn ) + O(log n) , (13.4)
where P̂xn is the empirical distribution of the sequence xn :
1X
n
P̂xn (a) ≜ 1{xi = a} . (13.5)
n
i=1

Then Fitingof argues that it should be possible to produce a prefix code with
ℓ(f(xn )) = Φ0 (xn ) + O(log n) . (13.6)
This can be done in many ways. In the spirit of what comes next, let us define
QXn (xn ) ≜ exp{−Φ0 (xn )}cn , (13.7)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-247

i i

13.3 Optimal compressors for a class of sources. Redundancy. 247

where the normalization constant cn is determined by the number of types, namely, cn =

|−1

1/ n+|X
|X |−1
. Counting the number of different possible empirical distributions (types), we get

cn = O(n−(|X |−1) ) ,

and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (13.6).1 Now
i.i.d.
taking expectation over Xn ∼ PX we get

E[ℓ(f(Xn ))] = nH(PX ) + (|X | − 1) log n + O(1) ,

for every i.i.d. source on X .

Universal compressor for all finite-order Markov chains. Fitingof’s idea can be extended as
follows. Define now the first-order information content Φ1 (xn ) to be the log of the number of all
sequences, obtainable by permuting xn with the extra restriction that the new sequence should have
the same statistics on digrams. Asymptotically, Φ1 is just the conditional entropy

Φ1 (xn ) = nH(xT |xT−1 ) + O(log n), T ∼ Unif([n]) ,

where T − 1 is understood in the sense of modulo n. Again, it can be shown that there exists a code
such that lengths

ℓ(f(xn )) = Φ1 (xn ) + O(log n) .

This implies that for every first-order stationary Markov chain X1 → X2 → · · · → Xn we have

E[ℓ(f(Xn ))] = nH(X2 |X1 ) + O(log n) .

This can be further continued to define Φ2 (xn ) leading to a universal code that is asymptotically
optimal for all second-order Markov chains, and so on and so forth.

13.3 Optimal compressors for a class of sources. Redundancy.

So we have seen that we can construct compressor f : X n → {0, 1}∗ that achieves

E[ℓ(f(Xn ))] ≤ H(Xn ) + o(n) ,

simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we
do next? Krichevsky [259] suggested that the next barrier should be to minimize the regret, or
redundancy:

E[ℓ(f(Xn ))] − H(Xn )

simultaneously for all sources in a given class. We proceed to rigorous definitions.

1
Explicitly, we can do a two-part encoding: first describe the type class of xn (takes (|X | − 1) log n bits) and then describe
the element of the class (takes Φ0 (xn ) bits).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-248

i i

248

Given a collection {PXn |θ : θ ∈ Θ} of sources, and a compressor f : X n → {0, 1}∗ we define

its redundancy as

sup E[ℓ(f(Xn ))|θ = θ0 ] − H(Xn |θ = θ0 ) .

θ0

Replacing code lengths with log Q1Xn , we define redundancy of the distribution QXn as

sup D(PXn |θ=θ0 kQXn ) .

θ0

Thus, the question of designing the best universal compressor (in the sense of optimizing worst-
case deviation of the average length from the entropy) becomes the question of finding solution
of:

Q∗Xn = argmin sup D(PXn |θ=θ0 kQXn ) .

QXn θ0

We therefore arrive at the following definition

Definition 13.1 (Redundancy in universal compression) Given a class of sources

{PXn |θ=θ0 : θ0 ∈ Θ, n = 1, . . .} we define its minimax redundancy as

R∗n ≡ R∗n (Θ) ≜ min sup D(PXn |θ=θ0 kQXn ) . (13.8)

QXn θ0 ∈Θ

Assuming the finiteness of R∗n , Theorem 5.9 gives the maximin and capacity representation

R∗n = sup min D(PXn |θ kQXn |π ) (13.9)

π QXn

= sup I(θ; Xn ) , (13.10)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-251

i i

13.4* Asymptotic maximin solution: Jeffreys prior 251

densities on Θ and set

Z
QXn (x ) ≜
n
PXn |θ (xn |θ′ )π (θ′ )dθ′ . (13.17)
Θ

Before proceeding further, we recall the Laplace method of approximating exponential inte-
grals. Suppose that f(θ) has a unique minimum at the interior point θ̂ of Θ and that Hessian Hessf
is uniformly lower-bounded by a multiple of identity (in particular, f(θ) is strongly convex). Then
taking Taylor expansion of π and f we get
Z Z
π (θ)e−nf(θ) dθ = (π (θ̂) + O(ktk))e−n(f(θ̂)− 2 t Hess f(θ̂)t+o(∥t∥ )) dt
1 T 2
(13.18)
Θ
Z
dx
= π (θ̂)e−nf(θ̂) e−x Hess f(θ̂)x √ (1 + O(n−1/2 ))
T
(13.19)
Rd nd
d2
−nf(θ̂) 2π 1
= π (θ̂)e q (1 + O(n−1/2 )) (13.20)
n
det Hessf(θ̂)

where in the last step we computed Gaussian integral.

Next, we notice that

PXn |θ (xn |θ′ ) = exp{−n(D(P̂xn kPX|θ=θ′ ) + H(P̂xn ))} ,

and therefore, denoting

θ̂(xn ) ≜ P̂xn

we get from applying (13.20) to (13.17)

d 2π Pθ (θ̂)
+ O( n− 2 ) ,
1
log QXn (xn ) = −nH(θ̂) + log + log q
2 n log e
det JF (θ̂)

where we used the fact that Hess θ′ D(P̂kPX|θ=θ′ )|θ′ =θ̂ = log1 e JF (θ̂) with JF being the Fisher infor-
mation matrix introduced previously in (2.34). From here, using the fact that under Xn ∼ PXn |θ=θ′
the random variable θ̂ = θ′ + O(n−1/2 ) we get by approximating JF (θ̂) and Pθ (θ̂)
d Pθ (θ′ )
D(PXn |θ=θ′ kQXn ) = n(E[H(θ̂)]−H(X|θ = θ′ ))+ log n−log p +C+O(n− 2 ) , (13.21)
1

2 ′
det JF (θ )
where C is some constant (independent of the prior Pθ or θ′ ). The first term is handled by the next
result, refining Corollary 7.18.

Lemma 13.2 Let Xn i.i.d.

∼ P on a finite alphabet X such that P(x) > 0 for all x ∈ X . Let P̂ = P̂Xn
be the empirical distribution of Xn , then

|X | − 1 1
E[D(P̂kP)] = log e + o .
2n n
log e 2
In fact, nD(P̂kP) → 2 χ (|X | − 1) in distribution.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-252

i i

252

√
Proof. By Central Limit Theorem, n(P̂ − P) converges in distribution to N (0, Σ), where Σ =
diag(P) − PPT , where P is an |X |-by-1 column vector. Thus, computing second-order Taylor
expansion of D(·kP), cf. (2.34) and (2.37), we get the result. (To interchange the limit and the
expectation, more formally we need to condition on the event P̂n (x) ∈ (ϵ, 1 − ϵ) for all x ∈ X to
make the integrand function bounded. We leave these technical details as an exercise.)

Continuing (13.21) we get in the end

d π (θ′ )
+ constant + O(n− 2 )
1
D(PXn |θ=θ′ kQXn ) = log n − log p (13.22)
2 ′
det JF (θ )

under the assumption of smoothness of prior π and that θ′ is not on the boundary of Θ. Con-
sequently, we can see that in order for the prior π be the saddle point solution, we should
have
p
π (θ′ ) ∝ det JF (θ′ ) ,

provided that the right side is integrable. The prior proportional to the square-root of the
determinant of Fisher information matrix is known as the Jeffreys prior. In our case, using
the explicit expression for Fisher information (2.39), the Jeffreys prior π ∗ is found to be
Dirichlet(1/2, 1/2, · · · , 1/2), with density:

1
π ∗ (θ) = cd qQ , (13.23)
d
j=0 θj

Γ( d+ 1
2 )
where cd = Γ(1/2)d+1
is the normalization constant. The corresponding redundancy is then

d n Γ( d+2 1 )
R∗n = log − log + o( 1) . (13.24)
2 2πe Γ(1/2)d+1

Making the above derivation rigorous is far from trivial and was completed in [460]. (In
Exercise II.15 and II.16 we analyze the d = 1 case and show R∗n = 12 log n + O(1).)
Overall, we see that Jeffreys prior asymptotically maximizes (within o(1)) the supπ I(θ; Xn ) and
for this reason is called asymptotically maximin solution. Surprisingly [405], the corresponding
(KT)
mixture QXn , that we denote QXn (and study in detail in the next section), however, turns out to
not give the asymptotically optimal redundancy. That is we have for some c1 > c2 > 0 inequalities

R∗n + c1 + o(1) ≤ sup D(PXn |θ=θ0 kQXn ) ≤ R∗n + c2 + o(1) .

(KT)
θ0

(KT)
That is QXn is not asymptotically minimax (but it does achieve optimal redundancy up to O(1)
term). However, it turns out that patching the Jeffreys prior near the boundary of the simplex
(or using a mixture of Dirichlet distributions) does result in asymptotically minimax universal
probability assignments [460].

(KT) (KT)
Note that QXn−1 coincides with the marginalization of QXn to first n − 1 coordinates. This prop-
R
erty is not specific to KT distribution and holds for any QXn that is given in the form Pθ (dθ)PXn |θ
with Pθ not depending on n. What is remarkable, however, is that the conditional distribution
(KT)
QXn |Xn−1 has a rather elegant form:

t1 + 12
QXn |Xn−1 (1|xn−1 ) =
(KT)
, t1 = #{j ≤ n − 1 : xj = 1} (13.28)
n
1
t0 + 2
QXn |Xn−1 (0|xn−1 ) =
(KT)
, t0 = #{j ≤ n − 1 : xj = 0} (13.29)
n
This is the famous “add-1/2” rule of Krichevsky and Trofimov [260]. As mentioned in Section 13.1,
this sequential assignment is very convenient for implementing an arithmetic
coder.
Let fKT : {0, 1}n → {0, 1}∗ be the encoder assigning length l(fKT (xn )) = log2 1
(KT) . Now
QXn (xn )
from (13.24) we know that we

1
sup {E [l(fKT (Snθ ))] − nh(θ)} = log n + O(1) .
0≤θ≤1 2

Since (13.24) was not shown rigorously in Exercise II.15 we prove the upper bound of this claim
independently.

Remark 13.5 (Laplace “add-1” rule) A slightly less optimal choice of QXn results from
Laplace prior (recall that Laplace advise to take Pθ to be uniform on [0, 1]). Then, in the case of
binary (Bernoulli) alphabet we get

1
(Lap)
QXn = n
, w = #{j : xj = 1} . (13.30)
w ( n + 1)

The corresponding sequential probability assignment is given by

t1 + 1
QXn |Xn−1 (1|xn−1 ) =
(Lap)
, t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (13.30) is exactly the same as Fitingof’s (13.7). Second,
this distribution “almost” attains the optimal first-order term in (13.24). Indeed, when Xn is iid
Ber(θ) we have for the redundancy:
" #
1 n
E log (Lap) − H(Xn ) = log(n + 1) + E log − nh(θ) , W ∼ Bin(n, θ) . (13.31)
Q n (X ) n W
X

From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (13.31) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to
endpoints and results in suboptimal redundancy log n + O(1). The Jeffreys prior (13.26) and the
resulting KT compressor fixes the problem at the endpoints.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-255

i i

13.6 Online prediction and density estimation 255

For a general (non-binary) alphabet KT distribution is equally simple:

1
ta +
QXn |Xn−1 (a|xn−1 ) =
(KT) 2
|X |−2
, ta = #{j ≤ n − 1 : xj = a} (13.32)
n+ 2
In summary, to build a universal compressor for a class of all iid sources on a given finite
alphabet X we can do the following:

1 Learner: Set QX1 = Unif[X ].

2 Arithmetic encoder: subdivide [0, 1] interval using QX1 . Receive the first letter x1 and select
partition QX1 (x1 ).
3 Learner: given x1 compute QX2 |X1 =x1 according to (13.32).
4 Arithmetic encoder: Subdivide currently selected partition according to QX2 |X1 =x1 . Receive the
next letter x2 and select partition QX2 |X1 =x1 (x2 ).
5 etc.
For compression we only use the state of the arithmetic encoder to output corresponding {0, 1}
bits. In the next section, however, we will see that the learner’s output QXn |Xn−1 is a very relevant
quantity: it is (regret-optimal) online density estimator under KL loss.

13.6 Online prediction and density estimation

It turns out that the universal compression problem (or, more specifically, a universal probabil-
ity assignment QXn representing a whole class of distributions) automatically solves two other
important problems: one in machine learning (online prediction) and the other in statistics (den-
sity estimation). For this reason, in information theory the problem of universal compression is
also sometimes called “universal prediction”. In the next two sections we will briefly explain this
fundamental connection. For a full story we recommend an excellent textbook on the topic [86].
(θ)
Let us fix X and a collection PXn , θ ∈ Θ of measures on X n , which we will call a model class
Θ. Although a case of continuous X is even more interesting (as we will explore in Section 32.1),
(θ)
for now we restrict to attention to discrete X and, thus, we understand PXn (·) as a PMF on X n .
(θ) Qn (θ)
The most immediate choice is to have PXn (x1 , . . . , xn ) = i=1 PX1 (xi ), which would correspond
to a class of iid sources, but we do not make this restriction.
(θ ∗ )
Online prediction. Given a sequence X1 , . . . , Xn ∼ PXn the learner sequentially observes
X1 , . . . , Xt−1 and outputs its prediction Qt (·) about the next sample Xt . Once the next sample is
revealed learner experiences a loss measured via log-loss:
1
ℓ(Qt , Xt ) ≜ log .
Qt (Xt )
Given a sequence of predictors {Qt } we can assign average cumulative loss to it as follows:
" n #
X 1
∗
ℓn ({Qt }, θ ) ≜ EP(θ∗ )
(a)
log .
Xn Qt (Xt |Xt−1 )
t=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-256

i i

256

The intimate connection between this problem of online prediction and universal compression
is revealed by the following almost trivial observation. Notice that distribution Qt (·) output by the
learner at time t is in fact a function of observations x1 , . . . xt−1 . Therefore, we should be writing
more explicitly it as Qt (·; xt−1 ) to emphasize dependence on the (training) data. But then we can
also think of Qt (xt ; xt−1 ) as a Markov kernel QXt |Xt−1 (xt |xt−1 ) and compute a joint probability
distribution
Y
n
Q X n ( xn ) ≜ Qt (xt ; xt−1 ) . (13.33)
t=1

Conversely, any QXn we can factorize sequentially in the form (13.33) and obtain an online pre-
dictor. It turns out that the choice of the optimal QXn is precisely the same problem of universal
probability assignment that is solved by the universal compression in (13.8). But before that we
need to explain how to define optimality in the online prediction game.
Since the ℓn ({Qt }, θ∗ ) depends on the value of θ∗ that governs the stochastics of the input
(a)

sequence, our first instinct could be to try to minimize

min sup ℓ(na) ({Qt }, θ∗ ) .

{Qt } θ ∗ ∈Θ

(θ)
However, this turns out to be a bad way to pose the problem. For example, if one of PXn =
(a)
Unif[X n ] then no predictor can achieve ℓn ≤ n log |X |. Furthermore, a trivial predictor that
always outputs Qt = Unif[X ] achieves this upper bound exactly. Thus in the minimax setting
predicting Unif[X ] turns out to be optimal.
To understand how to work around this issue, let us first recall from Corollary 2.4 that if we
have oracle knowledge about the true θ∗ generating Xj ’s, then our choice would be to set Qt to be
(θ ∗ )
the factorization of PXn . This achieves the loss
(θ ∗ )
ℓ(na) ({P∗θ }, θ∗ ) = H(PXn ) .
(θ ∗ )
Thus, even if given the oracle information we cannot avoid the H(PXn ) loss (this would also
(θ ∗ )
be called Bayes loss in machine learning). Note that for iid model class we have H(PXn ) =
∗
(θ )
nH(PX1 ) and the oracle loss is of order n. Consequently, the quality of the learning algorithm
should be measured by the amount of excess of loss above the oracle loss. This quantity is known
as average regret and defined as
" n #
X 1 (θ ∗ )
AvgRegn ({Qt }, θ∗ ) ≜ EP(θ∗ ) log − H(PXn ) .
Xn Qt (Xt |Xt−1 )
t=1

Hence to design an optimal algorithm we want to minimize the worst regret, or in other words to
solve the minimax problem

AvgReg∗n (Θ) ≜ inf sup AvgRegn ({Qt }, θ∗ ) . (13.34)

{Qt } θ ∗ ∈Θ

This turns out to be completely equivalent to the universal compression problem, as we state next.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-257

i i

13.6 Online prediction and density estimation 257

Theorem 13.3 Recall the definition of compression redundancy R∗n in (13.8). Then we have
(θ ∗ )
AvgReg∗n (Θ) = R∗n (Θ) ≜ min sup D(PXn kQXn ) ,
QXn θ ∗ ∈Θ

where the minimum in the RHS is achieved and at a unique distribution Q∗Xn . The optimal predictor
is given by setting Qt (·) = Q∗Xt |Xt−1 =xt−1 (·). Furthermore, let θ ∈ Θ have a prior distribution
π ∈ P(Θ). Then

AvgReg∗n (Θ) = sup I(θ; Xn ) .

If there exists a maximizer π ∗ of the right-hand side maximization then the optimal estimator is
R Qn
found by factorizing Q∗Xn = π ∗ (dθ) i=1 Pθ (xi ).

Proof. There is almost nothing to prove. We only need to rewrite definition of average regret in
terms of QXn as follows
" #
(θ ∗ )
P (θ ∗ )
AvgRegn ({Qt }, θ∗ ) = EP(θ∗ ) log X = D(PXn kQXn ) .
n

Xn QX n

The rest of the claims follow from Theorem 5.9 (recall that I(θ; Xn ) ≤ n log |X | < ∞) and
Theorem 5.4.

As an application of this result, we see that Krichevsky-Trofimov’s estimator achieves for any
i.i.d.
iid string Xn ∼ P a log-loss
 
X
n
E log (KT)
1  ≤ nH(P) + |X | − 1 log n + cX ,
Q t−1 (Xt |X )t− 1 2
t=1 Xt |X

where cX < ∞ is a constant independent of P or n. This excess above nH(P) is optimal among
all possible online estimators except possibly for a constant cX .
The problem we discussed may appear at first to be somewhat contrived, especially to some-
one who has been used to supervised learning/prediction tasks. Indeed, our prediction problem
does not have any features to predict from! Nevertheless, the modern large language models are
solving precisely this task: they are trained to predict a sequence of words by minimizing log-loss
(cross-entropy loss), cf. [320]. In those instances the learning task is made feasible due to non-iid
nature of the sequence. The iid setting, however, is also quite interesting and practically relevant.
But one needs to introduce supervised learning version for that, where prediction task is to esti-
mate an unknown (label or quantity) Yt given a correlated feature vector Xt . There is an analog of
Theorem 13.3 for that case as well – see Exercise II.20 and II.22.

Batch regret. In machine learning what we have defined above is known as cumulative (or
online) regret, because we insisted on the estimator to produce some prediction at every time step
t. However, a much more common setting is that of prediction, where the first n − 1 observations

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-258

i i

258

are available as the training data and we only assess the loss on the new unseen sample (test data).
This is called batch loss and the corresponding minimax regret is

∗ 1 (θ ∗ )
BatchRegn (Θ) ≜ inf sup E (θ ) log
∗ − H(PXn |Xn−1 ) . (13.35)
Qn (·;xn−1 ) θ ∗ ∈Θ PXn Qn (Xn ; Xn−1 )
| {z }
(θ ∗ )
D P Qn (·;Xn−1 )
X n | X n− 1

In other words, this is the optimal KL loss of predicting the next symbol by estimating its condi-
tional distribution given the past, a central task in language models such as GPT [320]. Similar to
Theorem 13.3 we can give a max-information formula for batch regret (Exercise II.19). However,
it turns out that there is also a connection to universal compression. Indeed, we have the following
inequalities

1
AvgReg∗n (Θ) − AvgReg∗n−1 (Θ) ≤ BatchReg∗n (Θ) ≤ AvgReg∗n (Θ) , (13.36)
n

where the upper bound is only guaranteed to hold for iid models.5 The inequality (13.36) is known
as online-to-batch conversion or estimation-compression inequality [159, 240]; see Lemma 32.3
and Proposition 32.7 for a justification. The estimator that achieves the above upper bound is very
simple: it takes a probability assignment Q∗Xn and sets its predictor as

1X
n
n−1
Q n ( xn ; x )≜ QXt |Xt−1 (xn |xt−1 ) . (13.37)
n
t=1

However, unlike the cumulative regret, minimizers of the batch regret are distinct from those in
universal compression. For example, for the model class of all iid distributions over k symbols, we
know that (asymptotically) the “add-1/2” estimator of Krichevsky-Trofimov is optimal. However,
for the batch loss it is not so (see the note at the end of Exercise VI.10). We also note that optimal
batch regret in this case is O( k−n 1 ), but the online-to-batch rule only yields O( (k−1n) log n ). On the
other hand, for first-order Markov chains with k ≥ 3 states, the online-to-batch upper bound
turns out to be order optimal, as we have BatchReg∗n 1n AvgReg∗n kn log kn2 provided that
2

n k2 [213]; however, proving this result, especially the lower bound, requires arguments native
to Markov chains.

Density estimation. Consider now the following problem. Given a collection of (single-letter)
(θ) i.i.d. (θ)
distributions PX on X and X1 , . . . , Xn−1 ∼ PX we want to produce a density estimate P̂ which
minimizes the worst-case error as measured by KL divergence, i.e. we seek to minimize
h i
(θ ∗ )
sup E n−1 i.i.d. (θ∗ ) D(P̂kPX ) .
X ∼ PX
θ ∗ ∈Θ

5
For stationary m-order Markov models, the upper bound in (13.36) holds with n − m in the denominator [213, Lemma 6].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-259

i i

13.7 Individual sequence and worst-case regret 259

To connect to the previous discussion, we only need to notice that P̂(·) can be interpreted as Qn in
the batch regret problem and we have an exact equality
h i 1
(θ ∗ )
inf sup E n−1 i.i.d. (θ∗ ) D(P̂kPX ) = BatchReg∗n (Θ) ≤ sup I(θ; Xn ) .
∗
P̂ θ ∈Θ X ∼ P X n π
Thus, we bound the minimax (KL-divergence) density estimation rate by capacity of a certain
(θ)
channel. The estimator achieving this bound is improper (i.e. P̂ 6= PX for any θ) and given
by (13.37). This is the basis of the Yang-Barron approach to density estimation, see Section 32.1
for more.

13.7 Individual sequence and worst-case regret

In previous section we explained how learner can predict stochastically generated strings Xn .
Though very standard, this approach suffers from a common criticism: what if our model class Θ
for the stochasticity of Xn is incorrect and the real data is not generated by any process in the class
Θ? In this section, we show a somewhat surprising workaround: it turns out there can be a theory
of predicting any possible sequences xn even non-random and adversarially chosen! This exciting
area is known in information theory as individual sequence approach and we describe it next.
Consider the following problem: a sequence xn is observed sequentially and our goal is to
predict (by making a soft, or probabilistic, prediction) the next symbol given the past observations.
The experiment proceeds as follows:

1 A string xn ∈ X n is selected by the nature.

2 Having observed x1 , . . . , xt−1 we are tasked to output a probability distribution Qt (·|xt−1 ) on X .
3 After that nature reveals the next sample xt and our loss for the t-th prediction is evaluated via
the log-loss:
1
log .
Qt (xt |xt−1 )
The main objective is to find a sequence of predictors {Qt } that minimizes the cumulative loss:
X
n
1
ℓ({Qt }, x ) ≜n
log .
Qt (xt |xt−1 )
t=1

Consider first the naive goal of minimizing the worst-case loss:

min max ℓ({Qt }, xn ) .

{Qt }nt=1
n x

This is clearly hopeless. Indeed, at any step t the distribution Qt must have at least one atom with
weight at most |X1 | , and hence for any predictor

max
n
ℓ({Qt }, xn ) ≥ n log |X | ,
x

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-260

i i

260

which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if the predictor simply makes uniform random
guesses. This triviality is not surprising: In the absence of whatsoever prior information on xn it
is impossible to predict anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [161, 303], is to replace loss
with regret, i.e. the gap to the best possible static oracle. More precisely, suppose a non-causal
oracle can examine the entire string xn and output a constant Qt ≡ Q. From the non-negativity of
divergence this non-causal oracle achieves:
X
n
1
ℓoracle (xn ) = min log = nH(P̂xn ) .
Q Q ( xt )
t=1

Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret of a sequential predictor as the excess risk over the static oracle
reg({Qt }, xn ) ≜ ℓ({Qt }, xn ) − nH(P̂xn )
and ask to minimize the worst-case regret:
Reg∗n ≜ min max reg({Qt }, xn ) . (13.38)
{Qt }
n x

Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore do
not rely on any assumptions on xn .
We next consider the case of X = {0, 1} for simplicity. To solve (13.38), first notice that
designing a sequence {Qt (·|xt−1 } is equivalent to defining one joint distribution QXn and then
Q
factorizing the latter as QXn (xn ) = t Qt (xt |xt−1 ). Then the problem (13.38) becomes simply
1
Reg∗n = min max log − nH(P̂xn ) .
n
QXn x Q ( xn )
Xn

First, we notice that generally we have that optimal QXn is the Shtarkov distribution (13.12), which
implies that the regret coincides with the log of the Shtarkov sum (13.13). In the iid case we are
considering, from (13.14) we get
X Y
n X
Reg∗n = log max Q(xi ) = log exp{−nH(P̂xn )} .
Q
xn i=1 xn

This expression is, however, frequently not very convenient to analyze, so instead we consider
upper and lower bounds. We may lower-bound the max over xn with the average over the Xn ∼
Ber(θ)n and obtain (also applying Lemma 13.2):
|X | − 1
Reg∗n ≥ R∗n + log e + o(1) ,
2
where R∗n is the universal compression redundancy defined in (13.8), whose asymptotics we
derived in (13.24).
(KT)
On the other hand, taking QXn from Krichevsky-Trofimov (13.27) we find after some algebra
and Stirling’s expansion:
1 1
max log (KT)
− nH(P̂xn ) = log n + O(1) .
n
x QXn (xn ) 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-261

i i

13.8 Lempel-Ziv compressor 261

In all, we conclude that,

|X | − 1
Reg∗n = R∗n + O(1) = log n + O(1) ,
2
and remarkably, the per-letter regret 1n Reg∗n converges to zero. That is, there exists a causal predic-
tor that can predict (under the log-loss) almost as well as any constant one, even if it is adapted
to a particular sequence xn non-causally.
Explicit (asymptotically optimal) sequential prediction rules are given by Krichevsky-
Trofimov’s “add-1/2” rules (13.29). We note that the resulting rules are also independent of n
(“horizon-free”). This is a very desirable property not shared by the optimal sequential predictors
derived from factorizing the Shtarkov’s distribution (13.12).

General parametric families. The general definition of (cumulative) individual-sequence (or

worst-case) regret for a model class {PXn |θ=θ0 : θ0 ∈ Θ} is given by
1 1
Reg∗n (Θ) = min sup log − inf log ,
QXn xn QXn (xn ) θ0 ∈Θ PXn |θ=θ0 (xn )
This regret can be interpreted as worst-case loss of a given estimator compared to the best possible
one from a class PXn |θ , when the latter is selected optimally for each sequence. In this sense, regret
gives a uniform (in xn ) bound on the performance of an algorithm against a class.
It turns out that similarly to (13.25) the individual sequence redundancy for general d-
parametric families (under smoothness conditions) can be shown to satisfy [362]:
Z p
∗ ∗ d d n
Regn (Θ) = Rn (Θ) + log e + o(1) = log + log det JF (θ)dθ + o(1) .
2 2 2π Θ

In machine learning terms, we say that R∗n (Θ) in (13.8) is a cumulative sequential prediction
regret under the well-specified setting (i.e. data Xn is generated by a distribution inside the model
class Θ), while here Reg∗n (Θ) corresponds to a fully mis-specified setting (i.e. data is completely
arbitrary). There are also interesting settings in between these extremes, e.g. when data is iid but
not from a model class Θ, cf. [162].

13.8 Lempel-Ziv compressor

So given a class of sources {PXn |θ : θ ∈ Θ} we have shown how to produce an asymptotically
optimal compressors by using Jeffreys’ prior. In the case of a class of i.i.d. processes, the result-
ing sequential probability of Krichevsky-Trofimov, see (13.32), had a very simple algorithmic
description. When extended to more general classes (such as r-th order Markov chains), however,
the sequential probability rules become rather complex. The Lempel-Ziv approach was to forgo
the path “ design QXn , convert to QXt |Xt−1 , extract compressor” and attempt to directly construct
a reasonable sequential compressor or, equivalently, derive an algorithmically simple sequential
estimator QXt |Xt−1 . The corresponding joint distribution QXn is hard to imagine, and the achieved
redundancy is not easy to derive, but the algorithm becomes very transparent.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-262

i i

262

In principle, the problem is rather straightforward: as we observe a stationary process, we may

estimate with better and better precision the conditional probability P̂Xn |Xn−1 and then use it as
n−r

the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
will attain the entropy rate of H(Xn |Xnn− 1
−r ). Note that Krichevsky-Trofimov assignment (13.29) is
clearly learning the distribution too: as n grows, the estimator QXn |Xn−1 converges to the true PX
(provided that the sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning
the joint distribution PXr−1 of r-blocks. However, the sample size required to obtain a good esti-
0
mate of PXr−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family of
0
algorithms works around this in an ingeniously elegant way:

• First, estimating probabilities of rare substrings takes longest, but it is also the least useful, as
these substrings almost never appear at the input.
• Second, and the most crucial, point is that an unbiased estimate of PXr (xr ) is given by the
reciprocal of the time since the last observation of xr in the data stream.
• Third, there is a prefix code6 mapping any integer n to binary string of length roughly log2 n:

fint : Z+ → {0, 1}+ , ℓ(fint (n)) = log2 n + O(log log n) . (13.39)

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-265

i i

Exercises for Part II

X
k
k
wj ≤ (II.2)
2
j=1

For example, (0, 0, 0, 0), (0, 0, 0, 2) and (1, 1, 0, 0) satisfy the constraint but (0, 0, 1, 2) does not.
Let ϵ∗ (Sn , k) denote the minimum probability of error among all possible compressors of Sn =
{Sj , j = 1, . . . , n} with i.i.d. entries of finite entropy H(S) < ∞. Compute

lim ϵ∗ (Sn , nR)

n→∞

as a function of R ≥ 0.
Hint: Relate to P[ℓ(f∗ (Sn )) ≥ γ n] and use Stirling’s formula (I.2) to find γ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-268

i i

268 Exercises for Part II

II.12 Consider a probability measure P and a measure-preserving transformation τ : Ω → Ω. Prove

that τ is ergodic if and only if for any measurable A, B we have

1X
n− 1
P[A ∩ τ −k B] → P[A]P[B] .
n
k=0

Comment: Thus ergodicity is a weaker condition than mixing: P[A ∩ τ −n B] → P[A]P[B].

II.13 (Arithmetic Coding) We analyze the encoder defined by (13.1) for iid source. Let P be a
distribution on some ordered finite alphabet, say, a < b < · · · < z. For each n, define
Qn P
p(xn ) = i=1 P(xi ) and q(xn ) = n
yn <xn p(y ) according to the lexicographic ordering, so
that Fn (x ) = q(x ) and |Ixn | = p(x ).
n n n

(a) Show that if xn−1 = (x1 , . . . , xn−1 ), then

X
q(xn ) = q(xn−1 ) + p(xn−1 ) P(α).
α<xn

Conclude that q(xn ) can be computed in O(n) steps sequentially.

(b) Show that intervals Ixn are disjoint subintervals of [0, 1).
n
(c) Encoding. Show that the codelength l l(f(x )) m defined in (13.1) satisfies the constraint (13.2),
namely, log2 p(xn ) ≤ ℓ(f(x )) ≤ log2 p(xn ) +1. Furthermore, verify that the map xn 7→ f(xn )
1 n 1

BatchReg∗n (Θ) = max I(θ; Xn |Xn−1 ) ,

π ∈P(Θ)

where optimization is over distribution of θ ∼ π. (Hint: Apply Exercise I.11.)

II.20 (Supervised learning) Consider a possibly non-iid stochastic process Xn = (X1 , . . . , Xn ) and
(θ)
a parametric collection of conditional distributions PY|X , θ ∈ Θ, which we also understand
(θ ∗ )
as a kernel PY|X,θ . Nature fixes θ∗ and generates Yi ∼ PY|X=Xi independently. These samples
are sequentially fed to the learner who having observed (Xt , Yt−1 ) outputs Q̂t (·) ∈ P(Y) and
experiences log-loss − log Q̂t (Yt ). The goal of supervised learning is to minimize the worst case

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-271

i i

Exercises for Part II 271

regret, i.e. find minimizer in

 
X
n
1 1
AvgReg∗n (Θ) ≜ inf sup E  log − log (θ∗ ) 
{Qt } θ ∗ ∈Θ
t=1
Q̂t (Yt ) PY|X (Yt |Xt )

Here we show analog of Theorem 13.3, namely that

AvgReg∗n (Θ) = Cn ≜ max I(θ; Yn |Xn ) ,
π

Exercise I.11.)
Qn
(c) Show that we can always factorize Q∗Yn |Xn = t=1 QYt |Xt ,Yt−1 .
(d) Conclude that Q∗Yn |Xn defines an optimal learner, who also operates without any knowledge
of PXn .
Note: this characterization is mostly useful for upper-bounding regret (Exercise II.22). Indeed,
the optimal learner requires knowledge of π ∗ which in turn often depends on PXn , which is
not available to the learner. This shows why supervised learning is quite a bit more deli-
cate than universal compression. Nevertheless, taking a “natural” prior π and factorizing the
R (θ) ⊗n
mixture π (dθ)PY|X often gives very interesting and often almost optimal algorithms (e.g.
exponential-weights update algorithm [445]).
II.21 (Average-case and worst-case redundancies are incomparable.) This exercise provides an exam-
ple where the worst-case minimax redundancy (13.11) is infinite but the average-case one (13.8)
is finite. Take n = 1 and consider the class of distributions P1 = {P ∈ P(Z+ ) : EP [X] ≤ 1}.
Define
P ( x)
R∗ = min sup D(PkQ), R∗∗ = min max sup log .
Q P∈P1 Q x∈Z+ P∈P1 Q ( x)
(a) Applying the capacity-redundancy theorem, show that R∗ ≤ 2 log 2. (Hint: use Exercise I.4
to bound the mutual information.)
(b) Prove that R∗∗ = ∞ if and only if the Shtarkov sum (13.13) is infinite, namely,
P
x∈Z+ supP∈P1 P(x) = ∞
(c) Verify that
(
1 x=0
sup P(x) =
P∈P1 1/x x ≥ 1
and conclude R∗∗ = ∞. (Hint: Markov’s inequality.)
i.i.d. (d)
II.22 (Linear regression) Let Xi ∼ PX on Rp with PX being rotationally invariant (i.e. X = UX for
any orthogonal matrix U). Fix θ ∈ Rp with kθk ≤ s and given Xi generate Yi ∼ N (θ⊤ Xi , σ 2 ).
Having observed Yt−1 , Xt (but not θ) the learner outputs a prediction Ŷt of Yt .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-272

i i

272 Exercises for Part II

(a) Show that

X
n X
p
s2 n
AvgRegn ≜ sup E[(Ŷt − Yt ) ] − nσ ≤ σ
2 2
E ln 1 +
2
λi (Σ̂X ) ,
∥θ∥≤s t=1 p
i=1

1
Pn ⊤
where Σ̂X = n i=1 Xi Xi
is the sample covariance matrix. (Hint: Interpret LHS as regret
under log-loss and solve maxπ θ I(θ; Yn |Xn ) s.t. E[kθk2 ] ≤ s2 via Exercise I.10.)
(b) Show that
s2 n s2 n
AvgRegn ≤ σ 2 ln det Ip + ΣX ≤ pσ 2 ln 1 + 2 E[kXk2 ] .
p p
(Hint: Jensen’s inequality.)
i.i.d.
Remark: Note that if Xi ∼ N (0, BIp ) the RHS is pσ 2 ln n + O(1). At the same time prediction
error of an ordinary least-square estimate Ŷt = θ̂⊤ Xt for n ≥ p + 2 is known to be exactly7
E[(Ŷt − Yt )2 ] = σ 2 (1 + n−pp−1 ) and hence achieves the optimal pσ 2 ln n + O(1) cumulative
regret.

7
This can be shown by applying Exercise VI.3b and evaluating the expected trace using [444, Theorem 3.1].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-273

i i

Part III

Hypothesis testing and large deviations

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-274

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-275

i i

275

In this part we study the topic of binary hypothesis testing (BHT) which we first encountered
in Section 7.3. This is an important area of statistics, with a definitive treatment given in [277].
Historically, there has been two schools of thought on how to approach this question. One is the
so-called significance testing of Karl Pearson and Ronald Fisher. This is perhaps the most widely
used approach in modern biomedical and social sciences. The concepts of null hypothesis, p-value,
χ2 -test, goodness-of-fit belong to this world. We will not be discussing these.
The other school was pioneered by Jerzy Neyman and Egon Pearson, and is our topic in this part.
The concepts of Type-I and Type-II errors, likelihood-ratio tests, Chernoff exponent are from this
domain. This is, arguably, a more popular way of looking at the problem among the engineering
disciplines (perhaps explained by its foundational role in radar and electronic signal detection).
The conceptual difference between the two is that in the first approach the full probabilistic
i.i.d.
model is specified only under the null hypothesis. (It still could be very specific like Xi ∼ N (0, 1),
i.i.d.
contain unknown parameters, like Xi ∼ N (θ, 1) with θ ∈ R arbitrary, or be nonparametric, like
i.i.d.
(Xi , Yi ) ∼ PX,Y = PX PY denoting that observables X and Y are statistically independent). The main
goal of the statistician in this setting is inventing a testing process that is able to find statistically
significant deviations from the postulated null behavior. If such deviation is found then the null is
rejected and (in scientific fields) a discovery is announced. The role of the alternative hypothesis
(if one is specified at all) is to roughly suggest what feature of the null are most likely to be violated
i.i.d.
and motivates the choice of test procedures. For example, if under the null Xi ∼ N (0, 1), then both
of the following are reasonable tests:
1X ? 1X 2 ?
n n
Xi ≈ 0 Xi ≈ 1 .
n n
i=1 i=1

However, the first one would be preferred if, under the alternative, “data has non-zero mean”, and
the second if “data has zero mean but variance not equal to one”. Whichever of the alternatives is
selected does not imply in any way the validity of the alternative. In addition, theoretical properties
of the test are mostly studied under the null rather than the alternative. For this approach the null
hypothesis (out of the two) plays a very special role.
The second approach treats hypotheses in complete symmetry. Exact specifications of proba-
bility distributions are required for both hypotheses and the precision of a proposed test is to be
analyzed under both. This is the setting that is most useful for our treatment of forthcoming topics
of channel coding (Part IV) and statistical estimation (Part VI).
The outline of this part is the following. First, we define the performance metric R(P, Q) giving
a full description of the BHT problem. A key result in this theory, the Neyman-Pearson lemma
determines the form of the optimal test and, at the same time, characterizes R(P, Q). We then
specialize to the setting of iid observations and consider two types of asymptotics (as the sample
size n goes to infinity): Stein’s regime (where type-I error is held constant) and Chernoff’s regime
(where errors of both types are required to decay exponentially). The fundamental limit in the
former regime is simply a scalar (given by D(PkQ)), while in the latter it is a region. To describe
this region (as we do in Chapter 16) we will first need to dive deep into another foundational topic:
theory of large deviations and the information projection (Chapter 15).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-276

i i

14 Neyman-Pearson lemma

In this Chapter we formally define the problem of binary hypothesis testing between two sim-
ple hypotheses. We introduce the fundamental limit for this problem in the form of a region
R(P, Q) ⊂ [0, 1]2 , whose boundary is known as the received operating characteristic (ROC)
curve. We will show how to compute this region/curve exactly (Neyman-Pearson lemma) and
show optimality of the likelihood-ratio tests in the process. However, for high-dimensional situa-
tions exact computation of the region is still too complex and we will also derive upper and lower
bounds (as usual, we call them achievability and converse, respectively). Finally, we will conclude
by introducing two different asymptotic settings: the Stein regime and the Chernoff regime. The
answer in the former will be given completely (for iid distributions), while the answer for the latter
will require further developments in the subsequent chapters.

14.1 Neyman-Pearson formulation

Consider the situation where we have two distributions P and Q on a space X one of which have
generated our observation X. These two possibilities are summarized as a pair of hypotheses:
H0 : X ∼ P
H1 : X ∼ Q ,
which states that, under hypothesis H0 (the null hypothesis) X is distributed according to P, and
under H1 (the alternative hypothesis) X is distributed according to Q. A test (or decision rule)
between two distributions chooses either H0 or H1 based on the data X. We will consider

• Deterministic tests: f : X → {0, 1}, or equivalently, f(x) = 1 {x ∈ E} where E is known as a

decision region; and more generally,
• Randomized tests: PZ|X : X → {0, 1}, so that PZ|X (1|x) ∈ [0, 1] is the probability of rejecting
the null upon observing X = x.

Let Z = 0 denote that the test chooses P (accepting the null) and Z = 1 that the test chooses Q
(rejecting the null).
This setting is called “testing simple hypothesis against simple hypothesis”. Here “simple”
refers to the fact that under each hypothesis there is only one distribution that could have gen-
erated the data. In comparison, composite hypothesis postulates that X ∼ P for some P is a given
class of distributions; see Sections 16.4 and 32.2.1.

276

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-277

i i

14.1 Neyman-Pearson formulation 277

Table 14.1 Expressions for common performance metrics of hypothesis tests

Term Expression

type-I error, significance, size, false alarm rate, false positive 1−α
specificity, selectivity, true negative α
power, recall, sensitivity, true positive 1−β
type-II error, missed detection, false negative β
accuracy π 1 (1 − β) + (1 − π 1 )α
2π 1 (1−β)
F1 -score 1+π 1 (1−β)−(1−π 1 )α
Bayesian error π 1 β + (1 − π 1 )(1 − α)
π 1 (1−β)
positive predictive value (PPV), precision 1−π 1 β−(1−π 1 )α

Entries involving π 1 = P[H1 ] correspond to Bayesian setting where a prior probability on occurrence of H1 is
postulated.

In order to quantify performance of a test, we focus on two metrics. Let π i|j denote the proba-
bility of the test choosing i when the correct hypothesis is j, with i, j ∈ {0, 1}. For every test PZ|X
we associate a pair of numbers:

α = π 0|0 = P[Z = 0] (Probability of success given H0 is true)

β = π 0|1 = Q[Z = 0] (Probability of error given H1 is true),
R R
where P[Z = 0] = PZ|X (0|x)P(dx) and Q[Z = 0] = PZ|X (0|x)Q(dx). Depending on the field of
study there are many different names (and transformations) that have been defined, see Table 14.1.
Because we have two performance metrics it is not easy to understand what should one
designate as the “best test”. Consequently, there are several approaches:

• Bayesian: Assuming the prior distribution P[H0 ] = π 0 and P[H1 ] = π 1 , we minimize the
average probability of error:

P∗b = min π 0 π 1|0 + π 1 π 0|1 . (14.1)

PZ|X :X →{0,1}

• Minimax: Assuming there is an unknown prior distribution, we choose the test that preforms
the best for the worst-case prior

P∗m = min max{π 1|0 , π 0|1 }.

PZ|X :X →{0,1}

• Neyman-Pearson: Minimize the type-II error β subject to that the success probability under the
null is at least α.

In this Part the Neyman-Pearson formulation is our choice. We formalize the fundamental
performance limit as follows.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-278

i i

278

Definition 14.1 Given (P, Q), the Neyman-Pearson region consists of achievable points for
all randomized tests

R(P, Q) = (P[Z = 0], Q[Z = 0]) : PZ|X : X → {0, 1} ⊂ [0, 1]2 . (14.2)

In particular, its lower boundary is defined as (see Figure 14.1 for an illustration)

βα (P, Q) ≜ inf Q[ Z = 0] (14.3)

P[Z=0]≥α

R(P, Q)

βα (P, Q)

P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =

Figure 14.1 Illustration of the Neyman-Pearson regions: typical (top plot) and two extremal cases (bottom
row). Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P[E] = 0 and Q[E] = 1 for some E.

The Neyman-Pearson region encodes much useful information about the relationship between
P and Q. In particular, the mutual singularity (see Figure 14.1) can be detected. Furthermore, every
f-divergence can be computed from the R(P, Q). For example, TV(P, Q) coincides with half the
length of the longest vertical segment contained in R(P, Q) (Exercise III.7). In machine learning
some of the most popular metric used to characterized quality of a R(P, Q) is area under the curve
(AUC)
Z 1
AUC(P, Q) ≜ 1 − βα (P, Q)dα .
0

We next prove several basic properties of R(P, Q).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-279

i i

14.1 Neyman-Pearson formulation 279

Theorem 14.2 (Properties of R(P, Q))

(a) R(P, Q) is a closed, convex subset of [0, 1]2 .

(b) R(P, Q) contains the diagonal.
(c) Symmetry: (α, β) ∈ R(P, Q) ⇔ (1 − α, 1 − β) ∈ R(P, Q).

Proof. (a) For convexity, suppose that (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), corresponding to tests
PZ0 |X , PZ1 |X , respectively. Randomizing between these two tests, we obtain the test λPZ0 |X +
λ̄PZ1 |X for λ ∈ [0, 1], which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
The closedness of R(P, Q) will follow from the explicit determination of all boundary
points via the Neyman-Pearson lemma – see Remark 14.1. In more complicated situations
(e.g. in testing against composite hypothesis) simple explicit solutions similar to Neyman-
Pearson Lemma are not available but closedness of the region can frequently be argued
still. The basic reason is that the collection of bounded functions {g : X → [0, 1]} (with
g(x) = PZ|X (0|x)) forms a weakly compact set and hence its image under the linear functional
R R
g 7→ ( gdP, gdQ) is closed.
(b) Testing by random guessing, i.e., Z ∼ Ber(1 − α) ⊥ ⊥ X, achieves the point (α, α).
(c) If (α, β) ∈ R(P, Q) is achieved by PZ|X , P1−Z|X achieves (1 − α, 1 − β).

The region R(P, Q) consists of the operating points of all randomized tests, which include as
special cases those of deterministic tests, namely

Rdet (P, Q) = {(P(E), Q(E)) : E measurable} . (14.4)

As the next result shows, the former is in fact the closed convex hull of the latter. Recall that
cl(E) (resp. co(E)) denote the closure and convex hull of a set E, namely, the smallest closed
(resp. convex) set containing E. A useful example: For a subset E of an Euclidean space, and
measurable functions f, g : R → E, we have (E [f(X)] , E [g(X)]) ∈ cl(co(E)) for any real-valued
random variable X.

Theorem 14.3 (Randomized test v.s. deterministic tests)

R(P, Q) = cl(co(Rdet (P, Q))).

Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.

Proof. “⊃”: Comparing (14.2) and (14.4), by definition, R(P, Q) ⊃ Rdet (P, Q)), the former of
which is closed and convex , by Theorem 14.2.
“⊂”: Given any randomized test PZ|X , define a measurable function g : X → [0, 1] by g(x) =
PZ|X (0|x). Then
X Z 1
P [ Z = 0] = g(x)P(x) = EP [g(X)] = P[g(X) ≥ t]dt
x 0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-280

i i

280

X Z 1
Q[Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0

R
where we applied the “area rule” that E[U] = R+ P [U ≥ t] dt for any non-negative random
variable U. Therefore the point (P[Z = 0], Q[Z = 0]) ∈ R is a mixture of points (P[g(X) ≥
t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly distributed on the unit interval. Hence
R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (14.4).

Example 14.1 (Testing Ber(p) versus Ber(q)) Assume that p < < q. Using Theo- 1
2
rem 14.3, note that there are 2 = 4 events E = ∅, {0}, {1}, {0, 1}. Then R(Ber(p), Ber(q)) is
2

given by

1
)
q)

(p, q)
r(
Be
),
(p
er
(B
R

(p̄, q̄)

α
0 1

14.2 Likelihood ratio tests

To define optimal hypothesis tests, we need to define the concept of the log-likelihood ratio (LLR).
In the simple case when P Q we can define the LLR T(x) = log dQ dP
(x) as a function T : X →
R ∪{−∞} by thinking of log 0 = −∞. In order to handle also the case of P 6 Q, we can leverage
our concept of the Log function, cf. (2.10).

Definition 14.4 (Extended log likelihood ratio) Assume that dP = p(x)dμ and
dQ = q(x)dμ for some dominating measure μ (e.g. μ = P + Q.) Recalling the definition of
Log from (2.10) we define the extended LLR as


 log qp((xx)) , p ( x) > 0 , q ( x) > 0


p(x) +∞, p ( x) > 0 , q ( x) = 0
T(x) ≜ Log =
q ( x)  −∞, p ( x) = 0 , q ( x) > 0



0, p ( x) = 0 , q ( x) = 0 ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-281

i i

14.2 Likelihood ratio tests 281

Definition 14.5 (Likelihood ratio test (LRT)) Given a binary hypothesis testing H0 : X ∼
P vs H1 : X ∼ Q the likelihood ratio test (LRT) with threshold τ ∈ R ∪ {±∞} is 1{x : T(x) ≤ τ },
in other words it decides
(
declare H0 , T(x) > τ
LRTτ (x) = .
declare H1 , T(x) ≤ τ

When P Q it is clear that T(x) = log dQ dP

(x) for P- and Q-almost every x. For this reason,
dP
everywhere in this Part we abuse notation and write simply log dQ to denote the extended (!) LLR
as defined above. Notice that LRT is a deterministic test, and that it does make intuitive sense:
upon observing x, if QP((xx)) is large then Q is more likely and one should reject the null hypothesis
P.
Note that for a discrete alphabet X and assuming Q P we can see

Q[T = t] = exp(−t)P[T = t] ∀t ∈ R ∪ {+∞}.

Indeed, this is shown by the following chain:

X P ( x) X
QT ( t) = Q(x)1{log = t} = Q(x)1{et Q(x) = P(x)}
Q ( x)
X X
X P(x)
= e− t P(x)1{log = t } = e− t P T ( t )
Q ( x)
X

We see that taking expectation over P and over Q are equivalent upon multiplying the expectant
by exp(±T). The next result gives precise details in the general case.

Theorem 14.6 (Change of measure P ↔ Q) The following hold:

1 For any h : X → R we have

EQ [h(X)1{T > −∞}] = EP [h(X) exp(−T)] (14.5)

EP [h(X)1{T < +∞}] = EQ [h(X) exp(T)] (14.6)

2 For any f ≥ 0 and any −∞ < τ < ∞ we have

EQ [f(X)1{T ≥ τ }] ≤ EP [f(X)1{T ≥ τ }] · exp(−τ )

EQ [f(X)1{T ≤ τ }] ≥ EP [f(X)1{T ≥ τ }] · exp(−τ ) (14.7)

Proof. We first observe that

Q[T = +∞] = P[T = −∞] = 0 . (14.8)

Then consider the chain

Z Z
( a) (b)
EQ [h(X)1{T > −∞}] = dμ q(x)h(x) = dμ p(x) exp(−T(x))h(x)
{−∞<T(x)<∞} {−∞<T(x)<∞}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-282

i i

282

Z
( c)
= dμ p(x) exp(−T(x))h(x) = EP [exp(−T)g(T)] ,
{−∞<T(x)≤∞}

where in (a) we used (14.8) to justify restriction to finite values of T; in (b) we used exp(−T(x)) =
q(x)
p(x) for p, q > 0; and (c) follows from the fact that exp(−T(x)) = 0 whenever T = ∞. Exchanging
the roles of P and Q proves (14.6).
The last part follows upon taking h(x) = f(x)1{T(x) ≥ τ } and h(x) = f(x)1{T(x) ≤ τ } in (14.5)
and (14.6), respectively.

The importance of the LLR is that it is a sufficient statistic for testing the two hypotheses (recall
Section 3.5 and in particular Example 3.9), as the following result shows.

Corollary 14.7 T = T(X) is a sufficient statistic for testing P versus Q.

Proof. For part 2, sufficiency of T would be implied by PX|T = QX|T . For the case of X being
discrete we have:

PX (x)PT|X (t|x) P(x)1{ QP((xx)) = et } et Q(x)1{ QP((xx)) = et }

PX|T (x|t) = = =
PT (t) PT (t) PT (t)
QXT (xt) QXT
= −t = = QX|T (x|t).
e P T ( t) QT
We leave the general case as an exercise.

From Theorem 14.3 we know that to obtain the achievable region R(P, Q), one can iterate over
all decision regions and compute the region Rdet (P, Q) first, then take its closed convex hull. But
this is a formidable task if the alphabet is large or infinite. On the other hand, we know that the
LLR is a sufficient statistic. Next we give bounds to the region R(P, Q) in terms of the statistics
of the LLR. As usual, there are two types of statements:

• Converse (outer bounds): any point in R(P, Q) must satisfy certain constraints;
• Achievability (inner bounds): points satisfying certain constraints belong to R(P, Q).

14.3 Converse bounds on R(P, Q)

Theorem 14.8 (Weak converse) ∀(α, β) ∈ R(P, Q),
d(αkβ) ≤ D(PkQ)
d(βkα) ≤ D(QkP)

where d(·k·) is the binary divergence function in (2.6).

Proof. Use the data processing inequality for KL divergence with PZ|X ; cf. Corollary 2.19.

We will strengthen this bound with the aid of the following result.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-283

i i

14.4 Achievability bounds on R(P, Q) 283

Lemma 14.9 For any test Z and any γ > 0 we have

P[Z = 0] − γ Q[Z = 0] ≤ P T > log γ ,
dP
where T = log dQ is understood in the extended sense of Definition 14.4.

Note that we do not need to assume P Q precisely because ±∞ are admissible values for
the (extended) LLR.
Proof. Defining τ = log γ and g(x) = PZ|X (0|x) we get from (14.7):
P[Z = 0, T ≤ τ ] − γ Q[Z = 0, T ≤ τ ] ≤ 0 .
Decomposing P[Z = 0] = P[Z = 0, T ≤ τ ] + P[Z = 0, T > τ ] and similarly for Q we obtain then
P[Z = 0] − γ Q[Z = 0] ≤= P [T > log γ, Z = 0] − γ Q [T > log γ, Z = 0] ≤ P [T > log γ]

Theorem 14.10 (Strong converse) ∀(α, β) ∈ R(P, Q), ∀γ > 0,

h dP i
α − γβ ≤ P log > log γ (14.9)
dQ
1 h dP i
β − α ≤ Q log < log γ (14.10)
γ dQ

Proof. Apply Lemma 14.9 to (P, Q, γ) and (Q, P, 1/γ).

Theorem 14.10 provides an outer bound for the region R(P, Q) in terms of half-planes. To see
this, fix γ > 0 and consider the line α − γβ = c by gradually increasing c from zero. There exists
a maximal c, say c∗ , at which point the line touches the lower boundary of the region. Then (14.9)
says that c∗ cannot exceed P[log dQ dP
> log γ]. Hence R must lie to the left of the line. Similarly,
(14.10) provides bounds for the upper boundary. Altogether Theorem 14.10 states that R(P, Q) is
contained in the intersection of an infinite collection of half-planes indexed by γ .
To apply the strong converse Theorem 14.10, we need to know the CDF of the LLR, whereas
to apply the weak converse Theorem 14.8 we need only to know the expectation of the LLR, i.e.,
the divergence. This is the usual pattern between the weak and strong converses in information
theory.

14.4 Achievability bounds on R(P, Q)

Given the convexity of the set R(P, Q), it is natural to try to find all of its supporting lines (hyper-
planes in dimension 2), as it is well-known that closed convex set equals the intersection of all
half-spaces that correspond to the supporting hyperplanes. We are thus lead to the following
problem: for t >0,
max{α − tβ : (α, β) ∈ R(P, Q)} ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-284

i i

284

which is equivalent to minimizing the average probability of error in (14.1), with t = ππ 01 . This can
be solved without much effort. For simplicity, consider the discrete case. Then
X X
α∗ − tβ ∗ = max (α − tβ) = max (P(x) − tQ(x))PZ|X (0|x) = |P(x) − tQ(x)|+
(α,β)∈R PZ|X
x∈X x∈X

where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:

P(x)
PZ|X (0|x) = 1 log ≥ log t .
Q ( x)
Thus, we have shown that all supporting hyperplanes are parameterized by LRT. This completely
recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces) of the
region. The precise result is stated as follows:

Theorem 14.11 (Neyman-Pearson Lemma) For each α, βα in (14.3) is attained by the

following test:


 dP
1 log dQ >τ
PZ|X (0|x) = λ dP
log dQ =τ (14.11)


0 log dP
<τ
dQ

where τ ∈ R and λ ∈ [0, 1] are the unique solutions to α = P[log dQ

dP dP
> τ ] + λP[log dQ = τ ].

Proof of Theorem 14.11. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P[Z = 0] = EP [g(X)] = P > t + λP =t (14.12)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (14.13)
dQ dQ
n o n o
Using the simple fact that EQ [f(X)1 dQ dP
≤ t ] ≥ t−1 EP [f(X)1 dQ dP
≤ t ] for any f ≥ 0 twice, we
have

dP dP
β = EQ [g(X)1 ≤ t ] + EQ [g(X)1 >t ]
dQ dQ

1 dP dP
≥ EP [g(X)1 ≤ t ] + E Q [ g( X ) 1 >t ]
t dQ dQ
| {z }
h dP i
(14.12) 1 dP dP
= EP [(1 − g(X))1 > t ] + λP = t + E Q [ g( X ) 1 >t ]
t dQ dQ dQ
| {z }
h dP i
dP dP
≥ EQ [(1 − g(X))1 > t ] + λQ = t + E Q [ g( X ) 1 >t ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-285

i i

14.4 Achievability bounds on R(P, Q) 285

Remark 14.1 As a consequence of the Neyman-Pearson lemma, all the points on the boundary
of the region R(P, Q) are attainable. Therefore
R(P, Q) = {(α, β) : βα ≤ β ≤ 1 − β1−α }.

Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set, as
previously stated in Theorem 14.2. Consequently, the infimum in the definition of βα is in fact a
minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the
following two sets:
(
dP
α = P log dQ >τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and
(
dP
α = P log dQ ≥τ
τ ∈ R ∪ {±∞}.
dP
β = Q log dQ ≥τ

Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ
dP
≥ τ}
or 1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead
dP

to the achievability of the Neyman-Pearson lemma (Theorem 14.11).

Remark 14.2 The Neyman-Pearson test (14.11) is related to the LRT1 as follows:

dP dP
P [log dQ > t] P [log dQ > t]

1 1
α α

t t
τ τ

• Left figure: If α = P[log dQ

dP
> τ ] for some τ , then λ = 0, and (14.11) becomes the LRT
n o
Z = 1 log dQdP
≤τ .
• Right figure: If α 6= P[log dQ
dP
> τ ] for any τ , then we have λ ∈ (0, 1), and (14.11) is equivalent
n o n o
to randomize over tests: Z = 1 log dQ dP
≤ τ with probability 1 − λ or 1 log dQ dP
< τ with
probability λ.

Corollary 14.12 ∀τ ∈ R, there exists (α, β) ∈ R(P, Q) s.t.

h dP i
α = P log >τ
dQ

1
Note that it so happens that in Definition 14.4 the LRT is defined with an ≤ instead of <.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-286

i i

286

h dP i
β ≤ exp(−τ )P log > τ ≤ exp(−τ )
dQ

Proof. For the case of discrete X it is easy to give an explicit proof

h dP i X n P(x) o
Q log >τ = Q ( x) 1 > exp(τ )
dQ Q ( x)
X n P(x) o h dP i
≤ P(x) exp(−τ )1 > exp(tau) = exp(−τ )P log >τ .
Q ( x) dQ

The general case is just an application of (14.7).

14.5 Asymptotics: Stein’s regime

Having understood how to compute and bound R(P, Q) we next proceed to the analysis of asymp-
totics. We will focus on iid observations in the large-sample asymptotics, that is we will be talking
about R(P⊗n , Q⊗n ) here. In other words, we consider
i.i.d.
H0 : X1 , . . . , Xn ∼ P
i.i.d.
H1 : X1 , . . . , Xn ∼ Q, (14.14)

where P and Q do not depend on n; this is a particular case of our general setting with P and Q
replaced by their n-fold product distributions. We are interested in the asymptotics of the error
probabilities π 0|1 and π 1|0 as n → ∞ in the following two regimes:

• Stein’s regime: When π 1|0 is constrained to be at most ϵ, what is the best exponential rate of
convergence for π 0|1 ?
• Chernoff’s regime: When both π 1|0 and π 0|1 are required to vanish exponentially, what is the
optimal tradeoff between their exponents?

Recall that we are in the iid setting (14.14) and are interested in tests satisfying 1 −α = π 1|0 ≤ ϵ
and β = π 0|1 ≤ exp(−nE) for some exponent E > 0. Motivation of this asymmetric objective
is that often a “missed detection” (π 0|1 ) is far more disastrous than a “false alarm” (π 1|0 ). For
example, a false alarm could simply result in extra computations (attempting to decode a packet
when there is in fact only noise has been received), while missed detection results in a complete
loss of the packet. The formal definition of the best exponent is as follows.

Definition 14.13 The ϵ-optimal exponent in Stein’s regime is

Vϵ ≜ sup{E : ∃n0 , ∀n ≥ n0 , ∃PZ|Xn s.t. α > 1 − ϵ, β < exp (−nE)}.

and Stein’s exponent is defined as V ≜ limϵ→0 Vϵ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-287

i i

14.5 Asymptotics: Stein’s regime 287

It is an exercise to check the following equivalent definition

1 1
Vϵ = lim inf log
n→∞ n β1−ϵ (PXn , QXn )
where βα is defined in (14.3).
Here is the main result of this section.

Theorem 14.14 (Stein’s lemma) Consider the iid setting (14.14) where PXn = Pn and
QXn = Qn . Then
Vϵ = D(PkQ), ∀ϵ ∈ (0, 1).

Consequently, V = D(PkQ).

The way to use this result in practice is the following. Suppose it is required that α ≥ 0.999,
and β ≤ 10−40 , what is the required sample size? Stein’s lemma provides a rule of thumb: n ≥
10−40
− log
D(P∥Q) .

Proof. We first assume that P Q so that dP

dQ is well defined. Define the LLR

dPXn X n
dP
Fn = log = log (Xi ), (14.15)
dQXn dQ
i=1

which is an iid sum under both hypotheses. As such, by WLLN, under P, as n → ∞,

1X
n
1 dP P dP
Fn = log →EP log
(Xi )− = D(PkQ). (14.16)
n n dQ dQ
i=1

Alternatively, under Q, we have

1 P dP
→EQ log
Fn − = −D(QkP). (14.17)
n dQ
Note that both convergence results hold even if the divergence is infinite.
(Achievability) We show that Vϵ ≥ D(PkQ) ≡ D for any ϵ > 0. First assume that D < ∞. Pick
τ = n(D − δ) for some small δ > 0. Then Corollary 14.12 yields

α = P(Fn > n(D − δ)) → 1, by (14.16)

β ≤ e−n(D−δ)

then pick n large enough (depends on ϵ, δ ) such that α ≥ 1 − ϵ, we have the exponent E = D − δ
achievable, Vϵ ≥ E. Sending δ → 0 yields Vϵ ≥ D. Finally, if D = ∞, the above argument holds
for arbitrary τ > 0, proving that Vϵ = ∞.
(Converse) We show that Vϵ ≤ D for any ϵ < 1, to which end it suffices to consider D < ∞. As
a warm-up, we first show a weak converse by applying Theorem 14.8 based on data processing
inequality. For any (α, β) ∈ R(PXn , QXn ), we have
1
−h(α) + α log ≤ d(αkβ) ≤ D(PXn kQXn ) (14.18)
β

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-288

i i

288

For any achievable exponent E < Vϵ , by definition, there exists a sequence of tests such that
αn ≥ 1 − ϵ and βn ≤ exp(−nE). Plugging this into (14.18) and using h ≤ log 2, we have E ≤
D(P∥Q) log 2
1−ϵ + n(1−ϵ) . Sending n → ∞ yields

D(PkQ)
Vϵ ≤ ,
1−ϵ

which is weaker than what we set out to prove; nevertheless, this weak converse is tight for ϵ → 0,
so that for Stein’s exponent we have succeeded in proving the desired result of V = limϵ→0 Vϵ ≥
D(PkQ). So the question remains: if we allow the type-I error to be ϵ = 0.999, is it possible for
the type-II error to decay faster? This is shown impossible by the strong converse next.
To this end, note that, in proving the weak converse, we only made use of the expectation
of Fn in (14.18), we need to make use of the entire distribution (CDF) in order to obtain better
results. Applying the strong converse Theorem 14.10 to testing PXn versus QXn and α = 1 − ϵ and
β = exp(−nE), we have

1 − ϵ − γ exp (−nE) ≤ αn − γβn ≤ PXn [Fn > log γ].

Pick γ = exp(n(D + δ)) for δ > 0, by WLLN (14.16) the probability on the right side goes to 0,
which implies that for any fixed ϵ < 1, we have E ≤ D + δ and hence Vϵ ≤ D + δ . Sending δ → 0
complete the proof.
Finally, let us address the case of P 6 Q, in which case D(PkQ) = ∞. By definition, there
exists a subset A such that Q(A) = 0 but P(A) > 0. Consider the test that selects P if Xi ∈ A for
some i ∈ [n]. It is clear that this test achieves β = 0 and 1 − α = (1 − P(A))n , which can be made
less than any ϵ for large n. This shows Vϵ = ∞, as desired.

Remark 14.3 (Non-iid data) Just like in Chapter 12 on data compression, Theorem 14.14
can be extended to stationary ergodic processes. Specifically, one can show that the Stein’s
exponent corresponds to relative entropy rate, i.e.

1
Vϵ = lim D(PXn kQXn )
n→∞ n

where {Xi } is stationary and ergodic under both P and Q. Indeed, the counterpart of (14.16) based
on WLLN, which is the key for choosing the appropriate threshold τ , for ergodic processes is the
Birkhoff-Khintchine convergence theorem (cf. Theorem 12.8).

The theoretical importance of Stein’s exponent is in implications of the following type:

∀E ⊂ X n , PXn [E] ≥ 1 − ϵ ⇒ QXn [E] ≥ exp (−nVϵ + o(n))

Thus knowledge of Stein’s exponent Vϵ allows one to prove exponential bounds on probabilities
of arbitrary sets; this technique is known as “change of measure”, which will be applied in large
deviations analysis in Chapter 15.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-289

i i

14.6 Chernoff regime: preview 289

14.6 Chernoff regime: preview

We are still considering iid setting (14.14), namely, testing

H0 : Xn ∼ Pn versus H1 : Xn ∼ Qn ,

but the objective in the Chernoff regime is to achieve exponentially small error probability of both
types simultaneously. We say a pair of exponents (E0 , E1 ) is achievable if there exists a sequence
of tests such that

1 − α = π 1|0 ≤ exp(−nE0 )
β = π 0|1 ≤ exp(−nE1 ).

Intuitively, one exponent can made large at the expense of making the other small. So the interest-
ing question is to find their optimal tradeoff by characterizing the achievable region of (E0 , E1 ).
This problem was solved by [218, 61] and is the topic of Chapter 16. (See Figure 16.2 for an
illustration of the optimal (E0 , E1 )-tradeoff.)
Let us explain what we already know about the region of achievable pairs of exponents (E0 , E1 ).
First, Stein’s regime corresponds to corner points of this achievable region. Indeed, Theo-
rem 14.14 tells us that when fixing αn = 1 − ϵ, namely E0 = 0, picking τ = D(PkQ) − δ
(δ → 0) gives the exponential convergence rate of βn as E1 = D(PkQ). Similarly, exchanging the
role of P and Q, we can achieves the point (E0 , E1 ) = (D(QkP), 0).
Second, we have shown in Section 7.3 that the minimum total error probabilities over all tests
satisfies

min 1 − α + β = 1 − TV(Pn , Qn ) .
(α,β)∈R(Pn ,Qn )

As n → ∞, Pn and Qn becomes increasingly distinguishable and their total variation converges

to 1 exponentially, with exponent E given by max min(E0 , E1 ) over all achievable pairs. From the
bounds (7.22) and tensorization of the Hellinger distance (7.25), we obtain
p
1 − 1 − exp(−2nEH ) ≤ 1 − TV(Pn , Qn ) ≤ exp(−nEH ) , (14.19)

where we denoted

1
EH ≜ log 1 − H2 (P, Q) .
2

Thus, we can see that

EH ≤ E ≤ 2EH .

This characterization is valid even if P and Q depends on the sample size n which will prove
useful later when we study composite hypothesis testing in Section 32.2.1. However, for fixed P
and Q this is not precise enough. In order to determine the full set of achievable pairs, we need

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-290

i i

290

to make a detour into the topic of large deviations next. To see how this connection arises, notice
that the (optimal) likelihood ratio tests give us explicit expressions for both error probabilities:

1 1
1 − αn = P Fn ≤ τ , βn = Q Fn > τ
n n
where Fn is the LLR in (14.15). When τ falls in the range of (−D(QkP), D(PkQ)), both proba-
bilities are vanishing thanks to WLLN – see (14.16) and (14.17), and we are interested in their
exponential convergence rate. This falls under the purview of large deviations theory.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-291

i i

15 Information projection and large deviations

In this chapter we develop tools needed for the analysis of the error-exponents in hypothesis test-
ing (Chernoff regime). We will start by introducing the concepts of large deviations theory ( log
moment generating function (MGF) ψX , its convex conjugate ψX∗ , known as rate function, and
revisit the idea of tilting). Then, we show that probability of deviation of an empirical mean is
governed by the solution of an information projection (also known as I-projection) problem:

min D(QkP) = ψ ∗ (γ).

Q:EQ [X]≥γ

Equipped with the information projection we will prove a tight version of the Chernoff bound.
Specifically, for iid copies X1 , . . . , Xn of X, we show
" n #
1X
P Xk ≥ γ = exp (−nψ ∗ (γ) + o(n)) .
n
k=1

In the remaining sections we extend the simple information projection problem to a general
minimization over convex sets of measures and connect it to empirical process theory (Sanov’s the-
orem) and also show how to solve the problem under finitely many linear constraints (exponential
families).
In the next chapter, we apply these results to characterize the achievable (E0 , E1 )-region (as
defined in Section 14.6) to get

(E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ) ,

with ψP∗ being the rate function of log dQ dP

(under P). This gives us a complete (parametric)
description of the sought-after tradeoff between the two exponents in the Chernoff regime.

15.1 Basics of large deviations theory

Pn
Let X1 , . . . , Xn be an iid sequence drawn from P and P̂n = 1n i=1 δXi their empirical distribution.
The large deviations theory focuses on establishing sharp exponential estimates of the kind

P[P̂n ∈ E] = exp{−nE + o(n)} .

The full account of such theory requires delicate consideration of topological properties of E , and
is the subject of classical treatments e.g. [120]. We focus here on a simple special case which,
however, suffices for the purpose of establishing the Chernoff exponents in hypothesis testing,

291

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-292

i i

292

and also showcases all the relevant information-theoretic ideas. Our initial goal is to show the
following result:

Theorem 15.1 Consider a random variable X whose log MGF ψX (λ) = log E[exp(λX)] is
finite for all λ ∈ R. Let B = esssup X and let E[X] < γ < B. Then
" #
X
n
P Xi ≥ nγ = exp{−nE(γ) + o(n)} ,
i=1

where E(γ) = supλ≥0 λγ − ψX (λ) = ψX∗ (γ), known as the rate function.

The concepts of log MGF and the rate function will be elaborated in subsequent sections. We
provide the proof below that should be revisited after reading the rest of the chapter.

Proof. Let us recall the usual Chernoff bound: For iid Xn , for any λ ≥ 0, applying Markov’s
inequality yields
" # " ! #
X
n X
n
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" !#
X
n
≤ exp(−nλγ)E exp λ Xi
i=1

= exp(−nλγ + n log E [exp(λX)]).

| {z }
ψX (λ)

Optimizing over λ ≥ 0 gives the following non-asymptotic upper bound (concentration inequality)
which holds for any n:
" #
X
n n o
P Xi ≥ nγ ≤ exp − n sup(λγ − ψX (λ)) . (15.1)
i=1 λ≥0

This proves the upper bound part of Theorem 15.1.

To show the lower bound we need more tools that are going to be developed below. First, we will
express E(γ) as a certain KL-minimization problem (see Theorem 15.9), known as information
projection. Second, we will solve this problem (see (15.26)) to obtain the desired value of E(γ). In
the process of this proof we will also gain a deeper understanding of why the naive Chernoff bound
turns out to be sharp. It will be seen that inequality (15.1) performs a change of measure to a new
distribution Q, which is chosen to be the closest to P (in KL divergence) among all distributions
Q with EQ [X] ≥ γ . (This new distribution will turn out to be the tilted version of P, denoted by
Pλ .)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-293

i i

15.1 Basics of large deviations theory 293

15.1.1 Log MGF and rate function

Definition 15.2 The log moment-generating function (log MGF, also known as the cumulant
generating function) of a real-valued random variable X is

ψX (λ) = log E[exp(λX)], λ ∈ R.

Per convention in information theory, we will denote ψP (λ) = ψX (λ) if X ∼ P.

2
As an example, for a standard Gaussian Z ∼ N (0, 1), we have ψZ (λ) = λ2 . Taking X = Z3
yields a random variable such that ψX (λ) is infinite for all non-zero λ.
In the remaining of the chapter, we shall make the following simplifying assumption, known
as Cramér’s condition.

Assumption 15.1 The random variable X is such that ψX (λ) < ∞ for all λ ∈ R.

Most of the results we discuss in this chapter hold under a much weaker assumption of ψX
having domain with non-empty interior. But proofs in this generality significantly obscure the
elegance of the main ideas and we decided to avoid them. We note that Assumption 15.1 implies
that all moments of X is finite.

Theorem 15.3 (Properties of ψX ) Under Assumption 15.1 we have:

(a) ψX is convex;
(b) ψX is continuous;
(c) ψX is infinitely differentiable and

E[X exp{λX}]
ψX′ (λ) = = exp{−ψX (λ)}E[X exp{λX}].
E[exp{λX}]

In particular, ψX (0) = 0, ψX′ (0) = E [X].

(d) If a ≤ X ≤ b a.s., then a ≤ ψX′ ≤ b;
(e) Conversely, if

A = inf ψX′ (λ), B = sup ψX′ (λ),

λ∈R λ∈R

then A ≤ X ≤ B a.s.;
(f) If X is not a constant, then ψX is strictly convex, and consequently, ψX′ is strictly increasing.
(g) Chernoff bound:

P(X ≥ γ) ≤ exp(−λγ + ψX (λ)), λ ≥ 0. (15.2)

Remark 15.1 The slope of log MGF encodes the range of X. Indeed, Theorem 15.3(d) and
(e) together show that the smallest closed interval containing the support of PX equals (closure of)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-294

i i

294

the range of ψX′ . In other words, A and B coincide with the essential infimum and supremum (min
and max of RV in the probabilistic sense) of X respectively,
A = essinf X ≜ sup{a : X ≥ a a.s.}
B = esssup X ≜ inf{b : X ≤ b a.s.}
See Figure 15.1 for an illustration.

ψX (λ)

slope A
slope B

0
λ
slope E[X]

Figure 15.1 Example of a log MGF ψX (γ) with PX supported on [A, B]. The limiting maximal and minimal
slope is A and B respectively. The slope at γ = 0 is ψX′ (0) = E[X]. Here we plot for X = ±1 with
P [X = 1] = 1/3.

Proof. For the proof we assume that base of log and exp is e. Note that (g) is already proved
in (15.1). The proof of (e)–(f) relies on Theorem 15.8 and can be revisited later.

(a) Fix θ ∈ (0, 1). Recall Hölder’s inequality:

1 1
E[|UV|] ≤ kUkp kVkq , for p, q ≥ 1, + =1
p q
where the Lp -norm of a random variable U is defined by kUkp = (E|U|p )1/p . Applying to
E[e(θλ1 +θ̄λ2 )X ] with p = 1/θ, q = 1/θ̄, we get
E[exp((λ1 /p + λ2 /q)X)] ≤ k exp(λ1 X/p)kp k exp(λ2 X/q)kq = E[exp(λ1 X)]θ E[exp(λ2 X)]θ̄ ,
i.e., eψX (θλ1 +θ̄λ2 ) ≤ eψX (λ1 )θ eψX (λ2 )θ̄ . Another proof is by expressing ψX′′ as certain variance;
cf. Theorem 15.8(c).
(b) By our assumptions on X, the domain of ψX is R. By the fact that a convex function must be
continuous on the interior of its domain, we conclude that ψX is continuous on R.
(c) The subtlety here is that we need to be careful when exchanging the order of differentiation
and expectation.
Assume without loss of generality that λ ≥ 0. First, we show that E[|XeλX |] exists. Since
e|X| ≤ eX + e−X

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-295

i i

15.1 Basics of large deviations theory 295

|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X

by assumption on X, both of the summands are absolutely integrable in X. Therefore by the

dominated convergence theorem, E[|XeλX |] exists and is continuous in λ.
Second, by the existence and continuity of E[|XeλX |], u 7→ E[|XeuX |] is integrable on [0, λ], we
can switch order of integration and differentiation as follows:
" Z λ # Z λ
Fubini
e ψX (λ)
= E[e ] = E 1 +
λX uX
Xe du = 1 + E XeuX du
0 0

⇒ ψX′ (λ)eψX (λ) = E[Xe ]

λX

thus ψX′ (λ) = e−ψX (λ) E[XeλX ] exists and is continuous in λ on R.

Furthermore, using similar application of the dominated convergence theorem we can extend
to λ ∈ C and show that λ 7→ E[eλX ] is a holomorphic function. Thus it is infinitely
differentiable.
(d) A ≤ X ≤ B ⇒ ψX′ (λ) = EE[[Xe
λX
]
eλ X ] ∈ [ A , B ] .
(e) Suppose (for contradiction) that PX [X > B] > 0. Then PX [X > B + 2ϵ] > 0 for some small
ϵ > 0. But then Pλ [X ≤ B +ϵ] → 0 for λ → ∞ (see Theorem 15.8.3 below). On the other hand,
we know from Theorem 15.8.2 that EPλ [X] = ψX′ (λ) ≤ B. This is not yet a contradiction, since
Pλ might still have some very small mass at a very negative value. To show that this cannot
happen, we first assume that B − ϵ > 0 (otherwise just replace X with X − 2B). Next note that

B ≥ EPλ [X] = EPλ [X1 {X B + ϵ}]
≥ EPλ [X1 {X B + ϵ}]
≥ − EPλ [|X|1 {X B + ϵ] . (15.3)
| {z }
→1

Therefore we will obtain a contradiction if we can show that EPλ [|X|1 {X < B − ϵ}] → 0 as
λ → ∞. To that end, notice that the convexity of ψX implies that ψX′ % B. Thus, for all λ ≥ λ0
we have ψX′ (λ) ≥ B − 2ϵ . Thus, we have for all λ ≥ λ0
ϵ ϵ
ψX (λ) ≥ ψX (λ0 ) + (λ − λ0 )(B − ) = c + λ(B − ) , (15.4)
2 2
for some constant c. Then,

EPλ [|X|1{X < B − ϵ}] = E[|X|eλX−ψX (λ) 1{X < B − ϵ}]

≤ E[|X|eλX−c−λ(B− 2 ) 1{X < B − ϵ}]
ϵ

≤ E[|X|eλ(B−ϵ)−c−λ(B− 2 ) ]
ϵ

= E[|X|]e−λ 2 −c → 0
ϵ
λ→∞

where the first inequality is from (15.4) and the second from X < B − ϵ. Thus, the first term
in (15.3) goes to 0 implying the desired contradiction.
(f) Suppose ψX is not strictly convex. Since ψX is convex from part (a), ψX must be “flat” (affine)
near some point. That is, there exists a small neighborhood of some λ0 such that ψX (λ0 + u) =

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-296

i i

296

ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood of zero, or
equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 15.4 implies Pλ [X = r] = 1,
but then P[X = r] = 1, contradicting the assumption X 6= constant.

Lemma 15.4 E[euS ] = 1 for all u ∈ (−ϵ, ϵ) then S = 0.

Proof. Expand in Taylor series around u = 0 to obtain E[S] = 0, E[S2 ] = 0. Alternatively, we

can extend the argument we gave for differentiating ψX (λ) to show that the function z 7→ E[ezS ] is
holomorphic on the entire complex plane1 . Thus by uniqueness, E[euS ] = 1 for all u.

Definition 15.5 (Rate function) The rate function ψX∗ : R → R ∪ {+∞} is given by the
Fenchel-Legendre conjugate (convex conjugate) of the log MGF:

ψX∗ (γ) = sup λγ − ψX (λ) (15.5)

λ∈R

Note that the maximization (15.5) is a convex optimization problem since ψX is strictly convex,
so we can find the maximum by taking the derivative and finding the stationary point. In fact, ψX∗
is the precisely the convex conjugate of ψX ; cf. (7.84).
The next result describes useful properties of the rate function. See Figure 15.2 for an
illustration.

Theorem 15.6 (Properties of ψX∗ ) Assume that X is non-constant and satisfies Assump-
tion 15.1.

(a) Let A = essinf X and B = esssup X. Then


 ′
 λγ − ψX (λ) for λ s.t. γ = ψX (λ), A<γ B

(b) ψX∗ is strictly convex and strictly positive except ψX∗ (E[X]) = 0.
(c) ψX∗ is decreasing when γ ∈ (A, E[X]), and increasing when γ ∈ [E[X], B)

Proof. By Theorem 15.3(d), since A ≤ X ≤ B a.s., we have A ≤ ψX′ ≤ B. When γ ∈ (A, B), the
strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.

1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in the
vertical strip {z : |Rez| < 1}.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-297

i i

15.1 Basics of large deviations theory 297

ψX (λ)

slope γ
0
λ
ψX∗ (γ)

ψX∗ (γ)

+∞ +∞

γ
A E[X] 0 B

Figure 15.2 Log MGF ψX and its conjugate (rate function) ψX∗ for X taking values in [A, B], continuing the
example in Figure 15.1.

When γ = B, since X ≤ B a.s., we have

ψX∗ (B) = sup λB − log(E[exp(λX)]) = − log inf E[exp(λ(X − B))]
λ∈R λ∈R

= − log lim E[exp(λ(X − B))] = − log P(X = B),

λ→∞

by the monotone convergence theorem.

By Theorem 15.3(f), since ψX is strictly convex, the derivative of ψX and ψX∗ are inverse to each
other. Hence ψX∗ is strictly convex. Since ψX (0) = 0, we have ψX∗ (γ) ≥ 0. Moreover, ψX∗ (E[X]) = 0
follows from E[X] = ψX′ (0).

15.1.2 Tilted distribution

As early as in Chapter 4, we have already introduced the concept of tilting in the proof of Donsker-
Varadhan’s variational characterization of divergence (Theorem 4.6). Let us formally define it now.

Definition 15.7 (Tilting) Given X ∼ P and λ ∈ R, the tilted measure Pλ is defined by

exp{λx}
Pλ (dx) = P(dx) = exp{λx − ψX (λ)}P(dx) (15.6)
E[exp{λX}]
In particular, if P has a PDF p, then the PDF of Pλ is given by pλ (x) = eλx−ψX (λ) p(x).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-298

i i

298

The set of distributions {Pλ : λ ∈ R} parametrized by λ is called a standard (one-parameter)

exponential family, an important object in statistics [77]. Here are some of the examples:

• Gaussian: P = N (0, 1) with density p(x) = √1

2π
exp(−x2 /2). Then Pλ has density
exp(λx)
√1 exp(−x2 /2) = √1exp(−(x − λ) /2). Hence Pλ = N (λ, 1).
2
exp(λ2 /2) 2π 2π λ
• Bernoulli: P = Ber( 12 ). Then Pλ = Ber eλe+1 which puts more (resp. less) mass on 1 if λ > 0
d
(resp. < 0). Moreover, Pλ −
→δ1 if λ → ∞ or δ0 if λ → −∞.
• Uniform: Let P be the uniform distribution on [0, 1]. Then Pλ is also supported on [0, 1] with
pdf pλ (x) = λ exp(λx)
eλ −1 . Therefore as λ increases, Pλ becomes increasingly concentrated near 1,
and Pλ → δ1 as λ → ∞. Similarly, Pλ → δ0 as λ → −∞.

In the above examples we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0
(resp. < 0). Indeed, this is a general property of tilting.

Theorem 15.8 (Properties of Pλ ) Under Assumption 15.1 we have:

(a) Log MGF:

ψPλ (u) = ψX (λ + u) − ψX (λ)

(b) Tilting trades mean for divergence:

EPλ [X] = ψX′ (λ) ≷ EP [X] if λ ≷ 0. (15.7)

D(Pλ kP) = ψX∗ (ψX′ (λ)) = ψX∗ (EPλ [X]). (15.8)

(c) Tilted variance: VarPλ (X) = ψX′′ (λ) log e.

(d)

P(X > b) > 0 ⇒ ∀ϵ > 0, Pλ (X ≤ b − ϵ) → 0 as λ → ∞;

P(X < a) > 0 ⇒ ∀ϵ > 0, Pλ (X ≥ a + ϵ) → 0 as λ → −∞.
d d
Therefore if Xλ ∼ Pλ , then Xλ −
→ essinf X = A as λ → −∞ and Xλ −
→ esssup X = B as λ → ∞.

Proof. Again for the proof we assume base e for exp and log.

(a) By definition.
(b) EPλ [X] = EE[X[exp
exp(λX)] ′ ′
(λX)] = ψX (λ), which is strictly increasing in λ, with ψX (0) = EP [X].
exp(λX) ′
D(Pλ kP) = EPλ log dP dP = EPλ log E[exp(λX)] = λEPλ [X] − ψX (λ) = λψX (λ) − ψX (λ) =
λ

∗ ′
ψX (ψX (λ)), where the last equality follows from Theorem 15.6(a).
(c) ψX′′ (λ) = E[X exp(λ X)]−(E[X exp(λX)])2
2

(E[exp(λX)])2
= VarPλ (X).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-299

i i

15.2 Large-deviations exponents and KL divergence 299

(d)

Pλ (X ≤ b − ϵ) = EP [eλX−ψX (λ) 1 {X ≤ b − ϵ}]

≤ EP [eλ(b−ϵ)−ψX (λ) 1 {X ≤ b − ϵ}]
≤ e−λϵ eλb−ψX (λ)
e−λϵ
≤ → 0 as λ → ∞
P[X > b]
where the last inequality is due to the usual Chernoff bound (Theorem 15.3(g)): P[X > b] ≤
exp(−λb + ψX (λ)).

15.2 Large-deviations exponents and KL divergence

Large deviations problems deal with rare events by making statements about the tail probabilities
of a sequence of distributions. Here, we are interested in the following special case: the speed of
Pn
decay for P 1n k=1 Xk ≥ γ for iid Xk when γ exceeds the mean.
In (15.1) we have used Chernoff bound to obtain an upper bound on the exponent via the log
MGF. Here we use a different method to give a formula for the exponent as a convex optimiza-
tion problem involving the KL divergence. In the subsequent chapter (information projection).
Later in Section 15.4 we shall revisit the Chernoff bound after we have computed the value of the
information projection.

Theorem 15.9 Let X1 , X2 , . . . i.i.d.

∼ P. Then for any γ ∈ R,
1 1
lim log 1 Pn = inf D(QkP) (15.9)
n→∞ n P n k=1 Xk > γ Q : EQ [X]>γ

1 1
lim log 1 Pn = inf D(QkP) (15.10)
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]≥γ

Furthermore, for every n we have the firm upper bound

" n #
1X
P Xk ≥ γ ≤ exp −n · inf D(QkP) (15.11)
n Q : EQ [X]≥γ
k=1

and similarly for > in place of ≥.

Remark 15.2 (Subadditivity) One can argue from first principles that the limits
(15.9) and (15.10) exist without computing their values. Indeed, note that the sequence
pn ≜ log P 1 Pn 1 X ≥γ satisfies pn+m ≥ pn pm and hence log p1n is subadditive. As such,
[ n k=1 k ]
limn→∞ 1n log p1n = infn log p1n by Fekete’s lemma.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-300

i i

300

Proof. First note that if the events have zero probability, then both sides coincide with infinity.
Pn
Indeed, if P 1n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒
Q 6 P ⇒ D(QkP) = ∞ and hence (15.9) holds trivially. The case for (15.10) is similar.
In the sequel we assume both probabilities are nonzero. We start by proving (15.9). Set P[En ] =
Pn
P 1n k=1 Xk > γ .

Lower Bound on P[En ]: Fix a Q such that EQ [X] > γ . Let Xn be iid. Then by WLLN,
" n #
X
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1

Now the data processing inequality (Corollary 2.19) gives

d(Q[En ]kP[En ]) ≤ D(QXn kPXn ) = nD(QkP)
And a lower bound for the binary divergence is
1
d(Q[En ]kP[En ]) ≥ −h(Q[En ]) + Q[En ] log
P[ En ]
Combining the two bounds on d(Q[En ]kP[En ]) gives

−nD(QkP) − log 2
P[En ] ≥ exp (15.12)
Q[En ]
Optimizing over Q to give the best bound:
1 1
lim sup log ≤ inf D(QkP).
n→∞ n P [En ] Q:EQ [X]>γ

Upper Bound on P[En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional distribution as:
P
log PX1(E) = D(PX|X∈E kPX ). Define P̃Xn = PXn | P Xi >nγ , under which Xi > nγ holds a.s. Then
1
log = D(P̃Xn kPXn ) ≥ inf
P D(QXn kPXn ) (15.13)
P[En ] QXn :EQ [ Xi ]>nγ

We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Note that this is a
special case of a more general phenomena – see Ex. III.12. Consider the following two steps:
X
n
D(QXn kPXn ) ≥ D(QXj kP)
j=1

1X
n
≥ nD(Q̄kP) , Q̄ ≜ QXj , (15.14)
n
j=1

where the first step follows from (2.27) in Theorem 2.16, after noticing that PXn = Pn , and the
second step is by convexity of divergence (Theorem 5.1). From this argument we conclude that
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.15)
QXn :EQ [ Xi ]>nγ Q:EQ [X]>γ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-301

i i

15.2 Large-deviations exponents and KL divergence 301

inf
P D(QXn kPXn ) = n · inf D(QkP) (15.16)
QXn :EQ [ Xi ]≥nγ Q:EQ [X]≥γ

In particular, (15.13) and (15.15) imply the required lower bound in (15.9).
Next we prove (15.10). First, notice that the lower bound argument (15.13) applies equally well,
so that for each n we have
1 1
log 1 Pn ≥ inf D(QkP) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ

To get a matching upper bound we consider two cases:

• Case I: P[X > γ] = 0. If P[X ≥ γ] = 0, then both sides of (15.10) are +∞. If P[X = γ] > 0,
P
then P[ Xk ≥ nγ] = P[X1 = . . . = Xn = γ] = P[X = γ]n . For the right-hand side, since
D(QkP) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
Q(X = γ) = 1, i.e., Q = δγ . Then infEQ [X]≥γ D(QkP) = log P(X1=γ) .
P P
• Case II: P[X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (15.9) we know that
1 1
lim sup log 1 Pn ≤ inf D(QkP) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ

We next show that in this case

inf D(QkP) = inf D(QkP) (15.17)

Q : EQ [X]>γ Q : EQ [X]≥γ

Indeed, let P̃ = PX|X>γ which is well defined since P[X > γ] > 0. For any Q such that EQ [X] ≥
γ , set Q̃ = ϵ̄Q + ϵP̃ satisfies EQ̃ [X] > γ . Then by convexity, D(Q̃kP) ≤ ϵ̄D(QkP) + ϵD(P̃kP) =
ϵ̄D(QkP) + ϵ log P[X1>γ] . Sending ϵ → 0, we conclude the proof of (15.17).

Remark 15.3 Note that the upper bound (15.11) also holds for independent non-identically
distributed Xi . Indeed, we only need to replace the step (15.14) with D(QXn kPXn ) ≥
Pn Pn
i=1 D(QXi kPXi ) ≥ nD(Q̄kP̄) where P̄ = n
1
i=1 PXi . This yields a bound (15.11) with P
replaced by P̄ in the right-hand side.

Example 15.1 (Poisson-Binomial tails) Consider X which is a sum of n independent

Bernoulli random variables so that E[X] = np. The distribution of X is known as Poisson-Binomial
[330, 413], including Bin(n, p) as a special case. Applying Theorem 15.9 (or the Remark 15.3),
we get the following tail bounds on X:
k
P[X ≥ k] ≤ exp{−nd(k/nkp)}, >p (15.18)
n
k
P[X ≤ k] ≤ exp{−nd(k/nkp)}, <p (15.19)
n
where for (15.18) we used the fact that minQ:EQ [X]≥k/n D(QkBer(p)) = minq≥k/n d(qkp) = d( nk kp)
and similarly for (15.19). These bounds, in turn, can be used to derive various famous estimates:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-302

i i

302

• Multiplicative deviation from the mean (Bennett’s inequality): We have

P[X ≥ u E[X]] ≤ exp{− E[X]f(u)} ∀u > 1 ,

P[X ≤ u E[X]] ≤ exp{− E[X]f(u)} ∀0 ≤ u < 1 ,

where f(u) ≜ u log u − (u − 1) log e ≥ 0. These follow from (15.18)-(15.19) via the following
useful estimate:

d(upkp) ≥ pf(u) ∀p ∈ [0, 1], u ∈ [0, 1/p] (15.20)

Indeed, consider the elementary inequality

x
x log ≥ (x − y) log e
y
for all x, y ∈ [0, 1] (since the difference between the left and right side is minimized over y at
y = x). Using x = 1 − up and y = 1 − p establishes (15.20).
• Bernstein’s inequality:
t2
P[X > np + t] ≤ e− 2(t+np) ∀t > 0 .
f(u) Ru u−x
Ru
This follows from the previous bound for u > 1 by bounding log e = 1 x dx ≥ 1
u 1
(u −
(u−1)2
x)dx = 2u .
• Okamoto’s inequality: For all 0 0,
√ √
P[ X − np ≥ t] ≤ e−t ,
2
(15.21)
√ √
P[ X − np ≤ −t] ≤ e−t .
2
(15.22)

These simply follow from the inequality between KL divergence and Hellinger distance
√ √ √
( np+t)2
in (7.33). Indeed, we get d(xkp) ≥ H2 (Ber(x), Ber(p)) ≥ ( x − p)2 . Plugging x = n
into (15.18)-(15.19) we obtain the result. We note that [316, Theorem 3] shows a stronger bound
of e−2t in (15.21).
2

Remarkably, the bounds in (15.21) and (15.22) do not depend on n or p. This is due to the
√
variance-stabilizing effect of the square-root transformation for binomials: Var( X) is at most a
√ √
constant for all n, p. In addition, X − np = √XX− np
√ is of a self-normalizing form: the denomi-
+ np
nator is on par with the standard deviation of the numerator. For more on self-normalizing sums,
see [69, Problem 12.2].

15.3 Information projection

The results of Theorem 15.9 motivate us to study the following general information projection
problem: Let E be a convex set of distributions on some abstract space Ω, then for the distribution
P on Ω, we want

B = sup ψX′ = esssup X ≜ inf{b : X ≤ b P-a.s.} (15.24)

1 The information projection problem over E = {Q : EQ [X] ≥ γ} has solution



 0 γ < E P [ X]


ψ ∗ (γ) E P [ X] ≤ γ B

= ψP∗ (γ)1{γ ≥ EP [X]} (15.26)

2 Whenever the minimum is finite, the minimizing distribution is unique and equal to the tilting
of P along X, namely2

dPλ = exp{λX − ψ(λ)} · dP (15.27)

3 For all γ ∈ [EP [X], B) we have

min D(QkP) = inf D(QkP) = min D(QkP) .

EQ [X]≥γ EQ [X]>γ EQ [X]=γ

Remark 15.5 Both Theorem 15.9 and Theorem 15.11 are stated for the right tail where the
sample mean exceeds the population mean. For the left tail, simply these results to −Xi to obtain
for γ < E[X],
1 1
lim log 1 Pn = inf D(QkP) = ψX∗ (γ).
n→∞ n P n k=1 Xk < γ Q : EQ [X]<γ

In other words, the large deviations exponent is still given by the rate function (15.5) except that
the optimal tilting parameter λ is negative.

Proof. We first prove (15.25).

• First case: Take Q = P.

• Fourth case: If EQ [X] > B, then Q[X ≥ B + ϵ] > 0 for some ϵ > 0, but P[X ≥ B + ϵ] = 0, since
P(X ≤ B) = 1, by Theorem 15.3(e). Hence Q 6 P =⇒ D(QkP) = ∞.
• Third case: If P(X = B) = 0, then X 0. Since D(QkP) < ∞ =⇒ Q P =⇒
Q(X ≤ B) = 1. Therefore the only possibility for EQ [X] ≥ B is that Q(X = B) = 1, i.e., Q = δB .
Then D(QkP) = log P(X1=B) .
• Second case: Fix EP [X] ≤ γ < B, and find the unique λ such that ψX′ (λ) = γ = EPλ [X] where
dPλ = exp(λX − ψX (λ))dP. This corresponds to tilting P far enough to the right to increase its

2
Note that unlike the setting of Theorems 15.1 and 15.9 here P and Pλ are measures on an abstract space Ω, not necessarily
on the real line.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-305

i i

15.3 Information projection 305

mean from EP X to γ , in particular λ ≥ 0. Moreover, ψX∗ (γ) = λγ − ψX (λ). Take any Q such
that EQ [X] ≥ γ , then

dQdPλ
D(QkP) = EQ log (15.28)
dPdPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
= D(QkPλ ) + ψX∗ (γ)
≥ ψX∗ (γ), (15.29)

where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows
the minimizer is unique, proving the second claim. Note that even in the corner case of γ = B
(assuming P(X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure
(P∞ ), since Pλ → δB as λ → ∞, cf. Theorem 15.8(d).

An alternative version of the solution, given by expression (15.26), follows from Theorem 15.6.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X] we
have just shown

ψX∗ (γ) = min D(QkP)

Q:EQ [X]≥γ

while from the next corollary we have

inf D(QkP) = inf

′
ψX∗ (γ ′ ) .
Q:EQ [X]>γ γ >γ

The final step is to notice that ψX∗ is increasing and continuous by Theorem 15.6, and hence the
right-hand side infimum equals ψX∗ (γ). The case of minQ:EQ [X]=γ is handled similarly.

Corollary 15.12 For any Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the
tilted distribution Pλ satisfies

EPλ [X] = EQ [X]

D(Pλ kP) ≤ D(QkP)

and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP) − D(Pλ kP).

Proof. Proceed as in the proof of Theorem 15.11, and find the unique λ s.t. EPλ [X] = ψX′ (λ) =
EQ [X]. Then D(Pλ kP) = ψX∗ (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (15.28)-(15.29)
obtaining D(QkP) = D(QkPλ ) + D(Pλ kP).

For any Q the previous result allows us to find a tilted measure Pλ that has the same mean as
Q yet smaller (or equal) divergence distance to P. We will see that the same can be done under
multiple linear constraints (Section 15.6*).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-306

i i

306

15.4 I-projection and KL geodesics

The following Figure 15.4 describes many properties of information projections, where we fix
some baseline measure P. Then

Q 6≪ P
One Parameter Family

γ=A

P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (γ)
b

λ>0 Q
b
γ=B
Q∗
=Pλ

Q 6≪ P

Space of distributions on R

Figure 15.4 Illustration of information projections, tilting, and rate function.

• Each set {Q : EQ [X] = γ} is a slice of P(R), the space of probability distributions on R. As γ

varies from −∞ to +∞ the union of these slices fill the entire space of distributions with finite
first moment.
• When γ < A or γ > B, any distribution Q inside the slice is Q 6 P.
• As γ varies between A and B the slices fill out the space of all {Q : Q P}. By Corollary 15.12,
inside each slice there is one special distribution Pλ of the tilted form (15.6) that minimizes the
divergence D(QkP) to P.
• The set of Pλ ’s trace out a curve in the space of distributions. The “geodesic” distance from P to
Pλ is measured by ψ ∗ (γ) = D(Pλ kP). This set of distributions is the one-parameter exponential
family in Definition 15.7.

The key observation here is that the curve of this one-parameter family {Pλ : λ ∈ R} intersects
each γ -slice E = {Q : EQ [X] = γ} “orthogonally” at the minimizing Q∗ ∈ E , and the distance
from P to Q∗ is given by ψ ∗ (λ). To see this, note that applying Theorem 15.10 to the convex set
E gives us D(QkP) ≥ D(QkQ∗ ) + D(Q∗ kP). Now thanks to Corollary 15.12, we in fact have an
equality D(QkP) = D(QkQ∗ ) + D(Q∗ kP) and Q∗ = Pλ for some tilted measure.
Let us give an intuitive (non-rigorous) justification for calling the curve {Pμ , μ ∈ [0, λ]}
geodesic connecting P = P0 to Pλ . Suppose there existed another curve {Qμ } connecting P to
Pλ and minimizing KL distance. Then the expectation EQμ [X] should continously change from
EP [X] to EPλ [X]. Now take any intermediate value γ ′ of the expectation EQμ [X]. We know that on
the slice {Q : EQ [X] = γ ′ } the closest to P element is Pμ′ for some μ′ ∈ [0, λ]. Thus, we could
shorten the distance by connecting P to Pμ′ instead of Qμ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-307

i i

15.5 Sanov’s theorem 307

Our treatment above is specific to distributions on R. How do we find a geodesic between two
arbitrary distributions P̃ and Q̃ on an abstract measurable space X ? To find the answer we notice
that the “intrinsic” definition of the geodesic between P and Pλ above can be given as follows:
μ
dPμ 1 dPλ λ
= ,
dP Z( μ) dP
where Z( μ) = exp{ψ( μ)} is a normalization constant. Correspondingly, we define the geodesic
between P̃ and Q̃ as a parametric family {P̃μ , μ ∈ [0, 1]} given by
μ
dP̃μ 1 dQ̃
≜ , (15.30)
dP̃ Z̃( μ) dP̃
where the normalizing constant Z̃( μ) = exp{( μ − 1)Dμ (Q̃kP̃)} is given in terms of Rényi
divergence. See also Exercise III.25.
Formal justification of (15.30) as a true geodesic in the sense of differential geometry was
given by Cencov in [85, Theorem 12.3] for the case of finite underlying space X . His argument
was the following. To enable discussion of geodesics one needs to equip the space P([k]) with a
connection (or parallel transport map). It is natural to require the connection to be equivariant (or
commute) with respect to some maps P([k]) → P([k′ ]). Cencov lists (a) permutations of elements
(k = k′ ); (b) embedding of a distribution P ∈ P([k]) into a larger space by splitting atoms of [k]
(with specified ratio) into multiple atoms of [k′ ], so that k < k′ ; and (c) conditioning on an event
(k > k′ ). It turns out there is one-parameter family of connections satisfying (a)-(b), including
the Riemannian (Levi-Civitta) connection corresponding to a Fisher-Rao metric (2.35). However,
there is only a unique connection satisfying all (a)-(c). It is different from the Fisher-Rao and its
geodesics are exactly given by (15.30). Geodesically complete submanifolds in this metric are
simply the exponential families (Section 15.6*). For more on this exciting area, see Cencov [85]
and Amari [17].

15.5 Sanov’s theorem

A corollary of the WLLN is that the empirical distribution P̂ of n iid observations drawn from a
distribution P converges weakly to P itself. The following theorem due to Sanov [370] quantifies
the large-deviations behavior of this convergence.

Theorem 15.13 (Sanov’s

P
Theorem) Given X1 , . . . , Xn i.i.d.
∼ P on X , denote the empirical
n
distribution by P̂ = 1n j=1 δXj . Let E be a convex set of distributions. Then under regularity
conditions on E and P,

P[P̂ ∈ E] = exp −n min D(QkP) + o(n) .
Q∈E

Examples of regularity conditions in the above theorem include: (a) X is finite, P is fully sup-
ported on X , and E is closed with non-empty interior: see Exercise III.23 for a full proof in this

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-308

i i

308

case; (b) X is a Polish space and infQ∈int(E) D(QkP) = infQ∈cl(E) D(QkP): this is the content
of [120, Theorem 6.2.10]. The reference [120] contains full details about various other versions
and extensions of Sanov’s theorem to infinite-dimensional settings.

15.6* Information projection with multiple constraints

We have considered so far example of a single inequality E[X] ≥ γ . However, the entire theory
can be extended to accommodate multiple constraints. Let P be a fixed distribution on some space
X and let ϕ1 , . . . , ϕd : X → R be arbitrary functions, which we will stack together into a vector-
valued function ϕ : X → Rd . In this section we discuss solution of the following I-projection
problem, known as I-projection on a hyperplane:

F(γ) ≜ inf{D(QkP) : EQ [ϕ(X)] = γ} , γ ∈ Rd . (15.31)

This problem arises in statistical physics, Gibbs variational principle, exponential family, and
many other fields. Note that taking P uniform correspond to the max-entropy problems.
In the case of d = 1 we have seen that whenever the value of minimization is finite solution
Q∗ can be sought inside a single-parameter family of tilted measures P, cf. (15.27). For this more
general case of d > 1 we define tilted measures as

Pλ (dx) ≜ exp{λ⊤ ϕ(x) − ψ(λ)}P(dx) , λ ∈ Rd

where the multi-dimensional log MGF of P (with respect to ϕ) is defined as

ψ(λ) ≜ EX∼P [exp{λ⊤ ϕ(X)] . (15.32)

In order to discuss the solution of (15.31) we first make a simple observation analogous to
Corollary 15.12:

Proposition 15.14 If there exists λ such that ψ(λ) < ∞ and EX∼Pλ [ϕ(X)] = γ , then the
unique minimizer of (15.31) is Pλ and for any Q with EQ [ϕ(X)] = γ we have

D(QkP) = D(QkPλ ) + D(Pλ kP) . (15.33)

⊤ ⊤
Proof. Since log dPdP = λ ϕ(x) − ψ(λ) is finite everywhere we have that D(Pλ kP) = λ γ −
λ

ψ(λ) < ∞ and hence the solution of (15.31) is finite. The fact that Pλ is the unique minimizer
follows from the identity (15.33) that we are to prove next. Take Q as in the statement and suppose
that either D(QkP) or D(QkPλ ) finite (otherwise there is nothing to prove). Since Pλ P this
implies that Q P and so let us denote by fQ = dQ dP . From (2.11) we see that

exp{λ⊤ ϕ(X) − ψ(λ)}
D(QkP) − D(QkPλ ) = EQ Log = EQ log exp{λ⊤ ϕ(X) − ψ(λ)}
1
= λ⊤ γ − ψ(λ) = D(Pλ kP) ,

establishing the claim.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-309

i i

15.6* Information projection with multiple constraints 309

Unfortunately, Proposition 15.14 is far from being able to completely resolve the prob-
lem (15.31) since it does not explain for what values γ ∈ Rd of the constraints it is possible
to find a required tilting Pλ . For d = 1 we had a very simple characterization of the set of values
that the means of Pλ ’s can achieve. Specifically, Theorem 15.8 showed (under Assumption 15.1)

{EX∼Pλ [ϕ(X)] : λ ∈ R} = (A, B) ,

where A, B are the boundaries of the support of ϕ. To obtain a similar characterization for the
case of d > 1, we let P̃ be the probability distribution on Rd of ϕ(X) when X ∼ P, i.e. P̃ is the
push-forward of P along ϕ. The analog of (A, B) is then played by the following concept:

Definition 15.15 (Convex support) Let P̃ be a probability measure on Rd . We recall that

support supp(P̃) is defined as the intersection of all closed sets of full measure. The convex support
of P̃ is defined as the intersection of all closed convex sets with full measure:

csupp(P̃) ≜ ∩{S : P̃[S] = 1, S is closed and convex} .

It is clear that csupp is itself a closed convex set. Furthermore, it can be obtained by taking the
convex hull of supp(P̃) followed by the topological closure cl(·), i.e.

csupp(P̃) = cl(co(supp(P̃))) . (15.34)

(Indeed, csupp(P̃) ⊂ cl(co(suppP̃)) since the set on the right is convex and closed. On the other
hand, for any closed half-space H ⊃ csupp(P̃) of full measure, i.e. P̃[H] = 1 we must have
supp(P̃) ⊂ H. Taking convex hull and then closure of both sides yields cl(co(supp(P̃))) ⊂ H.
Taking the intersection over all such H shows that cl(co(suppP̃)) ⊂ csupp(P̃) as well.)
We are now ready to state the characterization of when I-projection is solved by a tilted measure.

Theorem 15.16 (I-projection on hyperplane) Suppose P and ϕ satisfy the following two
assumptions: (a) The d + 1 functions (1, ϕ1 , . . . , ϕd ) are linearly independent P-a.s. and (b) the
log MGF ψ(λ) is finite for all λ ∈ Rd . Then

1 If there exist any Q such that EQ [ϕ] = γ and D(QkP) < ∞, we must have γ ∈ csupp(P̃).
2 There is a solution λ to EPλ [ϕ] = γ if and only if γ ∈ int(csupp(P̃)).

Corollary 15.17 Whenever γ ∈ int(csupp(P̃)) the I-projection problem (15.31) is solved by

Pλ for some λ and the identity (15.33) holds. Furthermore, F(γ) = ψ ∗ (γ) and λ = ∇F(γ).

Remark 15.6 Assumption (b) of Theorem 15.16 can be relaxed to requiring only the domain
of the log MGF to be an open set (see [85, Theorem 23.1] or [77, Theorem 3.6].) Applying Theo-
rem 15.16, whenever γ ∈ int csupp(P̃) the I-projection can be sought in the tilted family Pλ and
only in such case. If γ ∈/ csupp(P̃) then the I-projection is trivially impossible and every Q with
the given expectation yields D(QkP) = ∞. When γ ∈ ∂ csupp(P̃) it could be that I-projection
(i.e. the minimizer of (15.31)) exist, unique and yields a finite divergence, but the minimizer is
not given by the λ-tilting of P. It could also be that every feasible Q yields D(QkP) = ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-310

i i

310

As a concrete example, consider X ∼ P = N (0, 1) and ϕ(x) = (x, x2 ). Then csupp(P̃) =

{(γ1 , γ2 ) ∈ R2 : γ2 ≥ γ12 }, consisting of all valid values of first and second moments (satisfy-
ing Cauchy-Schwarz) of distributions on R. Then the solution to this I-projection problem is as
follows.

• γ2 > γ12 : the optimal Q is N (γ1 , γ2 − γ12 ), which is a tilted version of P along ϕ.
• γ2 = γ12 : the only feasible Q is δγ1 , which results in D(QkP) = ∞.
• γ2 < γ12 : there is no feasible Q.

Before giving the proof of the theorem we remind some of the standard and easy facts about
exponential families of which Pλ is an example. In this context ϕ is called a vector of statistics
and λ is the natural parameter. Note that all Pλ ∼ P are mutually absolutely continuous and hence
we have from the linear independence assumption:
CovX∼Pλ [ϕ(X)] 0 (15.35)
i.e. the covariance matrix is (strictly) positive definite. Similar to Theorem 15.3 we can show that
λ 7→ ψ(λ) is a convex, infinitely differentiable function [77]. We want to study the map from
natural parameter λ to the mean parameter μ:
λ 7→ μ(λ) ≜ EX∼Pλ [ϕ(X)] ,
Specifically, we will show that the image μ(Rd ) = int csupp(P̃). To that end note that, similar to
Theorem 15.8(b) and (c), the first two derivatives give moments of ϕ as follows:
EX∼Pλ [ϕ(X)] = ∇ψ(λ) , CovX∼Pλ [ϕ(X)] = Hess ψ(λ) log e . (15.36)
Together with (15.35) we see that then ψ is strictly convex and hence for any λ1 , λ2 we have the
strict monotonicity of ∇ψ , i.e.
(λ1 − λ2 )T (∇ψ(λ1 ) − ∇ψ(λ2 )) > 0 . (15.37)
Additionally, from (15.35) we obtain that Jacobian of the map λ 7→ μ(λ) equals det Hess ψ >
0. Thus by the inverse function theorem the image μ(Rd ) is an open set in Rd and there is an
infinitely differentiable inverse μ 7→ λ = μ−1 ( μ) defined on this set. Hence, the family Pλ can be
equivalently reparameterized by μ’s. What is non-trivial is that the image μ(Rd ) is convex and in
fact coincides with int csupp(P̃).
Proof of Theorem 15.16. Throughout the proof we denote C = csupp(P̃), Co = int(csupp(P̃)).
Suppose there is a Q P with t = EQ [ϕ(X)] 6∈ C. Then there is a (separating hyperplane)
b ∈ Rd and c ∈ R such that b⊤ t < c ≤ b⊤ p for any p ∈ C. Since P[ϕ(X) ∈ C] = 1 we conclude
that Q[b⊤ ϕ(X) ≥ c] = 1. But then this contradicts the fact that EQ [b⊤ ϕ(X)] < c. This shows the
first claim.
Next, we show that for any λ we have μ(λ) = EPλ [ϕ] ∈ Co . Indeed, by the previous paragraph
we know μ(λ) ∈ C. On the other hand, as we discussed the map λ → μ(λ) is smooth, one-to-one,
with smooth inverse. Hence the image of a small ball around λ is open and hence μ(λ) ∈ int(C) =
Co .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-311

i i

15.6* Information projection with multiple constraints 311

Finally, we prove the main implication that for any γ ∈ Co there must exist a λ such that
μ(λ) = γ . To that end, consider the unconstrained minimization problem

inf ψ(λ) − λ⊤ γ . (15.38)

If we can show that the minimum is achieved at some λ∗ , then from the first-order optimality
conditions we conclude the desired ∇ψ(λ∗ ) = γ . Since the objective function is continuous, it is
sufficient to show that the minimization without loss of generality can be restricted to a compact
ball {kλk ≤ R} for some large R.
To that end, we first notice that if γ ∈ Co then for some ϵ > 0 we must have

cϵ ≜ inf P[v⊤ (ϕ(X) − γ) > ϵ] > 0 . (15.39)

v:∥v∥=1

Indeed, suppose this is not the case. Then for any ϵ > 0 there is a sequence vk s.t.

P[v⊤
k (ϕ(X) − γ) > ϵ] → 0 .

Now by compactness of the sphere, vk → ṽϵ without loss of generality and thus we have for every
ϵ some ṽϵ such that

P[ṽ⊤
ϵ (ϕ(X) − γ) > ϵ] = 0 .

Again, by compactness there must exist convergent subsequence ṽϵ → v∗ and ϵ → 0 such that

P[v⊤
∗ (ϕ(X) − γ) > 0] = 0 .

Thus, supp(P̃) ⊂ {x : v⊤ ⊤
∗ ϕ(x) ≤ v∗ γ} and hence γ cannot be an interior point of C = csupp(P̃).
λ
Given (15.39) we obtain the following estimate, where we denote v = ∥λ∥ :

exp{ψ(λ) − λ⊤ γ} = EP [exp{λ⊤ (ϕ(X) − γ)}]

≥ EP [exp{λ⊤ (ϕ(X) − γ)}1{v⊤ (ϕ(X) − γ) > ϵ}]
≥ cϵ exp{ϵkλk}

Thus, returning to the minimization problem (15.38) we see that the objective function satisfies
a lower bound

ψ(λ) − λ⊤ γ ≥ log cϵ + ϵkλk .

Then it is clear that restricting the minimization to a sufficiently large ball {kλk ≤ R} is without
loss of generality. As we explained this completes the proof.

Example 15.2 (Sinkhorn’s problem) As an application of Theorem 15.16, consider a joint

distribution PX,Y on finite X × Y . Our goal is to solve a coupling problem:

min{D(QX,Y kPX,Y ) : QX = VX , QY = VY } ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-312

i i

312

where the marginals VX and VY are given. As we discussed in Section 5.6, Sinkhorn identified
an elegant iterative algorithm that converges to the minimizer. Here, we can apply our general
I-projection theory to show that minimizer has the form
Q∗X,Y (x, y) = A(x)PX,Y (x, y)B(y) . (15.40)
Specifically, let us assume PX,Y (x, y) > 0 and consider |X | + |Y| functions ϕa (x, y) = 1{x = a}
and ϕb (x, y) = 1{y = b}, a ∈ X , b ∈ Y . They are linearly independent. The set csupp(P̃) =
P(X ) × P(Y) in this case corresponds to all marginal distributions. Thus, whenever VX , VY have
no zeros they belong to int(csupp(P̃)) and the solution to the I-projection problem is a tilted version
of PX,Y which is precisely of the form (15.40). In this case, it turns out that I-projection exists also
on the boundary and even when PX,Y is allowed to have zeros but these cases are outside the scope
of Theorem 15.16 and need to be treated differently, see [114].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-313

i i

16 Hypothesis testing: error exponents

In this chapter our goal is to determine the achievable region of the exponent pairs (E0 , E1 ) for
the Type-I and Type-II error probabilities in Chernoff’s regime when both exponents are strictly
positive. Our strategy is to apply the achievability and (strong) converse bounds from Chapter 14
in conjunction with the large deviations theory developed in Chapter 15. After characterizing the
full tradeoff we will discuss an adaptive setting of hypothesis testing where instead of committing
ahead of time to testing on the basis of n samples, one can decide adaptively whether to request
more samples or stop. We will find out that adaptivity greatly increases the region of achievable
error-exponents and will learn about the sequential probability ratio test (SPRT) of Wald. In the
closing sections we will discuss relation to more complicated settings in hypothesis testing: one
with composite hypotheses and one with communication constraints.

16.1 (E0 , E1 )-Tradeoff

Recall the setting of Chernoff regime introduced in Section 14.6, where the goal is in designing
tests satisfying
π 1|0 = 1 − α ≤ exp (−nE0 ) , π 0|1 = β ≤ exp (−nE1 ) .
To find the best tradeoff of E0 versus E1 we can define the following function
E∗1 (E0 ) ≜ sup{E1 : ∃n0 , ∀n ≥ n0 , ∃PZ|Xn s.t. α > 1 − exp (−nE0 ) , β < exp (−nE1 )}
1 1
= lim inf log
n→∞ n β1−exp(−nE0 ) (Pn , Qn )
This should be compared with Stein’s exponent in Definition 14.13.
Define
dQ
Tk = log (Xk ), k = 1, . . . , n
dP
dQn Pn
which are iid copies of T = log dQ n
dP (X). Then log dPn (X ) = k=1 Tk , which is an iid sum under
both P and Q.
The log MGF of T under P (again assumed to be finite and also T is not a constant since P 6= Q)
and the corresponding rate function are (cf. Definitions 15.2 and 15.5):
ψP (λ) = log EP [exp(λT)], ψP∗ (θ) = sup θλ − ψP (λ).
λ∈R

313

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-314

i i

314

ψP (λ)

0 1
λ

E0 = ψP∗ (θ)

E1 = ψP∗ (θ) − θ

slope θ

Figure 16.1 Geometric interpretation of Theorem 16.1 relies on the properties of ψP (λ) and ψP∗ (θ). Note that
ψP (0) = ψP (1) = 0. Moreover, by Theorem 15.6, θ 7→ E0 (θ) is increasing, θ 7→ E1 (θ) is decreasing.

P
For discrete distributions, we have ψP (λ) = log x P(x)1−λ Q(x)λ ; in general, ψP (λ) =
R 1−λ dQ λ
log dμ( dPdμ ) ( dμ ) for some dominating measure μ.
Note that since ψP (0) = ψP (1) = 0, from the convexity of ψP (Theorem 15.3) we conclude
that ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore, assuming P Q and Q P we also have that
λ 7→ ψP (λ) continuous everywhere on [0, 1]. (The continuity on (0, 1) follows from convexity,
but for the boundary points we need more detailed arguments.) Although all results in this section
apply under the (milder) conditions of P Q and Q P, we will only present proofs under
the (stronger) condition that log MGF exists for all λ, following the convention of the previous
chapter. The following result determines the optimal (E0 , E1 )-tradeoff in a parametric form. For a
concrete example, see Exercise III.19 for testing two Gaussians.

Theorem 16.1 Assume P Q and Q P. Then

E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ, (16.1)

parametrized by −D(PkQ) ≤ θ ≤ D(QkP), characterizes the upper boundary of the region of all
achievable (E0 , E1 )-pairs. (See Figure 16.1 for an illustration.)

Remark 16.1 (Rényi divergence) In Definition 7.24 we defined Rényi divergences Dλ .

It turns out that Dλ ’s are intimately related to error exponents. Indeed, we have ψP (λ) =
(λ − 1)Dλ (QkP) = −λD1−λ (PkQ). This provides another proof of why ψP (λ) is negative
for λ between 0 and 1, and also recovers the slope at endpoints: ψP′ (0) = −D(PkQ) and
ψP′ (1) = D(QkP). See also Ex. I.39.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-315

i i

16.1 (E0 , E1 )-Tradeoff 315

Corollary 16.2 (Bayesian criterion) Fix a prior (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 <
π 0 < 1. Denote the optimal Bayesian (average) error probability by
P∗e (n) ≜ inf π 0 π 1|0 + π 1 π 0|1
P Z| X n

with exponent
1 1
E ≜ lim log ∗ .
n→∞ n P e ( n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0)
θ

regardless of the prior, and

Z
ψP∗ (0) = − inf ψP (λ) = − inf log (dP)1−λ (dQ)λ ≜ C(P, Q) (16.2)
λ∈[0,1] λ∈[0,1]

is called the Chernoff exponent or Chernoff information.

Notice that from (14.19) we always have

1 2 1 2
log 1 − H (P, Q) ≤ C(P, Q) ≤ 2 log 1 − H (P, Q)
2 2
and thus for small H2 (P, Q) we have C(P, Q) H2 (P, Q).
Remark 16.2 (Bhattacharyya distance) There is an important special case in which Cher-
noff exponent simplifies. Instead of i.i.d. observations, consider independent, but not identically
distributed observations. Namely, suppose that two hypotheses correspond to two different strings
xn and x̃n over a finite alphabet X . The hypothesis tester observes Yn = (Y1 , . . . , Yn ) obtained by
applying one of the two strings to the input of the memoryless channel PY|X ; in other words, either
Qn Qn
Yn ∼ t=1 PY|X=xt or t=1 PY|X=x̃t . (The alphabet Y does not need to be finite, but we assume
this below.) Extending Corollary 16.2 it can be shown, that in this case the optimal (average)
probability of error P∗e (xn , x̃n ) has (Chernoff) exponent1
1X X
n
E = − inf log PY|X (y|xt )λ PY|X (y|x̃t )1−λ .
λ∈[0,1] n
t=1 y∈Y

If |X | = 2 and if the compositions (types) of xn and x̃n are equal (!), the expression is invariant
under λ ↔ 1 − λ and thus from the convexity in λ we conclude that λ = 12 is optimal,2 yielding
E = 1n dB (xn , x̃n ), where
X
n Xq
dB (xn , x̃n ) = − log PY|X (y|xt )PY|X (y|x̃t ) (16.3)
t=1 y∈Y

1
In short, this is because the optimal tilting parameter λ does not need to be chosen differently for different values of
(xt , x̃t ).
2 1
For another example where λ = 2
achieves the optimal in the Chernoff information, see Exercise III.30.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-316

i i

316

is known as the Bhattacharyya distance between codewords xn and x̃n . (Compare with the Bhat-
tacharyya coefficient defined after (7.5).) Without the two assumptions stated, dB (·, ·) does not
necessarily give the optimal error exponent. We do, however, always have the bounds, see (14.19):
1
exp (−2dB (xn , x̃n )) ≤ P∗e (xn , x̃n ) ≤ exp (−dB (xn , x̃n )) ,
4
where the upper bound becomes tighter when the joint composition of (xn , x̃n ) and that of (x̃n , xn )
are closer.
Pn
Proof of Theorem 16.1. The idea is to apply the large deviations theory to the iid sum k=1 Tk .
Specifically, let’s rewrite the achievability and converse bounds from Chapter 14 in terms of T:

• Achievability (Neyman-Pearson): Applying Theorem 14.11 with τ = −nθ, the LRT achieves
the following
" n # " n #
X X
π 1|0 = P T k ≥ nθ π 0|1 = Q T k < nθ (16.4)
k=1 k=1

• Converse (strong): Applying Theorem 14.10 with γ = exp (−nθ), any achievable π 1|0 and π 0|1
satisfy
" n #
X
π 1|0 + exp (−nθ) π 0|1 ≥ P T k ≥ nθ . (16.5)
k=1

For achievability, applying the nonasymptotic large deviations upper bound in Theorem 15.9
(and Theorem 15.11) to (16.4), we obtain that for any n,
" n #
X
π 1| 0 = P Tk ≥ nθ ≤ exp (−nψP∗ (θ)) , for θ ≥ EP T = −D(PkQ)
k=1
" #
Xn

π 0|1 = Q Tk < nθ ≤ exp −nψQ∗ (θ) , for θ ≤ EQ T = D(QkP)
k=1

Notice that by the definition of T = log dQ

dP we have

ψQ (λ) = log EQ [eλT ] = log EP [e(λ+1)T ] = ψP (λ + 1)

⇒ ψQ∗ (θ) = sup θλ − ψP (λ + 1) = ψP∗ (θ) − θ.
λ∈R

Thus the pair of exponents (E0 (θ), E1 (θ)) in (16.1) is achievable.

For converse, we aim to show that any achievable (E0 , E1 ) pair must lie below the curve
achieved by the above Neyman-Pearson test, namely (E0 (θ), E1 (θ)) parametrized by θ. Suppose
π 1|0 = exp (−nE0 ) and π 0|1 = exp (−nE1 ) is achievable. Combining the strong converse bound
(16.5) with the large deviations lower bound, we have: for any fixed θ ∈ [−D(PkQ), ≤ D(QkP)],

exp (−nE0 ) + exp (−nθ) exp (−nE1 ) ≥ exp (−nψP∗ (θ) + o(n))
⇒ min(E0 , E1 + θ) ≤ ψP∗ (θ)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-317

i i

16.2 Equivalent forms of Theorem 16.1 317

⇒ either E0 ≤ ψP∗ (θ) or E1 ≤ ψP∗ (θ) − θ,

proving the desired result.

16.2 Equivalent forms of Theorem 16.1

Alternatively, the optimal (E0 , E1 )-tradeoff can be stated in the following equivalent forms:

Theorem 16.3 (a) The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as
E0 = D(Pλ kP), E1 = D(Pλ kQ) (16.6)
where the distribution Pλ 3 is tilting of P along T given in (15.27), which moves from P0 = P
to P1 = Q as λ ranges from 0 to 1:
dPλ = (dP)1−λ (dQ)λ exp{−ψP (λ)}.
(b) Yet another characterization of the boundary is
E∗1 (E0 ) = min D(Q′ kQ) , 0 ≤ E0 ≤ D(QkP) (16.7)
Q′ :D(Q′ ∥P)≤E0

Remark 16.3 The interesting consequence of this point of view is that it also suggests how
typical error event looks like. Namely, consider an optimal hypothesis test achieving the pair of
exponents (E0 , E1 ). Then conditioned on the error event (under either P or Q) we have that the
empirical distribution of the sample will be close to Pλ . For example, if P = Bin(m, p) and Q =
Bin(m, q), then the typical error event will correspond to a sample whose empirical distribution
P̂n is approximately Bin(m, r) for some r = r(p, q, λ) ∈ (p, q), and not any other distribution on
{0, . . . , m}.

Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) ≜ EPλ [T], then
from (15.8) we have
D(Pλ kP) = ψP∗ (θ) ,
whereas

dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP) − EPλ [T] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (15.7) we know that as λ ranges in [0, 1] the mean θ = EPλ [T] ranges from −D(PkQ)
to D(QkP).
To prove the second claim (16.7), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions
dPλ = exp{λT − ψP (λ)} · dP

3
This is called a geometric mixture of P and Q.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-318

i i

318

dQλ′ = exp{λ′ T − ψQ (λ′ )} · dQ

are in fact the same family with Qλ′ = Pλ′ +1 .

Now, suppose that Q∗ achieves the minimum in (16.7) and that Q∗ 6= Q, Q∗ 6= P (these cases
should be verified separately). Note that we have not shown that this minimum is achieved, but it
will be clear that our argument can be extended to the case of when Q′n is a sequence achieving
the infimum. Then, on one hand, obviously

D(Q∗ kQ) = min D(Q′ kQ) ≤ D(PkQ)

Q′ :D(Q′ ∥P)≤E0

On the other hand, since E0 ≤ D(QkP) we also have

D(Q∗ kP) ≤ D(QkP) .

Therefore,

dQ∗ dQ
E [T] = E
Q∗ Q∗ log = D(Q∗ kP) − D(Q∗ kQ) ∈ [−D(PkQ), D(QkP)] . (16.8)
dP dQ∗

Next, we have from Corollary 15.12 that there exists a unique Pλ with the following three
properties:4

EPλ [T] = EQ∗ [T]

D(Pλ kP) ≤ D(Q∗ kP)
D(Pλ kQ) ≤ D(Q∗ kQ)

Thus, we immediately conclude that minimization in (16.7) can be restricted to Q∗ belonging

to the family of tilted distributions {Pλ , λ ∈ R}. Furthermore, from (16.8) we also conclude
that λ ∈ [0, 1]. Hence, characterization of E∗1 (E0 ) given by (16.6) coincides with the one given
by (16.7).

Remark 16.4 A geometric interpretation of (16.7) is given in Figure 16.2: As λ increases from
0 to 1, or equivalently, θ increases from −D(PkQ) to D(QkP), the optimal distribution Pλ traverses
down the dotted path from P and Q. Note that there are many ways to interpolate between P and
Q, e.g., by taking their (arithmetic) mixture (1 − λ)P + λQ. In contrast, Pλ is a geometric mixture
of P and Q, and this special path is in essence a geodesic connecting P to Q and the exponents
E0 and E1 measures its respective distances to P and Q. Unlike Riemannian geometry, though,
here the sum of distances to the two endpoints from an intermediate Pλ actually varies along the
geodesic.

4
A subtlety: In Corollary 15.12 we ask EQ∗ [T] ∈ (A, B). But A, B – the essential range of T – depend on the distribution
under which the essential range is computed, cf. (15.23). Fortunately, we have Q P and P Q, so the essential range
is the same under both P and Q. And furthermore (16.8) implies that EQ∗ [T] ∈ (A, B).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-319

i i

16.3* Sequential hypothesis testing 319

E1
P

D(PkQ) Pλ

space of distributions

D(Pλ kQ)

E0
0 D(Pλ kP) D(QkP)

Figure 16.2 Geometric interpretation of (16.7). Here the shaded circle represents {Q′ : D(Q′ kP) ≤ E0 }, the
KL divergence “ball” of radius E0 centered at P. The optimal E∗1 (E0 ) in (16.7) is given by the divergence from
Q to the closest element of this ball, attained by some tilted distribution Pλ . The tilted family Pλ is the
geodesic traversing from P to Q as λ increases from 0 to 1.

16.3* Sequential hypothesis testing

Review: Filtration and stopping time

• A sequence of nested σ -algebras F0 ⊂ F1 ⊂ F2 · · · ⊂ Fn · · · ⊂ F is called a

filtration of F .
• A random variable τ is called a stopping time of a filtration Fn if (a) τ is valued in
Z+ and (b) for every n ≥ 0 the event {τ ≤ n} ∈ Fn .
• The σ -algebra Fτ consists of all events E such that E ∩ {τ ≤ n} ∈ Fn for all n ≥ 0.
• When Fn = σ{X1 , . . . , Xn } the interpretation is that τ is a time that can be deter-
mined by causally observing the sequence Xj , and random variables measurable
with respect to Fτ are precisely those whose value can be determined on the basis
of knowing (X1 , . . . , Xτ ).
• Let Mn be a martingale adapted to Fn , i.e. Mn is Fn -measurable and E[Mn |Fk ] =
Mmin(n,k) . Then M̃n = Mmin(n,τ ) is also a martingale. If collection {Mn } is uniformly
integrable then
E[Mτ ] = E[M0 ] .
• For more details, see [84, Chapter V].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-320

i i

320

So far we have always been working with a fixed number of observations n. However, different
realizations of Xn are informative to different levels, i.e. under some realizations we are very certain
about declaring the true hypothesis, whereas some other realizations leave us more doubtful. In
the fixed n setting, the tester is forced to take a guess in the latter case. In the sequential setting,
pioneered by Wald [448], the tester is allowed to request more observations. We show in this
section that the optimal test in this setting is something known as sequential probability ratio test
(SPRT) [450]. It will also be shown that the resulting tradeoff between the exponents E0 and E1 is
much improved in the sequential setting.
We start with the concept of a sequential test. Informally, at each time t, upon receiving the
observation Xt , a sequential test either declares H0 , declares H1 , or requests one more observation.
The rigorous definition is as follows: a sequential hypothesis test consists of (a) a stopping time
τ with respect to the filtration {Fk , k ∈ Z+ }, where Fk ≜ σ{X1 , . . . , Xn } is generated by the first
n observations; and (b) a random variable (decision) Z ∈ {0, 1} measurable with respect to Fτ .
Each sequential test is associated with the following performance metrics:
α = P[Z = 0], β = Q [ Z = 0] (16.9)
l0 = EP [τ ], l1 = EQ [τ ] (16.10)
The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 21 δ−1 . Since P 6⊥ Q, we also
have Pn 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both hypothe-
ses. However, an obvious sequential test (waiting for the first appearance of ±1) achieves zero error
probability with finite number of (two) observations in expectation under both hypotheses. This
advantage is also clear in terms of the achievable error exponents shown in Figure 16.3.
The following result is essentially due to [450], though there it was shown only for the special
case of E0 = D(QkP) and E1 = D(PkQ). The version below is from [339].

Theorem 16.4 Assume bounded LLR:5

P(x)
log ≤ c0 , ∀ x
Q ( x)
where c0 is some positive constant. Call a pair of exponents (E0 , E1 ) achievable if there exist a
test with l0 , l1 → ∞ and probabilities satisfy:
π 1|0 ≤ exp (−l0 E0 (1 + o(1))) , π 0|1 ≤ exp (−l1 E1 (1 + o(1)))
Then the set of achievable exponents must satisfy
E0 E1 ≤ D(PkQ)D(QkP).
Furthermore, any such (E0 , E1 ) is achieved by the sequential probability ratio test SPRT(A, B)
(A, B are large positive numbers) defined as follows:
τ = inf{n : Sn ≥ B or Sn ≤ −A}

5
This assumption is satisfied for example for a pair of fully supported discrete distributions on finite alphabets.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-321

i i

16.3* Sequential hypothesis testing 321

Sequential test
D(PkQ)

Fixed sample size

E0
0 D(QkP)

Figure 16.3 Tradeoff between Type-I and Type-II error exponents. The bottom curve corresponds to optimal
tests with fixed sample size (Theorem 16.1) and the upper curve to optimal sequential tests (Theorem 16.4).

0, if Sτ ≥ B
Z=
1, if Sτ < −A

where
X
n
P(Xk )
Sn = log
Q( X k )
k=1

is the log likelihood function of the first n observations.

Remark 16.5 (Interpretation of SPRT) Under the usual setup of hypothesis testing, we
collect a sample of n iid observations, evaluate the LLR Sn , and compare it to the threshold to give
the optimal test. Under the sequential setup, {Sn : n ≥ 1} is a random walk, which has positive
(resp. negative) drift D(PkQ) (resp. −D(QkP)) under the null (resp. alternative)! SPRT simply
declares P if the random walk crosses the upper boundary B, or Q if the random walk crosses the
upper boundary −A. See Figure 16.4 for an illustration.

Proof. As preparation we show two useful identities:

• For any stopping time with EP [τ ] < ∞ we have

EP [Sτ ] = EP [τ ]D(PkQ) (16.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-322

i i

322

0 n
τ

−A

Figure 16.4 Illustration of the SPRT(A, B) test. Here, at the stopping time τ , the LLR process Sn reaches B
before reaching −A and the decision is Z = 1.

and similarly, if EQ [τ ] < ∞ then

EQ [Sτ ] = − EQ [τ ]D(QkP) .

To prove these, notice that

Mn = Sn − nD(PkQ)

is clearly a martingale w.r.t. Fn . Consequently,

M̃n ≜ Mmin(τ,n)

is also a martingale. Thus

E[M̃n ] = E[M̃0 ] = 0 ,

or, equivalently,

E[Smin(τ,n) ] = E[min(τ, n)]D(PkQ) . (16.12)

This holds for every n ≥ 0. From the boundedness assumption we have |Sn | ≤ nc0 and thus
|Smin(n,τ ) | ≤ τ c0 , implying that collection {Smin(n,τ ) : n ≥ 0} is uniformly integrable. Thus, we
can take n → ∞ in (16.12) and interchange expectation and limit safely to conclude (16.11).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-323

i i

16.3* Sequential hypothesis testing 323

• Let τ be a stopping time. Recall that a random variable R is a Radon-Nikodym derivative of P

dP|Fτ
w.r.t. Q on a σ -algebra Fτ , denoted by dQ |F , if
τ

EP [1E ] = EQ [R1E ] ∀E ∈ F τ . (16.13)

We will show that it is in fact given by

dP|Fτ
= exp{Sτ } .
dQ|Fτ
Indeed, what we need to verify is that (16.13) holds with R = exp{Sτ } and an arbitrary event
E ∈ Fτ , which we decompose as
X
1E = 1E∩{τ =n} .
n≥0

By monotone convergence theorem applied to the both sides of (16.13) it is then sufficient to
verify that for every n

EP [1E∩{τ =n} ] = EQ [exp{Sτ }1E∩{τ =n} ] . (16.14)

dP|Fn
This, however, follows from the fact that E ∩ {τ = n} ∈ Fn and dQ|Fn = exp{Sn } by the very
definition of Sn .

We now proceed to the proof. For achievability we apply (16.13) to infer

π 1|0 = P[Sτ ≤ −A] = EQ [exp{Sτ }1{Sτ ≤ −A}] ≤ e−A .

Next, we denote τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas the expectation of τ0 can
be bounded using (16.11) as:
E P [ Sτ 0 ] B + c0
EP [τ ] ≤ EP [τ0 ] = ≤ ,
D(PkQ) D(PkQ)
where in the last step we used the boundedness assumption to infer Sτ0 ≤ B + c0 . Overall,
B + c0
l0 = EP [τ ] ≤ EP [τ0 ] ≤ .
D(PkQ)

Similarly, we can show π 0|1 ≤ e−B and l1 ≤ A+c0

D(Q∥P) . Now consider a pair of exponents E0 , E1
D(Q∥P)
at the boundary, that is E0 E1 = D(PkQ)D(QkP). Let x = E0
D(P∥Q) = E1 . Set A = xB and
l0 D(P∥Q)
−xB −xB B+c −l0 E0 (1+o(1))
let B → ∞. From the argument above we have π 1|0 ≤ e ≤e =e 0 and
similarly π 0|1 ≤ e−l1 E1 (1+o(1)) .
Converse: Assume that (E0 , E1 ) achievable for large l0 , l1 . Recall from Section 4.6* that
D(PFτ kQFτ ) denotes the divergence between P and Q when viewed as measures on σ -algebra
Fτ . We apply the data processing inequality for divergence to obtain:
(16.11)
d(P(Z = 1)kQ(Z = 1)) ≤ D(PFτ kQFτ ) = EP [Sτ ] = EP [τ ]D(PkQ) = l0 D(PkQ),

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-324

i i

324

Notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) = l1 E1 (1 + o(1)), therefore
l1 E1 ≤ (1 + o(1))l0 D(PkQ). Similarly we can show that l0 E0 ≤ (1 + o(1))l1 D(QkP). Thus taking
ℓ0 , ℓ1 → ∞ we conclude

E0 E1 ≤ D(PkQ)D(QkP) .

16.4 Composite, robust, and goodness-of-fit hypothesis testing

In this chapter we have considered the setting of distinguishing between the two alternatives, under
either of which the data distribution was specified completely. There are multiple other settings
that have also been studied in the literature, which we briefly mention here for completeness.
The key departure is to replace the simple hypotheses that we started with in Chapter 14 with
composite ones. Namely, we postulate
i.i.d. i.i.d.
H0 : Xi ∼ P, P∈P vs H1 : Xi ∼ Q, Q ∈ Q,

where P and Q are two families of distributions. In this case for a given test Z = Z(X1 , . . . , Xn ) ∈
{0, 1} we define the two types of error as before, but taking worst-case choices over the
distribution:

1 − α = inf P⊗n [Z = 0], β = sup Q⊗n [Z = 0] .

P∈P Q∈Q

Unlike testing simple hypotheses for which Neyman-Pearson’s test is optimal (Theorem 14.11), in
general there is no explicit description for the optimal test of composite hypotheses (cf. (32.28)).
The popular choice is a generalized likelihood-ratio test (GLRT) that proposes to threshold the
GLR
supP∈P P⊗n (Xn )
T(Xn ) = .
supQ∈Q Q⊗n (Xn )
For examples and counterexamples of the optimality of GLRT in terms of error exponents, see,
e.g. [469].
Sometimes the families P and Q are small balls (in some metric) surrounding the center dis-
tributions P and Q, respectively. In this case, testing P against Q is known as robust hypothesis
testing (since the test is robust to small deviations of the data distribution). There is a notable finite-
sample optimality result in this case due to Huber [221] – see Exercise III.31. Asymptotically, it
turns out that if P and Q are separated in the Hellinger distance, then the probability of error can
be made exponentially small: see Theorem 32.8.
Sometimes in the setting of composite testing the distance between P and Q is zero. This is
the case, for example, for the most famous setting of a Student t-test: P = {N (0, σ 2 ) : σ 2 > 0},
Q = {N ( μ, σ 2 ) : μ 6= 0, σ 2 > 0}. It is clear that in this case there is no way to construct a test with
α + β < 1, since the data distribution under H1 can be arbitrarily close to P0 . Here, thus, instead
of minimizing worst-case β , one tries to find a test statistic T(X1 , . . . , Xn ) which is a) pivotal in the
sense that its distribution under the H0 is (asymptotically) independent of the choice P0 ∈ P ; and

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-325

i i

16.5* Hypothesis testing with communication constraints 325

b) consistent, in the sense that T → ∞ as n → ∞ under any fixed Q ∈ Q. Optimality questions are
studied by minimizing β as a function of Q ∈ Q (known as the power curve). The uniform most
powerful tests are the gold standard in this area [277, Chapter 3], although besides a few classical
settings (such as the one above) their existence is unknown.
In other settings, known as the goodness-of-fit testing [277, Chapter 14], instead of relatively
low-complexity parametric families P and Q one is interested in a giant set Q of alternatives. For
i.i.d. i.i.d.
example, the simplest setting is to distinguish H0 : Xi ∼ P0 vs H1 : Xi ∼ Q, TV(P0 , Q) > δ . If
δ = 0, then in this case again the worst case α + β = 1 for any test and one may only ask for a
statistic T(Xn ) with a known distribution under H0 and T → ∞ for any Q in the alternative. For
δ > 0 the problem is known as nonparametric detection [225, 226] and related to that of property
testing [192].

16.5* Hypothesis testing with communication constraints

In this section we consider a variation of the hypothesis testing problem where determination of
the Stein’s exponent is still open, except for the special case resolved in [8]. Specifically, we still
consider the case of a pair of simple iid hypotheses as in (14.14) except this time the Y sample is
available to a statistician, but the X sample needs to be communicated from a remote location over
a (noiseless) rate-constrained link:
i.i.d.
H0 : (X1 , Y1 ), . . . , (Yn , Xn ) ∼ PX,Y
i.i.d.
H1 : (X1 , Y1 ), . . . , (Yn , Xn ) ∼ QX,Y , (16.15)
and the tester consists of an X-compressor W = f(Xn ) with W ∈ {0, 1}nR and a decision PZ|W,Yn .
(The illustration of the setting is given in Figure 16.5.) Our goal is to determine dependence of the
Stein exponent on the rate R, namely
Vϵ (R) ≜ sup{E : ∃n0 , ∀n ≥ n0 , ∃(f, PZ|Xn ) s.t. α > 1 − ϵ, β < exp (−nE)}.
Exponents E satisfying constraints inside the supremum are known as ϵ-achievable exponents.
The importance of this problem is that it emphasized some new phenomenon arising in dis-
tributed statistical problems. Although, the problem is still open and the topic of characterizing
Stein’s exponent fell out of fashion, the tools that were developed for this problem (namely, the
strong data-processing inequalities) are important and found many uses in modern distributed
inference problems (see Chapter 33 and specifically Section 33.11 for more). We will discuss this
after we present the main result, for which we introduce a key new concept.

Definition 16.5 (FI -curve6 ) Given pair of random variables (X, Y) we define their FI curve
as
FI (t; PX,Y ) ≜ sup{I(U; Y) : I(U; X) ≤ t, U → X → Y} ,

6
This concept was introduced in [453], see also [136] and [345, Section 2.2] for the “PX -independent” version.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-326

i i

326

Xn W ∈ {0, 1}nR
Compressor f

Z ∈ {0, 1}

Tester
Yn

Figure 16.5 Illustration to the problem of hypothesis testing with communication constraints.

I(Y; U)

ηKL
I(X; Y)

I(X; U)
0 H(X)

Figure 16.6 A typical FI -curve whose slope at zero is the SDPI constant ηKL .

supremum taken over all random variables U satisfying U → X → Y Markov relation.

Example 16.1 A typical FI -curve is shown in Figure 16.6. In general, computing FI -curves is
hard. An exception is the case of X ∼ Ber(1/2) and Y = BSCδ (X). In this case, applying MGL in
Exercise I.64 we get, in the notation of that exercise, that

FI (t) = log 2 − h(h−1 (log 2 − t) ∗ δ)

achieved by taking U ∼ Ber(1/2) and X = BSCp (U) with p chosen such that h(p) = log 2 − t.
From the DPI (3.12) we know that FI (t) ≤ t and the FI -curve strengthens the DPI to I(U; X) ≤
FI (I(U; Y)) whenever U → X → Y. (Note that the roles of X and Y are not symmetric.) In general,
it is not clear how to compute this function; nevertheless, in Exercise III.32 we show that if X
takes values over a finite alphabet then it is sufficient to consider |U| ≤ |X | + 1, and hence FI is
a value of a finite-dimensional convex program. Other properties of the FI -curve and applications
are found in Exercise III.32 and III.33.
The main result of this section is the following.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-327

i i

16.5* Hypothesis testing with communication constraints 327

PY|X
(a) Let PY|X : X → Y be a Markov kernel, which maps Pj to Qj according to Pj −−→ Qj , j =
0, 1. Compare the regions R(P0 , P1 ) and R(Q0 , Q1 ). What does this say about βα (P0 , P1 )
vs βα (Q0 , Q1 )?
(b*) Prove that R(P0 , P1 ) ⊃ R(Q0 , Q1 ) implies existence of some PY|X mapping P0 to Q1 and
P1 to Q1 . In other words, inclusion of R is equivalent to degradation or Blackwell order
(see Definition 33.15).
Comment: This is the most general form of data processing inequality, of which all the other
ones (divergence, mutual information, f-divergence, total-variation, Rényi-divergence, etc) are
corollaries.
III.2 Consider the following binary hypothesis testing (BHT) problem. Under both hypotheses X and
Y are uniform on {0, 1}. However, under H0 , X and Y are independent, while under H1 :

P1 [X 6= Y] = δ < 1/2 .

For this problem:

(a) Draw the region R(P0 , P1 ) of achievable pairs of values (P0 [Z = 0], P1 [Z = 0]) for all
randomized tests PZ|XY : X × Y → {0, 1}.
(b) Find a sufficient statistic and define an equivalent BHT problem on a smaller alphabet.
(c) Let PFi , i ∈ {0, 1} be the distribution of log PP01 ((XX)) under X ∼ Pi . What are the distributions
PF0 , PF1 . How can you read them off of R(P0 , P1 )?
(d) Compute the minimal probability of error in the Bayesian setup, when

P [ H1 ] = 1 − P [ H0 ] = π 1 .

Identify the corresponding point on R(P0 , P1 ).

(e) Compute the minimal probability of error in the non-Bayesian minimax setup:

min max{P0 [decide H1 ], P1 [decide H0 ]} ,

where the min is over the tests and the max is between the two numbers in the braces.
Identify the corresponding point on R(P0 , P1 ).
III.3 Consider distributions P and Q on [0, 3] with densities given in Figure 16.7.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-330

i i

330 Exercises for Part III

p q
1 1
3 3

3 3

Figure 16.7 Densities of P and Q in Exercise III.3.

(a) Compute the expression of βα (P, Q).

(b) Plot the region R(P, Q).
(c) Specify the tests achieving βα for α = 5/6 and α = 1/2, respectively.
III.4 Let P be the uniform distribution on the interval [0, 1]. Let Q be the equal mixture of the uniform
distribution on [0, 1/2] and the point mass at 1.
(a) Compute the region R(P, Q).
(b) Explicitly describe the tests that achieve the optimal boundary βα (P, Q).
III.5 (a) Consider the binary hypothesis test:
H0 : X ∼ N (0, 1) vs H1 : X ∼ N ( μ, 1).
Compute and plot the Neyman-Pearson region R(N (0, 1), N ( μ, 1)).
(b) Now suppose we have n samples and we want to test
i.i.d. i.i.d.
H0 : X1 , . . . , Xn ∼ N (0, 1) vs H1 : X1 , . . . , Xn ∼ N ( μ, 1).
Compute the Neyman-Pearson region R(N (0, 1)n , N ( μ, 1)n ). As the sample size increases,
describe how the region evolves and provides an interpretation. Hint: Consider sufficient
statistics.
III.6 (a) Consider the binary hypothesis test:
H0 : X ∼ Exp(1) vs H1 : X ∼ Exp(λ),
where Exp(λ) has density λe−λx 1 {x ≥ 0}. Compute the region R(Exp(1), Exp(λ)). What
is the optimal test for achieving βα ?
(b) Now suppose we have n samples and we want to test
i.i.d. i.i.d.
H0 : X1 , . . . , Xn ∼ Exp(1) vs H1 : X1 , . . . , Xn ∼ Exp(λ).
Compute the region R(Exp(1)n , Exp(λ)n ). As the sample size increases, describe how the
region evolves and provides an interpretation. What is the optimal test for achieving βα ?
III.7 (a) Prove that TV(P, Q) = sup0≤α≤1 {α − βα (P, Q)}. Explain how to read the value TV(P, Q)
from the region R(P, Q). Does it equal half the maximal vertical segment in R(P, Q)?
(b) (Bayesian criteria) Fix a prior π = (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 < π 0 < 1.
Denote the optimal average error probability by Pe ≜ infPZ|X π 0 π 1|0 + π 1 π 0|1 . Prove that if
π = ( 21 , 12 ), then Pe = 12 (1 − TV(P, Q)). Find the optimal test.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-331

i i

Exercises for Part III 331

(c) Find the optimal test for general prior π (not necessarily equiprobable).
(d) Show that it is sufficient to focus on deteministic test in order to minimize the Bayesian
error probability.
III.8 The function α 7→ βα (P, Q) is monotone and thus by Lebesgue’s theorem possesses a derivative
d
βα′ ≜ βα (P, Q) .
dα
almost everywhere on [0, 1]. Prove
Z 1
D(PkQ) = − log βα′ dα . (III.1)
0

III.9 Let P, Q be distributions such that for all α ∈ [0, 1] we have

βα (P, Q) ≜ min Q[ Z = 0] = α 2 .
PZ|X :P[Z=0]≥α

Find TV(P, Q), D(PkQ) and D(QkP).

III.10 We have shown that for testing iid products and any fixed ϵ ∈ (0, 1):

log β1−ϵ (Pn , Qn ) = −nD(PkQ) + o(n) , n → ∞,

which is equivalent to Stein’s lemma (Theorem 14.14). Show furthermore that assuming
V(PkQ) < ∞ we have
p √
log β1−ϵ (Pn , Qn ) = −nD(PkQ) + nV(PkQ)Q−1 (ϵ) + o( n) , (III.2)
R ∞
where Q−1 (·) is the functional inverse of Q(x) = x √12π e−t /2 dt and
2

dP
V(PkQ) ≜ VarP log .
dQ

III.11 (Likelihood-ratio trick) Given two distributions P and Q on X let us generate iid samples (Xi , Yi )
as follows: first Yi ∼ Ber(1/2) and then if Yi = 0 we sample Xi ∼ Q and otherwise Xi ∼ P. We
next train a classifier to minimize the cross-entropy loss:
1X
n
∗ 1 1
p̂ = argmin Yi log + (1 − Yi ) log .
p̂:X →[0,1] n p̂ ( Xi ) 1 − p̂(Xi )
i=1

1−p̂∗ (x)
Show that → dQ
p̂∗ (x)
dP
(x) as n → ∞. This trick is used in machine learning to approximate
dP
dQ for complicated high-dimensional distributions.
III.12 Prove
Yn Xn
min D(QYn k PYj ) = min D(QYj kPYj )
QYn ∈F
j=1 j=1

whenever the constraint set F is marginals-based, i.e.:

QYn ∈ F ⇐⇒ (QY1 , . . . , QYn ) ∈ F ′

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-332

i i

332 Exercises for Part III

for some F ′ .
Conclude that in the case when PYj = P and
   
 X
n 
F = QYn : EQ  f(Yj ) ≥ nγ
 
j=1

we have the single-letterization:

min D(QYn kPn ) = n min D(QY kP) ,

QYn QY :EQY [f(Y)]≥γ

of which (15.15) is a special case. Hint: Convexity of divergence.

III.13 Fix a distribution PX on a finite set X and a channel PY|X : X → Y . Consider a sequence xn
with composition PX , i.e.

#{j : xj = a} = nPX (a) ± 1 ∀a ∈ X .

Let Yn be generated according to P⊗ n n

Y|X (·|x ). Show that
 
X
n
log P  f(Xj , Yj ) ≥ nγ Xn = xn  = −n min D(QY|X kPY|X |PX ) + o(n) ,
EQ [f(X,Y)]≥γ
j=1

where minimum is over all QX,Y with QX = PX .

III.14 (Large deviations on the boundary) Recall that A = infλ ψX′ (λ) and B = supλ ψX′ (λ) were shown
to be the boundaries of the support of PX , e.g. B = sup{b : P[X > b] > 0}.
(a) Show by example that ψX∗ (B) can be finite or infinite.
(b) Show by example that asymptotic behavior of
" n #
1X
P Xi ≥ B , (III.3)
n
i=1

can be quite different depending on the distribution of PX .

(c) Compare Ψ∗X (B) and the exponent in (III.3) for your examples. Prove a general statement
(you can assume that ψX (λ) < ∞ for all λ ∈ R).
III.15 (Simple radar) A binary signal detector is being built. When the signal A is being sent a sequence
of i.i.d. Xj ∼ N (−1, 1) is received. When the signal B is being sent a sequence of Xj ∼ N (+1, 1)
is being received. Given a very large number n of observations (X1 , . . . , Xn ) propose a detector
for deciding between A and B. Consider two separate design cases:
(a) Misdetecting A for B or B for A are equally bad.
(b) Misdetecting A for B in 10−3 cases is ok, but the opposite should be avoided as much as
possible.
III.16 (Small-ball probability I.) Let Z ∼ N (0, Id ). Without using the χ2 -density, show the following
bound on P [kZk2 ≤ ϵ].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-333

i i

Exercises for Part III 333

√
(a) Using the Chernoff bound, show that for all ϵ > d,
2 d/2
eϵ
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≤
d
(b) Prove the lower bound
d/2
ϵ2
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≥
2πd
(c) Extend the results to Z ∼ N (0, Σ).
See Exercise V.30 for an example in infinite dimensions.
III.17 Consider the hypothesis testing problem:
i.i.d.
H0 : X1 , . . . , Xn ∼ P = Ber(1/3) ,
i.i.d.
H1 : X1 , . . . , Xn ∼ Q = Ber(2/3) .

(a) Compute the Stein exponent.

(b) Compute the tradeoff region E of achievable error-exponent pairs (E0 , E1 ) using the charac-
terization E0 (θ) = ψP∗ (θ) and E1 (θ) = ψP∗ (θ) − θ. Express the optimal boundary in explicit
form (eliminate the parameter).
(c) Identify the divergence-minimizing geodesic P(λ) running from P to Q, λ ∈ [0, 1]. Verify
that (E0 , E1 ) = (D(P(λ) kP), D(P(λ) kQ)), 0 ≤ λ ≤ 1 gives the same tradeoff curve.
(d) Compute the Chernoff exponent.
III.18 Let γ(a, c) denote a Gamma distribution with shape parameter a and scale parameter c:
(cx)a−1 e−cx
γ(a, c)(dx) = c dx .
Γ(a)
Consider a hypothesis testing problem:

H0 : X1 , . . . , Xn -i.i.d. ∼ P0 = exp(1) , (III.4)

H1 : X1 , . . . , Xn -i.i.d. ∼ P1 = γ(a, c = 1) . (III.5)

Questions:
(a) Compute the Stein exponent
(b) For a = 3 draw the tradeoff region E of achievable error-exponent pairs (E0 , E1 ).
(c) Identify the divergence-minimizing geodesic Pλ running from P0 to P1 , λ ∈ [0, 1].
Hint: To simplify calculations try differentiating in u the following identity
Z ∞
xu e−x dx = Γ(u + 1) .
0

III.19 Consider the hypothesis testing problem:

i.i.d.
H0 : X1 , . . . , Xn ∼ P = N (0, 1) ,
i.i.d.
H1 : X1 , . . . , Xn ∼ Q = N ( μ, 1) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-334

i i

334 Exercises for Part III

(a) Show that the Stein exponent is V = log2 e μ2 .

(b) Show that the optimal tradeoff between achievable error-exponent pairs (E0 , E1 ) is given
by
log e p log e 2
E1 = ( μ − 2E0 )2 , 0 ≤ E0 ≤ μ ,
2 2
(c) Show that the Chernoff exponent is C(P, Q) = log8 e μ2 .
III.20 Let Uj be iid uniform on [0, 1]. Prove/disprove that
 
X n
1
P Uj ≥ nγ  , γ= ≈ 0.582
e−1
j=1

converges to zero exponentially fast as n → ∞. If it does then find the exponent. Repeat with
γ = 0.5.
III.21 Let Xj be i.i.d. exponential with unit mean. Since the log MGF ψX (λ) ≜ log E[exp{λX}] does
not exist for λ > 1, the large deviations result in Theorem 15.1
X
n
P[ Xj ≥ nγ] = exp{−nψX∗ (γ) + o(n)} (III.6)
j=1

does not apply. Show (III.6) directly via the following steps:
(a) Apply Chernoff argument directly to prove an upper bound.
(b) Fix an arbitrary c > 0 and prove
X
n X
n
P[ Xj ≥ nγ] ≥ P[ (Xj ∧ c) ≥ nγ] . (III.7)
j=1 j=1

(c) Apply the results shown in Chapter 15 to investigate the asymptotics of the right-hand side
of (III.7).
(d) Conclude the proof of (III.6) by taking c → ∞.
III.22 (Hoeffding’s lemma) In this exercise we prove Hoeffding’s lemma (stated after Definition 4.15)
and derive Hoeffding’s concentration inequality. Let X ∈ [−1, 1] with E[X] = 0.
(a) Show that the log MGF ψX (λ) satisfies ψX (0) = ψX′ (0) = 0 and 0 ≤ ψX′′ (λ) ≤ 1. (Hint:
Apply Theorem 15.8(c) and the fact that the variance of any distribution supported on
[−1, 1] is at most 1.)
(b) By Taylor expansion, show that ψX (λ) ≤ λ2 /2.
(c) Applying Theorem 15.1, prove Hoeffding’s inequality: Let Xi ’s be iid copies of X. For any
Pn
γ > 0, P i=1 Xi ≥ nγ ≤ exp(−nγ /2).
2

III.23 (Sanov’s theorem for discrete X ) Let X be a finite set. Let E be a set of probability distributions
on X with non-empty interior. Let Xn = (X1 , . . . , Xn ) be iid drawn from some distribution P
Pn
fully supported on X and let π n denote the empirical distribution, i.e., π n = 1n i=1 δXi . Our
goal is to show that
1 1
E ≜ lim log = inf D(QkP). (III.8)
n→∞ n P(π n ∈ E) Q∈E

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-335

i i

Exercises for Part III 335

(a) We first assume that E is convex. Define the following set of joint distributions En ≜ {QXn :
QXi ∈ E, i = 1, . . . , n}. Show that
inf D(QXn kPXn ) = n inf D(QkP),
QXn ∈En Q∈E
n
where PXn = P .
(b) Consider the conditional distribution P̃Xn = PXn |π n ∈E . Show that P̃Xn ∈ En .
(c) Prove the following nonasymptotic upper bound: for any convex E ,

P(π n ∈ E) ≤ exp − n inf D(QkP) , ∀n.
Q∈E

(d) Show that for any E :

P(π n ∈ E) ≤ exp − n inf D(QkP) + o(n) , ∀ n.
Q∈E

(Hint: For each ϵ > 0, cover E by N TV balls of radius ϵ where N = N(ϵ) is finite;
cf. Theorem 27.3. Applying the previous part and the union bound.)
(e) For any Q in the interior of E , show that
P(π n ∈ E) ≥ exp(−nD(QkP) + o(n)), n → ∞.
(Hint: Use data processing as in the proof of the large deviations Theorem 15.9.)
(f) Conclude (III.8) by applying the continuity of divergence on finite alphabet (Proposi-
tion 4.8).
III.24 (Error exponents of data compression) Let Xn be iid according to P on a finite alphabet X .
Let ϵ∗ (Xn , nR) denote the minimal probability of error achieved by fixed-length compressors
and decompressors for Xn of compression rate R (cf. Definition 11.1). We know that if R <
H(P) then ϵ∗ (Xn , nR) → 0. Here we show it converges to zero exponentially fast and find the
expopnent.
(a) For any sequence xn , denote by P̂xn its empirical distribution and by H(P̂xn ) its empirical
entropy, i.e., the entropy of the empirical distribution. For example, for the binary sequence
xn = (010110), the empirical distribution is Ber(1/2) and the empirical entropy is 1 bit.
For each R > 0, define the set T = {xn : H(P̂xn ) <R}. Show that |T| ≤ exp( nR)(n + 1)|X | .
∗
(b) Show that for any R > H(P), ϵ (X , nR) ≤ exp − n infQ:H(Q)>R D(QkP) . Specify the
n

achievability scheme. (Hint: Use Sanov’s theorem in Exercise III.23.)

(c) Prove that the above exponent is asymptotically optimal:
1 1
lim sup log ∗ n ≤ inf D(QkP).
n→∞ n ϵ (X , nR) Q:H(Q)>R
(Hint: Recall that any compression scheme for memoryless source with rate below the
entropy fails with probability tending to one. Use data processing inequality.)
III.25 (Local KL geodesics) Recall from Section 2.6.1* the local expansion D(λQ + (1 − λ)PkP) =
λ2 2
2 χ (QkP) + o(λ ), provided that χ (QkP) < ∞. Instead of the linear mixture, consider the
2 2
λ 1−λ
geometric mixture Pλ ∝ Q P , which we argued should be called a “KL geodesic” in (15.30).
Show that

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-339

i i

Part IV

Channel coding

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-340

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-341

i i

341

In this Part we study a new type of problem. The goal of channel coding is to communicate
digital information across a noisy channel. Historically, this was the first area of information the-
ory that lead to immediately and widely deployed applications. Shannon’s discovery [378] of the
possibility of transmitting information with vanishing error and positive (i.e. bigger than zero) rate
of bits per second was also theoretically quite surprising and unexpected. Our goal in this Part is
to understand these arguments.
To explain the relation of this Part to others, let us revisit what problems we have studied so
far. In Part I we introduced various information measures and studied their properties irrespec-
tive of engineering applications. Then, in Part II our objective was data compression. The main
object there was a single distribution PX and the fundamental limit E[ℓ(f∗ (X))] is the minimal
compression length. The main result (the “coding theorem”) established connection between the
fundamental limit and an information quantity, that we can summarize as
E[ℓ(f∗ (X))] ≈ H(X)
Next, in Part III we studied binary hypothesis testing. There the main object was a pair of distri-
butions (P, Q), the fundamental limit was the Neyman-Pearson curve β1−ϵ (Pn , Qn ) and the main
result
β1−ϵ (Pn , Qn ) ≈ exp{−nD(PkQ)} ,
again connecting an operational quantity to an information measure.
In channel coding – the topic of this Part – the main object is going to be a channel PY|X .
The fundamental limit is M∗ (ϵ), the maximum number of messages that can be transmitted with
probability of error at most ϵ, which we rigorously define in this chapter. Our main result in this
part is to show the celebrated Shannon’s noisy channel coding theorem:
log M∗ (ϵ) ≈ max I(X; Y) .
PX

We will demonstrate the possibility of sending information with high reliability and also will
rigorously derive the asymptotically (and non-asymptotically!) highest achievable rates. However,
we entirely omit a giant and beautiful field of coding theory that deals with the question of how
to construct transmitters and receivers with low computational complexity. This area of science,
though deeply related to the content of our book, deserves a separate dedicated treatment. We
recommend reading [360] for the sparse-graph based codes and [372] for introduction to more
modern polar codes.
The practical implications of this chapter are profound even without giving explicit construc-
tions of codes. First, in the process of finding channel capacity one needs to maximize mutual
information and the maximizing distributions reveal properties of optimal codes (e.g. water-filling
solution dictates how to optimally allocate power between frequency bands, Figure 3.2 suggests
when to use binary modulation, etc). Second, the non-asymptotic (finite blocklength) bounds that
we develop in this Part are routinely used for benchmarking performance of all newly developed
codes. Other implications tell how to exploit memory in the channel, or leverage multiple antennas
(Exercise I.10), and many more. In all, the contents of this Part have had by far the most real-world
impact of all (at least as of the writing of this book).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-342

i i

17 Error correcting codes

In this chapter we introduce the concept of an error correcting code (ECC). We will spend time
discussing what it means for a code to have low probability of error, what is the optimum (ML or
MAP) decoder. On the special case of coding for the BSC we showcase evolution of our under-
standing of fundamental limits from pre-Shannon’s to modern finite blocklength. We also briefly
review the history of ECCs. We conclude with a conceptually important proof of a weak converse
(impossibility) bound for the performance of ECCs.

17.1 Codes and probability of error

We start with a simple definition of a code.

Definition 17.1 An M-code for PY|X is an encoder/decoder pair (f, g) of (randomized)

functions1

• encoder f : [M] → X
• decoder g : Y → [M] ∪ {e}

In most cases f and g are deterministic functions, in which case we think of them, equivalently,
in terms of codewords, codebooks, and decoding regions (see Figure 17.1 for an illustration)

• ∀i ∈ [M] : ci ≜ f(i) are codewords, the collection C = {c1 , . . . , cM } is called a codebook.

• ∀i ∈ [M], Di ≜ g−1 ({i}) is the decoding region for i.

Given an M-code we can define a probability space, underlying all the subsequent developments
in this Part. For that we chain the three objects – message W, the encoder and the decoder – together
into the following Markov chain:

f P Y| X g
W −→ X −→ Y −→ Ŵ (17.1)

The argument PY|X will be omitted when PY|X is clear from the context.

In other words, the quantity log2 M∗ (ϵ) gives the maximum number of bits that we can
push through a noisy transformation PY|X , while still guaranteeing the error probability in the
appropriate sense to be at most ϵ.

Example 17.1 The channel BSC⊗ n

δ (recall from Example 3.6 that BSC stands for binary sym-
metric channel) acts between X = {0, 1}n and Y = {0, 1}n , where the input Xn is contaminated
i.i.d.
by additive noise Zn ∼ Ber(δ) independent of Xn , resulting in the channel output

Yn = X n ⊕ Zn .

In other words, the BSC⊗ n

δ channel takes a binary sequence length n and flips each bit indepen-
dently with probability δ ; pictorially,

0 1 0 0 1 1 0 0 1 1

PY n |X n

1 1 0 1 0 1 0 0 0 1

In the next section we discuss coding for the BSC channel in more detail.

17.2 Coding for Binary Symmetric Channels

To understand the problem of designing the encoders and decoders, let us consider the BSC trans-
formation with δ = 0.11 and n = 1000. The problem of studying log2 M∗ (ϵ) attempts to answer
what is the maximum number k of bits you can send with Pe ≤ ϵ? For concreteness, let us fix
ϵ = 10−3 and discuss some of the possible ideas.
Perhaps our first attempt would be to try sending k = 1000 bits with one data bit mapped to
one channel input position. However, a simple calculation shows that in this case we get Pe =
1 − (1 − δ)n ≈ 1. In other word, the uncoded transmission does not meet our objective of small
Pe and some form of coding is necessary. This incurs a fundamental tradeoff: reduce the number
of bits to send (and use the freed channel inputs for sending redundant copies) in order to increase
the probability of success.
So let us consider the next natural idea: the repetition coding. We take each of the input data
bits and repeat it ℓ times:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-345

i i

17.2 Coding for Binary Symmetric Channels 345

0 0 1 0

0000000 0000000 1111111 0000000

Decoding can be done by taking a majority vote inside each ℓ-block. Thus, each data bit is decoded
with probability of bit error Pb = P[Binom(l, δ) > l/2]. However, the probability of block error of
this scheme is Pe ≤ kP[Binom(l, δ) > l/2]. (This bound is essentially tight in the current regime).
Consequently, to satisfy Pe ≤ 10−3 we must solve for k and ℓ satisfying kl ≤ n = 1000 and also

kP[Binom(l, δ) > l/2] ≤ 10−3 .

This gives l = 21, k = 47 bits. So we can see that using repetition coding we can send 47 data
bits by using 1000 channel uses.
Repetition coding is a natural idea. It also has a very natural tradeoff: if you want better reliabil-
ity, then the number ℓ needs to increase and hence the ratio nk = 1ℓ should drop. Before Shannon’s
groundbreaking work, it was almost universally accepted that this is fundamentally unavoidable:
vanishing error probability should imply vanishing communication rate nk .
Before delving into optimal codes let us offer a glimpse of more sophisticated ways of injecting
redundancy into the channel input n-sequence than simple repetition. For that, consider the so-
called first-order Reed-Muller codes (1, r). We interpret a sequence of r data bits a0 , . . . , ar−1 ∈ Fr2
as a degree-1 polynomial in (r − 1) variables:
X
r− 1
a = (a0 , . . . , ar−1 ) 7→ fa (x) ≜ a i xi + a 0 .
i=1

In order to transmit these r bits of data we simply evaluate fa (·) at all possible values of the variables
xr−1 ∈ Fr2−1 . This code, which maps r bits to 2r−1 bits, has minimum distance dmin = 2r−2 . That
is, for two distinct a 6= a′ the number of positions in which fa and fa′ disagree is at least 2r−2 . In
coding theory notation [n, k, dmin ] we say that the first-order Reed-Muller code (1, 7) is a [64, 7, 32]
code. It can be shown that the optimal decoder for this code achieves over the BSC⊗ 64
0.11 channel a
probability of error at most 6 · 10−6 . Thus, we can use 16 such blocks (each carrying 7 data bits and
occupying 64 bits on the channel) over the BSC⊗ δ
1024
, and still have (by the union bound) overall
−4 −3
probability of block error Pe ≲ 10 < 10 . Thus, with the help of Reed-Muller codes we can
send 7 · 16 = 112 bits in 1024 channel uses, more than doubling that of the repetition code.
Shannon’s noisy channel coding theorem (Theorem 19.9) – a crown jewel of information theory
– tells us that over memoryless channel PYn |Xn = (PY|X )n of blocklength n the fundamental limit
satisfies

i i

17.3 Optimal decoder 347

the subtle ways in which Shannon revolutionized scientific approach to the study of information
exchange.

17.3 Optimal decoder

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-348

i i

348

the setting of binary hypotheses testing in Part III. The optimal decoder (test) that minimizes the
maximal Type-I and II error probability, i.e., max{1 − α, β}, will not be deterministic if max{1 −
α, β} is not achieved at a vertex of the Neyman-Pearson region R(PY|W=1 , PY|W=2 ).

17.4 Weak converse bound

The main focus of both the theory and the practice of channel coding lies in showing the existence
of (or constructing explicit) (M, ϵ)-codes with large M and small ϵ. To understand how close the
constructed code is to the fundamental limit, one needs to prove an “impossibility result” bound-
ing M from the above or ϵ from below. Such negative results are known as “converse bounds”,
with the name coming from the fact that classically these bounds followed right after the positive
(existential) results and were preceded with the words “Conversely, …”. The next result shows
that M can never (multiplicatively) exceed capacity supPX I(X; Y) by much.

Theorem 17.3 (Weak converse) Any (M, ϵ)-code for PY|X satisfies
supPX I(X; Y) + h(ϵ)
log M ≤ ,
1−ϵ
where h(x) = H(Ber(x)) is the binary entropy function.

Proof. This can be derived as a one-line application of Fano’s inequality (Theorem 3.12), but we
proceed slightly differently with an eye towards future extensions in meta-converse (Section 22.3).
Consider an M-code with probability of error Pe and its corresponding probability space: W →
X → Y → Ŵ. We want to show that this code can be used as a hypothesis test between distributions
PX,Y and PX PY . Indeed, given a pair (X, Y) we can sample (W, Ŵ) from PW,Ŵ|X,Y = PW|X PŴ|Y and
compute the binary value Z = 1{W 6= Ŵ}. (Note that in the most interesting cases when encoder
and decoder are deterministic and the encoder is injective, the value Z is a deterministic function
of (X, Y).) Let us compute performance of this binary hypothesis test under two hypotheses. First,
when (X, Y) ∼ PX PY we have that Ŵ ⊥ ⊥ W ∼ Unif([M]) and therefore:
1
PX PY [Z = 1] = .
M
Second, when (X, Y) ∼ PX,Y then by definition we have

PX,Y [Z = 1] = 1 − Pe .

Thus, we can now apply the data-processing inequality for divergence to conclude: Since W →
X → Y → Ŵ, we have the following chain of inequalities (cf. Fano’s inequality Theorem 3.12):
DPI 1
D(PX,Y kPX PY ) ≥ d(1 − Pe k )
M
≥ −h(P[W 6= Ŵ]) + (1 − Pe ) log M

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-349

i i

17.4 Weak converse bound 349

By noticing that the left-hand side is I(X; Y) ≤ supPX I(X; Y) we obtain:

supPX I(X; Y) + h(Pe )
log M ≤ ,
1 − Pe
h(p)
and the proof is completed by checking that p 7→ 1− p is monotonically increasing.

Remark 17.2 The bound can be significantly improved by considering other divergence mea-
sures in the data-processing step. In particular, we will see below how one can get “strong”
converse (explaining the term “weak” converse here as well) in Section 22.1. The proof technique
is known as meta-converse; see Section 22.3.

18.1 Information density

A crucial object for the subsequent development is the information density. Historically, the con-
cept seems to originate in early works of Soviet information theorists. In a nutshell, we want to
set i(x; y) = log PPXX(,xY)(PxY,y()y) . However, we want to make this definition sufficiently general so as
to take into account continuous distributions, the possibility of PX,Y 6 PX PY (in which case the
value under the log can equal +∞) and the possibility of argument of the log being equal to 0. The
definition below is similar to what we did in Definition 14.4 and (2.10) using the Log function,
but we repeat it below for convenience.

Definition 18.1 (Information density1 ) Let PX,Y μ and PX PY μ for some dominat-
ing measure μ, and denote by f(x, y) = dPdμX,Y and f̄(x, y) = dPdμ X PY
the Radon-Nikodym derivatives
of PX,Y and PX PY with respect to μ, respectively. Then recalling the Log definition (2.10) we set


 log ff̄((xx,,yy)) , f(x, y) > 0, f̄(x, y) > 0


f(x, y) +∞, f(x, y) > 0, f̄(x, y) = 0
iPX,Y (x; y) ≜ Log = (18.2)
f̄(x, y)  −∞, f(x, y) = 0, f̄(x, y) > 0



0, f(x, y) = f̄(x, y) = 0 .

In the most common special case of PX,Y PX PY we have simply

dPX,Y
iPX,Y (x; y) = log ( x, y) ,
dPX PY
with log 0 = −∞.
Notice that the information density as a function depends on the underlying PX,Y . Throughout
this Part, however, PY|X is going to be a fixed channel (fixed by the problem at hand), and thus
information density only depends on the choice of PX . Most of the time PX (and, correspondingly,
PX,Y ) used to define information density will be apparent from the context. Thus for the benefit of
the reader as well as our own, we will write i(x; y) dropping the subscript PX,Y .
We proceed to show some elementary properties of the information density. The next result
explains the name “information density”.

Proposition 18.2 The expectation E[i(X; Y)] is well-defined and non-negative (but possibly
infinite). In any case, we have I(X; Y) = E[i(X; Y)].

Proof. This is follows from (2.12) and the definition of i(x; y) as log-ratio.

Being defined as log-likelihood, information density possesses the standard properties of the
latter, cf. Theorem 14.6. However, because its defined in terms of two variables (X, Y), there are

1
We remark that in machine learning (especially natural language processing) information density is also called pointwise
mutual information (PMI) [292].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-352

i i

352

also very useful conditional expectation versions. To illustrate the meaning of the next proposition,
let us consider the case of discrete X, Y and PX,Y PX PY . Then we have for every x:
X X
f(x, y)PX (x)PY (y) = f(x, y) exp{−i(x; y)}PX,Y (x, y) .
y y

The general case requires a little more finesse.

Proposition 18.3 (Conditioning-unconditioning trick) Let X̄ ⊥

⊥ (X, Y) be a copy of X.
We have the following:

1 For any function f : X × Y → R

E[f(X̄, Y)1{i(X̄; Y) > −∞}] = E[f(X, Y) exp{−i(X; Y)}] . (18.3)

2 Let f+ be a non-negative function. Then for PX -almost every x we have

E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = E[f+ (X, Y) exp{−i(X; Y)}|X = x] (18.4)

Proof. The first part (18.3) is simply a restatement of (14.5). For the second part, let us define

a(x) ≜ E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x], b(x) ≜ E[f+ (X, Y) exp{−i(X; Y)}|X = x]

We first additionally assume that f is bounded. Fix ϵ > 0 and denote Sϵ = {x : a(x) ≥ b(x) + ϵ}.
As ϵ → 0 we have Sϵ % {x : a(x) > b(x)} and thus if we show PX [Sϵ ] = 0 this will imply that
a(x) ≤ b(x) for PX -a.e. x. The symmetric argument shows b(x) ≤ a(x) and completes the proof
of the equality.
To show PX [Sϵ ] = 0 let us apply (18.3) to the function f(x, y) = f+ (x, y)1{x ∈ Sϵ }. Then we get

E[f+ (X, Y)1{X ∈ Sϵ } exp{−i(X; Y)}] = E[f+ (X̄, Y)1{i(X̄; Y) > −∞}1{X ∈ Sϵ }] .

Let us re-express both sides of this equality by taking the conditional expectations over Y to get:

E[b(X)1{X ∈ Sϵ }] = E[a(X̄)1{X̄ ∈ Sϵ }] .

But from the definition of Sϵ we have

E[b(X)1{X ∈ Sϵ }] ≥ E[(b(X̄) + ϵ)1{X̄ ∈ Sϵ }] .

d
Recall that X=X̄ and hence

E[b(X)1{X ∈ Sϵ }] ≥ E[b(X)1{X ∈ Sϵ }] + ϵPX [Sϵ ] .

Since f+ (and therefore b) was assumed to be bounded we can cancel the common term from both
sides and conclude PX [Sϵ ] = 0 as required.
Finally, to show (18.4) in full generality, given an unbounded f+ we define fn (x, y) =
min(f+ (x, y), n). Since (18.4) holds for fn we can take limit as n → ∞ on both sides of it:

Note that (18.8) holds regardless of the input distribution PX used for the definition of i(x; y), in
PM
particular we do not have to use the code-induced distribution PX = M1 i=1 δci . However, if we
are to threshold information density, different choices of PX will result in different decoders, so
we need to justify the choice of PX .
To that end, recall that to distinguish between two codewords ci and cj , one can apply (as we
P
learned in Part III for binary HT) the likelihood ratio test, namely thresholding the LLR log PYY||XX=
=c
ci
.
j
As we explained at the beginning of this Chapter, a (possibly suboptimal) approach in M-ary HT
is to run binary tests by thresholding each information density i(ci ; y). This, loosely speaking,
evaluates the likelihood of ci against the average distribution of the other M − 1 codewords, which
1
P
we approximate by PY (as opposed to the more precise form M− 1 j̸=i PY|X=cj ). Putting these ideas
together we can propose the decoder as

g(y) = any m s.t. i(cm ; y) > γ ,

where γ is a threshold and PX is judiciously chosen (to maximize I(X; Y) as we will see soon).
A yet another way to see why thresholding decoder (as opposed to an ML one) is a natural idea
is to simply believe the fact that for good error correcting codes the most likely (ML) codeword
has likelihood (information density) so much higher than the rest of the candidates that instead
of looking for the maximum we simply can select the one (and only) codeword that exceeds a
pre-specified threshold.
With these initial justifications we proceed to the main result of this section.

Theorem 18.5 (Shannon’s achievability bound) Fix a channel PY|X and an arbitrary
input distribution PX . Then for every τ > 0 there exists an (M, ϵ)-code with

ϵ ≤ P[i(X; Y) ≤ log M + τ ] + exp(−τ ). (18.9)

Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP and is equiv-
alent to maximizing information density, cf. (18.8). The step of maximizing the i(cm ; Y) makes
analyzing the error probability difficult. Similar to what we did in almost loss compression, cf. The-
orem 11.5, the first important step for showing the achievability bound is to consider a suboptimal
decoder. In Shannon’s bound, we consider a threshold-based suboptimal decoder g(y) as follows:

m, ∃! cm s.t. i(cm ; y) ≥ log M + τ
g ( y) = (18.10)
e, otherwise

In words, decoder g reports m as decoded message if and only if codeword cm is a unique one
with information density exceeding the threshold log M + τ . If there are multiple or none such
codewords, then decoder outputs a special value of e, which always results in error since W 6= e
ever. (We could have decreased probability of error slightly by allowing the decoder to instead
output a random message, or to choose any one of the messages exceeding the threshold, or any
other clever ideas. The point, however, is that even the simplistic resolution of outputting e already
achieves all qualitative goals, while simplifying the analysis considerably.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-355

i i

18.2 Shannon’s random coding bound 355

For a given codebook (c1 , . . . , cM ), the error probability is:

Pe (c1 , . . . , cM ) = P[{i(cW ; Y) ≤ log M + τ } ∪ {∃m 6= W, i(cm ; Y) > log M + τ }]

where W is uniform on [M] and the probability space is as in (17.1).

The second (and most ingenious) step proposed by Shannon was to forgo the complicated
discrete optimization of the codebook. His proposal is to generate the codebook (c1 , . . . , cM ) ran-
domly with cm ∼ PX i.i.d. for m ∈ [M] and then try to reason about the average E[Pe (c1 , . . . , cM )].
By symmetry, this averaged error probability over all possible codebooks is unchanged if we con-
dition on W = 1. Considering also the random variables (X, Y, X̄) as in (18.1), we get the following
chain:

E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y) > log M + τ }|W = 1]
X
M
≤ P[i(c1 ; Y) ≤ log M + τ |W = 1] + P[i(cm̄ ; Y) > log M + τ |W = 1] (union bound)
m̄=2
( a)
= P [i(X; Y) ≤ log M + τ ] + (M − 1)P i(X; Y) > log M + τ
≤ P [i(X; Y) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 18.4)
≤ P [i(X; Y) ≤ log M + τ ] + exp(−τ ) ,

where the crucial step (a) follows from the fact that given W = 1 and m̄ 6= 1 we have
d
(c1 , cm̄ , Y) = (X, X̄, Y)

with the latter triple defined in (18.1).

The last expression does indeed conclude the proof of existence of the (M, ϵ) code: it shows
that the average of Pe (c1 , . . . , cM ) satisfies the required bound on probability of error, and thus
there must exist at least one choice of c1 , . . . , cM satisfying the same bound.

Remark 18.2 (Joint typicality) Shortly in Chapter 19, we will apply this theorem for the
case of PX = P⊗ n ⊗n
X1 (the iid input) and PY|X = PY1 |X1 (the memoryless channel). Traditionally,
cf. [111], decoders in such settings were defined with the help of so called “joint typicality”. Those
decoders given y = yn search for the codeword xn (both of which are an n-letter vectors) such that
the empirical joint distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PX1 ,Y1 , where
1
P̂xn ,yn (a, b) = · |{j ∈ [n] : xj = a, yj = b}|
n
is the joint empirical distribution of (xn , yn ). This definition is used for the case when random
coding is done with cj ∼ uniform on the type class {xn : P̂xn ≈ PX }. Another alternative, “entropic
Pn
typicality”, cf. [106], is to search for a codeword with j=1 log PX ,Y 1(xj ,yj ) ≈ H(X, Y). We think
1 1
of our requirement, {i(xn ; yn ) ≥ nγ1 }, as a version of “joint typicality” that is applicable to much
wider generality of channels (not necessarily over product alphabets, or memoryless).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-356

i i

356

18.3 Dependence-testing (DT) bound

The following result is a slight refinement of Theorem 18.5, that results in a bound that is free
from the auxiliary parameters and is provably stronger.

Theorem 18.6 (DT bound) Fix a channel PY|X and an arbitrary input distribution PX . Then
for every τ > 0 there exists an (M, ϵ)-code with

M − 1 +
ϵ ≤ E exp − i(X; Y) − log (18.11)
2
where x+ ≜ max(x, 0).

Proof. For a fixed γ , consider the following suboptimal decoder:

(
m for the smallest m s.t. i(cm ; y) ≥ γ
g ( y) =
e otherwise.

Setting Ŵ = g(Y) we note that given a codebook {c1 , . . . , cM }, we have by union bound
P[Ŵ 6= j|W = j] = P[i(cj ; Y) ≤ γ|W = j] + P[i(cj ; Y) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y) > γ]
j− 1
X
≤ P[i(cj ; Y) ≤ γ|W = j] + P[i(ck ; Y) > γ|W = j].
k=1

Averaging over the randomly generated codebook, the expected error probability is upper bounded
by:

1 X
M
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1

1 X
j−1
M X
≤ P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
M
j=1 k=1
M−1
= P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
2
M−1
= P[i(X; Y) ≤ γ] + E[exp(−i(X; Y))1 {i(X; Y) > γ}] (by (18.3))
2
h M−1 i
= E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
2
To optimize over γ , note the simple observation that U1E + V1Ec ≥ min{U, V}. Therefore for any
x, y, 1{i(x; y) ≤ γ} + M− M−1
2 exp{−i(x; y)}1{i(x; y) > γ} > min(1, 2 exp{−i(x; y)}), achieved
1
M−1
by γ = log 2 regardless of x, y. Thus, we continue the bounding as follows
h M−1 i
inf E[Pe (c1 , . . . , cM )] ≤ inf E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
γ γ 2
h M−1 i
= E min 1, exp(−i(X; Y))
2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-357

i i

18.4 Feinstein’s maximal coding bound 357

M − 1 +
= E exp − i(X; Y) − log .
2

Remark 18.3 (Dependence testing interpretation) The RHS of (18.11) equals to M+

2
1

multiple of the minimum error probability of the following Bayesian hypothesis testing problem:
H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M−1
prior prob.: π 0 = , π1 = .
M+1 M+1
Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent code-
word. As we know from binary hypothesis testing, the best threshold for the likelihood ratio test
(minimizing the weighted probability of error) is log ππ 10 , as we indeed found out.
One of the immediate benefits of Theorem 18.6 compared to Theorem 18.5 is precisely the fact
that we do not need to perform a cumbersome minimization over τ in (18.9) to get the minimum
upper bound in Theorem 18.5. Nevertheless, it can be shown that the DT bound is stronger than
Shannon’s bound with optimized τ . See also Exercise IV.5.
Finally, we remark (and will develop this below in our treatment of linear codes) that DT bound
and Shannon’s bound both hold without change if we generate {ci } by any other (non-iid) pro-
cedure with a prescribed marginal and pairwise independent codewords – see Theorem 18.13
below.

18.4 Feinstein’s maximal coding bound

The previous achievability results are obtained using probabilistic methods (random coding). In
contrast, the following achievability bound due to Feinstein uses a greedy construction. One imme-
diate advantage of Feinstein’s method is that it shows existence of codes satisfying maximal
probability of error criterion.2

Theorem 18.7 (Feinstein’s lemma) Fix a channel PY|X and an arbitrary input distribution
PX . Then for every γ > 0 and for every ϵ ∈ (0, 1) there exists an (M, ϵ)max -code with

M ≥ γ(ϵ − P[i(X; Y) < log γ]) (18.12)

Remark 18.4 (Comparison with Shannon’s bound) We can also interpret (18.12) differ-
ently: for any fixed M, there exists an (M, ϵ)max -code that achieves the maximal error probability
bounded as follows:
M
ϵ ≤ P[i(X; Y) < log γ] +
γ

2
Nevertheless, we should point out that this is not a serious advantage: from any (M, ϵ) code we can extract an
(M′ , ϵ′ )max -subcode with a smaller M′ and larger ϵ′ – see Theorem 19.4.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-358

i i

358

If we take log γ = log M + τ , this gives the bound of exactly the same form as Shannon’s (18.9). It
is rather surprising that two such different methods of proof produced essentially the same bound
(modulo the difference between maximal and average probability of error). We will discuss the
reason for this phenomenon in Section 18.7.

Proof. From the definition of (M, ϵ)max -code, we recall that our goal is to find codewords
c1 , . . . , cM ∈ X and disjoint subsets (decoding regions) D1 , . . . , DM ⊂ Y , s.t.

PY|X (Di |ci ) ≥ 1 − ϵ, ∀i ∈ [M].

Feinstein’s idea is to construct a codebook of size M in a sequential greedy manner.

For every x ∈ X , associate it with a preliminary decoding region Ex defined as follows:

Ex ≜ {y ∈ Y : i(x; y) ≥ log γ}

Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them
into final decoding regions {Dx }, which will be disjoint. Next, we apply Corollary 18.4 and find
out that there is a set F ⊂ X with two properties: a) PX [F] = 1 and b) for every x ∈ F we have
1
PY (Ex ) ≤ . (18.13)
γ

We can assume that P[i(X; Y) < log γ] ≤ ϵ, for otherwise the RHS of (18.12) is negative and
there is nothing to prove. We first claim that there exists some c ∈ F such that P[Y ∈ Ec |X =
c] = PY|X (Ec |c) ≥ 1 − ϵ. Indeed, assume (for the sake of contradiction) that ∀c ∈ F, P[i(c; Y) ≥
log γ|X = c] < 1 − ϵ. Note that since PX (F) = 1 we can average this inequality over c ∼ PX . Then
we arrive at P[i(X; Y) ≥ log γ] < 1 − ϵ, which is a contradiction.
With these preparations we construct the codebook in the following way:

1 Pick c1 to be any codeword in F such that PY|X (Ec1 |c1 ) ≥ 1 − ϵ, and set D1 = Ec1 ;
2 Pick c2 to be any codeword in F such that PY|X (Ec2 \D1 |c2 ) ≥ 1 − ϵ, and set D2 = Ec2 \D1 ;
...
−1
3 Pick cM to be any codeword in F such that PY|X (EcM \ ∪M j=1 Dj |cM ] ≥ 1 − ϵ, and set DM =
M− 1
EcM \ ∪j=1 Dj .

We stop if cM+1 codeword satisfying the requirement cannot be found. Thus, M is determined by
the stopping condition:

∀c ∈ F, PY|X (Ec \ ∪M
j=1 Dj |c) < 1 − ϵ

Averaging the stopping condition over c ∼ PX (which is permissible due to PX (F) = 1), we
obtain
 
[
M
P i(X; Y) ≥ log γ and Y 6∈ Dj  < 1 − ϵ,
j=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-359

360

To that end, notice that expectations of each term in the sum coincide (by symmetry). To evalu-
ate this expectation, let us take m = 1 condition on W = 1 and observe that under this conditioning
we have
Y
M
(c1 , Y, c2 , . . . , cM ) ∼ PX,Y PX .
j=2

With this observation in mind we have the following chain:

 
[
M
P  {i(cj ; Y) ≥ i(c1 ; Y)} W = 1
j=2
  
[
M
= E(x,y)∼PX,Y P  {i(cj ; Y) ≥ i(c1 ; Y)} c1 = x, Y = y, W = 1
( a)

j=2
(b)
≤ E min{1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y }

where (a) is just expressing the probability by first conditioning on the values of (c1 , Y); and (b)
corresponds to applying the union bound but capping the result by 1. This completes the proof
of the bound. We note that the step (b) is the essence of the RCU bound and corresponds to the
self-evident fact that for any collection of events Ej we have
X
P[∪Ej ] ≤ min{1, P[Ej ]} .
j

What is makes its application clever is that we first conditioned on (c1 , Y). If we applied the union
bound right from the start without conditioning, the resulting estimate on ϵ would have been much
weaker (in particular, would not have lead to achieving capacity).

It turns out that Shannon’s bound Theorem 18.5 is just a weakening of (18.14) obtained by
splitting the expectation according to whether or not i(X; Y) ≤ log M + τ and upper bounding
min{x, 1} by 1 when i(X; Y) ≤ log M + τ and by x otherwise. Another such weakening is a
famous Gallager’s bound [176], which in fact gives tight estimate of the exponent in the decay of
error probability over memoryless channels (Section 22.4*).

Theorem 18.9 (Gallager’s bound) Fix a channel PY|X , an arbitrary input distribution PX
and ρ ∈ [0, 1]. Then there exists an (M, ϵ) code such that
" 1+ρ #
i ( X̄; Y )
ϵ ≤ Mρ E E exp Y (18.15)
1+ρ

where again (X̄, Y) ∼ PX PY as in (18.1).

For a classical way of writing this bound see (22.13).

362

18.6 Linear codes

So far in this Chapter we have shown existence of good error correcting codes by either doing
the random or maximal coding. The constructed codes have little structure. At the same time,
most codes used in practice are so-called linear codes and a natural question whether restricting
to linear codes leads to loss in performance. In this section we show that there exist good linear
codes as well. A pleasant property of linear codes is that Pe = Pe,max and, therefore, bounding
average probability of error (as in Shannon’s bound) automatically yields control of the maximal
probability of error as well.

Definition 18.10 (Linear codes) Let Fq denote the finite field of cardinality q (cf. Defini-
tion 11.7). Let the input and output space of the channel be X = Y = Fnq . We say a codebook
C = {cu : u ∈ Fkq } of size M = qk is a linear code if C is a k-dimensional linear subspace of Fnq .

A linear code can be equivalently described by:

• Generator matrix G ∈ Fkq×n , so that the codeword for each u ∈ Fkq is given by cu = uG
(row-vector convention) and the codebook C is the row-span of G, denoted by Im(G);
(n−k)×n
• Parity-check matrix H ∈ Fq , so that each codeword c ∈ C satisfies Hc⊤ = 0. Thus C is
the nullspace of H, denoted by Ker(H). We have HG⊤ = 0.

Example 18.1 (Hamming code) The [7, 4, 3]2 Hamming code over F2 is a linear code with
the following generator and parity check matrices:
 
1 0 0 0 1 1 0  
 0  1 1 0 1 1 0 0
1 0 0 1 0 1 
G=
 0 , H= 1 0 1 1 0 1 0 
0 1 0 0 1 1 
0 1 1 1 0 0 1
0 0 0 1 1 1 1

decoder aims to find the noise realization with the fewest number of flips that is compatible with
the received codeword, namely gsynd (s) = argminz:Hz⊤ =s wH (z), where wH denotes the Hamming
weight. In this case elements of the image of gsynd , which we denoted by D0 , are known as “minimal
weight coset leaders”. Counting how many of them occur at each Hamming weight is a difficult
open problem even for the most well-studied codes such as Reed-Muller ones. In Hamming space
D0 looks like a Voronoi region of a lattice and Du ’s constitute a Voronoi tesselation of Fnq .

Overwhelming majority of practically used codes are in fact linear codes. Early in the history
of coding, linearity was viewed as a way towards fast and low-complexity encoding (just binary
matrix multiplication) and slightly lower complexity of the maximum-likelihood decoding (via the
syndrome decoder). As codes became longer and longer, though, the syndrome decoding became
impractical and today only those codes are used in practice for which there are fast and low-
complexity (suboptimal) decoders.

Theorem 18.13 (DT bound for linear codes) Let PY|X be an additive noise channel over
Fnq . For all integers k ≥ 1 there exists a linear code f : Fkq → Fnq with error probability:
 + 
− n−k−logq 1

Pe,max = Pe ≤ E q .
P Z ( Z)
(18.19)

Remark 18.6 The bound above is the same as Theorem 18.6 evaluated with PX = Unif(Fnq ).
The analogy between Theorems 18.6 and 18.13 is the same as that between Theorems 11.5 and
11.8 (full random coding vs random linear codes).

Proof. Recall that in proving the DT bound (Theorem 18.6), we selected the codewords
i.i.d.
c1 , . . . , cM ∼ PX and showed that

M−1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y) ≤ γ] + P[i(X; Y) ≥ γ]
2

Here we will adopt the same approach and take PX = Unif(Fnq ) and M = qk .
By Theorem 18.12 the optimal decoding regions are translational invariant, i.e. Du = cu +
D0 , ∀u, and therefore:

Pe,max = Pe = P[Ŵ 6= u|W = u], ∀u.

Step 1: Random linear coding with dithering: Let codewords be chosen as

cu = uG + h, ∀u ∈ Fkq

where random G and h are drawn as follows: the k × n entries of G and the 1 × n entries
of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role that the
all-zero codeword plays (since it is contained in any linear codebook).

i i

366

• One issue with Elias ensemble is that with some non-zero probability G may fail to be full rank.
(It is a good exercise to find P [rank(G) < k] as a function of n, k, q.) If G is not full rank, then
there are two identical codewords and hence Pe,max ≥ 1/2. To fix this issue, one may let the
generator matrix G be uniform on the set of all k × n matrices of full (row) rank.
• Similarly, we may modify Gallager’s ensemble by taking the parity-check matrix H to be
uniform on all n × (n − k) full rank matrices.

For the modified Elias and Gallager’s ensembles, we could still do the analysis of random coding.
A small modification would be to note that this time (X, X̄) would have distribution
1
PX,X̄ = 1 {X 6= X′ }
q2n − qn
uniform on all pairs of distinct codewords and are not pairwise independent.
Finally, we note that the Elias ensemble with dithering, cu = uG + h, has pairwise independence
property and its joint entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q. This is significantly
smaller than for Shannon’s fully random ensemble that we used in Theorem 18.5. Indeed, when
i.i.d.
cj ∼ Unif(Fnq ) we have H(c1 , . . . , cM ) = qk n log q. An interesting question, thus, is to find

min H(c1 , . . . , cM )

where the minimum is over all distributions with P[ci = a, cj = b] = q−2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly, we
may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case, the Wozencraft
ensemble (see Exercise IV.7) for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q, which is
essentially our lower bound.

18.7 Why random and maximal coding work well?

As we will see later the bounds developed in this chapter are very tight both asymptotically and
non-asymptotically. That is, the codes constructed by the apparently rather naive processes of ran-
domly selecting codewords or a greedily growing the codebook turn out to be essentially optimal
in many ways. An additional mystery is that the bounds we obtained via these two rather different
processes are virtually the same. These questions have puzzled researchers since the early days of
information theory.
A rather satisfying reason was finally given in an elegant work of Barman and Fawzi [36].
Before going into the details, we want to vocalize explicitly the two questions we want to address:

1 Why is greedy procedure close to optimal?

2 Why is random coding procedure (with a simple PX ) close to optimal?

In short, we will see that the answer is that both of these methods are well-known to be (almost)
optimal for submodular function maximization, and this is exactly what channel coding is about.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-367

i i

18.7 Why random and maximal coding work well? 367

Before proceeding, we notice that in the second question it is important to qualify that PX
is simple, since taking PX to be supported on the optimal M∗ (ϵ)-achieving codebook would of
course result in very good performance. However, instead we will see that choosing rather simple
PX already achieves a rather good lower bound on M∗ (ϵ). More explicitly, by simple we mean a
product distribution for the memoryless channel. Or, as an even better example to have in mind,
consider an additive-noise vector channel:

Yn = Xn + Zn

with addition over a product abelian group and arbitrary (even non-memoryless) noise Zn . In this
case the choice of uniform PX in random coding bound works, and is definitely “simple”.
The key observation of [36] is submodularity of the function mapping a codebook C ⊂ X to
the |C|(1 − Pe,MAP (C)), where Pe,MAP (C) is the probability of error under the MAP decoder (17.5).
(Recall (1.8) for the definition of submodularity.) More explicitly, consider a discrete Y and define
X
S(C) ≜ max PY|X (y|x) , S(∅) = 0
x∈C
y∈Y

It is clear that S(C) is submodular non-decreasing as a sum of submodular non-decreasing func-

tions max (i.e. T 7→ maxx∈T ϕ(x) is submodular for any ϕ). On the other hand, Pe,MAP (C) =
1 − |C|1
S(C), and thus search for the minimal error codebook is equivalent to maximizing the
set-function S.
The question of finding

S∗ (M) ≜ max S(C)

|C|≤M

was algorithmically resolved in a groundbreaking work of [313] showing (approximate) optimality

of the greedy process. Consider, the following natural greedy process of constructing a sequence
of good sets Ct . Start with C0 = ∅. At each step find any

xt+1 ∈ argmax S(Ct ∪ {x})

x̸∈Ct

and set

Ct+1 = Ct ∪ {xt+1 } .

They showed that

S(Ct ) ≥ (1 − 1/e) max S(C) .

|C|=t

In other words, the probability of successful (MAP) decoding for the greedily constructed code-
book is at most a factor (1 − 1/e) away from the largest possible probability of success among all
codebooks of the same cardinality. Since we are mostly interested in success probabilities very
close to 1, this result may not appear very exciting. However, a small modification of the argument
yields the following (see [257, Theorem 1.5] for the proof):

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-368

i i

368

Theorem 18.14 ([313]) For any non-negative submodular set-function f and a greedy
sequence Ct we have for all ℓ, t:

f(Cℓ ) ≥ (1 − e−ℓ/t ) max f(C) .

|C|=t

Applying this to the special case of f(·) = S(·) we obtain the result of [36]: The greedily
constructed codebook C ′ with M′ codewords satisfies
M ′
1 − Pe,MAP (C ′ ) ≥ (1 − e−M /M )(1 − ϵ∗ (M)) .
M′
In particular, the greedily constructed code with M′ = M2−10 achieves probability of success that
is ≥ 0.9995(1 −ϵ∗ (M)). In other words, compared to the best possible code a greedy code carrying
10 bits fewer of data suffers at most 5 · 10−4 worse probability of error. This is a very compelling
evidence for why greedy construction is so good. We do note, however, that Feinstein’s bound
does greedy construction not with the MAP decoder, but with a suboptimal one.
Next we address the question of random coding. Recall that our goal is to explain how can
selecting codewords uniformly at random from a “simple” distribution PX be any good. The key
idea is again contained in [313]. The set-function S(C) can also be understood as a function with
domain {0, 1}|X | . Here is a natural extension of this function to the entire solid hypercube [0, 1]|X | :
X X
SLP (π ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} . (18.21)
x, y x

Indeed, it is easy to see that SLP (1C ) = S(C) and that SLP is a concave function.3
Since SLP is an extension of S it is clear that
X
S∗ (M) ≤ S∗LP (M) ≜ max{SLP (π ) : 0 ≤ π x ≤ 1, π x ≤ M} . (18.22)
x

In fact, we will see later in Section 22.3 that this bound coincides with the bound known as
meta-converse. Surprisingly, [313] showed that the greedy construction not only achieves a large
multiple of S∗ (M) but also of S∗LP (M):

S(CM ) ≥ (1 − e−1 )S∗LP (M) . (18.23)

P
The importance of this result (which is specific to submodular functions C 7→ y maxx∈C g(x, y))
is that it gave one of the first integrality gap results relating the value of combinatorial optimization
S∗ (M) and a linear program relaxation S∗LP (M): (1 − e−1 )S∗LP (M) ≤ S∗ (M) ≤ S∗LP (M).
An extension of (18.23) similar to the preceding theorem can also be shown: for all M′ , M we
have
′
S(CM′ ) ≥ (1 − e−M /M )S∗LP (M) .

3
There are a number of standard extensions of a submodular function f to a hypercube. The largest convex interpolant f+ ,
also known as Lovász extension, the least concave interpolant f− , and multi-linear extension [80]. However, SLP does not
coincide with any of these and in particular strictly larger (in general) than f− .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-369

i i

18.7 Why random and maximal coding work well? 369

To connect to the concept of random coding, though, we need the following result of [36]:4
P
Theorem 18.15 Fix π ∈ [0, 1]|X | and let M = x∈X π x . Let C = {c1 , . . . , cM′ } with
i.i.d.
cj ∼ PX (x) = πx
M. Then we have
′
E[S(C)] ≥ (1 − e−M /M )SLP (π ) .

The proof of this result trivially follows from applying the following lemma with g(x) =
PY|X (y|x), summing over y and recalling the definition of SLP in (18.21).

Theorem. Let g : X → R be any function and denote

Lemma 18.16P Let π and C be as in P
T(π , g) = max{ x r x g ( x) : 0 ≤ rx ≤ π x , x rx = 1}. Then
′
E[max g(x)] ≥ (1 − e−M /M )T(π , g) .
x∈C

Proof. Without loss of generality we take X = [m] and g(1) ≥ g(2) ≥ · · · ≥ g(m) ≥ g(m + 1) ≜
′ ′
0. Denote for convenience a = 1 − (1 − M1 )M ≥ 1 − e−M /M , b(j) ≜ P[{1, . . . , j} ∩ C 6= ∅]. Then
P[max g(x) = g(j)] = b(j) − b(j − 1) ,
x∈C

and from the summation by parts we get

X
m
E[max g(x)] = (g(j) − g(j + 1))b(j) . (18.24)
x∈C
j=1
P
On the other hand, denoting c(j) = min( i≤j π i , 1). Now from the definition of b(j) we have
π1 + . . . πj ℓ c(j) M′
b( j) = 1 − ( 1 − ) ≥ 1 − (1 − ) .
M M
x M′
From the simple inequality (1 − M) ≤ 1 − ax (valid for any x ∈ [0, 1]) we get

b(j) ≥ ac(j) .
Plugging this into (18.24) we conclude the proof by noticing that rj = c(j) − c(j − 1) attains the
maximum in the definition of T(π , g).
Theorem 18.15 completes this section’s goal and shows that the random coding (as well as the
greedy/maximal coding) attains an almost optimal value of S∗ (M). Notice also that the random
coding distribution that we should be using is the one that attains the definition of S∗LP (M). For input
symmetric channels (such as additive noise ones) it is easy to show that the optimal π ∈ [0, 1]X is
a constant vector, and hence the codewords are to be generated iid uniformly on X .

4
There are other ways of doing “random coding” to produce an integer solution from a fractional one. For example, see the
multi-linear extension based one in [80].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-370

i i

19 Channel capacity

In this chapter we apply methods developed in the previous chapters (namely the weak converse
and the random/maximal coding achievability) to compute the channel capacity. This latter notion
quantifies the maximal amount of (data) bits that can be reliably communicated per single channel
use in the limit of using the channel many times. Formalizing the latter statement will require
introducing the concept of a communication channel. Then for special kinds of channels (the
memoryless and the information stable ones) we will show that computing the channel capacity
reduces to maximizing the (sequence of the) mutual informations. This result, known as Shannon’s
noisy channel coding theorem, is the third example of a coding theorem in this book. It connects the
value of an operationally defined (discrete, combinatorial) optimization problem over codebooks
to that of a (convex) optimization problem over information measures. It builds a bridge between
the abstraction of information measures (Part I) and a practical engineering problem of channel
coding.
Information theory as a subject is sometimes accused of “asymptopia”, or the obsession with
asymptotic results and computing various limits. Although in this book we attempt to refrain from
asymptopia, the topic of this chapter requires committing this sin ipso facto. After proving capacity
theorems in various settings, we conclude the Chapter with Shannon’s separation theorem, that
shows that any (stationary ergodic) source can be communicated over an (information stable)
channel if and only its entropy rate is smaller than the channel capacity. Furthermore, doing so
can be done by first compressing a source to pure bits and then using channel code to match those
bits to channel inputs. The fact that no performance is lost in the process of this conversion to bits
had important historical consequence in cementing bits as the universal currency of the digital
age.

19.1 Channels and channel capacity

As we discussed in Chapter 17 the main information-theoretic question of data transmission is the
following: How many bits can one transmit reliably if one is allowed to use a given noisy channel
n times? The normalized quantity equal to the number of message bits per channel use is known as
rate, and capacity refers to the highest achievable rate under a small probability of decoding error.
However, what does it mean to “use channel several times”? How do we formalize the concept of
a channel use? To that end, we need to change the meaning of the term “channel”. So far in this
book we have used the term channel as a synonym of the Markov kernel (Definition 2.10). More
correctly, however, this term should be used to refer to the following notion.

370

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-371

i i

19.1 Channels and channel capacity 371

Definition 19.1 Fix an input alphabet A and an output alphabet B. A sequence of Markov
kernels PYn |Xn : An → B n indexed by the integer n = 1, 2 . . . is called a channel. The length of
the input n is known as blocklength.

To give this abstract notion more concrete form one should recall Section 17.2, in which we
described the BSC channel. Note, however, that despite this definition, it is customary to use the
term channel to refer to a single Markov kernel (as we did before in this book). An even worse,
yet popular, abuse of terminology is to refer to n-th element of the sequence, the kernel PYn |Xn , as
the n-letter channel.
Although we have not imposed any requirements on the sequence of kernels PYn |Xn , one is never
interested in channels at this level of generality. Most of the time the elements of the channel input
Xn = (X1 , . . . , Xn ) are thought as indexed by time. That is the Xt corresponds to the letter that is
transmitted at time t inside an overall block of length n, while Yt is the letter received at time t.
The channel’s action is that of “adding noise” to Xt and outputting Yt . However, the generality of
the previous definition allows to model situations where the channel has internal state, so that the
amount and type of noise added to Xt depends on the previous inputs and in principle even on the
future inputs. The interpretation of t as time, however, is not exclusive. In storage (magnetic, non-
volatile or flash) t indexes space. In those applications, the noise may have a rather complicated
structure with transformation Xt → Yt depending on both the “past” X<t and the “future” X>t .
Almost all channels of interest satisfy one or more of the restrictions that we list next:

• A channel is called non-anticipatory if it has the following extension property. Under the n-letter
kernel PYn |Xn , the conditional distribution of the first k output symbols Yk only depends on Xk
(and not on Xnk+1 ) and coincides with the kernel PYk |Xk (the k-th element of the channel sequence)
the k-th channel transition kernel in the sequence. This requirement models the scenario wherein
channel outputs depend causally on the inputs.
• A channel is discrete if A and B are finite.
• A channel is additive-noise if A = B are abelian group and Yn = Xn + Zn for some Zn
independent of Xn (see Definition 18.11). Thus

PYn |Xn (yn |xn ) = PZn (yn − xn ).

• A channel is memoryless if PYn |Xn factorizes into a product distribution. Namely,

Y
n
PYn |Xn = PYk |Xk . (19.1)
k=1

where each PYk |Xk : A → B ; in particular, PYn |Xn are compatible at different blocklengths n.
• A channel is stationary memoryless if (19.1) is satisfied with PYk |Xk not depending on k, denoted
commonly by PY|X . In other words,

PYn |Xn = (PY|X )⊗n . (19.2)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-372

i i

372

δ̄
1 1
δ
[ ]
? δ̄ δ 0
BECδ
δ 0 δ δ̄
0 0
δ̄

1 1
δ
[ ]
1 0
Z-channel
0 0 δ δ̄
δ̄

Table 19.1 Examples of DMCs.

Thus, in discrete cases, we have

Y
n
PYn |Xn (yn |xn ) = PY|X (yi |xi ). (19.3)
k=1

The interpretation is that each coordinate of the transmitted codeword Xn is corrupted by noise
independently with the same noise statistic.
• Discrete memoryless stationary channel (DMC): A DMC is a channel that is both discrete and
stationary memoryless. It can be specified in two ways:
– an |A| × |B|-dimensional (row-stochastic) matrix PY|X where elements specify the transition
probabilities;
– a bipartite graph with edge weight specifying the non-zero transition probabilities.
Table 19.1 lists some common binary-input binary-output DMCs: the binary symmetric channel
(BSC), the binary symmetric channel (BEC), and the Z-channel.

As another example, let us recall the AWGN channel in Example 3.3: the alphabets A = B =
R and Yn = Xn + Zn , with Xn ⊥ ⊥ Zn ∼ N (0, σ 2 In ). This channel is a non-discrete, stationary
memoryless, additive-noise channel.
Having defined the notion of the channel, we can define next the operational problem that the
communication engineer faces when tasked with establishing a data link across the channel. Since
the channel is noisy, the data is not going to pass unperturbed and the error correcting codes are

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-373

i i

19.1 Channels and channel capacity 373

naturally to be employed. To send one of M = 2k messages (or k data bits) with low probabil-
ity of error, it is often desirable to use the shortest possible length of the input sequence. This
desire explains the following definitions, which extend the fundamental limits in Definition 17.2
to involve the blocklength n.

Definition 19.2 (Fundamental Limits of Channel Coding)

• An (n, M, ϵ)-code is an (M, ϵ)-code for PYn |Xn , consisting of an encoder f : [M] → An and a
decoder g : B n → [M] ∪ {e}.
• An (n, M, ϵ)max -code is analogously defined for the maximum probability of error.

The (non-asymptotic) fundamental limits are

M∗ (n, ϵ) = max{M : ∃ (n, M, ϵ)-code}, (19.4)

M∗max (n, ϵ) = max{M : ∃ (n, M, ϵ)max -code}
∗
ϵ (n, M) = inf{ϵ : ∃(n, M, ϵ)-code} (19.5)
ϵ∗max (n, M) = inf{ϵ : ∃(n, M, ϵ)max -code} .

We will mostly focus on understanding M∗ (n, ϵ) and a relate quantity known as rate. Recall that
blocklength n measures the amount of time or space resource used by the code. Thus, it is natural
to maximize the ratio of the data transmitted to the resource used, and that leads us to the notion of
log M
the transmission rate defined as R = n2 and equal to the number of bits transmitted per channel
use. Consequently, instead of studying M∗ (n, ϵ) one is lead to the study of 1n log M∗ (n, ϵ). A natural
first question is to determine the first-order asymptotics of this quantity and this motivates the final
definition of the Section.

Definition 19.3 (Channel capacity) The ϵ-capacity Cϵ and the Shannon capacity C are
defined as follows
1
Cϵ ≜ lim inf log M∗ (n, ϵ);
n→∞ n
C = lim Cϵ .
ϵ→0+

Channel capacity is measured in information units per channel use, e.g. “bit/ch.use”.

The operational meaning of Cϵ (resp. C) is the maximum achievable rate at which one can
communicate through a noisy channel with probability of error at most ϵ (resp. o(1)). In other
words, for any R < C, there exists an (n, exp(nR), ϵn )-code, such that ϵn → 0. In this vein, Cϵ and
C can be equivalently defined as follows:

Cϵ = sup{R : ∀δ > 0, ∃n0 (δ), ∀n ≥ n0 (δ), ∃(n, exp(n(R − δ)), ϵ)-code}

C = sup{R : ∀ϵ > 0, ∀δ > 0, ∃n0 (δ, ϵ), ∀n ≥ n0 (δ, ϵ), ∃(n, exp(n(R − δ)), ϵ)-code}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-374

i i

374

The reason that capacity is defined as a large-n limit (as opposed to a supremum over n) is because
we are concerned with rate limit of transmitting large amounts of data without errors (such as in
communication and storage).
The case of zero-error (ϵ = 0) is so different from ϵ > 0 that the topic of ϵ = 0 constitutes a
separate subfield of its own (cf. the survey [252]). Introduced by Shannon in 1956 [379], the value

1
C0 ≜ lim inf log M∗ (n, 0) (19.6)
n→∞ n

is known as the zero-error capacity and represents the maximal achievable rate with no error
whatsoever. Characterizing the value of C0 is often a hard combinatorial problem. However, for
many practically relevant channels it is quite trivial to show C0 = 0. This is the case, for example,
for the DMCs we considered before: the BSC or BEC. Indeed, for them we have log M∗ (n, 0) = 0
for all n, meaning transmitting any amount of information across these channels requires accepting
some (perhaps vanishingly small) probability of error. Nevertheless, there are certain interesting
and important channels for which C0 is positive, cf. Section 23.3.1 for more.
As a function of ϵ the Cϵ could (most generally) behave like the plot below on the left-hand
side below. It may have a discontinuity at ϵ = 0 and may be monotonically increasing (possibly
even with jump discontinuities) in ϵ. Typically, however, Cϵ is zero at ϵ = 0 and stays constant for
all 0 < ϵ < 1 and, hence, coincides with C (see the plot on the right-hand side). In such cases we
say that the strong converse holds (more on this later in Section 22.1).

Cǫ Cǫ

strong converse
holds

Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1

In Definition 19.3, the capacities Cϵ and C are defined with respect to the average probabil-
ity of error. By replacing M∗ with M∗max , we can define, analogously, the capacities Cϵ
(max)
and
(max)
C with respect to the maximal probability of error. It turns out that these two definitions are
equivalent, as the next theorem shows.

Theorem 19.4 ∀τ ∈ (0, 1),

τ M∗ (n, ϵ(1 − τ )) ≤ M∗max (n, ϵ) ≤ M∗ (n, ϵ)

Proof. The second inequality is obvious, since any code that achieves a maximum error
probability ϵ also achieves an average error probability of ϵ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-375

i i

19.2 Shannon’s noisy channel coding theorem 375

For the first inequality, take an (n, M, ϵ(1 − τ ))-code, and define the error probability for the jth
codeword as

λj ≜ P[Ŵ 6= j|W = j]

Then
X X X
M(1 − τ )ϵ ≥ λj = λj 1 {λj ≤ ϵ} + λj 1 {λj > ϵ} ≥ ϵ|{j : λj > ϵ}|.

Hence |{j : λj > ϵ}| ≤ (1 − τ )M. (Note that this is exactly Markov inequality.) Now by removing
those codewords1 whose λj exceeds ϵ, we can extract an (n, τ M, ϵ)max -code. Finally, take M =
M∗ (n, ϵ(1 − τ )) to finish the proof.

Corollary 19.5 (Capacity under maximal probability of error) C(ϵmax) = Cϵ for all
ϵ > 0 such that Cϵ = Cϵ− . In particular, C(max) = C.

Proof. Using the definition of M∗ and the previous theorem, for any fixed τ > 0
1
Cϵ ≥ C(ϵmax) ≥ lim inf log τ M∗ (n, ϵ(1 − τ )) ≥ Cϵ(1−τ )
n→∞ n
(max)
Sending τ → 0 yields Cϵ ≥ Cϵ ≥ Cϵ− .

19.2 Shannon’s noisy channel coding theorem

Now that we have the basic definitions for Shannon capacity, we define another type of capac-
ity, and show that for a stationary memoryless channels, these two notions (“operational” and
“information”) of capacity coincide.

Definition 19.6 The information capacity of a channel is

1
C(I) = lim inf sup I(Xn ; Yn ),
n→∞ n P Xn
where for each n the supremum is taken over all joint distributions PXn on An .

Note that information capacity C(I) so defined is not the same as the Shannon capacity C in Def-
inition 19.3; as such, from first principles it has no direct interpretation as an operational quantity
related to coding. Nevertheless, they are related by the following coding theorems. We start with
a converse result:

C(I)
Theorem 19.7 (Upper Bound for Cϵ ) For any channel, ∀ϵ ∈ [0, 1), Cϵ ≤ 1−ϵ and C ≤ C(I) .

1
This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook to reach a
desired property.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-376

i i

376

Proof. Applying the general weak converse bound in Theorem 17.3 to PYn |Xn yields
supPXn I(Xn ; Yn ) + h(ϵ)
log M∗ (n, ϵ) ≤
1−ϵ
Normalizing this by n and taking the lim inf as n → ∞, we have
1 1 supPXn I(Xn ; Yn ) + h(ϵ) C(I)
Cϵ = lim inf log M∗ (n, ϵ) ≤ lim inf = .
n→∞ n n→∞ n 1−ϵ 1−ϵ

Next we give an achievability bound:

Theorem 19.8 (Lower Bound for Cϵ ) For a stationary memoryless channel, Cϵ ≥

supPX I(X; Y), for any ϵ ∈ (0, 1].

Proof. Fix an arbitrary PX on A and let PXn = P⊗ n

X be an iid product of a single-letter distribution
PX . Recall Shannon’s achievability bound Theorem 18.5 (or any other one from Chapter 18 would
work just as well). From that result we know that for any n, M and any τ > 0, there exists an
(n, M, ϵn )-code with
ϵn ≤ P[i(Xn ; Yn ) ≤ log M + τ ] + exp(−τ )
Here the information density is defined with respect to the distribution PXn ,Yn = P⊗ n
X,Y and, therefore,

X
n
dPX,Y Xn
i(Xn ; Yn ) = log (Xk , Yk ) = i(Xk ; Yk ),
dPX PY
k=1 k=1
n n n n
where i(x; y) = iPX,Y (x; y) and i(x ; y ) = iPXn ,Yn (x ; y ). What is important is that under PXn ,Yn the
random variable i(Xn ; Yn ) is a sum of iid random variables with mean I(X; Y). Thus, by the weak
law of large numbers we have
P[i(Xn ; Yn ) < n(I(X; Y) − δ)] → 0
for any δ > 0.
With this in mind, let us set log M = n(I(X; Y) − 2δ) for some δ > 0, and take τ = δ n in
Shannon’s bound. Then for the error bound we have
" n #
X n→∞
ϵn ≤ P i(Xk ; Yk ) ≤ nI(X; Y) − δ n + exp(−δ n) −−−→ 0, (19.7)
k=1

Since the bound converges to 0, we have shown that there exists a sequence of (n, Mn , ϵn )-codes
with ϵn → 0 and log Mn = n(I(X; Y) − 2δ). Hence, for all n such that ϵn ≤ ϵ
log M∗ (n, ϵ) ≥ n(I(X; Y) − 2δ)
And so
1
Cϵ = lim inf log M∗ (n, ϵ) ≥ I(X; Y) − 2δ
n→∞
n
Since this holds for all PX and all δ > 0, we conclude Cϵ ≥ supPX I(X; Y).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-377

i i

19.2 Shannon’s noisy channel coding theorem 377

The following result follows from pairing the upper and lower bounds on Cϵ .

Theorem 19.9 (Shannon’s channel coding theorem [378]) For a stationary memory-
less channel,
C = C(I) = sup I(X; Y). (19.8)
PX

As we mentioned several times already this result is among the most significant results in
information theory. From the engineering point of view, the major surprise was that C > 0,
i.e. communication over a channel is possible with strictly positive rate for any arbitrarily small
probability of error. The way to achieve this is to encode the input data jointly (i.e. over many
input bits together). This is drastically different from the pre-1948 methods, which operated on
a letter-by-letter bases (such as Morse code). This theoretical result gave impetus (and still gives
guidance) to the evolution of practical communication systems – quite a rare achievement for an
asymptotic mathematical fact.
Proof. Statement (19.8) contains two equalities. The first one follows automatically from the
second and Theorems 19.7 and 19.8. To show the second equality C(I) = supPX I(X; Y), we note
that for stationary memoryless channels C(I) is in fact easy to compute. Indeed, rather than solving
a sequence of optimization problems (one for each n) and taking the limit of n → ∞, memory-
lessness of the channel implies that only the n = 1 problem needs to be solved. This type of result
is known as single-letterization (or tensorization) in information theory and we show it formally
in the following proposition, which concludes the proof.

Proposition 19.10 (Tensorization of capacity)

• For memoryless channels,

X
n
sup I(Xn ; Yn ) = sup I(Xi ; Yi ).
PXn PXi
i=1

• For stationary memoryless channels,

C(I) = sup I(X; Y).
PX

Q
Proof. Recall that from Theorem 6.1 we know that for product kernels PYn |Xn = PYi |Xi , mutual
P n
information satisfies I(Xn ; Yn ) ≤ k=1 I(Xk ; Yk ) with equality whenever Xi ’s are independent.
Then
1
C(I) = lim inf sup I(Xn ; Yn ) = lim inf sup I(X; Y) = sup I(X; Y).
n→∞ n P n n→∞ PX PX
X

Shannon’s noisy channel theorem shows that by employing codes of large blocklength, we can
approach the channel capacity arbitrarily close. Given the asymptotic nature of this result (or any

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-378

i i

378

other asymptotic result), a natural question is understanding the price to pay for reaching capacity.
This can be understood in two ways:

1 The complexity of achieving capacity: Is it possible to find low-complexity encoders and

decoders with polynomial number of operations in the blocklength n which achieve the capac-
ity? This question was resolved by Forney [172] who showed that this is possible in linear time
with exponentially small error probability.
Note that if we are content with polynomially small probability of error, e.g., Pe = O(n−100 ),
then we can construct polynomial-time decodable codes as follows. First, it can be shown that
with rate strictly below capacity, the error probability of optimal codes decays exponentially
w.r.t. the blocklength. Now divide the block of length n into shorter block of length c log n and
apply the optimal code for blocklength c log n with error probability n−101 . The by the union
bound, the whole block has error with probability at most n−100 . The encoding and exhaustive-
search decoding are obviously polynomial time.
2 The speed of achieving capacity. Suppose we are content with achieving 90% of the capacity.
Then the question is how large a blocklength do we need to take in order for that to be possible?
(Blocklength is directly related to latency, and, thus, equivalently we may ask how much of a
delay is incurred by the desire to achieve 90% of capacity.) In other words, we want to know
how fast the gap to capacity vanish as blocklength grows. Shannon’s theorem shows that the
gap C − 1n log M∗ (n, ϵ) = o(1). Next theorem shows that under proper conditions, the o(1) term
is in fact O( √1n ).

The main tool in the proof of Theorem 19.8 was the law of large numbers. The lower bound
Cϵ ≥ C(I) in Theorem 19.8 shows that log M∗ (n, ϵ) ≥ nC + o(n) (this just restates the fact
that normalizing by n and taking the lim inf must result in something ≥ C). If instead we apply
a more careful analysis using the central limit theorem (CLT), we obtain the following refined
achievability result.

Theorem 19.11 Consider a stationary memoryless channel with a capacity-achieving input

distribution. Namely, C = maxPX I(X; Y) = I(X∗ ; Y∗ ) is attained at P∗X , which induces PX∗ Y∗ =
PX∗ PY|X . Assume that V = Var[i(X∗ ; Y∗ )] < ∞. Then
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n),

where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.

Proof. Writing the little-o notation in terms of lim inf, our goal is
log M∗ (n, ϵ) − nC
lim inf √ ≥ −Q−1 (ϵ) = Φ−1 (ϵ),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound

∃(n, M, ϵ)max : M ≥ β (ϵ − P[i(Xn ; Yn ) ≤ log β])

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-379

i i

19.3 Examples of capacity computation 379

√
Take log β = nC + nVt, then applying the CLT gives
√ hX √ i
log M ≥ nC + nVt + log ϵ − P i(Xk ; Yk ) ≤ nC + nVt
√
=⇒ log M ≥ nC + nVt + log (ϵ − Φ(t)) ∀Φ(t) < ϵ
log M − nC log(ϵ − Φ(t))
=⇒ √ ≥t+ √ ,
nV nV
where Φ(t) is the standard normal CDF. Taking the liminf of both sides
log M∗ (n, ϵ) − nC
lim inf √ ≥ t,
n→∞ nV
for all t such that Φ(t) < ϵ. Finally, taking t % Φ−1 (ϵ), and writing the liminf in little-oh notation
completes the proof
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n).

Remark 19.1 Theorem 19.9 implies that for any R < C, there exists a sequence of
(n, exp(nR), ϵn )-codes such that the probability of error ϵn vanishes as n → ∞. Examining the
upper bound (19.7), we see that the probability of error actually vanishes exponentially fast, since
the event in the first term is of large-deviations type (recall Chapter 15) so that both terms are
exponentially small. Finding the value of the optimal exponent (or even the existence thereof) has
a long history (but remains generally open) in information theory, see Section 22.4*. Recently,
however, it was understood that a practically more relevant, and also much easier to analyze, is
the regime of fixed (non-vanishing) error ϵ, in which case the main question is to bound the speed
of convergence of R → Cϵ = C. Previous theorem shows one bound on this speed of convergence.
The optimal √1n coefficient is known as channel dispersion, see Sections 22.5 and 22.6 for more.
√
In particular, we will show that the bound on the n term in Theorem 19.11 is often tight.

19.3 Examples of capacity computation

We compute the capacities of the simple DMCs from Table 19.1 and plot them in Figure 19.1.
First, for the BSCδ we have the following description of the input-output law:

Y = X + Z mod 2, Z ∼ Ber(δ) ⊥
⊥ X.

To compute the capacity, let us notice

I(X; X + Z) = H(X + Z) − H(X + Z|X) = H(X + Z) − H(Z) ≤ log 2 − h(δ)

with equality iff X ∼ Ber(1/2). Hence we have shown

C = sup I(X; Y) = log 2 − h(δ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-380

i i

380

C
C C
1 bit
1 bit 1 bit

δ
0 1 1 δ δ
2 0 1 0 1
BSCδ BECδ Z-channel

Figure 19.1 Capacities of three simple DMCs.

More generally, recalling Example 3.7, for any additive-noise channel over a finite abelian
group G, we have C = supPX I(X; X + Z) = log |G| − H(Z), achieved by X ∼ Unif(G). Similarly,
for a group-noise channel acting over a non-abelian group G by x 7→ x ◦ Z, Z ∼ PZ we also have
capacity equal log |G| − H(Z) and achieved by X ∼ Unif(G).
Next we consider the BECδ . This is a multiplicative channel. Indeed, if we equivalently redefine
the input X ∈ {±1} and output Y ∈ {±1, 0}, then BEC relation can be written as

Y = XZ, Z ∼ Ber(δ) ⊥
⊥ X.

To compute the capacity, we first notice that even without evaluating Shannon’s formula, it is clear
that C ≤ 1 −δ (bit), because for a large blocklength n about δ -fraction of the message is completely
lost (even if the encoder knows a priori where the erasures are going to occur, the rate still cannot
exceed 1 − δ ). More formally, we notice that P[X = 1|Y = 0] = P[X= δ
1]δ
= P[X = 1] and therefore

I(X; Y) = H(X) − H(X|Y) = H(X) − H(X|Y = e) ≤ (1 − δ)H(X) ≤ (1 − δ) log 2 ,

with equality iff X ∼ Ber(1/2). Thus we have shown

C = sup I(X; Y) = 1 − δ bits.

Finally, the Z-channel can also be thought of as a multiplicative channel with transition law

Y = XZ, X ∈ { 0, 1} ⊥
⊥ Z ∼ Ber(1 − δ) ,

so that P[Z = 0] = δ . For this channel if X ∼ Ber(p) we have

I(X; Y) = H(Y) − H(Y|X) = h(p(1 − δ)) − ph(δ) .

Optimizing over p we get that the optimal input is given by

1 1
p∗ (δ) = .
1 − δ 1 + exp{ h(δ) }
1−δ
∗
The capacity-achieving input distribution p (δ) monotonically decreases from 12 when δ = 0 to 1e
when δ → 1. (Infamously, there is no “explanation” for this latter limiting value.) For the capacity,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-381

i i

19.4* Symmetric channels 381

thus, we get

C = h(p∗ (δ)(1 − δ)) − p∗ (δ)h(δ) .

19.4* Symmetric channels

for all measurable E ⊂ Y and x ∈ X . Two symmetries f and g can be composed to produce another
symmetry as

( gi , go ) ◦ ( fi , fo ) ≜ ( gi ◦ fi , fo ◦ go ) . (19.9)

A symmetry group G of PY|X is any collection of invertible symmetries (automorphisms) closed

under the group operation (19.9).

Note that both components of an automorphism f = (fi , fo ) are bimeasurable bijections, that is
fi , f− 1 −1
i , fo , fo are all measurable and well-defined functions.
Naturally, every symmetry group G possesses a canonical left action on X × Y defined as

g · (x, y) ≜ (gi (x), g− 1

o (y)) . (19.10)

Since the action on X × Y splits into actions on X and Y , we will abuse notation slightly and write

g · ( x, y) ≜ ( g x , g y ) .

Let us assume in addition that our group G can be equipped with a σ -algebra σ(G) such that
the maps h 7→ hg and h 7→ gh are measurable for each g ∈ G. We say that a probability measure μ
on (G, σ(G)) is a left-invariant Haar measure if when H ∼ μ we also have gH ∼ μ for any g ∈ G.
(See also Exercise V.23.) Existence of Haar measure is trivial for finite (and compact) groups, but
in general is a difficult subject. To proceed we need to make an assumption about the symmetry
group G that we call regularity. (This condition is trivially satisfied whenever X and Y are finite,
thus all the sophistication in these few paragraphs is only relevant to non-discrete channels.)

Definition 19.13 A symmetry group G is called regular if it possesses a left-invariant Haar

probability measure ν such that the group action (19.10)

G×X ×Y →X ×Y

is measurable.

Note that under the regularity assumption the action (19.10) also defines a left action of G on
P(X ) and P(Y) according to

(gPX )[E] ≜ PX [g−1 E] , (19.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-382

i i

382

(gQY )[E] ≜ QY [g−1 E] , (19.12)

or, in words, if X ∼ PX then gX ∼ gPX , and similarly for Y and gY. For every distribution PX we
define an averaged distribution P̄X as
Z
P̄X [E] ≜ PX [g−1 E]ν(dg) , (19.13)
G

which is the distribution of random variable gX when g ∼ ν and X ∼ PX . The measure P̄X is G-
invariant, in the sense that gP̄X = P̄X . Indeed, by left-invariance of ν we have for every bounded
function f
Z Z
f(g)ν(dg) = f(hg)ν(dg) ∀h ∈ G ,
G G

and therefore
Z
−1
P̄X [h E] = PX [(hg)−1 E]ν(dg) = P̄X [E] .
G

Similarly one defines Q̄Y :

Z
Q̄Y [E] ≜ QY [g−1 E]ν(dg) , (19.14)
G

which is also G-invariant: gQ̄Y = Q̄Y .

The main property of the action of G may be rephrased as follows: For arbitrary ϕ : X ×Y → R
we have
Z Z
ϕ(x, y)PY|X (dy|x)(gPX )(dx)
X Y
Z Z
= ϕ(gx, gy)PY|X (dy|x)PX (dx) . (19.15)
X Y

In other words, if the pair (X, Y) is generated by taking X ∼ PX and applying PY|X , then the pair
(gX, gY) has marginal distribution gPX but conditional kernel is still PY|X . For finite X , Y this is
equivalent to

PY|X (gy|gx) = PY|X (y|x) , (19.16)

which may also be taken as the definition of the automorphism. In terms of the G-action on P(Y)
we may also say:

gPY|X=x = PY|X=gx ∀ g ∈ G, x ∈ X . (19.17)

It is not hard to show that for any channel and a regular group of symmetries G the capacity-
achieving output distribution must be G-invariant, and capacity-achieving input distribution can
be chosen to be G-invariant. That is, the saddle point equation

to [390] the latin squares that are Cayley tables are precisely the ones in which composition of
two rows (as permutations) gives another row. An example of the latin square which is not a
Cayley table is the following:
 
1 2 3 4 5
2 5 4 1 3
 
 
3 1 2 5 4 . (19.21)
 
4 3 5 2 1
5 4 1 3 2
1
Thus, by multiplying this matrix by 15 we obtain a counterexample:
Dobrushin, square 6=⇒ group-noise
In fact, this channel is not even input-symmetric. Indeed, suppose there is g ∈ G such that
g4 = 1 (on X ). Then, applying (19.16) with x = 4 we figure out that on Y the action of g must
be:
1 7→ 4, 2 7→ 3, 3 7→ 5, 4 7→ 2, 5 7→ 1 .
But then we have
1
gPY|X=1 = 5 4 2 1 3 · ,
15
which by a simple inspection does not match any of the rows in (19.21). Thus, (19.17) cannot
hold for x = 1. We conclude:
Dobrushin, square 6=⇒ input-symmetric
Similarly, if there were g ∈ G such that g2 = 1 (on Y ), then on X it would act as
1 7→ 2, 2 7→ 5, 3 7→ 1, 4 7→ 3, 5 7→ 4 ,
which implies via (19.16) that PY|X (g1|x) is not a column of (19.21). Thus:
Dobrushin, square 6=⇒ output-symmetric
• Clearly, not every input-symmetric channel is Dobrushin (e.g., BEC). One may even find a
counterexample in the class of square channels:
 
1 2 3 4
1 3 2 4  1
 
4 2 3 1 · 10 (19.22)

4 3 2 1
This shows:
input-symmetric, square 6=⇒ Dobrushin
• Channel (19.22) also demonstrates:
Gallager-symmetric, square 6=⇒ Dobrushin .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-385

i i

19.4* Symmetric channels 385

• Example (19.22) naturally raises the question of whether every input-symmetric channel is
Gallager symmetric. The answer is positive: by splitting Y into the orbits of G we see that a
subchannel X → {orbit} is input and output symmetric. Thus by (19.18) we have:

input-symmetric =⇒ Gallager-symmetric =⇒ weakly input-symmetric (19.23)

(The second implication is evident).

• However, not all weakly input-symmetric channels are Gallager-symmetric. Indeed, consider
the following channel
 
1/7 4/7 1/7 1/7
 
 4/7 1/7 0 4/7 
 
W= . (19.24)
 0 0 4 /7 2 / 7 
 
2/7 2/7 2/7 0

Since det W 6= 0, the capacity achieving input distribution is unique. Since H(Y|X = x) is
independent of x and PX = [1/4, 1/4, 3/8, 1/8] achieves uniform P∗Y it must be the unique
optimum. Clearly any permutation Tx fixes a uniform P∗Y and thus the channel is weakly input-
symmetric. At the same time it is not Gallager-symmetric since no row is a permutation of
another.
• For more on the properties of weakly input-symmetric channels see [333, Section 3.4.5].

A pictorial representation of these relationships between the notions of symmetry is given

schematically on Figure 19.2.

Weakly input symmetric

Gallager
1010
1111111
0000000
0000000
1111111 0
1 Dobrushin
0000000
1111111
0000000
1111111 101111
0000000000
1111111111
0000
0000000
1111111 101111
0000000000
1111111111
0000
101111
0000
1111
0000000
1111111 0000000000
1111111111
0000
000
111
0
10000
1111 000
111
0000
1111
0000000
1111111 0000000000
1111111111
0000
1111
000
111
000
111
0
10000
1111
0000000000
1111111111
0000
1111
000
111
000
111 000
111
0000
1111
0000000
1111111
0000
1111 0000
1111
000
111
0000
1111 000
111
000
111
0000000
1111111
0000
1111 0000
1111 000input−symmetric
111
output−symmetric group−noise

Figure 19.2 Schematic representation of inclusions of various classes of channels.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-386

i i

386

19.5* Information stability

We saw that C = C(I) for stationary memoryless channels, but what other channels does this hold
for? And what about non-stationary channels? To answer this question, we introduce the notion
of information stability.

Definition 19.14 A channel is called information stable if there exists a sequence of input
distributions {PXn , n = 1, 2, . . .} such that
1 n n P (I)
i( X ; Y ) −
→C .
n

For example, we can pick PXn = (P∗X )n for stationary memoryless channels. Therefore
stationary memoryless channels are information stable.
The purpose for defining information stability is the following theorem.

Theorem 19.15 For an information stable channel, C = C(I) .

Proof. Like the stationary, memoryless case, the upper bound comes from the general con-
verse Theorem 17.3, and the lower bound uses a similar strategy as Theorem 19.8, except utilizing
the definition of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.

Theorem 19.16 A memoryless channel is information stable if there exists {X∗k : k ≥ 1} such
that both of the following hold:
1X ∗ ∗
n
I(Xk ; Yk ) → C(I) (19.25)
n
k=1
X
∞
1
Var[i(X∗n ; Y∗n )] < ∞ . (19.26)
n2
n=1

In particular, this is satisfied if

|A| < ∞ or |B| < ∞ (19.27)

Proof. To show the first part, it is sufficient to prove

" #
1 X ∗ ∗
n
∗ ∗
P i(Xk ; Yk ) − I(Xk , Yk ) > δ → 0
n
k=1

So that 1n i(Xn ; Yn ) → C(I) in probability. We bound this by Chebyshev’s inequality

" # Pn
1 X ∗ ∗ ∗ ∗
n 1
∗ ∗ k=1 Var[i(Xk ; Yk )]
P i(Xk ; Yk ) − I(Xk , Yk ) > δ ≤ n2
→ 0,
n δ2
k=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-387

i i

19.5* Information stability 387

where convergence to 0 follows from Kronecker lemma (Lemma 19.17 to follow) applied with
bn = n2 , xn = Var[i(X∗n ; Y∗n )]/n2 .
The second part follows from the first. Indeed, notice that

1X
n
C(I) = lim inf sup I(Xk ; Yk ) .
n→∞ n PXk
k=1

Now select PX∗k such that

I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 2−k .

PXk

(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k

X
n X
n
I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 1 ,
PXk
k=1 k=1

and hence normalizing by n we get (19.25). We next show that for any joint distribution PX,Y we
have

Var[i(X; Y)] ≤ 2 log2 (min(|A|, |B|)) . (19.28)

The argument is symmetric in X and Y, so assume for concreteness that |B| < ∞. Then

E[i2 (X; Y)]

Z X
≜ dPX (x) PY|X (y|x) log2 PY|X (y|x) + log2 PY (y) − 2 log PY|X (y|x) · log PY (y)
A y∈B
Z X h i
≤ dPX (x) PY|X (y|x) log2 PY|X (y|x) + log2 PY (y) (19.29)
A y∈B
   
Z X X
= dPX (x)  PY|X (y|x) log2 PY|X (y|x) +  PY (y) log2 PY (y)
A y∈B y∈B
Z
≤ dPX (x)g(|B|) + g(|B|) (19.30)
A
=2g(|B|) ,

where (19.29) is because 2 log PY|X (y|x) · log PY (y) is always non-negative, and (19.30) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X
n
g( n) ≜ sup
Pn
aj log2 aj . (19.31)
aj ≥0: j= 1 aj =1 j=1

Since the x log2 x has unbounded derivative at the origin, the solution of (19.31) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = 1n .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-388

i i

388

For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,

2g(|B|) ≤ 2 log2 |B| .

Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.

Lemma 19.17 (Kronecker

P
Lemma) Let a sequence 0 < bn % ∞ and a non-negative
∞
sequence {xn } such that n=1 xn < ∞, then

1 X
n
bj xj → 0
bn
j=1

Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from
above
X
n X
m X
n
bk xk ≤ bm xk + b k xk
k=1 k=1 k=m+1

Now throw in the rest of the xk ’s in the summation

1 X bm X X bm X X
n ∞ n ∞ ∞
bk
=⇒ b k xk ≤ xk + xk ≤ xk + xk
bn bn bn bn
k=1 k=1 k=m+1 k=1 k=m+1

1 X X
n ∞
=⇒ lim bk xk ≤ xk → 0
n→∞ bn
k=1 k=m+1

Since this holds for any m, we can make the last term arbitrarily small.

How to show information stability? One important class of channels with memory for which
information stability can be shown easily are Gaussian channels. The complete details will be
shown below (see Sections 20.5* and 20.6*), but here we demonstrate a crucial fact.
For jointly Gaussian (X, Y) we always have bounded variance:
cov[X, Y]
Var[i(X; Y)] = ρ2 (X, Y) log2 e ≤ log2 e , ρ(X, Y) = p . (19.32)
Var[X] Var[Y]

Indeed, first notice that we can always represent Y = X̃ + Z with X̃ = aX ⊥

⊥ Z. On the other hand,
we have

log e x̃2 + 2x̃z σ2 2
i(x̃; y) = − 2 2z , z ≜ y − x̃ .
2 σY2 σY σZ

From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
 σX̃2

log e  X̃ 2
− 2
E[i(X̃; Y)|X̃] =
σZ
,
2 σY2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-389

390

Theorem 19.19 (Assouad’s lemma) If M = 2k then

Pb ≥ min{P[Ŵ = c′ |W = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = 1} .

Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then

X
k X
k
1{Si =
6 Ŝi } ≥ 1{Sk = Ŝk + ei }
i=1 i=1

Dividing by k and taking expectation gives

1X
k
Pb ≥ P[Sk = Ŝk + ei ]
k
i=1

≥ min{P[Ŵ = c′ |W = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = 1} .

Similarly, we can prove the following generalization:

Theorem 19.20 If A, B ∈ {0, 1}k (with arbitrary marginals!) then for every r ≥ 1 we have

1 k−1
Pb = E[dH (A, B)] ≥ Pr,min (19.34)
k r−1
Pr,min ≜ min{P[B = c′ |A = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = r} (19.35)

Proof. First, observe that

X
k
P[dH (A, B) = r|A = a] = PB|A (b|a) ≥ Pr,min .
r
b:dH (a,b)=r

Next, notice

dH (x, y) ≥ r1{dH (x, y) = r}

and take the expectation with x ∼ A, y ∼ B.

In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the minimax risk
of an estimator (Section 31.2).
The following is a converse bound for channel coding under BER constraint.

Theorem 19.21 (Converse under BER) Any M-code with M = 2k and bit-error rate Pb
satisfies
supPX I(X; Y)
log M ≤ .
log 2 − h(Pb )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-391

And therefore if R(H + δ) < C − δ , then ϵ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
(Converse.) To prove the converse notice that any JSCC encoder/decoder induces a Markov
chain
Sk → Xn → Yn → Ŝk .
Applying data processing for mutual information
I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC.
PXn

On the other hand, since P[Sk 6= Ŝk ] ≤ ϵn , Fano’s inequality (Theorem 3.12) yields
I(Sk ; Ŝk ) = H(Sk ) − H(Sk |Ŝk ) ≥ kH − ϵn log |A|k − log 2.
Combining the two gives
nC ≥ kH − ϵn log |A|k − log 2. (19.36)
Since R = nk , dividing both sides by n and sending n → ∞ yields
RH − C
lim inf ϵn ≥ .
n→∞ R log |A|
Therefore ϵn does not vanish if R > C/H.

We remark that instead of using Fano’s inequality we could have lower bounded I(Sk ; Ŝk ) as in
the proof of Theorem 17.3 by defining QSk Ŝk = USk PŜk (with USk = Unif({0, 1}k ) and applying the
data processing inequality to the map (Sk , Ŝk ) 7→ 1{Sk = Ŝk }:
D(PSk Ŝk kQSk Ŝk ) = D(PSk kUSk ) + D(PŜ|Sk kPŜ |PSk ) ≥ d(1 − ϵn k|A|−k )
Rearranging terms yields (19.36). As we discussed in Remark 17.2, replacing D with other f-
divergences can be very fruitful.
In a very similar manner, by invoking Corollary 12.6 and Theorem 19.15 we obtain:

Theorem 19.23 Let source {Sk } be ergodic on a finite alphabet, and have entropy rate H. Let
the channel have capacity C and be information stable. Then
(
= 0 R > H/C
lim ϵ∗JSCC (nR, n)
n→∞ > 0 R < H/C

We leave the proof as an exercise.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-394

i i

20 Channels with input constraints. Gaussian

channels.

In this chapter we study data transmission with constraints on the channel input. Namely, in pre-
vious chapter the encoder for blocklength n code was permitted to produce arbitrary sequences
of channel inputs (i.e. the range of the encoder could be all of An ). However, in many practical
problem only a subset of An is allowed to be used. The main such example is the AWGN chan-
nel Example 3.3. If encoder is allowed to produce arbitrary elements of Rn as input, the channel
capacity is infinite: supPX I(X; X + Z) = ∞ (for example, take X ∼ N (0, P) and P → ∞). That
is, one can transmit arbitrarily many messages with arbitrarily small error probability by choos-
ing elements of Rn with giant pairwise distance. In reality, however, allowed channel inputs are
limited by the available1 power and the encoder is only capable of using xn ∈ Rn are satisfying

1X 2
n
xt ≤ P ,
n
t=1

where P > 0 is known as the power constraint. How many bits per channel use can we transmit
under this constraint on the codewords? To answer this question in general, we need to extend
the setup and coding theorems to channels with input constraints. After doing that we will apply
these results to compute capacities of various Gaussian channels (memoryless, with inter-symbol
interference and subject to fading).

20.1 Channel coding with input constraints

We say that an (n, M, ϵ)-code satisfies the input constraint Fn ⊂ An if the encoder maps [M] into
Fn , i.e. f : [M] → Fn as illustrated by the following figure.

An
b b
b
b Fn b
b b
b b b
b b
b

Codewords all land in a subset of An

1
or allowed by regulatory bodies, such as the FCC in the US.

394

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-395

i i

20.1 Channel coding with input constraints 395

What type of constraint sets Fn are of practical interest? In the context of Gaussian channels,
we have A = R. Then one often talks about the following constraints:

• Average power constraint:

1X 2 √
n
| xi | ≤ P ⇔ kxn k2 ≤ nP.
n
i=1
√
In other words, codewords must lie in a ball of radius nP.
• Peak power constraint :

max |xi | ≤ A ⇔ kxn k∞ ≤ A

1≤i≤n

Notice that the second type of constraint does not introduce any new problems: we can simply
restrict the input space from A = R to A = [−A, A] and be back into the setting of input-
unconstrained coding. The first type of the constraint is known as a separable cost-constraint.
We will restrict our attention from now on to it exclusively.

Definition 20.1 A channel with a separable cost constraint is specified by

• input space A and output space B ;

• a sequence of Markov kernels PYn |Xn : An → B n , indexed by the blocklength n = 1, 2, . . .
• (per-letter) cost function c : A → R ∪ {±∞}.

We extend the per-letter cost to n-sequences as follows:

1X
n
c(xn ) ≜ c(xk )
n
k=1

We next extend the channel coding notions to such channels.

Definition 20.2 • A P
code is an (n, M, ϵ, P)-code if it is an (n, M, ϵ)-code satisfying input
n
constraint Fn ≜ {x : n k=1 c(xk ) ≤ P}
n 1

• Finite-n fundamental limits:

M∗ (n, ϵ, P) = max{M : ∃(n, M, ϵ, P)-code}

M∗max (n, ϵ, P) = max{M : ∃(n, M, ϵ, P)max -code}

• ϵ-capacity and Shannon capacity

1
Cϵ (P) = lim inf log M∗ (n, ϵ, P)
n→∞ n

C(P) = lim Cϵ (P)

ϵ↓0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-396

i i

396

• Information capacity
1
C(I) (P) = lim inf sup I(Xn ; Yn )
n→∞ n PXn :E[Pnk=1 c(Xk )]≤nP

• Information stability: Channel is information stable if for all (admissible) P, there exists a
sequence of channel input distributions PXn such that the following two properties hold:
1 P
iP n n (Xn ; Yn )−
→C(I) (P) (20.1)
n X ,Y
P[c(Xn ) > P + δ] → 0 ∀δ > 0 . (20.2)

These definitions clearly parallel those of Definitions 19.3 and 19.6 for channels without input
constraints. A notable and crucial exception is the definition of the information capacity C(I) (P).
Indeed, under input constraints instead of maximizing I(Xn ; Yn ) over distributions supported on
Fn we extend maximization to a richer set of distributions, namely, those satisfying
" n #
X
E c(Xk ) ≤ nP .
k=1

This will be crucial for the single-letterization of C(I) (P) soon.

Definition 20.3 (Admissible constraint) We say P is an admissible constraint if ∃x0 ∈ A

s.t. c(x0 ) ≤ P, or equivalently, ∃PX : E[c(X)] ≤ P. The set of admissible P’s is denoted by Dc ,
and can be either in the form (P0 , ∞) or [P0 , ∞), where P0 ≜ infx∈A c(x).

Clearly, if P ∈
/ Dc , then there is no code (even a useless one, with one codeword) satisfying the
input constraint. So in the remaining we always assume P ∈ Dc .

Proposition 20.4 (Capacity-cost function) Define the capacity-cost function ϕ(P) ≜

supPX :E[c(X)]≤P I(X; Y). Then

1 ϕ is concave and non-decreasing. The domain of ϕ is dom ϕ ≜ {x : f(x) > −∞} = Dc .

2 One of the following is true: ϕ(P) is continuous and finite on (P0 , ∞), or ϕ = ∞ on (P0 , ∞).

Both of these properties extend to the function P 7→ C(I) (P).

Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence ϕ(λ̄P0 + λP1 ) ≥
λ̄ϕ(P0 ) + λϕ(P1 ). The second claim follows from concavity of ϕ(·).
To extend these results to C(I) (P) observe that for every n
1
P 7→ sup I(Xn ; Yn )
n PXn :E[c(Xn )]≤P

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-397

i i

20.2 Channel capacity under separable cost constraints 397

is concave. Then taking lim infn→∞ the same holds for C(I) (P).

An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:

Corollary 20.5 (Single-letterization) The information capacity of a stationary memory-

less channel with separable cost is given by

C(I) (P) = ϕ(P) = sup I(X; Y).

E[c(X)]≤P

Proof. C(I) (P) ≥ ϕ(P) is obvious by using PXn = (PX )⊗n . For “≤”, fix any PXn satisfying the
cost constraint. Consider the chain
 
( a) X (b) X X
n n ( c)
n
1
I(Xn ; Yn ) ≤ I(Xj ; Yj ) ≤ ϕ(E[c(Xj )]) ≤ nϕ  E[c(Xj )] ≤ nϕ(P) ,
n
j=1 j=1 j=1

where (a) follows from Theorem 6.1; (b) from the definition of ϕ; and (c) from Jensen’s inequality
and concavity of ϕ.

20.2 Channel capacity under separable cost constraints

We start with a straightforward extension of the weak converse to the case of input constraints.

Theorem 20.6 (Weak converse)

C(I) (P)
Cϵ (P) ≤
1−ϵ

Proof. The argument is the same as we used in Theorem 17.3. Take any (n, M, ϵ, P)-code, W →
Xn → Yn → Ŵ. Applying Fano’s inequality and the data-processing, we get

−h(ϵ) + (1 − ϵ) log M ≤ I(W; Ŵ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn )

PXn :E[c(Xn )]≤P

Normalizing both sides by n and taking lim infn→∞ we obtain the result.

Next we need to extend one of the coding theorems to the case of input constraints. We do so for
the Feinstein’s lemma (Theorem 18.7). Note that when F = X , it reduces to the original version.

Theorem 20.7 (Extended Feinstein’s lemma) Fix a Markov kernel PY|X and an arbitrary
PX . Then for any measurable subset F ⊂ X , everyγ > 0 and any integer M ≥ 1, there exists an
(M, ϵ)max -code such that

• Encoder satisfies the input constraint: f : [M] → F ⊂ X ;

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-398

i i

398

• Probability of error bound:

M
ϵPX (F) ≤ P[i(X; Y) < log γ] +
γ

Proof. Similar to the proof of the original Feinstein’s lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Next, we apply Corollary 18.4 and find out
that there is a set F0 ⊂ X with two properties: a) PX [F0 ] = 1 and b) for every x ∈ F0 we have
PY (Ex ) ≤ γ1 . We now let F′ = F ∩ F0 and notice that PX [F′ ] = PX [F].
We sequentially pick codewords {c1 , . . . , cM } from the set F′ (!) and define the decoding regions
{D1 , . . . , DM } as Dj ≜ Ecj \ ∪jk− 1
=1 Dk . The stopping criterion is that M is maximal, i.e.,

∀x0 ∈ F′ , PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 − ϵ
′ ′c
⇔ ∀ x0 ∈ X , P Y [ E x 0 \ ∪ M
j=1 Dj X = x0 ] < (1 − ϵ)1[x0 ∈ F ] + 1[x0 ∈ F ]

Now average the last inequality over x0 ∼ PX to obtain

′ ′c
P[{i(X; Y) ≥ log γ}\ ∪M
j=1 Dj ] ≤ (1 − ϵ)PX [F ] + PX [F ] = 1 − ϵPX [F]

From here, we complete the proof by following the same steps as in the proof of original Feinstein’s
lemma (Theorem 18.7).

Given the coding theorem we can establish a lower bound on capacity.

Theorem 20.8 (Achievability bound) For any information stable channel with input
constraints and P > P0 we have

C(P) ≥ C(I) (P). (20.3)

400

Theorem 20.11 For the stationary AWGN channel, the channel capacity is equal to informa-
tion capacity, and is given by:

1 P
( I)
C(P) = C (P) = log 1 + 2 for R-AWGN (20.4)
2 σ

P
( I)
C(P) = C (P) = log 1 + 2 for C-AWGN
σ

Proof. By Corollary 20.5,

C(I) = sup I(X; X + Z)

PX :EX2 ≤P

Then using Theorem 5.11 (the Gaussian saddle point) to conclude X ∼ N (0, P) (or Nc (0, P)) is
the unique capacity-achieving input distribution.

At this point it is also instructive to revisit Section 6.2* which shows that Gaussian capacity
can in fact be derived essentially without solving the maximization of mutual information: the
Euclidean rotational symmetry implies the optimal input should be Gaussian.
There is a great deal of deep knowledge embedded in the simple looking formula of Shan-
non (20.4). First, from the engineering point of view we immediately see that to transmit
information faster (per unit time) one needs to pay with radiating at higher power, but the payoff
in transmission speed is only logarithmic. The waveforms of good error correcting codes should
look like samples of the white Gaussian process.
Second, the amount of energy spent per transmitted information bit is minimized by solving
P log 2
inf = 2σ 2 loge 2 (20.5)
P>0 C(P)

and is achieved by taking P → 0. (We will discuss the notion of energy-per-bit more in Sec-
tion 21.1.) Thus, we see that in order to maximize communication rate we need to send powerful,
high-power waveforms. But in order to minimize energy-per-bit we need to send in very quiet
“whisper” and at very low communication rate.2 In addition, when signaling with very low power
√
(and hence low rate), by inspecting Figure 3.2 we can see that one can restrict to just binary ± P
symbols (so called BPSK modulation). This results in virtually no loss of capacity.
Third, from the mathematical point of view, formula (20.4) reveals certain properties of high-
dimensional Euclidean geometry
√ as follows. Since Zn ∼ N (0, σ 2 ), then with high probability,
kZ k2 concentrates around nσ . Similarly, due the power constraint and the fact that Zn ⊥
n 2 ⊥ Xn , we
n 2 n 2 n 2
have E kY k = E kY p k + E kZ k ≤ n(P + σ 2 ) and the received vector Yn lies in an ℓ√ 2 -ball
of radius approximately n(P + σ 2 ). Since the noise √ can at most perturb the codeword p by nσ 2
in Euclidean distance, if we can pack M balls of radius nσ 2 into the ℓ2 -ball of radius n(P + σ 2 )
centered at the origin, this yields a good codebook and decoding regions – see Figure 20.1 for an
illustration. So how large can M be? Note that the volume of an ℓ2 -ball of radius r in Rn is given by

2
This explains why, for example, the deep space probes communicate with earth via very low-rate codes and very long
blocklengths.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-401

i i

20.3 Stationary AWGN channel 401

p n
√ c2
nσ 2

(P
c1

+
σ
2
)
c5
c8

···

c6
c7

Figure 20.1 Interpretation of the AWGN capacity formula as “soft” packing.

2 n/ 2 n/2
cn rn for some constant cn . Then cn (cnn((Pn+σ ))
= 1 + σP2 . Taking the log and dividing by n, we
σ 2 ) n/ 2
∗
get n log M ≈ 2 log 1 + σ2 . This tantalizingly convincing reasoning, however, is flawed in at
1 1 P

least two ways. (a) Computing the volume ratio only gives an upper bound on the maximal number
of disjoint balls (See Section 27.2 for an extensive discussion on this topic.) (b) Codewords need
not correspond to centers of disjoint ℓ2 -balls. √ Indeed, the fact that we allow some vanishing (but
non-zero) probability of error means that the nσ 2 -balls are slightly overlapping and Shannon’s
formula establishes the maximal number of such partially overlapping balls that we can pack so
that they are (mostly) inside a larger ball.

Since Theorem 20.11 applies to Gaussian noise, it is natural to ask: What if the noise is non-
Gaussian and how sensitive is the capacity formula 21 log(1 + SNR) to the Gaussian assumption?
Recall the Gaussian saddle point result in Theorem 5.11 which shows that for the same variance,
Gaussian noise is the worst which shows that the capacity of any non-Gaussian noise is at least
1
2 log(1 + SNR). Conversely, it turns out the increase of the capacity can be controlled by how
non-Gaussian the noise is (in terms of KL divergence). The following result is due to Ihara [223].

Theorem 20.12 (Additive Non-Gaussian noise) Let Z be a real-valued random variable

independent of X and EZ2 < ∞. Let σ 2 = Var Z. Then

1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (EZ, σ 2 )). (20.6)
2 σ PX :EX2 ≤P 2 σ

Proof. See Ex. IV.24.

Remark 20.1 The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z,
where N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian
density, say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-402

i i

402

AWGN, which still scales as 12 log SNR in the high-SNR regime. On the other hand, if Z is discrete,
then D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite
because the noise is “too weak”.

20.4 Parallel AWGN channel

Definition 20.13 (Parallel AWGN) A parallel AWGN channel with L branches is a station-
ary memoryless channel whose single-letter kernel is defined as follows: alphabets A = B = RL ,
PL
k=1 |xk | ; and the kernel PYL |XL : Yk = Xk + Zk , for k = 1, . . . , L, and
2
the cost c(x) =
Zk ∼ N (0, σk ) are independent for each branch.
2

Theorem 20.14 (Water-filling) The capacity of L-parallel AWGN channel is given by

1X + T
L
C = log
2 σj2
j=1

where log+ (x) ≜ max(log x, 0), and T ≥ 0 is determined by

X
L
P = |T − σj2 |+
j=1

Proof.

C(I) (P) = sup

P
I(XL ; YL )
PXL : E[X2i ]≤P

X
L
≤ P
sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[X2k ]≤Pk

X
L
1 Pk
= P
sup log(1 + )
Pk ≤P,Pk ≥0 k=1 2 σk2

with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last maximiza-
tion problem, known as problem of optimal power allocation. Denote the Lagrangian multipliers
P
for the constraint Pk ≤ P by λ and for the constraint Pk ≥ 0 by μk . We want to solve
P1 P
max 2 log(1 + σPk2 ) − μk Pk + λ(P − Pk ). First-order condition on Pk gives that
k

1 1
= λ − μ k , μ k Pk = 0
2 σk2 + Pk
therefore the optimal solution is
X
L
Pk = |T − σk2 |+ , T is chosen such that P = |T − σk2 |+
k=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-403

i i

20.5* Non-stationary AWGN 403

T
P1 P3
σ22

σ12 σ32

Figure 20.2 Power allocation via water-filling across three parallel channels. Here, the second branch is too
noisy (σ2 too big) for the amount of available power P and the optimal coding should discard (input zeros to)
this branch altogether.

Figure 20.2 illustrates the water-filling solution. It has a number of practically important con-
clusions. First, it gives a precise recipe for how much power to allocate to different frequency
bands. This solution, simple and elegant, was actually pivotal for bringing high-speed internet
to many homes (via cable modems, or ADSL): initially, before information theorists had a say,
power allocations were chosen on the basis of costly and imprecise simulations of real codes. The
simplicity of the water-filling scheme makes power allocation dynamic and enables instantaneous
reaction to changing noise environments.
Second, there is a very important consequence for multiple-antenna (MIMO) communication.
Given nr receive antennas and nt transmit antennas, very often one gets as a result a parallel AWGN
with L = min(nr , nt ) branches (see Exercise I.9 and I.10). For a single-antenna system the capacity
then scales as 12 log P with increasing power (Theorem 20.11), while the capacity for a MIMO
AWGN channel is approximately L2 log( PL ) ≈ L2 log P for large P. This results in an L-fold increase
in capacity at high SNR. This is the basis of a powerful technique of spatial multiplexing in MIMO,
largely behind much of advance in 4G, 5G cellular (3GPP) and post-802.11n WiFi systems.
Notice that spatial diversity (requiring both receive and transmit antennas) is different from a
so-called multipath diversity (which works even if antennas are added on just one side). Indeed,
if a single stream of data is sent through every parallel channel simultaneously, then the sufficient
statistic would be to the average of all received vectors, resulting in a the effective noise level
reduced by L1 factor. This results in capacity increasing from 12 log P to 21 log(LP) – a far cry
from the L-fold increase of spatial multiplexing. These exciting topics are explored in excellent
textbooks [423, 268].

20.5* Non-stationary AWGN

Definition 20.15 (Non-stationary AWGN) A non-stationary AWGN channel is a memory-
less channel with single-letter alphabets A = B = R and the separable cost c(x) = x2 . The channel
acts on the input vector Xn by addition Yn = Xn + Zn , where Zj ∼ N (0, σj2 ) are independent.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-404

i i

404

Theorem 20.16 Assume that for every T > 0 the following limits exist:
1X1
n
T
(I)
C̃ (T) = lim log+ 2
n→∞ n 2 σj
j=1

1X
n
P̃(T) = lim |T − σj2 |+ .
n→∞ n
j=1

Then the capacity of the non-stationary AWGN channel is given by the parameterized form:
C(T) = C̃(I) (T) with input power constraint P̃(T).

Proof. Fix T > 0. Then it is clear from the water-filling solution in Theorem 20.14 that

X
n
1 T
sup I(Xn ; Yn ) = log+ , (20.7)
2 σj2
j=1

where the supremum is over all PXn such that

1X
n
E[c(Xn )] ≤ |T − σj2 |+ . (20.8)
n
j=1

Now, by assumption, the LHS of (20.8) converges to P̃(T). Thus, we have that for every δ > 0

C(I) (P̃(T) − δ) ≤ C̃(I) (T)

C(I) (P̃(T) + δ) ≥ C̃(I) (T)

Taking δ → 0 and invoking the continuity of P 7→ C(I) (P), we get from Theorem 19.15 that the
information capacity satisfies

C(I) (P̃(T)) = C̃(I) (T)

provided that the channel is information stable. Indeed, from (19.32)

log2 e Pj log2 e
Var(i(Xj ; Yj )) = 2
≤
2 Pj + σj 2

and thus
X
n
1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1

From here information stability follows via Theorem 19.16.

Non-stationary AWGN is primarily of interest due to its relationship to the additive colored
Gaussian noise channel in the following section.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-405

i i

20.6* Additive colored Gaussian noise channel 405

Zn : Cov(Zn ) = Σ

multiply by
X̃ U−1 X + Y multiply by
U Ỹ

stationary additive Gaussian noise channel

fZ (ω)
T

ω
−π π

power allocation

Figure 20.3 The ACGN channel: the “whitening” process used in the capacity proof and the water-filling
power allocation across spectrum.

20.6* Additive colored Gaussian noise channel

Definition 20.17 The Additive Colored Gaussian Noise (ACGN) channel is a channel with
memory defined as follows. The single-letter alphabets are A = B = R and the separable cost is
c(x) = x2 . The channel acts on the input vector Xn by addition Yn = Xn + Zn , where {Zj : j ≥
1} is a stationary Gaussian process with power spectral density fZ (ω) ≥ 0, ω ∈ [−π , π ] (recall
Example 6.4).

Theorem 20.18 The capacity of the ACGN channel with fZ (ω) > 0 for almost every ω ∈
[−π , π ] is given by the following parametric form:
Z π
1 1 T
C ( T) = log+ dω,
2π −π 2 fZ (ω)
Z π
1
P ( T) = |T − fZ (ω)|+ dω.
2π −π

Proof. For n ≥ 1, consider the diagonalization of the covariance matrix of Zn :

e U, such that Σ
Cov(Zn ) = Σ = U⊤ Σ e = diag(σ1 , . . . , σn ) ,

where U is an orthogonal matrix. (Since Cov(Zn ) is positive semi-definite this diagonalization is

en = UXn and Y
always possible.) Define X en = UYn , the channel between X en and Y en is thus

en = X
Y en + UZn ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-406

i i

406

e
Cov(UZn ) = U · Cov(Zn ) · U⊤ = Σ
Therefore we have the equivalent channel as follows:
en = X
Y en + Z
en , ej ∼ N (0, σj2 ) independent across j.
Z
Note that since U and U⊤ is orthogonal the maps X̃n = UXn and Xn = U⊤ X̃n preserve the norm
kX̃n k = kXn k. Therefore, capacities of both channels are equal: C = C̃. But the latter follows from
Theorem 20.16. Indeed, we have that
Z π
1X + T
n
e 1 1 T
C = lim log 2
= log+ dω. (Szegö’s theorem, see (6.12))
n→∞ n σj 2π −π 2 f Z (ω)
j=1

1X
n
lim |T − σj2 |+ = P(T).
n→∞ n
j=1

The idea used in the proof as well as the water-filling power allocation are illustrated on Fig-
ure 20.3. Note that most of the time the noise that impacts real-world systems is actually “born”
white (because it is a thermal noise). However, between the place of its injection and the process-
ing there are usually multiple circuit elements. If we model them linearly then their action can
equivalently be described as the ACGN channel, since the effective noise added becomes colored.
In fact, this filtering can be inserted deliberately in order to convert the actual channel into an
additive noise one. This is the content of the next section.

20.7* AWGN channel with intersymbol interference

Oftentimes in wireless communication systems a signal is propagating through a rich scattering
environment. Thus, reaching the receiver are multiple delayed and attenuated copies of the initial
signal. Such situation is formally called intersymbol interference (ISI). A similar effect also occurs
when the cable modem attempts to send signals across telephone or TV wires due to the presence
of various linear filters, transformers and relays. The mathematical model for such channels is the
following.

Definition 20.19 (AWGN with ISI) An AWGN channel with ISI is a channel with memory
that is defined as follows: the alphabets are A = B = R, and the separable cost is c(x) = x2 . The
channel law PYn |Xn is given by
X
n
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1

i.i.d.
where Zk ∼ N (0, σ 2 ) is white Gaussian noise, {hk , k ∈ Z} are coefficients of a discrete-time
channel filter.

The coefficients {hk } describe the action of the environment. They are often learned by the
receiver during the “handshake” process of establishing a communication link.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-407

i i

Proof sketch. At the decoder apply the inverse filter with frequency response ω 7→ 1
H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:

Ỹj = Xj + Z̃j ,

where Z̃j is a stationary Gaussian process with spectral density

1
fZ̃ (ω) = .
|H(ω)|2
Then apply Theorem 20.18 to the resulting channel.
To make the above argument rigorous one must carefully analyze the non-zero error introduced
by truncating the deconvolution filter to finite n. This would take us too much outside of the scope
of this book.

20.8* Gaussian channels with amplitude constraints

We have examined some classical results of additive Gaussian noise channels. In the following,
we will list some more recent results without proof.

Theorem 20.21 (Amplitude-constrained AWGN channel capacity) For an AWGN

channel Yi = Xi + Zi with amplitude constraint |Xi | ≤ A, we denote the capacity by:

C(A) = max I(X; X + Z).

PX :|X|≤A

The capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. The
number of atoms is Ω(A) and O(A2 ) as A → ∞. Moreover,

1 2A2 1
log 1 + ≤ C(A) ≤ log 1 + A2
2 eπ 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-408

i i

408

Theorem 20.22 (Amplitude-and-power-constrained AWGN channel capacity)

For an AWGN channel Yi = Xi + Zi with amplitude constraint |Xi | ≤ A and power constraint
Pn
i=1 Xi ≤ nP, we denote the capacity by:
2

C(A, P) = max I(X; X + Z).

PX :|X|≤A,E|X|2 ≤P

Capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. Moreover,
the convergence speed of limA→∞ C(A, P) = 21 log(1 + P) is of the order e−O(A ) .
2

For details, see [396], [343, Section III] and [144, 348] for the O(A2 ) bound on the number of
atoms.

20.9* Gaussian channels with fading

So far we assumed that the channel is either additive (as in AWGN or ACGN) or has known
multiplicative gains (as in ISI). However, in practice the channel gain is a random variable. This
situation is called “fading” and is often used to model the urban signal propagation with multipath
or shadowing. The received signal at time i, Yi , is affected by multiplicative fading coefficient Hi
and additive noise Zi as follows:
Yi = Hi Xi + Zi , Zi ∼ N (0, σ 2 )
This is illustrated in Figure 20.4.

Hi Zi
E[X2i ] ≤ P
Xi × + Yi receiver

Figure 20.4 AWGN channel with fading.

There are two drastically different cases of fading channels, depending on the presence or
absence of the dashed link on Figure 20.4. In the first case, known as the coherent case or the
CSIR case (for channel state information at the receiver), the receiver is assumed to have perfect
estimate of the channel state information Hi at every time i. In other words, the channel output
is effectively (Yi , Hi ). This situation occurs, for example, when there are pilot signals sent peri-
odically and are used at the receiver to estimate the channel. In some cases, the index i refers to
different frequencies or sub-channels of an OFDM frame.
Whenever Hj is a stationary ergodic process, we have the channel capacity given by:

1 P | H| 2
C(P) = E log 1 +
2 σ2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-409

i i

20.9* Gaussian channels with fading 409

and the capacity achieving input distribution is the usual PX = N (0, P). Note that the capacity
C(P) is in the order of log P and we call the channel “energy efficient”.
In the second case, known as non-coherent or no-CSIR, the receiver does not have any knowl-
edge of the coefficients Hi ’s. In this case, there is no simple expression for the channel capacity.
Most of the known results were shown for the case of iid Hi according to a Rayleigh distribution.
In this case, the capacity achieving input distribution is discrete [3], and the capacity scales as
[415, 269]
C(P) = O(log log P), P→∞ (20.9)
This channel is said to be “energy inefficient” since increasing the communication rate requires
dramatic expenditures in power.
Further generalization of the Gaussian channel models requires introducing multiple input and
output antennas (known as MIMO channel). In this case, the single-letter input Xi ∈ Cnt and the
output Yi ∈ Cnr are related by
Yi = Hi Xi + Zi , (20.10)
i.i.d.
where Zi ∼ CN (0, σ 2 Inr ), nt and nr are the number of transmit and receive antennas, and Hi ∈
Cnt ×nr is a matrix-valued channel gain process. For the capacity of this channel under CSIR,
see Exercise I.10. An incredible effort in the 1990s and 2000s was invested by the information-
theoretic and communication-theoretic researchers to understand this channel model. Some of the
highlights include:

• a beautiful transmit-diversity 2x2 code of Alamouti [10]

• generalization of Alamouti’s code lead to the discovery of space-time coding [417, 416]
• the result of Telatar [418] showing that under coherent fading the capacity scales as
min(nt , nr ) log P (the coefficient appearing in front of log P is known as pre-log or degrees-
of-freedom of the channel)
• the result of Zheng and Tse [474] showing a different pre-log in the scaling for the non-coherent
(block-fading) case.

It is not possible to do any justice to these and many other fundamental results in MIMO communi-
cation here, unfortunately. We suggest textbook [423] as an introduction to this deep and exciting
field.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-410

i i

21 Capacity per unit cost

In this chapter we will consider an interesting variation of the channel coding problem. Instead
of constraining the blocklength (i.e. the number of channel uses), we will constrain the total cost
incurred by the codewords. The most important special case of this problem is that of the AWGN
channel and quadratic (energy) cost constraint. The standard motivation in this setting is the fol-
lowing. Consider a deep space probe which has a k bit message that needs to be delivered to Earth
(or a satellite orbiting it). The duration of transmission is of little worry for the probe, but what is
really limited is the amount of energy it has stored in its battery. In this chapter we will learn how
to study this question abstractly, how coding over large number of bits k → ∞ reduces the energy
spent (per bit), and how this fundamental limit is related to communication over continuous-time
channels.

21.1 Energy-per-bit
In this chapter we will consider Markov kernels PY∞ |X∞ acting between two spaces of infinite
sequences. The prototypical example is again the AWGN channel:

Yi = Xi + Zi , Zi ∼ N (0, N0 /2). (21.1)

Note that in this chapter we have denoted the noise level for Zi to be N20 . There is a long tradition for
such a notation. Indeed, most of the noise in communication systems is a white thermal noise at the
receiver. The power spectral density of that noise is flat and denoted by N0 (in Joules per second
per Hz). However, recall that received signal is complex-valued and, thus, each real component
has power N20 . Note also that thermodynamics suggests that N0 = kT, where k = 1.38 × 10−23 is
the Boltzmann constant, and T is the absolute temperature in Kelvins.
In previous chapter, we analyzed the maximum number of information messages (M∗ (n, ϵ, P))
that can be sent through this channel for a given n number of channel uses and under the power
constraint P. We have also hinted that in (20.5) that there is a fundamental minimal required cost
to send each (data) bit. Here we develop this question more rigorously. Everywhere in this chapter
for v ∈ R∞

X
∞
kvk22 ≜ v2j .
j=1

410

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-411

i i

21.1 Energy-per-bit 411

Definition 21.1 ((E, 2k , ϵ)-code) Given a Markov kernel with input space R∞ we define
an (E, 2k , ϵ)-code to be an encoder-decoder pair, f : [2k ] → R∞ and g : R∞ → [2k ] (or similar
randomized versions), such that for all messages m ∈ [2k ] we have kf(m)k22 ≤ E and

P[g(Y∞ ) 6= W] ≤ ϵ ,

where as usual the probability space is W → X∞ → Y∞ → Ŵ with W ∼ Unif([2k ]), X∞ = f(W)

and Ŵ = g(Y∞ ). The fundamental limit is defined to be

E∗ (k, ϵ) = min{E : ∃(E, 2k , ϵ) code}

The operational meaning of E∗ (k, ϵ) should be apparent: it is the minimal amount of energy the
space probe needs to draw from the battery in order to send k bits of data.

Theorem 21.2 ((Eb /N0 )min = −1.6dB) For the AWGN channel we have
E∗ (k, ϵ) N0
lim lim sup = . (21.2)
ϵ→0 k→∞ k log2 e

Remark 21.1 This result, first obtained by Shannon [378], is colloquially referred to as: min-
imal Eb /N0 (pronounced “eebee over enzero” or “ebno”) is −1.6 dB. The latter value is simply
10 log10 ( log1 e ) ≈ −1.592. Achieving this value of the ebno was an ultimate quest for coding the-
2
ory, first resolved by the turbo codes [47]. See [101] for a review of this long conquest. We also
remark that the fundamental limit is unchanged if instead of real-valued AWGN channel we used
a C-AWGN channel

Yi = Xi + Zi , Zi ∼ CN (0, N0 )
P∞
and energy constraint i=1 |Xi |2 ≤ E. Indeed, this channel’s single input can be simply converted
into a pair of inputs for the R-AWGN channel. This double the blocklength, but it is anyway
considered to be infinite.

Proof. We start with a lower bound (or the “converse” part). As usual, we have the working
probability space

W → X∞ → Y∞ → Ŵ .

Then consider the following chain:

1
−h(ϵ) + ϵk ≤ d (1 − ϵ)k Fano’s inequality
M
≤ I(W; Ŵ) data processing for divergence
∞ ∞
≤ I( X ; Y ) data processing for mutual information
X
∞
≤ I(Xi ; Yi ) lim I(Xn ; U) = I(X∞ ; U) by (4.30)
n→∞
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-412

i i

412

X
∞
1 EX2i
≤ log 1 + Gaussian capacity, Theorem 5.11
2 N0 /2
i=1

log e X EX2i
∞
≤ linearization of log
2 N0 /2
i=1
E
≤ log e.
N0
Thus, we have shown

E∗ (k, ϵ) N0 h(ϵ)
≥ ϵ−
k log e k
and taking the double limit in n → ∞ then in ϵ → 0 completes the proof.
Next, for the upper bound (the “achievability” part). We first give a traditional existential proof.
Notice that a (n, 2k , ϵ, P)-code for the AWGN channel is also a (nP, 2k , ϵ)-code for the energy
problem without time constraint. Therefore,

log2 M∗ (n, ϵ, P) ≥ k ⇒ E∗ (k, ϵ) ≤ nP.

E∗ (kn ,ϵ)
Take kn = blog2 M∗ (n, ϵ, P)c, so that we have kn ≤ nP
kn for all n ≥ 1. Taking the limit then we
get
E∗ (kn , ϵ) nP
lim sup ≤ lim sup ∗
n→∞ kn n→∞ log M (n, ϵ, P)
P
=
lim infn→∞ n log M∗max (n, ϵ, P)
1

P
= 1 P
,
2 log(1 + N0 /2 )

where in the last step we applied Theorem 20.11. Now the above statement holds for every P > 0,
so let us optimize it to get the best bound:
E∗ (kn , ϵ) P
lim sup ≤ inf 1 P
n→∞ kn P≥0
2 log(1 + N0 / 2 )
P
= lim
P→0 1 log(1 + P
2 N0 / 2 )
N0
= (21.3)
log2 e
Note that the fact that minimal energy per bit is attained at P → 0 implies that in order to send
information reliably at the Shannon limit of −1.6dB, infinitely many time slots are needed. In
other words, the information rate (also known as spectral efficiency) should be vanishingly small.
Conversely, in order to have non-zero spectral efficiency, one necessarily has to step away from
the −1.6 dB. This tradeoff is known as spectral efficiency vs energy-per-bit.
We next can give a simpler and more explicit construction of the code, not relying on the random
coding implicit in Theorem 20.11. Let M = 2k and consider the following code, known as the

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-413

i i

21.1 Energy-per-bit 413

pulse-position modulation (PPM):

√
PPM encoder: ∀m, f(m) = cm ≜ (0, 0, . . . , E ,...)
|{z} (21.4)
m-th location

It is not hard to derive an upper bound on the probability of error that this code achieves [337,
Theorem 2]:
" ( r ! )#
2E
ϵ ≤ E min MQ + Z ,1 , Z ∼ N (0, 1) . (21.5)
N0

Indeed, our orthogonal codebook under a maximum likelihood decoder has probability of error
equal to

Z " r !#M−1
∞ √
(z− E)2
1 2 − N
Pe = 1 − √ 1−Q z e 0 dz , (21.6)
πN0 −∞ N0

which is obtained by observing that conditioned on (W = j,q Zj ) the events {||cj + z||2 ≤ ||cj +
z − ci ||2 }, i 6= j are independent. A change of variables x = N20 z and application of the bound
1 − (1 − y)M−1 ≤ min{My, 1} weakens (21.6) to (21.5).
To see that (21.5) implies (21.3), fix c > 0 and condition on |Z| ≤ c in (21.5) to relax it to
r
2E
ϵ ≤ MQ( − c) + 2Q(c) .
N0

Recall the expansion for the Q-function [435, (3.53)]:

x2 log e 1
log Q(x) = − − log x − log 2π + o(1) , x→∞ (21.7)
2 2

Thus, choosing τ > 0 and setting E = (1 + τ )k logN0 e we obtain

r
2E
2k Q( − c) → 0
N0

as k → ∞. Thus choosing c > 0 sufficiently large we obtain that lim supk→∞ E∗ (k, ϵ) ≤ (1 +
τ ) logN0 e for every τ > 0. Taking τ → 0 implies (21.3).
2

Remark 21.2 (Simplex conjecture) The code (21.4) in fact achieves the first three terms
in the large-k expansion of E∗ (k, ϵ), cf. [337, Theorem 3]. In fact, the code can be further slightly
√ √
optimized by subtracting the common center of gravity (2−k E, . . . , 2−k E . . .) and rescaling
each codeword to satisfy the power constraint. The resulting constellation is known as the simplex
code. It is conjectured to be the actual optimal code achieving E∗ (k, ϵ) for a fixed k and ϵ; see [105,
Section 3.16] and [401] for more.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-414

i i

414

21.2 Capacity per unit cost

Generalizing the energy-per-bit setting of the previous section, we get the problem of capacity per
unit cost.

Definition 21.3 Given a channel PY∞ |X∞ : X ∞ → Y ∞ and a cost function c : X → R+ ,

we define (E, M, ϵ)-code to be an encoder f : [M] → X ∞ and a decoder g : Y ∞ → [M] s.t. a) for
every m the output of the encoder x∞ ≜ f(m) satisfies
X
∞
c(xt ) ≤ E . (21.8)
t=1

and b) the probability of error is small

P[g(Y∞ ) 6= W] ≤ ϵ ,

where as usual we operate on the space W → X∞ → Y∞ → Ŵ with W ∼ Unif([M]). We let

M∗ (E, ϵ) = max{M : (E, M, ϵ)-code} ,

and define capacity per unit cost as

1
Cpuc ≜ lim lim inf log M∗ (E, ϵ) . (21.9)
ϵ→0 E→∞ E

Let C(P) be the capacity-cost function of the channel (in the usual sense of capacity, as defined
in (20.1)). Assuming P0 = 0 and C(0) = 0 it is not hard to show (basically following the steps of
Theorem 21.2) that:
C(P) C(P) d
Cpuc = sup = lim = C(P) .
P P P→ 0 P dP P=0

The surprising discovery of Verdú [434] is that one can avoid computing C(P) and derive the Cpuc
directly. This is a significant help, as for many practical channels C(P) is unknown. Additionally,
this gives a yet another fundamental meaning to the KL divergence.
Q
Theorem 21.4 For a stationary memoryless channel PY∞ |X∞ = PY|X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have
D(PY|X=x kPY|X=x0 )
Cpuc = sup .
x̸=x0 c(x)

In particular, Cpuc = ∞ if there exists x1 6= x0 with c(x1 ) = 0.

Proof. Let
D(PY|X=x kPY|X=x0 )
CV = sup .
x̸=x0 c(x)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-415

i i

21.2 Capacity per unit cost 415

Converse: Consider a (E, M, ϵ)-code W → X∞ → Y∞ → Ŵ. Introduce an auxiliary distribution

QW,X∞ ,Y∞ ,Ŵ , where a channel is a useless one

QY∞ |X∞ = QY∞ ≜ P∞

Y|X=x0 .

That is, the overall factorization is

QW,X∞ ,Y∞ ,Ŵ = PW PX∞ |W QY∞ PŴ|Y∞ .

Then, as usual we have from the data-processing for divergence

1
(1 − ϵ) log M + h(ϵ) ≤ d(1 − ϵk )
M
≤ D(PW,X∞ ,Y∞ ,Ŵ kQW,X∞ ,Y∞ ,Ŵ )
= D(PY∞ |X∞ kQY∞ |PX∞ )
"∞ #
X
=E d(Xt ) , (21.10)
t=1

where we denoted for convenience d(x) ≜ D(PY|X=x kPY|X=x0 ). By the definition of CV we have

d(x) ≤ c(x)CV .

Thus, continuing (21.10) we obtain

" #
X
∞
(1 − ϵ) log M + h(ϵ) ≤ CV E c(Xt ) ≤ CV · E ,
t=1

where the last step is by the cost constraint (21.8). Thus, dividing by E and taking limits we get

Cpuc ≤ CV .

Achievability: We generalize the PPM code (21.4). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:

f(1) = (x1 , x1 , . . . , x1 , x0 , . . . , x0 )
| {z } | {z }
n-times n(M−1)-times

f(2) = (x0 , x0 , . . . , x0 , x1 , . . . , x1 , x0 , . . . , x0 )
| {z } | {z } | {z }
n-times n-times n(M−2)-times

···
f ( M ) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 )
| {z } | {z }
n(M−1)-times n-times

Now, by Stein’s lemma (Theorem 14.14) there exists a subset S ⊂ Y n with the property that

P[Yn ∈ S|Xn = (x1 , . . . , x1 )] ≥ 1 − ϵ1

P[Yn ∈ S|Xn = (x0 , . . . , x0 )] ≤ exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} .

i i

N0
E[N(t)N(s)] = δ(t − s) ,
2
and δ is the Dirac δ -function. Defining the channel in this way requires careful understanding of
the nature of N(t) (in particular, it is not a usual stochastic process, since its value at any point
N(t) = ∞), but is preferred by engineers. Mathematicians prefer to define the continuous-time

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-418

i i

418

10
Achievability

8
Converse
dB

6 Rayleigh fading, noCSI

N0 ,
Eb

2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k

Figure 21.1 Comparing the energy-per-bit required to send a packet of k-bits for different channel models
∗
(curves represent upper and lower bounds on the unknown optimal value E (k,ϵ) k
). As a comparison: to get to
−1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading AWGN or fading AWGN with
Hj known perfectly at the receiver. For fading AWGN without knowledge of Hj (no CSI), one has to code over
at least 7 · 107 data bits to get to the same −1.5 dB. Plot generated using [397].

channel by introducing the standard Wiener process (Brownian motion) Wt and setting
Z t r
N0
Yint (t) = X(τ )dτ + Wt ,
0 2
where Wt is the zero-mean Gaussian process with covariance function
E[Ws Wt ] = min(s, t) .
Denote by L2 ([0, T]) the space of all square-integrable functions on [0, T]. Let M∗ (T, ϵ, P) the
maximum number of messages that can be sent through this channel such that given an encoder
f : [M] → L2 [0, T] for each m ∈ [M] the waveform x(t) ≜ f(m)

1 is non-zero only on [0, T];

RT
2 input energy constrained to t=0 x2 (t) ≤ TP;
and the decoding error probability P[Ŵ 6= W] ≤ ϵ. This is a natural extension of the previously
defined log M∗ functions to continuous-time setting.
We prove the capacity result for this channel next.

Theorem 21.5 The maximal reliable rate of communication across the continuous-time AWGN
P
channel is N0 log e (per unit of time). More formally, we have
1 P
lim lim inf log M∗ (T, ϵ, P) = log e (21.12)
ϵ→0 T→∞ T N0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-419

i i

21.5* Capacity of the continuous-time bandlimited AWGN channel 419

Proof. Note that the space L2 [0, T] has a countable basis (e.g. sinusoids). Thus, by expanding our
input and output waveforms in that basis we obtain an equivalent channel model:
N
0
Ỹj = X̃j + Z̃j , Z̃j ∼ N 0, ,
2
and energy constraint (dependent upon duration T):
X
∞
X̃2j ≤ PT .
j=1

But then the problem is equivalent to the energy-per-bit for the (discrete-time) AWGN channel
(see Theorem 21.2) and hence

log2 M∗ (T, ϵ, P) = k ⇐⇒ E∗ (k, ϵ) = PT .

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-421

i i

21.5* Capacity of the continuous-time bandlimited AWGN channel 421

• After sampling the equivalent channel model is that of discrete-time C-AWGN.

• Given time T and bandwidth B the discrete-time equivalent channel has blocklength n = BT.
• The power constraint in the discrete-time model corresponds to:
X
BT
|Xi |2 = kX(t)k22 ≤ PT ,
i=1

Thus the effective discrete-time power constraint becomes Pd = PB .

Hence, we have established the following fact:

1 1
log M∗CT (T, ϵ, P, B) = log M∗C−AWGN (BT, ϵ, Pd ) ,
T T
where M∗C−AWGN denotes the fundamental limit of the C-AWGN channel from Theorem 20.11.
Thus, taking T → ∞ we get (21.13).
Note also that in the limit of large bandwidth B the capacity formula (21.13) yields
P P
CB=∞ (P) = lim B log(1 + )= log e ,
B→∞ N0 B N0
agreeing with (21.12).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-422

i i

22 Strong converse. Channel dispersion. Error

exponents. Finite blocklength.

In previous chapters our main object of study was the fundamental limit of blocklength-n coding:

M∗ (n, ϵ) = max{M : ∃(n, M, ϵ)-code}

Equivalently, we can define it in terms of the smallest probability of error at a given M:

ϵ∗ (n, M) = inf{ϵ : ∃(n, M, ϵ)-code}

What we learned so far is that for stationary memoryless channels we have

1
lim lim inf log M∗ (n, ϵ) = C ,
ϵ→0 n→∞ n
or, equivalently,
(
∗ 0, R<C
lim sup ϵ (n, exp{nR}) =
n→∞ > 0, R > C.
These results were proved 75 years ago by Shannon. What happened in the ensuing 75 years is that
we obtained much more detailed information about M∗ and ϵ∗ . For example, the strong converse
says that in the previous limit the > 0 can be replaced with 1. The error-exponents show that
convergence of ϵ∗ (n, exp{nR}) to zero or one happens exponentially fast (with partially known
exponents). The channel dispersion refines the asymptotic description to
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .

Finally, the finite blocklength information theory strives to prove the sharpest possible computa-
tional bounds on log M∗ (n, ϵ) at finite n, which allows evaluating real-world codes’ performance
taking their latency n into account. These results are surveyed in this chapter.

22.1 Strong converse

We begin by stating the main theorem.

Theorem 22.1 For any stationary memoryless channel with either |A| < ∞ or |B| < ∞ we
have Cϵ = C for 0 < ϵ < 1. Equivalently, for every 0 < ϵ < 1 we have

log M∗ (n, ϵ) = nC + o(n) , n → ∞.

422

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-423

i i

22.1 Strong converse 423

Previously in Theorem 19.7, we showed that C ≤ Cϵ ≤ 1−ϵ C

. Now we are asserting that equality
holds for every ϵ. Our previous converse arguments (Theorem 17.3 based on Fano’s inequality)
showed that communication with an arbitrarily small error probability is possible only when using
rate R < C. The strong converse shows that when communicating at any rate above capacity R > C,
the probability of error in fact goes to 1. (An even more detailed result of Arimoto characterizes
the speed of convergence to 1 as exponential in n and gives an exact expression for the exponent.)
In other words,
(
0 R<C
ϵ∗ (n, exp(nR)) → (22.1)
1 R>C

where ϵ∗ (n, M) is the inverse of M∗ (n, ϵ) defined in (19.5).

In practice, engineers observe this effect differently. Instead of changing the coding rate, they
fix a code and then allow the channel parameter (SNR for the AWGN channel, or δ for BSCδ ) vary.
This typically results in a waterfall plot for the probability of error:

Pe
1
10−1
10−2
10−3
10−4
10−5
SNR

In other words, below a certain critical SNR, the probability of error quickly approaches 1, so
that the receiver cannot decode anything meaningful. Above the critical SNR the probability of
error quickly approaches 0 (unless there is an effect known as the error floor, in which case prob-
ability of error decreases reaches that floor value and stays there regardless of the further SNR
increase). Thus, long-blocklength codes have a threshold-like behavior of probability of error sim-
ilar to (22.1). Besides changing SNR instead of rate, there is another important difference between
a waterfall plot and (22.1). The former applies to only a single (perhaps rather suboptimal) code,
while the latter is a statement about the best possible code for each (n, R) pair.

Proof. We will improve the method used in the proof of Theorem 17.3. Take an (n, M, ϵ)-code
and consider the usual probability space

W → Xn → Yn → Ŵ ,

where W ∼ Unif([M]). Note that PXn is the empirical distribution induced by the encoder at the
channel input. We denote the joint measure on (W, Xn , Yn , Ŵ) induced in this way by P. Our goal
is to replace this probability space with a different one where the true channel PYn |Xn = P⊗ n
Y|X is
replaced with an auxiliary channel (which is a “dummy” one in this case):

i i

22.1 Strong converse 425

Notice that the effect of unknown PXn completely disappeared, and so we can compute βα :
1
βα (PXn Yn , PXn (QY )⊗n ) = βα (Ber(δ)⊗n , Ber( )⊗n ) (22.4)
2
1
= exp{−nD(Ber(δ)kBer( )) + o(n)} (by Stein’s Lemma: Theorem 14.14)
2
Putting this together with our main bound (22.3), we see that any (n, M, ϵ) code for the BSC
satisfies
1
log M ≤ nD(Ber(δ)kBer( )) + o(n) = nC + o(n) .
2
Clearly, this implies the strong converse for the BSC. (For a slightly different, but equivalent, proof
see Exercise IV.32 and for the AWGN channel see Exercise IV.33).
For the general channel, let us denote by P∗Y the capacity achieving output distribution. Recall
that by Corollary 5.5 it is unique and by (5.1) we have for every x ∈ A:

D(PY|X=x kP∗Y ) ≤ C . (22.5)

This property will be very useful. We next consider two cases separately:

1 If |B| < ∞ we take QY = P∗Y and note that from (19.31) we have
X
PY|X (y|x0 ) log2 PY|X (y|x0 ) ≤ log2 |B| ∀ x0 ∈ A
y

then we have

E[Sn |Xn ] ≤ nC, Var[Sn |Xn ] ≤ Kn . (22.6)

Hence from Chebyshev inequality (applied conditional on Xn ), we have

√ √ 1
P[Sn > nC + λ Kn] ≤ P[Sn > E[Sn |Xn ] + λ Kn] ≤ 2 . (22.7)
λ
2 If |A| < ∞, then first we recall that without loss of generality the encoder can be taken to be
deterministic. Then for each codeword c ∈ An we define its composition (also known as type)
to simply be its empirical distribution
1X
n
P̂c (x) ≜ 1{cj = x} .
n
j=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-426

i i

426

By simple counting1 it is clear that from any (n, M, ϵ) code, it is possible to select an (n, M′ , ϵ)
subcode, such that a) all codeword have the same composition P0 ; and b) M′ > (n+1M )|A|−1
. Note
′ ′
that, log M = log M + O(log n) and thus we may replace M with M and focus on the analysis of
the chosen subcode. Then we set QY = PY|X ◦ P0 . From now on we also assume that P0 (x) > 0
for all x ∈ A (otherwise just reduce A). Let i(x; y) denote the information density with respect
to P0 PY|X . If X ∼ P0 then I(X; Y) = D(PY|X kQY |P0 ) ≤ log |A| < ∞ and we conclude that
PY|X=x QY for each x and thus
dPY|X=x
i(x; y) = log ( y) .
dQY
From (19.28) we have

Var [i(X; Y)|X] ≤ Var[i(X; Y)] ≤ K < ∞

Furthermore, we also have

E [i(X; Y)|X] = D(PY|X kQY |P0 ) = I(X; Y) ≤ C X ∼ P0 .

So if we define
X
n
dPY|X=Xi (Yi |Xi ) X n
Sn = log ( Yi ) = i(Xi ; Yi ) ,
dQY
i=1 i=1

we again first get the estimates (22.6) and then (22.7).

To proceed with (22.3) we apply the lower bound on β from (14.9):
h i
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ 1 − ϵ − P Sn > log γ ,
√ 2
where γ is arbitrary. We set log γ = nC + λ Kn and λ2 = 1−ϵ to obtain (via (22.7)):
1−ϵ
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ ,
2
which then implies
√
log β1−ϵ (PXn Yn , PXn (QY )n ) ≥ −nC + O( n) .

Consequently, from (22.3) we conclude that

√
log M ≤ nC + O( n) ,

implying the strong converse.

We note several lessons from this proof. First, we basically followed the same method as in the
proof of the weak converse, except instead of invoking data-processing inequality for divergence,
we analyzed the hypothesis testing problem explicitly. Second, the bound on variance of the infor-
mation density is important. Thus, while the AWGN channel is excluded by the assumptions of

1
This kind of reduction from a general code to a constant-composition subcode is the essence of the method of types [115].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-427

i i

22.2 Stationary memoryless channel without strong converse 427

the Theorem, the strong converse for it does hold as well (see Ex. IV.33). Third, this method of
proof is also known as “sphere-packing”, for the reason that becomes clear if we do the example
of the BSC slightly differently (see Ex. IV.32).

22.2 Stationary memoryless channel without strong converse

In the proof above we basically only used the fact that the sum of independent random vari-
ables concentrates around its expectation (we used second moment to show that, but it could
have been done more generally, when the second moment does not exist). Thus one may won-
der whether the strong converse should hold for all stationary memoryless channels (it was only
showed in Theorem 22.1 for those with finite input or output spaces). In this section we construct
a counterexample.
Let the output alphabet be B = [0, 1]. The input A is going to be countably infinite. It will be
convenient to define it as

A = {(j, m) : j, m ∈ Z+ , 0 ≤ j ≤ m} .

The single-letter channel PY|X is defined in terms of probability density function as

(
j
am , ≤y≤ j+1
m ,,
pY|X (y|(j, m)) = m
bm , otherwise ,

where am , bm are chosen to satisfy

1 1
am + ( 1 − ) bm = 1 (22.8)
m m
1 1
am log am + (1 − )bm log bm = C , (22.9)
m m
where C > 0 is an arbitrary fixed constant. Note that for large m we have
mC 1
am = (1 + O( )) , (22.10)
log m log m
C 1
bm = 1 − + O( 2 ) (22.11)
log m log m
It is easy to see that P∗Y = Unif(0, 1) is the capacity-achieving output distribution and

sup I(X; Y) = C .
PX

Thus by Theorem 19.9 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the ϵ-capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-428

i i

428

single-letter information density is given by

( (
log am , w.p. amm 2nC + O(log n), w.p. am
m
i(X; Y) = = .
log bm , w.p. 1 − m
am
O( 1n ), w.p. 1 − am
m

Thus, for blocklength-n density we have

1X
n
1 n n d 1 1 am d
i( X ; Y ) = i(Xi ; Yi ) = O( ) + (2C + O( log n)) · Bin(n, )−→2C · Poisson(1/2) ,
n n n n m
i=1

where we used the fact that amm = (1 + o(1)) 2n

1
and invoked the Poisson limit theorem for Binomial.
Therefore, from Theorem 18.5 we get that for ϵ > e−1/2 there exist (n, M, ϵ)-codes with
log M ≥ 2nC(1 + o(1)) .
In particular,
Cϵ ≥ 2C ∀ϵ > e−1/2

22.3 Meta-converse
We have seen various ways in which one can derive upper (impossibility or converse) bounds on
the fundamental limits such as log M∗ (n, ϵ). In Theorem 17.3 we used data-processing and Fano’s
inequalities. In the proof of Theorem 22.1 we reduced the problem to that of hypothesis testing.
There are many other converse bounds that were developed over the years. It turns out that there
is a very general approach that encompasses all of them. For its versatility it is sometimes referred
to as the “meta-converse”.
To describe it, let us fix a Markov kernel PY|X (usually, it will be the n-letter channel PYn |Xn ,
but in the spirit of “one-shot” approach, we avoid introducing blocklength). We are also given a
certain (M, ϵ) code and the goal is to show that there is an upper bound on M in terms of PY|X and
ϵ. The essence of the meta-converse is described by the following diagram:

PY |X
W Xn Yn Ŵ

QY |X

Here the W → X and Y → Ŵ represent encoder and decoder of our fixed (M, ϵ) code. The upper
arrow X → Y corresponds to the actual channel, whose fundamental limits we are analyzing. The
lower arrow is an auxiliary channel that we are free to select.
The PY|X or QY|X together with PX (distribution induced by the code) define two distribu-
tions: PX,Y and QX,Y . Consider a map (X, Y) 7→ Z ≜ 1{W 6= Ŵ} defined by the encoder and
decoder pair (if decoders are randomized or W → X is not injective, we consider a Markov kernel
PZ|X,Y (1|x, y) = P[Z = 1|X = x, Y = y] instead). We have
PX,Y [Z = 0] = 1 − ϵ, QX,Y [Z = 0] = 1 − ϵ′ ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-429

i i

22.3 Meta-converse 429

where ϵ and ϵ′ are the average probabilities of error of the given code under the PY|X and QY|X
respectively. This implies the following relation for the binary HT problem of testing PX,Y vs
QX,Y :

β1−ϵ (PX,Y , QX,Y ) ≤ 1 − ϵ′ .

The high-level idea of the meta-converse is to select a convenient QY|X , bound 1 − ϵ′ from above
(i.e. prove a converse result for the QY|X ), and then use the Neyman-Pearson β -function to lift the
Q-channel converse to P-channel.
How one chooses QY|X is a matter of art. For example, in the proof of Case 2 of Theorem 22.1
we used the trick of reducing to the constant-composition subcode. This can instead be done by
taking QYn |Xn =c = (PY|X ◦ P̂c )⊗n . Since there are at most (n + 1)|A|−1 different output distributions,
we can see that
(n + 1)∥A|−1
1 − ϵ′ ≤ ,
M
and bounding of β can be done similar to Case 2 proof of Theorem 22.1. For channels with
|A| = ∞ the technique of reducing to constant-composition codes is not available, but the meta-
converse can still be applied. Examples include proof of parallel AWGN channel’s dispersion [333,
Theorem 78] and the study of the properties of good codes [340, Theorem 21].
However, the most common way of using meta-converse is to apply it with the trivial channel
QY|X = QY . We have already seen this idea in Section 22.1. Indeed, with this choice the proof
of the converse for the Q-channel is trivial, because we always have: 1 − ϵ′ = M1 . Therefore, we
conclude that any (M, ϵ) code must satisfy
1
≥ β1−ϵ (PX,Y , PX QY ) . (22.12)
M
Or, after optimization we obtain
1
≥ inf sup β1−ϵ (PX,Y , PX QY ) .
M∗ (ϵ) PX QY

This is a special case of the meta-converse known as the minimax meta-converse. It has a number
of interesting properties. First, the minimax problem in question possesses a saddle-point and is of
convex-concave type [341]. It, thus, can be seen as a stronger version of the capacity saddle-point
result for divergence in Theorem 5.4.
Second, the bound given by the minimax meta-converse coincides with the bound we obtained
before via linear programming relaxation (18.22), as discovered by [295]. To see this connection,
instead of writing the meta-converse as an upper bound M (for a given ϵ) let us think of it as an
upper bound on 1 − ϵ (for a given M).
We have seen that existence of an (M, ϵ)-code for PY|X implies existence of the (stochastic) map
(X, Y) 7→ Z ∈ {0, 1}, denoted by PZ|X,Y , with the following property:

E0 (ρ) = min E0 (ρ, PX ) , ρ ≤ 0.

This expression is defined in terms of the single-letter channel PY|X . It is not hard to see that E0
function for the n-letter extension evaluated with P⊗ n
X just equals nE0 (ρ, PX ), i.e. it tensorizes
2
similar to mutual information. From this observation we can apply Gallager’s random coding

2
There is one more very pleasant analogy with mutual information: the optimization problems in the definition of E0 (ρ)
also tensorize. That is, the optimal distribution for the n-letter channel is just P⊗n
X , where PX is optimal for a single-letter
one.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-432

i i

432

bound (Theorem 18.9) with P⊗ n

X to obtain

ϵ∗ (n, exp{nR}) ≤ exp{n(ρR − E0 (ρ, PX ))} ∀PX , ρ ∈ [0, 1] . (22.13)

Optimizing the choice of PX we obtain our first estimate on the reliability function

E(R) ≤ Er (R) ≜ sup E0 (ρ) − ρR .

ρ∈[0,1]

An analysis, e.g. [177, Section 5.6], shows that the function Er (R) is a convex, decreasing and
strictly positive on 0 ≤ R < C. Therefore, Gallager’s bound provides a non-trivial estimate of
the reliability function for the entire range of rates below capacity. At rates R → C the optimal
choice of ρ → 0. As R departs further away from the capacity the optimal ρ reaches 1 at a certain
rate R = Rcr known as the critical rate, so that for R < Rcr we have Er (R) = E0 (1) − R behaving
linearly. The Er (R) bound is shown on Figure 22.1 by a curve labeled “Random code ensemble”.
Going to the upper bounds, taking QY to be the iid product distribution in (22.12) and optimizing
yields the bound [381] known as the sphere-packing bound:

E(R) ≤ Esp (R) ≜ sup E0 (ρ) − ρR . (22.14)

ρ≥0

Comparing the definitions of Esp and Er we can see that for Rcr < R < C we have

Esp (R) = E(R) = Er (R)

thus establishing reliability function value for high rates. However, for R < Rcr we have Esp (R) >
Er (R), so that E(R) remains unknown. The Esp (R) bound is shown on Figure 22.1 by a curve
labeled “Sphere-packing (volume)”.
Both upper and lower bounds have classical improvements. The random coding bound can be
improved via technique known as expurgation showing

E(R) ≥ Eex (R) ,

and Eex (R) > Er (R) for rates R < Rx where Rx ≤ Rcr is the second critical rate; see Exercise IV.31.
The Eex (R) bound is shown on Figure 22.1 by a curve labeled “Typical random linear code (aka
expurgation)”. (See below for the explanation of the naming.)
The sphere packing bound can also be improved at low rates by analyzing a combinatorial
packing problem by showing that any code must have a pair of codewords which are close (in terms
of Hellinger distance between the induced output distributions) and concluding that confusing
these two leads to a lower bound on probability of error (via (16.3)). This class of bounds is
known as “minimum distance” based bounds and several of them are shown on Figure 22.1 with
the strongest labeled “MRRW + mindist”, corresponding to the currently the best known minimum
distance upper bound due to [299]. (This bound also known as a linear programming or JPL bound
has not seen improvements in 60 years and it is a long-standing open problem in combinatorics
to do so.)
The straight-line bound [177, Theorem 5.8.2] allows to interpolate between any minimum dis-
tance bound and the Esp (R). Unfortunately, these (classical) improvements tightly bound E(R) at

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-433

i i

22.4* Error exponents 433

only one additional rate point R = 0:

E(0+) = Eex (0) .

This state of affairs remains unchanged (for a general DMC) since the foundational work of Shan-
non, Gallager and Berlekamp in 1967. As far as we know, the common belief is that Eex (R) is in
fact the true value of E(R) for all rates. As we mentioned above this is perhaps one of the most
famous open problems in classical information theory.
We demonstrate these bounds (with exception of the straight-line bound) on the reliability func-
tion on Figure 22.1 for the case of the BSCδ . For this channel, there is an interesting interpretation
of the expurgated bound. To explain it, let us recall the different ensembles of random codes
that we discussed in Section 18.6. In particular, we had the Shannon ensemble (as used in Theo-
rem 18.5) and the random linear code (either Elias or Gallager ensembles, we do not need to make
a distinction here).
For either ensemble, it it is known [178] that Er (R) is not just an estimate, but in fact the exact
value of the exponent of the average probability of error (averaged over a code in the ensemble).
For either ensemble, however, for low rates the average is dominated by few bad codes, whereas
a typical (high probability) realization of the code has a much better error exponent. For Shannon
ensemble this happens at R < 12 Rx and for the linear ensemble it happens at R < Rx . Furthermore,
the typical linear code in fact has error exponent exactly equal to the expurgated exponent Eex (R),
see [34].
There is a famous conjecture in combinatorics stating that the best possible minimum pairwise
Hamming distance of a code with rate R is given by the Gilbert-Varshamov bound (Theorem 27.5).
If true, this would imply that E(R) = Eex (R) for R < Rx , see e.g. [283].
The most outstanding development in the error exponents since 1967 was a sequence of papers
starting from [283], which proposed a new technique for bounding E(R) from above. Litsyn’s
idea was to first prove a geometric result (that any code of a given rate has a large number of
pairs of codewords at a given distance) and then use de Caen’s inequality to convert it into a lower
bound on the probability of error. The resulting bound was very cumbersome. Thus, it was rather
surprising when Barg and MacGregor [35] were able to show that the new upper bound on E(R)
matched Er (R) for Rcr − ϵ < R < Rcr for some small ϵ > 0. This, for the first time since [381]
extended the range of knowledge of the reliability function. Their amazing result (together with
Gilbert-Varshamov conjecture) reinforced the belief that the typical linear codes achieve optimal
error exponent in the whole range 0 ≤ R ≤ C.
Regarding E(R) for R > C the situation is much simpler. We have

E(R) = sup E0 (ρ) − ρR .

ρ∈(−1,0)

The lower (achievability) bound here is due to Dueck [141] (see also [319]), while the harder
(converse) part is by Arimoto [25]. It was later discovered that Arimoto’s converse bound can
be derived by a simple modification of the weak converse (Theorem 17.3): instead of applying
1
data-processing to the KL divergence, one uses Rényi divergence of order α = 1+ρ ; see [338]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-434

i i

434

Error−exponent bounds for BSC(0.02)

2
Sphere−packing (radius) + mindist
MRRW + mindist
1.8 Sphere−packing (volume)
Random code ensemble
Typical random linear code (aka expurgation)
Gilbert−Varshamov, dmin/2 halfplane + union bound
1.6 Gilbert−Varshamov, dmin/2 sphere

1.4

1.2
Err.Exp. (log2)

0.8 Rx = 0.24

0.6 Rcrit = 0.46

0.4
C = 0.86

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Rate

Figure 22.1 Comparison of bounds on the error exponent of the BSC. The MRRW stands for the upper
bound on the minimum distance of a code [299] and Gilbert-Varshamov is a lower bound (cf. Theorem 27.5).

for details. This suggests a general conjecture that replacing Shannon information measures with
Rényi ones upgrades the (weak) converse proofs to a strong converse.

22.5 Channel dispersion

Historically, first error correcting codes had rather meager rates R very far from channel capacity.
As we have seen in Section 22.4* the best codes at any rate R < C have probability of error that
behaves as

Pe = exp{−nE(R) + o(n)} .

Therefore, for a while the question of non-asymptotic characterization of log M∗ (n, ϵ) and ϵ∗ (n, M)
was equated with establishing the sharp value of the error exponent E(R). However, as codes
became better and started having rates approaching the channel capacity, the question has changed
to that of understanding behavior of log M∗ (n, ϵ) in the regime of fixed ϵ and large n (and, thus, rates
R → C). It was soon discovered by [334] that the next-order terms in the asymptotic expansion of
log M∗ give surprisingly sharp estimates on the true value of the log M∗ . Since then, the work on
channel coding focused on establishing sharp upper and lower bounds on log M∗ (n, ϵ) for finite n
(the topic of Section 22.6) and refining the classical results on the asymptotic expansions, which
we discuss here.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-435

i i

22.5 Channel dispersion 435

We have already seen that the strong converse (Theorem 22.1) can be stated in the asymptotic
expansion form as: for every fixed ϵ ∈ (0, 1),
log M∗ (n, ϵ) = nC + o(n), n → ∞.
Intuitively, though, the smaller values of ϵ should make convergence to capacity slower. This
suggests that the term o(n) hides some interesting dependence on ϵ. What is it?
This topic has been investigated since the 1960s, see [130, 402, 334, 333] , and resulted in
the concept of channel dispersion. We first present the rigorous statement of the result and then
explain its practical uses.

Theorem 22.2 Consider one of the following channels:

1 DMC
2 DMC with cost constraint
3 AWGN
4 Parallel AWGN
Let (X∗ , Y∗ ) be the input-output of the channel under the capacity achieving input distribution, and
i(x; y) be the corresponding (single-letter) information density. The following expansion holds for
a fixed 0 < ϵ < 1/2 and n → ∞
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) (22.15)
where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X∗ ; Y∗ ) = E[i(X∗ ; Y∗ )], and the channel dispersion3 is V = Var[i(X∗ ; Y∗ )|X∗ ].

Proof. The full proofs of these results are somewhat technical, even for the DMC.4 Here we only
sketch the details.
First, in the absence of cost constraints the achievability (lower bound on log M∗ ) part has
already been done by us in Theorem 19.11, where we have shown that log M∗ (n, ϵ) ≥ nC −
√ √
nVQ−1 (ϵ) + o( n) by refining the proof of the noisy channel coding theorem and using the
CLT. Replacing the CLT with its non-asymptotic version (Berry-Esseen inequality [165, Theorem
√
2, Chapter XVI.5]) improves o( n) to O(log n). In the presence of cost constraints, one is inclined
to attempt to use an appropriate version of the achievability bound such as Theorem 20.7. However,
for the AWGN this would require using input distribution that is uniform on the sphere. Since this
distribution is non-product, the information density ceases to be a sum of iid, and CLT is harder
to justify. Instead, there is a different achievability bound known as the κ-β bound [334, Theorem
25] that has become the workhorse of achievability proofs for cost-constrained channels with
continuous input spaces.

3
There could be multiple capacity-achieving input distributions, in which case PX∗ should be chosen as the one that
minimizes Var[i(X∗ ; Y∗ )|X∗ ]. See [334] for more details.
4
Recently, subtle gaps in [402] and [334] in the treatment of DMCs with non-unique capacity-achieving input distributions
were found and corrected in [81].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-436

i i

436

The upper (converse) bound requires various special methods depending on the channel. How-
ever, the high-level idea is to always apply the meta-converse bound from (22.12) with an
appropriate choice of QY . Most often, QY is taken as the n-th power of the capacity achieving out-
put distribution for the channel. We illustrate the details for the special case of the BSC. In (22.4)
we have shown that
1
log M∗ (n, ϵ) ≤ − log βα (Ber(δ)⊗n , Ber( )⊗n ) . (22.16)
2
On the other hand, Exercise III.10 shows that
1 1 √ √
− log β1−ϵ (Ber(δ)⊗n , Ber( )⊗n ) = nd(δk ) + nvQ−1 (ϵ) + o( n) ,
2 2
where v is just the variance of the (single-letter) log-likelihood ratio:
" #
δ 1−δ δ δ
v = VarZ∼Ber(δ) Z log 1 + (1 − Z) log 1 = Var[Z log ] = δ(1 − δ) log2 .
2 2
1 − δ 1 − δ

Upon inspection we notice that v = V – the channel dispersion of the BSC, which completes the
proof of the upper bound:
√ √
log M∗ (n, ϵ) ≤ nC − nVQ−1 (ϵ) + o( n)
√
Improving the o( n) to O(log n) is done by applying the Berry-Esseen inequality in place of the
CLT, similar to the upper bound. Many more details on these proofs are contained in [333].

Remark 22.1 (Zero dispersion) We notice that V = 0 is entirely possible. For example,
consider an additive-noise channel Y = X + Z over some abelian group G with Z being uniform
on some subset of G, e.g. channel in Exercise IV.7. Among the zero-dispersion channels there is
a class of exotic channels [334], which for ϵ > 1/2 have asymptotic expansions of the form [333,
Theorem 51]:

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-438

i i

438

0.5

0.4

Rate, bit/ch.use

0.3

0.2

0.1 Capacity
Converse
RCU
DT
Gallager
Feinstein
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n

Figure 22.2 Comparing various lower (achievability) bounds on 1

n
log M∗ (n, ϵ) for the BSCδ channel
(δ = 0.11, ϵ = 10−3 ).

0.5

0.4
Rate, bit/ch.use

0.3

0.2

Capacity
Converse
0.1 Normal approximation + 1/2 log n
Normal approximation
Achievability

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n

Figure 22.3 Comparing the normal approximation against the best upper and lower bounds on 1
n
log M∗ (n, ϵ)
for the BSCδ channel (δ = 0.11, ϵ = 10−3 ).

matrix is known at the transmitter, the same paper demonstrated that the standard water-filling
power allocation (Theorem 20.14) that maximizes Cϵ is rather sub-optimal at finite n.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-439

i i

22.7 Normalized Rate 439

22.6 Finite blocklength bounds and normal approximation

As stated earlier, direct computation of M∗ (n, ϵ) by exhaustive search doubly exponential in com-
plexity, and thus is infeasible in most cases. However, the bounds we have developed so far can
often help sandwich the unknown value pretty tightly. Less rigorously, we may also evaluate the
normal approximation which simply suggests dropping unknown terms in the expansion (22.15):
√
log M∗ (n, ϵ) ≈ nC − nVQ−1 (ϵ)

(The log n term in (22.15) is known to be equal to O(1) for the BEC, and 12 log n for the BSC,
AWGN and binary-input AWGN. For these latter channels, normal approximation is typically
defined with + 12 log n added to the previous display.)
For example, considering the BEC1/2 channel we can easily compute the capacity and disper-
sion to be C = (1 − δ) and V = δ(1 − δ) (in bits and bits2 , resp.). Detailed calculation in Ex. IV.4
lead to the following rigorous estimates:

213 ≤ log2 M∗ (500, 10−3 ) ≤ 217 .

At the same time the normal approximation yields

p
log M∗ (500, 10−3 ) ≈ nδ̄ − nδ δ̄ Q−1 (10−3 ) ≈ 215.5 bits

This tightness is preserved across wide range of n, ϵ, δ .

As another example, we can consider the BSCδ channel. We have already presented numerical
results for this channel in (17.3). Here, we evaluate all the lower bounds that were discussed in
Chapter 18. We show the results in Figure 22.2 together with the upper bound (22.16). We conclude
that (unsurprisingly) the RCU bound is the tightest and is impressively close to the converse bound,
as we have already seen in (17.3). The normal approximation (with and without the 1/2 log n
term) is compared against the rigorous bounds on Figure 22.3. The excellent precision of the
approximation should be contrasted with a fairly loose estimate arising from the error-exponent
approximation (which coincides with the “Gallager” curve on Figure 22.2).
We can see that for the simple cases of the BEC and the BSC, the solution to the incredibly
complex combinatorial optimization problem log M∗ (n, ϵ) can be rather well approximated by
considering the first few terms in the expansion (22.15). This justifies further interest in computing
channel dispersion and establishing such expansions for other channels.

22.7 Normalized Rate

Suppose we are considering two different codes. One has M = 2k1 and blocklength n1 (and so, in
engineering language is a k1 -to-n1 code) and another is a k2 -to-n2 code. How can we compare the
two of them fairly? A traditional way of presenting the code performance is in terms of the “water-
fall plots” showing dependence of the probability of error on the SNR (or crossover probability)
of the channel. These two codes could have waterfall plots of the following kind:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-440

i i

440

Pe k1 → n1 Pe k2 → n2

10−4 10−4

P∗ SNR P∗ SNR

After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires a
smaller SNR to achieve the same error probability. However, this ignores the fact that the rate of
this code nk11 might be much smaller as well. The concept of normalized rate allows us to compare
the codes of different blocklengths and coding rates.
Specifically, suppose that a k → n code is given. Fix ϵ > 0 and find the value of the SNR P for
which this code attains probability of error ϵ (for example, by taking a horizontal intercept at level
ϵ on the waterfall plot). The normalized rate is defined as
k k
Rnorm (ϵ) = ≈ p ,
log2 M∗ (n, ϵ, P) nC(P) − nV(P)Q−1 (ϵ)
where log M∗ , capacity and dispersion correspond to the channel over which evaluation is being
made (most often the AWGN, BI-AWGN or the fading channel). We also notice that, of course,
the value of log M∗ is not possible to compute exactly and thus, in practice, we use the normal
approximation to evaluate it.
This idea allows us to clearly see how much different ideas in coding theory over the decades
were driving the value of normalized rate upward to 1. This comparison is show on Figure 22.4.
A short summary is that at blocklengths corresponding to “data stream” channels in cellular net-
works (n ∼ 104 ) the LDPC codes and non-binary LDPC codes are already achieving 95% of the
information-theoretic limit. At blocklengths corresponding to “control plane” (n ≲ 103 ) the polar
codes and LDPC codes are at similar performance and at 90% of the fundamental limits.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-441

i i

22.7 Normalized Rate 441

Normalized rates of code families over AWGN, Pe=0.0001

0.95

0.9

0.85 Turbo R=1/3

Turbo R=1/6
Turbo R=1/4
0.8 Voyager
Normalized rate

Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)

0.6

0.55

0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1

0.95

0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate

Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7

0.65

0.6 2 3 4 5
10 10 10 10
Blocklength, n

Figure 22.4 Normalized rates for various codes. Plots generated using [397] (color version recommended)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-442

i i

23 Channel coding with feedback

So far we have been focusing on the paradigm for one-way communication: data are mapped to
codewords and transmitted, and later decoded based on the received noisy observations. In most
practical settings (except for storage), frequently the communication goes in both ways so that the
receiver can provide certain feedback to the transmitter. As a motivating example, consider the
communication channel of the downlink transmission from a satellite to earth. Downlink transmis-
sion is very expensive (power constraint at the satellite), but the uplink from earth to the satellite
is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite).
In general, channel with noiseless feedback is interesting when such asymmetry exists between
uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available
that can potentially improve the reliability or complexity of communication.
In the first half of our discussion, we shall follow Shannon to show that even with noiseless
feedback nothing (in terms of capacity) can be gained in the conventional setup. In the process, we
will also introduce the concept of Massey’s directed information. In the second half of the Chapter
we examine situations where feedback is extremely helpful: low probability of error, variable
transmission length and variable transmission power.

23.1 Feedback does not increase capacity for stationary memoryless

channels
Definition 23.1 (Code with feedback) An (n, M, ϵ)-code with feedback is specified by
the encoder-decoder pair (f, g) as follows:

• Encoder: (time varying)

f1 : [ M ] → A
f2 : [ M ] × B → A
..
.
fn : [M] × B n−1 → A

• Decoder:

g : B n → [M]

442

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-443

i i

23.1 Feedback does not increase capacity for stationary memoryless channels 443

such that P[W 6= Ŵ] ≤ ϵ.

Here the symbol transmitted at time t depends on both the message and the history of received
symbols (causality constraint):

Xt = ft (W, Yt1−1 ).

Hence the probability space is as follows:

W ∼ uniform on [M]
PY|X

X1 = f1 (W) −→ Y1 


.. −→ Ŵ = g(Yn )
. 
PY|X 

Xn = fn (W, Yn1−1 ) −→ Yn
Figure 23.1 compares the settings of channel coding without feedback and with ideal full feedback:

W Xn channel Yn Ŵ

W Xk channel Yk Ŵ

delay

Figure 23.1 Schematic representation of coding without feedback (left) and with full noiseless feedback
(right).

Definition 23.2 (Fundamental limits)

M∗fb (n, ϵ) = max{M : ∃(n, M, ϵ) code with feedback.}
1
Cfb,ϵ = lim inf log M∗fb (n, ϵ)
n→∞ n

Cfb = lim Cfb,ϵ (Shannon capacity with feedback)

ϵ→0

Theorem 23.3 (Shannon 1956) For a stationary memoryless channel,

Cfb = C = C(I) = sup I(X; Y).
PX

Proof. Achievability: Although it is obvious that Cfb ≥ C, we wanted to demonstrate that in fact
constructing codes achieving capacity with full feedback can be done directly, without appealing

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-444

i i

444

to a (much harder) problem of non-feedback codes. Let π t (·) ≜ PW|Yt (·|Yt ) with the (random) pos-
terior distribution after t steps. It is clear that due to the knowledge of Yt on both ends, transmitter
and receiver have perfectly synchronized knowledge of π t . Now consider how the transmission
progresses:

1 Initialize π 0 (·) = M1 .
2 At (t + 1)-th step, encoder sends Xt+1 = ft+1 (W, Yt ). Note that selection of ft+1 is equivalent to
the task of partitioning message space [M] into classes Pa , i.e.

Pa ≜ {j ∈ [M] : ft+1 (j, Yt ) = a} a ∈ A.

How to do this partitioning optimally is what we will discuss soon.

3 Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:
PY|X (Yt+1 |ft+1 (j, Yt ))
π t+1 (j) ≜ π t (j)Bt+1 (j) , Bt+1 (j) ≜ P .
a∈A π t (Pa )

Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY|X (y|a)
E[log Bt+1 (W)|Yt ] = π t (Pa ) log P = I(π̃ t+1 , PY|X ) (23.1)
a∈A y∈B a∈A π t (Pa )a

where π̃ t+1 (a) ≜ π t (Pa ) is a (random) distribution on A, induced by the encoder at the channel
input in round (t + 1). Note that while π t is decided before the (t + 1)-st step, design of partition
Pa (and hence π̃ t+1 ) is in the hands of the encoder.

The goal of the code designer is to come up with such a partitioning {Pa : a ∈ A} that the speed
of growth of π t (W) is maximal. Now, analyzing the speed of growth of a random-multiplicative
process is best done by taking logs:
X
t
log π t (j) = log Bs + log π 0 (j) .
s=1

Intuitively, we expect that the process log π t (W) resembles a random walk starting from − log M
and having a positive drift. Thus to estimate the time it takes for this process to reach value 0
we need to estimate the upward drift. Appealing to intuition and the law of large numbers (more
exactly to the theory of martingales) we approximate
X
t
log π t (W) − log π 0 (W) ≈ E[log Bs ] .
s=1

Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃ t+1 ≈ P∗X (capacity-achieving input distribution) and this obtains

log π t (W) ≈ tC − log M ,

implying that the transmission terminates in time ≈ logCM . The important lesson here is the follow-
ing: The optimal transmission scheme should map messages to channel inputs in such a way that

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-445

i i

23.1 Feedback does not increase capacity for stationary memoryless channels 445

the induced input distribution PXt+1 |Yt is approximately equal to the one maximizing I(X; Y). This
idea is called posterior matching and explored in detail in [384].1
Although our argument above is not rigorous, it is not hard to make it such by an appeal to
theory of martingale converges, very similar to the way we used it in Section 16.3* to analyze
SPRT. We omit those details (see [384]), since the result is in principle not needed for the proof
of the Theorem.
Converse: We are left to show that Cfb ≤ C(I) . Recall the key in proving weak converse for
channel coding without feedback: Fano’s inequality plus the graphical model

W → Xn → Yn → Ŵ. (23.2)

Then

−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) ≤ I(Xn ; Yn ) ≤ nC(I) .

With feedback the probabilistic picture becomes more complicated as the following Figure 23.2
demonstrates for n = 3 (dependence introduced by the extra squiggly arrows):

X1 Y1 X1 Y1

W X2 Y2 Ŵ W X2 Y2 Ŵ

X3 Y3 X3 Y3
without feedback with feedback

Figure 23.2 Graphical model for channel coding and n = 3 with and without feedback. Double arrows ⇒
correspond to the channel links.

Notice that the d-separation criterion shows we no longer have Markov chain (23.2), i.e. given
X the W and Yn are not independent.2 Furthermore, the input-output relation is also no longer
n

memoryless
Y
n
PYn |Xn (yn |xn ) 6= PY|X (yj |xj )
j=1

1
This simple (but capacity-achieving) feedback coding scheme also helps us appreciate more fully the magic of Shannon’s
(non-feedback) coding theorem, which demonstrated that the (almost) optimal partitioning can be done in a way that is
totally blind to actual channel outputs. That is, we can preselect partitions Pa that are independent of π t (but dependent on
t) and so that π t (Pa ) ≈ P∗X (a) with overwhelming probability and for almost all t ∈ [n].
2
For example, suppose we are transmitting W ∼ Ber(1/2) over the BSC and set X1 = 0, X2 = W ⊕ Y1 . Then given X1 , X2
we see that Y2 and W can be exactly determined from one another.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-446

i i

446

(as an example, let X2 = Y1 and thus PY1 |X1 X2 = δX1 is a point mass). Nevertheless, there is still a
large degree of independence in the channel. Namely, we have
(Yt−1 , W) →Xt → Yt , t = 1, . . . , n (23.3)
W → Y → Ŵ
n
(23.4)
Then
−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) (Fano)
≤ I(W; Y ) n
(Data processing applied to (23.4))
X
n
= I(W; Yt |Yt−1 ) (Chain rule)
t=1
Xn
≤ I(W, Yt−1 ; Yt ) (I(W; Yt |Yt−1 ) = I(W, Yt−1 ; Yt ) − I(Yt−1 ; Yt ))
t=1
X
n
≤ I(Xt ; Yt ) (Data processing applied to (23.3))
t=1

≤ nC
In comparison with Theorem 22.2, the following result shows that, with fixed-length block cod-
ing, feedback does not even improve the speed of approaching capacity and can at most improve
the third-order log n terms.

Theorem 23.4 (Dispersion with feedback [131, 336]) For weakly input-symmetric
DMC (e.g. additive noise, BSC, BEC) we have:
√
log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n)

23.2* Massey’s directed information

In this section we show an alternative proof of the converse part of Theorem 23.3, which is more
in the spirit of “channel substitution” ideas (meta-converse) that we emphasize in this book, see
Sections 3.6, 17.4 and 22.3. In addition, it will also lead us to defining the concept of directed
information ⃗I(Xn ; Yn ) due to Massey [294]. Directed information is an important tool in the field
of causal inference, though we will not go into those connections [294].
Proof. Let us revisit the steps of showing the weak converse C ≤ C(I) , when phrased in the
style of meta-converse. We take an arbitrary (n, M, ϵ) code and define two distributions with
corresponding graphical models:
P : W → Xn → Yn → Ŵ (23.5)
Q:W→X n
Y → Ŵ
n
(23.6)
We then make two key observations:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-447

i i

23.2* Massey’s directed information 447

1 Under Q, W ⊥⊥ W, so that Q[W = Ŵ] = M1 while P[W = Ŵ] ≥ 1 − ϵ.

2 The two graphical models give the factorization:

PW,Xn ,Yn ,Ŵ = PW,Xn PYn |Xn PŴ|Yn , QW,Xn ,Yn ,Ŵ = QW,Xn QYn QŴ|Yn .

We are free to choose factors QW,Xn , QYn and QŴ|Yn . However, as we will see soon, it is best
to choose them to minimize D(PkQ) which gives us (see the discussion of information flow
after (4.6))

min D(PkQ) = I(Xn ; Yn ) (23.7)

QW,Xn ,QYn ,QŴ|Yn

and achieved by taking the factors equal to their values under P, namely QW,Xn = PW,Xn , QYn =
PYn and QŴ|Yn = PŴ|Yn . (It is a good exercise to show this by writing out the chain rule for
divergence (2.26).) As we know this minimal value of D(PkQ) measures the information flow
through the links excluded in the graphical model, i.e. through Xn → Yn .

From here we proceed via the data-processing inequality and tensorization of capacity for
memoryless channels as follows:

(∗) X
n
1 DPI
−h(ϵ) + ϵ̄ log M = d(1 − ϵk ) ≤ D(PkQ) = I(Xn ; Yn ) ≤ I(Xi ; Yi ) ≤ nC(I) (23.8)
M
i=1

where the (∗) followed from the fact that the Xn → Yn is a memoryless channel ((6.1)).
Let us now go back to the case of channels with feedback. There are several problems with
adapting the previous argument. First, when feedback is present, Xn → Yn is not memoryless due
to the influence of the transmission protocol (for example, knowing both X1 and X2 affects the law
Qn
of Y1 , that is PY1 |X1 6= PY1 |X1 ,X2 and also PYn |Xn 6= j=1 PYj |Xj even for the DMC). However, an
even bigger problem is revealed by attempting to replicate the previous proof.
Suppose we again try to induce an auxiliary probability space Q as in (23.6). Then due to lack
of Markov chain under P (i.e. (23.5)) solution of the problem (23.7) can be shown to equal this
time

min D(PkQ) = I(W, Xn ; Yn ) .

This value can be quite a bit higher than capacity. For example, consider an extremely noisy
(in fact, useless) channel BSC1/2 and a feedback transmission scheme Xt+1 = Yt . We see that
I(W, Xn ; Yn ) ≥ H(Yn−1 ) = (n − 1) log 2, whereas capacity C = 0. What went wrong in this case?
For the explanation, we should revisit the graphical model under P as shown on Figure 23.2
(right graph). When Q is defined as in (23.6) the value min D(PkQ) = I(W, Xn ; Yn ) measures the
information flow through both the ⇒ and ⇝ links.
This motivates us to find a graphical model for Q such that min D(PkQ) only captured the
information flow through only the ⇒ links {Xi → Yi : i = 1, . . . , n} (and so that min D(PkQ) ≤
⊥ Ŵ, so that Q[W = Ŵ] = M1 .
nC(I) ), while still guaranteeing that W ⊥

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-448

i i

448

X1 Y1 X1 Y1

W X2 Y2 Ŵ W X2 Y2 Ŵ

X3 Y3 X3 Y3
without feedback with feedback

Figure 23.3 Graphical model for n = 3 under the auxiliary distribution Q. Compare with Figure 23.2 under
the actual distribution P.

Such a graphical model is depicted on Figure 23.3 (right graph).3 Formally, we shall restrict
QW,Xn ,Yn ,Ŵ ∈ Q, where Q is the set of distributions that can be factorized as follows:

Using the d-separation criterion (see (3.11)) we can verify that for any Q ∈ Q we have W ⊥ ⊥ W:
n
W and Ŵ are d-separated by X . (More directly, one can clearly see that conditioning on any fixed
value of W = w does affect distributions of X1 , . . . , Xn but leaves Yn and Ŵ unaffected.)
Notice that in the graphical model for Q, when removing ⇒ we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when ⇒ are removed, so that Q could be made closer to P, while still maintaining
W⊥ ⊥ W. We note that these links were also implicitly present in the non-feedback case (see model
for Q in that case on the left graph in Figure 23.3).
Now since as we agreed under Q we still have Q[W = Ŵ] = M1 we can use our usual data-
processing for divergence to conclude d(1 − ϵk M1 ) ≤ D(PkQ).
Assuming the crucial fact about this Q-graphical model that will be shown in a Lemma 23.6
(to follow), we then have the following chain:

1
d(1 − ϵk ) ≤ inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ )
M Q∈Q
Xn
= I(Xt ; Yt |Yt−1 ) (Lemma 23.6)
t=1

X
n
= EYt−1 [I(PXt |Yt−1 , PY|X )]
t=1

3
This kind of removal of one-directional links is known as causal conditioning.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-449

i i

23.2* Massey’s directed information 449

X
n
≤ I(EYt−1 [PXt |Yt−1 ], PY|X ) (concavity of I(·, PY|X ))
t=1
Xn
= I(PXt , PY|X )
t=1

≤nC . ( I)

Thus, we complete our proof

nC + h(ϵ) C
−h(ϵ) + ϵ̄ log M ≤ nC(I) ⇒ log M ≤ ⇒ Cfb,ϵ ≤ ⇒ Cfb ≤ C.
1−ϵ 1−ϵ
We notice that the proof can also be adapted essentially without change to channels with cost
constraints and for capacity per unit cost setting, cf. [337].

We now proceed to showing the crucial omitted step in the above proof. Before that let us define
an interesting new kind of information.

Definition 23.5 (Massey’s directed information [294]) For a pair of blocklength-n

random variables Xn and Yn we define
X
n
⃗I(Xn ; Yn ) ≜ I(Xt ; Yt |Yt−1 )
t=1

Note that directed information is not symmetric. As [294] (and subsequent work, e.g. [412])
shows ⃗I(Xn ; Yn ) quantifies the amount of causal information transfer from X-process to Y-process.
In context of feedback communication a formal justification for introduction of this concept is the
following result.

Lemma 23.6 Consider communication with feedback over a non-anticipatory channel given
by a sequence of Markov kernels PYt |Xt ,Yt−1 , t ∈ [n], i.e. we have a probability distribution P on
(W, Xn , Yn , Ŵ) described by factorization

Y
n
PW,Xn ,Yn ,Ŵ = PW PXt |W,Xt−1 ,Yt−1 PYt |Xt ,Yt−1 . (23.9)
t=1

Denote by Q all distributions factorizing with respect to the graphical models on Figure 23.3
(right graph), that is those satisfying
Y
n
QW,Xn ,Yn ,Ŵ = QW QXk |W,Yk−1 QYk |Yk−1 (23.10)
t=1

Then we have

inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ ) = ⃗I(Xn ; Yn ) . (23.11)

Q∈Q

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-450

i i

450

In addition, if the channel is memoryless, i.e. PYt |Xt ,Yt−1 = PYt |Xt for all t ∈ [n], then we have
X
n
⃗I(Xn ; Yn ) = I(Xt ; Yt |Yt−1 ) .
t=1

Proof. By comparing factorizations (23.9) and (23.10) and applying the chain rule (2.26) we can
immediately optimize several terms (we leave this justification as an exercise):

QX,W = PX,W ,
QXt |W,Yt−1 = PXt |W,Yt−1
QŴ|Yn = PW|Yn .

From here we conclude that

inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ )

Q∈Q

= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 ) ,

where in the last step we simply applied (conditional) versions of Corollary 4.2.
To prove the claim for the memoryless channels, we only need to notice that

I(Xt ; Yt |Yt−1 ) = I(Xt ; Yt |Yt−1 ) + I(Xt−1 ; Yt |Yt−1 , Xt ) ,

and that the last term is zero. The latter can be justified via d-separation criterion. Indeed, in the
absence of channel memory every undirected path from Xt−1 to Yt must pass through Xt , which is
a non-collider and is conditioned on.

To summarize, we can see that Shannon’s result for feedback communication can be best
understood as a simple modification of the standard weak converse in channel coding: instead
of using

23.3 When is feedback really useful?

Theorems 23.3 and 23.4 state that feedback does not improve communication rate neither asymp-
totically nor for moderate blocklengths. In this section, we shall examine three cases where
feedback turns out to be very useful.

23.3.1 Code with very small (e.g. zero) error probability

Theorem 23.7 (Shannon [379]) For any DMC PY|X ,
1
Cfb,0 = max min log (23.12)
PX y∈B PX (Sy )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-451

i i

23.3 When is feedback really useful? 451

where

Sy = {x ∈ A : PY|X (y|x) > 0}

denotes the set of input symbols that can lead to the output symbol y.

Remark 23.1 For stationary memoryless channel, we have

( a) (b) ( c) (d)
C0 ≤ Cfb,0 ≤ Cfb = lim Cfb,ϵ = C = lim Cϵ = C(I) = sup I(X; Y)
ϵ→0 ϵ→0 PX

where (a) and (b) are by definitions, (c) follows from Theorem 23.3, and (d) is due to Theorem 19.9.
All capacity quantities above are defined with (fixed-length) block codes.

Remark 23.2 1 In DMC for both zero-error capacities (C0 and Cfb,0 ) only the support of the
transition matrix PY|X , i.e., whether PY|X (b|a) > 0 or not, matters; the values of those non-zero
PY|X (b|a) are irrelevant. That is, C0 and Cfb,0 are determined by the bipartite graph represen-
tation between the input alphabet A and the output alphabet B . Furthermore, the C0 (but not
Cfb,0 !) is a function of the confusability graph – a simple undirected graph on A with a 6= a′
connected by an edge iff ∃b ∈ B s.t. PY|X (b|a)PY|X (b|a′ ) > 0.
2 That Cfb,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next example) with L = 3 (for which Cfb,0 = log 32 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cfb,0 = 0). Clearly in both cases the confusability
graph is the same – a triangle.
3 Oftentimes C0 is very hard to compute, but Cfb,0 can be obtained in closed form as in (23.12).
As an example, consider the following polygon channel (named after its confusability graph):

1 1

1
2 2 5

. .
2
. .
. .
4
L L 3
Bipartite graph Confusability graph (L = 5)

The following are known about the zero-error capacity C0 of the polygon channel:
• L = 3: C0 = 0.
• L = 5: C0 = 12 log 5. This is a famous “capacity of a pentagon” problem. For achievability,
with blocklength one, one can use {1, 3} to achieve rate 1 bit; with blocklength two, the code-
book {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)} achieves rate 12 log 5 bits, as pointed out by Shannon

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-452

i i

452

in his original 1956 paper [379]. More than two decades later this was shown optimal by
Lovász using a technique now known as semidefinite programming relaxation [286].
• Even L: C0 = log L2 (Exercise IV.36).
• L = 7: 3/5 log 7 ≤ C0 ≤ log 3.32. Finding the exact value for any odd L ≥ 7 is a famous
open problem in combinatorics.
• Asymptotics of odd L: Despite being unknown in general C0 has a known asymptotic
expansion: For odd L, C0 = log L2 + o(1) [66].
In comparison, the zero-error capacity with feedback (Exercise IV.36) equals Cfb,0 = log L2 for
any L, which, thus, can strictly exceed C0 .
4 Notice that Cfb,0 is not necessarily equal to Cfb = limϵ→0 Cfb,ϵ = C. Here is an example with

C0 < Cfb,0 < Cfb = C.

Consider a channel with the following bipartite graph representation:

1 1

2 2

3 3

4 4

Then one can verify that

C0 = log 2
2
Cfb,0 = max − log max δ, 1 − δ (P∗X = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, the Shannon capacity C = Cfb can be made arbitrarily close to log 4 by
picking the cross-over probabilities arbitrarily close to zero, while the confusability graph stays
the same.

Proof of Theorem 23.7. 1 Fix any (n, M, 0)-code. For each t = 0, 1, . . . , n, denote the confusabil-
ity set of all possible messages that could have produced the received signal yt = (y1 , . . . , yt )
by:

Et (yt ) ≜ {m ∈ [M] : f1 (m) ∈ Sy1 , f2 (m, y1 ) ∈ Sy2 , . . . , fn (m, yt−1 ) ∈ Syt }

Notice that zero-error means no ambiguity:

ϵ = 0 ⇔ ∀yn ∈ B n , |En (yn )| = 1 or 0. (23.13)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-453

i i

23.3 When is feedback really useful? 453

2 The key quantities in the proof are defined as follows:

θfb ≜ min max PX (Sy ), P∗X ≜ argmin max PX (Sy )

PX y∈B PX y∈B

The goal is to show

1
Cfb,0 = log .
θfb
By definition, we have

∀PX , ∃y ∈ B, such that PX (Sy ) ≥ θfb (23.14)

Notice that in general the minimizer P∗X is not the capacity-achieving input distribution in the
usual sense (recall Theorem 5.4). This definition also sheds light on how the encoding and
decoding should proceed and serves to lower bound the uncertainty reduction at each stage of
the decoding scheme.
3 “≤” (converse): Let PXn be the joint distribution of the codewords. Denote by E0 = [M] the
original message set.
t = 1: For PX1 , by (23.14), ∃y∗1 such that:
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|
PX1 (Sy∗1 ) = = ≥ θfb .
|{m ∈ [M]}| | E0 |
t = 2: For PX2 |X1 ∈Sy∗ , by (23.14), ∃y∗2 such that:
1

|{m : f1 (m) ∈ Sy∗1 , f2 (m, y∗1 ) ∈ Sy∗2 }| |E2 (y∗1 , y∗2 )|

PX2 (Sy∗2 |X1 ∈ Sy∗1 ) = = ≥ θfb ,
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|
t = n: Continue the selection process up to y∗n which satisfies that:
|En (y∗1 , . . . , y∗n )|
PXn (Sy∗n |Xt ∈ Sy∗t for t = 1, . . . , n − 1) = ≥ θfb .
|En−1 (y∗1 , . . . , y∗n−1 )|
Finally, by (23.13) and the above selection procedure, we have
1 |En (y∗1 , . . . , y∗n )|
≥ ≥ θfb
n
.
M |E0 |
Thus M ≤ −n log θfb and we have shown Cfb,0 ≤ − log θfb .
4 “≥” (achievability): Let us construct a code that achieves (M, n, 0).
As an example, consider the specific channel in Figure 23.4 with |A| = |B| = 3. As the first
stage, the encoder f1 partitions the space of all M messages to three groups of size proportional
to the weight P∗X (ai ), then maps messages in each group to the corresponding ai for i = 1, 2, 3.
Suppose the channel outputs, say, y1 . Since in this example Sy1 = {a1 , a2 }, the decoder can
eliminate a total number of MP∗X (a3 ) candidate messages in this round. Afterwards, the “con-
fusability set” only contains the remaining MP∗X (Sy1 ) messages. By definition of P∗X we know
that MP∗X (Sy1 ) ≤ Mθfb . In the second round, the encoder f2 partitions the remaining messages
into three groups, sends the group index and repeats.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-454

i i

454

encoder f1 channel

MP∗
X (a1 ) messages
a1 y1

MP∗
X (a2 ) messages
a2 y2

MP∗
X (a3 ) messages
a3 y3

Figure 23.4 Achievability scheme for zero-error capacity with feedback.

By similar arguments, each interaction reduces the uncertainty by a factor of at least θfb . After n
n
iterations, the size of “confusability set” is upper bounded by Mθfb n
, if Mθfb ≤ 1,4 then zero error
probability is achieved. This is guaranteed by choosing log M = −n log θfb . Therefore we have
shown that −n log θfb bits can be reliably delivered with n + O(1) channel uses with feedback,
thus Cfb,0 ≥ − log θfb .

Theorem above shows possible advantages of feedback for zero-error communication. How-
ever, the zero-error capacity for a generic DMC (e.g. BSCδ with δ ∈ (0, 1)) we have C0 =
Cfb,0 = 0. Can we show any advantage of feedback for such channels? Clearly for that we need to
understand behavior of log M∗fb (n, ϵ) for ϵ > 0. It turns out that [336] for weakly-input symmetric
channels (Section 19.4*) we have
√
log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) ,

and thus at least up to second order the behavior of fundamental limits is unchanged in the presence
of feedback. Let us next discuss the error-exponent asymptotics (Section 22.4*) by defining
1
Efb (R) ≜ lim − log ϵ∗fb (n, exp{nR}) ,
n→∞ n

provided the limit exists and having denoted by ϵ∗fb (n, M) the smallest possible error of a feedback
code of blocklength n.
First, it is known that the sphere-packing bound (22.14) continues to hold in the presence of
feedback [312], that is

Efb (R) ≤ Esp (R) ,

4
Some rounding-off errors need to be corrected in a few final steps (because P∗X may not be closely approximable when
very few messages are remaining). This does not change the asymptotics though.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-455

log M∗VLF (ℓ, 0) ≥ (1 − δ)ℓ log 2 + O(1) , ℓ→∞

Notice that compared to the scheme without feedback, there is a significant improvement since
√
the term nVQ−1 (ϵ) in the expansion for log M∗ (n, ϵ) is now dropped. For this reason, results like
this are known as zero-dispersion results.
It turns out that this effect is general for all DMC as long as we allow some ϵ > 0 error.

Theorem 23.9 (VLF zero dispersion[336]) For any DMC with capacity C we have
lC
log M∗VLF (l, ϵ) = + O(log l) (23.20)
1−ϵ

We omit the proof of this result, only mentioning that the achievability part relies on ideas
similar to SPRT from Section 16.3*: the message keeps being transmitted until the information
density i(cj ; Yn ) of one of the codewords exceeds log M. See [336] for details. We also mention
that there is another variant of the VLF coding known as VLFT coding in which the stopping time
τ instead of being determined by the receiver is allowed to be determined by the transmitter (see
Exercise IV.35(d)). The expansion (23.20) continues to hold for VLFT codes as well [336].
Example 23.1 For the channel BSC0.11 without feedback the minimal is n = 3000 needed
to achieve 90% of the capacity C, while there exists a VLF code with ℓ = E[n] = 200 achieving
that [336]. This showcases how much feedback can improve the latency and decoding complexity.
VLF codes not only kill the dispersion term, but also dramatically improves error-exponents.
We have already discussed them in the context of fixed-length codes in Section 22.4* (without
feedback) and the end of last Section (with feedback). Here we mention a deep result of Burna-
shev [79], who showed that the optimal probability of error for VLF codes of rate R (i.e. with
log M = ℓR) satisfies for every DMC

ϵ∗VLF (ℓ, exp{ℓR}) = exp{−ℓEVLF (R) + o(ℓ)} ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-457

i i

23.3 When is feedback really useful? 457

where Burnashev’s error-exponent has a particularly simple expression

C1
EVLF (R) = (C − R)+ , C1 = max D(PY|X=x1 kPY|X=x2 ) .
C x1 , x2

Simplicity of this expression when compared to the complicated (and still largely open!) situation
with respect to non-feedback or fixed-length feedback error-exponents is striking.

23.3.3 Codes with variable power

In previous sections we discussed advantages of feedback for the DMC. For the AWGN channel, it
turns out that unlocking the potential of feedback requires relaxing the cost constraint. Recall that
in Section 20.1 we postulated that every codeword xn ∈ Rn should satisfy a fixed power constraint
Pn
P, namely j=1 x2j ≤ nP. It turns out that under such power constraint one can show that feedback
does not help much neither in any of the dispersion, finite block-length or error-exponent senses.
However, the true potential of feedback is unlocked if the power constraint is relaxed to
X
n
E[ X2j ] ≤ nP ,
j=1

where expectation here is both over the channel noise and the potential randomness employed
by the transmitter in determination of Xj on the basis of the message W and Y1 , . . . , Yj−1 . In the
following, we demonstrate how to leverage this new freedom effectively.

Elias’ scheme Consider sending a standard Gaussian random variable A over the following set
of AWGN channels:

Yk = X k + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[X2k ] ≤ P.

We assume that full noiseless feedback is available as in Figure 23.1. Note that, crucially, the
power constraint is imposed in expectation, which does not increase the channel capacity (recall
the converse in Theorem 20.6) but enables simple algorithms such as Elias’ scheme below. In
contrast, if we insist as in Section 20.1 that each codeword satisfies the power constraint almost
Pn
surely instead in expectation, i.e., k=1 X2k ≤ nP a.s., then Elias’ scheme does not work.
Using only linear processing, Elias’ scheme proceeds according to illustration on Figure 23.5.
According to the orthogonality principle, at the receiver side we have for all t = 1, . . . , n,

A = Ât + Nt , Nt ⊥
⊥ Yt .

Moreover, since all operations are linear, all random variables are jointly Gaussian and hence the
residual error satisfies Nt ⊥
⊥ Yt . Since Xt ∝ Nt−1 ⊥⊥ Yt−1 , the codeword we are sending at each
time slot is independent of the history of the channel output (“innovation”), in order to maximize
the information transfer.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-458

i i

458

Encoder Decoder

X1 = c 1 A Y1 = c1 A + Z1 √
Â1 = E[A|Y1 ] = P
Y1
P+σ 2

i i

Exercises for Part IV

IV.1 Consider the AWGN channel Yn = Xn + Zn , where Zi is iid N (0, 1) and Xi ∈ [−1, 1] (amplitude
constraint). Recall that ϵ∗ (n, 2) denotes the optimal average probability of error of transmitting
1 bit of information over this channel.
R∞
(a) Express the value of ϵ∗ (n, 2) in terms of Q(x) = x √12π e−t /2 dt.
2

(b) Compute the exponent r = limn→∞ 1n log ϵ∗ (1n,2) . (Hint: Q(x) = e−x (1/2+o(1)) when x → ∞,
2

cf. Exercise V.25)

(c) Use asymptotics of hypothesis testing to compute r differently and check that two values
2
agree. (Hint: MGF of standard Gaussian Z ∼ N (0, 1) is given by E[etZ ] = et /2 .)
IV.2 Randomized encoders and decoders may help maximal probability of error:
(a) Consider a binary asymmetric channel PY|X : {0, 1} → {0, 1} specified by PY|X=0 =
Ber(1/2) and PY|X=1 = Ber(1/3). The encoder f : [M] → {0, 1} tries to transmit 1 bit
of information, i.e., M = 2, with f(1) = 0, f(2) = 1. Show that the optimal decoder which
minimizes the maximal probability of error is necessarily randomized. Find the optimal
decoder and the optimal Pe,max . (Hint: Recall binary hypothesis testing.)
(b) Give an example of PY|X : X → Y , M > 1 and ϵ > 0 such that there is an (M, ϵ)max -code
with a randomized encoder-decoder, but no such code with a deterministic encoder-decoder.
IV.3 (Lousy typist) Let X = Y = {A, S, D, F, G, H, J, K, L}. Let PY|X (α|β) = 0.1 if α and β are
neighboring letters in the keyboard, and PY|X (α|β) = 0 if α 6= β and they are not neighbors.
Find the smallest ϵ for which you can guarantee that a (4, ϵ)avg -code exists.
IV.4 (Finite-blocklength bounds for BEC). Consider a code with M = 2k operating over the
blocklength-n binary erasure channel (BEC) with erasure probability δ ∈ [0, 1).
(a) Show that regardless of the encoder-decoder pair:
+
P[error|#erasures = z] ≥ 1 − 2n−z−k
(b) Conclude by averaging over the distribution of z that the probability of error ϵ must satisfy
X
n
n ℓ
ϵ≥ δ (1 − δ)n−ℓ 1 − 2n−ℓ−k , (IV.1)
ℓ
ℓ=n−k+1

(c) By applying the DT bound with uniform PX show that there exist codes with
X n
n t
δ (1 − δ)n−t 2−|n−t−k+1| .
+
ϵ≤ (IV.2)
t
t=0

(d) Fix n = 500, δ = 1/2. Compute the smallest k for which the right-hand side of (IV.1) is
greater than 10−3 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-461

i i

Exercises for Part IV 461

(e) Fix n = 500, δ = 1/2. Find the largest k for which the right-hand side of (IV.2) is smaller
than 10−3 .
(f) Express your results in terms of lower and upper bounds on log M∗ (500, 10−3 ).
IV.5 Recall that in the proof of the DT bound (Theorem 18.6) we used the decoder that outputs (for
a given channel output y) the first cm that satisfies

{i(cm ; y) > log β} . (IV.3)

One may consider the following generalization. Fix E ⊂ X × Y and let the decoder output the
first cm which satisfies

( cm , y) ∈ E

By repeating the random coding proof steps (as in the DT bound) show that the average
probability of error satisfies
M−1
E[Pe ] ≤ P[(X, Y) 6∈ E] + P[(X̄, Y) ∈ E] ,
2
where

PXYX̄ (a, b, ā) = PX (a)PY|X (b|a)PX (ā) .

Conclude that the optimal E is given by (IV.3) with β = M− 2 .

IV.6 In Section 18.6 we showed that for additive noise, random linear codes achieves the same per-
formance as Shannon’s ensemble (fully random coding). The total number of possible generator
matrices is qnk , which is significant smaller than double exponential, but still quite large. Now
we show that without degrading the performance, we can reduce this number to qn by restricting
to Toeplitz generator matrix G, i.e., Gij = Gi−1,j−1 for all i, j > 1.
Prove the following strengthening of Theorem 18.13: Let PY|X be additive noise over Fnq . For
any 1 ≤ k ≤ n, there exists a linear code f : Fkq → Fnq with Toeplitz generator matrix, such that
+
h − n−k−log 1 n i
Pe,max = Pe ≤ E q
q P Zn ( Z )

How many Toeplitz generator matrices are there?

Hint: Analogous to the proof Theorem 15.2, first consider random linear codewords plus ran-
dom dithering, then argue that dithering can be removed without changing the performance of
the codes. Show that codewords are pairwise independent and uniform.
IV.7 (Wozencraft ensemble) Let X = Y = F2q , a vector space of dimension two over Galois field
with q elements. A Wozencraft code of rate 1/2 is a map parameterized by 0 6= u ∈ Fq given as
a 7→ (a, a · u), where a ∈ Fq corresponds to the original message, multiplication is over Fq and
(·, ·) denotes a 2-dimensional vector in F2q . We will show there exists u yielding a (q, ϵ)avg code
with
" ( +
)#
q2
ϵ ≤ E exp − i(X; Y) − log (IV.4)
2( q − 1)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-462

i i

462 Exercises for Part IV

for the channel Y = X + Z where X is uniform on F2q , noise Z ∈ F2q has distribution PZ and

P Z ( b − a)
i(a; b) ≜ log .
q− 2
(a) Show that probability of error of the code a 7→ (av, au) + h is the same as that of a 7→
(a, auv−1 ).
(b) Let {Xa , a ∈ Fq } be a random codebook defined as

Xa = (aV, aU) + H ,

with V, U uniform over non-zero elements of Fq and H uniform over F2q , the three being
jointly independent. Show that for a 6= a′ we have
1
PXa ,X′a (x21 , x̃21 ) = 1{x1 6= x̃1 , x2 6= x̃2 }.
q2 ( q − 1) 2
(c) Show that for a 6= a′
q2 1
P[i(X′a ; Xa + Z) > log β] ≤ P[i(X̄; Y) > log β] − P[i(X; Y) > log β]
( q − 1) 2 (q − 1)2
q2
≤ P[i(X̄; Y) > log β] ,
( q − 1) 2

where PX̄XY (ā, a, b) = q14 PZ (b − a).

(d) Conclude by following the proof of the DT bound with M = q that the probability of error
averaged over the random codebook {Xa } satisfies (IV.4).
IV.8 (Universal codes) Fix finite alphabets X and Y .
a Let C be a finite collection of channels PY|X : X → Y . Show that for any PX and any R > 0
there exists a sequence of codes (fn , gn ) such that regardless of what DMC PY|X ∈ C is selected
we have P[fn (W) 6= gn (Yn )] → 0 as long as I(PX , PY|X ) > R. (Hint: union bound over C )
b Extend the idea to show that there exists (fn , gn ) such that for any DMC with I(PX , PY|X > R
we have P[fn (W) 6= gn (Yn )] → 0. (Hint: discretize the set of X × Y stochastic matrices)
IV.9 (Information density and types.) Let PY|X : A → B be a DMC and let PX be some input
distribution. Take PXn Yn = PnXY and define i(an ; bn ) with respect to this PXn Yn .
(a) Show that i(xn ; yn ) is a function of only the “joint type” P̂XY of (xn , yn ), which is a
distribution on A × B defined as
1
P̂XY (a, b) = #{i : xi = a, yi = b} ,
n
where a ∈ A and b ∈ B . Therefore the condition of the form { n1 i(xn ; yn ) ≥ γ} in the decoder
(18.10) used in Shannon’s random coding bound can be interpreted as a constraint on the
joint type of (xn , yn ).
(b) Assume also that the input xn is such that P̂X = PX . Show that
1 n n
i(x ; y ) ≤ I(P̂X , P̂Y|X ) .
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-463

i i

Exercises for Part IV 463

kxn kΦ ≜ nH(P̂xn ) = nH(xT ), T ∼ Unif([n]) ,

where P̂xn is the empirical distribution of xn .

(a) Show that kxn kΦ = k − xn kΦ and the triangle inequality

kxn − yn kΦ ≤ kxn kΦ + kyn kΦ

Conclude that dΦ (xn , yn ) ≜ kxn − yn kΦ is a translation invariant (Fitingof) metric on the set
of equivalence classes in X n , with equivalence xn ∼ yn ⇐⇒ kxn − yn kΦ = 0.
(b) Define the Fitingof ball Br (xn ) ≜ {yn : dΦ (xn , yn ) ≤ r}. Show that

log |Bλn (xn )| = λn + O(log n)

for all 0 ≤ λ ≤ log |X |.

(d) Conclude that a code C ⊂ X n with Fitingof minimal distance dmin,Φ (C) ≜
minc̸=c′ ∈C dΦ (c, c′ ) ≥ 2λn is decodable with vanishing probability of error on any
additive-noise channel Y = X + Z, as long as H(Z) < λ.
Comment: By Feinstein-lemma like argument it can be shown that there exist codes of size
X n(1−λ) , such that balls of radius λn centered at codewords are almost disjoint. Such codes are
universally capacity-achieving for all memoryless additive-noise channels on X . Extension to
general (non-additive) channels is done via introducing dΦ (xn , yn ) = nH(xT |yT ), while exten-
sion to channels with Markov memory is done by introducing Markov-type norm kxn kΦ1 =
nH(xT |xT−1 ). See [196, Chapter 3].
IV.11 A magician is performing card tricks on stage. In each round he takes a shuffled deck of 52
cards and asks someone to pick a random card N from the deck, which is then revealed to the
audience. Assume the magician can prepare an arbitrary ordering of cards in the deck (before
each round) and that N is distributed binomially on {0, . . . , 51} with mean 51
2 .
(a) What is the maximal number of bits per round that he can send over to his companion in
the room (in the limit of infinitely many rounds)?

6
Invented by V. Goppa for his maximal mutual information (MMI) decoder [195]: Ŵ = argmaxi I(ci ∧ yn ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-464

i i

464 Exercises for Part IV

(b) Is communication possible if N were uniform on {0, . . . , 51}? (In practice, however, nobody
ever picks the top or the bottom ones.)
IV.12 (Channel with memory) Consider the additive noise channel with A = B = F2 (Galois field
of order 2) and PYn |Xn : Fn2 → Fn2 specified by
Yn = Xn + Zn ,
where Zn = (Z1 , . . . , Zn ) is a stationary Markov chain with PZ2 |Z1 (0|1) = PZ2 |Z1 (1|0) = τ .
Show information stability and find the capacity. (Hint: your proof should work for an arbitrary
stationary ergodic noise process Z∞ = (Z1 , . . .)). Can the capacity be achieved by linear codes?

IV.13 Consider a DMC PYn |Xn = PnY|X , where a single-letter PY|X : A → B is given by A = B =
{0, 1}7 , and

1−p y=x
PY|X (y|x) =
p/7 dH ( y, x) = 1
where dH stands for Hamming distance.In other words, for each 7-bit string, the channel either
leaves it intact, or randomly flips exactly one bit.
(a) Compute the Shannon capacity C as a function of p and plot.
(b) Consider the special case of p = 78 . Show that the zero-error capacity C0 coincides with C.
Moreover, C0 can be achieved with blocklength n = 1 and give a capacity-achieving code.
IV.14 Find the capacity of the erasure-error channel (Figure 23.6) with channel matrix

1 − 2δ δ δ
W=
δ δ 1 − 2δ
where 0 ≤ δ ≤ 1/2.

1 − 2δ
1 δ 1

δ
0 0
1 − 2δ

Figure 23.6 Binary erasure-error channel.

IV.15 (Capacity of reordering) Routers A and B are setting up a covert communication channel in
which the data is encoded in the ordering of packets. Formally, router A receives n packets,
each having one of two types, Ack or Data, with probabilities p and 1 − p, respectively (and
iid). It encodes k bits of secret data by reordering these packets. The network between A and B
delivers packets in-order with loss rate δ . (Note: packets have sequence numbers, so each loss
is detected by B).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-465

i i

Exercises for Part IV 465

What is the maximum rate of asymptotically reliable communication achievable?

IV.16 (Sum of channels) Let W1 and W2 denote the channel matrices of discrete memoryless channel
(DMC) PY1 |X1 and PY2 |X2 with capacity C1 and C2 , respectively. The sum of the two channels is

another DMC with channel matrix W01 W02 . Show that the capacity of the sum channel is given
by
C = log(exp(C1 ) + exp(C2 )).
IV.17 (Product of channels) For i = 1, 2, let PYi |Xi be a (stationary memoryless) channel with input
space Ai , output space Bi , and capacity Ci . Their product channel is a channel with input space
A1 × A2 , output space B1 × B2 , and transition kernel PY1 Y2 |X1 X2 = PY1 |X1 PY2 |X2 . Show that the
capacity of the product channel is given by C = C1 + C2 .
IV.18 (Mixtures of DMCs) Consider two DMCs UY|X and VY|X with a common capacity achieving
input distribution and capacities CU < CV . Let T = {0, 1} be uniform and consider a channel
PYn |Xn that uses U if T = 0 and V if T = 1, or more formally:
1 n 1
PYn |Xn (yn |xn ) = UY|X (yn |xn ) + VnY|X (yn |xn ) . (IV.5)
2 2
Show:
(a) Is this channel {PYn |Xn }n≥1 stationary? Memoryless?
(b) Show that the Shannon capacity C of this channel is not greater than CU .
(c) The maximal mutual information rate is
1 CU + CV
C(I) = lim sup I(Xn ; Yn ) =
n→∞ n Xn 2
(d) Conclude that C < C(I) and strong converse does not hold.
IV.19 (Compound DMC [59]) Compound DMC is a family of DMC’s with common input and output
alphabets PYs |X : A → B, s ∈ S . An (n, M, ϵ) code is an encoder-decoder pair whose probability
of error ≤ ϵ over any channel PYs |X in the family (note that the same encoder and the same
decoder are used for each s ∈ S ). Show that capacity is given by
C = sup inf I(X; Ys ) .
PX s

The dispersion of the compound DMC is, however, more delicate [342].
IV.20 Consider the following (memoryless) channel. It has a side switch U that can be in positions
ON and OFF. If U is on then the channel from X to Y is BSCδ and if U is off then Y is Bernoulli
(1/2) regardless of X. The receiving party sees Y but not U. A design constraint is that U should
be in the ON position no more than the fraction s of all channel uses, 0 ≤ s ≤ 1.
(a) One strategy is to put U into ON over the first sn time units and ignore the rest of the (1 − s)n
readings of Y. What is the maximal rate in bits per channel use achievable with this strategy?
(b) Can we increase the communication rate if the encoder is allowed to modulate the U switch
together with the input X (while still satisfying the s-constraint on U)?
(c) Now assume nobody has access to U, which is iid Ber(s) independent of X. Find the
capacity.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-466

i i

466 Exercises for Part IV

IV.21 Alice has n oranges and great many (essentially, infinite) number of empty trays. She wants to
communicate a message to Bob by placing the oranges in (sequentially numbered) trays with
at most one orange per tray. Unfortunately, before Bob gets to see the trays Eve inspects them
and eats each orange independently with probability 0 < δ < 1. In the limit of n → ∞ show
that an arbitrary high rate (in bits per orange) is achievable.
Show that capacity changes to log2 δ1 bits per orange if Eve never eats any oranges but places
an orange into each empty tray with probability δ (iid).
IV.22 (Non-stationary channel [106, Problem 9.12]) A train pulls out of the station at constant velocity.
The received signal energy thus falls off with time as 1/i2 . The total received signal at time i is

1
Yi = X i + Zi ,
i
i.i.d. Pn
where Z1 , Z2 , . . . ∼ N(0, σ 2 ). The transmitter cost constraint for block length n is i |x2i | ≤ nP.
Show that the capacity C is equal to zero for this channel.
IV.23 (Capacity-cost function at the boundary.) Recall from Corollary 20.5 that we have shown that
for stationary memoryless channels and P > P0 capacity equals f(P):

C(P) = f(P) , (IV.6)

where

P0 ≜ inf c(x) (IV.7)

x∈A

f( P ) ≜ sup I(X; Y) . (IV.8)

X:E[c(X)]≤P

Show:
(a) If P0 is not admissible, i.e., c(x) > P0 for all x ∈ A, then C(P0 ) is undefined (even M = 1
is not possible)
(b) If there exists a unique x0 such that c(x0 ) = P0 then

C(P0 ) = f(P0 ) = 0 .

C(P0 ) = f(P0 ) .

(d) Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable
cost function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R →
{−1, 0, 1}.)
IV.24 Consider a stationary memoryless additive non-Gaussian noise channel:

Yi = Xi + Zi , E [ Z i ] = 0, Var[Zi ] = 1
Pn 2
with the standard input constraint i=1 xi ≤ nP.
(a) Prove that capacity C(P) of this channel satisfies (20.6). (Hint: Gaussian saddle point
Theorem 5.11 and the golden formula I(X; Y) ≤ D(PY|X kQY |PX ).)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-467

i i

Exercises for Part IV 467

(b) If D(PZ kN (0, 1)) = ∞ (Z is very non-Gaussian), then it is possible that the capacity is
infinite. Consider Z is ±1 with equal probability. Show that the capacity is infinite by a)
proving the maximal mutual information is infinite; b) giving an explicit scheme to achieve
infinite capacity.
IV.25 (Input-output cost) Let PY|X : X → Y be a DMC and consider a cost function c : X × Y → R
(note that c(x, y) ≤ L < ∞ for some L). Consider a problem of channel coding, where the
error-event is defined as
( n )
X
{error} ≜ {Ŵ 6= W} ∪ c(Xk , Yk ) > nP ,
k=1

where P is a fixed parameter. Define operational capacity C(P) and show it is given by

C(I) (P) = max I(X; Y)

PX :E[c(X,Y)]≤P

for all P > P0 ≜ minx0 E[c(X, Y)|X = x0 ]. Give a counterexample for P = P0 . (Hint: do a
converse directly, and for achievability reduce to an appropriately chosen cost-function c′ (x)).
IV.26 (Gauss-Markov noise) Let {Zj , j = 1, 2, . . .} be a stationary ergodic Gaussian process with
variance 1 such that Zj form an Markov chain Z1 → . . . → Zn → . . . Consider an additive
channel

Yn = X n + Zn
Pn
with power constraint j=1 |xj |2 ≤ nP. Suppose that I(Z1 ; Z2 ) = ϵ 1, then capacity-cost
function
1
C(P) = log(1 + P) + Bϵ + o(ϵ)
2
as ϵ → 0. Compute B and interpret your answer.
How does the frequency spectrum of optimal signal change with increasing ϵ?
IV.27 A semiconductor company offers a random number generator that outputs a block of random n
bits Y1 , . . . , Yn . The company wants to secretly embed a signature in every chip. To that end, it
decides to encode a k-bit signature in n real numbers Xj ∈ [0, 1]. Given an individual signature a
chip is manufactured such that it produces the outputs Yj ∼ Ber(Xj ). In order for the embedding
to be inconspicuous the average bias p should be small:

1X
n
1
Xj − ≤ p.
n 2
j=1

As a function of p how many signature bits per output (k/n) can be reliably embedded in this
fashion? Is there a simple coding scheme achieving this performance?
IV.28 (Capacity of sneezing) A sick student once every minute with probability p (iid) wants to sneeze.
He decides to send k bits to a friend by modulating the sneezes. For that, every time he realizes
he is about to sneeze he chooses to suppress a sneeze or not. A friend listens for n minutes and
then tries to decode k bits.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-468

i i

468 Exercises for Part IV

(a) Find the capacity in bits per minute. (Hint: Think how to define the channel so that channel
input at time t were not dependent on the arrival of the sneeze at time t. To rule out strategies
that depend on arrivals of past sneezes, you may invoke Exercise IV.34.)
(b) Suppose the sender can suppress at most E sneezes and listener can wait indefinitely (n =
∞). Show that the sender can transmit Cpuc E + o(E) bits reliably as E → ∞ and find the
capacity per unit cost Cpuc . Curiously, Cpuc ≥ 1.44 bits/sneeze regardless of p. (Hint: This
is similar to Exercise IV.25.)
(d*) Redo (a) and (b) for the case of a clairvoyant student who knows exactly when sneezes
will happen in the future. (This is a simple example of a so-called Gelfand-Pinsker
problem [183].)
IV.29 A data storage company is considering two options for sending its 100 Tb archive from Boston
to NYC: via (physical) mail or via wireless transmission. Let us analyze these options:
(a) Given the radiated power Pt the received power Pr at distance r for communicating at fre-
2
c
quency f is given by Pr = G 4πrf Pt , where G is antenna-to-antenna coupling gain and c
– a speed of light7 . Assuming transmitting between Boston and NYC compute the energy
transfer coefficient η (take G = 15 dB and f = 4 GHz).
(b) The receiver’s amplifier adds white Gaussian noise of power spectral density N0 (W/Hz
or J). On the basis of required energy per bit, compute the minimal N0 which still makes
transmission over the radio economically justified assuming optimal channel coding is done
(assume 0.07$ per kWh and $20 per shipment).
(c) Compare this N0 with the thermal noise power N0,thermal = kT, where k – Boltzmann
constant and T – temperature in Kelvins. Conclude that T ≤ 103 K should work.
(d) Codes that achieve Shannon’s minimal Eb /N0 in principle do not put restrictions on the
receiver SNR (signal-to-noise ratio in one channel sample), however synchronization and
other issues constrain this SNR to be ≥ −10 dB. Assuming communication bandwidth
B = 20 Mhz compute the minimal power (in W) required for transmitter radio station.
Pr
(Hint: received SNR = BN 0
, the answer should be a few watts).
(e) How long will it take to send archive at this bandwidth and SNR? (Hint: the answer is
between a few days and a few years).
IV.30 (Optimal ϵ under ARQ) A packet of k bits is to be delivered over an AWGN channel with a
given SNR. To that end, a k-to-n error correcting code is used, whose probability of error is ϵ.
The system employs automatic repeat request (ARQ) to resend the packet whenever an error
occured.8 Suppose that the optimal k-to-n codes achieving

√ 1
k ≈ nC − nVQ−1 (ϵ) + log n
2

7
This formula is known as Friis transmission equation and it simply reflects the fact that the receiving antenna captures a
2
plane wave at the effective area of λ
4π
.
8
Assuming there is a way for receiver to verify whether his decoder produced the correct packet contents or not (e.g. by
finding HTML tags).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-469

i i

Exercises for Part IV 469

are available. The goal is to optimize ϵ to get the highest average throughput: ϵ too small requires
excessive redundancy, ϵ too large leads to lots of retransmissions. Compute the optimal ϵ and
optimal block length n for the following four cases: SNR=0 dB or 20 dB; k = 103 or 104 bits.
(This gives an idea of what ϵ you should aim for in practice.)
IV.31 (Expurgated random coding bound)
(a) For any code C show the following bound on probability of error
1 X −dB (c,c′ )
Pe (C) ≤ 2 ,
M ′ c̸=c
Pn
where recall from (16.3) the Bhattacharya distance dB (xn , x̃n ) = j=1 dB (xj , x̃j ) and
Xp
dB (x, x̃) = − log2 W(y|x)W(y|x̃) .
y∈Y

− ρ1 dB (X,X′ )
(b) Fix PX and let E0,x (ρ, PX ) ≜ −ρ log2 E[2 ⊥ X′ ∼ PX . Show by random
], where X ⊥
coding that there always exists a code C of rate R with

Pe (C) ≤ 2n(E0,x (1,PX )−R) .

(c) We improve the previous bound as follows. We still generate C by random coding. But this
time we expurgate all codewords with f(c, C) > med(f(c, C)), where med(·) denotes the
P ′
median and f(c) = c′ ̸=c 2−dB (c,c ) . Using the bound

med(V) ≤ 2ρ E[V1/ρ ]ρ ∀ρ ≥ 1

show that

med(f(c, C)) ≤ 2n(ρR−E0,x (ρ,PX )) .

(d) Conclude that there must exist a code with rate R − O(1/n) and Pe (C) ≤ 2−nEex (R) , where

Eex (R) ≜ max −ρR + max E0,x (ρ, PX ) .

ρ≥1 PX

IV.32 (Strong converse for BSC) In this exercise we give a combinatorial proof of the strong converse
for the binary symmetric channel. For BSCδ with 0 < δ < 21 ,
(a) Given any (n, M, ϵ)max -code with deterministic encoder f and decoder g, recall that the
decoding regions {Di = g−1 (i)}M i=1 form a partition of the output space. Prove that for
all i ∈ [M],
L
X n
| Di | ≥
j
j=0

where L is the largest integer such that P [Binomial(n, δ) ≤ L] ≤ 1 − ϵ.

(b) Conclude that

M ≤ 2n(1−h(δ))+o(n) . (IV.9)

i i

2 There is a lot to be gained in compression if we allow some reconstruction errors. This is espe-
cially important in applications where certain errors (such as high-frequency components in
natural audio and visual signals) are imperceptible to humans. This observation is the basis of
many important compression algorithms and standards that are widely deployed in practice,
including JPEG for images, MPEG for videos, and MP3 for audios.
The operation of mapping (naturally occurring) continuous time/analog signals into
(electronics-friendly) discrete/digital signals is known as quantization, which is an important sub-
ject in signal processing in its own right (cf. the encyclopedic survey [197]). In information theory,
the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959
[380]. To start, we will take a closer look at quantization next in Section 24.1, followed by the
information-theoretic formulation in Section 24.2. A simple (and tight) converse bound is given
in Section 24.3, with the matching achievability bound deferred to the next chapter.
In Chapter 25 we present the hard direction of the rate-distortion theorem: the random coding
construction of a quantizer. This method is extended to the development of a covering lemma and
soft-covering lemma, which lead to sharp result of Cuff showing that the fundamental limit of
channel simulation is given by Wyner’s common information. We also derive (strengthened form
of) Han-Verdú’s results on approximating output distributions in KL.
Chapter 26 evaluates rate-distortion function for Gaussian and Hamming sources. We also dis-
cuss the important foundational implication that optimal (lossy) compressor paired with an optimal

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-476

i i

476

error correcting code together form an optimal end-to-end communication scheme (known as joint
source-channel coding separation principle). This principle explains why “bits” are the natural
currency of the digital age.
Finally, in Chapter 27 we study Kolmogorov’s metric entropy, which is a non-probabilistic
theory of quantization for sets in metric spaces. While traditional rate-distortion tries to compress
samples from a fixed distribution, metric entropy tries to compress any element of the metric
space. What links the two topics is that metric entropy can be viewed as a rate-distortion theory
applied to the “worst-case” distribution on the space (an idea further expanded in Section 27.6). In
addition to connections to the probabilistic theory of quantization in the preceding chapters, this
concept has far-reaching consequences in both probability (e.g. empirical processes, small-ball
probability) and statistical learning (e.g. entropic upper and lower bounds for estimation) that will
be explored further in Part VI. Exercises explore applications to Brownian motion (Exercise V.30),
random matrices (Exercise V.29) and more.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-477

i i

24 Rate-distortion theory

In this chapter we introduce the theory of optimal quantization. In Section 24.1 we examine the
classical theory for quantization for fixed dimension and high rate, discussing various aspects such
as uniform versus non-uniform quantization, fixed versus variable rate, quantization algorithm (of
Lloyd) versus clustering, and the asymptotics of optimal quantization error. In Section 24.2 we turn
to the information-theoretic formulation of quantization, known as the rate-distortion theory, that is
targeted at high dimension and fixed rate and the regime where the number of reconstruction points
grows exponentially with dimension. Section 24.3 introduces the rate-distortion function and the
main converse bound. Finally, in Section 24.4* we discuss how to relate the average distortion
(which we focus) to excess distortion that targets a reconstruction error in high probability as
opposed to in expectation.

24.1 Scalar and vector quantization

Before going to the information-theoretic setting, it is important to set the stage by introducing
some classical pre-Shannon point of view on quantization. In this Section and various subsec-
tions we focus on the setting where the continuous signal lives in a relatively low-dimensional
space (for most of the Section we only discuss scalar signals). We start with the very basic but
overwhelmingly the most often used case of a scalar uniform quantization.

24.1.1 Scalar uniform quantization

The idea of quantizing an inherently continuous-valued signal was most explicitly expounded in
the patenting of Pulse-Coded Modulation (PCM) by A. Reeves; cf. [355] for some interesting
historical notes. His argument was that unlike AM and FM modulation, quantized (digital) sig-
nals could be sent over long routes without the detrimental accumulation of noise. Some initial
theoretical analysis of the PCM was undertaken in 1948 by Oliver, Pierce, and Shannon [318].
For a random variable X ∈ [−A/2, A/2] ⊂ R, the scalar uniform quantizer qU (X) with N
quantization points partitions the interval [−A/2, A/2] uniformly

N equally spaced points

−A A
2 2

477

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-478

i i

478

where the points are in { −2A + kA

N , k = 0, . . . , N − 1}.
What is the quality (or fidelity) of this quantization? Most of the time, mean squared error is
used as the quality criterion:

D(N) = E|X − qU (X)|2

where D denotes the average distortion. Often R = log2 N is used instead of N, so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.

Nearly flat for

large partition

∆j

Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have

X
N
DU (R) = E|X − qU (X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (24.1)
j=1

X
N
|∆j |2
(high rate approximation) ≈ P[ X ∈ ∆ j ] (24.2)
12
j=1

( NA )2 A2 −2R
= = 2 , (24.3)
12 12
where we used the fact that the variance of Unif(−a, a) is a2 /3.
How much do we gain per bit?
Var(X)
10 log10 SNR = 10 log10
E|X − qU (X)|2
12Var(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R

q(x) = cj 1 {Tj ≤ x ≤ Tj+1 }

for some cj ∈ [Tj , Tj+1 ].

Example 24.1 As a simple example, consider the one-bit quantization
q of X ∼ N (0, σ 2 ).q
Then
optimal quantization points are c1 = E[X|X ≥ 0] = E[|X|] = 2
π σ, c2 = E[X|X ≤ 0] = − 2
π σ,

with quantization error equal to Var(|X|) = 1 − π2 σ 2 .
With ideas like this, in 1957 S. Lloyd developed an algorithm (called Lloyd’s algorithm or
Lloyd’s Method I) for iteratively finding optimal quantization regions and points.1 Suitable for
both the scalar and vector cases, this method proceeds as follows: Initialized with some choice of
N = 2k quantization points, the algorithm iterates between the following two steps:

1 Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.
2 Update the quantization points by the centroids E[X|X ∈ D] of each Voronoi region D.

b b
b b

b b

b b
b b

Steps of Lloyd’s algorithm

1
This work at Bell Labs remained unpublished until 1982 [284].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-481

i i

24.1 Scalar and vector quantization 481

Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results on
Lloyd’s algorithm.

Theorem 24.1 (Lloyd)

1 Lloyd’s algorithm always converges to a Centroidal Voronoi Tessellation.

2 The optimal quantization strategy is always a CVT.
3 CVT’s need not be unique, and the algorithm may converge to non-global optima.

Remark 24.1 The third point tells us that Lloyd’s algorithm is not always guaranteed to give
the optimal quantization strategy.2 One sufficient condition for uniqueness of a CVT is the log-
concavity of the density of X [171], e.g., Gaussians. On the other hand, even for the Gaussian PX
and N > 3, the optimal quantization points are not known in closed form. So it may seem to be
very hard to have any meaningful theory of optimal quantizers. However, as we shall see next,
when N becomes very large, locations of optimal quantization points can be characterized. In this
section we will do so in the case of fixed dimension, while for the rest of this Part we will consider
the regime of taking N to grow exponentially with dimension.
Remark 24.2 (k-means) A popular clustering method called k-means is the following: Given
n data points x1 , . . . , xn ∈ Rd , the goal is to find k centers μ1 , . . . , μk ∈ Rd to minimize the objective
function
X
n
min kxi − μj k2 .
j∈[k]
i=1

This is equivalent to solving the optimal vector quantization problem analogous to (24.5):

min EkX − q(X)k2

q:|Im(q)|≤k
Pn
where X is distributed according to the empirical distribution over the dataset, namely, 1n i=1 δxi .
Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is a commonly
used heuristic.

24.1.4 Fine quantization

Following Panter-Dite [324], we now study the asymptotics of small quantization error. For this,
introduce a probability density function λ(x), which represents the density of quantization points

2
As a simple example one may consider PX = 13 ϕ(x − 1) + 31 f(x) + 13 f(x + 1) where f(·) is a very narrow pdf, symmetric
around 0. Here the CVT with centers ± 23 is not optimal among binary quantizers (just compare to any quantizer that
quantizes two adjacent spikes to same value).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-482

i i

482

in a given interval and allows us to approximate summations by integrals.3 Then the number of
Rb
quantization points in any interval [a, b] is ≈ N a λ(x)dx. For any point x, denote the size of the
quantization interval that contains x by ∆(x). Then
Z x+∆(x)
1
N λ(t)dt ≈ Nλ(x)∆(x) ≈ 1 =⇒ ∆(x) ≈ .
x Nλ(x)
With this approximation, the quality of reconstruction is

X
N
E|X − q(X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ]
j=1

X
N Z
|∆j |2 ∆ 2 ( x)
≈ P[ X ∈ ∆ j ] ≈ p ( x) dx
12 12
j=1
Z
1
= p(x)λ−2 (x)dx ,
12N2
To find the optimal density λ that gives the best reconstruction (minimum MSE) when X has den-
R R R R R
sity p, we use Hölder’s inequality: p1/3 ≤ ( pλ−2 )1/3 ( λ)2/3 . Therefore pλ−2 ≥ ( p1/3 )3 ,
1/ 3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) .
p1/3 dx
Therefore when N = 2R ,4
Z 3
1 −2R
Dscalar (R) ≈ 2 p1/3 (x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,

• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2

1 −2R 2
Dscalar (R) ≤ 2 A = DU (R)
12
where the RHS is the uniform quantization error given in (24.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution, uniform
quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives
√
2 −2R π 3
Dscalar (R) ≈ σ 2 (24.6)
2

Remark 24.3 In fact, in scalar case the optimal non-uniform quantizer can be realized using
the compander architecture (24.4) that we discussed in Section 24.1.2: As an exercise, use Taylor

3
This argument is easy to make rigorous. We only need to define reconstruction points cj as the solution of
∫ cj j
−∞ λ(x) dx = N (quantile).
4
In fact when R → ∞, “≈” can be replaced by “= 1 + o(1)” as shown by Zador [467, 468].

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-484

i i

484

Hamming Game. Given 100 unbiased bits, we are asked to inspect them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the original
100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% bit error rate (BER). However, as we will show in this chapter
(Theorem 26.1), the optimal strategy amazingly achieves a BER of 11%. How is this possible?
After all we are guessing independent bits and the loss function (BER) treats all bits equally.

Gaussian example. Given (X1 , . . . , Xn ) drawn independently from N (0, σ 2 ), we are given a
budget of one bit per symbol to compress, so that the decoded version (X̂1 , . . . , X̂n ) has a small
Pn
mean-squared error 1n i=1 E[(Xi − X̂i )2 ].
To this end, a simple strategy is to quantize each coordinate into 1 bit. As worked out in Exam-
ple 24.1, the optimal one-bit quantization error is (1 − π2 )σ 2 ≈ 0.36σ 2 . In comparison, we will
2
show later (Theorem 26.2) that there is a scheme that achieves an MSE of σ4 per coordinate
for large n; furthermore, this is optimal. More generally, given R bits per symbol, by doing opti-
mal vector quantization in high dimensions (namely, compressing (X1 , . . . , Xn ) jointly to nR bits),
rate-distortion theory will tell us that when n is large, we can achieve the per-coordinate MSE:

Dvec (R) = σ 2 2−2R

which, compared to (24.6), gains 4.35 dB (or 0.72 bit/sample).

The conclusions from both the Bernoulli and the Gaussian examples are rather surprising: Even
when X1 , . . . , Xn are iid, there is something to be gained by quantizing these coordinates jointly.
Some intuitive explanations for this high-dimensional phenomenon as as follows:

1 Applying scalar quantization componentwise results in quantization region that are hypercubes,
which may not suboptimal for covering in high dimensions.
2 Concentration of measures effectively removes many atypical source realizations. For example,
when quantizing a single Gaussian X, we need to cover large portion of R in order to deal with
those significant deviations of X from 0. However, when we are quantizing many (X1 , . . . , Xn )
together, the law of large numbers makes sure that many Xj ’s cannot conspire together and all
produce large values. Indeed, (X1 , . . . , Xn ) concentrates near a sphere. As such, we may exclude
large portions of the space Rn from consideration.

Mathematical formulation A lossy compressor is an encoder/decoder pair (f, g) that induced

the following Markov chain
f g
X −→ W −→ X̂

where X ∈ X is refereed to as the source, W = f(X) is the compressed discrete data, and X̂ = g(W)
is the reconstruction which takes values in some alphabet X̂ that needs not be the same as X .
A distortion metric (or loss function) is a measurable function d : X × X̂ → R ∪ {+∞}. There
are various formulations of the lossy compression problem:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-485

i i

24.3 Converse bounds 485

1 Fixed length (fixed rate), average distortion: W ∈ [M], minimize E[d(X, X̂)].
2 Fixed length, excess distortion: W ∈ [M], minimize P[d(X, X̂) > D].
3 Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize the average length
E[l(W)] or entropy H(W).

In this book we focus on lossy compression with fixed length and are chiefly concerned with
average distortion (with the exception of joint source-channel coding in Section 26.3 where excess
distortion will be needed). The difference between average and excess distortion is analogous
to average and high-probability risk bound in statistics and machine learning. It turns out that
under mild assumptions these two formulations lead to the same asymptotic fundamental limit
(cf. Remark 25.2). However, the speed of convergence to that limit is very different: the excess
distortion version converges as O( √1n ) has a rich dispersion theory [255], which we do not discuss.
The convergence under excess distortion is much faster as O( logn n ); see Exercise V.3.
As usual, of particular interest is when the source takes the form of a random vector Sn =
(S1 , . . . , Sn ) ∈ S n and the reconstruction is Ŝn = (S1 , . . . , Sn ) ∈ Ŝ n . We will be focusing on the
so called separable distortion metric defined for n-letter vectors by averaging the single-letter
distortions:

1X
n
d(sn , ŝn ) ≜ d(si , ŝi ). (24.8)
n
i=1

Definition 24.2 An (n, M, D)-code consists of an encoder f : An → [M] and a decoder g :

[M] → Ân such that the average distortion satisfies E[d(Sn , g(f(Sn )))] ≤ D. The nonasymptotic
and asymptotic fundamental limits are defined as follows:

M∗ (n, D) = min{M : ∃(n, M, D)-code} (24.9)

1
R(D) = lim sup log M∗ (n, D). (24.10)
n→∞ n

Note that, for stationary memoryless (iid) source, the large-blocklength limit in (24.10) in fact
exists and coincides with the infimum over all blocklengths. This is a consequence of the average
distortion criterion and the separability of the distortion metric – see Exercise V.2.

24.3 Converse bounds

Now that we have the definitions, we give a (surprisingly simple) general converse.

Theorem 24.3 (General Converse) Suppose X → W → X̂, where W ∈ [M] and

E[d(X, X̂)] ≤ D. Then

log M ≥ ϕX (D) ≜ inf I(X; Y).

PY|X :E[d(X,Y)]≤D

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-486

i i

486

Proof.

log M ≥ H(W) ≥ I(X; W) ≥ I(X; X̂) ≥ ϕX (D)

where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).

Theorem 24.4 (Properties of ϕX )

(a) ϕX is convex, non-increasing.

(b) ϕX continuous on (D0 , ∞), where D0 = inf{D : ϕX (D) < ∞}.
(c) Suppose X = X̂ and the distortion metric satisfies d(x, x) = D0 for all x and d(x, y) > D0 for
all x 6= y. Then ϕX (D0 ) = I(X; X).
(d) If d is a proper metric and X is a complete metric space, we have D0 = 0 and ϕX (D0 +) =
ϕX (D0 ) = I(X; X).
(e) Let

Dmax = inf Ed(X, x̂).

x̂∈X̂

Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.

Remark 24.4 (The role of D0 and Dmax ) By definition, Dmax is the distortion attainable
without any information. Indeed, if Dmax = Ed(X, x̂) for some fixed x̂, then this x̂ is the “default”
reconstruction of X, i.e., the best estimate when we have no information about X. Therefore D ≥
Dmax can be achieved for free. This is the reason for the notation Dmax despite that it is defined as
an infimum. On the other hand, D0 should be understood as the minimum distortion one can hope
to attain. Indeed, suppose that X̂ = X and d is a metric on X . In this case, we have D0 = 0, since
we can choose Y to be a finitely-valued approximation of X.
As an example, consider the Gaussian source with MSE distortion, namely, X ∼ N (0, σ 2 ) and
2
d(x, x̂) = (x −x̂)2 . We will show later that ϕX (D) = 12 log+ σD . In this case D0 = 0 and the infimum
defining it is not attained; Dmax = σ 2 and if D ≥ σ 2 , we can simply output 0 as the reconstruction
which requires zero bits.

Proof.

(a) Convexity follows from the convexity of PY|X 7→ I(PX , PY|X ) (Theorem 5.3).
(b) Continuity in the interior of the domain follows from convexity, since D0 =
infPX̂|X E[d(X, X̂)] = inf{D : ϕS (D) < ∞}.
(c) The only way to satisfy the constraint is to take X = Y.
(d) Clearly, D0 = d(x, x) = 0. We also clearly have ϕX (D0 ) ≥ ϕX (D0 +). Consider a sequence
of Yn such that E[d(X, Yn )] ≤ 2−n and I(X; Yn ) → ϕX (D0 +). By Borel-Cantelli we have with
probability 1 d(X, Yn ) → 0 and hence (X, Yn ) → (X, X). Then from lower-semicontinuity of
mutual information (4.28) we get I(X; X) ≤ lim I(X; Yn ) = ϕX (D0 +).
(e) For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-487

i i

24.3 Converse bounds 487

In channel coding, the main result relates the Shannon capacity, an operational quantity, to the
information capacity. Here we introduce the information rate-distortion function in an analogous
way, which by itself is not an operational quantity.

Definition 24.5 The information rate-distortion function for a source {Si } is

1
R(I) (D) = lim sup ϕSn (D), where ϕSn (D) = inf I(Sn ; Ŝn ).
n→∞ n PŜn |Sn :E[d(Sn ,Ŝn )]≤D

The reason for defining R(I) (D) is because from Theorem 24.3 we immediately get:

Corollary 24.6 ∀D, R(D) ≥ R(I) (D).

Naturally, the information rate-distortion function inherits the properties of ϕ from Theo-
rem 24.4:

Theorem 24.7 (Properties of R(I) )

(a) R(I) (D) is convex, non-increasing.

(b) R(I) (D) is continuous on (D0 , ∞), where D0 ≜ inf{D : R(I) (D) < ∞}.
(c) Assume the same assumption on the distortion function as in Theorem 24.4(c). For stationary
ergodic {Sn }, R(I) (D) = H (entropy rate) or +∞ if Sk is not discrete.
(d) R(I) (D) = 0 for all D > Dmax , where

Dmax ≜ lim sup inf Ed(Xn , xˆn ) .

n→∞ xˆn ∈X̂

If D0 < Dmax , then R(I) (Dmax ) = 0 too.

Proof. All properties follow directly from corresponding properties in Theorem 24.4 applied to
ϕSn .

Next we show that R(I) (D) can be easily calculated for stationary memoryless (iid) source
without going through the multi-letter optimization problem. This parallels Corollary 20.5 for
channel capacity (with separable cost function).

i.i.d.
Theorem 24.8 (Single-letterization) For stationary memoryless source Si ∼ PS and
separable distortion d in the sense of (24.8), we have for every n,

ϕSn (D) = nϕS (D).

Thus

R(I) (D) = ϕS (D) = inf I(S; Ŝ).

PŜ|S :E[d(S,Ŝ)]≤D

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-488

i i

488

Proof. By definition we have that ϕSn (D) ≤ nϕS (D) by choosing a product channel: PŜn |Sn = P⊗ n
Ŝ|S
.
Thus R(I) (D) ≤ ϕS (D).
For the converse, for any PŜn |Sn satisfying the constraint E[d(Sn , Ŝn )] ≤ D, we have
X
n
I(Sn ; Ŝn ) ≥ I(Sj , Ŝj ) (Sn independent)
j=1
X
n
≥ ϕS (E[d(Sj , Ŝj )])
j=1
 
1X
n
≥ nϕ S  E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1

≥ nϕ S ( D) (ϕS non-increasing)
In the first step we used the crucial super-additivity property of mutual information (6.2).
For generalization to a memoryless but non-stationary sources see Exercise V.10.

24.4* Converting excess distortion to average

Finally, we discuss how to build a compressor for average distortion if we have one for excess
distortion, the former of which is our focus.

Theorem 24.9 (Excess-to-Average) Suppose that there exists (f, g) such that W = f(X) ∈
[M] and P[d(X, g(W)) > D] ≤ ϵ. Assume for some p ≥ 1 and x̂0 ∈ X̂ that (E[d(X, x̂0 )p ])1/p =
Dp < ∞. Then there exists (f′ , g′ ) such that W′ = f′ (X) ∈ [M + 1] and
E[d(X, g(W′ ))] ≤ D(1 − ϵ) + Dp ϵ1−1/p . (24.11)

Remark 24.5 This result is only useful for p > 1, since for p = 1 the right-hand side of (24.11)
does not converge to D as ϵ → 0. However, a different method (as we will see in the proof of
Theorem 25.1) implies that under just Dmax = D1 < ∞ the analog of the second term in (24.11)
is vanishing as ϵ → 0, albeit at an unspecified rate.
Proof. We transform the first code into the second by adding one codeword:
(
′ f ( x) d(x, g(f(x))) ≤ D
f ( x) =
M + 1 otherwise
(
g( j) j ≤ M
g′ ( j) =
x̂0 j=M+1
Then by Hölder’s inequality,
E[d(X, g′ (W′ )) ≤ E[d(X, g(W))|Ŵ 6= M + 1](1 − ϵ) + E[d(X, x̂0 )1{Ŵ = M + 1}]

i i

ϕSn (D) = inf I(Sn ; Ŝn ) (25.2)

PŜn |Sn :E[d(Sn ,Ŝn )]≤D
Pn
and d(Sn , Ŝn ) = 1n i=1 d(Si , Ŝi ) takes a separable form.
We have shown the following general converse in Theorem 24.3: For any compression scheme:
[M] 3 W → X → X̂ such that E[d(X, X̂)] ≤ D, we must have log M ≥ ϕX (D), which implies in the
special case of X = Sn , log M∗ (n, D) ≥ ϕSn (D) and hence, in the large-n limit, R(D) ≥ R(I) (D).
i.i.d.
For a stationary memoryless source Si ∼ PS , Theorem 24.8 shows that ϕSn single-letterizes as
ϕSn (D) = nϕS (D). As a result, we obtain the converse
R(D) ≥ R(I) (D) = ϕS (D).
As we said, the goal of this Chapter is to show R(D) = R(I) (D).

25.1 Shannon’s rate-distortion theorem

The following result is (essentially) proved by Shannon in his 1959 paper [380].

490

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-491

i i

25.1 Shannon’s rate-distortion theorem 491

Theorem 25.1 Consider a stationary memoryless source Sn i.i.d.

∼ PS . Suppose that the distortion
metric d and the target distortion D satisfy:

1 d(sn , ŝn ) is non-negative and separable.

2 D > D0 , where D0 = inf{D : ϕS (D) < ∞}.
3 Dmax is finite, i.e.

Dmax ≜ inf E[d(S, ŝ)] < ∞. (25.3)

ŝ

Then

R(D) = R(I) (D) = ϕS (D) = inf I(S; Ŝ). (25.4)

PŜ|S :E[d(S,Ŝ)]≤D

Remark 25.1

• Note that Dmax < ∞ does not require that d(·, ·) only takes values in R. That is, Theorem 25.1
permits d(s, ŝ) = ∞.
• When Dmax = ∞, typically we have R(D) = ∞ for all D. Indeed, suppose that d(·, ·) is a metric
(i.e. real-valued and satisfies triangle inequality). Then, for any x0 ∈ An we have

d(X, X̂) ≥ d(X, x0 ) − d(x0 , X̂) .

Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore

E[d(X, X̂)] ≥ E[d(X, x0 )] − max d(x0 , cj ) = ∞ .

So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
the absolute impossibility of compressing such sources; it is just not possible with fixed-length
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second moment. But it is easy to see that1 the information rate-distortion function
R(I) (D) < ∞ for any D ∈ (0, ∞). In fact, in this case R(I) (D) is a hyperbola-like curve that
never touches either axis. Using variable-length codes, Sn can be compressed non-trivially into
W with bounded entropy (but unbounded cardinality) H(W). An open question: Is H(W) =
nR(I) (D) + o(n) attainable?
• We restricted theorem to D > D0 because it is possible that R(D0 ) 6= R(I) (D0 ). For exam-
ple, consider an iid non-uniform source {Sj } with A = Â being a finite metric space with
metric d(·, ·). Then D0 = 0 and from Exercise V.5 we have R(D0 +) < R(D0 ). At the same
time, from Theorem 24.4(d) we know that R(I) is continuous at D0 : R(I) (D0 +) = ϕS (D0 +) =
ϕS (D0 ) = R(I) (D0 ).

1
Indeed, if we take W to be a quantized version of S with small quantization error D and notice that differential entropy of
the Cauchy S is finite, we get from (24.7) that R(I) (D) ≤ H(W) < ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-492

i i

492

• Techniques for proving (25.4) for memoryless sources can be extended to stationary ergodic
sources by making changes to the proof similar to those we have discussed in lossless
compression (Chapter 12).

Before giving a formal proof, we give a heuristic derivation emphasizing the connection to large
deviations estimates from Chapter 15.

25.1.1 Intuition
Let us throw M random points C = {c1 , . . . , cM } into the space Ân by generating them indepen-
dently according to a product distribution QnŜ , where QŜ is some distribution on Â to be optimized.
Consider the following simple coding strategy:

Encoder : f(sn ) = argmin d(sn , cj ) (25.5)

j∈[M]

Decoder : g(j) = cj (25.6)

The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword is close to the source realization is (exponentially) small, say,
ϵ. However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codewords exist is approximately (1 −ϵ)M ,
which can be made close to zero provided M 1ϵ .
i.i.d.
To explain this intuition further, consider a discrete memoryless source Sn ∼ PS and let us eval-
uate the excess distortion of this random code: P[d(Sn , f(Sn )) > D], where the probability is over
all random codewords c1 , . . . , cM and the source Sn . Define

Pfailure ≜ P[∀c ∈ C, d(Sn , c) > D] = ESn [P[d(Sn , c1 ) > D|Sn ]M ],

where the last equality follows from the assumption that c1 , . . . , cM are iid and independent of Sn .
i.i.d.
To simplify notation, let Ŝn ∼ QnŜ independently of Sn , so that PSn ,Ŝn = PnS QnŜ . Then

P[d(Sn , c1 ) > D|Sn ] = P[d(Sn , Ŝn ) > D|Sn ]. (25.7)

To evaluate the failure probability, let us consider the special case of PS = Ber( 12 ) and also
choose QŜ = Ber( 12 ) to generate the random codewords, aiming to achieve a normalized Hamming
P P
distortion at most D < 12 . Since nd(Sn , Ŝn ) = i:si =1 (1 − Ŝi ) + i:si =0 Ŝi ∼ Bin(n, 21 ) for any sn ,
the conditional probability (25.7) does not depend on Sn and is given by

1
P[d(S , Ŝ ) > D|S ] = P Bin n,
n n n
≥ nD ≈ 1 − 2−n(1−h(D))+o(n) , (25.8)
2

where in the last step we applied large-deviations estimates from Theorem 15.9 and Example 15.1.
(Note that here we actually need lower estimates on these exponentially small probabilities.) Thus,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-493

i i

25.1 Shannon’s rate-distortion theorem 493

Pfailure = (1 − 2−n(1−h(D))+o(n) )M , which vanishes if M = 2n(1−h(D)+δ) for any δ > 0.2 As we will
compute in Theorem 26.1, the rate-distortion function for PS = Ber( 12 ) is precisely ϕS (D) =
1 − h(D), so we have a rigorous proof of the optimal achievability in this special case.
For general distribution PS (or even for PS = Ber(p) for which it is suboptimal to choose
QŜ as Ber( 12 )), the situation is more complicated as the conditional probability (25.7) depends
on the source realization Sn through its empirical distribution (type). Let Tn be the set of typical
realizations whose empirical distribution is close to PS . We have

Pfailure ≈P[d(Sn , Ŝn ) > D|Sn ∈ Tn ]M

=(1 − P[d(Sn , Ŝn ) ≤ D|Sn ∈ Tn ])M (25.9)
| {z }
≈ 0, since S ⊥
⊥ Ŝ
n n

−nE(QŜ ) M
≈(1 − 2 ) ,

where it can be shown (using large deviations analysis similar to information projection in
Chapter 15) that

E(QŜ ) = min D(PŜ|S kQŜ |PS ) (25.10)

PŜ|S :E[d(S,Ŝ)]≤D

Thus we conclude that for any choice of QŜ (from which the random codewords were drawn) and
any δ > 0, the above code with M = 2n(E(QŜ )+δ) achieves vanishing excess distortion

Pfailure = P[∀c ∈ C, d(Sn , c) > D] → 0 as n → ∞.

Finally, we optimize QŜ to get the smallest possible M:

min E(QŜ ) = min min D(PŜ|S kQŜ |PS )

QŜ QŜ P :E[d(S,Ŝ)]≤D
Ŝ|S

= min min D(PŜ|S kQŜ |PS )

PŜ|S :E[d(S,Ŝ)]≤D QŜ

= min I(S; Ŝ)

PŜ|S :E[d(S,Ŝ)]≤D

= ϕ S ( D)

where the third equality follows from the variational representation of mutual information (Corol-
lary 4.2). This heuristic derivation explains how the constrained mutual information minimization
arises. Below we make it rigorous using a different approach, again via random coding.

25.1.2 Proof of Theorem 25.1

Theorem 25.2 (Random coding bound of average distortion) Fix PX and suppose
d(x, x̂) ≥ 0 for all x, x̂. For any PY|X and any y0 ∈ Â, there exists a code X → W → X̂ with

2
In fact, this argument shows that M = 2n(1−h(D))+o(n) codewords suffice to cover the entire Hamming space within
distance Dn. See (27.9) and Exercise V.26.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-494

i i

494

W ∈ [M + 1], such d(X, X̂) ≤ d(X, y0 ) almost surely and for any γ > 0,

E[d(X, X̂)] ≤ E[d(X, Y)] + E[d(X, y0 )]e−M/γ + E[d(X, y0 )1 {i(X; Y) > log γ}].

Here the first and the third expectations are over (X, Y) ∼ PX,Y = PX PY|X and the information
density i(·; ·) is defined with respect to this joint distribution (cf. Definition 18.1).

Some remarks are in order:

• Theorem 25.2 says that from an arbitrary PY|X such that E[d(X, Y)] ≤ D, we can extract a good
code with average distortion D plus some extra terms which will vanish in the asymptotic regime
for memoryless sources.
• The proof uses the random coding argument with codewords drawn independently from PY , the
marginal distribution induced by the source distribution PX and the auxiliary channel PY|X . As
such, PY|X plays no role in the code construction and is used only in analysis (by defining a
coupling between PX and PY ).
• The role of the deterministic y0 is a “fail-safe” codeword (think of y0 as the default reconstruc-
tion with Dmax = E[d(X, y0 )]). We add y0 to the random codebook for “damage control”, to
hedge against the (highly unlikely) event that we end up with a terrible codebook.

Proof. Similar to the intuitive argument sketched in Section 25.1.1, we apply random coding and
generate the codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X

and add the “fail-safe” codeword cM+1 = y0 . We adopt the same encoder-decoder pair (25.5) –
(25.6) and let X̂ = g(f(X)). Then by definition,

d(X, X̂) = min d(X, cj ) ≤ d(X, y0 ).

j∈[M+1]

To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Chapter 18):

PX,Y,Y = PX,Y PY

where PY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
Ra
E[U] = 0 P[U ≥ u]du. Then the average distortion is

Ed(X, X̂) = E min d(X, cj )

j∈[M+1]
h i
= EX E min d(X, cj ) X
j∈[M+1]
Z d(X,y0 ) h i
= EX P min d(X, cj ) > u X du
0 j∈[M+1]
Z d(X,y0 ) h i
≤ EX P min d(X, cj ) > u X du
0 j∈[M]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-495

496

Theorem 25.3 (Random coding bound of excess distortion) For any PY|X , there
exists a code X → W → X̂ with W ∈ [M], such that for any γ > 0,
P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y) > D} ∪ {i(X; Y) > log γ}]

Proof. Proceed exactly as in the proof of Theorem 25.2 (without using the extra codeword y0 ),
replace (25.11) by P[d(X, X̂) > D] = P[∀j ∈ [M], d(X, cj ) > D] = EX [(1 − P[d(X, Y) ≤ D|X])M ],
and continue similarly.
Finally, we give a rigorous proof of Theorem 25.1 by applying Theorem 25.2 to the iid source
i.i.d.
X = Sn ∼ PS and n → ∞:
Proof of Theorem 25.1. Our goal is the achievability: R(D) ≤ R(I) (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] is achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed symbol so that the expectation is finite. The
default reconstruction for Sn is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(Sn , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ ; such PŜ|S since D > D0 by
assumption. Apply Theorem 25.2 to (X, Y) = (Sn , Ŝn ) with
PX = PSn
PY|X = PŜn |Sn = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
1X
n
d( X , Y ) = d(Sj , Ŝj )
n
j=1

y0 = ŝn0
we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1] → Ân , such that
n o
E[d(Sn , g(f(Sn )))] ≤ E[d(Sn , Ŝn )] + E[d(Sn , ŝn0 )]e−M/γ + E[d(Sn , ŝn0 )1 i(Sn ; Ŝn ) > log γ ]
≤ D − δ + Dmax e− exp(nδ) + E[d(Sn , ŝn0 )1En ], (25.15)
| {z } | {z }
→0 →0 (later)

where
 
1 X
n 
WLLN
En = {i(Sn ; Ŝn ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n 
j=1

If we can show the expectation in (25.15) vanishes, then there exists an (n, M, D)-code with:
M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o( 1) ≤ D.
To summarize, ∀PŜ|S such that E[d(S, Ŝ)] ≤ D −δ we have shown that R(D) ≤ I(S; Ŝ). Sending δ ↓
0, we have, by continuity of ϕS (D) in (D0 ∞) (recall Theorem 24.4), R(D) ≤ ϕS (D−) = ϕS (D).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-497

i i

25.2* Covering lemma and joint typicality 497

It remains to show the expectation in (25.15) vanishes. This is a simple consequence of the
uniform integrability of the sequence {d(Sn , ŝn0 )}. We need the following lemma.
Lemma 25.4 For any positive random variable U, define g(δ) = supH:P[H]≤δ E[U1H ], where
δ→0
the supremum is over all events measurable with respect to U. Then3 EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U1H ] ≤ E[U1 {U > b}] + bδ , where E[U1 {U > b√}] −−−→ 0 by
dominated convergence theorem. Then the proof is completed by setting b = 1/ δ .
Pn
Now d(Sn , ŝn0 ) = 1n j=1 Uj , where Uj are iid copies of U ≜ d(S, ŝ0 ). Since E[U] = Dmax < ∞
P
by assumption, applying Lemma 25.4 yields E[d(Sn , ŝn0 )1En ] = 1n E[Uj 1En ] ≤ g(P[En ]) → 0,
since P[En ] → 0. This proves the theorem.
Remark 25.2 (Fundamental limit for excess distortion) Although Theorem 25.1 is
stated for the average distortion, under certain mild extra conditions, it also holds for excess distor-
tion where the goal is to achieve d(Sn , Ŝn ) ≤ D with probability arbitrarily close to one as opposed
to in expectation. Indeed, the achievability proof of Theorem 25.1 is already stated in high proba-
bility. For converse, assume in addition to (25.3) that Dp ≜ E[d(S, ŝ)p ]1/p < ∞ for some ŝ ∈ Ŝ and
Pn
p > 1. Applying Rosenthal’s inequality [368, 235], we have E[d(S, ŝn )p ] = E[( i=1 d(Si , ŝ))p ] ≤
CDpp for some constant C = C(p). Then we can apply Theorem 24.9 to convert a code for excess
distortion to one for average distortion and invoke the converse for the latter.
To end this section, we note that in Section 25.1.1 and in Theorem 25.1 it seems we applied
different proof techniques. How come they both turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the under-
lined probabilities in (25.9) and (25.11), respectively. To obtain the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projec-
tion problem (25.10). When PY = (QŜ )n = (PŜ )n with PŜ chosen as the output distribution in the
solution to rate-distortion optimization (25.1), the resulting exponent is precisely given by 2−i(X;Y) .

25.2* Covering lemma and joint typicality

In this section we consider the following curious problem, a version of channel simulation/syn-
i.i.d.
thesis. We want to simulate a sequence of iid correlated strings (Ai , Bi ) ∼ PA,B via a protocol we
i.i.d.
describe next. First, an sequence An ∼ PA is generated at one terminal. Then we can look at it,
produce a rate constrained message W ∈ [2nR ] which gets communicated to a remote destination
(noiselessly). Upon receipt of the message, remote decoder produces a string Bn out of it. The goal
is to be able to fool the tester who inspects (An , Bn ) and tries to check that it was indeed generated
i.i.d.
as (Ai , Bi ) ∼ PA,B . See Figure 25.1 for an illustration.
How large a rate R is required depends on how we exactly understand the requirement to “fool
the tester”. If the tester is fixed ahead of time (this just means that we know the set F such that

3
In fact, ⇒ is ⇔.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-498

i i

498

A1 B1 A1 B1

A2 B2 A2 W B2

. . . .
. . . .
. . . .

An Bn An Bn

P Q

Figure 25.1 Description of channel simulation game. The distribution P (left) is to be simulated via the
distribution Q (right) at minimal rate R. Depending on the exact formulation we either require R = I(A; B)
(covering lemma) or R = C(A; B) (soft-covering lemma).

i.i.d.
(Ai , Bi ) ∼ PA,B is declared whenever (An , Bn ) ∈ F) then this is precisely the setting in which
covering lemma operates. In the next section we show that a higher rate R = C(A; B) is required
if F is not known ahead of time. We leave out the celebrated theorem of Bennett and Shor [43]
which shows that rate R = I(A; B) is also attainable even if F is not known, but if encoder and
decoder are given access to a source of common random bits (independent of An , of course).
Before proceeding, we note some simple corner cases:

1 If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly and
use that information to produce Bn through PBn |An .
2 If R = H(B), “A side” can generate Bn according to PnA,B and send that Bn sequence to the “B
side”.
3 If A ⊥
⊥ B, we know that R = 0, as “B side” can generate Bn independently.

Our previous argument for achieving the rate-distortion turns out to give a sharp answer (that
R = I(A; B) is sufficient) for the F-known case as follows.

Theorem 25.5 (Covering Lemma) Fix PA,B and let (Aj , Bj )i.i.d.
∼ PA,B , R > I(A; B). We gener-
ate a random codebook C = {c1 , . . . , cM }, log M = nR, with each codeword cj drawn i.i.d. from
distribution PnB . Then we have for all sets F

P[∃c : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + oR (1) (25.16)

| {z }
uniform in F

Remark 25.3 The origin of the name “covering” is from the application to a proof of Theo-
rem 25.1. In that context we set A = S and B = Ŝ to be the source and optimal reconstruction (in

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-499

i i

25.2* Covering lemma and joint typicality 499

the sense of minimizing R(I) (D)). Then taking F = {(an , bn ) : d(an , bn ) ≤ D + δ} we see that both
terms in the right-hand side of the inequality are o(1). Thus, sampling 2nR reconstruction points
we covered the space of source realizations in such a way that with high probability we can always
find a reconstruction with low distortion.
Proof. Set γ > M and following similar arguments of the proof for Theorem 25.2, we have
P[∀c ∈ C : (An , c) 6∈ F] ≤ e−M/γ + P[{(An , Bn ) 6∈ F} ∪ {i(An ; Bn ) > log γ}]
= P[(An , Bn ) 6∈ F] + o(1)
⇒ P[∃c ∈ C : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + o(1)

As we explained, the version of covering lemma that we stated shows how to “fool the tester”
applying only one fixed test set F. However, if both A and B take values on finite alphabets then
something stronger can be stated. This original version of the covering lemma [111] is what is
used in sophisticated distributed compression arguments, e.g. Theorem 11.17. Before stating the
result we remind that for two sequences an , bn we denote their joint empirical distribution by
1X
n
P̂an ,bn (α, β) ≜ 1{ai = α, bi = β} , α ∈ A, β ∈ B .
n
i=1

It is also useful to review joint typicality discussion in Remark 18.2. In this section we say that a
sequence of pairs of vectors (an , bn ) is jointly typical with respect to PA,B if
TV(P̂an ,bn , PA,B ) = o(1) .
Fix a distribution PA,B and any codebook C = {c1 , . . . , cM } consisting of elements cj ∈ B n . For
any fixed input string an we define
W = argmin TV(P̂an ,cj , PA,B ) , B̂n = cW . (25.17)
j∈[M]

The next corollary says that in order for us to produce a jointly typical pair (An , B̂n ) a codebook
must have the rate R > I(A; B) and this is optimal.

Corollary 25.6 Fix PA,B on a pair of finite alphabets A, B. For any R > I(A; B) we generate
a random codebook C = {c1 , . . . , cM }, log M = nR, where each codeword cj is drawn i.i.d. from
distribution PnB . With B̂n defined as in (25.17) we have that pair (An , B̂n ) is jointly typical with high
probability
E[TV(P̂An ,B̂n , PA,B )] = oR (1) . (25.18)
Furthermore, no codebook with rate R < I(A, B) can achieve (25.18).

Proof. First, in this case i(An ; Bn ) is a sum of bounded iid terms and thus the oR (1) term in (25.16)
is in fact e−Ω(n) . Fix arbitrary ϵ > 0 and apply Theorem to
n o
F = (an , bn ) : P̂an ,bn (α, β) − PA,B (α, β) ≤ ϵ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-500

i i

500

with all possible α ∈ A and β ∈ B we conclude that

P[TV(PAn ,B̂n , PA,B ) ≤ ϵ] ≥ 1 − |A||B|e−Ω(n) = 1 + o(1) .

Due to the arbitrariness of ϵ, (25.18) follows.

This proof can also be understood combinatorially (as is done classically [111]). Indeed, the
rate R ≈ I(A; B) is sufficient since an iid Bn ranges over about 2nH(B) high probability sequences
(cf. Proposition 1.5). Applying the same proposition conditionally on a typical An sequence, there
are around 2nH(B|A) of Bn sequences that have the same joint distribution. Therefore, while we need
nH(B) bits to describe all of Bn it is intuitively clear that we only need to describe a class of Bn for
nH(B)
each An sequence, and there are around 22nH(B|A) = 2nI(A;B) classes. We can represent this situation
by a bipartite graph with (typical) An sequences on the left and (typical) Bn sequences on the right
and edges corresponding to pairs having joint typicality; this graph is regular with degrees 2nH(A|B)
and 2nH(A|B) , respectively. Thus, to convert our intuition to a rigorous proof as above we need to
show that a random subset of 2nI(A;B) right vertices covers all left vertices. (This alternate proof
has advantage of showing (25.18) conditional on any typical An .)
We next proceed to proving that R ≥ I(A; B) is in fact necessary for simulating a jointly typical
Bn . Consider an arbitrary codebook C such that (25.18) holds. On one hand we have

nR = log M ≥ I(An ; B̂n ) = H(An ) − H(An |B̂n ) = nH(A) − H(An |B̂n ) .

Thus, the proof will be complete if we show

H(An |B̂n ) ≤ nH(A|B) + o(n) .

To that end, let us define a random conditional type of An |B̂n as

#{i : Âi = α, B̂i = β}
Q̂(α|β) ≜ .
#{i : B̂i = β}
For an arbitrarily small ϵ > 0 we define a binary T = 1 if and only if for some α or β either of the
following inequalities is violated
1
#{i : B̂i = β} − PB (β) > ϵ ,
n
Q̂(α|β) − PA|B (α|β) > ϵ .

By the Markov inequality (and assuming WLOG that PB (β) > 0 for all β ) we get that

P [ T = 1] = o( 1) .

Thus, we have

H(An |B̂n ) ≤ H(An , T|B̂n ) ≤ log 2 + n log |A|P[T = 1] + H(An |B̂n , T = 0) .

The first two terms are o(n) so we focus on the last term. Since Q̂ is a random variable with
polynomially many possible values, cf. Exercise I.2, we have

H(An |B̂n , T = 0) ≤ H(An , Q̂|B̂n , T = 0) ≤ H(An |Q̂, B̂n , T = 0) + O(log n) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-501

i i

25.2* Covering lemma and joint typicality 501

Let there be nβ positions with B̂i = β . Conditioned on Q̂, the random variable An ranges over
nβ n
nβ Q(α1 |β)···nβ Q(α|A| |β)
. Since under T = 0 we have Q̂ → PA|B and nβ → PB (β) as n → ∞ we
conclude from Proposition 1.5 and the continuity of entropy Proposition 4.8 that
H(An |Q̂, B̂n , T = 0) ≤ n(H(A|B) + δ)
for some δ = δ(ϵ) > 0 that vanishes as ϵ → 0.
Applications of Corollary 25.6 include distributed compression (Theorem 11.17) and hypoth-
esis testing (Theorem 16.6). Now, those applications use the rate-constrained B̂n to create a
required correlation (joint typicality) not only with An but also with other random variables. Those
applications will require the following simple observation.

Proposition 25.7 Fix some PX,A,B on finite alphabets and consider a pair of random variables
(An , B̂n ) which are jointly typical on average (specifically, (25.18) holds as n → ∞). Given An , B̂n
suppose that Xn is generated ∼ P⊗ n
X|A,B . Then we have

E[TV(P̂Xn ,An ,B̂n , PX,A,B )] = o(1) .

And, furthermore we have
I(Xn ; B̂n ) ≥ nI(X; B) + o(n) . (25.19)

Remark 25.4 (Markov lemma) This result is known as Markov lemma, e.g. [106, Lemma
15.8.1] because in the standard application setting one considers a joint distribution PX,A,B =
i.i.d.
PX,A PB|A , i.e. X → A → B. In this application, one further has (Xn , An ) ∼ PX,A generated by
nature with only An being observed. Given An one constructs a jointly typical vector B̂n (e.g. via
covering lemma Corollary 25.6). Now, since with high probability (Xn , An ) is jointly typical, it
is tempting to automatically conclude that (Xn , B̂n ) would also be jointly typical. Unfortunately,
joint typicality relation is generally not transitive.4 In the above result, however, what resolves
this issue is the fact that Xn can be viewed as generated after (An , B̂n ) were already selected. Thus,
viewing the process in this order we can even allow Xn to depend on B̂n , which is what we did. For
stronger results under the classical setting of PX|A,B = PX|A see [147, Lemma 12.1].
Proof. Note that from condition (25.18) and Markov inequality we get that TV(P̂An ,B̂n , PA,B ) =
o(1) with probability 1 − o(1). Fix any a, b, x ∈ A × B × X and consider m = nP̂An ,B̂n (a, b)
coordinates i ∈ [n] with Ai = a, B̂i = b. Among these there are m′ ≤ m coordinates i that
also satisfy Xi = x. Standard concentration estimate shows that |m′ − mPX|A,B (x|a, b)| = o(m)
with probability 1 − o(1). Hence, normalizing by m we obtain (from the union bound) that with
probability 1 − o(1) we have
|P̂Xn ,An ,B̂n (x, a, b) − PX,A,B (x, a, b)| = o(1) .

4
Let PX,A,B = PX PA PB with PX = PA = PB = Ber(1/2). Take an to be any binary string in {0, 1}n with n/2 ones. Set
xj = bj = aj for j ≤ n/2 and xj = bj = 1 − aj , otherwise. Then (xn , an ) and (an , bn ) are jointly typical, but (xn , bn ) is
not.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-502

i i

502

This implies the first statement. Note that by summing out a ∈ A we obtain that

E[|TV(P̂Xn ,B̂n , PX,B ) = o(1) .

But then repeating the steps of the second part of Corollary 25.6 we obtain I(Xn ; B̂n ) ≥ nI(X; B) +
o(n), as required.

Remark 25.5 Although in (25.19) we only proved a lower bound (which is sufficient for
applications in this book), it is known that under the Markov assumption X → A → B the inequal-
ity (25.19) holds with equality [111, Chapter 15]. This follows as a by-product of a deep entropy
characterization problem for which we recommend the mentioned reference.
Let us go back to the discussion in the beginning of this section. We have learned how to “fool”
the tester that uses one fixed test set F (Theorem 25.5). Then for finite alphabets we have shown
that we can also “fool” the tester that computes empirical averages since

1X
n
f(Aj , B̂j ) ≈ EA,B∼PA,B [f(A, B)] ,
n
j=1

for any bounded function f. A stronger requirement would be to demand that the joint distribution
PAn ,B̂n fools any permutation invariant tester, i.e.

sup |PAn ,B̂n (F) − PnA,B (F)| → 0

where the supremum is taken over all permutation invariant subsets F ⊂ An × B n . This is not
guaranteed by Corollary 25.6. Indeed, note that a sufficient statistic for a permutation invariant
tester is a joint type P̂An ,B̂n , and Corollary does show that P̂An ,B̂n ≈ PA,B (in the sense of L1 distance
of vectors). However, it still might happen that P̂An ,B̂n although close to PA,B takes highly different
values compared to those of P̂An ,Bn . For example, if we restrict all c ∈ C to have a fixed composition
P0 , the tester can easily detect the problem since PnB -measure of all strings of composition P0
√
cannot exceed O(1/ n). Formally, to fool permutation invariant tester we need to have small
total variation between the distribution of P̂An ,B̂n and P̂An ,Bn .
We conjecture, however, that nevertheless the rate R = I(A; B) should be sufficient to achieve
also this stronger requirement. In the next section we show that if one removes the permutation-
invariance constraint, then a larger rate R = C(A; B) is needed.

25.3* Wyner’s common information

We continue discussing the channel simulation setting as in previous section. We now want to
determined the minimal possible communication rate (i.e. cardinality of W ∈ [2nR ]) required to
have small total variation:

TV(PAn ,B̂n , PnA,B ) ≤ ϵ (25.20)

between the simulated and the true output (see Figure 25.1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-503

i i

25.3* Wyner’s common information 503

Theorem 25.8 (Cuff [116]) Let PA,B be an arbitrary distribution on the finite space A × B.
i.i.d.
Consider a coding scheme where Alice observes An ∼ PnA , sends a message W ∈ [2nR ] to Bob,
who given W generates a (possibly random) sequence B̂n . If (25.20) is satisfied for all ϵ > 0 and
sufficiently large n, then we must have

R ≥ C(A; B) ≜ min I(A, B; U) , (25.21)

A→U→B

where C(A; B) is known as the Wyner’s common information [458]. Furthermore, for any R >
C(A; B) and ϵ > 0 there exists n0 (ϵ) such that for all n ≥ n0 (ϵ) there exists a scheme
satisfying (25.20).

Note that condition (25.20) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , Bn ) with probability ≥ 1 −ϵ. However, compared to Theorem 25.5,
this requires a higher communication rate since C(A; B) ≥ I(A; B), clearly.

Proof. Showing that Wyner’s common information is a lower-bound is not hard. First, since
PAn ,B̂n ≈ PnA,B (in TV) we have

I(At , B̂t ; At−1 , B̂t−1 ) ≈ I(At , Bt ; At−1 , Bt−1 ) = 0

(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P) − H(Q)
with TV(P, Q), cf. (7.20) and Corollary 6.7). Next, we have

nR = H(W) ≥ I(An , B̂n ; W) (25.22)

X n
≥ I(At , B̂t ; W) − I(At , B̂t ; At−1 , B̂t−1 ) (25.23)
t=1
Xn
≈ I(At , B̂t ; W) (25.24)
t=1

≳ nC(A; B) (25.25)

where in the last step we used the crucial observation that

At → W → B̂t

and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B .
To show achievability, let us notice that the problem is equivalent to constructing three random
variables (Ân , W, B̂n ) such that a) W ∈ [2nR ], b) the Markov relation

Ân ← W → B̂n (25.26)

holds and c) TV(PÂn ,B̂n , PnA,B ) ≤ ϵ/2. Indeed, given such a triple we can use coupling charac-
terization of TV (7.20) and the fact that TV(PÂn , PnA ) ≤ ϵ/2 to extend the probability space
to

An → Ân → W → B̂n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-504

506

Remark 25.7 The origin of the name “soft-covering” is due to the fact that unlike the covering
lemma (Theorem 25.5) which selects one xi (trying to make PY|X=xi as close to PY as possible) here
we mix over n choices uniformly.
Proof. By tensorization of Rényi divergence, cf. Section 7.12, we have
Iλ (X; Y) = dIλ (U; V) .
For every 1 < λ < λ0 we have that λ 7→ Iλ (U; V) is continuous and converging to I(U; V) as
λ → 1. Thus, we can find λ sufficiently small so that R > Iλ (U; V). Applying Theorem 25.9 with
this λ completes the proof.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-507

i i

26 Evaluating rate-distortion function. Lossy

Source-Channel separation.

In previous chapters we established the main coding theorem for lossy data compression: For
stationary memoryless (iid) sources and separable distortion, under the assumption that Dmax < ∞,
the operational and information rate-distortion functions coincide, namely,

R(D) = R(I) (D) = inf I(S; Ŝ).

PŜ|S :Ed(S,Ŝ)≤D

In addition, we have shown various properties about the rate-distortion function (cf. Theorem 24.4).
In this chapter we compute the rate-distortion function for several important source distributions
by evaluating this constrained minimization of mutual information. The common technique we
apply to evaluate these special cases in Section 26.1 is then formalized in Section 26.2* as a saddle
point property akin to those in Sections 5.2 and 5.4* for mutual information maximization (capac-
ity). Next we extend the paradigm of joint source-channel coding in Section 19.7 to the lossy
setting; this reasoning will later be found useful in statistical applications in Part VI (cf. Chap-
ter 30). Finally, in Section 26.4 we discuss several limitations, both theoretical and practical, of
the classical theory for lossy compression and joint source-channel coding.

26.1 Evaluation of R(D)

26.1.1 Bernoulli Source

Let S ∼ Ber(p) with Hamming distortion d(S, Ŝ) = 1{S 6= Ŝ} and alphabets S = Ŝ = {0, 1}. Then
d(sn , ŝn ) = 1n dH (sn , ŝn ) is the bit-error rate (fraction of erroneously decoded bits). By symmetry,
we may assume that p ≤ 1/2.

Theorem 26.1
R(D) = (h(p) − h(D))+ . (26.1)

For example, when p = 1/2, D = .11, we have R(D) ≈ 1/2 bits. In the Hamming game
described in Section 24.2 where we aim to compress 100 bits down to 50, we indeed can do this
while achieving 11% average distortion, compared to the naive scheme of storing half the string
and guessing on the other half, which achieves 25% average distortion. Note that we can also get
very tight non-asymptotic bounds, cf. Exercise V.3.

507

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-508

i i

508

Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to
show.
For the converse, consider any PŜ|S such that P[S 6= Ŝ] ≤ D ≤ p ≤ 21 . Then

I(S; Ŝ) = H(S) − H(S|Ŝ)

= H(S) − H(S + Ŝ|Ŝ)
≥ H(S) − H(S + Ŝ)
= h(p) − h(P[S 6= Ŝ])
≥ h(p) − h(D).

" #
P∗S|Ŝ
≥ EP log
PS
 (S−Ŝ)2

√ 1 e− 2D

= EP log 2πD
S2

√ 1
2π σ 2
e− 2 σ 2
" #
1 σ2 log e S2 (S − Ŝ)2
= log + EP −
2 D 2 σ2 D
1 σ2
≥ log .
2 D
Finally, for the vector case, (26.3) follows from (26.2) and the same single-letterization argu-
ment in Theorem 24.8 using the convexity of the rate-distortion function in Theorem 24.4(a).

The interpretation of the optimal reconstruction points in the Gaussian case is analogous to that
of the Hamming source previously
√ discussed in Remark 26.2: As n grows, the Gaussian random
2
vector concentrates on S(0, nσ ) (n-sphere in Euclidean p space rather than Hamming), but each
reconstruction point drawn from (P∗Ŝ )n is close to S(0, n(σ 2 − D)). So again the picture is similar
to Figure 26.1 of two nested spheres.
We can also understand geometry of errors of optimal compressors. Indeed, suppose we have
a sequence of quantizers Xn → W → X̂n with n1 log M → R(D). As we know, without loss of
generality we may assume X̂n = E[Xn |W]. Let us denote by Σ = Cov[Xn |W] be the covariance
matrix of the reconstruction errors. We know that 1n tr Σ ≤ D by the distortion constraint. Now let
us express mutual information in terms of differential entropy to obtain

log M = I(Xn ; W) = h(Xn ) − h(Xn |W) .

Applying maximum entropy principle (2.19) to the second term (and taking expectation over W
inside log det via Jensen’s and Corollary 2.9) we obtain
n 1
log M ≥ log σ 2 − log det Σ .
2 2
Let {λj , j ∈ [n]} denote the spectrum of Σ. Dividing by n and recalling that quantizer is optimal
we get

1X1
n
1 σ2 σ2
log + o( 1) ≥ log .
2 D n 2 λj
j=1

2
From strict convexity of λ 7→ 12 log σλ we conclude that empirical distributions of eigenvalues, i.e.
P
j δλj , must converge to a point, i.e. to δD . In this sense Σ ≈ D · In and the uncertainty regions
1
n
(given the message) should be approximately spherical.
Note that the exact expression in Theorem 26.2 relies on the Gaussianity assumption of the
source. How sensitive is the rate-distortion formula to this assumption? The following comparison
result is a counterpart of Theorem 20.12 for channel capacity:

1 σ2
≥ log − D(PS kPSG ).
2 D

26.2* Analog of saddle-point property in rate-distortion

In the computation of R(D) for the Hamming and Gaussian source, we guessed the correct form
of the rate-distortion function. In both of their converse arguments, we used the same trick to
establish that any other feasible PŜ|S gave a larger value for R(D). In this section, we formalize
this trick, in an analogous manner to the saddle point property of the channel capacity (recall
Theorem 5.4 and Section 5.4*). Note that typically we do not need any tricks to compute R(D),
since we can obtain a solution in a parametric form to the unconstrained convex optimization
min I(S; Ŝ) + λ E[d(S, Ŝ)].
PŜ|S

In fact we have discussed in Section 5.6 iterative algorithms (Blahut-Arimoto) that computes R(D).
However, for the peace of mind it is good to know there are some general reasons why tricks like
we used in the Hamming or Gaussian case actually are guaranteed to work.

Theorem 26.4

1 Suppose PY∗ and PX|Y∗ PX are such that E[d(X, Y∗ )] ≤ D and for any PX,Y with E[d(X, Y)] ≤
D we have

dPX|Y∗
E log (X|Y) ≥ I(X; Y∗ ) . (26.6)
dPX
Then R(D) = I(X; Y∗ ).
2 Suppose that I(X; Y∗ ) = R(D). Then for any regular branch of conditional probability PX|Y∗
and for any PX,Y satisfying
• E[d(X, Y)] ≤ D and
• PY PY∗ and
• I ( X ; Y) < ∞ ,
the inequality (26.6) holds.

Some remarks on the preceding theorem are as follows:

1 The first part is a sufficient condition for optimality of a given PXY∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PX,Y satisfying those conditions is rich enough to infer from (26.6):
dPX|Y∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.

i i

Notice:

On {ρ1 = 0} : λ(y) = D(y) = Dλ (y) = 0

and otherwise λ(y) > 0. By convexity of divergence

Dλ (y) ≤ λ(y)D(y)

and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (26.7) the function ρ1 (y)D(y) is non-negative and PY∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY∗ Dλ (y)ρ1 (y) = dPY∗ ρ1 (y) lim D λ ( y) = 0 (26.8)
λ→0 {ρ >0}
1
λ( y ) {ρ1 >0} λ→ 0 λ( y)
where in the last step we applied the result from Chapter 5

D(PkQ) < ∞ =⇒ D(λP + λ̄QkQ) = o(λ)

26.3.2 Achievability via separation

The proof strategy is similar to lossless JSCC in Section 19.7 by separately constructing a
channel coding scheme and a lossy compression scheme, as opposed to jointly optimizing the
JSCC encoder/decoder pair. Specifically, first compress the data into bits then encode with
a channel code; to decode, apply the channel decoder followed by the source decompressor.
Under appropriate assumptions, this separately-designed scheme achieves the optimal rate in
Theorem 26.5.

Theorem 26.6 For any stationary memoryless source (PS , S, Ŝ, d) with rate-distortion func-
tion R(D) satisfying Assumption 26.1 (below), and for any stationary memoryless channel PY|X
with capacity C,
C
RJSCC (D) = .
R(D)

Assumption 26.1 on the source (which is rather technical and can be skipped in the first reading)
is to control the distortion incurred by the channel decoder making an error. Despite this being a
low-probability event, without any assumption on the distortion metric, we cannot say much about
its contribution to the end-to-end average distortion. (Note that this issue does not arise in lossless

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-517

i i

26.3 Lossy joint source-channel coding 517

JSCC). Assumption 26.1 is trivially satisfied by bounded distortion (e.g., Hamming), and can be
shown to hold more generally such as for Gaussian sources and MSE distortion.
Proof. In view of Theorem 26.5, we only prove achievability. We constructed a separated
compression/channel coding scheme as follows:

• Let (fs , gs ) be a (k, 2kR(D)+o(k) , D)-code for compressing Sk such that E[d(Sk , gs (fs (Sk )] ≤ D.
By Lemma 26.8 (below), we may assume that all reconstruction points are not too far from
some fixed string, namely,
d(sk0 , gs (i)) ≤ L (26.9)
for all i and some constant L, where sk0 = (s0 , . . . , s0 ) is from Assumption 26.1 below.
• Let (fc , gc ) be a (n, 2nC+o(n) , ϵn )max -code for channel PYn |Xn such that kR(D) + o(k) ≤ nC +
o(n) and the maximal probability of error ϵn → 0 as n → ∞. Such as code exists thanks to
Theorem 19.9 and Corollary 19.5.

Let the JSCC encoder and decoder be f = fc ◦ fs and g = gs ◦ gc . So the overall system is
fs fc gc gs
Sk −
→W−
→ Xn −→ Yn −
→ Ŵ −
→ Ŝk .
Note that here we need to control the maximal probability of error of the channel code since
when we concatenate these two schemes, W at the input of the channel is the output of the source
compressor, which need not be uniform.
To analyze the average distortion, we consider two cases depending on whether the channel
decoding is successful or not:
E[d(Sk , Ŝk )] = E[d(Sk , gs (W))1{W = Ŵ}] + E[d(Sk , gs (Ŵ)))1{W 6= Ŵ}].
By assumption on our lossy code, the first term is at most D. For the second term, we have P[W 6=
Ŵ] ≤ ϵn = o(1) by assumption on our channel code. Then
( a)
E[d(Sk , gs (Ŵ))1{W 6= Ŵ}] ≤ E[1{W 6= Ŵ}λ(d(Sk , ŝk0 ) + d(sk0 , gs (Ŵ)))]
(b)
≤ λ · E[1{W 6= Ŵ}d(Sk , ŝk0 )] + λL · P[W 6= Ŵ]
( c)
= o(1),
where (a) follows from the generalized triangle inequality from Assumption 26.1(a) below; (b)
follows from (26.9); in (c) we apply Lemma 25.4 that were used to show the vanishing of the
expectation in (25.15) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for all R >
C/R(D), there exists a sequence of (k, n, D + o(1))-JSCC codes.
The following assumption is needed by the previous theorem:

Assumption 26.1 Fix D. For a source (PS , S, Ŝ, d), there exists λ ≥ 0, s0 ∈ S, ŝ0 ∈ Ŝ such
that

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-518

i i

518

(a) Generalized triangle inequality: d(s, ŝ) ≤ λ(d(s, ŝ0 ) + d(s0 , â)) ∀a, â.
(b) E[d(S, ŝ0 )] < ∞ (so that Dmax < ∞ too).
(c) E[d(s0 , Ŝ)] < ∞ for any output distribution PŜ achieving the rate-distortion function R(D).
(d) d(s0 , ŝ0 ) < ∞.

The interpretation of this assumption is that the spaces S and Ŝ have “nice centers” s0 and ŝ0 ,
in the sense that the distance between any two points is upper bounded by a constant times the
distance from the centers to each point (see figure below).

b
b

s ŝ

b b

s0 ŝ0

S Ŝ

Note that Assumption 26.1 is not straightforward to verify. Next we give some more convenient
sufficient conditions. First of all, Assumption 26.1 holds automatically for bounded distortion
function. In other words, for a discrete source on a finite alphabet S , a finite reconstruction alphabet
Ŝ , and a finite distortion function d(s, ŝ) < ∞, Assumption 26.1 is fulfilled. More generally, we
have the following criterion.

Theorem 26.7 If S = Ŝ and d(s, ŝ) = ρ(s, ŝ)q for some metric ρ and q ≥ 1, and Dmax ≜
infŝ0 E[d(S, ŝ0 )] < ∞, then Assumption 26.1 holds.

Proof. Take s0 = ŝ0 that achieves a finite Dmax = E[d(S, ŝ0 )]. (In fact, any points can serve as
centers in a metric space). Applying triangle inequality and Jensen’s inequality, we have
q q
1 1 1 1 1
ρ(s, ŝ) ≤ ρ(s, s0 ) + ρ(s0 , ŝ) ≤ ρq (s, s0 ) + ρq (s0 , ŝ).
2 2 2 2 2
Thus d(s, ŝ) ≤ 2q−1 (d(s, s0 ) + d(s0 , ŝ)). Taking λ = 2q−1 verifies (a) and (b) in Assumption 26.1.
To verify (c), we can apply this generalized triangle inequality to get d(s0 , Ŝ) ≤ 2q−1 (d(s0 , S) +
d(S, Ŝ)). Then taking the expectation of both sides gives

E[d(s0 , Ŝ)] ≤ 2q−1 (E[d(s0 , S)] + E[d(S, Ŝ)])

≤ 2q−1 (Dmax + D) < ∞.

So we see that metrics raised to powers (e.g. squared norms) satisfy Assumption 26.1. Finally,
we give the lemma used in the proof of Theorem 26.6.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-519

i i

26.4 What is lacking in classical lossy compression? 519

Lemma 26.8 Fix a source satisfying Assumption 26.1 and an arbitrary PŜ|S . Let R > I(S; Ŝ),
L > max{E[d(s0 , Ŝ)], d(s0 , ŝ0 )} and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such that
d(sk0 , ŝk ) ≤ L for every reconstruction point ŝk , where sk0 = (s0 , . . . , s0 ).

Proof. Let X = S k , X̂ = Ŝ ⊗k and PX = PkS , PY|X = P⊗ k

Ŝ|S
. We apply the achievability bound
for excess distortion from Theorem 25.3 with γ = 2k(R+I(S;Ŝ))/2 to the following non-separable
distortion function
(
d(x, x̂) d(sk0 , x̂) ≤ L
d1 (x, x̂) =
+∞ otherwise.

For any D′ ∈ (E[d(S, Ŝ)], D), there exist M = 2kR reconstruction points (c1 , . . . , cM ) such that

P min d(S , cj ) > D ≤ P[d1 (Sk , Ŝk ) > D′ ] + o(1),
k ′
j∈[M]

where on the right side (Sk , Ŝk ) ∼ P⊗ k

S,Ŝ
. Note that without any change in d1 -distortion we can
remove all (if any) reconstruction points cj with d(sk0 , cj ) > L. Furthermore, from the WLLN we
have

P[d1 (S, Ŝ) > D′ ] ≤ P[d(Sk , Ŝk ) > D′ ] + P[d(sk0 , Ŝk ) > L] → 0

as k → ∞ (since E[d(S, Ŝ)] < D′ and E[d(s0 , Ŝ)] < L). Thus we have

P min d(Sk , cj ) > D′ → 0
j∈[M]

and d(sk0 , cj ) ≤ L. Finally, by adding another reconstruction point cM+1 = ŝk0 = (ŝ0 , . . . , ŝ0 ) we
get
h i h i
′ ′
E min d(S , cj ) ≤ D + E d(S , ŝ0 )1 min d(S , cj ) > D
k k k k
= D′ + o(1) ,
j∈[M+1] j∈[M]

where the last estimate follows from the same argument that shows the vanishing of the expectation
in (25.15). Thus, for sufficiently large n the expected distortion is at most D, as required.

26.4 What is lacking in classical lossy compression?

Let us discuss some issues and open problems in the classical compression theory. First, for
the compression the standard results in lossless compression apply well for text files. The lossy
compression theory, however, relies on the independence assumption and on separable distortion
metrics. Because of that, while the scalar quantization theory has been widely used in practice (in
the form of analog-to-digital converters, ADCs), the vector quantizers (rate-distortion) theory so
far has not been employed. The assumptions of the rate-distortion theory can be seen to be espe-
cially problematic in the case of compressing digital images, which evidently have very strong

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-520

i i

520

spatial correlation compared to 1D signals. (For example, the first sentence and the last in Tol-
stoy’s novel are pretty uncorrelated. But the regions in the upper-left and bottom-right corners of
one image can be strongly correlated. At the same time, the uncompressed size of the novel and
the image could be easily equal.) Thus, for practicing the lossy compression of videos and images
the key problem is that of coming up with a good “whitening” bases, which is an art still being
refined.
For the joint-source-channel coding, the separation principle has definitely been a guiding light
for the entire development of digital information technology. But this now ubiquitous solution
that Shannon’s separation has professed led to a rather undesirable feature of dropped cellular
calls (as opposed to slowly degraded quality of the old analog telephones) or “snow screen” on
TV whenever the SNR falls below a certain threshold. That is, the separated systems can be very
unstable, or lacks graceful degradation. To sketch this effect consider an example of JSCC, where
the source distribution is Ber( 12 ), with rate-distortion function R(D) = 1 − h(D), and the channel
is BSCδ with capacity C(δ) = 1 − h(δ). Consider two solutions:

1 a separated scheme: targeting a certain acceptable distortion level D∗ we compress the source
at rate R(D∗ ). Then we can use a channel code of rate R(D∗ ) which would achieve vanishing
error as long as R(D∗ ) < C(δ), i.e. δ < D∗ . Overall, this scheme has a bandwidth expansion
factor ρ = nk = 1. Note that there exists channel codes (Exercises IV.8 and IV.10) that work
simultaneously for all δ < δ ∗ = D∗ .
2 a simple JSCC with ρ = 1 which transmits “uncoded” data, i.e. sets Xi = Si .

For large blocklengths, the achieved distortion are shown in Figure 26.2 as a function of δ .
We can now see why separated solution, though in some sense optimal, is not ideal. First, below

distortion

separated
1
2

uncoded

D∗ = δ ∗

δ
0 δ∗ 1
2

Figure 26.2 No graceful degradation of separately designed source channel code (black solid), as compared
with uncoded transmission (blue dashed).

δ < δ ∗ the separated solution does achieve acceptable distortion D∗ , but it does not improve if the
channel improves, i.e. the distortion stays constant at D∗ , unlike the uncoded system. Second, and
much more importantly, is a problem with δ > δ ∗ . In this regime, separated scheme undergoes a

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-521

i i

26.4 What is lacking in classical lossy compression? 521

catastrophic failure and distortion becomes 1/2 (that is, we observe pure noise, or “snow” in TV-
speak). At the same time, the distortion of the simple “uncoded” JSCC is also deteriorating but
gracefully so. Unfortunately, such graceful schemes are only known for very few cases, requiring
ρ = 1 and certain “perfect match” conditions between channel noise and source (distortion met-
ric)2 . It is a long-standing (practical and theoretical) open problem in information theory to find
schemes that exhibit non-catastrophic degradation for general source-channel pairs and general ρ.
Even purely theoretically the problem of JSCC still contains many mysteries. For example, in
Section 22.5 we described refined expansion of the channel coding rate as a function of block-
length. In particular, we have seen that convergence to channel capacity happens at the rate √1n ,
which is rather slow. At the same time, convergence to the rate-distortion function is almost at
the rate of 1n (see Exercises V.3 and V.4). Thus, it is not clear what the convergence rate of the
JSCC may be. Unfortunately, sharp results here are still at a nascent stage. In fact, even for the
most canonical setting of a binary source and BSCδ channel it was only very recently shown [248]
√
that the optimal rate nk converges to the ultimate limit of R(CD) at the speed of Θ(1/ n) unless
the Gastpar condition R(D) = C(δ) is met. Analyzing other source-channel pairs or any general
results of this kind is another important open problem.

2
Often informally called “Gastpar conditions” after [181].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-522

i i

27 Metric entropy

In the previous chapters of this part we discussed optimal quantization of random vectors in both
fixed and high dimensions. Complementing this average-case perspective, the topic of this chapter
is on the deterministic (worst-case) theory of quantization. The main object of interest is the metric
entropy of a set, which allows us to answer two key questions (a) covering number: the minimum
number of points to cover a set up to a given accuracy; (b) packing number: the maximal number
of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy were put forth by Kolmogorov, who, together with
his students, also determined the behavior of metric entropy in a variety of problems for both finite
and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s 13th
problem, which concerns the possibility or impossibility of representing multi-variable functions
as compositions of functions of fewer variables. It turns out that the theory of metric entropy can
provide a surprisingly simple and powerful resolution to such problems. Over the years, metric
entropy has found numerous connections to and applications in other fields such as approximation
theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In particular, metric entropy will be featured prominently in Part VI of this book, wherein we
discuss its applications to proving both lower and upper bounds for statistical estimation.
This chapter is organized as follows. Section 27.1 provides basic definitions and explains the
fundamental connections between covering and packing numbers. These foundations are laid out
by Kolmogorov and Tikhomirov in [250], which remains the definitive reference on this subject.
In Section 27.2 we study metric entropy in finite-dimensional spaces and a popular approach for
bounding the metric entropy known as the volume bound. To demonstrate the limitations of the
volume method and the associated high-dimensional phenomenon, in Section 27.3 we discuss
a few other approaches through concrete examples. Infinite-dimensional spaces are treated next
for smooth functions in Section 27.4 (wherein we also discuss the application to Hilbert’s 13th
problem) and Hilbert spaces in Section 27.3.2 (wherein we also discuss the application to empirical
processes). Section 27.5 gives an exposition of the connections between metric entropy and the
small-ball problem in probability theory. Finally, in Section 27.6 we circle back to rate-distortion
theory and discuss how it is related to metric entropy and how information-theoretic methods can
be useful for the latter.

27.1 Covering and packing

Definition 27.1 Let (V, d) be a metric space and Θ ⊂ V.

522

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-523

i i

27.1 Covering and packing 523

• We say {v1 , ..., vN } ⊂ V is an ϵ-covering (or ϵ-net) of Θ if Θ ⊂ ∪Ni=1 B(vi , ϵ), where B(v, ϵ) ≜
{u ∈ V : d(u, v) ≤ ϵ} is the (closed) ball of radius ϵ centered at v; or equivalently, ∀θ ∈ Θ,
∃i ∈ [N] such that d(θ, vi ) ≤ ϵ.
• We say {θ1 , ..., θM } ⊂ Θ is an ϵ-packing of Θ if mini̸=j kθi − θj k > ϵ;1 or equivalently, the balls
{B(θi , ϵ/2) : j ∈ [M]} are disjoint.

ϵ
≥ϵ

Θ Θ

Figure 27.1 Illustration of ϵ-covering and ϵ-packing.

Upon defining ϵ-covering and ϵ-packing, a natural question concerns the size of the optimal
covering and packing, leading to the definition of covering and packing numbers:

N(Θ, d, ϵ) ≜ min{n : ∃ ϵ-covering of Θ of size n} (27.1)

M(Θ, d, ϵ) ≜ max{m : ∃ ϵ-packing of Θ of size m} (27.2)

with min ∅ understood as ∞; we will sometimes abbreviate these as N(ϵ) and M(ϵ) for brevity.
Similar to volume and width, covering and packing numbers provide a meaningful measure for
the “massiveness” of a set. The major focus of this chapter is to understanding their behavior in
both finite and infinite-dimensional spaces as well as their statistical applications.
Some remarks are in order.

• Monotonicity: N(Θ, d, ϵ) and M(Θ, d, ϵ) are non-decreasing and right-continuous functions of

ϵ. Furthermore, both are non-decreasing in Θ with respect to set inclusion.
• Finiteness: Θ is totally bounded (e.g. compact) if N(Θ, d, ϵ) < ∞ for all ϵ > 0. For Euclidean
spaces, this is equivalent to Θ being bounded, namely, diam(Θ) < ∞ (cf. (5.4)).
• The logarithm of the covering and packing numbers are commonly referred to as metric
entropy. In particular, log M(ϵ) and log N(ϵ) are called ϵ-entropy and ϵ-capacity in [250]. Quan-
titative connections between metric entropy and other information measures are explored in
Section 27.6.
• Widely used in the literature of functional analysis [329, 285], the notion of entropy numbers
essentially refers to the inverse of the metric entropy: The kth entropy number of Θ is ek (Θ) ≜
inf{ϵ : N(Θ, d, ϵ) ≤ 2k−1 }. In particular, e1 (Θ) = rad(Θ), the radius of Θ defined in (5.3).

1
Notice we imposed strict inequality for convenience.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-524

i i

524

Remark 27.1 Unlike the packing number M(Θ, d, ϵ), the covering number N(Θ, d, ϵ) defined
in (27.1) depends implicitly on the ambient space V ⊃ Θ, since, per Definition 27.1), an ϵ-covering
is required to be a subset of V rather than Θ. Nevertheless, as the next Theorem 27.2 shows, this
dependency on V has almost no effect on the behavior of the covering number.
As an alternative to (27.1), we can define N′ (Θ, d, ϵ) as the size of the minimal ϵ-covering of Θ
that is also a subset of Θ, which is closely related to the original definition as

N(Θ, d, ϵ) ≤ N′ (Θ, d, ϵ) ≤ N(Θ, d, ϵ/2) (27.3)

Here, the left inequality is obvious. To see the right inequality,2 let {θ1 , . . . , θN } be an 2ϵ -covering
of Θ. We can project each θi to Θ by defining θi′ = argminu∈Θ d(θi , u). Then {θ1′ , . . . , θN′ } ⊂ Θ
constitutes an ϵ-covering. Indeed, for any θ ∈ Θ, we have d(θ, θi ) ≤ ϵ/2 for some θi . Then
d(θ, θi′ ) ≤ d(θ, θi ) + d(θi , θi′ ) ≤ 2d(θ, θi ) ≤ ϵ. On the other hand, the N′ covering numbers need
not be monotone with respect to set inclusion.
The relation between the covering and packing numbers is described by the following funda-
mental result.

Theorem 27.2 (Kolomogrov-Tikhomirov [250])

M(Θ, d, 2ϵ)≤N(Θ, d, ϵ)≤M(Θ, d, ϵ). (27.4)

Proof. To prove the right inequality, fix a maximal packing E = {θ1 , ..., θM }. Then ∀θ ∈ Θ\E,
∃i ∈ [M], such that d(θ, θi ) ≤ ϵ (for otherwise we can obtain a bigger packing by adding θ). Hence
E must an ϵ-covering (which is also a subset of Θ). Since N(Θ, d, ϵ) is the minimal size of all
possible coverings, we have M(Θ, d, ϵ) ≥ N(Θ, d, ϵ).
We next prove the left inequality by contradiction. Suppose there exists a 2ϵ-packing
{θ1 , ..., θM } and an ϵ-covering {x1 , ..., xN } such that M ≥ N + 1. Then by the pigeonhole prin-
ciple, there exist distinct θi and θj belonging to the same ϵ-ball B(xk , ϵ). By triangle inequality,
d(θi , θj ) ≤ 2ϵ, which is a contradiction since d(θi , θj ) > 2ϵ for a 2ϵ-packing. Hence the size of any
2ϵ-packing is at most that of any ϵ-covering, that is, M(Θ, d, 2ϵ) ≤ N(Θ, d, ϵ).

The significance of (27.4) is that it shows that the small-ϵ behavior of the covering and packing
numbers are essentially the same. In addition, the right inequality therein, namely, N(ϵ) ≤ M(ϵ),
deserves some special mention. As we will see next, it is oftentimes easier to prove negative
results (lower bound on the minimal covering or upper bound on the maximal packing) than pos-
itive results which require explicit construction. When used in conjunction with the inequality
N(ϵ) ≤ M(ϵ), these converses turn into achievability statements,3 leading to many useful bounds
on metric entropy (e.g. the volume bound in Theorem 27.3 and the Gilbert-Varshamov bound

2
Another way to see this is from Theorem 27.2: Note that the right inequality in (27.4) yields a ϵ-covering that is included
in Θ. Together with the left inequality, we get N′ (ϵ) ≤ M(ϵ) ≤ N(ϵ/2).
3
This is reminiscent of duality-based argument in optimization: To bound a minimization problem from above, instead of
constructing an explicit feasible solution, a fruitful approach is to equate it with the dual problem (maximization) and
bound this maximum from above.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-525

i i

27.2 Finite-dimensional space and volume bound 525

Theorem 27.5 in the next section). Revisiting the proof of Theorem 27.2, we see that this logic
actually corresponds to a greedy construction (greedily increase the packing until no points can
be added).

27.2 Finite-dimensional space and volume bound

A commonly used method to bound metric entropy in finite dimensions is in terms of volume
ratio. Consider the d-dimensional Euclidean space V = Rd with metric given by an arbitrary
norm d(x, y) = kx − yk. We have the following result.

Theorem 27.3 Let k · k be an arbitrary norm on Rd and B = {x ∈ Rd : kxk ≤ 1} the

corresponding unit norm ball. Then for any Θ ⊂ Rd ,
d d
1 vol(Θ) (a) (b) vol(Θ + ϵ B) (c) 3 vol(Θ)
≤ N(Θ, k · k, ϵ) ≤ M(Θ, k · k, ϵ) ≤ 2
≤ .
ϵ vol(B) vol( 2ϵ B) ϵ vol(B)
where (c) holds under the extra condition that Θ is convex and contains ϵB.

Proof. To prove (a), consider an ϵ-covering Θ ⊂ ∪Ni=1 B(θi , ϵ). Applying the union bound yields
XN
vol(Θ) ≤ vol ∪Ni=1 B(θi , ϵ) ≤ vol(B(θi , ϵ)) = Nϵd vol(B),
i=1

where the last step follows from the translation-invariance and scaling property of volume.
To prove (b), consider an ϵ-packing {θ1 , . . . , θM } ⊂ Θ such that the balls B(θi , ϵ/2) are disjoint.
M(ϵ)
Since ∪i=1 B(θi , ϵ/2) ⊂ Θ + 2ϵ B, taking the volume on both sides yields
ϵ ϵ
vol Θ + B ≥ vol ∪M i=1 B(θi , ϵ/2) = Mvol B .
2 2
This proves (b).
Finally, (c) follows from the following two statements: (1) if ϵB ⊂ Θ, then Θ + 2ϵ B ⊂ Θ + 21 Θ;
and (2) if Θ is convex, then Θ+ 12 Θ = 32 Θ. We only prove (2). First, ∀θ ∈ 32 Θ, we have θ = 13 θ+ 32 θ,
where 13 θ ∈ 12 Θ and 32 θ ∈ Θ. Thus 32 Θ ⊂ Θ + 12 Θ. On the other hand, for any x ∈ Θ + 12 Θ, we
have x = y + 21 z with y, z ∈ Θ. By the convexity of Θ, 23 x = 23 y + 31 z ∈ Θ. Hence x ∈ 23 Θ, implying
Θ + 21 Θ ⊂ 32 Θ.

Remark 27.2 Similar to the proof of (a) in Theorem 27.3, we can start from Θ + 2ϵ B ⊂
∪Ni=1 B(θi , 32ϵ ) to conclude that
N(Θ, k · k, ϵ)
(2/3)d ≤ ≤ 2d .
vol(Θ + 2ϵ B)/vol(ϵB)
In other words, the volume of the fattened set Θ + 2ϵ determines the metric entropy up to constants
that only depend on the dimension. We will revisit this reasoning in Section 27.5 to adapt the
volumetric estimates to infinite dimensions where this fattening step becomes necessary.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-526

i i

526

Next we discuss several applications of Theorem 27.3.

Corollary 27.4 (Metric entropy of balls and spheres) Let k · k be an arbitrary norm on
Rd . Let B ≡ B∥·∥ = {x ∈ Rd : kxk ≤ 1} and S ≡ S∥·∥ = {x ∈ Rd : kxk ≤ 1} be the corresponding
unit ball and unit sphere. Then for ϵ < 1,
d d
1 2
≤ N(B, k · k, ϵ) ≤ 1 + (27.5)
ϵ ϵ
d−1 d−1
1 1
≤ N(S, k · k, ϵ) ≤ 2d 1 + (27.6)
2ϵ ϵ

where the left inequality in (27.6) holds under the extra assumption that k · k is an absolute norm
(invariant to sign changes of coordinates).

Proof. For balls, the estimate (27.5) directly follows from Theorem 27.3 since B + 2ϵ B = (1 + 2ϵ )B.
Next we consider the spheres. Applying (b) in Theorem 27.3 yields
vol(S + ϵB) vol((1 + ϵ)B) − vol((1 − ϵ)B)
N(S, k · k, ϵ) ≤ M(S, k · k, ϵ) ≤ ≤
vol(ϵB) vol(ϵB)
Z ϵ d−1
(1 + ϵ) − (1 − ϵ)
d d
d d−1 1
= = d (1 + x) dx ≤ 2d 1 + .
ϵd ϵ −ϵ ϵ

where the third inequality applies S + ϵB ⊂ ((1 + ϵ)B)\((1 − ϵ)B) by triangle inequality.
Finally, we prove the lower bound in (27.6) for an absolute norm k · k. To this end one cannot
directly invoke the lower bound in Theorem 27.3 as the sphere has zero volume. Note that k · k′ ≜
k(·, 0)k defines a norm on Rd−1 . We claim that every ϵ-packing in k · k′ for the unit k · k′ -ball
induces an ϵ-packing in k · k for the unit k · k-sphere. Fix x ∈ Rd−1 such that k(x, 0)k ≤ 1 and
define f : R+ → R+ by f(y) = k(x, y)k. Using the fact that k · k is an absolute norm, it is easy to
verify that f is a continuous increasing function with f(0) ≤ 1 and f(∞) = ∞. By the mean value
theorem, there exists yx , such that k(x, yx )k = 1. Finally, for any ϵ-packing {x′1 , . . . , x′M } of the unit
ball B∥·∥′ with respect to k·k′ , setting x′i = (xi , yxi ) we have kx′i −x′j k ≥ k(xi −xj , 0)k = kxi −xj k′ ≥ ϵ.
This proves

M(S∥·∥ , k · k, ϵ) ≥ M(B∥·∥′ , k · k′ , ϵ).

Then the left inequality of (27.6) follows from those of (27.4) and (27.5).

Remark 27.3 Several remarks on Corollary 27.4 are in order:

(a) Using (27.5), we see that for any compact Θ with nonempty interior, we have
1
N(Θ, k · k, ϵ) M(Θ, k · k, ϵ) (27.7)
ϵd
for small ϵ, with proportionality constants depending on both Θ and the norm. In fact, the
sharp constant is also known to exist. It is shown in [250, Theorem IX] that there exists a

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-527

i i

27.2 Finite-dimensional space and volume bound 527

constant τ depending only on k · k and the dimension, such that

vol(Θ) 1
M(Θ, k · k, 2ϵ) = (τ + o(1))
vol(B) ϵd
holds for any Θ with positive volume. This constant τ is the maximal sphere packing density in
Rd (the proportion of the whole space covered by the balls in the packing – see [365, Chapter
1] for a formal definition); a similar result and interpretation hold for the covering number as
well. Computing or bounding the value of τ is extremely difficult and remains open except
for some special cases.4 For more on this subject see the monographs [365, 99].
(b) The result (27.6) for spheres suggests that one may expect the metric entropy for a smooth man-
ifold Θ to behave as ( 1ϵ )dim , where dim stands for the dimension of Θ as opposed to the ambient
dimension. This is indeed true in many situations, for example, in the context of matrices, for
the orthogonal group O(d), unitary group U(d), and Grassmanian manifolds [406, 407], in
which case dim corresponds to the “degrees of freedom” (for example, dim = d(d − 1)/2 for
O(d)). More generally, for an arbitrary set Θ, one may define the limit limϵ→0 log Nlog (Θ,∥·∥,ϵ)
1 as
ϵ
its dimension (known as the Minkowski dimension or box-counting dimension). For sets of a
fractal nature, this dimension can be a non-integer (e.g. log2 3 for the Cantor set).
(c) Since all norms on Euclidean space are equivalent (within multiplicative constant factors
depending on dimension), the small-ϵ behavior in (27.7) holds for any norm as long as the
dimension d is fixed. However, this result does not capture the full picture in high dimensions
when ϵ is allowed to depend on d. Understanding these high-dimensional phenomena requires
us to go beyond volume methods. See Section 27.3 for details.

Next we switch our attention to the discrete case of Hamming space. The following theorem
bounds its packing number M(Fd2 , dH , r) ≡ M(Fd2 , r), namely, the maximal number of binary code-
words of length d with a prescribed minimum distance r + 1.5 This is a central question in coding
theory, wherein the lower and upper bounds below are known as the Gilbert-Varshamov bound
and the Hamming bound, respectively.

Theorem 27.5 For any integer 1 ≤ r ≤ d − 1,

2d 2d
Pr d
≤ M( F d
2 , r) ≤ P ⌊r/2⌋ d
. (27.8)
i=0 i i=0 i

Proof. Both inequalities in (27.8) follow from the same argument as that in Theorem 27.3, with
Rd replaced by Fd2 and volume by the counting measure (which is translation invariant).

Of particular interest to coding theory is the asymptotic regime of d → ∞ and r = ρd for some
constant ρ ∈ (0, 1). Using the asymptotics of the binomial coefficients (cf. Proposition 1.5), the

4
For example, it is easy to show that τ = 1 for both ℓ∞ and ℓ1 balls in any dimension since cubes can be subdivided into
smaller cubes; for ℓ2 -ball in d = 2, τ = √π is the famous result of L. Fejes Tóth on the optimality of hexagonal
12
arrangement for circle packing [365].
5
Recall that the packing number in Definition 27.1 is defined with a strict inequality.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-528

i i

528

Hamming and Gilbert-Varshamov bounds translate to

2d(1−h(ρ)+o(d) ≤ M(Fd2 , ρd) ≤ 2d(1−h(ρ/2))+o(d) .

Finding the exact exponent is one of the most significant open questions in coding theory. The best
upper bound to date is due to McEliece, Rodemich, Rumsey and Welch [299] using the technique
of linear programming relaxation.
In contrast, the corresponding covering problem in Hamming space is much simpler, as we
have the following tight result

N(Fd2 , ρd) = 2dR(ρ)+o(d) , (27.9)

where R(ρ) = (1 − h(ρ))+ is the rate-distortion function of Ber( 12 ) from Theorem 26.1. Although
this does not automatically follow from the rate-distortion theory, it can be shown using similar
argument – see Exercise V.26.
Finally, we state a lower bound on the packing number of Hamming spheres, which is needed
for subsequent application in sparse estimation (Exercise VI.12) and useful as basic building blocks
for computing metric entropy in more complicated settings (Theorem 27.7).

Theorem 27.6 (Gilbert-Varshamov bound for Hamming spheres) Denote by

Sdk = {x ∈ Fd2 : wH (x) = k} (27.10)

the Hamming sphere of radius 0 ≤ k ≤ d. Then

d
M(Sdk , r) ≥ Pr k
.
d
(27.11)
i=0 i

In particular,
k d
log M(Sdk , k/2) ≥ log . (27.12)
2 2ek

Proof. Again (27.11) follows from the volume argument. To verify (27.12), note that for r ≤ d/2,
Pr
we have i=0 di ≤ exp(dh( dr )) (see Theorem 8.2 or (15.19) with p = 1/2). Using h(x) ≤ x log xe

and dk ≥ ( dk )k , we conclude (27.12) from (27.11).

27.3 Beyond the volume bound

The volume bound in Theorem 27.3 provides a useful tool for studying metric entropy in Euclidean
spaces. As a result, as ϵ → 0, the covering number of any set with non-empty interior always grows
exponentially in d as ( 1ϵ )d – cf. (27.7). This asymptotic result, however, has its limitations and does
not apply if the dimension d is large and ϵ scales with d. In fact, one expects that there is some
critical threshold of ϵ depending on the dimension d, below which the exponential asymptotics is
tight, and above which the covering number can grow polynomially in d. This high-dimensional
phenomenon is not fully captured by the volume method.

Proof. The case of ϵ ≤ √1d follows from earlier volume calculation (27.15)–(27.16). Next we
focus on √1d ≤ ϵ < 1.
For the upper bound, we construct an ϵ-covering in ℓ2 by quantizing each coordinate. Without
loss of generality, assume that ϵ < 1/4. Fix some δ < 1. For each θ ∈ B1 , there exists x ∈
(δ Zd ) ∩ B1 such that kx − θk∞ ≤ δ . Then kx − θk22 ≤ kx − θk1 kx − θk∞ ≤ 2δ . Furthermore, x/δ
belongs to the set
( )
X d
Z= z∈Z : d
|zi | ≤ k (27.17)
i=1

with k = b1/δc. Note that each z ∈ Z has at most k nonzeros. By enumerating the number of non-

negative solutions (stars and bars calculation) and the sign pattern, we have7 |Z| ≤ 2k∧d d−k1+k .
Finally, picking δ = ϵ2 /2, we conclude that N(B1 , k · k2 , ϵ) ≤ |Z| ≤ ( 2e(dk+k) )k as desired. (Note
that this method also recovers the volume bound for ϵ ≤ √1d , in which case k ≤ d.)
√
For the lower bound, note that M(B1 , k · k2 , 2) ≥ 2d by considering ±e1 , . . . , ±ed . So it
suffices to consider d ≥ 8. We construct a packing of B1 based on a packing of the Hamming
sphere. Without loss of generality, assume that ϵ > 4√1 d . Fix some 1 ≤ k ≤ d. Applying
the Gilbert-Varshamov bound in Theorem 27.6, in particular, (27.12), there exists a k/2-packing
Pd
{x1 , . . . , xM } ⊂ Sdk = {x ∈ {0, 1}d : i=1 xi = k} and log M ≥ 2k log 2ek d
. Scale the Hamming
sphere to fit the ℓ1 -ball by setting θi = xi /k. Then θi ∈ B1 and kθi − θj k2 = k2 dH (xi , xj ) ≥ 2k
2 1 1
for all
1
i 6= j. Choosing k = ϵ2 which satisfies k ≤ d/8, we conclude that {θ1 , . . . , θM } is a 2 -packing
ϵ

of B1 in k · k2 as desired.

The above elementary proof can be adapted to give the following more general result (see
Exercise V.27): Let 1 ≤ p < q ≤ ∞. For all 0 < ϵ < 1 and d ∈ N,
(
d log ϵes d ϵ ≤ d−1/s 1 1 1
log M(Bp , k · kq , ϵ) p,q 1 , ≜ − . (27.18)
−1/s
s log(eϵ d)
ϵ
s
ϵ≥d s p q

In the remainder of this section, we discuss a few generic results in connection to Theorem 27.7,
in particular, metric entropy upper bounds via the Sudakov minorization and Maurey’s empirical
method, as well as the duality of metric entropy in Euclidean spaces.

27.3.1 Sudakov’s minoration

Theorem 27.8 (Sudakov’s minoration) Define the Gaussian width of Θ ⊂ Rd as8

w(Θ) ≜ E sup hθ, Zi , Z ∼ N (0, Id ). (27.19)

θ∈Θ

7 ∑d (d)( k )
By enumerating the support and counting positive solutions, it is easy to show that |Z| = i=0 2d−i i d−i
.
8
To avoid measurability difficulty, w(Θ) should be understood as supT⊂Θ,|T|<∞ E maxθ∈T hθ, Zi.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-531

i i

27.3 Beyond the volume bound 531

For any Θ ⊂ Rd ,
p
w(Θ) ≳ sup ϵ log M(Θ, k · k2 , ϵ). (27.20)
ϵ>0

As a quick corollary, applying the volume lower bound on the packing number in Theorem 27.3
to (27.20) and optimizing over ϵ, we obtain Urysohn’s inequality:9
1/d
√ vol(Θ) (27.14)
w(Θ) ≳ d d · vol(Θ)1/d . (27.21)
vol(B2 )

Sudakov’s theorem relates the Gaussian width to the metric entropy, both of which are meaning-
ful measures of the massiveness of a set. The important point is that the proportionality constant
in (27.20) is independent of the dimension. It turns out that Sudakov’s lower bound is tight up to
a log(d) factor [438, Theorem 8.1.13]. The following complementary result is known as Dudley’s
chaining inequality (see Exercise V.28 for a proof.)
Z ∞p
w(Θ) ≲ log M(Θ, k · k2 , ϵ)dϵ. (27.22)
0

Understanding the maximum of Gaussian processes is a field on its own; see the monograph [411].
In this section we focus on the lower bound (27.20) in order to develop upper bound for metric
entropy using the Gaussian width.
The proof of Theorem 27.8 relies on the following Gaussian comparison lemma of Slepian
(whom we have encountered earlier in Theorem 11.13). For a self-contained proof see [89]. See
also [329, Lemma 5.7, p. 70] for a simpler proof of a weaker version E max Xi ≤ 2E max Yi , which
suffices for our purposes.

Lemma 27.9 (Slepian’s lemma) Let X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) be Gaussian

random vectors. If E(Yi − Yj )2 ≤ E(Xi − Xj )2 for all i, j, then E max Yi ≤ E max Xi .

We also need the result bounding the expectation of the maximum of n Gaussian random
variables.

Lemma 27.10 Let Z1 , . . . , Zn be distributed as N (0, 1). Then

h i p
E max Zi ≤ 2 log n. (27.23)
i∈[n]

i.i.d.
In addition, if Z1 , . . . , Zn ∼ N (0, 1), then
h i p
E max Zi = 2 log n(1 + o(1)). (27.24)
i∈[n]

9 vol(Θ)
For a sharp form, see [329, Corollary 1.4], which states that for all symmetric convex Θ, w(Θ) ≥ E[kZk2 ]( vol(B ) )1/d ;
2
in other words, balls minimize the Gaussian width among all symmetric convex bodies of the same volume.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-532

i i

532

Proof. First, let T = argmaxj Zj . Since Zj are 1-subgaussian (recall Definition 4.15), from
Exercise I.56 we have
p p p
| E[max Zi ]| = | E[ZT ]| ≤ 2I(Zn ; T) = 2H(T) ≤ 2 log n .
i

Next, assume that Zi are iid. For any t > 0,

E[max Zi ] ≥ t P[max Zi ≥ t] + E[max Zi 1 {Z1 < 0}1 {Z2 < 0} . . . 1 {Zn < 0}]
i i i

≥ t(1 − (1 − Φc (t))n ) + E[Z1 1 {Z1 < 0}1 {Z2 < 0} . . . 1 {Zn < 0}].

where Φc (t) = P[Z1 ≥ t] is the normal tail probability. The second term equals
2−(n−1) E[Z1 1 {Z1 < 0}] = o(1). For the first term, recall that Φc (t) ≥ 1+t t2 φ(t) (Exercise V.25).
p
Choosing
p t = (2 − ϵ) log n for small ϵ > 0 so that Φc (t) = ω( 1n ) and hence E[maxi Zi ] ≥
(2 − ϵ) log n(1 + o(1)). By the arbitrariness of ϵ > 0, the lower bound part of (27.24)
follows.

Proof of Theorem 27.8. Let {θ1 , . . . , θM } be an optimal ϵ-packing of Θ. Let Xi = hθi , Zi for
i.i.d.
i ∈ [M], where Z ∼ N (0, Id ). Let Yi ∼ N (0, ϵ2 /2). Then

E(Xi − Xj )2 = (θi − θj )⊤ E[ZZ⊤ ](θi − θj ) = kθi − θj k22 ≥ ϵ2 = E(Yi − Yj )2 .

Then
p
E sup hθ, Zi ≥ E max Xi ≥ E max Yi ϵ log M
θ∈Θ 1≤i≤M 1≤i≤M

where the second and third step follows from Lemma 27.9 and Lemma 27.10 respectively.

Revisiting the packing number of the ℓ1 -ball, we apply Sudakov minorization to Θ = B1 . By

duality and applying Lemma 27.10,
p
w(B1 ) = E sup hx, Zi = EkZk∞ ≤ 2 log d.
x: ∥ x∥ 1 ≤ 1

Then Theorem 27.8 gives

log d
log M(B1 , k · k2 , ϵ) ≲ . (27.25)
ϵ2
√
When ϵ ≳ 1/ d, this is much tighter than the volume bound (27.15) and almost optimal (com-
2 √
pared to log(ϵd2 ϵ ) ); however, when ϵ 1/ d, (27.25) yields d log d but we know (even from
the volume bound) that the correct behavior is d. In Section 27.3.3 we discuss another general
approach that gives the optimal bound in this case.

1
27.3.2 Hilbert ball has metric entropy ϵ2
P
We consider a Hilbert ball B2 = {x ∈ R∞ : i x2i ≤ 1}. Under the usual metric ℓ2 (R∞ ) this
set is not compact and cannot have finite ϵ-nets for all ϵ. However, the metric of interest in many

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-533

i i

27.3 Beyond the volume bound 533

applications is often different. Specifically, let us fix some probability distribution P on B2 s.t.
EX∼P [kXk22 ] ≤ 1 and define
p
dP (θ, θ′ ) ≜ EX∼P [| hθ − θ′ , Xi |2 ]

for θ, θ′ ∈ B2 . The importance of this metric is that it allows to analyze complexity of a class
of linear functions θ 7→ hθ, Xi for any random variable X of unit norm and has applications in
learning theory [302, 471].

Theorem 27.11 For some universal constant c we have for all ϵ > 0
c
log N(B2 , dP , ϵ) ≤ .
ϵ2

Proof. First, we show that without loss of generality we may assume that X has all coordinates
P 2
other than the first n zero. Indeed, take n so large that E[ j>n X2j ] ≤ ϵ4 . Let us denote by θ̃ the
vector obtained from θ by zeroing out all coordinates j > n. Then from Cauchy-Schwartz we see
that dP (θ, θ̃) ≤ 2ϵ and therefore any 2ϵ -covering of B̃2 = {θ̃ : θ ∈ B2 } will be an ϵ-covering of B2 .
Hence, from now on we assume that the ball B2 is in fact finite-dimensional.
We can redefine distance dP in a more explicit way as follows

dP (θ, θ′ )2 = (θ − θ′ )⊤ Σ(θ − θ′ ) ,

where√Σ = E[XX⊤ ] is a positive-semidefinite matrix of second moments of X ∼ P. Let us set

D = Σ be the symmetric square-root of Σ. To each θ let us associate v(θ) = Dθ and let V =
D(B2 ) be the image of B2 under D. Note dP (θ, θ′ ) = kv(θ) − v(θ′ )k2 . Therefore, from Sudakov
minoration Theorem 27.8 we obtain
c
log M(V, k · k2 , ϵ) ≤ E[sup hv, Zi] Z ∼ N (0, Id ) .
ϵ2 v∈V

Since V is an ellipsoid, we can compute the supremum explicitly, indeed

q √
E[sup hv, Zi] = E[ sup hDθ, Zi] = E[kDZk2 ] ≤ E[kDZk22 ] = tr Σ ≤ 1 .
v∈ V θ∈B2

To see one simple implication of the result, recall the standard bound on empirical processes: By
endowing any collection of functions {fθ , θ ∈ Θ} with a metric dP̂n (θ, θ′ )2 = EP̂n [(fθ (X)− fθ′ (X))2 ]
we have
" Z ∞r #
log N(Θ, dP̂n , ϵ)
E sup E[fθ (X)] − Ên [fθ (X)] ≲ E inf δ + dϵ .
θ δ>0 δ n

It can be seen that when entropy behaves as ϵ−p we get rate n− min(1/p,1/2) except for p = 2
for which the upper bound yields n− 2 log n. The significance of the previous theorem is that the
1

Hilbert ball is precisely “at the phase transition” from parametric to nonparametric rate.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-534

i i

534

As a sanity check, let us take any PX over the unit (possibly infinite dimensional) ball B with
E[X] = 0 and let Θ = B. We have
" # r
1X
n
log n
E[kX̄n k] = E sup hθ, Xi i ≲ ,
θ n i=1 n
Pn
where
p X̄n = 1n i=1 Xi is the empirical mean. In this special case it is easy to bound E[kX̄n k] ≤
E[kX̄n k2 ] ≤ √1n by an explicit calculation.

27.3.3 Maurey’s empirical method

In this section we discuss a powerful probabilistic method due to B. Maurey for constructing a
good covering. It has found applications in approximation theory and especially that for neural
nets [237, 37]. The following result gives a dimension-free bound on the cover number of convex
hulls in Hilbert spaces:
p
Theorem 27.12 Let H be an inner product space with the norm kxk ≜ hx, xi. Let T ⊂ H be
a finite set, with radius r = rad(T) = infy∈H supx∈T kx − yk (recall (5.3)). Denote the convex hull
of T by co(T). Then for any 0 < ϵ ≤ r,
2
|T| + d ϵr2 e − 2
N(co(T), k · k, ϵ) ≤ . (27.26)
d ϵr2 e − 1
2

Theorem 27.13 Assume that L, A > 0 and p ∈ [1, ∞] are constants. Then

1
log N(F(A, L), k · kp , ϵ) = Θ . (27.29)
ϵ

Furthermore, for the sup-norm we have the sharp asymptotics:

LA
log2 N(F(A, L), k · k∞ , ϵ) = (1 + o(1)), ϵ → 0. (27.30)
ϵ

Proof. By replacing f(x) by √1 f( √x ), we have

L L
√ 1− p
N(F(A, L), k · kp , ϵ) = N(F( LA, 1), k · kp , L 2p ϵ). (27.31)

Thus, it is sufficient to consider F(A, 1) ≜ F(A), the collection of 1-Lipschitz densities on [0, A].
Next, observe that any such density function f is bounded from above. Indeed, since f(x) ≥ (f(0) −
RA
x)+ and 0 f = 1, we conclude that f(0) ≤ max{A, A2 + A1 } ≜ m.
To show (27.29), it suffices to prove the upper bound for p = ∞ and the lower bound for p = 1.
Specifically, we aim to show, by explicit construction,
C Aϵ
N(F(A), k · k∞ , ϵ) ≤ 2 (27.32)
ϵ
c
M(F(A), k · k1 , ϵ) ≥ 2 ϵ (27.33)

which imply the desired (27.29) in view of Theorem 27.2. Here and below, c, C are constants
depending on A. We start with the easier (27.33). We construct a packing by perturbing the uniform
density. Define a function T by T(x) = x1 {x ≤ ϵ} + (2ϵ − x)1 {x ≥ ϵ} + A1 on [0, 2ϵ] and zero

elsewhere. Let n = 4Aϵ and a = 2nϵ. For each y ∈ {0, 1}n , define a density fy on [0, A] such that

X
n
f y ( x) = yi T(x − 2(i − 1)ϵ), x ∈ [0, a],
i=1
RA
and we linearly extend fy to [a, A] so that 0 fy = 1; see Figure 27.2. For sufficiently small ϵ, the
Ra
resulting fy is 1-Lipschitz since 0 fy = 12 + O(ϵ) so that the slope of the linear extension is O(ϵ).

1/A

x
0 ϵ 2ϵ 2nϵ A

Figure 27.2 Packing that achieves (27.33). The solid line represent one such density fy (x) with
y = (1, 0, 1, 1). The dotted line is the density of Unif(0, A).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-537

i i

27.4 Infinite-dimensional space: smooth functions 537

Thus we conclude that each fy is a valid member of F(A). Furthermore, for y, z ∈ {0, 1}n , we
have kfy −fz k1 = dH (y, z)kTk1 = ϵ2 dH (y, z). Invoking the Gilbert-Varshamov bound Theorem 27.5,
we obtain an n2 -packing Y of the Hamming space {0, 1}n with |Y| ≥ 2cn for some absolute constant
2
c. Thus {fy : y ∈ Y} constitutes an n2ϵ -packing of F(A) with respect to the L1 -norm. This is the
2
desired (27.33) since n2ϵ = Θ(ϵ).

To construct a covering, set J = mϵ , n = Aϵ , and xk = kϵ for k = 0, . . . , n. Let G be the
collection of all lattice paths (with grid size ϵ) of n steps starting from the coordinate (0, jϵ) for
some j ∈ {0, . . . , J}. In other words, each element g of G is a continuous piecewise linear function
on each subinterval Ik = [xk , xk+1 ) with slope being either +1 or −1. Evidently, the number of
such paths is at most (J + 1)2n = O( 1ϵ 2A/ϵ ). To show that G is an ϵ-covering, for each f ∈ F (A),
we show that there exists g ∈ G such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, A]. This can be shown
by a simple induction. Suppose that there exists g such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, xk ],
which clearly holds for the base case of k = 0. We show that g can be extended to Ik so that this
holds for k + 1. Since |f(xk ) − g(xk )| ≤ ϵ and f is 1-Lipschitz, either f(xk+1 ) ∈ [g(xk ), g(xk ) + 2ϵ]
or [g(xk ) − 2ϵ, g(xk )], in which case we extend g upward or downward, respectively. The resulting
g satisfies |f(x) − g(x)| ≤ ϵ on Ik , completing the induction.

b′ + ϵ1/3

b′

x
0 a′ A

Figure 27.3 Improved packing for (27.34). Here the solid and dashed lines are two lattice paths on a grid of
size ϵ starting from (0, b′ ) and staying in the range of [b′ , b′ + ϵ1/3 ], followed by their respective linear
extensions.

Finally, we prove the sharp bound (27.30) for p = ∞. The upper bound readily follows from
(27.32) plus the scaling relation (27.31). For the lower bound, we apply Theorem 27.2 converting
the problem to the construction of 2ϵ-packing. Following the same idea of lattice paths, next we
give an improved packing construction such that
a
M(F(A), k · k∞ , 2ϵ) ≥ Ω(ϵ3/2 2 ϵ ). (27.34)
a b
for any a < A. Choose any b such that A1 < b < A1 + (A−
2
a) ′ ′
2A . Let a = ϵ ϵ and b = ϵ ϵ . Consider
a density f on [0, A] of the following form (cf. Figure 27.3): on [0, a ], f is a lattice path from (0, b′ )
′

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-538

i i

538

to (a′ , b′ ) that stays in the vertical range of [b′ , b′ +ϵ1/3 ]; on [a′ , A], f is a linear extension chosen so
RA
that 0 f = 1. This is possible because by the 1-Lipschitz constraint we can linearly extend f so that
RA ′ 2 ′ 2 R a′
a′
f takes any value in the interval [b′ (A−a′ )− (A−2a ) , b′ (A−a′ )+ (A−2a ) ]. Since 0 f = ab+o(1),
RA R a′
we need a′ f = 1 − 0 f = 1 − ab + o(1), which is feasible due to the choice of b. The collection
G of all such functions constitute a 2ϵ-packing in the sup norm (for two distinct paths consider the
first subinterval where they differ). Finally, we bound the cardinality of this packing by counting
the number of such paths. This can be accomplished by standard estimates on random walks (see
e.g. [164, Chap. III]). For any constant c > 0, the probability that a symmetric random walk on
Z returns to zero in n (even) steps and stays in the range of [0, n1+c ] is Θ(n−3/2 ); this implies the
desired (27.34). Finally, since a < A is arbitrary, the lower bound part of (27.30) follows in view
of Theorem 27.2.

The following result, due to Birman and Solomjak [57] (cf. [285, Sec. 15.6] for an exposition),
is an extension of Theorem 27.13 to the more general Hölder class.

Theorem 27.14 Fix positive constants A, L and d ∈ N. Let β > 0 and write β = ℓ + α,
where ℓ ∈ Z+ and α ∈ (0, 1]. Let Fβ (A, L) denote the collection of ℓ-times continuously
differentiable densities f on [0, A]d whose ℓth derivative is (L, α)-Hölder continuous, namely,
kD(ℓ) f(x) − D(ℓ) f(y)k∞ ≤ Lkx − ykα ∞ for all x, y ∈ [0, A] . Then for any 1 ≤ p ≤ ∞,
d

d
log N(Fβ (A, L), k · kp , ϵ) = Θ ϵ− β . (27.35)

The main message of the preceding theorem is that is the entropy of the function class grows
more slowly if the dimension decreases or the smoothness increases. As such, the metric entropy
for very smooth functions can grow subpolynomially in 1ϵ . For example, Vitushkin (cf. [250,
Eq. (129)]) showed that for the class of analytic functions on the unit complex disk D having
analytic extension to a bigger disk rD for r > 1, the metric entropy (with respect to the sup-norm
on D) is Θ((log 1ϵ )2 ); see [250, Sec. 7 and 8] for more such results.
As mentioned at the beginning of this chapter, the conception and development of the subject
on metric entropy, in particular, Theorem 27.14, are motivated by and plays an important role
in the study of Hilbert’s 13th problem. In 1900, Hilbert conjectured that there exist functions of
several variables which cannot be represented as a superposition (composition) of finitely many
functions of fewer variables. This was disproved by Kolmogorov and Arnold in 1950s who showed
that every continuous function of d variables can be represented by sums and superpositions of
single-variable functions; however, their construction does not work if one requires the constituent
functions to have specific smoothness. Subsequently, Hilbert’s conjecture for smooth functions
was positively resolved by Vitushkin [439], who showed that there exist functions of d variables
in the β -Hölder class (in the sense of Theorem 27.14) that cannot be expressed as finitely many
superpositions of functions of d′ variables in the β ′ -Hölder class, provided d/β > d′ /β ′ . The
original proof of Vitushkin is highly involved. Later, Kolmogorov gave a much simplified proof
by proving and applying the k · k∞ -version of Theorem 27.14. As evident in (27.35), the index
d/β provides a complexity measure for the function class; this allows an proof of impossibility

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-539

i i

27.5 Metric entropy and small-ball probability 539

of superposition by an entropy comparison argument. For concreteness, let us prove the follow-
ing simpler version: There exists a 1-Lipschitz function f(x, y, z) of three variables on [0, 1]3 that
cannot be written as g(h1 (x, y), h2 (y, z)) where g, h1 , h2 are 1-Lipschitz functions of two variables
on [0, 1]2 . Suppose, for the sake of contradiction, that this is possible. Fixing an ϵ-covering of
cardinality exp(O( ϵ12 )) for 1-Lipschitz functions on [0, 1]2 and using it to approximate the func-
tions g, h1 , h2 , we obtain by superposition g(h1 , h2 ) an O(ϵ)-covering of cardinality exp(O( ϵ12 )) of
1-Lipschitz functions on [0, 1]3 ; however, this is a contradiction as any such covering must be of
size exp(Ω( ϵ13 )). For stronger and more general results along this line, see [250, Appendix I].

27.5 Metric entropy and small-ball probability

The small ball problem in probability theory concerns the behavior of the function
1
ϕ(ϵ) ≜ log
P [kXk ≤ ϵ]
as ϵ → 0, where X is a random variable taking values on some real separable Banach space
(V, k · k). For example, for standard normal X ∼ N (0, Id ) and the ℓ2 -ball, a simple large-deviations
calculation (Exercise III.16) shows that
1
ϕ(ϵ) d log .
ϵ
Of more interest is the infinite-dimensional case of Gaussian processes. For example, for the
standard Brownian motion on the unit interval and the sup norm, it is elementary to show
(Exercise V.30) that
1
ϕ(ϵ) . (27.36)
ϵ2
We refer the reader to the excellent survey [279] for this field.
There is a deep connection between the small-ball probability and metric entropy, which allows
one to translate results from one area to the other in fruitful ways. To identify this link, the start-
ing point is the volume argument in Theorem 27.3. On the one hand, it is well-known that there
exists no analog of Lebesgue measure (translation-invariant) in infinite-dimensional spaces. As
such, for functional spaces, one frequently uses a Gaussian measure. On the other hand, the “vol-
ume” argument in Theorem 27.3 and Remark 27.2 can adapted to a measure γ that need not be
translation-invariant, leading to
γ (Θ + B (0, ϵ)) γ (Θ + B (0, ϵ/2))
≤ N(Θ, k · k, ϵ) ≤ M(Θ, k · k, ϵ) ≤ , (27.37)
maxθ∈V γ (B (θ, 2ϵ)) minθ∈Θ γ (B (θ, ϵ/2))
where we recall that B(θ, ϵ) denotes the norm ball centered at θ of radius ϵ. From here we have
already seen the natural appearance of small-ball probabilities. Using properties native to the
Gaussian measure, this can be further analyzed and reduced to balls centered at zero.
To be precise, let γ be a zero-mean Gaussian measure on V such that EX∼γ [kXk2 ] < ∞. Let
H ⊂ V be the reproducing kernel Hilbert space (RKHS) generated by γ and K the unit ball in

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-540

i i

540

H. We refer the reader to, e.g., [262, Sec. 2] and [314, III.3.2], for the precise definition of this
object.11 For the purpose of this section, it is enough to consider the following examples (for more
see [262]):

• Finite dimensions. Let γ = N (0, Σ). Then

K = {Σ1/2 x : kxk2 ≤ 1} (27.38)

is a rescaled Euclidean ball, with inner product hx, yiH = x⊤ Σ−1 y.

• Brownian motion: Let γ be the law of the standard Brownian motion on the unit interval [0, 1].
Then
Z t
′ ′
K = f( t) = f (s)ds : kf k2 ≤ 1 (27.39)
0
R1
with inner product hf, giH = hf′ , g′ i ≡ 0
f′ (t)g′ (t)dt.

The following fundamental result due to Kuelbs and Li [263] (see also the earlier work of
Goodman [194]) describes a precise connection between the small-ball probability function ϕ(ϵ)
and the metric entropy of the unit Hilbert ball N(K, k · k, ϵ) ≡ N(ϵ).

Theorem 27.15 For all ϵ > 0,

!
ϵ
ϕ(2ϵ) − log 2 ≤ log N p ≤ 2ϕ(ϵ/2) (27.40)
2ϕ(ϵ/2)

Proof. We show that for any λ > 0,

λ2
ϕ(2ϵ) + log Φ(λ + Φ−1 (e−ϕ(ϵ) )) ≤ log N(λK, ϵ) ≤ log M(λK, ϵ) ≤ + ϕ(ϵ/2) (27.41)
2
p
To deduce (27.40), choose λ = 2ϕ(ϵ/2) and note that by scaling N(λK, ϵ) = N(K, ϵ/λ).
) = Φc (t) ≤ e−t /2 (Exercise V.25) yields Φ−1 (e−ϕ(ϵ) ) ≥
2
Applying
p the normal tail bound Φ(− t
− 2ϕ(ϵ) ≥ −λ so that Φ(Φ−1 (e−ϕ(ϵ) ) + λ) ≥ Φ(0) = 1/2.
We only give the proof in finite dimensions as the results are dimension-free and extend natu-
rally to infinite-dimensional spaces. Let Z ∼ γ = N (0, Σ) on Rd so that K = Σ1/2 B2 is given in
(27.38). Applying (27.37) to λK and noting that γ is a probability measure, we have
γ (λK + B (0, ϵ)) 1
≤ N(λK, ϵ) ≤ M(λK, ϵ) ≤ . (27.42)
maxθ∈Rd γ (B (θ, 2ϵ)) minθ∈λK γ (B (θ, ϵ/2))
Next we further bound (27.42) using properties native to the Gaussian measure.

11
In particular, if γ is the law of a Gaussian process X on C([0, 1]) with E[kXk22 ] < ∞, the kernel K(s, t) = E[X(s)X(t)]
∑
admits the eigendecomposition K(s, t) = λk ψk (s)ψk (t) (Mercer’s theorem), where {ϕk } is an orthonormal basis for
∑
L2 ([0, 1]) and λk > 0. Then H is the closure of the span of {ϕk } with the inner product hx, yiH = k hx, ψk ihy, ψk i/λk .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-541

i i

27.5 Metric entropy and small-ball probability 541

• For the upper bound, for any symmetric set A = −A and any θ ∈ λK, by a change of measure

γ(θ + A) = P [Z − θ ∈ A]
1 ⊤ −1
h −1 i
= e− 2 θ Σ θ E e⟨Σ θ,Z⟩ 1 {Z ∈ A}

≥ e−λ
2
/2
P [Z ∈ A] ,
h −1 i
where the last step follows from θ⊤ Σ−1 θ ≤ λ2 and by Jensen’s inequality E e⟨Σ θ,Z⟩ |Z ∈ A ≥
−1
e⟨Σ θ,E[Z|Z∈A]⟩ = 1, using crucially that E [Z|Z ∈ A] = 0 by symmetry. Applying the above to
A = B(0, ϵ/2) yields the right inequality in (27.41).
• For the lower bound, recall Anderson’s lemma (Lemma 28.10) stating that the Gaussian measure
of a ball is maximized when centered at zero, so γ(B(θ, 2ϵ)) ≤ γ(B(0, 2ϵ)) for all θ. To bound
the numerator, recall the Gaussian isoperimetric inequality (see e.g. [69, Theorem 10.15]):12

γ(A + λK) ≥ Φ(Φ−1 (γ(A)) + λ). (27.43)

Applying this with A = B(0, ϵ) proves the left inequality in (27.41) and the theorem.

The implication of Theorem 27.15 is the following. Provided that ϕ(ϵ) ϕ(ϵ/2), then we
should expect that approximately
!
ϵ
log N p ϕ(ϵ)
ϕ(ϵ)

With more effort this can be made precise unconditionally (see e.g. [279, Theorem 3.3], incorporat-
ing the later improvement by [278]), leading to very precise connections between metric entropy
and small-ball probability, for example: for fixed α > 0, β ∈ R,
β 2+α
2β

−α 1 − 2+α
2α 1
ϕ(ϵ) ϵ log ⇐⇒ log N(ϵ) ϵ log (27.44)
ϵ ϵ

As a concrete example, consider the unit ball (27.39) in the RKHS generated by the standard
Brownian motion, which is similar to a Sobolev ball.13 Using (27.36) and (27.44), we conclude
that log N(ϵ) 1ϵ , recovering the metric entropy of Sobolev ball determined in [420]. This result
also coincides with the metric entropy of Lipschitz ball in Theorem 27.14 which requires the
derivative to be bounded everywhere as opposed to on average in L2 . For more applications of
small-ball probability on metric entropy (and vice versa), see [263, 278].

12
The connection between (27.43) and isoperimetry is that if we interpret limλ→0 (γ(A + λK) − γ(A))/λ as the surface
measure of A, then among all sets with the same Gaussian measure, the half space has maximal surface measure.
13
The Sobolev norm is kfkW1,2 ≜ kfk2 + kf′ k2 . Nevertheless, it is simple to verify a priori that the metric entropy of
(27.39) and that of the Sobolev ball share the same behavior (see [263, p. 152]).

i i

544

commensurate with their respective degrees of freedoms. As mentioned in Remark 27.3(b), these
results were obtained by Szarek in [406] using a volume argument with Haar measures; in compar-
ison, the information-theoretic approach is more elementary as we can again reduce to Gaussian
rate-distortion computation.
Proof. The upper bound follows from Theorem 27.16 and Remark 27.4(a), applying the metric
entropy bound for spheres in Corollary 27.4.
To prove the lower bound, let Z ∼ N (0, Id ). Define θ = ∥ZZ∥ and A = kZk, where k · k ≡ k · k2
henceforth. Then θ ∼ Unif(Sd−1 ) and A ∼ χd are independent. Fix Pθ̂|θ such that E[kθ̂ − θk2 ] ≤
ϵ2 . Since Var(A) ≤ 1, the Shannon lower bound (Theorem 26.3) shows that the rate-distortion
function of A is majorized by that of the standard Gaussian. So for each δ ∈ (0, 1), there exists
PÂ|A such that E[(Â − A)2 ] ≤ δ 2 , I(A, Â) ≤ log δ1 , and E[A] = E[Â]. Set Ẑ = Âθ̂. Then

I(Z; Ẑ) = I(θ, A; Ẑ) ≤ I(θ, A; θ̂, Â) = I(θ; θ̂) + I(A, Â).
Furthermore, E[Â2 ] = E[(Â − A)2 ] + E[A2 ] + 2E[(Â − A)(A − E[A])] ≤ d + δ 2 + 2δ ≤ d + 3δ .
Similarly, |E[Â(Â − A)]| ≤ 2δ and E[kZ − Ẑk2 ] ≤ dϵ2 + 7δϵ + δ . Choosing δ = ϵ, we have
E[kZ − Ẑk2 ] ≤ (d + 8)ϵ2 . Combining Theorem 24.8 with the Gaussian rate-distortion function in
Theorem 26.2, we have I(Z; Ẑ) ≥ d2 log (d+d8)ϵ2 , so applying log(1 + x) ≤ x yields
1
I(θ; θ̂) ≥ (d − 1) log − 4 log e.
ϵ2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-545

i i

Exercises for Part V

V.1 Let S = Ŝ = {0, 1}. Consider the source X10 consisting of fair coin flips. Construct a simple
1
(suboptimal) compressor achieving average Hamming distortion 20 with 512 codewords.
V.2 Assume a separable distortion loss. Show that the minimal number of codewords M∗ (n, D)
required to represent memoryless source Xn with average distortion D (recall (24.9)) satisfies

log M∗ (n1 + n2 , D) ≤ log M∗ (n1 , D) + log M∗ (n2 , D) .

Conclude that
1 1
lim log M∗ (n, D) = inf log M∗ (n, D) . (V.1)
n→∞ n n n

That is, one can always achieve a better compression rate by using a longer blocklength. Neither
claim holds for log M∗ (n, ϵ) in channel coding as defined in (19.4). Explain why this different
behavior arises.
V.3 (Non-asymptotic rate-distortion) Our goal is to show that the convergence to R(D) happens
much faster than that to capacity in channel coding. Consider binary uniform X ∼ Ber(1/2)
with Hamming distortion.
(a) Show that there exists a lossy code Xn → W → X̂n with M codewords and

P[d(Xn , X̂n ) > D] ≤ (1 − p(nD))M ,

where
s
X n
p(s) = 2−n .
j
j=0

(b) Show that there exists a lossy code with M codewords and

1X
n−1
M
E[d(Xn , X̂n )] ≤ (1 − p(s)) . (V.2)
n
s=0

1 X −Mp(s)
n−1
E[d(Xn , X̂n )] ≤ e . (V.3)
n
s=0

provided that X and Y are independent.

(b) Show that the rate-distortion function of Z is related to that of X and Y via the inf-convolution,
i.e.

R(D) = inf RX (D1 ) + RY (D − D1 ).

0≤D1 ≤D

(c) How do you build an optimal lossy compressor for Z using optimal lossy compressors for
X and Y?
V.11 (Compression with output constraints) Compute the rate-distortion function R(D; a, p) of a
Ber(p) source, Hamming distortion under an extra constraint that reconctruction points X̂n
should have average Hamming weight E[wH (X̂n )] ≤ an, where 0 < a, p ≤ 21 . (Hint: Show a
more general result that given two distortion metrics d1 , d2 we have R(D1 , D2 ) = min{I(S; Ŝ) :
E[di (S, Ŝ)] ≤ Di , i ∈ {1, 2}}.)
V.12 Commercial (mono) FM radio modulates a bandlimited (15kHz) audio signal into a radio-
frequency signal of bandwidth 200kHz. This system roughly achieves

SNRaudio ≈ 40 dB + SNRchannel

over the AWGN channel whenever SNRchannel ≳ 12 dB. Thus for the 12 dB channel, we get
that FM radio has distortion of 52 dB. Show that information-theoretic limit is about 160 dB.
Hint: assume that input signal is low-pass filtered to 15kHz white, zero-mean Gaussian and use
the optimal joint source channel code (JSSC) for the given bandwidth expansion ratio and fixed
SNRchannel . Also recall that the SNR of the reconstruction Ŝn expressed in dB is defined as
Pk
j=1 E[Sj ]
2
SNRaudio ≜ 10 log10 Pk .
j=1 E[(Sj − Ŝj ) ]
2

V.13 Consider a memoryless Gaussian source X ∼ N (0, 1), reconstruction alphabet Â = {±1} and
quadratic distortion d(a, â) = (a − â)2 . Compute D0 , R(D0 +), Dmax . Then obtain a parametric
formula for R(D).
V.14 (Erokhin’s rate-distortion [155]) Let d(ak , bk ) = 1{ak 6= bk } be a (non-separable) distortion
metric for k-strings on a finite alphabet S = Ŝ . Prove that for any source Sk we have

φSk (ϵ) ≜ min I(Sk ; Ŝk ) ≥ H(Sk ) − ϵk log |S| − h(ϵ) , (V.5)
P[Sk ̸=Ŝk ]≤ϵ

and that the bound is tight only for Sk uniform on S k . Next, suppose that Sk is i.i.d. source. Prove
r
kV(S) − (Q−12(ϵ))2
ϕSk (ϵ) = (1 − ϵ)kH(S) − e + O(log k) ,
2π
where V(S) is the varentropy (10.4). (Hint: Let T = P̂Sk be the empirical distribution (type) of the
realization Sk . Then I(Sk ; Ŝk ) = I(Sk , T; Ŝk ) = I(Sk ; Ŝk |T) + O(log k). Denote ϵT ≜ P[Sk 6= Ŝk |T]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-548

i i

548 Exercises for Part V

and given ϵT optimize the first term via (V.5). Then optimize the assignment ϵT over all E[ϵT ] ≤ ϵ.
Also use E[Z1{Z > c}] = √12π e−c /2 for Z ∼ N (0, 1). See [254, Lemma 1] for full details).
2

i.i.d.
V.15 Consider a source Sn ∼ Ber( 21 ). Answer the following questions when n is large.
(a) Suppose the goal is to compress Sn into k bits so that one can reconstruct Sn with at most
one bit of error. That is, the decoded version Ŝn satisfies E[dH (Ŝn , Sn )] ≤ 1. Show that this
can be done (if possible, with an explicit algorithm) with k = n − C log n bits for some
constant C. Is it optimal?
(b) Suppose we are required to compress Sn into only 1 bit. Show that one can achieve (if
√
possible, with an explicit algorithm) a reconstruction error E[dH (Ŝn , Sn )] ≤ n2 − C n for
some constant C. Is it optimal?
Warning: We cannot blindly apply the asymptotic rate-distortion theory to show achievability
since here the distortion level changes with n. The converse, however, directly applies.
V.16 Consider a standard Gaussian vector Sn and quadratic distortion metric. We will discuss zero-
rate quantization.
√
(a) Let Smax =√max1≤i≤n Si . Show that E[(Smax − 2 ln n)2 ] → 0 when n → ∞. Show that
E[(Smax − 2 ln n)2 ] → 0 when n → ∞.
(b) Suppose you are given a budget of log2 n bits. Consider the following scheme: Let i∗ denote
the index of the largest coordinate. The compressor √
stores the index i∗ which costs log2 n
bits and the decompressor outputs Ŝ where Ŝi = 2 ln n for i = i∗ and Si = 0 otherwise.
n

Show that distortion in terms of mean-square error satisfies E[kŜn − Sn k22 ] = n − 2 ln n + o(1)
when n → ∞.
(c) Show that for any compressor (using log2 n bits) we must have E[kŜn − Sn k22 ] ≥ n − 2 ln n +
o( 1) .
V.17 (Noisy source-coding; also remote source-coding [126]) Consider the problem of compressing
i.i.d. sequence Xn under separable distortion metric d. Now, however, compressor does not have
direct access to Xn but only to its noisy version Yn obtained over a stationary memoryless channel
i.i.d.
PY|X (i.e. (Xi , Yi ) ∼ PX,Y for a fixed PX,Y and encoder is a map f : Y n → [M]). Show that the
rate-distortion function is
n o
R(D) = min I(Y; X̂) : E[d(X, X̂)] ≤ D, X → Y → X̂ ,

where minimization is over all PX̂|Y . (Hint: define d̃(y, x̂) ≜ E[d(X, x̂)|Y = y].)
i.i.d.
V.18 (Noisy/remote source coding; special case) Let Zn ∼ Ber( 12 ) and Xn = BECδ (Zn ). Compressor
is to encode Xn at rate R so that we can reconstruct Zn with bit-error rate D. Let R(D) denote
the optimal rate.
(a) Suppose that locations of erasures in Xn are provided as a side information to decompressor.
Show that R(δ/2) = δ̄2 (Hint: compressor is very simple).
(b) Surprisingly, the same rate is achievable without knowledge of erasures. Use Exercise V.17
to prove R(D) = H(δ̄/2, δ̄/2, δ) − H(1 − D − δ2 , D − δ2 , δ) for D ∈ [ δ2 , 12 ].
V.19 (Log-loss) Consider the rate-distortion problem where the reconstruction alphabet X̂n = P(X n )
is the space of all probability mass functions on X n . We define two loss functions. The first one

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-549

i i

Exercises for Part V 549

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-551

i i

Exercises for Part V 551

(d) Optimize over s > 0 to conclude (V.7).

(e) Verify that the lower bound of Theorem 26.3 is a special case of (V.7).
Note: Alternatively, the SLB can be written in the following form:

ϕX (D) ≥ h(X) − sup h(W)

PW :E[∥W∥r ]≤D

and this entropy maximization can be solved following the argument in Example 5.2.
V.23 (Uniform distribution minimizes convex symmetric functional.) Let G be a group acting on a
set X such that each g ∈ G sends x ∈ X to gx ∈ X . Suppose G acts transitively, i.e., for each
x, x′ ∈ X there exists g ∈ G such that gx = x′ . Let g be a random element of G with an invariant
d
distribution, namely hg=g for any h ∈ G. (Such a distribution, known as the Haar measure,
exists for compact topological groups.)
(a) Show that for any x ∈ X , gx has the same law, denoted by Unif(X ), the uniform distribution
on X .
(b) Let f : P(X ) → R be convex and G-invariant, i.e., f(PgX ) = f(PX ) for any X -valued random
variable X and any g ∈ G. Show that minPX ∈P(X ) f(PX ) = f(Unif(X )).
V.24 (Uniform distribution maximizes rate-distortion function.) Continuing the setup of Exer-
cise V.23, let d : X × X → R be a G-invariant distortion function, i.e., d(gx, gx′ ) =
d(x, x′ ) for any g ∈ G. Denote the rate-distortion function of an X -valued X by ϕX (D) =
infP :E[d(X,X̂)]≤D I(X; X̂). Suppose that ϕX (D) < ∞ for all X and all D > 0.
X̂|X

(a) Let ϕ∗X (λ) = supD {λD − ϕX (D)} denote the conjugate of ϕX . Applying Theorem 24.4 and
Fenchel-Moreau’s biconjugation theorem to conclude that ϕX (D) = supλ {λD − ϕ∗X (λ)}.
(b) Show that

ϕ∗X (λ) = sup{λE[d(X, X̂)] − I(X; X̂)}.

PX̂|X

As such, for each λ, PX 7→ ϕ∗X (λ) is convex and G-invariant. (Hint: Theorem 5.3.)
(c) Applying Exercise V.23 to conclude that ϕ∗U (λ) ≤ ϕ∗X (λ) for U ∼ Unif(X ) and that

ϕX (D) ≤ ϕU (D), ∀ D > 0.

V.25 (Normal tail bound.) Denote the standard normal density and tail probability by φ(x) =
R∞
√1 e−x /2 and Φc (t) =
2

2π t
φ(x)dx. Show that for all t > 0,

t φ(t) −t2 /2
φ(t) ≤ Φ (t) ≤ min
c
,e . (V.8)
1 + t2 t

(Hint: For Φc (t) ≤ e−t /2 apply the Chernoff bound (15.2); for the rest, note that by integration
2

R∞
by parts Φc (t) = φ(t t) − t φ(x2x) dx.)
V.26 (Covering radius in Hamming space) In this exercise we prove (27.9), namely, for any fixed
0 ≤ D ≤ 1, as n → ∞,

N(Fn2 , dH , Dn) = 2n(1−h(D))+ +o(n) ,

where h(·) is the binary entropy function.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-552

Exercises for Part V 553

(c) Apply Corollary 27.4 and Lemma 27.10 to conclude that

√ √
E[kAk] ≲ n + m (V.11)
(d) By choosing u and v in (V.10) smartly, show a matching lower bound and conclude that
√ √
E[kAk] n + m (V.12)
(e) Use Sudakov minorization (Theorem 27.8) to prove a matching lower bound. (Hint: use
(27.6)).
V.30 (Small-ball probability II.) In this exercise we prove (27.36). Let {Wt : t ≥ 0} be a standard
Brownian motion. Show that for small ϵ,16
" #
1
ϕ(ϵ) = − log P sup |Wt | ≤ ϵ 2
t∈[0,1] ϵ
h i h i
(a) By rescaling space and time, show that P supt∈[0,1] |Wt | ≤ ϵ = P supt∈[0,T] |Wt | ≤ 1 ≜
pT , where T = 1/ϵ2 . To show pT = e−Θ(T) , there is no loss of generality to assume that T is
an integer.
(b) (Upper bound) Using the independent increment property, show that pT+1 ≤ apT , where a =
P [|Z| ≤ 1] with Z ∼ N (0, 1). (Hint: g(z) ≜ P [|Z − z| ≤ 1] for z ∈ [−1, 1] is maximized at
z = 0 and minimized at z = ±1.) h i
(c) (Lower bound) Again by scaling, it is equivalent to show P supt∈[0,T] |Wt | ≤ C ≥ C−T for
h i
some constant C. Let qT ≜ P supt∈[0,T] |Wt | ≤ 2, maxt=1,...,T |Wt | ≤ 1 . Show that qT+1 ≥
bqT , where b = P [|Z − 1| ≤ 1] P[supt∈[0,1] |Bt | ≤ 1], and Bt = Bt − tB1 is a Brownian
bridge. (Hint: {Wt : t ∈ [0, T]}, WT+1 − WT , and {WT+t − (1 − t)WT − tWT+1 : t ∈ [0, 1]}
are mutually independent, with the latter distributed as a Brownian bridge.)

16
Using the large deviations theory developed by Donsker-Varadhan, the sharp constant can be found to be
2
limϵ→0 ϵ2 ϕ(ϵ) = π8 ; see for example [279, Sec. 6.2].

H0 : θ ∈ Θ 0 vs H1 : θ ∈ Θ 1 , Θ0 ∩ Θ1 = ∅.

For each case one can introduce the appropriate parameter space and loss function. For example,
in the last (most general) case, we may take
(
0 θ ∈ Θ0
Θ = Θ0 ∪ Θ1 , T(θ) = , T̂ ∈ {0, 1}
1 θ ∈ Θ1
n o
and use the zero-one loss ℓ(T, T̂) = 1 T 6= T̂ so that the expected risk Rθ (T̂) = Pθ {θ ∈ / ΘT̂ } is
the probability of error.
For the problem of inference, the goal is to output a confidence interval (or region) which covers
the true parameter with high
n probability.
o In this case T̂ is a subset of Θ and we may choose the
loss function ℓ(θ, T̂) = 1 θ ∈/ T̂ + λ · length(T̂) for some λ > 0, in order to balance the coverage
and the size of the confidence interval.
Remark 28.2 (Randomized versus deterministic estimators) Although most of the
estimators used in practice are deterministic, there are a number of reasons to consider randomized
estimators:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-560

i i

560

• For certain formulations, such as the minimizing worst-case risk (minimax approach), deter-
ministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the
objective is to minimize the average risk (Bayes approach), then it does not lose generality to
restrict to deterministic estimators.
• The space of randomized estimators (viewed as Markov kernels) is convex which is the convex
hull of deterministic estimators. This convexification is needed for example for the treatment
of minimax theorems.

See Section 28.3 for a detailed discussion and examples.

A well-known fact is that for convex loss function (i.e., T̂ 7→ ℓ(T, T̂) is convex), randomization
does not help. Indeed, for any randomized estimator T̂, we can derandomize it by considering its
conditional expectation E[T̂|X], which is a deterministic estimator and whose risk dominates that
of the original T̂ at every θ, namely, Rθ (T̂) = Eθ ℓ(T, T̂) ≥ Eθ ℓ(T, E[T̂|X]), by Jensen’s inequality.

28.2 Gaussian location model (GLM)

Note that, without loss of generality, all statistical models can be expressed in the parametric form
of (28.1) (since we can take θ to be the distribution itself). In the statistics literature, it is customary
to refer to a model as parametric if θ takes values in a finite-dimensional Euclidean space (so that
each distribution is specified by finitely many parameters), and nonparametric if θ takes values in
some infinite-dimensional space (e.g. density estimation or sequence model).
Perhaps the most basic parametric model is the Gaussian Location Model (GLM), also known
as the Normal Mean Model, which corresponds to our familiar Gaussian channel in Example 3.3.
This will be our running example in this part of the book. In this model, we have

P = {N (θ, σ 2 Id ) : θ ∈ Θ}

where Id is the d-dimensional identity matrix and the parameter space Θ ⊂ Rd . Equivalently, we
can express the data as a noisy observation of the unknown vector θ as:

X = θ + Z, Z ∼ N (0, σ 2 Id ).

The case of d = 1 and d > 1 refers to the univariate (scalar) and multivariate (vector) case,
respectively. (Also of interest is the case where θ is a d1 × d2 matrix, which can be vectorized into
a d = d1 d2 -dimensional vector.)
The choice of the parameter space Θ represents our prior knowledges of the unknown parameter
θ, for example,

• Θ = Rd , in which case there is no assumption on θ.

• Θ = ℓp -norm balls.
• Θ = {all k-sparse vectors} = {θ ∈ Rd : kθk0 ≤ k}, where kθk0 ≜ |{i : θi 6= 0}| denotes the
size of the support, informally referred to as the ℓ0 -“norm”.
• Θ = {θ ∈ Rd1 ×d2 : rank(θ) ≤ r}, the set of low-rank matrices.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-561

i i

28.3 Bayes risk, minimax risk, and the minimax theorem 561

By definition, more structure (smaller parameter space) always makes the estimation task easier
(smaller worst-case risk), but not necessarily so in terms of computation.
For estimating θ itself (denoising), it is customary to use a loss function defined by certain
P p 1
p for some 1 ≤ p ≤ ∞ and α > 0, where kθkp ≜ (
norms, e.g., ℓ(θ, θ̂) = kθ − θ̂kα |θi | ) p , with
p = α = 2 corresponding to the commonly used quadratic loss (squared error). Some well-known
estimators include the Maximum Likelihood Estimator (MLE)
θ̂ML = X (28.3)
and the James-Stein estimator based on shrinkage

(d − 2)σ 2
θ̂JS = 1 − X. (28.4)
kXk22
The choice of the estimator depends on both the objective and the parameter space. For instance,
if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero
(thresholding) in order to better denoise θ (cf. Section 30.2).
In addition to estimating the vector θ itself, it is also of interest to estimate certain functionals
T(θ) thereof, e.g., T(θ) = kθkp , max{θ1 , . . . , θd }, or eigenvalues in the matrix case. In addition,
the hypothesis testing problem in the GLM has been well-studied. For example, one can consider
detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ ϵ, or testing weak signal
H0 : kθk ≤ ϵ0 versus strong signal H1 : kθk ≥ ϵ1 , with or without further structural assumptions
on θ. We refer the reader to the monograph [225] devoted to these problems.

28.3 Bayes risk, minimax risk, and the minimax theorem

One of our main objectives in this part of the book is to understand the fundamental limit of
statistical estimation, that is, to determine the performance of the best estimator. As in (28.2), the
risk Rθ (T̂) of an estimator T̂ for T(θ) depends on the ground truth θ. To compare the risk profiles of
different estimators meaningfully requires some thought. As a toy example, Figure 28.1 depicts the
risk functions of three estimators. It is clear that θ̂1 is superior to θ̂2 in the sense that the risk of the
former is pointwise lower than that of the latter. (In statistical literature we say θ̂2 is inadmissible.)
However, the comparison of θ̂1 and θ̂3 is less clear. Although the peak risk value of θ̂3 is bigger
than that of θ̂1 , on average its risk (area under the curve) is smaller. In fact, both views are valid
and meaningful, and they correspond to the worst-case (minimax) and average-case (Bayesian)
approach, respectively. In the minimax formulation, we summarize the risk function into a scalar
quantity, namely, the worst-case risk, and seek the estimator that minimize this objective. In the
Bayesian formulation, the objective is the average risk. Below we discuss these two approaches
and their connections. For notational simplicity, we consider the task of estimating T(θ) = θ.

28.3.1 Bayes risk

The Bayesian approach is an average-case formulation in which the statistician acts as if the param-
eter θ is random with a known distribution. Concretely, let π be a probability distribution (prior)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-562

i i

562

Figure 28.1 Risk profiles of three estimators.

on Θ. Then the average risk (w.r.t. π) of an estimator θ̂ is defined as

Rπ (θ̂) = Eθ∼π [Rθ (θ̂)] = Eθ,X [ℓ(θ, θ̂)]. (28.5)

Given a prior π, its Bayes risk is the minimal average risk, namely

R∗π = inf Rπ (θ̂).

θ̂

An estimator θ̂∗ is called a Bayes estimator if it attains the Bayes risk, namely, R∗π = Eθ∼π [Rθ (θ̂∗ )].
Remark 28.3 Bayes estimator is always deterministic, a fact that holds for any loss function.
To see this, note that for any randomized estimator, say θ̂ = θ̂(X, U), where U is some external
randomness independent of X and θ, its risk is lower bounded by

Rπ (θ̂) = Eθ,X,U [ℓ(θ, θ̂(X, U))] = EU [Rπ (θ̂(·, U))] ≥ inf Rπ (θ̂(·, u)).
u

Note that for any u, θ̂(·, u) is a deterministic estimator. This shows that we can find a deterministic
estimator whose average risk is no worse than that of the randomized estimator.
An alternative explanation of this fact is the following: Note that the average risk Rπ (θ̂) defined
in (28.5) is an affine function of the randomized estimator (understood as a Markov kernel Pθ̂|X )
is affine, whose minimum is achieved at the extremal points. In this case the extremal points of
Markov kernels are simply delta measures, which corresponds to deterministic estimators.
In certain settings the Bayes estimator can be found explicitly. Consider the problem of estimat-
ing θ ∈ Rd drawn from a prior π. Under the quadratic loss ℓ(θ, θ̂) = kθ̂ − θk22 , the Bayes estimator
is the conditional mean θ̂(X) = E[θ|X] and the Bayes risk is the minimum mean-square error
(MMSE), which we previously encountered in Section 3.7* in the context of I-MMSE relationship:

R∗π = E[kθ − E[θ|X]k22 ] = E[Tr(Cov(θ|X))],

where Cov(θ|X = x) is the conditional covariance matrix of θ given X = x.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-563

i i

28.3 Bayes risk, minimax risk, and the minimax theorem 563

As a concrete example, let us consider the Gaussian Location Model in Section 28.2 with a
Gaussian prior.
Example 28.1 (Bayes risk in GLM) Consider the scalar case, where X = θ + Z and Z ∼
N (0, σ 2 ) is independent of θ. Consider a Gaussian prior θ ∼ π = N (0, s). One can verify that the
sσ 2
posterior distribution Pθ|X=x is N ( s+σ 2 x, s+σ 2 ). As such, the Bayes estimator is E[θ|X] = s+σ 2 X
s s

and the Bayes risk is

sσ 2
R∗π = . (28.6)
s + σ2
Similarly, for multivariate GLM: X = θ + Z, Z ∼ N (0, Id ), if θ ∼ π = N (0, sId ), then we have

sσ 2
R∗π = d. (28.7)
s + σ2

28.3.2 Minimax risk

A common criticism of the Bayesian approach is the arbitrariness of the selected prior. A frame-
work related to this but not discussed in this case is the empirical Bayes approach [363, 470],
where one “estimates” the prior from the data instead of choosing a prior a priori. Instead, we take
a frequentist viewpoint by considering the worst-case situation. The minimax risk is defined as

R∗ = inf sup Rθ (θ̂). (28.8)

θ̂ θ∈Θ

If there exists θ̂ s.t. supθ∈Θ Rθ (θ̂) = R∗ , then the estimator θ̂ is minimax (minimax optimal).
Finding the value of the minimax risk R∗ entails proving two things, namely,

• a minimax upper bound, by exhibiting an estimator θ̂∗ such that Rθ (θ̂∗ ) ≤ R∗ + ϵ for all θ ∈ Θ;
• a minimax lower bound, by proving that for any estimator θ̂, there exists some θ ∈ Θ, such that
Rθ (θ̂) ≥ R∗ − ϵ,

where ϵ > 0 is arbitrary. This task is frequently difficult especially in high dimensions. Instead of
the exact minimax risk, it is often useful to find a constant-factor approximation Ψ, which we call
minimax rate, such that

R∗ Ψ, (28.9)

that is, cΨ ≤ R∗ ≤ CΨ for some universal constants c, C ≥ 0. Establishing Ψ is the minimax rate
still entails proving the minimax upper and lower bounds, albeit within multiplicative constant
factors.
In practice, minimax lower bounds are rarely established according to the original definition.
The next result shows that the Bayes risk is always lower than the minimax risk. Throughout
this book, all lower bound techniques essentially boil down to evaluating the Bayes risk with a
sagaciously chosen prior.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-564

i i

564

Theorem 28.1 Let P(Θ) denote the collection of probability distributions on Θ. Then

R∗ ≥ R∗Bayes ≜ sup R∗π . (28.10)

π ∈P(Θ)

(If the supremum is attained for some prior, we say it is least favorable.)

Proof. Two (equivalent) ways to prove this fact:

1 “max ≥ mean”: For any θ̂, Rπ (θ̂) = Eθ∼π Rθ (θ̂) ≤ supθ∈Θ Rθ (θ̂). Taking the infimum over θ̂
completes the proof;
2 “min max ≥ max min”:

R∗ = inf sup Rθ (θ̂) = inf sup Rπ (θ̂) ≥ sup inf Rπ (θ̂) = sup R∗π ,
θ̂ θ∈Θ θ̂ π ∈P(Θ) π ∈P(Θ) θ̂ π

where the inequality follows from the generic fact that minx maxy f(x, y) ≥ maxy minx f(x, y).

Remark 28.4 Unlike Bayes estimators which, as shown in Remark 28.3, are always deter-
ministic, to minimize the worst-case risk it is sometimes necessary to randomize for example in
the context of hypotheses testing (Chapter 14). Specifically, consider a trivial experiment where
θ ∈ {0, 1} and nX is absent,
o so that we are forced to guess the value of θ under the zero-one
loss ℓ(θ, θ̂) = 1 θ 6= θ̂ . It is clear that in this case the minimax risk is 12 , achieved by random
guessing θ̂ ∼ Ber( 21 ) but not by any deterministic θ̂.

As an application of Theorem 28.1, let us determine the minimax risk of the Gaussian location
model under the quadratic loss function.

Example 28.2 (Minimax quadratic risk of GLM) Consider the Gaussian location model
without structural assumptions, where X ∼ N (θ, σ 2 Id ) with θ ∈ Rd . We show that

R∗ ≡ inf sup Eθ [kθ̂(X) − θk22 ] = dσ 2 . (28.11)

θ∈Rd θ∈Rd

By scaling, it suffices to consider σ = 1. For the upper bound, we consider θ̂ML = X which
achieves Rθ (θ̂ML ) = d for all θ. To get a matching minimax lower bound, we consider the prior
θ ∼ N (0, s). Using the Bayes risk previously computed in (28.6), we have R∗ ≥ R∗π = s+ sd
1.
∗
Sending s → ∞ yields R ≥ d.

Remark 28.5 (Non-uniqueness of minimax estimators) In general, estimators that

achieve the minimax risk need not be unique. For instance, as shown in Example 28.2, the MLE
θ̂ML = X is minimax for the unconstrained GLM in any dimension. On the other hand, it is known
that whenever d ≥ 3, the risk of the James-Stein estimator (28.4) is smaller that of the MLE every-
where (see Figure 28.2) and thus is also minimax. In fact, there exist a continuum of estimators
that are minimax for (28.11) [276, Theorem 5.5].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-565

i i

28.3 Bayes risk, minimax risk, and the minimax theorem 565

3.0

2.8

2.6

2.4

2.2

2 4 6 8

Figure 28.2 Risk of the James-Stein estimator (28.4) in dimension d = 3 and σ = 1 as a function of kθk.

For most of the statistical models, Theorem 28.1 in fact holds with equality; such a result is
known as a minimax theorem. Before discussing this important topic, here is an example where
minimax risk is strictly bigger than the worst-case Bayes risk.
n o
Example 28.3 Let θ, θ̂ ∈ N ≜ {1, 2, ...} and ℓ(θ, θ̂) = 1 θ̂ < θ , i.e., the statistician loses
one dollar if the Nature’s choice exceeds the statistician’s guess and loses nothing if otherwise.
Consider the extreme case of blind guessing (i.e., no data is available, say, X = 0). Then for any
θ̂ possibly randomized, we have Rθ (θ̂) = P(θ̂ < θ). Thus R∗ ≥ limθ→∞ P(θ̂ < θ) = 1, which is
clearly achievable. On the other hand, for any prior π on N, Rπ (θ̂) = P(θ̂ < θ), which vanishes as
θ̂ → ∞. Therefore, we have R∗π = 0. Therefore in this case R∗ = 1 > R∗Bayes = 0.
As an exercise, one can show that the minimax quadratic risk of the GLM X ∼ N (θ, 1) with
parameter space θ ≥ 0 is the same as the unconstrained case. (This might be a bit surprising
because the thresholded estimator X+ = max(X, 0) achieves a better risk pointwise at every θ ≥ 0;
nevertheless, just like the James-Stein estimator (cf. Figure 28.2), in the worst case the gain is
asymptotically diminishing.)

28.3.3 Minimax and Bayes risk: a duality perspective

Recall from Theorem 28.1 the inequality

R∗ ≥ R∗Bayes .

This result can be interpreted from an optimization perspective. More precisely, R∗ is the value
of a convex optimization problem (primal) and R∗Bayes is precisely the value of its dual program.
Thus the inequality (28.10) is simply weak duality. If strong duality holds, then (28.10) is in fact
an equality, in which case the minimax theorem holds.
For simplicity, we consider the case where Θ is a finite set. Then

R∗ = min max Eθ [ℓ(θ, θ̂)]. (28.12)

Pθ̂|X θ∈Θ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-566

i i

566

This is a convex optimization problem. Indeed, Pθ̂|X 7→ Eθ [ℓ(θ, θ̂)] is affine and the pointwise
supremum of affine functions is convex. To write down its dual problem, first let us rewrite (28.12)
in an augmented form

R∗ = min t (28.13)
Pθ̂|X ,t

s.t Eθ [ℓ(θ, θ̂)] ≤ t, ∀θ ∈ Θ.

Let π θ ≥ 0 denote the Lagrange multiplier (dual variable) for each inequality constraint. The
Lagrangian of (28.13) is
!
X X X
L(Pθ̂|X , t, π ) = t + π θ Eθ [ℓ(θ, θ̂)] − t = 1 − πθ t + π θ Eθ [ℓ(θ, θ̂)].
θ∈Θ θ∈Θ θ∈Θ
P
By definition, we have R∗ ≥ mint,Pθ̂|X L(θ̂, t, π ). Note that unless θ∈Θ π θ = 1, mint∈R L(θ̂, t, π )
is −∞. Thus π = (π θ : θ ∈ Θ) must be a probability measure and the dual problem is

max min L(Pθ̂|X , t, π ) = max min Rπ (θ̂) = max R∗π .

π Pθ̂|X ,t π ∈P(Θ) Pθ̂|X π ∈P(Θ)

Hence, R∗ ≥ R∗Bayes .
In summary, the minimax risk and the worst-case Bayes risk are related by convex duality,
where the primal variables are (randomized) estimators and the dual variables are priors. This
view can in fact be operationalized. For example, [238, 346] showed that for certain problems
dualizing Le Cam’s two-point lower bound (Theorem 31.1) leads to optimal minimax upper bound;
see Exercise VI.17.

28.3.4 Minimax theorem

Much earlier in Chapter 5 we have already seen example of the strong minimax duality. That is, we
found that capacity satisfies C = minPX maxQY D(PY|X kQY |PX ) = maxQY minPX D(PY|X kQY |PX ),
and the optimal pair (P∗X , Q∗Y ) forms a saddle point. Here we show example of a similar minimax
theorem but for the statistical risk. Namely, we want to specify conditions that ensure (28.10) holds
with equality. For simplicity, let us consider the case of estimating θ itself where the estimator θ̂
takes values in the action space Θ̂ with a loss function ℓ : Θ × Θ̂ → R. A very general result
(cf. [404, Theorem 46.6]) asserts that R∗ = R∗Bayes , provided that the following condition hold:

• The experiment is dominated, i.e., Pθ ν holds for all θ ∈ Θ for some ν on X .

• The action space Θ̂ is a locally compact topological space with a countable base (e.g. the
Euclidean space).
• The loss function is level-compact (i.e., for each θ ∈ Θ, ℓ(θ, ·) is bounded from below and the
sublevel set {θ̂ : ℓ(θ, θ̂) ≤ a} is compact for each a).

This result shows that for virtually all problems encountered in practice, the minimax risk coin-
cides with the least favorable Bayes risk. At the heart of any minimax theorem, there is an

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-567

i i

28.4 Multiple observations and sample complexity 567

application of the separating hyperplane theorem. Below we give a proof of a special case
illustrating this type of argument.

Theorem 28.2 (Minimax theorem)

R∗ = R∗Bayes
in either of the following cases:

• Θ is a finite set and the data X takes values in a finite set X .

• Θ is a finite set and the loss function ℓ is bounded from below, i.e., infθ,θ̂ ℓ(θ, θ̂) > −∞.

Proof. The first case directly follows from the duality interpretation in Section 28.3.3 and the
fact that strong duality holds for finite-dimensional linear programming (see for example [376,
Sec. 7.4].
For the second case, we start by showing that if R∗ = ∞, then R∗Bayes = ∞. To see this, consider
the uniform prior π on Θ. Then for any estimator θ̂, there exists θ ∈ Θ such that R(θ, θ̂) = ∞.
Then Rπ (θ̂) ≥ |Θ|
1
R(θ, θ̂) = ∞.
Next we assume that R∗ < ∞. Then R∗ ∈ R since ℓ is bounded from below (say, by a) by
assumption. Given an estimator θ̂, denote its risk vector R(θ̂) = (Rθ (θ̂))θ∈Θ . Then its average risk
P
with respect to a prior π is given by the inner product hR(θ̂), π i = θ∈Θ π θ Rθ (θ̂). Define
S = {R(θ̂) ∈ RΘ : θ̂ is a randomized estimator} = set of all possible risk vectors,
T = {t ∈ RΘ : tθ < R∗ , θ ∈ Θ}.
Note that both S and T are convex (why?) subsets of Euclidean space RΘ and S∩T = ∅ by definition
of R∗ . By the separation hyperplane theorem, there exists a non-zero π ∈ RΘ and c ∈ R, such
that infs∈S hπ , si ≥ c ≥ supt∈T hπ , ti. Obviously, π must be componentwise positive, for otherwise
supt∈T hπ , ti = ∞. Therefore by normalization we may assume that π is a probability vector, i.e.,
a prior on Θ. Then R∗Bayes ≥ R∗π = infs∈S hπ , si ≥ supt∈T hπ , ti ≥ R∗ , completing the proof.

28.4 Multiple observations and sample complexity

Given a experiment {Pθ : θ ∈ Θ}, consider the experiment
Pn = {P⊗
θ : θ ∈ Θ},
n
n ≥ 1. (28.14)
We refer to this as the independent sampling model, in which we observe a sample X =
(X1 , . . . , Xn ) consisting of independent observations drawn from Pθ for some θ ∈ Θ ⊂ Rd . Given
a loss function ℓ : Rd × Rd → R+ , the minimax risk is denoted by
R∗n (Θ) = inf sup Eθ [ℓ(θ, θ̂)]. (28.15)
θ̂ θ∈Θ

Clearly, n 7→ R∗n (Θ) is non-increasing since we can always discard the extra observations.
Typically, when Θ is a fixed subset of Rd , R∗n (Θ) vanishes as n → ∞. Thus a natural question is

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-568

i i

568

at what rate R∗n converges to zero. Equivalently, one can consider the sample complexity, namely,
the minimum sample size to attain a prescribed error ϵ even in the worst case:

n∗ (ϵ) ≜ min {n ∈ N : R∗n (Θ) ≤ ϵ} . (28.16)

In the classical large-sample asymptotics (Chapter 29), the rate of convergence for the quadratic
risk is usually Θ( 1n ), which is commonly referred to as the “parametric rate“. In comparison, in this
book we focus on understanding the dependency on the dimension and other structural parameters
nonasymptotically.
As a concrete example, let us revisit the GLM in Section 28.2 with sample size n, in which case
i.i.d.
we observe X = (X1 , . . . , Xn ) ∼ N (0, σ 2 Id ), θ ∈ Rd . In this case, the minimax quadratic risk is1
dσ 2
R∗n = . (28.17)
n
To see this, note that in this case X̄ = n1 (X1 + . . . + Xn ) is a sufficient statistic (cf. Section 3.5) of X
2
for θ. Therefore the model reduces to X̄ ∼ N (θ, σn Id ) and (28.17) follows from the minimax risk
(28.11) for a single observation.
2
From (28.17), we conclude that the sample complexity is n∗ (ϵ) = d dσϵ e, which grows linearly
with the dimension d. This is the common wisdom that “sample complexity scales proportionally
to the number of parameters”, also known as “counting the degrees of freedom”. Indeed in high
dimensions we typically expect the sample complexity to grow with the ambient dimension; how-
ever, the exact dependency need not be linear as it depends on the loss function and the objective
of estimation. For example, consider the matrix case θ ∈ Rd×d with n independent observations
in Gaussian noise. Let ϵ be a small constant. Then we have
2
• For quadratic loss, namely, kθ − θ̂k2F , we have R∗n = dn and hence n∗ (ϵ) = Θ(d2 );
• If the loss function is kθ − θ̂k2op , then R∗n dn and hence n∗ (ϵ) = Θ(d) (Example 28.4);
• As opposed to θ itself, suppose we are content with p estimating only the scalar functional θmax =
∗
max{θ1 , . . . , θd } up to accuracy ϵ, then n (ϵ) = Θ( log d) (Exercise VI.14).

In the last two examples, the sample complexity scales sublinearly with the dimension.

28.5 Tensor product of experiments

Tensor product is a way to define a high-dimensional model from low-dimensional models. Given
statistical experiments Pi = {Pθi : θi ∈ Θi } and the corresponding loss function ℓi , for i ∈ [d],
their tensor product refers to the following statistical experiment:
( )
Yd Y
d
P = Pθ = Pθi : θ = (θ1 , . . . , θd ) ∈ Θ ≜ Θi ,
i=1 i=1

1
See Exercise VI.11 for an extension of this result to nonparametric location models.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-569

i i

28.5 Tensor product of experiments 569

X
d
ℓ(θ, θ̂) ≜ ℓi (θi , θ̂i ), ∀θ, θ̂ ∈ Θ.
i=1

In this model, the observation X = (X1 , . . . , Xd ) consists of independent (not identically dis-
ind
tributed) Xi ∼ Pθi and the loss function takes a separable form, which is reminiscent of separable
distortion function in (24.8). This should be contrasted with the multiple-observation model in
(28.14), in which n iid observations drawn from the same distribution are given.
The minimax risk of the tensorized experiment is related to the minimax risk R∗ (Pi ) and worst-
case Bayes risks R∗Bayes (Pi ) ≜ supπ i ∈P(Θi ) Rπ i (Pi ) of each individual experiment as follows:

Theorem 28.3 (Minimax risk of tensor product)

X
d X
d
R∗Bayes (Pi ) ≤ R∗ (P) ≤ R∗ (Pi ). (28.18)
i=1 i=1

Consequently, if minimax theorem holds for each experiment, i.e., R∗ (Pi ) = R∗Bayes (Pi ), then it
also holds for the product experiment and, in particular,
X
d
∗
R (P) = R∗ (Pi ). (28.19)
i=1

Proof. The right inequality of (28.18) simply follows by separately estimating θi on the basis
of Xi , namely, θ̂ = (θ̂1 , . . . , θ̂d ), where θ̂i depends only on Xi . For the left inequality, consider
Qd
a product prior π = i=1 π i , under which θi ’s are independent and so are Xi ’s. Consider any
randomized estimator θ̂i = θ̂i (X, Ui ) of θi based on X, where Ui is some auxiliary randomness
independent of X. We can rewrite it as θ̂i = θ̂i (Xi , Ũi ), where Ũi = (X\i , Ui ) ⊥ ⊥ Xi . Thus θ̂i can
be viewed as it a randomized estimator based on Xi alone and its the average risk must satisfy
Rπ i (θ̂i ) = E[ℓ(θi , θ̂i )] ≥ R∗π i . Summing over i and taking the suprema over priors π i ’s yields the
left inequality of (28.18).
As an example, we note that the unstructured d-dimensional GLM {N (θ, σ 2 Id ) : θ ∈ Rd } with
quadratic loss is simply the d-fold tensor product of the one-dimensional GLM. Since minimax
theorem holds for the GLM (cf. Section 28.3.4), Theorem 28.3 shows the minimax risks sum up to
σ 2 d, which agrees with Example 28.2. In general, however, it is possible that the minimax risk of
the tensorized experiment is less than the sum of individual minimax risks and the right inequality
of (28.19) can be strict. This might appear surprising since Xi only carries information about θi
and it makes sense intuitively to estimate θi based solely on Xi . Nevertheless, the following is a
counterexample:
Remark 28.6 Consider X = θZ, where θ n∈ N, Zo∼ Ber( 21 ). The estimator θ̂ takes values in
N as well and the loss function is ℓ(θ, θ̂) = 1 θ̂ < θ , i.e., whoever guesses the greater number
wins. The minimax risk for this experiment is equal to P [Z = 0] = 21 . To see this, note that if
Z = 0, then all information about θ is erased. Therefore for any (randomized) estimator Pθ̂|X , the
risk is lower bounded by Rθ (θ̂) = P[θ̂ < θ] ≥ P[θ̂ < θ, Z = 0] = 21 P[θ̂ < θ|X = 0]. Therefore

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-570

i i

570

sending θ → ∞ yields supθ Rθ (θ̂) ≥ 12 . This is achievable by θ̂ = X. Clearly, this is a case where
minimax theorem does not hold, which is very similar to the previous Example 28.3.
nNext consider
o the tensoroproduct of two copies of this experiment with loss function ℓ(θ, θ̂) =
n
1 θ̂1 < θ1 + 1 θ̂2 < θ2 . We show that the minimax risk is strictly less than one. For i = 1, 2,
i.i.d.
let Xi = θi Zi , where Z1 , Z2 ∼ Ber( 21 ). Consider the following estimator
(
X1 ∨ X2 X1 > 0 or X2 > 0
θ̂1 = θ̂2 =
1 otherwise.

Then for any θ1 , θ2 ∈ N, averaging over Z1 , Z2 , we get

1 3
E[ℓ(θ, θ̂)] ≤ (1 {θ1 < θ2 } + 1 {θ2 < θ1 } + 1) ≤ .
4 4
We end this section by consider the minimax risk of GLM with non-quadratic loss. The
following result extends Example 28.2:

Theorem 28.4 Consider the Gaussian location model X1 , . . . , Xn i.i.d.

∼ N (θ, Id ). Then for 1 ≤
q < ∞,
E[kZkqq ]
inf sup Eθ [kθ − θ̂kqq ] = , Z ∼ N (0, Id ).
θ̂ θ∈Rd nq/2

Proof. Note that N (θ, Id ) is a product distribution and the loss function is separable: kθ − θ̂kqq =
Pd
i=1 |θi − θ̂i | . Thus the experiment is a d-fold tensor product of the one-dimensional version.
q

By Theorem 28.3, it suffices to consider d = 1. The upper bound is achieved by the sample mean
Pn
X = 1n i=1 Xi ∼ N (θ, 1n ), which is a sufficient statistic.
For the lower bound, following Example 28.2, consider a Gaussian prior θ ∼ π = N (0, s).
Then the posterior distribution is also Gaussian: Pθ|X = N (E[θ|X], 1+ssn ). The following lemma
shows that the Bayes estimator is simply the conditional mean:

Lemma 28.5 Let Z ∼ N (0, 1). Then miny∈R E[|y + Z|q ] = E[|Z|q ].
Thus the Bayes risk is
s q/2
R∗π = E[|θ − E[θ|X]|q ] = E | Z| q .
1 + sn
Sending s → ∞ proves the matching lower bound.

Proof of Lemma 28.5. Write

Z ∞ Z ∞
E | y + Z| q = P [|y + Z|q > c] dc ≥ P [|Z|q > c] dc = E|Z|q ,
0 0

where the inequality follows from the simple observation that for any a > 0, P [|y + Z| ≤ a] ≤
P [|Z| ≤ a], due to the symmetry and unimodality of the normal density.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-571

i i

28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 571

28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM

As mentioned in Section 28.3.2, computing the exact minimax risk is frequently difficult especially
in high dimensions. Nevertheless, for the special case of (unconstrained) GLM, the minimax risk is
known exactly in arbitrary dimensions for a large collection of loss functions.2 We have previously
seen in Theorem 28.4 that this is possible for loss functions of the form ℓ(θ, θ̂) = kθ − θ̂kqq .
Examining the proof of this result, we note that the major limitation is that it only applies to
separable loss functions, so that tensorization allows us to reduce the problem to one dimension.
This does not apply to (and actually fails) for non-separable loss, since Theorem 28.3, if applicable,
dictates the risk to grow linearly with the dimension, which is not always the case. We next discuss
a more general result that goes beyond separable losses.

Definition 28.6 A function ρ : Rd → R+ is called bowl-shaped if its sublevel set Kc ≜ {x :

ρ(x) ≤ c} is convex and symmetric (i.e. Kc = −Kc ) for all c ∈ R.

Theorem 28.7 Consider the d-dimensional GLM where X1 , . . . , Xn ∼ N (0, Id ) are observed.
Let the loss function be ℓ(θ, θ̂) = ρ(θ − θ̂), where ρ : Rd → R+ is bowl-shaped and lower-
semicontinuous. Then the minimax risk is given by

Z
R∗ ≜ inf sup Eθ [ρ(θ − θ̂)] = Eρ √ , Z ∼ N (0, Id ).
θ̂ θ∈Rd n
Pn
Furthermore, the upper bound is attained by X̄ = 1n i=1 Xi .

The following corollary extends Theorem 28.4 to arbitrary norms.

Corollary 28.8 Let ρ(·) = k · kq for some q > 0, where k · k is an arbitrary norm on Rd . Then
EkZkq
R∗ = . (28.20)
nq/2

Example 28.4 Some applications of Corollary 28.8:

• For ρ = k.k22 , R∗ = 1n EkZk2 = dn , which has been shown in (28.17).

p q
• For ρ = k.k∞ , EkZk∞ log d (Lemma 27.10) and R∗ logn d .
• For a matrix θ ∈ Rd×d , let ρ(θ) = kθkop denote the operator norm (maximum singular value).
√ q
It has been shown in Exercise V.29 that E kZkop d and so R∗ n ; for ρ(·) = k · kF ,
d

R∗ √d .
n

We can also phrase the result of Corollary 28.8 in terms of the sample complexity n∗ (ϵ) as

defined in (28.16). For example, for q = 2 we have n∗ (ϵ) = E[kZk2 ]/ϵ . The above examples

The following result, due to Prékopa [350], characterizes the log-concavity of measures in terms
of that of its density function; see also [361] (or [179, Theorem 4.2]) for a proof.

Theorem 28.12 Suppose that μ has a density f with respect to the Lebesgue measure on Rd .
Then μ is log-concave if and only if f is log-concave.

Example 28.5 Examples of log-concave measures:

• Lebesgue measure: Let μ = vol be the Lebesgue measure on Rd , which satisfies Theorem 28.12
(f ≡ 1). Then

vol(λA + (1 − λ)B) ≥ vol(A)λ vol(B)1−λ , (28.21)

which implies3 the Brunn-Minkowski inequality:

1 1 1
vol(A + B) d ≥ vol(A) d + vol(B) d . (28.22)

• Gaussian distribution: Let μ = N (0, Σ), with a log-concave density f since log f(x) =
− p2 log(2π ) − 12 log det(Σ) − 21 x⊤ Σ−1 x is concave in x.

Proof of Lemma 28.10. By Theorem 28.12, the distribution of X is log-concave. Then

(a)
h 1 1 i
P[X ∈ K] = P X ∈ (K + y) + (K − y)
2 2
(b) p
≥ P[X ∈ K − y]P[X ∈ K + y]
(c)
= P[X + y ∈ K],

3
Applying (28.21) to A′ = vol(A)−1/d A, B′ = vol(B)−1/d B (both of which have unit volume), and
λ = vol(A)1/d /(vol(A)1/d + vol(B)1/d ) yields (28.22).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-574

i i

574

where (a) follows from 12 (K + y) + 12 (K − y) = 12 K + 12 K = K since K is convex; (b) follows from

the definition of log-concavity in Definition 28.11 with λ = 21 , A = K − y = {x − y : x ∈ K}
and B = K + y; (c) follows from P[X ∈ K + y] = P[X ∈ −K − y] = P[X + y ∈ K] since X has a
symmetric distribution and K is symmetric (K = −K).

i i

576

Theorem 29.1 (HCR lower bound) The quadratic loss of any estimator θ̂ at θ ∈ Θ ⊂ Rd
satisfies
(Eθ [θ̂] − Eθ′ [θ̂])2
Rθ (θ̂) = Eθ [(θ̂ − θ)2 ] ≥ Varθ (θ̂) ≥ sup . (29.2)
θ ′ ̸=θ χ2 (Pθ′ kPθ )

Proof. Let θ̂ be a (possibly randomized) estimator based on X. Fix θ′ 6= θ ∈ Θ. Denote by P and

Q the probability distribution when the true parameter is θ or θ′ , respectively. That is, PX = Pθ
and QX = Pθ′ . Then
(Eθ [θ̂] − Eθ′ [θ̂])2
χ2 (PX kQX ) ≥ χ2 (Pθ̂ kQθ̂ ) ≥ (29.3)
Varθ (θ̂)
where the first inequality applies the data processing inequality (Theorem 7.4) and the second
inequality the variational representation (7.91) of χ2 -divergence.
Next we apply Theorem 29.1 to unbiased estimators θ̂ that satisfies Eθ [θ̂] = θ for all θ ∈ Θ.
Then
(θ − θ′ )2
Varθ (θ̂) ≥ sup .
θ ′ ̸=θ χ2 (Pθ′ kPθ )
Lower bounding the supremum by the limit of θ′ → θ and recall the asymptotic expansion of
χ2 -divergence from Theorem 7.22 in terms of the Fisher information, we get, under the regularity
conditions in Theorem 7.22, the celebrated Cramér-Rao (CR) lower bound [108, 354]:
1
Varθ (θ̂) ≥ . (29.4)
JF (θ)
A few more remarks are as follows:

• Note that the HCR lower bound Theorem 29.1 is based on the χ2 -divergence. For a version
based on Hellinger distance which also implies the CR lower bound, see Exercise VI.5.
• Both the HCR and the CR lower bounds extend to the multivariate case as follows. Let θ̂ be
an unbiased estimator of θ ∈ Θ ⊂ Rd . Assume that its covariance matrix Covθ (θ̂) = Eθ [(θ̂ −
θ)(θ̂ − θ)⊤ ] is positive definite. Fix a ∈ Rd . Applying Theorem 29.1 to ha, θ̂i, we get
h a, θ − θ ′ i 2
χ2 (Pθ kPθ′ ) ≥ .
a⊤ Covθ (θ̂)a
Optimizing over a yields1
χ2 (Pθ kPθ′ ) ≥ (θ − θ′ )⊤ Covθ (θ̂)−1 (θ − θ′ ).
Sending θ′ → θ and applying the asymptotic expansion χ2 (Pθ kPθ′ ) = (θ − θ′ )⊤ JF (θ)(θ −
θ′ )(1 + o(1)) (see Remark 7.13), we get the multivariate version of CR lower bound:
Covθ (θ̂) J− 1
F (θ). (29.5)

1 ⟨x,y⟩2
For Σ 0, supx̸=0 x⊤ Σx
= y⊤ Σ−1 y, attained at x = Σ−1 y.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-577

i i

29.1 Statistical lower bound from data processing 577

• For a sample of n iid observations, by the additivity property (2.36), the Fisher information
matrix is equal to nJF (θ). Taking the trace on both sides, we conclude that the squared error of
any unbiased estimators satisfies
1
Tr(J−
Eθ [kθ̂ − θk22 ] ≥
1
F (θ)).
n
This is already very close to (29.1), except for the fundamental restriction of unbiased
estimators.

29.1.2 Bayesian CR and HCR

The drawback of the HCR and CR lower bounds is that they are confined to unbiased estimators.
For the minimax settings in (29.1), there is no sound reason to restrict to unbiased estimators; in
fact, it is often wise to trade bias with variance in order to achieve a smaller overall risk.
Next we discuss a lower bound, known as the Bayesian Cramér-Rao (BCR) lower bound [188]
or the van Trees inequality [433], for a Bayesian setting that applies to all estimators; to apply to
the minimax setting, in view of Theorem 28.1, one just needs to choose an appropriate prior. The
exact statement and the application to minimax risk are postponed till the next section. Here we
continue the previous line of thinking and derive it from the data processing argument.
Fix a prior π on Θ and a (possibly randomized) estimator θ̂. Then we have the Markov chain
θ → X → θ̂. Consider two joint distributions for (θ, X):

• Under Q, θ is drawn from π and X ∼ Pθ conditioned on θ;

• Under P, θ is drawn from Tδ π, where Tδ denote the pushforward of shifting by δ , i.e., Tδ π (A) =
π (A − δ), and X ∼ Pθ−δ conditioned on θ.

Similar to (29.3), applying data processing and variational representation of χ2 -divergence yields
(EP [θ − θ̂] − EQ [θ − θ̂])2
χ2 (PθX kQθX ) ≥ χ2 (Pθθ̂ kQθθ̂ ) ≥ χ2 (Pθ−θ̂ kQθ−θ̂ ) ≥ .
VarQ (θ̂ − θ)
Note that by design, PX = QX and thus EP [θ̂] = EQ [θ̂]; on the other hand, EP [θ] = EQ [θ] + δ .
Furthermore, Eπ [(θ̂ − θ)2 ] ≥ VarQ (θ̂ − θ). Since this applies to any estimators, we conclude that
the Bayes risk R∗π (and hence the minimax risk) satisfies
δ2
R∗π ≜ inf Eπ [(θ̂ − θ)2 ] ≥ sup , (29.6)
θ̂ δ̸=0 χ2 (PXθ kQXθ )
which is referred to as the Bayesian HCR lower bound in comparison with (29.2).
Similar to the deduction of CR lower bound from the HCR, we can further lower bound
this supremum by evaluating the small-δ limit. First note the following chain rule for the
χ2 -divergence:
" 2 #
dPθ
χ (PXθ kQXθ ) = χ (Pθ kQθ ) + EQ χ (PX|θ kQX|θ ) ·
2 2 2
.
dQθ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-578

i i

578

Under suitable regularity conditions in Theorem 7.22, again applying the local expansion of χ2 -
divergence yields
R π ′2
• χ2 (Pθ kQθ ) = χ2 (Tδ π kπ ) = (J(π ) + o(1))δ 2 , where J(π ) ≜ π is the Fisher information of
the prior (see Example 2.7);
• χ2 (PX|θ kQX|θ ) = [JF (θ) + o(1)]δ 2 .

Thus from (29.6) we get

1
R∗π ≥ . (29.7)
J(π ) + Eθ∼π [JF (θ)]
We conclude this section by revisiting the Gaussian Location Model (GLM) in Example 28.1.

Example 29.1 Let Xn = (X1 , . . . , Xn )i.i.d.

∼ N (θ, 1) and consider the prior θ ∼ π = N (0, s). To
apply the Bayesian HCR bound (29.6), note that
( a)
χ2 (PθXn ||QθXn ) = χ2 (PθX̄ ||QθX̄ )
" 2 #
dPθ
= χ (Pθ ||Qθ ) + EQ
2
χ (PX̄|θ ||QX̄|θ )
2
dQθ
(b) 2 2 2
= eδ /s
− 1 + eδ /s
(enδ − 1)
2
(n+ 1s )
= eδ − 1.
Pn
where (a) follows from the sufficiency of X̄ = 1n i=1 Xi ; (b) is by Qθ = N (0, s), QX̄|θ = N (θ, n1 ),
Pθ = N (δ, s), PX̄|θ = N (θ − δ, 1n ), and the fact (7.43) for Gaussians. Therefore,

δ2 δ2 s
R∗π ≥ sup δ 2 (n+ 1s )
= lim
δ 2 (n+ 1s )
= .
δ̸=0 e −1 δ→0 e −1 sn + 1

In view of the Bayes risk found in Example 28.1, we see that in this case the Bayesian HCR and
Bayesian Cramér-Rao lower bounds are exact.

29.2 Bayesian CR lower bounds and extensions

In this section we give the rigorous statement of the Bayesian Cramér-Rao lower bound and discuss
its extensions and consequences. For the proof, we take a more direct approach as opposed to the
data-processing argument in Section 29.1 based on asymptotic expansion of the χ2 -divergence.

Theorem 29.2 (BCR lower bound) Let π be a differentiable prior density on the interval
[θ0 , θ1 ] such that π (θ0 ) = π (θ1 ) = 0 and
Z θ1
π ′ (θ)2
J( π ) ≜ dθ < ∞. (29.8)
θ0 π (θ)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-579

i i

29.2 Bayesian CR lower bounds and extensions 579

Let Pθ (dx) = pθ (x) μ(dx), where the density pθ (x) is differentiable in θ for μ-almost every x.
Assume that for π-almost every θ,
Z
μ(dx)∂θ pθ (x) = 0. (29.9)

Then the Bayes quadratic risk R∗π ≜ infθ̂ E[(θ − θ̂)2 ] satisfies
1
R∗π ≥ . (29.10)
Eθ∼π [JF (θ)] + J(π )

Proof. In view of Remark 28.3, it loses no generality to assume that the estimator θ̂ = θ̂(X) is
deterministic. For each x, integration by parts yields
Z θ1 Z θ1
dθ(θ̂(x) − θ)∂θ (pθ (x)π (θ)) = pθ (x)π (θ)dθ.
θ0 θ0

Integrating both sides over μ(dx) yields

E[(θ̂ − θ)V(θ, X)] = 1.
where V(θ, x) ≜ ∂θ (log(pθ (x)π (θ))) = ∂θ log pθ (x) + ∂θ log π (θ) and the expectation is over
the joint distribution of (θ, X). Applying Cauchy-Schwarz, we have E[(θ̂ − θ)2 ]E[V(θ, X)2 ] ≥ 1.
The proof is completed by noting that E[V(θ, X)2 ] = E[(∂θ log pθ (X))2 ] + E[(∂θ log π (θ))2 ] =
E[JF (θ)] + J(π ), thanks to the assumption (29.9).
The multivariate version of Theorem 29.2 is the following.
Qd
Theorem 29.3
Q
(Multivariate BCR) Consider a product prior density π (θ) = π i (θi ) i=1
d
over the box i=1 [θ0,i , θ1,i ],
where each π i is differentiable on [θ0,i , θ1,i ] and vanishes on the
boundary. Assume that for π-almost every θ,
Z
μ(dx)∇θ pθ (x) = 0. (29.11)

Then
R∗π ≜ inf Eπ [kθ̂ − θk22 ] ≥ Tr((Eθ∼π [JF (θ)] + J(π ))−1 ), (29.12)
θ̂

where the Fisher information matrices are given by JF (θ) = Eθ [∇θ log pθ (X)∇θ log pθ (X)⊤ ] and
J(π ) = diag(J(π 1 ), . . . , J(π d )).

Proof. Fix an estimator θ̂ = (θ̂1 , . . . , θ̂d ) and a non-zero u ∈ Rd . For each i, k = 1, . . . , d,

integration by parts yields
Z θ 1, i Z θ 1, i
(θ̂k (x) − θk )∂θi (pθ (x)π (θ))dθi = 1 {k = i} pθ (x)π (θ)dθi .
θ 0, i θ0,i
Q
Integrating both sides over j̸=i dθj and μ(dx), multiplying by ui , and summing over i, we obtain

E[(θ̂k (X) − θk )hu, ∇ log(pθ (X)π (θ))i] = hu, ek i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-580

i i

580

where ek denotes the kth standard basis. Applying Cauchy-Schwarz and optimizing over u yield
h u , ek i 2
E[(θ̂k (X) − θk )2 ] ≥ sup = Σ− 1
kk ,
u̸=0 u⊤ Σ u

where Σ ≡ E[∇ log(pθ (X)π (θ))∇ log(pθ (X)π (θ))⊤ ] = Eθ∼π [JF (θ)] + J(π ), thanks to (29.11).
Summing over k completes the proof of (29.12).

Several remarks are in order:

• The above versions of the BCR bound assume a prior density that vanishes at the boundary.
If we choose a uniform prior, the same derivation leads to a similar lower bound known as
the Chernoff-Rubin-Stein inequality (see Ex. VI.4), which also suffices for proving the optimal
minimax lower bound in (29.1).
• For the purpose of the lower bound, it is advantageous to choose a prior density with the mini-
mum Fisher information. The optimal density with a compact support is known to be a squared
cosine density [219, 426]:

min J( g ) = π 2 ,
g on [−1,1]

attained by
πu
g(u) = cos2 . (29.13)
2
• Suppose the goal is to estimate a smooth functional T(θ) of the unknown parameter θ, where
T : Rd → Rs is differentiable with ∇T(θ) = ( ∂ T∂θi (θ)
j
) its s × d Jacobian matrix. Then under the
same condition of Theorem 29.3, we have the following Bayesian Cramér-Rao lower bound for
functional estimation:

inf Eπ [kT̂(X) − T(θ)k22 ] ≥ Tr(E[∇T(θ)](E[JF (θ)] + J(π ))−1 E[∇T(θ)]⊤ ), (29.14)

T̂

where the expectation on the right-hand side is over θ ∼ π.

As a consequence of the BCR bound, we prove the lower bound part for the asymptotic minimax
risk in (29.1).

Theorem 29.4 Assume that θ 7→ JF (θ) is continuous. Denote the minimax squared error
i.i.d.
R∗n ≜ infθ̂ supθ∈Θ Eθ [kθ̂ − θk22 ], where Eθ is taken over X1 , . . . , Xn ∼ Pθ . Then as n → ∞,
1 + o( 1)
R∗n ≥ sup TrJ− 1
F (θ). (29.15)
n θ∈Θ

Proof. Fix θ ∈ Θ. Then for all sufficiently small δ , B∞ (θ, δ) = θ + [−δ, δ]d ⊂ Θ. Let π i (θi ) =
1 θ−θi Qd
δ g( δ ), where g is the prior density in (29.13). Then the product distribution π = i=1 π i
satisfies the assumption of Theorem 29.3. By the scaling rule of Fisher information (see (2.35)),
2 2
J(π i ) = δ12 J(g) = δπ2 . Thus J(π ) = δπ2 Id .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-581

i i

29.3 Maximum likelihood estimator and asymptotic efficiency 581

Theorem 29.5 Fox fixed k, the minimax squared error of estimating P satisfies

b − Pk22 ] = 1 k−1
R∗sq (k, n) ≜ inf sup E[kP + o( 1) , n → ∞. (29.19)
b
P P∈Pk n k

Proof. Let P = (P1 , . . . , Pk ) be parametrized, as in Example 2.6, by θ = (P1 , P2 , · · · , Pk−1 ) and

Pk = 1 − P1 − · · · − Pk−1 . Then P = T(θ), where T : Rk−1 → Rk is an affine functional so that
I 1
∇T(θ) = [ −k−
1⊤
], with 1 being the all-ones (column) vector.
The Fisher information matrix and its inverse have been calculated in (2.37) and (2.38): We
have J−
F (θ) = diag(θ) − θθ and
1 ⊤

diag(θ) − θθ⊤ −Pk θ
∇T(θ)J− 1
F (θ)∇T(θ)
⊤
=
−Pk θ⊤ Pk (1 − Pk ).
Pk Pk
So Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ) = i=1 Pi (1 − Pi ) = 1 −
2
i=1 Pi , which achieves its maximum
1 − 1k at the uniform distribution. Applying the functional form of the BCR bound in (29.16), we
conclude R∗sq (k, n) ≥ 1n (1 − 1k + o(1)).
For the upper bound, consider the MLE, which in this case coincides with the empirical distri-
Pn
bution P̂ = (P̂i ) (Exercise VI.8). Note that nP̂i = j=1 1 {Xj = i} ∼ Bin(n, Pi ). Then for any P,
Pk
E[kP̂ − Pk22 ] = n1 i=1 Pi (1 − Pi ) ≤ n1 (1 − 1k ).

Some remarks on Theorem 29.5 are in order:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-584

i i

584

−1/k
• In fact, for any k, n, we have the precise result: R∗sq (k, n) = (11+√ 2 – see Ex. VI.7h. This can be
n)
shown by considering a Dirichlet prior (13.16) and applying the corresponding Bayes estimator,
which is an additively-smoothed empirical distribution (Section 13.5).
• Note that R∗sq (k, n) does not grow with the alphabet size k; this is because squared loss is
too weak for estimating probability vectors. More meaningful loss functions include the f-
divergences in Chapter 7, such as the total variation, KL divergence, χ2 -divergence. These
minimax rates are worked out in Exercise VI.8 and Exercise VI.10, for both small and large
alphabets, and they indeed depend on the alphabet size k. For example, the minimax KL risk
satisfies Θ( nk ) for k ≤ n and grows as Θ(log nk ) for k n. This agrees with the rule of thumb
that consistent estimation requires the sample size to scale faster than the dimension.

As a final application, let us consider the classical problem of entropy estimation in information
theory and statistics [304, 128, 215], where the goal is to estimate the Shannon entropy, a non-
linear functional of P. The following result follows from the functional BCR lower bound (29.16)
and analyzing the MLE (in this case the empirical entropy) [39].

Theorem 29.6 For fixed k, the minimax quadratic risk of entropy estimation satisfies

b (X1 , . . . , Xn ) − H(P))2 ] = 1
R∗ent (k, n) ≜ inf sup E[(H max V(P) + o(1) , n→∞
b P∈Pk
H n P∈Pk

Pk
where H(P) = i=1 Pi log P1i = E[log P(1X) ] and V(P) = Var[log P(1X) ] are the Shannon entropy
and varentropy (cf. (10.4)) of P.

Let us analyze the result of Theorem 29.6 and see how it extends to large alphabets. It can be
2
shown that3 maxP∈Pk V(P) log2 k, which suggests that R∗ent ≡ R∗ent (k, n) may satisfy R∗ent logn k
even when the alphabet size k grows with n; however, this result only holds for sufficiently small
alphabet. In fact, back in Lemma 13.2 we have shown that for the empirical entropy which achieves
the bound in Theorem 29.6, its bias is on the order of nk , which is no longer negligible on large
alphabets. Using techniques of polynomial approximation [456, 233], one can reduce this bias to
n log k and further show that consistent entropy estimation is only possible if and only if n log k
k k

[428], in which case the minimax rate satisfies

2
k log2 k
R∗ent +
n log k n
In summary, one needs to exercise caution extending classical large-sample results to high
dimensions, especially when bias becomes the dominating factor.

3
Indeed, maxP∈Pk V(P) ≤ log2 k for all k ≥ 3 [334, Eq. (464)]. For the lower bound, consider
P = ( 12 , 2(k−1)
1 1
, . . . 2(k−1) ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-585

i i

30 Mutual information method

In this chapter we describe a strategy for proving statistical lower bound we call the Mutual Infor-
mation Method (MIM), which entails comparing the amount of information data provides with
the minimum amount of information needed to achieve a certain estimation accuracy. Similar to
Section 29.2, the main information-theoretical ingredient is the data-processing inequality, this
time for mutual information as opposed to f-divergences.
Here is the main idea of the MIM: Fix some prior π on Θ and we aim to lower bound the Bayes
risk R∗π of estimating θ ∼ π on the basis of X with respect to some loss function ℓ. Let θ̂ be an
estimator such that E[ℓ(θ, θ̂)] ≤ D. Then we have the Markov chain θ → X → θ̂. Applying the
data processing inequality (Theorem 3.7), we have

inf I(θ; θ̂) ≤ I(θ; θ̂) ≤ I(θ; X). (30.1)

Pθ̃|θ :Eℓ(θ,θ̃)≤D

Note that

• The leftmost quantity can be interpreted as the minimum amount of information required to
achieve a given estimation accuracy. This is precisely the rate-distortion function ϕ(D) ≡ ϕθ (D)
(recall Section 24.3).
• The rightmost quantity can be interpreted as the amount of information provided by the data
about the latent parameter. Sometimes it suffices to further upper-bound it by the capacity of
the channel PX|θ by maximizing over all priors (Chapter 5):

I(θ; X) ≤ sup I(θ; X) ≜ C. (30.2)

π ∈P(Θ)

Therefore, we arrive at the following lower bound on the Bayes and hence the minimax risks

R∗π ≥ ϕ−1 (I(θ; X)) ≥ ϕ−1 (C). (30.3)

The reasoning of the mutual information method is reminiscent of the converse proof for joint-
source channel coding in Section 26.3. As such, the argument here retains the flavor of “source-
channel separation”, in that the lower bound in (30.1) depends only on the prior (source) and
the loss function, while the capacity upper bound (30.2) depends only on the statistical model
(channel).
In the next few sections, we discuss a sequence of examples to illustrate the MIM and its
execution:

585

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-586

i i

586

• Denoising a vector in Gaussian noise, where we will compute the exact minimax risk;
• Denoising a sparse vector, where we determine the sharp minimax rate;
• Community detection, where the goal is to recover a dense subgraph planted in a bigger Erdös-
Rényi graph.

In the next chapter we will discuss three popular approaches for, namely, Le Cam’s method,
Assouad’s lemma, and Fano’s method. As illustrated in Figure 30.1, all three follow from the

Mutual Information Method

Fano Assouad Le Cam

Figure 30.1 The three lower bound techniques as consequences of the Mutual Information Method.

mutual information method, corresponding to different choice of prior π for θ, namely, the uni-
form distribution over a two-point set {θ0 , θ1 }, the hypercube {0, 1}d , and a packing (recall
Section 27.1). While these methods are highly useful in determining the minimax rate for many
problems, they are often loose with constant factors compared to the MIM. In the last section
of this chapter, we discuss the problem of how and when is non-trivial estimation achievable by
applying the MIM; for this purpose, none of the three methods in the next chapter works.

30.1 GLM revisited and the Shannon lower bound

i.i.d.
Consider the d-dimensional GLM, where we observe X = (X1 , . . . , Xn ) ∼ N (θ, Id ) and θ ∈ Θ
is the parameter. Denote by R∗ (Θ) the minimax risk with respect to the quadratic loss ℓ(θ, θ̂) =
kθ̂ − θk22 .
First, let us consider the unconstrained model where Θ = Rd . Estimating using the sample
Pn
mean X̄ = n1 i=1 Xi ∼ N (θ, 1n Id ), we achieve the upper bound R∗ (Rd ) ≤ dn . This turns out to
be the exact minimax risk, as shown in Example 28.2 by computing the Bayes risk for Gaussian
priors. Next we apply the mutual information method to obtain the same matching lower bound
without evaluating the Bayes risk. Again, let us consider θ ∼ N (0, sId ) for some s > 0. We know
from the Gaussian rate-distortion function (Theorem 26.2) that
(
d
2 log sd
D D < sd
ϕ(D) = inf I(θ; θ̂) =
Pθ̂|θ :E[∥θ̂−θ∥22 ]≤D 0 otherwise.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-587

i i

30.1 GLM revisited and the Shannon lower bound 587

Using the sufficiency of X̄ and the formula of Gaussian channel capacity (cf. Theorem 5.11 or
Theorem 20.11), the mutual information between the parameter and the data can be computed as
d
I(θ; X) = I(θ; X̄) = log(1 + sn).
2
It then follows from (30.3) that R∗π ≥ 1+sdsn , which in fact matches the exact Bayes risk in (28.7).
Sending s → ∞ we recover the result in (28.17), namely
d
R∗ ( R d ) = . (30.4)
n
In the above unconstrained GLM, we are able to compute everything in closed form when
applying the mutual information method. Such exact expressions are rarely available in more
complicated models in which case various bounds on the mutual information will prove useful.
Next, let us consider the GLM with bounded means, where the parameter space Θ = B(ρ) =
{θ : kθk2 ≤ ρ} is the ℓ2 -ball of radius ρ centered at zero. In this case there is no known closed-
form formula for the minimax quadratic risk even in one dimension.1 Nevertheless, the next result
determines the sharp minimax rate, which characterizes the minimax risk up to universal constant
factors.

Theorem 30.1 (Bounded GLM)

d
R∗ (B(ρ)) ∧ ρ2 . (30.5)
n
p
p we see that if ρ ≳
Remark 30.1 Comparing (30.5) with (30.4), d/n, it is rate-optimal to
ignore the bounded-norm constraint; if ρ ≲ d/n, we can discard all observations and estimate
by zero, because data do not provide a better resolution than the prior information.

Proof. The upper bound R∗ (B(ρ)) ≤ dn ∧ ρ2 follows from considering the estimator θ̂ = X̄ and
θ̂ = 0. To prove the lower bound, we apply the mutual information method with a uniform prior
θ ∼ Unif(B(r)), where r ∈ [0, ρ] is to be optimized. The mutual information can be upper bound
using the AWGN capacity as follows:

1 d nr2 nr2
I(θ; X) = I(θ; X̄) ≤ sup I(θ; θ + √ Z) = log 1 + ≤ , (30.6)
Pθ :E[∥θ∥2 ]≤r n 2 d 2
2

where Z ∼ N (0, Id ). Alternatively, we can use Corollary 5.8 to bound the capacity (as information
radius) by the KL diameter, which yields the same bound within constant factors:
1
I(θ; X) ≤ sup I(θ; θ + √ Z) ≤ max D(N (θ, Id /n)kN (θ, Id /n)k) = 2nr2 . (30.7)
Pθ :∥θ∥≤r n θ,θ ′ ∈ B( r)

1
It is known that there exists some ρ0 depending on d/n such that for all ρ ≤ ρ0 , the uniform prior over the sphere of
radius ρ is exactly least favorable (see [82] for d = 1 and [48] for d > 1.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-588

i i

588

For the lower bound, due to the lack of closed-form formula for the rate-distortion function
for uniform distribution over Euclidean balls, we apply the Shannon lower bound (SLB) from
Section 26.1. Since θ has an isotropic distribution, applying Theorem 26.3 yields
d 2πed d cr2
inf I(θ; θ̂) ≥ h(θ) + log ≥ log ,
Pθ̂|θ :E∥θ−θ̂∥2 ≤D 2 D 2 D

for some universal constant c, where the last inequality is because for θ ∼ Unif(B(r)), h(θ) =
log vol(B(r)) = d log r + log vol(B(1)) and the volume of a unit Euclidean ball in d dimensions
satisfies (recall (27.14)) vol(B(1))1/d √1d .
2 2
∗ 2 −nr /d 2
R∗ ≤ 2 , i.e., R ≥ cr e
Finally, applying (30.3) yields 12 log cr nr
. Optimizing over r and
−ax −a
using the fact that sup0<x<1 xe = ea if a ≥ 1 and e if a < 1, we have
1

d
R∗ ≥ sup cr2 e−nr /d
2
∧ ρ2 .
r∈[0,ρ] n

As a final example, let us consider a non-quadratic loss ℓ(θ, θ̂) = kθ − θ̂kr , the rth power of an
arbitrary norm on Rd . Recall that we have determined in Corollary 28.8 the exact minimax risk
using Anderson’s lemma, namely,
inf sup Eθ [kθ̂ − θkr ] = n−r/2 E[kZkr ], Z ∼ N (0, Id ). (30.8)
θ̂ θ∈Rd

In order to apply the mutual information method, consider again a Gaussian prior θ ∼ N (0, sId ).
Suppose that E[kθ̂ − θkr ] ≤ D. By the data processing inequality,
( d )
d d Dre r d
log(1 + ns) ≥ I(θ; X) ≥ I(θ; θ̂) ≥ log(2πes) − log V∥·∥ Γ 1+ ,
2 2 d r
where the last inequality follows from the general SLB (26.5). Rearranging terms and sending
s → ∞ yields
r/2 −r/d r
d 2πe d − r/ 2 − r/ d d
inf sup Eθ [kθ̂ − θk ] ≥
r
V∥·∥ Γ 1 + n V∥·∥ ≳ √ ,
θ̂ θ∈Rd re n r nE[kZk∗ ]
(30.9)
where the middle inequality applies Stirling’s approximation Γ(x)1/x x for x → ∞, and the
right applies Urysohn’s volume inequality (27.21), with kxk∗ = sup{hx, yi : kyk ≤ 1} denoting
the dual norm of k · k.
To evaluate the tightness of the lower bound from SLB in comparison with the exact result
P 1/q
d
(30.8), consider the example of r = 2 and the ℓq -norm kxkq = i=1 | x i | q
with 1 ≤ q ≤ ∞.
Recall the volume of a unit ℓq -ball given in (27.13). In the special case of q = 2, the (first) lower
bound in (30.9) is in fact exact and coincides with (30.4). For general q ∈ [1, ∞), (30.9) gives the
2/q
tight minimax rate d n ; however, for q = ∞, the minimax lower bound we get is 1/n, independent p
of the dimension d. In comparison, from (30.8) we get the sharp rate logn d , since EkZk∞ log d
(cf. Lemma 27.10). We will revisit this example in Section 31.4 and show how to obtain the optimal
dependency on the dimension.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-589

i i

30.2 GLM with sparse means 589

Remark 30.2 (SLB versus the volume method) Recall the connection between the rate-
distortion function and the metric entropy in Section 27.6. As we have seen in Section 27.2, a
common lower bound for metric entropy is via the volume bound. In fact, the SLB can be inter-
preted as a volume-based lower bound to the rate-distortion function. To see this, consider r = 1
and let θ be uniformly distributed over some compact set Θ, so that h(θ) = log vol(Θ) (Theo-
rem 2.7(a)). Applying Stirling’s approximation, the lower bound in (26.5) becomes log vol(vol (Θ)
B∥·∥ (cϵ))
for some constant c, which has the same form as the volume ratio in Theorem 27.3 for metric
entropy. We will see later in Section 31.4 that in statistical applications, applying SLB yields basi-
cally the same lower bound as applying Fano’s method to a packing obtained from the volume
bound, although SLB does not rely explicitly on a packing.

30.2 GLM with sparse means

In this section we consider the problem of denoising for a sparse vector. Specifically, consider
again the Gaussian location model N (θ, Id ) where the mean vector θ is known to be k-sparse,
taking values in the “ℓ0 -ball”

B0 (k) = {θ ∈ Rd : kθk0 ≤ k}, k ∈ [p],

where kθk0 = |{i ∈ [d] : θi 6= 0}| is the number of nonzero entries of θ, indicating the sparsity of
θ. Our goal is to characterize the minimax quadratic risk

R∗n (B0 (k)) = inf sup Eθ kθ̂ − θk22 .

θ̂ θ∈B0 (k)

Next we prove an optimal lower bound applying MIM. (For a different proof using Fano’s method
in Section 31.4, see Exercise VI.12.)

Theorem 30.2
k ed
R∗n (B0 (k)) ≳ log . (30.10)
n k

A few remarks are in order:

Remark 30.3 • The lower bound (30.10) turns out to be tight, achieved by the maximum
likelihood estimator

θ̂MLE = arg min kX̄ − θk2 , (30.11)

∥θ∥0 ≤k

which is equivalent to keeping the k entries from X̄ with the largest magnitude and setting the
rest to zero, or the following hard-thresholding estimator θ̂τ with an appropriately chosen τ (see
Exercise VI.13):

θ̂iτ = Xi 1 {|Xi | ≥ τ }. (30.12)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-590

i i

590

• Sharp asymptotics: For sublinear sparsity k = o(d), we have R∗n (B0 (k)) = (2 + o(1)) nk log dk
(Exercise VI.13); for linear sparsity k = (η + o(1))d with η ∈ (0, 1), R∗n (B0 (k)) = (β(η) +
o(1))d for some constant β(η). For the latter and more refined results, we refer the reader to the
monograph [236, Chapter 8].
Proof. First, note that B0 (k) is a union of linear subspace of Rd and thus homogeneous. Therefore
by scaling, we have
1 ∗ 1
R∗n (B0 (k)) =
R (B0 (k)) ≜ R∗ (k, d). (30.13)
n 1 n
Thus it suffices to consider n = 1. Denote the observation by X = θ + Z.
Next, note that the following oracle lower bound:
R∗ (k, d) ≥ k,
which is the optimal risk given the extra information of the support of θ, in view of (30.4). Thus
to show (30.10), below it suffices to consider k ≤ d/4.
We now apply the mutual information method. Recall from (27.10) that Sdk denotes the
Hamming sphere, namely,
Sdk = {b ∈ {0, 1}d : wH (b) = k},
d
where wH (b) denotes
qthe Hamming weights of b. Let b be uniformly distributed over Sk and let
θ = τ b, where τ = log dk . Given any estimator θ̂ = θ̂(X), define an estimator b̂ ∈ {0, 1}d for b
by
(
0 θ̂i ≤ τ /2
b̂i = , i ∈ [d].
1 θ̂i > τ /2

Thus the Hamming loss of b̂ can be related to the squared loss of θ̂ as

τ2
kθ − θ̂k22 ≥ dH (b, b̂). (30.14)
4
Let EdH (b, b̂) = δ k. Assume that δ ≤ 14 , for otherwise, we are done.
Note the following Markov chain b → θ → X → θ̂ → b̂ and thus, by the data processing
inequality of mutual information,

d kτ 2 kτ 2 k d
I(b; b̂) ≤ I(θ; X) ≤ log 1 + ≤ = log .
2 d 2 2 k
where the second inequality follows from the fact that kθk22 = kτ 2 and the Gaussian channel
capacity.
Conversely,
I(b̂; b) ≥ min I(b̂; b)
EdH (b,b̂)≤δ d

= H(b) − max H(b|b̂)

EdH (b,b̂)≤δ k

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-591

30.4 Estimation better than chance

Instead of characterizing the rate of convergence of the minimax risk to zero as the amount of data
grows, suppose we are in a regime where this is impossible due to either limited sample size, poor
signal to noise ratio, or the high dimensionality; instead, we are concerned with the modest goal of
achieving an estimation error strictly better than the trivial error (without data). In the context of
clustering, this is known as weak recovery or correlated recovery, where the goal is not to achieve
a vanishing misclassification rate but one strictly better than random guessing the labels. It turns
out that MIM is particularly suited for this regime. (In fact, we will see in the next chapter that all
three popular further relaxations of MIM fall short due to the loss of constant factors.)
As an example, let us continue the setting of Theorem 30.1, where the goal is to estimate a vec-
tor in a high-dimensional unit-ball based on noisy observations. Since the radius of the parameter
space is one, the trivial squared error equals one. The following theorem shows that in high dimen-
sions, non-trivial estimation is achievable if and only if the sample size n grows proportionally
with the dimension d; otherwise, when d n 1, the optimal estimation error is 1 − nd (1 + o(1)).

i.i.d.
Theorem 30.4 (Bounded GLM continued) Suppose X1 , . . . , Xn ∼ N (θ, Id ), where θ
belongs to B, the unit ℓ2 -ball in Rd . Then for some universal constant C0 ,
n+C0 d
e− d−1 ≤ inf sup Eθ [kθ̂ − θk2 ] ≤ .
θ̂ θ∈B d+n

Proof. Without loss of generality, assume that the observation is X = θ+ √Zn , where Z ∼ N (0, Id ).
For the upper bound, applying the shrinkage estimator2 θ̂ = 1+1d/n X yields E[kθ̂ − θk2 ] ≤ n+d d .
For the lower bound, we apply MIM as in the proof of Theorem 30.1 with the prior θ ∼
Unif(Sd−1 ). We still apply the AWGN capacity in (30.6) to get I(θ; X) ≤ n/2. (Here the

2
This corresponds to the Bayes estimator (Example 28.1) when we choose θ ∼ N (0, 1d Id ), which is approximately
concentrated on the unit sphere for large d.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-593

i i

30.4 Estimation better than chance 593

constant 1/2 is important and so the diameter-based bound (30.7) is too loose.) For the rate-
distortion function of spherical uniform distribution, applying Theorem 27.17 yields I(θ; θ̂) ≥
d−1
2 log E[∥θ̂−θ∥2 ] − C. Thus the lower bound on E[kθ̂ − θk ] follows from the data processing
1 2

inequality.
A similar phenomenon also occurs in the problem of estimating a discrete distribution P on k
elements based on n iid observations, which has been studied in Section 29.4 for small alphabet in
the large-sample asymptotics and extended in Exercise VI.7–VI.10 to large alphabets. In particular,
consider the total variation loss, which is at most one. Ex. VI.10f shows that the TV error of any
estimator is 1 − o(1) if n k; conversely, Ex. VI.10b demonstrates an estimator P̂ such that
E[χ2 (PkP̂)] ≤ nk− 1 2
+1 . Applying the joint range (7.32) between TV and χ and Jensen’s inequality,
we have
 q
 1 k− 1 n ≥ k − 2
E[TV(P, P̂)] ≤ 2 n+1
 k− 1 n≤k−2
k+n

which is bounded away from one whenever n = Ω(k). In summary, non-trivial estimation in total
variation is possible if and only if n scales at least proportionally with k.
Finally, let us mention the problem of correlated recovery in the stochastic block model
(cf. Exercise I.49), which refers to estimating the community labels better than chance. The
optimal information-theoretic threshold of this problem can be established by bounding the
appropriate mutual information; see Section 33.9 for the Gaussian version (spiked Wigner model).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-594

i i

31 Lower bounds via reduction to hypothesis

testing

In this chapter we study three commonly used techniques for proving minimax lower bounds,
namely, Le Cam’s method, Assouad’s lemma, and Fano’s method. Compared to the results in
Chapter 29 geared towards large-sample asymptotics in smooth parametric models, the approach
here is more generic, less tied to mean-squared error, and applicable in nonasymptotic settings
such as nonparametric or high-dimensional problems.
The common rationale of all three methods is reducing statistical estimation to hypothesis test-
ing. Specifically, to lower bound the minimax risk R∗ (Θ) for the parameter space Θ, the first step
is to notice that R∗ (Θ) ≥ R∗ (Θ′ ) for any subcollection Θ′ ⊂ Θ, and Le Cam, Assouad, and Fano’s
methods amount to choosing Θ′ to be a two-point set, a hypercube, or a packing, respectively. In
particular, Le Cam’s method reduces the estimation problem to binary hypothesis testing. This
method is perhaps the easiest to evaluate; however, the disadvantage is that it is frequently loose
in estimating high-dimensional parameters. To capture the correct dependency on the dimension,
both Assouad’s and Fano’s method rely on reduction to testing multiple hypotheses.
As illustrated in Figure 30.1, all three methods in fact follow from the common principle of
the mutual information method (MIM) in Chapter 30, corresponding to different choice of priors.
The limitation of these methods, compared to the MIM, is that, due to the looseness in constant
factors, they are ineffective for certain problems such as estimation better than chance discussed
in Section 30.4.

31.1 Le Cam’s two-point method

Theorem 31.1 Suppose the loss function ℓ : Θ × Θ → R+ satisfies ℓ(θ, θ) = 0 for all θ ∈ Θ
and the following α-triangle inequality for some α > 0: For all θ0 , θ1 , θ ∈ Θ,

ℓ(θ0 , θ1 ) ≤ α(ℓ(θ0 , θ) + ℓ(θ1 , θ)). (31.1)

Then

ℓ(θ0 , θ1 )
inf sup Eθ ℓ(θ, θ̂) ≥ sup (1 − TV(Pθ0 , Pθ1 )) (31.2)
θ̂ θ∈Θ θ0 ,θ1 ∈Θ 2α

594

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-595

i i

31.1 Le Cam’s two-point method 595

Proof. Fix θ0 , θ1 ∈ Θ. Given any estimator θ̂, let us convert it into the following (randomized)
test:

θ0 with probability ℓ(θ1 ,θ̂)
,
ℓ(θ0 ,θ̂)+ℓ(θ1 ,θ̂)
θ̃ =
θ1 with probability ℓ(θ 0 , θ̂)
.
ℓ(θ ,θ̂)+ℓ(θ ,θ̂) 0 1

By the α-triangle inequality, we have

" #
ℓ(θ0 , θ̂) 1
Eθ0 [ℓ(θ̃, θ0 )] = ℓ(θ0 , θ1 )Eθ0 ≥ Eθ [ℓ(θ̂, θ0 )],
ℓ(θ0 , θ̂) + ℓ(θ1 , θ̂) α 0

and similarly for θ1 . Consider the prior π = 21 (δθ0 + δθ1 ) and let θ ∼ π. Taking expectation on
both sides yields the following lower bound on the Bayes risk:
ℓ(θ0 , θ1 ) ℓ(θ0 , θ1 )
Eπ [ℓ(θ̂, θ)] ≥ P θ̃ 6= θ ≥ (1 − TV(Pθ0 , Pθ1 ))
α 2α
where the last step follows from the minimum average probability of error in binary hypothesis
testing (Theorem 7.7).
Remark 31.1 As an example where the bound (31.2) is tight (up to constants), consider a
binary hypothesis testing problem with Θ = {θ0 , θ1 } and the Hamming loss ℓ(θ, θ̂) = 1{θ 6= θ̂},
where θ, θ̂ ∈ {θ0 , θ1 } and α = 1. Then the left side is the minimax probability of error, and the
right side is the optimal average probability of error (cf. (7.19)). These two quantities can coincide
(for example for Gaussian location model).
Another special case of interest is the quadratic loss ℓ(θ, θ̂) = kθ − θ̂k22 , where θ, θ̂ ∈ Rd , which
satisfies the α-triangle inequality with α = 2. In this case, the leading constant 41 in (31.2) makes
sense, because in the extreme case of TV = 0 where Pθ0 and Pθ1 cannot be distinguished, the best
estimate is simply θ0 +θ2 . In addition, the inequality (31.2) can be deduced based on properties of
1

f-divergences and their joint range (Chapter 7). To this end, abbreviate Pθi as Pi for i = 0, 1 and
consider the prior π = 21 (δθ0 + δθ1 ). Then the Bayes estimator (posterior mean) is θ0 dP 0 +θ1 dP1
dP0 +dP1 and
the Bayes risk is given by
Z
kθ0 − θ1 k2 dP0 dP1
R∗π =
2 dP0 + dP1
kθ0 − θ1 k2 kθ0 − θ1 k2
= (1 − LC(P0 , P1 )) ≥ (1 − TV(P0 , P1 )),
4 4
R 0 −dP1 )
2
where LC(P0 , P1 ) = (dP dP0 +dP1 is the Le Cam divergence defined in (7.7) and satisfies LC ≤ TV.

Example 31.1 As a concrete example, consider

P
the one-dimensional GLM with sample size
n
n. By considering the sufficient statistic X̄ = n1 i=1 Xi , the model is simply {N (θ, 1n ) : θ ∈ R}.
Applying Theorem 31.1 yields

∗ 1 1 1
R ≥ sup |θ0 − θ1 | 1 − TV N θ0 ,
2
, N θ1 ,
θ0 ,θ1 ∈R 4 n n
(a) 1 ( b) c
= sup s2 (1 − TV(N (0, 1), N (s, 1))) = (31.3)
4n s>0 n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-596

i i

596

where (a) follows from the shift and scale invariance of the total variation; in (b) c ≈ 0.083 is
some absolute constant, obtained by applying the formula TV(N (0, 1), N (s, 1)) = 2Φ( 2s ) − 1
from (7.40). On the other hand, we know from Example 28.2 that the minimax risk equals 1n , so
the two-point method is rate-optimal in this case.
In the above example, for two points separated by Θ( √1n ), the corresponding hypothesis cannot
be tested with vanishing probability of error so that the resulting estimation risk (say in squared
error) cannot be smaller than 1n . This convergence rate is commonly known as the “parametric
rate”, which we have studied in Chapter 29 for smooth parametric families focusing on the Fisher
information as the sharp constant. More generally, the 1n rate is not improvable for models with
locally quadratic behavior

H2 (Pθ0 , Pθ0 +t ) t2 , t → 0. (31.4)

(Recall that Theorem 7.23 gives a sufficient condition for this behavior.) Indeed, pick θ0 in the
interior of the parameter space and set θ1 = θ0 + √1n , so that H2 (Pθ0 , Pθ1 ) = Θ( 1n ) thanks to (31.4).
By Theorem 7.8, we have TV(P⊗ ⊗n
θ0 , Pθ1 ) ≤ 1 − c for some constant c and hence Theorem 31.1
n

yields the lower bound Ω(1/n) for the squared error. Furthermore, later we will show that the same
locally quadratic behavior in fact guarantees the achievability of the 1/n rate; see Corollary 32.12.
Example 31.2 As a different example, consider the family Unif(0, θ). Note that as opposed
to the quadratic behavior (31.4), we have
√
H2 (Unif(0, 1), Unif(0, 1 + t)) = 2(1 − 1/ 1 + t) t.

Thus an application of Theorem 31.1 yields an Ω(1/n2 ) lower bound. This rate is not achieved
by the empirical mean estimator (which only achieves 1/n rate), but by the maximum likelihood
estimator θ̂ = max{X1 , . . . , Xn }. Other types of behavior in t, and hence the rates of convergence,
can occur even in compactly supported location families – see Example 7.1.
The limitation of Le Cam’s two-point method is that it does not capture the correct dependency
on the dimensionality. To see this, let us revisit Example 31.1 for d dimensions.
Example 31.3 Consider the d-dimensional GLM in Corollary 28.8. Again, it is equivalent
to consider the reduced model {N (θ, 1n ) : θ ∈ Rd }. We know from Example 28.2 (see also
Theorem 28.4) that for quadratic risk ℓ(θ, θ̂) = kθ − θ̂k22 , the exact minimax risk is R∗ = nd for any
d and n. Let us compare this with the best two-point lower bound. Applying Theorem 31.1 with
α = 2,

1 1 1
R∗ ≥ sup kθ0 − θ1 k22 1 − TV N θ0 , Id , N θ1 , Id
θ0 ,θ1 ∈Rd 4 n n
1
= sup kθk22 {1 − TV (N (0, Id ) , N (θ, Id ))}
θ∈Rd 4n
1
= sup s2 (1 − TV(N (0, 1), N (s, 1))),
4n s>0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-597

i i

31.2 Assouad’s Lemma 597

where the second step applies the shift and scale invariance of the total variation; in the last step,
by rotational invariance of isotropic Gaussians, we can rotate the vector θ align with a coordinate
vector (say, e1 = (1, 0 . . . , 0)) which reduces the problem to one dimension, namely,
TV(N (0, Id ), N (θ, Id )) = TV(N (0, Id ), N (kθke1 , Id )
= TV(N (0, 1), N (kθk, 1)).

Comparing the above display with (31.3), we see that the best Le Cam two-point lower bound in
d dimensions coincide with that in one dimension.
Let us mention in passing that although Le Cam’s two-point method is typically suboptimal for
estimating a high-dimensional parameter θ, for functional estimation in high dimensions (e.g. esti-
mating a scalar functional T(θ)), Le Cam’s method is much more effective and sometimes even
optimal. The subtlety is that is that as opposed to testing a pair of simple hypotheses H0 : θ = θ0
versus H1 : θ = θ1 , we need to test H0 : T(θ) = t0 versus H1 : T(θ) = t1 , both of which are
composite hypotheses and require a sagacious choice of priors. See Exercise VI.14 for an example.

31.2 Assouad’s Lemma

From Example 31.3 we see that Le Cam’s two-point method effectively only perturbs one out
of d coordinates, leaving the remaining d − 1 coordinates unexplored; this is the source of its
suboptimality. In order to obtain a lower bound that scales with the dimension, it is necessary to
randomize all d coordinates. Our next topic Assouad’s Lemma is an extension in this direction.

Theorem 31.2 (Assouad’s lemma) Assume that the loss function ℓ satisfies the α-triangle
inequality (31.1). Suppose Θ contains a subset Θ′ = {θb : b ∈ {0, 1}d } indexed by the hypercube,
such that ℓ(θb , θb′ ) ≥ β · dH (b, b′ ) for all b, b′ and some β > 0. Then

βd
inf sup Eθ ℓ(θ, θ̂) ≥ 1 − max TV(Pθb , Pθb′ ) (31.5)
θ̂ θ∈Θ 4α dH (b,b′ )=1

Proof. We lower bound the Bayes risk with respect to the uniform prior over Θ′ . Given any
estimator θ̂ = θ̂(X), define b̂ ∈ argmin ℓ(θ̂, θb ). Then for any b ∈ {0, 1}d ,
β dH (b̂, b) ≤ ℓ(θb̂ , θb ) ≤ α(ℓ(θb̂ , θ̂b ) + ℓ(θ̂, θb )) ≤ 2αℓ(θ̂, θb ).

Let b ∼ Unif({0, 1}d ) and we have b → θb → X. Then

β
E[ℓ(θ̂, θb )] ≥ E[dH (b̂, b)]
2α
β X h i
d
= P b̂i 6= bi
2α
i=1

β X
d
≥ (1 − TV(PX|bi =0 , PX|bi =1 )),
4α
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-598

i i

598

where the last step is again by Theorem 7.7, just like in the proof of Theorem 31.1. Each total
variation can be upper bounded as follows:
!
( a) 1 X 1 X (b)
TV(PX|bi =0 , PX|bi =1 ) = TV d − 1
Pθb , d−1 Pθb ≤ max TV(Pθb , Pθb′ )
2 2 dH (b,b′ )=1
b:bi =1 b:bi =0

where (a) follows from the Bayes rule, and (b) follows from the convexity of total variation
(Theorem 7.5). This completes the proof.

Example 31.4 Let us continue the discussion of the d-dimensional GLM in Example 31.3.
Consider the quadratic loss first. To apply Theorem 31.2, consider the hypercube θb = ϵb, where
b ∈ {0, 1}d . Then kθb − θb′ k22 = ϵ2 dH (b, b′ ). Applying Theorem 31.2 yields

∗ ϵ2 d 1 ′ 1
R ≥ 1− max TV N ϵb, Id , N ϵb , Id
4 b,b′ ∈{0,1}d ,dH (b,b′ )=1 n n
2

ϵ d 1 1
= 1 − TV N 0, , N ϵ, ,
4 n n

where the last step applies (7.11) for f-divergence between product distributions that only differ
in one coordinate. Setting ϵ = √1n and by the scale-invariance of TV, we get the desired R∗ ≳ dn .
Next, let’s consider the loss function kθb − θb′ k∞ . In the same setup, we only kθb − θb′ k∞ ≥
′ ∗ √1 , which does not depend on d. In fact, R∗
d dH (b, b ). Then Assouad’s lemma yields R ≳
ϵ
q n
log d
n as shown in Corollary 28.8. In the next section, we will discuss Fano’s method which can
resolve this deficiency.

31.3 Assouad’s lemma from the Mutual Information Method

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-600

i i

600

Proof. It suffices to show (31.10). Fix T ⊂ Θ. Consider an ϵ-packing T′ = {θ1 , . . . , θM } ⊂ T such

that mini̸=j d(θi , θj ) ≥ ϵ. Let θ be uniformly distributed on this packing and X ∼ Pθ conditioned
on θ. Given any estimator θ̂, construct a test by rounding θ̂ to θ̃ = argminθ∈T′ d(θ̂, θ). By triangle
inequality, d(θ, θ̃) ≤ 2d(θ, θ̂). Thus P[θ 6= θ̃] ≤ P[d(θ, θ̃) ≥ ϵ/2]. On the other hand, applying
Fano’s inequality (Corollary 3.13) yields

i i

32.1 Yang-Barron’s construction 603

akin to metric entropy under the sup norm) and the style of analysis is more related in spirit to the
theory of empirical processes (e.g. Dudley’s entropy integral (27.22)). We refer the readers to the
monographs [332, 429, 431] for details. In this chapter we focus on more information-theoretic
style results.

32.1 Yang-Barron’s construction

Let P = {Pθ : θ ∈ Θ} be a parametric family of distributions on the space X . Given Xn =
i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, we obtain an estimate P̂ = P̂(·|Xn ), which is a distribution
depending on Xn . The loss function is the KL divergence D(Pθ kP̂).1 The average risk is thus
Z
Eθ D(Pθ kP̂) = D Pθ kP̂(·|Xn ) P⊗n (dxn ).

If the family has a common dominating measure μ, the problem is equivalent to estimate the
density pθ = dP dμ , commonly referred to as the problem of density estimation in the statistics
θ

literature.
Our objective is to prove the upper bound (32.1) for the minimax KL risk
R∗KL (n) ≜ inf sup Eθ D(Pθ kP̂), (32.4)
P̂ θ∈Θ

where the infimum is taken over all estimators P̂ = P̂(·|Xn ) which is a distribution on X ; in
other words, we allow improper estimates in the sense that P̂ can step outside the model class P .
Indeed, the construction we will use in this section (such as predictive density estimators (Bayes)
or their mixtures) need not be a member of P . Later we will see in Sections 32.2 and 32.3 that for
total variation and Hellinger loss we can always restrict to proper estimators;2 however these loss
functions are weaker than the KL divergence.
The main result of this section is the following.

Theorem 32.1 Let Cn denotes the capacity of the channel θ 7→ Xn ∼ P⊗ n

θ , namely

Cn = sup I(θ; Xn ), (32.5)

where the supremum is over all distributions (priors) of θ taking values in Θ. Denote by

NKL (P, ϵ) ≜ min N : ∃Q1 , . . . , QN s.t. ∀θ ∈ Θ, ∃i ∈ [N], D(Pθ kQi ) ≤ ϵ2 . (32.6)
the KL covering number for the class P . Then
Cn+1
R∗KL (n) ≤ (32.7)
n+1

1
Note the asymmetry in this loss function. Alternatively the loss D(P̂kP) is typically infinite in nonparametric settings,
because it is impossible to estimate the support of the true density exactly.
2
This is in fact a generic observation: Whenever the loss function satisfies an approximate triangle inequality, any improper
estimate can be converted to a proper one by its project on the model class whose risk is inflated by no more than a
constant factor.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-604

i i

604

1
≤ inf ϵ2 + log NKL (P, ϵ) . (32.8)
ϵ>0 n+1
Conversely,
X
n
R∗KL (t) ≥ Cn+1 . (32.9)
t=0

Note that the capacity Cn is precisely the redundancy (13.10) which governs the minimax regret
in universal compression; the fact that it bounds the KL risk can be attributed to a generic relation
between individual and cumulative risks which we explain later in Section 32.1.4. As explained in
Chapter 13, it is in general difficult to compute the exact value of Cn even for models as simple as
Bernoulli (Pθ = Ber(θ)). This is where (32.8) comes in: one can use metric entropy and tools from
Chapter 27 to bound this capacity, leading to useful (and even optimal) risk bounds. We discuss
two types of applications of this result.

Finite-dimensional models Consider a family P = {Pθ : θ ∈ Θ} of smooth parametrized

densities, where Θ ⊂ Rd is some compact set. Suppose that the KL-divergence behaves like
squared norm, namely, D(Pθ kPθ′ ) kθ − θ′ k2 for any θ, θ′ ∈ Θ and some norm k · k on Rd .
(For example, for GLM with Pθ = N (θ, Id ), we have D(Pθ kPθ′ ) = 12 kθ − θ′ k22 .). In this case, the
KL covering numbers inherits the usual behavior of metric entropy in finite-dimensional space
(cf. Theorem 27.3 and Corollary 27.4) and we have
d
1
NKL (P, ϵ) ≲ .
ϵ
Then (32.8) yields

1
Cn ≲ inf nϵ + d log
2
d log n, (32.10)
ϵ>0 ϵ
d
which is consistent with the typical asymptotics of redundancy Cn = 2 log n + o(log n) (recall
(13.24) and (13.25)).
Applying the upper bound (32.7) or (32.8) yields
d log n
R∗KL (n) ≲ .
n
d
As compared to the usual parametric rate of n in d dimensions (e.g. GLM), this upper bound is
suboptimal only by a logarithmic factor.

Infinite-dimensional models Similar to the results in Section 27.4, for nonparametric models
NKL (ϵ) typically grows super-polynomially in 1ϵ and, in turn, the capacity Cn grows super-
logarithmically. In fact, whenever we have Cn = nα polylog(n) for some α > 0 where (log n)c0 ≤
polylog(n) ≤ (log n)c1 for some absolute c0 , c1 , Theorem 32.1 shows the minimax KL rate satisfies

R∗KL (n) = nα−1 polylog(n) (32.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-605

i i

32.1 Yang-Barron’s construction 605

which easily follows from combining (32.7) and (32.8) – see (32.27) for details. For concrete
examples, see Section 32.4 for the application of estimating smooth densities.
Next, we explain the intuition behind and the proof of Theorem 32.1.

32.1.1 Bayes risk as conditional mutual information and capacity bound

To gain some insight, let us start by considering the Bayesian setting with a prior π on Θ. Condi-
i.i.d.
tioned on θ ∼ π, the data Xn = (X1 , . . . , Xn ) ∼ Pθ .3 Any estimator, P̂ = P̂(·|Xn ), is a distribution
on X depending on Xn . As such, P̂ can be identified with a conditional distribution, say, QXn+1 |Xn ,
and we shall do so henceforth. For convenience, let us introduce an (unseen) observation Xn+1
that is drawn from the same Pθ and independent of Xn conditioned on θ. In this light, the role of
the estimator is to predict the distribution of the unseen Xn+1 .
The following lemma shows that the Bayes KL risk equals the conditional mutual information
and the Bayes estimator is precisely PXn+1 |Xn (with respect to the joint distribution induced by the
prior), known as the predictive density estimator in the statistics literature.

Lemma 32.2 The Bayes risk for prior π is given by

Z
R∗KL,Bayes (π ) ≜ inf π (dθ)P⊗
θ (dx )D(Pθ kP̂(·|x )) = I(θ; Xn+1 |X ),
n n n n
P̂

i.i.d.
where θ ∼ π and (X1 , . . . , Xn+1 ) ∼ Pθ conditioned on θ. The Bayes estimator achieving this infi-
mum is given by P̂Bayes (·|xn ) = PXn+1 |Xn =xn . If each Pθ has a density pθ with respect to some
common dominating measure μ, the Bayes estimator has density:
R Qn+1
π (dθ) i=1 pθ (xi )
p̂Bayes (xn+1 |x ) = R
n
Qn . (32.12)
π (dθ) i=1 pθ (xi )

Proof. The Bayes risk can be computed as follows:

3
Throughout this chapter, we continue to use the conventional notation Pθ for a parametric family of distributions and use
π to stand for the distribution of θ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-606

i i

606

the fact that Xn → θ → Xn+1 forms a Markov chain, so that PXn+1 |θ,Xn = PXn+1 |θ . In addition, the
Bayes optimal estimator is given by PXn+1 |Xn .

Note that the operational meaning of I(θ; Xn+1 |Xn ) is the information provided by one extra
observation about θ having already obtained n observations. In most situations, since Xn will have
already allowed θ to be consistently estimated as n → ∞, the additional usefulness of Xn+1 is
vanishing. This is made precisely by the following result.

Lemma 32.3 (Diminishing marginal utility in information) n 7→ I(θ; Xn+1 |Xn ) is a

decreasing sequence. Furthermore,
1
I(θ; Xn+1 |Xn ) ≤ I(θ; Xn+1 ). (32.13)
n+1

Proof. In view of the chain rule for mutual information (Theorem 3.7): I(θ; Xn+1 ) =
Pn+1 i−1
i=1 I(θ; Xi |X ), (32.13) follows from the monotonicity. To show the latter, let us consider
a “sampling channel” where the input is θ and the output is X sampled from Pθ . Let I(π )
denote the mutual information when the input distribution is π, which is a concave function in
π (Theorem 5.3). Then

I(θ; Xn+1 |Xn ) = EXn [I(Pθ|Xn )] ≤ EXn−1 [I(Pθ|Xn−1 )] = I(θ; Xn |Xn−1 )

where the inequality follows from Jensen’s inequality, since Pθ|Xn−1 is a mixture of Pθ|Xn .

Lemma 32.3 allows us to prove the converse bound (32.9): Fix any prior π. Since the minimax
risk dominates any Bayes risk (Theorem 28.1), in view of Lemma 32.2, we have
X
n X
n
R∗KL (t) ≥ I(θ; Xt+1 |Xt ) = I(θ; Xn+1 ).
t=0 t=0

Recall from (32.5) that Cn+1 = supπ ∈P(Θ) I(θ; Xn+1 ). Optimizing over the prior π yields (32.9).
Now suppose that the minimax theorem holds for (32.4), so that R∗KL = supπ ∈P(Θ) R∗KL,Bayes (π ).
Lemma 32.2 then allows us to express the minimax risk as the conditional mutual information
maximized over the prior π:

R∗KL (n) = sup I(θ; Xn+1 |Xn ).

π ∈P(Θ)

Thus Lemma 32.3 implies the desired

1
R∗KL (n) ≤ Cn+1 .
n+1
Next, we prove this directly without going through the Bayesian route or assuming the minimax
theorem. The main idea, due to Yang and Barron [464], is to consider Bayes estimators (of the
form (32.12)) but analyze it in the worst case. Fix an arbitrary joint distribution QXn+1 on X n+1 ,
Qn−1
which factorizes as QXn+1 = i=1 QXi |Xi−1 . (This joint distribution is an auxiliary object used only
for constructing an estimator.) For each i, the conditional distribution QXi |Xi−1 defines an estimator

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-607

i i

32.1 Yang-Barron’s construction 607

taking the sample Xi of size i as the input. Taking their Cesàro mean results in the following
estimator operating on the full sample Xn :

1 X
n+1
P̂(·|Xn ) ≜ QXi |Xi−1 . (32.14)
n+1
i=1

Let us bound the worst-case KL risk of this estimator. Fix θ ∈ Θ and let Xn+1 be drawn
⊗(n+1)
independently from Pθ so that PXn+1 = Pθ . Taking expectations with this law, we have
" !#
1 X
n+1
Eθ [D(Pθ kP̂(·|Xn ))] = E D Pθ QXi |Xi−1
n+1
i=1

( a) 1 X
n+1
≤ D(Pθ kQXi |Xi−1 |PXi−1 )
n+1
i=1
(b) 1 ⊗(n+1)
= D(Pθ kQXn+1 ),
n+1
where (a) and (b) follows from the convexity (Theorem 5.1) and the chain rule for KL divergence
(Theorem 2.16(c)). Taking the supremum over θ ∈ Θ bounds the worst-case risk as
1 ⊗(n+1)
R∗KL (n) ≤ sup D(Pθ kQXn+1 ).
n + 1 θ∈Θ
Optimizing over the choice of QXn+1 , we obtain
1 ⊗(n+1) Cn+1
R∗KL (n) ≤ inf sup D(Pθ kQXn+1 ) = ,
n + 1 QXn+1 θ∈Θ n+1
where the last identity applies Theorem 5.9 of Kemperman, completing the proof of (32.7).
Furthermore, Theorem 5.9 asserts that the optimal QXn+1 exists and given uniquely by the capacity-
achieving output distribution P∗Xn+1 . Thus the above minimax upper bound can be attained by
taking the Cesàro average of P∗X1 , P∗X2 |X1 , . . . , P∗Xn+1 |Xn , namely,

1 X ∗
n+1
P̂∗ (·|Xn ) = PXi |Xi−1 . (32.15)
n+1
i=1

Note that in general this is an improper estimate as it steps outside the class P .
In the special case where the capacity-achieving input distribution π ∗ exists, the capacity-
achieving output distribution can be expressed as a mixture over product distributions as P∗Xn+1 =
R ∗ ⊗(n+1)
π (dθ)Pθ . Thus the estimator P̂∗ (·|Xn ) is in fact the average of Bayes estimators (32.12)
∗
under prior π for sample sizes ranging from 0 to n.
Finally, as will be made clear in the next section, in order to achieve the further upper bound
(32.8) in terms of the KL covering numbers, namely R∗KL (n) ≤ ϵ2 + n+1 1 log NKL (P, ϵ), it suffices to
choose the following QXn+1 as opposed to the exact capacity-achieving output distribution: Pick an
ϵ-KL cover Q1 , . . . , QN for P of size N = NKL (P, ϵ) and choose π to be the uniform distribution
PN ⊗(n+1)
and define QXn+1 = N1 j=1 Qj – this was the original construction in [464]. In this case,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-608

i i

608

applying the Bayes rule (32.12), we see that the estimator is in fact a convex combination P̂(·|Xn ) =
PN
j=1 wj Qj of the centers Q1 , . . . , QN , with data-driven weights given by
Qi−1
1 X
n+1
t=1 Qj (Xt )
wj = PN Qi−1 .
n+1 Qj ( X t )
i=1 j=1 t=1

Again, except for the extraordinary case where P is convex and the centers Qj belong to P , the
estimate P̂(·|Xn ) is improper.

32.1.2 Capacity upper bound via KL covering numbers

As explained earlier, finding the capacity Cn requires solving the difficult optimization problem in
(32.5). In this subsection we prove (32.8) which bounds this capacity by metric entropy. Concep-
tually speaking, both metric entropy and capacity measure the complexity of a model class. The
following result, which applies to a more general setting than (32.5), makes precise their relations.

Theorem 32.4 Let Q = {PB|A=a : a ∈ A} be a collection of distributions on some space B

and denote the capacity C = supPA ∈P(A) I(A; B). Then

C = inf {ϵ2 + log NKL (Q, ϵ)}, (32.16)

ϵ>0

where NKL is the KL covering number defined in (32.6).

Proof. Fix ϵ and let N = NKL (Q, ϵ). Then there exist Q1 , . . . , QN that form an ϵ-KL cover, such
that for any a ∈ A there exists i(a) ∈ [N] such that D(PB|A=a kQi(a) ) ≤ ϵ2 . Fix any PA . Then

I(A; B) = I(A, i(A); B) = I(i(A); B) + I(A; B|i(A))

≤ H(i(A)) + I(A; B|i(A)) ≤ log N + ϵ2 .

where the last inequality follows from that i(A) takes at most N values and, by applying
Theorem 4.1,

I(A; B|i(A)) ≤ D PB|A kQi(A) |Pi(A) ≤ ϵ2 .

For the lower bound, note that if C = ∞, then in view of the upper bound above, NKL (Q, ϵ) = ∞
for any ϵ and (32.16) holds with equality. If C < ∞, Theorem 5.9 shows that C is the KL radius of
Q, namely, there exists P∗B , such that C = supPA ∈P(A) D(PB|A kP∗B |PA ) = supx∈A D(PB|A kP∗B |PA ).
√
In other words, NKL (Q, C + δ) = 1 for any δ > 0. Sending δ → 0 proves the equality of
(32.16).

Next we specialize Theorem 32.4 to our statistical setting (32.5) where the input A is θ and the
output B is Xn ∼ Pθ . Recall that P = {Pθ : θ ∈ Θ}. Let Pn ≜ {P⊗
i.i.d.
θ : θ ∈ Θ}. By tensorization of
n
⊗n ⊗n
KL divergence (Theorem 2.16(d)), D(Pθ kPθ′ ) = nD(Pθ kPθ′ ). Thus

ϵ
NKL (Pn , ϵ) ≤ NKL P, √ .
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-609

i i

32.1 Yang-Barron’s construction 609

Combining this with Theorem 32.4, we obtain the following upper bound on the capacity Cn in
terms of the KL metric entropy of the (single-letter) family P :

Cn ≤ inf nϵ2 + log NKL (P, ϵ) . (32.17)
ϵ>0

This proves (32.8), completing the proof of Theorem 32.1.

32.1.3 Bounding capacity and KL covering number using Hellinger entropy

Recall that in order to deduce from (32.9) concrete bounds on the minimax KL risk, such as
(32.11), one needs to have matching upper and lower bounds on the capacity Cn . Although Theo-
rem 32.4 characterizes capacity in terms of the KL covering numbers, working with the latter is
not convenient as it is not a metric so that results developed in Chapter 27 such as Theorem 27.2
do not apply. Next, we give bounds on the KL covering number and the capacity Cn using metric
entropy with respect to the Hellinger distance, which are tight up to logarithmic factors under mild
conditions.

Theorem 32.5 Let P = {Pθ : θ ∈ Θ} and MH (ϵ) ≡ M(P, H, ϵ) the Hellinger packing number
of the set P , cf. (27.2). Then Cn defined in (32.5) satisfies

log e 2
Cn ≥ min nϵ , log MH (ϵ) − log 2 (32.18)
2

Proof. The idea of the proof is simple. Given a packing θ1 , . . . , θM ∈ Θ with pairwise distances
2
H2 (Qi , Qj ) ≥ ϵ2 for i 6= j, where Qi ≡ Pθi , we know that one can test Q⊗ n ⊗n
i vs Qj with error e
− nϵ2
,
nϵ 2
cf. Theorem 7.8 and Theorem 32.8. Then by the union bound, if Me− 2 < 12 , we can distinguish
these M hypotheses with error < 12 . Let θ ∼ Unif(θ1 , . . . , θM ). Then from Fano’s inequality we
get I(θ; Xn ) ≳ log M.
To get sharper constants, though, we will proceed via the inequality shown in Ex. I.58. In the
notation of that exercise we take λ = 1/2 and from Definition 7.24 we get that
1
D1/2 (Qi , Qj ) = −2 log(1 − H2 (Qi , Qj )) ≥ H2 (Qi , Qj ) log e ≥ ϵ2 log e i 6= j .
2
By the tensorization property (7.79) for Rényi divergence, D1/2 (Q⊗ n ⊗n
i , Qj ) = nD1/2 (Qi , Qj ) and
we get by Ex. I.58
 
X
M
1 X
M
1 n n o
I(θ; Xn ) ≥ − log  exp − D1/2 (Qi , Qj ) 
M M 2
i=1 j=1

X
M
( a) 1M − 1 − nϵ22 1
≥− log e +
M M M
i=1
XM
1 − nϵ2
2 1 − nϵ2
2 1
≥− log e + = − log e + ,
M M M
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-610

i i

610

where in (a) we used the fact that pairwise distances are all ≥ nϵ2 except when i = j. Finally, since
A + B ≤ min(A,B) we conclude the result.
1 1 2

Note that, from the joint range (7.33) that D(PkQ) ≥ H2 (P, Q), a different (weaker) lower
bound on the KL risk also follows from Section 32.2.4 below.
Next we proceed to the converse of Theorem 32.5. The KL and Hellinger covering numbers
always satisfy

NKL (P, ϵ) ≥ NH (ϵ) ≡ N(P, H, ϵ). (32.19)

We next show that, assuming that the class P has a finite radius in Rényi divergence, (32.19)
and hence the capacity bound in Theorem 32.5 are tight up to logarithmic factors. Later in Sec-
tion 32.4 we will apply these results to the class of smooth densities, which has a finite χ2 -radius
(by choosing the uniform distribution as the center).

Theorem 32.6 Suppose that the family P has a finite Dλ radius for some λ > 1, i.e.
Rλ (P) ≜ inf sup Dλ (PkU) < ∞ , (32.20)
U P∈P

where Dλ is the Rényi divergence of order λ (see Definition 7.24). Then there exist ϵ0 and c
depending only on λ and Rλ , such that for all ϵ ≤ ϵ0 ,
r !
1
NKL P, cϵ log ≤ NH (ϵ) (32.21)
ϵ

and, consequently,

1
Cn ≤ inf 2
cnϵ log + log NH (ϵ) . (32.22)
ϵ≤ϵ0 ϵ

Proof. Let Q1 , . . . , QM be an ϵ-covering of P such that for any P ∈ P , there exists i ∈ [M] such
that H2 (P, Qi ) ≤ ϵ2 . Fix an arbitrary U and let Pi = ϵ2 U + (1 − ϵ2 )Qi . Applying Exercise I.59
yields

2λ 1
D(PkPi ) ≤ 24ϵ 2
log + Dλ (PkU) .
λ−1 ϵ
Optimizing over U to approach (32.20) proves (32.21). Finally, (32.22) follows from applying
(32.21) to (32.17).

32.1.4 General bounds between cumulative and individual (one-step) risks

In summary, we can see that the beauty of the Yang-Barron method lies in two ideas:

• Instead of directly studying the risk R∗KL (n), (32.7) relates it to a cumulative risk Cn
• The cumulative risk turns out to be equal to a capacity, which can be conveniently bounded in
terms of covering numbers.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-611

i i

32.1 Yang-Barron’s construction 611

In this subsection we want to point out that while the second step is very special to KL (log-loss),
the first idea is generic. Namely, we have the following relationship between individual risk (also
known as batch loss) and cumulative risk (also known as online loss), which were previously
introduced in Section 13.6 in the context of universal compression.

Proposition 32.7 (Online-to-batch conversion) Fix a loss function ℓ : P(X )×P(X ) →

R̄ and a class Π of distributions on X . Define the cumulative and one-step minimax risks as
follows:4
" n #
X
t−1
Cn = inf sup E ℓ(P, P̂t (X )) (32.23)
{P̂t (·)} P∈Π
t=1
h i
R∗n = inf sup E ℓ(P, P̂(Xn−1 )) (32.24)
P̂(·) P∈Π

where both infima are over measurable (possibly randomized) estimators P̂t : X t−1 → P(X ), and
i.i.d.
the expectations are over Xi ∼ P and the randomness of the estimators. Then we have
X
n
nR∗n ≤ Cn ≤ Cn−1 + R∗n ≤ R∗t . (32.25)
t=1
Pn−1
Thus, if the sequence {R∗n } satisfies R∗n 1n t=0 R∗t then Cn nR∗n . Conversely, if nα− ≲ Cn ≲
nα+ for all n and some α+ ≥ α− > 0, then
α
(α− −1) α+
n − ≲ R∗n ≲ nα+ −1 . (32.26)

Remark 32.1 The meaning of the above is that R∗n ≈ 1

n Cn within either constant or
polylogarithmic factors, for most cases of interest.
Proof. To show the first inequality in (32.25), given predictors {P̂t (Xt−1 ) : t ∈ [n]} for Cn , con-
sider a randomized predictor P̂(Xn−1 ) for R∗n that equals each of the P̂t (Xt−1 ) with equal probability.
P
The second inequality follows from interchanging supP and t via:
" n # " n−1 #
X X h i
t− 1
sup E ℓ(P, P̂t (X )) ≤ sup E ℓ(P, P̂t (X )) + sup E ℓ(P, P̂n (Xn−1 )) .
t−1
P∈Π t=1 P∈Π t=1 P∈Π

(In other words, Cn is bounded by using the Cn−1 -optimal online learner for first n − 1 rounds and
the R∗n -optimal batch learner for the last round.) The third inequality in (32.25) follows from the
second by induction and C1 = R∗1 .
To derive (32.26) notice that the upper bound on R∗n follows from (32.25). For the lower bound,
notice that the sequence R∗n is non-increasing and hence we have for any n < m
X
m−1 X
n−1
Ct
Cm ≤ R∗t ≤ + (m − n)R∗n . (32.27)
t
t=0 t=0

4
Note that for KL loss, Cn and R∗n coincide with AvgReg∗n and BatchReg∗n defined in (13.34) and (13.35), respectively.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-612

32.2 Pairwise comparison à la Le Cam-Birgé 613

The exact result (32.28) is unwieldy as the RHS involves finding the least favorable priors over
the n-fold product space. However, there are several known examples where much simpler and
explicit results are available. In the case when P and Q are TV-balls around P0 and Q0 , Huber [221]
showed that the minimax optimal test has the form
( n )
X dP0
n ′ ′′
ϕ(x ) = 1 min c , max c , log ( Xi ) >t .
dQ0
i=1

(See also Ex. III.31.) However, there are few other examples where minimax optimal tests are
known explicitly. Fortunately, as was shown by Le Cam, there is a general “single-letter” upper
bound in terms of the Hellinger separation between P and Q. It is the consequence of the more
general tensorization property of Rényi divergence in Proposition 7.25 (of which Hellinger is a
special case).

Theorem 32.8

≤ e− 2 H
n 2
(co(P),co(Q))
min sup P(ϕ = 1) + sup Q(ϕ = 0) , (32.29)
ϕ P∈P Q∈Q

Remark 32.2 For the case when P and Q are Hellinger balls of radius r around P0 and
Q0 (which arises in the proof of Theorem 32.9 below), respectively, Birgé [56] constructed an
explicit test.
nPNamely, under the assumption
o H(P0 , Q0 ) > 2.01r, there q is a test of the form
n n α+βψ(Xi ) −Ω(nr2 ) dP0
ϕ(x ) = 1 i=1 log β+αψ(Xi ) > t attaining error e , where ψ(x) = dQ 0
(x) and α, β > 0
depend only on H(P0 , Q0 ).

Remark 32.3 Here is an example where Theorem 32.8 is (very) loose. Consider P =
{Ber(1/2)} and Q = {Ber(0), Ber(1)}. Then co(P) ⊂ co(Q)and so the upper bound in (32.29) is
trivial. On the other hand, the test that declares P ∈ Q if we see all 0’s or all 0’s has exponentially
small probability of error.

Proof. We start by restating the special case of Proposition 7.25:

! !! n
1 2 O n On Y 1
1 − H co Pi , co Qi ≤ 1 − H2 (co(Pi ), co(Qi )) . (32.30)
2 2
i=1 i=1 i=1

Then from (32.28) we get

( a) 1
1 − TV(co(P ⊗n ), co(Q⊗n )) ≤ 1 − H2 (co(P ⊗n ), co(Q⊗n ))
2
n
(b) 1 2
≤ 1 − H (co(P), co(Q)) ≤ e− 2 H (co(P),co(Q))
n 2

where (a) follows from (7.22); (b) follows from (32.30).

In the sequel we will apply Theorem 32.8 to two disjoint Hellinger balls (both are convex).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-614

i i

614

32.2.2 Hellinger guarantee on Le Cam-Birgé’s pairwise comparison estimator

The idea of constructing estimator based on pairwise tests is due to Le Cam ([273], see also [430,
Section 10]) and Birgé [53]. We are given n i.i.d. observations X1 , · · · , Xn generated from P, where
P ∈ P is the distribution to be estimated. Here let us emphasize that P need not be a convex set. Let
the loss function between the true distribution P and the estimated distribution P̂ be their squared
Hellinger distance, i.e.

ℓ(P, P̂) = H2 (P, P̂).

Then, we have the following result:

Theorem 32.9 (Le Cam-Birgé) Denote by NH (P, ϵ) the ϵ-covering number of the set P
under the Hellinger distance (cf. (27.1)). Let ϵn be such that

nϵ2n ≥ log NH (P, ϵn ) ∨ 1.

Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 1,

sup P[H(P, P̂) > 4tϵn ] ≲ e−t

2
(32.31)
P∈P

and, consequently,

sup EP [H2 (P, P̂)] ≲ ϵ2n (32.32)

P∈P

Proof of Theorem 32.9. It suffices to prove the high-probability bound (32.31). Abbreviate ϵ =
ϵn and N = NH (P, ϵn ). Let P1 , · · · , PN be a maximal ϵ-packing of P under the Hellinger distance,
which also serves as an ϵ-covering (cf. Theorem 27.2). Thus, ∀i 6= j,

H(Pi , Pj ) ≥ ϵ,

and for ∀P ∈ P , ∃i ∈ [N], s.t.

H(P, Pi ) ≤ ϵ,

Denote B(P, ϵ) = {Q : H(P, Q) ≤ ϵ} denote the ϵ-Hellinger ball centered at P. Crucially,

Hellinger ball is convex thanks to the convexity of squared Hellinger distance as an f-divergence;
cf. Theorem 7.5. (In contrast, recall from (7.6) that Hellinger distance itself is not convex.) Indeed,
for any P′ , P′′ ∈ B(P, ϵ) and α ∈ [0, 1],

H2 (ᾱP′ + αP′′ , P) ≤ ᾱH2 (P′ , P) + αH2 (P′′ , P) ≤ ϵ2 .

Next, consider the following pairwise comparison problem, where we test two Hellinger balls
(composite hypothesis) against each other:

Hi : P ∈ B(Pi , ϵ) vs Hj : P ∈ B(Pj , ϵ)

for all i 6= j, s.t. H(Pi , Pj ) ≥ δ = 4ϵ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-615

i i

32.2 Pairwise comparison à la Le Cam-Birgé 615

Since both B(Pi , ϵ) and B(Pj , ϵ) are convex, applying Theorem 32.8 yields a test ψij =
ψij (X1 , . . . , Xn ), with ψij = 0 corresponding to declaring P ∈ B(Pi , ϵ), and ψij = 1 corresponding
to declaring P ∈ B(Pj , ϵ), such that ψij = 1 − ψji and the following large deviations bound holds:
for all i, j, s.t. H(Pi , Pj ) ≥ δ ,

P(ψij = 1) ≤ e− 8 H(Pi ,Pj ) ,

n 2
sup (32.33)
P∈B(Pi ,ϵ)

where we used the triangle inequality of Hellinger distance: for any P ∈ B(Pi , ϵ) and any Q ∈
B(Pj , ϵ),

H(P, Q) ≥ H(Pi , Pj ) − 2ϵ ≥ H(Pi , Pj )/2 ≥ 2ϵ.

For i ∈ [N], define the random variable

maxj∈[N] H2 (Pi , Pj ) s.t. ψij = 1, H(Pi , Pj ) > δ ;
Ti ≜
0, no such j exists.

Basically, Ti records the maximum distance from Pi to those Pj outside the δ -neighborhood of Pi
that is confusable with Pi given the present sample. Our density estimator is defined as

P̂ = Pi∗ , where i∗ ∈ argmin Ti . (32.34)

i∈[N]

Now for the proof of correctness, assume that P ∈ B(P1 , ϵ). The intuition is that, we should
expect, typically, that T1 = 0, and furthermore, Tj ≥ δ 2 for all j such that H(P1 , Pj ) ≥ δ . Note
that by the definition of Ti and the symmetry of the Hellinger distance, for any pair i, j such that
H(Pi , Pj ) ≥ δ , we have

max{Ti , Tj } ≥ H(Pi , Pj ).

Consequently,
n o
H(P̂, P1 )1 H(P̂, P1 ) ≥ δ = H(Pi∗ , P1 )1 {H(Pi∗ , P1 ) ≥ δ}
≤ max{Ti∗ , T1 }1 {max{Ti∗ , T1 } ≥ δ} = T1 1 {T1 ≥ δ},

where the last equality follows from the definition of i∗ as a global minimizer in (32.34). Thus, for
any t ≥ 1,

P[H(P̂, P1 ) ≥ tδ] ≤ P[T1 ≥ tδ]

≤ N(ϵ)e−2nϵ
2 2
t
(32.35)
≲ e− t ,
2
(32.36)
2
where (32.35) follows from (32.33) and (32.36) uses the assumption that nϵ2 ≥ 1 and N ≤ enϵ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-616

i i

616

32.2.3 Refinement using local entropy

Just like Theorem 32.1, while they are often tight for nonparametric problems with superlogarith-
mically metric entropy, for finite-dimensional models a direct application of Theorem 32.9 results
in a slack by a log factor. For example, for a d-dimensional parametric family, e.g., the Gaussian
location model or its finite mixtures, the metric entropy usually behaves as log NH (ϵ) d log 1ϵ .
Thus when n ≳ d, Theorem 32.9 entails choosing ϵ2n dn log nd , which falls short of the parametric
rate E[H2 (P̂, P)] ≲ dn which are typically achievable.
As usual, such a log factor can be removed using the local entropy argument. To this end, define
the local Hellinger entropy:
Nloc (P, ϵ) ≜ sup sup NH (B(P, η) ∩ P, η/2). (32.37)
P∈P η≥ϵ

Theorem 32.10 (Le Cam-Birgé: local entropy version) Let ϵn be such that
nϵ2n ≥ log Nloc (P, ϵn ) ∨ 1. (32.38)
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 2,
sup P[H(P, P̂) > 4tϵn ] ≤ e−t
2
(32.39)
P∈P

and hence
sup EP [H2 (P, P̂)] ≲ ϵ2n (32.40)
P∈P

Remark 32.4 (Doubling dimension) Suppose that for some d > 0, log Nloc (P, ϵ) ≤ d log 1ϵ
holds for all sufficiently large small ϵ; this is the case for finite-dimensional models where the
Hellinger distance is comparable with the vector norm by the usual volume argument (Theo-
rem 27.3). Then we say the doubling dimension (also known as the Le Cam dimensionLe Cam
dimension|see doubling dimension [430]) of P is at most d; this terminology comes from the
fact that the local entropy concerns covering Hellinger balls using balls of half the radius. Then
Theorem 32.10 shows that it is possible to achieve the “parametric rate” O( dn ). In this sense, the
doubling dimension serves as the effective dimension of the model P .

Lemma 32.11 For any P ∈ P and η ≥ ϵ and k ≥ Z+ ,

NH (B(P, 2k η) ∩ P, η/2) ≤ Nloc (P, ϵ)k (32.41)

Proof. We proceed by induction on k. The base case of k = 0 follows from the definition (32.37).
For k ≥ 1, assume that (32.41) holds for k − 1 for all P ∈ P . To prove it for k, we construct a cover
of B(P, 2k η) ∩ P as follows: first cover it with 2k−1 η -balls, then cover each ball with η/2-balls. By
the induction hypothesis, the total number of balls is at most
NH (B(P, 2k η) ∩ P, 2k−1 η) · sup NH (B(P′ , 2k−1 η) ∩ P, η/2) ≤ Nloc (ϵ) · Nloc (ϵ)k−1
P′ ∈P

completing the proof.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-617

i i

32.2 Pairwise comparison à la Le Cam-Birgé 617

We now prove Theorem 32.10:

Proof. We analyze the same estimator (32.34) following the proof of Theorem 32.9, except
that the estimate (32.35) is improved as follows: Define the Hellinger shell Ak ≜ {P : 2k δ ≤
H(P1 , P) < 2k+1 δ} and Gk ≜ {P1 , . . . , PN } ∩ Ak . Recall that δ = 4ϵ. Given t ≥ 2, let ℓ = blog2 tc
so that 2ℓ ≤ t < 2ℓ+1 . Then
X
P[T1 ≥ tδ] ≤ P[2k δ ≤ T1 < 2k+1 δ]
k≥ℓ
( a) X
|Gk |e− 8 (2 δ)
n k 2
≤
k≥ℓ
(b) X
Nloc (ϵ)k+3 e−2nϵ 4
2 k
≤
k≥ℓ
( c)
≲ e− 4 ≤ e− t
ℓ 2

where (a) follows from from (32.33); (c) follows from the assumption that log Nloc ≤ nϵ2 and
k ≥ ℓ ≥ log2 t ≥ 1; (b) follows from the following reasoning: since {P1 , . . . , PN } is an ϵ-packing,
we have
|Gk | ≤ M(Ak , ϵ) ≤ N(Ak , ϵ/2) ≤ N(B(P1 , 2k+1 δ) ∩ P, ϵ/2) ≤ Nloc (ϵ)k+3

where the first and the last inequalities follow from Theorem 27.2 and Lemma 32.11 respectively.

As an application of Theorem 32.10, we show that parametric rate (namely, dimension divided
by the sample size) is achievable for models with locally quadratic behavior, such as those smooth
parametric models (cf. Section 7.11 and in particular Theorem 7.23).

Corollary 32.12 Consider a parametric family P = {Pθ : θ ∈ Θ}, where Θ ⊂ Rd and P is

totally bounded in Hellinger distance. Suppose that there exists a norm k · k and constants t0 , c, C
such that for all θ0 , θ1 ∈ Θ with kθ0 − θ1 k ≤ t0 ,
ckθ0 − θ1 k ≤ H(Pθ0 , Pθ1 ) ≤ Ckθ0 − θ1 k. (32.42)
i.i.d.
Then there exists an estimator θ̂ based on X1 , . . . , Xn ∼ Pθ , such that
d
sup Eθ [H2 (Pθ , Pθ̂ )] ≲ .
θ∈Θ n

Proof. It suffices to bound the local entropy Nloc (P, ϵ) in (32.37). Fix θ0 ∈ Θ. Indeed, for any
η > t0 , we have NH (B(Pθ0 , η) ∩ P, η/2) ≤ NH (P, t0 ) ≲ 1. For ϵ ≤ η ≤ t0 ,
( a)
NH (B(Pθ0 , η) ∩ P, η/2) ≤ N∥·∥ (B∥·∥ (θ0 , η/c), η/(2C))
d
vol(B∥·∥ (θ0 , η/c + η/(2C)))
(b) 2C
≤ = 1+
vol(B∥·∥ (θ0 , η/(2C))) c

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-618

i i

618

where (a) and (b) follow from (32.42) and Theorem 27.3 respectively. This shows that
log Nloc (P, ϵ) ≲ d, completing the proof by applying Theorem 32.10.

32.2.4 Lower bound using local Hellinger packing

It turns out that under certain regularity assumptions we can prove an almost matching lower
bound (typically within a logarithmic term) on the Hellinger risk. First we define the local packing
number as follows:
n ϵ o
Mloc (ϵ) ≡ Mloc (P, H, ϵ) = max M : ∃R, P1 , . . . , PM ∈ P : H(Pi , R) ≤ ϵ, H(Pi , Pj ) ≥ ∀i 6= j .
2
The local packing number is related to the global one M(ϵ) ≡ M(P, H, ϵ) by the following general
lemma that holds for any metric. This result shows that the local and global packing numbers
behave similarly as long as the growth is super polynomial in 1/ϵ (e.g. for those nonparametric
class considered in Section 27.4).

Lemma 32.13
M(ϵ/2)
≤ Mloc (ϵ) ≤ M(ϵ)
M(ϵ)

Proof. The upper bound is obvious. For the lower bound, Let P1 , . . . , PM be a maximal ϵ-packing
for P with M = M(ϵ). Let Q1 , . . . , QM′ be a maximal ϵ/2-packing for P with M′ = M(ϵ/2).
Partition E = {P1 , . . . , PM } into the Voronoi cells centered at each Qi , namely, Ei ≜ {Pj :
H(Pj , Qi ) = mink∈[M] H(Pk , Qi )} (with ties broken arbitrarily), so that E1 , . . . , EM′ are disjoint
and E = ∪i∈[M′ ] Ei . Thus max |Ei | ≥ M/M′ . Finally, note that each Ei ⊂ B(Qi , ϵ) because E is also
an ϵ-covering.
Note that unlike the definition of Nloc in (32.37) we are not taking the supremum over the scale
η ≥ ϵ. For this reason, we cannot generally apply Theorem 27.2 to conclude that Nloc (ϵ) ≥ Mloc (ϵ).
In all instances known to us we have log Nloc log Mloc , in which case the following general result
provides a minimax lower bound that matches the upper bound in Theorem 32.10 up to logarithmic
factors.

Theorem 32.14 Suppose that the Dλ radius Rλ (P) of the family P is finite for some λ > 1;
cf. (32.20). There exists constants c = c(λ) and ϵ < ϵ0 (λ) such that whenever n and ϵ < ϵ0 are
such that

2 1
c(λ)nϵ log 2 + Rλ (P) + 2 log 2 < log Mloc (ϵ), (32.43)
ϵ

any estimator P̂ = P̂(·; Xn ) must satisfy

ϵ2
sup EP [H2 (P, P̂)] ≥ ,
P∈P 32
i.i.d.
where EP is taken with respect to Xn ∼ P.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-619

i i

32.2 Pairwise comparison à la Le Cam-Birgé 619

Proof. Let M = Mloc (P, ϵ). From the definition there exists an ϵ/2-packing P1 , . . . , PM in some
Hellinger ball B(R, ϵ).
i.i.d.
Let θ ∼ Unif([M]) and Xn ∼ Pθ conditioned on θ. Then from Fano’s inequality in the form
of Theorem 31.3 we get

ϵ 2 I(θ; Xn ) + log 2

sup E[H (P, P̂)] ≥
2
1−
P∈P 4 log M
It remains to show that
I(θ; Xn ) + log 2 1
≤ . (32.44)
log M 2
To that end for an arbitrary distribution U define

Q = ϵ2 U + ( 1 − ϵ2 )R .

We first notice that from Ex. I.59 we have that for all i ∈ [M]

λ 1
D(Pi kQ) ≤ 8(H (Pi , R) + 2ϵ )
2 2
log 2 + Dλ (Pi kU)
λ−1 ϵ

provided that ϵ < 2− 2(λ−1) ≜ ϵ0 . Since H2 (Pi , R) ≤ ϵ2 , by optimizing U (as the Dλ -center of P )
5λ

we obtain

λ 1 c(λ) 2 1
inf max D(Pi kQ) ≤ 24ϵ 2
log 2 + Rλ ≤ ϵ log 2 + Rλ .
U i∈[M] λ−1 ϵ 2 ϵ
By Theorem 4.1 we have

nc(λ) 2 1
I(θ; X ) ≤ n
max D(P⊗ n ⊗n
i kQ ) ≤ ϵ log 2 + Rλ .
i∈[M] 2 ϵ
This final bound and condition (32.43) then imply (32.44) and the statement of the theorem.

Finally, we mention that for sufficiently regular models wherein the KL divergence and the
squared Hellinger distances are comparable, the upper bound in Theorem 32.10 based on local
entropy gives the exact minimax rate. Models of this type include GLM and more generally
Gaussian mixture models with bounded centers in arbitrary dimensions [232].

Corollary 32.15 Assume that

H2 (P, P′ ) D(PkP′ ), ∀P, P′ ∈ P.

Then

inf sup EP [H2 (P, P̂)] ϵ2n

P̂ P∈P

where ϵn was defined in (32.38).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-620

i i

620

Proof. By assumption, KL neighborhoods coincide with Hellinger balls up to constant factors.

Thus the lower bound follows from apply Fano’s method in Theorem 31.3 to a Hellinger ball of
radius O(ϵn ).

32.3 Yatracos’ class and minimum distance estimator

In this section we prove (32.3), the third entropy upper bound on statistical risk. Paralleling the
result (32.1) of Yang-Barron (for KL divergence) and (32.2) of Le Cam-Birgé (for Hellinger dis-
tance), the following result bounds the minimax total variation risk using the metric entropy of
the parameter space in total variation:

Theorem 32.16 (Yatracos [465]) There exists a universal constant C such that the following
i.i.d.
holds. Let X1 , . . . , Xn ∼ P ∈ P , where P is a collection of distributions on a common measurable
space (X , E). For any ϵ > 0, there exists a proper estimator P̂ = P̂(X1 , . . . , Xn ) ∈ P , such that

1
sup EP [TV(P̂, P)2 ] ≤ C ϵ2 + log N(P, TV, ϵ) (32.45)
P∈P n

For loss function that is a distance, a natural idea for obtaining proper estimator is the minimum
distance estimator. In the current context, we compute the minimum-distance projection of the
empirical distribution on the model class P :6
Pmin-dist = argmin TV(P̂n , P)
P∈P

1
Pn
where P̂n = ni=1 δXi is the empirical distribution. However, since the empirical distribution is
discrete, this strategy does not make sense if elements of P have densities. The reason for this
degeneracy is because the total variation distance is too strong. The key idea is to replace TV,
which compares two distributions over all measurable sets, by a proxy, which only inspects a
“low-complexity” family of sets.
To this end, let A ⊂ E be a finite collection of measurable sets to be specified later. Define a
pseudo-distance
dist(P, Q) ≜ sup |P(A) − Q(A)|. (32.46)
A∈A

(Note that if A = E , then this is just TV.) One can verify that dist satisfies the triangle inequality.
As a result, the estimator
P̃ ≜ argmin dist(P, P̂n ), (32.47)
P∈P

as a minimizer, satisfies
dist(P̃, P) ≤ dist(P̃, P̂n ) + dist(P, P̂n ) ≤ 2dist(P, P̂n ). (32.48)

6
Here and below, if the minimizer does not exist, we can replace it by an infimizing sequence.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-621

i i

32.3 Yatracos’ class and minimum distance estimator 621

In addition, applying the binomial tail bound and the union bound, we have
C0 log |A|
E[dist(P, P̂n )2 ] ≤ . (32.49)
n
for some absolute constant C0 .
The main idea of Yatracos [465] boils down to the following choice of A: Consider an
ϵ-covering {Q1 , . . . , QN } of P in TV. Define the set

dQi dQj
Aij ≜ x : ( x) ≥ ( x)
d( Qi + Qj ) d(Qi + Qj )
and the collection (known as the Yatracos class)

A ≜ {Aij : i 6= j ∈ [N]}. (32.50)

Then the corresponding dist approximates the TV on P , in the sense that

dist(P, Q) ≤ TV(P, Q) ≤ dist(P, Q) + 4ϵ, ∀P, Q ∈ P. (32.51)

To see this, we only need to justify the upper bound. For any P, Q ∈ P , there exists i, j ∈ [N], such
that TV(P, Pi ) ≤ ϵ and TV(Q, Qj ) ≤ ϵ. By the key observation that dist(Qi , Qj ) = TV(Qi , Qj ), we
have

TV(P, Q) ≤ TV(P, Qi ) + TV(Qi , Qj ) + TV(Qj , Q)

≤ 2ϵ + dist(Qi , Qj )
≤ 2ϵ + dist(Qi , P) + dist(P, Q) + dist(Q, Qj )
≤ 4ϵ + dist(P, Q).

Finally, we analyze the estimator (32.47) with A given in (32.50). Applying (32.51) and (32.48)
yields

TV(P̃, P) ≤ dist(P, P̃) + 4ϵ

≤ 2dist(P, P̂n ) + 4ϵ.

Squaring both sizes, taking expectation and applying (32.49), we have

8C0 log |N|
E[TV(P̃, P)2 ] ≤ 32ϵ2 + 8E[dist(P, P̂n )2 ] ≤ 32ϵ2 + .
n
Choosing the optimal TV-covering completes the proof of (32.45).
Remark 32.5 (Robust version) Note that Yatracos’ scheme idea works even if the model
is misspecified, i.e., when the data generating distribution P is outside (but close to) P . Indeed,
denote Q∗ = argminQ∈{Qi } TV(P, Q) and notice that

dist(Q∗ , P̂n ) ≤ dist(Q∗ , P) + dist(P, P̂n ) ≤ TV(P, Q∗ ) + dist(P, P̂n ) ,

since dist(Q, Q′ ) ≤ TV(Q, Q′ ) for any pair of distributions. Then we have:

TV(P̃, P) ≤ TV(P̃, Q∗ ) + TV(Q∗ , P) = dist(P̃, Q∗ ) + TV(Q∗ , P)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-622

i i

622

≤ dist(P̃, P̂n ) + dist(P̂n , P) + dist(P, Q∗ ) + TV(Q∗ , P)

≤ dist(Q∗ , P̂n ) + dist(P̂n , P) + 2TV(P, Q∗ )
≤ 2dist(P, P̂n ) + 3TV(P, Q∗ ) .

Since 3TV(P, Q∗ ) ≤ 3ϵ + 3 minP′ ∈P TV(P, P′ ) we can see that the estimator also works for
“misspecified case”. Surprisingly, the multiplier 3 is not improvable if the estimator is required to
be proper (inside P ), cf. [70].

32.4 Density estimation over Hölder classes

In this section we will talk about the classical problem of nonparametric density estimation prob-
lem under smoothness constraint. Following Theorem 27.14, for brevity denote by F ≡ Fβ (1, 1)
the collection of β -smooth densities f on the unit cube [0, 1]d for some constant d. (In this case the
parameter is simply the density f, so we shall refrain from writing a parametrized form.) Given
i.i.d.
X1 , · · · , Xn ∼ f ∈ F , an estimator of the unknown density f is a function f̂(·) = f̂(·; X1 , . . . , Xn ).
R1
Let us first consider the conventional quadratic risk kf − f̂k22 = 0 (f(x) − f̂(x))2 . Then we will state
the results for Hellinger, KL, and total variation risks.

Theorem 32.17 Given X1 , · · · , Xn i.i.d.

∼ f ∈ F , the minimax quadratic risk over F satisfies
2β
R∗L2 (n; F) ≜ inf sup E kf − f̂k22 n− 2β+d . (32.52)
f̂ f∈F

Capitalizing on the metric entropy of smooth densities studied in Section 27.4, we will prove
this result by applying the entropic upper bound in Theorem 32.1 and the minimax lower bound
based on Fano’s inequality in Theorem 31.3. However, Theorem 32.17 pertains to the L2 rather
than KL risk. This can be fixed by a simple reduction.

Lemma 32.18 Let F ′ denote the collection of f ∈ F which is bounded from below by 1/2.
Then

R∗L2 (n; F ′ ) ≤ R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).

Proof. The left inequality follows because F ′ ⊂ F . For the right inequality, we apply a sim-
i.i.d.
ulation argument. Fix some f ∈ F and we observe X1 , . . . , Xn ∼ f. Let us sample U1 , . . . , Un
independently and uniformly from [0, 1]d . Define
(
Ui w.p. 12 ,
Zi =
Xi w.p. 12 .
i.i.d.
Then Z1 , . . . , Zn ∼ g = 12 (1 + f) ∈ F ′ . Let ĝ be an estimator that achieves the minimax risk
R∗L2 (n; F ′ ) on F ′ . Consider the estimator f̂ = 2ĝ − 1. Then kf − f̂k22 = 4kg − ĝk22 . Taking the
supremum over f ∈ F proves R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-623

i i

32.4 Density estimation over Hölder classes 623

Lemma 32.18 allows us to focus on the subcollection F ′ , where each density is lower bounded
by 1/2. In addition, each β -smooth density in F is also upper bounded by an absolute constant.
Therefore, the KL divergence and squared L2 distance are in fact equivalent on F ′ , i.e.,

D(fkg) kf − gk22 , f, g ∈ F ′ , (32.53)

as shown by the following lemma:

dQ
Lemma 32.19 Suppose both f = dP
dμ and g = dμ are upper and lower bounded by absolute
constants c and C respectively. Then
Z Z
1 1
dμ(f − g) ≤ H (P, Q) ≤ D(PkQ) ≤ χ (PkQ) ≤
2 2 2
dμ(f − g)2 .
4C c

R R
Proof. For the upper bound, applying (7.34), D(PkQ) ≤ χ2 (PkQ) = dμ (f−gg) ≤ dμ (f−gg) .
2 2
1
c
R R
For the lower bound, applying (7.33), D(PkQ) ≥ H2 (P, Q) = dμ √(f−g√) 2 ≥
2
1
4C dμ(f −
( f+ g)
g) 2 .

We now prove Theorem 32.17:

Proof. In view of Lemma 32.18, it suffices to consider R∗L2 (n; F ′ ). For the upper bound, we have

( a)
R∗L2 (n; F ′ ) R∗KL (n; F ′ )
(b)

1 ′
≲ inf ϵ + log NKL (F , ϵ)
2
ϵ>0 n

( c) 1 ′
inf ϵ + log N(F , k · k2 , ϵ)
2
ϵ>0 n

(d) 1 2β
inf ϵ2 + d/β n− 2β+d .
ϵ>0 nϵ

where both (a) and (c) apply (32.53), so that both the risk and the metric entropy are equivalent
for KL and L2 distance; (b) follows from Theorem 32.1; (d) applies the metric entropy (under L2 )
of the Lipschitz class from Theorem 27.14 and the fact that the metric entropy of the subclass F ′
is at most that of the full class F .
For the lower bound, we apply Fano’s inequality. Applying Theorem 27.14 and the rela-
tion between covering and packing numbers in Theorem 27.2, we have log N(F, k · k2 , ϵ)
log M(F, k · k2 , ϵ) ϵ−d/β . Fix ϵ to be specified and let f1 , . . . , fM be an ϵ-packing in F , where
M ≥ exp(Cϵ−d/β ). Then g1 , . . . , gM is an 2ϵ -packing in F ′ , with gi = (fi + 1)/2. Applying Fano’s
inequality in Theorem 31.3, we have

∗ ′ C′n
RL2 (n; F ) ≳ ϵ 1 −
2
(32.54)
log M

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-624

i i

624

i.i.d.
where C′n is the capacity (or KL radius) from f ∈ F ′ to X1 , . . . , Xn ∼ f. Using (32.17) and
Lemma 32.19, we have
C′n ≤ inf (nϵ2 + log NKL (F ′ , ϵ)) inf (nϵ2 + log N(F ′ , k · k2 , ϵ)) ≲ inf (nϵ2 + ϵ−d/β ) nd/(2β+d) .
ϵ>0 ϵ>0 ϵ>0
β
− 2β+
Thus choosing ϵ = cn d for sufficiently small c ensures C′n ≤ 1
2 log M and hence R∗L2 (n; F ′ ) ≳
2β
− 2β+
ϵ2 n d .
Remark 32.6 The above lower bound proof, based on Fano’s inequality and the intuition that
small mutual information implies large estimation error, requires us to upper bound the capacity
C′n of the subcollection F ′ . On the other hand, as hinted earlier in (32.11) (and shown next), the
C′
risk is expected to be proportional to nn , which suggests one should lower bound the capacity
using metric entropy. Indeed, this is possible: Applying Theorem 32.5,
C′n ≳ min{nϵ2 , log M(F ′ , H, ϵ)} − 2
min{nϵ2 , log M(F ′ , k · k2 , ϵ)} − 2
min{nϵ2 , ϵ−d/β } − 2 nd/(2β+d) ,

where we picked the same ϵ as in the previous proof. So C′n nd/(2β+d) . Finally, applying the
online-to-batch conversion (32.26) in Proposition 32.7 (or equivalently, combining (32.7) and
C′ 2β
(32.9)) yields R∗KL (n; F ′ ) nn n− 2β+d .
Remark 32.7 Note that the above proof of Theorem 32.17 relies on the entropic risk bound
(32.1), which, though rate-optimal, is not attained by a computationally efficient estimator. (The
same criticism also applies to (32.2) and (32.3) for Hellinger and total variation.) To remedy this,
for the squared loss, a classical idea is to apply the kernel density estimator (KDE) – cf. Section 7.9.
Pn
Specifically, one compute the convolution of the empirical distribution P̂n = 1n i=1 δXi with a
kernel function K(·) whose shape and bandwidth are chosen according to the smooth constraint.
For example, for Lipschitz densities, the optimal rate in Theorem 32.17 can be attained by a box
kernel K(·) = 2h1
1 {| · | ≤ h} with bandwidth h = n−1/3 (cf. e.g. [424, Sec. 1.2]).
The classical literature of density estimation is predominantly concerned with the L2 loss,
mainly due to the convenient quadratic nature of the loss function that allows bias-variance decom-
position and facilitates the analysis of KDE. However, L2 -distance between densities has no clear
operational meaning. Next we consider the three f-divergence losses introduced at the beginning
of this chapter. Paralleling Theorem 32.17, we have
β
R∗TV (n; F) ≜ inf sup E TV(f, f̂) n− 2β+d (32.55)
f̂ f∈F
β
R∗H2 (n; F) ≜ inf sup E H2 (f, f̂) n− β+d (32.56)
f̂ f∈F
β β
n− β+d ≲ R∗KL (n; F) ≜ inf sup E D(fkf̂) n− β+d (log n) β+d
d
(32.57)
f̂ f∈F

For TV loss, the upper bound follows from the L2 -rates in Theorem 32.17 and kf − f̂k1 ≤ kf − f̂k2
by Cauchy-Schwarz; alternatively, we can also apply Yatracos’ estimator from Theorem 32.16.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-625

i i

32.4 Density estimation over Hölder classes 625

The matching lower bound can be shown using the same argument based on Fano’s method as the
metric entropy under L1 -distance behaves the same (Theorem 27.14).
Recall that for L2 /L1 the rate is derived by considering a subclass F ′ , which has the same
estimation rate, but on which Lp H KL, cf. Lemma 32.18. It was thus, surprising, when
Birgé [54] found the Hellinger rate on the full family F to be different.
To derive the H2 result (32.56), first note that neither upper or a lower bound follow from the
2
generic comparison inequality H2 ≤ TV ≤ H in (7.22). Instead, what works is comparing entropy
numbers via the first of these inequalities. Specifically, we have
log N(F, H, ϵ) ≤ log N(F, TV, ϵ2 /2) ≲ ϵ− β ,
2d
(32.58)
where in the last step we invoked Theorem 27.14. Combining this with Le Cam-Birgé method
(Theorem 32.9) proves the upper bound part of (32.56).7
The lower bound follows from a similar argument as in the proof of Theorem 32.17, although
the construction is more involved. Below c0 , c1 , . . . are absolute constants. Fix a small ϵ and con-
sider the subcollection F ′ = {f ∈ F : f ≥ ϵ} of densities lower bounded by ϵ. We first construct a
Hellinger packing of F ′ . Applying the same argument in Lemma 32.13 yields an L2 -packing in an
L∞ -local ball: there exist f0 , f1 , . . . , fM ∈ F , such that kfi − fj k2 ≥ c0 ϵ for all i 6= j, kfi − f0 k∞ ≤ ϵ
for all i, and M ≥ M(F, k · k2 , c0 ϵ)/M(F, k · k∞ , ϵ) ≥ exp(c1 ϵ−d/β ), the last step applying The-
orem 32.17 for sufficiently small c0 . Let hi = fi − f0 and define fi by fi (x) = ϵ + hi (2x) for
x ∈ [0, 12 ]d and extend fi smoothly elsewhere so that it is a valid density in F ′ . Then f1 , . . . , fM
√ R (f −fj )2
form a Hellinger Ω( ϵ)-packing of F ′ , since H2 (fi , fj ) ≥ [0, 1 ]d √ i √ 2
≥ c2 ϵ. (This construc-
2 ( fi + fj )
tion also shows that the metric entropy bound (32.58) is tight.) It remains to bound the capacity
C′n of F ′ as a function of n and ϵ. Note that for any f, g ∈ F ′ , we have as in Lemma 32.19
D(fkg) ≤ χ2 (fkg) ≤ kf − gk22 /ϵ. Thus NKL (F ′ , δ 2 /ϵ) ≤ N(F ′ , k · k2 , δ). Applying (32.17) and
Lemma 32.19, C′n ≲ infδ>0 (nδ 2 /ϵ + δ −d/β ) (n/ϵ)d/(2β+d) . Applying Fano’s inequality as in
(32.54) yields an Ω(ϵ) lower bound in squared Hellinger, provided log M ≥ 2C′n . This is achieved
β
by choosing ϵ = c3 n− β+d , completing the proof of (32.56).
For KL loss, the lower bound of (32.57) follows from (32.56) because D ≥ H2 . For the upper
bound, applying (32.7) in Theorem 32.1, we have R∗KL (n; F) ≤ Cn+ n+1
1 , where Cn is the capacity
i.i.d.
(32.5) of the channel between f and Xn ∼ f ∈ F . This capacity can be bounded, in turn, using
Theorem 32.6 via the Hellinger entropy. Applying (32.58) in conjunction with (32.22), we obtain
Cn ≤ infϵ (nϵ2 log 1ϵ + ϵ−2d/β ) (n log n)d/(d+β) , proving the upper bound (32.57).8 To the best
of our knowledge, resolving the logarithmic gap in (32.57) remains open.

7
Comparing (32.56) with (32.52), we see that the Hellinger rate coincides with the L2 rate upon replacing the smoothness
parameter β by β/2. Note that Hellinger distance is the L2 between root-densities. For β ≤ 1, one can indeed show that
√
f is β/2-Hölder continuous, which explains the result in (32.56). However, this interpretation fails for general β. For
√
example, Glaeser [191] constructs an infinitely differentiable f such that f has points with arbitrarily large second
derivative.
8
This capacity bound is tight up to logarithmic factors. Note that the construction in the proof of the lower bound in (32.56)
shows that log M(F , H, ϵ) ≳ ϵ−2d/β , which, via Theorem 32.5, implies that Cn ≥ nd/(d+β) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-626

i i

33 Strong data processing inequality

In this chapter we explore statistical implications of the following effect. For any Markov chain
U→X→Y→V (33.1)
we know from the data-processing inequality (DPI, Theorem 3.7) that
I(U; Y) ≤ I(U; X), I(X; V) ≤ I(Y; V) .
However, something stronger can often be said. Namely, if the Markov chain (33.1) factor through
a known noisy channel PY|X : X → Y , then oftentimes we can prove strong data processing
inequalities (SDPI):
I(U; Y) ≤ η I(U; X), I(X; V) ≤ η (p) I(Y; V) ,
where coefficients η = η(PY|X ), η (p) = η (p) (PY|X ) < 1 only depend on the channel and not the
(generally unknown or very complex) PU,X or PY,V . The coefficients η and η (p) approach 0 for
channels that are very noisy (for example, η is always up to a constant factor equal to the Hellinger-
squared diameter of the channel).
The purpose of this chapter is twofold. First, we want to introduce general properties of the
SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility)
bounds on statistical estimation questions. The flavor of the statistical problems in this chapter
is different from the rest of the book in that here the information about unknown parameter θ
is either more “thinly spread” across a high dimensional vector of observations than in classical
X = θ + Z type of tasks (cf., spiked Wigner and tree-coloring examples), or distributed across
different terminals (as in correlation and mean estimation examples).
We point out that SDPIs are an area of current research and multiple topics are not covered by
our brief exposition here. For more, we recommend surveys [345] and [352], of which the latter
explores the functional-theoretic side of SDPIs and their close relation to logarithmic Sobolev
inequalities – a topic we entirely omit in our book.

33.1 Computing a boolean function with noisy gates

A boolean function with n inputs is defined as f : {0, 1}n → {0, 1}. Note that a boolean function
can be described as a network of primitive logic gates of the three kinds as illustrated on Figure 33.1
In 1938 Shannon has shown how any boolean function f can be represented with primitive
logic gates [377] from the top row of the Figure 33.1. In 1950s John von Neumann was laying

626

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-627

i i

33.1 Computing a boolean function with noisy gates 627

a a
OR a∨b AND a∧b a NOT a′
b b
Z Z Z
a a
OR ⊕ Y AND ⊕ Y a NOT ⊕ Y
b b

Figure 33.1 Basic building blocks of any boolean circuit. Top row shows the classical (Shannon) noiseless
gates. Bottom row shows noisy (von Neumann) gates. Here Z ∼ Ber(δ) is assumed to be independent of the
inputs.

the groundwork for the digital computers, and he was bothered by the following question. Since
real physical (and biological) networks are composed of imperfect elements, can we compute any
boolean function f if the constituent basis gates are in fact noisy? His model of the δ -noisy gate
(bottom row of Figure 33.1) is to take a primitive noiseless gate and apply a (mod 2) additive noise
to the output.
In this case, we have a network of the noisy gates, and such network necessarily has noisy (non-
deterministic) output. Therefore, when we say that a noisy gate circuit C computes f we require
the existence of some ϵ0 = ϵ0 (δ) (that cannot depend on f) such that

1
P[C(x1 , . . . , xn ) 6= f(x1 , . . . , xn ) ≤ − ϵ0 (33.2)
2

where C(x1 , . . . , xn ) is the output of the noisy circuit with inputs x1 , . . . , xn . If we build the circuit
according to the classical (Shannon) methods, we would obviously have catastrophic error accumu-
lation so that deep circuits necessarily have ϵ0 → 0. At the same time, von Neumann was bothered
by the fact that evidently our brains operate with very noisy gates and yet are able to carry very
long computations without mistakes. His thoughts culminated in the following ground-breaking
result.

Theorem 33.1 (von Neumann [443]) There exists δ ∗ > 0 such that for all δ < δ ∗ it is
possible to compute every boolean function f via δ -noisy 3-majority gates.

von Neumann’s original estimate δ ∗ ≈ 0.087 was subsequently improved by Pippenger. The
main (still open) question of this area is to find the largest δ ∗ for which the above theorem holds.
Condition in (33.2) implies the output should be correlated with the inputs. This requires the
mutual information between the inputs (if they are random) and the output to be greater than
zero. We now give a theorem of Evans and Schulman that gives an upper bound to the mutual
information between any of the inputs and the output. We will prove the theorem in Section 33.3
as a consequence of the more general directed information percolation theory.

Theorem 33.2 ([158]) Suppose an n-input noisy boolean circuit composed of gates with at
most K inputs and with noise components having at most δ probability of error. Then, the mutual

i i

i i
i i

For computation with formulas much stronger results are available. For example, for any odd K,
the threshold is exactly known from [157, Theorem 1]. Specifically, it is shown there that we can
compute reliably any boolean function f that is represented with a formula compose of K-input
δ -noisy gates (with K odd) if δ < δf∗ , and no such computation is possible for δ > δf∗ , where

1 2K−1
δf∗ = − K−1
2 K K− 1
2

Since every formula is also a circuit, we of course have δf∗ < δES
∗
, so that there is no contradiction
with Theorem 33.2. However, comparing the thresholds gives us ability to appreciate tightness of
Theorem 33.2 for general boolean circuits. Indeed, for large K we have an approximation
p
∗ 1 π /2
δf ≈ − √ , K 1 ,
2 2 K
∗
whereas the estimate of Evans-Schulman δES ≈ 1
2 − 1
√
2 K
. We can thus see that it has at least the
right rate of convergence to 1/2 for large K.

33.2 Strong data processing inequality

Definition 33.3 (Contraction coefficient for PY|X ) For a fixed conditional distribution (or
kernel) PY|X , define
Df (PY kQY )
ηf (PY|X ) = sup , (33.3)
Df (PX kQX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and supremum is over all pairs (PX , QX ) satisfying
0 < Df ( P X k QX ) < ∞ .

Recall that the data-processing inequality (DPI) in Theorem 7.4 states that Df (PX kQX ) ≥
Df (PY kQY ). The concept of the Strong DPI introduced above quantifies the multiplicative decrease
between the two f-divergences.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-630

i i

630

We note that in general ηf (PY|X ) is hard to compute. However, total variation is an exception.

Theorem 33.4 ([127]) ηTV = supx̸=x′ TV(PY|X=x , PY|X=x′ ).

Proof. We consider two directions separately.

• ηTV ≥ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):

This case is obvious. Take PX = δx0 and QX = δx′0 .1 Then from the definition of ηTV , we
have ηTV ≥ TV(PY|X=x0 , PY|X=x′0 ) for any x0 and x′0 , x0 6= x′0 .
• ηTV ≤ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):
0

Define η̃ ≜ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ). We consider the discrete alphabet case for simplicity.
0
Fix any PX , QX and PY = PX ◦ PY|X , QY = QX ◦ PY|X . Observe that for any E ⊆ Y

PY|X=x0 (E) − PY|X=x′0 (E) ≤ η̃ 1{x0 6= x′0 }. (33.4)

Now suppose there are random variables X0 and X′0 having some marginals PX and QX respec-
tively. Consider any coupling π X0 ,X′0 with marginals PX and QX respectively. Then averaging
(33.4) and taking the supremum over E, we obtain

sup PY (E) − QY (E) ≤ η̃ π [X0 6= X′0 ]

E⊆Y

Now the left-hand side equals TV(PY , QY ) by Theorem 7.7(a). Taking the infimum over
couplings π the right-hand side evaluates to TV(PX , QX ) by Theorem 7.7(b).

Example 33.1 (ηTV of a Binary Symmetric Channel) The ηTV of the BSCδ is given by

ηTV (BSCδ ) =TV(Bern(δ ), Bern(1 − δ ))

1
= |δ − (1 − δ)| + |1 − δ − δ| = |1 − 2δ|.
2
We sometimes want to relate ηf to the f-information (Section 7.8) instead of f-divergence. This
relation is given in the following theorem.

Theorem 33.5
If (U; Y)
ηf (PY|X ) = sup .
PUX : U→X→Y f (U; X)
I

1
δx0 is the probability distribution with P(X = x0 ) = 1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-631

i i

33.2 Strong data processing inequality 631

Recall that for any Markov chain U → X → Y, DPI states that If (U; Y) ≤ If (U; X) and Theorem
33.5 gives the stronger bound

If (U; Y) ≤ ηf (PY|X )If (U; X). (33.5)

Proof. First, notice that for any u0 , we have Df (PY|U=u0 kPY ) ≤ ηf Df (PX|U=u0 kPX ). Averaging the
above expression over any PU , we obtain

If (U; Y) ≤ ηf If (U; X)

Second, fix P̃X , Q̃X and let U ∼ Bern(λ) for some λ ∈ [0, 1]. Define the conditional distribution
PX|U as PX|U=1 = P̃X , PX|U=0 = Q̃X . Take λ → 0, then (see [345] for technical subtleties)

If (U; X) = λDf (P̃X kQ̃X ) + o(λ)

If (U; Y) = λDf (P̃Y kQ̃Y ) + o(λ)
I (U;Y) Df (P̃Y ∥Q̃Y )
The ratio Iff(U;X) will then converge to Df (P̃X ∥Q̃X )
. Thus, optimizing over P̃X and Q̃X we can get ratio
of If ’s arbitrarily close to ηf .

We next state some of the fundamental properties of contraction coefficients.

Theorem 33.6 In the statements below ηf (and others) corresponds to ηf (PY|X ) for some fixed
PY|X from X to Y .

(a) For any f, ηf ≤ ηTV .

(b) ηKL = ηH2 = ηχ2 . More generally, for any operator-convex and twice continuously
differentiable f we have ηf = ηχ2 .
(c) ηχ2 equals the squared maximal correlation: Denote by ρ(A, B) ≜ √ Cov(A,B) the correla-
Var[A] Var[B]
tion coefficient between scalar random variables A and B. Then ηχ2 = supPX ,f,g ρ2 (f(X), g(Y)),
where the supremum is over all distributions PX on X , all functions f : X → R and g : Y → R.
(d) For binary-input channels, denote P0 = PY|X=0 and P1 = PY|X=1 . Then

ηKL = LCmax (P0 , P1 ) ≜ sup LCβ (P0 kP1 )

0<β<1

where (recall β̄ ≜ 1 − β )
( 1 − x) 2
LCβ (PkQ) = Df (PkQ), f(x) = β̄β
β̄ x + β
is the Le Cam divergence of order β (recall (7.7) for β = 1/2).
(e) Consequently,
1 2 H4 (P0 , P1 )
H (P0 , P1 ) ≤ ηKL ≤ H2 (P0 , P1 ) − . (33.6)
2 4
(f) If a binary-input channel PY|X is also input-symmetric (or BMS, see Section 19.4*), then
ηKL (PY|X ) = Iχ2 (X; Y) for X ∼ Bern(1/2).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-632

i i

632

(g) For any channel PY|X , the supremum in (33.3) can be restricted to PX , QX with a common
binary support. In other words, ηf (PY|X ) coincides with that of the least contractive binary
subchannel. Consequently, from (e) we conclude
1 diamH2
diamH2 ≤ ηKL (PY|X ) = diamLCmax ≤ diamH2 − ,
2 4
(in particular ηKL diamH2 ), where diamH2 (PY|X ) = supx,x′ ∈X H2 (PY|X=x , PY|X=x′ ),
diamLCmax = supx,x′ LCmax (PY|X=x , PY|x=x′ ) are the squared Hellinger and Le Cam diameters
of the channel.

Proof. Most proofs in full generality can be found in [345]. For (a) one first shows that ηf ≤ ηTV
for the so-called Eγ divergences corresponding to f(x) = |x − γ|+ − |1 − γ|+ , which is not hard to
believe since Eγ is piecewise linear. Then the general result follows from the fact that any convex
function f can be approximated (as N → ∞) in the form
X
N
aj |x − cj |+ + a0 x + c0 .
j=1

For (b) see [93, Theorem 1] and [97, Proposition II.6.13 and Corollary II.6.16]. The idea of this
proof is as follows:

• ηKL ≥ ηχ2 by restricting to local perturbations. Recall that KL divergence behaves locally as
χ2 – Proposition 2.21.
R∞
• Using the identity D(PkQ) = 0 χ2 (PkQt )dt where Qt = tP1+ Q
+t , we have
Z ∞ Z ∞
D(PY kQY ) = χ (PY kQY t )dt ≤ ηχ2
2
χ2 (PX kQX t )dt = ηχ2 D(PX kQX ).
0 0

For (c), we fix QX (and thus QX,Y = QX PY|X ). If g = dQ dPX

X
then Tg(y) = dQ dPY
Y
= EQX|Y [g(X)|Y =
y] is a linear operator. ηχ2 (PY|X ) is then nothing but the maximal singular value (spectral norm
squared) of T : L2 (QX ) → L2 (QY ) when restricted to a linear subspace {h : EQX [h] = 0}. The
adjoint of T is T∗ h(x) = EPY|X [h(Y)|X = x]. The spectral norms of an operator and its adjoint
coincide and the spectral norm of T∗ is precisely the squared maximal correlation. These two
facts together yield the result. (See Theorem 33.12(c) which strengthens this result for a fixed
PX .)
2
P1 +ᾱP0 ∥β P1 +β̄ P0 )
Part (d) follows from the definition of ηχ2 = supα,β χ (α χ2 (Ber(α)∥Ber(β))
and some algebra.
Next, (e) follows from bounding (via Cauchy-Schwarz etc) LCmax in terms of H2 ; see [345,
Appendix B].
Part (f) follows from the fact that every BMS channel can be represented as X 7→ Y = (Y∆ , ∆)
where ∆ ∈ [0, 1/2] is independent of X and Yδ = BSCδ (X). In other words, every BMS channel
is a mixture of BSCs; see [360, Section 4.1]. Thus, we have for any U → X → Y = (Y∆ , ∆) and
∆⊥ ⊥ (U, X) the following chain

I(U; Y) = I(U; Y|∆) ≤ Eδ∼P∆ [(1 − 2δ)2 I(U; X|∆ = δ) = E[(1 − 2∆)2 ]I(U; X),

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-633

i i

33.3 Directed information percolation 633

where we used the fact that I(U; X|∆ = δ) = I(U; X) and Example 33.2 below.
For (g) see Ex. VI.20.

Example 33.2 (Computing ηKL (BSCδ )) Consider p the BSCδ channel. In Example 33.1
we computed ηTV . Here we have diamH2 = 2 − 4 δ(1 − δ) and thus the bound (33.6) we get
ηKL ≤ (1 − 2δ)2 . On the other hand taking U = Ber(1/2) and PX|U = Ber(α) we get
I(U; Y) log 2 − h(α + (1 − 2α)δ) 1
ηKL ≥ = → (1 − 2δ)2 α→ .
I(U; X) log 2 − h(α) 2
Thus we have shown:

ηKL (BSCδ ) = ηH2 (BSCδ ) = ηχ2 = (1 − 2δ)2 .

This example has the following consequence for the KL-divergence geometry.

Proposition 33.7 Consider any distributions P0 and P1 on X and let us consider the interval
in P(X ): Pλ = λP1 + (1 − λ)P0 for λ ∈ [0, 1]. Then divergence (with respect to the midpoint)
behaves subquadratically:

D(Pλ kP1/2 ) + D(P1−λ kP1/2 ) ≤ (1 − 2λ)2 {D(P0 kP1/2 ) + D(P1 kP1/2 )) .

The same statement holds with D replaced by χ2 (and any other Df satisfying Theorem 33.6(b)).

Proof. Let X ∼ Ber(1/2) and Y = BSCλ (X). Let U ← X → Y be defined with U ∼ P0 if X = 0

and U ∼ P1 if X = 1. Then
1 1
I f ( U; Y ) = Df (Pλ kP1/2 ) + Df (Pλ kP1/2 ) .
2 2
Thus, applying SDPI (33.5) completes the proof.
p p
Remark 33.1 Let us introduce dJS (P, Q) = JS(P, Q) and dLC = LC(P, Q) – the Jensen-
Shannon (7.8) and Le Cam (7.7) metrics. Then the proposition can be rewritten as

dJS (Pλ , P1−λ ) ≤ |1 − 2λ|dJS (P0 , P1 )

dLC (Pλ , P1−λ ) ≤ |1 − 2λ|dLC (P0 , P1 ) .

Notice that for any metric d(P, Q) on P(X ) that is induced from the norm on the vector space
M(X ) of all signed measures (such as TV), we must necessarily have d(Pλ , P1−λ ) = |1 −
2λ|d(P0 , P1 ). Thus, the ηKL (BSCλ ) = (1 − 2λ)2 which yields the inequality is rather natural.

33.3 Directed information percolation

In this section, we are concerned about the amount of information decay experienced in a directed
acyclic graph (DAG) G = (V, E). In the following context the vertex set V refers to a set of vertices
v, each associated with a random variable Xv and the edge set E refers to a set of directed edges

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-634

i i

634

whose configuration allows us to factorize the joint distribution over XV by Throughout the section,
we consider Shannon mutual information, i.e., f = x log x. Let us give a detailed example below.
Example 33.3 Suppose we have a graph G = (V, E) as follows.

B
X0 W
A

This means that we have a joint distribution factorizing as

PX0 ,A,B,W = PX0 PB|X0 PA|B,X0 PW|A,B .

Then every node has a channel from its parents to itself, for example W corresponds to a noisy
channel PW|A,B , and we can define η ≜ ηKL (PW|A,B ). Now, prepend another random variable U ∼
Bern(λ) at the beginning, the new graph G′ = (V′ , E′ ) is shown below:

B
U X0 W
A

We want to verify the relation

I(U; B, W) ≤ η̄ I(U; B) + η I(U; A, B). (33.7)

Recall that from chain rule we have I(U; B, W) = I(U; B) + I(U; W|B) ≥ I(U; B). Hence, if (33.7)
is correct, then η → 0 implies I(U; B, W) ≈ I(U; B) and symmetrically I(U; A, W) ≈ I(U; A).
Therefore for small δ , observing W, A or W, B does not give advantage over observing solely A or
B, respectively.
Observe that G′ forms a Markov chain U → X0 → (A, B) → W, which allows us to factorize
the joint distribution over E′ as

PU,X0 ,A,B,W = PU PX0 |U PA,B|X0 PW|A,B .

Now consider the joint distribution conditioned on B = b, i.e., PU,X0 ,A,W|B . We claim that the
conditional Markov chain U → X0 → A → W|B = b holds. Indeed, given B and A, X0 is
independent of W, that is PX0 |A,B PW|A,B = PX0 ,W|AB , from which follows the mentioned conditional
Markov chain. Using the conditional Markov chain, SDPI gives us for any b,

I(U; W|B = b) ≤ η I(U; A|B = b).

Averaging over b and adding I(U; B) to both sides we obtain

I(U; W, B) ≤ η I(U; A|B) + I(U; B)

= η I(U; A, B) + η̄ I(U; B).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-635

i i

33.3 Directed information percolation 635

B
R
X W
A

Figure 33.4 Illustration for Example 33.4.

From the characterization of ηf in Theorem 33.5 we conclude

ηKL (PW,B|X0 ) ≤ η · ηKL (PA,B|X0 ) + (1 − η) · ηKL (PB|X0 ) . (33.8)
Now, we provide another example which has in some sense an analogous setup to Example
33.3.
Example 33.4 (Percolation) Take the graph G = (V, E) in Example 33.3 with a small
modification. See Figure 33.4. Now, suppose X,A,B,W are some cities and the edge set E represents
the roads between these cities. Let R be a random variable denoting the state of the road connecting
to W with P(R is open) = η and P[R is closed] = η̄ . For any Y ∈ V, let the event {X → Y} indicate
that one can drive from X to Y. Then
P[X → B or W] = η P[X → A or B] + η̄ P[X → B]. (33.9)
Observe the resemblance between (33.8) and (33.9).
We will now give a theorem that relates ηKL to percolation probability on a DAG under the
following setting: Consider a DAG G = (V, E).

• All edges are open

• Every vertex is open with probability p(v) = ηKL PXv |XPa(v) where Pa(v) denotes the set of
parents of v.

Under this model, for two subsets T, S ⊂ V we define perc[T → S] = P[∃ open path T → S].
Note that PXv |XPa(v) describe the stochastic recipe for producing Xv based on its parent variables.
We assume that in addition to a DAG we also have been given all these constituent channels (or
at least bounds on their ηKL coefficients).

Theorem 33.8 ([345]) Let G = (V, E) be a DAG and let 0 be a node with in-degree equal to
zero (i.e. a source node). Note that for any 0 63 S ⊂ V we can inductively stitch together constituent
channels PXv |XPa(v) and obtain PXS |X0 . Then we have
ηKL (PXS |X0 ) ≤ perc(0 → S). (33.10)

Proof. For convenience let us denote η(T) = ηKL (PXT |X0 ) and ηv = ηKL (PXv |XPa(v) ). The proof
follows from an induction on the size of G. The statement is clear for the |V(G)| = 1 since
S = ∅ or S = {X0 }. Now suppose the statement is already shown for all graphs smaller than
G. Let v be the node with out-degree 0 in G. If v 6∈ S then we can exclude it from G and the
statement follows from induction hypothesis. Otherwise, define SA = Pa(v) \ S and SB = S \ {v},

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-636

i i

636

A = XSA , B = XSB , W = Xv . (If 0 ∈ A then we can create a fake 0′ with X0′ = X0 and retain
0′ ∈ A while moving 0 out of A. So without loss of generality, 0 6∈ A.) Prepending arbitrary U to
the graph as U → X0 , the joint DAG of random variables (X0 , A, B, W) is then given by precisely
the graph in (33.7). Thus, we obtain from (33.8) the estimate

η(S) ≤ ηv η(SA ∪ SB ) + (1 − ηv )ηKL (SB ) . (33.11)

From induction hypothesis η(SA ∪ SB ) ≤ perc(0 → SA ) and η(SB ) ≤ perc(0 → SB ) (they live
on a graph G \ {v}). Thus, from computation (33.9) we see that the right-hand side of (33.11) is
precisely perc(0 → S) and thus η(S) ≤ perc(S) as claimed.

We are now in position to complete the postponed proof.

Proof of Theorem 33.2. First observe the noisy boolean circuit is a form of DAG. Since the gates
are δ -noisy contraction coefficients of constituent channels ηv in the DAG can be bounded by
(1 − 2δ)2 . Thus, in the percolation question all vertices are open with probability (1 − 2δ)2
From SDPI, for each i, we have I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ). From Theorem 33.8, we know
ηKL (PY|Xi ) ≤ perc(Xi → Y). We now want to upper bound perc(Xi → Y). Recall that the minimum
distance between Xi and Y is di . For any path π of length ℓ(π ) from Xi to Y, therefore, the probability
that it will be open is ≤ (1 − 2δ)2ℓ(π ) . We can thus bound
X
perc(Xi → Y) ≤ (1 − 2δ)2ℓ(π ) . (33.12)
π :Xi →Y

Let us now build paths backward starting from Y, which allows us to represent paths X → Yi
as vertices of a K-ary tree with root Yi . By labeling all vertices on a K-ary tree corresponding
to paths X → Yi we observe two facts: the labeled set V is prefix-free (two labeled vertices are
never in ancestral relation) and the depth of each labeled set is at least di . It is easy to see that
P
u∈V c
depth(u)
≤ (Kc)di provided Kc ≤ 1 and attained by taking V to be set of all vertices in the
tree at depth di . We conclude that whenever K(1 − 2δ)2 ≤ 1 the right-hand side of (33.12) is
bounded by (K(1 − 2δ)2 )di , which concludes the proof by upper bounding H(Xi ) ≤ log 2 as

I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ) ≤ Kdi (1 − 2δ)2di log 2

As another (simple) application of Theorem 33.8 we show the following.

Corollary 33.9 Consider a channel PY|X and its n-letter memoryless extension P⊗ n
Y|X . Then we
have

ηKL (P⊗
Y|X ) ≤ 1 − (1 − ηKL (PY|X )) ≤ nηKL (PY|X ) .
n n

The first inequality can be sharp for some channels. For example, it is sharp when PY|X is a
binary or q-ary erasure channel (defined below in Example 33.6). This fact is proven in [345,
Theorem 17].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-637

i i

33.4 Input-dependent SDPI. Mixing of Markov chains 637

Proof. The graph here consists of n parallel lines Xi → Yi . Theorem 33.8 shows that ηKL (P⊗ Y|X ) ≤
n

perc({X1 , . . . , Xn } → {Y1 , . . . , Yn }). The latter simply equals 1 − (1 − η) where η = η(PY|X ) is

the probability of an edge being open.

We conclude the section with a more sophisticated application of Theorem 33.8, emphasizing
how it can yield stronger bounds when compared to Theorem 33.2.
Example 33.5 Suppose we have the topological restriction on the placement of gates (namely
that the inputs to each gets should be from nearest neighbors to the left), resulting in the following
circuit of 2-input δ -noisy gates.

Note that each gate may be a simple passthrough (i.e. serve as router) or a constant output. Theorem
33.2 states that if (1 − 2δ)2 < 12 , then noisy computation within arbitrary topology is not possible.
Theorem 33.8 improves this to (1 − 2δ)2 < pc , where pc is the oriented site percolation threshold
for the particular graph we have. Namely, if each vertex is open with probability p < pc then
with probability 1 the connected component emanating from any given node (and extending to
the right) is finite. For the example above the site percolation threshold is estimated as pc ≈ 0.705
(so-called Stavskaya automata).

33.4 Input-dependent SDPI. Mixing of Markov chains

Previously we have defined contraction coefficient ηf (PY|X ), as the maximum contraction of an
f-divergences over all input distributions. We now define an analogous concept for a fixed input
distribution PX .

Definition 33.10 (Input-dependent contraction coefficient) For any input distribu-

tion PX , Markov kernel PY|X and convex function f, we define
Df (QY kPY )
ηf (PX , PY|X ) ≜ sup
Df (QX kPX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and the supremum is over QX satisfying 0 < Df (QX kPX ) <
∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-638

i i

638

We refer to ηf (PX , PY|X ) as the input-dependent contraction coefficient, in contrast with the
input-independent contraction coefficient ηf (PY|X ). It is obvious that

ηf (PX , PY|X ) ≤ ηf (PY|X )

but as we will see below the inequality is often strict and the difference can lead to significant
improvements in the applications (Example 33.10). In Theorem 33.6 we have seen that for most
interesting f’s we have ηf (PY|X ) = ηχ2 (PY|X ). Unfortunately, for the input-dependent version this is
not true: we only have a one-sided comparison, namely for any twice continuously differentiable
f with f′′ (1) > 0 (in particular for KL-divergence) it holds that [345, Theorem 2]

ηχ2 (PX , PY|X ) ≤ ηf (PX , PY|X ) . (33.13)

For example, for jointly Gaussian X, Y, we in fact have ηχ2 = ηKL (see Example 33.7 next);
however, in general we only have ηχ2 < ηKL (see [19] for an example). Thus, unlike the input-
independent case, here the choice of f is very important. A general rule is that ηχ2 (PX , PY|X ) is the
easiest to bound and by (33.13) it contracts the fastest. However, for various reasons other f are
more useful in applications. Consequently, theory of input-dependent contraction coefficients is
much more intricate (see [201] for many recent results and references). In this section we try to
summarize some important similarities and distinctions between the ηf (PX , PY|X ) and ηf (PY|X ).
First, just as in Theorem 33.5 we can similarly prove a mutual information characterization of
ηKL (PX , PY|X ) as follows [352, Theorem V.2]:

I(U; Y)
ηKL (PX , PY|X ) = sup .
PU|X :U→X→Y I(U; X)

In particular, we see that ηKL (PX , PY|X ) is also a slope of the FI -curve (cf. Definition 16.5):

d
ηKL (PX , PY|X ) = FI (t; PX,Y ) . (33.14)
dt t=0+

(Indeed, from Exercise III.32 we know FI (t) is concave and thus supt≥0 FI t(t) = F′I (0).)
The next property of input-dependent SDPIs emphasizes the key difference compared to its
input-independent counterpart. Recall that Corollary 33.9 (and the discussion thereafter) show
that generally ηKL (P⊗ ⊗n ⊗n
Y|X ) → 1 exponentially fast. At the same time, ηKL (PX , PY|X ) stays constant.
n

Proposition 33.11 (Tensorization of ηKL )

ηKL (P⊗ n ⊗n
X , PY|X ) = ηKL (PX , PY|X )

i i

640

Among the input-dependent ηf the most elegant is the theory of ηχ2 . The properties hold for
general PX,Y but we only state it for the finite case for simplicity.

Theorem 33.12 (Properties of ηχ2 (PX , PY|X )) Consider finite X and Y . Then, we have

(a) (Spectral characterization) Let Mx,y = √PX,Y (x,y) be an |X | × |Y| matrix. Let 1 = σ1 (M) ≥
PX (x)PY (y)
p
σ2 (M) ≥ · · · ≥ 0 be the singular values of M, i.e. σj (M) = λj (MT M). Then ηχ2 (PX , PY|X ) =
σ22 (M).
(b) (Symmetry) ηχ2 (PX , PY|X ) = ηχ2 (PY , PX|Y ).
(c) (Maximal correlation) ηχ2 (PX , PY|X ) = supg1 ,g2 ρ2 (g1 (X), g2 (Y)), where the supremum is over
all functions g1 : X → R and g2 : Y → R.
(d) (Tensorization) ηχ2 (P⊗ n ⊗n
X , PY|X ) = ηχ2 (PX , PY|X )

Proof. We focus on the spectral characterization which implies the rest. Denote by EX|Y a linear
P
operator that acts on function g as EX|Y g(y) = x PX|Y (x|y)g(x). For any QX let g(x) = QPXX((xx)) then
QY (y)
we have PY (y) = EX|Y g. Therefore, we have

VarPY [EX|Y g]
ηχ2 (PX , PY|X ) = sup
g VarPX [g]

with supremum over all g ≥ 0 and EPX [g] = 1. We claim that this supremum is also equal to

EPY [E2X|Y h]
ηχ2 (PX , PY|X ) = sup ,
h EPX [h2 ]
taken over all h with EPX h = 0. Indeed, for any such h we can take g = 1 + ϵh for some suffi-
p g ≥ 0) and conversely, for any g we can set h = g − 1. Finally, let us
ciently small ϵ (to satisfy
reparameterize ϕx ≜ PX (x)h(x) in which case we get

ϕ T MT Mϕ
ηχ2 (PX , PY|X ) = sup ,
ϕ ϕT ϕ
p
where ϕ ranges over all vectors in RX that are orthogonal to the vector ψ with ψx = PX (x).
Finally, we notice that top singular value of M corresponds to singular vector ψ and thus restricting
ϕ ⊥ ψ results in recovering the second-largest singular vector.
Symmetry follows from noticing that matrix M is replaced by MT when we interchange X and
Y. The maximal correlation characterization follows from the fact that supg2 E√[g1 (X)g2 (Y)] is attained
Var[g2 (Y)]
at g2 = EX|Y g1 . Tensorization follows from the fact that singular values of the Kronecker product
M⊗n are just products of (all possible) n-tuples of singular values of M.

Example 33.7 (SDPI constants of joint Gaussian) Let X, Y be jointly Gaussian with
correlation coefficient ρ. Then

ηχ2 (PX , PY|X ) = ηKL (PX , PY|X ) = ρ2 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-641

i i

33.5 Application: broadcasting and coloring on trees 641

Indeed, it is well-known that the maximal correlation of X and Y is simply |ρ|. (This can be shown
by finding the eigenvalues of the (Mehler) kernel defined in Theorem 33.12(a); see e.g. [266].)
Applying Theorem 33.12(c) yields ηχ2 (PX , PY|X ) = ρ2 .
Next, in view of (33.13), it suffices to show ηKL ≤ ρ2 , which is a simple consequence of EPI.
Without loss of generality, let us consider Y = X + Z, where X ∼ PX = N (0, 1) and Z ∼ N (0, σ 2 ).
Then PY = N (0, 1 + σ 2 ) and ρ2 = 1+σ 1
2 . Let X̃ have finite second moment and finite differential
1
entropy and set Ỹ = X̃ + Z. Applying Lieb’s EPI (3.36) with U1 = X̃, U2 = Z/σ and cos2 α = 1+σ 2,

we obtain

1 σ2 1
h(Ỹ) ≥ 2
h(X̃) + 2
log(2πe) + log(1 + σ 2 )
1+σ 2( 1 + σ ) 2

which implies D(PỸ kPY ) ≤ 1+σ 2 D(PX̃ kPX )

1
as desired.

Before proceeding to statistical applications we mention a very important probabilistic appli-

cation.

Example 33.8 (Mixing of Markov chains) One area in which input-dependent contrac-
tion coefficients have found a lot of use is in estimating mixing time (time to convergence to
equilibrium) of Markov chains. Indeed, suppose K = PY|X is a kernel for a time-homogeneous
Markov chain X0 → X1 → · · · with stationary distribution π (i.e., K = PXt+1 |Xt ). Then for any
initial distribution q, SDPI gives the following bound:

Df (qPn kπ ) ≤ ηf (π , K)n Df (qkπ ) ,

showing exponential decrease of Df provided that ηf (π , K) < 1. For most interesting chains the
TV version is useless, but χ2 and KL is rather effective (the two known as the spectral gap and
modified log-Sobolev inequality methods). For example, for reversible Markov chains, we have
[124, Prop. 3]

χ2 (PXn kπ ) ≤ γ∗2n χ2 (PX0 kπ ) (33.18)

where γ∗ is the absolute spectral gap of P. See Exercise VI.19. The most efficient modern method
for bounding ηKL is known as spectral independence, see Exercise VI.26.

33.5 Application: broadcasting and coloring on trees

Consider an infinite b-ary tree G = (V, E). We assign a random variable Xv for each v ∈ V . These
random variables Xv ’s are defined on the same alphabet X . In this model, the joint distribution
is induced by the distribution on the root vertex π, i.e., Xρ ∼ π, and the edge kernel PX′ |X , i.e.
∀(p, c) ∈ E, PXc |Xp = PX′ |X .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-642

i i

642

′ |X X3 ……
PX

To simplify our discussion, we will assume that π is a reversible measure on kernel PX′ |X , i.e.,

PX′ |X (a|b)π (b) = PX′ |X (b|a)π (a). (33.19)

By standard result on Markov chains, this also implies that π is a stationary distribution of the
reversed Markov kernel PX|X′ .
This model, known as broadcasting on trees, turns out to be rather deep. It first arose in sta-
tistical physics as a simplification of Ising model on lattices (trees are called Bethe lattices in
physics) [63]. Then, it was found to be closely related to a problem of phylogenetic reconstruc-
tion in computational biology [306] and almost simultaneously appeared in random constraint
satisfaction problems [261] and sparse-graph coding theory. Our own interest was triggered by
a discovery of a certain equivalence between reconstruction on trees and community detection in
stochastic block model [307, 119].
We make the following observations:

• We can think of this model as a broadcasting scenario, where the root broadcasts its message
Xρ to the leaves through noisy channels PX′ |X . The condition (33.19) here is only made to avoid
defining the reverse channel. In general, one only requires that π is a stationary distribution of
PX′ |X , in which case the (33.21) should be replaced with ηKL (π , PX|X′ )b < 1.
• Under the assumption (33.19), the joint distribution of this tree can also be written as a Gibbs
distribution
 
1 X X
PXall = exp  f(Xp , Xc ) + g( X v )  , (33.20)
Z
(p,c)∈E v∈V

where Z is the normalization constant, f(xp , xc ) = f(xc , xp ) is symmetric. When X = {0, 1}, this
model is known as the Ising model (on a tree). Note, however, that not every measure factorizing
as (33.20) (with symmetric f) can be written as a broadcasting process for some P and π.

The broadcasting on trees is an inference problem in which we want to reconstruct the root
variable Xρ given the observations XLd = {Xv : v ∈ Ld }, with Ld = {v : v ∈ V, depth(v) = d}.
A natural question is to upper bound the performance of any inference algorithm on this problem.
The following theorem shows that there exists a phase transition depending on the branching factor
b and the contraction coefficient of the kernel PX′ |X .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-643

i i

33.5 Application: broadcasting and coloring on trees 643

Theorem 33.13 Consider the broadcasting problem on infinite b-ary tree (b > 1), with root
distribution π and edge kernel PX′ |X . If π is a reversible measure of PX′ |X such that
ηKL (π , PX′ |X )b < 1, (33.21)
then I(Xρ ; XLd ) → 0 as d → 0.

Proof. For every v ∈ L1 , we define the set Ld,v = {u : u ∈ Ld , v ∈ ancestor(u)}. We can upper
bound the mutual information between the root vertex and leaves at depth d
X
I(Xρ ; XLd ) ≤ I(Xρ ; XLd,v ).
v∈L1

For each term in the summation, we consider the Markov chain

XLd,v → Xv → Xρ .
Due to our assumption on π and PX′ |X , we have PXρ |Xv = PX′ |X and PXv = π. By the definition of
the contraction coefficient, we have
I(XLd,v ; Xρ ) ≤ ηKL (π , PX′ |X )I(XLd,v ; Xv ).
Observe that because PXv = π and all edges have the same kernel, then I(XLd,v ; Xv ) = I(XLd−1 ; Xρ ).
This gives us the inequality
I(Xρ ; XLd ) ≤ ηKL (π , PX′ |X )bI(Xρ ; XLd−1 ),
which implies
I(Xρ ; XLd ) ≤ (ηKL (π , PX′ |X )b)d H(Xρ ).
Therefore if ηKL (π , PX′ |X )b < 1 then I(Xρ ; XLd ) → 0 exponentially fast as d → ∞.
Note that a weaker version of this theorem (non-reconstruction when ηKL (PX′ |X )b ≤ 1) is
implied by the directed information percolation theorem. The k-coloring example (see below)
demonstrates that this strengthening is essential; see [203] for details.
Example 33.9 (Broadcasting on BSC tree.) Consider a broadcasting problem on b-ary tree
with vertex alphabet X = {0, 1}, edge kernel PX′ |X = BSCδ , and π = Unif . Note that uniform
distribution is a reversible measure for BSCδ . In Example 33.2, we calculated ηKL (BSCδ ) = (1 −
2δ)2 . Therefore, using Theorem 33.13, we can deduce that if
b(1 − 2δ)2 < 1
then no inference algorithm can recover the root nodes as depth of the tree goes to infinity. This
result is originally proved in [63].
Example 33.10 (k-coloring on tree) Given a b-ary tree, we assign a k-coloring Xvall by
sampling uniformly from the ensemble of all valid k-coloring. For this model, we can define a
corresponding inference problem, namely given all the colors of the leaves at a certain depth, i.e.,
XLd , determine the color of the root node, i.e., Xρ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-644

i i

644

This problem can be modeled as a broadcasting problem on tree where the root distribution π
is given by the uniform distribution on k colors, and the edge kernel PX′ |X is defined as
(
1
a 6= b
PX′ |X (a|b) = k−1
0, a = b.

It can be shown (see Ex. VI.24) that ηKL (Unif, PX′ |X ) = k log k(11+o(1)) for large k. By Theorem
33.13, this implies that if b < k log k(1 + o(1)) then reliable reconstruction of the root node is not
possible. This result is originally proved in [393] and [50].
The other direction b > k log k(1 + o(1)) can be shown by observing that if b > k log k(1 + o(1))
then the probability of the children of a node taking all available colors (except its own) is close to
1. Thus, an inference algorithm can always determine the color of a node by finding a color that
is not assigned to any of its children. Similarly, when b > (1 + ϵ)k log k even observing (1 − ϵ)-
fraction of the node’s children is sufficient to reconstruct its color exactly. Proceeding recursively
from bottom up, such a reconstruction algorithm will succeed with high probability. In this regime
with positive probability (over the leaf variables) the posterior distribution of the root color is a
point mass (deterministic). This effect is known as “freezing” of the root given the boundary.
We may also consider another reconstruction method which simply computes majority of the
leaves, i.e. X̂ρ = j for the color j that appears the most among the leaves. This method gives
success probability strictly above 1k if and only if d > (k − 1)2 , by a famous result of Kesten and
Stigum [244]. While the threshold is suboptimal, the method is quite robust in the sense that it
also works if we only have access to a small fraction ϵ of the leaves (and the rest are replaced by
erasures).
Let us now consider ηχ2 (Unif, PX′ |X ). The transition matrix is symmetric with eigenvalues
{1, − k−1 1 } and thus from Theorem 33.12 we have that

1 1
ηχ2 (Unif, PX′ |X ) = ηKL (Unif, PX′ |X ) = .
( k − 1) 2 k log k(1 + o(1))
Thus if Theorem 33.13 could be shown with Iχ2 instead of IKL we would be able to show non-
reconstruction for d < (k − 1)2 , contradicting the result of the previous paragraph. What goes
wrong is that Iχ2 fails to be subadditive, cf. (7.47). However, it is locally subadditive (when e.g.
Iχ2 (X; A) 1) by [202, Lemma 26]. Thus, an argument in Theorem 33.13 can be repeated for the
case where the leaves are observed through a very noisy channel (for example, an erasure channel
leaving only ϵ-fraction of the leaves). Consequently, robust reconstruction threshold for coloring
exactly equals d = (k − 1)2 . See [228] for more on robust reconstruction thresholds.

33.6 Application: distributed correlation estimation

Tensorization property can be used for correlation estimation. Suppose Alice observes
i.i.d. i.i.d.
{Xi }i≥1 ∼ B(1/2) and Bob observes {Yi }i≥1 ∼ B(1/2) such that the (Xi , Yi ) are i.i.d. with
E[Xi Yi ] = ρ ∈ [−1, 1]. The goal is for Bob to send W to Alice with H(W) = B bits and for

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-645

i i

33.6 Application: distributed correlation estimation 645

Alice to estimate ρ̂ = ρ̂(X∞ , W) with objective

R∗ (B) = inf sup E[(ρ − ρ̂)2 ].

W,ρ̂ ρ

Notice that in this problem we are not sample-limited (each party has infinitely many observations),
but communication-limited (only B bits can be exchanged).
Here is a trivial attempt to solve it. Notice that if Bob sends W = (Y1 , . . . , YB ) then the optimal
PB
estimator is ρ̂(X∞ , W) = 1n i=1 Xi Yi which has minimax error B1 , hence R∗ (B) ≤ B1 . Surprisingly,
this can be improved.

Theorem 33.14 ([207]) The optimal rate when B → ∞ is given by

1 + o( 1) 1
R∗ (X∞ , W) = ·
2 ln 2 B

Proof. Fix PW|Y∞ , we get the following decomposition

X1 Y1
.. ..
. .
W Xi Yi
.. ..
. .

Note that once the messages W are fixed we have a parameter estimation problem {Qρ , ρ ∈
[−1, 1]} where Qρ is a distribution of (X∞ , W) when A∞ , B∞ are ρ-correlated. Since we mini-
mize mean-squared error, we know from the van Trees inequality (Theorem 29.2)2 that R∗ (B) ≥
1+o(1) 1+o(1)
minρ JF (ρ) ≥ JF (0) where JF (ρ) is the Fisher information of the family {Qρ }.
Recall, that we also know from the local approximation that
ρ2 log e
D(Qρ kQ0 ) = JF (0) + o(ρ2 )
2
Furthermore, notice that under ρ = 0 we have X∞ and W independent and thus

D(Qρ kQ0 ) = D(PρX∞ ,W kP0X∞ ,W )

= D(PρX∞ ,W kPρX∞ × PρW )
= I(W; X∞ )
≤ ρ2 I(W; Y∞ )
≤ ρ2 B log 2

hence JF (0) ≤ (2 ln 2)B + o(1) which in turns implies the theorem. For full details and the
extension to interactive communication between Alice and Bob, see [207].

2
This requires some technical justification about smoothness of the Fisher information JF (ρ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-646

i i

646

We turn to the upper bound next. First, notice that by taking blocks of m → ∞ consecutive bits
Pim−1
and setting X̃i = √1m j=(i−1)m Xj and similarly for Ỹi , Alice and Bob can replace ρ-correlated

i.i.d. 1 ρ
bits with ρ-correlated standard Gaussians (X̃i , Ỹi ) ∼ N (0, ). Next, fix some very large N
ρ 1
and let

W = argmax Yj .
1≤j≤N

(See Exercise V.16

√
for a motivation behind this idea.) From standard concentration results we know
that E[YW ] = 2 ln N(1 + o(1)) (Lemma 27.10) and Var[YW ] = O( ln1N ). Therefore, knowing W
Alice can estimate
XW
ρ̂ = .
E [ YW ]
1−ρ2 +o(1)
This is an unbiased estimator and Varρ [ρ̂] = 2 ln N . Finally, setting N = 2B completes the
argument.

33.7 Channel comparison: degradation, less noisy, more capable

It turns out that the ηKL coefficient is intimately related to the concept of less noisy partial order
on channels. We define several such partial orders together.

Definition 33.15 (Partial orders on channels) Let PY|X and PZ|X be two channels.

U X

we have I(U; Y) ≤ I(U; Z).

• We say that PZ|X is more capable than PY|X , denoted PY|X ≤mc PZ|X if for any PX we have
I(X; Y) ≤ I(X; Z).

We make some remarks on these definitions and refer to [345] for proofs:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-647

i i

33.7 Channel comparison: degradation, less noisy, more capable 647

• PY|X ≤deg PZ|X =⇒ PY|X ≤ln PZ|X =⇒ PY|X ≤mc PZ|X . Counter examples for reverse
implications can be found in [111, Problem 15.11].
• The less noisy relation can be defined equivalently in terms of the divergence, namely PY|X ≤ln
PZ|X if and only if for all PX , QX we have D(QY kPY ) ≤ D(QZ kPZ ). We refer to [290, Sections
I.B, II.A] and [345, Section 6] for alternative useful characterizations of the less-noisy order.
• For BMS channels (see Section 19.4*) it turns out that among all channels with a given
Iχ2 (X; Y) = η (with X ∼ Ber(1/2)) the BSC and BEC are the minimal and maximal elements
in the poset of ≤ln ; see Ex. VI.21 for details.

Proposition 33.16 ηKL (PY|X ) ≤ 1 − τ if and only if PY|X ≤ln ECτ , where ECτ was defined in
Example 33.6.

Proof. For ECτ we always have

I(U; Z) = (1 − τ )I(U; X).

By the mutual information characterization of ηKL we have,

I(U; Y) ≤ (1 − τ )I(U; X).

Combining these two inequalities gives us

I(U; Y) ≤ I(U; Z).

This proposition gives us an intuitive interpretation of contraction coefficient as the worst

erasure channel that still dominates the channel.

Proposition 33.17 (Tensorization

Proof. By induction it is sufficient to consider n = 2 only. Consider the following Markov chain:

X1
Z1
U
Y2
X2

3 ∏
We remind that ⊗PYi |Xi refers to the product (memoryless) channel with xn 7→ Yn ∼ i PYi |Xi =xi .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-648

i i

648

Consider the following inequalities,

I(U; Y1 , Y2 ) = I(U; Y1 ) + I(U; Y2 |Y1 )

≤ I(U; Y1 ) + I(U; Z2 |Y1 )
= I(U; Y1 , Z2 ).

Hence I(U; Y1 , Y2 ) ≤ I(U; Y1 , Z2 ) for any PX1 ,X2 ,U . Applying the same argument we can replace
Y1 with Z1 to get I(U; Y1 , Z2 ) ≤ I(U; Z1 , Z2 ), completing the proof.
For the second claim, notice that

I(X2 ; Y2 ) = I(X2 ; Y2 ) + I(X1 ; Y2 |X2 )

where equalities are just applications of the chain rule (and in (a) and (b) we also notice that
conditioned on X2 the Y2 or Z2 are non-informative) and both inequalities are applications of
the most capable relation to the conditional distributions. For example, for every y we have
I(X2 ; Y2 |Y1 = y) ≤ I(X2 ; Z2 |Y1 = y) and hence we can average over y ∼ PY1 .

33.8 Undirected information percolation

In this section we will study the problem of inference on undirected graph. Consider an undirected
graph G = (V, E). We assign a random variable Xv on the alphabet X to each vertex v. For each
e = (u, v) ∈ E , we assign Ye sampled according to the kernel PYe |Xe with Xe = (Xu , Xv ). The
goal of this inference model is to determine the value of Xv ’s given the value of Ye ’s. As a visual
illustration we could be considering the following graph:

X2 X6
Y2

Y6
6
2

Y5
Y1

Y35
X1 X3 X5 X7
Y5
Y1

9
4

Y7
Y3

9
4

X4 X9

Example 33.11 (Community detection) In this model, we consider a complete graph

with n vertices, i.e. Kn , and the random variables Xv representing the membership of each vertex

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-649

i i

33.8 Undirected information percolation 649

to one of the m communities. We assume that Xv is sampled uniformly from [m] and independent
of the other vertices. The observation Yu,v is defined as
(
Ber(a/n) Xu = Xv
Yuv ∼
Ber(b/n) Xu 6= Xv .

Example 33.12 (Z2 synchronization) For any graph G, we sample Xv uniformly from
{−1, +1} and Ye = BSCδ (Xu Xv ).

Example 33.13 (Spiked Wigner model) We consider the inference problem of estimating
the value of vector (Xi )i∈[n] given the observation (Yij )i,j∈[n],i≤j . The Xi ’s and Yij ’s are related by
r
λ
Yij = Xi Xj + Wij ,
n
i.i.d.
where X = (X1 , . . . , Xn )⊤ is sampled uniformly from {±1}n and Wi,j = Wj,i ∼ N (0, 1), so that
W forms a Wigner matrix (symmetric Gaussian matrix). This model can also be written in matrix
form as
r
λ
Y= XX⊤ + W
n
as a rank-one perturbation of a Wigner matrix W, hence the name of the model. It is used as a
probabilistic model for principal component analysis.
This problem can also be treated as a problem of inference on undirected graph. In this case,
the underlying graph is a complete graph, and we assign Xi to the ith vertex. Under this model, the
edge observations is given by Yij = BIAWGNλ/n (Xi Xj ), cf. Example 3.4.
Although seemingly different, these problems share the following common characteristics,
namely:

Assumption 33.1 • Each Xv is uniformly distributed.

• Defining an auxiliary random variable B = 1{Xu 6= Xv } for any edges e = (u, v), the following
Markov chain holds

(Xu , Xv ) → B → Ye .

In other words, the observation on each edge only depends on whether the random variables on
its endpoints are similar.

Due to Assumption 33.1, the reconstructed Xv ’s is symmetric up to any permutation on X . In

the case of alphabet X = {−1, +1}, this implies that for any realization σ then PXall |Yall (σ|b) =
PXall |Yall (−σ|b). Consequently, our reconstruction metric also needs to accommodate this symmetry.
Pn
For X = {−1, +1}, this leads to the use of 1n | i=1 Xi X̂i | as our reconstruction metric.
Our main theorem for undirected inference problem can be seen as the analogue of the
information percolation theorem for DAG (Theorem 33.8). However, instead of controlling the

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-650

i i

650

contraction coefficient, the percolation probability is used to directly control the conditional
mutual information between any subsets of vertices in the graph.
Before stating our main theorem, we will need to define the corresponding percolation model
for inference on undirected graph. For any undirected graph G = (V, E) we define a percolation
model on this graph as follows :

• Every edge e ∈ E is open with the probability ηKL (PYe |Xe ), independent of the other edges,
• For any v ∈ V and S ⊂ V , we define the v ↔ S as the event that there exists an open path from
v to any vertex in S,
• For any S1 , S2 ⊂ V , we define the function percu (S1 , S2 ) as
X
percu (S1 , S2 ) ≜ P(v ↔ S2 ).
v∈S1

Notice that this function is different from the percolation function for information percolation
in DAG. Most importantly, this function is not equivalent to the exact percolation probability.
Instead, it is an upper bound on the percolation probability by union bounding with respect to
S1 . Hence, it is natural that this function is not symmetric, i.e. percu (S1 , S2 ) 6= percu (S2 , S1 ).

Theorem 33.18 (Undirected information percolation [347]) Consider an inference

problem on undirected graph G = (V, E). For any S1 , S2 ⊂ V , then

I(XS1 ; XS2 |Y) ≤ percu (S1 , S2 ) log |X |.

Instead of proving Theorem 33.18 in its full generality, we will prove the theorem under
Assumption 33.1. The main step of the proof utilizes the fact we can upper bound the mutual
information of any channel by its less noisy upper bound.

Theorem 33.19 Consider the problem of inference on undirected graph G = (V, E) with
X1 , ..., Xn not necessarily independent. If PYe |Xe ≤LN PZe |Xe , then for any S1 , S2 ⊂ V and E ⊂ E

I(XS1 ; YE |XS2 ) ≤ I(XS1 ; ZE |XS2 )

Proof. From our assumption and the tensorization property of less noisy ordering (Proposi-
tion 33.17), we have PYE |XS1 ,XS2 ≤ln PZE |XS1 ,XS2 . This implies that for σ as a valid realization of
XS2 we will have

I(XS1 ; YE |XS2 = σ) = I(XS1 , XS2 ; YE |XS2 = σ) ≤ I(XS1 , XS2 ; ZE |XS2 = σ) = I(XS1 ; ZE |XS2 = σ).

i i

652

√
semicircle law supported on the interval√(−2, 2). At the same time the rank-one matrix nλ XXT has
only one non-zero eigenvalue equal to λ. It turns out that for λ < 1 the effect of this “spike” is
√
undetectable and the spectrum of Y/ n is unaffected. For λ > 1 it turns out that the top eigenvalue
√
of Y/ n moves above the edge of the semicircle law to λ + λ1 > 2. Furthermore, computing the
top eigenvector and taking the sign of its coordinates achieves a correlated recovery of the true X
in the sense of (33.23). Note, however, that inability to change the spectrum (when λ < 1) does
not imply that reconstruction of X is not possible by other means. In this section, however, we will
show that indeed for λ ≤ 1 no method can achieve (33.23). Thus, together with the mentioned
spectral algorithm for λ > 1 we may conclude that λ∗ = 1 is the critical threshold separating the
two phases of the problem.

Theorem 33.20 Consider the spiked Wigner model. If λ ≤ 1, then for any sequence of
estimators Xˆn (Y),
" #
1 X
n
E Xi X̂i →0 (33.24)
n
i=1

as n → ∞.

Proof of Theorem 33.20. By Cauchy-Schwarz, for (33.24) it suffices to show

X
E[Xi Xj X̂i X̂j ] = o(n2 ) .
i̸=j

Next, it is clear that we can simplify the task of maximizing (over X̂n ) by allowing to separately
estimate each Xi Xj by T̂i,j , i.e.
X X
max E[Xi Xj X̂i X̂j ] ≤ max E[Xi Xj T̂i,j ] .
X̂n T̂i,j
i̸=j i̸=j

The latter maximization is solved by the MAP decoder:

T̂i,j (Y) = arg max P[Xi Xj = σ|Y] .

σ∈{±1}

Since each Xi ∼ Ber(1/2) it is easy to see that

I(Xi ; Xj |Y) → 0 ⇐⇒ max E[Xi Xj T̂i,j ] → 0 .

T̂i,j

(For example, we may notice I(Xi ; Xj |Y) = I(Xi , Xj ; Y) ≥ I(Xi Xj ; Y) and apply Fano’s inequality).
Thus, from the symmetry of the problem it is sufficient to prove I(X1 ; X2 |Y) → 0 as n → ∞.
By using the undirected information percolation theorem, we have

I(X2 ; X1 |Y) ≤ percu ({1}, {2}) .

Now, for computation of perc we need to compute the probability of having an open edge, which in
our case simply equals ηKL (BIAWGNλ/n ). From Theorem 33.6 we know the latter equals Iχ2 (X; Y)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-653

i i

33.10 Strong data post-processing inequality (post-SDPI) 653

where X ∼ Ber(1/2) and Y = BIAWGNλ/n (X). A short computation shows thus

λ
ηKL (BIAWGNλ/n ) = (1 + o(1)) .
n
′
Suppose that λ < 1. In this case, we can overbound λ+no(1) by λn with λ′ < 1. The percolation
random graph then is equivalent to the Erdös-Rényi random graph with n vertices and λ′ /n edge
probability, i.e., ER(n, λ′ /n). Using this observation, the inequality can be rewritten as

I(X2 ; X1 |Y) ≤ P(Vertex 1 and 2 is connected in ER(n, λ′ /n)).

A classical result in random graph theory is that the largest connected component in ER(n, λ′ /n)
contains O(log n) vertices if λ′ < 1 [154]. This implies that the probability that two specific
vertices are connected is o(1), hence I(X2 ; X1 |Y) → 0 as n → ∞.
To treat the case of λ = 1 we need a slightly more refined information about ηKL (BIAWGNλ/n )
and about the behavior of the giant component of ER(n, 1+on(1) ) graph; see [347] for full details.

Remark 33.2 (Dense-sparse correspondence) The proof above changes the underlying
structure of the graph. Namely, instead of dealing with a complete graph, the information percola-
tion method replaced it with an Erdös-Rényi random graph. Moreover, if ηKL is small enough, then
the underlying percolation graph tends to be very sparse and has a locally tree-like structure. This
demonstrates a ubiquitous and actively studied effect in modern statistics: dense inference (such
as spiked Wigner model, sparse regression, sparse PCA, etc) with very weak signals (ηKL ≈ 1)
is similar to sparse inference (broadcasting on trees) with moderate signals (ηKL ∈ (ϵ, 1 − ϵ)).
The information percolation method provides a certain bridge between these two worlds, perhaps
partially explaining why the results in these two worlds often parallel one another. (E.g. results on
optimality and phase transitions for belief propagation (sparse inference) often parallel those for
approximate message passing (AMP, dense inference)). We do want to caution, however, that the
reduction given by the information percolation method is not generally tight (spiked Wigner being
a lucky exception). For example [347], for correlated recovery in the SBM √
with k communities
√
and edge probability a/n and b/n it yields an impossibility result ( a − b)2 ≤ 2k , weaker than
the best known upper bounds of [203].

33.10 Strong data post-processing inequality (post-SDPI)

For the applications in distributed estimation the following version of the SDPI is useful.

Definition 33.21 (Post-SDPI constant) Given a conditional measure PY|X , define the input-
dependent and input-free contraction coefficients as

(p) I(U; X)
ηKL (PX , PY|X ) = sup :X→Y→U
PU|Y I(U; Y)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-654

i i

654

X Y U

ε̄ 0 τ̄ 0 0
τ
?
τ
ε 1 τ̄ 1 1

Figure 33.6 Post-SDPI coefficient of BEC equals to 1.

(p) I(U; X)
ηKL (PY|X ) = sup :X→Y→U
PX ,PU|Y I(U; Y)

To get characterization in terms of KL-divergence we simply notice that

(p)
ηKL (PX , PY|X ) = ηKL (PY , PX|Y ) (33.25)
(p)
ηKL (PX , PY|X ) = sup ηKL (PY , PX|Y ) , (33.26)
PX

where PY = PY|X ◦ PX and PX|Y is the conditional measure corresponding to PX PY|X . From (33.25)
and Prop. 33.11 we also get tensorization property for input-dependent post-SDPI:
ηKL (P⊗ ⊗n
(p) n (p)
X , (PY|X ) ) = ηKL (PX , PY|X ). (33.27)
(p)
It is easy to see that by the data processing inequality, ηKL (PY|X ) ≤ 1. Unlike the ηKL coefficient
(p)
the ηKL can equal to 1 even for a noisy channel PY|X .
(p)
Example 33.14 (ηKL = 1 for erasure channels) Let PY|X = BECτ and X → Y → U
be defined as on Figure 33.6. Then we can compute I(Y; U) = H(U) = h(ετ̄ ) and I(X; U) =
H(U) − H(U|X) = h(ετ̄ ) − εh(τ ) hence
(p) I(X; U)
ηKL (PY|X ) ≥
I(Y; U)
ε
= 1 − h(τ )
h(ετ̄ )
This last term tends to 1 when ε tends to 0 hence
(p)
ηKL (BECτ ) = 1

even though Y is not a one to one function of X.

(p)
Note that this example also shows that even for an input-constrained version of ηKL the natural
(p)
conjecture ηKL (Unif, BMS) = ηKL (BMS) is incorrect. Indeed, by taking ε = 12 , we have that
(p)
ηKL (Unif, BECτ ) > 1 − τ for τ → 1.
Nevertheless, the post-SDPI constant is often non-trivial, most importantly for the BSC:

i i

656

To proceed, we need to introduce a new concept. The T1 -transportation inequality, first intro-
duced by K. Marton, for the measure PY states the following: For every QY we have for some
c = c(PY )
p
W1 (QY , PY ) ≤ 2cD(QY kPY ) , (33.30)
where W1 (QY , PY ) is the 1-Wasserstein distance defined as
W1 (QY , PY ) = sup{EQY [f] − EPY [f] : f 1-Lipschitz} (33.31)
= inf{E[|A − B|] : A ∼ QY , B ∼ PY } .

The constant c(PY ) in (33.30) has been characterized in [64, 125] in terms of properties of PY . One
such estimate is the following:
!1/k
2 G(δ)
c(PY ) ≤ sup 2k
,
δ k≥ 1 k
′ 2 i.i.d.
where G(δ) = E[eδ(Y−Y ) ] where Y, Y′ ∼ PY . Using the estimate 2k
k ≥ √ 4k
and the fact
π (k+1/2)
that 1k ln(k + 1/2) ≤ 1
2 we get a further bound
√
2 π e 6G(δ)
c(PY ) ≤ G(δ) ≤ .
δ 4 δ
d √
Next notice that Y − Y′ = Bϵ + 2Z where Bϵ ⊥ ⊥ Z ∼ N (0, 1) and Bϵ is symmetric and |Bϵ | ≤ 2ϵ.
Thus, we conclude that for any δ < 1/4 we have c̄ ≜ δ6 supϵ≤1 G(δ) < ∞. In the end, we have
inequality (33.30) with constant c = c̄ that holds uniformly for all 0 ≤ ϵ ≤ 1.
Now, notice that dyd
v(y) ≤ 2ϵ and therefore v is 2ϵ -Lipschitz. From (33.30) and (33.31) we
obtain then
ϵp
|EQY [v(Y)] − EPY [v(Y)]| ≤ 2c̄D(QY kPY ) .
2
Squaring this inequality and plugging back into (33.29) completes the proof.
(p)
Remark 33.3 Notice that we can also compute the exact value of ηKL (PX , PY|X ) by noticing the
following. From (33.28) it is evident that among all measures QY with a given value of EQY [v(Y)]
we are interested in the one minimizing D(QY kPY ). From Theorem 15.11 we know that such QY
is given by dQY = ebv(y)−ψV (b) dPY , where ψV (b) ≜ ln EPY [ebv(Y) ]. Thus, by defining the convex
dual ψV∗ (λ) we can get the exact value in terms of the following single-variable optimization:
(p) d(λkπ )
ηKL (PX , PY|X ) = sup ∗ .
λ∈[0,1] ψV (λ)

Numerically, for π = 1/2 it turns out that the optimal value is λ → 12 , justifying our overbounding
of d by χ2 , and surprisingly giving
(p)
ηKL (Ber(1/2), PY|X ) = 4 EPY [tanh2 (ϵY)] = ηKL (PY|X ) ,

where in the last equality we used Theorem 33.6(f)).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-657

i i

33.11 Application: distributed mean estimation 657

33.11 Application: distributed mean estimation

We want to estimate θ ∈ [−1, 1]d and we have m machines observing Yi = θ + σ Zi where Zi ∼
N (0, Id ) independently for i = 1, . . . , m. They can send a total of B bits to a remote estimator
which produces θ̂ with the goal of minimizing the worst-case risk supθ E[kθ − θ̂k2 ]. If we denote
P
by Ui ∈ Ui the messages, then we have the communication constraint i log2 |Ui | ≤ B and the
diagram is the following:

Y1 U1

θ .. ..
. . θ̂

Ym Um

Finally, denote the minimax risk of estimation by

R∗ (m, d, σ 2 , B) = inf sup E[kθ − θ̂k2 ].

U1 ,...,Um ,θ̂ θ

We begin with some simple observations:

• Without the constraint θ ∈ [−1, 1]d , we could take θ ∼ N (0, bId ) and from rate-distortion
quickly conclude that estimating θ within risk R requires communicating at least d2 log bd R bits,
which diverges as b → ∞. Thus, restricting the magnitude of θ is necessary in order for it to be
estimable with finitely many bits communicated.
• Without
h P communication
i constraint, it is easy to establish that R∗ (m, d, σ 2 , ∞) =
2 2 P
E mσ i Zi = dmσ by taking Ui = Yi and θ̂ = m1 i Ui , which matches the minimax
risk (28.17) in non-distributed setting.
• In order to achieve a risk of order md we can apply a crude quantizer as follows. Let Ui = sign(Yi )
(coordinate-wise sign). This yields B = md and it is easy to show that the achievable risk is
Pm
Oσ ( md ). Indeed, notice that by taking V = m1 i=1 Ui we see that each coordinate Vj , j ∈ [d],
estimates (within Op ( √1m )) quantities Φ(θj /σ) with Φ denoting the CDF of N (0, 1). Since Φ
has derivative bounded away from 0 on [−1/σ, 1/σ], we get that the estimate θ̂j ≜ σΦ−1 (Vj )
will have mean square error of O(1/m) (with a poor dependency on σ , though), which gives
overall error O(d/m) as claimed.
Our main result below shows that the previous simple strategy is order-optimal in terms of
communicated bits. This simplifies the proofs of [137, 73].
• We remark that these results are also implicitly contained in the long line of work in the
information theoretic literature on the so-called Gaussian CEO problem. We recommend con-
sulting [156]; in particular, Theorem 3 there implies the B ≳ dm lower bound we show below.
However, the Gaussian CEO work uses a lot more sophisticated machinery (the entropy power
inequality and related results), while our SDPI proof is more elementary.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-658

i i

658

Our goal is to show that R∗ ≲ md implies B ≳ md.

Notice, first of all, that this is completely obvious for d = 1. Indeed, if B ≤ τ m then less than τ m
machines are communicating anything at all, and hence R∗ ≥ τKm for some universal constant K
(which is not 1 because θ is restricted to [−1, 1]). However, for d 1 it is not clear whether each
machine is required to communicate Ω(d) bits. Perhaps sending d single-bit measurements
taken in different orthogonal bases could work? Hopefully, this (incorrect) guess demonstrates
why the following result is interesting and non-trivial.

dϵ2
Theorem 33.25 There exists a constant c1 > 0 such that if R∗ (m, d, σ 2 , B) ≤ 9 then B ≥ c1 d
ϵ2
.

Proof. Let X ∼ Unif({±1}d ) and set θ = ϵX. Given an estimate θ̂ we can convert it into an
estimator of X via X̂ = sign(θ̂) (coordinatewise). Then, clearly
ϵ2 dϵ 2
E[dH (X, X̂)] ≤ E[kθ̂ − θk2 ] ≤ .
4 9
Thus, we have an estimator of X within Hamming distance 49 d. From Rate-Distortion (Theo-
rem 26.1) we conclude that I(X; X̂) ≥ cd for some constant c > 0. On the other hand, from
the standard DPI we have
X
m
cd ≤ I(X; X̂) ≤ I(X; U1 , . . . , Um ) ≤ I(X; Uj ) , (33.32)
j=1

where we also applied Theorem 6.1. Next we estimate I(X; Uj ) via I(Yj ; Uj ) by applying the Post-
SDPI. To do this we need to notice that the channel X → Yj for each j is just a memoryless
extension of the binary-input AWGN channel with SNR ϵ. Since each coordinate of X is uniform,
we can apply Theorem 33.24 (with π = 1/2) together with tensorization (33.27) to conclude that

I(X; Uj ) ≤ 4Kϵ2 I(Yj ; Uj ) ≤ 4Kϵ2 log |Uj |

Together with (33.32) we thus obtain

cd ≤ I(X; X̂) ≤ 4Kϵ2 B log 2 (33.33)

We notice that in this short section we only considered a non-interactive setting in the sense that
the message Ui is produced by machine i independently and without consulting anything except
its private measurement Yi . More generally, we could allow machines to communicate their bits
over a public broadcast channel, so that each communicated bit is seen by all other machines. We
could still restrict the total number of bits sent by all machines to be B and ask for the best possible
interactive estimation rate. While [137, 73] claim lower bounds that apply to this setting, those
bounds contain subtle errors (see [5, 4] for details). There are lower bounds applicable to non-
interactive settings but they are weaker by certain logarithmic terms. For example, [5, Theorem 4]
shows that to achieve risk ≲ dϵ2 one needs B ≳ ϵ2 logd(dm) in the limited interactive setting where
Ui may depend on Ui1−1 but there are no other interactions (i.e. the i-th machine sends its entire

i i

θ̂
2
satisfies R∗π ≥ 2ns2s2 +1 .
VI.2 (System identification) Let θ ∈ R be an unknown parameter of a dynamical system:
i.i.d.
Xt = θXt−1 + Zt , Zt ∼ N (0, 1), X0 = 0 .

Learning parameters of dynamical systems is known as “system identification”. Denote the law
of (X1 , . . . , Xn ) corresponding to θ by Pθ .
1. Compute D(Pθ kPθ0 ). (Hint: chain rule saves a lot of effort.)
2. Show that Fisher information
X
JF (θ) = θ2t−2 (n − t).
1≤t≤n−1

3. Argue that the hardest regime for system identification is when θ ≈ 0, and that instability
(|θ| > 1) is in fact helpful.
VI.3 (Linear regression) Consider the model

Y = Xβ + Z

where the design matrix X ∈ Rn×d is known and Z ∼ N (0, In ). Define the minimax mean-square
error of estimating the regression coefficient β ∈ Rd based on X and Y as follows:

R∗est = inf sup Ekβ̂ − βk22 . (VI.1)

β̂ β∈Rd

(a) Show that if rank(X) < d, then R∗est = ∞;

(b) Show that if rank(X) = d, then

R∗est = tr((X⊤ X)−1 )

and identify which estimator achieves the minimax risk.

R∗pred = inf sup EkXβ̂ − Xβk22 . (VI.2)

β̂ β∈Rd

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-661

i i

Exercises for Part VI 661

Redo (a) and (b) by finding the value of R∗pred and identify the minimax estimator. Explain
intuitively why R∗pred is always finite even when d exceeds n.
i.i.d.
VI.4 (Chernoff-Rubin-Stein lower bound.) Let X1 , . . . , Xn ∼ Pθ and θ ∈ [−a, a].
(a) State the appropriate regularity conditions and prove the following minimax lower bound:

(1 − ϵ)2
inf sup Eθ [(θ − θ̂)2 ] ≥ min max ϵ2 a2 , ,
θ̂ θ∈[−a,a] 0<ϵ<1 nJ̄F
1
Ra
where J̄F = 2a J (θ)dθ is the average Fisher information. (Hint: Consider the uniform
−a F
prior on [−a, a] and proceed as in the proof of Theorem 29.2 by applying integration by
parts.)
(b) Simplify the above bound and show that
1
inf sup Eθ [(θ − θ̂)2 ] ≥ p . (VI.3)
θ̂ θ∈[−a,a] (a−1 + nJ̄F )2
(c) Assuming the continuity of θ 7→ JF (θ), show that the above result also leads to the optimal
local minimax lower bound in Theorem 29.4 obtained from Bayesian Cramér-Rao:
1 + o( 1)
inf sup Eθ [(θ − θ̂)2 ] ≥ .
θ̂ θ∈[θ0 ±n−1/4 ] nJF (θ0 )

Note: (VI.3) is an improvement of the inequality given in [92, Lemma 1] without proof and
credited to Rubin and Stein.
VI.5 In this exercise we give a Hellinger-based lower bound analogous to the χ2 -based HCR lower
bound in Theorem 29.1. Let θ̂ be an unbiased estimator for θ ∈ Θ ⊂ R.
(a) For any θ, θ′ ∈ Θ, show that [386]

1 (θ − θ′ )2 1
(Varθ (θ̂) + Varθ′ (θ̂)) ≥ −1 . (VI.4)
2 4 H2 (Pθ , Pθ′ )
R √ √ √ √
(Hint: For any c, θ − θ′ = (θ̂ − c)( pθ + pθ′ )( pθ − pθ′ ). Apply Cauchy-Schwarz
and optimize over c.)
(b) Show that
1
H2 (Pθ , Pθ′ ) ≤ (θ − θ′ )2 J̄F (VI.5)
4
R θ′
where J̄F = θ′ 1−θ θ JF (u)du is the average Fisher information.
(c) State the needed regularity conditions and deduce the Cramér-Rao lower bound from (VI.4)
and (VI.5) with θ′ → θ.
(d) Extend the previous parts to the multivariate case.
VI.6 (Bayesian distribution estimation.) Let {Pθ : θ ∈ Θ} be a family of distributions on X
with a common dominating measure μ and density pθ (x) = dP n
dμ (x). Given a sample X =
θ

i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, the goal is to estimate the data-generating distribution Pθ by
some estimator P̂(·) = P̂(·; Xn ) with respect to some loss function ℓ(P, P̂). Suppose we are in

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-662

i i

662 Exercises for Part VI

a Bayesian setting where θ is drawn from a prior π. Let’s find the form of the Bayes estimator
and the Bayes risk.
(a) For convenience, let Xn+1 denote a test data point (unseen) drawn from Pθ and independent
of the observed data Xn . Convince yourself that every estimator P̂ can be formally identified
as a conditional distribution QXn+1 |Xn .
(b) Consider the KL loss ℓ(P, P̂) = D(PkP̂). Using Corollary 4.2, show that the Bayes estimator
minimizing the average KL risk is the posterior (conditional mean), i.e. its μ-density is given
by
Qn+1
Eθ∼π [ i=1 pθ (xi )]
qXn+1 |Xn (xn+1 |x ) =
n
Qn . (VI.6)
Eθ∼π [ i=1 pθ (xi )]
(c) Conclude that the Bayes KL risk equals I(θ; Xn+1 |Xn ). Compare with the conclusion of
Exercise II.19 and the KL risk interpretation of batch regret in (13.35).
(d) Now, consider the χ2 loss ℓ(P, P̂) = χ2 (PkP̂). Using (I.12) in Exercise I.45 show that the
optimal risk is given by
"Z 2 #
p
inf Eθ,Xn [χ (Pθ kP̂)] = EXn
2
μ(dxn+1 ) Eθ [pθ (xn+1 ) |X ]
2 n − 1. (VI.7)
P̂

attained by
p
qXn+1 |Xn (xn+1 |xn ) ∝ Eθ [pθ (xn+1 )2 |Xn = xn ] (VI.8)

(k
∗ ∗ k k ≤ 1.1n
RKL (k, n) RrevKL (k, n) log 1 + n k (VI.16)
n log k > 1.1n
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-665

i i

Exercises for Part VI 665

k
R∗χ2 (k, n) . (VI.17)
n
(a) Show that the empirical distribution, optimal for the TV loss (Exercise VI.8), implies the
claimed upper bound for the reverse KL loss (Hint: (7.58)). Show, on the other hand, that
for KL and χ2 it results in infinite loss.
(b) To show the upper bound for χ2 , consider the add-α estimator P̂ in (VI.10) with α = 1.
Show that
k−1
E[χ2 (PkP̂)] ≤ .
n+1
Using (7.34) conclude the upper bound part of (VI.16). (Hint: EN∼Bin(n,p) [ N+ 1
1] =
(n+1)p (1 − p̄
1 n+1
).
(c) Show that in the small alphabet regime of k ≲ n, all lower bounds follow from (VI.15).
(d) Next assume k ≥ 4n. Consider a Dirichlet(α, . . . , α) prior in (13.16). Applying (VI.11) and
(VI.7) for the Bayes χ2 risk and choosing α = n/k, show the lower bound R∗χ2 (k, n) ≳ nk .
(e) Consider the prior under which P is uniform over a support set S chosen uniformly at ran-
dom from all s-subsets of [k], where s < k is to be specified. Applying (VI.6), show that for
this prior the Bayes estimator for KL loss takes a natural form:
(
1
i ∈ Ŝ
P̂j = 1s −ŝ/s
k−ŝ i∈/ Ŝ

where Ŝ = {i : n√ i ≥ 1} is the support of the empirical distribution and ŝ = |Ŝ|.

p
(f) Choosing s = nk, conclude E[TV(P, P̂)] ≥ 1 − 2 nk . (Hint: Show that TV(P, P̂) ≥
(1 − ŝs )(1 − ks ) and ŝ ≤ n.)
(g) Using (7.31), show that E[D(P̂kP)], E[D(PkP̂)] ≥ Ω(log nk ), concluding the lower bound in
(VI.16). (Hint: (7.31) is convex in TV.)
Note: For k 1, [258] found that the best add-α estimator has α∗ ≈ 0.509 (unlike α = 1/2
optimal for cumulative loss in Section 13.5) and it achieves loss α∗ k−1+n o(1) . In this regime, the
optimal risk is only slightly smaller and equals R∗KL (k, n) = k−12n +o(1)
, which was shown in [72,
71] via deep results in polynomial approximation (the optimal estimator is add-c estimator but
with c varying according to the empirical count in each bin). For k n, Paninski [323] showed
R∗KL (k, n) = log nk (1 + o(1)) by a careful analysis of the Dirichlet prior.
VI.11 (Nonparametric model) In this exercise we consider some nonparametric extensions of the
i.i.d.
Gaussian location model and the Bernoulli model. Observing X1 , . . . , Xn ∼ P for some P ∈ P ,
where P is a collection of distributions on the real line, our goal is to estimate the mean of
R
the distribution P: μ(P) ≜ xP(dx), which is a linear functional of P. Denote the minimax
quadratic risk by

R∗n = inf sup EP [(μ̂(X1 , . . . , Xn ) − μ(P))2 ].

μ̂ P∈P

(a) Let P be the class of distributions (which need not have a density) on the real line with
2
variance at most σ 2 . Show that R∗n = σn .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-666

i i

666 Exercises for Part VI

(b) Let P = P([0, 1]), the collection of all probability distributions on [0, 1]. Show that
R∗n = 4(1+1√n)2 . (Hint: For the upper bound, using the fact that, for any [0, 1]-valued ran-
dom variable Z, Var(Z) ≤ E[Z](1 − E[Z]), mimic the analysis of the estimator (VI.13) in
Ex. VI.7e.)
VI.12 Prove Theorem 30.2 using Fano’s method. (Hint: apply Theorem 31.3 with T = ϵ · Sdk , where
Sdk denotes the Hamming sphere of radius k in d dimensions. Choose ϵ appropriately and apply
the Gilbert-Varshamov bound for the packing number of Sdk in Theorem 27.6.)
VI.13 (Sharp minimax rate in sparse denoising) Continuing Theorem 30.2, in this exercise we deter-
mine the sharp minimax risk for denoising a high-dimensional sparse vector. In the notation of
(30.13), we show that, for the d-dimensional GLM model X ∼ N (θ, Id ), the following minimax
risk satisfies, as d → ∞ and k/d → 0,
d
R∗ (k, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1))k log . (VI.18)
θ̂ ∥θ∥0 ≤k k

(a) We first consider 1-sparse vectors and prove

R∗ (1, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1)) log d, d → ∞. (VI.19)

θ̂ ∥θ∥0 ≤1

For the lower bound, consider the prior π under which θ is uniformly p distributed over
{τ e1 , . . . , τ ed }, where ei ’s denote the standard basis. Let τ = (2 − ϵ) log d. Show that
for any ϵ > 0, the Bayes risk is given by

inf Eθ∼π [kθ̂ − θk22 ] = τ 2 (1 + o(1)), d → ∞.

θ̂

(Hint: either apply the mutual information method, or directly compute the Bayes risk by
evaluating the conditional mean and conditional variance.)
(b) Demonstrate an estimator θ̂ that achieves the RHS of (VI.19) asymptotically. (Hint: consider
the hard-thresholding estimator (30.13) or the MLE (30.11).)
(c) To prove the lower bound part of (VI.18), prove the following generic result

∗ ∗ d
R (k, d) ≥ kR 1,
k
and then apply (VI.19). (Hint: consider a prior of d/k blocks each of which is 1-sparse.)
(d) Similar to the 1-sparse case, demonstrate an estimator θ̂ that achieves the RHS of (VI.18)
asymptotically.
Note: For both the upper and lower bound, the normal tail bound in Exercise V.25 is helpful.
VI.14 Consider the following functional estimation problem in GLM. Observing X ∼ N (θ, Id ), we
intend to estimate the maximal coordinate of θ: T(θ) = θmax ≜ max{θ1 , . . . , θd }. Prove the
minimax rate:

inf sup Eθ (T̂ − θmax )2 log d. (VI.20)

T̂ θ∈Rd

(a) Prove the upper bound by considering T̂ = Xmax , the plug-in estimator with the MLE.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-667

i i

Exercises for Part VI 667

(b) For the lower bound, consider two hypotheses:

H0 : θ = 0, H1 : θ ∼ Unif {τ e1 , τ e2 , . . . , τ ed } .

where ei ’s are the standard bases and τ > 0. Then under H0 , X ∼ P0 = N (0, Id ); under H1 ,
Pd τ2
X ∼ P1 = 1d i=1 N (τ ei , Id ). Show that χ2 (P1 kP0 ) = e d−1 . (Hint: Exercise I.48)
(c) Applying the joint range (7.32) (or (7.38)) to bound TV(P0 , P1 ), conclude the lower bound
part of (VI.20) via Le Cam’s method (Theorem 31.1).
(d) By improving both the upper and lower bound prove the sharp version:

1
inf sup Eθ (T̂ − θmax ) =
2
+ o(1) log d, d → ∞. (VI.21)
T̂ θ∈Rd 2

VI.15 (Suboptimality of MLE in high dimensions [55]) Consider the d-dimensional GLM: X ∼
N (θ, Id ), where θ belongs to the parameter space
n o
Θ = θ ∈ Rd : |θ1 | ≤ d1/4 , kθ\1 k2 ≤ 2(1 − d−1/4 |θ1 |)

with θ\1 ≡ (θ2 , . . . , θd ). For the square loss, prove the following for sufficiently large d.
(a) The minimax risk is bounded:

inf sup Eθ [kθ̂ − θk22 ] ≲ 1.

θ̂ θ∈Θ

(b) The worst-case risk of maximum likelihood estimator

θMLE ≜ argmin kX − θ̃k2

θ̃∈Θ

is unbounded:
√
sup Eθ [kθ̂MLE − θk22 ] ≳ d.
θ∈Θ

i.i.d.
VI.16 (Covariance model) Let X1 , . . . , Xn ∼ N (0, Σ), where Σ is a d × d covariance matrix. Let us
show that the minimax quadratic risk of estimating Σ using X1 , . . . , Xn satisfies

d
inf sup E[kΣ̂ − Σk2F ] ∧ 1 r2 , ∀r > 0, d, n ∈ N. (VI.22)
Σ̂ ∥Σ∥F ≤r n
P
where kΣ̂ − Σk2F = ij (Σ̂ij − Σij )2 .
(a) Show that unlike location model, without restricting to a compact parameter space for Σ,
the minimax risk in (VI.22) is infinite.
Pn
(b) Consider the sample covariance matrix Σ̂ = 1n i=1 Xi X⊤ i . Show that

1
E[kΣ̂ − Σk2F ] = kΣk2F + Tr(Σ)2
n
and use this to deduce the minimax upper bound in (VI.22).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-668

i i

668 Exercises for Part VI

(c) To prove the minimax lower bound, we can proceed in several steps. Show that for any
positive semidefinite (PSD) Σ0 , Σ1 0, the KL divergence satisfies
1 1/ 2 1/2
D(N (0, Id + Σ0 )kN (0, Id + Σ1 )) ≤
kΣ − Σ1 k2F , (VI.23)
2 0
where Id is the d × d identity matrix. (Hint: apply (2.8).)
(d) Let B(δ) denote the Frobenius ball of radius δ centered at the zero matrix. Let PSD = {X :
X 0} denote the collection of d × d PSD matrices. Show that
vol(B(δ) ∩ PSD)
= P [ Z 0] , (VI.24)
vol(B(δ))
where Z is a random matrix distributed according to the Gaussian Orthogonal Ensemble
i.i.d.
(GOE), that is, Z is symmetric with independent diagonals Zii ∼ N (0, 2) and off-diagonals
i.i.d.
Zij ∼ N (0, 1).
2
(e) Show that P [Z 0] ≥ cd for some absolute constant c.4
(f) Prove the following lower bound on the packing number on the set of PSD matrices:
′ d2 /2
cδ
M(B(δ) ∩ PSD, k · kF , ϵ) ≥ (VI.25)
ϵ
for some absolute constant c′ . (Hint: Use the volume bound and the result of Part (d) and
(e).) √
(g) Complete the proof of lower bound of (VI.22). (Hint: WLOG, we can consider r d and
2
aim for the lower bound Ω( dn ∧ d). Take the packing from (VI.25) and shift by the identity
matrix I. Then apply Fano’s method and use (VI.23).)
VI.17 For a family of probability distributions P and a functional T : P → R define its χ2 -modulus
of continuity as

δ χ 2 ( t) = sup {T(P1 ) − T(P2 ) : χ2 (P1 kP2 ) ≤ t} .

P1 ,P2 ∈P

When the functional T is affine, and continuous, and P is compact5 it can be shown [346] that
1
δ 2 (1/n)2 ≤ inf sup E i.i.d. (T(P) − T̂n (X1 , . . . , Xn ))2 ≤ δχ2 (1/n)2 . (VI.26)
7 χ T̂n P∈P Xi ∼ P

Consider the following problem (interval censored model): A lab conducts experiments with
i.i.d.
n mice. In the i-th mouse a tumour develops at time Ai ∈ [0, 1] with Ai ∼ π where π is a pdf
on [0, 1] bounded by 12 ≤ π ≤ 2 pointwise. For each i the existence of tumour is checked at
i.i.d.
another random time Bi ∼ Unif(0, 1) with Bi ⊥ ⊥ Ai . Given observations Xi = (1{Ai ≤ Bi }, Bi )
one is trying to estimate T(π ) = π [A ≤ 1/2]. Show that

inf sup E[(T(π ) − T̂n (X1 , . . . , Xn ))2 ] n−2/3 .

T̂n π

4
Getting the exact exponent is a difficult result (cf. [26]). Here we only need some crude estimate.
5
Both under the same, but otherwise arbitrary topology on P.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-669

i i

Exercises for Part VI 669

VI.18 (Comparison between contraction coefficients.) Prove (33.13) for ηf = ηKL .

Hint: Use local behavior of f-divergences (Proposition 2.21).
VI.19 (Spectral gap and χ2 -contraction of Markov chains.) In this exercise we prove (33.18). Let
P = (P(x, y)) denote the transition matrix of a time-reversible Markov chain with finite state
space X and stationary distribution π, so that π (x)P(x, y) = π (y)P(y, x) for all x, y ∈ X . It is
known that the k = |X | eigenvalues of P satisfy 1 = λ1 ≥ λ2 ≥ . . . ≥ λk ≥ −1. Define by
γ∗ ≜ max{λ2 , |λk |} the absolute spectral gap.
(a) Show that
χ2 (PX1 kπ ) ≤ χ2 (PX0 kπ )γ∗2 .

from which (33.18) follows.

(b) Conclude that for any initial state x,
1 − π (x) 2n
χ2 (PXn |X0 =x kπ ) ≤ γ .
π ( x) ∗
(c) Compute γ∗ for the BSCδ channel and compare with the ηχ2 contraction coefficients.
For a continuous-time version, see [124].
VI.20 (Input-independent contraction coefficient is achieved by binary inputs [321]) Let K : X → Y
be a Markov kernel with countable X . Prove that for all f-divergence, we have
Df (K ◦ PkK ◦ Q)
ηf (K) = sup .
P,Q:|supp(P)∪supp(Q)|≤2 Df (PkQ)
0<Df (P∥Q)<∞

Hint: Define function

Lλ (P, Q) = Df (K ◦ PkK ◦ Q) − λDf (PkQ)
and prove that Q̂ 7→ Lλ ( QP Q̂, Q̂) is convex on the set

P
Q̂ ∈ P(X ) : supp(Q̂) ⊆ supp(Q), Q̂ ∈ P(X ) .
Q

VI.21 (BMS channel comparison [371, 367]) Below X ∼ Ber(1/2) and PY|X is an input-symmetric
channel (BMS). It turns out that BSC and BEC are extremal for various partial orders. Prove
the following statements.
(a) If ITV (X; Y) = 12 (1 − 2δ), then
BSCδ ≤deg PY|X ≤deg BEC2δ .

(b) If I(X; Y) = C, then

BSCh−1 (log 2−C) ≤mc PY|X ≤mc BEC1−C/ log 2 .

(c) If Iχ2 (X; Y) = η , then

BSC1/2−√η/2 ≤ln PY|X ≤ln BEC1−η .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-670

i i

670 Exercises for Part VI

(Hint: apply Exercise I.64.)

VI.22 (Broadcasting on Trees with BSC [204]) We have seen in Example 33.9 that broadcasting on
trees (BOT) with BSCδ has non-reconstruction when b(1 − 2δ)2 < 1. In this exercise we prove
the achievability bound (known as the Kesten-Stigum bound [244]) using channel comparison.
We work with an infinite b-ary tree with BSCδ edge channels. Let ρ be the root and Lk be the
set of nodes at distance k to ρ. Let Mk denote the channel Xρ → XLk .
In the following, assume that b(1 − 2δ)2 > 1.
1
(a) Prove that there exists τ < 2 such that
BSCτ ≤ln BSC⊗
τ ◦ M1 .
b

Hint: Use Ex. VI.21.

(b) Prove BSCτ ≤ln Mk by induction on k. Conclude that reconstruction holds.
Hint: Use tensorization of less noisy ordering.
VI.23 (Broadcasting on a 2D Grid) Consider the following broadcasting model on a 2D grid:
• Nodes are labeled with (i, j) for i, j ∈ Z;
• Xi,j = 0 when i < 0 or j < 0;
• X0,0 ∼ Ber( 12 );
i.i.d.
• Xi,j = fi,j (Xi−1,j ⊕ Zi,j,1 , Xi,j−1 ⊕ Zi,j,2 ) for i, j ≥ 0 and (i, j) 6= (0, 0), where Zi,j,k ∼ Ber(δ),
and fi,j is any function {0, 1} × {0, 1} → {0, 1}.
Let Ln = {(n−i, i) : 0 ≤ i ≤ n} be the set of nodes at level n. Let pc be directed bond percolation
threshold from (0, 0) to Ln for n → ∞. Apply Theorem 33.8 to show that for (1 − 2δ)2 < pc
we have
lim I(X0,0 ; XLn ) = 0.
n→∞

It is known that pc ≈ 0.645 (e.g. [231]).

Note: We could also use Theorem 33.8 differently. Above we replaced each directed edge by
a BSCδ . We could instead consider channels (Xi−1,j , Xi,j−1 ) 7→ Xi,j and relate its contraction
coefficient to the directed site percolation threshold p′c . This would yield a non-reconstruction
whenever
p
1 − 2δ + 4δ 3 − 2δ 4 − 2δ(1 − δ) δ(1 + δ)(1 − δ)(2 − δ) < p′c .
Since p′c ≈ 0.705 it turns out the bond percolation result is stronger.
VI.24 (Input-dependent contraction coefficient for coloring channel [203]) Fix an integer q ≥ 3 and
let X = [q]. Consider the following coloring channel K : X → X :
(
0 y = x,
K(y|x) = 1
q−1 y 6= x.

Let π be uniform distribution on X .

(a) Compute ηKL (π , K).
(b) Conclude that there exists a function f(q) = (1 − o(1))q log q as q → ∞ such that for all
d < f(q), BOT with the coloring channel on a d-ary tree has non-reconctruction.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-671

i i

i i

674 Strong data processing inequality

vol. 64, no. 2, pp. 1038–1045, 2017. [47] C. Berrou, A. Glavieux, and P. Thiti-
(pp. 366, 367, 368, and 369) majshima, “Near Shannon limit
[37] A. R. Barron, “Universal approximation error-correcting coding and decoding:
bounds for superpositions of a sigmoidal Turbo-codes. 1,” in Proceedings of
function,” IEEE Trans. Inf. Theory, vol. 39, ICC’93-IEEE International Conference on
no. 3, pp. 930–945, 1993. (p. 534) Communications, vol. 2. IEEE, 1993, pp.
[38] P. L. Bartlett and S. Mendelson, 1064–1070. (pp. 346 and 411)
“Rademacher and Gaussian complexi- [48] J. C. Berry, “Minimax estimation of a
ties: Risk bounds and structural results,” bounded normal mean vector,” Journal of
Journal of Machine Learning Research, Multivariate Analysis, vol. 35, no. 1, pp.
vol. 3, no. Nov, pp. 463–482, 2002. (p. 86) 130–139, 1990. (p. 587)
[39] G. Basharin, “On a statistical estimate for [49] D. P. Bertsekas, A. Nedi�, and A. E.
the entropy of a sequence of independent Ozdaglar, Convex analysis and optimiza-
random variables,” Theory of Probability & tion. Belmont, MA, USA: Athena Scien-
Its Applications, vol. 4, no. 3, pp. 333–336, tific, 2003. (p. 93)
1959. (p. 584) [50] N. Bhatnagar, J. Vera, E. Vigoda, and
[40] A. Beck, First-order methods in optimiza- D. Weitz, “Reconstruction for colorings on
tion. SIAM, 2017. (p. 92) trees,” SIAM Journal on Discrete Mathe-
[41] A. Beirami and F. Fekri, “Fundamental lim- matics, vol. 25, no. 2, pp. 809–826, 2011.
its of universal lossless one-to-one com- (p. 644)
pression of parametric sources,” in Informa- [51] A. Bhatt, B. Nazer, O. Ordentlich, and
tion Theory Workshop (ITW). IEEE, 2014, Y. Polyanskiy, “Information-distilling
pp. 212–216. (p. 250) quantizers,” IEEE Transactions on
[42] C. H. Bennett, “Notes on Landauer’s princi- Information Theory, vol. 67, no. 4, pp.
ple, reversible computation, and Maxwell’s 2472–2487, 2021. (p. 190)
Demon,” Studies In History and Philoso- [52] A. Bhattacharyya, “On a measure of diver-
phy of Science Part B: Studies In History gence between two statistical populations
and Philosophy of Modern Physics, vol. 34, defined by their probability distributions,”
no. 3, pp. 501–510, 2003. (p. xix) Bull. Calcutta Math. Soc., vol. 35, pp. 99–
[43] C. H. Bennett, P. W. Shor, J. A. Smolin, and 109, 1943. (p. 117)
A. V. Thapliyal, “Entanglement-assisted [53] L. Birgé, “Approximation dans les espaces
classical capacity of noisy quantum chan- métriques et théorie de l’estimation,”
nels,” Physical Review Letters, vol. 83, Zeitschrift für Wahrscheinlichkeitstheorie
no. 15, p. 3081, 1999. (p. 498) und Verwandte Gebiete, vol. 65, no. 2, pp.
[44] W. R. Bennett, “Spectra of quantized sig- 181–237, 1983. (pp. xxii, 602, and 614)
nals,” The Bell System Technical Journal, [54] ——, “On estimating a density using
vol. 27, no. 3, pp. 446–472, 1948. (p. 483) Hellinger distance and some other strange
[45] P. Bergmans, “A simple converse for broad- facts,” Probability theory and related fields,
cast channels with additive white Gaus- vol. 71, no. 2, pp. 271–291, 1986. (p. 625)
sian noise (corresp.),” IEEE Transactions [55] L. Birgé, “Model selection via testing : an
on Information Theory, vol. 20, no. 2, pp. alternative to (penalized) maximum likeli-
279–280, 1974. (p. 65) hood estimators,” Annales de l’I.H.P. Prob-
[46] J. M. Bernardo, “Reference posterior dis- abilités et statistiques, vol. 42, no. 3, pp.
tributions for Bayesian inference,” Journal 273–325, 2006. (p. 667)
of the Royal Statistical Society: Series B [56] L. Birgé, “Robust tests for model selection,”
(Methodological), vol. 41, no. 2, pp. 113– From probability to statistics and back:
128, 1979. (p. 253) high-dimensional models and processes–A

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-675

i i

References 675

Festschrift in honor of Jon A. Wellner, IMS [67] H. F. Bohnenblust, “Convex regions and
Collections, Volume 9, pp. 47–64, 2013. projections in Minkowski spaces,” Ann.
(p. 613) Math., vol. 39, no. 2, pp. 301–308, 1938.
[57] M. Š. Birman and M. Solomjak, “Piecewise- (p. 96)
polynomial approximations of functions of [68] A. Borovkov, Mathematical Statistics.
the classes,” Mathematics of the USSR- CRC Press, 1999. (pp. xxii, 141, 581,
Sbornik, vol. 2, no. 3, p. 295, 1967. (p. 538) and 582)
[58] N. Blachman, “The convolution inequal- [69] S. Boucheron, G. Lugosi, and P. Massart,
ity for entropy powers,” IEEE Transactions Concentration Inequalities: A Nonasymp-
on Information theory, vol. 11, no. 2, pp. totic Theory of Independence. OUP
267–271, 1965. (p. 185) Oxford, 2013. (pp. 85, 151, 302, and 541)
[59] D. Blackwell, L. Breiman, and [70] O. Bousquet, D. Kane, and S. Moran,
A. Thomasian, “The capacity of a class of “The optimal approximation factor in den-
channels,” The Annals of Mathematical sity estimation,” in Conference on Learn-
Statistics, pp. 1229–1241, 1959. (p. 465) ing Theory. PMLR, 2019, pp. 318–341.
[60] D. H. Blackwell, “The entropy of functions (p. 622)
of finite-state Markov chains,” Transactions [71] D. Braess and T. Sauer, “Bernstein poly-
of the first Prague conference on infor- nomials and learning theory,” Journal of
mation theory, statistical decision func- Approximation Theory, vol. 128, no. 2, pp.
tions, random processes, pp. 13–20, 1956. 187–206, 2004. (p. 665)
(p. 111) [72] D. Braess, J. Forster, T. Sauer, and
[61] R. E. Blahut, “Hypothesis testing and infor- H. U. Simon, “How to achieve minimax
mation theory,” IEEE Trans. Inf. Theory, expected Kullback-Leibler distance from
vol. 20, no. 4, pp. 405–417, 1974. (p. 289) an unknown finite distribution,” in Algo-
[62] R. Blahut, “Computation of channel rithmic Learning Theory. Springer, 2002,
capacity and rate-distortion functions,” pp. 380–394. (p. 665)
IEEE transactions on Information Theory, [73] M. Braverman, A. Garg, T. Ma, H. L.
vol. 18, no. 4, pp. 460–473, 1972. (p. 102) Nguyen, and D. P. Woodruff, “Communica-
[63] P. M. Bleher, J. Ruiz, and V. A. Zagrebnov, tion lower bounds for statistical estimation
“On the purity of the limiting gibbs state for problems via a distributed data processing
the Ising model on the Bethe lattice,” Jour- inequality,” in Proceedings of the forty-
nal of Statistical Physics, vol. 79, no. 1, pp. eighth annual ACM symposium on Theory
473–482, Apr 1995. (pp. 642 and 643) of Computing. ACM, 2016, pp. 1011–
[64] S. G. Bobkov and F. Götze, “Exponential 1020. (pp. 657 and 658)
integrability and transportation cost related [74] L. M. Bregman, “Some properties of non-
to logarithmic Sobolev inequalities,” Jour- negative matrices and their permanents,”
nal of Functional Analysis, vol. 163, no. 1, Soviet Math. Dokl., vol. 14, no. 4, pp. 945–
pp. 1–28, 1999. (p. 656) 949, 1973. (p. 161)
[65] S. Bobkov and G. P. Chistyakov, “Entropy [75] L. Breiman, “The individual ergodic the-
power inequality for the Rényi entropy.” orem of information theory,” Ann. Math.
IEEE Transactions on Information Theory, Stat., vol. 28, no. 3, pp. 809–811, 1957.
vol. 61, no. 2, pp. 708–714, 2015. (p. 27) (p. 234)
[66] T. Bohman, “A limit theorem for the Shan- [76] L. Brillouin, Science and information the-
non capacities of odd cycles I,” Proceedings ory, 2nd Ed. Academic Press, 1962.
of the American Mathematical Society, vol. (p. xvii)
131, no. 11, pp. 3559–3569, 2003. (p. 452) [77] L. D. Brown, “Fundamentals of statisti-
cal exponential families with applications

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-676

i i

686 Strong data processing inequality

[288] M. Madiman and P. Tetali, “Information annual conference on Computational learn-

inequalities for joint distributions, with ing theory, 1998, pp. 230–234. (pp. 83
interpretations and applications,” IEEE and 88)
Trans. Inf. Theory, vol. 56, no. 6, pp. 2699– [299] R. McEliece, E. Rodemich, H. Rumsey,
2713, 2010. (p. 18) and L. Welch, “New upper bounds on the
[289] M. Mahoney, “Large text compression rate of a code via the Delsarte-Macwilliams
benchmark,” http://www.mattmahoney.net/ inequalities,” IEEE transactions on Infor-
dc/text.html, Aug. 2021. (p. 245) mation Theory, vol. 23, no. 2, pp. 157–166,
[290] A. Makur and Y. Polyanskiy, “Compari- 1977. (pp. 432, 434, and 528)
son of channels: Criteria for domination by [300] R. J. McEliece and E. C. Posner, “Hide
a symmetric channel,” IEEE Transactions and seek, data storage, and entropy,” The
on Information Theory, vol. 64, no. 8, pp. Annals of Mathematical Statistics, vol. 42,
5704–5725, 2018. (p. 647) no. 5, pp. 1706–1716, 1971. (p. 543)
[291] B. Mandelbrot, “An informational theory of [301] B. McMillan, “The basic theorems of infor-
the statistical structure of language,” Com- mation theory,” Ann. Math. Stat., pp. 196–
munication theory, vol. 84, pp. 486–502, 219, 1953. (p. 234)
1953. (pp. 203, 205, and 206) [302] S. Mendelson, “Rademacher averages
[292] C. Manning and H. Schutze, Foundations and phase transitions in Glivenko-Cantelli
of statistical natural language processing. classes,” IEEE Transactions on Informa-
MIT press, 1999. (p. 351) tion Theory, vol. 48, no. 1, pp. 251–263,
[293] J. Massey, “On the fractional weight of 2002. (p. 533)
distinct binary n-tuples (corresp.),” IEEE [303] N. Merhav and M. Feder, “Universal pre-
Transactions on Information Theory, diction,” IEEE Trans. Inf. Theory, vol. 44,
vol. 20, no. 1, pp. 131–131, 1974. (p. 158) no. 6, pp. 2124–2147, 1998. (p. 260)
[294] ——, “Causality, feedback and directed [304] G. A. Miller, “Note on the bias of infor-
information,” in Proc. Int. Symp. Inf. The- mation estimates,” Information theory in
ory Applic.(ISITA-90), 1990, pp. 303–305. psychology: Problems and methods, vol. 2,
(pp. 446 and 449) pp. 95–100, 1955. (p. 584)
[295] W. Matthews, “A linear program for the [305] M. Mitzenmacher, “A brief history of gen-
finite block length converse of Polyanskiy– erative models for power law and lognormal
Poor–Verdú via nonsignaling codes,” distributions,” Internet mathematics, vol. 1,
IEEE Transactions on Information Theory, no. 2, pp. 226–251, 2004. (pp. 203 and 206)
vol. 58, no. 12, pp. 7036–7044, 2012. [306] E. Mossel, “Phase transitions in phy-
(p. 429) logeny,” Transactions of the American
[296] H. H. Mattingly, M. K. Transtrum, M. C. Mathematical Society, vol. 356, no. 6, pp.
Abbott, and B. B. Machta, “Maximizing 2379–2404, 2004. (p. 642)
the information learned from finite data [307] E. Mossel, J. Neeman, and A. Sly, “Recon-
selects a simple model,” Proceedings of the struction and estimation in the planted
National Academy of Sciences, vol. 115, partition model,” Probability Theory and
no. 8, pp. 1760–1765, 2018. (p. 248) Related Fields, vol. 162, no. 3-4, pp. 431–
[297] A. Maurer, “A note on the PAC Bayesian 461, 2015. (pp. 186 and 642)
theorem,” arXiv preprint cs/0411099, 2004. [308] E. Mossel and Y. Peres, “New coins from
(p. 83) old: computing with unknown bias,” Com-
[298] D. A. McAllester, “Some PAC-Bayesian binatorica, vol. 25, no. 6, pp. 707–724,
theorems,” in Proceedings of the eleventh 2005. (p. 172)
[309] J. Mourtada, “Exact minimax risk for linear
least squares, and the lower tail of sample

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-687

i i

i i

Index

FI -curve, 325, 338, 549, 638 Alon, N., 160 BEC, 181, 372, 380, 439, 455, 460,
I-projection, see information alternating minimization algorithm, 471, 639, 654, 669
projection 102 belief propagation, 653
Log function, 25 Amari, S.-I., 307 Bell Labs, 480
ϵ-covering, 523 Anderson’s lemma, 541, 572 Berlekamp, E., 432
ϵ-net, see ϵ-covering Anderson, T. W., 572 Bernoulli factory, 172
ϵ-packing, 523 approximate message passing (AMP), Bernoulli shifts, 230
Z2 synchronization, 649 653 Bernoulli, D., 143
σ-algebra, 79 area theorem, 63 Berry-Esseen inequality, 435
denseness, 240 area under the curve (AUC), 278 Bhattacharyya distance, 315
monotone limits, 79 Arimoto, S., 102, 433 binary divergence, 1, 22, 56
f-divergence, 115, 631 arithmetic coding, 245, 268 binary entropy function, 1, 9
inf-characterization, 151 Artstein, S., 535 binary symmetric channel, see BSC
sup-characterization, 121, 147 Assouad’s lemma, 389, 597, 664 binomial tail, 159
comparison, 127, 132 via Mutual information method, 598 bipartite graph, 161
conditional, 117 asymmetric numeral system (ANS), Birgé, L., 613, 614, 625
convexity, 120 246 Birkhoff-Khintchine theorem, 234
data processing, 119 asymptotic efficiency, 581 Birman, M. Š, 538
finite partitions, 121 asymptotic equipartition property birthday paradox, 186
local behavior, 138 (AEP), 217, 234 bit error rate (BER), 389
lower semicontinuity, 148 asymptotic separatedness, 125 Blackwell measure, 111
monotonicity, 118 autocovariance function, 114 Blackwell order, 182, 329
operational meaning, 122 automatic repeat request (ARQ), 468 Blackwell, D., 111, 182
SDPI, 629 auxiliary channel, 423, 428 Blahut, R., 102
f-information, 134, 182, 630, 631 auxiliary random variable, 227 Blahut-Arimoto algorithm, 102
χ2 , 136 blocklength, 370
additivity, 135 BMS, 383, 631, 669
definition, 134 B-process, 232 mixture representation, 632
subadditivity, 135, 644 balls and urn, 186 Bollobás, B., 164
symmetric KL, 135, 187 Barg, A., 433 Boltzmann, 15
g-divergence, 121 Barron, A. R., 64, 606 Boltzmann constant, 410
k-means, 481 batch loss, 270, 611 Bonami-Beckner semigroup, 132
3GPP, 403 Bayes risk, 662 boolean function, 626
GLM, 563 bowl-shaped, 571
Bayesian Cramér-Rao, 661 box-counting dimension, see
absolute continuity, 21, 42, 43 Bayesian Cramér-Rao (BCR) lower Minkowski dimension
absolute norm, 526 bound, 577 Brégman’s theorem, 162
achievability, 201 Bayesian Cramér-Rao lower bound , Breiman, L., 234
additive set-functions, 79 663 broadcasting
ADSL, 403 Bayesian networks, 633 on a grid, 670
Ahlswede, R., 208, 227, 326 BCR lower bound, 578 on trees, 642
Ahlswede-Csisár, 326 functional estimation, 580 Brownian motion, 417
Alamouti code, 409 multivariate, 579

695

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-696

i i

696 Index

BSC, 49, 53, 111, 344, 347, 363, 372, non-existence, 98 finite blocklength, 346
379, 436, 439, 455, 469, 471, 630, capacity-achieving output distribution, fundamental limit, 373, 395
643, 655, 669, 670 94, 96, 253 Gallager’s bound, 360, 431
channel coding, 344 uniqueness, 94, 97 information density, 351
contraction coefficient, 633 capacity-cost function, 396, 466, 467 linear code, 362, 461
SDPI, 633 capacity-redundancy theorem, 248, normal approximation, 439
strong converse, 424 269, 270, 604 normalized rate, 439
Burnashev’s error-exponent, 456 Carnot’s cycle, 14 optimal decoder, 347
carrier frequency, 419 posterior matching, 444
Catoni, O., 83 power constraint, 394
capacity, 49, 91, 94, 96, 102, 178, 179,
causal conditioning, 448 probability of error, 343
256, 345, 348
causal inference, 446 randomized encoder, 460
ϵ-capacity, 373, 395
Cencov, N. N., 307 RCU bound, 359, 439
Z-channel, 380
center of gravity, 67, 184 real-world codes, 440
ACGN, 405
central limit theorem, 78, 148, 181, reliability function, 431
additive non-gaussian noise, 401
202 Schalkwijk-Kailath scheme, 459
amplitude-constrained AWGN, 407
Centroidal Voronoi Tessellation sent codeword, 350
AWGN, 399
(CVT), 481 Shannon’s random coding, 354
BEC, 380
chain rule sphere-packing bound, 427, 432,
bit error rate, 390
χ2 , 183 454, 471
BSC, 379
differential entropy, 27 straight-line bound, 432
compound DMC, 465
divergence, 32, 33, 183 strong converse, 422, 465, 469, 470
continuous-time AWGN, 418
entropy, 12, 158 submodularity, 367
erasure-error channel, 464
Hellinger, 183 threshold decoder, 353
Gaussian channel, 100, 107
mutual information, 52, 63, 187 transmission rate, 373
group channel, 379
Rényi divergence, 183 universal, 463
information capacity, 375, 395
total variation, 183 unsent codeword, 350
information stable channels, 386,
chaining, 86 variable-length, 455, 471
399
channel, 29 weak converse, 348, 397
maximal probability of error, 375
channel automorophism, 381 zero-rate, 432
memoryless channels, 377
channel capacity, see capacity channel comparison, 646, 669, 670
MIMO channel, 176
channel coding channel dispersion, 434
mixture DMC, 465
(M, ϵ)-code, 343 channel filter, 406
non-stationary AWGN, 403
κ-β bound, 435 channel state information, 408
parallel AWGN, 402
admissible constraint, 396 channel symmetry group, 381
per unit cost, 414, 470
BSC, 344 channel symmetry relations, 385
product channel, 465
capacity, 345 channel, OR-channel, 189
Shannon capacity, 373, 395
capacity per unit cost, 414 channel, Z-channel, 372
sum of DMCs, 465
capacity-cost, 396, 466 channel, q-ary erasure, 639
with feedback, 443, 471
cost function, 395 channel, q-ary symmetric, 670, 671
zero-error, 374, 464, 471
cost-constrained code, 395, 467 channel, Z-channel, 380
zero-error with feedback, 450
degrees-of-freedom, 409 channel, ACGN, 405
capacity achieving output distribution,
dispersion, see dispersion channel, additive noise, 50, 363, 464
425
DT bound, 356, 460, 461 channel, additive non-Gaussian noise,
Capacity and Hellinger entropy
DT bound, linear codes, 364 466
lower bound, 609
Elias’ scheme, 457 channel, additive-noise, 371, 467
upper bound, 610
energy-per-bit, see energy-per-bit channel, AWGN, 48, 98, 100, 372,
Capacity and KL covering numbers,
error-exponents, 430, 460, 469, 471 399, 436, 457, 460, 470
603, 608
error-exponents with feedback, 454 channel, AWGN with ISI, 406
capacity-achieving input distribution,
expurgated random coding, 469 channel, bandlimited AWGN, 419
94, 444
feedback code, 442, 471 channel, BI-AWGN, 48, 436, 655
discrete, 407

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-697

i i

i i

702 Index

cubes, 527 rate-distortion function, 542 de Finetti, 186

Maximum entropy, 100 Sobolev ball, 541 divergence, 188
Gaussian, 28 volume bound, 525, 539 Hellinger distance, 146
maximum entropy, 99, 336 volume bound, sharp constant, 527 Le Cam lemma, 146
continuous uniform, 27 metric space, 78 Rényi divergence, 146
discrete uniform, 11 Milman, V., 535 MLE, 76, 91, 143, 581, 667
Gaussian, 64 min-entropy, 14, 57 suboptimality in high dimensions,
geometric, 175 minimax estimator 667
Hamming weight constraint, 175 non-uniqueness, 564 MMSE, 58, 562
multiple constraints, 308 minimax lower bound modulus of continuity, 668
robust version, 175 Assouad’s lemma, see Assouad’s monotone convergence theorem, 36
maximum likelihood estimation, see lemma monotonicity of information, 33, 42,
MLE asymptotic, 580 118
maximum posterior (MAP), 57 better than chance, 592 more capable channel, 646
McAllester, D. A., 83 Fano’s method, 599 MRRW (JPL) bound, 432
McMillan, B., 234 Le Cam’s two-point method, see Le MRRW bound, 528
mean-field approximation, 74 Cam’s two-point method Mrs. Gerber’s lemma (MGL), 191,
measure preserving transformation, mutual information method, see 326
230 mutual information method multinomial coefficient, 16
memoryless channel, 106 minimax rate, 563 multipath diversity, 403
memoryless source, 106, 110, 200 minimax risk mutual information, 41
Mercer’s theorem, 540 binomial model, 663 inf-characterization, 68, 103
Merhav, N., 260 covariance model, 667 sup-characterization, 69, 103
meta-converse, 57, 348, 368, 428, 435 exact asymptotics, 575 concavity, 92
minimax, 429 GLM, bounded means, 587 continuity, 81
method of types, 15, 168, 174, 425, GLM, estimating the maximum, convexity, 92
462, 463 666 finite partitions, 83
metric entropy, 159, 178, 189, 523 GLM, non-quadratic loss, 570, 571 Gaussian, 47
ℓq -covering of ℓp ball, 552 GLM, quadratic loss, 564, 568, 587 lower semicontinuity, 82, 181
ℓ2 -covering of ℓ1 ball, 529 GLM, with sparse mean, 589, 666 monotone limits, 83
ℓq -covering of ℓp ball, 530 linear regression, estimation, 660 permutation invariance, 52
ϵ-capacity, 523 linear regression, prediction, 661 saddle point, 94
ϵ-entropy, 523 multinomial model, 664 single-letterization, 106
convex null, 534 nonparametric location model, 665 stochastic processes, 113
covering number, 523, 524, 535 minimax theorem, 565, 566 variational characterization, 82
covering of Hamming space, 528, counterexample, 565 Venn diagram, 46
551 duality, 565 Mutual information method
duality, 535 minimum distance, 432 via Shannon lower bound, 586
entropy numbers, 523 minimum distance estimator, 620 mutual information method, 664
finite-dimensional balls and spheres, minimum mean-square error mutual information vs entropy, 44
526 see MMSE, 58
finiteness, 523 Minkowski dimension, 527
Newtonian mechanics, 242
global to local, 618 Minkowski inequality, 65
Neyman, J., 275
Hölder class, 538 mirror descent, 92
Neyman-Pearson lemma, 275, 284,
Hilbert balls, 533 MIT, 41
424
Lipschitz class, 536 mixing distribution, 59
Neyman-Pearson region, 278, 348,
local to global, 616 mixing process, 232, 268
424
monotonicity, 523 strongly mixing, 232
noisy gates, 626
packing number, 401, 523, 524 weakly mixing, 232
non-Gaussianness, 401
packing of Hamming space, 527 mixture models
normal approximation, 214
packing of Hamming sphere, 528 χ2 , 185
channel coding, 439, 468

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-703

i i

Index 703

compression, 202 Prékopa’s theorem, 573 rate region, 225, 227

normal tail bound, 551 Prékopa, A., 573 rate-distortion function, 91, 103, 485
Nyquist sampling, 420 predictive density estimator, 605 Bernoulli, 507
prefix code, 206 discrete uniform, 546
online learning, 245 probabilistic method, see random erasure metric, 546
online-to-batch conversion, 258, 611 coding, 215, 353 Erokhin, 547
open problem, 111, 208, 248, 380, probability of bit error, 389 Gaussian, 509, 550
413, 432, 433, 452, 521 probability of error, 57 Haar measure, 544
operator-convex function, 631 probability preserving transformation, information rate-distortion function,
optimal transport, 656 230 487
oracle inequality, 260 probably approximately correct (PAC), non-Gaussian, 511
order statistic, 178 83 product source, 546
Ornstein’s distance, 106, 112 prolate spheroidal functions, 419 properties, 486
overfitting, 87 pulse-amplitude modulation, 459 single-letterization, 487
pulse-coded modulation (PCM), 477 uniform on sphere, 543
pulse-position modulation (PPM), 413 worst-case, 542, 551
PAC-Bayes, 188
rate-distortion theory
Panter, P., 481
quantization, 477 asymptotics, 491
Panter-Dite approximation, 482
entropy constrained, 483 average distortion, 485
parameter space, 558
optimal, 480 convergence rate for average
parametric family, 38, 140
optimal asymptotics, 481 distortion, 546
location family
scalar non-uniform, 479 excess distortion, 485
see location family, 40
scalar uniform, 477 excess-to-average, 488
multi-dimensional, 142
variable rate, 483 general converse, 485
regular, 140
zero-rate, 548 max distortion, 485
smooth, 252
multiple distortion metrics, 547
parametrized family, see parametric
output constraint, 547
family Rényi divergence, 126, 144, 182, 188,
random coding bound, average
Pearson, E., 275 307, 504
distortion, 493
Pearson, K., 275 convexity, 145
random coding bound, excess
percolation, 635 tensorization, 145
distortion, 496
Peres’ extractor, 169 Rényi entropy, 13, 57, 145
Rayleigh fading, 416
Peres, Y., 169 Rényi mutual information, 504
redundancy, 179, 247, 261, 269
perfect matchings, 161 Rényi, A., 13
average-case minimax, 249
permanent, 161 Rademacher complexity, 86
worst-case minimax, 249
perspective function, 92 Radhakrishnan, J., 161
Reed-Muller code, 345, 364
phase transition, 643 radius of a set, 96
Reeves, A., 477
BBP, 651 Radon-Nikodym derivative, 21, 115,
regression, 190
Pinsker’s inequality, 23, 98, 126, 131, 322, 351
regret, 247, 256, 260
187 Radon-Nikodym Theorem, 31
finite-state machines, 264
reverse, 132 random coding, 161, 215, 353, 368,
supervised learning, 270
planted dense subgraph model, 591 461, 493
regular measures, 72
plug-in estimator, 138, 666 expurgation, 432, 469
regularization term, 88
Poincaré recurrence, 231 random matrix theory, 176, 552, 651,
relative density, 21
pointwise mutual information (PMI), 668
reliable computation, 626
351 random number generator, 166
repetition code, 345
polar codes, 341, 346 random transformation, see Markov
reproducing kernel Hilbert space
Polish space, 72 kernel
(RKHS), 539
positive predictive value (PPV), 277 random walk, 181, 266, 321, 444
reverse Pinsker’s inequality, 132
power allocation, 402 randomness extractor, 166
Riemannian metric, 40
power constraint, 394 rank-frequency plot, 203
risk, 559
power spectral density, 405 Rao, C. R., 576

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-704

i i

704 Index

average, 562 simplex conjecture, 413 power spectral density, 114, 233,
Bayes, 562 Sinai’s generator theorem, 240 405
minimax, 563 Sinai, Y., 239 spectral measure, 233
Robbins, H., 575 single-letterization, 106 stationary process, 109, 230
run-length encoding, 266 singular value decomposition (SVD), statistical experiment, see statistical
176 model
singular values, 640 statistical learning theory, 83, 87
saddle point, 94, 176
Sinkhorn’s algorithm, 105, 311 statistical model, 558
Gaussian, 100, 107, 466
site percolation threshold, 637 nonparametric, 560
sample complexity, 568
SLB, see Shannon lower bound parametric, 560
sample covariance matrix, 85, 667
Slepian, D., 223, 531 Stavskaya automata, 637
sampling without replacement, 186
Slepian-Wolf theorem, 223, 225, 228 Stein’s lemma, 287, 415, 470
Sanov’s theorem, 307, 334
small subgraph conditioning, 186 Stirling’s approximation, 16, 162, 174,
Sanov, I. N., 307
small-ball probability, 539 260, 588
score function, 152
Brownian motion, 553 stochastic block model, see
parametrized family, 39
finite dimensions, 332 community detection, 642
SDPI, 53, 328, 626, 629, 669, 671
Smooth density estimation, 622 stochastic domination, 338
χ2 , 640
L2 loss, 622 stochastic localization, 58, 191
BSC, 633
Hellinger loss, 625 stopping time of a filtration, 319, 455
erasure channels, 639
KL loss, 625 strong converse, 283, 374, 422, 470
joint Gaussian, 640
TV loss, 624 failure, 427, 465
post-processing, 653
soft-covering lemma, 137, 505 strong data-processing inequality, see
tensorization, 638
Solomjak, M., 538 SDPI
self-normalizing sums, 302
source subadditivity of information, 135
separable cost-constraint, 395
Markov, 110 subgaussian, 85, 188
sequential prediction, 245
memoryless, see memoryless source subgraph counts, 160
sequential probability ratio test
mixed, 110, 265 submodular function, 16, 27, 367
(SPRT), 320
source coding Sudakov’s minoration, 530
Shannon
see compression, 197 sufficient statistic, 41, 54, 178, 180,
boolean circuits, construction of,
source-coding 282, 363
626
noisy, 548 supervised learning, 257, 270
Shannon entropy, 8
remote, 548 support, 3, 309
Shannon lower bound, 511, 586, 588
space-time coding, 409 symmetric KL-information
arbitrary norm, 511, 550
sparse estimation, 589, 666 see f-information
quadratic distortion, 511
sparse-graph codes, 341, 346 symmetric KL, 135
Shannon’s channel coding theorem,
sparsity, 666 symmetry group, 381
345, 377
spatial diversity, 403 system identification, 660
Shannon’s rate-distortion theorem,
spectral gap, 641, 669 Szarek, S. J., 527, 535
491
spectral independence, 641, 671 Szegö’s theorem, 114, 406
Shannon’s source coding theorem,
spectral measure, 233 Szemerédi regularity lemma, 127, 190
202, 214
spiked Wigner model, 649
Shannon, C. E., 1–672
squared error, 561
Shannon-McMillan-Breiman theorem, tail σ-algebra, 83, 231
Stam, A. J., 64, 186
233 Telatar, E., 187, 409
standard Borel space, 20, 29, 30, 42,
Shawe-Taylor, J., 83 temperature, 336
43, 51
Shearer’s lemma, 18, 158, 160 Tensor product of experiments, 568
stationary Gaussian processes, 114,
shift-invariant event, 231 minimax risk, 569
233
shrinkage estimator, 561 tensorization, 33, 55, 63, 106, 107,
autocovariance function, 233
Shtarkov distribution, 245, 248, 260 112, 124, 145, 636, 638, 640, 647,
B-process, 233
Shtarkov sum, 249, 260 670
ergodic, 233, 467
signal-to-noise ratio (SNR), 48 FI -curve, 338
significance testing, 275 I-projection, 331

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-705

i i

Index 705

χ2 , 145 uniquely decodable codes, 206 von Neumann, J., 166, 626
capacity, 377 unitary operator, 243 Voronoi cells, 347
capacity-cost, 397 Universal codes, 462 Vovk, V. G., 271
Hellinger, 124 universal compression, 179, 210, 270
minimax risk, 569 universal prediction, 255
test error, 87 universal probability assignment, 245, Wald, A., 319
thermodynamics, 9, 14, 410 255 Wasserstein distance, 105, 123, 151,
Thomason, A. G., 164 Urysohn’s lemma, 72 656
thresholding, 561, 589, 590 Urysohn, P. S., 72 water-filling solution, 114, 176, 341,
Tikhomirov, V. M., 522, 524 402, 406, 437, 438
tilting, 72, 297 Vajda, I., 128 waterfall plot, 423, 439
time sharing, 226 van Trees inequality, see Bayesian weak converse, 282, 348, 397
Toeplitz matrices, 114 Cramér-Rao (BCR) lower bound Whitney, H., 164
total variation, 98, 116, 122, 131, 132, van Trees, H. L., 577 Wiener process, 417
330, 629 Varadhan, S. S., 71 WiFi, 403
inf-representation, 181 varentropy, 200, 547, 584 Wigner’s semicircle law, 652
inf-representation, 122 variable-length codes, 168, 455 Williamson, R. C., 83
sup-characterization, 148 variational autoencoder (VAE), 76, Wishart matrix, 176
sup-representation, 122 602 Wolf, J., 223
training error, 87 variational inference, 74 Wozencraft ensemble, 366
training sample, 87 variational representation, 70, 71, 123, Wringing lemma, 187
147, 154 Wyner’s common information, 502
transition probability kernel, see
Markov kernel χ2 , 149 Wyner, A., 227, 502
transmit-diversity, 409 Fisher information, 151
Trofimov, V. K., 253 Hellinger distance, 148 Yaglom, A. M., 70
Tunstall code, 196 total variation, 122, 148 Yang, Y., 606
turbo codes, 346 Varshamov, R. R., 527 Yang-Barron’s estimator, 607
types, see method of types, 174 Venn diagrams, 46 Yatracos class, 621
Verdú, S., 414 Yatracos’ estimator, 620
undetectable errors, 213, 224 Verdú, S., 475 Yatracos, Y. G., 620
uniform convergence, 85, 86 Verwandlungsinhalt, 14 Young-Fenchel duality, 73
uniform integrability, 153 Vitushkin, A. G., 538
uniform quantization, 29 VLF codes, 455
uniformly integrable martingale, 80 VLFT codes, 471 Zador, P. L., 482
union-closed sets conjecture, 189 volume ratio, 525 Zipf’s law, 203

i i

Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022
100% (2)
Information Theory From-Coding To - Learning, Yury Polyanskiy, Yihong Wu, Cambridge University Press, 2022
620 pages
It Lectures
No ratings yet
It Lectures
342 pages
Mit6 441s16 Course Notes
No ratings yet
Mit6 441s16 Course Notes
295 pages
Information Theory
No ratings yet
Information Theory
122 pages
Info Theory
No ratings yet
Info Theory
59 pages
Lecture Notes
No ratings yet
Lecture Notes
495 pages
Full Notes
No ratings yet
Full Notes
197 pages
Information Theory and Coding by Example
90% (10)
Information Theory and Coding by Example
528 pages
Information Theory and Coding by Example PDF
100% (1)
Information Theory and Coding by Example PDF
528 pages
Introduction To Information Theory
75% (4)
Introduction To Information Theory
178 pages
Electrical Engineering 229A Lecture Notes Information Theory and Coding
No ratings yet
Electrical Engineering 229A Lecture Notes Information Theory and Coding
117 pages
Intro to Information Theory
No ratings yet
Intro to Information Theory
21 pages
Elements of Information Theory: Thomas M - Cover
No ratings yet
Elements of Information Theory: Thomas M - Cover
8 pages
Information Theory: Lecture Notes For
No ratings yet
Information Theory: Lecture Notes For
193 pages
Lecture Notes
No ratings yet
Lecture Notes
668 pages
Lecture Notes in Information Theory Volume II
No ratings yet
Lecture Notes in Information Theory Volume II
293 pages
Information Theory Lecture Notes
100% (1)
Information Theory Lecture Notes
97 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lecture Notes on Quantum Information Theory
No ratings yet
Lecture Notes on Quantum Information Theory
101 pages
(Electronic Science) Norman Abramson - Information Theory and Coding (1963, McGraw-Hill Inc.,US) - Libgen - Li
100% (1)
(Electronic Science) Norman Abramson - Information Theory and Coding (1963, McGraw-Hill Inc.,US) - Libgen - Li
111 pages
Entropy 21 00881 v3
No ratings yet
Entropy 21 00881 v3
39 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
Lectures On Probability, Entropy, and Statistical Physics
No ratings yet
Lectures On Probability, Entropy, and Statistical Physics
170 pages
Lecture Notes
100% (2)
Lecture Notes
324 pages
ACaticha-Entropic Physics Book-July 2022
No ratings yet
ACaticha-Entropic Physics Book-July 2022
364 pages
Stanford Statistics311 InformationTheoryAndStatistics
No ratings yet
Stanford Statistics311 InformationTheoryAndStatistics
304 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Info Theory for Systems Scientists
No ratings yet
Info Theory for Systems Scientists
18 pages
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
No ratings yet
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
156 pages
Information Theory For Electrical Engineers
No ratings yet
Information Theory For Electrical Engineers
277 pages
Elements of Information Theory
11% (19)
Elements of Information Theory
16 pages
Entropy & Probability in Physics
No ratings yet
Entropy & Probability in Physics
300 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Information Theory &amp Coding (ECE) by Nitin Mittal
0% (1)
Information Theory &amp Coding (ECE) by Nitin Mittal
15 pages
Information Theory For Single-User Systems With Arbitrary Statistical Memory
No ratings yet
Information Theory For Single-User Systems With Arbitrary Statistical Memory
111 pages
Advanced Algorithm Design Techniques
No ratings yet
Advanced Algorithm Design Techniques
69 pages
Intro to Information Theory
No ratings yet
Intro to Information Theory
17 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Alajaji Chen2018 Book AnIntroductionToSingle UserInf
100% (1)
Alajaji Chen2018 Book AnIntroductionToSingle UserInf
333 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
Information Theory and Coding by Example Kelbert Mark Download
No ratings yet
Information Theory and Coding by Example Kelbert Mark Download
115 pages
Dabel Info Theory
No ratings yet
Dabel Info Theory
25 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
Introduction To Information Theory
No ratings yet
Introduction To Information Theory
20 pages
I T D S: Nformation Heory For ATA Cience
No ratings yet
I T D S: Nformation Heory For ATA Cience
417 pages
On Measures of Entropy and Information
No ratings yet
On Measures of Entropy and Information
18 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
SSRN 2971502
No ratings yet
SSRN 2971502
20 pages
Implementation Guide and Architecture Overview
No ratings yet
Implementation Guide and Architecture Overview
6 pages
Implementation Notes and Key Points
No ratings yet
Implementation Notes and Key Points
5 pages
Jingalov Hats - Anna Voloshyna
No ratings yet
Jingalov Hats - Anna Voloshyna
1 page
Profound Life Quotes & Reflections
No ratings yet
Profound Life Quotes & Reflections
2 pages
Best Chess Pie Recipe - How To Make Chess Pie
No ratings yet
Best Chess Pie Recipe - How To Make Chess Pie
3 pages
Quotes
No ratings yet
Quotes
1 page
Fe 440543
No ratings yet
Fe 440543
3 pages
How To Make Quinoa Bowls - Simply Quinoa
No ratings yet
How To Make Quinoa Bowls - Simply Quinoa
1 page
Korean Seaweed Soup (Miyeok Guk) - My Korean Kitchen
No ratings yet
Korean Seaweed Soup (Miyeok Guk) - My Korean Kitchen
1 page
Vietnamese Dipping Fish Sauce Recipe
No ratings yet
Vietnamese Dipping Fish Sauce Recipe
2 pages
Portobello Mushroom Steaks - The Picky Eater
No ratings yet
Portobello Mushroom Steaks - The Picky Eater
1 page
Wit and Wisdom Abe Lincoln
100% (1)
Wit and Wisdom Abe Lincoln
302 pages
Teaching Note 96-01: Default Risk As An Option: Version Date: November 3, 2003 C:/Classbackup/Teaching Notes/Tn96-01.Wpd
No ratings yet
Teaching Note 96-01: Default Risk As An Option: Version Date: November 3, 2003 C:/Classbackup/Teaching Notes/Tn96-01.Wpd
5 pages
Beef Stroganoff Recipe
No ratings yet
Beef Stroganoff Recipe
2 pages
Topics: Convexity Immunization by Matching Convexity in Addition To Duration
No ratings yet
Topics: Convexity Immunization by Matching Convexity in Addition To Duration
8 pages
Teaching Note 96-02: Risk Neutral Pricing in Discrete Time
No ratings yet
Teaching Note 96-02: Risk Neutral Pricing in Discrete Time
4 pages
American Option Value Approximation Guide
No ratings yet
American Option Value Approximation Guide
4 pages
Teaching Note 98-01: Closed-Form American Call Option Pricing: Roll-Geske-Whaley
No ratings yet
Teaching Note 98-01: Closed-Form American Call Option Pricing: Roll-Geske-Whaley
8 pages
Finance Students' Guide to Models
No ratings yet
Finance Students' Guide to Models
10 pages
Overviewof Option Strategies 1
No ratings yet
Overviewof Option Strategies 1
6 pages
Solving Linear Equations with Excel
No ratings yet
Solving Linear Equations with Excel
4 pages
Derivative Pricing with Girsanov
No ratings yet
Derivative Pricing with Girsanov
16 pages
Black-Scholes Model & Forward Start Options
No ratings yet
Black-Scholes Model & Forward Start Options
6 pages
Lecture-10 Regression Tree
No ratings yet
Lecture-10 Regression Tree
7 pages
Commodity Swaps: Pricing & Valuation
No ratings yet
Commodity Swaps: Pricing & Valuation
13 pages
Data Mining in Finance: B20.3355 / B90.3355: Intensive Methods For Financial Modeling and Data Analysis
No ratings yet
Data Mining in Finance: B20.3355 / B90.3355: Intensive Methods For Financial Modeling and Data Analysis
14 pages
Implied Binomial Trees for American Options
No ratings yet
Implied Binomial Trees for American Options
29 pages
Understanding the Volatility Smile
No ratings yet
Understanding the Volatility Smile
5 pages
Stats 242: Algorithmic Trading and Quantitative Strategies Summer 2011
No ratings yet
Stats 242: Algorithmic Trading and Quantitative Strategies Summer 2011
9 pages
Classroompamphletsteiner
No ratings yet
Classroompamphletsteiner
3 pages
Desigining Assessments: Extensive Speaking: Reference
No ratings yet
Desigining Assessments: Extensive Speaking: Reference
3 pages
Agile Value Maximization Quiz
No ratings yet
Agile Value Maximization Quiz
4 pages
TSL015 Academic Reading Journal 1
No ratings yet
TSL015 Academic Reading Journal 1
4 pages
Contrastive Linguistics Exam Questions
No ratings yet
Contrastive Linguistics Exam Questions
6 pages
AITS 2122 FT IX JEEA Paper 2
No ratings yet
AITS 2122 FT IX JEEA Paper 2
15 pages
Fitness FAQs for Busy Beginners
No ratings yet
Fitness FAQs for Busy Beginners
2 pages
SHS E-Class Record 2024-2025
No ratings yet
SHS E-Class Record 2024-2025
12 pages
HiFlyer Airline Market Research Evaluation
No ratings yet
HiFlyer Airline Market Research Evaluation
4 pages
June 2019 MS - Paper 1 Edexcel English Language GCSE
No ratings yet
June 2019 MS - Paper 1 Edexcel English Language GCSE
13 pages
Future Teacher's Educational Philosophy
No ratings yet
Future Teacher's Educational Philosophy
6 pages
Professional Salesmanship Syllabus
100% (13)
Professional Salesmanship Syllabus
6 pages
ĐỀ THI THỬ SỐ 24
No ratings yet
ĐỀ THI THỬ SỐ 24
10 pages
Tefma Facility Audit Guidelines
No ratings yet
Tefma Facility Audit Guidelines
12 pages
Validity and Reliability
No ratings yet
Validity and Reliability
8 pages
Grade 11 Math: Simple & Compound Interest
No ratings yet
Grade 11 Math: Simple & Compound Interest
5 pages
Chemist Igor Larrosa's Achievements
No ratings yet
Chemist Igor Larrosa's Achievements
4 pages
How To Write Chapter 2 Review of The Related Literature and Studies
No ratings yet
How To Write Chapter 2 Review of The Related Literature and Studies
21 pages
Media and Information Literacy Guide
No ratings yet
Media and Information Literacy Guide
13 pages
Ayasdi Discovering The Whole Truth
No ratings yet
Ayasdi Discovering The Whole Truth
10 pages
Electrical Engineer Resume SEO
No ratings yet
Electrical Engineer Resume SEO
3 pages
In Ac mdurohtak-DGMST-805294520222113081266III
No ratings yet
In Ac mdurohtak-DGMST-805294520222113081266III
1 page
Jawaharlal Nehru Technological University Hyderabad: Dr. M. Chandra Mohan
No ratings yet
Jawaharlal Nehru Technological University Hyderabad: Dr. M. Chandra Mohan
17 pages
MIL Q1 M4 Indigenous-Media
80% (5)
MIL Q1 M4 Indigenous-Media
10 pages
Evolución de la Didáctica de Lenguas
No ratings yet
Evolución de la Didáctica de Lenguas
28 pages
Lepaksh
No ratings yet
Lepaksh
24 pages
Evidence-Based Autism Treatment Guidelines
No ratings yet
Evidence-Based Autism Treatment Guidelines
135 pages
Datasheet of Ds 7116hghi m1 v4.70.160 20220616
No ratings yet
Datasheet of Ds 7116hghi m1 v4.70.160 20220616
4 pages
PreCal Orientation
No ratings yet
PreCal Orientation
19 pages
Assistant Professor (Dept. of Education), Gagan College of Management & Technology, Aligarh
No ratings yet
Assistant Professor (Dept. of Education), Gagan College of Management & Technology, Aligarh
18 pages