Info Theory Polyanskiy Wu
Info Theory Polyanskiy Wu
i i
i i
i i
i i
i i
i i
Book Heading
This textbook introduces the subject of information theory at a level suitable for advanced
undergraduate and graduate students. It develops both the classical Shannon theory and recent
applications in statistical learning. There are six parts covering foundations of information mea-
sures; data compression; hypothesis testing and large deviations theory; channel coding and
channel capacity; rate-distortion and metric entropy; and, finally, statistical applications. There are
over 210 exercises, helping the reader deepen their knowledge and learn about recent discoveries.
Yury Polyanskiy is a Professor of Electrical Engineering and Computer Science at MIT and a
member of Laboratory of Information and Decision Systems (LIDS) and Statistics and Data Sci-
ence Center (SDSC). He was elected an IEEE Fellow (2024), received the 2020 IEEE Information
Theory Society James Massey Award and co-authored papers receiving Best Paper Awards from
the IEEE Information Theory Society (2011), the IEEE International Symposium on Information
Theory (2008, 2010, 2022), and the Conference on Learning Theory (2021). At MIT he teaches
courses on information theory, probability, statistics, and machine learning.
Yihong Wu is a Professor in the Department of Statistics and Data Science at Yale University.
He was a recipient of the Marconi Society Paul Baran Young Scholar Award in 2011, the NSF
CAREER award in 2017, the Sloan Research Fellowship in Mathematics in 2018, and the IMS
fellow in 2023. He has taught classes on probability, statistics, and information theory at Yale
University and the University of Illinois at Urbana-Champaign.
i i
i i
i i
i i
i i
i i
Information Theory
From Coding to Learning
FIRS T E DI TI ON
Yury Polyanskiy
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Yihong Wu
Department of Statistics and Data Science
Yale University
i i
i i
i i
www.cambridge.org
Information on this title: www.cambridge.org/XXX-X-XXX-XXXXX-X
DOI: 10.1017/XXX-X-XXX-XXXXX-X
© Author name XXXX
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published XXXX
Printed in <country> by <printer>
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
ISBN XXX-X-XXX-XXXXX-X Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
i i
i i
i i
Dedicated to
My family (Y.W.)
i i
i i
i i
Contents
Preface page xv
Introduction xvii
2 Divergence 20
2.1 Divergence and Radon-Nikodym derivatives 20
2.2 Divergence: main inequality and equivalent expressions 24
2.3 Differential entropy 26
2.4 Markov kernels 29
2.5 Conditional divergence, chain rule, data-processing inequality 31
2.6* Local behavior of divergence and Fisher information 36
2.6.1* Local behavior of divergence for mixtures 36
2.6.2* Parametrized family 38
3 Mutual information 41
3.1 Mutual information 41
3.2 Mutual information as difference of entropies 44
3.3 Examples of computing mutual information 47
3.4 Conditional mutual information and conditional independence 50
3.5 Sufficient statistic and data processing 53
3.6 Probability of error and Fano’s inequality 55
3.7* Estimation error in Gaussian noise (I-MMSE) 58
3.8* Entropy-power inequality 63
i i
i i
i i
Contents vii
7 f-divergences 115
7.1 Definition and basic properties of f-divergences 115
7.2 Data-processing inequality; approximation by finite partitions 118
7.3 Total variation and Hellinger distance in hypothesis testing 122
7.4 Inequalities between f-divergences and joint range 126
7.5 Examples of computing joint range 130
7.5.1 Hellinger distance versus total variation 131
7.5.2 KL divergence versus total variation 131
7.5.3 χ2 -divergence versus total variation 132
7.6 A selection of inequalities between various divergences 132
7.7 Divergences between Gaussians 133
7.8 Mutual information based on f-divergence 134
7.9 Empirical distribution and χ2 -information 136
7.10 Most f-divergences are locally χ2 -like 138
i i
i i
i i
viii Contents
i i
i i
i i
Contents ix
i i
i i
i i
x Contents
i i
i i
i i
Contents xi
i i
i i
i i
xii Contents
i i
i i
i i
Contents xiii
i i
i i
i i
xiv Contents
i i
i i
i i
Preface
This book is a modern introduction to information theory. In the last two decades, the subject has
evolved from a discipline primarily dealing with problems of information storage and transmission
(“coding”) to one focusing increasingly on information extraction and denoising (“learning”). This
transformation is reflected in the title and content of this book.
It took us more than a decade to complete this work. It started as a set of lecture notes accumu-
lated by the authors through teaching regular courses at MIT, University of Illinois, and Yale, as
well as topics courses at EPFL (Switzerland) and ENSAE (France). Consequently, the intended
usage of this manuscript is as a textbook for a first course on information theory for graduate and
senior undergraduate students, or for a second (topics) course delving deeper into specific areas.
There are two aspects that make this textbook unusual. First is that, while being written by
information-theoretic “insiders”, the material is very much outward looking. While we do cover
in depth the bread-and-butter results (coding theorems) of information theory, we also dedicate
much effort to ideas and methods which have found influential applications in statistical learning,
statistical physics, ergodic theory, computer science, probability theory and more. The second
aspect is that we cover both the time-tested classical material (such as connections to combina-
torics, ergodicity, functional analysis) along with the latest developments of very recent years
(large alphabet distribution estimation, community detection, mixing of Markov chains, graphical
models, PAC-Bayes, generalization bounds).
It is hard to mention everyone, who helped us start and finish this work, but some stand out
especially. We owe our debt to Sergio Verdú, whose course at Princeton is responsible for our life-
long admiration of the subject. His passion and pedagogy are reflected, if imperfectly, on these
pages. For an undistorted view see his forthcoming comprehensive monograph [436].
Next, we were fortunate to have many bright students contribute to typing of the original lecture
notes (precursor of this book), as well as to correcting and extending the content. Among them,
we especially thank Ganesh Ajjanagadde, Austin Collins, Yuzhou Gu, Richard Guo, Qingqing
Huang, Alexander Haberman, Matthew Ho, Yunus Inan, Reka Inovan, Jason Klusowski, Anuran
Makur, Pierre Quinton, Aolin Xu, Sheng Xu, Pengkun Yang, Andrew Yao, and Junhui Zhang.
We thank many colleagues who provided valuable feedback at various stages of the book draft
over the years, in particular, Lucien Birgé, Marco Dalai, Meir Feder, Bob Gallager, Bobak Nazer,
Or Ordentlich, Henry Pfister, Maxim Raginsky, Sasha Rakhlin, Philippe Rigollet, Mark Sellke,
and Nikita Zhivotovskiy. Rachel Cohen (MIT) has been very kind with her time and helped in a
myriad of different ways.
We are grateful for the support from our editors Chloe Mcloughlin, Elizabeth Horne, Sarah
Strange, Julie Lancashire at Cambridge University Press (CUP) and for CUP to allow us to keep
a free version online. Our special acknowledgement is to Julie for providing that initial push and
i i
i i
i i
xvi Preface
motivation in 2019, without which we would have never even considered to go beyond the initial
set of chaotic online lecture notes. (Though, if we knew it would take 4 years...)
The cover art and artwork were contributed by the talented illustrator Nastya Mukhanova [311],
whom we cannot praise enough.
Y. P. would like to thank Olga for her unwavering patience and encouragement. Her loving sac-
rifice made the luxury of countless hours of extra time available to Y. P. to work on this book. Y. P.
would also like to extend a literary hug to Yana, Alina and Evan and thank them for brightening
up his life.
Y. W. would like to thank his parents and his wife, Nanxi.
Y. Polyanskiy <[email protected]>
MIT
Y. Wu <[email protected]>
Yale
i i
i i
i i
Introduction
What is information?
The Oxford English Dictionary lists 18 definitions of the word information, while the Merriam-
Webster Dictionary lists 17. This emphasizes the diversity of meaning and contexts in which the
word information may appear. This book, however, is only concerned with a precise mathematical
understanding of information, independent of the application domain.
How can we measure something that we cannot even define well? Among the earliest attempts
of quantifying information we can list R.A. Fisher’s works on the uncertainty of statistical esti-
mates (“confidence intervals”) and R. Hartley’s definition of information as the logarithm of the
number of possibilities. Around the same time, Fisher [169] and others identified connection
between information and thermodynamic entropy. This line of thinking culminated in Claude
Shannon’s magnum opus [378], where he formalized the concept of (what we call today) the Shan-
non information and forever changed the human language by accepting John Tukey’s word bit as
the unit of its measurement. In addition to possessing a number of elegant properties, Shannon
information turned out to also answer certain rigorous mathematical questions (such as the opti-
mal rate of data compression and data transmission). This singled out Shannon’s definition as the
right way of quantifying information. Classical information theory, as taught in [106, 111, 177],
focuses exclusively on this point of view.
In this book, however, we take a slightly more general point of view. To introduce it, let us
quote an eminent physicist L. Brillouin [76]:
We must start with a precise definition of the word “information”. We consider a problem involving a certain
number of possible answers, if we have no special information on the actual situation. When we happen to be
in possession of some information on the problem, the number of possible answers is reduced, and complete
information may even leave us with only one possible answer. Information is a function of the ratio of the
number of possible answers before and after, and we choose a logarithmic law in order to insure additivity of
the information contained in independent situations.
Note that only the last sentence specializes the more general term information to the Shannon’s
special version. In this book, we think of information without that last sentence. Namely, for us
information is a measure of difference between two beliefs about the system state. For example, it
could be the amount of change in our worldview following an observation or an event. Specifically,
suppose that initially the probability distribution P describes our understanding of the world (e.g.,
P allows us to answer questions such as how likely it is to rain today). Following an observation our
distribution changes to Q (e.g., upon observing clouds or a clear sky). The amount of information in
the observation is the dissimilarity between P and Q. How to quantify dissimilarity depends on the
particular context. As argued by Shannon, in many cases the right choice is the Kullback-Leibler
i i
i i
i i
xviii Introduction
(KL) divergence D(QkP), see Definition 2.1. Indeed, if the prior belief is described by a probability
mass function P = (p1 , . . . , pk ) on the set of k possible outcomes, then the observation of the first
outcome results in the new (posterior) belief vector Q = (1, 0, . . . , 0) giving D(QkP) = log p11 ,
and similarly for other outcomes. Since the outcome i happens with probability pi we see that the
average dissimilarity between the prior and posterior beliefs is
X
k
1
pi log ,
pi
i=1
1
For Institute of Electrical and Electronics Engineers; pronounced “Eye-triple-E”.
i i
i i
i i
Introduction xix
entropy [387, 322]. In physics, the Landauer principle and other works on Maxwell demon have
been heavily influenced by the information theory [267, 42]. Natural language processing (NLP),
the idea of modeling text as a high order Markov model, has seen spectacular successes recently in
the form of GPT [320] and related models. Many more topics ranging from biology, neuroscience
and thermodynamics to pattern recognition, artificial intelligence and control theory all regularly
appear in information-theoretic conferences and journals.
It seems that objectively circumscribing the territory claimed by information theory is futile.
Instead, we would like to highlight what we believe to be the recent developments that fascinate
us and which motivated us to write this book.
First, information processing systems of today are much more varied compared to those of last
century. A modern controller (robot) is not just reacting to a few-dimensional vector of observa-
tions, modeled as a linear time-invariant system. Instead, it has million-dimensional inputs (e.g.,
a rasterized image), delayed and quantized, which also need to be communicated across noisy
links. The target of statistical inference is no longer a low-dimensional parameter, but rather a
high-dimensional (possibly discrete) object with structure (e.g. a sparse matrix, or a social net-
work between people with underlying community structure). Furthermore, observations arrive
to a statistician from spatially or temporally separated sources, which need to be transmitted
cognizant of rate limitations. Recognizing these new challenges, multiple communities simul-
taneously started re-investigating classical results (Chapter 29) on the optimality of maximum
likelihood and the (optimal) variance bounds given by the Fisher information. These developments
in high-dimensional statistics, computer science and statistical learning depend on the mastery of
the f-divergences (Chapter 7), the mutual-information method (Chapter 30), and the strong version
of the data-processing inequality (Chapter 33).
Second, since the 1990s technological advances have brought about a slew of new noisy channel
models. While classical theory addresses the so-called memoryless channels, the modern channels,
such as in flash storage, or urban wireless (multi-path, multi-antenna) communication, are far from
memoryless. In order to analyze these, the classical “asymptotic i.i.d.” theory is insufficient. The
resolution is the so-called “one-shot” approach to information theory, in which all main results
are developed while treating the channel inputs and outputs as abstract [211]. Only at the last step
those inputs are given the structure of long sequences and the asymptotic values are calculated.
This new “one-shot” approach has additional relevance for quantum information theory, where it
is in fact necessary.
Third, following impressive empirical achievements in 2010s there was an explosion in the
interest of understanding the methods and limits of machine learning from data. Information-
theoretic principles were instrumental for several discoveries in this area. As examples, we recall
the concept of metric entropy (Chapter 27) that is a cornerstone of Vapnik’s approach to supervised
learning (known as empirical risk minimization), non-linear regression and theory of density esti-
mation (Chapter 32). In machine learning density estimation is known as probabilistic generative
modeling, a prototypical problem in unsupervised learning. At present the best algorithms were
derived by applying information-theoretic ideas: Gibbs variational principle for Kullback-Leibler
divergence (in variational auto-encoders (VAE), cf. Example 4.2) and variational characteriza-
tion of Jensen-Shannon divergence (in generative adversarial networks (GAN), cf. Example 7.5).
i i
i i
i i
xx Introduction
i i
i i
i i
Introduction xxi
The textbook [111] spearheaded the combinatorial approach on information theory, known as
“the method of types”. While more mathematically demanding than [106], [111] manages to intro-
duce stronger results such as sharp estimates of error exponents and, especially, rate regions in
multi-terminal communication systems. However, both books are almost exclusively focused on
asymptotics, Shannon-type information measures and discrete (finite alphabet) cases.
Focused on specialized topics, several monographs are available. For a communication-oriented
reader, the classical [177] is still indispensable. The one-shot point of view is taken in [211]. Con-
nections to statistical learning theory and learning on graphs (belief propagation) is beautifully
covered in [287]. Ergodic theory is the central subject in [198]. Quantum information theory – a
burgeoning field – is treated in the recent [451]. The only textbook dedicated to the connection
between information theory and statistics is by Kullback [264], though restricted to large-sample
asymptotics in hypothesis testing. In nonparametric statistics, application of information-theoretic
methods is briefly (but elegantly) covered in [424].
Nevertheless, it is not possible to quilt this textbook from chapters of these excellent prede-
cessors. A number of important topics are treated exclusively here, such as those in Chapters 7
(f-divergences), 18 (one-shot coding theorems), 22 (finite blocklength), 27 (metric entropy), 30
(mutual information method), 32 (entropic bounds on estimation), and 33 (strong data-processing
inequalities). Furthermore, building up to these chapters requires numerous small innovations
across the rest of the textbook and are not available elsewhere. In addition, the exercises explore
works of the last few years.
Going to omissions, this book almost entirely skips the topic of multi-terminal information
theory (with exception of Sections 11.7*, 16.5* and 25.3*) . This difficult subject captivated much
of the effort in the post-Shannon “IEEE-style” theory. We refer to the classical [115] and the recent
excellent textbook [147] containing an encyclopedic coverage of this area.
Another unfortunate omission is the connection between information theory and functional
inequalities [106, Chapter 17]. This topic has seen a flurry of recent activity, especially in loga-
rithmic Sobolev inequalities, isoperimetry, concentration of measure, Brascamp-Lieb inequalities,
(Marton-Talagrand) information-transportation inequalities and others. We only briefly mention
these topics in Sections 3.7*, 3.8* and associated exercises (e.g. I.47 and I.65). For a fuller
treatment, see the monograph [353] and references there.
Finally, this book will not teach one how to construct practical error-correcting codes or design
modern wireless communication systems. Following our Part IV, which covers the basics, an
interested reader is advised to master the tools from coding theory via [360] and multiple-antenna
channels via [423].
A note to statisticians
The interplay between information theory and statistics is a constant theme in the development of
both fields. Since its inception, information theory has been indispensable for understanding the
fundamental limits of statistical estimation. The prominent role of information-theoretic quanti-
ties, such as mutual information, f-divergence, metric entropy, and capacity, in establishing the
i i
i i
i i
xxii Introduction
minimax rates of estimation has long been recognized since the seminal work of Le Cam [272],
Ibragimov and Khas’minski [222], Pinsker [328], Birgé [53], Haussler and Opper [216], Yang
and Barron [464], among many others. In Part VI of this book we give an exposition to some of
the most influential information-theoretic ideas and their applications in statistics. This part is not
meant to be a thorough treatment of decision theory or mathematical statistics; for that purpose,
we refer to the classics [222, 276, 68, 424] and the more recent monographs [78, 190, 446] focus-
ing on high dimensions. Instead, we apply the theory developed in previous Parts I–V of this book
to several concrete and carefully chosen examples of determining the minimax risk in both classi-
cal (fixed-dimensional, large-sample asymptotic) and modern (high-dimensional, non-asymptotic)
settings.
At a high level, the connection between information theory (in particular, data transmission)
and statistical inference is that both problems are defined by a conditional distribution PY|X , which
is referred to as the channel for the former and the statistical model or experiment for the latter. In
both disciplines the ultimate goal is to estimate X with high fidelity based on its noisy observation Y
using computationally efficient algorithms. However, in data transmission the set of allowed values
of X is typically discrete and restricted to a carefully chosen subset of inputs (called codebook),
the design of which is considered to be the main difficulty. In statistics, however, the space or
the distribution of allowed values of X (the parameter) is constrained by the problem setup (for
example, requiring sparsity or low rank on X), not by the statistician. Despite this key difference,
both disciplines in the end are all about estimating X based on Y and information-theoretic ideas
are applicable in both settings.
Specifically, in Chapter 29 we show how the data processing inequality can be used to deduce
classical lower bounds in statistical estimation (Hammersley-Chapman-Robbins, Cramér-Rao,
van Trees). In Chapter 30 we introduce the mutual information method, based on the reasoning
in joint source-channel coding. Namely, by comparing the amount of information contained in
the data and the amount of information required for achieving a given estimation accuracy, both
measured in bits, this method allows us to apply the theory of capacity and rate-distortion func-
tion developed in Parts IV and V to lower bound the statistical risk. Besides being principled, this
approach also unifies the three popular methods for proving minimax lower bounds due to Le
Cam, Assouad, and Fano respectively (Chapter 31).
It is a common misconception that information theory only supplies techniques for proving
negative results in statistics. In Chapter 32 we present three upper bounds on statistical estimation
risk based on metric entropy: Yang-Barron’s construction inspired by universal compression, Le
Cam-Birgé’s tournament based on pairwise hypothesis testing, and Yatracos’ minimum-distance
approach. These powerful methods are responsible for some of the strongest and most general
results in statistics and applicable to both high-dimensional and nonparametric problems. Finally,
in Chapter 33 we introduce the method based on strong data processing inequalities and apply
it to resolve an array of contemporary problems including community detection on graphs, dis-
tributed estimation with communication constraints, and generating random tree colorings. These
problems are increasingly captivating the minds of computer scientists as well.
i i
i i
i i
Introduction xxiii
• Part I: Chapters 1–3, Sections 4.1, 5.1–5.3, 6.1, and 3.6, focusing only on discrete prob-
ability space and ignoring Radon-Nikodym derivatives. Some mention of applications in
combinatorics and cryptography (Chapters 8, 9 and select exercises) is recommended.
• Part II: Chapter 10, Sections 11.1–11.5.
• Part III: Chapter 14, Sections 15.1–15.3, and 16.1.
• Part IV: Chapters 17–18, Sections 19.1–19.3, 19.7, 20.1–20.2, 23.1.
• Part V: Sections 24.1–24.3, 25.1, 26.1, and 26.3.
• Conclude with a few applications of information theory outside the classical domain (Chap-
ters 30 and 33).
i i
i i
i i
General conventions
• h(·) is the binary entropy function, H(·) denotes general Shannon entropy.
• d(·k·) is the binary divergence function, D(·k·) denotes general Kullback-Leibler divergence
• Standard channels BSCδ , BECδ , BIAWGNσ2
• Common divergences are χ2 (·k·) (chi-squared), Dα (·k·) (Rényi divergence), Df (·k·) (general
f-divergence).
i i
i i
i i
• The Lebesgue measure on Euclidean spaces is denoted by Leb and also by vol (volume).
• Throughout the book, all measurable spaces (X , E) are standard Borel spaces. Unless explicitly
needed, we suppress the underlying σ -algebra E .
• The collection of all probability measures on X is denoted by P(X ). For finite spaces we
abbreviate Pk ≡ P([k]), a (k − 1)-dimensional simplex.
• For measures P and Q, their product measure is denoted by P × Q or P ⊗ Q. The n-fold product
of P is denoted by Pn or P⊗n . Similarly, given a Markov kernel PY|X : X → Y the kernel that
acts independently on each of n coordinates is denoted as P⊗ Y|X : X → Y .
n n n
i i
i i
i i
is referred to the probability density function (pdf); if Q is the counting measure on a countable
X , dQ
dP
is the probability mass function (pmf).
• Let P ⊥ Q denote their mutual singularity, namely, P(A) = 0 and Q(A) = 1 for some A.
• The support of a probability measure P, denoted by supp(P), is the smallest closed set C such
that P(C) = 1. An atom x of P is such that P({x}) > 0. A distribution P is discrete if supp(P)
is a countable set (consisting of its atoms).
• Let X be a random variable taking values on X , which is referred to as the alphabet of X. Its
realizations are labeled by lower case letters, e.g. x. Thus, upper case, lower case, and script case
are matched to random variables, realizations, and alphabets, respectively (as in X = x ∈ X ).
Oftentimes X and Y are automatically assumed to be the alphabet of X and Y, etc. We also write
X ∈ X to mean that random variable X is X -valued.
• Let PX denote the distribution (law) of the random variable X, PX,Y the joint distribution of X
and Y, and PY|X the conditional distribution of Y given X.
• A conditional distribution PY|X is also called a Markov kernel acting between spaces X and Y ,
written as PY|X : X → Y . Given a conditional distribution PY|X and a marginal we can form
a joint distribution, written as PX × PY|X , or simply PX PY|X . Its marginal PY is denoted by a
composition operation PY ≜ PY|X ◦ PX .
• The independence of random variables X and Y is denoted by X ⊥ ⊥ Y, in which case PX,Y =
PX × PY . Similarly, X ⊥ ⊥ Y|Z denotes their conditional independence given Z, in which case
PX,Y|Z = PX|Z × PY|Z .
• Throughout the book, Xn ≡ Xn1 ≜ (X1 , . . . , Xn ) denotes an n-dimensional random vector. We
i.i.d.
write X1 , . . . , Xn ∼ P if they are independently and identically distributed (iid) as P, in which
case PXn = Pn .
• The empirical distribution of a sequence x1 , . . . , xn denoted by P̂xn ; the empirical distribution of
a random sample X1 , . . . , Xn is denoted by P̂n ≡ P̂Xn .
a.s. P d
• −−→, − →, − → denote convergence almost surely, in probability, and in distribution (law),
d
respectively. We define = to mean equality in distribution.
• Occasionally, for clarity we use a self-explanatory notation EY∼Q [·] to mean that the expectation
is taking with Y generated from distribution Q. We also use cues like EC [·] to signify that the
expectation is taken over C.
• Some commonly used distributions are as follows:
– Ber(p): Bernoulli distribution with mean p.
– Bin(n, p): Binomial distribution with n trials and success probability p.
– Poisson(λ): Poisson distribution with mean λ.
– N ( μ, σ 2 ) is the Gaussian (normal) distribution on R with mean μ and σ 2 . N ( μ, Σ) is the
Gaussian distribution on Rd with mean μ and covariance matrix Σ. Denote the standard nor-
Rt
mal density by φ(x) = √12π e−x /2 , the CDF and complementary CDF by Φ(t) = −∞ φ(x)dx
2
i i
i i
i i
– For a compact subset X of Rd with non-empty interior, Unif(X ) denotes the uniform distri-
bution on X , with Unif(a, b) ≡ Unif([a, b]) for interval [a, b]. We also use Unif(X ) to denote
the uniform (equiprobable) distribution on a finite set X .
• For a Rd -valued random variable X we denote Cov(X) = E[(X−E[X])(X−E[X])⊤ ] its covariance
matrix. A conditional version is denoted as Cov(X|Y) = E[(X − E[X|Y])(X − E[X|Y])⊤ ].
• For a set E ⊂ Ω we denote by 1E (ω) the function equal to 1 iff ω ∈ E. Similarly,
1{boolean condition} denotes a random variable that is equal to 1 iff the “boolean condition”
is satisfied and otherwise equals zero. Thus, for example, P[X > 1] = E[1{X > 1}].
i i
i i
i i
Part I
Information measures
i i
i i
i i
i i
i i
i i
Information measures form the backbone of information theory. The first part of this book is
devoted to an in-depth study of some of them, most notably, entropy, divergence, mutual informa-
tion, as well as their conditional versions (Chapters 1–3). In addition to basic definitions illustrated
through concrete examples, we will also study various aspects including chain rules, regularity,
tensorization, variational representation, local expansion, convexity and optimization properties,
as well as the data processing principle (Chapters 4–6). These information measures will be
imbued with operational meaning when we proceed to classical topics in information theory such
as data compression and transmission, in subsequent parts of the book. This Part also includes
topics connecting information theory to other subjects, such as I-MMSE relation (estimation the-
ory), entropy power inequality (probability), PAC-Bayes bounds and Gibbs variational principle
(machine learning).
In addition to the classical (Shannon) information measures, Chapter 7 provides a systematic
treatment of f-divergences, a generalization of (Shannon) measures introduced by Csíszar that
plays an important role in many statistical problems (see Parts III and VI). Finally, towards the
end of this part we will discuss two operational topics: random number generators in Chapter 9
and the application of entropy method to combinatorics and geometry in Chapter 8.
Several contemporary topics are developed in exercises such as stochastic block model
(Exercise I.49), Gilmer’s method in combinatorics (Exercise I.61), Tao’s proof of Szemerédi’s reg-
ularity lemma (Exercise I.63), Eldan’s stochastic localization (Exercise I.66), Gross’ log-Sobolev
inequality (Exercise I.65) and others.
i i
i i
i i
1 Entropy
This chapter introduces the first information measure – Shannon entropy. After studying its stan-
dard properties (chain rule, conditioning), we will briefly describe how one could arrive at its
definition. We discuss axiomatic characterization, the historical development in statistical mechan-
ics, as well as the underlying combinatorial foundation (“method of types”). We close the chapter
with Han’s and Shearer’s inequalities, that both exploit submodularity of entropy. After this Chap-
ter, the reader is welcome to consult the applications in combinatorics (Chapter 8) and random
number generation (Chapter 9), which are independent of the rest of this Part.
log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp
Different units will be convenient in different cases and so most of the general results in this book
are stated with “baseless” log/exp.
i i
i i
i i
Definition 1.2 (Joint entropy) The joint entropy of n discrete random variables Xn ≜
(X1 , X2 , . . . , Xn ) is
h 1 i
H(Xn ) = H(X1 , . . . , Xn ) = E log ,
PX1 ,...,Xn (X1 , . . . , Xn )
which can also be written explicitly as a summation over a joint probability mass function (PMF):
X X 1
H(Xn ) = ··· PX1 ,...,Xn (x1 , . . . , xn ) log .
x1 xn
PX1 ,...,Xn (x1 , . . . , xn )
Note that joint entropy is a special case of Definition 1.1 applied to the random vector Xn =
(X1 , X2 , . . . , Xn ) taking values in the product space.
Remark 1.1 The name “entropy” originates from thermodynamics – see Section 1.3, which
also provides combinatorial justification for this definition. Another common justification is to
derive H(X) as a consequence of natural axioms for any measure of “information content” – see
Section 1.2. There are also natural experiments suggesting that H(X) is indeed the amount of
“information content” in X. For example, one can measure time it takes for ant scouts to describe
the location of the food to ants-workers. It was found that when nest is placed at the root of a full
binary tree of depth d and food at one of the leaves, the time was proportional to the entropy of a
random variable describing the food location [358]. (It was also estimated that ants communicate
with about 0.7–1 bit/min and that communication time reduces if there are some regularities in
path-description: paths like “left,right,left,right,left,right” are described by scouts faster).
Entropy measures the intrinsic randomness or uncertainty of a random variable. In the simple
setting where X takes values uniformly over a finite set X , the entropy is simply given by log-
cardinality: H(X) = log |X |. In general, the more spread out (resp. concentrated) a probability
mass function is, the higher (resp. lower) is its entropy, as demonstrated by the following example.
h(p)
Example 1.1 (Bernoulli) Let X ∼ Ber(p), with
PX (1) = p and PX (0) = p ≜ 1 − p. Then
log 2
1 1
H(X) = h(p) ≜ p log + p log .
p p
Here h(·) is called the binary entropy function, which is
continuous, concave on [0, 1], symmetric around 12 , and sat-
isfies h′ (p) = log pp , with infinite slope at 0 and 1. The
highest entropy is achieved at p = 21 (uniform), while the
lowest entropy is achieved at p = 0 or 1 (deterministic).
It is instructive to compare the plot of the binary entropy
p
function with the variance p(1 − p). 0 1
2
1
i i
i i
i i
10
Definition 1.3 (Conditional entropy) Let X be a discrete random variable and Y arbitrary.
Denote by PX|Y=y (·) or PX|Y (·|y) the conditional distribution of X given Y = y. The conditional
entropy of X given Y is
h 1 i
H(X|Y) = Ey∼PY [H(PX|Y=y )] = E log .
PX|Y (X|Y)
Note that if Y is also discrete we can write out the expression in terms of joint PMF PX,Y and
conditional PMF PX|Y as
XX 1
H(X|Y) = PX,Y (x, y) log .
x y
PX|Y (x|y)
Similar to entropy, conditional entropy measures the remaining randomness of a random vari-
able when another is revealed. As such, H(X|Y) = H(X) whenever Y is independent of X. But
when Y depends on X, observing Y does lower the entropy of X. Before formalizing this in the
next theorem, here is a concrete example.
Example 1.4 (Conditional entropy and noisy channel) Let Y be a noisy observation
of X ∼ Ber( 21 ) as follows.
Before discussing various properties of entropy and conditional entropy, let us first review some
relevant facts from convex analysis, which will be used extensively throughout the book.
i i
i i
i i
1.0
0.8
0.6
0.4
0.2
0.0
Figure 1.1 Conditional entropy of a Bernoulli X given its Gaussian noisy observation.
Review: Convexity
i i
i i
i i
12
(f) (Entropy under deterministic transformation) H(X) = H(X, f(X)) ≥ H(f(X)) with equality iff
f is one-to-one on the support of PX .
(g) (Full chain rule)
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |Xi−1 ) ≤ H(Xi ), (1.3)
i=1 i=1
Proof. (a) Since log PX1(X) is a positive random variable, its expectation H(X) is also positive,
with H(X) = 0 if and only if log PX1(X) = 0 almost surely, namely, PX is a point mass.
(b) Apply Jensen’s inequality to the strictly concave function x 7→ log x:
1 1
H(X) = E log ≤ log E = log |X |.
PX (X) PX (X)
(c) H(X) as a summation only depends on the values of PX , not locations.
(d) Abbreviate P(x) ≡ PX (x) and P(x|y) ≡ PX|Y (x|y). Using P(x) = EY [P(x|Y)] and applying
Jensen’s inequality to the strictly concave function x 7→ x log 1x ,
X 1
X
1
H(X|Y) = EY P(x|Y) log ≤ P(x) log = H(X).
P(x|Y) P ( x)
x∈X x∈X
Additionally, this also follows from (and is equivalent to) Corollary 3.5 in Chapter 3 or
Theorem 5.2 in Chapter 5.
(e) Telescoping PX,Y (X, Y) = PY|X (Y|X)PX (X) and noting that both sides are positive PX,Y -almost
surely, we have
h 1 i h 1 i h 1 i h 1 i
E log = E log = E log + E log
PX,Y (X, Y) PX (X) · PY|X (Y|X) PX (X) PY|X (Y|X)
| {z } | {z }
H(X) H(Y|X)
(f) The intuition is that (X, f(X)) contains the same amount of information as X. Indeed, x 7→
(x, f(x)) is one-to-one. Thus by (c) and (e):
The bound is attained iff H(X|f(X)) = 0 which in turn happens iff X is a constant given f(X).
(g) Similar to (e), telescoping
and taking the logarithm prove the equality. The inequality follows from (d), with the case of
Qn
equality occurring if and only if PXi |Xi−1 = PXi for i = 1, . . . , n, namely, PXn = i=1 PXi .
i i
i i
i i
To give a preview of the operational meaning of entropy, let us play the game of 20 Questions.
We are allowed to make queries about some unknown discrete RV X by asking yes-no questions.
The objective of the game is to guess the realized value of the RV X. For example, X ∈ {a, b, c, d}
with P [X = a] = 1/2, P [X = b] = 1/4, and P [X = c] = P [X = d] = 1/8. In this case, we can
ask “X = a?”. If not, proceed by asking “X = b?”. If not, ask “X = c?”, after which we will know
for sure the realization of X. The resulting average number of questions is 1/2 + 1/4 × 2 + 1/8 ×
3 + 1/8 × 3 = 1.75, which equals H(X) in bits. An alternative strategy is to ask “X = a, b or c, d”
in the first round then proceeds to determine the value in the second round, which always requires
two questions and does worse on average.
It turns out (Section 10.3) that the minimal average number of yes-no questions to pin down
the value of X is always between H(X) bits and H(X) + 1 bits. In this special case the above
scheme is optimal because (intuitively) it always splits the probability in half.
(a) Permutation invariance: Hm (pπ (1) , . . . , pπ (m) ) = Hm (p1 , . . . , pm ) for any permutation π on [m].
(b) Expansibility: Hm (p1 , . . . , pm−1 , 0) = Hm−1 (p1 , . . . , pm−1 ).
(c) Normalization: H2 ( 12 , 12 ) = log 2.
(d) Subadditivity: H(X, Y) ≤ H(X) + H(Y). Equivalently, Hmn (r11 , . . . , rmn ) ≤ Hm (p1 , . . . , pm ) +
Pn Pm
Hn (q1 , . . . , qn ) whenever j=1 rij = pi and i=1 rij = qj .
(e) Additivity: H(X, Y) = H(X) + H(Y) if X ⊥ ⊥ Y. Equivalently, Hmn (p1 q1 , . . . , pm qn ) =
Hm (p1 , . . . , pm ) + Hn (q1 , . . . , qn ).
(f) Continuity: H2 (p, 1 − p) → 0 as p → 0.
Pm
then Hm (p1 , . . . , pm ) = i=1 pi log p1i is the only possibility. The interested reader is referred to
[115, Exercise 1.13] and the references therein.
We note that there are other meaningful measure of randomness, including, notably, the Rényi
entropy of order α introduced by Alfréd Rényi [356]
( Pm
1
1−α log pα α ∈ (0, 1) ∪ (1, ∞)
Hα (P) ≜ i=1 i
(1.4)
mini log 1
pi α = ∞.
i i
i i
i i
14
(The quantity H∞ is also known as the min-entropy, or Hmin , in the cryptography literature). One
can check that
1 0 ≤ Hα (P) ≤ log m, where the lower (resp. upper) bound is achieved when P is a point mass
(resp. uniform);
2 Hα (P) is non-increasing in α and tends to the Shannon entropy H(P) as α → 1.
3 Rényi entropy satisfies the above six axioms except for the subadditivity.
Second law of thermodynamics: There does not exist a machine that operates in a cycle (i.e. returns to its original
state periodically), produces useful work and whose only other effect on the outside world is drawing heat from
a warm body. (That is, every such machine, should expend some amount of heat to some cold body too!)1
Equivalent formulation is as follows: “There does not exist a cyclic process that transfers heat
from a cold body to a warm body”. That is, every such process needs to be helped by expending
some amount of external work; for example, the air conditioners, sadly, will always need to use
some electricity.
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy
of it (initially described as “transformative content” or Verwandlungsinhalt in German), whose
value must return to its original state. Furthermore, under any reversible (i.e. quasi-stationary, or
“very slow”) process operated on this machine the change of entropy is proportional to the ratio
1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.
i i
i i
i i
where k is the Boltzmann constant, and we assumed that each particle can only be in one of ℓ
molecular states (e.g. spin up/down, or if we quantize the phase volume into ℓ subcubes) and pj is
the fraction of particles in j-th molecular state.
More explicitly, their innovation was two-fold. First, they separated the concept of a micro-
state (which in our example above corresponds to a tuple of n states, one for each particle) and the
macro-state (a list {pj } of proportions of particles in each state). Second, they postulated that for
experimental observations only the macro-state matters, but the multiplicity of the macro-state
(number of micro-states that correspond to a given macro-state) is precisely the (exponential
of the) entropy. The formula (1.6) then follows from the following explicit result connecting
combinatorics and entropy.
Pk
Proposition 1.5 (Method of types) Let n1 , . . . , nk be non-negative integers with i=1 ni =
n, and denote the distribution P = (p1 , . . . , pk ), pi = nni . Then the multinomial coefficient
n1 ,...nk ≜ n1 !···nk ! satisfies
n n!
1 n
exp{nH(P)} ≤ ≤ exp{nH(P)} .
( 1 + n) k − 1 n1 , . . . nk
i.i.d. Pn
Proof. For the upper bound, let X1 , . . . , Xn ∼ P and let Ni = i=1 1 {Xj = i} denote the number
of occurrences of i. Then (N1 , . . . , Nk ) has a multinomial distribution:
Y
k
′ ′ n n′
P[N1 = n1 , . . . , Nk = nk ] = ′ ′ pi i ,
n1 , . . . , nk
i=1
i i
i i
i i
16
(n + 1)k−1 values, the lower bound follows if we can show that (n1 , . . . , nk ) is its mode. Indeed,
for any n′i with n′1 + · · · + n′k = n, defining ∆i = n′i − ni we have
For more on combinatorics and entropy, see Ex. I.1, I.3 and Chapter 8. For more on the intricate
relationship between statistical, mechanistic and information-theoretic description of the world
see Section 12.5* on Kolmogorov-Sinai entropy.
1.4* Submodularity
Recall that [n] denotes a set {1, . . . , n}, Sk denotes subsets of S of size k and 2S denotes all subsets
of S. A set function f : 2S → R is called submodular if for any T1 , T2 ⊂ S
f(T1 ∪ T2 ) + f(T1 ∩ T2 ) ≤ f(T1 ) + f(T2 ) (1.8)
Submodularity is similar to concavity, in the sense that “adding elements gives diminishing
returns”. Indeed consider T′ ⊂ T and b 6∈ T. Then
f(T ∪ b) − f(T) ≤ f(T′ ∪ b) − f(T′ ) .
Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show
H(A, B, C) + H(B) ≤ H(A, B) + H(B, C) .
This follows from a simple chain
H(A, B, C) + H(B) = H(A, C|B) + 2H(B) (1.9)
≤ H(A|B) + H(C|B) + 2H(B) (1.10)
= H(A, B) + H(B, C) (1.11)
i i
i i
i i
T1 ⊂ T2 =⇒ H(XT1 ) ≤ H(XT2 ) .
So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on
[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex cone
in R2+ −1 . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributions
n
on Xn . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [472], that Γ̄∗n is also a
closed convex cone and that
Γ∗n ⊂ Γ̄∗n ⊂ Γn .
This follows from the fundamental new information inequality not implied by the submodularity
of entropy (and thus called non-Shannon inequality). Namely, [473] showed that for any 4-tuple
of discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
Here we have used mutual information and conditional mutual information – notions that we
will introduce later. However, the above inequality (with the help of Theorem 3.4) can be easily
rewritten as a rather cumbersome expression in terms of entropies of sets of variables X1 , X2 , X3 , X4 .
In conclusion, the work [473] demonstrated that the entropy set-function is more constrained than
a generic submodular non-negative set function even if one only considers linear constraints.
1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.15)
n k
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:
H̄m
Proof. Denote for convenience H̄0 = 0. Note that m is an average of differences:
1X
m
1
H̄m = (H̄k − H̄k−1 )
m m
k=1
i i
i i
i i
18
Thus, it is clear that (1.16) implies (1.15) since increasing m by one adds a smaller element to the
average. To prove (1.16) observe that from submodularity
Now average this inequality over all n! permutations of indices {1, . . . , n} to get
as claimed by (1.16).
Alternative proof: Notice that by “conditioning decreases entropy” we have
Theorem 1.8 (Shearer’s Lemma) Let Xn be discrete n-dimensional RV and let S ⊂ [n] be
a random variable independent of Xn and taking values in subsets of [n]. Then
Remark 1.2 In the special case where S is uniform over all subsets of cardinality k, (1.17)
reduces to Han’s inequality 1n H(Xn ) ≤ 1k H̄k . The case of n = 3 and k = 2 can be used to give
an entropy proof of the following well-known geometry result that relates the size of 3-D object
to those of its 2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number
of distinct points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N2 . For
another application, see Section 8.2.
Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a
list (possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(Xn ) · min deg(i) , (1.18)
i
j
where deg(i) ≜ #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.18) is trivial, since the minimum on the right-hand side is either
zero (if SM 6= [n]) or equals multiplicity of SM in C ,2 in which case we have
X
H(XSj ) ≥ H(XSM )#{j : Sj = SM } = H(Xn ) · min deg(i) .
i
j
For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.8) implies that the sum on the
left-hand side of (1.18) does not increase under this replacement, values deg(i) are not changed.
2
Note that, consequently, for Xn without constant coordinates, and if C is a chain, (1.18) is only tight if C consists of only ∅
and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.18) can be improved, cf. [288].
i i
i i
i i
Notice that the total number of pairs that are not related by inclusion strictly decreases by this
replacement: if T was related by inclusion to S1 then it will also be related to at least one of S1 ∪ S2
or S1 ∩ S2 ; if T was related to both S1 , S2 then it will be related to both of the new sets as well.
Therefore, by applying this operation we must eventually arrive to a chain, for which (1.18) has
already been shown.
Remark 1.3 Han’s inequality (1.16) holds for any submodular set-function. For Han’s inequal-
ity (1.15) we also need f(∅) = 0 (this can be achieved by adding a constant to all values of f).
Shearer’s lemma holds for any submodular set-function that is also non-negative.
Example 1.5 (Non-entropy submodular function) Another submodular set-function is
S 7→ I(XS ; XSc ) .
Han’s inequality for this one reads
1 1
0= In ≤ · · · ≤ Ik · · · ≤ I1 ,
n k
1
P
where Ik = S:|S|=k I(XS ; XSc ) measures the amount of k-subset coupling in the random vector
(nk)
n
X.
i i
i i
i i
2 Divergence
In this chapter we study divergence D(PkQ) (also known as information divergence, Kullback-
Leibler (KL) divergence, relative entropy), which is the first example of dissimilarity (information)
measure between a pair of distributions P and Q. As we will see later in Chapter 7, KL divergence is
a special case of f-divergences. Defining KL divergence and its conditional version in full general-
ity requires some measure-theoretic acrobatics (Radon-Nikodym derivatives and Markov kernels),
that we spend some time on. (We stress again that all these abstractions can be ignored if one is
willing to only work with finite or countably-infinite alphabets.)
Besides definitions we prove the “main inequality” showing that KL-divergence is non-negative.
Coupled with the chain rule for divergence, this inequality implies the data-processing inequality,
which is arguably the central pillar of information theory and this book. We conclude the chapter
by studying local behavior of divergence when P and Q are close. In the special case when P and
Q belong to a parametric family, we will see that divergence is locally quadratic with Hessian
being the Fisher information, explaining the fundamental role of the latter in classical statistics.
Review: Measurability
• All complete separable metric spaces, endowed with Borel σ -algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q∞
• If Xi , i = 1, . . . are standard Borel, then so is i=1 Xi .
• Singletons {x} are measurable sets.
• The diagonal {(x, x) : x ∈ X } is measurable in X × X .
20
i i
i i
i i
We now need to define the second central concept of this book: the relative entropy, or Kullback-
Leibler divergence. Before giving the formal definition, we start with special cases. For that we
fix some alphabet A. The relative entropy from between distributions P and Q on X is denoted by
D(PkQ), defined as follows.
• Suppose A = Rk , P and Q have densities (pdfs) p and q with respect to the Lebesgue measure.
Then
(R
{p>0,q>0}
p(x) log qp((xx)) dx Leb{p > 0, q = 0} = 0
D(PkQ) = (2.2)
+∞ otherwise
These two special cases cover a vast majority of all cases that we encounter in this book. How-
ever, mathematically it is not very satisfying to restrict to these two special cases. For example, it
is not clear how to compute D(PkQ) when P and Q are two measures on a manifold (such as a
unit sphere) embedded in Rk . Another problematic case is computing D(PkQ) between measures
on the space of sequences (stochastic processes). To address these cases we need to recall the
concepts of Radon-Nikodym derivative and absolute continuity.
Recall that for two measures P and Q, we say P is absolutely continuous w.r.t. Q (denoted by
P Q) if Q(E) = 0 implies P(E) = 0 for all measurable E. If P Q, then Radon-Nikodym
theorem show that there exists a function f : X → R+ such that for any measurable set E,
Z
P(E) = fdQ. [change of measure] (2.3)
E
dP
Such f is called a relative density or a Radon-Nikodym derivative of P w.r.t. Q, denoted by dQ .
dP dP
Not that dQ may not be unique. In the simple cases, dQ is just the familiar likelihood ratio:
We can see that the two special cases of D(PkQ) were both computing EP [log dQdP
]. This turns
out to be the most general definition that we are looking for. However, we will state it slightly
differently, following the tradition.
i i
i i
i i
22
Below we will show (Lemma 2.5) that the expectation in (2.4) is well-defined (but possibly
infinite) and coincides with EP [log dQdP
] whenever P Q.
To demonstrate the general definition in the case not covered by discrete/continuous special-
izations, consider the situation in which both P and Q are given as densities with respect to a
common dominating measure μ, written as dP = fP dμ and dQ = fQ dμ for some non-negative
fP , fQ . (In other words, P μ and fP = dP dμ .) For example, taking μ = P + Q always allows one to
specify P and Q in this form. In this case, we have the following expression for divergence:
(R
dμ fP log ffQP μ({fQ = 0, fP > 0}) = 0,
D(PkQ) = fQ >0,fP >0
(2.5)
+∞ otherwise
Indeed, first note that, under the assumption of P μ and Q μ, we have P Q iff
μ({fQ = 0, fP > 0}) = 0. Furthermore, if P Q, then dQdP
= ffQP Q-a.e, in which case apply-
ing (2.3) and (1.1) reduces (2.5) to (2.4). Namely, D(PkQ) = EQ [ dQ dP dP
log dQ ] = EQ [ ffQP log ffQP ] =
R R
fP fP
dμfP log fQ 1 {fQ > 0} = dμfP log fQ 1 {fQ > 0, fP > 0}.
Note that D(PkQ) was defined to be +∞ if P 6 Q. However, it can also be +∞ even when
P Q. For example, D(CauchykGaussian) = ∞. However, it does not mean that there are
somehow two different ways in which D can be infinite. Indeed, what can be shown is that in
both cases there exists a sequence of (finer and finer) finite partitions Π of the space A such that
evaluating KL divergence between the induced discrete distributions P|Π and Q|Π grows without
a bound. This will be subject of Theorem 4.5 below.
Our next observation is that, generally, D(PkQ) 6= D(QkP) and, therefore, divergence is not a
distance. We will see later, that this is natural in many cases; for example it reflects the inherent
asymmetry of hypothesis testing (see Part III and, in particular, Section 14.5). Consider the exam-
ple of coin tossing where under P the coin is fair and under Q the coin always lands on the head.
Upon observing HHHHHHH, one tends to believe it is Q but can never be absolutely sure; upon
observing HHT, one knows for sure it is P. Indeed, D(PkQ) = ∞, D(QkP) = 1 bit.
Having made these remarks we proceed to some examples. First, we show that D is unsurpris-
ingly a generalization of entropy.
Example 2.1 (Binary divergence) Consider P = Ber(p) and Q = Ber(q) on A = {0, 1}.
Then
p p
D(PkQ) = d(pkq) ≜ p log + p log . (2.6)
q q
i i
i i
i i
1
log q
d(p∥q) d(p∥q)
1
log q̄
q p
0 p 1 0 q 1
In fact, this is a special case of the famous Pinsker’s inequality (Theorem 7.10).
Example 2.2 (Real Gaussian) For two Gaussians on A = R,
log e (m1 − m0 )2 1 h σ02 σ12 i
D(N (m1 , σ12 )kN (m0 , σ02 )) = + log + − 1 log e . (2.7)
2 σ02 2 σ12 σ02
Here, the first and second term compares the means and the variances, respectively.
Similarly, in the vector case of A = Rk and assuming det Σ0 6= 0, we have
Then
log e |m1 − m0 |2 σ02 σ12
D(Nc (m1 , σ12 )kNc (m0 , σ02 )) = + log + − 1 log e. (2.9)
2 σ02 σ12 σ02
i i
i i
i i
24
which follows from (2.8). More generally, for complex Gaussian vectors on Ck , assuming det Σ0 6=
0,
Proof. In view of the definition (2.4), it suffices to consider P Q. Let φ(x) ≜ x log x, which
is strictly convex on R+ . Applying Jensen’s Inequality:
h dP i h dP i
D(PkQ) = EQ φ ≥ φ EQ = φ(1) = 0,
dQ dQ
dP
with equality iff dQ = 1 Q-a.e., namely, P = Q.
Here is a typical application of the previous result (variations of it will be applied numerous
times in this book). This result is widely used in machine learning as it shows that minimizing
average cross-entropy loss ℓ(Q, x) ≜ log Q(1x) recovers the true distribution (Exercise III.11).
Corollary 2.4 Let X be a discrete random variable with H(X) < ∞. Then
1
min E log = H(P) ,
Q Q( X )
and unique minimizer is Q = PX .
i i
i i
i i
Another implication of the proof of Theorem 2.3 is in bringing forward the reason for defining
D(PkQ) = EQ [ dQ dP dP
log dQ ] as opposed to D(PkQ) = EP [log dQ
dP
]. However, we still need to show
that the two definitions are equivalent, which is what we do next. In addition, we will also unify
the two cases (P Q vs P 6 Q) in Definition 2.1.
Lemma 2.5 Let P, Q, R μ and fP , fQ , fR denote their densities relative to μ. Define a bivariate
function Log ab : R+ × R+ → R ∪ {±∞} by
−∞ a = 0, b > 0
a +∞ a > 0, b = 0
Log = (2.10)
b 0 a = 0, b = 0
log ab a > 0, b > 0.
Then the following results hold:
Remark 2.1 Note that ignoring the issue of dividing by or taking a log of 0, the proof of (2.12)
dR
is just the simple identity log dQ dRdP
= log dQdP = log dQdP
− log dP
dR . What permits us to handle zeros
is the Log function, which satisfies several natural properties of the log: for every a, b ∈ R+
a b
Log = −Log
b a
and for every c > 0 we have
a a c ac
Log = Log + Log = Log − log(c)
b c b b
except for the case a = b = 0.
Proof. First, suppose D(PkQ) = ∞ and D(PkR) < ∞. Then P[fR (Y) = 0] = 0, and hence in
computation of the expectation in (2.11) only the second part of convention (2.10) can possibly
apply. Since also fP > 0 P-almost surely, we have
fR fR fP
Log = Log + Log , (2.14)
fQ fP fQ
i i
i i
i i
26
with both logarithms evaluated according to (2.10). Taking expectation over P we see that the
first term, equal to −D(PkR), is finite, whereas the second term is infinite. Thus, the expectation
in (2.11) is well-defined and equal to +∞, as is the LHS of (2.11).
Now consider D(PkQ) < ∞. This implies that P[fQ (Y) = 0] = 0 and this time in (2.11) only
the first part of convention (2.10) can apply. Thus, again we have identity (2.14). Since the P-
expectation of the second term is finite, and of the first term non-negative, we again conclude that
expectation in (2.11) is well-defined, equals the LHS of (2.11) (and both sides are possibly equal
to −∞).
For the second part, we first show that
fP log e
EP min(Log , 0) ≥ − . (2.15)
fQ e
Let g(x) = min(x log x, 0). It is clear − loge e ≤ g(x) ≤ 0 for all x. Since fP (Y) > 0 for P-almost
all Y, in convention (2.10) only the 10 case is possible, which is excluded by the min(·, 0) from the
expectation in (2.15). Thus, the LHS in (2.15) equals
Z Z
f P ( y) f P ( y) f P ( y)
fP (y) log dμ = f Q ( y) log dμ
{fP >fQ >0} f Q ( y ) {fP >fQ >0} f Q ( y ) f Q ( y)
Z
f P ( y)
= f Q ( y) g dμ
{fQ >0} f Q ( y)
log e
≥− .
e
h i h i
Since the negative part of EP Log ffQP is bounded, the expectation EP Log ffQP is well-defined. If
P[fQ = 0] > 0 then it is clearly +∞, as is D(PkQ) (since P 6 Q). Otherwise, let E = {fP >
0, fQ > 0}. Then P[E] = 1 and on E we have fP = fQ · ffQP . Thus, we obtain
Z Z
fP fP fP fP
EP Log = dμ fP log = dμfQ φ( ) = EQ 1E φ( ) .
fQ E fQ E fQ fQ
From here, we notice that Q[fQ > 0] = 1 and on {fP = 0, fQ > 0} we have φ( ffQP ) = 0. Thus, the
term 1E can be dropped and we obtain the desired (2.12).
The final statement of the Lemma follows from taking μ = Q and noticing that P-almost surely
we have
dP
dQ dP
Log = log .
1 dQ
i i
i i
i i
In particular, if X has probability density function (pdf) p, then h(X) = E log p(1X) ; otherwise
h(X) = −∞. The conditional differential entropy is h(X|Y) ≜ E log pX|Y (1X|Y) where pX|Y is a
conditional pdf.
Theorem 2.7 (Properties of differential entropy) Assume that all differential entropies
appearing below exist and are finite (in particular all RVs have pdfs and conditional pdfs).
1 n c −(−1)n n
For an example, consider a piecewise-constant pdf taking value e(−1) n on the n-th interval of width ∆n = n2
e .
i i
i i
i i
28
Proof. Parts (a), (c), and (d) follow from the similar argument in the proof (b), (d), and (g) of
Theorem 1.4. Part (b) is by a change of variable in the density. Finally, (e) and (f) are analogous
to Theorems 1.6 and 1.7.
Interestingly, the first property is robust to small additive perturbations, cf. Ex. I.6. Regard-
ing maximizing entropy under quadratic constraints, we have the following characterization of
Gaussians.
Theorem 2.8 Let Cov(X) = E[XX⊤ ] − E[X]E[X]⊤ denote the covariance matrix of a random
vector X. For any d × d positive definite matrix Σ,
1
max h(X) = h(N(0, Σ)) = log((2πe)d det Σ) (2.19)
PX :Cov(X)⪯Σ 2
Furthermore, for any a > 0,
a d 2πea
max h(X) = h N 0, Id = log . (2.20)
PX :E[∥X∥ ]≤a
2 d 2 d
Proof. To show (2.19), without loss of generality, assume that E[X] = 0. By comparing to
Gaussian, we have
where in the last step we apply E[X⊤ Σ−1 X] = Tr(E[XX⊤ ]Σ−1 ) ≤ Tr(I) due to the constraint
Cov(X) Σ and the formula (2.18). The inequality (2.20) follows analogously by choosing the
reference measure to be N(0, ad Id ).
Corollary 2.9 The map Σ 7→ log det Σ is concave on the space of real positive definite n × n
matrices.
i i
i i
i i
for m ∈ N, where b·c is taken componentwise. Rényi showed that [357, Theorem 1] provided
H(bXc) < ∞ and h(X) is defined, we have
To interpret this result, consider, for simplicity, d = 1, m = 2k and assume that X takes values
in the unit interval, in which case X2k is the k-bit uniform quantization of X. Then (2.21) suggests
that for large k, the quantized bits behave as independent fair coin flips. The underlying reason is
that for “nice” density functions, the restriction to small intervals is approximately uniform. For
more on quantization see Section 24.1 (notably Section 24.1.5) in Chapter 24.
The kernel K can be viewed as a random transformation acting from X to Y , which draws
Y from a distribution depending on the realization of X, including deterministic transformations
PY|X
as special cases. For this reason, we write PY|X : X → Y and also X −−→ Y. In information-
theoretic context, we also refer to PY|X as a channel, where X and Y are the channel input and
output respectively. There are two ways of obtaining Markov kernels. The first way is defining
them explicitly. Here are some examples of that:
i i
i i
i i
30
Note that above we have implicitly used the facts that the slices Ex of E are measurable subsets
of Y for each x and that the function x 7→ K(Ex |x) is measurable (cf. [84, Chapter I, Prop. 6.8 and
6.9], respectively). We also notice that one joint distribution PX,Y can have many different versions
of PY|X differing on a measure-zero set of x’s.
The operation of combining an input distribution on X and a kernel K : X → Y as we did
in (2.22) is going to appear extensively in this book. We will usually denote it as multiplication:
Given PX and kernel PY|X we can multiply them to obtain PX,Y ≜ PX PY|X , which in the discrete
case simply means that the joint PMF factorizes as product of marginal and conditional PMFs:
PX,Y (x, y) = PY|X (y|x)PX (x) ,
and more generally is given by (2.22) with K = PY|X .
Another useful operation will be that of composition (marginalization), which we denote by
PY|X ◦ PX ≜ PY . In words, this means forming a distribution PX,Y = PX PY|X and then computing
the marginal PY , or, explicitly,
Z
PY [E] = PX (dx)PY|X (E|x) .
X
To denote this (linear) relation between the input PX and the output PY we sometimes also write
PY|X
PX −−→ PY .
We must remark that technical assumptions such as restricting to standard Borel spaces are
really necessary for constructing any sensible theory of disintegration/conditioning and multipli-
cation. To emphasize this point we consider a (cautionary!) example involving a pathological
measurable space Y .
Example 2.5 (X ⊥ ⊥ Y but PY|X=x 6 PY for all x) Consider X a unit interval with
Borel σ -algebra and Y a unit interval with the σ -algebra σY consisting of all sets which are either
countable or have a countable complement. Clearly σY is a sub-σ -algebra of Borel one. We define
the following kernel K : X → Y :
K(A|x) ≜ 1{x ∈ A} .
This is simply saying that Y is produced from X by setting Y = X. It should be clear that for
every A ∈ σY the map x 7→ K(A|x) is measurable, and thus K is a valid Markov kernel. Letting
X ∼ Unif(0, 1) and using formula (2.22) we can define a joint distribution PX,Y . But what is the
conditional distribution PY|X ? On one hand, clearly we can set PY|X (A|x) = K(A|x), since this
was how PX,Y was constructed. On the other hand, we will show that PX,Y = PX PY , i.e. X ⊥ ⊥ Y
i i
i i
i i
and X = Y at the same time! Indeed, consider any set E = B × C ⊂ X × Y . We always have
PX,Y [B × C] = PX [B ∩ C]. Thus if C is countable then PX,Y [E] = 0 and so is PX PY [E] = 0. On the
other hand, if Cc is countable then PX [C] = PY [C] = 1 and PX,Y [E] = PX PY [E] again. Thus, both
PY|X = K and PY|X = PY are valid conditional distributions. But notice that since PY [{x}] = 0, we
have K(·|x) 6 PY for every x ∈ X . In particular, the value of D(PY|X=x kPY ) can either be 0 or
+∞ for every x depending on the choice of the version of PY|X . It is, thus, advisable to stay within
the realm of standard Borel spaces.
We will also need to use the following result extensively. We remind that a σ -algebra is called
separable if it is generated by a countable collection of sets. Any standard Borel space’s σ -algebra
is separable. The following is another useful result about Markov kernels, cf. [84, Chapter 5,
Theorem 4.44]:
dPY|X=x
The meaning of this theorem is that the Radon-Nikodym derivative dRY|X=x can be made jointly
measurable with respect to (x, y).
In order to extend the above definition to more general X , we need to first understand whether
the map x 7→ D(PY|X=x kQY|X=x ) is even measurable.
Lemma 2.13 Suppose that Y is standard Borel. The set A0 ≜ {x : PY|X=x QY|X=x } and the
function
x 7→ D(PY|X=x kQY|X=x )
are both measurable.
dPY|X=x dQY|X=x
Proof. Take RY|X = 1
2 PY|X + 12 QY|X and define fP (y|x) ≜ dRY|X=x (y) and fQ (y|x) ≜ dRY|X=x (y).
By Theorem 2.12 these can be chosen to be jointly measurable on X × Y . Let us define B0 ≜
i i
i i
i i
32
{(x, y) : fP (y|x) > 0, fQ (y|x) = 0} and its slice Bx0 = {y : (x, y) ∈ B0 }. Then note that PY|X=x
QY|X=x iff RY|X=x [Bx0 ] = 0. In other words, x ∈ A0 iff RY|X=x [Bx0 ] = 0. The measurability of B0
implies that of x 7→ RY|X=x [Bx0 ] and thus that of A0 . Finally, from (2.12) we get that
f P ( Y | x)
D(PY|X=x kQY|X=x ) = EY∼PY|X=x Log , (2.23)
f Q ( Y | x)
which is measurable, e.g. [84, Chapter 1, Prop. 6.9].
Theorem 2.15 (Chain rule) For any pair of measures PX,Y and QX,Y we have
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kQX ) , (2.24)
regardless of the versions of conditional distributions PY|X and QY|X one chooses.
Proof. First, let us consider the simplest case: X , Y are discrete and QX,Y (x, y) > 0 for all x, y.
Letting (X, Y) ∼ PX,Y we get
PX,Y (X, Y) PX (X)PY|X (Y|X)
D(PX,Y kQX,Y ) = E log = E log
QX,Y (X, Y) QX (X)QY|X (Y|X)
PY|X (Y|X) PX (X)
= E log + E log
QY|X (Y|X) QX (X)
completing the proof.
Next, let us address the general case. If PX 6 QX then PX,Y 6 QX,Y and both sides of (2.24) are
infinity. Thus, we assume PX QX and set λP (x) ≜ dQ dPX
X
(x). Define fP (y|x), fQ (y|x) and RY|X as in
the proof of Lemma 2.13. Then we have PX,Y , QX,Y RX,Y ≜ QX RY|X , and for any measurable E
Z Z
PX,Y [E] = λP (x)fP (y|x)RX,Y (dx dy) , QX,Y [E] = fQ (y|x)RX,Y (dx dy) .
E E
i i
i i
i i
unless a = b = 0. Now, since PX,Y [fP (Y|X) > 0, λP (X) > 0] = 1, we conclude that PX,Y -almost
surely
fP (Y|X)λP (X) fP ( Y | X )
Log = log λP (X) + Log .
fQ (Y|X) fQ (Y|X)
We aim to take the expectation of both sides over PX,Y and invoke linearity of expectation. To
ensure that the issue of ∞ − ∞ does not arise, we notice that the negative part of each term has
finite expectation by (2.15). Overall, continuing (2.25) and invoking linearity we obtain
fP ( Y | X )
D(PX,Y kQX,Y ) = EPX,Y [log λP (X)] + EPX,Y Log ,
fQ (Y|X)
where the first term equals D(PX kQX ) by (2.12) and the second D(PY|X kQY|X |PX ) by (2.23) and
the definition of conditional divergence.
The chain rule has a number of useful corollaries, which we summarize below.
Theorem 2.16 (Properties of divergence) Assume that X and Y are standard Borel.
Then
(e) (Conditioning increases divergence) Given PY|X , QY|X and PX , let PY = PY|X ◦ PX and QY =
QY|X ◦ PX , as represented by the diagram:
i i
i i
i i
34
PY |X PY
PX
QY |X QY
Then D(PY kQY ) ≤ D(PY|X kQY|X |PX ), with equality iff D(PX|Y kQX|Y |PY ) = 0.
We remark that as before without the standard Borel assumption even the first property can
fail. For example, Example 2.5 shows an example where PX PY|X = PX QY|X but PY|X 6= QY|X and
D(PY|X kQY|X |PX ) = ∞.
Proof. (a) This follows from the chain rule (2.24) since PX = QX .
(b) Apply (2.24), with X and Y interchanged and use the fact that conditional divergence is non-
negative.
Qn Qn
(c) By telescoping PXn = i=1 PXi |Xi−1 and QXn = i=1 QXi |Xi−1 .
(d) Apply (c).
(e) The inequality follows from (a) and (b). To get conditions for equality, notice that by the chain
rule for D:
• There is a nice interpretation of the full chain rule as a decomposition of the “distance” from
PXn to QXn as a sum of “distances” between intermediate distributions, cf. Ex. I.43.
• In general, D(PX,Y kQX,Y ) and D(PX kQX ) + D(PY kQY ) are incomparable. For example, if X = Y
under P and Q, then D(PX,Y kQX,Y ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PX,Y 6= QX,Y we have D(PX,Y kQX,Y ) > 0 = D(PX kQX ) + D(PY kQY ).
The following result, known as the Data-Processing Inequality (DPI), is an important principle
in all of information theory. In many ways, it underpins the whole concept of information. The
intuitive interpretation is that it is easier to distinguish two distributions using clean (resp. full)
data as opposed to noisy (resp. partial) data. DPI is a recurring theme in this book, and later we
will study DPI for other information measures such as mutual information and f-divergences.
i i
i i
i i
PX PY
PY|X
QX QY
Then
Note that D(Pf(X) kQf(X) ) = D(PX kQX ) does not imply that f is one-to-one; as an example,
consider PX = Gaussian, QX = Laplace, Y = |X|. In fact, the equality happens precisely when
f(X) is a sufficient statistic for testing P against Q; in other words, there is no loss of information
in summarizing X into f(X) as far as testing these two hypotheses is concerned. See Example 3.9
for details.
A particular useful application of Corollary 2.18 is when we take f to be an indicator function:
This method will be highly useful in large deviations theory which studies rare events (Sec-
tion 14.5 and Section 15.2), where we apply Corollary 2.19 to an event E which is highly likely
under P but highly unlikely under Q.
i i
i i
i i
36
Proof.
1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function
i i
i i
i i
1
log(λ̄ + λf) ≤ c1 f + c2 .
λ
Thus, by the dominated convergence theorem we get
Z Z
1 1 λ→0
D(QkλP + λ̄Q) = − dQ log(λ̄ + λf) −−−→ dQ(1 − f) = 0 .
λ λ
λ 7→ D(λP + λ̄QkQ) ,
Our second result about the local behavior of KL-divergence is the following (see Section 7.10
for generalizations):
i i
i i
i i
38
and hence
2
1 dP dP
0 ≤ 2 g λ̄ + λ ≤ − 1 log e.
λ dQ dQ
By the dominated convergence theorem (which is applicable since χ2 (PkQ) < ∞) we have
" 2 #
1 dP g′′ (1) dP log e 2
lim EQ g λ̄ + λ = EQ −1 = χ (PkQ) .
λ→0 λ2 dQ 2 dQ 2
where μ is some common dominating measure (e.g. Lebesgue or counting measure). If for each
fixed x, the density pθ (x) depends smoothly on θ, one can define the Fisher information matrix
with respect to the parameter θ as
JF (θ) ≜ Eθ VV⊤ , V ≜ ∇θ ln pθ (X) , (2.32)
i i
i i
i i
E θ [ V] = 0 (2.33)
JF (θ) = cov(V)
θ
Z p p
= 4 μ(dx)(∇θ pθ (x))(∇θ pθ (x))⊤
where the last identity is obtained by differentiating (2.33) with respect to each θj .
The significance of Fisher information matrix arises from the fact that it gauges the local behav-
ior of divergence for smooth parametric families. Namely, we have (again under suitable technical
conditions):2
log e ⊤
D(Pθ0 kPθ0 +ξ ) = ξ JF (θ0 )ξ + o(kξk2 ) , (2.34)
2
which is obtained by integrating the Taylor expansion:
1
ln pθ0 +ξ (x) = ln pθ0 (x) + ξ ⊤ ∇θ ln pθ0 (x) + ξ ⊤ Hessθ (ln pθ0 (x))ξ + o(kξk2 ) .
2
We will establish this fact rigorously later in Section 7.11. Property (2.34) is of paramount impor-
tance in statistics. We should remember it as: Divergence is locally quadratic on the parameter
space, with Hessian given by the Fisher information matrix. Note that for the Gaussian location
model Pθ = N (θ, Σ), (2.34) is in fact exact with JF (θ) ≡ Σ−1 – cf. Example 2.2.
As another example, note that Proposition 2.21 is a special case of (2.34) by considering Pλ =
λ̄Q + λP parametrized by λ ∈ [0, 1]. In this case, the Fisher information at λ = 0 is simply
χ2 (PkQ). Nevertheless, Proposition 2.21 is completely general while the asymptotic expansion
(2.34) is not without regularity conditions (see Section 7.11).
Remark 2.3 Some useful properties of Fisher information are as follows:
2
To illustrate the subtlety here, consider a scalar location family, i.e. pθ (x) = f0 (x − θ) for some density f0 . In this case
∫ (f′0 )2
Fisher information JF (θ0 ) = f0
does not depend on θ0 and is well-defined even for compactly supported f0 ,
provided f′0 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (2.34) is infinite for any
ξ > 0. Thus, a more general interpretation for Fisher information is as the coefficient in expansion
ξ2
D(Pθ0 k 12 Pθ0 + 12 Pθ0 +ξ ) = J (θ )
8 F 0
+ o(ξ 2 ). We will discuss this in more detail in Section 7.11.
i i
i i
i i
40
where A = ddθθ̃ is the Jacobian of the map. So we can see that JF transforms similarly to the
metric tensor in Riemannian geometry. This idea can be used to define a Riemannian metric
on the parameter space Θ, called the Fisher-Rao metric. This is explored in a field known as
information geometry [85, 17].
i.i.d.
• Additivity: Suppose we are given a sample of n iid observations Xn ∼ Pθ . As such, consider the
parametrized family of product distributions {P⊗ θ : θ ∈ Θ}, whose Fisher information matrix
n
is denoted by J⊗ n
F (θ). In this case, the score is an iid sum. Applying (2.32) and (2.33) yields
J⊗ n
F (θ) = nJF (θ). (2.36)
Example 2.7 (Location family) In statistics and information theory it is common to talk
about Fisher information of a (single) random variable or a distribution without reference to a
parametric family. In such cases one is implicitly considering a location parameter. Specifically,
for any density p0 on Rd we define a location family of distributions on Rd by setting Pθ (dx) =
p0 (x − θ)dx, θ ∈ Rd . Note that JF (θ) here does not depend on θ. For this special case, we will
adopt the standard notation: Let X ∼ p0 then
J(X) ≡ J(p0 ) ≜ EX∼p0 [(∇ ln p0 (X))(∇ ln p0 (X))⊤ ] = − EX∼p0 [Hess(ln p0 (X))] , (2.40)
where the second equality requires applicability of integration by parts. (See also (7.96) for a
variational definition.)
i i
i i
i i
3 Mutual information
After technical preparations in previous chapters we define perhaps the most famous concept in the
entire field of information theory, the mutual information. It was originally defined by Shannon,
although the name was coined later by Robert Fano.1 It has two equivalent expressions (as a KL
divergence and as difference of entropies), both having their merits. In this chapter, we collect
some basic properties of mutual information (non-negativity, chain rule and the data-processing
inequality). While defining conditional information, we also introduce the language of directed
graphical models, and connect the equality case in the data-processing inequality with Fisher’s
concept of sufficient statistic.
So far in this book we have not yet attempted connecting information quantities to any opera-
tional concepts. The first time this will be done in Section 3.6 where we relate mutual information
to probability of error in the form of Fano’s inequality, which states that whenever I(X; Y) is small,
one should not be able to predict X on the basis of Y with a small probability of error. As such, this
inequality will be applied countless times in the rest of the book as a main workhorse for studying
fundamental limits of problems in both information theory and in statistics.
The connection between information and estimation is furthered in Section 3.7*, in which we
relate mutual information and minimum mean squared error in Gaussian noise (I-MMSE relation).
From the latter we also derive the entropy power inequality, which plays a central role in high-
dimensional probability and concentration of measure.
1
Professor of electrical engineering at MIT, who developed the first course on information theory and as part of it
formalized and rigorized much of Shannon’s ideas. Most famously, he showed the “converse part” of the noisy channel
coding theorem, see Section 17.4.
41
i i
i i
i i
42
Definition 3.1 (Mutual information) For a pair of random variables X and Y we define
I(X; Y) = D(PX,Y kPX PY ).
The intuitive interpretation of mutual information is that I(X; Y) measures the dependency
between X and Y by comparing their joint distribution to the product of the marginals in the KL
divergence, which, as we show next, is also equivalent to comparing the conditional distribution
to the unconditional.
The way we defined I(X; Y) it is a functional of the joint distribution PX,Y . However, it is also
rather fruitful to look at it as a functional of the pair (PX , PY|X ) – more on this in Section 5.1.
In general, the divergence D(PX,Y kPX PY ) should be evaluated using the general definition (2.4).
Note that PX,Y PX PY need not always hold. Let us consider the following examples, though.
Example 3.1 If X = Y ∼ N(0, 1) then PX,Y 6 PX PY and I(X; Y) = ∞. This reflects our
intuition that X contains an “infinite” amount of information requiring infinitely many bits to
describe. On the other hand, if even one of X or Y is discrete, then we always have PX,Y PX PY .
Indeed, consider any E ⊂ X × Y measurable in the product sigma algebra with PX,Y (E) > 0.
P
Since PX,Y (E) = x∈S P[(X, Y) ∈ S, X = x], there exists some x0 ∈ S such that PY (Ex0 ) ≥ P[X =
x0 , Y ∈ Ex0 ] > 0, where Ex0 ≜ {y : (x0 , y) ∈ E} is a section of E (measurable for every x0 ). But
then PX PY (E) ≥ PX PY ({x0 } × Ex0 ) = PX ({x0 })PY (Ex0 ) > 0, implying that PX,Y PX PY .
I(f(X); Y) ≤ I(X; Y) .
i i
i i
i i
K
(d) Consider a Markov kernel K sending (x, y) 7→ (f(x), y). This kernel sends measure PX,Y − →
K
Pf(X),Y and PX PY −
→ Pf(X) PY . Therefore, from the DPI Theorem 2.17 applied to this kernel we
get
It is clear that the two sides correspond to the two mutual informations. For bijective f, simply
apply the inequality to f and f−1 .
(e) Apply (d) with f(X1 , X2 ) = X1 .
Of the results above, the one we will use the most is (3.1). Note that it implies that
D(PX,Y kPX PY ) < ∞ if and only if
x 7→ D(PY|X=x kPY )
Proof. Suppose PX,Y PX PY . We need to prove that any version of the conditional probability
satisfies PY|X=x PY for almost every x. Note, however, that if we prove this for some version P̃Y|X
then the statement for any version follows, since PY|X=x = P̃Y|X=x for PX -a.e. x. (This measure-
theoretic fact can be derived from the chain rule (2.24): since PX P̃Y|X = PX,Y = PX PY|X we must
have 0 = D(PX,Y kPX,Y ) = D(P̃Y|X kPY|X |PX ) = Ex∼PX [D(P̃Y|X=x kPY|X=x )], implying the stated
dPX,Y R
fact.) So let g(x, y) = dP X PY
(x, y) and ρ(x) ≜ Y g(x, y)PY (dy). Fix any set E ⊂ X and notice
Z Z
PX [E] = 1E (x)g(x, y)PX (dx) PY (dy) = 1E (x)ρ(x)PX (dx) .
X ×Y X
R
On the other hand, we also have PX [E] = 1E dPX , which implies ρ(x) = 1 for PX -a.e. x. Now
define
(
g(x, y)PY (dy), ρ(x) = 1
P̃Y|X (dy|x) =
PY (dy), ρ(x) 6= 1 .
Directly plugging P̃Y|X into (2.22) shows that P̃Y|X does define a valid version of the conditional
probability of Y given X. Since by construction P̃Y|X=x PY for every x, the result follows.
Conversely, let PY|X be a kernel such that PX [E] = 1, where E = {x : PY|X=x PY } (recall that
E is measurable by Lemma 2.13). Define P̃Y|X=x = PY|X=x if x ∈ E and P̃Y|X=x = PY , otherwise.
By construction PX P̃Y|X = PX PY|X = PX,Y and P̃Y|X=x PY for every x. Thus, by Theorem 2.12
there exists a jointly measurable f(y|x) such that
i i
i i
i i
44
(d) Similarly, if X, Y are real-valued random vectors with a joint PDF, then
I(X; Y) = h(X) + h(Y) − h(X, Y)
provided that h(X, Y) < ∞. If X has a marginal PDF pX and a conditional PDF pX|Y (x|y),
then
I(X; Y) = h(X) − h(X|Y) ,
provided h(X|Y) < ∞.
2
This is indeed possible if one takes Y = 0 (constant) and X from Example 1.3, demonstrating that (3.3) does not always
hold.
i i
i i
i i
(e) If X or Y are discrete then I(X; Y) ≤ min (H(X), H(Y)), with equality iff H(X|Y) = 0 or
H(Y|X) = 0, or, equivalently, iff one is a deterministic function of the other.
Proof. (a) By Theorem 3.2.(a), I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete,
then D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, let A = {x : PX (x) > 0}
denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . (∆ is measurable since it’s
the intersection of Ac × Ac with the diagonal {(x, x) : x ∈ X }.) Then PX,X (∆) = PX (Ac ) > 0
but since
Z Z
(PX × PX )(E) ≜ PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X
we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X 6 PX × PX and thus by definition
(b) Since X is discrete there exists a countable set S such that P[X ∈ S] = 1, and for any x0 ∈ S we
have P[X = x0 ] > 0. Let λ be a counting measure on S and let μ = λ×PY , so that PX PY μ. As
shown in Example 3.1 we also have PX,Y μ. Furthermore, fP (x, y) ≜ dPdμX,Y (x, y) = pX|Y (x|y),
where the latter denotes conditional pmf of X given Y (which is a proper pmf for almost every
y, since P[X ∈ S|Y = y] = 1 for a.e. y). We also have fQ (x, y) = dPdμ
X PY
(x, y) = dP
dλ (x) = pX (x),
X
Note that PX,Y -almost surely both pX|Y (X|Y) > 0 and PX (x) > 0, so we can replace Log with
log in the above. On the other hand,
X 1
H(X|Y) = Ey∼PY pX|Y (x|y) log .
pX|Y (x|y)
x∈ S
i i
i i
i i
46
(d) These arguments are similar to discrete case, except that counting measure is replaced with
Lebesgue. We leave the details as an exercise.
(e) Follows from (b).
From (3.2) we deduce the following result, which was previously shown in Theorem 1.4(d).
Corollary 3.5 (Conditioning reduces entropy) For discrete X, H(X|Y) ≤ H(X), with
equality iff X ⊥
⊥ Y.
H(X, Y )
H(Y ) H(X)
As an example, we have
H(X1 |X2 , X3 ) = μ(E1 \ (E2 ∪ E3 )) , (3.6)
i i
i i
i i
I(X; Y )
ρ
-1 0 1
i i
i i
i i
48
1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log .
2 2 2 1 − ρ2
where the second equality follows h(Y|X) = h(Y − X|X) = h(Z|X) = h(Z) applying the shift-
invariance of h and the independence between X and Z.
Similar to the role of mutual information, the correlation coefficient also measures the
dependency between random variables which are real-valued (or, more generally, valued in an
inner-product space) in a certain sense. In contrast, mutual information is invariant to bijections
and much more general as it can be defined not just for numerical but for arbitrary random
variables.
Example 3.3 (AWGN channel) The additive white Gaussian noise (AWGN) channel is one
of the main examples of Markov kernels that we will use in this book. This kernel acts from R to
R by taking an input x and setting K(·|x) ∼ N (x, σN2 ), or, in equation form we write Y = X + N,
with X ⊥⊥ N ∼ N (0, σN2 ). Pictorially, we can think of it as
X + Y
Now, suppose that X ∼ N (0, σX2 ), in which case Y ∼ N (0, σX2 + σN2 ). Then by invoking (2.17)
twice we obtain
1 σ2
I(X; Y) = h(Y) − h(Y|X) = h(X + N) − h(N) = log 1 + X2 ,
2 σN
2
where σσX2 is frequently referred to as the signal-to-noise ratio (SNR). See Figure 3.2 for an illus-
N
tration. Note that in engineering it is common to express SNR in decibels (dB), so that SNR in dB
equals 10 log10 (SNR). Later, we will define AWGN channel more formally in Definition 20.10.
Example 3.4 (BI-AWGN channel) In communication and statistical applications one also
often encounters a situation where AWGN channel’s input is restricted to X ∈ {±1}. This Markov
kernel is denoted BIAWGNσN2 : {±1} → R and acts by setting
Y = X + N, X⊥
⊥ N ∼ N (0, σN2 ) .
If we set X ∼ Ber(1/2) then in this case it is more convenient to calculate mutual information by
a decomposition different from the AWGN case. Indeed, we have
I(X; Y) = H(X) − H(X|Y) = log 2 − H(X|Y) .
To compute H(X|Y = y) we simply need to evaluate posterior distribution given observation Y = y.
y
2
e σN
In this case we have P[X = +1|Y = y] = −
y y . Thus, after some algebra we obtain the
σ2 2
e N +e σN
i i
i i
i i
Figure 3.2 Comparing mutual information for the AWGN and BI-AWGN channels (see Examples 3.3
and 3.4). It will be shown later in this book that these mutual informations coincide with the capacities of
respective channels.
following expression
Z ∞
1 z2
√ e− 2 log(1 + e− σ2 + σ z ) dz .
2 2
I(X; Y) = log 2 −
−∞ 2π
(One can verify that H(X|Y) here coincides with that in Example 1.4(2) with σ replaced by 2σ .)
For this channel, the SNR is given by EE[[NX2]] = σ12 . We compare mutual informations of AWGN
2
N
and BI-AWGN as a function of the SNR on Figure 3.2. Note that for the low SNR restricting to
binary input results in virtually no loss of information – a fact underpinning the role played by the
BI-AWGN channel in many real-world communication systems.
Example 3.5 (Gaussian vectors) Let X ∈ Rm and Y ∈ Rn be jointly Gaussian. Then
1 det ΣX det ΣY
I(X; Y) = log
2 det Σ[X,Y]
where ΣX ≜ E (X − EX)(X − EX)⊤ denotes the covariance matrix of X ∈ Rm , and Σ[X,Y]
denotes the covariance matrix of the random vector [X, Y] ∈ Rm+n .
In the special case of additive noise: Y = X + N for N ⊥
⊥ X, we have
1 det(ΣX + ΣN )
I ( X; X + N) = log
2 det ΣN
ΣX ΣX
why?
since det Σ[X,X+N] = det ΣX ΣX +ΣN = det ΣX det ΣN .
i i
i i
i i
50
Example 3.6 (Binary symmetric channel) Recall the setting in Example 1.4(1). Let X ∼
Ber( 21 ) and N ∼ Ber(δ) be independent. Let Y = X ⊕ N; or equivalently, Y is obtained by flipping
X with probability δ .
N
1− δ
0 0
X δ Y X + Y
1 1
1− δ
Denoting I(X; Y) as a functional I(PX,Y ) of the joint distribution PX,Y , we have I(X; Y|Z) =
Ez∼PZ [I(PX,Y|Z=z )]. As such, I(X; Y|Z) is a linear functional in PZ . Measurability of the map z 7→
I(PX,Y|Z=z ) is not obvious, but follows from Lemma 2.13.
To further discuss properties of the conditional mutual information, let us first introduce the
notation for conditional independence. A family of joint distributions can be represented by a
directed acyclic graph (DAG) encoding the dependency structure of the underlying random vari-
ables. We do not intend to introduce formal definitions here and refer to the standard monograph
for full details [271]. But in short, every problem consisting of finitely (or countably infinitely)
i i
i i
i i
many random variables can be depicted as a DAG. Nodes of the DAG correspond to random vari-
ables and incoming edges into the node U simply describe which variables need to be known in
order to generate U. A simple example is a Markov chain (path graph) X → Y → Z, which repre-
sents distributions that factor as {PX,Y,Z : PX,Y,Z = PX PY|X PZ|Y }. We have the following equivalent
descriptions:
There is a general method for obtaining these equivalences for general graphs, known as d-
separation, see [271]. We say that a variable V is a collider on some undirected path if it appears
on the path as
Theorem 3.7 (Further properties of mutual information) Suppose that all random
variables are valued in standard Borel spaces. Then:
3
Also known as “Kolmogorov identities”.
i i
i i
i i
52
(f) (Permutation invariance) If f and g are one-to-one (with measurable inverses), then
I(f(X); g(Y)) = I(X; Y).
Remark 3.2 In general, I(X; Y|Z) and I(X; Y) are incomparable. Indeed, consider the following
examples:
• I(X; Y|Z) > I(X; Y): We need to find an example of X, Y, Z, which do not form a Markov chain.
To that end notice that there is only one directed acyclic graph non-isomorphic to X → Y → Z,
i.i.d.
namely X → Y ← Z. With this idea in mind, we construct X, Z ∼ Ber( 12 ) and Y = X ⊕ Z. Then
I(X; Y) = 0 since X ⊥⊥ Y; however, I(X; Y|Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit.
i i
i i
i i
• I(X; Y|Z) < I(X; Y): Simply take X, Y, Z to be any random variables on finite alphabets and
Z = Y. Then I(X; Y|Z) = I(X; Y|Y) = H(Y|Y) − H(Y|X, Y) = 0 by a conditional version of (3.3).
Remark 3.3
Pn
(Chain rule for IP⇒ Chain rule for H) Set Y = Xn . Then H(Xn ) =
n k− 1 k− 1
), since H(Xk |Xn , Xk−1 ) = 0.
n
n n
I(X ; X ) = k=1 I(Xk ; X |X )= k=1 H(Xk |X
Remark 3.4 (DPI for divergence =⇒ DPI for mutual information) We proved
DPI for mutual information in Theorem 3.7 using Kolmogorov’s identity. In fact, DPI for mutual
information is implied by that for divergence in Theorem 2.17:
where ηKL < 1 and depends on the channel PY|X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information in Theorem 3.7(c): For any PU,X we have
For example, for PY|X = BSCδ we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
(SDPIs) quantify the intuitive observation that noise intrinsic to the channel PY|X must reduce the
information that Y carries about the data U, regardless of how we optimize the encoding U 7→ X.
We explore SDPI further in Chapter 33 as well as their ramifications in statistics.
In addition to the case of strict inequality in DPI, the case of equality is also worth taking a closer
look. If U → X → Y and I(U; X) = I(U; Y), intuitively it means that, as far as U is concerned,
there is no loss of information in summarizing X into Y. In statistical parlance, we say that Y is a
sufficient statistic of X for U. This is the topic for the next section.
i i
i i
i i
54
• Let PT|X be some Markov kernel. Let PθT ≜ PT|X ◦ PθX be the induced distribution on T for each
θ.
Definition 3.8 (Sufficient statistic) We say that T is a sufficient statistic of X for θ if there
exists a transition probability kernel PX|T so that PθX PT|X = PθT PX|T , i.e., PX|T can be chosen to not
depend on θ.
The intuitive interpretation of T being sufficient is that, with T at hand, one can ignore X; in
other words, T contains all the relevant information to infer about θ. This is because X can be
simulated on the sole basis of T without knowing θ. As such, X provides no extra information
for identification of θ. Any one-to-one transformation of X is sufficient, however, this is not the
interesting case. In the interesting cases dimensionality of T will be much smaller (typically equal
to that of θ) than that of X. See examples below.
Observe also that the parameter θ need not be a random variable, as Definition 3.8 does not
involve any distribution (prior) on θ. This is a so-called frequentist point of view on the problem
of parameter estimation.
Theorem 3.9 Let θ, X, T be as in the setting above. Then the following are equivalent
Proof. We omit the details, which amount to either restating the conditions in terms of conditional
independence, or invoking equality cases in the properties stated in Theorem 3.7.
Theorem 3.10 (Fisher’s factorization theorem) For all θ ∈ Θ, let PθX have a density pθ
with respect to a common dominating measure μ. Let T = T(X) be a deterministic function of X.
Then T is a sufficient statistic of X for θ iff
pθ (x) = gθ (T(x))h(x)
Proof. We only give the proof in the discrete case where pθ represents the PMF. (The argument
P R
for the general case is similar replacing by dμ). Let t = T(x).
“⇒”: Suppose T is a sufficient statistic of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T =
t) = Pθ (X = x|T = t)Pθ (T = t) = P(X = x|T = T(x)) Pθ (T = T(x))
| {z }| {z }
h(x) gθ (T(x))
i i
i i
i i
i.i.d.
• Normal mean model. Let θ ∈ R and observations X1 , . . . , Xn ∼ N (θ, 1). Then the sample mean
Pn
X̄ = 1n j=1 Xj is a sufficient statistic of Xn for θ.
i.i.d. Pn
• Coin flips. Let Bi ∼ Ber(θ). Then i=1 Bi is a sufficient statistic of Bn for θ.
i.i.d.
• Uniform distribution. Let Ui ∼ Unif(0, θ). Then maxi∈[n] Ui is a sufficient statistic of Un for θ.
Example 3.9 (Sufficient statistic for hypothesis testing) Let Θ = {0, 1}. Given θ = 0
or 1, X ∼ PX or QX , respectively. Then Y – the output of PY|X – is a sufficient statistic of X for θ iff
D(PX|Y kQX|Y |PY ) = 0, i.e., PX|Y = QX|Y holds PY -a.s. Indeed, the latter means that for kernel QX|Y
we have
which is precisely the definition of sufficient statistic when θ ∈ {0, 1}. This example explains
the condition for equality in the data-processing for divergence in Theorem 2.17. Then assuming
D(PY kQY ) < ∞ we have:
with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a sufficient statistic for
testing PX vs QX as desired.
i i
i i
i i
56
Our goal is to draw converse statements of the following type: If the uncertainty of W is too high
or if the information provided by the data is too scarce, then it is difficult to guess the value of W.
In this section we formalize these intuitions using (conditional) entropy and mutual information.
where
The function FM (·) is shown in Figure 3.3. Notice that due to its non-monotonicity the
statement (3.15) does not imply (3.13), even though P[X = X̂] ≤ Pmax .
FM (p)
log M
log(M − 1)
p
0 1/M 1
Figure 3.3 The function FM in (3.14) is concave with maximum log M at maximizer 1/M, but not monotone.
Proof. To show (3.13) consider an auxiliary (product) distribution QX,X̂ = UX PX̂ , where UX is
uniform on X . Then Q[X = X̂] = n1/M. Denoting
o P[X = X̂] ≜ PS , applying the DPI for divergence
to the data processor (X, X̂) 7→ 1 X = X̂ yields d(PS k1/M) ≤ D(PXX̂ kQXX̂ ) = log M − H(X).
To show the second part, suppose one is trying to guess the value of X without any side informa-
tion. Then the best bet is obviously the most likely outcome (mode) and the maximal probability
i i
i i
i i
of success is
Thus, applying (3.13) with X̂ being the mode yield (3.15). Finally, suppose that P =
(Pmax , P2 , . . . , PM ) and introduce Q = (Pmax , 1− Pmax 1−Pmax
M−1 , . . . , M−1 ). Then the difference of the right
and left side of (3.15) equals D(PkQ) ≥ 0, with equality iff P = Q.
Remark 3.6 Let us discuss the unusual proof technique. Instead of studying directly the prob-
ability space PX,X̂ given to us, we introduced an auxiliary one: QX,X̂ . We then drew conclusions
about the target metric ( probability of error) for the auxiliary problem (the probability of error
= 1 − M1 ). Finally, we used DPI to transport statement about Q to a statement about P: if D(PkQ)
is small, then the probabilities of the events (e.g., {X 6= X̂}) should be small as well. This is a
general method, known as meta-converse, that we develop in more detail later in this book for
channel coding (see Section 22.3). For the specific result (3.15), however, there are much more
explicit ways to derive it – see Ex. I.25.
Similar to the Shannon entropy H, Pmax is also a reasonable measure for randomness of P. In
fact, recall from (1.4) that
1
H∞ (P) = log (3.17)
Pmax
is the Rényi entropy of order ∞, cf. (1.4). In this regard, Theorem 3.11 can be thought of as our
first example of a comparison of information measures: it compares H and H∞ . We will study
such comparisons systematically in Section 7.4.
Next we proceed to the setting of Fano’s inequality where the estimate X̂ is made on the basis of
some observation Y correlated with X. We will see that the proof of the previous theorem trivially
generalizes to this new case of possibly randomized estimators. Though not needed in the proof,
it is worth mentioning that the best estimator minimizing the probability of error P[X 6= X̂] is the
maximum posterior (MAP) rule, i.e., the posterior mode: X̂(y) = argmaxx PX|Y (x|y).
Theorem 3.12 (Fano’s inequality) Let |X | = M < ∞ and X → Y → X̂. Let Pe = P[X 6= X̂],
then
Proof. To show (3.18) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs
n o
QX,Y,X̂ = UX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1 X 6= X̂ (note that PX̂|Y is
identical for both).
i i
i i
i i
58
To show (3.19) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs QX,Y,X̂ =
n o
PX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1 X 6= X̂ to obtain:
Proof. Apply Theorem 3.12 and the data processing for mutual information: I(W; Ŵ) ≤ I(X; Y).
where MMSE stands for minimum MSE (which follows from the fact that the best estimator of X
given Y is precisely E[X|Y]). Just like Fano’s inequality one can derive inequalities relating I(X; Y)
and mmse(X|Y). For example, from Tao’s inequality (see Corollary 7.11) one can easily get for
the case where X ∈ [−1, 1] that
2
0 ≤ Var(X) − mmse(X|Y) ≤ I ( X ; Y) ,
log e
which shows that the variance reduction of X due to Y is at most proportional to their mutual
information. (Simply notice that E[| E[X|Y] − E[X]|2 ] = Var(X) − mmse(X|Y)).
However, this section is not about such inequalities. Here we show a remarkable equality for
the special case when Y is an observation of X corrupted by Gaussian noise. As applications of
i i
i i
i i
this identity we will derive stochastic localization in Exercise I.66 and entropy power inequality
in Theorem 3.16.
As a simple example, for Gaussian X, one may verify (3.22) by combining the mutual
information in Example 3.3 with the MMSE in Example 28.1.
Before proving Theorem 3.14 we start with some notation and preliminary results. Let I ⊂ R
be an open interval, μ a (positive) measure on Rd , and K, L : Rd × I → R such that the following
R R R
conditions are met: a) K(x, θ) μ(dx) exists for all θ ∈ I; b) Rd μ(dx) I dθ|L(x, θ)| < ∞; c)
R
t 7→ Rd μ(dx)L(x, t) is continuous and d) we have
∂
K(x, θ) = L(x, θ)
∂θ
for all x, θ. Then
Z Z
∂
K(x, θ) dx = L(x, θ) dx . (3.24)
∂θ Rd Rd
Rθ
(To see this, take θ > θ0 ∈ I and write K(x, θ) = K(x, θ0 ) + θ0 dtL(x, t). Now we can integrate
R Rθ
this over x and interchange the order of integrals to get dxK(x, θ) = constant + θ0 g(t)dt, where
R
g(t) = dxL(x, t) is continuous). Note that in the case of finite interval I both conditions b) and
c) are implied by condition e) for all t ∈ I we have |L(x, t)| ≤ ℓ(x) and ℓ is μ-integrable.
Let ϕa (x) = (2πa1)d/2 e−∥x∥ /(2a) be the density of N (0, aId ). Suppose p is some probability
2
distribution, and f is a function then we denote by p ∗ f(x) = EX′ ∼p [f(x − X′ )], which coincides
with the usual convolution if p is a density. In particular, the Gaussian convolution p ∗ ϕa is known
k
as a Gaussian mixture with mixing distribution p. For any differential operator D = ∂ xi ∂···∂ xi we
1 k
have
i i
i i
i i
60
where ∆f = tr(Hess f) is the Laplacian. Notice that the second equality follows from (3.25) and
the easily checked identity
∂ 1 1
ϕa (x) = 2 (kxk2 − ad)ϕa (x) = ∆ϕa (x) .
∂a 2a 2
Thus, we only need to justify the first equality in (3.26). To that end, we use (3.24) with μ = p,
K(x, a) = ϕa (y − x) and L(x, a) = ∂∂a K(x, a). Note that by the previous calculation we have
supx | ∂∂a ϕa (x)| < ∞, and thus condition e) of (3.24) applies and so (3.24) implies (3.26).
The next lemma shows a special property of Gaussian convolution (derivatives of log-
convolution correspond to conditional moments).
√
Lemma 3.15 Let Y = X + aZ, where X ⊥
⊥ Z and X ∼ p and Z ∼ N (0, Id ). Then
where (a) follows from the fact that ∇ϕa (x) = − ax ϕa (x) and (b) from (3.25). The proof of (3.27)
is completed after noticing p∗ϕ1a (y) ∇ (p ∗ ϕa (x)) = ∇ ln(p ∗ ϕa )(y). Technical estimate (3.28) is
shown in [344, Proposition 2].
The identity (3.29) is shown entirely similarly.
Proof of Theorem 3.14. For simplicity, in this proof we compute all informations and entropies
with natural base, so log = ln. With these preparations we can show (3.22). First, let a = 1/γ and
notice
√ √ √ d
I(X; γ X + Z) = I(X; X + aZ) = h(X + aZ) − ln(2πea) ,
2
i i
i i
i i
where we computed differential entropy of the Gaussian via Theorem 2.8. Thus, the proof is
completed if we show
d √ d 1
h(X + aZ) = − mmse(X|Ya ) , (3.30)
da 2a 2a2
√
where we defined Ya = X + aZ. Let the law of X be p. Conceptually, the computation is just a
few lines:
Z
d √ d
− h(X + aZ) = (p ∗ ϕa )(x) ln(p ∗ ϕa )(x)dx
da da
Z
( a) ∂
= [(p ∗ ϕa )(x) ln(p ∗ ϕa )(x)]dx
∂a
Z
(b) 1
= (1 + ln p ∗ ϕa )∆(p ∗ ϕa )dx
2
Z
( c) 1
= (p ∗ ϕa )∆(ln p ∗ ϕa )dx
2
Z
(d) 1 1 d
= (p ∗ ϕa )(y)( 2 mmse(X|Ya = y) − )dy ,
2 a a
where (a) and (c) will require technical justifications, while (b) is just (3.26) and (d) is by taking
trace of (3.29). Note that (a) is just interchange of differentiation and integration, while (c) is
simply the “self-adjointness” of Laplacian.
We proceed to justifying (a). We will apply (3.24) with μ = Leb, I = (a1 , a2 ) some finite
interval, K(x, a) = (p ∗ ϕa )(x) ln(p ∗ ϕa )(x) and
∂ 1
L ( x, a ) = K(x, a) = (1 + ln(p ∗ ϕa )(x))(p ∗ ∆ϕa )(x) ,
∂a 2
where we again used (3.26).
Integrating (3.28) we get
3 4
| ln p ∗ ϕa (x) − ln p ∗ ϕa (0)| ≤ kxk2 + kxk E[kXk] .
2a a
Since p ∗ ϕa (0) ≤ ϕa (0) we get that for all a ∈ (a1 , a2 ) we have for some c > 0:
The integral of the right-hand side over x is simply c(1 + E[kYa k + kYa k2 ]) < ∞, which confirms
condition a) of (3.24).
Next, we notice that for any differential operator D we have Dϕa (x) = f(x)ϕa (x) where f is some
polynomial in x. Since for a < a2 we have supx f(ϕx)ϕ a (x)
a2 ( x )
< ∞ we have that for some constant c′
and all a < a2 and all x we have
i i
i i
i i
62
where we used (3.25) as well. Thus, for L(x, a) we can see that the first term is bounded by (3.31)
and the second by the previous display, so that overall
cc′
L ( x, a ) ≤
|2 + kxk + kxk2 |(p ∗ ϕa2 )(x) .
2
Since again the right-hand side is integrable, we see that condition e) of (3.24) applies and thus
R
the interchange of differentiation and in step (a) is valid.
Finally, we argue that step (c) is applicable. To that end we prove an auxiliary result first: If
R R
u, v are two univariate twice-differentiable functions with a) R |u′′ v| and R |v′′ u| both finite and
R ′ ′
b) |u v | < ∞ then
Z Z
u′′ v = v′′ u . (3.33)
R R
Indeed, from condition b) there must exist a sequence cn → +∞, bn → −∞ such that
|u′ (cn )v′ (cn )| + |u′ (bn )v′ (bn )| → 0. On the other hand, from condition a) we have
Z cn Z
lim u′′ v = u′′ v ,
n→∞ bn R
R
and similarly for v′′ u. Now applying integration by parts we have
Z cn Z cn
′′ ′ ′ ′ ′
u v = u (cn )v (cn ) − u (bn )v (bn ) + v′′ u ,
bn bn
R d | U 2
∂ xi
V| both finite and b) R d k∇ U k k∇ Vk < ∞ then
Z Z
V∆ U = U∆ V . (3.34)
Rd Rd
We write x = (x1 , xd2 )by grouping the last (d − 1) coordinates together. Fix xd2 and define
u(x1 ) = U(x1 , x2 , · · · , xd ) and v(x1 ) = V(x1 , x2 , · · · , xd ). For Lebesgue-a.e. xd2 we see that u, v
satisfy conditions for (3.33). Thus, we obtain that for such xd2 we have
Z Z
∂2 ∂2
V(x) 2 U(x) dx1 = U(x) 2 V(x) dx1 .
R ∂ x1 R ∂ x1
Integrating this over xd2 we get
Z Z
∂2 ∂2
V(x) 2 U(x) dx = U(x) 2 V(x) dx .
Rd ∂ x1 Rd ∂ x1
Now, to justify step (c) we have to verify that U(x) = 1 + ln(p ∗ ϕa )(x) and V(x) = p ∗ ϕa (x)
2
satisfy conditions of the previous result. To that end, notice that from (3.29) we have | ∂∂y2 U(y)| ≤
i
1
a2 E[X2i |Ya = y] + 1
a and thus
Z
∂2
|V U| = Oa (E[X2i ]) < ∞ .
Rd ∂ x2i
i i
i i
i i
2
On the other hand, from (3.25) and (3.32) we have | ∂∂y2 V(y)| ≤ c′ p ∗ ϕa2 (y). From (3.31) then we
i
obtain
Z Z
∂2
|U 2 V| ≤ cc (1 + kxk + kxk2 )p ∗ ϕa2 (x) = cc′ E[1 + kYa2 k + kYa2 k2 ] < ∞ .
′
Rd ∂ xi
R
Finally, for showing Rd k∇Ukk∇Vk < ∞ we apply (3.28) to estimate k∇Uk ≲a 1 + kyk and
use (3.32) to estimate k∇Vk ≲a p ∗ ϕa2 (x). Thus, we have
Z
k∇Ukk∇Vk ≲a E[1 + kYa2 k] < ∞ .
Rd
R R
This completes verification of conditions and we conclude U∆V = V∆U as required for step
(c).
√
The identity (3.23) is obtained by differentiating function γ 7→ mmse(X| γ + Z) using very
similar methods. We refer to [206] for full justification.
Similarly, notice that Ey∼j ∼PY∼j [mmse(Xj |Yj , Y∼j = y∼j )] = mmse(Xj |Y). Thus, applying the 1D-
version of (3.22) we get
∂ log e
I ( X ; Y) = mmse(Xj |Y) .
∂γj 2
P
Now since mmse(X|Y) = j mmse(Xj |Y) by summing (3.35) over j we obtain the d-dimensional
version of (3.22). Note that we computed the derivative in a scalar parameter γ by introducing a
vector one γ and then using the chain rule to simplify partial derivatives. This idea is the basis
of area theorem in information theory [360, Lemma 3] and Guerra interpolation in statistical
physics [410].
i i
i i
i i
64
Theorem 3.16 (Entropy power inequality (EPI) [399]) Suppose A1 ⊥ ⊥ A2 are inde-
pendent Rd -valued random variables with finite second moments E[kAi k2 ] < ∞, i ∈ {1, 2}.
Then
Proof. We present an elegant proof of [437]. First, an observation of Lieb [280] shows that EPI
is equivalent to proving: For all U1 ⊥
⊥ U2 and α ∈ [0, 2π ) we have
(To see that (3.36) implies EPI simply take cos2 α = N(A1N)+ ( A1 )
N(A2 ) and U1 = A1 / cos α, U2 =
A2 / sin α.)
Next, we claim that proving (3.36) for general Ui is equivalent to proving it for their “smoothed”
√
versions, i.e. Ũi = Ui + ϵZi , where Zi ∼ N (0, Id ) is independent of U1 , U2 . Indeed, this technical
continuity result follows, for example, from [344, Prop. 1], which shows that whenever E[kUi k2 ] <
√ √ √
∞ then function ϵ 7→ h(Ui + ϵZi ) is continuous and in fact h(Ui + ϵZ) = h(Ui ) + O( ϵ) as
ϵ → 0.
In other words, to prove Lieb’s EPI it is sufficient to prove for all ϵ > 0
√ √ √
h( X + ϵZ) ≥ h(U1 + ϵZ1 ) cos2 α + h(U2 + ϵZ2 ) sin2 (α) ,
where we also defined X ≜ U1 cos α+ U2 sin α, Z ≜ Z1 cos α+ Z2 sin α. Since the above inequality
is scale-invariant, we can equivalently show for all γ ≥ 0 the following:
√ √ √
h( γ X + Z) ≥ h( γ U1 + Z1 ) cos2 α + h( γ U2 + Z2 ) sin2 (α) .
4
Another deep manifestation of this phenomenon is in the context of CLT. Barron’s entropic CLT states that for iid Xi ’s
with zero mean and unit variance, the KL divergence D( X1 +...+X
√
n
n
kN (0, 1)), whenever finite, converges to zero. This
convergence is in fact monotonic as shown in [27, 102].
i i
i i
i i
On the other hand, for the right-hand side X is a sum of two conditionally independent terms and
thus
√ √ √ √
mmse(X| γ U1 +Z1 , γ U2 +Z2 ) = mmse(U1 | γ U1 +Z1 ) cos2 α+mmse(U2 | γ U2 +Z2 ) sin2 (α) .
In Corollary 2.9 we have already seen how properties of differential entropy can be translated
to properties of positive definite matrices. Here is another application:
i i
i i
i i
We will see in this chapter that divergence has two different sup characterizations (over partitions
and over functions). The mutual information is more special. In addition to inheriting the ones
from KL divergence, it possesses two extra: an inf-representation over (centroid) measures QY
and a sup-representation over Markov kernels.
As applications of these variational characterizations, we discuss the Gibbs variational prin-
ciple, which serves as the basis of many modern algorithms in machine learning, including the
EM algorithm and variational autoencoders; see Section 4.4. An important theoretical construct
in machine learning is the idea of PAC-Bayes bounds (Section 4.8*).
From information theoretic point of view variational characterizations are important because
they address the problem of continuity. We will discuss several types of continuity in this Chapter.
First, is the continuity in discretization. This is related to the issue of computation. For complicated
P and Q direct computation of D(PkQ) might be hard. Instead, one may want to discretize the
infinite alphabet and compute numerically the finite sum. Does this approximation work, i.e., as
the quantization becomes finer, are the resulting finite sums guaranteed to converge to the true
value of D(PkQ)? The answer is positive and this continuity with respect to discretization is the
content of Theorem 4.5.
Second, is the continuity under change of the distribution. For example, this arises in the prob-
lem of estimating information measures. In many statistical setups, oftentimes we do not know
P or Q, and we estimate the distribution by P̂n using n iid observations sampled from P (in dis-
crete cases we may set P̂n to be simply the empirical distribution). Does D(P̂n kQ) provide a good
estimator for D(PkQ)? Does D(P̂n kQ) → D(PkQ) if P̂n → P? The answer is delicate – see
Section 4.5.
Third, there is yet another kind of continuity: continuity “in the σ -algebra”. Despite the scary
name, this one is useful even in the most “discrete” situations. For example, imagine that θ ∼
66
i i
i i
i i
i.i.d.
Unif(0, 1) and Xi ∼ Ber(θ). Suppose that you observe a sequence of Xi ’s until the random moment
τ equal to the first occurrence of the pattern 0101. How much information about θ did you learn
by time τ ? We can encode these observations as
(
Xj , j ≤ τ ,
Zj = ,
?, j > τ
where ? designates the fact that we don’t know the value of Xj on those times. Then the question
we asked above is to compute I(θ; Z∞ ). We will show in this chapter that
X
∞
I(θ; Z∞ ) = lim I(θ; Zn ) = I(θ; Zn |Zn−1 ) (4.1)
n→∞
n=1
thus reducing computation to evaluating an infinite sum of simpler terms (not involving infinite-
dimensional vectors). Thus, even in this simple question about biased coin flips we have to
understand how to safely work with infinite-dimensional vectors.
Furthermore, it turns out that PY , similar to the center of gravity, minimizes this weighted distance
and thus can be thought as the best approximation for the “center” of the collection of distribu-
tions {PY|X=x : x ∈ X } with weights given by PX . We formalize these results in this section and
start with the proof of a “golden formula”. Its importance is in bridging the two points of view
on mutual information. Recall that on one hand we had the Fano’s Definition 3.1, on the other
hand for discrete cases we had the Shannon’s definition (3.3) as difference of entropies. Then
next result (4.3) presents MI as the difference of relative entropies in the style of Shannon, while
retaining applicability to continuous spaces in the style of Fano.
Proof. In the discrete case and ignoring the possibility of dividing by zero, the argument is really
simple. We just need to write
(3.1) PY|X PY|X QY
I(X; Y) = EPX,Y log = EPX,Y log
PY PY QY
i i
i i
i i
68
P Q P
and then expand log PYY|XQYY = log QY|YX − log QPYY . The argument below is a rigorous implementation
of this idea.
First, notice that by Theorem 2.16(e) we have D(PY|X kQY |PX ) ≥ D(PY kQY ) and thus if
D(PY kQY ) = ∞ then both sides of (4.2) are infinite. Thus, we assume D(PY kQY ) < ∞ and
in particular PY QY . Rewriting LHS of (4.2) via the chain rule (2.24) we see that Theorem
amounts to proving
D(PX,Y kPX QY ) = D(PX,Y kPX PY ) + D(PY kQY ) .
The case of D(PX,Y kPX QY ) = D(PX,Y kPX PY ) = ∞ is clear. Thus, we can assume at least one of
these divergences is finite, and, hence, also PX,Y PX QY .
dPY
Let λ(y) = dQ Y
(y). Since λ(Y) > 0 PY -a.s., applying the definition of Log in (2.10), we can
write
λ(Y)
EPY [log λ(Y)] = EPX,Y Log . (4.4)
1
dPX PY
Notice that the same λ(y) is also the density dPX QY
(x, y) of the product measure PX PY with respect
to PX QY . Therefore, the RHS of (4.4) by (2.11) applied with μ = PX QY coincides with
D(PX,Y kPX QY ) − D(PX,Y kPX PY ) ,
while the LHS of (4.4) by (2.13) equals D(PY kQY ). Thus, we have shown the required
D(PY kQY ) = D(PX,Y kPX QY ) − D(PX,Y kPX PY ) .
Remark 4.1 The variational representation (4.5) is useful for upper bounding mutual infor-
mation by choosing an appropriate QY . Indeed, often each distribution in the collection PY|X=x
is simple, but their mixture, PY , is very hard to work with. In these cases, choosing a suitable QY
in (4.5) provides a convenient upper bound. As an example, consider the AWGN channel Y = X+Z
in Example 3.3, where Var(X) = σ 2 , Z ∼ N (0, 1). Then, choosing the best possible Gaussian Q
and applying the above bound, we have:
1
I(X; Y) ≤ inf E[D(N (X, 1)kN ( μ, s))] = log(1 + σ 2 ),
μ∈R,s≥0 2
which is tight when X is Gaussian. For more examples and statistical applications, see Chapter 30.
i i
i i
i i
Proof. We only need to use the previous corollary and the chain rule (2.24):
(2.24)
D(PX,Y kQX QY ) = D(PY|X kQY |PX ) + D(PX kQX ) ≥ I(X; Y) .
Interestingly, the point of view in the previous result extends to conditional mutual information
as follows: We have
where the minimization is over all QX,Y,Z = QX QY|X QZ|Y , cf. Section 3.4. Showing this character-
ization is very similar to the previous theorem. By repeating the same argument as in (4.2) we
get
≥ I ( X ; Z| Y) .
Characterization (4.6) can be understood as follows. The most general directed graphical model
for the triplet (X, Y, Z) is a 3-clique (triangle).
Y X
What is the information flow on the dashed edge X → Z? To answer this, notice that removing
this edge restricts the joint distribution to a Markov chain X → Y → Z. Thus, it is natural to
ask what is the minimum (KL-divergence) distance between a given PX,Y,Z and the set of all
distributions QX,Y,Z satisfying the Markov chain constraint. By the above calculation, optimal
QX,Y,Z = PY PX|Y PZ|Y and hence the distance is I(X; Z|Y). For this reason, we may interpret I(X; Z|Y)
as the amount of information flowing through the X → Z edge.
In addition to inf-characterization, mutual information also has a sup-characterization.
i i
i i
i i
70
Theorem 4.4 For any Markov kernel QX|Y such that QX|Y=y PX for PY -a.e. y we have
dQX|Y
I(X; Y) ≥ EPX,Y log .
dPX
If I(X; Y) < ∞ then
dQX|Y
I(X; Y) = sup EPX,Y log , (4.7)
QX|Y dPX
where the supremum is over Markov kernels QX|Y as in the first sentence.
Remark 4.2 Similar to how (4.5) is used to upper-bound I(X; Y) by choosing a good approx-
imation to PY , this result is used to lower-bound I(X; Y) by selecting a good (but computable)
approximation QX|Y to usually a very complicated posterior PX|Y . See Section 5.6 for applications.
Proof. Since modifying QX|Y=y on a negligible set of y’s does not change the expectations, we
will assume that QX|Y=y PY for every y. If I(X; Y) = ∞ then there is nothing to prove. So we
assume I(X; Y) < ∞, which implies PX,Y PX PY . Then by Lemma 3.3 we have that PX|Y=y PX
dQX|Y=y /dPX
for almost every y. Choose any such y and apply (2.11) with μ = PX and noticing Log 1 =
dQX|Y=y
log dP X
we get
dQX|Y=y
EPX|Y=y log = D(PX|Y=y kPX ) − D(PX|Y=y kQX|Y=y ) ,
dPX
which is applicable since the first term is finite for a.e. y by (3.1). Taking expectation of the previous
identity over y we obtain
dQX|Y
EPX,Y log = I(X; Y) − D(PX|Y kQX|Y |PY ) ≤ I(X; Y) , (4.8)
dPX
implying the first part. The equality case in (4.7) follows by taking QX|Y = PX|Y , which satisfies
the conditions on Q when I(X; Y) < ∞.
i i
i i
i i
Remark 4.3 This theorem, in particular, allows us to prove all general identities and inequali-
ties for the cases of discrete random variables and then pass to the limit. In the case of mutual
information I(X; Y) = D(PX,Y kPX PY ), the partitions of X and Y can be chosen separately,
see (4.29).
“≤”: To show D(PkQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P(B). Choosing the partition E1 = B and E2 = Bc , we have
P2 P[Ei ]
D(PkQ) = ∞ = i=1 P[Ei ] log Q[Ei ] . In the sequel we assume that P Q and let X = dQ .
dP
Then D(PkQ) = EQ [X log X] = EQ [φ(X)] by (2.4). Note that φ(x) ≥ 0 if and only if x ≥ 1. By
monotone convergence theorem, we have EQ [φ(X)1 {X < c}] → D(PkQ) as c → ∞, regardless
of the finiteness of D(PkQ).
Next, we construct a finite partition. Let n = c/ϵ be an integer and for j = 0, . . . , n − 1, let
Ej = {jϵ ≤ X ≤ (j + 1)ϵ} and En = {X ≥ c}. Define Y = ϵbX/ϵc as the quantized version. Since φ
is uniformly continuous on [0, c], for any x, y ∈ [0, c] such |x−y| ≤ ϵ, we have |φ(x)−φ(y)| ≤ ϵ′ for
some ϵ′ = ϵ′ (ϵ, c) such as ϵ′ → 0 as ϵ → 0. Then EQ [φ(Y)1 {X < c}] ≥ EQ [φ(X)1 {X < c}] − ϵ′ .
Moreover,
X
n−1 n−1
X
P(Ej )
EQ [φ(Y)1 {X < c}] = φ(jϵ)Q(Ej ) ≤ ϵ′ + φ Q(Ej )
Q( E j )
j=0 j=0
X
n
P(Ej )
≤ ϵ′ + Q(X ≥ c) log e + P(Ej ) log ,
Q(Ej )
j=0
P(E )
where the first inequality applies the uniform continuity of φ since jϵ ≤ Q(Ejj ) < (j + 1)ϵ, and the
second applies φ ≥ − log e. As Q(X ≥ c) → 0 as c → ∞, the proof is completed by first sending
ϵ → 0 then c → ∞.
i i
i i
i i
72
In particular, if D(PkQ) < ∞ then EP [f(X)] is well-defined and < ∞ for every f ∈ CQ . The
identity (4.11) holds with CQ replaced by the class of all R-valued simple functions. If X is a
normal topological space (e.g., a metric space) with the Borel σ -algebra, then also
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] , (4.12)
f∈Cb
Proof. “D ≥ supf∈CQ ”: We can assume for this part that D(PkQ) < ∞, since otherwise there is
nothing to prove. Then fix f ∈ CQ and define a probability measure Qf (tilted version of Q) via
Qf (dx) = exp{f(x) − ψf }Q(dx) , ψf ≜ log EQ [exp{f(X)}] . (4.13)
Then, Qf Q. We will apply (2.11) next with reference measure μ = Q. Note that according
exp{f(x)−ψf }
to (2.10) we always have Log 1 = f(x) − ψf even when f(x) = −∞. Thus, we get
from (2.11)
dQf /dQ
EP [f(X)] − ψf = EP Log = D(PkQ) − D(PkQf ) ≤ D(PkQ) .
1
Note that (2.11) also implies that if D(PkQ) < ∞ and f ∈ CQ the expectation EP [f] is well-defined.
“D ≤ supf ” with supremum over all simple functions: The idea is to just take f = log dQ dP
;
however to handle all cases we proceed more carefully. First, notice that if P 6 Q then for some
E with Q[E] = 0 < P[E] and c → ∞ taking f = c1E shows that both sides of (4.11) are infinite.
Pn P[ E ]
Thus, we assume P Q. For any partition of X = ∪nj=1 Ej we set f = j=1 1Ej log Q[Ejj ] . Then
the right-hand sides of (4.11) and (4.9) evaluate to the same value and hence by Theorem 4.5 we
obtain that supremum over simple functions (and thus over CQ ) is at least as large as D(PkQ).
Finally, to show (4.12), we show that for every simple function f there exists a continuous
bounded f′ such that EP [f′ ] − log EQ [exp{f′ }] is arbitrarily close to the same functional evaluated
at f. To that end we first show that for any a ∈ R and measurable A ⊂ X there exists a sequence
of continuous bounded fn such that
EP [fn ] → aP[A], and EQ [exp{fn }] → exp{a}Q[A] (4.14)
hold simultaneously, i.e. fn → a1A in the sense of approximating both expectations. We only
consider the case of a > 0 below. Let compact F and open U be such that F ⊂ A ⊂ U and
max(P[U] − P[F], Q[U] − Q[F]) ≤ ϵ. Such F and U exist whenever P and Q are so-called regular
measures. Without going into details, we just notice that finite measures on Polish spaces are
automatically regular. Then by Urysohn’s lemma there exists a continuous function fϵ : X → [0, a]
equal to a on F and 0 on Uc . Then we have
aP[F] ≤ EP [fϵ ] ≤ aP[U]
i i
i i
i i
Subtracting aP[A] and exp{a}Q[A] for each of these inequalities, respectively, we see that taking
ϵ → 0 indeed results in a sequence of functions satisfying (4.14).
Pn
Similarly, if we want to approximate a general simple function g = i=1 ai 1Ai (with Ai disjoint
and |ai | ≤ amax < ∞) we fix ϵ > 0 and define functions fi,ϵ approximating ai 1Ai as above with
sets Fi ⊂ Ai ⊂ Ui , so that S ≜ ∪i (Ui \ Fi ) satisfies max(P[S], Q[S]) ≤ nϵ. We also have
X X
| fi,ϵ − g| ≤ amax 1Ui \Fi ≤ namax 1S .
i i
P
We then clearly have | EP [ i fi,ϵ ] − EP [g]| ≤ amax n2 ϵ. On the other hand, we also have
X X
exp{ai }Q[Fi ] ≤ EQ [exp{ fi,ϵ }]
i i
Remark 4.4 1 What is the Donsker-Varadhan representation useful for? By setting f(x) =
ϵ · g(x) with ϵ 1 and linearizing exp and log we can see that when D(PkQ) is small, expec-
tations under P can be approximated by expectations over Q (change of measure): EP [g(X)] ≈
EQ [g(X)]. This holds for all functions g with finite exponential moment under Q. Total variation
distance provides a similar bound, but for a narrower class of bounded functions:
2 More formally, the inequality EP [f(X)] ≤ log EQ [exp f(X)] + D(PkQ) is useful in estimating
EP [f(X)] for complicated distribution P (e.g. over high-dimensional X with weakly dependent
coordinates) by making a smart choice of Q (e.g. with iid components).
3 In Chapter 5 we will show that D(PkQ) is convex in P (in fact, in the pair). A general method
of obtaining variational formulas like (4.11) is via the Young-Fenchel duality, which we review
below in (7.84). Indeed, (4.11) is exactly that inequality since the Fenchel-Legendre conjugate
of D(·kQ) is given by a convex map f 7→ ψf . For more details, see Section 7.13.
4 Donsker-Varadhan should also be seen as an “improved version” of the DPI. For example, one
of the main applications of the DPI in this book is in obtaining estimates like
1
P[A] log ≤ D(PkQ) + log 2 , (4.15)
Q[ A ]
which is the basis of the large deviations theory (Corollary 2.19 and Chapter 15) and Fano’s
inequality (Theorem 3.12). The same estimate can be obtained by applying (4.11) with f(x) =
1 {x ∈ A} log Q[1A] .
i i
i i
i i
74
where the supremum is taken over all P with D(PkQ) < ∞. If the left-hand side is finite then the
unique maximizer of the right-hand side is P = Qf , a tilted version of Q defined in (4.13).
Proof. Let ψf ≜ log EQ [exp{f(X)}]. First, if ψf = −∞, then Q-a.s. f = −∞ and hence P-a.s. also
f = −∞, so that both sides of the equality are −∞. Next, assume −∞ < ψf < ∞. Then by
Donsker-Varadhan (4.11) we get
ψf ≥ EP [f(X)] − D(PkQ) .
dQf
On the other hand, setting P = Qf we obtain an equality. To show uniqueness, notice that Log dQ
1 =
f − ψf even when f = −∞. Thus, from (2.11) we get whenever D(PkQ) < ∞ that
EP [f(X) − ψf ] = D(PkQ) − D(PkQf ) .
From here we conclude that EP [f(X)] < ∞ and hence can rewrite the above as
EP [f(X)] − D(PkQ) = ψf − D(PkQf ) ,
which shows that EP [f(X)] − D(PkQ) = ψf implies P = Qf .
Next, suppose ψf = ∞. Let us define fn = f ∧ n, n ≥ 1. Since ψfn < ∞ we have by the previous
characterization that there is a sequence Pn such that D(Pn kQ) < ∞ and as n → ∞
EPn [f(X) ∧ n] − D(Pn kQ) = ψfn % ψf = ∞ .
Since EPn [f(X) ∧ n] ≤ EPn [f(X)], we have
EPn [f(X)] − D(Pn kQ) ≥ ψfn → ∞ ,
concluding the proof.
We now briefly explore how Proposition 4.7 has been applied over the last century. We start
with the example from statistical physics and graphical models. Here the key idea is to replace
sup over all distributions P with a subset that is easier to handle. This idea is the basis of much of
variational inference [447].
i i
i i
i i
Example 4.1 (Mean-field approximation for Ising model) Suppose that we have a
complicated model for a distribution of a vector X̃ ∈ {0, 1}n+m given by an Ising model
1
PX̃ (x̃) = exp{x̃⊤ Ãx̃ + b̃⊤ x̃} ,
Z̃
where à ∈ R(n+m)×(n+m) is a symmetric interaction matrix with zero diagonal and b̃ is a vector
of external fields and Z̃ is a normalization constant. We note that often à is very sparse with non-
zero entries occurring only those few variables xi and xj that are considered to be interacting (or
adjacent in some graph). We decompose the vector X̃ = (X, Y) into two components: the last
m coordinates are observables and the first n coordinates are hidden (latent), whose values we
want to infer; in other words, our goal is to evaluate PX|Y=y upon observing y. It is clear that this
conditional distribution is still an Ising model, so that
1
PX|Y (x|y) = exp{x⊤ Ax + b⊤ x} , x ∈ {0, 1}n
Z
where A is the n × n leading minor of à and b and Z depend on y. Unfortunately, computing even
a single value P[X1 = 1] is known to be generally computationally infeasible [394, 175], since
evaluating Z requires summation over 2n values of x.
Let us denote f(x) = x⊤ Ax + b⊤ x and by Q the uniform distribution on {0, 1}n . Applying
Proposition 4.7 we obtain
As we said, exact computation of log Z, though, is not tractable. An influential idea is to instead
Qn
search the maximizer in the class of product distributions PXn = i=1 Ber(pi ). In this case, this
supremization can be solved almost in closed-form:
X
sup p⊤ Ap + b⊤ p + h(pi ) ,
p
i
where p = (p1 , . . . , pn ). Since the objective function is strongly concave (Exercise I.37), we only
need to solve the first order optimality conditions (or mean-field equations), which is a set of n
non-linear equations in n variables:
X
n 1
pi = σ bi + 2 ai,j pj , σ(x) ≜ .
1 + exp(−x)
j=1
These are solved by iterative message-passing algorithms [447]. Once the values of pi are obtained,
the mean-field approximation is to take
Y
n
PX|Y=y ≈ Ber(pi ) .
i=1
i i
i i
i i
76
We stress that the mean-field idea is not only to approximate the value of Z, but also to consider
the corresponding maximizer (over a restricted class of product distributions) as the approximate
posterior distribution.
To get another flavor of examples, let us consider a more general setting, where we have some
(θ)
parametric collection of distributions PX,Y indexed by θ ∈ Rd . Often, the joint distribution is such
(θ) (θ) (θ) (θ)
that PX and PY|X are both “simple”, but the PY and PX|Y are “complex” or even intractable (e.g.
in sparse linear regression and community detection Section 30.3). As in the previous example, X
is the latent (unobserved) and Y is the observed variable.
For a moment we will omit writing θ and consider the problem of evaluating PY (y) – a quantity
(known as evidence) showing how extreme the observed y is. Note that
Although by assumption PX and PY|X are both easy to compute, this marginalization may be
intractable. As a workaround, we invoke Proposition 4.7 with f(x) = log PY|X (y|x) and Q = PX to
get
PX,Y (X, y)
log PY (y) = sup ER [f(X)] − D(RkPX ) = sup EX∼R log , (4.16)
R R R(X)
where R is an arbitrary distribution. Note that the right-hand side only involves a simple quantity
PX,Y and hence all the complexity of computation is moved to optimization over R. Expres-
sion (4.16) is known as evidence lower bound (ELBO) since for any fixed value of R we get a
provable lower bound on log PY (y). Typically, one optimizes the choice of R over some convenient
set of distributions to get the best (tightest) lower bound in that class.
One such application leads to the famous iterative (EM) algorithm, see (5.33) below. Another
application is a modern density estimation algorithm, which we describe next.
where vector μ(·; θ) and diagonal matrix D(·; θ) are deep neural networks with input (·) and
(θ)
weights θ. (See [245, App. C.2] for a detailed description.) The resulting distribution PY is a
(complicated) location-scale Gaussian mixture. To find the best density [245] aims to maximize
the likelihood by solving
X (θ)
max log PY (yi ) .
θ
i
i i
i i
i i
Since the marginalization to obtain PY is intractable, we replace the objective (by an appeal to
ELBO (4.16)) with
" #
X (θ)
pX,Y (X, yi )
max sup EX∼RX|Y=yi log , (4.17)
θ RX|Y rX|Y (X|yi )
i
where we denoted the PDFs of PX,Y and RX|Y by lower-case letters. Now in this form the algorithm
is simply the EM algorithm (as we discuss below in Section 5.6). What brings the idea to the 21st
century is restricting the optimization to the set of RX|Y which are again defined via
where μ̃(Y; ϕ) and diagonal covariance matrix D̃(Y; ϕ) are output by some neural network with
parameter ϕ. The conditional distribution under this auxiliary model (recognition model), denoted
(ϕ) (θ)
by RX|Y , is Gaussian. Since the ELBO (4.16) is achieved by the posterior PX|Y , what this amounts
to is to approximate the true posterior under the generative model by a Gaussian. Replacing also
i.i.d.
the expectation over RX|Y=yi with its empirical version (by generating Z̃ij ∼ N (0, Id′ ) we obtain
the following
XX (θ)
pX,Y (xi,j , yi )
max max log , xi,j = μ̃(yi ; ϕ) + D̃(yi ; ϕ)Z̃ij . (4.18)
(ϕ)
θ ϕ
i j rX|Y (xi,j |yi )
(θ) (ϕ)
Now plugging the Gaussian form of the densities pX , pY|X and rX|Y one gets an expression whose
gradients ∇θ and ∇ϕ can be easily computed by the automatic differentiation software.1 In
(ϕ) (θ)
fact, since rX|Y and rY|X are both Gaussian, we can use less Monte Carlo approximation than
(θ)
pX,Y (X,yi )
(4.18), because the objective in (4.17) equals EX∼RX|Y=yi log (ϕ) = −D(RX|Y=yi kPX ) +
rX|Y (X|yi )
h i
EX∼RX|Y=yi log pY|X (yi |X) , where PX = N(0, Id′ ), RX|Y=yi = N(μ̃(yi ; ϕ); D̃2 (yi ; ϕ)), PY|X=x =
(θ) (θ)
N( μ(x; θ); D2 (x; θ)) so that the first Gaussian KL divergence is in close form (Example 2.2) and
we only need to apply Monte Carlo approximation to the second term. For both versions, the
optimization proceeds by (stochastic) gradient ascent over θ and ϕ until convergence to some
(θ ∗ ) (ϕ∗ )
(θ∗ , ϕ∗ ). Then PY can be used to generate new samples from the learned distribution, RX|Y to
(θ ∗ )
map (“encode”) samples to the latent space and PY|X to “decode” a latent representation into a
target sample. We refer the readers to Chapters 3 and 4 in the survey [246] for other encoder and
decoder architectures.
1
An important part of the contribution of [245] is the “reparametrization trick”. Namely, since [452] a standard way to
compute ∇ϕ EQ(ϕ) [f] in machine learning is to write ∇ϕ EQ(ϕ) [f] = EQ(ϕ) [f(X)∇ϕ ln q(ϕ) (X)] and replace the latter
expectation by its empirical approximation. However, in this case a much better idea is to write
EQ(ϕ) [f] = EZ∼N [f(g(Z; ϕ))] for some explicit g and then move gradient inside the expectation before computing the
empirical version.
i i
i i
i i
78
Proposition 4.8 Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Then
the map P 7→ D(PkQ) is continuous. In particular, P 7→ H(P) is continuous.
Warning: Divergence is never continuous in the pair, even for finite alphabets. For example,
as n → ∞, d( 1n k2−n ) 6→ 0.
X P(x)
D(PkQ) = P(x) log
x
Q ( x)
Our next goal is to study continuity properties of divergence for general alphabets. We start
with a negative observation.
Remark 4.5 In general, D(PkQ) is not continuous in either P or Q. For example, let X1 , . . . , Xn
Pn d
be iid and equally likely to be {±1}. Then by central limit theorem, Sn = √1n i=1 Xi −
→N (0, 1)
as n → ∞. But D(PSn kN (0, 1)) = ∞ for all n because Sn is discrete. Note that this is also an
example for strict inequality in (4.19).
On a general space if Pn → P and Qn → Q pointwise3 (i.e. Pn [E] → P[E] and Qn [E] → Q[E] for
every measurable E) then (4.19) also holds.
Proof. This simply follows from (4.12) since EPn [f] → EP [f] and EQn [exp{f}] → EQ [exp{f}] for
every f ∈ Cb .
2
Recall that sequence of random variables Xn converges in distribution to X if and only if their laws PXn converge weakly
to PX .
3
Pointwise convergence is weaker than convergence in total variation and stronger than weak convergence.
i i
i i
i i
D(PF kQF ) .
Our main results are continuity under monotone limits of σ -algebras. Recall that a sequence of
nested σ -algebras, F1 ⊂ F2 · · · , is said to Fn % F when F ≜ σ (∪n Fn ) is the smallest σ -
algebra containing ∪n Fn (the union of σ -algebras may fail to be a σ -algebra and hence needs
completion). Similarly, a sequence of nested σ -algebras, F1 ⊃ F2 · · · , is said to Fn & F if
F = ∩n Fn (intersection of σ -algebras is always a σ -algebra). We will show in this section that we
always have:
For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to (a) any algebra of sets F and (b) two positive additive (not necessarily σ -additive)
set-functions P, Q on F . We do so following the Gelfand-Yaglom-Perez variational representation
of divergence (Theorem 4.5).
Definition 4.10 (KL divergence over an algebra) Let P and Q be two positive, addi-
tive (not necessarily σ -additive) set-functions defined over an algebra F of subsets of X (not
necessarily a σ -algebra). We define
X
n
P[Ei ]
D(PF kQF ) ≜ sup P[Ei ] log ,
{E1 ,...,En } i=1 Q[Ei ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.
Note that when F is not a σ -algebra or P, Q are not σ -additive, we do not have Radon-Nikodym
theorem and thus our original definition of KL-divergence is not applicable.
i i
i i
i i
80
S
• Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let F = n Fn be their limit, then
• If F is (P + Q)-dense in G then4
and, in particular,
Proof. The first two items are straightforward applications of the definition. The third follows
from the following fact: if F is dense in G then any G -measurable partition {E1 , . . . , En } can
be approximated by a F -measurable partition {E′1 , . . . , E′n } with (P + Q)[Ei 4E′i ] ≤ ϵ. Indeed,
first we set E′1 to be an element of F with (P + Q)(E1 4E′1 ) ≤ 2n ϵ
. Then, we set E′2 to be
an 2nϵ
-approximation of E2 \ E′1 , etc. Finally, E′n = (∪j≤1 E′j )c . By taking ϵ → 0 we obtain
P ′ P[E′i ] P P[ Ei ]
i P[Ei ] log Q[E′i ] → i P[Ei ] log Q[Ei ] .
The last statement follows from the previous one and the fact that any algebra F is μ-dense in
the σ -algebra σ{F} it generates for any bounded μ on (X , H) (cf. [142, Lemma III.7.1].)
Finally, we address the continuity under the decreasing σ -algebra, i.e. (4.21).
The condition D(PF0 kQF0 ) < ∞ can not be dropped, cf. the example after (4.32).
h i
Proof. Let X−n = dP
dQ . Since X−n = EQ dP
dQ Fn , we have that (. . . , X−1 , X0 ) is a uniformly
Fn
integrable martingale. By the martingale convergence theorem in reversed time, cf. [84, Theorem
5.4.17], we have almost surely
dP
X−n → X−∞ ≜ . (4.25)
dQ F
4
Recall that F is μ-dense in G if ∀E ∈ G, ϵ > 0∃E′ ∈ F s.t. μ[E∆E′ ] ≤ ϵ.
i i
i i
i i
where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have
from the bounded convergence theorem:
To prove a similar convergence for log+ we need to notice two things. First, the function
x 7→ x log+ x
is convex. Second, for any non-negative convex function ϕ s.t. E[ϕ(X0 )] < ∞ the collection
{Zn = ϕ(E[X0 |Fn ]), n ≥ 0} is uniformly integrable. Indeed, we have from Jensen’s inequality
1 E[ϕ(X0 )]
P[Zn > c] ≤ E[ϕ(E[X0 |Fn ])] ≤
c c
and thus P[Zn > c] → 0 as c → ∞. Therefore, we have again by Jensen’s
Finally, since X−n log+ X−n is uniformly integrable, we have from (4.25)
Proposition 4.13 (a) If X and Y are both finite alphabets, then PX,Y 7→ I(X; Y) is continuous.
(b) If X is finite, then PX 7→ I(X; Y) is continuous.
(c) Without any assumptions on X and Y , let PX range over the convex hull Π = co(P1 , . . . , Pn ) =
Pn Pn
{ i=1 αi Pi : i=1 αi = 1, αi ≥ 0}. If I(Pj , PY|X ) < ∞ (using the notation I(PX , PY|X ) =
I(X; Y)) for all j ∈ [n], then the map PX 7→ I(X; Y) is continuous.
5
Here we only assume that topology on the space of measures is compatible with the linear structure, so that all linear
operations on measures are continuous.
i i
i i
i i
82
1
P
For the second statement, take QY = |X | PY|X=x . Note that
x∈X
" !#
X
D(PY kQY ) = EQY f PX (x)hx (Y) ,
x
dPY|X=x
where f(t) = t log t and hx (y) = are bounded by |X | and non-negative. Thus, from the
dQY (y)
bounded convergence theorem we have that PX 7→ D(PY kQY ) is continuous. The proof is complete
since by the golden formula
Further properties of mutual information follow from I(X; Y) = D(PX,Y kPX PY ) and correspond-
ing properties of divergence, e.g.
where Ȳ is a copy of Y, independent of X and supremum can be taken over any of the classes of
(bivariate) functions as in Theorem 4.6. Notice, however, that for mutual information we can
also get a stronger characterization:6
from which (4.26) follows by moving the outer expectation inside the log. Both of these can
be used to show that E[f(X, Y)] ≈ E[f(X, Ȳ)] as long as the dependence between X and Y (as
measured by I(X; Y)) is weak, cf. Exercise I.55.
d
2 If (Xn , Yn ) → (X, Y) converge in distribution, then
d
• Example of strict inequality: Xn = Yn = n1 Z. In this case (Xn , Yn ) → (0, 0) but I(Xn ; Yn ) =
H(Z) > 0 = I(0; 0).
• An even more impressive example: Let (Xp , Yp ) be uniformly distributed on the unit ℓp -ball
d
on the plane: {(x, y) ∈ R2 : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but
I(Xp ; Yp ) → ∞. (See Ex. I.16)
6
Just apply Donsker-Varadhan to D(PY|X=x0 kPY ) and average over x0 ∼ PX .
i i
i i
i i
4.8* PAC-Bayes 83
This implies that the full amount of mutual information between two processes X∞ and Y∞
is contained in their finite-dimensional projections, leaving nothing in the tail σ -algebra. Note
also that applying the (finite-n) chain rule to (4.30) recovers (4.1).
5 (Monotone convergence II): Recall that for any random process (X1 , . . .) we define its tail σ -
T
algebra as Ftail = ∩n≥1 σ(X∞ n ). Let Xtail be a random variable such that σ(Xtail ) =
∞
n≥1 σ(Xn ).
Then
whenever the right-hand side is finite. This is a consequence of Proposition 4.12. Without the
i.i.d.
finiteness assumption the statement is incorrect. Indeed, consider Xj ∼ Ber(1/2) and Y = X∞ 0 .
∞
Then each I(Xn ; Y) = ∞, but Xtail = constant a.e. by Kolmogorov’s 0-1 law, and thus the
left-hand side of (4.32) is zero.
4.8* PAC-Bayes
A deep implication of Donsker-Varadhan and Gibbs principle is a method, historically known as
PAC-Bayes,8 for bounding suprema of empirical processes. Here we present the key idea together
with two applications: one in high-dimensional probability and the other in statistical learning
theory.
But first, let us agree that in this section ρ and π will denote distributions on Θ and we will
write Eρ [·] and Eπ [·] to mean integration over only the θ variable over the respective prior, i.e.
Z
Eρ [fθ (x)] ≜ Eθ∼ρ [fθ (x)] = fθ (x)ρ(dθ)
Θ
denotes a function of x. Similarly, EPX [fθ (X)] will denote expectation only over X ∼ PX . The
following estimate is a workhorse of PAC-Bayes method.
7
To prove this from (4.9) one needs to notice that algebra of measurable rectangles is dense in the product σ-algebra. See
[129, Sec. 2.2].
8
For probably approximately correct (PAC), as developed by Shawe-Taylor and Williamson [382], McAllester [298],
Mauer [297], Catoni [83] and many others, see [16] for a survey.
i i
i i
i i
84
Proof. We will prove the following result, known in this area as an exponential inequality:
EPX sup exp{Eρ [fθ (X) − ψ(θ)] − D(ρkπ )} ≤ 1 , (4.35)
ρ
where the supremum inside is taken over all ρ such that D(ρkπ ) < ∞. Indeed, from it (4.33)
follows via the Markov inequality. Notice that this supremum is taken over uncountably many
values and hence it is not a priori clear whether the function of X under the outer expectation is
even measurable. We will show the latter together with the exponential inequality.
To that end, we apply the Gibbs principle (Proposition 4.7) to the function θ 7→ fθ (X) − ψ(θ)
and base measure π. Notice that this function may take value −∞, but nevertheless we obtain
where the right-hand side is a measurable function of X. Exponentiating and taking expectation
over X we obtain
EPX sup exp{Eρ [fθ (X) − ψ(θ)] − D(ρkπ )} = EPX [Eπ [exp{fθ (X) − ψ(θ)}]] .
ρ
We claim that the right-hand side equals π [ψ(θ) < ∞] ≤ 1, which completes the proof of (4.35).
Indeed, let E = {θ : ψ(θ) < ∞}. Then for any θ ∈ E we have EPX [exp{fθ (X) − ψ(θ)}] = 1, or in
other words for all θ:
Finally, notice that 1{θ ∈ E} can be omitted since for θ ∈ Ec we have exp{fθ (X) − ∞} = 0 by
agreement.
To show (4.34), for each x take ρ = Pθ|X=x in (4.35) to get
Ex∼PX exp{E[fθ (X) − ψ(θ)|X = x] − D(Pθ|X=x kπ )} ≤ 1 .
By Jensen’s we can move outer expectation inside the exponent and obtain the right-most
inequality in (4.34). To get the bound in terms of I(θ; X) take π = Pθ and recall (4.5).
i i
i i
i i
4.8* PAC-Bayes 85
But what about typical value of kXk, i.e. can we show an upper bound on kXk that holds with high
probability? In order to see how PAC-Bayes could be useful here, notice that kXk = sup∥v∥=1 v⊤ X.
Thus, we aim to use (4.33) to bound this supremum. For any v let ρv = N (v, β 2 Id ) and notice
v⊤ X = Eρv [θ⊤ X]. We also take π = N (0, β 2 Id ) and fθ (x) = λθ⊤ x, θ ∈ Rd , where β, λ > 0
are parameters to be optimized later. Taking base of log to be e, we compute explicitly ψ(θ) =
1 2 ⊤ λ2 ⊤ ∥ v∥ 2
2 λ θ Σθ , Eρv [ψ(θ)] = 2 (v Σv + β tr Σ) and D(ρv k π ) = 2β 2 via (2.8). Thus, using (4.33)
2
restricted to ρv with kvk = 1 we obtain that with probability ≥ 1 − δ we have for all v with
kvk = 1
λ2 ⊤ 1 1
λ v⊤ X ≤ (v Σv + β 2 tr Σ) + 2 + ln .
2 2β δ
√
Now, we can optimize right-hand side over β by choosing β 2 = 1/ λ2 tr Σ to get
λ2 ⊤ √ 1
λ v⊤ X ≤ (v Σv) + λ tr Σ + ln .
2 δ
Finally, estimating v⊤ Σv ≤ kΣkop and optimizing λ we obtain the resulting high-probability
bound:
r
√ 1
kXk ≤ tr Σ + 2kΣkop ln .
δ
Although this result can be obtained using the standard technique of Chernoff bound (Section 15.1)
– see [270, Lemma 1] for a stronger version or [69, Example 5.7] for general norms based on sophis-
ticated Gaussian concentration inequalities, the advantages of the PAC-Bayes proof are that (a) it is
2
not specific to Gaussian and holds for any X such that ψ(θ) ≤ λ2 θ⊤ Σθ (similar to subgaussian ran-
dom variables introduced below); and (b) its extensions can be used to analyze the concentration
of sample covariance matrices, cf. [475].
To present further applications, we need to introduce a new concept.
i i
i i
i i
86
• N (0, σ 2 ) is σ 2 -subgaussian. In fact it satisfies the condition with equality and explains the origin
of the name.
• If X ∈ [a, b], then X is (b−4a) -subgaussian. This is the well-known Hoeffding’s lemma (see
2
There are many equivalent ways to define subgaussianity [438, Prop. 2.5.2], including by requiring
t2
tails of X to satisfy P[|X − E[X]| > t] ≤ 2e− 2σ2 . However, for us the most important property is the
consequence of the two observations above: empirical average of independent bounded random
variables are O(1/n)-subgaussian.
The concept of subgaussianity is used in PAC-Bayes method as follows. Suppose we have a
i.i.d.
collection of functions F from X to R and an iid sample Xn ∼ PX . One of the main questions of
empirical process theory and uniform convergence is to get a high-probability bound on
1X
n
sup E[f(X)] − Ên [f(X)] , Ên [f(X)] ≜ f(Xi ) .
f∈F n
i=1
Suppose that we know each f is taking values in [0, 1]. Then (E −Ên )f is 4n
1
-subgaussian and
applying PAC-Bayes inequality to functions λ(E −Ên )f(X) we get that with probability ≥ 1 − δ
for any ρ on F we have
λ 1 1
Ef∼ρ [(E −Ên )f(X)] ≤ + D(ρkπ ) + ln ,
8n λ δ
where π is a fixed prior. This method can be used to get interesting bounds for countably-infinite
collections F (see Exercise I.55 and I.56). However, the real power of this method shows when
F is uncountable (as in the previous example for Gaussian norms).
We remark that bounding the supremum of a random process (e.g., empirical or Gaussian pro-
cess) indexed by a continuous parameter is a vast subject [139, 429, 431]. The usual method is
based on discretization and approximation (with more advanced version known as chaining; see
(27.22) and Exercise V.28 for the counterpart of Gaussian processes). The PAC-Bayes inequality
offers an alternative which often allows for sharper results and shorter proofs. There are also appli-
cations to high-dimensional probability (the small-ball probability and random matrix theory), see
[317, 309]. In those works, PAC-Bayes is applied with π being the uniform distribution on a small
ball.
i i
i i
i i
4.8* PAC-Bayes 87
q
O(M Cn ), see [239]. In fact they show that any set {F(ρ) : G(ρ) ≤ C} satisfies this bound as long
as G is strongly convex. Recall that ρ 7→ D(ρkπ ) is indeed strongly convex by Exercise I.37.
So how does one select a good estimate θ̂ given training sample Xn ? Here are two famous options:
However, many other choices exist and are invented daily for specific problems.
The main issue that is to be addressed by theory is the following: the choice θ̂ is guided by
the sample Xn and thus the value L̂n (θ̂) is probably not representative of the value L(θ̂) that the
estimator will attain on a fresh sample Xnew because of overfitting to the training sample. To gauge
the amount of overfitting one seeks to prove an estimate of the form:
L(θ̂) ≤ L̂n (θ̂) + small error terms
i i
i i
i i
88
with high probability over the sample Xn . Note that here θ̂ can be either deterministic (as in ERM)
or a randomized (as in Gibbs) function of the training sample. In either case, it is convenient to
think about the estimator as a θ drawn from a data-dependent distribution ρ, in other words, a
channel from Xn to θ, so that we always understand L(θ̂) as Eρ [L(θ)] and L̂n (θ̂) as Eρ [L̂n (θ)]. Note
that in either case the values L(θ̂) and L̂n (θ̂) are random quantities depending on the sample Xn .
A specific kind of bounds we will show is going to be a high-probability bound that holds
uniformly over all data-dependent ρ, specifically
h i
P ∀ρ : Eθ∼ρ [L(θ)] ≤ Eθ∼ρ [L̂n (θ)] + excess risk(ρ) ≥ 1 − δ,
for some excess risk depending on (n, ρ, δ). We emphasize that here the probability is with respect
i.i.d.
to Xn ∼ P and the quantifier “∀ρ” is inside the probability. Having a uniform bound like this
suggests selecting that ρ which minimizes the right-hand side of the inequality, thus making the
second term serve as a regularization term preventing overfitting.
The main theorem of this section is the following version of the generalization error bound of
McAllester [298]. Many other similar bounds exist, for example see Exercise I.54.
Theorem 4.16 Fix a reference distribution π on Θ and suppose that for all θ, x the loss ℓθ (x) ∈
[0, 1]. Then for any δ ≤ e−1
s
5 D(ρkπ ) + ln δ1 1
P ∀ρ : Eθ∼ρ [L(θ)] ≤ Eθ∼ρ [L̂n (θ)] + +√ ≥ 1 − δ. (4.37)
4 2n 10n
The same result holds if (instead of being bounded) ℓθ (X) is 14 -subgaussian for each θ.
Before proving the theorem, let us consider a finite class Θ and argue that
" r #
1 M
P ∀ρ : Eθ∼ρ [L(θ)] ≤ Eθ∼ρ [L̂n (θ)] + ln ≥ 1 − δ, (4.38)
2n δ
Indeed, by linearity, it suffices to restrict to point mass distributions ρ = δθ . For each θ the ran-
dom variable L̂n (θ) − L(θ) is zero-mean and 4n 1
-subgaussian (Hoeffding’s lemma). Thus, applying
Markov’s inequality to eλ(L(θ)−L̂n (θ)) we have for any λ > 0 and t > 0:
λ2
P[L(θ) − L̂n (θ) ≥ t] ≤ e 8n −λt .
Thus, setting t so that the right-hand side equals Mδ from the union bound we obtain that with
probability ≥ 1 − δ simultaneously for all θ we have
λ 1 M
L(θ) − L̂n (θ) ≤ + ln .
8n λ δ
q
Optimizing λ = 8n ln Mδ yields (4.38). On the other hand, if we apply (4.37) with π = Unif(Θ)
and observe that D(ρkπ ) ≤ log M, we recover (4.38) with only a slightly worse estimate of excess
risk.
i i
i i
i i
4.8* PAC-Bayes 89
We can see that just like in the previous subsection, the core problem in showing Theorem 4.16
is that union bound only applies to finite Θ and we need to work around that problem by leveraging
the PAC-Bayes inequality.
Proof. First, we fix λ and apply PAC-Bayes inequality to functions fθ (Xn ) = λ(L(θ) − L̂n (θ)).
2
By Hoeffding’s lemma we know L̂n (θ) is 8n1
-subgaussian and thus ψ(θ) ≤ λ8n . Thus, we have with
probability ≥ 1 − δ simultaneously for all ρ:
λ D(ρkπ ) + ln δ1
Eρ [L(θ) − L̂n (θ)] ≤ + (4.39)
8n λ
Let us denote for convenience b(ρ) = D(ρkπ ) + ln δ1 . Since
p δ < e−1 , we see that b(ρ) ≥ 1. We
would like to optimize λ in (4.39) by setting λ = λ∗ ≜ 8nb √
(ρ). However, of course, λ cannot
depend on ρ. Thus, instead we select a countable grid λi = 2n2i , i ≥ 1 and apply PAC-Bayes
inequality separately for each λi with probability chosen to be δi = δ 2−i . Then from the union
bound we have for all ρ and all i ≥ 1 simultaneously:
λi b(ρ) + i ln 2
Eρ [L(θ) − L̂n (θ)] ≤ + .
8n λ
Let
√
i∗ = i∗p
(ρ) be chosen so that λ∗ (ρ) ≤ λi∗ < 2λ∗ (ρ). From the latter inequality we have
i∗
2n2 < 2 8b(ρ)n and thus
1 1
i∗ ln 2 < ln 4 + ln(b(ρ)) ≤ ln 4 − 1/2 + b(ρ) .
2 2
Therefore, choosing i = i∗ in the bound, upper-bounding λi∗ ≤ 2λ∗ and λ1i∗ ≤ 1
λ∗ we get that for
all ρ
r
b(ρ) 3 ln 4 − 1/2
Eρ [L(θ) − L̂n (θ)] ≤ (2 + ) + p .
8n 2 8nb(ρ)
Finally, bounding the last term by √1
10n
(since b ≥ 1), we obtain the theorem.
We remark that although we proved an optimized (over λ) bound, the intermediate result (4.39)
is also quite important. Indeed, it suggests choosing ρ (the randomized estimator) based on min-
imizing the regularized empirical risk Eρ L̂n (θ) + λ1 D(ρkπ ). The minimizing ρ is just the Gibbs
sampler (4.36) due to Proposition 4.7, which justifies its popularity.
PAC-Bayes bounds are often criticized on the following grounds. Suppose we take a neural
network and train it (perhaps by using a version of gradient descent) until it finds some choice of
weight matrices θ̂ that results in an acceptably low value of L̂n (θ̂). We would like now to apply
PAC-Bayes Theorem 4.16 to convince ourselves that also the test loss L(θ̂) would be small. But
notice that the weights of the neural network are non-random, that is ρ = δθ̂ and hence for any
continuous prior π we will have D(ρkπ ) = ∞, resulting in a vacuous bound. For a while this was
considered to be an unavoidable limitation, until an elegant work of [476]. There authors argue
that in the end weights of neural networks are stored as finite-bit approximations (floating point)
and we can use π (θ) = 2−length(θ) as a prior. Here length(θ) represents the total number of bits in
a compressed representation of θ. As we will learn in Part II this indeed defines a valid probability
i i
i i
i i
90
distribution (for any choice of the lossless compressor). In this way, the idea of [476] bridges
the area of data compression and generalization bounds: if the trained neural network has highly
compressible θ (e.g. has many zero weights) then it has smaller excess risk and thus is less prone
to overfitting.
Before closing this section, let us also apply the “in expectation” version of the PAC-
Bayes (4.34). Namely, again suppose that losses ℓθ (x) are in [0, 1] and suppose the learning
algorithm (given Xn ) selects ρ and then samples θ̂ ∼ ρ. This creates a joint distribution Pθ̂,Xn .
From (4.34), as in the preceding proof, for every λ > 0 we get
λ
E[L(θ̂) − L̂n (θ̂)] ≤ + λI(θ̂; Xn ) .
8n
Optimizing over λ we obtain the bound
r
1
E[L(θ̂) − L̂n (θ̂)] ≤ I(θ̂; Xn ) .
2n
This version of McAllester’s result [369, 461] provides a useful intuition: the algorithm’s propen-
sity to overfit can be gauged by the amount of information leaking from Xn into θ̂. For applications,
though, a version with a flexible reference prior π, as in Theorem 4.16, appears more convenient.
i i
i i
i i
• Information projection (or I-projection): Given Q minimize D(PkQ) over convex class of P’s.
(See Chapter 15.)
• Maximum likelihood: Given P minimize D(PkQ) over some class of Q. (See Section 29.3.)
• Rate-Distortion: Given PX minimize I(X; Y) over a convex class of PY|X . (See Chapter 26.)
• Capacity: Given PY|X maximize I(X; Y) over a convex class of PX . (This chapter.)
In this chapter we show that all these problems have convex/concave objective functions,
discuss iterative algorithms for solving them, and study the capacity problem in more detail.
Specifically, we will find that the supremum over input distributions PX can also be written as infi-
mum over the output distributions PY and the resulting minimax problem has a saddle point. This
will lead to understanding of capacity as information radius of a set of conditional distributions
{PY|X=x , x ∈ X } measured in KL divergence.
Remark 5.1 The proof shows that for an arbitrary measure of similarity D(PkQ), the con-
vexity of (P, Q) 7→ D(PkQ) is equivalent to “conditioning increases divergence” property of D.
Convexity can also be understood as “mixing decreases divergence”.
91
i i
i i
i i
92
Remark 5.2 (Strict and strong convexity) There are a number of alternative arguments
possible. For example, (p, q) 7→ p log pq is convex on R2+ , which is a manifestation of a general
phenomenon: for a convex f(·) the perspective function (p, q) 7→ qf pq is convex too. Yet another
way is to invoke the Donsker-Varadhan variational representation Theorem 4.6 and notice that
supremum of convex functions is convex. Our proof, however, allows us to immediately notice
that the map (P, Q) 7→ D(PkQ) is not strictly convex. Indeed, the gap in the DPI that we used
in the proof is equal to D(PX|Y kQX|Y |PY ), which can be zero. For example, this happens if P0 , Q0
have common support, which is disjoint from the common support of P1 , Q1 . At the same time
the map P 7→ D(PkQ), whose convexity was so crucial in the previous Chapter, turns out to not
only be strictly convex but in fact strongly convex with respect to total variation, cf. Exercise I.37.
This strong convexity is crucial for the analysis of mirror descent algorithm, which is a first-order
method for optimization over probability measures (see [40, Examples 9.10 and 5.27].)
Theorem 5.2 The map PX 7→ H(X) is concave. Furthermore, if PY|X is any channel, then
PX 7→ H(X|Y) is concave. If X is finite, then PX 7→ H(X|Y) is continuous.
Proof. For the special case of the first claim, when PX is on a finite alphabet, the proof is complete
by H(X) = log |X | − D(PX kUX ). More generally, we prove the second claim as follows. Let
f(PX ) = H(X|Y). Introduce a random variable U ∼ Ber(λ) and define the transformation
P0 U=0
PX|U =
P1 U=1
Consider the probability space U → X → Y. Then we have f(λP1 + (1 − λ)P0 ) = H(X|Y) and
λf(P1 ) + (1 − λ)f(P0 ) = H(X|Y, U). Since H(X|Y, U) ≤ H(X|Y), the proof is complete. Continuity
follows from Proposition 4.13.
Recall that I(X; Y) is a function of PX,Y , or equivalently, (PX , PY|X ). Denote I(PX , PY|X ) =
I(X; Y).
Proof. There are several ways to prove the first statement, all having their merits.
• First proof : Introduce θ ∈ Ber(λ). Define PX|θ=0 = P0X and PX|θ=1 = P1X . Then θ → X → Y.
Then PX = λ̄P0X + λP1X . I(X; Y) = I(X, θ; Y) = I(θ; Y) + I(X; Y|θ) ≥ I(X; Y|θ), which is our
desired I(λ̄P0X + λP1X , PY|X ) ≥ λ̄I(P0X , PY|X ) + λI(P0X , PY|X ).
• Second proof : I(X; Y) = minQ D(PY|X kQ|PX ), which is a pointwise minimum of affine functions
in PX and hence concave.
i i
i i
i i
• Third proof : Pick a Q and use the golden formula: I(X; Y) = D(PY|X kQ|PX ) − D(PY kQ), where
PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→ D(PY kQ)
(convex).
The argument PY is a linear function of PY|X and thus the statement follows from convexity of D
in the pair.
Suppose we have a bivariate function f. Then we always have the minimax inequality:
inf sup f(x, y) ≥ sup inf f(x, y).
y x x y
1 It turns out minimax equality is implied by the existence of a saddle point (x∗ , y∗ ),
i.e.,
f ( x, y∗ ) ≤ f ( x∗ , y∗ ) ≤ f ( x∗ , y) ∀ x, y
Furthermore, minimax equality also implies existence of saddle point if inf and sup
are achieved c.f. [49, Section 2.6]) for all x, y [Straightforward to check. See proof
of corollary below].
2 There are a number of known criteria establishing
inf sup f(x, y) = sup inf f(x, y)
y x x y
i i
i i
i i
94
Theorem 5.4 (Saddle point) Let P be a convex set of distributions on X . Suppose there
exists P∗X ∈ P , called a capacity-achieving input distribution, such that
Let P∗Y = PY|X ◦ P∗X , called a capacity-achieving output distribution. Then for all PX ∈ P and for
all QY , we have
D(PY|X kP∗Y |PX ) ≤ D(PY|X kP∗Y |P∗X ) ≤ D(PY|X kQY |P∗X ). (5.1)
Proof. Right inequality in (5.1) follows from C = I(P∗X , PY|X ) = minQY D(PY|X kQY |P∗X ), where
the latter is (4.5).
The left inequality in (5.1) is trivial when C = ∞. So assume that C < ∞, and hence
I(PX , PY|X ) < ∞ for all PX ∈ P . Let PXλ = λPX + λP∗X ∈ P and PYλ = PY|X ◦ PXλ . Clearly,
PYλ = λPY + λP∗Y , where PY = PY|X ◦ PX .
We have the following chain then:
where inequality is by the right part of (5.1) (already shown). Thus, subtracting λ̄C and dividing
by λ we get
and the proof is completed by taking lim infλ→0 and applying the lower semincontinuity of
divergence (Theorem 4.9).
Corollary 5.5 In addition to the assumptions of Theorem 5.4, suppose C < ∞. Then the
capacity-achieving output distribution P∗Y is unique. It satisfies the property that for any PY induced
by some PX ∈ P (i.e. PY = PY|X ◦ PX ) we have
i i
i i
i i
Statement (5.2) follows from the left inequality in (5.1) and “conditioning increases divergence”
property in Theorem 2.16.
Remark 5.3 • The finiteness of C is necessary for Corollary 5.5 to hold. For a counterexample,
consider the identity channel Y = X, where X takes values on integers. Then any distribution
with infinite entropy is a capacity-achieving input (and output) distribution.
• Unlike the output distribution, capacity-achieving input distribution need not be unique. For
example, consider Y1 = X1 ⊕ Z1 and Y2 = X2 where Z1 ∼ Ber( 12 ) is independent of X1 . Then
maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 2, achieved by PX1 X2 = Ber(p) × Ber( 21 ) for any p. Note that
the capacity-achieving output distribution is unique: P∗Y1 Y2 = Ber( 12 ) × Ber( 21 ).
Proof. This follows from the standard property of saddle points: Maximizing/minimizing the
leftmost/rightmost sides of (5.1) gives
min sup D(PY|X kQY |PX ) ≤ max D(PY|X kP∗Y |PX ) = D(PY|X kP∗Y |P∗X )
QY PX ∈P PX ∈P
but by definition min max ≥ max min. Note that we were careful to only use max and min for the
cases where we know the optimum is achievable.
i i
i i
i i
96
1 Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,
i.e.,
rad(A) = inf sup d(x, y). (5.3)
y∈X x∈A
2 Diameter of A:
diam(A) = sup d(x, y). (5.4)
x, y∈ A
Note that the radius and the diameter both measure the massiveness/richness of a
set.
3 From definition and triangle inequality we have
1
diam(A) ≤ rad(A) ≤ diam(A). (5.5)
2
The lower and upper bounds are achieved when A is, for example, a Euclidean ball
and the Hamming space, respectively.
4 In many special cases, the upper bound in (5.5) can be improved:
• A result of Bohnenblust [67] shows that in Rn equipped with any norm we always
have rad(A) ≤ n+n 1 diam(A).
q
• For Rn with Euclidean distance Jung proved rad(A) ≤ n
2(n+1) diam(A),
attained by simplex. The best constant is sometimes called the Jung constant
of the space.
• For Rn with ℓ∞ -norm the situation is even simpler: rad(A) = 21 diam(A); such
spaces are called centrable.
Corollary 5.7 For any finite X and any kernel PY|X , the maximal mutual information over all
distributions PX on X satisfies
i i
i i
i i
The last corollary gives a geometric interpretation to capacity: It equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY|X=x : x ∈ X }. Moreover, the optimal
center P∗Y is a convex combination of some PY|X=x and is equidistant to those.
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence)
for arbitrary input space (see Theorem 32.4 for a related representation):
can be (a) interpreted as a saddle point; (b) written in the minimax form; and (c) that the capacity-
achieving output distribution P∗Y is unique. This was all done under the extra assumption that the
supremum over PX is attainable. It turns out, properties b) and c) can be shown without that extra
assumption.
Theorem 5.9 (Kemperman [243]) For any PY|X and a convex set of distributions P such
that
C = sup I(PX , PY|X ) < ∞, (5.6)
PX ∈P
Furthermore,
C = sup min D(PY|X kQY |PX ) (5.8)
PX ∈P QY
= min sup D(PY|X kQY |PX ) (5.9)
QY PX ∈P
i i
i i
i i
98
Note that Condition (5.6) is automatically satisfied if there exists a QY such that
C= sup I ( X ; X + Z) . (5.12)
E[X]=0,E[X2 ]=P
PX :
E[X4 ]=s
Without the constraint E[X4 ] = s, the capacity is uniquely achieved at the input distribution PX =
N (0, P); see Theorem 5.11. When s 6= 3P2 , such PX is no longer feasible. However, for s > 3P2
the maximum
1
C= log(1 + P)
2
is still attainable. Indeed, we can add a small “bump” to the Gaussian distribution as follows:
where p → 0 and x → ∞ such that px2 → 0 but px4 → s − 3P2 > 0. This shows that for the
problem (5.12) with s > 3P2 , the capacity-achieving input distribution does not exist, but the
capacity-achieving output distribution P∗Y = N (0, 1 + P) exists and is unique as Theorem 5.9
shows.
Proof of Theorem 5.9. Let P′Xn be a sequence of input distributions achieving C, i.e.,
I(P′Xn , PY|X ) → C. Let Pn be the convex hull of {P′X1 , . . . , P′Xn }. Since Pn is a finite-dimensional
simplex, the (concave) function PX 7→ I(PX , PY|X ) is continuous (Proposition 4.13) and attains its
maximum at some point PXn ∈ Pn , i.e.,
where in (5.14) we applied Theorem 5.4 to (Pn+k , PYn+k ). The crucial idea is to apply comparison
of KL divergence (which is not a distance) with a true distance known as total variation defined
in (7.3) below. Such comparisons are going to be the topic of Chapter 7. Here we assume for
granted validity of Pinsker’s inequality (see Theorem 7.10). According to that inequality and since
In % C, we conclude that the sequence PYn is Cauchy in total variation:
i i
i i
i i
Since the space of all probability distributions on a fixed alphabet is complete in total variation,
the sequence must have a limit point PYn → P∗Y . Convergence in TV implies weak convergence,
and thus by taking a limit as k → ∞ in (5.15) and applying the lower semicontinuity of divergence
(Theorem 4.9) we get
and therefore, PYn → P∗Y in the (stronger) sense of D(PYn kP∗Y ) → 0. By Theorem 4.1,
To prove that (5.18) holds for arbitrary PX ∈ P , we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃∗Y , and obtain:
where (5.20) follows from (5.18) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃∗Y = P∗Y
and therefore (5.18) holds.
To see the uniqueness of P∗Y , assuming there exists Q∗Y that fulfills C = supPX ∈P D(PY|X kQ∗Y |PX ),
we show Q∗Y = P∗Y . Indeed,
C ≥ D(PY|X kQ∗Y |PXn ) = D(PY|X kPYn |PXn ) + D(PYn kQ∗Y ) = In + D(PYn kQ∗Y ).
Since In → C, we have D(PYn kQ∗Y ) → 0. Since we have already shown that D(PYn kP∗Y ) → 0,
we conclude P∗Y = Q∗Y (this can be seen, for example, from Pinsker’s inequality and the triangle
inequality TV(P∗Y , Q∗Y ) ≤ TV(PYn , Q∗Y ) + TV(PYn , P∗Y ) → 0).
Finally, to see (5.9), note that by definition capacity as a max-min is at most the min-max, i.e.,
C = sup min D(PY|X kQY |PX ) ≤ min sup D(PY|X kQY |PX ) ≤ sup D(PY|X kP∗Y |PX ) = C
PX ∈P QY QY PX ∈P PX ∈P
and the optimizer Q∗X exists and is unique. If Q∗X ∈ P , then it is also the unique maximizer of H(X).
i i
i i
i i
100
in Corollary 5.10. Distributions of this form are known as Gibbs distributions for the energy func-
tion f. This bound is often tight and achieved by PX (n) = Z(λ∗ )−1 exp{−λ∗ f(n)} with λ∗ being
the minimizer, see Exercise III.27. (Note that Proposition 4.7 discusses Lagrangian version of the
same problem.)
1. “Gaussian capacity”:
1 σ2
C = I(Xg ; Xg + Ng ) = log 1 + X2
2 σN
2. “Gaussian input is the best for Gaussian noise”: For all X ⊥
⊥ Ng and Var X ≤ σX2 ,
I(Xg ; Xg + N) ≥ I(Xg ; Xg + Ng ),
d
with equality iff N=Ng and independent of Xg .
This result encodes extremality properties of the normal distribution: for the AWGN channel,
Gaussian input is the most favorable (attains the maximum mutual information, or capacity), while
for a general additive noise channel the least favorable noise is Gaussian. For a vector version of
the former statement see Exercise I.9.
i i
i i
i i
Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg + Ng . Define
1 σ 2 log e x2 − σX2
f(x) ≜ D(PYg |Xg =x kPYg ) = D(N (x, σN2 )kN (0, σX2 + σN2 )) = log 1 + X2 +
2 σN 2 σX2 + σN2
| {z }
=C
3. Let Y = Xg + N and let PY|Xg be the respective kernel. Note that here we only assume that N is
uncorrelated with Xg , i.e., E [NXg ] = 0, not necessarily independent. Then
dPXg |Yg (Xg |Y)
I(Xg ; Xg + N) ≥ E log (5.23)
dPXg (Xg )
dPYg |Xg (Y|Xg )
= E log (5.24)
dPYg (Y)
log e h Y2 N2 i
=C+ E 2 2
− 2 (5.25)
2 σX + σN σN
log e σX 2 EN2
=C+ 1 − (5.26)
2 σX2 + σN2 σN2
≥ C, (5.27)
where
• (5.23): follows from (4.7),
dPX |Y dPY |X
• (5.24): dPgX g = dPgY g
g g
i i
i i
i i
102
Note that there is a steady improvement at each step (the value F(sk , tk ) is decreasing), so it
can be often proven that the algorithm converges to a local minimum, or even a global minimum
under appropriate conditions (e.g. the convexity of f). Below we discuss several applications of
this idea, and refer to [113] for proofs of convergence. In general, this class of iterative meth-
ods for maximizing and minimizing mutual information are called Blahut-Arimoto algorithms for
their original discoverers [24, 62]. Unlike gradient ascent/descent that proceeds by small (“local”)
changes of the decision variable, algorithms in this section move by large (“global”) jumps and
hence converge much faster.
The basis of all these algorithms is the Gibbs variational principle (Proposition 4.7): for
any function c : Y → R and any QY on Y , under the integrability condition Z =
R
QY (dy) exp{−c(y)} < ∞, the minimum
is attained at P∗Y (dy) = Z1 QY (dy) exp{−c(y)}. For simplicity below we mostly consider the case
of discrete alphabets X , Y .
Maximizing mutual information (capacity). We have a fixed PY|X and the optimization
problem
QX|Y
C = max I(X; Y) = max max EPX,Y log ,
PX PX QX|Y PX
i i
i i
i i
where in the second equality we invoked (4.7). This results in the iterations:
1
QX|Y (x|y) ← PX (x)PY|X (y|x)
Z(y)
( )
1 X
′
PX (x) ← Q (x) ≜ exp PY|X (y|x) log QX|Y (x|y) ,
Z y
where Z(y) and Z are normalization constants. To derive this, notice that for a fixed PX the optimal
QX|Y = PX|Y . For a fixed QX|Y , we can see that
QX|Y
EPX,Y log = log Z − D(PX kQ′ ) ,
PX
and thus the optimal PX = Q′ .
Denoting Pn to be the value of PX at the nth iteration, we observe that
This is useful since at every iteration not only we get an estimate of the optimizer Pn , but also the
gap to optimality C − I(Pn , PY|X ) ≤ C − RHS. It can be shown, furthermore, that both RHS and
LHS in (5.29) monotonically converge to C as n → ∞, see [113] for details.
R = min I(X; Y) + E[d(X, Y)] = min D(PY|X kQY |PX ) + E[d(X, Y)] , (5.30)
P Y| X PY|X ,QY
where in the second equality we invoked (4.5). This minimization problem is the basis of lossy
compression and will be discussed extensively in Part V. Using (5.28) we derive the iterations:
1
PY|X (y|x) ← QY (y) exp{−d(x, y)}
Z ( x)
QY ← PY|X ◦ PX .
A sandwich bound similar to (5.29) holds here, see (5.32), so that one gets two computable
sequences converging to R from above and below, as well as PY|X converging to the argmin
in (5.30).
i i
i i
i i
104
where QX|Y is a given channel. This is a problem arising in the maximum likelihood estimation for
Pn
mixture models where QY is the unknown mixing distribution and PX = 1n i=1 δxi is the empirical
distribution of the sample (x1 , . . . , xn ). To derive an iterative algorithm for (5.31), we write
min D(PX kQX ) = min min D(PX,Y kQX,Y ) .
QY QY PY|X
dQ ( x| y)
(Note that taking d(x, y) = − log dPX|XY (x) shows that this problem is equivalent to (5.30).) By the
chain rule, thus, we find the iterations
1
PY|X ← QY (y)QX|Y (x|y)
Z ( x)
QY ← PY|X ◦ PX .
(n) (n) (n)
Denote by QY the value of QY at the nth iteration and QX = QX|Y ◦ QY . Notice that for any n
and all QY we have from Jensen’s inequality,
" #
( n) dQX|Y (n)
D(PX kQX ) − D(PX kQX ) = EX∼PX log EY∼QY (n)
≤ gap(QX ) ,
dQX
dQX|Y=y
where gap(QX ) ≜ log esssupy EX∼PX [ dQX ]. In all, we get the following sandwich bound:
(n) (n) (n)
D(PX kQX ) − gap(QX ) ≤ L ≤ D(PX kQX ) , (5.32)
and it can be shown that as n → ∞ both sides converge to L, see e.g. [112, Theorem 5.3].
EM algorithm (general case). The EM algorithm is also applicable more broadly than (5.31),
(θ) (θ)
in which the quantity QX|Y is fixed. In general, we consider the model where both QY and QX|Y
depend on the unknown parameter θ and the goal (see Section 29.3) is to maximize the total log
Pn (θ)
likelihood i=1 log QX (xi ) over θ. A canonical example (which was one of the original motiva-
(θ) Pk
tions for the EM algorithm) is the k-component Gaussian mixture QX = j=1 wj N ( μj , 1); in
other words, QY = (w1 , . . . , wk ), QX|Y=j = N ( μj , 1) and θ = (w1 , . . . , wk , μ1 , . . . , μk ). If the cen-
ters μj ’s are known and only the weights wj ’s are to be estimated, then we get the simple convex
case in (5.31). Otherwise, we need to jointly optimize the log-likelihood over the centers and the
weights, which is a non-convex problem.
Here, one way to approach the problem is to apply the ELBO (4.16) as follows:
" (θ)
#
(θ) QX,Y (xi , Y)
log QX (xi ) = sup EY∼P log .
P P(Y)
Thus the maximum likelihood can be written as a double maximization problem
X (θ)
sup log QX (xi ) = sup sup F(θ, PY|X ) ,
θ i θ PY|X
where
" #
X (θ)
QX,Y (xi , Y)
F(θ, PY|X ) = EY∼PY|X=xi log .
PY|X (Y|xi )
i
i i
i i
i i
Thus, the iterative algorithm is to start with some θ and update according to
(θ)
PY|X ← QY|X E-step
θ ← argmax F(θ, PY|X ) M-step. (5.33)
θ
In general, if the log likelihood function is non-convex in θ, EM iterations may not converge to
the global optimum even with infinite sample size (see [234] for an example for 3-component
(θ)
Gaussian mixtures). Furthermore, for certain problems in the E-step QY|X may be intractable to
compute. In those cases one performs approximate version of EM where the step maxPY|X F(θ, PY|X )
is solved over a restricted class of distributions, cf. Examples 4.1 and 4.2.
Sinkhorn’s algorithm. This algorithm [388] is very similar, but not exactly the same as the ones
above. We fix QX,Y , two marginals VX , VY and solve the problem
S = min{D(PX,Y kQX,Y ) : PX = VX , PY = VY )} .
From the results of Chapter 15 (see Theorem 15.16 and Example 15.2) it is clear that the optimal
distribution PX,Y is given by
P∗X,Y = A(x)QX,Y (x, y)B(y) ,
for some A, B ≥ 0. In order to find functions A, B we notice that under a fixed B the value of A
that makes PX = VX is given by
VX (x)QX,Y (x, y)B(y)
A ( x) ← P .
y QX,Y (x, y)B(y)
i i
i i
i i
In this chapter we start with explaining the important property of mutual information known
as tensorization (or single-letterization ), which allows one to maximize and minimize mutual
information between two high-dimensional vectors. Next, we extend the information measures
discussed in previous chapters for random variables to random processes by introducing the con-
cepts of entropy rate (for a stochastic process) and mutual information rate (for a pair of stochastic
processes). For the former, it is shown that two stochastic processes that can be coupled well
(i.e., have small Ornstein’s distance) have close entropy rates – a fact to be used later in the
discussion of ergodicity (see Section 12.5*). For the latter we give a simple expression for the
information rate between a pair of stationary Gaussian processes in terms of their joint spectral
density. This expression will be crucial much later, when we study Gaussian channels with colored
noise (Section 20.6*).
106
i i
i i
i i
Q
with equality iff PXn |Y = PXi |Y PY -almost surely.1 Consequently,
X
n
min I(Xn ; Yn ) = min I(Xi ; Yi ).
PYn |Xn P Yi | X i
i=1
P Q Q
Proof. (1) Use I(Xn ; Yn ) − I(Xi ; Yi ) = D(PYn |Xn k PYi |Xi |PXn ) − D(PYn k PYi )
P Q Q
(2) Reverse the role of X and Y: I(Xn ; Y) − I(Xi ; Y) = D(PXn |Y k PXi |Y |PY ) − D(PXn k PXi )
1 For a product channel, the input maximizing the mutual information is a product distribution.
2 For a product source, the channel minimizing the mutual information is a product channel.
Example 6.1 Let us complement Theorem 6.1 with the following examples.
1 ∏n
That is, if PXn ,Y = PY i=1 PXi |Y as joint distributions.
i i
i i
i i
108
We start with the maximization of mutual information (capacity) question. In the notation
of Theorem 5.11 we know that (for Z ∼ N (0, 1))
1
max I(X; X + Z) = log 1 + σX2 .
PX :E[X2 ]≤σX2 2
Note that from tensorization we also immediately get (for Zn ∼ N (0, In ))
n
max I(Xn ; Xn + Zn ) = log 1 + σX2 .
PXn :E[∥X ∥ ]≤nσX
n 2 2 2
Thus, the traditional way of solving n-dimensional problems is to solve a 1-dimensional version
by explicit (typically calculus of variations) computation and then apply tensorization. However,
it turns out that sometimes directly solving the n-dimensional problem is magically easier and that
is what we want to show in this section.
So, suppose that we are trying to directly solve
Pmax I(Xn ; Xn + Zn )
E[ X2k ]≤nσX2
over the joint distribution PXn . By the tensorization property in Theorem 6.1(a) we get
X
n
n n n
Pmax I( X ; X + Z ) = Pmax I(Xk ; Xk + Zk ) .
E[ X2k ]≤nσX2 E[ X2k ]≤nσX2
k=1
Given distributions PX1 · · · PXn satisfying the constraint, form the “average of marginals” distribu-
Pn Pn
tion P̄X = n1 k=1 PXk , which also satisfies the single letter constraint E[X2 ] = 1n k=1 E[X2k ] ≤ σX2 .
Then from the concavity in PX of I(PX , PY|X )
1X
n
I(P̄X ; PY|X ) ≥ I(PXk , PY|X )
n
k=1
So P̄X gives the same or better mutual information, which shows that the extremization above
ought to grow linearly with n, i.e.
Similarly to the “average of marginals” argument above, averaging over all orthogonal rotations U
of Xn can only make the mutual information larger. Therefore, the optimal input distribution PXn
can be chosen to be invariant under orthogonal transformations. Consequently, by Theorem 5.9,
the (unique!) capacity achieving output distribution P∗Yn must be rotationally invariant. Further-
more, from the conditions for equality in (6.1) we conclude that P∗Yn must have independent
components. Since the only product distribution satisfying the power constraints and having rota-
tional symmetry is an isotropic Gaussian, we conclude that PYn = (P∗Y )⊗n and P∗Y = N (0, 1 + σX2 ).
i i
i i
i i
In turn, the only distribution PX such that PX+Z = P∗Y is PX = N (0, σX2 ) (this can be argued by
considering characteristic functions).
The last part of Theorem 5.11 can also be handled similarly. That is, we can show that the
minimizer in
min I(XG ; XG + N)
PN :E[N2 ]=1
is necessarily Gaussian by going to a multidimensional problem and averaging over all orthogonal
rotations.
The idea of “going up in dimension” (i.e. solving an n = 1 problem by going to an n > 1
problem first) as presented here is from [333] and only re-derives something that we have already
shown directly in Theorem 5.11. But the idea can also be employed for solving various non-convex
differential entropy maximization problems, cf. [184].
A sufficient condition for the entropy rate to exist is stationarity, which essentially means invari-
d
ance with respect to time shift. Formally, X is stationary if (Xt1 , . . . , Xtn )=(Xt1 +k , . . . , Xtn +k ) for
any t1 , . . . , tn , k ∈ N. This definition naturally extends to two-sided processes indexed by Z.
Proof.
(a) Further conditioning + stationarity: H(Xn |Xn−1 ) ≤ H(Xn |Xn2−1 ) = H(Xn−1 |Xn−2 )
P
(b) Using chain rule: 1n H(Xn ) = 1n H(Xi |Xi−1 ) ≥ H(Xn |Xn−1 )
(c) H(Xn ) = H(Xn−1 ) + H(Xn |Xn−1 ) ≤ H(Xn−1 ) + 1n H(Xn )
i i
i i
i i
110
(d) n 7→ 1n H(Xn ) is a decreasing sequence and lower bounded by zero, hence has a limit
Pn
H(X). Moreover by chain rule, 1n H(Xn ) = 1n i=1 H(Xi |Xi−1 ). From here we claim that
H(Xn |Xn−1 ) converges to the same limit H(X). Indeed, from the monotonicity shown in part
(a), limn H(Xn |Xn−1 ) = H′ exists. Next, recall the following fact from calculus: if an → a,
Pn
then the Cesàro’s mean 1n i=1 ai → a as well. Thus, H′ = H(X).
(e) Assuming H(X1 ) < ∞ we have from (4.30):
lim H(X1 ) − H(X1 |X0−n ) = lim I(X1 ; X0−n ) = I(X1 ; X0−∞ ) = H(X1 ) − H(X1 |X0−∞ ).
n→∞ n→∞
Example 6.2 (Stationary processes) Let us discuss some of the most standard examples
of stationary processes.
See Exercise I.31 for an example. This kind of process is what is called first-order Markov
process, since Xn depends only on Xn−1 . There is an extension of that idea, where a k-th order
Markov process is defined by a kernel PXn |Xn−1 . Shannon classically suggested that such a pro-
n− k
cess is a good model for natural language (with sufficiently large k), and recent breakthrough
in large language models [320] largely verified his vision.
Note that both of our characterizations of the entropy rate converge to the limit from above
and thus evaluating H(Xn |Xn−1 ) or n1 H(Xn ) for arbitrary large n does not give any guarantees on
the true value of H(X) beyond an upper bound (in particular, we cannot even rule out H(X) = 0).
However, for a certain class of stationary processes, widely used in speech and language modeling,
we can have a sandwich bound.
Definition 6.4 (Hidden Markov model (HMM)) Given a stationary Markov chain
. . . , S−1 , S0 , S1 , . . . on state space S and a Markov kernel PX|S : S → X , we define HMM as
a stationary process . . . , X−1 , X0 , X1 , . . . as follows. First a trajectory S∞
−∞ is generated. Then,
conditionally on it, we generate each Xi ∼ PX|S=Si independently. In other words, X is just S but
observed over a stationary memoryless channel PX|S (called the emission channel).
i i
i i
i i
One of the fundamental results in this area is due to Blackwell [60] who showed that an P(S)-
−1
valued belief process Rn = (Rs,n , s ∈ S) given by Rs,n ≜ P[Sn = s|Xn−∞ ] is in fact a stationary
first-order Markov process. The common law μ of Rn (independent of n) is called the Blackwell
measure. Although finding μ is very difficult even for the simplest processes (see example below),
we do have the following representation of entropy rate in terms of μ
Z
H( X ) = μ(dr) Es∼r [H(PX|S=s )] .
P(S)
P
That is, the entropy rate is an integral of a simple function r 7→ s rs H(PX|S=s ) over μ.
Example 6.3 (Gilbert-Elliott HMM [187, 151]) This is an HMM with binary states and
binary emissions. Let S = {0, 1} and P[S1 6= S0 |S0 ] = τ , i.e. the transition matrix of the S-
1−τ τ
process is . Set Xi = BSCδ (Si ). In this case the Blackwell measure μ is supported
τ 1−τ
on [τ, 1 − τ ] and is the law of the random variable P[S1 = 1|X0−∞ ] and the entropy rate can be
expressed in terms of the binary entropy function h:
Z 1
H( X ) = μ(dx)h(δx̄ + δ̄ x) ,
0
where we remind x̄ = 1 − x etc. In fact, we can express integration over μ in terms of the limit
R
fdμ = limn→∞ Kn f(1/2), where K is the transition kernel of the belief process, which acts on
functions g : [0, 1] → R as
xτ̄ δ̄ + x̄τ δ xτ̄ δ + x̄τ δ̄
Kg(x) = p(x)g + p̄(x)g , p(x) = 1 − p̄(x) = δx̄ + δ̄ x .
p ( x) p̄(x)
We can see that the belief process follows a simple fractional-linear updates, but nevertheless
the stationary measure μ is extremely complicated which can be either absolutely continuous or
singular (fractal-like) [33, 32]. As such, understanding H(X) as a function of (τ, δ) is a major open
problem in this area. We remark, however, that if instead of the BSC we used X = BECδ (S) then
the resulting entropy rate is much easier to compute, see Exercise I.32.
Despite these complications, the entropy rate of HMM has a nice property: it can be tightly
sandwiched between a monotonically increasing and a monotonically decreasing sequences. As
we remarked above, such sandwich bound is not possible for general stationary processes.
Proof. The part about the upper bound we have already established. To show the lower bound,
notice that
−1 −1
H(X) = H(Xn |Xn−∞ ) ≥ H(Xn |Xn−∞ , S0 ) = H(Xn |Xn1−1 , S0 ) ,
i i
i i
i i
112
where in the last step we used the Markov property X0−∞ → S0 → Xn1 . Next, we show that
H(Xn |Xn1−1 , S0 ) is increasing in n. Indeed
H(Xn+1 |Xn1 , S0 ) = H(Xn |Xn0−1 , S−1 ) ≥ H(Xn |Xn0−1 , S−1 , S0 ) = H(Xn |Xn1−1 , S0 ) ,
where the first equality is by stationarity, the inequality is by adding conditioning (Theorem 1.4)
and the last equality is due to the Markov property (S−1 , X0 ) → S0 → Xn1 .
Finally, we show that
I(S0 ; X∞
1 ) = lim I(S0 ; X1 ) ≤ H(S0 ) < ∞ ,
n
n→∞
and thus I(S0 ; Xn1 ) − I(S0 ; Xn1−1 ) → 0. But we also have by the chain rule
I(S0 ; Xn1 ) − I(S0 ; Xn1−1 ) = I(S0 ; Xn |Xn1−1 ) = H(Xn |Xn1−1 ) − H(Xn |Xn1−1 , S0 ) → 0 .
Thus, we can see that the difference between the two sides of (6.4) vanishes with n.
1X
n
P[Xj 6= Yj ] ≤ ϵ . (6.5)
n
j=1
(The minimal such ϵ over all possible couplings is called Ornstein’s distance between stochastic
processes.) For binary alphabet this quantity is known as the bit error rate, which is one of the
performance metrics we consider for reliable data transmission in Part IV (see Section 17.1 and
Section 19.6). Notice that if we define the Hamming distance as
X
n
d H ( xn , yn ) ≜ 1{xj 6= yj } (6.6)
j=1
i i
i i
i i
1X
n
1
δ= E[dH (Xn , Yn )] = P[Xj 6= Yj ] .
n n
j=1
Proof. For each j ∈ [n], applying (3.18) to the Markov chain Xj → Yn → Yj yields
where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.3),
and combining with (6.8), we get
X
n
H(Xn |Yn ) ≤ H( X j | Y n ) (6.9)
j=1
Xn
≤ FM (P[Xj = Yj ]) (6.10)
j=1
1 X
n
≤ nFM P[Xj = Yj ] (6.11)
n
j=1
where in the last step we used the concavity of FM and Jensen’s inequality.
Corollary 6.7 Consider two processes X and Y with entropy rates H(X) and H(Y). If
P[Xj 6= Yj ] ≤ ϵ
H(X) − H(Y) ≤ FM (1 − ϵ) .
and apply (6.7). For the last statement just recall the expression for FM .
i i
i i
i i
114
We provide an example in the context of Gaussian processes which will be useful in studying
Gaussian channels with correlated noise (Section 20.6*).
Example 6.4 (Gaussian processes) Consider X, N two stationary Gaussian processes,
independent of each other. Assume that their autocovariance functions are absolutely summable
and thus there exist continuous power spectral density functions fX and fN . Without loss of gener-
ality, assume all means are zero. Let cX (k) = E [X1 Xk+1 ]. Then fX is the Fourier transform of the
P∞
autocovariance function cX , i.e., fX (ω) = k=−∞ cX (k)eiωk , |ω| ≤ π. Finally, assume fN ≥ δ > 0.
Then recall from Example 3.5:
1 det(ΣXn + ΣNn )
I(Xn ; Xn + Nn ) = log
2 det ΣNn
1X 1X
n n
= log σi − log λi ,
2 2
i=1 i=1
where σj , λj are the eigenvalues of the covariance matrices ΣYn = ΣXn + ΣNn and ΣNn , which are
all Toeplitz matrices, e.g., (ΣXn )ij = E [Xi Xj ] = cX (i − j). By Szegö’s theorem [199, Sec. 5.2]:
Z π
1X
n
1
log σi → log fY (ω)dω (6.12)
n 2π −π
i=1
Note that cY (k) = E [(X1 + N1 )(Xk+1 + Nk+1 )] = cX (k) + cN (k) and hence fY = fX + fN . Thus, we
have
Z π
1 n n 1 fX (w) + fN (ω)
I(X ; X + N ) → I(X; X + N) =
n
log dω.
n 4π −π fN (ω)
Maximizing this over fX subject to a moment constraint leads to the famous water-filling solution
f∗X (ω) = |T − fN (ω)|+ – see Theorem 20.18.
i i
i i
i i
7 f-divergences
In Chapter 2 we introduced the KL divergence that measures dissimilarity between two dis-
tributions. This turns out to be a special case of a whole family of such measures, known as
f-divergences, introduced by Csiszár [109]. Like KL-divergence, f-divergences satisfy a number
of useful properties:
The purpose of this chapter is to establish these properties and prepare the ground for appli-
cations in subsequent chapters. The important highlight is a joint range Theorem of Harremoës
and Vajda [214], which gives the sharpest possible comparison inequality between arbitrary f-
divergences and puts an end to a long sequence of results starting from Pinsker’s inequality –
Theorem 7.10. This material is not only mandatory for those interested in “non-classical” applica-
tions of information theory, such as the ones we will explore in Part VI. Others can skim through
this chapter and refer back to it upon need.
where dQdP
is a Radon-Nikodym derivative and f(0) ≜ f(0+). More generally, let f′ (∞) ≜
limx↓0 xf(1/x). Suppose that Q(dx) = q(x) μ(dx) and P(dx) = p(x) μ(dx) for some common
115
i i
i i
i i
116
with the agreement that if P[q = 0] = 0 the last term is taken to be zero regardless of the value of
f′ (∞) (which could be infinite).
Remark 7.1 For the discrete case, with Q(x) and P(x) being the respective pmfs, we can also
write
X
P(x)
Df (PkQ) = Q ( x) f
x
Q ( x)
• f(0) = f(0+),
• 0f( 00 ) = 0, and
• 0f( a0 ) = limx↓0 xf( ax ) = af′ (∞) for a > 0.
Remark 7.2 A nice property of Df (PkQ) is that the definition is invariant to the choice of
the dominating measure μ in (7.2). This is not the case for other dissimilarity measures, e.g., the
squared L2 -distance between the densities kp − qk2L2 (dμ) which is a popular loss function for density
estimation in statistics literature (cf. Section 32.4).
The following are common f-divergences:
Note that we can also choose f(x) = x2 − 1. Indeed, f’s differing by a linear term lead to the
same f-divergence, cf. Proposition 7.2.
1 ∫ ∫ dQ
In (7.3), d(P ∧ Q) is the usual short-hand for ( dP dμ
∧ dμ
)dμ where μ is any dominating measure. The expressions in
(7.4) and (7.5) are understood in the similar sense.
i i
i i
i i
√ 2
• Squared Hellinger distance: f(x) = 1 − x ,
r !2 Z √ Z p
dP p 2
H2 (P, Q) ≜ EQ 1 − = dP − dQ = 2 − 2 dPdQ. (7.5)
dQ
R√
Here the quantity B(P, Q) ≜ dPdQ
p is known as the Bhattacharyya coefficient (or Hellinger
affinity) [52]. Note that H(P, Q) = H2 (P, Q) defines a metric on the space of probability dis-
tributions: indeed, the triangle inequality follows from that of L2 ( μ) for a common dominating
measure. Note, however, that
P 7→ H(P, Q) is not convex. (7.6)
(This is because metric H is not induced by a Banach norm on the space of measures.) For an
explicit example, consider p 7→ H(Ber(p), Ber(0.1)).
1−x
• Le Cam divergence (distance) [273, p. 47]: f(x) = 2x +2 ,
Z
1 (dP − dQ)2
LC(P, Q) = . (7.7)
2 dP + dQ
p
Moreover, LC(PkQ) is a metric on the space of probability distributions [152], known as Le
Cam distance.
• Jensen-Shannon divergence: f(x) = x log x2x 2
+1 + log x+1 ,
P + Q P + Q
JS(P, Q) = D P +D Q . (7.8)
2 2
p
Moreover, JS(PkQ) is a metric on the space of probability distributions [152].
Remark 7.3 If Df (PkQ) is an f-divergence, then it is easy to verify that Df (λP + λ̄QkQ) and
Df (PkλP + λ̄Q) are f-divergences for all λ ∈ [0, 1]. In particular, Df (QkP) = Df̃ (PkQ) with
f̃(x) ≜ xf( 1x ).
We start summarizing some formal observations about the f-divergences
the latter referred to as the conditional f-divergence (similar to Definition 2.14 for conditional
KL divergence).
i i
i i
i i
118
In particular,
Df ( P X P Y k QX P Y ) = Df ( P X k QX ) . (7.11)
In particular, we can always assume that f ≥ 0 and (if f is differentiable at 1) that f′ (1) = 0.
Proof. The first and second are clear. For the third property, verify explicitly that Df (PkQ) = 0
for f = c(x − 1). Next consider general f and observe that for P ⊥ Q, by definition we have
which is well-defined (i.e., ∞ − ∞ is not possible) since by convexity f(0) > −∞ and f′ (∞) >
−∞. So all we need to verify is that f(0) + f′ (∞) = 0 if and only if f = c(x − 1) for some c ∈ R.
Indeed, since f(1) = 0, the convexity of f implies that x 7→ g(x) ≜ xf(−x)1 is non-decreasing. By
assumption, we have g(0+) = g(∞) and hence g(x) is a constant on x > 0, as desired.
For property 4, let RY|X = 12 PY|X + 21 QY|X . By Theorem 2.12 there exist jointly measurable
p(y|x) and q(y|x) such that dPY|X=x = p(y|x)dRY|X=x and QY|X = q(y|x)dRY|X=x . We can then take
μ in (7.2) to be μ = PX RY|X , which gives dPX,Y = p(y|x)dμ and dQX,Y = q(y|x)dμ and thus
Df (PX,Y kQX,Y )
Z Z
p ( y| x) ′
= dμ1{y : q(y|x) > 0} q(y|x)f + f (∞) dμ1{y : q(y|x) = 0} p(y|x)
X ×Y q ( y| x) X ×Y
Z Z Z
(7.2) p ( y| x)
= dPX dRY|X=x q(y|x)f + f′ (∞) dRY|X=x p(y|x)
X {y:q(y|x)>0} q ( y| x) {y:q(y|x)=0}
| {z }
Df (PY|X=x ∥QY|X=x )
i i
i i
i i
Proof. Note that in the case PX,Y QX,Y (and thus PX QX ), the proof is a simple application
of Jensen’s inequality to definition (7.1):
dPY|X PX
Df (PX,Y kQX,Y ) = EX∼QX EY∼QY|X f
dQY|X QX
dPY|X PX
≥ EX∼QX f EY∼QY|X
dQY|X QX
dPX
= EX∼QX f .
dQX
To prove the general case we need to be more careful. Let RX = 12 (PX + QX ) and RY|X = 12 PY|X +
1
2 QY|X .It should be clear that PX,Y , QX,Y RX,Y ≜ RX RY|X and that for every x: PY|X=x , QY|X=x
RY|X=x . By Theorem 2.12 there exist measurable functions p1 , p2 , q1 , q2 so that
and dPY|X=x = p2 (y|x)dRY|X=x , dQY|X=x = q2 (y|x)dRY|X=x . We also denote p(x, y) = p1 (x)p2 (y|x),
q ( x, y) = q 1 ( x) q 2 ( y| x) .
Fix t > 0 and consider a supporting line to f at t with slope μ, so that
Note that we added t = 0 case as well, since for t = 0 the statement is obvious (recall, though,
that f(0) ≜ f(0+) can be equal to +∞).
Next, fix some x with q1 (x) > 0 and consider the chain
Z
p1 (x)p2 (y|x) p 1 ( x)
dRY|X=x q2 (y|x)f + PY|X=x [q2 (Y|x) = 0]f′ (∞)
{y:q2 (y|x)>0} q 1 ( x ) q2 ( y | x ) q 1 ( x )
( a) p 1 ( x) p1 (x)
≥f PY|X=x [q2 (Y|x) > 0] + P [q2 (Y|x) = 0]f′ (∞)
q 1 ( x) q1 (x) Y|X=x
(b) p 1 ( x)
≥f
q 1 ( x)
where (a) is by Jensen’s inequality and the convexity of f, and (b) by taking t = pq11 ((xx)) and λ =
PY|X=x [q2 (Y|x) > 0] in (7.14). Now multiplying the obtained inequality by q1 (x) and integrating
over {x : q1 (x) > 0} we get
Z Z
p ( x, y) p 1 ( x)
dRX,Y q(x, y)f + f′ (∞)PX,Y [q1 (X) > 0, q2 (Y|X) = 0] ≥ dRX q1 (x)f .
{q>0} q ( x, y) {q1 >0} q 1 ( x)
Adding f′ (∞)PX [q1 (X) = 0] to both sides we obtain (7.13) since both sides evaluate to
definition (7.2).
The following is the main result of this section.
i i
i i
i i
120
Theorem 7.4 (Data processing) Consider a channel that produces Y given X based on the
conditional law PY|X (shown below).
PX PY
PY|X
QX QY
Let PY (resp. QY ) denote the distribution of Y when X is distributed as PX (resp. QX ). For any
f-divergence Df (·k·),
Next we discuss some of the more useful properties of f-divergence that parallel those of KL
divergence in Theorem 2.16:
PY |X PY
PX
QY |X QY
Then
Df (PY kQY ) ≤ Df PY|X kQY|X |PX .
Proof. (a) Non-negativity follows from monotonicity by taking X to be unary. To show strict
positivity, suppose for the sake of contradiction that Df (PkQ) = 0 for some P 6= Q. Then
there exists some measurable A such that p = P(A) 6= q = Q(A) > 0. Applying the data
2
By strict convexity at 1, we mean for all s, t ∈ [0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf(s) + (1 − α)f(t) > f(1).
i i
i i
i i
Remark 7.4 (Strict convexity) Just like for the KL divergence, f-divergences are never
strictly convex in the sense that (P, Q) 7→ Df (PkQ) can be linear on an interval connecting (P0 , Q0 )
to (P1 , Q1 ). As in Remark 5.2 this is the case when (P0 , Q0 ) have support disjoint from (P1 , Q1 ).
However, for f-divergences this can happen even with pairs with a common support. For example,
TV(Ber(p), Ber(q)) = |p − q| is piecewise linear. In turn, strict convexity of f is related to certain
desirable properties of f-information If (X; Y), see Ex. I.40.
Remark 7.5 (g-divergences) We note that, more generally, we may call functional D(PkQ)
a “g-divergence”, or a generalized dissimilarity measure, if it satisfies the following properties: pos-
itivity, monotonicity (as in (7.13)), data processing inequality (DPI, cf. (7.15)) and D(PkP) = 0
for any P. Note that the last three properties imply positivity by taking X to be unary in the DPI. In
many ways g-divergence properties allow to interpret it as measure of information in the generic
sense adopted in this book. We have seen that f-divergences satisfy two additional properties:
conditioning increases divergence (CID) and convexity in the pair, the two being essentially equiv-
alent (cf. proof of Theorem 5.1). CID and convexity do not necessarily hold for any f-divergence.
Indeed, any monotone function of an f-divergence is a g-divergence, and of course those do not
need to be monotone (cf. (7.6) for an example). Interestingly, there exist g-divergences which
are not monotone transformations of any f-divergence, cf. [338, Section V]; the example there is
in fact D(PkQ) = α − βα (P, Q) with β defined in (14.3) later in the book. On the other hand,
P
for finite alphabets, [325] shows that any D(PkQ) = i ϕ(Pi , Qi ) is a g-divergence iff it is an
f-divergence.
The following convenient property, a counterpart of Theorem 4.5, allows us to reduce any gen-
eral problem about f-divergences to the problem on finite alphabets. The proof is in Section 7.14*.
Theorem 7.6 Let P, Q be two probability measures on X with σ -algebra F . Given a finite
F -measurable partitions E = {E1 , . . . , En } define the distribution PE on [n] by PE (i) = P[Ei ] and
QE (i) = Q[Ei ]. Then
Df (PkQ) = sup Df (PE kQE ) (7.16)
E
i i
i i
i i
122
is minimized.
In this section we first show that optimization over ϕ naturally leads to the concept of TV.
Subsequently, we will see that asymptotic considerations (when P and Q are replaced with P⊗n
and Q⊗n ) leads to H2 . We start with the former case.
where minimization is over joint distributions PX,Y with the property PX = P and PY = Q,
which are called couplings of P and Q.
3
The extension of (7.19) from simple to composite hypothesis testing is in (32.28).
4
See Exercise I.36 for another inf-representation.
i i
i i
i i
which establishes that the second supremum in (7.18) lower bounds TV, and hence (by taking
f(x) = 2 · 1E (x) − 1) so does the first. For the other direction, let E = {x : p(x) > q(x)} and notice
Z Z Z
0 = (p(x) − q(x))dμ = + (p(x) − q(x))dμ ,
E Ec
R R
implying that Ec (q(x)− p(x))dμ = E (p(x)− q(x))dμ. But the sum of these two integrals precisely
equals 2 · TV, which implies that this choice of E attains equality in (7.18).
For the inf-representation, we notice that given a coupling PX,Y , for any kfk∞ ≤ 1, we have
EP [f(X)] − EQ [f(X)] = E[f(X) − f(Y)] ≤ 2PX,Y [X 6= Y]
which, in view of (7.18), shows that the inf-representation is always an upper bound. To show
R
that this bound is tight one constructs X, Y as follows: with probability π ≜ min(p(x), q(x))dμ
we take X = Y = c with c sampled from a distribution with density r(x) = π1 min(p(x), q(x)),
whereas with probability 1 − π we take X, Y sampled independently from distributions p1 (x) =
1−π (p(x) − min(p(x), q(x))) and q1 (x) = 1−π (q(x) − min(p(x), q(x))) respectively. The result
1 1
follows upon verifying that this PX,Y indeed defines a coupling of P and Q and applying the last
identity of (7.3).
Remark 7.6 (Variational representation) The sup-representation (7.18) of the total vari-
ation will be extended to general f-divergences in Section 7.13. However, only the TV has the
representation of the form supf∈F | EP [f] − EQ [f]| over the class of functions. Distances of this
form (for different classes of F ) are sometimes known as integral probability metrics (IPMs).
And so TV is an example of an IPM for the class F of all bounded functions.
In turn, the inf-representation (7.20) has no analogs for other f-divergences, with the notable
exception of Marton’s d2 , see Remark 7.15. Distances defined via inf-representations over cou-
plings are often called Wasserstein distances, and hence we may think of TV as the Wasserstein
distance with respect to Hamming distance d(x, x′ ) = 1{x 6= x′ } on X . The benefit of variational
representations is that choosing a particular coupling in (7.20) gives an upper bound on TV(P, Q),
and choosing a particular f in (7.18) yields a lower bound.
Of particular relevance is the special case of testing with multiple observations, where the data
X = (X1 , . . . , Xn ) are i.i.d. drawn from either P or Q. In other words, the goal is to test
H0 : X ∼ P⊗n vs H1 : X ∼ Q⊗ n .
By Theorem 7.7, the optimal total probability of error is given by 1 − TV(P⊗n , Q⊗n ). By the data
processing inequality, TV(P⊗n , Q⊗n ) is a non-decreasing sequence in n (and bounded by 1 by
definition) and hence converges. One would expect that as n → ∞, TV(P⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test vanishes. It turns out that for fixed
distributions P 6= Q, large deviations theory (see Chapter 16) shows that TV(P⊗n , Q⊗n ) indeed
converges to one as n → ∞ and, in fact, exponentially fast:
TV(P⊗n , Q⊗n ) = 1 − exp(−nC(P, Q) + o(n)), (7.21)
where the exponent C(P, Q) > 0 is known as the Chernoff Information of P and Q given in (16.2).
However, as frequently encountered in high-dimensional statistical problems, if the distributions
i i
i i
i i
124
• H2 (P, Q) = 2, if and only if TV(P, Q) = 1. In this case, the probability of error is zero since
essentially P and Q have disjoint supports.
• H2 (P, Q) = 0 if and only if TV(P, Q) = 0. In this case, the smallest total probability of error is
one, meaning the best test is random guessing.
• Hellinger consistency is equivalent to TV consistency: we have
i.i.d.
Proof. For convenience, let X1 , X2 , ...Xn ∼ Qn . Then
v
u n
u Y Pn
H2 (P⊗ ⊗n
n , Qn ) = 2 − 2E
n t (Xi )
Qn
i=1
r r n
Yn
Pn Pn
=2−2 E (Xi ) = 2 − 2 E
Qn Qn
i=1
n
1
= 2 − 2 1 − H2 (Pn , Qn ) . (7.25)
2
We now use (7.25) to conclude the proof. Recall from (7.23) that TV(P⊗ ⊗n
n , Qn ) → 0 if and
n
only if H2 (P⊗ n ⊗n
n , Qn ) → 0, which happens precisely when H (Pn , Qn ) = o( n ). Similarly, by
2 1
⊗n ⊗n 2 ⊗n ⊗n
(7.24), TV(Pn , Qn ) → 1 if and only if H (Pn , Qn ) → 2, which is further equivalent to
H2 (Pn , Qn ) = ω( 1n ).
i i
i i
i i
While some other f-divergences also satisfy tensorization, see Section 7.12, the H2 has the advan-
tage of a sandwich bound (7.22) making it the most convenient tool for checking asymptotic
testability of hypotheses.
Q Q
Remark 7.8 (Kakutani’s dichotomy) Let P = Pi and Q = i≥1 Qi , where Pi Qi .
i≥1
Kakutani’s theorem shows the following dichotomy between these two distributions on the infinite
sequence space:
P
• If i≥1 H2 (Pi , Qi ) = ∞, then P and Q are mutually singular (i.e. P ⊥ Q).
P
• If i≥1 H2 (Pi , Qi ) < ∞, then P and Q are equivalent (i.e. P Q and Q P).
In the Gaussian case, say, Pi = N( μi , 1) and Qi = N(0, 1), the equivalence condition simplifies to
P 2
μi < ∞.
To understand Kakutani’s criterion, note that by the tensorization property (7.26), we have
Y H2 (Pi , Qi )
H ( P , Q) = 2 − 2
2
1− .
2
i≥1
Q 2 P
Thus, if i≥1 (1 − H (P2i ,Qi ) ) = 0, or equivalently, i≥1 H2 (Pi , Qi ) = ∞, then H2 (P, Q) = 2,
P
which, by (7.22), is equivalent to TV(P, Q) = 0 and hence P ⊥ Q. If i≥1 H2 (Pi , Qi ) < ∞,
then H2 (P, Q) < 2. To conclude the equivalence between P and Q, note that the likelihood ratio
dP
Q dPi dP
dQ = i≥1 dQi satisfies that either Q( dQ = 0) = 0 or 1 by Kolmogorov’s 0-1 law. See [143,
Theorem 5.3.5] for details.
We end this section by discussing the related concept of contiguity. Note that if two distributions
Pn and Qn has vanishing total variation, then Pn (A) = Qn (A) + o(1) uniformly for all events A.
Sometimes and especially for statistical applications we are only interested comparing those events
with probability close to zero or one. This leads us to the following definition.
Definition 7.9 (Contiguity and asymptotic separatedness) Let {Pn } and {Qn } be
sequences of probability measures on some Ωn . We say Pn is contiguous with respect to Qn
(denoted by Pn ◁ Qn ) if for any sequence {An } of measurable sets, Qn (An ) → 0 implies that
Pn (An ) → 0. We say Pn and Qn are mutually contiguous (denoted by Pn ◁▷ Qn ) if Pn ◁ Qn
and Qn ◁ Pn . We say that Pn is asymptotically separated from Qn (denoted Pn △ Qn ) if
lim supn→∞ TV(Pn , Qn ) = 1.
i i
i i
i i
126
0 or 1. In addition, if Pn ◁ Qn , then any test that succeeds with high Qn -probability must fail with
high Pn -probability; in other words, Pn and Qn cannot be distinguished perfectly so TV(Pn , Qn ) =
1 − Ω(1), in particular contiguity and separatedness are mutually exclusive. Furthermore, often
many interesting sequences of measures satisfy dichotomy similar to Kakutani’s: either Pn ◁▷ Qn
or Pn △ Qn , see [282].
Our interest in these notions arises from the fact that f-divergences are instrumental for
establishing contiguity and separatedness. For example, from (7.24) we conclude that
where Dα is Rényi divergence (Definition 7.24). This criterion can be weakened to the following
(commonly used) one: Pn ◁ Qn if χ2 (Pn kQn ) = O(1). Indeed, applying p a change ofpmeasure
and Cauchy-Schwarz, Pn (An ) = EPn [1 {An }] = EQn [ dQ dPn
n
1 { An }] ≤ 1 + χ2 (Pn kQn ) Qn (An ),
which vanishes whenever Qn (An ) vanishes. (See Exercise I.49 for a concrete example in the con-
text of community detection and random graphs.) In particular, a sufficient condition for mutual
contiguity is the boundedness of likelihood ratio: c ≤ QPnn ≤ C for some constants c, C.
Proof. It suffices to consider the natural logarithm for the KL divergence. First we show that,
by the data processing inequality, it suffices to prove the result for Bernoulli distributions. For
any event E, let Y = 1 {X ∈ E} which is Bernoulli with parameter P(E) or Q(E). By the DPI,
D(PkQ) ≥ d(P(E)kQ(E)). If Pinsker’s inequality holds for all Bernoulli distributions, we have
r
1
D(PkQ) ≥ TV(Ber(P(E)), Ber(Q(E)) = |P(E) − Q(E)|
2
q
Taking the supremum over E gives 12 D(PkQ) ≥ supE |P(E) − Q(E)| = TV(P, Q), in view of
Theorem 7.7.
i i
i i
i i
The binary case follows easily from a second-order Taylor expansion (with integral remainder
form) of p 7→ d(pkq):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t(1 − t) q
Pinsker’s inequality has already been used multiple times in this book. Here is yet another
implication that is further explored in Exercise I.62 and I.63 (Szemerédi regularity).
Proof. If Y1 and Y2 are two random variables taking values in [−1, 1] then by (7.20) there exists
a coupling such that P[Y1 6= Y2 ] ≤ TV(PY1 , PY2 ). Thus, | E[Y1 ] − E[Y2 ]| ≤ 2TV(PY1 , PY2 ). Now,
applying this to PY1 = PY|X=a and PY2 = PY|X′ =b we obtain
2
| E[Y|X = a] − E[Y|X′ = b]|2 ≤ 4TV2 (PY|X=a , PY|X′ =b ) ≤ D(PY|X=a kPY|X′ =b ) ,
log e
where we applied Pinsker’s inequality in the last step. The proof is completed by averaging over
(a, b) ∼ PX,X′ and noticing that D(PY|X kPY|X′ |PX,X′ ) = I(Y; X|X′ ) due to PY|X,X′ = PY|X by
assumption.
Pinsker’s inequality and Tao’s inequality are both sharp in the sense that the constants can not
be improved. For example, for (7.27) we can take Pn = Ber( 21 + 1n ) and Qn = Ber( 12 ) and compute
D(Pn ∥Qn )
that TV 2 (P ,Q ) → 2 log e as n → ∞. (This is best seen by inspecting the local quadratic behavior
n n
in Proposition 2.21.) Nevertheless, this does not mean that the inequality (7.27) is not improvable,
as the RHS can be replaced by some other function of TV(P, Q) with additional higher-order
terms. Indeed, several such improvements of Pinsker’s inequality are known. But what is the best
inequality? In addition, another natural question is the reverse inequality: can we upper-bound
D(PkQ) in terms of TV(P, Q)?
Settling these questions rests on characterizing the joint range (the set of possible values) of a
given pair f-divergences. This systematic approach to comparing f-divergences (as opposed to the
ad hoc proof of Theorem 7.10 we presented above) is the subject of the rest of this section.
Definition 7.12 (Joint range) Consider two f-divergences Df (PkQ) and Dg (PkQ). Their
joint range is a subset of [0, ∞]2 defined by
i i
i i
i i
128
As an example, Figure 7.1 gives the joint range R between the KL divergence and the total vari-
ation. By definition, the lower boundary of the region R gives the optimal refinement of Pinsker’s
inequality:
Also from Figure 7.1 we see that it is impossible to bound D(PkQ) from above in terms of TV(P, Q)
due to the lack of upper boundary.
1.5
1.0
0.5
Figure 7.1 Joint range of TV and KL divergence. The dashed line is the quadratic lower bound given by
Pinsker’s inequality (7.27).
The joint range R may appear difficult to characterize since we need to consider P, Q over
all measurable spaces; on the other hand, the region Rk for small k is easy to obtain (at least
numerically). Revisiting the proof of Pinsker’s inequality in Theorem 7.10, we see that the key
step is the reduction to Bernoulli distributions. It is natural to ask: to obtain full joint range is it
possible to reduce to the binary case? It turns out that it is always sufficient to consider quaternary
distributions, or the convex hull of that of binary distributions.
where co denotes the convex hull with a natural extension of convex operations to [0, ∞]2 .
i i
i i
i i
We will rely on the following famous result from convex analysis (cf. e.g. [145, Chapter 2,
Theorem 18]).
• Claim 1: co(R2 ) ⊂ R4 ;
• Claim 2: Rk ⊂ co(R2 );
• Claim 3: R = R4 .
S∞
Note that Claims 1-2 prove the most interesting part: k=1 Rk = co(R2 ). Claim 3 is more
technical and its proof can be found in [214]. However, the approximation result in Theorem 7.6
S∞
shows that R is the closure of k=1 Rk . Thus for the purpose of obtaining inequalities between
Df and Dg , Claims 1-2 are sufficient.
We start with Claim 1. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space
X and given any α ∈ [0, 1], define two joint distributions of the random variables (X, B) where
PB = QB = Ber(α), PX|B=i = Pi and QX|B=i = Qi for i = 0, 1. Then by (7.9) we get
R2 = R̃2 ∪ {(pf′ (∞), pg′ (∞)) : p ∈ (0, 1]} ∪ {(qf(0), qg(0)) : q ∈ (0, 1]} ,
Since (0, 0) ∈ R̃2 , we see that regardless of which f(0), f′ (∞), g(0), g′ (∞) are infinite, the set
R2 ∩ R2 is connected. Thus, by Lemma 7.14 any point in co(R2 ∩ R2 ) is a combination of two
points in R2 ∩ R2 , which, by the argument above, is a subset of R4 . Finally, it is not hard to see
that co(R2 )\R2 ⊂ R4 , which concludes the proof of co(R2 ) ⊂ R4 .
Next, we prove Claim 2. Fix P, Q on [k] and denote their PMFs (pj ) and (qj ), respectively. Note
that without changing either Df (PkQ) or Dg (PkQ) (but perhaps, by increasing k by 1), we can
p
make qj > 0 for j > 1 and q1 = 0, which we thus assume. Denote ϕj = qjj for j > 1 and consider
the set
X X
k
S = Q̃ = (q̃j )j∈[k] : q̃j ≥ 0, q̃j = 1, q̃1 = 0, q̃j ϕj ≤ 1 .
j=2
i i
i i
i i
130
affinely maps S to [0, ∞] (note that f(0) or f′ (∞) can equal ∞). In particular, if we denote P̃i =
P̃(Q̃i ) corresponding to Q̃i in decomposition (7.29), we get
X
m
Df (PkQ) = αi Df (P̃i kQ̃i ) ,
i=1
and similarly for Dg (PkQ). We are left to show that (P̃i , Q̃i ) are supported on at most two points,
which verifies that any element of Rk is a convex combination of k elements of R2 . Indeed, for
Q̃ ∈ Se the set {j ∈ [k] : q̃j > 0 or p̃j > 0} has cardinality at most two (for the second type
extremal points we notice p̃j1 + p̃j2 = 1 implying p̃1 = 0). This concludes the proof of Claim
2.
i i
i i
i i
1−t
• the upper boundary is achieved by P = Ber( 1+ t
2 ), Q = Ber( 2 ),
• the lower boundary is achieved by P = (1 − t, t, 0), Q = (1 − t, 0, t).
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0 1.5 2.0
Figure 7.2 The joint range R of TV and H2 is characterized by (7.22), which is the convex hull of the grey
region R2 .
where we take the natural logarithm. Here is a corollary (weaker bound) due to [427]:
1 + TV(P, Q) 2TV(P, Q)
D(PkQ) ≥ log − log e. (7.31)
1 − TV(P, Q) 1 + TV(P, Q)
Both bounds are stronger than Pinsker’s inequality (7.27). Note the following consequences:
i i
i i
i i
132
where the function f is a convex increasing bijection of [0, 1) onto [0, ∞). Furthermore, for every
s ≥ f(t) there exists a pair of distributions such that χ2 (PkQ) = s and TV(P, Q) = t.
(p − q)2 t2
TV(Ber(p), Ber(q)) = |p − q| ≜ t, χ2 (Ber(p)kBer(q)) = = .
q( 1 − q) q( 1 − q)
Given |p − q| = t, let us determine the possible range of q(1 − q). The smallest value of q(1 − q)
is always 0 by choosing p = t, q = 0. The largest value is 1/4 if t ≤ 1/2 (by choosing p = 1/2 − t,
q = 1/2). If t > 1/2 then we can at most get t(1 − t) (by setting p = 0 and q = t). Thus we
get χ2 (Ber(p)kBer(q)) ≥ f(|p − q|) as claimed. The convexity of f follows since its derivative is
monotonically increasing. Clearly, f(t) ≥ 4t2 because t(1 − t) ≤ 41 .
• KL vs TV: see (7.30). For discrete distributions there is partial comparison in the other direction
(“reverse Pinsker”, cf. [373, Section VI]):
2 2 log e
D(PkQ) ≤ log 1 + TV(P, Q)2 ≤ TV(P, Q)2 , Qmin = min Q(x)
Qmin Qmin x
• KL vs Hellinger:
2
D(P||Q) ≥ 2 log ≥ log e · H2 (P, Q). (7.33)
2 − H2 (P, Q)
The first inequality gives the joint range and is attained at P = Ber(0), Q = Ber(q). For a fixed
H2 , in general D(P||Q) has no finite upper bound, as seen from P = Ber(p), Q = Ber(0).
i i
i i
i i
There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup, cf. [122, Theorem A.1] and Exercise I.64):
log( Q1min − 1)
D(PkQ) ≤ 1 − (1 − H2 (P, Q))2 , Qmin = min Q(x)
1 − 2Qmin x
• χ2 vs TV: The full joint range is given by (7.32). Two simple consequences are:
1p 2
TV(P, Q) ≤ χ (PkQ) (7.37)
2
1 χ2 (PkQ)
TV(P, Q) ≤ max , (7.38)
2 1 + χ2 (PkQ)
where the second is useful for bounding TV away from one.
• JS vs TV: The full joint region is given by
1 − TV(P, Q) 1
2d ≤ JS(P, Q) ≤ TV(P, Q) · 2 log 2 . (7.39)
2 2
The lower bound is a consequence of Fano’s inequality. For the upper bound notice that for
p, q ∈ [0, 1] and |p − q| = τ the maximum of d(pk p+2 q ) is attained at p = 0, q = τ (from the
convexity of d(·k·)) and, thus, the binary joint-range is given by τ 7→ d(τ kτ /2) + d(1 − τ k1 −
τ /2). Since the latter is convex, its concave envelope is a straight line connecting endpoints at
τ = 0 and τ = 1.
i i
i i
i i
134
1 Total variation:
Z | μ|
| μ| 2σ | μ|
TV(N (0, σ ), N ( μ, σ )) = 2Φ
2 2
−1= φ(x)dx = √ + O( μ2 ), μ → 0.
2σ | μ|
− 2σ 2π σ
(7.40)
2 Hellinger distance:
μ2 μ2
H2 (N (0, σ 2 )kN ( μ, σ 2 )) = 2 − 2e− 8σ2 = + O( μ3 ), μ → 0. (7.41)
4σ 2
More generally,
1 1
|Σ1 | 4 |Σ2 | 4 1 ′ −1
H (N ( μ1 , Σ1 )kN ( μ2 , Σ2 )) = 2 − 2
2
1 exp − ( μ1 − μ2 ) Σ̄ ( μ1 − μ2 ) ,
|Σ̄| 2 8
where Σ̄ = Σ1 +Σ
2
2
.
3 KL divergence:
1 σ2 1 ( μ 1 − μ 2 )2 σ12
D(N ( μ1 , σ12 )kN ( μ2 , σ22 )) = log 22 + + 2 − 1 log e. (7.42)
2 σ1 2 σ22 σ2
For a more general result see (2.8).
4 χ2 -divergence:
μ2 μ2
χ2 (N ( μ, σ 2 )kN (0, σ 2 )) = e σ2 − 1 = 2 + O( μ3 ), μ → 0 (7.43)
2 σ
e √ μ /(2−σ 2 )
− 1 σ2 < 2
χ2 (N ( μ, σ 2 )kN (0, 1)) = σ 2−σ 2 (7.44)
∞ σ2 ≥ 2
5 χ2 -divergence for Gaussian mixtures [225] (see also Exercise I.48 for the Ingster-Suslina
method applicable to general mixture distributions):
−1
X,X′ ⟩
χ2 (P ∗ N (0, Σ)kN (0, Σ)) = E[e⟨Σ ] − 1, ⊥ X′ ∼ P .
X⊥ (7.45)
Proof. Note that If (U; X) = Df (PU,X kPU PX ) ≥ Df (PU,Y kPU PY ) = If (U; Y), where we
applied the data-processing Theorem 7.4 to the (possibly stochastic) map (U, X) 7→ (U, Y). See
also Remark 3.4.
i i
i i
i i
is not generally subadditive. There are two special cases when Iχ2 is subadditive: If one of the
Iχ2 (X; A) or Iχ2 (X; B) is small [202, Lemma26] or if X ∼ Ber(1/2) channels PA|X and PB|X are
BMS (Section 19.4*), cf. [1].
2 The f-information corresponding to total-variation ITV (X; Y) ≜ TV(PX,Y , PX PY ) is not subad-
ditive. Furthermore, it has a counter-intuitive behavior of “getting stuck”. For example, take
X ∼ Ber(1/2) and A = BSCδ (X), B = BSCδ (X) – two independent observations of X across
the BSC. A simple computation (Exercise I.35) shows:
In other words, an additional observation does not improve TV-information at all. This is the
main reason for the famous herding effect in economics [30].
3 The symmetric KL-information
the f-information corresponding to the symmetric KL divergence (also known as the Jeffreys
divergence)
Let us prove this in the discrete case. First notice the following equivalent expression for ISKL :
X
ISKL (X; Y) = PX (x)PX (x′ )D(PY|X=x kPY|X=x′ ) . (7.52)
x, x′
From (7.52) we get (7.51) by the additivity D(PA,B|X=x kPA,B|X=x′ ) = D(PA|X=x kPA|X=x′ ) +
D(PB|X=x kPB|X=x′ ). To prove (7.52) first consider the obvious identity:
X
PX (x)PX (x′ )[D(PY kPY|X=x′ ) − D(PY kPY|X=x )] = 0
x, x′
i i
i i
i i
136
which is rewritten as
X X PY|X (y|x)
PX (x)PX (x′ ) PY (y) log = 0. (7.53)
PY|X (y|x′ )
x,x′ y
Next, by definition,
X PX,Y (x, y)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PX (x)PY (y)
P Y| X ( y | x )
Since the marginals of PX,Y and PX PY coincide, we can replace log PPXX(,xY)(PxY,y()y) by any log f ( y)
for any f. We choose f(y) = PY|X (y|x′ ) to get
X PY|X (y|x)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PY|X (y|x′ )
Now averaging this over PX (x′ ) and applying (7.53) to get rid of the second term in [· · · ], we
obtain (7.52). For another interesting property of ISKL , see Ex. I.54.
1X
n
P̂n = δ Xi
n
i=1
denote the empirical distribution corresponding to this sample. Let PY = PY|X ◦ PX be the output
distribution corresponding to PX and PY|X ◦ P̂n be the output distribution corresponding to P̂n (a
random distribution). Note that when PY|X=x (·) = ϕ(· − x), where ϕ is a fixed density, we can
think of PY|X ◦ P̂n as a kernel density estimator (KDE), whose density is p̂n (x) = (ϕ ∗ P̂n )(x) =
Pn
i=1 ϕ(Xi − x). Furthermore, using the fact that E[PY|X ◦ P̂n ] = PY , we have
1
n
where the first term represents the bias of the KDE due to convolution and increases with band-
width of ϕ, while the second term represents the variability of the KDE and decreases with the
bandwidth of ϕ. Surprisingly, the second term is is sharply (within a factor of two) given by the
Iχ2 information. More exactly, we prove the following result.
Proposition 7.17
1
E[D(PY|X ◦ P̂n kPY )] ≤ log 1 + Iχ2 (X; Y) , (7.54)
n
i i
i i
i i
In Section 25.4* we will discuss an extension of this simple bound, in particular showing that
in many cases about n = exp{I(X; Y)+ K} observations are sufficient to ensure D(PY|X ◦ P̂n kPY ) =
e−O(K) .
and apply the local expansion of KL divergence (Proposition 2.21) to get (7.55).
In the discrete case, by taking PY|X to be the identity channel (Y = X) we obtain the following
guarantee on the closeness between the empirical and the population distribution. This fact can be
used to test whether the sample was truly generated by the distribution PX .
Otherwise, we have
|X | − 1 log e
E[D(P̂n kPX )] ≤ log 1 + ≤ (|X | − 1) . (7.58)
n n
i i
i i
i i
138
is always non-negative. For fixed PX , Ĥemp is known to be consistent even on countably infinite
alphabets [22], although the convergence rate can be arbitrarily slow, which aligns with the con-
clusion of (7.57). However, for large alphabet of size Θ(n), the upper bound (7.58) does not vanish
(this is tight for, e.g., uniform distribution). In this case, one need to de-bias the empirical entropy
(e.g. on the basis of (7.59)) or employ different techniques in order to achieve consistent estimation.
See Section 29.4 for more details.
Theorem 7.19 Suppose that Df (PkQ) < ∞ and derivative of f(x) at x = 1 exist. Then,
1
lim Df (λP + λ̄QkQ) = (1 − P[supp(Q)])f′ (∞) ,
λ→0 λ
where as usual we take 0 · ∞ = 0 in the left-hand side.
Remark 7.10 Note that we do not need a separate theorem for Df (QkλP + λ̄Q) since the
exchange of arguments leads to another f-divergence with f(x) replaced by xf(1/x).
Proof. Without loss of generality we may assume f(1) = f′ (1) = 0 and f ≥ 0. Then, decomposing
P = μP1 + μ̄P0 with P0 ⊥ Q and P1 Q we have
Z
1 ′ 1 dP1
Df (λP + λ̄QkQ) = μ̄f (∞) + dQ f 1 + λ( μ − 1) .
λ λ dQ
Note that g(λ) = f (1 + λt) is positive and convex for every t ∈ R and hence λ1 g(λ) is mono-
tonically decreasing to g′ (0) = 0 as λ & 0. Since for λ = 1 the integrand is assumed to be
Q-integrable, the dominated convergence theorem applies and we get the result.
i i
i i
i i
If χ2 (PkQ) < ∞, then Df (λ̄Q + λPkQ) < ∞ for all 0 ≤ λ < 1 and
1 f′′ (1) 2
lim 2
Df (λ̄Q + λPkQ) = χ (PkQ) . (7.60)
λ→0 λ 2
If χ2 (PkQ) = ∞ and f′′ (1) > 0 then (7.60) also holds, i.e. Df (λ̄Q + λPkQ) = ω(λ2 ).
Remark 7.11 Conditions of the theorem include D, DSKL , H2 , JS, LC and all Rényi divergences
1 (x − 1); see Definition 7.24). A similar result holds also for the
1 λ
of orders λ < 2 (with f(x) = λ−
case when f′′ (x) → ∞ with x → +∞ (e.g. Rényi divergences with λ > 2), but then we need to
make extra assumptions in order to guarantee applicability of the dominated convergence theorem
(often just the finiteness of Df (PkQ) is sufficient).
Proof. Assuming that χ2 (PkQ) < ∞ we must have P Q and hence we can use (7.1) as the
definition of Df . Note that under (7.1) without loss of generality we may assume f′ (1) = f(1) = 0
(indeed, for that we can just add a multiple of (x − 1) to f(x), which does not change the value of
Df (PkQ)). From the Taylor expansion we have then
Z 1
f(1 + u) = u2 (1 − t)f′′ (1 + tu)dt .
0
i i
i i
i i
140
decomposition P = μP1 + μ̄P0 with P1 Q and P0 ⊥ Q. From definition (7.2) we have (for
λμ
λ1 = 1−λμ̄ )
Df (λP + λ̄QkQ) = (1 − λμ̄)Df (λ1 P1 + λ̄1 QkQ) + λμ̄Df (P0 kQ) ≥ λμ̄Df (P0 kQ) .
Recall from Proposition 7.2 that Df (P0 kQ) > 0 unless f(x) = c(x − 1) for some constant c and
the proof is complete.
Definition 7.21 (Regular single-parameter families) Fix τ > 0, space X and a family
Pt of distributions on X , t ∈ [0, τ ). We define the following types of conditions that we call
regularity at t = 0:
(a) Pt (dx) = pt (x) μ(dx), for some measurable (t, x) 7→ pt (x) ∈ R+ and a fixed measure μ on X ;
(b0 ) There exists a measurable function (s, x) 7→ ṗs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ṗs (x0 )|ds < ∞ and
Z t
p t ( x0 ) = p 0 ( x0 ) + ṗs (x0 )ds . (7.63)
0
Furthermore, for μ-almost every x0 we have limt↘0 ṗt (x0 ) = ṗ0 (x0 ).
(b1 ) We have ṗt (x) = 0 whenever p0 (x) = 0 and, furthermore,
Z
(ṗt (x))2
μ(dx) sup < ∞. (7.64)
X 0≤t<τ p0 (x)
i i
i i
i i
(c0 ) There exists a measurable function (s, x) 7→ ḣs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost
Rτ
every x0 we have 0 |ḣs (x0 )|ds < ∞ and
p p Z t
h t ( x0 ) ≜ p t ( x0 ) = p 0 ( x0 ) + ḣs (x0 )ds . (7.65)
0
Furthermore, for μ-almost every x0 we have limt↘0 ḣt (x0 ) = ḣ0 (x0 ).
(c1 ) The family of functions {(ḣt (x))2 : t ∈ [0, τ )} is uniformly μ-integrable.
Remark 7.12 Recall that the uniform integrability condition (c1 ) is implied by the following
stronger (but easier to verify) condition:
Z
μ(dx) sup (ḣt (x))2 < ∞ . (7.66)
X 0≤t<τ
Impressively, if one also assumes the continuous differentiability of ht then the uniform integra-
bility condition becomes equivalent to the continuity of the Fisher information
Z
t 7→ JF (t) ≜ 4 μ(dx)(ḣt (x))2 . (7.67)
Theorem 7.22 Let the family of distributions {Pt : t ∈ [0, τ )} satisfy the conditions (a), (b0 )
and (b1 ) in Definition 7.21. Then we have
χ2 (Pt kP0 ) = JF (0)t2 + o(t2 ) , (7.68)
log e
D(Pt kP0 ) = JF ( 0 ) t 2 + o ( t 2 ) , (7.69)
2
R 2
where JF (0) ≜ X
μ(dx) (ṗp00((xx))) < ∞ is the Fisher information at t = 0.
Proof. From assumption (b1 ) we see that for any x0 with p0 (x0 ) = 0 we must have ṗt (x0 ) = 0
and thus pt (x0 ) = 0 for all t ∈ [0, τ ). Hence, we may restrict all integrals below to subset {x :
p0 (x0 ))2
p0 (x) > 0}, on which the ratio (pt (x0p)−
0 0)
( x is well-defined. Consequently, we have by (7.63)
Z
1 2 1 (pt (x) − p0 (x))2
2
χ ( Pt kP0 ) = 2
μ(dx)
t t p 0 ( x)
Z Z 1 !2
1 1
= 2 μ(dx) t duṗtu (x)
t p 0 ( x) 0
Z Z 1 Z 1
(a) ṗtu (x)ṗtu2 (x)
= μ(dx) du1 du2 1
0 0 p 0 ( x)
Note that by the continuity assumption in (b1 ) we have ṗtu1 (x)ṗtu2 (x) → ṗ20 (x) for every (u1 , u2 , x)
ṗ (x)ṗ ( x) 2
as t → 0. Furthermore, we also have tu1 p0 (xtu)2 ≤ sup0≤t<τ (ṗpt0((xx00))) , which is integrable
by (7.64). Consequently, application of the dominated convergence theorem to the integral in
(a) concludes the proof of (7.68).
i i
i i
i i
142
We next show that for any f-divergence with twice continuously differentiable f (and in fact,
without assuming (7.64)) we have:
1 f′′ (1)
lim inf 2
Df (Pt kP0 ) ≥ JF ( 0 ) . (7.70)
t→0 t 2
Indeed, similar to (7.61) we get
Z " 2 #
1
p ( X) − p ( X ) p ( X) − p ( X )
dz(1 − z) EX∼P0 f′′ 1 + z
t 0 t 0
Df (Pt kP0 ) = . (7.71)
0 p0 ( X ) p0 (X)
2
The first fraction inside the bracket is between 0 and 1 and the second by sup0<t<τ pṗ0t ((XX)) , which
is P0 -integrable by (b1 ). Thus, dominated convergence theorem applies to the double integral
in (7.71) and we obtain
Z 1 " 2 #
1 ṗ0 (X)
lim 2 D(Pt kP0 ) = (log e) dz EX∼P0 (1 − z) ,
t→0 t 0 p0 ( X )
Remark 7.13 Theorem 7.22 extends to the case of multi-dimensional parameters as follows.
Define the Fisher information matrix at θ ∈ Rd :
Z p p ⊤
JF (θ) ≜ μ(dx)∇θ pθ (x)∇θ pθ (x) (7.73)
Then (7.68) becomes χ2 (Pt kP0 ) = t⊤ JF (0)t + o(ktk2 ) as t → 0 and similarly for (7.69), which
has previously appeared in (2.34).
Theorem 7.22 applies to many cases (e.g. to smooth subfamilies of exponential families, for
which one can take μ = P0 and p0 (x) ≡ 1), but it is not sufficiently general. To demonstrate the
issue, consider the following example.
Example 7.1 (Location families with compact support) We say that family Pt is a
(scalar) location family if X = R, μ = Leb and pt (x) = p0 (x − t). Consider the following
i i
i i
i i
with Cα chosen from normalization. Clearly, here condition (7.64) is not satisfied and both
χ2 (Pt kP0 ) and D(Pt kP0 ) are infinite for t > 0, since Pt 6 P0 . But JF (0) < ∞ whenever α > 1
and thus one expects that a certain remedy should be possible. Indeed, one can compute those
f-divergences that are finite for Pt 6 P0 and find that for α > 1 they are quadratic in t. As an
illustration, we have
1+α
0≤α<1
Θ(t ),
2 2 1
H (Pt , P0 ) = Θ(t log t ), α = 1 (7.74)
Θ(t2 ), α>1
as t → 0. This can be computed directly, or from a more general results of [222, Theorem VI.1.1].5
For a relation between Hellinger and Fisher information see also (VI.5).
The previous example suggests that quadratic behavior as t → 0 can hold even when Pt 6 P0 ,
which is the case handled by the next (more technical) result, whose proof we placed in Sec-
tion 7.14*). One can verify that condition (c1 ) is indeed satisfied for all α > 1 in Example 7.1,
thus establishing the quadratic behavior. Also note that the stronger (7.66) only applies to α ≥ 2.
Theorem 7.23 Given a family of distributions {Pt : t ∈ [0, τ )} satisfying the conditions (a),
(c0 ) and (c1 ) of Definition 7.21, we have
1 − 4ϵ #
χ (Pt kϵ̄P0 + ϵPt ) = t ϵ̄ JF (0) +
2 2 2
J (0) + o(t2 ) , ∀ϵ ∈ (0, 1) (7.75)
ϵ
t2
H2 (Pt , P0 ) = JF ( 0 ) + o ( t 2 ) , (7.76)
4
R R
where JF (0) = 4 ḣ02 dμ < ∞ is the Fisher information and J# (0) = ḣ20 1 {h0 = 0}dμ can be
called the Fisher defect at t = 0.
Example 7.2 (On Fisher defect) Note that in most cases of interest we will have the situa-
tion that t 7→ ht (x) is actually differentiable for all t in some two-sided neighborhood (−τ, τ ) of
0. In such cases, h0 (x) = 0 implies that t = 0 is a local minima and thus ḣ0 (x) = 0, implying that
5
Statistical significance of this calculation is that if we were to estimate the location parameter t from n iid observations,
then precision δn∗ of the optimal estimator up to constant factors is given by solving H2 (Pδn∗ , P0 ) n1 , cf. [222, Chapter
1
− 1+α
VI]. For α < 1 we have δn∗ n which is notably better than the empirical mean estimator (attaining precision of
− 12
only n ). For α = 1/2 this fact was noted by D. Bernoulli in 1777 as a consequence of his (newly proposed) maximum
likelihood estimation.
i i
i i
i i
144
the defect J#
F = 0. However, for other families this will not be so, sometimes even when pt (x) is
smooth on t ∈ (−τ, τ ) (but not ht ). Here is such an example.
Consider Pt = Ber(t2 ). A straightforward calculation shows:
ϵ̄2 p
χ2 (Pt kϵ̄P0 + ϵPt ) = t2 + O(t4 ), H2 (Pt , P0 ) = 2(1 − 1 − t2 ) = t2 + O( t4 ) .
ϵ
Taking μ({0}) = μ({1}) = 1 to be the counting measure, we get the following
(√
√−t , x = 0
1−t , x=0
2 1−t2
h t ( x) = , ḣt (x) = sign(t), x = 1, t 6= 0 .
|t|, x=1
1, x = 1, t = 0 (just as an agreement)
Note that if we view Pt as a family on t ∈ [0, τ ) for small τ , then all conditions (a), (c0 ) and
(c1 ) are clearly satisfied (ḣt is bounded on t ∈ (−τ, τ )). We have JF (0) = 4 and J# (0) = 1 and
thus (7.75) recovers the correct expansion for χ2 and (7.76) for H2 .
Notice that the non-smoothness of ht only becomes visible if we extend the domain to t ∈
(−τ, τ ). In fact, this issue is not seen in terms of densities pt . Indeed, let us compute the density pt
and its derivative ṗt explicitly too:
( (
1 − t2 , x = 0 −2t, x = 0
pt (x) = , ṗt (x) = .
2
t, x=1 2t, x=1
is discontinuous at t = 0. To make things worse, at t = 0 this expectation does not match our
definition of the Fisher information JF (0) in Theorem 7.23, and thus does not yield the correct
small-t behavior for either χ2 or H2 . In general, to avoid difficulties one should restrict to those
families with t 7→ ht (x) continuously differentiable in t ∈ (−τ, τ ).
i i
i i
i i
Definition 7.24 For any λ ∈ R \ {0, 1}, the Rényi divergence of order λ between probability
distributions P and Q is defined as
" λ #
1 dP
Dλ (PkQ) ≜ log EQ ,
λ−1 dQ
– see Definition 7.1. Extending Definition 2.14 of conditional KL divergence and assuming the
same setup, the conditional Rényi divergence is defined as
Numerous properties of Rényi divergences are known, see [432]. Here we only notice a few:
which is a simple consequence of (7.77). Dλ ’s are the only divergences satisfying DPI and
tensorization [310]. The most well-known special cases of (7.79) are for Hellinger distance
(see (7.26)) and for χ2 :
!
Yn Yn Yn
1+χ 2
Pi Qi = 1 + χ2 (Pi kQi ) .
i=1 i=1 i=1
We can also obtain additive bounds for non-product distributions, see Ex. I.42 and I.43.
i i
i i
i i
146
The following consequence of the chain rule will be crucial in statistical applications later (see
Section 32.2, in particular, Theorem 32.8).
Q Q
Proposition 7.25 Consider product channels PYn |Xn = PYi |Xi and QYn |Xn = QYi |Xi . We
have (with all optimizations over all possible distributions)
X
n
inf Dλ (PYn kQYn ) = inf Dλ (PYi kQYi ), (7.80)
PXn ,QXn PXi ,QXi
i=1
Xn X
n
sup Dλ (P kQ ) =
Yn Yn sup Dλ (PYi kQYi ) = sup Dλ (PYi |Xi =x kQYi |Xi =x′ ). (7.81)
PXn ,QXn ′
i=1 PXi ,QXi i=1 x,x
Remark 7.14 The mnemonic for (7.82)-(7.83) is that “mixtures of products are less distin-
guishable than products of mixtures”. The former arise in statistical settings where iid observations
are drawn a single distribution whose parameter is drawn from a prior.
Proof. The second equality in (7.81) follows from the fact that Dλ is an increasing function
of an f-divergence, and thus maximization should be attained at an extreme point of the space
of probabilities, which are just the single-point masses. The main equalities (7.80)-(7.81) follow
from a) restricting optimizations to product distributions and invoking (7.79); and b) the chain rule
for Dλ . For example for n = 2, we fix PX2 and QX2 , which (via channels) induce joint distributions
PX2 ,Y2 and QX2 ,Y2 . Then we have
since PY1 |Y2 =y is a distribution induced by taking P̃X1 = PX1 |Y2 =y , and similarly for QY1 |Y2 =y′ . In
all, we get
(λ)
X
2
Dλ (PY2 kQY2 ) = Dλ (PY2 kQY2 ) + Dλ (PY1 |Y2 kQY1 |Y2 |PY2 ) ≥ inf Dλ (PYi kQYi ) ,
PXi ,QXi
i=1
i i
i i
i i
Denote the domain of f∗ by dom(f∗ ) ≜ {y : f∗ (y) < ∞}. Two important properties of the convex
conjugates are
f(x) + f∗ (y) ≥ xy .
Similarly, we can define a convex conjugate for any convex functional Ψ(P) defined on the
space of measures, by setting
Z
Ψ∗ (g) = sup gdP − Ψ(P) . (7.85)
P
Under appropriate conditions (e.g. finite X ), biconjugation then yields the sought-after variational
representation
Z
Ψ(P) = sup gdP − Ψ∗ (g) . (7.86)
g
Next we will now compute these conjugates for Ψ(P) = Df (PkQ). It turns out to be convenient
to first extend the definition of Df (PkQ) to all finite signed measures P then compute the conjugate.
To this end, let fext : R → R ∪ {+∞} be an extension of f, such that fext (x) = f(x) for x ≥ 0 and
fext is convex on R. In general, we can always choose fext (x) = ∞ for all x < 0. In special cases
e.g. f(x) = |x − 1|/2 or f(x) = (x − 1)2 we can directly take fext (x) = f(x) for all x. Now we can
define Df (PkQ) for all signed measure measures P in the same way as in Definition 7.1 using fext
in place of f.
For each choice of fext we have a variational representation of f-divergence:
i i
i i
i i
148
Theorem 7.26 Let P and Q be probability measures on X . Fix an extension fext of f and let f∗ext
is the conjugate of fext , i.e., f∗ext (y) = supx∈R xy − fext (x). Denote dom(f∗ext ) ≜ {y : f∗ext (y) < ∞}.
Then
where the supremum can be taken over either (a) all simple g or (b) over all g satisfying
EQ [f∗ext (g(X))] < ∞.
We remark that when P Q then both results (a) and (b) also hold for supremum over g :
X → R, i.e. without restricting g(x) ∈ dom(f∗ext ).
As a consequence of the variational characterization, we get the following properties for f-
divergences:
1 Convexity: First of all, note that Df (PkQ) is expressed as a supremum of affine functions (since
the expectation is a linear operation). As a result, we get that (P, Q) 7→ Df (PkQ) is convex,
which was proved previously in Theorem 7.5 using different method.
2 Weak lower semicontinuity: Recall the example in Remark 4.5, where {Xi } are i.i.d. Rademach-
ers (±1), and
Pn
i=1 Xi d
√ →N (0, 1)
−
n
by the central limit theorem; however, by Proposition 7.2, for all n,
PX1 +X2 +...+Xn
Df √ N (0, 1) = f(0) + f′ (∞) > 0,
n
since the former distribution is discrete and the latter is continuous. Therefore similar to the
KL divergence, the best we can hope for f-divergence is semicontinuity. Indeed, if X is a nice
space (e.g., Euclidean space), in (7.87) we can restrict the function g to continuous bounded
functions, in which case Df (PkQ) is expressed as a supremum of weakly continuous functionals
(note that f∗ ◦ g is also continuous and bounded since f∗ is continuous) and is hence weakly
w
lower semicontinuous, i.e., for any sequence of distributions Pn and Qn such that Pn −→ P and
w
Qn −→ Q, we have
i i
i i
i i
Example 7.3 (Total variation and Hellinger) For total variation, we have f(x) = 12 |x − 1|.
Consider the extension fext (x) = 21 |x − 1| for x ∈ R. Then
∗ 1 +∞ if |y| > 1
fext (y) = sup xy − |x − 1| = 2 .
x 2 y if |y| ≤ 1
2
which previously appeared in (7.18). A calculation for squared Hellinger yields f∗ext (y) = y
1−y with
y ∈ (−∞, 1) and, thus, after changing from g to h = 1 − g in (7.87), we obtain
1
H2 (P, Q) = 2 − inf EP [h] + EQ [ ] .
h>0 h
As an application, consider f : X → [0, 1] and τ ∈ (0, 1), so that h = 1 − τ f satisfies 1
h ≤ 1 + 1−τ
τ
f.
Then the previous characterization implies
1 1
EP [f] ≤ EQ [f] + H2 (P, Q) ∀f : X → [0, 1], ∀τ ∈ (0, 1) .
1−τ τ
Example 7.4 (χ2 -divergence) For χ2 -divergence we have f(x) = (x − 1)2 . Take fext (x) =
y2
(x − 1)2 , whose conjugate is f∗ext (y) = y + 4.
Applying (7.87) yields
" #
2
g ( X )
χ2 (PkQ) = sup EP [g(X)] − EQ g(X) + (7.89)
g:X →R 4
= sup 2EP [g(X)] − EQ [g2 (X)] − 1 (7.90)
g:X →R
i i
i i
i i
150
This characterization is the basis of an influential modern method of density estimation, known
as generative adversarial networks (GANs) [193]. Here is its essence. Suppose that we are trying to
approximate a very complicated distribution P on Rd by representing it as (the law of) a generator
map G : Rm → Rd applied to a standard normal Z ∼ N (0, Im ). The idea of [193] is to search for
a good G by minimizing JS(P, PG(Z) ). Due to the variational characterization we can equivalently
formulate this problem as
(and in this context the test function h is called a discriminator or, less often, a critic). Since the
distribution P is only available to us through a sample of iid observations x1 , . . . , xn ∼ P, we
approximate this minimax problem by
1X
n
inf sup log h(xi ) + EZ∼N [log(1 − h(G(Z))] .
G h n
i=1
In order to be able to solve this problem another idea of [193] is to approximate the intractable
optimizations over the infinite-dimensional function spaces of G and h by an optimization over
neural networks. This is implemented via alternating gradient ascent/descent steps over the
(finite-dimensional) parameter spaces defining the neural networks of G and h. Following the
breakthrough of [193] variations on their idea resulted in finding G(Z)’s that yielded incredibly
realistic images, music, videos, 3D scenery and more.
Example 7.6 (KL-divergence) In this case we have f(x) = x log x. Consider the extension
fext (x) = ∞ for x < 0, whose convex conjugate is f∗ (y) = log e
e exp(y). Hence (7.87) yields
Note that in the last example, the variational representation (7.92) we obtained for the KL
divergence is not the same as the Donsker-Varadhan identity in Theorem 4.6, that is,
In fact, (7.92) is weaker than (7.93) in the sense that for each choice of g, the obtained lower bound
on D(PkQ) in the RHS is smaller. Furthermore, regardless of the choice of fext , the Donsker-
Varadhan representation can never be obtained from Theorem 7.26 because, unlike (7.93), the
second term in (7.87) is always linear in Q. It turns out if we define Df (PkQ) = ∞ for all non-
probability measure P, and compute its convex conjugate, we obtain in the next theorem a different
type of variational representation, which, specialized to KL divergence in Example 7.6, recovers
exactly the Donsker-Varadhan identity.
i i
i i
i i
Theorem 7.27 Consider the extension fext of f such that fext (x) = ∞ for x < 0. Let S = {x :
q(x) > 0} where q is as in (7.2). Then
Df (PkQ) = f′ (∞)P[Sc ] + sup EP [g1S ] − Ψ∗Q,P (g) , (7.94)
g
where
Ψ∗Q,P (g) ≜ inf EQ [f∗ext (g(X) − a)] + aP[S].
a∈R
′
In the special case f (∞) = ∞, we have
Df (PkQ) = sup EP [g] − Ψ∗Q (g), Ψ∗Q (g) ≜ inf EQ [f∗ext (g(X) − a)] + a. (7.95)
g a∈R
Remark 7.15 (Marton’s divergence) Recall that in Theorem 7.7 we have shown both the
sup and inf characterizations for the TV. Do other f-divergences also possess inf characterizations?
The only other known example (to us) is due to Marton. Let
Z 2
dP
Dm (PkQ) = dQ 1 − ,
dQ +
which is clearly an f-divergence with f(x) = (1 − x)2+ . We have the following [69, Lemma 8.3]:
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : X ∼ P, Y ∼ Q} ,
where the infimum is over all couplings of P and Q. See Ex. I.44.
Marton’s Dm divergence plays a crucial role in the theory of concentration of measure [69,
Chapter 8]. Note also that while Theorem 7.20 does not apply to Dm , due to the absence of twice
continuous differentiability, it does apply to the symmetrized Marton divergence Dsm (PkQ) ≜
Dm (PkQ) + Dm (QkP).
We end this section by describing some properties of Fisher information akin to those of f-
divergences. In view of its role in the local expansion, we expect the Fisher information to inherit
these properties such as monotonicity, data processing inequality, and the variational representa-
tion. Indeed the first two can be established directly; see Exercise I.46. In [220] Huber introduced
the following variational extension of the Fisher information (in the location family) (2.40) of a
density on R: for any P ∈ P(R), define
EP [h′ (X)]2
J(P) = sup (7.96)
h EP [h(X)2 ]
where the supremum is over all test functions h ∈ C1c that are continuously differentiable and
compactly supported such that EP [h(X)2 ] > 0. Huber showed that J(P) < ∞ if and only if P
R
has an absolutely continuous density p such that (p′ )2 /p < ∞, in which case (7.96) agrees
with the usual definition (2.40).6 This sup-representation can be anticipated by combining the
6
As an example in the reverse direction, J(Unif(0, 1)) = ∞ which follows from choosing test functions such as
h(x) = cos2 xπ
ϵ
1 {|x| ≤ ϵ/2} and ϵ → 0.
i i
i i
i i
152
variational representation (7.91) of χ2 -divergence and its local expansion (7.68) that involves the
Fisher information. Indeed, setting aside regularity conditions, by Taylor expansion we have
(E[h(X + t) − h(X)])2 E[h′ (X)]2 2
χ2 (Pt kP) = sup = sup · t + o(t2 ),
E[h2 (X)] E[h2 (X)]
which is also χ2 (Pt kP) = J(P)t2 + o(t2 ). A direct proof can be given applying integration by parts
R R R R
and Cauchy-Schwarz: ( ph′ )2 = ( p′ h)2 ≤ h2 p (p′ )2 /p, which also shows the optimal test
function is given by the score function h = p′ /p; for details, see [220, Theorem 4.2].
where we also used continuity ḣt (x) → ḣ0 (x) by assumption (c0 ).
Substituting the integral expression for g(t, x) into (7.97) we obtain
Z Z 1 Z 1
L(t) = μ(dx) du1 du2 ġ(tu1 , x)ġ(tu2 , x) . (7.100)
0 0
i i
i i
i i
7.14* Technical proofs: convexity, local expansions and variational representations 153
Since |ġ(s, x)| ≤ C|hs (x)| for some C = C(ϵ), we have from Cauchy-Schwarz
Z Z
μ(dx)|ġ(s1 , x)ġ(s2 , x)| ≤ C2 sup μ(dx)ḣt (x)2 < ∞ . (7.101)
t X
where the last inequality follows from the uniform integrability assumption (c1 ). This implies that
Fubini’s theorem applies in (7.100) and we obtain
Z 1 Z 1 Z
L(t) = du1 du2 G(tu1 , tu2 ) , G(s1 , s2 ) ≜ μ(dx)ġ(s1 , x)ġ(s2 , x) .
0 0
Notice that if a family of functions {fα (x) : α ∈ I} is uniformly square-integrable, then the family
{fα (x)fβ (x) : α ∈ I, β ∈ I} is uniformly integrable simply because apply |fα fβ | ≤ 12 (f2α + f2β ).
Consequently, from the assumption (c1 ) we see that the integral defining G(s1 , s2 ) allows passing
the limit over s1 , s2 inside the integral. From (7.99) we get as t → 0
Z
1 1 − 4ϵ #
G(tu1 , tu2 ) → G(0, 0) = μ(dx)ḣ0 (x) 4 · 1{h0 > 0} + 1{h0 = 0} = JF (0)+
2
J ( 0) .
ϵ ϵ
From (7.101) we see that G(s1 , s2 ) is bounded and thus, the bounded convergence theorem applies
and
Z 1 Z 1
lim du1 du2 G(tu1 , tu2 ) = G(0, 0) ,
t→0 0 0
which thus concludes the proof of L(t) → JF (0) and of (7.75) assuming facts about ϕ. Let us
verify those.
For simplicity, in the next paragraph we omit the argument x in h0 (x) and ϕ(·; x). A straightfor-
ward differentiation yields
h20 (1 − 2ϵ ) + 2ϵ h2
ϕ′ (h) = 2h .
(ϵ̄h20 + ϵh2 )3/2
h20 (1− ϵ2 )+ ϵ2 h2 1−ϵ/2
Since √ h
≤ √1
ϵ
and ϵ̄h20 +ϵh2
≤ 1−ϵ we obtain the finiteness of ϕ′ . For the continuity
ϵ̄h20 +ϵh2
of ϕ′ notice that if h0 > 0 then clearly the function is continuous, whereas for h0 = 0 we have
ϕ′ (h) = √1ϵ for all h.
We next proceed to the Hellinger distance. Just like in the argument above, we define
Z Z 1 Z 1
1
M(t) ≜ 2 H2 (Pt , P0 ) = μ(dx) du1 du2 ḣtu1 (x)ḣtu2 (x) .
t 0 0
R
Exactly as above from Cauchy-Schwarz and supt μ(dx)ḣt (x)2 < ∞ we conclude that Fubini
applies and hence
Z 1 Z 1 Z
M(t) = du1 du2 H(tu1 , tu2 ) , H(s1 , s2 ) ≜ μ(dx)ḣs1 (x)ḣs2 (x) .
0 0
Again, the family {ḣs1 ḣs2 : s1 ∈ [0, τ ), s2 ∈ [0, τ } is uniformly integrable and thus from (c0 ) we
conclude H(tu1 , tu2 ) → 14 JF (0). Furthermore, similar to (7.101) we see that H(s1 , s2 ) is bounded
i i
i i
i i
154
and thus
Z 1 Z 1
1
lim M(t) = du1 du2 lim H(tu1 , tu2 ) = JF ( 0 ) ,
t→0 0 0 t→0 4
concluding the proof of (7.76).
Proof of Theorem 7.6. The lower bound Df (PkQ) ≥ Df (PE kQE ) follows from the DPI. To prove
an upper bound, first we reduce to the case of f ≥ 0 by property 6 in Proposition 7.2. Then define
sets S = suppQ, F∞ = { dQdP
= 0} and for a fixed ϵ > 0 let
dP
Fm = ϵm ≤ f < ϵ(m + 1) , m = 0, 1, . . . .
dQ
We have
X Z X
dP
ϵ mQ[Fm ] ≤ dQf ≤ϵ (m + 1)Q[Fm ] + f(0)Q[F∞ ]
m S dQ m
X
≤ϵ mQ[Fm ] + f(0)Q[F∞ ] + ϵ . (7.102)
m
P[F±
m]
f ≥ ϵm .
Q[ F ±
m]
− −
0 , F0 , . . . , Fn , Fn , F∞ , S , ∪m>n Fm }. For this
Next, define the partition consisting of sets E = {F+ + c
We next show that with sufficiently large n and sufficiently small ϵ the RHS of (7.103)
approaches Df (PkQ). If f(0)Q[F∞ ] = ∞ (and hence Df (PkQ) = ∞) then clearly (7.103) is also
infinite. Thus,
assume
that f(0)Q[F∞ ] < ∞.
R
If S dQf dQ = ∞ then the sum over m on the RHS of (7.102) is also infinite, and hence
dP
P
for any N > 0 there exists some n such that m≤n mQ[Fm ] ≥ N, thus showing that RHS
R
for (7.103) can be made arbitrarily large. Thus, assume S dQf dQdP
< ∞. Considering LHS
P
of (7.102) we conclude that for some large n we have m>n mQ[Fm ] ≤ 12 . Then, we must have
again from (7.102)
X Z
dP 3
ϵ mQ[Fm ] + f(0)Q[F∞ ] ≥ dQf − ϵ.
S dQ 2
m≤n
i i
i i
i i
7.14* Technical proofs: convexity, local expansions and variational representations 155
Thus, we have shown that for arbitrary ϵ > 0 the RHS of (7.103) can be made greater than
Df (PkQ) − 32 ϵ.
Proof of Theorem 7.26. First, we show that for any g : X → dom(f∗ext ) we must have
EP [g(X)] ≤ Df (PkQ) + EQ [f∗ext (g(X))] . (7.104)
Let p(·) and q(·) be the densities of P and Q. Then, from the definition of f∗ext we have for every x
s.t. q(x) > 0:
p ( x) p ( x)
f∗ext (g(x)) + fext ( ) ≥ g ( x) .
q ( x) q ( x)
Integrating this over dQ = q dμ restricted to the set {q > 0} we get
Z
p ( x)
EQ [f∗ext (g(X))] + q(x)fext ( ) dμ ≥ EP [g(X)1{q(X) > 0}] . (7.105)
q>0 q ( x)
Now, notice that
fext (x)
sup{y : y ∈ dom(f∗ext )} = lim = f′ (∞) (7.106)
x→∞ x
Therefore, f′ (∞)P[q(X) = 0] ≥ EP [g(X)1{q(X) = 0}]. Summing the latter inequality with (7.105)
we obtain (7.104).
Next we prove that supremum in (7.87) over simple functions g does yield Df (PkQ), so that
inequality (7.104) is tight. Armed with Theorem 7.6, it suffices to show (7.87) for finite X . Indeed,
for general X , given a finite partition E = {E1 , . . . , En } of X , we say a function g : X → R is
E -compatible if g is constant on each Ei ∈ E . Taking the supremum over all finite partitions E we
get
Df (PkQ) = sup Df (PE kQE )
E
where the last step follows is because the two suprema combined is equivalent to the supremum
over all simple (finitely-valued) functions g.
Next, consider finite X . Let S = {x ∈ X : Q(x) > 0} denote the support of Q. We show the
following statement
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))] + f′ (∞)P(Sc ), (7.107)
g:S→dom(f∗
ext )
i i
i i
i i
156
Consider the functional Ψ(P) defined above where P takes values over all signed measures on S,
which can be identified with RS . The convex conjugate of Ψ(P) is as follows: for any g : S → R,
( )
X P ( x)
∗ ∗
Ψ (g) = sup P(x)g(x) − Q(x) sup h − fext (h)
P x h∈dom(f∗
ext )
Q ( x)
X
= sup inf ∗ P(x)(g(x) − h(x)) + Q(x)f∗ext (h(x))
P h:S→dom(fext ) x
( a) X
= inf sup P(x)(g(x) − h(x)) + EQ [f∗ext (h)]
h:S→dom(f∗
ext ) P
x
(
EQ [f∗ext (g(X))] g : S → dom(f∗ext )
= .
+∞ otherwise
where (a) follows from the minimax theorem (which applies due to finiteness of X ). Applying
the convex duality in (7.86) yields the proof of the desired (7.107).
Proof of Theorem 7.27. First we argue that the supremum in the right-hand side of (7.94) can
be taken over all simple functions g. Then thanks to Theorem 7.6, it will suffice to consider finite
alphabet X . To that end, fix any g. For any δ , there exists a such that EQ [f∗ext (g − a)] − aP[S] ≤
Ψ∗Q,P (g) + δ . Since EQ [f∗ext (g − an )] can be approximated arbitrarily well by simple functions we
conclude that there exists a simple function g̃ such that simultaneously EP [g̃1S ] ≥ EP [g1S ] − δ and
This implies that restricting to simple functions in the supremization in (7.94) does not change the
right-hand side.
Next consider finite X . We proceed to compute the conjugate of Ψ, where Ψ(P) ≜ Df (PkQ) if
P is a probability measure on X and +∞ otherwise. Then for any g : X → R, maximizing over
all probability measures P we have:
X
Ψ∗ (g) = sup P(x)g(x) − Df (PkQ)
P x∈X
X X X
P(x)
= sup P(x)g(x) − P(x)g(x) − Q ( x) f
P x∈X Q ( x)
x∈Sc x∈ S
X X X
= sup inf P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + Q(x)f∗ext (h(x))
P h:S→R x∈S x∈Sc x∈S
( ! )
( a) X X
= inf sup P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + EQ [f∗ext (h(X))]
h:S→R P x∈ S x∈Sc
(b) ′ ∗
= inf max max g(x) − h(x), maxc g(x) − f (∞) + EQ [fext (h(X))]
h:S→R x∈ S x∈ S
( c) ′ ∗
= inf max a, maxc g(x) − f (∞) + EQ [fext (g(X) − a)]
a∈ R x∈ S
i i
i i
i i
7.14* Technical proofs: convexity, local expansions and variational representations 157
where (a) follows from the minimax theorem; (b) is due to P being a probability measure; (c)
follows since we can restrict to h(x) = g(x) − a for x ∈ S, thanks to the fact that f∗ext is non-
decreasing (since dom(fext ) = R+ ).
From convex duality we have shown that Df (PkQ) = supg EP [g] − Ψ∗ (g). Notice that without
loss of generality we may take g(x) = f′ (∞) + b for x ∈ Sc . Interchanging the optimization over
b with that over a we find that
sup bP[Sc ] − max(a, b) = −aP[S] ,
b
which then recovers (7.94). To get (7.95) simply notice that if P[Sc ] > 0, then both sides of (7.95)
are infinite (since Ψ∗Q (g) does not depend on the values of g outside of S). Otherwise, (7.95)
coincides with (7.94).
i i
i i
i i
A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. This method typically proceeds as follows: in
order to count the cardinality of a given set C , we draw an element uniformly at random from
C , whose entropy is given by log |C|. To bound |C| from above, we describe this random object
by a random vector X = (X1 , . . . , Xn ) then proceed to compute or upper-bound the joint entropy
H(X1 , . . . , Xn ) via one of the following methods:
Pn
• Marginal bound: H(X1 , . . . , Xn ) ≤ i=1 H(Xi )
• Pairwise bound (Shearer’s lemma) and generalization cf. Theorem 1.8: H(X1 , . . . , Xn ) ≤
1
P
n−1 i<j H(Xi , Xj ) Pn
• Chain rule (exact calculation): H(X1 , . . . , Xn ) = i=1 H(Xi |X1 , . . . , Xi−1 )
We give three applications using the above three methods, respectively, in the order of increas-
ing difficulty: enumerating binary vectors of a given average weights, counting triangles and other
subgraphs, and Brégman’s theorem.
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties
of differential entropy (Section 2.3).
where wH (x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ exp{nh(p)}.
158
i i
i i
i i
where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that
1X
n
p= pi ,
n
i=1
since we can either first average over vectors in C or first average across different bits. By Jensen’s
inequality and the fact that x 7→ h(x) is concave,
!
Xn
1X
n
h(pi ) ≤ nh pi = nh(p).
n
i=1 i=1
As a consequence we obtain the following bound on the volume of the Hamming ball, which
will be instrumental much later when we talk about metric entropy (Chapter 27).
Theorem 8.2
k
X n
≤ exp{nh(k/n)}, k ≤ n/2.
j
j=0
Proof. We take C = {x ∈ {0, 1}n : wH (x) ≤ k} and invoke the previous lemma, which says that
k
X n
= |C| ≤ exp{nh(p)} ≤ exp{nh(k/n)},
j
j=0
where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.
For extensions to non-binary alphabets see Exercise I.1 and I.2. Note that, Theorem 8.2 also
follows from the large deviations theory in Part III:
LHS k 1 RHS
= P ( Bin ( n , 1 / 2) ≤ k ) ≤ exp − nd k = exp{−n(log 2 − h(k/n))} = n ,
2n n 2 2
where the inequality is the Chernoff bound on the binomial tail (see (15.19) in Example 15.1).
i i
i i
i i
160
For graphs H and G, define N(H, G) to be the number of copies of H in G.1 For example,
N( , ) = 4, N( , ) = 8.
If we know G has m edges, what is the maximal number of H that are contained in G? To study
this quantity, we define
N(H, m) = max N(H, G).
G:|E(G)|≤m
Theorem 8.3
∗ ∗
c0 ( H ) m ρ (H)
≤ N(H, m) ≤ c1 (H)mρ (H)
. (8.4)
1
To be precise, here N(H, G) is the number of subgraphs of G (subsets of edges) isomorphic to H. If we denote by inj(H, G)
the number of injective maps V(H) → V(G) mapping edges of H to edges of G, then N(H, G) = |Aut(H)| 1
inj(H, G).
2
If the “∈ [0, 1]” constraints in (8.3) and (8.5) are replaced by “∈ {0, 1}”, we obtain the covering number ρ(H) and the
independence number α(H) of H, respectively.
i i
i i
i i
For example, for triangles we have ρ∗ (K3 ) = 3/2 and Theorem 8.3 is consistent with (8.1).
Proof. Upper bound: Let V(H) = [n] and let w∗ (e) be the solution for ρ∗ (H). For any G with m
edges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H. Given
such a random subgraph set Xi ∈ V(G) to be the vertex corresponding to an i-th vertex of H, i ∈ [n].
∗
Now define a random 2-subset S of [n] by sampling an edge e from E(H) with probability ρw∗ ((He)) .
By the definition of ρ∗ (H) we have for any i ∈ [n] that P[i ∈ S] ≥ ρ∗1(H) . We are now ready to
apply Theorem 1.8:
log N(H, G) = H(X) ≤ H(XS |S)ρ∗ (H) ≤ log(2m)ρ∗ (H) ,
where the last inequality is as before: if S = {v, w} then XS = (Xv , Xw ) takes one of 2m values.
∗
Overall, we get3 N(H, G) ≤ (2m)ρ (H) .
Lower bound: It amounts to construct a graph G with m edges for which N(H, G) ≥
∗
c(H)|e(G)|ρ (H) . Consider the dual LP of (8.3)
X
α∗ (H) = max ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E, ψ(v) ∈ [0, 1] (8.5)
ψ
v∈V(H)
i.e., the fractional packing number. By the duality theorem of LP, we have α∗ (H) = ρ∗ (H). The
graph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edge
e = (vw) of H, replace it by a complete bipartite graph Km(v),m(w) . Then the total number of edges
of G is
X
|E(G)| = m(v)m(w).
(vw)∈E(H)
Q
Furthermore, N(G, H) ≥ v∈V(H) m(v). To minimize the exponent log N(G,H)
log |E(G)| , fix a large number
ψ(v)
M and let m(v) = M , where ψ is the maximizer in (8.5). Then
X
|E(G)| ≤ 4Mψ(v)+ψ(w) ≤ 4M|E(H)|
(vw)∈E(H)
Y ∗
N(G, H) ≥ Mψ(v) = Mα (H)
v∈V(H)
3
Note that for H = K3 this gives a bound weaker than (8.2). To recover (8.2) we need to take X = (X1 , . . . , Xn ) be
uniform on all injective homomorphisms H → G.
i i
i i
i i
162
where Sn denotes the group of all permutations of [n]. For a bipartite graph G with n vertices on
the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix. For example,
perm = 1, perm =2
Theorem 8.4 (Brégman’s Theorem) For any n × n bipartite graph with adjacency matrix
A,
Y
n
1
perm(A) ≤ (di !) di ,
i=1
where di is the degree of left vertex i (i.e. sum of the ith row of A).
As an example, consider G = Kn,n . Then perm(G) = n!, which coincides with the RHS
[(n!)1/n ]n = n!. More generally, if G consists of n/d copies of Kd,d , then Brégman’s bound is
tight and perm = (d!)n/d .
Proof. If perm(A) = 0 then there is nothing to prove, so instead we assume perm(A) > 0 and
some perfect matchings exist. As a first attempt of proving Theorem 8.4 using the entropy method,
we select a perfect matching uniformly at random which matches the ith left vertex to the Xi th right
one. Let X = (X1 , . . . , Xn ). Then
X
n X
n
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H( X i ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is worse than Brégman’s bound by an exponential factor, since by
Stirling’s formula (I.2)
!
Yn
1 Y
n
(di !) di ∼ d i e− n .
i=1 i=1
Here is our second attempt. The hope is to use the chain rule to expand the joint entropy and
bound the conditional entropy more carefully. Let us write
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1
i i
i i
i i
where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, there is no reason why we should order from 1 to n. The key idea is to label the vertices
randomly, apply chain rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then
where Nk denotes the number of possible matchings for vertex k given the outcomes of {Xj :
π −1 (j) < π −1 (k)} and the expectation is with respect to (X, π ). The key observation is:
1 X
dk
1
E(X,π ) log Nk = log i = log(di !) di
dk
i=1
and hence
X
n
1 Y
n
1
log perm(A) ≤ log(di !) di = log (di !) di .
k=1 i=1
Proof of Lemma 8.5. In fact, we will show that even conditioned on Xn the distribution of Nk
is uniform. Indeed, if d = dk is the degree of k-th (right) node then let J1 , . . . , Jd be those right
nodes that match with neighbors of k under the fixed perfect matching (one of Ji ’s, say J1 , equals
i i
i i
i i
164
k). Random permutation π rearranges Ji ’s in the order in which corresponding right nodes are
revealed. Clearly the induced order of Ji ’s is uniform on d! possible choices. Note that if J1 occurs
in position ℓ ∈ {1, . . . , d} then Nk = d − ℓ + 1. Clearly ℓ and thus Nk are uniform on [d] = [dk ].
Leb(AS ) ≤ Leb(KS )
Thus, rectangles are extremal objects from the point of view of maximizing volumes of
coordinate projections.
Proof. Let Xn be uniformly distributed on K. Then h(Xn ) = log Leb(K). Let A be a rectangle of
size a1 × · · · × an where
On the other hand, by the chain rule and the fact that conditioning reduces differential entropy
(recall Theorem 2.7(a) and (c)),
X
n
h(XS ) = 1{i ∈ S}h(Xi |X[i−1]∩S )
i=1
X
≥ h(Xi |Xi−1 )
i∈S
Y
= log ai
i∈S
= log Leb(AS ).
The following result is a continuous counterpart of Shearer’s lemma (see Theorem 1.8 and
Remark 1.2):
4
Note that since K is compact, its projection and slices are all compact and hence measurable.
i i
i i
i i
Corollary 8.7 (Loomis-Whitney) Let K be a compact subset of Rn and let Kjc denote the
projection of K onto coordinates in [n] \ j. Then
Y
n
1
Leb(K) ≤ Leb(Kjc ) n−1 . (8.6)
j=1
Y
n
Leb(K) ≥ wj ,
j=1
i.e. that volume of K is greater than that of the rectangle of average widths.
i i
i i
i i
In this chapter we consider (a toy version of) the problem of creating high-quality random number
generators. Given a stream of independent Ber(p) bits, with unknown p, we want to turn them into
pure random bits, i.e., independent Ber(1/2) bits. Our goal is to find a way of extracting as many
fair coin flips as possible from possibly biased coin flips, without knowing the actual bias p.
In 1951 von Neumann [442] proposed the following scheme: Divide the stream into pairs of
bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both 01
and 10 occur with probability pp̄ (where, we remind p̄ = 1 − p), regardless of the value of p, we
obtain fair coin flips at the output. To measure the efficiency of von Neumann’s scheme, note that,
on average, we have 2n bits in and 2pp̄n bits out. So the efficiency (rate) is pp̄. The question is:
Can we do better?
There are several choices to be made in the problem formulation. Universal vs non-universal:
the source distribution can be unknown or partially known, respectively. Exact vs approximately
fair coin flips: whether the generated coin flips are exactly fair or approximately, as measured by
one of the f-divergences studied in Chapter 7 (e.g., the total variation or KL divergence). In this
chapter, we only focus on the universal generation of exactly fair coins. On the other extreme,
in Part II we will see that optimal data compressors’ output consists of almost purely random
bits, however those compressors are non-universal (need to know source statistics, e.g. bias p) and
approximate.
For convenience, in this chapter we consider entropies measured in bits, i.e. log = log2 in this
chapter.
9.1 Setup
Let {0, 1}∗ = ∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . } denote the set of all finite-length binary strings,
where ∅ denotes the empty string. For any x ∈ {0, 1}∗ , l(x) denotes the length of x.
Let us first introduce the definition of random number generator formally. If the input vector is
X, denote the output (variable-length) vector by Y ∈ {0, 1}∗ . Then the desired property of Y is the
following: Conditioned on the length of Y being k, Y is uniformly distributed on {0, 1}k .
Definition 9.1 (Randomness extractor) We say Ψ : {0, 1}∗ → {0, 1}∗ is an extractor if
166
i i
i i
i i
i.i.d.
2 For any n and any p ∈ (0, 1), if Xn ∼ Ber(p), then Ψ(Xn ) ∼ Ber(1/2)k conditioned on
l(Ψ(Xn )) = k for each k ≥ 1.
9.2 Converse
We show that no extractor has a rate higher than the binary entropy function h(p), even if the
extractor is allowed to be non-universal (depending on p). The intuition is that the “information
content” contained in each Ber(p) variable is h(p) bits; as such, it is impossible to extract more
than that. This is easily made precise by the data processing inequality for entropy (since extractors
are deterministic functions).
nh(p) = H(Xn ) ≥ H(Ψ(Xn )) = H(Ψ(Xn )|L) + H(L) ≥ H(Ψ(Xn )|L) = E [L] bits,
where the last step follows from the assumption on Ψ that Ψ(Xn ) is uniform over {0, 1}k
conditioned on L = k.
The rate of von Neumann’s extractor and the entropy bound are plotted in Figure 9.1. Next
we present two extractors, due to Elias [149] and Peres [327] respectively, that attain the binary
entropy function. (More precisely, both construct a sequence of extractors whose rate approaches
the entropy bound).
i i
i i
i i
168
rate
1 bit
rvN
p
0 1 1
2
Figure 9.1 Rate function of von Neumann’s extractor and the binary entropy function.
1 For iid Xn , the probability of each string only depends on its type, i.e., the number of 1’s, cf.
method of types in Exercise I.1. Therefore conditioned on the number of 1’s to be qn, Xn is
uniformly distributed over the type class Tq . This observation holds universally for any value
of the actual bias p.
2 Given a uniformly distributed random variable on some finite set, we can easily turn it into
variable-length string of fair coin flips. For example:
• If U is uniform over {1, 2, 3}, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.
• If U is uniform over {1, 2, . . . , 11}, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and the remaining
eight numbers 4, . . . , 11 are assigned to 3-bit strings.
We will study properties of these kind of variable-length encoders later in Chapter 10.
Lemma 9.3 Given U uniformly distributed on [M], there exists f : [M] → {0, 1}∗ such that
conditioned on l(f(U)) = k, f(U) is uniformly over {0, 1}k . Moreover,
log2 M − 4 ≤ E[l(f(U))] ≤ log2 M bits.
Proof. We defined f by partitioning [M] into subsets whose cardinalities are powers of two, and
assign elements in each subset to binary strings of that length. Formally, denote the binary expan-
Pn
sion of M by M = i=0 mi 2i , where the most significant bit mn = 1 and n = blog2 Mc + 1. Taking
non-zero mi ’s we can write M = 2i0 + · · · 2it as a sum of distinct powers of twos and thus define
a partition [M] = ∪tj=0 Mj , where |Mj | = 2ij . We map the elements of Mj to {0, 1}ij . Finally, notice
that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U) ≥ H(f(U)) ≥ H(f(U)|l(f(U))) = E[l(f(U))], and the lower bound
follows from
1 X 1 X 2n X i−n
n n n
2n+1
E[l(f(U))] = mi 2i · i = n − mi 2i (n − i) ≥ n − 2 ( n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0
i i
i i
i i
Elias’ extractor Fix n ≥ 1. Let wH (xn ) define the Hamming weight (number of ones) of a
binary string xn . Let Tk = {xn ∈ {0, 1}n : wH (xn ) = k} define the Hamming sphere of radius k.
For each 0 ≤ k ≤ n, we apply the function f from Lemma 9.3 to each Tk . This defines a mapping
ΨE : {0, 1}n → {0, 1}∗ and then we extend it to ΨE : {0, 1}∗ → {0, 1}∗ by applying the mapping
per n-bit block and discard the last incomplete block. Then it is clear that the rate is given by
n E[l(ΨE (X ))]. By Lemma 9.3, we have
1 n
n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
wH (Xn ) wH (Xn )
Using Stirling’s approximation (cf. Exercise I.1) we can show
n
log = nh(wH (Xn )/n) + O(log n) .
wH (Xn )
Pn wH ( X n )
Since n1 wH (Xn ) = 1n i=1 1{Xi = 1}, from the law of large numbers we conclude n →p
and since h is a continuous bounded function, we also have
1
E[l(ΨE (Xn ))] = h(p) + O(log n/n).
n
Therefore the extraction rate of ΨE approaches the optimum h(p) as n → ∞.
• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj −1 .
• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij −1 .
• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1 .
Here yk are the bits that von Neumann’s scheme outputs and both vn−k and un are discarded. Note
that un is important because it encodes the location of the yk and contains a lot of information.
Therefore von Neumann’s scheme can be improved if we can extract the randomness out of both
vn−k and un .
i i
i i
i i
170
Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bits that
enter into the iteration are no longer i.i.d. To compute the rate of Ψt , it is convenient to introduce
the notion of exchangeability. We say Xn are exchangeable if the joint distribution is invariant
under permutation, that is, PX1 ,...,Xn = PXπ (1) ,...,Xπ (n) for any permutation π on [n]. In particular, if
Xi ’s are binary, then Xn are exchangeable if and only if the joint distribution only depends on the
Hamming weight, i.e., PXn (xn ) = f(wH (xn )) for some function f. Examples: Xn is iid Ber(p); Xn is
uniform over the Hamming sphere Tk .
As an example, if X2n are i.i.d. Ber(p), then conditioned on L = k, Vn−k is iid Ber(p2 /(p2 + p̄2 )),
since L ∼ Binom(n, 2pp̄) and
pk+2m p̄n−k−2m
P[Yk = y, Un = u, Vn−k = v|L = k] =
n
k(p2 + p̄2 )n−k (2pp̄)k
− 1
n p2 m p̄2 n−k−m
= 2− k · · 2
k p + p̄2 p2 + p̄2
= P[Yk = y|L = k]P[Un = u|L = k]P[Vn−k = v|L = k],
where m = wH (v). In general, when X2n are only exchangeable, we have the following:
Lemma 9.4 (Ψt preserves exchangeability) Let X2n be exchangeable and L = Ψ1 (X2n ).
Then conditioned on L = k, Yk , Un and Vn−k are independent, each having an exchangeable
i.i.d.
distribution. Furthermore, Yk ∼ Ber( 21 ) and Un is uniform over Tk .
Proof. If suffices to show that ∀y, y′ ∈ {0, 1}k , u, u′ ∈ Tk and v, v′ ∈ {0, 1}n−k such that wH (v) =
wH (v′ ), we have
which implies that P[Yk = y, Un = u, Vn−k = v|L = k] = f(wH (v)) for some function f. Note that
the string X2n and the triple (Yk , Un , Vn−k ) are in one-to-one correspondence of each other. Indeed,
to reconstruct X2n , simply read the k distinct pairs from Y and fill them according to the locations of
ones in U and fill the remaining equal pairs from V. [Examples: (y, u, v) = (01, 1100, 01) ⇒ x =
(10010011), (y, u, v) = (11, 1010, 10) ⇒ x′ = (01110100).] Finally, note that u, y, v and u′ , y′ , v′
correspond to two input strings x and x′ of identical Hamming weight (wH (x) = k + 2wH (v)) and
hence of identical probability due to the exchangeability of X2n .
i i
i i
i i
i.i.d.
Lemma 9.5 (Ψt is an extractor) Let X2n be exchangeable. Then Ψt (X2n ) ∼ Ber(1/2)
conditioned on l(Ψt (X2n )) = m.
Proof. Note that Ψt (X2n ) ∈ {0, 1}∗ . It is equivalent to show that for all sm ∈ {0, 1}m ,
Proceed by induction on t. The base case of t = 1 follows from Lemma 9.4 (the distribution of
the Y part). Assume Ψt−1 is an extractor. Recall that Ψt (X2n ) = (Ψ1 (X2n ), Ψt−1 (Un ), Ψt−1 (Vn−k ))
and write the length as L = L1 + L2 + L3 , where L2 ⊥ ⊥ L3 |L1 by Lemma 9.4. Then
P[Ψt (X2n ) = sm ]
Xm
= P[Ψt (X2n ) = sm |L1 = k]P[L1 = k]
k=0
X
m X
m−k
Lemma 9.4 n−k
= P[L1 = k]P[Yk = sk |L1 = k]P[Ψt−1 (Un ) = skk+1 |L1 = k]P[Ψt−1 (V
+r
k+r+1 |L1 = k]
) = sm
k=0 r=0
X
m X
m−k
P[L1 = k]2−k 2−r P[L2 = r|L1 = k]2−(m−k−r) P[L3 = m − k − r|L1 = k]
induction
=
k=0 r=0
i.i.d.
Next we compute the rate of Ψt . Let X2n ∼ Ber(p). Then by the Strong Law of Large
Numbers (SLLN), 2n 1
l(Ψ1 (X2n )) ≜ 2n Ln
converges a.s. to pp̄. Assume, again by induction, that
a. s .
1
2n l (Ψ t−1 ( X 2n
)) −
− → rt− 1 ( p ) , with r1 ( p ) = pq. Then
1 Ln 1 1
l(Ψt (X2n )) = + l(Ψt−1 (Un )) + l(Ψt−1 (Vn−Ln )).
2n 2n 2n 2n
i.i.d. i.i.d. a. s .
Note that Un ∼ Ber(2pp̄), Vn−Ln |Ln ∼ Ber(p2 /(p2 +p̄2 )) and Ln −−→∞. Then the induction hypoth-
a. s . a. s .
esis implies that 1n l(Ψt−1 (Un ))−−→rt−1 (2pp̄) and 2(n−1 Ln ) l(Ψt−1 (Vn−Ln ))−−→rt−1 (p2 /(p2 +p̄2 )). We
obtain the recursion:
1 p2 + p̄2 p2
rt (p) = pp̄ + rt−1 (2pp̄) + rt−1 ≜ (Trt−1 )(p), (9.1)
2 2 p2 + p̄2
where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is mono-
tone in the sense that f ≤ g pointwise then Tf ≤ Tg. Then it can be shown that rt converges
monotonically from below to the fixed point of T, which turns out to be exactly the binary
entropy function h. Instead of directly verifying Th = h, here is a simple proof: Consider
i.i.d.
X1 , X2 ∼ Ber(p). Then 2h(p) = H(X1 , X2 ) = H(X1 ⊕ X2 , X1 ) = H(X1 ⊕ X2 ) + H(X1 |X1 ⊕ X2 ) =
2
h(2pp̄) + 2pp̄h( 12 ) + (p2 + p̄2 )h( p2p+p̄2 ).
The convergence of rt to h are shown in Figure 9.2.
i i
i i
i i
172
1.0
0.8
0.6
0.4
0.2
Figure 9.2 The rate function rt of Ψt (by iterating von Neumann’s extractor t times) versus the binary entropy
function, for t = 1, 4, 10.
1 Finite-state machine (FSM): initial state (red), intermediate states (white) and final states (blue,
output 0 or 1 then reset to initial state).
2 Block simulation: let A0 , A1 be disjoint subsets of {0, 1}k . For each k-bit segment, output 0 if
falling in A0 or 1 if falling in A1 . If neither, discard and move to the next segment. The block
size is at most the degree of the denominator polynomial of f.
The next table gives some examples of f that can be realized with these two architectures. (Exercise:
How to generate f(p) = 1/3?)
It turns out that the only type of f that can be simulated using either FSM or block simulation
√
is rational function. For f(p) = p, which satisfies Keane-O’Brien’s characterization, it cannot
be simulated by FSM or block simulation, but it can be simulated by the so-called pushdown
automata, which is a FSM operating with a stack (infinite memory) [308].
It is unknown how to find the optimal Bernoulli factory with the best rate. Clearly, a converse
is the entropy bound h(hf((pp))) , which can be trivial (bigger than one).
i i
i i
i i
1
1
0
0
f(p) = 1/2 A0 = 10; A1 = 01
1
1 0
0
0 0
1
f(p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1
0
1 1
0 0
0
0
1
1
p3
f(p) = p3 +p̄3
A0 = 000; A1 = 111
0
0
1
1
1 1
i i
i i
i i
where h(·) is the binary entropy. Conclude that for all 0 ≤ k ≤ n we have
(c*) More generally, let X be a finite alphabet, P̂, Q distributions on X , and TP̂ a set of all strings
in X n with composition P̂. If TP̂ is non-empty (i.e. if nP̂(·) is integral) then
and furthermore, both O(log n) terms can be bounded as |O(log n)| ≤ |X | log(n + 1). (Hint:
show that number of non-empty TP̂ is ≤ (n + 1)|X | .)
I.2 (Refined method of types) The following refines Proposition 1.5. Let n1 , . . . , be non-negative
P
integers with i ni = n and let k+ be the number of non-zero ni ’s. Then
n k+ − 1 1 X
log = nH(P̂) − log(2πn) − log P̂i − Ck+ ,
n1 , n2 , . . . 2 2
i:ni >0
i i
i i
i i
where we define xn+1 = x1 (cyclic continuation). Show that 1n Nxn (·, ·) defines a probability
distribution PA,B on X ×X with equal marginals PA = PB . Conclude that H(A|B) = H(B|A).
Is PA|B = PB|A ?
(2)
(b) Let Txn (Markov type-class of xn ) be defined as
(2)
Txn = {x̃n ∈ X n : Nx̃n = Nxn } .
(2)
Show that elements of Txn can be identified with cycles in the complete directed graph G
on X , such that for each (a, b) ∈ X × X the cycle passes Nxn (a, b) times through edge
( a, b) .
(c) Show that each such cycle can be uniquely specified by indentifying the first node and by
choosing at each vertex of the graph the order in which the outgoing edges are taken. From
this and Stirling’s approximation conclude that
(2)
log |Txn | = nH(xT+1 |xT ) + O(log n) , T ∼ Unif([n]) .
I.4 (Maximum entropy) Prove that for any X taking values on N = {1, 2, . . .} such that E[X] < ∞,
1
H(X) ≤ E[X]h ,
E [ X]
maximized uniquely by the geometric distribution. Hint: Find an appropriate Q such that RHS
- LHS = D(PX kQ).
I.5 (Finiteness of entropy) In Exercise I.4 we have shown that the entropy of any N-valued random
variable with finite expectation is finite. Next let us improve this result.
(a) Show that E[log X] < ∞ ⇒ H(X) < ∞.
Moreover, show that the condition of X being integer-valued is not superfluous by giving a
counterexample.
(b) Show that if k 7→ PX (k) is a decreasing sequence, then H(X) < ∞ ⇒ E[log X] < ∞.
Moreover, show that the monotonicity assumption is not superfluous by giving a counterex-
ample.
I.6 (Robust version of the maximal entropy) The maximal differential entropy among all densities
supported on [−b, b] is attained by the uniform distribution. Prove that as ϵ → 0+ we have
where supremization is over all (not necessarily independent) random variables M, Z such that
M + Z possesses a density. (Hint: [162, Appendix C] proves o(1) = O(ϵ1/3 log 1ϵ ) bound.)
I.7 (Maximum entropy under Hamming weight constraint) For any α ≤ 1/2 and d ∈ N,
i i
i i
i i
achieved by the product distribution Y ∼ Ber(α)⊗d . Hint: Find an appropriate Q such that RHS
- LHS = D(PY kQ).
I.8 (Gaussian divergence)
(a) Under what conditions on m0 , Σ0 , m1 , Σ1 is D( N (m1 , Σ1 ) k N (m0 , Σ0 ) ) < ∞?
(b) Compute D(N (m, Σ)kN (0, In )), where In is the n × n identity matrix.
(c) Compute D( N (m1 , Σ1 ) k N (m0 , Σ0 ) ) for non-singular Σ0 . (Hint: think how Gaussian
distribution changes under shifts x 7→ x+a and non-singular linear transformations x 7→ Ax.
Apply data-processing to reduce to previous case.)
I.9 (Water-filling solution) Let M ∈ Rk×n be a fixed matrix, X ⊥⊥ Z ∼ N (0, In ).
(a) Let M = UΛVT be an SVD decomposition, so that U, V are orthogonal matrices and Λ =
diag(λ1 , . . . , λn ) (with rank(M) non-zero λj ’s). Show that
1X + 2
n
max I(X; MX + Z) = log (λi t) ,
PX :E[∥X∥2 ]≤s2 2
i=1
Pn
where log+ x = max(0, log x) and t is determined from solving i=1 |t − λ− i |+ = s .
2 2
1 h i 1X h i n
s2 s2
max I(X; MX + Z|M) = E log det(I + MT M) = E log(1 + σi2 (M)) ,
PX :E[∥X∥2 ]≤s2 2 n 2 n
i=1
where σi (M) are the singular values of M. (Hint: average over rotations as in Section 6.2*)
i.i.d.
Note: In communication theory Mi,j ∼ N (0, 1) (Ginibre ensemble) models a multi-input, multi-
output (MIMO) channel with n transmit and k receive antennas. The matrix MMT is a Wishart
matrix and its spectrum, when n and k grow proportionally, approaches the Marchenko-Pastur
distribution. The important practical consequence is that the capacity of a MIMO channel grows
for high-SNR as 21 min(n, k) log SNR. This famous observation [418] is the reason modern WiFi
and cellular systems employ multiple antennas.
I.11 (Conditional capacity) Consider a Markov kernel PB,C|A : A → B × C , which we will also
(a)
understand as a collection of distributions PB,C ≜ PB,C|A=a . Prove
(a) (a)
inf sup D(PC|B kQC|B |PB ) = sup I(A; C|B) ,
QC|B a∈A PA
whenever supremum on the right-hand side is finite and achievable by some distribution P∗A . In
R
this case, optimal QC|B = P∗C|B is found by disintegrating P∗B,C = PA∗ (da)PB,C . (Hint: follow
(a)
i i
i i
i i
I.15 The Hewitt-Savage 0-1 law states that certain symmetric events have no randomness. Let
{Xi }i≥1 be a sequence be iid random variables. Let E be an event determined by this sequence.
We say E is exchangeable if it is invariant under permutation of finitely many indices in
the sequence of {Xi }’s, e.g., the occurance of E is unchanged if we permute the values of
(X1 , X4 , X7 ), etc.
Let us prove the Hewitt-Savage 0-1 law information-theoretically in the following steps:
P Pn
(a) (Warm-up) Verify that E = { i≥1 Xi converges} and E = {limn→∞ n1 i=1 Xi = E[X1 ]}
are exchangeable events.
(b) Let E be an exchangeable event and W = 1E is its indicator random variable. Show that
for any k, I(W; X1 , . . . , Xk ) = 0. (Hint: Use tensorization (6.2) to show that for arbitrary n,
nI(W; X1 , . . . , Xk ) ≤ 1 bit.)
(c) Since E is determined by the sequence {Xi }i≥1 , we have by continuity of mutual informa-
tion:
i i
i i
i i
Pn
I.17 Suppose Z1 , . . . Zn are independent Poisson random variables with mean λ. Show that i=1 Zi
is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.18 Suppose Z1 , . . . Zn are independent uniformly distributed on the interval [0, λ]. Show that
max1≤i≤n Zi is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.19 (Divergence of order statistic) Given xn = (x1 , . . . , xn ) ∈ Rn , let x(1) ≤ . . . ≤ x(n) denote the
ordered entries. Let P, Q be distributions on R and PXn = Pn , QXn = Qn .
(a) Prove that
D(PX(1) ,...,X(n) kQX(1) ,...,X(n) ) = nD(PkQ). (I.3)
(b) Show that
D(Bin(n, p)kBin(n, q)) = nd(pkq).
I.20 (Continuity of entropy on finite alphabet) We have shown that on a finite alphabet entropy is a
continuous function of the distribution. Quantify this continuity by explicitly showing
|H(P) − H(Q)| ≤ h(TV(P, Q)) + TV(P, Q) log(|X | − 1)
for any P and Q supported on X .
Hint: Use Fano’s inequaility and the inf-representation (over coupling) of total variation in
Theorem 7.7(a).
I.21 (a) For any X such that E [|X|] < ∞, show that
(E[X])2
D(PX kN (0, 1)) ≥ nats.
2
(b) For a > 0, find the minimum and minimizer of
min D(PX kN (0, 1)).
PX :EX≥a
C = inf ϵ2 + log N(ϵ) . (I.5)
ϵ≥0
1
N(ϵ) is the minimum number of radius-ϵ (in divergence) balls that cover the set {PY|X=x : x ∈ X }. Thus, log N(ϵ) is a
metric entropy – see Chapter 27.
i i
i i
i i
Comments: These estimates are useful because N(ϵ) for small ϵ roughly speaking depends on
local (differential) properties of the map x 7→ PY|X=x , unlike C which is global.
I.24 Consider the channel PYm |X : [0, 1] 7→ {0, 1}m , where given x ∈ [0, 1], Ym is i.i.d. Ber(x).
(a) Using the upper bound from Exercise I.23 prove
1
C(m) ≜ max I(X; Ym ) ≤ log m + O(1) , m → ∞.
PX 2
Hint: Find a covering of the input space.
(b) Show a lower bound to establish
1
C(m) ≥ log m + o(log m) , m → ∞.
2
Hint: Show that for any ϵ > 0 there exists K(ϵ) such that for all m ≥ 1 and all p ∈ [ϵ, 1 − ϵ]
we have |H(Bin(m, p)) − 12 log m| ≤ K(ϵ).
I.25 This exercise shows other ways of proving Fano’s inequality in its various forms.
(a) Prove (3.15) as follows. Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π
to the last M − 1 atoms to obtain the distribution Pπ . By comparing H(P) and H(Q), where
Q is the average of Pπ over all permutations, complete the proof.
(b) Prove (3.15) by directly solving the convex optimization max{H(P) : 0 ≤ pi ≤ Pmax , i =
P
1, . . . , M, i pi = 1}.
(c) Prove (3.19) as follows. Let Pe = P[X 6= X̂]. First show that
Notice that the minimum is non-zero unless Pe = Pmax . Second, solve the stated convex
optimization problem. (Hint: look for invariants that the matrix PZ|X must satisfy under
permutations (X, Z) 7→ (π (X), π (Z)) then apply the convexity of I(PX , ·)).
Qn
I.26 Show that PY1 ···Yn |X1 ···Xn = i=1 PYi |Xi if and only if the Markov chain Yi → Xi → (X\i , Y\i )
holds for all i = 1, . . . , n, where X\i = {Xj , j 6= i}.
I.27 (Distributions and graphical models)
(a) Draw all possible directed acyclic graphs (DAGs, or directed graphical models) compatible
with the following distribution on X, Y, Z ∈ {0, 1}:
(
1/6, x = 0, z ∈ {0, 1} ,
PX,Z (x, z) = (I.7)
1/3, x = 1, z ∈ {0, 1}
Y=X+Z (mod2) (I.8)
i i
i i
i i
You may include only the minimal DAGs (recall: the DAG is minimal for a given
distribution if removal of any edge leads to a graphical model incompatible with the
distribution).2
Qn
(b) Draw the DAG describing the set of distributions PXn Yn satisfying PYn |Xn = i=1 PYi |Xi .
(c) Recall that two DAGs G1 and G2 are called equivalent if they have the same vertex sets and
each distribution factorizes w.r.t. G1 if and only if it does so w.r.t. G2 . For example, it is
well known
X→Y→Z ⇐⇒ X←Y←Z ⇐⇒ X ← Y → Z.
X1 → X2 → · · · → Xn → · · ·
X1 ← X2 ← · · · ← Xn ← · · ·
A→B→C
=⇒ A ⊥
⊥ (B, C)
A→C→B
I( A; C ) = I( B; C ) = 0 =⇒ I(A, B; C) = 0 . (I.9)
2
Note: {X → Y}, {X ← Y} and {X Y} are the three possible directed graphical modelss for two random variables. For
example, the third graph describes the set of distributions for which X and Y are independent: PXY = PX PY . In fact, PX PY
factorizes according to any of the three DAGs, but {X Y} is the unique minimal DAG.
i i
i i
i i
I.31 Find the entropy rate of a stationary ergodic Markov chain with transition probability matrix
1 1 1
2 4 4
P= 0 1
2
1
2
1 0 0
I.32 (Solvable HMM) Similar to the Gilbert-Elliott process (Example 6.3) let Sj ∈ {±1} be a
stationary two-state Markov chain with P[Sj = −Sj−1 |Sj−1 ] = 1 − P[Sj = Sj−1 |Sj−1 ] = τ . Let
iid
Ej ∼ Ber(δ), with Ej ∈ {0, 1} and let Xj = BECδ (Sj ) be the observation of Sj through the binary
erasure channel (BEC) with erasure probability δ , i.e. Xj = Sj Ej . Find entropy rate of Xj (you
can give answer in the form of a convergent series). Evaluate at τ = 0.11, δ = 1/2 and compare
with H(X1 ).
I.33 Consider a binary symmetric random walk Xn on Z that starts at zero. In other words, Xn =
Pn
j=1 Bj , where (B1 , B2 , . . .) are independent and equally likely to be ±1.
(a) When n 1 does knowing X2n provide any information about Xn ? More exactly, prove
i i
i i
i i
I.37 Show that map P 7→ D(PkQ) is strongly convex, i.e. for all λ ∈ [0, 1] and all P0 , P1 , Q we have
λD(P1 kQ) + λ̄D(P0 kQ) − D(λP1 + λ̄P0 kQ) ≥ 2λλ̄TV(P0 , P1 )2 log e .
Whenever the LHS is finite, derive the explicit form of a unique minimizer R.
I.40 For an f-divergence, consider the following statements:
(i) If If (X; Y) = 0, then X ⊥
⊥ Y.
(ii) If X − Y − Z and If (X; Y) = If (X; Z) < ∞, then X − Z − Y.
Recall that f : (0, ∞) → R is a convex function with f(1) = 0.
(a) Choose an f-divergence which is not a multiple of the KL divergence (i.e., f cannot be of
form c1 x log x + c2 (x − 1) for any c1 , c2 ∈ R). Prove both statements for If .
(b) Choose an f-divergence which is non-linear (i.e., f cannot be of form c(x − 1) for any c ∈ R)
and provide examples that violate (i) and (ii).
(c) Choose an f-divergence. Prove that (i) holds, and provide an example that violates (ii).
I.41 (Hellinger and interactive protocols [31]) In the area of interactive communication Alice has
access to X and outputs bits Ai , i ≥ 1, whereas Bob has access to Y and outputs bits Bi , i ≥ 1.
The communication proceeds in rounds, so that at i-th round Alice and Bob see the previous
messages of each other. This means that conditional distribution of the protocol is given by
Y
n
PAn ,Bn |X,Y = PAi |Ai−1 ,Bi−1 ,X PBi |Ai−1 ,Bi−1 ,Y .
i=1
i i
i i
i i
(b) H2 (Πx,y , Πx′ ,y ) + H2 (Πx,y′ , Πx′ ,y′ ) ≤ 2H2 (Πx,y , Πx′ ,y′ )
I.42 (Chain rules I)
(a) Show using (I.11) and the chain rule for KL that
X
n
(1 − α)Dα (PXn kQXn ) ≥ inf(1 − α)Dα (PXi |Xi−1 =a kQXi |Xi−1 =a )
a
i=1
where Pi = PXi QXni+1 |Xi , with Pn = PXn and P0 = QXn . The identity above shows how
KL-distance from PXn to QXn can be traversed by summing distances between intermediate
Pi ’s.
(b) Using the same path and triangle inequality show that
X
n
TV(PXn , QXn ) ≤ EPXi−1 TV(PXi |Xi−1 , QXi |Xi−1 )
i=1
See also [230, Theorem 7] for a deeper result, where for a universal C > 0 it is shown that
X
n
H2 (PXn , QXn ) ≤ C EPXi−1 H2 (PXi |Xi−1 , QXi |Xi−1 ) .
i=1
Prove that
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : PX = P, PY = Q}
PXY
where the infimum is over all couplings. (Hint: For one direction use the same coupling
achieving TV. For the other direction notice that P[X 6= Y|Y] ≥ 1 − QP((YY)) .)
i i
i i
i i
Prove that
I.45 (Center of gravity under f-divergences) Recall from Corollary 4.2 the fact that
minQY D(PY|X kQY |PX ) = I(X; Y) achieved at QY = PY . Prove the following versions for other
f-divergences:
(a) Suppose that for PX -a.e. x, PY|X=x μ with density p(y|x).3 Then
Z q 2
inf χ2 (PY|X kQY |PX ) = μ(dy) E[pY|X (y|X)2 ] − 1. (I.12)
QY
p
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ E[p(y|X)2 ] μ(dy).
(b) Show that
Z − 1
1
inf χ (QY kPY|X |PX ) =
2
μ(dy) − 1, (I.13)
QY g ( y)
where g(y) ≜ E[pY|X (y|X)−1 ] and we use agreement 1/0 = ∞ for all reciprocals. If the right-
hand side is finite, then the minimum is achieved by QY (dy) ∝ g(1y) 1{g(y) < ∞} μ(dy).
(c) Show that
Z
inf D(QY kPY|X |PX ) = − log μ(dy) exp(E[log p(y|X)]). (I.14)
QY
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ exp(E[log p(y|X)]) μ(dy).
Note: This exercise shows that the center of gravity with respect to other f-divergences need not
be PY but its reweighted version. For statistical applications, see Exercises VI.6, VI.9, and VI.10,
where (I.12) and (I.13) are used to determine the form of the Bayes estimator.
I.46 (DPI for Fisher information) Let pθ (x, y) be a smoothly parametrized family of densities on
X ⊗ Y (with respect to some reference measure μX ⊗ μY ) where θ ∈ Rd . Let JXF,Y (θ) denote
the Fisher information matrix of the joint distribution and similarly JXF (θ), JYF (θ) those of the
marginals.
(a) (Monotonicity) Assume the interchangeability of derivative and integral, namely,
R
∇θ pθ (y) = μX (dx)∇θ pθ (x, y) for every θ, y. Show that JYF (θ) JXF,Y (θ).
(b) (Data processing inequality) Suppose, in addition, that θ → X → Y. (In other words, pθ (y|x)
does not depend on θ.) Then JYF (θ) JXF (θ), with equality if Y is a sufficient statistic of X
for θ.
3
Note that the results do not depend on the choice of μ, so we can take for example μ = PY , in view of Lemma 3.3.
i i
i i
i i
I.47 (Fisher information inequality) Consider real-valued A ⊥ ⊥ B with differentiable densities and
finite (location) Fisher informations J(A), J(B). Then Stam’s inequality [399] shows
1 1 1
≥ + . (I.15)
J(A + B) J(A) J(B)
(a) Show that Stam’s inequality is equivalent to (a + b)2 J(A + B) ≤ a2 J(A) + b2 J(B) for all
a, b > 0.
(b) Let X1 = aθ+ A, X2 = bθ+ B. This defines a family of distributions of (X1 , X2 ) parametrized
by θ ∈ R. Show that its Fisher information is given by JF (θ) = a2 J(A) + b2 J(B).
(c) Let Y = X1 + X2 and assume that conditions for the applicability of the DPI for Fisher
information (Exercise I.46) hold. Conclude the proof of (I.15).
Note: A simple sufficient condition that implies (I.15) is that densities of A and B are everywhere
strictly positive on R. For a direct proof in this case, see [58].
I.48 The Ingster-Suslina formula [225] computes the χ2 -divergence between a mixture and a sim-
ple distribution, exploiting the second moment nature of χ2 . Let Pθ be a family of probability
distributions on X parameterized by θ ∈ Θ. Each distribution (prior) π on Θ induces a mixture
R
Pπ ≜ Pθ π (dθ). Assume that Pθ ’s have a common dominating distribution Q.
(a) Show that
i i
i i
i i
(b) Assume that n is even and σ is uniformly distributed over the set of bisections {z ∈ {±1}n :
Pn
i=1 zi = 0}, so that the two communities are equally sized. Show that (I.16) continues to
hold.
Note: As a consequence of (I.16), we have the contiguity SBM(n, p, q) ◁ ER(n, p+2 q ) whenever
τ < 1. In fact, they are mutually contiguous if and only if τ < 1. This much more difficult
result can be shown using the method of small subgraph conditioning developed by [364, 227];
cf. [307, Section 5].
I.50 (Sampling without replacement I [400]) Consider two ways of generating a random vector
Xn = (X1 , . . . , Xn ): Under P, Xn are sampled from the set [n] = {1, . . . , n} without replacement;
under Q, Xn are sampled from [n] with replacement. Let’s compare the joint distribution of the
first k draws X1 , . . . , Xk for some 1 ≤ k ≤ n.
(a) Show that
k! n
TV(PXk , QXk ) = 1 −
nk k
k! n
D(PXk kQXk ) = − log k .
n k
√
Conclude that D and TV are o(1) iff k = o( n).
√
(b) Explain the specialness of n by find an explicit test that distinguishes P and Q with high
√
probability when k n. Hint: birthday paradox.
I.51 (Sampling without replacement II [400]) Let X1 , . . . , Xk be a random sample of balls without
Pq
replacement from an urn containing ai balls of color i ∈ [q], i=1 ai = n. Let QX (i) = ani . Show
that
k2 (q − 1) log e
D(PXk kQkX ) ≤ c , c= .
(n − 1)(n − k + 1) 2
Let Rm,b0 ,b1 be the distribution of the number of 1’s in the first m ≤ b0 + b1 coordinates of a
randomly permuted binary strings with b0 zeros and b1 ones.
(a) Show that
X
q
ai − V i ai − V i
D(PXm+1 |Xm kQX |PXm ) = E[ log ],
N−m pi (N − m)
i=1
i i
i i
i i
Pm
where i=1 λi = 1, λi ≥ 0 and Qi are some distributions on X and c > 0 is a universal constant.
Follow the steps:
(a) Show the identity (here PXk is arbitrary)
Y
k Xk− 1
D PXk PXj = I(Xj ; Xj+1 ).
j=1 j=1
(b) Show that there must exist some t ∈ {k, k + 1, . . . , n} such that
H(Xk−1 )
I(Xk−1 ; Xk |Xnt+1 ) ≤ .
n−k+1
(Hint: Expand I(Xk−1 ; Xnk ) via chain rule.)
(c) Show from 1 and 2 that
Y kH(Xk−1 )
D PXk |T PXj |T |PT ≤
n−k+1
n
where T = Xt+1 .
(d) By Pinsker’s inequality
h i r
Y kH(Xk−1 )|X | 1
ET TV PXk |T , PXj |T ≤ c , c= p .
n−k+1 2 log e
Conclude (I.17) by the convexity of total variation.
Note: Another estimate [400, 123] is easy to deduce from Exercise I.51 and Exercise I.50: there
exists a mixture of iid QXk such that
k
min(2|X |, k − 1) .
TV(QXk , PXk ) ≤
n
The bound (I.17) improves the above only when H(X1 ) ≲ 1.
I.53 (Wringing lemma [140, 419]) Prove that for any δ > 0 and any (Un , Vn ) there exists an index
n n
set I ⊂ [n] of size |I| ≤ I(U δ;V ) such that
I(Ut ; Vt |UI , VI ) ≤ δ ∀ t ∈ [ n] .
When I(Un ; Vn ) n, this shows that conditioning on a (relatively few) entries, one can make
individual coordinates almost independent. (Hint: Show I(A, B; C, D) ≥ I(A; C) + I(B; D|A, C)
first. Then start with I = ∅ and if there is any index t s.t. I(Ut ; Vt |UI , VI ) > δ then add it to I and
repeat.)
I.54 (Generalization gap = ISKL , [18]) A learning algorithm selects a parameter W based on observing
(not necessarily independent) S1 , . . . , Sn , where all Si have a common marginal law PS , with the
goal of minimizing the loss on a fresh sample = E[ℓ(W, S)], where Sn ⊥ ⊥ S ∼ PS and ℓ is an
4
arbitrary loss function . Consider a Gibbs sampler (see Section 4.8.2) which chooses
αX
n
1
W ∼ PW|Sn (w|sn ) = n
π (w) exp{− ℓ(w, si )} ,
Z( s ) n
i=1
4
For example, if S = (X, Y) we may have ℓ(w, (x, y)) = 1{fw (x) 6= y} where fw denotes a neural network with weights w.
i i
i i
i i
where π (·) is a fixed prior on weights and Z(·) the normalization constant. Show that general-
ization gap of this algorithm is given by
1X
n
1
E[ℓ(W, S)] − E[ ℓ(W, Si )] = ISKL (W; Sn ) ,
n α
i=1
P
I.57 (Divergence for mixtures [216, 249]) Let Q̄ = i π i Qi be a mixture distribution.
(a) Prove
!
X
D(PkQ̄) ≤ − log π i exp(−D(PkQi )) ,
i
P
improving over the simple convexity estimate D(PkQ̄) ≤ i π i D(PkQi ). (Hint: Prove that
the function Q 7→ exp{−aD(PkQ)} is concave for every a ≤ 1.)
(b) Furthermore, for any distribution {π̃ j }, any λ ∈ [0, 1] we have
X X X
π̃ j D(Qj kQ̄) + D(π kπ̃ ) ≥ − π i log π̃ j e−(1−λ)Dλ (Pi ∥Pj )
j i j
X
≥ − log π i π̃ j e−(1−λ)Dλ (Pi ∥Pj )
i,j
′
(Hint: Prove D(PA|B=b kQA ) ≥ − EA|B=b [log EA′ ∼QA gg((AA,,bb)) ] via Donsker-Varadhan. Plug in
g(a, b) = PB|A (b|a)1−λ , average over B and use Jensen to bring outer EB|A inside the log.)
I.58 (Mutual information and pairwise distances [216]) Suppose we have knowledge of pairwise
distances dλ (x, x′ ) ≜ Dλ (PY|X=x kPY|X=x′ ), where Dλ is the Rényi divergence of order λ. What
i.i.d.
can be said about I(X; Y)? Let X, X′ ∼ PX . Using Exercise I.57, prove that
I(X; Y) ≤ − E[log E[exp(−d1 (X, X′ ))|X]]
and for every λ ∈ [0, 1]
I(X; Y) ≥ − E[log E[exp(−(1 − λ)dλ (X, X′ ))|X]].
See Theorem 32.5 for an application.
i i
i i
i i
I.59 (D ≲ H2 log H12 trick) Show that for any P, U, R, λ > 1, and 0 < ϵ < 2−5 λ−1 we have
λ
λ 1
D(PkϵU + ϵ̄R) ≤ 8(H (P, R) + 2ϵ)
2
log + Dλ (PkU) .
λ−1 ϵ
Thus, a Hellinger ϵ-net for a set of P’s can be converted into a KL (ϵ2 log 1ϵ )-net; see
Theorem 32.6 in Section 32.2.4.
−1
(a) Start by proving the tail estimate for the divergence: For any λ > 1 and b > e(λ−1)
dP dP log b
EP log · 1{ > b} ≤ λ−1 exp{(λ − 1)Dλ (PkQ)}
dQ dQ b
(b) Show that for any b > 1 we have
b log b dP dP
D(PkQ) ≤ H2 (P, Q) √ + EP log · 1{ > b}
( b − 1)2 dQ dQ
h(x)
(Hint: Write D(PkQ) = EP [h( dQ
dP )] for h(x) = − log x + x − 1 and notice that
√
( x−1)2
is
monotonically decreasing on R+ .)
(c) Set Q = ϵU + ϵ̄R and show that for every δ < e− λ−1 ∧ 14
1
1
D(PkQ) ≤ 4H2 (P, R) + 8ϵ + cλ ϵ1−λ δ λ−1 log ,
δ
where cλ = exp{(λ − 1)Dλ (PkU). (Notice H2 (P, Q) ≤ H2 (P, R) + 2ϵ, Dλ (PkQ) ≤
Dλ (PkU) + log 1ϵ and set b = 1/δ .)
2
(d) Complete the proof by setting δ λ−1 = 4H c(λPϵ,λ−
R)+2ϵ
1 .
I.60 Let G = (V, E) be a finite directed graph. Let
4 = (x, y, z) ∈ V3 : (x, y), (y, z), (z, x) ∈ E ,
∧ = (x, y, z) ∈ V3 : (x, y), (x, z) ∈ E .
Prove that 4 ≤ ∧.
Hint: Prove H(X, Y, Z) ≤ H(X) + 2H(Y|X) for random variables (X, Y, Z) distributed uniformly
over the set of directed 3-cycles, i.e. subsets X → Y → Z → X.
I.61 (Union-closed sets conjecture (UCSC)) Let X and Y be independent vectors in {0, 1}n .
Show [88]
p̄
H(X OR Y) ≥ (H(X) + H(Y)) , p̄ ≜ min min(P[Xi = 0], P[Yi = 0]) ,
2ϕ i
√
where OR denotes coordinatewise logical-OR and ϕ = 52−1 . (Hint: set Z = X OR Y, use chain
P
rule H(Z) ≥ i H(Zi |Xi−1 , Yi−1 ), and the inequality for binary-entropy h(ab) ≥ h(a)b2+ϕh(b)a ).
Comment: F ⊂ {0, 1}n is called a union-closed set if x, y ∈ F =⇒ (x OR y) ∈ F . The UCSC
states that p = maxi P[Xi = 1] ≥ 1/2, where X is uniform over F . Gilmer’s method [189]
applies the inequality above to Y taken to be an independent copy of X (so that H(X OR Y) ≤
H(X) = H(Y) = log |F|) to prove that p ≥ 1 − ϕ ≈ 0.382.
i i
i i
i i
I.62 (Compression for regression) Let Y ∈ [−1; 1] and X ∈ X with X being finite (for simplic-
ity). Auxiliary variables U, U′ in this exercise are assumed to be deterministic functions of X.
For simplicity assume X is finite (but giant). Let cmp(U) be a complexity measure satisfying
cmp(U, U′ ) ≤ cmp(U) + cmp(U′ ), cmp(constant) = 0 and cmp(Ber(p)) ≤ log 2 for any p
(think of H(U) or log |U|). Choose U to be a maximizer of I(U; Y) − δ cmp(U).
(a) Show that cmp(U) ≤ I(Xδ;Y)
(b) For any U′ show I(Y; U′ |U) ≤ δ cmp(U′ ) (Hint: check U′′ = (U, U′ ))
(c) For any event S = {X ∈ A} show
√
|E[(Y − E[Y|U])1S ]| ≤ 2δ ln 2 (I.19)
(Hint: by Cauchy-Schwarz only need to show E[| E[Y|U, 1S ]− E[Y|U]|2 ] ≲ δ , which follows
by taking U′ = 1S in b) and applying Tao’s inequality (7.28))
(d) By choosing a proper S and applying above to S and Sc conclude that
√
E[| E[Y|X] − E[Y|U]|] ≤ 2 2δ ln 2 .
(So any high-dimensional complex feature vector X can be compressed down to U whose car-
dinality is of order I(Y; X) (and independent of |X |) but which, nevertheless, is essentially as
good as X for regression; see [51] for other results on information distillation.)
I.63 (IT version of Szemerédi regularity [414]) Fix ϵ, m > 0 and consider random variables Y, X =
(X1 , X2 ) with Y ∈ [−1, 1], X = X1 × X2 finite (but giant) and I(X; Y) ≤ m. In this excercise,
all auxiliary random variables U have structure U = (f1 (X1 ), f2 (X2 )) for some deterministic
functions f1 , f2 . Thus U partitions X into product blocks and we call block U = u0 ϵ-regular if
|E[(Y − E[Y|U])1S |U = u0 ]| ≤ ϵ ∀S = {X1 ∈ A1 , X2 ∈ A2 } .
We will show there is J = J(ϵ, m) such that there exists a U with |U| ≤ J and such that
P[block U is not ϵ-regular] ≤ ϵ . (I.20)
(a) Suppose that we found random variables Y → X → U′ → U such that (i) I(Y; U′ |U) ≤ ϵ4
4
and (ii) for all S as above I(Y; 1S |U′ ) ≤ |Uϵ |2 . Then (I.20) holds with ϵ replaced by O(ϵ).
(Hint: define g(u0 ) = E[| E[Y|U′ ] − E[Y|U]| |U = u0 ] and show via (7.28) that E[g(U)] ≲ ϵ2 .
ϵ2
As in (I.19) argue that E[(Y − E[Y|U′ ])1S ] ≲ |U | . From triangle inequality any u0 -block is
O(ϵ)-regular whenever g(u0 ) < ϵ and P[U = u0 ] > |Uϵ | . Finally, apply Markov inequality
twice to show that the last condition is violated with O(ϵ) probability.)
(b) Show that such U′ , U indeed exist. (Hint: Construct a sequence Y → X → · · · Uj → Uj−1 →
· · · U0 = 0 sequentially by taking Uj+1 to be maximizer of I(Y; U) − δj+1 log |U| among all
4
Y → X → U → Uj (compare Exercise I.62) and δj+1 = |Uϵj |2 . We take U′ , U = Un+1 , Un
for the first pair that has I(Y; Un+1 |Un ) ≤ ϵ4 . Show n ≤ ϵm4 and |Un | is bounded by the n-th
iterate of map h → h exp{mh2 /ϵ4 } started from h = 1.)
Remark: The point is that J does not depend on PX,Y or |X |. For Szemerédi’s regularity lemma
one takes X1 , X2 to be uniformly sampled vertices of a bipartite graph and Y = 1{X1 ∼ X2 } is
the incidence relation. An ϵ-regular block corresponds to an ϵ-regular bipartite subgraph, and
lemma decomposes arbitrary graph into finitely many pairwise (almost) regular subgraphs.
i i
i i
i i
I.64 (Entropy and binary convolution) Binary convolution is defined for (a, b) ∈ [0, 1]2 by a ∗ b =
a(1 − b) + (1 − a)b and describes the law of Ber(a) ⊕ Ber(b) where ⊕ denotes modulo-2
addition.
(a) (Mrs. Gerber’s lemma, MGL5 ) Let (U, X) ⊥
⊥ Z ∼ Ber(δ) with X ∈ {0, 1}. Show that
H(X|U)H(Z)
h(h−1 (H(X|U)) ∗ δ) ≤ H(X ⊕ Z|U) ≤ H(X|U) + H(Z) − .
log 2
(Hint: equivalently [457], need to show that a parametric curve (h(p), h(p ∗ δ)), p ∈ [0, 1/2]
is convex.)
(b) Show that for any p, q the parametric curve ((1 − 2r)2 , d(p ∗ rkq ∗ r)), r ∈ [0, 1/2] is convex.
(Hint: see [367, Appendix A])
MGL has various applications (Example 16.1 and Exercise VI.21), it tensorizes (see Exer-
cise III.32) and its infinitesimal version (derivative in δ = 0+) is exactly the log-Sobolev
inequality for the hypercube [122, Section 4.1].
I.65 (log-Sobolev inequality, LSI) Let X be a Rd -valued random variable, E[kXk2 ] < ∞, and
X ⊥⊥ Z ∼ γ , where γ = N (0, Id ) is the standard Gaussian measure. Recall the notation for
Fisher information matrix J(·) from (2.40).
(a) Show de Bruijn’s identity:
d √ log e √
h(X + aZ) = tr J(X + aZ)
da 2
(Hint: inspect the proof of Theorem 3.14)
(b) Show that EPI implies
d √
exp{2h(X + aZ)/d} ≥ 2πe .
da
(c) Conclude that Gaussians minimize the differential entropy among all X with bounded Fisher
information J(X), namely [399]
n 2πen
h(X) ≥
log .
2 tr J(X)
R
(d) Show the LSI of Gross [200]: For any f with f2 dγ = 1, we have
Z Z
f2 ln(f2 )dγ ≤ 2 · k∇fk2 dγ .
R
(Hint: PX (dx) = f2 (x)γ(dx), prove 2 (xT ∇f)fγ(dx) = E[kXk2 ] − d and use ln(1 + y) ≤ y.)
I.66 (Stochastic localization [148, 146]) Consider a discrete X ∼ μ taking values in Rn and let ρ =
Pn
E[kX − E[X]k2 ] = i=1 Var[Xi ]. We will show that for any ϵ > 0 there exists a decomposition
of μ = Eθ μθ as a mixture of measures μθ , which have similar entropy ( Eθ [H( μθ )] = H( μ) −
O(ρ/ϵ)) but have almost no pairwise correlations (Eθ [Cov( μθ )] ϵIn and Eθ [k Cov( μθ )k2F ] =
O(ϵρ)). This has useful applications in statistical physics of Ising models.
5
Apparently, named after a landlady renting a house to Wyner and Ziv [457] at the time.
i i
i i
i i
√ √
(a) Let Yt = tX + ϵZ, where X ⊥ ⊥ Z ∼ N (0, Id ) and t, ϵ > 0. Show that Cov(X|Yt ) ϵt In
√
(Hint: consider the suboptimal estimator X̂(Yt ) = Yt / t).
(b) Show that 0 ≤ H(X) − H(X|Yt ) ≤ n2 log(1 + ϵtn ρ) ≤ t log e
2Rϵ ρ. (Hint: use (5.22))
2
(c) Show that ρ ≥ mmse(X|Y1 ) − mmse(X|Y2 ) = 1ϵ 1 E[kΣt (Yt )k2F ]dt, where Σt (y) =
Cov[X|Yt = y]. (Hint: use (3.23)).
Thus we conclude that for some t ∈ [1, 2] decomposing μ = EYt PX|Yt satisfies the stated claims.
i i
i i
i i
Part II
i i
i i
i i
i i
i i
i i
195
• Variable-length lossless compression. Here we require P[X 6= X̂] = 0, where X̂ is the decoded
version. To make the question interesting, we compress X into a variable-length binary string. It
will turn out that optimal compression length is H(X) − O(log(1 + H(X))). If we further restrict
attention to so-called prefix-free or uniquely decodable codes, then the optimal compression
length is H(X) + O(1). Applying these results to n-letter variables X = Sn we see that optimal
compression length normalized by n converges to the entropy rate (Section 6.3) of the process
{Sj }.
6
Of course, one should not take these “laws” too far. In regards to language modeling, (finite-state) Markov assumption is
too simplistic to truly generate all proper sentences, cf. Chomsky [94]. However, astounding success of modern high-order
Markov models, such as GPT-4 [320], shows that such models are very difficult to distinguish from true language.
i i
i i
i i
196
• Fixed-length, almost lossless compression. Here, we allow some very small (or vanishing with
n → ∞ when X = Sn ) probability of error, i.e. P[X 6= X̂] ≤ ϵ. It turns out that under mild
assumptions on the process {Sj }, here again we can compress to entropy rate but no more.
This mode of compression permits various beautiful results in the presence of side-information
(Slepian-Wolf, etc).
• Lossy compression. Here we require only E[d(X, X̂)] ≤ ϵ where d(·, ·) is some loss function.
This type of compression problems is the topic of Part V.
We also note that thinking of the X = Sn , it would be more correct to call the first two com-
pression types above as “fixed-to-variable” and “fixed-to-fixed”, because they take fixed number
of input letters and produce variable or fixed number of output bits. There exists other types of
compression algorithms, which we do not discuss, e.g. a beautiful variable-to-fixed compressor
of Tunstall [425].
i i
i i
i i
10 Variable-length compression
In this chapter we consider a basic question: how does one describe a discrete random variable
X ∈ X in terms of a variable-length bit string so that the description is the shortest possible. The
basic idea, already used in the telegraph’s Morse code, is completely obvious: shorter descriptions
(bit strings) should correspond to more probable symbols. Later, however, we will see that this
basic idea becomes a lot more subtle once we take X to mean a group of symbols. The discovery
of Shannon was that compressing groups of symbols together (even if they are iid!) can lead
to impressive savings in compressed length. That is, coding English text by first grouping 10
consecutive characters together is much better than doing so on a character-by-character basis. One
should appreciate boldness of Shannon’s proposition since sorting all possible 2610 realizations of
the 10-letter English chunks in the order of their decreasing frequency appears quite difficult. It is
only later, with the invention of Huffman coding, arithmetic coding and Lempel-Ziv compressors
(decades after) that these methods became practical and ubiquitous.
In this Chapter we discover that the minimal compression length of X is essentially equal to the
entropy H(X) for both the single-shot, uniquely-decodable and prefix-free codes. These results
are the first examples of coding theorems in our book, that is results connecting an operational
problem and an information measure. (For this reason, compression is also called source coding
in information theory.) In addition, we also discuss the so called Zipf law and how its widespread
occurrence can be described information-theoretically.
X Compressor
{0, 1}∗ Decompressor X
f: X →{0,1}∗ g: {0,1}∗ →X
1 It maps each symbol x ∈ X into a variable-length string f(x) in {0, 1}∗ ≜ ∪k≥0 {0, 1}k =
{∅, 0, 1, 00, 01, . . . }. Each f(x) is referred to as a codeword and the collection of codewords the
codebook.
197
i i
i i
i i
198
PX (i)
i
1 2 3 4 5 6 7 ···
∗
f
∅ 0 1 00 01 10 11 ···
2 It is lossless for X: there exists a decompressor g : {0, 1}∗ → X such that P [X = g(f(X))] = 1.
In other words, f is injective on the support of PX .
Notice that since {0, 1}∗ is countable, lossless compression is only possible for discrete X. Also,
since the structure of X is not important, we can relabel X such that X = N = {1, 2, . . . } and
sort the PMF decreasingly: PX (1) ≥ PX (2) ≥ · · · . In a single-shot compression setting, cf. [251],
we do not impose any additional constraints on the map f. Later in Section 10.3 we will introduce
conditions such as prefix-freeness and unique-decodability.
To quantify how good a compressor f is, we introduce the length function l : {0, 1}∗ → Z+ , e.g.,
l(∅) = 0, l(01001) = 5. We could consider different objectives for selecting the best compressor f,
for example, minimizing any of E[l(f(X))], esssup l(f(X)), median[l(f(X))] appears reasonable. It
turns out that there is a compressor f∗ that minimizes all objectives simultaneously. As mentioned
in the preface to this chapter, the main idea is to assign longer codewords to less likely symbols,
and reserve the shorter codewords for more probable symbols. To make precise of the optimality
of f∗ , let us recall the concept of stochastic dominance.
st.
By definition, X ≤ Y if and only if the CDF of X is larger than that of Y pointwise; in other words,
the distribution of X assigns more probability to lower values than that of Y does. In particular, if
X is dominated by Y stochastically, so are their means, medians, supremum, etc.
Theorem 10.2 (Optimal f∗ ) Consider the compressor f∗ defined (for a down-sorted PMF
PX ) by f∗ (1) = ∅, f∗ (2) = 0, f∗ (3) = 1, f∗ (4) = 00, etc, assigning strings with increasing lengths
to symbols i ∈ X . (See Figure 10.1 for an illustration.) Then
i i
i i
i i
1 Length of codeword:
2 l(f∗ (X)) is stochastically the smallest: For any lossless compressor f : X → {0, 1}∗ ,
st.
l(f∗ (X)) ≤ l(f(X))
i.e., for any k, P[l(f(X)) ≤ k] ≤ P[l(f∗ (X)) ≤ k]. As a result, E[l(f∗ (X))] ≤ E[l(f(X))].
Here the inequality is because f is lossless so that |Ak | can at most be the total number of binary
strings of length up to k. Then
X X
P[l(f(X)) ≤ k] = PX (x) ≤ PX (x) = P[l(f∗ (X)) ≤ k], (10.1)
x∈Ak x∈A∗
k
since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.
Having identified the optimal compressor the next question is to understand its average com-
pression length E[ℓ(f∗ (X))]. It turns out that one can in fact compute it exactly as an infinite series,
see Exercise II.1. However, much more importantly, it turns out to be essentially equal to H(X).
Specifically, we have the following result.
Remark 10.1 (Source coding theorem) Theorem 10.3 is the first example of a coding
theorem in this book, which relates the fundamental limit E[l(f∗ (X))] (an operational quantity) to
the entropy H(X) (an information measure).
Proof. Define L(X) = l(f∗ (X))). For the upper bound, observe that since the PMF are ordered
decreasingly by assumption, PX (m) ≤ 1/m, so L(m) ≤ log2 m ≤ log2 (1/PX (m)). Taking
expectation yields E[L(X)] ≤ H(X).
For the lower bound,
( a)
H(X) = H(X, L) = H(X|L) + H(L) ≤ E[L] + H(L)
(b) 1
≤ E [ L] + h (1 + E[L])
1 + E[L]
i i
i i
i i
200
1
= E[L] + log2 (1 + E[L]) + E[L] log2 1 + (10.2)
E [ L]
( c)
≤ E[L] + log2 (1 + E[L]) + log2 e
(d)
≤ E[L] + log2 (e(1 + H(X)))
where in (a) we have used the fact that H(X|L = k) ≤ k bits, because f∗ is lossless, so that given
f∗ (X) ∈ {0, 1}k , X can take at most 2k values; (b) follows by Exercise I.4; (c) is via x log(1 + 1/x) ≤
log e, ∀x > 0; and (d) is by the previously shown upper bound H(X) ≤ E[L].
To give an illustration, we need to introduce an important method of going from a single-letter
i.i.d.
source to a multi-letter one, already alluded to in the preface. Suppose that Sj ∼ PS (this is called a
memoryless source). We can group n letters of Sj together and consider X = Sn as one super-letter.
Applying our results to random variable X we obtain:
nH(S) ≥ E[l(f∗ (Sn ))] ≥ nH(S) − log2 n + O(1).
In fact for memoryless sources, the exact asymptotic behavior is found in [408, Theorem 4]:
(
∗ n nH(S) + O(1) , PS = Unif
E[l(f (S ))] = .
nH(S) − 2 log2 n + O(1) , PS 6= Unif
1
1
For the case of sources for which log2 PS has non-lattice distribution, it is further shown in [408,
Theorem 3]:
1
E[l(f∗ (Sn ))] = nH(S) − log2 (8πeV(S)n) + o(1) , (10.3)
2
where V(S) is the varentropy of the source S:
1
V(S) ≜ Var log2 . (10.4)
PS (S)
Theorem 10.3 relates the mean of l(f∗ (X)) to that of log2 PX1(X) (entropy). It turns out that
distributions of these random variables are also closely related.
Proof. Lower bound (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 10.3,
L(m) = blog2 mc ≤ log2 m ≤ log2 PX 1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
Upper bound (converse): Consider, the following chain
1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)
i i
i i
i i
X
1
≤ P log2 ≤k+τ + PX (x)1 {l(f∗ (x)) ≤ k}1 PX (x) ≤ 2−k−τ
PX (X)
x∈X
1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)
Corollary 10.5 Let (S1 , S2 , . . .) be a random process and U, V real-valued random variable.
Then
1 1 d 1 ∗ n d
log2 →U
− ⇔ l(f (S ))−
→U (10.5)
n PSn (Sn ) n
and
1 1 1
√ (l(f∗ (Sn )) − H(Sn ))−
d d
√ log2 n
− H( S ) →
n
−V ⇔ →V (10.6)
n PS (S )
n n
Proof. First recall that convergence in distribution is equivalent to convergence of CDF at all
d
→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the CDF of U is
continuity point, i.e., Un −
continuous (i.e., not an atom of U).
√
To get (10.5), apply Theorem 10.4 with k = un and τ = n:
1 1 1 ∗ 1 1 1 √
P log2 ≤ u ≤ P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n
√
To get (10.6), apply Theorem 10.4 with k = H(Sn ) + nu and τ = n1/4 :
∗ n
1 1 l(f (S )) − H(Sn )
P √ log − H( S ) ≤ u ≤ P
n
√ ≤u
n PSn (Sn ) n
1 1 −1/4
+ 2−n +1 .
1/ 4
≤P √ log n
− H ( S n
) ≤ u + n
n PSn (S )
(10.7)
Now let us particularize the preceding theorem to memoryless sources of i.i.d. Sj ’s. The
important observation is that the log likelihood becomes an i.i.d. sum:
1 X n
1
log n
= log .
PSn (S ) PS (Si )
i=1 | {z }
i.i.d.
i i
i i
i i
202
P
1 By the weak law of large numbers (WLLN), we know that n1 log PSn 1(Sn ) −→E log PS1(S) = H(S).
Therefore in (10.5) the limiting distribution U is degenerate, i.e., U = H(S), and we have the
following result of fundamental importance:1
1 ∗ n P
l(f (S ))−
→H(S) .
n
That is, the optimal compression rate of an iid process converges to its entropy rate. This is
a version of Shannon’s source coding theorem, which we will also discuss in the subsequent
chapter.
2 By the Central Limit Theorem (CLT), if varentropy V(S) < ∞, then we know that V in (10.6)
is Gaussian, i.e.,
1 1 d
p log n)
− nH(S) −→N (0, 1).
nV(S) PSn ( S
Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1
(l(f∗ (Sn )) − nH(S))−
d
p →N (0, 1),
nV(S)
or, in shorthand,
p
l(f∗ (Sn )) ≈ nH(S) + nV(S)N (0, 1) in distribution.
Gaussian approximation tells us the speed of convergence 1n l(f∗ (Sn )) → H(S) and also gives us
a good approximation of the distribution of length at finite n.
Example 10.1 (Ternary source) Next we apply our bounds to approximate the distribu-
tion of l(f∗ (Sn )) in a concrete example. Consider a memoryless ternary source outputting i.i.d. n
symbols from the distribution PS = [0.445, 0.445, 0.11]. We first compare different results on the
minimal expected length E[l(f∗ (Sn ))] in the following table:
Blocklength Lower bound (10.3) E[l(f∗ (Sn ))] H(Sn ) (upper bound) asymptotics (10.3)
n = 20 21.5 24.3 27.8 23.3 + o(1)
n = 100 130.4 134.4 139.0 133.3 + o(1)
n = 500 684.1 689.2 695.0 688.1 + o(1)
In all cases above E[l(f∗ (S))] is close to a midpoint between the bounds.
1
Convergence to a constant in distribution is equivalent to that in probability.
i i
i i
i i
Optimal compression: CDF, n = 200, PS = [0.445 0.445 0.110] Optimal compression: PMF, n = 200, P S = [0.445 0.445 0.110]
1 0.06
True PMF
Gaussian approximation
Gaussian approximation (mean adjusted)
0.9
0.05
0.8
0.7
0.04
0.6
0.5
P
0.03
P
0.4
0.02
0.3
0.2
0.01
True CDF
0.1 Lower bound
Upper bound
Gaussian approximation
Gaussian approximation (mean adjusted)
0 0
1.25 1.3 1.35 1.4 1.45 1.5 1.25 1.3 1.35 1.4 1.45 1.5
Rate Rate
Figure 10.2 Left plot: Comparison of the true CDF of l(f∗ (Sn )), bounds of Theorem 10.4 (optimized over τ ),
and the Gaussian approximations in (10.8) and (10.9). Right plot: PMF of the optimal compression length
l(f∗ (Sn )) and the two Gaussian approximations.
Next we consider the distribution of l(f∗ (Sn ). Its Gaussian approximation is defined as
p
nH(S) + nV(S)Z , Z ∼ N ( 0, 1) . (10.8)
i i
i i
i i
204
Figure 10.3 The log-log frequency-rank plots of the most used words in various languages exhibit a power
law tail with exponent close to 1, as popularized by Zipf [477]. Data from [398].
where optimization is over lossless encoders and probability distributions PX = {pj : j = 1, . . .}.
Theorem 10.3 (or more precisely, the intermediate result (10.2)) shows that
It turns out that the upper bound is in fact tight. Furthermore, among all distributions the optimal
tradeoff between entropy and minimal compression length is attained at power law distributions.
To show that, notice that in computing H(Λ), we can restrict attention to sorted PMFs p1 ≥ p2 ≥
· · · (call this class P ↓ ), for which the optimal encoder is such that l(f(j)) = blog2 jc (Theorem 10.2).
Thus, we have shown
X
H(Λ) = sup {H(P) : pj blog2 jc ≤ Λ} .
P∈P ↓ j
i i
i i
i i
Next, let us fix the base of the logarithm of H to be 2, for convenience. (We will convert to arbitrary
base at the end). Applying Example 5.2 we obtain:
H(Λ) ≤ inf λΛ + log2 Z(λ) , (10.10)
λ>0
P∞ P∞
where Z(λ) = n=1 2−λ⌊log2 n⌋ = m=0 2(1−λ)m = 1−211−λ if λ > 1 and Z(λ) = ∞ otherwise.
Clearly, the infimum over λ > 0 is a minimum attained at a value λ∗ > 1 satisfying
d
Λ=− log2 Z(λ) .
dλ λ=λ∗
The argument of Mandelbrot [291] The above derivation shows a special (extremality) prop-
erty of the power law, but falls short of explaining its empirical ubiquity. Here is a way to connect
the optimization problem H(Λ) to the evolution of the natural language. Suppose that there is a
countable set S of elementary concepts that are used by the brain as building blocks of perception
and communication with the outside world. As an approximation we can think that concepts are
in one-to-one correspondence with language words. Now every concept x is represented internally
by the brain as a certain pattern, in the simplest case – a sequence of zeros and ones of length l(f(x))
([291] considers more general representations). Now we have seen that the number of sequences
of concepts with a composition P grows exponentially (in length) with the exponent given by
H(P), see Proposition 1.5. Thus in the long run the probability distribution P over the concepts
results in the rate of information transfer equal to EP [Hl((fP(X) ))] . Mandelbrot concludes that in order
to transfer maximal information per unit, language and brain representation co-evolve in such a
way as to maximize this ratio. Note that
H(P) H(Λ)
sup = sup .
P,f EP [l(f(X))] Λ Λ
It is not hard to show that H(Λ) is concave and thus the supremum is achieved at Λ = 0+ and
equals infinity. This appears to have not been observed by Mandelbrot. To fix this issue, we can
i i
i i
i i
206
postulate that for some unknown physiological reason there is a requirement of also having a
certain minimal entropy H(P) ≥ h0 . In this case
H(P) h0
sup = −1
P,f:H(P)≥h0 EP [l(f(X))] H ( h0 )
and the supremum is achieved at a power law distribution P. Thus, the implication is that the fre-
quency of word usage in human languages evolves until a power law is attained, at which point it
maximizes information transfer within the brain. That’s the gist of the argument of [291]. It is clear
that this does not explain appearance of the power law in other domains, for which other explana-
tions such as preferential attachment models are more plausible, see [305]. Finally, we mention
that the Pλ distributions take discrete values 2−λm−log2 Z(λ) , m = 0, 1, 2, . . . with multiplicities 2m .
Thus Pλ appears as a rather unsightly staircase on frequency-rank plots such as Figure 10.3. This
artifact can be alleviated by considering non-binary brain representations with unequal lengths of
signals.
Definition 10.8 (Prefix codes) f : A → {0, 1}∗ is a prefix code2 if no codeword is a prefix
of another (e.g., 010 is a prefix of 0101).
2
Also known as prefix-free/comma-free/self-punctuating/instantaneous code.
i i
i i
i i
10.3 Uniquely decodable codes, prefix codes and Huffman codes 207
• f(a) = 0, f(b) = 1, f(c) = 10. Not uniquely decodable, since f(ba) = f(c) = 10.
• f(a) = 0, f(b) = 10, f(c) = 11. Uniquely decodable and a prefix code.
• f(a) = 0, f(b) = 01, f(c) = 011, f(d) = 0111 Uniquely decodable but not a prefix code, since
as long as 0 appears, we know that the previous codeword has terminated.3
Remark 10.3
1 Prefix codes are uniquely decodable and hence lossless, as illustrated in the following picture:
prefix codes
Huffman
code
2 Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable
(one should start decoding in reverse direction).
3 By definition, any uniquely decodable code does not have the empty string as a codeword. Hence
f : X → {0, 1}+ in both Definition 10.7 and Definition 10.8.
4 Unique decodability means that one can decode from a stream of bits without ambiguity, but
one might need to look ahead in order to decide the termination of a codeword. (Think of the
last example). In contrast, prefix codes allow the decoder to decode instantaneously without
looking ahead.
5 Prefix codes are in one-to-one correspondence with binary trees (with codewords at leaves). It
is also equivalent to strategies to ask “yes/no” questions previously mentioned at the end of
Section 1.1.
3
In this example, if 0 is placed at the very end of each codeword, the code is uniquely decodable, known as the unary code.
i i
i i
i i
208
1 Let f : A → {0, 1}∗ be uniquely decodable. Set la = l(f(a)). Then f satisfies the Kraft inequality
X
2−la ≤ 1. (10.11)
a∈A
2 Conversely, for any set of code length {la : a ∈ A} satisfying (10.11), there exists a prefix code
f, such that la = l(f(a)). Moreover, such an f can be computed efficiently.
Remark 10.4 The consequence of Theorem 10.9 is that as far as compression efficiency is
concerned, we can ignore those uniquely decodable codes that are not prefix codes.
Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.
The proof for the former is probabilistic, following ideas in [15, Exercise 1.8, p. 12]. Let f be a
prefix code. Let us construct a probability space such that the LHS of (10.11) is the probability
of some event, which cannot exceed one. To this end, consider the following scenario: Generate
independent Ber( 12 ) bits. Stop if a codeword has been written, otherwise continue. This process
P
terminates with probability a∈A 2−la . The summation makes sense because the events that a
given codeword is written are mutually exclusive, thanks to the prefix condition.
Now let f be a uniquely decodable code. The proof uses generating function as a device for
counting. (The analogy in coding theory is the weight enumerator function.) First assume A is
P PL
finite. Then L = maxa∈A la is finite. Let Gf (z) = a∈A zla = l=0 Al (f)zl , where Al (f) denotes
the number of codewords of length l in f. For k ≥ 1, define fk : Ak → {0, 1}+ as the symbol-
P k k P P
by-symbol extension of f. Then Gfk (z) = ak ∈Ak zl(f (a )) = a1 · · · ak zla1 +···+lak = [Gf (z)]k =
PkL k l
l=0 Al (f )z . By the unique decodability of f, fk is lossless. Hence Al (fk ) ≤ 2l . Therefore we have
P
Gf (1/2)k = Gfk (1/2) ≤ kL for all k. Then a∈A 2−la = Gf (1/2) ≤ limk→∞ (kL)1/k = 1. If A is
P
countably infinite, for any finite subset A′ ⊂ A, repeating the same argument gives a∈A′ 2−la ≤
1. The proof is complete by the arbitrariness of A′ .
P
Conversely, given a set of code lengths {la : a ∈ A} s.t. a∈A 2−la ≤ 1, construct a prefix
code f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define
X
i−1
ai ≜ 2−lk
k=1
with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f(i) ∈ {0, 1}+ as the
first li bits in the binary expansion of ai . Finally, we prove that f is a prefix code by contradiction:
Suppose for some j > i, f(i) is the prefix of f(j), since lj ≥ li . Then aj − ai ≤ 2−li , since they agree
on the most significant li bits. But aj − ai = 2−li + 2−li+1 +. . . > 2−li , which is a contradiction.
Remark
P
10.5 A conjecture of Ahlswede et al [7] states that for any set of lengths for which
2−la ≤ 43 there exists a fix-free code (i.e. one which is simultaneously prefix-free and suffix-
free). So far, existence has only been shown when the Kraft sum is ≤ 58 , cf. [466].
i i
i i
i i
10.3 Uniquely decodable codes, prefix codes and Huffman codes 209
In view of Theorem 10.9, the optimal average code length among all prefix (or uniquely
decodable) codes is given by the following optimization problem
X
L∗ (X) ≜ min PX (a)la (10.12)
a∈A
X
s.t. 2− l a ≤ 1
a∈A
la ∈ N
This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:
Theorem 10.10
H(X) ≤ L∗ (X) ≤ H(X) + 1 bit. (10.13)
Proof. Right inequality: Consider the following length assignment la = log2 PX1(a) ,4 which
P P
satisfies Kraft since l a∈A 2−la m≤ a∈A PX (a) = 1. By Theorem 10.9, there exists a prefix code
f such that l(f(a)) = log2 PX1(a) and El(f(X)) ≤ H(X) + 1.
Light inequality: We give two proofs for this converse. One of the commonly used ideas to deal
with combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in
(10.12) and relax it into the following optimization problem, which obviously provides a lower
bound
X
L∗ (X) ≜ min PX (a)la (10.14)
a∈A
X
s.t. 2− l a ≤ 1
a∈A
This is a nice convex optimization problem, with an affine objective function and a convex feasible
set. Solving (10.14) by Lagrange multipliers (Exercise!) yields that the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
Another proof is the following: For any f whose codelengths {la } satisfying the Kraft inequality,
− la
define a probability measure Q(a) = P 2 2−la . Then
a∈A
X
El(f(X)) − H(X) = D(PkQ) − log 2−la ≥ 0.
a∈A
4
Such a code is called a Shannon code.
i i
i i
i i
210
Next we describe the Huffman code, which achieves the optimum in (10.12). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of the Huffman code is to build
the binary tree from the bottom up: Given a PMF {PX (a) : a ∈ A},
The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted PMF)
or O(|A| log |A|) (unsorted PMF).
d e
Theorem 10.11 (Optimality of Huffman codes) The Huffman code achieves the minimal
average code length (10.12) among all prefix (or uniquely decodable) codes.
1 Constructing the Huffman code requires knowing the source distribution. This brings us the
question: Is it possible to design universal compressor which achieves entropy for a class of
source distributions? And what is the price to pay? These questions are the topic of universal
compression and will be addressed in Chapter 13.
2 To understand the main limitation of Huffman coding, we recall that (as Shannon pointed out),
while Morse code already exploits the nonequiprobability of English letters, working with
pairs (or more generally, n-grams) of letters achieves even more compression, since letters in
a pair are not independent. In other words, to compress a block of symbols (S1 , . . . , Sn ) by
applying Huffman code on a symbol-by-symbol basis one can achieve an average length of
Pn
i=1 H(Si ) + n bits. But applying Huffman codes on a whole block (S1 , . . . , Sn ), that is the
code designed for PS1 ,...,Sn , allows to exploit the memory in the source and achieve compres-
P
sion length H(S1 , . . . , Sn ) + O(1). Due to (1.3) the joint entropy is smaller than i H(Si ) (and
i i
i i
i i
10.3 Uniquely decodable codes, prefix codes and Huffman codes 211
usually much smaller). However, the drawback of this idea is that constructing the Huffman
code has complexity |A|n – exponential in the blocklength.
To resolve these problems we will later study other methods:
1 Arithmetic coding has a sequential encoding algorithm with complexity linear in the block-
length, while still attaining H(Sn1 ) length – Section 13.1.
2 Lempel-Ziv algorithm also has low-complexity and is even universal, provably optimal for all
ergodic sources – Section 13.8.
As a summary of this chapter, we learned the following relationship between entropy and
compression length of various codes:
H(X) − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) ≤ E[l(fHuffman (X))] ≤ H(X) + 1.
i i
i i
i i
In previous chapter we introduced the concept of variable-length compression and studied its
fundamental limits with and without prefix-free condition. In some situations, however, one may
desire that the output of the compressor always be of a fixed length, say, k bits. Unless k is unrea-
sonably large, then, this will require relaxing the losslessness condition. This is the focus of this
chapter: compression in the presence of (typically vanishingly small) probability of error. It turns
out allowing even very small error enables several beautiful effects:
• The possibility to compress data via matrix multiplication over finite fields (linear compression
or hashing).
• The possibility to reduce compression length from H(X) to H(X|Y) if side information Y is
available at the decompressor (Slepian-Wolf).
• The possibility to reduce compression length below H(X) if access to a compressed representa-
tion of side-information Y is available at the decompressor (Ahlswede-Körner-Wyner).
All of these effects are ultimately based on the fundamental property of many high-dimensional
probability distributions, the asymptotic equipartition (AEP), which we study in the context of iid
distributions. Later we will extend this property to all ergodic processes in Chapter 12.
Note that if we insist like in Chapter 10 that g(f(X)) = X with probability one, then k ≥
log2 |supp(PX )| and no meaningful compression can be achieved. It turns out that by tolerating
a small error probability, we can gain a lot in terms of code length! So, instead of requiring
g(f(x)) = x for all x ∈ X , consider only lossless decompression for a subset S ⊂ X :
(
x x∈S
g(f(x)) =
e x 6∈ S
212
i i
i i
i i
P [g(f(X)) 6= X] = P [g(f(X)) = e] = P [X ∈
/ S] .
f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}
such that g(f(x)) ∈ {x, e} for all x ∈ X and P [g(f(X)) = e] ≤ ϵ. The fundamental limit of fixed-
length compression is simply the minimum probability of error and is defined as
The following result connects the respective fundamental limits of fixed-length almost lossless
compression and variable-length lossless compression (Section 10.1):
Proof. Note that because of the assumption X = N compressor must reserve one k-bit string for
the error message even if PX (1) = 1. The proof is essentially tautological. Note 1 + 2 +· · ·+ 2k−1 =
2k − 1. Let S be the set of top 2k − 1 most likely (as measured by PX (x)) elements x ∈ X . Then
i i
i i
i i
214
freedom in designing codes. It turns out that we do not gain much by this relaxation. Indeed, if
we define
Corollary 11.3 (Shannon’s source coding theorem) Let Sn be i.i.d. discrete random
variables. Then for any R > 0 and γ ∈ R asymptotically in blocklength n we have
∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)
This result demonstrates that if we are to compress an iid string Sn down to k = k(n) bits
then minimal possible k enabling vanishing error satisfies nk = H(S), that is we can compress to
entropy rate of the iid process S and no more. Furthermore, if we allow a non-vanishing error ϵ
then compression is possible down to
p
k = nH(S) + nV(S)Q−1 (ϵ)
bits. In the language of modern information theory, Corollary 11.3 derives both the asymptotic
fundamental limit (minimal k/n) and the normal approximation under non-vanishing error.
The next desired step after understanding asymptotics is to derive finite blocklength guarantees,
that is bounds on ϵ∗ (X, k) in terms of the information quantities. As we mentioned above, the
upper and lower bounds are typically called achievability and converse bounds. In the case of
lossless compression such bounds are rather trivial corollaries of Theorem 11.2, but we present
them for completeness next. For other problems in this Part and other Parts obtaining good finite
blocklength bounds is much more challenging.
Theorem 11.4 (Finite blocklength bounds) For all τ > 0 and all k ∈ Z+ we have
1 −τ ∗ ∗ 1
P log2 > k + τ − 2 ≤ ϵ̃ (X, k) ≤ ϵ (X, k) ≤ P log2 ≥k .
PX (X) PX (X)
i i
i i
i i
Proof. The argument for the lower (converse) bound is identical to the converse of Theorem 10.4.
Indeed, considering the optimal (undetectable error) code let S = {x : g(f(x)) = x} and note
∗ 1 1
1 − ϵ̃ (X, k) = P [X ∈ S] ≤ P log2 ≤ k + τ + P X ∈ S, log2 >k+τ
PX (X) PX (X)
where we used the fact that |S| ≤ 2k . Combining the two inequalities yields the lower bound.
For the upper bound, without loss of generality we assume PX (1) ≥ PX (2) ≥ · · · . Then by
Theorem 11.2 we have
X X 1
1
ϵ∗ (X, k) = PX (m) ≤ 1 ≥ 2k PX (m) = P log2 ≥k ,
P X ( m) PX (X)
k m≥2
where ≤ follows from the fact that mth largest mass PX (m) ≤ 1
m.
We now will do something strange. We will prove an upper bound that is weaker than that of
Theorem 11.4 and furthermore, the proof is much longer. However, this will be our first exposition
to the technique of random coding (also known as probabilistic method outside of information
theory).1 We will quickly find out that outside of the simplest setting of lossless compression,
where the optimal encoder f∗ was easy to describe, good encoders are very hard to find and thus
random coding becomes indispensable. In particular, Slepian-Wolf theorem (Section 11.5 below)
all of data transmission (Part IV) and lossy data compression (Part V) will be based on the method.
Theorem 11.5 (Random coding achievability) For any k ∈ Z+ and any τ > 0 we have
1
ϵ̃∗ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0 , (11.1)
PX (X)
that is there exists a compressor-decompressor pair with the (possibly undetectable) error bounded
by the right-hand side.
Proof. We first start with constructing a suboptimal decompressor g for a given f. Indeed, for a
given compressor f, the optimal decompressor which minimizes the error probability is simply the
maximum a posteriori (MAP) decoder, i.e.,
1
These methods were discovered simultaneously by Shannon [378] and Erdös [153], respectively.
i i
i i
i i
216
However, this decoder’s performance is a little hard to analyze, so instead, we consider the
following (suboptimal) decompressor g:
x, ∃! x ∈ X s.t. f(x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)
e, otherwise
Note that log2 PX1(x) ≤ k − τ ⇐⇒ PX (x) ≥ 2−(k−τ ) . We call those x “high-probability”. (In the
language of [106] and [115] these would be called “typical” realizations).
Denote f(x) = cx and call the long vector C = [cx : x ∈ X ] a codebook. It is instructive to think
of C as a hashing table: it takes an object x ∈ X and assigns to it a k-bit hash value.
To analyze the error probability let us define
′ ′ 1
J(x, C) ≜ x ∈ X : cx′ = cx , x 6= x, log2 ≤k−τ
PX (x′ )
to be the set of high-probability inputs whose hashes collide with that of x. Then we have the
following estimate for probability of error:
1
P [g(f(X)) = e] = P log2 > k − τ ∪ {J(X, C) 6= ∅}
PX (X)
1
≤ P log2 > k − τ + P [J(X, C) 6= ϕ]
PX (X)
The first term does not depend on the codebook C , while the second term does. The idea now
is to randomize over C and show that when we average over all possible choices of codebook,
the second term is smaller than 2−τ . Therefore there exists at least one codebook that achieves
i.i.d.
the desired bound. Specifically, let us consider C generated by setting each cx ∼ Unif[{0, 1}k ] and
independently of X. Equivalently, since C can be represented by an |X | × k binary matrix, whose
rows correspond to codewords, we choose each entry to be independent fair coin flip. Averaging
the error probability (over C and over X), we have
′ 1
EC [P [J(X, C) 6= ϕ]] = EC,X 1 ∃x 6= X : log2 ≤ k − τ, cx = cX
′
PX (x′ )
X 1
≤ EC,X 1 log2 ≤ k − τ 1 {cx′ = cX } (union bound)
PX ( x′ )
x′ ̸=X
X
= 2− k E X 1 PX (x′ ) ≥ 2−k+τ
x′ ̸=X
X
≤ 2− k 1 PX (x′ ) ≥ 2−k+τ
x′ ∈X
−k k−τ
≤2 2 = 2−τ ,
where the crucial penultimate step uses the fact that there can be at most 2k−τ values of x′ with
PX (x′ ) > 2−k+τ .
i i
i i
i i
Remark 11.2 (Why random coding works) The compressor f(x) = cx can be thought as
hashing x ∈ X to a random k-bit string cx ∈ {0, 1}k , as illustrated below:
Here, x has high probability ⇔ log2 PX1(x) ≤ k − τ ⇔ PX (x) ≥ 2−k+τ . Therefore the number of
those high-probability x’s is at most 2k−τ , which is far smaller than 2k , the total number of k-bit
codewords. Hence the chance of collision among high-probability x’s is small.
Let us again emphasize that the essence of the random coding argument is the following. To
prove the existence of an object with certain property, we construct a probability distribution
(randomize) and show that on average the property is satisfied. Hence there exists at least one
realization with the desired property. The downside of this argument is that it is not constructive,
i.e., does not give us an algorithm to find the object. One may wonder whether we can practically
simply generate a large random hashing table and use it for compression. The problem is that
generating such a table requires a lot of randomness and a lot of storage space (both are impor-
tant resources). We will address this issue in Section 11.3, but for now let us make the following
remark.
Remark 11.3 (Pairwise independence of codewords) In the proof we choose the
i.i.d.
random codebook to be uniform over all possible codebooks: cx ∼ Unif. But a careful inspec-
tion (exercise!) shows that we only used pairwise independence, i.e., cx ⊥ ⊥ cx′ for any x 6= x′ .
This suggests that perhaps in generating the table we can use a lot fewer than k|X | random bits.
Indeed, given 2 independent random bits B1 , B2 we can generate 3 bits that are pairwise indepen-
dent: B1 , B2 , B1 ⊕ B2 . This observation will lead us to the idea of linear compression studied in
Section 11.3, where the codewords generated not iid, but as elements of a random linear subspace.
Proposition 11.6 (Asymptotic equipartition (AEP)) Consider iid Sn and for any δ > 0,
define the so-called entropy δ -typical set
1 1
Tδn ≜ sn : log − H ( S ) ≤ δ .
n PSn (sn )
Then the following properties hold:
i i
i i
i i
218
1 P Sn ∈ Tδn → 1 as n → ∞.
2 |Tδn | ≤ exp{(H(S) + δ)n}.
i.i.d.
For example if Sn ∼ Ber(p), then PSn (sn ) = pwH (s ) p̄n−wH (s ) , where wH (sn ) is the Hamming
n n
weight of the string (number of 1’s). Thus the typical set corresponds to those sequences whose
Hamming weight 1n wH (sn ) is close to the expected value of p + Op (δ).
Thus, P[Sn ∈ Tδn ] → 1. On the other hand, since for every sn ∈ Tδn we have PSn (sn ) > exp{−(H(S)+
δ)n} there can be at most exp{(H(S) + δ)n} elements in Tδn .
To understand the meaning of the AEP, notice that it shows that the gigantic space S n has
almost all of probability PSn concentrated on an exponentially smaller subset Tδn . Furthermore, on
this subset the measure PSn is approximately uniform: PSn (sn ) = exp{−nH(S) ± nδ}.
To see how AEP is related to compression, let us give a third proof of Shannon’s result:
∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)
Indeed, let us consider an encoder f that enumerates (by strings in {0, 1}nR ) elements of Tδn . Then
if R > H(S) + δ the decoding error happens with probability P[Sn 6∈ Tδn ] → 0. Hence any rate
R > H(S) results in a vanishing error. On the other hand, if R < H(S) then it is clear that 2nR -bits
cannot describe any significant portion of |Tδn | and since on the latter the measure PSn is almost
uniform, the probability of error necessarily converges to 1 (in fact exponentially fast). There is a
certain conceptual beauty in this way of proving source coding theorem. For example, it explains
why optimal compressor’s output should look almost like iid Ber(1/2):2 after all it enumerates
over an almost uniform set Tδn .
2
This is the intuitive basis why compressors can be used as random number generators; cf. Section 9.3.
i i
i i
i i
In this section we assume that the source takes the form X = Sn , where each coordinate is an
element of a finite field (Galois field), i.e., Si ∈ Fq , where q is the cardinality of Fq . (This is only
possible if q = pk for some prime number p and k ∈ N.)
Definition 11.7 (Galois field) F is a finite set with operations (+, ·) where
A linear compressor is a linear function H : Fnq → Fkq (represented by a matrix H ∈ Fqk×n ) that
maps each x ∈ Fnq to its codeword w = Hx, namely
w1 h11 ... h1n x1
.. .. .. ..
. = . . .
wk hk1 ... hkn xn
Compression is achieved if k < n, i.e., H is a fat matrix, which, again, is only possible in the
almost lossless sense.
Theorem 11.8 (Achievability via linear codes) Let X ∈ Fnq be a random vector. For all
τ > 0, there exists a linear compressor H ∈ Fnq×k and decompressor g : Fkq → Fnq ∪ {e}, s.t. its
undetectable error probability is bounded by
1
P [g(HX) 6= X] ≤ P logq > k − τ + q−τ .
PX (X)
Remark 11.4 Consider the Hamming space q = 2. In comparison with Shannon’s random
coding achievability, which uses k2n bits to construct a completely random codebook, here for lin-
ear codes we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional
linear subspace of the Hamming space.
Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 11.5),
given the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) =
argmaxx:Hx=w PX (x), which outputs the most likely symbol that is compatible with the codeword
i i
i i
i i
220
where, as in the proof of Theorem 11.5, we denoted x to be “h.p.” (high probability) whenever
logq PX1(x) ≤ k − τ .
Note that this decoder is the same as in the proof of Theorem 11.5. The proof is also mostly the
same, except now hash collisions occur under the linear map H. Specifically, we have by applying
the union bound twice:
Now we use random coding to average the second term over all possible choices of H. Specif-
ically, choose H as a matrix independent of X where each entry is iid and uniform on Fq . For
distinct x0 and x1 , the collision probability is
where H1 is the first row of the matrix H, and each row of H is independent. This is the probability
that Hi is in the orthogonal complement of x2 . On Fnq , the orthogonal complement of a given
non-zero vector has cardinality qn−1 . So the probability for the first row to lie in this subspace is
qn−1 /qn = 1/q, hence the collision probability 1/qk . Averaging over H gives
X
EH 1{Hx′ = Hx} = |{x′ : x′ h.p., x′ 6= x}|q−k ≤ qk−τ q−k = q−τ .
′
x h.p.,x′ ̸=x
We remark that the bounds in Theorems 11.5 and 11.8 produce compressors with undetectable
errors. However, the non-linear construction in the former is easy to modify to make all errors
detectable (e.g. by increasing k by 1 and making sure the first bit is 1 for all x = sn with low
probability). For the linear compressors, however, the errors cannot be made detectable.
Note that we restricted our theorem to inputs over Fq . Can we loosen the requirements and
produce compressors over an arbitrary commutative ring? In general, the answer is negative due
to existences of zero divisors in the commutative ring. The latter ruin the key proof ingredient of
low collision probability in the random hashing. Indeed, consider the following computation over
i i
i i
i i
11.4 Compression with side information at both compressor and decompressor 221
Z/6Z
1 2
0 0
P H .. = 0 = 6− k but P H .. = 0 = 3− k ,
. .
0 0
since 0 · 2 = 3 · 2 = 0 in Z/6Z.
Note that here unlike the source X, the side information Y need not be discrete. Conditioned on
Y = y, the problem reduces to compression without side information studied in Section 11.1, but
with the source X distributed according to PX|Y=y . Since Y is known to both the compressor and
decompressor, they can use the best code tailored for this distribution. Recall ϵ∗ (X, k) defined in
Definition 11.1, the optimal probability of error for compressing X using k bits, which can also be
denoted by ϵ∗ (PX , k). Then we have the following relationship
which allows us to apply various bounds developed before. In particular, we clearly have the
following result.
i i
i i
i i
222
Theorem 11.10
1 1
P log > k + τ − 2−τ ≤ ϵ∗ (X|Y, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0
PX|Y (X|Y) PX|Y (X|Y)
Corollary 11.11 Let (X, Y) = (Sn , Tn ) where the pairs (Si , Ti )i.i.d.
∼ PS,T . Then
(
∗ 0 R > H(S|T)
lim ϵ (S |T , nR) =
n n
n→∞ 1 R < H(S|T)
1X
n
1 1 1 P
log = log −
→H(S|T)
n PSn |Tn (S |T )
n n n PS|T (Si |Ti )
i=1
as n → ∞. Thus, the result follows from setting (X, Y) = (Sn , Tn ) in the previous theorem.
i i
i i
i i
Here is the very surprising result of Slepian and Wolf3 , which shows that unavailability of the
side information at compressor does not hinder the compression rate at all.
From this theorem we will get by the WLLN the asymptotic result:
And we remark that the side-information (T-process) is not even required to be discrete, see
Exercise II.9.
Proof of the Theorem. LHS is obvious, since side information at the compressor and decompres-
sor is better than only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y), then define the compressor and decoder as
f(x) = cx
(
x ∃!x : cx = w, x h.p.|y
g(w, y) =
0 otherwise
where we used the shorthand x h.p.|y ⇔ log2 PX|Y1(x|y) < k −τ . The error probability of this scheme,
as a function of the code book C , is
1
P[X 6= g(f(X))|C] = P log ≥ k − τ or J(X, C|Y) 6= ∅|C
PX|Y (X|Y)
1
≤ P log ≥ k − τ + P [J(X, C|Y) 6= ∅|C]
PX|Y (X|Y)
X
1
= P log ≥k−τ + PX,Y (x, y)1 {J(x, C|y) 6= ∅}.
PX|Y (X|Y) x, y
3
This result is often informally referred to as “the most surprising result post-Shannon”.
i i
i i
i i
224
where (a) is a union bound, (b) follows from the fact that |{x′ : x′ h.p.|y}| ≤ 2k−τ , and (c) is from
P[cx′ = cx ] = 2−k for any x 6= x′ .
X {0, 1}k1
Compressor f1
Decompressor g
(X̂, Ŷ)
Y {0, 1}k2
Compressor f2
i i
i i
i i
R2
Achievable
H(T )
Region
H(T |S)
R1
H(S|T ) H(S)
Since H(T) − H(T|S) = H(S) − H(S|T) = I(S; T), the slope of the skewed line is −1.
Proof. Converse: Take (R1 , R2 ) 6∈ RSW . Then one of three cases must occur:
1 R1 < H(S|T). In this case, even if f1 encoder and decoder had access to full Tn , we still can not
achieve vanishing error (Corollary 11.11).
2 R2 < H(T|S) (same).
3 R1 + R2 < H(S, T). If this were possible, then we would be compressing the joint (Sn , Tn ) at
rate lower than H(S, T), violating Corollary 11.3.
i i
i i
i i
226
Achievability: First note that we can achieve the two corner points. The point (H(S), H(T|S))
can be approached by almost lossless compressing S at entropy and compressing T with side infor-
mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T|S) + δ). By
Corollary 11.3, there exist f1 : S n → {0, 1}k1 and g1 : {0, 1}k1 → S n s.t. P [g1 (f1 (Sn )) 6= Sn ] ≤
ϵn → 0. By Theorem 11.13, there exist f2 : T n → {0, 1}k2 and g2 : {0, 1}k1 × S n → T n
s.t. P [g2 (f2 (Tn ), Sn ) 6= Tn ] ≤ ϵn → 0. Now that Sn is not available, feed the S.W. decompres-
sor with g1 (f1 (Sn )) and define the joint decompressor by g(w1 , w2 ) = (g1 (w1 ), g2 (w2 , g1 (w1 )))
(see below):
Sn Ŝn
f1 g1
Tn T̂n
f2 g2
i i
i i
i i
X {0, 1}nR1
Compressor f1
Decompressor g
X̂
Y {0, 1}nR2
Compressor f2
Note also that unlike the previous section decompressor is only required to produce an esti-
mate of X (not of Y), hence the name of this problem: compression with a (rate-limited) helper.
The difficulty this time is that what needs to be communicated over this link from Y to decom-
pressor is not the information about Y but only that information in Y that is maximally useful
for decompressing X. Despite similarity with the previous sections, this task is completely new
and, consequently, characterization of rate pairs R1 , R2 is much more subtle in this case. It was
completed independently in two works [9, 459].
Furthermore, for every such random variable U the rate pair (H(X|U), I(Y; U)) is achievable with
vanishing error.
In other words, this time the set of achievable pairs (R1 , R2 ) belongs to a region of R2+ described
as ∪{[H(X|U), +∞)×[I(Y; U), +∞)} with the union taken over all possible PU|Y : Y → U , where
|U| = |Y| + 1. The boundary of the optimal (R1 , R2 )-region is traced by an FI -curve, a concept
we will define later (Definition 16.5).
Proof. First, note that iterating over all possible random variables U (without cardinality con-
straint) the set of pairs (R1 , R2 ) satisfying (11.3) is convex. Next, consider a compressor W1 =
f1 (Xn ) and W2 = f2 (Yn ). Then from Fano’s inequality (3.19) assuming P[Xn 6= X̂n ] = o(1) we
have
i i
i i
i i
228
Thus, from chain rule and the fact that conditioning decreases entropy, we get
nR1 ≥ I(Xn ; W1 |W2 ) ≥ H(Xn |W2 ) − o(n) (11.4)
Xn
= H(Xk |W2 , Xk−1 ) − o(n) (11.5)
k=1
Xn
≥ H(Xk | W2 , Xk−1 , Yk−1 ) − o(n) (11.6)
| {z }
k=1
≜Uk
where (11.8) follows from I(W2 , Xk−1 ; Yk |Yk−1 ) = I(W2 ; Yk |Yk−1 ) + I(Xk−1 ; Yk |W2 , Yk−1 ) and the
⊥ Xk−1 |Yk−1 ; and (11.9) from Yk−1 ⊥
fact that (W2 , Yk ) ⊥ ⊥ Yk . Comparing (11.6) and (11.9) we
k− 1 k− 1
notice that denoting Uk = (W2 , X , Y ) we have both Xk → Yk → Uk and
1X
n
(R1 , R2 ) ≥ (H(Xk |Uk ), I(Uk ; Yk ))
n
k=1
and thus (from convexity) the rate pair must belong to the region spanned by all pairs
(H(X|U), I(U; Y)).
To show that without loss of generality the auxiliary random variable U can be chosen to take
at most |Y| + 1 values, one can invoke Carathéodory’s theorem (see Lemma 7.14). We omit the
details.
Next, we show that for each U the mentioned rate-pair is achievable. To that end, we first
notice that if there were side information at the decompressor in the form of the i.i.d. sequence
Un correlated to Xn , then Slepian-Wolf theorem implies that only rate R1 = H(X|U) would be
sufficient to reconstruct Xn . Thus, the question boils down to creating a correlated sequence Un at
the decompressor by using the minimal rate R2 . One way to do it is to communicate Un exactly by
spending nH(U) bits. However, it turns out that with nI(U; X) bits we can communicate a “fake”
Ûn which nevertheless has conditional distribution PXn |Ûn ≈ PXn |Un (such Ûn is known as “jointly
typical” with Xn ). Possibility of producing such Ûn is a result of independent prominence known
as covering lemma, which we will study much later – see Corollary 25.6. Here we show how to
apply covering lemma in this case.
By Corollary 25.6 and by Proposition 25.7 we know that for every δ > 0 there exists a
sufficiently large m and Ûm = f2 (Ym ) ∈ {0, 1}mR2 such that
Xm → Ym → Ûm
i i
i i
i i
and I(Xm ; Ûm ) ≥ m(I(X; U)−δ). This means that H(Xm |Ûm ) ≤ mH(X|U)+ mδ . We can now apply
Slepian-Wolf theorem to the block-symbols (Xm , Ûm ). Namely, we define a new compression prob-
lem with X̃ = Xm and Ũ = Ûm . These still take values on finite alphabets and thus there must exist
(for sufficiently large ℓ) a compressor W1 = f1 (X̃ℓ ) ∈ {0, 1}ℓR̃1 and a decompressor g(W1 , Ũℓ )
with a low probability of error and R̃1 ≤ H(X̃|Ũ) + mδ ≤ mH(X|U) + 2mδ). Now since the actual
blocklength is n = ℓm we get that the effective rate of this scheme is R1 = R̃m1 ≤ H(X|U) + 2δ .
Since δ > 0 is arbitrary, the proof is completed.
i i
i i
i i
So far we studied compression of i.i.d. sequence {Si }, for which we demonstrated that the average
compression length (for variable length compressors) converges to the entropy H(S) and that the
probability of error (for fixed-length compressor) converges to zero or one depending on whether
compression rate R ≶ H(S). In this chapter, we shall examine similar results for a large class of
processes with memory, known as ergodic processes. We start this chapter with a quick review of
main concepts of ergodic theory, then state our main results (Shannon-McMillan theorem, com-
pression limit and AEP). Subsequent sections are dedicated to proofs of Shannon-McMillan and
ergodic theorems. Finally, in the last section we introduce Kolmogorov-Sinai entropy, which asso-
ciates to a fully deterministic transformation the measure of how “chaotic” it is. This concept
plays a very important role in formalizing an apparent paradox: large mechanical systems (such
as collections of gas particles) are on one hand fully deterministic (described by Newton’s laws
of motion) and on the other hand have a lot of probabilistic properties (Maxwell distribution of
velocities, fluctuations etc). Kolmogorov-Sinai entropy shows how these two notions can co-exist.
In addition it was used to resolve a long-standing open problem in dynamical systems regarding
isomorphism of Bernoulli shifts [387, 322].
230
i i
i i
i i
Remark 12.1
or, equivalently, E = τ −1 E. Thus τ -invariant events are also called shift-invariant, when τ is
interpreted as (12.1).
3 Some examples of shift-invariant events are {∃n : xi = 0, ∀i ≥ n}, {lim sup xi < 1} etc. A non
shift-invariant event is A = {x0 = x1 = · · · = 0}, since τ (1, 0, 0, . . .) ∈ A but (1, 0, . . .) 6∈ A.
4 Also recall that the tail σ -algebra is defined as
\
Ftail ≜ σ{Sn , Sn+1 , . . .} .
n≥1
It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event {∃n : xi = 0, ∀ odd i ≥ n} is in Ftail but not shift-invariant.
Proposition 12.3 (Poincaré recurrence) Let τ be measure-preserving for (Ω, F, P). Then
for any measurable A with P[A] > 0 we have
" #
[
P τ −k A A = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥ 1
S
Proof. Let B = k≥ 1 τ −k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently
i i
i i
i i
232
(with each gas occupying roughly its half of the cylinder). Of course, the “paradox” is resolved
by observing that it will take unphysically long for this to happen.
P[A ∩ τ −n B] → P[A]P[B] .
Strong mixing implies weak mixing, which implies ergodicity (Exercise II.12).
• {Si }: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),
regardless of initial distribution.
As a toy example, consider the kernel P(0|1) = P(1|0) = 1 with initial distribution
P(S0 = 0) = 0.5. This process only has two sample paths: P [S∞ 1 = (010101 . . .)] =
P [ S∞
1 = ( 101010 . . .)] = 1
2 . It is easy to verify this process is ergodic (in the sense of Defi-
nition 12.4). Note however, that in the Markov-chain literature a chain is called ergodic if it is
irreducible, aperiodic and recurrent. This example does not satisfy this definition (this clash of
terminology is a frequent source of confusion).
i i
i i
i i
• {Si }: stationary zero-mean Gaussian process with autocovariance function c(n) = E[S0 S∗n ].
1 X
n
lim c(t) = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0
Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesáro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {c(n)}, in the sense that there exists a unique positive measure μ on [−π , π ] such that
1
R
c(n) = 2π exp(inx) μ(dx). The spectral criteria can be formulated as follows:
Detailed exposition on stationary Gaussian processes can be found in [135, Theorem 9.3.2, pp.
474, Theorem 9.7.1, pp. 493–494].
Corollary 12.6 Let {S1 , S2 , . . . } be a discrete stationary and ergodic process with entropy rate
H (in bits). Denote by f∗n the optimal variable-length compressor for Sn and ϵ∗ (Sn , nR) the optimal
probability of error of its fixed-length compressor with R bits per symbol (Definition 11.1). Then
we have
(
1 ∗ n P 0 R > H,
l(f (S ))− →H and lim ϵ∗ (Sn , nR) = (12.4)
n n n→∞ 1 R < H.
Proof. By Corollary 10.5, the asymptotic distributions of 1n l(f∗n (Sn )) and 1n log PSn1(sn ) coincide.
By the Shannon-McMillan-Breiman theorem (we only need convergence in probability) the latter
converges to a constant H.
i i
i i
i i
234
In Chapter 11 we learned the asymptotic equipartition property (AEP) for iid sources. Thanks
to Shannon-McMillan-Breiman the same proof we did for the iid processes works for a general
ergodic process.
Corollary 12.7 (AEP for stationary ergodic sources) Let {S1 , S2 , . . . } be a stationary
and ergodic discrete process. For any δ > 0, define the set
1 1
δ
Tn = s : n
log −H ≤δ .
n PSn (sn )
Then
1 P Sn ∈ Tδn → 1 as n → ∞.
2 exp{n(H − δ)}(1 + o(1)) ≤ |Tδn | ≤ exp{(H + δ)n}(1 + o(1)).
Some historical notes are in order. Convergence in probability for stationary ergodic Markov
chains was already shown in [378]. The extension to convergence in L1 for all stationary ergodic
processes is due to McMillan in [301], and to almost sure convergence to Breiman [75].1 A modern
proof is in [11]. Note also that for a Markov chain, existence of typical sequences and the AEP can
be anticipated by thinking of a Markov process as a sequence of independent decisions regarding
which transitions to take at each state. It is then clear that Markov process’s trajectory is simply a
transformation of trajectories of an iid process, hence must concentrate similarly.
1X
n
lim f(Sk , . . . ) = E f(S1 , . . . ) a.s. and in L1 .
n→∞ n
k=1
In the special case where f depends on finitely many coordinates, say, f = f(S1 , . . . , Sm ),
1X
n
lim f(Sk , . . . , Sk+m−1 ) = E f(S1 , . . . , Sm ) a.s. and in L1 .
n→∞ n
k=1
1
Curiously, both McMillan and Breiman left the field after these contributions. McMillan went on to head the US satellite
reconnaissance program, and Breiman became a pioneer and advocate of machine learning approach to statistical
inference.
i i
i i
i i
Example 12.1 Consider f = f(S1 ). Then for an iid process Theorem 12.8 is simply the strong
law of large numbers. On the extreme, if {Si } has constant trajectories, i.e. Si = S1 for all i ≥
1, then such process is non-ergodic and conclusion of Theorem 12.8 fails (unless S1 is an a.s.
constant).
We introduce an extension of the idea of the Markov chain.
Definition 12.9 (Finite order Markov chain) {Si : i ∈ N} is an mth order Markov chain
if PSt+1 |St1 = PSt+1 |Stt−m+1 for all t ≥ m. It is called time homogeneous if PSt+1 |Stt−m+1 = PSm+1 |Sm1 .
Remark 12.2 Showing (12.3) for an mth order time-homogeneous Markov chain {Si } is a
direct application of Birkhoff-Khintchine. Indeed, we have
1X
n
1 1 1
log = log
n n
PSn (S ) n PSt |St−1 (St |St−1 )
t=1
1 X
n
1 1 1
= log + log
n PSm (Sm ) n
t=m+1
PSt |St−1 (Sl |Sll− 1
−m )
t−m
1 X
n
1 1 1
= log + log t−1
, (12.5)
n PS1 (Sm ) n PSm+1 |S1 (St |St−m )
m
| {z 1
} | t=m +1
{z }
→0
→H(Sm+1 |Sm
1 ) by Birkhoff-Khintchine
1
where we applied Theorem 12.8 with f(s1 , s2 , . . .) = log PS m (sm+1 |sm )
.
m+1 |S1 1
Now let us prove (12.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to first approximate the distribution of that ergodic process by an m-th order
Markov chain (finite memory) and make use of (12.5), then let m → ∞ to make the approximation
accurate. This is a highly influential contribution of Shannon to the theory of stochastic processes,
known as Markov approximation.
Proof of Theorem 12.5 in L1 . To show that (12.3) converges in L1 , we want to show that
1 1
E log − H → 0, n → ∞.
n PSn (Sn )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
Y
∞
Q(m) (S∞
1 ) ≜ PSm
1
( Sm
1) PSt |St−1 (St |Stt− 1
−m )
t− m
t=m+1
Y∞
= PSm1 (Sm
1) PSm+1 |Sm1 (St |Stt− 1
−m ),
t=m+1
where the second line applies stationarity. Note that under Q(m) , {Si } is an mth -order time-
homogeneous Markov chain.
i i
i i
i i
236
By triangle inequality,
1 1 1 1 1 1 1 1
E log − H ≤ E log − log (m) +E log (m) − Hm + |Hm − H|
n n
PSn (S ) n n
PSn (S ) n QSn (Sn ) n QSn (Sn ) | {z }
| {z } | {z } ≜C
≜A ≜B
where
" #
1 1 PSn (Sn )
(m)
D(PSn kQSn ) = E log Qn
n n PSm (Sm ) t=m+1 PSm+1 |S1 (St |St−1 )
m t− m
1
= (−H(Sn ) + H(Sm ) + (n − m)Hm )
n
→ Hm − H as n → ∞,
1 1
lim sup E log − H ≤ 2(Hm − H).
n→∞ n PSn (Sn )
Lemma 12.10
dP 2 log e
EP log ≤ D(PkQ) + .
dQ e
i i
i i
i i
Intuitively:
1X k 1
n
An = T = (I − Tn )(I − T)−1
n n
k=1
Then, if f ⊥ ker(I − T) we should have An f → 0, since only components in the kernel can blow
up. This intuition is formalized in the proof below.
Let us further decompose f into two parts f = f1 + f2 , where f1 ∈ ker(I − T) and f2 ∈ ker(I − T)⊥ .
We make the following observations:
• if g ∈ ker(I − T), g must be a constant function. This is due to the ergodicity. Consider the
indicator function 1A : if 1A = 1A ◦ τ = 1τ −1 A , then P[A] = 0 or 1. For a general case, suppose
g = Tg and g is not constant, then at least some set {g ∈ (a, b)} will be shift-invariant and have
non-trivial measure, violating ergodicity.
• ker(I − T) = ker(I − T∗ ). This is due to the fact that T is unitary:
With these observations, we know that f1 = m is a constant. Also, f2 ∈ [Im(I − T)] so we further
approximate it by f2 = f0 + h1 , where f0 ∈ Im(I − T), namely f0 = g − g ◦ τ for some function
g ∈ L2 , and kh1 k1 ≤ kh1 k2 < ϵ. Therefore we have
An f1 = f1 = E[f]
1
An f0 = (g − g ◦ τ n ) → 0 a.s. and L1
n
P g◦τ n 2 P 1 a.s.
since E[ n≥1 ( n ) ] = E[g ] n2 < ∞ and hence 1n g ◦ τ n −−→0 by Borel-Cantelli.
2
i i
i i
i i
238
The proof of (12.6) makes use of the Maximal Ergodic Lemma stated as follows:
Theorem 12.11 (Maximal Ergodic Lemma) Let (P, τ ) be a probability measure and a
measure-preserving transformation. Then for any f ∈ L1 (P) we have
E[f1 supn≥1 An f > a ] kfk1
P sup An f > a ≤ ≤
n≥1 a a
Pn−1
where An f = 1
n k=0 f ◦ τ k.
This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this theorem
is exactly equivalent to the following result:
Proof. The argument for this Lemma has originally been quite involved, until a dramatically
simple proof (below) was found by A. Garsia [180, Theorem 2.2.2]. Define
X
n
Sn = Zk ,
k=1
Ln = max{0, Z1 , . . . , Z1 + · · · + Zn },
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn },
Sn
Z∗ = sup .
n≥1 n
i i
i i
i i
from which Lemma follows by upper-bounding the left-hand side with E[|Z1 |].
In order to show (12.7) we notice that
Z1 + Mn = max{S1 , . . . , Sn }
and furthermore
Z1 + M n = Ln on {Ln > 0}
Thus, we have
where we do not need indicator in the first term since Ln = 0 on {Ln > 0}c . Taking expectation
we get
where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1 , and Ln ≥ Ln−1 ,
respectively. Taking limit as n → ∞ in (12.8) and noticing that {Ln > 0} % {Z∗ > 0}, we
obtain (12.7).
i i
i i
i i
240
where supremum is taken over all finitely-valued random variables X0 : Ω → X measurable with
respect to F .
Note that every random variable X0 generates a stationary process adapted to τ , that is
Xk ≜ X0 ◦ τ k .
In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:
P[E∆E′ ] = 0 .
σ{Y, Y ◦ τ, . . . , Y ◦ τ n , . . .} = F mod P
Proof. Notice that since H(Y) is finite, we must have H(Yn0 ) < ∞ and thus H(Y) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞
we have from lower semicontinuity of mutual information, cf. (4.28), that
H(Y|Ỹ) ≤ ϵ ,
i i
i i
i i
X
n
≤ H(Ỹn0 ) + H(Yi |Ỹi )
i=0
H(X0 ) = I(X0 ; Y∞ n
0 ) = lim I(X0 ; Y0 ) ,
n→∞
where we used the continuity-in-σ -algebra property of mutual information, cf. (4.30). Rewriting
the latter limit differently, we have
lim H(X0 |Yn0 ) = 0 .
n→∞
S
(This is just another way to say that n σ(Yn0 ) is P-dense in σ(Y∞
0 ).) Define a stationary process
X̃ as
X̃j ≜ fϵ (Yjm+j ) .
Notice that since X̃n0 is a function of Yn0+m we have
H(X̃n0 ) ≤ H(Yn0+m ) .
i i
i i
i i
242
Dividing by m and passing to the limit we conclude that the entropy rates satisfy
H(X̃) ≤ H(Y) .
P[X̃j 6= Xj ] ≤ ϵ .
Since both processes take values on a fixed finite alphabet, from Corollary 6.7 we infer that
Altogether, we have shown that H(X) ≤ H(Y) + ϵ log |X | + h(ϵ). Taking ϵ → 0 concludes the
proof.
It is easy to show that Y(ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
Let us understand significance of this example and Sinai’s result. If we have full “microscopic”
description of the initial state of the system ω , then the future states of the system are completely
deterministic: τ (ω), τ (τ (ω)), · · · . However, in practice we can not possibly have a complete
description of the initial state, and should be satisfied with some discrete (i.e. finite or countably-
infinite) measurement outcomes Y(ω), Y(τ (ω)) etc. What we infer from the previous result
is that no matter how fine our discrete measurements are, they will still generate a process
that will have finite entropy rate (equal to log 2 bits per measurement). This reconciles the
apparent paradox between Newtonian (dynamical) and Gibbsian (statistical) points of view
on large mechanical systems. In more mundane terms, we may notice that Sinai’s theorem tells
us that much more complicated stochastic processes (e.g. the one generated by a ternary valued
measurement Y′ (ω = 1{ω > 1/3} + 1{ω > 2/3}) would still have the entropy rate same as the
simple iid Bernoulli(1/2) process.
• Let Ω be the unit circle S1 , F the Borel σ -algebra, and P the normalized length and
τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ . (When 2π is irrational, this is known to be an ergodic p.p.t.).
Here Y = 1{|ω| < 2π ϵ} is a generator for arbitrarily small ϵ and hence
i i
i i
i i
Remark 12.3 Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists
fi : Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦fi = f1−i ◦τi ; 2) fi ◦f1−i is identity
on Ωi (a.e.); 3) Pi [f−
1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
1
p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revolutionary, since it
allowed to show that p.p.t.s corresponding to shifts of iid Ber(1/2) and iid Ber(1/3) processes are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L 2 ( Ω , P ) → L 2 ( Ω , P ) (12.9)
ϕ(x) 7→ ϕ(τ (x)) . (12.10)
However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entire
unit circle, and thus is unable to distinguish Ber(1/2) from Ber(1/3).2
2
To see the statement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider ϕ(x∞1 ) defined as
∑m iωk x . This ϕ has unit energy and as m → ∞ we have kU ϕ − eiω ϕk
√1 L2 → 0. Hence every e
iω belongs to
m k=1 e k τ
the spectrum of Uτ .
i i
i i
i i
13 Universal compression
Unfortunately, theory developed so far is not very helpful for anyone tasked with actually com-
pressing a file of English text. Indeed, since the probability law governing text generation is not
given to us, one cannot apply compression results that we discussed so far. In this chapter we
will discuss how to produce compression schemes that do not require a priori knowledge of the
distribution. For example, an n-letter input compressor maps X n → {0, 1}∗ . There is no one fixed
probability distribution PXn on X n , but rather a whole class of distributions. Thus, the problem of
compression becomes intertwined with the problem of distribution (density) estimation and we
will see that optimal algorithms for both problems are equivalent.
The plan for this chapter is as follows:
1 We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be asymp-
totically optimal simultaneously for all i.i.d. distributions and with small modifications for all
finite-order Markov chains.
2 Next class of universal compressors is based on assuming that the true distribution PXn belongs
to a given class. These methods proceed by choosing a good model distribution QXn serving as
the minimax approximation to each distribution in the class. The compression algorithm for a
single distribution QXn is then designed as in previous chapters.
3 Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically adapt
to the distribution of the source, without any prior assumptions required.
Throughout this chapter, all logarithms are binary. Instead of describing each compres-
sion algorithm, we will merely specify some distribution QXn and apply one of the following
constructions:
• Sort all xn in the order of decreasing QXn (xn ) and assign values from {0, 1}∗ as in Theorem 10.2,
this compressor has lengths satisfying
1
ℓ(f(xn )) ≤ log .
QXn (xn )
• Set lengths to be
1
ℓ(f(xn )) ≜ log
QXn (xn )
and apply Kraft’s inequality Theorem 10.9 to construct a prefix code.
244
i i
i i
i i
Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QXn (xn )). These intervals are disjoint
subintervals of [0, 1). As such, each xn can be represented uniquely by any point in the interval Ixn .
A specific choice is as follows. Encode
xn 7→ largest dyadic interval Dxn contained in Ixn (13.1)
and we agree to select the left-most dyadic interval when there are two possibilities. Recall that
dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ) where m is an integer. We encode
such interval by the k-bit (zero-padded) binary expansion of the fractional number m2−k =
i i
i i
i i
246
Pk
0.b1 b2 . . . bk = i=1 bi 2−i . For example, [3/4, 7/8) 7→ 110, [3/4, 13/16) 7→ 1100. We set the
codeword f(xn ) to be that string. The resulting code is a prefix code satisfying
1 1
log2 ≤ ℓ(f(x )) ≤ log2
n
+ 1. (13.2)
QXn (xn ) Q X n ( xn )
(This is an exercise, see Ex. II.13.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QXn−1 (xn−1 ) QXn |Xn−1 (y|xn−1 )
y< xn
and thus Fn (xn ) can be computed sequentially if QXn−1 and QXn |Xn−1 are easy to compute. This
method is the method of choice in many modern compression algorithms because it allows to
dynamically incorporate the learned information about the data stream, in the form of updating
QXn |Xn−1 (e.g. if the algorithm detects that an executable file contains a long chunk of English text,
it may temporarily switch to QXn |Xn−1 modeling the English language).
We note that efficient implementation of arithmetic encoder and decoder is a continuing
research area. Indeed, performance depends on number-theoretic properties of denominators of
distributions QXt |Xt−1 , because as encoder/decoder progress along the string, they need to peri-
odically renormalize the current interval Ixt to be [0, 1) but this requires carefully realigning the
dyadic boundaries. A recent idea of J. Duda, known as asymmetric numeral system (ANS) [138],
lead to such impressive computational gains that in less than a decade it was adopted by most
compression libraries handling diverse data streams (e.g., the Linux kernel images, Dropbox and
Facebook traffic, etc).
Then Fitingof argues that it should be possible to produce a prefix code with
ℓ(f(xn )) = Φ0 (xn ) + O(log n) . (13.6)
This can be done in many ways. In the spirit of what comes next, let us define
QXn (xn ) ≜ exp{−Φ0 (xn )}cn , (13.7)
i i
i i
i i
cn = O(n−(|X |−1) ) ,
and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (13.6).1 Now
i.i.d.
taking expectation over Xn ∼ PX we get
Universal compressor for all finite-order Markov chains. Fitingof’s idea can be extended as
follows. Define now the first-order information content Φ1 (xn ) to be the log of the number of all
sequences, obtainable by permuting xn with the extra restriction that the new sequence should have
the same statistics on digrams. Asymptotically, Φ1 is just the conditional entropy
where T − 1 is understood in the sense of modulo n. Again, it can be shown that there exists a code
such that lengths
This implies that for every first-order stationary Markov chain X1 → X2 → · · · → Xn we have
This can be further continued to define Φ2 (xn ) leading to a universal code that is asymptotically
optimal for all second-order Markov chains, and so on and so forth.
simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we
do next? Krichevsky [259] suggested that the next barrier should be to minimize the regret, or
redundancy:
1
Explicitly, we can do a two-part encoding: first describe the type class of xn (takes (|X | − 1) log n bits) and then describe
the element of the class (takes Φ0 (xn ) bits).
i i
i i
i i
248
Replacing code lengths with log Q1Xn , we define redundancy of the distribution QXn as
Thus, the question of designing the best universal compressor (in the sense of optimizing worst-
case deviation of the average length from the entropy) becomes the question of finding solution
of:
Assuming the finiteness of R∗n , Theorem 5.9 gives the maximin and capacity representation
where optimization is over priors π ∈ P(Θ) on θ. Thus redundancy is simply the capacity of
the channel θ → Xn . This result, obvious in hindsight, was rather surprising in the early days of
universal compression. It is known as capacity-redundancy theorem.
Finding exact QXn -minimizer in (13.8) is a daunting task even for the simple class of all i.i.d.
Bernoulli sources (i.e. Θ = [0, 1], PXn |θ = Bern (θ)). In fact, for smooth parametric families the
capacity-achieving input distribution is rather cumbersome: it is a discrete distribution with a kn
atoms, kn slowly growing as n → ∞. A provocative conjecture was put forward by physicists [296,
2] that there is a certain universality relation:
3
R∗n = log kn + o(log kn )
4
satisfied for all parametric families simultaneously. For the Bernoulli example this implies kn
n2/3 , but even this is open. However, as we will see below it turns out that these unwieldy capacity-
achieving input distributions converge as n → ∞ to a beautiful limiting law, known as the Jeffreys
prior.
Remark 13.1 (Shtarkov, Fitingof and individual sequence approach) There is a connection
between the combinatorial method of Fitingof and the method of optimality for a class. Indeed,
i i
i i
i i
following Shtarkov we may want to choose distribution QXn so as to minimize the worst-case
redundancy for each realization xn (not average!):
PXn |θ (xn |θ0 )
R∗∗
n (Θ) ≜ min max sup log (13.11)
nQXn x θ0 ∈Θ QXn (xn )
This minimization is attained at the Shtarkov distribution (also known as the normalized maximal
likelihood (NML) code):
(S) 1
Q X n ( xn ) = sup P n (xn |θ0 ) , (13.12)
Z θ0 ∈Θ X |θ
where the normalization constant
X
Z= sup PXn |θ (xn |θ0 ) , (13.13)
xn ∈X n θ0 ∈Θ
is called the Shtarkov sum. If the class {PXn |θ : θ ∈ Θ} is chosen to be all product distributions on
X then
( S) exp{−nH(P̂xn )}
(i.i.d.) QXn (xn ) = P , (13.14)
xn exp{−nH(P̂x )}
n
(S)
where H(P̂xn ) is the empirical entropy of xn . As such, compressing with respect to QXn recovers
Fitingof’s construction Φ0 up to O(log n) differences between nH(P̂xn ) and Φ0 (xn ). If we take
PXn |θ to be all first-order Markov chains, then we get construction Φ1 etc. Note also, that the
problem (13.11) can also be written as a minimization of the regret for each individual sequence
(under the log-loss, with respect to a parameter class PXn |θ ):
1 1
min max log − inf log . (13.15)
QXn xn QXn (xn ) θ0 ∈Θ PXn |θ (xn |θ0 )
In summary, using Shtarkov’s distribution (minimizer of (13.15)) makes sure that any individual
realization of xn (whether it was or was not generated by PXn |θ=θ0 for some θ0 ) is compressed
almost as well as the best compressor tailored for the class of PXn |θ . Hence, if our model class
PXn |θ approximates the generative process of xn well, we achieves nearly optimal compression. In
Section 13.7 below we will also learn that QXn |Xn−1 can be interpreted as online estimator of the
distribution of xj ’s.
Remark 13.2 (Two redundancies) In the literature of universal compression, the quantity
R∗∗
n is known as the worst-case or pointwise minimax redundancy, in comparison with the average-
case minimax redundancy R∗n in (13.8), which replaces maxxn in (13.11) by Exn ∼PXn |θ0 . It is known
that for many model classes, such as iid and finite-order Markov sources, R∗n and R∗∗ n agree in
the leading term as n → ∞.2 As R∗n ≤ R∗∗ n , typically the way one bounds the redundancies
is to upper bound R∗∗ n by bounding the pointwise redundancy (via combinatorial means) for a
specific probability assignment and lower bound R∗n by applying (13.10) and bounding the mutual
2
This, however, is not true in general. See Exercise II.21 for an example where R∗n < ∞ but R∗∗
n = ∞.
i i
i i
i i
250
information for a specific prior; see Exercises II.15 and II.16 for an example and [112, Chap. 6-7]
for more.
Remark 13.3 (Redundancy for single-shot codes) We note that any prefix code f :
X n → {0, 1}∗ defines a distribution QXn (xn ) = 2−ℓ(f(x )) . (We assume the code’s binary tree is
n
full such that the Kraft sum equals one). Therefore, our definition of redundancy in (13.8) assess
the excess of expected length E[ℓ(f(Xn ))] over H(Xn ) for the prefix codes. For single-shot codes
(Section 10.1) without prefix constraints the optimal answers are slightly different, however. For
example, the optimal universal code for all i.i.d. sources satisfies E[ℓ(f(Xn ))] ≈ H(Xn )+ |X 2|−3 log n
in contrast with |X 2|−1 log n for prefix codes, cf. [41, 256].
Y
d
αj − 1
c(α0 , . . . , αd ) θj (13.16)
j=0
Pd Γ(α0 +...+αd )
and θ0 = 1 − j=1 θj , where c(α0 , . . . , αd ) = Qd is the normalizing constant.
j=0 Γ(αj )
First, we give the formal setting as follows:
X
d
θ0 ≜ 1 − θj .
j=1
In order to find the (near) optimal QXn , we need to guess an (almost) optimal prior π ∈ P(Θ)
in (13.10) and take QXn to be the mixture of PXn |θ ’s. We will search for π in the class of smooth
i i
i i
i i
Before proceeding further, we recall the Laplace method of approximating exponential inte-
grals. Suppose that f(θ) has a unique minimum at the interior point θ̂ of Θ and that Hessian Hessf
is uniformly lower-bounded by a multiple of identity (in particular, f(θ) is strongly convex). Then
taking Taylor expansion of π and f we get
Z Z
π (θ)e−nf(θ) dθ = (π (θ̂) + O(ktk))e−n(f(θ̂)− 2 t Hess f(θ̂)t+o(∥t∥ )) dt
1 T 2
(13.18)
Θ
Z
dx
= π (θ̂)e−nf(θ̂) e−x Hess f(θ̂)x √ (1 + O(n−1/2 ))
T
(13.19)
Rd nd
d2
−nf(θ̂) 2π 1
= π (θ̂)e q (1 + O(n−1/2 )) (13.20)
n
det Hessf(θ̂)
θ̂(xn ) ≜ P̂xn
where we used the fact that Hess θ′ D(P̂kPX|θ=θ′ )|θ′ =θ̂ = log1 e JF (θ̂) with JF being the Fisher infor-
mation matrix introduced previously in (2.34). From here, using the fact that under Xn ∼ PXn |θ=θ′
the random variable θ̂ = θ′ + O(n−1/2 ) we get by approximating JF (θ̂) and Pθ (θ̂)
d Pθ (θ′ )
D(PXn |θ=θ′ kQXn ) = n(E[H(θ̂)]−H(X|θ = θ′ ))+ log n−log p +C+O(n− 2 ) , (13.21)
1
2 ′
det JF (θ )
where C is some constant (independent of the prior Pθ or θ′ ). The first term is handled by the next
result, refining Corollary 7.18.
i i
i i
i i
252
√
Proof. By Central Limit Theorem, n(P̂ − P) converges in distribution to N (0, Σ), where Σ =
diag(P) − PPT , where P is an |X |-by-1 column vector. Thus, computing second-order Taylor
expansion of D(·kP), cf. (2.34) and (2.37), we get the result. (To interchange the limit and the
expectation, more formally we need to condition on the event P̂n (x) ∈ (ϵ, 1 − ϵ) for all x ∈ X to
make the integrand function bounded. We leave these technical details as an exercise.)
d π (θ′ )
+ constant + O(n− 2 )
1
D(PXn |θ=θ′ kQXn ) = log n − log p (13.22)
2 ′
det JF (θ )
under the assumption of smoothness of prior π and that θ′ is not on the boundary of Θ. Con-
sequently, we can see that in order for the prior π be the saddle point solution, we should
have
p
π (θ′ ) ∝ det JF (θ′ ) ,
provided that the right side is integrable. The prior proportional to the square-root of the
determinant of Fisher information matrix is known as the Jeffreys prior. In our case, using
the explicit expression for Fisher information (2.39), the Jeffreys prior π ∗ is found to be
Dirichlet(1/2, 1/2, · · · , 1/2), with density:
1
π ∗ (θ) = cd qQ , (13.23)
d
j=0 θj
Γ( d+ 1
2 )
where cd = Γ(1/2)d+1
is the normalization constant. The corresponding redundancy is then
d n Γ( d+2 1 )
R∗n = log − log + o( 1) . (13.24)
2 2πe Γ(1/2)d+1
Making the above derivation rigorous is far from trivial and was completed in [460]. (In
Exercise II.15 and II.16 we analyze the d = 1 case and show R∗n = 12 log n + O(1).)
Overall, we see that Jeffreys prior asymptotically maximizes (within o(1)) the supπ I(θ; Xn ) and
for this reason is called asymptotically maximin solution. Surprisingly [405], the corresponding
(KT)
mixture QXn , that we denote QXn (and study in detail in the next section), however, turns out to
not give the asymptotically optimal redundancy. That is we have for some c1 > c2 > 0 inequalities
(KT)
That is QXn is not asymptotically minimax (but it does achieve optimal redundancy up to O(1)
term). However, it turns out that patching the Jeffreys prior near the boundary of the simplex
(or using a mixture of Dirichlet distributions) does result in asymptotically minimax universal
probability assignments [460].
i i
i i
i i
Extension to general smooth parametric families. The fact that Jeffreys prior θ ∼ π maxi-
mizes the value of mutual information I(θ; Xn ) for general parametric families was conjectured
in [46] in the context of selecting priors in Bayesian inference. This result was proved rigorously
in [95, 96]. We briefly summarize the results of the latter.
Let {Pθ : θ ∈ Θ0 } be a smooth parametric family admitting a continuous and bounded Fisher
information matrix JF (θ) everywhere on the interior of Θ0 ⊂ Rd . Then for every compact Θ
contained in the interior of Θ0 we have
Z p
d n
R∗n (Θ) = log + log det JF (θ)dθ + o(1) . (13.25)
2 2πe Θ
Although Jeffreys prior on Θ achieves (up to o(1)) the optimal value of supπ I(θ; Xn ), to produce
an approximate capacity-achieving output distribution QXn , however, one needs to take a mixture
with respect to a Jeffreys prior on a slightly larger set Θϵ = {θ : d(θ, Θ) ≤ ϵ} and take ϵ → 0
slowly with n → ∞. This sequence of QXn ’s does achieve the optimal redundancy up to o(1).
Remark 13.4 (Laplace’s law of succession) In statistics Jeffreys prior is justified as
being invariant to smooth reparametrization, as evidenced by (2.35). For example, in answering
“will the sun rise tomorrow”3 , Laplace proposed to estimate the probability by modeling sunrise
as i.i.d. Bernoulli process with a uniform prior on θ ∈ [0, 1]. However, this is √
clearly not very
10
logical, as one may equally well postulate uniformity of α = θ or β = pθ. Jeffreys prior
θ∼ √ 1 is invariant to reparametrization in the sense that if one computed det JF (α) under
θ(1−θ)
α-parametrization the result would be exactly the pushforward of the √ 1
along the map
θ(1−θ)
θ 7→ θ10 .
3
Interested readers should check Laplace’s rule of succession and the sunrise problem; see [229, Chap. 18] for a historical
and philosophical account.
4
We remind (2a − 1)!! = 1 · 3 · · · (2a − 1). The expression for QXn is obtained from the identity
∫ 1 θa (1−θ)b
0
√ (2a−1)!!(2b−1)!!
dθ = π 2a+b (a+b)! for integer a, b ≥ 0, which in turn is derived by change of variable z = θ
1−θ
and
θ(1−θ)
using the standard keyhole contour on the complex plane.
i i
i i
i i
254
(KT) (KT)
Note that QXn−1 coincides with the marginalization of QXn to first n − 1 coordinates. This prop-
R
erty is not specific to KT distribution and holds for any QXn that is given in the form Pθ (dθ)PXn |θ
with Pθ not depending on n. What is remarkable, however, is that the conditional distribution
(KT)
QXn |Xn−1 has a rather elegant form:
t1 + 12
QXn |Xn−1 (1|xn−1 ) =
(KT)
, t1 = #{j ≤ n − 1 : xj = 1} (13.28)
n
1
t0 + 2
QXn |Xn−1 (0|xn−1 ) =
(KT)
, t0 = #{j ≤ n − 1 : xj = 0} (13.29)
n
This is the famous “add-1/2” rule of Krichevsky and Trofimov [260]. As mentioned in Section 13.1,
this sequential assignment is very convenient for implementing an arithmetic
coder.
Let fKT : {0, 1}n → {0, 1}∗ be the encoder assigning length l(fKT (xn )) = log2 1
(KT) . Now
QXn (xn )
from (13.24) we know that we
1
sup {E [l(fKT (Snθ ))] − nh(θ)} = log n + O(1) .
0≤θ≤1 2
Since (13.24) was not shown rigorously in Exercise II.15 we prove the upper bound of this claim
independently.
Remark 13.5 (Laplace “add-1” rule) A slightly less optimal choice of QXn results from
Laplace prior (recall that Laplace advise to take Pθ to be uniform on [0, 1]). Then, in the case of
binary (Bernoulli) alphabet we get
1
(Lap)
QXn = n
, w = #{j : xj = 1} . (13.30)
w ( n + 1)
t1 + 1
QXn |Xn−1 (1|xn−1 ) =
(Lap)
, t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (13.30) is exactly the same as Fitingof’s (13.7). Second,
this distribution “almost” attains the optimal first-order term in (13.24). Indeed, when Xn is iid
Ber(θ) we have for the redundancy:
" #
1 n
E log (Lap) − H(Xn ) = log(n + 1) + E log − nh(θ) , W ∼ Bin(n, θ) . (13.31)
Q n (X ) n W
X
From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (13.31) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to
endpoints and results in suboptimal redundancy log n + O(1). The Jeffreys prior (13.26) and the
resulting KT compressor fixes the problem at the endpoints.
i i
i i
i i
i i
i i
i i
256
The intimate connection between this problem of online prediction and universal compression
is revealed by the following almost trivial observation. Notice that distribution Qt (·) output by the
learner at time t is in fact a function of observations x1 , . . . xt−1 . Therefore, we should be writing
more explicitly it as Qt (·; xt−1 ) to emphasize dependence on the (training) data. But then we can
also think of Qt (xt ; xt−1 ) as a Markov kernel QXt |Xt−1 (xt |xt−1 ) and compute a joint probability
distribution
Y
n
Q X n ( xn ) ≜ Qt (xt ; xt−1 ) . (13.33)
t=1
Conversely, any QXn we can factorize sequentially in the form (13.33) and obtain an online pre-
dictor. It turns out that the choice of the optimal QXn is precisely the same problem of universal
probability assignment that is solved by the universal compression in (13.8). But before that we
need to explain how to define optimality in the online prediction game.
Since the ℓn ({Qt }, θ∗ ) depends on the value of θ∗ that governs the stochastics of the input
(a)
(θ)
However, this turns out to be a bad way to pose the problem. For example, if one of PXn =
(a)
Unif[X n ] then no predictor can achieve ℓn ≤ n log |X |. Furthermore, a trivial predictor that
always outputs Qt = Unif[X ] achieves this upper bound exactly. Thus in the minimax setting
predicting Unif[X ] turns out to be optimal.
To understand how to work around this issue, let us first recall from Corollary 2.4 that if we
have oracle knowledge about the true θ∗ generating Xj ’s, then our choice would be to set Qt to be
(θ ∗ )
the factorization of PXn . This achieves the loss
(θ ∗ )
ℓ(na) ({P∗θ }, θ∗ ) = H(PXn ) .
(θ ∗ )
Thus, even if given the oracle information we cannot avoid the H(PXn ) loss (this would also
(θ ∗ )
be called Bayes loss in machine learning). Note that for iid model class we have H(PXn ) =
∗
(θ )
nH(PX1 ) and the oracle loss is of order n. Consequently, the quality of the learning algorithm
should be measured by the amount of excess of loss above the oracle loss. This quantity is known
as average regret and defined as
" n #
X 1 (θ ∗ )
AvgRegn ({Qt }, θ∗ ) ≜ EP(θ∗ ) log − H(PXn ) .
Xn Qt (Xt |Xt−1 )
t=1
Hence to design an optimal algorithm we want to minimize the worst regret, or in other words to
solve the minimax problem
This turns out to be completely equivalent to the universal compression problem, as we state next.
i i
i i
i i
Theorem 13.3 Recall the definition of compression redundancy R∗n in (13.8). Then we have
(θ ∗ )
AvgReg∗n (Θ) = R∗n (Θ) ≜ min sup D(PXn kQXn ) ,
QXn θ ∗ ∈Θ
where the minimum in the RHS is achieved and at a unique distribution Q∗Xn . The optimal predictor
is given by setting Qt (·) = Q∗Xt |Xt−1 =xt−1 (·). Furthermore, let θ ∈ Θ have a prior distribution
π ∈ P(Θ). Then
If there exists a maximizer π ∗ of the right-hand side maximization then the optimal estimator is
R Qn
found by factorizing Q∗Xn = π ∗ (dθ) i=1 Pθ (xi ).
Proof. There is almost nothing to prove. We only need to rewrite definition of average regret in
terms of QXn as follows
" #
(θ ∗ )
P (θ ∗ )
AvgRegn ({Qt }, θ∗ ) = EP(θ∗ ) log X = D(PXn kQXn ) .
n
Xn QX n
The rest of the claims follow from Theorem 5.9 (recall that I(θ; Xn ) ≤ n log |X | < ∞) and
Theorem 5.4.
As an application of this result, we see that Krichevsky-Trofimov’s estimator achieves for any
i.i.d.
iid string Xn ∼ P a log-loss
X
n
E log (KT)
1 ≤ nH(P) + |X | − 1 log n + cX ,
Q t−1 (Xt |X )t− 1 2
t=1 Xt |X
where cX < ∞ is a constant independent of P or n. This excess above nH(P) is optimal among
all possible online estimators except possibly for a constant cX .
The problem we discussed may appear at first to be somewhat contrived, especially to some-
one who has been used to supervised learning/prediction tasks. Indeed, our prediction problem
does not have any features to predict from! Nevertheless, the modern large language models are
solving precisely this task: they are trained to predict a sequence of words by minimizing log-loss
(cross-entropy loss), cf. [320]. In those instances the learning task is made feasible due to non-iid
nature of the sequence. The iid setting, however, is also quite interesting and practically relevant.
But one needs to introduce supervised learning version for that, where prediction task is to esti-
mate an unknown (label or quantity) Yt given a correlated feature vector Xt . There is an analog of
Theorem 13.3 for that case as well – see Exercise II.20 and II.22.
Batch regret. In machine learning what we have defined above is known as cumulative (or
online) regret, because we insisted on the estimator to produce some prediction at every time step
t. However, a much more common setting is that of prediction, where the first n − 1 observations
i i
i i
i i
258
are available as the training data and we only assess the loss on the new unseen sample (test data).
This is called batch loss and the corresponding minimax regret is
∗ 1 (θ ∗ )
BatchRegn (Θ) ≜ inf sup E (θ ) log
∗ − H(PXn |Xn−1 ) . (13.35)
Qn (·;xn−1 ) θ ∗ ∈Θ PXn Qn (Xn ; Xn−1 )
| {z }
(θ ∗ )
D P Qn (·;Xn−1 )
X n | X n− 1
In other words, this is the optimal KL loss of predicting the next symbol by estimating its condi-
tional distribution given the past, a central task in language models such as GPT [320]. Similar to
Theorem 13.3 we can give a max-information formula for batch regret (Exercise II.19). However,
it turns out that there is also a connection to universal compression. Indeed, we have the following
inequalities
1
AvgReg∗n (Θ) − AvgReg∗n−1 (Θ) ≤ BatchReg∗n (Θ) ≤ AvgReg∗n (Θ) , (13.36)
n
where the upper bound is only guaranteed to hold for iid models.5 The inequality (13.36) is known
as online-to-batch conversion or estimation-compression inequality [159, 240]; see Lemma 32.3
and Proposition 32.7 for a justification. The estimator that achieves the above upper bound is very
simple: it takes a probability assignment Q∗Xn and sets its predictor as
1X
n
n−1
Q n ( xn ; x )≜ QXt |Xt−1 (xn |xt−1 ) . (13.37)
n
t=1
However, unlike the cumulative regret, minimizers of the batch regret are distinct from those in
universal compression. For example, for the model class of all iid distributions over k symbols, we
know that (asymptotically) the “add-1/2” estimator of Krichevsky-Trofimov is optimal. However,
for the batch loss it is not so (see the note at the end of Exercise VI.10). We also note that optimal
batch regret in this case is O( k−n 1 ), but the online-to-batch rule only yields O( (k−1n) log n ). On the
other hand, for first-order Markov chains with k ≥ 3 states, the online-to-batch upper bound
turns out to be order optimal, as we have BatchReg∗n 1n AvgReg∗n kn log kn2 provided that
2
n k2 [213]; however, proving this result, especially the lower bound, requires arguments native
to Markov chains.
Density estimation. Consider now the following problem. Given a collection of (single-letter)
(θ) i.i.d. (θ)
distributions PX on X and X1 , . . . , Xn−1 ∼ PX we want to produce a density estimate P̂ which
minimizes the worst-case error as measured by KL divergence, i.e. we seek to minimize
h i
(θ ∗ )
sup E n−1 i.i.d. (θ∗ ) D(P̂kPX ) .
X ∼ PX
θ ∗ ∈Θ
5
For stationary m-order Markov models, the upper bound in (13.36) holds with n − m in the denominator [213, Lemma 6].
i i
i i
i i
To connect to the previous discussion, we only need to notice that P̂(·) can be interpreted as Qn in
the batch regret problem and we have an exact equality
h i 1
(θ ∗ )
inf sup E n−1 i.i.d. (θ∗ ) D(P̂kPX ) = BatchReg∗n (Θ) ≤ sup I(θ; Xn ) .
∗
P̂ θ ∈Θ X ∼ P X n π
Thus, we bound the minimax (KL-divergence) density estimation rate by capacity of a certain
(θ)
channel. The estimator achieving this bound is improper (i.e. P̂ 6= PX for any θ) and given
by (13.37). This is the basis of the Yang-Barron approach to density estimation, see Section 32.1
for more.
This is clearly hopeless. Indeed, at any step t the distribution Qt must have at least one atom with
weight at most |X1 | , and hence for any predictor
max
n
ℓ({Qt }, xn ) ≥ n log |X | ,
x
i i
i i
i i
260
which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if the predictor simply makes uniform random
guesses. This triviality is not surprising: In the absence of whatsoever prior information on xn it
is impossible to predict anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [161, 303], is to replace loss
with regret, i.e. the gap to the best possible static oracle. More precisely, suppose a non-causal
oracle can examine the entire string xn and output a constant Qt ≡ Q. From the non-negativity of
divergence this non-causal oracle achieves:
X
n
1
ℓoracle (xn ) = min log = nH(P̂xn ) .
Q Q ( xt )
t=1
Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret of a sequential predictor as the excess risk over the static oracle
reg({Qt }, xn ) ≜ ℓ({Qt }, xn ) − nH(P̂xn )
and ask to minimize the worst-case regret:
Reg∗n ≜ min max reg({Qt }, xn ) . (13.38)
{Qt }
n x
Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore do
not rely on any assumptions on xn .
We next consider the case of X = {0, 1} for simplicity. To solve (13.38), first notice that
designing a sequence {Qt (·|xt−1 } is equivalent to defining one joint distribution QXn and then
Q
factorizing the latter as QXn (xn ) = t Qt (xt |xt−1 ). Then the problem (13.38) becomes simply
1
Reg∗n = min max log − nH(P̂xn ) .
n
QXn x Q ( xn )
Xn
First, we notice that generally we have that optimal QXn is the Shtarkov distribution (13.12), which
implies that the regret coincides with the log of the Shtarkov sum (13.13). In the iid case we are
considering, from (13.14) we get
X Y
n X
Reg∗n = log max Q(xi ) = log exp{−nH(P̂xn )} .
Q
xn i=1 xn
This expression is, however, frequently not very convenient to analyze, so instead we consider
upper and lower bounds. We may lower-bound the max over xn with the average over the Xn ∼
Ber(θ)n and obtain (also applying Lemma 13.2):
|X | − 1
Reg∗n ≥ R∗n + log e + o(1) ,
2
where R∗n is the universal compression redundancy defined in (13.8), whose asymptotics we
derived in (13.24).
(KT)
On the other hand, taking QXn from Krichevsky-Trofimov (13.27) we find after some algebra
and Stirling’s expansion:
1 1
max log (KT)
− nH(P̂xn ) = log n + O(1) .
n
x QXn (xn ) 2
i i
i i
i i
In machine learning terms, we say that R∗n (Θ) in (13.8) is a cumulative sequential prediction
regret under the well-specified setting (i.e. data Xn is generated by a distribution inside the model
class Θ), while here Reg∗n (Θ) corresponds to a fully mis-specified setting (i.e. data is completely
arbitrary). There are also interesting settings in between these extremes, e.g. when data is iid but
not from a model class Θ, cf. [162].
i i
i i
i i
262
the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
will attain the entropy rate of H(Xn |Xnn− 1
−r ). Note that Krichevsky-Trofimov assignment (13.29) is
clearly learning the distribution too: as n grows, the estimator QXn |Xn−1 converges to the true PX
(provided that the sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning
the joint distribution PXr−1 of r-blocks. However, the sample size required to obtain a good esti-
0
mate of PXr−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family of
0
algorithms works around this in an ingeniously elegant way:
• First, estimating probabilities of rare substrings takes longest, but it is also the least useful, as
these substrings almost never appear at the input.
• Second, and the most crucial, point is that an unbiased estimate of PXr (xr ) is given by the
reciprocal of the time since the last observation of xr in the data stream.
• Third, there is a prefix code6 mapping any integer n to binary string of length roughly log2 n:
There are a number of variations of these basic ideas, so we will only attempt to give a rough
explanation of why it works, without analyzing any particular algorithm.
We proceed to formal details. First, we need to establish Kac’s lemma.
6 ∑
For this just notice that k≥1 2− log2 k−2 log2 log(k+1) < ∞ and use Kraft’s inequality. See also Ex. II.18.
i i
i i
i i
However, the last event is shift-invariant and thus must have probability zero or one by ergodic
assumption. But since P[X0 = u] > 0 it cannot be zero. So we conclude
P[∃t ≥ 0 : Xt = u] = 1 . (13.40)
Next, we have
X
E[L|X0 = u] = P [ L ≥ t| X 0 = u] (13.41)
t≥ 1
1 X
= P[L ≥ t, X0 = u] (13.42)
P[X0 = u]
t≥1
1 X
= P[X−t+1 6= u, . . . , X−1 6= u, X0 = u] (13.43)
P[X0 = u]
t≥1
1 X
= P[X0 6= u, . . . , Xt−2 6= u, Xt−1 = u] (13.44)
P[X0 = u]
t≥1
1
= P[∃t ≥ 0 : Xt = u] (13.45)
P[X0 = u]
1
= , (13.46)
P[X0 = u]
where (13.41) is the standard expression for the expectation of a Z+ -valued random vari-
able, (13.44) is from stationarity, (13.45) is because the events corresponding to different t are
disjoint, and (13.46) is from (13.40).
The following result serves to explain the basic principle behind operation of Lempel-Ziv
methods.
1
E[ℓ(fn (Xn0−1 , X∞
−1
))] → H ,
n
Proof. Let Ln be the last occurrence of the block xn0−1 in the string x− 1
−∞ (recall that the latter is
known to decoder), namely
Ln = inf{t > 0 : x−
−t
t+n−1
= xn0−1 } .
= Xtt+n−1 we have
(n)
Then, by Kac’s lemma applied to the process Yt
1
E[Ln |Xn0−1 = xn0−1 ] = .
P[Xn0−1 = xn0−1 ]
We know encode Ln using the code (13.39). Note that there is crucial subtlety: even if Ln < n and
thus [−t, −t + n − 1] and [0, n − 1] overlap, the substring xn0−1 can be decoded from the knowledge
of Ln .
i i
i i
i i
264
We have, by applying Jensen’s inequality twice and noticing that 1n H(Xn0−1 ) & H and
1
n log H(Xn0−1 ) → 0 that
1 1 h 1 i
E[ℓ(fint (Ln ))] ≤ E log + o(1) → H .
n n PXn−1 (Xn0−1 )
0
From Kraft’s inequality we know that for any prefix code we must have
1 1
E[ℓ(fint (Ln ))] ≥ H(Xn0−1 |X− 1
−∞ ) = H .
n n
The result shown above demonstrates that LZ algorithm has asymptotically optimal com-
pression rate for every stationary ergodic process. Recall, however, that previously discussed
compressors also enjoyed non-stochastic (individual sequence) guarantees. For example, we have
seen in Section 13.7 that Krichevsky-Trofimov’s compressor achieves on every input sequence a
compression ratio that is at most O( logn n ) worse than the arithmetic encoder built with the best
possible (for this sequence!) static probability assignment. It turns out that LZ algorithm is also
special from this point of view. In [331] (see also [160, Theorem 4]) it was shown that the LZ
compression rate on every input sequence is better than that achieved by any finite state machine
(FSM) up to correction terms O( logloglogn n ). Consequently, investing via LZ achieves capital growth
that is competitive against any possible FSM investor [160].
Altogether we can see that LZ compression enjoys certain optimality guarantees in both the
stochastic and individual sequence senses.
i i
i i
i i
II.1 (Exact value of minimal compression length) Suppose X ∈ N and PX (1) ≥ PX (2) ≥ . . .. Show
that the optimal compressor f∗ ’s length satisfies
X
∞
E[l(f∗ (X))] = P[X ≥ 2k ].
k=1
II.4 Consider a three-state Markov chain S1 , S2 , . . . with the following transition probability matrix
1 1 1
2 4 4
P= 0 1
2
1
2
.
1 0 0
i i
i i
i i
Compute the limit of 1n E[l(f∗ (Sn ))] when n → ∞. Does your answer depend on the distribution
of the initial state S1 ?
II.5 (a) Let X take values on a finite alphabet X . Prove that
H(X) − k − 1
ϵ ∗ ( X , k) ≥ .
log(|X | − 1)
(b) Deduce the following converse result: For a stationary process {Sk : k ≥ 1} on a finite
alphabet S ,
H−R
lim inf ϵ∗ (Sn , nR) ≥ .
n→∞ log |S|
n
where H = limn→∞ H(nS ) is the entropy rate of the process.
II.6 Run-length encoding is a popular variable-length lossless compressor used in fax machines,
image compression, etc. Consider compression of Sn , an i.i.d. Ber(δ) source with very small
1
δ = 128 using run-length encoding: A chunk of consecutive r ≤ 255 zeros (resp. ones) is
encoded into a zero (resp. one) followed by an 8-bit binary encoding of r (If there are > 255
consecutive zeros then two or more 9-bit blocks will be output). Compute the average achieved
compression rate
1
lim E[ℓ(f(Sn )]
n→∞n
How does it compare with the optimal lossless compressor?
Hint: Compute the expected number of 9-bit blocks output per chunk of consecutive zeros/ones;
normalize by the expected length of the chunk.
II.7 Draw n random points independently and uniformly from the vertices of the following square.
Denote the coordinates by (X1 , Y1 ), . . . , (Xn , Yn ). Suppose Alice only observes Xn and Bob only
observes Yn . They want to encode their observation using RX and RY bits per symbol respectively
and send the codewords to Charlie who will be able to reconstruct the sequence of pairs.
(a) Find the optimal rate region for (RX , RY ).
(b) What if the square is rotated by 45◦ ?
II.8 Consider a particle walking randomly on the graph of Exercise II.7 (each edge is taken with
equal probability; particle does not stay in the same node). Alice observes the X coordinate
and Bob observes the Y coordinate. How many bits per step (in the long run) does Bob need to
i i
i i
i i
send to Alice so that Alice will be able to reconstruct the particle’s trajectory with vanishing
probability of error? (Hint: you need to extend certain theorem from Chapter 11 to the case of
an ergodic Markov chain)
II.9 Recall from Theorem 11.13 the upper bound on the probability of error for the Slepian-Wolf
compression to k bits:
∗ 1 −τ
ϵSW (k) ≤ min P log|A| > k − τ + |A| (II.1)
τ >0 PXn |Y (Xn |Y)
Consider the following case, where Xn = (X1 , . . . , Xn ) is uniform on {0, 1}n and
Y = (X1 , . . . , Xn ) + (N1 , . . . , Nn ) ,
where Ni are iid Gaussian with zero mean and variance 0.1. Let n = 10. Propose a method to
numerically compute or approximate the bound (II.1) as a function of k = 1, . . . 10. Plot the
results.
II.10 (Mismatched compression) Let P, Q be distributions on some discrete alphabet A.
(a) Let f∗P : A → {0, 1} denote the optimal variable-length lossless compressor for X ∼ P.
Show that under Q,
(b) The Shannon code for X ∼ P is a prefix code fP with the code length l(fP (a)) =
dlog2 P(1a) e, a ∈ A. Show that if X is distributed according to Q instead, then
Comments: This can be interpreted as a robustness result for compression with model misspec-
ification: When a compressor designed for P is applied to a source whose distribution is in fact
Q, the suboptimality incurred by this mismatch can be related to divergence D(QkP).
II.11 Consider a ternary fixed length (almost lossless) compression X → {0, 1, 2}k with an additional
requirement that the string in wk ∈ {0, 1, 2}k should satisfy
X
k
k
wj ≤ (II.2)
2
j=1
For example, (0, 0, 0, 0), (0, 0, 0, 2) and (1, 1, 0, 0) satisfy the constraint but (0, 0, 1, 2) does not.
Let ϵ∗ (Sn , k) denote the minimum probability of error among all possible compressors of Sn =
{Sj , j = 1, . . . , n} with i.i.d. entries of finite entropy H(S) < ∞. Compute
as a function of R ≥ 0.
Hint: Relate to P[ℓ(f∗ (Sn )) ≥ γ n] and use Stirling’s formula (I.2) to find γ .
i i
i i
i i
1X
n− 1
P[A ∩ τ −k B] → P[A]P[B] .
n
k=0
defines a prefix code. (Warning: This is not about checking Kraft’s inequality.)
(d) Decoding. Upon receipt of the codeword, we can reconstruct the interval Dxn . Divide the
unit interval according to the distribution P, i.e., partition [0, 1) into disjoint subintervals
Ia , . . . , Iz . Output the index that contains Dxn . Show that this gives the first symbol x1 . Con-
tinue in this fashion by dividing Ix1 into Ix1 ,a , . . . , Ix1 ,z and etc. Argue that xn can be decoded
losslessly. How many steps are needed?
(e) Suppose PX (e) = 0.5, PX (o) = 0.3, PX (t) = 0.2. Encode etoo (write the binary codewords)
and describe how to decode.
(f) Show that the average length of this code satisfies
nH(P) ≤ E[l(f(Xn ))] ≤ nH(P) + 2 bits.
(g) Assume that X = (X1 , . . . , Xn ) is not iid but PX1 , PX2 |X1 , . . . , PXn |Xn−1 are known. How would
you modify the scheme so that we have
H(Xn ) ≤ E[l(f(Xn ))] ≤ H(Xn ) + 2 bits.
II.14 (Enumerative Codes) Consider the following simple universal compressor for binary sequences:
Pn
Given xn ∈ {0, 1}n , denote by n1 = i=1 xi and n0 = n − n1 the number of ones and zeros in xn .
First encode n1 ∈ {0, 1, . . . , n} using dlog2 (ln + 1)e bits, n
m then encode the index of x in the set
of all strings with n1 number of ones using log2 nn1 bits. Concatenating two binary strings,
we obtain the codeword of xn . This defines a lossless compressor f : {0, 1}n → {0, 1}∗ .
(a) Verify that f is a prefix code.
i.i.d.
(b) Let Eθ be taken over Xn ∼ Ber(θ). Show that for any θ ∈ [0, 1],
Eθ [l(f(Xn ))] ≤ nh(θ) + log n + O(1),
i i
i i
i i
[Optional: Explain why enumerative coding fails to achieve the optimal redundancy.]
Hint: Stirling’s approximation (I.2) might be useful.
II.15 (Krichevsky-Trofimov codes). Consider the K-T probability assignment for the binary alpha-
bet (13.27) and its sequential
form (13.29). Let fKT be the encoder with length assignment
1
l(f (xn )) = log2 (KT) for all xn .
QXn (xn )
i i
i i
i i
(f) Now choose π to be the Beta( 12 ) prior and redo the previous part. Do you get a better
constant c1 ?
Comments: We followed the strategy, introduced in [118], of lower bounding I(θ; Xn ) by guess-
ing a good estimator θ̂ = θ̂(X1 , . . . , Xn ) and bounding I(θ; θ̂) on the basis of the estimator
error. The rationale is that if θ can be estimated well, then Xn needs to provide large amount of
information. We will further explore this idea in Chapter 30.
II.17 Consider the following collection of stationary ergodic Markov
( processes depending on param-
Xt−1 w.p. 1 − θ,
eter θ ∈ [0, 1]. The X1 ∼ Ber( 12 ) and after that Xt = . Denote the
1 − Xt−1 w.p. θ.
resulting Markov kernel as PXn |θ .
(a) Compute JF (θ).
(b) Prove that minimax redundancy R∗n = ( 12 + o(1)) log n.
II.18 (Elias coding) In this problem we construct universal codes for integers. Namely, they compress
any integer-valued (infinite alphabet!) random variable almost to its entropy.
(a) Consider the following universal compressor for natural numbers: For x ∈ N = {1, 2, . . .},
let k(x) denote the length of its binary representation. Define its codeword c(x) to be k(x)
zeros followed by the binary representation of x. Compute c(10). Show that c is a prefix
code and describe how to decode a stream of codewords.
(b) Next we construct another code using the one above: Define the codeword c′ (x) to be c(k(x))
followed by the binary representation of x. Compute c′ (10). Show that c′ is a prefix code
and describe how to decode a stream of codewords.
(c) Let X be a random variable on N whose probability mass function is decreasing. Show that
E[log(X)] ≤ H(X).
(d) Show that the average code length of c satisfies E[ℓ(c(X))] ≤ 2H(X) + 2 bit.
(e) Show that the average code length of c′ satisfies E[ℓ(c′ (X))] ≤ H(X) + 2 log(H(X) + 1) + 3
bit.
Comments: The two coding schems are known as Elias γ -codes and δ -codes.
II.19 (Batch loss) Recall the definition of batch regret in online prediction in Section 13.6. Show that
whenever maximizer π ∗ exists we have
i i
i i
i i
with optimization over π ∈ P(Θ) of priors on θ. We assume the maximum is attained at some
π∗.
(θ) ⊗n
(a) Let Dn ≜ infQYn |Xn supθ∈Θ D(PY|X kQYn |Xn |PXn ), where the infimum is over all conditional
kernels QYn |Xn : X n → Y n . Show AvgReg∗n (Θ) ≤ Dn .
R Qn
(b) Show that Cn = Dn and that optimal Q∗Yn |Xn (yn |xn ) = π ∗ (dθ) t=1 PY|X (yt |xt ) (Hint:
(θ)
Exercise I.11.)
Qn
(c) Show that we can always factorize Q∗Yn |Xn = t=1 QYt |Xt ,Yt−1 .
(d) Conclude that Q∗Yn |Xn defines an optimal learner, who also operates without any knowledge
of PXn .
Note: this characterization is mostly useful for upper-bounding regret (Exercise II.22). Indeed,
the optimal learner requires knowledge of π ∗ which in turn often depends on PXn , which is
not available to the learner. This shows why supervised learning is quite a bit more deli-
cate than universal compression. Nevertheless, taking a “natural” prior π and factorizing the
R (θ) ⊗n
mixture π (dθ)PY|X often gives very interesting and often almost optimal algorithms (e.g.
exponential-weights update algorithm [445]).
II.21 (Average-case and worst-case redundancies are incomparable.) This exercise provides an exam-
ple where the worst-case minimax redundancy (13.11) is infinite but the average-case one (13.8)
is finite. Take n = 1 and consider the class of distributions P1 = {P ∈ P(Z+ ) : EP [X] ≤ 1}.
Define
P ( x)
R∗ = min sup D(PkQ), R∗∗ = min max sup log .
Q P∈P1 Q x∈Z+ P∈P1 Q ( x)
(a) Applying the capacity-redundancy theorem, show that R∗ ≤ 2 log 2. (Hint: use Exercise I.4
to bound the mutual information.)
(b) Prove that R∗∗ = ∞ if and only if the Shtarkov sum (13.13) is infinite, namely,
P
x∈Z+ supP∈P1 P(x) = ∞
(c) Verify that
(
1 x=0
sup P(x) =
P∈P1 1/x x ≥ 1
and conclude R∗∗ = ∞. (Hint: Markov’s inequality.)
i.i.d. (d)
II.22 (Linear regression) Let Xi ∼ PX on Rp with PX being rotationally invariant (i.e. X = UX for
any orthogonal matrix U). Fix θ ∈ Rp with kθk ≤ s and given Xi generate Yi ∼ N (θ⊤ Xi , σ 2 ).
Having observed Yt−1 , Xt (but not θ) the learner outputs a prediction Ŷt of Yt .
i i
i i
i i
1
Pn ⊤
where Σ̂X = n i=1 Xi Xi
is the sample covariance matrix. (Hint: Interpret LHS as regret
under log-loss and solve maxπ θ I(θ; Yn |Xn ) s.t. E[kθk2 ] ≤ s2 via Exercise I.10.)
(b) Show that
s2 n s2 n
AvgRegn ≤ σ 2 ln det Ip + ΣX ≤ pσ 2 ln 1 + 2 E[kXk2 ] .
p p
(Hint: Jensen’s inequality.)
i.i.d.
Remark: Note that if Xi ∼ N (0, BIp ) the RHS is pσ 2 ln n + O(1). At the same time prediction
error of an ordinary least-square estimate Ŷt = θ̂⊤ Xt for n ≥ p + 2 is known to be exactly7
E[(Ŷt − Yt )2 ] = σ 2 (1 + n−pp−1 ) and hence achieves the optimal pσ 2 ln n + O(1) cumulative
regret.
7
This can be shown by applying Exercise VI.3b and evaluating the expected trace using [444, Theorem 3.1].
i i
i i
i i
Part III
i i
i i
i i
i i
i i
i i
275
In this part we study the topic of binary hypothesis testing (BHT) which we first encountered
in Section 7.3. This is an important area of statistics, with a definitive treatment given in [277].
Historically, there has been two schools of thought on how to approach this question. One is the
so-called significance testing of Karl Pearson and Ronald Fisher. This is perhaps the most widely
used approach in modern biomedical and social sciences. The concepts of null hypothesis, p-value,
χ2 -test, goodness-of-fit belong to this world. We will not be discussing these.
The other school was pioneered by Jerzy Neyman and Egon Pearson, and is our topic in this part.
The concepts of Type-I and Type-II errors, likelihood-ratio tests, Chernoff exponent are from this
domain. This is, arguably, a more popular way of looking at the problem among the engineering
disciplines (perhaps explained by its foundational role in radar and electronic signal detection).
The conceptual difference between the two is that in the first approach the full probabilistic
i.i.d.
model is specified only under the null hypothesis. (It still could be very specific like Xi ∼ N (0, 1),
i.i.d.
contain unknown parameters, like Xi ∼ N (θ, 1) with θ ∈ R arbitrary, or be nonparametric, like
i.i.d.
(Xi , Yi ) ∼ PX,Y = PX PY denoting that observables X and Y are statistically independent). The main
goal of the statistician in this setting is inventing a testing process that is able to find statistically
significant deviations from the postulated null behavior. If such deviation is found then the null is
rejected and (in scientific fields) a discovery is announced. The role of the alternative hypothesis
(if one is specified at all) is to roughly suggest what feature of the null are most likely to be violated
i.i.d.
and motivates the choice of test procedures. For example, if under the null Xi ∼ N (0, 1), then both
of the following are reasonable tests:
1X ? 1X 2 ?
n n
Xi ≈ 0 Xi ≈ 1 .
n n
i=1 i=1
However, the first one would be preferred if, under the alternative, “data has non-zero mean”, and
the second if “data has zero mean but variance not equal to one”. Whichever of the alternatives is
selected does not imply in any way the validity of the alternative. In addition, theoretical properties
of the test are mostly studied under the null rather than the alternative. For this approach the null
hypothesis (out of the two) plays a very special role.
The second approach treats hypotheses in complete symmetry. Exact specifications of proba-
bility distributions are required for both hypotheses and the precision of a proposed test is to be
analyzed under both. This is the setting that is most useful for our treatment of forthcoming topics
of channel coding (Part IV) and statistical estimation (Part VI).
The outline of this part is the following. First, we define the performance metric R(P, Q) giving
a full description of the BHT problem. A key result in this theory, the Neyman-Pearson lemma
determines the form of the optimal test and, at the same time, characterizes R(P, Q). We then
specialize to the setting of iid observations and consider two types of asymptotics (as the sample
size n goes to infinity): Stein’s regime (where type-I error is held constant) and Chernoff’s regime
(where errors of both types are required to decay exponentially). The fundamental limit in the
former regime is simply a scalar (given by D(PkQ)), while in the latter it is a region. To describe
this region (as we do in Chapter 16) we will first need to dive deep into another foundational topic:
theory of large deviations and the information projection (Chapter 15).
i i
i i
i i
14 Neyman-Pearson lemma
In this Chapter we formally define the problem of binary hypothesis testing between two sim-
ple hypotheses. We introduce the fundamental limit for this problem in the form of a region
R(P, Q) ⊂ [0, 1]2 , whose boundary is known as the received operating characteristic (ROC)
curve. We will show how to compute this region/curve exactly (Neyman-Pearson lemma) and
show optimality of the likelihood-ratio tests in the process. However, for high-dimensional situa-
tions exact computation of the region is still too complex and we will also derive upper and lower
bounds (as usual, we call them achievability and converse, respectively). Finally, we will conclude
by introducing two different asymptotic settings: the Stein regime and the Chernoff regime. The
answer in the former will be given completely (for iid distributions), while the answer for the latter
will require further developments in the subsequent chapters.
Let Z = 0 denote that the test chooses P (accepting the null) and Z = 1 that the test chooses Q
(rejecting the null).
This setting is called “testing simple hypothesis against simple hypothesis”. Here “simple”
refers to the fact that under each hypothesis there is only one distribution that could have gen-
erated the data. In comparison, composite hypothesis postulates that X ∼ P for some P is a given
class of distributions; see Sections 16.4 and 32.2.1.
276
i i
i i
i i
type-I error, significance, size, false alarm rate, false positive 1−α
specificity, selectivity, true negative α
power, recall, sensitivity, true positive 1−β
type-II error, missed detection, false negative β
accuracy π 1 (1 − β) + (1 − π 1 )α
2π 1 (1−β)
F1 -score 1+π 1 (1−β)−(1−π 1 )α
Bayesian error π 1 β + (1 − π 1 )(1 − α)
π 1 (1−β)
positive predictive value (PPV), precision 1−π 1 β−(1−π 1 )α
Entries involving π 1 = P[H1 ] correspond to Bayesian setting where a prior probability on occurrence of H1 is
postulated.
In order to quantify performance of a test, we focus on two metrics. Let π i|j denote the proba-
bility of the test choosing i when the correct hypothesis is j, with i, j ∈ {0, 1}. For every test PZ|X
we associate a pair of numbers:
• Bayesian: Assuming the prior distribution P[H0 ] = π 0 and P[H1 ] = π 1 , we minimize the
average probability of error:
• Minimax: Assuming there is an unknown prior distribution, we choose the test that preforms
the best for the worst-case prior
• Neyman-Pearson: Minimize the type-II error β subject to that the success probability under the
null is at least α.
In this Part the Neyman-Pearson formulation is our choice. We formalize the fundamental
performance limit as follows.
i i
i i
i i
278
Definition 14.1 Given (P, Q), the Neyman-Pearson region consists of achievable points for
all randomized tests
R(P, Q) = (P[Z = 0], Q[Z = 0]) : PZ|X : X → {0, 1} ⊂ [0, 1]2 . (14.2)
In particular, its lower boundary is defined as (see Figure 14.1 for an illustration)
R(P, Q)
βα (P, Q)
P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =
Figure 14.1 Illustration of the Neyman-Pearson regions: typical (top plot) and two extremal cases (bottom
row). Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P[E] = 0 and Q[E] = 1 for some E.
The Neyman-Pearson region encodes much useful information about the relationship between
P and Q. In particular, the mutual singularity (see Figure 14.1) can be detected. Furthermore, every
f-divergence can be computed from the R(P, Q). For example, TV(P, Q) coincides with half the
length of the longest vertical segment contained in R(P, Q) (Exercise III.7). In machine learning
some of the most popular metric used to characterized quality of a R(P, Q) is area under the curve
(AUC)
Z 1
AUC(P, Q) ≜ 1 − βα (P, Q)dα .
0
i i
i i
i i
Proof. (a) For convexity, suppose that (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), corresponding to tests
PZ0 |X , PZ1 |X , respectively. Randomizing between these two tests, we obtain the test λPZ0 |X +
λ̄PZ1 |X for λ ∈ [0, 1], which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
The closedness of R(P, Q) will follow from the explicit determination of all boundary
points via the Neyman-Pearson lemma – see Remark 14.1. In more complicated situations
(e.g. in testing against composite hypothesis) simple explicit solutions similar to Neyman-
Pearson Lemma are not available but closedness of the region can frequently be argued
still. The basic reason is that the collection of bounded functions {g : X → [0, 1]} (with
g(x) = PZ|X (0|x)) forms a weakly compact set and hence its image under the linear functional
R R
g 7→ ( gdP, gdQ) is closed.
(b) Testing by random guessing, i.e., Z ∼ Ber(1 − α) ⊥ ⊥ X, achieves the point (α, α).
(c) If (α, β) ∈ R(P, Q) is achieved by PZ|X , P1−Z|X achieves (1 − α, 1 − β).
The region R(P, Q) consists of the operating points of all randomized tests, which include as
special cases those of deterministic tests, namely
As the next result shows, the former is in fact the closed convex hull of the latter. Recall that
cl(E) (resp. co(E)) denote the closure and convex hull of a set E, namely, the smallest closed
(resp. convex) set containing E. A useful example: For a subset E of an Euclidean space, and
measurable functions f, g : R → E, we have (E [f(X)] , E [g(X)]) ∈ cl(co(E)) for any real-valued
random variable X.
Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.
Proof. “⊃”: Comparing (14.2) and (14.4), by definition, R(P, Q) ⊃ Rdet (P, Q)), the former of
which is closed and convex , by Theorem 14.2.
“⊂”: Given any randomized test PZ|X , define a measurable function g : X → [0, 1] by g(x) =
PZ|X (0|x). Then
X Z 1
P [ Z = 0] = g(x)P(x) = EP [g(X)] = P[g(X) ≥ t]dt
x 0
i i
i i
i i
280
X Z 1
Q[Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0
R
where we applied the “area rule” that E[U] = R+ P [U ≥ t] dt for any non-negative random
variable U. Therefore the point (P[Z = 0], Q[Z = 0]) ∈ R is a mixture of points (P[g(X) ≥
t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly distributed on the unit interval. Hence
R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (14.4).
Example 14.1 (Testing Ber(p) versus Ber(q)) Assume that p < < q. Using Theo- 1
2
rem 14.3, note that there are 2 = 4 events E = ∅, {0}, {1}, {0, 1}. Then R(Ber(p), Ber(q)) is
2
given by
1
)
q)
(p, q)
r(
Be
),
(p
er
(B
R
(p̄, q̄)
α
0 1
Definition 14.4 (Extended log likelihood ratio) Assume that dP = p(x)dμ and
dQ = q(x)dμ for some dominating measure μ (e.g. μ = P + Q.) Recalling the definition of
Log from (2.10) we define the extended LLR as
log qp((xx)) , p ( x) > 0 , q ( x) > 0
p(x) +∞, p ( x) > 0 , q ( x) = 0
T(x) ≜ Log =
q ( x) −∞, p ( x) = 0 , q ( x) > 0
0, p ( x) = 0 , q ( x) = 0 ,
i i
i i
i i
Definition 14.5 (Likelihood ratio test (LRT)) Given a binary hypothesis testing H0 : X ∼
P vs H1 : X ∼ Q the likelihood ratio test (LRT) with threshold τ ∈ R ∪ {±∞} is 1{x : T(x) ≤ τ },
in other words it decides
(
declare H0 , T(x) > τ
LRTτ (x) = .
declare H1 , T(x) ≤ τ
We see that taking expectation over P and over Q are equivalent upon multiplying the expectant
by exp(±T). The next result gives precise details in the general case.
i i
i i
i i
282
Z
( c)
= dμ p(x) exp(−T(x))h(x) = EP [exp(−T)g(T)] ,
{−∞<T(x)≤∞}
where in (a) we used (14.8) to justify restriction to finite values of T; in (b) we used exp(−T(x)) =
q(x)
p(x) for p, q > 0; and (c) follows from the fact that exp(−T(x)) = 0 whenever T = ∞. Exchanging
the roles of P and Q proves (14.6).
The last part follows upon taking h(x) = f(x)1{T(x) ≥ τ } and h(x) = f(x)1{T(x) ≤ τ } in (14.5)
and (14.6), respectively.
The importance of the LLR is that it is a sufficient statistic for testing the two hypotheses (recall
Section 3.5 and in particular Example 3.9), as the following result shows.
Proof. For part 2, sufficiency of T would be implied by PX|T = QX|T . For the case of X being
discrete we have:
From Theorem 14.3 we know that to obtain the achievable region R(P, Q), one can iterate over
all decision regions and compute the region Rdet (P, Q) first, then take its closed convex hull. But
this is a formidable task if the alphabet is large or infinite. On the other hand, we know that the
LLR is a sufficient statistic. Next we give bounds to the region R(P, Q) in terms of the statistics
of the LLR. As usual, there are two types of statements:
• Converse (outer bounds): any point in R(P, Q) must satisfy certain constraints;
• Achievability (inner bounds): points satisfying certain constraints belong to R(P, Q).
Proof. Use the data processing inequality for KL divergence with PZ|X ; cf. Corollary 2.19.
We will strengthen this bound with the aid of the following result.
i i
i i
i i
Note that we do not need to assume P Q precisely because ±∞ are admissible values for
the (extended) LLR.
Proof. Defining τ = log γ and g(x) = PZ|X (0|x) we get from (14.7):
P[Z = 0, T ≤ τ ] − γ Q[Z = 0, T ≤ τ ] ≤ 0 .
Decomposing P[Z = 0] = P[Z = 0, T ≤ τ ] + P[Z = 0, T > τ ] and similarly for Q we obtain then
P[Z = 0] − γ Q[Z = 0] ≤= P [T > log γ, Z = 0] − γ Q [T > log γ, Z = 0] ≤ P [T > log γ]
i i
i i
i i
284
which is equivalent to minimizing the average probability of error in (14.1), with t = ππ 01 . This can
be solved without much effort. For simplicity, consider the discrete case. Then
X X
α∗ − tβ ∗ = max (α − tβ) = max (P(x) − tQ(x))PZ|X (0|x) = |P(x) − tQ(x)|+
(α,β)∈R PZ|X
x∈X x∈X
where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:
P(x)
PZ|X (0|x) = 1 log ≥ log t .
Q ( x)
Thus, we have shown that all supporting hyperplanes are parameterized by LRT. This completely
recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces) of the
region. The precise result is stated as follows:
Proof of Theorem 14.11. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P[Z = 0] = EP [g(X)] = P > t + λP =t (14.12)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (14.13)
dQ dQ
n o n o
Using the simple fact that EQ [f(X)1 dQ dP
≤ t ] ≥ t−1 EP [f(X)1 dQ dP
≤ t ] for any f ≥ 0 twice, we
have
dP dP
β = EQ [g(X)1 ≤ t ] + EQ [g(X)1 >t ]
dQ dQ
1 dP dP
≥ EP [g(X)1 ≤ t ] + E Q [ g( X ) 1 >t ]
t dQ dQ
| {z }
h dP i
(14.12) 1 dP dP
= EP [(1 − g(X))1 > t ] + λP = t + E Q [ g( X ) 1 >t ]
t dQ dQ dQ
| {z }
h dP i
dP dP
≥ EQ [(1 − g(X))1 > t ] + λQ = t + E Q [ g( X ) 1 >t ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ
i i
i i
i i
Remark 14.1 As a consequence of the Neyman-Pearson lemma, all the points on the boundary
of the region R(P, Q) are attainable. Therefore
R(P, Q) = {(α, β) : βα ≤ β ≤ 1 − β1−α }.
Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set, as
previously stated in Theorem 14.2. Consequently, the infimum in the definition of βα is in fact a
minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the
following two sets:
(
dP
α = P log dQ >τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and
(
dP
α = P log dQ ≥τ
τ ∈ R ∪ {±∞}.
dP
β = Q log dQ ≥τ
Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ
dP
≥ τ}
or 1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead
dP
dP dP
P [log dQ > t] P [log dQ > t]
1 1
α α
t t
τ τ
1
Note that it so happens that in Definition 14.4 the LRT is defined with an ≤ instead of <.
i i
i i
i i
286
h dP i
β ≤ exp(−τ )P log > τ ≤ exp(−τ )
dQ
where P and Q do not depend on n; this is a particular case of our general setting with P and Q
replaced by their n-fold product distributions. We are interested in the asymptotics of the error
probabilities π 0|1 and π 1|0 as n → ∞ in the following two regimes:
• Stein’s regime: When π 1|0 is constrained to be at most ϵ, what is the best exponential rate of
convergence for π 0|1 ?
• Chernoff’s regime: When both π 1|0 and π 0|1 are required to vanish exponentially, what is the
optimal tradeoff between their exponents?
Recall that we are in the iid setting (14.14) and are interested in tests satisfying 1 −α = π 1|0 ≤ ϵ
and β = π 0|1 ≤ exp(−nE) for some exponent E > 0. Motivation of this asymmetric objective
is that often a “missed detection” (π 0|1 ) is far more disastrous than a “false alarm” (π 1|0 ). For
example, a false alarm could simply result in extra computations (attempting to decode a packet
when there is in fact only noise has been received), while missed detection results in a complete
loss of the packet. The formal definition of the best exponent is as follows.
i i
i i
i i
Theorem 14.14 (Stein’s lemma) Consider the iid setting (14.14) where PXn = Pn and
QXn = Qn . Then
Vϵ = D(PkQ), ∀ϵ ∈ (0, 1).
Consequently, V = D(PkQ).
The way to use this result in practice is the following. Suppose it is required that α ≥ 0.999,
and β ≤ 10−40 , what is the required sample size? Stein’s lemma provides a rule of thumb: n ≥
10−40
− log
D(P∥Q) .
dPXn X n
dP
Fn = log = log (Xi ), (14.15)
dQXn dQ
i=1
then pick n large enough (depends on ϵ, δ ) such that α ≥ 1 − ϵ, we have the exponent E = D − δ
achievable, Vϵ ≥ E. Sending δ → 0 yields Vϵ ≥ D. Finally, if D = ∞, the above argument holds
for arbitrary τ > 0, proving that Vϵ = ∞.
(Converse) We show that Vϵ ≤ D for any ϵ < 1, to which end it suffices to consider D < ∞. As
a warm-up, we first show a weak converse by applying Theorem 14.8 based on data processing
inequality. For any (α, β) ∈ R(PXn , QXn ), we have
1
−h(α) + α log ≤ d(αkβ) ≤ D(PXn kQXn ) (14.18)
β
i i
i i
i i
288
For any achievable exponent E < Vϵ , by definition, there exists a sequence of tests such that
αn ≥ 1 − ϵ and βn ≤ exp(−nE). Plugging this into (14.18) and using h ≤ log 2, we have E ≤
D(P∥Q) log 2
1−ϵ + n(1−ϵ) . Sending n → ∞ yields
D(PkQ)
Vϵ ≤ ,
1−ϵ
which is weaker than what we set out to prove; nevertheless, this weak converse is tight for ϵ → 0,
so that for Stein’s exponent we have succeeded in proving the desired result of V = limϵ→0 Vϵ ≥
D(PkQ). So the question remains: if we allow the type-I error to be ϵ = 0.999, is it possible for
the type-II error to decay faster? This is shown impossible by the strong converse next.
To this end, note that, in proving the weak converse, we only made use of the expectation
of Fn in (14.18), we need to make use of the entire distribution (CDF) in order to obtain better
results. Applying the strong converse Theorem 14.10 to testing PXn versus QXn and α = 1 − ϵ and
β = exp(−nE), we have
Pick γ = exp(n(D + δ)) for δ > 0, by WLLN (14.16) the probability on the right side goes to 0,
which implies that for any fixed ϵ < 1, we have E ≤ D + δ and hence Vϵ ≤ D + δ . Sending δ → 0
complete the proof.
Finally, let us address the case of P 6 Q, in which case D(PkQ) = ∞. By definition, there
exists a subset A such that Q(A) = 0 but P(A) > 0. Consider the test that selects P if Xi ∈ A for
some i ∈ [n]. It is clear that this test achieves β = 0 and 1 − α = (1 − P(A))n , which can be made
less than any ϵ for large n. This shows Vϵ = ∞, as desired.
Remark 14.3 (Non-iid data) Just like in Chapter 12 on data compression, Theorem 14.14
can be extended to stationary ergodic processes. Specifically, one can show that the Stein’s
exponent corresponds to relative entropy rate, i.e.
1
Vϵ = lim D(PXn kQXn )
n→∞ n
where {Xi } is stationary and ergodic under both P and Q. Indeed, the counterpart of (14.16) based
on WLLN, which is the key for choosing the appropriate threshold τ , for ergodic processes is the
Birkhoff-Khintchine convergence theorem (cf. Theorem 12.8).
Thus knowledge of Stein’s exponent Vϵ allows one to prove exponential bounds on probabilities
of arbitrary sets; this technique is known as “change of measure”, which will be applied in large
deviations analysis in Chapter 15.
i i
i i
i i
H0 : Xn ∼ Pn versus H1 : Xn ∼ Qn ,
but the objective in the Chernoff regime is to achieve exponentially small error probability of both
types simultaneously. We say a pair of exponents (E0 , E1 ) is achievable if there exists a sequence
of tests such that
1 − α = π 1|0 ≤ exp(−nE0 )
β = π 0|1 ≤ exp(−nE1 ).
Intuitively, one exponent can made large at the expense of making the other small. So the interest-
ing question is to find their optimal tradeoff by characterizing the achievable region of (E0 , E1 ).
This problem was solved by [218, 61] and is the topic of Chapter 16. (See Figure 16.2 for an
illustration of the optimal (E0 , E1 )-tradeoff.)
Let us explain what we already know about the region of achievable pairs of exponents (E0 , E1 ).
First, Stein’s regime corresponds to corner points of this achievable region. Indeed, Theo-
rem 14.14 tells us that when fixing αn = 1 − ϵ, namely E0 = 0, picking τ = D(PkQ) − δ
(δ → 0) gives the exponential convergence rate of βn as E1 = D(PkQ). Similarly, exchanging the
role of P and Q, we can achieves the point (E0 , E1 ) = (D(QkP), 0).
Second, we have shown in Section 7.3 that the minimum total error probabilities over all tests
satisfies
min 1 − α + β = 1 − TV(Pn , Qn ) .
(α,β)∈R(Pn ,Qn )
where we denoted
1
EH ≜ log 1 − H2 (P, Q) .
2
EH ≤ E ≤ 2EH .
This characterization is valid even if P and Q depends on the sample size n which will prove
useful later when we study composite hypothesis testing in Section 32.2.1. However, for fixed P
and Q this is not precise enough. In order to determine the full set of achievable pairs, we need
i i
i i
i i
290
to make a detour into the topic of large deviations next. To see how this connection arises, notice
that the (optimal) likelihood ratio tests give us explicit expressions for both error probabilities:
1 1
1 − αn = P Fn ≤ τ , βn = Q Fn > τ
n n
where Fn is the LLR in (14.15). When τ falls in the range of (−D(QkP), D(PkQ)), both proba-
bilities are vanishing thanks to WLLN – see (14.16) and (14.17), and we are interested in their
exponential convergence rate. This falls under the purview of large deviations theory.
i i
i i
i i
In this chapter we develop tools needed for the analysis of the error-exponents in hypothesis test-
ing (Chernoff regime). We will start by introducing the concepts of large deviations theory ( log
moment generating function (MGF) ψX , its convex conjugate ψX∗ , known as rate function, and
revisit the idea of tilting). Then, we show that probability of deviation of an empirical mean is
governed by the solution of an information projection (also known as I-projection) problem:
Equipped with the information projection we will prove a tight version of the Chernoff bound.
Specifically, for iid copies X1 , . . . , Xn of X, we show
" n #
1X
P Xk ≥ γ = exp (−nψ ∗ (γ) + o(n)) .
n
k=1
In the remaining sections we extend the simple information projection problem to a general
minimization over convex sets of measures and connect it to empirical process theory (Sanov’s the-
orem) and also show how to solve the problem under finitely many linear constraints (exponential
families).
In the next chapter, we apply these results to characterize the achievable (E0 , E1 )-region (as
defined in Section 14.6) to get
The full account of such theory requires delicate consideration of topological properties of E , and
is the subject of classical treatments e.g. [120]. We focus here on a simple special case which,
however, suffices for the purpose of establishing the Chernoff exponents in hypothesis testing,
291
i i
i i
i i
292
and also showcases all the relevant information-theoretic ideas. Our initial goal is to show the
following result:
Theorem 15.1 Consider a random variable X whose log MGF ψX (λ) = log E[exp(λX)] is
finite for all λ ∈ R. Let B = esssup X and let E[X] < γ < B. Then
" #
X
n
P Xi ≥ nγ = exp{−nE(γ) + o(n)} ,
i=1
where E(γ) = supλ≥0 λγ − ψX (λ) = ψX∗ (γ), known as the rate function.
The concepts of log MGF and the rate function will be elaborated in subsequent sections. We
provide the proof below that should be revisited after reading the rest of the chapter.
Proof. Let us recall the usual Chernoff bound: For iid Xn , for any λ ≥ 0, applying Markov’s
inequality yields
" # " ! #
X
n X
n
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" !#
X
n
≤ exp(−nλγ)E exp λ Xi
i=1
Optimizing over λ ≥ 0 gives the following non-asymptotic upper bound (concentration inequality)
which holds for any n:
" #
X
n n o
P Xi ≥ nγ ≤ exp − n sup(λγ − ψX (λ)) . (15.1)
i=1 λ≥0
i i
i i
i i
2
As an example, for a standard Gaussian Z ∼ N (0, 1), we have ψZ (λ) = λ2 . Taking X = Z3
yields a random variable such that ψX (λ) is infinite for all non-zero λ.
In the remaining of the chapter, we shall make the following simplifying assumption, known
as Cramér’s condition.
Assumption 15.1 The random variable X is such that ψX (λ) < ∞ for all λ ∈ R.
Most of the results we discuss in this chapter hold under a much weaker assumption of ψX
having domain with non-empty interior. But proofs in this generality significantly obscure the
elegance of the main ideas and we decided to avoid them. We note that Assumption 15.1 implies
that all moments of X is finite.
(a) ψX is convex;
(b) ψX is continuous;
(c) ψX is infinitely differentiable and
E[X exp{λX}]
ψX′ (λ) = = exp{−ψX (λ)}E[X exp{λX}].
E[exp{λX}]
then A ≤ X ≤ B a.s.;
(f) If X is not a constant, then ψX is strictly convex, and consequently, ψX′ is strictly increasing.
(g) Chernoff bound:
Remark 15.1 The slope of log MGF encodes the range of X. Indeed, Theorem 15.3(d) and
(e) together show that the smallest closed interval containing the support of PX equals (closure of)
i i
i i
i i
294
the range of ψX′ . In other words, A and B coincide with the essential infimum and supremum (min
and max of RV in the probabilistic sense) of X respectively,
A = essinf X ≜ sup{a : X ≥ a a.s.}
B = esssup X ≜ inf{b : X ≤ b a.s.}
See Figure 15.1 for an illustration.
ψX (λ)
slope A
slope B
0
λ
slope E[X]
Figure 15.1 Example of a log MGF ψX (γ) with PX supported on [A, B]. The limiting maximal and minimal
slope is A and B respectively. The slope at γ = 0 is ψX′ (0) = E[X]. Here we plot for X = ±1 with
P [X = 1] = 1/3.
Proof. For the proof we assume that base of log and exp is e. Note that (g) is already proved
in (15.1). The proof of (e)–(f) relies on Theorem 15.8 and can be revisited later.
i i
i i
i i
B ≥ EPλ [X] = EPλ [X1 {X < B − ϵ}] + EPλ [X1 {B − ϵ ≤ X ≤ B + ϵ}] + EPλ [X1 {X > B + ϵ}]
≥ EPλ [X1 {X < B − ϵ}] + EPλ [X1 {X > B + ϵ}]
≥ − EPλ [|X|1 {X < B − ϵ}] + (B + ϵ) Pλ [X > B + ϵ] . (15.3)
| {z }
→1
Therefore we will obtain a contradiction if we can show that EPλ [|X|1 {X < B − ϵ}] → 0 as
λ → ∞. To that end, notice that the convexity of ψX implies that ψX′ % B. Thus, for all λ ≥ λ0
we have ψX′ (λ) ≥ B − 2ϵ . Thus, we have for all λ ≥ λ0
ϵ ϵ
ψX (λ) ≥ ψX (λ0 ) + (λ − λ0 )(B − ) = c + λ(B − ) , (15.4)
2 2
for some constant c. Then,
≤ E[|X|eλ(B−ϵ)−c−λ(B− 2 ) ]
ϵ
= E[|X|]e−λ 2 −c → 0
ϵ
λ→∞
where the first inequality is from (15.4) and the second from X < B − ϵ. Thus, the first term
in (15.3) goes to 0 implying the desired contradiction.
(f) Suppose ψX is not strictly convex. Since ψX is convex from part (a), ψX must be “flat” (affine)
near some point. That is, there exists a small neighborhood of some λ0 such that ψX (λ0 + u) =
i i
i i
i i
296
ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood of zero, or
equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 15.4 implies Pλ [X = r] = 1,
but then P[X = r] = 1, contradicting the assumption X 6= constant.
Definition 15.5 (Rate function) The rate function ψX∗ : R → R ∪ {+∞} is given by the
Fenchel-Legendre conjugate (convex conjugate) of the log MGF:
Note that the maximization (15.5) is a convex optimization problem since ψX is strictly convex,
so we can find the maximum by taking the derivative and finding the stationary point. In fact, ψX∗
is the precisely the convex conjugate of ψX ; cf. (7.84).
The next result describes useful properties of the rate function. See Figure 15.2 for an
illustration.
Theorem 15.6 (Properties of ψX∗ ) Assume that X is non-constant and satisfies Assump-
tion 15.1.
(b) ψX∗ is strictly convex and strictly positive except ψX∗ (E[X]) = 0.
(c) ψX∗ is decreasing when γ ∈ (A, E[X]), and increasing when γ ∈ [E[X], B)
Proof. By Theorem 15.3(d), since A ≤ X ≤ B a.s., we have A ≤ ψX′ ≤ B. When γ ∈ (A, B), the
strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.
1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in the
vertical strip {z : |Rez| < 1}.
i i
i i
i i
ψX (λ)
slope γ
0
λ
ψX∗ (γ)
ψX∗ (γ)
+∞ +∞
γ
A E[X] 0 B
Figure 15.2 Log MGF ψX and its conjugate (rate function) ψX∗ for X taking values in [A, B], continuing the
example in Figure 15.1.
i i
i i
i i
298
In the above examples we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0
(resp. < 0). Indeed, this is a general property of tilting.
Proof. Again for the proof we assume base e for exp and log.
(a) By definition.
(b) EPλ [X] = EE[X[exp
exp(λX)] ′ ′
(λX)] = ψX (λ), which is strictly increasing in λ, with ψX (0) = EP [X].
exp(λX) ′
D(Pλ kP) = EPλ log dP dP = EPλ log E[exp(λX)] = λEPλ [X] − ψX (λ) = λψX (λ) − ψX (λ) =
λ
∗ ′
ψX (ψX (λ)), where the last equality follows from Theorem 15.6(a).
(c) ψX′′ (λ) = E[X exp(λ X)]−(E[X exp(λX)])2
2
(E[exp(λX)])2
= VarPλ (X).
i i
i i
i i
(d)
1 1
lim log 1 Pn = inf D(QkP) (15.10)
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
Remark 15.2 (Subadditivity) One can argue from first principles that the limits
(15.9) and (15.10) exist without computing their values. Indeed, note that the sequence
pn ≜ log P 1 Pn 1 X ≥γ satisfies pn+m ≥ pn pm and hence log p1n is subadditive. As such,
[ n k=1 k ]
limn→∞ 1n log p1n = infn log p1n by Fekete’s lemma.
i i
i i
i i
300
Proof. First note that if the events have zero probability, then both sides coincide with infinity.
Pn
Indeed, if P 1n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒
Q 6 P ⇒ D(QkP) = ∞ and hence (15.9) holds trivially. The case for (15.10) is similar.
In the sequel we assume both probabilities are nonzero. We start by proving (15.9). Set P[En ] =
Pn
P 1n k=1 Xk > γ .
Lower Bound on P[En ]: Fix a Q such that EQ [X] > γ . Let Xn be iid. Then by WLLN,
" n #
X
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1
Upper Bound on P[En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional distribution as:
P
log PX1(E) = D(PX|X∈E kPX ). Define P̃Xn = PXn | P Xi >nγ , under which Xi > nγ holds a.s. Then
1
log = D(P̃Xn kPXn ) ≥ inf
P D(QXn kPXn ) (15.13)
P[En ] QXn :EQ [ Xi ]>nγ
We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Note that this is a
special case of a more general phenomena – see Ex. III.12. Consider the following two steps:
X
n
D(QXn kPXn ) ≥ D(QXj kP)
j=1
1X
n
≥ nD(Q̄kP) , Q̄ ≜ QXj , (15.14)
n
j=1
where the first step follows from (2.27) in Theorem 2.16, after noticing that PXn = Pn , and the
second step is by convexity of divergence (Theorem 5.1). From this argument we conclude that
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.15)
QXn :EQ [ Xi ]>nγ Q:EQ [X]>γ
i i
i i
i i
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.16)
QXn :EQ [ Xi ]≥nγ Q:EQ [X]≥γ
In particular, (15.13) and (15.15) imply the required lower bound in (15.9).
Next we prove (15.10). First, notice that the lower bound argument (15.13) applies equally well,
so that for each n we have
1 1
log 1 Pn ≥ inf D(QkP) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ
• Case I: P[X > γ] = 0. If P[X ≥ γ] = 0, then both sides of (15.10) are +∞. If P[X = γ] > 0,
P
then P[ Xk ≥ nγ] = P[X1 = . . . = Xn = γ] = P[X = γ]n . For the right-hand side, since
D(QkP) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
Q(X = γ) = 1, i.e., Q = δγ . Then infEQ [X]≥γ D(QkP) = log P(X1=γ) .
P P
• Case II: P[X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (15.9) we know that
1 1
lim sup log 1 Pn ≤ inf D(QkP) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ
Indeed, let P̃ = PX|X>γ which is well defined since P[X > γ] > 0. For any Q such that EQ [X] ≥
γ , set Q̃ = ϵ̄Q + ϵP̃ satisfies EQ̃ [X] > γ . Then by convexity, D(Q̃kP) ≤ ϵ̄D(QkP) + ϵD(P̃kP) =
ϵ̄D(QkP) + ϵ log P[X1>γ] . Sending ϵ → 0, we conclude the proof of (15.17).
Remark 15.3 Note that the upper bound (15.11) also holds for independent non-identically
distributed Xi . Indeed, we only need to replace the step (15.14) with D(QXn kPXn ) ≥
Pn Pn
i=1 D(QXi kPXi ) ≥ nD(Q̄kP̄) where P̄ = n
1
i=1 PXi . This yields a bound (15.11) with P
replaced by P̄ in the right-hand side.
i i
i i
i i
302
where f(u) ≜ u log u − (u − 1) log e ≥ 0. These follow from (15.18)-(15.19) via the following
useful estimate:
These simply follow from the inequality between KL divergence and Hellinger distance
√ √ √
( np+t)2
in (7.33). Indeed, we get d(xkp) ≥ H2 (Ber(x), Ber(p)) ≥ ( x − p)2 . Plugging x = n
into (15.18)-(15.19) we obtain the result. We note that [316, Theorem 3] shows a stronger bound
of e−2t in (15.21).
2
Remarkably, the bounds in (15.21) and (15.22) do not depend on n or p. This is due to the
√
variance-stabilizing effect of the square-root transformation for binomials: Var( X) is at most a
√ √
constant for all n, p. In addition, X − np = √XX− np
√ is of a self-normalizing form: the denomi-
+ np
nator is on par with the standard deviation of the numerator. For more on self-normalizing sums,
see [69, Problem 12.2].
inf D(QkP)
Q∈E
i i
i i
i i
Denote the minimizing distribution Q by Q∗ . The next result shows that intuitively the “line”
between P and optimal Q∗ is “orthogonal” to E (cf. Figure 15.3).
Q∗
Distributions on X
Figure 15.3 Illustration of information projection and the Pythagorean theorem.
Theorem 15.10 Let E be a convex set of distributions. If there exists Q∗ ∈ E such that
∗
D(Q kP) = minQ∈E D(QkP) < ∞, then ∀Q ∈ E
D(QkP) ≥ D(QkQ∗ ) + D(Q∗ kP)
Proof. If D(QkP) = ∞, then there is nothing to prove. So we assume that D(QkP) < ∞, which
also implies that D(Q∗ kP) < ∞. For λ ∈ [0, 1], form the convex combination Q(λ) = λ̄Q∗ +λQ ∈
E . Since Q∗ is the minimizer of D(QkP), then
d
0≤ D(Q(λ) kP) = D(QkP) − D(QkQ∗ ) − D(Q∗ kP)
dλ λ=0
The rigorous analysis requires an argument for interchanging derivatives and integrals (via domi-
nated convergence theorem) and is similar to the proof of Proposition 2.20. The details are in [114,
Theorem 2.2].
Remark 15.4 If we view the picture above in the Euclidean setting, the “triangle” formed by
P, Q∗ and Q (for Q∗ , Q in a convex set, P outside the set) is always obtuse, and is a right triangle
only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared
Euclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.
The relevant set E of Q’s that we will focus next is the “half-space” of distributions E = {Q :
EQ [X] ≥ γ}, where X : Ω → R is some fixed function (random variable). This is justified by rela-
tion with the large-deviations exponent in Theorem 15.9. First, we solve this I-projection problem
explicitly.
i i
i i
i i
304
2 Whenever the minimum is finite, the minimizing distribution is unique and equal to the tilting
of P along X, namely2
Remark 15.5 Both Theorem 15.9 and Theorem 15.11 are stated for the right tail where the
sample mean exceeds the population mean. For the left tail, simply these results to −Xi to obtain
for γ < E[X],
1 1
lim log 1 Pn = inf D(QkP) = ψX∗ (γ).
n→∞ n P n k=1 Xk < γ Q : EQ [X]<γ
In other words, the large deviations exponent is still given by the rate function (15.5) except that
the optimal tilting parameter λ is negative.
2
Note that unlike the setting of Theorems 15.1 and 15.9 here P and Pλ are measures on an abstract space Ω, not necessarily
on the real line.
i i
i i
i i
mean from EP X to γ , in particular λ ≥ 0. Moreover, ψX∗ (γ) = λγ − ψX (λ). Take any Q such
that EQ [X] ≥ γ , then
dQdPλ
D(QkP) = EQ log (15.28)
dPdPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
= D(QkPλ ) + ψX∗ (γ)
≥ ψX∗ (γ), (15.29)
where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows
the minimizer is unique, proving the second claim. Note that even in the corner case of γ = B
(assuming P(X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure
(P∞ ), since Pλ → δB as λ → ∞, cf. Theorem 15.8(d).
An alternative version of the solution, given by expression (15.26), follows from Theorem 15.6.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X] we
have just shown
The final step is to notice that ψX∗ is increasing and continuous by Theorem 15.6, and hence the
right-hand side infimum equals ψX∗ (γ). The case of minQ:EQ [X]=γ is handled similarly.
Corollary 15.12 For any Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the
tilted distribution Pλ satisfies
and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP) − D(Pλ kP).
Proof. Proceed as in the proof of Theorem 15.11, and find the unique λ s.t. EPλ [X] = ψX′ (λ) =
EQ [X]. Then D(Pλ kP) = ψX∗ (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (15.28)-(15.29)
obtaining D(QkP) = D(QkPλ ) + D(Pλ kP).
For any Q the previous result allows us to find a tilted measure Pλ that has the same mean as
Q yet smaller (or equal) divergence distance to P. We will see that the same can be done under
multiple linear constraints (Section 15.6*).
i i
i i
i i
306
Q 6≪ P
One Parameter Family
γ=A
P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (γ)
b
λ>0 Q
b
γ=B
Q∗
=Pλ
Q 6≪ P
Space of distributions on R
The key observation here is that the curve of this one-parameter family {Pλ : λ ∈ R} intersects
each γ -slice E = {Q : EQ [X] = γ} “orthogonally” at the minimizing Q∗ ∈ E , and the distance
from P to Q∗ is given by ψ ∗ (λ). To see this, note that applying Theorem 15.10 to the convex set
E gives us D(QkP) ≥ D(QkQ∗ ) + D(Q∗ kP). Now thanks to Corollary 15.12, we in fact have an
equality D(QkP) = D(QkQ∗ ) + D(Q∗ kP) and Q∗ = Pλ for some tilted measure.
Let us give an intuitive (non-rigorous) justification for calling the curve {Pμ , μ ∈ [0, λ]}
geodesic connecting P = P0 to Pλ . Suppose there existed another curve {Qμ } connecting P to
Pλ and minimizing KL distance. Then the expectation EQμ [X] should continously change from
EP [X] to EPλ [X]. Now take any intermediate value γ ′ of the expectation EQμ [X]. We know that on
the slice {Q : EQ [X] = γ ′ } the closest to P element is Pμ′ for some μ′ ∈ [0, λ]. Thus, we could
shorten the distance by connecting P to Pμ′ instead of Qμ .
i i
i i
i i
Our treatment above is specific to distributions on R. How do we find a geodesic between two
arbitrary distributions P̃ and Q̃ on an abstract measurable space X ? To find the answer we notice
that the “intrinsic” definition of the geodesic between P and Pλ above can be given as follows:
μ
dPμ 1 dPλ λ
= ,
dP Z( μ) dP
where Z( μ) = exp{ψ( μ)} is a normalization constant. Correspondingly, we define the geodesic
between P̃ and Q̃ as a parametric family {P̃μ , μ ∈ [0, 1]} given by
μ
dP̃μ 1 dQ̃
≜ , (15.30)
dP̃ Z̃( μ) dP̃
where the normalizing constant Z̃( μ) = exp{( μ − 1)Dμ (Q̃kP̃)} is given in terms of Rényi
divergence. See also Exercise III.25.
Formal justification of (15.30) as a true geodesic in the sense of differential geometry was
given by Cencov in [85, Theorem 12.3] for the case of finite underlying space X . His argument
was the following. To enable discussion of geodesics one needs to equip the space P([k]) with a
connection (or parallel transport map). It is natural to require the connection to be equivariant (or
commute) with respect to some maps P([k]) → P([k′ ]). Cencov lists (a) permutations of elements
(k = k′ ); (b) embedding of a distribution P ∈ P([k]) into a larger space by splitting atoms of [k]
(with specified ratio) into multiple atoms of [k′ ], so that k < k′ ; and (c) conditioning on an event
(k > k′ ). It turns out there is one-parameter family of connections satisfying (a)-(b), including
the Riemannian (Levi-Civitta) connection corresponding to a Fisher-Rao metric (2.35). However,
there is only a unique connection satisfying all (a)-(c). It is different from the Fisher-Rao and its
geodesics are exactly given by (15.30). Geodesically complete submanifolds in this metric are
simply the exponential families (Section 15.6*). For more on this exciting area, see Cencov [85]
and Amari [17].
Examples of regularity conditions in the above theorem include: (a) X is finite, P is fully sup-
ported on X , and E is closed with non-empty interior: see Exercise III.23 for a full proof in this
i i
i i
i i
308
case; (b) X is a Polish space and infQ∈int(E) D(QkP) = infQ∈cl(E) D(QkP): this is the content
of [120, Theorem 6.2.10]. The reference [120] contains full details about various other versions
and extensions of Sanov’s theorem to infinite-dimensional settings.
This problem arises in statistical physics, Gibbs variational principle, exponential family, and
many other fields. Note that taking P uniform correspond to the max-entropy problems.
In the case of d = 1 we have seen that whenever the value of minimization is finite solution
Q∗ can be sought inside a single-parameter family of tilted measures P, cf. (15.27). For this more
general case of d > 1 we define tilted measures as
In order to discuss the solution of (15.31) we first make a simple observation analogous to
Corollary 15.12:
Proposition 15.14 If there exists λ such that ψ(λ) < ∞ and EX∼Pλ [ϕ(X)] = γ , then the
unique minimizer of (15.31) is Pλ and for any Q with EQ [ϕ(X)] = γ we have
⊤ ⊤
Proof. Since log dPdP = λ ϕ(x) − ψ(λ) is finite everywhere we have that D(Pλ kP) = λ γ −
λ
ψ(λ) < ∞ and hence the solution of (15.31) is finite. The fact that Pλ is the unique minimizer
follows from the identity (15.33) that we are to prove next. Take Q as in the statement and suppose
that either D(QkP) or D(QkPλ ) finite (otherwise there is nothing to prove). Since Pλ P this
implies that Q P and so let us denote by fQ = dQ dP . From (2.11) we see that
exp{λ⊤ ϕ(X) − ψ(λ)}
D(QkP) − D(QkPλ ) = EQ Log = EQ log exp{λ⊤ ϕ(X) − ψ(λ)}
1
= λ⊤ γ − ψ(λ) = D(Pλ kP) ,
i i
i i
i i
Unfortunately, Proposition 15.14 is far from being able to completely resolve the prob-
lem (15.31) since it does not explain for what values γ ∈ Rd of the constraints it is possible
to find a required tilting Pλ . For d = 1 we had a very simple characterization of the set of values
that the means of Pλ ’s can achieve. Specifically, Theorem 15.8 showed (under Assumption 15.1)
where A, B are the boundaries of the support of ϕ. To obtain a similar characterization for the
case of d > 1, we let P̃ be the probability distribution on Rd of ϕ(X) when X ∼ P, i.e. P̃ is the
push-forward of P along ϕ. The analog of (A, B) is then played by the following concept:
It is clear that csupp is itself a closed convex set. Furthermore, it can be obtained by taking the
convex hull of supp(P̃) followed by the topological closure cl(·), i.e.
(Indeed, csupp(P̃) ⊂ cl(co(suppP̃)) since the set on the right is convex and closed. On the other
hand, for any closed half-space H ⊃ csupp(P̃) of full measure, i.e. P̃[H] = 1 we must have
supp(P̃) ⊂ H. Taking convex hull and then closure of both sides yields cl(co(supp(P̃))) ⊂ H.
Taking the intersection over all such H shows that cl(co(suppP̃)) ⊂ csupp(P̃) as well.)
We are now ready to state the characterization of when I-projection is solved by a tilted measure.
Theorem 15.16 (I-projection on hyperplane) Suppose P and ϕ satisfy the following two
assumptions: (a) The d + 1 functions (1, ϕ1 , . . . , ϕd ) are linearly independent P-a.s. and (b) the
log MGF ψ(λ) is finite for all λ ∈ Rd . Then
1 If there exist any Q such that EQ [ϕ] = γ and D(QkP) < ∞, we must have γ ∈ csupp(P̃).
2 There is a solution λ to EPλ [ϕ] = γ if and only if γ ∈ int(csupp(P̃)).
Remark 15.6 Assumption (b) of Theorem 15.16 can be relaxed to requiring only the domain
of the log MGF to be an open set (see [85, Theorem 23.1] or [77, Theorem 3.6].) Applying Theo-
rem 15.16, whenever γ ∈ int csupp(P̃) the I-projection can be sought in the tilted family Pλ and
only in such case. If γ ∈/ csupp(P̃) then the I-projection is trivially impossible and every Q with
the given expectation yields D(QkP) = ∞. When γ ∈ ∂ csupp(P̃) it could be that I-projection
(i.e. the minimizer of (15.31)) exist, unique and yields a finite divergence, but the minimizer is
not given by the λ-tilting of P. It could also be that every feasible Q yields D(QkP) = ∞.
i i
i i
i i
310
• γ2 > γ12 : the optimal Q is N (γ1 , γ2 − γ12 ), which is a tilted version of P along ϕ.
• γ2 = γ12 : the only feasible Q is δγ1 , which results in D(QkP) = ∞.
• γ2 < γ12 : there is no feasible Q.
Before giving the proof of the theorem we remind some of the standard and easy facts about
exponential families of which Pλ is an example. In this context ϕ is called a vector of statistics
and λ is the natural parameter. Note that all Pλ ∼ P are mutually absolutely continuous and hence
we have from the linear independence assumption:
CovX∼Pλ [ϕ(X)] 0 (15.35)
i.e. the covariance matrix is (strictly) positive definite. Similar to Theorem 15.3 we can show that
λ 7→ ψ(λ) is a convex, infinitely differentiable function [77]. We want to study the map from
natural parameter λ to the mean parameter μ:
λ 7→ μ(λ) ≜ EX∼Pλ [ϕ(X)] ,
Specifically, we will show that the image μ(Rd ) = int csupp(P̃). To that end note that, similar to
Theorem 15.8(b) and (c), the first two derivatives give moments of ϕ as follows:
EX∼Pλ [ϕ(X)] = ∇ψ(λ) , CovX∼Pλ [ϕ(X)] = Hess ψ(λ) log e . (15.36)
Together with (15.35) we see that then ψ is strictly convex and hence for any λ1 , λ2 we have the
strict monotonicity of ∇ψ , i.e.
(λ1 − λ2 )T (∇ψ(λ1 ) − ∇ψ(λ2 )) > 0 . (15.37)
Additionally, from (15.35) we obtain that Jacobian of the map λ 7→ μ(λ) equals det Hess ψ >
0. Thus by the inverse function theorem the image μ(Rd ) is an open set in Rd and there is an
infinitely differentiable inverse μ 7→ λ = μ−1 ( μ) defined on this set. Hence, the family Pλ can be
equivalently reparameterized by μ’s. What is non-trivial is that the image μ(Rd ) is convex and in
fact coincides with int csupp(P̃).
Proof of Theorem 15.16. Throughout the proof we denote C = csupp(P̃), Co = int(csupp(P̃)).
Suppose there is a Q P with t = EQ [ϕ(X)] 6∈ C. Then there is a (separating hyperplane)
b ∈ Rd and c ∈ R such that b⊤ t < c ≤ b⊤ p for any p ∈ C. Since P[ϕ(X) ∈ C] = 1 we conclude
that Q[b⊤ ϕ(X) ≥ c] = 1. But then this contradicts the fact that EQ [b⊤ ϕ(X)] < c. This shows the
first claim.
Next, we show that for any λ we have μ(λ) = EPλ [ϕ] ∈ Co . Indeed, by the previous paragraph
we know μ(λ) ∈ C. On the other hand, as we discussed the map λ → μ(λ) is smooth, one-to-one,
with smooth inverse. Hence the image of a small ball around λ is open and hence μ(λ) ∈ int(C) =
Co .
i i
i i
i i
Finally, we prove the main implication that for any γ ∈ Co there must exist a λ such that
μ(λ) = γ . To that end, consider the unconstrained minimization problem
If we can show that the minimum is achieved at some λ∗ , then from the first-order optimality
conditions we conclude the desired ∇ψ(λ∗ ) = γ . Since the objective function is continuous, it is
sufficient to show that the minimization without loss of generality can be restricted to a compact
ball {kλk ≤ R} for some large R.
To that end, we first notice that if γ ∈ Co then for some ϵ > 0 we must have
Indeed, suppose this is not the case. Then for any ϵ > 0 there is a sequence vk s.t.
P[v⊤
k (ϕ(X) − γ) > ϵ] → 0 .
Now by compactness of the sphere, vk → ṽϵ without loss of generality and thus we have for every
ϵ some ṽϵ such that
P[ṽ⊤
ϵ (ϕ(X) − γ) > ϵ] = 0 .
Again, by compactness there must exist convergent subsequence ṽϵ → v∗ and ϵ → 0 such that
P[v⊤
∗ (ϕ(X) − γ) > 0] = 0 .
Thus, supp(P̃) ⊂ {x : v⊤ ⊤
∗ ϕ(x) ≤ v∗ γ} and hence γ cannot be an interior point of C = csupp(P̃).
λ
Given (15.39) we obtain the following estimate, where we denote v = ∥λ∥ :
Thus, returning to the minimization problem (15.38) we see that the objective function satisfies
a lower bound
Then it is clear that restricting the minimization to a sufficiently large ball {kλk ≤ R} is without
loss of generality. As we explained this completes the proof.
min{D(QX,Y kPX,Y ) : QX = VX , QY = VY } ,
i i
i i
i i
312
where the marginals VX and VY are given. As we discussed in Section 5.6, Sinkhorn identified
an elegant iterative algorithm that converges to the minimizer. Here, we can apply our general
I-projection theory to show that minimizer has the form
Q∗X,Y (x, y) = A(x)PX,Y (x, y)B(y) . (15.40)
Specifically, let us assume PX,Y (x, y) > 0 and consider |X | + |Y| functions ϕa (x, y) = 1{x = a}
and ϕb (x, y) = 1{y = b}, a ∈ X , b ∈ Y . They are linearly independent. The set csupp(P̃) =
P(X ) × P(Y) in this case corresponds to all marginal distributions. Thus, whenever VX , VY have
no zeros they belong to int(csupp(P̃)) and the solution to the I-projection problem is a tilted version
of PX,Y which is precisely of the form (15.40). In this case, it turns out that I-projection exists also
on the boundary and even when PX,Y is allowed to have zeros but these cases are outside the scope
of Theorem 15.16 and need to be treated differently, see [114].
i i
i i
i i
In this chapter our goal is to determine the achievable region of the exponent pairs (E0 , E1 ) for
the Type-I and Type-II error probabilities in Chernoff’s regime when both exponents are strictly
positive. Our strategy is to apply the achievability and (strong) converse bounds from Chapter 14
in conjunction with the large deviations theory developed in Chapter 15. After characterizing the
full tradeoff we will discuss an adaptive setting of hypothesis testing where instead of committing
ahead of time to testing on the basis of n samples, one can decide adaptively whether to request
more samples or stop. We will find out that adaptivity greatly increases the region of achievable
error-exponents and will learn about the sequential probability ratio test (SPRT) of Wald. In the
closing sections we will discuss relation to more complicated settings in hypothesis testing: one
with composite hypotheses and one with communication constraints.
313
i i
i i
i i
314
ψP (λ)
0 1
λ
E0 = ψP∗ (θ)
E1 = ψP∗ (θ) − θ
slope θ
Figure 16.1 Geometric interpretation of Theorem 16.1 relies on the properties of ψP (λ) and ψP∗ (θ). Note that
ψP (0) = ψP (1) = 0. Moreover, by Theorem 15.6, θ 7→ E0 (θ) is increasing, θ 7→ E1 (θ) is decreasing.
P
For discrete distributions, we have ψP (λ) = log x P(x)1−λ Q(x)λ ; in general, ψP (λ) =
R 1−λ dQ λ
log dμ( dPdμ ) ( dμ ) for some dominating measure μ.
Note that since ψP (0) = ψP (1) = 0, from the convexity of ψP (Theorem 15.3) we conclude
that ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore, assuming P Q and Q P we also have that
λ 7→ ψP (λ) continuous everywhere on [0, 1]. (The continuity on (0, 1) follows from convexity,
but for the boundary points we need more detailed arguments.) Although all results in this section
apply under the (milder) conditions of P Q and Q P, we will only present proofs under
the (stronger) condition that log MGF exists for all λ, following the convention of the previous
chapter. The following result determines the optimal (E0 , E1 )-tradeoff in a parametric form. For a
concrete example, see Exercise III.19 for testing two Gaussians.
parametrized by −D(PkQ) ≤ θ ≤ D(QkP), characterizes the upper boundary of the region of all
achievable (E0 , E1 )-pairs. (See Figure 16.1 for an illustration.)
i i
i i
i i
Corollary 16.2 (Bayesian criterion) Fix a prior (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 <
π 0 < 1. Denote the optimal Bayesian (average) error probability by
P∗e (n) ≜ inf π 0 π 1|0 + π 1 π 0|1
P Z| X n
with exponent
1 1
E ≜ lim log ∗ .
n→∞ n P e ( n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0)
θ
If |X | = 2 and if the compositions (types) of xn and x̃n are equal (!), the expression is invariant
under λ ↔ 1 − λ and thus from the convexity in λ we conclude that λ = 12 is optimal,2 yielding
E = 1n dB (xn , x̃n ), where
X
n Xq
dB (xn , x̃n ) = − log PY|X (y|xt )PY|X (y|x̃t ) (16.3)
t=1 y∈Y
1
In short, this is because the optimal tilting parameter λ does not need to be chosen differently for different values of
(xt , x̃t ).
2 1
For another example where λ = 2
achieves the optimal in the Chernoff information, see Exercise III.30.
i i
i i
i i
316
is known as the Bhattacharyya distance between codewords xn and x̃n . (Compare with the Bhat-
tacharyya coefficient defined after (7.5).) Without the two assumptions stated, dB (·, ·) does not
necessarily give the optimal error exponent. We do, however, always have the bounds, see (14.19):
1
exp (−2dB (xn , x̃n )) ≤ P∗e (xn , x̃n ) ≤ exp (−dB (xn , x̃n )) ,
4
where the upper bound becomes tighter when the joint composition of (xn , x̃n ) and that of (x̃n , xn )
are closer.
Pn
Proof of Theorem 16.1. The idea is to apply the large deviations theory to the iid sum k=1 Tk .
Specifically, let’s rewrite the achievability and converse bounds from Chapter 14 in terms of T:
• Achievability (Neyman-Pearson): Applying Theorem 14.11 with τ = −nθ, the LRT achieves
the following
" n # " n #
X X
π 1|0 = P T k ≥ nθ π 0|1 = Q T k < nθ (16.4)
k=1 k=1
• Converse (strong): Applying Theorem 14.10 with γ = exp (−nθ), any achievable π 1|0 and π 0|1
satisfy
" n #
X
π 1|0 + exp (−nθ) π 0|1 ≥ P T k ≥ nθ . (16.5)
k=1
For achievability, applying the nonasymptotic large deviations upper bound in Theorem 15.9
(and Theorem 15.11) to (16.4), we obtain that for any n,
" n #
X
π 1| 0 = P Tk ≥ nθ ≤ exp (−nψP∗ (θ)) , for θ ≥ EP T = −D(PkQ)
k=1
" #
Xn
π 0|1 = Q Tk < nθ ≤ exp −nψQ∗ (θ) , for θ ≤ EQ T = D(QkP)
k=1
exp (−nE0 ) + exp (−nθ) exp (−nE1 ) ≥ exp (−nψP∗ (θ) + o(n))
⇒ min(E0 , E1 + θ) ≤ ψP∗ (θ)
i i
i i
i i
Theorem 16.3 (a) The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as
E0 = D(Pλ kP), E1 = D(Pλ kQ) (16.6)
where the distribution Pλ 3 is tilting of P along T given in (15.27), which moves from P0 = P
to P1 = Q as λ ranges from 0 to 1:
dPλ = (dP)1−λ (dQ)λ exp{−ψP (λ)}.
(b) Yet another characterization of the boundary is
E∗1 (E0 ) = min D(Q′ kQ) , 0 ≤ E0 ≤ D(QkP) (16.7)
Q′ :D(Q′ ∥P)≤E0
Remark 16.3 The interesting consequence of this point of view is that it also suggests how
typical error event looks like. Namely, consider an optimal hypothesis test achieving the pair of
exponents (E0 , E1 ). Then conditioned on the error event (under either P or Q) we have that the
empirical distribution of the sample will be close to Pλ . For example, if P = Bin(m, p) and Q =
Bin(m, q), then the typical error event will correspond to a sample whose empirical distribution
P̂n is approximately Bin(m, r) for some r = r(p, q, λ) ∈ (p, q), and not any other distribution on
{0, . . . , m}.
Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) ≜ EPλ [T], then
from (15.8) we have
D(Pλ kP) = ψP∗ (θ) ,
whereas
dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP) − EPλ [T] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (15.7) we know that as λ ranges in [0, 1] the mean θ = EPλ [T] ranges from −D(PkQ)
to D(QkP).
To prove the second claim (16.7), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions
dPλ = exp{λT − ψP (λ)} · dP
3
This is called a geometric mixture of P and Q.
i i
i i
i i
318
Therefore,
dQ∗ dQ
E [T] = E
Q∗ Q∗ log = D(Q∗ kP) − D(Q∗ kQ) ∈ [−D(PkQ), D(QkP)] . (16.8)
dP dQ∗
Next, we have from Corollary 15.12 that there exists a unique Pλ with the following three
properties:4
Remark 16.4 A geometric interpretation of (16.7) is given in Figure 16.2: As λ increases from
0 to 1, or equivalently, θ increases from −D(PkQ) to D(QkP), the optimal distribution Pλ traverses
down the dotted path from P and Q. Note that there are many ways to interpolate between P and
Q, e.g., by taking their (arithmetic) mixture (1 − λ)P + λQ. In contrast, Pλ is a geometric mixture
of P and Q, and this special path is in essence a geodesic connecting P to Q and the exponents
E0 and E1 measures its respective distances to P and Q. Unlike Riemannian geometry, though,
here the sum of distances to the two endpoints from an intermediate Pλ actually varies along the
geodesic.
4
A subtlety: In Corollary 15.12 we ask EQ∗ [T] ∈ (A, B). But A, B – the essential range of T – depend on the distribution
under which the essential range is computed, cf. (15.23). Fortunately, we have Q P and P Q, so the essential range
is the same under both P and Q. And furthermore (16.8) implies that EQ∗ [T] ∈ (A, B).
i i
i i
i i
E1
P
D(PkQ) Pλ
space of distributions
D(Pλ kQ)
E0
0 D(Pλ kP) D(QkP)
Figure 16.2 Geometric interpretation of (16.7). Here the shaded circle represents {Q′ : D(Q′ kP) ≤ E0 }, the
KL divergence “ball” of radius E0 centered at P. The optimal E∗1 (E0 ) in (16.7) is given by the divergence from
Q to the closest element of this ball, attained by some tilted distribution Pλ . The tilted family Pλ is the
geodesic traversing from P to Q as λ increases from 0 to 1.
i i
i i
i i
320
So far we have always been working with a fixed number of observations n. However, different
realizations of Xn are informative to different levels, i.e. under some realizations we are very certain
about declaring the true hypothesis, whereas some other realizations leave us more doubtful. In
the fixed n setting, the tester is forced to take a guess in the latter case. In the sequential setting,
pioneered by Wald [448], the tester is allowed to request more observations. We show in this
section that the optimal test in this setting is something known as sequential probability ratio test
(SPRT) [450]. It will also be shown that the resulting tradeoff between the exponents E0 and E1 is
much improved in the sequential setting.
We start with the concept of a sequential test. Informally, at each time t, upon receiving the
observation Xt , a sequential test either declares H0 , declares H1 , or requests one more observation.
The rigorous definition is as follows: a sequential hypothesis test consists of (a) a stopping time
τ with respect to the filtration {Fk , k ∈ Z+ }, where Fk ≜ σ{X1 , . . . , Xn } is generated by the first
n observations; and (b) a random variable (decision) Z ∈ {0, 1} measurable with respect to Fτ .
Each sequential test is associated with the following performance metrics:
α = P[Z = 0], β = Q [ Z = 0] (16.9)
l0 = EP [τ ], l1 = EQ [τ ] (16.10)
The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 21 δ−1 . Since P 6⊥ Q, we also
have Pn 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both hypothe-
ses. However, an obvious sequential test (waiting for the first appearance of ±1) achieves zero error
probability with finite number of (two) observations in expectation under both hypotheses. This
advantage is also clear in terms of the achievable error exponents shown in Figure 16.3.
The following result is essentially due to [450], though there it was shown only for the special
case of E0 = D(QkP) and E1 = D(PkQ). The version below is from [339].
5
This assumption is satisfied for example for a pair of fully supported discrete distributions on finite alphabets.
i i
i i
i i
E1
Sequential test
D(PkQ)
E0
0 D(QkP)
Figure 16.3 Tradeoff between Type-I and Type-II error exponents. The bottom curve corresponds to optimal
tests with fixed sample size (Theorem 16.1) and the upper curve to optimal sequential tests (Theorem 16.4).
0, if Sτ ≥ B
Z=
1, if Sτ < −A
where
X
n
P(Xk )
Sn = log
Q( X k )
k=1
Remark 16.5 (Interpretation of SPRT) Under the usual setup of hypothesis testing, we
collect a sample of n iid observations, evaluate the LLR Sn , and compare it to the threshold to give
the optimal test. Under the sequential setup, {Sn : n ≥ 1} is a random walk, which has positive
(resp. negative) drift D(PkQ) (resp. −D(QkP)) under the null (resp. alternative)! SPRT simply
declares P if the random walk crosses the upper boundary B, or Q if the random walk crosses the
upper boundary −A. See Figure 16.4 for an illustration.
i i
i i
i i
322
Sn
0 n
τ
−A
Figure 16.4 Illustration of the SPRT(A, B) test. Here, at the stopping time τ , the LLR process Sn reaches B
before reaching −A and the decision is Z = 1.
EQ [Sτ ] = − EQ [τ ]D(QkP) .
Mn = Sn − nD(PkQ)
M̃n ≜ Mmin(τ,n)
E[M̃n ] = E[M̃0 ] = 0 ,
or, equivalently,
This holds for every n ≥ 0. From the boundedness assumption we have |Sn | ≤ nc0 and thus
|Smin(n,τ ) | ≤ τ c0 , implying that collection {Smin(n,τ ) : n ≥ 0} is uniformly integrable. Thus, we
can take n → ∞ in (16.12) and interchange expectation and limit safely to conclude (16.11).
i i
i i
i i
By monotone convergence theorem applied to the both sides of (16.13) it is then sufficient to
verify that for every n
Next, we denote τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas the expectation of τ0 can
be bounded using (16.11) as:
E P [ Sτ 0 ] B + c0
EP [τ ] ≤ EP [τ0 ] = ≤ ,
D(PkQ) D(PkQ)
where in the last step we used the boundedness assumption to infer Sτ0 ≤ B + c0 . Overall,
B + c0
l0 = EP [τ ] ≤ EP [τ0 ] ≤ .
D(PkQ)
i i
i i
i i
324
Notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) = l1 E1 (1 + o(1)), therefore
l1 E1 ≤ (1 + o(1))l0 D(PkQ). Similarly we can show that l0 E0 ≤ (1 + o(1))l1 D(QkP). Thus taking
ℓ0 , ℓ1 → ∞ we conclude
E0 E1 ≤ D(PkQ)D(QkP) .
where P and Q are two families of distributions. In this case for a given test Z = Z(X1 , . . . , Xn ) ∈
{0, 1} we define the two types of error as before, but taking worst-case choices over the
distribution:
Unlike testing simple hypotheses for which Neyman-Pearson’s test is optimal (Theorem 14.11), in
general there is no explicit description for the optimal test of composite hypotheses (cf. (32.28)).
The popular choice is a generalized likelihood-ratio test (GLRT) that proposes to threshold the
GLR
supP∈P P⊗n (Xn )
T(Xn ) = .
supQ∈Q Q⊗n (Xn )
For examples and counterexamples of the optimality of GLRT in terms of error exponents, see,
e.g. [469].
Sometimes the families P and Q are small balls (in some metric) surrounding the center dis-
tributions P and Q, respectively. In this case, testing P against Q is known as robust hypothesis
testing (since the test is robust to small deviations of the data distribution). There is a notable finite-
sample optimality result in this case due to Huber [221] – see Exercise III.31. Asymptotically, it
turns out that if P and Q are separated in the Hellinger distance, then the probability of error can
be made exponentially small: see Theorem 32.8.
Sometimes in the setting of composite testing the distance between P and Q is zero. This is
the case, for example, for the most famous setting of a Student t-test: P = {N (0, σ 2 ) : σ 2 > 0},
Q = {N ( μ, σ 2 ) : μ 6= 0, σ 2 > 0}. It is clear that in this case there is no way to construct a test with
α + β < 1, since the data distribution under H1 can be arbitrarily close to P0 . Here, thus, instead
of minimizing worst-case β , one tries to find a test statistic T(X1 , . . . , Xn ) which is a) pivotal in the
sense that its distribution under the H0 is (asymptotically) independent of the choice P0 ∈ P ; and
i i
i i
i i
b) consistent, in the sense that T → ∞ as n → ∞ under any fixed Q ∈ Q. Optimality questions are
studied by minimizing β as a function of Q ∈ Q (known as the power curve). The uniform most
powerful tests are the gold standard in this area [277, Chapter 3], although besides a few classical
settings (such as the one above) their existence is unknown.
In other settings, known as the goodness-of-fit testing [277, Chapter 14], instead of relatively
low-complexity parametric families P and Q one is interested in a giant set Q of alternatives. For
i.i.d. i.i.d.
example, the simplest setting is to distinguish H0 : Xi ∼ P0 vs H1 : Xi ∼ Q, TV(P0 , Q) > δ . If
δ = 0, then in this case again the worst case α + β = 1 for any test and one may only ask for a
statistic T(Xn ) with a known distribution under H0 and T → ∞ for any Q in the alternative. For
δ > 0 the problem is known as nonparametric detection [225, 226] and related to that of property
testing [192].
Definition 16.5 (FI -curve6 ) Given pair of random variables (X, Y) we define their FI curve
as
FI (t; PX,Y ) ≜ sup{I(U; Y) : I(U; X) ≤ t, U → X → Y} ,
6
This concept was introduced in [453], see also [136] and [345, Section 2.2] for the “PX -independent” version.
i i
i i
i i
326
Xn W ∈ {0, 1}nR
Compressor f
Z ∈ {0, 1}
Tester
Yn
Figure 16.5 Illustration to the problem of hypothesis testing with communication constraints.
I(Y; U)
ηKL
I(X; Y)
I(X; U)
0 H(X)
Figure 16.6 A typical FI -curve whose slope at zero is the SDPI constant ηKL .
Example 16.1 A typical FI -curve is shown in Figure 16.6. In general, computing FI -curves is
hard. An exception is the case of X ∼ Ber(1/2) and Y = BSCδ (X). In this case, applying MGL in
Exercise I.64 we get, in the notation of that exercise, that
achieved by taking U ∼ Ber(1/2) and X = BSCp (U) with p chosen such that h(p) = log 2 − t.
From the DPI (3.12) we know that FI (t) ≤ t and the FI -curve strengthens the DPI to I(U; X) ≤
FI (I(U; Y)) whenever U → X → Y. (Note that the roles of X and Y are not symmetric.) In general,
it is not clear how to compute this function; nevertheless, in Exercise III.32 we show that if X
takes values over a finite alphabet then it is sufficient to consider |U| ≤ |X | + 1, and hence FI is
a value of a finite-dimensional convex program. Other properties of the FI -curve and applications
are found in Exercise III.32 and III.33.
The main result of this section is the following.
i i
i i
i i
Theorem 16.6 (Ahslwede-Csiszár [8]) Suppose X, Y are ranging over the finite alphabets
and QX,Y = PX PY (independence testing problem). Then Vϵ (R) = FI (R) for all ϵ ∈ (0, 1).
The setting describes the problem of detecting correlation between two sequences. When R = 0
the testing problem is impossible since the marginal distribution of Y is the same under both
hypotheses. If only a very small communication rate is available then the sample size required
will be very large (Stein exponent small).
Proof. Let us start with an upper bound. Fix a compressor W and notice that for any ϵ-achievable
exponent E by Theorem 14.8 we have
But under conditions of the theorem we have QW,Yn = PW PYn and thus we obtain as in (14.18):
Now, from looking at Figure 16.5 we see that W → Xn → Yn and the from Exercise III.32
(tensorization) we know that
1
I(W; Yn ) ≤ nFI ( I(W; Xn )) ≤ nFI (R) .
n
Thus, we have shown that for all sufficiently large n
FI (R) log 2
E≤ + .
1−ϵ n
This demonstrates that lim supϵ→0 Vϵ ≤ FI (R). For a stronger claim of Vϵ ≤ FI (R), i.e. the strong
converse, see [8].
Now, for the constructive part, consider any n1 and any compressed random variable W1 =
f1 (Xn1 ) with W1 ∈ {0, 1}n1 R . Given blocklength n we can repeatedly send W1 by compress-
ing each n1 chunk independently (for a total of n/n1 “frames”). Then the decompressor will
observe n/n1 iid copies of W1 and also of Yn1 vector-observations. Note that D(PW1 ,Yn1 kQW1 ,Yn1 ) =
I(W1 ; Yn1 ) as above. Thus, by Theorem 14.14 we should be able to achieve α ≥ 1 − ϵ and
β ≤ exp{−n/n1 I(W1 ; Yn1 )}.
Therefore, we obtained the following lower bound (after optimizing over the choice of W1 and
blocklength n1 that we replace by more convenient n again):
1
Vϵ ≥ F̃I (R) ≜ sup { I(W1 ; Yn ) : W1 → Xn → Yn , W1 ∈ {0, 1}nR } . (16.16)
n,W1 n
This looks very similar to the definition of (tensorized) FI -curve except that the constraint is on
the cardinality of W1 instead of the I(W1 ; Xn ). It turns out that the two quantities coincide, i.e.
F̃I (R) = FI (R). We only need to show F̃I (R) ≥ FI (R) for that.
To that end, consider any U → X → Y and R > I(U; X). We apply covering lemma (Corol-
lary 25.6) and Markov lemma (Proposition 25.7), where we set An = Xn , Bn = Un , Xn = Yn .
i i
i i
i i
328
Overall, we get that as n → ∞ there exist encoder W1 = f1 (Xn ), W1 ∈ {0, 1}nR such that
W1 → Xn → Yn and
I(W1 ; Yn ) ≥ nI(U; Y) + o(n) .
By optimizing the choice of U this proves F̃I (R+) ≥ FI (R). Since (as we shown above) F̃I (R+) ≤
FI (R+) and FI (R+) = FI (R) (Exercise III.32), we conclude that F̃I (R) = FI (R).
Theorem shown above has interesting implications for certain task in modern machine learn-
ing. Consider the situation where the sample size n is gigantic but the communication budget (or
memory bandwidth) is constrained so that we can at most deliver k bits from X terminal to the
tester. Then the rate R = nk 1 and the error probability β of an optimal test is roughly given as
′
β ≈ 2−nFI (k/n) ≈ 2−kFI (0) ,
where we used the fact that FI (k/n) ≈ nk F′I (0). We see that the error probability is decaying expo-
nentially with the number of communicated bits not the sample size. In many ways, Theorem 16.6
foreshadowed various results in the last decade on distributed inference. We will get back to this
topic in Chapter 33 dedicated to strong data processing inequalities (SDPI). There is a simple
relation that connects the classical results (this Section) with the modern approach via SDPIs (in
Chapter 33): the slope F′I (0) is precisely the SDPI constant:
F′I (0) = ηKL (PX , PY|X ) ,
see (33.14) for the definition of ηKL . In essence, SDPIs are just linearized versions of FI -curves as
illustrated in Figure 16.6.
i i
i i
i i
III.1 Let P0 and P1 be distributions on X . Recall that the region of achievable pairs (P0 [Z =
0], P1 [Z = 0]) via randomized tests PZ|X : X → {0, 1} is denoted
[
R(P0 , P1 ) ≜ (P0 [Z = 0], P1 [Z = 0]) ⊆ [0, 1]2 .
PZ|X
PY|X
(a) Let PY|X : X → Y be a Markov kernel, which maps Pj to Qj according to Pj −−→ Qj , j =
0, 1. Compare the regions R(P0 , P1 ) and R(Q0 , Q1 ). What does this say about βα (P0 , P1 )
vs βα (Q0 , Q1 )?
(b*) Prove that R(P0 , P1 ) ⊃ R(Q0 , Q1 ) implies existence of some PY|X mapping P0 to Q1 and
P1 to Q1 . In other words, inclusion of R is equivalent to degradation or Blackwell order
(see Definition 33.15).
Comment: This is the most general form of data processing inequality, of which all the other
ones (divergence, mutual information, f-divergence, total-variation, Rényi-divergence, etc) are
corollaries.
III.2 Consider the following binary hypothesis testing (BHT) problem. Under both hypotheses X and
Y are uniform on {0, 1}. However, under H0 , X and Y are independent, while under H1 :
P1 [X 6= Y] = δ < 1/2 .
P [ H1 ] = 1 − P [ H0 ] = π 1 .
where the min is over the tests and the max is between the two numbers in the braces.
Identify the corresponding point on R(P0 , P1 ).
III.3 Consider distributions P and Q on [0, 3] with densities given in Figure 16.7.
i i
i i
i i
p q
1 1
3 3
3 3
i i
i i
i i
(c) Find the optimal test for general prior π (not necessarily equiprobable).
(d) Show that it is sufficient to focus on deteministic test in order to minimize the Bayesian
error probability.
III.8 The function α 7→ βα (P, Q) is monotone and thus by Lebesgue’s theorem possesses a derivative
d
βα′ ≜ βα (P, Q) .
dα
almost everywhere on [0, 1]. Prove
Z 1
D(PkQ) = − log βα′ dα . (III.1)
0
βα (P, Q) ≜ min Q[ Z = 0] = α 2 .
PZ|X :P[Z=0]≥α
which is equivalent to Stein’s lemma (Theorem 14.14). Show furthermore that assuming
V(PkQ) < ∞ we have
p √
log β1−ϵ (Pn , Qn ) = −nD(PkQ) + nV(PkQ)Q−1 (ϵ) + o( n) , (III.2)
R ∞
where Q−1 (·) is the functional inverse of Q(x) = x √12π e−t /2 dt and
2
dP
V(PkQ) ≜ VarP log .
dQ
III.11 (Likelihood-ratio trick) Given two distributions P and Q on X let us generate iid samples (Xi , Yi )
as follows: first Yi ∼ Ber(1/2) and then if Yi = 0 we sample Xi ∼ Q and otherwise Xi ∼ P. We
next train a classifier to minimize the cross-entropy loss:
1X
n
∗ 1 1
p̂ = argmin Yi log + (1 − Yi ) log .
p̂:X →[0,1] n p̂ ( Xi ) 1 − p̂(Xi )
i=1
1−p̂∗ (x)
Show that → dQ
p̂∗ (x)
dP
(x) as n → ∞. This trick is used in machine learning to approximate
dP
dQ for complicated high-dimensional distributions.
III.12 Prove
Yn Xn
min D(QYn k PYj ) = min D(QYj kPYj )
QYn ∈F
j=1 j=1
i i
i i
i i
for some F ′ .
Conclude that in the case when PYj = P and
X
n
F = QYn : EQ f(Yj ) ≥ nγ
j=1
i i
i i
i i
√
(a) Using the Chernoff bound, show that for all ϵ > d,
2 d/2
eϵ
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≤
d
(b) Prove the lower bound
d/2
ϵ2
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≥
2πd
(c) Extend the results to Z ∼ N (0, Σ).
See Exercise V.30 for an example in infinite dimensions.
III.17 Consider the hypothesis testing problem:
i.i.d.
H0 : X1 , . . . , Xn ∼ P = Ber(1/3) ,
i.i.d.
H1 : X1 , . . . , Xn ∼ Q = Ber(2/3) .
Questions:
(a) Compute the Stein exponent
(b) For a = 3 draw the tradeoff region E of achievable error-exponent pairs (E0 , E1 ).
(c) Identify the divergence-minimizing geodesic Pλ running from P0 to P1 , λ ∈ [0, 1].
Hint: To simplify calculations try differentiating in u the following identity
Z ∞
xu e−x dx = Γ(u + 1) .
0
i i
i i
i i
converges to zero exponentially fast as n → ∞. If it does then find the exponent. Repeat with
γ = 0.5.
III.21 Let Xj be i.i.d. exponential with unit mean. Since the log MGF ψX (λ) ≜ log E[exp{λX}] does
not exist for λ > 1, the large deviations result in Theorem 15.1
X
n
P[ Xj ≥ nγ] = exp{−nψX∗ (γ) + o(n)} (III.6)
j=1
does not apply. Show (III.6) directly via the following steps:
(a) Apply Chernoff argument directly to prove an upper bound.
(b) Fix an arbitrary c > 0 and prove
X
n X
n
P[ Xj ≥ nγ] ≥ P[ (Xj ∧ c) ≥ nγ] . (III.7)
j=1 j=1
(c) Apply the results shown in Chapter 15 to investigate the asymptotics of the right-hand side
of (III.7).
(d) Conclude the proof of (III.6) by taking c → ∞.
III.22 (Hoeffding’s lemma) In this exercise we prove Hoeffding’s lemma (stated after Definition 4.15)
and derive Hoeffding’s concentration inequality. Let X ∈ [−1, 1] with E[X] = 0.
(a) Show that the log MGF ψX (λ) satisfies ψX (0) = ψX′ (0) = 0 and 0 ≤ ψX′′ (λ) ≤ 1. (Hint:
Apply Theorem 15.8(c) and the fact that the variance of any distribution supported on
[−1, 1] is at most 1.)
(b) By Taylor expansion, show that ψX (λ) ≤ λ2 /2.
(c) Applying Theorem 15.1, prove Hoeffding’s inequality: Let Xi ’s be iid copies of X. For any
Pn
γ > 0, P i=1 Xi ≥ nγ ≤ exp(−nγ /2).
2
III.23 (Sanov’s theorem for discrete X ) Let X be a finite set. Let E be a set of probability distributions
on X with non-empty interior. Let Xn = (X1 , . . . , Xn ) be iid drawn from some distribution P
Pn
fully supported on X and let π n denote the empirical distribution, i.e., π n = 1n i=1 δXi . Our
goal is to show that
1 1
E ≜ lim log = inf D(QkP). (III.8)
n→∞ n P(π n ∈ E) Q∈E
i i
i i
i i
(a) We first assume that E is convex. Define the following set of joint distributions En ≜ {QXn :
QXi ∈ E, i = 1, . . . , n}. Show that
inf D(QXn kPXn ) = n inf D(QkP),
QXn ∈En Q∈E
n
where PXn = P .
(b) Consider the conditional distribution P̃Xn = PXn |π n ∈E . Show that P̃Xn ∈ En .
(c) Prove the following nonasymptotic upper bound: for any convex E ,
P(π n ∈ E) ≤ exp − n inf D(QkP) , ∀n.
Q∈E
(Hint: For each ϵ > 0, cover E by N TV balls of radius ϵ where N = N(ϵ) is finite;
cf. Theorem 27.3. Applying the previous part and the union bound.)
(e) For any Q in the interior of E , show that
P(π n ∈ E) ≥ exp(−nD(QkP) + o(n)), n → ∞.
(Hint: Use data processing as in the proof of the large deviations Theorem 15.9.)
(f) Conclude (III.8) by applying the continuity of divergence on finite alphabet (Proposi-
tion 4.8).
III.24 (Error exponents of data compression) Let Xn be iid according to P on a finite alphabet X .
Let ϵ∗ (Xn , nR) denote the minimal probability of error achieved by fixed-length compressors
and decompressors for Xn of compression rate R (cf. Definition 11.1). We know that if R <
H(P) then ϵ∗ (Xn , nR) → 0. Here we show it converges to zero exponentially fast and find the
expopnent.
(a) For any sequence xn , denote by P̂xn its empirical distribution and by H(P̂xn ) its empirical
entropy, i.e., the entropy of the empirical distribution. For example, for the binary sequence
xn = (010110), the empirical distribution is Ber(1/2) and the empirical entropy is 1 bit.
For each R > 0, define the set T = {xn : H(P̂xn ) <R}. Show that |T| ≤ exp( nR)(n + 1)|X | .
∗
(b) Show that for any R > H(P), ϵ (X , nR) ≤ exp − n infQ:H(Q)>R D(QkP) . Specify the
n
i i
i i
i i
(a) D(Pλ kP) = −ψP (λ) + λψP′ (λ), where ψP (λ) = log EP [exp(λ log dQ dP )].
(b) State the appropriate conditions to conclude D(Pλ kP) = 12 λ2 ψP′′ (0)+ o(λ2 ), where ψP′′ (0) =
dP ], which is clearly different from χ (QkP) = VarP [ dP ].
VarP [log dQ 2 dQ
III.26 Denote by N ( μ, σ ) the one-dimensional Gaussian distribution with mean μ and variance σ 2 .
2
distribution. Express P [X1 + · · · + Xn ≥ na] in terms of the Φ̄ function. Using the fact that
Φ̄(x) = e−x /2+o(x ) as x → ∞ (cf. Exercise V.25), reprove (III.9).
2 2
(d) (reverse I-projection) Let Y be a continuous random variable with zero mean and unit
variance. Show that
min D(PY kN ( μ, σ 2 )) = D(PY kN(0, 1)).
μ,σ
III.27 (Why temperatures equalize?) Let X be finite alphabet and f : X → R an arbitrary function.
Let Emin = min f(x).
(a) Using I-projection show that for any E ≥ Emin the solution of
H∗ (E) = max{H(X) : E[f(X)] ≤ E}
1
is given by a Gibbs distribution (cf. (5.21)) PX (x) = Z(β) e−β f(x) for some β = β(E).
Comment: In statistical physics x is state of the system (e.g. locations and velocities of all
molecules), f(x) is energy of the system in state x, PX is the Gibbs distribution and β = T1 is
the inverse temperature of the system. In thermodynamic equilibrium, PX (x) gives fraction
of time system spends in state x.
∗
(b) Show that dHdE(E) = β(E).
(c) Next consider two functions f0 , f1 (i.e. two types of molecules with different state-energy
relations). Show that for E ≥ minx0 f0 (x0 ) + minx1 f1 (x1 ) we have
max H(X0 , X1 ) = max H∗0 (E0 ) + H∗1 (E1 ) (III.10)
E[f0 (X0 )+f1 (X1 )]≤E E0 +E1 ≤E
i i
i i
i i
Remark: (III.12) also just follows from part (a) by taking f(x0 , x1 ) = f0 (x0 ) + f1 (x1 ). The
point here is relation (III.11): when two thermodynamical systems are brought in contact with
each other, the energy distributes among them in such a way that β parameters (temperatures)
equalize.
III.28 (Importance Sampling [90]) Let μ and ν be two probability measures on set X . Assume that
ν μ. Let L = D(νk μ) and ρ = ddμν be the Radon-Nikodym derivative. Let f : X → R be a
measurable function. We would like to estimate Eν f using data from μ.
i.i.d. P
Let X1 , . . . , Xn ∼ μ and In (f) = n1 1≤i≤n f(Xi )ρ(Xi ). Prove the following.
(a) For n ≥ exp(L + t) with t ≥ 0, we have
q
E |In (f) − Eν f| ≤ kfkL2 (ν) exp(−t/4) + 2 Pμ (log ρ > L + t/2) .
Hint: Let h = f1{ρ ≤ exp(L + t/2)}. Use triangle inequality and bound E |In (h) − Eν h|,
E |In (h) − In (f)|, | Eν f − Eν h| separately.
(b) On the other hand, for n ≤ exp(L − t) with t ≥ 0, we have
Pμ (log ρ ≤ L − t/2)
P(In (1) ≥ 1 − δ)| ≤ exp(−t/2) + ,
1−δ
for all δ ∈ (0, 1), where 1 is the constant-1 function.
Hint: Divide into two cases depending on whether max1≤i≤n ρ(Xi ) ≤ exp(L − t/2).
This shows that a sample of size exp(D(νk μ) + Θ(1)) is both necessary and sufficient for
accurate estimation by importance sampling.
III.29 (M-ary hypothesis testing)7 The following result [274] generalizes Corollary 16.2 on the best
average probability of error for testing two hypotheses to multiple hypotheses.
Fix a collection of distributions {P1 , . . . , PM }. Conditioned on θ, which takes value i with prob-
i.i.d.
ability π i > 0 for i = 1, . . . , M, let X1 , . . . , Xn ∼ Pθ . Denote the optimal average probability of
∗
error by pn = inf P[θ̂ 6= θ], where the infimum is taken over all decision rules θ̂ = θ̂(X1 , . . . , Xn ).
(a) Show that
1 1
lim log ∗ = min C(Pi , Pj ), (III.13)
n→∞ n pn 1≤i<j≤M
7
Not to be confused with multiple testing in the statistics literature, which refers to testing multiple pairs of binary
hypotheses simultaneously.
i i
i i
i i
i i
i i
i i
Part IV
Channel coding
i i
i i
i i
i i
i i
i i
341
In this Part we study a new type of problem. The goal of channel coding is to communicate
digital information across a noisy channel. Historically, this was the first area of information the-
ory that lead to immediately and widely deployed applications. Shannon’s discovery [378] of the
possibility of transmitting information with vanishing error and positive (i.e. bigger than zero) rate
of bits per second was also theoretically quite surprising and unexpected. Our goal in this Part is
to understand these arguments.
To explain the relation of this Part to others, let us revisit what problems we have studied so
far. In Part I we introduced various information measures and studied their properties irrespec-
tive of engineering applications. Then, in Part II our objective was data compression. The main
object there was a single distribution PX and the fundamental limit E[ℓ(f∗ (X))] is the minimal
compression length. The main result (the “coding theorem”) established connection between the
fundamental limit and an information quantity, that we can summarize as
E[ℓ(f∗ (X))] ≈ H(X)
Next, in Part III we studied binary hypothesis testing. There the main object was a pair of distri-
butions (P, Q), the fundamental limit was the Neyman-Pearson curve β1−ϵ (Pn , Qn ) and the main
result
β1−ϵ (Pn , Qn ) ≈ exp{−nD(PkQ)} ,
again connecting an operational quantity to an information measure.
In channel coding – the topic of this Part – the main object is going to be a channel PY|X .
The fundamental limit is M∗ (ϵ), the maximum number of messages that can be transmitted with
probability of error at most ϵ, which we rigorously define in this chapter. Our main result in this
part is to show the celebrated Shannon’s noisy channel coding theorem:
log M∗ (ϵ) ≈ max I(X; Y) .
PX
We will demonstrate the possibility of sending information with high reliability and also will
rigorously derive the asymptotically (and non-asymptotically!) highest achievable rates. However,
we entirely omit a giant and beautiful field of coding theory that deals with the question of how
to construct transmitters and receivers with low computational complexity. This area of science,
though deeply related to the content of our book, deserves a separate dedicated treatment. We
recommend reading [360] for the sparse-graph based codes and [372] for introduction to more
modern polar codes.
The practical implications of this chapter are profound even without giving explicit construc-
tions of codes. First, in the process of finding channel capacity one needs to maximize mutual
information and the maximizing distributions reveal properties of optimal codes (e.g. water-filling
solution dictates how to optimally allocate power between frequency bands, Figure 3.2 suggests
when to use binary modulation, etc). Second, the non-asymptotic (finite blocklength) bounds that
we develop in this Part are routinely used for benchmarking performance of all newly developed
codes. Other implications tell how to exploit memory in the channel, or leverage multiple antennas
(Exercise I.10), and many more. In all, the contents of this Part have had by far the most real-world
impact of all (at least as of the writing of this book).
i i
i i
i i
In this chapter we introduce the concept of an error correcting code (ECC). We will spend time
discussing what it means for a code to have low probability of error, what is the optimum (ML or
MAP) decoder. On the special case of coding for the BSC we showcase evolution of our under-
standing of fundamental limits from pre-Shannon’s to modern finite blocklength. We also briefly
review the history of ECCs. We conclude with a conceptually important proof of a weak converse
(impossibility) bound for the performance of ECCs.
• encoder f : [M] → X
• decoder g : Y → [M] ∪ {e}
In most cases f and g are deterministic functions, in which case we think of them, equivalently,
in terms of codewords, codebooks, and decoding regions (see Figure 17.1 for an illustration)
Given an M-code we can define a probability space, underlying all the subsequent developments
in this Part. For that we chain the three objects – message W, the encoder and the decoder – together
into the following Markov chain:
f P Y| X g
W −→ X −→ Y −→ Ŵ (17.1)
1
For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PŴ|Y .
342
i i
i i
i i
c1 b
b
b
D1 b
b b
b cM
b b
b
DM
Figure 17.1 When X = Y, the decoding regions can be pictured as a partition of the space, each containing
one codeword.
where we set W ∼ Unif([M]). In the case of discrete spaces, we can explicitly write out the joint
distribution of these variables as follows:
1
(general) PW,X,Y,Ŵ (m, a, b, m̂) = PX|W (a|m)PY|X (b|a)PŴ|Y (m̂|b)
M
1
(deterministic f, g) PW,X,Y,Ŵ (m, cm , b, m̂) = PY|X (b|cm )1{b ∈ Dm̂ }
M
Throughout these sections, these random variables will be referred to by their traditional names:
W – original (true) message, X - (induced) channel input, Y - channel output and Ŵ - decoded
message.
Although any pair (f, g) is called an M-code, in reality we are only interested in those that satisfy
certain “error correcting” properties. To assess their quality we define the following performance
metrics:
Definition 17.2 A code (f, g) is an (M, ϵ)-code for PY|X if Pe (f, g) ≤ ϵ. Similarly, an (M, ϵ)max -
code must satisfy Pe,max ≤ ϵ. The fundamental limits of channel coding are defined as
M∗ (ϵ; PY|X ) = max{M : ∃(M, ϵ)-code}
M∗max (ϵ; PY|X ) = max{M : ∃(M, ϵ)max -code}
ϵ∗ (M; PY|X ) = inf{ϵ : ∃(M, ϵ)-code}
ϵ∗max (M; PY|X ) = inf{ϵ : ∃(M, ϵ)max -code}
i i
i i
i i
344
The argument PY|X will be omitted when PY|X is clear from the context.
In other words, the quantity log2 M∗ (ϵ) gives the maximum number of bits that we can
push through a noisy transformation PY|X , while still guaranteeing the error probability in the
appropriate sense to be at most ϵ.
Yn = X n ⊕ Zn .
0 1 0 0 1 1 0 0 1 1
PY n |X n
1 1 0 1 0 1 0 0 0 1
In the next section we discuss coding for the BSC channel in more detail.
i i
i i
i i
0 0 1 0
Decoding can be done by taking a majority vote inside each ℓ-block. Thus, each data bit is decoded
with probability of bit error Pb = P[Binom(l, δ) > l/2]. However, the probability of block error of
this scheme is Pe ≤ kP[Binom(l, δ) > l/2]. (This bound is essentially tight in the current regime).
Consequently, to satisfy Pe ≤ 10−3 we must solve for k and ℓ satisfying kl ≤ n = 1000 and also
This gives l = 21, k = 47 bits. So we can see that using repetition coding we can send 47 data
bits by using 1000 channel uses.
Repetition coding is a natural idea. It also has a very natural tradeoff: if you want better reliabil-
ity, then the number ℓ needs to increase and hence the ratio nk = 1ℓ should drop. Before Shannon’s
groundbreaking work, it was almost universally accepted that this is fundamentally unavoidable:
vanishing error probability should imply vanishing communication rate nk .
Before delving into optimal codes let us offer a glimpse of more sophisticated ways of injecting
redundancy into the channel input n-sequence than simple repetition. For that, consider the so-
called first-order Reed-Muller codes (1, r). We interpret a sequence of r data bits a0 , . . . , ar−1 ∈ Fr2
as a degree-1 polynomial in (r − 1) variables:
X
r− 1
a = (a0 , . . . , ar−1 ) 7→ fa (x) ≜ a i xi + a 0 .
i=1
In order to transmit these r bits of data we simply evaluate fa (·) at all possible values of the variables
xr−1 ∈ Fr2−1 . This code, which maps r bits to 2r−1 bits, has minimum distance dmin = 2r−2 . That
is, for two distinct a 6= a′ the number of positions in which fa and fa′ disagree is at least 2r−2 . In
coding theory notation [n, k, dmin ] we say that the first-order Reed-Muller code (1, 7) is a [64, 7, 32]
code. It can be shown that the optimal decoder for this code achieves over the BSC⊗ 64
0.11 channel a
probability of error at most 6 · 10−6 . Thus, we can use 16 such blocks (each carrying 7 data bits and
occupying 64 bits on the channel) over the BSC⊗ δ
1024
, and still have (by the union bound) overall
−4 −3
probability of block error Pe ≲ 10 < 10 . Thus, with the help of Reed-Muller codes we can
send 7 · 16 = 112 bits in 1024 channel uses, more than doubling that of the repetition code.
Shannon’s noisy channel coding theorem (Theorem 19.9) – a crown jewel of information theory
– tells us that over memoryless channel PYn |Xn = (PY|X )n of blocklength n the fundamental limit
satisfies
i i
i i
i i
346
as n → ∞ and for arbitrary ϵ ∈ (0, 1). Here C = maxPX1 I(X1 ; Y1 ) is the capacity of the single-letter
channel. In our case of BSC we have
1
C = log 2 − h(δ) ≈ bit ,
2
since the optimal input distribution is uniform (from symmetry) – see Section 19.3. Shannon’s
expansion (17.2) can be used to predict (not completely rigorously, of course, because of the
o(n) residual) that it should be possible to send around 500 bits reliably. As it turns out, for the
blocklength n = 1000 this is not quite possible.
Note that computing M∗ exactly requires iterating over all possible encoders and decoder –
an impossible task even for small values of n. However, there exist rigorous and computation-
ally tractable finite blocklength bounds [334] that demonstrate for our choice of n = 1000, δ =
0.11, ϵ = 10−3 :
Thus we can see that Shannon’s prediction is about 20% too optimistic. We will see below
some of such finite-length bounds. Notice, however, that while the bounds guarantee existence
of an encoder-decoder pair achieving a prescribed performance, building an actual f and g
implementable with a modern software/hardware is a different story.
It took about 60 years after Shannon’s discovery of (17.2) to construct practically imple-
mentable codes achieving that performance. The first codes that approach the bounds on log M∗
are calledturbo codes [47] (after the turbocharger engine, where the exhaust is fed back in to
power the engine). This class of codes is known as sparse-graph codes, of which the low-density
parity check (LDPC) codes invented by Gallager are particularly well studied [360]. As a rule of
thumb, these codes typically approach 80 . . . 90% of log M∗ when n ≈ 103 . . . 104 . For shorter
blocklengths in the range of n = 100 . . . 1000 there is an exciting alternative to LDPC codes: the
polar codes of Arıkan [23], which are most typically used together with the list-decoding idea
of Tal and Vardy [409]. And of course, the story is still evolving today as new channel models
become relevant and new hardware possibilities open up.
We wanted to point out a subtle but very important conceptual paradigm shift introduced by
Shannon’s insistence on coding over many (information) bits together. Indeed, consider the sit-
uation discussed above, where we constructed a powerful code with M ≈ 2400 codewords and
n = 1000. Now, one might imagine this code as a constellation of 2400 points carefully arranged
inside a hypercube {0, 1}1000 to guarantee some degree of separation between them, cf. (17.6).
Next, suppose one was using this code every second for the lifetime of the universe (≈ 1018 sec).
Yet, even after this laborious process she will have explored at most 260 different codewords from
among an overwhelmingly large codebook 2400 . So a natural question arises: why did we need
to carefully place all these many codewords if majority of them will never be used by anyone?
The answer is at the heart of the concept of information: to transmit information is to convey a
selection of one element (W) from a collection of possibilities ([M]). The fact that we do not know
which W will be selected forces us to a priori prepare for every one of the possibilities. This simple
idea, proposed in the first paragraph of [378], is now tacitly assumed by everyone, but was one of
i i
i i
i i
the subtle ways in which Shannon revolutionized scientific approach to the study of information
exchange.
Notice that the optimal decoder is deterministic. For the special case of deterministic encoder,
where we can identify the encoder with its image C the minimal (MAP) probability of error for
the codebook C can be written as
1 X
Pe,MAP (C) = 1 − max PY|X (y|x) , (17.5)
M x∈C
y∈Y
Consequently, the optimal decoding regions – see Figure 17.1 – become the Voronoi cells tesse-
lating the Hamming space {0, 1}n . Similarly, the MAP decoder for the AWGN channel induces a
Voronoi tesselation of Rn – see Section 20.3.
So we have seen that the optimal decoder is without loss of generality can be assumed to be
deterministic. Similarly, we can represent any randomized encoder f as a function of two argu-
ments: the true message W and an external randomness U ⊥ ⊥ W, so that X = f(W, U) where this
time f is a deterministic function. Then we have
which implies that if P[W 6= Ŵ] ≤ ϵ then there must exist some choice u0 such that P[W 6= Ŵ|U =
u0 ] ≤ ϵ. In other words, the fundamental limit M∗ (ϵ) is unchanged if we restrict our attention to
deterministic encoders and decoders only.
Note, however, that neither of the above considerations apply to the maximal probability of
error Pe,max . Indeed, the fundamental limit M∗max (ϵ) does indeed require considering randomized
encoders and decoders. For example, when M = 2 from the decoding point of view we are back to
i i
i i
i i
348
the setting of binary hypotheses testing in Part III. The optimal decoder (test) that minimizes the
maximal Type-I and II error probability, i.e., max{1 − α, β}, will not be deterministic if max{1 −
α, β} is not achieved at a vertex of the Neyman-Pearson region R(PY|W=1 , PY|W=2 ).
Theorem 17.3 (Weak converse) Any (M, ϵ)-code for PY|X satisfies
supPX I(X; Y) + h(ϵ)
log M ≤ ,
1−ϵ
where h(x) = H(Ber(x)) is the binary entropy function.
Proof. This can be derived as a one-line application of Fano’s inequality (Theorem 3.12), but we
proceed slightly differently with an eye towards future extensions in meta-converse (Section 22.3).
Consider an M-code with probability of error Pe and its corresponding probability space: W →
X → Y → Ŵ. We want to show that this code can be used as a hypothesis test between distributions
PX,Y and PX PY . Indeed, given a pair (X, Y) we can sample (W, Ŵ) from PW,Ŵ|X,Y = PW|X PŴ|Y and
compute the binary value Z = 1{W 6= Ŵ}. (Note that in the most interesting cases when encoder
and decoder are deterministic and the encoder is injective, the value Z is a deterministic function
of (X, Y).) Let us compute performance of this binary hypothesis test under two hypotheses. First,
when (X, Y) ∼ PX PY we have that Ŵ ⊥ ⊥ W ∼ Unif([M]) and therefore:
1
PX PY [Z = 1] = .
M
Second, when (X, Y) ∼ PX,Y then by definition we have
PX,Y [Z = 1] = 1 − Pe .
Thus, we can now apply the data-processing inequality for divergence to conclude: Since W →
X → Y → Ŵ, we have the following chain of inequalities (cf. Fano’s inequality Theorem 3.12):
DPI 1
D(PX,Y kPX PY ) ≥ d(1 − Pe k )
M
≥ −h(P[W 6= Ŵ]) + (1 − Pe ) log M
i i
i i
i i
Remark 17.2 The bound can be significantly improved by considering other divergence mea-
sures in the data-processing step. In particular, we will see below how one can get “strong”
converse (explaining the term “weak” converse here as well) in Section 22.1. The proof technique
is known as meta-converse; see Section 22.3.
i i
i i
i i
So far our discussion of channel coding was mostly following the same lines as the M-ary hypothe-
sis testing (HT) in statistics. In this chapter we introduce the key departure: the principal and most
interesting goal in information theory is the design of the encoder f : [M] → X or the codebook
{ci ≜ f(i), i ∈ [M]}. Once the codebook is chosen, the problem indeed becomes that of M-ary HT
and can be tackled by standard statistical tools. However, the task of choosing the encoder f has no
exact analogs in statistical theory (the closest being design of experiments). Each f gives rise to a
different HT problem and the goal is to choose these M hypotheses PY|X=c1 , . . . , PY|X=cM to ensure
maximal testability. It turns out that the problem of choosing a good f will be much simplified
if we adopt a suboptimal way of testing M-ary HT. Roughly speaking we will run M binary HTs
testing PY|X=cm against PY , which tries to distinguish the channel output induced by the message m
from an “average background noise” PY . An optimal such test, as we know from Neyman-Pearson
(Theorem 14.11), thresholds the following quantity
PY|X=x
log .
PY
This explains the central role played by the information density (defined next) in this chapter.
After introducing the latter we will present several results demonstrating existence of good codes.
We start with the original bound of Shannon (expressed in modern language), followed by its
tightenings (DT, RCU and Gallager’s bounds). These belong to the class of random coding bounds.
An entirely different approach was developed by Feinstein and is called maximal coding. We
will see that the two result in eerily similar results. Why two of these rather different methods
yield similar results, which are also quite close to the best possible (i.e. “achieve capacity and
dispersion”)? It turns out that the answer lies in a certain submodularity property of the channel
coding task. Finally, we will also discuss a more structured class of codes based on linear algebraic
constructions. Similar to the case of compression it will be shown that linear codes are no worse
than general codes, explaining why virtually all practical codes are linear.
While reading this Chapter, we recommend also consulting Figure 22.2, in which various
achievability bounds are compared for the BSC.
In this chapter it will be convenient to introduce the following independent pairs (X, Y) ⊥
⊥ (X, Y)
with their joint distribution given by:
We will often call X the sent codeword and X̄ the unsent codeword.
350
i i
i i
i i
Definition 18.1 (Information density1 ) Let PX,Y μ and PX PY μ for some dominat-
ing measure μ, and denote by f(x, y) = dPdμX,Y and f̄(x, y) = dPdμ X PY
the Radon-Nikodym derivatives
of PX,Y and PX PY with respect to μ, respectively. Then recalling the Log definition (2.10) we set
log ff̄((xx,,yy)) , f(x, y) > 0, f̄(x, y) > 0
f(x, y) +∞, f(x, y) > 0, f̄(x, y) = 0
iPX,Y (x; y) ≜ Log = (18.2)
f̄(x, y) −∞, f(x, y) = 0, f̄(x, y) > 0
0, f(x, y) = f̄(x, y) = 0 .
Proposition 18.2 The expectation E[i(X; Y)] is well-defined and non-negative (but possibly
infinite). In any case, we have I(X; Y) = E[i(X; Y)].
Proof. This is follows from (2.12) and the definition of i(x; y) as log-ratio.
Being defined as log-likelihood, information density possesses the standard properties of the
latter, cf. Theorem 14.6. However, because its defined in terms of two variables (X, Y), there are
1
We remark that in machine learning (especially natural language processing) information density is also called pointwise
mutual information (PMI) [292].
i i
i i
i i
352
also very useful conditional expectation versions. To illustrate the meaning of the next proposition,
let us consider the case of discrete X, Y and PX,Y PX PY . Then we have for every x:
X X
f(x, y)PX (x)PY (y) = f(x, y) exp{−i(x; y)}PX,Y (x, y) .
y y
E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = E[f+ (X, Y) exp{−i(X; Y)}|X = x] (18.4)
Proof. The first part (18.3) is simply a restatement of (14.5). For the second part, let us define
a(x) ≜ E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x], b(x) ≜ E[f+ (X, Y) exp{−i(X; Y)}|X = x]
We first additionally assume that f is bounded. Fix ϵ > 0 and denote Sϵ = {x : a(x) ≥ b(x) + ϵ}.
As ϵ → 0 we have Sϵ % {x : a(x) > b(x)} and thus if we show PX [Sϵ ] = 0 this will imply that
a(x) ≤ b(x) for PX -a.e. x. The symmetric argument shows b(x) ≤ a(x) and completes the proof
of the equality.
To show PX [Sϵ ] = 0 let us apply (18.3) to the function f(x, y) = f+ (x, y)1{x ∈ Sϵ }. Then we get
E[f+ (X, Y)1{X ∈ Sϵ } exp{−i(X; Y)}] = E[f+ (X̄, Y)1{i(X̄; Y) > −∞}1{X ∈ Sϵ }] .
Let us re-express both sides of this equality by taking the conditional expectations over Y to get:
E[b(X)1{X ∈ Sϵ }] = E[a(X̄)1{X̄ ∈ Sϵ }] .
Since f+ (and therefore b) was assumed to be bounded we can cancel the common term from both
sides and conclude PX [Sϵ ] = 0 as required.
Finally, to show (18.4) in full generality, given an unbounded f+ we define fn (x, y) =
min(f+ (x, y), n). Since (18.4) holds for fn we can take limit as n → ∞ on both sides of it:
lim E[fn (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = lim E[fn (X, Y) exp{−i(X; Y)}|X = x]
n→∞ n→∞
i i
i i
i i
By the monotone convergence theorem (for conditional expectations!) we can take the limits inside
the expectations to conclude the proof.
i i
i i
i i
354
Note that (18.8) holds regardless of the input distribution PX used for the definition of i(x; y), in
PM
particular we do not have to use the code-induced distribution PX = M1 i=1 δci . However, if we
are to threshold information density, different choices of PX will result in different decoders, so
we need to justify the choice of PX .
To that end, recall that to distinguish between two codewords ci and cj , one can apply (as we
P
learned in Part III for binary HT) the likelihood ratio test, namely thresholding the LLR log PYY||XX=
=c
ci
.
j
As we explained at the beginning of this Chapter, a (possibly suboptimal) approach in M-ary HT
is to run binary tests by thresholding each information density i(ci ; y). This, loosely speaking,
evaluates the likelihood of ci against the average distribution of the other M − 1 codewords, which
1
P
we approximate by PY (as opposed to the more precise form M− 1 j̸=i PY|X=cj ). Putting these ideas
together we can propose the decoder as
where γ is a threshold and PX is judiciously chosen (to maximize I(X; Y) as we will see soon).
A yet another way to see why thresholding decoder (as opposed to an ML one) is a natural idea
is to simply believe the fact that for good error correcting codes the most likely (ML) codeword
has likelihood (information density) so much higher than the rest of the candidates that instead
of looking for the maximum we simply can select the one (and only) codeword that exceeds a
pre-specified threshold.
With these initial justifications we proceed to the main result of this section.
Theorem 18.5 (Shannon’s achievability bound) Fix a channel PY|X and an arbitrary
input distribution PX . Then for every τ > 0 there exists an (M, ϵ)-code with
Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP and is equiv-
alent to maximizing information density, cf. (18.8). The step of maximizing the i(cm ; Y) makes
analyzing the error probability difficult. Similar to what we did in almost loss compression, cf. The-
orem 11.5, the first important step for showing the achievability bound is to consider a suboptimal
decoder. In Shannon’s bound, we consider a threshold-based suboptimal decoder g(y) as follows:
m, ∃! cm s.t. i(cm ; y) ≥ log M + τ
g ( y) = (18.10)
e, otherwise
In words, decoder g reports m as decoded message if and only if codeword cm is a unique one
with information density exceeding the threshold log M + τ . If there are multiple or none such
codewords, then decoder outputs a special value of e, which always results in error since W 6= e
ever. (We could have decreased probability of error slightly by allowing the decoder to instead
output a random message, or to choose any one of the messages exceeding the threshold, or any
other clever ideas. The point, however, is that even the simplistic resolution of outputting e already
achieves all qualitative goals, while simplifying the analysis considerably.)
i i
i i
i i
E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y) > log M + τ }|W = 1]
X
M
≤ P[i(c1 ; Y) ≤ log M + τ |W = 1] + P[i(cm̄ ; Y) > log M + τ |W = 1] (union bound)
m̄=2
( a)
= P [i(X; Y) ≤ log M + τ ] + (M − 1)P i(X; Y) > log M + τ
≤ P [i(X; Y) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 18.4)
≤ P [i(X; Y) ≤ log M + τ ] + exp(−τ ) ,
where the crucial step (a) follows from the fact that given W = 1 and m̄ 6= 1 we have
d
(c1 , cm̄ , Y) = (X, X̄, Y)
Remark 18.2 (Joint typicality) Shortly in Chapter 19, we will apply this theorem for the
case of PX = P⊗ n ⊗n
X1 (the iid input) and PY|X = PY1 |X1 (the memoryless channel). Traditionally,
cf. [111], decoders in such settings were defined with the help of so called “joint typicality”. Those
decoders given y = yn search for the codeword xn (both of which are an n-letter vectors) such that
the empirical joint distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PX1 ,Y1 , where
1
P̂xn ,yn (a, b) = · |{j ∈ [n] : xj = a, yj = b}|
n
is the joint empirical distribution of (xn , yn ). This definition is used for the case when random
coding is done with cj ∼ uniform on the type class {xn : P̂xn ≈ PX }. Another alternative, “entropic
Pn
typicality”, cf. [106], is to search for a codeword with j=1 log PX ,Y 1(xj ,yj ) ≈ H(X, Y). We think
1 1
of our requirement, {i(xn ; yn ) ≥ nγ1 }, as a version of “joint typicality” that is applicable to much
wider generality of channels (not necessarily over product alphabets, or memoryless).
i i
i i
i i
356
Theorem 18.6 (DT bound) Fix a channel PY|X and an arbitrary input distribution PX . Then
for every τ > 0 there exists an (M, ϵ)-code with
M − 1 +
ϵ ≤ E exp − i(X; Y) − log (18.11)
2
where x+ ≜ max(x, 0).
Setting Ŵ = g(Y) we note that given a codebook {c1 , . . . , cM }, we have by union bound
P[Ŵ 6= j|W = j] = P[i(cj ; Y) ≤ γ|W = j] + P[i(cj ; Y) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y) > γ]
j− 1
X
≤ P[i(cj ; Y) ≤ γ|W = j] + P[i(ck ; Y) > γ|W = j].
k=1
Averaging over the randomly generated codebook, the expected error probability is upper bounded
by:
1 X
M
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1
1 X
j−1
M X
≤ P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
M
j=1 k=1
M−1
= P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
2
M−1
= P[i(X; Y) ≤ γ] + E[exp(−i(X; Y))1 {i(X; Y) > γ}] (by (18.3))
2
h M−1 i
= E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
2
To optimize over γ , note the simple observation that U1E + V1Ec ≥ min{U, V}. Therefore for any
x, y, 1{i(x; y) ≤ γ} + M− M−1
2 exp{−i(x; y)}1{i(x; y) > γ} > min(1, 2 exp{−i(x; y)}), achieved
1
M−1
by γ = log 2 regardless of x, y. Thus, we continue the bounding as follows
h M−1 i
inf E[Pe (c1 , . . . , cM )] ≤ inf E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
γ γ 2
h M−1 i
= E min 1, exp(−i(X; Y))
2
i i
i i
i i
M − 1 +
= E exp − i(X; Y) − log .
2
multiple of the minimum error probability of the following Bayesian hypothesis testing problem:
H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M−1
prior prob.: π 0 = , π1 = .
M+1 M+1
Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent code-
word. As we know from binary hypothesis testing, the best threshold for the likelihood ratio test
(minimizing the weighted probability of error) is log ππ 10 , as we indeed found out.
One of the immediate benefits of Theorem 18.6 compared to Theorem 18.5 is precisely the fact
that we do not need to perform a cumbersome minimization over τ in (18.9) to get the minimum
upper bound in Theorem 18.5. Nevertheless, it can be shown that the DT bound is stronger than
Shannon’s bound with optimized τ . See also Exercise IV.5.
Finally, we remark (and will develop this below in our treatment of linear codes) that DT bound
and Shannon’s bound both hold without change if we generate {ci } by any other (non-iid) pro-
cedure with a prescribed marginal and pairwise independent codewords – see Theorem 18.13
below.
Theorem 18.7 (Feinstein’s lemma) Fix a channel PY|X and an arbitrary input distribution
PX . Then for every γ > 0 and for every ϵ ∈ (0, 1) there exists an (M, ϵ)max -code with
Remark 18.4 (Comparison with Shannon’s bound) We can also interpret (18.12) differ-
ently: for any fixed M, there exists an (M, ϵ)max -code that achieves the maximal error probability
bounded as follows:
M
ϵ ≤ P[i(X; Y) < log γ] +
γ
2
Nevertheless, we should point out that this is not a serious advantage: from any (M, ϵ) code we can extract an
(M′ , ϵ′ )max -subcode with a smaller M′ and larger ϵ′ – see Theorem 19.4.
i i
i i
i i
358
If we take log γ = log M + τ , this gives the bound of exactly the same form as Shannon’s (18.9). It
is rather surprising that two such different methods of proof produced essentially the same bound
(modulo the difference between maximal and average probability of error). We will discuss the
reason for this phenomenon in Section 18.7.
Proof. From the definition of (M, ϵ)max -code, we recall that our goal is to find codewords
c1 , . . . , cM ∈ X and disjoint subsets (decoding regions) D1 , . . . , DM ⊂ Y , s.t.
Ex ≜ {y ∈ Y : i(x; y) ≥ log γ}
Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them
into final decoding regions {Dx }, which will be disjoint. Next, we apply Corollary 18.4 and find
out that there is a set F ⊂ X with two properties: a) PX [F] = 1 and b) for every x ∈ F we have
1
PY (Ex ) ≤ . (18.13)
γ
We can assume that P[i(X; Y) < log γ] ≤ ϵ, for otherwise the RHS of (18.12) is negative and
there is nothing to prove. We first claim that there exists some c ∈ F such that P[Y ∈ Ec |X =
c] = PY|X (Ec |c) ≥ 1 − ϵ. Indeed, assume (for the sake of contradiction) that ∀c ∈ F, P[i(c; Y) ≥
log γ|X = c] < 1 − ϵ. Note that since PX (F) = 1 we can average this inequality over c ∼ PX . Then
we arrive at P[i(X; Y) ≥ log γ] < 1 − ϵ, which is a contradiction.
With these preparations we construct the codebook in the following way:
1 Pick c1 to be any codeword in F such that PY|X (Ec1 |c1 ) ≥ 1 − ϵ, and set D1 = Ec1 ;
2 Pick c2 to be any codeword in F such that PY|X (Ec2 \D1 |c2 ) ≥ 1 − ϵ, and set D2 = Ec2 \D1 ;
...
−1
3 Pick cM to be any codeword in F such that PY|X (EcM \ ∪M j=1 Dj |cM ] ≥ 1 − ϵ, and set DM =
M− 1
EcM \ ∪j=1 Dj .
We stop if cM+1 codeword satisfying the requirement cannot be found. Thus, M is determined by
the stopping condition:
∀c ∈ F, PY|X (Ec \ ∪M
j=1 Dj |c) < 1 − ϵ
Averaging the stopping condition over c ∼ PX (which is permissible due to PX (F) = 1), we
obtain
[
M
P i(X; Y) ≥ log γ and Y 6∈ Dj < 1 − ϵ,
j=1
i i
i i
i i
or, equivalently,
[
M
ϵ < P i(X; Y) < log γ or Y ∈ Dj .
j=1
X
M
≤ P[i(X; Y) < log γ] + PY (Ecj )
j=1
M
≤ P[i(X; Y) < log γ] +
γ
where the last step makes use of (18.13).Evidently, this completes the proof.
Theorem 18.8 (RCU bound) Fix a channel PY|X and an arbitrary input distribution PX . Then
for every integer M ≥ 1 there exists an (M, ϵ)-code with
ϵ ≤ E min 1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y , (18.14)
where the joint distribution of (X, X̄, Y) is as in (18.1).
Proof. For a given codebook (c1 , . . . cM ) the average probability of error for the maximum
likelihood decoder, cf. (18.8), is upper bounded by
1 X [
M M
ϵ≤ P {i(cj ; Y) ≥ i(cm ; Y)} |X = cm .
M
m=1 j=1;j̸=m
Note that we do not necessarily have equality here, since the maximum likelihood decoder will
resolves ties (i.e. the cases when multiple codewords maximize information density) in favor of
the correct codeword, whereas in the expression above we pessimistically assume that all ties are
resolved incorrectly. Now, similar to Shannon’s bound in Theorem 18.5 we prove existence of a
i.i.d.
good code by averaging the last expression over cj ∼ PX .
i i
i i
i i
360
To that end, notice that expectations of each term in the sum coincide (by symmetry). To evalu-
ate this expectation, let us take m = 1 condition on W = 1 and observe that under this conditioning
we have
Y
M
(c1 , Y, c2 , . . . , cM ) ∼ PX,Y PX .
j=2
j=2
(b)
≤ E min{1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y }
where (a) is just expressing the probability by first conditioning on the values of (c1 , Y); and (b)
corresponds to applying the union bound but capping the result by 1. This completes the proof
of the bound. We note that the step (b) is the essence of the RCU bound and corresponds to the
self-evident fact that for any collection of events Ej we have
X
P[∪Ej ] ≤ min{1, P[Ej ]} .
j
What is makes its application clever is that we first conditioned on (c1 , Y). If we applied the union
bound right from the start without conditioning, the resulting estimate on ϵ would have been much
weaker (in particular, would not have lead to achieving capacity).
It turns out that Shannon’s bound Theorem 18.5 is just a weakening of (18.14) obtained by
splitting the expectation according to whether or not i(X; Y) ≤ log M + τ and upper bounding
min{x, 1} by 1 when i(X; Y) ≤ log M + τ and by x otherwise. Another such weakening is a
famous Gallager’s bound [176], which in fact gives tight estimate of the exponent in the decay of
error probability over memoryless channels (Section 22.4*).
Theorem 18.9 (Gallager’s bound) Fix a channel PY|X , an arbitrary input distribution PX
and ρ ∈ [0, 1]. Then there exists an (M, ϵ) code such that
" 1+ρ #
i ( X̄; Y )
ϵ ≤ Mρ E E exp Y (18.15)
1+ρ
i i
i i
i i
Proof. We first notice that by Proposition 18.3 applied with f+ (x, y) = exp{ i1(+ρ
x;y)
} and
interchanged X and Y we have for PY -almost every y
ρ 1 1
E[exp{−i(X; Y) }|Y = y] = E[exp{i(X; Ȳ) }|Ȳ = y] = E[exp{i(X̄; Y) }|Y = y] ,
1+ρ 1+ρ 1+ρ
(18.16)
d
where we also used the fact that (X, Ȳ) = (X̄, Y) under (18.1).
Now, consider the bound (18.14) and replace the min via the bound
min{t, 1} ≤ tρ ∀t ≥ 0 . (18.17)
this results in
ϵ ≤ Mρ E P[i(X̄; Y) > i(X; Y)|X, Y]ρ . (18.18)
The key innovation of Gallager, namely the step (18.17), which became known as the ρ-trick,
corresponds to the following version of the union bound: For any events Ej and 0 ≤ ρ ≤ 1 we
have
ρ
X X
P[∪Ej ] ≤ min 1, P [ Ej ] ≤ P[Ej ] .
j j
Now to understand properly the significance of Gallager’s bound we need to first define the concept
of the memoryless channels (see (19.1) below). For such channels and using the iid inputs, the
expression (18.15) turns, after optimization over ρ, into
ϵ ≤ exp{−nEr (R)} ,
where R = logn M is the rate and Er (R) is the Gallager’s random coding exponent. This shows that
not only the error probability at a fixed rate can be made to vanish, but in fact it can be made to
vanish exponentially fast in the blocklength. We will discuss such exponential estimates in more
detail in Section 22.4*.
i i
i i
i i
362
Definition 18.10 (Linear codes) Let Fq denote the finite field of cardinality q (cf. Defini-
tion 11.7). Let the input and output space of the channel be X = Y = Fnq . We say a codebook
C = {cu : u ∈ Fkq } of size M = qk is a linear code if C is a k-dimensional linear subspace of Fnq .
• Generator matrix G ∈ Fkq×n , so that the codeword for each u ∈ Fkq is given by cu = uG
(row-vector convention) and the codebook C is the row-span of G, denoted by Im(G);
(n−k)×n
• Parity-check matrix H ∈ Fq , so that each codeword c ∈ C satisfies Hc⊤ = 0. Thus C is
the nullspace of H, denoted by Ker(H). We have HG⊤ = 0.
Example 18.1 (Hamming code) The [7, 4, 3]2 Hamming code over F2 is a linear code with
the following generator and parity check matrices:
1 0 0 0 1 1 0
0 1 1 0 1 1 0 0
1 0 0 1 0 1
G=
0 , H= 1 0 1 1 0 1 0
0 1 0 0 1 1
0 1 1 1 0 0 1
0 0 0 1 1 1 1
In particular, G and H are of the form G = [I; P] and H = [−P⊤ ; I] (systematic codes) so that
HG⊤ = 0. The following picture helps to visualize the parity check operation:
x5
x2 x1
x4
x7 x6
x3
i i
i i
i i
Note that all four bits in each circle (corresponding to a row of H) sum up to zero. One can verify
that the minimum distance of this code is 3 bits. As such, it can correct 1 bit of error and detect 2
bits of error.
Linear codes are almost always examined with channels of additive noise, a precise definition
of which is given below:
Definition 18.11 (Additive noise) A channel PY|X with input and output space Fnq is called
additive-noise if
PY|X (y|x) = PZ (y − x)
for some random vector Z taking values in Fnq . In other words, Y = X + Z, where Z ⊥
⊥ X.
Given a linear code and an additive-noise channel PY|X , it turns out that there is a special
“syndrome decoder” that is optimal.
Theorem 18.12 Any [k, n]Fq linear code over an additive-noise PY|X has a maximum likelihood
(ML) decoder g : Fnq → Fkq such that:
1 g(y) = y − gsynd (Hy⊤ ), i.e., the decoder is a function of the “syndrome” Hy⊤ only. Here gsynd :
Fnq−k → Fnq , defined by gsynd (s) ≜ argmaxz:Hz⊤ =s PZ (z), is called the “syndrome decoder”,
which decodes the most likely realization of the noise.
2 (Geometric uniformity) Decoding regions are translates of D0 = Im(gsynd ): Du = cu + D0 for
any u ∈ Fkq .
3 Pe,max = Pe .
In other words, syndrome is a sufficient statistic (Definition 3.8) for decoding a linear code.
Proof. 1 The maximum likelihood decoder for a linear code is
g(y) = argmax PY|X (y|c) = argmax PZ (y − c) = y − argmax PZ (z) = y − gsynd (Hy⊤ ).
c∈C c:Hc⊤ =0 z:Hz⊤ =Hy⊤
Remark 18.5 (BSC example) As a concrete example, consider the binary symmetric channel
BSC⊗ n
δ previously considered in Example 17.1 and Section 17.2. This is an additive-noise channel
i.i.d.
over Fn2 , where Y = X + Z and Z = (Z1 , . . . , Zn ) ∼ Ber(δ). Assuming δ < 1/2, the syndrome
i i
i i
i i
364
decoder aims to find the noise realization with the fewest number of flips that is compatible with
the received codeword, namely gsynd (s) = argminz:Hz⊤ =s wH (z), where wH denotes the Hamming
weight. In this case elements of the image of gsynd , which we denoted by D0 , are known as “minimal
weight coset leaders”. Counting how many of them occur at each Hamming weight is a difficult
open problem even for the most well-studied codes such as Reed-Muller ones. In Hamming space
D0 looks like a Voronoi region of a lattice and Du ’s constitute a Voronoi tesselation of Fnq .
Overwhelming majority of practically used codes are in fact linear codes. Early in the history
of coding, linearity was viewed as a way towards fast and low-complexity encoding (just binary
matrix multiplication) and slightly lower complexity of the maximum-likelihood decoding (via the
syndrome decoder). As codes became longer and longer, though, the syndrome decoding became
impractical and today only those codes are used in practice for which there are fast and low-
complexity (suboptimal) decoders.
Theorem 18.13 (DT bound for linear codes) Let PY|X be an additive noise channel over
Fnq . For all integers k ≥ 1 there exists a linear code f : Fkq → Fnq with error probability:
+
− n−k−logq 1
Pe,max = Pe ≤ E q .
P Z ( Z)
(18.19)
Remark 18.6 The bound above is the same as Theorem 18.6 evaluated with PX = Unif(Fnq ).
The analogy between Theorems 18.6 and 18.13 is the same as that between Theorems 11.5 and
11.8 (full random coding vs random linear codes).
Proof. Recall that in proving the DT bound (Theorem 18.6), we selected the codewords
i.i.d.
c1 , . . . , cM ∼ PX and showed that
M−1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y) ≤ γ] + P[i(X; Y) ≥ γ]
2
Here we will adopt the same approach and take PX = Unif(Fnq ) and M = qk .
By Theorem 18.12 the optimal decoding regions are translational invariant, i.e. Du = cu +
D0 , ∀u, and therefore:
cu = uG + h, ∀u ∈ Fkq
where random G and h are drawn as follows: the k × n entries of G and the 1 × n entries
of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role that the
all-zero codeword plays (since it is contained in any linear codebook).
i i
i i
i i
Step 2: We claim that the codewords are pairwise independent and uniform, i.e. ∀u 6= u′ ,
(cu , cu′ ) ∼ (X, X), where PX,X (x, x) = 1/q2n . To see this note that
cu ∼ uniform on Fnq
cu′ = u′ G + h = uG + h + (u′ − u)G = cu + (u′ − u)G
Step 5: Remove dithering h. We claim that there exists a linear code without dithering such
that (18.20) is satisfied. The intuition is that shifting a codebook has no effect on its
performance. Indeed,
• Before, with dithering, the encoder maps u to uG + h, the channel adds noise to produce
Y = uG + h + Z, and the decoder g outputs g(Y).
• Now, without dithering, we encode u to uG, the channel adds noise to produce Y =
uG + Z, and we apply decode g′ defined by g′ (Y) = g(Y + h).
By doing so, we “simulate” dithering at the decoder end and the probability of error
remains the same as before. Note that this is possible thanks to the additivity of the noisy
channel.
We see that random coding can be done with different ensembles of codebooks. For example,
we have
i.i.d.
• Shannon ensemble: C = {c1 , . . . , cM } ∼ PX – fully random ensemble.
• Elias ensemble [150]: C = {uG : u ∈ Fkq }, with the k × n generator matrix G drawn uniformly
at random from the set of all matrices. (This ensemble is used in the proof of Theorem 18.13.)
• Gallager ensemble: C = {c : Hc⊤ = 0}, with the (n − k) × n parity-check matrix H drawn
uniformly at random. Note this is not the same as the Elias ensemble.
i i
i i
i i
366
• One issue with Elias ensemble is that with some non-zero probability G may fail to be full rank.
(It is a good exercise to find P [rank(G) < k] as a function of n, k, q.) If G is not full rank, then
there are two identical codewords and hence Pe,max ≥ 1/2. To fix this issue, one may let the
generator matrix G be uniform on the set of all k × n matrices of full (row) rank.
• Similarly, we may modify Gallager’s ensemble by taking the parity-check matrix H to be
uniform on all n × (n − k) full rank matrices.
For the modified Elias and Gallager’s ensembles, we could still do the analysis of random coding.
A small modification would be to note that this time (X, X̄) would have distribution
1
PX,X̄ = 1 {X 6= X′ }
q2n − qn
uniform on all pairs of distinct codewords and are not pairwise independent.
Finally, we note that the Elias ensemble with dithering, cu = uG + h, has pairwise independence
property and its joint entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q. This is significantly
smaller than for Shannon’s fully random ensemble that we used in Theorem 18.5. Indeed, when
i.i.d.
cj ∼ Unif(Fnq ) we have H(c1 , . . . , cM ) = qk n log q. An interesting question, thus, is to find
min H(c1 , . . . , cM )
where the minimum is over all distributions with P[ci = a, cj = b] = q−2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly, we
may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case, the Wozencraft
ensemble (see Exercise IV.7) for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q, which is
essentially our lower bound.
In short, we will see that the answer is that both of these methods are well-known to be (almost)
optimal for submodular function maximization, and this is exactly what channel coding is about.
i i
i i
i i
Before proceeding, we notice that in the second question it is important to qualify that PX
is simple, since taking PX to be supported on the optimal M∗ (ϵ)-achieving codebook would of
course result in very good performance. However, instead we will see that choosing rather simple
PX already achieves a rather good lower bound on M∗ (ϵ). More explicitly, by simple we mean a
product distribution for the memoryless channel. Or, as an even better example to have in mind,
consider an additive-noise vector channel:
Yn = Xn + Zn
with addition over a product abelian group and arbitrary (even non-memoryless) noise Zn . In this
case the choice of uniform PX in random coding bound works, and is definitely “simple”.
The key observation of [36] is submodularity of the function mapping a codebook C ⊂ X to
the |C|(1 − Pe,MAP (C)), where Pe,MAP (C) is the probability of error under the MAP decoder (17.5).
(Recall (1.8) for the definition of submodularity.) More explicitly, consider a discrete Y and define
X
S(C) ≜ max PY|X (y|x) , S(∅) = 0
x∈C
y∈Y
and set
Ct+1 = Ct ∪ {xt+1 } .
In other words, the probability of successful (MAP) decoding for the greedily constructed code-
book is at most a factor (1 − 1/e) away from the largest possible probability of success among all
codebooks of the same cardinality. Since we are mostly interested in success probabilities very
close to 1, this result may not appear very exciting. However, a small modification of the argument
yields the following (see [257, Theorem 1.5] for the proof):
i i
i i
i i
368
Theorem 18.14 ([313]) For any non-negative submodular set-function f and a greedy
sequence Ct we have for all ℓ, t:
Applying this to the special case of f(·) = S(·) we obtain the result of [36]: The greedily
constructed codebook C ′ with M′ codewords satisfies
M ′
1 − Pe,MAP (C ′ ) ≥ (1 − e−M /M )(1 − ϵ∗ (M)) .
M′
In particular, the greedily constructed code with M′ = M2−10 achieves probability of success that
is ≥ 0.9995(1 −ϵ∗ (M)). In other words, compared to the best possible code a greedy code carrying
10 bits fewer of data suffers at most 5 · 10−4 worse probability of error. This is a very compelling
evidence for why greedy construction is so good. We do note, however, that Feinstein’s bound
does greedy construction not with the MAP decoder, but with a suboptimal one.
Next we address the question of random coding. Recall that our goal is to explain how can
selecting codewords uniformly at random from a “simple” distribution PX be any good. The key
idea is again contained in [313]. The set-function S(C) can also be understood as a function with
domain {0, 1}|X | . Here is a natural extension of this function to the entire solid hypercube [0, 1]|X | :
X X
SLP (π ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} . (18.21)
x, y x
Indeed, it is easy to see that SLP (1C ) = S(C) and that SLP is a concave function.3
Since SLP is an extension of S it is clear that
X
S∗ (M) ≤ S∗LP (M) ≜ max{SLP (π ) : 0 ≤ π x ≤ 1, π x ≤ M} . (18.22)
x
In fact, we will see later in Section 22.3 that this bound coincides with the bound known as
meta-converse. Surprisingly, [313] showed that the greedy construction not only achieves a large
multiple of S∗ (M) but also of S∗LP (M):
3
There are a number of standard extensions of a submodular function f to a hypercube. The largest convex interpolant f+ ,
also known as Lovász extension, the least concave interpolant f− , and multi-linear extension [80]. However, SLP does not
coincide with any of these and in particular strictly larger (in general) than f− .
i i
i i
i i
To connect to the concept of random coding, though, we need the following result of [36]:4
P
Theorem 18.15 Fix π ∈ [0, 1]|X | and let M = x∈X π x . Let C = {c1 , . . . , cM′ } with
i.i.d.
cj ∼ PX (x) = πx
M. Then we have
′
E[S(C)] ≥ (1 − e−M /M )SLP (π ) .
The proof of this result trivially follows from applying the following lemma with g(x) =
PY|X (y|x), summing over y and recalling the definition of SLP in (18.21).
Proof. Without loss of generality we take X = [m] and g(1) ≥ g(2) ≥ · · · ≥ g(m) ≥ g(m + 1) ≜
′ ′
0. Denote for convenience a = 1 − (1 − M1 )M ≥ 1 − e−M /M , b(j) ≜ P[{1, . . . , j} ∩ C 6= ∅]. Then
P[max g(x) = g(j)] = b(j) − b(j − 1) ,
x∈C
b(j) ≥ ac(j) .
Plugging this into (18.24) we conclude the proof by noticing that rj = c(j) − c(j − 1) attains the
maximum in the definition of T(π , g).
Theorem 18.15 completes this section’s goal and shows that the random coding (as well as the
greedy/maximal coding) attains an almost optimal value of S∗ (M). Notice also that the random
coding distribution that we should be using is the one that attains the definition of S∗LP (M). For input
symmetric channels (such as additive noise ones) it is easy to show that the optimal π ∈ [0, 1]X is
a constant vector, and hence the codewords are to be generated iid uniformly on X .
4
There are other ways of doing “random coding” to produce an integer solution from a fractional one. For example, see the
multi-linear extension based one in [80].
i i
i i
i i
19 Channel capacity
In this chapter we apply methods developed in the previous chapters (namely the weak converse
and the random/maximal coding achievability) to compute the channel capacity. This latter notion
quantifies the maximal amount of (data) bits that can be reliably communicated per single channel
use in the limit of using the channel many times. Formalizing the latter statement will require
introducing the concept of a communication channel. Then for special kinds of channels (the
memoryless and the information stable ones) we will show that computing the channel capacity
reduces to maximizing the (sequence of the) mutual informations. This result, known as Shannon’s
noisy channel coding theorem, is the third example of a coding theorem in this book. It connects the
value of an operationally defined (discrete, combinatorial) optimization problem over codebooks
to that of a (convex) optimization problem over information measures. It builds a bridge between
the abstraction of information measures (Part I) and a practical engineering problem of channel
coding.
Information theory as a subject is sometimes accused of “asymptopia”, or the obsession with
asymptotic results and computing various limits. Although in this book we attempt to refrain from
asymptopia, the topic of this chapter requires committing this sin ipso facto. After proving capacity
theorems in various settings, we conclude the Chapter with Shannon’s separation theorem, that
shows that any (stationary ergodic) source can be communicated over an (information stable)
channel if and only its entropy rate is smaller than the channel capacity. Furthermore, doing so
can be done by first compressing a source to pure bits and then using channel code to match those
bits to channel inputs. The fact that no performance is lost in the process of this conversion to bits
had important historical consequence in cementing bits as the universal currency of the digital
age.
370
i i
i i
i i
Definition 19.1 Fix an input alphabet A and an output alphabet B. A sequence of Markov
kernels PYn |Xn : An → B n indexed by the integer n = 1, 2 . . . is called a channel. The length of
the input n is known as blocklength.
To give this abstract notion more concrete form one should recall Section 17.2, in which we
described the BSC channel. Note, however, that despite this definition, it is customary to use the
term channel to refer to a single Markov kernel (as we did before in this book). An even worse,
yet popular, abuse of terminology is to refer to n-th element of the sequence, the kernel PYn |Xn , as
the n-letter channel.
Although we have not imposed any requirements on the sequence of kernels PYn |Xn , one is never
interested in channels at this level of generality. Most of the time the elements of the channel input
Xn = (X1 , . . . , Xn ) are thought as indexed by time. That is the Xt corresponds to the letter that is
transmitted at time t inside an overall block of length n, while Yt is the letter received at time t.
The channel’s action is that of “adding noise” to Xt and outputting Yt . However, the generality of
the previous definition allows to model situations where the channel has internal state, so that the
amount and type of noise added to Xt depends on the previous inputs and in principle even on the
future inputs. The interpretation of t as time, however, is not exclusive. In storage (magnetic, non-
volatile or flash) t indexes space. In those applications, the noise may have a rather complicated
structure with transformation Xt → Yt depending on both the “past” X<t and the “future” X>t .
Almost all channels of interest satisfy one or more of the restrictions that we list next:
• A channel is called non-anticipatory if it has the following extension property. Under the n-letter
kernel PYn |Xn , the conditional distribution of the first k output symbols Yk only depends on Xk
(and not on Xnk+1 ) and coincides with the kernel PYk |Xk (the k-th element of the channel sequence)
the k-th channel transition kernel in the sequence. This requirement models the scenario wherein
channel outputs depend causally on the inputs.
• A channel is discrete if A and B are finite.
• A channel is additive-noise if A = B are abelian group and Yn = Xn + Zn for some Zn
independent of Xn (see Definition 18.11). Thus
Y
n
PYn |Xn = PYk |Xk . (19.1)
k=1
where each PYk |Xk : A → B ; in particular, PYn |Xn are compatible at different blocklengths n.
• A channel is stationary memoryless if (19.1) is satisfied with PYk |Xk not depending on k, denoted
commonly by PY|X . In other words,
i i
i i
i i
372
δ̄
1 1
δ [ ]
δ̄ δ
BSCδ
δ δ δ̄
0 0
δ̄
δ̄
1 1
δ
[ ]
? δ̄ δ 0
BECδ
δ 0 δ δ̄
0 0
δ̄
1 1
δ
[ ]
1 0
Z-channel
0 0 δ δ̄
δ̄
The interpretation is that each coordinate of the transmitted codeword Xn is corrupted by noise
independently with the same noise statistic.
• Discrete memoryless stationary channel (DMC): A DMC is a channel that is both discrete and
stationary memoryless. It can be specified in two ways:
– an |A| × |B|-dimensional (row-stochastic) matrix PY|X where elements specify the transition
probabilities;
– a bipartite graph with edge weight specifying the non-zero transition probabilities.
Table 19.1 lists some common binary-input binary-output DMCs: the binary symmetric channel
(BSC), the binary symmetric channel (BEC), and the Z-channel.
As another example, let us recall the AWGN channel in Example 3.3: the alphabets A = B =
R and Yn = Xn + Zn , with Xn ⊥ ⊥ Zn ∼ N (0, σ 2 In ). This channel is a non-discrete, stationary
memoryless, additive-noise channel.
Having defined the notion of the channel, we can define next the operational problem that the
communication engineer faces when tasked with establishing a data link across the channel. Since
the channel is noisy, the data is not going to pass unperturbed and the error correcting codes are
i i
i i
i i
naturally to be employed. To send one of M = 2k messages (or k data bits) with low probabil-
ity of error, it is often desirable to use the shortest possible length of the input sequence. This
desire explains the following definitions, which extend the fundamental limits in Definition 17.2
to involve the blocklength n.
• An (n, M, ϵ)-code is an (M, ϵ)-code for PYn |Xn , consisting of an encoder f : [M] → An and a
decoder g : B n → [M] ∪ {e}.
• An (n, M, ϵ)max -code is analogously defined for the maximum probability of error.
We will mostly focus on understanding M∗ (n, ϵ) and a relate quantity known as rate. Recall that
blocklength n measures the amount of time or space resource used by the code. Thus, it is natural
to maximize the ratio of the data transmitted to the resource used, and that leads us to the notion of
log M
the transmission rate defined as R = n2 and equal to the number of bits transmitted per channel
use. Consequently, instead of studying M∗ (n, ϵ) one is lead to the study of 1n log M∗ (n, ϵ). A natural
first question is to determine the first-order asymptotics of this quantity and this motivates the final
definition of the Section.
Definition 19.3 (Channel capacity) The ϵ-capacity Cϵ and the Shannon capacity C are
defined as follows
1
Cϵ ≜ lim inf log M∗ (n, ϵ);
n→∞ n
C = lim Cϵ .
ϵ→0+
Channel capacity is measured in information units per channel use, e.g. “bit/ch.use”.
The operational meaning of Cϵ (resp. C) is the maximum achievable rate at which one can
communicate through a noisy channel with probability of error at most ϵ (resp. o(1)). In other
words, for any R < C, there exists an (n, exp(nR), ϵn )-code, such that ϵn → 0. In this vein, Cϵ and
C can be equivalently defined as follows:
i i
i i
i i
374
The reason that capacity is defined as a large-n limit (as opposed to a supremum over n) is because
we are concerned with rate limit of transmitting large amounts of data without errors (such as in
communication and storage).
The case of zero-error (ϵ = 0) is so different from ϵ > 0 that the topic of ϵ = 0 constitutes a
separate subfield of its own (cf. the survey [252]). Introduced by Shannon in 1956 [379], the value
1
C0 ≜ lim inf log M∗ (n, 0) (19.6)
n→∞ n
is known as the zero-error capacity and represents the maximal achievable rate with no error
whatsoever. Characterizing the value of C0 is often a hard combinatorial problem. However, for
many practically relevant channels it is quite trivial to show C0 = 0. This is the case, for example,
for the DMCs we considered before: the BSC or BEC. Indeed, for them we have log M∗ (n, 0) = 0
for all n, meaning transmitting any amount of information across these channels requires accepting
some (perhaps vanishingly small) probability of error. Nevertheless, there are certain interesting
and important channels for which C0 is positive, cf. Section 23.3.1 for more.
As a function of ϵ the Cϵ could (most generally) behave like the plot below on the left-hand
side below. It may have a discontinuity at ϵ = 0 and may be monotonically increasing (possibly
even with jump discontinuities) in ϵ. Typically, however, Cϵ is zero at ϵ = 0 and stays constant for
all 0 < ϵ < 1 and, hence, coincides with C (see the plot on the right-hand side). In such cases we
say that the strong converse holds (more on this later in Section 22.1).
Cǫ Cǫ
strong converse
holds
Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1
In Definition 19.3, the capacities Cϵ and C are defined with respect to the average probabil-
ity of error. By replacing M∗ with M∗max , we can define, analogously, the capacities Cϵ
(max)
and
(max)
C with respect to the maximal probability of error. It turns out that these two definitions are
equivalent, as the next theorem shows.
Proof. The second inequality is obvious, since any code that achieves a maximum error
probability ϵ also achieves an average error probability of ϵ.
i i
i i
i i
For the first inequality, take an (n, M, ϵ(1 − τ ))-code, and define the error probability for the jth
codeword as
λj ≜ P[Ŵ 6= j|W = j]
Then
X X X
M(1 − τ )ϵ ≥ λj = λj 1 {λj ≤ ϵ} + λj 1 {λj > ϵ} ≥ ϵ|{j : λj > ϵ}|.
Hence |{j : λj > ϵ}| ≤ (1 − τ )M. (Note that this is exactly Markov inequality.) Now by removing
those codewords1 whose λj exceeds ϵ, we can extract an (n, τ M, ϵ)max -code. Finally, take M =
M∗ (n, ϵ(1 − τ )) to finish the proof.
Corollary 19.5 (Capacity under maximal probability of error) C(ϵmax) = Cϵ for all
ϵ > 0 such that Cϵ = Cϵ− . In particular, C(max) = C.
Proof. Using the definition of M∗ and the previous theorem, for any fixed τ > 0
1
Cϵ ≥ C(ϵmax) ≥ lim inf log τ M∗ (n, ϵ(1 − τ )) ≥ Cϵ(1−τ )
n→∞ n
(max)
Sending τ → 0 yields Cϵ ≥ Cϵ ≥ Cϵ− .
Note that information capacity C(I) so defined is not the same as the Shannon capacity C in Def-
inition 19.3; as such, from first principles it has no direct interpretation as an operational quantity
related to coding. Nevertheless, they are related by the following coding theorems. We start with
a converse result:
C(I)
Theorem 19.7 (Upper Bound for Cϵ ) For any channel, ∀ϵ ∈ [0, 1), Cϵ ≤ 1−ϵ and C ≤ C(I) .
1
This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook to reach a
desired property.
i i
i i
i i
376
Proof. Applying the general weak converse bound in Theorem 17.3 to PYn |Xn yields
supPXn I(Xn ; Yn ) + h(ϵ)
log M∗ (n, ϵ) ≤
1−ϵ
Normalizing this by n and taking the lim inf as n → ∞, we have
1 1 supPXn I(Xn ; Yn ) + h(ϵ) C(I)
Cϵ = lim inf log M∗ (n, ϵ) ≤ lim inf = .
n→∞ n n→∞ n 1−ϵ 1−ϵ
X
n
dPX,Y Xn
i(Xn ; Yn ) = log (Xk , Yk ) = i(Xk ; Yk ),
dPX PY
k=1 k=1
n n n n
where i(x; y) = iPX,Y (x; y) and i(x ; y ) = iPXn ,Yn (x ; y ). What is important is that under PXn ,Yn the
random variable i(Xn ; Yn ) is a sum of iid random variables with mean I(X; Y). Thus, by the weak
law of large numbers we have
P[i(Xn ; Yn ) < n(I(X; Y) − δ)] → 0
for any δ > 0.
With this in mind, let us set log M = n(I(X; Y) − 2δ) for some δ > 0, and take τ = δ n in
Shannon’s bound. Then for the error bound we have
" n #
X n→∞
ϵn ≤ P i(Xk ; Yk ) ≤ nI(X; Y) − δ n + exp(−δ n) −−−→ 0, (19.7)
k=1
Since the bound converges to 0, we have shown that there exists a sequence of (n, Mn , ϵn )-codes
with ϵn → 0 and log Mn = n(I(X; Y) − 2δ). Hence, for all n such that ϵn ≤ ϵ
log M∗ (n, ϵ) ≥ n(I(X; Y) − 2δ)
And so
1
Cϵ = lim inf log M∗ (n, ϵ) ≥ I(X; Y) − 2δ
n→∞
n
Since this holds for all PX and all δ > 0, we conclude Cϵ ≥ supPX I(X; Y).
i i
i i
i i
The following result follows from pairing the upper and lower bounds on Cϵ .
Theorem 19.9 (Shannon’s channel coding theorem [378]) For a stationary memory-
less channel,
C = C(I) = sup I(X; Y). (19.8)
PX
As we mentioned several times already this result is among the most significant results in
information theory. From the engineering point of view, the major surprise was that C > 0,
i.e. communication over a channel is possible with strictly positive rate for any arbitrarily small
probability of error. The way to achieve this is to encode the input data jointly (i.e. over many
input bits together). This is drastically different from the pre-1948 methods, which operated on
a letter-by-letter bases (such as Morse code). This theoretical result gave impetus (and still gives
guidance) to the evolution of practical communication systems – quite a rare achievement for an
asymptotic mathematical fact.
Proof. Statement (19.8) contains two equalities. The first one follows automatically from the
second and Theorems 19.7 and 19.8. To show the second equality C(I) = supPX I(X; Y), we note
that for stationary memoryless channels C(I) is in fact easy to compute. Indeed, rather than solving
a sequence of optimization problems (one for each n) and taking the limit of n → ∞, memory-
lessness of the channel implies that only the n = 1 problem needs to be solved. This type of result
is known as single-letterization (or tensorization) in information theory and we show it formally
in the following proposition, which concludes the proof.
Q
Proof. Recall that from Theorem 6.1 we know that for product kernels PYn |Xn = PYi |Xi , mutual
P n
information satisfies I(Xn ; Yn ) ≤ k=1 I(Xk ; Yk ) with equality whenever Xi ’s are independent.
Then
1
C(I) = lim inf sup I(Xn ; Yn ) = lim inf sup I(X; Y) = sup I(X; Y).
n→∞ n P n n→∞ PX PX
X
Shannon’s noisy channel theorem shows that by employing codes of large blocklength, we can
approach the channel capacity arbitrarily close. Given the asymptotic nature of this result (or any
i i
i i
i i
378
other asymptotic result), a natural question is understanding the price to pay for reaching capacity.
This can be understood in two ways:
The main tool in the proof of Theorem 19.8 was the law of large numbers. The lower bound
Cϵ ≥ C(I) in Theorem 19.8 shows that log M∗ (n, ϵ) ≥ nC + o(n) (this just restates the fact
that normalizing by n and taking the lim inf must result in something ≥ C). If instead we apply
a more careful analysis using the central limit theorem (CLT), we obtain the following refined
achievability result.
where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.
Proof. Writing the little-o notation in terms of lim inf, our goal is
log M∗ (n, ϵ) − nC
lim inf √ ≥ −Q−1 (ϵ) = Φ−1 (ϵ),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound
i i
i i
i i
√
Take log β = nC + nVt, then applying the CLT gives
√ hX √ i
log M ≥ nC + nVt + log ϵ − P i(Xk ; Yk ) ≤ nC + nVt
√
=⇒ log M ≥ nC + nVt + log (ϵ − Φ(t)) ∀Φ(t) < ϵ
log M − nC log(ϵ − Φ(t))
=⇒ √ ≥t+ √ ,
nV nV
where Φ(t) is the standard normal CDF. Taking the liminf of both sides
log M∗ (n, ϵ) − nC
lim inf √ ≥ t,
n→∞ nV
for all t such that Φ(t) < ϵ. Finally, taking t % Φ−1 (ϵ), and writing the liminf in little-oh notation
completes the proof
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n).
Remark 19.1 Theorem 19.9 implies that for any R < C, there exists a sequence of
(n, exp(nR), ϵn )-codes such that the probability of error ϵn vanishes as n → ∞. Examining the
upper bound (19.7), we see that the probability of error actually vanishes exponentially fast, since
the event in the first term is of large-deviations type (recall Chapter 15) so that both terms are
exponentially small. Finding the value of the optimal exponent (or even the existence thereof) has
a long history (but remains generally open) in information theory, see Section 22.4*. Recently,
however, it was understood that a practically more relevant, and also much easier to analyze, is
the regime of fixed (non-vanishing) error ϵ, in which case the main question is to bound the speed
of convergence of R → Cϵ = C. Previous theorem shows one bound on this speed of convergence.
The optimal √1n coefficient is known as channel dispersion, see Sections 22.5 and 22.6 for more.
√
In particular, we will show that the bound on the n term in Theorem 19.11 is often tight.
Y = X + Z mod 2, Z ∼ Ber(δ) ⊥
⊥ X.
i i
i i
i i
380
C
C C
1 bit
1 bit 1 bit
δ
0 1 1 δ δ
2 0 1 0 1
BSCδ BECδ Z-channel
More generally, recalling Example 3.7, for any additive-noise channel over a finite abelian
group G, we have C = supPX I(X; X + Z) = log |G| − H(Z), achieved by X ∼ Unif(G). Similarly,
for a group-noise channel acting over a non-abelian group G by x 7→ x ◦ Z, Z ∼ PZ we also have
capacity equal log |G| − H(Z) and achieved by X ∼ Unif(G).
Next we consider the BECδ . This is a multiplicative channel. Indeed, if we equivalently redefine
the input X ∈ {±1} and output Y ∈ {±1, 0}, then BEC relation can be written as
Y = XZ, Z ∼ Ber(δ) ⊥
⊥ X.
To compute the capacity, we first notice that even without evaluating Shannon’s formula, it is clear
that C ≤ 1 −δ (bit), because for a large blocklength n about δ -fraction of the message is completely
lost (even if the encoder knows a priori where the erasures are going to occur, the rate still cannot
exceed 1 − δ ). More formally, we notice that P[X = 1|Y = 0] = P[X= δ
1]δ
= P[X = 1] and therefore
Finally, the Z-channel can also be thought of as a multiplicative channel with transition law
Y = XZ, X ∈ { 0, 1} ⊥
⊥ Z ∼ Ber(1 − δ) ,
i i
i i
i i
thus, we get
for all measurable E ⊂ Y and x ∈ X . Two symmetries f and g can be composed to produce another
symmetry as
( gi , go ) ◦ ( fi , fo ) ≜ ( gi ◦ fi , fo ◦ go ) . (19.9)
Note that both components of an automorphism f = (fi , fo ) are bimeasurable bijections, that is
fi , f− 1 −1
i , fo , fo are all measurable and well-defined functions.
Naturally, every symmetry group G possesses a canonical left action on X × Y defined as
Since the action on X × Y splits into actions on X and Y , we will abuse notation slightly and write
g · ( x, y) ≜ ( g x , g y ) .
Let us assume in addition that our group G can be equipped with a σ -algebra σ(G) such that
the maps h 7→ hg and h 7→ gh are measurable for each g ∈ G. We say that a probability measure μ
on (G, σ(G)) is a left-invariant Haar measure if when H ∼ μ we also have gH ∼ μ for any g ∈ G.
(See also Exercise V.23.) Existence of Haar measure is trivial for finite (and compact) groups, but
in general is a difficult subject. To proceed we need to make an assumption about the symmetry
group G that we call regularity. (This condition is trivially satisfied whenever X and Y are finite,
thus all the sophistication in these few paragraphs is only relevant to non-discrete channels.)
G×X ×Y →X ×Y
is measurable.
Note that under the regularity assumption the action (19.10) also defines a left action of G on
P(X ) and P(Y) according to
i i
i i
i i
382
or, in words, if X ∼ PX then gX ∼ gPX , and similarly for Y and gY. For every distribution PX we
define an averaged distribution P̄X as
Z
P̄X [E] ≜ PX [g−1 E]ν(dg) , (19.13)
G
which is the distribution of random variable gX when g ∼ ν and X ∼ PX . The measure P̄X is G-
invariant, in the sense that gP̄X = P̄X . Indeed, by left-invariance of ν we have for every bounded
function f
Z Z
f(g)ν(dg) = f(hg)ν(dg) ∀h ∈ G ,
G G
and therefore
Z
−1
P̄X [h E] = PX [(hg)−1 E]ν(dg) = P̄X [E] .
G
In other words, if the pair (X, Y) is generated by taking X ∼ PX and applying PY|X , then the pair
(gX, gY) has marginal distribution gPX but conditional kernel is still PY|X . For finite X , Y this is
equivalent to
which may also be taken as the definition of the automorphism. In terms of the G-action on P(Y)
we may also say:
It is not hard to show that for any channel and a regular group of symmetries G the capacity-
achieving output distribution must be G-invariant, and capacity-achieving input distribution can
be chosen to be G-invariant. That is, the saddle point equation
inf sup D(PY|X kQY |PX ) = sup inf D(PY|X kQY |PX ) ,
PX QY QY PX
i i
i i
i i
can be solved in the class of G-invariant distribution. Often, the action of G is transitive on X (Y ),
in which case the capacity-achieving input (output) distribution can be taken to be uniform.
Below we systematize many popular notions of channel symmetry and explain relationship
between them.
• Note that it is an easy consequence of the definitions that any input-symmetric (resp. output-
symmetric) channel, all rows of the channel matrix PY|X (resp. columns) are permutations of
the first row (resp. column). Hence,
input-symmetric, output-symmetric =⇒ Dobrushin (19.18)
• Group-noise channels satisfy all other definitions of symmetry:
i i
i i
i i
384
to [390] the latin squares that are Cayley tables are precisely the ones in which composition of
two rows (as permutations) gives another row. An example of the latin square which is not a
Cayley table is the following:
1 2 3 4 5
2 5 4 1 3
3 1 2 5 4 . (19.21)
4 3 5 2 1
5 4 1 3 2
1
Thus, by multiplying this matrix by 15 we obtain a counterexample:
Dobrushin, square 6=⇒ group-noise
In fact, this channel is not even input-symmetric. Indeed, suppose there is g ∈ G such that
g4 = 1 (on X ). Then, applying (19.16) with x = 4 we figure out that on Y the action of g must
be:
1 7→ 4, 2 7→ 3, 3 7→ 5, 4 7→ 2, 5 7→ 1 .
But then we have
1
gPY|X=1 = 5 4 2 1 3 · ,
15
which by a simple inspection does not match any of the rows in (19.21). Thus, (19.17) cannot
hold for x = 1. We conclude:
Dobrushin, square 6=⇒ input-symmetric
Similarly, if there were g ∈ G such that g2 = 1 (on Y ), then on X it would act as
1 7→ 2, 2 7→ 5, 3 7→ 1, 4 7→ 3, 5 7→ 4 ,
which implies via (19.16) that PY|X (g1|x) is not a column of (19.21). Thus:
Dobrushin, square 6=⇒ output-symmetric
• Clearly, not every input-symmetric channel is Dobrushin (e.g., BEC). One may even find a
counterexample in the class of square channels:
1 2 3 4
1 3 2 4 1
4 2 3 1 · 10 (19.22)
4 3 2 1
This shows:
input-symmetric, square 6=⇒ Dobrushin
• Channel (19.22) also demonstrates:
Gallager-symmetric, square 6=⇒ Dobrushin .
i i
i i
i i
• Example (19.22) naturally raises the question of whether every input-symmetric channel is
Gallager symmetric. The answer is positive: by splitting Y into the orbits of G we see that a
subchannel X → {orbit} is input and output symmetric. Thus by (19.18) we have:
Since det W 6= 0, the capacity achieving input distribution is unique. Since H(Y|X = x) is
independent of x and PX = [1/4, 1/4, 3/8, 1/8] achieves uniform P∗Y it must be the unique
optimum. Clearly any permutation Tx fixes a uniform P∗Y and thus the channel is weakly input-
symmetric. At the same time it is not Gallager-symmetric since no row is a permutation of
another.
• For more on the properties of weakly input-symmetric channels see [333, Section 3.4.5].
Gallager
1010
1111111
0000000
0000000
1111111 0
1 Dobrushin
0000000
1111111
0000000
1111111 101111
0000000000
1111111111
0000
0000000
1111111 101111
0000000000
1111111111
0000
101111
0000
1111
0000000
1111111 0000000000
1111111111
0000
000
111
0
10000
1111 000
111
0000
1111
0000000
1111111 0000000000
1111111111
0000
1111
000
111
000
111
0
10000
1111
0000000000
1111111111
0000
1111
000
111
000
111 000
111
0000
1111
0000000
1111111
0000
1111 0000
1111
000
111
0000
1111 000
111
000
111
0000000
1111111
0000
1111 0000
1111 000input−symmetric
111
output−symmetric group−noise
i i
i i
i i
386
Definition 19.14 A channel is called information stable if there exists a sequence of input
distributions {PXn , n = 1, 2, . . .} such that
1 n n P (I)
i( X ; Y ) −
→C .
n
For example, we can pick PXn = (P∗X )n for stationary memoryless channels. Therefore
stationary memoryless channels are information stable.
The purpose for defining information stability is the following theorem.
Proof. Like the stationary, memoryless case, the upper bound comes from the general con-
verse Theorem 17.3, and the lower bound uses a similar strategy as Theorem 19.8, except utilizing
the definition of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.
Theorem 19.16 A memoryless channel is information stable if there exists {X∗k : k ≥ 1} such
that both of the following hold:
1X ∗ ∗
n
I(Xk ; Yk ) → C(I) (19.25)
n
k=1
X
∞
1
Var[i(X∗n ; Y∗n )] < ∞ . (19.26)
n2
n=1
i i
i i
i i
where convergence to 0 follows from Kronecker lemma (Lemma 19.17 to follow) applied with
bn = n2 , xn = Var[i(X∗n ; Y∗n )]/n2 .
The second part follows from the first. Indeed, notice that
1X
n
C(I) = lim inf sup I(Xk ; Yk ) .
n→∞ n PXk
k=1
(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k
X
n X
n
I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 1 ,
PXk
k=1 k=1
and hence normalizing by n we get (19.25). We next show that for any joint distribution PX,Y we
have
The argument is symmetric in X and Y, so assume for concreteness that |B| < ∞. Then
where (19.29) is because 2 log PY|X (y|x) · log PY (y) is always non-negative, and (19.30) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X
n
g( n) ≜ sup
Pn
aj log2 aj . (19.31)
aj ≥0: j= 1 aj =1 j=1
Since the x log2 x has unbounded derivative at the origin, the solution of (19.31) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = 1n .
i i
i i
i i
388
For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,
Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.
1 X
n
bj xj → 0
bn
j=1
Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from
above
X
n X
m X
n
bk xk ≤ bm xk + b k xk
k=1 k=1 k=m+1
1 X bm X X bm X X
n ∞ n ∞ ∞
bk
=⇒ b k xk ≤ xk + xk ≤ xk + xk
bn bn bn bn
k=1 k=1 k=m+1 k=1 k=m+1
1 X X
n ∞
=⇒ lim bk xk ≤ xk → 0
n→∞ bn
k=1 k=m+1
Since this holds for any m, we can make the last term arbitrarily small.
How to show information stability? One important class of channels with memory for which
information stability can be shown easily are Gaussian channels. The complete details will be
shown below (see Sections 20.5* and 20.6*), but here we demonstrate a crucial fact.
For jointly Gaussian (X, Y) we always have bounded variance:
cov[X, Y]
Var[i(X; Y)] = ρ2 (X, Y) log2 e ≤ log2 e , ρ(X, Y) = p . (19.32)
Var[X] Var[Y]
From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
σX̃2
log e X̃ 2
− 2
E[i(X̃; Y)|X̃] =
σZ
,
2 σY2
i i
i i
i i
and hence
2 log2 e 4
Var[E[i(X̃; Y)|X̃]] = σ .
4σY4 X̃
On the other hand,
2 log2 e 2 2
Var[i(X̃; Y)|X̃] = [4σX̃ σZ + 2σX̃4 ] .
4σY4
Putting it all together we get (19.32). Inequality (19.32) justifies information stability of all sorts
of Gaussian channels (memoryless and with memory), as we will see shortly.
1X
k
1
Pb ≜ P[Sj 6= Ŝj ] = E[dH (Sk , Ŝk )] , (19.33)
k k
j=1
1X X
k k
1{Si 6= Ŝi } ≤ 1{Sk 6= Ŝk } ≤ 1{Si 6= Ŝi },
k
i=1 i=1
where the first inequality is obvious and the second follow from the union bound. Taking
expectation of the above expression gives the theorem.
Next, the following pair of results is often useful for lower bounding Pb for some specific codes.
i i
i i
i i
390
Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then
X
k X
k
1{Si =
6 Ŝi } ≥ 1{Sk = Ŝk + ei }
i=1 i=1
1X
k
Pb ≥ P[Sk = Ŝk + ei ]
k
i=1
Theorem 19.20 If A, B ∈ {0, 1}k (with arbitrary marginals!) then for every r ≥ 1 we have
1 k−1
Pb = E[dH (A, B)] ≥ Pr,min (19.34)
k r−1
Pr,min ≜ min{P[B = c′ |A = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = r} (19.35)
Next, notice
In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the minimax risk
of an estimator (Section 31.2).
The following is a converse bound for channel coding under BER constraint.
Theorem 19.21 (Converse under BER) Any M-code with M = 2k and bit-error rate Pb
satisfies
supPX I(X; Y)
log M ≤ .
log 2 − h(Pb )
i i
i i
i i
i.i.d.
Proof. Note that Sk → X → Y → Ŝk , where Sk ∼ Ber( 12 ). Recall from Theorem 6.1 that for iid
P
Sn , I(Si ; Ŝi ) ≤ I(Sk ; Ŝk ). This gives us
X
k
sup I(X; Y) ≥ I(X; Y) ≥ I(Si ; Ŝi )
PX
i=1
1X 1
≥k d P[Si = Ŝi ]
k 2
!
1X
k
1
≥ kd P[Si = Ŝi ]
k 2
i=1
1
= kd 1 − Pb = k(log 2 − h(Pb ))
2
where the second line used Fano’s inequality (Theorem 3.12) for binary random variables (or data
processing inequality for divergence), and the third line used the convexity of divergence. One
should note that this last chain of inequalities is similar to the proof of Proposition 6.6.
Pairing this bound with Proposition 19.10 shows that any sequence of codes with Pb → 0 (for
a memoryless channel) must have rate R < C. In other words, relaxing the constraint from Pe to
Pb does not yield any higher rates.
Later in Section 26.3 we will see that channel coding under BER constraint is a special case
of a more general paradigm known as lossy joint source channel coding so that Theorem 19.21
follows from Theorem 26.5. Furthermore, this converse bound is in fact achievable asymptotically
for stationary memoryless channels.
Sk Encoder Xn Yn Decoder Ŝ k
Source (JSCC) Channel (JSCC)
In channel coding we are interested in transmitting M messages and all messages are born equal.
Here we want to convey the source realizations which might not be equiprobable (has redundancy).
Indeed, if Sk is uniformly distributed on, say, {0, 1}k , then we are back to the channel coding setup
i i
i i
i i
392
with M = 2k under average probability of error, and ϵ∗JSCC (k, n) coincides with ϵ∗ (n, 2k ) defined
in Section 22.1.
Here, we look for a clever scheme to directly encode k symbols from A into a length n channel
input such that we achieve a small probability of error over the channel. This feels like a mix
of two problems we have seen: compressing a source and coding over a channel. The following
theorem shows that compressing and channel coding separately is optimal. This is a relief, since
it implies that we do not need to develop any new theory or architectures to solve the Joint Source
Channel Coding problem. As far as the leading term in the asymptotics is concerned, the following
two-stage scheme is optimal: First use the optimal compressor to eliminate all the redundancy in
the source, then use the optimal channel code to add redundancy to combat the noise in the data
transmission.
The result is known as separation theorem since it separates the jobs of compressor and channel
code, with the two blocks interfacing in terms of bits. Note that the source can generate symbols
over very different alphabet than the channel’s input alphabet. Nevertheless, the bit stream pro-
duced by the source code (compressor) is matched to the channel by the channel code. There is
an even more general version of this result (Section Section 26.3).
Theorem 19.22 (Shannon separation theorem) Let the source {Sk } be stationary mem-
oryless on a finite alphabet with entropy H. Let the channel be stationary memoryless with finite
capacity C. Then
(
∗ → 0 R < C/H
ϵJSCC (nR, n) n → ∞.
6→ 0 R > C/H
The interpretation of this result is as follows: Each source symbol has information content
(entropy) H bits. Each channel use can convey C bits. Therefore to reliably transmit k symbols
over n channel uses, we need kH ≤ nC.
Proof. (Achievability.) The idea is to separately compress our source and code it for transmission.
Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. This
separated architecture is
f1 f2 P Yn | X n g2 g1
Sk −→ W −→ Xn −→ Yn −→ Ŵ −→ Ŝk
Where we use the optimal compressor (f1 , g1 ) and optimal channel code (maximum probability of
error) (f2 , g2 ). Let W denote the output of the compressor which takes at most Mk values. Then
from Corollary 11.3 and Theorem 19.9 we get:
1
(From optimal compressor) log Mk > H + δ =⇒ P[Ŝk 6= Sk (W)] ≤ ϵ ∀k ≥ k0
k
1
(From optimal channel code) log Mk < C − δ =⇒ P[Ŵ 6= m|W = m] ≤ ϵ ∀m, ∀k ≥ k0
n
Using both of these,
i i
i i
i i
And therefore if R(H + δ) < C − δ , then ϵ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
(Converse.) To prove the converse notice that any JSCC encoder/decoder induces a Markov
chain
Sk → Xn → Yn → Ŝk .
Applying data processing for mutual information
I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC.
PXn
On the other hand, since P[Sk 6= Ŝk ] ≤ ϵn , Fano’s inequality (Theorem 3.12) yields
I(Sk ; Ŝk ) = H(Sk ) − H(Sk |Ŝk ) ≥ kH − ϵn log |A|k − log 2.
Combining the two gives
nC ≥ kH − ϵn log |A|k − log 2. (19.36)
Since R = nk , dividing both sides by n and sending n → ∞ yields
RH − C
lim inf ϵn ≥ .
n→∞ R log |A|
Therefore ϵn does not vanish if R > C/H.
We remark that instead of using Fano’s inequality we could have lower bounded I(Sk ; Ŝk ) as in
the proof of Theorem 17.3 by defining QSk Ŝk = USk PŜk (with USk = Unif({0, 1}k ) and applying the
data processing inequality to the map (Sk , Ŝk ) 7→ 1{Sk = Ŝk }:
D(PSk Ŝk kQSk Ŝk ) = D(PSk kUSk ) + D(PŜ|Sk kPŜ |PSk ) ≥ d(1 − ϵn k|A|−k )
Rearranging terms yields (19.36). As we discussed in Remark 17.2, replacing D with other f-
divergences can be very fruitful.
In a very similar manner, by invoking Corollary 12.6 and Theorem 19.15 we obtain:
Theorem 19.23 Let source {Sk } be ergodic on a finite alphabet, and have entropy rate H. Let
the channel have capacity C and be information stable. Then
(
= 0 R > H/C
lim ϵ∗JSCC (nR, n)
n→∞ > 0 R < H/C
i i
i i
i i
In this chapter we study data transmission with constraints on the channel input. Namely, in pre-
vious chapter the encoder for blocklength n code was permitted to produce arbitrary sequences
of channel inputs (i.e. the range of the encoder could be all of An ). However, in many practical
problem only a subset of An is allowed to be used. The main such example is the AWGN chan-
nel Example 3.3. If encoder is allowed to produce arbitrary elements of Rn as input, the channel
capacity is infinite: supPX I(X; X + Z) = ∞ (for example, take X ∼ N (0, P) and P → ∞). That
is, one can transmit arbitrarily many messages with arbitrarily small error probability by choos-
ing elements of Rn with giant pairwise distance. In reality, however, allowed channel inputs are
limited by the available1 power and the encoder is only capable of using xn ∈ Rn are satisfying
1X 2
n
xt ≤ P ,
n
t=1
where P > 0 is known as the power constraint. How many bits per channel use can we transmit
under this constraint on the codewords? To answer this question in general, we need to extend
the setup and coding theorems to channels with input constraints. After doing that we will apply
these results to compute capacities of various Gaussian channels (memoryless, with inter-symbol
interference and subject to fading).
An
b b
b
b Fn b
b b
b b b
b b
b
1
or allowed by regulatory bodies, such as the FCC in the US.
394
i i
i i
i i
What type of constraint sets Fn are of practical interest? In the context of Gaussian channels,
we have A = R. Then one often talks about the following constraints:
1X 2 √
n
| xi | ≤ P ⇔ kxn k2 ≤ nP.
n
i=1
√
In other words, codewords must lie in a ball of radius nP.
• Peak power constraint :
Notice that the second type of constraint does not introduce any new problems: we can simply
restrict the input space from A = R to A = [−A, A] and be back into the setting of input-
unconstrained coding. The first type of the constraint is known as a separable cost-constraint.
We will restrict our attention from now on to it exclusively.
Definition 20.2 • A P
code is an (n, M, ϵ, P)-code if it is an (n, M, ϵ)-code satisfying input
n
constraint Fn ≜ {x : n k=1 c(xk ) ≤ P}
n 1
i i
i i
i i
396
• Information capacity
1
C(I) (P) = lim inf sup I(Xn ; Yn )
n→∞ n PXn :E[Pnk=1 c(Xk )]≤nP
• Information stability: Channel is information stable if for all (admissible) P, there exists a
sequence of channel input distributions PXn such that the following two properties hold:
1 P
iP n n (Xn ; Yn )−
→C(I) (P) (20.1)
n X ,Y
P[c(Xn ) > P + δ] → 0 ∀δ > 0 . (20.2)
These definitions clearly parallel those of Definitions 19.3 and 19.6 for channels without input
constraints. A notable and crucial exception is the definition of the information capacity C(I) (P).
Indeed, under input constraints instead of maximizing I(Xn ; Yn ) over distributions supported on
Fn we extend maximization to a richer set of distributions, namely, those satisfying
" n #
X
E c(Xk ) ≤ nP .
k=1
Clearly, if P ∈
/ Dc , then there is no code (even a useless one, with one codeword) satisfying the
input constraint. So in the remaining we always assume P ∈ Dc .
Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence ϕ(λ̄P0 + λP1 ) ≥
λ̄ϕ(P0 ) + λϕ(P1 ). The second claim follows from concavity of ϕ(·).
To extend these results to C(I) (P) observe that for every n
1
P 7→ sup I(Xn ; Yn )
n PXn :E[c(Xn )]≤P
i i
i i
i i
is concave. Then taking lim infn→∞ the same holds for C(I) (P).
An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:
Proof. C(I) (P) ≥ ϕ(P) is obvious by using PXn = (PX )⊗n . For “≤”, fix any PXn satisfying the
cost constraint. Consider the chain
( a) X (b) X X
n n ( c)
n
1
I(Xn ; Yn ) ≤ I(Xj ; Yj ) ≤ ϕ(E[c(Xj )]) ≤ nϕ E[c(Xj )] ≤ nϕ(P) ,
n
j=1 j=1 j=1
where (a) follows from Theorem 6.1; (b) from the definition of ϕ; and (c) from Jensen’s inequality
and concavity of ϕ.
Proof. The argument is the same as we used in Theorem 17.3. Take any (n, M, ϵ, P)-code, W →
Xn → Yn → Ŵ. Applying Fano’s inequality and the data-processing, we get
Normalizing both sides by n and taking lim infn→∞ we obtain the result.
Next we need to extend one of the coding theorems to the case of input constraints. We do so for
the Feinstein’s lemma (Theorem 18.7). Note that when F = X , it reduces to the original version.
Theorem 20.7 (Extended Feinstein’s lemma) Fix a Markov kernel PY|X and an arbitrary
PX . Then for any measurable subset F ⊂ X , everyγ > 0 and any integer M ≥ 1, there exists an
(M, ϵ)max -code such that
i i
i i
i i
398
Proof. Similar to the proof of the original Feinstein’s lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Next, we apply Corollary 18.4 and find out
that there is a set F0 ⊂ X with two properties: a) PX [F0 ] = 1 and b) for every x ∈ F0 we have
PY (Ex ) ≤ γ1 . We now let F′ = F ∩ F0 and notice that PX [F′ ] = PX [F].
We sequentially pick codewords {c1 , . . . , cM } from the set F′ (!) and define the decoding regions
{D1 , . . . , DM } as Dj ≜ Ecj \ ∪jk− 1
=1 Dk . The stopping criterion is that M is maximal, i.e.,
∀x0 ∈ F′ , PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 − ϵ
′ ′c
⇔ ∀ x0 ∈ X , P Y [ E x 0 \ ∪ M
j=1 Dj X = x0 ] < (1 − ϵ)1[x0 ∈ F ] + 1[x0 ∈ F ]
From here, we complete the proof by following the same steps as in the proof of original Feinstein’s
lemma (Theorem 18.7).
Theorem 20.8 (Achievability bound) For any information stable channel with input
constraints and P > P0 we have
Proof. Let us consider a special case of the stationary memoryless channel (the proof for general
information stable channel follows similarly). Thus, we assume PYn |Xn = (PY|X )⊗n .
Fix n ≥ 1. Choose a PX such that E[c(X)] < P, Pick log M = n(I(X; Y) − 2δ) and log γ =
n(I(X; Y) − δ).
P
With the input constraint set Fn = {xn : 1n c(xk ) ≤ P}, and iid input distribution PXn = P⊗ n
X ,
we apply the extended Feinstein’s lemma. This shows existence of an (n, M, ϵn , P)max -code with
the encoder satisfying input constraint Fn and vanishing (maximal) error probability
Indeed, the first term is vanishing by the weak law of large numbers: since E[c(X)] < P, we have
P
PXn (Fn ) = P[ 1n c(xk ) ≤ P] → 1. Since ϵn → 0 this implies that for every ϵ > 0 we have
1
Cϵ (P) ≥ log M = I(X; Y) − 2δ, ∀δ > 0, ∀PX s.t. E[c(X)] < P
n
⇒ Cϵ (P) ≥ sup lim (I(X; Y) − 2δ)
PX :E[c(X)]<P δ→0
i i
i i
i i
where the last equality is from the continuity of C(I) on (P0 , ∞) by Proposition 20.4.
For a general information stable channel, we just need to use the definition to show that
P[i(Xn ; Yn ) ≤ n(C(I) − δ)] → 0, and the rest of the proof follows similarly.
Theorem 20.9 (Channel capacity under cost constraint) For an information stable
channel with cost constraint and for any admissible constraint P we have
C(P) = C(I) (P).
Proof. The boundary case of P = P0 is treated in Ex. IV.23, which shows that C(P0 ) = C(I) (P0 )
even though C(I) (P) may be discontinuous at P0 . So assume P > P0 next. Theorem 20.6 shows
(I)
Cϵ (P) ≤ C1−ϵ (P)
, thus C(P) ≤ C(I) (P). On the other hand, from Theorem 20.8 we have C(P) ≥
( I)
C ( P) .
Z ∼ N (0, σ 2 )
X + Y
Definition 20.10 (The stationary AWGN channel) The Additive White Gaussian Noise
(AWGN) channel is a stationary memoryless additive-noise channel with separable cost constraint:
A = B = R, c(x) = x2 , and a single-letter kernel PY|X given by Y = X + Z, where Z ∼ N (0, σ 2 ) ⊥⊥
X. The n-letter kernel is given by a product extension, i.e. Yn = Xn + Zn with Zn ∼ N (0, In ). When
the power constraint is E[c(X)] ≤ P we say that the signal-to-noise ratio (SNR) equals σP2 . Note
that our informal definition early on (Example 3.3) lacked specification of the cost constraint
function, without which it was not complete.
The terminology white noise refers to the fact that the noise variables are uncorrelated across
time. This makes the power spectral density of the process {Zj } constant in frequency (or “white”).
We often drop the word stationary when referring to this channel. The definition we gave above is
more correctly should be called the real AWGN, or R-AWGN, channel. The complex AWGN, or
C-channel is defined similarly: A = B = C, c(x) = |x|2 , and Yn = Xn + Zn , with Zn ∼ Nc (0, In )
being the circularly symmetric complex gaussian.
i i
i i
i i
400
Theorem 20.11 For the stationary AWGN channel, the channel capacity is equal to informa-
tion capacity, and is given by:
1 P
( I)
C(P) = C (P) = log 1 + 2 for R-AWGN (20.4)
2 σ
P
( I)
C(P) = C (P) = log 1 + 2 for C-AWGN
σ
Then using Theorem 5.11 (the Gaussian saddle point) to conclude X ∼ N (0, P) (or Nc (0, P)) is
the unique capacity-achieving input distribution.
At this point it is also instructive to revisit Section 6.2* which shows that Gaussian capacity
can in fact be derived essentially without solving the maximization of mutual information: the
Euclidean rotational symmetry implies the optimal input should be Gaussian.
There is a great deal of deep knowledge embedded in the simple looking formula of Shan-
non (20.4). First, from the engineering point of view we immediately see that to transmit
information faster (per unit time) one needs to pay with radiating at higher power, but the payoff
in transmission speed is only logarithmic. The waveforms of good error correcting codes should
look like samples of the white Gaussian process.
Second, the amount of energy spent per transmitted information bit is minimized by solving
P log 2
inf = 2σ 2 loge 2 (20.5)
P>0 C(P)
and is achieved by taking P → 0. (We will discuss the notion of energy-per-bit more in Sec-
tion 21.1.) Thus, we see that in order to maximize communication rate we need to send powerful,
high-power waveforms. But in order to minimize energy-per-bit we need to send in very quiet
“whisper” and at very low communication rate.2 In addition, when signaling with very low power
√
(and hence low rate), by inspecting Figure 3.2 we can see that one can restrict to just binary ± P
symbols (so called BPSK modulation). This results in virtually no loss of capacity.
Third, from the mathematical point of view, formula (20.4) reveals certain properties of high-
dimensional Euclidean geometry
√ as follows. Since Zn ∼ N (0, σ 2 ), then with high probability,
kZ k2 concentrates around nσ . Similarly, due the power constraint and the fact that Zn ⊥
n 2 ⊥ Xn , we
n 2 n 2 n 2
have E kY k = E kY p k + E kZ k ≤ n(P + σ 2 ) and the received vector Yn lies in an ℓ√ 2 -ball
of radius approximately n(P + σ 2 ). Since the noise √ can at most perturb the codeword p by nσ 2
in Euclidean distance, if we can pack M balls of radius nσ 2 into the ℓ2 -ball of radius n(P + σ 2 )
centered at the origin, this yields a good codebook and decoding regions – see Figure 20.1 for an
illustration. So how large can M be? Note that the volume of an ℓ2 -ball of radius r in Rn is given by
2
This explains why, for example, the deep space probes communicate with earth via very low-rate codes and very long
blocklengths.
i i
i i
i i
c3
c4
p n
√ c2
nσ 2
(P
c1
+
σ
2
)
c5
c8
···
c6
c7
cM
2 n/ 2 n/2
cn rn for some constant cn . Then cn (cnn((Pn+σ ))
= 1 + σP2 . Taking the log and dividing by n, we
σ 2 ) n/ 2
∗
get n log M ≈ 2 log 1 + σ2 . This tantalizingly convincing reasoning, however, is flawed in at
1 1 P
least two ways. (a) Computing the volume ratio only gives an upper bound on the maximal number
of disjoint balls (See Section 27.2 for an extensive discussion on this topic.) (b) Codewords need
not correspond to centers of disjoint ℓ2 -balls. √ Indeed, the fact that we allow some vanishing (but
non-zero) probability of error means that the nσ 2 -balls are slightly overlapping and Shannon’s
formula establishes the maximal number of such partially overlapping balls that we can pack so
that they are (mostly) inside a larger ball.
Since Theorem 20.11 applies to Gaussian noise, it is natural to ask: What if the noise is non-
Gaussian and how sensitive is the capacity formula 21 log(1 + SNR) to the Gaussian assumption?
Recall the Gaussian saddle point result in Theorem 5.11 which shows that for the same variance,
Gaussian noise is the worst which shows that the capacity of any non-Gaussian noise is at least
1
2 log(1 + SNR). Conversely, it turns out the increase of the capacity can be controlled by how
non-Gaussian the noise is (in terms of KL divergence). The following result is due to Ihara [223].
Remark 20.1 The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z,
where N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian
density, say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to
i i
i i
i i
402
AWGN, which still scales as 12 log SNR in the high-SNR regime. On the other hand, if Z is discrete,
then D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite
because the noise is “too weak”.
Proof.
X
L
≤ P
sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[X2k ]≤Pk
X
L
1 Pk
= P
sup log(1 + )
Pk ≤P,Pk ≥0 k=1 2 σk2
with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last maximiza-
tion problem, known as problem of optimal power allocation. Denote the Lagrangian multipliers
P
for the constraint Pk ≤ P by λ and for the constraint Pk ≥ 0 by μk . We want to solve
P1 P
max 2 log(1 + σPk2 ) − μk Pk + λ(P − Pk ). First-order condition on Pk gives that
k
1 1
= λ − μ k , μ k Pk = 0
2 σk2 + Pk
therefore the optimal solution is
X
L
Pk = |T − σk2 |+ , T is chosen such that P = |T − σk2 |+
k=1
i i
i i
i i
T
P1 P3
σ22
σ12 σ32
Figure 20.2 Power allocation via water-filling across three parallel channels. Here, the second branch is too
noisy (σ2 too big) for the amount of available power P and the optimal coding should discard (input zeros to)
this branch altogether.
Figure 20.2 illustrates the water-filling solution. It has a number of practically important con-
clusions. First, it gives a precise recipe for how much power to allocate to different frequency
bands. This solution, simple and elegant, was actually pivotal for bringing high-speed internet
to many homes (via cable modems, or ADSL): initially, before information theorists had a say,
power allocations were chosen on the basis of costly and imprecise simulations of real codes. The
simplicity of the water-filling scheme makes power allocation dynamic and enables instantaneous
reaction to changing noise environments.
Second, there is a very important consequence for multiple-antenna (MIMO) communication.
Given nr receive antennas and nt transmit antennas, very often one gets as a result a parallel AWGN
with L = min(nr , nt ) branches (see Exercise I.9 and I.10). For a single-antenna system the capacity
then scales as 12 log P with increasing power (Theorem 20.11), while the capacity for a MIMO
AWGN channel is approximately L2 log( PL ) ≈ L2 log P for large P. This results in an L-fold increase
in capacity at high SNR. This is the basis of a powerful technique of spatial multiplexing in MIMO,
largely behind much of advance in 4G, 5G cellular (3GPP) and post-802.11n WiFi systems.
Notice that spatial diversity (requiring both receive and transmit antennas) is different from a
so-called multipath diversity (which works even if antennas are added on just one side). Indeed,
if a single stream of data is sent through every parallel channel simultaneously, then the sufficient
statistic would be to the average of all received vectors, resulting in a the effective noise level
reduced by L1 factor. This results in capacity increasing from 12 log P to 21 log(LP) – a far cry
from the L-fold increase of spatial multiplexing. These exciting topics are explored in excellent
textbooks [423, 268].
i i
i i
i i
404
Theorem 20.16 Assume that for every T > 0 the following limits exist:
1X1
n
T
(I)
C̃ (T) = lim log+ 2
n→∞ n 2 σj
j=1
1X
n
P̃(T) = lim |T − σj2 |+ .
n→∞ n
j=1
Then the capacity of the non-stationary AWGN channel is given by the parameterized form:
C(T) = C̃(I) (T) with input power constraint P̃(T).
Proof. Fix T > 0. Then it is clear from the water-filling solution in Theorem 20.14 that
X
n
1 T
sup I(Xn ; Yn ) = log+ , (20.7)
2 σj2
j=1
1X
n
E[c(Xn )] ≤ |T − σj2 |+ . (20.8)
n
j=1
Now, by assumption, the LHS of (20.8) converges to P̃(T). Thus, we have that for every δ > 0
Taking δ → 0 and invoking the continuity of P 7→ C(I) (P), we get from Theorem 19.15 that the
information capacity satisfies
log2 e Pj log2 e
Var(i(Xj ; Yj )) = 2
≤
2 Pj + σj 2
and thus
X
n
1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1
Non-stationary AWGN is primarily of interest due to its relationship to the additive colored
Gaussian noise channel in the following section.
i i
i i
i i
Zn : Cov(Zn ) = Σ
multiply by
X̃ U−1 X + Y multiply by
U Ỹ
fZ (ω)
T
ω
−π π
power allocation
Figure 20.3 The ACGN channel: the “whitening” process used in the capacity proof and the water-filling
power allocation across spectrum.
Theorem 20.18 The capacity of the ACGN channel with fZ (ω) > 0 for almost every ω ∈
[−π , π ] is given by the following parametric form:
Z π
1 1 T
C ( T) = log+ dω,
2π −π 2 fZ (ω)
Z π
1
P ( T) = |T − fZ (ω)|+ dω.
2π −π
en = X
Y en + UZn ,
i i
i i
i i
406
e
Cov(UZn ) = U · Cov(Zn ) · U⊤ = Σ
Therefore we have the equivalent channel as follows:
en = X
Y en + Z
en , ej ∼ N (0, σj2 ) independent across j.
Z
Note that since U and U⊤ is orthogonal the maps X̃n = UXn and Xn = U⊤ X̃n preserve the norm
kX̃n k = kXn k. Therefore, capacities of both channels are equal: C = C̃. But the latter follows from
Theorem 20.16. Indeed, we have that
Z π
1X + T
n
e 1 1 T
C = lim log 2
= log+ dω. (Szegö’s theorem, see (6.12))
n→∞ n σj 2π −π 2 f Z (ω)
j=1
1X
n
lim |T − σj2 |+ = P(T).
n→∞ n
j=1
The idea used in the proof as well as the water-filling power allocation are illustrated on Fig-
ure 20.3. Note that most of the time the noise that impacts real-world systems is actually “born”
white (because it is a thermal noise). However, between the place of its injection and the process-
ing there are usually multiple circuit elements. If we model them linearly then their action can
equivalently be described as the ACGN channel, since the effective noise added becomes colored.
In fact, this filtering can be inserted deliberately in order to convert the actual channel into an
additive noise one. This is the content of the next section.
Definition 20.19 (AWGN with ISI) An AWGN channel with ISI is a channel with memory
that is defined as follows: the alphabets are A = B = R, and the separable cost is c(x) = x2 . The
channel law PYn |Xn is given by
X
n
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1
i.i.d.
where Zk ∼ N (0, σ 2 ) is white Gaussian noise, {hk , k ∈ Z} are coefficients of a discrete-time
channel filter.
The coefficients {hk } describe the action of the environment. They are often learned by the
receiver during the “handshake” process of establishing a communication link.
i i
i i
i i
Theorem 20.20 Suppose that the sequence {hk } is the inverse Fourier transform of a
frequency response H(ω):
Z π
1
hk = eiωk H(ω)dω .
2π −π
Assume also that H(ω) is a continuous function on [0, 2π ]. Then the capacity of the AWGN channel
with ISI is given by
Z π
1 1
C ( T) = log+ (T|H(ω)|2 )dω
2π −π 2
Z π +
1 1
P ( T) = T− dω
2π −π |H(ω)|2
Proof sketch. At the decoder apply the inverse filter with frequency response ω 7→ 1
H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:
Ỹj = Xj + Z̃j ,
The capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. The
number of atoms is Ω(A) and O(A2 ) as A → ∞. Moreover,
1 2A2 1
log 1 + ≤ C(A) ≤ log 1 + A2
2 eπ 2
i i
i i
i i
408
Capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. Moreover,
the convergence speed of limA→∞ C(A, P) = 21 log(1 + P) is of the order e−O(A ) .
2
For details, see [396], [343, Section III] and [144, 348] for the O(A2 ) bound on the number of
atoms.
Hi Zi
E[X2i ] ≤ P
Xi × + Yi receiver
There are two drastically different cases of fading channels, depending on the presence or
absence of the dashed link on Figure 20.4. In the first case, known as the coherent case or the
CSIR case (for channel state information at the receiver), the receiver is assumed to have perfect
estimate of the channel state information Hi at every time i. In other words, the channel output
is effectively (Yi , Hi ). This situation occurs, for example, when there are pilot signals sent peri-
odically and are used at the receiver to estimate the channel. In some cases, the index i refers to
different frequencies or sub-channels of an OFDM frame.
Whenever Hj is a stationary ergodic process, we have the channel capacity given by:
1 P | H| 2
C(P) = E log 1 +
2 σ2
i i
i i
i i
and the capacity achieving input distribution is the usual PX = N (0, P). Note that the capacity
C(P) is in the order of log P and we call the channel “energy efficient”.
In the second case, known as non-coherent or no-CSIR, the receiver does not have any knowl-
edge of the coefficients Hi ’s. In this case, there is no simple expression for the channel capacity.
Most of the known results were shown for the case of iid Hi according to a Rayleigh distribution.
In this case, the capacity achieving input distribution is discrete [3], and the capacity scales as
[415, 269]
C(P) = O(log log P), P→∞ (20.9)
This channel is said to be “energy inefficient” since increasing the communication rate requires
dramatic expenditures in power.
Further generalization of the Gaussian channel models requires introducing multiple input and
output antennas (known as MIMO channel). In this case, the single-letter input Xi ∈ Cnt and the
output Yi ∈ Cnr are related by
Yi = Hi Xi + Zi , (20.10)
i.i.d.
where Zi ∼ CN (0, σ 2 Inr ), nt and nr are the number of transmit and receive antennas, and Hi ∈
Cnt ×nr is a matrix-valued channel gain process. For the capacity of this channel under CSIR,
see Exercise I.10. An incredible effort in the 1990s and 2000s was invested by the information-
theoretic and communication-theoretic researchers to understand this channel model. Some of the
highlights include:
It is not possible to do any justice to these and many other fundamental results in MIMO communi-
cation here, unfortunately. We suggest textbook [423] as an introduction to this deep and exciting
field.
i i
i i
i i
In this chapter we will consider an interesting variation of the channel coding problem. Instead
of constraining the blocklength (i.e. the number of channel uses), we will constrain the total cost
incurred by the codewords. The most important special case of this problem is that of the AWGN
channel and quadratic (energy) cost constraint. The standard motivation in this setting is the fol-
lowing. Consider a deep space probe which has a k bit message that needs to be delivered to Earth
(or a satellite orbiting it). The duration of transmission is of little worry for the probe, but what is
really limited is the amount of energy it has stored in its battery. In this chapter we will learn how
to study this question abstractly, how coding over large number of bits k → ∞ reduces the energy
spent (per bit), and how this fundamental limit is related to communication over continuous-time
channels.
21.1 Energy-per-bit
In this chapter we will consider Markov kernels PY∞ |X∞ acting between two spaces of infinite
sequences. The prototypical example is again the AWGN channel:
Note that in this chapter we have denoted the noise level for Zi to be N20 . There is a long tradition for
such a notation. Indeed, most of the noise in communication systems is a white thermal noise at the
receiver. The power spectral density of that noise is flat and denoted by N0 (in Joules per second
per Hz). However, recall that received signal is complex-valued and, thus, each real component
has power N20 . Note also that thermodynamics suggests that N0 = kT, where k = 1.38 × 10−23 is
the Boltzmann constant, and T is the absolute temperature in Kelvins.
In previous chapter, we analyzed the maximum number of information messages (M∗ (n, ϵ, P))
that can be sent through this channel for a given n number of channel uses and under the power
constraint P. We have also hinted that in (20.5) that there is a fundamental minimal required cost
to send each (data) bit. Here we develop this question more rigorously. Everywhere in this chapter
for v ∈ R∞
X
∞
kvk22 ≜ v2j .
j=1
410
i i
i i
i i
Definition 21.1 ((E, 2k , ϵ)-code) Given a Markov kernel with input space R∞ we define
an (E, 2k , ϵ)-code to be an encoder-decoder pair, f : [2k ] → R∞ and g : R∞ → [2k ] (or similar
randomized versions), such that for all messages m ∈ [2k ] we have kf(m)k22 ≤ E and
P[g(Y∞ ) 6= W] ≤ ϵ ,
The operational meaning of E∗ (k, ϵ) should be apparent: it is the minimal amount of energy the
space probe needs to draw from the battery in order to send k bits of data.
Theorem 21.2 ((Eb /N0 )min = −1.6dB) For the AWGN channel we have
E∗ (k, ϵ) N0
lim lim sup = . (21.2)
ϵ→0 k→∞ k log2 e
Remark 21.1 This result, first obtained by Shannon [378], is colloquially referred to as: min-
imal Eb /N0 (pronounced “eebee over enzero” or “ebno”) is −1.6 dB. The latter value is simply
10 log10 ( log1 e ) ≈ −1.592. Achieving this value of the ebno was an ultimate quest for coding the-
2
ory, first resolved by the turbo codes [47]. See [101] for a review of this long conquest. We also
remark that the fundamental limit is unchanged if instead of real-valued AWGN channel we used
a C-AWGN channel
Yi = Xi + Zi , Zi ∼ CN (0, N0 )
P∞
and energy constraint i=1 |Xi |2 ≤ E. Indeed, this channel’s single input can be simply converted
into a pair of inputs for the R-AWGN channel. This double the blocklength, but it is anyway
considered to be infinite.
Proof. We start with a lower bound (or the “converse” part). As usual, we have the working
probability space
W → X∞ → Y∞ → Ŵ .
i i
i i
i i
412
X
∞
1 EX2i
≤ log 1 + Gaussian capacity, Theorem 5.11
2 N0 /2
i=1
log e X EX2i
∞
≤ linearization of log
2 N0 /2
i=1
E
≤ log e.
N0
Thus, we have shown
E∗ (k, ϵ) N0 h(ϵ)
≥ ϵ−
k log e k
and taking the double limit in n → ∞ then in ϵ → 0 completes the proof.
Next, for the upper bound (the “achievability” part). We first give a traditional existential proof.
Notice that a (n, 2k , ϵ, P)-code for the AWGN channel is also a (nP, 2k , ϵ)-code for the energy
problem without time constraint. Therefore,
P
= 1 P
,
2 log(1 + N0 /2 )
where in the last step we applied Theorem 20.11. Now the above statement holds for every P > 0,
so let us optimize it to get the best bound:
E∗ (kn , ϵ) P
lim sup ≤ inf 1 P
n→∞ kn P≥0
2 log(1 + N0 / 2 )
P
= lim
P→0 1 log(1 + P
2 N0 / 2 )
N0
= (21.3)
log2 e
Note that the fact that minimal energy per bit is attained at P → 0 implies that in order to send
information reliably at the Shannon limit of −1.6dB, infinitely many time slots are needed. In
other words, the information rate (also known as spectral efficiency) should be vanishingly small.
Conversely, in order to have non-zero spectral efficiency, one necessarily has to step away from
the −1.6 dB. This tradeoff is known as spectral efficiency vs energy-per-bit.
We next can give a simpler and more explicit construction of the code, not relying on the random
coding implicit in Theorem 20.11. Let M = 2k and consider the following code, known as the
i i
i i
i i
It is not hard to derive an upper bound on the probability of error that this code achieves [337,
Theorem 2]:
" ( r ! )#
2E
ϵ ≤ E min MQ + Z ,1 , Z ∼ N (0, 1) . (21.5)
N0
Indeed, our orthogonal codebook under a maximum likelihood decoder has probability of error
equal to
Z " r !#M−1
∞ √
(z− E)2
1 2 − N
Pe = 1 − √ 1−Q z e 0 dz , (21.6)
πN0 −∞ N0
which is obtained by observing that conditioned on (W = j,q Zj ) the events {||cj + z||2 ≤ ||cj +
z − ci ||2 }, i 6= j are independent. A change of variables x = N20 z and application of the bound
1 − (1 − y)M−1 ≤ min{My, 1} weakens (21.6) to (21.5).
To see that (21.5) implies (21.3), fix c > 0 and condition on |Z| ≤ c in (21.5) to relax it to
r
2E
ϵ ≤ MQ( − c) + 2Q(c) .
N0
x2 log e 1
log Q(x) = − − log x − log 2π + o(1) , x→∞ (21.7)
2 2
r
2E
2k Q( − c) → 0
N0
as k → ∞. Thus choosing c > 0 sufficiently large we obtain that lim supk→∞ E∗ (k, ϵ) ≤ (1 +
τ ) logN0 e for every τ > 0. Taking τ → 0 implies (21.3).
2
Remark 21.2 (Simplex conjecture) The code (21.4) in fact achieves the first three terms
in the large-k expansion of E∗ (k, ϵ), cf. [337, Theorem 3]. In fact, the code can be further slightly
√ √
optimized by subtracting the common center of gravity (2−k E, . . . , 2−k E . . .) and rescaling
each codeword to satisfy the power constraint. The resulting constellation is known as the simplex
code. It is conjectured to be the actual optimal code achieving E∗ (k, ϵ) for a fixed k and ϵ; see [105,
Section 3.16] and [401] for more.
i i
i i
i i
414
P[g(Y∞ ) 6= W] ≤ ϵ ,
Let C(P) be the capacity-cost function of the channel (in the usual sense of capacity, as defined
in (20.1)). Assuming P0 = 0 and C(0) = 0 it is not hard to show (basically following the steps of
Theorem 21.2) that:
C(P) C(P) d
Cpuc = sup = lim = C(P) .
P P P→ 0 P dP P=0
The surprising discovery of Verdú [434] is that one can avoid computing C(P) and derive the Cpuc
directly. This is a significant help, as for many practical channels C(P) is unknown. Additionally,
this gives a yet another fundamental meaning to the KL divergence.
Q
Theorem 21.4 For a stationary memoryless channel PY∞ |X∞ = PY|X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have
D(PY|X=x kPY|X=x0 )
Cpuc = sup .
x̸=x0 c(x)
Proof. Let
D(PY|X=x kPY|X=x0 )
CV = sup .
x̸=x0 c(x)
i i
i i
i i
where we denoted for convenience d(x) ≜ D(PY|X=x kPY|X=x0 ). By the definition of CV we have
d(x) ≤ c(x)CV .
where the last step is by the cost constraint (21.8). Thus, dividing by E and taking limits we get
Cpuc ≤ CV .
Achievability: We generalize the PPM code (21.4). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:
f(1) = (x1 , x1 , . . . , x1 , x0 , . . . , x0 )
| {z } | {z }
n-times n(M−1)-times
f(2) = (x0 , x0 , . . . , x0 , x1 , . . . , x1 , x0 , . . . , x0 )
| {z } | {z } | {z }
n-times n-times n(M−2)-times
···
f ( M ) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 )
| {z } | {z }
n(M−1)-times n-times
Now, by Stein’s lemma (Theorem 14.14) there exists a subset S ⊂ Y n with the property that
i i
i i
i i
416
Yn ∈ S =⇒ Ŵ = 1
Y2n
n+1 ∈S =⇒ Ŵ = 2
···
From the union bound we find that the overall probability of error is bounded by
At the same time the total cost of each codeword is given by nc(x1 ). Thus, taking n → ∞ and
after straightforward manipulations, we conclude that
D(PY|X=x1 kPY|X=x0 )
Cpuc ≥ .
c(x1 )
This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥
CV , as required.
Yj = Hj Xj + Zj , Hj ∼ Nc (0, 1) ⊥
⊥ Zj ∼ Nc (0, N0 ).
(We use here a more convenient C-valued fading channel with Hj ∼ Nc , known as the Rayleigh
fading). The cost function is the usual quadratic one c(x) = |x|2 . As we discussed previously,
cf. (20.9), the capacity-cost function C(P) is unknown in closed form, but is known to behave
drastically different from the case of non-fading AWGN (i.e. when Hj = 1). So here Theorem 21.4
comes handy. Let us perform a simple computation required, cf. (2.9):
D(Nc (0, |x|2 + N0 )kNc (0, N0 ))
Cpuc = sup
x̸=0 | x| 2
log(1 + |Nx|0 )
2
1
= sup log e − |x|2
(21.11)
N0 x̸=0
N0
log e
=
N0
Comparing with Theorem 21.2 we discover that surprisingly, the capacity-per-unit-cost is unaf-
fected by the presence of fading. In other words, the random multiplicative noise which is so
i i
i i
i i
detrimental at high SNR, appears to be much more benign at low SNR (recall that Cpuc = C′ (0)
and thus computing Cpuc corresponds to computing C(P) at P → 0). There is one important differ-
ence: the supremization over x in (21.11) is solved at x = ∞. Following the proof of the converse
bound, we conclude that any code hoping to achieve optimal Cpuc must satisfy a strange constraint:
X X
|xt |2 1{|xt | ≥ A} ≈ | xt | 2 ∀A > 0
t t
i.e. the total energy expended by each codeword must be almost entirely concentrated in very
large spikes. Such a coding method is called “flash signaling”. Thus, we can see that unlike the
non-fading AWGN (for which due to rotational symmetry all codewords can be made relatively
non-spiky), the only hope of achieving full Cpuc in the presence of fading is by signaling in short
bursts of energy. Thus, while the ultimate minimal energy-per-bit is the same for the AWGN or
the fading channel, the nature of optimal coding schemes is rather different.
Another fundamental difference between the two channels is revealed in the finite blocklength
behavior of E∗ (k, ϵ). Specifically, we have the following different asymptotic expansions for the
∗
energy-per-bit E (kk,ϵ) :
r
E∗ (k, ϵ) constant −1
= (−1.59 dB) + Q (ϵ) (AWGN)
k k
r
E∗ (k, ϵ) 3 log k 2
= (−1.59 dB) + (Q−1 (ϵ)) (non-coherent fading.)
k k
That is we see that the speed of convergence to Shannon limit is much slower under fading. Fig-
ure 21.1 shows this effect numerically by plotting evaluation of (the upper and lower bounds for)
E∗ (k, ϵ) for the fading and non-fading channel. We see that the number of data bits k that need
to be coded over for the fading channel is about factor of 103 larger than for the AWGN channel.
See [463] for further details.
i i
i i
i i
418
14
12
10
Achievability
8
Converse
dB
2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k
Figure 21.1 Comparing the energy-per-bit required to send a packet of k-bits for different channel models
∗
(curves represent upper and lower bounds on the unknown optimal value E (k,ϵ) k
). As a comparison: to get to
−1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading AWGN or fading AWGN with
Hj known perfectly at the receiver. For fading AWGN without knowledge of Hj (no CSI), one has to code over
at least 7 · 107 data bits to get to the same −1.5 dB. Plot generated using [397].
channel by introducing the standard Wiener process (Brownian motion) Wt and setting
Z t r
N0
Yint (t) = X(τ )dτ + Wt ,
0 2
where Wt is the zero-mean Gaussian process with covariance function
E[Ws Wt ] = min(s, t) .
Denote by L2 ([0, T]) the space of all square-integrable functions on [0, T]. Let M∗ (T, ϵ, P) the
maximum number of messages that can be sent through this channel such that given an encoder
f : [M] → L2 [0, T] for each m ∈ [M] the waveform x(t) ≜ f(m)
Theorem 21.5 The maximal reliable rate of communication across the continuous-time AWGN
P
channel is N0 log e (per unit of time). More formally, we have
1 P
lim lim inf log M∗ (T, ϵ, P) = log e (21.12)
ϵ→0 T→∞ T N0
i i
i i
i i
Proof. Note that the space L2 [0, T] has a countable basis (e.g. sinusoids). Thus, by expanding our
input and output waveforms in that basis we obtain an equivalent channel model:
N
0
Ỹj = X̃j + Z̃j , Z̃j ∼ N 0, ,
2
and energy constraint (dependent upon duration T):
X
∞
X̃2j ≤ PT .
j=1
But then the problem is equivalent to the energy-per-bit for the (discrete-time) AWGN channel
(see Theorem 21.2) and hence
Thus,
1 P P
lim lim inf log2 M∗ (T, ϵ, P) = E∗ (k,ϵ)
= log2 e ,
ϵ→0 n→∞ T limϵ→0 lim supk→∞ N0
k
1
Here we already encounter a major issue: the waveform x(t) supported on a finite interval (0, T] cannot have spectrum
supported on a compact. The requirements of finite duration and finite spectrum are only satisfied by the zero waveform.
Rigorously, one should relax the bandwidth constraint to requiring that the signal have a vanishing out-of-band energy as
T → ∞. As we said, rigorous treatment of this issue lead to the theory of prolate spheroidal functions [391].
i i
i i
i i
420
In other words, the capacity of this channel is B log(1 + NP0 B ). To understand the idea of the proof,
we need to recall the concept of modulation first. Every signal X(t) that is required to live in
[fc − B/2, fc + B/2] frequency band can be obtained by starting with a complex-valued signal XB (t)
with frequency content in [−B/2, B/2] and mapping it to X(t) via the modulation:
√
X(t) = Re(XB (t) 2ejωc t ) ,
where ωc = 2πfc . Upon receiving the sum Y(t) = X(t) + N(t) of the signal and the white noise
N(t) we may demodulate Y by computing
√
YB (t) = 2LPF(Y(t)ejωc t ), ,
where the LPF is a low-pass filter removing all frequencies beyond [−B/2, B/2]. The important
fact is that converting from Y(t) to YB (t) does not lose information.
Overall we have the following input-output relation:
e ( t) ,
YB (t) = XB (t) + N
e (t) is a complex Gaussian white noise and
where all processes are C-valued and N
e ( t) N
E[ N e (s)∗ ] = N0 δ(t − s).
where sincB (x) = sin(xBx) and Xi = XB (i/B). After the Nyquist sampling on XB and YB we get the
following equivalent input-output relation:
Yi = Xi + Zi , Zi ∼ Nc (0, N0 ) (21.14)
R∞
where the noise Zi = t=−∞ N e (t)sincB (t − i )dt. Finally, given that XB (t) is only non-zero for
B
t ∈ (0, T] we see that the C-AWGN channel (21.14) is only allowed to be used for i = 1, . . . , TB.
This fact is known in communication theory as “bandwidth B and duration T signal has BT complex
degrees of freedom”.
Let us summarize what we obtained so far:
i i
i i
i i
i i
i i
i i
In previous chapters our main object of study was the fundamental limit of blocklength-n coding:
Finally, the finite blocklength information theory strives to prove the sharpest possible computa-
tional bounds on log M∗ (n, ϵ) at finite n, which allows evaluating real-world codes’ performance
taking their latency n into account. These results are surveyed in this chapter.
Theorem 22.1 For any stationary memoryless channel with either |A| < ∞ or |B| < ∞ we
have Cϵ = C for 0 < ϵ < 1. Equivalently, for every 0 < ϵ < 1 we have
422
i i
i i
i i
Pe
1
10−1
10−2
10−3
10−4
10−5
SNR
In other words, below a certain critical SNR, the probability of error quickly approaches 1, so
that the receiver cannot decode anything meaningful. Above the critical SNR the probability of
error quickly approaches 0 (unless there is an effect known as the error floor, in which case prob-
ability of error decreases reaches that floor value and stays there regardless of the further SNR
increase). Thus, long-blocklength codes have a threshold-like behavior of probability of error sim-
ilar to (22.1). Besides changing SNR instead of rate, there is another important difference between
a waterfall plot and (22.1). The former applies to only a single (perhaps rather suboptimal) code,
while the latter is a statement about the best possible code for each (n, R) pair.
Proof. We will improve the method used in the proof of Theorem 17.3. Take an (n, M, ϵ)-code
and consider the usual probability space
W → Xn → Yn → Ŵ ,
where W ∼ Unif([M]). Note that PXn is the empirical distribution induced by the encoder at the
channel input. We denote the joint measure on (W, Xn , Yn , Ŵ) induced in this way by P. Our goal
is to replace this probability space with a different one where the true channel PYn |Xn = P⊗ n
Y|X is
replaced with an auxiliary channel (which is a “dummy” one in this case):
i i
i i
i i
424
We will denote the measure on (W, Xn , Yn , Ŵ) induced by this new channel by Q. Note that for
communication purposes, QYn |Xn is a useless channel since it ignores the input and randomly picks
i.i.d.
a member of the output space according to Yi ∼ QY , so that Xn and Yn are independent (under Q).
Therefore, for the probability of success under each channel we have
1
Q[Ŵ = W] =
M
P[Ŵ = W] ≥ 1 − ϵ
n o
Therefore, the random variable 1 Ŵ = W is likely to be 1 under P and likely to be 0 under Q.
It thus looks like a rather good choice for a binary hypothesis test statistic distinguishing the two
distributions, PW,Xn ,Yn ,Ŵ and QW,Xn ,Yn ,Ŵ . Since no hypothesis test can beat the optimal (Neyman-
Pearson) test, we get the upper bound
1
β1−ϵ (PW,Xn ,Yn ,Ŵ , QW,Xn ,Yn ,Ŵ ) ≤ (22.2)
M
(Recall the definition of β from (14.3).) The likelihood ratio is a sufficient statistic for this
hypothesis test, so let us compute it:
PW,Xn ,Yn ,Ŵ PW PXn |W PYn |Xn PŴ|Yn PW|Xn PXn ,Yn PŴ|Yn PXn ,Yn
= ⊗
= =
QW,Xn ,Yn ,Ŵ n
PW PXn |W (QY ) PŴ|Yn PW|Xn PXn (QY )⊗n PŴ|Yn PXn (QY )⊗n
Therefore, inequality above becomes
1
β1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≤ (22.3)
M
Computing the LHS of this bound may appear to be impossible because the distribution PXn
depends on the unknown code. However, it will turn out that a judicious choice of QY will make
knowledge of PXn unnecessary. Before presenting a formal argument, let us consider a special case
of the BSCδ channel. It will show that for symmetric channels we can select QY to be the capacity
achieving output distribution (recall, that it is unique by Corollary 5.5). To treat the general case
later we will (essentially) decompose the channel into symmetric subchannels (corresponding to
“composition” of the input).
Special case: BSCδ . So let us take PYn |Xn = BSC⊗ δ
n
and for QY we will take the capacity
achieving output distribution which is simply QY = Ber(1/2).
PYn |Xn (yn |xn ) = PnZ (yn − xn ), Zn ∼ Ber(δ)n
( Q Y ) ⊗ n ( yn ) = 2 − n
From the Neyman-Pearson lemma, the optimal HT takes the form
⊗n PXn Yn PXn Yn
βα (PXn Yn , PXn (QY ) ) = Q log ≥ γ where α = P log ≥γ
| {z } | {z } PXn (QY )⊗n PXn (QY )⊗n
P Q
i i
i i
i i
Notice that the effect of unknown PXn completely disappeared, and so we can compute βα :
1
βα (PXn Yn , PXn (QY )⊗n ) = βα (Ber(δ)⊗n , Ber( )⊗n ) (22.4)
2
1
= exp{−nD(Ber(δ)kBer( )) + o(n)} (by Stein’s Lemma: Theorem 14.14)
2
Putting this together with our main bound (22.3), we see that any (n, M, ϵ) code for the BSC
satisfies
1
log M ≤ nD(Ber(δ)kBer( )) + o(n) = nC + o(n) .
2
Clearly, this implies the strong converse for the BSC. (For a slightly different, but equivalent, proof
see Exercise IV.32 and for the AWGN channel see Exercise IV.33).
For the general channel, let us denote by P∗Y the capacity achieving output distribution. Recall
that by Corollary 5.5 it is unique and by (5.1) we have for every x ∈ A:
This property will be very useful. We next consider two cases separately:
1 If |B| < ∞ we take QY = P∗Y and note that from (19.31) we have
X
PY|X (y|x0 ) log2 PY|X (y|x0 ) ≤ log2 |B| ∀ x0 ∈ A
y
and since miny P∗Y (y)> 0 (without loss of generality), we conclude that for some constant
K > 0 and for all x0 ∈ A we have
PY|X (Y|X = x0 )
Var log | X = x0 ≤ K < ∞ .
QY (Y)
Thus, if we let
X
n
PY|X (Yi |Xi )
Sn = log ,
P∗Y (Yi )
i=1
then we have
i i
i i
i i
426
By simple counting1 it is clear that from any (n, M, ϵ) code, it is possible to select an (n, M′ , ϵ)
subcode, such that a) all codeword have the same composition P0 ; and b) M′ > (n+1M )|A|−1
. Note
′ ′
that, log M = log M + O(log n) and thus we may replace M with M and focus on the analysis of
the chosen subcode. Then we set QY = PY|X ◦ P0 . From now on we also assume that P0 (x) > 0
for all x ∈ A (otherwise just reduce A). Let i(x; y) denote the information density with respect
to P0 PY|X . If X ∼ P0 then I(X; Y) = D(PY|X kQY |P0 ) ≤ log |A| < ∞ and we conclude that
PY|X=x QY for each x and thus
dPY|X=x
i(x; y) = log ( y) .
dQY
From (19.28) we have
So if we define
X
n
dPY|X=Xi (Yi |Xi ) X n
Sn = log ( Yi ) = i(Xi ; Yi ) ,
dQY
i=1 i=1
1
This kind of reduction from a general code to a constant-composition subcode is the essence of the method of types [115].
i i
i i
i i
the Theorem, the strong converse for it does hold as well (see Ex. IV.33). Third, this method of
proof is also known as “sphere-packing”, for the reason that becomes clear if we do the example
of the BSC slightly differently (see Ex. IV.32).
A = {(j, m) : j, m ∈ Z+ , 0 ≤ j ≤ m} .
sup I(X; Y) = C .
PX
Thus by Theorem 19.9 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the ϵ-capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a
i i
i i
i i
428
22.3 Meta-converse
We have seen various ways in which one can derive upper (impossibility or converse) bounds on
the fundamental limits such as log M∗ (n, ϵ). In Theorem 17.3 we used data-processing and Fano’s
inequalities. In the proof of Theorem 22.1 we reduced the problem to that of hypothesis testing.
There are many other converse bounds that were developed over the years. It turns out that there
is a very general approach that encompasses all of them. For its versatility it is sometimes referred
to as the “meta-converse”.
To describe it, let us fix a Markov kernel PY|X (usually, it will be the n-letter channel PYn |Xn ,
but in the spirit of “one-shot” approach, we avoid introducing blocklength). We are also given a
certain (M, ϵ) code and the goal is to show that there is an upper bound on M in terms of PY|X and
ϵ. The essence of the meta-converse is described by the following diagram:
PY |X
W Xn Yn Ŵ
QY |X
Here the W → X and Y → Ŵ represent encoder and decoder of our fixed (M, ϵ) code. The upper
arrow X → Y corresponds to the actual channel, whose fundamental limits we are analyzing. The
lower arrow is an auxiliary channel that we are free to select.
The PY|X or QY|X together with PX (distribution induced by the code) define two distribu-
tions: PX,Y and QX,Y . Consider a map (X, Y) 7→ Z ≜ 1{W 6= Ŵ} defined by the encoder and
decoder pair (if decoders are randomized or W → X is not injective, we consider a Markov kernel
PZ|X,Y (1|x, y) = P[Z = 1|X = x, Y = y] instead). We have
PX,Y [Z = 0] = 1 − ϵ, QX,Y [Z = 0] = 1 − ϵ′ ,
i i
i i
i i
where ϵ and ϵ′ are the average probabilities of error of the given code under the PY|X and QY|X
respectively. This implies the following relation for the binary HT problem of testing PX,Y vs
QX,Y :
The high-level idea of the meta-converse is to select a convenient QY|X , bound 1 − ϵ′ from above
(i.e. prove a converse result for the QY|X ), and then use the Neyman-Pearson β -function to lift the
Q-channel converse to P-channel.
How one chooses QY|X is a matter of art. For example, in the proof of Case 2 of Theorem 22.1
we used the trick of reducing to the constant-composition subcode. This can instead be done by
taking QYn |Xn =c = (PY|X ◦ P̂c )⊗n . Since there are at most (n + 1)|A|−1 different output distributions,
we can see that
(n + 1)∥A|−1
1 − ϵ′ ≤ ,
M
and bounding of β can be done similar to Case 2 proof of Theorem 22.1. For channels with
|A| = ∞ the technique of reducing to constant-composition codes is not available, but the meta-
converse can still be applied. Examples include proof of parallel AWGN channel’s dispersion [333,
Theorem 78] and the study of the properties of good codes [340, Theorem 21].
However, the most common way of using meta-converse is to apply it with the trivial channel
QY|X = QY . We have already seen this idea in Section 22.1. Indeed, with this choice the proof
of the converse for the Q-channel is trivial, because we always have: 1 − ϵ′ = M1 . Therefore, we
conclude that any (M, ϵ) code must satisfy
1
≥ β1−ϵ (PX,Y , PX QY ) . (22.12)
M
Or, after optimization we obtain
1
≥ inf sup β1−ϵ (PX,Y , PX QY ) .
M∗ (ϵ) PX QY
This is a special case of the meta-converse known as the minimax meta-converse. It has a number
of interesting properties. First, the minimax problem in question possesses a saddle-point and is of
convex-concave type [341]. It, thus, can be seen as a stronger version of the capacity saddle-point
result for divergence in Theorem 5.4.
Second, the bound given by the minimax meta-converse coincides with the bound we obtained
before via linear programming relaxation (18.22), as discovered by [295]. To see this connection,
instead of writing the meta-converse as an upper bound M (for a given ϵ) let us think of it as an
upper bound on 1 − ϵ (for a given M).
We have seen that existence of an (M, ϵ)-code for PY|X implies existence of the (stochastic) map
(X, Y) 7→ Z ∈ {0, 1}, denoted by PZ|X,Y , with the following property:
1
PX,Y [Z = 0] ≥ 1 − ϵ, and P X QY [ Z = 0] ≤ ∀ QY .
M
i i
i i
i i
430
That is PZ|X,Y is a test of a simple null hypothesis (X, Y) ∼ PX,Y against a composite alternative
(X, Y) ∼ PX QY for an arbitrary QY . In other words every (M, ϵ) code must satisfy
1 − ϵ ≤ α̃(M; PX ) ,
Let us now replace PX with a π x ≜ MPX (x), x ∈ X . It is clear that π ∈ [0, 1]X . Let us also
replace the optimization variable with rx,y ≜ MPZ|X,Y (0|x, y)PX (x). With these notational changes
we obtain
1 X X
α̃(M; PX ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} .
M x, y x
It is now obvious that α̃(M; PX ) = SLP (π ) defined in (18.21). Optimizing over the choice of PX
P
(or equivalently π with x π x ≤ M) we obtain
1 1 X S∗ (M)
1−ϵ≤ SLP (π ) ≤ sup{SLP (π ) : π x ≤ M} = LP .
M M x
M
Now recall that in (18.23) we showed that a greedy procedure (essentially, the same as the one we
used in the Feinstein’s bound Theorem 18.7) produces a code with probability of success
1 S∗ (M)
1 − ϵ ≥ (1 − ) LP .
e M
This indicates that in the regime of a fixed ϵ the bound based on minimax meta-converse should
be very sharp. This, of course, provided we can select the best QY in applying it. Fortunately, for
symmetric channels optimal QY can be guessed fairly easily, cf. [341] for more.
i i
i i
i i
We motivate the question by trying to understand the speed of convergence in the strong con-
verse (22.1). If we return to the proof of Theorem 19.9, namely the step (19.7), we see that by
applying large-deviations Theorem 15.9 we can prove that for some Ẽ(R) and any R < C we have
ϵ∗ (n, exp{nR}) ≤ exp{−nẼ(R)} .
What is the best value of Ẽ(R) for each R? This is perhaps the most famous open question in all
of channel coding. Let us proceed in more details.
We will treat both regimes R < C and R > C. The reliability function E(R) of a channel is
defined as follows:
(
limn→∞ − 1n log ϵ∗ (n, exp{nR}) R<C
E(R) =
limn→∞ − 1n log(1 − ϵ∗ (n, exp{nR})) R > C .
We leave E(R) as undefined if the limit does not exist. Unfortunately, there is no general argument
showing that this limit exist. The only way to show its existence is to prove an achievability bound
1
lim inf − log ϵ∗ (n, exp{nR}) ≥ Elower (R) ,
n→∞ n
a converse bound
1
lim sup − log ϵ∗ (n, exp{nR}) ≤ Eupper (R) ,
n→∞ n
and conclude that the limit exist whenever Elower = Eupper . It is common to abuse notation and
write such pair of bounds as
Elower (R) ≤ E(R) ≤ Eupper (R) ,
even though, as we said, the E(R) is not known to exist unless the two bounds match unless the
two bounds match.
From now on we restrict our discussion to the case of a DMC. An important object to
define is the Gallager’s E0 function, which is nothing else than the right-hand side of Gallager’s
bound (18.15). For the DMC it has the following expression:
!1+ρ
X X 1
E0 (ρ, PX ) = − log PX (x)PY|X (y|x)
1+ρ
y∈B x∈A
This expression is defined in terms of the single-letter channel PY|X . It is not hard to see that E0
function for the n-letter extension evaluated with P⊗ n
X just equals nE0 (ρ, PX ), i.e. it tensorizes
2
similar to mutual information. From this observation we can apply Gallager’s random coding
2
There is one more very pleasant analogy with mutual information: the optimization problems in the definition of E0 (ρ)
also tensorize. That is, the optimal distribution for the n-letter channel is just P⊗n
X , where PX is optimal for a single-letter
one.
i i
i i
i i
432
Optimizing the choice of PX we obtain our first estimate on the reliability function
An analysis, e.g. [177, Section 5.6], shows that the function Er (R) is a convex, decreasing and
strictly positive on 0 ≤ R < C. Therefore, Gallager’s bound provides a non-trivial estimate of
the reliability function for the entire range of rates below capacity. At rates R → C the optimal
choice of ρ → 0. As R departs further away from the capacity the optimal ρ reaches 1 at a certain
rate R = Rcr known as the critical rate, so that for R < Rcr we have Er (R) = E0 (1) − R behaving
linearly. The Er (R) bound is shown on Figure 22.1 by a curve labeled “Random code ensemble”.
Going to the upper bounds, taking QY to be the iid product distribution in (22.12) and optimizing
yields the bound [381] known as the sphere-packing bound:
Comparing the definitions of Esp and Er we can see that for Rcr < R < C we have
thus establishing reliability function value for high rates. However, for R < Rcr we have Esp (R) >
Er (R), so that E(R) remains unknown. The Esp (R) bound is shown on Figure 22.1 by a curve
labeled “Sphere-packing (volume)”.
Both upper and lower bounds have classical improvements. The random coding bound can be
improved via technique known as expurgation showing
and Eex (R) > Er (R) for rates R < Rx where Rx ≤ Rcr is the second critical rate; see Exercise IV.31.
The Eex (R) bound is shown on Figure 22.1 by a curve labeled “Typical random linear code (aka
expurgation)”. (See below for the explanation of the naming.)
The sphere packing bound can also be improved at low rates by analyzing a combinatorial
packing problem by showing that any code must have a pair of codewords which are close (in terms
of Hellinger distance between the induced output distributions) and concluding that confusing
these two leads to a lower bound on probability of error (via (16.3)). This class of bounds is
known as “minimum distance” based bounds and several of them are shown on Figure 22.1 with
the strongest labeled “MRRW + mindist”, corresponding to the currently the best known minimum
distance upper bound due to [299]. (This bound also known as a linear programming or JPL bound
has not seen improvements in 60 years and it is a long-standing open problem in combinatorics
to do so.)
The straight-line bound [177, Theorem 5.8.2] allows to interpolate between any minimum dis-
tance bound and the Esp (R). Unfortunately, these (classical) improvements tightly bound E(R) at
i i
i i
i i
This state of affairs remains unchanged (for a general DMC) since the foundational work of Shan-
non, Gallager and Berlekamp in 1967. As far as we know, the common belief is that Eex (R) is in
fact the true value of E(R) for all rates. As we mentioned above this is perhaps one of the most
famous open problems in classical information theory.
We demonstrate these bounds (with exception of the straight-line bound) on the reliability func-
tion on Figure 22.1 for the case of the BSCδ . For this channel, there is an interesting interpretation
of the expurgated bound. To explain it, let us recall the different ensembles of random codes
that we discussed in Section 18.6. In particular, we had the Shannon ensemble (as used in Theo-
rem 18.5) and the random linear code (either Elias or Gallager ensembles, we do not need to make
a distinction here).
For either ensemble, it it is known [178] that Er (R) is not just an estimate, but in fact the exact
value of the exponent of the average probability of error (averaged over a code in the ensemble).
For either ensemble, however, for low rates the average is dominated by few bad codes, whereas
a typical (high probability) realization of the code has a much better error exponent. For Shannon
ensemble this happens at R < 12 Rx and for the linear ensemble it happens at R < Rx . Furthermore,
the typical linear code in fact has error exponent exactly equal to the expurgated exponent Eex (R),
see [34].
There is a famous conjecture in combinatorics stating that the best possible minimum pairwise
Hamming distance of a code with rate R is given by the Gilbert-Varshamov bound (Theorem 27.5).
If true, this would imply that E(R) = Eex (R) for R < Rx , see e.g. [283].
The most outstanding development in the error exponents since 1967 was a sequence of papers
starting from [283], which proposed a new technique for bounding E(R) from above. Litsyn’s
idea was to first prove a geometric result (that any code of a given rate has a large number of
pairs of codewords at a given distance) and then use de Caen’s inequality to convert it into a lower
bound on the probability of error. The resulting bound was very cumbersome. Thus, it was rather
surprising when Barg and MacGregor [35] were able to show that the new upper bound on E(R)
matched Er (R) for Rcr − ϵ < R < Rcr for some small ϵ > 0. This, for the first time since [381]
extended the range of knowledge of the reliability function. Their amazing result (together with
Gilbert-Varshamov conjecture) reinforced the belief that the typical linear codes achieve optimal
error exponent in the whole range 0 ≤ R ≤ C.
Regarding E(R) for R > C the situation is much simpler. We have
The lower (achievability) bound here is due to Dueck [141] (see also [319]), while the harder
(converse) part is by Arimoto [25]. It was later discovered that Arimoto’s converse bound can
be derived by a simple modification of the weak converse (Theorem 17.3): instead of applying
1
data-processing to the KL divergence, one uses Rényi divergence of order α = 1+ρ ; see [338]
i i
i i
i i
434
1.4
1.2
Err.Exp. (log2)
0.8 Rx = 0.24
0.4
C = 0.86
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Rate
Figure 22.1 Comparison of bounds on the error exponent of the BSC. The MRRW stands for the upper
bound on the minimum distance of a code [299] and Gilbert-Varshamov is a lower bound (cf. Theorem 27.5).
for details. This suggests a general conjecture that replacing Shannon information measures with
Rényi ones upgrades the (weak) converse proofs to a strong converse.
Pe = exp{−nE(R) + o(n)} .
Therefore, for a while the question of non-asymptotic characterization of log M∗ (n, ϵ) and ϵ∗ (n, M)
was equated with establishing the sharp value of the error exponent E(R). However, as codes
became better and started having rates approaching the channel capacity, the question has changed
to that of understanding behavior of log M∗ (n, ϵ) in the regime of fixed ϵ and large n (and, thus, rates
R → C). It was soon discovered by [334] that the next-order terms in the asymptotic expansion of
log M∗ give surprisingly sharp estimates on the true value of the log M∗ . Since then, the work on
channel coding focused on establishing sharp upper and lower bounds on log M∗ (n, ϵ) for finite n
(the topic of Section 22.6) and refining the classical results on the asymptotic expansions, which
we discuss here.
i i
i i
i i
We have already seen that the strong converse (Theorem 22.1) can be stated in the asymptotic
expansion form as: for every fixed ϵ ∈ (0, 1),
log M∗ (n, ϵ) = nC + o(n), n → ∞.
Intuitively, though, the smaller values of ϵ should make convergence to capacity slower. This
suggests that the term o(n) hides some interesting dependence on ϵ. What is it?
This topic has been investigated since the 1960s, see [130, 402, 334, 333] , and resulted in
the concept of channel dispersion. We first present the rigorous statement of the result and then
explain its practical uses.
1 DMC
2 DMC with cost constraint
3 AWGN
4 Parallel AWGN
Let (X∗ , Y∗ ) be the input-output of the channel under the capacity achieving input distribution, and
i(x; y) be the corresponding (single-letter) information density. The following expansion holds for
a fixed 0 < ϵ < 1/2 and n → ∞
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) (22.15)
where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X∗ ; Y∗ ) = E[i(X∗ ; Y∗ )], and the channel dispersion3 is V = Var[i(X∗ ; Y∗ )|X∗ ].
Proof. The full proofs of these results are somewhat technical, even for the DMC.4 Here we only
sketch the details.
First, in the absence of cost constraints the achievability (lower bound on log M∗ ) part has
already been done by us in Theorem 19.11, where we have shown that log M∗ (n, ϵ) ≥ nC −
√ √
nVQ−1 (ϵ) + o( n) by refining the proof of the noisy channel coding theorem and using the
CLT. Replacing the CLT with its non-asymptotic version (Berry-Esseen inequality [165, Theorem
√
2, Chapter XVI.5]) improves o( n) to O(log n). In the presence of cost constraints, one is inclined
to attempt to use an appropriate version of the achievability bound such as Theorem 20.7. However,
for the AWGN this would require using input distribution that is uniform on the sphere. Since this
distribution is non-product, the information density ceases to be a sum of iid, and CLT is harder
to justify. Instead, there is a different achievability bound known as the κ-β bound [334, Theorem
25] that has become the workhorse of achievability proofs for cost-constrained channels with
continuous input spaces.
3
There could be multiple capacity-achieving input distributions, in which case PX∗ should be chosen as the one that
minimizes Var[i(X∗ ; Y∗ )|X∗ ]. See [334] for more details.
4
Recently, subtle gaps in [402] and [334] in the treatment of DMCs with non-unique capacity-achieving input distributions
were found and corrected in [81].
i i
i i
i i
436
The upper (converse) bound requires various special methods depending on the channel. How-
ever, the high-level idea is to always apply the meta-converse bound from (22.12) with an
appropriate choice of QY . Most often, QY is taken as the n-th power of the capacity achieving out-
put distribution for the channel. We illustrate the details for the special case of the BSC. In (22.4)
we have shown that
1
log M∗ (n, ϵ) ≤ − log βα (Ber(δ)⊗n , Ber( )⊗n ) . (22.16)
2
On the other hand, Exercise III.10 shows that
1 1 √ √
− log β1−ϵ (Ber(δ)⊗n , Ber( )⊗n ) = nd(δk ) + nvQ−1 (ϵ) + o( n) ,
2 2
where v is just the variance of the (single-letter) log-likelihood ratio:
" #
δ 1−δ δ δ
v = VarZ∼Ber(δ) Z log 1 + (1 − Z) log 1 = Var[Z log ] = δ(1 − δ) log2 .
2 2
1 − δ 1 − δ
Upon inspection we notice that v = V – the channel dispersion of the BSC, which completes the
proof of the upper bound:
√ √
log M∗ (n, ϵ) ≤ nC − nVQ−1 (ϵ) + o( n)
√
Improving the o( n) to O(log n) is done by applying the Berry-Esseen inequality in place of the
CLT, similar to the upper bound. Many more details on these proofs are contained in [333].
Remark 22.1 (Zero dispersion) We notice that V = 0 is entirely possible. For example,
consider an additive-noise channel Y = X + Z over some abelian group G with Z being uniform
on some subset of G, e.g. channel in Exercise IV.7. Among the zero-dispersion channels there is
a class of exotic channels [334], which for ϵ > 1/2 have asymptotic expansions of the form [333,
Theorem 51]:
log M∗ (n, ϵ) = nC + Θϵ (n 3 ) .
1
Existence of this special case is why we restricted the theorem above to ϵ < 21 .
Remark 22.2 The expansion (22.15) only applies to certain channels (as described in the
theorem). If, for example, Var[i(X∗ ; Y∗ )] = ∞, then the theorem need not hold and there might
be other stable (non-Gaussian) distributions that the n-letter information density will converge to.
Also notice that in the absence of cost constraints we have
since, by capacity saddle-point (Corollary 5.7), E[i(X∗ ; Y∗ )|X∗ = x] = C for PX∗ -almost all x.
As an example, we have the following dispersion formulas for the common channels that we
discussed so far:
i i
i i
i i
δ̄
BSCδ : V(δ) = δ δ̄ log2
δ
P ( P + 2)
AWGN (real): VAWGN (P) = log2 e
2( P + 1) 2
P ( P + 2)
AWGN (complex): V(P) = log2 e
( P + 1) 2
√
BI-AWGN: V(P) = Var[log(1 + e−2P+2 PZ
)] , Z ∼ N ( 0, 1)
where for the AWGN and BI-AWGN P is the SNR. √ We also remind that, cf. Example 3.4, for the BI-
AWGN we have C(P) = log 2 − E[log(1 + e−2P+2 PZ )]. For the Parallel AWGN, cf. Section 20.4,
we have
!2 +
log2 e X
L
σj2
Parallel AWGN: V(P, {σj , j ∈ [L]}) =
2
1− ,
2 T
j=1
PL
where T is the optimal threshold in the water-filling solution, i.e. j=1 |T − σj2 |+ = P. We remark
that the expression for the parallel AWGN channel can be guessed by noticing that it equals
PL Pj
j=1 VAWGN ( σ 2 ) with Pj = |T − σj | – the optimal power allocation.
2 +
j
What about channel dispersion for other channels? Discrete channels with memory have seen
some limited success in [335], which expresses dispersion in terms of the Fourier spectrum of the
information density process. The compound DMC (Ex. IV.19) has a much more delicate dispersion
formula (and the remainder term is not O(log n), but O(n1/4 )), see [342]. For non-discrete channels
(other than the AWGN and Poisson) new difficulties appear in the proof of the converse part. For
example, the dispersion of a (coherent) fading channel is known only if one additionally restricts
the input codewords to have limited peak values, cf. [98, Remark 1]. In particular, dispersion of
the following Gaussian erasure channel is unknown:
Y i = Hi ( X i + Z i ) ,
Pn
where we have N (0, 1) ∼ Zi ⊥ ⊥ Hi ∼ Ber(1/2) and the usual quadratic cost constraint i=1 x2i ≤
nP.
Multi-antenna (MIMO) channels (20.10) present interesting new challenges as well. For exam-
ple, for coherent channels the capacity achieving input distribution is non-unique [98]. The
quasi-static channels are similar to fading channels but the H1 = H2 = · · · , i.e. the channel
gain matrix in (20.10) is not changing with time. This channel model is often used to model cellu-
lar networks. By leveraging an unexpected amount of differential geometry, it was shown in [462]
that they have zero-dispersion, or more specifically:
where the ϵ-capacity Cϵ is known as outage capacity in this case (and depends on ϵ). The main
implication is that Cϵ is a good predictor of the ultimate performance limits for these practically-
relevant channels (better than C is for the AWGN channel, for example). But some caution must
be taken in approximating log M∗ (n, ϵ) ≈ nCϵ , nevertheless. For example, in the case where H
i i
i i
i i
438
0.5
0.4
Rate, bit/ch.use
0.3
0.2
0.1 Capacity
Converse
RCU
DT
Gallager
Feinstein
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n
0.5
0.4
Rate, bit/ch.use
0.3
0.2
Capacity
Converse
0.1 Normal approximation + 1/2 log n
Normal approximation
Achievability
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n
Figure 22.3 Comparing the normal approximation against the best upper and lower bounds on 1
n
log M∗ (n, ϵ)
for the BSCδ channel (δ = 0.11, ϵ = 10−3 ).
matrix is known at the transmitter, the same paper demonstrated that the standard water-filling
power allocation (Theorem 20.14) that maximizes Cϵ is rather sub-optimal at finite n.
i i
i i
i i
(The log n term in (22.15) is known to be equal to O(1) for the BEC, and 12 log n for the BSC,
AWGN and binary-input AWGN. For these latter channels, normal approximation is typically
defined with + 12 log n added to the previous display.)
For example, considering the BEC1/2 channel we can easily compute the capacity and disper-
sion to be C = (1 − δ) and V = δ(1 − δ) (in bits and bits2 , resp.). Detailed calculation in Ex. IV.4
lead to the following rigorous estimates:
i i
i i
i i
440
Pe k1 → n1 Pe k2 → n2
10−4 10−4
P∗ SNR P∗ SNR
After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires a
smaller SNR to achieve the same error probability. However, this ignores the fact that the rate of
this code nk11 might be much smaller as well. The concept of normalized rate allows us to compare
the codes of different blocklengths and coding rates.
Specifically, suppose that a k → n code is given. Fix ϵ > 0 and find the value of the SNR P for
which this code attains probability of error ϵ (for example, by taking a horizontal intercept at level
ϵ on the waterfall plot). The normalized rate is defined as
k k
Rnorm (ϵ) = ≈ p ,
log2 M∗ (n, ϵ, P) nC(P) − nV(P)Q−1 (ϵ)
where log M∗ , capacity and dispersion correspond to the channel over which evaluation is being
made (most often the AWGN, BI-AWGN or the fading channel). We also notice that, of course,
the value of log M∗ is not possible to compute exactly and thus, in practice, we use the normal
approximation to evaluate it.
This idea allows us to clearly see how much different ideas in coding theory over the decades
were driving the value of normalized rate upward to 1. This comparison is show on Figure 22.4.
A short summary is that at blocklengths corresponding to “data stream” channels in cellular net-
works (n ∼ 104 ) the LDPC codes and non-binary LDPC codes are already achieving 95% of the
information-theoretic limit. At blocklengths corresponding to “control plane” (n ≲ 103 ) the polar
codes and LDPC codes are at similar performance and at 90% of the fundamental limits.
i i
i i
i i
0.95
0.9
Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)
0.6
0.55
0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1
0.95
0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate
Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7
0.65
0.6 2 3 4 5
10 10 10 10
Blocklength, n
Figure 22.4 Normalized rates for various codes. Plots generated using [397] (color version recommended)
i i
i i
i i
So far we have been focusing on the paradigm for one-way communication: data are mapped to
codewords and transmitted, and later decoded based on the received noisy observations. In most
practical settings (except for storage), frequently the communication goes in both ways so that the
receiver can provide certain feedback to the transmitter. As a motivating example, consider the
communication channel of the downlink transmission from a satellite to earth. Downlink transmis-
sion is very expensive (power constraint at the satellite), but the uplink from earth to the satellite
is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite).
In general, channel with noiseless feedback is interesting when such asymmetry exists between
uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available
that can potentially improve the reliability or complexity of communication.
In the first half of our discussion, we shall follow Shannon to show that even with noiseless
feedback nothing (in terms of capacity) can be gained in the conventional setup. In the process, we
will also introduce the concept of Massey’s directed information. In the second half of the Chapter
we examine situations where feedback is extremely helpful: low probability of error, variable
transmission length and variable transmission power.
f1 : [ M ] → A
f2 : [ M ] × B → A
..
.
fn : [M] × B n−1 → A
• Decoder:
g : B n → [M]
442
i i
i i
i i
23.1 Feedback does not increase capacity for stationary memoryless channels 443
Here the symbol transmitted at time t depends on both the message and the history of received
symbols (causality constraint):
Xt = ft (W, Yt1−1 ).
W ∼ uniform on [M]
PY|X
X1 = f1 (W) −→ Y1
.. −→ Ŵ = g(Yn )
.
PY|X
Xn = fn (W, Yn1−1 ) −→ Yn
Figure 23.1 compares the settings of channel coding without feedback and with ideal full feedback:
W Xn channel Yn Ŵ
W Xk channel Yk Ŵ
delay
Figure 23.1 Schematic representation of coding without feedback (left) and with full noiseless feedback
(right).
Proof. Achievability: Although it is obvious that Cfb ≥ C, we wanted to demonstrate that in fact
constructing codes achieving capacity with full feedback can be done directly, without appealing
i i
i i
i i
444
to a (much harder) problem of non-feedback codes. Let π t (·) ≜ PW|Yt (·|Yt ) with the (random) pos-
terior distribution after t steps. It is clear that due to the knowledge of Yt on both ends, transmitter
and receiver have perfectly synchronized knowledge of π t . Now consider how the transmission
progresses:
1 Initialize π 0 (·) = M1 .
2 At (t + 1)-th step, encoder sends Xt+1 = ft+1 (W, Yt ). Note that selection of ft+1 is equivalent to
the task of partitioning message space [M] into classes Pa , i.e.
Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY|X (y|a)
E[log Bt+1 (W)|Yt ] = π t (Pa ) log P = I(π̃ t+1 , PY|X ) (23.1)
a∈A y∈B a∈A π t (Pa )a
where π̃ t+1 (a) ≜ π t (Pa ) is a (random) distribution on A, induced by the encoder at the channel
input in round (t + 1). Note that while π t is decided before the (t + 1)-st step, design of partition
Pa (and hence π̃ t+1 ) is in the hands of the encoder.
The goal of the code designer is to come up with such a partitioning {Pa : a ∈ A} that the speed
of growth of π t (W) is maximal. Now, analyzing the speed of growth of a random-multiplicative
process is best done by taking logs:
X
t
log π t (j) = log Bs + log π 0 (j) .
s=1
Intuitively, we expect that the process log π t (W) resembles a random walk starting from − log M
and having a positive drift. Thus to estimate the time it takes for this process to reach value 0
we need to estimate the upward drift. Appealing to intuition and the law of large numbers (more
exactly to the theory of martingales) we approximate
X
t
log π t (W) − log π 0 (W) ≈ E[log Bs ] .
s=1
Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃ t+1 ≈ P∗X (capacity-achieving input distribution) and this obtains
implying that the transmission terminates in time ≈ logCM . The important lesson here is the follow-
ing: The optimal transmission scheme should map messages to channel inputs in such a way that
i i
i i
i i
23.1 Feedback does not increase capacity for stationary memoryless channels 445
the induced input distribution PXt+1 |Yt is approximately equal to the one maximizing I(X; Y). This
idea is called posterior matching and explored in detail in [384].1
Although our argument above is not rigorous, it is not hard to make it such by an appeal to
theory of martingale converges, very similar to the way we used it in Section 16.3* to analyze
SPRT. We omit those details (see [384]), since the result is in principle not needed for the proof
of the Theorem.
Converse: We are left to show that Cfb ≤ C(I) . Recall the key in proving weak converse for
channel coding without feedback: Fano’s inequality plus the graphical model
W → Xn → Yn → Ŵ. (23.2)
Then
With feedback the probabilistic picture becomes more complicated as the following Figure 23.2
demonstrates for n = 3 (dependence introduced by the extra squiggly arrows):
X1 Y1 X1 Y1
W X2 Y2 Ŵ W X2 Y2 Ŵ
X3 Y3 X3 Y3
without feedback with feedback
Figure 23.2 Graphical model for channel coding and n = 3 with and without feedback. Double arrows ⇒
correspond to the channel links.
Notice that the d-separation criterion shows we no longer have Markov chain (23.2), i.e. given
X the W and Yn are not independent.2 Furthermore, the input-output relation is also no longer
n
memoryless
Y
n
PYn |Xn (yn |xn ) 6= PY|X (yj |xj )
j=1
1
This simple (but capacity-achieving) feedback coding scheme also helps us appreciate more fully the magic of Shannon’s
(non-feedback) coding theorem, which demonstrated that the (almost) optimal partitioning can be done in a way that is
totally blind to actual channel outputs. That is, we can preselect partitions Pa that are independent of π t (but dependent on
t) and so that π t (Pa ) ≈ P∗X (a) with overwhelming probability and for almost all t ∈ [n].
2
For example, suppose we are transmitting W ∼ Ber(1/2) over the BSC and set X1 = 0, X2 = W ⊕ Y1 . Then given X1 , X2
we see that Y2 and W can be exactly determined from one another.
i i
i i
i i
446
(as an example, let X2 = Y1 and thus PY1 |X1 X2 = δX1 is a point mass). Nevertheless, there is still a
large degree of independence in the channel. Namely, we have
(Yt−1 , W) →Xt → Yt , t = 1, . . . , n (23.3)
W → Y → Ŵ
n
(23.4)
Then
−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) (Fano)
≤ I(W; Y ) n
(Data processing applied to (23.4))
X
n
= I(W; Yt |Yt−1 ) (Chain rule)
t=1
Xn
≤ I(W, Yt−1 ; Yt ) (I(W; Yt |Yt−1 ) = I(W, Yt−1 ; Yt ) − I(Yt−1 ; Yt ))
t=1
X
n
≤ I(Xt ; Yt ) (Data processing applied to (23.3))
t=1
≤ nC
In comparison with Theorem 22.2, the following result shows that, with fixed-length block cod-
ing, feedback does not even improve the speed of approaching capacity and can at most improve
the third-order log n terms.
Theorem 23.4 (Dispersion with feedback [131, 336]) For weakly input-symmetric
DMC (e.g. additive noise, BSC, BEC) we have:
√
log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n)
i i
i i
i i
PW,Xn ,Yn ,Ŵ = PW,Xn PYn |Xn PŴ|Yn , QW,Xn ,Yn ,Ŵ = QW,Xn QYn QŴ|Yn .
We are free to choose factors QW,Xn , QYn and QŴ|Yn . However, as we will see soon, it is best
to choose them to minimize D(PkQ) which gives us (see the discussion of information flow
after (4.6))
and achieved by taking the factors equal to their values under P, namely QW,Xn = PW,Xn , QYn =
PYn and QŴ|Yn = PŴ|Yn . (It is a good exercise to show this by writing out the chain rule for
divergence (2.26).) As we know this minimal value of D(PkQ) measures the information flow
through the links excluded in the graphical model, i.e. through Xn → Yn .
From here we proceed via the data-processing inequality and tensorization of capacity for
memoryless channels as follows:
(∗) X
n
1 DPI
−h(ϵ) + ϵ̄ log M = d(1 − ϵk ) ≤ D(PkQ) = I(Xn ; Yn ) ≤ I(Xi ; Yi ) ≤ nC(I) (23.8)
M
i=1
where the (∗) followed from the fact that the Xn → Yn is a memoryless channel ((6.1)).
Let us now go back to the case of channels with feedback. There are several problems with
adapting the previous argument. First, when feedback is present, Xn → Yn is not memoryless due
to the influence of the transmission protocol (for example, knowing both X1 and X2 affects the law
Qn
of Y1 , that is PY1 |X1 6= PY1 |X1 ,X2 and also PYn |Xn 6= j=1 PYj |Xj even for the DMC). However, an
even bigger problem is revealed by attempting to replicate the previous proof.
Suppose we again try to induce an auxiliary probability space Q as in (23.6). Then due to lack
of Markov chain under P (i.e. (23.5)) solution of the problem (23.7) can be shown to equal this
time
This value can be quite a bit higher than capacity. For example, consider an extremely noisy
(in fact, useless) channel BSC1/2 and a feedback transmission scheme Xt+1 = Yt . We see that
I(W, Xn ; Yn ) ≥ H(Yn−1 ) = (n − 1) log 2, whereas capacity C = 0. What went wrong in this case?
For the explanation, we should revisit the graphical model under P as shown on Figure 23.2
(right graph). When Q is defined as in (23.6) the value min D(PkQ) = I(W, Xn ; Yn ) measures the
information flow through both the ⇒ and ⇝ links.
This motivates us to find a graphical model for Q such that min D(PkQ) only captured the
information flow through only the ⇒ links {Xi → Yi : i = 1, . . . , n} (and so that min D(PkQ) ≤
⊥ Ŵ, so that Q[W = Ŵ] = M1 .
nC(I) ), while still guaranteeing that W ⊥
i i
i i
i i
448
X1 Y1 X1 Y1
W X2 Y2 Ŵ W X2 Y2 Ŵ
X3 Y3 X3 Y3
without feedback with feedback
Figure 23.3 Graphical model for n = 3 under the auxiliary distribution Q. Compare with Figure 23.2 under
the actual distribution P.
Such a graphical model is depicted on Figure 23.3 (right graph).3 Formally, we shall restrict
QW,Xn ,Yn ,Ŵ ∈ Q, where Q is the set of distributions that can be factorized as follows:
QW,Xn ,Yn ,Ŵ = QW QX1 |W QY1 QX2 |W,Y1 QY2 |Y1 · · · QXk |W,Yk−1 QYk |Yk−1 · · · QXn |W,Yn−1 QYn |Yn−1 QŴ|Yn
Using the d-separation criterion (see (3.11)) we can verify that for any Q ∈ Q we have W ⊥ ⊥ W:
n
W and Ŵ are d-separated by X . (More directly, one can clearly see that conditioning on any fixed
value of W = w does affect distributions of X1 , . . . , Xn but leaves Yn and Ŵ unaffected.)
Notice that in the graphical model for Q, when removing ⇒ we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when ⇒ are removed, so that Q could be made closer to P, while still maintaining
W⊥ ⊥ W. We note that these links were also implicitly present in the non-feedback case (see model
for Q in that case on the left graph in Figure 23.3).
Now since as we agreed under Q we still have Q[W = Ŵ] = M1 we can use our usual data-
processing for divergence to conclude d(1 − ϵk M1 ) ≤ D(PkQ).
Assuming the crucial fact about this Q-graphical model that will be shown in a Lemma 23.6
(to follow), we then have the following chain:
1
d(1 − ϵk ) ≤ inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ )
M Q∈Q
Xn
= I(Xt ; Yt |Yt−1 ) (Lemma 23.6)
t=1
X
n
= EYt−1 [I(PXt |Yt−1 , PY|X )]
t=1
3
This kind of removal of one-directional links is known as causal conditioning.
i i
i i
i i
X
n
≤ I(EYt−1 [PXt |Yt−1 ], PY|X ) (concavity of I(·, PY|X ))
t=1
Xn
= I(PXt , PY|X )
t=1
≤nC . ( I)
We now proceed to showing the crucial omitted step in the above proof. Before that let us define
an interesting new kind of information.
Note that directed information is not symmetric. As [294] (and subsequent work, e.g. [412])
shows ⃗I(Xn ; Yn ) quantifies the amount of causal information transfer from X-process to Y-process.
In context of feedback communication a formal justification for introduction of this concept is the
following result.
Lemma 23.6 Consider communication with feedback over a non-anticipatory channel given
by a sequence of Markov kernels PYt |Xt ,Yt−1 , t ∈ [n], i.e. we have a probability distribution P on
(W, Xn , Yn , Ŵ) described by factorization
Y
n
PW,Xn ,Yn ,Ŵ = PW PXt |W,Xt−1 ,Yt−1 PYt |Xt ,Yt−1 . (23.9)
t=1
Denote by Q all distributions factorizing with respect to the graphical models on Figure 23.3
(right graph), that is those satisfying
Y
n
QW,Xn ,Yn ,Ŵ = QW QXk |W,Yk−1 QYk |Yk−1 (23.10)
t=1
Then we have
i i
i i
i i
450
In addition, if the channel is memoryless, i.e. PYt |Xt ,Yt−1 = PYt |Xt for all t ∈ [n], then we have
X
n
⃗I(Xn ; Yn ) = I(Xt ; Yt |Yt−1 ) .
t=1
Proof. By comparing factorizations (23.9) and (23.10) and applying the chain rule (2.26) we can
immediately optimize several terms (we leave this justification as an exercise):
QX,W = PX,W ,
QXt |W,Yt−1 = PXt |W,Yt−1
QŴ|Yn = PW|Yn .
= inf D(PY1 |X1 kQY1 |X1 ) + D(PY2 |X2 ,Y1 kQY2 |Y1 |X2 , Y1 ) + · · · + D(PYn |Xn ,Yn−1 kQYn |Yn−1 |Xn , Yn−1 )
Q∈Q
where in the last step we simply applied (conditional) versions of Corollary 4.2.
To prove the claim for the memoryless channels, we only need to notice that
and that the last term is zero. The latter can be justified via d-separation criterion. Indeed, in the
absence of channel memory every undirected path from Xt−1 to Yt must pass through Xt , which is
a non-collider and is conditioned on.
To summarize, we can see that Shannon’s result for feedback communication can be best
understood as a simple modification of the standard weak converse in channel coding: instead
of using
i i
i i
i i
where
denotes the set of input symbols that can lead to the output symbol y.
where (a) and (b) are by definitions, (c) follows from Theorem 23.3, and (d) is due to Theorem 19.9.
All capacity quantities above are defined with (fixed-length) block codes.
Remark 23.2 1 In DMC for both zero-error capacities (C0 and Cfb,0 ) only the support of the
transition matrix PY|X , i.e., whether PY|X (b|a) > 0 or not, matters; the values of those non-zero
PY|X (b|a) are irrelevant. That is, C0 and Cfb,0 are determined by the bipartite graph represen-
tation between the input alphabet A and the output alphabet B . Furthermore, the C0 (but not
Cfb,0 !) is a function of the confusability graph – a simple undirected graph on A with a 6= a′
connected by an edge iff ∃b ∈ B s.t. PY|X (b|a)PY|X (b|a′ ) > 0.
2 That Cfb,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next example) with L = 3 (for which Cfb,0 = log 32 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cfb,0 = 0). Clearly in both cases the confusability
graph is the same – a triangle.
3 Oftentimes C0 is very hard to compute, but Cfb,0 can be obtained in closed form as in (23.12).
As an example, consider the following polygon channel (named after its confusability graph):
1 1
1
2 2 5
. .
2
. .
. .
4
L L 3
Bipartite graph Confusability graph (L = 5)
The following are known about the zero-error capacity C0 of the polygon channel:
• L = 3: C0 = 0.
• L = 5: C0 = 12 log 5. This is a famous “capacity of a pentagon” problem. For achievability,
with blocklength one, one can use {1, 3} to achieve rate 1 bit; with blocklength two, the code-
book {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)} achieves rate 12 log 5 bits, as pointed out by Shannon
i i
i i
i i
452
in his original 1956 paper [379]. More than two decades later this was shown optimal by
Lovász using a technique now known as semidefinite programming relaxation [286].
• Even L: C0 = log L2 (Exercise IV.36).
• L = 7: 3/5 log 7 ≤ C0 ≤ log 3.32. Finding the exact value for any odd L ≥ 7 is a famous
open problem in combinatorics.
• Asymptotics of odd L: Despite being unknown in general C0 has a known asymptotic
expansion: For odd L, C0 = log L2 + o(1) [66].
In comparison, the zero-error capacity with feedback (Exercise IV.36) equals Cfb,0 = log L2 for
any L, which, thus, can strictly exceed C0 .
4 Notice that Cfb,0 is not necessarily equal to Cfb = limϵ→0 Cfb,ϵ = C. Here is an example with
1 1
2 2
3 3
4 4
C0 = log 2
2
Cfb,0 = max − log max δ, 1 − δ (P∗X = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, the Shannon capacity C = Cfb can be made arbitrarily close to log 4 by
picking the cross-over probabilities arbitrarily close to zero, while the confusability graph stays
the same.
Proof of Theorem 23.7. 1 Fix any (n, M, 0)-code. For each t = 0, 1, . . . , n, denote the confusabil-
ity set of all possible messages that could have produced the received signal yt = (y1 , . . . , yt )
by:
i i
i i
i i
Notice that in general the minimizer P∗X is not the capacity-achieving input distribution in the
usual sense (recall Theorem 5.4). This definition also sheds light on how the encoding and
decoding should proceed and serves to lower bound the uncertainty reduction at each stage of
the decoding scheme.
3 “≤” (converse): Let PXn be the joint distribution of the codewords. Denote by E0 = [M] the
original message set.
t = 1: For PX1 , by (23.14), ∃y∗1 such that:
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|
PX1 (Sy∗1 ) = = ≥ θfb .
|{m ∈ [M]}| | E0 |
t = 2: For PX2 |X1 ∈Sy∗ , by (23.14), ∃y∗2 such that:
1
i i
i i
i i
454
encoder f1 channel
MP∗
X (a1 ) messages
a1 y1
MP∗
X (a2 ) messages
a2 y2
MP∗
X (a3 ) messages
a3 y3
By similar arguments, each interaction reduces the uncertainty by a factor of at least θfb . After n
n
iterations, the size of “confusability set” is upper bounded by Mθfb n
, if Mθfb ≤ 1,4 then zero error
probability is achieved. This is guaranteed by choosing log M = −n log θfb . Therefore we have
shown that −n log θfb bits can be reliably delivered with n + O(1) channel uses with feedback,
thus Cfb,0 ≥ − log θfb .
Theorem above shows possible advantages of feedback for zero-error communication. How-
ever, the zero-error capacity for a generic DMC (e.g. BSCδ with δ ∈ (0, 1)) we have C0 =
Cfb,0 = 0. Can we show any advantage of feedback for such channels? Clearly for that we need to
understand behavior of log M∗fb (n, ϵ) for ϵ > 0. It turns out that [336] for weakly-input symmetric
channels (Section 19.4*) we have
√
log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) ,
and thus at least up to second order the behavior of fundamental limits is unchanged in the presence
of feedback. Let us next discuss the error-exponent asymptotics (Section 22.4*) by defining
1
Efb (R) ≜ lim − log ϵ∗fb (n, exp{nR}) ,
n→∞ n
provided the limit exists and having denoted by ϵ∗fb (n, M) the smallest possible error of a feedback
code of blocklength n.
First, it is known that the sphere-packing bound (22.14) continues to hold in the presence of
feedback [312], that is
4
Some rounding-off errors need to be corrected in a few final steps (because P∗X may not be closely approximable when
very few messages are remaining). This does not change the asymptotics though.
i i
i i
i i
and thus for rates R ∈ (Rcr , C) the error-exponent Efb (R) = E(R) showing no change due to
availability of the feedback. So what is the advantage of feedback then? It turns out that the error-
exponents do improve at rates below critical. For example, for the BECδ a simple transmission
scheme where each bit is retransmitted until it is successfully received achieves error exponent
Esp (R) at all rates (since the probability of error here is given by P[Bin(n, δ) > n(1 − R/ log 2)]):
BEC : Efb (R) = Esp (R) 0 < R < C,
which is strictly higher than E(R) for R < Rcr .
For the BSCδ a beautiful result of Berlekamp shows that
1
Efb (0+) = − log((1 − δ)1/3 δ 2/3 + (1 − δ)2/3 δ 1/3 ) > E(0+) = Eex (0) = − log(4δ(1 − δ)) .
4
In other words, the error probability of optimal codes of size M = exp{o(n)} does significantly
improve in the presence of feedback (at least over the BSC and BEC).
Definition 23.8 An (ℓ, M, ϵ) variable-length feedback (VLF) code, where ℓ is a positive real,
M is a positive integer and 0 ≤ ϵ ≤ 1, is defined by:
5
It can be shown that without loss of generality we can assume |U | ≤ 3, see [336, Appendix].
i i
i i
i i
456
E[τ ] ≤ ℓ . (23.16)
Ŵ = gτ (U, Yτ ) , (23.17)
P[Ŵ 6= W] ≤ ϵ . (23.18)
The fundamental limit of channel coding with feedback is given by the following quantity:
In this language, our example above showed that for the BECδ we have
Notice that compared to the scheme without feedback, there is a significant improvement since
√
the term nVQ−1 (ϵ) in the expansion for log M∗ (n, ϵ) is now dropped. For this reason, results like
this are known as zero-dispersion results.
It turns out that this effect is general for all DMC as long as we allow some ϵ > 0 error.
Theorem 23.9 (VLF zero dispersion[336]) For any DMC with capacity C we have
lC
log M∗VLF (l, ϵ) = + O(log l) (23.20)
1−ϵ
We omit the proof of this result, only mentioning that the achievability part relies on ideas
similar to SPRT from Section 16.3*: the message keeps being transmitted until the information
density i(cj ; Yn ) of one of the codewords exceeds log M. See [336] for details. We also mention
that there is another variant of the VLF coding known as VLFT coding in which the stopping time
τ instead of being determined by the receiver is allowed to be determined by the transmitter (see
Exercise IV.35(d)). The expansion (23.20) continues to hold for VLFT codes as well [336].
Example 23.1 For the channel BSC0.11 without feedback the minimal is n = 3000 needed
to achieve 90% of the capacity C, while there exists a VLF code with ℓ = E[n] = 200 achieving
that [336]. This showcases how much feedback can improve the latency and decoding complexity.
VLF codes not only kill the dispersion term, but also dramatically improves error-exponents.
We have already discussed them in the context of fixed-length codes in Section 22.4* (without
feedback) and the end of last Section (with feedback). Here we mention a deep result of Burna-
shev [79], who showed that the optimal probability of error for VLF codes of rate R (i.e. with
log M = ℓR) satisfies for every DMC
i i
i i
i i
Simplicity of this expression when compared to the complicated (and still largely open!) situation
with respect to non-feedback or fixed-length feedback error-exponents is striking.
where expectation here is both over the channel noise and the potential randomness employed
by the transmitter in determination of Xj on the basis of the message W and Y1 , . . . , Yj−1 . In the
following, we demonstrate how to leverage this new freedom effectively.
Elias’ scheme Consider sending a standard Gaussian random variable A over the following set
of AWGN channels:
Yk = X k + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[X2k ] ≤ P.
We assume that full noiseless feedback is available as in Figure 23.1. Note that, crucially, the
power constraint is imposed in expectation, which does not increase the channel capacity (recall
the converse in Theorem 20.6) but enables simple algorithms such as Elias’ scheme below. In
contrast, if we insist as in Section 20.1 that each codeword satisfies the power constraint almost
Pn
surely instead in expectation, i.e., k=1 X2k ≤ nP a.s., then Elias’ scheme does not work.
Using only linear processing, Elias’ scheme proceeds according to illustration on Figure 23.5.
According to the orthogonality principle, at the receiver side we have for all t = 1, . . . , n,
A = Ât + Nt , Nt ⊥
⊥ Yt .
Moreover, since all operations are linear, all random variables are jointly Gaussian and hence the
residual error satisfies Nt ⊥
⊥ Yt . Since Xt ∝ Nt−1 ⊥⊥ Yt−1 , the codeword we are sending at each
time slot is independent of the history of the channel output (“innovation”), in order to maximize
the information transfer.
i i
i i
i i
458
Encoder Decoder
X1 = c 1 A Y1 = c1 A + Z1 √
Â1 = E[A|Y1 ] = P
Y1
P+σ 2
X2 = c2 (A − Â1 ) Y2 = c2 (A − Â1 ) + Z2
Â2 = E[A|Y1 , Y2 ] = linear combination of Y1 , Y2
. .
. .
. .
Xn = cn (A − Ân−1 ) Yn = cn (A − Ân−1 ) + Zn
Ân = E[A|Yn ] = linear combination of Yn
Figure 23.5 Elias’ scheme for the AWGN channel with variable power. Here, each coefficient ct is chosen
such that E[X2t ] = P.
Note that Yn → Ân → A, and the optimal estimator Ân (a linear combination of Yn ) is a sufficient
statistic of Yn for A thanks to Gaussianity. Then
i i
i i
i i
where the key step applies Xt ⊥ ⊥ Yt−1 for all t. Therefore, with Elias’ scheme of sending A ∼
N (0, 1), after the n-th use of the AWGN channel with feedback and expected power P, we have
P n
Var Nn = Var(Ân − A) = 2−2nC Var A = ,
P + σ2
which says that the reduction of uncertainty in the estimation is exponential fast in n.
Schalkwijk-Kailath scheme Elias’ scheme can also be used to send digital data. Let W ∼ be
uniform on the M-PAM (Pulse-amplitude modulation) constellation in [−1, 1], i.e., {−1, −1 +
M , · · · , −1 + M , · · · , 1}. In the very first step, W is sent (after scaling to satisfy the power
2 2k
constraint):
√
X0 = PW, Y0 = X0 + Z0
Since Y0 and X0 are both known at the encoder, it can compute Z0 . Hence, to describe W it is
sufficient for the encoder to describe the noise realization Z0 . This is done by employing the Elias’
scheme (n − 1 times). After n − 1 channel uses, and the MSE estimation, the equivalent channel
output:
e0 = X0 + Z
Y e0 , e0 ) = 2−2(n−1)C
Var(Z
e0 to the nearest PAM point. Notice that
Finally, the decoder quantizes Y
√ (n−1)C √
e 1 −(n−1)C P 2 P
ϵ ≤ P |Z0 | > =P 2 | Z| > = 2Q
2M 2M 2M
so that
√
P ϵ
log M ≥ (n − 1)C + log − log Q−1 ( ) = nC + O(1).
2 2
Hence if the rate is strictly less than capacity, the error probability decays doubly exponentially as
√
n increases. More importantly, we gained an n term in terms of log M, since for the case without
feedback we have (by Theorem 22.2)
√
log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .
As an example, consider P = 1 and (n−thenchannel capacity is C = 0.5 bit per channel use. To
e(n−1)C
−3
achieve error probability 10 , 2Q 2 1) C
2M ≈ 10 , so 2M ≈ 3, and logn M ≈ n−n 1 C − logn 8 .
−3
Notice that the capacity is achieved to within 99% in as few as n = 50 channel uses, whereas the
best possible block codes without feedback require n ≈ 2800 to achieve 90% of capacity.
The take-away message of this chapter is as follows: Feedback is best harnessed with adaptive
strategies. Although it does not increase capacity under block coding, feedback can greatly boost
reliability as well as reduce coding complexity.
i i
i i
i i
IV.1 Consider the AWGN channel Yn = Xn + Zn , where Zi is iid N (0, 1) and Xi ∈ [−1, 1] (amplitude
constraint). Recall that ϵ∗ (n, 2) denotes the optimal average probability of error of transmitting
1 bit of information over this channel.
R∞
(a) Express the value of ϵ∗ (n, 2) in terms of Q(x) = x √12π e−t /2 dt.
2
(b) Compute the exponent r = limn→∞ 1n log ϵ∗ (1n,2) . (Hint: Q(x) = e−x (1/2+o(1)) when x → ∞,
2
(c) By applying the DT bound with uniform PX show that there exist codes with
X n
n t
δ (1 − δ)n−t 2−|n−t−k+1| .
+
ϵ≤ (IV.2)
t
t=0
(d) Fix n = 500, δ = 1/2. Compute the smallest k for which the right-hand side of (IV.1) is
greater than 10−3 .
i i
i i
i i
(e) Fix n = 500, δ = 1/2. Find the largest k for which the right-hand side of (IV.2) is smaller
than 10−3 .
(f) Express your results in terms of lower and upper bounds on log M∗ (500, 10−3 ).
IV.5 Recall that in the proof of the DT bound (Theorem 18.6) we used the decoder that outputs (for
a given channel output y) the first cm that satisfies
One may consider the following generalization. Fix E ⊂ X × Y and let the decoder output the
first cm which satisfies
( cm , y) ∈ E
By repeating the random coding proof steps (as in the DT bound) show that the average
probability of error satisfies
M−1
E[Pe ] ≤ P[(X, Y) 6∈ E] + P[(X̄, Y) ∈ E] ,
2
where
IV.6 In Section 18.6 we showed that for additive noise, random linear codes achieves the same per-
formance as Shannon’s ensemble (fully random coding). The total number of possible generator
matrices is qnk , which is significant smaller than double exponential, but still quite large. Now
we show that without degrading the performance, we can reduce this number to qn by restricting
to Toeplitz generator matrix G, i.e., Gij = Gi−1,j−1 for all i, j > 1.
Prove the following strengthening of Theorem 18.13: Let PY|X be additive noise over Fnq . For
any 1 ≤ k ≤ n, there exists a linear code f : Fkq → Fnq with Toeplitz generator matrix, such that
+
h − n−k−log 1 n i
Pe,max = Pe ≤ E q
q P Zn ( Z )
i i
i i
i i
for the channel Y = X + Z where X is uniform on F2q , noise Z ∈ F2q has distribution PZ and
P Z ( b − a)
i(a; b) ≜ log .
q− 2
(a) Show that probability of error of the code a 7→ (av, au) + h is the same as that of a 7→
(a, auv−1 ).
(b) Let {Xa , a ∈ Fq } be a random codebook defined as
Xa = (aV, aU) + H ,
with V, U uniform over non-zero elements of Fq and H uniform over F2q , the three being
jointly independent. Show that for a 6= a′ we have
1
PXa ,X′a (x21 , x̃21 ) = 1{x1 6= x̃1 , x2 6= x̃2 }.
q2 ( q − 1) 2
(c) Show that for a 6= a′
q2 1
P[i(X′a ; Xa + Z) > log β] ≤ P[i(X̄; Y) > log β] − P[i(X; Y) > log β]
( q − 1) 2 (q − 1)2
q2
≤ P[i(X̄; Y) > log β] ,
( q − 1) 2
i i
i i
i i
The quantity I(P̂X , P̂Y|X ), sometimes written as I(xn ∧ yn ), is an empirical mutual informa-
tion.6 Hint:
PY|X (Y|X)
EQXY log = D(QY|X kQY |QX ) + D(QY kPY ) − D(QY|X kPY|X |QX ).
PY (Y)
IV.10 (Fitingof-Goppa universal codes) Consider a finite abelian group X . Define the Fitingof norm
as
Conclude that dΦ (xn , yn ) ≜ kxn − yn kΦ is a translation invariant (Fitingof) metric on the set
of equivalence classes in X n , with equivalence xn ∼ yn ⇐⇒ kxn − yn kΦ = 0.
(b) Define the Fitingof ball Br (xn ) ≜ {yn : dΦ (xn , yn ) ≤ r}. Show that
(d) Conclude that a code C ⊂ X n with Fitingof minimal distance dmin,Φ (C) ≜
minc̸=c′ ∈C dΦ (c, c′ ) ≥ 2λn is decodable with vanishing probability of error on any
additive-noise channel Y = X + Z, as long as H(Z) < λ.
Comment: By Feinstein-lemma like argument it can be shown that there exist codes of size
X n(1−λ) , such that balls of radius λn centered at codewords are almost disjoint. Such codes are
universally capacity-achieving for all memoryless additive-noise channels on X . Extension to
general (non-additive) channels is done via introducing dΦ (xn , yn ) = nH(xT |yT ), while exten-
sion to channels with Markov memory is done by introducing Markov-type norm kxn kΦ1 =
nH(xT |xT−1 ). See [196, Chapter 3].
IV.11 A magician is performing card tricks on stage. In each round he takes a shuffled deck of 52
cards and asks someone to pick a random card N from the deck, which is then revealed to the
audience. Assume the magician can prepare an arbitrary ordering of cards in the deck (before
each round) and that N is distributed binomially on {0, . . . , 51} with mean 51
2 .
(a) What is the maximal number of bits per round that he can send over to his companion in
the room (in the limit of infinitely many rounds)?
6
Invented by V. Goppa for his maximal mutual information (MMI) decoder [195]: Ŵ = argmaxi I(ci ∧ yn ).
i i
i i
i i
(b) Is communication possible if N were uniform on {0, . . . , 51}? (In practice, however, nobody
ever picks the top or the bottom ones.)
IV.12 (Channel with memory) Consider the additive noise channel with A = B = F2 (Galois field
of order 2) and PYn |Xn : Fn2 → Fn2 specified by
Yn = Xn + Zn ,
where Zn = (Z1 , . . . , Zn ) is a stationary Markov chain with PZ2 |Z1 (0|1) = PZ2 |Z1 (1|0) = τ .
Show information stability and find the capacity. (Hint: your proof should work for an arbitrary
stationary ergodic noise process Z∞ = (Z1 , . . .)). Can the capacity be achieved by linear codes?
IV.13 Consider a DMC PYn |Xn = PnY|X , where a single-letter PY|X : A → B is given by A = B =
{0, 1}7 , and
1−p y=x
PY|X (y|x) =
p/7 dH ( y, x) = 1
where dH stands for Hamming distance.In other words, for each 7-bit string, the channel either
leaves it intact, or randomly flips exactly one bit.
(a) Compute the Shannon capacity C as a function of p and plot.
(b) Consider the special case of p = 78 . Show that the zero-error capacity C0 coincides with C.
Moreover, C0 can be achieved with blocklength n = 1 and give a capacity-achieving code.
IV.14 Find the capacity of the erasure-error channel (Figure 23.6) with channel matrix
1 − 2δ δ δ
W=
δ δ 1 − 2δ
where 0 ≤ δ ≤ 1/2.
1 − 2δ
1 δ 1
δ
0 0
1 − 2δ
IV.15 (Capacity of reordering) Routers A and B are setting up a covert communication channel in
which the data is encoded in the ordering of packets. Formally, router A receives n packets,
each having one of two types, Ack or Data, with probabilities p and 1 − p, respectively (and
iid). It encodes k bits of secret data by reordering these packets. The network between A and B
delivers packets in-order with loss rate δ . (Note: packets have sequence numbers, so each loss
is detected by B).
i i
i i
i i
The dispersion of the compound DMC is, however, more delicate [342].
IV.20 Consider the following (memoryless) channel. It has a side switch U that can be in positions
ON and OFF. If U is on then the channel from X to Y is BSCδ and if U is off then Y is Bernoulli
(1/2) regardless of X. The receiving party sees Y but not U. A design constraint is that U should
be in the ON position no more than the fraction s of all channel uses, 0 ≤ s ≤ 1.
(a) One strategy is to put U into ON over the first sn time units and ignore the rest of the (1 − s)n
readings of Y. What is the maximal rate in bits per channel use achievable with this strategy?
(b) Can we increase the communication rate if the encoder is allowed to modulate the U switch
together with the input X (while still satisfying the s-constraint on U)?
(c) Now assume nobody has access to U, which is iid Ber(s) independent of X. Find the
capacity.
i i
i i
i i
IV.21 Alice has n oranges and great many (essentially, infinite) number of empty trays. She wants to
communicate a message to Bob by placing the oranges in (sequentially numbered) trays with
at most one orange per tray. Unfortunately, before Bob gets to see the trays Eve inspects them
and eats each orange independently with probability 0 < δ < 1. In the limit of n → ∞ show
that an arbitrary high rate (in bits per orange) is achievable.
Show that capacity changes to log2 δ1 bits per orange if Eve never eats any oranges but places
an orange into each empty tray with probability δ (iid).
IV.22 (Non-stationary channel [106, Problem 9.12]) A train pulls out of the station at constant velocity.
The received signal energy thus falls off with time as 1/i2 . The total received signal at time i is
1
Yi = X i + Zi ,
i
i.i.d. Pn
where Z1 , Z2 , . . . ∼ N(0, σ 2 ). The transmitter cost constraint for block length n is i |x2i | ≤ nP.
Show that the capacity C is equal to zero for this channel.
IV.23 (Capacity-cost function at the boundary.) Recall from Corollary 20.5 that we have shown that
for stationary memoryless channels and P > P0 capacity equals f(P):
where
Show:
(a) If P0 is not admissible, i.e., c(x) > P0 for all x ∈ A, then C(P0 ) is undefined (even M = 1
is not possible)
(b) If there exists a unique x0 such that c(x0 ) = P0 then
C(P0 ) = f(P0 ) = 0 .
(c) If there are more than one x with c(x) = P0 then we still have
C(P0 ) = f(P0 ) .
(d) Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable
cost function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R →
{−1, 0, 1}.)
IV.24 Consider a stationary memoryless additive non-Gaussian noise channel:
Yi = Xi + Zi , E [ Z i ] = 0, Var[Zi ] = 1
Pn 2
with the standard input constraint i=1 xi ≤ nP.
(a) Prove that capacity C(P) of this channel satisfies (20.6). (Hint: Gaussian saddle point
Theorem 5.11 and the golden formula I(X; Y) ≤ D(PY|X kQY |PX ).)
i i
i i
i i
(b) If D(PZ kN (0, 1)) = ∞ (Z is very non-Gaussian), then it is possible that the capacity is
infinite. Consider Z is ±1 with equal probability. Show that the capacity is infinite by a)
proving the maximal mutual information is infinite; b) giving an explicit scheme to achieve
infinite capacity.
IV.25 (Input-output cost) Let PY|X : X → Y be a DMC and consider a cost function c : X × Y → R
(note that c(x, y) ≤ L < ∞ for some L). Consider a problem of channel coding, where the
error-event is defined as
( n )
X
{error} ≜ {Ŵ 6= W} ∪ c(Xk , Yk ) > nP ,
k=1
where P is a fixed parameter. Define operational capacity C(P) and show it is given by
for all P > P0 ≜ minx0 E[c(X, Y)|X = x0 ]. Give a counterexample for P = P0 . (Hint: do a
converse directly, and for achievability reduce to an appropriately chosen cost-function c′ (x)).
IV.26 (Gauss-Markov noise) Let {Zj , j = 1, 2, . . .} be a stationary ergodic Gaussian process with
variance 1 such that Zj form an Markov chain Z1 → . . . → Zn → . . . Consider an additive
channel
Yn = X n + Zn
Pn
with power constraint j=1 |xj |2 ≤ nP. Suppose that I(Z1 ; Z2 ) = ϵ 1, then capacity-cost
function
1
C(P) = log(1 + P) + Bϵ + o(ϵ)
2
as ϵ → 0. Compute B and interpret your answer.
How does the frequency spectrum of optimal signal change with increasing ϵ?
IV.27 A semiconductor company offers a random number generator that outputs a block of random n
bits Y1 , . . . , Yn . The company wants to secretly embed a signature in every chip. To that end, it
decides to encode a k-bit signature in n real numbers Xj ∈ [0, 1]. Given an individual signature a
chip is manufactured such that it produces the outputs Yj ∼ Ber(Xj ). In order for the embedding
to be inconspicuous the average bias p should be small:
1X
n
1
Xj − ≤ p.
n 2
j=1
As a function of p how many signature bits per output (k/n) can be reliably embedded in this
fashion? Is there a simple coding scheme achieving this performance?
IV.28 (Capacity of sneezing) A sick student once every minute with probability p (iid) wants to sneeze.
He decides to send k bits to a friend by modulating the sneezes. For that, every time he realizes
he is about to sneeze he chooses to suppress a sneeze or not. A friend listens for n minutes and
then tries to decode k bits.
i i
i i
i i
(a) Find the capacity in bits per minute. (Hint: Think how to define the channel so that channel
input at time t were not dependent on the arrival of the sneeze at time t. To rule out strategies
that depend on arrivals of past sneezes, you may invoke Exercise IV.34.)
(b) Suppose the sender can suppress at most E sneezes and listener can wait indefinitely (n =
∞). Show that the sender can transmit Cpuc E + o(E) bits reliably as E → ∞ and find the
capacity per unit cost Cpuc . Curiously, Cpuc ≥ 1.44 bits/sneeze regardless of p. (Hint: This
is similar to Exercise IV.25.)
(d*) Redo (a) and (b) for the case of a clairvoyant student who knows exactly when sneezes
will happen in the future. (This is a simple example of a so-called Gelfand-Pinsker
problem [183].)
IV.29 A data storage company is considering two options for sending its 100 Tb archive from Boston
to NYC: via (physical) mail or via wireless transmission. Let us analyze these options:
(a) Given the radiated power Pt the received power Pr at distance r for communicating at fre-
2
c
quency f is given by Pr = G 4πrf Pt , where G is antenna-to-antenna coupling gain and c
– a speed of light7 . Assuming transmitting between Boston and NYC compute the energy
transfer coefficient η (take G = 15 dB and f = 4 GHz).
(b) The receiver’s amplifier adds white Gaussian noise of power spectral density N0 (W/Hz
or J). On the basis of required energy per bit, compute the minimal N0 which still makes
transmission over the radio economically justified assuming optimal channel coding is done
(assume 0.07$ per kWh and $20 per shipment).
(c) Compare this N0 with the thermal noise power N0,thermal = kT, where k – Boltzmann
constant and T – temperature in Kelvins. Conclude that T ≤ 103 K should work.
(d) Codes that achieve Shannon’s minimal Eb /N0 in principle do not put restrictions on the
receiver SNR (signal-to-noise ratio in one channel sample), however synchronization and
other issues constrain this SNR to be ≥ −10 dB. Assuming communication bandwidth
B = 20 Mhz compute the minimal power (in W) required for transmitter radio station.
Pr
(Hint: received SNR = BN 0
, the answer should be a few watts).
(e) How long will it take to send archive at this bandwidth and SNR? (Hint: the answer is
between a few days and a few years).
IV.30 (Optimal ϵ under ARQ) A packet of k bits is to be delivered over an AWGN channel with a
given SNR. To that end, a k-to-n error correcting code is used, whose probability of error is ϵ.
The system employs automatic repeat request (ARQ) to resend the packet whenever an error
occured.8 Suppose that the optimal k-to-n codes achieving
√ 1
k ≈ nC − nVQ−1 (ϵ) + log n
2
7
This formula is known as Friis transmission equation and it simply reflects the fact that the receiving antenna captures a
2
plane wave at the effective area of λ
4π
.
8
Assuming there is a way for receiver to verify whether his decoder produced the correct packet contents or not (e.g. by
finding HTML tags).
i i
i i
i i
are available. The goal is to optimize ϵ to get the highest average throughput: ϵ too small requires
excessive redundancy, ϵ too large leads to lots of retransmissions. Compute the optimal ϵ and
optimal block length n for the following four cases: SNR=0 dB or 20 dB; k = 103 or 104 bits.
(This gives an idea of what ϵ you should aim for in practice.)
IV.31 (Expurgated random coding bound)
(a) For any code C show the following bound on probability of error
1 X −dB (c,c′ )
Pe (C) ≤ 2 ,
M ′ c̸=c
Pn
where recall from (16.3) the Bhattacharya distance dB (xn , x̃n ) = j=1 dB (xj , x̃j ) and
Xp
dB (x, x̃) = − log2 W(y|x)W(y|x̃) .
y∈Y
− ρ1 dB (X,X′ )
(b) Fix PX and let E0,x (ρ, PX ) ≜ −ρ log2 E[2 ⊥ X′ ∼ PX . Show by random
], where X ⊥
coding that there always exists a code C of rate R with
(c) We improve the previous bound as follows. We still generate C by random coding. But this
time we expurgate all codewords with f(c, C) > med(f(c, C)), where med(·) denotes the
P ′
median and f(c) = c′ ̸=c 2−dB (c,c ) . Using the bound
med(V) ≤ 2ρ E[V1/ρ ]ρ ∀ρ ≥ 1
show that
(d) Conclude that there must exist a code with rate R − O(1/n) and Pe (C) ≤ 2−nEex (R) , where
IV.32 (Strong converse for BSC) In this exercise we give a combinatorial proof of the strong converse
for the binary symmetric channel. For BSCδ with 0 < δ < 21 ,
(a) Given any (n, M, ϵ)max -code with deterministic encoder f and decoder g, recall that the
decoding regions {Di = g−1 (i)}M i=1 form a partition of the output space. Prove that for
all i ∈ [M],
L
X n
| Di | ≥
j
j=0
M ≤ 2n(1−h(δ))+o(n) . (IV.9)
i i
i i
i i
(c) Show that (IV.9) holds for average probability of error. (Hint: how to go from maximal to
averange probability of error?)
(d) Conclude that strong converse holds for BSC. (Hint: argue that requiring deterministic
encoder/decoder does not change the asymptotics.)
IV.33 (Strong converse for AWGN) Recall that the AWGN channel is specified by
1 n 2
Yn = X n + Zn , Zn ∼ N (0, In ) , c(xn ) = kx k
n
Prove the strong converse for the AWGN via the following steps:
(a) Let ci = f(i) and Di = g−1 (i), i = 1, . . . , M be the codewords and the decoding regions of
an (n, M, P, ϵ)max code. Let
QYn = N (0, (1 + P)In ) .
Show that there must exist a codeword c and a decoding region D such that
PYn |Xn =c [D] ≥ 1 − ϵ (IV.10)
1
QYn [D] ≤ . (IV.11)
M
(b) Show that then
1
β1−ϵ (PYn |Xn =c , QYn ) ≤ . (IV.12)
M
(c) Show that hypothesis testing problem
PYn |Xn =c vs QYn
is equivalent to
PYn |Xn =Uc vs QYn
where U ∈ Rn×n is an orthogonal matrix. (Hint: use spherical symmetry of white Gaussian
distributions.)
(d) Choose U such that
PYn |Xn =Uc = Pn ,
where Pn is an iid Gaussian distribution of mean that depends on kck2 .
(e) Apply Stein’s lemma (Theorem 14.14) to show that for a certain value of E = E(P) > 0
we have
β1−ϵ (PYn |Xn =c , QYn ) = exp{−nE + o(n)}
i i
i i
i i
(a) Show that capacity is not increased in general (even when Y 6= U).
(b) Suppose now that there is a cost function c and c(x0 ) = 0. Show that capacity per unit cost
(with U being fed back) is still given by
D(PY|X=x kPY|X=x0 )
CV = max
x̸=x0 c(x)
IV.35 Consider a binary symmetric channel with crossover probability δ ∈ (0, 1):
Y = X + Z mod 2 , Z ∼ Ber(δ) .
Suppose that in addition to Y the receiver also gets to observe noise Z through a binary erasure
channel with erasure probability δe ∈ (0, 1). Compute:
(a) Capacity C of the channel.
(b) Zero-error capacity C0 of the channel.
(c) Zero-error capacity in the presence of feedback Cfb,0 .
(d*) Now consider the setup when in addition to feedback also the variable-length communica-
tion with feedback and termination (VLFT) is allowed. What is the zero-error capacity (in
bits per average number of channel uses) in this case? (In VLFT model, the transmitter can
send a special symbol T that is received without error, but the channel dies after T has been
sent; cf. Section 23.3.2)
IV.36 Consider the polygon channel discussed in Remark 23.2, where the input and output alphabet
are both {1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = (a mod L) + 1. The
confusability graph is a cycle of L vertices. Rigorously prove the following:
(a) For all L, The zero-error capacity with feedback is Cfb,0 = log L2 .
(b) For even L, the zero-error capacity without feedback C0 = log L2 .
(c) Now consider the following channel, where the input and output alphabet are both
{1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = a + 1. In this case the confusability
graph is a path of L vertices. Show that the zero-error capacity is given by
L
C0 = log
2
What is Cfb,0 ?
IV.37 (BEC with feedback) Consider the stationary memoryless binary erasure channel with erasure
probability δ and noiseless feedback. Design a fixed-blocklength coding scheme achieving the
capacity, i.e., find a scheme that sends k bits over n channel uses with noiseless feedback, such
that the rate nk approaches the capacity 1 − δ when n → ∞ and the maximal probability of
error vanishes. Show also that for any rate R < (1 − δ) bit the error-exponent matches the
sphere-packing bound.
i i
i i
i i
i i
i i
i i
Part V
i i
i i
i i
i i
i i
i i
475
In Part II we studied lossless data compression (source coding), where the goal is to compress
a random variable (source) X into a minimal number of bits on average (resp. exactly) so that
X can be reconstructed exactly (resp. with high probability) using these bits. In both cases, the
fundamental limit is given by the entropy of the source X. Clearly, this paradigm is confined to
discrete random variables.
In this part we will tackle the problem of compressing continuous random variables, known as
lossy data compression. Given a random variable X, we need to encode it into a minimal number
of bits, such that the decoded version X̂ is a faithful a reconstruction of X, which is rigorously
understood as distortion metric between X and X̂ being bounded by a prescribed fidelity either on
average or with high probability.
The motivations for study lossy compression are at least two-fold:
1 Many natural signals (e.g. audio, images, or video) are continuously valued. As such, there is
a need to represent these real-valued random variables or processes using finitely many bits,
which can be fed to downstream digital processing; see Figure 23.7 for an illustration.
Domain Range
Continuous Analog
Sampling time Quantization
Signal
Discrete Digital
time
2 There is a lot to be gained in compression if we allow some reconstruction errors. This is espe-
cially important in applications where certain errors (such as high-frequency components in
natural audio and visual signals) are imperceptible to humans. This observation is the basis of
many important compression algorithms and standards that are widely deployed in practice,
including JPEG for images, MPEG for videos, and MP3 for audios.
The operation of mapping (naturally occurring) continuous time/analog signals into
(electronics-friendly) discrete/digital signals is known as quantization, which is an important sub-
ject in signal processing in its own right (cf. the encyclopedic survey [197]). In information theory,
the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959
[380]. To start, we will take a closer look at quantization next in Section 24.1, followed by the
information-theoretic formulation in Section 24.2. A simple (and tight) converse bound is given
in Section 24.3, with the matching achievability bound deferred to the next chapter.
In Chapter 25 we present the hard direction of the rate-distortion theorem: the random coding
construction of a quantizer. This method is extended to the development of a covering lemma and
soft-covering lemma, which lead to sharp result of Cuff showing that the fundamental limit of
channel simulation is given by Wyner’s common information. We also derive (strengthened form
of) Han-Verdú’s results on approximating output distributions in KL.
Chapter 26 evaluates rate-distortion function for Gaussian and Hamming sources. We also dis-
cuss the important foundational implication that optimal (lossy) compressor paired with an optimal
i i
i i
i i
476
error correcting code together form an optimal end-to-end communication scheme (known as joint
source-channel coding separation principle). This principle explains why “bits” are the natural
currency of the digital age.
Finally, in Chapter 27 we study Kolmogorov’s metric entropy, which is a non-probabilistic
theory of quantization for sets in metric spaces. While traditional rate-distortion tries to compress
samples from a fixed distribution, metric entropy tries to compress any element of the metric
space. What links the two topics is that metric entropy can be viewed as a rate-distortion theory
applied to the “worst-case” distribution on the space (an idea further expanded in Section 27.6). In
addition to connections to the probabilistic theory of quantization in the preceding chapters, this
concept has far-reaching consequences in both probability (e.g. empirical processes, small-ball
probability) and statistical learning (e.g. entropic upper and lower bounds for estimation) that will
be explored further in Part VI. Exercises explore applications to Brownian motion (Exercise V.30),
random matrices (Exercise V.29) and more.
i i
i i
i i
24 Rate-distortion theory
In this chapter we introduce the theory of optimal quantization. In Section 24.1 we examine the
classical theory for quantization for fixed dimension and high rate, discussing various aspects such
as uniform versus non-uniform quantization, fixed versus variable rate, quantization algorithm (of
Lloyd) versus clustering, and the asymptotics of optimal quantization error. In Section 24.2 we turn
to the information-theoretic formulation of quantization, known as the rate-distortion theory, that is
targeted at high dimension and fixed rate and the regime where the number of reconstruction points
grows exponentially with dimension. Section 24.3 introduces the rate-distortion function and the
main converse bound. Finally, in Section 24.4* we discuss how to relate the average distortion
(which we focus) to excess distortion that targets a reconstruction error in high probability as
opposed to in expectation.
−A A
2 2
477
i i
i i
i i
478
where D denotes the average distortion. Often R = log2 N is used instead of N, so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.
∆j
Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have
X
N
DU (R) = E|X − qU (X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (24.1)
j=1
X
N
|∆j |2
(high rate approximation) ≈ P[ X ∈ ∆ j ] (24.2)
12
j=1
( NA )2 A2 −2R
= = 2 , (24.3)
12 12
where we used the fact that the variance of Unif(−a, a) is a2 /3.
How much do we gain per bit?
Var(X)
10 log10 SNR = 10 log10
E|X − qU (X)|2
12Var(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R
For example, when X is uniform on [− A2 , A2 ], the constant is 0. Every engineer knows the rule of
thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR. How-
ever, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,
widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should be
taken with a grain of salt.)
The above discussion deals with X with a bounded support. When X is unbounded, it is wise to
allocate the quantization points to those values that are more likely and saturate the large values at
i i
i i
i i
the dynamic range of the quantizer, resulting in two types of contributions to the quantization error,
known as the granular distortion and overload distortion. This leads us to the question: Perhaps
uniform quantization is not optimal?
Often the way such quantizers are implemented is to take a monotone transformation of the source
f(X), perform uniform quantization, then take the inverse function:
f
X U
q qU (24.4)
X̂ qU ( U)
f−1
i.e., q(X) = f−1 (qU (f(X))). The function f is usually called the compander (compressor+expander).
One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this compander
architecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF
(!). We defer this discussion till Section 24.1.4.
In terms of practical considerations, for example, the human ear can detect sounds with volume
as small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieving this is
possible because the human ear inherently uses logarithmic comp anding function. Furthermore,
many natural signals (such as differences of consecutive samples in speech or music (but not
samples themselves!)) have an approximately Laplace distribution. Due to these two factors, a
very popular and sensible choice for f is the μ-companding function
i i
i i
i i
480
which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range of
human hearing, and less quantization bits outside this region. This results in the so-called μ-law
which is used in the digital telecommunication systems in the US, while in Europe a slightly
different compander called the A-law is used.
Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,
given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers are
piecewise constant:
1 Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.
2 Update the quantization points by the centroids E[X|X ∈ D] of each Voronoi region D.
b b
b b
b b
b b
b b
1
This work at Bell Labs remained unpublished until 1982 [284].
i i
i i
i i
Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results on
Lloyd’s algorithm.
Remark 24.1 The third point tells us that Lloyd’s algorithm is not always guaranteed to give
the optimal quantization strategy.2 One sufficient condition for uniqueness of a CVT is the log-
concavity of the density of X [171], e.g., Gaussians. On the other hand, even for the Gaussian PX
and N > 3, the optimal quantization points are not known in closed form. So it may seem to be
very hard to have any meaningful theory of optimal quantizers. However, as we shall see next,
when N becomes very large, locations of optimal quantization points can be characterized. In this
section we will do so in the case of fixed dimension, while for the rest of this Part we will consider
the regime of taking N to grow exponentially with dimension.
Remark 24.2 (k-means) A popular clustering method called k-means is the following: Given
n data points x1 , . . . , xn ∈ Rd , the goal is to find k centers μ1 , . . . , μk ∈ Rd to minimize the objective
function
X
n
min kxi − μj k2 .
j∈[k]
i=1
This is equivalent to solving the optimal vector quantization problem analogous to (24.5):
2
As a simple example one may consider PX = 13 ϕ(x − 1) + 31 f(x) + 13 f(x + 1) where f(·) is a very narrow pdf, symmetric
around 0. Here the CVT with centers ± 23 is not optimal among binary quantizers (just compare to any quantizer that
quantizes two adjacent spikes to same value).
i i
i i
i i
482
in a given interval and allows us to approximate summations by integrals.3 Then the number of
Rb
quantization points in any interval [a, b] is ≈ N a λ(x)dx. For any point x, denote the size of the
quantization interval that contains x by ∆(x). Then
Z x+∆(x)
1
N λ(t)dt ≈ Nλ(x)∆(x) ≈ 1 =⇒ ∆(x) ≈ .
x Nλ(x)
With this approximation, the quality of reconstruction is
X
N
E|X − q(X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ]
j=1
X
N Z
|∆j |2 ∆ 2 ( x)
≈ P[ X ∈ ∆ j ] ≈ p ( x) dx
12 12
j=1
Z
1
= p(x)λ−2 (x)dx ,
12N2
To find the optimal density λ that gives the best reconstruction (minimum MSE) when X has den-
R R R R R
sity p, we use Hölder’s inequality: p1/3 ≤ ( pλ−2 )1/3 ( λ)2/3 . Therefore pλ−2 ≥ ( p1/3 )3 ,
1/ 3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) .
p1/3 dx
Therefore when N = 2R ,4
Z 3
1 −2R
Dscalar (R) ≈ 2 p1/3 (x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,
• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2
1 −2R 2
Dscalar (R) ≤ 2 A = DU (R)
12
where the RHS is the uniform quantization error given in (24.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution, uniform
quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives
√
2 −2R π 3
Dscalar (R) ≈ σ 2 (24.6)
2
Remark 24.3 In fact, in scalar case the optimal non-uniform quantizer can be realized using
the compander architecture (24.4) that we discussed in Section 24.1.2: As an exercise, use Taylor
3
This argument is easy to make rigorous. We only need to define reconstruction points cj as the solution of
∫ cj j
−∞ λ(x) dx = N (quantile).
4
In fact when R → ∞, “≈” can be replaced by “= 1 + o(1)” as shown by Zador [467, 468].
i i
i i
i i
∆2 22h(X)
D= ≈ 2−2R .
12 12
On the other hand, any quantizer with unnormalized point density function Λ(x) (i.e. smooth
R cj
function such that −∞ Λ(x)dx = j) can be shown to achieve (assuming Λ → ∞ pointwise)
Z
1 1
D≈ pX (x) 2 dx
12 Λ ( x)
Z
Λ(x)
H(q(X)) ≈ pX (x) log dx.
p X ( x)
Now, from Jensen’s inequality we have
Z Z
1 1 1 22h(X)
pX (x) 2 dx ≥ exp{−2 pX (x) log Λ(x) dx} ≈ 2−2H(q(X)) ,
12 Λ ( x) 12 12
concluding that uniform quantizer is asymptotically optimal.
Furthermore, it turns out that for any source, even the optimal vector quantizers (to be con-
2h(X)
sidered next) can not achieve distortion better that 2−2R 22πe . That is, the maximal improvement
they can gain for any i.i.d. source is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar
uniform quantizers followed by lossless compression is an overwhelmingly popular solution in
practice.
i i
i i
i i
484
Hamming Game. Given 100 unbiased bits, we are asked to inspect them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the original
100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% bit error rate (BER). However, as we will show in this chapter
(Theorem 26.1), the optimal strategy amazingly achieves a BER of 11%. How is this possible?
After all we are guessing independent bits and the loss function (BER) treats all bits equally.
Gaussian example. Given (X1 , . . . , Xn ) drawn independently from N (0, σ 2 ), we are given a
budget of one bit per symbol to compress, so that the decoded version (X̂1 , . . . , X̂n ) has a small
Pn
mean-squared error 1n i=1 E[(Xi − X̂i )2 ].
To this end, a simple strategy is to quantize each coordinate into 1 bit. As worked out in Exam-
ple 24.1, the optimal one-bit quantization error is (1 − π2 )σ 2 ≈ 0.36σ 2 . In comparison, we will
2
show later (Theorem 26.2) that there is a scheme that achieves an MSE of σ4 per coordinate
for large n; furthermore, this is optimal. More generally, given R bits per symbol, by doing opti-
mal vector quantization in high dimensions (namely, compressing (X1 , . . . , Xn ) jointly to nR bits),
rate-distortion theory will tell us that when n is large, we can achieve the per-coordinate MSE:
1 Applying scalar quantization componentwise results in quantization region that are hypercubes,
which may not suboptimal for covering in high dimensions.
2 Concentration of measures effectively removes many atypical source realizations. For example,
when quantizing a single Gaussian X, we need to cover large portion of R in order to deal with
those significant deviations of X from 0. However, when we are quantizing many (X1 , . . . , Xn )
together, the law of large numbers makes sure that many Xj ’s cannot conspire together and all
produce large values. Indeed, (X1 , . . . , Xn ) concentrates near a sphere. As such, we may exclude
large portions of the space Rn from consideration.
where X ∈ X is refereed to as the source, W = f(X) is the compressed discrete data, and X̂ = g(W)
is the reconstruction which takes values in some alphabet X̂ that needs not be the same as X .
A distortion metric (or loss function) is a measurable function d : X × X̂ → R ∪ {+∞}. There
are various formulations of the lossy compression problem:
i i
i i
i i
1 Fixed length (fixed rate), average distortion: W ∈ [M], minimize E[d(X, X̂)].
2 Fixed length, excess distortion: W ∈ [M], minimize P[d(X, X̂) > D].
3 Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize the average length
E[l(W)] or entropy H(W).
In this book we focus on lossy compression with fixed length and are chiefly concerned with
average distortion (with the exception of joint source-channel coding in Section 26.3 where excess
distortion will be needed). The difference between average and excess distortion is analogous
to average and high-probability risk bound in statistics and machine learning. It turns out that
under mild assumptions these two formulations lead to the same asymptotic fundamental limit
(cf. Remark 25.2). However, the speed of convergence to that limit is very different: the excess
distortion version converges as O( √1n ) has a rich dispersion theory [255], which we do not discuss.
The convergence under excess distortion is much faster as O( logn n ); see Exercise V.3.
As usual, of particular interest is when the source takes the form of a random vector Sn =
(S1 , . . . , Sn ) ∈ S n and the reconstruction is Ŝn = (S1 , . . . , Sn ) ∈ Ŝ n . We will be focusing on the
so called separable distortion metric defined for n-letter vectors by averaging the single-letter
distortions:
1X
n
d(sn , ŝn ) ≜ d(si , ŝi ). (24.8)
n
i=1
Note that, for stationary memoryless (iid) source, the large-blocklength limit in (24.10) in fact
exists and coincides with the infimum over all blocklengths. This is a consequence of the average
distortion criterion and the separability of the distortion metric – see Exercise V.2.
i i
i i
i i
486
Proof.
where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).
Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.
Remark 24.4 (The role of D0 and Dmax ) By definition, Dmax is the distortion attainable
without any information. Indeed, if Dmax = Ed(X, x̂) for some fixed x̂, then this x̂ is the “default”
reconstruction of X, i.e., the best estimate when we have no information about X. Therefore D ≥
Dmax can be achieved for free. This is the reason for the notation Dmax despite that it is defined as
an infimum. On the other hand, D0 should be understood as the minimum distortion one can hope
to attain. Indeed, suppose that X̂ = X and d is a metric on X . In this case, we have D0 = 0, since
we can choose Y to be a finitely-valued approximation of X.
As an example, consider the Gaussian source with MSE distortion, namely, X ∼ N (0, σ 2 ) and
2
d(x, x̂) = (x −x̂)2 . We will show later that ϕX (D) = 12 log+ σD . In this case D0 = 0 and the infimum
defining it is not attained; Dmax = σ 2 and if D ≥ σ 2 , we can simply output 0 as the reconstruction
which requires zero bits.
Proof.
(a) Convexity follows from the convexity of PY|X 7→ I(PX , PY|X ) (Theorem 5.3).
(b) Continuity in the interior of the domain follows from convexity, since D0 =
infPX̂|X E[d(X, X̂)] = inf{D : ϕS (D) < ∞}.
(c) The only way to satisfy the constraint is to take X = Y.
(d) Clearly, D0 = d(x, x) = 0. We also clearly have ϕX (D0 ) ≥ ϕX (D0 +). Consider a sequence
of Yn such that E[d(X, Yn )] ≤ 2−n and I(X; Yn ) → ϕX (D0 +). By Borel-Cantelli we have with
probability 1 d(X, Yn ) → 0 and hence (X, Yn ) → (X, X). Then from lower-semicontinuity of
mutual information (4.28) we get I(X; X) ≤ lim I(X; Yn ) = ϕX (D0 +).
(e) For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.
i i
i i
i i
In channel coding, the main result relates the Shannon capacity, an operational quantity, to the
information capacity. Here we introduce the information rate-distortion function in an analogous
way, which by itself is not an operational quantity.
The reason for defining R(I) (D) is because from Theorem 24.3 we immediately get:
Naturally, the information rate-distortion function inherits the properties of ϕ from Theo-
rem 24.4:
Proof. All properties follow directly from corresponding properties in Theorem 24.4 applied to
ϕSn .
Next we show that R(I) (D) can be easily calculated for stationary memoryless (iid) source
without going through the multi-letter optimization problem. This parallels Corollary 20.5 for
channel capacity (with separable cost function).
i.i.d.
Theorem 24.8 (Single-letterization) For stationary memoryless source Si ∼ PS and
separable distortion d in the sense of (24.8), we have for every n,
Thus
i i
i i
i i
488
Proof. By definition we have that ϕSn (D) ≤ nϕS (D) by choosing a product channel: PŜn |Sn = P⊗ n
Ŝ|S
.
Thus R(I) (D) ≤ ϕS (D).
For the converse, for any PŜn |Sn satisfying the constraint E[d(Sn , Ŝn )] ≤ D, we have
X
n
I(Sn ; Ŝn ) ≥ I(Sj , Ŝj ) (Sn independent)
j=1
X
n
≥ ϕS (E[d(Sj , Ŝj )])
j=1
1X
n
≥ nϕ S E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1
≥ nϕ S ( D) (ϕS non-increasing)
In the first step we used the crucial super-additivity property of mutual information (6.2).
For generalization to a memoryless but non-stationary sources see Exercise V.10.
Theorem 24.9 (Excess-to-Average) Suppose that there exists (f, g) such that W = f(X) ∈
[M] and P[d(X, g(W)) > D] ≤ ϵ. Assume for some p ≥ 1 and x̂0 ∈ X̂ that (E[d(X, x̂0 )p ])1/p =
Dp < ∞. Then there exists (f′ , g′ ) such that W′ = f′ (X) ∈ [M + 1] and
E[d(X, g(W′ ))] ≤ D(1 − ϵ) + Dp ϵ1−1/p . (24.11)
Remark 24.5 This result is only useful for p > 1, since for p = 1 the right-hand side of (24.11)
does not converge to D as ϵ → 0. However, a different method (as we will see in the proof of
Theorem 25.1) implies that under just Dmax = D1 < ∞ the analog of the second term in (24.11)
is vanishing as ϵ → 0, albeit at an unspecified rate.
Proof. We transform the first code into the second by adding one codeword:
(
′ f ( x) d(x, g(f(x))) ≤ D
f ( x) =
M + 1 otherwise
(
g( j) j ≤ M
g′ ( j) =
x̂0 j=M+1
Then by Hölder’s inequality,
E[d(X, g′ (W′ )) ≤ E[d(X, g(W))|Ŵ 6= M + 1](1 − ϵ) + E[d(X, x̂0 )1{Ŵ = M + 1}]
i i
i i
i i
≤ D(1 − ϵ) + Dp ϵ1−1/p
i i
i i
i i
In this chapter, we prove an achievability bound and (together with the converse from the previous
chapter) establish the identity R(D) = infŜ:E[d(S,Ŝ)]≤D I(S; Ŝ) for stationary memoryless sources.
The key idea is again random coding, which is a probabilistic construction of quantizers by gener-
ating the reconstruction points independently from a carefully chosen distribution. Before proving
this result rigorously, we first convey the main intuition in the case of Bernoulli sources by making
connections to large deviations theory (Chapter 15) and explaining how the constrained minimiza-
tion of mutual information is related to optimization of the random coding ensemble. Later in
Sections 25.2*–25.4*, we extend this random coding construction to establish covering lemma
and soft-covering lemma, which are at the heart of the problem of channel simulation.
We start by recalling the key concepts introduced in the last chapter:
1
R(D) = lim sup log M∗ (n, D), (rate-distortion function)
n→∞ n
1
R(I) (D) = lim sup ϕSn (D), (information rate-distortion function)
n→∞ n
where
ϕ S ( D) ≜ inf I(S; Ŝ) (25.1)
PŜ|S :E[d(S,Ŝ)]≤D
490
i i
i i
i i
Then
Remark 25.1
• Note that Dmax < ∞ does not require that d(·, ·) only takes values in R. That is, Theorem 25.1
permits d(s, ŝ) = ∞.
• When Dmax = ∞, typically we have R(D) = ∞ for all D. Indeed, suppose that d(·, ·) is a metric
(i.e. real-valued and satisfies triangle inequality). Then, for any x0 ∈ An we have
Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore
So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
the absolute impossibility of compressing such sources; it is just not possible with fixed-length
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second moment. But it is easy to see that1 the information rate-distortion function
R(I) (D) < ∞ for any D ∈ (0, ∞). In fact, in this case R(I) (D) is a hyperbola-like curve that
never touches either axis. Using variable-length codes, Sn can be compressed non-trivially into
W with bounded entropy (but unbounded cardinality) H(W). An open question: Is H(W) =
nR(I) (D) + o(n) attainable?
• We restricted theorem to D > D0 because it is possible that R(D0 ) 6= R(I) (D0 ). For exam-
ple, consider an iid non-uniform source {Sj } with A = Â being a finite metric space with
metric d(·, ·). Then D0 = 0 and from Exercise V.5 we have R(D0 +) < R(D0 ). At the same
time, from Theorem 24.4(d) we know that R(I) is continuous at D0 : R(I) (D0 +) = ϕS (D0 +) =
ϕS (D0 ) = R(I) (D0 ).
1
Indeed, if we take W to be a quantized version of S with small quantization error D and notice that differential entropy of
the Cauchy S is finite, we get from (24.7) that R(I) (D) ≤ H(W) < ∞.
i i
i i
i i
492
• Techniques for proving (25.4) for memoryless sources can be extended to stationary ergodic
sources by making changes to the proof similar to those we have discussed in lossless
compression (Chapter 12).
Before giving a formal proof, we give a heuristic derivation emphasizing the connection to large
deviations estimates from Chapter 15.
25.1.1 Intuition
Let us throw M random points C = {c1 , . . . , cM } into the space Ân by generating them indepen-
dently according to a product distribution QnŜ , where QŜ is some distribution on  to be optimized.
Consider the following simple coding strategy:
The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword is close to the source realization is (exponentially) small, say,
ϵ. However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codewords exist is approximately (1 −ϵ)M ,
which can be made close to zero provided M 1ϵ .
i.i.d.
To explain this intuition further, consider a discrete memoryless source Sn ∼ PS and let us eval-
uate the excess distortion of this random code: P[d(Sn , f(Sn )) > D], where the probability is over
all random codewords c1 , . . . , cM and the source Sn . Define
where the last equality follows from the assumption that c1 , . . . , cM are iid and independent of Sn .
i.i.d.
To simplify notation, let Ŝn ∼ QnŜ independently of Sn , so that PSn ,Ŝn = PnS QnŜ . Then
To evaluate the failure probability, let us consider the special case of PS = Ber( 12 ) and also
choose QŜ = Ber( 12 ) to generate the random codewords, aiming to achieve a normalized Hamming
P P
distortion at most D < 12 . Since nd(Sn , Ŝn ) = i:si =1 (1 − Ŝi ) + i:si =0 Ŝi ∼ Bin(n, 21 ) for any sn ,
the conditional probability (25.7) does not depend on Sn and is given by
1
P[d(S , Ŝ ) > D|S ] = P Bin n,
n n n
≥ nD ≈ 1 − 2−n(1−h(D))+o(n) , (25.8)
2
where in the last step we applied large-deviations estimates from Theorem 15.9 and Example 15.1.
(Note that here we actually need lower estimates on these exponentially small probabilities.) Thus,
i i
i i
i i
Pfailure = (1 − 2−n(1−h(D))+o(n) )M , which vanishes if M = 2n(1−h(D)+δ) for any δ > 0.2 As we will
compute in Theorem 26.1, the rate-distortion function for PS = Ber( 12 ) is precisely ϕS (D) =
1 − h(D), so we have a rigorous proof of the optimal achievability in this special case.
For general distribution PS (or even for PS = Ber(p) for which it is suboptimal to choose
QŜ as Ber( 12 )), the situation is more complicated as the conditional probability (25.7) depends
on the source realization Sn through its empirical distribution (type). Let Tn be the set of typical
realizations whose empirical distribution is close to PS . We have
−nE(QŜ ) M
≈(1 − 2 ) ,
where it can be shown (using large deviations analysis similar to information projection in
Chapter 15) that
Thus we conclude that for any choice of QŜ (from which the random codewords were drawn) and
any δ > 0, the above code with M = 2n(E(QŜ )+δ) achieves vanishing excess distortion
= ϕ S ( D)
where the third equality follows from the variational representation of mutual information (Corol-
lary 4.2). This heuristic derivation explains how the constrained mutual information minimization
arises. Below we make it rigorous using a different approach, again via random coding.
2
In fact, this argument shows that M = 2n(1−h(D))+o(n) codewords suffice to cover the entire Hamming space within
distance Dn. See (27.9) and Exercise V.26.
i i
i i
i i
494
W ∈ [M + 1], such d(X, X̂) ≤ d(X, y0 ) almost surely and for any γ > 0,
E[d(X, X̂)] ≤ E[d(X, Y)] + E[d(X, y0 )]e−M/γ + E[d(X, y0 )1 {i(X; Y) > log γ}].
Here the first and the third expectations are over (X, Y) ∼ PX,Y = PX PY|X and the information
density i(·; ·) is defined with respect to this joint distribution (cf. Definition 18.1).
• Theorem 25.2 says that from an arbitrary PY|X such that E[d(X, Y)] ≤ D, we can extract a good
code with average distortion D plus some extra terms which will vanish in the asymptotic regime
for memoryless sources.
• The proof uses the random coding argument with codewords drawn independently from PY , the
marginal distribution induced by the source distribution PX and the auxiliary channel PY|X . As
such, PY|X plays no role in the code construction and is used only in analysis (by defining a
coupling between PX and PY ).
• The role of the deterministic y0 is a “fail-safe” codeword (think of y0 as the default reconstruc-
tion with Dmax = E[d(X, y0 )]). We add y0 to the random codebook for “damage control”, to
hedge against the (highly unlikely) event that we end up with a terrible codebook.
Proof. Similar to the intuitive argument sketched in Section 25.1.1, we apply random coding and
generate the codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X
and add the “fail-safe” codeword cM+1 = y0 . We adopt the same encoder-decoder pair (25.5) –
(25.6) and let X̂ = g(f(X)). Then by definition,
To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Chapter 18):
PX,Y,Y = PX,Y PY
where PY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
Ra
E[U] = 0 P[U ≥ u]du. Then the average distortion is
i i
i i
i i
Z d(X,y0 )
= EX P[d(X, Y) > u|X]M du
0
Z d(X,y0 )
= EX (1 − P[d(X, Y) ≤ u|X])M du
0
Z d(X,y0 )
≤ EX (1 − P[d(X, Y) ≤ u, i(X, Y) > −∞|X])M du. (25.11)
0 | {z }
≜δ(X,u)
• (25.12) uses the following trick in dealing with (1 − δ)M for δ 1 and M 1. First, recall the
standard rule of thumb:
(
0, δ M 1
(1 − δ) ≈
M
1, δ M 1
In order to obtain firm bounds of a similar flavor, we apply, for any γ > 0,
(1 − δ)M ≤ e−δM ≤ e−M/γ + (1 − γδ)+ .
• (25.13) is simply a change of measure argument of Proposition 18.3. Namely we apply (18.4)
with f(x, y) = 1 {d(x, y) ≤ u}.
• For (25.14) consider the chain:
1 − γ E[exp{−i(X; Y)}1 {d(X, Y) ≤ u}|X] ≤ 1 − γ E[exp{−i(X; Y)}1 {d(X, Y) ≤ u, i(X; Y) ≤ log γ}|X]
≤ 1 − E[1 {d(X, Y) ≤ u, i(X; Y) ≤ log γ}|X]
= P[d(X, Y) > u or i(X; Y) > log γ|X]
≤ P[d(X, Y) > u|X] + P[i(X; Y) > log γ|X]
As a side product, we have the following achievability result for excess distortion.
i i
i i
i i
496
Theorem 25.3 (Random coding bound of excess distortion) For any PY|X , there
exists a code X → W → X̂ with W ∈ [M], such that for any γ > 0,
P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y) > D} ∪ {i(X; Y) > log γ}]
Proof. Proceed exactly as in the proof of Theorem 25.2 (without using the extra codeword y0 ),
replace (25.11) by P[d(X, X̂) > D] = P[∀j ∈ [M], d(X, cj ) > D] = EX [(1 − P[d(X, Y) ≤ D|X])M ],
and continue similarly.
Finally, we give a rigorous proof of Theorem 25.1 by applying Theorem 25.2 to the iid source
i.i.d.
X = Sn ∼ PS and n → ∞:
Proof of Theorem 25.1. Our goal is the achievability: R(D) ≤ R(I) (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] is achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed symbol so that the expectation is finite. The
default reconstruction for Sn is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(Sn , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ ; such PŜ|S since D > D0 by
assumption. Apply Theorem 25.2 to (X, Y) = (Sn , Ŝn ) with
PX = PSn
PY|X = PŜn |Sn = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
1X
n
d( X , Y ) = d(Sj , Ŝj )
n
j=1
y0 = ŝn0
we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1] → Ân , such that
n o
E[d(Sn , g(f(Sn )))] ≤ E[d(Sn , Ŝn )] + E[d(Sn , ŝn0 )]e−M/γ + E[d(Sn , ŝn0 )1 i(Sn ; Ŝn ) > log γ ]
≤ D − δ + Dmax e− exp(nδ) + E[d(Sn , ŝn0 )1En ], (25.15)
| {z } | {z }
→0 →0 (later)
where
1 X
n
WLLN
En = {i(Sn ; Ŝn ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n
j=1
If we can show the expectation in (25.15) vanishes, then there exists an (n, M, D)-code with:
M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o( 1) ≤ D.
To summarize, ∀PŜ|S such that E[d(S, Ŝ)] ≤ D −δ we have shown that R(D) ≤ I(S; Ŝ). Sending δ ↓
0, we have, by continuity of ϕS (D) in (D0 ∞) (recall Theorem 24.4), R(D) ≤ ϕS (D−) = ϕS (D).
i i
i i
i i
It remains to show the expectation in (25.15) vanishes. This is a simple consequence of the
uniform integrability of the sequence {d(Sn , ŝn0 )}. We need the following lemma.
Lemma 25.4 For any positive random variable U, define g(δ) = supH:P[H]≤δ E[U1H ], where
δ→0
the supremum is over all events measurable with respect to U. Then3 EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U1H ] ≤ E[U1 {U > b}] + bδ , where E[U1 {U > b√}] −−−→ 0 by
dominated convergence theorem. Then the proof is completed by setting b = 1/ δ .
Pn
Now d(Sn , ŝn0 ) = 1n j=1 Uj , where Uj are iid copies of U ≜ d(S, ŝ0 ). Since E[U] = Dmax < ∞
P
by assumption, applying Lemma 25.4 yields E[d(Sn , ŝn0 )1En ] = 1n E[Uj 1En ] ≤ g(P[En ]) → 0,
since P[En ] → 0. This proves the theorem.
Remark 25.2 (Fundamental limit for excess distortion) Although Theorem 25.1 is
stated for the average distortion, under certain mild extra conditions, it also holds for excess distor-
tion where the goal is to achieve d(Sn , Ŝn ) ≤ D with probability arbitrarily close to one as opposed
to in expectation. Indeed, the achievability proof of Theorem 25.1 is already stated in high proba-
bility. For converse, assume in addition to (25.3) that Dp ≜ E[d(S, ŝ)p ]1/p < ∞ for some ŝ ∈ Ŝ and
Pn
p > 1. Applying Rosenthal’s inequality [368, 235], we have E[d(S, ŝn )p ] = E[( i=1 d(Si , ŝ))p ] ≤
CDpp for some constant C = C(p). Then we can apply Theorem 24.9 to convert a code for excess
distortion to one for average distortion and invoke the converse for the latter.
To end this section, we note that in Section 25.1.1 and in Theorem 25.1 it seems we applied
different proof techniques. How come they both turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the under-
lined probabilities in (25.9) and (25.11), respectively. To obtain the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projec-
tion problem (25.10). When PY = (QŜ )n = (PŜ )n with PŜ chosen as the output distribution in the
solution to rate-distortion optimization (25.1), the resulting exponent is precisely given by 2−i(X;Y) .
3
In fact, ⇒ is ⇔.
i i
i i
i i
498
A1 B1 A1 B1
A2 B2 A2 W B2
. . . .
. . . .
. . . .
An Bn An Bn
P Q
Figure 25.1 Description of channel simulation game. The distribution P (left) is to be simulated via the
distribution Q (right) at minimal rate R. Depending on the exact formulation we either require R = I(A; B)
(covering lemma) or R = C(A; B) (soft-covering lemma).
i.i.d.
(Ai , Bi ) ∼ PA,B is declared whenever (An , Bn ) ∈ F) then this is precisely the setting in which
covering lemma operates. In the next section we show that a higher rate R = C(A; B) is required
if F is not known ahead of time. We leave out the celebrated theorem of Bennett and Shor [43]
which shows that rate R = I(A; B) is also attainable even if F is not known, but if encoder and
decoder are given access to a source of common random bits (independent of An , of course).
Before proceeding, we note some simple corner cases:
1 If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly and
use that information to produce Bn through PBn |An .
2 If R = H(B), “A side” can generate Bn according to PnA,B and send that Bn sequence to the “B
side”.
3 If A ⊥
⊥ B, we know that R = 0, as “B side” can generate Bn independently.
Our previous argument for achieving the rate-distortion turns out to give a sharp answer (that
R = I(A; B) is sufficient) for the F-known case as follows.
Theorem 25.5 (Covering Lemma) Fix PA,B and let (Aj , Bj )i.i.d.
∼ PA,B , R > I(A; B). We gener-
ate a random codebook C = {c1 , . . . , cM }, log M = nR, with each codeword cj drawn i.i.d. from
distribution PnB . Then we have for all sets F
Remark 25.3 The origin of the name “covering” is from the application to a proof of Theo-
rem 25.1. In that context we set A = S and B = Ŝ to be the source and optimal reconstruction (in
i i
i i
i i
the sense of minimizing R(I) (D)). Then taking F = {(an , bn ) : d(an , bn ) ≤ D + δ} we see that both
terms in the right-hand side of the inequality are o(1). Thus, sampling 2nR reconstruction points
we covered the space of source realizations in such a way that with high probability we can always
find a reconstruction with low distortion.
Proof. Set γ > M and following similar arguments of the proof for Theorem 25.2, we have
P[∀c ∈ C : (An , c) 6∈ F] ≤ e−M/γ + P[{(An , Bn ) 6∈ F} ∪ {i(An ; Bn ) > log γ}]
= P[(An , Bn ) 6∈ F] + o(1)
⇒ P[∃c ∈ C : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + o(1)
As we explained, the version of covering lemma that we stated shows how to “fool the tester”
applying only one fixed test set F. However, if both A and B take values on finite alphabets then
something stronger can be stated. This original version of the covering lemma [111] is what is
used in sophisticated distributed compression arguments, e.g. Theorem 11.17. Before stating the
result we remind that for two sequences an , bn we denote their joint empirical distribution by
1X
n
P̂an ,bn (α, β) ≜ 1{ai = α, bi = β} , α ∈ A, β ∈ B .
n
i=1
It is also useful to review joint typicality discussion in Remark 18.2. In this section we say that a
sequence of pairs of vectors (an , bn ) is jointly typical with respect to PA,B if
TV(P̂an ,bn , PA,B ) = o(1) .
Fix a distribution PA,B and any codebook C = {c1 , . . . , cM } consisting of elements cj ∈ B n . For
any fixed input string an we define
W = argmin TV(P̂an ,cj , PA,B ) , B̂n = cW . (25.17)
j∈[M]
The next corollary says that in order for us to produce a jointly typical pair (An , B̂n ) a codebook
must have the rate R > I(A; B) and this is optimal.
Corollary 25.6 Fix PA,B on a pair of finite alphabets A, B. For any R > I(A; B) we generate
a random codebook C = {c1 , . . . , cM }, log M = nR, where each codeword cj is drawn i.i.d. from
distribution PnB . With B̂n defined as in (25.17) we have that pair (An , B̂n ) is jointly typical with high
probability
E[TV(P̂An ,B̂n , PA,B )] = oR (1) . (25.18)
Furthermore, no codebook with rate R < I(A, B) can achieve (25.18).
Proof. First, in this case i(An ; Bn ) is a sum of bounded iid terms and thus the oR (1) term in (25.16)
is in fact e−Ω(n) . Fix arbitrary ϵ > 0 and apply Theorem to
n o
F = (an , bn ) : P̂an ,bn (α, β) − PA,B (α, β) ≤ ϵ
i i
i i
i i
500
By the Markov inequality (and assuming WLOG that PB (β) > 0 for all β ) we get that
P [ T = 1] = o( 1) .
Thus, we have
The first two terms are o(n) so we focus on the last term. Since Q̂ is a random variable with
polynomially many possible values, cf. Exercise I.2, we have
i i
i i
i i
Let there be nβ positions with B̂i = β . Conditioned on Q̂, the random variable An ranges over
nβ n
nβ Q(α1 |β)···nβ Q(α|A| |β)
. Since under T = 0 we have Q̂ → PA|B and nβ → PB (β) as n → ∞ we
conclude from Proposition 1.5 and the continuity of entropy Proposition 4.8 that
H(An |Q̂, B̂n , T = 0) ≤ n(H(A|B) + δ)
for some δ = δ(ϵ) > 0 that vanishes as ϵ → 0.
Applications of Corollary 25.6 include distributed compression (Theorem 11.17) and hypoth-
esis testing (Theorem 16.6). Now, those applications use the rate-constrained B̂n to create a
required correlation (joint typicality) not only with An but also with other random variables. Those
applications will require the following simple observation.
Proposition 25.7 Fix some PX,A,B on finite alphabets and consider a pair of random variables
(An , B̂n ) which are jointly typical on average (specifically, (25.18) holds as n → ∞). Given An , B̂n
suppose that Xn is generated ∼ P⊗ n
X|A,B . Then we have
Remark 25.4 (Markov lemma) This result is known as Markov lemma, e.g. [106, Lemma
15.8.1] because in the standard application setting one considers a joint distribution PX,A,B =
i.i.d.
PX,A PB|A , i.e. X → A → B. In this application, one further has (Xn , An ) ∼ PX,A generated by
nature with only An being observed. Given An one constructs a jointly typical vector B̂n (e.g. via
covering lemma Corollary 25.6). Now, since with high probability (Xn , An ) is jointly typical, it
is tempting to automatically conclude that (Xn , B̂n ) would also be jointly typical. Unfortunately,
joint typicality relation is generally not transitive.4 In the above result, however, what resolves
this issue is the fact that Xn can be viewed as generated after (An , B̂n ) were already selected. Thus,
viewing the process in this order we can even allow Xn to depend on B̂n , which is what we did. For
stronger results under the classical setting of PX|A,B = PX|A see [147, Lemma 12.1].
Proof. Note that from condition (25.18) and Markov inequality we get that TV(P̂An ,B̂n , PA,B ) =
o(1) with probability 1 − o(1). Fix any a, b, x ∈ A × B × X and consider m = nP̂An ,B̂n (a, b)
coordinates i ∈ [n] with Ai = a, B̂i = b. Among these there are m′ ≤ m coordinates i that
also satisfy Xi = x. Standard concentration estimate shows that |m′ − mPX|A,B (x|a, b)| = o(m)
with probability 1 − o(1). Hence, normalizing by m we obtain (from the union bound) that with
probability 1 − o(1) we have
|P̂Xn ,An ,B̂n (x, a, b) − PX,A,B (x, a, b)| = o(1) .
4
Let PX,A,B = PX PA PB with PX = PA = PB = Ber(1/2). Take an to be any binary string in {0, 1}n with n/2 ones. Set
xj = bj = aj for j ≤ n/2 and xj = bj = 1 − aj , otherwise. Then (xn , an ) and (an , bn ) are jointly typical, but (xn , bn ) is
not.
i i
i i
i i
502
This implies the first statement. Note that by summing out a ∈ A we obtain that
But then repeating the steps of the second part of Corollary 25.6 we obtain I(Xn ; B̂n ) ≥ nI(X; B) +
o(n), as required.
Remark 25.5 Although in (25.19) we only proved a lower bound (which is sufficient for
applications in this book), it is known that under the Markov assumption X → A → B the inequal-
ity (25.19) holds with equality [111, Chapter 15]. This follows as a by-product of a deep entropy
characterization problem for which we recommend the mentioned reference.
Let us go back to the discussion in the beginning of this section. We have learned how to “fool”
the tester that uses one fixed test set F (Theorem 25.5). Then for finite alphabets we have shown
that we can also “fool” the tester that computes empirical averages since
1X
n
f(Aj , B̂j ) ≈ EA,B∼PA,B [f(A, B)] ,
n
j=1
for any bounded function f. A stronger requirement would be to demand that the joint distribution
PAn ,B̂n fools any permutation invariant tester, i.e.
where the supremum is taken over all permutation invariant subsets F ⊂ An × B n . This is not
guaranteed by Corollary 25.6. Indeed, note that a sufficient statistic for a permutation invariant
tester is a joint type P̂An ,B̂n , and Corollary does show that P̂An ,B̂n ≈ PA,B (in the sense of L1 distance
of vectors). However, it still might happen that P̂An ,B̂n although close to PA,B takes highly different
values compared to those of P̂An ,Bn . For example, if we restrict all c ∈ C to have a fixed composition
P0 , the tester can easily detect the problem since PnB -measure of all strings of composition P0
√
cannot exceed O(1/ n). Formally, to fool permutation invariant tester we need to have small
total variation between the distribution of P̂An ,B̂n and P̂An ,Bn .
We conjecture, however, that nevertheless the rate R = I(A; B) should be sufficient to achieve
also this stronger requirement. In the next section we show that if one removes the permutation-
invariance constraint, then a larger rate R = C(A; B) is needed.
between the simulated and the true output (see Figure 25.1).
i i
i i
i i
Theorem 25.8 (Cuff [116]) Let PA,B be an arbitrary distribution on the finite space A × B.
i.i.d.
Consider a coding scheme where Alice observes An ∼ PnA , sends a message W ∈ [2nR ] to Bob,
who given W generates a (possibly random) sequence B̂n . If (25.20) is satisfied for all ϵ > 0 and
sufficiently large n, then we must have
where C(A; B) is known as the Wyner’s common information [458]. Furthermore, for any R >
C(A; B) and ϵ > 0 there exists n0 (ϵ) such that for all n ≥ n0 (ϵ) there exists a scheme
satisfying (25.20).
Note that condition (25.20) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , Bn ) with probability ≥ 1 −ϵ. However, compared to Theorem 25.5,
this requires a higher communication rate since C(A; B) ≥ I(A; B), clearly.
Proof. Showing that Wyner’s common information is a lower-bound is not hard. First, since
PAn ,B̂n ≈ PnA,B (in TV) we have
(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P) − H(Q)
with TV(P, Q), cf. (7.20) and Corollary 6.7). Next, we have
≳ nC(A; B) (25.25)
At → W → B̂t
and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B .
To show achievability, let us notice that the problem is equivalent to constructing three random
variables (Ân , W, B̂n ) such that a) W ∈ [2nR ], b) the Markov relation
holds and c) TV(PÂn ,B̂n , PnA,B ) ≤ ϵ/2. Indeed, given such a triple we can use coupling charac-
terization of TV (7.20) and the fact that TV(PÂn , PnA ) ≤ ϵ/2 to extend the probability space
to
An → Ân → W → B̂n
i i
i i
i i
504
and P[An = Ân ] ≥ 1 − ϵ/2. Again by (7.20) we conclude that TV(PAn ,B̂n , PÂn ,B̂n ) ≤ ϵ/2 and by
triangle inequality we conclude that (25.20) holds.
Finally, construction of the triple satisfying a)-c) follows from the soft-covering lemma
(Corollary 25.10) applied with V = (A, B) and W being uniform on the set of xi ’s there.
A natural questions is how large n should be in order for the approximation PY|X ◦ P̂n ≈ PY to
hold. A remarkable fact that we establish in this section is that the answer is n ≈ 2I(X;Y) , assum-
ing I(X; Y) 1 and there is certain concentration properties of i(X; Y) around I(X; Y). This fact
originated from Wyner [458] and was significantly strengthened in [212].
Here, we show a new variation of such results by strengthening our simple χ2 -information
bound of Proposition 7.17 (corresponding to λ = 2).
Theorem 25.9 Fix PX,Y and for any λ ∈ R define the Rényi mutual information of order λ
Iλ (X; Y) ≜ Dλ (PX,Y kPX PY ) ,
where Dλ is the Rényi-divergence, cf. Definition 7.24. We have for every 1 < λ ≤ 2
1
E[D(PY|X ◦ P̂n kPY )] ≤ log(1 + exp{(λ − 1)(Iλ (X; Y) − log n)}) . (25.27)
λ−1
i i
i i
i i
Note that conditioned on Y we get to analyze a λ-th moment of a sum of iid random variables.
This puts us into a well-known setting of Rosenthal-type inequalities. In particular, we have that5
for any iid non-negative Bj we have, provided 1 ≤ λ ≤ 2,
!λ
X n
E Bi ≤ n E[Bλ ] + (n E[B])λ . (25.30)
i=1
which implies
1
Iλ (Xn ; Ȳ) ≤ log 1 + n1−λ exp{(λ − 1)Iλ (X; Y)} ,
λ−1
which together with (25.28) recovers the main result (25.27).
Remark 25.6 Hayashi [217] upper bounds the LHS of (25.27) with
λ λ−1
log(1 + exp{ (Kλ (X; Y) − log n)}) ,
λ−1 λ
where Kλ (X; Y) = infQY Dλ (PX,Y kPX QY ) is the so-called Sibson-Csiszár information, cf. [338].
This bound, however, does not have the right rate of convergence as n → ∞, at least for λ = 1 as
comparison with Proposition 7.17 reveals.
We note that [217, 212] also contain direct bounds on
E[TV(PY|X ◦ P̂n , PY )]
P
which do not assume existence of λ-th moment of PYY|X for λ > 1 and instead rely on the distribution
of i(X; Y). We do not discuss these bounds here, however, since for the purpose of discussing finite
alphabets the next corollary is sufficient.
5
The inequality (25.30), which is known to be essentially tight [374], can be shown by applying
∑
(a + b)λ−1 ≤ aλ−1 + bλ−1 and Jensen’s to get E Bi (Bi + j̸=i Bj )λ−1 ≤ E[Bλ ] + E[B]((n − 1) E[B])λ−1 . Summing
the left side over i and bounding (n − 1) ≤ n we get (25.30).
i i
i i
i i
506
Remark 25.7 The origin of the name “soft-covering” is due to the fact that unlike the covering
lemma (Theorem 25.5) which selects one xi (trying to make PY|X=xi as close to PY as possible) here
we mix over n choices uniformly.
Proof. By tensorization of Rényi divergence, cf. Section 7.12, we have
Iλ (X; Y) = dIλ (U; V) .
For every 1 < λ < λ0 we have that λ 7→ Iλ (U; V) is continuous and converging to I(U; V) as
λ → 1. Thus, we can find λ sufficiently small so that R > Iλ (U; V). Applying Theorem 25.9 with
this λ completes the proof.
i i
i i
i i
In previous chapters we established the main coding theorem for lossy data compression: For
stationary memoryless (iid) sources and separable distortion, under the assumption that Dmax < ∞,
the operational and information rate-distortion functions coincide, namely,
In addition, we have shown various properties about the rate-distortion function (cf. Theorem 24.4).
In this chapter we compute the rate-distortion function for several important source distributions
by evaluating this constrained minimization of mutual information. The common technique we
apply to evaluate these special cases in Section 26.1 is then formalized in Section 26.2* as a saddle
point property akin to those in Sections 5.2 and 5.4* for mutual information maximization (capac-
ity). Next we extend the paradigm of joint source-channel coding in Section 19.7 to the lossy
setting; this reasoning will later be found useful in statistical applications in Part VI (cf. Chap-
ter 30). Finally, in Section 26.4 we discuss several limitations, both theoretical and practical, of
the classical theory for lossy compression and joint source-channel coding.
Theorem 26.1
R(D) = (h(p) − h(D))+ . (26.1)
For example, when p = 1/2, D = .11, we have R(D) ≈ 1/2 bits. In the Hamming game
described in Section 24.2 where we aim to compress 100 bits down to 50, we indeed can do this
while achieving 11% average distortion, compared to the naive scheme of storing half the string
and guessing on the other half, which achieves 25% average distortion. Note that we can also get
very tight non-asymptotic bounds, cf. Exercise V.3.
507
i i
i i
i i
508
Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to
show.
For the converse, consider any PŜ|S such that P[S 6= Ŝ] ≤ D ≤ p ≤ 21 . Then
In order to achieve this bound, we need to saturate the above chain of inequalities, in particular,
choose PŜ|S so that the difference S + Ŝ is independent of Ŝ. Let S = Ŝ + Z, where Ŝ ∼ Ber(p′ ) ⊥
⊥
Z ∼ Ber(D), and p′ is such that the convolution gives exactly Ber(p), namely,
p′ ∗ D = p′ (1 − D) + (1 − p′ )D = p,
p−D
i.e., p′ = 1−2D . In other words, the backward channel PS|Ŝ is exactly BSC(D) and the resulting
PŜ|S is our choice of the forward channel PŜ|S . Then, I(S; Ŝ) = H(S) − H(S|Ŝ) = H(S) − H(Z) =
h(p) − h(D), yielding the upper bound R(D) ≤ h(p) − h(D).
Remark 26.1 Here is a more general strategy (which we will later implement in the Gaussian
case.) Denote the optimal forward channel from the achievability proof by P∗Ŝ|S and P∗S|Ŝ the asso-
ciated backward channel (which is BSC(D)). We need to show that there is no better PŜ|S with
P[S 6= Ŝ] ≤ D and a smaller mutual information. Then
Remark 26.2 By WLLN, the distribution PnS = Ber(p)n concentrates near the Hamming sphere
of radius np as n grows large. Recall that in proving Shannon’s rate distortion theorem, the optimal
codebook are drawn independently from PnŜ = Ber(p′ )n with p′ = 1p−−2D D
. Note that p′ = 1/2 if
p = 1/2 but p′ < p if p < 1/2. In the latter case, the reconstruction points concentrate on a smaller
sphere of radius np′ and none of them are typical source realizations, as illustrated in Figure 26.1.
For a generalization of this result to m-ary uniform source, see Exercise V.6.
i i
i i
i i
S(0, np)
S(0, np′ )
Hamming Spheres
Figure 26.1 Source realizations (solid sphere) versus codewords (dashed sphere) in compressing Hamming
sources.
Theorem 26.2 Let S ∼ N (0, σ 2 ) and d(s, ŝ) = (s − ŝ)2 for s, ŝ ∈ R. Then
1 σ2
R ( D) = log+ . (26.2)
2 D
In the vector case of S ∼ N (0, σ 2 Id ) and d(s, ŝ) = ks − ŝk22 ,
d dσ 2
R ( D) = log+ . (26.3)
2 D
Proof. Since Dmax = σ 2 , in the sequel we can assume D < σ 2 for otherwise there is nothing to
show.
(Achievability) Choose S = Ŝ + Z , where Ŝ ∼ N (0, σ 2 − D) ⊥
⊥ Z ∼ N (0, D). In other words,
the backward channel PS|Ŝ is AWGN with noise power D, and the forward channel can be easily
found to be PŜ|S = N ( σ σ−2 D S, σ σ−2 D D). Then
2 2
1 σ2 1 σ2
I(S; Ŝ) =
log =⇒ R(D) ≤ log
2 D 2 D
(Converse) Formally, we can mimic the proof of Theorem 26.1 replacing Shannon entropy by
the differential entropy and applying the maximal entropy result from Theorem 2.8; the caveat is
that for Ŝ (which may be discrete) the differential entropy may not be well-defined. As such, we
follow the alternative proof given in Remark 26.1. Let PŜ|S be any conditional distribution such
that EP [(S − Ŝ)2 ] ≤ D. Denote the forward channel in the above achievability by P∗Ŝ|S . Then
" #
P∗S|Ŝ
∗
I(PS , PŜ|S ) = D(PS|Ŝ kPS|Ŝ |PŜ ) + EP log
PS
i i
i i
i i
510
" #
P∗S|Ŝ
≥ EP log
PS
(S−Ŝ)2
√ 1 e− 2D
= EP log 2πD
S2
√ 1
2π σ 2
e− 2 σ 2
" #
1 σ2 log e S2 (S − Ŝ)2
= log + EP −
2 D 2 σ2 D
1 σ2
≥ log .
2 D
Finally, for the vector case, (26.3) follows from (26.2) and the same single-letterization argu-
ment in Theorem 24.8 using the convexity of the rate-distortion function in Theorem 24.4(a).
The interpretation of the optimal reconstruction points in the Gaussian case is analogous to that
of the Hamming source previously
√ discussed in Remark 26.2: As n grows, the Gaussian random
2
vector concentrates on S(0, nσ ) (n-sphere in Euclidean p space rather than Hamming), but each
reconstruction point drawn from (P∗Ŝ )n is close to S(0, n(σ 2 − D)). So again the picture is similar
to Figure 26.1 of two nested spheres.
We can also understand geometry of errors of optimal compressors. Indeed, suppose we have
a sequence of quantizers Xn → W → X̂n with n1 log M → R(D). As we know, without loss of
generality we may assume X̂n = E[Xn |W]. Let us denote by Σ = Cov[Xn |W] be the covariance
matrix of the reconstruction errors. We know that 1n tr Σ ≤ D by the distortion constraint. Now let
us express mutual information in terms of differential entropy to obtain
Applying maximum entropy principle (2.19) to the second term (and taking expectation over W
inside log det via Jensen’s and Corollary 2.9) we obtain
n 1
log M ≥ log σ 2 − log det Σ .
2 2
Let {λj , j ∈ [n]} denote the spectrum of Σ. Dividing by n and recalling that quantizer is optimal
we get
1X1
n
1 σ2 σ2
log + o( 1) ≥ log .
2 D n 2 λj
j=1
2
From strict convexity of λ 7→ 12 log σλ we conclude that empirical distributions of eigenvalues, i.e.
P
j δλj , must converge to a point, i.e. to δD . In this sense Σ ≈ D · In and the uncertainty regions
1
n
(given the message) should be approximately spherical.
Note that the exact expression in Theorem 26.2 relies on the Gaussianity assumption of the
source. How sensitive is the rate-distortion formula to this assumption? The following comparison
result is a counterpart of Theorem 20.12 for channel capacity:
i i
i i
i i
Theorem 26.3 Assume that ES = 0 and Var S = σ 2 . Consider the MSE distortion. Then
1 σ2 1 σ2
log+ − D(PS kN (0, σ 2 )) ≤ R(D) = inf I(S; Ŝ) ≤ log+ .
2 D PŜ|S :E(Ŝ−S)2 ≤D 2 D
Remark 26.3 A simple consequence of Theorem 26.3 is that for source distributions with a
density, the rate-distortion function grows according to 12 log D1 in the low-distortion regime as
long as D(PS kN (0, σ 2 )) is finite. In fact, the first inequality, known as the Shannon lower bound
(SLB), is asymptotically tight, in the sense that
1 σ2
R(D) = log − D(PS kN (0, σ 2 )) + o(1), D → 0 (26.4)
2 D
under appropriate conditions on PS [281, 247]. Therefore, by comparing (2.21) and (26.4), we
see that, for small distortion, uniform scalar quantization (Section 24.1) is in fact asymptotically
optimal within 12 log(2πe) ≈ 2.05 bits.
Later in Section 30.1 we will apply SLB to derive lower bounds for statistical estimation. For
this we need the following general version of SLB (see Exercise V.22 for a proof): Let k · k be an
arbitrary norm on Rd and r > 0. Let X be a d-dimensional continuous random vector with finite
differential entropy h(X). Then
d d d
inf I(X; X̂) ≥ h(X) + log − log Γ +1 V , (26.5)
PX̂|X :E[∥X̂−X∥r ]≤D r Dre r
distortion function:
R(D) ≤ I(PS , P∗Ŝ|S )
σ2 − D σ2 − D
= I ( S; S + W) W ∼ N ( 0, D)
σ2 σ2
σ2 − D
≤ I ( SG ; SG + W) by Gaussian saddle point (Theorem 5.11)
σ2
1 σ2
= log .
2 D
“≥”: For any PŜ|S such that E(Ŝ − S)2 ≤ D. Let P∗S|Ŝ = N (Ŝ, D) denote the AWGN channel
with noise power D. Then
I(S; Ŝ) = D(PS|Ŝ kPS |PŜ )
" #
P∗S|Ŝ
= D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log − D(PS kPSG )
PSG
(S−Ŝ)2
√ 1 e− 2D
≥ EP log 2πD
S2
− D(PS kPSG )
√ 1
2π σ 2
e− 2 σ 2
i i
i i
i i
512
1 σ2
≥ log − D(PS kPSG ).
2 D
In fact we have discussed in Section 5.6 iterative algorithms (Blahut-Arimoto) that computes R(D).
However, for the peace of mind it is good to know there are some general reasons why tricks like
we used in the Hamming or Gaussian case actually are guaranteed to work.
Theorem 26.4
1 Suppose PY∗ and PX|Y∗ PX are such that E[d(X, Y∗ )] ≤ D and for any PX,Y with E[d(X, Y)] ≤
D we have
dPX|Y∗
E log (X|Y) ≥ I(X; Y∗ ) . (26.6)
dPX
Then R(D) = I(X; Y∗ ).
2 Suppose that I(X; Y∗ ) = R(D). Then for any regular branch of conditional probability PX|Y∗
and for any PX,Y satisfying
• E[d(X, Y)] ≤ D and
• PY PY∗ and
• I ( X ; Y) < ∞ ,
the inequality (26.6) holds.
1 The first part is a sufficient condition for optimality of a given PXY∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PX,Y satisfying those conditions is rich enough to infer from (26.6):
dPX|Y∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.
i i
i i
i i
2 Note that the second part is not valid without assuming PY PY∗ . A counterexample to this
and various other erroneous (but frequently encountered) generalizations is the following: A =
{0, 1}, PX = Ber(1/2), Â = {0, 1, 0′ , 1′ } and
From here we will conclude, similar to Proposition 2.20, that the first term is o(λ) and thus for
sufficiently small λ we should have I(X; Yλ ) < R(D), contradicting optimality of coupling PX,Y∗ .
We proceed to details. For every λ ∈ [0, 1] define
dPY
ρ 1 ( y) ≜ ( y)
dPY∗
λρ1 (y)
λ(y) ≜
λρ1 (y) + λ̄
(λ)
PX|Y=y = λ(y)PX|Y=y + λ̄(y)PX|Y∗ =y
dPYλ = λdPY + λ̄dPY∗ = (λρ1 (y) + λ̄)dPY∗
D(y) = D(PX|Y=y kPX|Y∗ =y )
(λ)
Dλ (y) = D(PX|Y=y kPX|Y∗ =y ) .
i i
i i
i i
514
Notice:
Dλ (y) ≤ λ(y)D(y)
and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (26.7) the function ρ1 (y)D(y) is non-negative and PY∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY∗ Dλ (y)ρ1 (y) = dPY∗ ρ1 (y) lim D λ ( y) = 0 (26.8)
λ→0 {ρ >0}
1
λ( y ) {ρ1 >0} λ→ 0 λ( y)
where in the last step we applied the result from Chapter 5
Finally, since
(λ)
PX|Y ◦ PYλ = PX ,
we have
(λ) dPX|Y∗ dPX|Y∗ ∗
I ( X ; Yλ ) = D(PX|Y kPX|Y∗ |PYλ ) + λ E log (X|Y) + λ̄ E log (X|Y )
dPX dPX
= I∗ + λ(I1 − I∗ ) + o(λ) ,
I ( X ; Y λ ) ≥ I ∗ = R ( D) .
i i
i i
i i
Such a pair (f, g) is called a (k, n, D)-JSCC, which transmits k symbols over n channel uses such
that the end-to-end distortion is at most D in expectation. Our goal is to optimize the encoder/de-
coder pair so as to maximize the transmission rate (number of symbols per channel use) R = nk .1
As such, we define the asymptotic fundamental limit as
1
RJSCC (D) ≜ lim inf max {k : ∃(k, n, D)-JSCC} .
n→∞ n
To simplify the exposition, we will focus on JSCC for a stationary memoryless source Sk ∼ P⊗S
k
⊗n
transmitted over a stationary memoryless channel PYn |Xn = PY|X subject to a separable distortion
Pk
function d(sk , ŝk ) = 1k i=1 d(si , ŝi ).
26.3.1 Converse
The converse for the JSCC is quite simple, based on data processing inequality and following the
weak converse of lossless JSCC using Fano’s inequality.
where C = supPX I(X; Y) is the capacity of the channel and R(D) = infP :E[d(S,Ŝ)]≤D I(S; Ŝ) is the
Ŝ|S
rate-distortion function of the source.
The interpretation of this result is clear: Since we need at least R(D) bits per symbol to recon-
struct the source up to a distortion D and we can transmit at most C bits per channel use, the overall
transmission rate cannot exceeds C/R(D). Note that the above theorem clearly holds for channels
with cost constraint with the corresponding capacity (Chapter 20).
1
Or equivalently, minimize the bandwidth expansion factor ρ = nk .
i i
i i
i i
516
Proof. Consider a (k, n, D)-code which induces the Markov chain Sk → Xn → Yn → Ŝk such
Pk
that E[d(Sk , Ŝk )] = 1k i=1 E[d(Si , Ŝi )] ≤ D. Then
( a) (b) ( c)
kR(D) = inf I(Sk ; Ŝk ) ≤ I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC
PŜk |Sk :E[d(Sk ,Ŝk )]≤D P Xn
where (b) applies data processing inequality for mutual information, (a) and (c) follow from the
respective single-letterization result for lossy compression and channel coding (Theorem 24.8 and
Proposition 19.10).
Remark 26.4 Consider the case where the source is Ber(1/2) with Hamming distortion. Then
Theorem 26.5 coincides with the converse for channel coding under bit error rate Pb in (19.33):
k C
R= ≤
n 1 − h(Pb )
which was previously given in Theorem 19.21 and proved using ad hoc techniques. In the case of
channel with cost constraints, e.g., the AWGN channel with C(SNR) = 12 log(1 + SNR), we have
−1 C(SNR)
Pb ≥ h 1−
R
This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical
codes. (See, e.g., Fig. 2 from [359] for BIAWGN (binary-input) channel.) This is erroneous, since
the pb above refers to the bit error rate of data bits (or systematic bits), not all of the codeword bits.
The latter quantity is what typically called BER (see (19.33)) in the coding-theoretic literature.
Theorem 26.6 For any stationary memoryless source (PS , S, Ŝ, d) with rate-distortion func-
tion R(D) satisfying Assumption 26.1 (below), and for any stationary memoryless channel PY|X
with capacity C,
C
RJSCC (D) = .
R(D)
Assumption 26.1 on the source (which is rather technical and can be skipped in the first reading)
is to control the distortion incurred by the channel decoder making an error. Despite this being a
low-probability event, without any assumption on the distortion metric, we cannot say much about
its contribution to the end-to-end average distortion. (Note that this issue does not arise in lossless
i i
i i
i i
JSCC). Assumption 26.1 is trivially satisfied by bounded distortion (e.g., Hamming), and can be
shown to hold more generally such as for Gaussian sources and MSE distortion.
Proof. In view of Theorem 26.5, we only prove achievability. We constructed a separated
compression/channel coding scheme as follows:
• Let (fs , gs ) be a (k, 2kR(D)+o(k) , D)-code for compressing Sk such that E[d(Sk , gs (fs (Sk )] ≤ D.
By Lemma 26.8 (below), we may assume that all reconstruction points are not too far from
some fixed string, namely,
d(sk0 , gs (i)) ≤ L (26.9)
for all i and some constant L, where sk0 = (s0 , . . . , s0 ) is from Assumption 26.1 below.
• Let (fc , gc ) be a (n, 2nC+o(n) , ϵn )max -code for channel PYn |Xn such that kR(D) + o(k) ≤ nC +
o(n) and the maximal probability of error ϵn → 0 as n → ∞. Such as code exists thanks to
Theorem 19.9 and Corollary 19.5.
Let the JSCC encoder and decoder be f = fc ◦ fs and g = gs ◦ gc . So the overall system is
fs fc gc gs
Sk −
→W−
→ Xn −→ Yn −
→ Ŵ −
→ Ŝk .
Note that here we need to control the maximal probability of error of the channel code since
when we concatenate these two schemes, W at the input of the channel is the output of the source
compressor, which need not be uniform.
To analyze the average distortion, we consider two cases depending on whether the channel
decoding is successful or not:
E[d(Sk , Ŝk )] = E[d(Sk , gs (W))1{W = Ŵ}] + E[d(Sk , gs (Ŵ)))1{W 6= Ŵ}].
By assumption on our lossy code, the first term is at most D. For the second term, we have P[W 6=
Ŵ] ≤ ϵn = o(1) by assumption on our channel code. Then
( a)
E[d(Sk , gs (Ŵ))1{W 6= Ŵ}] ≤ E[1{W 6= Ŵ}λ(d(Sk , ŝk0 ) + d(sk0 , gs (Ŵ)))]
(b)
≤ λ · E[1{W 6= Ŵ}d(Sk , ŝk0 )] + λL · P[W 6= Ŵ]
( c)
= o(1),
where (a) follows from the generalized triangle inequality from Assumption 26.1(a) below; (b)
follows from (26.9); in (c) we apply Lemma 25.4 that were used to show the vanishing of the
expectation in (25.15) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for all R >
C/R(D), there exists a sequence of (k, n, D + o(1))-JSCC codes.
The following assumption is needed by the previous theorem:
Assumption 26.1 Fix D. For a source (PS , S, Ŝ, d), there exists λ ≥ 0, s0 ∈ S, ŝ0 ∈ Ŝ such
that
i i
i i
i i
518
(a) Generalized triangle inequality: d(s, ŝ) ≤ λ(d(s, ŝ0 ) + d(s0 , â)) ∀a, â.
(b) E[d(S, ŝ0 )] < ∞ (so that Dmax < ∞ too).
(c) E[d(s0 , Ŝ)] < ∞ for any output distribution PŜ achieving the rate-distortion function R(D).
(d) d(s0 , ŝ0 ) < ∞.
The interpretation of this assumption is that the spaces S and Ŝ have “nice centers” s0 and ŝ0 ,
in the sense that the distance between any two points is upper bounded by a constant times the
distance from the centers to each point (see figure below).
b
b
s ŝ
b b
s0 ŝ0
S Ŝ
Note that Assumption 26.1 is not straightforward to verify. Next we give some more convenient
sufficient conditions. First of all, Assumption 26.1 holds automatically for bounded distortion
function. In other words, for a discrete source on a finite alphabet S , a finite reconstruction alphabet
Ŝ , and a finite distortion function d(s, ŝ) < ∞, Assumption 26.1 is fulfilled. More generally, we
have the following criterion.
Theorem 26.7 If S = Ŝ and d(s, ŝ) = ρ(s, ŝ)q for some metric ρ and q ≥ 1, and Dmax ≜
infŝ0 E[d(S, ŝ0 )] < ∞, then Assumption 26.1 holds.
Proof. Take s0 = ŝ0 that achieves a finite Dmax = E[d(S, ŝ0 )]. (In fact, any points can serve as
centers in a metric space). Applying triangle inequality and Jensen’s inequality, we have
q q
1 1 1 1 1
ρ(s, ŝ) ≤ ρ(s, s0 ) + ρ(s0 , ŝ) ≤ ρq (s, s0 ) + ρq (s0 , ŝ).
2 2 2 2 2
Thus d(s, ŝ) ≤ 2q−1 (d(s, s0 ) + d(s0 , ŝ)). Taking λ = 2q−1 verifies (a) and (b) in Assumption 26.1.
To verify (c), we can apply this generalized triangle inequality to get d(s0 , Ŝ) ≤ 2q−1 (d(s0 , S) +
d(S, Ŝ)). Then taking the expectation of both sides gives
So we see that metrics raised to powers (e.g. squared norms) satisfy Assumption 26.1. Finally,
we give the lemma used in the proof of Theorem 26.6.
i i
i i
i i
Lemma 26.8 Fix a source satisfying Assumption 26.1 and an arbitrary PŜ|S . Let R > I(S; Ŝ),
L > max{E[d(s0 , Ŝ)], d(s0 , ŝ0 )} and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such that
d(sk0 , ŝk ) ≤ L for every reconstruction point ŝk , where sk0 = (s0 , . . . , s0 ).
For any D′ ∈ (E[d(S, Ŝ)], D), there exist M = 2kR reconstruction points (c1 , . . . , cM ) such that
P min d(S , cj ) > D ≤ P[d1 (Sk , Ŝk ) > D′ ] + o(1),
k ′
j∈[M]
P[d1 (S, Ŝ) > D′ ] ≤ P[d(Sk , Ŝk ) > D′ ] + P[d(sk0 , Ŝk ) > L] → 0
as k → ∞ (since E[d(S, Ŝ)] < D′ and E[d(s0 , Ŝ)] < L). Thus we have
P min d(Sk , cj ) > D′ → 0
j∈[M]
and d(sk0 , cj ) ≤ L. Finally, by adding another reconstruction point cM+1 = ŝk0 = (ŝ0 , . . . , ŝ0 ) we
get
h i h i
′ ′
E min d(S , cj ) ≤ D + E d(S , ŝ0 )1 min d(S , cj ) > D
k k k k
= D′ + o(1) ,
j∈[M+1] j∈[M]
where the last estimate follows from the same argument that shows the vanishing of the expectation
in (25.15). Thus, for sufficiently large n the expected distortion is at most D, as required.
i i
i i
i i
520
spatial correlation compared to 1D signals. (For example, the first sentence and the last in Tol-
stoy’s novel are pretty uncorrelated. But the regions in the upper-left and bottom-right corners of
one image can be strongly correlated. At the same time, the uncompressed size of the novel and
the image could be easily equal.) Thus, for practicing the lossy compression of videos and images
the key problem is that of coming up with a good “whitening” bases, which is an art still being
refined.
For the joint-source-channel coding, the separation principle has definitely been a guiding light
for the entire development of digital information technology. But this now ubiquitous solution
that Shannon’s separation has professed led to a rather undesirable feature of dropped cellular
calls (as opposed to slowly degraded quality of the old analog telephones) or “snow screen” on
TV whenever the SNR falls below a certain threshold. That is, the separated systems can be very
unstable, or lacks graceful degradation. To sketch this effect consider an example of JSCC, where
the source distribution is Ber( 12 ), with rate-distortion function R(D) = 1 − h(D), and the channel
is BSCδ with capacity C(δ) = 1 − h(δ). Consider two solutions:
1 a separated scheme: targeting a certain acceptable distortion level D∗ we compress the source
at rate R(D∗ ). Then we can use a channel code of rate R(D∗ ) which would achieve vanishing
error as long as R(D∗ ) < C(δ), i.e. δ < D∗ . Overall, this scheme has a bandwidth expansion
factor ρ = nk = 1. Note that there exists channel codes (Exercises IV.8 and IV.10) that work
simultaneously for all δ < δ ∗ = D∗ .
2 a simple JSCC with ρ = 1 which transmits “uncoded” data, i.e. sets Xi = Si .
For large blocklengths, the achieved distortion are shown in Figure 26.2 as a function of δ .
We can now see why separated solution, though in some sense optimal, is not ideal. First, below
distortion
separated
1
2
uncoded
D∗ = δ ∗
δ
0 δ∗ 1
2
Figure 26.2 No graceful degradation of separately designed source channel code (black solid), as compared
with uncoded transmission (blue dashed).
δ < δ ∗ the separated solution does achieve acceptable distortion D∗ , but it does not improve if the
channel improves, i.e. the distortion stays constant at D∗ , unlike the uncoded system. Second, and
much more importantly, is a problem with δ > δ ∗ . In this regime, separated scheme undergoes a
i i
i i
i i
catastrophic failure and distortion becomes 1/2 (that is, we observe pure noise, or “snow” in TV-
speak). At the same time, the distortion of the simple “uncoded” JSCC is also deteriorating but
gracefully so. Unfortunately, such graceful schemes are only known for very few cases, requiring
ρ = 1 and certain “perfect match” conditions between channel noise and source (distortion met-
ric)2 . It is a long-standing (practical and theoretical) open problem in information theory to find
schemes that exhibit non-catastrophic degradation for general source-channel pairs and general ρ.
Even purely theoretically the problem of JSCC still contains many mysteries. For example, in
Section 22.5 we described refined expansion of the channel coding rate as a function of block-
length. In particular, we have seen that convergence to channel capacity happens at the rate √1n ,
which is rather slow. At the same time, convergence to the rate-distortion function is almost at
the rate of 1n (see Exercises V.3 and V.4). Thus, it is not clear what the convergence rate of the
JSCC may be. Unfortunately, sharp results here are still at a nascent stage. In fact, even for the
most canonical setting of a binary source and BSCδ channel it was only very recently shown [248]
√
that the optimal rate nk converges to the ultimate limit of R(CD) at the speed of Θ(1/ n) unless
the Gastpar condition R(D) = C(δ) is met. Analyzing other source-channel pairs or any general
results of this kind is another important open problem.
2
Often informally called “Gastpar conditions” after [181].
i i
i i
i i
27 Metric entropy
In the previous chapters of this part we discussed optimal quantization of random vectors in both
fixed and high dimensions. Complementing this average-case perspective, the topic of this chapter
is on the deterministic (worst-case) theory of quantization. The main object of interest is the metric
entropy of a set, which allows us to answer two key questions (a) covering number: the minimum
number of points to cover a set up to a given accuracy; (b) packing number: the maximal number
of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy were put forth by Kolmogorov, who, together with
his students, also determined the behavior of metric entropy in a variety of problems for both finite
and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s 13th
problem, which concerns the possibility or impossibility of representing multi-variable functions
as compositions of functions of fewer variables. It turns out that the theory of metric entropy can
provide a surprisingly simple and powerful resolution to such problems. Over the years, metric
entropy has found numerous connections to and applications in other fields such as approximation
theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In particular, metric entropy will be featured prominently in Part VI of this book, wherein we
discuss its applications to proving both lower and upper bounds for statistical estimation.
This chapter is organized as follows. Section 27.1 provides basic definitions and explains the
fundamental connections between covering and packing numbers. These foundations are laid out
by Kolmogorov and Tikhomirov in [250], which remains the definitive reference on this subject.
In Section 27.2 we study metric entropy in finite-dimensional spaces and a popular approach for
bounding the metric entropy known as the volume bound. To demonstrate the limitations of the
volume method and the associated high-dimensional phenomenon, in Section 27.3 we discuss
a few other approaches through concrete examples. Infinite-dimensional spaces are treated next
for smooth functions in Section 27.4 (wherein we also discuss the application to Hilbert’s 13th
problem) and Hilbert spaces in Section 27.3.2 (wherein we also discuss the application to empirical
processes). Section 27.5 gives an exposition of the connections between metric entropy and the
small-ball problem in probability theory. Finally, in Section 27.6 we circle back to rate-distortion
theory and discuss how it is related to metric entropy and how information-theoretic methods can
be useful for the latter.
522
i i
i i
i i
• We say {v1 , ..., vN } ⊂ V is an ϵ-covering (or ϵ-net) of Θ if Θ ⊂ ∪Ni=1 B(vi , ϵ), where B(v, ϵ) ≜
{u ∈ V : d(u, v) ≤ ϵ} is the (closed) ball of radius ϵ centered at v; or equivalently, ∀θ ∈ Θ,
∃i ∈ [N] such that d(θ, vi ) ≤ ϵ.
• We say {θ1 , ..., θM } ⊂ Θ is an ϵ-packing of Θ if mini̸=j kθi − θj k > ϵ;1 or equivalently, the balls
{B(θi , ϵ/2) : j ∈ [M]} are disjoint.
ϵ
≥ϵ
Θ Θ
Upon defining ϵ-covering and ϵ-packing, a natural question concerns the size of the optimal
covering and packing, leading to the definition of covering and packing numbers:
with min ∅ understood as ∞; we will sometimes abbreviate these as N(ϵ) and M(ϵ) for brevity.
Similar to volume and width, covering and packing numbers provide a meaningful measure for
the “massiveness” of a set. The major focus of this chapter is to understanding their behavior in
both finite and infinite-dimensional spaces as well as their statistical applications.
Some remarks are in order.
1
Notice we imposed strict inequality for convenience.
i i
i i
i i
524
Remark 27.1 Unlike the packing number M(Θ, d, ϵ), the covering number N(Θ, d, ϵ) defined
in (27.1) depends implicitly on the ambient space V ⊃ Θ, since, per Definition 27.1), an ϵ-covering
is required to be a subset of V rather than Θ. Nevertheless, as the next Theorem 27.2 shows, this
dependency on V has almost no effect on the behavior of the covering number.
As an alternative to (27.1), we can define N′ (Θ, d, ϵ) as the size of the minimal ϵ-covering of Θ
that is also a subset of Θ, which is closely related to the original definition as
Here, the left inequality is obvious. To see the right inequality,2 let {θ1 , . . . , θN } be an 2ϵ -covering
of Θ. We can project each θi to Θ by defining θi′ = argminu∈Θ d(θi , u). Then {θ1′ , . . . , θN′ } ⊂ Θ
constitutes an ϵ-covering. Indeed, for any θ ∈ Θ, we have d(θ, θi ) ≤ ϵ/2 for some θi . Then
d(θ, θi′ ) ≤ d(θ, θi ) + d(θi , θi′ ) ≤ 2d(θ, θi ) ≤ ϵ. On the other hand, the N′ covering numbers need
not be monotone with respect to set inclusion.
The relation between the covering and packing numbers is described by the following funda-
mental result.
Proof. To prove the right inequality, fix a maximal packing E = {θ1 , ..., θM }. Then ∀θ ∈ Θ\E,
∃i ∈ [M], such that d(θ, θi ) ≤ ϵ (for otherwise we can obtain a bigger packing by adding θ). Hence
E must an ϵ-covering (which is also a subset of Θ). Since N(Θ, d, ϵ) is the minimal size of all
possible coverings, we have M(Θ, d, ϵ) ≥ N(Θ, d, ϵ).
We next prove the left inequality by contradiction. Suppose there exists a 2ϵ-packing
{θ1 , ..., θM } and an ϵ-covering {x1 , ..., xN } such that M ≥ N + 1. Then by the pigeonhole prin-
ciple, there exist distinct θi and θj belonging to the same ϵ-ball B(xk , ϵ). By triangle inequality,
d(θi , θj ) ≤ 2ϵ, which is a contradiction since d(θi , θj ) > 2ϵ for a 2ϵ-packing. Hence the size of any
2ϵ-packing is at most that of any ϵ-covering, that is, M(Θ, d, 2ϵ) ≤ N(Θ, d, ϵ).
The significance of (27.4) is that it shows that the small-ϵ behavior of the covering and packing
numbers are essentially the same. In addition, the right inequality therein, namely, N(ϵ) ≤ M(ϵ),
deserves some special mention. As we will see next, it is oftentimes easier to prove negative
results (lower bound on the minimal covering or upper bound on the maximal packing) than pos-
itive results which require explicit construction. When used in conjunction with the inequality
N(ϵ) ≤ M(ϵ), these converses turn into achievability statements,3 leading to many useful bounds
on metric entropy (e.g. the volume bound in Theorem 27.3 and the Gilbert-Varshamov bound
2
Another way to see this is from Theorem 27.2: Note that the right inequality in (27.4) yields a ϵ-covering that is included
in Θ. Together with the left inequality, we get N′ (ϵ) ≤ M(ϵ) ≤ N(ϵ/2).
3
This is reminiscent of duality-based argument in optimization: To bound a minimization problem from above, instead of
constructing an explicit feasible solution, a fruitful approach is to equate it with the dual problem (maximization) and
bound this maximum from above.
i i
i i
i i
Theorem 27.5 in the next section). Revisiting the proof of Theorem 27.2, we see that this logic
actually corresponds to a greedy construction (greedily increase the packing until no points can
be added).
Proof. To prove (a), consider an ϵ-covering Θ ⊂ ∪Ni=1 B(θi , ϵ). Applying the union bound yields
XN
vol(Θ) ≤ vol ∪Ni=1 B(θi , ϵ) ≤ vol(B(θi , ϵ)) = Nϵd vol(B),
i=1
where the last step follows from the translation-invariance and scaling property of volume.
To prove (b), consider an ϵ-packing {θ1 , . . . , θM } ⊂ Θ such that the balls B(θi , ϵ/2) are disjoint.
M(ϵ)
Since ∪i=1 B(θi , ϵ/2) ⊂ Θ + 2ϵ B, taking the volume on both sides yields
ϵ ϵ
vol Θ + B ≥ vol ∪M i=1 B(θi , ϵ/2) = Mvol B .
2 2
This proves (b).
Finally, (c) follows from the following two statements: (1) if ϵB ⊂ Θ, then Θ + 2ϵ B ⊂ Θ + 21 Θ;
and (2) if Θ is convex, then Θ+ 12 Θ = 32 Θ. We only prove (2). First, ∀θ ∈ 32 Θ, we have θ = 13 θ+ 32 θ,
where 13 θ ∈ 12 Θ and 32 θ ∈ Θ. Thus 32 Θ ⊂ Θ + 12 Θ. On the other hand, for any x ∈ Θ + 12 Θ, we
have x = y + 21 z with y, z ∈ Θ. By the convexity of Θ, 23 x = 23 y + 31 z ∈ Θ. Hence x ∈ 23 Θ, implying
Θ + 21 Θ ⊂ 32 Θ.
Remark 27.2 Similar to the proof of (a) in Theorem 27.3, we can start from Θ + 2ϵ B ⊂
∪Ni=1 B(θi , 32ϵ ) to conclude that
N(Θ, k · k, ϵ)
(2/3)d ≤ ≤ 2d .
vol(Θ + 2ϵ B)/vol(ϵB)
In other words, the volume of the fattened set Θ + 2ϵ determines the metric entropy up to constants
that only depend on the dimension. We will revisit this reasoning in Section 27.5 to adapt the
volumetric estimates to infinite dimensions where this fattening step becomes necessary.
i i
i i
i i
526
Corollary 27.4 (Metric entropy of balls and spheres) Let k · k be an arbitrary norm on
Rd . Let B ≡ B∥·∥ = {x ∈ Rd : kxk ≤ 1} and S ≡ S∥·∥ = {x ∈ Rd : kxk ≤ 1} be the corresponding
unit ball and unit sphere. Then for ϵ < 1,
d d
1 2
≤ N(B, k · k, ϵ) ≤ 1 + (27.5)
ϵ ϵ
d−1 d−1
1 1
≤ N(S, k · k, ϵ) ≤ 2d 1 + (27.6)
2ϵ ϵ
where the left inequality in (27.6) holds under the extra assumption that k · k is an absolute norm
(invariant to sign changes of coordinates).
Proof. For balls, the estimate (27.5) directly follows from Theorem 27.3 since B + 2ϵ B = (1 + 2ϵ )B.
Next we consider the spheres. Applying (b) in Theorem 27.3 yields
vol(S + ϵB) vol((1 + ϵ)B) − vol((1 − ϵ)B)
N(S, k · k, ϵ) ≤ M(S, k · k, ϵ) ≤ ≤
vol(ϵB) vol(ϵB)
Z ϵ d−1
(1 + ϵ) − (1 − ϵ)
d d
d d−1 1
= = d (1 + x) dx ≤ 2d 1 + .
ϵd ϵ −ϵ ϵ
where the third inequality applies S + ϵB ⊂ ((1 + ϵ)B)\((1 − ϵ)B) by triangle inequality.
Finally, we prove the lower bound in (27.6) for an absolute norm k · k. To this end one cannot
directly invoke the lower bound in Theorem 27.3 as the sphere has zero volume. Note that k · k′ ≜
k(·, 0)k defines a norm on Rd−1 . We claim that every ϵ-packing in k · k′ for the unit k · k′ -ball
induces an ϵ-packing in k · k for the unit k · k-sphere. Fix x ∈ Rd−1 such that k(x, 0)k ≤ 1 and
define f : R+ → R+ by f(y) = k(x, y)k. Using the fact that k · k is an absolute norm, it is easy to
verify that f is a continuous increasing function with f(0) ≤ 1 and f(∞) = ∞. By the mean value
theorem, there exists yx , such that k(x, yx )k = 1. Finally, for any ϵ-packing {x′1 , . . . , x′M } of the unit
ball B∥·∥′ with respect to k·k′ , setting x′i = (xi , yxi ) we have kx′i −x′j k ≥ k(xi −xj , 0)k = kxi −xj k′ ≥ ϵ.
This proves
Then the left inequality of (27.6) follows from those of (27.4) and (27.5).
(a) Using (27.5), we see that for any compact Θ with nonempty interior, we have
1
N(Θ, k · k, ϵ) M(Θ, k · k, ϵ) (27.7)
ϵd
for small ϵ, with proportionality constants depending on both Θ and the norm. In fact, the
sharp constant is also known to exist. It is shown in [250, Theorem IX] that there exists a
i i
i i
i i
Next we switch our attention to the discrete case of Hamming space. The following theorem
bounds its packing number M(Fd2 , dH , r) ≡ M(Fd2 , r), namely, the maximal number of binary code-
words of length d with a prescribed minimum distance r + 1.5 This is a central question in coding
theory, wherein the lower and upper bounds below are known as the Gilbert-Varshamov bound
and the Hamming bound, respectively.
Proof. Both inequalities in (27.8) follow from the same argument as that in Theorem 27.3, with
Rd replaced by Fd2 and volume by the counting measure (which is translation invariant).
Of particular interest to coding theory is the asymptotic regime of d → ∞ and r = ρd for some
constant ρ ∈ (0, 1). Using the asymptotics of the binomial coefficients (cf. Proposition 1.5), the
4
For example, it is easy to show that τ = 1 for both ℓ∞ and ℓ1 balls in any dimension since cubes can be subdivided into
smaller cubes; for ℓ2 -ball in d = 2, τ = √π is the famous result of L. Fejes Tóth on the optimality of hexagonal
12
arrangement for circle packing [365].
5
Recall that the packing number in Definition 27.1 is defined with a strict inequality.
i i
i i
i i
528
Finding the exact exponent is one of the most significant open questions in coding theory. The best
upper bound to date is due to McEliece, Rodemich, Rumsey and Welch [299] using the technique
of linear programming relaxation.
In contrast, the corresponding covering problem in Hamming space is much simpler, as we
have the following tight result
where R(ρ) = (1 − h(ρ))+ is the rate-distortion function of Ber( 12 ) from Theorem 26.1. Although
this does not automatically follow from the rate-distortion theory, it can be shown using similar
argument – see Exercise V.26.
Finally, we state a lower bound on the packing number of Hamming spheres, which is needed
for subsequent application in sparse estimation (Exercise VI.12) and useful as basic building blocks
for computing metric entropy in more complicated settings (Theorem 27.7).
In particular,
k d
log M(Sdk , k/2) ≥ log . (27.12)
2 2ek
Proof. Again (27.11) follows from the volume argument. To verify (27.12), note that for r ≤ d/2,
Pr
we have i=0 di ≤ exp(dh( dr )) (see Theorem 8.2 or (15.19) with p = 1/2). Using h(x) ≤ x log xe
and dk ≥ ( dk )k , we conclude (27.12) from (27.11).
i i
i i
i i
As a case in point, consider the maximum number of ℓ2 -balls of radius ϵ packed into the unit
ℓ1 -ball, namely, M(B1 , k · k2 , ϵ). (Recall that Bp denotes the unit ℓp -ball in Rd with 1 ≤ p ≤ ∞.)
We have studied the metric entropy of arbitrary norm balls under the same norm in Corollary 27.4,
where the specific value of the volume was canceled from the √
volume ratio. Here, although ℓ1 and
ℓ2 norms are equivalent in the sense that kxk2 ≤ kxk1 ≤ dkxk2 , this relationship is too loose
when d is large.
Let us start by applying the volume method in Theorem 27.3:
vol(B1 ) vol(B1 + 2ϵ B2 )
≤ N(B1 , k · k2 , ϵ) ≤ M(B1 , k · k2 , ϵ) ≤ .
vol(ϵB2 ) vol( 2ϵ B2 )
Applying the formula for the volume of a unit ℓq -ball in Rd :
h id
2Γ 1 + 1q
vol(Bq ) = , (27.13)
Γ 1 + qd
πd
we get6 vol(B1 ) = 2d /d! and vol(B2 ) = Γ(1+d/2) , which yield, by Stirling approximation,
1 1
vol(B1 )1/d , vol(B2 )1/d √ . (27.14)
d d
Then for some absolute constant C,
√ d
vol(B1 + 2ϵ B2 ) vol((1 + ϵ 2 d )B1 ) 1
M(B1 , k · k2 , ϵ) ≤ ≤ ≤ C 1 + √ , (27.15)
vol( 2ϵ B2 ) vol( 2ϵ B2 ) ϵ d
√
where the second inequality follows from B2 ⊂ dB1 by Cauchy-Schwarz inequality. (This step
is tight in the sense that vol(B1 + 2ϵ B2 )1/d ≳ max{vol(B1 )1/d , 2ϵ vol(B2 )1/d } max{ d1 , √ϵd }.) On
the other hand, for some absolute constant c,
d d
vol(B1 ) 1 vol(B1 ) c
M(B1 , k · k2 , ϵ) ≥ = = √ . (27.16)
vol(ϵB2 ) ϵ vol(B2 ) ϵ d
Overall, for ϵ ≤ √1d , we have M(B1 , k · k2 , ϵ)1/d ϵ√1 d ; however, the lower bound trivializes and
the upper bound (which is exponential in d) is loose in the regime of ϵ √1d , which requires
different methods than volume calculation. The following result describes the complete behavior
of this metric entropy. In view of Theorem 27.2, we will go back and forth between the covering
and packing numbers in the argument.
6
For B1 this can be proved directly by noting that B1 consists 2d disjoint “copies” of the simplex whose volume is 1/d! by
induction on d.
i i
i i
i i
530
Proof. The case of ϵ ≤ √1d follows from earlier volume calculation (27.15)–(27.16). Next we
focus on √1d ≤ ϵ < 1.
For the upper bound, we construct an ϵ-covering in ℓ2 by quantizing each coordinate. Without
loss of generality, assume that ϵ < 1/4. Fix some δ < 1. For each θ ∈ B1 , there exists x ∈
(δ Zd ) ∩ B1 such that kx − θk∞ ≤ δ . Then kx − θk22 ≤ kx − θk1 kx − θk∞ ≤ 2δ . Furthermore, x/δ
belongs to the set
( )
X d
Z= z∈Z : d
|zi | ≤ k (27.17)
i=1
with k = b1/δc. Note that each z ∈ Z has at most k nonzeros. By enumerating the number of non-
negative solutions (stars and bars calculation) and the sign pattern, we have7 |Z| ≤ 2k∧d d−k1+k .
Finally, picking δ = ϵ2 /2, we conclude that N(B1 , k · k2 , ϵ) ≤ |Z| ≤ ( 2e(dk+k) )k as desired. (Note
that this method also recovers the volume bound for ϵ ≤ √1d , in which case k ≤ d.)
√
For the lower bound, note that M(B1 , k · k2 , 2) ≥ 2d by considering ±e1 , . . . , ±ed . So it
suffices to consider d ≥ 8. We construct a packing of B1 based on a packing of the Hamming
sphere. Without loss of generality, assume that ϵ > 4√1 d . Fix some 1 ≤ k ≤ d. Applying
the Gilbert-Varshamov bound in Theorem 27.6, in particular, (27.12), there exists a k/2-packing
Pd
{x1 , . . . , xM } ⊂ Sdk = {x ∈ {0, 1}d : i=1 xi = k} and log M ≥ 2k log 2ek d
. Scale the Hamming
sphere to fit the ℓ1 -ball by setting θi = xi /k. Then θi ∈ B1 and kθi − θj k2 = k2 dH (xi , xj ) ≥ 2k
2 1 1
for all
1
i 6= j. Choosing k = ϵ2 which satisfies k ≤ d/8, we conclude that {θ1 , . . . , θM } is a 2 -packing
ϵ
of B1 in k · k2 as desired.
The above elementary proof can be adapted to give the following more general result (see
Exercise V.27): Let 1 ≤ p < q ≤ ∞. For all 0 < ϵ < 1 and d ∈ N,
(
d log ϵes d ϵ ≤ d−1/s 1 1 1
log M(Bp , k · kq , ϵ) p,q 1 , ≜ − . (27.18)
−1/s
s log(eϵ d)
ϵ
s
ϵ≥d s p q
In the remainder of this section, we discuss a few generic results in connection to Theorem 27.7,
in particular, metric entropy upper bounds via the Sudakov minorization and Maurey’s empirical
method, as well as the duality of metric entropy in Euclidean spaces.
7 ∑d (d)( k )
By enumerating the support and counting positive solutions, it is easy to show that |Z| = i=0 2d−i i d−i
.
8
To avoid measurability difficulty, w(Θ) should be understood as supT⊂Θ,|T|<∞ E maxθ∈T hθ, Zi.
i i
i i
i i
For any Θ ⊂ Rd ,
p
w(Θ) ≳ sup ϵ log M(Θ, k · k2 , ϵ). (27.20)
ϵ>0
As a quick corollary, applying the volume lower bound on the packing number in Theorem 27.3
to (27.20) and optimizing over ϵ, we obtain Urysohn’s inequality:9
1/d
√ vol(Θ) (27.14)
w(Θ) ≳ d d · vol(Θ)1/d . (27.21)
vol(B2 )
Sudakov’s theorem relates the Gaussian width to the metric entropy, both of which are meaning-
ful measures of the massiveness of a set. The important point is that the proportionality constant
in (27.20) is independent of the dimension. It turns out that Sudakov’s lower bound is tight up to
a log(d) factor [438, Theorem 8.1.13]. The following complementary result is known as Dudley’s
chaining inequality (see Exercise V.28 for a proof.)
Z ∞p
w(Θ) ≲ log M(Θ, k · k2 , ϵ)dϵ. (27.22)
0
Understanding the maximum of Gaussian processes is a field on its own; see the monograph [411].
In this section we focus on the lower bound (27.20) in order to develop upper bound for metric
entropy using the Gaussian width.
The proof of Theorem 27.8 relies on the following Gaussian comparison lemma of Slepian
(whom we have encountered earlier in Theorem 11.13). For a self-contained proof see [89]. See
also [329, Lemma 5.7, p. 70] for a simpler proof of a weaker version E max Xi ≤ 2E max Yi , which
suffices for our purposes.
We also need the result bounding the expectation of the maximum of n Gaussian random
variables.
i.i.d.
In addition, if Z1 , . . . , Zn ∼ N (0, 1), then
h i p
E max Zi = 2 log n(1 + o(1)). (27.24)
i∈[n]
9 vol(Θ)
For a sharp form, see [329, Corollary 1.4], which states that for all symmetric convex Θ, w(Θ) ≥ E[kZk2 ]( vol(B ) )1/d ;
2
in other words, balls minimize the Gaussian width among all symmetric convex bodies of the same volume.
i i
i i
i i
532
Proof. First, let T = argmaxj Zj . Since Zj are 1-subgaussian (recall Definition 4.15), from
Exercise I.56 we have
p p p
| E[max Zi ]| = | E[ZT ]| ≤ 2I(Zn ; T) = 2H(T) ≤ 2 log n .
i
E[max Zi ] ≥ t P[max Zi ≥ t] + E[max Zi 1 {Z1 < 0}1 {Z2 < 0} . . . 1 {Zn < 0}]
i i i
≥ t(1 − (1 − Φc (t))n ) + E[Z1 1 {Z1 < 0}1 {Z2 < 0} . . . 1 {Zn < 0}].
where Φc (t) = P[Z1 ≥ t] is the normal tail probability. The second term equals
2−(n−1) E[Z1 1 {Z1 < 0}] = o(1). For the first term, recall that Φc (t) ≥ 1+t t2 φ(t) (Exercise V.25).
p
Choosing
p t = (2 − ϵ) log n for small ϵ > 0 so that Φc (t) = ω( 1n ) and hence E[maxi Zi ] ≥
(2 − ϵ) log n(1 + o(1)). By the arbitrariness of ϵ > 0, the lower bound part of (27.24)
follows.
Proof of Theorem 27.8. Let {θ1 , . . . , θM } be an optimal ϵ-packing of Θ. Let Xi = hθi , Zi for
i.i.d.
i ∈ [M], where Z ∼ N (0, Id ). Let Yi ∼ N (0, ϵ2 /2). Then
Then
p
E sup hθ, Zi ≥ E max Xi ≥ E max Yi ϵ log M
θ∈Θ 1≤i≤M 1≤i≤M
where the second and third step follows from Lemma 27.9 and Lemma 27.10 respectively.
1
27.3.2 Hilbert ball has metric entropy ϵ2
P
We consider a Hilbert ball B2 = {x ∈ R∞ : i x2i ≤ 1}. Under the usual metric ℓ2 (R∞ ) this
set is not compact and cannot have finite ϵ-nets for all ϵ. However, the metric of interest in many
i i
i i
i i
applications is often different. Specifically, let us fix some probability distribution P on B2 s.t.
EX∼P [kXk22 ] ≤ 1 and define
p
dP (θ, θ′ ) ≜ EX∼P [| hθ − θ′ , Xi |2 ]
for θ, θ′ ∈ B2 . The importance of this metric is that it allows to analyze complexity of a class
of linear functions θ 7→ hθ, Xi for any random variable X of unit norm and has applications in
learning theory [302, 471].
Theorem 27.11 For some universal constant c we have for all ϵ > 0
c
log N(B2 , dP , ϵ) ≤ .
ϵ2
Proof. First, we show that without loss of generality we may assume that X has all coordinates
P 2
other than the first n zero. Indeed, take n so large that E[ j>n X2j ] ≤ ϵ4 . Let us denote by θ̃ the
vector obtained from θ by zeroing out all coordinates j > n. Then from Cauchy-Schwartz we see
that dP (θ, θ̃) ≤ 2ϵ and therefore any 2ϵ -covering of B̃2 = {θ̃ : θ ∈ B2 } will be an ϵ-covering of B2 .
Hence, from now on we assume that the ball B2 is in fact finite-dimensional.
We can redefine distance dP in a more explicit way as follows
dP (θ, θ′ )2 = (θ − θ′ )⊤ Σ(θ − θ′ ) ,
To see one simple implication of the result, recall the standard bound on empirical processes: By
endowing any collection of functions {fθ , θ ∈ Θ} with a metric dP̂n (θ, θ′ )2 = EP̂n [(fθ (X)− fθ′ (X))2 ]
we have
" Z ∞r #
log N(Θ, dP̂n , ϵ)
E sup E[fθ (X)] − Ên [fθ (X)] ≲ E inf δ + dϵ .
θ δ>0 δ n
It can be seen that when entropy behaves as ϵ−p we get rate n− min(1/p,1/2) except for p = 2
for which the upper bound yields n− 2 log n. The significance of the previous theorem is that the
1
Hilbert ball is precisely “at the phase transition” from parametric to nonparametric rate.
i i
i i
i i
534
As a sanity check, let us take any PX over the unit (possibly infinite dimensional) ball B with
E[X] = 0 and let Θ = B. We have
" # r
1X
n
log n
E[kX̄n k] = E sup hθ, Xi i ≲ ,
θ n i=1 n
Pn
where
p X̄n = 1n i=1 Xi is the empirical mean. In this special case it is easy to bound E[kX̄n k] ≤
E[kX̄n k2 ] ≤ √1n by an explicit calculation.
Proof. Let T = {t1 , t2 , . . . , tm } and denote the Chebyshev center of T by c ∈ H, such that r =
maxi∈[m] kc − ti k. For n ∈ Z+ , let
( ! )
1 Xm Xm
Z= c+ ni ti : ni ∈ Z+ , ni = n .
n+1
i=1 i=1
Pm P
For any x = i=1 xi ti ∈ co(T) where xi ≥ 0 and xi = 1, let Z be a discrete random variable
such that Z = ti with probability xi . Then E[Z] = x. Let Z0 = c and Z1 , . . . , Zn be i.i.d. copies of
Pm
Z. Let Z̄ = n+1 1 i=0 Zi , which takes values in the set Z . Since
2
1 X n
1 Xn X
EkZ̄ − xk22 = E (Zi − x) = E kZi − xk2 + EhZi − x, Zj − xi
(n + 1)2 ( n + 1) 2
i=0 i=0 i̸=j
1 X
n
1 r2
= E kZi − xk2 = kc − xk2 + nE[kZ − xk2 ] ≤ ,
(n + 1)2 ( n + 1) 2 n+1
i=0
Pm
where the last inequality follows from that kc − xk ≤ i=1 xi kc − ti k ≤ r (in other words, rad(T) =
rad(co(T)) and E[kZ − xk2 ] ≤ E[kZ − ck2 ] ≤ r2 . Set n = r2 /ϵ2 − 1 so that r2 /(n + 1) ≤ ϵ2 .
There exists some z ∈ N such that kz − xk ≤ ϵ. Therefore Z is an ϵ-covering of co(T). Similar to
(27.17), we have
n+m−1 m + r2 /ϵ2 − 2
|Z| ≤ = .
n dr2 /ϵ2 e − 1
i i
i i
i i
We now apply Theorem 27.12 to recover the result for the unit ℓ1 -ball B1 in Rd in Theorem 27.7:
Note that B1 = co(T), where T = {±e1 , . . . , ±ed , 0} satisfies rad(T) = 1. Then
2d + d ϵ12 e − 1
N(B1 , k · k2 , ϵ) ≤ , (27.27)
d ϵ12 e − 1
which recovers the optimal upper bound in Theorem 27.7 at both small and big scales.
Then the usual covering number in Definition 27.1 satisfies N(K, k · k, ϵ) = N(K, ϵB), where B is
the corresponding unit norm ball.
A deep result of Artstein, Milman, and Szarek [28] establishes the following duality for metric
entropy: There exist absolute constants α and β such that for any symmetric convex body K,10
1 ϵ
log N B2 , K◦ ≤ log N(K, ϵB2 ) ≤ log N(B2 , αϵK◦ ), (27.28)
β α
where B2 is the usual unit ℓ2 -ball, and K◦ = {y : supx∈K hx, yi ≤ 1} is the polar body of K.
As an example, consider p < 2 < q and p1 + 1q = 1. By duality, B◦p = Bq . Then (27.28) shows
that N(Bp , k · k2 , ϵ) and N(B2 , k · kq , ϵ) have essentially the same behavior, as verified by (27.18).
10
A convex body K is a compact convex set with non-empty interior. We say K is symmetric if K = −K.
i i
i i
i i
536
Theorem 27.13 Assume that L, A > 0 and p ∈ [1, ∞] are constants. Then
1
log N(F(A, L), k · kp , ϵ) = Θ . (27.29)
ϵ
Thus, it is sufficient to consider F(A, 1) ≜ F(A), the collection of 1-Lipschitz densities on [0, A].
Next, observe that any such density function f is bounded from above. Indeed, since f(x) ≥ (f(0) −
RA
x)+ and 0 f = 1, we conclude that f(0) ≤ max{A, A2 + A1 } ≜ m.
To show (27.29), it suffices to prove the upper bound for p = ∞ and the lower bound for p = 1.
Specifically, we aim to show, by explicit construction,
C Aϵ
N(F(A), k · k∞ , ϵ) ≤ 2 (27.32)
ϵ
c
M(F(A), k · k1 , ϵ) ≥ 2 ϵ (27.33)
which imply the desired (27.29) in view of Theorem 27.2. Here and below, c, C are constants
depending on A. We start with the easier (27.33). We construct a packing by perturbing the uniform
density. Define a function T by T(x) = x1 {x ≤ ϵ} + (2ϵ − x)1 {x ≥ ϵ} + A1 on [0, 2ϵ] and zero
elsewhere. Let n = 4Aϵ and a = 2nϵ. For each y ∈ {0, 1}n , define a density fy on [0, A] such that
X
n
f y ( x) = yi T(x − 2(i − 1)ϵ), x ∈ [0, a],
i=1
RA
and we linearly extend fy to [a, A] so that 0 fy = 1; see Figure 27.2. For sufficiently small ϵ, the
Ra
resulting fy is 1-Lipschitz since 0 fy = 12 + O(ϵ) so that the slope of the linear extension is O(ϵ).
1/A
x
0 ϵ 2ϵ 2nϵ A
Figure 27.2 Packing that achieves (27.33). The solid line represent one such density fy (x) with
y = (1, 0, 1, 1). The dotted line is the density of Unif(0, A).
i i
i i
i i
Thus we conclude that each fy is a valid member of F(A). Furthermore, for y, z ∈ {0, 1}n , we
have kfy −fz k1 = dH (y, z)kTk1 = ϵ2 dH (y, z). Invoking the Gilbert-Varshamov bound Theorem 27.5,
we obtain an n2 -packing Y of the Hamming space {0, 1}n with |Y| ≥ 2cn for some absolute constant
2
c. Thus {fy : y ∈ Y} constitutes an n2ϵ -packing of F(A) with respect to the L1 -norm. This is the
2
desired (27.33) since n2ϵ = Θ(ϵ).
To construct a covering, set J = mϵ , n = Aϵ , and xk = kϵ for k = 0, . . . , n. Let G be the
collection of all lattice paths (with grid size ϵ) of n steps starting from the coordinate (0, jϵ) for
some j ∈ {0, . . . , J}. In other words, each element g of G is a continuous piecewise linear function
on each subinterval Ik = [xk , xk+1 ) with slope being either +1 or −1. Evidently, the number of
such paths is at most (J + 1)2n = O( 1ϵ 2A/ϵ ). To show that G is an ϵ-covering, for each f ∈ F (A),
we show that there exists g ∈ G such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, A]. This can be shown
by a simple induction. Suppose that there exists g such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, xk ],
which clearly holds for the base case of k = 0. We show that g can be extended to Ik so that this
holds for k + 1. Since |f(xk ) − g(xk )| ≤ ϵ and f is 1-Lipschitz, either f(xk+1 ) ∈ [g(xk ), g(xk ) + 2ϵ]
or [g(xk ) − 2ϵ, g(xk )], in which case we extend g upward or downward, respectively. The resulting
g satisfies |f(x) − g(x)| ≤ ϵ on Ik , completing the induction.
b′ + ϵ1/3
b′
x
0 a′ A
Figure 27.3 Improved packing for (27.34). Here the solid and dashed lines are two lattice paths on a grid of
size ϵ starting from (0, b′ ) and staying in the range of [b′ , b′ + ϵ1/3 ], followed by their respective linear
extensions.
Finally, we prove the sharp bound (27.30) for p = ∞. The upper bound readily follows from
(27.32) plus the scaling relation (27.31). For the lower bound, we apply Theorem 27.2 converting
the problem to the construction of 2ϵ-packing. Following the same idea of lattice paths, next we
give an improved packing construction such that
a
M(F(A), k · k∞ , 2ϵ) ≥ Ω(ϵ3/2 2 ϵ ). (27.34)
a b
for any a < A. Choose any b such that A1 < b < A1 + (A−
2
a) ′ ′
2A . Let a = ϵ ϵ and b = ϵ ϵ . Consider
a density f on [0, A] of the following form (cf. Figure 27.3): on [0, a ], f is a lattice path from (0, b′ )
′
i i
i i
i i
538
to (a′ , b′ ) that stays in the vertical range of [b′ , b′ +ϵ1/3 ]; on [a′ , A], f is a linear extension chosen so
RA
that 0 f = 1. This is possible because by the 1-Lipschitz constraint we can linearly extend f so that
RA ′ 2 ′ 2 R a′
a′
f takes any value in the interval [b′ (A−a′ )− (A−2a ) , b′ (A−a′ )+ (A−2a ) ]. Since 0 f = ab+o(1),
RA R a′
we need a′ f = 1 − 0 f = 1 − ab + o(1), which is feasible due to the choice of b. The collection
G of all such functions constitute a 2ϵ-packing in the sup norm (for two distinct paths consider the
first subinterval where they differ). Finally, we bound the cardinality of this packing by counting
the number of such paths. This can be accomplished by standard estimates on random walks (see
e.g. [164, Chap. III]). For any constant c > 0, the probability that a symmetric random walk on
Z returns to zero in n (even) steps and stays in the range of [0, n1+c ] is Θ(n−3/2 ); this implies the
desired (27.34). Finally, since a < A is arbitrary, the lower bound part of (27.30) follows in view
of Theorem 27.2.
The following result, due to Birman and Solomjak [57] (cf. [285, Sec. 15.6] for an exposition),
is an extension of Theorem 27.13 to the more general Hölder class.
Theorem 27.14 Fix positive constants A, L and d ∈ N. Let β > 0 and write β = ℓ + α,
where ℓ ∈ Z+ and α ∈ (0, 1]. Let Fβ (A, L) denote the collection of ℓ-times continuously
differentiable densities f on [0, A]d whose ℓth derivative is (L, α)-Hölder continuous, namely,
kD(ℓ) f(x) − D(ℓ) f(y)k∞ ≤ Lkx − ykα ∞ for all x, y ∈ [0, A] . Then for any 1 ≤ p ≤ ∞,
d
d
log N(Fβ (A, L), k · kp , ϵ) = Θ ϵ− β . (27.35)
The main message of the preceding theorem is that is the entropy of the function class grows
more slowly if the dimension decreases or the smoothness increases. As such, the metric entropy
for very smooth functions can grow subpolynomially in 1ϵ . For example, Vitushkin (cf. [250,
Eq. (129)]) showed that for the class of analytic functions on the unit complex disk D having
analytic extension to a bigger disk rD for r > 1, the metric entropy (with respect to the sup-norm
on D) is Θ((log 1ϵ )2 ); see [250, Sec. 7 and 8] for more such results.
As mentioned at the beginning of this chapter, the conception and development of the subject
on metric entropy, in particular, Theorem 27.14, are motivated by and plays an important role
in the study of Hilbert’s 13th problem. In 1900, Hilbert conjectured that there exist functions of
several variables which cannot be represented as a superposition (composition) of finitely many
functions of fewer variables. This was disproved by Kolmogorov and Arnold in 1950s who showed
that every continuous function of d variables can be represented by sums and superpositions of
single-variable functions; however, their construction does not work if one requires the constituent
functions to have specific smoothness. Subsequently, Hilbert’s conjecture for smooth functions
was positively resolved by Vitushkin [439], who showed that there exist functions of d variables
in the β -Hölder class (in the sense of Theorem 27.14) that cannot be expressed as finitely many
superpositions of functions of d′ variables in the β ′ -Hölder class, provided d/β > d′ /β ′ . The
original proof of Vitushkin is highly involved. Later, Kolmogorov gave a much simplified proof
by proving and applying the k · k∞ -version of Theorem 27.14. As evident in (27.35), the index
d/β provides a complexity measure for the function class; this allows an proof of impossibility
i i
i i
i i
of superposition by an entropy comparison argument. For concreteness, let us prove the follow-
ing simpler version: There exists a 1-Lipschitz function f(x, y, z) of three variables on [0, 1]3 that
cannot be written as g(h1 (x, y), h2 (y, z)) where g, h1 , h2 are 1-Lipschitz functions of two variables
on [0, 1]2 . Suppose, for the sake of contradiction, that this is possible. Fixing an ϵ-covering of
cardinality exp(O( ϵ12 )) for 1-Lipschitz functions on [0, 1]2 and using it to approximate the func-
tions g, h1 , h2 , we obtain by superposition g(h1 , h2 ) an O(ϵ)-covering of cardinality exp(O( ϵ12 )) of
1-Lipschitz functions on [0, 1]3 ; however, this is a contradiction as any such covering must be of
size exp(Ω( ϵ13 )). For stronger and more general results along this line, see [250, Appendix I].
i i
i i
i i
540
H. We refer the reader to, e.g., [262, Sec. 2] and [314, III.3.2], for the precise definition of this
object.11 For the purpose of this section, it is enough to consider the following examples (for more
see [262]):
The following fundamental result due to Kuelbs and Li [263] (see also the earlier work of
Goodman [194]) describes a precise connection between the small-ball probability function ϕ(ϵ)
and the metric entropy of the unit Hilbert ball N(K, k · k, ϵ) ≡ N(ϵ).
λ2
ϕ(2ϵ) + log Φ(λ + Φ−1 (e−ϕ(ϵ) )) ≤ log N(λK, ϵ) ≤ log M(λK, ϵ) ≤ + ϕ(ϵ/2) (27.41)
2
p
To deduce (27.40), choose λ = 2ϕ(ϵ/2) and note that by scaling N(λK, ϵ) = N(K, ϵ/λ).
) = Φc (t) ≤ e−t /2 (Exercise V.25) yields Φ−1 (e−ϕ(ϵ) ) ≥
2
Applying
p the normal tail bound Φ(− t
− 2ϕ(ϵ) ≥ −λ so that Φ(Φ−1 (e−ϕ(ϵ) ) + λ) ≥ Φ(0) = 1/2.
We only give the proof in finite dimensions as the results are dimension-free and extend natu-
rally to infinite-dimensional spaces. Let Z ∼ γ = N (0, Σ) on Rd so that K = Σ1/2 B2 is given in
(27.38). Applying (27.37) to λK and noting that γ is a probability measure, we have
γ (λK + B (0, ϵ)) 1
≤ N(λK, ϵ) ≤ M(λK, ϵ) ≤ . (27.42)
maxθ∈Rd γ (B (θ, 2ϵ)) minθ∈λK γ (B (θ, ϵ/2))
Next we further bound (27.42) using properties native to the Gaussian measure.
11
In particular, if γ is the law of a Gaussian process X on C([0, 1]) with E[kXk22 ] < ∞, the kernel K(s, t) = E[X(s)X(t)]
∑
admits the eigendecomposition K(s, t) = λk ψk (s)ψk (t) (Mercer’s theorem), where {ϕk } is an orthonormal basis for
∑
L2 ([0, 1]) and λk > 0. Then H is the closure of the span of {ϕk } with the inner product hx, yiH = k hx, ψk ihy, ψk i/λk .
i i
i i
i i
• For the upper bound, for any symmetric set A = −A and any θ ∈ λK, by a change of measure
γ(θ + A) = P [Z − θ ∈ A]
1 ⊤ −1
h −1 i
= e− 2 θ Σ θ E e⟨Σ θ,Z⟩ 1 {Z ∈ A}
≥ e−λ
2
/2
P [Z ∈ A] ,
h −1 i
where the last step follows from θ⊤ Σ−1 θ ≤ λ2 and by Jensen’s inequality E e⟨Σ θ,Z⟩ |Z ∈ A ≥
−1
e⟨Σ θ,E[Z|Z∈A]⟩ = 1, using crucially that E [Z|Z ∈ A] = 0 by symmetry. Applying the above to
A = B(0, ϵ/2) yields the right inequality in (27.41).
• For the lower bound, recall Anderson’s lemma (Lemma 28.10) stating that the Gaussian measure
of a ball is maximized when centered at zero, so γ(B(θ, 2ϵ)) ≤ γ(B(0, 2ϵ)) for all θ. To bound
the numerator, recall the Gaussian isoperimetric inequality (see e.g. [69, Theorem 10.15]):12
Applying this with A = B(0, ϵ) proves the left inequality in (27.41) and the theorem.
The implication of Theorem 27.15 is the following. Provided that ϕ(ϵ) ϕ(ϵ/2), then we
should expect that approximately
!
ϵ
log N p ϕ(ϵ)
ϕ(ϵ)
With more effort this can be made precise unconditionally (see e.g. [279, Theorem 3.3], incorporat-
ing the later improvement by [278]), leading to very precise connections between metric entropy
and small-ball probability, for example: for fixed α > 0, β ∈ R,
β 2+α
2β
−α 1 − 2+α
2α 1
ϕ(ϵ) ϵ log ⇐⇒ log N(ϵ) ϵ log (27.44)
ϵ ϵ
As a concrete example, consider the unit ball (27.39) in the RKHS generated by the standard
Brownian motion, which is similar to a Sobolev ball.13 Using (27.36) and (27.44), we conclude
that log N(ϵ) 1ϵ , recovering the metric entropy of Sobolev ball determined in [420]. This result
also coincides with the metric entropy of Lipschitz ball in Theorem 27.14 which requires the
derivative to be bounded everywhere as opposed to on average in L2 . For more applications of
small-ball probability on metric entropy (and vice versa), see [263, 278].
12
The connection between (27.43) and isoperimetry is that if we interpret limλ→0 (γ(A + λK) − γ(A))/λ as the surface
measure of A, then among all sets with the same Gaussian measure, the half space has maximal surface measure.
13
The Sobolev norm is kfkW1,2 ≜ kfk2 + kf′ k2 . Nevertheless, it is simple to verify a priori that the metric entropy of
(27.39) and that of the Sobolev ball share the same behavior (see [263, p. 152]).
i i
i i
i i
542
its rate-distortion function (recall Section 24.3). Denote the worst-case rate-distortion function on
X by
The next theorem relates ϕX to the covering and packing number of X . The lower bound simply
follows from a “Bayesian” argument, which bounds the worst case from below by the average case,
akin to the relationship between minimax and Bayes risk (see Section 28.3). The upper bound was
shown in [241] using the dual representation of rate-distortion functions; here we give a simpler
proof via Fano’s inequality.
Proof. Fix an ϵ-covering of X in d of size N. Let X̂ denote the closest element in the covering to
X. Then d(X, X̂) ≤ ϵ almost surely. Thus ϕX (ϵ) ≤ I(X; X̂) ≤ log N. Optimizing over PX proves the
left inequality.
For the right inequality, let X be uniformly distributed over a maximal ϵ-packing of X . For
any PX̂|X such that E[d(X, X̂)] ≤ cϵ. Let X̃ denote the closest point in the packing to X̂. Then we
have the Markov chain X → X̂ → X̃. By definition, d(X, X̃) ≤ d(X̂, X̃) + d(X̂, X) ≤ 2d(X̂, X) so
E[d(X, X̃)] ≤ 2cϵ. Since either X = X̃ or d(X, X̃) > ϵ, we have P[X 6= X̃] ≤ 2c. On the other
hand, Fano’s inequality (Corollary 3.13) yields P[X 6= X̃] ≥ 1 − I(X;log X̂)+log 2
M . In all, I(X; X̂) ≥
(1 − 2c) log M − log 2, proving the upper bound.
Remark 27.4 (a) Clearly, Theorem 27.16 can be extended to the case where the distortion
function equals a power of the metric, namely, replacing (27.45) with
ϕX,r (ϵ) ≜ inf I(X; X̂).
PX̂|X :E[d(X,X̂)r ]≤ϵr
Then (27.47) continues to hold with 1 − 2c replaced by 1 − (2c)r . This will be useful, for
example, in the forthcoming applications where second moment constraint is easier to work
with.
i i
i i
i i
(b) In the earlier literature a variant of the rate-distortion function is also considered, known as
the ϵ-entropy of X, where the constraint is d(X, X̂) ≤ ϵ with probability one as opposed to
in expectation (cf. e.g. [250, Appendix II] and [349]). With this definition, it is natural to
conjecture that the maximal ϵ-entropy over all distributions on X coincides with the metric
entropy log N(X , ϵ); nevertheless, this need not be true (see [300, Remark, p. 1708] for a
counterexample).
Theorem 27.16 points out an information-theoretic route to bound the metric entropy by the
worst-case rate-distortion function (27.46).14 Solving this maximization, however, is not easy as
PX 7→ ϕX (D) is in general neither convex nor concave [6].15 Fortunately, for certain spaces, one
can show via a symmetry argument that the “uniform” distribution maximizes the rate-distortion
function at every distortion level; see Exercise V.24 for a formal statement. As a consequence, we
have:
• For Hamming space X = {0, 1}d and Hamming distortion, ϕX (D) is attained by Ber( 12 )d . (We
already knew this from Theorem 26.1 and Theorem 24.8.)
• For the unit sphere X = Sd−1 and distortion function defined by the Euclidean distance, ϕX (D)
is attained by Unif(Sd−1 ).
• For the orthogonal group X = O(d) or unitary group U(d) and distortion function defined by
the Frobenius norm, ϕX (D) is attained by the Haar measure. Similar statements also hold for
the Grassmann manifold (collection of linear subspaces).
Theorem 27.17 Let θ be uniformly distributed over the unit sphere Sd−1 . Then for all 0 <
ϵ < 1,
1 1
(d − 1) log − C ≤ inf I(θ; θ̂) ≤ (d − 1) log 1 + + log(2d)
ϵ Pθ̂|θ :E[∥θ̂−θ∥22 ]≤ϵ2 ϵ
Note that the random vector θ have dependent entries so we cannot invoke the single-
d
letterization technique in Theorem 24.8. Nevertheless, we have the representation θ=Z/kZk2 for
Z ∼ N (0, Id ), which allows us to relate the rate-distortion function of θ to that of the Gaussian
found in Theorem 26.2. The resulting lower bound agree with the metric entropy for spheres in
Corollary 27.4, which scales as (d − 1) log 1ϵ . Using similar reduction arguments (see [275, The-
orem VIII.18]), one can obtain tight lower bound for the metric entropy of the orthogonal group
O(d) and the unitary group U(d), which scales as d(d2−1) log 1ϵ and d2 log 1ϵ , with pre-log factors
14
A striking parallelism between the metric entropy of Sobolev balls and the rate-distortion function of smooth Gaussian
processes has been observed by Donoho in [133]. However, we cannot apply Theorem 27.16 to formally relate one to the
other since it is unclear whether the Gaussian rate-distortion function is maximal.
15
As a counterexample, consider Theorem 26.1 for the binary source.
i i
i i
i i
544
commensurate with their respective degrees of freedoms. As mentioned in Remark 27.3(b), these
results were obtained by Szarek in [406] using a volume argument with Haar measures; in compar-
ison, the information-theoretic approach is more elementary as we can again reduce to Gaussian
rate-distortion computation.
Proof. The upper bound follows from Theorem 27.16 and Remark 27.4(a), applying the metric
entropy bound for spheres in Corollary 27.4.
To prove the lower bound, let Z ∼ N (0, Id ). Define θ = ∥ZZ∥ and A = kZk, where k · k ≡ k · k2
henceforth. Then θ ∼ Unif(Sd−1 ) and A ∼ χd are independent. Fix Pθ̂|θ such that E[kθ̂ − θk2 ] ≤
ϵ2 . Since Var(A) ≤ 1, the Shannon lower bound (Theorem 26.3) shows that the rate-distortion
function of A is majorized by that of the standard Gaussian. So for each δ ∈ (0, 1), there exists
PÂ|A such that E[(Â − A)2 ] ≤ δ 2 , I(A, Â) ≤ log δ1 , and E[A] = E[Â]. Set Ẑ = Âθ̂. Then
I(Z; Ẑ) = I(θ, A; Ẑ) ≤ I(θ, A; θ̂, Â) = I(θ; θ̂) + I(A, Â).
Furthermore, E[Â2 ] = E[(Â − A)2 ] + E[A2 ] + 2E[(Â − A)(A − E[A])] ≤ d + δ 2 + 2δ ≤ d + 3δ .
Similarly, |E[Â(Â − A)]| ≤ 2δ and E[kZ − Ẑk2 ] ≤ dϵ2 + 7δϵ + δ . Choosing δ = ϵ, we have
E[kZ − Ẑk2 ] ≤ (d + 8)ϵ2 . Combining Theorem 24.8 with the Gaussian rate-distortion function in
Theorem 26.2, we have I(Z; Ẑ) ≥ d2 log (d+d8)ϵ2 , so applying log(1 + x) ≤ x yields
1
I(θ; θ̂) ≥ (d − 1) log − 4 log e.
ϵ2
i i
i i
i i
V.1 Let S = Ŝ = {0, 1}. Consider the source X10 consisting of fair coin flips. Construct a simple
1
(suboptimal) compressor achieving average Hamming distortion 20 with 512 codewords.
V.2 Assume a separable distortion loss. Show that the minimal number of codewords M∗ (n, D)
required to represent memoryless source Xn with average distortion D (recall (24.9)) satisfies
Conclude that
1 1
lim log M∗ (n, D) = inf log M∗ (n, D) . (V.1)
n→∞ n n n
That is, one can always achieve a better compression rate by using a longer blocklength. Neither
claim holds for log M∗ (n, ϵ) in channel coding as defined in (19.4). Explain why this different
behavior arises.
V.3 (Non-asymptotic rate-distortion) Our goal is to show that the convergence to R(D) happens
much faster than that to capacity in channel coding. Consider binary uniform X ∼ Ber(1/2)
with Hamming distortion.
(a) Show that there exists a lossy code Xn → W → X̂n with M codewords and
where
s
X n
p(s) = 2−n .
j
j=0
(b) Show that there exists a lossy code with M codewords and
1X
n−1
M
E[d(Xn , X̂n )] ≤ (1 − p(s)) . (V.2)
n
s=0
(c) Show that there exists a lossy code with M codewords and
1 X −Mp(s)
n−1
E[d(Xn , X̂n )] ≤ e . (V.3)
n
s=0
(Note: For M ≈ 2nR , numerical evaluation of (V.2) for large n is challenging. At the same
time (V.3) is only slightly looser.)
i i
i i
i i
(d) For n = 10, 50, 100 and 200 compute the upper bound on log M∗ (n, 0.11) via (V.3).
Compare with the lower bound
V.4 Continuing Exercise V.3 use Stirling formula and (V.3)-(V.4) to show
Note: Thus, optimal compression rate converges to its asymptotic fundamental limit R(D) at
√
a fast rate of O(log n/n) as opposed to O(1/ n) for channel coding (cf. Theorem 22.2). This
result holds for most memoryless sources.
i.i.d.
V.5 Let Sj ∼ PS be an iid source on a finite alphabet A and PS (a) > 0 for all a ∈ A. Suppose
the distortion metric satisfies d(x, y) = D0 =⇒ x = y. Show that R(D0 ) = log |A|, while
R(D0 +) = H(X).
V.6 Consider a memoryless source X uniform on X = X̂ = [m] with Hamming distortion: d(x, x̂) =
1{x 6= x̂}. Show that
(
log m − h(D) − D log(m − 1) D ≤ mm−1
R(D) =
0 otherwise.
and that for any distortion level optimal vector quantizer is only taking values ±(1 − 2p) (Hint:
you may find Exercise I.64(b) useful). Compare with the case of X̂ ∈ {±1}, for which we have
shown R(D) = log 2 − h(D/4), D ∈ [0, 2].
V.10 (Product source) Consider two independent stationary memoryless sources X ∈ X and Y ∈ Y
with reproduction alphabets X̂ and Ŷ , distortion measures d1 : X × X̂ → R+ and d2 : Y ×
Ŷ → R+ , and rate-distortion functions RX and RY , respectively. Now consider the stationary
memoryless product source Z = (X, Y) with reproduction alphabet X̂ ×Ŷ and distortion measure
d(z, ẑ) = d1 (x, x̂) + d2 (y, ŷ).
i i
i i
i i
(c) How do you build an optimal lossy compressor for Z using optimal lossy compressors for
X and Y?
V.11 (Compression with output constraints) Compute the rate-distortion function R(D; a, p) of a
Ber(p) source, Hamming distortion under an extra constraint that reconctruction points X̂n
should have average Hamming weight E[wH (X̂n )] ≤ an, where 0 < a, p ≤ 21 . (Hint: Show a
more general result that given two distortion metrics d1 , d2 we have R(D1 , D2 ) = min{I(S; Ŝ) :
E[di (S, Ŝ)] ≤ Di , i ∈ {1, 2}}.)
V.12 Commercial (mono) FM radio modulates a bandlimited (15kHz) audio signal into a radio-
frequency signal of bandwidth 200kHz. This system roughly achieves
SNRaudio ≈ 40 dB + SNRchannel
over the AWGN channel whenever SNRchannel ≳ 12 dB. Thus for the 12 dB channel, we get
that FM radio has distortion of 52 dB. Show that information-theoretic limit is about 160 dB.
Hint: assume that input signal is low-pass filtered to 15kHz white, zero-mean Gaussian and use
the optimal joint source channel code (JSSC) for the given bandwidth expansion ratio and fixed
SNRchannel . Also recall that the SNR of the reconstruction Ŝn expressed in dB is defined as
Pk
j=1 E[Sj ]
2
SNRaudio ≜ 10 log10 Pk .
j=1 E[(Sj − Ŝj ) ]
2
V.13 Consider a memoryless Gaussian source X ∼ N (0, 1), reconstruction alphabet  = {±1} and
quadratic distortion d(a, â) = (a − â)2 . Compute D0 , R(D0 +), Dmax . Then obtain a parametric
formula for R(D).
V.14 (Erokhin’s rate-distortion [155]) Let d(ak , bk ) = 1{ak 6= bk } be a (non-separable) distortion
metric for k-strings on a finite alphabet S = Ŝ . Prove that for any source Sk we have
φSk (ϵ) ≜ min I(Sk ; Ŝk ) ≥ H(Sk ) − ϵk log |S| − h(ϵ) , (V.5)
P[Sk ̸=Ŝk ]≤ϵ
and that the bound is tight only for Sk uniform on S k . Next, suppose that Sk is i.i.d. source. Prove
r
kV(S) − (Q−12(ϵ))2
ϕSk (ϵ) = (1 − ϵ)kH(S) − e + O(log k) ,
2π
where V(S) is the varentropy (10.4). (Hint: Let T = P̂Sk be the empirical distribution (type) of the
realization Sk . Then I(Sk ; Ŝk ) = I(Sk , T; Ŝk ) = I(Sk ; Ŝk |T) + O(log k). Denote ϵT ≜ P[Sk 6= Ŝk |T]
i i
i i
i i
and given ϵT optimize the first term via (V.5). Then optimize the assignment ϵT over all E[ϵT ] ≤ ϵ.
Also use E[Z1{Z > c}] = √12π e−c /2 for Z ∼ N (0, 1). See [254, Lemma 1] for full details).
2
i.i.d.
V.15 Consider a source Sn ∼ Ber( 21 ). Answer the following questions when n is large.
(a) Suppose the goal is to compress Sn into k bits so that one can reconstruct Sn with at most
one bit of error. That is, the decoded version Ŝn satisfies E[dH (Ŝn , Sn )] ≤ 1. Show that this
can be done (if possible, with an explicit algorithm) with k = n − C log n bits for some
constant C. Is it optimal?
(b) Suppose we are required to compress Sn into only 1 bit. Show that one can achieve (if
√
possible, with an explicit algorithm) a reconstruction error E[dH (Ŝn , Sn )] ≤ n2 − C n for
some constant C. Is it optimal?
Warning: We cannot blindly apply the asymptotic rate-distortion theory to show achievability
since here the distortion level changes with n. The converse, however, directly applies.
V.16 Consider a standard Gaussian vector Sn and quadratic distortion metric. We will discuss zero-
rate quantization.
√
(a) Let Smax =√max1≤i≤n Si . Show that E[(Smax − 2 ln n)2 ] → 0 when n → ∞. Show that
E[(Smax − 2 ln n)2 ] → 0 when n → ∞.
(b) Suppose you are given a budget of log2 n bits. Consider the following scheme: Let i∗ denote
the index of the largest coordinate. The compressor √
stores the index i∗ which costs log2 n
bits and the decompressor outputs Ŝ where Ŝi = 2 ln n for i = i∗ and Si = 0 otherwise.
n
Show that distortion in terms of mean-square error satisfies E[kŜn − Sn k22 ] = n − 2 ln n + o(1)
when n → ∞.
(c) Show that for any compressor (using log2 n bits) we must have E[kŜn − Sn k22 ] ≥ n − 2 ln n +
o( 1) .
V.17 (Noisy source-coding; also remote source-coding [126]) Consider the problem of compressing
i.i.d. sequence Xn under separable distortion metric d. Now, however, compressor does not have
direct access to Xn but only to its noisy version Yn obtained over a stationary memoryless channel
i.i.d.
PY|X (i.e. (Xi , Yi ) ∼ PX,Y for a fixed PX,Y and encoder is a map f : Y n → [M]). Show that the
rate-distortion function is
n o
R(D) = min I(Y; X̂) : E[d(X, X̂)] ≤ D, X → Y → X̂ ,
where minimization is over all PX̂|Y . (Hint: define d̃(y, x̂) ≜ E[d(X, x̂)|Y = y].)
i.i.d.
V.18 (Noisy/remote source coding; special case) Let Zn ∼ Ber( 12 ) and Xn = BECδ (Zn ). Compressor
is to encode Xn at rate R so that we can reconstruct Zn with bit-error rate D. Let R(D) denote
the optimal rate.
(a) Suppose that locations of erasures in Xn are provided as a side information to decompressor.
Show that R(δ/2) = δ̄2 (Hint: compressor is very simple).
(b) Surprisingly, the same rate is achievable without knowledge of erasures. Use Exercise V.17
to prove R(D) = H(δ̄/2, δ̄/2, δ) − H(1 − D − δ2 , D − δ2 , δ) for D ∈ [ δ2 , 12 ].
V.19 (Log-loss) Consider the rate-distortion problem where the reconstruction alphabet X̂n = P(X n )
is the space of all probability mass functions on X n . We define two loss functions. The first one
i i
i i
i i
is non-separable (!)
1 1
dnonsep (xn , P̂) = log (V.6)
n P̂(xn )
and the second one is separable:
1X
n
1
dsep (xn , P̂Xn ) = log .
n P̂Xj (xj )
j=1
Show that
D(R) = H(X) − FI (R) .
i i
i i
i i
Q
(Hint: for achievability, restrict reconstruction to P̂Xn = P̂Xi , this makes distortion
additive and then apply Ex. V.17; for converse use tensorization property of FI -curve
from Exercise III.32)
V.21 (a) Let 0 ≺ ∆ Σ be positive definite matrices. For S ∼ N (0, Σ), show that
1 det Σ
inf I(S; Ŝ) = log .
PŜ|S :E[(S−Ŝ)(S−Ŝ)⊤ ]⪯∆ 2 det ∆
1 X + σi2
d
inf I(S; Ŝ) = log
PŜ|S :E[∥S−Ŝ∥22 ]≤D 2 λ
i=1
Pd
where λ > 0 is such that i=1 min{σi2 , λ} = D. This is the counterpart of the solution in
Theorem 20.14.
(Hint: First, using the orthogonal invariance of distortion metric we can assume that
Σ is diagonal. Next, apply the same single-letterization argument for (26.3) and solve
Pd σ2
minP Di =D 12 i=1 log+ Dii .)
V.22 (Shannon lower bound) Let k · k be an arbitrary norm on Rd and r > 0. Let X be a Rd -valued
random vector with a probability density function pX . Denote the rate-distortion function
(Hint: Define an auxiliary backward channel QX|X̂ (dx|x̂) = qs (x − x̂)dx, where qs (w) =
QX|X̂
1
Z(s) exp(−skwkr ). Then I(X; X̂) = EP [log PX ] + D(PX|X̂ kQX|X̂ kPX̂ ).)
i i
i i
i i
and this entropy maximization can be solved following the argument in Example 5.2.
V.23 (Uniform distribution minimizes convex symmetric functional.) Let G be a group acting on a
set X such that each g ∈ G sends x ∈ X to gx ∈ X . Suppose G acts transitively, i.e., for each
x, x′ ∈ X there exists g ∈ G such that gx = x′ . Let g be a random element of G with an invariant
d
distribution, namely hg=g for any h ∈ G. (Such a distribution, known as the Haar measure,
exists for compact topological groups.)
(a) Show that for any x ∈ X , gx has the same law, denoted by Unif(X ), the uniform distribution
on X .
(b) Let f : P(X ) → R be convex and G-invariant, i.e., f(PgX ) = f(PX ) for any X -valued random
variable X and any g ∈ G. Show that minPX ∈P(X ) f(PX ) = f(Unif(X )).
V.24 (Uniform distribution maximizes rate-distortion function.) Continuing the setup of Exer-
cise V.23, let d : X × X → R be a G-invariant distortion function, i.e., d(gx, gx′ ) =
d(x, x′ ) for any g ∈ G. Denote the rate-distortion function of an X -valued X by ϕX (D) =
infP :E[d(X,X̂)]≤D I(X; X̂). Suppose that ϕX (D) < ∞ for all X and all D > 0.
X̂|X
(a) Let ϕ∗X (λ) = supD {λD − ϕX (D)} denote the conjugate of ϕX . Applying Theorem 24.4 and
Fenchel-Moreau’s biconjugation theorem to conclude that ϕX (D) = supλ {λD − ϕ∗X (λ)}.
(b) Show that
As such, for each λ, PX 7→ ϕ∗X (λ) is convex and G-invariant. (Hint: Theorem 5.3.)
(c) Applying Exercise V.23 to conclude that ϕ∗U (λ) ≤ ϕ∗X (λ) for U ∼ Unif(X ) and that
V.25 (Normal tail bound.) Denote the standard normal density and tail probability by φ(x) =
R∞
√1 e−x /2 and Φc (t) =
2
2π t
φ(x)dx. Show that for all t > 0,
t φ(t) −t2 /2
φ(t) ≤ Φ (t) ≤ min
c
,e . (V.8)
1 + t2 t
(Hint: For Φc (t) ≤ e−t /2 apply the Chernoff bound (15.2); for the rest, note that by integration
2
R∞
by parts Φc (t) = φ(t t) − t φ(x2x) dx.)
V.26 (Covering radius in Hamming space) In this exercise we prove (27.9), namely, for any fixed
0 ≤ D ≤ 1, as n → ∞,
i i
i i
i i
(a) Prove the lower bound by invoking the volume bound in Theorem 27.3 and the large-
deviations estimate in Example 15.1.
(b) Prove the upper bound using probabilistic construction and a similar argument to (25.8).
(c) Show that for D ≥ 12 , N(Fn2 , dH , Dn) ≤ 2 – cf. Ex. V.15a.
V.27 (Covering ℓp -ball with ℓq -balls)
(a) For 1 ≤ p < q ≤ ∞, prove the bound (27.18) on the metric entropy of the unit ℓp -ball with
respect to the ℓq -norm (Hint: for small ϵ, apply the volume calculation in (27.15)–(27.16)
and the formula in (27.13); for large ϵ, proceed as in the proof of Theorem 27.7 by applying
the quantization argument and the Gilbert-Varshamov bound of Hamming spheres.)
(b) What happens when p > q?
V.28 In this exercise we prove Dudley’s chaining inequality (27.22). In view of Theorem 27.2, it is
equivalent to show the following version with covering numbers:
Z ∞p
w(Θ) ≲ log N(ϵ)dϵ. (V.9)
0
where Mi ≜ max{hZ, si − si−1 i : si ∈ Ti , si−1 ∈ Ti−1 }. (Hint: For any θ ∈ Θ, let θi denote its
P
nearest neighbor in Ti . Then hZ, θi = hZ, θ0 i + i≥1 hZ, θi − θi−1 i.)
p
c Show that E[Mi ] ≲ ϵi log N(ϵi ) (Hint: kθi − θi−1 k ≤ kθi − θk + kθi−1 − θk ≤ 3ϵi . Then
apply Lemma 27.10 and the bounded convergence theorem.)
d Conclude that
X p
w(Θ) ≲ ϵi log N(ϵi )
i≥0
(b) Let U = {u1 , . . . , uM } and V = {v1 , . . . , vM } be an ϵ-net for the spheres Sm−1 and Sn−1
respectively. Show that
1
kAkop ≤ max hA, uv′ i .
(1 − ϵ)2 u∈U ,v∈V
i i
i i
i i
16
Using the large deviations theory developed by Donsker-Varadhan, the sharp constant can be found to be
2
limϵ→0 ϵ2 ϕ(ϵ) = π8 ; see for example [279, Sec. 6.2].
i i
i i
i i
i i
i i
i i
Part VI
Statistical applications
i i
i i
i i
i i
i i
i i
557
This part gives an exposition on the application of information-theoretic principles and meth-
ods in mathematical statistics; we do so by discussing a selection of topics. To start, Chapter 28
introduces the basic decision-theoretic framework of statistical estimation and the Bayes risk
and the minimax risk as the fundamental limits. Chapter 29 gives an exposition of the classi-
cal large-sample asymptotics for smooth parametric models in fixed dimensions, highlighting the
role of Fisher information introduced in Chapter 2. Notably, we discuss how to deduce classi-
cal lower bounds (Hammersley-Chapman-Robbins, Cramér-Rao, van Trees) from the variational
characterization and the data processing inequality (DPI) of χ2 -divergence in Chapter 7.
Moving into high dimensions, Chapter 30 introduces the mutual information method for sta-
tistical lower bound, based on the DPI for mutual information as well as the theory of capacity
and rate-distortion function from Parts IV and V. This principled approach includes three popular
methods for proving minimax lower bounds (Le Cam, Assouad, and Fano) as special cases, which
are discussed at length in Chapter 31 drawing results from metric entropy in Chapter 27 also.
Complementing the exposition on lower bounds in Chapters 30 and 31, in Chapter 32 we present
three upper bounds on statistical estimation based on metric entropy. These bounds appear strik-
ingly similar but follow from completely different methodologies. Application to nonparametric
density estimation is used as a primary example.
Chapter 33 introduces strong data processing inequalities (SDPI), which are quantitative
strengthening of DPIs in Part I. As applications we show how to apply SDPI to deduce lower
bounds for various estimation problems on graphs or in distributed settings.
i i
i i
i i
In this chapter, we discuss the decision-theoretic framework of statistical estimation and introduce
several important examples. Section 28.1 presents the basic elements of statistical experiment and
statistical estimation. Section 28.3 introduces the Bayes risk (average-case) and the minimax risk
(worst-case) as the respective fundamental limit of statistical estimation in Bayesian and frequen-
tist setting, with the latter being our primary focus in this part. We discuss several version of the
minimax theorem (and prove a simple one) that equates the minimax risk with the worst-case
Bayes risk. Two variants are introduced next that extend a basic statistical experiment to either
large sample size or large dimension: Section 28.4 on independent observations and Section 28.5
on tensorization of experiments. Throughout this part the Gaussian location model (GLM), intro-
duced in Section 28.2, serves as a running example, with different focus at different places (such
as the role of loss functions, parameter spaces, low versus high dimensions, etc). In Section 28.6,
we discuss a key result known as the Anderson’s lemma for determining the exact minimax risk
of (unconstrained) GLM in any dimension for a broad class of loss functions, which provides a
benchmark for various more general techniques introduced in later chapters.
558
i i
i i
i i
transition kernel) PT̂|X , or a channel in the language of Part I. For all practical purposes, we can
write T̂ = T̂(X, U), where U denotes external randomness uniform on [0, 1] and independent of X.
To measure the quality of an estimator T̂, we introduce a loss function ℓ : Y × Ŷ → R such
that ℓ(T, T̂) is the risk of T̂ for estimating T. Since we are dealing with loss (as opposed to reward),
all the negative (converse) results are lower bounds and all the positive (achievable) results are
upper bounds. Note that X is a random variable, so are T̂ and ℓ(T, T̂). Therefore, to make sense of
“minimizing the loss”, we consider the expected risk:
Z
Rθ (T̂) = Eθ [ℓ(T, T̂)] = Pθ (dx)PT̂|X (dt̂|x)ℓ(T(θ), t̂), (28.2)
which we refer to as the risk of T̂ at θ. The subscript in Eθ indicates the distribution with respect
to which the expectation is taken. Note that the expected risk depends on the estimator as well as
the ground truth.
Remark 28.1 We note that the problem of hypothesis testing and inference can be encom-
passed as special cases of the estimation paradigm. As previously discussed in Section 16.4, there
are three formulations for testing:
H0 : θ = θ 0 vs H1 : θ = θ 1 , θ0 6= θ1
H0 : θ = θ 0 vs H1 : θ ∈ Θ 1 , θ0 ∈
/ Θ1
H0 : θ ∈ Θ 0 vs H1 : θ ∈ Θ 1 , Θ0 ∩ Θ1 = ∅.
For each case one can introduce the appropriate parameter space and loss function. For example,
in the last (most general) case, we may take
(
0 θ ∈ Θ0
Θ = Θ0 ∪ Θ1 , T(θ) = , T̂ ∈ {0, 1}
1 θ ∈ Θ1
n o
and use the zero-one loss ℓ(T, T̂) = 1 T 6= T̂ so that the expected risk Rθ (T̂) = Pθ {θ ∈ / ΘT̂ } is
the probability of error.
For the problem of inference, the goal is to output a confidence interval (or region) which covers
the true parameter with high
n probability.
o In this case T̂ is a subset of Θ and we may choose the
loss function ℓ(θ, T̂) = 1 θ ∈/ T̂ + λ · length(T̂) for some λ > 0, in order to balance the coverage
and the size of the confidence interval.
Remark 28.2 (Randomized versus deterministic estimators) Although most of the
estimators used in practice are deterministic, there are a number of reasons to consider randomized
estimators:
i i
i i
i i
560
• For certain formulations, such as the minimizing worst-case risk (minimax approach), deter-
ministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the
objective is to minimize the average risk (Bayes approach), then it does not lose generality to
restrict to deterministic estimators.
• The space of randomized estimators (viewed as Markov kernels) is convex which is the convex
hull of deterministic estimators. This convexification is needed for example for the treatment
of minimax theorems.
P = {N (θ, σ 2 Id ) : θ ∈ Θ}
where Id is the d-dimensional identity matrix and the parameter space Θ ⊂ Rd . Equivalently, we
can express the data as a noisy observation of the unknown vector θ as:
X = θ + Z, Z ∼ N (0, σ 2 Id ).
The case of d = 1 and d > 1 refers to the univariate (scalar) and multivariate (vector) case,
respectively. (Also of interest is the case where θ is a d1 × d2 matrix, which can be vectorized into
a d = d1 d2 -dimensional vector.)
The choice of the parameter space Θ represents our prior knowledges of the unknown parameter
θ, for example,
i i
i i
i i
28.3 Bayes risk, minimax risk, and the minimax theorem 561
By definition, more structure (smaller parameter space) always makes the estimation task easier
(smaller worst-case risk), but not necessarily so in terms of computation.
For estimating θ itself (denoising), it is customary to use a loss function defined by certain
P p 1
p for some 1 ≤ p ≤ ∞ and α > 0, where kθkp ≜ (
norms, e.g., ℓ(θ, θ̂) = kθ − θ̂kα |θi | ) p , with
p = α = 2 corresponding to the commonly used quadratic loss (squared error). Some well-known
estimators include the Maximum Likelihood Estimator (MLE)
θ̂ML = X (28.3)
and the James-Stein estimator based on shrinkage
(d − 2)σ 2
θ̂JS = 1 − X. (28.4)
kXk22
The choice of the estimator depends on both the objective and the parameter space. For instance,
if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero
(thresholding) in order to better denoise θ (cf. Section 30.2).
In addition to estimating the vector θ itself, it is also of interest to estimate certain functionals
T(θ) thereof, e.g., T(θ) = kθkp , max{θ1 , . . . , θd }, or eigenvalues in the matrix case. In addition,
the hypothesis testing problem in the GLM has been well-studied. For example, one can consider
detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ ϵ, or testing weak signal
H0 : kθk ≤ ϵ0 versus strong signal H1 : kθk ≥ ϵ1 , with or without further structural assumptions
on θ. We refer the reader to the monograph [225] devoted to these problems.
i i
i i
i i
562
Given a prior π, its Bayes risk is the minimal average risk, namely
An estimator θ̂∗ is called a Bayes estimator if it attains the Bayes risk, namely, R∗π = Eθ∼π [Rθ (θ̂∗ )].
Remark 28.3 Bayes estimator is always deterministic, a fact that holds for any loss function.
To see this, note that for any randomized estimator, say θ̂ = θ̂(X, U), where U is some external
randomness independent of X and θ, its risk is lower bounded by
Rπ (θ̂) = Eθ,X,U [ℓ(θ, θ̂(X, U))] = EU [Rπ (θ̂(·, U))] ≥ inf Rπ (θ̂(·, u)).
u
Note that for any u, θ̂(·, u) is a deterministic estimator. This shows that we can find a deterministic
estimator whose average risk is no worse than that of the randomized estimator.
An alternative explanation of this fact is the following: Note that the average risk Rπ (θ̂) defined
in (28.5) is an affine function of the randomized estimator (understood as a Markov kernel Pθ̂|X )
is affine, whose minimum is achieved at the extremal points. In this case the extremal points of
Markov kernels are simply delta measures, which corresponds to deterministic estimators.
In certain settings the Bayes estimator can be found explicitly. Consider the problem of estimat-
ing θ ∈ Rd drawn from a prior π. Under the quadratic loss ℓ(θ, θ̂) = kθ̂ − θk22 , the Bayes estimator
is the conditional mean θ̂(X) = E[θ|X] and the Bayes risk is the minimum mean-square error
(MMSE), which we previously encountered in Section 3.7* in the context of I-MMSE relationship:
i i
i i
i i
28.3 Bayes risk, minimax risk, and the minimax theorem 563
As a concrete example, let us consider the Gaussian Location Model in Section 28.2 with a
Gaussian prior.
Example 28.1 (Bayes risk in GLM) Consider the scalar case, where X = θ + Z and Z ∼
N (0, σ 2 ) is independent of θ. Consider a Gaussian prior θ ∼ π = N (0, s). One can verify that the
sσ 2
posterior distribution Pθ|X=x is N ( s+σ 2 x, s+σ 2 ). As such, the Bayes estimator is E[θ|X] = s+σ 2 X
s s
sσ 2
R∗π = d. (28.7)
s + σ2
If there exists θ̂ s.t. supθ∈Θ Rθ (θ̂) = R∗ , then the estimator θ̂ is minimax (minimax optimal).
Finding the value of the minimax risk R∗ entails proving two things, namely,
• a minimax upper bound, by exhibiting an estimator θ̂∗ such that Rθ (θ̂∗ ) ≤ R∗ + ϵ for all θ ∈ Θ;
• a minimax lower bound, by proving that for any estimator θ̂, there exists some θ ∈ Θ, such that
Rθ (θ̂) ≥ R∗ − ϵ,
where ϵ > 0 is arbitrary. This task is frequently difficult especially in high dimensions. Instead of
the exact minimax risk, it is often useful to find a constant-factor approximation Ψ, which we call
minimax rate, such that
R∗ Ψ, (28.9)
that is, cΨ ≤ R∗ ≤ CΨ for some universal constants c, C ≥ 0. Establishing Ψ is the minimax rate
still entails proving the minimax upper and lower bounds, albeit within multiplicative constant
factors.
In practice, minimax lower bounds are rarely established according to the original definition.
The next result shows that the Bayes risk is always lower than the minimax risk. Throughout
this book, all lower bound techniques essentially boil down to evaluating the Bayes risk with a
sagaciously chosen prior.
i i
i i
i i
564
Theorem 28.1 Let P(Θ) denote the collection of probability distributions on Θ. Then
(If the supremum is attained for some prior, we say it is least favorable.)
1 “max ≥ mean”: For any θ̂, Rπ (θ̂) = Eθ∼π Rθ (θ̂) ≤ supθ∈Θ Rθ (θ̂). Taking the infimum over θ̂
completes the proof;
2 “min max ≥ max min”:
R∗ = inf sup Rθ (θ̂) = inf sup Rπ (θ̂) ≥ sup inf Rπ (θ̂) = sup R∗π ,
θ̂ θ∈Θ θ̂ π ∈P(Θ) π ∈P(Θ) θ̂ π
where the inequality follows from the generic fact that minx maxy f(x, y) ≥ maxy minx f(x, y).
Remark 28.4 Unlike Bayes estimators which, as shown in Remark 28.3, are always deter-
ministic, to minimize the worst-case risk it is sometimes necessary to randomize for example in
the context of hypotheses testing (Chapter 14). Specifically, consider a trivial experiment where
θ ∈ {0, 1} and nX is absent,
o so that we are forced to guess the value of θ under the zero-one
loss ℓ(θ, θ̂) = 1 θ 6= θ̂ . It is clear that in this case the minimax risk is 12 , achieved by random
guessing θ̂ ∼ Ber( 21 ) but not by any deterministic θ̂.
As an application of Theorem 28.1, let us determine the minimax risk of the Gaussian location
model under the quadratic loss function.
Example 28.2 (Minimax quadratic risk of GLM) Consider the Gaussian location model
without structural assumptions, where X ∼ N (θ, σ 2 Id ) with θ ∈ Rd . We show that
By scaling, it suffices to consider σ = 1. For the upper bound, we consider θ̂ML = X which
achieves Rθ (θ̂ML ) = d for all θ. To get a matching minimax lower bound, we consider the prior
θ ∼ N (0, s). Using the Bayes risk previously computed in (28.6), we have R∗ ≥ R∗π = s+ sd
1.
∗
Sending s → ∞ yields R ≥ d.
i i
i i
i i
28.3 Bayes risk, minimax risk, and the minimax theorem 565
3.0
2.8
2.6
2.4
2.2
2 4 6 8
Figure 28.2 Risk of the James-Stein estimator (28.4) in dimension d = 3 and σ = 1 as a function of kθk.
For most of the statistical models, Theorem 28.1 in fact holds with equality; such a result is
known as a minimax theorem. Before discussing this important topic, here is an example where
minimax risk is strictly bigger than the worst-case Bayes risk.
n o
Example 28.3 Let θ, θ̂ ∈ N ≜ {1, 2, ...} and ℓ(θ, θ̂) = 1 θ̂ < θ , i.e., the statistician loses
one dollar if the Nature’s choice exceeds the statistician’s guess and loses nothing if otherwise.
Consider the extreme case of blind guessing (i.e., no data is available, say, X = 0). Then for any
θ̂ possibly randomized, we have Rθ (θ̂) = P(θ̂ < θ). Thus R∗ ≥ limθ→∞ P(θ̂ < θ) = 1, which is
clearly achievable. On the other hand, for any prior π on N, Rπ (θ̂) = P(θ̂ < θ), which vanishes as
θ̂ → ∞. Therefore, we have R∗π = 0. Therefore in this case R∗ = 1 > R∗Bayes = 0.
As an exercise, one can show that the minimax quadratic risk of the GLM X ∼ N (θ, 1) with
parameter space θ ≥ 0 is the same as the unconstrained case. (This might be a bit surprising
because the thresholded estimator X+ = max(X, 0) achieves a better risk pointwise at every θ ≥ 0;
nevertheless, just like the James-Stein estimator (cf. Figure 28.2), in the worst case the gain is
asymptotically diminishing.)
R∗ ≥ R∗Bayes .
This result can be interpreted from an optimization perspective. More precisely, R∗ is the value
of a convex optimization problem (primal) and R∗Bayes is precisely the value of its dual program.
Thus the inequality (28.10) is simply weak duality. If strong duality holds, then (28.10) is in fact
an equality, in which case the minimax theorem holds.
For simplicity, we consider the case where Θ is a finite set. Then
i i
i i
i i
566
This is a convex optimization problem. Indeed, Pθ̂|X 7→ Eθ [ℓ(θ, θ̂)] is affine and the pointwise
supremum of affine functions is convex. To write down its dual problem, first let us rewrite (28.12)
in an augmented form
R∗ = min t (28.13)
Pθ̂|X ,t
Let π θ ≥ 0 denote the Lagrange multiplier (dual variable) for each inequality constraint. The
Lagrangian of (28.13) is
!
X X X
L(Pθ̂|X , t, π ) = t + π θ Eθ [ℓ(θ, θ̂)] − t = 1 − πθ t + π θ Eθ [ℓ(θ, θ̂)].
θ∈Θ θ∈Θ θ∈Θ
P
By definition, we have R∗ ≥ mint,Pθ̂|X L(θ̂, t, π ). Note that unless θ∈Θ π θ = 1, mint∈R L(θ̂, t, π )
is −∞. Thus π = (π θ : θ ∈ Θ) must be a probability measure and the dual problem is
Hence, R∗ ≥ R∗Bayes .
In summary, the minimax risk and the worst-case Bayes risk are related by convex duality,
where the primal variables are (randomized) estimators and the dual variables are priors. This
view can in fact be operationalized. For example, [238, 346] showed that for certain problems
dualizing Le Cam’s two-point lower bound (Theorem 31.1) leads to optimal minimax upper bound;
see Exercise VI.17.
This result shows that for virtually all problems encountered in practice, the minimax risk coin-
cides with the least favorable Bayes risk. At the heart of any minimax theorem, there is an
i i
i i
i i
application of the separating hyperplane theorem. Below we give a proof of a special case
illustrating this type of argument.
Proof. The first case directly follows from the duality interpretation in Section 28.3.3 and the
fact that strong duality holds for finite-dimensional linear programming (see for example [376,
Sec. 7.4].
For the second case, we start by showing that if R∗ = ∞, then R∗Bayes = ∞. To see this, consider
the uniform prior π on Θ. Then for any estimator θ̂, there exists θ ∈ Θ such that R(θ, θ̂) = ∞.
Then Rπ (θ̂) ≥ |Θ|
1
R(θ, θ̂) = ∞.
Next we assume that R∗ < ∞. Then R∗ ∈ R since ℓ is bounded from below (say, by a) by
assumption. Given an estimator θ̂, denote its risk vector R(θ̂) = (Rθ (θ̂))θ∈Θ . Then its average risk
P
with respect to a prior π is given by the inner product hR(θ̂), π i = θ∈Θ π θ Rθ (θ̂). Define
S = {R(θ̂) ∈ RΘ : θ̂ is a randomized estimator} = set of all possible risk vectors,
T = {t ∈ RΘ : tθ < R∗ , θ ∈ Θ}.
Note that both S and T are convex (why?) subsets of Euclidean space RΘ and S∩T = ∅ by definition
of R∗ . By the separation hyperplane theorem, there exists a non-zero π ∈ RΘ and c ∈ R, such
that infs∈S hπ , si ≥ c ≥ supt∈T hπ , ti. Obviously, π must be componentwise positive, for otherwise
supt∈T hπ , ti = ∞. Therefore by normalization we may assume that π is a probability vector, i.e.,
a prior on Θ. Then R∗Bayes ≥ R∗π = infs∈S hπ , si ≥ supt∈T hπ , ti ≥ R∗ , completing the proof.
Clearly, n 7→ R∗n (Θ) is non-increasing since we can always discard the extra observations.
Typically, when Θ is a fixed subset of Rd , R∗n (Θ) vanishes as n → ∞. Thus a natural question is
i i
i i
i i
568
at what rate R∗n converges to zero. Equivalently, one can consider the sample complexity, namely,
the minimum sample size to attain a prescribed error ϵ even in the worst case:
In the classical large-sample asymptotics (Chapter 29), the rate of convergence for the quadratic
risk is usually Θ( 1n ), which is commonly referred to as the “parametric rate“. In comparison, in this
book we focus on understanding the dependency on the dimension and other structural parameters
nonasymptotically.
As a concrete example, let us revisit the GLM in Section 28.2 with sample size n, in which case
i.i.d.
we observe X = (X1 , . . . , Xn ) ∼ N (0, σ 2 Id ), θ ∈ Rd . In this case, the minimax quadratic risk is1
dσ 2
R∗n = . (28.17)
n
To see this, note that in this case X̄ = n1 (X1 + . . . + Xn ) is a sufficient statistic (cf. Section 3.5) of X
2
for θ. Therefore the model reduces to X̄ ∼ N (θ, σn Id ) and (28.17) follows from the minimax risk
(28.11) for a single observation.
2
From (28.17), we conclude that the sample complexity is n∗ (ϵ) = d dσϵ e, which grows linearly
with the dimension d. This is the common wisdom that “sample complexity scales proportionally
to the number of parameters”, also known as “counting the degrees of freedom”. Indeed in high
dimensions we typically expect the sample complexity to grow with the ambient dimension; how-
ever, the exact dependency need not be linear as it depends on the loss function and the objective
of estimation. For example, consider the matrix case θ ∈ Rd×d with n independent observations
in Gaussian noise. Let ϵ be a small constant. Then we have
2
• For quadratic loss, namely, kθ − θ̂k2F , we have R∗n = dn and hence n∗ (ϵ) = Θ(d2 );
• If the loss function is kθ − θ̂k2op , then R∗n dn and hence n∗ (ϵ) = Θ(d) (Example 28.4);
• As opposed to θ itself, suppose we are content with p estimating only the scalar functional θmax =
∗
max{θ1 , . . . , θd } up to accuracy ϵ, then n (ϵ) = Θ( log d) (Exercise VI.14).
In the last two examples, the sample complexity scales sublinearly with the dimension.
1
See Exercise VI.11 for an extension of this result to nonparametric location models.
i i
i i
i i
X
d
ℓ(θ, θ̂) ≜ ℓi (θi , θ̂i ), ∀θ, θ̂ ∈ Θ.
i=1
In this model, the observation X = (X1 , . . . , Xd ) consists of independent (not identically dis-
ind
tributed) Xi ∼ Pθi and the loss function takes a separable form, which is reminiscent of separable
distortion function in (24.8). This should be contrasted with the multiple-observation model in
(28.14), in which n iid observations drawn from the same distribution are given.
The minimax risk of the tensorized experiment is related to the minimax risk R∗ (Pi ) and worst-
case Bayes risks R∗Bayes (Pi ) ≜ supπ i ∈P(Θi ) Rπ i (Pi ) of each individual experiment as follows:
Consequently, if minimax theorem holds for each experiment, i.e., R∗ (Pi ) = R∗Bayes (Pi ), then it
also holds for the product experiment and, in particular,
X
d
∗
R (P) = R∗ (Pi ). (28.19)
i=1
Proof. The right inequality of (28.18) simply follows by separately estimating θi on the basis
of Xi , namely, θ̂ = (θ̂1 , . . . , θ̂d ), where θ̂i depends only on Xi . For the left inequality, consider
Qd
a product prior π = i=1 π i , under which θi ’s are independent and so are Xi ’s. Consider any
randomized estimator θ̂i = θ̂i (X, Ui ) of θi based on X, where Ui is some auxiliary randomness
independent of X. We can rewrite it as θ̂i = θ̂i (Xi , Ũi ), where Ũi = (X\i , Ui ) ⊥ ⊥ Xi . Thus θ̂i can
be viewed as it a randomized estimator based on Xi alone and its the average risk must satisfy
Rπ i (θ̂i ) = E[ℓ(θi , θ̂i )] ≥ R∗π i . Summing over i and taking the suprema over priors π i ’s yields the
left inequality of (28.18).
As an example, we note that the unstructured d-dimensional GLM {N (θ, σ 2 Id ) : θ ∈ Rd } with
quadratic loss is simply the d-fold tensor product of the one-dimensional GLM. Since minimax
theorem holds for the GLM (cf. Section 28.3.4), Theorem 28.3 shows the minimax risks sum up to
σ 2 d, which agrees with Example 28.2. In general, however, it is possible that the minimax risk of
the tensorized experiment is less than the sum of individual minimax risks and the right inequality
of (28.19) can be strict. This might appear surprising since Xi only carries information about θi
and it makes sense intuitively to estimate θi based solely on Xi . Nevertheless, the following is a
counterexample:
Remark 28.6 Consider X = θZ, where θ n∈ N, Zo∼ Ber( 21 ). The estimator θ̂ takes values in
N as well and the loss function is ℓ(θ, θ̂) = 1 θ̂ < θ , i.e., whoever guesses the greater number
wins. The minimax risk for this experiment is equal to P [Z = 0] = 21 . To see this, note that if
Z = 0, then all information about θ is erased. Therefore for any (randomized) estimator Pθ̂|X , the
risk is lower bounded by Rθ (θ̂) = P[θ̂ < θ] ≥ P[θ̂ < θ, Z = 0] = 21 P[θ̂ < θ|X = 0]. Therefore
i i
i i
i i
570
sending θ → ∞ yields supθ Rθ (θ̂) ≥ 12 . This is achievable by θ̂ = X. Clearly, this is a case where
minimax theorem does not hold, which is very similar to the previous Example 28.3.
nNext consider
o the tensoroproduct of two copies of this experiment with loss function ℓ(θ, θ̂) =
n
1 θ̂1 < θ1 + 1 θ̂2 < θ2 . We show that the minimax risk is strictly less than one. For i = 1, 2,
i.i.d.
let Xi = θi Zi , where Z1 , Z2 ∼ Ber( 21 ). Consider the following estimator
(
X1 ∨ X2 X1 > 0 or X2 > 0
θ̂1 = θ̂2 =
1 otherwise.
Proof. Note that N (θ, Id ) is a product distribution and the loss function is separable: kθ − θ̂kqq =
Pd
i=1 |θi − θ̂i | . Thus the experiment is a d-fold tensor product of the one-dimensional version.
q
By Theorem 28.3, it suffices to consider d = 1. The upper bound is achieved by the sample mean
Pn
X = 1n i=1 Xi ∼ N (θ, 1n ), which is a sufficient statistic.
For the lower bound, following Example 28.2, consider a Gaussian prior θ ∼ π = N (0, s).
Then the posterior distribution is also Gaussian: Pθ|X = N (E[θ|X], 1+ssn ). The following lemma
shows that the Bayes estimator is simply the conditional mean:
Lemma 28.5 Let Z ∼ N (0, 1). Then miny∈R E[|y + Z|q ] = E[|Z|q ].
Thus the Bayes risk is
s q/2
R∗π = E[|θ − E[θ|X]|q ] = E | Z| q .
1 + sn
Sending s → ∞ proves the matching lower bound.
where the inequality follows from the simple observation that for any a > 0, P [|y + Z| ≤ a] ≤
P [|Z| ≤ a], due to the symmetry and unimodality of the normal density.
i i
i i
i i
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 571
Theorem 28.7 Consider the d-dimensional GLM where X1 , . . . , Xn ∼ N (0, Id ) are observed.
Let the loss function be ℓ(θ, θ̂) = ρ(θ − θ̂), where ρ : Rd → R+ is bowl-shaped and lower-
semicontinuous. Then the minimax risk is given by
Z
R∗ ≜ inf sup Eθ [ρ(θ − θ̂)] = Eρ √ , Z ∼ N (0, Id ).
θ̂ θ∈Rd n
Pn
Furthermore, the upper bound is attained by X̄ = 1n i=1 Xi .
Corollary 28.8 Let ρ(·) = k · kq for some q > 0, where k · k is an arbitrary norm on Rd . Then
EkZkq
R∗ = . (28.20)
nq/2
R∗ √d .
n
We can also phrase the result of Corollary 28.8 in terms of the sample complexity n∗ (ϵ) as
defined in (28.16). For example, for q = 2 we have n∗ (ϵ) = E[kZk2 ]/ϵ . The above examples
2
Another example is the multinomial model with squared error; cf. Exercises VI.7 and VI.9.
i i
i i
i i
572
show that the scaling of n∗ (ϵ) with dimension depends on the loss function and the “rule of thumb”
that the sampling complexity is proportional to the number of parameters need not always hold.
Finally, for the sake of high-probability (as opposed to average) risk bound, consider ρ(θ − θ̂) =
1{kθ − θ̂k > ϵ}, which is lower semicontinuous and bowl-shaped. Then the exact expression
√
R∗ = P kZk ≥ ϵ n . This result is stronger since the sample mean is optimal simultaneously for
all ϵ, so that integrating over ϵ recovers (28.20).
Proof of Theorem 28.7. We only prove the lower bound. We bound the minimax risk R∗ from
below by the Bayes risk R∗π with the prior π = N (0, sId ):
R∗ ≥ R∗π = inf Eπ [ρ(θ − θ̂)]
θ̂
= E inf E[ρ(θ − θ̂)|X]
θ̂
( a)
= E[E[ρ(θ − E[θ|X])|X]]
r
(b) s
=E ρ Z .
1 + sn
where (a) follows from the crucial Lemma 28.9 below; (b) uses the fact that θ − E[θ|X] ∼
N (0, 1+ssn Id ) under the Gaussian prior. Since ρ(·) is lower semicontinuous, sending s → ∞ and
applying Fatou’s lemma, we obtain the matching lower bound:
r
s Z
R∗ ≥ lim E ρ Z ≥E ρ √ .
s→∞ 1 + sn n
The following lemma establishes the conditional mean as the Bayes estimator under the
Gaussian prior for all bowl-shaped losses, extending the previous Lemma 28.5 in one dimension:
In order to prove Lemma 28.9, it suffices to consider ρ being indicator functions. This is done
in the next lemma, which we prove later.
Lemma 28.10 Let K ∈ Rd be a symmetric convex set and X ∼ N (0, Σ). Then maxy∈Rd P(X +
y ∈ K) = P(X ∈ K).
Proof of Lemma 28.9. Denote the sublevel set set Kc = {x ∈ Rd : ρ(x) ≤ c}. Since ρ is bowl-
shaped, Kc is convex and symmetric, which satisfies the conditions of Lemma 28.10. So,
Z ∞
E[ρ(y + x)] = P(ρ(y + x) > c)dc,
Z ∞
0
= (1 − P(y + x ∈ Kc ))dc,
0
i i
i i
i i
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 573
Z ∞
≥ (1 − P(x ∈ Kc ))dc,
Z 0
∞
= P(ρ(x) ≥ c)dc,
0
= E[ρ(x)].
Before going into the proof of Lemma 28.10, we need the following definition.
The following result, due to Prékopa [350], characterizes the log-concavity of measures in terms
of that of its density function; see also [361] (or [179, Theorem 4.2]) for a proof.
Theorem 28.12 Suppose that μ has a density f with respect to the Lebesgue measure on Rd .
Then μ is log-concave if and only if f is log-concave.
• Lebesgue measure: Let μ = vol be the Lebesgue measure on Rd , which satisfies Theorem 28.12
(f ≡ 1). Then
• Gaussian distribution: Let μ = N (0, Σ), with a log-concave density f since log f(x) =
− p2 log(2π ) − 12 log det(Σ) − 21 x⊤ Σ−1 x is concave in x.
3
Applying (28.21) to A′ = vol(A)−1/d A, B′ = vol(B)−1/d B (both of which have unit volume), and
λ = vol(A)1/d /(vol(A)1/d + vol(B)1/d ) yields (28.22).
i i
i i
i i
574
i i
i i
i i
In this chapter we give an overview of the classical large-sample theory in the setting of iid obser-
vations in Section 28.4 focusing again on the minimax risk (28.15). These results pertain to smooth
parametric models in fixed dimensions, with the sole asymptotics being the sample size going to
infinity. The main result is that, under suitable conditions, the minimax squared error of estimating
i.i.d.
θ based on X1 , . . . , Xn ∼ Pθ satisfies
1 + o(1)
inf sup Eθ [kθ̂ − θk22 ] = sup TrJ− 1
F (θ). (29.1)
θ̂ θ∈Θ n θ∈Θ
where JF (θ) is the Fisher information matrix introduced in (2.32) in Chapter 2. This is asymptotic
characterization of the minimax risk with sharp constant. In later chapters, we will proceed to high
dimensions where such precise results are difficult and rare.
Throughout this chapter, we focus on the quadratic risk and assume that Θ is an open set of
the Euclidean space Rd . While reading this chapter, a reader is advised to consult Exercise VI.7 to
understand the minimax risk in a simple setting of estimating parameter of a Bernoulli model.
575
i i
i i
i i
576
Theorem 29.1 (HCR lower bound) The quadratic loss of any estimator θ̂ at θ ∈ Θ ⊂ Rd
satisfies
(Eθ [θ̂] − Eθ′ [θ̂])2
Rθ (θ̂) = Eθ [(θ̂ − θ)2 ] ≥ Varθ (θ̂) ≥ sup . (29.2)
θ ′ ̸=θ χ2 (Pθ′ kPθ )
• Note that the HCR lower bound Theorem 29.1 is based on the χ2 -divergence. For a version
based on Hellinger distance which also implies the CR lower bound, see Exercise VI.5.
• Both the HCR and the CR lower bounds extend to the multivariate case as follows. Let θ̂ be
an unbiased estimator of θ ∈ Θ ⊂ Rd . Assume that its covariance matrix Covθ (θ̂) = Eθ [(θ̂ −
θ)(θ̂ − θ)⊤ ] is positive definite. Fix a ∈ Rd . Applying Theorem 29.1 to ha, θ̂i, we get
h a, θ − θ ′ i 2
χ2 (Pθ kPθ′ ) ≥ .
a⊤ Covθ (θ̂)a
Optimizing over a yields1
χ2 (Pθ kPθ′ ) ≥ (θ − θ′ )⊤ Covθ (θ̂)−1 (θ − θ′ ).
Sending θ′ → θ and applying the asymptotic expansion χ2 (Pθ kPθ′ ) = (θ − θ′ )⊤ JF (θ)(θ −
θ′ )(1 + o(1)) (see Remark 7.13), we get the multivariate version of CR lower bound:
Covθ (θ̂) J− 1
F (θ). (29.5)
1 ⟨x,y⟩2
For Σ 0, supx̸=0 x⊤ Σx
= y⊤ Σ−1 y, attained at x = Σ−1 y.
i i
i i
i i
• For a sample of n iid observations, by the additivity property (2.36), the Fisher information
matrix is equal to nJF (θ). Taking the trace on both sides, we conclude that the squared error of
any unbiased estimators satisfies
1
Tr(J−
Eθ [kθ̂ − θk22 ] ≥
1
F (θ)).
n
This is already very close to (29.1), except for the fundamental restriction of unbiased
estimators.
Similar to (29.3), applying data processing and variational representation of χ2 -divergence yields
(EP [θ − θ̂] − EQ [θ − θ̂])2
χ2 (PθX kQθX ) ≥ χ2 (Pθθ̂ kQθθ̂ ) ≥ χ2 (Pθ−θ̂ kQθ−θ̂ ) ≥ .
VarQ (θ̂ − θ)
Note that by design, PX = QX and thus EP [θ̂] = EQ [θ̂]; on the other hand, EP [θ] = EQ [θ] + δ .
Furthermore, Eπ [(θ̂ − θ)2 ] ≥ VarQ (θ̂ − θ). Since this applies to any estimators, we conclude that
the Bayes risk R∗π (and hence the minimax risk) satisfies
δ2
R∗π ≜ inf Eπ [(θ̂ − θ)2 ] ≥ sup , (29.6)
θ̂ δ̸=0 χ2 (PXθ kQXθ )
which is referred to as the Bayesian HCR lower bound in comparison with (29.2).
Similar to the deduction of CR lower bound from the HCR, we can further lower bound
this supremum by evaluating the small-δ limit. First note the following chain rule for the
χ2 -divergence:
" 2 #
dPθ
χ (PXθ kQXθ ) = χ (Pθ kQθ ) + EQ χ (PX|θ kQX|θ ) ·
2 2 2
.
dQθ
i i
i i
i i
578
Under suitable regularity conditions in Theorem 7.22, again applying the local expansion of χ2 -
divergence yields
R π ′2
• χ2 (Pθ kQθ ) = χ2 (Tδ π kπ ) = (J(π ) + o(1))δ 2 , where J(π ) ≜ π is the Fisher information of
the prior (see Example 2.7);
• χ2 (PX|θ kQX|θ ) = [JF (θ) + o(1)]δ 2 .
δ2 δ2 s
R∗π ≥ sup δ 2 (n+ 1s )
= lim
δ 2 (n+ 1s )
= .
δ̸=0 e −1 δ→0 e −1 sn + 1
In view of the Bayes risk found in Example 28.1, we see that in this case the Bayesian HCR and
Bayesian Cramér-Rao lower bounds are exact.
Theorem 29.2 (BCR lower bound) Let π be a differentiable prior density on the interval
[θ0 , θ1 ] such that π (θ0 ) = π (θ1 ) = 0 and
Z θ1
π ′ (θ)2
J( π ) ≜ dθ < ∞. (29.8)
θ0 π (θ)
i i
i i
i i
Let Pθ (dx) = pθ (x) μ(dx), where the density pθ (x) is differentiable in θ for μ-almost every x.
Assume that for π-almost every θ,
Z
μ(dx)∂θ pθ (x) = 0. (29.9)
Then the Bayes quadratic risk R∗π ≜ infθ̂ E[(θ − θ̂)2 ] satisfies
1
R∗π ≥ . (29.10)
Eθ∼π [JF (θ)] + J(π )
Proof. In view of Remark 28.3, it loses no generality to assume that the estimator θ̂ = θ̂(X) is
deterministic. For each x, integration by parts yields
Z θ1 Z θ1
dθ(θ̂(x) − θ)∂θ (pθ (x)π (θ)) = pθ (x)π (θ)dθ.
θ0 θ0
Then
R∗π ≜ inf Eπ [kθ̂ − θk22 ] ≥ Tr((Eθ∼π [JF (θ)] + J(π ))−1 ), (29.12)
θ̂
where the Fisher information matrices are given by JF (θ) = Eθ [∇θ log pθ (X)∇θ log pθ (X)⊤ ] and
J(π ) = diag(J(π 1 ), . . . , J(π d )).
i i
i i
i i
580
where ek denotes the kth standard basis. Applying Cauchy-Schwarz and optimizing over u yield
h u , ek i 2
E[(θ̂k (X) − θk )2 ] ≥ sup = Σ− 1
kk ,
u̸=0 u⊤ Σ u
where Σ ≡ E[∇ log(pθ (X)π (θ))∇ log(pθ (X)π (θ))⊤ ] = Eθ∼π [JF (θ)] + J(π ), thanks to (29.11).
Summing over k completes the proof of (29.12).
• The above versions of the BCR bound assume a prior density that vanishes at the boundary.
If we choose a uniform prior, the same derivation leads to a similar lower bound known as
the Chernoff-Rubin-Stein inequality (see Ex. VI.4), which also suffices for proving the optimal
minimax lower bound in (29.1).
• For the purpose of the lower bound, it is advantageous to choose a prior density with the mini-
mum Fisher information. The optimal density with a compact support is known to be a squared
cosine density [219, 426]:
min J( g ) = π 2 ,
g on [−1,1]
attained by
πu
g(u) = cos2 . (29.13)
2
• Suppose the goal is to estimate a smooth functional T(θ) of the unknown parameter θ, where
T : Rd → Rs is differentiable with ∇T(θ) = ( ∂ T∂θi (θ)
j
) its s × d Jacobian matrix. Then under the
same condition of Theorem 29.3, we have the following Bayesian Cramér-Rao lower bound for
functional estimation:
As a consequence of the BCR bound, we prove the lower bound part for the asymptotic minimax
risk in (29.1).
Theorem 29.4 Assume that θ 7→ JF (θ) is continuous. Denote the minimax squared error
i.i.d.
R∗n ≜ infθ̂ supθ∈Θ Eθ [kθ̂ − θk22 ], where Eθ is taken over X1 , . . . , Xn ∼ Pθ . Then as n → ∞,
1 + o( 1)
R∗n ≥ sup TrJ− 1
F (θ). (29.15)
n θ∈Θ
Proof. Fix θ ∈ Θ. Then for all sufficiently small δ , B∞ (θ, δ) = θ + [−δ, δ]d ⊂ Θ. Let π i (θi ) =
1 θ−θi Qd
δ g( δ ), where g is the prior density in (29.13). Then the product distribution π = i=1 π i
satisfies the assumption of Theorem 29.3. By the scaling rule of Fisher information (see (2.35)),
2 2
J(π i ) = δ12 J(g) = δπ2 . Thus J(π ) = δπ2 Id .
i i
i i
i i
It is known that (see [68, Theorem 2, Appendix V]) the continuity of θ 7→ JF (θ) implies (29.11).
So we are ready to apply the BCR bound in Theorem 29.3. Lower bounding the minimax by the
Bayes risk and also applying the additivity property (2.36) of Fisher information, we obtain
− 1 !
∗ 1 π2
Rn ≥ · Tr Eθ∼π [JF (θ)] + 2 Id .
n nδ
Finally, choosing δ = n−1/4 and applying the continuity of JF (θ) in θ, the desired (29.15) follows.
Similarly, for estimating a smooth functional T(θ), applying (29.14) with the same argument
yields
1 + o( 1)
inf sup Eθ [kT̂ − T(θ)k22 ] ≥ sup Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ). (29.16)
T̂ θ∈Θ n θ∈Θ
where
X
n
Lθ (Xn ) = log pθ (Xi )
i=1
is the total log-likelihood and pθ (x) = dP dμ (x) is the density of Pθ with respect to some com-
θ
mon dominating measure μ. For discrete distribution Pθ , the MLE can also be written as the KL
projection2 of the empirical distribution P̂n to the model class: θ̂MLE ∈ arg minθ∈Θ D(P̂n kPθ ).
2
Note that this is the reverse of the information projection studied in Section 15.3.
i i
i i
i i
582
The main intuition why MLE works is as follows. Assume that the model is identifiable, namely,
θ 7→ Pθ is injective. Then for any θ 6= θ0 , we have by positivity of the KL divergence (Theorem 2.3)
" n #
X pθ (Xi )
Eθ0 [Lθ − Lθ0 ] = Eθ0 log = −nD(Pθ0 ||Pθ ) < 0.
pθ0 (Xi )
i=1
In other words, Lθ − Lθ0 is an iid sum with a negative mean and thus negative with high probability
for large n. From here the consistency of MLE follows upon assuming appropriate regularity
conditions, among which is Wald’s integrability condition Eθ0 [sup∥θ−θ0 ∥≤ϵ log ppθθ (X)] < ∞ [449,
0
454].
Assuming more conditions one can obtain the asymptotic normality and efficiency of the
MLE. This follows from the local quadratic approximation of the log-likelihood function. Define
V(θ, x) ≜ ∇θ log pθ (x) (score) and H(θ, x) ≜ ∇2θ log pθ (x). By Taylor expansion,
! !
Xn
1 Xn
⊤ ⊤
Lθ =Lθ0 + (θ − θ0 ) V(θ0 , Xi ) + (θ − θ0 ) H(θ0 , Xi ) (θ − θ0 )
2
i=1 i=1
+ o(n(θ − θ0 ) ).
2
(29.18)
Recall from Section 2.6.2* that, under suitable regularity conditions, we have
Eθ0 [V(θ0 , X)] = 0, Eθ0 [V(θ0 , X)V(θ0 , X)⊤ ] = −Eθ0 [H(θ0 , X)] = JF (θ0 ).
Thus, by the Central Limit Theorem and the Weak Law of Large Numbers, we have
1 X 1X
n n
d P
√ V(θ0 , Xi )−
→N (0, JF (θ0 )), H(θ0 , Xi )−
→ − JF (θ0 ).
n n
i=1 i=1
Substituting these quantities into (29.18), we obtain the following stochastic approximation of the
log-likelihood:
p n
Lθ ≈ Lθ0 + h nJF (θ0 )Z, θ − θ0 i − (θ − θ0 )⊤ JF (θ0 )(θ − θ0 ),
2
where Z ∼ N (0, Id ). Maximizing the right-hand side yields:
1
θ̂MLE ≈ θ0 + √ JF (θ0 )−1/2 Z.
n
From this asymptotic normality, we can obtain Eθ0 [kθ̂MLE − θ0 k22 ] ≤ n1 (TrJF (θ0 )−1 + o(1)), and
for smooth functionals by Taylor expanding T at θ0 (delta method), Eθ0 [kT(θ̂MLE ) − T(θ0 )k22 ] ≤
−1 ⊤
n (Tr(∇T(θ0 )JF (θ0 ) ∇T(θ0 ) ) + o(1)), matching the information bounds (29.15) and (29.16).
1
Of course, the above heuristic derivation requires additional assumptions to justify (for example,
Cramér’s condition, cf. [168, Theorem 18] and [375, Theorem 7.63]). Even stronger assumptions
are needed to ensure the error is uniform in θ in order to achieve the minimax lower bound in
Theorem 29.4; see, e.g., Theorem 34.4 (and also Chapters 36-37) of [68] for the exact conditions
and statements. A more general and abstract theory of MLE and the attainment of information
bound were developed by Hájek and Le Cam; see [209, 273].
Despite its wide applicability and strong optimality properties, the methodology of MLE is not
without limitations. We conclude this section with some remarks along this line.
i i
i i
i i
• MLE may not exist even for simple parametric models. For example, consider X1 , . . . , Xn
drawn iid from the location-scale mixture of two Gaussians 21 N ( μ1 , σ12 ) + 12 N ( μ2 , σ22 ), where
( μ1 , μ2 , σ1 , σ2 ) are unknown parameters. Then the likelihood can be made arbitrarily large by
setting for example μ1 = X1 and σ1 → 0.
• MLE may be inconsistent; see [375, Example 7.61] and [167] for examples, both in one-
dimensional parametric family.
• In high dimensions, it is possible that MLE fails to achieve the minimax rate (Exercise VI.15).
Theorem 29.5 Fox fixed k, the minimax squared error of estimating P satisfies
b − Pk22 ] = 1 k−1
R∗sq (k, n) ≜ inf sup E[kP + o( 1) , n → ∞. (29.19)
b
P P∈Pk n k
diag(θ) − θθ⊤ −Pk θ
∇T(θ)J− 1
F (θ)∇T(θ)
⊤
=
−Pk θ⊤ Pk (1 − Pk ).
Pk Pk
So Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ) = i=1 Pi (1 − Pi ) = 1 −
2
i=1 Pi , which achieves its maximum
1 − 1k at the uniform distribution. Applying the functional form of the BCR bound in (29.16), we
conclude R∗sq (k, n) ≥ 1n (1 − 1k + o(1)).
For the upper bound, consider the MLE, which in this case coincides with the empirical distri-
Pn
bution P̂ = (P̂i ) (Exercise VI.8). Note that nP̂i = j=1 1 {Xj = i} ∼ Bin(n, Pi ). Then for any P,
Pk
E[kP̂ − Pk22 ] = n1 i=1 Pi (1 − Pi ) ≤ n1 (1 − 1k ).
i i
i i
i i
584
−1/k
• In fact, for any k, n, we have the precise result: R∗sq (k, n) = (11+√ 2 – see Ex. VI.7h. This can be
n)
shown by considering a Dirichlet prior (13.16) and applying the corresponding Bayes estimator,
which is an additively-smoothed empirical distribution (Section 13.5).
• Note that R∗sq (k, n) does not grow with the alphabet size k; this is because squared loss is
too weak for estimating probability vectors. More meaningful loss functions include the f-
divergences in Chapter 7, such as the total variation, KL divergence, χ2 -divergence. These
minimax rates are worked out in Exercise VI.8 and Exercise VI.10, for both small and large
alphabets, and they indeed depend on the alphabet size k. For example, the minimax KL risk
satisfies Θ( nk ) for k ≤ n and grows as Θ(log nk ) for k n. This agrees with the rule of thumb
that consistent estimation requires the sample size to scale faster than the dimension.
As a final application, let us consider the classical problem of entropy estimation in information
theory and statistics [304, 128, 215], where the goal is to estimate the Shannon entropy, a non-
linear functional of P. The following result follows from the functional BCR lower bound (29.16)
and analyzing the MLE (in this case the empirical entropy) [39].
Theorem 29.6 For fixed k, the minimax quadratic risk of entropy estimation satisfies
b (X1 , . . . , Xn ) − H(P))2 ] = 1
R∗ent (k, n) ≜ inf sup E[(H max V(P) + o(1) , n→∞
b P∈Pk
H n P∈Pk
Pk
where H(P) = i=1 Pi log P1i = E[log P(1X) ] and V(P) = Var[log P(1X) ] are the Shannon entropy
and varentropy (cf. (10.4)) of P.
Let us analyze the result of Theorem 29.6 and see how it extends to large alphabets. It can be
2
shown that3 maxP∈Pk V(P) log2 k, which suggests that R∗ent ≡ R∗ent (k, n) may satisfy R∗ent logn k
even when the alphabet size k grows with n; however, this result only holds for sufficiently small
alphabet. In fact, back in Lemma 13.2 we have shown that for the empirical entropy which achieves
the bound in Theorem 29.6, its bias is on the order of nk , which is no longer negligible on large
alphabets. Using techniques of polynomial approximation [456, 233], one can reduce this bias to
n log k and further show that consistent entropy estimation is only possible if and only if n log k
k k
3
Indeed, maxP∈Pk V(P) ≤ log2 k for all k ≥ 3 [334, Eq. (464)]. For the lower bound, consider
P = ( 12 , 2(k−1)
1 1
, . . . 2(k−1) ).
i i
i i
i i
In this chapter we describe a strategy for proving statistical lower bound we call the Mutual Infor-
mation Method (MIM), which entails comparing the amount of information data provides with
the minimum amount of information needed to achieve a certain estimation accuracy. Similar to
Section 29.2, the main information-theoretical ingredient is the data-processing inequality, this
time for mutual information as opposed to f-divergences.
Here is the main idea of the MIM: Fix some prior π on Θ and we aim to lower bound the Bayes
risk R∗π of estimating θ ∼ π on the basis of X with respect to some loss function ℓ. Let θ̂ be an
estimator such that E[ℓ(θ, θ̂)] ≤ D. Then we have the Markov chain θ → X → θ̂. Applying the
data processing inequality (Theorem 3.7), we have
Note that
• The leftmost quantity can be interpreted as the minimum amount of information required to
achieve a given estimation accuracy. This is precisely the rate-distortion function ϕ(D) ≡ ϕθ (D)
(recall Section 24.3).
• The rightmost quantity can be interpreted as the amount of information provided by the data
about the latent parameter. Sometimes it suffices to further upper-bound it by the capacity of
the channel PX|θ by maximizing over all priors (Chapter 5):
Therefore, we arrive at the following lower bound on the Bayes and hence the minimax risks
The reasoning of the mutual information method is reminiscent of the converse proof for joint-
source channel coding in Section 26.3. As such, the argument here retains the flavor of “source-
channel separation”, in that the lower bound in (30.1) depends only on the prior (source) and
the loss function, while the capacity upper bound (30.2) depends only on the statistical model
(channel).
In the next few sections, we discuss a sequence of examples to illustrate the MIM and its
execution:
585
i i
i i
i i
586
• Denoising a vector in Gaussian noise, where we will compute the exact minimax risk;
• Denoising a sparse vector, where we determine the sharp minimax rate;
• Community detection, where the goal is to recover a dense subgraph planted in a bigger Erdös-
Rényi graph.
In the next chapter we will discuss three popular approaches for, namely, Le Cam’s method,
Assouad’s lemma, and Fano’s method. As illustrated in Figure 30.1, all three follow from the
Figure 30.1 The three lower bound techniques as consequences of the Mutual Information Method.
mutual information method, corresponding to different choice of prior π for θ, namely, the uni-
form distribution over a two-point set {θ0 , θ1 }, the hypercube {0, 1}d , and a packing (recall
Section 27.1). While these methods are highly useful in determining the minimax rate for many
problems, they are often loose with constant factors compared to the MIM. In the last section
of this chapter, we discuss the problem of how and when is non-trivial estimation achievable by
applying the MIM; for this purpose, none of the three methods in the next chapter works.
i i
i i
i i
Using the sufficiency of X̄ and the formula of Gaussian channel capacity (cf. Theorem 5.11 or
Theorem 20.11), the mutual information between the parameter and the data can be computed as
d
I(θ; X) = I(θ; X̄) = log(1 + sn).
2
It then follows from (30.3) that R∗π ≥ 1+sdsn , which in fact matches the exact Bayes risk in (28.7).
Sending s → ∞ we recover the result in (28.17), namely
d
R∗ ( R d ) = . (30.4)
n
In the above unconstrained GLM, we are able to compute everything in closed form when
applying the mutual information method. Such exact expressions are rarely available in more
complicated models in which case various bounds on the mutual information will prove useful.
Next, let us consider the GLM with bounded means, where the parameter space Θ = B(ρ) =
{θ : kθk2 ≤ ρ} is the ℓ2 -ball of radius ρ centered at zero. In this case there is no known closed-
form formula for the minimax quadratic risk even in one dimension.1 Nevertheless, the next result
determines the sharp minimax rate, which characterizes the minimax risk up to universal constant
factors.
Proof. The upper bound R∗ (B(ρ)) ≤ dn ∧ ρ2 follows from considering the estimator θ̂ = X̄ and
θ̂ = 0. To prove the lower bound, we apply the mutual information method with a uniform prior
θ ∼ Unif(B(r)), where r ∈ [0, ρ] is to be optimized. The mutual information can be upper bound
using the AWGN capacity as follows:
1 d nr2 nr2
I(θ; X) = I(θ; X̄) ≤ sup I(θ; θ + √ Z) = log 1 + ≤ , (30.6)
Pθ :E[∥θ∥2 ]≤r n 2 d 2
2
where Z ∼ N (0, Id ). Alternatively, we can use Corollary 5.8 to bound the capacity (as information
radius) by the KL diameter, which yields the same bound within constant factors:
1
I(θ; X) ≤ sup I(θ; θ + √ Z) ≤ max D(N (θ, Id /n)kN (θ, Id /n)k) = 2nr2 . (30.7)
Pθ :∥θ∥≤r n θ,θ ′ ∈ B( r)
1
It is known that there exists some ρ0 depending on d/n such that for all ρ ≤ ρ0 , the uniform prior over the sphere of
radius ρ is exactly least favorable (see [82] for d = 1 and [48] for d > 1.)
i i
i i
i i
588
For the lower bound, due to the lack of closed-form formula for the rate-distortion function
for uniform distribution over Euclidean balls, we apply the Shannon lower bound (SLB) from
Section 26.1. Since θ has an isotropic distribution, applying Theorem 26.3 yields
d 2πed d cr2
inf I(θ; θ̂) ≥ h(θ) + log ≥ log ,
Pθ̂|θ :E∥θ−θ̂∥2 ≤D 2 D 2 D
for some universal constant c, where the last inequality is because for θ ∼ Unif(B(r)), h(θ) =
log vol(B(r)) = d log r + log vol(B(1)) and the volume of a unit Euclidean ball in d dimensions
satisfies (recall (27.14)) vol(B(1))1/d √1d .
2 2
∗ 2 −nr /d 2
R∗ ≤ 2 , i.e., R ≥ cr e
Finally, applying (30.3) yields 12 log cr nr
. Optimizing over r and
−ax −a
using the fact that sup0<x<1 xe = ea if a ≥ 1 and e if a < 1, we have
1
d
R∗ ≥ sup cr2 e−nr /d
2
∧ ρ2 .
r∈[0,ρ] n
As a final example, let us consider a non-quadratic loss ℓ(θ, θ̂) = kθ − θ̂kr , the rth power of an
arbitrary norm on Rd . Recall that we have determined in Corollary 28.8 the exact minimax risk
using Anderson’s lemma, namely,
inf sup Eθ [kθ̂ − θkr ] = n−r/2 E[kZkr ], Z ∼ N (0, Id ). (30.8)
θ̂ θ∈Rd
In order to apply the mutual information method, consider again a Gaussian prior θ ∼ N (0, sId ).
Suppose that E[kθ̂ − θkr ] ≤ D. By the data processing inequality,
( d )
d d Dre r d
log(1 + ns) ≥ I(θ; X) ≥ I(θ; θ̂) ≥ log(2πes) − log V∥·∥ Γ 1+ ,
2 2 d r
where the last inequality follows from the general SLB (26.5). Rearranging terms and sending
s → ∞ yields
r/2 −r/d r
d 2πe d − r/ 2 − r/ d d
inf sup Eθ [kθ̂ − θk ] ≥
r
V∥·∥ Γ 1 + n V∥·∥ ≳ √ ,
θ̂ θ∈Rd re n r nE[kZk∗ ]
(30.9)
where the middle inequality applies Stirling’s approximation Γ(x)1/x x for x → ∞, and the
right applies Urysohn’s volume inequality (27.21), with kxk∗ = sup{hx, yi : kyk ≤ 1} denoting
the dual norm of k · k.
To evaluate the tightness of the lower bound from SLB in comparison with the exact result
P 1/q
d
(30.8), consider the example of r = 2 and the ℓq -norm kxkq = i=1 | x i | q
with 1 ≤ q ≤ ∞.
Recall the volume of a unit ℓq -ball given in (27.13). In the special case of q = 2, the (first) lower
bound in (30.9) is in fact exact and coincides with (30.4). For general q ∈ [1, ∞), (30.9) gives the
2/q
tight minimax rate d n ; however, for q = ∞, the minimax lower bound we get is 1/n, independent p
of the dimension d. In comparison, from (30.8) we get the sharp rate logn d , since EkZk∞ log d
(cf. Lemma 27.10). We will revisit this example in Section 31.4 and show how to obtain the optimal
dependency on the dimension.
i i
i i
i i
Remark 30.2 (SLB versus the volume method) Recall the connection between the rate-
distortion function and the metric entropy in Section 27.6. As we have seen in Section 27.2, a
common lower bound for metric entropy is via the volume bound. In fact, the SLB can be inter-
preted as a volume-based lower bound to the rate-distortion function. To see this, consider r = 1
and let θ be uniformly distributed over some compact set Θ, so that h(θ) = log vol(Θ) (Theo-
rem 2.7(a)). Applying Stirling’s approximation, the lower bound in (26.5) becomes log vol(vol (Θ)
B∥·∥ (cϵ))
for some constant c, which has the same form as the volume ratio in Theorem 27.3 for metric
entropy. We will see later in Section 31.4 that in statistical applications, applying SLB yields basi-
cally the same lower bound as applying Fano’s method to a packing obtained from the volume
bound, although SLB does not rely explicitly on a packing.
where kθk0 = |{i ∈ [d] : θi 6= 0}| is the number of nonzero entries of θ, indicating the sparsity of
θ. Our goal is to characterize the minimax quadratic risk
Next we prove an optimal lower bound applying MIM. (For a different proof using Fano’s method
in Section 31.4, see Exercise VI.12.)
Theorem 30.2
k ed
R∗n (B0 (k)) ≳ log . (30.10)
n k
which is equivalent to keeping the k entries from X̄ with the largest magnitude and setting the
rest to zero, or the following hard-thresholding estimator θ̂τ with an appropriately chosen τ (see
Exercise VI.13):
i i
i i
i i
590
• Sharp asymptotics: For sublinear sparsity k = o(d), we have R∗n (B0 (k)) = (2 + o(1)) nk log dk
(Exercise VI.13); for linear sparsity k = (η + o(1))d with η ∈ (0, 1), R∗n (B0 (k)) = (β(η) +
o(1))d for some constant β(η). For the latter and more refined results, we refer the reader to the
monograph [236, Chapter 8].
Proof. First, note that B0 (k) is a union of linear subspace of Rd and thus homogeneous. Therefore
by scaling, we have
1 ∗ 1
R∗n (B0 (k)) =
R (B0 (k)) ≜ R∗ (k, d). (30.13)
n 1 n
Thus it suffices to consider n = 1. Denote the observation by X = θ + Z.
Next, note that the following oracle lower bound:
R∗ (k, d) ≥ k,
which is the optimal risk given the extra information of the support of θ, in view of (30.4). Thus
to show (30.10), below it suffices to consider k ≤ d/4.
We now apply the mutual information method. Recall from (27.10) that Sdk denotes the
Hamming sphere, namely,
Sdk = {b ∈ {0, 1}d : wH (b) = k},
d
where wH (b) denotes
qthe Hamming weights of b. Let b be uniformly distributed over Sk and let
θ = τ b, where τ = log dk . Given any estimator θ̂ = θ̂(X), define an estimator b̂ ∈ {0, 1}d for b
by
(
0 θ̂i ≤ τ /2
b̂i = , i ∈ [d].
1 θ̂i > τ /2
i i
i i
i i
d d δk
≥ log − max H(b ⊕ b̂) = log − dh , (30.15)
k EwH (b⊕b̂)≤δ k k d
where the last step follows from Exercise I.7.
Combining the lower and upper bound on the mutual information and using dk ≥ ( dk )k , we
get dh( δdk ) ≥ 2k k log dk . Since h(p) ≤ −p log p + p for p ∈ [0, 1] and k/d ≤ 14 by assumption, we
conclude that δ ≥ ck/d for some absolute constant c, completing the proof of (30.10) in view of
(30.14).
Theorem 30.3 Assume that k/n is bounded away from one. If almost exact recovery is possible,
then
2 + o(1) n
d(pkq) ≥ log . (30.16)
k−1 k
Proof. Suppose Ĉ achieves almost exact recovery of C∗ . Let ξ ∗ , ξˆ ∈ {0, 1}n denote their indicator
vectors, respectively, for example, ξi∗ = 1 {i ∈ C∗ } for each i ∈ [n]. Then E[dH (ξ, ξ)]
ˆ = ϵn k for
some ϵn → 0. Applying the mutual information method as before, we have
( a) n ϵn k (b) n
∗ ˆ ∗
I(G; ξ ) ≥ I(ξ; ξ ) ≥ log − nh ≥ k log (1 + o(1)),
k n k
i i
i i
i i
592
where (a) follows in the same manner as (30.15) did from Exercise I.7; (b) is due to the assumption
that k/n ≤ 1 − c for some constant c.
On the other hand, the mutual information between the hidden community and the graph can
be upper bounded as:
(b) k
∗ ( a) ⊗(n2) ( c)
I(G; ξ ) = min D(PG|ξ∗ kQ|Pξ∗ ) ≤ D(PG|ξ∗ kBer(q) | Pξ ∗ ) = d(pkq),
Q 2
where (a) is by the variational representation of mutual information in Corollary 4.2; (b) follows
from choosing Q to be the distribution of the Erdös-Rényi graph ER(n, q); (c) is by the tensoriza-
tion property of KL divergence for product distributions (Theorem 2.16(d)). Combining the last
two displays completes the proof.
i.i.d.
Theorem 30.4 (Bounded GLM continued) Suppose X1 , . . . , Xn ∼ N (θ, Id ), where θ
belongs to B, the unit ℓ2 -ball in Rd . Then for some universal constant C0 ,
n+C0 d
e− d−1 ≤ inf sup Eθ [kθ̂ − θk2 ] ≤ .
θ̂ θ∈B d+n
Proof. Without loss of generality, assume that the observation is X = θ+ √Zn , where Z ∼ N (0, Id ).
For the upper bound, applying the shrinkage estimator2 θ̂ = 1+1d/n X yields E[kθ̂ − θk2 ] ≤ n+d d .
For the lower bound, we apply MIM as in the proof of Theorem 30.1 with the prior θ ∼
Unif(Sd−1 ). We still apply the AWGN capacity in (30.6) to get I(θ; X) ≤ n/2. (Here the
2
This corresponds to the Bayes estimator (Example 28.1) when we choose θ ∼ N (0, 1d Id ), which is approximately
concentrated on the unit sphere for large d.
i i
i i
i i
constant 1/2 is important and so the diameter-based bound (30.7) is too loose.) For the rate-
distortion function of spherical uniform distribution, applying Theorem 27.17 yields I(θ; θ̂) ≥
d−1
2 log E[∥θ̂−θ∥2 ] − C. Thus the lower bound on E[kθ̂ − θk ] follows from the data processing
1 2
inequality.
A similar phenomenon also occurs in the problem of estimating a discrete distribution P on k
elements based on n iid observations, which has been studied in Section 29.4 for small alphabet in
the large-sample asymptotics and extended in Exercise VI.7–VI.10 to large alphabets. In particular,
consider the total variation loss, which is at most one. Ex. VI.10f shows that the TV error of any
estimator is 1 − o(1) if n k; conversely, Ex. VI.10b demonstrates an estimator P̂ such that
E[χ2 (PkP̂)] ≤ nk− 1 2
+1 . Applying the joint range (7.32) between TV and χ and Jensen’s inequality,
we have
q
1 k− 1 n ≥ k − 2
E[TV(P, P̂)] ≤ 2 n+1
k− 1 n≤k−2
k+n
which is bounded away from one whenever n = Ω(k). In summary, non-trivial estimation in total
variation is possible if and only if n scales at least proportionally with k.
Finally, let us mention the problem of correlated recovery in the stochastic block model
(cf. Exercise I.49), which refers to estimating the community labels better than chance. The
optimal information-theoretic threshold of this problem can be established by bounding the
appropriate mutual information; see Section 33.9 for the Gaussian version (spiked Wigner model).
i i
i i
i i
In this chapter we study three commonly used techniques for proving minimax lower bounds,
namely, Le Cam’s method, Assouad’s lemma, and Fano’s method. Compared to the results in
Chapter 29 geared towards large-sample asymptotics in smooth parametric models, the approach
here is more generic, less tied to mean-squared error, and applicable in nonasymptotic settings
such as nonparametric or high-dimensional problems.
The common rationale of all three methods is reducing statistical estimation to hypothesis test-
ing. Specifically, to lower bound the minimax risk R∗ (Θ) for the parameter space Θ, the first step
is to notice that R∗ (Θ) ≥ R∗ (Θ′ ) for any subcollection Θ′ ⊂ Θ, and Le Cam, Assouad, and Fano’s
methods amount to choosing Θ′ to be a two-point set, a hypercube, or a packing, respectively. In
particular, Le Cam’s method reduces the estimation problem to binary hypothesis testing. This
method is perhaps the easiest to evaluate; however, the disadvantage is that it is frequently loose
in estimating high-dimensional parameters. To capture the correct dependency on the dimension,
both Assouad’s and Fano’s method rely on reduction to testing multiple hypotheses.
As illustrated in Figure 30.1, all three methods in fact follow from the common principle of
the mutual information method (MIM) in Chapter 30, corresponding to different choice of priors.
The limitation of these methods, compared to the MIM, is that, due to the looseness in constant
factors, they are ineffective for certain problems such as estimation better than chance discussed
in Section 30.4.
Then
ℓ(θ0 , θ1 )
inf sup Eθ ℓ(θ, θ̂) ≥ sup (1 − TV(Pθ0 , Pθ1 )) (31.2)
θ̂ θ∈Θ θ0 ,θ1 ∈Θ 2α
594
i i
i i
i i
Proof. Fix θ0 , θ1 ∈ Θ. Given any estimator θ̂, let us convert it into the following (randomized)
test:
θ0 with probability ℓ(θ1 ,θ̂)
,
ℓ(θ0 ,θ̂)+ℓ(θ1 ,θ̂)
θ̃ =
θ1 with probability ℓ(θ 0 , θ̂)
.
ℓ(θ ,θ̂)+ℓ(θ ,θ̂) 0 1
and similarly for θ1 . Consider the prior π = 21 (δθ0 + δθ1 ) and let θ ∼ π. Taking expectation on
both sides yields the following lower bound on the Bayes risk:
ℓ(θ0 , θ1 ) ℓ(θ0 , θ1 )
Eπ [ℓ(θ̂, θ)] ≥ P θ̃ 6= θ ≥ (1 − TV(Pθ0 , Pθ1 ))
α 2α
where the last step follows from the minimum average probability of error in binary hypothesis
testing (Theorem 7.7).
Remark 31.1 As an example where the bound (31.2) is tight (up to constants), consider a
binary hypothesis testing problem with Θ = {θ0 , θ1 } and the Hamming loss ℓ(θ, θ̂) = 1{θ 6= θ̂},
where θ, θ̂ ∈ {θ0 , θ1 } and α = 1. Then the left side is the minimax probability of error, and the
right side is the optimal average probability of error (cf. (7.19)). These two quantities can coincide
(for example for Gaussian location model).
Another special case of interest is the quadratic loss ℓ(θ, θ̂) = kθ − θ̂k22 , where θ, θ̂ ∈ Rd , which
satisfies the α-triangle inequality with α = 2. In this case, the leading constant 41 in (31.2) makes
sense, because in the extreme case of TV = 0 where Pθ0 and Pθ1 cannot be distinguished, the best
estimate is simply θ0 +θ2 . In addition, the inequality (31.2) can be deduced based on properties of
1
f-divergences and their joint range (Chapter 7). To this end, abbreviate Pθi as Pi for i = 0, 1 and
consider the prior π = 21 (δθ0 + δθ1 ). Then the Bayes estimator (posterior mean) is θ0 dP 0 +θ1 dP1
dP0 +dP1 and
the Bayes risk is given by
Z
kθ0 − θ1 k2 dP0 dP1
R∗π =
2 dP0 + dP1
kθ0 − θ1 k2 kθ0 − θ1 k2
= (1 − LC(P0 , P1 )) ≥ (1 − TV(P0 , P1 )),
4 4
R 0 −dP1 )
2
where LC(P0 , P1 ) = (dP dP0 +dP1 is the Le Cam divergence defined in (7.7) and satisfies LC ≤ TV.
i i
i i
i i
596
where (a) follows from the shift and scale invariance of the total variation; in (b) c ≈ 0.083 is
some absolute constant, obtained by applying the formula TV(N (0, 1), N (s, 1)) = 2Φ( 2s ) − 1
from (7.40). On the other hand, we know from Example 28.2 that the minimax risk equals 1n , so
the two-point method is rate-optimal in this case.
In the above example, for two points separated by Θ( √1n ), the corresponding hypothesis cannot
be tested with vanishing probability of error so that the resulting estimation risk (say in squared
error) cannot be smaller than 1n . This convergence rate is commonly known as the “parametric
rate”, which we have studied in Chapter 29 for smooth parametric families focusing on the Fisher
information as the sharp constant. More generally, the 1n rate is not improvable for models with
locally quadratic behavior
(Recall that Theorem 7.23 gives a sufficient condition for this behavior.) Indeed, pick θ0 in the
interior of the parameter space and set θ1 = θ0 + √1n , so that H2 (Pθ0 , Pθ1 ) = Θ( 1n ) thanks to (31.4).
By Theorem 7.8, we have TV(P⊗ ⊗n
θ0 , Pθ1 ) ≤ 1 − c for some constant c and hence Theorem 31.1
n
yields the lower bound Ω(1/n) for the squared error. Furthermore, later we will show that the same
locally quadratic behavior in fact guarantees the achievability of the 1/n rate; see Corollary 32.12.
Example 31.2 As a different example, consider the family Unif(0, θ). Note that as opposed
to the quadratic behavior (31.4), we have
√
H2 (Unif(0, 1), Unif(0, 1 + t)) = 2(1 − 1/ 1 + t) t.
Thus an application of Theorem 31.1 yields an Ω(1/n2 ) lower bound. This rate is not achieved
by the empirical mean estimator (which only achieves 1/n rate), but by the maximum likelihood
estimator θ̂ = max{X1 , . . . , Xn }. Other types of behavior in t, and hence the rates of convergence,
can occur even in compactly supported location families – see Example 7.1.
The limitation of Le Cam’s two-point method is that it does not capture the correct dependency
on the dimensionality. To see this, let us revisit Example 31.1 for d dimensions.
Example 31.3 Consider the d-dimensional GLM in Corollary 28.8. Again, it is equivalent
to consider the reduced model {N (θ, 1n ) : θ ∈ Rd }. We know from Example 28.2 (see also
Theorem 28.4) that for quadratic risk ℓ(θ, θ̂) = kθ − θ̂k22 , the exact minimax risk is R∗ = nd for any
d and n. Let us compare this with the best two-point lower bound. Applying Theorem 31.1 with
α = 2,
1 1 1
R∗ ≥ sup kθ0 − θ1 k22 1 − TV N θ0 , Id , N θ1 , Id
θ0 ,θ1 ∈Rd 4 n n
1
= sup kθk22 {1 − TV (N (0, Id ) , N (θ, Id ))}
θ∈Rd 4n
1
= sup s2 (1 − TV(N (0, 1), N (s, 1))),
4n s>0
i i
i i
i i
where the second step applies the shift and scale invariance of the total variation; in the last step,
by rotational invariance of isotropic Gaussians, we can rotate the vector θ align with a coordinate
vector (say, e1 = (1, 0 . . . , 0)) which reduces the problem to one dimension, namely,
TV(N (0, Id ), N (θ, Id )) = TV(N (0, Id ), N (kθke1 , Id )
= TV(N (0, 1), N (kθk, 1)).
Comparing the above display with (31.3), we see that the best Le Cam two-point lower bound in
d dimensions coincide with that in one dimension.
Let us mention in passing that although Le Cam’s two-point method is typically suboptimal for
estimating a high-dimensional parameter θ, for functional estimation in high dimensions (e.g. esti-
mating a scalar functional T(θ)), Le Cam’s method is much more effective and sometimes even
optimal. The subtlety is that is that as opposed to testing a pair of simple hypotheses H0 : θ = θ0
versus H1 : θ = θ1 , we need to test H0 : T(θ) = t0 versus H1 : T(θ) = t1 , both of which are
composite hypotheses and require a sagacious choice of priors. See Exercise VI.14 for an example.
Theorem 31.2 (Assouad’s lemma) Assume that the loss function ℓ satisfies the α-triangle
inequality (31.1). Suppose Θ contains a subset Θ′ = {θb : b ∈ {0, 1}d } indexed by the hypercube,
such that ℓ(θb , θb′ ) ≥ β · dH (b, b′ ) for all b, b′ and some β > 0. Then
βd
inf sup Eθ ℓ(θ, θ̂) ≥ 1 − max TV(Pθb , Pθb′ ) (31.5)
θ̂ θ∈Θ 4α dH (b,b′ )=1
Proof. We lower bound the Bayes risk with respect to the uniform prior over Θ′ . Given any
estimator θ̂ = θ̂(X), define b̂ ∈ argmin ℓ(θ̂, θb ). Then for any b ∈ {0, 1}d ,
β dH (b̂, b) ≤ ℓ(θb̂ , θb ) ≤ α(ℓ(θb̂ , θ̂b ) + ℓ(θ̂, θb )) ≤ 2αℓ(θ̂, θb ).
β X
d
≥ (1 − TV(PX|bi =0 , PX|bi =1 )),
4α
i=1
i i
i i
i i
598
where the last step is again by Theorem 7.7, just like in the proof of Theorem 31.1. Each total
variation can be upper bounded as follows:
!
( a) 1 X 1 X (b)
TV(PX|bi =0 , PX|bi =1 ) = TV d − 1
Pθb , d−1 Pθb ≤ max TV(Pθb , Pθb′ )
2 2 dH (b,b′ )=1
b:bi =1 b:bi =0
where (a) follows from the Bayes rule, and (b) follows from the convexity of total variation
(Theorem 7.5). This completes the proof.
Example 31.4 Let us continue the discussion of the d-dimensional GLM in Example 31.3.
Consider the quadratic loss first. To apply Theorem 31.2, consider the hypercube θb = ϵb, where
b ∈ {0, 1}d . Then kθb − θb′ k22 = ϵ2 dH (b, b′ ). Applying Theorem 31.2 yields
∗ ϵ2 d 1 ′ 1
R ≥ 1− max TV N ϵb, Id , N ϵb , Id
4 b,b′ ∈{0,1}d ,dH (b,b′ )=1 n n
2
ϵ d 1 1
= 1 − TV N 0, , N ϵ, ,
4 n n
where the last step applies (7.11) for f-divergence between product distributions that only differ
in one coordinate. Setting ϵ = √1n and by the scale-invariance of TV, we get the desired R∗ ≳ dn .
Next, let’s consider the loss function kθb − θb′ k∞ . In the same setup, we only kθb − θb′ k∞ ≥
′ ∗ √1 , which does not depend on d. In fact, R∗
d dH (b, b ). Then Assouad’s lemma yields R ≳
ϵ
q n
log d
n as shown in Corollary 28.8. In the next section, we will discuss Fano’s method which can
resolve this deficiency.
Here τ ′ is related to τ by τ log 2 = h(τ ′ ). Thus, using the same “hypercube embedding b → θb ”,
the bound similar to (31.5) will follow once we can bound I(bd ; X) away from d log 2.
Can we use the pairwise total variation bound in (31.5) to do that? Yes! Notice that thanks to
the independence of bi ’s we have1
1
Equivalently, this also follows from the convexity of the mutual information in the channel (cf. Theorem 5.3).
i i
i i
i i
where in the last step we used the fact that whenever B ∼ Ber(1/2),
I(B; X) ≤ TV(PX|B=0 , PX|B=1 ) log 2 , (31.8)
which follows from (7.39) by noting that the mutual information is expressed as the Jensen-
Shannon divergence as 2I(B; X) = JS(PX|B=0 , PX|B=1 ). Combining (31.6) and (31.7), the mutual
information method implies the following version of the Assouad’s lemma: Under the assumption
of Theorem 31.2,
βd −1 (1 − t) log 2
inf sup Eθ ℓ(θ, θ̂) ≥ ·f max TV(Pθ , Pθ′ ) , f(t) ≜ h (31.9)
θ̂ θ∈Θ 4α dH (θ,θ ′ )=1 2
where h−1 : [0, log 2] → [0, 1/2] is the inverse of the binary entropy function. Note that (31.9) is
slightly weaker than (31.5). Nevertheless, as seen in Example 31.4, Assouad’s lemma is typically
applied when the pairwise total variation is bounded away from one by a constant, in which case
(31.9) and (31.5) differ by only a constant factor.
In all, we may summarize Assouad’s lemma as a convenient method for bounding I(bd ; X) away
from the full entropy (d bits) on the basis of distances between PX|bd corresponding to adjacent
bd ’s.
Theorem 31.3 Let d be a metric on Θ. Fix an estimator θ̂. For any T ⊂ Θ and ϵ > 0,
h ϵi C(T) + log 2
P d(θ, θ̂) ≥ ≥1− , (31.10)
2 log M(T, d, ϵ)
where C(T) ≜ sup I(θ; X) is the capacity of the channel from θ to X with input space T, with the
supremum taken over all distributions (priors) on T. Consequently,
ϵ r C(T) + log 2
inf sup Eθ [d(θ, θ̂) ] ≥ sup
r
1− , (31.11)
θ̂ θ∈Θ T⊂Θ,ϵ>0 2 log M(T, d, ϵ)
i i
i i
i i
600
I(θ; X) + log 2
P[θ 6= θ̃] ≥ 1 − .
log M
In applying Fano’s method, since it is often difficult to evaluate the capacity C(T), it is useful
to recall from Theorem 5.9 that C(T) coincides with the KL radius of the set of distributions
{Pθ : θ ∈ T}, namely, C(T) ≜ infQ supθ∈T D(Pθ kQ). As such, choosing any Q leads to an upper
bound on the capacity. As an application, we revisit the d-dimensional GLM in Corollary 28.8
under the ℓq -loss (1 ≤ q ≤ ∞), with the particular focus on the dependency on the dimension.
(For a different application in sparse setting see Exercise VI.12.)
Example 31.5 Consider GLM with sample size n, where Pθ = N (θ, Id )⊗n . Taking natural
logs here and below, we have
n
D(Pθ kPθ′ ) = kθ − θ′ k22 ;
2
in other words, KL-neighborhoods are ℓ2 -balls. As such, let us apply Theorem 31.3 to T = B2 (ρ)
2
for some ρ > 0 to be specified. Then C(T) ≤ supθ∈T D(Pθ kP0 ) = nρ2 . To bound the packing
number from below, we applying the volume bound in Theorem 27.3,
d
ρd vol(B2 ) cq ρd1/q
M(B2 (ρ), k · kq , ϵ) ≥ d ≥ √
ϵ vol(Bq ) ϵ d
for some
p constant cq ,c where the last step follows the volume formula (27.13) for ℓq -balls. Choosing
ρ = d/n and ϵ = eq2 ρd1/q−1/2 , an application of Theorem 31.3 yields the minimax lower bound
d1/q
Rq ≡ inf sup Eθ [kθ̂ − θkq ] ≥ Cq √ (31.12)
θ̂ θ∈Rd n
for some constant Cq depending on q. This is the same lower bound as that in (30.9) obtained via
the mutual information method plus the Shannon lower bound (which is also volume-based).
For any q ≥ 1, (31.12) is rate-optimal since we can apply the MLE θ̂ = X̄. (Note that at q = ∞,
pq = ∞, (31.12)
the constant Cq is still finite since vol(B∞ ) = 2d .) However, for the special case of
does not depend on the dimension at all, as opposed to the correct dependency log d shown in
Corollary 28.8. In fact, previously in Example 31.4 the application of Assouad’s lemma yields
the same suboptimal result. So is it possible to fix this looseness with Fano’s method? It turns out
that the answer is yes and the suboptimality is due to the volume bound on the metric entropy,
which, as we have seen in Section 27.3, can be ineffective if ϵ scales with dimension. Indeed, if
i i
i i
i i
q q
c log d
we apply the tight bound of M(B2 , k · k∞ , ϵ) in (27.18),2 with ϵ = and ρ = c′ logn d for
q n
• It is sometimes convenient to further bound the KL radius by the KL diameter, since C(T) ≤
diamKL (T) ≜ supθ,θ′ ∈T D(Pθ′ kPθ ) (cf. Corollary 5.8). This suffices for Example 31.5.
• In Theorem 31.3 we actually lower bound the global minimax risk by that restricted on a param-
eter subspace T ⊂ Θ for the purpose of controlling the mutual information, which is often
difficult to compute. For the GLM considered in Example 31.5, the KL divergence is propor-
tional to squared ℓ2 -distance and T is naturally chosen to be a Euclidean ball. For other models
such as the covariance model (Exercise VI.16) wherein the KL divergence is more complicated,
the KL neighborhood T needs to be chosen carefully. Later in Section 32.4 we will apply the
same Fano’s method to the infinite-dimensional problem of estimating smooth density.
2
In fact, in this case we can also choose the explicit packing {ϵe1 , . . . , ϵed }.
i i
i i
i i
So far our discussion on information-theoretic methods have been mostly focused on statistical
lower bounds (impossibility results), with matching upper bounds obtained on a case-by-case basis.
In this chapter, we will discuss three information-theoretic upper bounds for statistical estimation.
These three results apply to different loss functions and are obtained using completely different
means. However, they take on exactly the same form involving the appropriate metric entropy of
the model.
Specifically, suppose that we observe X1 , . . . , Xn drawn independently from a distribution Pθ for
some unknown parameter θ ∈ Θ, and the goal is to produce an estimate P̂ for the true distribution
Pθ . We have the following entropic minimax upper bounds:
Here N(P, ϵ) refers to the metric entropy (cf. Chapter 27) of the model class P = {Pθ : θ ∈ Θ}
under various distances, which we will formalize along the way.
In particular, we will see that these methods achieve minimax optimal rates for the classical
problem of density estimation under smoothness constraints. To place these results in the bigger
context, we remind that we have already discussed modern methods of density estimation based
on machine learning ideas (Examples 4.2 and 7.5). However, those methods, beautiful and empir-
ically successful, are not known to achieve optimality over any reasonable classes. The metric
entropy methods as above, though, could and should be used to derive fundamental limits for the
classes which are targeted by the machine learning methods. Thus, there is a rich field of modern
applications, which this chapter will hopefully welcome the reader to explore.
We note that there are other entropic upper bound for statistical estimation, notably, MLE and
other M-estimators. This require different type of metric entropy (bracketing entropy, which is
602
i i
i i
i i
akin to metric entropy under the sup norm) and the style of analysis is more related in spirit to the
theory of empirical processes (e.g. Dudley’s entropy integral (27.22)). We refer the readers to the
monographs [332, 429, 431] for details. In this chapter we focus on more information-theoretic
style results.
If the family has a common dominating measure μ, the problem is equivalent to estimate the
density pθ = dP dμ , commonly referred to as the problem of density estimation in the statistics
θ
literature.
Our objective is to prove the upper bound (32.1) for the minimax KL risk
R∗KL (n) ≜ inf sup Eθ D(Pθ kP̂), (32.4)
P̂ θ∈Θ
where the infimum is taken over all estimators P̂ = P̂(·|Xn ) which is a distribution on X ; in
other words, we allow improper estimates in the sense that P̂ can step outside the model class P .
Indeed, the construction we will use in this section (such as predictive density estimators (Bayes)
or their mixtures) need not be a member of P . Later we will see in Sections 32.2 and 32.3 that for
total variation and Hellinger loss we can always restrict to proper estimators;2 however these loss
functions are weaker than the KL divergence.
The main result of this section is the following.
1
Note the asymmetry in this loss function. Alternatively the loss D(P̂kP) is typically infinite in nonparametric settings,
because it is impossible to estimate the support of the true density exactly.
2
This is in fact a generic observation: Whenever the loss function satisfies an approximate triangle inequality, any improper
estimate can be converted to a proper one by its project on the model class whose risk is inflated by no more than a
constant factor.
i i
i i
i i
604
1
≤ inf ϵ2 + log NKL (P, ϵ) . (32.8)
ϵ>0 n+1
Conversely,
X
n
R∗KL (t) ≥ Cn+1 . (32.9)
t=0
Note that the capacity Cn is precisely the redundancy (13.10) which governs the minimax regret
in universal compression; the fact that it bounds the KL risk can be attributed to a generic relation
between individual and cumulative risks which we explain later in Section 32.1.4. As explained in
Chapter 13, it is in general difficult to compute the exact value of Cn even for models as simple as
Bernoulli (Pθ = Ber(θ)). This is where (32.8) comes in: one can use metric entropy and tools from
Chapter 27 to bound this capacity, leading to useful (and even optimal) risk bounds. We discuss
two types of applications of this result.
Infinite-dimensional models Similar to the results in Section 27.4, for nonparametric models
NKL (ϵ) typically grows super-polynomially in 1ϵ and, in turn, the capacity Cn grows super-
logarithmically. In fact, whenever we have Cn = nα polylog(n) for some α > 0 where (log n)c0 ≤
polylog(n) ≤ (log n)c1 for some absolute c0 , c1 , Theorem 32.1 shows the minimax KL rate satisfies
i i
i i
i i
which easily follows from combining (32.7) and (32.8) – see (32.27) for details. For concrete
examples, see Section 32.4 for the application of estimating smooth densities.
Next, we explain the intuition behind and the proof of Theorem 32.1.
i.i.d.
where θ ∼ π and (X1 , . . . , Xn+1 ) ∼ Pθ conditioned on θ. The Bayes estimator achieving this infi-
mum is given by P̂Bayes (·|xn ) = PXn+1 |Xn =xn . If each Pθ has a density pθ with respect to some
common dominating measure μ, the Bayes estimator has density:
R Qn+1
π (dθ) i=1 pθ (xi )
p̂Bayes (xn+1 |x ) = R
n
Qn . (32.12)
π (dθ) i=1 pθ (xi )
( a)
= EXn D(PXn+1 |θ kPXn+1 |Xn |Pθ|Xn )
= D(PXn+1 |θ kPXn+1 |Xn |Pθ,Xn )
(b)
= I(θ; Xn+1 |Xn ).
where (a) follows from the variational representation of mutual information (Theorem 4.1 and
Corollary 4.2); (b) invokes the definition of the conditional mutual information (Section 3.4) and
3
Throughout this chapter, we continue to use the conventional notation Pθ for a parametric family of distributions and use
π to stand for the distribution of θ.
i i
i i
i i
606
the fact that Xn → θ → Xn+1 forms a Markov chain, so that PXn+1 |θ,Xn = PXn+1 |θ . In addition, the
Bayes optimal estimator is given by PXn+1 |Xn .
Note that the operational meaning of I(θ; Xn+1 |Xn ) is the information provided by one extra
observation about θ having already obtained n observations. In most situations, since Xn will have
already allowed θ to be consistently estimated as n → ∞, the additional usefulness of Xn+1 is
vanishing. This is made precisely by the following result.
Proof. In view of the chain rule for mutual information (Theorem 3.7): I(θ; Xn+1 ) =
Pn+1 i−1
i=1 I(θ; Xi |X ), (32.13) follows from the monotonicity. To show the latter, let us consider
a “sampling channel” where the input is θ and the output is X sampled from Pθ . Let I(π )
denote the mutual information when the input distribution is π, which is a concave function in
π (Theorem 5.3). Then
where the inequality follows from Jensen’s inequality, since Pθ|Xn−1 is a mixture of Pθ|Xn .
Lemma 32.3 allows us to prove the converse bound (32.9): Fix any prior π. Since the minimax
risk dominates any Bayes risk (Theorem 28.1), in view of Lemma 32.2, we have
X
n X
n
R∗KL (t) ≥ I(θ; Xt+1 |Xt ) = I(θ; Xn+1 ).
t=0 t=0
Recall from (32.5) that Cn+1 = supπ ∈P(Θ) I(θ; Xn+1 ). Optimizing over the prior π yields (32.9).
Now suppose that the minimax theorem holds for (32.4), so that R∗KL = supπ ∈P(Θ) R∗KL,Bayes (π ).
Lemma 32.2 then allows us to express the minimax risk as the conditional mutual information
maximized over the prior π:
i i
i i
i i
taking the sample Xi of size i as the input. Taking their Cesàro mean results in the following
estimator operating on the full sample Xn :
1 X
n+1
P̂(·|Xn ) ≜ QXi |Xi−1 . (32.14)
n+1
i=1
Let us bound the worst-case KL risk of this estimator. Fix θ ∈ Θ and let Xn+1 be drawn
⊗(n+1)
independently from Pθ so that PXn+1 = Pθ . Taking expectations with this law, we have
" !#
1 X
n+1
Eθ [D(Pθ kP̂(·|Xn ))] = E D Pθ QXi |Xi−1
n+1
i=1
( a) 1 X
n+1
≤ D(Pθ kQXi |Xi−1 |PXi−1 )
n+1
i=1
(b) 1 ⊗(n+1)
= D(Pθ kQXn+1 ),
n+1
where (a) and (b) follows from the convexity (Theorem 5.1) and the chain rule for KL divergence
(Theorem 2.16(c)). Taking the supremum over θ ∈ Θ bounds the worst-case risk as
1 ⊗(n+1)
R∗KL (n) ≤ sup D(Pθ kQXn+1 ).
n + 1 θ∈Θ
Optimizing over the choice of QXn+1 , we obtain
1 ⊗(n+1) Cn+1
R∗KL (n) ≤ inf sup D(Pθ kQXn+1 ) = ,
n + 1 QXn+1 θ∈Θ n+1
where the last identity applies Theorem 5.9 of Kemperman, completing the proof of (32.7).
Furthermore, Theorem 5.9 asserts that the optimal QXn+1 exists and given uniquely by the capacity-
achieving output distribution P∗Xn+1 . Thus the above minimax upper bound can be attained by
taking the Cesàro average of P∗X1 , P∗X2 |X1 , . . . , P∗Xn+1 |Xn , namely,
1 X ∗
n+1
P̂∗ (·|Xn ) = PXi |Xi−1 . (32.15)
n+1
i=1
Note that in general this is an improper estimate as it steps outside the class P .
In the special case where the capacity-achieving input distribution π ∗ exists, the capacity-
achieving output distribution can be expressed as a mixture over product distributions as P∗Xn+1 =
R ∗ ⊗(n+1)
π (dθ)Pθ . Thus the estimator P̂∗ (·|Xn ) is in fact the average of Bayes estimators (32.12)
∗
under prior π for sample sizes ranging from 0 to n.
Finally, as will be made clear in the next section, in order to achieve the further upper bound
(32.8) in terms of the KL covering numbers, namely R∗KL (n) ≤ ϵ2 + n+1 1 log NKL (P, ϵ), it suffices to
choose the following QXn+1 as opposed to the exact capacity-achieving output distribution: Pick an
ϵ-KL cover Q1 , . . . , QN for P of size N = NKL (P, ϵ) and choose π to be the uniform distribution
PN ⊗(n+1)
and define QXn+1 = N1 j=1 Qj – this was the original construction in [464]. In this case,
i i
i i
i i
608
applying the Bayes rule (32.12), we see that the estimator is in fact a convex combination P̂(·|Xn ) =
PN
j=1 wj Qj of the centers Q1 , . . . , QN , with data-driven weights given by
Qi−1
1 X
n+1
t=1 Qj (Xt )
wj = PN Qi−1 .
n+1 Qj ( X t )
i=1 j=1 t=1
Again, except for the extraordinary case where P is convex and the centers Qj belong to P , the
estimate P̂(·|Xn ) is improper.
Proof. Fix ϵ and let N = NKL (Q, ϵ). Then there exist Q1 , . . . , QN that form an ϵ-KL cover, such
that for any a ∈ A there exists i(a) ∈ [N] such that D(PB|A=a kQi(a) ) ≤ ϵ2 . Fix any PA . Then
where the last inequality follows from that i(A) takes at most N values and, by applying
Theorem 4.1,
I(A; B|i(A)) ≤ D PB|A kQi(A) |Pi(A) ≤ ϵ2 .
For the lower bound, note that if C = ∞, then in view of the upper bound above, NKL (Q, ϵ) = ∞
for any ϵ and (32.16) holds with equality. If C < ∞, Theorem 5.9 shows that C is the KL radius of
Q, namely, there exists P∗B , such that C = supPA ∈P(A) D(PB|A kP∗B |PA ) = supx∈A D(PB|A kP∗B |PA ).
√
In other words, NKL (Q, C + δ) = 1 for any δ > 0. Sending δ → 0 proves the equality of
(32.16).
Next we specialize Theorem 32.4 to our statistical setting (32.5) where the input A is θ and the
output B is Xn ∼ Pθ . Recall that P = {Pθ : θ ∈ Θ}. Let Pn ≜ {P⊗
i.i.d.
θ : θ ∈ Θ}. By tensorization of
n
⊗n ⊗n
KL divergence (Theorem 2.16(d)), D(Pθ kPθ′ ) = nD(Pθ kPθ′ ). Thus
ϵ
NKL (Pn , ϵ) ≤ NKL P, √ .
n
i i
i i
i i
Combining this with Theorem 32.4, we obtain the following upper bound on the capacity Cn in
terms of the KL metric entropy of the (single-letter) family P :
Cn ≤ inf nϵ2 + log NKL (P, ϵ) . (32.17)
ϵ>0
Theorem 32.5 Let P = {Pθ : θ ∈ Θ} and MH (ϵ) ≡ M(P, H, ϵ) the Hellinger packing number
of the set P , cf. (27.2). Then Cn defined in (32.5) satisfies
log e 2
Cn ≥ min nϵ , log MH (ϵ) − log 2 (32.18)
2
Proof. The idea of the proof is simple. Given a packing θ1 , . . . , θM ∈ Θ with pairwise distances
2
H2 (Qi , Qj ) ≥ ϵ2 for i 6= j, where Qi ≡ Pθi , we know that one can test Q⊗ n ⊗n
i vs Qj with error e
− nϵ2
,
nϵ 2
cf. Theorem 7.8 and Theorem 32.8. Then by the union bound, if Me− 2 < 12 , we can distinguish
these M hypotheses with error < 12 . Let θ ∼ Unif(θ1 , . . . , θM ). Then from Fano’s inequality we
get I(θ; Xn ) ≳ log M.
To get sharper constants, though, we will proceed via the inequality shown in Ex. I.58. In the
notation of that exercise we take λ = 1/2 and from Definition 7.24 we get that
1
D1/2 (Qi , Qj ) = −2 log(1 − H2 (Qi , Qj )) ≥ H2 (Qi , Qj ) log e ≥ ϵ2 log e i 6= j .
2
By the tensorization property (7.79) for Rényi divergence, D1/2 (Q⊗ n ⊗n
i , Qj ) = nD1/2 (Qi , Qj ) and
we get by Ex. I.58
X
M
1 X
M
1 n n o
I(θ; Xn ) ≥ − log exp − D1/2 (Qi , Qj )
M M 2
i=1 j=1
X
M
( a) 1M − 1 − nϵ22 1
≥− log e +
M M M
i=1
XM
1 − nϵ2
2 1 − nϵ2
2 1
≥− log e + = − log e + ,
M M M
i=1
i i
i i
i i
610
where in (a) we used the fact that pairwise distances are all ≥ nϵ2 except when i = j. Finally, since
A + B ≤ min(A,B) we conclude the result.
1 1 2
Note that, from the joint range (7.33) that D(PkQ) ≥ H2 (P, Q), a different (weaker) lower
bound on the KL risk also follows from Section 32.2.4 below.
Next we proceed to the converse of Theorem 32.5. The KL and Hellinger covering numbers
always satisfy
We next show that, assuming that the class P has a finite radius in Rényi divergence, (32.19)
and hence the capacity bound in Theorem 32.5 are tight up to logarithmic factors. Later in Sec-
tion 32.4 we will apply these results to the class of smooth densities, which has a finite χ2 -radius
(by choosing the uniform distribution as the center).
Theorem 32.6 Suppose that the family P has a finite Dλ radius for some λ > 1, i.e.
Rλ (P) ≜ inf sup Dλ (PkU) < ∞ , (32.20)
U P∈P
where Dλ is the Rényi divergence of order λ (see Definition 7.24). Then there exist ϵ0 and c
depending only on λ and Rλ , such that for all ϵ ≤ ϵ0 ,
r !
1
NKL P, cϵ log ≤ NH (ϵ) (32.21)
ϵ
and, consequently,
1
Cn ≤ inf 2
cnϵ log + log NH (ϵ) . (32.22)
ϵ≤ϵ0 ϵ
Proof. Let Q1 , . . . , QM be an ϵ-covering of P such that for any P ∈ P , there exists i ∈ [M] such
that H2 (P, Qi ) ≤ ϵ2 . Fix an arbitrary U and let Pi = ϵ2 U + (1 − ϵ2 )Qi . Applying Exercise I.59
yields
2λ 1
D(PkPi ) ≤ 24ϵ 2
log + Dλ (PkU) .
λ−1 ϵ
Optimizing over U to approach (32.20) proves (32.21). Finally, (32.22) follows from applying
(32.21) to (32.17).
• Instead of directly studying the risk R∗KL (n), (32.7) relates it to a cumulative risk Cn
• The cumulative risk turns out to be equal to a capacity, which can be conveniently bounded in
terms of covering numbers.
i i
i i
i i
In this subsection we want to point out that while the second step is very special to KL (log-loss),
the first idea is generic. Namely, we have the following relationship between individual risk (also
known as batch loss) and cumulative risk (also known as online loss), which were previously
introduced in Section 13.6 in the context of universal compression.
where both infima are over measurable (possibly randomized) estimators P̂t : X t−1 → P(X ), and
i.i.d.
the expectations are over Xi ∼ P and the randomness of the estimators. Then we have
X
n
nR∗n ≤ Cn ≤ Cn−1 + R∗n ≤ R∗t . (32.25)
t=1
Pn−1
Thus, if the sequence {R∗n } satisfies R∗n 1n t=0 R∗t then Cn nR∗n . Conversely, if nα− ≲ Cn ≲
nα+ for all n and some α+ ≥ α− > 0, then
α
(α− −1) α+
n − ≲ R∗n ≲ nα+ −1 . (32.26)
(In other words, Cn is bounded by using the Cn−1 -optimal online learner for first n − 1 rounds and
the R∗n -optimal batch learner for the last round.) The third inequality in (32.25) follows from the
second by induction and C1 = R∗1 .
To derive (32.26) notice that the upper bound on R∗n follows from (32.25). For the lower bound,
notice that the sequence R∗n is non-increasing and hence we have for any n < m
X
m−1 X
n−1
Ct
Cm ≤ R∗t ≤ + (m − n)R∗n . (32.27)
t
t=0 t=0
4
Note that for KL loss, Cn and R∗n coincide with AvgReg∗n and BatchReg∗n defined in (13.34) and (13.35), respectively.
i i
i i
i i
612
α+
For Hellinger loss, the answer is yes, although the metric entropy involved is with respect to
the Hellinger distance not KL divergence. The basic construction is due to Le Cam and further
developed by Birgé. The main idea is as follows: Fix an ϵ-covering {P1 , . . . , PN } of the set of
distributions P . Given n observations drawn from P ∈ P , let us test which ball P belongs to;
this allows us to estimate P up to Hellinger loss ϵ. This can be realized by a pairwise comparison
argument of testing the (composite) hypothesis P ∈ B(Pi , ϵ) versus P ∈ B(Pj , ϵ). This program
can be further refined to involve on the local entropy of the model.
The optimal test that achieves (32.28) is the likelihood ratio given by the worst-case mixtures, that
is, the closest5 pair of mixture (P∗n , Q∗n ) such that TV(P∗n , Q∗n ) = TV(co(P ⊗n ), co(Q⊗n )).
5
In case the closest pair does not exist, we can replace it by an infimizing sequence.
i i
i i
i i
The exact result (32.28) is unwieldy as the RHS involves finding the least favorable priors over
the n-fold product space. However, there are several known examples where much simpler and
explicit results are available. In the case when P and Q are TV-balls around P0 and Q0 , Huber [221]
showed that the minimax optimal test has the form
( n )
X dP0
n ′ ′′
ϕ(x ) = 1 min c , max c , log ( Xi ) >t .
dQ0
i=1
(See also Ex. III.31.) However, there are few other examples where minimax optimal tests are
known explicitly. Fortunately, as was shown by Le Cam, there is a general “single-letter” upper
bound in terms of the Hellinger separation between P and Q. It is the consequence of the more
general tensorization property of Rényi divergence in Proposition 7.25 (of which Hellinger is a
special case).
Theorem 32.8
≤ e− 2 H
n 2
(co(P),co(Q))
min sup P(ϕ = 1) + sup Q(ϕ = 0) , (32.29)
ϕ P∈P Q∈Q
Remark 32.2 For the case when P and Q are Hellinger balls of radius r around P0 and
Q0 (which arises in the proof of Theorem 32.9 below), respectively, Birgé [56] constructed an
explicit test.
nPNamely, under the assumption
o H(P0 , Q0 ) > 2.01r, there q is a test of the form
n n α+βψ(Xi ) −Ω(nr2 ) dP0
ϕ(x ) = 1 i=1 log β+αψ(Xi ) > t attaining error e , where ψ(x) = dQ 0
(x) and α, β > 0
depend only on H(P0 , Q0 ).
Remark 32.3 Here is an example where Theorem 32.8 is (very) loose. Consider P =
{Ber(1/2)} and Q = {Ber(0), Ber(1)}. Then co(P) ⊂ co(Q)and so the upper bound in (32.29) is
trivial. On the other hand, the test that declares P ∈ Q if we see all 0’s or all 0’s has exponentially
small probability of error.
In the sequel we will apply Theorem 32.8 to two disjoint Hellinger balls (both are convex).
i i
i i
i i
614
Theorem 32.9 (Le Cam-Birgé) Denote by NH (P, ϵ) the ϵ-covering number of the set P
under the Hellinger distance (cf. (27.1)). Let ϵn be such that
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 1,
and, consequently,
Proof of Theorem 32.9. It suffices to prove the high-probability bound (32.31). Abbreviate ϵ =
ϵn and N = NH (P, ϵn ). Let P1 , · · · , PN be a maximal ϵ-packing of P under the Hellinger distance,
which also serves as an ϵ-covering (cf. Theorem 27.2). Thus, ∀i 6= j,
H(Pi , Pj ) ≥ ϵ,
H(P, Pi ) ≤ ϵ,
Next, consider the following pairwise comparison problem, where we test two Hellinger balls
(composite hypothesis) against each other:
Hi : P ∈ B(Pi , ϵ) vs Hj : P ∈ B(Pj , ϵ)
i i
i i
i i
Since both B(Pi , ϵ) and B(Pj , ϵ) are convex, applying Theorem 32.8 yields a test ψij =
ψij (X1 , . . . , Xn ), with ψij = 0 corresponding to declaring P ∈ B(Pi , ϵ), and ψij = 1 corresponding
to declaring P ∈ B(Pj , ϵ), such that ψij = 1 − ψji and the following large deviations bound holds:
for all i, j, s.t. H(Pi , Pj ) ≥ δ ,
where we used the triangle inequality of Hellinger distance: for any P ∈ B(Pi , ϵ) and any Q ∈
B(Pj , ϵ),
Basically, Ti records the maximum distance from Pi to those Pj outside the δ -neighborhood of Pi
that is confusable with Pi given the present sample. Our density estimator is defined as
Now for the proof of correctness, assume that P ∈ B(P1 , ϵ). The intuition is that, we should
expect, typically, that T1 = 0, and furthermore, Tj ≥ δ 2 for all j such that H(P1 , Pj ) ≥ δ . Note
that by the definition of Ti and the symmetry of the Hellinger distance, for any pair i, j such that
H(Pi , Pj ) ≥ δ , we have
max{Ti , Tj } ≥ H(Pi , Pj ).
Consequently,
n o
H(P̂, P1 )1 H(P̂, P1 ) ≥ δ = H(Pi∗ , P1 )1 {H(Pi∗ , P1 ) ≥ δ}
≤ max{Ti∗ , T1 }1 {max{Ti∗ , T1 } ≥ δ} = T1 1 {T1 ≥ δ},
where the last equality follows from the definition of i∗ as a global minimizer in (32.34). Thus, for
any t ≥ 1,
i i
i i
i i
616
Theorem 32.10 (Le Cam-Birgé: local entropy version) Let ϵn be such that
nϵ2n ≥ log Nloc (P, ϵn ) ∨ 1. (32.38)
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 2,
sup P[H(P, P̂) > 4tϵn ] ≤ e−t
2
(32.39)
P∈P
and hence
sup EP [H2 (P, P̂)] ≲ ϵ2n (32.40)
P∈P
Remark 32.4 (Doubling dimension) Suppose that for some d > 0, log Nloc (P, ϵ) ≤ d log 1ϵ
holds for all sufficiently large small ϵ; this is the case for finite-dimensional models where the
Hellinger distance is comparable with the vector norm by the usual volume argument (Theo-
rem 27.3). Then we say the doubling dimension (also known as the Le Cam dimensionLe Cam
dimension|see doubling dimension [430]) of P is at most d; this terminology comes from the
fact that the local entropy concerns covering Hellinger balls using balls of half the radius. Then
Theorem 32.10 shows that it is possible to achieve the “parametric rate” O( dn ). In this sense, the
doubling dimension serves as the effective dimension of the model P .
Proof. We proceed by induction on k. The base case of k = 0 follows from the definition (32.37).
For k ≥ 1, assume that (32.41) holds for k − 1 for all P ∈ P . To prove it for k, we construct a cover
of B(P, 2k η) ∩ P as follows: first cover it with 2k−1 η -balls, then cover each ball with η/2-balls. By
the induction hypothesis, the total number of balls is at most
NH (B(P, 2k η) ∩ P, 2k−1 η) · sup NH (B(P′ , 2k−1 η) ∩ P, η/2) ≤ Nloc (ϵ) · Nloc (ϵ)k−1
P′ ∈P
i i
i i
i i
where (a) follows from from (32.33); (c) follows from the assumption that log Nloc ≤ nϵ2 and
k ≥ ℓ ≥ log2 t ≥ 1; (b) follows from the following reasoning: since {P1 , . . . , PN } is an ϵ-packing,
we have
|Gk | ≤ M(Ak , ϵ) ≤ N(Ak , ϵ/2) ≤ N(B(P1 , 2k+1 δ) ∩ P, ϵ/2) ≤ Nloc (ϵ)k+3
where the first and the last inequalities follow from Theorem 27.2 and Lemma 32.11 respectively.
As an application of Theorem 32.10, we show that parametric rate (namely, dimension divided
by the sample size) is achievable for models with locally quadratic behavior, such as those smooth
parametric models (cf. Section 7.11 and in particular Theorem 7.23).
Proof. It suffices to bound the local entropy Nloc (P, ϵ) in (32.37). Fix θ0 ∈ Θ. Indeed, for any
η > t0 , we have NH (B(Pθ0 , η) ∩ P, η/2) ≤ NH (P, t0 ) ≲ 1. For ϵ ≤ η ≤ t0 ,
( a)
NH (B(Pθ0 , η) ∩ P, η/2) ≤ N∥·∥ (B∥·∥ (θ0 , η/c), η/(2C))
d
vol(B∥·∥ (θ0 , η/c + η/(2C)))
(b) 2C
≤ = 1+
vol(B∥·∥ (θ0 , η/(2C))) c
i i
i i
i i
618
where (a) and (b) follow from (32.42) and Theorem 27.3 respectively. This shows that
log Nloc (P, ϵ) ≲ d, completing the proof by applying Theorem 32.10.
Lemma 32.13
M(ϵ/2)
≤ Mloc (ϵ) ≤ M(ϵ)
M(ϵ)
Proof. The upper bound is obvious. For the lower bound, Let P1 , . . . , PM be a maximal ϵ-packing
for P with M = M(ϵ). Let Q1 , . . . , QM′ be a maximal ϵ/2-packing for P with M′ = M(ϵ/2).
Partition E = {P1 , . . . , PM } into the Voronoi cells centered at each Qi , namely, Ei ≜ {Pj :
H(Pj , Qi ) = mink∈[M] H(Pk , Qi )} (with ties broken arbitrarily), so that E1 , . . . , EM′ are disjoint
and E = ∪i∈[M′ ] Ei . Thus max |Ei | ≥ M/M′ . Finally, note that each Ei ⊂ B(Qi , ϵ) because E is also
an ϵ-covering.
Note that unlike the definition of Nloc in (32.37) we are not taking the supremum over the scale
η ≥ ϵ. For this reason, we cannot generally apply Theorem 27.2 to conclude that Nloc (ϵ) ≥ Mloc (ϵ).
In all instances known to us we have log Nloc log Mloc , in which case the following general result
provides a minimax lower bound that matches the upper bound in Theorem 32.10 up to logarithmic
factors.
Theorem 32.14 Suppose that the Dλ radius Rλ (P) of the family P is finite for some λ > 1;
cf. (32.20). There exists constants c = c(λ) and ϵ < ϵ0 (λ) such that whenever n and ϵ < ϵ0 are
such that
2 1
c(λ)nϵ log 2 + Rλ (P) + 2 log 2 < log Mloc (ϵ), (32.43)
ϵ
i i
i i
i i
Proof. Let M = Mloc (P, ϵ). From the definition there exists an ϵ/2-packing P1 , . . . , PM in some
Hellinger ball B(R, ϵ).
i.i.d.
Let θ ∼ Unif([M]) and Xn ∼ Pθ conditioned on θ. Then from Fano’s inequality in the form
of Theorem 31.3 we get
ϵ 2 I(θ; Xn ) + log 2
sup E[H (P, P̂)] ≥
2
1−
P∈P 4 log M
It remains to show that
I(θ; Xn ) + log 2 1
≤ . (32.44)
log M 2
To that end for an arbitrary distribution U define
Q = ϵ2 U + ( 1 − ϵ2 )R .
We first notice that from Ex. I.59 we have that for all i ∈ [M]
λ 1
D(Pi kQ) ≤ 8(H (Pi , R) + 2ϵ )
2 2
log 2 + Dλ (Pi kU)
λ−1 ϵ
provided that ϵ < 2− 2(λ−1) ≜ ϵ0 . Since H2 (Pi , R) ≤ ϵ2 , by optimizing U (as the Dλ -center of P )
5λ
we obtain
λ 1 c(λ) 2 1
inf max D(Pi kQ) ≤ 24ϵ 2
log 2 + Rλ ≤ ϵ log 2 + Rλ .
U i∈[M] λ−1 ϵ 2 ϵ
By Theorem 4.1 we have
nc(λ) 2 1
I(θ; X ) ≤ n
max D(P⊗ n ⊗n
i kQ ) ≤ ϵ log 2 + Rλ .
i∈[M] 2 ϵ
This final bound and condition (32.43) then imply (32.44) and the statement of the theorem.
Finally, we mention that for sufficiently regular models wherein the KL divergence and the
squared Hellinger distances are comparable, the upper bound in Theorem 32.10 based on local
entropy gives the exact minimax rate. Models of this type include GLM and more generally
Gaussian mixture models with bounded centers in arbitrary dimensions [232].
Then
i i
i i
i i
620
Theorem 32.16 (Yatracos [465]) There exists a universal constant C such that the following
i.i.d.
holds. Let X1 , . . . , Xn ∼ P ∈ P , where P is a collection of distributions on a common measurable
space (X , E). For any ϵ > 0, there exists a proper estimator P̂ = P̂(X1 , . . . , Xn ) ∈ P , such that
1
sup EP [TV(P̂, P)2 ] ≤ C ϵ2 + log N(P, TV, ϵ) (32.45)
P∈P n
For loss function that is a distance, a natural idea for obtaining proper estimator is the minimum
distance estimator. In the current context, we compute the minimum-distance projection of the
empirical distribution on the model class P :6
Pmin-dist = argmin TV(P̂n , P)
P∈P
1
Pn
where P̂n = ni=1 δXi is the empirical distribution. However, since the empirical distribution is
discrete, this strategy does not make sense if elements of P have densities. The reason for this
degeneracy is because the total variation distance is too strong. The key idea is to replace TV,
which compares two distributions over all measurable sets, by a proxy, which only inspects a
“low-complexity” family of sets.
To this end, let A ⊂ E be a finite collection of measurable sets to be specified later. Define a
pseudo-distance
dist(P, Q) ≜ sup |P(A) − Q(A)|. (32.46)
A∈A
(Note that if A = E , then this is just TV.) One can verify that dist satisfies the triangle inequality.
As a result, the estimator
P̃ ≜ argmin dist(P, P̂n ), (32.47)
P∈P
as a minimizer, satisfies
dist(P̃, P) ≤ dist(P̃, P̂n ) + dist(P, P̂n ) ≤ 2dist(P, P̂n ). (32.48)
6
Here and below, if the minimizer does not exist, we can replace it by an infimizing sequence.
i i
i i
i i
In addition, applying the binomial tail bound and the union bound, we have
C0 log |A|
E[dist(P, P̂n )2 ] ≤ . (32.49)
n
for some absolute constant C0 .
The main idea of Yatracos [465] boils down to the following choice of A: Consider an
ϵ-covering {Q1 , . . . , QN } of P in TV. Define the set
dQi dQj
Aij ≜ x : ( x) ≥ ( x)
d( Qi + Qj ) d(Qi + Qj )
and the collection (known as the Yatracos class)
To see this, we only need to justify the upper bound. For any P, Q ∈ P , there exists i, j ∈ [N], such
that TV(P, Pi ) ≤ ϵ and TV(Q, Qj ) ≤ ϵ. By the key observation that dist(Qi , Qj ) = TV(Qi , Qj ), we
have
Finally, we analyze the estimator (32.47) with A given in (32.50). Applying (32.51) and (32.48)
yields
i i
i i
i i
622
Since 3TV(P, Q∗ ) ≤ 3ϵ + 3 minP′ ∈P TV(P, P′ ) we can see that the estimator also works for
“misspecified case”. Surprisingly, the multiplier 3 is not improvable if the estimator is required to
be proper (inside P ), cf. [70].
Capitalizing on the metric entropy of smooth densities studied in Section 27.4, we will prove
this result by applying the entropic upper bound in Theorem 32.1 and the minimax lower bound
based on Fano’s inequality in Theorem 31.3. However, Theorem 32.17 pertains to the L2 rather
than KL risk. This can be fixed by a simple reduction.
Lemma 32.18 Let F ′ denote the collection of f ∈ F which is bounded from below by 1/2.
Then
Proof. The left inequality follows because F ′ ⊂ F . For the right inequality, we apply a sim-
i.i.d.
ulation argument. Fix some f ∈ F and we observe X1 , . . . , Xn ∼ f. Let us sample U1 , . . . , Un
independently and uniformly from [0, 1]d . Define
(
Ui w.p. 12 ,
Zi =
Xi w.p. 12 .
i.i.d.
Then Z1 , . . . , Zn ∼ g = 12 (1 + f) ∈ F ′ . Let ĝ be an estimator that achieves the minimax risk
R∗L2 (n; F ′ ) on F ′ . Consider the estimator f̂ = 2ĝ − 1. Then kf − f̂k22 = 4kg − ĝk22 . Taking the
supremum over f ∈ F proves R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).
i i
i i
i i
Lemma 32.18 allows us to focus on the subcollection F ′ , where each density is lower bounded
by 1/2. In addition, each β -smooth density in F is also upper bounded by an absolute constant.
Therefore, the KL divergence and squared L2 distance are in fact equivalent on F ′ , i.e.,
dQ
Lemma 32.19 Suppose both f = dP
dμ and g = dμ are upper and lower bounded by absolute
constants c and C respectively. Then
Z Z
1 1
dμ(f − g) ≤ H (P, Q) ≤ D(PkQ) ≤ χ (PkQ) ≤
2 2 2
dμ(f − g)2 .
4C c
R R
Proof. For the upper bound, applying (7.34), D(PkQ) ≤ χ2 (PkQ) = dμ (f−gg) ≤ dμ (f−gg) .
2 2
1
c
R R
For the lower bound, applying (7.33), D(PkQ) ≥ H2 (P, Q) = dμ √(f−g√) 2 ≥
2
1
4C dμ(f −
( f+ g)
g) 2 .
Proof. In view of Lemma 32.18, it suffices to consider R∗L2 (n; F ′ ). For the upper bound, we have
( a)
R∗L2 (n; F ′ ) R∗KL (n; F ′ )
(b)
1 ′
≲ inf ϵ + log NKL (F , ϵ)
2
ϵ>0 n
( c) 1 ′
inf ϵ + log N(F , k · k2 , ϵ)
2
ϵ>0 n
(d) 1 2β
inf ϵ2 + d/β n− 2β+d .
ϵ>0 nϵ
where both (a) and (c) apply (32.53), so that both the risk and the metric entropy are equivalent
for KL and L2 distance; (b) follows from Theorem 32.1; (d) applies the metric entropy (under L2 )
of the Lipschitz class from Theorem 27.14 and the fact that the metric entropy of the subclass F ′
is at most that of the full class F .
For the lower bound, we apply Fano’s inequality. Applying Theorem 27.14 and the rela-
tion between covering and packing numbers in Theorem 27.2, we have log N(F, k · k2 , ϵ)
log M(F, k · k2 , ϵ) ϵ−d/β . Fix ϵ to be specified and let f1 , . . . , fM be an ϵ-packing in F , where
M ≥ exp(Cϵ−d/β ). Then g1 , . . . , gM is an 2ϵ -packing in F ′ , with gi = (fi + 1)/2. Applying Fano’s
inequality in Theorem 31.3, we have
∗ ′ C′n
RL2 (n; F ) ≳ ϵ 1 −
2
(32.54)
log M
i i
i i
i i
624
i.i.d.
where C′n is the capacity (or KL radius) from f ∈ F ′ to X1 , . . . , Xn ∼ f. Using (32.17) and
Lemma 32.19, we have
C′n ≤ inf (nϵ2 + log NKL (F ′ , ϵ)) inf (nϵ2 + log N(F ′ , k · k2 , ϵ)) ≲ inf (nϵ2 + ϵ−d/β ) nd/(2β+d) .
ϵ>0 ϵ>0 ϵ>0
β
− 2β+
Thus choosing ϵ = cn d for sufficiently small c ensures C′n ≤ 1
2 log M and hence R∗L2 (n; F ′ ) ≳
2β
− 2β+
ϵ2 n d .
Remark 32.6 The above lower bound proof, based on Fano’s inequality and the intuition that
small mutual information implies large estimation error, requires us to upper bound the capacity
C′n of the subcollection F ′ . On the other hand, as hinted earlier in (32.11) (and shown next), the
C′
risk is expected to be proportional to nn , which suggests one should lower bound the capacity
using metric entropy. Indeed, this is possible: Applying Theorem 32.5,
C′n ≳ min{nϵ2 , log M(F ′ , H, ϵ)} − 2
min{nϵ2 , log M(F ′ , k · k2 , ϵ)} − 2
min{nϵ2 , ϵ−d/β } − 2 nd/(2β+d) ,
where we picked the same ϵ as in the previous proof. So C′n nd/(2β+d) . Finally, applying the
online-to-batch conversion (32.26) in Proposition 32.7 (or equivalently, combining (32.7) and
C′ 2β
(32.9)) yields R∗KL (n; F ′ ) nn n− 2β+d .
Remark 32.7 Note that the above proof of Theorem 32.17 relies on the entropic risk bound
(32.1), which, though rate-optimal, is not attained by a computationally efficient estimator. (The
same criticism also applies to (32.2) and (32.3) for Hellinger and total variation.) To remedy this,
for the squared loss, a classical idea is to apply the kernel density estimator (KDE) – cf. Section 7.9.
Pn
Specifically, one compute the convolution of the empirical distribution P̂n = 1n i=1 δXi with a
kernel function K(·) whose shape and bandwidth are chosen according to the smooth constraint.
For example, for Lipschitz densities, the optimal rate in Theorem 32.17 can be attained by a box
kernel K(·) = 2h1
1 {| · | ≤ h} with bandwidth h = n−1/3 (cf. e.g. [424, Sec. 1.2]).
The classical literature of density estimation is predominantly concerned with the L2 loss,
mainly due to the convenient quadratic nature of the loss function that allows bias-variance decom-
position and facilitates the analysis of KDE. However, L2 -distance between densities has no clear
operational meaning. Next we consider the three f-divergence losses introduced at the beginning
of this chapter. Paralleling Theorem 32.17, we have
β
R∗TV (n; F) ≜ inf sup E TV(f, f̂) n− 2β+d (32.55)
f̂ f∈F
β
R∗H2 (n; F) ≜ inf sup E H2 (f, f̂) n− β+d (32.56)
f̂ f∈F
β β
n− β+d ≲ R∗KL (n; F) ≜ inf sup E D(fkf̂) n− β+d (log n) β+d
d
(32.57)
f̂ f∈F
For TV loss, the upper bound follows from the L2 -rates in Theorem 32.17 and kf − f̂k1 ≤ kf − f̂k2
by Cauchy-Schwarz; alternatively, we can also apply Yatracos’ estimator from Theorem 32.16.
i i
i i
i i
The matching lower bound can be shown using the same argument based on Fano’s method as the
metric entropy under L1 -distance behaves the same (Theorem 27.14).
Recall that for L2 /L1 the rate is derived by considering a subclass F ′ , which has the same
estimation rate, but on which Lp H KL, cf. Lemma 32.18. It was thus, surprising, when
Birgé [54] found the Hellinger rate on the full family F to be different.
To derive the H2 result (32.56), first note that neither upper or a lower bound follow from the
2
generic comparison inequality H2 ≤ TV ≤ H in (7.22). Instead, what works is comparing entropy
numbers via the first of these inequalities. Specifically, we have
log N(F, H, ϵ) ≤ log N(F, TV, ϵ2 /2) ≲ ϵ− β ,
2d
(32.58)
where in the last step we invoked Theorem 27.14. Combining this with Le Cam-Birgé method
(Theorem 32.9) proves the upper bound part of (32.56).7
The lower bound follows from a similar argument as in the proof of Theorem 32.17, although
the construction is more involved. Below c0 , c1 , . . . are absolute constants. Fix a small ϵ and con-
sider the subcollection F ′ = {f ∈ F : f ≥ ϵ} of densities lower bounded by ϵ. We first construct a
Hellinger packing of F ′ . Applying the same argument in Lemma 32.13 yields an L2 -packing in an
L∞ -local ball: there exist f0 , f1 , . . . , fM ∈ F , such that kfi − fj k2 ≥ c0 ϵ for all i 6= j, kfi − f0 k∞ ≤ ϵ
for all i, and M ≥ M(F, k · k2 , c0 ϵ)/M(F, k · k∞ , ϵ) ≥ exp(c1 ϵ−d/β ), the last step applying The-
orem 32.17 for sufficiently small c0 . Let hi = fi − f0 and define fi by fi (x) = ϵ + hi (2x) for
x ∈ [0, 12 ]d and extend fi smoothly elsewhere so that it is a valid density in F ′ . Then f1 , . . . , fM
√ R (f −fj )2
form a Hellinger Ω( ϵ)-packing of F ′ , since H2 (fi , fj ) ≥ [0, 1 ]d √ i √ 2
≥ c2 ϵ. (This construc-
2 ( fi + fj )
tion also shows that the metric entropy bound (32.58) is tight.) It remains to bound the capacity
C′n of F ′ as a function of n and ϵ. Note that for any f, g ∈ F ′ , we have as in Lemma 32.19
D(fkg) ≤ χ2 (fkg) ≤ kf − gk22 /ϵ. Thus NKL (F ′ , δ 2 /ϵ) ≤ N(F ′ , k · k2 , δ). Applying (32.17) and
Lemma 32.19, C′n ≲ infδ>0 (nδ 2 /ϵ + δ −d/β ) (n/ϵ)d/(2β+d) . Applying Fano’s inequality as in
(32.54) yields an Ω(ϵ) lower bound in squared Hellinger, provided log M ≥ 2C′n . This is achieved
β
by choosing ϵ = c3 n− β+d , completing the proof of (32.56).
For KL loss, the lower bound of (32.57) follows from (32.56) because D ≥ H2 . For the upper
bound, applying (32.7) in Theorem 32.1, we have R∗KL (n; F) ≤ Cn+ n+1
1 , where Cn is the capacity
i.i.d.
(32.5) of the channel between f and Xn ∼ f ∈ F . This capacity can be bounded, in turn, using
Theorem 32.6 via the Hellinger entropy. Applying (32.58) in conjunction with (32.22), we obtain
Cn ≤ infϵ (nϵ2 log 1ϵ + ϵ−2d/β ) (n log n)d/(d+β) , proving the upper bound (32.57).8 To the best
of our knowledge, resolving the logarithmic gap in (32.57) remains open.
7
Comparing (32.56) with (32.52), we see that the Hellinger rate coincides with the L2 rate upon replacing the smoothness
parameter β by β/2. Note that Hellinger distance is the L2 between root-densities. For β ≤ 1, one can indeed show that
√
f is β/2-Hölder continuous, which explains the result in (32.56). However, this interpretation fails for general β. For
√
example, Glaeser [191] constructs an infinitely differentiable f such that f has points with arbitrarily large second
derivative.
8
This capacity bound is tight up to logarithmic factors. Note that the construction in the proof of the lower bound in (32.56)
shows that log M(F , H, ϵ) ≳ ϵ−2d/β , which, via Theorem 32.5, implies that Cn ≥ nd/(d+β) .
i i
i i
i i
In this chapter we explore statistical implications of the following effect. For any Markov chain
U→X→Y→V (33.1)
we know from the data-processing inequality (DPI, Theorem 3.7) that
I(U; Y) ≤ I(U; X), I(X; V) ≤ I(Y; V) .
However, something stronger can often be said. Namely, if the Markov chain (33.1) factor through
a known noisy channel PY|X : X → Y , then oftentimes we can prove strong data processing
inequalities (SDPI):
I(U; Y) ≤ η I(U; X), I(X; V) ≤ η (p) I(Y; V) ,
where coefficients η = η(PY|X ), η (p) = η (p) (PY|X ) < 1 only depend on the channel and not the
(generally unknown or very complex) PU,X or PY,V . The coefficients η and η (p) approach 0 for
channels that are very noisy (for example, η is always up to a constant factor equal to the Hellinger-
squared diameter of the channel).
The purpose of this chapter is twofold. First, we want to introduce general properties of the
SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility)
bounds on statistical estimation questions. The flavor of the statistical problems in this chapter
is different from the rest of the book in that here the information about unknown parameter θ
is either more “thinly spread” across a high dimensional vector of observations than in classical
X = θ + Z type of tasks (cf., spiked Wigner and tree-coloring examples), or distributed across
different terminals (as in correlation and mean estimation examples).
We point out that SDPIs are an area of current research and multiple topics are not covered by
our brief exposition here. For more, we recommend surveys [345] and [352], of which the latter
explores the functional-theoretic side of SDPIs and their close relation to logarithmic Sobolev
inequalities – a topic we entirely omit in our book.
626
i i
i i
i i
a a
OR a∨b AND a∧b a NOT a′
b b
Z Z Z
a a
OR ⊕ Y AND ⊕ Y a NOT ⊕ Y
b b
Figure 33.1 Basic building blocks of any boolean circuit. Top row shows the classical (Shannon) noiseless
gates. Bottom row shows noisy (von Neumann) gates. Here Z ∼ Ber(δ) is assumed to be independent of the
inputs.
the groundwork for the digital computers, and he was bothered by the following question. Since
real physical (and biological) networks are composed of imperfect elements, can we compute any
boolean function f if the constituent basis gates are in fact noisy? His model of the δ -noisy gate
(bottom row of Figure 33.1) is to take a primitive noiseless gate and apply a (mod 2) additive noise
to the output.
In this case, we have a network of the noisy gates, and such network necessarily has noisy (non-
deterministic) output. Therefore, when we say that a noisy gate circuit C computes f we require
the existence of some ϵ0 = ϵ0 (δ) (that cannot depend on f) such that
1
P[C(x1 , . . . , xn ) 6= f(x1 , . . . , xn ) ≤ − ϵ0 (33.2)
2
where C(x1 , . . . , xn ) is the output of the noisy circuit with inputs x1 , . . . , xn . If we build the circuit
according to the classical (Shannon) methods, we would obviously have catastrophic error accumu-
lation so that deep circuits necessarily have ϵ0 → 0. At the same time, von Neumann was bothered
by the fact that evidently our brains operate with very noisy gates and yet are able to carry very
long computations without mistakes. His thoughts culminated in the following ground-breaking
result.
Theorem 33.1 (von Neumann [443]) There exists δ ∗ > 0 such that for all δ < δ ∗ it is
possible to compute every boolean function f via δ -noisy 3-majority gates.
von Neumann’s original estimate δ ∗ ≈ 0.087 was subsequently improved by Pippenger. The
main (still open) question of this area is to find the largest δ ∗ for which the above theorem holds.
Condition in (33.2) implies the output should be correlated with the inputs. This requires the
mutual information between the inputs (if they are random) and the output to be greater than
zero. We now give a theorem of Evans and Schulman that gives an upper bound to the mutual
information between any of the inputs and the output. We will prove the theorem in Section 33.3
as a consequence of the more general directed information percolation theory.
Theorem 33.2 ([158]) Suppose an n-input noisy boolean circuit composed of gates with at
most K inputs and with noise components having at most δ probability of error. Then, the mutual
i i
i i
i i
628
X1 X2 X3 X4 X5 X6 X7 X8 X9
G1 G2 G3
G4 G5
G6
where di is the minimum length between Xi and Y (i.e, the minimum number of gates required to
be passed through until reaching Y).
Theorem 33.2 implies that noisy computation is only possible for δ < 12 − 2√1 K . This is the best
known threshold. To illustrate this result consider the circuit given on Figure 33.2. That circuit has
9 inputs and composed of gates with at most 3 inputs. The 3-input gates are G4 , G5 and G6 . The
minimum distance between X3 and Y is d3 = 2, and the minimum distance between X5 and Y is
d5 = 3. If Gi ’s are δ -noisy gates, we can invoke Theorem 33.2 between any input and the output.
Now, the main conceptual implication of Theorem 33.2 is in demonstrating that some cir-
cuits are not computable with δ -noisy gates unless δ is sufficiently small. For instance, take
f(X1 , . . . , Xn ) = XOR(X1 , . . . , Xn ). Note that function f depends essentially on every input Xi , since
XOR(X1 , . . . , Xn ) = XOR XOR(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ), Xi . Thus, any circuit that ignores
any one of the inputs Xi will not be able to satisfy requirement (33.2). Since we are composing
log n
our circuit via K-input gates, this implies that there must exist at least one input Xi with di ≥ log K
(indeed, going from Y up we are to make K-ary choice at each gate and thus at height d we can at
most reach dK inputs). Now as n → ∞ we see from Theorem 33.2 that I(Xi ; Y) → 0 unless
∗ 1 1
δ ≤ δES = − √ .
2 2 K
∗
As we argued I(Xi ; Y) → 0 is incompatible with satisfying (33.2). Hence the value of δES gives
a (at present, the tightest) upper bound for the noise limit under which reliable computation with
K-input gates is possible.
Computation with formulas Note that the graph structure given in Figure 33.2 contains some
undirected loops. A formula is a type of boolean circuits that does not contain any undirected
loops unlike the case in Figure 33.2. In other words, for a formula the underlying graph structure
forms a tree. For example, removing one of the outputs of G2 of Figure 33.2 we obtain a formula
as given on Figure 33.3.
i i
i i
i i
X1 X2 X3 X4 X5 X6 X7 X8 X9
G1 G2 G3
G4 G5
G6
For computation with formulas much stronger results are available. For example, for any odd K,
the threshold is exactly known from [157, Theorem 1]. Specifically, it is shown there that we can
compute reliably any boolean function f that is represented with a formula compose of K-input
δ -noisy gates (with K odd) if δ < δf∗ , and no such computation is possible for δ > δf∗ , where
1 2K−1
δf∗ = − K−1
2 K K− 1
2
Since every formula is also a circuit, we of course have δf∗ < δES
∗
, so that there is no contradiction
with Theorem 33.2. However, comparing the thresholds gives us ability to appreciate tightness of
Theorem 33.2 for general boolean circuits. Indeed, for large K we have an approximation
p
∗ 1 π /2
δf ≈ − √ , K 1 ,
2 2 K
∗
whereas the estimate of Evans-Schulman δES ≈ 1
2 − 1
√
2 K
. We can thus see that it has at least the
right rate of convergence to 1/2 for large K.
Recall that the data-processing inequality (DPI) in Theorem 7.4 states that Df (PX kQX ) ≥
Df (PY kQY ). The concept of the Strong DPI introduced above quantifies the multiplicative decrease
between the two f-divergences.
i i
i i
i i
630
We note that in general ηf (PY|X ) is hard to compute. However, total variation is an exception.
This case is obvious. Take PX = δx0 and QX = δx′0 .1 Then from the definition of ηTV , we
have ηTV ≥ TV(PY|X=x0 , PY|X=x′0 ) for any x0 and x′0 , x0 6= x′0 .
• ηTV ≤ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):
0
Define η̃ ≜ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ). We consider the discrete alphabet case for simplicity.
0
Fix any PX , QX and PY = PX ◦ PY|X , QY = QX ◦ PY|X . Observe that for any E ⊆ Y
Now suppose there are random variables X0 and X′0 having some marginals PX and QX respec-
tively. Consider any coupling π X0 ,X′0 with marginals PX and QX respectively. Then averaging
(33.4) and taking the supremum over E, we obtain
Now the left-hand side equals TV(PY , QY ) by Theorem 7.7(a). Taking the infimum over
couplings π the right-hand side evaluates to TV(PX , QX ) by Theorem 7.7(b).
Example 33.1 (ηTV of a Binary Symmetric Channel) The ηTV of the BSCδ is given by
Theorem 33.5
If (U; Y)
ηf (PY|X ) = sup .
PUX : U→X→Y f (U; X)
I
1
δx0 is the probability distribution with P(X = x0 ) = 1
i i
i i
i i
Recall that for any Markov chain U → X → Y, DPI states that If (U; Y) ≤ If (U; X) and Theorem
33.5 gives the stronger bound
Proof. First, notice that for any u0 , we have Df (PY|U=u0 kPY ) ≤ ηf Df (PX|U=u0 kPX ). Averaging the
above expression over any PU , we obtain
If (U; Y) ≤ ηf If (U; X)
Second, fix P̃X , Q̃X and let U ∼ Bern(λ) for some λ ∈ [0, 1]. Define the conditional distribution
PX|U as PX|U=1 = P̃X , PX|U=0 = Q̃X . Take λ → 0, then (see [345] for technical subtleties)
Theorem 33.6 In the statements below ηf (and others) corresponds to ηf (PY|X ) for some fixed
PY|X from X to Y .
where (recall β̄ ≜ 1 − β )
( 1 − x) 2
LCβ (PkQ) = Df (PkQ), f(x) = β̄β
β̄ x + β
is the Le Cam divergence of order β (recall (7.7) for β = 1/2).
(e) Consequently,
1 2 H4 (P0 , P1 )
H (P0 , P1 ) ≤ ηKL ≤ H2 (P0 , P1 ) − . (33.6)
2 4
(f) If a binary-input channel PY|X is also input-symmetric (or BMS, see Section 19.4*), then
ηKL (PY|X ) = Iχ2 (X; Y) for X ∼ Bern(1/2).
i i
i i
i i
632
(g) For any channel PY|X , the supremum in (33.3) can be restricted to PX , QX with a common
binary support. In other words, ηf (PY|X ) coincides with that of the least contractive binary
subchannel. Consequently, from (e) we conclude
1 diamH2
diamH2 ≤ ηKL (PY|X ) = diamLCmax ≤ diamH2 − ,
2 4
(in particular ηKL diamH2 ), where diamH2 (PY|X ) = supx,x′ ∈X H2 (PY|X=x , PY|X=x′ ),
diamLCmax = supx,x′ LCmax (PY|X=x , PY|x=x′ ) are the squared Hellinger and Le Cam diameters
of the channel.
Proof. Most proofs in full generality can be found in [345]. For (a) one first shows that ηf ≤ ηTV
for the so-called Eγ divergences corresponding to f(x) = |x − γ|+ − |1 − γ|+ , which is not hard to
believe since Eγ is piecewise linear. Then the general result follows from the fact that any convex
function f can be approximated (as N → ∞) in the form
X
N
aj |x − cj |+ + a0 x + c0 .
j=1
For (b) see [93, Theorem 1] and [97, Proposition II.6.13 and Corollary II.6.16]. The idea of this
proof is as follows:
• ηKL ≥ ηχ2 by restricting to local perturbations. Recall that KL divergence behaves locally as
χ2 – Proposition 2.21.
R∞
• Using the identity D(PkQ) = 0 χ2 (PkQt )dt where Qt = tP1+ Q
+t , we have
Z ∞ Z ∞
D(PY kQY ) = χ (PY kQY t )dt ≤ ηχ2
2
χ2 (PX kQX t )dt = ηχ2 D(PX kQX ).
0 0
I(U; Y) = I(U; Y|∆) ≤ Eδ∼P∆ [(1 − 2δ)2 I(U; X|∆ = δ) = E[(1 − 2∆)2 ]I(U; X),
i i
i i
i i
where we used the fact that I(U; X|∆ = δ) = I(U; X) and Example 33.2 below.
For (g) see Ex. VI.20.
Example 33.2 (Computing ηKL (BSCδ )) Consider p the BSCδ channel. In Example 33.1
we computed ηTV . Here we have diamH2 = 2 − 4 δ(1 − δ) and thus the bound (33.6) we get
ηKL ≤ (1 − 2δ)2 . On the other hand taking U = Ber(1/2) and PX|U = Ber(α) we get
I(U; Y) log 2 − h(α + (1 − 2α)δ) 1
ηKL ≥ = → (1 − 2δ)2 α→ .
I(U; X) log 2 − h(α) 2
Thus we have shown:
This example has the following consequence for the KL-divergence geometry.
Proposition 33.7 Consider any distributions P0 and P1 on X and let us consider the interval
in P(X ): Pλ = λP1 + (1 − λ)P0 for λ ∈ [0, 1]. Then divergence (with respect to the midpoint)
behaves subquadratically:
The same statement holds with D replaced by χ2 (and any other Df satisfying Theorem 33.6(b)).
Notice that for any metric d(P, Q) on P(X ) that is induced from the norm on the vector space
M(X ) of all signed measures (such as TV), we must necessarily have d(Pλ , P1−λ ) = |1 −
2λ|d(P0 , P1 ). Thus, the ηKL (BSCλ ) = (1 − 2λ)2 which yields the inequality is rather natural.
i i
i i
i i
634
whose configuration allows us to factorize the joint distribution over XV by Throughout the section,
we consider Shannon mutual information, i.e., f = x log x. Let us give a detailed example below.
Example 33.3 Suppose we have a graph G = (V, E) as follows.
B
X0 W
A
Then every node has a channel from its parents to itself, for example W corresponds to a noisy
channel PW|A,B , and we can define η ≜ ηKL (PW|A,B ). Now, prepend another random variable U ∼
Bern(λ) at the beginning, the new graph G′ = (V′ , E′ ) is shown below:
B
U X0 W
A
Recall that from chain rule we have I(U; B, W) = I(U; B) + I(U; W|B) ≥ I(U; B). Hence, if (33.7)
is correct, then η → 0 implies I(U; B, W) ≈ I(U; B) and symmetrically I(U; A, W) ≈ I(U; A).
Therefore for small δ , observing W, A or W, B does not give advantage over observing solely A or
B, respectively.
Observe that G′ forms a Markov chain U → X0 → (A, B) → W, which allows us to factorize
the joint distribution over E′ as
Now consider the joint distribution conditioned on B = b, i.e., PU,X0 ,A,W|B . We claim that the
conditional Markov chain U → X0 → A → W|B = b holds. Indeed, given B and A, X0 is
independent of W, that is PX0 |A,B PW|A,B = PX0 ,W|AB , from which follows the mentioned conditional
Markov chain. Using the conditional Markov chain, SDPI gives us for any b,
i i
i i
i i
B
R
X W
A
Under this model, for two subsets T, S ⊂ V we define perc[T → S] = P[∃ open path T → S].
Note that PXv |XPa(v) describe the stochastic recipe for producing Xv based on its parent variables.
We assume that in addition to a DAG we also have been given all these constituent channels (or
at least bounds on their ηKL coefficients).
Theorem 33.8 ([345]) Let G = (V, E) be a DAG and let 0 be a node with in-degree equal to
zero (i.e. a source node). Note that for any 0 63 S ⊂ V we can inductively stitch together constituent
channels PXv |XPa(v) and obtain PXS |X0 . Then we have
ηKL (PXS |X0 ) ≤ perc(0 → S). (33.10)
Proof. For convenience let us denote η(T) = ηKL (PXT |X0 ) and ηv = ηKL (PXv |XPa(v) ). The proof
follows from an induction on the size of G. The statement is clear for the |V(G)| = 1 since
S = ∅ or S = {X0 }. Now suppose the statement is already shown for all graphs smaller than
G. Let v be the node with out-degree 0 in G. If v 6∈ S then we can exclude it from G and the
statement follows from induction hypothesis. Otherwise, define SA = Pa(v) \ S and SB = S \ {v},
i i
i i
i i
636
A = XSA , B = XSB , W = Xv . (If 0 ∈ A then we can create a fake 0′ with X0′ = X0 and retain
0′ ∈ A while moving 0 out of A. So without loss of generality, 0 6∈ A.) Prepending arbitrary U to
the graph as U → X0 , the joint DAG of random variables (X0 , A, B, W) is then given by precisely
the graph in (33.7). Thus, we obtain from (33.8) the estimate
From induction hypothesis η(SA ∪ SB ) ≤ perc(0 → SA ) and η(SB ) ≤ perc(0 → SB ) (they live
on a graph G \ {v}). Thus, from computation (33.9) we see that the right-hand side of (33.11) is
precisely perc(0 → S) and thus η(S) ≤ perc(S) as claimed.
Proof of Theorem 33.2. First observe the noisy boolean circuit is a form of DAG. Since the gates
are δ -noisy contraction coefficients of constituent channels ηv in the DAG can be bounded by
(1 − 2δ)2 . Thus, in the percolation question all vertices are open with probability (1 − 2δ)2
From SDPI, for each i, we have I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ). From Theorem 33.8, we know
ηKL (PY|Xi ) ≤ perc(Xi → Y). We now want to upper bound perc(Xi → Y). Recall that the minimum
distance between Xi and Y is di . For any path π of length ℓ(π ) from Xi to Y, therefore, the probability
that it will be open is ≤ (1 − 2δ)2ℓ(π ) . We can thus bound
X
perc(Xi → Y) ≤ (1 − 2δ)2ℓ(π ) . (33.12)
π :Xi →Y
Let us now build paths backward starting from Y, which allows us to represent paths X → Yi
as vertices of a K-ary tree with root Yi . By labeling all vertices on a K-ary tree corresponding
to paths X → Yi we observe two facts: the labeled set V is prefix-free (two labeled vertices are
never in ancestral relation) and the depth of each labeled set is at least di . It is easy to see that
P
u∈V c
depth(u)
≤ (Kc)di provided Kc ≤ 1 and attained by taking V to be set of all vertices in the
tree at depth di . We conclude that whenever K(1 − 2δ)2 ≤ 1 the right-hand side of (33.12) is
bounded by (K(1 − 2δ)2 )di , which concludes the proof by upper bounding H(Xi ) ≤ log 2 as
Corollary 33.9 Consider a channel PY|X and its n-letter memoryless extension P⊗ n
Y|X . Then we
have
ηKL (P⊗
Y|X ) ≤ 1 − (1 − ηKL (PY|X )) ≤ nηKL (PY|X ) .
n n
The first inequality can be sharp for some channels. For example, it is sharp when PY|X is a
binary or q-ary erasure channel (defined below in Example 33.6). This fact is proven in [345,
Theorem 17].
i i
i i
i i
Proof. The graph here consists of n parallel lines Xi → Yi . Theorem 33.8 shows that ηKL (P⊗ Y|X ) ≤
n
We conclude the section with a more sophisticated application of Theorem 33.8, emphasizing
how it can yield stronger bounds when compared to Theorem 33.2.
Example 33.5 Suppose we have the topological restriction on the placement of gates (namely
that the inputs to each gets should be from nearest neighbors to the left), resulting in the following
circuit of 2-input δ -noisy gates.
Note that each gate may be a simple passthrough (i.e. serve as router) or a constant output. Theorem
33.2 states that if (1 − 2δ)2 < 12 , then noisy computation within arbitrary topology is not possible.
Theorem 33.8 improves this to (1 − 2δ)2 < pc , where pc is the oriented site percolation threshold
for the particular graph we have. Namely, if each vertex is open with probability p < pc then
with probability 1 the connected component emanating from any given node (and extending to
the right) is finite. For the example above the site percolation threshold is estimated as pc ≈ 0.705
(so-called Stavskaya automata).
i i
i i
i i
638
We refer to ηf (PX , PY|X ) as the input-dependent contraction coefficient, in contrast with the
input-independent contraction coefficient ηf (PY|X ). It is obvious that
but as we will see below the inequality is often strict and the difference can lead to significant
improvements in the applications (Example 33.10). In Theorem 33.6 we have seen that for most
interesting f’s we have ηf (PY|X ) = ηχ2 (PY|X ). Unfortunately, for the input-dependent version this is
not true: we only have a one-sided comparison, namely for any twice continuously differentiable
f with f′′ (1) > 0 (in particular for KL-divergence) it holds that [345, Theorem 2]
For example, for jointly Gaussian X, Y, we in fact have ηχ2 = ηKL (see Example 33.7 next);
however, in general we only have ηχ2 < ηKL (see [19] for an example). Thus, unlike the input-
independent case, here the choice of f is very important. A general rule is that ηχ2 (PX , PY|X ) is the
easiest to bound and by (33.13) it contracts the fastest. However, for various reasons other f are
more useful in applications. Consequently, theory of input-dependent contraction coefficients is
much more intricate (see [201] for many recent results and references). In this section we try to
summarize some important similarities and distinctions between the ηf (PX , PY|X ) and ηf (PY|X ).
First, just as in Theorem 33.5 we can similarly prove a mutual information characterization of
ηKL (PX , PY|X ) as follows [352, Theorem V.2]:
I(U; Y)
ηKL (PX , PY|X ) = sup .
PU|X :U→X→Y I(U; X)
In particular, we see that ηKL (PX , PY|X ) is also a slope of the FI -curve (cf. Definition 16.5):
d
ηKL (PX , PY|X ) = FI (t; PX,Y ) . (33.14)
dt t=0+
(Indeed, from Exercise III.32 we know FI (t) is concave and thus supt≥0 FI t(t) = F′I (0).)
The next property of input-dependent SDPIs emphasizes the key difference compared to its
input-independent counterpart. Recall that Corollary 33.9 (and the discussion thereafter) show
that generally ηKL (P⊗ ⊗n ⊗n
Y|X ) → 1 exponentially fast. At the same time, ηKL (PX , PY|X ) stays constant.
n
i.i.d.
In particular, if (Xi , Yi ) ∼ PX,Y , then ∀PU|Xn
Note that not all ηf satisfy tensorization. We will show below (Theorem 33.12) that ηχ2 does
satisfy it. On the other hand, ηTV (P⊗ ⊗n
X , PY|X ) → 1 exponentially fast (which follows from (7.21)).
n
i i
i i
i i
X1 Y1
X2 Y2
Figure 33.5 Illustration for the probability space in the proof of Proposition 33.11.
Proof. This result is implied by (33.14) and Exercise III.32, but a simple direct proof is useful.
Without loss of generality (by induction) it is sufficient to prove the proposition for n = 2. It is
always useful to keep in mind the diagram in Figure 33.5 Let η = ηKL (PX , PY|X )
I(U; Y1 , Y2 ) = I(U; Y1 ) + I(U; Y2 |Y1 )
≤ η [I(U; X1 ) + I(U; X2 |Y1 )] (33.15)
= η [I(U; X1 ) + I(U; X2 |X1 ) + I(U; X1 |Y1 ) − I(U; X1 |Y1 , X2 )] (33.16)
≤ η [I(U; X1 ) + I(U; X2 |Y1 )] (33.17)
= η I(U; X1 , X2 )
where (33.15) is due to the fact that conditioned on Y1 , U → X2 → Y2 is still a Markov chain,
(33.16) is because U → X1 → Y1 is a Markov chain and (33.17) follows from the fact that
X2 → U → X1 is a Markov chain even when conditioned on Y1 .
As an example, let us analyze the erasure channel.
Example 33.6 (ηKL (PX , PY|X ) for erasure channel) We define ECτ as the following channel
(
X w.p. 1 − τ
Y=
? w.p. τ.
Consider an arbitrary U → X → Y and define an auxiliary random variable B = 1{Y =?}. We
have
I(U; Y) = I(U; Y, B) = I(U; B) +I(U; Y|B) = (1 − τ )I(U; X),
| {z }
=0, since B⊥
⊥U
where the last equality is due to the fact that I(U; Y|B = 1) = 0 and I(U; Y|B = 0) = I(U; X).
By the mutual information characterization of ηKL (PX , PY|X ), we have ηKL (PX , ECτ ) = 1 − τ .
⊗n
Note that by tensorization we also have ηKL (P⊗X , ECτ ) = 1 − τ . However, for non-product input
n
i i
i i
i i
640
Among the input-dependent ηf the most elegant is the theory of ηχ2 . The properties hold for
general PX,Y but we only state it for the finite case for simplicity.
Theorem 33.12 (Properties of ηχ2 (PX , PY|X )) Consider finite X and Y . Then, we have
(a) (Spectral characterization) Let Mx,y = √PX,Y (x,y) be an |X | × |Y| matrix. Let 1 = σ1 (M) ≥
PX (x)PY (y)
p
σ2 (M) ≥ · · · ≥ 0 be the singular values of M, i.e. σj (M) = λj (MT M). Then ηχ2 (PX , PY|X ) =
σ22 (M).
(b) (Symmetry) ηχ2 (PX , PY|X ) = ηχ2 (PY , PX|Y ).
(c) (Maximal correlation) ηχ2 (PX , PY|X ) = supg1 ,g2 ρ2 (g1 (X), g2 (Y)), where the supremum is over
all functions g1 : X → R and g2 : Y → R.
(d) (Tensorization) ηχ2 (P⊗ n ⊗n
X , PY|X ) = ηχ2 (PX , PY|X )
Proof. We focus on the spectral characterization which implies the rest. Denote by EX|Y a linear
P
operator that acts on function g as EX|Y g(y) = x PX|Y (x|y)g(x). For any QX let g(x) = QPXX((xx)) then
QY (y)
we have PY (y) = EX|Y g. Therefore, we have
VarPY [EX|Y g]
ηχ2 (PX , PY|X ) = sup
g VarPX [g]
with supremum over all g ≥ 0 and EPX [g] = 1. We claim that this supremum is also equal to
EPY [E2X|Y h]
ηχ2 (PX , PY|X ) = sup ,
h EPX [h2 ]
taken over all h with EPX h = 0. Indeed, for any such h we can take g = 1 + ϵh for some suffi-
p g ≥ 0) and conversely, for any g we can set h = g − 1. Finally, let us
ciently small ϵ (to satisfy
reparameterize ϕx ≜ PX (x)h(x) in which case we get
ϕ T MT Mϕ
ηχ2 (PX , PY|X ) = sup ,
ϕ ϕT ϕ
p
where ϕ ranges over all vectors in RX that are orthogonal to the vector ψ with ψx = PX (x).
Finally, we notice that top singular value of M corresponds to singular vector ψ and thus restricting
ϕ ⊥ ψ results in recovering the second-largest singular vector.
Symmetry follows from noticing that matrix M is replaced by MT when we interchange X and
Y. The maximal correlation characterization follows from the fact that supg2 E√[g1 (X)g2 (Y)] is attained
Var[g2 (Y)]
at g2 = EX|Y g1 . Tensorization follows from the fact that singular values of the Kronecker product
M⊗n are just products of (all possible) n-tuples of singular values of M.
Example 33.7 (SDPI constants of joint Gaussian) Let X, Y be jointly Gaussian with
correlation coefficient ρ. Then
i i
i i
i i
Indeed, it is well-known that the maximal correlation of X and Y is simply |ρ|. (This can be shown
by finding the eigenvalues of the (Mehler) kernel defined in Theorem 33.12(a); see e.g. [266].)
Applying Theorem 33.12(c) yields ηχ2 (PX , PY|X ) = ρ2 .
Next, in view of (33.13), it suffices to show ηKL ≤ ρ2 , which is a simple consequence of EPI.
Without loss of generality, let us consider Y = X + Z, where X ∼ PX = N (0, 1) and Z ∼ N (0, σ 2 ).
Then PY = N (0, 1 + σ 2 ) and ρ2 = 1+σ 1
2 . Let X̃ have finite second moment and finite differential
1
entropy and set Ỹ = X̃ + Z. Applying Lieb’s EPI (3.36) with U1 = X̃, U2 = Z/σ and cos2 α = 1+σ 2,
we obtain
1 σ2 1
h(Ỹ) ≥ 2
h(X̃) + 2
log(2πe) + log(1 + σ 2 )
1+σ 2( 1 + σ ) 2
Example 33.8 (Mixing of Markov chains) One area in which input-dependent contrac-
tion coefficients have found a lot of use is in estimating mixing time (time to convergence to
equilibrium) of Markov chains. Indeed, suppose K = PY|X is a kernel for a time-homogeneous
Markov chain X0 → X1 → · · · with stationary distribution π (i.e., K = PXt+1 |Xt ). Then for any
initial distribution q, SDPI gives the following bound:
showing exponential decrease of Df provided that ηf (π , K) < 1. For most interesting chains the
TV version is useless, but χ2 and KL is rather effective (the two known as the spectral gap and
modified log-Sobolev inequality methods). For example, for reversible Markov chains, we have
[124, Prop. 3]
where γ∗ is the absolute spectral gap of P. See Exercise VI.19. The most efficient modern method
for bounding ηKL is known as spectral independence, see Exercise VI.26.
i i
i i
i i
642
′ |X X3 ……
PX
P X′ |X
X1
PX′ |X X5 ……
Xρ
PX′ |X X6 ……
PX ′
|X X2
PX ′
|X X4 ……
To simplify our discussion, we will assume that π is a reversible measure on kernel PX′ |X , i.e.,
By standard result on Markov chains, this also implies that π is a stationary distribution of the
reversed Markov kernel PX|X′ .
This model, known as broadcasting on trees, turns out to be rather deep. It first arose in sta-
tistical physics as a simplification of Ising model on lattices (trees are called Bethe lattices in
physics) [63]. Then, it was found to be closely related to a problem of phylogenetic reconstruc-
tion in computational biology [306] and almost simultaneously appeared in random constraint
satisfaction problems [261] and sparse-graph coding theory. Our own interest was triggered by
a discovery of a certain equivalence between reconstruction on trees and community detection in
stochastic block model [307, 119].
We make the following observations:
• We can think of this model as a broadcasting scenario, where the root broadcasts its message
Xρ to the leaves through noisy channels PX′ |X . The condition (33.19) here is only made to avoid
defining the reverse channel. In general, one only requires that π is a stationary distribution of
PX′ |X , in which case the (33.21) should be replaced with ηKL (π , PX|X′ )b < 1.
• Under the assumption (33.19), the joint distribution of this tree can also be written as a Gibbs
distribution
1 X X
PXall = exp f(Xp , Xc ) + g( X v ) , (33.20)
Z
(p,c)∈E v∈V
where Z is the normalization constant, f(xp , xc ) = f(xc , xp ) is symmetric. When X = {0, 1}, this
model is known as the Ising model (on a tree). Note, however, that not every measure factorizing
as (33.20) (with symmetric f) can be written as a broadcasting process for some P and π.
The broadcasting on trees is an inference problem in which we want to reconstruct the root
variable Xρ given the observations XLd = {Xv : v ∈ Ld }, with Ld = {v : v ∈ V, depth(v) = d}.
A natural question is to upper bound the performance of any inference algorithm on this problem.
The following theorem shows that there exists a phase transition depending on the branching factor
b and the contraction coefficient of the kernel PX′ |X .
i i
i i
i i
Theorem 33.13 Consider the broadcasting problem on infinite b-ary tree (b > 1), with root
distribution π and edge kernel PX′ |X . If π is a reversible measure of PX′ |X such that
ηKL (π , PX′ |X )b < 1, (33.21)
then I(Xρ ; XLd ) → 0 as d → 0.
Proof. For every v ∈ L1 , we define the set Ld,v = {u : u ∈ Ld , v ∈ ancestor(u)}. We can upper
bound the mutual information between the root vertex and leaves at depth d
X
I(Xρ ; XLd ) ≤ I(Xρ ; XLd,v ).
v∈L1
i i
i i
i i
644
This problem can be modeled as a broadcasting problem on tree where the root distribution π
is given by the uniform distribution on k colors, and the edge kernel PX′ |X is defined as
(
1
a 6= b
PX′ |X (a|b) = k−1
0, a = b.
It can be shown (see Ex. VI.24) that ηKL (Unif, PX′ |X ) = k log k(11+o(1)) for large k. By Theorem
33.13, this implies that if b < k log k(1 + o(1)) then reliable reconstruction of the root node is not
possible. This result is originally proved in [393] and [50].
The other direction b > k log k(1 + o(1)) can be shown by observing that if b > k log k(1 + o(1))
then the probability of the children of a node taking all available colors (except its own) is close to
1. Thus, an inference algorithm can always determine the color of a node by finding a color that
is not assigned to any of its children. Similarly, when b > (1 + ϵ)k log k even observing (1 − ϵ)-
fraction of the node’s children is sufficient to reconstruct its color exactly. Proceeding recursively
from bottom up, such a reconstruction algorithm will succeed with high probability. In this regime
with positive probability (over the leaf variables) the posterior distribution of the root color is a
point mass (deterministic). This effect is known as “freezing” of the root given the boundary.
We may also consider another reconstruction method which simply computes majority of the
leaves, i.e. X̂ρ = j for the color j that appears the most among the leaves. This method gives
success probability strictly above 1k if and only if d > (k − 1)2 , by a famous result of Kesten and
Stigum [244]. While the threshold is suboptimal, the method is quite robust in the sense that it
also works if we only have access to a small fraction ϵ of the leaves (and the rest are replaced by
erasures).
Let us now consider ηχ2 (Unif, PX′ |X ). The transition matrix is symmetric with eigenvalues
{1, − k−1 1 } and thus from Theorem 33.12 we have that
1 1
ηχ2 (Unif, PX′ |X ) = ηKL (Unif, PX′ |X ) = .
( k − 1) 2 k log k(1 + o(1))
Thus if Theorem 33.13 could be shown with Iχ2 instead of IKL we would be able to show non-
reconstruction for d < (k − 1)2 , contradicting the result of the previous paragraph. What goes
wrong is that Iχ2 fails to be subadditive, cf. (7.47). However, it is locally subadditive (when e.g.
Iχ2 (X; A) 1) by [202, Lemma 26]. Thus, an argument in Theorem 33.13 can be repeated for the
case where the leaves are observed through a very noisy channel (for example, an erasure channel
leaving only ϵ-fraction of the leaves). Consequently, robust reconstruction threshold for coloring
exactly equals d = (k − 1)2 . See [228] for more on robust reconstruction thresholds.
i i
i i
i i
Notice that in this problem we are not sample-limited (each party has infinitely many observations),
but communication-limited (only B bits can be exchanged).
Here is a trivial attempt to solve it. Notice that if Bob sends W = (Y1 , . . . , YB ) then the optimal
PB
estimator is ρ̂(X∞ , W) = 1n i=1 Xi Yi which has minimax error B1 , hence R∗ (B) ≤ B1 . Surprisingly,
this can be improved.
X1 Y1
.. ..
. .
W Xi Yi
.. ..
. .
Note that once the messages W are fixed we have a parameter estimation problem {Qρ , ρ ∈
[−1, 1]} where Qρ is a distribution of (X∞ , W) when A∞ , B∞ are ρ-correlated. Since we mini-
mize mean-squared error, we know from the van Trees inequality (Theorem 29.2)2 that R∗ (B) ≥
1+o(1) 1+o(1)
minρ JF (ρ) ≥ JF (0) where JF (ρ) is the Fisher information of the family {Qρ }.
Recall, that we also know from the local approximation that
ρ2 log e
D(Qρ kQ0 ) = JF (0) + o(ρ2 )
2
Furthermore, notice that under ρ = 0 we have X∞ and W independent and thus
hence JF (0) ≤ (2 ln 2)B + o(1) which in turns implies the theorem. For full details and the
extension to interactive communication between Alice and Bob, see [207].
2
This requires some technical justification about smoothness of the Fisher information JF (ρ).
i i
i i
i i
646
We turn to the upper bound next. First, notice that by taking blocks of m → ∞ consecutive bits
Pim−1
and setting X̃i = √1m j=(i−1)m Xj and similarly for Ỹi , Alice and Bob can replace ρ-correlated
i.i.d. 1 ρ
bits with ρ-correlated standard Gaussians (X̃i , Ỹi ) ∼ N (0, ). Next, fix some very large N
ρ 1
and let
W = argmax Yj .
1≤j≤N
Definition 33.15 (Partial orders on channels) Let PY|X and PZ|X be two channels.
• We say that PY|X is a degradation of PZ|X , denoted by PY|X ≤deg PZ|X , if there exists PY|Z such
that PY|X = PY|Z ◦ PZ|X .
• We say that PZ|X is less noisy than PY|X , denoted by PY|X ≤ln PZ|X , if for every PU,X on the
following Markov chain
U X
We make some remarks on these definitions and refer to [345] for proofs:
i i
i i
i i
• PY|X ≤deg PZ|X =⇒ PY|X ≤ln PZ|X =⇒ PY|X ≤mc PZ|X . Counter examples for reverse
implications can be found in [111, Problem 15.11].
• The less noisy relation can be defined equivalently in terms of the divergence, namely PY|X ≤ln
PZ|X if and only if for all PX , QX we have D(QY kPY ) ≤ D(QZ kPZ ). We refer to [290, Sections
I.B, II.A] and [345, Section 6] for alternative useful characterizations of the less-noisy order.
• For BMS channels (see Section 19.4*) it turns out that among all channels with a given
Iχ2 (X; Y) = η (with X ∼ Ber(1/2)) the BSC and BEC are the minimal and maximal elements
in the poset of ≤ln ; see Ex. VI.21 for details.
Proposition 33.16 ηKL (PY|X ) ≤ 1 − τ if and only if PY|X ≤ln ECτ , where ECτ was defined in
Example 33.6.
Proof. By induction it is sufficient to consider n = 2 only. Consider the following Markov chain:
Y1
X1
Z1
U
Y2
X2
Z2
3 ∏
We remind that ⊗PYi |Xi refers to the product (memoryless) channel with xn 7→ Yn ∼ i PYi |Xi =xi .
i i
i i
i i
648
Hence I(U; Y1 , Y2 ) ≤ I(U; Y1 , Z2 ) for any PX1 ,X2 ,U . Applying the same argument we can replace
Y1 with Z1 to get I(U; Y1 , Z2 ) ≤ I(U; Z1 , Z2 ), completing the proof.
For the second claim, notice that
where equalities are just applications of the chain rule (and in (a) and (b) we also notice that
conditioned on X2 the Y2 or Z2 are non-informative) and both inequalities are applications of
the most capable relation to the conditional distributions. For example, for every y we have
I(X2 ; Y2 |Y1 = y) ≤ I(X2 ; Z2 |Y1 = y) and hence we can average over y ∼ PY1 .
X2 X6
Y2
Y6
6
2
Y5
Y1
Y35
X1 X3 X5 X7
Y5
Y1
9
4
Y7
Y3
9
4
X4 X9
i i
i i
i i
to one of the m communities. We assume that Xv is sampled uniformly from [m] and independent
of the other vertices. The observation Yu,v is defined as
(
Ber(a/n) Xu = Xv
Yuv ∼
Ber(b/n) Xu 6= Xv .
Example 33.12 (Z2 synchronization) For any graph G, we sample Xv uniformly from
{−1, +1} and Ye = BSCδ (Xu Xv ).
Example 33.13 (Spiked Wigner model) We consider the inference problem of estimating
the value of vector (Xi )i∈[n] given the observation (Yij )i,j∈[n],i≤j . The Xi ’s and Yij ’s are related by
r
λ
Yij = Xi Xj + Wij ,
n
i.i.d.
where X = (X1 , . . . , Xn )⊤ is sampled uniformly from {±1}n and Wi,j = Wj,i ∼ N (0, 1), so that
W forms a Wigner matrix (symmetric Gaussian matrix). This model can also be written in matrix
form as
r
λ
Y= XX⊤ + W
n
as a rank-one perturbation of a Wigner matrix W, hence the name of the model. It is used as a
probabilistic model for principal component analysis.
This problem can also be treated as a problem of inference on undirected graph. In this case,
the underlying graph is a complete graph, and we assign Xi to the ith vertex. Under this model, the
edge observations is given by Yij = BIAWGNλ/n (Xi Xj ), cf. Example 3.4.
Although seemingly different, these problems share the following common characteristics,
namely:
(Xu , Xv ) → B → Ye .
In other words, the observation on each edge only depends on whether the random variables on
its endpoints are similar.
i i
i i
i i
650
contraction coefficient, the percolation probability is used to directly control the conditional
mutual information between any subsets of vertices in the graph.
Before stating our main theorem, we will need to define the corresponding percolation model
for inference on undirected graph. For any undirected graph G = (V, E) we define a percolation
model on this graph as follows :
• Every edge e ∈ E is open with the probability ηKL (PYe |Xe ), independent of the other edges,
• For any v ∈ V and S ⊂ V , we define the v ↔ S as the event that there exists an open path from
v to any vertex in S,
• For any S1 , S2 ⊂ V , we define the function percu (S1 , S2 ) as
X
percu (S1 , S2 ) ≜ P(v ↔ S2 ).
v∈S1
Notice that this function is different from the percolation function for information percolation
in DAG. Most importantly, this function is not equivalent to the exact percolation probability.
Instead, it is an upper bound on the percolation probability by union bounding with respect to
S1 . Hence, it is natural that this function is not symmetric, i.e. percu (S1 , S2 ) 6= percu (S2 , S1 ).
Instead of proving Theorem 33.18 in its full generality, we will prove the theorem under
Assumption 33.1. The main step of the proof utilizes the fact we can upper bound the mutual
information of any channel by its less noisy upper bound.
Theorem 33.19 Consider the problem of inference on undirected graph G = (V, E) with
X1 , ..., Xn not necessarily independent. If PYe |Xe ≤LN PZe |Xe , then for any S1 , S2 ⊂ V and E ⊂ E
Proof. From our assumption and the tensorization property of less noisy ordering (Proposi-
tion 33.17), we have PYE |XS1 ,XS2 ≤ln PZE |XS1 ,XS2 . This implies that for σ as a valid realization of
XS2 we will have
I(XS1 ; YE |XS2 = σ) = I(XS1 , XS2 ; YE |XS2 = σ) ≤ I(XS1 , XS2 ; ZE |XS2 = σ) = I(XS1 ; ZE |XS2 = σ).
As this inequality holds for all realization of XS2 , then the following inequality also holds
i i
i i
i i
Proof of Theorem 33.18. We only give a proof under Assumption 33.1 and only for the case
S1 = {i}. For the full proof (that proceeds by induction and does not leverage the less noisy idea),
see [347]. We have the following equalities
I(Xi ; XS2 |YE ) = I(Xi ; XS2 , YE ) = I(Xi ; YE |XS2 ) (33.22)
where the first inequality is due to the fact BE ⊥⊥ Xi under S.C, and the second inequality is due to
Xi ⊥
⊥ XS2 under Assumption 33.1.
Due to our previous result, if ηKL (PYe |Xe ) = 1 − τ then PYe |Xe ≤LN PZe |Xe where PZe |Xe = ECτ .
By tensorization property, this ordering also holds for the channel PYE |XE , thus we have
I(Xi ; YE |XS2 ) ≤ I(Xj ; ZE |XS2 ).
Let us define another auxiliary random variable D = 1{i ↔ S2 }, namely it is the indicator that
there is an open path from i to S2 . Notice that D is fully determined by ZE . By the same argument
as in (33.22), we have
I(Xi ; ZE |XS2 ) = I(Xi ; XS2 |ZE )
= I(Xi ; XS2 |ZE , D)
= (1 − P[i ↔ S2 ]) I(Xi ; XS2 |ZE , D = 0) +P[i ↔ S2 ] I(Xi ; XS2 |ZE , D = 1)
| {z } | {z }
0 ≤log |X |
≤ P[i ↔ S2 ] log |X |
= percu (i, S2 )
as n → ∞. Now, it turns out that in problems like this there is a so-called BBP phase transition, first
discovered in [29, 326]. Specifically, the eigenvalues of √1n W are well-known to follow Wigner’s
i i
i i
i i
652
√
semicircle law supported on the interval√(−2, 2). At the same time the rank-one matrix nλ XXT has
only one non-zero eigenvalue equal to λ. It turns out that for λ < 1 the effect of this “spike” is
√
undetectable and the spectrum of Y/ n is unaffected. For λ > 1 it turns out that the top eigenvalue
√
of Y/ n moves above the edge of the semicircle law to λ + λ1 > 2. Furthermore, computing the
top eigenvector and taking the sign of its coordinates achieves a correlated recovery of the true X
in the sense of (33.23). Note, however, that inability to change the spectrum (when λ < 1) does
not imply that reconstruction of X is not possible by other means. In this section, however, we will
show that indeed for λ ≤ 1 no method can achieve (33.23). Thus, together with the mentioned
spectral algorithm for λ > 1 we may conclude that λ∗ = 1 is the critical threshold separating the
two phases of the problem.
Theorem 33.20 Consider the spiked Wigner model. If λ ≤ 1, then for any sequence of
estimators Xˆn (Y),
" #
1 X
n
E Xi X̂i →0 (33.24)
n
i=1
as n → ∞.
Next, it is clear that we can simplify the task of maximizing (over X̂n ) by allowing to separately
estimate each Xi Xj by T̂i,j , i.e.
X X
max E[Xi Xj X̂i X̂j ] ≤ max E[Xi Xj T̂i,j ] .
X̂n T̂i,j
i̸=j i̸=j
(For example, we may notice I(Xi ; Xj |Y) = I(Xi , Xj ; Y) ≥ I(Xi Xj ; Y) and apply Fano’s inequality).
Thus, from the symmetry of the problem it is sufficient to prove I(X1 ; X2 |Y) → 0 as n → ∞.
By using the undirected information percolation theorem, we have
Now, for computation of perc we need to compute the probability of having an open edge, which in
our case simply equals ηKL (BIAWGNλ/n ). From Theorem 33.6 we know the latter equals Iχ2 (X; Y)
i i
i i
i i
λ
ηKL (BIAWGNλ/n ) = (1 + o(1)) .
n
′
Suppose that λ < 1. In this case, we can overbound λ+no(1) by λn with λ′ < 1. The percolation
random graph then is equivalent to the Erdös-Rényi random graph with n vertices and λ′ /n edge
probability, i.e., ER(n, λ′ /n). Using this observation, the inequality can be rewritten as
A classical result in random graph theory is that the largest connected component in ER(n, λ′ /n)
contains O(log n) vertices if λ′ < 1 [154]. This implies that the probability that two specific
vertices are connected is o(1), hence I(X2 ; X1 |Y) → 0 as n → ∞.
To treat the case of λ = 1 we need a slightly more refined information about ηKL (BIAWGNλ/n )
and about the behavior of the giant component of ER(n, 1+on(1) ) graph; see [347] for full details.
Remark 33.2 (Dense-sparse correspondence) The proof above changes the underlying
structure of the graph. Namely, instead of dealing with a complete graph, the information percola-
tion method replaced it with an Erdös-Rényi random graph. Moreover, if ηKL is small enough, then
the underlying percolation graph tends to be very sparse and has a locally tree-like structure. This
demonstrates a ubiquitous and actively studied effect in modern statistics: dense inference (such
as spiked Wigner model, sparse regression, sparse PCA, etc) with very weak signals (ηKL ≈ 1)
is similar to sparse inference (broadcasting on trees) with moderate signals (ηKL ∈ (ϵ, 1 − ϵ)).
The information percolation method provides a certain bridge between these two worlds, perhaps
partially explaining why the results in these two worlds often parallel one another. (E.g. results on
optimality and phase transitions for belief propagation (sparse inference) often parallel those for
approximate message passing (AMP, dense inference)). We do want to caution, however, that the
reduction given by the information percolation method is not generally tight (spiked Wigner being
a lucky exception). For example [347], for correlated recovery in the SBM √
with k communities
√
and edge probability a/n and b/n it yields an impossibility result ( a − b)2 ≤ 2k , weaker than
the best known upper bounds of [203].
Definition 33.21 (Post-SDPI constant) Given a conditional measure PY|X , define the input-
dependent and input-free contraction coefficients as
(p) I(U; X)
ηKL (PX , PY|X ) = sup :X→Y→U
PU|Y I(U; Y)
i i
i i
i i
654
X Y U
ε̄ 0 τ̄ 0 0
τ
?
τ
ε 1 τ̄ 1 1
(p) I(U; X)
ηKL (PY|X ) = sup :X→Y→U
PX ,PU|Y I(U; Y)
where PY = PY|X ◦ PX and PX|Y is the conditional measure corresponding to PX PY|X . From (33.25)
and Prop. 33.11 we also get tensorization property for input-dependent post-SDPI:
ηKL (P⊗ ⊗n
(p) n (p)
X , (PY|X ) ) = ηKL (PX , PY|X ). (33.27)
(p)
It is easy to see that by the data processing inequality, ηKL (PY|X ) ≤ 1. Unlike the ηKL coefficient
(p)
the ηKL can equal to 1 even for a noisy channel PY|X .
(p)
Example 33.14 (ηKL = 1 for erasure channels) Let PY|X = BECτ and X → Y → U
be defined as on Figure 33.6. Then we can compute I(Y; U) = H(U) = h(ετ̄ ) and I(X; U) =
H(U) − H(U|X) = h(ετ̄ ) − εh(τ ) hence
(p) I(X; U)
ηKL (PY|X ) ≥
I(Y; U)
ε
= 1 − h(τ )
h(ετ̄ )
This last term tends to 1 when ε tends to 0 hence
(p)
ηKL (BECτ ) = 1
i i
i i
i i
Theorem 33.22
(p)
ηKL (BSCδ ) = (1 − 2δ)2 .
Theorem 33.24 (Post-SDPI for BI-AWGN) Let 0 ≤ ϵ ≤ 1 and consider the channel PY|X
with X ∈ {±1} given by
Y = ϵ X + Z, Z ∼ N (0, 1) .
Then for any π ∈ (0, 1) taking PX = Ber(π ) we have for some absolute constant K the estimate
(p) ϵ2
ηKL (PX , PY|X ) ≤ K .
π (1 − π )
Proof. In this proof we assume all information measures are used to base-e. First, notice that
1
v(y) ≜ P[X = 1|Y = y] = 1−π −2yϵ
.
1+ π e
( p)
Then, the optimization defining ηKL can be written as
(p) d(EQY [v(Y)]kπ )
ηKL (PX , PY|X ) ≤ sup . (33.28)
QY D(QY kPY )
From (7.34) we have
(p) 1 (EQY [v(Y)] − π )2
ηKL (PX , PY|X ) ≤ sup . (33.29)
π (1 − π ) QY D(QY kPY )
i i
i i
i i
656
To proceed, we need to introduce a new concept. The T1 -transportation inequality, first intro-
duced by K. Marton, for the measure PY states the following: For every QY we have for some
c = c(PY )
p
W1 (QY , PY ) ≤ 2cD(QY kPY ) , (33.30)
where W1 (QY , PY ) is the 1-Wasserstein distance defined as
W1 (QY , PY ) = sup{EQY [f] − EPY [f] : f 1-Lipschitz} (33.31)
= inf{E[|A − B|] : A ∼ QY , B ∼ PY } .
The constant c(PY ) in (33.30) has been characterized in [64, 125] in terms of properties of PY . One
such estimate is the following:
!1/k
2 G(δ)
c(PY ) ≤ sup 2k
,
δ k≥ 1 k
′ 2 i.i.d.
where G(δ) = E[eδ(Y−Y ) ] where Y, Y′ ∼ PY . Using the estimate 2k
k ≥ √ 4k
and the fact
π (k+1/2)
that 1k ln(k + 1/2) ≤ 1
2 we get a further bound
√
2 π e 6G(δ)
c(PY ) ≤ G(δ) ≤ .
δ 4 δ
d √
Next notice that Y − Y′ = Bϵ + 2Z where Bϵ ⊥ ⊥ Z ∼ N (0, 1) and Bϵ is symmetric and |Bϵ | ≤ 2ϵ.
Thus, we conclude that for any δ < 1/4 we have c̄ ≜ δ6 supϵ≤1 G(δ) < ∞. In the end, we have
inequality (33.30) with constant c = c̄ that holds uniformly for all 0 ≤ ϵ ≤ 1.
Now, notice that dyd
v(y) ≤ 2ϵ and therefore v is 2ϵ -Lipschitz. From (33.30) and (33.31) we
obtain then
ϵp
|EQY [v(Y)] − EPY [v(Y)]| ≤ 2c̄D(QY kPY ) .
2
Squaring this inequality and plugging back into (33.29) completes the proof.
(p)
Remark 33.3 Notice that we can also compute the exact value of ηKL (PX , PY|X ) by noticing the
following. From (33.28) it is evident that among all measures QY with a given value of EQY [v(Y)]
we are interested in the one minimizing D(QY kPY ). From Theorem 15.11 we know that such QY
is given by dQY = ebv(y)−ψV (b) dPY , where ψV (b) ≜ ln EPY [ebv(Y) ]. Thus, by defining the convex
dual ψV∗ (λ) we can get the exact value in terms of the following single-variable optimization:
(p) d(λkπ )
ηKL (PX , PY|X ) = sup ∗ .
λ∈[0,1] ψV (λ)
Numerically, for π = 1/2 it turns out that the optimal value is λ → 12 , justifying our overbounding
of d by χ2 , and surprisingly giving
(p)
ηKL (Ber(1/2), PY|X ) = 4 EPY [tanh2 (ϵY)] = ηKL (PY|X ) ,
i i
i i
i i
Y1 U1
θ .. ..
. . θ̂
Ym Um
• Without the constraint θ ∈ [−1, 1]d , we could take θ ∼ N (0, bId ) and from rate-distortion
quickly conclude that estimating θ within risk R requires communicating at least d2 log bd R bits,
which diverges as b → ∞. Thus, restricting the magnitude of θ is necessary in order for it to be
estimable with finitely many bits communicated.
• Without
h P communication
i constraint, it is easy to establish that R∗ (m, d, σ 2 , ∞) =
2 2 P
E mσ i Zi = dmσ by taking Ui = Yi and θ̂ = m1 i Ui , which matches the minimax
risk (28.17) in non-distributed setting.
• In order to achieve a risk of order md we can apply a crude quantizer as follows. Let Ui = sign(Yi )
(coordinate-wise sign). This yields B = md and it is easy to show that the achievable risk is
Pm
Oσ ( md ). Indeed, notice that by taking V = m1 i=1 Ui we see that each coordinate Vj , j ∈ [d],
estimates (within Op ( √1m )) quantities Φ(θj /σ) with Φ denoting the CDF of N (0, 1). Since Φ
has derivative bounded away from 0 on [−1/σ, 1/σ], we get that the estimate θ̂j ≜ σΦ−1 (Vj )
will have mean square error of O(1/m) (with a poor dependency on σ , though), which gives
overall error O(d/m) as claimed.
Our main result below shows that the previous simple strategy is order-optimal in terms of
communicated bits. This simplifies the proofs of [137, 73].
• We remark that these results are also implicitly contained in the long line of work in the
information theoretic literature on the so-called Gaussian CEO problem. We recommend con-
sulting [156]; in particular, Theorem 3 there implies the B ≳ dm lower bound we show below.
However, the Gaussian CEO work uses a lot more sophisticated machinery (the entropy power
inequality and related results), while our SDPI proof is more elementary.
i i
i i
i i
658
dϵ2
Theorem 33.25 There exists a constant c1 > 0 such that if R∗ (m, d, σ 2 , B) ≤ 9 then B ≥ c1 d
ϵ2
.
Proof. Let X ∼ Unif({±1}d ) and set θ = ϵX. Given an estimate θ̂ we can convert it into an
estimator of X via X̂ = sign(θ̂) (coordinatewise). Then, clearly
ϵ2 dϵ 2
E[dH (X, X̂)] ≤ E[kθ̂ − θk2 ] ≤ .
4 9
Thus, we have an estimator of X within Hamming distance 49 d. From Rate-Distortion (Theo-
rem 26.1) we conclude that I(X; X̂) ≥ cd for some constant c > 0. On the other hand, from
the standard DPI we have
X
m
cd ≤ I(X; X̂) ≤ I(X; U1 , . . . , Um ) ≤ I(X; Uj ) , (33.32)
j=1
where we also applied Theorem 6.1. Next we estimate I(X; Uj ) via I(Yj ; Uj ) by applying the Post-
SDPI. To do this we need to notice that the channel X → Yj for each j is just a memoryless
extension of the binary-input AWGN channel with SNR ϵ. Since each coordinate of X is uniform,
we can apply Theorem 33.24 (with π = 1/2) together with tensorization (33.27) to conclude that
We notice that in this short section we only considered a non-interactive setting in the sense that
the message Ui is produced by machine i independently and without consulting anything except
its private measurement Yi . More generally, we could allow machines to communicate their bits
over a public broadcast channel, so that each communicated bit is seen by all other machines. We
could still restrict the total number of bits sent by all machines to be B and ask for the best possible
interactive estimation rate. While [137, 73] claim lower bounds that apply to this setting, those
bounds contain subtle errors (see [5, 4] for details). There are lower bounds applicable to non-
interactive settings but they are weaker by certain logarithmic terms. For example, [5, Theorem 4]
shows that to achieve risk ≲ dϵ2 one needs B ≳ ϵ2 logd(dm) in the limited interactive setting where
Ui may depend on Ui1−1 but there are no other interactions (i.e. the i-th machine sends its entire
i i
i i
i i
message at once instead of sending part of it and waiting for others to broadcast theirs before
completing its own transmission, as permitted by the fully interactive protocol).
i i
i i
i i
i.i.d.
VI.1 Let X1 , . . . , Xn ∼ Exp(exp(θ)), where θ follows the Cauchy distribution π with parameter s,
whose pdf is given by p(θ) = 1
θ2
for θ ∈ R. Show that the Bayes risk
πs(1+ )
s2
Learning parameters of dynamical systems is known as “system identification”. Denote the law
of (X1 , . . . , Xn ) corresponding to θ by Pθ .
1. Compute D(Pθ kPθ0 ). (Hint: chain rule saves a lot of effort.)
2. Show that Fisher information
X
JF (θ) = θ2t−2 (n − t).
1≤t≤n−1
3. Argue that the hardest regime for system identification is when θ ≈ 0, and that instability
(|θ| > 1) is in fact helpful.
VI.3 (Linear regression) Consider the model
Y = Xβ + Z
where the design matrix X ∈ Rn×d is known and Z ∼ N (0, In ). Define the minimax mean-square
error of estimating the regression coefficient β ∈ Rd based on X and Y as follows:
i i
i i
i i
Redo (a) and (b) by finding the value of R∗pred and identify the minimax estimator. Explain
intuitively why R∗pred is always finite even when d exceeds n.
i.i.d.
VI.4 (Chernoff-Rubin-Stein lower bound.) Let X1 , . . . , Xn ∼ Pθ and θ ∈ [−a, a].
(a) State the appropriate regularity conditions and prove the following minimax lower bound:
(1 − ϵ)2
inf sup Eθ [(θ − θ̂)2 ] ≥ min max ϵ2 a2 , ,
θ̂ θ∈[−a,a] 0<ϵ<1 nJ̄F
1
Ra
where J̄F = 2a J (θ)dθ is the average Fisher information. (Hint: Consider the uniform
−a F
prior on [−a, a] and proceed as in the proof of Theorem 29.2 by applying integration by
parts.)
(b) Simplify the above bound and show that
1
inf sup Eθ [(θ − θ̂)2 ] ≥ p . (VI.3)
θ̂ θ∈[−a,a] (a−1 + nJ̄F )2
(c) Assuming the continuity of θ 7→ JF (θ), show that the above result also leads to the optimal
local minimax lower bound in Theorem 29.4 obtained from Bayesian Cramér-Rao:
1 + o( 1)
inf sup Eθ [(θ − θ̂)2 ] ≥ .
θ̂ θ∈[θ0 ±n−1/4 ] nJF (θ0 )
Note: (VI.3) is an improvement of the inequality given in [92, Lemma 1] without proof and
credited to Rubin and Stein.
VI.5 In this exercise we give a Hellinger-based lower bound analogous to the χ2 -based HCR lower
bound in Theorem 29.1. Let θ̂ be an unbiased estimator for θ ∈ Θ ⊂ R.
(a) For any θ, θ′ ∈ Θ, show that [386]
1 (θ − θ′ )2 1
(Varθ (θ̂) + Varθ′ (θ̂)) ≥ −1 . (VI.4)
2 4 H2 (Pθ , Pθ′ )
R √ √ √ √
(Hint: For any c, θ − θ′ = (θ̂ − c)( pθ + pθ′ )( pθ − pθ′ ). Apply Cauchy-Schwarz
and optimize over c.)
(b) Show that
1
H2 (Pθ , Pθ′ ) ≤ (θ − θ′ )2 J̄F (VI.5)
4
R θ′
where J̄F = θ′ 1−θ θ JF (u)du is the average Fisher information.
(c) State the needed regularity conditions and deduce the Cramér-Rao lower bound from (VI.4)
and (VI.5) with θ′ → θ.
(d) Extend the previous parts to the multivariate case.
VI.6 (Bayesian distribution estimation.) Let {Pθ : θ ∈ Θ} be a family of distributions on X
with a common dominating measure μ and density pθ (x) = dP n
dμ (x). Given a sample X =
θ
i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, the goal is to estimate the data-generating distribution Pθ by
some estimator P̂(·) = P̂(·; Xn ) with respect to some loss function ℓ(P, P̂). Suppose we are in
i i
i i
i i
a Bayesian setting where θ is drawn from a prior π. Let’s find the form of the Bayes estimator
and the Bayes risk.
(a) For convenience, let Xn+1 denote a test data point (unseen) drawn from Pθ and independent
of the observed data Xn . Convince yourself that every estimator P̂ can be formally identified
as a conditional distribution QXn+1 |Xn .
(b) Consider the KL loss ℓ(P, P̂) = D(PkP̂). Using Corollary 4.2, show that the Bayes estimator
minimizing the average KL risk is the posterior (conditional mean), i.e. its μ-density is given
by
Qn+1
Eθ∼π [ i=1 pθ (xi )]
qXn+1 |Xn (xn+1 |x ) =
n
Qn . (VI.6)
Eθ∼π [ i=1 pθ (xi )]
(c) Conclude that the Bayes KL risk equals I(θ; Xn+1 |Xn ). Compare with the conclusion of
Exercise II.19 and the KL risk interpretation of batch regret in (13.35).
(d) Now, consider the χ2 loss ℓ(P, P̂) = χ2 (PkP̂). Using (I.12) in Exercise I.45 show that the
optimal risk is given by
"Z 2 #
p
inf Eθ,Xn [χ (Pθ kP̂)] = EXn
2
μ(dxn+1 ) Eθ [pθ (xn+1 ) |X ]
2 n − 1. (VI.7)
P̂
attained by
p
qXn+1 |Xn (xn+1 |xn ) ∝ Eθ [pθ (xn+1 )2 |Xn = xn ] (VI.8)
(e) Now, consider the reverse-χ2 loss ℓ(P, P̂) = χ2 (P̂kP), a weighted quadratic loss. Using
(I.13) in Exercise I.45 show that the optimal risk is attained by
− 1
qXn+1 |Xn (xn+1 |xn ) ∝ Eθ [pθ (xn+1 )−1 |Xn = xn ] (VI.9)
i.i.d.
(f) Consider the discrete alphabet [k] and Xn ∼ P, where P = (P1 , . . . , Pk ) is drawn from
the Dirichlet(α, . . . , α) prior. Applying previous results (with μ the counting measure),
show that the Bayes estimator for the KL loss and reverse-χ2 loss is given by Krichevsky-
Trofimov add-β estimator (Section 13.5)
X
n
b ( j) = nj + β ,
P nj ≜ 1 {Xi = j} , (VI.10)
n + kβ
i=1
where β = α for KL and β = α − 1 for reverse-χ2 (assuming α ≥ 1). Hint: The posterior is
(P1 , . . . , Pk )|Xn ∼ Dirichlet(n1 + α, . . . , nk + α) and Pj |Xn ∼ Beta(nj + α, n − nj + (k − 1)α).
(g) For the χ2 loss, show that the Bayes estimator is
p
b(j) = P (p nj + α)(nj + α + 1)
P k
. (VI.11)
j=1 (nj + α)(nj + α + 1)
i.i.d.
VI.7 (Coin flips) Given X1 , . . . , Xn ∼ Ber(θ) with θ ∈ Θ = [0, 1], we aim to estimate θ with respect
to the quadratic loss function ℓ(θ, θ̂) = (θ − θ̂)2 . Denote the minimax risk by R∗n .
i i
i i
i i
(a) Use the empirical frequency θ̂emp = X̄ to estimate θ. Compute and plot the risk Rθ (θ̂) and
show that
1
R∗n ≤ .
4n
(b) Compute the Fisher information of Pθ = Ber(θ)⊗n and Qθ = Bin(n, θ). Explain why they
are equal.
(c) Invoke the Bayesian Cramér-Rao lower bound Theorem 29.2 to show that
1 + o( 1)
R∗n = .
4n
(d) Notice that the risk of θ̂emp is maximized at 1/2 (fair coin), which suggests that it might be
possible to hedge against this situation by the following randomized estimator
(
θ̂emp , with probability δ
θ̂rand = 1 (VI.12)
2 with probability 1 − δ
Find the worst-case risk of θ̂rand as a function of δ . Optimizing over δ , show the improved
upper bound:
1
R∗n ≤ .
4( n + 1)
(e) As discussed in Remark 28.3, randomized estimator can always be improved if the loss is
convex; so we should average out the randomness in (VI.12) by considering the estimator
1
θ̂∗ = E[θ̂rand |X] = X̄δ + (1 − δ). (VI.13)
2
Optimizing over δ to minimize the worst-case risk, find the resulting estimator θ̂∗ and its
risk, show that it is constant (independent of θ), and conclude
1
R∗n ≤ √ .
4(1 + n)2
(f) Next we show θ̂∗ found in part (e) is exactly minimax and hence
1
R∗n = √ .
4(1 + n)2
Consider the following prior Beta(a, b) with density:
Γ(a + b) a−1
π (θ) = θ (1 − θ)b−1 , θ ∈ [0, 1],
Γ(a)Γ(b)
R∞ √
where Γ(a) ≜ 0 xa−1 e−x dx. Show that if a = b = 2n , θ̂∗ coincides with the Bayes
estimator for this prior, which is therefore least favorable. (Hint: work with the sufficient
statistic S = X1 + . . . + Xn .)
(g) Show that the least favorable prior is not unique; in fact, there is a continuum of them. (Hint:
consider the Bayes estimator E[θ|X] and show that it only depends on the first n + 1 moments
of π.)
i i
i i
i i
i.i.d.
(h) (k-ary alphabet) Suppose X1 , . . . , Xn ∼ P on [k]. Show that for any k, n, the minimax squared
risk of estimating P in Theorem 29.5 is exactly
b − Pk22 ] = √ 1 k−1
R∗sq (k, n) = inf sup E[kP 2
, (VI.14)
b
P P∈Pk ( n + 1 ) k
√
achieved by the add- kn estimator. (Hint: For the lower bound, show that the Bayes estimator
for the squared loss and the KL loss coincide, then apply (VI.10) in Exercise VI.6.)
VI.8 (Distribution estimation in TV) Continuing (VI.14), we show that the minimax rate for
estimating P with respect to the total variation loss is
r
∗ k
RTV (k, n) ≜ inf sup EP [TV(P̂, P)] ∧ 1, ∀ k ≥ 2, n ≥ 1, (VI.15)
P̂ P∈Pk ) n
(a) Show that the MLE coincides with the empirical distribution.
(b) Show that the MLE achieves the RHS of (VI.15) within constant factors. (Hint: either apply
(7.58) plus Pinsker’s inequality, or direct use the variance of empirical frequencies.)
(c) Establish the minimax lower bound. (Hint: apply Assouad’s lemma, or Fano’s inequality
(with volume method or explicit construction of packing), or the mutual information method
directly.)
VI.9 (Distribution estimation under reverse-χ2 ) Consider estimating a discrete distribution P on [k] in
i.i.d.
reverse-χ2 divergence from Xn ∼ P, which is a weighted version of the quadratic loss in (VI.14).
We show that the minimax risk is given by
k−1
R∗revχ2 (k, n) ≜ inf sup EP [χ2 (P̂kP) = .
P̂ P∈P([k]) n
Pn
(a) Show that taking P̂(j) = n i=1 1{Xi = j} to be the empirical distribution we always have
1
i i
i i
i i
k
R∗χ2 (k, n) . (VI.17)
n
(a) Show that the empirical distribution, optimal for the TV loss (Exercise VI.8), implies the
claimed upper bound for the reverse KL loss (Hint: (7.58)). Show, on the other hand, that
for KL and χ2 it results in infinite loss.
(b) To show the upper bound for χ2 , consider the add-α estimator P̂ in (VI.10) with α = 1.
Show that
k−1
E[χ2 (PkP̂)] ≤ .
n+1
Using (7.34) conclude the upper bound part of (VI.16). (Hint: EN∼Bin(n,p) [ N+ 1
1] =
(n+1)p (1 − p̄
1 n+1
).
(c) Show that in the small alphabet regime of k ≲ n, all lower bounds follow from (VI.15).
(d) Next assume k ≥ 4n. Consider a Dirichlet(α, . . . , α) prior in (13.16). Applying (VI.11) and
(VI.7) for the Bayes χ2 risk and choosing α = n/k, show the lower bound R∗χ2 (k, n) ≳ nk .
(e) Consider the prior under which P is uniform over a support set S chosen uniformly at ran-
dom from all s-subsets of [k], where s < k is to be specified. Applying (VI.6), show that for
this prior the Bayes estimator for KL loss takes a natural form:
(
1
i ∈ Ŝ
P̂j = 1s −ŝ/s
k−ŝ i∈/ Ŝ
(a) Let P be the class of distributions (which need not have a density) on the real line with
2
variance at most σ 2 . Show that R∗n = σn .
i i
i i
i i
(b) Let P = P([0, 1]), the collection of all probability distributions on [0, 1]. Show that
R∗n = 4(1+1√n)2 . (Hint: For the upper bound, using the fact that, for any [0, 1]-valued ran-
dom variable Z, Var(Z) ≤ E[Z](1 − E[Z]), mimic the analysis of the estimator (VI.13) in
Ex. VI.7e.)
VI.12 Prove Theorem 30.2 using Fano’s method. (Hint: apply Theorem 31.3 with T = ϵ · Sdk , where
Sdk denotes the Hamming sphere of radius k in d dimensions. Choose ϵ appropriately and apply
the Gilbert-Varshamov bound for the packing number of Sdk in Theorem 27.6.)
VI.13 (Sharp minimax rate in sparse denoising) Continuing Theorem 30.2, in this exercise we deter-
mine the sharp minimax risk for denoising a high-dimensional sparse vector. In the notation of
(30.13), we show that, for the d-dimensional GLM model X ∼ N (θ, Id ), the following minimax
risk satisfies, as d → ∞ and k/d → 0,
d
R∗ (k, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1))k log . (VI.18)
θ̂ ∥θ∥0 ≤k k
For the lower bound, consider the prior π under which θ is uniformly p distributed over
{τ e1 , . . . , τ ed }, where ei ’s denote the standard basis. Let τ = (2 − ϵ) log d. Show that
for any ϵ > 0, the Bayes risk is given by
(Hint: either apply the mutual information method, or directly compute the Bayes risk by
evaluating the conditional mean and conditional variance.)
(b) Demonstrate an estimator θ̂ that achieves the RHS of (VI.19) asymptotically. (Hint: consider
the hard-thresholding estimator (30.13) or the MLE (30.11).)
(c) To prove the lower bound part of (VI.18), prove the following generic result
∗ ∗ d
R (k, d) ≥ kR 1,
k
and then apply (VI.19). (Hint: consider a prior of d/k blocks each of which is 1-sparse.)
(d) Similar to the 1-sparse case, demonstrate an estimator θ̂ that achieves the RHS of (VI.18)
asymptotically.
Note: For both the upper and lower bound, the normal tail bound in Exercise V.25 is helpful.
VI.14 Consider the following functional estimation problem in GLM. Observing X ∼ N (θ, Id ), we
intend to estimate the maximal coordinate of θ: T(θ) = θmax ≜ max{θ1 , . . . , θd }. Prove the
minimax rate:
(a) Prove the upper bound by considering T̂ = Xmax , the plug-in estimator with the MLE.
i i
i i
i i
H0 : θ = 0, H1 : θ ∼ Unif {τ e1 , τ e2 , . . . , τ ed } .
where ei ’s are the standard bases and τ > 0. Then under H0 , X ∼ P0 = N (0, Id ); under H1 ,
Pd τ2
X ∼ P1 = 1d i=1 N (τ ei , Id ). Show that χ2 (P1 kP0 ) = e d−1 . (Hint: Exercise I.48)
(c) Applying the joint range (7.32) (or (7.38)) to bound TV(P0 , P1 ), conclude the lower bound
part of (VI.20) via Le Cam’s method (Theorem 31.1).
(d) By improving both the upper and lower bound prove the sharp version:
1
inf sup Eθ (T̂ − θmax ) =
2
+ o(1) log d, d → ∞. (VI.21)
T̂ θ∈Rd 2
VI.15 (Suboptimality of MLE in high dimensions [55]) Consider the d-dimensional GLM: X ∼
N (θ, Id ), where θ belongs to the parameter space
n o
Θ = θ ∈ Rd : |θ1 | ≤ d1/4 , kθ\1 k2 ≤ 2(1 − d−1/4 |θ1 |)
with θ\1 ≡ (θ2 , . . . , θd ). For the square loss, prove the following for sufficiently large d.
(a) The minimax risk is bounded:
is unbounded:
√
sup Eθ [kθ̂MLE − θk22 ] ≳ d.
θ∈Θ
i.i.d.
VI.16 (Covariance model) Let X1 , . . . , Xn ∼ N (0, Σ), where Σ is a d × d covariance matrix. Let us
show that the minimax quadratic risk of estimating Σ using X1 , . . . , Xn satisfies
d
inf sup E[kΣ̂ − Σk2F ] ∧ 1 r2 , ∀r > 0, d, n ∈ N. (VI.22)
Σ̂ ∥Σ∥F ≤r n
P
where kΣ̂ − Σk2F = ij (Σ̂ij − Σij )2 .
(a) Show that unlike location model, without restricting to a compact parameter space for Σ,
the minimax risk in (VI.22) is infinite.
Pn
(b) Consider the sample covariance matrix Σ̂ = 1n i=1 Xi X⊤ i . Show that
1
E[kΣ̂ − Σk2F ] = kΣk2F + Tr(Σ)2
n
and use this to deduce the minimax upper bound in (VI.22).
i i
i i
i i
(c) To prove the minimax lower bound, we can proceed in several steps. Show that for any
positive semidefinite (PSD) Σ0 , Σ1 0, the KL divergence satisfies
1 1/ 2 1/2
D(N (0, Id + Σ0 )kN (0, Id + Σ1 )) ≤
kΣ − Σ1 k2F , (VI.23)
2 0
where Id is the d × d identity matrix. (Hint: apply (2.8).)
(d) Let B(δ) denote the Frobenius ball of radius δ centered at the zero matrix. Let PSD = {X :
X 0} denote the collection of d × d PSD matrices. Show that
vol(B(δ) ∩ PSD)
= P [ Z 0] , (VI.24)
vol(B(δ))
where Z is a random matrix distributed according to the Gaussian Orthogonal Ensemble
i.i.d.
(GOE), that is, Z is symmetric with independent diagonals Zii ∼ N (0, 2) and off-diagonals
i.i.d.
Zij ∼ N (0, 1).
2
(e) Show that P [Z 0] ≥ cd for some absolute constant c.4
(f) Prove the following lower bound on the packing number on the set of PSD matrices:
′ d2 /2
cδ
M(B(δ) ∩ PSD, k · kF , ϵ) ≥ (VI.25)
ϵ
for some absolute constant c′ . (Hint: Use the volume bound and the result of Part (d) and
(e).) √
(g) Complete the proof of lower bound of (VI.22). (Hint: WLOG, we can consider r d and
2
aim for the lower bound Ω( dn ∧ d). Take the packing from (VI.25) and shift by the identity
matrix I. Then apply Fano’s method and use (VI.23).)
VI.17 For a family of probability distributions P and a functional T : P → R define its χ2 -modulus
of continuity as
When the functional T is affine, and continuous, and P is compact5 it can be shown [346] that
1
δ 2 (1/n)2 ≤ inf sup E i.i.d. (T(P) − T̂n (X1 , . . . , Xn ))2 ≤ δχ2 (1/n)2 . (VI.26)
7 χ T̂n P∈P Xi ∼ P
Consider the following problem (interval censored model): A lab conducts experiments with
i.i.d.
n mice. In the i-th mouse a tumour develops at time Ai ∈ [0, 1] with Ai ∼ π where π is a pdf
on [0, 1] bounded by 12 ≤ π ≤ 2 pointwise. For each i the existence of tumour is checked at
i.i.d.
another random time Bi ∼ Unif(0, 1) with Bi ⊥ ⊥ Ai . Given observations Xi = (1{Ai ≤ Bi }, Bi )
one is trying to estimate T(π ) = π [A ≤ 1/2]. Show that
4
Getting the exact exponent is a difficult result (cf. [26]). Here we only need some crude estimate.
5
Both under the same, but otherwise arbitrary topology on P.
i i
i i
i i
VI.21 (BMS channel comparison [371, 367]) Below X ∼ Ber(1/2) and PY|X is an input-symmetric
channel (BMS). It turns out that BSC and BEC are extremal for various partial orders. Prove
the following statements.
(a) If ITV (X; Y) = 12 (1 − 2δ), then
BSCδ ≤deg PY|X ≤deg BEC2δ .
i i
i i
i i
i i
i i
i i
Note: This bound is tight up to the first order: there exists a function g(q) = (1 + o(1))q log q
such that for all d > g(q), BOT with coloring channel on a d-ary tree has reconstruction.
VI.25 ([203]) Fix an integer q ≥ 2 and let X = [q]. Let λ ∈ [− q−1 1 , 1] be a real number. Let us define
a special kind of q-ary symmetric channel, known as the Potts channel, by taking Pλ : X → X
as
(
λ + 1−λ y = x,
P λ ( y| x) = 1−λ
q
q y 6= x.
Prove that
qλ2
ηKL (Pλ ) = .
(q − 2)λ + 2
VI.26 (Spectral Independence [20]) Say a probability distribution μ = μXn supported on [q]n is c-
pairwise independent if for every T ⊂ [n], σT ∈ [q]T , the conditional measure μ(σT ) ≜ μXTc |XT =σT
satisfies for every νXTc ,
X (σ ) c X (σ )
D(νXi ,Xj || μXi ,TXj ) ≥ (2 − ) D(νXi || μXi T ) .
n − | T | − 1
i̸=j∈Tc i∈T c
where ECτ is the erasure channel, cf. Example 33.6. (Hint: Define f(τ ) = D(EC⊗ ⊗n
τ ◦ν||ECτ ◦ μ)
n
′′ c ′
and prove f (τ ) ≥ τ f (τ ).)
Remark: Applying the above with τ = 1n shows that a Markov chain Gτ known as (small-block)
Glauber dynamics for μ is mixing in O(nc+1 log n) time. Indeed, Gτ consists of first applying
EC⊗ n
τ and then “imputing” erasures in the set S from the conditional distribution μXS |XSc . It is
also known that c-pairwise independence is implied (under some additional conditions on μ
and q = 2) by the uniform boundedness of the operator norms of the covariance matrices of
all μ(σT ) (see [91] for details). Thus a hard question of bounding ηKL ( μ, Gτ ) is first reduced to
ηKL ( μ, EC⊗ n
τ ) and then to the study of spectrum of covariance matrix.
i i
i i
i i
References
[1] E. Abbe and E. B. Adserà, “Subadditivity vol. 32, no. 4, pp. 533–542, 1986. (pp. 325
beyond trees and the Chi-squared mutual and 327)
information,” in 2019 IEEE International [9] R. Ahlswede and J. Körner, “Source cod-
Symposium on Information Theory (ISIT). ing with side information and a converse for
IEEE, 2019, pp. 697–701. (p. 135) degraded broadcast channels,” IEEE Trans-
[2] M. C. Abbott and B. B. Machta, “A scaling actions on Information Theory, vol. 21,
law from discrete to continuous solutions no. 6, pp. 629–637, 1975. (p. 227)
of channel capacity problems in the low- [10] S. M. Alamouti, “A simple transmit diver-
noise limit,” Journal of Statistical Physics, sity technique for wireless communica-
vol. 176, no. 1, pp. 214–227, 2019. (p. 248) tions,” IEEE Journal on selected areas in
[3] I. Abou-Faycal, M. Trott, and S. Shamai, communications, vol. 16, no. 8, pp. 1451–
“The capacity of discrete-time memoryless 1458, 1998. (p. 409)
rayleigh-fading channels,” IEEE Transac- [11] P. H. Algoet and T. M. Cover, “A sandwich
tion Information Theory, vol. 47, no. 4, pp. proof of the Shannon-Mcmillan-Breiman
1290 – 1301, 2001. (p. 409) theorem,” The annals of probability, pp.
[4] J. Acharya, C. L. Canonne, Y. Liu, Z. Sun, 899–909, 1988. (p. 234)
and H. Tyagi, “Interactive inference under [12] C. D. Aliprantis and K. C. Border, Infi-
information constraints,” IEEE Transac- nite Dimensional Analysis: a Hitchhiker’s
tions on Information Theory, vol. 68, no. 1, Guide, 3rd ed. Berlin: Springer-Verlag,
pp. 502–516, 2021. (p. 658) 2006. (p. 130)
[5] J. Acharya, C. L. Canonne, Z. Sun, and [13] N. Alon, “On the number of subgraphs of
H. Tyagi, “Unified lower bounds for inter- prescribed type of graphs with a given num-
active high-dimensional estimation under ber of edges,” Israel J. Math., vol. 38, no.
information constraints,” arXiv preprint 1-2, pp. 116–130, 1981. (p. 160)
arXiv:2010.06562, 2020. (p. 658) [14] N. Alon and A. Orlitsky, “A lower bound
[6] R. Ahlswede, “Extremal properties of rate on the expected length of one-to-one codes,”
distortion functions,” IEEE transactions on IEEE Transactions on Information The-
information theory, vol. 36, no. 1, pp. 166– ory, vol. 40, no. 5, pp. 1670–1672, 1994.
171, 1990. (p. 543) (p. 199)
[7] R. Ahlswede, B. Balkenhol, and L. Khacha- [15] N. Alon and J. H. Spencer, The Probabilis-
trian, “Some properties of fix free codes,” tic Method, 3rd ed. John Wiley & Sons,
in Proceedings First INTAS International 2008. (pp. 208 and 353)
Seminar on Coding Theory and Combina- [16] P. Alquier, “User-friendly introduction
torics, Thahkadzor, Armenia, 1996, pp. 20– to PAC-Bayes bounds,” arXiv preprint
33. (p. 208) arXiv:2110.11216, 2021. (p. 83)
[8] R. Ahlswede and I. Csiszár, “Hypothesis [17] S.-I. Amari and H. Nagaoka, Methods of
testing with communication constraints,” information geometry. American Math-
IEEE transactions on information theory, ematical Soc., 2007, vol. 191. (pp. 40
and 307)
i i
i i
i i
References 673
[18] G. Aminian, Y. Bu, L. Toni, M. R. Theory and Related fields, vol. 108, no. 4,
Rodrigues, and G. Wornell, “Characteriz- pp. 517–542, 1997. (p. 668)
ing the generalization error of Gibbs algo- [27] S. Artstein, K. Ball, F. Barthe, and A. Naor,
rithm with symmetrized KL information,” “Solution of Shannon’s problem on the
arXiv preprint arXiv:2107.13656, 2021. monotonicity of entropy,” Journal of the
(p. 187) American Mathematical Society, pp. 975–
[19] V. Anantharam, A. Gohari, S. Kamath, 982, 2004. (p. 64)
and C. Nair, “On maximal correlation, [28] S. Artstein, V. Milman, and S. J. Szarek,
hypercontractivity, and the data processing “Duality of metric entropy,” Annals of math-
inequality studied by Erkip and Cover,” ematics, pp. 1313–1328, 2004. (p. 535)
arXiv preprint arXiv:1304.6133, 2013. [29] J. Baik, G. Ben Arous, and S. Péché, “Phase
(p. 638) transition of the largest eigenvalue for non-
[20] N. Anari, K. Liu, and S. O. Gharan, “Spec- null complex sample covariance matrices,”
tral independence in high-dimensional The Annals of Probability, vol. 33, no. 5, pp.
expanders and applications to the hardcore 1643–1697, 2005. (p. 651)
model,” SIAM Journal on Computing, [30] A. V. Banerjee, “A simple model of herd
no. 0, pp. FOCS20–1, 2021. (p. 671) behavior,” The Quarterly Journal of Eco-
[21] T. W. Anderson, “The integral of a symmet- nomics, vol. 107, no. 3, pp. 797–817, 1992.
ric unimodal function over a symmetric con- (pp. 135 and 181)
vex set and some probability inequalities,” [31] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and
Proceedings of the American Mathematical D. Sivakumar, “An information statistics
Society, vol. 6, no. 2, pp. 170–176, 1955. approach to data stream and communica-
(p. 572) tion complexity,” Journal of Computer and
[22] A. Antos and I. Kontoyiannis, “Conver- System Sciences, vol. 68, no. 4, pp. 702–
gence properties of functional estimates for 732, 2004. (p. 182)
discrete distributions,” Random Structures [32] B. Bárány and I. Kolossváry, “On the abso-
& Algorithms, vol. 19, no. 3-4, pp. 163–193, lute continuity of the Blackwell measure,”
2001. (p. 138) Journal of Statistical Physics, vol. 159, pp.
[23] E. Arıkan, “Channel polarization: A 158–171, 2015. (p. 111)
method for constructing capacity-achieving [33] B. Bárány, M. Pollicott, and K. Simon,
codes for symmetric binary-input memo- “Stationary measures for projective trans-
ryless channels,” IEEE Transactions on formations: the Blackwell and Furstenberg
information Theory, vol. 55, no. 7, pp. measures,” Journal of Statistical Physics,
3051–3073, 2009. (p. 346) vol. 148, pp. 393–421, 2012. (p. 111)
[24] S. Arimoto, “An algorithm for computing [34] A. Barg and G. D. Forney, “Random codes:
the capacity of arbitrary discrete memory- Minimum distances and error exponents,”
less channels,” IEEE Transactions on Infor- IEEE Transactions on Information The-
mation Theory, vol. 18, no. 1, pp. 14–20, ory, vol. 48, no. 9, pp. 2568–2573, 2002.
1972. (p. 102) (p. 433)
[25] ——, “On the converse to the coding theo- [35] A. Barg and A. McGregor, “Distance distri-
rem for discrete memoryless channels (cor- bution of binary codes and the error proba-
resp.),” IEEE Transactions on Information bility of decoding,” IEEE Transactions on
Theory, vol. 19, no. 3, pp. 357–359, 1973. Information Theory, vol. 51, no. 12, pp.
(p. 433) 4237–4246, 2005. (p. 433)
[26] G. B. Arous and A. Guionnet, “Large devi- [36] S. Barman and O. Fawzi, “Algorithmic
ations for Wigner’s law and Voiculescu’s aspects of optimal channel coding,” IEEE
non-commutative entropy,” Probability Transactions on Information Theory,
i i
i i
i i
vol. 64, no. 2, pp. 1038–1045, 2017. [47] C. Berrou, A. Glavieux, and P. Thiti-
(pp. 366, 367, 368, and 369) majshima, “Near Shannon limit
[37] A. R. Barron, “Universal approximation error-correcting coding and decoding:
bounds for superpositions of a sigmoidal Turbo-codes. 1,” in Proceedings of
function,” IEEE Trans. Inf. Theory, vol. 39, ICC’93-IEEE International Conference on
no. 3, pp. 930–945, 1993. (p. 534) Communications, vol. 2. IEEE, 1993, pp.
[38] P. L. Bartlett and S. Mendelson, 1064–1070. (pp. 346 and 411)
“Rademacher and Gaussian complexi- [48] J. C. Berry, “Minimax estimation of a
ties: Risk bounds and structural results,” bounded normal mean vector,” Journal of
Journal of Machine Learning Research, Multivariate Analysis, vol. 35, no. 1, pp.
vol. 3, no. Nov, pp. 463–482, 2002. (p. 86) 130–139, 1990. (p. 587)
[39] G. Basharin, “On a statistical estimate for [49] D. P. Bertsekas, A. Nedi�, and A. E.
the entropy of a sequence of independent Ozdaglar, Convex analysis and optimiza-
random variables,” Theory of Probability & tion. Belmont, MA, USA: Athena Scien-
Its Applications, vol. 4, no. 3, pp. 333–336, tific, 2003. (p. 93)
1959. (p. 584) [50] N. Bhatnagar, J. Vera, E. Vigoda, and
[40] A. Beck, First-order methods in optimiza- D. Weitz, “Reconstruction for colorings on
tion. SIAM, 2017. (p. 92) trees,” SIAM Journal on Discrete Mathe-
[41] A. Beirami and F. Fekri, “Fundamental lim- matics, vol. 25, no. 2, pp. 809–826, 2011.
its of universal lossless one-to-one com- (p. 644)
pression of parametric sources,” in Informa- [51] A. Bhatt, B. Nazer, O. Ordentlich, and
tion Theory Workshop (ITW). IEEE, 2014, Y. Polyanskiy, “Information-distilling
pp. 212–216. (p. 250) quantizers,” IEEE Transactions on
[42] C. H. Bennett, “Notes on Landauer’s princi- Information Theory, vol. 67, no. 4, pp.
ple, reversible computation, and Maxwell’s 2472–2487, 2021. (p. 190)
Demon,” Studies In History and Philoso- [52] A. Bhattacharyya, “On a measure of diver-
phy of Science Part B: Studies In History gence between two statistical populations
and Philosophy of Modern Physics, vol. 34, defined by their probability distributions,”
no. 3, pp. 501–510, 2003. (p. xix) Bull. Calcutta Math. Soc., vol. 35, pp. 99–
[43] C. H. Bennett, P. W. Shor, J. A. Smolin, and 109, 1943. (p. 117)
A. V. Thapliyal, “Entanglement-assisted [53] L. Birgé, “Approximation dans les espaces
classical capacity of noisy quantum chan- métriques et théorie de l’estimation,”
nels,” Physical Review Letters, vol. 83, Zeitschrift für Wahrscheinlichkeitstheorie
no. 15, p. 3081, 1999. (p. 498) und Verwandte Gebiete, vol. 65, no. 2, pp.
[44] W. R. Bennett, “Spectra of quantized sig- 181–237, 1983. (pp. xxii, 602, and 614)
nals,” The Bell System Technical Journal, [54] ——, “On estimating a density using
vol. 27, no. 3, pp. 446–472, 1948. (p. 483) Hellinger distance and some other strange
[45] P. Bergmans, “A simple converse for broad- facts,” Probability theory and related fields,
cast channels with additive white Gaus- vol. 71, no. 2, pp. 271–291, 1986. (p. 625)
sian noise (corresp.),” IEEE Transactions [55] L. Birgé, “Model selection via testing : an
on Information Theory, vol. 20, no. 2, pp. alternative to (penalized) maximum likeli-
279–280, 1974. (p. 65) hood estimators,” Annales de l’I.H.P. Prob-
[46] J. M. Bernardo, “Reference posterior dis- abilités et statistiques, vol. 42, no. 3, pp.
tributions for Bayesian inference,” Journal 273–325, 2006. (p. 667)
of the Royal Statistical Society: Series B [56] L. Birgé, “Robust tests for model selection,”
(Methodological), vol. 41, no. 2, pp. 113– From probability to statistics and back:
128, 1979. (p. 253) high-dimensional models and processes–A
i i
i i
i i
References 675
Festschrift in honor of Jon A. Wellner, IMS [67] H. F. Bohnenblust, “Convex regions and
Collections, Volume 9, pp. 47–64, 2013. projections in Minkowski spaces,” Ann.
(p. 613) Math., vol. 39, no. 2, pp. 301–308, 1938.
[57] M. Š. Birman and M. Solomjak, “Piecewise- (p. 96)
polynomial approximations of functions of [68] A. Borovkov, Mathematical Statistics.
the classes,” Mathematics of the USSR- CRC Press, 1999. (pp. xxii, 141, 581,
Sbornik, vol. 2, no. 3, p. 295, 1967. (p. 538) and 582)
[58] N. Blachman, “The convolution inequal- [69] S. Boucheron, G. Lugosi, and P. Massart,
ity for entropy powers,” IEEE Transactions Concentration Inequalities: A Nonasymp-
on Information theory, vol. 11, no. 2, pp. totic Theory of Independence. OUP
267–271, 1965. (p. 185) Oxford, 2013. (pp. 85, 151, 302, and 541)
[59] D. Blackwell, L. Breiman, and [70] O. Bousquet, D. Kane, and S. Moran,
A. Thomasian, “The capacity of a class of “The optimal approximation factor in den-
channels,” The Annals of Mathematical sity estimation,” in Conference on Learn-
Statistics, pp. 1229–1241, 1959. (p. 465) ing Theory. PMLR, 2019, pp. 318–341.
[60] D. H. Blackwell, “The entropy of functions (p. 622)
of finite-state Markov chains,” Transactions [71] D. Braess and T. Sauer, “Bernstein poly-
of the first Prague conference on infor- nomials and learning theory,” Journal of
mation theory, statistical decision func- Approximation Theory, vol. 128, no. 2, pp.
tions, random processes, pp. 13–20, 1956. 187–206, 2004. (p. 665)
(p. 111) [72] D. Braess, J. Forster, T. Sauer, and
[61] R. E. Blahut, “Hypothesis testing and infor- H. U. Simon, “How to achieve minimax
mation theory,” IEEE Trans. Inf. Theory, expected Kullback-Leibler distance from
vol. 20, no. 4, pp. 405–417, 1974. (p. 289) an unknown finite distribution,” in Algo-
[62] R. Blahut, “Computation of channel rithmic Learning Theory. Springer, 2002,
capacity and rate-distortion functions,” pp. 380–394. (p. 665)
IEEE transactions on Information Theory, [73] M. Braverman, A. Garg, T. Ma, H. L.
vol. 18, no. 4, pp. 460–473, 1972. (p. 102) Nguyen, and D. P. Woodruff, “Communica-
[63] P. M. Bleher, J. Ruiz, and V. A. Zagrebnov, tion lower bounds for statistical estimation
“On the purity of the limiting gibbs state for problems via a distributed data processing
the Ising model on the Bethe lattice,” Jour- inequality,” in Proceedings of the forty-
nal of Statistical Physics, vol. 79, no. 1, pp. eighth annual ACM symposium on Theory
473–482, Apr 1995. (pp. 642 and 643) of Computing. ACM, 2016, pp. 1011–
[64] S. G. Bobkov and F. Götze, “Exponential 1020. (pp. 657 and 658)
integrability and transportation cost related [74] L. M. Bregman, “Some properties of non-
to logarithmic Sobolev inequalities,” Jour- negative matrices and their permanents,”
nal of Functional Analysis, vol. 163, no. 1, Soviet Math. Dokl., vol. 14, no. 4, pp. 945–
pp. 1–28, 1999. (p. 656) 949, 1973. (p. 161)
[65] S. Bobkov and G. P. Chistyakov, “Entropy [75] L. Breiman, “The individual ergodic the-
power inequality for the Rényi entropy.” orem of information theory,” Ann. Math.
IEEE Transactions on Information Theory, Stat., vol. 28, no. 3, pp. 809–811, 1957.
vol. 61, no. 2, pp. 708–714, 2015. (p. 27) (p. 234)
[66] T. Bohman, “A limit theorem for the Shan- [76] L. Brillouin, Science and information the-
non capacities of odd cycles I,” Proceedings ory, 2nd Ed. Academic Press, 1962.
of the American Mathematical Society, vol. (p. xvii)
131, no. 11, pp. 3559–3569, 2003. (p. 452) [77] L. D. Brown, “Fundamentals of statisti-
cal exponential families with applications
i i
i i
i i
i i
i i
i i
References 677
Information Theory, vol. 65, no. 1, pp. 380– [110] I. Csiszár and J. Körner, “Graph decomposi-
405, 2018. (p. 437) tion: a new key to coding theorems,” IEEE
[99] J. H. Conway and N. J. A. Sloane, Sphere Trans. Inf. Theory, vol. 27, no. 1, pp. 5–12,
packings, lattices and groups. Springer 1981. (p. 47)
Science & Business Media, 1999, vol. 290. [111] ——, Information Theory: Coding The-
(p. 527) orems for Discrete Memoryless Systems.
[100] M. Costa, “A new entropy power inequal- New York: Academic, 1981. (pp. xvii, xx,
ity,” IEEE Transactions on Information xxi, 355, 499, 500, 502, and 647)
Theory, vol. 31, no. 6, pp. 751–760, 1985. [112] I. Csiszár and P. C. Shields, “Information
(p. 64) theory and statistics: A tutorial,” Founda-
[101] D. J. Costello and G. D. Forney, “Channel tions and Trends in Communications and
coding: The road to channel capacity,” Pro- Information Theory, vol. 1, no. 4, pp. 417–
ceedings of the IEEE, vol. 95, no. 6, pp. 528, 2004. (pp. 104 and 250)
1150–1177, 2007. (p. 411) [113] I. Csiszár and G. Tusnády, “Informa-
[102] T. A. Courtade, “Monotonicity of entropy tion geometry and alternating minimiza-
and Fisher information: a quick proof via tion problems,” Statistics & Decision, Sup-
maximal correlation,” Communications in plement Issue No, vol. 1, 1984. (pp. 102
Information and Systems, vol. 16, no. 2, pp. and 103)
111–115, 2017. (p. 64) [114] I. Csiszár, “I-divergence geometry of prob-
[103] ——, “A strong entropy power inequality,” ability distributions and minimization prob-
IEEE Transactions on Information Theory, lems,” The Annals of Probability, pp. 146–
vol. 64, no. 4, pp. 2173–2192, 2017. (p. 64) 158, 1975. (pp. 303 and 312)
[104] T. M. Cover, “Universal data compression [115] I. Csiszár and J. Körner, Information The-
and portfolio selection,” in Proceedings of ory: Coding Theorems for Discrete Memo-
37th Conference on Foundations of Com- ryless Systems, 2nd ed. Cambridge Uni-
puter Science. IEEE, 1996, pp. 534–538. versity Press, 2011. (pp. xx, xxi, 13, 216,
(p. xx) and 426)
[105] T. M. Cover and B. Gopinath, Open prob- [116] P. Cuff, “Distributed channel synthesis,”
lems in communication and computation. IEEE Transactions on Information The-
Springer Science & Business Media, 2012. ory, vol. 59, no. 11, pp. 7071–7096, 2013.
(p. 413) (p. 503)
[106] T. M. Cover and J. A. Thomas, Elements [117] M. Cuturi, “Sinkhorn distances: Light-
of information theory, 2nd Ed. New speed computation of optimal transport,”
York, NY, USA: Wiley-Interscience, 2006. Advances in neural information process-
(pp. xvii, xx, xxi, 65, 210, 216, 355, 466, ing systems, vol. 26, pp. 2292–2300, 2013.
and 501) (p. 105)
[107] H. Cramér, “Über eine eigenschaft der [118] L. Davisson, R. McEliece, M. Pursley, and
normalen verteilungsfunktion,” Mathema- M. Wallace, “Efficient universal noiseless
tische Zeitschrift, vol. 41, no. 1, pp. 405– source codes,” IEEE Transactions on Infor-
414, 1936. (p. 101) mation Theory, vol. 27, no. 3, pp. 269–279,
[108] ——, Mathematical methods of statistics. 1981. (p. 270)
Princeton university press, 1946. (p. 576) [119] A. Decelle, F. Krzakala, C. Moore, and
[109] I. Csiszár, “Information-type measures of L. Zdeborová, “Asymptotic analysis of the
difference of probability distributions and stochastic block model for modular net-
indirect observation,” Studia Sci. Math. works and its algorithmic applications,”
Hungar., vol. 2, pp. 229–318, 1967. (p. 115) Physical review E, vol. 84, no. 6, p. 066106,
2011. (p. 642)
i i
i i
i i
[120] A. Dembo and O. Zeitouni, Large devia- on Differential Equations, Two on Informa-
tions techniques and applications. New tion Theory, American Mathematical Soci-
York: Springer Verlag, 2009. (pp. 291 ety Translations: Series 2, Volume 33, 1963.
and 308) (p. 83)
[121] A. P. Dempster, N. M. Laird, and D. B. [130] ——, “Mathematical problems in the Shan-
Rubin, “Maximum likelihood from incom- non theory of optimal coding of informa-
plete data via the EM algorithm,” Journal tion,” in Proc. 4th Berkeley Symp. Mathe-
of the royal statistical society. Series B matics, Statistics, and Probability, vol. 1,
(methodological), pp. 1–38, 1977. (p. 103) Berkeley, CA, USA, 1961, pp. 211–252.
[122] P. Diaconis and L. Saloff-Coste, “Logarith- (p. 435)
mic Sobolev inequalities for finite Markov [131] ——, “Asymptotic bounds on error prob-
chains,” Ann. Probab., vol. 6, no. 3, pp. ability for transmission over DMC with
695–750, 1996. (pp. 133 and 191) symmetric transition probabilities,” Theor.
[123] P. Diaconis and D. Freedman, “Finite Probability Appl., vol. 7, pp. 283–311,
exchangeable sequences,” The Annals of 1962. (pp. 383 and 446)
Probability, vol. 8, no. 4, pp. 745–764, [132] J. Dong, A. Roth, and W. J. Su, “Gaus-
1980. (p. 187) sian differential privacy,” Journal of the
[124] P. Diaconis and D. Stroock, “Geometric Royal Statistical Society Series B: Statisti-
bounds for eigenvalues of Markov chains,” cal Methodology, vol. 84, no. 1, pp. 3–37,
The Annals of Applied Probability, vol. 1, 2022. (p. 182)
no. 1, pp. 36–61, 1991. (pp. 641 and 669) [133] D. L. Donoho, “Wald lecture I: Counting
[125] H. Djellout, A. Guillin, and L. Wu, “Trans- bits with Kolmogorov and Shannon,” Note
portation cost-information inequalities and for the Wald Lectures, IMS Annual Meeting,
applications to random dynamical systems July 1997. (p. 543)
and diffusions,” The Annals of Probabil- [134] M. D. Donsker and S. S. Varadhan,
ity, vol. 32, no. 3B, pp. 2702–2732, 2004. “Asymptotic evaluation of certain Markov
(p. 656) process expectations for large time. IV,”
[126] R. Dobrushin and B. Tsybakov, “Informa- Communications on Pure and Applied
tion transmission with additional noise,” Mathematics, vol. 36, no. 2, pp. 183–212,
IRE Transactions on Information Theory, 1983. (p. 72)
vol. 8, no. 5, pp. 293–304, 1962. (p. 548) [135] J. L. Doob, Stochastic Processes. New
[127] R. L. Dobrushin, “Central limit theorem York Wiley, 1953. (p. 233)
for nonstationary Markov chains, I,” The- [136] F. du Pin Calmon, Y. Polyanskiy, and
ory Probab. Appl., vol. 1, no. 1, pp. 65–80, Y. Wu, “Strong data processing inequalities
1956. (p. 630) for input constrained additive noise chan-
[128] ——, “A simplified method of experimen- nels,” IEEE Transactions on Information
tally evaluating the entropy of a station- Theory, vol. 64, no. 3, pp. 1879–1892, 2017.
ary sequence,” Theory of Probability & Its (p. 325)
Applications, vol. 3, no. 4, pp. 428–430, [137] J. C. Duchi, M. I. Jordan, M. J. Wainwright,
1958. (p. 584) and Y. Zhang, “Optimality guarantees for
[129] ——, “A general formulation of the funda- distributed statistical estimation,” arXiv
mental theorem of Shannon in the theory of preprint arXiv:1405.0782, 2014. (pp. 657
information,” Uspekhi Mat. Nauk, vol. 14, and 658)
no. 6, pp. 3–104, 1959, English translation [138] J. Duda, “Asymmetric numeral systems:
in Eleven Papers in Analysis: Nine Papers entropy coding combining speed of
Huffman coding with compression rate
i i
i i
i i
References 679
of arithmetic coding,” arXiv preprint Mathematical Statistics, vol. 43, no. 3, pp.
arXiv:1311.2540, 2013. (p. 246) 865–870, 1972. (p. 167)
[139] R. M. Dudley, Uniform central limit theo- [150] ——, “Coding for noisy channels,” IRE
rems. Cambridge university press, 1999, Convention Record, vol. 3, pp. 37–46, 1955.
no. 63. (pp. 86 and 535) (p. 365)
[140] G. Dueck, “The strong converse to the cod- [151] E. O. Elliott, “Estimates of error rates for
ing theorem for the multiple-access chan- codes on burst-noise channels,” Bell Syst.
nel,” J. Comb. Inform. Syst. Sci, vol. 6, no. 3, Tech. J., vol. 42, pp. 1977–1997, Sep. 1963.
pp. 187–196, 1981. (p. 187) (p. 111)
[141] G. Dueck and J. Körner, “Reliability [152] D. M. Endres and J. E. Schindelin, “A new
function of a discrete memoryless chan- metric for probability distributions,” IEEE
nel at rates above capacity (corresp.),” Transactions on Information theory, vol. 49,
IEEE Transactions on Information Theory, no. 7, pp. 1858–1860, 2003. (p. 117)
vol. 25, no. 1, pp. 82–85, 1979. (p. 433) [153] P. Erdös, “Some remarks on the theory of
[142] N. Dunford and J. T. Schwartz, Linear oper- graphs,” Bulletin of the American Mathe-
ators, part 1: general theory. John Wiley matical Society, vol. 53, no. 4, pp. 292–294,
& Sons, 1988, vol. 10. (p. 80) 1947. (p. 215)
[143] R. Durrett, Probability: Theory and Exam- [154] P. Erdös and A. Rényi, “On random graphs,
ples, 4th ed. Cambridge University Press, I,” Publicationes Mathematicae (Debre-
2010. (p. 125) cen), vol. 6, pp. 290–297, 1959. (p. 653)
[144] A. Dytso, S. Yagli, H. V. Poor, and S. S. [155] V. Erokhin, “ε-entropy of a discrete random
Shitz, “The capacity achieving distribu- variable,” Theory of Probability & Its Appli-
tion for the amplitude constrained additive cations, vol. 3, no. 1, pp. 97–100, 1958.
Gaussian channel: An upper bound on the (p. 547)
number of mass points,” IEEE Transactions [156] K. Eswaran and M. Gastpar, “Remote
on Information Theory, vol. 66, no. 4, pp. source coding under Gaussian noise: Duel-
2006–2022, 2019. (p. 408) ing roles of power and entropy power,”
[145] H. G. Eggleston, Convexity, ser. Tracts in IEEE Transactions on Information Theory,
Math and Math. Phys. Cambridge Univer- 2019. (p. 657)
sity Press, 1958, vol. 47. (p. 129) [157] W. Evans and N. Pippenger, “On the maxi-
[146] A. El Alaoui and A. Montanari, “An mum tolerable noise for reliable computa-
information-theoretic view of stochastic tion by formulas,” IEEE Transactions on
localization,” IEEE Transactions on Infor- Information Theory, vol. 44, no. 3, pp.
mation Theory, vol. 68, no. 11, pp. 7423– 1299–1305, May 1998. (p. 629)
7426, 2022. (p. 191) [158] W. S. Evans and L. J. Schulman, “Signal
[147] A. El Gamal and Y.-H. Kim, Network infor- propagation and noisy circuits,” IEEE
mation theory. Cambridge University Transactions on Information Theory,
Press, 2011. (pp. xxi and 501) vol. 45, no. 7, pp. 2367–2373, Nov 1999.
[148] R. Eldan, “Taming correlations through (p. 627)
entropy-efficient measure decompositions [159] M. Falahatgar, A. Orlitsky, V. Pichapati,
with applications to mean-field approxi- and A. Suresh, “Learning Markov distri-
mation,” Probability Theory and Related butions: Does estimation trump compres-
Fields, vol. 176, no. 3-4, pp. 737–755, 2020. sion?” in 2016 IEEE International Sympo-
(p. 191) sium on Information Theory (ISIT). IEEE,
[149] P. Elias, “The efficient construction of July 2016, pp. 2689–2693. (p. 258)
an unbiased random sequence,” Annals of
i i
i i
i i
[160] M. Feder, “Gambling using a finite state [172] G. D. Forney, “Concatenated codes,”
machine,” IEEE Transactions on Informa- MIT RLE Technical Rep., vol. 440, 1965.
tion Theory, vol. 37, no. 5, pp. 1459–1465, (p. 378)
1991. (p. 264) [173] E. Friedgut and J. Kahn, “On the number
[161] M. Feder, N. Merhav, and M. Gut- of copies of one hypergraph in another,”
man, “Universal prediction of individual Israel J. Math., vol. 105, pp. 251–256, 1998.
sequences,” IEEE Trans. Inf. Theory, (p. 160)
vol. 38, no. 4, pp. 1258–1270, 1992. [174] P. Gács and J. Körner, “Common infor-
(p. 260) mation is far less than mutual informa-
[162] M. Feder and Y. Polyanskiy, “Sequential tion,” Problems of Control and Information
prediction under log-loss and misspecifica- Theory, vol. 2, no. 2, pp. 149–162, 1973.
tion,” in Conference on Learning Theory. (p. 338)
PMLR, 2021, pp. 1937–1964. (pp. 175 [175] A. Galanis, D. Štefankovi�, and E. Vigoda,
and 261) “Inapproximability of the partition function
[163] A. A. Fedotov, P. Harremoës, and F. Top- for the antiferromagnetic Ising and hard-
søe, “Refinements of Pinsker’s inequality,” core models,” Combinatorics, Probability
Information Theory, IEEE Transactions on, and Computing, vol. 25, no. 4, pp. 500–559,
vol. 49, no. 6, pp. 1491–1498, Jun. 2003. 2016. (p. 75)
(p. 131) [176] R. G. Gallager, “A simple derivation of
[164] W. Feller, An Introduction to Probability the coding theorem and some applications,”
Theory and Its Applications, 3rd ed. New IEEE Trans. Inf. Theory, vol. 11, no. 1, pp.
York: Wiley, 1970, vol. I. (p. 538) 3–18, 1965. (p. 360)
[165] ——, An Introduction to Probability The- [177] ——, Information Theory and Reliable
ory and Its Applications, 2nd ed. New Communication. New York: Wiley, 1968.
York: Wiley, 1971, vol. II. (p. 435) (pp. xvii, xxi, 383, and 432)
[166] T. S. Ferguson, Mathematical Statistics: A [178] R. Gallager, “The random coding bound
Decision Theoretic Approach. New York, is tight for the average code (corresp.),”
NY: Academic Press, 1967. (p. 558) IEEE Transactions on Information Theory,
[167] ——, “An inconsistent maximum likeli- vol. 19, no. 2, pp. 244–246, 1973. (p. 433)
hood estimate,” Journal of the American [179] R. Gardner, “The Brunn-Minkowski
Statistical Association, vol. 77, no. 380, pp. inequality,” Bulletin of the American
831–834, 1982. (p. 583) Mathematical Society, vol. 39, no. 3, pp.
[168] ——, A course in large sample theory. 355–405, 2002. (p. 573)
CRC Press, 1996. (p. 582) [180] A. M. Garsia, Topics in almost everywhere
[169] R. A. Fisher, “The logic of inductive infer- convergence. Chicago: Markham Publish-
ence,” Journal of the royal statistical soci- ing Company, 1970. (p. 238)
ety, vol. 98, no. 1, pp. 39–82, 1935. (p. xvii) [181] M. Gastpar, B. Rimoldi, and M. Vet-
[170] B. M. Fitingof, “The compression of dis- terli, “To code, or not to code: Lossy
crete information,” Problemy Peredachi source-channel communication revisited,”
Informatsii, vol. 3, no. 3, pp. 28–36, 1967. IEEE Transactions on Information The-
(p. 246) ory, vol. 49, no. 5, pp. 1147–1158, 2003.
[171] P. Fleisher, “Sufficient conditions for (p. 521)
achieving minimum distortion in a quan- [182] I. M. Gel’fand, A. N. Kolmogorov, and
tizer,” IEEE Int. Conv. Rec., pp. 104–111, A. M. Yaglom, “On the general definition
1964. (p. 481) of the amount of information,” Dokl. Akad.
Nauk. SSSR, vol. 11, pp. 745–748, 1956.
(p. 70)
i i
i i
i i
References 681
[183] S. I. Gelfand and M. Pinsker, “Coding for vol. 16, no. 3, pp. 1281–1290, 1988.
channels with random parameters,” Probl. (p. 540)
Contr. Inform. Theory, vol. 9, no. 1, pp. [195] V. D. Goppa, “Nonprobabilistic mutual
19–31, 1980. (p. 468) information with memory,” Probl. Contr.
[184] Y. Geng and C. Nair, “The capacity region Inf. Theory, vol. 4, pp. 97–102, 1975.
of the two-receiver Gaussian vector broad- (p. 463)
cast channel with private and common mes- [196] ——, “Codes and information,” Russian
sages,” IEEE Transactions on Information Mathematical Surveys, vol. 39, no. 1, p. 87,
Theory, vol. 60, no. 4, pp. 2087–2104, 2014. 1984. (p. 463)
(p. 109) [197] R. M. Gray and D. L. Neuhoff, “Quanti-
[185] G. L. Gilardoni, “On a Gel’fand-Yaglom- zation,” IEEE Trans. Inf. Theory, vol. 44,
Peres theorem for f-divergences,” arXiv no. 6, pp. 2325–2383, 1998. (p. 475)
preprint arXiv:0911.1934, 2009. (p. 154) [198] R. M. Gray, Entropy and Information The-
[186] ——, “On Pinsker’s and Vajda’s type ory. New York, NY: Springer-Verlag,
inequalities for Csiszár’s-divergences,” 1990. (p. xxi)
Information Theory, IEEE Transactions [199] U. Grenander and G. Szegö, Toeplitz forms
on, vol. 56, no. 11, pp. 5377–5386, 2010. and their applications, 2nd ed. New
(p. 133) York: Chelsea Publishing Company, 1984.
[187] E. N. Gilbert, “Capacity of burst-noise (p. 114)
channels,” Bell Syst. Tech. J., vol. 39, pp. [200] L. Gross, “Logarithmic sobolev inequali-
1253–1265, Sep. 1960. (p. 111) ties,” American Journal of Mathematics,
[188] R. D. Gill and B. Y. Levit, “Applications vol. 97, no. 4, pp. 1061–1083, 1975.
of the van Trees inequality: a Bayesian (pp. 107 and 191)
Cramér-Rao bound,” Bernoulli, vol. 1, no. [201] Y. Gu, “Channel comparison methods and
1–2, pp. 59–79, 1995. (p. 577) statistical problems on graphs,” Ph.D. dis-
[189] J. Gilmer, “A constant lower bound for sertation, MIT, Cambridge, MA, 02139,
the union-closed sets conjecture,” arXiv USA, 2023. (p. 638)
preprint arXiv:2211.09055, 2022. (p. 189) [202] Y. Gu and Y. Polyanskiy, “Uniqueness of
[190] C. Giraud, Introduction to High- BP fixed point for the Potts model and appli-
Dimensional Statistics. Chapman cations to community detection,” in Con-
and Hall/CRC, 2014. (p. xxii) ference on Learning Theory (COLT), 2023.
[191] G. Glaeser, “Racine carrée d’une fonc- (pp. 135 and 644)
tion différentiable,” Annales de l’institut [203] ——, “Non-linear log-Sobolev inequalities
Fourier, vol. 13, no. 2, pp. 203–210, 1963. for the Potts semigroup and appli-
(p. 625) cations to reconstruction problems,”
[192] O. Goldreich, Introduction to property test- Comm. Math. Physics, (to appear), also
ing. Cambridge University Press, 2017. arXiv:2005.05444. (pp. 643, 653, 670,
(p. 325) and 671)
[193] I. Goodfellow, J. Pouget-Abadie, M. Mirza, [204] Y. Gu, H. Roozbehani, and Y. Polyanskiy,
B. Xu, D. Warde-Farley, S. Ozair, “Broadcasting on trees near criticality,” in
A. Courville, and Y. Bengio, “Genera- 2020 IEEE International Symposium on
tive adversarial nets,” Advances in neural Information Theory (ISIT). IEEE, 2020,
information processing systems, vol. 27, pp. 1504–1509. (p. 670)
2014. (p. 150) [205] D. Guo, S. Shamai (Shitz), and S. Verdú,
[194] V. Goodman, “Characteristics of normal “Mutual information and minimum mean-
samples,” The Annals of Probability, square error in Gaussian channels,” IEEE
i i
i i
i i
Trans. Inf. Theory, vol. 51, no. 4, pp. 1261 P. Elias, Eds. Springer Netherlands, 1975,
– 1283, Apr. 2005. (p. 59) vol. 16, pp. 323–355. (p. 584)
[206] D. Guo, Y. Wu, S. S. Shamai, and S. Verdú, [216] D. Haussler and M. Opper, “Mutual infor-
“Estimation in Gaussian noise: Proper- mation, metric entropy and cumulative rel-
ties of the minimum mean-square error,” ative entropy risk,” The Annals of Statis-
IEEE Transactions on Information Theory, tics, vol. 25, no. 6, pp. 2451–2492, 1997.
vol. 57, no. 4, pp. 2371–2385, 2011. (p. 63) (pp. xxii and 188)
[207] U. Hadar, J. Liu, Y. Polyanskiy, and [217] M. Hayashi, “General nonasymptotic and
O. Shayevitz, “Communication complexity asymptotic formulas in channel resolv-
of estimating correlations,” in Proceedings ability and identification capacity and
of the 51st Annual ACM SIGACT Sympo- their application to the wiretap channel,”
sium on Theory of Computing. ACM, IEEE Transactions on Information The-
2019, pp. 792–803. (p. 645) ory, vol. 52, no. 4, pp. 1562–1575, 2006.
[208] B. Hajek, Y. Wu, and J. Xu, “Information (p. 505)
limits for recovering a hidden community,” [218] W. Hoeffding, “Asymptotically optimal
IEEE Trans. on Information Theory, vol. 63, tests for multinomial distributions,” The
no. 8, pp. 4729 – 4745, 2017. (p. 591) Annals of Mathematical Statistics, pp. 369–
[209] J. Hájek, “Local asymptotic minimax and 401, 1965. (p. 289)
admissibility in estimation,” in Proceedings [219] P. J. Huber, “Fisher information and spline
of the sixth Berkeley symposium on math- interpolation,” Annals of Statistics, pp.
ematical statistics and probability, vol. 1, 1029–1033, 1974. (p. 580)
1972, pp. 175–194. (p. 582) [220] ——, Robust Statistics. New York, NY:
[210] J. M. Hammersley, “On estimating Wiley-Interscience, 1981. (pp. 151 and 152)
restricted parameters,” Journal of the Royal [221] ——, “A robust version of the probabil-
Statistical Society. Series B (Methodolog- ity ratio test,” The Annals of Mathematical
ical), vol. 12, no. 2, pp. 192–240, 1950. Statistics, pp. 1753–1758, 1965. (pp. 324,
(p. 575) 338, and 613)
[211] T. S. Han, Information-spectrum methods [222] I. A. Ibragimov and R. Z. Khas’minsk�,
in information theory. Springer Science Statistical Estimation: Asymptotic Theory.
& Business Media, 2003. (pp. xix and xxi) Springer, 1981. (pp. xxii and 143)
[212] T. S. Han and S. Verdú, “Approximation [223] S. Ihara, “On the capacity of channels with
theory of output statistics,” IEEE Transac- additive non-Gaussian noise,” Information
tions on Information Theory, vol. 39, no. 3, and Control, vol. 37, no. 1, pp. 34–39, 1978.
pp. 752–772, 1993. (pp. 504 and 505) (p. 401)
[213] Y. Han, S. Jana, and Y. Wu, “Optimal pre- [224] ——, Information theory for continuous
diction of Markov chains with and without systems. World Scientific, 1993, vol. 2.
spectral gap,” IEEE Transactions on Infor- (p. 419)
mation Theory, vol. 69, no. 6, pp. 3920– [225] Y. I. Ingster and I. A. Suslina, Nonparamet-
3959, 2023. (p. 258) ric goodness-of-fit testing under Gaussian
[214] P. Harremoës and I. Vajda, “On pairs of models. New York, NY: Springer, 2003.
f-divergences and their joint range,” IEEE (pp. 134, 185, 325, and 561)
Trans. Inf. Theory, vol. 57, no. 6, pp. 3230– [226] Y. I. Ingster, “Minimax testing of nonpara-
3235, Jun. 2011. (pp. 115, 128, and 129) metric hypotheses on a distribution density
[215] B. Harris, “The statistical estimation of in the Lp metrics,” Theory of Probability &
entropy in the non-parametric case,” in Top- Its Applications, vol. 31, no. 2, pp. 333–337,
ics in Information Theory, I. Csiszár and 1987. (p. 325)
i i
i i
i i
References 683
[227] S. Janson, “Random regular graphs: asymp- [236] I. Johnstone, Gaussian estimation:
totic distributions and contiguity,” Combi- Sequence and wavelet models, 2011, avail-
natorics, Probability and Computing, vol. 4, able at http://www-stat.stanford.edu/~imj/.
no. 4, pp. 369–405, 1995. (p. 186) (p. 590)
[228] S. Janson and E. Mossel, “Robust recon- [237] L. K. Jones, “A simple lemma on greedy
struction on trees is determined by the approximation in Hilbert space and conver-
second eigenvalue,” Ann. Probab., vol. 32, gence rates for projection pursuit regression
no. 1A, pp. 2630–2649, 2004. (p. 644) and neural network training,” The Annals of
[229] E. T. Jaynes, Probability theory: The logic Statistics, pp. 608–613, 1992. (p. 534)
of science. Cambridge university press, [238] A. B. Juditsky and A. S. Nemirovski, “Non-
2003. (p. 253) parametric estimation by convex program-
[230] T. S. Jayram, “Hellinger strikes back: ming,” The Annals of Statistics, vol. 37,
A note on the multi-party information no. 5A, pp. 2278–2300, 2009. (p. 566)
complexity of AND,” in International [239] S. M. Kakade, K. Sridharan, and A. Tewari,
Workshop on Approximation Algorithms “On the complexity of linear prediction:
for Combinatorial Optimization, 2009, pp. Risk bounds, margin bounds, and regular-
562–573. (p. 183) ization,” Advances in neural information
[231] I. Jensen and A. J. Guttmann, “Series expan- processing systems, vol. 21, 2008. (p. 87)
sions of the percolation probability for [240] S. Kamath, A. Orlitsky, D. Pichapati, and
directed square and honeycomb lattices,” A. Suresh, “On learning distributions from
Journal of Physics A: Mathematical and their samples,” in Conference on Learning
General, vol. 28, no. 17, p. 4813, 1995. Theory, June 2015, pp. 1066–1100. (p. 258)
(p. 670) [241] T. Kawabata and A. Dembo, “The rate-
[232] Z. Jia, Y. Polyanskiy, and Y. Wu, “Entropic distortion dimension of sets and measures,”
characterization of optimal rates for learn- IEEE Trans. Inf. Theory, vol. 40, no. 5, pp.
ing Gaussian mixtures,” in Conference on 1564 – 1572, Sep. 1994. (p. 542)
Learning Theory (COLT). PMLR, 2023. [242] M. Keane and G. O’Brien, “A Bernoulli
(p. 619) factory,” ACM Transactions on Modeling
[233] J. Jiao, K. Venkat, Y. Han, and T. Weiss- and Computer Simulation, vol. 4, no. 2, pp.
man, “Minimax estimation of functionals of 213–219, 1994. (p. 172)
discrete distributions,” IEEE Transactions [243] J. Kemperman, “On the Shannon capacity
on Information Theory, vol. 61, no. 5, pp. of an arbitrary channel,” in Indagationes
2835–2885, 2015. (p. 584) Mathematicae (Proceedings), vol. 77, no. 2.
[234] C. Jin, Y. Zhang, S. Balakrishnan, M. J. North-Holland, 1974, pp. 101–115. (p. 97)
Wainwright, and M. I. Jordan, “Local max- [244] H. Kesten and B. P. Stigum, “Additional
ima in the likelihood of Gaussian mixture limit theorems for indecomposable multi-
models: Structural results and algorithmic dimensional galton-watson processes,” The
consequences,” in Advances in neural infor- Annals of Mathematical Statistics, vol. 37,
mation processing systems, 2016, pp. 4116– no. 6, pp. 1463–1481, 1966. (pp. 644
4124. (p. 105) and 670)
[235] W. B. Johnson, G. Schechtman, and J. Zinn, [245] D. P. Kingma and M. Welling, “Auto-
“Best constants in moment inequalities encoding variational Bayes,” arXiv preprint
for linear combinations of independent arXiv:1312.6114, 2013. (pp. 76 and 77)
and exchangeable random variables,” The [246] D. P. Kingma, M. Welling et al., “An intro-
Annals of Probability, vol. 13, no. 1, pp. duction to variational autoencoders,” Foun-
234–253, 1985. (p. 497) dations and Trends® in Machine Learning,
vol. 12, no. 4, pp. 307–392, 2019. (p. 77)
i i
i i
i i
[247] T. Koch, “The Shannon lower bound is [256] O. Kosut and L. Sankar, “Asymptotics
asymptotically tight,” IEEE Transactions and non-asymptotics for universal fixed-to-
on Information Theory, vol. 62, no. 11, pp. variable source coding,” IEEE Transactions
6155–6161, 2016. (p. 511) on Information Theory, vol. 63, no. 6, pp.
[248] Y. Kochman, O. Ordentlich, and Y. Polyan- 3757–3772, 2017. (p. 250)
skiy, “A lower bound on the expected [257] A. Krause and D. Golovin, “Submodular
distortion of joint source-channel coding,” function maximization,” Tractability, vol. 3,
IEEE Transactions on Information The- pp. 71–104, 2014. (p. 367)
ory, vol. 66, no. 8, pp. 4722–4741, 2020. [258] R. Krichevskiy, “Laplace’s law of succes-
(p. 521) sion and universal encoding,” IEEE Trans-
[249] A. Kolchinsky and B. D. Tracey, “Esti- actions on Information Theory, vol. 44,
mating mixture entropy with pairwise dis- no. 1, pp. 296–303, Jan. 1998. (p. 665)
tances,” Entropy, vol. 19, no. 7, p. 361, [259] R. Krichevsky, “A relation between the
2017. (p. 188) plausibility of information about a source
[250] A. N. Kolmogorov and V. M. Tikhomirov, and encoding redundancy,” Problems
“ε-entropy and ε-capacity of sets in function Inform. Transmission, vol. 4, no. 3, pp.
spaces,” Uspekhi Matematicheskikh Nauk, 48–57, 1968. (p. 247)
vol. 14, no. 2, pp. 3–86, 1959, reprinted [260] R. Krichevsky and V. Trofimov, “The per-
in Shiryayev, A. N., ed. Selected Works formance of universal encoding,” IEEE
of AN Kolmogorov: Volume III: Informa- Trans. Inf. Theory, vol. 27, no. 2, pp. 199–
tion Theory and the Theory of Algorithms, 207, 1981. (p. 254)
Vol. 27, Springer Netherlands, 1993, pp 86– [261] F. Krzakała, A. Montanari, F. Ricci-
170. (pp. 522, 523, 524, 526, 535, 538, 539, Tersenghi, G. Semerjian, and L. Zdeborová,
and 543) “Gibbs states and the set of solutions
[251] I. Kontoyiannis and S. Verdú, “Optimal of random constraint satisfaction prob-
lossless data compression: Non- lems,” Proceedings of the National
asymptotics and asymptotics,” IEEE Academy of Sciences, vol. 104, no. 25, pp.
Trans. Inf. Theory, vol. 60, no. 2, pp. 10 318–10 323, 2007. (p. 642)
777–795, 2014. (p. 198) [262] J. Kuelbs, “A strong convergence theorem
[252] J. Körner and A. Orlitsky, “Zero-error for Banach space valued random variables,”
information theory,” IEEE Transactions on The Annals of Probability, vol. 4, no. 5, pp.
Information Theory, vol. 44, no. 6, pp. 744–771, 1976. (p. 540)
2207–2229, 1998. (p. 374) [263] J. Kuelbs and W. V. Li, “Metric entropy and
[253] V. Koshelev, “Quantization with minimal the small ball problem for Gaussian mea-
entropy,” Probl. Pered. Inform, vol. 14, pp. sures,” Journal of Functional Analysis, vol.
151–156, 1963. (p. 483) 116, no. 1, pp. 133–157, 1993. (pp. 540
[254] V. Kostina, Y. Polyanskiy, and S. Verdú, and 541)
“Variable-length compression allowing [264] S. Kullback, Information theory and statis-
errors,” IEEE Transactions on Information tics. Mineola, NY: Dover publications,
Theory, vol. 61, no. 8, pp. 4316–4330, 1968. (p. xxi)
2015. (p. 548) [265] C. Külske and M. Formentin, “A symmetric
[255] V. Kostina and S. Verdú, “Fixed-length entropy bound on the non-reconstruction
lossy compression in the finite blocklength regime of Markov chains on Galton-
regime,” IEEE Transactions on Information Watson trees,” Electronic Communications
Theory, vol. 58, no. 6, pp. 3309–3338, 2012. in Probability, vol. 14, pp. 587–596, 2009.
(p. 485) (p. 135)
i i
i i
i i
References 685
[266] H. O. Lancaster, “Some properties of [277] E. Lehmann and J. Romano, Testing Statis-
the bivariate normal distribution consid- tical Hypotheses, 3rd ed. Springer, 2005.
ered in the form of a contingency table,” (pp. 275 and 325)
Biometrika, vol. 44, no. 1/2, pp. 289–292, [278] W. V. Li and W. Linde, “Approximation,
1957. (p. 641) metric entropy and small ball estimates for
[267] R. Landauer, “Irreversibility and heat gen- Gaussian measures,” The Annals of Proba-
eration in the computing process,” IBM bility, vol. 27, no. 3, pp. 1556–1578, 1999.
journal of research and development, vol. 5, (p. 541)
no. 3, pp. 183–191, 1961. (p. xix) [279] W. V. Li and Q.-M. Shao, “Gaussian pro-
[268] A. Lapidoth, A foundation in digital com- cesses: inequalities, small ball probabilities
munication. Cambridge University Press, and applications,” Handbook of Statistics,
2017. (p. 403) vol. 19, pp. 533–597, 2001. (pp. 539, 541,
[269] A. Lapidoth and S. M. Moser, “Capac- and 553)
ity bounds via duality with applications [280] E. H. Lieb, “Proof of an entropy conjecture
to multiple-antenna systems on flat-fading of Wehrl,” Communications in Mathemati-
channels,” IEEE Transactions on Informa- cal Physics, vol. 62, no. 1, pp. 35–41, 1978.
tion Theory, vol. 49, no. 10, pp. 2426–2467, (p. 64)
2003. (p. 409) [281] T. Linder and R. Zamir, “On the asymp-
[270] B. Laurent and P. Massart, “Adaptive esti- totic tightness of the Shannon lower bound,”
mation of a quadratic functional by model IEEE Transactions on Information The-
selection,” The Annals of Statistics, vol. 28, ory, vol. 40, no. 6, pp. 2026–2031, 1994.
no. 5, pp. 1302–1338, 2000. (p. 85) (p. 511)
[271] S. L. Lauritzen, Graphical models. Claren- [282] R. S. Liptser, F. Pukel’sheim, and A. N.
don Press, 1996, vol. 17. (pp. 50 and 51) Shiryaev, “Necessary and sufficient condi-
[272] L. Le Cam, “Convergence of estimates tions for contiguity and entire asymptotic
under dimensionality restrictions,” Annals separation of probability measures,” Rus-
of Statistics, vol. 1, no. 1, pp. 38 – 53, 1973. sian Mathematical Surveys, vol. 37, no. 6,
(p. xxii) p. 107, 1982. (p. 126)
[273] ——, Asymptotic methods in statistical [283] S. Litsyn, “New upper bounds on error
decision theory. New York, NY: Springer- exponents,” IEEE Transactions on Informa-
Verlag, 1986. (pp. 117, 133, 558, 582, 602, tion Theory, vol. 45, no. 2, pp. 385–398,
and 614) 1999. (p. 433)
[274] C. C. Leang and D. H. Johnson, “On [284] S. Lloyd, “Least squares quantization in
the asymptotics of m-hypothesis Bayesian PCM,” IEEE transactions on information
detection,” IEEE Transactions on Informa- theory, vol. 28, no. 2, pp. 129–137, 1982.
tion Theory, vol. 43, no. 1, pp. 280–282, (p. 480)
1997. (p. 337) [285] G. G. Lorentz, M. v. Golitschek, and
[275] K. Lee, Y. Wu, and Y. Bresler, “Near opti- Y. Makovoz, Constructive approximation:
mal compressed sensing of sparse rank-one advanced problems. Springer, 1996, vol.
matrices via sparse power factorization,” 304. (pp. 523 and 538)
IEEE Transactions on Information Theory, [286] L. Lovász, “On the Shannon capacity of
vol. 64, no. 3, pp. 1666–1698, Mar. 2018. a graph,” IEEE Transactions on Informa-
(p. 543) tion theory, vol. 25, no. 1, pp. 1–7, 1979.
[276] E. L. Lehmann and G. Casella, Theory of (p. 452)
Point Estimation, 2nd ed. New York, NY: [287] D. J. MacKay, Information theory, infer-
Springer, 1998. (pp. xxii and 564) ence and learning algorithms. Cambridge
university press, 2003. (p. xxi)
i i
i i
i i
i i
i i
i i
References 687
i i
i i
i i
[330] J. Pitman, “Probabilistic bounds on the [340] Y. Polyanskiy and S. Verdú, “Empirical dis-
coefficients of polynomials with only real tribution of good channel codes with non-
zeros,” Journal of Combinatorial Theory, vanishing error probability,” IEEE Trans.
Series A, vol. 77, no. 2, pp. 279–303, 1997. Inf. Theory, vol. 60, no. 1, pp. 5–21, Jan.
(p. 301) 2014. (p. 429)
[331] E. Plotnik, M. J. Weinberger, and J. Ziv, [341] Y. Polyanskiy, “Saddle point in the mini-
“Upper bounds on the probability of max converse for channel coding,” IEEE
sequences emitted by finite-state sources Transactions on Information Theory,
and on the redundancy of the Lempel-Ziv vol. 59, no. 5, pp. 2576–2595, 2012.
algorithm,” IEEE transactions on informa- (pp. 429 and 430)
tion theory, vol. 38, no. 1, pp. 66–72, 1992. [342] ——, “On dispersion of compound DMCs,”
(p. 264) in 2013 51st Annual Allerton Conference
[332] D. Pollard, “Empirical processes: Theory on Communication, Control, and Comput-
and applications,” NSF-CBMS Regional ing (Allerton). IEEE, 2013, pp. 26–32.
Conference Series in Probability and Statis- (pp. 437 and 465)
tics, vol. 2, pp. i–86, 1990. (p. 603) [343] Y. Polyanskiy and Y. Wu, “Peak-to-average
[333] Y. Polyanskiy, “Channel coding: non- power ratio of good codes for Gaussian
asymptotic fundamental limits,” Ph.D. channel,” IEEE Trans. Inf. Theory, vol. 60,
dissertation, Princeton Univ., Princeton, no. 12, pp. 7655–7660, Dec. 2014. (p. 408)
NJ, USA, 2010. (pp. 109, 383, 385, 429, [344] ——, “Wasserstein continuity of entropy
435, and 436) and outer bounds for interference channels,”
[334] Y. Polyanskiy, H. V. Poor, and S. Verdú, IEEE Transactions on Information Theory,
“Channel coding rate in the finite block- vol. 62, no. 7, pp. 3992–4002, 2016. (pp. 60
length regime,” IEEE Trans. Inf. Theory, and 64)
vol. 56, no. 5, pp. 2307–2359, May 2010. [345] ——, “Strong data-processing inequalities
(pp. 346, 353, 434, 435, 436, and 584) for channels and Bayesian networks,” in
[335] ——, “Dispersion of the Gilbert-Elliott Convexity and Concentration. The IMA Vol-
channel,” IEEE Trans. Inf. Theory, vol. 57, umes in Mathematics and its Applications,
no. 4, pp. 1829–1848, Apr. 2011. (p. 437) vol 161, E. Carlen, M. Madiman, and E. M.
[336] ——, “Feedback in the non-asymptotic Werner, Eds. New York, NY: Springer,
regime,” IEEE Trans. Inf. Theory, vol. 57, 2017, pp. 211–249. (pp. 325, 626, 631, 632,
no. 4, pp. 4903 – 4925, Apr. 2011. (pp. 446, 635, 636, 638, 646, and 647)
454, 455, and 456) [346] ——, “Dualizing Le Cam’s method for
[337] ——, “Minimum energy to send k bits with functional estimation, with applications to
and without feedback,” IEEE Trans. Inf. estimating the unseens,” arXiv preprint
Theory, vol. 57, no. 8, pp. 4880–4902, Aug. arXiv:1902.05616, 2019. (pp. 566 and 668)
2011. (pp. 413 and 449) [347] ——, “Application of the information-
[338] Y. Polyanskiy and S. Verdú, “Arimoto chan- percolation method to reconstruction prob-
nel coding converse and Rényi divergence,” lems on graphs,” Mathematical Statistics
in Proceedings of the Forty-eighth Annual and Learning, vol. 2, no. 1, pp. 1–24, 2020.
Allerton Conference on Communication, (pp. 650, 651, and 653)
Control, and Computing, Sep. 2010, pp. [348] ——, “Self-regularizing property of non-
1327–1333. (pp. 121, 433, and 505) parametric maximum likelihood estima-
[339] Y. Polyanskiy and S. Verdu, “Binary tor in mixture models,” arXiv preprint
hypothesis testing with feedback,” in Infor- arXiv:2008.08244, 2020. (p. 408)
mation Theory and Applications Workshop [349] E. C. Posner and E. R. Rodemich, “Epsilon
(ITA), 2011. (p. 320) entropy and data compression,” Annals of
i i
i i
i i
References 689
Mathematical Statistics, vol. 42, no. 6, pp. parity-check codes,” IEEE Transac-
2079–2125, Dec. 1971. (p. 543) tions on Information Theory, vol. 47, no. 2,
[350] A. Prékopa, “Logarithmic concave mea- pp. 619–637, 2001. (p. 516)
sures with application to stochastic pro- [360] T. Richardson and R. Urbanke, Modern
gramming,” Acta Scientiarum Mathemati- Coding Theory. Cambridge University
carum, vol. 32, pp. 301–316, 1971. (p. 573) Press, 2008. (pp. xxi, 63, 341, 346, 383,
[351] J. Radhakrishnan, “An entropy proof of and 632)
Bregman’s theorem,” J. Combin. Theory [361] Y. Rinott, “On convexity of measures,”
Ser. A, vol. 77, no. 1, pp. 161–164, 1997. Annals of Probability, vol. 4, no. 6, pp.
(p. 161) 1020–1026, 1976. (p. 573)
[352] M. Raginsky, “Strong data processing [362] J. J. Rissanen, “Fisher information and
inequalities and ϕ-Sobolev inequalities for stochastic complexity,” IEEE transactions
discrete channels,” IEEE Transactions on on information theory, vol. 42, no. 1, pp.
Information Theory, vol. 62, no. 6, pp. 40–47, 1996. (p. 261)
3355–3389, 2016. (pp. 626 and 638) [363] H. Robbins, “An empirical Bayes approach
[353] M. Raginsky and I. Sason, “Concentration to statistics,” in Proceedings of the Third
of measure inequalities in information the- Berkeley Symposium on Mathematical
ory, communications, and coding,” Founda- Statistics and Probability, Volume 1: Con-
tions and Trends® in Communications and tributions to the Theory of Statistics. The
Information Theory, vol. 10, no. 1-2, pp. Regents of the University of California,
1–246, 2013. (p. xxi) 1956. (p. 563)
[354] C. R. Rao, “Information and the accuracy [364] R. W. Robinson and N. C. Wormald,
attainable in the estimation of statistical “Almost all cubic graphs are Hamiltonian,”
parameters,” Bull. Calc. Math. Soc., vol. 37, Random Structures & Algorithms, vol. 3,
pp. 81–91, 1945. (p. 576) no. 2, pp. 117–125, 1992. (p. 186)
[355] A. H. Reeves, “The past present and future [365] C. Rogers, Packing and Covering, ser. Cam-
of PCM,” IEEE Spectrum, vol. 2, no. 5, pp. bridge tracts in mathematics and mathemat-
58–62, 1965. (p. 477) ical physics. Cambridge University Press,
[356] A. Rényi, “On measures of entropy and 1964. (p. 527)
information,” in Proc. 4th Berkeley Symp. [366] H. Roozbehani and Y. Polyanskiy, “Alge-
Mathematics, Statistics, and Probability, braic methods of classifying directed
vol. 1, Berkeley, CA, USA, 1961, pp. 547– graphical models,” arXiv preprint
561. (p. 13) arXiv:1401.5551, 2014. (p. 180)
[357] ——, “On the dimension and entropy of [367] ——, “Low density majority codes and
probability distributions,” Acta Mathemat- the problem of graceful degradation,” arXiv
ica Hungarica, vol. 10, no. 1 – 2, Mar. 1959. preprint arXiv:1911.12263, 2019. (pp. 191
(p. 29) and 669)
[358] Z. Reznikova and B. Ryabko, “Anal- [368] H. P. Rosenthal, “On the subspaces of
ysis of the language of ants by Lp (p > 2) spanned by sequences of inde-
information-theoretical methods,” Prob- pendent random variables,” Israel Journal
lemi Peredachi Informatsii, vol. 22, no. 3, of Mathematics, vol. 8, no. 3, pp. 273–303,
pp. 103–108, 1986, english translation: 1970. (p. 497)
http://reznikova.net/R-R-entropy-09.pdf. [369] D. Russo and J. Zou, “Controlling bias in
(p. 9) adaptive data analysis using information
[359] T. J. Richardson, M. A. Shokrollahi, and theory,” in Artificial Intelligence and Statis-
R. L. Urbanke, “Design of capacity- tics. PMLR, 2016, pp. 1232–1240. (pp. 90
approaching irregular low-density and 188)
i i
i i
i i
[370] I. N. Sanov, “On the probability of large channels i,” Inf. Contr., vol. 10, pp. 65–103,
deviations of random magnitudes,” Matem- 1967. (pp. 432 and 433)
aticheskii Sbornik, vol. 84, no. 1, pp. 11–44, [382] J. Shawe-Taylor and R. C. Williamson, “A
1957. (p. 307) PAC analysis of a Bayesian estimator,” in
[371] E. �a�o�lu, “Polar coding theorems for dis- Proceedings of the tenth annual conference
crete systems,” EPFL, Tech. Rep., 2011. on Computational learning theory, 1997,
(p. 669) pp. 2–9. (p. 83)
[372] ——, “Polarization and polar codes,” Foun- [383] O. Shayevitz, “On Rényi measures and
dations and Trends® in Communications hypothesis testing,” in 2011 IEEE Interna-
and Information Theory, vol. 8, no. 4, pp. tional Symposium on Information Theory
259–381, 2012. (p. 341) Proceedings. IEEE, 2011, pp. 894–898.
[373] I. Sason and S. Verdú, “f-divergence (p. 182)
inequalities,” IEEE Transactions on Infor- [384] O. Shayevitz and M. Feder, “Optimal feed-
mation Theory, vol. 62, no. 11, pp. 5973– back communication via posterior match-
6006, 2016. (p. 132) ing,” IEEE Trans. Inf. Theory, vol. 57, no. 3,
[374] G. Schechtman, “Extremal configurations pp. 1186–1222, 2011. (p. 445)
for moments of sums of independent pos- [385] A. N. Shiryaev, Probability-1. Springer,
itive random variables,” in Banach Spaces 2016, vol. 95. (p. 126)
and their Applications in Analysis. De [386] G. Simons and M. Woodroofe, “The
Gruyter, 2011, pp. 183–192. (p. 505) Cramér-Rao inequality holds almost every-
[375] M. J. Schervish, Theory of statistics. where,” in Recent Advances in Statistics:
Springer-Verlag New York, 1995. (pp. 582 Papers in Honor of Herman Chernoff on his
and 583) Sixtieth Birthday. Academic, New York,
[376] A. Schrijver, Theory of linear and integer 1983, pp. 69–93. (p. 661)
programming. John Wiley & Sons, 1998. [387] Y. G. Sinai, “On the notion of entropy of a
(p. 567) dynamical system,” in Doklady of Russian
[377] C. E. Shannon, “A symbolic analysis of Academy of Sciences, vol. 124, no. 3, 1959,
relay and switching circuits,” Electrical pp. 768–771. (pp. xix and 230)
Engineering, vol. 57, no. 12, pp. 713–723, [388] R. Sinkhorn, “A relationship between arbi-
Dec 1938. (p. 626) trary positive matrices and doubly stochas-
[378] C. E. Shannon, “A mathematical theory of tic matrices,” Ann. Math. Stat., vol. 35,
communication,” Bell Syst. Tech. J., vol. 27, no. 2, pp. 876–879, 1964. (p. 105)
pp. 379–423 and 623–656, Jul./Oct. 1948. [389] M. Sion, “On general minimax theorems,”
(pp. xvii, 41, 195, 215, 234, 341, 346, 377, Pacific J. Math, vol. 8, no. 1, pp. 171–176,
and 411) 1958. (p. 93)
[379] ——, “The zero error capacity of a noisy [390] M.-K. Siu, “Which Latin squares are Cay-
channel,” IRE Transactions on Informa- ley tables?” Amer. Math. Monthly, vol. 98,
tion Theory, vol. 2, no. 3, pp. 8–19, 1956. no. 7, pp. 625–627, Aug. 1991. (p. 384)
(pp. 374, 450, and 452) [391] D. Slepian and H. O. Pollak, “Prolate
[380] ——, “Coding theorems for a discrete spheroidal wave functions, Fourier analysis
source with a fidelity criterion,” IRE Nat. and uncertainty–I,” Bell System Technical
Conv. Rec, vol. 4, no. 142-163, p. 1, 1959. Journal, vol. 40, no. 1, pp. 43–63, 1961.
(pp. 475 and 490) (p. 419)
[381] C. E. Shannon, R. G. Gallager, and E. R. [392] D. Slepian and J. Wolf, “Noiseless cod-
Berlekamp, “Lower bounds to error prob- ing of correlated information sources,”
ability for coding on discrete memoryless IEEE Transactions on information Theory,
vol. 19, no. 4, pp. 471–480, 1973. (p. 223)
i i
i i
i i
References 691
[393] A. Sly, “Reconstruction of random colour- decision theory. Berlin, Germany: Walter
ings,” Communications in Mathematical de Gruyter, 1985. (pp. 558 and 566)
Physics, vol. 288, no. 3, pp. 943–961, Jun [405] J. Suzuki, “Some notes on universal noise-
2009. (p. 644) less coding,” IEICE transactions on fun-
[394] A. Sly and N. Sun, “Counting in two-spin damentals of electronics, communications
models on d-regular graphs,” The Annals of and computer sciences, vol. 78, no. 12, pp.
Probability, vol. 42, no. 6, pp. 2383–2416, 1840–1847, 1995. (p. 252)
2014. (p. 75) [406] S. Szarek, “Nets of Grassmann manifold
[395] B. Smith, “Instantaneous companding of and orthogonal groups,” in Proceedings of
quantized signals,” Bell System Technical Banach Space Workshop. University of
Journal, vol. 36, no. 3, pp. 653–709, 1957. Iowa Press, 1982, pp. 169–185. (pp. 527
(p. 483) and 544)
[396] J. G. Smith, “The information capacity of [407] ——, “Metric entropy of homogeneous
amplitude and variance-constrained scalar spaces,” Banach Center Publications,
Gaussian channels,” Information and Con- vol. 43, no. 1, pp. 395–410, 1998. (p. 527)
trol, vol. 18, pp. 203 – 219, 1971. (p. 408) [408] W. Szpankowski and S. Verdú, “Mini-
[397] Spectre, “SPECTRE: Short packet com- mum expected length of fixed-to-variable
munication toolbox,” https://github.com/ lossless compression without prefix con-
yp-mit/spectre, 2015, GitHub repository. straints,” IEEE Trans. Inf. Theory, vol. 57,
(pp. 418 and 441) no. 7, pp. 4017–4025, 2011. (p. 200)
[398] R. Speer, J. Chin, A. Lin, S. Jewett, and [409] I. Tal and A. Vardy, “List decoding of polar
L. Nathan, “Luminosoinsight/wordfreq: codes,” IEEE Transactions on Information
v2.2,” Oct. 2018. [Online]. Available: https: Theory, vol. 61, no. 5, pp. 2213–2226, 2015.
//doi.org/10.5281/zenodo.1443582 (p. 204) (p. 346)
[399] A. J. Stam, “Some inequalities satisfied by [410] M. Talagrand, “The Parisi formula,” Annals
the quantities of information of Fisher and of mathematics, pp. 221–263, 2006. (p. 63)
Shannon,” Information and Control, vol. 2, [411] ——, Upper and lower bounds for stochas-
no. 2, pp. 101–112, 1959. (pp. 64, 185, tic processes. Springer, 2014. (p. 531)
and 191) [412] T. Tanaka, P. M. Esfahani, and S. K. Mitter,
[400] ——, “Distance between sampling with “LQG control with minimum directed
and without replacement,” Statistica Neer- information: Semidefinite programming
landica, vol. 32, no. 2, pp. 81–91, 1978. approach,” IEEE Transactions on Auto-
(pp. 186 and 187) matic Control, vol. 63, no. 1, pp. 37–52,
[401] M. Steiner, “The strong simplex conjecture 2017. (p. 449)
is false,” IEEE Transactions on Information [413] W. Tang and F. Tang, “The Poisson bino-
Theory, vol. 40, no. 3, pp. 721–731, 1994. mial distribution – old & new,” Statistical
(p. 413) Science, vol. 38, no. 1, pp. 108–119, 2023.
[402] V. Strassen, “Asymptotische Abschätzun- (p. 301)
gen in Shannon’s Informationstheorie,” in [414] T. Tao, “Szemerédi’s regularity lemma
Trans. 3d Prague Conf. Inf. Theory, Prague, revisited,” Contributions to Discrete Math-
1962, pp. 689–723. (p. 435) ematics, vol. 1, no. 1, pp. 8–28, 2006.
[403] ——, “The existence of probability mea- (pp. 127 and 190)
sures with given marginals,” Annals of [415] G. Taricco and M. Elia, “Capacity of fading
Mathematical Statistics, vol. 36, no. 2, pp. channel with no side information,” Elec-
423–439, 1965. (p. 122) tronics Letters, vol. 33, no. 16, pp. 1368–
[404] H. Strasser, Mathematical theory of statis- 1370, 1997. (p. 409)
tics: Statistical experiments and asymptotic
i i
i i
i i
[416] V. Tarokh, H. Jafarkhani, and A. R. Calder- [427] I. Vajda, “Note on discrimination informa-
bank, “Space-time block codes from orthog- tion and variation (corresp.),” IEEE Trans-
onal designs,” IEEE Transactions on Infor- actions on Information Theory, vol. 16,
mation theory, vol. 45, no. 5, pp. 1456– no. 6, pp. 771–773, 1970. (p. 131)
1467, 1999. (p. 409) [428] G. Valiant and P. Valiant, “Estimating the
[417] V. Tarokh, N. Seshadri, and A. R. Calder- unseen: an n/ log(n)-sample estimator for
bank, “Space-time codes for high data rate entropy and support size, shown optimal
wireless communication: Performance cri- via new CLTs,” in Proceedings of the 43rd
terion and code construction,” IEEE trans- annual ACM symposium on Theory of com-
actions on information theory, vol. 44, puting, 2011, pp. 685–694. (p. 584)
no. 2, pp. 744–765, 1998. (p. 409) [429] S. van de Geer, Empirical Processes in M-
[418] E. Telatar, “Capacity of multi-antenna Estimation. Cambridge University Press,
Gaussian channels,” European trans. tele- 2000. (pp. 86 and 603)
com., vol. 10, no. 6, pp. 585–595, 1999. [430] A. van der Vaart, “The statistical work of
(pp. 176 and 409) Lucien Le Cam,” Annals of Statistics, pp.
[419] ——, “Wringing lemmas and multiple 631–682, 2002. (pp. 614 and 616)
descriptions,” 2016, unpublished draft. [431] A. W. van der Vaart and J. A. Well-
(p. 187) ner, Weak Convergence and Empirical Pro-
[420] V. N. Temlyakov, “On estimates of ϵ- cesses. Springer Verlag New York, Inc.,
entropy and widths of classes of functions 1996. (pp. 86 and 603)
with a bounded mixed derivative or differ- [432] T. Van Erven and P. Harremoës, “Rényi
ence,” Doklady Akademii Nauk, vol. 301, divergence and Kullback-Leibler diver-
no. 2, pp. 288–291, 1988. (p. 541) gence,” IEEE Trans. Inf. Theory, vol. 60,
[421] N. Tishby, F. C. Pereira, and W. Bialek, no. 7, pp. 3797–3820, 2014. (p. 145)
“The information bottleneck method,” [433] H. L. Van Trees, Detection, Estimation, and
arXiv preprint physics/0004057, 2000. Modulation Theory. Wiley, New York,
(p. 549) 1968. (p. 577)
[422] F. Topsøe, “Some inequalities for informa- [434] S. Verdú, “On channel capacity per unit
tion divergence and related measures of dis- cost,” IEEE Trans. Inf. Theory, vol. 36,
crimination,” IEEE Transactions on Infor- no. 5, pp. 1019–1030, Sep. 1990. (p. 414)
mation Theory, vol. 46, no. 4, pp. 1602– [435] ——, Multiuser Detection. Cambridge,
1609, 2000. (p. 133) UK: Cambridge Univ. Press, 1998. (p. 413)
[423] D. Tse and P. Viswanath, Fundamentals of [436] ——, “Information theory, part I,” draft
wireless communication. Cambridge Uni- (personal communication), 2017. (p. xv)
versity Press, 2005. (pp. xxi, 403, and 409) [437] S. Verdú and D. Guo, “A simple proof of the
[424] A. B. Tsybakov, Introduction to Nonpara- entropy-power inequality,” IEEE Transac-
metric Estimation. New York, NY: tions on Information Theory, vol. 52, no. 5,
Springer Verlag, 2009. (pp. xxi, xxii, 132, pp. 2165–2166, 2006. (p. 64)
and 624) [438] R. Vershynin, High-dimensional probabil-
[425] B. P. Tunstall, “Synthesis of noiseless com- ity: An introduction with applications in
pression codes,” Ph.D. dissertation, Geor- data science. Cambridge university press,
gia Institute of Technology, 1967. (p. 196) 2018, vol. 47. (pp. 86 and 531)
[426] E. Uhrmann-Klingen, “Minimal Fisher [439] A. G. Vitushkin, “On the 13th problem of
information distributions with compact- Hilbert,” Dokl. Akad. Nauk SSSR, vol. 95,
supports,” Sankhy�: The Indian Journal no. 4, pp. 701–704, 1954. (p. 538)
of Statistics, Series A, pp. 360–374, 1995.
(p. 580)
i i
i i
i i
References 693
[440] ——, “On Hilbert’s thirteenth problem and [452] R. J. Williams, “Simple statistical gradient-
related questions,” Russian Mathematical following algorithms for connectionist rein-
Surveys, vol. 59, no. 1, p. 11, 2004. (p. xviii) forcement learning,” Machine learning,
[441] ——, Theory of the Transmission and Pro- vol. 8, pp. 229–256, 1992. (p. 77)
cessing of Information. Pergamon Press, [453] H. Witsenhausen and A. Wyner, “A condi-
1961. (p. 535) tional entropy bound for a pair of discrete
[442] J. von Neumann, “Various techniques used random variables,” IEEE Transactions on
in connection with random digits,” Monte Information Theory, vol. 21, no. 5, pp. 493–
Carlo Method, National Bureau of Stan- 501, 1975. (p. 325)
dards, Applied Math Series, no. 12, pp. [454] J. Wolfowitz, “On Wald’s proof of the con-
36–38, 1951. (p. 166) sistency of the maximum likelihood esti-
[443] ——, “Probabilistic logics and the synthe- mate,” The Annals of Mathematical Statis-
sis of reliable organisms from unreliable tics, vol. 20, no. 4, pp. 601–602, 1949.
components,” in Automata Studies.(AM- (p. 582)
34), Volume 34, C. E. Shannon and [455] Y. Wu and J. Xu, “Statistical problems
J. McCarthy, Eds. Princeton University with planted structures: Information-
Press, 1956, pp. 43–98. (p. 627) theoretical and computational limits,” in
[444] D. Von Rosen, “Moments for the inverted Information-Theoretic Methods in Data
Wishart distribution,” Scandinavian Jour- Science, Y. Eldar and M. Rodrigues,
nal of Statistics, pp. 97–109, 1988. (p. 272) Eds. Cambridge University Press, 2020,
[445] V. G. Vovk, “Aggregating strategies,” Proc. arXiv:1806.00118. (p. 338)
of Computational Learning Theory, 1990, [456] Y. Wu and P. Yang, “Minimax rates
1990. (pp. xx and 271) of entropy estimation on large alpha-
[446] M. J. Wainwright, High-dimensional statis- bets via best polynomial approximation,”
tics: A non-asymptotic viewpoint. Cam- IEEE Transactions on Information The-
bridge University Press, 2019, vol. 48. ory, vol. 62, no. 6, pp. 3702–3720, 2016.
(p. xxii) (p. 584)
[447] M. J. Wainwright and M. I. Jordan, “Graph- [457] A. Wyner and J. Ziv, “A theorem on
ical models, exponential families, and varia- the entropy of certain binary sequences
tional inference,” Foundations and Trends® and applications–I,” IEEE Transactions on
in Machine Learning, vol. 1, no. 1–2, pp. Information Theory, vol. 19, no. 6, pp. 769–
1–305, 2008. (pp. 74 and 75) 772, 1973. (p. 191)
[448] A. Wald, “Sequential tests of statistical [458] A. Wyner, “The common information of
hypotheses,” The Annals of Mathematical two dependent random variables,” IEEE
Statistics, vol. 16, no. 2, pp. 117–186, 1945. Transactions on Information Theory,
(p. 320) vol. 21, no. 2, pp. 163–179, 1975. (pp. 503
[449] ——, “Note on the consistency of the max- and 504)
imum likelihood estimate,” The Annals of [459] ——, “On source coding with side infor-
Mathematical Statistics, vol. 20, no. 4, pp. mation at the decoder,” IEEE Transactions
595–601, 1949. (p. 582) on Information Theory, vol. 21, no. 3, pp.
[450] A. Wald and J. Wolfowitz, “Optimum char- 294–300, 1975. (p. 227)
acter of the sequential probability ratio test,” [460] Q. Xie and A. R. Barron, “Minimax redun-
The Annals of Mathematical Statistics, pp. dancy for the class of memoryless sources,”
326–339, 1948. (p. 320) IEEE Transactions on Information Theory,
[451] M. M. Wilde, Quantum information theory. vol. 43, no. 2, pp. 646–657, 1997. (p. 252)
Cambridge University Press, 2013. (p. xxi)
i i
i i
i i
[461] A. Xu and M. Raginsky, “Information- Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
theoretic analysis of generalization capa- (p. 324)
bility of learning algorithms,” Advances [470] C.-H. Zhang, “Compound decision theory
in Neural Information Processing Systems, and empirical Bayes methods,” The Annals
vol. 30, 2017. (p. 90) of Statistics, vol. 31, no. 2, pp. 379–390,
[462] W. Yang, G. Durisi, T. Koch, and Y. Polyan- 2003. (p. 563)
skiy, “Quasi-static multiple-antenna fading [471] T. Zhang, “Covering number bounds of
channels at finite blocklength,” IEEE Trans- certain regularized linear function classes,”
actions on Information Theory, vol. 60, Journal of Machine Learning Research,
no. 7, pp. 4232–4265, 2014. (p. 437) vol. 2, no. Mar, pp. 527–550, 2002. (p. 533)
[463] W. Yang, G. Durisi, and Y. Polyan- [472] Z. Zhang and R. W. Yeung, “A non-
skiy, “Minimum energy to send k bits Shannon-type conditional inequality of
over multiple-antenna fading channels,” information quantities,” IEEE Trans. Inf.
IEEE Transactions on Information The- Theory, vol. 43, no. 6, pp. 1982–1986, 1997.
ory, vol. 62, no. 12, pp. 6831–6853, 2016. (p. 17)
(p. 417) [473] ——, “On characterization of entropy func-
[464] Y. Yang and A. R. Barron, “Information- tion via information inequalities,” IEEE
theoretic determination of minimax rates of Trans. Inf. Theory, vol. 44, no. 4, pp. 1440–
convergence,” Annals of Statistics, vol. 27, 1452, 1998. (p. 17)
no. 5, pp. 1564–1599, 1999. (pp. xxii, 602, [474] L. Zheng and D. N. C. Tse, “Communica-
606, and 607) tion on the Grassmann manifold: A geomet-
[465] Y. G. Yatracos, “Rates of convergence ric approach to the noncoherent multiple-
of minimum distance estimators and Kol- antenna channel,” IEEE transactions on
mogorov’s entropy,” The Annals of Statis- Information Theory, vol. 48, no. 2, pp. 359–
tics, pp. 768–774, 1985. (pp. 602, 620, 383, 2002. (p. 409)
and 621) [475] N. Zhivotovskiy, “Dimension-free bounds
[466] S. Yekhanin, “Improved upper bound for for sums of independent matrices and sim-
the redundancy of fix-free codes,” IEEE ple tensors via the variational principle,”
Trans. Inf. Theory, vol. 50, no. 11, pp. 2815– arXiv preprint arXiv:2108.08198, 2021.
2818, 2004. (p. 208) (p. 85)
[467] P. L. Zador, “Development and evaluation [476] W. Zhou, V. Veitch, M. Austern, R. P.
of procedures for quantizing multivariate Adams, and P. Orbanz, “Non-vacuous gen-
distributions,” Ph.D. dissertation, Stanford eralization bounds at the ImageNet scale:
University, Department of Statistics, 1963. a PAC-Bayesian compression approach,” in
(p. 482) International Conference on Learning Rep-
[468] ——, “Asymptotic quantization error of resentations (ICLR), 2018. (pp. 89 and 90)
continuous signals and the quantization [477] G. Zipf, Selective Studies and the Principle
dimension,” IEEE Transactions on Informa- of Relative Frequency in Language. Cam-
tion Theory, vol. 28, no. 2, pp. 139–149, bridge MA: Harvard University Press, 1932.
1982. (p. 482) (pp. 203 and 204)
[469] O. Zeitouni, J. Ziv, and N. Merhav, “When
is the generalized likelihood ratio test opti-
mal?” IEEE Transactions on Information
i i
i i
i i
Index
FI -curve, 325, 338, 549, 638 Alon, N., 160 BEC, 181, 372, 380, 439, 455, 460,
I-projection, see information alternating minimization algorithm, 471, 639, 654, 669
projection 102 belief propagation, 653
Log function, 25 Amari, S.-I., 307 Bell Labs, 480
ϵ-covering, 523 Anderson’s lemma, 541, 572 Berlekamp, E., 432
ϵ-net, see ϵ-covering Anderson, T. W., 572 Bernoulli factory, 172
ϵ-packing, 523 approximate message passing (AMP), Bernoulli shifts, 230
Z2 synchronization, 649 653 Bernoulli, D., 143
σ-algebra, 79 area theorem, 63 Berry-Esseen inequality, 435
denseness, 240 area under the curve (AUC), 278 Bhattacharyya distance, 315
monotone limits, 79 Arimoto, S., 102, 433 binary divergence, 1, 22, 56
f-divergence, 115, 631 arithmetic coding, 245, 268 binary entropy function, 1, 9
inf-characterization, 151 Artstein, S., 535 binary symmetric channel, see BSC
sup-characterization, 121, 147 Assouad’s lemma, 389, 597, 664 binomial tail, 159
comparison, 127, 132 via Mutual information method, 598 bipartite graph, 161
conditional, 117 asymmetric numeral system (ANS), Birgé, L., 613, 614, 625
convexity, 120 246 Birkhoff-Khintchine theorem, 234
data processing, 119 asymptotic efficiency, 581 Birman, M. Š, 538
finite partitions, 121 asymptotic equipartition property birthday paradox, 186
local behavior, 138 (AEP), 217, 234 bit error rate (BER), 389
lower semicontinuity, 148 asymptotic separatedness, 125 Blackwell measure, 111
monotonicity, 118 autocovariance function, 114 Blackwell order, 182, 329
operational meaning, 122 automatic repeat request (ARQ), 468 Blackwell, D., 111, 182
SDPI, 629 auxiliary channel, 423, 428 Blahut, R., 102
f-information, 134, 182, 630, 631 auxiliary random variable, 227 Blahut-Arimoto algorithm, 102
χ2 , 136 blocklength, 370
additivity, 135 BMS, 383, 631, 669
definition, 134 B-process, 232 mixture representation, 632
subadditivity, 135, 644 balls and urn, 186 Bollobás, B., 164
symmetric KL, 135, 187 Barg, A., 433 Boltzmann, 15
g-divergence, 121 Barron, A. R., 64, 606 Boltzmann constant, 410
k-means, 481 batch loss, 270, 611 Bonami-Beckner semigroup, 132
3GPP, 403 Bayes risk, 662 boolean function, 626
GLM, 563 bowl-shaped, 571
Bayesian Cramér-Rao, 661 box-counting dimension, see
absolute continuity, 21, 42, 43 Bayesian Cramér-Rao (BCR) lower Minkowski dimension
absolute norm, 526 bound, 577 Brégman’s theorem, 162
achievability, 201 Bayesian Cramér-Rao lower bound , Breiman, L., 234
additive set-functions, 79 663 broadcasting
ADSL, 403 Bayesian networks, 633 on a grid, 670
Ahlswede, R., 208, 227, 326 BCR lower bound, 578 on trees, 642
Ahlswede-Csisár, 326 functional estimation, 580 Brownian motion, 417
Alamouti code, 409 multivariate, 579
695
i i
i i
i i
696 Index
BSC, 49, 53, 111, 344, 347, 363, 372, non-existence, 98 finite blocklength, 346
379, 436, 439, 455, 469, 471, 630, capacity-achieving output distribution, fundamental limit, 373, 395
643, 655, 669, 670 94, 96, 253 Gallager’s bound, 360, 431
channel coding, 344 uniqueness, 94, 97 information density, 351
contraction coefficient, 633 capacity-cost function, 396, 466, 467 linear code, 362, 461
SDPI, 633 capacity-redundancy theorem, 248, normal approximation, 439
strong converse, 424 269, 270, 604 normalized rate, 439
Burnashev’s error-exponent, 456 Carnot’s cycle, 14 optimal decoder, 347
carrier frequency, 419 posterior matching, 444
Catoni, O., 83 power constraint, 394
capacity, 49, 91, 94, 96, 102, 178, 179,
causal conditioning, 448 probability of error, 343
256, 345, 348
causal inference, 446 randomized encoder, 460
ϵ-capacity, 373, 395
Cencov, N. N., 307 RCU bound, 359, 439
Z-channel, 380
center of gravity, 67, 184 real-world codes, 440
ACGN, 405
central limit theorem, 78, 148, 181, reliability function, 431
additive non-gaussian noise, 401
202 Schalkwijk-Kailath scheme, 459
amplitude-constrained AWGN, 407
Centroidal Voronoi Tessellation sent codeword, 350
AWGN, 399
(CVT), 481 Shannon’s random coding, 354
BEC, 380
chain rule sphere-packing bound, 427, 432,
bit error rate, 390
χ2 , 183 454, 471
BSC, 379
differential entropy, 27 straight-line bound, 432
compound DMC, 465
divergence, 32, 33, 183 strong converse, 422, 465, 469, 470
continuous-time AWGN, 418
entropy, 12, 158 submodularity, 367
erasure-error channel, 464
Hellinger, 183 threshold decoder, 353
Gaussian channel, 100, 107
mutual information, 52, 63, 187 transmission rate, 373
group channel, 379
Rényi divergence, 183 universal, 463
information capacity, 375, 395
total variation, 183 unsent codeword, 350
information stable channels, 386,
chaining, 86 variable-length, 455, 471
399
channel, 29 weak converse, 348, 397
maximal probability of error, 375
channel automorophism, 381 zero-rate, 432
memoryless channels, 377
channel capacity, see capacity channel comparison, 646, 669, 670
MIMO channel, 176
channel coding channel dispersion, 434
mixture DMC, 465
(M, ϵ)-code, 343 channel filter, 406
non-stationary AWGN, 403
κ-β bound, 435 channel state information, 408
parallel AWGN, 402
admissible constraint, 396 channel symmetry group, 381
per unit cost, 414, 470
BSC, 344 channel symmetry relations, 385
product channel, 465
capacity, 345 channel, OR-channel, 189
Shannon capacity, 373, 395
capacity per unit cost, 414 channel, Z-channel, 372
sum of DMCs, 465
capacity-cost, 396, 466 channel, q-ary erasure, 639
with feedback, 443, 471
cost function, 395 channel, q-ary symmetric, 670, 671
zero-error, 374, 464, 471
cost-constrained code, 395, 467 channel, Z-channel, 380
zero-error with feedback, 450
degrees-of-freedom, 409 channel, ACGN, 405
capacity achieving output distribution,
dispersion, see dispersion channel, additive noise, 50, 363, 464
425
DT bound, 356, 460, 461 channel, additive non-Gaussian noise,
Capacity and Hellinger entropy
DT bound, linear codes, 364 466
lower bound, 609
Elias’ scheme, 457 channel, additive-noise, 371, 467
upper bound, 610
energy-per-bit, see energy-per-bit channel, AWGN, 48, 98, 100, 372,
Capacity and KL covering numbers,
error-exponents, 430, 460, 469, 471 399, 436, 457, 460, 470
603, 608
error-exponents with feedback, 454 channel, AWGN with ISI, 406
capacity-achieving input distribution,
expurgated random coding, 469 channel, bandlimited AWGN, 419
94, 444
feedback code, 442, 471 channel, BI-AWGN, 48, 436, 655
discrete, 407
i i
i i
i i
Index 697
i i
i i
i i
698 Index
correlation coefficient, maximal, 640 with feedback, 446 KL, 36, 37, 116, 131, 133, 150, 633
cost function, 395 zero-dispersion channel, 436, 437, Le Cam, 117, 133, 631
Costa, M., 64 455 local behavior, 36, 39, 137, 335
coupling, 122, 151, 178, 183 distortion metric, 484 lower semicontinuity, 78, 99
covariance matrix, 49, 59, 114, 510 separable, 485 Marton’s, 123, 151, 183
covariance matrix estimation, 85, 667 distributed estimation, 325, 644, 657 measure-theoretic properties, 80
covering lemma, 228, 327, 499 distribution estimation over an algebra of sets, 79
CR lower bound, 576, 661 χ2 risk, 664 parametric family, 38, 140
multivariate, 576 binary alphabet, 662 Rényi, see Rényi divergence, 314
Cramér’s condition, 293 KL risk, 664 real Gaussians, 23
Cramér, H., 576 quadratic risk, 583, 664 strong convexity, 92, 182
Cramér-Rao lower bound, see CR TV risk, 664 symmetric KL, 135
lower bound distribution, Bernoulli, 9, 48, 269, 333 total variation, see total variation
cryptography, 14 distribution, binomial, 174 divergence for mixtures, 36
Csisár, I., 326 distribution, Dirichlet, 250, 252 DMC, 372, 456
Cuff, P., 503 distribution, discrete, 40 Dobrushin’s contraction, 629
cumulant generating function, see log distribution, exponential, 177, 330 dominated convergence theorem, 138,
MGF distribution, Gamma, 333 141
cumulative risk, 611 distribution, Gaussian, 47, 49, 59, 64, Donsker, M. D., 71
98, 133, 330, 333, 336 Donsker-Varadhan, 71, 83, 147, 150,
distribution, geometric, 9, 175 297
data-processing inequality, see DPI
distribution, Marchenko-Pastur, 176 Doob, 31
de Bruijn’s identity, 60, 191
distribution, mixture of products, 146 doubling dimension, 616
de Finetti’s theorem, 186
distribution, mixtures, 36 DPI, 42, 154, 426
decibels (dB), 48
distribution, Poisson, 178 χ2 , 148
decoder
distribution, Poisson-Binomial, 301 f-divergence, 119
maximal mutual information
distribution, product, 55 f-information, 134
(MMI), 463
distribution, product of mixtures, 146 divergence, 34, 36, 53, 56, 57, 73,
maximum a posteriori (MAP), 347
distribution, subgaussian, see 348
maximum likelihood (ML), 347,
subgaussian Fisher information, 184
353
distribution, uniform, 11, 27, 175, 178 mutual information, 51, 53
decoding region, 342, 400
distribution, Zipf, 203 Neyman-Pearson region, 329
deconvolution filter, 407
Dite, W., 481 Rényi divergence, 433
degradation of channels, 182, 329, 646
divergence, 20 Duda, J., 246
density estimation, 244, 602, 661, 664
χ2 , 184, 185, 668, 669 Dudley’s entropy integral, 531, 552
Bayes χ2 risk, 662
χ2 , 36, 116, 122, 126, 132, 133, Dudley, R., 531
Bayes KL risk, 605, 662
136, 145, 148, 149, 631, 638, 640, Dueck, G., 187, 433
discrete case, 137
641, 644 dynamical system, 230
derivative of divergence, 36
inf-representation, 123
derivative of mutual information, 59
sup-characterization, 70, 71
diameter of a set, 96 ebno, see energy-per-bit
conditional, 42
differential entropy, 26, 47, 61, 158, ECC, 342
continuity, 78
164, 175, 191 eigenvalues, 114
continuity in σ-algebra, 80
directed acyclic graph (DAG), 50, 633 Elias ensemble, 365
convex duality, 73
directed acyclic graphs (DAGs), 179, Elias’ extractor, 168
convexity, 91
180 Elias, P., 167, 270
finite partitions, 70
directed graph, 189 Elliott, E. O., 111
geodesic, 306, 333, 335
directed information, 446 EM algorithm, 77, 104
Hellinger, see Hellinger
Dirichlet prior, 662, 665 convex, 103
distance117
disintegration of probability measure, empirical Bayes, 563
Jeffreys, 135
29 empirical distribution, 137
Jensen-Shannon, 117, 133, 149
dispersion, 379 empirical mutual information, 462
i i
i i
i i
Index 699
empirical process, 86 ergodic theorem, 268 finite blocklength, 214, 341, 346, 417,
empirical risk, 87 Birkhoff-Khintchine, 234 460
Empirical risk minimization (ERM), maximal, 238 finite groups, 50
87 ergodicity, 232, 393, 467 finite-state machine (FSM), 172, 264
energy-per-bit, 400, 410 error correcting code Fisher defect, 143
fading channel, 416 see ECC, 342 Fisher information, 38, 140, 252, 576,
finite blocklength, 417 error floor, 423 645, 660, 661
entropic CLT, 64 error-exponents, 123, 144, 286, 335, continuity, 141
Entropic risk bound, 602 430, 454, 456, 460, 469 data processing inequality, 184
Hellinger loss, 602, 614 estimand, 558 matrix, 38, 142, 184
Hellinger loss, parametric rate, 617 functional, 580 minimum, compactly supported,
Hellinger lower bound, 618 estimation 580
KL loss, 602, 603 entropy, 138 monotonicity, 184
local Hellinger entropy, 616 Estimation better than chance of a density, 40, 151, 191, 578
sharp rate, 619 Bounded GLM, 592 variational representation, 151
TV loss, 602, 620 distribution estimation, 593 Fisher information inequality, 185
TV loss, misspecified, 621 estimation in Gaussian noise, 58 Fisher’s factorization theorem, 54
entropy, 8 estimation, discrete parameter, 55 Fisher, R., 54, 275
ant scouts, 9 estimation, information measures, 66 Fisher-Rao metric, 40, 307
as signed measure, 46 estimation-compression inequality, Fitingof, B. M., 246
axioms, 13 see online-to-batch conversion Fitingof, B. M., 463
concavity, 92 estimator, 559 Fitingof-Goppa code, 463
conditional, 10, 46, 57 Bayes, 562 flash signaling, 417
continuity, 78, 178 deterministic, 558 Fourier spectrum, 419
differential, see differential entropy, improper, 603 Fourier transform, 114, 406
48 proper, 603 fractional covering number, 160
empirical, 138 randomized, 558 fractional packing number, 161
hidden Markov model, 111 Evans and Schulman, theorem of, 627 frequentist statistics, 54
inf-representation, 24 evidence lower bound (ELBO), 76 Friedgut, E., 160
infinite, 10 exchangeable distribution, 170, 186 Friis transmission equation, 468
Kolmogorov-Sinai, 16 exchangeable event, 177 Fubini theorem, 45
Markov chains, 110 expectation maximization, see EM functional estimation, 580, 666, 668
max entropy, 99, 175 algorithm
Rényi, 13, 57 exponential family, 310
thermodynamic, 14 natural parameter, 310
Gács-Körner information, 338
Venn diagram, 46 standard (one-parameter), 298, 306
Gallager ensemble, 365
entropy characterization, 502 exponential-weights update algorithm,
Gallager’s bound, 360
entropy estimation, 584 271
Gallager, R., 360, 431
large alphabet, 584
Galois field, 219
small alphabet, 584
Fano’s inequality, 41, 57, 179, 664 game of 20 Questions, 13
entropy method, 158
tensorization, 112 game of guessing, 55
entropy power, 64
Fano, R., 41 Gastpar conditions, 521
entropy power inequality, 41, 64
Fatou’s lemma, 38, 99, 142 Gastpar, M., 521
Costa’s, 64
Feder, M., 260 Gaussian CEO problem, 657
Lieb’s, 64
Feinstein’s lemma, 357, 397 Gaussian comparison, 531
entropy rate, 109, 181, 265
Feinstein, A., 357 Gaussian distribution, 23
relative, 288
Fekete’s lemma, 299 complex, 23
entropy vs conditional entropy, 46
Fenchel-Eggleston-Carathéodory Gaussian isoperimetric inequality, 541
entropy-power inequality, 191
theorem, 129 Gaussian location model, see GLM
Erdös, P., 215
Fenchel-Legendre conjugate, 73, 296 Gaussian mixture, 59, 76, 104, 134,
Erdös-Rényi graph, 185
filtration, 319 185, 619
i i
i i
i i
700 Index
Gaussian Orthogonal Ensemble Hamming sphere, 169, 170 weak converse, 282
(GOE), 668 Hamming weight, 158, 169, 170, 175
Gaussian width, 530 Han, T. S., 475
I-MMSE, 59
Gelfand, I.M., 70 Harremoës, P., 128
identity
Gelfand-Pinsker problem, 468 Haussler-Opper estimate, 188
de Bruijn’s, 191
Gelfand-Yaglom-Perez HCR lower bound
Ihara, S., 401, 419
characterization, 70, 121 Hellinger-based, 661
independence, 50, 55
generalization bounds, 87 multivariate, 576
individual (one-step) risk, 611
generalization error, 88, 187 Hellinger distance, 116, 117, 124, 131,
individual sequence, 248, 259
generalization risk, 87 133, 153, 182, 289, 302, 315, 631,
inequality
generative adversarial networks 661
Bennett’s, 301
(GANs), 149, 602 sup-characterization, 148
Bernstein’s, 301
Gibbs, 15 location family, 142
Brunn-Minkowski, 573
Gibbs distribution, 100, 242, 336 tensorization, 124
de Caen’s, 433
Gibbs sampler, 87, 187 Hellinger entropy
entropy-power, see entropy power
Gibbs variational principle, 74, 83, bounds on KL covering number,
inequality, 191
178 189, 610
Fano’s, see Fano’s inequality
Gilbert, E. N., 111, 527 covering number, 614
Han’s, 17, 28, 160
Gilbert-Elliott HMM, 111, 181 local covering number, 616
Hoeffding’s, 86, 334
Gilbert-Varshamov bound, 433, 527, local packing number, 618
Jensen’s, 11
666 packing number, 609
log-Sobolev, see log-Sobolev
Gilmer’s method, 189 Hessian, 60, 251
inequality (LSI), 191
Ginibre ensemble, 176 Hewitt-Savage 0-1 law, 177
Loomis-Whitney, 164
GLM, 140, 560, 666, 667 hidden Markov model (HMM), 110
non-Shannon, 17
golden formula, 67, 97, 466 high-dimensional probability, 83
Okamoto’s, 301
Goppa, V., 463 Hilbert’s 13th problem, 522, 538
Pinsker’s, see Pinsker’s inequality
graph coloring, 643, 670 Hoeffding’s lemma, 86, 88, 334
Shearer’s, 18
graph partitioning, 190 Huber, P. J., 151, 324, 613
Stam’s, 185
graphical model Huffman algorithm, 209
Tao’s, 58, 127, 190
directed, 447 hypothesis testing, 122
transportation, 656
graphical models accuracy, precision, recall, 277
van Trees, 645
d-connected, 51 asymptotics, 286
Young-Fenchel, 73
d-separation, 51, 445 Bayesian, 277, 314, 330
inf-convolution, 547
collider, 51 Chernoff’s regime, 314
information bottleneck, 177, 549
directed, 41, 50, 69, 179, 180, 633 communication constraints, 325
information density, 351
non-collider, 51 composite, 289, see composite
conditioning-unconditioning, 352
undirected, 74, 648 hypothesis testing
information distillation, 190
Gross, L., 65 error-exponent, 123, 144
information flow, 69, 447
Guerra interpolation, 63 error-exponents, 289, 314
information geometry, 40, 307
Gutman, M., 260 goodness-of-fit, 275, 325
information inequality, 24
independence testing, 326
information percolation
likelihood ratio test (LRT), 280, 424
Haar measure, 381, 543, 551 directed, 627, 635
null hypothesis, 275
Hamiltonian dynamical system, 231 undirected, 650
power, 277
Hammersley, J. M., 575 information projection, 91, 178
robust, 324, 338
Hammersley-Chapman-Robbins lower definition, 302
ROC curve, 276
bound, see HCR lower bound Gaussian, 336
sequential, 319
Hamming ball, 159 marginals, 331
SPRT, 320
Hamming bound, 527 Pythagorean theorem, 303
Stein’s exponent, 286
Hamming code, 362 information radius, 91, 96
strong converse, 283
Hamming distance, 112, 123, 389, 464 information stability, 386, 393, 396,
type-I, type-II error, 277
Hamming space, 219, 347 399, 464
i i
i i
i i
Index 701
information tails, 285, 353 Kraft inequality, 207 compact support, 142
Ingster-Suslina formula, 185 Krein-Milman theorem, 130 location parameter, 40
integer programming, 209 Krichevsky, R. E., 253 log MGF, 293
integral probability metrics, 123 Krichevsky-Trofimov algorithm, 253, properties, 293
interactive communication, 182, 658 269 log-concave distribution, 573
intersymbol interference (ISI), 406 Krichevsky-Trofimov estimator, 662 log-likelihood ratio, 280
interval censoring, 668 Kronecker Lemma, 388 log-Sobolev inequality (LSI), 65, 132,
Ising model, 74, 191, 642 191
log-Sobolev inequality, modified
Laplace method, 251
James-Stein estimator, 561 (MLSI), 641
Laplace’s law of succession, 253
Jeffreys prior, 252 Loomis, L. H., 164
Laplacian, 59
Joint entropy, 9 loss function, 559
large deviations, 35, 290, 291, 299
joint range, 115, 127 batch, 611
Gaussian, 332
χ2 vs TV, 665 cross-entropy, see log-loss
multiplicative deviation, 301
χ2 vs TV, 132 cumulative, 259
non-iid, 332
Harremoës-Vajda theorem, 128 log-loss, 24, 259, 548, 664
on the boundary, 332
Hellinger and TV, 124 quadratic, 561, 575
rate function, 296
Hellinger vs TV, 131 separable, 569
large deviations theory, 159
Jensen-Shannon vs TV, 133 test, 87
large language models, 110, 257
KL vs χ2 , 133 loss-functions
law of large numbers, 202
KL vs Hellinger, 132, 189, 302 log-loss, 331
strong, 235
KL vs TV, 131, 665 low-density parity check (LDPC), 346
Le Cam distance, 117
Le Cam and Hellinger, 133 lower semicontinuity, 148
Le Cam lemma, 146
Le Cam and Jensen-Shannon, 133 Le Cam’s method, 666
joint source-channel coding, see JSCC Le Cam’s two-point method, 594 Mandelbrot, B., 203
joint type, 462 looseness in high dimensions, 596 Markov approximation, 235
joint typicality, 228, 355, 499 Le Cam, L., 614 Markov chain, 179–181, 232, 265
JSCC, 391 Le Cam-Birgé’s estimator, 614 finite order, 235
ergodic source, 393 least favorable pair, 338 Markov chains, 110, 464
graceful degradation, 520 least favorable prior, 564 k-th order, 110
lossless, 392 non-uniqueness, 663 ergodic, 266
lossy, 515 Lempel-Ziv algorithm, 263 finite order, 247
lossy, achievability, 516 less noisy channel, 646 mixing, 641, 669, 671
lossy, converse, 515 Lieb, E., 64 Markov kernel, 29, 42
source-channel separation, 392, 516 likelihood-ratio trick, 331 composition, 30
statistical application, 585 linear code, 218, 362 Markov lemma, 228, 327, 501
coset leaders, 364 Markov types, 174
Körner, J., 227 error-exponent, 433 martingale convergence theorem, 80
Kac’s lemma, 262 generator matrix, 362 Marton’s transportation inequality,
Kahn, J., 160 geometric uniformity, 363 656
Kakutani’s dichotomy, 125 parity-check matrix, 362 Massey’s directed information, 449
Kelvin, 14 syndrome decoder, 363 see directed information, 446
kernel density estimator (KDE), 136, linear programming, 368 Massey, J., 158, 446
624 linear programming duality, 161 matrix inversion lemma, 40
Kesten-Stigum bound, 670 linear regression, 271, 660 Mauer, A., 83
KL covering numbers, 603, 608 Liouville’s theorem, 231 Maurey’s empirical method, 534
Kolmogorov identities, 51 list decoding, 346 Maurey, B., 534
Kolmogorov’s 0-1 law, 83, 125, 232 Litsyn, S., 433 maximal coding, 357, 367, 397
Kolmogorov, A. N., 239, 522, 524 Lloyd’s algorithm, 480 maximal correlation, 631, 640
Kolmogorov-Sinai entropy, 239 Lloyd, S., 480 maximal sphere packing density, 527
Koshelev, V., 483 location family, 40 circle packing, 527
i i
i i
i i
702 Index
i i
i i
i i
Index 703
i i
i i
i i
704 Index
average, 562 simplex conjecture, 413 power spectral density, 114, 233,
Bayes, 562 Sinai’s generator theorem, 240 405
minimax, 563 Sinai, Y., 239 spectral measure, 233
Robbins, H., 575 single-letterization, 106 stationary process, 109, 230
run-length encoding, 266 singular value decomposition (SVD), statistical experiment, see statistical
176 model
singular values, 640 statistical learning theory, 83, 87
saddle point, 94, 176
Sinkhorn’s algorithm, 105, 311 statistical model, 558
Gaussian, 100, 107, 466
site percolation threshold, 637 nonparametric, 560
sample complexity, 568
SLB, see Shannon lower bound parametric, 560
sample covariance matrix, 85, 667
Slepian, D., 223, 531 Stavskaya automata, 637
sampling without replacement, 186
Slepian-Wolf theorem, 223, 225, 228 Stein’s lemma, 287, 415, 470
Sanov’s theorem, 307, 334
small subgraph conditioning, 186 Stirling’s approximation, 16, 162, 174,
Sanov, I. N., 307
small-ball probability, 539 260, 588
score function, 152
Brownian motion, 553 stochastic block model, see
parametrized family, 39
finite dimensions, 332 community detection, 642
SDPI, 53, 328, 626, 629, 669, 671
Smooth density estimation, 622 stochastic domination, 338
χ2 , 640
L2 loss, 622 stochastic localization, 58, 191
BSC, 633
Hellinger loss, 625 stopping time of a filtration, 319, 455
erasure channels, 639
KL loss, 625 strong converse, 283, 374, 422, 470
joint Gaussian, 640
TV loss, 624 failure, 427, 465
post-processing, 653
soft-covering lemma, 137, 505 strong data-processing inequality, see
tensorization, 638
Solomjak, M., 538 SDPI
self-normalizing sums, 302
source subadditivity of information, 135
separable cost-constraint, 395
Markov, 110 subgaussian, 85, 188
sequential prediction, 245
memoryless, see memoryless source subgraph counts, 160
sequential probability ratio test
mixed, 110, 265 submodular function, 16, 27, 367
(SPRT), 320
source coding Sudakov’s minoration, 530
Shannon
see compression, 197 sufficient statistic, 41, 54, 178, 180,
boolean circuits, construction of,
source-coding 282, 363
626
noisy, 548 supervised learning, 257, 270
Shannon entropy, 8
remote, 548 support, 3, 309
Shannon lower bound, 511, 586, 588
space-time coding, 409 symmetric KL-information
arbitrary norm, 511, 550
sparse estimation, 589, 666 see f-information
quadratic distortion, 511
sparse-graph codes, 341, 346 symmetric KL, 135
Shannon’s channel coding theorem,
sparsity, 666 symmetry group, 381
345, 377
spatial diversity, 403 system identification, 660
Shannon’s rate-distortion theorem,
spectral gap, 641, 669 Szarek, S. J., 527, 535
491
spectral independence, 641, 671 Szegö’s theorem, 114, 406
Shannon’s source coding theorem,
spectral measure, 233 Szemerédi regularity lemma, 127, 190
202, 214
spiked Wigner model, 649
Shannon, C. E., 1–672
squared error, 561
Shannon-McMillan-Breiman theorem, tail σ-algebra, 83, 231
Stam, A. J., 64, 186
233 Telatar, E., 187, 409
standard Borel space, 20, 29, 30, 42,
Shawe-Taylor, J., 83 temperature, 336
43, 51
Shearer’s lemma, 18, 158, 160 Tensor product of experiments, 568
stationary Gaussian processes, 114,
shift-invariant event, 231 minimax risk, 569
233
shrinkage estimator, 561 tensorization, 33, 55, 63, 106, 107,
autocovariance function, 233
Shtarkov distribution, 245, 248, 260 112, 124, 145, 636, 638, 640, 647,
B-process, 233
Shtarkov sum, 249, 260 670
ergodic, 233, 467
signal-to-noise ratio (SNR), 48 FI -curve, 338
significance testing, 275 I-projection, 331
i i
i i
i i
Index 705
χ2 , 145 uniquely decodable codes, 206 von Neumann, J., 166, 626
capacity, 377 unitary operator, 243 Voronoi cells, 347
capacity-cost, 397 Universal codes, 462 Vovk, V. G., 271
Hellinger, 124 universal compression, 179, 210, 270
minimax risk, 569 universal prediction, 255
test error, 87 universal probability assignment, 245, Wald, A., 319
thermodynamics, 9, 14, 410 255 Wasserstein distance, 105, 123, 151,
Thomason, A. G., 164 Urysohn’s lemma, 72 656
thresholding, 561, 589, 590 Urysohn, P. S., 72 water-filling solution, 114, 176, 341,
Tikhomirov, V. M., 522, 524 402, 406, 437, 438
tilting, 72, 297 Vajda, I., 128 waterfall plot, 423, 439
time sharing, 226 van Trees inequality, see Bayesian weak converse, 282, 348, 397
Toeplitz matrices, 114 Cramér-Rao (BCR) lower bound Whitney, H., 164
total variation, 98, 116, 122, 131, 132, van Trees, H. L., 577 Wiener process, 417
330, 629 Varadhan, S. S., 71 WiFi, 403
inf-representation, 181 varentropy, 200, 547, 584 Wigner’s semicircle law, 652
inf-representation, 122 variable-length codes, 168, 455 Williamson, R. C., 83
sup-characterization, 148 variational autoencoder (VAE), 76, Wishart matrix, 176
sup-representation, 122 602 Wolf, J., 223
training error, 87 variational inference, 74 Wozencraft ensemble, 366
training sample, 87 variational representation, 70, 71, 123, Wringing lemma, 187
147, 154 Wyner’s common information, 502
transition probability kernel, see
Markov kernel χ2 , 149 Wyner, A., 227, 502
transmit-diversity, 409 Fisher information, 151
Trofimov, V. K., 253 Hellinger distance, 148 Yaglom, A. M., 70
Tunstall code, 196 total variation, 122, 148 Yang, Y., 606
turbo codes, 346 Varshamov, R. R., 527 Yang-Barron’s estimator, 607
types, see method of types, 174 Venn diagrams, 46 Yatracos class, 621
Verdú, S., 414 Yatracos’ estimator, 620
undetectable errors, 213, 224 Verdú, S., 475 Yatracos, Y. G., 620
uniform convergence, 85, 86 Verwandlungsinhalt, 14 Young-Fenchel duality, 73
uniform integrability, 153 Vitushkin, A. G., 538
uniform quantization, 29 VLF codes, 455
uniformly integrable martingale, 80 VLFT codes, 471 Zador, P. L., 482
union-closed sets conjecture, 189 volume ratio, 525 Zipf’s law, 203
i i
i i