0% found this document useful (0 votes)
108 views730 pages

Info Theory Polyanskiy Wu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views730 pages

Info Theory Polyanskiy Wu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 730

i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-i


i i

This material will be published by Cambridge University Press as “Information Theory” by


Yury Polyanskiy and Yihong Wu. This prepublication version is free to view and download for
personal use only. Not for redistribution, resale or use in derivative works.
Note that while this version has matching equation numbers, theorems etc, the printed version
will have many typos corrected. We recommend using it for thorough reading, while free version
could be used for quick reference lookup.
Polyanskiy & Wu © 2023

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-ii


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-i


i i

Book Heading

This textbook introduces the subject of information theory at a level suitable for advanced
undergraduate and graduate students. It develops both the classical Shannon theory and recent
applications in statistical learning. There are six parts covering foundations of information mea-
sures; data compression; hypothesis testing and large deviations theory; channel coding and
channel capacity; rate-distortion and metric entropy; and, finally, statistical applications. There are
over 210 exercises, helping the reader deepen their knowledge and learn about recent discoveries.

Yury Polyanskiy is a Professor of Electrical Engineering and Computer Science at MIT and a
member of Laboratory of Information and Decision Systems (LIDS) and Statistics and Data Sci-
ence Center (SDSC). He was elected an IEEE Fellow (2024), received the 2020 IEEE Information
Theory Society James Massey Award and co-authored papers receiving Best Paper Awards from
the IEEE Information Theory Society (2011), the IEEE International Symposium on Information
Theory (2008, 2010, 2022), and the Conference on Learning Theory (2021). At MIT he teaches
courses on information theory, probability, statistics, and machine learning.

Yihong Wu is a Professor in the Department of Statistics and Data Science at Yale University.
He was a recipient of the Marconi Society Paul Baran Young Scholar Award in 2011, the NSF
CAREER award in 2017, the Sloan Research Fellowship in Mathematics in 2018, and the IMS
fellow in 2023. He has taught classes on probability, statistics, and information theory at Yale
University and the University of Illinois at Urbana-Champaign.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-ii


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-iii


i i

Information Theory
From Coding to Learning
FIRS T E DI TI ON

Yury Polyanskiy
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology

Yihong Wu
Department of Statistics and Data Science
Yale University

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-iv


i i

University Printing House, Cambridge CB2 8BS, United Kingdom


One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/XXX-X-XXX-XXXXX-X
DOI: 10.1017/XXX-X-XXX-XXXXX-X
© Author name XXXX
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published XXXX
Printed in <country> by <printer>
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
ISBN XXX-X-XXX-XXXXX-X Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-v


i i

Dedicated to

The memory of Gennady (Y.P.)

My family (Y.W.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-vi


i i

Contents

Preface page xv
Introduction xvii

Frequently used notation 1

Part I Information measures 5


1 Entropy 8
1.1 Entropy and conditional entropy 8
1.2 Axiomatic characterization 13
1.3 History of entropy 14
1.4* Submodularity 16
1.5* Han’s inequality and Shearer’s Lemma 17

2 Divergence 20
2.1 Divergence and Radon-Nikodym derivatives 20
2.2 Divergence: main inequality and equivalent expressions 24
2.3 Differential entropy 26
2.4 Markov kernels 29
2.5 Conditional divergence, chain rule, data-processing inequality 31
2.6* Local behavior of divergence and Fisher information 36
2.6.1* Local behavior of divergence for mixtures 36
2.6.2* Parametrized family 38

3 Mutual information 41
3.1 Mutual information 41
3.2 Mutual information as difference of entropies 44
3.3 Examples of computing mutual information 47
3.4 Conditional mutual information and conditional independence 50
3.5 Sufficient statistic and data processing 53
3.6 Probability of error and Fano’s inequality 55
3.7* Estimation error in Gaussian noise (I-MMSE) 58
3.8* Entropy-power inequality 63

4 Variational characterizations and continuity of information measures 66

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-vii


i i

Contents vii

4.1 Geometric interpretation of mutual information 67


4.2 Variational characterizations of divergence: Gelfand-Yaglom-Perez 70
4.3 Variational characterizations of divergence: Donsker-Varadhan 71
4.4 Gibbs variational principle 74
4.5 Continuity of divergence 78
4.6* Continuity under monotone limits of σ -algebras 79
4.7 Variational characterizations and continuity of mutual information 81
4.8* PAC-Bayes 83
4.8.1 Uniform convergence 85
4.8.2 Generalization bounds in statistical learning theory 87

5 Extremization of mutual information: capacity saddle point 91


5.1 Convexity of information measures 91
5.2 Saddle point of mutual information 94
5.3 Capacity as information radius 96
5.4* Existence of capacity-achieving output distribution (general channel) 97
5.5 Gaussian saddle point 100
5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization, Sinkhorn 102

6 Tensorization and information rates 106


6.1 Tensorization (single-letterization) of mutual information 106
6.2* Gaussian capacity via orthogonal symmetry 107
6.3 Entropy rate 109
6.4 Entropy and symbol (bit) error rate 112
6.5 Mutual information rate 113

7 f-divergences 115
7.1 Definition and basic properties of f-divergences 115
7.2 Data-processing inequality; approximation by finite partitions 118
7.3 Total variation and Hellinger distance in hypothesis testing 122
7.4 Inequalities between f-divergences and joint range 126
7.5 Examples of computing joint range 130
7.5.1 Hellinger distance versus total variation 131
7.5.2 KL divergence versus total variation 131
7.5.3 χ2 -divergence versus total variation 132
7.6 A selection of inequalities between various divergences 132
7.7 Divergences between Gaussians 133
7.8 Mutual information based on f-divergence 134
7.9 Empirical distribution and χ2 -information 136
7.10 Most f-divergences are locally χ2 -like 138

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-viii


i i

viii Contents

7.11 f-divergences in parametric families: Fisher information 140


7.12 Rényi divergences and tensorization 144
7.13 Variational representation of f-divergences 147
7.14*Technical proofs: convexity, local expansions and variational representations 152

8 Entropy method in combinatorics and geometry 158


8.1 Binary vectors of average weights 158
8.2 Shearer’s lemma & counting subgraphs 159
8.3 Brégman’s Theorem 161
8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney 164

9 Random number generators 166


9.1 Setup 166
9.2 Converse 167
9.3 Elias’ construction from data compression 167
9.4 Peres’ iterated von Neumann’s scheme 169
9.5 Bernoulli factory 172

Exercises for Part I 174

Part II Lossless data compression 193


10 Variable-length compression 197
10.1 Variable-length lossless compression 197
10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 203
10.3 Uniquely decodable codes, prefix codes and Huffman codes 206

11 Fixed-length compression and Slepian-Wolf theorem 212


11.1 Source coding theorems 212
11.2 Asymptotic equipartition property (AEP) 217
11.3 Linear compression (hashing) 218
11.4 Compression with side information at both compressor and decompressor 221
11.5 Slepian-Wolf: side information at decompressor only 222
11.6 Slepian-Wolf: compressing multiple sources 224
11.7*Source-coding with a helper (Ahlswede-Körner-Wyner) 226

12 Entropy of ergodic processes 230


12.1 Bits of ergodic theory 230
12.2 Shannon-McMillan, entropy rate and AEP 233
12.3 Proof of the Shannon-McMillan-Breiman Theorem 234

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-ix


i i

Contents ix

12.4*Proof of the Birkhoff-Khintchine Theorem 237


12.5*Sinai’s generator theorem 239

13 Universal compression 244


13.1 Arithmetic coding 245
13.2 Combinatorial construction of Fitingof 246
13.3 Optimal compressors for a class of sources. Redundancy. 247
13.4*Asymptotic maximin solution: Jeffreys prior 250
13.5 Sequential probability assignment: Krichevsky-Trofimov 253
13.6 Online prediction and density estimation 255
13.7 Individual sequence and worst-case regret 259
13.8 Lempel-Ziv compressor 261

Exercises for Part II 265

Part III Hypothesis testing and large deviations 273


14 Neyman-Pearson lemma 276
14.1 Neyman-Pearson formulation 276
14.2 Likelihood ratio tests 280
14.3 Converse bounds on R(P, Q) 282
14.4 Achievability bounds on R(P, Q) 283
14.5 Asymptotics: Stein’s regime 286
14.6 Chernoff regime: preview 289

15 Information projection and large deviations 291


15.1 Basics of large deviations theory 291
15.1.1 Log MGF and rate function 293
15.1.2 Tilted distribution 297
15.2 Large-deviations exponents and KL divergence 299
15.3 Information projection 302
15.4 I-projection and KL geodesics 306
15.5 Sanov’s theorem 307
15.6*Information projection with multiple constraints 308

16 Hypothesis testing: error exponents 313


16.1 (E0 , E1 )-Tradeoff 313
16.2 Equivalent forms of Theorem 16.1 317
16.3*Sequential hypothesis testing 319
16.4 Composite, robust, and goodness-of-fit hypothesis testing 324

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-x


i i

x Contents

16.5*Hypothesis testing with communication constraints 325

Exercises for Part III 329

Part IV Channel coding 339


17 Error correcting codes 342
17.1 Codes and probability of error 342
17.2 Coding for Binary Symmetric Channels 344
17.3 Optimal decoder 347
17.4 Weak converse bound 348

18 Random and maximal coding 350


18.1 Information density 351
18.2 Shannon’s random coding bound 353
18.3 Dependence-testing (DT) bound 356
18.4 Feinstein’s maximal coding bound 357
18.5 RCU and Gallager’s bound 359
18.6 Linear codes 362
18.7 Why random and maximal coding work well? 366

19 Channel capacity 370


19.1 Channels and channel capacity 370
19.2 Shannon’s noisy channel coding theorem 375
19.3 Examples of capacity computation 379
19.4*Symmetric channels 381
19.5*Information stability 386
19.6 Capacity under bit error rate 389
19.7 Joint source-channel coding 391

20 Channels with input constraints. Gaussian channels. 394


20.1 Channel coding with input constraints 394
20.2 Channel capacity under separable cost constraints 397
20.3 Stationary AWGN channel 399
20.4 Parallel AWGN channel 402
20.5*Non-stationary AWGN 403
20.6*Additive colored Gaussian noise channel 405
20.7*AWGN channel with intersymbol interference 406
20.8*Gaussian channels with amplitude constraints 407
20.9*Gaussian channels with fading 408

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xi


i i

Contents xi

21 Capacity per unit cost 410


21.1 Energy-per-bit 410
21.2 Capacity per unit cost 414
21.3 Energy-per-bit for the fading channel 416
21.4 Capacity of the continuous-time AWGN channel 417
21.5*Capacity of the continuous-time bandlimited AWGN channel 419

22 Strong converse. Channel dispersion. Error exponents. Finite blocklength. 422


22.1 Strong converse 422
22.2 Stationary memoryless channel without strong converse 427
22.3 Meta-converse 428
22.4*Error exponents 430
22.5 Channel dispersion 434
22.6 Finite blocklength bounds and normal approximation 439
22.7 Normalized Rate 439

23 Channel coding with feedback 442


23.1 Feedback does not increase capacity for stationary memoryless channels 442
23.2*Massey’s directed information 446
23.3 When is feedback really useful? 450
23.3.1 Code with very small (e.g. zero) error probability 450
23.3.2 Code with variable length 455
23.3.3 Codes with variable power 457

Exercises for Part IV 460

Part V Rate-distortion theory and metric entropy 473


24 Rate-distortion theory 477
24.1 Scalar and vector quantization 477
24.1.1 Scalar uniform quantization 477
24.1.2 Scalar Non-uniform Quantization 479
24.1.3 Optimal quantizers 480
24.1.4 Fine quantization 481
24.1.5 Fine quantization and variable rate 483
24.2 Information-theoretic formulation 483
24.3 Converse bounds 485
24.4*Converting excess distortion to average 488

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xii


i i

xii Contents

25 Rate distortion: achievability bounds 490


25.1 Shannon’s rate-distortion theorem 490
25.1.1 Intuition 492
25.1.2 Proof of Theorem 25.1 493
25.2*Covering lemma and joint typicality 497
25.3*Wyner’s common information 502
25.4*Approximation of output statistics and the soft-covering lemma 504

26 Evaluating rate-distortion function. Lossy Source-Channel separation. 507


26.1 Evaluation of R(D) 507
26.1.1 Bernoulli Source 507
26.1.2 Gaussian Source 509
26.2*Analog of saddle-point property in rate-distortion 512
26.3 Lossy joint source-channel coding 515
26.3.1 Converse 515
26.3.2 Achievability via separation 516
26.4 What is lacking in classical lossy compression? 519

27 Metric entropy 522


27.1 Covering and packing 522
27.2 Finite-dimensional space and volume bound 525
27.3 Beyond the volume bound 528
27.3.1 Sudakov’s minoration 530
27.3.2 Hilbert ball has metric entropy ϵ12 532
27.3.3 Maurey’s empirical method 534
27.3.4 Duality of metric entropy 535
27.4 Infinite-dimensional space: smooth functions 535
27.5 Metric entropy and small-ball probability 539
27.6 Metric entropy and rate-distortion theory 542

Exercises for Part V 545

Part VI Statistical applications 555


28 Basics of statistical decision theory 558
28.1 Basic setting 558
28.2 Gaussian location model (GLM) 560
28.3 Bayes risk, minimax risk, and the minimax theorem 561
28.3.1 Bayes risk 561
28.3.2 Minimax risk 563

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xiii


i i

Contents xiii

28.3.3 Minimax and Bayes risk: a duality perspective 565


28.3.4 Minimax theorem 566
28.4 Multiple observations and sample complexity 567
28.5 Tensor product of experiments 568
28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 571

29 Classical large-sample asymptotics 575


29.1 Statistical lower bound from data processing 575
29.1.1 Hammersley-Chapman-Robbins (HCR) lower bound 575
29.1.2 Bayesian CR and HCR 577
29.2 Bayesian CR lower bounds and extensions 578
29.3 Maximum likelihood estimator and asymptotic efficiency 581
29.4 Application: Estimating discrete distributions and entropy 583

30 Mutual information method 585


30.1 GLM revisited and the Shannon lower bound 586
30.2 GLM with sparse means 589
30.3 Community detection 591
30.4 Estimation better than chance 592

31 Lower bounds via reduction to hypothesis testing 594


31.1 Le Cam’s two-point method 594
31.2 Assouad’s Lemma 597
31.3 Assouad’s lemma from the Mutual Information Method 598
31.4 Fano’s method 599

32 Entropic bounds for statistical estimation 602


32.1 Yang-Barron’s construction 603
32.1.1 Bayes risk as conditional mutual information and capacity bound 605
32.1.2 Capacity upper bound via KL covering numbers 608
32.1.3 Bounding capacity and KL covering number using Hellinger entropy 609
32.1.4 General bounds between cumulative and individual (one-step) risks 610
32.2 Pairwise comparison à la Le Cam-Birgé 612
32.2.1 Composite hypothesis testing and Hellinger distance 612
32.2.2 Hellinger guarantee on Le Cam-Birgé’s pairwise comparison estimator 614
32.2.3 Refinement using local entropy 616
32.2.4 Lower bound using local Hellinger packing 618
32.3 Yatracos’ class and minimum distance estimator 620
32.4 Density estimation over Hölder classes 622

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xiv


i i

xiv Contents

33 Strong data processing inequality 626


33.1 Computing a boolean function with noisy gates 626
33.2 Strong data processing inequality 629
33.3 Directed information percolation 633
33.4 Input-dependent SDPI. Mixing of Markov chains 637
33.5 Application: broadcasting and coloring on trees 641
33.6 Application: distributed correlation estimation 644
33.7 Channel comparison: degradation, less noisy, more capable 646
33.8 Undirected information percolation 648
33.9 Application: spiked Wigner model 651
33.10Strong data post-processing inequality (post-SDPI) 653
33.11Application: distributed mean estimation 657

Exercises for Part VI 660


References 672
Index 695
Index 695

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xv


i i

Preface

This book is a modern introduction to information theory. In the last two decades, the subject has
evolved from a discipline primarily dealing with problems of information storage and transmission
(“coding”) to one focusing increasingly on information extraction and denoising (“learning”). This
transformation is reflected in the title and content of this book.
It took us more than a decade to complete this work. It started as a set of lecture notes accumu-
lated by the authors through teaching regular courses at MIT, University of Illinois, and Yale, as
well as topics courses at EPFL (Switzerland) and ENSAE (France). Consequently, the intended
usage of this manuscript is as a textbook for a first course on information theory for graduate and
senior undergraduate students, or for a second (topics) course delving deeper into specific areas.
There are two aspects that make this textbook unusual. First is that, while being written by
information-theoretic “insiders”, the material is very much outward looking. While we do cover
in depth the bread-and-butter results (coding theorems) of information theory, we also dedicate
much effort to ideas and methods which have found influential applications in statistical learning,
statistical physics, ergodic theory, computer science, probability theory and more. The second
aspect is that we cover both the time-tested classical material (such as connections to combina-
torics, ergodicity, functional analysis) along with the latest developments of very recent years
(large alphabet distribution estimation, community detection, mixing of Markov chains, graphical
models, PAC-Bayes, generalization bounds).
It is hard to mention everyone, who helped us start and finish this work, but some stand out
especially. We owe our debt to Sergio Verdú, whose course at Princeton is responsible for our life-
long admiration of the subject. His passion and pedagogy are reflected, if imperfectly, on these
pages. For an undistorted view see his forthcoming comprehensive monograph [436].
Next, we were fortunate to have many bright students contribute to typing of the original lecture
notes (precursor of this book), as well as to correcting and extending the content. Among them,
we especially thank Ganesh Ajjanagadde, Austin Collins, Yuzhou Gu, Richard Guo, Qingqing
Huang, Alexander Haberman, Matthew Ho, Yunus Inan, Reka Inovan, Jason Klusowski, Anuran
Makur, Pierre Quinton, Aolin Xu, Sheng Xu, Pengkun Yang, Andrew Yao, and Junhui Zhang.
We thank many colleagues who provided valuable feedback at various stages of the book draft
over the years, in particular, Lucien Birgé, Marco Dalai, Meir Feder, Bob Gallager, Bobak Nazer,
Or Ordentlich, Henry Pfister, Maxim Raginsky, Sasha Rakhlin, Philippe Rigollet, Mark Sellke,
and Nikita Zhivotovskiy. Rachel Cohen (MIT) has been very kind with her time and helped in a
myriad of different ways.
We are grateful for the support from our editors Chloe Mcloughlin, Elizabeth Horne, Sarah
Strange, Julie Lancashire at Cambridge University Press (CUP) and for CUP to allow us to keep
a free version online. Our special acknowledgement is to Julie for providing that initial push and

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xvi


i i

xvi Preface

motivation in 2019, without which we would have never even considered to go beyond the initial
set of chaotic online lecture notes. (Though, if we knew it would take 4 years...)
The cover art and artwork were contributed by the talented illustrator Nastya Mukhanova [311],
whom we cannot praise enough.
Y. P. would like to thank Olga for her unwavering patience and encouragement. Her loving sac-
rifice made the luxury of countless hours of extra time available to Y. P. to work on this book. Y. P.
would also like to extend a literary hug to Yana, Alina and Evan and thank them for brightening
up his life.
Y. W. would like to thank his parents and his wife, Nanxi.

Y. Polyanskiy <[email protected]>
MIT
Y. Wu <[email protected]>
Yale

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xvii


i i

Introduction

What is information?
The Oxford English Dictionary lists 18 definitions of the word information, while the Merriam-
Webster Dictionary lists 17. This emphasizes the diversity of meaning and contexts in which the
word information may appear. This book, however, is only concerned with a precise mathematical
understanding of information, independent of the application domain.
How can we measure something that we cannot even define well? Among the earliest attempts
of quantifying information we can list R.A. Fisher’s works on the uncertainty of statistical esti-
mates (“confidence intervals”) and R. Hartley’s definition of information as the logarithm of the
number of possibilities. Around the same time, Fisher [169] and others identified connection
between information and thermodynamic entropy. This line of thinking culminated in Claude
Shannon’s magnum opus [378], where he formalized the concept of (what we call today) the Shan-
non information and forever changed the human language by accepting John Tukey’s word bit as
the unit of its measurement. In addition to possessing a number of elegant properties, Shannon
information turned out to also answer certain rigorous mathematical questions (such as the opti-
mal rate of data compression and data transmission). This singled out Shannon’s definition as the
right way of quantifying information. Classical information theory, as taught in [106, 111, 177],
focuses exclusively on this point of view.
In this book, however, we take a slightly more general point of view. To introduce it, let us
quote an eminent physicist L. Brillouin [76]:

We must start with a precise definition of the word “information”. We consider a problem involving a certain
number of possible answers, if we have no special information on the actual situation. When we happen to be
in possession of some information on the problem, the number of possible answers is reduced, and complete
information may even leave us with only one possible answer. Information is a function of the ratio of the
number of possible answers before and after, and we choose a logarithmic law in order to insure additivity of
the information contained in independent situations.

Note that only the last sentence specializes the more general term information to the Shannon’s
special version. In this book, we think of information without that last sentence. Namely, for us
information is a measure of difference between two beliefs about the system state. For example, it
could be the amount of change in our worldview following an observation or an event. Specifically,
suppose that initially the probability distribution P describes our understanding of the world (e.g.,
P allows us to answer questions such as how likely it is to rain today). Following an observation our
distribution changes to Q (e.g., upon observing clouds or a clear sky). The amount of information in
the observation is the dissimilarity between P and Q. How to quantify dissimilarity depends on the
particular context. As argued by Shannon, in many cases the right choice is the Kullback-Leibler

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xviii


i i

xviii Introduction

(KL) divergence D(QkP), see Definition 2.1. Indeed, if the prior belief is described by a probability
mass function P = (p1 , . . . , pk ) on the set of k possible outcomes, then the observation of the first
outcome results in the new (posterior) belief vector Q = (1, 0, . . . , 0) giving D(QkP) = log p11 ,
and similarly for other outcomes. Since the outcome i happens with probability pi we see that the
average dissimilarity between the prior and posterior beliefs is

X
k
1
pi log ,
pi
i=1

which is precisely the Shannon entropy, cf. Definition 1.1.


However, it is our conviction that measures of dissimilarity (or “information measures”) other
than the KL divergence are needed for applying information theory beyond the classical realms.
For example, the concepts of total variation, Hellinger distance and χ2 -divergence (both promi-
nent members of the f-divergence family) have found deep and fruitful applications in the theory
of statistical estimation and probability, as well as contemporary topics in theoretical computer
science such as communication complexity, estimation with communication constraints, property
testing (we discuss these in detail in Part VI). Therefore, when we talk about information measures
in Part I of this book we do not exclusively focus on those of Shannon type, although the latter are
justly given a premium treatment.

What is information theory?


Similarly to information, the subject of information theory does not have a precise definition.
In the narrowest sense, it is a scientific discipline concerned with optimal methods of transmit-
ting and storing data. The highlights of this part of the subject are so called “coding theorems”
showing existence of algorithms for compressing and communicating information across noisy
channels. Classical results, such as Shannon’s noisy channel coding theorem (Theorem 19.9),
not only show existence of algorithms, but also quantify their performance and show that
such performance is best possible. This part is, thus, concerned with identifying fundamental
limits of practically relevant (engineering) problems. Consequently, this branch is sometimes
called “IEEE1 -style information theory”, and it influenced or revolutionized much of informa-
tion technology we witness today: digital communication, wireless (cellular and WiFi) networks,
cryptography (Diffie-Hellman), data compression (Lempel-Ziv family of algorithms), and a lot
more.
This book, however, is not limited to the IEEE-style information theory, because the true
scope of the field is much broader. Indeed, the Hilbert’s 13th problem (for smooth functions)
was illuminated and resolved by Arnold and Kolmogorov via the idea of metric entropy that
Kolmogorov introduced following Shannon’s rate-distortion theory [440]. The isomorphism prob-
lem for Bernoulli shifts in ergodic theory has been solved by introducing the Kolmogorov-Sinai

1
For Institute of Electrical and Electronics Engineers; pronounced “Eye-triple-E”.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xix


i i

Introduction xix

entropy [387, 322]. In physics, the Landauer principle and other works on Maxwell demon have
been heavily influenced by the information theory [267, 42]. Natural language processing (NLP),
the idea of modeling text as a high order Markov model, has seen spectacular successes recently in
the form of GPT [320] and related models. Many more topics ranging from biology, neuroscience
and thermodynamics to pattern recognition, artificial intelligence and control theory all regularly
appear in information-theoretic conferences and journals.
It seems that objectively circumscribing the territory claimed by information theory is futile.
Instead, we would like to highlight what we believe to be the recent developments that fascinate
us and which motivated us to write this book.
First, information processing systems of today are much more varied compared to those of last
century. A modern controller (robot) is not just reacting to a few-dimensional vector of observa-
tions, modeled as a linear time-invariant system. Instead, it has million-dimensional inputs (e.g.,
a rasterized image), delayed and quantized, which also need to be communicated across noisy
links. The target of statistical inference is no longer a low-dimensional parameter, but rather a
high-dimensional (possibly discrete) object with structure (e.g. a sparse matrix, or a social net-
work between people with underlying community structure). Furthermore, observations arrive
to a statistician from spatially or temporally separated sources, which need to be transmitted
cognizant of rate limitations. Recognizing these new challenges, multiple communities simul-
taneously started re-investigating classical results (Chapter 29) on the optimality of maximum
likelihood and the (optimal) variance bounds given by the Fisher information. These developments
in high-dimensional statistics, computer science and statistical learning depend on the mastery of
the f-divergences (Chapter 7), the mutual-information method (Chapter 30), and the strong version
of the data-processing inequality (Chapter 33).
Second, since the 1990s technological advances have brought about a slew of new noisy channel
models. While classical theory addresses the so-called memoryless channels, the modern channels,
such as in flash storage, or urban wireless (multi-path, multi-antenna) communication, are far from
memoryless. In order to analyze these, the classical “asymptotic i.i.d.” theory is insufficient. The
resolution is the so-called “one-shot” approach to information theory, in which all main results
are developed while treating the channel inputs and outputs as abstract [211]. Only at the last step
those inputs are given the structure of long sequences and the asymptotic values are calculated.
This new “one-shot” approach has additional relevance for quantum information theory, where it
is in fact necessary.
Third, following impressive empirical achievements in 2010s there was an explosion in the
interest of understanding the methods and limits of machine learning from data. Information-
theoretic principles were instrumental for several discoveries in this area. As examples, we recall
the concept of metric entropy (Chapter 27) that is a cornerstone of Vapnik’s approach to supervised
learning (known as empirical risk minimization), non-linear regression and theory of density esti-
mation (Chapter 32). In machine learning density estimation is known as probabilistic generative
modeling, a prototypical problem in unsupervised learning. At present the best algorithms were
derived by applying information-theoretic ideas: Gibbs variational principle for Kullback-Leibler
divergence (in variational auto-encoders (VAE), cf. Example 4.2) and variational characteriza-
tion of Jensen-Shannon divergence (in generative adversarial networks (GAN), cf. Example 7.5).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xx


i i

xx Introduction

Another fascinating connection is that the optimal prediction performance of online-learning


algorithms is given by the maximum of the mutual information. This is shown through a deep con-
nection between prediction and universal compression (Chapter 13), which lead to the discovery
of the multiplicative weight update algorithm [445, 104].
On the theoretical side, a common information-theoretic method known as strong data-
processing inequality (Chapter 33) lead to resolutions of a series of problems in distributed estima-
tion, community detection (in graphs) and principal component analysis (spiked Wigner model).
The PAC-Bayes method, rooted in Donsker-Varadhan’s characterization of the Kullback-Leibler
divergence, lead to numerous breakthroughs in the theory of bounding the generalization error of
learning algorithms and in understanding concentration and uniform convergence properties of
empirical processes in high dimensions (Section 4.8*).
Fourth, theoretical computer science has been exchanging ideas with information theory as
well. Classical connections include entropy and combinatorics (Chapter 8); entropy and random-
ness extraction (Chapter 9); von Neumann’s computation with noisy gates (Section 33.1); Ising,
Potts and coloring models on trees and general graphs (Section 33.5); communication complex-
ity and Hellinger distance (Exercise I.41). More recently, skillful applications of the chain rule
lead to an elegant strengthening of Szemerédi’s regularity lemma in graph theory by Tao (Exer-
cise I.63) and to a breakthrough in union-closed sets conjecture by Gilmer (Exercise I.61). The
so-called I-MMSE identity (Section 3.7*) was applied to get a very short proof of stochastic local-
ization (Exercise I.66). In the area of randomized sampling and counting, the method of spectral
independence (Exercise VI.26) resolved multiple long-standing conjectures.

Why another book on information theory?


Our motivation for writing this book was two-fold. First, in our experience there is a need for
a graduate-level textbook on information theory, developed at an acceptable level of generality
(i.e. not restricted to discrete, or categorical, random variables) while not sacrificing any mathe-
matical rigor. Second, we wanted to introduce the readers to all the exciting (classical and new)
connections between information theory and other disciplines that we surveyed in previous sec-
tion. We believe that topics like the f-divergences, the one-shot point of view, the connections
with statistical learning and probability are not covered adequately in existing textbooks and are
future-proof: their significance will only grow with time. Currently being relegated to specialized
monographs, acquisition of this toolkit by an aspiring student is delayed.
There are two great classical textbooks that are unlikely to become irrelevant any time
soon: [106] by Cover-Thomas and [111] by Csiszár-Körner (and the revised edition of the lat-
ter [115]). The former has been a primary textbook for the majority of undergraduate courses on
information theory in the world. It manages to rigorously introduce the concepts of entropy, infor-
mation and divergence and prove all the main results of the field, while also sampling several less
standard topics, such as universal compression, gambling and portfolio theory.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xxi


i i

Introduction xxi

The textbook [111] spearheaded the combinatorial approach on information theory, known as
“the method of types”. While more mathematically demanding than [106], [111] manages to intro-
duce stronger results such as sharp estimates of error exponents and, especially, rate regions in
multi-terminal communication systems. However, both books are almost exclusively focused on
asymptotics, Shannon-type information measures and discrete (finite alphabet) cases.
Focused on specialized topics, several monographs are available. For a communication-oriented
reader, the classical [177] is still indispensable. The one-shot point of view is taken in [211]. Con-
nections to statistical learning theory and learning on graphs (belief propagation) is beautifully
covered in [287]. Ergodic theory is the central subject in [198]. Quantum information theory – a
burgeoning field – is treated in the recent [451]. The only textbook dedicated to the connection
between information theory and statistics is by Kullback [264], though restricted to large-sample
asymptotics in hypothesis testing. In nonparametric statistics, application of information-theoretic
methods is briefly (but elegantly) covered in [424].
Nevertheless, it is not possible to quilt this textbook from chapters of these excellent prede-
cessors. A number of important topics are treated exclusively here, such as those in Chapters 7
(f-divergences), 18 (one-shot coding theorems), 22 (finite blocklength), 27 (metric entropy), 30
(mutual information method), 32 (entropic bounds on estimation), and 33 (strong data-processing
inequalities). Furthermore, building up to these chapters requires numerous small innovations
across the rest of the textbook and are not available elsewhere. In addition, the exercises explore
works of the last few years.
Going to omissions, this book almost entirely skips the topic of multi-terminal information
theory (with exception of Sections 11.7*, 16.5* and 25.3*) . This difficult subject captivated much
of the effort in the post-Shannon “IEEE-style” theory. We refer to the classical [115] and the recent
excellent textbook [147] containing an encyclopedic coverage of this area.
Another unfortunate omission is the connection between information theory and functional
inequalities [106, Chapter 17]. This topic has seen a flurry of recent activity, especially in loga-
rithmic Sobolev inequalities, isoperimetry, concentration of measure, Brascamp-Lieb inequalities,
(Marton-Talagrand) information-transportation inequalities and others. We only briefly mention
these topics in Sections 3.7*, 3.8* and associated exercises (e.g. I.47 and I.65). For a fuller
treatment, see the monograph [353] and references there.
Finally, this book will not teach one how to construct practical error-correcting codes or design
modern wireless communication systems. Following our Part IV, which covers the basics, an
interested reader is advised to master the tools from coding theory via [360] and multiple-antenna
channels via [423].

A note to statisticians
The interplay between information theory and statistics is a constant theme in the development of
both fields. Since its inception, information theory has been indispensable for understanding the
fundamental limits of statistical estimation. The prominent role of information-theoretic quanti-
ties, such as mutual information, f-divergence, metric entropy, and capacity, in establishing the

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xxii


i i

xxii Introduction

minimax rates of estimation has long been recognized since the seminal work of Le Cam [272],
Ibragimov and Khas’minski [222], Pinsker [328], Birgé [53], Haussler and Opper [216], Yang
and Barron [464], among many others. In Part VI of this book we give an exposition to some of
the most influential information-theoretic ideas and their applications in statistics. This part is not
meant to be a thorough treatment of decision theory or mathematical statistics; for that purpose,
we refer to the classics [222, 276, 68, 424] and the more recent monographs [78, 190, 446] focus-
ing on high dimensions. Instead, we apply the theory developed in previous Parts I–V of this book
to several concrete and carefully chosen examples of determining the minimax risk in both classi-
cal (fixed-dimensional, large-sample asymptotic) and modern (high-dimensional, non-asymptotic)
settings.
At a high level, the connection between information theory (in particular, data transmission)
and statistical inference is that both problems are defined by a conditional distribution PY|X , which
is referred to as the channel for the former and the statistical model or experiment for the latter. In
both disciplines the ultimate goal is to estimate X with high fidelity based on its noisy observation Y
using computationally efficient algorithms. However, in data transmission the set of allowed values
of X is typically discrete and restricted to a carefully chosen subset of inputs (called codebook),
the design of which is considered to be the main difficulty. In statistics, however, the space or
the distribution of allowed values of X (the parameter) is constrained by the problem setup (for
example, requiring sparsity or low rank on X), not by the statistician. Despite this key difference,
both disciplines in the end are all about estimating X based on Y and information-theoretic ideas
are applicable in both settings.
Specifically, in Chapter 29 we show how the data processing inequality can be used to deduce
classical lower bounds in statistical estimation (Hammersley-Chapman-Robbins, Cramér-Rao,
van Trees). In Chapter 30 we introduce the mutual information method, based on the reasoning
in joint source-channel coding. Namely, by comparing the amount of information contained in
the data and the amount of information required for achieving a given estimation accuracy, both
measured in bits, this method allows us to apply the theory of capacity and rate-distortion func-
tion developed in Parts IV and V to lower bound the statistical risk. Besides being principled, this
approach also unifies the three popular methods for proving minimax lower bounds due to Le
Cam, Assouad, and Fano respectively (Chapter 31).
It is a common misconception that information theory only supplies techniques for proving
negative results in statistics. In Chapter 32 we present three upper bounds on statistical estimation
risk based on metric entropy: Yang-Barron’s construction inspired by universal compression, Le
Cam-Birgé’s tournament based on pairwise hypothesis testing, and Yatracos’ minimum-distance
approach. These powerful methods are responsible for some of the strongest and most general
results in statistics and applicable to both high-dimensional and nonparametric problems. Finally,
in Chapter 33 we introduce the method based on strong data processing inequalities and apply
it to resolve an array of contemporary problems including community detection on graphs, dis-
tributed estimation with communication constraints, and generating random tree colorings. These
problems are increasingly captivating the minds of computer scientists as well.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-xxiii


i i

Introduction xxiii

How to use this textbook


An introductory class on information theory aiming at advanced undergraduate or graduate
students can proceed with the following sequence:

• Part I: Chapters 1–3, Sections 4.1, 5.1–5.3, 6.1, and 3.6, focusing only on discrete prob-
ability space and ignoring Radon-Nikodym derivatives. Some mention of applications in
combinatorics and cryptography (Chapters 8, 9 and select exercises) is recommended.
• Part II: Chapter 10, Sections 11.1–11.5.
• Part III: Chapter 14, Sections 15.1–15.3, and 16.1.
• Part IV: Chapters 17–18, Sections 19.1–19.3, 19.7, 20.1–20.2, 23.1.
• Part V: Sections 24.1–24.3, 25.1, 26.1, and 26.3.
• Conclude with a few applications of information theory outside the classical domain (Chap-
ters 30 and 33).

A graduate-level class on information theory with a traditional focus on communication and


compression can proceed faster through Part I (omitting f-divergences and other non-essential
chapters), but then cover Parts II–V in depth, including strong converse, finite-blocklength regime,
and communication with feedback, but omitting Chapter 27. It is important to work through
exercises at the end of Part IV for this kind of class.
For a graduate-level class on information theory with an emphasis on statistical learning, start
with Part I (especially Chapter 7), followed by Part II (especially Chapter 13) and Part III, from
Part IV limit coverage to Chapters 17-19, and from Part V to Chapter 27 (especially, Sections 27.1–
27.4). This should leave more than half of the semester for carefully working through Part VI. For
example, for a good pace we suggest leaving at least 5-6 lectures for Chapters 32 and 33. These last
chapters contain some bleeding-edge research results and open problems, hopefully welcoming
students to work on them. For that we also recommend going over the exercises at the end of
Parts I and VI.
Difficult sections are marked with asterisks and can be skipped on a first reading as they may
rely on material from future chapters or external sources.
An extensive index should help connect different topics together. For example, looking up
“community detection” shows all the many occurrences of this interesting example across the
chapters.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-1


i i

Frequently used notation

General conventions

• The symbol ≜ reads defined as and ≡ abbreviated as.


• The set of real numbers and integers are denoted by R and Z. Let N = {1, 2, . . .}, Z+ =
{0, 1, . . .}, R+ = {x : x ≥ 0}.
• For n ∈ N, let [n] = {1, . . . , n}.
• Throughout the book, xn ≜ (x1 , . . . , xn ) denotes an n-dimensional vector, xji ≜ (xi , . . . , xj ) for
1 ≤ i < j ≤ n and xS ≜ {xi : i ∈ S} for S ⊂ [n].
• Unless explicitly specified, the logarithm log and exponential exp are with respect to a generic
common base. The natural logarithm is denoted by ln = loge and expe {·} = e(·) .
• We agree to take exp{+∞} = +∞, exp{−∞} = 0, log(+∞) = +∞, log(0) = −∞. The
function x 7→ x log x is extended to x = 0 by taking 0 · log 0 = 0. The bivariate function (x, y) 7→
log xy extended to x = 0 and y = 0 is denoted by Log xy and has a special convention (2.10).
• a ∧ b = min{a, b} and a ∨ b = max{a, b}.
• For p ∈ [0, 1], p̄ ≜ 1 − p.
• x+ = max{x, 0}.
• f(x+) ≜ lim f(y), f(x−) ≜ lim f(y).
y↘ x y↗ x
• limit inferior and limit superior: lim infn→∞ gn ≜ limn→∞ infm≥n gm and lim supn→∞ gn ≜
limn→∞ supm≥n gm .
• wH (x) denotes the Hamming weight (number of ones) of a binary vector x. dH (x, y) =
Pn
i=1 1 {xi 6= yi } denotes the Hamming distance between vectors x and y of length n.
• Standard big O notations are used throughout the book: e.g., for any positive sequences {an }
and {bn }, an = O(bn ) if there is an absolute constant c > 0 such that an ≤ cbn ; an = Ω(bn )
if bn = O(an ); an = Θ(bn ) if both an = O(bn ) and an = Ω(bn ), we also write an  bn in
these cases; an = o(bn ) or bn = ω(an ) if an ≤ ϵn bn for some ϵn → 0. In addition, if there is a
parameter p in the discussion and the constant c in the definition of an = O(bn ) depends on p,
then we emphasize this fact by writing an = Op (bn ).

Information theory and statistics

• h(·) is the binary entropy function, H(·) denotes general Shannon entropy.
• d(·k·) is the binary divergence function, D(·k·) denotes general Kullback-Leibler divergence
• Standard channels BSCδ , BECδ , BIAWGNσ2
• Common divergences are χ2 (·k·) (chi-squared), Dα (·k·) (Rényi divergence), Df (·k·) (general
f-divergence).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-2


i i

2 Frequently used notation

• Common statistical distances TV(·, ·) (total variation), H2 (·, ·) (Hellinger-squared), W1 (·, ·)


(Wasserstein distance)
• Depending on context Pθ may denote a parametric class of distributions indexed by the
parameter θ, or it can mean the law of random variable θ.

Analysis and algebra


Pp 
p 1/p
• For x ∈ Rd we denote kxkp = i=1 |xi | and kxk = kxk2 the standard Euclidean norm.
• Let int(E) and cl(E) denote the interior and closure of a set E, namely, the largest open set
contained in and smallest closed set containing E, respectively.
• Let co(E) denote the convex hull of E (without topology), namely, the smallest convex set
Pn Pn
containing E, given by co(E) = { i=1 αi xi : αi ≥ 0, i=1 αi = 1, xi ∈ E, n ∈ N}.
• For subsets A, B of a real vector space and λ ∈ R, denote the dilation λA = {λa : a ∈ A} and
the Minkowski sum A + B = {a + b : a ∈ A, B ∈ B}.
• For a metric space (X , d), a function f : X → R is called C-Lipschitz if |f(x) − f(y)| ≤ Cd(x, y)
for all x, y ∈ X . We set kfkLip(X ) = inf{C : f is C-Lipschitz}.
• Linear algebraic notations. For a matrix A with real entries we define
P
– trace tr A = i Ai,i
– σ1 (A) ≥ σ2 (A) ≥ · · · its list of singular values sorted in decreasing order. Recall that σj (A)
is the square root of j-th largest eigenvalue of A⊤ A.
– The operator norm kAkop = supv̸=0 ∥∥Av ∥
v∥ = σ1 (A).
P P
– The Frobenius norm kAk2F = i,j |Ai,j |2 = tr A⊤ A = tr AA⊤ = i σi2 (A).
– We write A  0 to denote that A is positive semi-definite, and A  B to denote that A − B  0.
2 P 2
• Hess f(x) denotes the Hessian of f: (Hess f(x))i,j = ∂ x∂i ∂fxj . ∆f(x) = tr Hess f(x) = i ∂∂x2 f(x)
i
denotes the Laplacian.
R
• Convolution of two functions f, g on Rd is defined as (f ∗ g)(x) = Rd f(y)g(x − y)dy for all
x where the (Lebesgue) integral exists. For two probability measure p, q on Rd we also define
p ∗ q to be the law of A + B where A ⊥ ⊥ B and A ∼ p, B ∼ q.

Measure theory and probability

• The Lebesgue measure on Euclidean spaces is denoted by Leb and also by vol (volume).
• Throughout the book, all measurable spaces (X , E) are standard Borel spaces. Unless explicitly
needed, we suppress the underlying σ -algebra E .
• The collection of all probability measures on X is denoted by P(X ). For finite spaces we
abbreviate Pk ≡ P([k]), a (k − 1)-dimensional simplex.
• For measures P and Q, their product measure is denoted by P × Q or P ⊗ Q. The n-fold product
of P is denoted by Pn or P⊗n . Similarly, given a Markov kernel PY|X : X → Y the kernel that
acts independently on each of n coordinates is denoted as P⊗ Y|X : X → Y .
n n n

• Let P be absolutely continuous with respect to Q, denoted by P  Q. The Radon-Nikodym


dP dP
derivative of P with respect to Q is denoted by dQ . For a probability measure P, if Q = Leb, dQ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-3


i i

Frequently used notation 3

is referred to the probability density function (pdf); if Q is the counting measure on a countable
X , dQ
dP
is the probability mass function (pmf).
• Let P ⊥ Q denote their mutual singularity, namely, P(A) = 0 and Q(A) = 1 for some A.
• The support of a probability measure P, denoted by supp(P), is the smallest closed set C such
that P(C) = 1. An atom x of P is such that P({x}) > 0. A distribution P is discrete if supp(P)
is a countable set (consisting of its atoms).
• Let X be a random variable taking values on X , which is referred to as the alphabet of X. Its
realizations are labeled by lower case letters, e.g. x. Thus, upper case, lower case, and script case
are matched to random variables, realizations, and alphabets, respectively (as in X = x ∈ X ).
Oftentimes X and Y are automatically assumed to be the alphabet of X and Y, etc. We also write
X ∈ X to mean that random variable X is X -valued.
• Let PX denote the distribution (law) of the random variable X, PX,Y the joint distribution of X
and Y, and PY|X the conditional distribution of Y given X.
• A conditional distribution PY|X is also called a Markov kernel acting between spaces X and Y ,
written as PY|X : X → Y . Given a conditional distribution PY|X and a marginal we can form
a joint distribution, written as PX × PY|X , or simply PX PY|X . Its marginal PY is denoted by a
composition operation PY ≜ PY|X ◦ PX .
• The independence of random variables X and Y is denoted by X ⊥ ⊥ Y, in which case PX,Y =
PX × PY . Similarly, X ⊥ ⊥ Y|Z denotes their conditional independence given Z, in which case
PX,Y|Z = PX|Z × PY|Z .
• Throughout the book, Xn ≡ Xn1 ≜ (X1 , . . . , Xn ) denotes an n-dimensional random vector. We
i.i.d.
write X1 , . . . , Xn ∼ P if they are independently and identically distributed (iid) as P, in which
case PXn = Pn .
• The empirical distribution of a sequence x1 , . . . , xn denoted by P̂xn ; the empirical distribution of
a random sample X1 , . . . , Xn is denoted by P̂n ≡ P̂Xn .
a.s. P d
• −−→, − →, − → denote convergence almost surely, in probability, and in distribution (law),
d
respectively. We define = to mean equality in distribution.
• Occasionally, for clarity we use a self-explanatory notation EY∼Q [·] to mean that the expectation
is taking with Y generated from distribution Q. We also use cues like EC [·] to signify that the
expectation is taken over C.
• Some commonly used distributions are as follows:
– Ber(p): Bernoulli distribution with mean p.
– Bin(n, p): Binomial distribution with n trials and success probability p.
– Poisson(λ): Poisson distribution with mean λ.
– N ( μ, σ 2 ) is the Gaussian (normal) distribution on R with mean μ and σ 2 . N ( μ, Σ) is the
Gaussian distribution on Rd with mean μ and covariance matrix Σ. Denote the standard nor-
Rt
mal density by φ(x) = √12π e−x /2 , the CDF and complementary CDF by Φ(t) = −∞ φ(x)dx
2

and Q(t) = Φc (t) = 1 − Φ(t). The inverse of Q is denoted by Q−1 .


– Z ∼ Nc ( μ, σ 2 ) denotes the complex-valued circular symmetric normal distribution with
expectation E[Z] = μ ∈ C and E[|Z − μ|2 ] = σ 2 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-4


i i

4 Frequently used notation

– For a compact subset X of Rd with non-empty interior, Unif(X ) denotes the uniform distri-
bution on X , with Unif(a, b) ≡ Unif([a, b]) for interval [a, b]. We also use Unif(X ) to denote
the uniform (equiprobable) distribution on a finite set X .
• For a Rd -valued random variable X we denote Cov(X) = E[(X−E[X])(X−E[X])⊤ ] its covariance
matrix. A conditional version is denoted as Cov(X|Y) = E[(X − E[X|Y])(X − E[X|Y])⊤ ].
• For a set E ⊂ Ω we denote by 1E (ω) the function equal to 1 iff ω ∈ E. Similarly,
1{boolean condition} denotes a random variable that is equal to 1 iff the “boolean condition”
is satisfied and otherwise equals zero. Thus, for example, P[X > 1] = E[1{X > 1}].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-5


i i

Part I

Information measures

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-6


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-7


i i

Information measures form the backbone of information theory. The first part of this book is
devoted to an in-depth study of some of them, most notably, entropy, divergence, mutual informa-
tion, as well as their conditional versions (Chapters 1–3). In addition to basic definitions illustrated
through concrete examples, we will also study various aspects including chain rules, regularity,
tensorization, variational representation, local expansion, convexity and optimization properties,
as well as the data processing principle (Chapters 4–6). These information measures will be
imbued with operational meaning when we proceed to classical topics in information theory such
as data compression and transmission, in subsequent parts of the book. This Part also includes
topics connecting information theory to other subjects, such as I-MMSE relation (estimation the-
ory), entropy power inequality (probability), PAC-Bayes bounds and Gibbs variational principle
(machine learning).
In addition to the classical (Shannon) information measures, Chapter 7 provides a systematic
treatment of f-divergences, a generalization of (Shannon) measures introduced by Csíszar that
plays an important role in many statistical problems (see Parts III and VI). Finally, towards the
end of this part we will discuss two operational topics: random number generators in Chapter 9
and the application of entropy method to combinatorics and geometry in Chapter 8.
Several contemporary topics are developed in exercises such as stochastic block model
(Exercise I.49), Gilmer’s method in combinatorics (Exercise I.61), Tao’s proof of Szemerédi’s reg-
ularity lemma (Exercise I.63), Eldan’s stochastic localization (Exercise I.66), Gross’ log-Sobolev
inequality (Exercise I.65) and others.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-8


i i

1 Entropy

This chapter introduces the first information measure – Shannon entropy. After studying its stan-
dard properties (chain rule, conditioning), we will briefly describe how one could arrive at its
definition. We discuss axiomatic characterization, the historical development in statistical mechan-
ics, as well as the underlying combinatorial foundation (“method of types”). We close the chapter
with Han’s and Shearer’s inequalities, that both exploit submodularity of entropy. After this Chap-
ter, the reader is welcome to consult the applications in combinatorics (Chapter 8) and random
number generation (Chapter 9), which are independent of the rest of this Part.

1.1 Entropy and conditional entropy


Definition 1.1 (Entropy) Let X be a discrete random variable with probability mass function
PX (x), x ∈ X . The entropy (or Shannon entropy) of X is
h 1 i
H(X) = E log
PX (X)
X 1
= PX (x) log .
P X ( x)
x∈X

When computing the sum, we agree that (by continuity of x 7→ x log 1x )


1
0 log = 0. (1.1)
0
Since entropy only depends on the distribution of a random variable, it is customary in information
theory to also write H(PX ) in place of H(X), which we will do freely in this book. The basis of the
logarithm in Definition 1.1 determines the units of the entropy:

log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp

Different units will be convenient in different cases and so most of the general results in this book
are stated with “baseless” log/exp.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-9


i i

1.1 Entropy and conditional entropy 9

Definition 1.2 (Joint entropy) The joint entropy of n discrete random variables Xn ≜
(X1 , X2 , . . . , Xn ) is

h 1 i
H(Xn ) = H(X1 , . . . , Xn ) = E log ,
PX1 ,...,Xn (X1 , . . . , Xn )

which can also be written explicitly as a summation over a joint probability mass function (PMF):

X X 1
H(Xn ) = ··· PX1 ,...,Xn (x1 , . . . , xn ) log .
x1 xn
PX1 ,...,Xn (x1 , . . . , xn )

Note that joint entropy is a special case of Definition 1.1 applied to the random vector Xn =
(X1 , X2 , . . . , Xn ) taking values in the product space.

Remark 1.1 The name “entropy” originates from thermodynamics – see Section 1.3, which
also provides combinatorial justification for this definition. Another common justification is to
derive H(X) as a consequence of natural axioms for any measure of “information content” – see
Section 1.2. There are also natural experiments suggesting that H(X) is indeed the amount of
“information content” in X. For example, one can measure time it takes for ant scouts to describe
the location of the food to ants-workers. It was found that when nest is placed at the root of a full
binary tree of depth d and food at one of the leaves, the time was proportional to the entropy of a
random variable describing the food location [358]. (It was also estimated that ants communicate
with about 0.7–1 bit/min and that communication time reduces if there are some regularities in
path-description: paths like “left,right,left,right,left,right” are described by scouts faster).

Entropy measures the intrinsic randomness or uncertainty of a random variable. In the simple
setting where X takes values uniformly over a finite set X , the entropy is simply given by log-
cardinality: H(X) = log |X |. In general, the more spread out (resp. concentrated) a probability
mass function is, the higher (resp. lower) is its entropy, as demonstrated by the following example.
h(p)
Example 1.1 (Bernoulli) Let X ∼ Ber(p), with
PX (1) = p and PX (0) = p ≜ 1 − p. Then
log 2
1 1
H(X) = h(p) ≜ p log + p log .
p p
Here h(·) is called the binary entropy function, which is
continuous, concave on [0, 1], symmetric around 12 , and sat-
isfies h′ (p) = log pp , with infinite slope at 0 and 1. The
highest entropy is achieved at p = 21 (uniform), while the
lowest entropy is achieved at p = 0 or 1 (deterministic).
It is instructive to compare the plot of the binary entropy
p
function with the variance p(1 − p). 0 1
2
1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-10


i i

10

Example 1.2 (Geometric) Let X be geometrically distributed, with PX (i) = ppi , i = 0, 1, . . ..


Then E[X] = p̄
p and
h 1 i 1 1 h( p)
H(X) = E log X = log + E[X] log = .
pp̄ p p̄ p
Example 1.3 (Infinite entropy) Is it possible that H(X) = +∞? Yes, for example, P[X =
k] ∝ 1
k ln2 k
,k = 2, 3, · · · .
Many commonly used information measures have their conditional counterparts, defined
by applying the original definition to a conditional probability measure followed by a further
averaging. For entropy this is defined as follows.

Definition 1.3 (Conditional entropy) Let X be a discrete random variable and Y arbitrary.
Denote by PX|Y=y (·) or PX|Y (·|y) the conditional distribution of X given Y = y. The conditional
entropy of X given Y is
h 1 i
H(X|Y) = Ey∼PY [H(PX|Y=y )] = E log .
PX|Y (X|Y)

Note that if Y is also discrete we can write out the expression in terms of joint PMF PX,Y and
conditional PMF PX|Y as
XX 1
H(X|Y) = PX,Y (x, y) log .
x y
PX|Y (x|y)

Similar to entropy, conditional entropy measures the remaining randomness of a random vari-
able when another is revealed. As such, H(X|Y) = H(X) whenever Y is independent of X. But
when Y depends on X, observing Y does lower the entropy of X. Before formalizing this in the
next theorem, here is a concrete example.
Example 1.4 (Conditional entropy and noisy channel) Let Y be a noisy observation
of X ∼ Ber( 21 ) as follows.

1 Y = X ⊕ Z, where ⊕ denotes binary addition (XOR) and Z ∼ Ber(δ) independently of X. In


other words, Y agrees with X with probability δ and disagrees with probability δ̄ . Then PX|Y=0 =
Ber(δ) and PX|Y=1 = Ber(δ̄). Since h(δ) = h(δ̄), H(X|Y) = h(δ). Note that when δ = 12 , Y is
independent of X and H(X|Y) = H(X) = 1 bits; when δ = 0 or 1, X is completely determined
by Y and hence H(X|Y) = 0.
2 Y = X + Z be real-valued, where Z ∼ N (0, σ 2 ). Then H(X|Y) = E[h(P [X = 1|Y])], where
φ( y− 1
σ )
P [ X = 1 | Y = y] = φ( σy )+φ( y− 1 and Y ∼ 12 (N (0, σ 2 ) + N (1, σ 2 )). Below is a numerical plot of
σ )
H(X|Y) as a function of σ which can be shown to be monotonically increasing from 0 to 1bit.
2

(Hint: Theorem 1.4(d).)

Before discussing various properties of entropy and conditional entropy, let us first review some
relevant facts from convex analysis, which will be used extensively throughout the book.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-11


i i

1.1 Entropy and conditional entropy 11

1.0

0.8

0.6

0.4

0.2

0.0

Figure 1.1 Conditional entropy of a Bernoulli X given its Gaussian noisy observation.

Review: Convexity

• Convex set: A subset S of some vector space is convex if x, y ∈ S ⇒ αx + ᾱy ∈ S


for all α ∈ [0, 1]. (Recall: ᾱ ≜ 1 − α.)
Examples: unit interval [0, 1]; S = {probability distributions on X }; S = {PX :
E[X] = 0}.
• Convex function: f : S → R is
– convex if f(αx + ᾱy) ≤ αf(x) + ᾱf(y) for all x, y ∈ S, α ∈ [0, 1].
– strictly convex if f(αx + ᾱy) < αf(x) + ᾱf(y) for all x 6= y ∈ S, α ∈ (0, 1).
– (strictly) concave if −f is (strictly) convex.
R
Examples: x 7→ x log x is strictly convex; the mean P 7→ xdP is convex but
not strictly convex, variance is concave (Question: is it strictly concave? Think of
zero-mean distributions.).
• Jensen’s inequality:

For any S-valued random variable X,

– f is convex ⇒ f(EX) ≤ Ef(X) Ef(X)

– f is strictly convex ⇒ f(EX) < Ef(X), unless X


is a constant (X = EX a.s.) f(EX)

Theorem 1.4 (Properties of entropy)

(a) (Positivity) H(X) ≥ 0 with equality, iff X is a constant (no randomness).


(b) (Uniform distribution maximizes entropy) For finite X , H(X) ≤ log |X |, with equality iff X is
uniform on X .
(c) (Invariance under relabeling) H(X) = H(f(X)) for any bijective f.
(d) (Conditioning reduces entropy) H(X|Y) ≤ H(X), with equality iff X and Y are independent.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-12


i i

12

(e) (Simple chain rule)

H(X, Y) = H(X) + H(Y|X) ≤ H(X) + H(Y). (1.2)

(f) (Entropy under deterministic transformation) H(X) = H(X, f(X)) ≥ H(f(X)) with equality iff
f is one-to-one on the support of PX .
(g) (Full chain rule)
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |Xi−1 ) ≤ H(Xi ), (1.3)
i=1 i=1

with equality iff X1 , . . . , Xn are mutually independent.

Proof. (a) Since log PX1(X) is a positive random variable, its expectation H(X) is also positive,
with H(X) = 0 if and only if log PX1(X) = 0 almost surely, namely, PX is a point mass.
(b) Apply Jensen’s inequality to the strictly concave function x 7→ log x:
   
1 1
H(X) = E log ≤ log E = log |X |.
PX (X) PX (X)
(c) H(X) as a summation only depends on the values of PX , not locations.
(d) Abbreviate P(x) ≡ PX (x) and P(x|y) ≡ PX|Y (x|y). Using P(x) = EY [P(x|Y)] and applying
Jensen’s inequality to the strictly concave function x 7→ x log 1x ,
X  1
 X
1
H(X|Y) = EY P(x|Y) log ≤ P(x) log = H(X).
P(x|Y) P ( x)
x∈X x∈X

Additionally, this also follows from (and is equivalent to) Corollary 3.5 in Chapter 3 or
Theorem 5.2 in Chapter 5.
(e) Telescoping PX,Y (X, Y) = PY|X (Y|X)PX (X) and noting that both sides are positive PX,Y -almost
surely, we have
h 1 i h 1 i h 1 i h 1 i
E log = E log = E log + E log
PX,Y (X, Y) PX (X) · PY|X (Y|X) PX (X) PY|X (Y|X)
| {z } | {z }
H(X) H(Y|X)

(f) The intuition is that (X, f(X)) contains the same amount of information as X. Indeed, x 7→
(x, f(x)) is one-to-one. Thus by (c) and (e):

H(X) = H(X, f(X)) = H(f(X)) + H(X|f(X)) ≥ H(f(X))

The bound is attained iff H(X|f(X)) = 0 which in turn happens iff X is a constant given f(X).
(g) Similar to (e), telescoping

PX1 X2 ···Xn = PX1 PX2 |X1 · · · PXn |Xn−1

and taking the logarithm prove the equality. The inequality follows from (d), with the case of
Qn
equality occurring if and only if PXi |Xi−1 = PXi for i = 1, . . . , n, namely, PXn = i=1 PXi .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-13


i i

1.2 Axiomatic characterization 13

To give a preview of the operational meaning of entropy, let us play the game of 20 Questions.
We are allowed to make queries about some unknown discrete RV X by asking yes-no questions.
The objective of the game is to guess the realized value of the RV X. For example, X ∈ {a, b, c, d}
with P [X = a] = 1/2, P [X = b] = 1/4, and P [X = c] = P [X = d] = 1/8. In this case, we can
ask “X = a?”. If not, proceed by asking “X = b?”. If not, ask “X = c?”, after which we will know
for sure the realization of X. The resulting average number of questions is 1/2 + 1/4 × 2 + 1/8 ×
3 + 1/8 × 3 = 1.75, which equals H(X) in bits. An alternative strategy is to ask “X = a, b or c, d”
in the first round then proceeds to determine the value in the second round, which always requires
two questions and does worse on average.
It turns out (Section 10.3) that the minimal average number of yes-no questions to pin down
the value of X is always between H(X) bits and H(X) + 1 bits. In this special case the above
scheme is optimal because (intuitively) it always splits the probability in half.

1.2 Axiomatic characterization


P
One might wonder why entropy is defined as H(P) = pi log p1i and if there are other definitions.
Indeed, the information-theoretic definition of entropy is related to entropy in statistical physics.
Also, it arises as answers to specific operational problems, e.g., the minimum average number of
bits to describe a random variable as discussed above. Therefore it is fair to say that it is not pulled
out of thin air.
Shannon in 1948 paper has also showed that entropy can be defined axiomatically, as a
function satisfying several natural conditions. Denote a probability distribution on m letters by
P = (p1 , . . . , pm ) and consider a functional Hm (p1 , . . . , pm ). If Hm obeys the following axioms:

(a) Permutation invariance: Hm (pπ (1) , . . . , pπ (m) ) = Hm (p1 , . . . , pm ) for any permutation π on [m].
(b) Expansibility: Hm (p1 , . . . , pm−1 , 0) = Hm−1 (p1 , . . . , pm−1 ).
(c) Normalization: H2 ( 12 , 12 ) = log 2.
(d) Subadditivity: H(X, Y) ≤ H(X) + H(Y). Equivalently, Hmn (r11 , . . . , rmn ) ≤ Hm (p1 , . . . , pm ) +
Pn Pm
Hn (q1 , . . . , qn ) whenever j=1 rij = pi and i=1 rij = qj .
(e) Additivity: H(X, Y) = H(X) + H(Y) if X ⊥ ⊥ Y. Equivalently, Hmn (p1 q1 , . . . , pm qn ) =
Hm (p1 , . . . , pm ) + Hn (q1 , . . . , qn ).
(f) Continuity: H2 (p, 1 − p) → 0 as p → 0.
Pm
then Hm (p1 , . . . , pm ) = i=1 pi log p1i is the only possibility. The interested reader is referred to
[115, Exercise 1.13] and the references therein.
We note that there are other meaningful measure of randomness, including, notably, the Rényi
entropy of order α introduced by Alfréd Rényi [356]
( Pm
1
1−α log pα α ∈ (0, 1) ∪ (1, ∞)
Hα (P) ≜ i=1 i
(1.4)
mini log 1
pi α = ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-14


i i

14

(The quantity H∞ is also known as the min-entropy, or Hmin , in the cryptography literature). One
can check that

1 0 ≤ Hα (P) ≤ log m, where the lower (resp. upper) bound is achieved when P is a point mass
(resp. uniform);
2 Hα (P) is non-increasing in α and tends to the Shannon entropy H(P) as α → 1.
3 Rényi entropy satisfies the above six axioms except for the subadditivity.

1.3 History of entropy


In the early days of industrial age, engineers wondered if it is possible to construct a perpetual
motion machine. After many failed attempts, a law of conservation of energy was postulated: a
machine cannot produce more work than the amount of energy it consumed from the ambient
world. (This is also called the first law of thermodynamics.) The next round of attempts was then
to construct a machine that would draw energy in the form of heat from a warm body and convert it
to equal (or approximately equal) amount of work. An example would be a steam engine. However,
again it was observed that all such machines were highly inefficient. That is, the amount of work
produced by absorbing heat Q was far less than Q. The remainder of energy was dissipated to
the ambient world in the form of heat. Again after many rounds of attempting various designs
Clausius and Kelvin proposed another law:

Second law of thermodynamics: There does not exist a machine that operates in a cycle (i.e. returns to its original
state periodically), produces useful work and whose only other effect on the outside world is drawing heat from
a warm body. (That is, every such machine, should expend some amount of heat to some cold body too!)1

Equivalent formulation is as follows: “There does not exist a cyclic process that transfers heat
from a cold body to a warm body”. That is, every such process needs to be helped by expending
some amount of external work; for example, the air conditioners, sadly, will always need to use
some electricity.
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy
of it (initially described as “transformative content” or Verwandlungsinhalt in German), whose
value must return to its original state. Furthermore, under any reversible (i.e. quasi-stationary, or
“very slow”) process operated on this machine the change of entropy is proportional to the ratio

1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-15


i i

1.3 History of entropy 15

of absorbed heat and the temperature of the machine:


∆Q
∆S = . (1.5)
T
If heat Q is absorbed at temperature Thot then to return to the original state, one must return some
amount of heat Q′ , where Q′ can be significantly smaller than Q but never zero if Q′ is returned
at temperature 0 < Tcold < Thot . Further logical arguments can convince one that for irreversible
cyclic process the change of entropy at the end of the cycle can only be positive, and hence entropy
cannot reduce.
There were great many experimentally verified consequences that second law produced. How-
ever, what is surprising is that the mysterious entropy did not have any formula for it (unlike, say,
energy), and thus had to be computed indirectly on the basis of relation (1.5). This was changed
with the revolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation
of the second law based on statistical physics principles and showed that, e.g., for a system of n
independent particles (as in ideal gas) the entropy of a given macro-state can be computed as
X

1
S = kn pj log , (1.6)
pj
j=1

where k is the Boltzmann constant, and we assumed that each particle can only be in one of ℓ
molecular states (e.g. spin up/down, or if we quantize the phase volume into ℓ subcubes) and pj is
the fraction of particles in j-th molecular state.
More explicitly, their innovation was two-fold. First, they separated the concept of a micro-
state (which in our example above corresponds to a tuple of n states, one for each particle) and the
macro-state (a list {pj } of proportions of particles in each state). Second, they postulated that for
experimental observations only the macro-state matters, but the multiplicity of the macro-state
(number of micro-states that correspond to a given macro-state) is precisely the (exponential
of the) entropy. The formula (1.6) then follows from the following explicit result connecting
combinatorics and entropy.
Pk
Proposition 1.5 (Method of types) Let n1 , . . . , nk be non-negative integers with i=1 ni =
n, and denote the distribution P = (p1 , . . . , pk ), pi = nni . Then the multinomial coefficient

n1 ,...nk ≜ n1 !···nk ! satisfies
n n!

 
1 n
exp{nH(P)} ≤ ≤ exp{nH(P)} .
( 1 + n) k − 1 n1 , . . . nk

i.i.d. Pn
Proof. For the upper bound, let X1 , . . . , Xn ∼ P and let Ni = i=1 1 {Xj = i} denote the number
of occurrences of i. Then (N1 , . . . , Nk ) has a multinomial distribution:
 Y
k
′ ′ n n′
P[N1 = n1 , . . . , Nk = nk ] = ′ ′ pi i ,
n1 , . . . , nk
i=1

n′i n′1 n′k


for any nonnegative integers such that + · · · + = n. Recalling that pi = ni /n, the upper
bound follows from P[N1 = n1 , . . . , Nk = nk ] ≤ 1. In addition, since (N1 , . . . , Nk ) takes at most

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-16


i i

16

(n + 1)k−1 values, the lower bound follows if we can show that (n1 , . . . , nk ) is its mode. Indeed,
for any n′i with n′1 + · · · + n′k = n, defining ∆i = n′i − ni we have

P[N1 = n′1 , . . . , Nk = n′k ] Y Y


k k
ni !
= i ≤
p∆ i
ni−∆i p∆ i
i = 1,
P[N1 = n1 , . . . , Nk = nk ] (ni + ∆i )!
i=1 i=1
−∆
Pn
where the inequality follows from m!
(m+∆)! ≤m and the last equality follows from i=1 ∆i =
0.
Proposition 1.5 shows that the multinomial coefficient can be approximated up to a polynomial
(in n) term by exp(nH(P)). More refined estimates can be obtained; see Ex. I.2. In particular, the
binomial coefficient can be approximated using the binary entropy function as follows: Provided
that p = nk ∈ (0, 1),
n

e− 1 / 6 ≤ k
≤ 1. (1.7)
√ 1
enh(p)
2πnp(1−p)

For more on combinatorics and entropy, see Ex. I.1, I.3 and Chapter 8. For more on the intricate
relationship between statistical, mechanistic and information-theoretic description of the world
see Section 12.5* on Kolmogorov-Sinai entropy.

1.4* Submodularity

Recall that [n] denotes a set {1, . . . , n}, Sk denotes subsets of S of size k and 2S denotes all subsets
of S. A set function f : 2S → R is called submodular if for any T1 , T2 ⊂ S
f(T1 ∪ T2 ) + f(T1 ∩ T2 ) ≤ f(T1 ) + f(T2 ) (1.8)
Submodularity is similar to concavity, in the sense that “adding elements gives diminishing
returns”. Indeed consider T′ ⊂ T and b 6∈ T. Then
f(T ∪ b) − f(T) ≤ f(T′ ∪ b) − f(T′ ) .

Theorem 1.6 Let Xn be discrete RV. Then T 7→ H(XT ) is submodular.

Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show
H(A, B, C) + H(B) ≤ H(A, B) + H(B, C) .
This follows from a simple chain
H(A, B, C) + H(B) = H(A, C|B) + 2H(B) (1.9)
≤ H(A|B) + H(C|B) + 2H(B) (1.10)
= H(A, B) + H(B, C) (1.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-17


i i

1.5* Han’s inequality and Shearer’s Lemma 17

Note that entropy is not only submodular, but also monotone:

T1 ⊂ T2 =⇒ H(XT1 ) ≤ H(XT2 ) .

So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on
[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex cone
in R2+ −1 . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributions
n

on Xn . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [472], that Γ̄∗n is also a
closed convex cone and that

Γ∗n ⊂ Γ̄∗n ⊂ Γn .

The astonishing result of [473] is that

Γ∗2 = Γ̄∗2 = Γ2 (1.12)


Γ∗3 ⊊ Γ̄∗3 = Γ3 (1.13)
Γ∗n ⊊ Γ̄∗n ⊊Γn n ≥ 4. (1.14)

This follows from the fundamental new information inequality not implied by the submodularity
of entropy (and thus called non-Shannon inequality). Namely, [473] showed that for any 4-tuple
of discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
Here we have used mutual information and conditional mutual information – notions that we
will introduce later. However, the above inequality (with the help of Theorem 3.4) can be easily
rewritten as a rather cumbersome expression in terms of entropies of sets of variables X1 , X2 , X3 , X4 .
In conclusion, the work [473] demonstrated that the entropy set-function is more constrained than
a generic submodular non-negative set function even if one only considers linear constraints.

1.5* Han’s inequality and Shearer’s Lemma


Theorem
P
1.7 (Han’s inequality) Let Xn be discrete n-dimensional RV and denote H̄k (Xn ) =
1 H̄k
T∈([nk])
H(XT ) the average entropy of a k-subset of coordinates. Then is decreasing in k:
(nk) k

1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.15)
n k
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:

H̄k+1 − H̄k ≤ H̄k − H̄k−1 . (1.16)

H̄m
Proof. Denote for convenience H̄0 = 0. Note that m is an average of differences:

1X
m
1
H̄m = (H̄k − H̄k−1 )
m m
k=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-18


i i

18

Thus, it is clear that (1.16) implies (1.15) since increasing m by one adds a smaller element to the
average. To prove (1.16) observe that from submodularity

H(X1 , . . . , Xk+1 ) + H(X1 , . . . , Xk−1 ) ≤ H(X1 , . . . , Xk ) + H(X1 , . . . , Xk−1 , Xk+1 ) .

Now average this inequality over all n! permutations of indices {1, . . . , n} to get

H̄k+1 + H̄k−1 ≤ 2H̄k

as claimed by (1.16).
Alternative proof: Notice that by “conditioning decreases entropy” we have

H(Xk+1 |X1 , . . . , Xk ) ≤ H(Xk+1 |X2 , . . . , Xk ) .

Averaging this inequality over all permutations of indices yields (1.16).

Theorem 1.8 (Shearer’s Lemma) Let Xn be discrete n-dimensional RV and let S ⊂ [n] be
a random variable independent of Xn and taking values in subsets of [n]. Then

H(XS |S) ≥ H(Xn ) · min P[i ∈ S] . (1.17)


i∈[n]

Remark 1.2 In the special case where S is uniform over all subsets of cardinality k, (1.17)
reduces to Han’s inequality 1n H(Xn ) ≤ 1k H̄k . The case of n = 3 and k = 2 can be used to give
an entropy proof of the following well-known geometry result that relates the size of 3-D object
to those of its 2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number
of distinct points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N2 . For
another application, see Section 8.2.

Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a
list (possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(Xn ) · min deg(i) , (1.18)
i
j

where deg(i) ≜ #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.18) is trivial, since the minimum on the right-hand side is either
zero (if SM 6= [n]) or equals multiplicity of SM in C ,2 in which case we have
X
H(XSj ) ≥ H(XSM )#{j : Sj = SM } = H(Xn ) · min deg(i) .
i
j

For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.8) implies that the sum on the
left-hand side of (1.18) does not increase under this replacement, values deg(i) are not changed.

2
Note that, consequently, for Xn without constant coordinates, and if C is a chain, (1.18) is only tight if C consists of only ∅
and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.18) can be improved, cf. [288].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-19


i i

1.5* Han’s inequality and Shearer’s Lemma 19

Notice that the total number of pairs that are not related by inclusion strictly decreases by this
replacement: if T was related by inclusion to S1 then it will also be related to at least one of S1 ∪ S2
or S1 ∩ S2 ; if T was related to both S1 , S2 then it will be related to both of the new sets as well.
Therefore, by applying this operation we must eventually arrive to a chain, for which (1.18) has
already been shown.
Remark 1.3 Han’s inequality (1.16) holds for any submodular set-function. For Han’s inequal-
ity (1.15) we also need f(∅) = 0 (this can be achieved by adding a constant to all values of f).
Shearer’s lemma holds for any submodular set-function that is also non-negative.
Example 1.5 (Non-entropy submodular function) Another submodular set-function is
S 7→ I(XS ; XSc ) .
Han’s inequality for this one reads
1 1
0= In ≤ · · · ≤ Ik · · · ≤ I1 ,
n k
1
P
where Ik = S:|S|=k I(XS ; XSc ) measures the amount of k-subset coupling in the random vector
(nk)
n
X.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-20


i i

2 Divergence

In this chapter we study divergence D(PkQ) (also known as information divergence, Kullback-
Leibler (KL) divergence, relative entropy), which is the first example of dissimilarity (information)
measure between a pair of distributions P and Q. As we will see later in Chapter 7, KL divergence is
a special case of f-divergences. Defining KL divergence and its conditional version in full general-
ity requires some measure-theoretic acrobatics (Radon-Nikodym derivatives and Markov kernels),
that we spend some time on. (We stress again that all these abstractions can be ignored if one is
willing to only work with finite or countably-infinite alphabets.)
Besides definitions we prove the “main inequality” showing that KL-divergence is non-negative.
Coupled with the chain rule for divergence, this inequality implies the data-processing inequality,
which is arguably the central pillar of information theory and this book. We conclude the chapter
by studying local behavior of divergence when P and Q are close. In the special case when P and
Q belong to a parametric family, we will see that divergence is locally quadratic with Hessian
being the Fisher information, explaining the fundamental role of the latter in classical statistics.

2.1 Divergence and Radon-Nikodym derivatives

Review: Measurability

For an exposition of measure-theoretic preliminaries, see [84, Chapters I and IV].


We emphasize two aspects. First, in this book we understand Lebesgue integration
R
fdμ as defined for measurable functions that are extended real-valued, i.e. f : X →
X R
R ∪ {±∞}. In particular, for negligible set E, i.e. μ[E] = 0, we have X 1E fdμ = 0
regardless of (possibly infinite) values of f on E, cf. [84, Chapter I, Prop. 4.13]. Second,
we almost always assume that alphabets are standard Borel spaces. Some of the nice
properties of standard Borel spaces:

• All complete separable metric spaces, endowed with Borel σ -algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q∞
• If Xi , i = 1, . . . are standard Borel, then so is i=1 Xi .
• Singletons {x} are measurable sets.
• The diagonal {(x, x) : x ∈ X } is measurable in X × X .

20

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-21


i i

2.1 Divergence and Radon-Nikodym derivatives 21

We now need to define the second central concept of this book: the relative entropy, or Kullback-
Leibler divergence. Before giving the formal definition, we start with special cases. For that we
fix some alphabet A. The relative entropy from between distributions P and Q on X is denoted by
D(PkQ), defined as follows.

• Suppose A is a discrete (finite or countably infinite) alphabet. Then


(P P(a)
a∈A:P(a),Q(a)>0 P(a) log Q(a) , supp(P) ⊂ supp(Q)
D(PkQ) ≜ (2.1)
+∞, otherwise

• Suppose A = Rk , P and Q have densities (pdfs) p and q with respect to the Lebesgue measure.
Then
(R
{p>0,q>0}
p(x) log qp((xx)) dx Leb{p > 0, q = 0} = 0
D(PkQ) = (2.2)
+∞ otherwise

These two special cases cover a vast majority of all cases that we encounter in this book. How-
ever, mathematically it is not very satisfying to restrict to these two special cases. For example, it
is not clear how to compute D(PkQ) when P and Q are two measures on a manifold (such as a
unit sphere) embedded in Rk . Another problematic case is computing D(PkQ) between measures
on the space of sequences (stochastic processes). To address these cases we need to recall the
concepts of Radon-Nikodym derivative and absolute continuity.
Recall that for two measures P and Q, we say P is absolutely continuous w.r.t. Q (denoted by
P  Q) if Q(E) = 0 implies P(E) = 0 for all measurable E. If P  Q, then Radon-Nikodym
theorem show that there exists a function f : X → R+ such that for any measurable set E,
Z
P(E) = fdQ. [change of measure] (2.3)
E
dP
Such f is called a relative density or a Radon-Nikodym derivative of P w.r.t. Q, denoted by dQ .
dP dP
Not that dQ may not be unique. In the simple cases, dQ is just the familiar likelihood ratio:

• For discrete distributions, we can just take dQ


dP
(x) to be the ratio of pmfs.
• For continuous distributions, we can take dQ (x) to be the ratio of pdfs.
dP

We can see that the two special cases of D(PkQ) were both computing EP [log dQdP
]. This turns
out to be the most general definition that we are looking for. However, we will state it slightly
differently, following the tradition.

Definition 2.1 (Kullback-Leibler (KL) Divergence) Let P, Q be distributions on A, with


Q called the reference measure. The divergence (or relative entropy) between P and Q is
(
EQ [ dQ
dP dP
log dQ ] PQ
D(PkQ) = (2.4)
+∞ otherwise

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-22


i i

22

adopting again the convention from (1.1), namely, 0 log 0 = 0.

Below we will show (Lemma 2.5) that the expectation in (2.4) is well-defined (but possibly
infinite) and coincides with EP [log dQdP
] whenever P  Q.
To demonstrate the general definition in the case not covered by discrete/continuous special-
izations, consider the situation in which both P and Q are given as densities with respect to a
common dominating measure μ, written as dP = fP dμ and dQ = fQ dμ for some non-negative
fP , fQ . (In other words, P  μ and fP = dP dμ .) For example, taking μ = P + Q always allows one to
specify P and Q in this form. In this case, we have the following expression for divergence:
(R
dμ fP log ffQP μ({fQ = 0, fP > 0}) = 0,
D(PkQ) = fQ >0,fP >0
(2.5)
+∞ otherwise
Indeed, first note that, under the assumption of P  μ and Q  μ, we have P  Q iff
μ({fQ = 0, fP > 0}) = 0. Furthermore, if P  Q, then dQdP
= ffQP Q-a.e, in which case apply-
ing (2.3) and (1.1) reduces (2.5) to (2.4). Namely, D(PkQ) = EQ [ dQ dP dP
log dQ ] = EQ [ ffQP log ffQP ] =
R R
fP fP
dμfP log fQ 1 {fQ > 0} = dμfP log fQ 1 {fQ > 0, fP > 0}.
Note that D(PkQ) was defined to be +∞ if P 6 Q. However, it can also be +∞ even when
P  Q. For example, D(CauchykGaussian) = ∞. However, it does not mean that there are
somehow two different ways in which D can be infinite. Indeed, what can be shown is that in
both cases there exists a sequence of (finer and finer) finite partitions Π of the space A such that
evaluating KL divergence between the induced discrete distributions P|Π and Q|Π grows without
a bound. This will be subject of Theorem 4.5 below.
Our next observation is that, generally, D(PkQ) 6= D(QkP) and, therefore, divergence is not a
distance. We will see later, that this is natural in many cases; for example it reflects the inherent
asymmetry of hypothesis testing (see Part III and, in particular, Section 14.5). Consider the exam-
ple of coin tossing where under P the coin is fair and under Q the coin always lands on the head.
Upon observing HHHHHHH, one tends to believe it is Q but can never be absolutely sure; upon
observing HHT, one knows for sure it is P. Indeed, D(PkQ) = ∞, D(QkP) = 1 bit.
Having made these remarks we proceed to some examples. First, we show that D is unsurpris-
ingly a generalization of entropy.

Theorem 2.2 (Entropy vs divergence) If distribution P is supported on a finite set A, then


H(P) = log |A| − D(PkUA ) ,

where UA is the uniform distribution on A.

Proof. D(PkUA ) = EP [log 1P/|A|


(X)
] = log |A| − H(P).

Example 2.1 (Binary divergence) Consider P = Ber(p) and Q = Ber(q) on A = {0, 1}.
Then
p p
D(PkQ) = d(pkq) ≜ p log + p log . (2.6)
q q

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-23


i i

2.1 Divergence and Radon-Nikodym derivatives 23

Here is how d(pkq) depends on p and q:

1
log q
d(p∥q) d(p∥q)

1
log q̄

q p
0 p 1 0 q 1

The following quadratic lower bound is easily checked:

d(pkq) ≥ 2(p − q)2 log e .

In fact, this is a special case of the famous Pinsker’s inequality (Theorem 7.10).
Example 2.2 (Real Gaussian) For two Gaussians on A = R,
log e (m1 − m0 )2 1 h σ02  σ12  i
D(N (m1 , σ12 )kN (m0 , σ02 )) = + log + − 1 log e . (2.7)
2 σ02 2 σ12 σ02
Here, the first and second term compares the means and the variances, respectively.
Similarly, in the vector case of A = Rk and assuming det Σ0 6= 0, we have

D(N (m1 , Σ1 )kN (m0 , Σ0 ))


log e 1 
= ( m1 − m0 ) ⊤ Σ −
0
1
( m 1 − m 0 ) + log det Σ 0 − log det Σ 1 + tr(Σ −1
0 Σ 1 − I ) log e . (2.8)
2 2
See Exercise I.8 for the derivation.
Example 2.3 (Complex Gaussian) The complex Gaussian distribution Nc (m, σ 2 ) with
1 −|z−m|2 /σ2
mean m ∈ C and variance σ 2 has a density e for z ∈ C. In other words, the real and
π σ2
imaginary parts are independent real Gaussians:
  
  σ 2 /2 0
Nc (m, σ ) = N Re(m) Im(m) ,
2
0 σ 2 /2

Then
log e |m1 − m0 |2 σ02  σ12 
D(Nc (m1 , σ12 )kNc (m0 , σ02 )) = + log + − 1 log e. (2.9)
2 σ02 σ12 σ02

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-24


i i

24

which follows from (2.8). More generally, for complex Gaussian vectors on Ck , assuming det Σ0 6=
0,

D(Nc (m1 , Σ1 )kNc (m0 , Σ0 )) =(m1 − m0 )H Σ−


0 (m1 − m0 ) log e
1

+ log det Σ0 − log det Σ1 + tr(Σ−


0 Σ1 − I) log e
1

2.2 Divergence: main inequality and equivalent expressions


Many inequalities in information can be attributed to the following fundamental result, namely,
the nonnegativity of divergence.

Theorem 2.3 (Information inequality)


D(PkQ) ≥ 0,

with equality iff P = Q.

Proof. In view of the definition (2.4), it suffices to consider P  Q. Let φ(x) ≜ x log x, which
is strictly convex on R+ . Applying Jensen’s Inequality:
h  dP i  h dP i
D(PkQ) = EQ φ ≥ φ EQ = φ(1) = 0,
dQ dQ
dP
with equality iff dQ = 1 Q-a.e., namely, P = Q.

Here is a typical application of the previous result (variations of it will be applied numerous
times in this book). This result is widely used in machine learning as it shows that minimizing
average cross-entropy loss ℓ(Q, x) ≜ log Q(1x) recovers the true distribution (Exercise III.11).

Corollary 2.4 Let X be a discrete random variable with H(X) < ∞. Then
 
1
min E log = H(P) ,
Q Q( X )
and unique minimizer is Q = PX .

Proof. It is sufficient to prove that for any Q 6= PX we have


 
1
E log > H(X) .
Q(X)
If the LHS is infinite this is clear, so let us assume it is finite and hence Q(x) > 0 whenever
PX (x) > 0. Then subtracting H(X) from both sides and using linearity of expectation we have
   
1 PX (X)
E log − H(X) = E log = D(PX kQ) > 0
Q( X ) Q( X )
where the inequality is via Theorem 2.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-25


i i

2.2 Divergence: main inequality and equivalent expressions 25

Another implication of the proof of Theorem 2.3 is in bringing forward the reason for defining
D(PkQ) = EQ [ dQ dP dP
log dQ ] as opposed to D(PkQ) = EP [log dQ
dP
]. However, we still need to show
that the two definitions are equivalent, which is what we do next. In addition, we will also unify
the two cases (P  Q vs P 6 Q) in Definition 2.1.

Lemma 2.5 Let P, Q, R  μ and fP , fQ , fR denote their densities relative to μ. Define a bivariate
function Log ab : R+ × R+ → R ∪ {±∞} by


 −∞ a = 0, b > 0


a  +∞ a > 0, b = 0
Log = (2.10)
b  0 a = 0, b = 0



log ab a > 0, b > 0.
Then the following results hold:

• First, the following expectation exists and equals


 
fR
EP Log = D(PkQ) − D(PkR) , (2.11)
fQ
provided at least one of the hdivergences
i is finite.
• Second, the expectation EP Log ffQP is well-defined (but possibly infinite) and, furthermore,
 
fP
D(PkQ) = EP Log . (2.12)
fQ
In particular, when P  Q we have
 
dP
D(PkQ) = EP log . (2.13)
dQ

Remark 2.1 Note that ignoring the issue of dividing by or taking a log of 0, the proof of (2.12)
dR
is just the simple identity log dQ dRdP
= log dQdP = log dQdP
− log dP
dR . What permits us to handle zeros
is the Log function, which satisfies several natural properties of the log: for every a, b ∈ R+
a b
Log = −Log
b a
and for every c > 0 we have
a a c ac
Log = Log + Log = Log − log(c)
b c b b
except for the case a = b = 0.
Proof. First, suppose D(PkQ) = ∞ and D(PkR) < ∞. Then P[fR (Y) = 0] = 0, and hence in
computation of the expectation in (2.11) only the second part of convention (2.10) can possibly
apply. Since also fP > 0 P-almost surely, we have
fR fR fP
Log = Log + Log , (2.14)
fQ fP fQ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-26


i i

26

with both logarithms evaluated according to (2.10). Taking expectation over P we see that the
first term, equal to −D(PkR), is finite, whereas the second term is infinite. Thus, the expectation
in (2.11) is well-defined and equal to +∞, as is the LHS of (2.11).
Now consider D(PkQ) < ∞. This implies that P[fQ (Y) = 0] = 0 and this time in (2.11) only
the first part of convention (2.10) can apply. Thus, again we have identity (2.14). Since the P-
expectation of the second term is finite, and of the first term non-negative, we again conclude that
expectation in (2.11) is well-defined, equals the LHS of (2.11) (and both sides are possibly equal
to −∞).
For the second part, we first show that
 
fP log e
EP min(Log , 0) ≥ − . (2.15)
fQ e
Let g(x) = min(x log x, 0). It is clear − loge e ≤ g(x) ≤ 0 for all x. Since fP (Y) > 0 for P-almost
all Y, in convention (2.10) only the 10 case is possible, which is excluded by the min(·, 0) from the
expectation in (2.15). Thus, the LHS in (2.15) equals
Z Z
f P ( y) f P ( y) f P ( y)
fP (y) log dμ = f Q ( y) log dμ
{fP >fQ >0} f Q ( y ) {fP >fQ >0} f Q ( y ) f Q ( y)
Z  
f P ( y)
= f Q ( y) g dμ
{fQ >0} f Q ( y)
log e
≥− .
e
h i h i
Since the negative part of EP Log ffQP is bounded, the expectation EP Log ffQP is well-defined. If
P[fQ = 0] > 0 then it is clearly +∞, as is D(PkQ) (since P 6 Q). Otherwise, let E = {fP >
0, fQ > 0}. Then P[E] = 1 and on E we have fP = fQ · ffQP . Thus, we obtain
  Z Z  
fP fP fP fP
EP Log = dμ fP log = dμfQ φ( ) = EQ 1E φ( ) .
fQ E fQ E fQ fQ
From here, we notice that Q[fQ > 0] = 1 and on {fP = 0, fQ > 0} we have φ( ffQP ) = 0. Thus, the
term 1E can be dropped and we obtain the desired (2.12).
The final statement of the Lemma follows from taking μ = Q and noticing that P-almost surely
we have
dP
dQ dP
Log = log .
1 dQ

2.3 Differential entropy


The definition of D(PkQ) extends verbatim to measures P and Q (not necessarily probability
measures), in which case D(PkQ) can be negative. A sufficient condition for D(PkQ) ≥ 0 is that
R R
P is a probability measure and Q is a sub-probability measure, i.e., dQ ≤ 1 = dP. The notion
of differential entropy is simply the divergence with respect to the Lebesgue measure.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-27


i i

2.3 Differential entropy 27

Definition 2.6 The differential entropy of a random vector X is


h(X) = h(PX ) ≜ −D(PX kLeb). (2.16)

In particular, if X has probability density function (pdf) p, then h(X) = E log p(1X) ; otherwise
h(X) = −∞. The conditional differential entropy is h(X|Y) ≜ E log pX|Y (1X|Y) where pX|Y is a
conditional pdf.

Example 2.4 (Gaussian) For X ∼ N(μ, σ 2 ),


1
h(X) = log(2πeσ 2 ) (2.17)
2
More generally, for X ∼ N( μ, Σ) in Rd ,
1
h(X) = log((2πe)d det Σ) (2.18)
2
Warning: Even for continuous random variable X, h(X) can be positive, negative, take values
of ±∞ or even undefined.1 There are many crucial differences between the Shannon entropy and
the differential entropy. For example, from Theorem 1.4 we know that deterministic processing
cannot increase the Shannon entropy, i.e. H(f(X)) ≤ H(X) for any discrete X, which is intuitively
clear. However, this fails completely for differential entropy (e.g. consider scaling). Furthermore,
for sums of independent random variables, for integer-valued X and Y, H(X + Y) is finite whenever
H(X) and H(Y) are, because H(X + Y) ≤ H(X, Y) = H(X) + H(Y). This again fails for differential
entropy. In fact, there exists real-valued X with finite h(X) such that h(X + Y) = ∞ for any
independent Y such that h(Y) > −∞; there also exist X and Y with finite differential entropy such
that h(X + Y) does not exist (cf. [65, Section V]).
Nevertheless, differential entropy shares many functional properties with the usual Shannon
entropy. For a short application to Euclidean geometry see Section 8.4.

Theorem 2.7 (Properties of differential entropy) Assume that all differential entropies
appearing below exist and are finite (in particular all RVs have pdfs and conditional pdfs).

(a) (Uniform distribution maximizes differential entropy) If P[Xn ∈ S] = 1 then h(Xn ) ≤


log Leb(S),with equality iff Xn is uniform on S.
(b) (Scaling and shifting) h(Xn + x) = h(Xn ), h(αXn ) = h(Xn ) + k log |α| and for an invertible
matrix A, h(AXn ) = h(Xn ) + log | det A|.
(c) (Conditioning reduces differential entropy) h(X|Y) ≤ h(X). (Here Y is arbitrary.)
(d) (Chain rule) Let Xn has a joint probability density function. Then
X
n
h( X n ) = h(Xk |Xk−1 ) .
k=1

1 n c −(−1)n n
For an example, consider a piecewise-constant pdf taking value e(−1) n on the n-th interval of width ∆n = n2
e .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-28


i i

28

(e) (Submodularity) The set-function T 7→ h(XT ) is submodular.


P
(f) (Han’s inequality) The function k 7→ k 1n [n] h(XT ) is decreasing in k.
(k) T∈( k )

Proof. Parts (a), (c), and (d) follow from the similar argument in the proof (b), (d), and (g) of
Theorem 1.4. Part (b) is by a change of variable in the density. Finally, (e) and (f) are analogous
to Theorems 1.6 and 1.7.
Interestingly, the first property is robust to small additive perturbations, cf. Ex. I.6. Regard-
ing maximizing entropy under quadratic constraints, we have the following characterization of
Gaussians.

Theorem 2.8 Let Cov(X) = E[XX⊤ ] − E[X]E[X]⊤ denote the covariance matrix of a random
vector X. For any d × d positive definite matrix Σ,
1
max h(X) = h(N(0, Σ)) = log((2πe)d det Σ) (2.19)
PX :Cov(X)⪯Σ 2
Furthermore, for any a > 0,
  a  d 2πea
max h(X) = h N 0, Id = log . (2.20)
PX :E[∥X∥ ]≤a
2 d 2 d

Proof. To show (2.19), without loss of generality, assume that E[X] = 0. By comparing to
Gaussian, we have

0 ≤ D(PX kN(0, Σ))


1 log e
= − h(X) + log((2π )d det(Σ)) + E[X⊤ Σ−1 X]
2 2
≤ − h(X) + h(N(0, Σ)),

where in the last step we apply E[X⊤ Σ−1 X] = Tr(E[XX⊤ ]Σ−1 ) ≤ Tr(I) due to the constraint
Cov(X)  Σ and the formula (2.18). The inequality (2.20) follows analogously by choosing the
reference measure to be N(0, ad Id ).

Corollary 2.9 The map Σ 7→ log det Σ is concave on the space of real positive definite n × n
matrices.

Proof. Let Σ1 , Σ2 be positive definite n × n matrices. Let Y ∼ Ber(1/2) and given Y = 0 we


set X ∼ N (0, Σ1 ) and otherwise X ∼ N (0, Σ2 ). Let Cov(X) = Σ = 12 Σ1 + 12 Σ2 . Then we have
h(X|Y) ≤ h(X) ≤ 12 log((2πe)n det Σ). For h(X|Y) we apply (2.18) and after simplification obtain
1 1 1 1 
log det Σ1 + log det Σ2 ≤ log det Σ1 + Σ2 ,
2 2 2 2
which is exactly the claimed concavity.
Finally, let us mention a connection between the differential entropy and the Shannon entropy.
Let X be a continuous random vector in Rd . Denote its discretized version by Xm = m1 bmXc

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-29


i i

2.4 Markov kernels 29

for m ∈ N, where b·c is taken componentwise. Rényi showed that [357, Theorem 1] provided
H(bXc) < ∞ and h(X) is defined, we have

H(Xm ) = d log m + h(X) + o(1), m → ∞. (2.21)

To interpret this result, consider, for simplicity, d = 1, m = 2k and assume that X takes values
in the unit interval, in which case X2k is the k-bit uniform quantization of X. Then (2.21) suggests
that for large k, the quantized bits behave as independent fair coin flips. The underlying reason is
that for “nice” density functions, the restriction to small intervals is approximately uniform. For
more on quantization see Section 24.1 (notably Section 24.1.5) in Chapter 24.

2.4 Markov kernels


The main objects in this book are random variables and probability distributions. The main opera-
tion for creating new random variables, as well as for defining relations between random variables,
is that of a Markov kernel (also known as a transition probability kernel).

Definition 2.10 A Markov kernel K : X → Y is a bivariate function K( · | · ), whose first


argument is a measurable subset of Y and the second is an element of X , such that:

1 For any x ∈ X : K( · |x) is a probability measure on Y


2 For any measurable set A: x 7→ K(A|x) is a measurable function on X .

The kernel K can be viewed as a random transformation acting from X to Y , which draws
Y from a distribution depending on the realization of X, including deterministic transformations
PY|X
as special cases. For this reason, we write PY|X : X → Y and also X −−→ Y. In information-
theoretic context, we also refer to PY|X as a channel, where X and Y are the channel input and
output respectively. There are two ways of obtaining Markov kernels. The first way is defining
them explicitly. Here are some examples of that:

1 Deterministic system: Y = f(X). This corresponds to setting PY|X=x = δf(x) .


2 Decoupled system: Y ⊥ ⊥ X. Here we set PY|X=x = PY .
3 Additive noise (convolution): Y = X + Z with Z ⊥
⊥ X. This time we choose PY|X=x (·) = PZ (·− x).
The term convolution corresponds to the fact that the resulting marginal distribution PY = PX ∗
PZ is a convolution of measures.
4 Some of the most useful channels that we will use throughout the book are going to be defined
shorting in Examples 3.3, 3.4, 3.6.

The second way is to disintegrate a joint distribution PX,Y by conditioning on X, which is


denoted simply by PY|X . Specifically, we have the following result [84, Chapter IV, Theorem
2.18]:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-30


i i

30

Theorem 2.11 (Disintegration) Suppose PX,Y is a distribution on X × Y with Y being


standard Borel. Then there exists a Markov kernel K : X → Y so that for any measurable E ⊂
X × Y and any integrable f we have
Z
PX,Y [E] = PX (dx)K(Ex |x) , Ex ≜ {y : (x, y) ∈ E} (2.22)
X
Z Z Z
f(x, y)PX,Y (dx dy) = PX (dx) f(x, y)K(dy|x) .
X ×Y X Y

Note that above we have implicitly used the facts that the slices Ex of E are measurable subsets
of Y for each x and that the function x 7→ K(Ex |x) is measurable (cf. [84, Chapter I, Prop. 6.8 and
6.9], respectively). We also notice that one joint distribution PX,Y can have many different versions
of PY|X differing on a measure-zero set of x’s.
The operation of combining an input distribution on X and a kernel K : X → Y as we did
in (2.22) is going to appear extensively in this book. We will usually denote it as multiplication:
Given PX and kernel PY|X we can multiply them to obtain PX,Y ≜ PX PY|X , which in the discrete
case simply means that the joint PMF factorizes as product of marginal and conditional PMFs:
PX,Y (x, y) = PY|X (y|x)PX (x) ,
and more generally is given by (2.22) with K = PY|X .
Another useful operation will be that of composition (marginalization), which we denote by
PY|X ◦ PX ≜ PY . In words, this means forming a distribution PX,Y = PX PY|X and then computing
the marginal PY , or, explicitly,
Z
PY [E] = PX (dx)PY|X (E|x) .
X

To denote this (linear) relation between the input PX and the output PY we sometimes also write
PY|X
PX −−→ PY .
We must remark that technical assumptions such as restricting to standard Borel spaces are
really necessary for constructing any sensible theory of disintegration/conditioning and multipli-
cation. To emphasize this point we consider a (cautionary!) example involving a pathological
measurable space Y .
Example 2.5 (X ⊥ ⊥ Y but PY|X=x 6 PY for all x) Consider X a unit interval with
Borel σ -algebra and Y a unit interval with the σ -algebra σY consisting of all sets which are either
countable or have a countable complement. Clearly σY is a sub-σ -algebra of Borel one. We define
the following kernel K : X → Y :
K(A|x) ≜ 1{x ∈ A} .
This is simply saying that Y is produced from X by setting Y = X. It should be clear that for
every A ∈ σY the map x 7→ K(A|x) is measurable, and thus K is a valid Markov kernel. Letting
X ∼ Unif(0, 1) and using formula (2.22) we can define a joint distribution PX,Y . But what is the
conditional distribution PY|X ? On one hand, clearly we can set PY|X (A|x) = K(A|x), since this
was how PX,Y was constructed. On the other hand, we will show that PX,Y = PX PY , i.e. X ⊥ ⊥ Y

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-31


i i

2.5 Conditional divergence, chain rule, data-processing inequality 31

and X = Y at the same time! Indeed, consider any set E = B × C ⊂ X × Y . We always have
PX,Y [B × C] = PX [B ∩ C]. Thus if C is countable then PX,Y [E] = 0 and so is PX PY [E] = 0. On the
other hand, if Cc is countable then PX [C] = PY [C] = 1 and PX,Y [E] = PX PY [E] again. Thus, both
PY|X = K and PY|X = PY are valid conditional distributions. But notice that since PY [{x}] = 0, we
have K(·|x) 6 PY for every x ∈ X . In particular, the value of D(PY|X=x kPY ) can either be 0 or
+∞ for every x depending on the choice of the version of PY|X . It is, thus, advisable to stay within
the realm of standard Borel spaces.
We will also need to use the following result extensively. We remind that a σ -algebra is called
separable if it is generated by a countable collection of sets. Any standard Borel space’s σ -algebra
is separable. The following is another useful result about Markov kernels, cf. [84, Chapter 5,
Theorem 4.44]:

Theorem 2.12 (Doob’s version of Radon-Nikodym Theorem) Assume that Y is a


measurable space with a separable σ -algebra. Let PY|X : X → Y and RY|X : X → Y be two
Markov kernels. Suppose that for every x we have PY|X=x  RY|X=x . Then there exists a measurable
function (x, y) 7→ f(y|x) ≥ 0 such that for every x ∈ X and every measurable subset E of Y ,
Z
PY|X (E|x) = f(y|x)RY|X (dy|x) .
E

dPY|X=x
The meaning of this theorem is that the Radon-Nikodym derivative dRY|X=x can be made jointly
measurable with respect to (x, y).

2.5 Conditional divergence, chain rule, data-processing inequality


We aim to define the conditional divergence between two Markov kernels. Throughout this chapter
we fix a pair of Markov kernels PY|X : X → Y and QY|X : X → Y , and also a probability measure
PX on X . First, let us consider the case of discrete X . We define the conditional divergence as
X
D(PY|X kQY|X |PX ) ≜ PX (x)D(PY|X=x kQY|X=x ) .
x∈X

In order to extend the above definition to more general X , we need to first understand whether
the map x 7→ D(PY|X=x kQY|X=x ) is even measurable.

Lemma 2.13 Suppose that Y is standard Borel. The set A0 ≜ {x : PY|X=x  QY|X=x } and the
function
x 7→ D(PY|X=x kQY|X=x )
are both measurable.
dPY|X=x dQY|X=x
Proof. Take RY|X = 1
2 PY|X + 12 QY|X and define fP (y|x) ≜ dRY|X=x (y) and fQ (y|x) ≜ dRY|X=x (y).
By Theorem 2.12 these can be chosen to be jointly measurable on X × Y . Let us define B0 ≜

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-32


i i

32

{(x, y) : fP (y|x) > 0, fQ (y|x) = 0} and its slice Bx0 = {y : (x, y) ∈ B0 }. Then note that PY|X=x 
QY|X=x iff RY|X=x [Bx0 ] = 0. In other words, x ∈ A0 iff RY|X=x [Bx0 ] = 0. The measurability of B0
implies that of x 7→ RY|X=x [Bx0 ] and thus that of A0 . Finally, from (2.12) we get that
 
f P ( Y | x)
D(PY|X=x kQY|X=x ) = EY∼PY|X=x Log , (2.23)
f Q ( Y | x)
which is measurable, e.g. [84, Chapter 1, Prop. 6.9].

With this preparation we can give the following definition.

Definition 2.14 (Conditional divergence) Assuming Y is standard Borel, define


D(PY|X kQY|X |PX ) ≜ Ex∼PX [D(PY|X=x kQY|X=x )]

We observe that as usual in Lebesgue integration it is possible that a conditional divergence is


finite even though D(PY|X=x kQY|X=x ) = ∞ for some (PX -negligible set of) x.

Theorem 2.15 (Chain rule) For any pair of measures PX,Y and QX,Y we have
D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kQX ) , (2.24)

regardless of the versions of conditional distributions PY|X and QY|X one chooses.

Proof. First, let us consider the simplest case: X , Y are discrete and QX,Y (x, y) > 0 for all x, y.
Letting (X, Y) ∼ PX,Y we get
   
PX,Y (X, Y) PX (X)PY|X (Y|X)
D(PX,Y kQX,Y ) = E log = E log
QX,Y (X, Y) QX (X)QY|X (Y|X)
   
PY|X (Y|X) PX (X)
= E log + E log
QY|X (Y|X) QX (X)
completing the proof.
Next, let us address the general case. If PX 6 QX then PX,Y 6 QX,Y and both sides of (2.24) are
infinity. Thus, we assume PX  QX and set λP (x) ≜ dQ dPX
X
(x). Define fP (y|x), fQ (y|x) and RY|X as in
the proof of Lemma 2.13. Then we have PX,Y , QX,Y  RX,Y ≜ QX RY|X , and for any measurable E
Z Z
PX,Y [E] = λP (x)fP (y|x)RX,Y (dx dy) , QX,Y [E] = fQ (y|x)RX,Y (dx dy) .
E E

Then from (2.12) we have


 
fP (Y|X)λP (X)
D(PX,Y kQX,Y ) = EPX,Y Log . (2.25)
fQ (Y|X)
Note the following property of Log: For any c > 0
ac a
Log = log(c) + Log
b b

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-33


i i

2.5 Conditional divergence, chain rule, data-processing inequality 33

unless a = b = 0. Now, since PX,Y [fP (Y|X) > 0, λP (X) > 0] = 1, we conclude that PX,Y -almost
surely
fP (Y|X)λP (X) fP ( Y | X )
Log = log λP (X) + Log .
fQ (Y|X) fQ (Y|X)
We aim to take the expectation of both sides over PX,Y and invoke linearity of expectation. To
ensure that the issue of ∞ − ∞ does not arise, we notice that the negative part of each term has
finite expectation by (2.15). Overall, continuing (2.25) and invoking linearity we obtain
 
fP ( Y | X )
D(PX,Y kQX,Y ) = EPX,Y [log λP (X)] + EPX,Y Log ,
fQ (Y|X)
where the first term equals D(PX kQX ) by (2.12) and the second D(PY|X kQY|X |PX ) by (2.23) and
the definition of conditional divergence.

The chain rule has a number of useful corollaries, which we summarize below.

Theorem 2.16 (Properties of divergence) Assume that X and Y are standard Borel.
Then

(a) Conditional divergence can be expressed unconditionally:

D(PY|X kQY|X |PX ) = D(PX PY|X kPX QY|X ) .

(b) (Monotonicity) D(PX,Y kQX,Y ) ≥ D(PY kQY ).


(c) (Full chain rule)
X
n
D(PX1 ···Xn kQX1 ···Xn ) = D(PXi |Xi−1 kQXi |Xi−1 |PXi−1 ). (2.26)
i=1
Qn
In the special case of QXn = i=1 QX i ,
X
n
D(PX1 ···Xn kQX1 · · · QXn ) = D(PX1 ···Xn kPX1 · · · PXn ) + D(PXi kQXi )
i=1
X
n
≥ D(PXi kQXi ), (2.27)
i=1
Qn
where the inequality holds with equality if and only if PXn = j=1 PXj .
(d) (Tensorization)
 
Yn Y n X n
D  PXj 
QX j = D(PXj kQXj ).
j=1 j=1 j=1

(e) (Conditioning increases divergence) Given PY|X , QY|X and PX , let PY = PY|X ◦ PX and QY =
QY|X ◦ PX , as represented by the diagram:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-34


i i

34

PY |X PY

PX

QY |X QY

Then D(PY kQY ) ≤ D(PY|X kQY|X |PX ), with equality iff D(PX|Y kQX|Y |PY ) = 0.

We remark that as before without the standard Borel assumption even the first property can
fail. For example, Example 2.5 shows an example where PX PY|X = PX QY|X but PY|X 6= QY|X and
D(PY|X kQY|X |PX ) = ∞.

Proof. (a) This follows from the chain rule (2.24) since PX = QX .
(b) Apply (2.24), with X and Y interchanged and use the fact that conditional divergence is non-
negative.
Qn Qn
(c) By telescoping PXn = i=1 PXi |Xi−1 and QXn = i=1 QXi |Xi−1 .
(d) Apply (c).
(e) The inequality follows from (a) and (b). To get conditions for equality, notice that by the chain
rule for D:

D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) + D(PX kPX )


| {z }
=0

= D(PX|Y kQX|Y |PY ) + D(PY kQY ).

Some remarks are in order:

• There is a nice interpretation of the full chain rule as a decomposition of the “distance” from
PXn to QXn as a sum of “distances” between intermediate distributions, cf. Ex. I.43.
• In general, D(PX,Y kQX,Y ) and D(PX kQX ) + D(PY kQY ) are incomparable. For example, if X = Y
under P and Q, then D(PX,Y kQX,Y ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PX,Y 6= QX,Y we have D(PX,Y kQX,Y ) > 0 = D(PX kQX ) + D(PY kQY ).

The following result, known as the Data-Processing Inequality (DPI), is an important principle
in all of information theory. In many ways, it underpins the whole concept of information. The
intuitive interpretation is that it is easier to distinguish two distributions using clean (resp. full)
data as opposed to noisy (resp. partial) data. DPI is a recurring theme in this book, and later we
will study DPI for other information measures such as mutual information and f-divergences.

Theorem 2.17 (DPI for KL divergence) Let PY = PY|X ◦ PX and QY = PY|X ◦ QX , as


represented by the diagram:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-35


i i

2.5 Conditional divergence, chain rule, data-processing inequality 35

PX PY

PY|X

QX QY

Then

D(PY kQY ) ≤ D(PX kQX ), (2.28)

with equality if and only if D(PX|Y kQX|Y |PY ) = 0.

Proof. This follows from either the chain rule or monotonicity:

D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )


| {z }
=0

= D(PX|Y kQX|Y |PY ) + D(PY kQY )

Corollary 2.18 (Divergence under deterministic transformation) Let Y = f(X).


Then D(PY kQY ) ≤ D(PX kQX ), with equality if f is one-to-one.

Note that D(Pf(X) kQf(X) ) = D(PX kQX ) does not imply that f is one-to-one; as an example,
consider PX = Gaussian, QX = Laplace, Y = |X|. In fact, the equality happens precisely when
f(X) is a sufficient statistic for testing P against Q; in other words, there is no loss of information
in summarizing X into f(X) as far as testing these two hypotheses is concerned. See Example 3.9
for details.
A particular useful application of Corollary 2.18 is when we take f to be an indicator function:

Corollary 2.19 (Large deviations estimate) For any subset E ⊂ X we have

d(PX [E]kQX [E]) ≤ D(PX kQX ),

where d(·k·) is the binary divergence function in (2.6).

Proof. Consider Y = 1 {X ∈ E}.

This method will be highly useful in large deviations theory which studies rare events (Sec-
tion 14.5 and Section 15.2), where we apply Corollary 2.19 to an event E which is highly likely
under P but highly unlikely under Q.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-36


i i

36

2.6* Local behavior of divergence and Fisher information


As we shall see in Section 4.5, KL divergence is in general not continuous. Nevertheless, it
is reasonable to expect that in non-pathological cases the functional D(PkQ) vanishes when P
approaches Q “smoothly”. Due to the smoothness and strict convexity of x log x at x = 1, it is
then also natural to expect that this functional decays “quadratically”. In this section we exam-
ine this question first along the linear interpolation between P and Q, then, more generally, in
smooth parametrized families of distributions. These properties will be extended to more general
divergences later in Sections 7.10 and 7.11.

2.6.1* Local behavior of divergence for mixtures


Let 0 ≤ λ ≤ 1 and consider D(λP + λ̄QkQ), which vanishes as λ → 0. Next, we show that this
decay is always sublinear.

Proposition 2.20 When D(PkQ) < ∞, the one-sided derivative at λ = 0 vanishes:


d
D(λP + λ̄QkQ) = 0
dλ λ=0

If we exchange the arguments, the criterion is even simpler:


d
D(QkλP + λ̄Q) = 0 ⇐⇒ PQ (2.29)
dλ λ=0

Proof.
 
1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function

λ 7→ g(λ) ≜ (λf + λ̄) log(λf + λ̄)


g(λ)
is convex and equals zero at λ = 0. Thus λ is increasing in λ. Moreover, by the convexity of
x 7→ x log x:
1 1
(λf + λ)(log(λf + λ)) ≤ (λf log f + λ1 log 1) = f log f
λ λ
and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.
To prove (2.29) first notice that if P 6 Q then there is a set E with p = P[E] > 0 = Q[E].
Applying data-processing for divergence to X 7→ 1E (X), we get
1
D(QkλP + λ̄Q) ≥ d(0kλp) = log
1 − λp

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-37


i i

2.6* Local behavior of divergence and Fisher information 37

and derivative is non-zero. If P  Q, then let f = dP


dQ and notice simple inequalities

log λ̄ ≤ log(λ̄ + λf) ≤ λ(f − 1) log e .


1
Dividing by λ and assuming λ < 2 we get for some absolute constants c1 , c2 :

1
log(λ̄ + λf) ≤ c1 f + c2 .
λ
Thus, by the dominated convergence theorem we get
Z   Z
1 1 λ→0
D(QkλP + λ̄Q) = − dQ log(λ̄ + λf) −−−→ dQ(1 − f) = 0 .
λ λ

Remark 2.2 More generally, under suitable technical conditions,


 
d dQ
D(λP + λQkR) = EP log − D(QkR)
dλ λ=0 dR
and
   
d dP1 dQ0
D(λ̄P1 + λQ1 kλ̄P0 + λQ0 ) = EQ1 log − D(P1 kP0 ) + EP1 1 − log e.
dλ λ=0 dP0 dP0
See Exercise I.22 for an example.

The main message of Proposition 2.20 is that the function

λ 7→ D(λP + λ̄QkQ) ,

is o(λ) as λ → 0. In fact, in most cases it is quadratic in λ. To make a precise statement, we need


to define the concept of χ2 -divergence – a version of f-divergence (see Chapter 7):
Z  2
dP
χ (PkQ) ≜ dQ
2
−1 .
dQ
This is a popular dissimilarity measure between P and Q, frequently used in statistics. It has many
important properties, but we will only mention that χ2 dominates KL-divergence (cf. (7.34)):

D(PkQ) ≤ log(1 + χ2 (PkQ)) .

Our second result about the local behavior of KL-divergence is the following (see Section 7.10
for generalizations):

Proposition 2.21 (KL is locally χ2 -like) We have


1 log e 2
lim inf D(λP + λ̄QkQ) = χ (PkQ) , (2.30)
λ→0 λ2 2
where both sides are finite or infinite simultaneously.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-38


i i

38

Proof. First, we assume that χ2 (PkQ) < ∞ and prove


λ2 log e 2
D(λP + λ̄QkQ) = χ (PkQ) + o(λ2 ) , λ → 0.
2
To that end notice that
  
dP
D(PkQ) = EQ g ,
dQ
where

g(x) ≜ x log x − (x − 1) log e .


g(x) R1
Note that x 7→ (x−1)2 log e
= sds
0 x(1−s)+s
is decreasing in x on (0, ∞). Therefore

0 ≤ g(x) ≤ (x − 1)2 log e ,

and hence
   2
1 dP dP
0 ≤ 2 g λ̄ + λ ≤ − 1 log e.
λ dQ dQ
By the dominated convergence theorem (which is applicable since χ2 (PkQ) < ∞) we have
   " 2 #
1 dP g′′ (1) dP log e 2
lim EQ g λ̄ + λ = EQ −1 = χ (PkQ) .
λ→0 λ2 dQ 2 dQ 2

Second, we show that unconditionally


1 log e 2
lim inf D(λP + λ̄QkQ) ≥ χ (PkQ) . (2.31)
λ→0 λ2 2
Indeed, this follows from Fatou’s lemma:
     
1 dP dP log e 2
lim inf EQ 2 g λ̄ + λ ≥ EQ lim inf g λ̄ + λ = χ (PkQ) .
λ→0 λ dQ λ→0 dQ 2
Therefore, from (2.31) we conclude that if χ2 (PkQ) = ∞ then so is the LHS of (2.30).

2.6.2* Parametrized family


Extending the setting of Section 2.6.1*, consider a parametrized set of distributions {Pθ : θ ∈ Θ}
where the parameter space Θ is an open subset of Rd . Furthermore, suppose that distribution Pθ
are all given in the form of

Pθ (dx) = pθ (x) μ(dx) ,

where μ is some common dominating measure (e.g. Lebesgue or counting measure). If for each
fixed x, the density pθ (x) depends smoothly on θ, one can define the Fisher information matrix
with respect to the parameter θ as
 
JF (θ) ≜ Eθ VV⊤ , V ≜ ∇θ ln pθ (X) , (2.32)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-39


i i

2.6* Local behavior of divergence and Fisher information 39

where Eθ is with respect to X ∼ Pθ . In particular, V is known as the score.


Under suitable regularity conditions, we have the identity

E θ [ V] = 0 (2.33)

and several equivalent expressions for the Fisher information matrix:

JF (θ) = cov(V)
θ
Z p p
= 4 μ(dx)(∇θ pθ (x))(∇θ pθ (x))⊤

= − Eθ [Hessθ (ln pθ (X)))] ,

where the last identity is obtained by differentiating (2.33) with respect to each θj .
The significance of Fisher information matrix arises from the fact that it gauges the local behav-
ior of divergence for smooth parametric families. Namely, we have (again under suitable technical
conditions):2

log e ⊤
D(Pθ0 kPθ0 +ξ ) = ξ JF (θ0 )ξ + o(kξk2 ) , (2.34)
2
which is obtained by integrating the Taylor expansion:
1
ln pθ0 +ξ (x) = ln pθ0 (x) + ξ ⊤ ∇θ ln pθ0 (x) + ξ ⊤ Hessθ (ln pθ0 (x))ξ + o(kξk2 ) .
2
We will establish this fact rigorously later in Section 7.11. Property (2.34) is of paramount impor-
tance in statistics. We should remember it as: Divergence is locally quadratic on the parameter
space, with Hessian given by the Fisher information matrix. Note that for the Gaussian location
model Pθ = N (θ, Σ), (2.34) is in fact exact with JF (θ) ≡ Σ−1 – cf. Example 2.2.
As another example, note that Proposition 2.21 is a special case of (2.34) by considering Pλ =
λ̄Q + λP parametrized by λ ∈ [0, 1]. In this case, the Fisher information at λ = 0 is simply
χ2 (PkQ). Nevertheless, Proposition 2.21 is completely general while the asymptotic expansion
(2.34) is not without regularity conditions (see Section 7.11).
Remark 2.3 Some useful properties of Fisher information are as follows:

• Reparametrization: It can be seen that if one introduces another parametrization θ̃ ∈ Θ̃ by


means of a smooth invertible map Θ̃ → Θ, then Fisher information matrix changes as

JF (θ̃) = A⊤ JF (θ)A , (2.35)

2
To illustrate the subtlety here, consider a scalar location family, i.e. pθ (x) = f0 (x − θ) for some density f0 . In this case
∫ (f′0 )2
Fisher information JF (θ0 ) = f0
does not depend on θ0 and is well-defined even for compactly supported f0 ,
provided f′0 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (2.34) is infinite for any
ξ > 0. Thus, a more general interpretation for Fisher information is as the coefficient in expansion
ξ2
D(Pθ0 k 12 Pθ0 + 12 Pθ0 +ξ ) = J (θ )
8 F 0
+ o(ξ 2 ). We will discuss this in more detail in Section 7.11.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-40


i i

40

where A = ddθθ̃ is the Jacobian of the map. So we can see that JF transforms similarly to the
metric tensor in Riemannian geometry. This idea can be used to define a Riemannian metric
on the parameter space Θ, called the Fisher-Rao metric. This is explored in a field known as
information geometry [85, 17].
i.i.d.
• Additivity: Suppose we are given a sample of n iid observations Xn ∼ Pθ . As such, consider the
parametrized family of product distributions {P⊗ θ : θ ∈ Θ}, whose Fisher information matrix
n

is denoted by J⊗ n
F (θ). In this case, the score is an iid sum. Applying (2.32) and (2.33) yields

J⊗ n
F (θ) = nJF (θ). (2.36)

Example 2.6 Let Pθ = (θ0 , . . . , θd ) be a probability distribution on the finite alphabet


Pd
{0, . . . , d}. We will take θ = (θ1 , . . . , θd ) as the free parameter and set θ0 = 1 − i=1 θi . So
all derivatives are with respect to θ1 , . . . , θd only. Then we have
(
θi , i = 1, . . . , d
pθ (i) = Pd
1 − i=1 θi , i = 0
and for Fisher information matrix we get
 
1 1 1
JF (θ) = diag ,..., + Pd 11⊤ , (2.37)
θ1 θd 1 − i=1 θi
where 1 is the d × 1 vector of all ones. For future references (see Sections 29.4 and 13.4*), we also
compute the inverse and determinant of JF (θ). By the matrix inversion lemma (A + UCV)−1 =
A−1 − A−1 U(C−1 + VA−1 U)−1 VA−1 , we have
J− ⊤
F (θ) = diag(θ) − θθ .
1
(2.38)
For the determinant, notice that det(A + xy⊤ ) = det A · det(I + A−1 xy⊤ ) = det A · (1 + y⊤ A−1 x),
where we used the identity det(I + AB) = det(I + BA). Thus, we have
Y
d
1
det JF (θ) = . (2.39)
θi
i=0

Example 2.7 (Location family) In statistics and information theory it is common to talk
about Fisher information of a (single) random variable or a distribution without reference to a
parametric family. In such cases one is implicitly considering a location parameter. Specifically,
for any density p0 on Rd we define a location family of distributions on Rd by setting Pθ (dx) =
p0 (x − θ)dx, θ ∈ Rd . Note that JF (θ) here does not depend on θ. For this special case, we will
adopt the standard notation: Let X ∼ p0 then
J(X) ≡ J(p0 ) ≜ EX∼p0 [(∇ ln p0 (X))(∇ ln p0 (X))⊤ ] = − EX∼p0 [Hess(ln p0 (X))] , (2.40)
where the second equality requires applicability of integration by parts. (See also (7.96) for a
variational definition.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-41


i i

3 Mutual information

After technical preparations in previous chapters we define perhaps the most famous concept in the
entire field of information theory, the mutual information. It was originally defined by Shannon,
although the name was coined later by Robert Fano.1 It has two equivalent expressions (as a KL
divergence and as difference of entropies), both having their merits. In this chapter, we collect
some basic properties of mutual information (non-negativity, chain rule and the data-processing
inequality). While defining conditional information, we also introduce the language of directed
graphical models, and connect the equality case in the data-processing inequality with Fisher’s
concept of sufficient statistic.
So far in this book we have not yet attempted connecting information quantities to any opera-
tional concepts. The first time this will be done in Section 3.6 where we relate mutual information
to probability of error in the form of Fano’s inequality, which states that whenever I(X; Y) is small,
one should not be able to predict X on the basis of Y with a small probability of error. As such, this
inequality will be applied countless times in the rest of the book as a main workhorse for studying
fundamental limits of problems in both information theory and in statistics.
The connection between information and estimation is furthered in Section 3.7*, in which we
relate mutual information and minimum mean squared error in Gaussian noise (I-MMSE relation).
From the latter we also derive the entropy power inequality, which plays a central role in high-
dimensional probability and concentration of measure.

3.1 Mutual information


Mutual information was first defined by Shannon to measure the decrease in entropy of a random
quantity following the observation of another (correlated) random quantity. Unlike the concept
of entropy itself, which was well-known by then in statistical mechanics, the mutual information
was new and revolutionary and had no analogs in science. Today, however, it is preferred to define
mutual information in a different form (proposed in [378, Appendix 7]).

1
Professor of electrical engineering at MIT, who developed the first course on information theory and as part of it
formalized and rigorized much of Shannon’s ideas. Most famously, he showed the “converse part” of the noisy channel
coding theorem, see Section 17.4.

41

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-42


i i

42

Definition 3.1 (Mutual information) For a pair of random variables X and Y we define
I(X; Y) = D(PX,Y kPX PY ).

The intuitive interpretation of mutual information is that I(X; Y) measures the dependency
between X and Y by comparing their joint distribution to the product of the marginals in the KL
divergence, which, as we show next, is also equivalent to comparing the conditional distribution
to the unconditional.
The way we defined I(X; Y) it is a functional of the joint distribution PX,Y . However, it is also
rather fruitful to look at it as a functional of the pair (PX , PY|X ) – more on this in Section 5.1.
In general, the divergence D(PX,Y kPX PY ) should be evaluated using the general definition (2.4).
Note that PX,Y  PX PY need not always hold. Let us consider the following examples, though.
Example 3.1 If X = Y ∼ N(0, 1) then PX,Y 6 PX PY and I(X; Y) = ∞. This reflects our
intuition that X contains an “infinite” amount of information requiring infinitely many bits to
describe. On the other hand, if even one of X or Y is discrete, then we always have PX,Y  PX PY .
Indeed, consider any E ⊂ X × Y measurable in the product sigma algebra with PX,Y (E) > 0.
P
Since PX,Y (E) = x∈S P[(X, Y) ∈ S, X = x], there exists some x0 ∈ S such that PY (Ex0 ) ≥ P[X =
x0 , Y ∈ Ex0 ] > 0, where Ex0 ≜ {y : (x0 , y) ∈ E} is a section of E (measurable for every x0 ). But
then PX PY (E) ≥ PX PY ({x0 } × Ex0 ) = PX ({x0 })PY (Ex0 ) > 0, implying that PX,Y  PX PY .

Theorem 3.2 (Properties of mutual information)

(a) (Mutual information as conditional divergence) Whenever Y is standard Borel,

I(X; Y) = D(PY|X kPY |PX ) . (3.1)

(b) (Symmetry) I(X; Y) = I(Y; X)


(c) (Positivity) I(X; Y) ≥ 0, with equality I(X; Y) = 0 iff X ⊥
⊥Y
(d) (Deterministic maps) For any function f we have

I(f(X); Y) ≤ I(X; Y) .

If f is one-to-one (with a measurable inverse), then I(f(X); Y) = I(X; Y).


(e) (More data ⇒ more information) I(X1 , X2 ; Z) ≥ I(X1 ; Z)

Proof. (a) This follows from Theorem 2.16(a) with QY|X = PY .


K
(b) Consider a Markov kernel K sending (x, y) 7→ (y, x). This kernel sends measure PX,Y −
→ PY,X
K
and PX PY −
→ PY PX . Therefore, from the DPI Theorem 2.17 applied to this kernel we get

D(PX,Y kPX PY ) ≥ D(PY,X kPY PX ) .

Applying this argument again, shows that inequality is in fact equality.


(c) This is just D ≥ 0 from Theorem 2.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-43


i i

3.1 Mutual information 43

K
(d) Consider a Markov kernel K sending (x, y) 7→ (f(x), y). This kernel sends measure PX,Y − →
K
Pf(X),Y and PX PY −
→ Pf(X) PY . Therefore, from the DPI Theorem 2.17 applied to this kernel we
get

D(PX,Y kPX PY ) ≥ D(Pf(X),Y kPf(X) PY ) .

It is clear that the two sides correspond to the two mutual informations. For bijective f, simply
apply the inequality to f and f−1 .
(e) Apply (d) with f(X1 , X2 ) = X1 .

Of the results above, the one we will use the most is (3.1). Note that it implies that
D(PX,Y kPX PY ) < ∞ if and only if

x 7→ D(PY|X=x kPY )

is PX -integrable. This property has a counterpart in terms of absolute continuity, as follows.

Lemma 3.3 Let Y be standard Borel. Then


PX,Y  PX PY ⇐⇒ PY|X=x  PY for PX -a.e. x

Proof. Suppose PX,Y  PX PY . We need to prove that any version of the conditional probability
satisfies PY|X=x  PY for almost every x. Note, however, that if we prove this for some version P̃Y|X
then the statement for any version follows, since PY|X=x = P̃Y|X=x for PX -a.e. x. (This measure-
theoretic fact can be derived from the chain rule (2.24): since PX P̃Y|X = PX,Y = PX PY|X we must
have 0 = D(PX,Y kPX,Y ) = D(P̃Y|X kPY|X |PX ) = Ex∼PX [D(P̃Y|X=x kPY|X=x )], implying the stated
dPX,Y R
fact.) So let g(x, y) = dP X PY
(x, y) and ρ(x) ≜ Y g(x, y)PY (dy). Fix any set E ⊂ X and notice
Z Z
PX [E] = 1E (x)g(x, y)PX (dx) PY (dy) = 1E (x)ρ(x)PX (dx) .
X ×Y X
R
On the other hand, we also have PX [E] = 1E dPX , which implies ρ(x) = 1 for PX -a.e. x. Now
define
(
g(x, y)PY (dy), ρ(x) = 1
P̃Y|X (dy|x) =
PY (dy), ρ(x) 6= 1 .

Directly plugging P̃Y|X into (2.22) shows that P̃Y|X does define a valid version of the conditional
probability of Y given X. Since by construction P̃Y|X=x  PY for every x, the result follows.
Conversely, let PY|X be a kernel such that PX [E] = 1, where E = {x : PY|X=x  PY } (recall that
E is measurable by Lemma 2.13). Define P̃Y|X=x = PY|X=x if x ∈ E and P̃Y|X=x = PY , otherwise.
By construction PX P̃Y|X = PX PY|X = PX,Y and P̃Y|X=x  PY for every x. Thus, by Theorem 2.12
there exists a jointly measurable f(y|x) such that

P̃Y|X (dy|x) = f(y|x)PY (dy) ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-44


i i

44

and, thus, by (2.22)


Z
PX,Y [E] = f(y|x)PY (dy)PX (dx) ,
E
implying that PX,Y  PX PY .

3.2 Mutual information as difference of entropies


As promised, we next introduce a different point of view on I(X; Y), namely as a difference of
entropies. This (conditional entropy) point of view of Shannon emphasizes that I(X; Y) is also
measuring the change in the spread or uncertainty of the distribution of X following the observation
of Y.

Theorem 3.4 (Mutual information and entropy)


(
H(X) X discrete
(a) I(X; X) =
+∞ otherwise.
(b) If X is discrete, then
I(X; Y) + H(X|Y) = H(X) . (3.2)
Consequently, either H(X|Y) = H(X) = ∞,2 or H(X|Y) < ∞ and
I(X; Y) = H(X) − H(X|Y). (3.3)

(c) If both X and Y are discrete, then


I(X; Y) + H(X, Y) = H(X) + H(Y),
so that whenever H(X, Y) < ∞ we have
I(X; Y) = H(X) + H(Y) − H(X, Y) .

(d) Similarly, if X, Y are real-valued random vectors with a joint PDF, then
I(X; Y) = h(X) + h(Y) − h(X, Y)
provided that h(X, Y) < ∞. If X has a marginal PDF pX and a conditional PDF pX|Y (x|y),
then
I(X; Y) = h(X) − h(X|Y) ,
provided h(X|Y) < ∞.

2
This is indeed possible if one takes Y = 0 (constant) and X from Example 1.3, demonstrating that (3.3) does not always
hold.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-45


i i

3.2 Mutual information as difference of entropies 45

(e) If X or Y are discrete then I(X; Y) ≤ min (H(X), H(Y)), with equality iff H(X|Y) = 0 or
H(Y|X) = 0, or, equivalently, iff one is a deterministic function of the other.

Proof. (a) By Theorem 3.2.(a), I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete,
then D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, let A = {x : PX (x) > 0}
denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . (∆ is measurable since it’s
the intersection of Ac × Ac with the diagonal {(x, x) : x ∈ X }.) Then PX,X (∆) = PX (Ac ) > 0
but since
Z Z
(PX × PX )(E) ≜ PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X

we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X 6 PX × PX and thus by definition

I(X; X) = D(PX,X kPX PX ) = +∞ .

(b) Since X is discrete there exists a countable set S such that P[X ∈ S] = 1, and for any x0 ∈ S we
have P[X = x0 ] > 0. Let λ be a counting measure on S and let μ = λ×PY , so that PX PY  μ. As
shown in Example 3.1 we also have PX,Y  μ. Furthermore, fP (x, y) ≜ dPdμX,Y (x, y) = pX|Y (x|y),
where the latter denotes conditional pmf of X given Y (which is a proper pmf for almost every
y, since P[X ∈ S|Y = y] = 1 for a.e. y). We also have fQ (x, y) = dPdμ
X PY
(x, y) = dP
dλ (x) = pX (x),
X

where the latter is an unconditional pmf of X. Note that by definition of Radon-Nikodym


derivatives we have

E[pX|Y (x0 |Y)] = pX (x0 ) . (3.4)

Next, according to (2.12) we have


  X 
fP ( X , Y ) pX|Y (x|y)
I(X; Y) = E Log = Ey∼PY pX|Y (x|y)Log .
fQ (X, Y) p X ( x)
x∈ S

Note that PX,Y -almost surely both pX|Y (X|Y) > 0 and PX (x) > 0, so we can replace Log with
log in the above. On the other hand,
X 1

H(X|Y) = Ey∼PY pX|Y (x|y) log .
pX|Y (x|y)
x∈ S

Adding these two expressions, we obtain


( a) X 1

I(X; Y) + H(X|Y) = Ey∼PY pX|Y (x|y) log
p X ( x)
x∈S
X  
(b)   1 ( c) 1
= Ey∼PY pX|Y (x|y) log = E log ≜ H(X) ,
p X ( x) PX (X)
x∈S
P P
where in (a) we used linearity of Lebesgue integral EPY x , in (b) we interchange E and
via Fubini; and (c) holds due to (3.4).
(c) Simply add H(Y) to both sides of (3.2) and use the chain rule for H from (1.2).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-46


i i

46

(d) These arguments are similar to discrete case, except that counting measure is replaced with
Lebesgue. We leave the details as an exercise.
(e) Follows from (b).

From (3.2) we deduce the following result, which was previously shown in Theorem 1.4(d).

Corollary 3.5 (Conditioning reduces entropy) For discrete X, H(X|Y) ≤ H(X), with
equality iff X ⊥
⊥ Y.

Proof. If H(X) = ∞ then there is nothing to prove. Otherwise, apply (3.2).


Thus, the intuition behind the last corollary (and an important innovation of Shannon) is to
give meaning to the amount of entropy reduction (mutual information). It is important to note
that conditioning reduces entropy on average, not per realization. Indeed, take X = U OR Y, where
i.i.d.
U, Y ∼ Ber( 21 ). Then X ∼ Ber( 34 ) and H(X) = h( 14 ) < 1 bit = H(X|Y = 0), i.e., conditioning on
Y = 0 increases entropy. But on average, H(X|Y) = P [Y = 0] H(X|Y = 0) + P [Y = 1] H(X|Y =
1) = 12 bits < H(X), by the strong concavity of h(·).
Remark 3.1 (Information, entropy, and Venn diagrams) For discrete random vari-
ables, the following Venn diagram illustrates the relationship between entropy, conditional
entropy, joint entropy, and mutual information from Theorem 3.4(b) and (c).

H(X, Y )

H(Y |X) I(X; Y ) H(X|Y )

H(Y ) H(X)

Applying analogously the inclusion-exclusion principle to three variables X1 , X2 , X3 , we see


that the triple intersection corresponds to
H(X1 ) + H(X2 ) + H(X3 ) − H(X1 , X2 ) − H(X2 , X3 ) − H(X1 , X3 ) + H(X1 , X2 , X3 ) (3.5)
which is sometimes denoted by I(X1 ; X2 ; X3 ). It can be both positive and negative (why?).
In general, one can treat random variables as sets (so that the Xi corresponds to set Ei and the
pair (X1 , X2 ) corresponds to E1 ∪ E2 ). Then we can define a unique signed measure μ on the finite
algebra generated by these sets so that every information quantity is found by replacing
I/H → μ ;→ ∩ ,→ ∪ | → \.

As an example, we have
H(X1 |X2 , X3 ) = μ(E1 \ (E2 ∪ E3 )) , (3.6)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-47


i i

3.3 Examples of computing mutual information 47

I(X1 , X2 ; X3 |X4 ) = μ(((E1 ∪ E2 ) ∩ E3 ) \ E4 ) . (3.7)

By inclusion-exclusion, the quantity in (3.5) corresponds to μ(E1 ∩ E2 ∩ E3 ), which explains why


μ is not necessarily a positive measure. For an extensive discussion, see [110, Chapter 1.3].

3.3 Examples of computing mutual information


Below we demonstrate how to compute I in both continuous and discrete settings.
Example 3.2 (Bivariate Gaussian) Let X, Y be jointly Gaussian. Then
1 1
I(X; Y) = log (3.8)
2 1 − ρ2X,Y

where ρX,Y ≜ E[(X−EX )(Y−EY)]


σX σY ∈ [−1, 1] is the correlation coefficient; see Figure 3.1 for a plot.
To show (3.8), by shifting and scaling if necessary, we can assume without loss of generality that

I(X; Y )

ρ
-1 0 1

Figure 3.1 Mutual information between correlated Gaussians.

EX = EY = 0 and EX2 = EY2 = 1. Then ρ = EXY. By joint Gaussianity, Y = ρX + Z for some


Z ∼ N ( 0, 1 − ρ 2 ) ⊥
⊥ X. Then using the divergence formula for Gaussians (2.7), we get

I(X; Y) = D(PY|X kPY |PX )


= ED(N (ρX, 1 − ρ2 )kN (0, 1))
 
1 1 log e 
=E log + (ρ X) 2
+ 1 − ρ 2
− 1
2 1 − ρ2 2
1 1
= log .
2 1 − ρ2
Alternatively, we can use the differential entropy representation in Theorem 3.4(d) and the entropy
formula (2.17) for Gaussians:

I(X; Y) = h(Y) − h(Y|X)


= h( Y ) − h( Z )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-48


i i

48

1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log .
2 2 2 1 − ρ2
where the second equality follows h(Y|X) = h(Y − X|X) = h(Z|X) = h(Z) applying the shift-
invariance of h and the independence between X and Z.
Similar to the role of mutual information, the correlation coefficient also measures the
dependency between random variables which are real-valued (or, more generally, valued in an
inner-product space) in a certain sense. In contrast, mutual information is invariant to bijections
and much more general as it can be defined not just for numerical but for arbitrary random
variables.

Example 3.3 (AWGN channel) The additive white Gaussian noise (AWGN) channel is one
of the main examples of Markov kernels that we will use in this book. This kernel acts from R to
R by taking an input x and setting K(·|x) ∼ N (x, σN2 ), or, in equation form we write Y = X + N,
with X ⊥⊥ N ∼ N (0, σN2 ). Pictorially, we can think of it as

X + Y

Now, suppose that X ∼ N (0, σX2 ), in which case Y ∼ N (0, σX2 + σN2 ). Then by invoking (2.17)
twice we obtain
1  σ2 
I(X; Y) = h(Y) − h(Y|X) = h(X + N) − h(N) = log 1 + X2 ,
2 σN
2
where σσX2 is frequently referred to as the signal-to-noise ratio (SNR). See Figure 3.2 for an illus-
N
tration. Note that in engineering it is common to express SNR in decibels (dB), so that SNR in dB
equals 10 log10 (SNR). Later, we will define AWGN channel more formally in Definition 20.10.

Example 3.4 (BI-AWGN channel) In communication and statistical applications one also
often encounters a situation where AWGN channel’s input is restricted to X ∈ {±1}. This Markov
kernel is denoted BIAWGNσN2 : {±1} → R and acts by setting

Y = X + N, X⊥
⊥ N ∼ N (0, σN2 ) .
If we set X ∼ Ber(1/2) then in this case it is more convenient to calculate mutual information by
a decomposition different from the AWGN case. Indeed, we have
I(X; Y) = H(X) − H(X|Y) = log 2 − H(X|Y) .
To compute H(X|Y = y) we simply need to evaluate posterior distribution given observation Y = y.
y
2
e σN
In this case we have P[X = +1|Y = y] = −
y y . Thus, after some algebra we obtain the
σ2 2
e N +e σN

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-49


i i

3.3 Examples of computing mutual information 49

Figure 3.2 Comparing mutual information for the AWGN and BI-AWGN channels (see Examples 3.3
and 3.4). It will be shown later in this book that these mutual informations coincide with the capacities of
respective channels.

following expression
Z ∞
1 z2
√ e− 2 log(1 + e− σ2 + σ z ) dz .
2 2
I(X; Y) = log 2 −
−∞ 2π
(One can verify that H(X|Y) here coincides with that in Example 1.4(2) with σ replaced by 2σ .)
For this channel, the SNR is given by EE[[NX2]] = σ12 . We compare mutual informations of AWGN
2

N
and BI-AWGN as a function of the SNR on Figure 3.2. Note that for the low SNR restricting to
binary input results in virtually no loss of information – a fact underpinning the role played by the
BI-AWGN channel in many real-world communication systems.
Example 3.5 (Gaussian vectors) Let X ∈ Rm and Y ∈ Rn be jointly Gaussian. Then
1 det ΣX det ΣY
I(X; Y) = log
2 det Σ[X,Y]
 
where ΣX ≜ E (X − EX)(X − EX)⊤ denotes the covariance matrix of X ∈ Rm , and Σ[X,Y]
denotes the covariance matrix of the random vector [X, Y] ∈ Rm+n .
In the special case of additive noise: Y = X + N for N ⊥
⊥ X, we have
1 det(ΣX + ΣN )
I ( X; X + N) = log
2 det ΣN
ΣX ΣX
 why?
since det Σ[X,X+N] = det ΣX ΣX +ΣN = det ΣX det ΣN .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-50


i i

50

Example 3.6 (Binary symmetric channel) Recall the setting in Example 1.4(1). Let X ∼
Ber( 21 ) and N ∼ Ber(δ) be independent. Let Y = X ⊕ N; or equivalently, Y is obtained by flipping
X with probability δ .

N
1− δ
0 0
X δ Y X + Y
1 1
1− δ

As shown in Example 1.4(1), H(X|Y) = H(N) = h(δ) and hence


I(X; Y) = log 2 − h(δ).
The corresponding conditional distribution PY|X (Markov kernel) is called the binary symmetric
channel (BSC) with parameter δ and denoted by BSCδ .
Example 3.7 (Addition over finite groups) Generalizing Example 3.6, let X and Z take
values on a finite abelian group G. If X is uniform on G and independent of Z, then
I(X; X + Z) = log |G| − H(Z),
which simply follows from that X + Z is uniform on G regardless of the distribution of Z. Same
holds for non-abelian groups, but then + should be replaced with the group operation ◦ and channel
action is x 7→ x ◦ Z, Z ∼ PZ .

3.4 Conditional mutual information and conditional independence


Definition 3.6 (Conditional mutual information) If X and Y are standard Borel, then
we define
I(X; Y|Z) ≜ D(PX,Y|Z kPX|Z PY|Z |PZ ) (3.9)
= Ez∼PZ [I(X; Y|Z = z)] . (3.10)
where the product PX|Z PY|Z is a conditional distribution such that (PX|Z PY|Z )(A × B|z) =
PX|Z (A|z)PY|Z (B|z), under which X and Y are independent conditioned on Z.

Denoting I(X; Y) as a functional I(PX,Y ) of the joint distribution PX,Y , we have I(X; Y|Z) =
Ez∼PZ [I(PX,Y|Z=z )]. As such, I(X; Y|Z) is a linear functional in PZ . Measurability of the map z 7→
I(PX,Y|Z=z ) is not obvious, but follows from Lemma 2.13.
To further discuss properties of the conditional mutual information, let us first introduce the
notation for conditional independence. A family of joint distributions can be represented by a
directed acyclic graph (DAG) encoding the dependency structure of the underlying random vari-
ables. We do not intend to introduce formal definitions here and refer to the standard monograph
for full details [271]. But in short, every problem consisting of finitely (or countably infinitely)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-51


i i

3.4 Conditional mutual information and conditional independence 51

many random variables can be depicted as a DAG. Nodes of the DAG correspond to random vari-
ables and incoming edges into the node U simply describe which variables need to be known in
order to generate U. A simple example is a Markov chain (path graph) X → Y → Z, which repre-
sents distributions that factor as {PX,Y,Z : PX,Y,Z = PX PY|X PZ|Y }. We have the following equivalent
descriptions:

X → Y → Z ⇔ PX,Z|Y = PX|Y · PZ|Y


⇔ PZ|X,Y = PZ|Y
⇔ PX,Y,Z = PX · PY|X · PZ|Y
⇔ X, Y, Z form a Markov chain
⇔ X⊥
⊥ Z| Y
⇔ X ← Y → Z, PX,Y,Z = PY · PX|Y · PZ|Y
⇔ Z→Y→X

There is a general method for obtaining these equivalences for general graphs, known as d-
separation, see [271]. We say that a variable V is a collider on some undirected path if it appears
on the path as

collider: ··· → V ← ··· . (3.11)

Otherwise, V is called a non-collider (and hence appears as → V →, ← V ←, or ← V →). A pair


of collections of variables A and B are d-connected by a collection C if there exists an undirected
path from some variable in A to some variable in B such that a) there are no non-colliders in C
and b) every collider is either in C or has a descendant in C. The concept of d-connectedness
is important because it characterizes conditional independence. Specifically, A ⊥ ⊥ B|C in every
distribution satisfying a given graphical model if and only if A and B are not d-connected by C. It
is rather useful for many information-theoretic considerations to master this criterion. However,
in our book we will not formally require this apparatus beyond the basic equivalences for a linear
Markov chain listed above. We do recommend practicing these, however, e.g. by doing Exercises
I.26–I.30.

Theorem 3.7 (Further properties of mutual information) Suppose that all random
variables are valued in standard Borel spaces. Then:

(a) I(X; Z|Y) ≥ 0, with equality iff X → Y → Z.


(b) (Simple chain rule)3

I(X, Y; Z) = I(X; Z) + I(Y; Z|X)


= I(Y; Z) + I(X; Z|Y).

3
Also known as “Kolmogorov identities”.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-52


i i

52

(c) (DPI for mutual information) If X → Y → Z, then


I(X; Z) ≤ I(X; Y) , (3.12)
with equality iff X → Z → Y.
(d) If X → Y → Z → W, then I(X; W) ≤ I(Y; Z).
(e) (Full chain rule)
X
n
I(Xn ; Y) = I(Xk ; Y|Xk−1 ).
k=1

(f) (Permutation invariance) If f and g are one-to-one (with measurable inverses), then
I(f(X); g(Y)) = I(X; Y).

Proof. (a) By definition and Theorem 3.2(c).


(b) First, notice that from (3.1) we have (with a self-evident notation):

I(Y; Z|X = x) = D(PY|Z,X=x kPY|X=x |PZ|X=x ) .


Taking expectation over X here we get
( a)
I(Y; Z|X) = D(PY|X,Z kPY|X |PX,Z ) .
On the other hand, from the chain rule for D, (2.24), we have
(b)
D(PX,Y,Z kPX,Y PZ ) = D(PX,Z kPX PZ ) + D(PY|X,Z kPY|X |PX,Z ) ,
where in the second term we noticed that conditioning on X, Z under the measure PX,Y PZ
results in PY|X (independent of Z). Putting (a) and (b) together completes the proof.
(c) Apply Kolmogorov identity to I(Y, Z; X):
I(Y, Z; X) = I(X; Y) + I(X; Z|Y)
| {z }
=0

= I(X; Z) + I(X; Y|Z)

(d) Several applications of the DPI: I(X; W) ≤ I(X; Z) ≤ I(Y; Z)


(e) Recursive application of Kolmogorov identity.
(f) Apply DPI to f and then to f−1 .

Remark 3.2 In general, I(X; Y|Z) and I(X; Y) are incomparable. Indeed, consider the following
examples:

• I(X; Y|Z) > I(X; Y): We need to find an example of X, Y, Z, which do not form a Markov chain.
To that end notice that there is only one directed acyclic graph non-isomorphic to X → Y → Z,
i.i.d.
namely X → Y ← Z. With this idea in mind, we construct X, Z ∼ Ber( 12 ) and Y = X ⊕ Z. Then
I(X; Y) = 0 since X ⊥⊥ Y; however, I(X; Y|Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-53


i i

3.5 Sufficient statistic and data processing 53

• I(X; Y|Z) < I(X; Y): Simply take X, Y, Z to be any random variables on finite alphabets and
Z = Y. Then I(X; Y|Z) = I(X; Y|Y) = H(Y|Y) − H(Y|X, Y) = 0 by a conditional version of (3.3).

Remark 3.3
Pn
(Chain rule for IP⇒ Chain rule for H) Set Y = Xn . Then H(Xn ) =
n k− 1 k− 1
), since H(Xk |Xn , Xk−1 ) = 0.
n
n n
I(X ; X ) = k=1 I(Xk ; X |X )= k=1 H(Xk |X
Remark 3.4 (DPI for divergence =⇒ DPI for mutual information) We proved
DPI for mutual information in Theorem 3.7 using Kolmogorov’s identity. In fact, DPI for mutual
information is implied by that for divergence in Theorem 2.17:

I(X; Z) = D(PZ|X kPZ |PX ) ≤ D(PY|X kPY |PX ) = I(X; Y),


P Z| Y PZ|Y
where note that for each x, we have PY|X=x −−→ PZ|X=x and PY −−→ PZ . Therefore if we have a
bi-variate functional of distributions D(PkQ) which satisfies DPI, then we can define a “mutual
information-like” quantity via ID (X; Y) ≜ D(PY|X kPY |PX ) ≜ Ex∼PX D(PY|X=x kPY ) which will
satisfy DPI on Markov chains. A rich class of examples arises by taking D = Df (an f-divergence
– see Chapter 7).
Remark 3.5 (Strong data-processing inequalities) For many channels PY|X , it is
possible to strengthen the data-processing inequality (2.28) as follows: For any PX , QX we have

D(PY kQY ) ≤ ηKL D(PX kQX ) ,

where ηKL < 1 and depends on the channel PY|X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information in Theorem 3.7(c): For any PU,X we have

U→X→Y =⇒ I(U; Y) ≤ ηKL I(U; X) .

For example, for PY|X = BSCδ we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
(SDPIs) quantify the intuitive observation that noise intrinsic to the channel PY|X must reduce the
information that Y carries about the data U, regardless of how we optimize the encoding U 7→ X.
We explore SDPI further in Chapter 33 as well as their ramifications in statistics.
In addition to the case of strict inequality in DPI, the case of equality is also worth taking a closer
look. If U → X → Y and I(U; X) = I(U; Y), intuitively it means that, as far as U is concerned,
there is no loss of information in summarizing X into Y. In statistical parlance, we say that Y is a
sufficient statistic of X for U. This is the topic for the next section.

3.5 Sufficient statistic and data processing


Much later in the book we will be interested in estimating parameters θ of probability distributions
of X. To that end, one often first tries to remove unnecessary information contained in X. Let us
formalize the setting as follows:

• Let PθX be a collection of distributions of X parameterized by θ ∈ Θ;

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-54


i i

54

• Let PT|X be some Markov kernel. Let PθT ≜ PT|X ◦ PθX be the induced distribution on T for each
θ.

Definition 3.8 (Sufficient statistic) We say that T is a sufficient statistic of X for θ if there
exists a transition probability kernel PX|T so that PθX PT|X = PθT PX|T , i.e., PX|T can be chosen to not
depend on θ.

The intuitive interpretation of T being sufficient is that, with T at hand, one can ignore X; in
other words, T contains all the relevant information to infer about θ. This is because X can be
simulated on the sole basis of T without knowing θ. As such, X provides no extra information
for identification of θ. Any one-to-one transformation of X is sufficient, however, this is not the
interesting case. In the interesting cases dimensionality of T will be much smaller (typically equal
to that of θ) than that of X. See examples below.
Observe also that the parameter θ need not be a random variable, as Definition 3.8 does not
involve any distribution (prior) on θ. This is a so-called frequentist point of view on the problem
of parameter estimation.

Theorem 3.9 Let θ, X, T be as in the setting above. Then the following are equivalent

(a) T is a sufficient statistic of X for θ.


(b) ∀Pθ , θ → T → X.
(c) ∀Pθ , I(θ; X|T) = 0.
(d) ∀Pθ , I(θ; X) = I(θ; T), i.e., the data processing inequality for mutual information holds with
equality.

Proof. We omit the details, which amount to either restating the conditions in terms of conditional
independence, or invoking equality cases in the properties stated in Theorem 3.7.

The following result of Fisher provides a criterion for verifying sufficiency:

Theorem 3.10 (Fisher’s factorization theorem) For all θ ∈ Θ, let PθX have a density pθ
with respect to a common dominating measure μ. Let T = T(X) be a deterministic function of X.
Then T is a sufficient statistic of X for θ iff

pθ (x) = gθ (T(x))h(x)

for some measurable functions gθ and h and all θ ∈ Θ.

Proof. We only give the proof in the discrete case where pθ represents the PMF. (The argument
P R
for the general case is similar replacing by dμ). Let t = T(x).
“⇒”: Suppose T is a sufficient statistic of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T =
t) = Pθ (X = x|T = t)Pθ (T = t) = P(X = x|T = T(x)) Pθ (T = T(x))
| {z }| {z }
h(x) gθ (T(x))

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-55


i i

3.6 Probability of error and Fano’s inequality 55

“⇐”: Suppose the factorization holds. Then


p θ ( x) g θ ( t ) h ( x) h ( x)
Pθ (X = x|T = t) = P =P =P ,
x 1 {T ( x ) = t }p θ ( x ) x 1 { T( x ) = t } gθ ( t ) h ( x ) x 1 {T (x) = t}h(x)
free of θ.

Example 3.8 (Independent observations) In the following examples, a parametrized


distribution generates an independent sample of size n, which can be summarized into a scalar-
valued sufficient statistic. These can be verified by checking the factorization of the n-fold product
distribution and applying Theorem 3.10.

i.i.d.
• Normal mean model. Let θ ∈ R and observations X1 , . . . , Xn ∼ N (θ, 1). Then the sample mean
Pn
X̄ = 1n j=1 Xj is a sufficient statistic of Xn for θ.
i.i.d. Pn
• Coin flips. Let Bi ∼ Ber(θ). Then i=1 Bi is a sufficient statistic of Bn for θ.
i.i.d.
• Uniform distribution. Let Ui ∼ Unif(0, θ). Then maxi∈[n] Ui is a sufficient statistic of Un for θ.

Example 3.9 (Sufficient statistic for hypothesis testing) Let Θ = {0, 1}. Given θ = 0
or 1, X ∼ PX or QX , respectively. Then Y – the output of PY|X – is a sufficient statistic of X for θ iff
D(PX|Y kQX|Y |PY ) = 0, i.e., PX|Y = QX|Y holds PY -a.s. Indeed, the latter means that for kernel QX|Y
we have

PX PY|X = PY QX|Y and QX PY|X = QY QX|Y ,

which is precisely the definition of sufficient statistic when θ ∈ {0, 1}. This example explains
the condition for equality in the data-processing for divergence in Theorem 2.17. Then assuming
D(PY kQY ) < ∞ we have:

D(PX kQX ) = D(PY kQY ) ⇐⇒ Y is a sufficient statistic for testing PX vs QX

Proof. Let QX,Y = QX PY|X , PX,Y = PX PY|X , then

D(PX,Y kQX,Y ) = D(PY|X kQY|X |PX ) +D(PX kQX )


| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )
≥ D(PY kQY )

with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a sufficient statistic for
testing PX vs QX as desired.

3.6 Probability of error and Fano’s inequality


Let W be a random variable and Ŵ be our prediction of it. Depending on the information available
for producing Ŵ we can consider three types of problems:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-56


i i

56

1 Random guessing: W ⊥ ⊥ Ŵ.


2 Guessing with data: W → X → Ŵ, where X = f(W) is a deterministic function of W.
3 Guessing with noisy data: W → X → Y → Ŵ, where X → Y is given by some noisy channel.

Our goal is to draw converse statements of the following type: If the uncertainty of W is too high
or if the information provided by the data is too scarce, then it is difficult to guess the value of W.
In this section we formalize these intuitions using (conditional) entropy and mutual information.

Theorem 3.11 Let |X | = M < ∞. Then for any X̂ ⊥


⊥ X,

H(X) ≤ FM (P[X = X̂]) (3.13)

where

FM (x) ≜ (1 − x) log(M − 1) + h(x), x ∈ [0, 1] (3.14)

and h(x) = x log 1x + (1 − x) log 1−1 x is the binary entropy function.


If Pmax ≜ maxx∈X PX (x), then

H(X) ≤ FM (Pmax ) = (1 − Pmax ) log(M − 1) + h(Pmax ) , (3.15)

with equality iff PX = (Pmax , 1− Pmax 1−Pmax


M−1 , . . . , M−1 ).

The function FM (·) is shown in Figure 3.3. Notice that due to its non-monotonicity the
statement (3.15) does not imply (3.13), even though P[X = X̂] ≤ Pmax .

FM (p)

log M

log(M − 1)

p
0 1/M 1

Figure 3.3 The function FM in (3.14) is concave with maximum log M at maximizer 1/M, but not monotone.

Proof. To show (3.13) consider an auxiliary (product) distribution QX,X̂ = UX PX̂ , where UX is
uniform on X . Then Q[X = X̂] = n1/M. Denoting
o P[X = X̂] ≜ PS , applying the DPI for divergence
to the data processor (X, X̂) 7→ 1 X = X̂ yields d(PS k1/M) ≤ D(PXX̂ kQXX̂ ) = log M − H(X).
To show the second part, suppose one is trying to guess the value of X without any side informa-
tion. Then the best bet is obviously the most likely outcome (mode) and the maximal probability

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-57


i i

3.6 Probability of error and Fano’s inequality 57

of success is

max P[X = X̂] = Pmax . (3.16)


X̂⊥
⊥X

Thus, applying (3.13) with X̂ being the mode yield (3.15). Finally, suppose that P =
(Pmax , P2 , . . . , PM ) and introduce Q = (Pmax , 1− Pmax 1−Pmax
M−1 , . . . , M−1 ). Then the difference of the right
and left side of (3.15) equals D(PkQ) ≥ 0, with equality iff P = Q.

Remark 3.6 Let us discuss the unusual proof technique. Instead of studying directly the prob-
ability space PX,X̂ given to us, we introduced an auxiliary one: QX,X̂ . We then drew conclusions
about the target metric ( probability of error) for the auxiliary problem (the probability of error
= 1 − M1 ). Finally, we used DPI to transport statement about Q to a statement about P: if D(PkQ)
is small, then the probabilities of the events (e.g., {X 6= X̂}) should be small as well. This is a
general method, known as meta-converse, that we develop in more detail later in this book for
channel coding (see Section 22.3). For the specific result (3.15), however, there are much more
explicit ways to derive it – see Ex. I.25.
Similar to the Shannon entropy H, Pmax is also a reasonable measure for randomness of P. In
fact, recall from (1.4) that
1
H∞ (P) = log (3.17)
Pmax
is the Rényi entropy of order ∞, cf. (1.4). In this regard, Theorem 3.11 can be thought of as our
first example of a comparison of information measures: it compares H and H∞ . We will study
such comparisons systematically in Section 7.4.
Next we proceed to the setting of Fano’s inequality where the estimate X̂ is made on the basis of
some observation Y correlated with X. We will see that the proof of the previous theorem trivially
generalizes to this new case of possibly randomized estimators. Though not needed in the proof,
it is worth mentioning that the best estimator minimizing the probability of error P[X 6= X̂] is the
maximum posterior (MAP) rule, i.e., the posterior mode: X̂(y) = argmaxx PX|Y (x|y).

Theorem 3.12 (Fano’s inequality) Let |X | = M < ∞ and X → Y → X̂. Let Pe = P[X 6= X̂],
then

H(X|Y) ≤ FM (1 − Pe ) = Pe log(M − 1) + h(Pe ) . (3.18)

Furthermore, if Pmax ≜ maxx∈X PX (x) > 0, then regardless of |X |,


1
I(X; Y) ≥ (1 − Pe ) log − h(Pe ) . (3.19)
Pmax

Proof. To show (3.18) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs
n o
QX,Y,X̂ = UX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1 X 6= X̂ (note that PX̂|Y is
identical for both).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-58


i i

58

To show (3.19) we apply data processing (for divergence) to PX,Y,X̂ = PX PY|X PX̂|Y vs QX,Y,X̂ =
n o
PX PY PX̂|Y and the data processor (kernel) (X, Y, X̂) 7→ 1 X 6= X̂ to obtain:

I(X; Y) = D(PX,Y,X̂ kQX,Y,X̂ ) ≥ d(P[X = X̂]kQ[X = X̂])


1
≥ −h(Pe ) + (1 − Pe ) log ≥ −h(Pe ) − (1 − Pe ) log Pmax ,
Q[X = X̂]
where the last step follows from Q[X = X̂] ≤ Pmax since X ⊥
⊥ X̂ under Q. (Again, we refer to
Ex. I.25 for a direct proof.)
The following corollary of the previous result emphasizes its role in providing converses (or
impossibility results) for statistics and data transmission.

Corollary 3.13 (Lower bound on average probability of error) Let W → X → Y →


Ŵ, where W is uniform on [M] ≜ {1, . . . , M}. Then
I ( X ; Y ) + h( P e )
Pe ≜ P[W 6= Ŵ] ≥ 1 − (3.20)
log M
I(X; Y) + log 2
≥1− . (3.21)
log M

Proof. Apply Theorem 3.12 and the data processing for mutual information: I(W; Ŵ) ≤ I(X; Y).

3.7* Estimation error in Gaussian noise (I-MMSE)


In previous section we considered estimating a discrete random variable X on the basis of obser-
vation Y and showed bounds on the probability of reconstruction error. Here we consider a more
general case of X ∈ Rd and a quadratic loss, which is also known in signal processing as the
mean-squared error (MSE). Specifically, whenever E[kXk2 ] < ∞ we define

mmse(X|Y) ≜ E[kX − E[X|Y]k2 ] ,

where MMSE stands for minimum MSE (which follows from the fact that the best estimator of X
given Y is precisely E[X|Y]). Just like Fano’s inequality one can derive inequalities relating I(X; Y)
and mmse(X|Y). For example, from Tao’s inequality (see Corollary 7.11) one can easily get for
the case where X ∈ [−1, 1] that
2
0 ≤ Var(X) − mmse(X|Y) ≤ I ( X ; Y) ,
log e
which shows that the variance reduction of X due to Y is at most proportional to their mutual
information. (Simply notice that E[| E[X|Y] − E[X]|2 ] = Var(X) − mmse(X|Y)).
However, this section is not about such inequalities. Here we show a remarkable equality for
the special case when Y is an observation of X corrupted by Gaussian noise. As applications of

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-59


i i

3.7* Estimation error in Gaussian noise (I-MMSE) 59

this identity we will derive stochastic localization in Exercise I.66 and entropy power inequality
in Theorem 3.16.

Theorem 3.14 (I-MMSE [205]) Let X ∈ Rd be independent of Z ∼ N (0, Id ). If E[kXk2 ] < ∞


then for all γ > 0 we have
d √ log e √
I(X; γ X + Z) = mmse(X| γ X + Z) (3.22)
dγ 2
d2 √ log e √
I(X; γ X + Z) = − E[kΣγ ( γ X + Z)k2F ] , (3.23)
dγ 2 2

where Σγ (y) = Cov(X| γ X + Z = y) and k · kF is the Frobenius norm of the matrix.

As a simple example, for Gaussian X, one may verify (3.22) by combining the mutual
information in Example 3.3 with the MMSE in Example 28.1.
Before proving Theorem 3.14 we start with some notation and preliminary results. Let I ⊂ R
be an open interval, μ a (positive) measure on Rd , and K, L : Rd × I → R such that the following
R R R
conditions are met: a) K(x, θ) μ(dx) exists for all θ ∈ I; b) Rd μ(dx) I dθ|L(x, θ)| < ∞; c)
R
t 7→ Rd μ(dx)L(x, t) is continuous and d) we have

K(x, θ) = L(x, θ)
∂θ
for all x, θ. Then
Z Z

K(x, θ) dx = L(x, θ) dx . (3.24)
∂θ Rd Rd

(To see this, take θ > θ0 ∈ I and write K(x, θ) = K(x, θ0 ) + θ0 dtL(x, t). Now we can integrate
R Rθ
this over x and interchange the order of integrals to get dxK(x, θ) = constant + θ0 g(t)dt, where
R
g(t) = dxL(x, t) is continuous). Note that in the case of finite interval I both conditions b) and
c) are implied by condition e) for all t ∈ I we have |L(x, t)| ≤ ℓ(x) and ℓ is μ-integrable.
Let ϕa (x) = (2πa1)d/2 e−∥x∥ /(2a) be the density of N (0, aId ). Suppose p is some probability
2

distribution, and f is a function then we denote by p ∗ f(x) = EX′ ∼p [f(x − X′ )], which coincides
with the usual convolution if p is a density. In particular, the Gaussian convolution p ∗ ϕa is known
k
as a Gaussian mixture with mixing distribution p. For any differential operator D = ∂ xi ∂···∂ xi we
1 k
have

D(p ∗ ϕa ) = p ∗ (Dϕa ) . (3.25)

For D = ∂∂x1 this follows from (3.24) by taking μ = p, K(x, θ) = ϕa (y − x) where y =


(θ, y2 , . . . . , yd ). Conditions for L(x, θ) follow from the fact that Dϕa is uniformly bounded. The
case of general D follows by induction.
As a next application we will show that
∂ ∂ 1
p ∗ ϕa (y) = p ∗ ( ϕa )(y) = ∆(p ∗ ϕa )(y) , (3.26)
∂a ∂a 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-60


i i

60

where ∆f = tr(Hess f) is the Laplacian. Notice that the second equality follows from (3.25) and
the easily checked identity
∂ 1 1
ϕa (x) = 2 (kxk2 − ad)ϕa (x) = ∆ϕa (x) .
∂a 2a 2
Thus, we only need to justify the first equality in (3.26). To that end, we use (3.24) with μ = p,
K(x, a) = ϕa (y − x) and L(x, a) = ∂∂a K(x, a). Note that by the previous calculation we have
supx | ∂∂a ϕa (x)| < ∞, and thus condition e) of (3.24) applies and so (3.24) implies (3.26).
The next lemma shows a special property of Gaussian convolution (derivatives of log-
convolution correspond to conditional moments).

Lemma 3.15 Let Y = X + aZ, where X ⊥
⊥ Z and X ∼ p and Z ∼ N (0, Id ). Then

1 If E[kXk] < ∞ we have


1
∇ ln(p ∗ ϕa )(y) = (E[X|Y = y] − y) . (3.27)
a
and also
3 4
k∇ ln p ∗ ϕa (x)k ≤ kxk + E[kXk] . (3.28)
a a
2 If E[kXk2 ] < ∞ we have
1 1
Hess ln(p ∗ ϕa )(y) = 2
Cov[X|Y = y] − Id . (3.29)
a a

Proof. Notice that since Y ∼ p ∗ ϕa we have


Z Z
1 1
E[X|Y = y] = xp(x)ϕa (y − x)dx = (y − x)p(y − x)ϕa (x)dx
p ∗ ϕ a ( y) R d p ∗ ϕ a ( y) R d
Z
( a) a (b) a
= y+ p(y − x)(∇ϕa (x)) = y + ∇ (p ∗ ϕa (x)) ,
p ∗ ϕ a ( y) p ∗ ϕ a ( y)

where (a) follows from the fact that ∇ϕa (x) = − ax ϕa (x) and (b) from (3.25). The proof of (3.27)
is completed after noticing p∗ϕ1a (y) ∇ (p ∗ ϕa (x)) = ∇ ln(p ∗ ϕa )(y). Technical estimate (3.28) is
shown in [344, Proposition 2].
The identity (3.29) is shown entirely similarly.

With these preparations we are ready to prove the Theorem.

Proof of Theorem 3.14. For simplicity, in this proof we compute all informations and entropies
with natural base, so log = ln. With these preparations we can show (3.22). First, let a = 1/γ and
notice
√ √ √ d
I(X; γ X + Z) = I(X; X + aZ) = h(X + aZ) − ln(2πea) ,
2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-61


i i

3.7* Estimation error in Gaussian noise (I-MMSE) 61

where we computed differential entropy of the Gaussian via Theorem 2.8. Thus, the proof is
completed if we show
d √ d 1
h(X + aZ) = − mmse(X|Ya ) , (3.30)
da 2a 2a2

where we defined Ya = X + aZ. Let the law of X be p. Conceptually, the computation is just a
few lines:
Z
d √ d
− h(X + aZ) = (p ∗ ϕa )(x) ln(p ∗ ϕa )(x)dx
da da
Z
( a) ∂
= [(p ∗ ϕa )(x) ln(p ∗ ϕa )(x)]dx
∂a
Z
(b) 1
= (1 + ln p ∗ ϕa )∆(p ∗ ϕa )dx
2
Z
( c) 1
= (p ∗ ϕa )∆(ln p ∗ ϕa )dx
2
Z
(d) 1 1 d
= (p ∗ ϕa )(y)( 2 mmse(X|Ya = y) − )dy ,
2 a a
where (a) and (c) will require technical justifications, while (b) is just (3.26) and (d) is by taking
trace of (3.29). Note that (a) is just interchange of differentiation and integration, while (c) is
simply the “self-adjointness” of Laplacian.
We proceed to justifying (a). We will apply (3.24) with μ = Leb, I = (a1 , a2 ) some finite
interval, K(x, a) = (p ∗ ϕa )(x) ln(p ∗ ϕa )(x) and
∂ 1
L ( x, a ) = K(x, a) = (1 + ln(p ∗ ϕa )(x))(p ∗ ∆ϕa )(x) ,
∂a 2
where we again used (3.26).
Integrating (3.28) we get
3 4
| ln p ∗ ϕa (x) − ln p ∗ ϕa (0)| ≤ kxk2 + kxk E[kXk] .
2a a
Since p ∗ ϕa (0) ≤ ϕa (0) we get that for all a ∈ (a1 , a2 ) we have for some c > 0:

| ln p ∗ ϕa (x)| ≤ c(1 + kxk + kxk2 ) . (3.31)

From this estimate we note that

K(x, a) ≤ c(1 + kxk + kxk2 )(p ∗ ϕa )(x) .

The integral of the right-hand side over x is simply c(1 + E[kYa k + kYa k2 ]) < ∞, which confirms
condition a) of (3.24).
Next, we notice that for any differential operator D we have Dϕa (x) = f(x)ϕa (x) where f is some
polynomial in x. Since for a < a2 we have supx f(ϕx)ϕ a (x)
a2 ( x )
< ∞ we have that for some constant c′
and all a < a2 and all x we have

|D(p ∗ ϕa )(x)| ≤ c′ p ∗ ϕa2 (x) , (3.32)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-62


i i

62

where we used (3.25) as well. Thus, for L(x, a) we can see that the first term is bounded by (3.31)
and the second by the previous display, so that overall
cc′
L ( x, a ) ≤
|2 + kxk + kxk2 |(p ∗ ϕa2 )(x) .
2
Since again the right-hand side is integrable, we see that condition e) of (3.24) applies and thus
R
the interchange of differentiation and in step (a) is valid.
Finally, we argue that step (c) is applicable. To that end we prove an auxiliary result first: If
R R
u, v are two univariate twice-differentiable functions with a) R |u′′ v| and R |v′′ u| both finite and
R ′ ′
b) |u v | < ∞ then
Z Z
u′′ v = v′′ u . (3.33)
R R

Indeed, from condition b) there must exist a sequence cn → +∞, bn → −∞ such that
|u′ (cn )v′ (cn )| + |u′ (bn )v′ (bn )| → 0. On the other hand, from condition a) we have
Z cn Z
lim u′′ v = u′′ v ,
n→∞ bn R
R
and similarly for v′′ u. Now applying integration by parts we have
Z cn Z cn
′′ ′ ′ ′ ′
u v = u (cn )v (cn ) − u (bn )v (bn ) + v′′ u ,
bn bn

and the first two terms vanish with n.


R 2
Next, consider multivariate twice-differentiable functions U, V with a) Rd |V ∂∂x2 U| and
R ∂2
R i

R d | U 2
∂ xi
V| both finite and b) R d k∇ U k k∇ Vk < ∞ then
Z Z
V∆ U = U∆ V . (3.34)
Rd Rd

We write x = (x1 , xd2 )by grouping the last (d − 1) coordinates together. Fix xd2 and define
u(x1 ) = U(x1 , x2 , · · · , xd ) and v(x1 ) = V(x1 , x2 , · · · , xd ). For Lebesgue-a.e. xd2 we see that u, v
satisfy conditions for (3.33). Thus, we obtain that for such xd2 we have
Z Z
∂2 ∂2
V(x) 2 U(x) dx1 = U(x) 2 V(x) dx1 .
R ∂ x1 R ∂ x1
Integrating this over xd2 we get
Z Z
∂2 ∂2
V(x) 2 U(x) dx = U(x) 2 V(x) dx .
Rd ∂ x1 Rd ∂ x1
Now, to justify step (c) we have to verify that U(x) = 1 + ln(p ∗ ϕa )(x) and V(x) = p ∗ ϕa (x)
2
satisfy conditions of the previous result. To that end, notice that from (3.29) we have | ∂∂y2 U(y)| ≤
i
1
a2 E[X2i |Ya = y] + 1
a and thus
Z
∂2
|V U| = Oa (E[X2i ]) < ∞ .
Rd ∂ x2i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-63


i i

3.8* Entropy-power inequality 63

2
On the other hand, from (3.25) and (3.32) we have | ∂∂y2 V(y)| ≤ c′ p ∗ ϕa2 (y). From (3.31) then we
i
obtain
Z Z
∂2
|U 2 V| ≤ cc (1 + kxk + kxk2 )p ∗ ϕa2 (x) = cc′ E[1 + kYa2 k + kYa2 k2 ] < ∞ .

Rd ∂ xi
R
Finally, for showing Rd k∇Ukk∇Vk < ∞ we apply (3.28) to estimate k∇Uk ≲a 1 + kyk and
use (3.32) to estimate k∇Vk ≲a p ∗ ϕa2 (x). Thus, we have
Z
k∇Ukk∇Vk ≲a E[1 + kYa2 k] < ∞ .
Rd
R R
This completes verification of conditions and we conclude U∆V = V∆U as required for step
(c).

The identity (3.23) is obtained by differentiating function γ 7→ mmse(X| γ + Z) using very
similar methods. We refer to [206] for full justification.

Remark 3.7 (Tensorization of I-MMSE) We proved the I-MMSE identity for a d-


dimension vectors directly. However, it turns out that the one-dimensional version implies the
d-dimensional version. Specifically, suppose the 1D version of (3.22) is already proven. Let us
denote X = (X1 , . . . , Xd ), Y = (Y1 , . . . , Yd ), Z = (Z1 , . . . , Zd ) as in (3.22). However, now let

γ = (γ1 , . . . , γd ) be a vector and we set Yj = γj Xj + Zj . We are interested in computing the
derivative of I(X; Y) along the diagonal γ = (γ, . . . , γ). To that end, denote by Y∼j = {Yi : i 6= j}
and notice that by the chain rule we have
∂ ∂
I ( X ; Y) = I(Xj ; Yj |Y∼j ) . (3.35)
∂γj ∂γj

Similarly, notice that Ey∼j ∼PY∼j [mmse(Xj |Yj , Y∼j = y∼j )] = mmse(Xj |Y). Thus, applying the 1D-
version of (3.22) we get
∂ log e
I ( X ; Y) = mmse(Xj |Y) .
∂γj 2
P
Now since mmse(X|Y) = j mmse(Xj |Y) by summing (3.35) over j we obtain the d-dimensional
version of (3.22). Note that we computed the derivative in a scalar parameter γ by introducing a
vector one γ and then using the chain rule to simplify partial derivatives. This idea is the basis
of area theorem in information theory [360, Lemma 3] and Guerra interpolation in statistical
physics [410].

3.8* Entropy-power inequality


As an application of the last section’s result we demonstrate an important relation between the addi-
tive structure of Rn and entropy. To state the result, recall that from (2.18) an iid Gaussian vector Zd
with coordinates of power σ 2 have differential entropy h(Zd ) = 2d log(2πeσ 2 ). Correspondingly,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-64


i i

64

for any Rd -valued random variable X we define its entropy power to be


1 2
N(X) ≜ exp{ h(X)} .
2πe d
Note that by Theorem 2.8 the entropy power is maximized (under second moment constraint) by
an iid Gaussian vector. Thus, the result that we prove next can be interpreted as a statement that
convolution increases Gaussianity.4 The result was conjectured by Shannon and proved by Stam.

Theorem 3.16 (Entropy power inequality (EPI) [399]) Suppose A1 ⊥ ⊥ A2 are inde-
pendent Rd -valued random variables with finite second moments E[kAi k2 ] < ∞, i ∈ {1, 2}.
Then

N(A1 + A2 ) ≥ N(A1 ) + N(A2 ) .



Remark 3.8 (Costa’s EPI) Consider the case of A2 = aZ, Z ∼ N (0, Id ). Then EPI is

equivalent to the statement that da N(A1 + aZ) ≥ 1. For this special case, Costa [100] established
d

a much stronger property that a 7→ N(A1 + aZ) is concave. A further improvement for this case,
in terms of FI -curve (cf. Definition 16.5) is proposed in [103].

Proof. We present an elegant proof of [437]. First, an observation of Lieb [280] shows that EPI
is equivalent to proving: For all U1 ⊥
⊥ U2 and α ∈ [0, 2π ) we have

h(U1 cos α + U2 sin α) ≥ h(U1 ) cos2 α + h(U2 ) sin2 (α) . (3.36)

(To see that (3.36) implies EPI simply take cos2 α = N(A1N)+ ( A1 )
N(A2 ) and U1 = A1 / cos α, U2 =
A2 / sin α.)
Next, we claim that proving (3.36) for general Ui is equivalent to proving it for their “smoothed”

versions, i.e. Ũi = Ui + ϵZi , where Zi ∼ N (0, Id ) is independent of U1 , U2 . Indeed, this technical
continuity result follows, for example, from [344, Prop. 1], which shows that whenever E[kUi k2 ] <
√ √ √
∞ then function ϵ 7→ h(Ui + ϵZi ) is continuous and in fact h(Ui + ϵZ) = h(Ui ) + O( ϵ) as
ϵ → 0.
In other words, to prove Lieb’s EPI it is sufficient to prove for all ϵ > 0
√ √ √
h( X + ϵZ) ≥ h(U1 + ϵZ1 ) cos2 α + h(U2 + ϵZ2 ) sin2 (α) ,

where we also defined X ≜ U1 cos α+ U2 sin α, Z ≜ Z1 cos α+ Z2 sin α. Since the above inequality
is scale-invariant, we can equivalently show for all γ ≥ 0 the following:
√ √ √
h( γ X + Z) ≥ h( γ U1 + Z1 ) cos2 α + h( γ U2 + Z2 ) sin2 (α) .

4
Another deep manifestation of this phenomenon is in the context of CLT. Barron’s entropic CLT states that for iid Xi ’s
with zero mean and unit variance, the KL divergence D( X1 +...+X

n
n
kN (0, 1)), whenever finite, converges to zero. This
convergence is in fact monotonic as shown in [27, 102].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-65


i i

3.8* Entropy-power inequality 65

As a final simplification, we replace differential entropies by mutual informations. That is the


proof is completed if we show
√ √ √
I(X; γ X + Z) ≥ I(U1 ; γ U1 + Z1 ) cos2 α + I(U2 ; γ U2 + Z2 ) sin2 (α) .
This last inequality clearly holds for γ = 0. Thus it is sufficient to prove the same inequality for
derivatives (in γ ) of both sides and then integrate from 0 to γ . Computing derivatives via (3.22)
we get to prove
√ √ √
mmse(X| γ X + Z) ≥ mmse(U1 | γ U1 + Z1 ) cos2 α + mmse(U2 | γ U2 + Z2 ) sin2 (α) .

But this latter inequality is very simple to argue, since clearly


√ √ √
mmse(X| γ X + Z) ≥ mmse(X| γ U1 + Z1 , γ U2 + Z2 ) .

On the other hand, for the right-hand side X is a sum of two conditionally independent terms and
thus
√ √ √ √
mmse(X| γ U1 +Z1 , γ U2 +Z2 ) = mmse(U1 | γ U1 +Z1 ) cos2 α+mmse(U2 | γ U2 +Z2 ) sin2 (α) .

In Corollary 2.9 we have already seen how properties of differential entropy can be translated
to properties of positive definite matrices. Here is another application:

Corollary 3.17 (Minkowski inequality) Let Σ1 , Σ2 be real positive definite n × n matrices.


Then
1 1 1
det(Σ1 + Σ2 ) n ≥ det(Σ1 ) n + det(Σ2 ) n .

Proof. Take Ai ∼ N (0, Σi ), use (2.18) and the EPI.


EPI is a corner stone of many information-theoretic arguments: for example, it was used to estab-
lish capacity region of the Gaussian broadcast channel [45]. However, its significance extends
throughout geometry and analysis, having deep implications for high-dimensional probability,
convex geometry and concentration of measure. As an example see Exercise I.65 which derives
the log-Sobolev inequality of Gross. Further discussions are outside of the scope of this book, but
we recommend reading [106, Chapter 16].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-66


i i

4 Variational characterizations and continuity of


information measures

In this chapter we collect some results on variational characterizations of information measures. It


is a well known method in analysis to study a functional by proving a variational characterization
of the form F(x) = supλ∈Λ fλ (x) or F(x) = infμ∈M gμ (x). Such representations can be useful for
multiple purposes:

• Convexity: pointwise supremum of convex functions is convex.


• Regularity: pointwise supremum of lower semicontinuous (lsc) functions is lsc.
• Bounds: upper/lower bound on F follows by choosing any λ (μ) and evaluating fλ (gμ ).

We will see in this chapter that divergence has two different sup characterizations (over partitions
and over functions). The mutual information is more special. In addition to inheriting the ones
from KL divergence, it possesses two extra: an inf-representation over (centroid) measures QY
and a sup-representation over Markov kernels.
As applications of these variational characterizations, we discuss the Gibbs variational prin-
ciple, which serves as the basis of many modern algorithms in machine learning, including the
EM algorithm and variational autoencoders; see Section 4.4. An important theoretical construct
in machine learning is the idea of PAC-Bayes bounds (Section 4.8*).
From information theoretic point of view variational characterizations are important because
they address the problem of continuity. We will discuss several types of continuity in this Chapter.
First, is the continuity in discretization. This is related to the issue of computation. For complicated
P and Q direct computation of D(PkQ) might be hard. Instead, one may want to discretize the
infinite alphabet and compute numerically the finite sum. Does this approximation work, i.e., as
the quantization becomes finer, are the resulting finite sums guaranteed to converge to the true
value of D(PkQ)? The answer is positive and this continuity with respect to discretization is the
content of Theorem 4.5.
Second, is the continuity under change of the distribution. For example, this arises in the prob-
lem of estimating information measures. In many statistical setups, oftentimes we do not know
P or Q, and we estimate the distribution by P̂n using n iid observations sampled from P (in dis-
crete cases we may set P̂n to be simply the empirical distribution). Does D(P̂n kQ) provide a good
estimator for D(PkQ)? Does D(P̂n kQ) → D(PkQ) if P̂n → P? The answer is delicate – see
Section 4.5.
Third, there is yet another kind of continuity: continuity “in the σ -algebra”. Despite the scary
name, this one is useful even in the most “discrete” situations. For example, imagine that θ ∼

66

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-67


i i

4.1 Geometric interpretation of mutual information 67

i.i.d.
Unif(0, 1) and Xi ∼ Ber(θ). Suppose that you observe a sequence of Xi ’s until the random moment
τ equal to the first occurrence of the pattern 0101. How much information about θ did you learn
by time τ ? We can encode these observations as
(
Xj , j ≤ τ ,
Zj = ,
?, j > τ
where ? designates the fact that we don’t know the value of Xj on those times. Then the question
we asked above is to compute I(θ; Z∞ ). We will show in this chapter that
X

I(θ; Z∞ ) = lim I(θ; Zn ) = I(θ; Zn |Zn−1 ) (4.1)
n→∞
n=1

thus reducing computation to evaluating an infinite sum of simpler terms (not involving infinite-
dimensional vectors). Thus, even in this simple question about biased coin flips we have to
understand how to safely work with infinite-dimensional vectors.

4.1 Geometric interpretation of mutual information


Mutual information (MI) can be understood as a weighted “distance” from the conditional
distributions to the marginal distribution. Indeed, for discrete X, we have
X
I(X; Y) = D(PY|X kPY |PX ) = D(PY|X=x kPY )PX (x).
x∈X

Furthermore, it turns out that PY , similar to the center of gravity, minimizes this weighted distance
and thus can be thought as the best approximation for the “center” of the collection of distribu-
tions {PY|X=x : x ∈ X } with weights given by PX . We formalize these results in this section and
start with the proof of a “golden formula”. Its importance is in bridging the two points of view
on mutual information. Recall that on one hand we had the Fano’s Definition 3.1, on the other
hand for discrete cases we had the Shannon’s definition (3.3) as difference of entropies. Then
next result (4.3) presents MI as the difference of relative entropies in the style of Shannon, while
retaining applicability to continuous spaces in the style of Fano.

Theorem 4.1 (Golden formula) For any QY we have


D(PY|X kQY |PX ) = I(X; Y) + D(PY kQY ). (4.2)
Thus, if D(PY kQY ) < ∞, then
I(X; Y) = D(PY|X kQY |PX ) − D(PY kQY ). (4.3)

Proof. In the discrete case and ignoring the possibility of dividing by zero, the argument is really
simple. We just need to write
   
(3.1) PY|X PY|X QY
I(X; Y) = EPX,Y log = EPX,Y log
PY PY QY

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-68


i i

68

P Q P
and then expand log PYY|XQYY = log QY|YX − log QPYY . The argument below is a rigorous implementation
of this idea.
First, notice that by Theorem 2.16(e) we have D(PY|X kQY |PX ) ≥ D(PY kQY ) and thus if
D(PY kQY ) = ∞ then both sides of (4.2) are infinite. Thus, we assume D(PY kQY ) < ∞ and
in particular PY  QY . Rewriting LHS of (4.2) via the chain rule (2.24) we see that Theorem
amounts to proving
D(PX,Y kPX QY ) = D(PX,Y kPX PY ) + D(PY kQY ) .
The case of D(PX,Y kPX QY ) = D(PX,Y kPX PY ) = ∞ is clear. Thus, we can assume at least one of
these divergences is finite, and, hence, also PX,Y  PX QY .
dPY
Let λ(y) = dQ Y
(y). Since λ(Y) > 0 PY -a.s., applying the definition of Log in (2.10), we can
write
 
λ(Y)
EPY [log λ(Y)] = EPX,Y Log . (4.4)
1
dPX PY
Notice that the same λ(y) is also the density dPX QY
(x, y) of the product measure PX PY with respect
to PX QY . Therefore, the RHS of (4.4) by (2.11) applied with μ = PX QY coincides with
D(PX,Y kPX QY ) − D(PX,Y kPX PY ) ,
while the LHS of (4.4) by (2.13) equals D(PY kQY ). Thus, we have shown the required
D(PY kQY ) = D(PX,Y kPX QY ) − D(PX,Y kPX PY ) .

By dropping the second term in (4.2) we obtain the following result.

Corollary 4.2 (Mutual information as center of gravity) For any QY we have


I(X; Y) ≤ D(PY|X kQY |PX )
and, consequently,
I(X; Y) = min D(PY|X kQY |PX ). (4.5)
QY

If I(X; Y) < ∞, the unique minimizer is QY = PY .

Remark 4.1 The variational representation (4.5) is useful for upper bounding mutual infor-
mation by choosing an appropriate QY . Indeed, often each distribution in the collection PY|X=x
is simple, but their mixture, PY , is very hard to work with. In these cases, choosing a suitable QY
in (4.5) provides a convenient upper bound. As an example, consider the AWGN channel Y = X+Z
in Example 3.3, where Var(X) = σ 2 , Z ∼ N (0, 1). Then, choosing the best possible Gaussian Q
and applying the above bound, we have:
1
I(X; Y) ≤ inf E[D(N (X, 1)kN ( μ, s))] = log(1 + σ 2 ),
μ∈R,s≥0 2
which is tight when X is Gaussian. For more examples and statistical applications, see Chapter 30.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-69


i i

4.1 Geometric interpretation of mutual information 69

Theorem 4.3 (Mutual information as distance to product distributions)


I(X; Y) = min D(PX,Y kQX QY )
QX ,QY

with the unique minimizer (QX , QY ) = (PX , PY ).

Proof. We only need to use the previous corollary and the chain rule (2.24):
(2.24)
D(PX,Y kQX QY ) = D(PY|X kQY |PX ) + D(PX kQX ) ≥ I(X; Y) .

Interestingly, the point of view in the previous result extends to conditional mutual information
as follows: We have

I(X; Z|Y) = min D(PX,Y,Z kQX,Y,Z ) , (4.6)


QX,Y,Z :X→Y→Z

where the minimization is over all QX,Y,Z = QX QY|X QZ|Y , cf. Section 3.4. Showing this character-
ization is very similar to the previous theorem. By repeating the same argument as in (4.2) we
get

D(PX,Y,Z kQX QY|X QZ|Y )


=D(PX,Y,Z kPX PY|X PZ|Y ) + D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
=D(PX,Y,Z kPY PX|Y PZ|Y ) + D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
= D(PXZ|Y kPX|Y PZ|Y |PY ) +D(PX kQX ) + D(PY|X kQY|X |PX ) + D(PZ|Y kQZ|Y |PY )
| {z }
I(X;Z|Y)

≥ I ( X ; Z| Y) .

Characterization (4.6) can be understood as follows. The most general directed graphical model
for the triplet (X, Y, Z) is a 3-clique (triangle).

Y X

What is the information flow on the dashed edge X → Z? To answer this, notice that removing
this edge restricts the joint distribution to a Markov chain X → Y → Z. Thus, it is natural to
ask what is the minimum (KL-divergence) distance between a given PX,Y,Z and the set of all
distributions QX,Y,Z satisfying the Markov chain constraint. By the above calculation, optimal
QX,Y,Z = PY PX|Y PZ|Y and hence the distance is I(X; Z|Y). For this reason, we may interpret I(X; Z|Y)
as the amount of information flowing through the X → Z edge.
In addition to inf-characterization, mutual information also has a sup-characterization.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-70


i i

70

Theorem 4.4 For any Markov kernel QX|Y such that QX|Y=y  PX for PY -a.e. y we have
 
dQX|Y
I(X; Y) ≥ EPX,Y log .
dPX
If I(X; Y) < ∞ then
 
dQX|Y
I(X; Y) = sup EPX,Y log , (4.7)
QX|Y dPX
where the supremum is over Markov kernels QX|Y as in the first sentence.

Remark 4.2 Similar to how (4.5) is used to upper-bound I(X; Y) by choosing a good approx-
imation to PY , this result is used to lower-bound I(X; Y) by selecting a good (but computable)
approximation QX|Y to usually a very complicated posterior PX|Y . See Section 5.6 for applications.
Proof. Since modifying QX|Y=y on a negligible set of y’s does not change the expectations, we
will assume that QX|Y=y  PY for every y. If I(X; Y) = ∞ then there is nothing to prove. So we
assume I(X; Y) < ∞, which implies PX,Y  PX PY . Then by Lemma 3.3 we have that PX|Y=y  PX
dQX|Y=y /dPX
for almost every y. Choose any such y and apply (2.11) with μ = PX and noticing Log 1 =
dQX|Y=y
log dP X
we get
 
dQX|Y=y
EPX|Y=y log = D(PX|Y=y kPX ) − D(PX|Y=y kQX|Y=y ) ,
dPX
which is applicable since the first term is finite for a.e. y by (3.1). Taking expectation of the previous
identity over y we obtain
 
dQX|Y
EPX,Y log = I(X; Y) − D(PX|Y kQX|Y |PY ) ≤ I(X; Y) , (4.8)
dPX
implying the first part. The equality case in (4.7) follows by taking QX|Y = PX|Y , which satisfies
the conditions on Q when I(X; Y) < ∞.

4.2 Variational characterizations of divergence: Gelfand-Yaglom-Perez


The point of the following theorem is that divergence on general alphabets can be defined via
divergence on finite alphabets and discretization. Moreover, as the quantization becomes finer, we
approach the value of divergence.

Theorem 4.5 (Gelfand-Yaglom-Perez [182]) Let P, Q be two probability measures on X


with σ -algebra F . Then
X
n
P[Ei ]
D(PkQ) = sup P[Ei ] log
, (4.9)
{E1 ,...,En } i=1 Q[ E i ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 10 = 0 and log 01 = ∞ per our usual convention.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-71


i i

4.3 Variational characterizations of divergence: Donsker-Varadhan 71

Remark 4.3 This theorem, in particular, allows us to prove all general identities and inequali-
ties for the cases of discrete random variables and then pass to the limit. In the case of mutual
information I(X; Y) = D(PX,Y kPX PY ), the partitions of X and Y can be chosen separately,
see (4.29).

Proof. “≥”: Fix a finite partition E1 , . . . En . Define a function (quantizer) f : X → {1, . . . , n} as


follows: For any x, let f(x) denote the index j of the set Ej to which x belongs. Let X be distributed
according to either P or Q and set Y = f(X). Applying data processing inequality for divergence
yields

D(PkQ) = D(PX kQX )


≥ D(PY kQY ) (4.10)
X P(Ei )
= P(Ei ) log .
Q(Ei )
i

“≤”: To show D(PkQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P(B). Choosing the partition E1 = B and E2 = Bc , we have
P2 P[Ei ]
D(PkQ) = ∞ = i=1 P[Ei ] log Q[Ei ] . In the sequel we assume that P  Q and let X = dQ .
dP

Then D(PkQ) = EQ [X log X] = EQ [φ(X)] by (2.4). Note that φ(x) ≥ 0 if and only if x ≥ 1. By
monotone convergence theorem, we have EQ [φ(X)1 {X < c}] → D(PkQ) as c → ∞, regardless
of the finiteness of D(PkQ).
Next, we construct a finite partition. Let n = c/ϵ be an integer and for j = 0, . . . , n − 1, let
Ej = {jϵ ≤ X ≤ (j + 1)ϵ} and En = {X ≥ c}. Define Y = ϵbX/ϵc as the quantized version. Since φ
is uniformly continuous on [0, c], for any x, y ∈ [0, c] such |x−y| ≤ ϵ, we have |φ(x)−φ(y)| ≤ ϵ′ for
some ϵ′ = ϵ′ (ϵ, c) such as ϵ′ → 0 as ϵ → 0. Then EQ [φ(Y)1 {X < c}] ≥ EQ [φ(X)1 {X < c}] − ϵ′ .
Moreover,
X
n−1 n−1 
X 
P(Ej )
EQ [φ(Y)1 {X < c}] = φ(jϵ)Q(Ej ) ≤ ϵ′ + φ Q(Ej )
Q( E j )
j=0 j=0
X
n
P(Ej )
≤ ϵ′ + Q(X ≥ c) log e + P(Ej ) log ,
Q(Ej )
j=0

P(E )
where the first inequality applies the uniform continuity of φ since jϵ ≤ Q(Ejj ) < (j + 1)ϵ, and the
second applies φ ≥ − log e. As Q(X ≥ c) → 0 as c → ∞, the proof is completed by first sending
ϵ → 0 then c → ∞.

4.3 Variational characterizations of divergence: Donsker-Varadhan


The following is perhaps the most important variational characterization of divergence. We remind
of our convention exp{−∞} = 0, log 0 = −∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-72


i i

72

Theorem 4.6 (Donsker-Varadhan [134]) Let P, Q be probability measures on X and


denote a class of functions CQ = {f : X → R ∪ {−∞} : 0 < EQ [exp{f(X)}] < ∞}. Then
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] (4.11)
f∈CQ

In particular, if D(PkQ) < ∞ then EP [f(X)] is well-defined and < ∞ for every f ∈ CQ . The
identity (4.11) holds with CQ replaced by the class of all R-valued simple functions. If X is a
normal topological space (e.g., a metric space) with the Borel σ -algebra, then also
D(PkQ) = sup EP [f(X)] − log EQ [exp{f(X)}] , (4.12)
f∈Cb

where Cb is the class of all bounded continuous functions.

Proof. “D ≥ supf∈CQ ”: We can assume for this part that D(PkQ) < ∞, since otherwise there is
nothing to prove. Then fix f ∈ CQ and define a probability measure Qf (tilted version of Q) via
Qf (dx) = exp{f(x) − ψf }Q(dx) , ψf ≜ log EQ [exp{f(X)}] . (4.13)
Then, Qf  Q. We will apply (2.11) next with reference measure μ = Q. Note that according
exp{f(x)−ψf }
to (2.10) we always have Log 1 = f(x) − ψf even when f(x) = −∞. Thus, we get
from (2.11)
 
dQf /dQ
EP [f(X)] − ψf = EP Log = D(PkQ) − D(PkQf ) ≤ D(PkQ) .
1
Note that (2.11) also implies that if D(PkQ) < ∞ and f ∈ CQ the expectation EP [f] is well-defined.
“D ≤ supf ” with supremum over all simple functions: The idea is to just take f = log dQ dP
;
however to handle all cases we proceed more carefully. First, notice that if P 6 Q then for some
E with Q[E] = 0 < P[E] and c → ∞ taking f = c1E shows that both sides of (4.11) are infinite.
Pn P[ E ]
Thus, we assume P  Q. For any partition of X = ∪nj=1 Ej we set f = j=1 1Ej log Q[Ejj ] . Then
the right-hand sides of (4.11) and (4.9) evaluate to the same value and hence by Theorem 4.5 we
obtain that supremum over simple functions (and thus over CQ ) is at least as large as D(PkQ).
Finally, to show (4.12), we show that for every simple function f there exists a continuous
bounded f′ such that EP [f′ ] − log EQ [exp{f′ }] is arbitrarily close to the same functional evaluated
at f. To that end we first show that for any a ∈ R and measurable A ⊂ X there exists a sequence
of continuous bounded fn such that
EP [fn ] → aP[A], and EQ [exp{fn }] → exp{a}Q[A] (4.14)
hold simultaneously, i.e. fn → a1A in the sense of approximating both expectations. We only
consider the case of a > 0 below. Let compact F and open U be such that F ⊂ A ⊂ U and
max(P[U] − P[F], Q[U] − Q[F]) ≤ ϵ. Such F and U exist whenever P and Q are so-called regular
measures. Without going into details, we just notice that finite measures on Polish spaces are
automatically regular. Then by Urysohn’s lemma there exists a continuous function fϵ : X → [0, a]
equal to a on F and 0 on Uc . Then we have
aP[F] ≤ EP [fϵ ] ≤ aP[U]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-73


i i

4.3 Variational characterizations of divergence: Donsker-Varadhan 73

exp{a}Q[F] ≤ EQ [exp{fϵ }] ≤ exp{a}Q[U] .

Subtracting aP[A] and exp{a}Q[A] for each of these inequalities, respectively, we see that taking
ϵ → 0 indeed results in a sequence of functions satisfying (4.14).
Pn
Similarly, if we want to approximate a general simple function g = i=1 ai 1Ai (with Ai disjoint
and |ai | ≤ amax < ∞) we fix ϵ > 0 and define functions fi,ϵ approximating ai 1Ai as above with
sets Fi ⊂ Ai ⊂ Ui , so that S ≜ ∪i (Ui \ Fi ) satisfies max(P[S], Q[S]) ≤ nϵ. We also have
X X
| fi,ϵ − g| ≤ amax 1Ui \Fi ≤ namax 1S .
i i

P
We then clearly have | EP [ i fi,ϵ ] − EP [g]| ≤ amax n2 ϵ. On the other hand, we also have
X X
exp{ai }Q[Fi ] ≤ EQ [exp{ fi,ϵ }]
i i

≤ EQ [exp{g}1Sc ] + exp{namax }Q[S] ≤ EQ [exp{g}] + exp{namax }nϵ .


P P
Hence taking ϵ → 0 the sum i fi,ϵ → i ai 1Ai in the sense of both EP [·] and EQ [exp{·}].

Remark 4.4 1 What is the Donsker-Varadhan representation useful for? By setting f(x) =
ϵ · g(x) with ϵ  1 and linearizing exp and log we can see that when D(PkQ) is small, expec-
tations under P can be approximated by expectations over Q (change of measure): EP [g(X)] ≈
EQ [g(X)]. This holds for all functions g with finite exponential moment under Q. Total variation
distance provides a similar bound, but for a narrower class of bounded functions:

| EP [g(X)] − EQ [g(X)]| ≤ kgk∞ TV(P, Q) .

2 More formally, the inequality EP [f(X)] ≤ log EQ [exp f(X)] + D(PkQ) is useful in estimating
EP [f(X)] for complicated distribution P (e.g. over high-dimensional X with weakly dependent
coordinates) by making a smart choice of Q (e.g. with iid components).
3 In Chapter 5 we will show that D(PkQ) is convex in P (in fact, in the pair). A general method
of obtaining variational formulas like (4.11) is via the Young-Fenchel duality, which we review
below in (7.84). Indeed, (4.11) is exactly that inequality since the Fenchel-Legendre conjugate
of D(·kQ) is given by a convex map f 7→ ψf . For more details, see Section 7.13.
4 Donsker-Varadhan should also be seen as an “improved version” of the DPI. For example, one
of the main applications of the DPI in this book is in obtaining estimates like

1
P[A] log ≤ D(PkQ) + log 2 , (4.15)
Q[ A ]

which is the basis of the large deviations theory (Corollary 2.19 and Chapter 15) and Fano’s
inequality (Theorem 3.12). The same estimate can be obtained by applying (4.11) with f(x) =
1 {x ∈ A} log Q[1A] .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-74


i i

74

4.4 Gibbs variational principle


As we remarked before the Donsker-Varadhan characterization can be seen as a way of expressing
the convex function P 7→ D(PkQ) as supremum of linear functions P 7→ EP [f] − ψf , where ψf =
log EQ [exp{f}], in which case ψf is known as the convex conjugate. Now, by the general convex
duality theory one would expect than that the function f 7→ ψf is convex and should have a similar
characterization as the supremum of linear functions of f. In this section we derive it and show
several of its (quite influential) classical and modern applications.

Proposition 4.7 (Gibbs variational principle) Let f : X → R∪{−∞} be any measurable


function and Q a probability measure on X . Then
log EQ [exp{f(X)}] = sup EP [f(X)] − D(PkQ) ,
P

where the supremum is taken over all P with D(PkQ) < ∞. If the left-hand side is finite then the
unique maximizer of the right-hand side is P = Qf , a tilted version of Q defined in (4.13).

Proof. Let ψf ≜ log EQ [exp{f(X)}]. First, if ψf = −∞, then Q-a.s. f = −∞ and hence P-a.s. also
f = −∞, so that both sides of the equality are −∞. Next, assume −∞ < ψf < ∞. Then by
Donsker-Varadhan (4.11) we get
ψf ≥ EP [f(X)] − D(PkQ) .
dQf
On the other hand, setting P = Qf we obtain an equality. To show uniqueness, notice that Log dQ
1 =
f − ψf even when f = −∞. Thus, from (2.11) we get whenever D(PkQ) < ∞ that
EP [f(X) − ψf ] = D(PkQ) − D(PkQf ) .
From here we conclude that EP [f(X)] < ∞ and hence can rewrite the above as
EP [f(X)] − D(PkQ) = ψf − D(PkQf ) ,
which shows that EP [f(X)] − D(PkQ) = ψf implies P = Qf .
Next, suppose ψf = ∞. Let us define fn = f ∧ n, n ≥ 1. Since ψfn < ∞ we have by the previous
characterization that there is a sequence Pn such that D(Pn kQ) < ∞ and as n → ∞
EPn [f(X) ∧ n] − D(Pn kQ) = ψfn % ψf = ∞ .
Since EPn [f(X) ∧ n] ≤ EPn [f(X)], we have
EPn [f(X)] − D(Pn kQ) ≥ ψfn → ∞ ,
concluding the proof.
We now briefly explore how Proposition 4.7 has been applied over the last century. We start
with the example from statistical physics and graphical models. Here the key idea is to replace
sup over all distributions P with a subset that is easier to handle. This idea is the basis of much of
variational inference [447].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-75


i i

4.4 Gibbs variational principle 75

Example 4.1 (Mean-field approximation for Ising model) Suppose that we have a
complicated model for a distribution of a vector X̃ ∈ {0, 1}n+m given by an Ising model
1
PX̃ (x̃) = exp{x̃⊤ Ãx̃ + b̃⊤ x̃} ,

where à ∈ R(n+m)×(n+m) is a symmetric interaction matrix with zero diagonal and b̃ is a vector
of external fields and Z̃ is a normalization constant. We note that often à is very sparse with non-
zero entries occurring only those few variables xi and xj that are considered to be interacting (or
adjacent in some graph). We decompose the vector X̃ = (X, Y) into two components: the last
m coordinates are observables and the first n coordinates are hidden (latent), whose values we
want to infer; in other words, our goal is to evaluate PX|Y=y upon observing y. It is clear that this
conditional distribution is still an Ising model, so that
1
PX|Y (x|y) = exp{x⊤ Ax + b⊤ x} , x ∈ {0, 1}n
Z
where A is the n × n leading minor of à and b and Z depend on y. Unfortunately, computing even
a single value P[X1 = 1] is known to be generally computationally infeasible [394, 175], since
evaluating Z requires summation over 2n values of x.
Let us denote f(x) = x⊤ Ax + b⊤ x and by Q the uniform distribution on {0, 1}n . Applying
Proposition 4.7 we obtain

log Z − n log 2 = log EQ [exp{f}] = sup EP [f(Xn )] − D(PkQ) .


PXn

Then by Theorem 2.2 we get

log Z = sup EP [f(Xn )] + H(PXn ) .


PXn

As we said, exact computation of log Z, though, is not tractable. An influential idea is to instead
Qn
search the maximizer in the class of product distributions PXn = i=1 Ber(pi ). In this case, this
supremization can be solved almost in closed-form:
X
sup p⊤ Ap + b⊤ p + h(pi ) ,
p
i

where p = (p1 , . . . , pn ). Since the objective function is strongly concave (Exercise I.37), we only
need to solve the first order optimality conditions (or mean-field equations), which is a set of n
non-linear equations in n variables:
 X
n  1
pi = σ bi + 2 ai,j pj , σ(x) ≜ .
1 + exp(−x)
j=1

These are solved by iterative message-passing algorithms [447]. Once the values of pi are obtained,
the mean-field approximation is to take
Y
n
PX|Y=y ≈ Ber(pi ) .
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-76


i i

76

We stress that the mean-field idea is not only to approximate the value of Z, but also to consider
the corresponding maximizer (over a restricted class of product distributions) as the approximate
posterior distribution.

To get another flavor of examples, let us consider a more general setting, where we have some
(θ)
parametric collection of distributions PX,Y indexed by θ ∈ Rd . Often, the joint distribution is such
(θ) (θ) (θ) (θ)
that PX and PY|X are both “simple”, but the PY and PX|Y are “complex” or even intractable (e.g.
in sparse linear regression and community detection Section 30.3). As in the previous example, X
is the latent (unobserved) and Y is the observed variable.
For a moment we will omit writing θ and consider the problem of evaluating PY (y) – a quantity
(known as evidence) showing how extreme the observed y is. Note that

PY (y) = Ex∼PX [PY|X (y|x)] .

Although by assumption PX and PY|X are both easy to compute, this marginalization may be
intractable. As a workaround, we invoke Proposition 4.7 with f(x) = log PY|X (y|x) and Q = PX to
get
 
PX,Y (X, y)
log PY (y) = sup ER [f(X)] − D(RkPX ) = sup EX∼R log , (4.16)
R R R(X)

where R is an arbitrary distribution. Note that the right-hand side only involves a simple quantity
PX,Y and hence all the complexity of computation is moved to optimization over R. Expres-
sion (4.16) is known as evidence lower bound (ELBO) since for any fixed value of R we get a
provable lower bound on log PY (y). Typically, one optimizes the choice of R over some convenient
set of distributions to get the best (tightest) lower bound in that class.
One such application leads to the famous iterative (EM) algorithm, see (5.33) below. Another
application is a modern density estimation algorithm, which we describe next.

Example 4.2 (Variational autoencoders [245]) A canonical problems in unsupervised


i.i.d.
learning is density estimation: given a sample y1 , . . . , yn ∼ PY estimate the true PY on Rd . We
describe a modern solution of [245]. First, they propose a latent parametric model (a generative
model) for PY . Namely, Y is generated by first sampling a latent variable X ∼ N (0, Id′ ) and then
setting Y to be conditionally Gaussian:

Y = μ(X; θ) + D(X; θ)Z , X⊥


⊥ Z ∼ N (0, Id ) ,

where vector μ(·; θ) and diagonal matrix D(·; θ) are deep neural networks with input (·) and
(θ)
weights θ. (See [245, App. C.2] for a detailed description.) The resulting distribution PY is a
(complicated) location-scale Gaussian mixture. To find the best density [245] aims to maximize
the likelihood by solving
X (θ)
max log PY (yi ) .
θ
i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-77


i i

4.4 Gibbs variational principle 77

Since the marginalization to obtain PY is intractable, we replace the objective (by an appeal to
ELBO (4.16)) with
" #
X (θ)
pX,Y (X, yi )
max sup EX∼RX|Y=yi log , (4.17)
θ RX|Y rX|Y (X|yi )
i

where we denoted the PDFs of PX,Y and RX|Y by lower-case letters. Now in this form the algorithm
is simply the EM algorithm (as we discuss below in Section 5.6). What brings the idea to the 21st
century is restricting the optimization to the set of RX|Y which are again defined via

X = μ̃(Y; ϕ) + D̃(Y; ϕ)Z̃, Y⊥


⊥ Z̃ ∼ N (0, Id′ )

where μ̃(Y; ϕ) and diagonal covariance matrix D̃(Y; ϕ) are output by some neural network with
parameter ϕ. The conditional distribution under this auxiliary model (recognition model), denoted
(ϕ) (θ)
by RX|Y , is Gaussian. Since the ELBO (4.16) is achieved by the posterior PX|Y , what this amounts
to is to approximate the true posterior under the generative model by a Gaussian. Replacing also
i.i.d.
the expectation over RX|Y=yi with its empirical version (by generating Z̃ij ∼ N (0, Id′ ) we obtain
the following
 
XX (θ)
pX,Y (xi,j , yi )
max max log , xi,j = μ̃(yi ; ϕ) + D̃(yi ; ϕ)Z̃ij . (4.18)
(ϕ)
θ ϕ
i j rX|Y (xi,j |yi )

(θ) (ϕ)
Now plugging the Gaussian form of the densities pX , pY|X and rX|Y one gets an expression whose
gradients ∇θ and ∇ϕ can be easily computed by the automatic differentiation software.1 In
(ϕ) (θ)
fact, since rX|Y and rY|X are both Gaussian, we can use less Monte Carlo approximation than
 (θ)

pX,Y (X,yi )
(4.18), because the objective in (4.17) equals EX∼RX|Y=yi log (ϕ) = −D(RX|Y=yi kPX ) +
rX|Y (X|yi )
h i
EX∼RX|Y=yi log pY|X (yi |X) , where PX = N(0, Id′ ), RX|Y=yi = N(μ̃(yi ; ϕ); D̃2 (yi ; ϕ)), PY|X=x =
(θ) (θ)

N( μ(x; θ); D2 (x; θ)) so that the first Gaussian KL divergence is in close form (Example 2.2) and
we only need to apply Monte Carlo approximation to the second term. For both versions, the
optimization proceeds by (stochastic) gradient ascent over θ and ϕ until convergence to some
(θ ∗ ) (ϕ∗ )
(θ∗ , ϕ∗ ). Then PY can be used to generate new samples from the learned distribution, RX|Y to
(θ ∗ )
map (“encode”) samples to the latent space and PY|X to “decode” a latent representation into a
target sample. We refer the readers to Chapters 3 and 4 in the survey [246] for other encoder and
decoder architectures.

1
An important part of the contribution of [245] is the “reparametrization trick”. Namely, since [452] a standard way to
compute ∇ϕ EQ(ϕ) [f] in machine learning is to write ∇ϕ EQ(ϕ) [f] = EQ(ϕ) [f(X)∇ϕ ln q(ϕ) (X)] and replace the latter
expectation by its empirical approximation. However, in this case a much better idea is to write
EQ(ϕ) [f] = EZ∼N [f(g(Z; ϕ))] for some explicit g and then move gradient inside the expectation before computing the
empirical version.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-78


i i

78

4.5 Continuity of divergence


For a finite alphabet X it is easy to establish the continuity of entropy and divergence:

Proposition 4.8 Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Then
the map P 7→ D(PkQ) is continuous. In particular, P 7→ H(P) is continuous.

Warning: Divergence is never continuous in the pair, even for finite alphabets. For example,
as n → ∞, d( 1n k2−n ) 6→ 0.

Proof. Notice that

X P(x)
D(PkQ) = P(x) log
x
Q ( x)

and each term is a continuous function of P(x).

Our next goal is to study continuity properties of divergence for general alphabets. We start
with a negative observation.

Remark 4.5 In general, D(PkQ) is not continuous in either P or Q. For example, let X1 , . . . , Xn
Pn d
be iid and equally likely to be {±1}. Then by central limit theorem, Sn = √1n i=1 Xi −
→N (0, 1)
as n → ∞. But D(PSn kN (0, 1)) = ∞ for all n because Sn is discrete. Note that this is also an
example for strict inequality in (4.19).

Nevertheless, there is a very useful semicontinuity property.

Theorem 4.9 (Lower semicontinuity of divergence) Let X be a metric space with


Borel σ -algebra H. If Pn and Qn converge weakly to P and Q, respectively,2 then

D(PkQ) ≤ lim inf D(Pn kQn ) . (4.19)


n→∞

On a general space if Pn → P and Qn → Q pointwise3 (i.e. Pn [E] → P[E] and Qn [E] → Q[E] for
every measurable E) then (4.19) also holds.

Proof. This simply follows from (4.12) since EPn [f] → EP [f] and EQn [exp{f}] → EQ [exp{f}] for
every f ∈ Cb .

2
Recall that sequence of random variables Xn converges in distribution to X if and only if their laws PXn converge weakly
to PX .
3
Pointwise convergence is weaker than convergence in total variation and stronger than weak convergence.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-79


i i

4.6* Continuity under monotone limits of σ -algebras 79

4.6* Continuity under monotone limits of σ -algebras


Our final and somewhat delicate topic is to understand the (so far neglected) dependence of D and I
on the implicit σ -algebra of the space. Indeed, the definition of divergence D(PkQ) implicitly (via
Radon-Nikodym derivative) depends on the σ -algebra F defining the measurable space (X , F).
To emphasize the dependence on F we will write in this section only the underlying σ -algebra
explicitly as follows:

D(PF kQF ) .

Our main results are continuity under monotone limits of σ -algebras. Recall that a sequence of
nested σ -algebras, F1 ⊂ F2 · · · , is said to Fn % F when F ≜ σ (∪n Fn ) is the smallest σ -
algebra containing ∪n Fn (the union of σ -algebras may fail to be a σ -algebra and hence needs
completion). Similarly, a sequence of nested σ -algebras, F1 ⊃ F2 · · · , is said to Fn & F if
F = ∩n Fn (intersection of σ -algebras is always a σ -algebra). We will show in this section that we
always have:

Fn % F =⇒ D(PFn kQFn ) % D(PF kQF ) (4.20)


Fn & F =⇒ D(PFn kQFn ) & D(PF kQF ) (4.21)

For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to (a) any algebra of sets F and (b) two positive additive (not necessarily σ -additive)
set-functions P, Q on F . We do so following the Gelfand-Yaglom-Perez variational representation
of divergence (Theorem 4.5).

Definition 4.10 (KL divergence over an algebra) Let P and Q be two positive, addi-
tive (not necessarily σ -additive) set-functions defined over an algebra F of subsets of X (not
necessarily a σ -algebra). We define
X
n
P[Ei ]
D(PF kQF ) ≜ sup P[Ei ] log ,
{E1 ,...,En } i=1 Q[Ei ]
Sn
where the supremum is over all finite F -measurable partitions: j=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.

Note that when F is not a σ -algebra or P, Q are not σ -additive, we do not have Radon-Nikodym
theorem and thus our original definition of KL-divergence is not applicable.

Theorem 4.11 (Measure-theoretic properties of divergence) Let P, Q be probability


measures on the measurable space (X , H). Assume all algebras below are sub-algebras of H.
Then:

• (Monotonicity) If F ⊆ G are nested algebras then

D(PF kQF ) ≤ D(PG kQG ) . (4.22)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-80


i i

80

S
• Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let F = n Fn be their limit, then

D(PFn kQFn ) % D(PF kQF ) .

• If F is (P + Q)-dense in G then4

D(PF kQF ) = D(PG kQG ) . (4.23)

• (Monotone convergence theorem) Let F1 ⊆ F2 . . . be an increasing sequence of algebras and


W
let F = n Fn be the smallest σ -algebra containing all of Fn . Then we have

D(PFn kQFn ) % D(PF kQF )

and, in particular,

D(PX∞ kQX∞ ) = lim D(PXn kQXn ) .


n→∞

Proof. The first two items are straightforward applications of the definition. The third follows
from the following fact: if F is dense in G then any G -measurable partition {E1 , . . . , En } can
be approximated by a F -measurable partition {E′1 , . . . , E′n } with (P + Q)[Ei 4E′i ] ≤ ϵ. Indeed,
first we set E′1 to be an element of F with (P + Q)(E1 4E′1 ) ≤ 2n ϵ
. Then, we set E′2 to be
an 2nϵ
-approximation of E2 \ E′1 , etc. Finally, E′n = (∪j≤1 E′j )c . By taking ϵ → 0 we obtain
P ′ P[E′i ] P P[ Ei ]
i P[Ei ] log Q[E′i ] → i P[Ei ] log Q[Ei ] .
The last statement follows from the previous one and the fact that any algebra F is μ-dense in
the σ -algebra σ{F} it generates for any bounded μ on (X , H) (cf. [142, Lemma III.7.1].)
Finally, we address the continuity under the decreasing σ -algebra, i.e. (4.21).

Proposition 4.12 Let Fn & F be a sequence of decreasing σ -algebras with intersection


F = ∩n Fn ; let P, Q be two probability measures on F0 . If D(PF0 kQF0 ) < ∞ then we have

D(PFn kQFn ) & D(PF kQF ) (4.24)

The condition D(PF0 kQF0 ) < ∞ can not be dropped, cf. the example after (4.32).
h i
Proof. Let X−n = dP
dQ . Since X−n = EQ dP
dQ Fn , we have that (. . . , X−1 , X0 ) is a uniformly
Fn
integrable martingale. By the martingale convergence theorem in reversed time, cf. [84, Theorem
5.4.17], we have almost surely
dP
X−n → X−∞ ≜ . (4.25)
dQ F

We need to prove that

EQ [X−n log X−n ] → EQ [X−∞ log X−∞ ] .

4
Recall that F is μ-dense in G if ∀E ∈ G, ϵ > 0∃E′ ∈ F s.t. μ[E∆E′ ] ≤ ϵ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-81


i i

4.7 Variational characterizations and continuity of mutual information 81

We will do so by decomposing x log x as follows

x log x = x log+ x + x log− x ,

where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have
from the bounded convergence theorem:

EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ].

To prove a similar convergence for log+ we need to notice two things. First, the function

x 7→ x log+ x

is convex. Second, for any non-negative convex function ϕ s.t. E[ϕ(X0 )] < ∞ the collection
{Zn = ϕ(E[X0 |Fn ]), n ≥ 0} is uniformly integrable. Indeed, we have from Jensen’s inequality
1 E[ϕ(X0 )]
P[Zn > c] ≤ E[ϕ(E[X0 |Fn ])] ≤
c c
and thus P[Zn > c] → 0 as c → ∞. Therefore, we have again by Jensen’s

E[Zn 1{Zn > c}] ≤ E[ϕ(X0 )1{Zn > c}] → 0 c → ∞.

Finally, since X−n log+ X−n is uniformly integrable, we have from (4.25)

EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ]

and this concludes the proof.

4.7 Variational characterizations and continuity of mutual information


Again, similarly to Proposition 4.8, it is easy to show that in the case of finite alphabets mutual
information is always continuous on finite-dimensional simplices of distributions.5

Proposition 4.13 (a) If X and Y are both finite alphabets, then PX,Y 7→ I(X; Y) is continuous.
(b) If X is finite, then PX 7→ I(X; Y) is continuous.
(c) Without any assumptions on X and Y , let PX range over the convex hull Π = co(P1 , . . . , Pn ) =
Pn Pn
{ i=1 αi Pi : i=1 αi = 1, αi ≥ 0}. If I(Pj , PY|X ) < ∞ (using the notation I(PX , PY|X ) =
I(X; Y)) for all j ∈ [n], then the map PX 7→ I(X; Y) is continuous.

Proof. For the first statement, apply the representation

I(X; Y) = H(X) + H(Y) − H(X, Y)

and the continuity of entropy in Proposition 4.8.

5
Here we only assume that topology on the space of measures is compatible with the linear structure, so that all linear
operations on measures are continuous.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-82


i i

82

1
P
For the second statement, take QY = |X | PY|X=x . Note that
x∈X
" !#
X
D(PY kQY ) = EQY f PX (x)hx (Y) ,
x
dPY|X=x
where f(t) = t log t and hx (y) = are bounded by |X | and non-negative. Thus, from the
dQY (y)
bounded convergence theorem we have that PX 7→ D(PY kQY ) is continuous. The proof is complete
since by the golden formula

I(X; Y) = D(PY|X kQY |PX ) − D(PY kQY ) ,

and the first term is linear in PX .


For the third statement, form a chain Z → X → Y with Z ∈ [n] and PX|Z=j = Pj . WLOG assume
that P1 , . . . , Pn are distinct extreme points of co(P1 , . . . , Pn ). Then there is a linear bijection
between PZ and PX ∈ Π. Furthermore, I(X; Y) = I(Z; Y) + I(X; Y|Z). The first term is continu-
ous in PZ by the previous claim, whereas the second one is simply linear in PZ . Thus, the map
PZ 7→ I(X; Y) is continuous and so is PX 7→ I(X; Y).

Further properties of mutual information follow from I(X; Y) = D(PX,Y kPX PY ) and correspond-
ing properties of divergence, e.g.

1 Donsker-Varadhan for mutual information: by definition of mutual information

I(X; Y) = sup E[f(X, Y)] − log E[exp{f(X, Ȳ)}] , (4.26)


f

where Ȳ is a copy of Y, independent of X and supremum can be taken over any of the classes of
(bivariate) functions as in Theorem 4.6. Notice, however, that for mutual information we can
also get a stronger characterization:6

I(X; Y) ≥ E[f(X, Y)] − E[log E[exp{f(X, Ȳ)}|X]] , (4.27)

from which (4.26) follows by moving the outer expectation inside the log. Both of these can
be used to show that E[f(X, Y)] ≈ E[f(X, Ȳ)] as long as the dependence between X and Y (as
measured by I(X; Y)) is weak, cf. Exercise I.55.
d
2 If (Xn , Yn ) → (X, Y) converge in distribution, then

I(X; Y) ≤ lim inf I(Xn ; Yn ) . (4.28)


n→∞

d
• Example of strict inequality: Xn = Yn = n1 Z. In this case (Xn , Yn ) → (0, 0) but I(Xn ; Yn ) =
H(Z) > 0 = I(0; 0).
• An even more impressive example: Let (Xp , Yp ) be uniformly distributed on the unit ℓp -ball
d
on the plane: {(x, y) ∈ R2 : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but
I(Xp ; Yp ) → ∞. (See Ex. I.16)

6
Just apply Donsker-Varadhan to D(PY|X=x0 kPY ) and average over x0 ∼ PX .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-83


i i

4.8* PAC-Bayes 83

3 Mutual information as supremum over partitions:


X PX,Y [Ei × Fj ]
I(X; Y) = sup PX,Y [Ei × Fj ] log , (4.29)
{Ei }×{Fj } i,j PX [Ei ]PY [Fj ]

where supremum is over finite partitions of spaces X and Y .7


4 (Monotone convergence I):

I(X∞ ; Y) = lim I(Xn ; Y) (4.30)


n→∞

I(X∞ ; Y∞ ) = lim I(Xn ; Yn ) (4.31)


n→∞

This implies that the full amount of mutual information between two processes X∞ and Y∞
is contained in their finite-dimensional projections, leaving nothing in the tail σ -algebra. Note
also that applying the (finite-n) chain rule to (4.30) recovers (4.1).
5 (Monotone convergence II): Recall that for any random process (X1 , . . .) we define its tail σ -
T
algebra as Ftail = ∩n≥1 σ(X∞ n ). Let Xtail be a random variable such that σ(Xtail ) =

n≥1 σ(Xn ).
Then

I(Xtail ; Y) = lim I(X∞


n ; Y) , (4.32)
n→∞

whenever the right-hand side is finite. This is a consequence of Proposition 4.12. Without the
i.i.d.
finiteness assumption the statement is incorrect. Indeed, consider Xj ∼ Ber(1/2) and Y = X∞ 0 .

Then each I(Xn ; Y) = ∞, but Xtail = constant a.e. by Kolmogorov’s 0-1 law, and thus the
left-hand side of (4.32) is zero.

4.8* PAC-Bayes
A deep implication of Donsker-Varadhan and Gibbs principle is a method, historically known as
PAC-Bayes,8 for bounding suprema of empirical processes. Here we present the key idea together
with two applications: one in high-dimensional probability and the other in statistical learning
theory.
But first, let us agree that in this section ρ and π will denote distributions on Θ and we will
write Eρ [·] and Eπ [·] to mean integration over only the θ variable over the respective prior, i.e.
Z
Eρ [fθ (x)] ≜ Eθ∼ρ [fθ (x)] = fθ (x)ρ(dθ)
Θ

denotes a function of x. Similarly, EPX [fθ (X)] will denote expectation only over X ∼ PX . The
following estimate is a workhorse of PAC-Bayes method.

7
To prove this from (4.9) one needs to notice that algebra of measurable rectangles is dense in the product σ-algebra. See
[129, Sec. 2.2].
8
For probably approximately correct (PAC), as developed by Shawe-Taylor and Williamson [382], McAllester [298],
Mauer [297], Catoni [83] and many others, see [16] for a survey.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-84


i i

84

Proposition 4.14 (PAC-Bayes inequality) Consider a collection of functions {fθ : X →


R, θ ∈ Θ} such that (θ, x) 7→ fθ (x) is measurable. Fix a random variable X ∈ X and prior
π ∈ P(Θ). Then with probability at least 1 − δ we have for all ρ ∈ P(Θ):
1
Eρ [fθ (X) − ψ(θ)] ≤ D(ρkπ ) + log , ψ(θ) ≜ log EPX [exp{fθ (X)}] . (4.33)
δ
Furthermore, for any joint distribution Pθ,X we have

E[fθ (X) − ψ(θ)] ≤ I(θ; X) ≤ D(Pθ|X kπ |PX ) . (4.34)

Proof. We will prove the following result, known in this area as an exponential inequality:
 
EPX sup exp{Eρ [fθ (X) − ψ(θ)] − D(ρkπ )} ≤ 1 , (4.35)
ρ

where the supremum inside is taken over all ρ such that D(ρkπ ) < ∞. Indeed, from it (4.33)
follows via the Markov inequality. Notice that this supremum is taken over uncountably many
values and hence it is not a priori clear whether the function of X under the outer expectation is
even measurable. We will show the latter together with the exponential inequality.
To that end, we apply the Gibbs principle (Proposition 4.7) to the function θ 7→ fθ (X) − ψ(θ)
and base measure π. Notice that this function may take value −∞, but nevertheless we obtain

sup Eρ [fθ (X) − ψ(θ)] − D(ρkπ ) = log Eπ [exp{fθ (X) − ψ(θ)}] ,


ρ

where the right-hand side is a measurable function of X. Exponentiating and taking expectation
over X we obtain
 
EPX sup exp{Eρ [fθ (X) − ψ(θ)] − D(ρkπ )} = EPX [Eπ [exp{fθ (X) − ψ(θ)}]] .
ρ

We claim that the right-hand side equals π [ψ(θ) < ∞] ≤ 1, which completes the proof of (4.35).
Indeed, let E = {θ : ψ(θ) < ∞}. Then for any θ ∈ E we have EPX [exp{fθ (X) − ψ(θ)}] = 1, or in
other words for all θ:

EPX [exp{fθ (X) − ψ(θ)}]1{θ ∈ E} = 1{θ ∈ E} .

Now applying Eπ to both sides here and invoking Fubini we obtain

EPX [Eπ [exp{fθ (X) − ψ(θ)}]1{θ ∈ E}] = π [E] .

Finally, notice that 1{θ ∈ E} can be omitted since for θ ∈ Ec we have exp{fθ (X) − ∞} = 0 by
agreement.
To show (4.34), for each x take ρ = Pθ|X=x in (4.35) to get
 
Ex∼PX exp{E[fθ (X) − ψ(θ)|X = x] − D(Pθ|X=x kπ )} ≤ 1 .

By Jensen’s we can move outer expectation inside the exponent and obtain the right-most
inequality in (4.34). To get the bound in terms of I(θ; X) take π = Pθ and recall (4.5).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-85


i i

4.8* PAC-Bayes 85

4.8.1 Uniform convergence


As stated the PAC-Bayes inequality is too general to appreciate its importance. Its significance
and depth is only revealed once the art of applying it is mastered. We first give such example in
the context of uniform convergence and high-dimensional probability.
Example 4.3 (Norms of subgaussian vectors [475]) Suppose X ∼ N (0, Σ) take values
in Rd . What is the magnitude of kXk ≡ kXk2 ? First, by Jensen’s we get
p √
E[kXk] ≤ E[kXk2 ] = tr Σ .

But what about typical value of kXk, i.e. can we show an upper bound on kXk that holds with high
probability? In order to see how PAC-Bayes could be useful here, notice that kXk = sup∥v∥=1 v⊤ X.
Thus, we aim to use (4.33) to bound this supremum. For any v let ρv = N (v, β 2 Id ) and notice
v⊤ X = Eρv [θ⊤ X]. We also take π = N (0, β 2 Id ) and fθ (x) = λθ⊤ x, θ ∈ Rd , where β, λ > 0
are parameters to be optimized later. Taking base of log to be e, we compute explicitly ψ(θ) =
1 2 ⊤ λ2 ⊤ ∥ v∥ 2
2 λ θ Σθ , Eρv [ψ(θ)] = 2 (v Σv + β tr Σ) and D(ρv k π ) = 2β 2 via (2.8). Thus, using (4.33)
2

restricted to ρv with kvk = 1 we obtain that with probability ≥ 1 − δ we have for all v with
kvk = 1
λ2 ⊤ 1 1
λ v⊤ X ≤ (v Σv + β 2 tr Σ) + 2 + ln .
2 2β δ

Now, we can optimize right-hand side over β by choosing β 2 = 1/ λ2 tr Σ to get
λ2 ⊤ √ 1
λ v⊤ X ≤ (v Σv) + λ tr Σ + ln .
2 δ
Finally, estimating v⊤ Σv ≤ kΣkop and optimizing λ we obtain the resulting high-probability
bound:
r
√ 1
kXk ≤ tr Σ + 2kΣkop ln .
δ
Although this result can be obtained using the standard technique of Chernoff bound (Section 15.1)
– see [270, Lemma 1] for a stronger version or [69, Example 5.7] for general norms based on sophis-
ticated Gaussian concentration inequalities, the advantages of the PAC-Bayes proof are that (a) it is
2
not specific to Gaussian and holds for any X such that ψ(θ) ≤ λ2 θ⊤ Σθ (similar to subgaussian ran-
dom variables introduced below); and (b) its extensions can be used to analyze the concentration
of sample covariance matrices, cf. [475].
To present further applications, we need to introduce a new concept.

Definition 4.15 (Subgaussian random variables) A random variable X is called σ 2 -


subgaussian if
σ 2 λ2
E[eλ(X−E[X]) ] ≤ e 2 ∀λ ∈ R .

Here are some useful observations:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-86


i i

86

• N (0, σ 2 ) is σ 2 -subgaussian. In fact it satisfies the condition with equality and explains the origin
of the name.
• If X ∈ [a, b], then X is (b−4a) -subgaussian. This is the well-known Hoeffding’s lemma (see
2

Exercise III.22 for a proof).


Pn
• If Xi are iid and σ 2 -subgaussian then the empirical average Sn = 1n i=1 Xi is (σ 2 /n)-
subgaussian.

There are many equivalent ways to define subgaussianity [438, Prop. 2.5.2], including by requiring
t2
tails of X to satisfy P[|X − E[X]| > t] ≤ 2e− 2σ2 . However, for us the most important property is the
consequence of the two observations above: empirical average of independent bounded random
variables are O(1/n)-subgaussian.
The concept of subgaussianity is used in PAC-Bayes method as follows. Suppose we have a
i.i.d.
collection of functions F from X to R and an iid sample Xn ∼ PX . One of the main questions of
empirical process theory and uniform convergence is to get a high-probability bound on

1X
n
sup E[f(X)] − Ên [f(X)] , Ên [f(X)] ≜ f(Xi ) .
f∈F n
i=1

Suppose that we know each f is taking values in [0, 1]. Then (E −Ên )f is 4n
1
-subgaussian and
applying PAC-Bayes inequality to functions λ(E −Ên )f(X) we get that with probability ≥ 1 − δ
for any ρ on F we have

λ 1 1
Ef∼ρ [(E −Ên )f(X)] ≤ + D(ρkπ ) + ln ,
8n λ δ
where π is a fixed prior. This method can be used to get interesting bounds for countably-infinite
collections F (see Exercise I.55 and I.56). However, the real power of this method shows when
F is uncountable (as in the previous example for Gaussian norms).
We remark that bounding the supremum of a random process (e.g., empirical or Gaussian pro-
cess) indexed by a continuous parameter is a vast subject [139, 429, 431]. The usual method is
based on discretization and approximation (with more advanced version known as chaining; see
(27.22) and Exercise V.28 for the counterpart of Gaussian processes). The PAC-Bayes inequality
offers an alternative which often allows for sharper results and shorter proofs. There are also appli-
cations to high-dimensional probability (the small-ball probability and random matrix theory), see
[317, 309]. In those works, PAC-Bayes is applied with π being the uniform distribution on a small
ball.

Remark 4.6 (PAC-Bayes vs Rademacher complexity) Note that PAC-Bayes bounds


supremum of an empirical process. Indeed, for any value Y we can think of θ 7→ fθ (Y) as vector
(random process) indexed by θ ∈ Θ. Each ρ ∈ P(Θ) defines a linear function F(ρ) ≜ Eθ∼ρ [fθ (Y)]
of this vector. A modern method [38] of bounding the supremum of a collection of functions is to
use Rademacher complexity. It turns out that PAC-Bayes can be rather naturally understood in that
framework: if |fθ (·)| ≤ M then the set {F(ρ) : D(ρkπ ) ≤ C < ∞} has Rademacher complexity

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-87


i i

4.8* PAC-Bayes 87

q
O(M Cn ), see [239]. In fact they show that any set {F(ρ) : G(ρ) ≤ C} satisfies this bound as long
as G is strongly convex. Recall that ρ 7→ D(ρkπ ) is indeed strongly convex by Exercise I.37.

4.8.2 Generalization bounds in statistical learning theory


The original purpose of introducing PAC-Bayes was for the analysis of the learning algorithms.
We first introduce the standard (PAC) setting of the learning theory.
i.i.d.
Suppose one has access to a training sample Xn = (X1 , . . . , Xn ) ∼ P and there is a space of
parameters Θ. The goal is to find θ ∈ Θ that minimizes the estimation error (loss). Specifically,
the loss incurred by an estimator indexed by some parameter θ ∈ Θ on the data point Xi is given
by ℓθ (Xi ). Learning algorithms is aiming to find minimizer of the test loss (also known as test error
or generalization risk), i.e.
L(θ) ≜ E[ℓθ (Xnew )] , Xnew ∼ P .
Note that since P is unknown direct minimization of L(θ) is impossible. This setting encompasses
many problems in machine learning and statistics. For example, we could have Xi = (Yi , Zi ),
where Zi is a feature (covariate) and Yi is the label (response) with the loss being 1 {Yi = fθ (Zi )}
(in classification) or (Yi − fθ (Zi ))2 (in regression) and where fθ is, for instance, a linear predictor
θ⊤ Zi or a deep neural net with weights θ. Or (as in Example 4.2) we could be trying to estimate
distribution P itself by optimizing cross-entropy loss log pθ 1(X) over a parametric class of densities.
As we mentioned, the value of L(θ) is not computable by the learner. What is computable is
the training error (or empirical risk), namely
1X
n
L̂n (θ) ≜ ℓθ (Xi ) .
n
i=1

So how does one select a good estimate θ̂ given training sample Xn ? Here are two famous options:

• Empirical Risk Minimization (ERM): θ̂ERM = argminθ∈Θ L̂n (θ)


• Gibbs sampler: Fix some λ > 0. Given some prior distribution π on Θ, draw θ̂ from the
“posterior”
ρ(dθ) ∝ π (dθ)e−λL̂n (θ) (4.36)
In the limit of λ → ∞, this reduces to the ERM. Note that in this case θ̂ is a randomized function
of Xn .

However, many other choices exist and are invented daily for specific problems.
The main issue that is to be addressed by theory is the following: the choice θ̂ is guided by
the sample Xn and thus the value L̂n (θ̂) is probably not representative of the value L(θ̂) that the
estimator will attain on a fresh sample Xnew because of overfitting to the training sample. To gauge
the amount of overfitting one seeks to prove an estimate of the form:
L(θ̂) ≤ L̂n (θ̂) + small error terms

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-88


i i

88

with high probability over the sample Xn . Note that here θ̂ can be either deterministic (as in ERM)
or a randomized (as in Gibbs) function of the training sample. In either case, it is convenient to
think about the estimator as a θ drawn from a data-dependent distribution ρ, in other words, a
channel from Xn to θ, so that we always understand L(θ̂) as Eρ [L(θ)] and L̂n (θ̂) as Eρ [L̂n (θ)]. Note
that in either case the values L(θ̂) and L̂n (θ̂) are random quantities depending on the sample Xn .
A specific kind of bounds we will show is going to be a high-probability bound that holds
uniformly over all data-dependent ρ, specifically
h i
P ∀ρ : Eθ∼ρ [L(θ)] ≤ Eθ∼ρ [L̂n (θ)] + excess risk(ρ) ≥ 1 − δ,

for some excess risk depending on (n, ρ, δ). We emphasize that here the probability is with respect
i.i.d.
to Xn ∼ P and the quantifier “∀ρ” is inside the probability. Having a uniform bound like this
suggests selecting that ρ which minimizes the right-hand side of the inequality, thus making the
second term serve as a regularization term preventing overfitting.
The main theorem of this section is the following version of the generalization error bound of
McAllester [298]. Many other similar bounds exist, for example see Exercise I.54.

Theorem 4.16 Fix a reference distribution π on Θ and suppose that for all θ, x the loss ℓθ (x) ∈
[0, 1]. Then for any δ ≤ e−1
 s 
5 D(ρkπ ) + ln δ1 1 
P ∀ρ : Eθ∼ρ [L(θ)] ≤ Eθ∼ρ [L̂n (θ)] + +√ ≥ 1 − δ. (4.37)
4 2n 10n

The same result holds if (instead of being bounded) ℓθ (X) is 14 -subgaussian for each θ.

Before proving the theorem, let us consider a finite class Θ and argue that
" r #
1 M
P ∀ρ : Eθ∼ρ [L(θ)] ≤ Eθ∼ρ [L̂n (θ)] + ln ≥ 1 − δ, (4.38)
2n δ

Indeed, by linearity, it suffices to restrict to point mass distributions ρ = δθ . For each θ the ran-
dom variable L̂n (θ) − L(θ) is zero-mean and 4n 1
-subgaussian (Hoeffding’s lemma). Thus, applying
Markov’s inequality to eλ(L(θ)−L̂n (θ)) we have for any λ > 0 and t > 0:
λ2
P[L(θ) − L̂n (θ) ≥ t] ≤ e 8n −λt .

Thus, setting t so that the right-hand side equals Mδ from the union bound we obtain that with
probability ≥ 1 − δ simultaneously for all θ we have
λ 1 M
L(θ) − L̂n (θ) ≤ + ln .
8n λ δ
q
Optimizing λ = 8n ln Mδ yields (4.38). On the other hand, if we apply (4.37) with π = Unif(Θ)
and observe that D(ρkπ ) ≤ log M, we recover (4.38) with only a slightly worse estimate of excess
risk.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-89


i i

4.8* PAC-Bayes 89

We can see that just like in the previous subsection, the core problem in showing Theorem 4.16
is that union bound only applies to finite Θ and we need to work around that problem by leveraging
the PAC-Bayes inequality.

Proof. First, we fix λ and apply PAC-Bayes inequality to functions fθ (Xn ) = λ(L(θ) − L̂n (θ)).
2
By Hoeffding’s lemma we know L̂n (θ) is 8n1
-subgaussian and thus ψ(θ) ≤ λ8n . Thus, we have with
probability ≥ 1 − δ simultaneously for all ρ:
λ D(ρkπ ) + ln δ1
Eρ [L(θ) − L̂n (θ)] ≤ + (4.39)
8n λ
Let us denote for convenience b(ρ) = D(ρkπ ) + ln δ1 . Since
p δ < e−1 , we see that b(ρ) ≥ 1. We
would like to optimize λ in (4.39) by setting λ = λ∗ ≜ 8nb √
(ρ). However, of course, λ cannot
depend on ρ. Thus, instead we select a countable grid λi = 2n2i , i ≥ 1 and apply PAC-Bayes
inequality separately for each λi with probability chosen to be δi = δ 2−i . Then from the union
bound we have for all ρ and all i ≥ 1 simultaneously:
λi b(ρ) + i ln 2
Eρ [L(θ) − L̂n (θ)] ≤ + .
8n λ
Let

i∗ = i∗p
(ρ) be chosen so that λ∗ (ρ) ≤ λi∗ < 2λ∗ (ρ). From the latter inequality we have
i∗
2n2 < 2 8b(ρ)n and thus
1 1
i∗ ln 2 < ln 4 + ln(b(ρ)) ≤ ln 4 − 1/2 + b(ρ) .
2 2
Therefore, choosing i = i∗ in the bound, upper-bounding λi∗ ≤ 2λ∗ and λ1i∗ ≤ 1
λ∗ we get that for
all ρ
r
b(ρ) 3 ln 4 − 1/2
Eρ [L(θ) − L̂n (θ)] ≤ (2 + ) + p .
8n 2 8nb(ρ)
Finally, bounding the last term by √1
10n
(since b ≥ 1), we obtain the theorem.

We remark that although we proved an optimized (over λ) bound, the intermediate result (4.39)
is also quite important. Indeed, it suggests choosing ρ (the randomized estimator) based on min-
imizing the regularized empirical risk Eρ L̂n (θ) + λ1 D(ρkπ ). The minimizing ρ is just the Gibbs
sampler (4.36) due to Proposition 4.7, which justifies its popularity.
PAC-Bayes bounds are often criticized on the following grounds. Suppose we take a neural
network and train it (perhaps by using a version of gradient descent) until it finds some choice of
weight matrices θ̂ that results in an acceptably low value of L̂n (θ̂). We would like now to apply
PAC-Bayes Theorem 4.16 to convince ourselves that also the test loss L(θ̂) would be small. But
notice that the weights of the neural network are non-random, that is ρ = δθ̂ and hence for any
continuous prior π we will have D(ρkπ ) = ∞, resulting in a vacuous bound. For a while this was
considered to be an unavoidable limitation, until an elegant work of [476]. There authors argue
that in the end weights of neural networks are stored as finite-bit approximations (floating point)
and we can use π (θ) = 2−length(θ) as a prior. Here length(θ) represents the total number of bits in
a compressed representation of θ. As we will learn in Part II this indeed defines a valid probability

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-90


i i

90

distribution (for any choice of the lossless compressor). In this way, the idea of [476] bridges
the area of data compression and generalization bounds: if the trained neural network has highly
compressible θ (e.g. has many zero weights) then it has smaller excess risk and thus is less prone
to overfitting.
Before closing this section, let us also apply the “in expectation” version of the PAC-
Bayes (4.34). Namely, again suppose that losses ℓθ (x) are in [0, 1] and suppose the learning
algorithm (given Xn ) selects ρ and then samples θ̂ ∼ ρ. This creates a joint distribution Pθ̂,Xn .
From (4.34), as in the preceding proof, for every λ > 0 we get
λ
E[L(θ̂) − L̂n (θ̂)] ≤ + λI(θ̂; Xn ) .
8n
Optimizing over λ we obtain the bound
r
1
E[L(θ̂) − L̂n (θ̂)] ≤ I(θ̂; Xn ) .
2n
This version of McAllester’s result [369, 461] provides a useful intuition: the algorithm’s propen-
sity to overfit can be gauged by the amount of information leaking from Xn into θ̂. For applications,
though, a version with a flexible reference prior π, as in Theorem 4.16, appears more convenient.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-91


i i

5 Extremization of mutual information: capacity


saddle point

There are four fundamental optimization problems arising in information theory:

• Information projection (or I-projection): Given Q minimize D(PkQ) over convex class of P’s.
(See Chapter 15.)
• Maximum likelihood: Given P minimize D(PkQ) over some class of Q. (See Section 29.3.)
• Rate-Distortion: Given PX minimize I(X; Y) over a convex class of PY|X . (See Chapter 26.)
• Capacity: Given PY|X maximize I(X; Y) over a convex class of PX . (This chapter.)

In this chapter we show that all these problems have convex/concave objective functions,
discuss iterative algorithms for solving them, and study the capacity problem in more detail.
Specifically, we will find that the supremum over input distributions PX can also be written as infi-
mum over the output distributions PY and the resulting minimax problem has a saddle point. This
will lead to understanding of capacity as information radius of a set of conditional distributions
{PY|X=x , x ∈ X } measured in KL divergence.

5.1 Convexity of information measures


Theorem 5.1 The map (P, Q) 7→ D(PkQ) is convex.

Proof. Let PX = QX = Ber(λ) and define two conditional kernels:


PY|X=0 = P0 , PY|X=1 = P1
QY|X=0 = Q0 , QY|X=1 = Q1
An explicit calculation shows that
D(PX,Y kQX,Y ) = λ̄D(P0 kQ0 ) + λD(P1 kQ1 ) .
Therefore, from the DPI (monotonicity) we get:
λ̄D(P0 kQ0 ) + λD(P1 kQ1 ) = D(PX,Y kQX,Y ) ≥ D(PY kQY ) = D(λ̄P0 + λP1 kλ̄Q0 + λQ1 ).

Remark 5.1 The proof shows that for an arbitrary measure of similarity D(PkQ), the con-
vexity of (P, Q) 7→ D(PkQ) is equivalent to “conditioning increases divergence” property of D.
Convexity can also be understood as “mixing decreases divergence”.

91

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-92


i i

92

Remark 5.2 (Strict and strong convexity) There are a number of alternative arguments
possible. For example, (p, q) 7→ p log pq is convex on R2+ , which is a manifestation of a general
 
phenomenon: for a convex f(·) the perspective function (p, q) 7→ qf pq is convex too. Yet another
way is to invoke the Donsker-Varadhan variational representation Theorem 4.6 and notice that
supremum of convex functions is convex. Our proof, however, allows us to immediately notice
that the map (P, Q) 7→ D(PkQ) is not strictly convex. Indeed, the gap in the DPI that we used
in the proof is equal to D(PX|Y kQX|Y |PY ), which can be zero. For example, this happens if P0 , Q0
have common support, which is disjoint from the common support of P1 , Q1 . At the same time
the map P 7→ D(PkQ), whose convexity was so crucial in the previous Chapter, turns out to not
only be strictly convex but in fact strongly convex with respect to total variation, cf. Exercise I.37.
This strong convexity is crucial for the analysis of mirror descent algorithm, which is a first-order
method for optimization over probability measures (see [40, Examples 9.10 and 5.27].)

Theorem 5.2 The map PX 7→ H(X) is concave. Furthermore, if PY|X is any channel, then
PX 7→ H(X|Y) is concave. If X is finite, then PX 7→ H(X|Y) is continuous.

Proof. For the special case of the first claim, when PX is on a finite alphabet, the proof is complete
by H(X) = log |X | − D(PX kUX ). More generally, we prove the second claim as follows. Let
f(PX ) = H(X|Y). Introduce a random variable U ∼ Ber(λ) and define the transformation

P0 U=0
PX|U =
P1 U=1

Consider the probability space U → X → Y. Then we have f(λP1 + (1 − λ)P0 ) = H(X|Y) and
λf(P1 ) + (1 − λ)f(P0 ) = H(X|Y, U). Since H(X|Y, U) ≤ H(X|Y), the proof is complete. Continuity
follows from Proposition 4.13.

Recall that I(X; Y) is a function of PX,Y , or equivalently, (PX , PY|X ). Denote I(PX , PY|X ) =
I(X; Y).

Theorem 5.3 (Mutual Information)

• For fixed PY|X , PX →


7 I(PX , PY|X ) is concave.
• For fixed PX , PY|X →7 I(PX , PY|X ) is convex.

Proof. There are several ways to prove the first statement, all having their merits.

• First proof : Introduce θ ∈ Ber(λ). Define PX|θ=0 = P0X and PX|θ=1 = P1X . Then θ → X → Y.
Then PX = λ̄P0X + λP1X . I(X; Y) = I(X, θ; Y) = I(θ; Y) + I(X; Y|θ) ≥ I(X; Y|θ), which is our
desired I(λ̄P0X + λP1X , PY|X ) ≥ λ̄I(P0X , PY|X ) + λI(P0X , PY|X ).
• Second proof : I(X; Y) = minQ D(PY|X kQ|PX ), which is a pointwise minimum of affine functions
in PX and hence concave.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-93


i i

5.1 Convexity of information measures 93

• Third proof : Pick a Q and use the golden formula: I(X; Y) = D(PY|X kQ|PX ) − D(PY kQ), where
PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→ D(PY kQ)
(convex).

To prove the second (convexity) statement, simply notice that

I(X; Y) = D(PY|X kPY |PX ) .

The argument PY is a linear function of PY|X and thus the statement follows from convexity of D
in the pair.

Review: Minimax and saddle-point

Suppose we have a bivariate function f. Then we always have the minimax inequality:
inf sup f(x, y) ≥ sup inf f(x, y).
y x x y

When does it hold with equality?

1 It turns out minimax equality is implied by the existence of a saddle point (x∗ , y∗ ),
i.e.,
f ( x, y∗ ) ≤ f ( x∗ , y∗ ) ≤ f ( x∗ , y) ∀ x, y

Furthermore, minimax equality also implies existence of saddle point if inf and sup
are achieved c.f. [49, Section 2.6]) for all x, y [Straightforward to check. See proof
of corollary below].
2 There are a number of known criteria establishing
inf sup f(x, y) = sup inf f(x, y)
y x x y

They usually require some continuity of f, compactness of domains and concavity


in x and convexity in y. One of the most general version is due to M. Sion [389].
3 The mother result of all this minimax theory is a theorem of von Neumann on
bilinear functions: Let A and B have finite alphabets, and g(a, b) be arbitrary, then
min max E[g(A, B)] = max min E[g(A, B)]
PA PB PB PA
P
Here (x, y) ↔ (PA , PB ) and f(x, y) ↔ a,b PA (a)PB (b)g(a, b).
4 A more general version is: if X and Y are compact convex domains in Rn , f(x, y)
continuous in (x, y), concave in x and convex in y then
max min f(x, y) = min max f(x, y)
x∈X y∈Y y∈Y x∈X

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-94


i i

94

5.2 Saddle point of mutual information


The following result is a cornerstone of analyzing maximum of mutual information over convex
sets of input distributions. We will see applications immediately in this section, as well as much
later in the book (in channel coding, sequential prediction, universal data compression and density
estimation).

Theorem 5.4 (Saddle point) Let P be a convex set of distributions on X . Suppose there
exists P∗X ∈ P , called a capacity-achieving input distribution, such that

sup I(PX , PY|X ) = I(P∗X , PY|X ) ≜ C.


PX ∈P

Let P∗Y = PY|X ◦ P∗X , called a capacity-achieving output distribution. Then for all PX ∈ P and for
all QY , we have

D(PY|X kP∗Y |PX ) ≤ D(PY|X kP∗Y |P∗X ) ≤ D(PY|X kQY |P∗X ). (5.1)

Proof. Right inequality in (5.1) follows from C = I(P∗X , PY|X ) = minQY D(PY|X kQY |P∗X ), where
the latter is (4.5).
The left inequality in (5.1) is trivial when C = ∞. So assume that C < ∞, and hence
I(PX , PY|X ) < ∞ for all PX ∈ P . Let PXλ = λPX + λP∗X ∈ P and PYλ = PY|X ◦ PXλ . Clearly,
PYλ = λPY + λP∗Y , where PY = PY|X ◦ PX .
We have the following chain then:

C ≥ I(Xλ ; Yλ ) = D(PY|X kPYλ |PXλ )


= λD(PY|X kPYλ |PX ) + λ̄D(PY|X kPYλ |P∗X )
≥ λD(PY|X kPYλ |PX ) + λ̄C
= λD(PX,Y kPX PYλ ) + λ̄C ,

where inequality is by the right part of (5.1) (already shown). Thus, subtracting λ̄C and dividing
by λ we get

D(PX,Y kPX PYλ ) ≤ C

and the proof is completed by taking lim infλ→0 and applying the lower semincontinuity of
divergence (Theorem 4.9).

Corollary 5.5 In addition to the assumptions of Theorem 5.4, suppose C < ∞. Then the
capacity-achieving output distribution P∗Y is unique. It satisfies the property that for any PY induced
by some PX ∈ P (i.e. PY = PY|X ◦ PX ) we have

D(PY kP∗Y ) ≤ C < ∞ (5.2)

and in particular PY  P∗Y .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-95


i i

5.2 Saddle point of mutual information 95

Proof. The statement is: I(PX , PY|X ) = C ⇒ PY = P∗Y . Indeed:

C = D(PY|X kPY |PX ) = D(PY|X kP∗Y |PX ) − D(PY kP∗Y )


≤ D(PY|X kP∗Y |P∗X ) − D(PY kP∗Y )
= C − D(PY kP∗Y ) ⇒ PY = P∗Y

Statement (5.2) follows from the left inequality in (5.1) and “conditioning increases divergence”
property in Theorem 2.16.

Remark 5.3 • The finiteness of C is necessary for Corollary 5.5 to hold. For a counterexample,
consider the identity channel Y = X, where X takes values on integers. Then any distribution
with infinite entropy is a capacity-achieving input (and output) distribution.
• Unlike the output distribution, capacity-achieving input distribution need not be unique. For
example, consider Y1 = X1 ⊕ Z1 and Y2 = X2 where Z1 ∼ Ber( 12 ) is independent of X1 . Then
maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 2, achieved by PX1 X2 = Ber(p) × Ber( 21 ) for any p. Note that
the capacity-achieving output distribution is unique: P∗Y1 Y2 = Ber( 12 ) × Ber( 21 ).

Applying Theorem 5.4 to conditional divergence gives the following result.

Corollary 5.6 (Minimax) Under the assumptions of Theorem 5.4, we have

max I(X; Y) = max min D(PY|X kQY |PX )


PX ∈P PX ∈P QY

= min sup D(PY|X kQY |PX )


QY PX ∈P

Proof. This follows from the standard property of saddle points: Maximizing/minimizing the
leftmost/rightmost sides of (5.1) gives

min sup D(PY|X kQY |PX ) ≤ max D(PY|X kP∗Y |PX ) = D(PY|X kP∗Y |P∗X )
QY PX ∈P PX ∈P

≤ min D(PY|X kQY |P∗X ) ≤ max min D(PY|X kQY |PX ).


QY PX ∈P QY

but by definition min max ≥ max min. Note that we were careful to only use max and min for the
cases where we know the optimum is achievable.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-96


i i

96

Review: Radius and diameter


Let (X, d) be a metric space. Let A be a bounded subset.

1 Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,
i.e.,
rad(A) = inf sup d(x, y). (5.3)
y∈X x∈A

2 Diameter of A:
diam(A) = sup d(x, y). (5.4)
x, y∈ A

Note that the radius and the diameter both measure the massiveness/richness of a
set.
3 From definition and triangle inequality we have
1
diam(A) ≤ rad(A) ≤ diam(A). (5.5)
2
The lower and upper bounds are achieved when A is, for example, a Euclidean ball
and the Hamming space, respectively.
4 In many special cases, the upper bound in (5.5) can be improved:
• A result of Bohnenblust [67] shows that in Rn equipped with any norm we always
have rad(A) ≤ n+n 1 diam(A).
q
• For Rn with Euclidean distance Jung proved rad(A) ≤ n
2(n+1) diam(A),
attained by simplex. The best constant is sometimes called the Jung constant
of the space.
• For Rn with ℓ∞ -norm the situation is even simpler: rad(A) = 21 diam(A); such
spaces are called centrable.

5.3 Capacity as information radius


The next simple corollary shows that capacity of a channel (Markov kernel) is just the radius of
a (finite) collection of distributions {PY|X=x : x ∈ X } when distances are measured by divergence
(although, we remind, divergence is not a metric).

Corollary 5.7 For any finite X and any kernel PY|X , the maximal mutual information over all
distributions PX on X satisfies

max I(X; Y) = max D(PY|X=x kP∗Y )


PX x∈X
= D(PY|X=x kP∗Y ) ∀x : P∗X (x) > 0 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-97


i i

5.4* Existence of capacity-achieving output distribution (general channel) 97

The last corollary gives a geometric interpretation to capacity: It equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY|X=x : x ∈ X }. Moreover, the optimal
center P∗Y is a convex combination of some PY|X=x and is equidistant to those.
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence)
for arbitrary input space (see Theorem 32.4 for a related representation):

Corollary 5.8 Let {PY|X=x : x ∈ X } be a set of distributions. Then


C = sup I(X; Y) ≤ inf sup D(PY|X=x kQ) ≤ sup D(PY|X=x kPY|X=x′ )
PX Q x∈X x,x′ ∈X
| {z } | {z }
radius diameter

Proof. By the golden formula Corollary 4.2, we have


I(X; Y) = inf D(PY|X kQ|PX ) ≤ inf sup D(PY|X=x kQ) ≤ ′inf sup D(PY|X=x kPY|X=x′ ).
Q Q x∈X x ∈X x∈X

5.4* Existence of capacity-achieving output distribution (general


channel)
In the previous section we have shown that the solution to
C = sup I(X; Y)
PX ∈P

can be (a) interpreted as a saddle point; (b) written in the minimax form; and (c) that the capacity-
achieving output distribution P∗Y is unique. This was all done under the extra assumption that the
supremum over PX is attainable. It turns out, properties b) and c) can be shown without that extra
assumption.

Theorem 5.9 (Kemperman [243]) For any PY|X and a convex set of distributions P such
that
C = sup I(PX , PY|X ) < ∞, (5.6)
PX ∈P

there exists a unique P∗Y with the property that


C = sup D(PY|X kP∗Y |PX ) . (5.7)
PX ∈P

Furthermore,
C = sup min D(PY|X kQY |PX ) (5.8)
PX ∈P QY
= min sup D(PY|X kQY |PX ) (5.9)
QY PX ∈P

= min sup D(PY|X=x kQY ) , (if P = {all PX }.) (5.10)


QY x∈X

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-98


i i

98

Note that Condition (5.6) is automatically satisfied if there exists a QY such that

sup D(PY|X kQY |PX ) < ∞ . (5.11)


PX ∈P

Example 5.1 (Non-existence of capacity-achieving input distribution) Let Z ∼


N (0, 1) and consider the problem

C= sup I ( X ; X + Z) . (5.12)
E[X]=0,E[X2 ]=P
PX :
E[X4 ]=s

Without the constraint E[X4 ] = s, the capacity is uniquely achieved at the input distribution PX =
N (0, P); see Theorem 5.11. When s 6= 3P2 , such PX is no longer feasible. However, for s > 3P2
the maximum
1
C= log(1 + P)
2
is still attainable. Indeed, we can add a small “bump” to the Gaussian distribution as follows:

PX = (1 − p)N (0, P) + pδx ,

where p → 0 and x → ∞ such that px2 → 0 but px4 → s − 3P2 > 0. This shows that for the
problem (5.12) with s > 3P2 , the capacity-achieving input distribution does not exist, but the
capacity-achieving output distribution P∗Y = N (0, 1 + P) exists and is unique as Theorem 5.9
shows.

Proof of Theorem 5.9. Let P′Xn be a sequence of input distributions achieving C, i.e.,
I(P′Xn , PY|X ) → C. Let Pn be the convex hull of {P′X1 , . . . , P′Xn }. Since Pn is a finite-dimensional
simplex, the (concave) function PX 7→ I(PX , PY|X ) is continuous (Proposition 4.13) and attains its
maximum at some point PXn ∈ Pn , i.e.,

In ≜ I(PXn , PY|X ) = max I(PX , PY|X ) .


PX ∈Pn

Denote by PYn be the output distribution induced by PXn . We have then:

D(PYn kPYn+k ) = D(PY|X kPYn+k |PXn ) − D(PY|X kPYn |PXn ) (5.13)


≤ I(PXn+k , PY|X ) − I(PXn , PY|X ) (5.14)
≤ C − In , (5.15)

where in (5.14) we applied Theorem 5.4 to (Pn+k , PYn+k ). The crucial idea is to apply comparison
of KL divergence (which is not a distance) with a true distance known as total variation defined
in (7.3) below. Such comparisons are going to be the topic of Chapter 7. Here we assume for
granted validity of Pinsker’s inequality (see Theorem 7.10). According to that inequality and since
In % C, we conclude that the sequence PYn is Cauchy in total variation:

sup TV(PYn , PYn+k ) → 0 , n → ∞.


k≥ 1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-99


i i

5.4* Existence of capacity-achieving output distribution (general channel) 99

Since the space of all probability distributions on a fixed alphabet is complete in total variation,
the sequence must have a limit point PYn → P∗Y . Convergence in TV implies weak convergence,
and thus by taking a limit as k → ∞ in (5.15) and applying the lower semicontinuity of divergence
(Theorem 4.9) we get

D(PYn kP∗Y ) ≤ lim D(PYn kPYn+k ) ≤ C − In ,


k→∞

and therefore, PYn → P∗Y in the (stronger) sense of D(PYn kP∗Y ) → 0. By Theorem 4.1,

D(PY|X kP∗Y |PXn ) = In + D(PYn kP∗Y ) → C . (5.16)


S
Take any PX ∈ k≥ 1 Pk . Then PX ∈ Pn for all sufficiently large n and thus by Theorem 5.4

D(PY|X kPYn |PX ) ≤ In ≤ C , (5.17)

which, by the lower semicontinuity of divergence and Fatou’s lemma, implies

D(PY|X kP∗Y |PX ) ≤ C . (5.18)

To prove that (5.18) holds for arbitrary PX ∈ P , we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃∗Y , and obtain:

D(PYn kP̃Yn ) = D(PY|X kP̃Yn |PXn ) − D(PY|X kPYn |PXn ) (5.19)


≤ C − In , (5.20)

where (5.20) follows from (5.18) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃∗Y = P∗Y
and therefore (5.18) holds.
To see the uniqueness of P∗Y , assuming there exists Q∗Y that fulfills C = supPX ∈P D(PY|X kQ∗Y |PX ),
we show Q∗Y = P∗Y . Indeed,

C ≥ D(PY|X kQ∗Y |PXn ) = D(PY|X kPYn |PXn ) + D(PYn kQ∗Y ) = In + D(PYn kQ∗Y ).

Since In → C, we have D(PYn kQ∗Y ) → 0. Since we have already shown that D(PYn kP∗Y ) → 0,
we conclude P∗Y = Q∗Y (this can be seen, for example, from Pinsker’s inequality and the triangle
inequality TV(P∗Y , Q∗Y ) ≤ TV(PYn , Q∗Y ) + TV(PYn , P∗Y ) → 0).
Finally, to see (5.9), note that by definition capacity as a max-min is at most the min-max, i.e.,

C = sup min D(PY|X kQY |PX ) ≤ min sup D(PY|X kQY |PX ) ≤ sup D(PY|X kP∗Y |PX ) = C
PX ∈P QY QY PX ∈P PX ∈P

in view of (5.16) and (5.17).

Corollary 5.10 Let X be countable and P a convex set of distributions on X . If


supPX ∈P H(X) < ∞ then
X 1
sup H(X) = min sup PX (x) log <∞
PX ∈P QX PX ∈P
x
Q X ( x)

and the optimizer Q∗X exists and is unique. If Q∗X ∈ P , then it is also the unique maximizer of H(X).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-100


i i

100

Proof. Just apply Kemperman’s Theorem 5.9 to the identity channel Y = X.

Example 5.2 (Max entropy) Assume that f : Z → R is such that Z(λ) ≜


P
n∈Z exp{−λf(n)} < ∞ for all λ > 0. Then

max H(X) ≤ inf {λa + log Z(λ)} .


X:E[f(X)]≤a λ>0

This follows from taking

QX (n) = Z(λ)−1 exp{−λf(n)} (5.21)

in Corollary 5.10. Distributions of this form are known as Gibbs distributions for the energy func-
tion f. This bound is often tight and achieved by PX (n) = Z(λ∗ )−1 exp{−λ∗ f(n)} with λ∗ being
the minimizer, see Exercise III.27. (Note that Proposition 4.7 discusses Lagrangian version of the
same problem.)

5.5 Gaussian saddle point


For the additive noise channel there is another curious saddle point relation that we rigorously
present in the next result. The proofs are based on applying the inf- and sup-characterizations of
mutual information in (4.5) and (4.7), respectively. Note that we have already seen that Gaussian
distribution is extremal under covariance constraints (Theorem 2.8).

Theorem 5.11 Let Xg ∼ N (0, σX2 ) , Ng ∼ N (0, σN2 ) , Xg ⊥


⊥ Ng . Then:

1. “Gaussian capacity”:
1  σ2 
C = I(Xg ; Xg + Ng ) = log 1 + X2
2 σN
2. “Gaussian input is the best for Gaussian noise”: For all X ⊥
⊥ Ng and Var X ≤ σX2 ,

I(X; X + Ng ) ≤ I(Xg ; Xg + Ng ), (5.22)


d
with equality iff X=Xg .
3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[Xg N] = 0 and EN2 ≤ σN2 ,

I(Xg ; Xg + N) ≥ I(Xg ; Xg + Ng ),
d
with equality iff N=Ng and independent of Xg .

This result encodes extremality properties of the normal distribution: for the AWGN channel,
Gaussian input is the most favorable (attains the maximum mutual information, or capacity), while
for a general additive noise channel the least favorable noise is Gaussian. For a vector version of
the former statement see Exercise I.9.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-101


i i

5.5 Gaussian saddle point 101

Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg + Ng . Define
1  σ 2  log e x2 − σX2
f(x) ≜ D(PYg |Xg =x kPYg ) = D(N (x, σN2 )kN (0, σX2 + σN2 )) = log 1 + X2 +
2 σN 2 σX2 + σN2
| {z }
=C

1. Compute I(Xg ; Xg + Ng ) = E[f(Xg )] = C


2. Recall the inf-representation (Corollary 4.2): I(X; Y) = minQ D(PY|X kQ|PX ). Then
I(X; X + Ng ) ≤ D(PYg |Xg kPYg |PX ) = E[f(X)] ≤ C < ∞ .
Furthermore, if I(X; X + Ng ) = C, then the uniqueness of the capacity-achieving output distribu-
tion, cf. Corollary 5.5, implies PY = PYg . But PY = PX ∗N (0, σN2 ), where ∗ denotes convolution.
Then it must be that X ∼ N (0, σX2 ) simply by considering characteristic functions:
ΨX (t) · e− 2 σN t = e− 2 (σX +σN )t ⇒ ΨX (t) = e− 2 σX t =⇒ X ∼ N (0, σX2 )
1 2 2 1 2 2 2 1 2 2

3. Let Y = Xg + N and let PY|Xg be the respective kernel. Note that here we only assume that N is
uncorrelated with Xg , i.e., E [NXg ] = 0, not necessarily independent. Then
dPXg |Yg (Xg |Y)
I(Xg ; Xg + N) ≥ E log (5.23)
dPXg (Xg )
dPYg |Xg (Y|Xg )
= E log (5.24)
dPYg (Y)
log e h Y2 N2 i
=C+ E 2 2
− 2 (5.25)
2 σX + σN σN
log e σX 2  EN2 
=C+ 1 − (5.26)
2 σX2 + σN2 σN2
≥ C, (5.27)
where
• (5.23): follows from (4.7),
dPX |Y dPY |X
• (5.24): dPgX g = dPgY g
g g

• (5.26): E[Xg N] = 0 and E[Y2 ] = E[N2 ] + E[X2g ].


• (5.27): EN2 ≤ σN2 .
Finally, the conditions for equality in (5.23) (see (4.8)) require
D(PXg |Y kPXg |Yg |PY ) = 0
Thus, PXg |Y = PXg |Yg , i.e., Xg is conditionally Gaussian: PXg |Y=y = N (by, c2 ) for some constants
b and c. In other words, under PXg Y , we have
Xg = bY + cZ , Z ∼ Gaussian ⊥
⊥ Y.
But then Y must be Gaussian itself by Cramer’s Theorem [107] or simply by considering
characteristic functions:
2 ′ 2 ′′ 2
ΨY (t) · ect = ec t ⇒ ΨY (t) = ec t
=⇒ Y is Gaussian

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-102


i i

102

Therefore, (Xg , Y) must be jointly Gaussian and hence N = Y − Xg is Gaussian. Thus we


conclude that it is only possible to attain I(Xg ; Xg + N) = C if N is Gaussian of variance σN2 and
independent of Xg .

5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization,


Sinkhorn
Although the optimization problems that we discussed above are convex (and thus would be con-
sidered algorithmically “easy” in finite dimensions), there are still clever ideas used to speed up
their numerical solutions. The main underlying principle is the following alternating minimization
algorithm:

• Optimization problem: mint f(t).


• Assumption I: f(t) = mins F(t, s) (i.e. f can be written as a minimum of some other function F).
• Assumption II: There exist two solvers t∗ (s) = argmint F(t, s) and s∗ (t) = argmins F(t, s).
• Iterative algorithm:
– Step 0: Fix some s0 , t0 .
– Step 2k − 1: sk = s∗ (tk−1 ).
– Step 2k: tk = t∗ (sk ).

Note that there is a steady improvement at each step (the value F(sk , tk ) is decreasing), so it
can be often proven that the algorithm converges to a local minimum, or even a global minimum
under appropriate conditions (e.g. the convexity of f). Below we discuss several applications of
this idea, and refer to [113] for proofs of convergence. In general, this class of iterative meth-
ods for maximizing and minimizing mutual information are called Blahut-Arimoto algorithms for
their original discoverers [24, 62]. Unlike gradient ascent/descent that proceeds by small (“local”)
changes of the decision variable, algorithms in this section move by large (“global”) jumps and
hence converge much faster.
The basis of all these algorithms is the Gibbs variational principle (Proposition 4.7): for
any function c : Y → R and any QY on Y , under the integrability condition Z =
R
QY (dy) exp{−c(y)} < ∞, the minimum

min D(PY kQY ) + EY∼PY [c(Y)] (5.28)


PY

is attained at P∗Y (dy) = Z1 QY (dy) exp{−c(y)}. For simplicity below we mostly consider the case
of discrete alphabets X , Y .

Maximizing mutual information (capacity). We have a fixed PY|X and the optimization
problem
 
QX|Y
C = max I(X; Y) = max max EPX,Y log ,
PX PX QX|Y PX

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-103


i i

5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization, Sinkhorn 103

where in the second equality we invoked (4.7). This results in the iterations:
1
QX|Y (x|y) ← PX (x)PY|X (y|x)
Z(y)
( )
1 X

PX (x) ← Q (x) ≜ exp PY|X (y|x) log QX|Y (x|y) ,
Z y

where Z(y) and Z are normalization constants. To derive this, notice that for a fixed PX the optimal
QX|Y = PX|Y . For a fixed QX|Y , we can see that
 
QX|Y
EPX,Y log = log Z − D(PX kQ′ ) ,
PX
and thus the optimal PX = Q′ .
Denoting Pn to be the value of PX at the nth iteration, we observe that

I(Pn , PY|X ) ≤ C ≤ sup D(PY|X=x kPY|X ◦ Pn ) . (5.29)


x

This is useful since at every iteration not only we get an estimate of the optimizer Pn , but also the
gap to optimality C − I(Pn , PY|X ) ≤ C − RHS. It can be shown, furthermore, that both RHS and
LHS in (5.29) monotonically converge to C as n → ∞, see [113] for details.

Minimizing mutual information (rate-distortion). We have a fixed PX , a cost function c(x, y)


and the optimization problem

R = min I(X; Y) + E[d(X, Y)] = min D(PY|X kQY |PX ) + E[d(X, Y)] , (5.30)
P Y| X PY|X ,QY

where in the second equality we invoked (4.5). This minimization problem is the basis of lossy
compression and will be discussed extensively in Part V. Using (5.28) we derive the iterations:
1
PY|X (y|x) ← QY (y) exp{−d(x, y)}
Z ( x)
QY ← PY|X ◦ PX .

A sandwich bound similar to (5.29) holds here, see (5.32), so that one gets two computable
sequences converging to R from above and below, as well as PY|X converging to the argmin
in (5.30).

EM algorithm (convex case). Proposed in [121], the Expectation-Maximization (EM) algo-


rithm is a heuristic for solving the maximum likelihood problem. It is known to converge to the
global maximizer for convex problems. We first consider this special case (with a general one to
follow next). Given a distribution PX our goal is to minimize the divergence with respect to the
mixture QX = QX|Y ◦ QY :

L = min D(PX kQX|Y ◦ QY ) , (5.31)


QY

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-104


i i

104

where QX|Y is a given channel. This is a problem arising in the maximum likelihood estimation for
Pn
mixture models where QY is the unknown mixing distribution and PX = 1n i=1 δxi is the empirical
distribution of the sample (x1 , . . . , xn ). To derive an iterative algorithm for (5.31), we write
min D(PX kQX ) = min min D(PX,Y kQX,Y ) .
QY QY PY|X

dQ ( x| y)
(Note that taking d(x, y) = − log dPX|XY (x) shows that this problem is equivalent to (5.30).) By the
chain rule, thus, we find the iterations
1
PY|X ← QY (y)QX|Y (x|y)
Z ( x)
QY ← PY|X ◦ PX .
(n) (n) (n)
Denote by QY the value of QY at the nth iteration and QX = QX|Y ◦ QY . Notice that for any n
and all QY we have from Jensen’s inequality,
" #
( n) dQX|Y (n)
D(PX kQX ) − D(PX kQX ) = EX∼PX log EY∼QY (n)
≤ gap(QX ) ,
dQX
dQX|Y=y
where gap(QX ) ≜ log esssupy EX∼PX [ dQX ]. In all, we get the following sandwich bound:
(n) (n) (n)
D(PX kQX ) − gap(QX ) ≤ L ≤ D(PX kQX ) , (5.32)
and it can be shown that as n → ∞ both sides converge to L, see e.g. [112, Theorem 5.3].

EM algorithm (general case). The EM algorithm is also applicable more broadly than (5.31),
(θ) (θ)
in which the quantity QX|Y is fixed. In general, we consider the model where both QY and QX|Y
depend on the unknown parameter θ and the goal (see Section 29.3) is to maximize the total log
Pn (θ)
likelihood i=1 log QX (xi ) over θ. A canonical example (which was one of the original motiva-
(θ) Pk
tions for the EM algorithm) is the k-component Gaussian mixture QX = j=1 wj N ( μj , 1); in
other words, QY = (w1 , . . . , wk ), QX|Y=j = N ( μj , 1) and θ = (w1 , . . . , wk , μ1 , . . . , μk ). If the cen-
ters μj ’s are known and only the weights wj ’s are to be estimated, then we get the simple convex
case in (5.31). Otherwise, we need to jointly optimize the log-likelihood over the centers and the
weights, which is a non-convex problem.
Here, one way to approach the problem is to apply the ELBO (4.16) as follows:
" (θ)
#
(θ) QX,Y (xi , Y)
log QX (xi ) = sup EY∼P log .
P P(Y)
Thus the maximum likelihood can be written as a double maximization problem
X (θ)
sup log QX (xi ) = sup sup F(θ, PY|X ) ,
θ i θ PY|X

where
" #
X (θ)
QX,Y (xi , Y)
F(θ, PY|X ) = EY∼PY|X=xi log .
PY|X (Y|xi )
i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-105


i i

5.6 Iterative algorithms: Blahut-Arimoto, Expectation-Maximization, Sinkhorn 105

Thus, the iterative algorithm is to start with some θ and update according to
(θ)
PY|X ← QY|X E-step
θ ← argmax F(θ, PY|X ) M-step. (5.33)
θ

In general, if the log likelihood function is non-convex in θ, EM iterations may not converge to
the global optimum even with infinite sample size (see [234] for an example for 3-component
(θ)
Gaussian mixtures). Furthermore, for certain problems in the E-step QY|X may be intractable to
compute. In those cases one performs approximate version of EM where the step maxPY|X F(θ, PY|X )
is solved over a restricted class of distributions, cf. Examples 4.1 and 4.2.

Sinkhorn’s algorithm. This algorithm [388] is very similar, but not exactly the same as the ones
above. We fix QX,Y , two marginals VX , VY and solve the problem
S = min{D(PX,Y kQX,Y ) : PX = VX , PY = VY )} .
From the results of Chapter 15 (see Theorem 15.16 and Example 15.2) it is clear that the optimal
distribution PX,Y is given by
P∗X,Y = A(x)QX,Y (x, y)B(y) ,
for some A, B ≥ 0. In order to find functions A, B we notice that under a fixed B the value of A
that makes PX = VX is given by
VX (x)QX,Y (x, y)B(y)
A ( x) ← P .
y QX,Y (x, y)B(y)

Similarly, to fix the Y-marginal we set


A(x)QX,Y (x, y)VY (y)
B(y) ← P .
x A(x)QX,Y (x, y)

The Sinkhorn’s algorithm alternates the A and B updates until convergence.


The original version in [388] corresponds to VX = VY = Unif([n]), and the goal there was
to show that any matrix {Cx,y } with non-negative entries can be transformed into a doubly-
stochastic matrix {A(x)Cx,y B(y)} by only rescaling rows and columns. The renewed interest in
this classical algorithm arose from an observation that taking a jointly Gaussian QX,Y (x, y) =
c exp{−kx − yk2 /ϵ} produces a coupling PX,Y which resembles and approximates (as ϵ → 0) the
optimal-transport coupling required for computing the Wasserstein distance W2 (VX , VY ), see [117]
for more.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-106


i i

6 Tensorization and information rates

In this chapter we start with explaining the important property of mutual information known
as tensorization (or single-letterization ), which allows one to maximize and minimize mutual
information between two high-dimensional vectors. Next, we extend the information measures
discussed in previous chapters for random variables to random processes by introducing the con-
cepts of entropy rate (for a stochastic process) and mutual information rate (for a pair of stochastic
processes). For the former, it is shown that two stochastic processes that can be coupled well
(i.e., have small Ornstein’s distance) have close entropy rates – a fact to be used later in the
discussion of ergodicity (see Section 12.5*). For the latter we give a simple expression for the
information rate between a pair of stationary Gaussian processes in terms of their joint spectral
density. This expression will be crucial much later, when we study Gaussian channels with colored
noise (Section 20.6*).

6.1 Tensorization (single-letterization) of mutual information


For many applications we will be dealing with memoryless channels or memoryless sources. The
following result is critical for extremizing mutual information in those cases.

Theorem 6.1 (Joint vs marginal mutual information)


Q
(a) If the channel is memoryless, i.e., PYn |Xn = PYi |Xi ,
X
n
I(Xn ; Yn ) ≤ I(Xi ; Yi ) (6.1)
i=1
Q
with equality iff PYn = PYi . Consequently, the (unconstrained) capacity is additive for
memoryless channels:
X
n
max I(Xn ; Yn ) = max I(Xi ; Yi ).
PXn PXi
i=1

(b) If the source is memoryless, i.e., X1 ⊥


⊥ ... ⊥
⊥ Xn , then
X
n
I(Xn ; Y) ≥ I(Xi ; Y) (6.2)
i=1

106

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-107


i i

6.2* Gaussian capacity via orthogonal symmetry 107

Q
with equality iff PXn |Y = PXi |Y PY -almost surely.1 Consequently,

X
n
min I(Xn ; Yn ) = min I(Xi ; Yi ).
PYn |Xn P Yi | X i
i=1

P Q Q
Proof. (1) Use I(Xn ; Yn ) − I(Xi ; Yi ) = D(PYn |Xn k PYi |Xi |PXn ) − D(PYn k PYi )
P Q Q
(2) Reverse the role of X and Y: I(Xn ; Y) − I(Xi ; Y) = D(PXn |Y k PXi |Y |PY ) − D(PXn k PXi )

In short, we see that

1 For a product channel, the input maximizing the mutual information is a product distribution.
2 For a product source, the channel minimizing the mutual information is a product channel.

This type of result is often known as single-letterization in information theory. It tremendously


simplifies the optimization problem over a high-dimensional (multi-letter) problem to a scalar
(single-letter) problem. For example, in the simplest case where Xn , Yn are binary vectors, opti-
mizing I(Xn ; Yn ) over PXn and PYn |Xn entails optimizing over 2n -dimensional vectors and 2n × 2n
matrices, whereas optimizing each I(Xi ; Yi ) individually is easy. In analysis, the effect when some
quantities extend additively to tensor powers is called tensorization. One of the most famous such
examples is a log-Sobolev inequality, see Exercise I.65 or [200]. Since forming a product of chan-
nels or distributions is a form of tensor power, the first part of the theorem shows that the capacity
tensorizes.

Example 6.1 Let us complement Theorem 6.1 with the following examples.

• (6.1) fails for non-product channels. Let X1 ⊥ ⊥ X2 ∼ Ber(1/2). Let Y1 = X1 + X2 (binary


addition) and Y2 = X1 . Then I(X1 ; Y1 ) = I(X2 ; Y2 ) = 0 but I(X2 ; Y2 ) = 2 bits.
• Strict inequality in (6.1). Consider Yk = Xk = U ∼ Ber(1/2) for all k. Then I(Xk ; Yk ) = 1 bit
P
and I(Xn ; Yn ) = 1 bit < I(Xk ; Yk ) = n bits.
• Strict inequality in (6.2). Let X1 ⊥ ⊥ ... ⊥
⊥ X . Consider Y1 = X2 , Y2 = X3 , . . . , Yn = X1 . Then
P n P
I(Xk ; Yk ) = 0 for all k, and I(Xn ; Yn ) = H(Xi ) > 0 = I(Xk ; Yk ).

6.2* Gaussian capacity via orthogonal symmetry


In this section we revisit the “Gaussian saddle point” result from Theorem 5.11. There it was
derived by an explicit argument. Here we demonstrate how tensorization can be used to show
extremality of Gaussian input/noise without any explicit calculations.

1 ∏n
That is, if PXn ,Y = PY i=1 PXi |Y as joint distributions.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-108


i i

108

We start with the maximization of mutual information (capacity) question. In the notation
of Theorem 5.11 we know that (for Z ∼ N (0, 1))
1 
max I(X; X + Z) = log 1 + σX2 .
PX :E[X2 ]≤σX2 2
Note that from tensorization we also immediately get (for Zn ∼ N (0, In ))
n 
max I(Xn ; Xn + Zn ) = log 1 + σX2 .
PXn :E[∥X ∥ ]≤nσX
n 2 2 2
Thus, the traditional way of solving n-dimensional problems is to solve a 1-dimensional version
by explicit (typically calculus of variations) computation and then apply tensorization. However,
it turns out that sometimes directly solving the n-dimensional problem is magically easier and that
is what we want to show in this section.
So, suppose that we are trying to directly solve

Pmax I(Xn ; Xn + Zn )
E[ X2k ]≤nσX2

over the joint distribution PXn . By the tensorization property in Theorem 6.1(a) we get
X
n
n n n
Pmax I( X ; X + Z ) = Pmax I(Xk ; Xk + Zk ) .
E[ X2k ]≤nσX2 E[ X2k ]≤nσX2
k=1

Given distributions PX1 · · · PXn satisfying the constraint, form the “average of marginals” distribu-
Pn Pn
tion P̄X = n1 k=1 PXk , which also satisfies the single letter constraint E[X2 ] = 1n k=1 E[X2k ] ≤ σX2 .
Then from the concavity in PX of I(PX , PY|X )

1X
n
I(P̄X ; PY|X ) ≥ I(PXk , PY|X )
n
k=1

So P̄X gives the same or better mutual information, which shows that the extremization above
ought to grow linearly with n, i.e.

Pmax I(Xn ; Xn + Zn ) = n max I(X; X + Z) .


E[ X2k ]≤nσX2 PX :E[X2 ]≤σX2

Next, let us return to Yn = Xn + Zn . Since an isotropic Gaussian is rotationally symmetric, for


any orthogonal transformation U ∈ O(n), U · (Zn ) ∼ N (0, In ), so that PUYn |UXn = PYn |Xn , and

I(PXn , PYn |Xn ) = I(PUXn , PUYn |UXn ) = I(PUXn , PYn |Xn ).

Similarly to the “average of marginals” argument above, averaging over all orthogonal rotations U
of Xn can only make the mutual information larger. Therefore, the optimal input distribution PXn
can be chosen to be invariant under orthogonal transformations. Consequently, by Theorem 5.9,
the (unique!) capacity achieving output distribution P∗Yn must be rotationally invariant. Further-
more, from the conditions for equality in (6.1) we conclude that P∗Yn must have independent
components. Since the only product distribution satisfying the power constraints and having rota-
tional symmetry is an isotropic Gaussian, we conclude that PYn = (P∗Y )⊗n and P∗Y = N (0, 1 + σX2 ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-109


i i

6.3 Entropy rate 109

In turn, the only distribution PX such that PX+Z = P∗Y is PX = N (0, σX2 ) (this can be argued by
considering characteristic functions).
The last part of Theorem 5.11 can also be handled similarly. That is, we can show that the
minimizer in

min I(XG ; XG + N)
PN :E[N2 ]=1

is necessarily Gaussian by going to a multidimensional problem and averaging over all orthogonal
rotations.
The idea of “going up in dimension” (i.e. solving an n = 1 problem by going to an n > 1
problem first) as presented here is from [333] and only re-derives something that we have already
shown directly in Theorem 5.11. But the idea can also be employed for solving various non-convex
differential entropy maximization problems, cf. [184].

6.3 Entropy rate


Definition 6.2 The entropy rate of a random process X = (X1 , X2 , . . .) is
1
H(X) ≜ lim H(Xn ) (6.3)
n→∞ n
provided the limit exists.

A sufficient condition for the entropy rate to exist is stationarity, which essentially means invari-
d
ance with respect to time shift. Formally, X is stationary if (Xt1 , . . . , Xtn )=(Xt1 +k , . . . , Xtn +k ) for
any t1 , . . . , tn , k ∈ N. This definition naturally extends to two-sided processes indexed by Z.

Theorem 6.3 For any stationary process X = (X1 , X2 , . . .)

(a) H(Xn |Xn−1 ) ≤ H(Xn−1 |Xn−2 ).


(b) 1n H(Xn ) ≥ H(Xn |Xn−1 ).
(c) 1n H(Xn ) ≤ n−1 1 H(Xn−1 ).
(d) H(X) exists and H(X) = limn→∞ 1n H(Xn ) = limn→∞ H(Xn |Xn−1 ). Both sequences converge
to H(X) from above.
(e) If X can be extended to a two-sided stationary process X = (. . . , X−1 , X0 , X1 , X2 , . . .), then
H(X) = H(X1 |X0−∞ ) provided that H(X1 ) < ∞.

Proof.

(a) Further conditioning + stationarity: H(Xn |Xn−1 ) ≤ H(Xn |Xn2−1 ) = H(Xn−1 |Xn−2 )
P
(b) Using chain rule: 1n H(Xn ) = 1n H(Xi |Xi−1 ) ≥ H(Xn |Xn−1 )
(c) H(Xn ) = H(Xn−1 ) + H(Xn |Xn−1 ) ≤ H(Xn−1 ) + 1n H(Xn )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-110


i i

110

(d) n 7→ 1n H(Xn ) is a decreasing sequence and lower bounded by zero, hence has a limit
Pn
H(X). Moreover by chain rule, 1n H(Xn ) = 1n i=1 H(Xi |Xi−1 ). From here we claim that
H(Xn |Xn−1 ) converges to the same limit H(X). Indeed, from the monotonicity shown in part
(a), limn H(Xn |Xn−1 ) = H′ exists. Next, recall the following fact from calculus: if an → a,
Pn
then the Cesàro’s mean 1n i=1 ai → a as well. Thus, H′ = H(X).
(e) Assuming H(X1 ) < ∞ we have from (4.30):

lim H(X1 ) − H(X1 |X0−n ) = lim I(X1 ; X0−n ) = I(X1 ; X0−∞ ) = H(X1 ) − H(X1 |X0−∞ ).
n→∞ n→∞

Example 6.2 (Stationary processes) Let us discuss some of the most standard examples
of stationary processes.

(a) Memoryless source: If X is iid, then H(X) = H(X1 ).


(b) An iid process is the simplest example of a stationary stochastic process. The next in complex-
ity is a mixed source: Given two stationary (e.g., iid) processes Y and Z, define another X as
follows. Flip a coin with bias p. If head, set X = Y; if tail, set X = Z. Applying Theorem 3.4(b)
yields 0 ≤ H(Xn ) − (pH(Yn ) + p̄H(Yn )) ≤ log 2 for all n. Then H(X) = pH(Y) + p̄H(Z).
(c) Stationary Markov process: Let X be a Markov chain X1 → X2 → X3 → · · · with transition
kernel P[X2 = b|X1 = a] = K(b|a) and initialized with an invariant distribution X1 ∼ μ (i.e.
P
μ(b) = a K(b|a) μ(a)). Then H(Xn |Xn−1 ) = H(Xn |Xn−1 ) for all n and hence
X 1
H(X) = H(X2 |X1 ) = μ(a)K(b|a) log .
K ( b| a)
a,b

See Exercise I.31 for an example. This kind of process is what is called first-order Markov
process, since Xn depends only on Xn−1 . There is an extension of that idea, where a k-th order
Markov process is defined by a kernel PXn |Xn−1 . Shannon classically suggested that such a pro-
n− k
cess is a good model for natural language (with sufficiently large k), and recent breakthrough
in large language models [320] largely verified his vision.

Note that both of our characterizations of the entropy rate converge to the limit from above
and thus evaluating H(Xn |Xn−1 ) or n1 H(Xn ) for arbitrary large n does not give any guarantees on
the true value of H(X) beyond an upper bound (in particular, we cannot even rule out H(X) = 0).
However, for a certain class of stationary processes, widely used in speech and language modeling,
we can have a sandwich bound.

Definition 6.4 (Hidden Markov model (HMM)) Given a stationary Markov chain
. . . , S−1 , S0 , S1 , . . . on state space S and a Markov kernel PX|S : S → X , we define HMM as
a stationary process . . . , X−1 , X0 , X1 , . . . as follows. First a trajectory S∞
−∞ is generated. Then,
conditionally on it, we generate each Xi ∼ PX|S=Si independently. In other words, X is just S but
observed over a stationary memoryless channel PX|S (called the emission channel).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-111


i i

6.3 Entropy rate 111

One of the fundamental results in this area is due to Blackwell [60] who showed that an P(S)-
−1
valued belief process Rn = (Rs,n , s ∈ S) given by Rs,n ≜ P[Sn = s|Xn−∞ ] is in fact a stationary
first-order Markov process. The common law μ of Rn (independent of n) is called the Blackwell
measure. Although finding μ is very difficult even for the simplest processes (see example below),
we do have the following representation of entropy rate in terms of μ
Z
H( X ) = μ(dr) Es∼r [H(PX|S=s )] .
P(S)
P
That is, the entropy rate is an integral of a simple function r 7→ s rs H(PX|S=s ) over μ.
Example 6.3 (Gilbert-Elliott HMM [187, 151]) This is an HMM with binary states and
binary emissions. Let S = {0, 1} and P[S1 6= S0 |S0 ] = τ , i.e. the transition matrix of the S-
 
1−τ τ
process is . Set Xi = BSCδ (Si ). In this case the Blackwell measure μ is supported
τ 1−τ
on [τ, 1 − τ ] and is the law of the random variable P[S1 = 1|X0−∞ ] and the entropy rate can be
expressed in terms of the binary entropy function h:
Z 1
H( X ) = μ(dx)h(δx̄ + δ̄ x) ,
0

where we remind x̄ = 1 − x etc. In fact, we can express integration over μ in terms of the limit
R
fdμ = limn→∞ Kn f(1/2), where K is the transition kernel of the belief process, which acts on
functions g : [0, 1] → R as
   
xτ̄ δ̄ + x̄τ δ xτ̄ δ + x̄τ δ̄
Kg(x) = p(x)g + p̄(x)g , p(x) = 1 − p̄(x) = δx̄ + δ̄ x .
p ( x) p̄(x)
We can see that the belief process follows a simple fractional-linear updates, but nevertheless
the stationary measure μ is extremely complicated which can be either absolutely continuous or
singular (fractal-like) [33, 32]. As such, understanding H(X) as a function of (τ, δ) is a major open
problem in this area. We remark, however, that if instead of the BSC we used X = BECδ (S) then
the resulting entropy rate is much easier to compute, see Exercise I.32.
Despite these complications, the entropy rate of HMM has a nice property: it can be tightly
sandwiched between a monotonically increasing and a monotonically decreasing sequences. As
we remarked above, such sandwich bound is not possible for general stationary processes.

Proposition 6.5 Consider an HMM process X with state process S. Then


H(Xn |Xn1−1 , S0 ) ≤ H(X) ≤ H(Xn |Xn1−1 ) , (6.4)

and both sides converge monotonically to H(X) as n → ∞.

Proof. The part about the upper bound we have already established. To show the lower bound,
notice that
−1 −1
H(X) = H(Xn |Xn−∞ ) ≥ H(Xn |Xn−∞ , S0 ) = H(Xn |Xn1−1 , S0 ) ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-112


i i

112

where in the last step we used the Markov property X0−∞ → S0 → Xn1 . Next, we show that
H(Xn |Xn1−1 , S0 ) is increasing in n. Indeed

H(Xn+1 |Xn1 , S0 ) = H(Xn |Xn0−1 , S−1 ) ≥ H(Xn |Xn0−1 , S−1 , S0 ) = H(Xn |Xn1−1 , S0 ) ,

where the first equality is by stationarity, the inequality is by adding conditioning (Theorem 1.4)
and the last equality is due to the Markov property (S−1 , X0 ) → S0 → Xn1 .
Finally, we show that

H(X) = lim H(Xn |Xn1−1 , S0 ) .


n→∞

Indeed, note that by (4.30) we have

I(S0 ; X∞
1 ) = lim I(S0 ; X1 ) ≤ H(S0 ) < ∞ ,
n
n→∞

and thus I(S0 ; Xn1 ) − I(S0 ; Xn1−1 ) → 0. But we also have by the chain rule

I(S0 ; Xn1 ) − I(S0 ; Xn1−1 ) = I(S0 ; Xn |Xn1−1 ) = H(Xn |Xn1−1 ) − H(Xn |Xn1−1 , S0 ) → 0 .

Thus, we can see that the difference between the two sides of (6.4) vanishes with n.

6.4 Entropy and symbol (bit) error rate


In this section we show that the entropy rates of two processes X and Y are close whenever they
can be “coupled”. Coupling of two processes means defining them on a common probability space
so that the average distance between their realizations is small. In the following, we will require
that the so-called symbol error rate (expected fraction of errors) is small, namely

1X
n
P[Xj 6= Yj ] ≤ ϵ . (6.5)
n
j=1

(The minimal such ϵ over all possible couplings is called Ornstein’s distance between stochastic
processes.) For binary alphabet this quantity is known as the bit error rate, which is one of the
performance metrics we consider for reliable data transmission in Part IV (see Section 17.1 and
Section 19.6). Notice that if we define the Hamming distance as
X
n
d H ( xn , yn ) ≜ 1{xj 6= yj } (6.6)
j=1

then (6.5) corresponds to requiring E[dH (Xn , Yn )] ≤ nϵ.


Before showing our main result, we show that Fano’s inequality Theorem 3.12 can be
tensorized:

Proposition 6.6 Let X1 , . . . , Xn take values on a finite alphabet X . Then


H(Xn |Yn ) ≤ nF|X | (1 − δ) = n(δ log(|X | − 1) + h(δ)) , (6.7)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-113


i i

6.5 Mutual information rate 113

where the function FM is defined in (3.14), and

1X
n
1
δ= E[dH (Xn , Yn )] = P[Xj 6= Yj ] .
n n
j=1

Proof. For each j ∈ [n], applying (3.18) to the Markov chain Xj → Yn → Yj yields

H(Xj |Yn ) ≤ FM (P [Xj = Yj ]) , (6.8)

where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.3),
and combining with (6.8), we get
X
n
H(Xn |Yn ) ≤ H( X j | Y n ) (6.9)
j=1
Xn
≤ FM (P[Xj = Yj ]) (6.10)
j=1
 
1 X
n
≤ nFM  P[Xj = Yj ] (6.11)
n
j=1

where in the last step we used the concavity of FM and Jensen’s inequality.

Corollary 6.7 Consider two processes X and Y with entropy rates H(X) and H(Y). If
P[Xj 6= Yj ] ≤ ϵ

for every j and if X takes values on a finite alphabet of size M, then

H(X) − H(Y) ≤ FM (1 − ϵ) .

If both processes have alphabets of size M, then

|H(X) − H(Y)| ≤ ϵ log M + h(ϵ) → 0 as ϵ → 0.

Proof. There is almost nothing to prove:

H(Xn ) ≤ H(Xn , Yn ) = H(Yn ) + H(Xn |Yn )

and apply (6.7). For the last statement just recall the expression for FM .

6.5 Mutual information rate


Extending the definition of entropy rate, the mutual information rate of two random processes
X = (X1 , X2 , . . .) and Y = (Y1 , Y2 , . . .) is defined as follows.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-114


i i

114

Definition 6.8 (Mutual information rate)


1 n n
I(X; Y) = lim I(X ; Y )
n→∞ n
provided the limit exists.

We provide an example in the context of Gaussian processes which will be useful in studying
Gaussian channels with correlated noise (Section 20.6*).
Example 6.4 (Gaussian processes) Consider X, N two stationary Gaussian processes,
independent of each other. Assume that their autocovariance functions are absolutely summable
and thus there exist continuous power spectral density functions fX and fN . Without loss of gener-
ality, assume all means are zero. Let cX (k) = E [X1 Xk+1 ]. Then fX is the Fourier transform of the
P∞
autocovariance function cX , i.e., fX (ω) = k=−∞ cX (k)eiωk , |ω| ≤ π. Finally, assume fN ≥ δ > 0.
Then recall from Example 3.5:
1 det(ΣXn + ΣNn )
I(Xn ; Xn + Nn ) = log
2 det ΣNn
1X 1X
n n
= log σi − log λi ,
2 2
i=1 i=1

where σj , λj are the eigenvalues of the covariance matrices ΣYn = ΣXn + ΣNn and ΣNn , which are
all Toeplitz matrices, e.g., (ΣXn )ij = E [Xi Xj ] = cX (i − j). By Szegö’s theorem [199, Sec. 5.2]:
Z π
1X
n
1
log σi → log fY (ω)dω (6.12)
n 2π −π
i=1

Note that cY (k) = E [(X1 + N1 )(Xk+1 + Nk+1 )] = cX (k) + cN (k) and hence fY = fX + fN . Thus, we
have
Z π
1 n n 1 fX (w) + fN (ω)
I(X ; X + N ) → I(X; X + N) =
n
log dω.
n 4π −π fN (ω)
Maximizing this over fX subject to a moment constraint leads to the famous water-filling solution
f∗X (ω) = |T − fN (ω)|+ – see Theorem 20.18.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-115


i i

7 f-divergences

In Chapter 2 we introduced the KL divergence that measures dissimilarity between two dis-
tributions. This turns out to be a special case of a whole family of such measures, known as
f-divergences, introduced by Csiszár [109]. Like KL-divergence, f-divergences satisfy a number
of useful properties:

• operational significance: KL divergence forms a basis of information theory by yielding funda-


mental answers to questions in channel coding and data compression. Similarly, f-divergences
such as χ2 , H2 and TV have their foundational roles in parameter estimation, high-dimensional
statistics and hypothesis testing, respectively.
• invariance to bijective transformations.
• data-processing inequality
• variational representations (à la Donsker-Varadhan)
• local behavior given by χ2 (in nonparametric cases) or Fisher information (in parametric cases).

The purpose of this chapter is to establish these properties and prepare the ground for appli-
cations in subsequent chapters. The important highlight is a joint range Theorem of Harremoës
and Vajda [214], which gives the sharpest possible comparison inequality between arbitrary f-
divergences and puts an end to a long sequence of results starting from Pinsker’s inequality –
Theorem 7.10. This material is not only mandatory for those interested in “non-classical” applica-
tions of information theory, such as the ones we will explore in Part VI. Others can skim through
this chapter and refer back to it upon need.

7.1 Definition and basic properties of f-divergences


Definition 7.1 (f-divergence) Let f : (0, ∞) → R be a convex function with f(1) = 0.
Let P and Q be two probability distributions on a measurable space (X , F). If P  Q then the
f-divergence is defined as
  
dP
Df (PkQ) ≜ EQ f (7.1)
dQ

where dQdP
is a Radon-Nikodym derivative and f(0) ≜ f(0+). More generally, let f′ (∞) ≜
limx↓0 xf(1/x). Suppose that Q(dx) = q(x) μ(dx) and P(dx) = p(x) μ(dx) for some common

115

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-116


i i

116

dominating measure μ, then we have


Z  
p ( x)
Df (PkQ) = q ( x) f dμ + f′ (∞)P[q = 0] (7.2)
q>0 q(x)

with the agreement that if P[q = 0] = 0 the last term is taken to be zero regardless of the value of
f′ (∞) (which could be infinite).

Remark 7.1 For the discrete case, with Q(x) and P(x) being the respective pmfs, we can also
write
X  
P(x)
Df (PkQ) = Q ( x) f
x
Q ( x)

with the understanding that

• f(0) = f(0+),
• 0f( 00 ) = 0, and
• 0f( a0 ) = limx↓0 xf( ax ) = af′ (∞) for a > 0.

Remark 7.2 A nice property of Df (PkQ) is that the definition is invariant to the choice of
the dominating measure μ in (7.2). This is not the case for other dissimilarity measures, e.g., the
squared L2 -distance between the densities kp − qk2L2 (dμ) which is a popular loss function for density
estimation in statistics literature (cf. Section 32.4).
The following are common f-divergences:

• Kullback-Leibler (KL) divergence: We recover the usual D(PkQ) in Chapter 2 by taking


f(x) = x log x.
• Total variation: f(x) = 12 |x − 1|,
  Z Z
1 dP 1
TV(P, Q) ≜ EQ −1 = |dP − dQ| = 1 − d(P ∧ Q). (7.3)
2 dQ 2

Moreover, TV(·, ·) is a metric on the space of probability distributions.1


• χ2 -divergence: f(x) = (x − 1)2 ,
" 2 # Z Z
dP (dP − dQ)2 dP2
χ (PkQ) ≜ EQ
2
−1 = = − 1. (7.4)
dQ dQ dQ

Note that we can also choose f(x) = x2 − 1. Indeed, f’s differing by a linear term lead to the
same f-divergence, cf. Proposition 7.2.

1 ∫ ∫ dQ
In (7.3), d(P ∧ Q) is the usual short-hand for ( dP dμ
∧ dμ
)dμ where μ is any dominating measure. The expressions in
(7.4) and (7.5) are understood in the similar sense.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-117


i i

7.1 Definition and basic properties of f-divergences 117

√ 2
• Squared Hellinger distance: f(x) = 1 − x ,
 
r !2 Z √ Z p
dP  p 2
H2 (P, Q) ≜ EQ  1 − = dP − dQ = 2 − 2 dPdQ. (7.5)
dQ
R√
Here the quantity B(P, Q) ≜ dPdQ
p is known as the Bhattacharyya coefficient (or Hellinger
affinity) [52]. Note that H(P, Q) = H2 (P, Q) defines a metric on the space of probability dis-
tributions: indeed, the triangle inequality follows from that of L2 ( μ) for a common dominating
measure. Note, however, that
P 7→ H(P, Q) is not convex. (7.6)
(This is because metric H is not induced by a Banach norm on the space of measures.) For an
explicit example, consider p 7→ H(Ber(p), Ber(0.1)).
1−x
• Le Cam divergence (distance) [273, p. 47]: f(x) = 2x +2 ,
Z
1 (dP − dQ)2
LC(P, Q) = . (7.7)
2 dP + dQ
p
Moreover, LC(PkQ) is a metric on the space of probability distributions [152], known as Le
Cam distance.
• Jensen-Shannon divergence: f(x) = x log x2x 2
+1 + log x+1 ,
 P + Q  P + Q
JS(P, Q) = D P +D Q . (7.8)
2 2
p
Moreover, JS(PkQ) is a metric on the space of probability distributions [152].

Remark 7.3 If Df (PkQ) is an f-divergence, then it is easy to verify that Df (λP + λ̄QkQ) and
Df (PkλP + λ̄Q) are f-divergences for all λ ∈ [0, 1]. In particular, Df (QkP) = Df̃ (PkQ) with
f̃(x) ≜ xf( 1x ).
We start summarizing some formal observations about the f-divergences

Proposition 7.2 (Basic properties) The following hold:

1 Df1 +f2 (PkQ) = Df1 (PkQ) + Df2 (PkQ).


2 Df (PkP) = 0.
3 Df (PkQ) = 0 for all P 6= Q iff f(x) = c(x − 1) for some c. For any other f we have Df (PkQ) =
f(0) + f′ (∞) > 0 for P ⊥ Q.
4 If PX,Y = PX PY|X and QX,Y = PX QY|X then the function x 7→ Df (PY|X=x kQY|X=x ) is measurable
and
Z
Df (PX,Y kQX,Y ) = dPX (x)Df (PY|X=x kQY|X=x ) ≜ Df (PY|X kQY|X |PX ) , (7.9)
X

the latter referred to as the conditional f-divergence (similar to Definition 2.14 for conditional
KL divergence).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-118


i i

118

5 If PX,Y = PX PY|X and QX,Y = QX PY|X then

Df (PX,Y kQX,Y ) = Df (PX kQX ) . (7.10)

In particular,

Df ( P X P Y k QX P Y ) = Df ( P X k QX ) . (7.11)

6 Let f1 (x) = f(x) + c(x − 1), then

Df1 (PkQ) = Df (PkQ) ∀P, Q .

In particular, we can always assume that f ≥ 0 and (if f is differentiable at 1) that f′ (1) = 0.

Proof. The first and second are clear. For the third property, verify explicitly that Df (PkQ) = 0
for f = c(x − 1). Next consider general f and observe that for P ⊥ Q, by definition we have

Df (PkQ) = f(0) + f′ (∞), (7.12)

which is well-defined (i.e., ∞ − ∞ is not possible) since by convexity f(0) > −∞ and f′ (∞) >
−∞. So all we need to verify is that f(0) + f′ (∞) = 0 if and only if f = c(x − 1) for some c ∈ R.
Indeed, since f(1) = 0, the convexity of f implies that x 7→ g(x) ≜ xf(−x)1 is non-decreasing. By
assumption, we have g(0+) = g(∞) and hence g(x) is a constant on x > 0, as desired.
For property 4, let RY|X = 12 PY|X + 21 QY|X . By Theorem 2.12 there exist jointly measurable
p(y|x) and q(y|x) such that dPY|X=x = p(y|x)dRY|X=x and QY|X = q(y|x)dRY|X=x . We can then take
μ in (7.2) to be μ = PX RY|X , which gives dPX,Y = p(y|x)dμ and dQX,Y = q(y|x)dμ and thus

Df (PX,Y kQX,Y )
Z   Z
p ( y| x) ′
= dμ1{y : q(y|x) > 0} q(y|x)f + f (∞) dμ1{y : q(y|x) = 0} p(y|x)
X ×Y q ( y| x) X ×Y
Z Z   Z 
(7.2) p ( y| x)
= dPX dRY|X=x q(y|x)f + f′ (∞) dRY|X=x p(y|x)
X {y:q(y|x)>0} q ( y| x) {y:q(y|x)=0}
| {z }
Df (PY|X=x ∥QY|X=x )

which is the desired (7.9).


Property 5 follows from the observation: if we take μ = PX,Y + QX,Y and μ1 = PX + QX then
dPX,Y dPX
dμ = dμ1 and similarly for Q.
Property 6 follows from the first and the third. Note also that reducing to f ≥ 0 is done by taking
c = f′ (1) (or any subdifferential at x = 1 if f is not differentiable).

7.2 Data-processing inequality; approximation by finite partitions


Theorem 7.3 (Monotonicity)
Df (PX,Y kQX,Y ) ≥ Df (PX kQX ) . (7.13)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-119


i i

7.2 Data-processing inequality; approximation by finite partitions 119

Proof. Note that in the case PX,Y  QX,Y (and thus PX  QX ), the proof is a simple application
of Jensen’s inequality to definition (7.1):
  
dPY|X PX
Df (PX,Y kQX,Y ) = EX∼QX EY∼QY|X f
dQY|X QX
   
dPY|X PX
≥ EX∼QX f EY∼QY|X
dQY|X QX
  
dPX
= EX∼QX f .
dQX
To prove the general case we need to be more careful. Let RX = 12 (PX + QX ) and RY|X = 12 PY|X +
1
2 QY|X .It should be clear that PX,Y , QX,Y  RX,Y ≜ RX RY|X and that for every x: PY|X=x , QY|X=x 
RY|X=x . By Theorem 2.12 there exist measurable functions p1 , p2 , q1 , q2 so that

dPX,Y = p1 (x)p2 (y|x)dRX,Y , dQX,Y = q1 (x)q2 (y|x)dRX,Y

and dPY|X=x = p2 (y|x)dRY|X=x , dQY|X=x = q2 (y|x)dRY|X=x . We also denote p(x, y) = p1 (x)p2 (y|x),
q ( x, y) = q 1 ( x) q 2 ( y| x) .
Fix t > 0 and consider a supporting line to f at t with slope μ, so that

f(u) ≥ f(t) + μ(u − t) , ∀u ≥ 0 .

Thus, f′ (∞) ≥ μ and taking u = λt for any λ ∈ [0, 1] we have shown:

f(λt) + λ̄tf′ (∞) ≥ f(t) , ∀t ≥ 0, λ ∈ [0, 1] . (7.14)

Note that we added t = 0 case as well, since for t = 0 the statement is obvious (recall, though,
that f(0) ≜ f(0+) can be equal to +∞).
Next, fix some x with q1 (x) > 0 and consider the chain
Z  
p1 (x)p2 (y|x) p 1 ( x)
dRY|X=x q2 (y|x)f + PY|X=x [q2 (Y|x) = 0]f′ (∞)
{y:q2 (y|x)>0} q 1 ( x ) q2 ( y | x ) q 1 ( x )
 
( a) p 1 ( x) p1 (x)
≥f PY|X=x [q2 (Y|x) > 0] + P [q2 (Y|x) = 0]f′ (∞)
q 1 ( x) q1 (x) Y|X=x
 
(b) p 1 ( x)
≥f
q 1 ( x)

where (a) is by Jensen’s inequality and the convexity of f, and (b) by taking t = pq11 ((xx)) and λ =
PY|X=x [q2 (Y|x) > 0] in (7.14). Now multiplying the obtained inequality by q1 (x) and integrating
over {x : q1 (x) > 0} we get
Z   Z  
p ( x, y) p 1 ( x)
dRX,Y q(x, y)f + f′ (∞)PX,Y [q1 (X) > 0, q2 (Y|X) = 0] ≥ dRX q1 (x)f .
{q>0} q ( x, y) {q1 >0} q 1 ( x)
Adding f′ (∞)PX [q1 (X) = 0] to both sides we obtain (7.13) since both sides evaluate to
definition (7.2).
The following is the main result of this section.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-120


i i

120

Theorem 7.4 (Data processing) Consider a channel that produces Y given X based on the
conditional law PY|X (shown below).

PX PY

PY|X

QX QY

Let PY (resp. QY ) denote the distribution of Y when X is distributed as PX (resp. QX ). For any
f-divergence Df (·k·),

Df (PY kQY ) ≤ Df (PX kQX ). (7.15)

Proof. This follows from the monotonicity (7.13) and (7.10).

Next we discuss some of the more useful properties of f-divergence that parallel those of KL
divergence in Theorem 2.16:

Theorem 7.5 (Properties of f-divergences)

(a) Non-negativity: Df (PkQ) ≥ 0. If f is strictly convex2 at 1, then Df (PkQ) = 0 if and only if


P = Q.
(b) Joint convexity: (P, Q) 7→ Df (PkQ) is a jointly convex function. Consequently, P 7→ Df (PkQ)
and Q 7→ Df (PkQ) are also convex.
(c) Conditioning increases f-divergence. Let PY = PY|X ◦ PX and QY = QY|X ◦ QY , or, pictorially,

PY |X PY

PX

QY |X QY

Then

Df (PY kQY ) ≤ Df PY|X kQY|X |PX .

Proof. (a) Non-negativity follows from monotonicity by taking X to be unary. To show strict
positivity, suppose for the sake of contradiction that Df (PkQ) = 0 for some P 6= Q. Then
there exists some measurable A such that p = P(A) 6= q = Q(A) > 0. Applying the data

2
By strict convexity at 1, we mean for all s, t ∈ [0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf(s) + (1 − α)f(t) > f(1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-121


i i

7.2 Data-processing inequality; approximation by finite partitions 121

processing inequality (with Y = 1 {X ∈ A}), we obtain Df (Ber(p)kBer(q)) = 0. Consider two


cases
a 0 < q < 1: Then Df (Ber(p)kBer(q)) = qf( pq ) + q̄f( p̄q̄ ) = f(1);
b q = 1: Then p < 1 and Df (Ber(p)kBer(q)) = f(p) + p̄f′ (∞) = 0, i.e. f′ (∞) = f(p)
p−1 . Since
f(x)
x 7→ x− 1is non-decreasing, we conclude that f is affine on [p, ∞).
Both cases contradict the assumed strict convexity of f at 1.
(b) Convexity follows from the DPI as in the proof of Theorem 5.1.
(c) Recall that the conditional divergence was defined in (7.9) and hence the inequality follows
from the monotonicity. Another way to see the inequality is as result of applying Jensen’s
inequality to the jointly convex function Df (PkQ).

Remark 7.4 (Strict convexity) Just like for the KL divergence, f-divergences are never
strictly convex in the sense that (P, Q) 7→ Df (PkQ) can be linear on an interval connecting (P0 , Q0 )
to (P1 , Q1 ). As in Remark 5.2 this is the case when (P0 , Q0 ) have support disjoint from (P1 , Q1 ).
However, for f-divergences this can happen even with pairs with a common support. For example,
TV(Ber(p), Ber(q)) = |p − q| is piecewise linear. In turn, strict convexity of f is related to certain
desirable properties of f-information If (X; Y), see Ex. I.40.
Remark 7.5 (g-divergences) We note that, more generally, we may call functional D(PkQ)
a “g-divergence”, or a generalized dissimilarity measure, if it satisfies the following properties: pos-
itivity, monotonicity (as in (7.13)), data processing inequality (DPI, cf. (7.15)) and D(PkP) = 0
for any P. Note that the last three properties imply positivity by taking X to be unary in the DPI. In
many ways g-divergence properties allow to interpret it as measure of information in the generic
sense adopted in this book. We have seen that f-divergences satisfy two additional properties:
conditioning increases divergence (CID) and convexity in the pair, the two being essentially equiv-
alent (cf. proof of Theorem 5.1). CID and convexity do not necessarily hold for any f-divergence.
Indeed, any monotone function of an f-divergence is a g-divergence, and of course those do not
need to be monotone (cf. (7.6) for an example). Interestingly, there exist g-divergences which
are not monotone transformations of any f-divergence, cf. [338, Section V]; the example there is
in fact D(PkQ) = α − βα (P, Q) with β defined in (14.3) later in the book. On the other hand,
P
for finite alphabets, [325] shows that any D(PkQ) = i ϕ(Pi , Qi ) is a g-divergence iff it is an
f-divergence.
The following convenient property, a counterpart of Theorem 4.5, allows us to reduce any gen-
eral problem about f-divergences to the problem on finite alphabets. The proof is in Section 7.14*.

Theorem 7.6 Let P, Q be two probability measures on X with σ -algebra F . Given a finite
F -measurable partitions E = {E1 , . . . , En } define the distribution PE on [n] by PE (i) = P[Ei ] and
QE (i) = Q[Ei ]. Then
Df (PkQ) = sup Df (PE kQE ) (7.16)
E

where the supremum is over all finite F -measurable partitions E .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-122


i i

122

7.3 Total variation and Hellinger distance in hypothesis testing


As we will discover throughout the book, different f-divergences have different operational signif-
icance. For example, χ2 -divergence is useful in the study of Markov chains (see Example 33.8 and
Exercise VI.19); in estimation the Bayes quadratic risk for a binary prior is determined by Le Cam
divergence (7.7). Here we discuss the relation of TV and Hellinger H2 to the problem of binary
hypothesis testing. We will delve deep into this problem in Part III (and return to its composite
version in Part VI). In this section, we only introduce some basics for the purpose of illustration.
The binary hypothesis testing problem is formulated as follows: one is given an observation
(random variable) X, and it is known that either X ∼ P (a case referred to as null-hypothesis H0 )
or X ∼ Q (alternative hypothesis H1 ). The goal is to decide, on the basis of X alone, which of the
two hypotheses holds. In other words, we want to find a (possibly randomized) decision function
ϕ : X → {0, 1} such that the sum of two types of probabilities of error

P[ϕ(X) = 1] + Q[ϕ(X) = 0] (7.17)

is minimized.
In this section we first show that optimization over ϕ naturally leads to the concept of TV.
Subsequently, we will see that asymptotic considerations (when P and Q are replaced with P⊗n
and Q⊗n ) leads to H2 . We start with the former case.

Theorem 7.7 (a) sup-representation of TV:


1
TV(P, Q) = sup P(E) − Q(E) = sup EP [f(X)] − EQ [f(X)] (7.18)
E 2 f∈F
where the first supremum is over all measurable sets E, and the second is over F = {f : X →
R, kfk∞ ≤ 1}. In particular, the minimal total error probability in (7.17) is given by

min {P[ϕ(X) = 1] + Q[ϕ(X) = 0]} = 1 − TV(P, Q), (7.19)


ϕ

where the minimum is over all decision rules ϕ : X → {0, 1}.3


(b) inf-representation of TV [403]:4 Provided that the diagonal {(x, x) : x ∈ X } is measurable,

TV(P, Q) = min{PX,Y [X 6= Y] : PX = P, PY = Q}, (7.20)


PX,Y

where minimization is over joint distributions PX,Y with the property PX = P and PY = Q,
which are called couplings of P and Q.

Proof. Let p, q, μ be as in Definition 7.1. Then for any f ∈ F we have


Z Z
f(x)(p(x) − q(x))dμ ≤ |p(x) − q(x)|dμ = 2TV(P, Q) ,

3
The extension of (7.19) from simple to composite hypothesis testing is in (32.28).
4
See Exercise I.36 for another inf-representation.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-123


i i

7.3 Total variation and Hellinger distance in hypothesis testing 123

which establishes that the second supremum in (7.18) lower bounds TV, and hence (by taking
f(x) = 2 · 1E (x) − 1) so does the first. For the other direction, let E = {x : p(x) > q(x)} and notice
Z Z Z
0 = (p(x) − q(x))dμ = + (p(x) − q(x))dμ ,
E Ec
R R
implying that Ec (q(x)− p(x))dμ = E (p(x)− q(x))dμ. But the sum of these two integrals precisely
equals 2 · TV, which implies that this choice of E attains equality in (7.18).
For the inf-representation, we notice that given a coupling PX,Y , for any kfk∞ ≤ 1, we have
EP [f(X)] − EQ [f(X)] = E[f(X) − f(Y)] ≤ 2PX,Y [X 6= Y]
which, in view of (7.18), shows that the inf-representation is always an upper bound. To show
R
that this bound is tight one constructs X, Y as follows: with probability π ≜ min(p(x), q(x))dμ
we take X = Y = c with c sampled from a distribution with density r(x) = π1 min(p(x), q(x)),
whereas with probability 1 − π we take X, Y sampled independently from distributions p1 (x) =
1−π (p(x) − min(p(x), q(x))) and q1 (x) = 1−π (q(x) − min(p(x), q(x))) respectively. The result
1 1

follows upon verifying that this PX,Y indeed defines a coupling of P and Q and applying the last
identity of (7.3).
Remark 7.6 (Variational representation) The sup-representation (7.18) of the total vari-
ation will be extended to general f-divergences in Section 7.13. However, only the TV has the
representation of the form supf∈F | EP [f] − EQ [f]| over the class of functions. Distances of this
form (for different classes of F ) are sometimes known as integral probability metrics (IPMs).
And so TV is an example of an IPM for the class F of all bounded functions.
In turn, the inf-representation (7.20) has no analogs for other f-divergences, with the notable
exception of Marton’s d2 , see Remark 7.15. Distances defined via inf-representations over cou-
plings are often called Wasserstein distances, and hence we may think of TV as the Wasserstein
distance with respect to Hamming distance d(x, x′ ) = 1{x 6= x′ } on X . The benefit of variational
representations is that choosing a particular coupling in (7.20) gives an upper bound on TV(P, Q),
and choosing a particular f in (7.18) yields a lower bound.
Of particular relevance is the special case of testing with multiple observations, where the data
X = (X1 , . . . , Xn ) are i.i.d. drawn from either P or Q. In other words, the goal is to test
H0 : X ∼ P⊗n vs H1 : X ∼ Q⊗ n .
By Theorem 7.7, the optimal total probability of error is given by 1 − TV(P⊗n , Q⊗n ). By the data
processing inequality, TV(P⊗n , Q⊗n ) is a non-decreasing sequence in n (and bounded by 1 by
definition) and hence converges. One would expect that as n → ∞, TV(P⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test vanishes. It turns out that for fixed
distributions P 6= Q, large deviations theory (see Chapter 16) shows that TV(P⊗n , Q⊗n ) indeed
converges to one as n → ∞ and, in fact, exponentially fast:
TV(P⊗n , Q⊗n ) = 1 − exp(−nC(P, Q) + o(n)), (7.21)
where the exponent C(P, Q) > 0 is known as the Chernoff Information of P and Q given in (16.2).
However, as frequently encountered in high-dimensional statistical problems, if the distributions

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-124


i i

124

P = Pn and Q = Qn depend on n, then the large-deviations asymptotics in (7.21) can no longer be


directly applied. Since computing the total variation between two n-fold product distributions is
typically difficult, understanding how a more tractable f-divergence is related to the total variation
may give insight on its behavior. It turns out Hellinger distance is precisely suited for this task.
Shortly, we will show the following relation between TV and the Hellinger divergence:
r
1 2 H2 (P, Q)
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − ≤ 1. (7.22)
2 4
Direct consequences of the bound (7.22) are:

• H2 (P, Q) = 2, if and only if TV(P, Q) = 1. In this case, the probability of error is zero since
essentially P and Q have disjoint supports.
• H2 (P, Q) = 0 if and only if TV(P, Q) = 0. In this case, the smallest total probability of error is
one, meaning the best test is random guessing.
• Hellinger consistency is equivalent to TV consistency: we have

H2 (Pn , Qn ) → 0 ⇐⇒ TV(Pn , Qn ) → 0 (7.23)


H (Pn , Qn ) → 2 ⇐⇒ TV(Pn , Qn ) → 1;
2
(7.24)

however, the speed of convergence need not be the same.

Theorem 7.8 For any sequence of distributions Pn and Qn , as n → ∞,


 
1
TV(P⊗ ⊗n
n , Qn ) → 0 ⇐⇒ H (Pn , Qn ) = o
n 2
n
 
⊗n ⊗n 1
TV(Pn , Qn ) → 1 ⇐⇒ H (Pn , Qn ) = ω
2
n

i.i.d.
Proof. For convenience, let X1 , X2 , ...Xn ∼ Qn . Then
v 
u n
u Y Pn
H2 (P⊗ ⊗n
n , Qn ) = 2 − 2E
n t (Xi ) 
Qn
i=1
 r   r n
Yn
Pn Pn
=2−2 E (Xi ) = 2 − 2 E
Qn Qn
i=1
 n
1
= 2 − 2 1 − H2 (Pn , Qn ) . (7.25)
2

We now use (7.25) to conclude the proof. Recall from (7.23) that TV(P⊗ ⊗n
n , Qn ) → 0 if and
n

only if H2 (P⊗ n ⊗n
n , Qn ) → 0, which happens precisely when H (Pn , Qn ) = o( n ). Similarly, by
2 1
⊗n ⊗n 2 ⊗n ⊗n
(7.24), TV(Pn , Qn ) → 1 if and only if H (Pn , Qn ) → 2, which is further equivalent to
H2 (Pn , Qn ) = ω( 1n ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-125


i i

7.3 Total variation and Hellinger distance in hypothesis testing 125

Remark 7.7 Property (7.25) is known as tensorization. More generally, we have


! n  
Y
n Y
n Y 1
H 2
Pi , Qi =2−2 1 − H2 ( P i , Qi ) . (7.26)
2
i=1 i=1 i=1

While some other f-divergences also satisfy tensorization, see Section 7.12, the H2 has the advan-
tage of a sandwich bound (7.22) making it the most convenient tool for checking asymptotic
testability of hypotheses.
Q Q
Remark 7.8 (Kakutani’s dichotomy) Let P = Pi and Q = i≥1 Qi , where Pi  Qi .
i≥1
Kakutani’s theorem shows the following dichotomy between these two distributions on the infinite
sequence space:
P
• If i≥1 H2 (Pi , Qi ) = ∞, then P and Q are mutually singular (i.e. P ⊥ Q).
P
• If i≥1 H2 (Pi , Qi ) < ∞, then P and Q are equivalent (i.e. P  Q and Q  P).

In the Gaussian case, say, Pi = N( μi , 1) and Qi = N(0, 1), the equivalence condition simplifies to
P 2
μi < ∞.
To understand Kakutani’s criterion, note that by the tensorization property (7.26), we have
Y H2 (Pi , Qi )

H ( P , Q) = 2 − 2
2
1− .
2
i≥1

Q 2 P
Thus, if i≥1 (1 − H (P2i ,Qi ) ) = 0, or equivalently, i≥1 H2 (Pi , Qi ) = ∞, then H2 (P, Q) = 2,
P
which, by (7.22), is equivalent to TV(P, Q) = 0 and hence P ⊥ Q. If i≥1 H2 (Pi , Qi ) < ∞,
then H2 (P, Q) < 2. To conclude the equivalence between P and Q, note that the likelihood ratio
dP
Q dPi dP
dQ = i≥1 dQi satisfies that either Q( dQ = 0) = 0 or 1 by Kolmogorov’s 0-1 law. See [143,
Theorem 5.3.5] for details.
We end this section by discussing the related concept of contiguity. Note that if two distributions
Pn and Qn has vanishing total variation, then Pn (A) = Qn (A) + o(1) uniformly for all events A.
Sometimes and especially for statistical applications we are only interested comparing those events
with probability close to zero or one. This leads us to the following definition.

Definition 7.9 (Contiguity and asymptotic separatedness) Let {Pn } and {Qn } be
sequences of probability measures on some Ωn . We say Pn is contiguous with respect to Qn
(denoted by Pn ◁ Qn ) if for any sequence {An } of measurable sets, Qn (An ) → 0 implies that
Pn (An ) → 0. We say Pn and Qn are mutually contiguous (denoted by Pn ◁▷ Qn ) if Pn ◁ Qn
and Qn ◁ Pn . We say that Pn is asymptotically separated from Qn (denoted Pn △ Qn ) if
lim supn→∞ TV(Pn , Qn ) = 1.

Note that when Pn = P and Qn = Q these definitions correspond to P  Q and P ⊥ Q,


respectively, and thus should be viewed as their asymptotic versions. Clearly, Pn ◁▷ Qn is much
weaker than TV(Pn , Qn ) → 0; for example, Pn (An ) = 1/2 only guarantees Qn (An ) is not tending to

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-126


i i

126

0 or 1. In addition, if Pn ◁ Qn , then any test that succeeds with high Qn -probability must fail with
high Pn -probability; in other words, Pn and Qn cannot be distinguished perfectly so TV(Pn , Qn ) =
1 − Ω(1), in particular contiguity and separatedness are mutually exclusive. Furthermore, often
many interesting sequences of measures satisfy dichotomy similar to Kakutani’s: either Pn ◁▷ Qn
or Pn △ Qn , see [282].
Our interest in these notions arises from the fact that f-divergences are instrumental for
establishing contiguity and separatedness. For example, from (7.24) we conclude that

Pn △ Qn ⇐⇒ lim sup H2 (Pn , Qn ) = 2 .


n→∞

On the other hand, [385, Theorem III.10.1] shows

Pn ◁ Qn ⇐⇒ lim lim sup Dα (Pn kQn ) = 0 ,


α→0+ n→∞

where Dα is Rényi divergence (Definition 7.24). This criterion can be weakened to the following
(commonly used) one: Pn ◁ Qn if χ2 (Pn kQn ) = O(1). Indeed, applying p a change ofpmeasure
and Cauchy-Schwarz, Pn (An ) = EPn [1 {An }] = EQn [ dQ dPn
n
1 { An }] ≤ 1 + χ2 (Pn kQn ) Qn (An ),
which vanishes whenever Qn (An ) vanishes. (See Exercise I.49 for a concrete example in the con-
text of community detection and random graphs.) In particular, a sufficient condition for mutual
contiguity is the boundedness of likelihood ratio: c ≤ QPnn ≤ C for some constants c, C.

7.4 Inequalities between f-divergences and joint range


In this section we study the relationship, in particular, inequalities, between f-divergences. To
gain some intuition, we start with the ad hoc approach by proving the Pinsker’s inequality, which
bounds total variation from above in terms of the KL divergence.

Theorem 7.10 (Pinsker’s inequality)

D(PkQ) ≥ (2 log e)TV2 (P, Q). (7.27)

Proof. It suffices to consider the natural logarithm for the KL divergence. First we show that,
by the data processing inequality, it suffices to prove the result for Bernoulli distributions. For
any event E, let Y = 1 {X ∈ E} which is Bernoulli with parameter P(E) or Q(E). By the DPI,
D(PkQ) ≥ d(P(E)kQ(E)). If Pinsker’s inequality holds for all Bernoulli distributions, we have
r
1
D(PkQ) ≥ TV(Ber(P(E)), Ber(Q(E)) = |P(E) − Q(E)|
2
q
Taking the supremum over E gives 12 D(PkQ) ≥ supE |P(E) − Q(E)| = TV(P, Q), in view of
Theorem 7.7.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-127


i i

7.4 Inequalities between f-divergences and joint range 127

The binary case follows easily from a second-order Taylor expansion (with integral remainder
form) of p 7→ d(pkq):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t(1 − t) q

and TV(Ber(p), Ber(q)) = |p − q|.

Pinsker’s inequality has already been used multiple times in this book. Here is yet another
implication that is further explored in Exercise I.62 and I.63 (Szemerédi regularity).

Corollary 7.11 (Tao’s inequality [414]) Let Y → X → X′ be a Markov chain with Y ∈


[−1, 1]. Then
2
E[| E[Y|X] − E[Y|X′ ]|2 ] ≤ I(Y; X|X′ ) . (7.28)
log e
The same estimate holds for Y ranging over a unit ball in any normed vector space with | · | in the
LHS being the norm.

Proof. If Y1 and Y2 are two random variables taking values in [−1, 1] then by (7.20) there exists
a coupling such that P[Y1 6= Y2 ] ≤ TV(PY1 , PY2 ). Thus, | E[Y1 ] − E[Y2 ]| ≤ 2TV(PY1 , PY2 ). Now,
applying this to PY1 = PY|X=a and PY2 = PY|X′ =b we obtain
2
| E[Y|X = a] − E[Y|X′ = b]|2 ≤ 4TV2 (PY|X=a , PY|X′ =b ) ≤ D(PY|X=a kPY|X′ =b ) ,
log e
where we applied Pinsker’s inequality in the last step. The proof is completed by averaging over
(a, b) ∼ PX,X′ and noticing that D(PY|X kPY|X′ |PX,X′ ) = I(Y; X|X′ ) due to PY|X,X′ = PY|X by
assumption.

Pinsker’s inequality and Tao’s inequality are both sharp in the sense that the constants can not
be improved. For example, for (7.27) we can take Pn = Ber( 21 + 1n ) and Qn = Ber( 12 ) and compute
D(Pn ∥Qn )
that TV 2 (P ,Q ) → 2 log e as n → ∞. (This is best seen by inspecting the local quadratic behavior
n n
in Proposition 2.21.) Nevertheless, this does not mean that the inequality (7.27) is not improvable,
as the RHS can be replaced by some other function of TV(P, Q) with additional higher-order
terms. Indeed, several such improvements of Pinsker’s inequality are known. But what is the best
inequality? In addition, another natural question is the reverse inequality: can we upper-bound
D(PkQ) in terms of TV(P, Q)?
Settling these questions rests on characterizing the joint range (the set of possible values) of a
given pair f-divergences. This systematic approach to comparing f-divergences (as opposed to the
ad hoc proof of Theorem 7.10 we presented above) is the subject of the rest of this section.

Definition 7.12 (Joint range) Consider two f-divergences Df (PkQ) and Dg (PkQ). Their
joint range is a subset of [0, ∞]2 defined by

R ≜ {(Df (PkQ), Dg (PkQ)) : P, Q are probability measures on some measurable space} .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-128


i i

128

In addition, the joint range over all k-ary distributions is defined as

Rk ≜ {(Df (PkQ), Dg (PkQ)) : P, Q are probability measures on [k]} .

As an example, Figure 7.1 gives the joint range R between the KL divergence and the total vari-
ation. By definition, the lower boundary of the region R gives the optimal refinement of Pinsker’s
inequality:

D(PkQ) ≥ F(TV(P, Q)), F(ϵ) ≜ inf D(PkQ) = inf{s : (ϵ, s) ∈ R}.


(P,Q):TV(P,Q)=ϵ

Also from Figure 7.1 we see that it is impossible to bound D(PkQ) from above in terms of TV(P, Q)
due to the lack of upper boundary.

1.5

1.0

0.5

0.2 0.4 0.6 0.8

Figure 7.1 Joint range of TV and KL divergence. The dashed line is the quadratic lower bound given by
Pinsker’s inequality (7.27).

The joint range R may appear difficult to characterize since we need to consider P, Q over
all measurable spaces; on the other hand, the region Rk for small k is easy to obtain (at least
numerically). Revisiting the proof of Pinsker’s inequality in Theorem 7.10, we see that the key
step is the reduction to Bernoulli distributions. It is natural to ask: to obtain full joint range is it
possible to reduce to the binary case? It turns out that it is always sufficient to consider quaternary
distributions, or the convex hull of that of binary distributions.

Theorem 7.13 (Harremoës-Vajda [214])


R = co(R2 ) = R4 .

where co denotes the convex hull with a natural extension of convex operations to [0, ∞]2 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-129


i i

7.4 Inequalities between f-divergences and joint range 129

We will rely on the following famous result from convex analysis (cf. e.g. [145, Chapter 2,
Theorem 18]).

Lemma 7.14 (Fenchel-Eggleston-Carathéodory theorem) Let S ⊆ Rd and x ∈ co(S).


Then there exists a set of d + 1 points S′ = {x1 , x2 , . . . , xd+1 } ∈ S such that x ∈ co(S′ ). If S has at
most d connected components, then d points are enough.

Proof. Our proof will consist of three claims:

• Claim 1: co(R2 ) ⊂ R4 ;
• Claim 2: Rk ⊂ co(R2 );
• Claim 3: R = R4 .
S∞
Note that Claims 1-2 prove the most interesting part: k=1 Rk = co(R2 ). Claim 3 is more
technical and its proof can be found in [214]. However, the approximation result in Theorem 7.6
S∞
shows that R is the closure of k=1 Rk . Thus for the purpose of obtaining inequalities between
Df and Dg , Claims 1-2 are sufficient.
We start with Claim 1. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space
X and given any α ∈ [0, 1], define two joint distributions of the random variables (X, B) where
PB = QB = Ber(α), PX|B=i = Pi and QX|B=i = Qi for i = 0, 1. Then by (7.9) we get

Df (PX,B kQX,B ) = ᾱDf (P0 kQ0 ) + αDf (P1 kQ1 ) ,

and similarly for the Dg . Thus, R is convex. Next, notice that

R2 = R̃2 ∪ {(pf′ (∞), pg′ (∞)) : p ∈ (0, 1]} ∪ {(qf(0), qg(0)) : q ∈ (0, 1]} ,

where R̃2 is the image of (0, 1)2 of the continuous map


 
(p, q) 7→ Df (Ber(p)kBer(q)), Dg (Ber(p)kBer(q)) .

Since (0, 0) ∈ R̃2 , we see that regardless of which f(0), f′ (∞), g(0), g′ (∞) are infinite, the set
R2 ∩ R2 is connected. Thus, by Lemma 7.14 any point in co(R2 ∩ R2 ) is a combination of two
points in R2 ∩ R2 , which, by the argument above, is a subset of R4 . Finally, it is not hard to see
that co(R2 )\R2 ⊂ R4 , which concludes the proof of co(R2 ) ⊂ R4 .
Next, we prove Claim 2. Fix P, Q on [k] and denote their PMFs (pj ) and (qj ), respectively. Note
that without changing either Df (PkQ) or Dg (PkQ) (but perhaps, by increasing k by 1), we can
p
make qj > 0 for j > 1 and q1 = 0, which we thus assume. Denote ϕj = qjj for j > 1 and consider
the set
 
 X X
k 
S = Q̃ = (q̃j )j∈[k] : q̃j ≥ 0, q̃j = 1, q̃1 = 0, q̃j ϕj ≤ 1 .
 
j=2

We also define a subset Se ⊂ S consisting of points Q̃ of two types:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-130


i i

130

1 q̃j = 1 for some j ≥ 2 and ϕj ≤ 1.


2 q̃j1 + q̃j2 = 1 for some j1 , j2 ≥ 2 and q̃j1 ϕj1 + q̃j2 ϕj2 = 1 .
P
It can be seen that Se are precisely all the extreme points of S . Indeed, any Q̃ ∈ S with j≥2 q̃j ϕj <
1 with more than one non-zero atom cannot be extremal (since there is only one active linear
P P
constraint j q̃j = 1). Similarly, Q̃ with j≥2 q̃j ϕj = 1 can only be extremal if it has one or two
non-zero atoms.
We next claim that any point in S can be written as a convex combination of finitely many
points in Se . This can be seen as follows. First, we can view S and Se as subsets of Rk−1 . Since S
is clearly closed and convex, by the Krein-Milman theorem (see [12, Theorem 7.68]), S coincides
with the closure of the convex hull of its extreme points. Since Se is compact (hence closed), so
is co(Se ) [12, Corollary 5.33]. Thus we have S = co(Se ) and, in particular, there are probability
weights {αi : i ∈ [m]} and extreme points Q̃i ∈ Se so that
X
m
Q= αi Q̃i . (7.29)
i=1

Next, to each Q̃ we associate P̃ = (p̃j )j∈[k] as follows:


(
ϕj q̃j , j ∈ {2, . . . , k} ,
p̃j = Pk
1 − j=2 ϕj q̃j , j = 1

We then have that


X
Q̃ 7→ Df (P̃kQ̃) = q̃j f(ϕj ) + f′ (∞)p̃1
j≥2

affinely maps S to [0, ∞] (note that f(0) or f′ (∞) can equal ∞). In particular, if we denote P̃i =
P̃(Q̃i ) corresponding to Q̃i in decomposition (7.29), we get
X
m
Df (PkQ) = αi Df (P̃i kQ̃i ) ,
i=1

and similarly for Dg (PkQ). We are left to show that (P̃i , Q̃i ) are supported on at most two points,
which verifies that any element of Rk is a convex combination of k elements of R2 . Indeed, for
Q̃ ∈ Se the set {j ∈ [k] : q̃j > 0 or p̃j > 0} has cardinality at most two (for the second type
extremal points we notice p̃j1 + p̃j2 = 1 implying p̃1 = 0). This concludes the proof of Claim
2.

7.5 Examples of computing joint range


In this section we show how to apply the method of Harremoës and Vajda for proving the best
possible comparison inequalities between various f-divergences.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-131


i i

7.5 Examples of computing joint range 131

7.5.1 Hellinger distance versus total variation


The joint range R2 of H2 and TV over binary distributions is simply:
 √ √
R2 = (2(1 − pq − p̄q̄), |p − q|) : 0 ≤ p ≤ 1, 0 ≤ q ≤ 1 .
shown as non-convex grey region in Figure 7.2. By Theorem 7.13, their full joint range R is the
convex hull of R2 , which turns out to be exactly described by the sandwich bound (7.22) shown
earlier in Section 7.3. This means that (7.22) is not improvable. Indeed, with t ranging from 0 to
1,

1−t
• the upper boundary is achieved by P = Ber( 1+ t
2 ), Q = Ber( 2 ),
• the lower boundary is achieved by P = (1 − t, t, 0), Q = (1 − t, 0, t).

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.5 1.0 1.5 2.0

Figure 7.2 The joint range R of TV and H2 is characterized by (7.22), which is the convex hull of the grey
region R2 .

7.5.2 KL divergence versus total variation


The joint range between KL and TV was previously shown in Figure 7.1. Although there is
no known closed-form expression, the following parametric formula of the lower boundary (see
Figure 7.1) is known [163, Theorem 1]:
  
TVt = 1 t 1 − coth(t) − 1 2
2 t
, t ≥ 0. (7.30)
Dt = −t2 csch2 (t) + t coth(t) + log(t csch(t))

where we take the natural logarithm. Here is a corollary (weaker bound) due to [427]:
1 + TV(P, Q) 2TV(P, Q)
D(PkQ) ≥ log − log e. (7.31)
1 − TV(P, Q) 1 + TV(P, Q)
Both bounds are stronger than Pinsker’s inequality (7.27). Note the following consequences:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-132


i i

132

• D → 0 ⇒ TV → 0, which can be deduced from Pinsker’s inequality;


• TV → 1 ⇒ D → ∞ and hence D = O(1) implies that TV is bounded away from one. This can
be obtained from (7.30) or (7.31), but not Pinsker’s inequality.

7.5.3 χ2 -divergence versus total variation


Proposition 7.15 We have the following bound
(
4t2 . t≤ 1
χ (PkQ) ≥ f(TV(P, Q)) ≥ 4TV (P, Q),
2 2
f( t) = 2
, (7.32)
t
1−t t≥ 1
2.

where the function f is a convex increasing bijection of [0, 1) onto [0, ∞). Furthermore, for every
s ≥ f(t) there exists a pair of distributions such that χ2 (PkQ) = s and TV(P, Q) = t.

Proof. We claim that the binary joint range is convex. Indeed,

(p − q)2 t2
TV(Ber(p), Ber(q)) = |p − q| ≜ t, χ2 (Ber(p)kBer(q)) = = .
q( 1 − q) q( 1 − q)
Given |p − q| = t, let us determine the possible range of q(1 − q). The smallest value of q(1 − q)
is always 0 by choosing p = t, q = 0. The largest value is 1/4 if t ≤ 1/2 (by choosing p = 1/2 − t,
q = 1/2). If t > 1/2 then we can at most get t(1 − t) (by setting p = 0 and q = t). Thus we
get χ2 (Ber(p)kBer(q)) ≥ f(|p − q|) as claimed. The convexity of f follows since its derivative is
monotonically increasing. Clearly, f(t) ≥ 4t2 because t(1 − t) ≤ 41 .

7.6 A selection of inequalities between various divergences


This section presents a collection of useful inequalities. For a more complete treatment, con-
sider [373] and [424, Sec. 2.4]. Most of these inequalities are joint ranges, which means they
are tight.

• KL vs TV: see (7.30). For discrete distributions there is partial comparison in the other direction
(“reverse Pinsker”, cf. [373, Section VI]):
 
2 2 log e
D(PkQ) ≤ log 1 + TV(P, Q)2 ≤ TV(P, Q)2 , Qmin = min Q(x)
Qmin Qmin x

• KL vs Hellinger:
2
D(P||Q) ≥ 2 log ≥ log e · H2 (P, Q). (7.33)
2 − H2 (P, Q)
The first inequality gives the joint range and is attained at P = Ber(0), Q = Ber(q). For a fixed
H2 , in general D(P||Q) has no finite upper bound, as seen from P = Ber(p), Q = Ber(0).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-133


i i

7.7 Divergences between Gaussians 133

There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup, cf. [122, Theorem A.1] and Exercise I.64):
log( Q1min − 1) 
D(PkQ) ≤ 1 − (1 − H2 (P, Q))2 , Qmin = min Q(x)
1 − 2Qmin x

Another partial result is in Ex. I.59.


• KL vs χ2 :

0 ≤ D(P||Q) ≤ log(1 + χ2 (P||Q)) ≤ log e · χ2 (PkQ) . (7.34)

The left-hand inequality states that no lower bound on KL in terms of χ2 is possible.


• TV vs Hellinger: see (7.22). A useful simplified bound from [186] is the following:
r
 H2 (P, Q) 
TV(P, Q) ≤ −2 ln 1 −
2
• Le Cam vs Hellinger [273, p. 48]:
1 2
H (P, Q) ≤ LC(P, Q) ≤ H2 (P, Q). (7.35)
2
• Le Cam vs Jensen-Shannon [422]:

LC(P, Q) log e ≤ JS(P, Q) ≤ LC(P, Q) · 2 log 2 (7.36)

• χ2 vs TV: The full joint range is given by (7.32). Two simple consequences are:
1p 2
TV(P, Q) ≤ χ (PkQ) (7.37)
2  
1 χ2 (PkQ)
TV(P, Q) ≤ max , (7.38)
2 1 + χ2 (PkQ)
where the second is useful for bounding TV away from one.
• JS vs TV: The full joint region is given by
 
1 − TV(P, Q) 1
2d ≤ JS(P, Q) ≤ TV(P, Q) · 2 log 2 . (7.39)
2 2
The lower bound is a consequence of Fano’s inequality. For the upper bound notice that for
p, q ∈ [0, 1] and |p − q| = τ the maximum of d(pk p+2 q ) is attained at p = 0, q = τ (from the
convexity of d(·k·)) and, thus, the binary joint-range is given by τ 7→ d(τ kτ /2) + d(1 − τ k1 −
τ /2). Since the latter is convex, its concave envelope is a straight line connecting endpoints at
τ = 0 and τ = 1.

7.7 Divergences between Gaussians


To get a better feel for the behavior of f-divergences, here we collect expressions (as well as
asymptotic expansions) of divergences between Gaussian distributions.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-134


i i

134

1 Total variation:
  Z | μ|
| μ| 2σ | μ|
TV(N (0, σ ), N ( μ, σ )) = 2Φ
2 2
−1= φ(x)dx = √ + O( μ2 ), μ → 0.
2σ | μ|
− 2σ 2π σ
(7.40)
2 Hellinger distance:
μ2 μ2
H2 (N (0, σ 2 )kN ( μ, σ 2 )) = 2 − 2e− 8σ2 = + O( μ3 ), μ → 0. (7.41)
4σ 2
More generally,
1 1  
|Σ1 | 4 |Σ2 | 4 1 ′ −1
H (N ( μ1 , Σ1 )kN ( μ2 , Σ2 )) = 2 − 2
2
1 exp − ( μ1 − μ2 ) Σ̄ ( μ1 − μ2 ) ,
|Σ̄| 2 8

where Σ̄ = Σ1 +Σ
2
2
.
3 KL divergence:
 
1 σ2 1 ( μ 1 − μ 2 )2 σ12
D(N ( μ1 , σ12 )kN ( μ2 , σ22 )) = log 22 + + 2 − 1 log e. (7.42)
2 σ1 2 σ22 σ2
For a more general result see (2.8).
4 χ2 -divergence:
μ2 μ2
χ2 (N ( μ, σ 2 )kN (0, σ 2 )) = e σ2 − 1 = 2 + O( μ3 ), μ → 0 (7.43)
 2 σ
e √ μ /(2−σ 2 )
− 1 σ2 < 2
χ2 (N ( μ, σ 2 )kN (0, 1)) = σ 2−σ 2 (7.44)
∞ σ2 ≥ 2

5 χ2 -divergence for Gaussian mixtures [225] (see also Exercise I.48 for the Ingster-Suslina
method applicable to general mixture distributions):
−1
X,X′ ⟩
χ2 (P ∗ N (0, Σ)kN (0, Σ)) = E[e⟨Σ ] − 1, ⊥ X′ ∼ P .
X⊥ (7.45)

7.8 Mutual information based on f-divergence


Given an f-divergence Df , we can define f-information, an extension of mutual information, as
follows:

If (X; Y) ≜ Df (PX,Y kPX PY ) . (7.46)

Theorem 7.16 (Data processing) For U → X → Y, we have If (U; Y) ≤ If (U; X).

Proof. Note that If (U; X) = Df (PU,X kPU PX ) ≥ Df (PU,Y kPU PY ) = If (U; Y), where we
applied the data-processing Theorem 7.4 to the (possibly stochastic) map (U, X) 7→ (U, Y). See
also Remark 3.4.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-135


i i

7.8 Mutual information based on f-divergence 135

A useful property of mutual information is that X ⊥⊥ Y iff I(X; Y) = 0. A generalization of it is


the property that for X → Y → Z we have I(X; Y) = I(X; Z) iff X → Z → Y. Both of these may or
may not hold for If depending on the strict convexity of f, see Ex. I.40.
Another often used property of the standard mutual information is the subadditivity: If PA,B|X =
PA|X PB|X (i.e. A and B are conditionally independent given X), then

I(X; A, B) ≤ I(X; A) + I(X; B). (7.47)

However, other notions of f-information have complicated relationship with subadditivity:

1 The f-information corresponding to the χ2 -divergence,

Iχ2 (X; Y) ≜ χ2 (PX,Y kPX PY ) (7.48)

is not generally subadditive. There are two special cases when Iχ2 is subadditive: If one of the
Iχ2 (X; A) or Iχ2 (X; B) is small [202, Lemma26] or if X ∼ Ber(1/2) channels PA|X and PB|X are
BMS (Section 19.4*), cf. [1].
2 The f-information corresponding to total-variation ITV (X; Y) ≜ TV(PX,Y , PX PY ) is not subad-
ditive. Furthermore, it has a counter-intuitive behavior of “getting stuck”. For example, take
X ∼ Ber(1/2) and A = BSCδ (X), B = BSCδ (X) – two independent observations of X across
the BSC. A simple computation (Exercise I.35) shows:

ITV (X; A, B) = ITV (X; A) = ITV (X; B) .

In other words, an additional observation does not improve TV-information at all. This is the
main reason for the famous herding effect in economics [30].
3 The symmetric KL-information

ISKL (X; Y) ≜ D(PX,Y kPX PY ) + D(PX PY kPX,Y ), (7.49)

the f-information corresponding to the symmetric KL divergence (also known as the Jeffreys
divergence)

DSKL (P, Q) ≜ D(PkQ) + D(QkP), (7.50)

satisfies, quite amazingly [265], the additivity property:

ISKL (X; A, B) = ISKL (X; A) + ISKL (X; B). (7.51)

Let us prove this in the discrete case. First notice the following equivalent expression for ISKL :
X
ISKL (X; Y) = PX (x)PX (x′ )D(PY|X=x kPY|X=x′ ) . (7.52)
x, x′

From (7.52) we get (7.51) by the additivity D(PA,B|X=x kPA,B|X=x′ ) = D(PA|X=x kPA|X=x′ ) +
D(PB|X=x kPB|X=x′ ). To prove (7.52) first consider the obvious identity:
X
PX (x)PX (x′ )[D(PY kPY|X=x′ ) − D(PY kPY|X=x )] = 0
x, x′

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-136


i i

136

which is rewritten as
X X PY|X (y|x)
PX (x)PX (x′ ) PY (y) log = 0. (7.53)
PY|X (y|x′ )
x,x′ y

Next, by definition,
X PX,Y (x, y)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PX (x)PY (y)

P Y| X ( y | x )
Since the marginals of PX,Y and PX PY coincide, we can replace log PPXX(,xY)(PxY,y()y) by any log f ( y)
for any f. We choose f(y) = PY|X (y|x′ ) to get

X PY|X (y|x)
ISKL (X; Y) = [PX,Y (x, y) − PX (x)PY (y)] log .
x, y
PY|X (y|x′ )

Now averaging this over PX (x′ ) and applying (7.53) to get rid of the second term in [· · · ], we
obtain (7.52). For another interesting property of ISKL , see Ex. I.54.

7.9 Empirical distribution and χ2 -information


i.i.d.
Consider an arbitrary channel PY|X and some input distribution PX . Suppose that we have Xi ∼ PX
for i = 1, . . . , n. Let

1X
n
P̂n = δ Xi
n
i=1

denote the empirical distribution corresponding to this sample. Let PY = PY|X ◦ PX be the output
distribution corresponding to PX and PY|X ◦ P̂n be the output distribution corresponding to P̂n (a
random distribution). Note that when PY|X=x (·) = ϕ(· − x), where ϕ is a fixed density, we can
think of PY|X ◦ P̂n as a kernel density estimator (KDE), whose density is p̂n (x) = (ϕ ∗ P̂n )(x) =
Pn
i=1 ϕ(Xi − x). Furthermore, using the fact that E[PY|X ◦ P̂n ] = PY , we have
1
n

E[D(PY|X ◦ P̂n kPX )] = D(PY kPX ) + E[D(PY|X ◦ P̂n kPY )] ,

where the first term represents the bias of the KDE due to convolution and increases with band-
width of ϕ, while the second term represents the variability of the KDE and decreases with the
bandwidth of ϕ. Surprisingly, the second term is is sharply (within a factor of two) given by the
Iχ2 information. More exactly, we prove the following result.

Proposition 7.17
 
1
E[D(PY|X ◦ P̂n kPY )] ≤ log 1 + Iχ2 (X; Y) , (7.54)
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-137


i i

7.9 Empirical distribution and χ2 -information 137

where Iχ2 (X; Y) is defined in (7.48). Furthermore,


log e
lim inf n E[D(PY|X ◦ P̂n kPY )] ≥ I 2 ( X ; Y) . (7.55)
n→∞ 2 χ
In particular, E[D(PY|X ◦ P̂n kPY )] = O(1/n) if Iχ2 (X; Y) < ∞ and ω(1/n) otherwise.

In Section 25.4* we will discuss an extension of this simple bound, in particular showing that
in many cases about n = exp{I(X; Y)+ K} observations are sufficient to ensure D(PY|X ◦ P̂n kPY ) =
e−O(K) .

Proof. First, a simple calculation shows that


1
E[χ2 (PY|X ◦ P̂n kPY )] = I 2 (X; Y) .
n χ
Then from (7.34) and Jensen’s inequality we get (7.54).
To get the lower bound in (7.55), let X̄ be drawn uniformly at random from the sample
{X1 , . . . , Xn } and let Ȳ be the output of the PY|X channel with input X̄. With this definition we
have:

E[D(PY|X ◦ P̂n kPY )] = I(Xn ; Ȳ) . (7.56)

Next, apply (6.2) to get


X
n
I(Xn ; Ȳ) ≥ I(Xi ; Ȳ) = nI(X1 ; Ȳ) .
i=1

Finally, notice that


!
n−1 1
I(X1 ; Ȳ) = D PX PY + PX,Y PX PY
n n

and apply the local expansion of KL divergence (Proposition 2.21) to get (7.55).

In the discrete case, by taking PY|X to be the identity channel (Y = X) we obtain the following
guarantee on the closeness between the empirical and the population distribution. This fact can be
used to test whether the sample was truly generated by the distribution PX .

Corollary 7.18 Suppose PX is discrete with support X . If X is infinite, then


lim n E[D(P̂n kPX )] = ∞ . (7.57)
n→∞

Otherwise, we have
 
|X | − 1 log e
E[D(P̂n kPX )] ≤ log 1 + ≤ (|X | − 1) . (7.58)
n n

Proof. Simply notice that Iχ2 (X; X) = |X | − 1.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-138


i i

138

Remark 7.9 For fixed PX , the tight asymptotic result is


log e
lim n E[D(P̂n kPX )] = (|supp(PX )| − 1) . (7.59)
n→∞ 2
See Lemma 13.2 below. See also Exercise VI.10 for the results on estimating PX under different
loss functions by means other than using empirical distribution.
Corollary 7.18 is also useful for the statistical application of entropy estimation. Given n iid
observations, a natural estimator of the entropy of PX is the empirical entropy Ĥemp = H(P̂n )
(plug-in estimator). It is clear that empirical entropy is an underestimate, in the sense that the bias

E[Ĥemp ] − H(PX ) = − E[D(P̂n kPX )]

is always non-negative. For fixed PX , Ĥemp is known to be consistent even on countably infinite
alphabets [22], although the convergence rate can be arbitrarily slow, which aligns with the con-
clusion of (7.57). However, for large alphabet of size Θ(n), the upper bound (7.58) does not vanish
(this is tight for, e.g., uniform distribution). In this case, one need to de-bias the empirical entropy
(e.g. on the basis of (7.59)) or employ different techniques in order to achieve consistent estimation.
See Section 29.4 for more details.

7.10 Most f-divergences are locally χ2 -like


In this section we prove analogs of Proposition 2.20 and Proposition 2.21 for the general
f-divergences.

Theorem 7.19 Suppose that Df (PkQ) < ∞ and derivative of f(x) at x = 1 exist. Then,
1
lim Df (λP + λ̄QkQ) = (1 − P[supp(Q)])f′ (∞) ,
λ→0 λ
where as usual we take 0 · ∞ = 0 in the left-hand side.

Remark 7.10 Note that we do not need a separate theorem for Df (QkλP + λ̄Q) since the
exchange of arguments leads to another f-divergence with f(x) replaced by xf(1/x).

Proof. Without loss of generality we may assume f(1) = f′ (1) = 0 and f ≥ 0. Then, decomposing
P = μP1 + μ̄P0 with P0 ⊥ Q and P1  Q we have
Z  
1 ′ 1 dP1
Df (λP + λ̄QkQ) = μ̄f (∞) + dQ f 1 + λ( μ − 1) .
λ λ dQ

Note that g(λ) = f (1 + λt) is positive and convex for every t ∈ R and hence λ1 g(λ) is mono-
tonically decreasing to g′ (0) = 0 as λ & 0. Since for λ = 1 the integrand is assumed to be
Q-integrable, the dominated convergence theorem applies and we get the result.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-139


i i

7.10 Most f-divergences are locally χ2 -like 139

Theorem 7.20 Let f be twice continuously differentiable on (0, ∞) with


lim sup f′′ (x) < ∞ .
x→+∞

If χ2 (PkQ) < ∞, then Df (λ̄Q + λPkQ) < ∞ for all 0 ≤ λ < 1 and
1 f′′ (1) 2
lim 2
Df (λ̄Q + λPkQ) = χ (PkQ) . (7.60)
λ→0 λ 2
If χ2 (PkQ) = ∞ and f′′ (1) > 0 then (7.60) also holds, i.e. Df (λ̄Q + λPkQ) = ω(λ2 ).

Remark 7.11 Conditions of the theorem include D, DSKL , H2 , JS, LC and all Rényi divergences
1 (x − 1); see Definition 7.24). A similar result holds also for the
1 λ
of orders λ < 2 (with f(x) = λ−
case when f′′ (x) → ∞ with x → +∞ (e.g. Rényi divergences with λ > 2), but then we need to
make extra assumptions in order to guarantee applicability of the dominated convergence theorem
(often just the finiteness of Df (PkQ) is sufficient).

Proof. Assuming that χ2 (PkQ) < ∞ we must have P  Q and hence we can use (7.1) as the
definition of Df . Note that under (7.1) without loss of generality we may assume f′ (1) = f(1) = 0
(indeed, for that we can just add a multiple of (x − 1) to f(x), which does not change the value of
Df (PkQ)). From the Taylor expansion we have then
Z 1
f(1 + u) = u2 (1 − t)f′′ (1 + tu)dt .
0

Applying this with u = λ P−


Q
Q
we get
Z Z 1  2  
P−Q P−Q
Df (λ̄Q + λPkQ) = dQ dt(1 − t)λ2 f′′ 1 + tλ . (7.61)
0 Q Q
P−Q
Note that for any ϵ > 0 we have supx≥ϵ |f′′ (x)| ≜ Cϵ < ∞. Note that Q ≥ −1 and, thus, for
every λ the integrand is non-negative and bounded by
 2
P−Q
C1−λ (7.62)
Q
which is integrable over dQ × Leb[0, 1] (by finiteness of χ2 (PkQ) and Fubini, which applies due
to non-negativity). Thus, Df (λ̄Q + λPkQ) < ∞. Dividing (7.61) by λ2 we see that the integrand
is dominated by (7.62) and hence we can apply the dominated convergence theorem to conclude
Z 1 Z  2  
1 ( a) P−Q ′′ P−Q
lim Df (λ̄Q + λPkQ) = dt(1 − t) dQ lim f 1 + tλ
λ→0 λ2 0 Q λ→0 Q
Z 1 Z  2
P−Q f′′ (1) 2
= dt(1 − t) dQ f′′ (1) = χ (PkQ) ,
0 Q 2
which proves (7.60).
We proceed to proving that Df (λP + λ̄QkQ) = ω(λ2 ) when χ2 (PkQ) = ∞. If P  Q then
this follows by replacing the equality in (a) with ≥ due to Fatou’s lemma. If P 6 Q, we consider

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-140


i i

140

decomposition P = μP1 + μ̄P0 with P1  Q and P0 ⊥ Q. From definition (7.2) we have (for
λμ
λ1 = 1−λμ̄ )

Df (λP + λ̄QkQ) = (1 − λμ̄)Df (λ1 P1 + λ̄1 QkQ) + λμ̄Df (P0 kQ) ≥ λμ̄Df (P0 kQ) .

Recall from Proposition 7.2 that Df (P0 kQ) > 0 unless f(x) = c(x − 1) for some constant c and
the proof is complete.

7.11 f-divergences in parametric families: Fisher information


In Section 2.6.2* we have already previewed the fact that in parametric families of distributions,
the Hessian of the KL divergence turns out to coincide with the Fisher information. Here we
collect such facts and their proofs. These materials form the basis of sharp bounds on parameter
estimation that we will study later in Chapter 29.
To start with an example, let us return to the Gaussian location model (GLM) Pt ≜ N (t, 1), t ∈
R. From the identities presented in Section 7.7 we obtain the following asymptotics:
|t| t2
TV(Pt , P0 ) = √ + o(|t|), H2 (Pt , P0 ) = + o( t2 ) ,
2π 4
t2
χ2 (Pt kP0 ) = t2 + o(t2 ), D(Pt kP0 ) = + o(t2 ) ,
2 log e
12
LC(Pt , P0 ) = t + o(t2 ) .
4
We can see that with the exception of TV, other f-divergences behave quadratically under small
displacement t → 0. This turns out to be a general fact, and furthermore the coefficient in front
of t2 is given by the Fisher information (at t = 0). To proceed carefully, we need some technical
assumptions on the family Pt .

Definition 7.21 (Regular single-parameter families) Fix τ > 0, space X and a family
Pt of distributions on X , t ∈ [0, τ ). We define the following types of conditions that we call
regularity at t = 0:

(a) Pt (dx) = pt (x) μ(dx), for some measurable (t, x) 7→ pt (x) ∈ R+ and a fixed measure μ on X ;
(b0 ) There exists a measurable function (s, x) 7→ ṗs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost

every x0 we have 0 |ṗs (x0 )|ds < ∞ and
Z t
p t ( x0 ) = p 0 ( x0 ) + ṗs (x0 )ds . (7.63)
0

Furthermore, for μ-almost every x0 we have limt↘0 ṗt (x0 ) = ṗ0 (x0 ).
(b1 ) We have ṗt (x) = 0 whenever p0 (x) = 0 and, furthermore,
Z
(ṗt (x))2
μ(dx) sup < ∞. (7.64)
X 0≤t<τ p0 (x)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-141


i i

7.11 f-divergences in parametric families: Fisher information 141

(c0 ) There exists a measurable function (s, x) 7→ ḣs (x), s ∈ [0, τ ), x ∈ X , such that for μ-almost

every x0 we have 0 |ḣs (x0 )|ds < ∞ and
p p Z t
h t ( x0 ) ≜ p t ( x0 ) = p 0 ( x0 ) + ḣs (x0 )ds . (7.65)
0

Furthermore, for μ-almost every x0 we have limt↘0 ḣt (x0 ) = ḣ0 (x0 ).
(c1 ) The family of functions {(ḣt (x))2 : t ∈ [0, τ )} is uniformly μ-integrable.

Remark 7.12 Recall that the uniform integrability condition (c1 ) is implied by the following
stronger (but easier to verify) condition:
Z
μ(dx) sup (ḣt (x))2 < ∞ . (7.66)
X 0≤t<τ

Impressively, if one also assumes the continuous differentiability of ht then the uniform integra-
bility condition becomes equivalent to the continuity of the Fisher information
Z
t 7→ JF (t) ≜ 4 μ(dx)(ḣt (x))2 . (7.67)

We refer to [68, Appendix V] for this finesse.

Theorem 7.22 Let the family of distributions {Pt : t ∈ [0, τ )} satisfy the conditions (a), (b0 )
and (b1 ) in Definition 7.21. Then we have
χ2 (Pt kP0 ) = JF (0)t2 + o(t2 ) , (7.68)
log e
D(Pt kP0 ) = JF ( 0 ) t 2 + o ( t 2 ) , (7.69)
2
R 2
where JF (0) ≜ X
μ(dx) (ṗp00((xx))) < ∞ is the Fisher information at t = 0.

Proof. From assumption (b1 ) we see that for any x0 with p0 (x0 ) = 0 we must have ṗt (x0 ) = 0
and thus pt (x0 ) = 0 for all t ∈ [0, τ ). Hence, we may restrict all integrals below to subset {x :
p0 (x0 ))2
p0 (x) > 0}, on which the ratio (pt (x0p)−
0 0)
( x is well-defined. Consequently, we have by (7.63)
Z
1 2 1 (pt (x) − p0 (x))2
2
χ ( Pt kP0 ) = 2
μ(dx)
t t p 0 ( x)
Z Z 1 !2
1 1
= 2 μ(dx) t duṗtu (x)
t p 0 ( x) 0
Z Z 1 Z 1
(a) ṗtu (x)ṗtu2 (x)
= μ(dx) du1 du2 1
0 0 p 0 ( x)
Note that by the continuity assumption in (b1 ) we have ṗtu1 (x)ṗtu2 (x) → ṗ20 (x) for every (u1 , u2 , x)
ṗ (x)ṗ ( x) 2
as t → 0. Furthermore, we also have tu1 p0 (xtu)2 ≤ sup0≤t<τ (ṗpt0((xx00))) , which is integrable
by (7.64). Consequently, application of the dominated convergence theorem to the integral in
(a) concludes the proof of (7.68).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-142


i i

142

We next show that for any f-divergence with twice continuously differentiable f (and in fact,
without assuming (7.64)) we have:

1 f′′ (1)
lim inf 2
Df (Pt kP0 ) ≥ JF ( 0 ) . (7.70)
t→0 t 2
Indeed, similar to (7.61) we get
Z "   2 #
1
p ( X) − p ( X ) p ( X) − p ( X )
dz(1 − z) EX∼P0 f′′ 1 + z
t 0 t 0
Df (Pt kP0 ) = . (7.71)
0 p0 ( X ) p0 (X)

pt (X)−p0 (X) a.s. ṗ0 (X)


Dividing by t2 notice that from (b0 ) we have tp0 (X) −−→ p0 (X) and thus
  2  2
pt (X) − p0 (X) pt (X) − p0 (X) ṗ0 (X)
f′′ 1 + z → f′′ (1) .
p0 ( X ) tp0 (X) p0

Thus, applying Fatou’s lemma we recover (7.70).


Next, plugging f(x) = x log x in (7.71) we obtain for the KL divergence
Z 1 "  2 #
1 1−z pt (X) − p0 (X)
D(Pt kP0 ) = (log e) dz EX∼P0 . (7.72)
t2 0 1 + z pt (X)−p0 (X) tp0 (X)
p0 (X)

 2
The first fraction inside the bracket is between 0 and 1 and the second by sup0<t<τ pṗ0t ((XX)) , which
is P0 -integrable by (b1 ). Thus, dominated convergence theorem applies to the double integral
in (7.71) and we obtain
Z 1 "  2 #
1 ṗ0 (X)
lim 2 D(Pt kP0 ) = (log e) dz EX∼P0 (1 − z) ,
t→0 t 0 p0 ( X )

completing the proof of (7.69).

Remark 7.13 Theorem 7.22 extends to the case of multi-dimensional parameters as follows.
Define the Fisher information matrix at θ ∈ Rd :
Z p p ⊤
JF (θ) ≜ μ(dx)∇θ pθ (x)∇θ pθ (x) (7.73)

Then (7.68) becomes χ2 (Pt kP0 ) = t⊤ JF (0)t + o(ktk2 ) as t → 0 and similarly for (7.69), which
has previously appeared in (2.34).

Theorem 7.22 applies to many cases (e.g. to smooth subfamilies of exponential families, for
which one can take μ = P0 and p0 (x) ≡ 1), but it is not sufficiently general. To demonstrate the
issue, consider the following example.

Example 7.1 (Location families with compact support) We say that family Pt is a
(scalar) location family if X = R, μ = Leb and pt (x) = p0 (x − t). Consider the following

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-143


i i

7.11 f-divergences in parametric families: Fisher information 143

example, for α > −1:




 α
x ∈ [ 0, 1] ,
x ,
p0 (x) = Cα × (2 − x)α , x ∈ [ 1, 2] , ,


 0, otherwise

with Cα chosen from normalization. Clearly, here condition (7.64) is not satisfied and both
χ2 (Pt kP0 ) and D(Pt kP0 ) are infinite for t > 0, since Pt 6 P0 . But JF (0) < ∞ whenever α > 1
and thus one expects that a certain remedy should be possible. Indeed, one can compute those
f-divergences that are finite for Pt 6 P0 and find that for α > 1 they are quadratic in t. As an
illustration, we have


 1+α
0≤α<1
Θ(t ),
2 2 1
H (Pt , P0 ) = Θ(t log t ), α = 1 (7.74)


Θ(t2 ), α>1

as t → 0. This can be computed directly, or from a more general results of [222, Theorem VI.1.1].5
For a relation between Hellinger and Fisher information see also (VI.5).
The previous example suggests that quadratic behavior as t → 0 can hold even when Pt 6 P0 ,
which is the case handled by the next (more technical) result, whose proof we placed in Sec-
tion 7.14*). One can verify that condition (c1 ) is indeed satisfied for all α > 1 in Example 7.1,
thus establishing the quadratic behavior. Also note that the stronger (7.66) only applies to α ≥ 2.

Theorem 7.23 Given a family of distributions {Pt : t ∈ [0, τ )} satisfying the conditions (a),
(c0 ) and (c1 ) of Definition 7.21, we have
 
1 − 4ϵ #
χ (Pt kϵ̄P0 + ϵPt ) = t ϵ̄ JF (0) +
2 2 2
J (0) + o(t2 ) , ∀ϵ ∈ (0, 1) (7.75)
ϵ
t2
H2 (Pt , P0 ) = JF ( 0 ) + o ( t 2 ) , (7.76)
4
R R
where JF (0) = 4 ḣ02 dμ < ∞ is the Fisher information and J# (0) = ḣ20 1 {h0 = 0}dμ can be
called the Fisher defect at t = 0.

Example 7.2 (On Fisher defect) Note that in most cases of interest we will have the situa-
tion that t 7→ ht (x) is actually differentiable for all t in some two-sided neighborhood (−τ, τ ) of
0. In such cases, h0 (x) = 0 implies that t = 0 is a local minima and thus ḣ0 (x) = 0, implying that

5
Statistical significance of this calculation is that if we were to estimate the location parameter t from n iid observations,
then precision δn∗ of the optimal estimator up to constant factors is given by solving H2 (Pδn∗ , P0 )  n1 , cf. [222, Chapter
1
− 1+α
VI]. For α < 1 we have δn∗  n which is notably better than the empirical mean estimator (attaining precision of
− 12
only n ). For α = 1/2 this fact was noted by D. Bernoulli in 1777 as a consequence of his (newly proposed) maximum
likelihood estimation.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-144


i i

144

the defect J#
F = 0. However, for other families this will not be so, sometimes even when pt (x) is
smooth on t ∈ (−τ, τ ) (but not ht ). Here is such an example.
Consider Pt = Ber(t2 ). A straightforward calculation shows:
ϵ̄2 p
χ2 (Pt kϵ̄P0 + ϵPt ) = t2 + O(t4 ), H2 (Pt , P0 ) = 2(1 − 1 − t2 ) = t2 + O( t4 ) .
ϵ
Taking μ({0}) = μ({1}) = 1 to be the counting measure, we get the following

(√ 
 √−t , x = 0
1−t , x=0
2  1−t2
h t ( x) = , ḣt (x) = sign(t), x = 1, t 6= 0 .
|t|, x=1 


1, x = 1, t = 0 (just as an agreement)

Note that if we view Pt as a family on t ∈ [0, τ ) for small τ , then all conditions (a), (c0 ) and
(c1 ) are clearly satisfied (ḣt is bounded on t ∈ (−τ, τ )). We have JF (0) = 4 and J# (0) = 1 and
thus (7.75) recovers the correct expansion for χ2 and (7.76) for H2 .
Notice that the non-smoothness of ht only becomes visible if we extend the domain to t ∈
(−τ, τ ). In fact, this issue is not seen in terms of densities pt . Indeed, let us compute the density pt
and its derivative ṗt explicitly too:
( (
1 − t2 , x = 0 −2t, x = 0
pt (x) = , ṗt (x) = .
2
t, x=1 2t, x=1

Clearly, pt is continuously differentiable on t ∈ (−τ, τ ). Furthermore, the following expectation


(typically equal to JF (t) in (7.67))
" 2 # (
ṗt (X) 0, t=0
EX∼Pt = 2
pt (X) 4 + 2 , t 6= 0
4t
1−t

is discontinuous at t = 0. To make things worse, at t = 0 this expectation does not match our
definition of the Fisher information JF (0) in Theorem 7.23, and thus does not yield the correct
small-t behavior for either χ2 or H2 . In general, to avoid difficulties one should restrict to those
families with t 7→ ht (x) continuously differentiable in t ∈ (−τ, τ ).

7.12 Rényi divergences and tensorization


The following family of divergence measures introduced by Rényi is key in many applications
involving product measures. Although these measures are not f-divergences, they are obtained as
monotone transformation of an appropriate f-divergence and thus satisfy DPI and other properties
of f-divergences. Later, Rényi divergence will feature prominently in characterizing the optimal
error exponents in hypothesis testing (see Section 16.1 and especially Remark 16.1), in approxi-
mating of channel output statistic (see Section 25.4*), and in nonasymptotic bounds for composite
hypothesis testing (see Section 32.2.1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-145


i i

7.12 Rényi divergences and tensorization 145

Definition 7.24 For any λ ∈ R \ {0, 1}, the Rényi divergence of order λ between probability
distributions P and Q is defined as
" λ #
1 dP
Dλ (PkQ) ≜ log EQ ,
λ−1 dQ

where EQ [( dQ ) ] is formally understood as a sign(λ−1)Df (PkQ)+1 with f(x) = sign(λ−1)(xλ −1)


dP λ

– see Definition 7.1. Extending Definition 2.14 of conditional KL divergence and assuming the
same setup, the conditional Rényi divergence is defined as

Dλ (PX|Y ||QX|Y |PY ) ≜ Dλ (PY × PX|Y ||PY × QX|Y )


Z
1
= log EY∼PY (dPX|Y (x))λ (dQX|Y (a))1−λ .
λ−1 X

Numerous properties of Rényi divergences are known, see [432]. Here we only notice a few:

• Special cases of λ = 12 , 1, 2: Under mild regularity conditions limλ→1 Dλ (PkQ) = D(PkQ).


2
On the other hand, D2 = log(1 + χ2 ) and D 21 = −2 log(1 − H2 ) are monotone transformation
of the χ2 -divergence (7.4) and the Hellinger distance (7.5), respectively.
• For all λ ∈ R the map λ 7→ Dλ (PkQ) is non-decreasing and the map λ 7→ (1 − λ)Dλ (PkQ) is
concave.
• For λ ∈ [0, 1] the map (P, Q) 7→ Dλ (PkQ) is convex.
• For λ ≥ 0 the map Q 7→ Dλ (PkQ) is convex.
• For Q uniform on a finite alphabet of size m, Dλ (PkQ) = log m − Hλ (P), where Hλ is the Rényi
entropy of order λ defined in (1.4). This recovers Theorem 2.2 as the special case of λ = 1.
• There is a version of the chain rule:
(λ)
Dλ (PA,B ||QA,B ) = Dλ (PB ||QB ) + Dλ (PA|B ||QA|B |PB ) , (7.77)
(λ)
where PB is the λ-tilting of PB towards QB given by

PB (b) ≜ PλB (b)Q1B−λ (b) exp{−(λ − 1)Dλ (PB ||QB )} .


(λ)
(7.78)

• The key property is additivity under products, or tensorization:


!
Y Y X
Dλ PXi QXi = Dλ (PXi kQXi ) , (7.79)
i i i

which is a simple consequence of (7.77). Dλ ’s are the only divergences satisfying DPI and
tensorization [310]. The most well-known special cases of (7.79) are for Hellinger distance
(see (7.26)) and for χ2 :
!
Yn Yn Yn

1+χ 2
Pi Qi = 1 + χ2 (Pi kQi ) .
i=1 i=1 i=1

We can also obtain additive bounds for non-product distributions, see Ex. I.42 and I.43.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-146


i i

146

The following consequence of the chain rule will be crucial in statistical applications later (see
Section 32.2, in particular, Theorem 32.8).

Q Q
Proposition 7.25 Consider product channels PYn |Xn = PYi |Xi and QYn |Xn = QYi |Xi . We
have (with all optimizations over all possible distributions)

X
n
inf Dλ (PYn kQYn ) = inf Dλ (PYi kQYi ), (7.80)
PXn ,QXn PXi ,QXi
i=1
Xn X
n
sup Dλ (P kQ ) =
Yn Yn sup Dλ (PYi kQYi ) = sup Dλ (PYi |Xi =x kQYi |Xi =x′ ). (7.81)
PXn ,QXn ′
i=1 PXi ,QXi i=1 x,x

In particular, for any collections of distributions {Pθ : θ ∈ Θ} and {Qθ : θ ∈ Θ}:

inf Dλ (PkQ) ≥ n inf Dλ (PkQ), (7.82)


P∈co{P⊗ n ⊗n
θ },Q∈co{Qθ }
P∈co{Pθ },Q∈co{Qθ }

sup Dλ (PkQ) ≤ n sup Dλ (PkQ). (7.83)


P∈co{P⊗ n ⊗n
θ },Q∈co{Qθ }
P∈co{Pθ },Q∈co{Qθ }

Remark 7.14 The mnemonic for (7.82)-(7.83) is that “mixtures of products are less distin-
guishable than products of mixtures”. The former arise in statistical settings where iid observations
are drawn a single distribution whose parameter is drawn from a prior.

Proof. The second equality in (7.81) follows from the fact that Dλ is an increasing function
of an f-divergence, and thus maximization should be attained at an extreme point of the space
of probabilities, which are just the single-point masses. The main equalities (7.80)-(7.81) follow
from a) restricting optimizations to product distributions and invoking (7.79); and b) the chain rule
for Dλ . For example for n = 2, we fix PX2 and QX2 , which (via channels) induce joint distributions
PX2 ,Y2 and QX2 ,Y2 . Then we have

Dλ (PY1 |Y2 =y kQY1 |Y2 =y′ ) ≥ inf Dλ (P̃Y1 kQ̃Y1 ) ,


P̃X1 ,Q̃X1

since PY1 |Y2 =y is a distribution induced by taking P̃X1 = PX1 |Y2 =y , and similarly for QY1 |Y2 =y′ . In
all, we get

(λ)
X
2
Dλ (PY2 kQY2 ) = Dλ (PY2 kQY2 ) + Dλ (PY1 |Y2 kQY1 |Y2 |PY2 ) ≥ inf Dλ (PYi kQYi ) ,
PXi ,QXi
i=1

as claimed. The case of sup is handled similarly.


From (7.80)-(7.81), we get (7.82)-(7.83) by taking X = Θ and specializing the inf and sup to
diagonal distributions PXn and QXn , i.e., those with the property that P[X1 = · · · = Xn ] = 1 and
Q[X1 = · · · = Xn ] = 1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-147


i i

7.13 Variational representation of f-divergences 147

7.13 Variational representation of f-divergences


In Theorem 4.6 we had a very useful variational representation of KL-divergence due to Donsker
and Varadhan. In this section we show how to derive such representations for other f-divergences
in a principled way. The proofs are slightly technical and given in Section 7.14* at the end of this
chapter.
Let f : (0, +∞) → R be a convex function. The convex conjugate f∗ : R → R ∪ {+∞} of f is
defined by:

f∗ (y) = sup xy − f(x) , y ∈ R. (7.84)


x∈ R +

Denote the domain of f∗ by dom(f∗ ) ≜ {y : f∗ (y) < ∞}. Two important properties of the convex
conjugates are

1 f∗ is also convex (which holds regardless of f being convex or not);


2 Biconjugation: (f∗ )∗ = f, which means

f(x) = sup xy − f∗ (y)


y

and implies the following (for all x > 0 and y)

f(x) + f∗ (y) ≥ xy .

Similarly, we can define a convex conjugate for any convex functional Ψ(P) defined on the
space of measures, by setting
Z
Ψ∗ (g) = sup gdP − Ψ(P) . (7.85)
P

Under appropriate conditions (e.g. finite X ), biconjugation then yields the sought-after variational
representation
Z
Ψ(P) = sup gdP − Ψ∗ (g) . (7.86)
g

Next we will now compute these conjugates for Ψ(P) = Df (PkQ). It turns out to be convenient
to first extend the definition of Df (PkQ) to all finite signed measures P then compute the conjugate.
To this end, let fext : R → R ∪ {+∞} be an extension of f, such that fext (x) = f(x) for x ≥ 0 and
fext is convex on R. In general, we can always choose fext (x) = ∞ for all x < 0. In special cases
e.g. f(x) = |x − 1|/2 or f(x) = (x − 1)2 we can directly take fext (x) = f(x) for all x. Now we can
define Df (PkQ) for all signed measure measures P in the same way as in Definition 7.1 using fext
in place of f.
For each choice of fext we have a variational representation of f-divergence:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-148


i i

148

Theorem 7.26 Let P and Q be probability measures on X . Fix an extension fext of f and let f∗ext
is the conjugate of fext , i.e., f∗ext (y) = supx∈R xy − fext (x). Denote dom(f∗ext ) ≜ {y : f∗ext (y) < ∞}.
Then

Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))]. (7.87)


g:X →dom(f∗
ext )

where the supremum can be taken over either (a) all simple g or (b) over all g satisfying
EQ [f∗ext (g(X))] < ∞.

We remark that when P  Q then both results (a) and (b) also hold for supremum over g :
X → R, i.e. without restricting g(x) ∈ dom(f∗ext ).
As a consequence of the variational characterization, we get the following properties for f-
divergences:

1 Convexity: First of all, note that Df (PkQ) is expressed as a supremum of affine functions (since
the expectation is a linear operation). As a result, we get that (P, Q) 7→ Df (PkQ) is convex,
which was proved previously in Theorem 7.5 using different method.
2 Weak lower semicontinuity: Recall the example in Remark 4.5, where {Xi } are i.i.d. Rademach-
ers (±1), and
Pn
i=1 Xi d
√ →N (0, 1)

n
by the central limit theorem; however, by Proposition 7.2, for all n,
 
PX1 +X2 +...+Xn
Df √ N (0, 1) = f(0) + f′ (∞) > 0,
n

since the former distribution is discrete and the latter is continuous. Therefore similar to the
KL divergence, the best we can hope for f-divergence is semicontinuity. Indeed, if X is a nice
space (e.g., Euclidean space), in (7.87) we can restrict the function g to continuous bounded
functions, in which case Df (PkQ) is expressed as a supremum of weakly continuous functionals
(note that f∗ ◦ g is also continuous and bounded since f∗ is continuous) and is hence weakly
w
lower semicontinuous, i.e., for any sequence of distributions Pn and Qn such that Pn −→ P and
w
Qn −→ Q, we have

lim inf Df (Pn kQn ) ≥ Df (PkQ).


n→∞

3 Relation to DPI: As discussed in (4.15) variational representations can be thought of as


extensions of the DPI. As an exercise, one should try to derive the estimate
p
|P[A] − Q[A]| ≤ Q[A] · χ2 (PkQ)

via both the DPI and (7.91).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-149


i i

7.13 Variational representation of f-divergences 149

Example 7.3 (Total variation and Hellinger) For total variation, we have f(x) = 12 |x − 1|.
Consider the extension fext (x) = 21 |x − 1| for x ∈ R. Then
  
∗ 1 +∞ if |y| > 1
fext (y) = sup xy − |x − 1| = 2 .
x 2 y if |y| ≤ 1
2

Thus (7.87) gives


TV(P, Q) = sup EP [g(X)] − EQ [g(X)], (7.88)
g:|g|≤ 12

which previously appeared in (7.18). A calculation for squared Hellinger yields f∗ext (y) = y
1−y with
y ∈ (−∞, 1) and, thus, after changing from g to h = 1 − g in (7.87), we obtain
1
H2 (P, Q) = 2 − inf EP [h] + EQ [ ] .
h>0 h
As an application, consider f : X → [0, 1] and τ ∈ (0, 1), so that h = 1 − τ f satisfies 1
h ≤ 1 + 1−τ
τ
f.
Then the previous characterization implies
1 1
EP [f] ≤ EQ [f] + H2 (P, Q) ∀f : X → [0, 1], ∀τ ∈ (0, 1) .
1−τ τ

Example 7.4 (χ2 -divergence) For χ2 -divergence we have f(x) = (x − 1)2 . Take fext (x) =
y2
(x − 1)2 , whose conjugate is f∗ext (y) = y + 4.
Applying (7.87) yields
" #
2
g ( X )
χ2 (PkQ) = sup EP [g(X)] − EQ g(X) + (7.89)
g:X →R 4
= sup 2EP [g(X)] − EQ [g2 (X)] − 1 (7.90)
g:X →R

where the last step follows from a change of variable (g ← 12 g − 1).


To get another equivalent, but much more memorable representation, we notice that (7.90) it is
not scale-invariant. To make it so, setting g = λh and optimizing over the λ ∈ R first we get
(EP [h(X)] − EQ [h(X)])2
χ2 (PkQ) = sup . (7.91)
h:X →R VarQ (h(X))
The statistical interpretation of (7.91) is as follows: if a test statistic h(X) is such that the separation
between its expectation under P and Q far exceeds its standard deviation, then this suggests the two
hypothesis can be distinguished reliably. The representation (7.91) will turn out useful in statistical
applications in Chapter 29 for deriving the Hammersley-Chapman-Robbins (HCR) lower bound
as well as its Bayesian version, see Section 29.1.2, and ultimately the Cramér-Rao and van Trees
lower bounds.
Example 7.5 (Jensen-Shannon divergence and GANs) For the Jensen-Shannon diver-

gence (7.8) we have f(x) = x log 12x 2
+x + log 1+x . Computing the conjugate we obtain f (s) =
− log(2 − exp(s)) with domain s ∈ (−∞, log 2). We obtain from (7.87) the characterization
JS(P, Q) = sup EP [g] + EQ [log(2 − exp{g(X)})] ,
g:X →(−∞,log 2)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-150


i i

150

or after reparametrizing h = exp{g}/2 we get

JS(P, Q) = sup log 2 + EP [log(h)] + EQ [log(1 − h)] .


h:X →(0,1)

This characterization is the basis of an influential modern method of density estimation, known
as generative adversarial networks (GANs) [193]. Here is its essence. Suppose that we are trying to
approximate a very complicated distribution P on Rd by representing it as (the law of) a generator
map G : Rm → Rd applied to a standard normal Z ∼ N (0, Im ). The idea of [193] is to search for
a good G by minimizing JS(P, PG(Z) ). Due to the variational characterization we can equivalently
formulate this problem as

inf sup EX∼P [log h(X)] + EZ∼N [log(1 − h(G(Z))]


G h

(and in this context the test function h is called a discriminator or, less often, a critic). Since the
distribution P is only available to us through a sample of iid observations x1 , . . . , xn ∼ P, we
approximate this minimax problem by

1X
n
inf sup log h(xi ) + EZ∼N [log(1 − h(G(Z))] .
G h n
i=1

In order to be able to solve this problem another idea of [193] is to approximate the intractable
optimizations over the infinite-dimensional function spaces of G and h by an optimization over
neural networks. This is implemented via alternating gradient ascent/descent steps over the
(finite-dimensional) parameter spaces defining the neural networks of G and h. Following the
breakthrough of [193] variations on their idea resulted in finding G(Z)’s that yielded incredibly
realistic images, music, videos, 3D scenery and more.
Example 7.6 (KL-divergence) In this case we have f(x) = x log x. Consider the extension
fext (x) = ∞ for x < 0, whose convex conjugate is f∗ (y) = log e
e exp(y). Hence (7.87) yields

D(PkQ) = sup EP [g(X)] − (EQ [exp{g(X)}] − 1)log e (7.92)


g:X →R

Note that in the last example, the variational representation (7.92) we obtained for the KL
divergence is not the same as the Donsker-Varadhan identity in Theorem 4.6, that is,

D(PkQ) = sup EP [g(X)] − log EQ [exp{g(X)}] . (7.93)


g:X →R

In fact, (7.92) is weaker than (7.93) in the sense that for each choice of g, the obtained lower bound
on D(PkQ) in the RHS is smaller. Furthermore, regardless of the choice of fext , the Donsker-
Varadhan representation can never be obtained from Theorem 7.26 because, unlike (7.93), the
second term in (7.87) is always linear in Q. It turns out if we define Df (PkQ) = ∞ for all non-
probability measure P, and compute its convex conjugate, we obtain in the next theorem a different
type of variational representation, which, specialized to KL divergence in Example 7.6, recovers
exactly the Donsker-Varadhan identity.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-151


i i

7.13 Variational representation of f-divergences 151

Theorem 7.27 Consider the extension fext of f such that fext (x) = ∞ for x < 0. Let S = {x :
q(x) > 0} where q is as in (7.2). Then
Df (PkQ) = f′ (∞)P[Sc ] + sup EP [g1S ] − Ψ∗Q,P (g) , (7.94)
g

where
Ψ∗Q,P (g) ≜ inf EQ [f∗ext (g(X) − a)] + aP[S].
a∈R

In the special case f (∞) = ∞, we have
Df (PkQ) = sup EP [g] − Ψ∗Q (g), Ψ∗Q (g) ≜ inf EQ [f∗ext (g(X) − a)] + a. (7.95)
g a∈R

Remark 7.15 (Marton’s divergence) Recall that in Theorem 7.7 we have shown both the
sup and inf characterizations for the TV. Do other f-divergences also possess inf characterizations?
The only other known example (to us) is due to Marton. Let
Z  2
dP
Dm (PkQ) = dQ 1 − ,
dQ +

which is clearly an f-divergence with f(x) = (1 − x)2+ . We have the following [69, Lemma 8.3]:
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : X ∼ P, Y ∼ Q} ,
where the infimum is over all couplings of P and Q. See Ex. I.44.
Marton’s Dm divergence plays a crucial role in the theory of concentration of measure [69,
Chapter 8]. Note also that while Theorem 7.20 does not apply to Dm , due to the absence of twice
continuous differentiability, it does apply to the symmetrized Marton divergence Dsm (PkQ) ≜
Dm (PkQ) + Dm (QkP).
We end this section by describing some properties of Fisher information akin to those of f-
divergences. In view of its role in the local expansion, we expect the Fisher information to inherit
these properties such as monotonicity, data processing inequality, and the variational representa-
tion. Indeed the first two can be established directly; see Exercise I.46. In [220] Huber introduced
the following variational extension of the Fisher information (in the location family) (2.40) of a
density on R: for any P ∈ P(R), define
EP [h′ (X)]2
J(P) = sup (7.96)
h EP [h(X)2 ]
where the supremum is over all test functions h ∈ C1c that are continuously differentiable and
compactly supported such that EP [h(X)2 ] > 0. Huber showed that J(P) < ∞ if and only if P
R
has an absolutely continuous density p such that (p′ )2 /p < ∞, in which case (7.96) agrees
with the usual definition (2.40).6 This sup-representation can be anticipated by combining the

6
As an example in the reverse direction, J(Unif(0, 1)) = ∞ which follows from choosing test functions such as
h(x) = cos2 xπ
ϵ
1 {|x| ≤ ϵ/2} and ϵ → 0.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-152


i i

152

variational representation (7.91) of χ2 -divergence and its local expansion (7.68) that involves the
Fisher information. Indeed, setting aside regularity conditions, by Taylor expansion we have
(E[h(X + t) − h(X)])2 E[h′ (X)]2 2
χ2 (Pt kP) = sup = sup · t + o(t2 ),
E[h2 (X)] E[h2 (X)]
which is also χ2 (Pt kP) = J(P)t2 + o(t2 ). A direct proof can be given applying integration by parts
R R R R
and Cauchy-Schwarz: ( ph′ )2 = ( p′ h)2 ≤ h2 p (p′ )2 /p, which also shows the optimal test
function is given by the score function h = p′ /p; for details, see [220, Theorem 4.2].

7.14* Technical proofs: convexity, local expansions and variational


representations
In this section we collect proofs of some technical theorems from this chapter.

Proof of Theorem 7.23. By definition we have


Z Z
1 1 (pt (x) − p0 (x))2 1
L(t) ≜ 2 2 χ2 (Pt kϵ̄P0 + ϵPt ) = 2 μ(dx) = 2 μ(dx)g(t, x)2 , (7.97)
ϵ̄ t t X ϵ̄p0 (x) + ϵpt (x) t
where
p t ( x) − p 0 ( x) h 2 − p 0 ( x)
g ( t , x) ≜ p = ϕ(ht (x); x) , ϕ(h; x) ≜ p .
ϵ̄p0 (x) + ϵpt (x) ϵ̄p0 (x) + ϵh2
p
By (c0 ) the function t 7→ ht (x) ≜ pt (x) is absolutely continuous (for μ-a.e. x). Below we
2−ϵ√
will show that kϕ(·; x)kLip = suph≥0 |ϕ′ (h; x)| ≤ (1−ϵ) ϵ
. This implies that t 7→ g(t, x) is also
absolutely continuous and hence differentiable almost everywhere. Consequently, we have
Z 1
g(t, x) = t duġ(tu, x), ġ(t, x) ≜ ϕ′ (ht (x); x)ḣt (x) ,
0

Since ϕ′ (·; x) is continuous with


(

2, x : h 0 ( x) > 0 ,
ϕ (h0 (x); x) = (7.98)
√1 , x : h 0 ( x) = 0
ϵ

(we verify these facts below too), we conclude that


 
1
lim ġ(s, x) = ġ(0, x) = ḣ0 (x) 2 · 1{h0 (x) > 0} + √ 1{h0 (x) = 0} , (7.99)
s→ 0 ϵ

where we also used continuity ḣt (x) → ḣ0 (x) by assumption (c0 ).
Substituting the integral expression for g(t, x) into (7.97) we obtain
Z Z 1 Z 1
L(t) = μ(dx) du1 du2 ġ(tu1 , x)ġ(tu2 , x) . (7.100)
0 0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-153


i i

7.14* Technical proofs: convexity, local expansions and variational representations 153

Since |ġ(s, x)| ≤ C|hs (x)| for some C = C(ϵ), we have from Cauchy-Schwarz
Z Z
μ(dx)|ġ(s1 , x)ġ(s2 , x)| ≤ C2 sup μ(dx)ḣt (x)2 < ∞ . (7.101)
t X

where the last inequality follows from the uniform integrability assumption (c1 ). This implies that
Fubini’s theorem applies in (7.100) and we obtain
Z 1 Z 1 Z
L(t) = du1 du2 G(tu1 , tu2 ) , G(s1 , s2 ) ≜ μ(dx)ġ(s1 , x)ġ(s2 , x) .
0 0

Notice that if a family of functions {fα (x) : α ∈ I} is uniformly square-integrable, then the family
{fα (x)fβ (x) : α ∈ I, β ∈ I} is uniformly integrable simply because apply |fα fβ | ≤ 12 (f2α + f2β ).
Consequently, from the assumption (c1 ) we see that the integral defining G(s1 , s2 ) allows passing
the limit over s1 , s2 inside the integral. From (7.99) we get as t → 0
Z  
1 1 − 4ϵ #
G(tu1 , tu2 ) → G(0, 0) = μ(dx)ḣ0 (x) 4 · 1{h0 > 0} + 1{h0 = 0} = JF (0)+
2
J ( 0) .
ϵ ϵ
From (7.101) we see that G(s1 , s2 ) is bounded and thus, the bounded convergence theorem applies
and
Z 1 Z 1
lim du1 du2 G(tu1 , tu2 ) = G(0, 0) ,
t→0 0 0

which thus concludes the proof of L(t) → JF (0) and of (7.75) assuming facts about ϕ. Let us
verify those.
For simplicity, in the next paragraph we omit the argument x in h0 (x) and ϕ(·; x). A straightfor-
ward differentiation yields
h20 (1 − 2ϵ ) + 2ϵ h2
ϕ′ (h) = 2h .
(ϵ̄h20 + ϵh2 )3/2
h20 (1− ϵ2 )+ ϵ2 h2 1−ϵ/2
Since √ h
≤ √1
ϵ
and ϵ̄h20 +ϵh2
≤ 1−ϵ we obtain the finiteness of ϕ′ . For the continuity
ϵ̄h20 +ϵh2
of ϕ′ notice that if h0 > 0 then clearly the function is continuous, whereas for h0 = 0 we have
ϕ′ (h) = √1ϵ for all h.
We next proceed to the Hellinger distance. Just like in the argument above, we define
Z Z 1 Z 1
1
M(t) ≜ 2 H2 (Pt , P0 ) = μ(dx) du1 du2 ḣtu1 (x)ḣtu2 (x) .
t 0 0
R
Exactly as above from Cauchy-Schwarz and supt μ(dx)ḣt (x)2 < ∞ we conclude that Fubini
applies and hence
Z 1 Z 1 Z
M(t) = du1 du2 H(tu1 , tu2 ) , H(s1 , s2 ) ≜ μ(dx)ḣs1 (x)ḣs2 (x) .
0 0

Again, the family {ḣs1 ḣs2 : s1 ∈ [0, τ ), s2 ∈ [0, τ } is uniformly integrable and thus from (c0 ) we
conclude H(tu1 , tu2 ) → 14 JF (0). Furthermore, similar to (7.101) we see that H(s1 , s2 ) is bounded

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-154


i i

154

and thus
Z 1 Z 1
1
lim M(t) = du1 du2 lim H(tu1 , tu2 ) = JF ( 0 ) ,
t→0 0 0 t→0 4
concluding the proof of (7.76).

Proceeding to variational representations, we prove the counterpart of Gelfand-Yaglom-


Perez Theorem 4.5, cf. [185].

Proof of Theorem 7.6. The lower bound Df (PkQ) ≥ Df (PE kQE ) follows from the DPI. To prove
an upper bound, first we reduce to the case of f ≥ 0 by property 6 in Proposition 7.2. Then define
sets S = suppQ, F∞ = { dQdP
= 0} and for a fixed ϵ > 0 let
   
dP
Fm = ϵm ≤ f < ϵ(m + 1) , m = 0, 1, . . . .
dQ
We have
X Z   X
dP
ϵ mQ[Fm ] ≤ dQf ≤ϵ (m + 1)Q[Fm ] + f(0)Q[F∞ ]
m S dQ m
X
≤ϵ mQ[Fm ] + f(0)Q[F∞ ] + ϵ . (7.102)
m

m = {x > 1 : ϵm ≤ f(x) < ϵ(m + 1)} the function f is increasing and


Notice that on the interval I+

on Im = {x ≤ 1 : ϵm ≤ f(x) < ϵ(m + 1)} it is decreasing. Thus partition further every Fm into
− −
m = { dQ ∈ Im } and Fm = { dQ ∈ Im }. Then, we see that
dP dP
F+ +

 
P[F±
m]
f ≥ ϵm .
Q[ F ±
m]
− −
0 , F0 , . . . , Fn , Fn , F∞ , S , ∪m>n Fm }. For this
Next, define the partition consisting of sets E = {F+ + c

partition we have, by the previous display:


X
D(PE kQE ) ≥ ϵ mQ[Fm ] + f(0)Q[F∞ ] + f′ (∞)P[Sc ] . (7.103)
m≤n

We next show that with sufficiently large n and sufficiently small ϵ the RHS of (7.103)
approaches Df (PkQ). If f(0)Q[F∞ ] = ∞ (and hence Df (PkQ) = ∞) then clearly (7.103) is also
infinite. Thus,
 assume
 that f(0)Q[F∞ ] < ∞.
R
If S dQf dQ = ∞ then the sum over m on the RHS of (7.102) is also infinite, and hence
dP
P
for any N > 0 there exists some n such that m≤n mQ[Fm ] ≥ N, thus showing that RHS
R  
for (7.103) can be made arbitrarily large. Thus, assume S dQf dQdP
< ∞. Considering LHS
P
of (7.102) we conclude that for some large n we have m>n mQ[Fm ] ≤ 12 . Then, we must have
again from (7.102)
X Z  
dP 3
ϵ mQ[Fm ] + f(0)Q[F∞ ] ≥ dQf − ϵ.
S dQ 2
m≤n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-155


i i

7.14* Technical proofs: convexity, local expansions and variational representations 155

Thus, we have shown that for arbitrary ϵ > 0 the RHS of (7.103) can be made greater than
Df (PkQ) − 32 ϵ.
Proof of Theorem 7.26. First, we show that for any g : X → dom(f∗ext ) we must have
EP [g(X)] ≤ Df (PkQ) + EQ [f∗ext (g(X))] . (7.104)
Let p(·) and q(·) be the densities of P and Q. Then, from the definition of f∗ext we have for every x
s.t. q(x) > 0:
p ( x) p ( x)
f∗ext (g(x)) + fext ( ) ≥ g ( x) .
q ( x) q ( x)
Integrating this over dQ = q dμ restricted to the set {q > 0} we get
Z
p ( x)
EQ [f∗ext (g(X))] + q(x)fext ( ) dμ ≥ EP [g(X)1{q(X) > 0}] . (7.105)
q>0 q ( x)
Now, notice that
fext (x)
sup{y : y ∈ dom(f∗ext )} = lim = f′ (∞) (7.106)
x→∞ x
Therefore, f′ (∞)P[q(X) = 0] ≥ EP [g(X)1{q(X) = 0}]. Summing the latter inequality with (7.105)
we obtain (7.104).
Next we prove that supremum in (7.87) over simple functions g does yield Df (PkQ), so that
inequality (7.104) is tight. Armed with Theorem 7.6, it suffices to show (7.87) for finite X . Indeed,
for general X , given a finite partition E = {E1 , . . . , En } of X , we say a function g : X → R is
E -compatible if g is constant on each Ei ∈ E . Taking the supremum over all finite partitions E we
get
Df (PkQ) = sup Df (PE kQE )
E

= sup sup EP [g(X)] − EQ [f∗ext (g(X))]


E g:X →dom(f∗
ext )
g E -compatible

= sup EP [g(X)] − EQ [f∗ext (g(X))],


g:X →dom(f∗
ext )
g simple

where the last step follows is because the two suprema combined is equivalent to the supremum
over all simple (finitely-valued) functions g.
Next, consider finite X . Let S = {x ∈ X : Q(x) > 0} denote the support of Q. We show the
following statement
Df (PkQ) = sup EP [g(X)] − EQ [f∗ext (g(X))] + f′ (∞)P(Sc ), (7.107)
g:S→dom(f∗
ext )

which is equivalent to (7.87) by (7.106). By definition,


X  
P(x)
Df (PkQ) = Q(x)fext +f′ (∞) · P(Sc ),
Q ( x)
x∈S
| {z }
≜Ψ(P)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-156


i i

156

Consider the functional Ψ(P) defined above where P takes values over all signed measures on S,
which can be identified with RS . The convex conjugate of Ψ(P) is as follows: for any g : S → R,
( )
X P ( x)
∗ ∗
Ψ (g) = sup P(x)g(x) − Q(x) sup h − fext (h)
P x h∈dom(f∗
ext )
Q ( x)
X
= sup inf ∗ P(x)(g(x) − h(x)) + Q(x)f∗ext (h(x))
P h:S→dom(fext ) x
( a) X
= inf sup P(x)(g(x) − h(x)) + EQ [f∗ext (h)]
h:S→dom(f∗
ext ) P
x
(
EQ [f∗ext (g(X))] g : S → dom(f∗ext )
= .
+∞ otherwise

where (a) follows from the minimax theorem (which applies due to finiteness of X ). Applying
the convex duality in (7.86) yields the proof of the desired (7.107).

Proof of Theorem 7.27. First we argue that the supremum in the right-hand side of (7.94) can
be taken over all simple functions g. Then thanks to Theorem 7.6, it will suffice to consider finite
alphabet X . To that end, fix any g. For any δ , there exists a such that EQ [f∗ext (g − a)] − aP[S] ≤
Ψ∗Q,P (g) + δ . Since EQ [f∗ext (g − an )] can be approximated arbitrarily well by simple functions we
conclude that there exists a simple function g̃ such that simultaneously EP [g̃1S ] ≥ EP [g1S ] − δ and

Ψ∗Q,P (g̃) ≤ EQ [f∗ext (g̃ − a)] − aP[S] + δ ≤ Ψ∗Q,P (g) + 2δ .

This implies that restricting to simple functions in the supremization in (7.94) does not change the
right-hand side.
Next consider finite X . We proceed to compute the conjugate of Ψ, where Ψ(P) ≜ Df (PkQ) if
P is a probability measure on X and +∞ otherwise. Then for any g : X → R, maximizing over
all probability measures P we have:
X
Ψ∗ (g) = sup P(x)g(x) − Df (PkQ)
P x∈X
X X X  
P(x)
= sup P(x)g(x) − P(x)g(x) − Q ( x) f
P x∈X Q ( x)
x∈Sc x∈ S
X X X
= sup inf P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + Q(x)f∗ext (h(x))
P h:S→R x∈S x∈Sc x∈S
( ! )
( a) X X
= inf sup P(x)[g(x) − h(x)] + P(x)[g(x) − f′ (∞)] + EQ [f∗ext (h(X))]
h:S→R P x∈ S x∈Sc
   
(b) ′ ∗
= inf max max g(x) − h(x), maxc g(x) − f (∞) + EQ [fext (h(X))]
h:S→R x∈ S x∈ S
   
( c) ′ ∗
= inf max a, maxc g(x) − f (∞) + EQ [fext (g(X) − a)]
a∈ R x∈ S

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-157


i i

7.14* Technical proofs: convexity, local expansions and variational representations 157

where (a) follows from the minimax theorem; (b) is due to P being a probability measure; (c)
follows since we can restrict to h(x) = g(x) − a for x ∈ S, thanks to the fact that f∗ext is non-
decreasing (since dom(fext ) = R+ ).
From convex duality we have shown that Df (PkQ) = supg EP [g] − Ψ∗ (g). Notice that without
loss of generality we may take g(x) = f′ (∞) + b for x ∈ Sc . Interchanging the optimization over
b with that over a we find that
sup bP[Sc ] − max(a, b) = −aP[S] ,
b

which then recovers (7.94). To get (7.95) simply notice that if P[Sc ] > 0, then both sides of (7.95)
are infinite (since Ψ∗Q (g) does not depend on the values of g outside of S). Otherwise, (7.95)
coincides with (7.94).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-158


i i

8 Entropy method in combinatorics and geometry

A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. This method typically proceeds as follows: in
order to count the cardinality of a given set C , we draw an element uniformly at random from
C , whose entropy is given by log |C|. To bound |C| from above, we describe this random object
by a random vector X = (X1 , . . . , Xn ) then proceed to compute or upper-bound the joint entropy
H(X1 , . . . , Xn ) via one of the following methods:
Pn
• Marginal bound: H(X1 , . . . , Xn ) ≤ i=1 H(Xi )
• Pairwise bound (Shearer’s lemma) and generalization cf. Theorem 1.8: H(X1 , . . . , Xn ) ≤
1
P
n−1 i<j H(Xi , Xj ) Pn
• Chain rule (exact calculation): H(X1 , . . . , Xn ) = i=1 H(Xi |X1 , . . . , Xi−1 )

We give three applications using the above three methods, respectively, in the order of increas-
ing difficulty: enumerating binary vectors of a given average weights, counting triangles and other
subgraphs, and Brégman’s theorem.
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties
of differential entropy (Section 2.3).

8.1 Binary vectors of average weights


Lemma 8.1 (Massey [293]) Let C ⊂ {0, 1}n and let p be the average fraction of 1’s in C ,
i.e.
1 X wH (x)
p= ,
|C| n
x∈C

where wH (x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ exp{nh(p)}.

We emphasize that this result holds even if p > 1/2.


Proof. Let X = (X1 , . . . , Xn ) be drawn uniformly at random from C . Then
X
n X
n
log |C| = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) = h(pi ),
i i=1

158

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-159


i i

8.2 Shearer’s lemma & counting subgraphs 159

where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that

1X
n
p= pi ,
n
i=1

since we can either first average over vectors in C or first average across different bits. By Jensen’s
inequality and the fact that x 7→ h(x) is concave,
!
Xn
1X
n
h(pi ) ≤ nh pi = nh(p).
n
i=1 i=1

Hence we have shown that log |C| ≤ nh(p).

As a consequence we obtain the following bound on the volume of the Hamming ball, which
will be instrumental much later when we talk about metric entropy (Chapter 27).

Theorem 8.2
k  
X n
≤ exp{nh(k/n)}, k ≤ n/2.
j
j=0

Proof. We take C = {x ∈ {0, 1}n : wH (x) ≤ k} and invoke the previous lemma, which says that
k  
X n
= |C| ≤ exp{nh(p)} ≤ exp{nh(k/n)},
j
j=0

where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.

For extensions to non-binary alphabets see Exercise I.1 and I.2. Note that, Theorem 8.2 also
follows from the large deviations theory in Part III:
  
LHS k 1 RHS
= P ( Bin ( n , 1 / 2) ≤ k ) ≤ exp − nd k = exp{−n(log 2 − h(k/n))} = n ,
2n n 2 2
where the inequality is the Chernoff bound on the binomial tail (see (15.19) in Example 15.1).

8.2 Shearer’s lemma & counting subgraphs


Recall that a special case of Shearer’s lemma Theorem 1.8 (Han’s inequality) says:
1
H(X1 , X2 , X3 ) ≤ [H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 )].
2
A classical application of this result (see Remark 1.2) is to bound cardinality of a set in R3 given
cardinalities of its projections.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-160


i i

160

For graphs H and G, define N(H, G) to be the number of copies of H in G.1 For example,
N( , ) = 4, N( , ) = 8.
If we know G has m edges, what is the maximal number of H that are contained in G? To study
this quantity, we define
N(H, m) = max N(H, G).
G:|E(G)|≤m

As an example, we show that the maximal number of triangles satisfies


N(K3 , m)  m3/2 . (8.1)

To show that N(H, m) ≳ m3/2 , consider G = Kn which has m = |E(G)| = 2  n2 and n

N(K3 , Kn ) = n3  n3  m3/2 .
To show the upper bound, fix a graph G = (V, E) with m edges. Draw a labeled triangle
uniformly at random and denote the vertices by (X1 , X2 , X3 ). Then by Shearer’s Lemma,
1 3
log(3!N(K3 , G)) = H(X1 , X2 , X3 ) ≤ [H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 )] ≤ log(2m).
2 2
Hence

2 3/2
N(K3 , G) ≤ m . (8.2)
3
Remark 8.1 Interestingly, linear algebra argument yields exactly the same upper bound as
(8.2): Let A be the adjacency matrix of G with eigenvalues {λi }. Then
X
2|E(G)| = tr(A2 ) = λ2i
X
6N(K3 , G) = tr(A3 ) = λ3i

By Minkowski’s inequality, (6N(K3 , G))1/3 ≤ (2|E(G)|)1/2 which yields N(K3 , G) ≤ 2 3/2
3 m .
Using Shearer’s lemma (Theorem 1.8), Friedgut and Kahn [173] obtained the counterpart of
(8.1) for arbitrary H; this result was first proved by Alon [13]. We start by introducing the fractional
covering number of a graph. For a graph H = (V, E), define the fractional covering number as the
value of the following linear program (LP):2
( )
X X

ρ (H) = min w(e) : w(e) ≥ 1, ∀v ∈ V, w(e) ∈ [0, 1] . (8.3)
w
e∈E e∈E, v∈e

Theorem 8.3
∗ ∗
c0 ( H ) m ρ (H)
≤ N(H, m) ≤ c1 (H)mρ (H)
. (8.4)

1
To be precise, here N(H, G) is the number of subgraphs of G (subsets of edges) isomorphic to H. If we denote by inj(H, G)
the number of injective maps V(H) → V(G) mapping edges of H to edges of G, then N(H, G) = |Aut(H)| 1
inj(H, G).
2
If the “∈ [0, 1]” constraints in (8.3) and (8.5) are replaced by “∈ {0, 1}”, we obtain the covering number ρ(H) and the
independence number α(H) of H, respectively.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-161


i i

8.3 Brégman’s Theorem 161

For example, for triangles we have ρ∗ (K3 ) = 3/2 and Theorem 8.3 is consistent with (8.1).
Proof. Upper bound: Let V(H) = [n] and let w∗ (e) be the solution for ρ∗ (H). For any G with m
edges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H. Given
such a random subgraph set Xi ∈ V(G) to be the vertex corresponding to an i-th vertex of H, i ∈ [n].

Now define a random 2-subset S of [n] by sampling an edge e from E(H) with probability ρw∗ ((He)) .
By the definition of ρ∗ (H) we have for any i ∈ [n] that P[i ∈ S] ≥ ρ∗1(H) . We are now ready to
apply Theorem 1.8:
log N(H, G) = H(X) ≤ H(XS |S)ρ∗ (H) ≤ log(2m)ρ∗ (H) ,
where the last inequality is as before: if S = {v, w} then XS = (Xv , Xw ) takes one of 2m values.

Overall, we get3 N(H, G) ≤ (2m)ρ (H) .
Lower bound: It amounts to construct a graph G with m edges for which N(H, G) ≥

c(H)|e(G)|ρ (H) . Consider the dual LP of (8.3)
 
 X 
α∗ (H) = max ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E, ψ(v) ∈ [0, 1] (8.5)
ψ  
v∈V(H)

i.e., the fractional packing number. By the duality theorem of LP, we have α∗ (H) = ρ∗ (H). The
graph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edge
e = (vw) of H, replace it by a complete bipartite graph Km(v),m(w) . Then the total number of edges
of G is
X
|E(G)| = m(v)m(w).
(vw)∈E(H)
Q
Furthermore, N(G, H) ≥ v∈V(H) m(v). To minimize the exponent log N(G,H)
log |E(G)| , fix a large number
 ψ(v) 
M and let m(v) = M , where ψ is the maximizer in (8.5). Then
X
|E(G)| ≤ 4Mψ(v)+ψ(w) ≤ 4M|E(H)|
(vw)∈E(H)
Y ∗
N(G, H) ≥ Mψ(v) = Mα (H)

v∈V(H)

and we are done.

8.3 Brégman’s Theorem


In this section, we present an elegant entropy proof of Radhakrishnan [351] of Brégman’s Theorem
[74], which bounds the number of perfect matchings (1-regular spanning subgraphs) in a bipartite
graphs.

3
Note that for H = K3 this gives a bound weaker than (8.2). To recover (8.2) we need to take X = (X1 , . . . , Xn ) be
uniform on all injective homomorphisms H → G.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-162


i i

162

We start with some definitions. The permanent of an n × n matrix A is defined as


XY
n
perm(A) ≜ aiπ (i) ,
π ∈Sn i=1

where Sn denotes the group of all permutations of [n]. For a bipartite graph G with n vertices on
the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix. For example,
 
 
 
 
perm   = 1, perm  =2
 
 

Theorem 8.4 (Brégman’s Theorem) For any n × n bipartite graph with adjacency matrix
A,
Y
n
1
perm(A) ≤ (di !) di ,
i=1

where di is the degree of left vertex i (i.e. sum of the ith row of A).

As an example, consider G = Kn,n . Then perm(G) = n!, which coincides with the RHS
[(n!)1/n ]n = n!. More generally, if G consists of n/d copies of Kd,d , then Brégman’s bound is
tight and perm = (d!)n/d .

Proof. If perm(A) = 0 then there is nothing to prove, so instead we assume perm(A) > 0 and
some perfect matchings exist. As a first attempt of proving Theorem 8.4 using the entropy method,
we select a perfect matching uniformly at random which matches the ith left vertex to the Xi th right
one. Let X = (X1 , . . . , Xn ). Then
X
n X
n
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H( X i ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is worse than Brégman’s bound by an exponential factor, since by
Stirling’s formula (I.2)
!
Yn
1 Y
n
(di !) di ∼ d i e− n .
i=1 i=1

Here is our second attempt. The hope is to use the chain rule to expand the joint entropy and
bound the conditional entropy more carefully. Let us write
X
n X
n
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-163


i i

8.3 Brégman’s Theorem 163

where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, there is no reason why we should order from 1 to n. The key idea is to label the vertices
randomly, apply chain rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then

log perm(A) = H(X) = H(X|π )


= H(Xπ (1) , . . . , Xπ (n) |π )
X
n
= H(Xπ (k) |Xπ (1) , . . . , Xπ (k−1) , π )
k=1
Xn
= H(Xk |{Xj : π −1 (j) < π −1 (k)}, π )
k=1
X
n
≤ E log Nk ,
k=1

where Nk denotes the number of possible matchings for vertex k given the outcomes of {Xj :
π −1 (j) < π −1 (k)} and the expectation is with respect to (X, π ). The key observation is:

Lemma 8.5 Nk is uniformly distributed on [dk ].


Example 8.1 As a concrete example for Lemma 8.5, consider the 1 1
graph G on the right. For vertex k = 1, dk = 2. Depending on the
random ordering, if π = 1 ∗ ∗, then Nk = 2 w.p. 1/3; if π = ∗ ∗ 1, 2 2
then Nk = 1 w.p. 1/3; if π = 213, then Nk = 2 w.p. 1/3; if π = 312,
then Nk = 1 w.p. 1/3. Combining everything, indeed Nk is equally
3 3
likely to be 1 or 2.

Applying Lemma 8.5,

1 X
dk
1
E(X,π ) log Nk = log i = log(di !) di
dk
i=1

and hence
X
n
1 Y
n
1
log perm(A) ≤ log(di !) di = log (di !) di .
k=1 i=1

Proof of Lemma 8.5. In fact, we will show that even conditioned on Xn the distribution of Nk
is uniform. Indeed, if d = dk is the degree of k-th (right) node then let J1 , . . . , Jd be those right
nodes that match with neighbors of k under the fixed perfect matching (one of Ji ’s, say J1 , equals

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-164


i i

164

k). Random permutation π rearranges Ji ’s in the order in which corresponding right nodes are
revealed. Clearly the induced order of Ji ’s is uniform on d! possible choices. Note that if J1 occurs
in position ℓ ∈ {1, . . . , d} then Nk = d − ℓ + 1. Clearly ℓ and thus Nk are uniform on [d] = [dk ].

8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney


The following famous result shows that n-dimensional rectangles simultaneously minimize the
volumes of all coordinate projections:4

Theorem 8.6 (Bollobás-Thomason Box Theorem) Let K ⊂ Rn be a compact set. For


S ⊂ [n], denote by KS ⊂ RS the projection of K onto those coordinates indexed by S. Then there
exists a rectangle A s.t. Leb(A) = Leb(K) and for all S ⊂ [n]:

Leb(AS ) ≤ Leb(KS )

Thus, rectangles are extremal objects from the point of view of maximizing volumes of
coordinate projections.

Proof. Let Xn be uniformly distributed on K. Then h(Xn ) = log Leb(K). Let A be a rectangle of
size a1 × · · · × an where

log ai = h(Xi |Xi−1 ) .

Then, we have by Theorem 2.7(a)

h(XS ) ≤ log Leb(KS ).

On the other hand, by the chain rule and the fact that conditioning reduces differential entropy
(recall Theorem 2.7(a) and (c)),
X
n
h(XS ) = 1{i ∈ S}h(Xi |X[i−1]∩S )
i=1
X
≥ h(Xi |Xi−1 )
i∈S
Y
= log ai
i∈S

= log Leb(AS ).

The following result is a continuous counterpart of Shearer’s lemma (see Theorem 1.8 and
Remark 1.2):

4
Note that since K is compact, its projection and slices are all compact and hence measurable.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-165


i i

8.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney 165

Corollary 8.7 (Loomis-Whitney) Let K be a compact subset of Rn and let Kjc denote the
projection of K onto coordinates in [n] \ j. Then
Y
n
1
Leb(K) ≤ Leb(Kjc ) n−1 . (8.6)
j=1

Proof. Let A be a rectangle having the same volume as K. Note that


Y
n
1
Leb(K) = Leb(A) = Leb(Ajc ) n−1
j=1

By the previous theorem, Leb(Ajc ) ≤ Leb(Kjc ).


The meaning of the Loomis-Whitney inequality is best understood by introducing the average
Leb(K)
width of K in the jth direction: wj ≜ Leb(Kjc ) . Then (8.6) is equivalent to

Y
n
Leb(K) ≥ wj ,
j=1

i.e. that volume of K is greater than that of the rectangle of average widths.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-166


i i

9 Random number generators

In this chapter we consider (a toy version of) the problem of creating high-quality random number
generators. Given a stream of independent Ber(p) bits, with unknown p, we want to turn them into
pure random bits, i.e., independent Ber(1/2) bits. Our goal is to find a way of extracting as many
fair coin flips as possible from possibly biased coin flips, without knowing the actual bias p.
In 1951 von Neumann [442] proposed the following scheme: Divide the stream into pairs of
bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both 01
and 10 occur with probability pp̄ (where, we remind p̄ = 1 − p), regardless of the value of p, we
obtain fair coin flips at the output. To measure the efficiency of von Neumann’s scheme, note that,
on average, we have 2n bits in and 2pp̄n bits out. So the efficiency (rate) is pp̄. The question is:
Can we do better?
There are several choices to be made in the problem formulation. Universal vs non-universal:
the source distribution can be unknown or partially known, respectively. Exact vs approximately
fair coin flips: whether the generated coin flips are exactly fair or approximately, as measured by
one of the f-divergences studied in Chapter 7 (e.g., the total variation or KL divergence). In this
chapter, we only focus on the universal generation of exactly fair coins. On the other extreme,
in Part II we will see that optimal data compressors’ output consists of almost purely random
bits, however those compressors are non-universal (need to know source statistics, e.g. bias p) and
approximate.
For convenience, in this chapter we consider entropies measured in bits, i.e. log = log2 in this
chapter.

9.1 Setup
Let {0, 1}∗ = ∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . } denote the set of all finite-length binary strings,
where ∅ denotes the empty string. For any x ∈ {0, 1}∗ , l(x) denotes the length of x.
Let us first introduce the definition of random number generator formally. If the input vector is
X, denote the output (variable-length) vector by Y ∈ {0, 1}∗ . Then the desired property of Y is the
following: Conditioned on the length of Y being k, Y is uniformly distributed on {0, 1}k .

Definition 9.1 (Randomness extractor) We say Ψ : {0, 1}∗ → {0, 1}∗ is an extractor if

1 Ψ(x) is a prefix of Ψ(y) if x is a prefix of y.

166

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-167


i i

9.3 Elias’ construction from data compression 167

i.i.d.
2 For any n and any p ∈ (0, 1), if Xn ∼ Ber(p), then Ψ(Xn ) ∼ Ber(1/2)k conditioned on
l(Ψ(Xn )) = k for each k ≥ 1.

The efficiency of an extractor Ψ is measured by its rate:


E[l(Ψ(Xn ))] i.i.d.
rΨ (p) = lim sup , Xn ∼ Ber(p).
n→∞ n
In other words, Ψ consumes a stream of n coins with bias p and outputs on average nrΨ (p) fair
coins.
Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN (x2n+1 ) =
ΨvN (x2n )), whose rate is rvN (p) = pp̄. Clearly this is wasteful, because even if the input bits are
already fair, we only get 25% in return.

9.2 Converse
We show that no extractor has a rate higher than the binary entropy function h(p), even if the
extractor is allowed to be non-universal (depending on p). The intuition is that the “information
content” contained in each Ber(p) variable is h(p) bits; as such, it is impossible to extract more
than that. This is easily made precise by the data processing inequality for entropy (since extractors
are deterministic functions).

Theorem 9.2 For any extractor Ψ and any p ∈ (0, 1),


1 1
rΨ (p) ≥ h(p) = p log2 + p̄ log2 .
p p̄

Proof. Let L = Ψ(Xn ). Then

nh(p) = H(Xn ) ≥ H(Ψ(Xn )) = H(Ψ(Xn )|L) + H(L) ≥ H(Ψ(Xn )|L) = E [L] bits,

where the last step follows from the assumption on Ψ that Ψ(Xn ) is uniform over {0, 1}k
conditioned on L = k.

The rate of von Neumann’s extractor and the entropy bound are plotted in Figure 9.1. Next
we present two extractors, due to Elias [149] and Peres [327] respectively, that attain the binary
entropy function. (More precisely, both construct a sequence of extractors whose rate approaches
the entropy bound).

9.3 Elias’ construction from data compression


The intuition behind Elias’ scheme is the following:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-168


i i

168

rate

1 bit

rvN

p
0 1 1
2

Figure 9.1 Rate function of von Neumann’s extractor and the binary entropy function.

1 For iid Xn , the probability of each string only depends on its type, i.e., the number of 1’s, cf.
method of types in Exercise I.1. Therefore conditioned on the number of 1’s to be qn, Xn is
uniformly distributed over the type class Tq . This observation holds universally for any value
of the actual bias p.
2 Given a uniformly distributed random variable on some finite set, we can easily turn it into
variable-length string of fair coin flips. For example:
• If U is uniform over {1, 2, 3}, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.
• If U is uniform over {1, 2, . . . , 11}, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and the remaining
eight numbers 4, . . . , 11 are assigned to 3-bit strings.
We will study properties of these kind of variable-length encoders later in Chapter 10.

Lemma 9.3 Given U uniformly distributed on [M], there exists f : [M] → {0, 1}∗ such that
conditioned on l(f(U)) = k, f(U) is uniformly over {0, 1}k . Moreover,
log2 M − 4 ≤ E[l(f(U))] ≤ log2 M bits.

Proof. We defined f by partitioning [M] into subsets whose cardinalities are powers of two, and
assign elements in each subset to binary strings of that length. Formally, denote the binary expan-
Pn
sion of M by M = i=0 mi 2i , where the most significant bit mn = 1 and n = blog2 Mc + 1. Taking
non-zero mi ’s we can write M = 2i0 + · · · 2it as a sum of distinct powers of twos and thus define
a partition [M] = ∪tj=0 Mj , where |Mj | = 2ij . We map the elements of Mj to {0, 1}ij . Finally, notice
that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U) ≥ H(f(U)) ≥ H(f(U)|l(f(U))) = E[l(f(U))], and the lower bound
follows from
1 X 1 X 2n X i−n
n n n
2n+1
E[l(f(U))] = mi 2i · i = n − mi 2i (n − i) ≥ n − 2 ( n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0

where the last step follows from n ≤ log2 M + 1.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-169


i i

9.4 Peres’ iterated von Neumann’s scheme 169

Elias’ extractor Fix n ≥ 1. Let wH (xn ) define the Hamming weight (number of ones) of a
binary string xn . Let Tk = {xn ∈ {0, 1}n : wH (xn ) = k} define the Hamming sphere of radius k.
For each 0 ≤ k ≤ n, we apply the function f from Lemma 9.3 to each Tk . This defines a mapping
ΨE : {0, 1}n → {0, 1}∗ and then we extend it to ΨE : {0, 1}∗ → {0, 1}∗ by applying the mapping
per n-bit block and discard the last incomplete block. Then it is clear that the rate is given by
n E[l(ΨE (X ))]. By Lemma 9.3, we have
1 n

   
n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
wH (Xn ) wH (Xn )
Using Stirling’s approximation (cf. Exercise I.1) we can show
 
n
log = nh(wH (Xn )/n) + O(log n) .
wH (Xn )
Pn wH ( X n )
Since n1 wH (Xn ) = 1n i=1 1{Xi = 1}, from the law of large numbers we conclude n →p
and since h is a continuous bounded function, we also have
1
E[l(ΨE (Xn ))] = h(p) + O(log n/n).
n
Therefore the extraction rate of ΨE approaches the optimum h(p) as n → ∞.

9.4 Peres’ iterated von Neumann’s scheme


The main idea is to recycle the bits thrown away in von Neumann’s scheme and iterate. What von
Neumann’s extractor discarded are: (a) bits from equal pairs; (b) location of the distinct pairs. To
achieve the entropy bound, we need to extract the randomness out of these two parts as well.
First, some notations: Given x2n , let k = l(ΨvN (x2n )) denote the number of consecutive distinct
bit-pairs.

• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj −1 .
• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij −1 .
• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1 .

Here yk are the bits that von Neumann’s scheme outputs and both vn−k and un are discarded. Note
that un is important because it encodes the location of the yk and contains a lot of information.
Therefore von Neumann’s scheme can be improved if we can extract the randomness out of both
vn−k and un .

Peres’ extractor For each t ∈ N, recursively define an extractor Ψt as follows:

• Set Ψ1 to be von Neumann’s extractor ΨvN , i.e., Ψ1 (x2n+1 ) = Ψ1 (x2n ) = yk .


• Define Ψt by Ψt (x2n ) = Ψt (x2n+1 ) = (Ψ1 (x2n ), Ψt−1 (un ), Ψt−1 (vn−k )).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-170


i i

170

As an example, consider input x = 100111010011 of length 2n = 12. Then the output is


determined recursively as follows:
y u v
z }| { z }| { z }| {
(011) (110100) (101)
(1)(010)(10)(0)
(1)(0)

Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bits that
enter into the iteration are no longer i.i.d. To compute the rate of Ψt , it is convenient to introduce
the notion of exchangeability. We say Xn are exchangeable if the joint distribution is invariant
under permutation, that is, PX1 ,...,Xn = PXπ (1) ,...,Xπ (n) for any permutation π on [n]. In particular, if
Xi ’s are binary, then Xn are exchangeable if and only if the joint distribution only depends on the
Hamming weight, i.e., PXn (xn ) = f(wH (xn )) for some function f. Examples: Xn is iid Ber(p); Xn is
uniform over the Hamming sphere Tk .
As an example, if X2n are i.i.d. Ber(p), then conditioned on L = k, Vn−k is iid Ber(p2 /(p2 + p̄2 )),
since L ∼ Binom(n, 2pp̄) and

pk+2m p̄n−k−2m
P[Yk = y, Un = u, Vn−k = v|L = k] = 
n
k(p2 + p̄2 )n−k (2pp̄)k
 − 1 
n p2 m  p̄2 n−k−m
= 2− k · · 2
k p + p̄2 p2 + p̄2
= P[Yk = y|L = k]P[Un = u|L = k]P[Vn−k = v|L = k],

where m = wH (v). In general, when X2n are only exchangeable, we have the following:

Lemma 9.4 (Ψt preserves exchangeability) Let X2n be exchangeable and L = Ψ1 (X2n ).
Then conditioned on L = k, Yk , Un and Vn−k are independent, each having an exchangeable
i.i.d.
distribution. Furthermore, Yk ∼ Ber( 21 ) and Un is uniform over Tk .

Proof. If suffices to show that ∀y, y′ ∈ {0, 1}k , u, u′ ∈ Tk and v, v′ ∈ {0, 1}n−k such that wH (v) =
wH (v′ ), we have

P[Yk = y, Un = u, Vn−k = v|L = k] = P[Yk = y′ , Un = u′ , Vn−k = v′ |L = k],

which implies that P[Yk = y, Un = u, Vn−k = v|L = k] = f(wH (v)) for some function f. Note that
the string X2n and the triple (Yk , Un , Vn−k ) are in one-to-one correspondence of each other. Indeed,
to reconstruct X2n , simply read the k distinct pairs from Y and fill them according to the locations of
ones in U and fill the remaining equal pairs from V. [Examples: (y, u, v) = (01, 1100, 01) ⇒ x =
(10010011), (y, u, v) = (11, 1010, 10) ⇒ x′ = (01110100).] Finally, note that u, y, v and u′ , y′ , v′
correspond to two input strings x and x′ of identical Hamming weight (wH (x) = k + 2wH (v)) and
hence of identical probability due to the exchangeability of X2n .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-171


i i

9.4 Peres’ iterated von Neumann’s scheme 171

i.i.d.
Lemma 9.5 (Ψt is an extractor) Let X2n be exchangeable. Then Ψt (X2n ) ∼ Ber(1/2)
conditioned on l(Ψt (X2n )) = m.

Proof. Note that Ψt (X2n ) ∈ {0, 1}∗ . It is equivalent to show that for all sm ∈ {0, 1}m ,

P[Ψt (X2n ) = sm ] = 2−m P[l(Ψt (X2n )) = m].

Proceed by induction on t. The base case of t = 1 follows from Lemma 9.4 (the distribution of
the Y part). Assume Ψt−1 is an extractor. Recall that Ψt (X2n ) = (Ψ1 (X2n ), Ψt−1 (Un ), Ψt−1 (Vn−k ))
and write the length as L = L1 + L2 + L3 , where L2 ⊥ ⊥ L3 |L1 by Lemma 9.4. Then

P[Ψt (X2n ) = sm ]
Xm
= P[Ψt (X2n ) = sm |L1 = k]P[L1 = k]
k=0
X
m X
m−k
Lemma 9.4 n−k
= P[L1 = k]P[Yk = sk |L1 = k]P[Ψt−1 (Un ) = skk+1 |L1 = k]P[Ψt−1 (V
+r
k+r+1 |L1 = k]
) = sm
k=0 r=0
X
m X
m−k
P[L1 = k]2−k 2−r P[L2 = r|L1 = k]2−(m−k−r) P[L3 = m − k − r|L1 = k]
induction
=
k=0 r=0

= 2−m P[L = m].

i.i.d.
Next we compute the rate of Ψt . Let X2n ∼ Ber(p). Then by the Strong Law of Large
Numbers (SLLN), 2n 1
l(Ψ1 (X2n )) ≜ 2n Ln
converges a.s. to pp̄. Assume, again by induction, that
a. s .
1
2n l (Ψ t−1 ( X 2n
)) −
− → rt− 1 ( p ) , with r1 ( p ) = pq. Then

1 Ln 1 1
l(Ψt (X2n )) = + l(Ψt−1 (Un )) + l(Ψt−1 (Vn−Ln )).
2n 2n 2n 2n
i.i.d. i.i.d. a. s .
Note that Un ∼ Ber(2pp̄), Vn−Ln |Ln ∼ Ber(p2 /(p2 +p̄2 )) and Ln −−→∞. Then the induction hypoth-
a. s . a. s .
esis implies that 1n l(Ψt−1 (Un ))−−→rt−1 (2pp̄) and 2(n−1 Ln ) l(Ψt−1 (Vn−Ln ))−−→rt−1 (p2 /(p2 +p̄2 )). We
obtain the recursion:
 
1 p2 + p̄2 p2
rt (p) = pp̄ + rt−1 (2pp̄) + rt−1 ≜ (Trt−1 )(p), (9.1)
2 2 p2 + p̄2

where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is mono-
tone in the sense that f ≤ g pointwise then Tf ≤ Tg. Then it can be shown that rt converges
monotonically from below to the fixed point of T, which turns out to be exactly the binary
entropy function h. Instead of directly verifying Th = h, here is a simple proof: Consider
i.i.d.
X1 , X2 ∼ Ber(p). Then 2h(p) = H(X1 , X2 ) = H(X1 ⊕ X2 , X1 ) = H(X1 ⊕ X2 ) + H(X1 |X1 ⊕ X2 ) =
2
h(2pp̄) + 2pp̄h( 12 ) + (p2 + p̄2 )h( p2p+p̄2 ).
The convergence of rt to h are shown in Figure 9.2.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-172


i i

172

1.0

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1.0

Figure 9.2 The rate function rt of Ψt (by iterating von Neumann’s extractor t times) versus the binary entropy
function, for t = 1, 4, 10.

9.5 Bernoulli factory


Given a stream of Ber(p) bits with unknown p, for what kind of function f : [0, 1] → [0, 1] can
we simulate iid bits from Ber(f(p)). Our discussion above deals with f(p) ≡ 12 . The most famous
example is whether we can simulate Ber(2p) from Ber(p), i.e., f(p) = 2p ∧ 1. Keane and O’Brien
[242] showed that all f that can be simulated are either constants or “polynomially bounded away
from 0 or 1”: for all 0 < p < 1, min{f(p), 1 − f(p)} ≥ min{p, 1 − p}n for some n ∈ N. In particular,
doubling the bias is impossible.
The above result deals with what f(p) can be simulated in principle. What type of computational
devices are needed for such as task? Note that since r1 (p) is quadratic in p, all rate functions rt
that arise from the iteration (9.1) are rational functions (ratios of polynomials), converging to
the binary entropy function as Figure 9.2 shows. It turns out that for any rational function f that
satisfies 0 < f < 1 on (0, 1), we can generate independent Ber(f(p)) from Ber(p) using either of
the following schemes with finite memory [308]:

1 Finite-state machine (FSM): initial state (red), intermediate states (white) and final states (blue,
output 0 or 1 then reset to initial state).
2 Block simulation: let A0 , A1 be disjoint subsets of {0, 1}k . For each k-bit segment, output 0 if
falling in A0 or 1 if falling in A1 . If neither, discard and move to the next segment. The block
size is at most the degree of the denominator polynomial of f.
The next table gives some examples of f that can be realized with these two architectures. (Exercise:
How to generate f(p) = 1/3?)
It turns out that the only type of f that can be simulated using either FSM or block simulation

is rational function. For f(p) = p, which satisfies Keane-O’Brien’s characterization, it cannot
be simulated by FSM or block simulation, but it can be simulated by the so-called pushdown
automata, which is a FSM operating with a stack (infinite memory) [308].
It is unknown how to find the optimal Bernoulli factory with the best rate. Clearly, a converse
is the entropy bound h(hf((pp))) , which can be trivial (bigger than one).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-173


i i

9.5 Bernoulli factory 173

Goal Block simulation FSM

1
1
0

0
f(p) = 1/2 A0 = 10; A1 = 01
1

1 0
0

0 0
1
f(p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1

0
1 1

0 0
0
0
1
1
p3
f(p) = p3 +p̄3
A0 = 000; A1 = 111

0
0
1
1
1 1

Table 9.1 Bernoulli factories realized by FSM or block simulation.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-174


i i

Exercises for Part I

I.1 (Combinatorial meaning of entropy)


(a) Fix n ≥ 1 and 0 ≤ k ≤ n. Let p = nk and define Tp ⊂ {0, 1}n to be the set of all binary
sequences with p fraction of ones. Show that if k ∈ [1, n − 1] then
 
exp{nh(p)} n exp{nh(p)}
p ≤ | Tp | = ≤p . (I.1)
8k(n − k)/n k 2πk(n − k)/n

where h(·) is the binary entropy. Conclude that for all 0 ≤ k ≤ n we have

log |Tp | = nh(p) + O(log n) .

Hint: Stirling’s approximation:


1 n! 1
e 12n+1 ≤ √ ≤ e 12n , n≥1 (I.2)
2πn(n/e)n
(b) Let Qn = Ber(q)n be iid Bernoulli distribution on {0, 1}n . Show that

log Qn [Tp ] = −nd(pkq) + O(log n)

(c*) More generally, let X be a finite alphabet, P̂, Q distributions on X , and TP̂ a set of all strings
in X n with composition P̂. If TP̂ is non-empty (i.e. if nP̂(·) is integral) then

log |TP̂ | = nH(P̂) + O(log n)


log Qn [TP̂ ] = −nD(P̂kQ) + O(log n)

and furthermore, both O(log n) terms can be bounded as |O(log n)| ≤ |X | log(n + 1). (Hint:
show that number of non-empty TP̂ is ≤ (n + 1)|X | .)
I.2 (Refined method of types) The following refines Proposition 1.5. Let n1 , . . . , be non-negative
P
integers with i ni = n and let k+ be the number of non-zero ni ’s. Then
 
n k+ − 1 1 X
log = nH(P̂) − log(2πn) − log P̂i − Ck+ ,
n1 , n2 , . . . 2 2
i:ni >0

where P̂i = nni and 0 ≤ Ck+ ≤ log e


12 . (Hint: use (I.2)).
I.3 (Conditional entropy and Markov types)
(a) Fix n ≥ 1, a sequence xn ∈ X n and define

Nxn (a, b) = |{(xi , xi+1 ) : xi = a, xi+1 = b, i = 1, . . . , n}| ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-175


i i

Exercises for Part I 175

where we define xn+1 = x1 (cyclic continuation). Show that 1n Nxn (·, ·) defines a probability
distribution PA,B on X ×X with equal marginals PA = PB . Conclude that H(A|B) = H(B|A).
Is PA|B = PB|A ?
(2)
(b) Let Txn (Markov type-class of xn ) be defined as
(2)
Txn = {x̃n ∈ X n : Nx̃n = Nxn } .
(2)
Show that elements of Txn can be identified with cycles in the complete directed graph G
on X , such that for each (a, b) ∈ X × X the cycle passes Nxn (a, b) times through edge
( a, b) .
(c) Show that each such cycle can be uniquely specified by indentifying the first node and by
choosing at each vertex of the graph the order in which the outgoing edges are taken. From
this and Stirling’s approximation conclude that
(2)
log |Txn | = nH(xT+1 |xT ) + O(log n) , T ∼ Unif([n]) .

Check that H(xT+1 |xT ) = H(A|B) = H(B|A).


(d) Show that for any time-homogeneous Markov chain Xn with PX1 ,X2 (a1 , a2 ) > 0 ∀a1 , a2 ∈ X
we have
(2)
log PXn (Xn ∈ Txn ) = −nD(PB|A kPX2 |X1 |PA ) + O(log n) .

I.4 (Maximum entropy) Prove that for any X taking values on N = {1, 2, . . .} such that E[X] < ∞,
 
1
H(X) ≤ E[X]h ,
E [ X]
maximized uniquely by the geometric distribution. Hint: Find an appropriate Q such that RHS
- LHS = D(PX kQ).
I.5 (Finiteness of entropy) In Exercise I.4 we have shown that the entropy of any N-valued random
variable with finite expectation is finite. Next let us improve this result.
(a) Show that E[log X] < ∞ ⇒ H(X) < ∞.
Moreover, show that the condition of X being integer-valued is not superfluous by giving a
counterexample.
(b) Show that if k 7→ PX (k) is a decreasing sequence, then H(X) < ∞ ⇒ E[log X] < ∞.
Moreover, show that the monotonicity assumption is not superfluous by giving a counterex-
ample.
I.6 (Robust version of the maximal entropy) The maximal differential entropy among all densities
supported on [−b, b] is attained by the uniform distribution. Prove that as ϵ → 0+ we have

sup{h(M + Z) : M ∈ [−b, b], E[Z] = 0, Var[Z] ≤ ϵ} = log(2b) + o(1) .

where supremization is over all (not necessarily independent) random variables M, Z such that
M + Z possesses a density. (Hint: [162, Appendix C] proves o(1) = O(ϵ1/3 log 1ϵ ) bound.)
I.7 (Maximum entropy under Hamming weight constraint) For any α ≤ 1/2 and d ∈ N,

max{H(Y) : Y ∈ {0, 1}d , E[wH (Y)] ≤ αd} = dh(α),

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-176


i i

176 Exercises for Part I

achieved by the product distribution Y ∼ Ber(α)⊗d . Hint: Find an appropriate Q such that RHS
- LHS = D(PY kQ).
I.8 (Gaussian divergence)
(a) Under what conditions on m0 , Σ0 , m1 , Σ1 is D( N (m1 , Σ1 ) k N (m0 , Σ0 ) ) < ∞?
(b) Compute D(N (m, Σ)kN (0, In )), where In is the n × n identity matrix.
(c) Compute D( N (m1 , Σ1 ) k N (m0 , Σ0 ) ) for non-singular Σ0 . (Hint: think how Gaussian
distribution changes under shifts x 7→ x+a and non-singular linear transformations x 7→ Ax.
Apply data-processing to reduce to previous case.)
I.9 (Water-filling solution) Let M ∈ Rk×n be a fixed matrix, X ⊥⊥ Z ∼ N (0, In ).
(a) Let M = UΛVT be an SVD decomposition, so that U, V are orthogonal matrices and Λ =
diag(λ1 , . . . , λn ) (with rank(M) non-zero λj ’s). Show that

max I(X; MX + Z) = max I(X; ΛX + Z) .


PX :E[∥X∥2 ]≤s2 PX :E[∥X∥2 ]≤s2

(b) Conclude that

1X + 2
n
max I(X; MX + Z) = log (λi t) ,
PX :E[∥X∥2 ]≤s2 2
i=1
Pn
where log+ x = max(0, log x) and t is determined from solving i=1 |t − λ− i |+ = s .
2 2

This distribution of energy of X along singular vectors of M is known as water-filling solution,


see Section 20.4.
(d)
I.10 (MIMO capacity) Let M ∈ Rk×n be a random, orthogonally invariant matrix (i.e. M = MU
for any orthogonal matrix U). Let X ⊥⊥ (Z, M) and Z ∼ N (0, In ). Show that

1 h i 1X h i n
s2 s2
max I(X; MX + Z|M) = E log det(I + MT M) = E log(1 + σi2 (M)) ,
PX :E[∥X∥2 ]≤s2 2 n 2 n
i=1

where σi (M) are the singular values of M. (Hint: average over rotations as in Section 6.2*)
i.i.d.
Note: In communication theory Mi,j ∼ N (0, 1) (Ginibre ensemble) models a multi-input, multi-
output (MIMO) channel with n transmit and k receive antennas. The matrix MMT is a Wishart
matrix and its spectrum, when n and k grow proportionally, approaches the Marchenko-Pastur
distribution. The important practical consequence is that the capacity of a MIMO channel grows
for high-SNR as 21 min(n, k) log SNR. This famous observation [418] is the reason modern WiFi
and cellular systems employ multiple antennas.
I.11 (Conditional capacity) Consider a Markov kernel PB,C|A : A → B × C , which we will also
(a)
understand as a collection of distributions PB,C ≜ PB,C|A=a . Prove
(a) (a)
inf sup D(PC|B kQC|B |PB ) = sup I(A; C|B) ,
QC|B a∈A PA

whenever supremum on the right-hand side is finite and achievable by some distribution P∗A . In
R
this case, optimal QC|B = P∗C|B is found by disintegrating P∗B,C = PA∗ (da)PB,C . (Hint: follow
(a)

the steps of (5.1).)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-177


i i

Exercises for Part I 177

I.12 Conditioned on X = x, let Y be Poisson with mean x, i.e.,


xk
PY|X (k|x) = e−x , k = 0, 1, 2, . . .
k!
Let X be an exponential random variable with unit mean. Find I(X; Y).
I.13 (Information lost in erasures) Let X, Y be a pair of random variables with I(X; Y) < ∞. Let Z
be obtained from Y by passing the latter through an erasure channel, i.e., X → Y → Z where
(
1 − δ, z = y ,
PZ|Y (z|y) =
δ, z =?

where ? is a symbol not in the alphabet of Y. Find I(X; Z).


I.14 (Information bottleneck) Let X → Y → Z where Y is a discrete random variable taking values
on a finite set Y . Prove that

I(X; Z) ≤ log |Y|.

I.15 The Hewitt-Savage 0-1 law states that certain symmetric events have no randomness. Let
{Xi }i≥1 be a sequence be iid random variables. Let E be an event determined by this sequence.
We say E is exchangeable if it is invariant under permutation of finitely many indices in
the sequence of {Xi }’s, e.g., the occurance of E is unchanged if we permute the values of
(X1 , X4 , X7 ), etc.
Let us prove the Hewitt-Savage 0-1 law information-theoretically in the following steps:
P Pn
(a) (Warm-up) Verify that E = { i≥1 Xi converges} and E = {limn→∞ n1 i=1 Xi = E[X1 ]}
are exchangeable events.
(b) Let E be an exchangeable event and W = 1E is its indicator random variable. Show that
for any k, I(W; X1 , . . . , Xk ) = 0. (Hint: Use tensorization (6.2) to show that for arbitrary n,
nI(W; X1 , . . . , Xk ) ≤ 1 bit.)
(c) Since E is determined by the sequence {Xi }i≥1 , we have by continuity of mutual informa-
tion:

H(W) = I(W; X1 , . . .) = lim I(W; X1 , . . . , Xk ) = 0.


k→∞

Conclude that E has no randomness, i.e., P(E) = 0 or P(E) = 1.


(d) (Application to random walk) Often after the application of Hewitt-Savage, further efforts
are needed to determine whether the probability is 0 or 1. As an example, consider Xi ’s
Pn
are iid ±1 and Sn = i=1 Xi denotes the symmetric random walk. Verify that the event
E = {Sn = 0 finitely often} is exchangeable. Now show that P(E) = 0.
(Hint: consider E+ = {Sn > 0 eventually} and E− similarly. Apply Hewitt-Savage to them
and invoke symmetry.)
I.16 Let (X, Y) be uniformly distributed in the unit ℓp -ball Bp ≜ {(x, y) : |x|p + |y|p ≤ 1}, where
p ∈ (0, ∞). Also define the ℓ∞ -ball B∞ ≜ {(x, y) : |x| ≤ 1, |y| ≤ 1}.
(a) Compute I(X; Y) for p = 1/2, p = 1 and p = ∞.
(b*) Determine the limit of I(X; Y) as p → 0.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-178


i i

178 Exercises for Part I

Pn
I.17 Suppose Z1 , . . . Zn are independent Poisson random variables with mean λ. Show that i=1 Zi
is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.18 Suppose Z1 , . . . Zn are independent uniformly distributed on the interval [0, λ]. Show that
max1≤i≤n Zi is a sufficient statistic of (Z1 , . . . Zn ) for λ.
I.19 (Divergence of order statistic) Given xn = (x1 , . . . , xn ) ∈ Rn , let x(1) ≤ . . . ≤ x(n) denote the
ordered entries. Let P, Q be distributions on R and PXn = Pn , QXn = Qn .
(a) Prove that
D(PX(1) ,...,X(n) kQX(1) ,...,X(n) ) = nD(PkQ). (I.3)
(b) Show that
D(Bin(n, p)kBin(n, q)) = nd(pkq).
I.20 (Continuity of entropy on finite alphabet) We have shown that on a finite alphabet entropy is a
continuous function of the distribution. Quantify this continuity by explicitly showing
|H(P) − H(Q)| ≤ h(TV(P, Q)) + TV(P, Q) log(|X | − 1)
for any P and Q supported on X .
Hint: Use Fano’s inequaility and the inf-representation (over coupling) of total variation in
Theorem 7.7(a).
I.21 (a) For any X such that E [|X|] < ∞, show that
(E[X])2
D(PX kN (0, 1)) ≥ nats.
2
(b) For a > 0, find the minimum and minimizer of
min D(PX kN (0, 1)).
PX :EX≥a

Is the minimizer unique? Why?


I.22 Suppose D(P1 kP0 ) < ∞ then show
d
D(λP1 + λ̄QkλP0 + λ̄Q) = 0 .
dλ λ=0

This extends Prop. 2.20.


I.23 (Metric entropy and capacity) Let {PY|X=x : x ∈ X } be a set of distributions and let C =
supPX I(X; Y) be its capacity. For every ϵ ≥ 0, define1
N(ϵ) = min{k : ∃Q1 . . . Qk : ∀x ∈ X , minj D(PY|X=x kQj ) ≤ ϵ2 } . (I.4)
(a) Prove that


C = inf ϵ2 + log N(ϵ) . (I.5)
ϵ≥0

1
N(ϵ) is the minimum number of radius-ϵ (in divergence) balls that cover the set {PY|X=x : x ∈ X }. Thus, log N(ϵ) is a
metric entropy – see Chapter 27.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-179


i i

Exercises for Part I 179

(Hint: when is N(ϵ) = 1? See Theorem 32.4.)


(b) Similarly, show

I(X; Y) = inf (ϵ + log N(ϵ; PX )) ,


ϵ≥0

where the average-case covering number is

N(ϵ; PX ) = min{k : ∃Q1 . . . Qk : Ex∼PX [minj D(PY|X=x kQj )] ≤ ϵ} (I.6)

Comments: These estimates are useful because N(ϵ) for small ϵ roughly speaking depends on
local (differential) properties of the map x 7→ PY|X=x , unlike C which is global.
I.24 Consider the channel PYm |X : [0, 1] 7→ {0, 1}m , where given x ∈ [0, 1], Ym is i.i.d. Ber(x).
(a) Using the upper bound from Exercise I.23 prove
1
C(m) ≜ max I(X; Ym ) ≤ log m + O(1) , m → ∞.
PX 2
Hint: Find a covering of the input space.
(b) Show a lower bound to establish
1
C(m) ≥ log m + o(log m) , m → ∞.
2
Hint: Show that for any ϵ > 0 there exists K(ϵ) such that for all m ≥ 1 and all p ∈ [ϵ, 1 − ϵ]
we have |H(Bin(m, p)) − 12 log m| ≤ K(ϵ).
I.25 This exercise shows other ways of proving Fano’s inequality in its various forms.
(a) Prove (3.15) as follows. Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π
to the last M − 1 atoms to obtain the distribution Pπ . By comparing H(P) and H(Q), where
Q is the average of Pπ over all permutations, complete the proof.
(b) Prove (3.15) by directly solving the convex optimization max{H(P) : 0 ≤ pi ≤ Pmax , i =
P
1, . . . , M, i pi = 1}.
(c) Prove (3.19) as follows. Let Pe = P[X 6= X̂]. First show that

I(X; Y) ≥ I(X; X̂) ≥ min{I(PX , PZ|X ) : P[X = Z] ≥ 1 − Pe }.


PZ|X

Notice that the minimum is non-zero unless Pe = Pmax . Second, solve the stated convex
optimization problem. (Hint: look for invariants that the matrix PZ|X must satisfy under
permutations (X, Z) 7→ (π (X), π (Z)) then apply the convexity of I(PX , ·)).
Qn
I.26 Show that PY1 ···Yn |X1 ···Xn = i=1 PYi |Xi if and only if the Markov chain Yi → Xi → (X\i , Y\i )
holds for all i = 1, . . . , n, where X\i = {Xj , j 6= i}.
I.27 (Distributions and graphical models)
(a) Draw all possible directed acyclic graphs (DAGs, or directed graphical models) compatible
with the following distribution on X, Y, Z ∈ {0, 1}:
(
1/6, x = 0, z ∈ {0, 1} ,
PX,Z (x, z) = (I.7)
1/3, x = 1, z ∈ {0, 1}
Y=X+Z (mod2) (I.8)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-180


i i

180 Exercises for Part I

You may include only the minimal DAGs (recall: the DAG is minimal for a given
distribution if removal of any edge leads to a graphical model incompatible with the
distribution).2
Qn
(b) Draw the DAG describing the set of distributions PXn Yn satisfying PYn |Xn = i=1 PYi |Xi .
(c) Recall that two DAGs G1 and G2 are called equivalent if they have the same vertex sets and
each distribution factorizes w.r.t. G1 if and only if it does so w.r.t. G2 . For example, it is
well known

X→Y→Z ⇐⇒ X←Y←Z ⇐⇒ X ← Y → Z.

Consider the following two DAGs with countably many vertices:

X1 → X2 → · · · → Xn → · · ·
X1 ← X2 ← · · · ← Xn ← · · ·

Are they equivalent?


I.28 Give a necessary and sufficient condition for A → B → C for jointly Gaussian (A, B, C) in
terms of correlation coefficients. For discrete (A, B, C) denote xabc = PABC (a, b, c) and write
the Markov chain condition as a list of degree-2 polynomial equations in {xabc , a ∈ A, b ∈
B, c ∈ C}.
I.29 Let A, B, C be discrete with PC|B (c|b) > 0 ∀b, c.
(a) Show

A→B→C
=⇒ A ⊥
⊥ (B, C)
A→C→B

Discuss implications for sufficient statistic.


(b*) For binary (A, B, C) characterize all counterexamples.
Comment: Thus, a popular positivity condition PA,B,C > 0 allows to infer conditional inde-
pendence relations, which are not true in general. In other words, a set of distributions
satisfying certain (conditional) independence relations does not coincide with the closure of
its intersection with {PA,B,C > 0}, see [366] for more.
I.30 Consider the implication

I( A; C ) = I( B; C ) = 0 =⇒ I(A, B; C) = 0 . (I.9)

(a) Show (I.9) for jointly Gaussian (A, B, C).


(b) Find a counterexample for general (A, B, C).
(c) Prove or disprove: (I.9) also holds for arbitrary finite-cardinality discrete (A, B, C) under
positivity condition PA,B,C (a, b, c) > 0 ∀a, b, c.

2
Note: {X → Y}, {X ← Y} and {X Y} are the three possible directed graphical modelss for two random variables. For
example, the third graph describes the set of distributions for which X and Y are independent: PXY = PX PY . In fact, PX PY
factorizes according to any of the three DAGs, but {X Y} is the unique minimal DAG.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-181


i i

Exercises for Part I 181

I.31 Find the entropy rate of a stationary ergodic Markov chain with transition probability matrix
 1 1 1 
2 4 4
P= 0 1
2
1
2

1 0 0

I.32 (Solvable HMM) Similar to the Gilbert-Elliott process (Example 6.3) let Sj ∈ {±1} be a
stationary two-state Markov chain with P[Sj = −Sj−1 |Sj−1 ] = 1 − P[Sj = Sj−1 |Sj−1 ] = τ . Let
iid
Ej ∼ Ber(δ), with Ej ∈ {0, 1} and let Xj = BECδ (Sj ) be the observation of Sj through the binary
erasure channel (BEC) with erasure probability δ , i.e. Xj = Sj Ej . Find entropy rate of Xj (you
can give answer in the form of a convergent series). Evaluate at τ = 0.11, δ = 1/2 and compare
with H(X1 ).
I.33 Consider a binary symmetric random walk Xn on Z that starts at zero. In other words, Xn =
Pn
j=1 Bj , where (B1 , B2 , . . .) are independent and equally likely to be ±1.
(a) When n  1 does knowing X2n provide any information about Xn ? More exactly, prove

lim inf I(Xn ; X2n ) > 0.


n→∞

(Hint: lower semicontinuity and central limit theorem)


(b) Compute the exact value of this limit limn→∞ I(Xn ; X2n ).
I.34 (Entropy rate and contiguity) Theorem 2.2 states that if a distribution on a finite alphabet X is
almost uniform then its entropy must be close to log |X |. This exercise extends this observation
to random processes.
(a) Let Qn be the uniform distribution on X n . Show that, if {Pn } ◁ {Qn } (cf. Definition 7.9),
then H(Pn ) = H(Qn ) + o(n) = n log |X | + o(n), or equivalently, D(Pn kQn ) = o(n).
(b) Show that for non-uniform Qn , Pn ◁▷ Qn does not imply H(Pn ) = H(Qn ) + o(n). (Hint:
for a counterexample, consider the mixed source in Example 6.2(b).)
I.35 Let ITV (X; Y) = TV(PX,Y , PX PY ). Let X ∼ Ber(1/2) and conditioned on X generate A and B
independently setting them equal to X or 1 − X with probabilities 1 − δ and δ , respectively (i.e.
A ← X → B). Show
1
ITV (X; A, B) = ITV (X; A) = | − δ| .
2
This means the second observation of X is “uninformative” (in the ITV sense).
Similarly, show that when X ∼ Ber(δ) for δ < 1/2 there exists joint distribution PX,Y so that
TV(PY|X=0 , PY|X=1 ) > 0 (thus ITV (X; Y) and I(X; Y) are strictly positive), but at the same time
minX̂(Y) P[X 6= X̂] = δ . In other words, observation Y is informative about X, but does not
improve the probability of error.
Note: This effect is the basis of an interesting economic effect of herding [30].
I.36 Prove the following variational representation of the total variation:
s
Z  2
1 d(P0 − P1 )
TV(P0 , P1 ) = inf dQ . (I.10)
2 Q dQ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-182


i i

182 Exercises for Part I

I.37 Show that map P 7→ D(PkQ) is strongly convex, i.e. for all λ ∈ [0, 1] and all P0 , P1 , Q we have
λD(P1 kQ) + λ̄D(P0 kQ) − D(λP1 + λ̄P0 kQ) ≥ 2λλ̄TV(P0 , P1 )2 log e .

(Hint: Write LHS as I(X; Y) for X ∼ Ber(λ) and apply Pinsker’s).


ϵ
I.38 (Rényi divergences and Blackwell order) Let pϵ = 1+e eϵ . Show that for all ϵ > 0 and all α > 0
we have
Dα (Ber(pϵ )kBer(1 − pϵ )) < Dα (N (ϵ, 1)kN (0, 1)) .
Yet, for small enough ϵ we have
TV(Ber(pϵ ), Ber(1 − pϵ )) > TV(N (ϵ, 1), N (0, 1)) .
Note: This shows that domination under all Rényi divergences does not imply a similar
comparison in other f-divergences [132]. On the other hand, we have the equivalence [310]:
∀α > 0 : Dα (P1 kP0 ) ≤ Dα (Q1 kQ0 )
⇐⇒ ∃ n0 ∀ n ≥ n0 ∀ f : Df ( P ⊗ ⊗n ⊗n ⊗n
1 kP0 ) ≤ Df (Q1 kQ0 ) .
n

(The latter is also equivalent to existence of a kernel Kn such that Kn ◦ P⊗


i
n
= Q⊗ n
i – a so-called
Blackwell order on pairs of measures, also known as channel degradation).
I.39 (Rényi divergence as KL [383]) Show for all α ∈ R:
(1 − α)Dα (PkQ) = inf (αD(RkP) + (1 − α)D(RkQ)) . (I.11)
R

Whenever the LHS is finite, derive the explicit form of a unique minimizer R.
I.40 For an f-divergence, consider the following statements:
(i) If If (X; Y) = 0, then X ⊥
⊥ Y.
(ii) If X − Y − Z and If (X; Y) = If (X; Z) < ∞, then X − Z − Y.
Recall that f : (0, ∞) → R is a convex function with f(1) = 0.
(a) Choose an f-divergence which is not a multiple of the KL divergence (i.e., f cannot be of
form c1 x log x + c2 (x − 1) for any c1 , c2 ∈ R). Prove both statements for If .
(b) Choose an f-divergence which is non-linear (i.e., f cannot be of form c(x − 1) for any c ∈ R)
and provide examples that violate (i) and (ii).
(c) Choose an f-divergence. Prove that (i) holds, and provide an example that violates (ii).
I.41 (Hellinger and interactive protocols [31]) In the area of interactive communication Alice has
access to X and outputs bits Ai , i ≥ 1, whereas Bob has access to Y and outputs bits Bi , i ≥ 1.
The communication proceeds in rounds, so that at i-th round Alice and Bob see the previous
messages of each other. This means that conditional distribution of the protocol is given by
Y
n
PAn ,Bn |X,Y = PAi |Ai−1 ,Bi−1 ,X PBi |Ai−1 ,Bi−1 ,Y .
i=1

Denote for convenience Πx,y ≜ PAn ,Bn |X=x,Y=y . Show


(a) (Cut-and-paste lemma) H2 (Πx,y , Πx′ ,y′ ) = H2 (Πx,y′ , Πx′ ,y ). Are there any other f-
divergences with this property?

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-183


i i

Exercises for Part I 183

(b) H2 (Πx,y , Πx′ ,y ) + H2 (Πx,y′ , Πx′ ,y′ ) ≤ 2H2 (Πx,y , Πx′ ,y′ )
I.42 (Chain rules I)
(a) Show using (I.11) and the chain rule for KL that
X
n
(1 − α)Dα (PXn kQXn ) ≥ inf(1 − α)Dα (PXi |Xi−1 =a kQXi |Xi−1 =a )
a
i=1

(b) Derive two special cases:


1 Y n
1
1 − H2 (PXn , QXn ) ≤ sup(1 − H2 (PXi |Xi−1 =a , QXi |Xi−1 =a ))
2 a 2
i=1
Y
n
1 + χ2 (PXn kQXn ) ≤ sup(1 + χ2 (PXi |Xi−1 =a kQXi |Xi−1 =a ))
a
i=1

I.43 (Chain rules II)


(a) Show that the chain rule for divergence can be restated as
X
n
D(PXn kQXn ) = D(Pi kPi−1 ),
i=1

where Pi = PXi QXni+1 |Xi , with Pn = PXn and P0 = QXn . The identity above shows how
KL-distance from PXn to QXn can be traversed by summing distances between intermediate
Pi ’s.
(b) Using the same path and triangle inequality show that
X
n
TV(PXn , QXn ) ≤ EPXi−1 TV(PXi |Xi−1 , QXi |Xi−1 )
i=1

(c) Similarly, show for the Hellinger distance H:


Xn q
H(PXn , QXn ) ≤ EPXi−1 H2 (PXi |Xi−1 , QXi |Xi−1 )
i=1

See also [230, Theorem 7] for a deeper result, where for a universal C > 0 it is shown that
X
n
H2 (PXn , QXn ) ≤ C EPXi−1 H2 (PXi |Xi−1 , QXi |Xi−1 ) .
i=1

I.44 (a) Define Marton’s divergence


Z  2
dP
Dm (PkQ) = dQ 1 − .
dQ +

Prove that
Dm (PkQ) = inf{E[P[X 6= Y|Y]2 ] : PX = P, PY = Q}
PXY

where the infimum is over all couplings. (Hint: For one direction use the same coupling
achieving TV. For the other direction notice that P[X 6= Y|Y] ≥ 1 − QP((YY)) .)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-184


i i

184 Exercises for Part I

(b) Define symmetrized Marton’s divergence

Dsm (PkQ) = Dm (PkQ) + Dm (QkP).

Prove that

Dsm (PkQ) = inf{E[P2 [X 6= Y|Y]] + E[P2 [X 6= Y|X]] : PX = P, PY = Q}.


PXY

I.45 (Center of gravity under f-divergences) Recall from Corollary 4.2 the fact that
minQY D(PY|X kQY |PX ) = I(X; Y) achieved at QY = PY . Prove the following versions for other
f-divergences:
(a) Suppose that for PX -a.e. x, PY|X=x  μ with density p(y|x).3 Then
Z q 2
inf χ2 (PY|X kQY |PX ) = μ(dy) E[pY|X (y|X)2 ] − 1. (I.12)
QY

p
If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ E[p(y|X)2 ] μ(dy).
(b) Show that
Z − 1
1
inf χ (QY kPY|X |PX ) =
2
μ(dy) − 1, (I.13)
QY g ( y)

where g(y) ≜ E[pY|X (y|X)−1 ] and we use agreement 1/0 = ∞ for all reciprocals. If the right-
hand side is finite, then the minimum is achieved by QY (dy) ∝ g(1y) 1{g(y) < ∞} μ(dy).
(c) Show that
Z
inf D(QY kPY|X |PX ) = − log μ(dy) exp(E[log p(y|X)]). (I.14)
QY

If the right-hand side is finite, the minimum is achieved at QY (dy) ∝ exp(E[log p(y|X)]) μ(dy).
Note: This exercise shows that the center of gravity with respect to other f-divergences need not
be PY but its reweighted version. For statistical applications, see Exercises VI.6, VI.9, and VI.10,
where (I.12) and (I.13) are used to determine the form of the Bayes estimator.
I.46 (DPI for Fisher information) Let pθ (x, y) be a smoothly parametrized family of densities on
X ⊗ Y (with respect to some reference measure μX ⊗ μY ) where θ ∈ Rd . Let JXF,Y (θ) denote
the Fisher information matrix of the joint distribution and similarly JXF (θ), JYF (θ) those of the
marginals.
(a) (Monotonicity) Assume the interchangeability of derivative and integral, namely,
R
∇θ pθ (y) = μX (dx)∇θ pθ (x, y) for every θ, y. Show that JYF (θ)  JXF,Y (θ).
(b) (Data processing inequality) Suppose, in addition, that θ → X → Y. (In other words, pθ (y|x)
does not depend on θ.) Then JYF (θ)  JXF (θ), with equality if Y is a sufficient statistic of X
for θ.

3
Note that the results do not depend on the choice of μ, so we can take for example μ = PY , in view of Lemma 3.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-185


i i

Exercises for Part I 185

I.47 (Fisher information inequality) Consider real-valued A ⊥ ⊥ B with differentiable densities and
finite (location) Fisher informations J(A), J(B). Then Stam’s inequality [399] shows
1 1 1
≥ + . (I.15)
J(A + B) J(A) J(B)
(a) Show that Stam’s inequality is equivalent to (a + b)2 J(A + B) ≤ a2 J(A) + b2 J(B) for all
a, b > 0.
(b) Let X1 = aθ+ A, X2 = bθ+ B. This defines a family of distributions of (X1 , X2 ) parametrized
by θ ∈ R. Show that its Fisher information is given by JF (θ) = a2 J(A) + b2 J(B).
(c) Let Y = X1 + X2 and assume that conditions for the applicability of the DPI for Fisher
information (Exercise I.46) hold. Conclude the proof of (I.15).
Note: A simple sufficient condition that implies (I.15) is that densities of A and B are everywhere
strictly positive on R. For a direct proof in this case, see [58].
I.48 The Ingster-Suslina formula [225] computes the χ2 -divergence between a mixture and a sim-
ple distribution, exploiting the second moment nature of χ2 . Let Pθ be a family of probability
distributions on X parameterized by θ ∈ Θ. Each distribution (prior) π on Θ induces a mixture
R
Pπ ≜ Pθ π (dθ). Assume that Pθ ’s have a common dominating distribution Q.
(a) Show that

χ2 (Pπ kQ) = E[G(θ, θ̃)] − 1,


R dPθ̃
where θ, θ̃ are two “replicas” independently drawn from π and G(θ, θ̃) ≜ dQ dP
dQ
θ
dQ .
(b) Show that for Gaussian mixtures (with ∗ denoting convolution)

χ2 (π ∗ N (0, I)kN (0, I)) = E[e⟨θ,θ̃⟩ ] − 1,


i.i.d.
θ, θ̃ ∼ π .

Deduce (7.45) from this result.


I.49 (Community detection) The Erdös-Rényi model, denoted by ER(n, p), is the distribution of
a random graph with n nodes where each pair is connected independently with probability p.
The Stochastic Block Model (SBM), denoted by SBM(n, p, q), extends the homogeneous Erdös-
Rényi model to incorporate community structure: Each node i is labeled by σi ∈ {±1} denoting
its community membership; conditioned on σ = (σ1 , . . . , σn ), each pair i and j are connected
independently with probability p if σi = σj and q otherwise. (For example, the assortative case
of p > q models the scenario where individuals in the same community are more likely to
be friends.) The problem of community detection asks whether SBM(n, p, q) is distinguishable
from its Erdös-Rényi counterpart ER(n, p+2 q ) when n is large. Consider the sparse setting where
p = an and q = bn for fixed constants a, b > 0.
(a) Assume that the labels σi ’s are independent and uniform. Applying the Ingster-Suslina
formula in Exercise I.48, show that as n → ∞,
(
  p + q  √1 + o( 1) τ < 1
χ2 SBM(n, p, q) ER n, = 1−τ (I.16)
2 ∞ τ ≥ 1,
(a−b)2
where τ ≜ 2(a+b) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-186


i i

186 Exercises for Part I

(b) Assume that n is even and σ is uniformly distributed over the set of bisections {z ∈ {±1}n :
Pn
i=1 zi = 0}, so that the two communities are equally sized. Show that (I.16) continues to
hold.
Note: As a consequence of (I.16), we have the contiguity SBM(n, p, q) ◁ ER(n, p+2 q ) whenever
τ < 1. In fact, they are mutually contiguous if and only if τ < 1. This much more difficult
result can be shown using the method of small subgraph conditioning developed by [364, 227];
cf. [307, Section 5].
I.50 (Sampling without replacement I [400]) Consider two ways of generating a random vector
Xn = (X1 , . . . , Xn ): Under P, Xn are sampled from the set [n] = {1, . . . , n} without replacement;
under Q, Xn are sampled from [n] with replacement. Let’s compare the joint distribution of the
first k draws X1 , . . . , Xk for some 1 ≤ k ≤ n.
(a) Show that
 
k! n
TV(PXk , QXk ) = 1 −
nk k
 
k! n
D(PXk kQXk ) = − log k .
n k

Conclude that D and TV are o(1) iff k = o( n).

(b) Explain the specialness of n by find an explicit test that distinguishes P and Q with high

probability when k  n. Hint: birthday paradox.
I.51 (Sampling without replacement II [400]) Let X1 , . . . , Xk be a random sample of balls without
Pq
replacement from an urn containing ai balls of color i ∈ [q], i=1 ai = n. Let QX (i) = ani . Show
that
k2 (q − 1) log e
D(PXk kQkX ) ≤ c , c= .
(n − 1)(n − k + 1) 2
Let Rm,b0 ,b1 be the distribution of the number of 1’s in the first m ≤ b0 + b1 coordinates of a
randomly permuted binary strings with b0 zeros and b1 ones.
(a) Show that
X
q
ai − V i ai − V i
D(PXm+1 |Xm kQX |PXm ) = E[ log ],
N−m pi (N − m)
i=1

where Vi ∼ Rm,N−ai ,ai .


(b) Show that the i-th term above also equals pi E[log pia(iN−−Ṽmi ) ], Ṽi ∼ Rm,N−ai ,ai −1 .
 
(c) Use Jensen’s inequality to show that the i-th term is upper bounded by pi log 1 + (n−1)(mn−m) 1−pipi .
(d) Use the bound log(1 + x) ≤ x log e to complete the proof.
I.52 (Effective de Finetti) We will show that for any distribution PXn invariant to permutation and
k < n there exists a mixture of iid distributions QXk which approximates PXk :
r
k2 H(X1 ) X m
TV(PXk , QXk ) ≤ c , QX k = λ i Q⊗ k
(I.17)
n−k+1 i
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-187


i i

Exercises for Part I 187

Pm
where i=1 λi = 1, λi ≥ 0 and Qi are some distributions on X and c > 0 is a universal constant.
Follow the steps:
(a) Show the identity (here PXk is arbitrary)
 Y
k  Xk− 1
D PXk PXj = I(Xj ; Xj+1 ).
j=1 j=1

(b) Show that there must exist some t ∈ {k, k + 1, . . . , n} such that
H(Xk−1 )
I(Xk−1 ; Xk |Xnt+1 ) ≤ .
n−k+1
(Hint: Expand I(Xk−1 ; Xnk ) via chain rule.)
(c) Show from 1 and 2 that
 Y  kH(Xk−1 )
D PXk |T PXj |T |PT ≤
n−k+1
n
where T = Xt+1 .
(d) By Pinsker’s inequality
h  i r
Y kH(Xk−1 )|X | 1
ET TV PXk |T , PXj |T ≤ c , c= p .
n−k+1 2 log e
Conclude (I.17) by the convexity of total variation.
Note: Another estimate [400, 123] is easy to deduce from Exercise I.51 and Exercise I.50: there
exists a mixture of iid QXk such that
k
min(2|X |, k − 1) .
TV(QXk , PXk ) ≤
n
The bound (I.17) improves the above only when H(X1 ) ≲ 1.
I.53 (Wringing lemma [140, 419]) Prove that for any δ > 0 and any (Un , Vn ) there exists an index
n n
set I ⊂ [n] of size |I| ≤ I(U δ;V ) such that
I(Ut ; Vt |UI , VI ) ≤ δ ∀ t ∈ [ n] .
When I(Un ; Vn )  n, this shows that conditioning on a (relatively few) entries, one can make
individual coordinates almost independent. (Hint: Show I(A, B; C, D) ≥ I(A; C) + I(B; D|A, C)
first. Then start with I = ∅ and if there is any index t s.t. I(Ut ; Vt |UI , VI ) > δ then add it to I and
repeat.)
I.54 (Generalization gap = ISKL , [18]) A learning algorithm selects a parameter W based on observing
(not necessarily independent) S1 , . . . , Sn , where all Si have a common marginal law PS , with the
goal of minimizing the loss on a fresh sample = E[ℓ(W, S)], where Sn ⊥ ⊥ S ∼ PS and ℓ is an
4
arbitrary loss function . Consider a Gibbs sampler (see Section 4.8.2) which chooses
αX
n
1
W ∼ PW|Sn (w|sn ) = n
π (w) exp{− ℓ(w, si )} ,
Z( s ) n
i=1

4
For example, if S = (X, Y) we may have ℓ(w, (x, y)) = 1{fw (x) 6= y} where fw denotes a neural network with weights w.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-188


i i

188 Exercises for Part I

where π (·) is a fixed prior on weights and Z(·) the normalization constant. Show that general-
ization gap of this algorithm is given by
1X
n
1
E[ℓ(W, S)] − E[ ℓ(W, Si )] = ISKL (W; Sn ) ,
n α
i=1

where ISKL is the symmetric KL-information defined in (7.49).


I.55 Let (X, Y) be some dependent random variables. Suppose that for every x the random variable
h(x, Y) is ϵ2 -subgaussian. Show that
p
EPX,Y [h(X, Y)] − EPX ×PY [h(X, Y)] ≤ 2ϵ2 I(X; Y) . (I.18)
This allows one to control expectations of functions of dependent random variables by replacing
them with independent pairs at the expense of (square-root of the) mutual information slack.
I.56 ([369]) Let A = {Aj : j ∈ J} be a countable collection of random variables and T is a J-valued
random index. Show that if each Aj is ϵ2 -subgaussian, then
p
| E[AT ]| ≤ 2ϵ2 I(A; T) .

P
I.57 (Divergence for mixtures [216, 249]) Let Q̄ = i π i Qi be a mixture distribution.
(a) Prove
!
X
D(PkQ̄) ≤ − log π i exp(−D(PkQi )) ,
i
P
improving over the simple convexity estimate D(PkQ̄) ≤ i π i D(PkQi ). (Hint: Prove that
the function Q 7→ exp{−aD(PkQ)} is concave for every a ≤ 1.)
(b) Furthermore, for any distribution {π̃ j }, any λ ∈ [0, 1] we have
X X X
π̃ j D(Qj kQ̄) + D(π kπ̃ ) ≥ − π i log π̃ j e−(1−λ)Dλ (Pi ∥Pj )
j i j
X
≥ − log π i π̃ j e−(1−λ)Dλ (Pi ∥Pj )
i,j

(Hint: Prove D(PA|B=b kQA ) ≥ − EA|B=b [log EA′ ∼QA gg((AA,,bb)) ] via Donsker-Varadhan. Plug in
g(a, b) = PB|A (b|a)1−λ , average over B and use Jensen to bring outer EB|A inside the log.)
I.58 (Mutual information and pairwise distances [216]) Suppose we have knowledge of pairwise
distances dλ (x, x′ ) ≜ Dλ (PY|X=x kPY|X=x′ ), where Dλ is the Rényi divergence of order λ. What
i.i.d.
can be said about I(X; Y)? Let X, X′ ∼ PX . Using Exercise I.57, prove that
I(X; Y) ≤ − E[log E[exp(−d1 (X, X′ ))|X]]
and for every λ ∈ [0, 1]
I(X; Y) ≥ − E[log E[exp(−(1 − λ)dλ (X, X′ ))|X]].
See Theorem 32.5 for an application.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-189


i i

Exercises for Part I 189

I.59 (D ≲ H2 log H12 trick) Show that for any P, U, R, λ > 1, and 0 < ϵ < 2−5 λ−1 we have
λ

 
λ 1
D(PkϵU + ϵ̄R) ≤ 8(H (P, R) + 2ϵ)
2
log + Dλ (PkU) .
λ−1 ϵ

Thus, a Hellinger ϵ-net for a set of P’s can be converted into a KL (ϵ2 log 1ϵ )-net; see
Theorem 32.6 in Section 32.2.4.
−1
(a) Start by proving the tail estimate for the divergence: For any λ > 1 and b > e(λ−1)
 
dP dP log b
EP log · 1{ > b} ≤ λ−1 exp{(λ − 1)Dλ (PkQ)}
dQ dQ b
(b) Show that for any b > 1 we have
 
b log b dP dP
D(PkQ) ≤ H2 (P, Q) √ + EP log · 1{ > b}
( b − 1)2 dQ dQ
h(x)
(Hint: Write D(PkQ) = EP [h( dQ
dP )] for h(x) = − log x + x − 1 and notice that

( x−1)2
is
monotonically decreasing on R+ .)
(c) Set Q = ϵU + ϵ̄R and show that for every δ < e− λ−1 ∧ 14
1

 1
D(PkQ) ≤ 4H2 (P, R) + 8ϵ + cλ ϵ1−λ δ λ−1 log ,
δ
where cλ = exp{(λ − 1)Dλ (PkU). (Notice H2 (P, Q) ≤ H2 (P, R) + 2ϵ, Dλ (PkQ) ≤
Dλ (PkU) + log 1ϵ and set b = 1/δ .)
2
(d) Complete the proof by setting δ λ−1 = 4H c(λPϵ,λ−
R)+2ϵ
1 .
I.60 Let G = (V, E) be a finite directed graph. Let

4 = (x, y, z) ∈ V3 : (x, y), (y, z), (z, x) ∈ E ,

∧ = (x, y, z) ∈ V3 : (x, y), (x, z) ∈ E .

Prove that 4 ≤ ∧.
Hint: Prove H(X, Y, Z) ≤ H(X) + 2H(Y|X) for random variables (X, Y, Z) distributed uniformly
over the set of directed 3-cycles, i.e. subsets X → Y → Z → X.
I.61 (Union-closed sets conjecture (UCSC)) Let X and Y be independent vectors in {0, 1}n .
Show [88]

H(X OR Y) ≥ (H(X) + H(Y)) , p̄ ≜ min min(P[Xi = 0], P[Yi = 0]) ,
2ϕ i

where OR denotes coordinatewise logical-OR and ϕ = 52−1 . (Hint: set Z = X OR Y, use chain
P
rule H(Z) ≥ i H(Zi |Xi−1 , Yi−1 ), and the inequality for binary-entropy h(ab) ≥ h(a)b2+ϕh(b)a ).
Comment: F ⊂ {0, 1}n is called a union-closed set if x, y ∈ F =⇒ (x OR y) ∈ F . The UCSC
states that p = maxi P[Xi = 1] ≥ 1/2, where X is uniform over F . Gilmer’s method [189]
applies the inequality above to Y taken to be an independent copy of X (so that H(X OR Y) ≤
H(X) = H(Y) = log |F|) to prove that p ≥ 1 − ϕ ≈ 0.382.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-190


i i

190 Exercises for Part I

I.62 (Compression for regression) Let Y ∈ [−1; 1] and X ∈ X with X being finite (for simplic-
ity). Auxiliary variables U, U′ in this exercise are assumed to be deterministic functions of X.
For simplicity assume X is finite (but giant). Let cmp(U) be a complexity measure satisfying
cmp(U, U′ ) ≤ cmp(U) + cmp(U′ ), cmp(constant) = 0 and cmp(Ber(p)) ≤ log 2 for any p
(think of H(U) or log |U|). Choose U to be a maximizer of I(U; Y) − δ cmp(U).
(a) Show that cmp(U) ≤ I(Xδ;Y)
(b) For any U′ show I(Y; U′ |U) ≤ δ cmp(U′ ) (Hint: check U′′ = (U, U′ ))
(c) For any event S = {X ∈ A} show

|E[(Y − E[Y|U])1S ]| ≤ 2δ ln 2 (I.19)
(Hint: by Cauchy-Schwarz only need to show E[| E[Y|U, 1S ]− E[Y|U]|2 ] ≲ δ , which follows
by taking U′ = 1S in b) and applying Tao’s inequality (7.28))
(d) By choosing a proper S and applying above to S and Sc conclude that

E[| E[Y|X] − E[Y|U]|] ≤ 2 2δ ln 2 .
(So any high-dimensional complex feature vector X can be compressed down to U whose car-
dinality is of order I(Y; X) (and independent of |X |) but which, nevertheless, is essentially as
good as X for regression; see [51] for other results on information distillation.)
I.63 (IT version of Szemerédi regularity [414]) Fix ϵ, m > 0 and consider random variables Y, X =
(X1 , X2 ) with Y ∈ [−1, 1], X = X1 × X2 finite (but giant) and I(X; Y) ≤ m. In this excercise,
all auxiliary random variables U have structure U = (f1 (X1 ), f2 (X2 )) for some deterministic
functions f1 , f2 . Thus U partitions X into product blocks and we call block U = u0 ϵ-regular if
|E[(Y − E[Y|U])1S |U = u0 ]| ≤ ϵ ∀S = {X1 ∈ A1 , X2 ∈ A2 } .
We will show there is J = J(ϵ, m) such that there exists a U with |U| ≤ J and such that
P[block U is not ϵ-regular] ≤ ϵ . (I.20)
(a) Suppose that we found random variables Y → X → U′ → U such that (i) I(Y; U′ |U) ≤ ϵ4
4
and (ii) for all S as above I(Y; 1S |U′ ) ≤ |Uϵ |2 . Then (I.20) holds with ϵ replaced by O(ϵ).
(Hint: define g(u0 ) = E[| E[Y|U′ ] − E[Y|U]| |U = u0 ] and show via (7.28) that E[g(U)] ≲ ϵ2 .
ϵ2
As in (I.19) argue that E[(Y − E[Y|U′ ])1S ] ≲ |U | . From triangle inequality any u0 -block is
O(ϵ)-regular whenever g(u0 ) < ϵ and P[U = u0 ] > |Uϵ | . Finally, apply Markov inequality
twice to show that the last condition is violated with O(ϵ) probability.)
(b) Show that such U′ , U indeed exist. (Hint: Construct a sequence Y → X → · · · Uj → Uj−1 →
· · · U0 = 0 sequentially by taking Uj+1 to be maximizer of I(Y; U) − δj+1 log |U| among all
4
Y → X → U → Uj (compare Exercise I.62) and δj+1 = |Uϵj |2 . We take U′ , U = Un+1 , Un
for the first pair that has I(Y; Un+1 |Un ) ≤ ϵ4 . Show n ≤ ϵm4 and |Un | is bounded by the n-th
iterate of map h → h exp{mh2 /ϵ4 } started from h = 1.)
Remark: The point is that J does not depend on PX,Y or |X |. For Szemerédi’s regularity lemma
one takes X1 , X2 to be uniformly sampled vertices of a bipartite graph and Y = 1{X1 ∼ X2 } is
the incidence relation. An ϵ-regular block corresponds to an ϵ-regular bipartite subgraph, and
lemma decomposes arbitrary graph into finitely many pairwise (almost) regular subgraphs.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-191


i i

Exercises for Part I 191

I.64 (Entropy and binary convolution) Binary convolution is defined for (a, b) ∈ [0, 1]2 by a ∗ b =
a(1 − b) + (1 − a)b and describes the law of Ber(a) ⊕ Ber(b) where ⊕ denotes modulo-2
addition.
(a) (Mrs. Gerber’s lemma, MGL5 ) Let (U, X) ⊥
⊥ Z ∼ Ber(δ) with X ∈ {0, 1}. Show that
H(X|U)H(Z)
h(h−1 (H(X|U)) ∗ δ) ≤ H(X ⊕ Z|U) ≤ H(X|U) + H(Z) − .
log 2
(Hint: equivalently [457], need to show that a parametric curve (h(p), h(p ∗ δ)), p ∈ [0, 1/2]
is convex.)
(b) Show that for any p, q the parametric curve ((1 − 2r)2 , d(p ∗ rkq ∗ r)), r ∈ [0, 1/2] is convex.
(Hint: see [367, Appendix A])
MGL has various applications (Example 16.1 and Exercise VI.21), it tensorizes (see Exer-
cise III.32) and its infinitesimal version (derivative in δ = 0+) is exactly the log-Sobolev
inequality for the hypercube [122, Section 4.1].
I.65 (log-Sobolev inequality, LSI) Let X be a Rd -valued random variable, E[kXk2 ] < ∞, and
X ⊥⊥ Z ∼ γ , where γ = N (0, Id ) is the standard Gaussian measure. Recall the notation for
Fisher information matrix J(·) from (2.40).
(a) Show de Bruijn’s identity:
d √ log e √
h(X + aZ) = tr J(X + aZ)
da 2
(Hint: inspect the proof of Theorem 3.14)
(b) Show that EPI implies
d √
exp{2h(X + aZ)/d} ≥ 2πe .
da
(c) Conclude that Gaussians minimize the differential entropy among all X with bounded Fisher
information J(X), namely [399]
n 2πen
h(X) ≥
log .
2 tr J(X)
R
(d) Show the LSI of Gross [200]: For any f with f2 dγ = 1, we have
Z Z
f2 ln(f2 )dγ ≤ 2 · k∇fk2 dγ .
R
(Hint: PX (dx) = f2 (x)γ(dx), prove 2 (xT ∇f)fγ(dx) = E[kXk2 ] − d and use ln(1 + y) ≤ y.)
I.66 (Stochastic localization [148, 146]) Consider a discrete X ∼ μ taking values in Rn and let ρ =
Pn
E[kX − E[X]k2 ] = i=1 Var[Xi ]. We will show that for any ϵ > 0 there exists a decomposition
of μ = Eθ μθ as a mixture of measures μθ , which have similar entropy ( Eθ [H( μθ )] = H( μ) −
O(ρ/ϵ)) but have almost no pairwise correlations (Eθ [Cov( μθ )]  ϵIn and Eθ [k Cov( μθ )k2F ] =
O(ϵρ)). This has useful applications in statistical physics of Ising models.

5
Apparently, named after a landlady renting a house to Wyner and Ziv [457] at the time.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-192


i i

192 Exercises for Part I

√ √
(a) Let Yt = tX + ϵZ, where X ⊥ ⊥ Z ∼ N (0, Id ) and t, ϵ > 0. Show that Cov(X|Yt )  ϵt In

(Hint: consider the suboptimal estimator X̂(Yt ) = Yt / t).
(b) Show that 0 ≤ H(X) − H(X|Yt ) ≤ n2 log(1 + ϵtn ρ) ≤ t log e
2Rϵ ρ. (Hint: use (5.22))
2
(c) Show that ρ ≥ mmse(X|Y1 ) − mmse(X|Y2 ) = 1ϵ 1 E[kΣt (Yt )k2F ]dt, where Σt (y) =
Cov[X|Yt = y]. (Hint: use (3.23)).
Thus we conclude that for some t ∈ [1, 2] decomposing μ = EYt PX|Yt satisfies the stated claims.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-193


i i

Part II

Lossless data compression

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-194


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-195


i i

195

The principal engineering goal of data compression is to represent a given sequence


a1 , a2 , . . . , an produced by a source as a sequence of bits of minimal possible length with possible
algorithmic constraints. Of course, reducing the number of bits is generally impossible, unless the
source satisfies certain statistical restrictions, that is, only a small subset of all sequences actually
occur in practice. (Or, more precisely, only a small subset captures the majority of the overall
probability distribution) Is this the case for real-world data?
As a simple demonstration, one may take two English novels and compute empirical frequen-
cies of each letter. It will turn out to be the same for both novels (approximately). Thus, we can
see that there is some underlying structure in English texts restricting possible output sequences.
The structure goes beyond empirical frequencies of course, as further experimentation (involving
digrams, word frequencies etc) may reveal. Thus, the main reason for the possibility of data com-
pression is the experimental (empirical) law: Real-world sources produce very restricted sets of
sequences.
How do we model these restrictions? Further experimentation (with language, music, images)
reveals that frequently, the structure may be well described if we assume that sequences are gener-
ated probabilistically [378, Sec. III]. One of the lasting contributions of Shannon is the following
empirical law: real-world sequences may be described probabilistically with increasing precision
starting from i.i.d., first-order Markov, second-order Markov etc. Note that sometimes one needs
to find an appropriate basis in which this “law” holds: for language you have a choice of characters
or words, for images you have pixels and wavelets/local Fourier transforms. Indeed, a rasterized
sequence of pixels does not exhibit any stable local structure whereas changing basis to wavelets
and local Fourier transform reveals that structure.6 Finding correct representations are practically
very important, but in this book we assume this step has already been done and we are facing a
simple stochastic process (iid, Markov or ergodic). How do we represent it with least number of
bits?
In the beginning, we will simplify the problem even further and restrict attention to representing
one random variable X in terms of (minimal number of) bits. Later, X will be taken to be a large
n-letter chunk of the target process, i.e. X = Sn = (S1 , . . . , Sn ). The types of compression we will
consider in this book are:

• Variable-length lossless compression. Here we require P[X 6= X̂] = 0, where X̂ is the decoded
version. To make the question interesting, we compress X into a variable-length binary string. It
will turn out that optimal compression length is H(X) − O(log(1 + H(X))). If we further restrict
attention to so-called prefix-free or uniquely decodable codes, then the optimal compression
length is H(X) + O(1). Applying these results to n-letter variables X = Sn we see that optimal
compression length normalized by n converges to the entropy rate (Section 6.3) of the process
{Sj }.

6
Of course, one should not take these “laws” too far. In regards to language modeling, (finite-state) Markov assumption is
too simplistic to truly generate all proper sentences, cf. Chomsky [94]. However, astounding success of modern high-order
Markov models, such as GPT-4 [320], shows that such models are very difficult to distinguish from true language.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-196


i i

196

• Fixed-length, almost lossless compression. Here, we allow some very small (or vanishing with
n → ∞ when X = Sn ) probability of error, i.e. P[X 6= X̂] ≤ ϵ. It turns out that under mild
assumptions on the process {Sj }, here again we can compress to entropy rate but no more.
This mode of compression permits various beautiful results in the presence of side-information
(Slepian-Wolf, etc).
• Lossy compression. Here we require only E[d(X, X̂)] ≤ ϵ where d(·, ·) is some loss function.
This type of compression problems is the topic of Part V.

We also note that thinking of the X = Sn , it would be more correct to call the first two com-
pression types above as “fixed-to-variable” and “fixed-to-fixed”, because they take fixed number
of input letters and produce variable or fixed number of output bits. There exists other types of
compression algorithms, which we do not discuss, e.g. a beautiful variable-to-fixed compressor
of Tunstall [425].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-197


i i

10 Variable-length compression

In this chapter we consider a basic question: how does one describe a discrete random variable
X ∈ X in terms of a variable-length bit string so that the description is the shortest possible. The
basic idea, already used in the telegraph’s Morse code, is completely obvious: shorter descriptions
(bit strings) should correspond to more probable symbols. Later, however, we will see that this
basic idea becomes a lot more subtle once we take X to mean a group of symbols. The discovery
of Shannon was that compressing groups of symbols together (even if they are iid!) can lead
to impressive savings in compressed length. That is, coding English text by first grouping 10
consecutive characters together is much better than doing so on a character-by-character basis. One
should appreciate boldness of Shannon’s proposition since sorting all possible 2610 realizations of
the 10-letter English chunks in the order of their decreasing frequency appears quite difficult. It is
only later, with the invention of Huffman coding, arithmetic coding and Lempel-Ziv compressors
(decades after) that these methods became practical and ubiquitous.
In this Chapter we discover that the minimal compression length of X is essentially equal to the
entropy H(X) for both the single-shot, uniquely-decodable and prefix-free codes. These results
are the first examples of coding theorems in our book, that is results connecting an operational
problem and an information measure. (For this reason, compression is also called source coding
in information theory.) In addition, we also discuss the so called Zipf law and how its widespread
occurrence can be described information-theoretically.

10.1 Variable-length lossless compression


The setting of the lossless data compression is depicted in the following figure.

X Compressor
{0, 1}∗ Decompressor X
f: X →{0,1}∗ g: {0,1}∗ →X

More formally, a function f : X → {0, 1}∗ is a variable-length single-shot lossless compressor


of a random variable X if it satisfies the following properties:

1 It maps each symbol x ∈ X into a variable-length string f(x) in {0, 1}∗ ≜ ∪k≥0 {0, 1}k =
{∅, 0, 1, 00, 01, . . . }. Each f(x) is referred to as a codeword and the collection of codewords the
codebook.

197

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-198


i i

198

PX (i)

i
1 2 3 4 5 6 7 ···

f

∅ 0 1 00 01 10 11 ···

Figure 10.1 Illustration of the optimal variable-length lossless compressor f∗ .

2 It is lossless for X: there exists a decompressor g : {0, 1}∗ → X such that P [X = g(f(X))] = 1.
In other words, f is injective on the support of PX .

Notice that since {0, 1}∗ is countable, lossless compression is only possible for discrete X. Also,
since the structure of X is not important, we can relabel X such that X = N = {1, 2, . . . } and
sort the PMF decreasingly: PX (1) ≥ PX (2) ≥ · · · . In a single-shot compression setting, cf. [251],
we do not impose any additional constraints on the map f. Later in Section 10.3 we will introduce
conditions such as prefix-freeness and unique-decodability.
To quantify how good a compressor f is, we introduce the length function l : {0, 1}∗ → Z+ , e.g.,
l(∅) = 0, l(01001) = 5. We could consider different objectives for selecting the best compressor f,
for example, minimizing any of E[l(f(X))], esssup l(f(X)), median[l(f(X))] appears reasonable. It
turns out that there is a compressor f∗ that minimizes all objectives simultaneously. As mentioned
in the preface to this chapter, the main idea is to assign longer codewords to less likely symbols,
and reserve the shorter codewords for more probable symbols. To make precise of the optimality
of f∗ , let us recall the concept of stochastic dominance.

Definition 10.1 (Stochastic dominance) For real-valued random variables X and Y, we


st.
say Y stochastically dominates (or, is stochastically larger than) X, denoted by X ≤ Y, if P [Y ≤ t] ≤
P [X ≤ t] for all t ∈ R.

st.
By definition, X ≤ Y if and only if the CDF of X is larger than that of Y pointwise; in other words,
the distribution of X assigns more probability to lower values than that of Y does. In particular, if
X is dominated by Y stochastically, so are their means, medians, supremum, etc.

Theorem 10.2 (Optimal f∗ ) Consider the compressor f∗ defined (for a down-sorted PMF
PX ) by f∗ (1) = ∅, f∗ (2) = 0, f∗ (3) = 1, f∗ (4) = 00, etc, assigning strings with increasing lengths
to symbols i ∈ X . (See Figure 10.1 for an illustration.) Then

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-199


i i

10.1 Variable-length lossless compression 199

1 Length of codeword:

l(f∗ (i)) = blog2 ic.

2 l(f∗ (X)) is stochastically the smallest: For any lossless compressor f : X → {0, 1}∗ ,
st.
l(f∗ (X)) ≤ l(f(X))

i.e., for any k, P[l(f(X)) ≤ k] ≤ P[l(f∗ (X)) ≤ k]. As a result, E[l(f∗ (X))] ≤ E[l(f(X))].

Proof. Note that


X
k
|Ak | ≜ |{x : l(f(x)) ≤ k}| ≤ 2i = 2k+1 − 1 = |{x : l(f∗ (x)) ≤ k}| ≜ |A∗k |.
i=0

Here the inequality is because f is lossless so that |Ak | can at most be the total number of binary
strings of length up to k. Then
X X
P[l(f(X)) ≤ k] = PX (x) ≤ PX (x) = P[l(f∗ (X)) ≤ k], (10.1)
x∈Ak x∈A∗
k

since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.

Having identified the optimal compressor the next question is to understand its average com-
pression length E[ℓ(f∗ (X))]. It turns out that one can in fact compute it exactly as an infinite series,
see Exercise II.1. However, much more importantly, it turns out to be essentially equal to H(X).
Specifically, we have the following result.

Theorem 10.3 (Optimal average code length vs entropy [14])


H(X) bits − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) bits

Remark 10.1 (Source coding theorem) Theorem 10.3 is the first example of a coding
theorem in this book, which relates the fundamental limit E[l(f∗ (X))] (an operational quantity) to
the entropy H(X) (an information measure).

Proof. Define L(X) = l(f∗ (X))). For the upper bound, observe that since the PMF are ordered
decreasingly by assumption, PX (m) ≤ 1/m, so L(m) ≤ log2 m ≤ log2 (1/PX (m)). Taking
expectation yields E[L(X)] ≤ H(X).
For the lower bound,
( a)
H(X) = H(X, L) = H(X|L) + H(L) ≤ E[L] + H(L)
 
(b) 1
≤ E [ L] + h (1 + E[L])
1 + E[L]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-200


i i

200

 
1
= E[L] + log2 (1 + E[L]) + E[L] log2 1 + (10.2)
E [ L]
( c)
≤ E[L] + log2 (1 + E[L]) + log2 e
(d)
≤ E[L] + log2 (e(1 + H(X)))

where in (a) we have used the fact that H(X|L = k) ≤ k bits, because f∗ is lossless, so that given
f∗ (X) ∈ {0, 1}k , X can take at most 2k values; (b) follows by Exercise I.4; (c) is via x log(1 + 1/x) ≤
log e, ∀x > 0; and (d) is by the previously shown upper bound H(X) ≤ E[L].
To give an illustration, we need to introduce an important method of going from a single-letter
i.i.d.
source to a multi-letter one, already alluded to in the preface. Suppose that Sj ∼ PS (this is called a
memoryless source). We can group n letters of Sj together and consider X = Sn as one super-letter.
Applying our results to random variable X we obtain:
nH(S) ≥ E[l(f∗ (Sn ))] ≥ nH(S) − log2 n + O(1).
In fact for memoryless sources, the exact asymptotic behavior is found in [408, Theorem 4]:
(
∗ n nH(S) + O(1) , PS = Unif
E[l(f (S ))] = .
nH(S) − 2 log2 n + O(1) , PS 6= Unif
1

1
For the case of sources for which log2 PS has non-lattice distribution, it is further shown in [408,
Theorem 3]:
1
E[l(f∗ (Sn ))] = nH(S) − log2 (8πeV(S)n) + o(1) , (10.3)
2
where V(S) is the varentropy of the source S:
 1 
V(S) ≜ Var log2 . (10.4)
PS (S)

Theorem 10.3 relates the mean of l(f∗ (X)) to that of log2 PX1(X) (entropy). It turns out that
distributions of these random variables are also closely related.

Theorem 10.4 (Code length distribution of f∗ ) ∀τ > 0, k ≥ 0,


   
1 ∗ 1
P log2 ≤ k ≤ P [l(f (X)) ≤ k] ≤ P log2 ≤ k + τ + 2−τ +1 .
PX (X) PX (X)

Proof. Lower bound (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 10.3,
L(m) = blog2 mc ≤ log2 m ≤ log2 PX 1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
Upper bound (converse): Consider, the following chain
   
1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-201


i i

10.1 Variable-length lossless compression 201

  X
1 
≤ P log2 ≤k+τ + PX (x)1 {l(f∗ (x)) ≤ k}1 PX (x) ≤ 2−k−τ
PX (X)
x∈X
 
1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)

Remark 10.2 (Achievability vs converse) Traditionally, in information theory positive


results (“compression length is smaller than ...”) are called achievability and negative results
(“compression length cannot be smaller than ...”) are called converse.
So far our discussion applies to an arbitrary random variable X. Next we consider the source as
a random process (S1 , S2 , . . .) and introduce blocklength n. We apply our results to X = Sn , that is,
by treating the first n symbols as a supersymbol. The following corollary states that the limiting
behavior of l(f∗ (Sn )) and log PSn 1(Sn ) always coincide.

Corollary 10.5 Let (S1 , S2 , . . .) be a random process and U, V real-valued random variable.
Then
1 1 d 1 ∗ n d
log2 →U
− ⇔ l(f (S ))−
→U (10.5)
n PSn (Sn ) n
and
 
1 1 1
√ (l(f∗ (Sn )) − H(Sn ))−
d d
√ log2 n
− H( S ) →
n
−V ⇔ →V (10.6)
n PS (S )
n n

Proof. First recall that convergence in distribution is equivalent to convergence of CDF at all
d
→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the CDF of U is
continuity point, i.e., Un −
continuous (i.e., not an atom of U).

To get (10.5), apply Theorem 10.4 with k = un and τ = n:
     
1 1 1 ∗ 1 1 1 √
P log2 ≤ u ≤ P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n

To get (10.6), apply Theorem 10.4 with k = H(Sn ) + nu and τ = n1/4 :
     ∗ n 
1 1 l(f (S )) − H(Sn )
P √ log − H( S ) ≤ u ≤ P
n
√ ≤u
n PSn (Sn ) n
   
1 1 −1/4
+ 2−n +1 .
1/ 4
≤P √ log n
− H ( S n
) ≤ u + n
n PSn (S )
(10.7)

Now let us particularize the preceding theorem to memoryless sources of i.i.d. Sj ’s. The
important observation is that the log likelihood becomes an i.i.d. sum:
1 X n
1
log n
= log .
PSn (S ) PS (Si )
i=1 | {z }
i.i.d.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-202


i i

202

This implies several results at once:

P
1 By the weak law of large numbers (WLLN), we know that n1 log PSn 1(Sn ) −→E log PS1(S) = H(S).
Therefore in (10.5) the limiting distribution U is degenerate, i.e., U = H(S), and we have the
following result of fundamental importance:1
1 ∗ n P
l(f (S ))−
→H(S) .
n
That is, the optimal compression rate of an iid process converges to its entropy rate. This is
a version of Shannon’s source coding theorem, which we will also discuss in the subsequent
chapter.
2 By the Central Limit Theorem (CLT), if varentropy V(S) < ∞, then we know that V in (10.6)
is Gaussian, i.e.,
 
1 1 d
p log n)
− nH(S) −→N (0, 1).
nV(S) PSn ( S

Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1
(l(f∗ (Sn )) − nH(S))−
d
p →N (0, 1),
nV(S)

or, in shorthand,
p
l(f∗ (Sn )) ≈ nH(S) + nV(S)N (0, 1) in distribution.

Gaussian approximation tells us the speed of convergence 1n l(f∗ (Sn )) → H(S) and also gives us
a good approximation of the distribution of length at finite n.

Example 10.1 (Ternary source) Next we apply our bounds to approximate the distribu-
tion of l(f∗ (Sn )) in a concrete example. Consider a memoryless ternary source outputting i.i.d. n
symbols from the distribution PS = [0.445, 0.445, 0.11]. We first compare different results on the
minimal expected length E[l(f∗ (Sn ))] in the following table:

Blocklength Lower bound (10.3) E[l(f∗ (Sn ))] H(Sn ) (upper bound) asymptotics (10.3)
n = 20 21.5 24.3 27.8 23.3 + o(1)
n = 100 130.4 134.4 139.0 133.3 + o(1)
n = 500 684.1 689.2 695.0 688.1 + o(1)

In all cases above E[l(f∗ (S))] is close to a midpoint between the bounds.

1
Convergence to a constant in distribution is equivalent to that in probability.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-203


i i

10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 203

Optimal compression: CDF, n = 200, PS = [0.445 0.445 0.110] Optimal compression: PMF, n = 200, P S = [0.445 0.445 0.110]
1 0.06
True PMF
Gaussian approximation
Gaussian approximation (mean adjusted)
0.9

0.05
0.8

0.7

0.04

0.6

0.5
P

0.03

P
0.4

0.02
0.3

0.2
0.01

True CDF
0.1 Lower bound
Upper bound
Gaussian approximation
Gaussian approximation (mean adjusted)
0 0
1.25 1.3 1.35 1.4 1.45 1.5 1.25 1.3 1.35 1.4 1.45 1.5
Rate Rate

Figure 10.2 Left plot: Comparison of the true CDF of l(f∗ (Sn )), bounds of Theorem 10.4 (optimized over τ ),
and the Gaussian approximations in (10.8) and (10.9). Right plot: PMF of the optimal compression length
l(f∗ (Sn )) and the two Gaussian approximations.

Next we consider the distribution of l(f∗ (Sn ). Its Gaussian approximation is defined as
p
nH(S) + nV(S)Z , Z ∼ N ( 0, 1) . (10.8)

However, in view of (10.3) we also define the mean-adjusted Gaussian approximation as


1 p
nH(S) − log2 (8πeV(S)n) + nV(S)Z , Z ∼ N ( 0, 1) . (10.9)
2
Figure 10.2 compares the true distribution of l(f∗ (Sn )) with bounds and two Gaussian approxi-
mations.

10.2 Mandelbrot’s argument for universality of Zipf’s (power) law


Given a corpus of text it is natural to plot its rank-frequency table by sorting the word frequencies
according to their rank p1 ≥ p2 ≥ · · · . The resulting tables, as noticed by Zipf [477], satisfy
pr  r−α for some value of α. Remarkably, this holds across various corpi of text in multiple dif-
ferent languages (and with α ≈ 1) – see Figure 10.3 for an illustration. Even more surprisingly, a
lot of other similar tables possess the power-law distribution: “city populations, the sizes of earth-
quakes, moon craters, solar flares, computer files, wars, personal names in most cultures, number
of papers scientists write, number of citations a paper receives, number of hits on web pages, sales
of books and music recordings, number of species in biological taxa, people’s incomes” (quoting
from [315], which gives references for each study). This spectacular universality of the power law
continues to provoke scientists from many disciplines to suggest explanations for its occurrence;
see [305] for a survey of such. One of the earliest (in the context of natural language of Zipf) is
due to Mandelbrot [291] and is in fact intimately related to the topic of this Chapter.
Let us go back to the question of minimal expected length of the representation of source X. We
have shown bounds on this quantity in terms of the entropy of X in Theorem 10.3. Let us introduce

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-204


i i

204

Figure 10.3 The log-log frequency-rank plots of the most used words in various languages exhibit a power
law tail with exponent close to 1, as popularized by Zipf [477]. Data from [398].

the following function

H(Λ) = sup{H(X) : E[l(f(X))] ≤ Λ} ,


f,PX

where optimization is over lossless encoders and probability distributions PX = {pj : j = 1, . . .}.
Theorem 10.3 (or more precisely, the intermediate result (10.2)) shows that

Λ log 2 ≤ H(Λ) ≤ Λ log 2 + (1 + Λ) log(1 + Λ) − Λ log Λ .

It turns out that the upper bound is in fact tight. Furthermore, among all distributions the optimal
tradeoff between entropy and minimal compression length is attained at power law distributions.
To show that, notice that in computing H(Λ), we can restrict attention to sorted PMFs p1 ≥ p2 ≥
· · · (call this class P ↓ ), for which the optimal encoder is such that l(f(j)) = blog2 jc (Theorem 10.2).
Thus, we have shown
X
H(Λ) = sup {H(P) : pj blog2 jc ≤ Λ} .
P∈P ↓ j

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-205


i i

10.2 Mandelbrot’s argument for universality of Zipf’s (power) law 205

Next, let us fix the base of the logarithm of H to be 2, for convenience. (We will convert to arbitrary
base at the end). Applying Example 5.2 we obtain:
H(Λ) ≤ inf λΛ + log2 Z(λ) , (10.10)
λ>0
P∞ P∞
where Z(λ) = n=1 2−λ⌊log2 n⌋ = m=0 2(1−λ)m = 1−211−λ if λ > 1 and Z(λ) = ∞ otherwise.
Clearly, the infimum over λ > 0 is a minimum attained at a value λ∗ > 1 satisfying
d
Λ=− log2 Z(λ) .
dλ λ=λ∗

Define the distribution


1 −λ⌊log2 n⌋
P λ ( n) ≜ 2 , n≥1
Z(λ)
and notice that
d 21−λ
EPλ [blog2 Xc] = − log2 Z(λ) =
dλ 1 − 21−λ
H(Pλ ) = log2 Z(λ) + λ EPλ [blog2 Xc] .
Comparing with (10.10) we find that the upper bound in (10.10) is tight and attained by Pλ∗ . From
the first equation above, we also find λ∗ = log2 2+Λ2Λ . Altogether this yields
H(Λ) = Λ log 2 + (Λ + 1) log(Λ + 1) − Λ log Λ ,

and the extremal distribution Pλ∗ (n)  n−λ is power-law distribution with the exponent λ∗ → 1
as Λ → ∞.

The argument of Mandelbrot [291] The above derivation shows a special (extremality) prop-
erty of the power law, but falls short of explaining its empirical ubiquity. Here is a way to connect
the optimization problem H(Λ) to the evolution of the natural language. Suppose that there is a
countable set S of elementary concepts that are used by the brain as building blocks of perception
and communication with the outside world. As an approximation we can think that concepts are
in one-to-one correspondence with language words. Now every concept x is represented internally
by the brain as a certain pattern, in the simplest case – a sequence of zeros and ones of length l(f(x))
([291] considers more general representations). Now we have seen that the number of sequences
of concepts with a composition P grows exponentially (in length) with the exponent given by
H(P), see Proposition 1.5. Thus in the long run the probability distribution P over the concepts
results in the rate of information transfer equal to EP [Hl((fP(X) ))] . Mandelbrot concludes that in order
to transfer maximal information per unit, language and brain representation co-evolve in such a
way as to maximize this ratio. Note that
H(P) H(Λ)
sup = sup .
P,f EP [l(f(X))] Λ Λ
It is not hard to show that H(Λ) is concave and thus the supremum is achieved at Λ = 0+ and
equals infinity. This appears to have not been observed by Mandelbrot. To fix this issue, we can

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-206


i i

206

postulate that for some unknown physiological reason there is a requirement of also having a
certain minimal entropy H(P) ≥ h0 . In this case
H(P) h0
sup = −1
P,f:H(P)≥h0 EP [l(f(X))] H ( h0 )

and the supremum is achieved at a power law distribution P. Thus, the implication is that the fre-
quency of word usage in human languages evolves until a power law is attained, at which point it
maximizes information transfer within the brain. That’s the gist of the argument of [291]. It is clear
that this does not explain appearance of the power law in other domains, for which other explana-
tions such as preferential attachment models are more plausible, see [305]. Finally, we mention
that the Pλ distributions take discrete values 2−λm−log2 Z(λ) , m = 0, 1, 2, . . . with multiplicities 2m .
Thus Pλ appears as a rather unsightly staircase on frequency-rank plots such as Figure 10.3. This
artifact can be alleviated by considering non-binary brain representations with unequal lengths of
signals.

10.3 Uniquely decodable codes, prefix codes and Huffman codes


In the previous sections we have studied f∗ , which achieves the stochastically (in particular, in
expectation) shortest code length among all variable-length lossless compressors. Note that f∗ is
obtained by ordering the PMF and assigning shorter codewords to more likely symbols. In this
section we focus on a specific class of compressors with good algorithmic properties which lead to
low complexity decoding and short delay when decoding from a stream of compressed bits. This
part is more combinatorial in nature.
S
We start with a few definitions. Let A+ = n≥1 An denotes all non-empty finite-length strings
consisting of symbols from the alphabet A. Throughout this chapter A is a countable set.

Definition 10.6 (Extension of a code) The (symbol-by-symbol) extension of f : A →


{0, 1}∗ is f : A+ → {0, 1}∗ where f(a1 , . . . , an ) = (f(a1 ), . . . , f(an )) is defined by concatenating
the bits.

Definition 10.7 (Uniquely decodable codes) f : A → {0, 1}∗ is uniquely decodable if


its extension f : A+ → {0, 1}∗ is injective.

Definition 10.8 (Prefix codes) f : A → {0, 1}∗ is a prefix code2 if no codeword is a prefix
of another (e.g., 010 is a prefix of 0101).

Example 10.2 A = {a, b, c}.

2
Also known as prefix-free/comma-free/self-punctuating/instantaneous code.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-207


i i

10.3 Uniquely decodable codes, prefix codes and Huffman codes 207

• f(a) = 0, f(b) = 1, f(c) = 10. Not uniquely decodable, since f(ba) = f(c) = 10.
• f(a) = 0, f(b) = 10, f(c) = 11. Uniquely decodable and a prefix code.
• f(a) = 0, f(b) = 01, f(c) = 011, f(d) = 0111 Uniquely decodable but not a prefix code, since
as long as 0 appears, we know that the previous codeword has terminated.3

Remark 10.3

1 Prefix codes are uniquely decodable and hence lossless, as illustrated in the following picture:

all lossless codes

uniquely decodable codes

prefix codes

Huffman
code

2 Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable
(one should start decoding in reverse direction).
3 By definition, any uniquely decodable code does not have the empty string as a codeword. Hence
f : X → {0, 1}+ in both Definition 10.7 and Definition 10.8.
4 Unique decodability means that one can decode from a stream of bits without ambiguity, but
one might need to look ahead in order to decide the termination of a codeword. (Think of the
last example). In contrast, prefix codes allow the decoder to decode instantaneously without
looking ahead.
5 Prefix codes are in one-to-one correspondence with binary trees (with codewords at leaves). It
is also equivalent to strategies to ask “yes/no” questions previously mentioned at the end of
Section 1.1.

Theorem 10.9 (Kraft-McMillan)

3
In this example, if 0 is placed at the very end of each codeword, the code is uniquely decodable, known as the unary code.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-208


i i

208

1 Let f : A → {0, 1}∗ be uniquely decodable. Set la = l(f(a)). Then f satisfies the Kraft inequality
X
2−la ≤ 1. (10.11)
a∈A

2 Conversely, for any set of code length {la : a ∈ A} satisfying (10.11), there exists a prefix code
f, such that la = l(f(a)). Moreover, such an f can be computed efficiently.

Remark 10.4 The consequence of Theorem 10.9 is that as far as compression efficiency is
concerned, we can ignore those uniquely decodable codes that are not prefix codes.

Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.
The proof for the former is probabilistic, following ideas in [15, Exercise 1.8, p. 12]. Let f be a
prefix code. Let us construct a probability space such that the LHS of (10.11) is the probability
of some event, which cannot exceed one. To this end, consider the following scenario: Generate
independent Ber( 12 ) bits. Stop if a codeword has been written, otherwise continue. This process
P
terminates with probability a∈A 2−la . The summation makes sense because the events that a
given codeword is written are mutually exclusive, thanks to the prefix condition.
Now let f be a uniquely decodable code. The proof uses generating function as a device for
counting. (The analogy in coding theory is the weight enumerator function.) First assume A is
P PL
finite. Then L = maxa∈A la is finite. Let Gf (z) = a∈A zla = l=0 Al (f)zl , where Al (f) denotes
the number of codewords of length l in f. For k ≥ 1, define fk : Ak → {0, 1}+ as the symbol-
P k k P P
by-symbol extension of f. Then Gfk (z) = ak ∈Ak zl(f (a )) = a1 · · · ak zla1 +···+lak = [Gf (z)]k =
PkL k l
l=0 Al (f )z . By the unique decodability of f, fk is lossless. Hence Al (fk ) ≤ 2l . Therefore we have
P
Gf (1/2)k = Gfk (1/2) ≤ kL for all k. Then a∈A 2−la = Gf (1/2) ≤ limk→∞ (kL)1/k = 1. If A is
P
countably infinite, for any finite subset A′ ⊂ A, repeating the same argument gives a∈A′ 2−la ≤
1. The proof is complete by the arbitrariness of A′ .
P
Conversely, given a set of code lengths {la : a ∈ A} s.t. a∈A 2−la ≤ 1, construct a prefix
code f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define

X
i−1
ai ≜ 2−lk
k=1

with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f(i) ∈ {0, 1}+ as the
first li bits in the binary expansion of ai . Finally, we prove that f is a prefix code by contradiction:
Suppose for some j > i, f(i) is the prefix of f(j), since lj ≥ li . Then aj − ai ≤ 2−li , since they agree
on the most significant li bits. But aj − ai = 2−li + 2−li+1 +. . . > 2−li , which is a contradiction.

Remark
P
10.5 A conjecture of Ahlswede et al [7] states that for any set of lengths for which
2−la ≤ 43 there exists a fix-free code (i.e. one which is simultaneously prefix-free and suffix-
free). So far, existence has only been shown when the Kraft sum is ≤ 58 , cf. [466].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-209


i i

10.3 Uniquely decodable codes, prefix codes and Huffman codes 209

In view of Theorem 10.9, the optimal average code length among all prefix (or uniquely
decodable) codes is given by the following optimization problem
X
L∗ (X) ≜ min PX (a)la (10.12)
a∈A
X
s.t. 2− l a ≤ 1
a∈A

la ∈ N

This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:

Theorem 10.10
H(X) ≤ L∗ (X) ≤ H(X) + 1 bit. (10.13)
 
Proof. Right inequality: Consider the following length assignment la = log2 PX1(a) ,4 which
P P
satisfies Kraft since l a∈A 2−la m≤ a∈A PX (a) = 1. By Theorem 10.9, there exists a prefix code
f such that l(f(a)) = log2 PX1(a) and El(f(X)) ≤ H(X) + 1.
Light inequality: We give two proofs for this converse. One of the commonly used ideas to deal
with combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in
(10.12) and relax it into the following optimization problem, which obviously provides a lower
bound
X
L∗ (X) ≜ min PX (a)la (10.14)
a∈A
X
s.t. 2− l a ≤ 1
a∈A

This is a nice convex optimization problem, with an affine objective function and a convex feasible
set. Solving (10.14) by Lagrange multipliers (Exercise!) yields that the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
Another proof is the following: For any f whose codelengths {la } satisfying the Kraft inequality,
− la
define a probability measure Q(a) = P 2 2−la . Then
a∈A

X
El(f(X)) − H(X) = D(PkQ) − log 2−la ≥ 0.
a∈A

4
Such a code is called a Shannon code.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-210


i i

210

Next we describe the Huffman code, which achieves the optimum in (10.12). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of the Huffman code is to build
the binary tree from the bottom up: Given a PMF {PX (a) : a ∈ A},

1 Choose the two least-probable symbols in the alphabet.


2 Delete the two symbols and add a new symbol (with combined probabilities). Add the new
symbol as the parent node of the previous two symbols in the binary tree.

The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted PMF)
or O(|A| log |A|) (unsorted PMF).

Example 10.3 A = {a, b, c, d, e}, PX = {0.25, 0.25, 0.2, 0.15, 0.15}.


Huffman tree: Codebook:
0 1 f(a) = 00
0.55 0.45 f(b) = 10
0 1 0 1 f(c) = 11
f(d) = 010
a 0.3 b c
0 1 f(e) = 011

d e

Theorem 10.11 (Optimality of Huffman codes) The Huffman code achieves the minimal
average code length (10.12) among all prefix (or uniquely decodable) codes.

Proof. See [106, Sec. 5.8].

Remark 10.6 (Drawbacks of Huffman codes)

1 Constructing the Huffman code requires knowing the source distribution. This brings us the
question: Is it possible to design universal compressor which achieves entropy for a class of
source distributions? And what is the price to pay? These questions are the topic of universal
compression and will be addressed in Chapter 13.
2 To understand the main limitation of Huffman coding, we recall that (as Shannon pointed out),
while Morse code already exploits the nonequiprobability of English letters, working with
pairs (or more generally, n-grams) of letters achieves even more compression, since letters in
a pair are not independent. In other words, to compress a block of symbols (S1 , . . . , Sn ) by
applying Huffman code on a symbol-by-symbol basis one can achieve an average length of
Pn
i=1 H(Si ) + n bits. But applying Huffman codes on a whole block (S1 , . . . , Sn ), that is the
code designed for PS1 ,...,Sn , allows to exploit the memory in the source and achieve compres-
P
sion length H(S1 , . . . , Sn ) + O(1). Due to (1.3) the joint entropy is smaller than i H(Si ) (and

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-211


i i

10.3 Uniquely decodable codes, prefix codes and Huffman codes 211

usually much smaller). However, the drawback of this idea is that constructing the Huffman
code has complexity |A|n – exponential in the blocklength.
To resolve these problems we will later study other methods:

1 Arithmetic coding has a sequential encoding algorithm with complexity linear in the block-
length, while still attaining H(Sn1 ) length – Section 13.1.
2 Lempel-Ziv algorithm also has low-complexity and is even universal, provably optimal for all
ergodic sources – Section 13.8.
As a summary of this chapter, we learned the following relationship between entropy and
compression length of various codes:
H(X) − log2 [e(H(X) + 1)] ≤ E[l(f∗ (X))] ≤ H(X) ≤ E[l(fHuffman (X))] ≤ H(X) + 1.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-212


i i

11 Fixed-length compression and Slepian-Wolf


theorem

In previous chapter we introduced the concept of variable-length compression and studied its
fundamental limits with and without prefix-free condition. In some situations, however, one may
desire that the output of the compressor always be of a fixed length, say, k bits. Unless k is unrea-
sonably large, then, this will require relaxing the losslessness condition. This is the focus of this
chapter: compression in the presence of (typically vanishingly small) probability of error. It turns
out allowing even very small error enables several beautiful effects:

• The possibility to compress data via matrix multiplication over finite fields (linear compression
or hashing).
• The possibility to reduce compression length from H(X) to H(X|Y) if side information Y is
available at the decompressor (Slepian-Wolf).
• The possibility to reduce compression length below H(X) if access to a compressed representa-
tion of side-information Y is available at the decompressor (Ahlswede-Körner-Wyner).

All of these effects are ultimately based on the fundamental property of many high-dimensional
probability distributions, the asymptotic equipartition (AEP), which we study in the context of iid
distributions. Later we will extend this property to all ergodic processes in Chapter 12.

11.1 Source coding theorems


The coding paradigm in this section is illustrated as follows:

X Compressor {0, 1}k Decompressor X ∪ {e}


f: X →{0,1}k g: {0,1}k →X ∪{e}

Note that if we insist like in Chapter 10 that g(f(X)) = X with probability one, then k ≥
log2 |supp(PX )| and no meaningful compression can be achieved. It turns out that by tolerating
a small error probability, we can gain a lot in terms of code length! So, instead of requiring
g(f(x)) = x for all x ∈ X , consider only lossless decompression for a subset S ⊂ X :
(
x x∈S
g(f(x)) =
e x 6∈ S

212

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-213


i i

11.1 Source coding theorems 213

and the probability of error is:

P [g(f(X)) 6= X] = P [g(f(X)) = e] = P [X ∈
/ S] .

We summarize this formally next.

Definition 11.1 A compressor-decompressor pair (f, g) is called a fixed-length almost-lossless


(k, ϵ) source code for X ∈ X , or (k, ϵ)-code for short, if:

f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}

such that g(f(x)) ∈ {x, e} for all x ∈ X and P [g(f(X)) = e] ≤ ϵ. The fundamental limit of fixed-
length compression is simply the minimum probability of error and is defined as

ϵ∗ (X, k) ≜ inf{ϵ : ∃(k, ϵ)-code for X} .

The following result connects the respective fundamental limits of fixed-length almost lossless
compression and variable-length lossless compression (Section 10.1):

Theorem 11.2 (Fundamental limit of fixed-length compression) Recall the optimal


variable-length compressor f∗ defined in Theorem 10.2 and assume as before that X = N and
PX (1) ≥ PX (2) ≥ · · · . Then
X
ϵ∗ (X, k) = P [l(f∗ (X)) ≥ k] = PX (x).
x≥2k

Proof. Note that because of the assumption X = N compressor must reserve one k-bit string for
the error message even if PX (1) = 1. The proof is essentially tautological. Note 1 + 2 +· · ·+ 2k−1 =
2k − 1. Let S be the set of top 2k − 1 most likely (as measured by PX (x)) elements x ∈ X . Then

ϵ∗ (X, k) = P [X 6∈ S] = P [l(f∗ (X)) ≥ k] .

The last equality follows from (10.1).


Comparing Theorems 10.2 and 11.2, we see that the optimal codes in these two settings work
as follows:

• Variable-length: f∗ encodes the 2k − 1 symbols with the highest probabilities to


{ϕ, 0, 1, 00, . . . , 1k−1 }.
• Fixed-length: The optimal compressor f maps the elements of S into (00 . . . 00), . . . , (11 . . . 10)
and the rest in X \S to (11 . . . 11). The decompressor g decodes perfectly except for outputting
e upon receipt of (11 . . . 11).

Remark 11.1 (Detectable vs undetectable errors) In Definition 11.1 we required that


the errors be always detectable, i.e., g(f(x)) = x or e. Alternatively, we can drop this requirement
and allow undetectable errors, in which case we can of course do better since we have more

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-214


i i

214

freedom in designing codes. It turns out that we do not gain much by this relaxation. Indeed, if
we define

ϵ̃∗ (X, k) = inf{P [g(f(X)) 6= X] : f : X → {0, 1}k , g : {0, 1}k → X ∪ {e}},


P P
then ϵ̃∗ (X, k) = x>2k PX (x). This follows immediately from P [g(f(X)) = X] = x∈S PX (x)
where S ≜ {x : g(f(x)) = x} satisfies |S| ≤ 2k , because f takes no more than 2k values. Compared
to Theorem 11.2, we see that ϵ̃∗ (X, k) and ϵ∗ (X, k) only differ by PX (2k ) ≤ 2−k . In particular,
ϵ∗ (X, k + 1) ≤ ϵ̃∗ (X, k) ≤ ϵ∗ (X, k) and we can at most save a single bit in compressed strings.

These simple observations lead us to the first fundamental result of Shannon.

Corollary 11.3 (Shannon’s source coding theorem) Let Sn be i.i.d. discrete random
variables. Then for any R > 0 and γ ∈ R asymptotically in blocklength n we have

∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)

If varentropy V(S) < ∞ then also


p
lim ϵ∗ (Sn , nH(S) + nV(S)γ) = Q(γ)
n→∞
R∞
√1 e−t /2 dt
2
where Q(x) = x 2π
is the complementary CDF of N (0, 1)s.

Proof. Combine Theorem 11.2 with Corollary 10.5.

This result demonstrates that if we are to compress an iid string Sn down to k = k(n) bits
then minimal possible k enabling vanishing error satisfies nk = H(S), that is we can compress to
entropy rate of the iid process S and no more. Furthermore, if we allow a non-vanishing error ϵ
then compression is possible down to
p
k = nH(S) + nV(S)Q−1 (ϵ)

bits. In the language of modern information theory, Corollary 11.3 derives both the asymptotic
fundamental limit (minimal k/n) and the normal approximation under non-vanishing error.
The next desired step after understanding asymptotics is to derive finite blocklength guarantees,
that is bounds on ϵ∗ (X, k) in terms of the information quantities. As we mentioned above, the
upper and lower bounds are typically called achievability and converse bounds. In the case of
lossless compression such bounds are rather trivial corollaries of Theorem 11.2, but we present
them for completeness next. For other problems in this Part and other Parts obtaining good finite
blocklength bounds is much more challenging.

Theorem 11.4 (Finite blocklength bounds) For all τ > 0 and all k ∈ Z+ we have
   
1 −τ ∗ ∗ 1
P log2 > k + τ − 2 ≤ ϵ̃ (X, k) ≤ ϵ (X, k) ≤ P log2 ≥k .
PX (X) PX (X)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-215


i i

11.1 Source coding theorems 215

Proof. The argument for the lower (converse) bound is identical to the converse of Theorem 10.4.
Indeed, considering the optimal (undetectable error) code let S = {x : g(f(x)) = x} and note
   
∗ 1 1
1 − ϵ̃ (X, k) = P [X ∈ S] ≤ P log2 ≤ k + τ + P X ∈ S, log2 >k+τ
PX (X) PX (X)

. For the second term we have


  X
1
P X ∈ S, log2 >k+τ = PX (x)1{PX (x) < 2−k−τ } ≤ |S|2−k−τ ≤ 2−τ ,
PX (X)
x∈ S

where we used the fact that |S| ≤ 2k . Combining the two inequalities yields the lower bound.
For the upper bound, without loss of generality we assume PX (1) ≥ PX (2) ≥ · · · . Then by
Theorem 11.2 we have
X X  1  
1

ϵ∗ (X, k) = PX (m) ≤ 1 ≥ 2k PX (m) = P log2 ≥k ,
P X ( m) PX (X)
k m≥2

where ≤ follows from the fact that mth largest mass PX (m) ≤ 1
m.

We now will do something strange. We will prove an upper bound that is weaker than that of
Theorem 11.4 and furthermore, the proof is much longer. However, this will be our first exposition
to the technique of random coding (also known as probabilistic method outside of information
theory).1 We will quickly find out that outside of the simplest setting of lossless compression,
where the optimal encoder f∗ was easy to describe, good encoders are very hard to find and thus
random coding becomes indispensable. In particular, Slepian-Wolf theorem (Section 11.5 below)
all of data transmission (Part IV) and lossy data compression (Part V) will be based on the method.

Theorem 11.5 (Random coding achievability) For any k ∈ Z+ and any τ > 0 we have
 
1
ϵ̃∗ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0 , (11.1)
PX (X)

that is there exists a compressor-decompressor pair with the (possibly undetectable) error bounded
by the right-hand side.

Proof. We first start with constructing a suboptimal decompressor g for a given f. Indeed, for a
given compressor f, the optimal decompressor which minimizes the error probability is simply the
maximum a posteriori (MAP) decoder, i.e.,

g∗ (w) = argmax PX|f(X) (x|w) = argmax PX (x) .


x x:f(x)=w

1
These methods were discovered simultaneously by Shannon [378] and Erdös [153], respectively.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-216


i i

216

However, this decoder’s performance is a little hard to analyze, so instead, we consider the
following (suboptimal) decompressor g:


 x, ∃! x ∈ X s.t. f(x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)

 e, otherwise

Note that log2 PX1(x) ≤ k − τ ⇐⇒ PX (x) ≥ 2−(k−τ ) . We call those x “high-probability”. (In the
language of [106] and [115] these would be called “typical” realizations).
Denote f(x) = cx and call the long vector C = [cx : x ∈ X ] a codebook. It is instructive to think
of C as a hashing table: it takes an object x ∈ X and assigns to it a k-bit hash value.
To analyze the error probability let us define
 
′ ′ 1
J(x, C) ≜ x ∈ X : cx′ = cx , x 6= x, log2 ≤k−τ
PX (x′ )
to be the set of high-probability inputs whose hashes collide with that of x. Then we have the
following estimate for probability of error:
  
1
P [g(f(X)) = e] = P log2 > k − τ ∪ {J(X, C) 6= ∅}
PX (X)
 
1
≤ P log2 > k − τ + P [J(X, C) 6= ϕ]
PX (X)
The first term does not depend on the codebook C , while the second term does. The idea now
is to randomize over C and show that when we average over all possible choices of codebook,
the second term is smaller than 2−τ . Therefore there exists at least one codebook that achieves
i.i.d.
the desired bound. Specifically, let us consider C generated by setting each cx ∼ Unif[{0, 1}k ] and
independently of X. Equivalently, since C can be represented by an |X | × k binary matrix, whose
rows correspond to codewords, we choose each entry to be independent fair coin flip. Averaging
the error probability (over C and over X), we have
  
′ 1
EC [P [J(X, C) 6= ϕ]] = EC,X 1 ∃x 6= X : log2 ≤ k − τ, cx = cX

PX (x′ )
 
X  1

≤ EC,X  1 log2 ≤ k − τ 1 {cx′ = cX } (union bound)
PX ( x′ )
x′ ̸=X
 
X 
= 2− k E X  1 PX (x′ ) ≥ 2−k+τ 
x′ ̸=X
X 
≤ 2− k 1 PX (x′ ) ≥ 2−k+τ
x′ ∈X
−k k−τ
≤2 2 = 2−τ ,

where the crucial penultimate step uses the fact that there can be at most 2k−τ values of x′ with
PX (x′ ) > 2−k+τ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-217


i i

11.2 Asymptotic equipartition property (AEP) 217

Remark 11.2 (Why random coding works) The compressor f(x) = cx can be thought as
hashing x ∈ X to a random k-bit string cx ∈ {0, 1}k , as illustrated below:

Here, x has high probability ⇔ log2 PX1(x) ≤ k − τ ⇔ PX (x) ≥ 2−k+τ . Therefore the number of
those high-probability x’s is at most 2k−τ , which is far smaller than 2k , the total number of k-bit
codewords. Hence the chance of collision among high-probability x’s is small.
Let us again emphasize that the essence of the random coding argument is the following. To
prove the existence of an object with certain property, we construct a probability distribution
(randomize) and show that on average the property is satisfied. Hence there exists at least one
realization with the desired property. The downside of this argument is that it is not constructive,
i.e., does not give us an algorithm to find the object. One may wonder whether we can practically
simply generate a large random hashing table and use it for compression. The problem is that
generating such a table requires a lot of randomness and a lot of storage space (both are impor-
tant resources). We will address this issue in Section 11.3, but for now let us make the following
remark.
Remark 11.3 (Pairwise independence of codewords) In the proof we choose the
i.i.d.
random codebook to be uniform over all possible codebooks: cx ∼ Unif. But a careful inspec-
tion (exercise!) shows that we only used pairwise independence, i.e., cx ⊥ ⊥ cx′ for any x 6= x′ .
This suggests that perhaps in generating the table we can use a lot fewer than k|X | random bits.
Indeed, given 2 independent random bits B1 , B2 we can generate 3 bits that are pairwise indepen-
dent: B1 , B2 , B1 ⊕ B2 . This observation will lead us to the idea of linear compression studied in
Section 11.3, where the codewords generated not iid, but as elements of a random linear subspace.

11.2 Asymptotic equipartition property (AEP)


Finally, we address the following question. Our random coding proof restricted attention only to
those x’s with sufficiently high probability PX (x) > 2−k+τ . But it turns out that for iid sources we
could have restricted attention only to what are called “typical” x’s.

Proposition 11.6 (Asymptotic equipartition (AEP)) Consider iid Sn and for any δ > 0,
define the so-called entropy δ -typical set
 
1 1
Tδn ≜ sn : log − H ( S ) ≤ δ .
n PSn (sn )
Then the following properties hold:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-218


i i

218

 
1 P Sn ∈ Tδn → 1 as n → ∞.
2 |Tδn | ≤ exp{(H(S) + δ)n}.

i.i.d.
For example if Sn ∼ Ber(p), then PSn (sn ) = pwH (s ) p̄n−wH (s ) , where wH (sn ) is the Hamming
n n

weight of the string (number of 1’s). Thus the typical set corresponds to those sequences whose
Hamming weight 1n wH (sn ) is close to the expected value of p + Op (δ).

Proof. By WLLN, we have


1 1 P
log −
→H(S). (11.2)
n PSn (Sn )

Thus, P[Sn ∈ Tδn ] → 1. On the other hand, since for every sn ∈ Tδn we have PSn (sn ) > exp{−(H(S)+
δ)n} there can be at most exp{(H(S) + δ)n} elements in Tδn .

To understand the meaning of the AEP, notice that it shows that the gigantic space S n has
almost all of probability PSn concentrated on an exponentially smaller subset Tδn . Furthermore, on
this subset the measure PSn is approximately uniform: PSn (sn ) = exp{−nH(S) ± nδ}.
To see how AEP is related to compression, let us give a third proof of Shannon’s result:

∗ n 0 R > H( S)
lim ϵ (S , nR) =
n→∞ 1 R < H( S)

Indeed, let us consider an encoder f that enumerates (by strings in {0, 1}nR ) elements of Tδn . Then
if R > H(S) + δ the decoding error happens with probability P[Sn 6∈ Tδn ] → 0. Hence any rate
R > H(S) results in a vanishing error. On the other hand, if R < H(S) then it is clear that 2nR -bits
cannot describe any significant portion of |Tδn | and since on the latter the measure PSn is almost
uniform, the probability of error necessarily converges to 1 (in fact exponentially fast). There is a
certain conceptual beauty in this way of proving source coding theorem. For example, it explains
why optimal compressor’s output should look almost like iid Ber(1/2):2 after all it enumerates
over an almost uniform set Tδn .

11.3 Linear compression (hashing)


So far we have seen three proofs of Shannon’s theorem (Corollary 11.3), but unfortunately each of
the proofs used methods that are not feasible to implement in practice. The first method required
sorting all |S|n realizations of the input data block, the second required constructing a |S|n × k
hashing table and the third enumerating the entropy-typical set |Tδn | = exp{nH(S) + o(n)}. In
this section we show a fourth method, which is conceptually important and also results in a very
simple compressor. (The decompressor that we describe is still going to be very impractical, but
it can be made practical by leveraging efficient decoders of linear error correcting codes.)

2
This is the intuitive basis why compressors can be used as random number generators; cf. Section 9.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-219


i i

11.3 Linear compression (hashing) 219

In this section we assume that the source takes the form X = Sn , where each coordinate is an
element of a finite field (Galois field), i.e., Si ∈ Fq , where q is the cardinality of Fq . (This is only
possible if q = pk for some prime number p and k ∈ N.)

Definition 11.7 (Galois field) F is a finite set with operations (+, ·) where

• The addition operation + is associative and commutative.


• The multiplication operation · is associative and commutative.
• There exist elements 0, 1 ∈ F s.t. 0 + a = 1 · a = a.
• ∀a, ∃ − a, s.t. a + (−a) = 0
• ∀a 6= 0, ∃a−1 , s.t. a−1 a = 1
• Distributive: a · (b + c) = (a · b) + (a · c)

Simple examples of finite fields:

• Fp = Z/pZ, where p is prime (“modulo-p arithmetic”)


• F4 = {0, 1, x, x + 1} with addition and multiplication as polynomials in F2 [x] modulo x2 + x + 1.

A linear compressor is a linear function H : Fnq → Fkq (represented by a matrix H ∈ Fqk×n ) that
maps each x ∈ Fnq to its codeword w = Hx, namely
    
w1 h11 ... h1n x1
 ..   .. ..   .. 
 . = . .  . 
wk hk1 ... hkn xn

Compression is achieved if k < n, i.e., H is a fat matrix, which, again, is only possible in the
almost lossless sense.

Theorem 11.8 (Achievability via linear codes) Let X ∈ Fnq be a random vector. For all
τ > 0, there exists a linear compressor H ∈ Fnq×k and decompressor g : Fkq → Fnq ∪ {e}, s.t. its
undetectable error probability is bounded by
 
1
P [g(HX) 6= X] ≤ P logq > k − τ + q−τ .
PX (X)

Remark 11.4 Consider the Hamming space q = 2. In comparison with Shannon’s random
coding achievability, which uses k2n bits to construct a completely random codebook, here for lin-
ear codes we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional
linear subspace of the Hamming space.

Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 11.5),
given the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) =
argmaxx:Hx=w PX (x), which outputs the most likely symbol that is compatible with the codeword

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-220


i i

220

received. Instead, as before we consider the following (suboptimal) decoder:


(
x ∃!x ∈ Fnq : w = Hx, x h.p.
g(w) =
e otherwise

where, as in the proof of Theorem 11.5, we denoted x to be “h.p.” (high probability) whenever
logq PX1(x) ≤ k − τ .
Note that this decoder is the same as in the proof of Theorem 11.5. The proof is also mostly the
same, except now hash collisions occur under the linear map H. Specifically, we have by applying
the union bound twice:

P [g(HX) 6= X|H] ≤ P [x not h.p.] + P [∃x′ h.p. : x′ 6= X, Hx′ = HX]


  X X
1
≤ P logq >k−τ + PX (x) 1{Hx′ = Hx}
P X ( x) ′ ′
x x h.p.,x ̸=x

Now we use random coding to average the second term over all possible choices of H. Specif-
ically, choose H as a matrix independent of X where each entry is iid and uniform on Fq . For
distinct x0 and x1 , the collision probability is

P[Hx1 = Hx0 ] = P[Hx2 = 0] (x2 ≜ x1 − x0 6= 0)


= P[H1 · x2 = 0] k
(iid rows)

where H1 is the first row of the matrix H, and each row of H is independent. This is the probability
that Hi is in the orthogonal complement of x2 . On Fnq , the orthogonal complement of a given
non-zero vector has cardinality qn−1 . So the probability for the first row to lie in this subspace is
qn−1 /qn = 1/q, hence the collision probability 1/qk . Averaging over H gives
 
X
EH  1{Hx′ = Hx} = |{x′ : x′ h.p., x′ 6= x}|q−k ≤ qk−τ q−k = q−τ .

x h.p.,x′ ̸=x

This completes the proof.

We remark that the bounds in Theorems 11.5 and 11.8 produce compressors with undetectable
errors. However, the non-linear construction in the former is easy to modify to make all errors
detectable (e.g. by increasing k by 1 and making sure the first bit is 1 for all x = sn with low
probability). For the linear compressors, however, the errors cannot be made detectable.
Note that we restricted our theorem to inputs over Fq . Can we loosen the requirements and
produce compressors over an arbitrary commutative ring? In general, the answer is negative due
to existences of zero divisors in the commutative ring. The latter ruin the key proof ingredient of
low collision probability in the random hashing. Indeed, consider the following computation over

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-221


i i

11.4 Compression with side information at both compressor and decompressor 221

Z/6Z
       
1 2
  0     0  
       
P H  ..  = 0 = 6− k but P H  ..  = 0 = 3− k ,
  .     .  
0 0

since 0 · 2 = 3 · 2 = 0 in Z/6Z.

11.4 Compression with side information at both compressor and


decompressor
We now move to discussing several variations of the compression problem when the data consists
of a correlated pair (X, Y) ∼ PX,Y . The first variation is schematically depicted in the next figure:

X {0, 1}k X ∪ {e}


Compressor Decompressor

Formally, we make the following definition

Definition 11.9 (Compression with side information) Given PX,Y we define

• compressor f : X × Y → {0, 1}k


• decompressor g : {0, 1}k × Y → X ∪ {e}
• probability of error P[g(f(X, Y), Y) 6= X] < ϵ. A code satisfying this property is called a (k, ϵ)-
s.i. code
• Fundamental Limit: ϵ∗ (X|Y, k) = inf{ϵ : ∃(k, ϵ)–s.i. code}

Note that here unlike the source X, the side information Y need not be discrete. Conditioned on
Y = y, the problem reduces to compression without side information studied in Section 11.1, but
with the source X distributed according to PX|Y=y . Since Y is known to both the compressor and
decompressor, they can use the best code tailored for this distribution. Recall ϵ∗ (X, k) defined in
Definition 11.1, the optimal probability of error for compressing X using k bits, which can also be
denoted by ϵ∗ (PX , k). Then we have the following relationship

ϵ∗ (X|Y, k) = Ey∼PY [ϵ∗ (PX|Y=y , k)],

which allows us to apply various bounds developed before. In particular, we clearly have the
following result.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-222


i i

222

Theorem 11.10
   
1 1
P log > k + τ − 2−τ ≤ ϵ∗ (X|Y, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0
PX|Y (X|Y) PX|Y (X|Y)

Corollary 11.11 Let (X, Y) = (Sn , Tn ) where the pairs (Si , Ti )i.i.d.
∼ PS,T . Then
(
∗ 0 R > H(S|T)
lim ϵ (S |T , nR) =
n n
n→∞ 1 R < H(S|T)

Proof. Indeed, note that from WLLN we have

1X
n
1 1 1 P
log = log −
→H(S|T)
n PSn |Tn (S |T )
n n n PS|T (Si |Ti )
i=1

as n → ∞. Thus, the result follows from setting (X, Y) = (Sn , Tn ) in the previous theorem.

11.5 Slepian-Wolf: side information at decompressor only


In previous section we learned that given access to side-information at both compressor and
decompressor the optimal compression rate is given by the conditional entropy H(S|T). We now
consider what happens if the side information Y = Tn is not available at the compressor. This is
demonstrated schematically in the following figure:

X {0, 1}k X ∪ {e}


Compressor Decompressor

Formally, we make the following definition.

Definition 11.12 (Slepian-Wolf code) Given PX,Y , we define a Slepian-Wolf coding


problem as:

• compressor f : X → {0, 1}k


• decompressor g : {0, 1}k × Y → X ∪ {e}
• probability of error P[g(f(X), Y) 6= X] ≤ ϵ. A code satisfying this property is called a (k, ϵ)-
Slepian-Wolf code
• Fundamental Limit: ϵ∗SW (X|Y, k) = inf{ϵ : ∃(k, ϵ)-Slepian-Wolf code}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-223


i i

11.5 Slepian-Wolf: side information at decompressor only 223

Here is the very surprising result of Slepian and Wolf3 , which shows that unavailability of the
side information at compressor does not hinder the compression rate at all.

Theorem 11.13 (Slepian-Wolf [392])


 
∗ 1
ϵ ( X | Y , k) ≤ ϵ∗SW (X|Y, k) ≤ P log ≥ k − τ + 2−τ
PX|Y (X|Y)

From this theorem we will get by the WLLN the asymptotic result:

Corollary 11.14 (Slepian-Wolf [392])


(
0 R > H(S|T)
lim ϵ∗SW (Sn |Tn , nR) =
n→∞ 1 R < H(S|T)

And we remark that the side-information (T-process) is not even required to be discrete, see
Exercise II.9.

Proof of the Theorem. LHS is obvious, since side information at the compressor and decompres-
sor is better than only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y), then define the compressor and decoder as

f(x) = cx
(
x ∃!x : cx = w, x h.p.|y
g(w, y) =
0 otherwise

where we used the shorthand x h.p.|y ⇔ log2 PX|Y1(x|y) < k −τ . The error probability of this scheme,
as a function of the code book C , is
 
1
P[X 6= g(f(X))|C] = P log ≥ k − τ or J(X, C|Y) 6= ∅|C
PX|Y (X|Y)
 
1
≤ P log ≥ k − τ + P [J(X, C|Y) 6= ∅|C]
PX|Y (X|Y)
  X
1
= P log ≥k−τ + PX,Y (x, y)1 {J(x, C|y) 6= ∅}.
PX|Y (X|Y) x, y

where J(x, C|y) ≜ {x′ 6= x : x′ h.p.|y, cx = cx′ }.

3
This result is often informally referred to as “the most surprising result post-Shannon”.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-224


i i

224

Now averaging over C we get


 
( a) X
P[J(x, C|y) 6= ∅] ≤ EC  1 {x′ h.p.|y}1 {cx′ = cx }
x′ ̸=x
(b)
≤ 2k−τ P[cx′ = cx ]
( c)
= 2−τ ,

where (a) is a union bound, (b) follows from the fact that |{x′ : x′ h.p.|y}| ≤ 2k−τ , and (c) is from
P[cx′ = cx ] = 2−k for any x 6= x′ .

Remark 11.5 (Undetectable error) Definition 11.12 allows appearance of undetected


errors. Now, we have seen that in all previous random coding results (except for the linear com-
pression) we could always easily modify the compression algorithm to make all undetected errors
detectable. However, Slepian-Wolf magic crucially depends on undetectable errors. Indeed, sup-
pose we require that g(f(x), y) = x e for all x, y with PX,Y (x, y) > 0. Suppose there is some
c ∈ {0, 1}k such that f(x1 ) = f(x2 ) = c. Then g(c, y) = e for all y, and the side-information is not
needed for such a c. On the other hand, if c ∈ {0, 1}k is such that it has a unique x s.t. f(x) = c,
then we can set g(c, y) = x and ignore y again. Overall, we see that in either case side-information
is not useful at the decompressor and we can only compress down to H(X) not H(X|Y). Similarly,
one can show that Slepian-Wolf theorem does not hold in the setting of variable-length lossless
compression: the minimal average compression length of any lossless algorithm is at least H(X)
(instead of H(X|Y)).

11.6 Slepian-Wolf: compressing multiple sources


A simple extension of the previous result also covers another variation of the data-compression
task, in which two correlated sources X and Y are compressed individually (possibly at two remote
locations), but are decompressed jointly at the destination. This time, however, the goal is to
reproduce both X and Y. This is depicted as follows:

X {0, 1}k1
Compressor f1
Decompressor g

(X̂, Ŷ)

Y {0, 1}k2
Compressor f2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-225


i i

11.6 Slepian-Wolf: compressing multiple sources 225

More formally, we define:

Definition 11.15 (Multi-terminal compression) Given PX,Y let

• compressors f1 : X → {0, 1}k1 , f2 : Y → {0, 1}k2 .


• decompressor g : {0, 1}k1 × {0, 1}k2 → X × Y ∪ {e}
• probability of error P[g(f1 (X), f2 (Y)) 6= (X, Y)] ≤ ϵ. A code satisfying this property is called a
(k1 , k2 , ϵ)-code (or multi-terminal Slepian-Wolf code).
• Fundamental limit: ϵ∗SW (X, Y, k1 , k2 ) = inf{ϵ : ∃(k1 , k2 , ϵ)-code}.

The asymptotic fundamental limit here is given as follows.

Theorem 11.16 Let (X, Y) = (Sn , Tn ) with (Si , Ti )i.i.d.


∼ PS,T . Then
(
0 (R1 , R2 ) ∈ int(RSW )
lim ϵ∗SW (Sn , Tn , nR1 , nR2 ) =
n→∞ 1 (R1 , R2 ) 6∈ RSW

where RSW denotes the Slepian-Wolf rate region




 a ≥ H(S|T)

RSW = (a, b) : b ≥ H(T|S)


 a + b ≥ H( S, T )

The rate region RSW typically looks like this:

R2

Achievable
H(T )
Region

H(T |S)
R1
H(S|T ) H(S)

Since H(T) − H(T|S) = H(S) − H(S|T) = I(S; T), the slope of the skewed line is −1.

Proof. Converse: Take (R1 , R2 ) 6∈ RSW . Then one of three cases must occur:

1 R1 < H(S|T). In this case, even if f1 encoder and decoder had access to full Tn , we still can not
achieve vanishing error (Corollary 11.11).
2 R2 < H(T|S) (same).
3 R1 + R2 < H(S, T). If this were possible, then we would be compressing the joint (Sn , Tn ) at
rate lower than H(S, T), violating Corollary 11.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-226


i i

226

Achievability: First note that we can achieve the two corner points. The point (H(S), H(T|S))
can be approached by almost lossless compressing S at entropy and compressing T with side infor-
mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T|S) + δ). By
Corollary 11.3, there exist f1 : S n → {0, 1}k1 and g1 : {0, 1}k1 → S n s.t. P [g1 (f1 (Sn )) 6= Sn ] ≤
ϵn → 0. By Theorem 11.13, there exist f2 : T n → {0, 1}k2 and g2 : {0, 1}k1 × S n → T n
s.t. P [g2 (f2 (Tn ), Sn ) 6= Tn ] ≤ ϵn → 0. Now that Sn is not available, feed the S.W. decompres-
sor with g1 (f1 (Sn )) and define the joint decompressor by g(w1 , w2 ) = (g1 (w1 ), g2 (w2 , g1 (w1 )))
(see below):

Sn Ŝn
f1 g1

Tn T̂n
f2 g2

Apply union bound:


P [g(f1 (Sn ), f2 (Tn )) 6= (Sn , Tn )]
= P [g1 (f1 (Sn )) 6= Sn ] + P [g2 (f2 (Tn ), g(f1 (Sn ))) 6= Tn , g1 (f1 (Sn )) = Sn ]
≤ P [g1 (f1 (Sn )) 6= Sn ] + P [g2 (f2 (Tn ), Sn ) 6= Tn ]
≤ 2ϵ n → 0.
Similarly, the point (H(S), H(T|S)) can be approached.
To achieve other points in the region, use the idea of time sharing: If you can achieve with
vanishing error probability any two points (R1 , R2 ) and (R′1 , R′2 ), then you can achieve for λ ∈
[0, 1], (λR1 + λ̄R′1 , λR2 + λ̄R′2 ) by dividing the block of length n into two blocks of length λn and
λ̄n and apply the two codes respectively
 
λnR1
(S1 , T1 ) →
λn λn
using (R1 , R2 ) code
λnR2
 
λ̄nR′1
(Sλn+1 , Tλn+1 ) →
n n
using (R′1 , R′2 ) code
λ̄nR′2
Therefore, all convex combinations of points in the achievable regions are also achievable, so the
achievable region must be convex.

11.7* Source-coding with a helper (Ahlswede-Körner-Wyner)


In the Slepian-Wolf setting the goal was to compress/decompress X with decompressor having
access to a side information Y. A natural variation of the problem is to consider the case where
access to Y itself is available over a rate-limited link, that is we have the following setting:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-227


i i

11.7* Source-coding with a helper (Ahlswede-Körner-Wyner) 227

X {0, 1}nR1
Compressor f1

Decompressor g

Y {0, 1}nR2
Compressor f2

Note also that unlike the previous section decompressor is only required to produce an esti-
mate of X (not of Y), hence the name of this problem: compression with a (rate-limited) helper.
The difficulty this time is that what needs to be communicated over this link from Y to decom-
pressor is not the information about Y but only that information in Y that is maximally useful
for decompressing X. Despite similarity with the previous sections, this task is completely new
and, consequently, characterization of rate pairs R1 , R2 is much more subtle in this case. It was
completed independently in two works [9, 459].

Theorem 11.17 (Ahlswede-Körner-Wyner) Consider i.i.d. source (Xn , Yn ) ∼ PX,Y with X


discrete. Compressor produces message W1 = f1 (Xn ) and helper produces a message W2 = f2 (Yn ),
with Wi consisting of at most nRi bits, i ∈ {1, 2}. Decompressor produces an estimate X̂n =
g(W1 , W2 ). If rate pair (R1 , R2 ) is achievable with vanishing probability of error P[X̂n 6= Xn ] → 0,
then there exists an auxiliary random variable U taking values on alphabet of cardinality |Y| + 1
such that PX,Y,U = PX,Y PU|Y (i.e. X → Y → U) and

R1 ≥ H(X|U), R2 ≥ I(Y; U) . (11.3)

Furthermore, for every such random variable U the rate pair (H(X|U), I(Y; U)) is achievable with
vanishing error.

In other words, this time the set of achievable pairs (R1 , R2 ) belongs to a region of R2+ described
as ∪{[H(X|U), +∞)×[I(Y; U), +∞)} with the union taken over all possible PU|Y : Y → U , where
|U| = |Y| + 1. The boundary of the optimal (R1 , R2 )-region is traced by an FI -curve, a concept
we will define later (Definition 16.5).

Proof. First, note that iterating over all possible random variables U (without cardinality con-
straint) the set of pairs (R1 , R2 ) satisfying (11.3) is convex. Next, consider a compressor W1 =
f1 (Xn ) and W2 = f2 (Yn ). Then from Fano’s inequality (3.19) assuming P[Xn 6= X̂n ] = o(1) we
have

H(Xn |W1 , W2 )) = o(n) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-228


i i

228

Thus, from chain rule and the fact that conditioning decreases entropy, we get
nR1 ≥ I(Xn ; W1 |W2 ) ≥ H(Xn |W2 ) − o(n) (11.4)
Xn
= H(Xk |W2 , Xk−1 ) − o(n) (11.5)
k=1
Xn
≥ H(Xk | W2 , Xk−1 , Yk−1 ) − o(n) (11.6)
| {z }
k=1
≜Uk

On the other hand, from (6.2) we have


X
n
nR2 ≥ I(W2 ; Y ) =n
I(W2 ; Yk |Yk−1 ) (11.7)
k=1
Xn
= I(W2 , Xk−1 ; Yk |Yk−1 ) (11.8)
k=1
Xn
= I(W2 , Xk−1 , Yk−1 ; Yk ) (11.9)
k=1

where (11.8) follows from I(W2 , Xk−1 ; Yk |Yk−1 ) = I(W2 ; Yk |Yk−1 ) + I(Xk−1 ; Yk |W2 , Yk−1 ) and the
⊥ Xk−1 |Yk−1 ; and (11.9) from Yk−1 ⊥
fact that (W2 , Yk ) ⊥ ⊥ Yk . Comparing (11.6) and (11.9) we
k− 1 k− 1
notice that denoting Uk = (W2 , X , Y ) we have both Xk → Yk → Uk and
1X
n
(R1 , R2 ) ≥ (H(Xk |Uk ), I(Uk ; Yk ))
n
k=1

and thus (from convexity) the rate pair must belong to the region spanned by all pairs
(H(X|U), I(U; Y)).
To show that without loss of generality the auxiliary random variable U can be chosen to take
at most |Y| + 1 values, one can invoke Carathéodory’s theorem (see Lemma 7.14). We omit the
details.
Next, we show that for each U the mentioned rate-pair is achievable. To that end, we first
notice that if there were side information at the decompressor in the form of the i.i.d. sequence
Un correlated to Xn , then Slepian-Wolf theorem implies that only rate R1 = H(X|U) would be
sufficient to reconstruct Xn . Thus, the question boils down to creating a correlated sequence Un at
the decompressor by using the minimal rate R2 . One way to do it is to communicate Un exactly by
spending nH(U) bits. However, it turns out that with nI(U; X) bits we can communicate a “fake”
Ûn which nevertheless has conditional distribution PXn |Ûn ≈ PXn |Un (such Ûn is known as “jointly
typical” with Xn ). Possibility of producing such Ûn is a result of independent prominence known
as covering lemma, which we will study much later – see Corollary 25.6. Here we show how to
apply covering lemma in this case.
By Corollary 25.6 and by Proposition 25.7 we know that for every δ > 0 there exists a
sufficiently large m and Ûm = f2 (Ym ) ∈ {0, 1}mR2 such that
Xm → Ym → Ûm

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-229


i i

11.7* Source-coding with a helper (Ahlswede-Körner-Wyner) 229

and I(Xm ; Ûm ) ≥ m(I(X; U)−δ). This means that H(Xm |Ûm ) ≤ mH(X|U)+ mδ . We can now apply
Slepian-Wolf theorem to the block-symbols (Xm , Ûm ). Namely, we define a new compression prob-
lem with X̃ = Xm and Ũ = Ûm . These still take values on finite alphabets and thus there must exist
(for sufficiently large ℓ) a compressor W1 = f1 (X̃ℓ ) ∈ {0, 1}ℓR̃1 and a decompressor g(W1 , Ũℓ )
with a low probability of error and R̃1 ≤ H(X̃|Ũ) + mδ ≤ mH(X|U) + 2mδ). Now since the actual
blocklength is n = ℓm we get that the effective rate of this scheme is R1 = R̃m1 ≤ H(X|U) + 2δ .
Since δ > 0 is arbitrary, the proof is completed.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-230


i i

12 Entropy of ergodic processes

So far we studied compression of i.i.d. sequence {Si }, for which we demonstrated that the average
compression length (for variable length compressors) converges to the entropy H(S) and that the
probability of error (for fixed-length compressor) converges to zero or one depending on whether
compression rate R ≶ H(S). In this chapter, we shall examine similar results for a large class of
processes with memory, known as ergodic processes. We start this chapter with a quick review of
main concepts of ergodic theory, then state our main results (Shannon-McMillan theorem, com-
pression limit and AEP). Subsequent sections are dedicated to proofs of Shannon-McMillan and
ergodic theorems. Finally, in the last section we introduce Kolmogorov-Sinai entropy, which asso-
ciates to a fully deterministic transformation the measure of how “chaotic” it is. This concept
plays a very important role in formalizing an apparent paradox: large mechanical systems (such
as collections of gas particles) are on one hand fully deterministic (described by Newton’s laws
of motion) and on the other hand have a lot of probabilistic properties (Maxwell distribution of
velocities, fluctuations etc). Kolmogorov-Sinai entropy shows how these two notions can co-exist.
In addition it was used to resolve a long-standing open problem in dynamical systems regarding
isomorphism of Bernoulli shifts [387, 322].

12.1 Bits of ergodic theory


Let us start with a dynamical system point of view on stochastic processes. Throughout this chapter
we assume that all random variables are defined as functions on a common space of elementary
outcomes (Ω, F).

Definition 12.1 (Measure preserving transformation) τ : Ω → Ω is a measure


preserving transformation, also known as a probability preserving transformation (p.p.t.), if
∀E ∈ F , P(E) = P(τ −1 E).
The set E is called τ -invariant if E = τ −1 E. The set of all τ -invariant sets forms a σ -algebra
(exercise) denoted Finv .

Definition 12.2 (Stationary process) A process {Sn , n = 0, . . .} is stationary if there


exists a measure preserving transformation τ : Ω → Ω such that:
Sj = Sj−1 ◦ τ = S0 ◦ τ j

230

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-231


i i

12.1 Bits of ergodic theory 231

Therefore a stationary process can be described by the tuple (Ω, F, P, τ, S0 ) and Sk = S0 ◦ τ k .

Remark 12.1

1 Alternatively, a random process (S0 , S1 , S2 , . . . ) is stationary if its joint distribution is invariant


with respect to shifts in time, i.e., PSmn = PSm+t , ∀n, m, t. Indeed, given such a process we can set
n+t
Ω = S ∞ and define an m.p.t. as follows:
τ
(s0 , s1 , . . . ) −
→ (s1 , s2 , . . . ) (12.1)
So τ is a shift to the left.
2 An event E ∈ F is shift-invariant if
(s1 , s2 , . . . ) ∈ E ⇐⇒ (s0 , s1 , s2 , . . . ) ∈ E, ∀s0

or, equivalently, E = τ −1 E. Thus τ -invariant events are also called shift-invariant, when τ is
interpreted as (12.1).
3 Some examples of shift-invariant events are {∃n : xi = 0, ∀i ≥ n}, {lim sup xi < 1} etc. A non
shift-invariant event is A = {x0 = x1 = · · · = 0}, since τ (1, 0, 0, . . .) ∈ A but (1, 0, . . .) 6∈ A.
4 Also recall that the tail σ -algebra is defined as
\
Ftail ≜ σ{Sn , Sn+1 , . . .} .
n≥1

It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event {∃n : xi = 0, ∀ odd i ≥ n} is in Ftail but not shift-invariant.

Proposition 12.3 (Poincaré recurrence) Let τ be measure-preserving for (Ω, F, P). Then
for any measurable A with P[A] > 0 we have
" #
[
P τ −k A A = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥ 1

S
Proof. Let B = k≥ 1 τ −k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently

P[A ∪ B] = P[B] . (12.2)


To that end notice that τ −1 A ∪ τ −1 B = B and thus
P[τ −1 (A ∪ B)] = P[B] ,
but the left-hand side equals P[A ∪ B] by the measure preservation of τ , proving (12.2).
Consider τ mapping the initial state of the conservative (Hamiltonian) mechanical system to its
state after passage of a given amount of time. It is known that τ preserves the Lebesgue measure in
phase space (Liouville’s theorem). Thus the Poincaré recurrence leads to a rather counter-intuitive
conclusions. For example, opening the barrier separating two gases in a cylinder allows them to
mix. Poincaré recurrence says that eventually they will return back to the original separated state

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-232


i i

232

(with each gas occupying roughly its half of the cylinder). Of course, the “paradox” is resolved
by observing that it will take unphysically long for this to happen.

Definition 12.4 (Ergodicity) A transformation τ is ergodic if ∀E ∈ Finv we have P[E] = 0 or


1. A process {Si } is ergodic if all shift-invariant events are deterministic, i.e., for any shift-invariant
event E, P [S∞
1 ∈ E] = 0 or 1.

Here are some examples:

• {Sk = k2 }: ergodic but not stationary.


• {Sk = S0 }: stationary but not ergodic (unless S0 is a constant). Note that the singleton set
E = {(s, s, . . .)} is shift invariant and P [S∞ 1 ∈ E] = P [S0 = s] ∈ (0, 1), not deterministic.
• {Sk } i.i.d. is stationary and ergodic (by Kolmogorov’s 0-1 law, tail events have no randomness).
• (Sliding-window construction of ergodic processes) If {Si } is ergodic, then {Xi =
f(Si , Si+1 , . . . )} is also ergodic. Such a process {Xi } is called a B-process if Si is i.i.d.
• Here is an important example demonstrating how one can look at a simple iid process Si in two
i.i.d. P∞
ways via sliding window. Take Si ∼ Ber( 12 ) and set Xk = n=0 2−n−1 Sk+n = 2Xk−1 mod 1.
The marginal distribution of Xi is uniform on [0, 1]. Furthermore, Xk ’s behavior is completely
deterministic: given X0 , all future Xk ’s can be computed exactly. This example shows that cer-
tain deterministic maps exhibit ergodic/chaotic behavior under iterative application: although
the trajectory of Xk is completely deterministic, its time-averages converge to expectations and
in general “look random” since full determinism is only guaranteed if infinite-precision mea-
surement of X0 is available. Any discretization of Xk ’s results in random behavior of positive
entropy rate – see more on this in Section 12.5*.
• There are also stronger conditions than ergodicity. Namely, we say that τ is mixing (or strongly
mixing) if

P[A ∩ τ −n B] → P[A]P[B] .

We say that τ is weakly mixing if


X
n
1
P[A ∩ τ −n B] − P[A]P[B] → 0 .
n
k=1

Strong mixing implies weak mixing, which implies ergodicity (Exercise II.12).
• {Si }: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),
regardless of initial distribution.
As a toy example, consider the kernel P(0|1) = P(1|0) = 1 with initial distribution
P(S0 = 0) = 0.5. This process only has two sample paths: P [S∞ 1 = (010101 . . .)] =
P [ S∞
1 = ( 101010 . . .)] = 1
2 . It is easy to verify this process is ergodic (in the sense of Defi-
nition 12.4). Note however, that in the Markov-chain literature a chain is called ergodic if it is
irreducible, aperiodic and recurrent. This example does not satisfy this definition (this clash of
terminology is a frequent source of confusion).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-233


i i

12.2 Shannon-McMillan, entropy rate and AEP 233

• {Si }: stationary zero-mean Gaussian process with autocovariance function c(n) = E[S0 S∗n ].

1 X
n
lim c(t) = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0

lim c(n) = 0 ⇔ {Si } mixing


n→∞

Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesáro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {c(n)}, in the sense that there exists a unique positive measure μ on [−π , π ] such that
1
R
c(n) = 2π exp(inx) μ(dx). The spectral criteria can be formulated as follows:

{Si } is ergodic ⇔ spectral measure has no atoms (CDF is continuous)


{Si } is a B-process ⇔ spectral measure has a density (power spectral density, cf. Example 6.4)

Detailed exposition on stationary Gaussian processes can be found in [135, Theorem 9.3.2, pp.
474, Theorem 9.7.1, pp. 493–494].

12.2 Shannon-McMillan, entropy rate and AEP


Equipped with the definitions of ergodicity we can state the three main results of this chapter. First
is an analog of the law of large numbers for normalized log-likelihoods.

Theorem 12.5 (Shannon-McMillan-Breiman) Let S = {S1 , S2 , . . . } be a stationary and


ergodic discrete process with entropy rate H ≜ H(S). Then
1 1 P
log n

→H, also a.s. and in L1 (12.3)
n PS ( S )
n

Corollary 12.6 Let {S1 , S2 , . . . } be a discrete stationary and ergodic process with entropy rate
H (in bits). Denote by f∗n the optimal variable-length compressor for Sn and ϵ∗ (Sn , nR) the optimal
probability of error of its fixed-length compressor with R bits per symbol (Definition 11.1). Then
we have
(
1 ∗ n P 0 R > H,
l(f (S ))− →H and lim ϵ∗ (Sn , nR) = (12.4)
n n n→∞ 1 R < H.

Proof. By Corollary 10.5, the asymptotic distributions of 1n l(f∗n (Sn )) and 1n log PSn1(sn ) coincide.
By the Shannon-McMillan-Breiman theorem (we only need convergence in probability) the latter
converges to a constant H.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-234


i i

234

In Chapter 11 we learned the asymptotic equipartition property (AEP) for iid sources. Thanks
to Shannon-McMillan-Breiman the same proof we did for the iid processes works for a general
ergodic process.

Corollary 12.7 (AEP for stationary ergodic sources) Let {S1 , S2 , . . . } be a stationary
and ergodic discrete process. For any δ > 0, define the set
 
1 1
δ
Tn = s : n
log −H ≤δ .
n PSn (sn )
Then
 
1 P Sn ∈ Tδn → 1 as n → ∞.
2 exp{n(H − δ)}(1 + o(1)) ≤ |Tδn | ≤ exp{(H + δ)n}(1 + o(1)).

Some historical notes are in order. Convergence in probability for stationary ergodic Markov
chains was already shown in [378]. The extension to convergence in L1 for all stationary ergodic
processes is due to McMillan in [301], and to almost sure convergence to Breiman [75].1 A modern
proof is in [11]. Note also that for a Markov chain, existence of typical sequences and the AEP can
be anticipated by thinking of a Markov process as a sequence of independent decisions regarding
which transitions to take at each state. It is then clear that Markov process’s trajectory is simply a
transformation of trajectories of an iid process, hence must concentrate similarly.

12.3 Proof of the Shannon-McMillan-Breiman Theorem


We shall show the L1 -convergence, which implies convergence in probability automatically. We
will not prove a.s. convergence. To this end, let us first introduce Birkhoff-Khintchine’s conver-
gence theorem for ergodic processes, the proof of which is presented in the next subsection. The
interpretation of this result is that time averages converge to the ensemble average.

Theorem 12.8 (Birkhoff-Khintchine’s Ergodic Theorem) Let {Si } be a stationary and


ergodic process. For any integrable function f, i.e., E |f(S1 , . . . )| < ∞,

1X
n
lim f(Sk , . . . ) = E f(S1 , . . . ) a.s. and in L1 .
n→∞ n
k=1

In the special case where f depends on finitely many coordinates, say, f = f(S1 , . . . , Sm ),

1X
n
lim f(Sk , . . . , Sk+m−1 ) = E f(S1 , . . . , Sm ) a.s. and in L1 .
n→∞ n
k=1

1
Curiously, both McMillan and Breiman left the field after these contributions. McMillan went on to head the US satellite
reconnaissance program, and Breiman became a pioneer and advocate of machine learning approach to statistical
inference.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-235


i i

12.3 Proof of the Shannon-McMillan-Breiman Theorem 235

Example 12.1 Consider f = f(S1 ). Then for an iid process Theorem 12.8 is simply the strong
law of large numbers. On the extreme, if {Si } has constant trajectories, i.e. Si = S1 for all i ≥
1, then such process is non-ergodic and conclusion of Theorem 12.8 fails (unless S1 is an a.s.
constant).
We introduce an extension of the idea of the Markov chain.

Definition 12.9 (Finite order Markov chain) {Si : i ∈ N} is an mth order Markov chain
if PSt+1 |St1 = PSt+1 |Stt−m+1 for all t ≥ m. It is called time homogeneous if PSt+1 |Stt−m+1 = PSm+1 |Sm1 .

Remark 12.2 Showing (12.3) for an mth order time-homogeneous Markov chain {Si } is a
direct application of Birkhoff-Khintchine. Indeed, we have

1X
n
1 1 1
log = log
n n
PSn (S ) n PSt |St−1 (St |St−1 )
t=1

1 X
n
1 1 1
= log + log
n PSm (Sm ) n
t=m+1
PSt |St−1 (Sl |Sll− 1
−m )
t−m

1 X
n
1 1 1
= log + log t−1
, (12.5)
n PS1 (Sm ) n PSm+1 |S1 (St |St−m )
m
| {z 1
} | t=m +1
{z }
→0
→H(Sm+1 |Sm
1 ) by Birkhoff-Khintchine

1
where we applied Theorem 12.8 with f(s1 , s2 , . . .) = log PS m (sm+1 |sm )
.
m+1 |S1 1

Now let us prove (12.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to first approximate the distribution of that ergodic process by an m-th order
Markov chain (finite memory) and make use of (12.5), then let m → ∞ to make the approximation
accurate. This is a highly influential contribution of Shannon to the theory of stochastic processes,
known as Markov approximation.

Proof of Theorem 12.5 in L1 . To show that (12.3) converges in L1 , we want to show that
1 1
E log − H → 0, n → ∞.
n PSn (Sn )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
Y

Q(m) (S∞
1 ) ≜ PSm
1
( Sm
1) PSt |St−1 (St |Stt− 1
−m )
t− m
t=m+1
Y∞
= PSm1 (Sm
1) PSm+1 |Sm1 (St |Stt− 1
−m ),
t=m+1

where the second line applies stationarity. Note that under Q(m) , {Si } is an mth -order time-
homogeneous Markov chain.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-236


i i

236

By triangle inequality,

1 1 1 1 1 1 1 1
E log − H ≤ E log − log (m) +E log (m) − Hm + |Hm − H|
n n
PSn (S ) n n
PSn (S ) n QSn (Sn ) n QSn (Sn ) | {z }
| {z } | {z } ≜C
≜A ≜B

where Hm ≜ H(Sm+1 |Sm1 ).


We discuss each term separately next:

• C = |Hm − H| → 0 as m → ∞ by Theorem 5.4 (Recall that for stationary processes:


H(Sm+1 |Sm
1 ) → H from above).
• As shown in Remark 12.2, for any fixed m, B → 0 in L1 as n → ∞, as a consequence of
Birkhoff-Khintchine. Hence for any fixed m, EB → 0 as n → ∞.
• For term A, applying the next Lemma 12.10,

1 dPSn 1 (m) 2 log e


E[A] = EP log (m)
≤ D(PSn kQSn ) +
n dQSn n en

where
" #
1 1 PSn (Sn )
(m)
D(PSn kQSn ) = E log Qn
n n PSm (Sm ) t=m+1 PSm+1 |S1 (St |St−1 )
m t− m

1
= (−H(Sn ) + H(Sm ) + (n − m)Hm )
n
→ Hm − H as n → ∞,

with the second equality following from stationarity again.

Combining all three terms and sending n → ∞, we obtain for any m,

1 1
lim sup E log − H ≤ 2(Hm − H).
n→∞ n PSn (Sn )

Sending m → ∞ completes the proof of the L1 -convergence.

Lemma 12.10
 
dP 2 log e
EP log ≤ D(PkQ) + .
dQ e

Proof. |x log x| − x log x ≤ 2 log e


e , ∀x > 0, since LHS is zero if x ≥ 1, and otherwise upper
1 2 log e
bounded by 2 sup0≤x≤1 x log x = e .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-237


i i

12.4* Proof of the Birkhoff-Khintchine Theorem 237

12.4* Proof of the Birkhoff-Khintchine Theorem


Proof of Theorem 12.8. For any function f̃ ∈ L1 and any ϵ, there exists a decomposition f̃ = f + h
such that f is bounded and h ∈ L1 with khk1 = E |h(S∞ 1 )| ≤ ϵ.
Let us first focus on the bounded function f. Note that in the bounded domain L1 ⊂ L2 , thus
f ∈ L2 . Furthermore, L2 is a Hilbert space with inner product (f, g) = E[f(S∞ ∞
1 )g(S1 )].
For the measure preserving transformation τ that generates the stationary process {Si }, define
the operator T(f) = f ◦ τ . Since τ is measure preserving, we know that kTfk22 = kfk22 , thus T is a
unitary and bounded operator.
Define the operator
1X
n
A n ( f) = f ◦ τk
n
k=1

Intuitively:
1X k 1
n
An = T = (I − Tn )(I − T)−1
n n
k=1

Then, if f ⊥ ker(I − T) we should have An f → 0, since only components in the kernel can blow
up. This intuition is formalized in the proof below.
Let us further decompose f into two parts f = f1 + f2 , where f1 ∈ ker(I − T) and f2 ∈ ker(I − T)⊥ .
We make the following observations:

• if g ∈ ker(I − T), g must be a constant function. This is due to the ergodicity. Consider the
indicator function 1A : if 1A = 1A ◦ τ = 1τ −1 A , then P[A] = 0 or 1. For a general case, suppose
g = Tg and g is not constant, then at least some set {g ∈ (a, b)} will be shift-invariant and have
non-trivial measure, violating ergodicity.
• ker(I − T) = ker(I − T∗ ). This is due to the fact that T is unitary:

g = Tg ⇒ kgk2 = (Tg, g) = (g, T∗ g) ⇒ (T∗ g, g) = kgkkT∗ gk ⇒ T∗ g = g


where in the last step we used the fact that Cauchy-Schwarz (f, g) ≤ kfk · kgk only holds with
equality for g = cf for some constant c.
• ker(I − T)⊥ = ker(I − T∗ )⊥ = [Im(I − T)], where [Im(I − T)] denotes an L2 closure.
• g ∈ ker(I − T)⊥ ⇐⇒ E[g] = 0. Indeed, only zero-mean functions are orthogonal to constants.

With these observations, we know that f1 = m is a constant. Also, f2 ∈ [Im(I − T)] so we further
approximate it by f2 = f0 + h1 , where f0 ∈ Im(I − T), namely f0 = g − g ◦ τ for some function
g ∈ L2 , and kh1 k1 ≤ kh1 k2 < ϵ. Therefore we have
An f1 = f1 = E[f]
1
An f0 = (g − g ◦ τ n ) → 0 a.s. and L1
n
P g◦τ n 2 P 1 a.s.
since E[ n≥1 ( n ) ] = E[g ] n2 < ∞ and hence 1n g ◦ τ n −−→0 by Borel-Cantelli.
2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-238


i i

238

The proof is completed by showing


 

P lim sup An (h + h1 ) ≥ δ ≤ . (12.6)
n δ
Indeed, then by taking ϵ → 0 we will have shown
 
P lim sup An (f) ≥ E[f] + δ = 0
n→∞

as required, and the opposite direction is shown analogously.

The proof of (12.6) makes use of the Maximal Ergodic Lemma stated as follows:

Theorem 12.11 (Maximal Ergodic Lemma) Let (P, τ ) be a probability measure and a
measure-preserving transformation. Then for any f ∈ L1 (P) we have
  
E[f1 supn≥1 An f > a ] kfk1
P sup An f > a ≤ ≤
n≥1 a a
Pn−1
where An f = 1
n k=0 f ◦ τ k.

This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this theorem
is exactly equivalent to the following result:

Lemma 12.12 (Estimate for the maximum of averages) Let {Zn : n = 1, . . .} be a


stationary process with E[|Z1 |] < ∞ then
 
Z1 + . . . + Zn E[|Z1 |]
P sup >a ≤ ∀a > 0.
n≥1 n a

Proof. The argument for this Lemma has originally been quite involved, until a dramatically
simple proof (below) was found by A. Garsia [180, Theorem 2.2.2]. Define
X
n
Sn = Zk ,
k=1

Ln = max{0, Z1 , . . . , Z1 + · · · + Zn },
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn },
Sn
Z∗ = sup .
n≥1 n

It is sufficient to show that

E[Z1 1{Z∗ >0} ] ≥ 0 . (12.7)

Indeed, applying (12.7) to Z̃1 = Z1 − a and noticing that Z̃∗ = Z∗ − a we obtain

E[Z1 1{Z∗ >a} ] ≥ aP[Z∗ > a] ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-239


i i

12.5* Sinai’s generator theorem 239

from which Lemma follows by upper-bounding the left-hand side with E[|Z1 |].
In order to show (12.7) we notice that

Z1 + Mn = max{S1 , . . . , Sn }

and furthermore

Z1 + M n = Ln on {Ln > 0}

Thus, we have

Z1 1{Ln >0} = Ln − Mn 1{Ln >0}

where we do not need indicator in the first term since Ln = 0 on {Ln > 0}c . Taking expectation
we get

E[Z1 1{Ln >0} ] = E[Ln ] − E[Mn 1{Ln >0} ]


≥ E[Ln ] − E[Mn ]
= E[Ln ] − E[Ln−1 ] = E[Ln − Ln−1 ] ≥ 0 , (12.8)

where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1 , and Ln ≥ Ln−1 ,
respectively. Taking limit as n → ∞ in (12.8) and noticing that {Ln > 0} % {Z∗ > 0}, we
obtain (12.7).

12.5* Sinai’s generator theorem


As we mentioned in the introduction to this Chapter, there is a classical conundrum in natural
science. Our microscopic description of motions of atoms is fully deterministic (i.e. given posi-
tions and velocities of atoms at time t there is an operator τ that gives their positions in time
t + 1). On the other hand, in many ways large systems behave probabilistically (as described by
statistical mechanics, Gibbs distributions etc). An important conceptual bridge was built with the
introduction of Kolmogorov-Sinai entropy, which in a nutshell attempts to resolve the conundrum
by noticing that our way of describing a system at time t would typically involve only finitely
many bits, and thus while τ is deterministic when acting on a full description of state, from the
point of view of any finite-bit description τ appears to act stochastically.
More formally, we associate to every probability-preserving transformation (p.p.t.) τ a number,
called the Kolmogorov-Sinai entropy. This number is invariant to isomorphisms of p.p.t.’s (appro-
priately defined). Sinai’s generator theorem then allows one to compute the Kolmogorov-Sinai
entropy.

Definition 12.13 Fix a probability-preserving transformation τ acting on probability space


(Ω, F, P). Kolmogorov-Sinai entropy of τ is defined as
1
H(τ ) ≜ sup lim H(X0 , X0 ◦ τ, . . . , X0 ◦ τ n−1 ) ,
X0 n→∞ n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-240


i i

240

where supremum is taken over all finitely-valued random variables X0 : Ω → X measurable with
respect to F .

Note that every random variable X0 generates a stationary process adapted to τ , that is

Xk ≜ X0 ◦ τ k .

In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:

• σ -algebra G ⊂ F is P-dense in F , or sometimes we also say G = F mod P or even G = F


mod 0, if for every E ∈ F there exists E′ ∈ G s.t.

P[E∆E′ ] = 0 .

• Partition A = {Ai : i = 1, 2, . . .} measurable with respect to F is called generating if


_

σ{τ −n A} = F mod P .
n=0

• Random variable Y : Ω → Y with a countable alphabet Y is called a generator of (Ω, F, P, τ )


if

σ{Y, Y ◦ τ, . . . , Y ◦ τ n , . . .} = F mod P

Theorem 12.14 (Sinai’s generator theorem) Let Y be the generator of a p.p.t.


(Ω, F, P, τ ). Let H(Y) be the entropy rate of the process Y = {Yk = Y ◦ τ k : k = 0, . . .}. If
H(Y) is finite, then H(τ ) = H(Y).

Proof. Notice that since H(Y) is finite, we must have H(Yn0 ) < ∞ and thus H(Y) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞
we have from lower semicontinuity of mutual information, cf. (4.28), that

lim I(Y; Ỹm ) ≥ H(Y) ,


m→∞

and consequently for arbitrarily small ϵ and sufficiently large m

H(Y|Ỹ) ≤ ϵ ,

Then, consider the chain

H(Yn0 ) = H(Ỹn0 , Yn0 ) = H(Ỹn0 ) + H(Yn0 |Ỹn0 )


X n
= H(Ỹn0 ) + H(Yi |Ỹn0 , Yi0−1 )
i=0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-241


i i

12.5* Sinai’s generator theorem 241

X
n
≤ H(Ỹn0 ) + H(Yi |Ỹi )
i=0

= H(Ỹn0 ) + nH(Y|Ỹ) ≤ H(Ỹn0 ) + nϵ


Thus, the entropy rate of Ỹ (which is on a finite alphabet) can be made arbitrarily close to that of
Y, concluding that H(τ ) ≥ H(Y).
The bulk of the proof is to show that for any stationary process X adapted to τ the entropy rate
is upper bounded by H(Y). To that end, consider X : Ω → X with finite X and define as usual the
process X = {X ◦ τ k : k = 0, 1, . . .}. By the generating property of Y we have that X (perhaps
after modification on a set of measure zero) is a function of Y∞0 . So are all Xk ’s. Thus

H(X0 ) = I(X0 ; Y∞ n
0 ) = lim I(X0 ; Y0 ) ,
n→∞

where we used the continuity-in-σ -algebra property of mutual information, cf. (4.30). Rewriting
the latter limit differently, we have
lim H(X0 |Yn0 ) = 0 .
n→∞

0 ) ≤ ϵ. Then consider the following chain:


Fix ϵ > 0 and choose m so that H(X0 |Ym
H(Xn0 ) ≤ H(Xn0 , Yn0 ) = H(Yn0 ) + H(Xn0 |Yn0 )
X n
≤ H(Yn0 ) + H(Xi |Yni )
i=0
X
n
= H(Yn0 ) + H(X0 |Yn0−i )
i=0

≤ H(Yn0 ) + m log |X | + (n − m)ϵ ,


where we used stationarity of (Xk , Yk ) and the fact that H(X0 |Yn0−i ) < ϵ for i ≤ n − m. After
dividing by n and passing to the limit our argument implies
H(X) ≤ H(Y) + ϵ .
Taking here ϵ → 0 completes the proof.
Alternative proof: Suppose X0 is taking values on a finite alphabet X and X0 = f(Y∞
0 ). Then (this
is a measure-theoretic fact) for every ϵ > 0 there exists m = m(ϵ) and a function fϵ : Y m+1 → X
s.t.
P[f(Y∞
0 ) 6= fϵ (Y0 )] ≤ ϵ .
m

S
(This is just another way to say that n σ(Yn0 ) is P-dense in σ(Y∞
0 ).) Define a stationary process
X̃ as
X̃j ≜ fϵ (Yjm+j ) .
Notice that since X̃n0 is a function of Yn0+m we have
H(X̃n0 ) ≤ H(Yn0+m ) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-242


i i

242

Dividing by m and passing to the limit we conclude that the entropy rates satisfy

H(X̃) ≤ H(Y) .

Finally, to relate X̃ to X notice that by construction for every j

P[X̃j 6= Xj ] ≤ ϵ .

Since both processes take values on a fixed finite alphabet, from Corollary 6.7 we infer that

|H(X) − H(X̃)| ≤ ϵ log |X | + h(ϵ) .

Altogether, we have shown that H(X) ≤ H(Y) + ϵ log |X | + h(ϵ). Taking ϵ → 0 concludes the
proof.

Some examples of Theorem 12.14 are as follows:

• Let Ω = [0, 1], F the Borel σ -algebra, P = Leb and


(
2ω, ω < 1/2
τ (ω) = 2ω mod 1 =
2ω − 1, ω ≥ 1/2

It is easy to show that Y(ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
Let us understand significance of this example and Sinai’s result. If we have full “microscopic”
description of the initial state of the system ω , then the future states of the system are completely
deterministic: τ (ω), τ (τ (ω)), · · · . However, in practice we can not possibly have a complete
description of the initial state, and should be satisfied with some discrete (i.e. finite or countably-
infinite) measurement outcomes Y(ω), Y(τ (ω)) etc. What we infer from the previous result
is that no matter how fine our discrete measurements are, they will still generate a process
that will have finite entropy rate (equal to log 2 bits per measurement). This reconciles the
apparent paradox between Newtonian (dynamical) and Gibbsian (statistical) points of view
on large mechanical systems. In more mundane terms, we may notice that Sinai’s theorem tells
us that much more complicated stochastic processes (e.g. the one generated by a ternary valued
measurement Y′ (ω = 1{ω > 1/3} + 1{ω > 2/3}) would still have the entropy rate same as the
simple iid Bernoulli(1/2) process.
• Let Ω be the unit circle S1 , F the Borel σ -algebra, and P the normalized length and

τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ . (When 2π is irrational, this is known to be an ergodic p.p.t.).
Here Y = 1{|ω| < 2π ϵ} is a generator for arbitrarily small ϵ and hence

H(τ ) ≤ H(X) ≤ H(Y0 ) = h(ϵ) → 0 as ϵ → 0 .

This is an example of a zero-entropy p.p.t.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-243


i i

12.5* Sinai’s generator theorem 243

Remark 12.3 Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists
fi : Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦fi = f1−i ◦τi ; 2) fi ◦f1−i is identity
on Ωi (a.e.); 3) Pi [f−
1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
1

p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revolutionary, since it
allowed to show that p.p.t.s corresponding to shifts of iid Ber(1/2) and iid Ber(1/3) processes are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L 2 ( Ω , P ) → L 2 ( Ω , P ) (12.9)
ϕ(x) 7→ ϕ(τ (x)) . (12.10)
However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entire
unit circle, and thus is unable to distinguish Ber(1/2) from Ber(1/3).2

2
To see the statement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider ϕ(x∞1 ) defined as
∑m iωk x . This ϕ has unit energy and as m → ∞ we have kU ϕ − eiω ϕk
√1 L2 → 0. Hence every e
iω belongs to
m k=1 e k τ
the spectrum of Uτ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-244


i i

13 Universal compression

Unfortunately, theory developed so far is not very helpful for anyone tasked with actually com-
pressing a file of English text. Indeed, since the probability law governing text generation is not
given to us, one cannot apply compression results that we discussed so far. In this chapter we
will discuss how to produce compression schemes that do not require a priori knowledge of the
distribution. For example, an n-letter input compressor maps X n → {0, 1}∗ . There is no one fixed
probability distribution PXn on X n , but rather a whole class of distributions. Thus, the problem of
compression becomes intertwined with the problem of distribution (density) estimation and we
will see that optimal algorithms for both problems are equivalent.
The plan for this chapter is as follows:

1 We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be asymp-
totically optimal simultaneously for all i.i.d. distributions and with small modifications for all
finite-order Markov chains.
2 Next class of universal compressors is based on assuming that the true distribution PXn belongs
to a given class. These methods proceed by choosing a good model distribution QXn serving as
the minimax approximation to each distribution in the class. The compression algorithm for a
single distribution QXn is then designed as in previous chapters.
3 Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically adapt
to the distribution of the source, without any prior assumptions required.
Throughout this chapter, all logarithms are binary. Instead of describing each compres-
sion algorithm, we will merely specify some distribution QXn and apply one of the following
constructions:

• Sort all xn in the order of decreasing QXn (xn ) and assign values from {0, 1}∗ as in Theorem 10.2,
this compressor has lengths satisfying
1
ℓ(f(xn )) ≤ log .
QXn (xn )
• Set lengths to be
 
1
ℓ(f(xn )) ≜ log
QXn (xn )
and apply Kraft’s inequality Theorem 10.9 to construct a prefix code.

244

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-245


i i

13.1 Arithmetic coding 245

• Use arithmetic coding (see next section).

The important conclusion is that in all these cases we have


1
ℓ(f(xn )) = log ± universal constant ,
Q X n ( xn )
and in this way we may and will always replace lengths with log QXn1(xn ) . In this architecture, the
only task of a universal compression algorithm is to specify the QXn , which is known as universal
probability assignment in this context.
Qn
If one factorizes QXn = t=1 QXt |Xt−1 then we arrive at a crucial conclusion: universal compres-
1
sion is equivalent to sequential (online) prediction under the log-loss, which in itself is simply
a version of the density estimation task in learning theory. This exciting connection between
compression and learning theory is explored in Section 13.6 and is a highlight of this Chapter.
In turn, machine learning drives advances in universal compression. As of 2022 the best per-
forming text compression algorithms (cf. the leaderboard at [289]) use a deep neural network
(specifically, a transformer model) that starts from a fixed initialization. As the input text is pro-
cessed, parameters of the network are continuously updated via stochastic gradient descent causing
progressively better prediction (and hence compression) performance.
This chapter, thus, can be understood as both a set of results on information theory (universal
compression) or machine learning (online prediction/density estimation).

13.1 Arithmetic coding


Constructing an encoder table from QXn may require a lot of resources if n is large. Arithmetic
coding provides a convenient workaround by allowing the encoder to output bits sequentially.
Notice that to do so, it requires that not only QXn but also its marginalizations QX1 , QX2 , · · · be
easily computable. (This is not the case, for example, for Shtarkov distributions (13.12)-(13.14),
which are not compatible for different n.)
Let us agree upon some ordering on the alphabet of X (e.g. a < b < · · · < z) and extend this
order lexicographically to X n (that is for x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), we say x < y if
xi < yi for the first i such that xi 6= yi , e.g., baba < babb). Then let
X
F n ( xn ) = Q X n ( yn ) .
yn < xn

Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QXn (xn )). These intervals are disjoint
subintervals of [0, 1). As such, each xn can be represented uniquely by any point in the interval Ixn .
A specific choice is as follows. Encode
xn 7→ largest dyadic interval Dxn contained in Ixn (13.1)
and we agree to select the left-most dyadic interval when there are two possibilities. Recall that
dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ) where m is an integer. We encode
such interval by the k-bit (zero-padded) binary expansion of the fractional number m2−k =

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-246


i i

246

Pk
0.b1 b2 . . . bk = i=1 bi 2−i . For example, [3/4, 7/8) 7→ 110, [3/4, 13/16) 7→ 1100. We set the
codeword f(xn ) to be that string. The resulting code is a prefix code satisfying
 
1 1
log2 ≤ ℓ(f(x )) ≤ log2
n
+ 1. (13.2)
QXn (xn ) Q X n ( xn )
(This is an exercise, see Ex. II.13.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QXn−1 (xn−1 ) QXn |Xn−1 (y|xn−1 )
y< xn

and thus Fn (xn ) can be computed sequentially if QXn−1 and QXn |Xn−1 are easy to compute. This
method is the method of choice in many modern compression algorithms because it allows to
dynamically incorporate the learned information about the data stream, in the form of updating
QXn |Xn−1 (e.g. if the algorithm detects that an executable file contains a long chunk of English text,
it may temporarily switch to QXn |Xn−1 modeling the English language).
We note that efficient implementation of arithmetic encoder and decoder is a continuing
research area. Indeed, performance depends on number-theoretic properties of denominators of
distributions QXt |Xt−1 , because as encoder/decoder progress along the string, they need to peri-
odically renormalize the current interval Ixt to be [0, 1) but this requires carefully realigning the
dyadic boundaries. A recent idea of J. Duda, known as asymmetric numeral system (ANS) [138],
lead to such impressive computational gains that in less than a decade it was adopted by most
compression libraries handling diverse data streams (e.g., the Linux kernel images, Dropbox and
Facebook traffic, etc).

13.2 Combinatorial construction of Fitingof


Fitingof [170] suggested that a sequence xn ∈ X n should be prescribed information Φ0 (xn ) equal
to the logarithm of the number of all possible permutations obtainable from xn (i.e. log-size of the
type-class containing xn ). As we have shown in Proposition 1.5:
Φ0 (xn ) = nH(xT ) + O(log n) T ∼ Unif([n]) (13.3)
= nH(P̂xn ) + O(log n) , (13.4)
where P̂xn is the empirical distribution of the sequence xn :
1X
n
P̂xn (a) ≜ 1{xi = a} . (13.5)
n
i=1

Then Fitingof argues that it should be possible to produce a prefix code with
ℓ(f(xn )) = Φ0 (xn ) + O(log n) . (13.6)
This can be done in many ways. In the spirit of what comes next, let us define
QXn (xn ) ≜ exp{−Φ0 (xn )}cn , (13.7)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-247


i i

13.3 Optimal compressors for a class of sources. Redundancy. 247

where the normalization constant cn is determined by the number of types, namely, cn =


|−1

1/ n+|X
|X |−1
. Counting the number of different possible empirical distributions (types), we get

cn = O(n−(|X |−1) ) ,

and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (13.6).1 Now
i.i.d.
taking expectation over Xn ∼ PX we get

E[ℓ(f(Xn ))] = nH(PX ) + (|X | − 1) log n + O(1) ,

for every i.i.d. source on X .

Universal compressor for all finite-order Markov chains. Fitingof’s idea can be extended as
follows. Define now the first-order information content Φ1 (xn ) to be the log of the number of all
sequences, obtainable by permuting xn with the extra restriction that the new sequence should have
the same statistics on digrams. Asymptotically, Φ1 is just the conditional entropy

Φ1 (xn ) = nH(xT |xT−1 ) + O(log n), T ∼ Unif([n]) ,

where T − 1 is understood in the sense of modulo n. Again, it can be shown that there exists a code
such that lengths

ℓ(f(xn )) = Φ1 (xn ) + O(log n) .

This implies that for every first-order stationary Markov chain X1 → X2 → · · · → Xn we have

E[ℓ(f(Xn ))] = nH(X2 |X1 ) + O(log n) .

This can be further continued to define Φ2 (xn ) leading to a universal code that is asymptotically
optimal for all second-order Markov chains, and so on and so forth.

13.3 Optimal compressors for a class of sources. Redundancy.


So we have seen that we can construct compressor f : X n → {0, 1}∗ that achieves

E[ℓ(f(Xn ))] ≤ H(Xn ) + o(n) ,

simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we
do next? Krichevsky [259] suggested that the next barrier should be to minimize the regret, or
redundancy:

E[ℓ(f(Xn ))] − H(Xn )

simultaneously for all sources in a given class. We proceed to rigorous definitions.

1
Explicitly, we can do a two-part encoding: first describe the type class of xn (takes (|X | − 1) log n bits) and then describe
the element of the class (takes Φ0 (xn ) bits).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-248


i i

248

Given a collection {PXn |θ : θ ∈ Θ} of sources, and a compressor f : X n → {0, 1}∗ we define


its redundancy as

sup E[ℓ(f(Xn ))|θ = θ0 ] − H(Xn |θ = θ0 ) .


θ0

Replacing code lengths with log Q1Xn , we define redundancy of the distribution QXn as

sup D(PXn |θ=θ0 kQXn ) .


θ0

Thus, the question of designing the best universal compressor (in the sense of optimizing worst-
case deviation of the average length from the entropy) becomes the question of finding solution
of:

Q∗Xn = argmin sup D(PXn |θ=θ0 kQXn ) .


QXn θ0

We therefore arrive at the following definition

Definition 13.1 (Redundancy in universal compression) Given a class of sources


{PXn |θ=θ0 : θ0 ∈ Θ, n = 1, . . .} we define its minimax redundancy as

R∗n ≡ R∗n (Θ) ≜ min sup D(PXn |θ=θ0 kQXn ) . (13.8)


QXn θ0 ∈Θ

Assuming the finiteness of R∗n , Theorem 5.9 gives the maximin and capacity representation

R∗n = sup min D(PXn |θ kQXn |π ) (13.9)


π QXn

= sup I(θ; Xn ) , (13.10)


π

where optimization is over priors π ∈ P(Θ) on θ. Thus redundancy is simply the capacity of
the channel θ → Xn . This result, obvious in hindsight, was rather surprising in the early days of
universal compression. It is known as capacity-redundancy theorem.
Finding exact QXn -minimizer in (13.8) is a daunting task even for the simple class of all i.i.d.
Bernoulli sources (i.e. Θ = [0, 1], PXn |θ = Bern (θ)). In fact, for smooth parametric families the
capacity-achieving input distribution is rather cumbersome: it is a discrete distribution with a kn
atoms, kn slowly growing as n → ∞. A provocative conjecture was put forward by physicists [296,
2] that there is a certain universality relation:
3
R∗n = log kn + o(log kn )
4
satisfied for all parametric families simultaneously. For the Bernoulli example this implies kn 
n2/3 , but even this is open. However, as we will see below it turns out that these unwieldy capacity-
achieving input distributions converge as n → ∞ to a beautiful limiting law, known as the Jeffreys
prior.
Remark 13.1 (Shtarkov, Fitingof and individual sequence approach) There is a connection
between the combinatorial method of Fitingof and the method of optimality for a class. Indeed,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-249


i i

13.3 Optimal compressors for a class of sources. Redundancy. 249

following Shtarkov we may want to choose distribution QXn so as to minimize the worst-case
redundancy for each realization xn (not average!):
PXn |θ (xn |θ0 )
R∗∗
n (Θ) ≜ min max sup log (13.11)
nQXn x θ0 ∈Θ QXn (xn )
This minimization is attained at the Shtarkov distribution (also known as the normalized maximal
likelihood (NML) code):
(S) 1
Q X n ( xn ) = sup P n (xn |θ0 ) , (13.12)
Z θ0 ∈Θ X |θ
where the normalization constant
X
Z= sup PXn |θ (xn |θ0 ) , (13.13)
xn ∈X n θ0 ∈Θ

is called the Shtarkov sum. If the class {PXn |θ : θ ∈ Θ} is chosen to be all product distributions on
X then
( S) exp{−nH(P̂xn )}
(i.i.d.) QXn (xn ) = P , (13.14)
xn exp{−nH(P̂x )}
n

(S)
where H(P̂xn ) is the empirical entropy of xn . As such, compressing with respect to QXn recovers
Fitingof’s construction Φ0 up to O(log n) differences between nH(P̂xn ) and Φ0 (xn ). If we take
PXn |θ to be all first-order Markov chains, then we get construction Φ1 etc. Note also, that the
problem (13.11) can also be written as a minimization of the regret for each individual sequence
(under the log-loss, with respect to a parameter class PXn |θ ):
 
1 1
min max log − inf log . (13.15)
QXn xn QXn (xn ) θ0 ∈Θ PXn |θ (xn |θ0 )
In summary, using Shtarkov’s distribution (minimizer of (13.15)) makes sure that any individual
realization of xn (whether it was or was not generated by PXn |θ=θ0 for some θ0 ) is compressed
almost as well as the best compressor tailored for the class of PXn |θ . Hence, if our model class
PXn |θ approximates the generative process of xn well, we achieves nearly optimal compression. In
Section 13.7 below we will also learn that QXn |Xn−1 can be interpreted as online estimator of the
distribution of xj ’s.
Remark 13.2 (Two redundancies) In the literature of universal compression, the quantity
R∗∗
n is known as the worst-case or pointwise minimax redundancy, in comparison with the average-
case minimax redundancy R∗n in (13.8), which replaces maxxn in (13.11) by Exn ∼PXn |θ0 . It is known
that for many model classes, such as iid and finite-order Markov sources, R∗n and R∗∗ n agree in
the leading term as n → ∞.2 As R∗n ≤ R∗∗ n , typically the way one bounds the redundancies
is to upper bound R∗∗ n by bounding the pointwise redundancy (via combinatorial means) for a
specific probability assignment and lower bound R∗n by applying (13.10) and bounding the mutual

2
This, however, is not true in general. See Exercise II.21 for an example where R∗n < ∞ but R∗∗
n = ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-250


i i

250

information for a specific prior; see Exercises II.15 and II.16 for an example and [112, Chap. 6-7]
for more.
Remark 13.3 (Redundancy for single-shot codes) We note that any prefix code f :
X n → {0, 1}∗ defines a distribution QXn (xn ) = 2−ℓ(f(x )) . (We assume the code’s binary tree is
n

full such that the Kraft sum equals one). Therefore, our definition of redundancy in (13.8) assess
the excess of expected length E[ℓ(f(Xn ))] over H(Xn ) for the prefix codes. For single-shot codes
(Section 10.1) without prefix constraints the optimal answers are slightly different, however. For
example, the optimal universal code for all i.i.d. sources satisfies E[ℓ(f(Xn ))] ≈ H(Xn )+ |X 2|−3 log n
in contrast with |X 2|−1 log n for prefix codes, cf. [41, 256].

13.4* Asymptotic maximin solution: Jeffreys prior


In this section we will only consider the simple setting of a class of sources consisting of all
i.i.d. distributions on a given finite alphabet |X | = d + 1, which defines a d-parameter family
of distributions. We will show that the prior, asymptotically achieving the capacity (13.10), is
given by the Dirichlet distribution with parameters set to 1/2. Recall that the Dirichlet distribution
Dirichlet(α0 , . . . , αd ) with parameters αj > 0 is a distribution for a probability vector (θ0 , . . . , θd )
such that (θ1 , . . . , θd ) has a joint density

Y
d
αj − 1
c(α0 , . . . , αd ) θj (13.16)
j=0

Pd Γ(α0 +...+αd )
and θ0 = 1 − j=1 θj , where c(α0 , . . . , αd ) = Qd is the normalizing constant.
j=0 Γ(αj )
First, we give the formal setting as follows:

• Fix a finite alphabet X of size |X | = d + 1, which we will enumerate as X = {0, . . . , d}.


Pd
• As in Example 2.6, let Θ = {(θ1 , . . . , θd ) : j=1 θj ≤ 1, θj ≥ 0} parametrizes the collection of
all probability distributions on X . Note that Θ is a d-dimensional simplex. We will also define

X
d
θ0 ≜ 1 − θj .
j=1

• The source class is


( )
Y
n X 1
PXn |θ (xn |θ) ≜ θxj = exp −n θa log ,
j=1 a∈X
P̂xn (a)

where as before P̂xn is the empirical distribution of xn , cf. (13.5).

In order to find the (near) optimal QXn , we need to guess an (almost) optimal prior π ∈ P(Θ)
in (13.10) and take QXn to be the mixture of PXn |θ ’s. We will search for π in the class of smooth

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-251


i i

13.4* Asymptotic maximin solution: Jeffreys prior 251

densities on Θ and set


Z
QXn (x ) ≜
n
PXn |θ (xn |θ′ )π (θ′ )dθ′ . (13.17)
Θ

Before proceeding further, we recall the Laplace method of approximating exponential inte-
grals. Suppose that f(θ) has a unique minimum at the interior point θ̂ of Θ and that Hessian Hessf
is uniformly lower-bounded by a multiple of identity (in particular, f(θ) is strongly convex). Then
taking Taylor expansion of π and f we get
Z Z
π (θ)e−nf(θ) dθ = (π (θ̂) + O(ktk))e−n(f(θ̂)− 2 t Hess f(θ̂)t+o(∥t∥ )) dt
1 T 2
(13.18)
Θ
Z
dx
= π (θ̂)e−nf(θ̂) e−x Hess f(θ̂)x √ (1 + O(n−1/2 ))
T
(13.19)
Rd nd
  d2
−nf(θ̂) 2π 1
= π (θ̂)e q (1 + O(n−1/2 )) (13.20)
n
det Hessf(θ̂)

where in the last step we computed Gaussian integral.


Next, we notice that

PXn |θ (xn |θ′ ) = exp{−n(D(P̂xn kPX|θ=θ′ ) + H(P̂xn ))} ,

and therefore, denoting

θ̂(xn ) ≜ P̂xn

we get from applying (13.20) to (13.17)


d 2π Pθ (θ̂)
+ O( n− 2 ) ,
1
log QXn (xn ) = −nH(θ̂) + log + log q
2 n log e
det JF (θ̂)

where we used the fact that Hess θ′ D(P̂kPX|θ=θ′ )|θ′ =θ̂ = log1 e JF (θ̂) with JF being the Fisher infor-
mation matrix introduced previously in (2.34). From here, using the fact that under Xn ∼ PXn |θ=θ′
the random variable θ̂ = θ′ + O(n−1/2 ) we get by approximating JF (θ̂) and Pθ (θ̂)
d Pθ (θ′ )
D(PXn |θ=θ′ kQXn ) = n(E[H(θ̂)]−H(X|θ = θ′ ))+ log n−log p +C+O(n− 2 ) , (13.21)
1

2 ′
det JF (θ )
where C is some constant (independent of the prior Pθ or θ′ ). The first term is handled by the next
result, refining Corollary 7.18.

Lemma 13.2 Let Xn i.i.d.


∼ P on a finite alphabet X such that P(x) > 0 for all x ∈ X . Let P̂ = P̂Xn
be the empirical distribution of Xn , then
 
|X | − 1 1
E[D(P̂kP)] = log e + o .
2n n
log e 2
In fact, nD(P̂kP) → 2 χ (|X | − 1) in distribution.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-252


i i

252


Proof. By Central Limit Theorem, n(P̂ − P) converges in distribution to N (0, Σ), where Σ =
diag(P) − PPT , where P is an |X |-by-1 column vector. Thus, computing second-order Taylor
expansion of D(·kP), cf. (2.34) and (2.37), we get the result. (To interchange the limit and the
expectation, more formally we need to condition on the event P̂n (x) ∈ (ϵ, 1 − ϵ) for all x ∈ X to
make the integrand function bounded. We leave these technical details as an exercise.)

Continuing (13.21) we get in the end

d π (θ′ )
+ constant + O(n− 2 )
1
D(PXn |θ=θ′ kQXn ) = log n − log p (13.22)
2 ′
det JF (θ )

under the assumption of smoothness of prior π and that θ′ is not on the boundary of Θ. Con-
sequently, we can see that in order for the prior π be the saddle point solution, we should
have
p
π (θ′ ) ∝ det JF (θ′ ) ,

provided that the right side is integrable. The prior proportional to the square-root of the
determinant of Fisher information matrix is known as the Jeffreys prior. In our case, using
the explicit expression for Fisher information (2.39), the Jeffreys prior π ∗ is found to be
Dirichlet(1/2, 1/2, · · · , 1/2), with density:

1
π ∗ (θ) = cd qQ , (13.23)
d
j=0 θj

Γ( d+ 1
2 )
where cd = Γ(1/2)d+1
is the normalization constant. The corresponding redundancy is then

d n Γ( d+2 1 )
R∗n = log − log + o( 1) . (13.24)
2 2πe Γ(1/2)d+1

Making the above derivation rigorous is far from trivial and was completed in [460]. (In
Exercise II.15 and II.16 we analyze the d = 1 case and show R∗n = 12 log n + O(1).)
Overall, we see that Jeffreys prior asymptotically maximizes (within o(1)) the supπ I(θ; Xn ) and
for this reason is called asymptotically maximin solution. Surprisingly [405], the corresponding
(KT)
mixture QXn , that we denote QXn (and study in detail in the next section), however, turns out to
not give the asymptotically optimal redundancy. That is we have for some c1 > c2 > 0 inequalities

R∗n + c1 + o(1) ≤ sup D(PXn |θ=θ0 kQXn ) ≤ R∗n + c2 + o(1) .


(KT)
θ0

(KT)
That is QXn is not asymptotically minimax (but it does achieve optimal redundancy up to O(1)
term). However, it turns out that patching the Jeffreys prior near the boundary of the simplex
(or using a mixture of Dirichlet distributions) does result in asymptotically minimax universal
probability assignments [460].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-253


i i

13.5 Sequential probability assignment: Krichevsky-Trofimov 253

Extension to general smooth parametric families. The fact that Jeffreys prior θ ∼ π maxi-
mizes the value of mutual information I(θ; Xn ) for general parametric families was conjectured
in [46] in the context of selecting priors in Bayesian inference. This result was proved rigorously
in [95, 96]. We briefly summarize the results of the latter.
Let {Pθ : θ ∈ Θ0 } be a smooth parametric family admitting a continuous and bounded Fisher
information matrix JF (θ) everywhere on the interior of Θ0 ⊂ Rd . Then for every compact Θ
contained in the interior of Θ0 we have
Z p
d n
R∗n (Θ) = log + log det JF (θ)dθ + o(1) . (13.25)
2 2πe Θ

Although Jeffreys prior on Θ achieves (up to o(1)) the optimal value of supπ I(θ; Xn ), to produce
an approximate capacity-achieving output distribution QXn , however, one needs to take a mixture
with respect to a Jeffreys prior on a slightly larger set Θϵ = {θ : d(θ, Θ) ≤ ϵ} and take ϵ → 0
slowly with n → ∞. This sequence of QXn ’s does achieve the optimal redundancy up to o(1).
Remark 13.4 (Laplace’s law of succession) In statistics Jeffreys prior is justified as
being invariant to smooth reparametrization, as evidenced by (2.35). For example, in answering
“will the sun rise tomorrow”3 , Laplace proposed to estimate the probability by modeling sunrise
as i.i.d. Bernoulli process with a uniform prior on θ ∈ [0, 1]. However, this is √
clearly not very
10
logical, as one may equally well postulate uniformity of α = θ or β = pθ. Jeffreys prior
θ∼ √ 1 is invariant to reparametrization in the sense that if one computed det JF (α) under
θ(1−θ)
α-parametrization the result would be exactly the pushforward of the √ 1
along the map
θ(1−θ)
θ 7→ θ10 .

13.5 Sequential probability assignment: Krichevsky-Trofimov


From (13.23) it is not hard to derive the (asymptotically) optimal universal probability assignment
QXn . For simplicity we consider Bernoulli case, i.e. d = 1 and θ ∈ [0, 1] is the 1-dimensional
parameter. Then the Jeffrey’s prior and the resulting mixture distribution are given by4
1
P∗θ = p (13.26)
π θ(1 − θ)
(KT) (2t0 − 1)!! · (2t1 − 1)!!
Q X n ( xn ) = , ta = #{j ≤ n : xj = a}. (13.27)
2n n!
This assignment can now be used to create a universal compressor via one of the methods outlined
in the beginning of this chapter.

3
Interested readers should check Laplace’s rule of succession and the sunrise problem; see [229, Chap. 18] for a historical
and philosophical account.
4
We remind (2a − 1)!! = 1 · 3 · · · (2a − 1). The expression for QXn is obtained from the identity
∫ 1 θa (1−θ)b
0
√ (2a−1)!!(2b−1)!!
dθ = π 2a+b (a+b)! for integer a, b ≥ 0, which in turn is derived by change of variable z = θ
1−θ
and
θ(1−θ)
using the standard keyhole contour on the complex plane.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-254


i i

254

(KT) (KT)
Note that QXn−1 coincides with the marginalization of QXn to first n − 1 coordinates. This prop-
R
erty is not specific to KT distribution and holds for any QXn that is given in the form Pθ (dθ)PXn |θ
with Pθ not depending on n. What is remarkable, however, is that the conditional distribution
(KT)
QXn |Xn−1 has a rather elegant form:

t1 + 12
QXn |Xn−1 (1|xn−1 ) =
(KT)
, t1 = #{j ≤ n − 1 : xj = 1} (13.28)
n
1
t0 + 2
QXn |Xn−1 (0|xn−1 ) =
(KT)
, t0 = #{j ≤ n − 1 : xj = 0} (13.29)
n
This is the famous “add-1/2” rule of Krichevsky and Trofimov [260]. As mentioned in Section 13.1,
this sequential assignment is very convenient for implementing an arithmetic
 coder. 
Let fKT : {0, 1}n → {0, 1}∗ be the encoder assigning length l(fKT (xn )) = log2 1
(KT) . Now
QXn (xn )
from (13.24) we know that we

1
sup {E [l(fKT (Snθ ))] − nh(θ)} = log n + O(1) .
0≤θ≤1 2

Since (13.24) was not shown rigorously in Exercise II.15 we prove the upper bound of this claim
independently.

Remark 13.5 (Laplace “add-1” rule) A slightly less optimal choice of QXn results from
Laplace prior (recall that Laplace advise to take Pθ to be uniform on [0, 1]). Then, in the case of
binary (Bernoulli) alphabet we get

1
(Lap)
QXn = n
 , w = #{j : xj = 1} . (13.30)
w ( n + 1)

The corresponding sequential probability assignment is given by

t1 + 1
QXn |Xn−1 (1|xn−1 ) =
(Lap)
, t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (13.30) is exactly the same as Fitingof’s (13.7). Second,
this distribution “almost” attains the optimal first-order term in (13.24). Indeed, when Xn is iid
Ber(θ) we have for the redundancy:
" #   
1 n
E log (Lap) − H(Xn ) = log(n + 1) + E log − nh(θ) , W ∼ Bin(n, θ) . (13.31)
Q n (X ) n W
X

From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (13.31) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to
endpoints and results in suboptimal redundancy log n + O(1). The Jeffreys prior (13.26) and the
resulting KT compressor fixes the problem at the endpoints.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-255


i i

13.6 Online prediction and density estimation 255

For a general (non-binary) alphabet KT distribution is equally simple:


1
ta +
QXn |Xn−1 (a|xn−1 ) =
(KT) 2
|X |−2
, ta = #{j ≤ n − 1 : xj = a} (13.32)
n+ 2
In summary, to build a universal compressor for a class of all iid sources on a given finite
alphabet X we can do the following:

1 Learner: Set QX1 = Unif[X ].


2 Arithmetic encoder: subdivide [0, 1] interval using QX1 . Receive the first letter x1 and select
partition QX1 (x1 ).
3 Learner: given x1 compute QX2 |X1 =x1 according to (13.32).
4 Arithmetic encoder: Subdivide currently selected partition according to QX2 |X1 =x1 . Receive the
next letter x2 and select partition QX2 |X1 =x1 (x2 ).
5 etc.
For compression we only use the state of the arithmetic encoder to output corresponding {0, 1}
bits. In the next section, however, we will see that the learner’s output QXn |Xn−1 is a very relevant
quantity: it is (regret-optimal) online density estimator under KL loss.

13.6 Online prediction and density estimation


It turns out that the universal compression problem (or, more specifically, a universal probabil-
ity assignment QXn representing a whole class of distributions) automatically solves two other
important problems: one in machine learning (online prediction) and the other in statistics (den-
sity estimation). For this reason, in information theory the problem of universal compression is
also sometimes called “universal prediction”. In the next two sections we will briefly explain this
fundamental connection. For a full story we recommend an excellent textbook on the topic [86].
(θ)
Let us fix X and a collection PXn , θ ∈ Θ of measures on X n , which we will call a model class
Θ. Although a case of continuous X is even more interesting (as we will explore in Section 32.1),
(θ)
for now we restrict to attention to discrete X and, thus, we understand PXn (·) as a PMF on X n .
(θ) Qn (θ)
The most immediate choice is to have PXn (x1 , . . . , xn ) = i=1 PX1 (xi ), which would correspond
to a class of iid sources, but we do not make this restriction.
(θ ∗ )
Online prediction. Given a sequence X1 , . . . , Xn ∼ PXn the learner sequentially observes
X1 , . . . , Xt−1 and outputs its prediction Qt (·) about the next sample Xt . Once the next sample is
revealed learner experiences a loss measured via log-loss:
1
ℓ(Qt , Xt ) ≜ log .
Qt (Xt )
Given a sequence of predictors {Qt } we can assign average cumulative loss to it as follows:
" n #
X 1

ℓn ({Qt }, θ ) ≜ EP(θ∗ )
(a)
log .
Xn Qt (Xt |Xt−1 )
t=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-256


i i

256

The intimate connection between this problem of online prediction and universal compression
is revealed by the following almost trivial observation. Notice that distribution Qt (·) output by the
learner at time t is in fact a function of observations x1 , . . . xt−1 . Therefore, we should be writing
more explicitly it as Qt (·; xt−1 ) to emphasize dependence on the (training) data. But then we can
also think of Qt (xt ; xt−1 ) as a Markov kernel QXt |Xt−1 (xt |xt−1 ) and compute a joint probability
distribution
Y
n
Q X n ( xn ) ≜ Qt (xt ; xt−1 ) . (13.33)
t=1

Conversely, any QXn we can factorize sequentially in the form (13.33) and obtain an online pre-
dictor. It turns out that the choice of the optimal QXn is precisely the same problem of universal
probability assignment that is solved by the universal compression in (13.8). But before that we
need to explain how to define optimality in the online prediction game.
Since the ℓn ({Qt }, θ∗ ) depends on the value of θ∗ that governs the stochastics of the input
(a)

sequence, our first instinct could be to try to minimize

min sup ℓ(na) ({Qt }, θ∗ ) .


{Qt } θ ∗ ∈Θ

(θ)
However, this turns out to be a bad way to pose the problem. For example, if one of PXn =
(a)
Unif[X n ] then no predictor can achieve ℓn ≤ n log |X |. Furthermore, a trivial predictor that
always outputs Qt = Unif[X ] achieves this upper bound exactly. Thus in the minimax setting
predicting Unif[X ] turns out to be optimal.
To understand how to work around this issue, let us first recall from Corollary 2.4 that if we
have oracle knowledge about the true θ∗ generating Xj ’s, then our choice would be to set Qt to be
(θ ∗ )
the factorization of PXn . This achieves the loss
(θ ∗ )
ℓ(na) ({P∗θ }, θ∗ ) = H(PXn ) .
(θ ∗ )
Thus, even if given the oracle information we cannot avoid the H(PXn ) loss (this would also
(θ ∗ )
be called Bayes loss in machine learning). Note that for iid model class we have H(PXn ) =

(θ )
nH(PX1 ) and the oracle loss is of order n. Consequently, the quality of the learning algorithm
should be measured by the amount of excess of loss above the oracle loss. This quantity is known
as average regret and defined as
" n #
X 1 (θ ∗ )
AvgRegn ({Qt }, θ∗ ) ≜ EP(θ∗ ) log − H(PXn ) .
Xn Qt (Xt |Xt−1 )
t=1

Hence to design an optimal algorithm we want to minimize the worst regret, or in other words to
solve the minimax problem

AvgReg∗n (Θ) ≜ inf sup AvgRegn ({Qt }, θ∗ ) . (13.34)


{Qt } θ ∗ ∈Θ

This turns out to be completely equivalent to the universal compression problem, as we state next.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-257


i i

13.6 Online prediction and density estimation 257

Theorem 13.3 Recall the definition of compression redundancy R∗n in (13.8). Then we have
(θ ∗ )
AvgReg∗n (Θ) = R∗n (Θ) ≜ min sup D(PXn kQXn ) ,
QXn θ ∗ ∈Θ

where the minimum in the RHS is achieved and at a unique distribution Q∗Xn . The optimal predictor
is given by setting Qt (·) = Q∗Xt |Xt−1 =xt−1 (·). Furthermore, let θ ∈ Θ have a prior distribution
π ∈ P(Θ). Then

AvgReg∗n (Θ) = sup I(θ; Xn ) .


π

If there exists a maximizer π ∗ of the right-hand side maximization then the optimal estimator is
R Qn
found by factorizing Q∗Xn = π ∗ (dθ) i=1 Pθ (xi ).

Proof. There is almost nothing to prove. We only need to rewrite definition of average regret in
terms of QXn as follows
" #
(θ ∗ )
P (θ ∗ )
AvgRegn ({Qt }, θ∗ ) = EP(θ∗ ) log X = D(PXn kQXn ) .
n

Xn QX n

The rest of the claims follow from Theorem 5.9 (recall that I(θ; Xn ) ≤ n log |X | < ∞) and
Theorem 5.4.

As an application of this result, we see that Krichevsky-Trofimov’s estimator achieves for any
i.i.d.
iid string Xn ∼ P a log-loss
 
X
n
E log (KT)
1  ≤ nH(P) + |X | − 1 log n + cX ,
Q t−1 (Xt |X )t− 1 2
t=1 Xt |X

where cX < ∞ is a constant independent of P or n. This excess above nH(P) is optimal among
all possible online estimators except possibly for a constant cX .
The problem we discussed may appear at first to be somewhat contrived, especially to some-
one who has been used to supervised learning/prediction tasks. Indeed, our prediction problem
does not have any features to predict from! Nevertheless, the modern large language models are
solving precisely this task: they are trained to predict a sequence of words by minimizing log-loss
(cross-entropy loss), cf. [320]. In those instances the learning task is made feasible due to non-iid
nature of the sequence. The iid setting, however, is also quite interesting and practically relevant.
But one needs to introduce supervised learning version for that, where prediction task is to esti-
mate an unknown (label or quantity) Yt given a correlated feature vector Xt . There is an analog of
Theorem 13.3 for that case as well – see Exercise II.20 and II.22.

Batch regret. In machine learning what we have defined above is known as cumulative (or
online) regret, because we insisted on the estimator to produce some prediction at every time step
t. However, a much more common setting is that of prediction, where the first n − 1 observations

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-258


i i

258

are available as the training data and we only assess the loss on the new unseen sample (test data).
This is called batch loss and the corresponding minimax regret is
 
∗ 1 (θ ∗ )
BatchRegn (Θ) ≜ inf sup E (θ ) log
∗ − H(PXn |Xn−1 ) . (13.35)
Qn (·;xn−1 ) θ ∗ ∈Θ PXn Qn (Xn ; Xn−1 )
|  {z  }
(θ ∗ )
D P Qn (·;Xn−1 )
X n | X n− 1

In other words, this is the optimal KL loss of predicting the next symbol by estimating its condi-
tional distribution given the past, a central task in language models such as GPT [320]. Similar to
Theorem 13.3 we can give a max-information formula for batch regret (Exercise II.19). However,
it turns out that there is also a connection to universal compression. Indeed, we have the following
inequalities

1
AvgReg∗n (Θ) − AvgReg∗n−1 (Θ) ≤ BatchReg∗n (Θ) ≤ AvgReg∗n (Θ) , (13.36)
n

where the upper bound is only guaranteed to hold for iid models.5 The inequality (13.36) is known
as online-to-batch conversion or estimation-compression inequality [159, 240]; see Lemma 32.3
and Proposition 32.7 for a justification. The estimator that achieves the above upper bound is very
simple: it takes a probability assignment Q∗Xn and sets its predictor as

1X
n
n−1
Q n ( xn ; x )≜ QXt |Xt−1 (xn |xt−1 ) . (13.37)
n
t=1

However, unlike the cumulative regret, minimizers of the batch regret are distinct from those in
universal compression. For example, for the model class of all iid distributions over k symbols, we
know that (asymptotically) the “add-1/2” estimator of Krichevsky-Trofimov is optimal. However,
for the batch loss it is not so (see the note at the end of Exercise VI.10). We also note that optimal
batch regret in this case is O( k−n 1 ), but the online-to-batch rule only yields O( (k−1n) log n ). On the
other hand, for first-order Markov chains with k ≥ 3 states, the online-to-batch upper bound
turns out to be order optimal, as we have BatchReg∗n  1n AvgReg∗n  kn log kn2 provided that
2

n  k2 [213]; however, proving this result, especially the lower bound, requires arguments native
to Markov chains.

Density estimation. Consider now the following problem. Given a collection of (single-letter)
(θ) i.i.d. (θ)
distributions PX on X and X1 , . . . , Xn−1 ∼ PX we want to produce a density estimate P̂ which
minimizes the worst-case error as measured by KL divergence, i.e. we seek to minimize
h i
(θ ∗ )
sup E n−1 i.i.d. (θ∗ ) D(P̂kPX ) .
X ∼ PX
θ ∗ ∈Θ

5
For stationary m-order Markov models, the upper bound in (13.36) holds with n − m in the denominator [213, Lemma 6].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-259


i i

13.7 Individual sequence and worst-case regret 259

To connect to the previous discussion, we only need to notice that P̂(·) can be interpreted as Qn in
the batch regret problem and we have an exact equality
h i 1
(θ ∗ )
inf sup E n−1 i.i.d. (θ∗ ) D(P̂kPX ) = BatchReg∗n (Θ) ≤ sup I(θ; Xn ) .

P̂ θ ∈Θ X ∼ P X n π
Thus, we bound the minimax (KL-divergence) density estimation rate by capacity of a certain
(θ)
channel. The estimator achieving this bound is improper (i.e. P̂ 6= PX for any θ) and given
by (13.37). This is the basis of the Yang-Barron approach to density estimation, see Section 32.1
for more.

13.7 Individual sequence and worst-case regret


In previous section we explained how learner can predict stochastically generated strings Xn .
Though very standard, this approach suffers from a common criticism: what if our model class Θ
for the stochasticity of Xn is incorrect and the real data is not generated by any process in the class
Θ? In this section, we show a somewhat surprising workaround: it turns out there can be a theory
of predicting any possible sequences xn even non-random and adversarially chosen! This exciting
area is known in information theory as individual sequence approach and we describe it next.
Consider the following problem: a sequence xn is observed sequentially and our goal is to
predict (by making a soft, or probabilistic, prediction) the next symbol given the past observations.
The experiment proceeds as follows:

1 A string xn ∈ X n is selected by the nature.


2 Having observed x1 , . . . , xt−1 we are tasked to output a probability distribution Qt (·|xt−1 ) on X .
3 After that nature reveals the next sample xt and our loss for the t-th prediction is evaluated via
the log-loss:
1
log .
Qt (xt |xt−1 )
The main objective is to find a sequence of predictors {Qt } that minimizes the cumulative loss:
X
n
1
ℓ({Qt }, x ) ≜n
log .
Qt (xt |xt−1 )
t=1

Consider first the naive goal of minimizing the worst-case loss:

min max ℓ({Qt }, xn ) .


{Qt }nt=1
n x

This is clearly hopeless. Indeed, at any step t the distribution Qt must have at least one atom with
weight at most |X1 | , and hence for any predictor

max
n
ℓ({Qt }, xn ) ≥ n log |X | ,
x

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-260


i i

260

which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if the predictor simply makes uniform random
guesses. This triviality is not surprising: In the absence of whatsoever prior information on xn it
is impossible to predict anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [161, 303], is to replace loss
with regret, i.e. the gap to the best possible static oracle. More precisely, suppose a non-causal
oracle can examine the entire string xn and output a constant Qt ≡ Q. From the non-negativity of
divergence this non-causal oracle achieves:
X
n
1
ℓoracle (xn ) = min log = nH(P̂xn ) .
Q Q ( xt )
t=1

Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret of a sequential predictor as the excess risk over the static oracle
reg({Qt }, xn ) ≜ ℓ({Qt }, xn ) − nH(P̂xn )
and ask to minimize the worst-case regret:
Reg∗n ≜ min max reg({Qt }, xn ) . (13.38)
{Qt }
n x

Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore do
not rely on any assumptions on xn .
We next consider the case of X = {0, 1} for simplicity. To solve (13.38), first notice that
designing a sequence {Qt (·|xt−1 } is equivalent to defining one joint distribution QXn and then
Q
factorizing the latter as QXn (xn ) = t Qt (xt |xt−1 ). Then the problem (13.38) becomes simply
1
Reg∗n = min max log − nH(P̂xn ) .
n
QXn x Q ( xn )
Xn

First, we notice that generally we have that optimal QXn is the Shtarkov distribution (13.12), which
implies that the regret coincides with the log of the Shtarkov sum (13.13). In the iid case we are
considering, from (13.14) we get
X Y
n X
Reg∗n = log max Q(xi ) = log exp{−nH(P̂xn )} .
Q
xn i=1 xn

This expression is, however, frequently not very convenient to analyze, so instead we consider
upper and lower bounds. We may lower-bound the max over xn with the average over the Xn ∼
Ber(θ)n and obtain (also applying Lemma 13.2):
|X | − 1
Reg∗n ≥ R∗n + log e + o(1) ,
2
where R∗n is the universal compression redundancy defined in (13.8), whose asymptotics we
derived in (13.24).
(KT)
On the other hand, taking QXn from Krichevsky-Trofimov (13.27) we find after some algebra
and Stirling’s expansion:
1 1
max log (KT)
− nH(P̂xn ) = log n + O(1) .
n
x QXn (xn ) 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-261


i i

13.8 Lempel-Ziv compressor 261

In all, we conclude that,


|X | − 1
Reg∗n = R∗n + O(1) = log n + O(1) ,
2
and remarkably, the per-letter regret 1n Reg∗n converges to zero. That is, there exists a causal predic-
tor that can predict (under the log-loss) almost as well as any constant one, even if it is adapted
to a particular sequence xn non-causally.
Explicit (asymptotically optimal) sequential prediction rules are given by Krichevsky-
Trofimov’s “add-1/2” rules (13.29). We note that the resulting rules are also independent of n
(“horizon-free”). This is a very desirable property not shared by the optimal sequential predictors
derived from factorizing the Shtarkov’s distribution (13.12).

General parametric families. The general definition of (cumulative) individual-sequence (or


worst-case) regret for a model class {PXn |θ=θ0 : θ0 ∈ Θ} is given by
1 1
Reg∗n (Θ) = min sup log − inf log ,
QXn xn QXn (xn ) θ0 ∈Θ PXn |θ=θ0 (xn )
This regret can be interpreted as worst-case loss of a given estimator compared to the best possible
one from a class PXn |θ , when the latter is selected optimally for each sequence. In this sense, regret
gives a uniform (in xn ) bound on the performance of an algorithm against a class.
It turns out that similarly to (13.25) the individual sequence redundancy for general d-
parametric families (under smoothness conditions) can be shown to satisfy [362]:
Z p
∗ ∗ d d n
Regn (Θ) = Rn (Θ) + log e + o(1) = log + log det JF (θ)dθ + o(1) .
2 2 2π Θ

In machine learning terms, we say that R∗n (Θ) in (13.8) is a cumulative sequential prediction
regret under the well-specified setting (i.e. data Xn is generated by a distribution inside the model
class Θ), while here Reg∗n (Θ) corresponds to a fully mis-specified setting (i.e. data is completely
arbitrary). There are also interesting settings in between these extremes, e.g. when data is iid but
not from a model class Θ, cf. [162].

13.8 Lempel-Ziv compressor


So given a class of sources {PXn |θ : θ ∈ Θ} we have shown how to produce an asymptotically
optimal compressors by using Jeffreys’ prior. In the case of a class of i.i.d. processes, the result-
ing sequential probability of Krichevsky-Trofimov, see (13.32), had a very simple algorithmic
description. When extended to more general classes (such as r-th order Markov chains), however,
the sequential probability rules become rather complex. The Lempel-Ziv approach was to forgo
the path “ design QXn , convert to QXt |Xt−1 , extract compressor” and attempt to directly construct
a reasonable sequential compressor or, equivalently, derive an algorithmically simple sequential
estimator QXt |Xt−1 . The corresponding joint distribution QXn is hard to imagine, and the achieved
redundancy is not easy to derive, but the algorithm becomes very transparent.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-262


i i

262

In principle, the problem is rather straightforward: as we observe a stationary process, we may


estimate with better and better precision the conditional probability P̂Xn |Xn−1 and then use it as
n−r

the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
will attain the entropy rate of H(Xn |Xnn− 1
−r ). Note that Krichevsky-Trofimov assignment (13.29) is
clearly learning the distribution too: as n grows, the estimator QXn |Xn−1 converges to the true PX
(provided that the sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning
the joint distribution PXr−1 of r-blocks. However, the sample size required to obtain a good esti-
0
mate of PXr−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family of
0
algorithms works around this in an ingeniously elegant way:

• First, estimating probabilities of rare substrings takes longest, but it is also the least useful, as
these substrings almost never appear at the input.
• Second, and the most crucial, point is that an unbiased estimate of PXr (xr ) is given by the
reciprocal of the time since the last observation of xr in the data stream.
• Third, there is a prefix code6 mapping any integer n to binary string of length roughly log2 n:

fint : Z+ → {0, 1}+ , ℓ(fint (n)) = log2 n + O(log log n) . (13.39)


Thus, by encoding the pointer to the last observation of xr via such a code we get a string of
length roughly log PXr (xr ) automatically.

There are a number of variations of these basic ideas, so we will only attempt to give a rough
explanation of why it works, without analyzing any particular algorithm.
We proceed to formal details. First, we need to establish Kac’s lemma.

Lemma 13.4 (Kac) Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . ..


Let L = inf{t > 0 : X−t = X0 } be the last appearance of symbol X0 in the sequence X− 1
−∞ . Then
for any u such that P[X0 = u] > 0 we have
1
E [ L | X 0 = u] = .
P [ X 0 = u]
In particular, mean recurrence time E[L] = |supp(PX )|.

Proof. Note that from stationarity the following probability


P[∃t ≥ k : Xt = u]
does not depend on k ∈ Z. Thus by continuity of probability we can take k = −∞ to get
P[∃t ≥ 0 : Xt = u] = P[∃t ∈ Z : Xt = u] .

6 ∑
For this just notice that k≥1 2− log2 k−2 log2 log(k+1) < ∞ and use Kraft’s inequality. See also Ex. II.18.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-263


i i

13.8 Lempel-Ziv compressor 263

However, the last event is shift-invariant and thus must have probability zero or one by ergodic
assumption. But since P[X0 = u] > 0 it cannot be zero. So we conclude
P[∃t ≥ 0 : Xt = u] = 1 . (13.40)
Next, we have
X
E[L|X0 = u] = P [ L ≥ t| X 0 = u] (13.41)
t≥ 1
1 X
= P[L ≥ t, X0 = u] (13.42)
P[X0 = u]
t≥1
1 X
= P[X−t+1 6= u, . . . , X−1 6= u, X0 = u] (13.43)
P[X0 = u]
t≥1
1 X
= P[X0 6= u, . . . , Xt−2 6= u, Xt−1 = u] (13.44)
P[X0 = u]
t≥1
1
= P[∃t ≥ 0 : Xt = u] (13.45)
P[X0 = u]
1
= , (13.46)
P[X0 = u]
where (13.41) is the standard expression for the expectation of a Z+ -valued random vari-
able, (13.44) is from stationarity, (13.45) is because the events corresponding to different t are
disjoint, and (13.46) is from (13.40).
The following result serves to explain the basic principle behind operation of Lempel-Ziv
methods.

Theorem 13.5 Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . . with


entropy rate H. Suppose that X− 1
−∞ is known to the decoder. Then there exists a sequence of prefix-
codes fn (xn0−1 , x− 1
−∞ ) with expected length

1
E[ℓ(fn (Xn0−1 , X∞
−1
))] → H ,
n

Proof. Let Ln be the last occurrence of the block xn0−1 in the string x− 1
−∞ (recall that the latter is
known to decoder), namely
Ln = inf{t > 0 : x−
−t
t+n−1
= xn0−1 } .

= Xtt+n−1 we have
(n)
Then, by Kac’s lemma applied to the process Yt
1
E[Ln |Xn0−1 = xn0−1 ] = .
P[Xn0−1 = xn0−1 ]
We know encode Ln using the code (13.39). Note that there is crucial subtlety: even if Ln < n and
thus [−t, −t + n − 1] and [0, n − 1] overlap, the substring xn0−1 can be decoded from the knowledge
of Ln .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-264


i i

264

We have, by applying Jensen’s inequality twice and noticing that 1n H(Xn0−1 ) & H and
1
n log H(Xn0−1 ) → 0 that
1 1 h 1 i
E[ℓ(fint (Ln ))] ≤ E log + o(1) → H .
n n PXn−1 (Xn0−1 )
0

From Kraft’s inequality we know that for any prefix code we must have
1 1
E[ℓ(fint (Ln ))] ≥ H(Xn0−1 |X− 1
−∞ ) = H .
n n

The result shown above demonstrates that LZ algorithm has asymptotically optimal com-
pression rate for every stationary ergodic process. Recall, however, that previously discussed
compressors also enjoyed non-stochastic (individual sequence) guarantees. For example, we have
seen in Section 13.7 that Krichevsky-Trofimov’s compressor achieves on every input sequence a
compression ratio that is at most O( logn n ) worse than the arithmetic encoder built with the best
possible (for this sequence!) static probability assignment. It turns out that LZ algorithm is also
special from this point of view. In [331] (see also [160, Theorem 4]) it was shown that the LZ
compression rate on every input sequence is better than that achieved by any finite state machine
(FSM) up to correction terms O( logloglogn n ). Consequently, investing via LZ achieves capital growth
that is competitive against any possible FSM investor [160].
Altogether we can see that LZ compression enjoys certain optimality guarantees in both the
stochastic and individual sequence senses.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-265


i i

Exercises for Part II

II.1 (Exact value of minimal compression length) Suppose X ∈ N and PX (1) ≥ PX (2) ≥ . . .. Show
that the optimal compressor f∗ ’s length satisfies
X

E[l(f∗ (X))] = P[X ≥ 2k ].
k=1

II.2 (Mixed source) Consider a finite collection Π = {P1 , . . . , Pm } of distributions on S . Mixed


i.i.d.
source {Sj } is generated by sampling Pi from Π with probability π i and then generating Sj ∼ Pi .
1 1 ∗ n
Show that n log PSn (Sn ) converges in distribution. Compute limn→∞ ϵ (S , nR).
II.3 Recall that an entropy rate of a process {Xj : j = 1, . . .} is defined as follows provided the limit
exists:
1
H = lim H( X n ) .
n→∞ n

Consider a 4-state Markov chain with transition probability matrix


 
0.89 0.11 0 0
 0.11 0.89 0 0 
 
 0 0 0.11 0.89 
0 0 0.89 0.11

The distribution of the initial state is [p, 0, 0, 1 − p].


(a) Does the entropy rate of such a Markov chain exist? If it does, find it.
(b) Describe the asymptotic behavior of the optimum variable-length rate n1 ℓ(f∗ (X1 , . . . , Xn )).
Consider convergence in probability and in distribution.
(c) Repeat with transition matrix:
 
0.89 0.11 0 0
 0.11 0.89 0 0 
 
 0 0 0. 5 0. 5 
0 0 0. 5 0. 5

II.4 Consider a three-state Markov chain S1 , S2 , . . . with the following transition probability matrix
 1 1 1 
2 4 4
P= 0 1
2
1
2
.
1 0 0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-266


i i

266 Exercises for Part II

Compute the limit of 1n E[l(f∗ (Sn ))] when n → ∞. Does your answer depend on the distribution
of the initial state S1 ?
II.5 (a) Let X take values on a finite alphabet X . Prove that
H(X) − k − 1
ϵ ∗ ( X , k) ≥ .
log(|X | − 1)
(b) Deduce the following converse result: For a stationary process {Sk : k ≥ 1} on a finite
alphabet S ,
H−R
lim inf ϵ∗ (Sn , nR) ≥ .
n→∞ log |S|
n
where H = limn→∞ H(nS ) is the entropy rate of the process.
II.6 Run-length encoding is a popular variable-length lossless compressor used in fax machines,
image compression, etc. Consider compression of Sn , an i.i.d. Ber(δ) source with very small
1
δ = 128 using run-length encoding: A chunk of consecutive r ≤ 255 zeros (resp. ones) is
encoded into a zero (resp. one) followed by an 8-bit binary encoding of r (If there are > 255
consecutive zeros then two or more 9-bit blocks will be output). Compute the average achieved
compression rate
1
lim E[ℓ(f(Sn )]
n→∞n
How does it compare with the optimal lossless compressor?
Hint: Compute the expected number of 9-bit blocks output per chunk of consecutive zeros/ones;
normalize by the expected length of the chunk.
II.7 Draw n random points independently and uniformly from the vertices of the following square.

Denote the coordinates by (X1 , Y1 ), . . . , (Xn , Yn ). Suppose Alice only observes Xn and Bob only
observes Yn . They want to encode their observation using RX and RY bits per symbol respectively
and send the codewords to Charlie who will be able to reconstruct the sequence of pairs.
(a) Find the optimal rate region for (RX , RY ).
(b) What if the square is rotated by 45◦ ?
II.8 Consider a particle walking randomly on the graph of Exercise II.7 (each edge is taken with
equal probability; particle does not stay in the same node). Alice observes the X coordinate
and Bob observes the Y coordinate. How many bits per step (in the long run) does Bob need to

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-267


i i

Exercises for Part II 267

send to Alice so that Alice will be able to reconstruct the particle’s trajectory with vanishing
probability of error? (Hint: you need to extend certain theorem from Chapter 11 to the case of
an ergodic Markov chain)
II.9 Recall from Theorem 11.13 the upper bound on the probability of error for the Slepian-Wolf
compression to k bits:
 
∗ 1 −τ
ϵSW (k) ≤ min P log|A| > k − τ + |A| (II.1)
τ >0 PXn |Y (Xn |Y)

Consider the following case, where Xn = (X1 , . . . , Xn ) is uniform on {0, 1}n and

Y = (X1 , . . . , Xn ) + (N1 , . . . , Nn ) ,

where Ni are iid Gaussian with zero mean and variance 0.1. Let n = 10. Propose a method to
numerically compute or approximate the bound (II.1) as a function of k = 1, . . . 10. Plot the
results.
II.10 (Mismatched compression) Let P, Q be distributions on some discrete alphabet A.
(a) Let f∗P : A → {0, 1} denote the optimal variable-length lossless compressor for X ∼ P.
Show that under Q,

EQ [l(f∗P (X))] ≤ H(Q) + D(QkP).

(b) The Shannon code for X ∼ P is a prefix code fP with the code length l(fP (a)) =
dlog2 P(1a) e, a ∈ A. Show that if X is distributed according to Q instead, then

H(Q) + D(QkP) ≤ EQ [l(fP (X))] ≤ H(Q) + D(QkP) + 1 bit.

Comments: This can be interpreted as a robustness result for compression with model misspec-
ification: When a compressor designed for P is applied to a source whose distribution is in fact
Q, the suboptimality incurred by this mismatch can be related to divergence D(QkP).
II.11 Consider a ternary fixed length (almost lossless) compression X → {0, 1, 2}k with an additional
requirement that the string in wk ∈ {0, 1, 2}k should satisfy

X
k
k
wj ≤ (II.2)
2
j=1

For example, (0, 0, 0, 0), (0, 0, 0, 2) and (1, 1, 0, 0) satisfy the constraint but (0, 0, 1, 2) does not.
Let ϵ∗ (Sn , k) denote the minimum probability of error among all possible compressors of Sn =
{Sj , j = 1, . . . , n} with i.i.d. entries of finite entropy H(S) < ∞. Compute

lim ϵ∗ (Sn , nR)


n→∞

as a function of R ≥ 0.
Hint: Relate to P[ℓ(f∗ (Sn )) ≥ γ n] and use Stirling’s formula (I.2) to find γ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-268


i i

268 Exercises for Part II

II.12 Consider a probability measure P and a measure-preserving transformation τ : Ω → Ω. Prove


that τ is ergodic if and only if for any measurable A, B we have

1X
n− 1
P[A ∩ τ −k B] → P[A]P[B] .
n
k=0

Comment: Thus ergodicity is a weaker condition than mixing: P[A ∩ τ −n B] → P[A]P[B].


II.13 (Arithmetic Coding) We analyze the encoder defined by (13.1) for iid source. Let P be a
distribution on some ordered finite alphabet, say, a < b < · · · < z. For each n, define
Qn P
p(xn ) = i=1 P(xi ) and q(xn ) = n
yn <xn p(y ) according to the lexicographic ordering, so
that Fn (x ) = q(x ) and |Ixn | = p(x ).
n n n

(a) Show that if xn−1 = (x1 , . . . , xn−1 ), then


X
q(xn ) = q(xn−1 ) + p(xn−1 ) P(α).
α<xn

Conclude that q(xn ) can be computed in O(n) steps sequentially.


(b) Show that intervals Ixn are disjoint subintervals of [0, 1).
n
(c) Encoding. Show that the codelength l l(f(x )) m defined in (13.1) satisfies the constraint (13.2),
namely, log2 p(xn ) ≤ ℓ(f(x )) ≤ log2 p(xn ) +1. Furthermore, verify that the map xn 7→ f(xn )
1 n 1

defines a prefix code. (Warning: This is not about checking Kraft’s inequality.)
(d) Decoding. Upon receipt of the codeword, we can reconstruct the interval Dxn . Divide the
unit interval according to the distribution P, i.e., partition [0, 1) into disjoint subintervals
Ia , . . . , Iz . Output the index that contains Dxn . Show that this gives the first symbol x1 . Con-
tinue in this fashion by dividing Ix1 into Ix1 ,a , . . . , Ix1 ,z and etc. Argue that xn can be decoded
losslessly. How many steps are needed?
(e) Suppose PX (e) = 0.5, PX (o) = 0.3, PX (t) = 0.2. Encode etoo (write the binary codewords)
and describe how to decode.
(f) Show that the average length of this code satisfies
nH(P) ≤ E[l(f(Xn ))] ≤ nH(P) + 2 bits.
(g) Assume that X = (X1 , . . . , Xn ) is not iid but PX1 , PX2 |X1 , . . . , PXn |Xn−1 are known. How would
you modify the scheme so that we have
H(Xn ) ≤ E[l(f(Xn ))] ≤ H(Xn ) + 2 bits.
II.14 (Enumerative Codes) Consider the following simple universal compressor for binary sequences:
Pn
Given xn ∈ {0, 1}n , denote by n1 = i=1 xi and n0 = n − n1 the number of ones and zeros in xn .
First encode n1 ∈ {0, 1, . . . , n} using dlog2 (ln + 1)e bits, n
m then encode the index of x in the set

of all strings with n1 number of ones using log2 nn1 bits. Concatenating two binary strings,
we obtain the codeword of xn . This defines a lossless compressor f : {0, 1}n → {0, 1}∗ .
(a) Verify that f is a prefix code.
i.i.d.
(b) Let Eθ be taken over Xn ∼ Ber(θ). Show that for any θ ∈ [0, 1],
Eθ [l(f(Xn ))] ≤ nh(θ) + log n + O(1),

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-269


i i

Exercises for Part II 269

where h(·) is the binary entropy function. Conclude that

sup {Eθ [l(f(Xn ))] − nh(θ)} ≥ log n + O(1).


0≤θ≤1

[Optional: Explain why enumerative coding fails to achieve the optimal redundancy.]
Hint: Stirling’s approximation (I.2) might be useful.
II.15 (Krichevsky-Trofimov codes). Consider the K-T probability assignment for the binary alpha-
bet (13.27) and its sequential
 form (13.29). Let fKT be the encoder with length assignment
1
l(f (xn )) = log2 (KT) for all xn .
QXn (xn )

(a) Prove that for any n and any xn ∈ {0, 1}n ,


 t0  t1
(KT) 1 1 t0 t1
QXn (xn ) ≥ √ .
2 t0 + t1 t0 + t1 t0 + t1
where ti = ti (xn ), i ∈ {0, 1} counts the number of i’s occuring in xn−1 . Hint: induction on
n.
(b) Conclude that the K-T code length satisfies:
n  1
1
l(fKT (xn )) ≤ nh + log n + 2, ∀xn ∈ {0, 1}n .
n 2
(c) Conclude that for K-T codes its redundancy is bounded by: in the notation of Exercise
II.14b,
1
sup {Eθ [l(fKT (Xn ))] − nh(θ)} ≤ log n + O(1).
0≤θ≤1 2
This establishes an upper-bound part of the O(1) version of (13.24) for the binary alphabet.
(In fact, Part (b) shows the stronger result that the pointwise redundancy (cf. (13.11)) of KT
P n ( xn )
code satisfies maxxn ∈{0,1}n maxθ log X(KT|θ) n ≤ 12 log n + O(1).)
QXn (x )
II.16 (Redundancy lower bound: binary alphabet) In Exercise II.15 we showed that the minimax
average-case redundancy, cf. (13.8), for the iid Bernoulli model satisfies R∗n ([0, 1]) ≤ 12 log n +
O(1). We show that this bound is tight by computing I(θ; Xn ) for θ ∼ Unif([0, 1]). Thus, from
the capacity-redundancy theorem (13.10), this provides a lower bound on the redundancy.
Pn
(a) Let θ̂ = n1 i=1 Xi denote the empirical frequency of ones. Prove that

I(θ; Xn ) = I(θ; θ̂).

(Hint: Section 3.5.)


(b) Show that the Bayes risk satisfies E[(θ − θ̂)2 ] = 1
6n .
(c) Compute the differential entropy h(θ).
(d) Justify each step in
1 2πe
h(θ|θ̂) = h(θ − θ̂|θ̂) ≤ h(θ − θ̂) ≤ log .
2 6n
(Hint: (2.20).)
(e) Assemble the above steps to conclude that R∗n ≥ 1
2 log n + c1 and compute the value of c1 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-270


i i

270 Exercises for Part II

(f) Now choose π to be the Beta( 12 ) prior and redo the previous part. Do you get a better
constant c1 ?
Comments: We followed the strategy, introduced in [118], of lower bounding I(θ; Xn ) by guess-
ing a good estimator θ̂ = θ̂(X1 , . . . , Xn ) and bounding I(θ; θ̂) on the basis of the estimator
error. The rationale is that if θ can be estimated well, then Xn needs to provide large amount of
information. We will further explore this idea in Chapter 30.
II.17 Consider the following collection of stationary ergodic Markov
( processes depending on param-
Xt−1 w.p. 1 − θ,
eter θ ∈ [0, 1]. The X1 ∼ Ber( 12 ) and after that Xt = . Denote the
1 − Xt−1 w.p. θ.
resulting Markov kernel as PXn |θ .
(a) Compute JF (θ).
(b) Prove that minimax redundancy R∗n = ( 12 + o(1)) log n.
II.18 (Elias coding) In this problem we construct universal codes for integers. Namely, they compress
any integer-valued (infinite alphabet!) random variable almost to its entropy.
(a) Consider the following universal compressor for natural numbers: For x ∈ N = {1, 2, . . .},
let k(x) denote the length of its binary representation. Define its codeword c(x) to be k(x)
zeros followed by the binary representation of x. Compute c(10). Show that c is a prefix
code and describe how to decode a stream of codewords.
(b) Next we construct another code using the one above: Define the codeword c′ (x) to be c(k(x))
followed by the binary representation of x. Compute c′ (10). Show that c′ is a prefix code
and describe how to decode a stream of codewords.
(c) Let X be a random variable on N whose probability mass function is decreasing. Show that
E[log(X)] ≤ H(X).
(d) Show that the average code length of c satisfies E[ℓ(c(X))] ≤ 2H(X) + 2 bit.
(e) Show that the average code length of c′ satisfies E[ℓ(c′ (X))] ≤ H(X) + 2 log(H(X) + 1) + 3
bit.
Comments: The two coding schems are known as Elias γ -codes and δ -codes.
II.19 (Batch loss) Recall the definition of batch regret in online prediction in Section 13.6. Show that
whenever maximizer π ∗ exists we have

BatchReg∗n (Θ) = max I(θ; Xn |Xn−1 ) ,


π ∈P(Θ)

where optimization is over distribution of θ ∼ π. (Hint: Apply Exercise I.11.)


II.20 (Supervised learning) Consider a possibly non-iid stochastic process Xn = (X1 , . . . , Xn ) and
(θ)
a parametric collection of conditional distributions PY|X , θ ∈ Θ, which we also understand
(θ ∗ )
as a kernel PY|X,θ . Nature fixes θ∗ and generates Yi ∼ PY|X=Xi independently. These samples
are sequentially fed to the learner who having observed (Xt , Yt−1 ) outputs Q̂t (·) ∈ P(Y) and
experiences log-loss − log Q̂t (Yt ). The goal of supervised learning is to minimize the worst case

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-271


i i

Exercises for Part II 271

regret, i.e. find minimizer in


 
X
n
1 1
AvgReg∗n (Θ) ≜ inf sup E  log − log (θ∗ ) 
{Qt } θ ∗ ∈Θ
t=1
Q̂t (Yt ) PY|X (Yt |Xt )

Here we show analog of Theorem 13.3, namely that


AvgReg∗n (Θ) = Cn ≜ max I(θ; Yn |Xn ) ,
π

with optimization over π ∈ P(Θ) of priors on θ. We assume the maximum is attained at some
π∗.
(θ) ⊗n
(a) Let Dn ≜ infQYn |Xn supθ∈Θ D(PY|X kQYn |Xn |PXn ), where the infimum is over all conditional
kernels QYn |Xn : X n → Y n . Show AvgReg∗n (Θ) ≤ Dn .
R Qn
(b) Show that Cn = Dn and that optimal Q∗Yn |Xn (yn |xn ) = π ∗ (dθ) t=1 PY|X (yt |xt ) (Hint:
(θ)

Exercise I.11.)
Qn
(c) Show that we can always factorize Q∗Yn |Xn = t=1 QYt |Xt ,Yt−1 .
(d) Conclude that Q∗Yn |Xn defines an optimal learner, who also operates without any knowledge
of PXn .
Note: this characterization is mostly useful for upper-bounding regret (Exercise II.22). Indeed,
the optimal learner requires knowledge of π ∗ which in turn often depends on PXn , which is
not available to the learner. This shows why supervised learning is quite a bit more deli-
cate than universal compression. Nevertheless, taking a “natural” prior π and factorizing the
R (θ) ⊗n
mixture π (dθ)PY|X often gives very interesting and often almost optimal algorithms (e.g.
exponential-weights update algorithm [445]).
II.21 (Average-case and worst-case redundancies are incomparable.) This exercise provides an exam-
ple where the worst-case minimax redundancy (13.11) is infinite but the average-case one (13.8)
is finite. Take n = 1 and consider the class of distributions P1 = {P ∈ P(Z+ ) : EP [X] ≤ 1}.
Define
P ( x)
R∗ = min sup D(PkQ), R∗∗ = min max sup log .
Q P∈P1 Q x∈Z+ P∈P1 Q ( x)
(a) Applying the capacity-redundancy theorem, show that R∗ ≤ 2 log 2. (Hint: use Exercise I.4
to bound the mutual information.)
(b) Prove that R∗∗ = ∞ if and only if the Shtarkov sum (13.13) is infinite, namely,
P
x∈Z+ supP∈P1 P(x) = ∞
(c) Verify that
(
1 x=0
sup P(x) =
P∈P1 1/x x ≥ 1
and conclude R∗∗ = ∞. (Hint: Markov’s inequality.)
i.i.d. (d)
II.22 (Linear regression) Let Xi ∼ PX on Rp with PX being rotationally invariant (i.e. X = UX for
any orthogonal matrix U). Fix θ ∈ Rp with kθk ≤ s and given Xi generate Yi ∼ N (θ⊤ Xi , σ 2 ).
Having observed Yt−1 , Xt (but not θ) the learner outputs a prediction Ŷt of Yt .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-272


i i

272 Exercises for Part II

(a) Show that


X
n X
p   
s2 n
AvgRegn ≜ sup E[(Ŷt − Yt ) ] − nσ ≤ σ
2 2
E ln 1 +
2
λi (Σ̂X ) ,
∥θ∥≤s t=1 p
i=1

1
Pn ⊤
where Σ̂X = n i=1 Xi Xi
is the sample covariance matrix. (Hint: Interpret LHS as regret
under log-loss and solve maxπ θ I(θ; Yn |Xn ) s.t. E[kθk2 ] ≤ s2 via Exercise I.10.)
(b) Show that
 s2 n   s2 n 
AvgRegn ≤ σ 2 ln det Ip + ΣX ≤ pσ 2 ln 1 + 2 E[kXk2 ] .
p p
(Hint: Jensen’s inequality.)
i.i.d.
Remark: Note that if Xi ∼ N (0, BIp ) the RHS is pσ 2 ln n + O(1). At the same time prediction
error of an ordinary least-square estimate Ŷt = θ̂⊤ Xt for n ≥ p + 2 is known to be exactly7
E[(Ŷt − Yt )2 ] = σ 2 (1 + n−pp−1 ) and hence achieves the optimal pσ 2 ln n + O(1) cumulative
regret.

7
This can be shown by applying Exercise VI.3b and evaluating the expected trace using [444, Theorem 3.1].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-273


i i

Part III

Hypothesis testing and large deviations

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-274


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-275


i i

275

In this part we study the topic of binary hypothesis testing (BHT) which we first encountered
in Section 7.3. This is an important area of statistics, with a definitive treatment given in [277].
Historically, there has been two schools of thought on how to approach this question. One is the
so-called significance testing of Karl Pearson and Ronald Fisher. This is perhaps the most widely
used approach in modern biomedical and social sciences. The concepts of null hypothesis, p-value,
χ2 -test, goodness-of-fit belong to this world. We will not be discussing these.
The other school was pioneered by Jerzy Neyman and Egon Pearson, and is our topic in this part.
The concepts of Type-I and Type-II errors, likelihood-ratio tests, Chernoff exponent are from this
domain. This is, arguably, a more popular way of looking at the problem among the engineering
disciplines (perhaps explained by its foundational role in radar and electronic signal detection).
The conceptual difference between the two is that in the first approach the full probabilistic
i.i.d.
model is specified only under the null hypothesis. (It still could be very specific like Xi ∼ N (0, 1),
i.i.d.
contain unknown parameters, like Xi ∼ N (θ, 1) with θ ∈ R arbitrary, or be nonparametric, like
i.i.d.
(Xi , Yi ) ∼ PX,Y = PX PY denoting that observables X and Y are statistically independent). The main
goal of the statistician in this setting is inventing a testing process that is able to find statistically
significant deviations from the postulated null behavior. If such deviation is found then the null is
rejected and (in scientific fields) a discovery is announced. The role of the alternative hypothesis
(if one is specified at all) is to roughly suggest what feature of the null are most likely to be violated
i.i.d.
and motivates the choice of test procedures. For example, if under the null Xi ∼ N (0, 1), then both
of the following are reasonable tests:
1X ? 1X 2 ?
n n
Xi ≈ 0 Xi ≈ 1 .
n n
i=1 i=1

However, the first one would be preferred if, under the alternative, “data has non-zero mean”, and
the second if “data has zero mean but variance not equal to one”. Whichever of the alternatives is
selected does not imply in any way the validity of the alternative. In addition, theoretical properties
of the test are mostly studied under the null rather than the alternative. For this approach the null
hypothesis (out of the two) plays a very special role.
The second approach treats hypotheses in complete symmetry. Exact specifications of proba-
bility distributions are required for both hypotheses and the precision of a proposed test is to be
analyzed under both. This is the setting that is most useful for our treatment of forthcoming topics
of channel coding (Part IV) and statistical estimation (Part VI).
The outline of this part is the following. First, we define the performance metric R(P, Q) giving
a full description of the BHT problem. A key result in this theory, the Neyman-Pearson lemma
determines the form of the optimal test and, at the same time, characterizes R(P, Q). We then
specialize to the setting of iid observations and consider two types of asymptotics (as the sample
size n goes to infinity): Stein’s regime (where type-I error is held constant) and Chernoff’s regime
(where errors of both types are required to decay exponentially). The fundamental limit in the
former regime is simply a scalar (given by D(PkQ)), while in the latter it is a region. To describe
this region (as we do in Chapter 16) we will first need to dive deep into another foundational topic:
theory of large deviations and the information projection (Chapter 15).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-276


i i

14 Neyman-Pearson lemma

In this Chapter we formally define the problem of binary hypothesis testing between two sim-
ple hypotheses. We introduce the fundamental limit for this problem in the form of a region
R(P, Q) ⊂ [0, 1]2 , whose boundary is known as the received operating characteristic (ROC)
curve. We will show how to compute this region/curve exactly (Neyman-Pearson lemma) and
show optimality of the likelihood-ratio tests in the process. However, for high-dimensional situa-
tions exact computation of the region is still too complex and we will also derive upper and lower
bounds (as usual, we call them achievability and converse, respectively). Finally, we will conclude
by introducing two different asymptotic settings: the Stein regime and the Chernoff regime. The
answer in the former will be given completely (for iid distributions), while the answer for the latter
will require further developments in the subsequent chapters.

14.1 Neyman-Pearson formulation


Consider the situation where we have two distributions P and Q on a space X one of which have
generated our observation X. These two possibilities are summarized as a pair of hypotheses:
H0 : X ∼ P
H1 : X ∼ Q ,
which states that, under hypothesis H0 (the null hypothesis) X is distributed according to P, and
under H1 (the alternative hypothesis) X is distributed according to Q. A test (or decision rule)
between two distributions chooses either H0 or H1 based on the data X. We will consider

• Deterministic tests: f : X → {0, 1}, or equivalently, f(x) = 1 {x ∈ E} where E is known as a


decision region; and more generally,
• Randomized tests: PZ|X : X → {0, 1}, so that PZ|X (1|x) ∈ [0, 1] is the probability of rejecting
the null upon observing X = x.

Let Z = 0 denote that the test chooses P (accepting the null) and Z = 1 that the test chooses Q
(rejecting the null).
This setting is called “testing simple hypothesis against simple hypothesis”. Here “simple”
refers to the fact that under each hypothesis there is only one distribution that could have gen-
erated the data. In comparison, composite hypothesis postulates that X ∼ P for some P is a given
class of distributions; see Sections 16.4 and 32.2.1.

276

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-277


i i

14.1 Neyman-Pearson formulation 277

Table 14.1 Expressions for common performance metrics of hypothesis tests


Term Expression

type-I error, significance, size, false alarm rate, false positive 1−α
specificity, selectivity, true negative α
power, recall, sensitivity, true positive 1−β
type-II error, missed detection, false negative β
accuracy π 1 (1 − β) + (1 − π 1 )α
2π 1 (1−β)
F1 -score 1+π 1 (1−β)−(1−π 1 )α
Bayesian error π 1 β + (1 − π 1 )(1 − α)
π 1 (1−β)
positive predictive value (PPV), precision 1−π 1 β−(1−π 1 )α

Entries involving π 1 = P[H1 ] correspond to Bayesian setting where a prior probability on occurrence of H1 is
postulated.

In order to quantify performance of a test, we focus on two metrics. Let π i|j denote the proba-
bility of the test choosing i when the correct hypothesis is j, with i, j ∈ {0, 1}. For every test PZ|X
we associate a pair of numbers:

α = π 0|0 = P[Z = 0] (Probability of success given H0 is true)


β = π 0|1 = Q[Z = 0] (Probability of error given H1 is true),
R R
where P[Z = 0] = PZ|X (0|x)P(dx) and Q[Z = 0] = PZ|X (0|x)Q(dx). Depending on the field of
study there are many different names (and transformations) that have been defined, see Table 14.1.
Because we have two performance metrics it is not easy to understand what should one
designate as the “best test”. Consequently, there are several approaches:

• Bayesian: Assuming the prior distribution P[H0 ] = π 0 and P[H1 ] = π 1 , we minimize the
average probability of error:

P∗b = min π 0 π 1|0 + π 1 π 0|1 . (14.1)


PZ|X :X →{0,1}

• Minimax: Assuming there is an unknown prior distribution, we choose the test that preforms
the best for the worst-case prior

P∗m = min max{π 1|0 , π 0|1 }.


PZ|X :X →{0,1}

• Neyman-Pearson: Minimize the type-II error β subject to that the success probability under the
null is at least α.

In this Part the Neyman-Pearson formulation is our choice. We formalize the fundamental
performance limit as follows.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-278


i i

278

Definition 14.1 Given (P, Q), the Neyman-Pearson region consists of achievable points for
all randomized tests

R(P, Q) = (P[Z = 0], Q[Z = 0]) : PZ|X : X → {0, 1} ⊂ [0, 1]2 . (14.2)

In particular, its lower boundary is defined as (see Figure 14.1 for an illustration)

βα (P, Q) ≜ inf Q[ Z = 0] (14.3)


P[Z=0]≥α

R(P, Q)

βα (P, Q)

P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =

Figure 14.1 Illustration of the Neyman-Pearson regions: typical (top plot) and two extremal cases (bottom
row). Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P[E] = 0 and Q[E] = 1 for some E.

The Neyman-Pearson region encodes much useful information about the relationship between
P and Q. In particular, the mutual singularity (see Figure 14.1) can be detected. Furthermore, every
f-divergence can be computed from the R(P, Q). For example, TV(P, Q) coincides with half the
length of the longest vertical segment contained in R(P, Q) (Exercise III.7). In machine learning
some of the most popular metric used to characterized quality of a R(P, Q) is area under the curve
(AUC)
Z 1
AUC(P, Q) ≜ 1 − βα (P, Q)dα .
0

We next prove several basic properties of R(P, Q).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-279


i i

14.1 Neyman-Pearson formulation 279

Theorem 14.2 (Properties of R(P, Q))

(a) R(P, Q) is a closed, convex subset of [0, 1]2 .


(b) R(P, Q) contains the diagonal.
(c) Symmetry: (α, β) ∈ R(P, Q) ⇔ (1 − α, 1 − β) ∈ R(P, Q).

Proof. (a) For convexity, suppose that (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), corresponding to tests
PZ0 |X , PZ1 |X , respectively. Randomizing between these two tests, we obtain the test λPZ0 |X +
λ̄PZ1 |X for λ ∈ [0, 1], which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
The closedness of R(P, Q) will follow from the explicit determination of all boundary
points via the Neyman-Pearson lemma – see Remark 14.1. In more complicated situations
(e.g. in testing against composite hypothesis) simple explicit solutions similar to Neyman-
Pearson Lemma are not available but closedness of the region can frequently be argued
still. The basic reason is that the collection of bounded functions {g : X → [0, 1]} (with
g(x) = PZ|X (0|x)) forms a weakly compact set and hence its image under the linear functional
R R
g 7→ ( gdP, gdQ) is closed.
(b) Testing by random guessing, i.e., Z ∼ Ber(1 − α) ⊥ ⊥ X, achieves the point (α, α).
(c) If (α, β) ∈ R(P, Q) is achieved by PZ|X , P1−Z|X achieves (1 − α, 1 − β).

The region R(P, Q) consists of the operating points of all randomized tests, which include as
special cases those of deterministic tests, namely

Rdet (P, Q) = {(P(E), Q(E)) : E measurable} . (14.4)

As the next result shows, the former is in fact the closed convex hull of the latter. Recall that
cl(E) (resp. co(E)) denote the closure and convex hull of a set E, namely, the smallest closed
(resp. convex) set containing E. A useful example: For a subset E of an Euclidean space, and
measurable functions f, g : R → E, we have (E [f(X)] , E [g(X)]) ∈ cl(co(E)) for any real-valued
random variable X.

Theorem 14.3 (Randomized test v.s. deterministic tests)


R(P, Q) = cl(co(Rdet (P, Q))).

Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.

Proof. “⊃”: Comparing (14.2) and (14.4), by definition, R(P, Q) ⊃ Rdet (P, Q)), the former of
which is closed and convex , by Theorem 14.2.
“⊂”: Given any randomized test PZ|X , define a measurable function g : X → [0, 1] by g(x) =
PZ|X (0|x). Then
X Z 1
P [ Z = 0] = g(x)P(x) = EP [g(X)] = P[g(X) ≥ t]dt
x 0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-280


i i

280

X Z 1
Q[Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0

R
where we applied the “area rule” that E[U] = R+ P [U ≥ t] dt for any non-negative random
variable U. Therefore the point (P[Z = 0], Q[Z = 0]) ∈ R is a mixture of points (P[g(X) ≥
t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly distributed on the unit interval. Hence
R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (14.4).

Example 14.1 (Testing Ber(p) versus Ber(q)) Assume that p < < q. Using Theo- 1
2
rem 14.3, note that there are 2 = 4 events E = ∅, {0}, {1}, {0, 1}. Then R(Ber(p), Ber(q)) is
2

given by

1
)
q)

(p, q)
r(
Be
),
(p
er
(B
R

(p̄, q̄)

α
0 1

14.2 Likelihood ratio tests


To define optimal hypothesis tests, we need to define the concept of the log-likelihood ratio (LLR).
In the simple case when P  Q we can define the LLR T(x) = log dQ dP
(x) as a function T : X →
R ∪{−∞} by thinking of log 0 = −∞. In order to handle also the case of P 6 Q, we can leverage
our concept of the Log function, cf. (2.10).

Definition 14.4 (Extended log likelihood ratio) Assume that dP = p(x)dμ and
dQ = q(x)dμ for some dominating measure μ (e.g. μ = P + Q.) Recalling the definition of
Log from (2.10) we define the extended LLR as


 log qp((xx)) , p ( x) > 0 , q ( x) > 0


p(x) +∞, p ( x) > 0 , q ( x) = 0
T(x) ≜ Log =
q ( x)  −∞, p ( x) = 0 , q ( x) > 0



0, p ( x) = 0 , q ( x) = 0 ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-281


i i

14.2 Likelihood ratio tests 281

Definition 14.5 (Likelihood ratio test (LRT)) Given a binary hypothesis testing H0 : X ∼
P vs H1 : X ∼ Q the likelihood ratio test (LRT) with threshold τ ∈ R ∪ {±∞} is 1{x : T(x) ≤ τ },
in other words it decides
(
declare H0 , T(x) > τ
LRTτ (x) = .
declare H1 , T(x) ≤ τ

When P  Q it is clear that T(x) = log dQ dP


(x) for P- and Q-almost every x. For this reason,
dP
everywhere in this Part we abuse notation and write simply log dQ to denote the extended (!) LLR
as defined above. Notice that LRT is a deterministic test, and that it does make intuitive sense:
upon observing x, if QP((xx)) is large then Q is more likely and one should reject the null hypothesis
P.
Note that for a discrete alphabet X and assuming Q  P we can see

Q[T = t] = exp(−t)P[T = t] ∀t ∈ R ∪ {+∞}.

Indeed, this is shown by the following chain:


X P ( x) X
QT ( t) = Q(x)1{log = t} = Q(x)1{et Q(x) = P(x)}
Q ( x)
X X
X P(x)
= e− t P(x)1{log = t } = e− t P T ( t )
Q ( x)
X

We see that taking expectation over P and over Q are equivalent upon multiplying the expectant
by exp(±T). The next result gives precise details in the general case.

Theorem 14.6 (Change of measure P ↔ Q) The following hold:

1 For any h : X → R we have

EQ [h(X)1{T > −∞}] = EP [h(X) exp(−T)] (14.5)


EP [h(X)1{T < +∞}] = EQ [h(X) exp(T)] (14.6)

2 For any f ≥ 0 and any −∞ < τ < ∞ we have

EQ [f(X)1{T ≥ τ }] ≤ EP [f(X)1{T ≥ τ }] · exp(−τ )


EQ [f(X)1{T ≤ τ }] ≥ EP [f(X)1{T ≥ τ }] · exp(−τ ) (14.7)

Proof. We first observe that

Q[T = +∞] = P[T = −∞] = 0 . (14.8)

Then consider the chain


Z Z
( a) (b)
EQ [h(X)1{T > −∞}] = dμ q(x)h(x) = dμ p(x) exp(−T(x))h(x)
{−∞<T(x)<∞} {−∞<T(x)<∞}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-282


i i

282

Z
( c)
= dμ p(x) exp(−T(x))h(x) = EP [exp(−T)g(T)] ,
{−∞<T(x)≤∞}

where in (a) we used (14.8) to justify restriction to finite values of T; in (b) we used exp(−T(x)) =
q(x)
p(x) for p, q > 0; and (c) follows from the fact that exp(−T(x)) = 0 whenever T = ∞. Exchanging
the roles of P and Q proves (14.6).
The last part follows upon taking h(x) = f(x)1{T(x) ≥ τ } and h(x) = f(x)1{T(x) ≤ τ } in (14.5)
and (14.6), respectively.

The importance of the LLR is that it is a sufficient statistic for testing the two hypotheses (recall
Section 3.5 and in particular Example 3.9), as the following result shows.

Corollary 14.7 T = T(X) is a sufficient statistic for testing P versus Q.

Proof. For part 2, sufficiency of T would be implied by PX|T = QX|T . For the case of X being
discrete we have:

PX (x)PT|X (t|x) P(x)1{ QP((xx)) = et } et Q(x)1{ QP((xx)) = et }


PX|T (x|t) = = =
PT (t) PT (t) PT (t)
QXT (xt) QXT
= −t = = QX|T (x|t).
e P T ( t) QT
We leave the general case as an exercise.

From Theorem 14.3 we know that to obtain the achievable region R(P, Q), one can iterate over
all decision regions and compute the region Rdet (P, Q) first, then take its closed convex hull. But
this is a formidable task if the alphabet is large or infinite. On the other hand, we know that the
LLR is a sufficient statistic. Next we give bounds to the region R(P, Q) in terms of the statistics
of the LLR. As usual, there are two types of statements:

• Converse (outer bounds): any point in R(P, Q) must satisfy certain constraints;
• Achievability (inner bounds): points satisfying certain constraints belong to R(P, Q).

14.3 Converse bounds on R(P, Q)


Theorem 14.8 (Weak converse) ∀(α, β) ∈ R(P, Q),
d(αkβ) ≤ D(PkQ)
d(βkα) ≤ D(QkP)

where d(·k·) is the binary divergence function in (2.6).

Proof. Use the data processing inequality for KL divergence with PZ|X ; cf. Corollary 2.19.

We will strengthen this bound with the aid of the following result.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-283


i i

14.4 Achievability bounds on R(P, Q) 283

Lemma 14.9 For any test Z and any γ > 0 we have


 
P[Z = 0] − γ Q[Z = 0] ≤ P T > log γ ,
dP
where T = log dQ is understood in the extended sense of Definition 14.4.

Note that we do not need to assume P  Q precisely because ±∞ are admissible values for
the (extended) LLR.
Proof. Defining τ = log γ and g(x) = PZ|X (0|x) we get from (14.7):
P[Z = 0, T ≤ τ ] − γ Q[Z = 0, T ≤ τ ] ≤ 0 .
Decomposing P[Z = 0] = P[Z = 0, T ≤ τ ] + P[Z = 0, T > τ ] and similarly for Q we obtain then
P[Z = 0] − γ Q[Z = 0] ≤= P [T > log γ, Z = 0] − γ Q [T > log γ, Z = 0] ≤ P [T > log γ]

Theorem 14.10 (Strong converse) ∀(α, β) ∈ R(P, Q), ∀γ > 0,


h dP i
α − γβ ≤ P log > log γ (14.9)
dQ
1 h dP i
β − α ≤ Q log < log γ (14.10)
γ dQ

Proof. Apply Lemma 14.9 to (P, Q, γ) and (Q, P, 1/γ).


Theorem 14.10 provides an outer bound for the region R(P, Q) in terms of half-planes. To see
this, fix γ > 0 and consider the line α − γβ = c by gradually increasing c from zero. There exists
a maximal c, say c∗ , at which point the line touches the lower boundary of the region. Then (14.9)
says that c∗ cannot exceed P[log dQ dP
> log γ]. Hence R must lie to the left of the line. Similarly,
(14.10) provides bounds for the upper boundary. Altogether Theorem 14.10 states that R(P, Q) is
contained in the intersection of an infinite collection of half-planes indexed by γ .
To apply the strong converse Theorem 14.10, we need to know the CDF of the LLR, whereas
to apply the weak converse Theorem 14.8 we need only to know the expectation of the LLR, i.e.,
the divergence. This is the usual pattern between the weak and strong converses in information
theory.

14.4 Achievability bounds on R(P, Q)


Given the convexity of the set R(P, Q), it is natural to try to find all of its supporting lines (hyper-
planes in dimension 2), as it is well-known that closed convex set equals the intersection of all
half-spaces that correspond to the supporting hyperplanes. We are thus lead to the following
problem: for t >0,
max{α − tβ : (α, β) ∈ R(P, Q)} ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-284


i i

284

which is equivalent to minimizing the average probability of error in (14.1), with t = ππ 01 . This can
be solved without much effort. For simplicity, consider the discrete case. Then
X X
α∗ − tβ ∗ = max (α − tβ) = max (P(x) − tQ(x))PZ|X (0|x) = |P(x) − tQ(x)|+
(α,β)∈R PZ|X
x∈X x∈X

where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:
 
P(x)
PZ|X (0|x) = 1 log ≥ log t .
Q ( x)
Thus, we have shown that all supporting hyperplanes are parameterized by LRT. This completely
recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces) of the
region. The precise result is stated as follows:

Theorem 14.11 (Neyman-Pearson Lemma) For each α, βα in (14.3) is attained by the


following test:


 dP
1 log dQ >τ
PZ|X (0|x) = λ dP
log dQ =τ (14.11)


0 log dP

dQ

where τ ∈ R and λ ∈ [0, 1] are the unique solutions to α = P[log dQ


dP dP
> τ ] + λP[log dQ = τ ].

Proof of Theorem 14.11. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P[Z = 0] = EP [g(X)] = P > t + λP =t (14.12)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (14.13)
dQ dQ
n o n o
Using the simple fact that EQ [f(X)1 dQ dP
≤ t ] ≥ t−1 EP [f(X)1 dQ dP
≤ t ] for any f ≥ 0 twice, we
have
   
dP dP
β = EQ [g(X)1 ≤ t ] + EQ [g(X)1 >t ]
dQ dQ
   
1 dP dP
≥ EP [g(X)1 ≤ t ] + E Q [ g( X ) 1 >t ]
t dQ dQ
| {z }
   h dP i  
(14.12) 1 dP dP
= EP [(1 − g(X))1 > t ] + λP = t + E Q [ g( X ) 1 >t ]
t dQ dQ dQ
| {z }
  h dP i  
dP dP
≥ EQ [(1 − g(X))1 > t ] + λQ = t + E Q [ g( X ) 1 >t ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-285


i i

14.4 Achievability bounds on R(P, Q) 285

Remark 14.1 As a consequence of the Neyman-Pearson lemma, all the points on the boundary
of the region R(P, Q) are attainable. Therefore
R(P, Q) = {(α, β) : βα ≤ β ≤ 1 − β1−α }.

Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set, as
previously stated in Theorem 14.2. Consequently, the infimum in the definition of βα is in fact a
minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the
following two sets:
(  
dP
α = P log dQ >τ
 dP
 τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and
(  
dP
α = P log dQ ≥τ
  τ ∈ R ∪ {±∞}.
dP
β = Q log dQ ≥τ

Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ
dP
≥ τ}
or 1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead
dP

to the achievability of the Neyman-Pearson lemma (Theorem 14.11).


Remark 14.2 The Neyman-Pearson test (14.11) is related to the LRT1 as follows:

dP dP
P [log dQ > t] P [log dQ > t]

1 1
α α

t t
τ τ

• Left figure: If α = P[log dQ


dP
> τ ] for some τ , then λ = 0, and (14.11) becomes the LRT
n o
Z = 1 log dQdP
≤τ .
• Right figure: If α 6= P[log dQ
dP
> τ ] for any τ , then we have λ ∈ (0, 1), and (14.11) is equivalent
n o n o
to randomize over tests: Z = 1 log dQ dP
≤ τ with probability 1 − λ or 1 log dQ dP
< τ with
probability λ.

Corollary 14.12 ∀τ ∈ R, there exists (α, β) ∈ R(P, Q) s.t.


h dP i
α = P log >τ
dQ

1
Note that it so happens that in Definition 14.4 the LRT is defined with an ≤ instead of <.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-286


i i

286

h dP i
β ≤ exp(−τ )P log > τ ≤ exp(−τ )
dQ

Proof. For the case of discrete X it is easy to give an explicit proof


h dP i X n P(x) o
Q log >τ = Q ( x) 1 > exp(τ )
dQ Q ( x)
X n P(x) o h dP i
≤ P(x) exp(−τ )1 > exp(tau) = exp(−τ )P log >τ .
Q ( x) dQ

The general case is just an application of (14.7).

14.5 Asymptotics: Stein’s regime


Having understood how to compute and bound R(P, Q) we next proceed to the analysis of asymp-
totics. We will focus on iid observations in the large-sample asymptotics, that is we will be talking
about R(P⊗n , Q⊗n ) here. In other words, we consider
i.i.d.
H0 : X1 , . . . , Xn ∼ P
i.i.d.
H1 : X1 , . . . , Xn ∼ Q, (14.14)

where P and Q do not depend on n; this is a particular case of our general setting with P and Q
replaced by their n-fold product distributions. We are interested in the asymptotics of the error
probabilities π 0|1 and π 1|0 as n → ∞ in the following two regimes:

• Stein’s regime: When π 1|0 is constrained to be at most ϵ, what is the best exponential rate of
convergence for π 0|1 ?
• Chernoff’s regime: When both π 1|0 and π 0|1 are required to vanish exponentially, what is the
optimal tradeoff between their exponents?

Recall that we are in the iid setting (14.14) and are interested in tests satisfying 1 −α = π 1|0 ≤ ϵ
and β = π 0|1 ≤ exp(−nE) for some exponent E > 0. Motivation of this asymmetric objective
is that often a “missed detection” (π 0|1 ) is far more disastrous than a “false alarm” (π 1|0 ). For
example, a false alarm could simply result in extra computations (attempting to decode a packet
when there is in fact only noise has been received), while missed detection results in a complete
loss of the packet. The formal definition of the best exponent is as follows.

Definition 14.13 The ϵ-optimal exponent in Stein’s regime is


Vϵ ≜ sup{E : ∃n0 , ∀n ≥ n0 , ∃PZ|Xn s.t. α > 1 − ϵ, β < exp (−nE)}.

and Stein’s exponent is defined as V ≜ limϵ→0 Vϵ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-287


i i

14.5 Asymptotics: Stein’s regime 287

It is an exercise to check the following equivalent definition


1 1
Vϵ = lim inf log
n→∞ n β1−ϵ (PXn , QXn )
where βα is defined in (14.3).
Here is the main result of this section.

Theorem 14.14 (Stein’s lemma) Consider the iid setting (14.14) where PXn = Pn and
QXn = Qn . Then
Vϵ = D(PkQ), ∀ϵ ∈ (0, 1).

Consequently, V = D(PkQ).

The way to use this result in practice is the following. Suppose it is required that α ≥ 0.999,
and β ≤ 10−40 , what is the required sample size? Stein’s lemma provides a rule of thumb: n ≥
10−40
− log
D(P∥Q) .

Proof. We first assume that P  Q so that dP


dQ is well defined. Define the LLR

dPXn X n
dP
Fn = log = log (Xi ), (14.15)
dQXn dQ
i=1

which is an iid sum under both hypotheses. As such, by WLLN, under P, as n → ∞,


 
1X
n
1 dP P dP
Fn = log →EP log
(Xi )− = D(PkQ). (14.16)
n n dQ dQ
i=1

Alternatively, under Q, we have


 
1 P dP
→EQ log
Fn − = −D(QkP). (14.17)
n dQ
Note that both convergence results hold even if the divergence is infinite.
(Achievability) We show that Vϵ ≥ D(PkQ) ≡ D for any ϵ > 0. First assume that D < ∞. Pick
τ = n(D − δ) for some small δ > 0. Then Corollary 14.12 yields

α = P(Fn > n(D − δ)) → 1, by (14.16)


β ≤ e−n(D−δ)

then pick n large enough (depends on ϵ, δ ) such that α ≥ 1 − ϵ, we have the exponent E = D − δ
achievable, Vϵ ≥ E. Sending δ → 0 yields Vϵ ≥ D. Finally, if D = ∞, the above argument holds
for arbitrary τ > 0, proving that Vϵ = ∞.
(Converse) We show that Vϵ ≤ D for any ϵ < 1, to which end it suffices to consider D < ∞. As
a warm-up, we first show a weak converse by applying Theorem 14.8 based on data processing
inequality. For any (α, β) ∈ R(PXn , QXn ), we have
1
−h(α) + α log ≤ d(αkβ) ≤ D(PXn kQXn ) (14.18)
β

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-288


i i

288

For any achievable exponent E < Vϵ , by definition, there exists a sequence of tests such that
αn ≥ 1 − ϵ and βn ≤ exp(−nE). Plugging this into (14.18) and using h ≤ log 2, we have E ≤
D(P∥Q) log 2
1−ϵ + n(1−ϵ) . Sending n → ∞ yields

D(PkQ)
Vϵ ≤ ,
1−ϵ

which is weaker than what we set out to prove; nevertheless, this weak converse is tight for ϵ → 0,
so that for Stein’s exponent we have succeeded in proving the desired result of V = limϵ→0 Vϵ ≥
D(PkQ). So the question remains: if we allow the type-I error to be ϵ = 0.999, is it possible for
the type-II error to decay faster? This is shown impossible by the strong converse next.
To this end, note that, in proving the weak converse, we only made use of the expectation
of Fn in (14.18), we need to make use of the entire distribution (CDF) in order to obtain better
results. Applying the strong converse Theorem 14.10 to testing PXn versus QXn and α = 1 − ϵ and
β = exp(−nE), we have

1 − ϵ − γ exp (−nE) ≤ αn − γβn ≤ PXn [Fn > log γ].

Pick γ = exp(n(D + δ)) for δ > 0, by WLLN (14.16) the probability on the right side goes to 0,
which implies that for any fixed ϵ < 1, we have E ≤ D + δ and hence Vϵ ≤ D + δ . Sending δ → 0
complete the proof.
Finally, let us address the case of P 6 Q, in which case D(PkQ) = ∞. By definition, there
exists a subset A such that Q(A) = 0 but P(A) > 0. Consider the test that selects P if Xi ∈ A for
some i ∈ [n]. It is clear that this test achieves β = 0 and 1 − α = (1 − P(A))n , which can be made
less than any ϵ for large n. This shows Vϵ = ∞, as desired.

Remark 14.3 (Non-iid data) Just like in Chapter 12 on data compression, Theorem 14.14
can be extended to stationary ergodic processes. Specifically, one can show that the Stein’s
exponent corresponds to relative entropy rate, i.e.

1
Vϵ = lim D(PXn kQXn )
n→∞ n

where {Xi } is stationary and ergodic under both P and Q. Indeed, the counterpart of (14.16) based
on WLLN, which is the key for choosing the appropriate threshold τ , for ergodic processes is the
Birkhoff-Khintchine convergence theorem (cf. Theorem 12.8).

The theoretical importance of Stein’s exponent is in implications of the following type:

∀E ⊂ X n , PXn [E] ≥ 1 − ϵ ⇒ QXn [E] ≥ exp (−nVϵ + o(n))

Thus knowledge of Stein’s exponent Vϵ allows one to prove exponential bounds on probabilities
of arbitrary sets; this technique is known as “change of measure”, which will be applied in large
deviations analysis in Chapter 15.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-289


i i

14.6 Chernoff regime: preview 289

14.6 Chernoff regime: preview


We are still considering iid setting (14.14), namely, testing

H0 : Xn ∼ Pn versus H1 : Xn ∼ Qn ,

but the objective in the Chernoff regime is to achieve exponentially small error probability of both
types simultaneously. We say a pair of exponents (E0 , E1 ) is achievable if there exists a sequence
of tests such that

1 − α = π 1|0 ≤ exp(−nE0 )
β = π 0|1 ≤ exp(−nE1 ).

Intuitively, one exponent can made large at the expense of making the other small. So the interest-
ing question is to find their optimal tradeoff by characterizing the achievable region of (E0 , E1 ).
This problem was solved by [218, 61] and is the topic of Chapter 16. (See Figure 16.2 for an
illustration of the optimal (E0 , E1 )-tradeoff.)
Let us explain what we already know about the region of achievable pairs of exponents (E0 , E1 ).
First, Stein’s regime corresponds to corner points of this achievable region. Indeed, Theo-
rem 14.14 tells us that when fixing αn = 1 − ϵ, namely E0 = 0, picking τ = D(PkQ) − δ
(δ → 0) gives the exponential convergence rate of βn as E1 = D(PkQ). Similarly, exchanging the
role of P and Q, we can achieves the point (E0 , E1 ) = (D(QkP), 0).
Second, we have shown in Section 7.3 that the minimum total error probabilities over all tests
satisfies

min 1 − α + β = 1 − TV(Pn , Qn ) .
(α,β)∈R(Pn ,Qn )

As n → ∞, Pn and Qn becomes increasingly distinguishable and their total variation converges


to 1 exponentially, with exponent E given by max min(E0 , E1 ) over all achievable pairs. From the
bounds (7.22) and tensorization of the Hellinger distance (7.25), we obtain
p
1 − 1 − exp(−2nEH ) ≤ 1 − TV(Pn , Qn ) ≤ exp(−nEH ) , (14.19)

where we denoted
 
1
EH ≜ log 1 − H2 (P, Q) .
2

Thus, we can see that

EH ≤ E ≤ 2EH .

This characterization is valid even if P and Q depends on the sample size n which will prove
useful later when we study composite hypothesis testing in Section 32.2.1. However, for fixed P
and Q this is not precise enough. In order to determine the full set of achievable pairs, we need

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-290


i i

290

to make a detour into the topic of large deviations next. To see how this connection arises, notice
that the (optimal) likelihood ratio tests give us explicit expressions for both error probabilities:
   
1 1
1 − αn = P Fn ≤ τ , βn = Q Fn > τ
n n
where Fn is the LLR in (14.15). When τ falls in the range of (−D(QkP), D(PkQ)), both proba-
bilities are vanishing thanks to WLLN – see (14.16) and (14.17), and we are interested in their
exponential convergence rate. This falls under the purview of large deviations theory.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-291


i i

15 Information projection and large deviations

In this chapter we develop tools needed for the analysis of the error-exponents in hypothesis test-
ing (Chernoff regime). We will start by introducing the concepts of large deviations theory ( log
moment generating function (MGF) ψX , its convex conjugate ψX∗ , known as rate function, and
revisit the idea of tilting). Then, we show that probability of deviation of an empirical mean is
governed by the solution of an information projection (also known as I-projection) problem:

min D(QkP) = ψ ∗ (γ).


Q:EQ [X]≥γ

Equipped with the information projection we will prove a tight version of the Chernoff bound.
Specifically, for iid copies X1 , . . . , Xn of X, we show
" n #
1X
P Xk ≥ γ = exp (−nψ ∗ (γ) + o(n)) .
n
k=1

In the remaining sections we extend the simple information projection problem to a general
minimization over convex sets of measures and connect it to empirical process theory (Sanov’s the-
orem) and also show how to solve the problem under finitely many linear constraints (exponential
families).
In the next chapter, we apply these results to characterize the achievable (E0 , E1 )-region (as
defined in Section 14.6) to get

(E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ) ,

with ψP∗ being the rate function of log dQ dP


(under P). This gives us a complete (parametric)
description of the sought-after tradeoff between the two exponents in the Chernoff regime.

15.1 Basics of large deviations theory


Pn
Let X1 , . . . , Xn be an iid sequence drawn from P and P̂n = 1n i=1 δXi their empirical distribution.
The large deviations theory focuses on establishing sharp exponential estimates of the kind

P[P̂n ∈ E] = exp{−nE + o(n)} .

The full account of such theory requires delicate consideration of topological properties of E , and
is the subject of classical treatments e.g. [120]. We focus here on a simple special case which,
however, suffices for the purpose of establishing the Chernoff exponents in hypothesis testing,

291

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-292


i i

292

and also showcases all the relevant information-theoretic ideas. Our initial goal is to show the
following result:

Theorem 15.1 Consider a random variable X whose log MGF ψX (λ) = log E[exp(λX)] is
finite for all λ ∈ R. Let B = esssup X and let E[X] < γ < B. Then
" #
X
n
P Xi ≥ nγ = exp{−nE(γ) + o(n)} ,
i=1

where E(γ) = supλ≥0 λγ − ψX (λ) = ψX∗ (γ), known as the rate function.

The concepts of log MGF and the rate function will be elaborated in subsequent sections. We
provide the proof below that should be revisited after reading the rest of the chapter.

Proof. Let us recall the usual Chernoff bound: For iid Xn , for any λ ≥ 0, applying Markov’s
inequality yields
" # " ! #
X
n X
n
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" !#
X
n
≤ exp(−nλγ)E exp λ Xi
i=1

= exp(−nλγ + n log E [exp(λX)]).


| {z }
ψX (λ)

Optimizing over λ ≥ 0 gives the following non-asymptotic upper bound (concentration inequality)
which holds for any n:
" #
X
n n o
P Xi ≥ nγ ≤ exp − n sup(λγ − ψX (λ)) . (15.1)
i=1 λ≥0

This proves the upper bound part of Theorem 15.1.


To show the lower bound we need more tools that are going to be developed below. First, we will
express E(γ) as a certain KL-minimization problem (see Theorem 15.9), known as information
projection. Second, we will solve this problem (see (15.26)) to obtain the desired value of E(γ). In
the process of this proof we will also gain a deeper understanding of why the naive Chernoff bound
turns out to be sharp. It will be seen that inequality (15.1) performs a change of measure to a new
distribution Q, which is chosen to be the closest to P (in KL divergence) among all distributions
Q with EQ [X] ≥ γ . (This new distribution will turn out to be the tilted version of P, denoted by
Pλ .)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-293


i i

15.1 Basics of large deviations theory 293

15.1.1 Log MGF and rate function


Definition 15.2 The log moment-generating function (log MGF, also known as the cumulant
generating function) of a real-valued random variable X is

ψX (λ) = log E[exp(λX)], λ ∈ R.

Per convention in information theory, we will denote ψP (λ) = ψX (λ) if X ∼ P.

2
As an example, for a standard Gaussian Z ∼ N (0, 1), we have ψZ (λ) = λ2 . Taking X = Z3
yields a random variable such that ψX (λ) is infinite for all non-zero λ.
In the remaining of the chapter, we shall make the following simplifying assumption, known
as Cramér’s condition.

Assumption 15.1 The random variable X is such that ψX (λ) < ∞ for all λ ∈ R.

Most of the results we discuss in this chapter hold under a much weaker assumption of ψX
having domain with non-empty interior. But proofs in this generality significantly obscure the
elegance of the main ideas and we decided to avoid them. We note that Assumption 15.1 implies
that all moments of X is finite.

Theorem 15.3 (Properties of ψX ) Under Assumption 15.1 we have:

(a) ψX is convex;
(b) ψX is continuous;
(c) ψX is infinitely differentiable and

E[X exp{λX}]
ψX′ (λ) = = exp{−ψX (λ)}E[X exp{λX}].
E[exp{λX}]

In particular, ψX (0) = 0, ψX′ (0) = E [X].


(d) If a ≤ X ≤ b a.s., then a ≤ ψX′ ≤ b;
(e) Conversely, if

A = inf ψX′ (λ), B = sup ψX′ (λ),


λ∈R λ∈R

then A ≤ X ≤ B a.s.;
(f) If X is not a constant, then ψX is strictly convex, and consequently, ψX′ is strictly increasing.
(g) Chernoff bound:

P(X ≥ γ) ≤ exp(−λγ + ψX (λ)), λ ≥ 0. (15.2)

Remark 15.1 The slope of log MGF encodes the range of X. Indeed, Theorem 15.3(d) and
(e) together show that the smallest closed interval containing the support of PX equals (closure of)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-294


i i

294

the range of ψX′ . In other words, A and B coincide with the essential infimum and supremum (min
and max of RV in the probabilistic sense) of X respectively,
A = essinf X ≜ sup{a : X ≥ a a.s.}
B = esssup X ≜ inf{b : X ≤ b a.s.}
See Figure 15.1 for an illustration.

ψX (λ)

slope A
slope B

0
λ
slope E[X]

Figure 15.1 Example of a log MGF ψX (γ) with PX supported on [A, B]. The limiting maximal and minimal
slope is A and B respectively. The slope at γ = 0 is ψX′ (0) = E[X]. Here we plot for X = ±1 with
P [X = 1] = 1/3.

Proof. For the proof we assume that base of log and exp is e. Note that (g) is already proved
in (15.1). The proof of (e)–(f) relies on Theorem 15.8 and can be revisited later.

(a) Fix θ ∈ (0, 1). Recall Hölder’s inequality:


1 1
E[|UV|] ≤ kUkp kVkq , for p, q ≥ 1, + =1
p q
where the Lp -norm of a random variable U is defined by kUkp = (E|U|p )1/p . Applying to
E[e(θλ1 +θ̄λ2 )X ] with p = 1/θ, q = 1/θ̄, we get
E[exp((λ1 /p + λ2 /q)X)] ≤ k exp(λ1 X/p)kp k exp(λ2 X/q)kq = E[exp(λ1 X)]θ E[exp(λ2 X)]θ̄ ,
i.e., eψX (θλ1 +θ̄λ2 ) ≤ eψX (λ1 )θ eψX (λ2 )θ̄ . Another proof is by expressing ψX′′ as certain variance;
cf. Theorem 15.8(c).
(b) By our assumptions on X, the domain of ψX is R. By the fact that a convex function must be
continuous on the interior of its domain, we conclude that ψX is continuous on R.
(c) The subtlety here is that we need to be careful when exchanging the order of differentiation
and expectation.
Assume without loss of generality that λ ≥ 0. First, we show that E[|XeλX |] exists. Since
e|X| ≤ eX + e−X

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-295


i i

15.1 Basics of large deviations theory 295

|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X

by assumption on X, both of the summands are absolutely integrable in X. Therefore by the


dominated convergence theorem, E[|XeλX |] exists and is continuous in λ.
Second, by the existence and continuity of E[|XeλX |], u 7→ E[|XeuX |] is integrable on [0, λ], we
can switch order of integration and differentiation as follows:
" Z λ # Z λ
Fubini  
e ψX (λ)
= E[e ] = E 1 +
λX uX
Xe du = 1 + E XeuX du
0 0

⇒ ψX′ (λ)eψX (λ) = E[Xe ]


λX

thus ψX′ (λ) = e−ψX (λ) E[XeλX ] exists and is continuous in λ on R.


Furthermore, using similar application of the dominated convergence theorem we can extend
to λ ∈ C and show that λ 7→ E[eλX ] is a holomorphic function. Thus it is infinitely
differentiable.
(d) A ≤ X ≤ B ⇒ ψX′ (λ) = EE[[Xe
λX
]
eλ X ] ∈ [ A , B ] .
(e) Suppose (for contradiction) that PX [X > B] > 0. Then PX [X > B + 2ϵ] > 0 for some small
ϵ > 0. But then Pλ [X ≤ B +ϵ] → 0 for λ → ∞ (see Theorem 15.8.3 below). On the other hand,
we know from Theorem 15.8.2 that EPλ [X] = ψX′ (λ) ≤ B. This is not yet a contradiction, since
Pλ might still have some very small mass at a very negative value. To show that this cannot
happen, we first assume that B − ϵ > 0 (otherwise just replace X with X − 2B). Next note that

B ≥ EPλ [X] = EPλ [X1 {X < B − ϵ}] + EPλ [X1 {B − ϵ ≤ X ≤ B + ϵ}] + EPλ [X1 {X > B + ϵ}]
≥ EPλ [X1 {X < B − ϵ}] + EPλ [X1 {X > B + ϵ}]
≥ − EPλ [|X|1 {X < B − ϵ}] + (B + ϵ) Pλ [X > B + ϵ] . (15.3)
| {z }
→1

Therefore we will obtain a contradiction if we can show that EPλ [|X|1 {X < B − ϵ}] → 0 as
λ → ∞. To that end, notice that the convexity of ψX implies that ψX′ % B. Thus, for all λ ≥ λ0
we have ψX′ (λ) ≥ B − 2ϵ . Thus, we have for all λ ≥ λ0
ϵ ϵ
ψX (λ) ≥ ψX (λ0 ) + (λ − λ0 )(B − ) = c + λ(B − ) , (15.4)
2 2
for some constant c. Then,

EPλ [|X|1{X < B − ϵ}] = E[|X|eλX−ψX (λ) 1{X < B − ϵ}]


≤ E[|X|eλX−c−λ(B− 2 ) 1{X < B − ϵ}]
ϵ

≤ E[|X|eλ(B−ϵ)−c−λ(B− 2 ) ]
ϵ

= E[|X|]e−λ 2 −c → 0
ϵ
λ→∞

where the first inequality is from (15.4) and the second from X < B − ϵ. Thus, the first term
in (15.3) goes to 0 implying the desired contradiction.
(f) Suppose ψX is not strictly convex. Since ψX is convex from part (a), ψX must be “flat” (affine)
near some point. That is, there exists a small neighborhood of some λ0 such that ψX (λ0 + u) =

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-296


i i

296

ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood of zero, or
equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 15.4 implies Pλ [X = r] = 1,
but then P[X = r] = 1, contradicting the assumption X 6= constant.

Lemma 15.4 E[euS ] = 1 for all u ∈ (−ϵ, ϵ) then S = 0.

Proof. Expand in Taylor series around u = 0 to obtain E[S] = 0, E[S2 ] = 0. Alternatively, we


can extend the argument we gave for differentiating ψX (λ) to show that the function z 7→ E[ezS ] is
holomorphic on the entire complex plane1 . Thus by uniqueness, E[euS ] = 1 for all u.

Definition 15.5 (Rate function) The rate function ψX∗ : R → R ∪ {+∞} is given by the
Fenchel-Legendre conjugate (convex conjugate) of the log MGF:

ψX∗ (γ) = sup λγ − ψX (λ) (15.5)


λ∈R

Note that the maximization (15.5) is a convex optimization problem since ψX is strictly convex,
so we can find the maximum by taking the derivative and finding the stationary point. In fact, ψX∗
is the precisely the convex conjugate of ψX ; cf. (7.84).
The next result describes useful properties of the rate function. See Figure 15.2 for an
illustration.

Theorem 15.6 (Properties of ψX∗ ) Assume that X is non-constant and satisfies Assump-
tion 15.1.

(a) Let A = essinf X and B = esssup X. Then



 ′
 λγ − ψX (λ) for λ s.t. γ = ψX (λ), A<γ<B
∗ 1
ψX (γ) = log P(X=γ) γ = A or B

 +∞, γ < A or γ > B

(b) ψX∗ is strictly convex and strictly positive except ψX∗ (E[X]) = 0.
(c) ψX∗ is decreasing when γ ∈ (A, E[X]), and increasing when γ ∈ [E[X], B)

Proof. By Theorem 15.3(d), since A ≤ X ≤ B a.s., we have A ≤ ψX′ ≤ B. When γ ∈ (A, B), the
strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.

1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in the
vertical strip {z : |Rez| < 1}.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-297


i i

15.1 Basics of large deviations theory 297

ψX (λ)

slope γ
0
λ
ψX∗ (γ)

ψX∗ (γ)

+∞ +∞

γ
A E[X] 0 B

Figure 15.2 Log MGF ψX and its conjugate (rate function) ψX∗ for X taking values in [A, B], continuing the
example in Figure 15.1.

When γ = B, since X ≤ B a.s., we have


ψX∗ (B) = sup λB − log(E[exp(λX)]) = − log inf E[exp(λ(X − B))]
λ∈R λ∈R

= − log lim E[exp(λ(X − B))] = − log P(X = B),


λ→∞

by the monotone convergence theorem.


By Theorem 15.3(f), since ψX is strictly convex, the derivative of ψX and ψX∗ are inverse to each
other. Hence ψX∗ is strictly convex. Since ψX (0) = 0, we have ψX∗ (γ) ≥ 0. Moreover, ψX∗ (E[X]) = 0
follows from E[X] = ψX′ (0).

15.1.2 Tilted distribution


As early as in Chapter 4, we have already introduced the concept of tilting in the proof of Donsker-
Varadhan’s variational characterization of divergence (Theorem 4.6). Let us formally define it now.

Definition 15.7 (Tilting) Given X ∼ P and λ ∈ R, the tilted measure Pλ is defined by


exp{λx}
Pλ (dx) = P(dx) = exp{λx − ψX (λ)}P(dx) (15.6)
E[exp{λX}]
In particular, if P has a PDF p, then the PDF of Pλ is given by pλ (x) = eλx−ψX (λ) p(x).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-298


i i

298

The set of distributions {Pλ : λ ∈ R} parametrized by λ is called a standard (one-parameter)


exponential family, an important object in statistics [77]. Here are some of the examples:

• Gaussian: P = N (0, 1) with density p(x) = √1



exp(−x2 /2). Then Pλ has density
exp(λx)
√1 exp(−x2 /2) = √1exp(−(x − λ) /2). Hence Pλ = N (λ, 1).
2
exp(λ2 /2) 2π 2π  λ 
• Bernoulli: P = Ber( 12 ). Then Pλ = Ber eλe+1 which puts more (resp. less) mass on 1 if λ > 0
d
(resp. < 0). Moreover, Pλ −
→δ1 if λ → ∞ or δ0 if λ → −∞.
• Uniform: Let P be the uniform distribution on [0, 1]. Then Pλ is also supported on [0, 1] with
pdf pλ (x) = λ exp(λx)
eλ −1 . Therefore as λ increases, Pλ becomes increasingly concentrated near 1,
and Pλ → δ1 as λ → ∞. Similarly, Pλ → δ0 as λ → −∞.

In the above examples we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0
(resp. < 0). Indeed, this is a general property of tilting.

Theorem 15.8 (Properties of Pλ ) Under Assumption 15.1 we have:

(a) Log MGF:

ψPλ (u) = ψX (λ + u) − ψX (λ)

(b) Tilting trades mean for divergence:

EPλ [X] = ψX′ (λ) ≷ EP [X] if λ ≷ 0. (15.7)


D(Pλ kP) = ψX∗ (ψX′ (λ)) = ψX∗ (EPλ [X]). (15.8)

(c) Tilted variance: VarPλ (X) = ψX′′ (λ) log e.


(d)

P(X > b) > 0 ⇒ ∀ϵ > 0, Pλ (X ≤ b − ϵ) → 0 as λ → ∞;


P(X < a) > 0 ⇒ ∀ϵ > 0, Pλ (X ≥ a + ϵ) → 0 as λ → −∞.
d d
Therefore if Xλ ∼ Pλ , then Xλ −
→ essinf X = A as λ → −∞ and Xλ −
→ esssup X = B as λ → ∞.

Proof. Again for the proof we assume base e for exp and log.

(a) By definition.
(b) EPλ [X] = EE[X[exp
exp(λX)] ′ ′
(λX)] = ψX (λ), which is strictly increasing in λ, with ψX (0) = EP [X].
exp(λX) ′
D(Pλ kP) = EPλ log dP dP = EPλ log E[exp(λX)] = λEPλ [X] − ψX (λ) = λψX (λ) − ψX (λ) =
λ

∗ ′
ψX (ψX (λ)), where the last equality follows from Theorem 15.6(a).
(c) ψX′′ (λ) = E[X exp(λ X)]−(E[X exp(λX)])2
2

(E[exp(λX)])2
= VarPλ (X).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-299


i i

15.2 Large-deviations exponents and KL divergence 299

(d)

Pλ (X ≤ b − ϵ) = EP [eλX−ψX (λ) 1 {X ≤ b − ϵ}]


≤ EP [eλ(b−ϵ)−ψX (λ) 1 {X ≤ b − ϵ}]
≤ e−λϵ eλb−ψX (λ)
e−λϵ
≤ → 0 as λ → ∞
P[X > b]
where the last inequality is due to the usual Chernoff bound (Theorem 15.3(g)): P[X > b] ≤
exp(−λb + ψX (λ)).

15.2 Large-deviations exponents and KL divergence


Large deviations problems deal with rare events by making statements about the tail probabilities
of a sequence of distributions. Here, we are interested in the following special case: the speed of
 Pn 
decay for P 1n k=1 Xk ≥ γ for iid Xk when γ exceeds the mean.
In (15.1) we have used Chernoff bound to obtain an upper bound on the exponent via the log
MGF. Here we use a different method to give a formula for the exponent as a convex optimiza-
tion problem involving the KL divergence. In the subsequent chapter (information projection).
Later in Section 15.4 we shall revisit the Chernoff bound after we have computed the value of the
information projection.

Theorem 15.9 Let X1 , X2 , . . . i.i.d.


∼ P. Then for any γ ∈ R,
1 1
lim log  1 Pn = inf D(QkP) (15.9)
n→∞ n P n k=1 Xk > γ Q : EQ [X]>γ

1 1
lim log  1 Pn = inf D(QkP) (15.10)
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]≥γ

Furthermore, for every n we have the firm upper bound


" n #  
1X
P Xk ≥ γ ≤ exp −n · inf D(QkP) (15.11)
n Q : EQ [X]≥γ
k=1

and similarly for > in place of ≥.

Remark 15.2 (Subadditivity) One can argue from first principles that the limits
(15.9) and (15.10) exist without computing their values. Indeed, note that the sequence
pn ≜ log P 1 Pn 1 X ≥γ satisfies pn+m ≥ pn pm and hence log p1n is subadditive. As such,
[ n k=1 k ]
limn→∞ 1n log p1n = infn log p1n by Fekete’s lemma.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-300


i i

300

Proof. First note that if the events have zero probability, then both sides coincide with infinity.
 Pn 
Indeed, if P 1n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒
Q 6 P ⇒ D(QkP) = ∞ and hence (15.9) holds trivially. The case for (15.10) is similar.
In the sequel we assume both probabilities are nonzero. We start by proving (15.9). Set P[En ] =
 Pn 
P 1n k=1 Xk > γ .

Lower Bound on P[En ]: Fix a Q such that EQ [X] > γ . Let Xn be iid. Then by WLLN,
" n #
X
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1

Now the data processing inequality (Corollary 2.19) gives


d(Q[En ]kP[En ]) ≤ D(QXn kPXn ) = nD(QkP)
And a lower bound for the binary divergence is
1
d(Q[En ]kP[En ]) ≥ −h(Q[En ]) + Q[En ] log
P[ En ]
Combining the two bounds on d(Q[En ]kP[En ]) gives
 
−nD(QkP) − log 2
P[En ] ≥ exp (15.12)
Q[En ]
Optimizing over Q to give the best bound:
1 1
lim sup log ≤ inf D(QkP).
n→∞ n P [En ] Q:EQ [X]>γ

Upper Bound on P[En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional distribution as:
P
log PX1(E) = D(PX|X∈E kPX ). Define P̃Xn = PXn | P Xi >nγ , under which Xi > nγ holds a.s. Then
1
log = D(P̃Xn kPXn ) ≥ inf
P D(QXn kPXn ) (15.13)
P[En ] QXn :EQ [ Xi ]>nγ

We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Note that this is a
special case of a more general phenomena – see Ex. III.12. Consider the following two steps:
X
n
D(QXn kPXn ) ≥ D(QXj kP)
j=1

1X
n
≥ nD(Q̄kP) , Q̄ ≜ QXj , (15.14)
n
j=1

where the first step follows from (2.27) in Theorem 2.16, after noticing that PXn = Pn , and the
second step is by convexity of divergence (Theorem 5.1). From this argument we conclude that
inf
P D(QXn kPXn ) = n · inf D(QkP) (15.15)
QXn :EQ [ Xi ]>nγ Q:EQ [X]>γ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-301


i i

15.2 Large-deviations exponents and KL divergence 301

inf
P D(QXn kPXn ) = n · inf D(QkP) (15.16)
QXn :EQ [ Xi ]≥nγ Q:EQ [X]≥γ

In particular, (15.13) and (15.15) imply the required lower bound in (15.9).
Next we prove (15.10). First, notice that the lower bound argument (15.13) applies equally well,
so that for each n we have
1 1
log  1 Pn ≥ inf D(QkP) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ

To get a matching upper bound we consider two cases:

• Case I: P[X > γ] = 0. If P[X ≥ γ] = 0, then both sides of (15.10) are +∞. If P[X = γ] > 0,
P
then P[ Xk ≥ nγ] = P[X1 = . . . = Xn = γ] = P[X = γ]n . For the right-hand side, since
D(QkP) < ∞ =⇒ Q  P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
Q(X = γ) = 1, i.e., Q = δγ . Then infEQ [X]≥γ D(QkP) = log P(X1=γ) .
P P
• Case II: P[X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (15.9) we know that
1 1
lim sup log  1 Pn ≤ inf D(QkP) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ

We next show that in this case

inf D(QkP) = inf D(QkP) (15.17)


Q : EQ [X]>γ Q : EQ [X]≥γ

Indeed, let P̃ = PX|X>γ which is well defined since P[X > γ] > 0. For any Q such that EQ [X] ≥
γ , set Q̃ = ϵ̄Q + ϵP̃ satisfies EQ̃ [X] > γ . Then by convexity, D(Q̃kP) ≤ ϵ̄D(QkP) + ϵD(P̃kP) =
ϵ̄D(QkP) + ϵ log P[X1>γ] . Sending ϵ → 0, we conclude the proof of (15.17).

Remark 15.3 Note that the upper bound (15.11) also holds for independent non-identically
distributed Xi . Indeed, we only need to replace the step (15.14) with D(QXn kPXn ) ≥
Pn Pn
i=1 D(QXi kPXi ) ≥ nD(Q̄kP̄) where P̄ = n
1
i=1 PXi . This yields a bound (15.11) with P
replaced by P̄ in the right-hand side.

Example 15.1 (Poisson-Binomial tails) Consider X which is a sum of n independent


Bernoulli random variables so that E[X] = np. The distribution of X is known as Poisson-Binomial
[330, 413], including Bin(n, p) as a special case. Applying Theorem 15.9 (or the Remark 15.3),
we get the following tail bounds on X:
k
P[X ≥ k] ≤ exp{−nd(k/nkp)}, >p (15.18)
n
k
P[X ≤ k] ≤ exp{−nd(k/nkp)}, <p (15.19)
n
where for (15.18) we used the fact that minQ:EQ [X]≥k/n D(QkBer(p)) = minq≥k/n d(qkp) = d( nk kp)
and similarly for (15.19). These bounds, in turn, can be used to derive various famous estimates:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-302


i i

302

• Multiplicative deviation from the mean (Bennett’s inequality): We have

P[X ≥ u E[X]] ≤ exp{− E[X]f(u)} ∀u > 1 ,


P[X ≤ u E[X]] ≤ exp{− E[X]f(u)} ∀0 ≤ u < 1 ,

where f(u) ≜ u log u − (u − 1) log e ≥ 0. These follow from (15.18)-(15.19) via the following
useful estimate:

d(upkp) ≥ pf(u) ∀p ∈ [0, 1], u ∈ [0, 1/p] (15.20)

Indeed, consider the elementary inequality


x
x log ≥ (x − y) log e
y
for all x, y ∈ [0, 1] (since the difference between the left and right side is minimized over y at
y = x). Using x = 1 − up and y = 1 − p establishes (15.20).
• Bernstein’s inequality:
t2
P[X > np + t] ≤ e− 2(t+np) ∀t > 0 .
f(u) Ru u−x
Ru
This follows from the previous bound for u > 1 by bounding log e = 1 x dx ≥ 1
u 1
(u −
(u−1)2
x)dx = 2u .
• Okamoto’s inequality: For all 0 < p < 1 and t > 0,
√ √
P[ X − np ≥ t] ≤ e−t ,
2
(15.21)
√ √
P[ X − np ≤ −t] ≤ e−t .
2
(15.22)

These simply follow from the inequality between KL divergence and Hellinger distance
√ √ √
( np+t)2
in (7.33). Indeed, we get d(xkp) ≥ H2 (Ber(x), Ber(p)) ≥ ( x − p)2 . Plugging x = n
into (15.18)-(15.19) we obtain the result. We note that [316, Theorem 3] shows a stronger bound
of e−2t in (15.21).
2

Remarkably, the bounds in (15.21) and (15.22) do not depend on n or p. This is due to the

variance-stabilizing effect of the square-root transformation for binomials: Var( X) is at most a
√ √
constant for all n, p. In addition, X − np = √XX− np
√ is of a self-normalizing form: the denomi-
+ np
nator is on par with the standard deviation of the numerator. For more on self-normalizing sums,
see [69, Problem 12.2].

15.3 Information projection


The results of Theorem 15.9 motivate us to study the following general information projection
problem: Let E be a convex set of distributions on some abstract space Ω, then for the distribution
P on Ω, we want

inf D(QkP)
Q∈E

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-303


i i

15.3 Information projection 303

Denote the minimizing distribution Q by Q∗ . The next result shows that intuitively the “line”
between P and optimal Q∗ is “orthogonal” to E (cf. Figure 15.3).

Q∗

Distributions on X
Figure 15.3 Illustration of information projection and the Pythagorean theorem.

Theorem 15.10 Let E be a convex set of distributions. If there exists Q∗ ∈ E such that

D(Q kP) = minQ∈E D(QkP) < ∞, then ∀Q ∈ E
D(QkP) ≥ D(QkQ∗ ) + D(Q∗ kP)

Proof. If D(QkP) = ∞, then there is nothing to prove. So we assume that D(QkP) < ∞, which
also implies that D(Q∗ kP) < ∞. For λ ∈ [0, 1], form the convex combination Q(λ) = λ̄Q∗ +λQ ∈
E . Since Q∗ is the minimizer of D(QkP), then
d
0≤ D(Q(λ) kP) = D(QkP) − D(QkQ∗ ) − D(Q∗ kP)
dλ λ=0
The rigorous analysis requires an argument for interchanging derivatives and integrals (via domi-
nated convergence theorem) and is similar to the proof of Proposition 2.20. The details are in [114,
Theorem 2.2].
Remark 15.4 If we view the picture above in the Euclidean setting, the “triangle” formed by
P, Q∗ and Q (for Q∗ , Q in a convex set, P outside the set) is always obtuse, and is a right triangle
only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared
Euclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.
The relevant set E of Q’s that we will focus next is the “half-space” of distributions E = {Q :
EQ [X] ≥ γ}, where X : Ω → R is some fixed function (random variable). This is justified by rela-
tion with the large-deviations exponent in Theorem 15.9. First, we solve this I-projection problem
explicitly.

Theorem 15.11 Given a distribution P on Ω and X : Ω → R suppose Assumption 15.1 holds.


We denote
A = inf ψX′ = essinf X ≜ sup{a : X ≥ a P-a.s.} (15.23)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-304


i i

304

B = sup ψX′ = esssup X ≜ inf{b : X ≤ b P-a.s.} (15.24)

1 The information projection problem over E = {Q : EQ [X] ≥ γ} has solution




 0 γ < E P [ X]


ψ ∗ (γ) E P [ X] ≤ γ < B
P
min D(QkP) = (15.25)
Q : EQ [X]≥γ 
 1
log P(X=B) γ = B



+∞ γ>B

= ψP∗ (γ)1{γ ≥ EP [X]} (15.26)

2 Whenever the minimum is finite, the minimizing distribution is unique and equal to the tilting
of P along X, namely2

dPλ = exp{λX − ψ(λ)} · dP (15.27)

3 For all γ ∈ [EP [X], B) we have

min D(QkP) = inf D(QkP) = min D(QkP) .


EQ [X]≥γ EQ [X]>γ EQ [X]=γ

Remark 15.5 Both Theorem 15.9 and Theorem 15.11 are stated for the right tail where the
sample mean exceeds the population mean. For the left tail, simply these results to −Xi to obtain
for γ < E[X],
1 1
lim log  1 Pn = inf D(QkP) = ψX∗ (γ).
n→∞ n P n k=1 Xk < γ Q : EQ [X]<γ

In other words, the large deviations exponent is still given by the rate function (15.5) except that
the optimal tilting parameter λ is negative.

Proof. We first prove (15.25).

• First case: Take Q = P.


• Fourth case: If EQ [X] > B, then Q[X ≥ B + ϵ] > 0 for some ϵ > 0, but P[X ≥ B + ϵ] = 0, since
P(X ≤ B) = 1, by Theorem 15.3(e). Hence Q 6 P =⇒ D(QkP) = ∞.
• Third case: If P(X = B) = 0, then X < B a.s. under P, and Q 6 P for any Q s.t. EQ [X] ≥ B.
Then the minimum is ∞. Now assume P(X = B) > 0. Since D(QkP) < ∞ =⇒ Q  P =⇒
Q(X ≤ B) = 1. Therefore the only possibility for EQ [X] ≥ B is that Q(X = B) = 1, i.e., Q = δB .
Then D(QkP) = log P(X1=B) .
• Second case: Fix EP [X] ≤ γ < B, and find the unique λ such that ψX′ (λ) = γ = EPλ [X] where
dPλ = exp(λX − ψX (λ))dP. This corresponds to tilting P far enough to the right to increase its

2
Note that unlike the setting of Theorems 15.1 and 15.9 here P and Pλ are measures on an abstract space Ω, not necessarily
on the real line.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-305


i i

15.3 Information projection 305

mean from EP X to γ , in particular λ ≥ 0. Moreover, ψX∗ (γ) = λγ − ψX (λ). Take any Q such
that EQ [X] ≥ γ , then
 
dQdPλ
D(QkP) = EQ log (15.28)
dPdPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
= D(QkPλ ) + ψX∗ (γ)
≥ ψX∗ (γ), (15.29)

where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows
the minimizer is unique, proving the second claim. Note that even in the corner case of γ = B
(assuming P(X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure
(P∞ ), since Pλ → δB as λ → ∞, cf. Theorem 15.8(d).

An alternative version of the solution, given by expression (15.26), follows from Theorem 15.6.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X] we
have just shown

ψX∗ (γ) = min D(QkP)


Q:EQ [X]≥γ

while from the next corollary we have

inf D(QkP) = inf



ψX∗ (γ ′ ) .
Q:EQ [X]>γ γ >γ

The final step is to notice that ψX∗ is increasing and continuous by Theorem 15.6, and hence the
right-hand side infimum equals ψX∗ (γ). The case of minQ:EQ [X]=γ is handled similarly.

Corollary 15.12 For any Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the
tilted distribution Pλ satisfies

EPλ [X] = EQ [X]


D(Pλ kP) ≤ D(QkP)

and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP) − D(Pλ kP).

Proof. Proceed as in the proof of Theorem 15.11, and find the unique λ s.t. EPλ [X] = ψX′ (λ) =
EQ [X]. Then D(Pλ kP) = ψX∗ (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (15.28)-(15.29)
obtaining D(QkP) = D(QkPλ ) + D(Pλ kP).

For any Q the previous result allows us to find a tilted measure Pλ that has the same mean as
Q yet smaller (or equal) divergence distance to P. We will see that the same can be done under
multiple linear constraints (Section 15.6*).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-306


i i

306

15.4 I-projection and KL geodesics


The following Figure 15.4 describes many properties of information projections, where we fix
some baseline measure P. Then

Q 6≪ P
One Parameter Family

γ=A

P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (γ)
b

λ>0 Q
b
γ=B
Q∗
=Pλ

Q 6≪ P

Space of distributions on R

Figure 15.4 Illustration of information projections, tilting, and rate function.

• Each set {Q : EQ [X] = γ} is a slice of P(R), the space of probability distributions on R. As γ


varies from −∞ to +∞ the union of these slices fill the entire space of distributions with finite
first moment.
• When γ < A or γ > B, any distribution Q inside the slice is Q 6 P.
• As γ varies between A and B the slices fill out the space of all {Q : Q  P}. By Corollary 15.12,
inside each slice there is one special distribution Pλ of the tilted form (15.6) that minimizes the
divergence D(QkP) to P.
• The set of Pλ ’s trace out a curve in the space of distributions. The “geodesic” distance from P to
Pλ is measured by ψ ∗ (γ) = D(Pλ kP). This set of distributions is the one-parameter exponential
family in Definition 15.7.

The key observation here is that the curve of this one-parameter family {Pλ : λ ∈ R} intersects
each γ -slice E = {Q : EQ [X] = γ} “orthogonally” at the minimizing Q∗ ∈ E , and the distance
from P to Q∗ is given by ψ ∗ (λ). To see this, note that applying Theorem 15.10 to the convex set
E gives us D(QkP) ≥ D(QkQ∗ ) + D(Q∗ kP). Now thanks to Corollary 15.12, we in fact have an
equality D(QkP) = D(QkQ∗ ) + D(Q∗ kP) and Q∗ = Pλ for some tilted measure.
Let us give an intuitive (non-rigorous) justification for calling the curve {Pμ , μ ∈ [0, λ]}
geodesic connecting P = P0 to Pλ . Suppose there existed another curve {Qμ } connecting P to
Pλ and minimizing KL distance. Then the expectation EQμ [X] should continously change from
EP [X] to EPλ [X]. Now take any intermediate value γ ′ of the expectation EQμ [X]. We know that on
the slice {Q : EQ [X] = γ ′ } the closest to P element is Pμ′ for some μ′ ∈ [0, λ]. Thus, we could
shorten the distance by connecting P to Pμ′ instead of Qμ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-307


i i

15.5 Sanov’s theorem 307

Our treatment above is specific to distributions on R. How do we find a geodesic between two
arbitrary distributions P̃ and Q̃ on an abstract measurable space X ? To find the answer we notice
that the “intrinsic” definition of the geodesic between P and Pλ above can be given as follows:
 μ
dPμ 1 dPλ λ
= ,
dP Z( μ) dP
where Z( μ) = exp{ψ( μ)} is a normalization constant. Correspondingly, we define the geodesic
between P̃ and Q̃ as a parametric family {P̃μ , μ ∈ [0, 1]} given by
 μ
dP̃μ 1 dQ̃
≜ , (15.30)
dP̃ Z̃( μ) dP̃
where the normalizing constant Z̃( μ) = exp{( μ − 1)Dμ (Q̃kP̃)} is given in terms of Rényi
divergence. See also Exercise III.25.
Formal justification of (15.30) as a true geodesic in the sense of differential geometry was
given by Cencov in [85, Theorem 12.3] for the case of finite underlying space X . His argument
was the following. To enable discussion of geodesics one needs to equip the space P([k]) with a
connection (or parallel transport map). It is natural to require the connection to be equivariant (or
commute) with respect to some maps P([k]) → P([k′ ]). Cencov lists (a) permutations of elements
(k = k′ ); (b) embedding of a distribution P ∈ P([k]) into a larger space by splitting atoms of [k]
(with specified ratio) into multiple atoms of [k′ ], so that k < k′ ; and (c) conditioning on an event
(k > k′ ). It turns out there is one-parameter family of connections satisfying (a)-(b), including
the Riemannian (Levi-Civitta) connection corresponding to a Fisher-Rao metric (2.35). However,
there is only a unique connection satisfying all (a)-(c). It is different from the Fisher-Rao and its
geodesics are exactly given by (15.30). Geodesically complete submanifolds in this metric are
simply the exponential families (Section 15.6*). For more on this exciting area, see Cencov [85]
and Amari [17].

15.5 Sanov’s theorem


A corollary of the WLLN is that the empirical distribution P̂ of n iid observations drawn from a
distribution P converges weakly to P itself. The following theorem due to Sanov [370] quantifies
the large-deviations behavior of this convergence.

Theorem 15.13 (Sanov’s


P
Theorem) Given X1 , . . . , Xn i.i.d.
∼ P on X , denote the empirical
n
distribution by P̂ = 1n j=1 δXj . Let E be a convex set of distributions. Then under regularity
conditions on E and P,
 
P[P̂ ∈ E] = exp −n min D(QkP) + o(n) .
Q∈E

Examples of regularity conditions in the above theorem include: (a) X is finite, P is fully sup-
ported on X , and E is closed with non-empty interior: see Exercise III.23 for a full proof in this

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-308


i i

308

case; (b) X is a Polish space and infQ∈int(E) D(QkP) = infQ∈cl(E) D(QkP): this is the content
of [120, Theorem 6.2.10]. The reference [120] contains full details about various other versions
and extensions of Sanov’s theorem to infinite-dimensional settings.

15.6* Information projection with multiple constraints


We have considered so far example of a single inequality E[X] ≥ γ . However, the entire theory
can be extended to accommodate multiple constraints. Let P be a fixed distribution on some space
X and let ϕ1 , . . . , ϕd : X → R be arbitrary functions, which we will stack together into a vector-
valued function ϕ : X → Rd . In this section we discuss solution of the following I-projection
problem, known as I-projection on a hyperplane:

F(γ) ≜ inf{D(QkP) : EQ [ϕ(X)] = γ} , γ ∈ Rd . (15.31)

This problem arises in statistical physics, Gibbs variational principle, exponential family, and
many other fields. Note that taking P uniform correspond to the max-entropy problems.
In the case of d = 1 we have seen that whenever the value of minimization is finite solution
Q∗ can be sought inside a single-parameter family of tilted measures P, cf. (15.27). For this more
general case of d > 1 we define tilted measures as

Pλ (dx) ≜ exp{λ⊤ ϕ(x) − ψ(λ)}P(dx) , λ ∈ Rd

where the multi-dimensional log MGF of P (with respect to ϕ) is defined as

ψ(λ) ≜ EX∼P [exp{λ⊤ ϕ(X)] . (15.32)

In order to discuss the solution of (15.31) we first make a simple observation analogous to
Corollary 15.12:

Proposition 15.14 If there exists λ such that ψ(λ) < ∞ and EX∼Pλ [ϕ(X)] = γ , then the
unique minimizer of (15.31) is Pλ and for any Q with EQ [ϕ(X)] = γ we have

D(QkP) = D(QkPλ ) + D(Pλ kP) . (15.33)

⊤ ⊤
Proof. Since log dPdP = λ ϕ(x) − ψ(λ) is finite everywhere we have that D(Pλ kP) = λ γ −
λ

ψ(λ) < ∞ and hence the solution of (15.31) is finite. The fact that Pλ is the unique minimizer
follows from the identity (15.33) that we are to prove next. Take Q as in the statement and suppose
that either D(QkP) or D(QkPλ ) finite (otherwise there is nothing to prove). Since Pλ  P this
implies that Q  P and so let us denote by fQ = dQ dP . From (2.11) we see that
 
exp{λ⊤ ϕ(X) − ψ(λ)}  
D(QkP) − D(QkPλ ) = EQ Log = EQ log exp{λ⊤ ϕ(X) − ψ(λ)}
1
= λ⊤ γ − ψ(λ) = D(Pλ kP) ,

establishing the claim.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-309


i i

15.6* Information projection with multiple constraints 309

Unfortunately, Proposition 15.14 is far from being able to completely resolve the prob-
lem (15.31) since it does not explain for what values γ ∈ Rd of the constraints it is possible
to find a required tilting Pλ . For d = 1 we had a very simple characterization of the set of values
that the means of Pλ ’s can achieve. Specifically, Theorem 15.8 showed (under Assumption 15.1)

{EX∼Pλ [ϕ(X)] : λ ∈ R} = (A, B) ,

where A, B are the boundaries of the support of ϕ. To obtain a similar characterization for the
case of d > 1, we let P̃ be the probability distribution on Rd of ϕ(X) when X ∼ P, i.e. P̃ is the
push-forward of P along ϕ. The analog of (A, B) is then played by the following concept:

Definition 15.15 (Convex support) Let P̃ be a probability measure on Rd . We recall that


support supp(P̃) is defined as the intersection of all closed sets of full measure. The convex support
of P̃ is defined as the intersection of all closed convex sets with full measure:

csupp(P̃) ≜ ∩{S : P̃[S] = 1, S is closed and convex} .

It is clear that csupp is itself a closed convex set. Furthermore, it can be obtained by taking the
convex hull of supp(P̃) followed by the topological closure cl(·), i.e.

csupp(P̃) = cl(co(supp(P̃))) . (15.34)

(Indeed, csupp(P̃) ⊂ cl(co(suppP̃)) since the set on the right is convex and closed. On the other
hand, for any closed half-space H ⊃ csupp(P̃) of full measure, i.e. P̃[H] = 1 we must have
supp(P̃) ⊂ H. Taking convex hull and then closure of both sides yields cl(co(supp(P̃))) ⊂ H.
Taking the intersection over all such H shows that cl(co(suppP̃)) ⊂ csupp(P̃) as well.)
We are now ready to state the characterization of when I-projection is solved by a tilted measure.

Theorem 15.16 (I-projection on hyperplane) Suppose P and ϕ satisfy the following two
assumptions: (a) The d + 1 functions (1, ϕ1 , . . . , ϕd ) are linearly independent P-a.s. and (b) the
log MGF ψ(λ) is finite for all λ ∈ Rd . Then

1 If there exist any Q such that EQ [ϕ] = γ and D(QkP) < ∞, we must have γ ∈ csupp(P̃).
2 There is a solution λ to EPλ [ϕ] = γ if and only if γ ∈ int(csupp(P̃)).

Corollary 15.17 Whenever γ ∈ int(csupp(P̃)) the I-projection problem (15.31) is solved by


Pλ for some λ and the identity (15.33) holds. Furthermore, F(γ) = ψ ∗ (γ) and λ = ∇F(γ).

Remark 15.6 Assumption (b) of Theorem 15.16 can be relaxed to requiring only the domain
of the log MGF to be an open set (see [85, Theorem 23.1] or [77, Theorem 3.6].) Applying Theo-
rem 15.16, whenever γ ∈ int csupp(P̃) the I-projection can be sought in the tilted family Pλ and
only in such case. If γ ∈/ csupp(P̃) then the I-projection is trivially impossible and every Q with
the given expectation yields D(QkP) = ∞. When γ ∈ ∂ csupp(P̃) it could be that I-projection
(i.e. the minimizer of (15.31)) exist, unique and yields a finite divergence, but the minimizer is
not given by the λ-tilting of P. It could also be that every feasible Q yields D(QkP) = ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-310


i i

310

As a concrete example, consider X ∼ P = N (0, 1) and ϕ(x) = (x, x2 ). Then csupp(P̃) =


{(γ1 , γ2 ) ∈ R2 : γ2 ≥ γ12 }, consisting of all valid values of first and second moments (satisfy-
ing Cauchy-Schwarz) of distributions on R. Then the solution to this I-projection problem is as
follows.

• γ2 > γ12 : the optimal Q is N (γ1 , γ2 − γ12 ), which is a tilted version of P along ϕ.
• γ2 = γ12 : the only feasible Q is δγ1 , which results in D(QkP) = ∞.
• γ2 < γ12 : there is no feasible Q.

Before giving the proof of the theorem we remind some of the standard and easy facts about
exponential families of which Pλ is an example. In this context ϕ is called a vector of statistics
and λ is the natural parameter. Note that all Pλ ∼ P are mutually absolutely continuous and hence
we have from the linear independence assumption:
CovX∼Pλ [ϕ(X)]  0 (15.35)
i.e. the covariance matrix is (strictly) positive definite. Similar to Theorem 15.3 we can show that
λ 7→ ψ(λ) is a convex, infinitely differentiable function [77]. We want to study the map from
natural parameter λ to the mean parameter μ:
λ 7→ μ(λ) ≜ EX∼Pλ [ϕ(X)] ,
Specifically, we will show that the image μ(Rd ) = int csupp(P̃). To that end note that, similar to
Theorem 15.8(b) and (c), the first two derivatives give moments of ϕ as follows:
EX∼Pλ [ϕ(X)] = ∇ψ(λ) , CovX∼Pλ [ϕ(X)] = Hess ψ(λ) log e . (15.36)
Together with (15.35) we see that then ψ is strictly convex and hence for any λ1 , λ2 we have the
strict monotonicity of ∇ψ , i.e.
(λ1 − λ2 )T (∇ψ(λ1 ) − ∇ψ(λ2 )) > 0 . (15.37)
Additionally, from (15.35) we obtain that Jacobian of the map λ 7→ μ(λ) equals det Hess ψ >
0. Thus by the inverse function theorem the image μ(Rd ) is an open set in Rd and there is an
infinitely differentiable inverse μ 7→ λ = μ−1 ( μ) defined on this set. Hence, the family Pλ can be
equivalently reparameterized by μ’s. What is non-trivial is that the image μ(Rd ) is convex and in
fact coincides with int csupp(P̃).
Proof of Theorem 15.16. Throughout the proof we denote C = csupp(P̃), Co = int(csupp(P̃)).
Suppose there is a Q  P with t = EQ [ϕ(X)] 6∈ C. Then there is a (separating hyperplane)
b ∈ Rd and c ∈ R such that b⊤ t < c ≤ b⊤ p for any p ∈ C. Since P[ϕ(X) ∈ C] = 1 we conclude
that Q[b⊤ ϕ(X) ≥ c] = 1. But then this contradicts the fact that EQ [b⊤ ϕ(X)] < c. This shows the
first claim.
Next, we show that for any λ we have μ(λ) = EPλ [ϕ] ∈ Co . Indeed, by the previous paragraph
we know μ(λ) ∈ C. On the other hand, as we discussed the map λ → μ(λ) is smooth, one-to-one,
with smooth inverse. Hence the image of a small ball around λ is open and hence μ(λ) ∈ int(C) =
Co .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-311


i i

15.6* Information projection with multiple constraints 311

Finally, we prove the main implication that for any γ ∈ Co there must exist a λ such that
μ(λ) = γ . To that end, consider the unconstrained minimization problem

inf ψ(λ) − λ⊤ γ . (15.38)


λ

If we can show that the minimum is achieved at some λ∗ , then from the first-order optimality
conditions we conclude the desired ∇ψ(λ∗ ) = γ . Since the objective function is continuous, it is
sufficient to show that the minimization without loss of generality can be restricted to a compact
ball {kλk ≤ R} for some large R.
To that end, we first notice that if γ ∈ Co then for some ϵ > 0 we must have

cϵ ≜ inf P[v⊤ (ϕ(X) − γ) > ϵ] > 0 . (15.39)


v:∥v∥=1

Indeed, suppose this is not the case. Then for any ϵ > 0 there is a sequence vk s.t.

P[v⊤
k (ϕ(X) − γ) > ϵ] → 0 .

Now by compactness of the sphere, vk → ṽϵ without loss of generality and thus we have for every
ϵ some ṽϵ such that

P[ṽ⊤
ϵ (ϕ(X) − γ) > ϵ] = 0 .

Again, by compactness there must exist convergent subsequence ṽϵ → v∗ and ϵ → 0 such that

P[v⊤
∗ (ϕ(X) − γ) > 0] = 0 .

Thus, supp(P̃) ⊂ {x : v⊤ ⊤
∗ ϕ(x) ≤ v∗ γ} and hence γ cannot be an interior point of C = csupp(P̃).
λ
Given (15.39) we obtain the following estimate, where we denote v = ∥λ∥ :

exp{ψ(λ) − λ⊤ γ} = EP [exp{λ⊤ (ϕ(X) − γ)}]


≥ EP [exp{λ⊤ (ϕ(X) − γ)}1{v⊤ (ϕ(X) − γ) > ϵ}]
≥ cϵ exp{ϵkλk}

Thus, returning to the minimization problem (15.38) we see that the objective function satisfies
a lower bound

ψ(λ) − λ⊤ γ ≥ log cϵ + ϵkλk .

Then it is clear that restricting the minimization to a sufficiently large ball {kλk ≤ R} is without
loss of generality. As we explained this completes the proof.

Example 15.2 (Sinkhorn’s problem) As an application of Theorem 15.16, consider a joint


distribution PX,Y on finite X × Y . Our goal is to solve a coupling problem:

min{D(QX,Y kPX,Y ) : QX = VX , QY = VY } ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-312


i i

312

where the marginals VX and VY are given. As we discussed in Section 5.6, Sinkhorn identified
an elegant iterative algorithm that converges to the minimizer. Here, we can apply our general
I-projection theory to show that minimizer has the form
Q∗X,Y (x, y) = A(x)PX,Y (x, y)B(y) . (15.40)
Specifically, let us assume PX,Y (x, y) > 0 and consider |X | + |Y| functions ϕa (x, y) = 1{x = a}
and ϕb (x, y) = 1{y = b}, a ∈ X , b ∈ Y . They are linearly independent. The set csupp(P̃) =
P(X ) × P(Y) in this case corresponds to all marginal distributions. Thus, whenever VX , VY have
no zeros they belong to int(csupp(P̃)) and the solution to the I-projection problem is a tilted version
of PX,Y which is precisely of the form (15.40). In this case, it turns out that I-projection exists also
on the boundary and even when PX,Y is allowed to have zeros but these cases are outside the scope
of Theorem 15.16 and need to be treated differently, see [114].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-313


i i

16 Hypothesis testing: error exponents

In this chapter our goal is to determine the achievable region of the exponent pairs (E0 , E1 ) for
the Type-I and Type-II error probabilities in Chernoff’s regime when both exponents are strictly
positive. Our strategy is to apply the achievability and (strong) converse bounds from Chapter 14
in conjunction with the large deviations theory developed in Chapter 15. After characterizing the
full tradeoff we will discuss an adaptive setting of hypothesis testing where instead of committing
ahead of time to testing on the basis of n samples, one can decide adaptively whether to request
more samples or stop. We will find out that adaptivity greatly increases the region of achievable
error-exponents and will learn about the sequential probability ratio test (SPRT) of Wald. In the
closing sections we will discuss relation to more complicated settings in hypothesis testing: one
with composite hypotheses and one with communication constraints.

16.1 (E0 , E1 )-Tradeoff


Recall the setting of Chernoff regime introduced in Section 14.6, where the goal is in designing
tests satisfying
π 1|0 = 1 − α ≤ exp (−nE0 ) , π 0|1 = β ≤ exp (−nE1 ) .
To find the best tradeoff of E0 versus E1 we can define the following function
E∗1 (E0 ) ≜ sup{E1 : ∃n0 , ∀n ≥ n0 , ∃PZ|Xn s.t. α > 1 − exp (−nE0 ) , β < exp (−nE1 )}
1 1
= lim inf log
n→∞ n β1−exp(−nE0 ) (Pn , Qn )
This should be compared with Stein’s exponent in Definition 14.13.
Define
dQ
Tk = log (Xk ), k = 1, . . . , n
dP
dQn Pn
which are iid copies of T = log dQ n
dP (X). Then log dPn (X ) = k=1 Tk , which is an iid sum under
both P and Q.
The log MGF of T under P (again assumed to be finite and also T is not a constant since P 6= Q)
and the corresponding rate function are (cf. Definitions 15.2 and 15.5):
ψP (λ) = log EP [exp(λT)], ψP∗ (θ) = sup θλ − ψP (λ).
λ∈R

313

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-314


i i

314

ψP (λ)

0 1
λ

E0 = ψP∗ (θ)

E1 = ψP∗ (θ) − θ

slope θ

Figure 16.1 Geometric interpretation of Theorem 16.1 relies on the properties of ψP (λ) and ψP∗ (θ). Note that
ψP (0) = ψP (1) = 0. Moreover, by Theorem 15.6, θ 7→ E0 (θ) is increasing, θ 7→ E1 (θ) is decreasing.

P
For discrete distributions, we have ψP (λ) = log x P(x)1−λ Q(x)λ ; in general, ψP (λ) =
R 1−λ dQ λ
log dμ( dPdμ ) ( dμ ) for some dominating measure μ.
Note that since ψP (0) = ψP (1) = 0, from the convexity of ψP (Theorem 15.3) we conclude
that ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore, assuming P  Q and Q  P we also have that
λ 7→ ψP (λ) continuous everywhere on [0, 1]. (The continuity on (0, 1) follows from convexity,
but for the boundary points we need more detailed arguments.) Although all results in this section
apply under the (milder) conditions of P  Q and Q  P, we will only present proofs under
the (stronger) condition that log MGF exists for all λ, following the convention of the previous
chapter. The following result determines the optimal (E0 , E1 )-tradeoff in a parametric form. For a
concrete example, see Exercise III.19 for testing two Gaussians.

Theorem 16.1 Assume P  Q and Q  P. Then

E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ, (16.1)

parametrized by −D(PkQ) ≤ θ ≤ D(QkP), characterizes the upper boundary of the region of all
achievable (E0 , E1 )-pairs. (See Figure 16.1 for an illustration.)

Remark 16.1 (Rényi divergence) In Definition 7.24 we defined Rényi divergences Dλ .


It turns out that Dλ ’s are intimately related to error exponents. Indeed, we have ψP (λ) =
(λ − 1)Dλ (QkP) = −λD1−λ (PkQ). This provides another proof of why ψP (λ) is negative
for λ between 0 and 1, and also recovers the slope at endpoints: ψP′ (0) = −D(PkQ) and
ψP′ (1) = D(QkP). See also Ex. I.39.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-315


i i

16.1 (E0 , E1 )-Tradeoff 315

Corollary 16.2 (Bayesian criterion) Fix a prior (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 <
π 0 < 1. Denote the optimal Bayesian (average) error probability by
P∗e (n) ≜ inf π 0 π 1|0 + π 1 π 0|1
P Z| X n

with exponent
1 1
E ≜ lim log ∗ .
n→∞ n P e ( n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0)
θ

regardless of the prior, and


Z
ψP∗ (0) = − inf ψP (λ) = − inf log (dP)1−λ (dQ)λ ≜ C(P, Q) (16.2)
λ∈[0,1] λ∈[0,1]

is called the Chernoff exponent or Chernoff information.

Notice that from (14.19) we always have


   
1 2 1 2
log 1 − H (P, Q) ≤ C(P, Q) ≤ 2 log 1 − H (P, Q)
2 2
and thus for small H2 (P, Q) we have C(P, Q)  H2 (P, Q).
Remark 16.2 (Bhattacharyya distance) There is an important special case in which Cher-
noff exponent simplifies. Instead of i.i.d. observations, consider independent, but not identically
distributed observations. Namely, suppose that two hypotheses correspond to two different strings
xn and x̃n over a finite alphabet X . The hypothesis tester observes Yn = (Y1 , . . . , Yn ) obtained by
applying one of the two strings to the input of the memoryless channel PY|X ; in other words, either
Qn Qn
Yn ∼ t=1 PY|X=xt or t=1 PY|X=x̃t . (The alphabet Y does not need to be finite, but we assume
this below.) Extending Corollary 16.2 it can be shown, that in this case the optimal (average)
probability of error P∗e (xn , x̃n ) has (Chernoff) exponent1
1X X
n
E = − inf log PY|X (y|xt )λ PY|X (y|x̃t )1−λ .
λ∈[0,1] n
t=1 y∈Y

If |X | = 2 and if the compositions (types) of xn and x̃n are equal (!), the expression is invariant
under λ ↔ 1 − λ and thus from the convexity in λ we conclude that λ = 12 is optimal,2 yielding
E = 1n dB (xn , x̃n ), where
X
n Xq
dB (xn , x̃n ) = − log PY|X (y|xt )PY|X (y|x̃t ) (16.3)
t=1 y∈Y

1
In short, this is because the optimal tilting parameter λ does not need to be chosen differently for different values of
(xt , x̃t ).
2 1
For another example where λ = 2
achieves the optimal in the Chernoff information, see Exercise III.30.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-316


i i

316

is known as the Bhattacharyya distance between codewords xn and x̃n . (Compare with the Bhat-
tacharyya coefficient defined after (7.5).) Without the two assumptions stated, dB (·, ·) does not
necessarily give the optimal error exponent. We do, however, always have the bounds, see (14.19):
1
exp (−2dB (xn , x̃n )) ≤ P∗e (xn , x̃n ) ≤ exp (−dB (xn , x̃n )) ,
4
where the upper bound becomes tighter when the joint composition of (xn , x̃n ) and that of (x̃n , xn )
are closer.
Pn
Proof of Theorem 16.1. The idea is to apply the large deviations theory to the iid sum k=1 Tk .
Specifically, let’s rewrite the achievability and converse bounds from Chapter 14 in terms of T:

• Achievability (Neyman-Pearson): Applying Theorem 14.11 with τ = −nθ, the LRT achieves
the following
" n # " n #
X X
π 1|0 = P T k ≥ nθ π 0|1 = Q T k < nθ (16.4)
k=1 k=1

• Converse (strong): Applying Theorem 14.10 with γ = exp (−nθ), any achievable π 1|0 and π 0|1
satisfy
" n #
X
π 1|0 + exp (−nθ) π 0|1 ≥ P T k ≥ nθ . (16.5)
k=1

For achievability, applying the nonasymptotic large deviations upper bound in Theorem 15.9
(and Theorem 15.11) to (16.4), we obtain that for any n,
" n #
X
π 1| 0 = P Tk ≥ nθ ≤ exp (−nψP∗ (θ)) , for θ ≥ EP T = −D(PkQ)
k=1
" #
Xn

π 0|1 = Q Tk < nθ ≤ exp −nψQ∗ (θ) , for θ ≤ EQ T = D(QkP)
k=1

Notice that by the definition of T = log dQ


dP we have

ψQ (λ) = log EQ [eλT ] = log EP [e(λ+1)T ] = ψP (λ + 1)


⇒ ψQ∗ (θ) = sup θλ − ψP (λ + 1) = ψP∗ (θ) − θ.
λ∈R

Thus the pair of exponents (E0 (θ), E1 (θ)) in (16.1) is achievable.


For converse, we aim to show that any achievable (E0 , E1 ) pair must lie below the curve
achieved by the above Neyman-Pearson test, namely (E0 (θ), E1 (θ)) parametrized by θ. Suppose
π 1|0 = exp (−nE0 ) and π 0|1 = exp (−nE1 ) is achievable. Combining the strong converse bound
(16.5) with the large deviations lower bound, we have: for any fixed θ ∈ [−D(PkQ), ≤ D(QkP)],

exp (−nE0 ) + exp (−nθ) exp (−nE1 ) ≥ exp (−nψP∗ (θ) + o(n))
⇒ min(E0 , E1 + θ) ≤ ψP∗ (θ)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-317


i i

16.2 Equivalent forms of Theorem 16.1 317

⇒ either E0 ≤ ψP∗ (θ) or E1 ≤ ψP∗ (θ) − θ,

proving the desired result.

16.2 Equivalent forms of Theorem 16.1


Alternatively, the optimal (E0 , E1 )-tradeoff can be stated in the following equivalent forms:

Theorem 16.3 (a) The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as
E0 = D(Pλ kP), E1 = D(Pλ kQ) (16.6)
where the distribution Pλ 3 is tilting of P along T given in (15.27), which moves from P0 = P
to P1 = Q as λ ranges from 0 to 1:
dPλ = (dP)1−λ (dQ)λ exp{−ψP (λ)}.
(b) Yet another characterization of the boundary is
E∗1 (E0 ) = min D(Q′ kQ) , 0 ≤ E0 ≤ D(QkP) (16.7)
Q′ :D(Q′ ∥P)≤E0

Remark 16.3 The interesting consequence of this point of view is that it also suggests how
typical error event looks like. Namely, consider an optimal hypothesis test achieving the pair of
exponents (E0 , E1 ). Then conditioned on the error event (under either P or Q) we have that the
empirical distribution of the sample will be close to Pλ . For example, if P = Bin(m, p) and Q =
Bin(m, q), then the typical error event will correspond to a sample whose empirical distribution
P̂n is approximately Bin(m, r) for some r = r(p, q, λ) ∈ (p, q), and not any other distribution on
{0, . . . , m}.

Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) ≜ EPλ [T], then
from (15.8) we have
D(Pλ kP) = ψP∗ (θ) ,
whereas
   
dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP) − EPλ [T] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (15.7) we know that as λ ranges in [0, 1] the mean θ = EPλ [T] ranges from −D(PkQ)
to D(QkP).
To prove the second claim (16.7), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions
dPλ = exp{λT − ψP (λ)} · dP

3
This is called a geometric mixture of P and Q.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-318


i i

318

dQλ′ = exp{λ′ T − ψQ (λ′ )} · dQ

are in fact the same family with Qλ′ = Pλ′ +1 .


Now, suppose that Q∗ achieves the minimum in (16.7) and that Q∗ 6= Q, Q∗ 6= P (these cases
should be verified separately). Note that we have not shown that this minimum is achieved, but it
will be clear that our argument can be extended to the case of when Q′n is a sequence achieving
the infimum. Then, on one hand, obviously

D(Q∗ kQ) = min D(Q′ kQ) ≤ D(PkQ)


Q′ :D(Q′ ∥P)≤E0

On the other hand, since E0 ≤ D(QkP) we also have

D(Q∗ kP) ≤ D(QkP) .

Therefore,
 
dQ∗ dQ
E [T] = E
Q∗ Q∗ log = D(Q∗ kP) − D(Q∗ kQ) ∈ [−D(PkQ), D(QkP)] . (16.8)
dP dQ∗

Next, we have from Corollary 15.12 that there exists a unique Pλ with the following three
properties:4

EPλ [T] = EQ∗ [T]


D(Pλ kP) ≤ D(Q∗ kP)
D(Pλ kQ) ≤ D(Q∗ kQ)

Thus, we immediately conclude that minimization in (16.7) can be restricted to Q∗ belonging


to the family of tilted distributions {Pλ , λ ∈ R}. Furthermore, from (16.8) we also conclude
that λ ∈ [0, 1]. Hence, characterization of E∗1 (E0 ) given by (16.6) coincides with the one given
by (16.7).

Remark 16.4 A geometric interpretation of (16.7) is given in Figure 16.2: As λ increases from
0 to 1, or equivalently, θ increases from −D(PkQ) to D(QkP), the optimal distribution Pλ traverses
down the dotted path from P and Q. Note that there are many ways to interpolate between P and
Q, e.g., by taking their (arithmetic) mixture (1 − λ)P + λQ. In contrast, Pλ is a geometric mixture
of P and Q, and this special path is in essence a geodesic connecting P to Q and the exponents
E0 and E1 measures its respective distances to P and Q. Unlike Riemannian geometry, though,
here the sum of distances to the two endpoints from an intermediate Pλ actually varies along the
geodesic.

4
A subtlety: In Corollary 15.12 we ask EQ∗ [T] ∈ (A, B). But A, B – the essential range of T – depend on the distribution
under which the essential range is computed, cf. (15.23). Fortunately, we have Q  P and P  Q, so the essential range
is the same under both P and Q. And furthermore (16.8) implies that EQ∗ [T] ∈ (A, B).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-319


i i

16.3* Sequential hypothesis testing 319

E1
P

D(PkQ) Pλ

space of distributions

D(Pλ kQ)

E0
0 D(Pλ kP) D(QkP)

Figure 16.2 Geometric interpretation of (16.7). Here the shaded circle represents {Q′ : D(Q′ kP) ≤ E0 }, the
KL divergence “ball” of radius E0 centered at P. The optimal E∗1 (E0 ) in (16.7) is given by the divergence from
Q to the closest element of this ball, attained by some tilted distribution Pλ . The tilted family Pλ is the
geodesic traversing from P to Q as λ increases from 0 to 1.

16.3* Sequential hypothesis testing

Review: Filtration and stopping time

• A sequence of nested σ -algebras F0 ⊂ F1 ⊂ F2 · · · ⊂ Fn · · · ⊂ F is called a


filtration of F .
• A random variable τ is called a stopping time of a filtration Fn if (a) τ is valued in
Z+ and (b) for every n ≥ 0 the event {τ ≤ n} ∈ Fn .
• The σ -algebra Fτ consists of all events E such that E ∩ {τ ≤ n} ∈ Fn for all n ≥ 0.
• When Fn = σ{X1 , . . . , Xn } the interpretation is that τ is a time that can be deter-
mined by causally observing the sequence Xj , and random variables measurable
with respect to Fτ are precisely those whose value can be determined on the basis
of knowing (X1 , . . . , Xτ ).
• Let Mn be a martingale adapted to Fn , i.e. Mn is Fn -measurable and E[Mn |Fk ] =
Mmin(n,k) . Then M̃n = Mmin(n,τ ) is also a martingale. If collection {Mn } is uniformly
integrable then
E[Mτ ] = E[M0 ] .
• For more details, see [84, Chapter V].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-320


i i

320

So far we have always been working with a fixed number of observations n. However, different
realizations of Xn are informative to different levels, i.e. under some realizations we are very certain
about declaring the true hypothesis, whereas some other realizations leave us more doubtful. In
the fixed n setting, the tester is forced to take a guess in the latter case. In the sequential setting,
pioneered by Wald [448], the tester is allowed to request more observations. We show in this
section that the optimal test in this setting is something known as sequential probability ratio test
(SPRT) [450]. It will also be shown that the resulting tradeoff between the exponents E0 and E1 is
much improved in the sequential setting.
We start with the concept of a sequential test. Informally, at each time t, upon receiving the
observation Xt , a sequential test either declares H0 , declares H1 , or requests one more observation.
The rigorous definition is as follows: a sequential hypothesis test consists of (a) a stopping time
τ with respect to the filtration {Fk , k ∈ Z+ }, where Fk ≜ σ{X1 , . . . , Xn } is generated by the first
n observations; and (b) a random variable (decision) Z ∈ {0, 1} measurable with respect to Fτ .
Each sequential test is associated with the following performance metrics:
α = P[Z = 0], β = Q [ Z = 0] (16.9)
l0 = EP [τ ], l1 = EQ [τ ] (16.10)
The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 21 δ−1 . Since P 6⊥ Q, we also
have Pn 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both hypothe-
ses. However, an obvious sequential test (waiting for the first appearance of ±1) achieves zero error
probability with finite number of (two) observations in expectation under both hypotheses. This
advantage is also clear in terms of the achievable error exponents shown in Figure 16.3.
The following result is essentially due to [450], though there it was shown only for the special
case of E0 = D(QkP) and E1 = D(PkQ). The version below is from [339].

Theorem 16.4 Assume bounded LLR:5


P(x)
log ≤ c0 , ∀ x
Q ( x)
where c0 is some positive constant. Call a pair of exponents (E0 , E1 ) achievable if there exist a
test with l0 , l1 → ∞ and probabilities satisfy:
π 1|0 ≤ exp (−l0 E0 (1 + o(1))) , π 0|1 ≤ exp (−l1 E1 (1 + o(1)))
Then the set of achievable exponents must satisfy
E0 E1 ≤ D(PkQ)D(QkP).
Furthermore, any such (E0 , E1 ) is achieved by the sequential probability ratio test SPRT(A, B)
(A, B are large positive numbers) defined as follows:
τ = inf{n : Sn ≥ B or Sn ≤ −A}

5
This assumption is satisfied for example for a pair of fully supported discrete distributions on finite alphabets.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-321


i i

16.3* Sequential hypothesis testing 321

E1

Sequential test
D(PkQ)

Fixed sample size

E0
0 D(QkP)

Figure 16.3 Tradeoff between Type-I and Type-II error exponents. The bottom curve corresponds to optimal
tests with fixed sample size (Theorem 16.1) and the upper curve to optimal sequential tests (Theorem 16.4).


0, if Sτ ≥ B
Z=
1, if Sτ < −A

where
X
n
P(Xk )
Sn = log
Q( X k )
k=1

is the log likelihood function of the first n observations.

Remark 16.5 (Interpretation of SPRT) Under the usual setup of hypothesis testing, we
collect a sample of n iid observations, evaluate the LLR Sn , and compare it to the threshold to give
the optimal test. Under the sequential setup, {Sn : n ≥ 1} is a random walk, which has positive
(resp. negative) drift D(PkQ) (resp. −D(QkP)) under the null (resp. alternative)! SPRT simply
declares P if the random walk crosses the upper boundary B, or Q if the random walk crosses the
upper boundary −A. See Figure 16.4 for an illustration.

Proof. As preparation we show two useful identities:

• For any stopping time with EP [τ ] < ∞ we have

EP [Sτ ] = EP [τ ]D(PkQ) (16.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-322


i i

322

Sn

0 n
τ

−A

Figure 16.4 Illustration of the SPRT(A, B) test. Here, at the stopping time τ , the LLR process Sn reaches B
before reaching −A and the decision is Z = 1.

and similarly, if EQ [τ ] < ∞ then

EQ [Sτ ] = − EQ [τ ]D(QkP) .

To prove these, notice that

Mn = Sn − nD(PkQ)

is clearly a martingale w.r.t. Fn . Consequently,

M̃n ≜ Mmin(τ,n)

is also a martingale. Thus

E[M̃n ] = E[M̃0 ] = 0 ,

or, equivalently,

E[Smin(τ,n) ] = E[min(τ, n)]D(PkQ) . (16.12)

This holds for every n ≥ 0. From the boundedness assumption we have |Sn | ≤ nc0 and thus
|Smin(n,τ ) | ≤ τ c0 , implying that collection {Smin(n,τ ) : n ≥ 0} is uniformly integrable. Thus, we
can take n → ∞ in (16.12) and interchange expectation and limit safely to conclude (16.11).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-323


i i

16.3* Sequential hypothesis testing 323

• Let τ be a stopping time. Recall that a random variable R is a Radon-Nikodym derivative of P


dP|Fτ
w.r.t. Q on a σ -algebra Fτ , denoted by dQ |F , if
τ

EP [1E ] = EQ [R1E ] ∀E ∈ F τ . (16.13)

We will show that it is in fact given by


dP|Fτ
= exp{Sτ } .
dQ|Fτ
Indeed, what we need to verify is that (16.13) holds with R = exp{Sτ } and an arbitrary event
E ∈ Fτ , which we decompose as
X
1E = 1E∩{τ =n} .
n≥0

By monotone convergence theorem applied to the both sides of (16.13) it is then sufficient to
verify that for every n

EP [1E∩{τ =n} ] = EQ [exp{Sτ }1E∩{τ =n} ] . (16.14)


dP|Fn
This, however, follows from the fact that E ∩ {τ = n} ∈ Fn and dQ|Fn = exp{Sn } by the very
definition of Sn .

We now proceed to the proof. For achievability we apply (16.13) to infer

π 1|0 = P[Sτ ≤ −A] = EQ [exp{Sτ }1{Sτ ≤ −A}] ≤ e−A .

Next, we denote τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas the expectation of τ0 can
be bounded using (16.11) as:
E P [ Sτ 0 ] B + c0
EP [τ ] ≤ EP [τ0 ] = ≤ ,
D(PkQ) D(PkQ)
where in the last step we used the boundedness assumption to infer Sτ0 ≤ B + c0 . Overall,
B + c0
l0 = EP [τ ] ≤ EP [τ0 ] ≤ .
D(PkQ)

Similarly, we can show π 0|1 ≤ e−B and l1 ≤ A+c0


D(Q∥P) . Now consider a pair of exponents E0 , E1
D(Q∥P)
at the boundary, that is E0 E1 = D(PkQ)D(QkP). Let x = E0
D(P∥Q) = E1 . Set A = xB and
l0 D(P∥Q)
−xB −xB B+c −l0 E0 (1+o(1))
let B → ∞. From the argument above we have π 1|0 ≤ e ≤e =e 0 and
similarly π 0|1 ≤ e−l1 E1 (1+o(1)) .
Converse: Assume that (E0 , E1 ) achievable for large l0 , l1 . Recall from Section 4.6* that
D(PFτ kQFτ ) denotes the divergence between P and Q when viewed as measures on σ -algebra
Fτ . We apply the data processing inequality for divergence to obtain:
(16.11)
d(P(Z = 1)kQ(Z = 1)) ≤ D(PFτ kQFτ ) = EP [Sτ ] = EP [τ ]D(PkQ) = l0 D(PkQ),

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-324


i i

324

Notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) = l1 E1 (1 + o(1)), therefore
l1 E1 ≤ (1 + o(1))l0 D(PkQ). Similarly we can show that l0 E0 ≤ (1 + o(1))l1 D(QkP). Thus taking
ℓ0 , ℓ1 → ∞ we conclude

E0 E1 ≤ D(PkQ)D(QkP) .

16.4 Composite, robust, and goodness-of-fit hypothesis testing


In this chapter we have considered the setting of distinguishing between the two alternatives, under
either of which the data distribution was specified completely. There are multiple other settings
that have also been studied in the literature, which we briefly mention here for completeness.
The key departure is to replace the simple hypotheses that we started with in Chapter 14 with
composite ones. Namely, we postulate
i.i.d. i.i.d.
H0 : Xi ∼ P, P∈P vs H1 : Xi ∼ Q, Q ∈ Q,

where P and Q are two families of distributions. In this case for a given test Z = Z(X1 , . . . , Xn ) ∈
{0, 1} we define the two types of error as before, but taking worst-case choices over the
distribution:

1 − α = inf P⊗n [Z = 0], β = sup Q⊗n [Z = 0] .


P∈P Q∈Q

Unlike testing simple hypotheses for which Neyman-Pearson’s test is optimal (Theorem 14.11), in
general there is no explicit description for the optimal test of composite hypotheses (cf. (32.28)).
The popular choice is a generalized likelihood-ratio test (GLRT) that proposes to threshold the
GLR
supP∈P P⊗n (Xn )
T(Xn ) = .
supQ∈Q Q⊗n (Xn )
For examples and counterexamples of the optimality of GLRT in terms of error exponents, see,
e.g. [469].
Sometimes the families P and Q are small balls (in some metric) surrounding the center dis-
tributions P and Q, respectively. In this case, testing P against Q is known as robust hypothesis
testing (since the test is robust to small deviations of the data distribution). There is a notable finite-
sample optimality result in this case due to Huber [221] – see Exercise III.31. Asymptotically, it
turns out that if P and Q are separated in the Hellinger distance, then the probability of error can
be made exponentially small: see Theorem 32.8.
Sometimes in the setting of composite testing the distance between P and Q is zero. This is
the case, for example, for the most famous setting of a Student t-test: P = {N (0, σ 2 ) : σ 2 > 0},
Q = {N ( μ, σ 2 ) : μ 6= 0, σ 2 > 0}. It is clear that in this case there is no way to construct a test with
α + β < 1, since the data distribution under H1 can be arbitrarily close to P0 . Here, thus, instead
of minimizing worst-case β , one tries to find a test statistic T(X1 , . . . , Xn ) which is a) pivotal in the
sense that its distribution under the H0 is (asymptotically) independent of the choice P0 ∈ P ; and

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-325


i i

16.5* Hypothesis testing with communication constraints 325

b) consistent, in the sense that T → ∞ as n → ∞ under any fixed Q ∈ Q. Optimality questions are
studied by minimizing β as a function of Q ∈ Q (known as the power curve). The uniform most
powerful tests are the gold standard in this area [277, Chapter 3], although besides a few classical
settings (such as the one above) their existence is unknown.
In other settings, known as the goodness-of-fit testing [277, Chapter 14], instead of relatively
low-complexity parametric families P and Q one is interested in a giant set Q of alternatives. For
i.i.d. i.i.d.
example, the simplest setting is to distinguish H0 : Xi ∼ P0 vs H1 : Xi ∼ Q, TV(P0 , Q) > δ . If
δ = 0, then in this case again the worst case α + β = 1 for any test and one may only ask for a
statistic T(Xn ) with a known distribution under H0 and T → ∞ for any Q in the alternative. For
δ > 0 the problem is known as nonparametric detection [225, 226] and related to that of property
testing [192].

16.5* Hypothesis testing with communication constraints


In this section we consider a variation of the hypothesis testing problem where determination of
the Stein’s exponent is still open, except for the special case resolved in [8]. Specifically, we still
consider the case of a pair of simple iid hypotheses as in (14.14) except this time the Y sample is
available to a statistician, but the X sample needs to be communicated from a remote location over
a (noiseless) rate-constrained link:
i.i.d.
H0 : (X1 , Y1 ), . . . , (Yn , Xn ) ∼ PX,Y
i.i.d.
H1 : (X1 , Y1 ), . . . , (Yn , Xn ) ∼ QX,Y , (16.15)
and the tester consists of an X-compressor W = f(Xn ) with W ∈ {0, 1}nR and a decision PZ|W,Yn .
(The illustration of the setting is given in Figure 16.5.) Our goal is to determine dependence of the
Stein exponent on the rate R, namely
Vϵ (R) ≜ sup{E : ∃n0 , ∀n ≥ n0 , ∃(f, PZ|Xn ) s.t. α > 1 − ϵ, β < exp (−nE)}.
Exponents E satisfying constraints inside the supremum are known as ϵ-achievable exponents.
The importance of this problem is that it emphasized some new phenomenon arising in dis-
tributed statistical problems. Although, the problem is still open and the topic of characterizing
Stein’s exponent fell out of fashion, the tools that were developed for this problem (namely, the
strong data-processing inequalities) are important and found many uses in modern distributed
inference problems (see Chapter 33 and specifically Section 33.11 for more). We will discuss this
after we present the main result, for which we introduce a key new concept.

Definition 16.5 (FI -curve6 ) Given pair of random variables (X, Y) we define their FI curve
as
FI (t; PX,Y ) ≜ sup{I(U; Y) : I(U; X) ≤ t, U → X → Y} ,

6
This concept was introduced in [453], see also [136] and [345, Section 2.2] for the “PX -independent” version.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-326


i i

326

Xn W ∈ {0, 1}nR
Compressor f

Z ∈ {0, 1}

Tester
Yn

Figure 16.5 Illustration to the problem of hypothesis testing with communication constraints.

I(Y; U)

ηKL
I(X; Y)

I(X; U)
0 H(X)

Figure 16.6 A typical FI -curve whose slope at zero is the SDPI constant ηKL .

supremum taken over all random variables U satisfying U → X → Y Markov relation.

Example 16.1 A typical FI -curve is shown in Figure 16.6. In general, computing FI -curves is
hard. An exception is the case of X ∼ Ber(1/2) and Y = BSCδ (X). In this case, applying MGL in
Exercise I.64 we get, in the notation of that exercise, that

FI (t) = log 2 − h(h−1 (log 2 − t) ∗ δ)

achieved by taking U ∼ Ber(1/2) and X = BSCp (U) with p chosen such that h(p) = log 2 − t.
From the DPI (3.12) we know that FI (t) ≤ t and the FI -curve strengthens the DPI to I(U; X) ≤
FI (I(U; Y)) whenever U → X → Y. (Note that the roles of X and Y are not symmetric.) In general,
it is not clear how to compute this function; nevertheless, in Exercise III.32 we show that if X
takes values over a finite alphabet then it is sufficient to consider |U| ≤ |X | + 1, and hence FI is
a value of a finite-dimensional convex program. Other properties of the FI -curve and applications
are found in Exercise III.32 and III.33.
The main result of this section is the following.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-327


i i

16.5* Hypothesis testing with communication constraints 327

Theorem 16.6 (Ahslwede-Csiszár [8]) Suppose X, Y are ranging over the finite alphabets
and QX,Y = PX PY (independence testing problem). Then Vϵ (R) = FI (R) for all ϵ ∈ (0, 1).

The setting describes the problem of detecting correlation between two sequences. When R = 0
the testing problem is impossible since the marginal distribution of Y is the same under both
hypotheses. If only a very small communication rate is available then the sample size required
will be very large (Stein exponent small).

Proof. Let us start with an upper bound. Fix a compressor W and notice that for any ϵ-achievable
exponent E by Theorem 14.8 we have

d(1 − ϵk exp{−nVϵ }) ≤ D(PW,Yn kQW,Yn ) .

But under conditions of the theorem we have QW,Yn = PW PYn and thus we obtain as in (14.18):

(1 − ϵ)nE ≤ D(PW,Yn kPW PYn ) + log 2 = I(W; Yn ) + log 2 .

Now, from looking at Figure 16.5 we see that W → Xn → Yn and the from Exercise III.32
(tensorization) we know that
1
I(W; Yn ) ≤ nFI ( I(W; Xn )) ≤ nFI (R) .
n
Thus, we have shown that for all sufficiently large n
FI (R) log 2
E≤ + .
1−ϵ n
This demonstrates that lim supϵ→0 Vϵ ≤ FI (R). For a stronger claim of Vϵ ≤ FI (R), i.e. the strong
converse, see [8].
Now, for the constructive part, consider any n1 and any compressed random variable W1 =
f1 (Xn1 ) with W1 ∈ {0, 1}n1 R . Given blocklength n we can repeatedly send W1 by compress-
ing each n1 chunk independently (for a total of n/n1 “frames”). Then the decompressor will
observe n/n1 iid copies of W1 and also of Yn1 vector-observations. Note that D(PW1 ,Yn1 kQW1 ,Yn1 ) =
I(W1 ; Yn1 ) as above. Thus, by Theorem 14.14 we should be able to achieve α ≥ 1 − ϵ and
β ≤ exp{−n/n1 I(W1 ; Yn1 )}.
Therefore, we obtained the following lower bound (after optimizing over the choice of W1 and
blocklength n1 that we replace by more convenient n again):
1
Vϵ ≥ F̃I (R) ≜ sup { I(W1 ; Yn ) : W1 → Xn → Yn , W1 ∈ {0, 1}nR } . (16.16)
n,W1 n

This looks very similar to the definition of (tensorized) FI -curve except that the constraint is on
the cardinality of W1 instead of the I(W1 ; Xn ). It turns out that the two quantities coincide, i.e.
F̃I (R) = FI (R). We only need to show F̃I (R) ≥ FI (R) for that.
To that end, consider any U → X → Y and R > I(U; X). We apply covering lemma (Corol-
lary 25.6) and Markov lemma (Proposition 25.7), where we set An = Xn , Bn = Un , Xn = Yn .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-328


i i

328

Overall, we get that as n → ∞ there exist encoder W1 = f1 (Xn ), W1 ∈ {0, 1}nR such that
W1 → Xn → Yn and
I(W1 ; Yn ) ≥ nI(U; Y) + o(n) .
By optimizing the choice of U this proves F̃I (R+) ≥ FI (R). Since (as we shown above) F̃I (R+) ≤
FI (R+) and FI (R+) = FI (R) (Exercise III.32), we conclude that F̃I (R) = FI (R).
Theorem shown above has interesting implications for certain task in modern machine learn-
ing. Consider the situation where the sample size n is gigantic but the communication budget (or
memory bandwidth) is constrained so that we can at most deliver k bits from X terminal to the
tester. Then the rate R = nk  1 and the error probability β of an optimal test is roughly given as

β ≈ 2−nFI (k/n) ≈ 2−kFI (0) ,

where we used the fact that FI (k/n) ≈ nk F′I (0). We see that the error probability is decaying expo-
nentially with the number of communicated bits not the sample size. In many ways, Theorem 16.6
foreshadowed various results in the last decade on distributed inference. We will get back to this
topic in Chapter 33 dedicated to strong data processing inequalities (SDPI). There is a simple
relation that connects the classical results (this Section) with the modern approach via SDPIs (in
Chapter 33): the slope F′I (0) is precisely the SDPI constant:
F′I (0) = ηKL (PX , PY|X ) ,
see (33.14) for the definition of ηKL . In essence, SDPIs are just linearized versions of FI -curves as
illustrated in Figure 16.6.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-329


i i

Exercises for Part III

III.1 Let P0 and P1 be distributions on X . Recall that the region of achievable pairs (P0 [Z =
0], P1 [Z = 0]) via randomized tests PZ|X : X → {0, 1} is denoted
[
R(P0 , P1 ) ≜ (P0 [Z = 0], P1 [Z = 0]) ⊆ [0, 1]2 .
PZ|X

PY|X
(a) Let PY|X : X → Y be a Markov kernel, which maps Pj to Qj according to Pj −−→ Qj , j =
0, 1. Compare the regions R(P0 , P1 ) and R(Q0 , Q1 ). What does this say about βα (P0 , P1 )
vs βα (Q0 , Q1 )?
(b*) Prove that R(P0 , P1 ) ⊃ R(Q0 , Q1 ) implies existence of some PY|X mapping P0 to Q1 and
P1 to Q1 . In other words, inclusion of R is equivalent to degradation or Blackwell order
(see Definition 33.15).
Comment: This is the most general form of data processing inequality, of which all the other
ones (divergence, mutual information, f-divergence, total-variation, Rényi-divergence, etc) are
corollaries.
III.2 Consider the following binary hypothesis testing (BHT) problem. Under both hypotheses X and
Y are uniform on {0, 1}. However, under H0 , X and Y are independent, while under H1 :

P1 [X 6= Y] = δ < 1/2 .

For this problem:


(a) Draw the region R(P0 , P1 ) of achievable pairs of values (P0 [Z = 0], P1 [Z = 0]) for all
randomized tests PZ|XY : X × Y → {0, 1}.
(b) Find a sufficient statistic and define an equivalent BHT problem on a smaller alphabet.
(c) Let PFi , i ∈ {0, 1} be the distribution of log PP01 ((XX)) under X ∼ Pi . What are the distributions
PF0 , PF1 . How can you read them off of R(P0 , P1 )?
(d) Compute the minimal probability of error in the Bayesian setup, when

P [ H1 ] = 1 − P [ H0 ] = π 1 .

Identify the corresponding point on R(P0 , P1 ).


(e) Compute the minimal probability of error in the non-Bayesian minimax setup:

min max{P0 [decide H1 ], P1 [decide H0 ]} ,

where the min is over the tests and the max is between the two numbers in the braces.
Identify the corresponding point on R(P0 , P1 ).
III.3 Consider distributions P and Q on [0, 3] with densities given in Figure 16.7.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-330


i i

330 Exercises for Part III

p q
1 1
3 3

3 3

Figure 16.7 Densities of P and Q in Exercise III.3.

(a) Compute the expression of βα (P, Q).


(b) Plot the region R(P, Q).
(c) Specify the tests achieving βα for α = 5/6 and α = 1/2, respectively.
III.4 Let P be the uniform distribution on the interval [0, 1]. Let Q be the equal mixture of the uniform
distribution on [0, 1/2] and the point mass at 1.
(a) Compute the region R(P, Q).
(b) Explicitly describe the tests that achieve the optimal boundary βα (P, Q).
III.5 (a) Consider the binary hypothesis test:
H0 : X ∼ N (0, 1) vs H1 : X ∼ N ( μ, 1).
Compute and plot the Neyman-Pearson region R(N (0, 1), N ( μ, 1)).
(b) Now suppose we have n samples and we want to test
i.i.d. i.i.d.
H0 : X1 , . . . , Xn ∼ N (0, 1) vs H1 : X1 , . . . , Xn ∼ N ( μ, 1).
Compute the Neyman-Pearson region R(N (0, 1)n , N ( μ, 1)n ). As the sample size increases,
describe how the region evolves and provides an interpretation. Hint: Consider sufficient
statistics.
III.6 (a) Consider the binary hypothesis test:
H0 : X ∼ Exp(1) vs H1 : X ∼ Exp(λ),
where Exp(λ) has density λe−λx 1 {x ≥ 0}. Compute the region R(Exp(1), Exp(λ)). What
is the optimal test for achieving βα ?
(b) Now suppose we have n samples and we want to test
i.i.d. i.i.d.
H0 : X1 , . . . , Xn ∼ Exp(1) vs H1 : X1 , . . . , Xn ∼ Exp(λ).
Compute the region R(Exp(1)n , Exp(λ)n ). As the sample size increases, describe how the
region evolves and provides an interpretation. What is the optimal test for achieving βα ?
III.7 (a) Prove that TV(P, Q) = sup0≤α≤1 {α − βα (P, Q)}. Explain how to read the value TV(P, Q)
from the region R(P, Q). Does it equal half the maximal vertical segment in R(P, Q)?
(b) (Bayesian criteria) Fix a prior π = (π 0 , π 1 ) such that π 0 + π 1 = 1 and 0 < π 0 < 1.
Denote the optimal average error probability by Pe ≜ infPZ|X π 0 π 1|0 + π 1 π 0|1 . Prove that if
π = ( 21 , 12 ), then Pe = 12 (1 − TV(P, Q)). Find the optimal test.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-331


i i

Exercises for Part III 331

(c) Find the optimal test for general prior π (not necessarily equiprobable).
(d) Show that it is sufficient to focus on deteministic test in order to minimize the Bayesian
error probability.
III.8 The function α 7→ βα (P, Q) is monotone and thus by Lebesgue’s theorem possesses a derivative
d
βα′ ≜ βα (P, Q) .

almost everywhere on [0, 1]. Prove
Z 1
D(PkQ) = − log βα′ dα . (III.1)
0

III.9 Let P, Q be distributions such that for all α ∈ [0, 1] we have

βα (P, Q) ≜ min Q[ Z = 0] = α 2 .
PZ|X :P[Z=0]≥α

Find TV(P, Q), D(PkQ) and D(QkP).


III.10 We have shown that for testing iid products and any fixed ϵ ∈ (0, 1):

log β1−ϵ (Pn , Qn ) = −nD(PkQ) + o(n) , n → ∞,

which is equivalent to Stein’s lemma (Theorem 14.14). Show furthermore that assuming
V(PkQ) < ∞ we have
p √
log β1−ϵ (Pn , Qn ) = −nD(PkQ) + nV(PkQ)Q−1 (ϵ) + o( n) , (III.2)
R ∞
where Q−1 (·) is the functional inverse of Q(x) = x √12π e−t /2 dt and
2

 
dP
V(PkQ) ≜ VarP log .
dQ

III.11 (Likelihood-ratio trick) Given two distributions P and Q on X let us generate iid samples (Xi , Yi )
as follows: first Yi ∼ Ber(1/2) and then if Yi = 0 we sample Xi ∼ Q and otherwise Xi ∼ P. We
next train a classifier to minimize the cross-entropy loss:
1X
n
∗ 1 1
p̂ = argmin Yi log + (1 − Yi ) log .
p̂:X →[0,1] n p̂ ( Xi ) 1 − p̂(Xi )
i=1

1−p̂∗ (x)
Show that → dQ
p̂∗ (x)
dP
(x) as n → ∞. This trick is used in machine learning to approximate
dP
dQ for complicated high-dimensional distributions.
III.12 Prove
Yn Xn
min D(QYn k PYj ) = min D(QYj kPYj )
QYn ∈F
j=1 j=1

whenever the constraint set F is marginals-based, i.e.:

QYn ∈ F ⇐⇒ (QY1 , . . . , QYn ) ∈ F ′

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-332


i i

332 Exercises for Part III

for some F ′ .
Conclude that in the case when PYj = P and
   
 X
n 
F = QYn : EQ  f(Yj ) ≥ nγ
 
j=1

we have the single-letterization:

min D(QYn kPn ) = n min D(QY kP) ,


QYn QY :EQY [f(Y)]≥γ

of which (15.15) is a special case. Hint: Convexity of divergence.


III.13 Fix a distribution PX on a finite set X and a channel PY|X : X → Y . Consider a sequence xn
with composition PX , i.e.

#{j : xj = a} = nPX (a) ± 1 ∀a ∈ X .

Let Yn be generated according to P⊗ n n


Y|X (·|x ). Show that
 
X
n
log P  f(Xj , Yj ) ≥ nγ Xn = xn  = −n min D(QY|X kPY|X |PX ) + o(n) ,
EQ [f(X,Y)]≥γ
j=1

where minimum is over all QX,Y with QX = PX .


III.14 (Large deviations on the boundary) Recall that A = infλ ψX′ (λ) and B = supλ ψX′ (λ) were shown
to be the boundaries of the support of PX , e.g. B = sup{b : P[X > b] > 0}.
(a) Show by example that ψX∗ (B) can be finite or infinite.
(b) Show by example that asymptotic behavior of
" n #
1X
P Xi ≥ B , (III.3)
n
i=1

can be quite different depending on the distribution of PX .


(c) Compare Ψ∗X (B) and the exponent in (III.3) for your examples. Prove a general statement
(you can assume that ψX (λ) < ∞ for all λ ∈ R).
III.15 (Simple radar) A binary signal detector is being built. When the signal A is being sent a sequence
of i.i.d. Xj ∼ N (−1, 1) is received. When the signal B is being sent a sequence of Xj ∼ N (+1, 1)
is being received. Given a very large number n of observations (X1 , . . . , Xn ) propose a detector
for deciding between A and B. Consider two separate design cases:
(a) Misdetecting A for B or B for A are equally bad.
(b) Misdetecting A for B in 10−3 cases is ok, but the opposite should be avoided as much as
possible.
III.16 (Small-ball probability I.) Let Z ∼ N (0, Id ). Without using the χ2 -density, show the following
bound on P [kZk2 ≤ ϵ].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-333


i i

Exercises for Part III 333


(a) Using the Chernoff bound, show that for all ϵ > d,
 2 d/2

e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≤
d
(b) Prove the lower bound
 d/2
ϵ2
e−ϵ /2 .
2
P [kZk2 ≤ ϵ] ≥
2πd
(c) Extend the results to Z ∼ N (0, Σ).
See Exercise V.30 for an example in infinite dimensions.
III.17 Consider the hypothesis testing problem:
i.i.d.
H0 : X1 , . . . , Xn ∼ P = Ber(1/3) ,
i.i.d.
H1 : X1 , . . . , Xn ∼ Q = Ber(2/3) .

(a) Compute the Stein exponent.


(b) Compute the tradeoff region E of achievable error-exponent pairs (E0 , E1 ) using the charac-
terization E0 (θ) = ψP∗ (θ) and E1 (θ) = ψP∗ (θ) − θ. Express the optimal boundary in explicit
form (eliminate the parameter).
(c) Identify the divergence-minimizing geodesic P(λ) running from P to Q, λ ∈ [0, 1]. Verify
that (E0 , E1 ) = (D(P(λ) kP), D(P(λ) kQ)), 0 ≤ λ ≤ 1 gives the same tradeoff curve.
(d) Compute the Chernoff exponent.
III.18 Let γ(a, c) denote a Gamma distribution with shape parameter a and scale parameter c:
(cx)a−1 e−cx
γ(a, c)(dx) = c dx .
Γ(a)
Consider a hypothesis testing problem:

H0 : X1 , . . . , Xn -i.i.d. ∼ P0 = exp(1) , (III.4)


H1 : X1 , . . . , Xn -i.i.d. ∼ P1 = γ(a, c = 1) . (III.5)

Questions:
(a) Compute the Stein exponent
(b) For a = 3 draw the tradeoff region E of achievable error-exponent pairs (E0 , E1 ).
(c) Identify the divergence-minimizing geodesic Pλ running from P0 to P1 , λ ∈ [0, 1].
Hint: To simplify calculations try differentiating in u the following identity
Z ∞
xu e−x dx = Γ(u + 1) .
0

III.19 Consider the hypothesis testing problem:


i.i.d.
H0 : X1 , . . . , Xn ∼ P = N (0, 1) ,
i.i.d.
H1 : X1 , . . . , Xn ∼ Q = N ( μ, 1) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-334


i i

334 Exercises for Part III

(a) Show that the Stein exponent is V = log2 e μ2 .


(b) Show that the optimal tradeoff between achievable error-exponent pairs (E0 , E1 ) is given
by
log e p log e 2
E1 = ( μ − 2E0 )2 , 0 ≤ E0 ≤ μ ,
2 2
(c) Show that the Chernoff exponent is C(P, Q) = log8 e μ2 .
III.20 Let Uj be iid uniform on [0, 1]. Prove/disprove that
 
X n
1
P Uj ≥ nγ  , γ= ≈ 0.582
e−1
j=1

converges to zero exponentially fast as n → ∞. If it does then find the exponent. Repeat with
γ = 0.5.
III.21 Let Xj be i.i.d. exponential with unit mean. Since the log MGF ψX (λ) ≜ log E[exp{λX}] does
not exist for λ > 1, the large deviations result in Theorem 15.1
X
n
P[ Xj ≥ nγ] = exp{−nψX∗ (γ) + o(n)} (III.6)
j=1

does not apply. Show (III.6) directly via the following steps:
(a) Apply Chernoff argument directly to prove an upper bound.
(b) Fix an arbitrary c > 0 and prove
X
n X
n
P[ Xj ≥ nγ] ≥ P[ (Xj ∧ c) ≥ nγ] . (III.7)
j=1 j=1

(c) Apply the results shown in Chapter 15 to investigate the asymptotics of the right-hand side
of (III.7).
(d) Conclude the proof of (III.6) by taking c → ∞.
III.22 (Hoeffding’s lemma) In this exercise we prove Hoeffding’s lemma (stated after Definition 4.15)
and derive Hoeffding’s concentration inequality. Let X ∈ [−1, 1] with E[X] = 0.
(a) Show that the log MGF ψX (λ) satisfies ψX (0) = ψX′ (0) = 0 and 0 ≤ ψX′′ (λ) ≤ 1. (Hint:
Apply Theorem 15.8(c) and the fact that the variance of any distribution supported on
[−1, 1] is at most 1.)
(b) By Taylor expansion, show that ψX (λ) ≤ λ2 /2.
(c) Applying Theorem 15.1, prove Hoeffding’s inequality: Let Xi ’s be iid copies of X. For any
Pn 
γ > 0, P i=1 Xi ≥ nγ ≤ exp(−nγ /2).
2

III.23 (Sanov’s theorem for discrete X ) Let X be a finite set. Let E be a set of probability distributions
on X with non-empty interior. Let Xn = (X1 , . . . , Xn ) be iid drawn from some distribution P
Pn
fully supported on X and let π n denote the empirical distribution, i.e., π n = 1n i=1 δXi . Our
goal is to show that
1 1
E ≜ lim log = inf D(QkP). (III.8)
n→∞ n P(π n ∈ E) Q∈E

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-335


i i

Exercises for Part III 335

(a) We first assume that E is convex. Define the following set of joint distributions En ≜ {QXn :
QXi ∈ E, i = 1, . . . , n}. Show that
inf D(QXn kPXn ) = n inf D(QkP),
QXn ∈En Q∈E
n
where PXn = P .
(b) Consider the conditional distribution P̃Xn = PXn |π n ∈E . Show that P̃Xn ∈ En .
(c) Prove the following nonasymptotic upper bound: for any convex E ,
 
P(π n ∈ E) ≤ exp − n inf D(QkP) , ∀n.
Q∈E

(d) Show that for any E :


 
P(π n ∈ E) ≤ exp − n inf D(QkP) + o(n) , ∀ n.
Q∈E

(Hint: For each ϵ > 0, cover E by N TV balls of radius ϵ where N = N(ϵ) is finite;
cf. Theorem 27.3. Applying the previous part and the union bound.)
(e) For any Q in the interior of E , show that
P(π n ∈ E) ≥ exp(−nD(QkP) + o(n)), n → ∞.
(Hint: Use data processing as in the proof of the large deviations Theorem 15.9.)
(f) Conclude (III.8) by applying the continuity of divergence on finite alphabet (Proposi-
tion 4.8).
III.24 (Error exponents of data compression) Let Xn be iid according to P on a finite alphabet X .
Let ϵ∗ (Xn , nR) denote the minimal probability of error achieved by fixed-length compressors
and decompressors for Xn of compression rate R (cf. Definition 11.1). We know that if R <
H(P) then ϵ∗ (Xn , nR) → 0. Here we show it converges to zero exponentially fast and find the
expopnent.
(a) For any sequence xn , denote by P̂xn its empirical distribution and by H(P̂xn ) its empirical
entropy, i.e., the entropy of the empirical distribution. For example, for the binary sequence
xn = (010110), the empirical distribution is Ber(1/2) and the empirical entropy is 1 bit.
For each R > 0, define the set T = {xn : H(P̂xn ) <R}. Show that |T| ≤ exp( nR)(n + 1)|X | .

(b) Show that for any R > H(P), ϵ (X , nR) ≤ exp − n infQ:H(Q)>R D(QkP) . Specify the
n

achievability scheme. (Hint: Use Sanov’s theorem in Exercise III.23.)


(c) Prove that the above exponent is asymptotically optimal:
1 1
lim sup log ∗ n ≤ inf D(QkP).
n→∞ n ϵ (X , nR) Q:H(Q)>R
(Hint: Recall that any compression scheme for memoryless source with rate below the
entropy fails with probability tending to one. Use data processing inequality.)
III.25 (Local KL geodesics) Recall from Section 2.6.1* the local expansion D(λQ + (1 − λ)PkP) =
λ2 2
2 χ (QkP) + o(λ ), provided that χ (QkP) < ∞. Instead of the linear mixture, consider the
2 2
λ 1−λ
geometric mixture Pλ ∝ Q P , which we argued should be called a “KL geodesic” in (15.30).
Show that

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-336


i i

336 Exercises for Part III

(a) D(Pλ kP) = −ψP (λ) + λψP′ (λ), where ψP (λ) = log EP [exp(λ log dQ dP )].
(b) State the appropriate conditions to conclude D(Pλ kP) = 12 λ2 ψP′′ (0)+ o(λ2 ), where ψP′′ (0) =
dP ], which is clearly different from χ (QkP) = VarP [ dP ].
VarP [log dQ 2 dQ

III.26 Denote by N ( μ, σ ) the one-dimensional Gaussian distribution with mean μ and variance σ 2 .
2

Let a > 0. All logarithms below are natural.


(a) Show that
a2
min D(QkN (0, 1)) = .
Q:EQ [X]≥a 2
(b) Let X1 , . . . , Xn be drawn iid from N (0, 1). Using part (a) show that
1 1 a2
lim log = . (III.9)
n→∞ n P [X1 + · · · + Xn ≥ na] 2
R∞
(c) Let Φ̄(x) = x √12π e−t /2 dt denote the complementary CDF of the standard Gaussian
2

distribution. Express P [X1 + · · · + Xn ≥ na] in terms of the Φ̄ function. Using the fact that
Φ̄(x) = e−x /2+o(x ) as x → ∞ (cf. Exercise V.25), reprove (III.9).
2 2

(d) (reverse I-projection) Let Y be a continuous random variable with zero mean and unit
variance. Show that
min D(PY kN ( μ, σ 2 )) = D(PY kN(0, 1)).
μ,σ

III.27 (Why temperatures equalize?) Let X be finite alphabet and f : X → R an arbitrary function.
Let Emin = min f(x).
(a) Using I-projection show that for any E ≥ Emin the solution of
H∗ (E) = max{H(X) : E[f(X)] ≤ E}
1
is given by a Gibbs distribution (cf. (5.21)) PX (x) = Z(β) e−β f(x) for some β = β(E).
Comment: In statistical physics x is state of the system (e.g. locations and velocities of all
molecules), f(x) is energy of the system in state x, PX is the Gibbs distribution and β = T1 is
the inverse temperature of the system. In thermodynamic equilibrium, PX (x) gives fraction
of time system spends in state x.

(b) Show that dHdE(E) = β(E).
(c) Next consider two functions f0 , f1 (i.e. two types of molecules with different state-energy
relations). Show that for E ≥ minx0 f0 (x0 ) + minx1 f1 (x1 ) we have
max H(X0 , X1 ) = max H∗0 (E0 ) + H∗1 (E1 ) (III.10)
E[f0 (X0 )+f1 (X1 )]≤E E0 +E1 ≤E

where H∗j (E) = maxE[fj (X)]≤E H(X).


(d) Further, show that for the optimal choice of E0 and E1 in (III.10) we have
β 0 ( E0 ) = β 1 ( E1 ) (III.11)
or equivalently that the optimal distribution PX0 ,X1 is given by
1
PX0 ,X1 (a, b) = e−β(f0 (a)+f1 (b)) (III.12)
Z0 (β)Z1 (β)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-337


i i

Exercises for Part III 337

Remark: (III.12) also just follows from part (a) by taking f(x0 , x1 ) = f0 (x0 ) + f1 (x1 ). The
point here is relation (III.11): when two thermodynamical systems are brought in contact with
each other, the energy distributes among them in such a way that β parameters (temperatures)
equalize.
III.28 (Importance Sampling [90]) Let μ and ν be two probability measures on set X . Assume that
ν  μ. Let L = D(νk μ) and ρ = ddμν be the Radon-Nikodym derivative. Let f : X → R be a
measurable function. We would like to estimate Eν f using data from μ.
i.i.d. P
Let X1 , . . . , Xn ∼ μ and In (f) = n1 1≤i≤n f(Xi )ρ(Xi ). Prove the following.
(a) For n ≥ exp(L + t) with t ≥ 0, we have
 q 
E |In (f) − Eν f| ≤ kfkL2 (ν) exp(−t/4) + 2 Pμ (log ρ > L + t/2) .

Hint: Let h = f1{ρ ≤ exp(L + t/2)}. Use triangle inequality and bound E |In (h) − Eν h|,
E |In (h) − In (f)|, | Eν f − Eν h| separately.
(b) On the other hand, for n ≤ exp(L − t) with t ≥ 0, we have
Pμ (log ρ ≤ L − t/2)
P(In (1) ≥ 1 − δ)| ≤ exp(−t/2) + ,
1−δ
for all δ ∈ (0, 1), where 1 is the constant-1 function.
Hint: Divide into two cases depending on whether max1≤i≤n ρ(Xi ) ≤ exp(L − t/2).
This shows that a sample of size exp(D(νk μ) + Θ(1)) is both necessary and sufficient for
accurate estimation by importance sampling.
III.29 (M-ary hypothesis testing)7 The following result [274] generalizes Corollary 16.2 on the best
average probability of error for testing two hypotheses to multiple hypotheses.
Fix a collection of distributions {P1 , . . . , PM }. Conditioned on θ, which takes value i with prob-
i.i.d.
ability π i > 0 for i = 1, . . . , M, let X1 , . . . , Xn ∼ Pθ . Denote the optimal average probability of

error by pn = inf P[θ̂ 6= θ], where the infimum is taken over all decision rules θ̂ = θ̂(X1 , . . . , Xn ).
(a) Show that
1 1
lim log ∗ = min C(Pi , Pj ), (III.13)
n→∞ n pn 1≤i<j≤M

where C is the Chernoff information defined in (16.2).


(b) It is clear that the optimal decision rule is the Maximum a Posteriori (MAP) rule. Does
maximum likelihood rule also achieve the optimal exponent (III.13)? Prove or disprove it.
III.30 Given n observations (X1 , Y1 ), . . . , (Xn , Yn ), where each observation consists of a pair of random
variables, we want to test the following hypothesis:
i.i.d.
H0 : (Xi , Yi ) ∼ P ⊗ Q
i.i.d.
H1 : (Xi , Yi ) ∼ Q ⊗ P

7
Not to be confused with multiple testing in the statistics literature, which refers to testing multiple pairs of binary
hypotheses simultaneously.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-338


i i

338 Exercises for Part III

where ⊗ denotes product distribution as usual.


(a) Show that the Stein exponent D(P ⊗ QkQ ⊗ P) is equal to D(PkQ) + D(QkP).
(b) Show that the Chernoff exponent (Chernoff information) C(P ⊗ Q, Q ⊗ P) is equal to
2 R√
−2 log(1 − H (P2 ,Q) ) = −2 log dPdQ, where H(P, Q) is the Hellinger distance – cf. (7.5).
Comment: This type of hypothesis testing problems arises in the context of community detec-
tion, where n nodes indexed by i ∈ [n] are partitioned into two communities (labeled by σi = +
and σi = − uniformly and independently) and the task is to classify the nodes based on the
pairwise observations W = (Wij : 1 ≤ i < j ≤ n) are independent conditioned on σi ’s and
Wij ∼ P if σi = σj and Q otherwise. (The stochastic block model previously introduced in
Exercise I.49 corresponds to P and Q being Bernoulli.) As a means to prove the impossibil-
ity result [455], consider the setting where an oracle reveals all labels except for σ1 . Define
i.i.d.
S+ = {j = 2, . . . , n : σj = +} and similarly S− . If σ1 = +, {W1,j : j ∈ S+ } ∼ P and
i.i.d.
{W1,j : j ∈ S− } ∼ Q and vice versa if σ1 = −.
III.31 (Stochastic dominance and robust LRT) Let P0 , P1 be two families of probability distributions
on X . Suppose that there is a least favorable pair (LFP) (Q0 , Q1 ) ∈ P0 × P1 such that
Q0 [π > t] ≥ Q′0 [π > t]
Q1 [π > t] ≤ Q′1 [π > t],
for all t ≥ 0 and Q′i ∈ Pi , where π = dQ1 /dQ0 . Prove that (Q0 , Q1 ) simultaneously minimizes
all f-divergences between P0 and P1 , i.e.
Df (Q1 kQ0 ) ≤ Df (Q′1 kQ′0 ) ∀Q′0 ∈ P0 , Q′1 ∈ P1 . (III.14)
Hint: Interpolate between (Q0 , Q1 ) and (Q′0 , Q′1 ) and differentiate.
Remark: For the case of two TV-balls, i.e. Pi = {Q : TV(Q, Pi ) ≤ ϵ}, the existence of LFP is
shown in [221], in which case π = min(c′ , max(c′′ , dP ′ ′′
dP1 )) for some 0 ≤ c < c ≤ ∞ giving
0

the robust likelihood-ratio test.


III.32 Recall the FI -curve from Definition 16.5. Suppose X and Y are finite and prove the following:
a (Tensorization) FI (nt; P⊗ n
X,Y ) = nFI (t; PX,Y ) (Hint: Theorem 11.17)
b (Concavity) t 7→ FI (t) is concave and continuous on its domain t ∈ [0; H(X)].
c (Cardinality bound) It is sufficient to take |U| ≤ |X | + 1 in the definition of FI . (Hint: inspect
Theorem 11.17)
d Show that sup in the definition is a max and that for every t there exists a random variable U
s.t. I(U; X) = t and I(U; Y) = FI (t).
e (Strict DPI) If PX,Y (x, y) > 0 then FI (t) < t for all t > 0.
III.33 Gács-Körner (GK) common information [174] between a pair of random variables (X, Y) is
defined as the supremum of rates R such that for all large enough n there exists functions f(Xn )
i.i.d.
and g(Yn ) with H(f(Xn )) ≥ nR and P[f(Xn ) 6= g(Xn )] → 0 where (Xi , Yi ) ∼ PX,Y (i.e. GK com-
mon information is the maximal rate at which randomness can be extracted from two correlated
sequences). Show that if PX,Y (x, y) > 0 then GK common information is zero. (Hint: Show that
I(Yn ; g(Yn )) = I(Yn ; f(Xn )) + o(n) and apply tensorization and strict DPI from Exercise III.32.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-339


i i

Part IV

Channel coding

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-340


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-341


i i

341

In this Part we study a new type of problem. The goal of channel coding is to communicate
digital information across a noisy channel. Historically, this was the first area of information the-
ory that lead to immediately and widely deployed applications. Shannon’s discovery [378] of the
possibility of transmitting information with vanishing error and positive (i.e. bigger than zero) rate
of bits per second was also theoretically quite surprising and unexpected. Our goal in this Part is
to understand these arguments.
To explain the relation of this Part to others, let us revisit what problems we have studied so
far. In Part I we introduced various information measures and studied their properties irrespec-
tive of engineering applications. Then, in Part II our objective was data compression. The main
object there was a single distribution PX and the fundamental limit E[ℓ(f∗ (X))] is the minimal
compression length. The main result (the “coding theorem”) established connection between the
fundamental limit and an information quantity, that we can summarize as
E[ℓ(f∗ (X))] ≈ H(X)
Next, in Part III we studied binary hypothesis testing. There the main object was a pair of distri-
butions (P, Q), the fundamental limit was the Neyman-Pearson curve β1−ϵ (Pn , Qn ) and the main
result
β1−ϵ (Pn , Qn ) ≈ exp{−nD(PkQ)} ,
again connecting an operational quantity to an information measure.
In channel coding – the topic of this Part – the main object is going to be a channel PY|X .
The fundamental limit is M∗ (ϵ), the maximum number of messages that can be transmitted with
probability of error at most ϵ, which we rigorously define in this chapter. Our main result in this
part is to show the celebrated Shannon’s noisy channel coding theorem:
log M∗ (ϵ) ≈ max I(X; Y) .
PX

We will demonstrate the possibility of sending information with high reliability and also will
rigorously derive the asymptotically (and non-asymptotically!) highest achievable rates. However,
we entirely omit a giant and beautiful field of coding theory that deals with the question of how
to construct transmitters and receivers with low computational complexity. This area of science,
though deeply related to the content of our book, deserves a separate dedicated treatment. We
recommend reading [360] for the sparse-graph based codes and [372] for introduction to more
modern polar codes.
The practical implications of this chapter are profound even without giving explicit construc-
tions of codes. First, in the process of finding channel capacity one needs to maximize mutual
information and the maximizing distributions reveal properties of optimal codes (e.g. water-filling
solution dictates how to optimally allocate power between frequency bands, Figure 3.2 suggests
when to use binary modulation, etc). Second, the non-asymptotic (finite blocklength) bounds that
we develop in this Part are routinely used for benchmarking performance of all newly developed
codes. Other implications tell how to exploit memory in the channel, or leverage multiple antennas
(Exercise I.10), and many more. In all, the contents of this Part have had by far the most real-world
impact of all (at least as of the writing of this book).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-342


i i

17 Error correcting codes

In this chapter we introduce the concept of an error correcting code (ECC). We will spend time
discussing what it means for a code to have low probability of error, what is the optimum (ML or
MAP) decoder. On the special case of coding for the BSC we showcase evolution of our under-
standing of fundamental limits from pre-Shannon’s to modern finite blocklength. We also briefly
review the history of ECCs. We conclude with a conceptually important proof of a weak converse
(impossibility) bound for the performance of ECCs.

17.1 Codes and probability of error


We start with a simple definition of a code.

Definition 17.1 An M-code for PY|X is an encoder/decoder pair (f, g) of (randomized)


functions1

• encoder f : [M] → X
• decoder g : Y → [M] ∪ {e}

In most cases f and g are deterministic functions, in which case we think of them, equivalently,
in terms of codewords, codebooks, and decoding regions (see Figure 17.1 for an illustration)

• ∀i ∈ [M] : ci ≜ f(i) are codewords, the collection C = {c1 , . . . , cM } is called a codebook.


• ∀i ∈ [M], Di ≜ g−1 ({i}) is the decoding region for i.

Given an M-code we can define a probability space, underlying all the subsequent developments
in this Part. For that we chain the three objects – message W, the encoder and the decoder – together
into the following Markov chain:

f P Y| X g
W −→ X −→ Y −→ Ŵ (17.1)

1
For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PŴ|Y .

342

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-343


i i

17.1 Codes and probability of error 343

c1 b
b
b

D1 b

b b

b cM
b b
b

DM

Figure 17.1 When X = Y, the decoding regions can be pictured as a partition of the space, each containing
one codeword.

where we set W ∼ Unif([M]). In the case of discrete spaces, we can explicitly write out the joint
distribution of these variables as follows:
1
(general) PW,X,Y,Ŵ (m, a, b, m̂) = PX|W (a|m)PY|X (b|a)PŴ|Y (m̂|b)
M
1
(deterministic f, g) PW,X,Y,Ŵ (m, cm , b, m̂) = PY|X (b|cm )1{b ∈ Dm̂ }
M
Throughout these sections, these random variables will be referred to by their traditional names:
W – original (true) message, X - (induced) channel input, Y - channel output and Ŵ - decoded
message.
Although any pair (f, g) is called an M-code, in reality we are only interested in those that satisfy
certain “error correcting” properties. To assess their quality we define the following performance
metrics:

1 Maximum error probability: Pe,max (f, g) ≜ maxm∈[M] P[Ŵ 6= m|W = m].


2 Average error probability: Pe (f, g) ≜ P[W 6= Ŵ].
Note that, clearly, Pe ≤ Pe,max . Therefore, requirement of the small maximum error probability
is a more stringent criterion, and offers uniform protection for all codewords. Some codes (such
as linear codes, see Section 18.6) have the property of Pe = Pe,max by construction, but generally
these two metrics could be very different.
Having defined the concept of an M-code and the performance metrics, we can finally define
the fundamental limits for a given channel PY|X .

Definition 17.2 A code (f, g) is an (M, ϵ)-code for PY|X if Pe (f, g) ≤ ϵ. Similarly, an (M, ϵ)max -
code must satisfy Pe,max ≤ ϵ. The fundamental limits of channel coding are defined as
M∗ (ϵ; PY|X ) = max{M : ∃(M, ϵ)-code}
M∗max (ϵ; PY|X ) = max{M : ∃(M, ϵ)max -code}
ϵ∗ (M; PY|X ) = inf{ϵ : ∃(M, ϵ)-code}
ϵ∗max (M; PY|X ) = inf{ϵ : ∃(M, ϵ)max -code}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-344


i i

344

The argument PY|X will be omitted when PY|X is clear from the context.

In other words, the quantity log2 M∗ (ϵ) gives the maximum number of bits that we can
push through a noisy transformation PY|X , while still guaranteeing the error probability in the
appropriate sense to be at most ϵ.

Example 17.1 The channel BSC⊗ n


δ (recall from Example 3.6 that BSC stands for binary sym-
metric channel) acts between X = {0, 1}n and Y = {0, 1}n , where the input Xn is contaminated
i.i.d.
by additive noise Zn ∼ Ber(δ) independent of Xn , resulting in the channel output

Yn = X n ⊕ Zn .

In other words, the BSC⊗ n


δ channel takes a binary sequence length n and flips each bit indepen-
dently with probability δ ; pictorially,

0 1 0 0 1 1 0 0 1 1

PY n |X n

1 1 0 1 0 1 0 0 0 1

In the next section we discuss coding for the BSC channel in more detail.

17.2 Coding for Binary Symmetric Channels


To understand the problem of designing the encoders and decoders, let us consider the BSC trans-
formation with δ = 0.11 and n = 1000. The problem of studying log2 M∗ (ϵ) attempts to answer
what is the maximum number k of bits you can send with Pe ≤ ϵ? For concreteness, let us fix
ϵ = 10−3 and discuss some of the possible ideas.
Perhaps our first attempt would be to try sending k = 1000 bits with one data bit mapped to
one channel input position. However, a simple calculation shows that in this case we get Pe =
1 − (1 − δ)n ≈ 1. In other word, the uncoded transmission does not meet our objective of small
Pe and some form of coding is necessary. This incurs a fundamental tradeoff: reduce the number
of bits to send (and use the freed channel inputs for sending redundant copies) in order to increase
the probability of success.
So let us consider the next natural idea: the repetition coding. We take each of the input data
bits and repeat it ℓ times:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-345


i i

17.2 Coding for Binary Symmetric Channels 345

0 0 1 0

0000000 0000000 1111111 0000000

Decoding can be done by taking a majority vote inside each ℓ-block. Thus, each data bit is decoded
with probability of bit error Pb = P[Binom(l, δ) > l/2]. However, the probability of block error of
this scheme is Pe ≤ kP[Binom(l, δ) > l/2]. (This bound is essentially tight in the current regime).
Consequently, to satisfy Pe ≤ 10−3 we must solve for k and ℓ satisfying kl ≤ n = 1000 and also

kP[Binom(l, δ) > l/2] ≤ 10−3 .

This gives l = 21, k = 47 bits. So we can see that using repetition coding we can send 47 data
bits by using 1000 channel uses.
Repetition coding is a natural idea. It also has a very natural tradeoff: if you want better reliabil-
ity, then the number ℓ needs to increase and hence the ratio nk = 1ℓ should drop. Before Shannon’s
groundbreaking work, it was almost universally accepted that this is fundamentally unavoidable:
vanishing error probability should imply vanishing communication rate nk .
Before delving into optimal codes let us offer a glimpse of more sophisticated ways of injecting
redundancy into the channel input n-sequence than simple repetition. For that, consider the so-
called first-order Reed-Muller codes (1, r). We interpret a sequence of r data bits a0 , . . . , ar−1 ∈ Fr2
as a degree-1 polynomial in (r − 1) variables:
X
r− 1
a = (a0 , . . . , ar−1 ) 7→ fa (x) ≜ a i xi + a 0 .
i=1

In order to transmit these r bits of data we simply evaluate fa (·) at all possible values of the variables
xr−1 ∈ Fr2−1 . This code, which maps r bits to 2r−1 bits, has minimum distance dmin = 2r−2 . That
is, for two distinct a 6= a′ the number of positions in which fa and fa′ disagree is at least 2r−2 . In
coding theory notation [n, k, dmin ] we say that the first-order Reed-Muller code (1, 7) is a [64, 7, 32]
code. It can be shown that the optimal decoder for this code achieves over the BSC⊗ 64
0.11 channel a
probability of error at most 6 · 10−6 . Thus, we can use 16 such blocks (each carrying 7 data bits and
occupying 64 bits on the channel) over the BSC⊗ δ
1024
, and still have (by the union bound) overall
−4 −3
probability of block error Pe ≲ 10 < 10 . Thus, with the help of Reed-Muller codes we can
send 7 · 16 = 112 bits in 1024 channel uses, more than doubling that of the repetition code.
Shannon’s noisy channel coding theorem (Theorem 19.9) – a crown jewel of information theory
– tells us that over memoryless channel PYn |Xn = (PY|X )n of blocklength n the fundamental limit
satisfies

log M∗ (ϵ; PYn |Xn ) = nC + o(n) (17.2)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-346


i i

346

as n → ∞ and for arbitrary ϵ ∈ (0, 1). Here C = maxPX1 I(X1 ; Y1 ) is the capacity of the single-letter
channel. In our case of BSC we have
1
C = log 2 − h(δ) ≈ bit ,
2
since the optimal input distribution is uniform (from symmetry) – see Section 19.3. Shannon’s
expansion (17.2) can be used to predict (not completely rigorously, of course, because of the
o(n) residual) that it should be possible to send around 500 bits reliably. As it turns out, for the
blocklength n = 1000 this is not quite possible.
Note that computing M∗ exactly requires iterating over all possible encoders and decoder –
an impossible task even for small values of n. However, there exist rigorous and computation-
ally tractable finite blocklength bounds [334] that demonstrate for our choice of n = 1000, δ =
0.11, ϵ = 10−3 :

414 ≤ log2 M∗ (ϵ = 10−3 ) ≤ 416 bits (17.3)

Thus we can see that Shannon’s prediction is about 20% too optimistic. We will see below
some of such finite-length bounds. Notice, however, that while the bounds guarantee existence
of an encoder-decoder pair achieving a prescribed performance, building an actual f and g
implementable with a modern software/hardware is a different story.
It took about 60 years after Shannon’s discovery of (17.2) to construct practically imple-
mentable codes achieving that performance. The first codes that approach the bounds on log M∗
are calledturbo codes [47] (after the turbocharger engine, where the exhaust is fed back in to
power the engine). This class of codes is known as sparse-graph codes, of which the low-density
parity check (LDPC) codes invented by Gallager are particularly well studied [360]. As a rule of
thumb, these codes typically approach 80 . . . 90% of log M∗ when n ≈ 103 . . . 104 . For shorter
blocklengths in the range of n = 100 . . . 1000 there is an exciting alternative to LDPC codes: the
polar codes of Arıkan [23], which are most typically used together with the list-decoding idea
of Tal and Vardy [409]. And of course, the story is still evolving today as new channel models
become relevant and new hardware possibilities open up.
We wanted to point out a subtle but very important conceptual paradigm shift introduced by
Shannon’s insistence on coding over many (information) bits together. Indeed, consider the sit-
uation discussed above, where we constructed a powerful code with M ≈ 2400 codewords and
n = 1000. Now, one might imagine this code as a constellation of 2400 points carefully arranged
inside a hypercube {0, 1}1000 to guarantee some degree of separation between them, cf. (17.6).
Next, suppose one was using this code every second for the lifetime of the universe (≈ 1018 sec).
Yet, even after this laborious process she will have explored at most 260 different codewords from
among an overwhelmingly large codebook 2400 . So a natural question arises: why did we need
to carefully place all these many codewords if majority of them will never be used by anyone?
The answer is at the heart of the concept of information: to transmit information is to convey a
selection of one element (W) from a collection of possibilities ([M]). The fact that we do not know
which W will be selected forces us to a priori prepare for every one of the possibilities. This simple
idea, proposed in the first paragraph of [378], is now tacitly assumed by everyone, but was one of

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-347


i i

17.3 Optimal decoder 347

the subtle ways in which Shannon revolutionized scientific approach to the study of information
exchange.

17.3 Optimal decoder


Given any encoder f : [M] → X , the decoder that minimizes Pe is the Maximum A Posteriori
(MAP) decoder, or equivalently, the Maximum Likelihood (ML) decoder, since the codewords are
equiprobable (W is uniform):

g∗ (y) = argmax P [W = m|Y = y]


m∈[M]

= argmax P [Y = y|W = m] . (17.4)


m∈[M]

Notice that the optimal decoder is deterministic. For the special case of deterministic encoder,
where we can identify the encoder with its image C the minimal (MAP) probability of error for
the codebook C can be written as
1 X
Pe,MAP (C) = 1 − max PY|X (y|x) , (17.5)
M x∈C
y∈Y

with a similar extension to non-discrete Y .


Remark 17.1 For the case of BSC⊗ n
δ MAP decoder has a nice geometric interpretation. Indeed,
if dH (xn , yn ) = |{i : xi 6= yi }| denotes the Hamming distance and if f (the encoder) is deterministic
with codewords C = {ci , i ∈ [M]} then

g∗ (yn ) = argmin dH (cm , yn ) . (17.6)


m∈[M]

Consequently, the optimal decoding regions – see Figure 17.1 – become the Voronoi cells tesse-
lating the Hamming space {0, 1}n . Similarly, the MAP decoder for the AWGN channel induces a
Voronoi tesselation of Rn – see Section 20.3.
So we have seen that the optimal decoder is without loss of generality can be assumed to be
deterministic. Similarly, we can represent any randomized encoder f as a function of two argu-
ments: the true message W and an external randomness U ⊥ ⊥ W, so that X = f(W, U) where this
time f is a deterministic function. Then we have

P[W 6= Ŵ] = E[P[W 6= Ŵ|U]] ,

which implies that if P[W 6= Ŵ] ≤ ϵ then there must exist some choice u0 such that P[W 6= Ŵ|U =
u0 ] ≤ ϵ. In other words, the fundamental limit M∗ (ϵ) is unchanged if we restrict our attention to
deterministic encoders and decoders only.
Note, however, that neither of the above considerations apply to the maximal probability of
error Pe,max . Indeed, the fundamental limit M∗max (ϵ) does indeed require considering randomized
encoders and decoders. For example, when M = 2 from the decoding point of view we are back to

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-348


i i

348

the setting of binary hypotheses testing in Part III. The optimal decoder (test) that minimizes the
maximal Type-I and II error probability, i.e., max{1 − α, β}, will not be deterministic if max{1 −
α, β} is not achieved at a vertex of the Neyman-Pearson region R(PY|W=1 , PY|W=2 ).

17.4 Weak converse bound


The main focus of both the theory and the practice of channel coding lies in showing the existence
of (or constructing explicit) (M, ϵ)-codes with large M and small ϵ. To understand how close the
constructed code is to the fundamental limit, one needs to prove an “impossibility result” bound-
ing M from the above or ϵ from below. Such negative results are known as “converse bounds”,
with the name coming from the fact that classically these bounds followed right after the positive
(existential) results and were preceded with the words “Conversely, …”. The next result shows
that M can never (multiplicatively) exceed capacity supPX I(X; Y) by much.

Theorem 17.3 (Weak converse) Any (M, ϵ)-code for PY|X satisfies
supPX I(X; Y) + h(ϵ)
log M ≤ ,
1−ϵ
where h(x) = H(Ber(x)) is the binary entropy function.

Proof. This can be derived as a one-line application of Fano’s inequality (Theorem 3.12), but we
proceed slightly differently with an eye towards future extensions in meta-converse (Section 22.3).
Consider an M-code with probability of error Pe and its corresponding probability space: W →
X → Y → Ŵ. We want to show that this code can be used as a hypothesis test between distributions
PX,Y and PX PY . Indeed, given a pair (X, Y) we can sample (W, Ŵ) from PW,Ŵ|X,Y = PW|X PŴ|Y and
compute the binary value Z = 1{W 6= Ŵ}. (Note that in the most interesting cases when encoder
and decoder are deterministic and the encoder is injective, the value Z is a deterministic function
of (X, Y).) Let us compute performance of this binary hypothesis test under two hypotheses. First,
when (X, Y) ∼ PX PY we have that Ŵ ⊥ ⊥ W ∼ Unif([M]) and therefore:
1
PX PY [Z = 1] = .
M
Second, when (X, Y) ∼ PX,Y then by definition we have

PX,Y [Z = 1] = 1 − Pe .

Thus, we can now apply the data-processing inequality for divergence to conclude: Since W →
X → Y → Ŵ, we have the following chain of inequalities (cf. Fano’s inequality Theorem 3.12):
DPI 1
D(PX,Y kPX PY ) ≥ d(1 − Pe k )
M
≥ −h(P[W 6= Ŵ]) + (1 − Pe ) log M

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-349


i i

17.4 Weak converse bound 349

By noticing that the left-hand side is I(X; Y) ≤ supPX I(X; Y) we obtain:


supPX I(X; Y) + h(Pe )
log M ≤ ,
1 − Pe
h(p)
and the proof is completed by checking that p 7→ 1− p is monotonically increasing.

Remark 17.2 The bound can be significantly improved by considering other divergence mea-
sures in the data-processing step. In particular, we will see below how one can get “strong”
converse (explaining the term “weak” converse here as well) in Section 22.1. The proof technique
is known as meta-converse; see Section 22.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-350


i i

18 Random and maximal coding

So far our discussion of channel coding was mostly following the same lines as the M-ary hypothe-
sis testing (HT) in statistics. In this chapter we introduce the key departure: the principal and most
interesting goal in information theory is the design of the encoder f : [M] → X or the codebook
{ci ≜ f(i), i ∈ [M]}. Once the codebook is chosen, the problem indeed becomes that of M-ary HT
and can be tackled by standard statistical tools. However, the task of choosing the encoder f has no
exact analogs in statistical theory (the closest being design of experiments). Each f gives rise to a
different HT problem and the goal is to choose these M hypotheses PY|X=c1 , . . . , PY|X=cM to ensure
maximal testability. It turns out that the problem of choosing a good f will be much simplified
if we adopt a suboptimal way of testing M-ary HT. Roughly speaking we will run M binary HTs
testing PY|X=cm against PY , which tries to distinguish the channel output induced by the message m
from an “average background noise” PY . An optimal such test, as we know from Neyman-Pearson
(Theorem 14.11), thresholds the following quantity

PY|X=x
log .
PY
This explains the central role played by the information density (defined next) in this chapter.
After introducing the latter we will present several results demonstrating existence of good codes.
We start with the original bound of Shannon (expressed in modern language), followed by its
tightenings (DT, RCU and Gallager’s bounds). These belong to the class of random coding bounds.
An entirely different approach was developed by Feinstein and is called maximal coding. We
will see that the two result in eerily similar results. Why two of these rather different methods
yield similar results, which are also quite close to the best possible (i.e. “achieve capacity and
dispersion”)? It turns out that the answer lies in a certain submodularity property of the channel
coding task. Finally, we will also discuss a more structured class of codes based on linear algebraic
constructions. Similar to the case of compression it will be shown that linear codes are no worse
than general codes, explaining why virtually all practical codes are linear.
While reading this Chapter, we recommend also consulting Figure 22.2, in which various
achievability bounds are compared for the BSC.
In this chapter it will be convenient to introduce the following independent pairs (X, Y) ⊥
⊥ (X, Y)
with their joint distribution given by:

PX,Y,X,Y (a, b, a, b) = PX (a)PY|X (b|a)PX (a)PY|X (b|a). (18.1)

We will often call X the sent codeword and X̄ the unsent codeword.

350

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-351


i i

18.1 Information density 351

18.1 Information density


A crucial object for the subsequent development is the information density. Historically, the con-
cept seems to originate in early works of Soviet information theorists. In a nutshell, we want to
set i(x; y) = log PPXX(,xY)(PxY,y()y) . However, we want to make this definition sufficiently general so as
to take into account continuous distributions, the possibility of PX,Y 6 PX PY (in which case the
value under the log can equal +∞) and the possibility of argument of the log being equal to 0. The
definition below is similar to what we did in Definition 14.4 and (2.10) using the Log function,
but we repeat it below for convenience.

Definition 18.1 (Information density1 ) Let PX,Y  μ and PX PY  μ for some dominat-
ing measure μ, and denote by f(x, y) = dPdμX,Y and f̄(x, y) = dPdμ X PY
the Radon-Nikodym derivatives
of PX,Y and PX PY with respect to μ, respectively. Then recalling the Log definition (2.10) we set


 log ff̄((xx,,yy)) , f(x, y) > 0, f̄(x, y) > 0


f(x, y) +∞, f(x, y) > 0, f̄(x, y) = 0
iPX,Y (x; y) ≜ Log = (18.2)
f̄(x, y)  −∞, f(x, y) = 0, f̄(x, y) > 0



0, f(x, y) = f̄(x, y) = 0 .

In the most common special case of PX,Y  PX PY we have simply


dPX,Y
iPX,Y (x; y) = log ( x, y) ,
dPX PY
with log 0 = −∞.
Notice that the information density as a function depends on the underlying PX,Y . Throughout
this Part, however, PY|X is going to be a fixed channel (fixed by the problem at hand), and thus
information density only depends on the choice of PX . Most of the time PX (and, correspondingly,
PX,Y ) used to define information density will be apparent from the context. Thus for the benefit of
the reader as well as our own, we will write i(x; y) dropping the subscript PX,Y .
We proceed to show some elementary properties of the information density. The next result
explains the name “information density”.

Proposition 18.2 The expectation E[i(X; Y)] is well-defined and non-negative (but possibly
infinite). In any case, we have I(X; Y) = E[i(X; Y)].

Proof. This is follows from (2.12) and the definition of i(x; y) as log-ratio.

Being defined as log-likelihood, information density possesses the standard properties of the
latter, cf. Theorem 14.6. However, because its defined in terms of two variables (X, Y), there are

1
We remark that in machine learning (especially natural language processing) information density is also called pointwise
mutual information (PMI) [292].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-352


i i

352

also very useful conditional expectation versions. To illustrate the meaning of the next proposition,
let us consider the case of discrete X, Y and PX,Y  PX PY . Then we have for every x:
X X
f(x, y)PX (x)PY (y) = f(x, y) exp{−i(x; y)}PX,Y (x, y) .
y y

The general case requires a little more finesse.

Proposition 18.3 (Conditioning-unconditioning trick) Let X̄ ⊥


⊥ (X, Y) be a copy of X.
We have the following:

1 For any function f : X × Y → R

E[f(X̄, Y)1{i(X̄; Y) > −∞}] = E[f(X, Y) exp{−i(X; Y)}] . (18.3)

2 Let f+ be a non-negative function. Then for PX -almost every x we have

E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = E[f+ (X, Y) exp{−i(X; Y)}|X = x] (18.4)

Proof. The first part (18.3) is simply a restatement of (14.5). For the second part, let us define

a(x) ≜ E[f+ (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x], b(x) ≜ E[f+ (X, Y) exp{−i(X; Y)}|X = x]

We first additionally assume that f is bounded. Fix ϵ > 0 and denote Sϵ = {x : a(x) ≥ b(x) + ϵ}.
As ϵ → 0 we have Sϵ % {x : a(x) > b(x)} and thus if we show PX [Sϵ ] = 0 this will imply that
a(x) ≤ b(x) for PX -a.e. x. The symmetric argument shows b(x) ≤ a(x) and completes the proof
of the equality.
To show PX [Sϵ ] = 0 let us apply (18.3) to the function f(x, y) = f+ (x, y)1{x ∈ Sϵ }. Then we get

E[f+ (X, Y)1{X ∈ Sϵ } exp{−i(X; Y)}] = E[f+ (X̄, Y)1{i(X̄; Y) > −∞}1{X ∈ Sϵ }] .

Let us re-express both sides of this equality by taking the conditional expectations over Y to get:

E[b(X)1{X ∈ Sϵ }] = E[a(X̄)1{X̄ ∈ Sϵ }] .

But from the definition of Sϵ we have

E[b(X)1{X ∈ Sϵ }] ≥ E[(b(X̄) + ϵ)1{X̄ ∈ Sϵ }] .


d
Recall that X=X̄ and hence

E[b(X)1{X ∈ Sϵ }] ≥ E[b(X)1{X ∈ Sϵ }] + ϵPX [Sϵ ] .

Since f+ (and therefore b) was assumed to be bounded we can cancel the common term from both
sides and conclude PX [Sϵ ] = 0 as required.
Finally, to show (18.4) in full generality, given an unbounded f+ we define fn (x, y) =
min(f+ (x, y), n). Since (18.4) holds for fn we can take limit as n → ∞ on both sides of it:

lim E[fn (X̄, Y)1{i(X̄; Y) > −∞}|X̄ = x] = lim E[fn (X, Y) exp{−i(X; Y)}|X = x]
n→∞ n→∞

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-353


i i

18.2 Shannon’s random coding bound 353

By the monotone convergence theorem (for conditional expectations!) we can take the limits inside
the expectations to conclude the proof.

Corollary 18.4 (Information tails) For PX -almost every x we have


P[i(x; Y) > t] ≤ exp(−t), (18.5)
P[i(X; Y) > t] ≤ exp(−t) (18.6)

Proof. Pick f+ (x, y) = 1 {i(x; y) > t} in (18.4).


Remark 18.1 This estimate has been used by us several times before. In the hypothesis testing
part we used (Corollary 14.12):
h dP i
Q log ≥ t ≤ exp(−t). (18.7)
dQ
In data compression, we used the fact that |{x : log PX (x) ≥ t}| ≤ exp(−t), which is also of the
form (18.7) with Q being the counting measure.

18.2 Shannon’s random coding bound


In this section we present perhaps the most virtuous technical result of Shannon. As we discussed
before, good error correcting code is supposed to be a geometrically elegant constellation in a
high-dimensional space. Its chief goal is to push different codewords as far apart as possible, so as
to reduce the deleterious effects of channel noise. However, in early 1940’s there were no codes
and no tools for constructing them available to Shannon. So facing the problem of understanding
if error correction is even possible, Shannon decided to check if placing codewords randomly
in space will somehow result in favorable geometric arrangement. To everyone’s astonishment,
which is still producing aftershocks today, this method not only produced reasonable codes, but in
fact turned out to be optimal asymptotically (and almost-optimal non-asymptotically [334]). As
we mentioned earlier, the method of proving existence of certain combinatorial objects by random
selection is known as Erdös’s probabilistic method [15], which Shannon apparently discovered
independently and, perhaps, earlier.
Before going to the proof, we need to explain why we use a particular suboptimal decoder
and also how information density arises in this context. First, consider, for simplicity, the case of
discrete alphabets and PX,Y  PX PY . Then we have an equivalent expression
PY|X (y|x)
i(x; y) = log .
PY (y)
Therefore, the optimal (maximum likelihood) decoder can be written in terms of the information
density:
g∗ (y) = argmax PX|Y (cm |y) = argmax PY|X (y|cm ) = argmax i(cm ; y). (18.8)
m∈[M] m∈[M] m∈[M]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-354


i i

354

Note that (18.8) holds regardless of the input distribution PX used for the definition of i(x; y), in
PM
particular we do not have to use the code-induced distribution PX = M1 i=1 δci . However, if we
are to threshold information density, different choices of PX will result in different decoders, so
we need to justify the choice of PX .
To that end, recall that to distinguish between two codewords ci and cj , one can apply (as we
P
learned in Part III for binary HT) the likelihood ratio test, namely thresholding the LLR log PYY||XX=
=c
ci
.
j
As we explained at the beginning of this Chapter, a (possibly suboptimal) approach in M-ary HT
is to run binary tests by thresholding each information density i(ci ; y). This, loosely speaking,
evaluates the likelihood of ci against the average distribution of the other M − 1 codewords, which
1
P
we approximate by PY (as opposed to the more precise form M− 1 j̸=i PY|X=cj ). Putting these ideas
together we can propose the decoder as

g(y) = any m s.t. i(cm ; y) > γ ,

where γ is a threshold and PX is judiciously chosen (to maximize I(X; Y) as we will see soon).
A yet another way to see why thresholding decoder (as opposed to an ML one) is a natural idea
is to simply believe the fact that for good error correcting codes the most likely (ML) codeword
has likelihood (information density) so much higher than the rest of the candidates that instead
of looking for the maximum we simply can select the one (and only) codeword that exceeds a
pre-specified threshold.
With these initial justifications we proceed to the main result of this section.

Theorem 18.5 (Shannon’s achievability bound) Fix a channel PY|X and an arbitrary
input distribution PX . Then for every τ > 0 there exists an (M, ϵ)-code with

ϵ ≤ P[i(X; Y) ≤ log M + τ ] + exp(−τ ). (18.9)

Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP and is equiv-
alent to maximizing information density, cf. (18.8). The step of maximizing the i(cm ; Y) makes
analyzing the error probability difficult. Similar to what we did in almost loss compression, cf. The-
orem 11.5, the first important step for showing the achievability bound is to consider a suboptimal
decoder. In Shannon’s bound, we consider a threshold-based suboptimal decoder g(y) as follows:

m, ∃! cm s.t. i(cm ; y) ≥ log M + τ
g ( y) = (18.10)
e, otherwise

In words, decoder g reports m as decoded message if and only if codeword cm is a unique one
with information density exceeding the threshold log M + τ . If there are multiple or none such
codewords, then decoder outputs a special value of e, which always results in error since W 6= e
ever. (We could have decreased probability of error slightly by allowing the decoder to instead
output a random message, or to choose any one of the messages exceeding the threshold, or any
other clever ideas. The point, however, is that even the simplistic resolution of outputting e already
achieves all qualitative goals, while simplifying the analysis considerably.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-355


i i

18.2 Shannon’s random coding bound 355

For a given codebook (c1 , . . . , cM ), the error probability is:

Pe (c1 , . . . , cM ) = P[{i(cW ; Y) ≤ log M + τ } ∪ {∃m 6= W, i(cm ; Y) > log M + τ }]

where W is uniform on [M] and the probability space is as in (17.1).


The second (and most ingenious) step proposed by Shannon was to forgo the complicated
discrete optimization of the codebook. His proposal is to generate the codebook (c1 , . . . , cM ) ran-
domly with cm ∼ PX i.i.d. for m ∈ [M] and then try to reason about the average E[Pe (c1 , . . . , cM )].
By symmetry, this averaged error probability over all possible codebooks is unchanged if we con-
dition on W = 1. Considering also the random variables (X, Y, X̄) as in (18.1), we get the following
chain:

E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y) > log M + τ }|W = 1]
X
M
≤ P[i(c1 ; Y) ≤ log M + τ |W = 1] + P[i(cm̄ ; Y) > log M + τ |W = 1] (union bound)
m̄=2
( a)  
= P [i(X; Y) ≤ log M + τ ] + (M − 1)P i(X; Y) > log M + τ
≤ P [i(X; Y) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 18.4)
≤ P [i(X; Y) ≤ log M + τ ] + exp(−τ ) ,

where the crucial step (a) follows from the fact that given W = 1 and m̄ 6= 1 we have
d
(c1 , cm̄ , Y) = (X, X̄, Y)

with the latter triple defined in (18.1).


The last expression does indeed conclude the proof of existence of the (M, ϵ) code: it shows
that the average of Pe (c1 , . . . , cM ) satisfies the required bound on probability of error, and thus
there must exist at least one choice of c1 , . . . , cM satisfying the same bound.

Remark 18.2 (Joint typicality) Shortly in Chapter 19, we will apply this theorem for the
case of PX = P⊗ n ⊗n
X1 (the iid input) and PY|X = PY1 |X1 (the memoryless channel). Traditionally,
cf. [111], decoders in such settings were defined with the help of so called “joint typicality”. Those
decoders given y = yn search for the codeword xn (both of which are an n-letter vectors) such that
the empirical joint distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PX1 ,Y1 , where
1
P̂xn ,yn (a, b) = · |{j ∈ [n] : xj = a, yj = b}|
n
is the joint empirical distribution of (xn , yn ). This definition is used for the case when random
coding is done with cj ∼ uniform on the type class {xn : P̂xn ≈ PX }. Another alternative, “entropic
Pn
typicality”, cf. [106], is to search for a codeword with j=1 log PX ,Y 1(xj ,yj ) ≈ H(X, Y). We think
1 1
of our requirement, {i(xn ; yn ) ≥ nγ1 }, as a version of “joint typicality” that is applicable to much
wider generality of channels (not necessarily over product alphabets, or memoryless).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-356


i i

356

18.3 Dependence-testing (DT) bound


The following result is a slight refinement of Theorem 18.5, that results in a bound that is free
from the auxiliary parameters and is provably stronger.

Theorem 18.6 (DT bound) Fix a channel PY|X and an arbitrary input distribution PX . Then
for every τ > 0 there exists an (M, ϵ)-code with
   
M − 1 +
ϵ ≤ E exp − i(X; Y) − log (18.11)
2
where x+ ≜ max(x, 0).

Proof. For a fixed γ , consider the following suboptimal decoder:


(
m for the smallest m s.t. i(cm ; y) ≥ γ
g ( y) =
e otherwise.

Setting Ŵ = g(Y) we note that given a codebook {c1 , . . . , cM }, we have by union bound
P[Ŵ 6= j|W = j] = P[i(cj ; Y) ≤ γ|W = j] + P[i(cj ; Y) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y) > γ]
j− 1
X
≤ P[i(cj ; Y) ≤ γ|W = j] + P[i(ck ; Y) > γ|W = j].
k=1

Averaging over the randomly generated codebook, the expected error probability is upper bounded
by:

1 X
M
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1

1 X 
j−1
M X
≤ P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
M
j=1 k=1
M−1
= P[i(X; Y) ≤ γ] + P[i(X; Y) > γ]
2
M−1
= P[i(X; Y) ≤ γ] + E[exp(−i(X; Y))1 {i(X; Y) > γ}] (by (18.3))
2
h M−1 i
= E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
2
To optimize over γ , note the simple observation that U1E + V1Ec ≥ min{U, V}. Therefore for any
x, y, 1{i(x; y) ≤ γ} + M− M−1
2 exp{−i(x; y)}1{i(x; y) > γ} > min(1, 2 exp{−i(x; y)}), achieved
1
M−1
by γ = log 2 regardless of x, y. Thus, we continue the bounding as follows
h M−1 i
inf E[Pe (c1 , . . . , cM )] ≤ inf E 1 {i(X; Y) ≤ γ} + exp(−i(X; Y))1 {i(X, Y) > γ}
γ γ 2
h  M−1 i
= E min 1, exp(−i(X; Y))
2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-357


i i

18.4 Feinstein’s maximal coding bound 357

   
M − 1 +
= E exp − i(X; Y) − log .
2

Remark 18.3 (Dependence testing interpretation) The RHS of (18.11) equals to M+


2
1

multiple of the minimum error probability of the following Bayesian hypothesis testing problem:
H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M−1
prior prob.: π 0 = , π1 = .
M+1 M+1
Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent code-
word. As we know from binary hypothesis testing, the best threshold for the likelihood ratio test
(minimizing the weighted probability of error) is log ππ 10 , as we indeed found out.
One of the immediate benefits of Theorem 18.6 compared to Theorem 18.5 is precisely the fact
that we do not need to perform a cumbersome minimization over τ in (18.9) to get the minimum
upper bound in Theorem 18.5. Nevertheless, it can be shown that the DT bound is stronger than
Shannon’s bound with optimized τ . See also Exercise IV.5.
Finally, we remark (and will develop this below in our treatment of linear codes) that DT bound
and Shannon’s bound both hold without change if we generate {ci } by any other (non-iid) pro-
cedure with a prescribed marginal and pairwise independent codewords – see Theorem 18.13
below.

18.4 Feinstein’s maximal coding bound


The previous achievability results are obtained using probabilistic methods (random coding). In
contrast, the following achievability bound due to Feinstein uses a greedy construction. One imme-
diate advantage of Feinstein’s method is that it shows existence of codes satisfying maximal
probability of error criterion.2

Theorem 18.7 (Feinstein’s lemma) Fix a channel PY|X and an arbitrary input distribution
PX . Then for every γ > 0 and for every ϵ ∈ (0, 1) there exists an (M, ϵ)max -code with

M ≥ γ(ϵ − P[i(X; Y) < log γ]) (18.12)

Remark 18.4 (Comparison with Shannon’s bound) We can also interpret (18.12) differ-
ently: for any fixed M, there exists an (M, ϵ)max -code that achieves the maximal error probability
bounded as follows:
M
ϵ ≤ P[i(X; Y) < log γ] +
γ

2
Nevertheless, we should point out that this is not a serious advantage: from any (M, ϵ) code we can extract an
(M′ , ϵ′ )max -subcode with a smaller M′ and larger ϵ′ – see Theorem 19.4.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-358


i i

358

If we take log γ = log M + τ , this gives the bound of exactly the same form as Shannon’s (18.9). It
is rather surprising that two such different methods of proof produced essentially the same bound
(modulo the difference between maximal and average probability of error). We will discuss the
reason for this phenomenon in Section 18.7.

Proof. From the definition of (M, ϵ)max -code, we recall that our goal is to find codewords
c1 , . . . , cM ∈ X and disjoint subsets (decoding regions) D1 , . . . , DM ⊂ Y , s.t.

PY|X (Di |ci ) ≥ 1 − ϵ, ∀i ∈ [M].

Feinstein’s idea is to construct a codebook of size M in a sequential greedy manner.


For every x ∈ X , associate it with a preliminary decoding region Ex defined as follows:

Ex ≜ {y ∈ Y : i(x; y) ≥ log γ}

Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them
into final decoding regions {Dx }, which will be disjoint. Next, we apply Corollary 18.4 and find
out that there is a set F ⊂ X with two properties: a) PX [F] = 1 and b) for every x ∈ F we have
1
PY (Ex ) ≤ . (18.13)
γ

We can assume that P[i(X; Y) < log γ] ≤ ϵ, for otherwise the RHS of (18.12) is negative and
there is nothing to prove. We first claim that there exists some c ∈ F such that P[Y ∈ Ec |X =
c] = PY|X (Ec |c) ≥ 1 − ϵ. Indeed, assume (for the sake of contradiction) that ∀c ∈ F, P[i(c; Y) ≥
log γ|X = c] < 1 − ϵ. Note that since PX (F) = 1 we can average this inequality over c ∼ PX . Then
we arrive at P[i(X; Y) ≥ log γ] < 1 − ϵ, which is a contradiction.
With these preparations we construct the codebook in the following way:

1 Pick c1 to be any codeword in F such that PY|X (Ec1 |c1 ) ≥ 1 − ϵ, and set D1 = Ec1 ;
2 Pick c2 to be any codeword in F such that PY|X (Ec2 \D1 |c2 ) ≥ 1 − ϵ, and set D2 = Ec2 \D1 ;
...
−1
3 Pick cM to be any codeword in F such that PY|X (EcM \ ∪M j=1 Dj |cM ] ≥ 1 − ϵ, and set DM =
M− 1
EcM \ ∪j=1 Dj .

We stop if cM+1 codeword satisfying the requirement cannot be found. Thus, M is determined by
the stopping condition:

∀c ∈ F, PY|X (Ec \ ∪M
j=1 Dj |c) < 1 − ϵ

Averaging the stopping condition over c ∼ PX (which is permissible due to PX (F) = 1), we
obtain
 
[
M
P i(X; Y) ≥ log γ and Y 6∈ Dj  < 1 − ϵ,
j=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-359


i i

18.5 RCU and Gallager’s bound 359

or, equivalently,
 
[
M
ϵ < P i(X; Y) < log γ or Y ∈ Dj  .
j=1

Applying the union bound to the right hand side yields


X
M
ϵ < P[i(X; Y) < log γ] + PY (Dj )
j=1

X
M
≤ P[i(X; Y) < log γ] + PY (Ecj )
j=1
M
≤ P[i(X; Y) < log γ] +
γ
where the last step makes use of (18.13).Evidently, this completes the proof.

18.5 RCU and Gallager’s bound


Although the bounds we demonstrated so far will be sufficient for recovering the noisy channel
coding theorem later, they are not the best possible. Namely, for a given M one can show much
smaller upper bounds on the probability of error. Two such bounds are the so-called random-
coding union (RCU) and the Gallager’s bound, which we prove here. The main new ingredient
is that instead of using suboptimal (threshold) decoders as before, we will analyze the optimal
maximum likelihood decoder.

Theorem 18.8 (RCU bound) Fix a channel PY|X and an arbitrary input distribution PX . Then
for every integer M ≥ 1 there exists an (M, ϵ)-code with
    
ϵ ≤ E min 1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y , (18.14)
where the joint distribution of (X, X̄, Y) is as in (18.1).

Proof. For a given codebook (c1 , . . . cM ) the average probability of error for the maximum
likelihood decoder, cf. (18.8), is upper bounded by
 
1 X  [
M M
ϵ≤ P {i(cj ; Y) ≥ i(cm ; Y)} |X = cm  .
M
m=1 j=1;j̸=m

Note that we do not necessarily have equality here, since the maximum likelihood decoder will
resolves ties (i.e. the cases when multiple codewords maximize information density) in favor of
the correct codeword, whereas in the expression above we pessimistically assume that all ties are
resolved incorrectly. Now, similar to Shannon’s bound in Theorem 18.5 we prove existence of a
i.i.d.
good code by averaging the last expression over cj ∼ PX .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-360


i i

360

To that end, notice that expectations of each term in the sum coincide (by symmetry). To evalu-
ate this expectation, let us take m = 1 condition on W = 1 and observe that under this conditioning
we have
Y
M
(c1 , Y, c2 , . . . , cM ) ∼ PX,Y PX .
j=2

With this observation in mind we have the following chain:


 
[
M
P  {i(cj ; Y) ≥ i(c1 ; Y)} W = 1
j=2
  
[
M
= E(x,y)∼PX,Y P  {i(cj ; Y) ≥ i(c1 ; Y)} c1 = x, Y = y, W = 1
( a)

j=2
(b)    
≤ E min{1, (M − 1)P i(X̄; Y) ≥ i(X; Y) X, Y }

where (a) is just expressing the probability by first conditioning on the values of (c1 , Y); and (b)
corresponds to applying the union bound but capping the result by 1. This completes the proof
of the bound. We note that the step (b) is the essence of the RCU bound and corresponds to the
self-evident fact that for any collection of events Ej we have
X
P[∪Ej ] ≤ min{1, P[Ej ]} .
j

What is makes its application clever is that we first conditioned on (c1 , Y). If we applied the union
bound right from the start without conditioning, the resulting estimate on ϵ would have been much
weaker (in particular, would not have lead to achieving capacity).

It turns out that Shannon’s bound Theorem 18.5 is just a weakening of (18.14) obtained by
splitting the expectation according to whether or not i(X; Y) ≤ log M + τ and upper bounding
min{x, 1} by 1 when i(X; Y) ≤ log M + τ and by x otherwise. Another such weakening is a
famous Gallager’s bound [176], which in fact gives tight estimate of the exponent in the decay of
error probability over memoryless channels (Section 22.4*).

Theorem 18.9 (Gallager’s bound) Fix a channel PY|X , an arbitrary input distribution PX
and ρ ∈ [0, 1]. Then there exists an (M, ϵ) code such that
"  1+ρ #
i ( X̄; Y )
ϵ ≤ Mρ E E exp Y (18.15)
1+ρ

where again (X̄, Y) ∼ PX PY as in (18.1).

For a classical way of writing this bound see (22.13).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-361


i i

18.5 RCU and Gallager’s bound 361

Proof. We first notice that by Proposition 18.3 applied with f+ (x, y) = exp{ i1(+ρ
x;y)
} and
interchanged X and Y we have for PY -almost every y
ρ 1 1
E[exp{−i(X; Y) }|Y = y] = E[exp{i(X; Ȳ) }|Ȳ = y] = E[exp{i(X̄; Y) }|Y = y] ,
1+ρ 1+ρ 1+ρ
(18.16)
d
where we also used the fact that (X, Ȳ) = (X̄, Y) under (18.1).
Now, consider the bound (18.14) and replace the min via the bound

min{t, 1} ≤ tρ ∀t ≥ 0 . (18.17)

this results in
 
ϵ ≤ Mρ E P[i(X̄; Y) > i(X; Y)|X, Y]ρ . (18.18)

We apply the Chernoff bound


1 1
P[i(X̄; Y) > i(X; Y)|X, Y] ≤ exp{− i(X; Y)} E[exp{ i(X̄; Y)}|Y] .
1+ρ 1+ρ
Raising this inequality to ρ-power and taking expectation E[·|Y] we obtain
  1 ρ
E P[i(X̄; Y) > i(X; Y)|X, Y]ρ |Y ≤ Eρ [exp{ i(X̄; Y)|Y] E[exp{− i(X; Y)}|Y] .
1+ρ 1+ρ
The last term can be now re-expressed via (18.16) to obtain
  1
E P[i(X̄; Y) > i(X; Y)|X, Y]ρ |Y ≤ E1+ρ [exp{ i(X̄; Y)|Y] .
1+ρ
Applying this estimate to (18.18) completes the proof.

The key innovation of Gallager, namely the step (18.17), which became known as the ρ-trick,
corresponds to the following version of the union bound: For any events Ej and 0 ≤ ρ ≤ 1 we
have
   ρ
 X  X
P[∪Ej ] ≤ min 1, P [ Ej ] ≤  P[Ej ] .
 
j j

Now to understand properly the significance of Gallager’s bound we need to first define the concept
of the memoryless channels (see (19.1) below). For such channels and using the iid inputs, the
expression (18.15) turns, after optimization over ρ, into

ϵ ≤ exp{−nEr (R)} ,

where R = logn M is the rate and Er (R) is the Gallager’s random coding exponent. This shows that
not only the error probability at a fixed rate can be made to vanish, but in fact it can be made to
vanish exponentially fast in the blocklength. We will discuss such exponential estimates in more
detail in Section 22.4*.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-362


i i

362

18.6 Linear codes


So far in this Chapter we have shown existence of good error correcting codes by either doing
the random or maximal coding. The constructed codes have little structure. At the same time,
most codes used in practice are so-called linear codes and a natural question whether restricting
to linear codes leads to loss in performance. In this section we show that there exist good linear
codes as well. A pleasant property of linear codes is that Pe = Pe,max and, therefore, bounding
average probability of error (as in Shannon’s bound) automatically yields control of the maximal
probability of error as well.

Definition 18.10 (Linear codes) Let Fq denote the finite field of cardinality q (cf. Defini-
tion 11.7). Let the input and output space of the channel be X = Y = Fnq . We say a codebook
C = {cu : u ∈ Fkq } of size M = qk is a linear code if C is a k-dimensional linear subspace of Fnq .

A linear code can be equivalently described by:

• Generator matrix G ∈ Fkq×n , so that the codeword for each u ∈ Fkq is given by cu = uG
(row-vector convention) and the codebook C is the row-span of G, denoted by Im(G);
(n−k)×n
• Parity-check matrix H ∈ Fq , so that each codeword c ∈ C satisfies Hc⊤ = 0. Thus C is
the nullspace of H, denoted by Ker(H). We have HG⊤ = 0.

Example 18.1 (Hamming code) The [7, 4, 3]2 Hamming code over F2 is a linear code with
the following generator and parity check matrices:
 
1 0 0 0 1 1 0  
 0  1 1 0 1 1 0 0
1 0 0 1 0 1 
G=
 0 , H= 1 0 1 1 0 1 0 
0 1 0 0 1 1 
0 1 1 1 0 0 1
0 0 0 1 1 1 1

In particular, G and H are of the form G = [I; P] and H = [−P⊤ ; I] (systematic codes) so that
HG⊤ = 0. The following picture helps to visualize the parity check operation:

x5

x2 x1
x4
x7 x6
x3

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-363


i i

18.6 Linear codes 363

Note that all four bits in each circle (corresponding to a row of H) sum up to zero. One can verify
that the minimum distance of this code is 3 bits. As such, it can correct 1 bit of error and detect 2
bits of error.
Linear codes are almost always examined with channels of additive noise, a precise definition
of which is given below:

Definition 18.11 (Additive noise) A channel PY|X with input and output space Fnq is called
additive-noise if
PY|X (y|x) = PZ (y − x)
for some random vector Z taking values in Fnq . In other words, Y = X + Z, where Z ⊥
⊥ X.

Given a linear code and an additive-noise channel PY|X , it turns out that there is a special
“syndrome decoder” that is optimal.

Theorem 18.12 Any [k, n]Fq linear code over an additive-noise PY|X has a maximum likelihood
(ML) decoder g : Fnq → Fkq such that:

1 g(y) = y − gsynd (Hy⊤ ), i.e., the decoder is a function of the “syndrome” Hy⊤ only. Here gsynd :
Fnq−k → Fnq , defined by gsynd (s) ≜ argmaxz:Hz⊤ =s PZ (z), is called the “syndrome decoder”,
which decodes the most likely realization of the noise.
2 (Geometric uniformity) Decoding regions are translates of D0 = Im(gsynd ): Du = cu + D0 for
any u ∈ Fkq .
3 Pe,max = Pe .

In other words, syndrome is a sufficient statistic (Definition 3.8) for decoding a linear code.
Proof. 1 The maximum likelihood decoder for a linear code is
g(y) = argmax PY|X (y|c) = argmax PZ (y − c) = y − argmax PZ (z) = y − gsynd (Hy⊤ ).
c∈C c:Hc⊤ =0 z:Hz⊤ =Hy⊤

2 For any u, the decoding region


Du = {y : g(y) = cu } = {y : y−gsynd (Hy⊤ ) = cu } = {y : y−cu = gsynd (H(y−cu )⊤ )} = cu +D0 ,
where we used Hc⊤
u = 0 and c0 = 0.
3 For any u,
P[Ŵ 6= u|W = u] = P[g(cu +Z) 6= cu ] = P[cu +Z−gsynd (Hc⊤ ⊤ ⊤
u +HZ ) 6= cu ] = P[gsynd (HZ ) 6= Z].

Remark 18.5 (BSC example) As a concrete example, consider the binary symmetric channel
BSC⊗ n
δ previously considered in Example 17.1 and Section 17.2. This is an additive-noise channel
i.i.d.
over Fn2 , where Y = X + Z and Z = (Z1 , . . . , Zn ) ∼ Ber(δ). Assuming δ < 1/2, the syndrome

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-364


i i

364

decoder aims to find the noise realization with the fewest number of flips that is compatible with
the received codeword, namely gsynd (s) = argminz:Hz⊤ =s wH (z), where wH denotes the Hamming
weight. In this case elements of the image of gsynd , which we denoted by D0 , are known as “minimal
weight coset leaders”. Counting how many of them occur at each Hamming weight is a difficult
open problem even for the most well-studied codes such as Reed-Muller ones. In Hamming space
D0 looks like a Voronoi region of a lattice and Du ’s constitute a Voronoi tesselation of Fnq .

Overwhelming majority of practically used codes are in fact linear codes. Early in the history
of coding, linearity was viewed as a way towards fast and low-complexity encoding (just binary
matrix multiplication) and slightly lower complexity of the maximum-likelihood decoding (via the
syndrome decoder). As codes became longer and longer, though, the syndrome decoding became
impractical and today only those codes are used in practice for which there are fast and low-
complexity (suboptimal) decoders.

Theorem 18.13 (DT bound for linear codes) Let PY|X be an additive noise channel over
Fnq . For all integers k ≥ 1 there exists a linear code f : Fkq → Fnq with error probability:
  + 
− n−k−logq 1

Pe,max = Pe ≤ E q .
P Z ( Z)
(18.19)

Remark 18.6 The bound above is the same as Theorem 18.6 evaluated with PX = Unif(Fnq ).
The analogy between Theorems 18.6 and 18.13 is the same as that between Theorems 11.5 and
11.8 (full random coding vs random linear codes).

Proof. Recall that in proving the DT bound (Theorem 18.6), we selected the codewords
i.i.d.
c1 , . . . , cM ∼ PX and showed that

M−1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y) ≤ γ] + P[i(X; Y) ≥ γ]
2

Here we will adopt the same approach and take PX = Unif(Fnq ) and M = qk .
By Theorem 18.12 the optimal decoding regions are translational invariant, i.e. Du = cu +
D0 , ∀u, and therefore:

Pe,max = Pe = P[Ŵ 6= u|W = u], ∀u.

Step 1: Random linear coding with dithering: Let codewords be chosen as

cu = uG + h, ∀u ∈ Fkq

where random G and h are drawn as follows: the k × n entries of G and the 1 × n entries
of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role that the
all-zero codeword plays (since it is contained in any linear codebook).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-365


i i

18.6 Linear codes 365

Step 2: We claim that the codewords are pairwise independent and uniform, i.e. ∀u 6= u′ ,
(cu , cu′ ) ∼ (X, X), where PX,X (x, x) = 1/q2n . To see this note that

cu ∼ uniform on Fnq
cu′ = u′ G + h = uG + h + (u′ − u)G = cu + (u′ − u)G

We claim that cu ⊥ ⊥ G because conditioned on the generator matrix G = G0 , cu ∼


uniform on Fnq due to the dithering h.
We also claim that cu ⊥ ⊥ cu′ because conditioned on cu , (u′ − u)G ∼ uniform on Fnq .
Thus random linear coding with dithering indeed gives codewords cu , cu′ pairwise
independent and are uniformly distributed.
Step 3: Repeat the same argument in proving DT bound for the symmetric and pairwise indepen-
dent codewords, we have
+ +
M − 1 + qk − 1
E[Pe (c1 , . . . , cM )] ≤ E[exp{− i(X; Y) − log }] = E[q− i(X;Y)−logq 2 ] ≤ E [ q− i(X;Y)−k
]
2
where we used M = qk and picked the base of log to be q.
Step 4: compute i(X; Y):
P Z ( b − a) 1
i(a; b) = logq = n − logq
q− n PZ (b − a)
therefore
+
− n−k−logq 1
Pe ≤ E [ q P Z ( Z)
] (18.20)

Step 5: Remove dithering h. We claim that there exists a linear code without dithering such
that (18.20) is satisfied. The intuition is that shifting a codebook has no effect on its
performance. Indeed,
• Before, with dithering, the encoder maps u to uG + h, the channel adds noise to produce
Y = uG + h + Z, and the decoder g outputs g(Y).
• Now, without dithering, we encode u to uG, the channel adds noise to produce Y =
uG + Z, and we apply decode g′ defined by g′ (Y) = g(Y + h).
By doing so, we “simulate” dithering at the decoder end and the probability of error
remains the same as before. Note that this is possible thanks to the additivity of the noisy
channel.

We see that random coding can be done with different ensembles of codebooks. For example,
we have

i.i.d.
• Shannon ensemble: C = {c1 , . . . , cM } ∼ PX – fully random ensemble.
• Elias ensemble [150]: C = {uG : u ∈ Fkq }, with the k × n generator matrix G drawn uniformly
at random from the set of all matrices. (This ensemble is used in the proof of Theorem 18.13.)
• Gallager ensemble: C = {c : Hc⊤ = 0}, with the (n − k) × n parity-check matrix H drawn
uniformly at random. Note this is not the same as the Elias ensemble.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-366


i i

366

• One issue with Elias ensemble is that with some non-zero probability G may fail to be full rank.
(It is a good exercise to find P [rank(G) < k] as a function of n, k, q.) If G is not full rank, then
there are two identical codewords and hence Pe,max ≥ 1/2. To fix this issue, one may let the
generator matrix G be uniform on the set of all k × n matrices of full (row) rank.
• Similarly, we may modify Gallager’s ensemble by taking the parity-check matrix H to be
uniform on all n × (n − k) full rank matrices.

For the modified Elias and Gallager’s ensembles, we could still do the analysis of random coding.
A small modification would be to note that this time (X, X̄) would have distribution
1
PX,X̄ = 1 {X 6= X′ }
q2n − qn
uniform on all pairs of distinct codewords and are not pairwise independent.
Finally, we note that the Elias ensemble with dithering, cu = uG + h, has pairwise independence
property and its joint entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q. This is significantly
smaller than for Shannon’s fully random ensemble that we used in Theorem 18.5. Indeed, when
i.i.d.
cj ∼ Unif(Fnq ) we have H(c1 , . . . , cM ) = qk n log q. An interesting question, thus, is to find

min H(c1 , . . . , cM )

where the minimum is over all distributions with P[ci = a, cj = b] = q−2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly, we
may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case, the Wozencraft
ensemble (see Exercise IV.7) for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q, which is
essentially our lower bound.

18.7 Why random and maximal coding work well?


As we will see later the bounds developed in this chapter are very tight both asymptotically and
non-asymptotically. That is, the codes constructed by the apparently rather naive processes of ran-
domly selecting codewords or a greedily growing the codebook turn out to be essentially optimal
in many ways. An additional mystery is that the bounds we obtained via these two rather different
processes are virtually the same. These questions have puzzled researchers since the early days of
information theory.
A rather satisfying reason was finally given in an elegant work of Barman and Fawzi [36].
Before going into the details, we want to vocalize explicitly the two questions we want to address:

1 Why is greedy procedure close to optimal?


2 Why is random coding procedure (with a simple PX ) close to optimal?

In short, we will see that the answer is that both of these methods are well-known to be (almost)
optimal for submodular function maximization, and this is exactly what channel coding is about.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-367


i i

18.7 Why random and maximal coding work well? 367

Before proceeding, we notice that in the second question it is important to qualify that PX
is simple, since taking PX to be supported on the optimal M∗ (ϵ)-achieving codebook would of
course result in very good performance. However, instead we will see that choosing rather simple
PX already achieves a rather good lower bound on M∗ (ϵ). More explicitly, by simple we mean a
product distribution for the memoryless channel. Or, as an even better example to have in mind,
consider an additive-noise vector channel:

Yn = Xn + Zn

with addition over a product abelian group and arbitrary (even non-memoryless) noise Zn . In this
case the choice of uniform PX in random coding bound works, and is definitely “simple”.
The key observation of [36] is submodularity of the function mapping a codebook C ⊂ X to
the |C|(1 − Pe,MAP (C)), where Pe,MAP (C) is the probability of error under the MAP decoder (17.5).
(Recall (1.8) for the definition of submodularity.) More explicitly, consider a discrete Y and define
X
S(C) ≜ max PY|X (y|x) , S(∅) = 0
x∈C
y∈Y

It is clear that S(C) is submodular non-decreasing as a sum of submodular non-decreasing func-


tions max (i.e. T 7→ maxx∈T ϕ(x) is submodular for any ϕ). On the other hand, Pe,MAP (C) =
1 − |C|1
S(C), and thus search for the minimal error codebook is equivalent to maximizing the
set-function S.
The question of finding

S∗ (M) ≜ max S(C)


|C|≤M

was algorithmically resolved in a groundbreaking work of [313] showing (approximate) optimality


of the greedy process. Consider, the following natural greedy process of constructing a sequence
of good sets Ct . Start with C0 = ∅. At each step find any

xt+1 ∈ argmax S(Ct ∪ {x})


x̸∈Ct

and set

Ct+1 = Ct ∪ {xt+1 } .

They showed that

S(Ct ) ≥ (1 − 1/e) max S(C) .


|C|=t

In other words, the probability of successful (MAP) decoding for the greedily constructed code-
book is at most a factor (1 − 1/e) away from the largest possible probability of success among all
codebooks of the same cardinality. Since we are mostly interested in success probabilities very
close to 1, this result may not appear very exciting. However, a small modification of the argument
yields the following (see [257, Theorem 1.5] for the proof):

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-368


i i

368

Theorem 18.14 ([313]) For any non-negative submodular set-function f and a greedy
sequence Ct we have for all ℓ, t:

f(Cℓ ) ≥ (1 − e−ℓ/t ) max f(C) .


|C|=t

Applying this to the special case of f(·) = S(·) we obtain the result of [36]: The greedily
constructed codebook C ′ with M′ codewords satisfies
M ′
1 − Pe,MAP (C ′ ) ≥ (1 − e−M /M )(1 − ϵ∗ (M)) .
M′
In particular, the greedily constructed code with M′ = M2−10 achieves probability of success that
is ≥ 0.9995(1 −ϵ∗ (M)). In other words, compared to the best possible code a greedy code carrying
10 bits fewer of data suffers at most 5 · 10−4 worse probability of error. This is a very compelling
evidence for why greedy construction is so good. We do note, however, that Feinstein’s bound
does greedy construction not with the MAP decoder, but with a suboptimal one.
Next we address the question of random coding. Recall that our goal is to explain how can
selecting codewords uniformly at random from a “simple” distribution PX be any good. The key
idea is again contained in [313]. The set-function S(C) can also be understood as a function with
domain {0, 1}|X | . Here is a natural extension of this function to the entire solid hypercube [0, 1]|X | :
X X
SLP (π ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} . (18.21)
x, y x

Indeed, it is easy to see that SLP (1C ) = S(C) and that SLP is a concave function.3
Since SLP is an extension of S it is clear that
X
S∗ (M) ≤ S∗LP (M) ≜ max{SLP (π ) : 0 ≤ π x ≤ 1, π x ≤ M} . (18.22)
x

In fact, we will see later in Section 22.3 that this bound coincides with the bound known as
meta-converse. Surprisingly, [313] showed that the greedy construction not only achieves a large
multiple of S∗ (M) but also of S∗LP (M):

S(CM ) ≥ (1 − e−1 )S∗LP (M) . (18.23)


P
The importance of this result (which is specific to submodular functions C 7→ y maxx∈C g(x, y))
is that it gave one of the first integrality gap results relating the value of combinatorial optimization
S∗ (M) and a linear program relaxation S∗LP (M): (1 − e−1 )S∗LP (M) ≤ S∗ (M) ≤ S∗LP (M).
An extension of (18.23) similar to the preceding theorem can also be shown: for all M′ , M we
have

S(CM′ ) ≥ (1 − e−M /M )S∗LP (M) .

3
There are a number of standard extensions of a submodular function f to a hypercube. The largest convex interpolant f+ ,
also known as Lovász extension, the least concave interpolant f− , and multi-linear extension [80]. However, SLP does not
coincide with any of these and in particular strictly larger (in general) than f− .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-369


i i

18.7 Why random and maximal coding work well? 369

To connect to the concept of random coding, though, we need the following result of [36]:4
P
Theorem 18.15 Fix π ∈ [0, 1]|X | and let M = x∈X π x . Let C = {c1 , . . . , cM′ } with
i.i.d.
cj ∼ PX (x) = πx
M. Then we have

E[S(C)] ≥ (1 − e−M /M )SLP (π ) .

The proof of this result trivially follows from applying the following lemma with g(x) =
PY|X (y|x), summing over y and recalling the definition of SLP in (18.21).

Theorem. Let g : X → R be any function and denote


Lemma 18.16P Let π and C be as in P
T(π , g) = max{ x r x g ( x) : 0 ≤ rx ≤ π x , x rx = 1}. Then

E[max g(x)] ≥ (1 − e−M /M )T(π , g) .
x∈C

Proof. Without loss of generality we take X = [m] and g(1) ≥ g(2) ≥ · · · ≥ g(m) ≥ g(m + 1) ≜
′ ′
0. Denote for convenience a = 1 − (1 − M1 )M ≥ 1 − e−M /M , b(j) ≜ P[{1, . . . , j} ∩ C 6= ∅]. Then
P[max g(x) = g(j)] = b(j) − b(j − 1) ,
x∈C

and from the summation by parts we get


X
m
E[max g(x)] = (g(j) − g(j + 1))b(j) . (18.24)
x∈C
j=1
P
On the other hand, denoting c(j) = min( i≤j π i , 1). Now from the definition of b(j) we have
π1 + . . . πj ℓ c(j) M′
b( j) = 1 − ( 1 − ) ≥ 1 − (1 − ) .
M M
x M′
From the simple inequality (1 − M) ≤ 1 − ax (valid for any x ∈ [0, 1]) we get

b(j) ≥ ac(j) .
Plugging this into (18.24) we conclude the proof by noticing that rj = c(j) − c(j − 1) attains the
maximum in the definition of T(π , g).
Theorem 18.15 completes this section’s goal and shows that the random coding (as well as the
greedy/maximal coding) attains an almost optimal value of S∗ (M). Notice also that the random
coding distribution that we should be using is the one that attains the definition of S∗LP (M). For input
symmetric channels (such as additive noise ones) it is easy to show that the optimal π ∈ [0, 1]X is
a constant vector, and hence the codewords are to be generated iid uniformly on X .

4
There are other ways of doing “random coding” to produce an integer solution from a fractional one. For example, see the
multi-linear extension based one in [80].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-370


i i

19 Channel capacity

In this chapter we apply methods developed in the previous chapters (namely the weak converse
and the random/maximal coding achievability) to compute the channel capacity. This latter notion
quantifies the maximal amount of (data) bits that can be reliably communicated per single channel
use in the limit of using the channel many times. Formalizing the latter statement will require
introducing the concept of a communication channel. Then for special kinds of channels (the
memoryless and the information stable ones) we will show that computing the channel capacity
reduces to maximizing the (sequence of the) mutual informations. This result, known as Shannon’s
noisy channel coding theorem, is the third example of a coding theorem in this book. It connects the
value of an operationally defined (discrete, combinatorial) optimization problem over codebooks
to that of a (convex) optimization problem over information measures. It builds a bridge between
the abstraction of information measures (Part I) and a practical engineering problem of channel
coding.
Information theory as a subject is sometimes accused of “asymptopia”, or the obsession with
asymptotic results and computing various limits. Although in this book we attempt to refrain from
asymptopia, the topic of this chapter requires committing this sin ipso facto. After proving capacity
theorems in various settings, we conclude the Chapter with Shannon’s separation theorem, that
shows that any (stationary ergodic) source can be communicated over an (information stable)
channel if and only its entropy rate is smaller than the channel capacity. Furthermore, doing so
can be done by first compressing a source to pure bits and then using channel code to match those
bits to channel inputs. The fact that no performance is lost in the process of this conversion to bits
had important historical consequence in cementing bits as the universal currency of the digital
age.

19.1 Channels and channel capacity


As we discussed in Chapter 17 the main information-theoretic question of data transmission is the
following: How many bits can one transmit reliably if one is allowed to use a given noisy channel
n times? The normalized quantity equal to the number of message bits per channel use is known as
rate, and capacity refers to the highest achievable rate under a small probability of decoding error.
However, what does it mean to “use channel several times”? How do we formalize the concept of
a channel use? To that end, we need to change the meaning of the term “channel”. So far in this
book we have used the term channel as a synonym of the Markov kernel (Definition 2.10). More
correctly, however, this term should be used to refer to the following notion.

370

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-371


i i

19.1 Channels and channel capacity 371

Definition 19.1 Fix an input alphabet A and an output alphabet B. A sequence of Markov
kernels PYn |Xn : An → B n indexed by the integer n = 1, 2 . . . is called a channel. The length of
the input n is known as blocklength.

To give this abstract notion more concrete form one should recall Section 17.2, in which we
described the BSC channel. Note, however, that despite this definition, it is customary to use the
term channel to refer to a single Markov kernel (as we did before in this book). An even worse,
yet popular, abuse of terminology is to refer to n-th element of the sequence, the kernel PYn |Xn , as
the n-letter channel.
Although we have not imposed any requirements on the sequence of kernels PYn |Xn , one is never
interested in channels at this level of generality. Most of the time the elements of the channel input
Xn = (X1 , . . . , Xn ) are thought as indexed by time. That is the Xt corresponds to the letter that is
transmitted at time t inside an overall block of length n, while Yt is the letter received at time t.
The channel’s action is that of “adding noise” to Xt and outputting Yt . However, the generality of
the previous definition allows to model situations where the channel has internal state, so that the
amount and type of noise added to Xt depends on the previous inputs and in principle even on the
future inputs. The interpretation of t as time, however, is not exclusive. In storage (magnetic, non-
volatile or flash) t indexes space. In those applications, the noise may have a rather complicated
structure with transformation Xt → Yt depending on both the “past” X<t and the “future” X>t .
Almost all channels of interest satisfy one or more of the restrictions that we list next:

• A channel is called non-anticipatory if it has the following extension property. Under the n-letter
kernel PYn |Xn , the conditional distribution of the first k output symbols Yk only depends on Xk
(and not on Xnk+1 ) and coincides with the kernel PYk |Xk (the k-th element of the channel sequence)
the k-th channel transition kernel in the sequence. This requirement models the scenario wherein
channel outputs depend causally on the inputs.
• A channel is discrete if A and B are finite.
• A channel is additive-noise if A = B are abelian group and Yn = Xn + Zn for some Zn
independent of Xn (see Definition 18.11). Thus

PYn |Xn (yn |xn ) = PZn (yn − xn ).

• A channel is memoryless if PYn |Xn factorizes into a product distribution. Namely,

Y
n
PYn |Xn = PYk |Xk . (19.1)
k=1

where each PYk |Xk : A → B ; in particular, PYn |Xn are compatible at different blocklengths n.
• A channel is stationary memoryless if (19.1) is satisfied with PYk |Xk not depending on k, denoted
commonly by PY|X . In other words,

PYn |Xn = (PY|X )⊗n . (19.2)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-372


i i

372

Channel Bipartite graph Channel matrix

δ̄
1 1
δ [ ]
δ̄ δ
BSCδ
δ δ δ̄
0 0
δ̄

δ̄
1 1
δ
[ ]
? δ̄ δ 0
BECδ
δ 0 δ δ̄
0 0
δ̄

1 1
δ
[ ]
1 0
Z-channel
0 0 δ δ̄
δ̄

Table 19.1 Examples of DMCs.

Thus, in discrete cases, we have


Y
n
PYn |Xn (yn |xn ) = PY|X (yi |xi ). (19.3)
k=1

The interpretation is that each coordinate of the transmitted codeword Xn is corrupted by noise
independently with the same noise statistic.
• Discrete memoryless stationary channel (DMC): A DMC is a channel that is both discrete and
stationary memoryless. It can be specified in two ways:
– an |A| × |B|-dimensional (row-stochastic) matrix PY|X where elements specify the transition
probabilities;
– a bipartite graph with edge weight specifying the non-zero transition probabilities.
Table 19.1 lists some common binary-input binary-output DMCs: the binary symmetric channel
(BSC), the binary symmetric channel (BEC), and the Z-channel.

As another example, let us recall the AWGN channel in Example 3.3: the alphabets A = B =
R and Yn = Xn + Zn , with Xn ⊥ ⊥ Zn ∼ N (0, σ 2 In ). This channel is a non-discrete, stationary
memoryless, additive-noise channel.
Having defined the notion of the channel, we can define next the operational problem that the
communication engineer faces when tasked with establishing a data link across the channel. Since
the channel is noisy, the data is not going to pass unperturbed and the error correcting codes are

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-373


i i

19.1 Channels and channel capacity 373

naturally to be employed. To send one of M = 2k messages (or k data bits) with low probabil-
ity of error, it is often desirable to use the shortest possible length of the input sequence. This
desire explains the following definitions, which extend the fundamental limits in Definition 17.2
to involve the blocklength n.

Definition 19.2 (Fundamental Limits of Channel Coding)

• An (n, M, ϵ)-code is an (M, ϵ)-code for PYn |Xn , consisting of an encoder f : [M] → An and a
decoder g : B n → [M] ∪ {e}.
• An (n, M, ϵ)max -code is analogously defined for the maximum probability of error.

The (non-asymptotic) fundamental limits are

M∗ (n, ϵ) = max{M : ∃ (n, M, ϵ)-code}, (19.4)


M∗max (n, ϵ) = max{M : ∃ (n, M, ϵ)max -code}

ϵ (n, M) = inf{ϵ : ∃(n, M, ϵ)-code} (19.5)
ϵ∗max (n, M) = inf{ϵ : ∃(n, M, ϵ)max -code} .

We will mostly focus on understanding M∗ (n, ϵ) and a relate quantity known as rate. Recall that
blocklength n measures the amount of time or space resource used by the code. Thus, it is natural
to maximize the ratio of the data transmitted to the resource used, and that leads us to the notion of
log M
the transmission rate defined as R = n2 and equal to the number of bits transmitted per channel
use. Consequently, instead of studying M∗ (n, ϵ) one is lead to the study of 1n log M∗ (n, ϵ). A natural
first question is to determine the first-order asymptotics of this quantity and this motivates the final
definition of the Section.

Definition 19.3 (Channel capacity) The ϵ-capacity Cϵ and the Shannon capacity C are
defined as follows
1
Cϵ ≜ lim inf log M∗ (n, ϵ);
n→∞ n
C = lim Cϵ .
ϵ→0+

Channel capacity is measured in information units per channel use, e.g. “bit/ch.use”.

The operational meaning of Cϵ (resp. C) is the maximum achievable rate at which one can
communicate through a noisy channel with probability of error at most ϵ (resp. o(1)). In other
words, for any R < C, there exists an (n, exp(nR), ϵn )-code, such that ϵn → 0. In this vein, Cϵ and
C can be equivalently defined as follows:

Cϵ = sup{R : ∀δ > 0, ∃n0 (δ), ∀n ≥ n0 (δ), ∃(n, exp(n(R − δ)), ϵ)-code}


C = sup{R : ∀ϵ > 0, ∀δ > 0, ∃n0 (δ, ϵ), ∀n ≥ n0 (δ, ϵ), ∃(n, exp(n(R − δ)), ϵ)-code}

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-374


i i

374

The reason that capacity is defined as a large-n limit (as opposed to a supremum over n) is because
we are concerned with rate limit of transmitting large amounts of data without errors (such as in
communication and storage).
The case of zero-error (ϵ = 0) is so different from ϵ > 0 that the topic of ϵ = 0 constitutes a
separate subfield of its own (cf. the survey [252]). Introduced by Shannon in 1956 [379], the value

1
C0 ≜ lim inf log M∗ (n, 0) (19.6)
n→∞ n

is known as the zero-error capacity and represents the maximal achievable rate with no error
whatsoever. Characterizing the value of C0 is often a hard combinatorial problem. However, for
many practically relevant channels it is quite trivial to show C0 = 0. This is the case, for example,
for the DMCs we considered before: the BSC or BEC. Indeed, for them we have log M∗ (n, 0) = 0
for all n, meaning transmitting any amount of information across these channels requires accepting
some (perhaps vanishingly small) probability of error. Nevertheless, there are certain interesting
and important channels for which C0 is positive, cf. Section 23.3.1 for more.
As a function of ϵ the Cϵ could (most generally) behave like the plot below on the left-hand
side below. It may have a discontinuity at ϵ = 0 and may be monotonically increasing (possibly
even with jump discontinuities) in ϵ. Typically, however, Cϵ is zero at ϵ = 0 and stays constant for
all 0 < ϵ < 1 and, hence, coincides with C (see the plot on the right-hand side). In such cases we
say that the strong converse holds (more on this later in Section 22.1).

Cǫ Cǫ

strong converse
holds

Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1

In Definition 19.3, the capacities Cϵ and C are defined with respect to the average probabil-
ity of error. By replacing M∗ with M∗max , we can define, analogously, the capacities Cϵ
(max)
and
(max)
C with respect to the maximal probability of error. It turns out that these two definitions are
equivalent, as the next theorem shows.

Theorem 19.4 ∀τ ∈ (0, 1),

τ M∗ (n, ϵ(1 − τ )) ≤ M∗max (n, ϵ) ≤ M∗ (n, ϵ)

Proof. The second inequality is obvious, since any code that achieves a maximum error
probability ϵ also achieves an average error probability of ϵ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-375


i i

19.2 Shannon’s noisy channel coding theorem 375

For the first inequality, take an (n, M, ϵ(1 − τ ))-code, and define the error probability for the jth
codeword as

λj ≜ P[Ŵ 6= j|W = j]

Then
X X X
M(1 − τ )ϵ ≥ λj = λj 1 {λj ≤ ϵ} + λj 1 {λj > ϵ} ≥ ϵ|{j : λj > ϵ}|.

Hence |{j : λj > ϵ}| ≤ (1 − τ )M. (Note that this is exactly Markov inequality.) Now by removing
those codewords1 whose λj exceeds ϵ, we can extract an (n, τ M, ϵ)max -code. Finally, take M =
M∗ (n, ϵ(1 − τ )) to finish the proof.

Corollary 19.5 (Capacity under maximal probability of error) C(ϵmax) = Cϵ for all
ϵ > 0 such that Cϵ = Cϵ− . In particular, C(max) = C.

Proof. Using the definition of M∗ and the previous theorem, for any fixed τ > 0
1
Cϵ ≥ C(ϵmax) ≥ lim inf log τ M∗ (n, ϵ(1 − τ )) ≥ Cϵ(1−τ )
n→∞ n
(max)
Sending τ → 0 yields Cϵ ≥ Cϵ ≥ Cϵ− .

19.2 Shannon’s noisy channel coding theorem


Now that we have the basic definitions for Shannon capacity, we define another type of capac-
ity, and show that for a stationary memoryless channels, these two notions (“operational” and
“information”) of capacity coincide.

Definition 19.6 The information capacity of a channel is


1
C(I) = lim inf sup I(Xn ; Yn ),
n→∞ n P Xn
where for each n the supremum is taken over all joint distributions PXn on An .

Note that information capacity C(I) so defined is not the same as the Shannon capacity C in Def-
inition 19.3; as such, from first principles it has no direct interpretation as an operational quantity
related to coding. Nevertheless, they are related by the following coding theorems. We start with
a converse result:

C(I)
Theorem 19.7 (Upper Bound for Cϵ ) For any channel, ∀ϵ ∈ [0, 1), Cϵ ≤ 1−ϵ and C ≤ C(I) .

1
This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook to reach a
desired property.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-376


i i

376

Proof. Applying the general weak converse bound in Theorem 17.3 to PYn |Xn yields
supPXn I(Xn ; Yn ) + h(ϵ)
log M∗ (n, ϵ) ≤
1−ϵ
Normalizing this by n and taking the lim inf as n → ∞, we have
1 1 supPXn I(Xn ; Yn ) + h(ϵ) C(I)
Cϵ = lim inf log M∗ (n, ϵ) ≤ lim inf = .
n→∞ n n→∞ n 1−ϵ 1−ϵ

Next we give an achievability bound:

Theorem 19.8 (Lower Bound for Cϵ ) For a stationary memoryless channel, Cϵ ≥


supPX I(X; Y), for any ϵ ∈ (0, 1].

Proof. Fix an arbitrary PX on A and let PXn = P⊗ n


X be an iid product of a single-letter distribution
PX . Recall Shannon’s achievability bound Theorem 18.5 (or any other one from Chapter 18 would
work just as well). From that result we know that for any n, M and any τ > 0, there exists an
(n, M, ϵn )-code with
ϵn ≤ P[i(Xn ; Yn ) ≤ log M + τ ] + exp(−τ )
Here the information density is defined with respect to the distribution PXn ,Yn = P⊗ n
X,Y and, therefore,

X
n
dPX,Y Xn
i(Xn ; Yn ) = log (Xk , Yk ) = i(Xk ; Yk ),
dPX PY
k=1 k=1
n n n n
where i(x; y) = iPX,Y (x; y) and i(x ; y ) = iPXn ,Yn (x ; y ). What is important is that under PXn ,Yn the
random variable i(Xn ; Yn ) is a sum of iid random variables with mean I(X; Y). Thus, by the weak
law of large numbers we have
P[i(Xn ; Yn ) < n(I(X; Y) − δ)] → 0
for any δ > 0.
With this in mind, let us set log M = n(I(X; Y) − 2δ) for some δ > 0, and take τ = δ n in
Shannon’s bound. Then for the error bound we have
" n #
X n→∞
ϵn ≤ P i(Xk ; Yk ) ≤ nI(X; Y) − δ n + exp(−δ n) −−−→ 0, (19.7)
k=1

Since the bound converges to 0, we have shown that there exists a sequence of (n, Mn , ϵn )-codes
with ϵn → 0 and log Mn = n(I(X; Y) − 2δ). Hence, for all n such that ϵn ≤ ϵ
log M∗ (n, ϵ) ≥ n(I(X; Y) − 2δ)
And so
1
Cϵ = lim inf log M∗ (n, ϵ) ≥ I(X; Y) − 2δ
n→∞
n
Since this holds for all PX and all δ > 0, we conclude Cϵ ≥ supPX I(X; Y).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-377


i i

19.2 Shannon’s noisy channel coding theorem 377

The following result follows from pairing the upper and lower bounds on Cϵ .

Theorem 19.9 (Shannon’s channel coding theorem [378]) For a stationary memory-
less channel,
C = C(I) = sup I(X; Y). (19.8)
PX

As we mentioned several times already this result is among the most significant results in
information theory. From the engineering point of view, the major surprise was that C > 0,
i.e. communication over a channel is possible with strictly positive rate for any arbitrarily small
probability of error. The way to achieve this is to encode the input data jointly (i.e. over many
input bits together). This is drastically different from the pre-1948 methods, which operated on
a letter-by-letter bases (such as Morse code). This theoretical result gave impetus (and still gives
guidance) to the evolution of practical communication systems – quite a rare achievement for an
asymptotic mathematical fact.
Proof. Statement (19.8) contains two equalities. The first one follows automatically from the
second and Theorems 19.7 and 19.8. To show the second equality C(I) = supPX I(X; Y), we note
that for stationary memoryless channels C(I) is in fact easy to compute. Indeed, rather than solving
a sequence of optimization problems (one for each n) and taking the limit of n → ∞, memory-
lessness of the channel implies that only the n = 1 problem needs to be solved. This type of result
is known as single-letterization (or tensorization) in information theory and we show it formally
in the following proposition, which concludes the proof.

Proposition 19.10 (Tensorization of capacity)

• For memoryless channels,


X
n
sup I(Xn ; Yn ) = sup I(Xi ; Yi ).
PXn PXi
i=1

• For stationary memoryless channels,


C(I) = sup I(X; Y).
PX

Q
Proof. Recall that from Theorem 6.1 we know that for product kernels PYn |Xn = PYi |Xi , mutual
P n
information satisfies I(Xn ; Yn ) ≤ k=1 I(Xk ; Yk ) with equality whenever Xi ’s are independent.
Then
1
C(I) = lim inf sup I(Xn ; Yn ) = lim inf sup I(X; Y) = sup I(X; Y).
n→∞ n P n n→∞ PX PX
X

Shannon’s noisy channel theorem shows that by employing codes of large blocklength, we can
approach the channel capacity arbitrarily close. Given the asymptotic nature of this result (or any

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-378


i i

378

other asymptotic result), a natural question is understanding the price to pay for reaching capacity.
This can be understood in two ways:

1 The complexity of achieving capacity: Is it possible to find low-complexity encoders and


decoders with polynomial number of operations in the blocklength n which achieve the capac-
ity? This question was resolved by Forney [172] who showed that this is possible in linear time
with exponentially small error probability.
Note that if we are content with polynomially small probability of error, e.g., Pe = O(n−100 ),
then we can construct polynomial-time decodable codes as follows. First, it can be shown that
with rate strictly below capacity, the error probability of optimal codes decays exponentially
w.r.t. the blocklength. Now divide the block of length n into shorter block of length c log n and
apply the optimal code for blocklength c log n with error probability n−101 . The by the union
bound, the whole block has error with probability at most n−100 . The encoding and exhaustive-
search decoding are obviously polynomial time.
2 The speed of achieving capacity. Suppose we are content with achieving 90% of the capacity.
Then the question is how large a blocklength do we need to take in order for that to be possible?
(Blocklength is directly related to latency, and, thus, equivalently we may ask how much of a
delay is incurred by the desire to achieve 90% of capacity.) In other words, we want to know
how fast the gap to capacity vanish as blocklength grows. Shannon’s theorem shows that the
gap C − 1n log M∗ (n, ϵ) = o(1). Next theorem shows that under proper conditions, the o(1) term
is in fact O( √1n ).

The main tool in the proof of Theorem 19.8 was the law of large numbers. The lower bound
Cϵ ≥ C(I) in Theorem 19.8 shows that log M∗ (n, ϵ) ≥ nC + o(n) (this just restates the fact
that normalizing by n and taking the lim inf must result in something ≥ C). If instead we apply
a more careful analysis using the central limit theorem (CLT), we obtain the following refined
achievability result.

Theorem 19.11 Consider a stationary memoryless channel with a capacity-achieving input


distribution. Namely, C = maxPX I(X; Y) = I(X∗ ; Y∗ ) is attained at P∗X , which induces PX∗ Y∗ =
PX∗ PY|X . Assume that V = Var[i(X∗ ; Y∗ )] < ∞. Then
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n),

where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.

Proof. Writing the little-o notation in terms of lim inf, our goal is
log M∗ (n, ϵ) − nC
lim inf √ ≥ −Q−1 (ϵ) = Φ−1 (ϵ),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound

∃(n, M, ϵ)max : M ≥ β (ϵ − P[i(Xn ; Yn ) ≤ log β])

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-379


i i

19.3 Examples of capacity computation 379


Take log β = nC + nVt, then applying the CLT gives
√  hX √ i
log M ≥ nC + nVt + log ϵ − P i(Xk ; Yk ) ≤ nC + nVt

=⇒ log M ≥ nC + nVt + log (ϵ − Φ(t)) ∀Φ(t) < ϵ
log M − nC log(ϵ − Φ(t))
=⇒ √ ≥t+ √ ,
nV nV
where Φ(t) is the standard normal CDF. Taking the liminf of both sides
log M∗ (n, ϵ) − nC
lim inf √ ≥ t,
n→∞ nV
for all t such that Φ(t) < ϵ. Finally, taking t % Φ−1 (ϵ), and writing the liminf in little-oh notation
completes the proof
√ √
log M∗ (n, ϵ) ≥ nC − nVQ−1 (ϵ) + o( n).

Remark 19.1 Theorem 19.9 implies that for any R < C, there exists a sequence of
(n, exp(nR), ϵn )-codes such that the probability of error ϵn vanishes as n → ∞. Examining the
upper bound (19.7), we see that the probability of error actually vanishes exponentially fast, since
the event in the first term is of large-deviations type (recall Chapter 15) so that both terms are
exponentially small. Finding the value of the optimal exponent (or even the existence thereof) has
a long history (but remains generally open) in information theory, see Section 22.4*. Recently,
however, it was understood that a practically more relevant, and also much easier to analyze, is
the regime of fixed (non-vanishing) error ϵ, in which case the main question is to bound the speed
of convergence of R → Cϵ = C. Previous theorem shows one bound on this speed of convergence.
The optimal √1n coefficient is known as channel dispersion, see Sections 22.5 and 22.6 for more.

In particular, we will show that the bound on the n term in Theorem 19.11 is often tight.

19.3 Examples of capacity computation


We compute the capacities of the simple DMCs from Table 19.1 and plot them in Figure 19.1.
First, for the BSCδ we have the following description of the input-output law:

Y = X + Z mod 2, Z ∼ Ber(δ) ⊥
⊥ X.

To compute the capacity, let us notice

I(X; X + Z) = H(X + Z) − H(X + Z|X) = H(X + Z) − H(Z) ≤ log 2 − h(δ)

with equality iff X ∼ Ber(1/2). Hence we have shown

C = sup I(X; Y) = log 2 − h(δ).


PX

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-380


i i

380

C
C C
1 bit
1 bit 1 bit

δ
0 1 1 δ δ
2 0 1 0 1
BSCδ BECδ Z-channel

Figure 19.1 Capacities of three simple DMCs.

More generally, recalling Example 3.7, for any additive-noise channel over a finite abelian
group G, we have C = supPX I(X; X + Z) = log |G| − H(Z), achieved by X ∼ Unif(G). Similarly,
for a group-noise channel acting over a non-abelian group G by x 7→ x ◦ Z, Z ∼ PZ we also have
capacity equal log |G| − H(Z) and achieved by X ∼ Unif(G).
Next we consider the BECδ . This is a multiplicative channel. Indeed, if we equivalently redefine
the input X ∈ {±1} and output Y ∈ {±1, 0}, then BEC relation can be written as

Y = XZ, Z ∼ Ber(δ) ⊥
⊥ X.

To compute the capacity, we first notice that even without evaluating Shannon’s formula, it is clear
that C ≤ 1 −δ (bit), because for a large blocklength n about δ -fraction of the message is completely
lost (even if the encoder knows a priori where the erasures are going to occur, the rate still cannot
exceed 1 − δ ). More formally, we notice that P[X = 1|Y = 0] = P[X= δ
1]δ
= P[X = 1] and therefore

I(X; Y) = H(X) − H(X|Y) = H(X) − H(X|Y = e) ≤ (1 − δ)H(X) ≤ (1 − δ) log 2 ,

with equality iff X ∼ Ber(1/2). Thus we have shown

C = sup I(X; Y) = 1 − δ bits.


PX

Finally, the Z-channel can also be thought of as a multiplicative channel with transition law

Y = XZ, X ∈ { 0, 1} ⊥
⊥ Z ∼ Ber(1 − δ) ,

so that P[Z = 0] = δ . For this channel if X ∼ Ber(p) we have

I(X; Y) = H(Y) − H(Y|X) = h(p(1 − δ)) − ph(δ) .

Optimizing over p we get that the optimal input is given by


1 1
p∗ (δ) = .
1 − δ 1 + exp{ h(δ) }
1−δ

The capacity-achieving input distribution p (δ) monotonically decreases from 12 when δ = 0 to 1e
when δ → 1. (Infamously, there is no “explanation” for this latter limiting value.) For the capacity,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-381


i i

19.4* Symmetric channels 381

thus, we get

C = h(p∗ (δ)(1 − δ)) − p∗ (δ)h(δ) .

19.4* Symmetric channels


Definition 19.12 A pair of measurable maps f = (fi , fo ) is a symmetry of PY|X if
PY|X (f−
o (E)|fi (x)) = PY|X (E|x) ,
1

for all measurable E ⊂ Y and x ∈ X . Two symmetries f and g can be composed to produce another
symmetry as

( gi , go ) ◦ ( fi , fo ) ≜ ( gi ◦ fi , fo ◦ go ) . (19.9)

A symmetry group G of PY|X is any collection of invertible symmetries (automorphisms) closed


under the group operation (19.9).

Note that both components of an automorphism f = (fi , fo ) are bimeasurable bijections, that is
fi , f− 1 −1
i , fo , fo are all measurable and well-defined functions.
Naturally, every symmetry group G possesses a canonical left action on X × Y defined as

g · (x, y) ≜ (gi (x), g− 1


o (y)) . (19.10)

Since the action on X × Y splits into actions on X and Y , we will abuse notation slightly and write

g · ( x, y) ≜ ( g x , g y ) .

Let us assume in addition that our group G can be equipped with a σ -algebra σ(G) such that
the maps h 7→ hg and h 7→ gh are measurable for each g ∈ G. We say that a probability measure μ
on (G, σ(G)) is a left-invariant Haar measure if when H ∼ μ we also have gH ∼ μ for any g ∈ G.
(See also Exercise V.23.) Existence of Haar measure is trivial for finite (and compact) groups, but
in general is a difficult subject. To proceed we need to make an assumption about the symmetry
group G that we call regularity. (This condition is trivially satisfied whenever X and Y are finite,
thus all the sophistication in these few paragraphs is only relevant to non-discrete channels.)

Definition 19.13 A symmetry group G is called regular if it possesses a left-invariant Haar


probability measure ν such that the group action (19.10)

G×X ×Y →X ×Y

is measurable.

Note that under the regularity assumption the action (19.10) also defines a left action of G on
P(X ) and P(Y) according to

(gPX )[E] ≜ PX [g−1 E] , (19.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-382


i i

382

(gQY )[E] ≜ QY [g−1 E] , (19.12)

or, in words, if X ∼ PX then gX ∼ gPX , and similarly for Y and gY. For every distribution PX we
define an averaged distribution P̄X as
Z
P̄X [E] ≜ PX [g−1 E]ν(dg) , (19.13)
G

which is the distribution of random variable gX when g ∼ ν and X ∼ PX . The measure P̄X is G-
invariant, in the sense that gP̄X = P̄X . Indeed, by left-invariance of ν we have for every bounded
function f
Z Z
f(g)ν(dg) = f(hg)ν(dg) ∀h ∈ G ,
G G

and therefore
Z
−1
P̄X [h E] = PX [(hg)−1 E]ν(dg) = P̄X [E] .
G

Similarly one defines Q̄Y :


Z
Q̄Y [E] ≜ QY [g−1 E]ν(dg) , (19.14)
G

which is also G-invariant: gQ̄Y = Q̄Y .


The main property of the action of G may be rephrased as follows: For arbitrary ϕ : X ×Y → R
we have
Z Z
ϕ(x, y)PY|X (dy|x)(gPX )(dx)
X Y
Z Z
= ϕ(gx, gy)PY|X (dy|x)PX (dx) . (19.15)
X Y

In other words, if the pair (X, Y) is generated by taking X ∼ PX and applying PY|X , then the pair
(gX, gY) has marginal distribution gPX but conditional kernel is still PY|X . For finite X , Y this is
equivalent to

PY|X (gy|gx) = PY|X (y|x) , (19.16)

which may also be taken as the definition of the automorphism. In terms of the G-action on P(Y)
we may also say:

gPY|X=x = PY|X=gx ∀ g ∈ G, x ∈ X . (19.17)

It is not hard to show that for any channel and a regular group of symmetries G the capacity-
achieving output distribution must be G-invariant, and capacity-achieving input distribution can
be chosen to be G-invariant. That is, the saddle point equation

inf sup D(PY|X kQY |PX ) = sup inf D(PY|X kQY |PX ) ,
PX QY QY PX

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-383


i i

19.4* Symmetric channels 383

can be solved in the class of G-invariant distribution. Often, the action of G is transitive on X (Y ),
in which case the capacity-achieving input (output) distribution can be taken to be uniform.
Below we systematize many popular notions of channel symmetry and explain relationship
between them.

• PY|X is called input-symmetric (output-symmetric) if there exists a regular group of symmetries


G acting transitively on X (Y ).
• An input-symmetric channel with a binary X is known as BMS (for Binary Memoryless
Symmetric). These channels possess a rich theory; see [360, Section 4.1] and Ex. VI.21.
• PY|X is called weakly input-symmetric if there exists an x0 ∈ X and a channel Tx : B → B for
each x ∈ X such that Tx ◦ PY|X=x0 = PY|X=x and Tx ◦ P∗Y = P∗Y , where P∗Y is the capacity achieving
output distribution. In [333, Section 3.4.5] it is shown that the allowing for a randomized maps
Tx is essential and that not all PY|X are weakly input-symmetric.
• DMC PY|X is a group-noise channel if X = Y is a group and PY|X acts by composing X with a
noise variable Z:
Y = X ◦ Z,
where ◦ is a group operation and Z is independent of X.
• DMC PY|X is called Dobrushin-symmetric if every row of PY|X is a permutation of the first one
and every column of PY|X is a permutation of the first one; see [131].
• DMC PY|X is called Gallager-symmetric if the output alphabet Y can be split into a disjoint union
of sub-alphabets such that restricted to each sub-alphabet PY|X has the Dobrushin property: every
row (every column) is a permutation of the first row (column); see [177, Section 4.5].
• for convenience, say that the channel is square if |X | = |Y|.

We demonstrate some of the relationship between these various notions of symmetry:

• Note that it is an easy consequence of the definitions that any input-symmetric (resp. output-
symmetric) channel, all rows of the channel matrix PY|X (resp. columns) are permutations of
the first row (resp. column). Hence,
input-symmetric, output-symmetric =⇒ Dobrushin (19.18)
• Group-noise channels satisfy all other definitions of symmetry:

group-noise =⇒ square, input/output-symmetric (19.19)


=⇒ Dobrushin, Gallager (19.20)
• Since Gallager symmetry implies all rows are permutations of the first one, while output
symmetry implies the same statement for columns we have
Gallager, output-symmetric =⇒ Dobrushin
• Clearly, not every Dobrushin-symmetric channel is square. One may wonder, however, whether
every square Dobrushin channel is a group-noise channel. This is not so. Indeed, according

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-384


i i

384

to [390] the latin squares that are Cayley tables are precisely the ones in which composition of
two rows (as permutations) gives another row. An example of the latin square which is not a
Cayley table is the following:
 
1 2 3 4 5
2 5 4 1 3
 
 
3 1 2 5 4 . (19.21)
 
4 3 5 2 1
5 4 1 3 2
1
Thus, by multiplying this matrix by 15 we obtain a counterexample:
Dobrushin, square 6=⇒ group-noise
In fact, this channel is not even input-symmetric. Indeed, suppose there is g ∈ G such that
g4 = 1 (on X ). Then, applying (19.16) with x = 4 we figure out that on Y the action of g must
be:
1 7→ 4, 2 7→ 3, 3 7→ 5, 4 7→ 2, 5 7→ 1 .
But then we have
 1
gPY|X=1 = 5 4 2 1 3 · ,
15
which by a simple inspection does not match any of the rows in (19.21). Thus, (19.17) cannot
hold for x = 1. We conclude:
Dobrushin, square 6=⇒ input-symmetric
Similarly, if there were g ∈ G such that g2 = 1 (on Y ), then on X it would act as
1 7→ 2, 2 7→ 5, 3 7→ 1, 4 7→ 3, 5 7→ 4 ,
which implies via (19.16) that PY|X (g1|x) is not a column of (19.21). Thus:
Dobrushin, square 6=⇒ output-symmetric
• Clearly, not every input-symmetric channel is Dobrushin (e.g., BEC). One may even find a
counterexample in the class of square channels:
 
1 2 3 4
1 3 2 4  1
 
4 2 3 1 · 10 (19.22)

4 3 2 1
This shows:
input-symmetric, square 6=⇒ Dobrushin
• Channel (19.22) also demonstrates:
Gallager-symmetric, square 6=⇒ Dobrushin .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-385


i i

19.4* Symmetric channels 385

• Example (19.22) naturally raises the question of whether every input-symmetric channel is
Gallager symmetric. The answer is positive: by splitting Y into the orbits of G we see that a
subchannel X → {orbit} is input and output symmetric. Thus by (19.18) we have:

input-symmetric =⇒ Gallager-symmetric =⇒ weakly input-symmetric (19.23)

(The second implication is evident).


• However, not all weakly input-symmetric channels are Gallager-symmetric. Indeed, consider
the following channel
 
1/7 4/7 1/7 1/7
 
 4/7 1/7 0 4/7 
 
W= . (19.24)
 0 0 4 /7 2 / 7 
 
2/7 2/7 2/7 0

Since det W 6= 0, the capacity achieving input distribution is unique. Since H(Y|X = x) is
independent of x and PX = [1/4, 1/4, 3/8, 1/8] achieves uniform P∗Y it must be the unique
optimum. Clearly any permutation Tx fixes a uniform P∗Y and thus the channel is weakly input-
symmetric. At the same time it is not Gallager-symmetric since no row is a permutation of
another.
• For more on the properties of weakly input-symmetric channels see [333, Section 3.4.5].

A pictorial representation of these relationships between the notions of symmetry is given


schematically on Figure 19.2.

Weakly input symmetric

Gallager
1010
1111111
0000000
0000000
1111111 0
1 Dobrushin
0000000
1111111
0000000
1111111 101111
0000000000
1111111111
0000
0000000
1111111 101111
0000000000
1111111111
0000
101111
0000
1111
0000000
1111111 0000000000
1111111111
0000
000
111
0
10000
1111 000
111
0000
1111
0000000
1111111 0000000000
1111111111
0000
1111
000
111
000
111
0
10000
1111
0000000000
1111111111
0000
1111
000
111
000
111 000
111
0000
1111
0000000
1111111
0000
1111 0000
1111
000
111
0000
1111 000
111
000
111
0000000
1111111
0000
1111 0000
1111 000input−symmetric
111
output−symmetric group−noise

Figure 19.2 Schematic representation of inclusions of various classes of channels.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-386


i i

386

19.5* Information stability


We saw that C = C(I) for stationary memoryless channels, but what other channels does this hold
for? And what about non-stationary channels? To answer this question, we introduce the notion
of information stability.

Definition 19.14 A channel is called information stable if there exists a sequence of input
distributions {PXn , n = 1, 2, . . .} such that
1 n n P (I)
i( X ; Y ) −
→C .
n

For example, we can pick PXn = (P∗X )n for stationary memoryless channels. Therefore
stationary memoryless channels are information stable.
The purpose for defining information stability is the following theorem.

Theorem 19.15 For an information stable channel, C = C(I) .

Proof. Like the stationary, memoryless case, the upper bound comes from the general con-
verse Theorem 17.3, and the lower bound uses a similar strategy as Theorem 19.8, except utilizing
the definition of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.

Theorem 19.16 A memoryless channel is information stable if there exists {X∗k : k ≥ 1} such
that both of the following hold:
1X ∗ ∗
n
I(Xk ; Yk ) → C(I) (19.25)
n
k=1
X

1
Var[i(X∗n ; Y∗n )] < ∞ . (19.26)
n2
n=1

In particular, this is satisfied if


|A| < ∞ or |B| < ∞ (19.27)

Proof. To show the first part, it is sufficient to prove


" #
1 X ∗ ∗
n
∗ ∗
P i(Xk ; Yk ) − I(Xk , Yk ) > δ → 0
n
k=1

So that 1n i(Xn ; Yn ) → C(I) in probability. We bound this by Chebyshev’s inequality


" # Pn
1 X ∗ ∗ ∗ ∗
n 1
∗ ∗ k=1 Var[i(Xk ; Yk )]
P i(Xk ; Yk ) − I(Xk , Yk ) > δ ≤ n2
→ 0,
n δ2
k=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-387


i i

19.5* Information stability 387

where convergence to 0 follows from Kronecker lemma (Lemma 19.17 to follow) applied with
bn = n2 , xn = Var[i(X∗n ; Y∗n )]/n2 .
The second part follows from the first. Indeed, notice that

1X
n
C(I) = lim inf sup I(Xk ; Yk ) .
n→∞ n PXk
k=1

Now select PX∗k such that

I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 2−k .


PXk

(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k

X
n X
n
I(X∗k ; Y∗k ) ≥ sup I(Xk ; Yk ) − 1 ,
PXk
k=1 k=1

and hence normalizing by n we get (19.25). We next show that for any joint distribution PX,Y we
have

Var[i(X; Y)] ≤ 2 log2 (min(|A|, |B|)) . (19.28)

The argument is symmetric in X and Y, so assume for concreteness that |B| < ∞. Then

E[i2 (X; Y)]


Z X  
≜ dPX (x) PY|X (y|x) log2 PY|X (y|x) + log2 PY (y) − 2 log PY|X (y|x) · log PY (y)
A y∈B
Z X h i
≤ dPX (x) PY|X (y|x) log2 PY|X (y|x) + log2 PY (y) (19.29)
A y∈B
   
Z X X
= dPX (x)  PY|X (y|x) log2 PY|X (y|x) +  PY (y) log2 PY (y)
A y∈B y∈B
Z
≤ dPX (x)g(|B|) + g(|B|) (19.30)
A
=2g(|B|) ,

where (19.29) is because 2 log PY|X (y|x) · log PY (y) is always non-negative, and (19.30) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X
n
g( n) ≜ sup
Pn
aj log2 aj . (19.31)
aj ≥0: j= 1 aj =1 j=1

Since the x log2 x has unbounded derivative at the origin, the solution of (19.31) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = 1n .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-388


i i

388

For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,

2g(|B|) ≤ 2 log2 |B| .

Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.

Lemma 19.17 (Kronecker


P
Lemma) Let a sequence 0 < bn % ∞ and a non-negative

sequence {xn } such that n=1 xn < ∞, then

1 X
n
bj xj → 0
bn
j=1

Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from
above
X
n X
m X
n
bk xk ≤ bm xk + b k xk
k=1 k=1 k=m+1

Now throw in the rest of the xk ’s in the summation

1 X bm X X bm X X
n ∞ n ∞ ∞
bk
=⇒ b k xk ≤ xk + xk ≤ xk + xk
bn bn bn bn
k=1 k=1 k=m+1 k=1 k=m+1

1 X X
n ∞
=⇒ lim bk xk ≤ xk → 0
n→∞ bn
k=1 k=m+1

Since this holds for any m, we can make the last term arbitrarily small.

How to show information stability? One important class of channels with memory for which
information stability can be shown easily are Gaussian channels. The complete details will be
shown below (see Sections 20.5* and 20.6*), but here we demonstrate a crucial fact.
For jointly Gaussian (X, Y) we always have bounded variance:
cov[X, Y]
Var[i(X; Y)] = ρ2 (X, Y) log2 e ≤ log2 e , ρ(X, Y) = p . (19.32)
Var[X] Var[Y]

Indeed, first notice that we can always represent Y = X̃ + Z with X̃ = aX ⊥


⊥ Z. On the other hand,
we have
 
log e x̃2 + 2x̃z σ2 2
i(x̃; y) = − 2 2z , z ≜ y − x̃ .
2 σY2 σY σZ

From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
 σX̃2

log e  X̃ 2
− 2
E[i(X̃; Y)|X̃] =
σZ
,
2 σY2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-389


i i

19.6 Capacity under bit error rate 389

and hence
2 log2 e 4
Var[E[i(X̃; Y)|X̃]] = σ .
4σY4 X̃
On the other hand,

2 log2 e 2 2
Var[i(X̃; Y)|X̃] = [4σX̃ σZ + 2σX̃4 ] .
4σY4
Putting it all together we get (19.32). Inequality (19.32) justifies information stability of all sorts
of Gaussian channels (memoryless and with memory), as we will see shortly.

19.6 Capacity under bit error rate


In most cases of interest the space [M] of messages can be given additional structure by identifying
[M] = {0, 1}k , which is, of course, only possible for M = 2k . In these cases, in addition to Pe and
Pe,max every code (f, g) has another important figure of merit – the so-called Bit Error Rate (BER),
denoted as Pb and defined in Section 6.4:

1X
k
1
Pb ≜ P[Sj 6= Ŝj ] = E[dH (Sk , Ŝk )] , (19.33)
k k
j=1

where we represented W and Ŵ as k-tuples: W = (S1 , . . . , Sk ), Ŵ = (Ŝ1 , . . . , Ŝk ), and dH denotes


the Hamming distance (6.6). In words, Pb is the average fraction of errors in a decoded k-bit block.
In addition to constructing codes minimizing block error probability Pe or Pe,max as we studied
above, the problem of minimizing the BER Pb is also practically relevant. Here, we discuss some
simple facts about this setting. In particular, we will see that the capacity value for memoryless
channels does not increase even if one insists only on a vanishing Pb – a much weaker criterion
compared to vanishing Pe .
First, we give a bound on the average probability of error (block error rate) in terms of the bit
error rate.

Theorem 19.18 For all (f, g), M = 2k =⇒ Pb ≤ Pe ≤ kPb

Proof. Recall that M = 2k gives us the interpretation of W = Sk sequence of bits.

1X X
k k
1{Si 6= Ŝi } ≤ 1{Sk 6= Ŝk } ≤ 1{Si 6= Ŝi },
k
i=1 i=1

where the first inequality is obvious and the second follow from the union bound. Taking
expectation of the above expression gives the theorem.

Next, the following pair of results is often useful for lower bounding Pb for some specific codes.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-390


i i

390

Theorem 19.19 (Assouad’s lemma) If M = 2k then


Pb ≥ min{P[Ŵ = c′ |W = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = 1} .

Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then

X
k X
k
1{Si =
6 Ŝi } ≥ 1{Sk = Ŝk + ei }
i=1 i=1

Dividing by k and taking expectation gives

1X
k
Pb ≥ P[Sk = Ŝk + ei ]
k
i=1

≥ min{P[Ŵ = c′ |W = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = 1} .

Similarly, we can prove the following generalization:

Theorem 19.20 If A, B ∈ {0, 1}k (with arbitrary marginals!) then for every r ≥ 1 we have
 
1 k−1
Pb = E[dH (A, B)] ≥ Pr,min (19.34)
k r−1
Pr,min ≜ min{P[B = c′ |A = c] : c, c′ ∈ {0, 1}k , dH (c, c′ ) = r} (19.35)

Proof. First, observe that


X  
k
P[dH (A, B) = r|A = a] = PB|A (b|a) ≥ Pr,min .
r
b:dH (a,b)=r

Next, notice

dH (x, y) ≥ r1{dH (x, y) = r}

and take the expectation with x ∼ A, y ∼ B.

In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the minimax risk
of an estimator (Section 31.2).
The following is a converse bound for channel coding under BER constraint.

Theorem 19.21 (Converse under BER) Any M-code with M = 2k and bit-error rate Pb
satisfies
supPX I(X; Y)
log M ≤ .
log 2 − h(Pb )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-391


i i

19.7 Joint source-channel coding 391

i.i.d.
Proof. Note that Sk → X → Y → Ŝk , where Sk ∼ Ber( 12 ). Recall from Theorem 6.1 that for iid
P
Sn , I(Si ; Ŝi ) ≤ I(Sk ; Ŝk ). This gives us
X
k
sup I(X; Y) ≥ I(X; Y) ≥ I(Si ; Ŝi )
PX
i=1
 
1X 1
≥k d P[Si = Ŝi ]
k 2
!
1X
k
1
≥ kd P[Si = Ŝi ]
k 2
i=1
 
1
= kd 1 − Pb = k(log 2 − h(Pb ))
2
where the second line used Fano’s inequality (Theorem 3.12) for binary random variables (or data
processing inequality for divergence), and the third line used the convexity of divergence. One
should note that this last chain of inequalities is similar to the proof of Proposition 6.6.
Pairing this bound with Proposition 19.10 shows that any sequence of codes with Pb → 0 (for
a memoryless channel) must have rate R < C. In other words, relaxing the constraint from Pe to
Pb does not yield any higher rates.
Later in Section 26.3 we will see that channel coding under BER constraint is a special case
of a more general paradigm known as lossy joint source channel coding so that Theorem 19.21
follows from Theorem 26.5. Furthermore, this converse bound is in fact achievable asymptotically
for stationary memoryless channels.

19.7 Joint source-channel coding


Now we will examine a slightly different data transmission scenario called Joint Source Channel
Coding (JSCC):

Sk Encoder Xn Yn Decoder Ŝ k
Source (JSCC) Channel (JSCC)

Formally, a JSCC code consists of an encoder f : Ak → X n and a decoder g : Y n → Ak . The


goal is to maximize the transmission rate R = nk (symbol per channel use) while ensuring the
probability of error P[Sk 6= Ŝk ] is small. The fundamental limit (optimal probability of error) is
defined as
ϵ∗JSCC (k, n) = inf P[Sk 6= Ŝk ]
f,g

In channel coding we are interested in transmitting M messages and all messages are born equal.
Here we want to convey the source realizations which might not be equiprobable (has redundancy).
Indeed, if Sk is uniformly distributed on, say, {0, 1}k , then we are back to the channel coding setup

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-392


i i

392

with M = 2k under average probability of error, and ϵ∗JSCC (k, n) coincides with ϵ∗ (n, 2k ) defined
in Section 22.1.
Here, we look for a clever scheme to directly encode k symbols from A into a length n channel
input such that we achieve a small probability of error over the channel. This feels like a mix
of two problems we have seen: compressing a source and coding over a channel. The following
theorem shows that compressing and channel coding separately is optimal. This is a relief, since
it implies that we do not need to develop any new theory or architectures to solve the Joint Source
Channel Coding problem. As far as the leading term in the asymptotics is concerned, the following
two-stage scheme is optimal: First use the optimal compressor to eliminate all the redundancy in
the source, then use the optimal channel code to add redundancy to combat the noise in the data
transmission.
The result is known as separation theorem since it separates the jobs of compressor and channel
code, with the two blocks interfacing in terms of bits. Note that the source can generate symbols
over very different alphabet than the channel’s input alphabet. Nevertheless, the bit stream pro-
duced by the source code (compressor) is matched to the channel by the channel code. There is
an even more general version of this result (Section Section 26.3).

Theorem 19.22 (Shannon separation theorem) Let the source {Sk } be stationary mem-
oryless on a finite alphabet with entropy H. Let the channel be stationary memoryless with finite
capacity C. Then
(
∗ → 0 R < C/H
ϵJSCC (nR, n) n → ∞.
6→ 0 R > C/H

The interpretation of this result is as follows: Each source symbol has information content
(entropy) H bits. Each channel use can convey C bits. Therefore to reliably transmit k symbols
over n channel uses, we need kH ≤ nC.
Proof. (Achievability.) The idea is to separately compress our source and code it for transmission.
Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. This
separated architecture is
f1 f2 P Yn | X n g2 g1
Sk −→ W −→ Xn −→ Yn −→ Ŵ −→ Ŝk

Where we use the optimal compressor (f1 , g1 ) and optimal channel code (maximum probability of
error) (f2 , g2 ). Let W denote the output of the compressor which takes at most Mk values. Then
from Corollary 11.3 and Theorem 19.9 we get:
1
(From optimal compressor) log Mk > H + δ =⇒ P[Ŝk 6= Sk (W)] ≤ ϵ ∀k ≥ k0
k
1
(From optimal channel code) log Mk < C − δ =⇒ P[Ŵ 6= m|W = m] ≤ ϵ ∀m, ∀k ≥ k0
n
Using both of these,

P[Sk 6= Ŝk (Ŵ)] ≤ P[Sk 6= Ŝk , W = Ŵ] + P[W 6= Ŵ]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-393


i i

19.7 Joint source-channel coding 393

≤ P[Sk 6= Ŝk (W)] + P[W 6= Ŵ] ≤ ϵ + ϵ

And therefore if R(H + δ) < C − δ , then ϵ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
(Converse.) To prove the converse notice that any JSCC encoder/decoder induces a Markov
chain
Sk → Xn → Yn → Ŝk .
Applying data processing for mutual information
I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC.
PXn

On the other hand, since P[Sk 6= Ŝk ] ≤ ϵn , Fano’s inequality (Theorem 3.12) yields
I(Sk ; Ŝk ) = H(Sk ) − H(Sk |Ŝk ) ≥ kH − ϵn log |A|k − log 2.
Combining the two gives
nC ≥ kH − ϵn log |A|k − log 2. (19.36)
Since R = nk , dividing both sides by n and sending n → ∞ yields
RH − C
lim inf ϵn ≥ .
n→∞ R log |A|
Therefore ϵn does not vanish if R > C/H.

We remark that instead of using Fano’s inequality we could have lower bounded I(Sk ; Ŝk ) as in
the proof of Theorem 17.3 by defining QSk Ŝk = USk PŜk (with USk = Unif({0, 1}k ) and applying the
data processing inequality to the map (Sk , Ŝk ) 7→ 1{Sk = Ŝk }:
D(PSk Ŝk kQSk Ŝk ) = D(PSk kUSk ) + D(PŜ|Sk kPŜ |PSk ) ≥ d(1 − ϵn k|A|−k )
Rearranging terms yields (19.36). As we discussed in Remark 17.2, replacing D with other f-
divergences can be very fruitful.
In a very similar manner, by invoking Corollary 12.6 and Theorem 19.15 we obtain:

Theorem 19.23 Let source {Sk } be ergodic on a finite alphabet, and have entropy rate H. Let
the channel have capacity C and be information stable. Then
(
= 0 R > H/C
lim ϵ∗JSCC (nR, n)
n→∞ > 0 R < H/C

We leave the proof as an exercise.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-394


i i

20 Channels with input constraints. Gaussian


channels.

In this chapter we study data transmission with constraints on the channel input. Namely, in pre-
vious chapter the encoder for blocklength n code was permitted to produce arbitrary sequences
of channel inputs (i.e. the range of the encoder could be all of An ). However, in many practical
problem only a subset of An is allowed to be used. The main such example is the AWGN chan-
nel Example 3.3. If encoder is allowed to produce arbitrary elements of Rn as input, the channel
capacity is infinite: supPX I(X; X + Z) = ∞ (for example, take X ∼ N (0, P) and P → ∞). That
is, one can transmit arbitrarily many messages with arbitrarily small error probability by choos-
ing elements of Rn with giant pairwise distance. In reality, however, allowed channel inputs are
limited by the available1 power and the encoder is only capable of using xn ∈ Rn are satisfying

1X 2
n
xt ≤ P ,
n
t=1

where P > 0 is known as the power constraint. How many bits per channel use can we transmit
under this constraint on the codewords? To answer this question in general, we need to extend
the setup and coding theorems to channels with input constraints. After doing that we will apply
these results to compute capacities of various Gaussian channels (memoryless, with inter-symbol
interference and subject to fading).

20.1 Channel coding with input constraints


We say that an (n, M, ϵ)-code satisfies the input constraint Fn ⊂ An if the encoder maps [M] into
Fn , i.e. f : [M] → Fn as illustrated by the following figure.

An
b b
b
b Fn b
b b
b b b
b b
b

Codewords all land in a subset of An

1
or allowed by regulatory bodies, such as the FCC in the US.

394

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-395


i i

20.1 Channel coding with input constraints 395

What type of constraint sets Fn are of practical interest? In the context of Gaussian channels,
we have A = R. Then one often talks about the following constraints:

• Average power constraint:

1X 2 √
n
| xi | ≤ P ⇔ kxn k2 ≤ nP.
n
i=1

In other words, codewords must lie in a ball of radius nP.
• Peak power constraint :

max |xi | ≤ A ⇔ kxn k∞ ≤ A


1≤i≤n

Notice that the second type of constraint does not introduce any new problems: we can simply
restrict the input space from A = R to A = [−A, A] and be back into the setting of input-
unconstrained coding. The first type of the constraint is known as a separable cost-constraint.
We will restrict our attention from now on to it exclusively.

Definition 20.1 A channel with a separable cost constraint is specified by

• input space A and output space B ;


• a sequence of Markov kernels PYn |Xn : An → B n , indexed by the blocklength n = 1, 2, . . .
• (per-letter) cost function c : A → R ∪ {±∞}.

We extend the per-letter cost to n-sequences as follows:


1X
n
c(xn ) ≜ c(xk )
n
k=1

We next extend the channel coding notions to such channels.

Definition 20.2 • A P
code is an (n, M, ϵ, P)-code if it is an (n, M, ϵ)-code satisfying input
n
constraint Fn ≜ {x : n k=1 c(xk ) ≤ P}
n 1

• Finite-n fundamental limits:

M∗ (n, ϵ, P) = max{M : ∃(n, M, ϵ, P)-code}


M∗max (n, ϵ, P) = max{M : ∃(n, M, ϵ, P)max -code}

• ϵ-capacity and Shannon capacity


1
Cϵ (P) = lim inf log M∗ (n, ϵ, P)
n→∞ n

C(P) = lim Cϵ (P)


ϵ↓0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-396


i i

396

• Information capacity
1
C(I) (P) = lim inf sup I(Xn ; Yn )
n→∞ n PXn :E[Pnk=1 c(Xk )]≤nP

• Information stability: Channel is information stable if for all (admissible) P, there exists a
sequence of channel input distributions PXn such that the following two properties hold:
1 P
iP n n (Xn ; Yn )−
→C(I) (P) (20.1)
n X ,Y
P[c(Xn ) > P + δ] → 0 ∀δ > 0 . (20.2)

These definitions clearly parallel those of Definitions 19.3 and 19.6 for channels without input
constraints. A notable and crucial exception is the definition of the information capacity C(I) (P).
Indeed, under input constraints instead of maximizing I(Xn ; Yn ) over distributions supported on
Fn we extend maximization to a richer set of distributions, namely, those satisfying
" n #
X
E c(Xk ) ≤ nP .
k=1

This will be crucial for the single-letterization of C(I) (P) soon.

Definition 20.3 (Admissible constraint) We say P is an admissible constraint if ∃x0 ∈ A


s.t. c(x0 ) ≤ P, or equivalently, ∃PX : E[c(X)] ≤ P. The set of admissible P’s is denoted by Dc ,
and can be either in the form (P0 , ∞) or [P0 , ∞), where P0 ≜ infx∈A c(x).

Clearly, if P ∈
/ Dc , then there is no code (even a useless one, with one codeword) satisfying the
input constraint. So in the remaining we always assume P ∈ Dc .

Proposition 20.4 (Capacity-cost function) Define the capacity-cost function ϕ(P) ≜


supPX :E[c(X)]≤P I(X; Y). Then

1 ϕ is concave and non-decreasing. The domain of ϕ is dom ϕ ≜ {x : f(x) > −∞} = Dc .


2 One of the following is true: ϕ(P) is continuous and finite on (P0 , ∞), or ϕ = ∞ on (P0 , ∞).

Both of these properties extend to the function P 7→ C(I) (P).

Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence ϕ(λ̄P0 + λP1 ) ≥
λ̄ϕ(P0 ) + λϕ(P1 ). The second claim follows from concavity of ϕ(·).
To extend these results to C(I) (P) observe that for every n
1
P 7→ sup I(Xn ; Yn )
n PXn :E[c(Xn )]≤P

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-397


i i

20.2 Channel capacity under separable cost constraints 397

is concave. Then taking lim infn→∞ the same holds for C(I) (P).

An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:

Corollary 20.5 (Single-letterization) The information capacity of a stationary memory-


less channel with separable cost is given by

C(I) (P) = ϕ(P) = sup I(X; Y).


E[c(X)]≤P

Proof. C(I) (P) ≥ ϕ(P) is obvious by using PXn = (PX )⊗n . For “≤”, fix any PXn satisfying the
cost constraint. Consider the chain
 
( a) X (b) X X
n n ( c)
n
1
I(Xn ; Yn ) ≤ I(Xj ; Yj ) ≤ ϕ(E[c(Xj )]) ≤ nϕ  E[c(Xj )] ≤ nϕ(P) ,
n
j=1 j=1 j=1

where (a) follows from Theorem 6.1; (b) from the definition of ϕ; and (c) from Jensen’s inequality
and concavity of ϕ.

20.2 Channel capacity under separable cost constraints


We start with a straightforward extension of the weak converse to the case of input constraints.

Theorem 20.6 (Weak converse)


C(I) (P)
Cϵ (P) ≤
1−ϵ

Proof. The argument is the same as we used in Theorem 17.3. Take any (n, M, ϵ, P)-code, W →
Xn → Yn → Ŵ. Applying Fano’s inequality and the data-processing, we get

−h(ϵ) + (1 − ϵ) log M ≤ I(W; Ŵ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn )


PXn :E[c(Xn )]≤P

Normalizing both sides by n and taking lim infn→∞ we obtain the result.

Next we need to extend one of the coding theorems to the case of input constraints. We do so for
the Feinstein’s lemma (Theorem 18.7). Note that when F = X , it reduces to the original version.

Theorem 20.7 (Extended Feinstein’s lemma) Fix a Markov kernel PY|X and an arbitrary
PX . Then for any measurable subset F ⊂ X , everyγ > 0 and any integer M ≥ 1, there exists an
(M, ϵ)max -code such that

• Encoder satisfies the input constraint: f : [M] → F ⊂ X ;

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-398


i i

398

• Probability of error bound:


M
ϵPX (F) ≤ P[i(X; Y) < log γ] +
γ

Proof. Similar to the proof of the original Feinstein’s lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Next, we apply Corollary 18.4 and find out
that there is a set F0 ⊂ X with two properties: a) PX [F0 ] = 1 and b) for every x ∈ F0 we have
PY (Ex ) ≤ γ1 . We now let F′ = F ∩ F0 and notice that PX [F′ ] = PX [F].
We sequentially pick codewords {c1 , . . . , cM } from the set F′ (!) and define the decoding regions
{D1 , . . . , DM } as Dj ≜ Ecj \ ∪jk− 1
=1 Dk . The stopping criterion is that M is maximal, i.e.,

∀x0 ∈ F′ , PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 − ϵ
′ ′c
⇔ ∀ x0 ∈ X , P Y [ E x 0 \ ∪ M
j=1 Dj X = x0 ] < (1 − ϵ)1[x0 ∈ F ] + 1[x0 ∈ F ]

Now average the last inequality over x0 ∼ PX to obtain


′ ′c
P[{i(X; Y) ≥ log γ}\ ∪M
j=1 Dj ] ≤ (1 − ϵ)PX [F ] + PX [F ] = 1 − ϵPX [F]

From here, we complete the proof by following the same steps as in the proof of original Feinstein’s
lemma (Theorem 18.7).

Given the coding theorem we can establish a lower bound on capacity.

Theorem 20.8 (Achievability bound) For any information stable channel with input
constraints and P > P0 we have

C(P) ≥ C(I) (P). (20.3)

Proof. Let us consider a special case of the stationary memoryless channel (the proof for general
information stable channel follows similarly). Thus, we assume PYn |Xn = (PY|X )⊗n .
Fix n ≥ 1. Choose a PX such that E[c(X)] < P, Pick log M = n(I(X; Y) − 2δ) and log γ =
n(I(X; Y) − δ).
P
With the input constraint set Fn = {xn : 1n c(xk ) ≤ P}, and iid input distribution PXn = P⊗ n
X ,
we apply the extended Feinstein’s lemma. This shows existence of an (n, M, ϵn , P)max -code with
the encoder satisfying input constraint Fn and vanishing (maximal) error probability

ϵn PXn [Fn ] ≤ P[i(Xn ; Yn ) ≤ n(I(X; Y) − δ)] + exp(−nδ)


| {z } | {z } | {z }
→1 →0 as n→∞ by WLLN and stationary memoryless assumption →0

Indeed, the first term is vanishing by the weak law of large numbers: since E[c(X)] < P, we have
P
PXn (Fn ) = P[ 1n c(xk ) ≤ P] → 1. Since ϵn → 0 this implies that for every ϵ > 0 we have
1
Cϵ (P) ≥ log M = I(X; Y) − 2δ, ∀δ > 0, ∀PX s.t. E[c(X)] < P
n
⇒ Cϵ (P) ≥ sup lim (I(X; Y) − 2δ)
PX :E[c(X)]<P δ→0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-399


i i

20.3 Stationary AWGN channel 399

⇒ Cϵ (P) ≥ sup I(X; Y) = C(I) (P−) = C(I) (P)


PX :E[c(X)]<P

where the last equality is from the continuity of C(I) on (P0 , ∞) by Proposition 20.4.
For a general information stable channel, we just need to use the definition to show that
P[i(Xn ; Yn ) ≤ n(C(I) − δ)] → 0, and the rest of the proof follows similarly.

Theorem 20.9 (Channel capacity under cost constraint) For an information stable
channel with cost constraint and for any admissible constraint P we have
C(P) = C(I) (P).

Proof. The boundary case of P = P0 is treated in Ex. IV.23, which shows that C(P0 ) = C(I) (P0 )
even though C(I) (P) may be discontinuous at P0 . So assume P > P0 next. Theorem 20.6 shows
(I)
Cϵ (P) ≤ C1−ϵ (P)
, thus C(P) ≤ C(I) (P). On the other hand, from Theorem 20.8 we have C(P) ≥
( I)
C ( P) .

20.3 Stationary AWGN channel


We start our applications with perhaps the most important channel (from the point of view of
communication engineering).

Z ∼ N (0, σ 2 )

X + Y

Definition 20.10 (The stationary AWGN channel) The Additive White Gaussian Noise
(AWGN) channel is a stationary memoryless additive-noise channel with separable cost constraint:
A = B = R, c(x) = x2 , and a single-letter kernel PY|X given by Y = X + Z, where Z ∼ N (0, σ 2 ) ⊥⊥
X. The n-letter kernel is given by a product extension, i.e. Yn = Xn + Zn with Zn ∼ N (0, In ). When
the power constraint is E[c(X)] ≤ P we say that the signal-to-noise ratio (SNR) equals σP2 . Note
that our informal definition early on (Example 3.3) lacked specification of the cost constraint
function, without which it was not complete.

The terminology white noise refers to the fact that the noise variables are uncorrelated across
time. This makes the power spectral density of the process {Zj } constant in frequency (or “white”).
We often drop the word stationary when referring to this channel. The definition we gave above is
more correctly should be called the real AWGN, or R-AWGN, channel. The complex AWGN, or
C-channel is defined similarly: A = B = C, c(x) = |x|2 , and Yn = Xn + Zn , with Zn ∼ Nc (0, In )
being the circularly symmetric complex gaussian.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-400


i i

400

Theorem 20.11 For the stationary AWGN channel, the channel capacity is equal to informa-
tion capacity, and is given by:
 
1 P
( I)
C(P) = C (P) = log 1 + 2 for R-AWGN (20.4)
2 σ
 
P
( I)
C(P) = C (P) = log 1 + 2 for C-AWGN
σ

Proof. By Corollary 20.5,

C(I) = sup I(X; X + Z)


PX :EX2 ≤P

Then using Theorem 5.11 (the Gaussian saddle point) to conclude X ∼ N (0, P) (or Nc (0, P)) is
the unique capacity-achieving input distribution.

At this point it is also instructive to revisit Section 6.2* which shows that Gaussian capacity
can in fact be derived essentially without solving the maximization of mutual information: the
Euclidean rotational symmetry implies the optimal input should be Gaussian.
There is a great deal of deep knowledge embedded in the simple looking formula of Shan-
non (20.4). First, from the engineering point of view we immediately see that to transmit
information faster (per unit time) one needs to pay with radiating at higher power, but the payoff
in transmission speed is only logarithmic. The waveforms of good error correcting codes should
look like samples of the white Gaussian process.
Second, the amount of energy spent per transmitted information bit is minimized by solving
P log 2
inf = 2σ 2 loge 2 (20.5)
P>0 C(P)

and is achieved by taking P → 0. (We will discuss the notion of energy-per-bit more in Sec-
tion 21.1.) Thus, we see that in order to maximize communication rate we need to send powerful,
high-power waveforms. But in order to minimize energy-per-bit we need to send in very quiet
“whisper” and at very low communication rate.2 In addition, when signaling with very low power

(and hence low rate), by inspecting Figure 3.2 we can see that one can restrict to just binary ± P
symbols (so called BPSK modulation). This results in virtually no loss of capacity.
Third, from the mathematical point of view, formula (20.4) reveals certain properties of high-
dimensional Euclidean geometry
√ as follows. Since Zn ∼ N (0, σ 2 ), then with high probability,
kZ k2 concentrates around nσ . Similarly, due the power constraint and the fact that Zn ⊥
n 2 ⊥ Xn , we
 n 2  n 2  n 2
have E kY k = E kY p k + E kZ k ≤ n(P + σ 2 ) and the received vector Yn lies in an ℓ√ 2 -ball
of radius approximately n(P + σ 2 ). Since the noise √ can at most perturb the codeword p by nσ 2
in Euclidean distance, if we can pack M balls of radius nσ 2 into the ℓ2 -ball of radius n(P + σ 2 )
centered at the origin, this yields a good codebook and decoding regions – see Figure 20.1 for an
illustration. So how large can M be? Note that the volume of an ℓ2 -ball of radius r in Rn is given by

2
This explains why, for example, the deep space probes communicate with earth via very low-rate codes and very long
blocklengths.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-401


i i

20.3 Stationary AWGN channel 401

c3

c4

p n
√ c2
nσ 2

(P
c1

+
σ
2
)
c5
c8

···

c6
c7

cM

Figure 20.1 Interpretation of the AWGN capacity formula as “soft” packing.

2 n/ 2 n/2
cn rn for some constant cn . Then cn (cnn((Pn+σ ))
= 1 + σP2 . Taking the log and dividing by n, we
 σ 2 ) n/ 2

get n log M ≈ 2 log 1 + σ2 . This tantalizingly convincing reasoning, however, is flawed in at
1 1 P

least two ways. (a) Computing the volume ratio only gives an upper bound on the maximal number
of disjoint balls (See Section 27.2 for an extensive discussion on this topic.) (b) Codewords need
not correspond to centers of disjoint ℓ2 -balls. √ Indeed, the fact that we allow some vanishing (but
non-zero) probability of error means that the nσ 2 -balls are slightly overlapping and Shannon’s
formula establishes the maximal number of such partially overlapping balls that we can pack so
that they are (mostly) inside a larger ball.

Since Theorem 20.11 applies to Gaussian noise, it is natural to ask: What if the noise is non-
Gaussian and how sensitive is the capacity formula 21 log(1 + SNR) to the Gaussian assumption?
Recall the Gaussian saddle point result in Theorem 5.11 which shows that for the same variance,
Gaussian noise is the worst which shows that the capacity of any non-Gaussian noise is at least
1
2 log(1 + SNR). Conversely, it turns out the increase of the capacity can be controlled by how
non-Gaussian the noise is (in terms of KL divergence). The following result is due to Ihara [223].

Theorem 20.12 (Additive Non-Gaussian noise) Let Z be a real-valued random variable


independent of X and EZ2 < ∞. Let σ 2 = Var Z. Then
   
1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (EZ, σ 2 )). (20.6)
2 σ PX :EX2 ≤P 2 σ

Proof. See Ex. IV.24.

Remark 20.1 The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z,
where N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian
density, say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-402


i i

402

AWGN, which still scales as 12 log SNR in the high-SNR regime. On the other hand, if Z is discrete,
then D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite
because the noise is “too weak”.

20.4 Parallel AWGN channel


Definition 20.13 (Parallel AWGN) A parallel AWGN channel with L branches is a station-
ary memoryless channel whose single-letter kernel is defined as follows: alphabets A = B = RL ,
PL
k=1 |xk | ; and the kernel PYL |XL : Yk = Xk + Zk , for k = 1, . . . , L, and
2
the cost c(x) =
Zk ∼ N (0, σk ) are independent for each branch.
2

Theorem 20.14 (Water-filling) The capacity of L-parallel AWGN channel is given by


1X + T
L
C = log
2 σj2
j=1

where log+ (x) ≜ max(log x, 0), and T ≥ 0 is determined by


X
L
P = |T − σj2 |+
j=1

Proof.

C(I) (P) = sup


P
I(XL ; YL )
PXL : E[X2i ]≤P

X
L
≤ P
sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[X2k ]≤Pk

X
L
1 Pk
= P
sup log(1 + )
Pk ≤P,Pk ≥0 k=1 2 σk2

with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last maximiza-
tion problem, known as problem of optimal power allocation. Denote the Lagrangian multipliers
P
for the constraint Pk ≤ P by λ and for the constraint Pk ≥ 0 by μk . We want to solve
P1 P
max 2 log(1 + σPk2 ) − μk Pk + λ(P − Pk ). First-order condition on Pk gives that
k

1 1
= λ − μ k , μ k Pk = 0
2 σk2 + Pk
therefore the optimal solution is
X
L
Pk = |T − σk2 |+ , T is chosen such that P = |T − σk2 |+
k=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-403


i i

20.5* Non-stationary AWGN 403

T
P1 P3
σ22

σ12 σ32

Figure 20.2 Power allocation via water-filling across three parallel channels. Here, the second branch is too
noisy (σ2 too big) for the amount of available power P and the optimal coding should discard (input zeros to)
this branch altogether.

Figure 20.2 illustrates the water-filling solution. It has a number of practically important con-
clusions. First, it gives a precise recipe for how much power to allocate to different frequency
bands. This solution, simple and elegant, was actually pivotal for bringing high-speed internet
to many homes (via cable modems, or ADSL): initially, before information theorists had a say,
power allocations were chosen on the basis of costly and imprecise simulations of real codes. The
simplicity of the water-filling scheme makes power allocation dynamic and enables instantaneous
reaction to changing noise environments.
Second, there is a very important consequence for multiple-antenna (MIMO) communication.
Given nr receive antennas and nt transmit antennas, very often one gets as a result a parallel AWGN
with L = min(nr , nt ) branches (see Exercise I.9 and I.10). For a single-antenna system the capacity
then scales as 12 log P with increasing power (Theorem 20.11), while the capacity for a MIMO
AWGN channel is approximately L2 log( PL ) ≈ L2 log P for large P. This results in an L-fold increase
in capacity at high SNR. This is the basis of a powerful technique of spatial multiplexing in MIMO,
largely behind much of advance in 4G, 5G cellular (3GPP) and post-802.11n WiFi systems.
Notice that spatial diversity (requiring both receive and transmit antennas) is different from a
so-called multipath diversity (which works even if antennas are added on just one side). Indeed,
if a single stream of data is sent through every parallel channel simultaneously, then the sufficient
statistic would be to the average of all received vectors, resulting in a the effective noise level
reduced by L1 factor. This results in capacity increasing from 12 log P to 21 log(LP) – a far cry
from the L-fold increase of spatial multiplexing. These exciting topics are explored in excellent
textbooks [423, 268].

20.5* Non-stationary AWGN


Definition 20.15 (Non-stationary AWGN) A non-stationary AWGN channel is a memory-
less channel with single-letter alphabets A = B = R and the separable cost c(x) = x2 . The channel
acts on the input vector Xn by addition Yn = Xn + Zn , where Zj ∼ N (0, σj2 ) are independent.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-404


i i

404

Theorem 20.16 Assume that for every T > 0 the following limits exist:
1X1
n
T
(I)
C̃ (T) = lim log+ 2
n→∞ n 2 σj
j=1

1X
n
P̃(T) = lim |T − σj2 |+ .
n→∞ n
j=1

Then the capacity of the non-stationary AWGN channel is given by the parameterized form:
C(T) = C̃(I) (T) with input power constraint P̃(T).

Proof. Fix T > 0. Then it is clear from the water-filling solution in Theorem 20.14 that

X
n
1 T
sup I(Xn ; Yn ) = log+ , (20.7)
2 σj2
j=1

where the supremum is over all PXn such that

1X
n
E[c(Xn )] ≤ |T − σj2 |+ . (20.8)
n
j=1

Now, by assumption, the LHS of (20.8) converges to P̃(T). Thus, we have that for every δ > 0

C(I) (P̃(T) − δ) ≤ C̃(I) (T)


C(I) (P̃(T) + δ) ≥ C̃(I) (T)

Taking δ → 0 and invoking the continuity of P 7→ C(I) (P), we get from Theorem 19.15 that the
information capacity satisfies

C(I) (P̃(T)) = C̃(I) (T)

provided that the channel is information stable. Indeed, from (19.32)

log2 e Pj log2 e
Var(i(Xj ; Yj )) = 2

2 Pj + σj 2

and thus
X
n
1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1

From here information stability follows via Theorem 19.16.

Non-stationary AWGN is primarily of interest due to its relationship to the additive colored
Gaussian noise channel in the following section.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-405


i i

20.6* Additive colored Gaussian noise channel 405

Zn : Cov(Zn ) = Σ

multiply by
X̃ U−1 X + Y multiply by
U Ỹ

stationary additive Gaussian noise channel

fZ (ω)
T

ω
−π π

power allocation

Figure 20.3 The ACGN channel: the “whitening” process used in the capacity proof and the water-filling
power allocation across spectrum.

20.6* Additive colored Gaussian noise channel


Definition 20.17 The Additive Colored Gaussian Noise (ACGN) channel is a channel with
memory defined as follows. The single-letter alphabets are A = B = R and the separable cost is
c(x) = x2 . The channel acts on the input vector Xn by addition Yn = Xn + Zn , where {Zj : j ≥
1} is a stationary Gaussian process with power spectral density fZ (ω) ≥ 0, ω ∈ [−π , π ] (recall
Example 6.4).

Theorem 20.18 The capacity of the ACGN channel with fZ (ω) > 0 for almost every ω ∈
[−π , π ] is given by the following parametric form:
Z π
1 1 T
C ( T) = log+ dω,
2π −π 2 fZ (ω)
Z π
1
P ( T) = |T − fZ (ω)|+ dω.
2π −π

Proof. For n ≥ 1, consider the diagonalization of the covariance matrix of Zn :


e U, such that Σ
Cov(Zn ) = Σ = U⊤ Σ e = diag(σ1 , . . . , σn ) ,

where U is an orthogonal matrix. (Since Cov(Zn ) is positive semi-definite this diagonalization is


en = UXn and Y
always possible.) Define X en = UYn , the channel between X en and Y en is thus

en = X
Y en + UZn ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-406


i i

406

e
Cov(UZn ) = U · Cov(Zn ) · U⊤ = Σ
Therefore we have the equivalent channel as follows:
en = X
Y en + Z
en , ej ∼ N (0, σj2 ) independent across j.
Z
Note that since U and U⊤ is orthogonal the maps X̃n = UXn and Xn = U⊤ X̃n preserve the norm
kX̃n k = kXn k. Therefore, capacities of both channels are equal: C = C̃. But the latter follows from
Theorem 20.16. Indeed, we have that
Z π
1X + T
n
e 1 1 T
C = lim log 2
= log+ dω. (Szegö’s theorem, see (6.12))
n→∞ n σj 2π −π 2 f Z (ω)
j=1

1X
n
lim |T − σj2 |+ = P(T).
n→∞ n
j=1

The idea used in the proof as well as the water-filling power allocation are illustrated on Fig-
ure 20.3. Note that most of the time the noise that impacts real-world systems is actually “born”
white (because it is a thermal noise). However, between the place of its injection and the process-
ing there are usually multiple circuit elements. If we model them linearly then their action can
equivalently be described as the ACGN channel, since the effective noise added becomes colored.
In fact, this filtering can be inserted deliberately in order to convert the actual channel into an
additive noise one. This is the content of the next section.

20.7* AWGN channel with intersymbol interference


Oftentimes in wireless communication systems a signal is propagating through a rich scattering
environment. Thus, reaching the receiver are multiple delayed and attenuated copies of the initial
signal. Such situation is formally called intersymbol interference (ISI). A similar effect also occurs
when the cable modem attempts to send signals across telephone or TV wires due to the presence
of various linear filters, transformers and relays. The mathematical model for such channels is the
following.

Definition 20.19 (AWGN with ISI) An AWGN channel with ISI is a channel with memory
that is defined as follows: the alphabets are A = B = R, and the separable cost is c(x) = x2 . The
channel law PYn |Xn is given by
X
n
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1

i.i.d.
where Zk ∼ N (0, σ 2 ) is white Gaussian noise, {hk , k ∈ Z} are coefficients of a discrete-time
channel filter.

The coefficients {hk } describe the action of the environment. They are often learned by the
receiver during the “handshake” process of establishing a communication link.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-407


i i

20.8* Gaussian channels with amplitude constraints 407

Theorem 20.20 Suppose that the sequence {hk } is the inverse Fourier transform of a
frequency response H(ω):
Z π
1
hk = eiωk H(ω)dω .
2π −π
Assume also that H(ω) is a continuous function on [0, 2π ]. Then the capacity of the AWGN channel
with ISI is given by
Z π
1 1
C ( T) = log+ (T|H(ω)|2 )dω
2π −π 2
Z π +
1 1
P ( T) = T− dω
2π −π |H(ω)|2

Proof sketch. At the decoder apply the inverse filter with frequency response ω 7→ 1
H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:

Ỹj = Xj + Z̃j ,

where Z̃j is a stationary Gaussian process with spectral density


1
fZ̃ (ω) = .
|H(ω)|2
Then apply Theorem 20.18 to the resulting channel.
To make the above argument rigorous one must carefully analyze the non-zero error introduced
by truncating the deconvolution filter to finite n. This would take us too much outside of the scope
of this book.

20.8* Gaussian channels with amplitude constraints


We have examined some classical results of additive Gaussian noise channels. In the following,
we will list some more recent results without proof.

Theorem 20.21 (Amplitude-constrained AWGN channel capacity) For an AWGN


channel Yi = Xi + Zi with amplitude constraint |Xi | ≤ A, we denote the capacity by:

C(A) = max I(X; X + Z).


PX :|X|≤A

The capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. The
number of atoms is Ω(A) and O(A2 ) as A → ∞. Moreover,
 
1 2A2 1 
log 1 + ≤ C(A) ≤ log 1 + A2
2 eπ 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-408


i i

408

Theorem 20.22 (Amplitude-and-power-constrained AWGN channel capacity)


For an AWGN channel Yi = Xi + Zi with amplitude constraint |Xi | ≤ A and power constraint
Pn
i=1 Xi ≤ nP, we denote the capacity by:
2

C(A, P) = max I(X; X + Z).


PX :|X|≤A,E|X|2 ≤P

Capacity achieving input distribution P∗X is discrete, with finitely many atoms on [−A, A]. Moreover,
the convergence speed of limA→∞ C(A, P) = 21 log(1 + P) is of the order e−O(A ) .
2

For details, see [396], [343, Section III] and [144, 348] for the O(A2 ) bound on the number of
atoms.

20.9* Gaussian channels with fading


So far we assumed that the channel is either additive (as in AWGN or ACGN) or has known
multiplicative gains (as in ISI). However, in practice the channel gain is a random variable. This
situation is called “fading” and is often used to model the urban signal propagation with multipath
or shadowing. The received signal at time i, Yi , is affected by multiplicative fading coefficient Hi
and additive noise Zi as follows:
Yi = Hi Xi + Zi , Zi ∼ N (0, σ 2 )
This is illustrated in Figure 20.4.

Hi Zi
E[X2i ] ≤ P
Xi × + Yi receiver

Figure 20.4 AWGN channel with fading.

There are two drastically different cases of fading channels, depending on the presence or
absence of the dashed link on Figure 20.4. In the first case, known as the coherent case or the
CSIR case (for channel state information at the receiver), the receiver is assumed to have perfect
estimate of the channel state information Hi at every time i. In other words, the channel output
is effectively (Yi , Hi ). This situation occurs, for example, when there are pilot signals sent peri-
odically and are used at the receiver to estimate the channel. In some cases, the index i refers to
different frequencies or sub-channels of an OFDM frame.
Whenever Hj is a stationary ergodic process, we have the channel capacity given by:
  
1 P | H| 2
C(P) = E log 1 +
2 σ2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-409


i i

20.9* Gaussian channels with fading 409

and the capacity achieving input distribution is the usual PX = N (0, P). Note that the capacity
C(P) is in the order of log P and we call the channel “energy efficient”.
In the second case, known as non-coherent or no-CSIR, the receiver does not have any knowl-
edge of the coefficients Hi ’s. In this case, there is no simple expression for the channel capacity.
Most of the known results were shown for the case of iid Hi according to a Rayleigh distribution.
In this case, the capacity achieving input distribution is discrete [3], and the capacity scales as
[415, 269]
C(P) = O(log log P), P→∞ (20.9)
This channel is said to be “energy inefficient” since increasing the communication rate requires
dramatic expenditures in power.
Further generalization of the Gaussian channel models requires introducing multiple input and
output antennas (known as MIMO channel). In this case, the single-letter input Xi ∈ Cnt and the
output Yi ∈ Cnr are related by
Yi = Hi Xi + Zi , (20.10)
i.i.d.
where Zi ∼ CN (0, σ 2 Inr ), nt and nr are the number of transmit and receive antennas, and Hi ∈
Cnt ×nr is a matrix-valued channel gain process. For the capacity of this channel under CSIR,
see Exercise I.10. An incredible effort in the 1990s and 2000s was invested by the information-
theoretic and communication-theoretic researchers to understand this channel model. Some of the
highlights include:

• a beautiful transmit-diversity 2x2 code of Alamouti [10]


• generalization of Alamouti’s code lead to the discovery of space-time coding [417, 416]
• the result of Telatar [418] showing that under coherent fading the capacity scales as
min(nt , nr ) log P (the coefficient appearing in front of log P is known as pre-log or degrees-
of-freedom of the channel)
• the result of Zheng and Tse [474] showing a different pre-log in the scaling for the non-coherent
(block-fading) case.

It is not possible to do any justice to these and many other fundamental results in MIMO communi-
cation here, unfortunately. We suggest textbook [423] as an introduction to this deep and exciting
field.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-410


i i

21 Capacity per unit cost

In this chapter we will consider an interesting variation of the channel coding problem. Instead
of constraining the blocklength (i.e. the number of channel uses), we will constrain the total cost
incurred by the codewords. The most important special case of this problem is that of the AWGN
channel and quadratic (energy) cost constraint. The standard motivation in this setting is the fol-
lowing. Consider a deep space probe which has a k bit message that needs to be delivered to Earth
(or a satellite orbiting it). The duration of transmission is of little worry for the probe, but what is
really limited is the amount of energy it has stored in its battery. In this chapter we will learn how
to study this question abstractly, how coding over large number of bits k → ∞ reduces the energy
spent (per bit), and how this fundamental limit is related to communication over continuous-time
channels.

21.1 Energy-per-bit
In this chapter we will consider Markov kernels PY∞ |X∞ acting between two spaces of infinite
sequences. The prototypical example is again the AWGN channel:

Yi = Xi + Zi , Zi ∼ N (0, N0 /2). (21.1)

Note that in this chapter we have denoted the noise level for Zi to be N20 . There is a long tradition for
such a notation. Indeed, most of the noise in communication systems is a white thermal noise at the
receiver. The power spectral density of that noise is flat and denoted by N0 (in Joules per second
per Hz). However, recall that received signal is complex-valued and, thus, each real component
has power N20 . Note also that thermodynamics suggests that N0 = kT, where k = 1.38 × 10−23 is
the Boltzmann constant, and T is the absolute temperature in Kelvins.
In previous chapter, we analyzed the maximum number of information messages (M∗ (n, ϵ, P))
that can be sent through this channel for a given n number of channel uses and under the power
constraint P. We have also hinted that in (20.5) that there is a fundamental minimal required cost
to send each (data) bit. Here we develop this question more rigorously. Everywhere in this chapter
for v ∈ R∞

X

kvk22 ≜ v2j .
j=1

410

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-411


i i

21.1 Energy-per-bit 411

Definition 21.1 ((E, 2k , ϵ)-code) Given a Markov kernel with input space R∞ we define
an (E, 2k , ϵ)-code to be an encoder-decoder pair, f : [2k ] → R∞ and g : R∞ → [2k ] (or similar
randomized versions), such that for all messages m ∈ [2k ] we have kf(m)k22 ≤ E and

P[g(Y∞ ) 6= W] ≤ ϵ ,

where as usual the probability space is W → X∞ → Y∞ → Ŵ with W ∼ Unif([2k ]), X∞ = f(W)


and Ŵ = g(Y∞ ). The fundamental limit is defined to be

E∗ (k, ϵ) = min{E : ∃(E, 2k , ϵ) code}

The operational meaning of E∗ (k, ϵ) should be apparent: it is the minimal amount of energy the
space probe needs to draw from the battery in order to send k bits of data.

Theorem 21.2 ((Eb /N0 )min = −1.6dB) For the AWGN channel we have
E∗ (k, ϵ) N0
lim lim sup = . (21.2)
ϵ→0 k→∞ k log2 e

Remark 21.1 This result, first obtained by Shannon [378], is colloquially referred to as: min-
imal Eb /N0 (pronounced “eebee over enzero” or “ebno”) is −1.6 dB. The latter value is simply
10 log10 ( log1 e ) ≈ −1.592. Achieving this value of the ebno was an ultimate quest for coding the-
2
ory, first resolved by the turbo codes [47]. See [101] for a review of this long conquest. We also
remark that the fundamental limit is unchanged if instead of real-valued AWGN channel we used
a C-AWGN channel

Yi = Xi + Zi , Zi ∼ CN (0, N0 )
P∞
and energy constraint i=1 |Xi |2 ≤ E. Indeed, this channel’s single input can be simply converted
into a pair of inputs for the R-AWGN channel. This double the blocklength, but it is anyway
considered to be infinite.

Proof. We start with a lower bound (or the “converse” part). As usual, we have the working
probability space

W → X∞ → Y∞ → Ŵ .

Then consider the following chain:


 1
−h(ϵ) + ϵk ≤ d (1 − ϵ)k Fano’s inequality
M
≤ I(W; Ŵ) data processing for divergence
∞ ∞
≤ I( X ; Y ) data processing for mutual information
X

≤ I(Xi ; Yi ) lim I(Xn ; U) = I(X∞ ; U) by (4.30)
n→∞
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-412


i i

412

X

1  EX2i 
≤ log 1 + Gaussian capacity, Theorem 5.11
2 N0 /2
i=1

log e X EX2i

≤ linearization of log
2 N0 /2
i=1
E
≤ log e.
N0
Thus, we have shown
 
E∗ (k, ϵ) N0 h(ϵ)
≥ ϵ−
k log e k
and taking the double limit in n → ∞ then in ϵ → 0 completes the proof.
Next, for the upper bound (the “achievability” part). We first give a traditional existential proof.
Notice that a (n, 2k , ϵ, P)-code for the AWGN channel is also a (nP, 2k , ϵ)-code for the energy
problem without time constraint. Therefore,

log2 M∗ (n, ϵ, P) ≥ k ⇒ E∗ (k, ϵ) ≤ nP.


E∗ (kn ,ϵ)
Take kn = blog2 M∗ (n, ϵ, P)c, so that we have kn ≤ nP
kn for all n ≥ 1. Taking the limit then we
get
E∗ (kn , ϵ) nP
lim sup ≤ lim sup ∗
n→∞ kn n→∞ log M (n, ϵ, P)
P
=
lim infn→∞ n log M∗max (n, ϵ, P)
1

P
= 1 P
,
2 log(1 + N0 /2 )

where in the last step we applied Theorem 20.11. Now the above statement holds for every P > 0,
so let us optimize it to get the best bound:
E∗ (kn , ϵ) P
lim sup ≤ inf 1 P
n→∞ kn P≥0
2 log(1 + N0 / 2 )
P
= lim
P→0 1 log(1 + P
2 N0 / 2 )
N0
= (21.3)
log2 e
Note that the fact that minimal energy per bit is attained at P → 0 implies that in order to send
information reliably at the Shannon limit of −1.6dB, infinitely many time slots are needed. In
other words, the information rate (also known as spectral efficiency) should be vanishingly small.
Conversely, in order to have non-zero spectral efficiency, one necessarily has to step away from
the −1.6 dB. This tradeoff is known as spectral efficiency vs energy-per-bit.
We next can give a simpler and more explicit construction of the code, not relying on the random
coding implicit in Theorem 20.11. Let M = 2k and consider the following code, known as the

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-413


i i

21.1 Energy-per-bit 413

pulse-position modulation (PPM):



PPM encoder: ∀m, f(m) = cm ≜ (0, 0, . . . , E ,...)
|{z} (21.4)
m-th location

It is not hard to derive an upper bound on the probability of error that this code achieves [337,
Theorem 2]:
" ( r ! )#
2E
ϵ ≤ E min MQ + Z ,1 , Z ∼ N (0, 1) . (21.5)
N0

Indeed, our orthogonal codebook under a maximum likelihood decoder has probability of error
equal to

Z " r !#M−1
∞ √
(z− E)2
1 2 − N
Pe = 1 − √ 1−Q z e 0 dz , (21.6)
πN0 −∞ N0

which is obtained by observing that conditioned on (W = j,q Zj ) the events {||cj + z||2 ≤ ||cj +
z − ci ||2 }, i 6= j are independent. A change of variables x = N20 z and application of the bound
1 − (1 − y)M−1 ≤ min{My, 1} weakens (21.6) to (21.5).
To see that (21.5) implies (21.3), fix c > 0 and condition on |Z| ≤ c in (21.5) to relax it to
r
2E
ϵ ≤ MQ( − c) + 2Q(c) .
N0

Recall the expansion for the Q-function [435, (3.53)]:

x2 log e 1
log Q(x) = − − log x − log 2π + o(1) , x→∞ (21.7)
2 2

Thus, choosing τ > 0 and setting E = (1 + τ )k logN0 e we obtain


2

r
2E
2k Q( − c) → 0
N0

as k → ∞. Thus choosing c > 0 sufficiently large we obtain that lim supk→∞ E∗ (k, ϵ) ≤ (1 +
τ ) logN0 e for every τ > 0. Taking τ → 0 implies (21.3).
2

Remark 21.2 (Simplex conjecture) The code (21.4) in fact achieves the first three terms
in the large-k expansion of E∗ (k, ϵ), cf. [337, Theorem 3]. In fact, the code can be further slightly
√ √
optimized by subtracting the common center of gravity (2−k E, . . . , 2−k E . . .) and rescaling
each codeword to satisfy the power constraint. The resulting constellation is known as the simplex
code. It is conjectured to be the actual optimal code achieving E∗ (k, ϵ) for a fixed k and ϵ; see [105,
Section 3.16] and [401] for more.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-414


i i

414

21.2 Capacity per unit cost


Generalizing the energy-per-bit setting of the previous section, we get the problem of capacity per
unit cost.

Definition 21.3 Given a channel PY∞ |X∞ : X ∞ → Y ∞ and a cost function c : X → R+ ,


we define (E, M, ϵ)-code to be an encoder f : [M] → X ∞ and a decoder g : Y ∞ → [M] s.t. a) for
every m the output of the encoder x∞ ≜ f(m) satisfies
X

c(xt ) ≤ E . (21.8)
t=1

and b) the probability of error is small

P[g(Y∞ ) 6= W] ≤ ϵ ,

where as usual we operate on the space W → X∞ → Y∞ → Ŵ with W ∼ Unif([M]). We let

M∗ (E, ϵ) = max{M : (E, M, ϵ)-code} ,

and define capacity per unit cost as


1
Cpuc ≜ lim lim inf log M∗ (E, ϵ) . (21.9)
ϵ→0 E→∞ E

Let C(P) be the capacity-cost function of the channel (in the usual sense of capacity, as defined
in (20.1)). Assuming P0 = 0 and C(0) = 0 it is not hard to show (basically following the steps of
Theorem 21.2) that:
C(P) C(P) d
Cpuc = sup = lim = C(P) .
P P P→ 0 P dP P=0

The surprising discovery of Verdú [434] is that one can avoid computing C(P) and derive the Cpuc
directly. This is a significant help, as for many practical channels C(P) is unknown. Additionally,
this gives a yet another fundamental meaning to the KL divergence.
Q
Theorem 21.4 For a stationary memoryless channel PY∞ |X∞ = PY|X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have
D(PY|X=x kPY|X=x0 )
Cpuc = sup .
x̸=x0 c(x)

In particular, Cpuc = ∞ if there exists x1 6= x0 with c(x1 ) = 0.

Proof. Let
D(PY|X=x kPY|X=x0 )
CV = sup .
x̸=x0 c(x)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-415


i i

21.2 Capacity per unit cost 415

Converse: Consider a (E, M, ϵ)-code W → X∞ → Y∞ → Ŵ. Introduce an auxiliary distribution


QW,X∞ ,Y∞ ,Ŵ , where a channel is a useless one

QY∞ |X∞ = QY∞ ≜ P∞


Y|X=x0 .

That is, the overall factorization is

QW,X∞ ,Y∞ ,Ŵ = PW PX∞ |W QY∞ PŴ|Y∞ .

Then, as usual we have from the data-processing for divergence


1
(1 − ϵ) log M + h(ϵ) ≤ d(1 − ϵk )
M
≤ D(PW,X∞ ,Y∞ ,Ŵ kQW,X∞ ,Y∞ ,Ŵ )
= D(PY∞ |X∞ kQY∞ |PX∞ )
"∞ #
X
=E d(Xt ) , (21.10)
t=1

where we denoted for convenience d(x) ≜ D(PY|X=x kPY|X=x0 ). By the definition of CV we have

d(x) ≤ c(x)CV .

Thus, continuing (21.10) we obtain


" #
X

(1 − ϵ) log M + h(ϵ) ≤ CV E c(Xt ) ≤ CV · E ,
t=1

where the last step is by the cost constraint (21.8). Thus, dividing by E and taking limits we get

Cpuc ≤ CV .

Achievability: We generalize the PPM code (21.4). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:

f(1) = (x1 , x1 , . . . , x1 , x0 , . . . , x0 )
| {z } | {z }
n-times n(M−1)-times

f(2) = (x0 , x0 , . . . , x0 , x1 , . . . , x1 , x0 , . . . , x0 )
| {z } | {z } | {z }
n-times n-times n(M−2)-times

···
f ( M ) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 )
| {z } | {z }
n(M−1)-times n-times

Now, by Stein’s lemma (Theorem 14.14) there exists a subset S ⊂ Y n with the property that

P[Yn ∈ S|Xn = (x1 , . . . , x1 )] ≥ 1 − ϵ1


P[Yn ∈ S|Xn = (x0 , . . . , x0 )] ≤ exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-416


i i

416

Therefore, we propose the following (suboptimal!) decoder:

Yn ∈ S =⇒ Ŵ = 1
Y2n
n+1 ∈S =⇒ Ŵ = 2
···

From the union bound we find that the overall probability of error is bounded by

ϵ ≤ ϵ1 + M exp{−nD(PY|X=x1 kPY|X=x0 ) + o(n)} .

At the same time the total cost of each codeword is given by nc(x1 ). Thus, taking n → ∞ and
after straightforward manipulations, we conclude that
D(PY|X=x1 kPY|X=x0 )
Cpuc ≥ .
c(x1 )
This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥
CV , as required.

21.3 Energy-per-bit for the fading channel


Note that Theorem 21.4 when applied to the AWGN channel seems to be pointless: since the
capacity-cost function C(P) is known, it is not hard to compute limP→0 C(PP) directly. Theo-
rem 21.4’s true strength is revealed when applied to channels for which capacity-cost function
is unknown.
Specifically, we consider a stationary memoryless Gaussian channel with fading Hj unknown
at the receiver (i.e. non-coherent fading channel, see Section 20.9*):

Yj = Hj Xj + Zj , Hj ∼ Nc (0, 1) ⊥
⊥ Zj ∼ Nc (0, N0 ).

(We use here a more convenient C-valued fading channel with Hj ∼ Nc , known as the Rayleigh
fading). The cost function is the usual quadratic one c(x) = |x|2 . As we discussed previously,
cf. (20.9), the capacity-cost function C(P) is unknown in closed form, but is known to behave
drastically different from the case of non-fading AWGN (i.e. when Hj = 1). So here Theorem 21.4
comes handy. Let us perform a simple computation required, cf. (2.9):
D(Nc (0, |x|2 + N0 )kNc (0, N0 ))
Cpuc = sup
x̸=0 | x| 2
 
log(1 + |Nx|0 )
2
1
= sup log e − |x|2
 (21.11)
N0 x̸=0
N0
log e
=
N0
Comparing with Theorem 21.2 we discover that surprisingly, the capacity-per-unit-cost is unaf-
fected by the presence of fading. In other words, the random multiplicative noise which is so

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-417


i i

21.4 Capacity of the continuous-time AWGN channel 417

detrimental at high SNR, appears to be much more benign at low SNR (recall that Cpuc = C′ (0)
and thus computing Cpuc corresponds to computing C(P) at P → 0). There is one important differ-
ence: the supremization over x in (21.11) is solved at x = ∞. Following the proof of the converse
bound, we conclude that any code hoping to achieve optimal Cpuc must satisfy a strange constraint:
X X
|xt |2 1{|xt | ≥ A} ≈ | xt | 2 ∀A > 0
t t

i.e. the total energy expended by each codeword must be almost entirely concentrated in very
large spikes. Such a coding method is called “flash signaling”. Thus, we can see that unlike the
non-fading AWGN (for which due to rotational symmetry all codewords can be made relatively
non-spiky), the only hope of achieving full Cpuc in the presence of fading is by signaling in short
bursts of energy. Thus, while the ultimate minimal energy-per-bit is the same for the AWGN or
the fading channel, the nature of optimal coding schemes is rather different.
Another fundamental difference between the two channels is revealed in the finite blocklength
behavior of E∗ (k, ϵ). Specifically, we have the following different asymptotic expansions for the

energy-per-bit E (kk,ϵ) :
r
E∗ (k, ϵ) constant −1
= (−1.59 dB) + Q (ϵ) (AWGN)
k k
r
E∗ (k, ϵ) 3 log k 2
= (−1.59 dB) + (Q−1 (ϵ)) (non-coherent fading.)
k k
That is we see that the speed of convergence to Shannon limit is much slower under fading. Fig-
ure 21.1 shows this effect numerically by plotting evaluation of (the upper and lower bounds for)
E∗ (k, ϵ) for the fading and non-fading channel. We see that the number of data bits k that need
to be coded over for the fading channel is about factor of 103 larger than for the AWGN channel.
See [463] for further details.

21.4 Capacity of the continuous-time AWGN channel


We now briefly discuss the topic of continuous-time channels. We would like to define the channel
as acting on waveforms x(t), t ≥ 0 by adding white Gaussian noise as follows:

Y(t) = X(t) + N(t) ,

where N(t) is a (generalized) Gaussian process with covariance function


N0
E[N(t)N(s)] = δ(t − s) ,
2
and δ is the Dirac δ -function. Defining the channel in this way requires careful understanding of
the nature of N(t) (in particular, it is not a usual stochastic process, since its value at any point
N(t) = ∞), but is preferred by engineers. Mathematicians prefer to define the continuous-time

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-418


i i

418

14

12

10
Achievability

8
Converse
dB

6 Rayleigh fading, noCSI


N0 ,
Eb

2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k

Figure 21.1 Comparing the energy-per-bit required to send a packet of k-bits for different channel models

(curves represent upper and lower bounds on the unknown optimal value E (k,ϵ) k
). As a comparison: to get to
−1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading AWGN or fading AWGN with
Hj known perfectly at the receiver. For fading AWGN without knowledge of Hj (no CSI), one has to code over
at least 7 · 107 data bits to get to the same −1.5 dB. Plot generated using [397].

channel by introducing the standard Wiener process (Brownian motion) Wt and setting
Z t r
N0
Yint (t) = X(τ )dτ + Wt ,
0 2
where Wt is the zero-mean Gaussian process with covariance function
E[Ws Wt ] = min(s, t) .
Denote by L2 ([0, T]) the space of all square-integrable functions on [0, T]. Let M∗ (T, ϵ, P) the
maximum number of messages that can be sent through this channel such that given an encoder
f : [M] → L2 [0, T] for each m ∈ [M] the waveform x(t) ≜ f(m)

1 is non-zero only on [0, T];


RT
2 input energy constrained to t=0 x2 (t) ≤ TP;
and the decoding error probability P[Ŵ 6= W] ≤ ϵ. This is a natural extension of the previously
defined log M∗ functions to continuous-time setting.
We prove the capacity result for this channel next.

Theorem 21.5 The maximal reliable rate of communication across the continuous-time AWGN
P
channel is N0 log e (per unit of time). More formally, we have
1 P
lim lim inf log M∗ (T, ϵ, P) = log e (21.12)
ϵ→0 T→∞ T N0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-419


i i

21.5* Capacity of the continuous-time bandlimited AWGN channel 419

Proof. Note that the space L2 [0, T] has a countable basis (e.g. sinusoids). Thus, by expanding our
input and output waveforms in that basis we obtain an equivalent channel model:
 N 
0
Ỹj = X̃j + Z̃j , Z̃j ∼ N 0, ,
2
and energy constraint (dependent upon duration T):
X

X̃2j ≤ PT .
j=1

But then the problem is equivalent to the energy-per-bit for the (discrete-time) AWGN channel
(see Theorem 21.2) and hence

log2 M∗ (T, ϵ, P) = k ⇐⇒ E∗ (k, ϵ) = PT .

Thus,
1 P P
lim lim inf log2 M∗ (T, ϵ, P) = E∗ (k,ϵ)
= log2 e ,
ϵ→0 n→∞ T limϵ→0 lim supk→∞ N0
k

where the last step is by Theorem 21.2.

21.5* Capacity of the continuous-time bandlimited AWGN channel


An engineer looking at the previous theorem will immediately point out an issue with the definition
of an error correcting code. Namely, we allowed the waveforms x(t) to have bounded duration
and bounded power, but did not constrain its frequency content. In practice, waveforms are also
required to occupy a certain limited band of B Hz. What is the capacity of the AWGN channel
subject to both the power p and the bandwidth B constraints?
Unfortunately, answering this question rigorously requires a long and delicate digression into
functional analysis and prolate spheroidal functions. We thus only sketch the main result, without
stating it as a rigorous theorem. For a full treatment, consider the monograph of Ihara [224].
Let us again define M∗CT (T, ϵ, P, B) to be the maximum number of waveforms that can be sent
with probability of error ϵ in time T, power P and so that each waveform in addition to those two
constraints also has Fourier spectrum entirely contained in [fc − B/2, fc + B/2], where fc is a certain
“carrier” frequency.1
We claim that
1 P
CB (P) ≜ lim lim inf log M∗CT (T, ϵ, P, B) = B log(1 + ), (21.13)
ϵ→0 n→∞ T N0 B

1
Here we already encounter a major issue: the waveform x(t) supported on a finite interval (0, T] cannot have spectrum
supported on a compact. The requirements of finite duration and finite spectrum are only satisfied by the zero waveform.
Rigorously, one should relax the bandwidth constraint to requiring that the signal have a vanishing out-of-band energy as
T → ∞. As we said, rigorous treatment of this issue lead to the theory of prolate spheroidal functions [391].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-420


i i

420

In other words, the capacity of this channel is B log(1 + NP0 B ). To understand the idea of the proof,
we need to recall the concept of modulation first. Every signal X(t) that is required to live in
[fc − B/2, fc + B/2] frequency band can be obtained by starting with a complex-valued signal XB (t)
with frequency content in [−B/2, B/2] and mapping it to X(t) via the modulation:

X(t) = Re(XB (t) 2ejωc t ) ,
where ωc = 2πfc . Upon receiving the sum Y(t) = X(t) + N(t) of the signal and the white noise
N(t) we may demodulate Y by computing

YB (t) = 2LPF(Y(t)ejωc t ), ,
where the LPF is a low-pass filter removing all frequencies beyond [−B/2, B/2]. The important
fact is that converting from Y(t) to YB (t) does not lose information.
Overall we have the following input-output relation:
e ( t) ,
YB (t) = XB (t) + N
e (t) is a complex Gaussian white noise and
where all processes are C-valued and N
e ( t) N
E[ N e (s)∗ ] = N0 δ(t − s).

(Notice that after demodulation, the power spectral density √


of the noise is N0 /2 with N0 /4 in the
real part and N0 /4 in the imaginary part, and after the 2 amplifier the spectral density of the
noise is restored to N0 /2 in both real and imaginary part.)
Next, we do Nyquist sampling to convert from continuous-time to discrete time. Namely, the
input waveform XB (t) is going to be represented by its values at an equispaced grid of time instants,
separated by B1 . Similar representation is done to YB (t). It is again known that these two operations
do not lead to either restriction of the space of input waveforms (since every band limited signal
can be uniquely represented by its samples at Nyquist rate) or loss of the information content in
YB (t) (again, Nyquist samples represent the signal YB completely). Mathematically, what we have
done is
X∞
i
XB (t) = Xi sincB (t − )
B
i=−∞
Z ∞
i
Yi = YB (t)sincB (t − )dt ,
t=−∞ B

where sincB (x) = sin(xBx) and Xi = XB (i/B). After the Nyquist sampling on XB and YB we get the
following equivalent input-output relation:
Yi = Xi + Zi , Zi ∼ Nc (0, N0 ) (21.14)
R∞
where the noise Zi = t=−∞ N e (t)sincB (t − i )dt. Finally, given that XB (t) is only non-zero for
B
t ∈ (0, T] we see that the C-AWGN channel (21.14) is only allowed to be used for i = 1, . . . , TB.
This fact is known in communication theory as “bandwidth B and duration T signal has BT complex
degrees of freedom”.
Let us summarize what we obtained so far:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-421


i i

21.5* Capacity of the continuous-time bandlimited AWGN channel 421

• After sampling the equivalent channel model is that of discrete-time C-AWGN.


• Given time T and bandwidth B the discrete-time equivalent channel has blocklength n = BT.
• The power constraint in the discrete-time model corresponds to:
X
BT
|Xi |2 = kX(t)k22 ≤ PT ,
i=1

Thus the effective discrete-time power constraint becomes Pd = PB .

Hence, we have established the following fact:


1 1
log M∗CT (T, ϵ, P, B) = log M∗C−AWGN (BT, ϵ, Pd ) ,
T T
where M∗C−AWGN denotes the fundamental limit of the C-AWGN channel from Theorem 20.11.
Thus, taking T → ∞ we get (21.13).
Note also that in the limit of large bandwidth B the capacity formula (21.13) yields
P P
CB=∞ (P) = lim B log(1 + )= log e ,
B→∞ N0 B N0
agreeing with (21.12).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-422


i i

22 Strong converse. Channel dispersion. Error


exponents. Finite blocklength.

In previous chapters our main object of study was the fundamental limit of blocklength-n coding:

M∗ (n, ϵ) = max{M : ∃(n, M, ϵ)-code}

Equivalently, we can define it in terms of the smallest probability of error at a given M:

ϵ∗ (n, M) = inf{ϵ : ∃(n, M, ϵ)-code}

What we learned so far is that for stationary memoryless channels we have


1
lim lim inf log M∗ (n, ϵ) = C ,
ϵ→0 n→∞ n
or, equivalently,
(
∗ 0, R<C
lim sup ϵ (n, exp{nR}) =
n→∞ > 0, R > C.
These results were proved 75 years ago by Shannon. What happened in the ensuing 75 years is that
we obtained much more detailed information about M∗ and ϵ∗ . For example, the strong converse
says that in the previous limit the > 0 can be replaced with 1. The error-exponents show that
convergence of ϵ∗ (n, exp{nR}) to zero or one happens exponentially fast (with partially known
exponents). The channel dispersion refines the asymptotic description to

log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .

Finally, the finite blocklength information theory strives to prove the sharpest possible computa-
tional bounds on log M∗ (n, ϵ) at finite n, which allows evaluating real-world codes’ performance
taking their latency n into account. These results are surveyed in this chapter.

22.1 Strong converse


We begin by stating the main theorem.

Theorem 22.1 For any stationary memoryless channel with either |A| < ∞ or |B| < ∞ we
have Cϵ = C for 0 < ϵ < 1. Equivalently, for every 0 < ϵ < 1 we have

log M∗ (n, ϵ) = nC + o(n) , n → ∞.

422

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-423


i i

22.1 Strong converse 423

Previously in Theorem 19.7, we showed that C ≤ Cϵ ≤ 1−ϵ C


. Now we are asserting that equality
holds for every ϵ. Our previous converse arguments (Theorem 17.3 based on Fano’s inequality)
showed that communication with an arbitrarily small error probability is possible only when using
rate R < C. The strong converse shows that when communicating at any rate above capacity R > C,
the probability of error in fact goes to 1. (An even more detailed result of Arimoto characterizes
the speed of convergence to 1 as exponential in n and gives an exact expression for the exponent.)
In other words,
(
0 R<C
ϵ∗ (n, exp(nR)) → (22.1)
1 R>C

where ϵ∗ (n, M) is the inverse of M∗ (n, ϵ) defined in (19.5).


In practice, engineers observe this effect differently. Instead of changing the coding rate, they
fix a code and then allow the channel parameter (SNR for the AWGN channel, or δ for BSCδ ) vary.
This typically results in a waterfall plot for the probability of error:

Pe
1
10−1
10−2
10−3
10−4
10−5
SNR

In other words, below a certain critical SNR, the probability of error quickly approaches 1, so
that the receiver cannot decode anything meaningful. Above the critical SNR the probability of
error quickly approaches 0 (unless there is an effect known as the error floor, in which case prob-
ability of error decreases reaches that floor value and stays there regardless of the further SNR
increase). Thus, long-blocklength codes have a threshold-like behavior of probability of error sim-
ilar to (22.1). Besides changing SNR instead of rate, there is another important difference between
a waterfall plot and (22.1). The former applies to only a single (perhaps rather suboptimal) code,
while the latter is a statement about the best possible code for each (n, R) pair.

Proof. We will improve the method used in the proof of Theorem 17.3. Take an (n, M, ϵ)-code
and consider the usual probability space

W → Xn → Yn → Ŵ ,

where W ∼ Unif([M]). Note that PXn is the empirical distribution induced by the encoder at the
channel input. We denote the joint measure on (W, Xn , Yn , Ŵ) induced in this way by P. Our goal
is to replace this probability space with a different one where the true channel PYn |Xn = P⊗ n
Y|X is
replaced with an auxiliary channel (which is a “dummy” one in this case):

QYn |Xn = (QY )⊗n .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-424


i i

424

We will denote the measure on (W, Xn , Yn , Ŵ) induced by this new channel by Q. Note that for
communication purposes, QYn |Xn is a useless channel since it ignores the input and randomly picks
i.i.d.
a member of the output space according to Yi ∼ QY , so that Xn and Yn are independent (under Q).
Therefore, for the probability of success under each channel we have
1
Q[Ŵ = W] =
M
P[Ŵ = W] ≥ 1 − ϵ
n o
Therefore, the random variable 1 Ŵ = W is likely to be 1 under P and likely to be 0 under Q.
It thus looks like a rather good choice for a binary hypothesis test statistic distinguishing the two
distributions, PW,Xn ,Yn ,Ŵ and QW,Xn ,Yn ,Ŵ . Since no hypothesis test can beat the optimal (Neyman-
Pearson) test, we get the upper bound
1
β1−ϵ (PW,Xn ,Yn ,Ŵ , QW,Xn ,Yn ,Ŵ ) ≤ (22.2)
M
(Recall the definition of β from (14.3).) The likelihood ratio is a sufficient statistic for this
hypothesis test, so let us compute it:
PW,Xn ,Yn ,Ŵ PW PXn |W PYn |Xn PŴ|Yn PW|Xn PXn ,Yn PŴ|Yn PXn ,Yn
= ⊗
= =
QW,Xn ,Yn ,Ŵ n
PW PXn |W (QY ) PŴ|Yn PW|Xn PXn (QY )⊗n PŴ|Yn PXn (QY )⊗n
Therefore, inequality above becomes
1
β1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≤ (22.3)
M
Computing the LHS of this bound may appear to be impossible because the distribution PXn
depends on the unknown code. However, it will turn out that a judicious choice of QY will make
knowledge of PXn unnecessary. Before presenting a formal argument, let us consider a special case
of the BSCδ channel. It will show that for symmetric channels we can select QY to be the capacity
achieving output distribution (recall, that it is unique by Corollary 5.5). To treat the general case
later we will (essentially) decompose the channel into symmetric subchannels (corresponding to
“composition” of the input).
Special case: BSCδ . So let us take PYn |Xn = BSC⊗ δ
n
and for QY we will take the capacity
achieving output distribution which is simply QY = Ber(1/2).
PYn |Xn (yn |xn ) = PnZ (yn − xn ), Zn ∼ Ber(δ)n
( Q Y ) ⊗ n ( yn ) = 2 − n
From the Neyman-Pearson lemma, the optimal HT takes the form
   
⊗n PXn Yn PXn Yn
βα (PXn Yn , PXn (QY ) ) = Q log ≥ γ where α = P log ≥γ
| {z } | {z } PXn (QY )⊗n PXn (QY )⊗n
P Q

For the BSC, this becomes


PXn Yn PZn (yn − xn )
log = log .
PXn (P∗Y )n 2− n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-425


i i

22.1 Strong converse 425

Notice that the effect of unknown PXn completely disappeared, and so we can compute βα :
1
βα (PXn Yn , PXn (QY )⊗n ) = βα (Ber(δ)⊗n , Ber( )⊗n ) (22.4)
2
1
= exp{−nD(Ber(δ)kBer( )) + o(n)} (by Stein’s Lemma: Theorem 14.14)
2
Putting this together with our main bound (22.3), we see that any (n, M, ϵ) code for the BSC
satisfies
1
log M ≤ nD(Ber(δ)kBer( )) + o(n) = nC + o(n) .
2
Clearly, this implies the strong converse for the BSC. (For a slightly different, but equivalent, proof
see Exercise IV.32 and for the AWGN channel see Exercise IV.33).
For the general channel, let us denote by P∗Y the capacity achieving output distribution. Recall
that by Corollary 5.5 it is unique and by (5.1) we have for every x ∈ A:

D(PY|X=x kP∗Y ) ≤ C . (22.5)

This property will be very useful. We next consider two cases separately:

1 If |B| < ∞ we take QY = P∗Y and note that from (19.31) we have
X
PY|X (y|x0 ) log2 PY|X (y|x0 ) ≤ log2 |B| ∀ x0 ∈ A
y

and since miny P∗Y (y)> 0 (without loss of generality), we conclude that for some constant
K > 0 and for all x0 ∈ A we have
 
PY|X (Y|X = x0 )
Var log | X = x0 ≤ K < ∞ .
QY (Y)
Thus, if we let
X
n
PY|X (Yi |Xi )
Sn = log ,
P∗Y (Yi )
i=1

then we have

E[Sn |Xn ] ≤ nC, Var[Sn |Xn ] ≤ Kn . (22.6)

Hence from Chebyshev inequality (applied conditional on Xn ), we have


√ √ 1
P[Sn > nC + λ Kn] ≤ P[Sn > E[Sn |Xn ] + λ Kn] ≤ 2 . (22.7)
λ
2 If |A| < ∞, then first we recall that without loss of generality the encoder can be taken to be
deterministic. Then for each codeword c ∈ An we define its composition (also known as type)
to simply be its empirical distribution
1X
n
P̂c (x) ≜ 1{cj = x} .
n
j=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-426


i i

426

By simple counting1 it is clear that from any (n, M, ϵ) code, it is possible to select an (n, M′ , ϵ)
subcode, such that a) all codeword have the same composition P0 ; and b) M′ > (n+1M )|A|−1
. Note
′ ′
that, log M = log M + O(log n) and thus we may replace M with M and focus on the analysis of
the chosen subcode. Then we set QY = PY|X ◦ P0 . From now on we also assume that P0 (x) > 0
for all x ∈ A (otherwise just reduce A). Let i(x; y) denote the information density with respect
to P0 PY|X . If X ∼ P0 then I(X; Y) = D(PY|X kQY |P0 ) ≤ log |A| < ∞ and we conclude that
PY|X=x  QY for each x and thus
dPY|X=x
i(x; y) = log ( y) .
dQY
From (19.28) we have

Var [i(X; Y)|X] ≤ Var[i(X; Y)] ≤ K < ∞

Furthermore, we also have

E [i(X; Y)|X] = D(PY|X kQY |P0 ) = I(X; Y) ≤ C X ∼ P0 .

So if we define
X
n
dPY|X=Xi (Yi |Xi ) X n
Sn = log ( Yi ) = i(Xi ; Yi ) ,
dQY
i=1 i=1

we again first get the estimates (22.6) and then (22.7).


To proceed with (22.3) we apply the lower bound on β from (14.9):
h i
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ 1 − ϵ − P Sn > log γ ,
√ 2
where γ is arbitrary. We set log γ = nC + λ Kn and λ2 = 1−ϵ to obtain (via (22.7)):
1−ϵ
γβ1−ϵ (PXn ,Yn , PXn (QY )⊗n ) ≥ ,
2
which then implies

log β1−ϵ (PXn Yn , PXn (QY )n ) ≥ −nC + O( n) .

Consequently, from (22.3) we conclude that



log M ≤ nC + O( n) ,

implying the strong converse.


We note several lessons from this proof. First, we basically followed the same method as in the
proof of the weak converse, except instead of invoking data-processing inequality for divergence,
we analyzed the hypothesis testing problem explicitly. Second, the bound on variance of the infor-
mation density is important. Thus, while the AWGN channel is excluded by the assumptions of

1
This kind of reduction from a general code to a constant-composition subcode is the essence of the method of types [115].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-427


i i

22.2 Stationary memoryless channel without strong converse 427

the Theorem, the strong converse for it does hold as well (see Ex. IV.33). Third, this method of
proof is also known as “sphere-packing”, for the reason that becomes clear if we do the example
of the BSC slightly differently (see Ex. IV.32).

22.2 Stationary memoryless channel without strong converse


In the proof above we basically only used the fact that the sum of independent random vari-
ables concentrates around its expectation (we used second moment to show that, but it could
have been done more generally, when the second moment does not exist). Thus one may won-
der whether the strong converse should hold for all stationary memoryless channels (it was only
showed in Theorem 22.1 for those with finite input or output spaces). In this section we construct
a counterexample.
Let the output alphabet be B = [0, 1]. The input A is going to be countably infinite. It will be
convenient to define it as

A = {(j, m) : j, m ∈ Z+ , 0 ≤ j ≤ m} .

The single-letter channel PY|X is defined in terms of probability density function as


(
j
am , ≤y≤ j+1
m ,,
pY|X (y|(j, m)) = m
bm , otherwise ,

where am , bm are chosen to satisfy


1 1
am + ( 1 − ) bm = 1 (22.8)
m m
1 1
am log am + (1 − )bm log bm = C , (22.9)
m m
where C > 0 is an arbitrary fixed constant. Note that for large m we have
mC 1
am = (1 + O( )) , (22.10)
log m log m
C 1
bm = 1 − + O( 2 ) (22.11)
log m log m
It is easy to see that P∗Y = Unif(0, 1) is the capacity-achieving output distribution and

sup I(X; Y) = C .
PX

Thus by Theorem 19.9 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the ϵ-capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-428


i i

428

single-letter information density is given by


( (
log am , w.p. amm 2nC + O(log n), w.p. am
m
i(X; Y) = = .
log bm , w.p. 1 − m
am
O( 1n ), w.p. 1 − am
m

Thus, for blocklength-n density we have


1X
n
1 n n d 1 1 am d
i( X ; Y ) = i(Xi ; Yi ) = O( ) + (2C + O( log n)) · Bin(n, )−→2C · Poisson(1/2) ,
n n n n m
i=1

where we used the fact that amm = (1 + o(1)) 2n


1
and invoked the Poisson limit theorem for Binomial.
Therefore, from Theorem 18.5 we get that for ϵ > e−1/2 there exist (n, M, ϵ)-codes with
log M ≥ 2nC(1 + o(1)) .
In particular,
Cϵ ≥ 2C ∀ϵ > e−1/2

22.3 Meta-converse
We have seen various ways in which one can derive upper (impossibility or converse) bounds on
the fundamental limits such as log M∗ (n, ϵ). In Theorem 17.3 we used data-processing and Fano’s
inequalities. In the proof of Theorem 22.1 we reduced the problem to that of hypothesis testing.
There are many other converse bounds that were developed over the years. It turns out that there
is a very general approach that encompasses all of them. For its versatility it is sometimes referred
to as the “meta-converse”.
To describe it, let us fix a Markov kernel PY|X (usually, it will be the n-letter channel PYn |Xn ,
but in the spirit of “one-shot” approach, we avoid introducing blocklength). We are also given a
certain (M, ϵ) code and the goal is to show that there is an upper bound on M in terms of PY|X and
ϵ. The essence of the meta-converse is described by the following diagram:

PY |X
W Xn Yn Ŵ

QY |X

Here the W → X and Y → Ŵ represent encoder and decoder of our fixed (M, ϵ) code. The upper
arrow X → Y corresponds to the actual channel, whose fundamental limits we are analyzing. The
lower arrow is an auxiliary channel that we are free to select.
The PY|X or QY|X together with PX (distribution induced by the code) define two distribu-
tions: PX,Y and QX,Y . Consider a map (X, Y) 7→ Z ≜ 1{W 6= Ŵ} defined by the encoder and
decoder pair (if decoders are randomized or W → X is not injective, we consider a Markov kernel
PZ|X,Y (1|x, y) = P[Z = 1|X = x, Y = y] instead). We have
PX,Y [Z = 0] = 1 − ϵ, QX,Y [Z = 0] = 1 − ϵ′ ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-429


i i

22.3 Meta-converse 429

where ϵ and ϵ′ are the average probabilities of error of the given code under the PY|X and QY|X
respectively. This implies the following relation for the binary HT problem of testing PX,Y vs
QX,Y :

β1−ϵ (PX,Y , QX,Y ) ≤ 1 − ϵ′ .

The high-level idea of the meta-converse is to select a convenient QY|X , bound 1 − ϵ′ from above
(i.e. prove a converse result for the QY|X ), and then use the Neyman-Pearson β -function to lift the
Q-channel converse to P-channel.
How one chooses QY|X is a matter of art. For example, in the proof of Case 2 of Theorem 22.1
we used the trick of reducing to the constant-composition subcode. This can instead be done by
taking QYn |Xn =c = (PY|X ◦ P̂c )⊗n . Since there are at most (n + 1)|A|−1 different output distributions,
we can see that
(n + 1)∥A|−1
1 − ϵ′ ≤ ,
M
and bounding of β can be done similar to Case 2 proof of Theorem 22.1. For channels with
|A| = ∞ the technique of reducing to constant-composition codes is not available, but the meta-
converse can still be applied. Examples include proof of parallel AWGN channel’s dispersion [333,
Theorem 78] and the study of the properties of good codes [340, Theorem 21].
However, the most common way of using meta-converse is to apply it with the trivial channel
QY|X = QY . We have already seen this idea in Section 22.1. Indeed, with this choice the proof
of the converse for the Q-channel is trivial, because we always have: 1 − ϵ′ = M1 . Therefore, we
conclude that any (M, ϵ) code must satisfy
1
≥ β1−ϵ (PX,Y , PX QY ) . (22.12)
M
Or, after optimization we obtain
1
≥ inf sup β1−ϵ (PX,Y , PX QY ) .
M∗ (ϵ) PX QY

This is a special case of the meta-converse known as the minimax meta-converse. It has a number
of interesting properties. First, the minimax problem in question possesses a saddle-point and is of
convex-concave type [341]. It, thus, can be seen as a stronger version of the capacity saddle-point
result for divergence in Theorem 5.4.
Second, the bound given by the minimax meta-converse coincides with the bound we obtained
before via linear programming relaxation (18.22), as discovered by [295]. To see this connection,
instead of writing the meta-converse as an upper bound M (for a given ϵ) let us think of it as an
upper bound on 1 − ϵ (for a given M).
We have seen that existence of an (M, ϵ)-code for PY|X implies existence of the (stochastic) map
(X, Y) 7→ Z ∈ {0, 1}, denoted by PZ|X,Y , with the following property:

1
PX,Y [Z = 0] ≥ 1 − ϵ, and P X QY [ Z = 0] ≤ ∀ QY .
M

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-430


i i

430

That is PZ|X,Y is a test of a simple null hypothesis (X, Y) ∼ PX,Y against a composite alternative
(X, Y) ∼ PX QY for an arbitrary QY . In other words every (M, ϵ) code must satisfy

1 − ϵ ≤ α̃(M; PX ) ,

where (we are assuming finite X , Y for simplicity)


X X 1
α̃(M; PX ) ≜ sup { PX,Y (x, y)PZ|X,Y (0|x, y) : PX (x)QY (y)PZ|X,Y (0|x, y) ≤ ∀ QY } .
PZ|X,Y x, y x,y
M

We can simplify the constraint by rewriting it as


X 1
sup PX (x)QY (y)PZ|X,Y (0|x, y) ≤ ,
QY x, y
M

and further simplifying it to


X 1
PX (x)PZ|X,Y (0|x, y) ≤ , ∀y ∈ Y .
x
M

Let us now replace PX with a π x ≜ MPX (x), x ∈ X . It is clear that π ∈ [0, 1]X . Let us also
replace the optimization variable with rx,y ≜ MPZ|X,Y (0|x, y)PX (x). With these notational changes
we obtain
1 X X
α̃(M; PX ) = sup{ PY|X (y|x)rx,y : 0 ≤ rx,y ≤ π x , rx,y ≤ 1} .
M x, y x

It is now obvious that α̃(M; PX ) = SLP (π ) defined in (18.21). Optimizing over the choice of PX
P
(or equivalently π with x π x ≤ M) we obtain
1 1 X S∗ (M)
1−ϵ≤ SLP (π ) ≤ sup{SLP (π ) : π x ≤ M} = LP .
M M x
M

Now recall that in (18.23) we showed that a greedy procedure (essentially, the same as the one we
used in the Feinstein’s bound Theorem 18.7) produces a code with probability of success
1 S∗ (M)
1 − ϵ ≥ (1 − ) LP .
e M
This indicates that in the regime of a fixed ϵ the bound based on minimax meta-converse should
be very sharp. This, of course, provided we can select the best QY in applying it. Fortunately, for
symmetric channels optimal QY can be guessed fairly easily, cf. [341] for more.

22.4* Error exponents


We have studied the question of optimal error exponents in hypothesis testing before (Chapter 16).
The corresponding topic for channel coding is much harder and full of open problems.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-431


i i

22.4* Error exponents 431

We motivate the question by trying to understand the speed of convergence in the strong con-
verse (22.1). If we return to the proof of Theorem 19.9, namely the step (19.7), we see that by
applying large-deviations Theorem 15.9 we can prove that for some Ẽ(R) and any R < C we have
ϵ∗ (n, exp{nR}) ≤ exp{−nẼ(R)} .
What is the best value of Ẽ(R) for each R? This is perhaps the most famous open question in all
of channel coding. Let us proceed in more details.
We will treat both regimes R < C and R > C. The reliability function E(R) of a channel is
defined as follows:
(
limn→∞ − 1n log ϵ∗ (n, exp{nR}) R<C
E(R) =
limn→∞ − 1n log(1 − ϵ∗ (n, exp{nR})) R > C .
We leave E(R) as undefined if the limit does not exist. Unfortunately, there is no general argument
showing that this limit exist. The only way to show its existence is to prove an achievability bound
1
lim inf − log ϵ∗ (n, exp{nR}) ≥ Elower (R) ,
n→∞ n
a converse bound
1
lim sup − log ϵ∗ (n, exp{nR}) ≤ Eupper (R) ,
n→∞ n
and conclude that the limit exist whenever Elower = Eupper . It is common to abuse notation and
write such pair of bounds as
Elower (R) ≤ E(R) ≤ Eupper (R) ,
even though, as we said, the E(R) is not known to exist unless the two bounds match unless the
two bounds match.
From now on we restrict our discussion to the case of a DMC. An important object to
define is the Gallager’s E0 function, which is nothing else than the right-hand side of Gallager’s
bound (18.15). For the DMC it has the following expression:
!1+ρ
X X 1
E0 (ρ, PX ) = − log PX (x)PY|X (y|x)
1+ρ

y∈B x∈A

E0 (ρ) = max E0 (ρ, PX ) , ρ≥0


PX

E0 (ρ) = min E0 (ρ, PX ) , ρ ≤ 0.


PX

This expression is defined in terms of the single-letter channel PY|X . It is not hard to see that E0
function for the n-letter extension evaluated with P⊗ n
X just equals nE0 (ρ, PX ), i.e. it tensorizes
2
similar to mutual information. From this observation we can apply Gallager’s random coding

2
There is one more very pleasant analogy with mutual information: the optimization problems in the definition of E0 (ρ)
also tensorize. That is, the optimal distribution for the n-letter channel is just P⊗n
X , where PX is optimal for a single-letter
one.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-432


i i

432

bound (Theorem 18.9) with P⊗ n


X to obtain

ϵ∗ (n, exp{nR}) ≤ exp{n(ρR − E0 (ρ, PX ))} ∀PX , ρ ∈ [0, 1] . (22.13)

Optimizing the choice of PX we obtain our first estimate on the reliability function

E(R) ≤ Er (R) ≜ sup E0 (ρ) − ρR .


ρ∈[0,1]

An analysis, e.g. [177, Section 5.6], shows that the function Er (R) is a convex, decreasing and
strictly positive on 0 ≤ R < C. Therefore, Gallager’s bound provides a non-trivial estimate of
the reliability function for the entire range of rates below capacity. At rates R → C the optimal
choice of ρ → 0. As R departs further away from the capacity the optimal ρ reaches 1 at a certain
rate R = Rcr known as the critical rate, so that for R < Rcr we have Er (R) = E0 (1) − R behaving
linearly. The Er (R) bound is shown on Figure 22.1 by a curve labeled “Random code ensemble”.
Going to the upper bounds, taking QY to be the iid product distribution in (22.12) and optimizing
yields the bound [381] known as the sphere-packing bound:

E(R) ≤ Esp (R) ≜ sup E0 (ρ) − ρR . (22.14)


ρ≥0

Comparing the definitions of Esp and Er we can see that for Rcr < R < C we have

Esp (R) = E(R) = Er (R)

thus establishing reliability function value for high rates. However, for R < Rcr we have Esp (R) >
Er (R), so that E(R) remains unknown. The Esp (R) bound is shown on Figure 22.1 by a curve
labeled “Sphere-packing (volume)”.
Both upper and lower bounds have classical improvements. The random coding bound can be
improved via technique known as expurgation showing

E(R) ≥ Eex (R) ,

and Eex (R) > Er (R) for rates R < Rx where Rx ≤ Rcr is the second critical rate; see Exercise IV.31.
The Eex (R) bound is shown on Figure 22.1 by a curve labeled “Typical random linear code (aka
expurgation)”. (See below for the explanation of the naming.)
The sphere packing bound can also be improved at low rates by analyzing a combinatorial
packing problem by showing that any code must have a pair of codewords which are close (in terms
of Hellinger distance between the induced output distributions) and concluding that confusing
these two leads to a lower bound on probability of error (via (16.3)). This class of bounds is
known as “minimum distance” based bounds and several of them are shown on Figure 22.1 with
the strongest labeled “MRRW + mindist”, corresponding to the currently the best known minimum
distance upper bound due to [299]. (This bound also known as a linear programming or JPL bound
has not seen improvements in 60 years and it is a long-standing open problem in combinatorics
to do so.)
The straight-line bound [177, Theorem 5.8.2] allows to interpolate between any minimum dis-
tance bound and the Esp (R). Unfortunately, these (classical) improvements tightly bound E(R) at

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-433


i i

22.4* Error exponents 433

only one additional rate point R = 0:

E(0+) = Eex (0) .

This state of affairs remains unchanged (for a general DMC) since the foundational work of Shan-
non, Gallager and Berlekamp in 1967. As far as we know, the common belief is that Eex (R) is in
fact the true value of E(R) for all rates. As we mentioned above this is perhaps one of the most
famous open problems in classical information theory.
We demonstrate these bounds (with exception of the straight-line bound) on the reliability func-
tion on Figure 22.1 for the case of the BSCδ . For this channel, there is an interesting interpretation
of the expurgated bound. To explain it, let us recall the different ensembles of random codes
that we discussed in Section 18.6. In particular, we had the Shannon ensemble (as used in Theo-
rem 18.5) and the random linear code (either Elias or Gallager ensembles, we do not need to make
a distinction here).
For either ensemble, it it is known [178] that Er (R) is not just an estimate, but in fact the exact
value of the exponent of the average probability of error (averaged over a code in the ensemble).
For either ensemble, however, for low rates the average is dominated by few bad codes, whereas
a typical (high probability) realization of the code has a much better error exponent. For Shannon
ensemble this happens at R < 12 Rx and for the linear ensemble it happens at R < Rx . Furthermore,
the typical linear code in fact has error exponent exactly equal to the expurgated exponent Eex (R),
see [34].
There is a famous conjecture in combinatorics stating that the best possible minimum pairwise
Hamming distance of a code with rate R is given by the Gilbert-Varshamov bound (Theorem 27.5).
If true, this would imply that E(R) = Eex (R) for R < Rx , see e.g. [283].
The most outstanding development in the error exponents since 1967 was a sequence of papers
starting from [283], which proposed a new technique for bounding E(R) from above. Litsyn’s
idea was to first prove a geometric result (that any code of a given rate has a large number of
pairs of codewords at a given distance) and then use de Caen’s inequality to convert it into a lower
bound on the probability of error. The resulting bound was very cumbersome. Thus, it was rather
surprising when Barg and MacGregor [35] were able to show that the new upper bound on E(R)
matched Er (R) for Rcr − ϵ < R < Rcr for some small ϵ > 0. This, for the first time since [381]
extended the range of knowledge of the reliability function. Their amazing result (together with
Gilbert-Varshamov conjecture) reinforced the belief that the typical linear codes achieve optimal
error exponent in the whole range 0 ≤ R ≤ C.
Regarding E(R) for R > C the situation is much simpler. We have

E(R) = sup E0 (ρ) − ρR .


ρ∈(−1,0)

The lower (achievability) bound here is due to Dueck [141] (see also [319]), while the harder
(converse) part is by Arimoto [25]. It was later discovered that Arimoto’s converse bound can
be derived by a simple modification of the weak converse (Theorem 17.3): instead of applying
1
data-processing to the KL divergence, one uses Rényi divergence of order α = 1+ρ ; see [338]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-434


i i

434

Error−exponent bounds for BSC(0.02)


2
Sphere−packing (radius) + mindist
MRRW + mindist
1.8 Sphere−packing (volume)
Random code ensemble
Typical random linear code (aka expurgation)
Gilbert−Varshamov, dmin/2 halfplane + union bound
1.6 Gilbert−Varshamov, dmin/2 sphere

1.4

1.2
Err.Exp. (log2)

0.8 Rx = 0.24

0.6 Rcrit = 0.46

0.4
C = 0.86

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Rate

Figure 22.1 Comparison of bounds on the error exponent of the BSC. The MRRW stands for the upper
bound on the minimum distance of a code [299] and Gilbert-Varshamov is a lower bound (cf. Theorem 27.5).

for details. This suggests a general conjecture that replacing Shannon information measures with
Rényi ones upgrades the (weak) converse proofs to a strong converse.

22.5 Channel dispersion


Historically, first error correcting codes had rather meager rates R very far from channel capacity.
As we have seen in Section 22.4* the best codes at any rate R < C have probability of error that
behaves as

Pe = exp{−nE(R) + o(n)} .

Therefore, for a while the question of non-asymptotic characterization of log M∗ (n, ϵ) and ϵ∗ (n, M)
was equated with establishing the sharp value of the error exponent E(R). However, as codes
became better and started having rates approaching the channel capacity, the question has changed
to that of understanding behavior of log M∗ (n, ϵ) in the regime of fixed ϵ and large n (and, thus, rates
R → C). It was soon discovered by [334] that the next-order terms in the asymptotic expansion of
log M∗ give surprisingly sharp estimates on the true value of the log M∗ . Since then, the work on
channel coding focused on establishing sharp upper and lower bounds on log M∗ (n, ϵ) for finite n
(the topic of Section 22.6) and refining the classical results on the asymptotic expansions, which
we discuss here.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-435


i i

22.5 Channel dispersion 435

We have already seen that the strong converse (Theorem 22.1) can be stated in the asymptotic
expansion form as: for every fixed ϵ ∈ (0, 1),
log M∗ (n, ϵ) = nC + o(n), n → ∞.
Intuitively, though, the smaller values of ϵ should make convergence to capacity slower. This
suggests that the term o(n) hides some interesting dependence on ϵ. What is it?
This topic has been investigated since the 1960s, see [130, 402, 334, 333] , and resulted in
the concept of channel dispersion. We first present the rigorous statement of the result and then
explain its practical uses.

Theorem 22.2 Consider one of the following channels:

1 DMC
2 DMC with cost constraint
3 AWGN
4 Parallel AWGN
Let (X∗ , Y∗ ) be the input-output of the channel under the capacity achieving input distribution, and
i(x; y) be the corresponding (single-letter) information density. The following expansion holds for
a fixed 0 < ϵ < 1/2 and n → ∞

log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) (22.15)
where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X∗ ; Y∗ ) = E[i(X∗ ; Y∗ )], and the channel dispersion3 is V = Var[i(X∗ ; Y∗ )|X∗ ].

Proof. The full proofs of these results are somewhat technical, even for the DMC.4 Here we only
sketch the details.
First, in the absence of cost constraints the achievability (lower bound on log M∗ ) part has
already been done by us in Theorem 19.11, where we have shown that log M∗ (n, ϵ) ≥ nC −
√ √
nVQ−1 (ϵ) + o( n) by refining the proof of the noisy channel coding theorem and using the
CLT. Replacing the CLT with its non-asymptotic version (Berry-Esseen inequality [165, Theorem

2, Chapter XVI.5]) improves o( n) to O(log n). In the presence of cost constraints, one is inclined
to attempt to use an appropriate version of the achievability bound such as Theorem 20.7. However,
for the AWGN this would require using input distribution that is uniform on the sphere. Since this
distribution is non-product, the information density ceases to be a sum of iid, and CLT is harder
to justify. Instead, there is a different achievability bound known as the κ-β bound [334, Theorem
25] that has become the workhorse of achievability proofs for cost-constrained channels with
continuous input spaces.

3
There could be multiple capacity-achieving input distributions, in which case PX∗ should be chosen as the one that
minimizes Var[i(X∗ ; Y∗ )|X∗ ]. See [334] for more details.
4
Recently, subtle gaps in [402] and [334] in the treatment of DMCs with non-unique capacity-achieving input distributions
were found and corrected in [81].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-436


i i

436

The upper (converse) bound requires various special methods depending on the channel. How-
ever, the high-level idea is to always apply the meta-converse bound from (22.12) with an
appropriate choice of QY . Most often, QY is taken as the n-th power of the capacity achieving out-
put distribution for the channel. We illustrate the details for the special case of the BSC. In (22.4)
we have shown that
1
log M∗ (n, ϵ) ≤ − log βα (Ber(δ)⊗n , Ber( )⊗n ) . (22.16)
2
On the other hand, Exercise III.10 shows that
1 1 √ √
− log β1−ϵ (Ber(δ)⊗n , Ber( )⊗n ) = nd(δk ) + nvQ−1 (ϵ) + o( n) ,
2 2
where v is just the variance of the (single-letter) log-likelihood ratio:
" #
δ 1−δ δ δ
v = VarZ∼Ber(δ) Z log 1 + (1 − Z) log 1 = Var[Z log ] = δ(1 − δ) log2 .
2 2
1 − δ 1 − δ

Upon inspection we notice that v = V – the channel dispersion of the BSC, which completes the
proof of the upper bound:
√ √
log M∗ (n, ϵ) ≤ nC − nVQ−1 (ϵ) + o( n)

Improving the o( n) to O(log n) is done by applying the Berry-Esseen inequality in place of the
CLT, similar to the upper bound. Many more details on these proofs are contained in [333].

Remark 22.1 (Zero dispersion) We notice that V = 0 is entirely possible. For example,
consider an additive-noise channel Y = X + Z over some abelian group G with Z being uniform
on some subset of G, e.g. channel in Exercise IV.7. Among the zero-dispersion channels there is
a class of exotic channels [334], which for ϵ > 1/2 have asymptotic expansions of the form [333,
Theorem 51]:

log M∗ (n, ϵ) = nC + Θϵ (n 3 ) .
1

Existence of this special case is why we restricted the theorem above to ϵ < 21 .
Remark 22.2 The expansion (22.15) only applies to certain channels (as described in the
theorem). If, for example, Var[i(X∗ ; Y∗ )] = ∞, then the theorem need not hold and there might
be other stable (non-Gaussian) distributions that the n-letter information density will converge to.
Also notice that in the absence of cost constraints we have

Var[i(X∗ ; Y∗ )|X∗ ] = Var[i(X∗ ; Y∗ )]

since, by capacity saddle-point (Corollary 5.7), E[i(X∗ ; Y∗ )|X∗ = x] = C for PX∗ -almost all x.
As an example, we have the following dispersion formulas for the common channels that we
discussed so far:

BECδ : V(δ) = δ δ̄ log2 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-437


i i

22.5 Channel dispersion 437

δ̄
BSCδ : V(δ) = δ δ̄ log2
δ
P ( P + 2)
AWGN (real): VAWGN (P) = log2 e
2( P + 1) 2
P ( P + 2)
AWGN (complex): V(P) = log2 e
( P + 1) 2

BI-AWGN: V(P) = Var[log(1 + e−2P+2 PZ
)] , Z ∼ N ( 0, 1)

where for the AWGN and BI-AWGN P is the SNR. √ We also remind that, cf. Example 3.4, for the BI-
AWGN we have C(P) = log 2 − E[log(1 + e−2P+2 PZ )]. For the Parallel AWGN, cf. Section 20.4,
we have
!2 +
log2 e X
L
σj2
Parallel AWGN: V(P, {σj , j ∈ [L]}) =
2
1− ,
2 T
j=1

PL
where T is the optimal threshold in the water-filling solution, i.e. j=1 |T − σj2 |+ = P. We remark
that the expression for the parallel AWGN channel can be guessed by noticing that it equals
PL Pj
j=1 VAWGN ( σ 2 ) with Pj = |T − σj | – the optimal power allocation.
2 +
j
What about channel dispersion for other channels? Discrete channels with memory have seen
some limited success in [335], which expresses dispersion in terms of the Fourier spectrum of the
information density process. The compound DMC (Ex. IV.19) has a much more delicate dispersion
formula (and the remainder term is not O(log n), but O(n1/4 )), see [342]. For non-discrete channels
(other than the AWGN and Poisson) new difficulties appear in the proof of the converse part. For
example, the dispersion of a (coherent) fading channel is known only if one additionally restricts
the input codewords to have limited peak values, cf. [98, Remark 1]. In particular, dispersion of
the following Gaussian erasure channel is unknown:

Y i = Hi ( X i + Z i ) ,
Pn
where we have N (0, 1) ∼ Zi ⊥ ⊥ Hi ∼ Ber(1/2) and the usual quadratic cost constraint i=1 x2i ≤
nP.
Multi-antenna (MIMO) channels (20.10) present interesting new challenges as well. For exam-
ple, for coherent channels the capacity achieving input distribution is non-unique [98]. The
quasi-static channels are similar to fading channels but the H1 = H2 = · · · , i.e. the channel
gain matrix in (20.10) is not changing with time. This channel model is often used to model cellu-
lar networks. By leveraging an unexpected amount of differential geometry, it was shown in [462]
that they have zero-dispersion, or more specifically:

log M∗ (n, ϵ) = nCϵ + O(log n) ,

where the ϵ-capacity Cϵ is known as outage capacity in this case (and depends on ϵ). The main
implication is that Cϵ is a good predictor of the ultimate performance limits for these practically-
relevant channels (better than C is for the AWGN channel, for example). But some caution must
be taken in approximating log M∗ (n, ϵ) ≈ nCϵ , nevertheless. For example, in the case where H

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-438


i i

438

0.5

0.4

Rate, bit/ch.use

0.3

0.2

0.1 Capacity
Converse
RCU
DT
Gallager
Feinstein
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n

Figure 22.2 Comparing various lower (achievability) bounds on 1


n
log M∗ (n, ϵ) for the BSCδ channel
(δ = 0.11, ϵ = 10−3 ).

0.5

0.4
Rate, bit/ch.use

0.3

0.2

Capacity
Converse
0.1 Normal approximation + 1/2 log n
Normal approximation
Achievability

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Blocklength, n

Figure 22.3 Comparing the normal approximation against the best upper and lower bounds on 1
n
log M∗ (n, ϵ)
for the BSCδ channel (δ = 0.11, ϵ = 10−3 ).

matrix is known at the transmitter, the same paper demonstrated that the standard water-filling
power allocation (Theorem 20.14) that maximizes Cϵ is rather sub-optimal at finite n.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-439


i i

22.7 Normalized Rate 439

22.6 Finite blocklength bounds and normal approximation


As stated earlier, direct computation of M∗ (n, ϵ) by exhaustive search doubly exponential in com-
plexity, and thus is infeasible in most cases. However, the bounds we have developed so far can
often help sandwich the unknown value pretty tightly. Less rigorously, we may also evaluate the
normal approximation which simply suggests dropping unknown terms in the expansion (22.15):

log M∗ (n, ϵ) ≈ nC − nVQ−1 (ϵ)

(The log n term in (22.15) is known to be equal to O(1) for the BEC, and 12 log n for the BSC,
AWGN and binary-input AWGN. For these latter channels, normal approximation is typically
defined with + 12 log n added to the previous display.)
For example, considering the BEC1/2 channel we can easily compute the capacity and disper-
sion to be C = (1 − δ) and V = δ(1 − δ) (in bits and bits2 , resp.). Detailed calculation in Ex. IV.4
lead to the following rigorous estimates:

213 ≤ log2 M∗ (500, 10−3 ) ≤ 217 .

At the same time the normal approximation yields


p
log M∗ (500, 10−3 ) ≈ nδ̄ − nδ δ̄ Q−1 (10−3 ) ≈ 215.5 bits

This tightness is preserved across wide range of n, ϵ, δ .


As another example, we can consider the BSCδ channel. We have already presented numerical
results for this channel in (17.3). Here, we evaluate all the lower bounds that were discussed in
Chapter 18. We show the results in Figure 22.2 together with the upper bound (22.16). We conclude
that (unsurprisingly) the RCU bound is the tightest and is impressively close to the converse bound,
as we have already seen in (17.3). The normal approximation (with and without the 1/2 log n
term) is compared against the rigorous bounds on Figure 22.3. The excellent precision of the
approximation should be contrasted with a fairly loose estimate arising from the error-exponent
approximation (which coincides with the “Gallager” curve on Figure 22.2).
We can see that for the simple cases of the BEC and the BSC, the solution to the incredibly
complex combinatorial optimization problem log M∗ (n, ϵ) can be rather well approximated by
considering the first few terms in the expansion (22.15). This justifies further interest in computing
channel dispersion and establishing such expansions for other channels.

22.7 Normalized Rate


Suppose we are considering two different codes. One has M = 2k1 and blocklength n1 (and so, in
engineering language is a k1 -to-n1 code) and another is a k2 -to-n2 code. How can we compare the
two of them fairly? A traditional way of presenting the code performance is in terms of the “water-
fall plots” showing dependence of the probability of error on the SNR (or crossover probability)
of the channel. These two codes could have waterfall plots of the following kind:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-440


i i

440

Pe k1 → n1 Pe k2 → n2

10−4 10−4

P∗ SNR P∗ SNR

After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires a
smaller SNR to achieve the same error probability. However, this ignores the fact that the rate of
this code nk11 might be much smaller as well. The concept of normalized rate allows us to compare
the codes of different blocklengths and coding rates.
Specifically, suppose that a k → n code is given. Fix ϵ > 0 and find the value of the SNR P for
which this code attains probability of error ϵ (for example, by taking a horizontal intercept at level
ϵ on the waterfall plot). The normalized rate is defined as
k k
Rnorm (ϵ) = ≈ p ,
log2 M∗ (n, ϵ, P) nC(P) − nV(P)Q−1 (ϵ)
where log M∗ , capacity and dispersion correspond to the channel over which evaluation is being
made (most often the AWGN, BI-AWGN or the fading channel). We also notice that, of course,
the value of log M∗ is not possible to compute exactly and thus, in practice, we use the normal
approximation to evaluate it.
This idea allows us to clearly see how much different ideas in coding theory over the decades
were driving the value of normalized rate upward to 1. This comparison is show on Figure 22.4.
A short summary is that at blocklengths corresponding to “data stream” channels in cellular net-
works (n ∼ 104 ) the LDPC codes and non-binary LDPC codes are already achieving 95% of the
information-theoretic limit. At blocklengths corresponding to “control plane” (n ≲ 103 ) the polar
codes and LDPC codes are at similar performance and at 90% of the fundamental limits.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-441


i i

22.7 Normalized Rate 441

Normalized rates of code families over AWGN, Pe=0.0001


1

0.95

0.9

0.85 Turbo R=1/3


Turbo R=1/6
Turbo R=1/4
0.8 Voyager
Normalized rate

Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)

0.6

0.55

0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1

0.95

0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate

Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7

0.65

0.6 2 3 4 5
10 10 10 10
Blocklength, n

Figure 22.4 Normalized rates for various codes. Plots generated using [397] (color version recommended)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-442


i i

23 Channel coding with feedback

So far we have been focusing on the paradigm for one-way communication: data are mapped to
codewords and transmitted, and later decoded based on the received noisy observations. In most
practical settings (except for storage), frequently the communication goes in both ways so that the
receiver can provide certain feedback to the transmitter. As a motivating example, consider the
communication channel of the downlink transmission from a satellite to earth. Downlink transmis-
sion is very expensive (power constraint at the satellite), but the uplink from earth to the satellite
is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite).
In general, channel with noiseless feedback is interesting when such asymmetry exists between
uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available
that can potentially improve the reliability or complexity of communication.
In the first half of our discussion, we shall follow Shannon to show that even with noiseless
feedback nothing (in terms of capacity) can be gained in the conventional setup. In the process, we
will also introduce the concept of Massey’s directed information. In the second half of the Chapter
we examine situations where feedback is extremely helpful: low probability of error, variable
transmission length and variable transmission power.

23.1 Feedback does not increase capacity for stationary memoryless


channels
Definition 23.1 (Code with feedback) An (n, M, ϵ)-code with feedback is specified by
the encoder-decoder pair (f, g) as follows:

• Encoder: (time varying)

f1 : [ M ] → A
f2 : [ M ] × B → A
..
.
fn : [M] × B n−1 → A

• Decoder:

g : B n → [M]

442

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-443


i i

23.1 Feedback does not increase capacity for stationary memoryless channels 443

such that P[W 6= Ŵ] ≤ ϵ.

Here the symbol transmitted at time t depends on both the message and the history of received
symbols (causality constraint):

Xt = ft (W, Yt1−1 ).

Hence the probability space is as follows:

W ∼ uniform on [M]
PY|X

X1 = f1 (W) −→ Y1 


.. −→ Ŵ = g(Yn )
. 
PY|X 

Xn = fn (W, Yn1−1 ) −→ Yn
Figure 23.1 compares the settings of channel coding without feedback and with ideal full feedback:

W Xn channel Yn Ŵ

W Xk channel Yk Ŵ

delay

Figure 23.1 Schematic representation of coding without feedback (left) and with full noiseless feedback
(right).

Definition 23.2 (Fundamental limits)


M∗fb (n, ϵ) = max{M : ∃(n, M, ϵ) code with feedback.}
1
Cfb,ϵ = lim inf log M∗fb (n, ϵ)
n→∞ n

Cfb = lim Cfb,ϵ (Shannon capacity with feedback)


ϵ→0

Theorem 23.3 (Shannon 1956) For a stationary memoryless channel,


Cfb = C = C(I) = sup I(X; Y).
PX

Proof. Achievability: Although it is obvious that Cfb ≥ C, we wanted to demonstrate that in fact
constructing codes achieving capacity with full feedback can be done directly, without appealing

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-444


i i

444

to a (much harder) problem of non-feedback codes. Let π t (·) ≜ PW|Yt (·|Yt ) with the (random) pos-
terior distribution after t steps. It is clear that due to the knowledge of Yt on both ends, transmitter
and receiver have perfectly synchronized knowledge of π t . Now consider how the transmission
progresses:

1 Initialize π 0 (·) = M1 .
2 At (t + 1)-th step, encoder sends Xt+1 = ft+1 (W, Yt ). Note that selection of ft+1 is equivalent to
the task of partitioning message space [M] into classes Pa , i.e.

Pa ≜ {j ∈ [M] : ft+1 (j, Yt ) = a} a ∈ A.

How to do this partitioning optimally is what we will discuss soon.


3 Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:
PY|X (Yt+1 |ft+1 (j, Yt ))
π t+1 (j) ≜ π t (j)Bt+1 (j) , Bt+1 (j) ≜ P .
a∈A π t (Pa )

Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY|X (y|a)
E[log Bt+1 (W)|Yt ] = π t (Pa ) log P = I(π̃ t+1 , PY|X ) (23.1)
a∈A y∈B a∈A π t (Pa )a

where π̃ t+1 (a) ≜ π t (Pa ) is a (random) distribution on A, induced by the encoder at the channel
input in round (t + 1). Note that while π t is decided before the (t + 1)-st step, design of partition
Pa (and hence π̃ t+1 ) is in the hands of the encoder.

The goal of the code designer is to come up with such a partitioning {Pa : a ∈ A} that the speed
of growth of π t (W) is maximal. Now, analyzing the speed of growth of a random-multiplicative
process is best done by taking logs:
X
t
log π t (j) = log Bs + log π 0 (j) .
s=1

Intuitively, we expect that the process log π t (W) resembles a random walk starting from − log M
and having a positive drift. Thus to estimate the time it takes for this process to reach value 0
we need to estimate the upward drift. Appealing to intuition and the law of large numbers (more
exactly to the theory of martingales) we approximate
X
t
log π t (W) − log π 0 (W) ≈ E[log Bs ] .
s=1

Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃ t+1 ≈ P∗X (capacity-achieving input distribution) and this obtains

log π t (W) ≈ tC − log M ,

implying that the transmission terminates in time ≈ logCM . The important lesson here is the follow-
ing: The optimal transmission scheme should map messages to channel inputs in such a way that

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-445


i i

23.1 Feedback does not increase capacity for stationary memoryless channels 445

the induced input distribution PXt+1 |Yt is approximately equal to the one maximizing I(X; Y). This
idea is called posterior matching and explored in detail in [384].1
Although our argument above is not rigorous, it is not hard to make it such by an appeal to
theory of martingale converges, very similar to the way we used it in Section 16.3* to analyze
SPRT. We omit those details (see [384]), since the result is in principle not needed for the proof
of the Theorem.
Converse: We are left to show that Cfb ≤ C(I) . Recall the key in proving weak converse for
channel coding without feedback: Fano’s inequality plus the graphical model

W → Xn → Yn → Ŵ. (23.2)

Then

−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) ≤ I(Xn ; Yn ) ≤ nC(I) .

With feedback the probabilistic picture becomes more complicated as the following Figure 23.2
demonstrates for n = 3 (dependence introduced by the extra squiggly arrows):

X1 Y1 X1 Y1

W X2 Y2 Ŵ W X2 Y2 Ŵ

X3 Y3 X3 Y3
without feedback with feedback

Figure 23.2 Graphical model for channel coding and n = 3 with and without feedback. Double arrows ⇒
correspond to the channel links.

Notice that the d-separation criterion shows we no longer have Markov chain (23.2), i.e. given
X the W and Yn are not independent.2 Furthermore, the input-output relation is also no longer
n

memoryless
Y
n
PYn |Xn (yn |xn ) 6= PY|X (yj |xj )
j=1

1
This simple (but capacity-achieving) feedback coding scheme also helps us appreciate more fully the magic of Shannon’s
(non-feedback) coding theorem, which demonstrated that the (almost) optimal partitioning can be done in a way that is
totally blind to actual channel outputs. That is, we can preselect partitions Pa that are independent of π t (but dependent on
t) and so that π t (Pa ) ≈ P∗X (a) with overwhelming probability and for almost all t ∈ [n].
2
For example, suppose we are transmitting W ∼ Ber(1/2) over the BSC and set X1 = 0, X2 = W ⊕ Y1 . Then given X1 , X2
we see that Y2 and W can be exactly determined from one another.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-446


i i

446

(as an example, let X2 = Y1 and thus PY1 |X1 X2 = δX1 is a point mass). Nevertheless, there is still a
large degree of independence in the channel. Namely, we have
(Yt−1 , W) →Xt → Yt , t = 1, . . . , n (23.3)
W → Y → Ŵ
n
(23.4)
Then
−h(ϵ) + ϵ̄ log M ≤ I(W; Ŵ) (Fano)
≤ I(W; Y ) n
(Data processing applied to (23.4))
X
n
= I(W; Yt |Yt−1 ) (Chain rule)
t=1
Xn
≤ I(W, Yt−1 ; Yt ) (I(W; Yt |Yt−1 ) = I(W, Yt−1 ; Yt ) − I(Yt−1 ; Yt ))
t=1
X
n
≤ I(Xt ; Yt ) (Data processing applied to (23.3))
t=1

≤ nC
In comparison with Theorem 22.2, the following result shows that, with fixed-length block cod-
ing, feedback does not even improve the speed of approaching capacity and can at most improve
the third-order log n terms.

Theorem 23.4 (Dispersion with feedback [131, 336]) For weakly input-symmetric
DMC (e.g. additive noise, BSC, BEC) we have:

log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n)

23.2* Massey’s directed information


In this section we show an alternative proof of the converse part of Theorem 23.3, which is more
in the spirit of “channel substitution” ideas (meta-converse) that we emphasize in this book, see
Sections 3.6, 17.4 and 22.3. In addition, it will also lead us to defining the concept of directed
information ⃗I(Xn ; Yn ) due to Massey [294]. Directed information is an important tool in the field
of causal inference, though we will not go into those connections [294].
Proof. Let us revisit the steps of showing the weak converse C ≤ C(I) , when phrased in the
style of meta-converse. We take an arbitrary (n, M, ϵ) code and define two distributions with
corresponding graphical models:
P : W → Xn → Yn → Ŵ (23.5)
Q:W→X n
Y → Ŵ
n
(23.6)
We then make two key observations:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-447


i i

23.2* Massey’s directed information 447

1 Under Q, W ⊥⊥ W, so that Q[W = Ŵ] = M1 while P[W = Ŵ] ≥ 1 − ϵ.


2 The two graphical models give the factorization:

PW,Xn ,Yn ,Ŵ = PW,Xn PYn |Xn PŴ|Yn , QW,Xn ,Yn ,Ŵ = QW,Xn QYn QŴ|Yn .

We are free to choose factors QW,Xn , QYn and QŴ|Yn . However, as we will see soon, it is best
to choose them to minimize D(PkQ) which gives us (see the discussion of information flow
after (4.6))

min D(PkQ) = I(Xn ; Yn ) (23.7)


QW,Xn ,QYn ,QŴ|Yn

and achieved by taking the factors equal to their values under P, namely QW,Xn = PW,Xn , QYn =
PYn and QŴ|Yn = PŴ|Yn . (It is a good exercise to show this by writing out the chain rule for
divergence (2.26).) As we know this minimal value of D(PkQ) measures the information flow
through the links excluded in the graphical model, i.e. through Xn → Yn .

From here we proceed via the data-processing inequality and tensorization of capacity for
memoryless channels as follows:

(∗) X
n
1 DPI
−h(ϵ) + ϵ̄ log M = d(1 − ϵk ) ≤ D(PkQ) = I(Xn ; Yn ) ≤ I(Xi ; Yi ) ≤ nC(I) (23.8)
M
i=1

where the (∗) followed from the fact that the Xn → Yn is a memoryless channel ((6.1)).
Let us now go back to the case of channels with feedback. There are several problems with
adapting the previous argument. First, when feedback is present, Xn → Yn is not memoryless due
to the influence of the transmission protocol (for example, knowing both X1 and X2 affects the law
Qn
of Y1 , that is PY1 |X1 6= PY1 |X1 ,X2 and also PYn |Xn 6= j=1 PYj |Xj even for the DMC). However, an
even bigger problem is revealed by attempting to replicate the previous proof.
Suppose we again try to induce an auxiliary probability space Q as in (23.6). Then due to lack
of Markov chain under P (i.e. (23.5)) solution of the problem (23.7) can be shown to equal this
time

min D(PkQ) = I(W, Xn ; Yn ) .

This value can be quite a bit higher than capacity. For example, consider an extremely noisy
(in fact, useless) channel BSC1/2 and a feedback transmission scheme Xt+1 = Yt . We see that
I(W, Xn ; Yn ) ≥ H(Yn−1 ) = (n − 1) log 2, whereas capacity C = 0. What went wrong in this case?
For the explanation, we should revisit the graphical model under P as shown on Figure 23.2
(right graph). When Q is defined as in (23.6) the value min D(PkQ) = I(W, Xn ; Yn ) measures the
information flow through both the ⇒ and ⇝ links.
This motivates us to find a graphical model for Q such that min D(PkQ) only captured the
information flow through only the ⇒ links {Xi → Yi : i = 1, . . . , n} (and so that min D(PkQ) ≤
⊥ Ŵ, so that Q[W = Ŵ] = M1 .
nC(I) ), while still guaranteeing that W ⊥

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-448


i i

448

X1 Y1 X1 Y1

W X2 Y2 Ŵ W X2 Y2 Ŵ

X3 Y3 X3 Y3
without feedback with feedback

Figure 23.3 Graphical model for n = 3 under the auxiliary distribution Q. Compare with Figure 23.2 under
the actual distribution P.

Such a graphical model is depicted on Figure 23.3 (right graph).3 Formally, we shall restrict
QW,Xn ,Yn ,Ŵ ∈ Q, where Q is the set of distributions that can be factorized as follows:

QW,Xn ,Yn ,Ŵ = QW QX1 |W QY1 QX2 |W,Y1 QY2 |Y1 · · · QXk |W,Yk−1 QYk |Yk−1 · · · QXn |W,Yn−1 QYn |Yn−1 QŴ|Yn

Using the d-separation criterion (see (3.11)) we can verify that for any Q ∈ Q we have W ⊥ ⊥ W:
n
W and Ŵ are d-separated by X . (More directly, one can clearly see that conditioning on any fixed
value of W = w does affect distributions of X1 , . . . , Xn but leaves Yn and Ŵ unaffected.)
Notice that in the graphical model for Q, when removing ⇒ we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when ⇒ are removed, so that Q could be made closer to P, while still maintaining
W⊥ ⊥ W. We note that these links were also implicitly present in the non-feedback case (see model
for Q in that case on the left graph in Figure 23.3).
Now since as we agreed under Q we still have Q[W = Ŵ] = M1 we can use our usual data-
processing for divergence to conclude d(1 − ϵk M1 ) ≤ D(PkQ).
Assuming the crucial fact about this Q-graphical model that will be shown in a Lemma 23.6
(to follow), we then have the following chain:

1
d(1 − ϵk ) ≤ inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ )
M Q∈Q
Xn
= I(Xt ; Yt |Yt−1 ) (Lemma 23.6)
t=1

X
n
= EYt−1 [I(PXt |Yt−1 , PY|X )]
t=1

3
This kind of removal of one-directional links is known as causal conditioning.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-449


i i

23.2* Massey’s directed information 449

X
n
≤ I(EYt−1 [PXt |Yt−1 ], PY|X ) (concavity of I(·, PY|X ))
t=1
Xn
= I(PXt , PY|X )
t=1

≤nC . ( I)

Thus, we complete our proof


nC + h(ϵ) C
−h(ϵ) + ϵ̄ log M ≤ nC(I) ⇒ log M ≤ ⇒ Cfb,ϵ ≤ ⇒ Cfb ≤ C.
1−ϵ 1−ϵ
We notice that the proof can also be adapted essentially without change to channels with cost
constraints and for capacity per unit cost setting, cf. [337].

We now proceed to showing the crucial omitted step in the above proof. Before that let us define
an interesting new kind of information.

Definition 23.5 (Massey’s directed information [294]) For a pair of blocklength-n


random variables Xn and Yn we define
X
n
⃗I(Xn ; Yn ) ≜ I(Xt ; Yt |Yt−1 )
t=1

Note that directed information is not symmetric. As [294] (and subsequent work, e.g. [412])
shows ⃗I(Xn ; Yn ) quantifies the amount of causal information transfer from X-process to Y-process.
In context of feedback communication a formal justification for introduction of this concept is the
following result.

Lemma 23.6 Consider communication with feedback over a non-anticipatory channel given
by a sequence of Markov kernels PYt |Xt ,Yt−1 , t ∈ [n], i.e. we have a probability distribution P on
(W, Xn , Yn , Ŵ) described by factorization

Y
n
PW,Xn ,Yn ,Ŵ = PW PXt |W,Xt−1 ,Yt−1 PYt |Xt ,Yt−1 . (23.9)
t=1

Denote by Q all distributions factorizing with respect to the graphical models on Figure 23.3
(right graph), that is those satisfying
Y
n
QW,Xn ,Yn ,Ŵ = QW QXk |W,Yk−1 QYk |Yk−1 (23.10)
t=1

Then we have

inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ ) = ⃗I(Xn ; Yn ) . (23.11)


Q∈Q

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-450


i i

450

In addition, if the channel is memoryless, i.e. PYt |Xt ,Yt−1 = PYt |Xt for all t ∈ [n], then we have
X
n
⃗I(Xn ; Yn ) = I(Xt ; Yt |Yt−1 ) .
t=1

Proof. By comparing factorizations (23.9) and (23.10) and applying the chain rule (2.26) we can
immediately optimize several terms (we leave this justification as an exercise):

QX,W = PX,W ,
QXt |W,Yt−1 = PXt |W,Yt−1
QŴ|Yn = PW|Yn .

From here we conclude that

inf D(PW,Xn ,Yn ,Ŵ kQW,Xn ,Yn ,Ŵ )


Q∈Q

= inf D(PY1 |X1 kQY1 |X1 ) + D(PY2 |X2 ,Y1 kQY2 |Y1 |X2 , Y1 ) + · · · + D(PYn |Xn ,Yn−1 kQYn |Yn−1 |Xn , Yn−1 )
Q∈Q

= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 ) ,

where in the last step we simply applied (conditional) versions of Corollary 4.2.
To prove the claim for the memoryless channels, we only need to notice that

I(Xt ; Yt |Yt−1 ) = I(Xt ; Yt |Yt−1 ) + I(Xt−1 ; Yt |Yt−1 , Xt ) ,

and that the last term is zero. The latter can be justified via d-separation criterion. Indeed, in the
absence of channel memory every undirected path from Xt−1 to Yt must pass through Xt , which is
a non-collider and is conditioned on.

To summarize, we can see that Shannon’s result for feedback communication can be best
understood as a simple modification of the standard weak converse in channel coding: instead
of using

23.3 When is feedback really useful?


Theorems 23.3 and 23.4 state that feedback does not improve communication rate neither asymp-
totically nor for moderate blocklengths. In this section, we shall examine three cases where
feedback turns out to be very useful.

23.3.1 Code with very small (e.g. zero) error probability


Theorem 23.7 (Shannon [379]) For any DMC PY|X ,
1
Cfb,0 = max min log (23.12)
PX y∈B PX (Sy )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-451


i i

23.3 When is feedback really useful? 451

where

Sy = {x ∈ A : PY|X (y|x) > 0}

denotes the set of input symbols that can lead to the output symbol y.

Remark 23.1 For stationary memoryless channel, we have


( a) (b) ( c) (d)
C0 ≤ Cfb,0 ≤ Cfb = lim Cfb,ϵ = C = lim Cϵ = C(I) = sup I(X; Y)
ϵ→0 ϵ→0 PX

where (a) and (b) are by definitions, (c) follows from Theorem 23.3, and (d) is due to Theorem 19.9.
All capacity quantities above are defined with (fixed-length) block codes.

Remark 23.2 1 In DMC for both zero-error capacities (C0 and Cfb,0 ) only the support of the
transition matrix PY|X , i.e., whether PY|X (b|a) > 0 or not, matters; the values of those non-zero
PY|X (b|a) are irrelevant. That is, C0 and Cfb,0 are determined by the bipartite graph represen-
tation between the input alphabet A and the output alphabet B . Furthermore, the C0 (but not
Cfb,0 !) is a function of the confusability graph – a simple undirected graph on A with a 6= a′
connected by an edge iff ∃b ∈ B s.t. PY|X (b|a)PY|X (b|a′ ) > 0.
2 That Cfb,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next example) with L = 3 (for which Cfb,0 = log 32 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cfb,0 = 0). Clearly in both cases the confusability
graph is the same – a triangle.
3 Oftentimes C0 is very hard to compute, but Cfb,0 can be obtained in closed form as in (23.12).
As an example, consider the following polygon channel (named after its confusability graph):

1 1

1
2 2 5

. .
2
. .
. .
4
L L 3
Bipartite graph Confusability graph (L = 5)

The following are known about the zero-error capacity C0 of the polygon channel:
• L = 3: C0 = 0.
• L = 5: C0 = 12 log 5. This is a famous “capacity of a pentagon” problem. For achievability,
with blocklength one, one can use {1, 3} to achieve rate 1 bit; with blocklength two, the code-
book {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)} achieves rate 12 log 5 bits, as pointed out by Shannon

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-452


i i

452

in his original 1956 paper [379]. More than two decades later this was shown optimal by
Lovász using a technique now known as semidefinite programming relaxation [286].
• Even L: C0 = log L2 (Exercise IV.36).
• L = 7: 3/5 log 7 ≤ C0 ≤ log 3.32. Finding the exact value for any odd L ≥ 7 is a famous
open problem in combinatorics.
• Asymptotics of odd L: Despite being unknown in general C0 has a known asymptotic
expansion: For odd L, C0 = log L2 + o(1) [66].
In comparison, the zero-error capacity with feedback (Exercise IV.36) equals Cfb,0 = log L2 for
any L, which, thus, can strictly exceed C0 .
4 Notice that Cfb,0 is not necessarily equal to Cfb = limϵ→0 Cfb,ϵ = C. Here is an example with

C0 < Cfb,0 < Cfb = C.

Consider a channel with the following bipartite graph representation:

1 1

2 2

3 3

4 4

Then one can verify that

C0 = log 2
2 
Cfb,0 = max − log max δ, 1 − δ (P∗X = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, the Shannon capacity C = Cfb can be made arbitrarily close to log 4 by
picking the cross-over probabilities arbitrarily close to zero, while the confusability graph stays
the same.

Proof of Theorem 23.7. 1 Fix any (n, M, 0)-code. For each t = 0, 1, . . . , n, denote the confusabil-
ity set of all possible messages that could have produced the received signal yt = (y1 , . . . , yt )
by:

Et (yt ) ≜ {m ∈ [M] : f1 (m) ∈ Sy1 , f2 (m, y1 ) ∈ Sy2 , . . . , fn (m, yt−1 ) ∈ Syt }

Notice that zero-error means no ambiguity:

ϵ = 0 ⇔ ∀yn ∈ B n , |En (yn )| = 1 or 0. (23.13)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-453


i i

23.3 When is feedback really useful? 453

2 The key quantities in the proof are defined as follows:

θfb ≜ min max PX (Sy ), P∗X ≜ argmin max PX (Sy )


PX y∈B PX y∈B

The goal is to show


1
Cfb,0 = log .
θfb
By definition, we have

∀PX , ∃y ∈ B, such that PX (Sy ) ≥ θfb (23.14)

Notice that in general the minimizer P∗X is not the capacity-achieving input distribution in the
usual sense (recall Theorem 5.4). This definition also sheds light on how the encoding and
decoding should proceed and serves to lower bound the uncertainty reduction at each stage of
the decoding scheme.
3 “≤” (converse): Let PXn be the joint distribution of the codewords. Denote by E0 = [M] the
original message set.
t = 1: For PX1 , by (23.14), ∃y∗1 such that:
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|
PX1 (Sy∗1 ) = = ≥ θfb .
|{m ∈ [M]}| | E0 |
t = 2: For PX2 |X1 ∈Sy∗ , by (23.14), ∃y∗2 such that:
1

|{m : f1 (m) ∈ Sy∗1 , f2 (m, y∗1 ) ∈ Sy∗2 }| |E2 (y∗1 , y∗2 )|


PX2 (Sy∗2 |X1 ∈ Sy∗1 ) = = ≥ θfb ,
|{m : f1 (m) ∈ Sy∗1 }| |E1 (y∗1 )|
t = n: Continue the selection process up to y∗n which satisfies that:
|En (y∗1 , . . . , y∗n )|
PXn (Sy∗n |Xt ∈ Sy∗t for t = 1, . . . , n − 1) = ≥ θfb .
|En−1 (y∗1 , . . . , y∗n−1 )|
Finally, by (23.13) and the above selection procedure, we have
1 |En (y∗1 , . . . , y∗n )|
≥ ≥ θfb
n
.
M |E0 |
Thus M ≤ −n log θfb and we have shown Cfb,0 ≤ − log θfb .
4 “≥” (achievability): Let us construct a code that achieves (M, n, 0).
As an example, consider the specific channel in Figure 23.4 with |A| = |B| = 3. As the first
stage, the encoder f1 partitions the space of all M messages to three groups of size proportional
to the weight P∗X (ai ), then maps messages in each group to the corresponding ai for i = 1, 2, 3.
Suppose the channel outputs, say, y1 . Since in this example Sy1 = {a1 , a2 }, the decoder can
eliminate a total number of MP∗X (a3 ) candidate messages in this round. Afterwards, the “con-
fusability set” only contains the remaining MP∗X (Sy1 ) messages. By definition of P∗X we know
that MP∗X (Sy1 ) ≤ Mθfb . In the second round, the encoder f2 partitions the remaining messages
into three groups, sends the group index and repeats.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-454


i i

454

encoder f1 channel

MP∗
X (a1 ) messages
a1 y1

MP∗
X (a2 ) messages
a2 y2

MP∗
X (a3 ) messages
a3 y3

Figure 23.4 Achievability scheme for zero-error capacity with feedback.

By similar arguments, each interaction reduces the uncertainty by a factor of at least θfb . After n
n
iterations, the size of “confusability set” is upper bounded by Mθfb n
, if Mθfb ≤ 1,4 then zero error
probability is achieved. This is guaranteed by choosing log M = −n log θfb . Therefore we have
shown that −n log θfb bits can be reliably delivered with n + O(1) channel uses with feedback,
thus Cfb,0 ≥ − log θfb .

Theorem above shows possible advantages of feedback for zero-error communication. How-
ever, the zero-error capacity for a generic DMC (e.g. BSCδ with δ ∈ (0, 1)) we have C0 =
Cfb,0 = 0. Can we show any advantage of feedback for such channels? Clearly for that we need to
understand behavior of log M∗fb (n, ϵ) for ϵ > 0. It turns out that [336] for weakly-input symmetric
channels (Section 19.4*) we have

log M∗fb (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) ,

and thus at least up to second order the behavior of fundamental limits is unchanged in the presence
of feedback. Let us next discuss the error-exponent asymptotics (Section 22.4*) by defining
1
Efb (R) ≜ lim − log ϵ∗fb (n, exp{nR}) ,
n→∞ n

provided the limit exists and having denoted by ϵ∗fb (n, M) the smallest possible error of a feedback
code of blocklength n.
First, it is known that the sphere-packing bound (22.14) continues to hold in the presence of
feedback [312], that is

Efb (R) ≤ Esp (R) ,

4
Some rounding-off errors need to be corrected in a few final steps (because P∗X may not be closely approximable when
very few messages are remaining). This does not change the asymptotics though.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-455


i i

23.3 When is feedback really useful? 455

and thus for rates R ∈ (Rcr , C) the error-exponent Efb (R) = E(R) showing no change due to
availability of the feedback. So what is the advantage of feedback then? It turns out that the error-
exponents do improve at rates below critical. For example, for the BECδ a simple transmission
scheme where each bit is retransmitted until it is successfully received achieves error exponent
Esp (R) at all rates (since the probability of error here is given by P[Bin(n, δ) > n(1 − R/ log 2)]):
BEC : Efb (R) = Esp (R) 0 < R < C,
which is strictly higher than E(R) for R < Rcr .
For the BSCδ a beautiful result of Berlekamp shows that
1
Efb (0+) = − log((1 − δ)1/3 δ 2/3 + (1 − δ)2/3 δ 1/3 ) > E(0+) = Eex (0) = − log(4δ(1 − δ)) .
4
In other words, the error probability of optimal codes of size M = exp{o(n)} does significantly
improve in the presence of feedback (at least over the BSC and BEC).

23.3.2 Code with variable length


Consider the example of BECδ channel with feedback. We can define a simple communication
strategy that sends k data bits in the following way. Send each data bit repeatedly until it gets
through the channel unerased. Using feedback the transmitter can know exactly when this occurs
and at this point can switch to transmitting the next data bit. The expected number of channel uses
for sending k bits is given by
k
ℓ = E [ n] = .
1−δ
So, remarkably, we can see that by allowing variable-length transmission one can achieve capac-
ity even by coding each data bit independently, thus completely avoiding any finite blocklength
penalty. While this cute coding scheme is special to BECδ , it turns out that the general fact of
zero-dispersion is universal. For that we define formally the following concept.

Definition 23.8 An (ℓ, M, ϵ) variable-length feedback (VLF) code, where ℓ is a positive real,
M is a positive integer and 0 ≤ ϵ ≤ 1, is defined by:

1 A space U 5 and a probability distribution PU on it, defining a random variable U which is


revealed to both transmitter and receiver before the start of transmission; i.e. U acts as common
randomness used to initialize the encoder and the decoder before the start of transmission.
2 A sequence of encoders fn : U × {1, . . . , M} × B n−1 → A, n ≥ 1, defining channel inputs
Xn = fn (U, W, Yn−1 ) , (23.15)
where W ∈ {1, . . . , M} is the equiprobable message.
3 A sequence of decoders gn : U × B n → {1, . . . , M} providing the best estimate of W at time n.

5
It can be shown that without loss of generality we can assume |U | ≤ 3, see [336, Appendix].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-456


i i

456

4 A non-negative integer-valued random variable τ , a stopping time of the filtration Gn =


σ{U, Y1 , . . . , Yn }, which satisfies

E[τ ] ≤ ℓ . (23.16)

The final decision Ŵ is computed at the time instant τ :

Ŵ = gτ (U, Yτ ) , (23.17)

and must satisfy

P[Ŵ 6= W] ≤ ϵ . (23.18)

The fundamental limit of channel coding with feedback is given by the following quantity:

M∗VLF (ℓ, ϵ) = max{M : ∃(ℓ, M, ϵ)-VLF code} . (23.19)

In this language, our example above showed that for the BECδ we have

log M∗VLF (ℓ, 0) ≥ (1 − δ)ℓ log 2 + O(1) , ℓ→∞

Notice that compared to the scheme without feedback, there is a significant improvement since

the term nVQ−1 (ϵ) in the expansion for log M∗ (n, ϵ) is now dropped. For this reason, results like
this are known as zero-dispersion results.
It turns out that this effect is general for all DMC as long as we allow some ϵ > 0 error.

Theorem 23.9 (VLF zero dispersion[336]) For any DMC with capacity C we have
lC
log M∗VLF (l, ϵ) = + O(log l) (23.20)
1−ϵ

We omit the proof of this result, only mentioning that the achievability part relies on ideas
similar to SPRT from Section 16.3*: the message keeps being transmitted until the information
density i(cj ; Yn ) of one of the codewords exceeds log M. See [336] for details. We also mention
that there is another variant of the VLF coding known as VLFT coding in which the stopping time
τ instead of being determined by the receiver is allowed to be determined by the transmitter (see
Exercise IV.35(d)). The expansion (23.20) continues to hold for VLFT codes as well [336].
Example 23.1 For the channel BSC0.11 without feedback the minimal is n = 3000 needed
to achieve 90% of the capacity C, while there exists a VLF code with ℓ = E[n] = 200 achieving
that [336]. This showcases how much feedback can improve the latency and decoding complexity.
VLF codes not only kill the dispersion term, but also dramatically improves error-exponents.
We have already discussed them in the context of fixed-length codes in Section 22.4* (without
feedback) and the end of last Section (with feedback). Here we mention a deep result of Burna-
shev [79], who showed that the optimal probability of error for VLF codes of rate R (i.e. with
log M = ℓR) satisfies for every DMC

ϵ∗VLF (ℓ, exp{ℓR}) = exp{−ℓEVLF (R) + o(ℓ)} ,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-457


i i

23.3 When is feedback really useful? 457

where Burnashev’s error-exponent has a particularly simple expression


C1
EVLF (R) = (C − R)+ , C1 = max D(PY|X=x1 kPY|X=x2 ) .
C x1 , x2

Simplicity of this expression when compared to the complicated (and still largely open!) situation
with respect to non-feedback or fixed-length feedback error-exponents is striking.

23.3.3 Codes with variable power


In previous sections we discussed advantages of feedback for the DMC. For the AWGN channel, it
turns out that unlocking the potential of feedback requires relaxing the cost constraint. Recall that
in Section 20.1 we postulated that every codeword xn ∈ Rn should satisfy a fixed power constraint
Pn
P, namely j=1 x2j ≤ nP. It turns out that under such power constraint one can show that feedback
does not help much neither in any of the dispersion, finite block-length or error-exponent senses.
However, the true potential of feedback is unlocked if the power constraint is relaxed to
X
n
E[ X2j ] ≤ nP ,
j=1

where expectation here is both over the channel noise and the potential randomness employed
by the transmitter in determination of Xj on the basis of the message W and Y1 , . . . , Yj−1 . In the
following, we demonstrate how to leverage this new freedom effectively.

Elias’ scheme Consider sending a standard Gaussian random variable A over the following set
of AWGN channels:

Yk = X k + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[X2k ] ≤ P.

We assume that full noiseless feedback is available as in Figure 23.1. Note that, crucially, the
power constraint is imposed in expectation, which does not increase the channel capacity (recall
the converse in Theorem 20.6) but enables simple algorithms such as Elias’ scheme below. In
contrast, if we insist as in Section 20.1 that each codeword satisfies the power constraint almost
Pn
surely instead in expectation, i.e., k=1 X2k ≤ nP a.s., then Elias’ scheme does not work.
Using only linear processing, Elias’ scheme proceeds according to illustration on Figure 23.5.
According to the orthogonality principle, at the receiver side we have for all t = 1, . . . , n,

A = Ât + Nt , Nt ⊥
⊥ Yt .

Moreover, since all operations are linear, all random variables are jointly Gaussian and hence the
residual error satisfies Nt ⊥
⊥ Yt . Since Xt ∝ Nt−1 ⊥⊥ Yt−1 , the codeword we are sending at each
time slot is independent of the history of the channel output (“innovation”), in order to maximize
the information transfer.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-458


i i

458

Encoder Decoder

X1 = c 1 A Y1 = c1 A + Z1 √
Â1 = E[A|Y1 ] = P
Y1
P+σ 2

X2 = c2 (A − Â1 ) Y2 = c2 (A − Â1 ) + Z2
Â2 = E[A|Y1 , Y2 ] = linear combination of Y1 , Y2

. .
. .
. .

Xn = cn (A − Ân−1 ) Yn = cn (A − Ân−1 ) + Zn
Ân = E[A|Yn ] = linear combination of Yn

Figure 23.5 Elias’ scheme for the AWGN channel with variable power. Here, each coefficient ct is chosen
such that E[X2t ] = P.

Note that Yn → Ân → A, and the optimal estimator Ân (a linear combination of Yn ) is a sufficient
statistic of Yn for A thanks to Gaussianity. Then

I(A; Yn ) =I(A; Ân , Yn )


= I(A; Ân ) + I(A; Yn |Ân )
= I(A; Ân )
1 Var(A)
= log ,
2 Var(Nn )

where the last equality applies Ân ⊥


⊥ Nn and the Gaussian mutual information formula in Exam-
ple 3.3. While Var(Nn ) can be readily computed using standard linear MMSE results, next we
determine it information-theoretically: Notice that we also have

I(A; Yn ) = I(A; Y1 ) + I(A; Y2 |Y1 ) + · · · + I(A; Yn |Yn−1 )


= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Yn−1 )
key
= I(X1 ; Y1 ) + I(X2 ; Y2 ) + · · · + I(Xn ; Yn )
n
= log(1 + P) = nC,
2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-459


i i

23.3 When is feedback really useful? 459

where the key step applies Xt ⊥ ⊥ Yt−1 for all t. Therefore, with Elias’ scheme of sending A ∼
N (0, 1), after the n-th use of the AWGN channel with feedback and expected power P, we have
 P n
Var Nn = Var(Ân − A) = 2−2nC Var A = ,
P + σ2
which says that the reduction of uncertainty in the estimation is exponential fast in n.

Schalkwijk-Kailath scheme Elias’ scheme can also be used to send digital data. Let W ∼ be
uniform on the M-PAM (Pulse-amplitude modulation) constellation in [−1, 1], i.e., {−1, −1 +
M , · · · , −1 + M , · · · , 1}. In the very first step, W is sent (after scaling to satisfy the power
2 2k

constraint):

X0 = PW, Y0 = X0 + Z0
Since Y0 and X0 are both known at the encoder, it can compute Z0 . Hence, to describe W it is
sufficient for the encoder to describe the noise realization Z0 . This is done by employing the Elias’
scheme (n − 1 times). After n − 1 channel uses, and the MSE estimation, the equivalent channel
output:
e0 = X0 + Z
Y e0 , e0 ) = 2−2(n−1)C
Var(Z
e0 to the nearest PAM point. Notice that
Finally, the decoder quantizes Y
   √   (n−1)C √ 
e 1 −(n−1)C P 2 P
ϵ ≤ P |Z0 | > =P 2 | Z| > = 2Q
2M 2M 2M
so that

P ϵ
log M ≥ (n − 1)C + log − log Q−1 ( ) = nC + O(1).
2 2
Hence if the rate is strictly less than capacity, the error probability decays doubly exponentially as

n increases. More importantly, we gained an n term in terms of log M, since for the case without
feedback we have (by Theorem 22.2)

log M∗ (n, ϵ) = nC − nVQ−1 (ϵ) + O(log n) .
As an example, consider P = 1 and  (n−thenchannel capacity is C = 0.5 bit per channel use. To
e(n−1)C
−3
achieve error probability 10 , 2Q 2 1) C
2M ≈ 10 , so 2M ≈ 3, and logn M ≈ n−n 1 C − logn 8 .
−3

Notice that the capacity is achieved to within 99% in as few as n = 50 channel uses, whereas the
best possible block codes without feedback require n ≈ 2800 to achieve 90% of capacity.

The take-away message of this chapter is as follows: Feedback is best harnessed with adaptive
strategies. Although it does not increase capacity under block coding, feedback can greatly boost
reliability as well as reduce coding complexity.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-460


i i

Exercises for Part IV

IV.1 Consider the AWGN channel Yn = Xn + Zn , where Zi is iid N (0, 1) and Xi ∈ [−1, 1] (amplitude
constraint). Recall that ϵ∗ (n, 2) denotes the optimal average probability of error of transmitting
1 bit of information over this channel.
R∞
(a) Express the value of ϵ∗ (n, 2) in terms of Q(x) = x √12π e−t /2 dt.
2

(b) Compute the exponent r = limn→∞ 1n log ϵ∗ (1n,2) . (Hint: Q(x) = e−x (1/2+o(1)) when x → ∞,
2

cf. Exercise V.25)


(c) Use asymptotics of hypothesis testing to compute r differently and check that two values
2
agree. (Hint: MGF of standard Gaussian Z ∼ N (0, 1) is given by E[etZ ] = et /2 .)
IV.2 Randomized encoders and decoders may help maximal probability of error:
(a) Consider a binary asymmetric channel PY|X : {0, 1} → {0, 1} specified by PY|X=0 =
Ber(1/2) and PY|X=1 = Ber(1/3). The encoder f : [M] → {0, 1} tries to transmit 1 bit
of information, i.e., M = 2, with f(1) = 0, f(2) = 1. Show that the optimal decoder which
minimizes the maximal probability of error is necessarily randomized. Find the optimal
decoder and the optimal Pe,max . (Hint: Recall binary hypothesis testing.)
(b) Give an example of PY|X : X → Y , M > 1 and ϵ > 0 such that there is an (M, ϵ)max -code
with a randomized encoder-decoder, but no such code with a deterministic encoder-decoder.
IV.3 (Lousy typist) Let X = Y = {A, S, D, F, G, H, J, K, L}. Let PY|X (α|β) = 0.1 if α and β are
neighboring letters in the keyboard, and PY|X (α|β) = 0 if α 6= β and they are not neighbors.
Find the smallest ϵ for which you can guarantee that a (4, ϵ)avg -code exists.
IV.4 (Finite-blocklength bounds for BEC). Consider a code with M = 2k operating over the
blocklength-n binary erasure channel (BEC) with erasure probability δ ∈ [0, 1).
(a) Show that regardless of the encoder-decoder pair:
+
P[error|#erasures = z] ≥ 1 − 2n−z−k
(b) Conclude by averaging over the distribution of z that the probability of error ϵ must satisfy
X  
n
n ℓ 
ϵ≥ δ (1 − δ)n−ℓ 1 − 2n−ℓ−k , (IV.1)

ℓ=n−k+1

(c) By applying the DT bound with uniform PX show that there exist codes with
X n  
n t
δ (1 − δ)n−t 2−|n−t−k+1| .
+
ϵ≤ (IV.2)
t
t=0

(d) Fix n = 500, δ = 1/2. Compute the smallest k for which the right-hand side of (IV.1) is
greater than 10−3 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-461


i i

Exercises for Part IV 461

(e) Fix n = 500, δ = 1/2. Find the largest k for which the right-hand side of (IV.2) is smaller
than 10−3 .
(f) Express your results in terms of lower and upper bounds on log M∗ (500, 10−3 ).
IV.5 Recall that in the proof of the DT bound (Theorem 18.6) we used the decoder that outputs (for
a given channel output y) the first cm that satisfies

{i(cm ; y) > log β} . (IV.3)

One may consider the following generalization. Fix E ⊂ X × Y and let the decoder output the
first cm which satisfies

( cm , y) ∈ E

By repeating the random coding proof steps (as in the DT bound) show that the average
probability of error satisfies
M−1
E[Pe ] ≤ P[(X, Y) 6∈ E] + P[(X̄, Y) ∈ E] ,
2
where

PXYX̄ (a, b, ā) = PX (a)PY|X (b|a)PX (ā) .

Conclude that the optimal E is given by (IV.3) with β = M− 2 .


1

IV.6 In Section 18.6 we showed that for additive noise, random linear codes achieves the same per-
formance as Shannon’s ensemble (fully random coding). The total number of possible generator
matrices is qnk , which is significant smaller than double exponential, but still quite large. Now
we show that without degrading the performance, we can reduce this number to qn by restricting
to Toeplitz generator matrix G, i.e., Gij = Gi−1,j−1 for all i, j > 1.
Prove the following strengthening of Theorem 18.13: Let PY|X be additive noise over Fnq . For
any 1 ≤ k ≤ n, there exists a linear code f : Fkq → Fnq with Toeplitz generator matrix, such that
 +
h − n−k−log 1 n i
Pe,max = Pe ≤ E q
q P Zn ( Z )

How many Toeplitz generator matrices are there?


Hint: Analogous to the proof Theorem 15.2, first consider random linear codewords plus ran-
dom dithering, then argue that dithering can be removed without changing the performance of
the codes. Show that codewords are pairwise independent and uniform.
IV.7 (Wozencraft ensemble) Let X = Y = F2q , a vector space of dimension two over Galois field
with q elements. A Wozencraft code of rate 1/2 is a map parameterized by 0 6= u ∈ Fq given as
a 7→ (a, a · u), where a ∈ Fq corresponds to the original message, multiplication is over Fq and
(·, ·) denotes a 2-dimensional vector in F2q . We will show there exists u yielding a (q, ϵ)avg code
with
" ( +
)#
q2
ϵ ≤ E exp − i(X; Y) − log (IV.4)
2( q − 1)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-462


i i

462 Exercises for Part IV

for the channel Y = X + Z where X is uniform on F2q , noise Z ∈ F2q has distribution PZ and

P Z ( b − a)
i(a; b) ≜ log .
q− 2
(a) Show that probability of error of the code a 7→ (av, au) + h is the same as that of a 7→
(a, auv−1 ).
(b) Let {Xa , a ∈ Fq } be a random codebook defined as

Xa = (aV, aU) + H ,

with V, U uniform over non-zero elements of Fq and H uniform over F2q , the three being
jointly independent. Show that for a 6= a′ we have
1
PXa ,X′a (x21 , x̃21 ) = 1{x1 6= x̃1 , x2 6= x̃2 }.
q2 ( q − 1) 2
(c) Show that for a 6= a′
q2 1
P[i(X′a ; Xa + Z) > log β] ≤ P[i(X̄; Y) > log β] − P[i(X; Y) > log β]
( q − 1) 2 (q − 1)2
q2
≤ P[i(X̄; Y) > log β] ,
( q − 1) 2

where PX̄XY (ā, a, b) = q14 PZ (b − a).


(d) Conclude by following the proof of the DT bound with M = q that the probability of error
averaged over the random codebook {Xa } satisfies (IV.4).
IV.8 (Universal codes) Fix finite alphabets X and Y .
a Let C be a finite collection of channels PY|X : X → Y . Show that for any PX and any R > 0
there exists a sequence of codes (fn , gn ) such that regardless of what DMC PY|X ∈ C is selected
we have P[fn (W) 6= gn (Yn )] → 0 as long as I(PX , PY|X ) > R. (Hint: union bound over C )
b Extend the idea to show that there exists (fn , gn ) such that for any DMC with I(PX , PY|X > R
we have P[fn (W) 6= gn (Yn )] → 0. (Hint: discretize the set of X × Y stochastic matrices)
IV.9 (Information density and types.) Let PY|X : A → B be a DMC and let PX be some input
distribution. Take PXn Yn = PnXY and define i(an ; bn ) with respect to this PXn Yn .
(a) Show that i(xn ; yn ) is a function of only the “joint type” P̂XY of (xn , yn ), which is a
distribution on A × B defined as
1
P̂XY (a, b) = #{i : xi = a, yi = b} ,
n
where a ∈ A and b ∈ B . Therefore the condition of the form { n1 i(xn ; yn ) ≥ γ} in the decoder
(18.10) used in Shannon’s random coding bound can be interpreted as a constraint on the
joint type of (xn , yn ).
(b) Assume also that the input xn is such that P̂X = PX . Show that
1 n n
i(x ; y ) ≤ I(P̂X , P̂Y|X ) .
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-463


i i

Exercises for Part IV 463

The quantity I(P̂X , P̂Y|X ), sometimes written as I(xn ∧ yn ), is an empirical mutual informa-
tion.6 Hint:
 
PY|X (Y|X)
EQXY log = D(QY|X kQY |QX ) + D(QY kPY ) − D(QY|X kPY|X |QX ).
PY (Y)
IV.10 (Fitingof-Goppa universal codes) Consider a finite abelian group X . Define the Fitingof norm
as

kxn kΦ ≜ nH(P̂xn ) = nH(xT ), T ∼ Unif([n]) ,

where P̂xn is the empirical distribution of xn .


(a) Show that kxn kΦ = k − xn kΦ and the triangle inequality

kxn − yn kΦ ≤ kxn kΦ + kyn kΦ

Conclude that dΦ (xn , yn ) ≜ kxn − yn kΦ is a translation invariant (Fitingof) metric on the set
of equivalence classes in X n , with equivalence xn ∼ yn ⇐⇒ kxn − yn kΦ = 0.
(b) Define the Fitingof ball Br (xn ) ≜ {yn : dΦ (xn , yn ) ≤ r}. Show that

log |Bλn (xn )| = λn + O(log n)

for all 0 ≤ λ ≤ log |X |.


(c) Show that for any product measure PZn = PnZ on X n we have
(
1, H( Z ) < λ
lim PZn [Bλn (0n )] =
n→∞ 0, H( Z ) > λ

(d) Conclude that a code C ⊂ X n with Fitingof minimal distance dmin,Φ (C) ≜
minc̸=c′ ∈C dΦ (c, c′ ) ≥ 2λn is decodable with vanishing probability of error on any
additive-noise channel Y = X + Z, as long as H(Z) < λ.
Comment: By Feinstein-lemma like argument it can be shown that there exist codes of size
X n(1−λ) , such that balls of radius λn centered at codewords are almost disjoint. Such codes are
universally capacity-achieving for all memoryless additive-noise channels on X . Extension to
general (non-additive) channels is done via introducing dΦ (xn , yn ) = nH(xT |yT ), while exten-
sion to channels with Markov memory is done by introducing Markov-type norm kxn kΦ1 =
nH(xT |xT−1 ). See [196, Chapter 3].
IV.11 A magician is performing card tricks on stage. In each round he takes a shuffled deck of 52
cards and asks someone to pick a random card N from the deck, which is then revealed to the
audience. Assume the magician can prepare an arbitrary ordering of cards in the deck (before
each round) and that N is distributed binomially on {0, . . . , 51} with mean 51
2 .
(a) What is the maximal number of bits per round that he can send over to his companion in
the room (in the limit of infinitely many rounds)?

6
Invented by V. Goppa for his maximal mutual information (MMI) decoder [195]: Ŵ = argmaxi I(ci ∧ yn ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-464


i i

464 Exercises for Part IV

(b) Is communication possible if N were uniform on {0, . . . , 51}? (In practice, however, nobody
ever picks the top or the bottom ones.)
IV.12 (Channel with memory) Consider the additive noise channel with A = B = F2 (Galois field
of order 2) and PYn |Xn : Fn2 → Fn2 specified by
Yn = Xn + Zn ,
where Zn = (Z1 , . . . , Zn ) is a stationary Markov chain with PZ2 |Z1 (0|1) = PZ2 |Z1 (1|0) = τ .
Show information stability and find the capacity. (Hint: your proof should work for an arbitrary
stationary ergodic noise process Z∞ = (Z1 , . . .)). Can the capacity be achieved by linear codes?

IV.13 Consider a DMC PYn |Xn = PnY|X , where a single-letter PY|X : A → B is given by A = B =
{0, 1}7 , and

1−p y=x
PY|X (y|x) =
p/7 dH ( y, x) = 1
where dH stands for Hamming distance.In other words, for each 7-bit string, the channel either
leaves it intact, or randomly flips exactly one bit.
(a) Compute the Shannon capacity C as a function of p and plot.
(b) Consider the special case of p = 78 . Show that the zero-error capacity C0 coincides with C.
Moreover, C0 can be achieved with blocklength n = 1 and give a capacity-achieving code.
IV.14 Find the capacity of the erasure-error channel (Figure 23.6) with channel matrix
 
1 − 2δ δ δ
W=
δ δ 1 − 2δ
where 0 ≤ δ ≤ 1/2.

1 − 2δ
1 δ 1

δ
0 0
1 − 2δ

Figure 23.6 Binary erasure-error channel.

IV.15 (Capacity of reordering) Routers A and B are setting up a covert communication channel in
which the data is encoded in the ordering of packets. Formally, router A receives n packets,
each having one of two types, Ack or Data, with probabilities p and 1 − p, respectively (and
iid). It encodes k bits of secret data by reordering these packets. The network between A and B
delivers packets in-order with loss rate δ . (Note: packets have sequence numbers, so each loss
is detected by B).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-465


i i

Exercises for Part IV 465

What is the maximum rate of asymptotically reliable communication achievable?


IV.16 (Sum of channels) Let W1 and W2 denote the channel matrices of discrete memoryless channel
(DMC) PY1 |X1 and PY2 |X2 with capacity C1 and C2 , respectively. The sum of the two channels is
 
another DMC with channel matrix W01 W02 . Show that the capacity of the sum channel is given
by
C = log(exp(C1 ) + exp(C2 )).
IV.17 (Product of channels) For i = 1, 2, let PYi |Xi be a (stationary memoryless) channel with input
space Ai , output space Bi , and capacity Ci . Their product channel is a channel with input space
A1 × A2 , output space B1 × B2 , and transition kernel PY1 Y2 |X1 X2 = PY1 |X1 PY2 |X2 . Show that the
capacity of the product channel is given by C = C1 + C2 .
IV.18 (Mixtures of DMCs) Consider two DMCs UY|X and VY|X with a common capacity achieving
input distribution and capacities CU < CV . Let T = {0, 1} be uniform and consider a channel
PYn |Xn that uses U if T = 0 and V if T = 1, or more formally:
1 n 1
PYn |Xn (yn |xn ) = UY|X (yn |xn ) + VnY|X (yn |xn ) . (IV.5)
2 2
Show:
(a) Is this channel {PYn |Xn }n≥1 stationary? Memoryless?
(b) Show that the Shannon capacity C of this channel is not greater than CU .
(c) The maximal mutual information rate is
1 CU + CV
C(I) = lim sup I(Xn ; Yn ) =
n→∞ n Xn 2
(d) Conclude that C < C(I) and strong converse does not hold.
IV.19 (Compound DMC [59]) Compound DMC is a family of DMC’s with common input and output
alphabets PYs |X : A → B, s ∈ S . An (n, M, ϵ) code is an encoder-decoder pair whose probability
of error ≤ ϵ over any channel PYs |X in the family (note that the same encoder and the same
decoder are used for each s ∈ S ). Show that capacity is given by
C = sup inf I(X; Ys ) .
PX s

The dispersion of the compound DMC is, however, more delicate [342].
IV.20 Consider the following (memoryless) channel. It has a side switch U that can be in positions
ON and OFF. If U is on then the channel from X to Y is BSCδ and if U is off then Y is Bernoulli
(1/2) regardless of X. The receiving party sees Y but not U. A design constraint is that U should
be in the ON position no more than the fraction s of all channel uses, 0 ≤ s ≤ 1.
(a) One strategy is to put U into ON over the first sn time units and ignore the rest of the (1 − s)n
readings of Y. What is the maximal rate in bits per channel use achievable with this strategy?
(b) Can we increase the communication rate if the encoder is allowed to modulate the U switch
together with the input X (while still satisfying the s-constraint on U)?
(c) Now assume nobody has access to U, which is iid Ber(s) independent of X. Find the
capacity.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-466


i i

466 Exercises for Part IV

IV.21 Alice has n oranges and great many (essentially, infinite) number of empty trays. She wants to
communicate a message to Bob by placing the oranges in (sequentially numbered) trays with
at most one orange per tray. Unfortunately, before Bob gets to see the trays Eve inspects them
and eats each orange independently with probability 0 < δ < 1. In the limit of n → ∞ show
that an arbitrary high rate (in bits per orange) is achievable.
Show that capacity changes to log2 δ1 bits per orange if Eve never eats any oranges but places
an orange into each empty tray with probability δ (iid).
IV.22 (Non-stationary channel [106, Problem 9.12]) A train pulls out of the station at constant velocity.
The received signal energy thus falls off with time as 1/i2 . The total received signal at time i is
 
1
Yi = X i + Zi ,
i
i.i.d. Pn
where Z1 , Z2 , . . . ∼ N(0, σ 2 ). The transmitter cost constraint for block length n is i |x2i | ≤ nP.
Show that the capacity C is equal to zero for this channel.
IV.23 (Capacity-cost function at the boundary.) Recall from Corollary 20.5 that we have shown that
for stationary memoryless channels and P > P0 capacity equals f(P):

C(P) = f(P) , (IV.6)

where

P0 ≜ inf c(x) (IV.7)


x∈A

f( P ) ≜ sup I(X; Y) . (IV.8)


X:E[c(X)]≤P

Show:
(a) If P0 is not admissible, i.e., c(x) > P0 for all x ∈ A, then C(P0 ) is undefined (even M = 1
is not possible)
(b) If there exists a unique x0 such that c(x0 ) = P0 then

C(P0 ) = f(P0 ) = 0 .

(c) If there are more than one x with c(x) = P0 then we still have

C(P0 ) = f(P0 ) .

(d) Give example of a channel with discontinuity of C(P) at P = P0 . (Hint: select a suitable
cost function for the channel Y = (−1)Z · sign(X), where Z is Bernoulli and sign : R →
{−1, 0, 1}.)
IV.24 Consider a stationary memoryless additive non-Gaussian noise channel:

Yi = Xi + Zi , E [ Z i ] = 0, Var[Zi ] = 1
Pn 2
with the standard input constraint i=1 xi ≤ nP.
(a) Prove that capacity C(P) of this channel satisfies (20.6). (Hint: Gaussian saddle point
Theorem 5.11 and the golden formula I(X; Y) ≤ D(PY|X kQY |PX ).)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-467


i i

Exercises for Part IV 467

(b) If D(PZ kN (0, 1)) = ∞ (Z is very non-Gaussian), then it is possible that the capacity is
infinite. Consider Z is ±1 with equal probability. Show that the capacity is infinite by a)
proving the maximal mutual information is infinite; b) giving an explicit scheme to achieve
infinite capacity.
IV.25 (Input-output cost) Let PY|X : X → Y be a DMC and consider a cost function c : X × Y → R
(note that c(x, y) ≤ L < ∞ for some L). Consider a problem of channel coding, where the
error-event is defined as
( n )
X
{error} ≜ {Ŵ 6= W} ∪ c(Xk , Yk ) > nP ,
k=1

where P is a fixed parameter. Define operational capacity C(P) and show it is given by

C(I) (P) = max I(X; Y)


PX :E[c(X,Y)]≤P

for all P > P0 ≜ minx0 E[c(X, Y)|X = x0 ]. Give a counterexample for P = P0 . (Hint: do a
converse directly, and for achievability reduce to an appropriately chosen cost-function c′ (x)).
IV.26 (Gauss-Markov noise) Let {Zj , j = 1, 2, . . .} be a stationary ergodic Gaussian process with
variance 1 such that Zj form an Markov chain Z1 → . . . → Zn → . . . Consider an additive
channel

Yn = X n + Zn
Pn
with power constraint j=1 |xj |2 ≤ nP. Suppose that I(Z1 ; Z2 ) = ϵ  1, then capacity-cost
function
1
C(P) = log(1 + P) + Bϵ + o(ϵ)
2
as ϵ → 0. Compute B and interpret your answer.
How does the frequency spectrum of optimal signal change with increasing ϵ?
IV.27 A semiconductor company offers a random number generator that outputs a block of random n
bits Y1 , . . . , Yn . The company wants to secretly embed a signature in every chip. To that end, it
decides to encode a k-bit signature in n real numbers Xj ∈ [0, 1]. Given an individual signature a
chip is manufactured such that it produces the outputs Yj ∼ Ber(Xj ). In order for the embedding
to be inconspicuous the average bias p should be small:

1X
n
1
Xj − ≤ p.
n 2
j=1

As a function of p how many signature bits per output (k/n) can be reliably embedded in this
fashion? Is there a simple coding scheme achieving this performance?
IV.28 (Capacity of sneezing) A sick student once every minute with probability p (iid) wants to sneeze.
He decides to send k bits to a friend by modulating the sneezes. For that, every time he realizes
he is about to sneeze he chooses to suppress a sneeze or not. A friend listens for n minutes and
then tries to decode k bits.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-468


i i

468 Exercises for Part IV

(a) Find the capacity in bits per minute. (Hint: Think how to define the channel so that channel
input at time t were not dependent on the arrival of the sneeze at time t. To rule out strategies
that depend on arrivals of past sneezes, you may invoke Exercise IV.34.)
(b) Suppose the sender can suppress at most E sneezes and listener can wait indefinitely (n =
∞). Show that the sender can transmit Cpuc E + o(E) bits reliably as E → ∞ and find the
capacity per unit cost Cpuc . Curiously, Cpuc ≥ 1.44 bits/sneeze regardless of p. (Hint: This
is similar to Exercise IV.25.)
(d*) Redo (a) and (b) for the case of a clairvoyant student who knows exactly when sneezes
will happen in the future. (This is a simple example of a so-called Gelfand-Pinsker
problem [183].)
IV.29 A data storage company is considering two options for sending its 100 Tb archive from Boston
to NYC: via (physical) mail or via wireless transmission. Let us analyze these options:
(a) Given the radiated power Pt the received power Pr at distance r for communicating at fre-
 2
c
quency f is given by Pr = G 4πrf Pt , where G is antenna-to-antenna coupling gain and c
– a speed of light7 . Assuming transmitting between Boston and NYC compute the energy
transfer coefficient η (take G = 15 dB and f = 4 GHz).
(b) The receiver’s amplifier adds white Gaussian noise of power spectral density N0 (W/Hz
or J). On the basis of required energy per bit, compute the minimal N0 which still makes
transmission over the radio economically justified assuming optimal channel coding is done
(assume 0.07$ per kWh and $20 per shipment).
(c) Compare this N0 with the thermal noise power N0,thermal = kT, where k – Boltzmann
constant and T – temperature in Kelvins. Conclude that T ≤ 103 K should work.
(d) Codes that achieve Shannon’s minimal Eb /N0 in principle do not put restrictions on the
receiver SNR (signal-to-noise ratio in one channel sample), however synchronization and
other issues constrain this SNR to be ≥ −10 dB. Assuming communication bandwidth
B = 20 Mhz compute the minimal power (in W) required for transmitter radio station.
Pr
(Hint: received SNR = BN 0
, the answer should be a few watts).
(e) How long will it take to send archive at this bandwidth and SNR? (Hint: the answer is
between a few days and a few years).
IV.30 (Optimal ϵ under ARQ) A packet of k bits is to be delivered over an AWGN channel with a
given SNR. To that end, a k-to-n error correcting code is used, whose probability of error is ϵ.
The system employs automatic repeat request (ARQ) to resend the packet whenever an error
occured.8 Suppose that the optimal k-to-n codes achieving

√ 1
k ≈ nC − nVQ−1 (ϵ) + log n
2

7
This formula is known as Friis transmission equation and it simply reflects the fact that the receiving antenna captures a
2
plane wave at the effective area of λ

.
8
Assuming there is a way for receiver to verify whether his decoder produced the correct packet contents or not (e.g. by
finding HTML tags).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-469


i i

Exercises for Part IV 469

are available. The goal is to optimize ϵ to get the highest average throughput: ϵ too small requires
excessive redundancy, ϵ too large leads to lots of retransmissions. Compute the optimal ϵ and
optimal block length n for the following four cases: SNR=0 dB or 20 dB; k = 103 or 104 bits.
(This gives an idea of what ϵ you should aim for in practice.)
IV.31 (Expurgated random coding bound)
(a) For any code C show the following bound on probability of error
1 X −dB (c,c′ )
Pe (C) ≤ 2 ,
M ′ c̸=c
Pn
where recall from (16.3) the Bhattacharya distance dB (xn , x̃n ) = j=1 dB (xj , x̃j ) and
Xp
dB (x, x̃) = − log2 W(y|x)W(y|x̃) .
y∈Y

− ρ1 dB (X,X′ )
(b) Fix PX and let E0,x (ρ, PX ) ≜ −ρ log2 E[2 ⊥ X′ ∼ PX . Show by random
], where X ⊥
coding that there always exists a code C of rate R with

Pe (C) ≤ 2n(E0,x (1,PX )−R) .

(c) We improve the previous bound as follows. We still generate C by random coding. But this
time we expurgate all codewords with f(c, C) > med(f(c, C)), where med(·) denotes the
P ′
median and f(c) = c′ ̸=c 2−dB (c,c ) . Using the bound

med(V) ≤ 2ρ E[V1/ρ ]ρ ∀ρ ≥ 1

show that

med(f(c, C)) ≤ 2n(ρR−E0,x (ρ,PX )) .

(d) Conclude that there must exist a code with rate R − O(1/n) and Pe (C) ≤ 2−nEex (R) , where

Eex (R) ≜ max −ρR + max E0,x (ρ, PX ) .


ρ≥1 PX

IV.32 (Strong converse for BSC) In this exercise we give a combinatorial proof of the strong converse
for the binary symmetric channel. For BSCδ with 0 < δ < 21 ,
(a) Given any (n, M, ϵ)max -code with deterministic encoder f and decoder g, recall that the
decoding regions {Di = g−1 (i)}M i=1 form a partition of the output space. Prove that for
all i ∈ [M],
L  
X n
| Di | ≥
j
j=0

where L is the largest integer such that P [Binomial(n, δ) ≤ L] ≤ 1 − ϵ.


(b) Conclude that

M ≤ 2n(1−h(δ))+o(n) . (IV.9)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-470


i i

470 Exercises for Part IV

(c) Show that (IV.9) holds for average probability of error. (Hint: how to go from maximal to
averange probability of error?)
(d) Conclude that strong converse holds for BSC. (Hint: argue that requiring deterministic
encoder/decoder does not change the asymptotics.)
IV.33 (Strong converse for AWGN) Recall that the AWGN channel is specified by
1 n 2
Yn = X n + Zn , Zn ∼ N (0, In ) , c(xn ) = kx k
n
Prove the strong converse for the AWGN via the following steps:
(a) Let ci = f(i) and Di = g−1 (i), i = 1, . . . , M be the codewords and the decoding regions of
an (n, M, P, ϵ)max code. Let
QYn = N (0, (1 + P)In ) .
Show that there must exist a codeword c and a decoding region D such that
PYn |Xn =c [D] ≥ 1 − ϵ (IV.10)
1
QYn [D] ≤ . (IV.11)
M
(b) Show that then
1
β1−ϵ (PYn |Xn =c , QYn ) ≤ . (IV.12)
M
(c) Show that hypothesis testing problem
PYn |Xn =c vs QYn
is equivalent to
PYn |Xn =Uc vs QYn
where U ∈ Rn×n is an orthogonal matrix. (Hint: use spherical symmetry of white Gaussian
distributions.)
(d) Choose U such that
PYn |Xn =Uc = Pn ,
where Pn is an iid Gaussian distribution of mean that depends on kck2 .
(e) Apply Stein’s lemma (Theorem 14.14) to show that for a certain value of E = E(P) > 0
we have
β1−ϵ (PYn |Xn =c , QYn ) = exp{−nE + o(n)}

(f) Conclude via (IV.12) that


1
log M ≤ nE + o(n) =⇒ Cϵ ≤
log(1 + P) .
2
IV.34 Consider a DMC with two outputs PY,U|X . Suppose that receiver observes only Y, while U is
(causally) fed back to the transmitter. We know that when Y = U the capacity is not increased.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-471


i i

Exercises for Part IV 471

(a) Show that capacity is not increased in general (even when Y 6= U).
(b) Suppose now that there is a cost function c and c(x0 ) = 0. Show that capacity per unit cost
(with U being fed back) is still given by
D(PY|X=x kPY|X=x0 )
CV = max
x̸=x0 c(x)
IV.35 Consider a binary symmetric channel with crossover probability δ ∈ (0, 1):
Y = X + Z mod 2 , Z ∼ Ber(δ) .
Suppose that in addition to Y the receiver also gets to observe noise Z through a binary erasure
channel with erasure probability δe ∈ (0, 1). Compute:
(a) Capacity C of the channel.
(b) Zero-error capacity C0 of the channel.
(c) Zero-error capacity in the presence of feedback Cfb,0 .
(d*) Now consider the setup when in addition to feedback also the variable-length communica-
tion with feedback and termination (VLFT) is allowed. What is the zero-error capacity (in
bits per average number of channel uses) in this case? (In VLFT model, the transmitter can
send a special symbol T that is received without error, but the channel dies after T has been
sent; cf. Section 23.3.2)
IV.36 Consider the polygon channel discussed in Remark 23.2, where the input and output alphabet
are both {1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = (a mod L) + 1. The
confusability graph is a cycle of L vertices. Rigorously prove the following:
(a) For all L, The zero-error capacity with feedback is Cfb,0 = log L2 .
(b) For even L, the zero-error capacity without feedback C0 = log L2 .
(c) Now consider the following channel, where the input and output alphabet are both
{1, . . . , L}, and PY|X (b|a) > 0 if and only if b = a or b = a + 1. In this case the confusability
graph is a path of L vertices. Show that the zero-error capacity is given by
 
L
C0 = log
2
What is Cfb,0 ?
IV.37 (BEC with feedback) Consider the stationary memoryless binary erasure channel with erasure
probability δ and noiseless feedback. Design a fixed-blocklength coding scheme achieving the
capacity, i.e., find a scheme that sends k bits over n channel uses with noiseless feedback, such
that the rate nk approaches the capacity 1 − δ when n → ∞ and the maximal probability of
error vanishes. Show also that for any rate R < (1 − δ) bit the error-exponent matches the
sphere-packing bound.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-472


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-473


i i

Part V

Rate-distortion theory and metric entropy

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-474


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-475


i i

475

In Part II we studied lossless data compression (source coding), where the goal is to compress
a random variable (source) X into a minimal number of bits on average (resp. exactly) so that
X can be reconstructed exactly (resp. with high probability) using these bits. In both cases, the
fundamental limit is given by the entropy of the source X. Clearly, this paradigm is confined to
discrete random variables.
In this part we will tackle the problem of compressing continuous random variables, known as
lossy data compression. Given a random variable X, we need to encode it into a minimal number
of bits, such that the decoded version X̂ is a faithful a reconstruction of X, which is rigorously
understood as distortion metric between X and X̂ being bounded by a prescribed fidelity either on
average or with high probability.
The motivations for study lossy compression are at least two-fold:

1 Many natural signals (e.g. audio, images, or video) are continuously valued. As such, there is
a need to represent these real-valued random variables or processes using finitely many bits,
which can be fed to downstream digital processing; see Figure 23.7 for an illustration.

Domain Range
Continuous Analog
Sampling time Quantization
Signal
Discrete Digital
time

Figure 23.7 Sampling and quantization in engineering practice.

2 There is a lot to be gained in compression if we allow some reconstruction errors. This is espe-
cially important in applications where certain errors (such as high-frequency components in
natural audio and visual signals) are imperceptible to humans. This observation is the basis of
many important compression algorithms and standards that are widely deployed in practice,
including JPEG for images, MPEG for videos, and MP3 for audios.
The operation of mapping (naturally occurring) continuous time/analog signals into
(electronics-friendly) discrete/digital signals is known as quantization, which is an important sub-
ject in signal processing in its own right (cf. the encyclopedic survey [197]). In information theory,
the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959
[380]. To start, we will take a closer look at quantization next in Section 24.1, followed by the
information-theoretic formulation in Section 24.2. A simple (and tight) converse bound is given
in Section 24.3, with the matching achievability bound deferred to the next chapter.
In Chapter 25 we present the hard direction of the rate-distortion theorem: the random coding
construction of a quantizer. This method is extended to the development of a covering lemma and
soft-covering lemma, which lead to sharp result of Cuff showing that the fundamental limit of
channel simulation is given by Wyner’s common information. We also derive (strengthened form
of) Han-Verdú’s results on approximating output distributions in KL.
Chapter 26 evaluates rate-distortion function for Gaussian and Hamming sources. We also dis-
cuss the important foundational implication that optimal (lossy) compressor paired with an optimal

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-476


i i

476

error correcting code together form an optimal end-to-end communication scheme (known as joint
source-channel coding separation principle). This principle explains why “bits” are the natural
currency of the digital age.
Finally, in Chapter 27 we study Kolmogorov’s metric entropy, which is a non-probabilistic
theory of quantization for sets in metric spaces. While traditional rate-distortion tries to compress
samples from a fixed distribution, metric entropy tries to compress any element of the metric
space. What links the two topics is that metric entropy can be viewed as a rate-distortion theory
applied to the “worst-case” distribution on the space (an idea further expanded in Section 27.6). In
addition to connections to the probabilistic theory of quantization in the preceding chapters, this
concept has far-reaching consequences in both probability (e.g. empirical processes, small-ball
probability) and statistical learning (e.g. entropic upper and lower bounds for estimation) that will
be explored further in Part VI. Exercises explore applications to Brownian motion (Exercise V.30),
random matrices (Exercise V.29) and more.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-477


i i

24 Rate-distortion theory

In this chapter we introduce the theory of optimal quantization. In Section 24.1 we examine the
classical theory for quantization for fixed dimension and high rate, discussing various aspects such
as uniform versus non-uniform quantization, fixed versus variable rate, quantization algorithm (of
Lloyd) versus clustering, and the asymptotics of optimal quantization error. In Section 24.2 we turn
to the information-theoretic formulation of quantization, known as the rate-distortion theory, that is
targeted at high dimension and fixed rate and the regime where the number of reconstruction points
grows exponentially with dimension. Section 24.3 introduces the rate-distortion function and the
main converse bound. Finally, in Section 24.4* we discuss how to relate the average distortion
(which we focus) to excess distortion that targets a reconstruction error in high probability as
opposed to in expectation.

24.1 Scalar and vector quantization


Before going to the information-theoretic setting, it is important to set the stage by introducing
some classical pre-Shannon point of view on quantization. In this Section and various subsec-
tions we focus on the setting where the continuous signal lives in a relatively low-dimensional
space (for most of the Section we only discuss scalar signals). We start with the very basic but
overwhelmingly the most often used case of a scalar uniform quantization.

24.1.1 Scalar uniform quantization


The idea of quantizing an inherently continuous-valued signal was most explicitly expounded in
the patenting of Pulse-Coded Modulation (PCM) by A. Reeves; cf. [355] for some interesting
historical notes. His argument was that unlike AM and FM modulation, quantized (digital) sig-
nals could be sent over long routes without the detrimental accumulation of noise. Some initial
theoretical analysis of the PCM was undertaken in 1948 by Oliver, Pierce, and Shannon [318].
For a random variable X ∈ [−A/2, A/2] ⊂ R, the scalar uniform quantizer qU (X) with N
quantization points partitions the interval [−A/2, A/2] uniformly

N equally spaced points

−A A
2 2

477

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-478


i i

478

where the points are in { −2A + kA


N , k = 0, . . . , N − 1}.
What is the quality (or fidelity) of this quantization? Most of the time, mean squared error is
used as the quality criterion:

D(N) = E|X − qU (X)|2

where D denotes the average distortion. Often R = log2 N is used instead of N, so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R  1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.

Nearly flat for


large partition

∆j

Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have

X
N
DU (R) = E|X − qU (X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (24.1)
j=1

X
N
|∆j |2
(high rate approximation) ≈ P[ X ∈ ∆ j ] (24.2)
12
j=1

( NA )2 A2 −2R
= = 2 , (24.3)
12 12
where we used the fact that the variance of Unif(−a, a) is a2 /3.
How much do we gain per bit?
Var(X)
10 log10 SNR = 10 log10
E|X − qU (X)|2
12Var(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R

For example, when X is uniform on [− A2 , A2 ], the constant is 0. Every engineer knows the rule of
thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR. How-
ever, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,
widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should be
taken with a grain of salt.)
The above discussion deals with X with a bounded support. When X is unbounded, it is wise to
allocate the quantization points to those values that are more likely and saturate the large values at

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-479


i i

24.1 Scalar and vector quantization 479

the dynamic range of the quantizer, resulting in two types of contributions to the quantization error,
known as the granular distortion and overload distortion. This leads us to the question: Perhaps
uniform quantization is not optimal?

24.1.2 Scalar Non-uniform Quantization


Since our source has density pX , a good idea might be to assign more quantization points where
pX is larger, and less where pX is smaller, as the following picture illustrates:

Often the way such quantizers are implemented is to take a monotone transformation of the source
f(X), perform uniform quantization, then take the inverse function:
f
X U
q qU (24.4)
X̂ qU ( U)
f−1

i.e., q(X) = f−1 (qU (f(X))). The function f is usually called the compander (compressor+expander).
One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this compander
architecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF
(!). We defer this discussion till Section 24.1.4.
In terms of practical considerations, for example, the human ear can detect sounds with volume
as small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieving this is
possible because the human ear inherently uses logarithmic comp anding function. Furthermore,
many natural signals (such as differences of consecutive samples in speech or music (but not
samples themselves!)) have an approximately Laplace distribution. Due to these two factors, a
very popular and sensible choice for f is the μ-companding function

f (X) = sign(X) ln(1+µ|X|)


ln(1+µ)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-480


i i

480

which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range of
human hearing, and less quantization bits outside this region. This results in the so-called μ-law
which is used in the digital telecommunication systems in the US, while in Europe a slightly
different compander called the A-law is used.

24.1.3 Optimal quantizers


Now we look for the optimal scalar quantizer given R bits for reconstruction. Formally, this is

Dscalar (R) = min E|X − q(X)|2 (24.5)


q:|Im q|≤2R

Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,
given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers are
piecewise constant:

q(x) = cj 1 {Tj ≤ x ≤ Tj+1 }

for some cj ∈ [Tj , Tj+1 ].


Example 24.1 As a simple example, consider the one-bit quantization
q of X ∼ N (0, σ 2 ).q
Then
optimal quantization points are c1 = E[X|X ≥ 0] = E[|X|] = 2
π σ, c2 = E[X|X ≤ 0] = − 2
π σ,

with quantization error equal to Var(|X|) = 1 − π2 σ 2 .
With ideas like this, in 1957 S. Lloyd developed an algorithm (called Lloyd’s algorithm or
Lloyd’s Method I) for iteratively finding optimal quantization regions and points.1 Suitable for
both the scalar and vector cases, this method proceeds as follows: Initialized with some choice of
N = 2k quantization points, the algorithm iterates between the following two steps:

1 Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.
2 Update the quantization points by the centroids E[X|X ∈ D] of each Voronoi region D.

b b
b b

b b

b b
b b

Steps of Lloyd’s algorithm

1
This work at Bell Labs remained unpublished until 1982 [284].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-481


i i

24.1 Scalar and vector quantization 481

Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results on
Lloyd’s algorithm.

Theorem 24.1 (Lloyd)

1 Lloyd’s algorithm always converges to a Centroidal Voronoi Tessellation.


2 The optimal quantization strategy is always a CVT.
3 CVT’s need not be unique, and the algorithm may converge to non-global optima.

Remark 24.1 The third point tells us that Lloyd’s algorithm is not always guaranteed to give
the optimal quantization strategy.2 One sufficient condition for uniqueness of a CVT is the log-
concavity of the density of X [171], e.g., Gaussians. On the other hand, even for the Gaussian PX
and N > 3, the optimal quantization points are not known in closed form. So it may seem to be
very hard to have any meaningful theory of optimal quantizers. However, as we shall see next,
when N becomes very large, locations of optimal quantization points can be characterized. In this
section we will do so in the case of fixed dimension, while for the rest of this Part we will consider
the regime of taking N to grow exponentially with dimension.
Remark 24.2 (k-means) A popular clustering method called k-means is the following: Given
n data points x1 , . . . , xn ∈ Rd , the goal is to find k centers μ1 , . . . , μk ∈ Rd to minimize the objective
function
X
n
min kxi − μj k2 .
j∈[k]
i=1

This is equivalent to solving the optimal vector quantization problem analogous to (24.5):

min EkX − q(X)k2


q:|Im(q)|≤k
Pn
where X is distributed according to the empirical distribution over the dataset, namely, 1n i=1 δxi .
Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is a commonly
used heuristic.

24.1.4 Fine quantization


Following Panter-Dite [324], we now study the asymptotics of small quantization error. For this,
introduce a probability density function λ(x), which represents the density of quantization points

2
As a simple example one may consider PX = 13 ϕ(x − 1) + 31 f(x) + 13 f(x + 1) where f(·) is a very narrow pdf, symmetric
around 0. Here the CVT with centers ± 23 is not optimal among binary quantizers (just compare to any quantizer that
quantizes two adjacent spikes to same value).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-482


i i

482

in a given interval and allows us to approximate summations by integrals.3 Then the number of
Rb
quantization points in any interval [a, b] is ≈ N a λ(x)dx. For any point x, denote the size of the
quantization interval that contains x by ∆(x). Then
Z x+∆(x)
1
N λ(t)dt ≈ Nλ(x)∆(x) ≈ 1 =⇒ ∆(x) ≈ .
x Nλ(x)
With this approximation, the quality of reconstruction is

X
N
E|X − q(X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ]
j=1

X
N Z
|∆j |2 ∆ 2 ( x)
≈ P[ X ∈ ∆ j ] ≈ p ( x) dx
12 12
j=1
Z
1
= p(x)λ−2 (x)dx ,
12N2
To find the optimal density λ that gives the best reconstruction (minimum MSE) when X has den-
R R R R R
sity p, we use Hölder’s inequality: p1/3 ≤ ( pλ−2 )1/3 ( λ)2/3 . Therefore pλ−2 ≥ ( p1/3 )3 ,
1/ 3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) .
p1/3 dx
Therefore when N = 2R ,4
Z 3
1 −2R
Dscalar (R) ≈ 2 p1/3 (x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,

• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2

1 −2R 2
Dscalar (R) ≤ 2 A = DU (R)
12
where the RHS is the uniform quantization error given in (24.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution, uniform
quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives

2 −2R π 3
Dscalar (R) ≈ σ 2 (24.6)
2

Remark 24.3 In fact, in scalar case the optimal non-uniform quantizer can be realized using
the compander architecture (24.4) that we discussed in Section 24.1.2: As an exercise, use Taylor

3
This argument is easy to make rigorous. We only need to define reconstruction points cj as the solution of
∫ cj j
−∞ λ(x) dx = N (quantile).
4
In fact when R → ∞, “≈” can be replaced by “= 1 + o(1)” as shown by Zador [467, 468].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-483


i i

24.2 Information-theoretic formulation 483

expansion to analyze the quantization


R
error of (24.4) when N → ∞. The optimal compander
t
p1/3 (t)dt
f : R → [0, 1] turns out to be f(x) = R−∞

p1/3 (t)dt [44, 395].
−∞

24.1.5 Fine quantization and variable rate


So far we have been focusing on quantization with restriction on the cardinality of the image
of q(·). If one, however, intends to further compress the values q(X) losslessly, a more natural
constraint is to bound H(q(X)).
Koshelev [253] discovered in 1963 that in the high-rate regime uniform quantization is asymp-
totically optimal under the entropy constraint. Indeed, if q∆ is a uniform quantizer with cell size
∆, then under appropriate assumptions we have (recall (2.21))

H(q∆ (X)) = h(X) − log ∆ + o(1) , (24.7)


R
where h(X) = − pX (x) log pX (x) dx is the differential entropy of X. So a uniform quantizer with
H(q(X)) = R achieves

∆2 22h(X)
D= ≈ 2−2R .
12 12
On the other hand, any quantizer with unnormalized point density function Λ(x) (i.e. smooth
R cj
function such that −∞ Λ(x)dx = j) can be shown to achieve (assuming Λ → ∞ pointwise)
Z
1 1
D≈ pX (x) 2 dx
12 Λ ( x)
Z
Λ(x)
H(q(X)) ≈ pX (x) log dx.
p X ( x)
Now, from Jensen’s inequality we have
Z Z
1 1 1 22h(X)
pX (x) 2 dx ≥ exp{−2 pX (x) log Λ(x) dx} ≈ 2−2H(q(X)) ,
12 Λ ( x) 12 12
concluding that uniform quantizer is asymptotically optimal.
Furthermore, it turns out that for any source, even the optimal vector quantizers (to be con-
2h(X)
sidered next) can not achieve distortion better that 2−2R 22πe . That is, the maximal improvement
they can gain for any i.i.d. source is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar
uniform quantizers followed by lossless compression is an overwhelmingly popular solution in
practice.

24.2 Information-theoretic formulation


Before describing the mathematical formulation of optimal quantization, let us begin with two
concrete examples.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-484


i i

484

Hamming Game. Given 100 unbiased bits, we are asked to inspect them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the original
100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% bit error rate (BER). However, as we will show in this chapter
(Theorem 26.1), the optimal strategy amazingly achieves a BER of 11%. How is this possible?
After all we are guessing independent bits and the loss function (BER) treats all bits equally.

Gaussian example. Given (X1 , . . . , Xn ) drawn independently from N (0, σ 2 ), we are given a
budget of one bit per symbol to compress, so that the decoded version (X̂1 , . . . , X̂n ) has a small
Pn
mean-squared error 1n i=1 E[(Xi − X̂i )2 ].
To this end, a simple strategy is to quantize each coordinate into 1 bit. As worked out in Exam-
ple 24.1, the optimal one-bit quantization error is (1 − π2 )σ 2 ≈ 0.36σ 2 . In comparison, we will
2
show later (Theorem 26.2) that there is a scheme that achieves an MSE of σ4 per coordinate
for large n; furthermore, this is optimal. More generally, given R bits per symbol, by doing opti-
mal vector quantization in high dimensions (namely, compressing (X1 , . . . , Xn ) jointly to nR bits),
rate-distortion theory will tell us that when n is large, we can achieve the per-coordinate MSE:

Dvec (R) = σ 2 2−2R

which, compared to (24.6), gains 4.35 dB (or 0.72 bit/sample).


The conclusions from both the Bernoulli and the Gaussian examples are rather surprising: Even
when X1 , . . . , Xn are iid, there is something to be gained by quantizing these coordinates jointly.
Some intuitive explanations for this high-dimensional phenomenon as as follows:

1 Applying scalar quantization componentwise results in quantization region that are hypercubes,
which may not suboptimal for covering in high dimensions.
2 Concentration of measures effectively removes many atypical source realizations. For example,
when quantizing a single Gaussian X, we need to cover large portion of R in order to deal with
those significant deviations of X from 0. However, when we are quantizing many (X1 , . . . , Xn )
together, the law of large numbers makes sure that many Xj ’s cannot conspire together and all
produce large values. Indeed, (X1 , . . . , Xn ) concentrates near a sphere. As such, we may exclude
large portions of the space Rn from consideration.

Mathematical formulation A lossy compressor is an encoder/decoder pair (f, g) that induced


the following Markov chain
f g
X −→ W −→ X̂

where X ∈ X is refereed to as the source, W = f(X) is the compressed discrete data, and X̂ = g(W)
is the reconstruction which takes values in some alphabet X̂ that needs not be the same as X .
A distortion metric (or loss function) is a measurable function d : X × X̂ → R ∪ {+∞}. There
are various formulations of the lossy compression problem:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-485


i i

24.3 Converse bounds 485

1 Fixed length (fixed rate), average distortion: W ∈ [M], minimize E[d(X, X̂)].
2 Fixed length, excess distortion: W ∈ [M], minimize P[d(X, X̂) > D].
3 Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize the average length
E[l(W)] or entropy H(W).

In this book we focus on lossy compression with fixed length and are chiefly concerned with
average distortion (with the exception of joint source-channel coding in Section 26.3 where excess
distortion will be needed). The difference between average and excess distortion is analogous
to average and high-probability risk bound in statistics and machine learning. It turns out that
under mild assumptions these two formulations lead to the same asymptotic fundamental limit
(cf. Remark 25.2). However, the speed of convergence to that limit is very different: the excess
distortion version converges as O( √1n ) has a rich dispersion theory [255], which we do not discuss.
The convergence under excess distortion is much faster as O( logn n ); see Exercise V.3.
As usual, of particular interest is when the source takes the form of a random vector Sn =
(S1 , . . . , Sn ) ∈ S n and the reconstruction is Ŝn = (S1 , . . . , Sn ) ∈ Ŝ n . We will be focusing on the
so called separable distortion metric defined for n-letter vectors by averaging the single-letter
distortions:

1X
n
d(sn , ŝn ) ≜ d(si , ŝi ). (24.8)
n
i=1

Definition 24.2 An (n, M, D)-code consists of an encoder f : An → [M] and a decoder g :


[M] → Ân such that the average distortion satisfies E[d(Sn , g(f(Sn )))] ≤ D. The nonasymptotic
and asymptotic fundamental limits are defined as follows:

M∗ (n, D) = min{M : ∃(n, M, D)-code} (24.9)


1
R(D) = lim sup log M∗ (n, D). (24.10)
n→∞ n

Note that, for stationary memoryless (iid) source, the large-blocklength limit in (24.10) in fact
exists and coincides with the infimum over all blocklengths. This is a consequence of the average
distortion criterion and the separability of the distortion metric – see Exercise V.2.

24.3 Converse bounds


Now that we have the definitions, we give a (surprisingly simple) general converse.

Theorem 24.3 (General Converse) Suppose X → W → X̂, where W ∈ [M] and


E[d(X, X̂)] ≤ D. Then

log M ≥ ϕX (D) ≜ inf I(X; Y).


PY|X :E[d(X,Y)]≤D

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-486


i i

486

Proof.

log M ≥ H(W) ≥ I(X; W) ≥ I(X; X̂) ≥ ϕX (D)

where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).

Theorem 24.4 (Properties of ϕX )

(a) ϕX is convex, non-increasing.


(b) ϕX continuous on (D0 , ∞), where D0 = inf{D : ϕX (D) < ∞}.
(c) Suppose X = X̂ and the distortion metric satisfies d(x, x) = D0 for all x and d(x, y) > D0 for
all x 6= y. Then ϕX (D0 ) = I(X; X).
(d) If d is a proper metric and X is a complete metric space, we have D0 = 0 and ϕX (D0 +) =
ϕX (D0 ) = I(X; X).
(e) Let

Dmax = inf Ed(X, x̂).


x̂∈X̂

Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.

Remark 24.4 (The role of D0 and Dmax ) By definition, Dmax is the distortion attainable
without any information. Indeed, if Dmax = Ed(X, x̂) for some fixed x̂, then this x̂ is the “default”
reconstruction of X, i.e., the best estimate when we have no information about X. Therefore D ≥
Dmax can be achieved for free. This is the reason for the notation Dmax despite that it is defined as
an infimum. On the other hand, D0 should be understood as the minimum distortion one can hope
to attain. Indeed, suppose that X̂ = X and d is a metric on X . In this case, we have D0 = 0, since
we can choose Y to be a finitely-valued approximation of X.
As an example, consider the Gaussian source with MSE distortion, namely, X ∼ N (0, σ 2 ) and
2
d(x, x̂) = (x −x̂)2 . We will show later that ϕX (D) = 12 log+ σD . In this case D0 = 0 and the infimum
defining it is not attained; Dmax = σ 2 and if D ≥ σ 2 , we can simply output 0 as the reconstruction
which requires zero bits.

Proof.

(a) Convexity follows from the convexity of PY|X 7→ I(PX , PY|X ) (Theorem 5.3).
(b) Continuity in the interior of the domain follows from convexity, since D0 =
infPX̂|X E[d(X, X̂)] = inf{D : ϕS (D) < ∞}.
(c) The only way to satisfy the constraint is to take X = Y.
(d) Clearly, D0 = d(x, x) = 0. We also clearly have ϕX (D0 ) ≥ ϕX (D0 +). Consider a sequence
of Yn such that E[d(X, Yn )] ≤ 2−n and I(X; Yn ) → ϕX (D0 +). By Borel-Cantelli we have with
probability 1 d(X, Yn ) → 0 and hence (X, Yn ) → (X, X). Then from lower-semicontinuity of
mutual information (4.28) we get I(X; X) ≤ lim I(X; Yn ) = ϕX (D0 +).
(e) For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-487


i i

24.3 Converse bounds 487

In channel coding, the main result relates the Shannon capacity, an operational quantity, to the
information capacity. Here we introduce the information rate-distortion function in an analogous
way, which by itself is not an operational quantity.

Definition 24.5 The information rate-distortion function for a source {Si } is


1
R(I) (D) = lim sup ϕSn (D), where ϕSn (D) = inf I(Sn ; Ŝn ).
n→∞ n PŜn |Sn :E[d(Sn ,Ŝn )]≤D

The reason for defining R(I) (D) is because from Theorem 24.3 we immediately get:

Corollary 24.6 ∀D, R(D) ≥ R(I) (D).

Naturally, the information rate-distortion function inherits the properties of ϕ from Theo-
rem 24.4:

Theorem 24.7 (Properties of R(I) )

(a) R(I) (D) is convex, non-increasing.


(b) R(I) (D) is continuous on (D0 , ∞), where D0 ≜ inf{D : R(I) (D) < ∞}.
(c) Assume the same assumption on the distortion function as in Theorem 24.4(c). For stationary
ergodic {Sn }, R(I) (D) = H (entropy rate) or +∞ if Sk is not discrete.
(d) R(I) (D) = 0 for all D > Dmax , where

Dmax ≜ lim sup inf Ed(Xn , xˆn ) .


n→∞ xˆn ∈X̂

If D0 < Dmax , then R(I) (Dmax ) = 0 too.

Proof. All properties follow directly from corresponding properties in Theorem 24.4 applied to
ϕSn .

Next we show that R(I) (D) can be easily calculated for stationary memoryless (iid) source
without going through the multi-letter optimization problem. This parallels Corollary 20.5 for
channel capacity (with separable cost function).

i.i.d.
Theorem 24.8 (Single-letterization) For stationary memoryless source Si ∼ PS and
separable distortion d in the sense of (24.8), we have for every n,

ϕSn (D) = nϕS (D).

Thus

R(I) (D) = ϕS (D) = inf I(S; Ŝ).


PŜ|S :E[d(S,Ŝ)]≤D

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-488


i i

488

Proof. By definition we have that ϕSn (D) ≤ nϕS (D) by choosing a product channel: PŜn |Sn = P⊗ n
Ŝ|S
.
Thus R(I) (D) ≤ ϕS (D).
For the converse, for any PŜn |Sn satisfying the constraint E[d(Sn , Ŝn )] ≤ D, we have
X
n
I(Sn ; Ŝn ) ≥ I(Sj , Ŝj ) (Sn independent)
j=1
X
n
≥ ϕS (E[d(Sj , Ŝj )])
j=1
 
1X
n
≥ nϕ S  E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1

≥ nϕ S ( D) (ϕS non-increasing)
In the first step we used the crucial super-additivity property of mutual information (6.2).
For generalization to a memoryless but non-stationary sources see Exercise V.10.

24.4* Converting excess distortion to average


Finally, we discuss how to build a compressor for average distortion if we have one for excess
distortion, the former of which is our focus.

Theorem 24.9 (Excess-to-Average) Suppose that there exists (f, g) such that W = f(X) ∈
[M] and P[d(X, g(W)) > D] ≤ ϵ. Assume for some p ≥ 1 and x̂0 ∈ X̂ that (E[d(X, x̂0 )p ])1/p =
Dp < ∞. Then there exists (f′ , g′ ) such that W′ = f′ (X) ∈ [M + 1] and
E[d(X, g(W′ ))] ≤ D(1 − ϵ) + Dp ϵ1−1/p . (24.11)

Remark 24.5 This result is only useful for p > 1, since for p = 1 the right-hand side of (24.11)
does not converge to D as ϵ → 0. However, a different method (as we will see in the proof of
Theorem 25.1) implies that under just Dmax = D1 < ∞ the analog of the second term in (24.11)
is vanishing as ϵ → 0, albeit at an unspecified rate.
Proof. We transform the first code into the second by adding one codeword:
(
′ f ( x) d(x, g(f(x))) ≤ D
f ( x) =
M + 1 otherwise
(
g( j) j ≤ M
g′ ( j) =
x̂0 j=M+1
Then by Hölder’s inequality,
E[d(X, g′ (W′ )) ≤ E[d(X, g(W))|Ŵ 6= M + 1](1 − ϵ) + E[d(X, x̂0 )1{Ŵ = M + 1}]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-489


i i

24.4* Converting excess distortion to average 489

≤ D(1 − ϵ) + Dp ϵ1−1/p

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-490


i i

25 Rate distortion: achievability bounds

In this chapter, we prove an achievability bound and (together with the converse from the previous
chapter) establish the identity R(D) = infŜ:E[d(S,Ŝ)]≤D I(S; Ŝ) for stationary memoryless sources.
The key idea is again random coding, which is a probabilistic construction of quantizers by gener-
ating the reconstruction points independently from a carefully chosen distribution. Before proving
this result rigorously, we first convey the main intuition in the case of Bernoulli sources by making
connections to large deviations theory (Chapter 15) and explaining how the constrained minimiza-
tion of mutual information is related to optimization of the random coding ensemble. Later in
Sections 25.2*–25.4*, we extend this random coding construction to establish covering lemma
and soft-covering lemma, which are at the heart of the problem of channel simulation.
We start by recalling the key concepts introduced in the last chapter:
1
R(D) = lim sup log M∗ (n, D), (rate-distortion function)
n→∞ n
1
R(I) (D) = lim sup ϕSn (D), (information rate-distortion function)
n→∞ n

where
ϕ S ( D) ≜ inf I(S; Ŝ) (25.1)
PŜ|S :E[d(S,Ŝ)]≤D

ϕSn (D) = inf I(Sn ; Ŝn ) (25.2)


PŜn |Sn :E[d(Sn ,Ŝn )]≤D
Pn
and d(Sn , Ŝn ) = 1n i=1 d(Si , Ŝi ) takes a separable form.
We have shown the following general converse in Theorem 24.3: For any compression scheme:
[M] 3 W → X → X̂ such that E[d(X, X̂)] ≤ D, we must have log M ≥ ϕX (D), which implies in the
special case of X = Sn , log M∗ (n, D) ≥ ϕSn (D) and hence, in the large-n limit, R(D) ≥ R(I) (D).
i.i.d.
For a stationary memoryless source Si ∼ PS , Theorem 24.8 shows that ϕSn single-letterizes as
ϕSn (D) = nϕS (D). As a result, we obtain the converse
R(D) ≥ R(I) (D) = ϕS (D).
As we said, the goal of this Chapter is to show R(D) = R(I) (D).

25.1 Shannon’s rate-distortion theorem


The following result is (essentially) proved by Shannon in his 1959 paper [380].

490

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-491


i i

25.1 Shannon’s rate-distortion theorem 491

Theorem 25.1 Consider a stationary memoryless source Sn i.i.d.


∼ PS . Suppose that the distortion
metric d and the target distortion D satisfy:

1 d(sn , ŝn ) is non-negative and separable.


2 D > D0 , where D0 = inf{D : ϕS (D) < ∞}.
3 Dmax is finite, i.e.

Dmax ≜ inf E[d(S, ŝ)] < ∞. (25.3)


Then

R(D) = R(I) (D) = ϕS (D) = inf I(S; Ŝ). (25.4)


PŜ|S :E[d(S,Ŝ)]≤D

Remark 25.1

• Note that Dmax < ∞ does not require that d(·, ·) only takes values in R. That is, Theorem 25.1
permits d(s, ŝ) = ∞.
• When Dmax = ∞, typically we have R(D) = ∞ for all D. Indeed, suppose that d(·, ·) is a metric
(i.e. real-valued and satisfies triangle inequality). Then, for any x0 ∈ An we have

d(X, X̂) ≥ d(X, x0 ) − d(x0 , X̂) .

Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore

E[d(X, X̂)] ≥ E[d(X, x0 )] − max d(x0 , cj ) = ∞ .


j

So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
the absolute impossibility of compressing such sources; it is just not possible with fixed-length
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second moment. But it is easy to see that1 the information rate-distortion function
R(I) (D) < ∞ for any D ∈ (0, ∞). In fact, in this case R(I) (D) is a hyperbola-like curve that
never touches either axis. Using variable-length codes, Sn can be compressed non-trivially into
W with bounded entropy (but unbounded cardinality) H(W). An open question: Is H(W) =
nR(I) (D) + o(n) attainable?
• We restricted theorem to D > D0 because it is possible that R(D0 ) 6= R(I) (D0 ). For exam-
ple, consider an iid non-uniform source {Sj } with A = Â being a finite metric space with
metric d(·, ·). Then D0 = 0 and from Exercise V.5 we have R(D0 +) < R(D0 ). At the same
time, from Theorem 24.4(d) we know that R(I) is continuous at D0 : R(I) (D0 +) = ϕS (D0 +) =
ϕS (D0 ) = R(I) (D0 ).

1
Indeed, if we take W to be a quantized version of S with small quantization error D and notice that differential entropy of
the Cauchy S is finite, we get from (24.7) that R(I) (D) ≤ H(W) < ∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-492


i i

492

• Techniques for proving (25.4) for memoryless sources can be extended to stationary ergodic
sources by making changes to the proof similar to those we have discussed in lossless
compression (Chapter 12).

Before giving a formal proof, we give a heuristic derivation emphasizing the connection to large
deviations estimates from Chapter 15.

25.1.1 Intuition
Let us throw M random points C = {c1 , . . . , cM } into the space Ân by generating them indepen-
dently according to a product distribution QnŜ , where QŜ is some distribution on  to be optimized.
Consider the following simple coding strategy:

Encoder : f(sn ) = argmin d(sn , cj ) (25.5)


j∈[M]

Decoder : g(j) = cj (25.6)

The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword is close to the source realization is (exponentially) small, say,
ϵ. However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codewords exist is approximately (1 −ϵ)M ,
which can be made close to zero provided M  1ϵ .
i.i.d.
To explain this intuition further, consider a discrete memoryless source Sn ∼ PS and let us eval-
uate the excess distortion of this random code: P[d(Sn , f(Sn )) > D], where the probability is over
all random codewords c1 , . . . , cM and the source Sn . Define

Pfailure ≜ P[∀c ∈ C, d(Sn , c) > D] = ESn [P[d(Sn , c1 ) > D|Sn ]M ],

where the last equality follows from the assumption that c1 , . . . , cM are iid and independent of Sn .
i.i.d.
To simplify notation, let Ŝn ∼ QnŜ independently of Sn , so that PSn ,Ŝn = PnS QnŜ . Then

P[d(Sn , c1 ) > D|Sn ] = P[d(Sn , Ŝn ) > D|Sn ]. (25.7)

To evaluate the failure probability, let us consider the special case of PS = Ber( 12 ) and also
choose QŜ = Ber( 12 ) to generate the random codewords, aiming to achieve a normalized Hamming
P P
distortion at most D < 12 . Since nd(Sn , Ŝn ) = i:si =1 (1 − Ŝi ) + i:si =0 Ŝi ∼ Bin(n, 21 ) for any sn ,
the conditional probability (25.7) does not depend on Sn and is given by
   
1
P[d(S , Ŝ ) > D|S ] = P Bin n,
n n n
≥ nD ≈ 1 − 2−n(1−h(D))+o(n) , (25.8)
2

where in the last step we applied large-deviations estimates from Theorem 15.9 and Example 15.1.
(Note that here we actually need lower estimates on these exponentially small probabilities.) Thus,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-493


i i

25.1 Shannon’s rate-distortion theorem 493

Pfailure = (1 − 2−n(1−h(D))+o(n) )M , which vanishes if M = 2n(1−h(D)+δ) for any δ > 0.2 As we will
compute in Theorem 26.1, the rate-distortion function for PS = Ber( 12 ) is precisely ϕS (D) =
1 − h(D), so we have a rigorous proof of the optimal achievability in this special case.
For general distribution PS (or even for PS = Ber(p) for which it is suboptimal to choose
QŜ as Ber( 12 )), the situation is more complicated as the conditional probability (25.7) depends
on the source realization Sn through its empirical distribution (type). Let Tn be the set of typical
realizations whose empirical distribution is close to PS . We have

Pfailure ≈P[d(Sn , Ŝn ) > D|Sn ∈ Tn ]M


=(1 − P[d(Sn , Ŝn ) ≤ D|Sn ∈ Tn ])M (25.9)
| {z }
≈ 0, since S ⊥
⊥ Ŝ
n n

−nE(QŜ ) M
≈(1 − 2 ) ,

where it can be shown (using large deviations analysis similar to information projection in
Chapter 15) that

E(QŜ ) = min D(PŜ|S kQŜ |PS ) (25.10)


PŜ|S :E[d(S,Ŝ)]≤D

Thus we conclude that for any choice of QŜ (from which the random codewords were drawn) and
any δ > 0, the above code with M = 2n(E(QŜ )+δ) achieves vanishing excess distortion

Pfailure = P[∀c ∈ C, d(Sn , c) > D] → 0 as n → ∞.

Finally, we optimize QŜ to get the smallest possible M:

min E(QŜ ) = min min D(PŜ|S kQŜ |PS )


QŜ QŜ P :E[d(S,Ŝ)]≤D
Ŝ|S

= min min D(PŜ|S kQŜ |PS )


PŜ|S :E[d(S,Ŝ)]≤D QŜ

= min I(S; Ŝ)


PŜ|S :E[d(S,Ŝ)]≤D

= ϕ S ( D)

where the third equality follows from the variational representation of mutual information (Corol-
lary 4.2). This heuristic derivation explains how the constrained mutual information minimization
arises. Below we make it rigorous using a different approach, again via random coding.

25.1.2 Proof of Theorem 25.1


Theorem 25.2 (Random coding bound of average distortion) Fix PX and suppose
d(x, x̂) ≥ 0 for all x, x̂. For any PY|X and any y0 ∈ Â, there exists a code X → W → X̂ with

2
In fact, this argument shows that M = 2n(1−h(D))+o(n) codewords suffice to cover the entire Hamming space within
distance Dn. See (27.9) and Exercise V.26.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-494


i i

494

W ∈ [M + 1], such d(X, X̂) ≤ d(X, y0 ) almost surely and for any γ > 0,

E[d(X, X̂)] ≤ E[d(X, Y)] + E[d(X, y0 )]e−M/γ + E[d(X, y0 )1 {i(X; Y) > log γ}].

Here the first and the third expectations are over (X, Y) ∼ PX,Y = PX PY|X and the information
density i(·; ·) is defined with respect to this joint distribution (cf. Definition 18.1).

Some remarks are in order:

• Theorem 25.2 says that from an arbitrary PY|X such that E[d(X, Y)] ≤ D, we can extract a good
code with average distortion D plus some extra terms which will vanish in the asymptotic regime
for memoryless sources.
• The proof uses the random coding argument with codewords drawn independently from PY , the
marginal distribution induced by the source distribution PX and the auxiliary channel PY|X . As
such, PY|X plays no role in the code construction and is used only in analysis (by defining a
coupling between PX and PY ).
• The role of the deterministic y0 is a “fail-safe” codeword (think of y0 as the default reconstruc-
tion with Dmax = E[d(X, y0 )]). We add y0 to the random codebook for “damage control”, to
hedge against the (highly unlikely) event that we end up with a terrible codebook.

Proof. Similar to the intuitive argument sketched in Section 25.1.1, we apply random coding and
generate the codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X

and add the “fail-safe” codeword cM+1 = y0 . We adopt the same encoder-decoder pair (25.5) –
(25.6) and let X̂ = g(f(X)). Then by definition,

d(X, X̂) = min d(X, cj ) ≤ d(X, y0 ).


j∈[M+1]

To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Chapter 18):

PX,Y,Y = PX,Y PY

where PY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
Ra
E[U] = 0 P[U ≥ u]du. Then the average distortion is

Ed(X, X̂) = E min d(X, cj )


j∈[M+1]
h i
= EX E min d(X, cj ) X
j∈[M+1]
Z d(X,y0 ) h i
= EX P min d(X, cj ) > u X du
0 j∈[M+1]
Z d(X,y0 ) h i
≤ EX P min d(X, cj ) > u X du
0 j∈[M]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-495


i i

25.1 Shannon’s rate-distortion theorem 495

Z d(X,y0 )
= EX P[d(X, Y) > u|X]M du
0
Z d(X,y0 )
= EX (1 − P[d(X, Y) ≤ u|X])M du
0
Z d(X,y0 )
≤ EX (1 − P[d(X, Y) ≤ u, i(X, Y) > −∞|X])M du. (25.11)
0 | {z }
≜δ(X,u)

Next we upper bound (1 − δ(X, u))M as follows:


(1 − δ(X, u))M ≤ e−M/γ + |1 − γδ(X, u)|
+
(25.12)
−M/γ +
=e + |1 − γ E[exp{−i(X; Y)}1 {d(X, Y) ≤ u}|X]| (25.13)
−M/γ
≤e + P[i(X; Y) > log γ|X] + P[d(X, Y) > u|X] (25.14)
where

• (25.12) uses the following trick in dealing with (1 − δ)M for δ  1 and M  1. First, recall the
standard rule of thumb:
(
0, δ M  1
(1 − δ) ≈
M
1, δ M  1
In order to obtain firm bounds of a similar flavor, we apply, for any γ > 0,
(1 − δ)M ≤ e−δM ≤ e−M/γ + (1 − γδ)+ .

• (25.13) is simply a change of measure argument of Proposition 18.3. Namely we apply (18.4)
with f(x, y) = 1 {d(x, y) ≤ u}.
• For (25.14) consider the chain:

1 − γ E[exp{−i(X; Y)}1 {d(X, Y) ≤ u}|X] ≤ 1 − γ E[exp{−i(X; Y)}1 {d(X, Y) ≤ u, i(X; Y) ≤ log γ}|X]
≤ 1 − E[1 {d(X, Y) ≤ u, i(X; Y) ≤ log γ}|X]
= P[d(X, Y) > u or i(X; Y) > log γ|X]
≤ P[d(X, Y) > u|X] + P[i(X; Y) > log γ|X]

Plugging (25.14) into (25.11), we have


"Z #
d(X,y0 )
−M/γ
E[d(X, X̂)] ≤ EX (e + P[i(X; Y) > log γ|X] + P[d(X, Y) > u|X])du
0
Z ∞
−M/γ
≤ E[d(X, y0 )]e + E[d(X, y0 )P[i(X; Y) > log γ|X]] + EX P[d(X, Y) > u|X])du
0
= E[d(X, y0 )]e−M/γ + E[d(X, y0 )1 {i(X; Y) > log γ}] + E[d(X, Y)].

As a side product, we have the following achievability result for excess distortion.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-496


i i

496

Theorem 25.3 (Random coding bound of excess distortion) For any PY|X , there
exists a code X → W → X̂ with W ∈ [M], such that for any γ > 0,
P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y) > D} ∪ {i(X; Y) > log γ}]

Proof. Proceed exactly as in the proof of Theorem 25.2 (without using the extra codeword y0 ),
replace (25.11) by P[d(X, X̂) > D] = P[∀j ∈ [M], d(X, cj ) > D] = EX [(1 − P[d(X, Y) ≤ D|X])M ],
and continue similarly.
Finally, we give a rigorous proof of Theorem 25.1 by applying Theorem 25.2 to the iid source
i.i.d.
X = Sn ∼ PS and n → ∞:
Proof of Theorem 25.1. Our goal is the achievability: R(D) ≤ R(I) (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] is achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed symbol so that the expectation is finite. The
default reconstruction for Sn is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(Sn , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ ; such PŜ|S since D > D0 by
assumption. Apply Theorem 25.2 to (X, Y) = (Sn , Ŝn ) with
PX = PSn
PY|X = PŜn |Sn = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
1X
n
d( X , Y ) = d(Sj , Ŝj )
n
j=1

y0 = ŝn0
we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1] → Ân , such that
n o
E[d(Sn , g(f(Sn )))] ≤ E[d(Sn , Ŝn )] + E[d(Sn , ŝn0 )]e−M/γ + E[d(Sn , ŝn0 )1 i(Sn ; Ŝn ) > log γ ]
≤ D − δ + Dmax e− exp(nδ) + E[d(Sn , ŝn0 )1En ], (25.15)
| {z } | {z }
→0 →0 (later)

where
 
1 X
n 
WLLN
En = {i(Sn ; Ŝn ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n 
j=1

If we can show the expectation in (25.15) vanishes, then there exists an (n, M, D)-code with:
M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o( 1) ≤ D.
To summarize, ∀PŜ|S such that E[d(S, Ŝ)] ≤ D −δ we have shown that R(D) ≤ I(S; Ŝ). Sending δ ↓
0, we have, by continuity of ϕS (D) in (D0 ∞) (recall Theorem 24.4), R(D) ≤ ϕS (D−) = ϕS (D).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-497


i i

25.2* Covering lemma and joint typicality 497

It remains to show the expectation in (25.15) vanishes. This is a simple consequence of the
uniform integrability of the sequence {d(Sn , ŝn0 )}. We need the following lemma.
Lemma 25.4 For any positive random variable U, define g(δ) = supH:P[H]≤δ E[U1H ], where
δ→0
the supremum is over all events measurable with respect to U. Then3 EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U1H ] ≤ E[U1 {U > b}] + bδ , where E[U1 {U > b√}] −−−→ 0 by
dominated convergence theorem. Then the proof is completed by setting b = 1/ δ .
Pn
Now d(Sn , ŝn0 ) = 1n j=1 Uj , where Uj are iid copies of U ≜ d(S, ŝ0 ). Since E[U] = Dmax < ∞
P
by assumption, applying Lemma 25.4 yields E[d(Sn , ŝn0 )1En ] = 1n E[Uj 1En ] ≤ g(P[En ]) → 0,
since P[En ] → 0. This proves the theorem.
Remark 25.2 (Fundamental limit for excess distortion) Although Theorem 25.1 is
stated for the average distortion, under certain mild extra conditions, it also holds for excess distor-
tion where the goal is to achieve d(Sn , Ŝn ) ≤ D with probability arbitrarily close to one as opposed
to in expectation. Indeed, the achievability proof of Theorem 25.1 is already stated in high proba-
bility. For converse, assume in addition to (25.3) that Dp ≜ E[d(S, ŝ)p ]1/p < ∞ for some ŝ ∈ Ŝ and
Pn
p > 1. Applying Rosenthal’s inequality [368, 235], we have E[d(S, ŝn )p ] = E[( i=1 d(Si , ŝ))p ] ≤
CDpp for some constant C = C(p). Then we can apply Theorem 24.9 to convert a code for excess
distortion to one for average distortion and invoke the converse for the latter.
To end this section, we note that in Section 25.1.1 and in Theorem 25.1 it seems we applied
different proof techniques. How come they both turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the under-
lined probabilities in (25.9) and (25.11), respectively. To obtain the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projec-
tion problem (25.10). When PY = (QŜ )n = (PŜ )n with PŜ chosen as the output distribution in the
solution to rate-distortion optimization (25.1), the resulting exponent is precisely given by 2−i(X;Y) .

25.2* Covering lemma and joint typicality


In this section we consider the following curious problem, a version of channel simulation/syn-
i.i.d.
thesis. We want to simulate a sequence of iid correlated strings (Ai , Bi ) ∼ PA,B via a protocol we
i.i.d.
describe next. First, an sequence An ∼ PA is generated at one terminal. Then we can look at it,
produce a rate constrained message W ∈ [2nR ] which gets communicated to a remote destination
(noiselessly). Upon receipt of the message, remote decoder produces a string Bn out of it. The goal
is to be able to fool the tester who inspects (An , Bn ) and tries to check that it was indeed generated
i.i.d.
as (Ai , Bi ) ∼ PA,B . See Figure 25.1 for an illustration.
How large a rate R is required depends on how we exactly understand the requirement to “fool
the tester”. If the tester is fixed ahead of time (this just means that we know the set F such that

3
In fact, ⇒ is ⇔.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-498


i i

498

A1 B1 A1 B1

A2 B2 A2 W B2

. . . .
. . . .
. . . .

An Bn An Bn

P Q

Figure 25.1 Description of channel simulation game. The distribution P (left) is to be simulated via the
distribution Q (right) at minimal rate R. Depending on the exact formulation we either require R = I(A; B)
(covering lemma) or R = C(A; B) (soft-covering lemma).

i.i.d.
(Ai , Bi ) ∼ PA,B is declared whenever (An , Bn ) ∈ F) then this is precisely the setting in which
covering lemma operates. In the next section we show that a higher rate R = C(A; B) is required
if F is not known ahead of time. We leave out the celebrated theorem of Bennett and Shor [43]
which shows that rate R = I(A; B) is also attainable even if F is not known, but if encoder and
decoder are given access to a source of common random bits (independent of An , of course).
Before proceeding, we note some simple corner cases:

1 If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly and
use that information to produce Bn through PBn |An .
2 If R = H(B), “A side” can generate Bn according to PnA,B and send that Bn sequence to the “B
side”.
3 If A ⊥
⊥ B, we know that R = 0, as “B side” can generate Bn independently.

Our previous argument for achieving the rate-distortion turns out to give a sharp answer (that
R = I(A; B) is sufficient) for the F-known case as follows.

Theorem 25.5 (Covering Lemma) Fix PA,B and let (Aj , Bj )i.i.d.
∼ PA,B , R > I(A; B). We gener-
ate a random codebook C = {c1 , . . . , cM }, log M = nR, with each codeword cj drawn i.i.d. from
distribution PnB . Then we have for all sets F

P[∃c : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + oR (1) (25.16)


| {z }
uniform in F

Remark 25.3 The origin of the name “covering” is from the application to a proof of Theo-
rem 25.1. In that context we set A = S and B = Ŝ to be the source and optimal reconstruction (in

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-499


i i

25.2* Covering lemma and joint typicality 499

the sense of minimizing R(I) (D)). Then taking F = {(an , bn ) : d(an , bn ) ≤ D + δ} we see that both
terms in the right-hand side of the inequality are o(1). Thus, sampling 2nR reconstruction points
we covered the space of source realizations in such a way that with high probability we can always
find a reconstruction with low distortion.
Proof. Set γ > M and following similar arguments of the proof for Theorem 25.2, we have
P[∀c ∈ C : (An , c) 6∈ F] ≤ e−M/γ + P[{(An , Bn ) 6∈ F} ∪ {i(An ; Bn ) > log γ}]
= P[(An , Bn ) 6∈ F] + o(1)
⇒ P[∃c ∈ C : (An , c) ∈ F] ≥ P[(An , Bn ) ∈ F] + o(1)

As we explained, the version of covering lemma that we stated shows how to “fool the tester”
applying only one fixed test set F. However, if both A and B take values on finite alphabets then
something stronger can be stated. This original version of the covering lemma [111] is what is
used in sophisticated distributed compression arguments, e.g. Theorem 11.17. Before stating the
result we remind that for two sequences an , bn we denote their joint empirical distribution by
1X
n
P̂an ,bn (α, β) ≜ 1{ai = α, bi = β} , α ∈ A, β ∈ B .
n
i=1

It is also useful to review joint typicality discussion in Remark 18.2. In this section we say that a
sequence of pairs of vectors (an , bn ) is jointly typical with respect to PA,B if
TV(P̂an ,bn , PA,B ) = o(1) .
Fix a distribution PA,B and any codebook C = {c1 , . . . , cM } consisting of elements cj ∈ B n . For
any fixed input string an we define
W = argmin TV(P̂an ,cj , PA,B ) , B̂n = cW . (25.17)
j∈[M]

The next corollary says that in order for us to produce a jointly typical pair (An , B̂n ) a codebook
must have the rate R > I(A; B) and this is optimal.

Corollary 25.6 Fix PA,B on a pair of finite alphabets A, B. For any R > I(A; B) we generate
a random codebook C = {c1 , . . . , cM }, log M = nR, where each codeword cj is drawn i.i.d. from
distribution PnB . With B̂n defined as in (25.17) we have that pair (An , B̂n ) is jointly typical with high
probability
E[TV(P̂An ,B̂n , PA,B )] = oR (1) . (25.18)
Furthermore, no codebook with rate R < I(A, B) can achieve (25.18).

Proof. First, in this case i(An ; Bn ) is a sum of bounded iid terms and thus the oR (1) term in (25.16)
is in fact e−Ω(n) . Fix arbitrary ϵ > 0 and apply Theorem to
n o
F = (an , bn ) : P̂an ,bn (α, β) − PA,B (α, β) ≤ ϵ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-500


i i

500

with all possible α ∈ A and β ∈ B we conclude that

P[TV(PAn ,B̂n , PA,B ) ≤ ϵ] ≥ 1 − |A||B|e−Ω(n) = 1 + o(1) .

Due to the arbitrariness of ϵ, (25.18) follows.


This proof can also be understood combinatorially (as is done classically [111]). Indeed, the
rate R ≈ I(A; B) is sufficient since an iid Bn ranges over about 2nH(B) high probability sequences
(cf. Proposition 1.5). Applying the same proposition conditionally on a typical An sequence, there
are around 2nH(B|A) of Bn sequences that have the same joint distribution. Therefore, while we need
nH(B) bits to describe all of Bn it is intuitively clear that we only need to describe a class of Bn for
nH(B)
each An sequence, and there are around 22nH(B|A) = 2nI(A;B) classes. We can represent this situation
by a bipartite graph with (typical) An sequences on the left and (typical) Bn sequences on the right
and edges corresponding to pairs having joint typicality; this graph is regular with degrees 2nH(A|B)
and 2nH(A|B) , respectively. Thus, to convert our intuition to a rigorous proof as above we need to
show that a random subset of 2nI(A;B) right vertices covers all left vertices. (This alternate proof
has advantage of showing (25.18) conditional on any typical An .)
We next proceed to proving that R ≥ I(A; B) is in fact necessary for simulating a jointly typical
Bn . Consider an arbitrary codebook C such that (25.18) holds. On one hand we have

nR = log M ≥ I(An ; B̂n ) = H(An ) − H(An |B̂n ) = nH(A) − H(An |B̂n ) .

Thus, the proof will be complete if we show

H(An |B̂n ) ≤ nH(A|B) + o(n) .

To that end, let us define a random conditional type of An |B̂n as


#{i : Âi = α, B̂i = β}
Q̂(α|β) ≜ .
#{i : B̂i = β}
For an arbitrarily small ϵ > 0 we define a binary T = 1 if and only if for some α or β either of the
following inequalities is violated
1
#{i : B̂i = β} − PB (β) > ϵ ,
n
Q̂(α|β) − PA|B (α|β) > ϵ .

By the Markov inequality (and assuming WLOG that PB (β) > 0 for all β ) we get that

P [ T = 1] = o( 1) .

Thus, we have

H(An |B̂n ) ≤ H(An , T|B̂n ) ≤ log 2 + n log |A|P[T = 1] + H(An |B̂n , T = 0) .

The first two terms are o(n) so we focus on the last term. Since Q̂ is a random variable with
polynomially many possible values, cf. Exercise I.2, we have

H(An |B̂n , T = 0) ≤ H(An , Q̂|B̂n , T = 0) ≤ H(An |Q̂, B̂n , T = 0) + O(log n) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-501


i i

25.2* Covering lemma and joint typicality 501

Let there be nβ positions with B̂i = β . Conditioned on Q̂, the random variable An ranges over
nβ  n
nβ Q(α1 |β)···nβ Q(α|A| |β)
. Since under T = 0 we have Q̂ → PA|B and nβ → PB (β) as n → ∞ we
conclude from Proposition 1.5 and the continuity of entropy Proposition 4.8 that
H(An |Q̂, B̂n , T = 0) ≤ n(H(A|B) + δ)
for some δ = δ(ϵ) > 0 that vanishes as ϵ → 0.
Applications of Corollary 25.6 include distributed compression (Theorem 11.17) and hypoth-
esis testing (Theorem 16.6). Now, those applications use the rate-constrained B̂n to create a
required correlation (joint typicality) not only with An but also with other random variables. Those
applications will require the following simple observation.

Proposition 25.7 Fix some PX,A,B on finite alphabets and consider a pair of random variables
(An , B̂n ) which are jointly typical on average (specifically, (25.18) holds as n → ∞). Given An , B̂n
suppose that Xn is generated ∼ P⊗ n
X|A,B . Then we have

E[TV(P̂Xn ,An ,B̂n , PX,A,B )] = o(1) .


And, furthermore we have
I(Xn ; B̂n ) ≥ nI(X; B) + o(n) . (25.19)

Remark 25.4 (Markov lemma) This result is known as Markov lemma, e.g. [106, Lemma
15.8.1] because in the standard application setting one considers a joint distribution PX,A,B =
i.i.d.
PX,A PB|A , i.e. X → A → B. In this application, one further has (Xn , An ) ∼ PX,A generated by
nature with only An being observed. Given An one constructs a jointly typical vector B̂n (e.g. via
covering lemma Corollary 25.6). Now, since with high probability (Xn , An ) is jointly typical, it
is tempting to automatically conclude that (Xn , B̂n ) would also be jointly typical. Unfortunately,
joint typicality relation is generally not transitive.4 In the above result, however, what resolves
this issue is the fact that Xn can be viewed as generated after (An , B̂n ) were already selected. Thus,
viewing the process in this order we can even allow Xn to depend on B̂n , which is what we did. For
stronger results under the classical setting of PX|A,B = PX|A see [147, Lemma 12.1].
Proof. Note that from condition (25.18) and Markov inequality we get that TV(P̂An ,B̂n , PA,B ) =
o(1) with probability 1 − o(1). Fix any a, b, x ∈ A × B × X and consider m = nP̂An ,B̂n (a, b)
coordinates i ∈ [n] with Ai = a, B̂i = b. Among these there are m′ ≤ m coordinates i that
also satisfy Xi = x. Standard concentration estimate shows that |m′ − mPX|A,B (x|a, b)| = o(m)
with probability 1 − o(1). Hence, normalizing by m we obtain (from the union bound) that with
probability 1 − o(1) we have
|P̂Xn ,An ,B̂n (x, a, b) − PX,A,B (x, a, b)| = o(1) .

4
Let PX,A,B = PX PA PB with PX = PA = PB = Ber(1/2). Take an to be any binary string in {0, 1}n with n/2 ones. Set
xj = bj = aj for j ≤ n/2 and xj = bj = 1 − aj , otherwise. Then (xn , an ) and (an , bn ) are jointly typical, but (xn , bn ) is
not.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-502


i i

502

This implies the first statement. Note that by summing out a ∈ A we obtain that

E[|TV(P̂Xn ,B̂n , PX,B ) = o(1) .

But then repeating the steps of the second part of Corollary 25.6 we obtain I(Xn ; B̂n ) ≥ nI(X; B) +
o(n), as required.

Remark 25.5 Although in (25.19) we only proved a lower bound (which is sufficient for
applications in this book), it is known that under the Markov assumption X → A → B the inequal-
ity (25.19) holds with equality [111, Chapter 15]. This follows as a by-product of a deep entropy
characterization problem for which we recommend the mentioned reference.
Let us go back to the discussion in the beginning of this section. We have learned how to “fool”
the tester that uses one fixed test set F (Theorem 25.5). Then for finite alphabets we have shown
that we can also “fool” the tester that computes empirical averages since

1X
n
f(Aj , B̂j ) ≈ EA,B∼PA,B [f(A, B)] ,
n
j=1

for any bounded function f. A stronger requirement would be to demand that the joint distribution
PAn ,B̂n fools any permutation invariant tester, i.e.

sup |PAn ,B̂n (F) − PnA,B (F)| → 0

where the supremum is taken over all permutation invariant subsets F ⊂ An × B n . This is not
guaranteed by Corollary 25.6. Indeed, note that a sufficient statistic for a permutation invariant
tester is a joint type P̂An ,B̂n , and Corollary does show that P̂An ,B̂n ≈ PA,B (in the sense of L1 distance
of vectors). However, it still might happen that P̂An ,B̂n although close to PA,B takes highly different
values compared to those of P̂An ,Bn . For example, if we restrict all c ∈ C to have a fixed composition
P0 , the tester can easily detect the problem since PnB -measure of all strings of composition P0

cannot exceed O(1/ n). Formally, to fool permutation invariant tester we need to have small
total variation between the distribution of P̂An ,B̂n and P̂An ,Bn .
We conjecture, however, that nevertheless the rate R = I(A; B) should be sufficient to achieve
also this stronger requirement. In the next section we show that if one removes the permutation-
invariance constraint, then a larger rate R = C(A; B) is needed.

25.3* Wyner’s common information


We continue discussing the channel simulation setting as in previous section. We now want to
determined the minimal possible communication rate (i.e. cardinality of W ∈ [2nR ]) required to
have small total variation:

TV(PAn ,B̂n , PnA,B ) ≤ ϵ (25.20)

between the simulated and the true output (see Figure 25.1).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-503


i i

25.3* Wyner’s common information 503

Theorem 25.8 (Cuff [116]) Let PA,B be an arbitrary distribution on the finite space A × B.
i.i.d.
Consider a coding scheme where Alice observes An ∼ PnA , sends a message W ∈ [2nR ] to Bob,
who given W generates a (possibly random) sequence B̂n . If (25.20) is satisfied for all ϵ > 0 and
sufficiently large n, then we must have

R ≥ C(A; B) ≜ min I(A, B; U) , (25.21)


A→U→B

where C(A; B) is known as the Wyner’s common information [458]. Furthermore, for any R >
C(A; B) and ϵ > 0 there exists n0 (ϵ) such that for all n ≥ n0 (ϵ) there exists a scheme
satisfying (25.20).

Note that condition (25.20) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , Bn ) with probability ≥ 1 −ϵ. However, compared to Theorem 25.5,
this requires a higher communication rate since C(A; B) ≥ I(A; B), clearly.

Proof. Showing that Wyner’s common information is a lower-bound is not hard. First, since
PAn ,B̂n ≈ PnA,B (in TV) we have

I(At , B̂t ; At−1 , B̂t−1 ) ≈ I(At , Bt ; At−1 , Bt−1 ) = 0

(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P) − H(Q)
with TV(P, Q), cf. (7.20) and Corollary 6.7). Next, we have

nR = H(W) ≥ I(An , B̂n ; W) (25.22)


X n
≥ I(At , B̂t ; W) − I(At , B̂t ; At−1 , B̂t−1 ) (25.23)
t=1
Xn
≈ I(At , B̂t ; W) (25.24)
t=1

≳ nC(A; B) (25.25)

where in the last step we used the crucial observation that

At → W → B̂t

and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B .
To show achievability, let us notice that the problem is equivalent to constructing three random
variables (Ân , W, B̂n ) such that a) W ∈ [2nR ], b) the Markov relation

Ân ← W → B̂n (25.26)

holds and c) TV(PÂn ,B̂n , PnA,B ) ≤ ϵ/2. Indeed, given such a triple we can use coupling charac-
terization of TV (7.20) and the fact that TV(PÂn , PnA ) ≤ ϵ/2 to extend the probability space
to

An → Ân → W → B̂n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-504


i i

504

and P[An = Ân ] ≥ 1 − ϵ/2. Again by (7.20) we conclude that TV(PAn ,B̂n , PÂn ,B̂n ) ≤ ϵ/2 and by
triangle inequality we conclude that (25.20) holds.
Finally, construction of the triple satisfying a)-c) follows from the soft-covering lemma
(Corollary 25.10) applied with V = (A, B) and W being uniform on the set of xi ’s there.

25.4* Approximation of output statistics and the soft-covering lemma


In this section we aim to prove the remaining ingredient (the soft-covering lemma) required for
the proof of Theorem 25.8. To that end, recall that in Section 7.9 we have shown that generating
i.i.d.
iid Xi ∼ PX and passing their empirical distribution P̂n across the channel PY|X results in a good
approximation of PY = PY|X ◦ PX , i.e.

D(PY|X ◦ P̂n kPY ) → 0 .

A natural questions is how large n should be in order for the approximation PY|X ◦ P̂n ≈ PY to
hold. A remarkable fact that we establish in this section is that the answer is n ≈ 2I(X;Y) , assum-
ing I(X; Y)  1 and there is certain concentration properties of i(X; Y) around I(X; Y). This fact
originated from Wyner [458] and was significantly strengthened in [212].
Here, we show a new variation of such results by strengthening our simple χ2 -information
bound of Proposition 7.17 (corresponding to λ = 2).

Theorem 25.9 Fix PX,Y and for any λ ∈ R define the Rényi mutual information of order λ
Iλ (X; Y) ≜ Dλ (PX,Y kPX PY ) ,

where Dλ is the Rényi-divergence, cf. Definition 7.24. We have for every 1 < λ ≤ 2
1
E[D(PY|X ◦ P̂n kPY )] ≤ log(1 + exp{(λ − 1)(Iλ (X; Y) − log n)}) . (25.27)
λ−1

Proof. Since λ 7→ Dλ is non-decreasing, it is sufficient to prove an equivalent upper bound on


E[Dλ (PY|X ◦ Pn kPY )]. From Jensen’s inequality we see that
( )λ 
1 P Y|X ◦ P̂
EXn log EY∼PY  
n
E[Dλ (PY|X ◦ P̂n kPY )] ≜ (25.28)
λ−1 PY
( )λ 
1 PY|X ◦ P̂n
≤ log EXn EY∼PY   = Iλ (Xn ; Ȳ) ,
λ−1 PY
Pn
where similarly to (7.56) we introduced the channel PȲ|Xn = 1n i=1 PY|X=Xi . To analyze Iλ (Xn ; Ȳ)
we need to bound
( )λ 
1 X PY|X (Y|Xi )
E(Xn ,Ȳ)∼PnX ×PY  . (25.29)
n PY (Y)
i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-505


i i

25.4* Approximation of output statistics and the soft-covering lemma 505

Note that conditioned on Y we get to analyze a λ-th moment of a sum of iid random variables.
This puts us into a well-known setting of Rosenthal-type inequalities. In particular, we have that5
for any iid non-negative Bj we have, provided 1 ≤ λ ≤ 2,
 !λ 
X n
E Bi  ≤ n E[Bλ ] + (n E[B])λ . (25.30)
i=1

Now using (25.30) we can overbound (25.29) as


" λ #
PY|X (Y|Xi )
≤ 1 + n1−λ E(X,Ȳ)∼PX ×PY ,
PY (Y)

which implies
1 
Iλ (Xn ; Ȳ) ≤ log 1 + n1−λ exp{(λ − 1)Iλ (X; Y)} ,
λ−1
which together with (25.28) recovers the main result (25.27).

Remark 25.6 Hayashi [217] upper bounds the LHS of (25.27) with
λ λ−1
log(1 + exp{ (Kλ (X; Y) − log n)}) ,
λ−1 λ
where Kλ (X; Y) = infQY Dλ (PX,Y kPX QY ) is the so-called Sibson-Csiszár information, cf. [338].
This bound, however, does not have the right rate of convergence as n → ∞, at least for λ = 1 as
comparison with Proposition 7.17 reveals.
We note that [217, 212] also contain direct bounds on

E[TV(PY|X ◦ P̂n , PY )]
P
which do not assume existence of λ-th moment of PYY|X for λ > 1 and instead rely on the distribution
of i(X; Y). We do not discuss these bounds here, however, since for the purpose of discussing finite
alphabets the next corollary is sufficient.

Corollary 25.10 (Soft-covering lemma) Suppose X = (U1 , . . . , Ud ) and Y = (V1 , . . . , Vd )


i.i.d.
are vectors with (Ui , Vi ) ∼ PU,V and Iλ0 (U; V) < ∞ for some λ0 > 1 (e.g. if one of U or V is over
a finite alphabet). Then for any R > I(U; V) there exists ϵ > 0, so that for all d ≥ 1 there exists
x1 , . . . , xn , n = dexp{dR}e, such that
1 X
n  1
D PY|X=xi kPY ≤ exp{−dϵ} .
n ϵ
i=1

5
The inequality (25.30), which is known to be essentially tight [374], can be shown by applying

(a + b)λ−1 ≤ aλ−1 + bλ−1 and Jensen’s to get E Bi (Bi + j̸=i Bj )λ−1 ≤ E[Bλ ] + E[B]((n − 1) E[B])λ−1 . Summing
the left side over i and bounding (n − 1) ≤ n we get (25.30).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-506


i i

506

Remark 25.7 The origin of the name “soft-covering” is due to the fact that unlike the covering
lemma (Theorem 25.5) which selects one xi (trying to make PY|X=xi as close to PY as possible) here
we mix over n choices uniformly.
Proof. By tensorization of Rényi divergence, cf. Section 7.12, we have
Iλ (X; Y) = dIλ (U; V) .
For every 1 < λ < λ0 we have that λ 7→ Iλ (U; V) is continuous and converging to I(U; V) as
λ → 1. Thus, we can find λ sufficiently small so that R > Iλ (U; V). Applying Theorem 25.9 with
this λ completes the proof.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-507


i i

26 Evaluating rate-distortion function. Lossy


Source-Channel separation.

In previous chapters we established the main coding theorem for lossy data compression: For
stationary memoryless (iid) sources and separable distortion, under the assumption that Dmax < ∞,
the operational and information rate-distortion functions coincide, namely,

R(D) = R(I) (D) = inf I(S; Ŝ).


PŜ|S :Ed(S,Ŝ)≤D

In addition, we have shown various properties about the rate-distortion function (cf. Theorem 24.4).
In this chapter we compute the rate-distortion function for several important source distributions
by evaluating this constrained minimization of mutual information. The common technique we
apply to evaluate these special cases in Section 26.1 is then formalized in Section 26.2* as a saddle
point property akin to those in Sections 5.2 and 5.4* for mutual information maximization (capac-
ity). Next we extend the paradigm of joint source-channel coding in Section 19.7 to the lossy
setting; this reasoning will later be found useful in statistical applications in Part VI (cf. Chap-
ter 30). Finally, in Section 26.4 we discuss several limitations, both theoretical and practical, of
the classical theory for lossy compression and joint source-channel coding.

26.1 Evaluation of R(D)

26.1.1 Bernoulli Source


Let S ∼ Ber(p) with Hamming distortion d(S, Ŝ) = 1{S 6= Ŝ} and alphabets S = Ŝ = {0, 1}. Then
d(sn , ŝn ) = 1n dH (sn , ŝn ) is the bit-error rate (fraction of erroneously decoded bits). By symmetry,
we may assume that p ≤ 1/2.

Theorem 26.1
R(D) = (h(p) − h(D))+ . (26.1)

For example, when p = 1/2, D = .11, we have R(D) ≈ 1/2 bits. In the Hamming game
described in Section 24.2 where we aim to compress 100 bits down to 50, we indeed can do this
while achieving 11% average distortion, compared to the naive scheme of storing half the string
and guessing on the other half, which achieves 25% average distortion. Note that we can also get
very tight non-asymptotic bounds, cf. Exercise V.3.

507

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-508


i i

508

Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to
show.
For the converse, consider any PŜ|S such that P[S 6= Ŝ] ≤ D ≤ p ≤ 21 . Then

I(S; Ŝ) = H(S) − H(S|Ŝ)


= H(S) − H(S + Ŝ|Ŝ)
≥ H(S) − H(S + Ŝ)
= h(p) − h(P[S 6= Ŝ])
≥ h(p) − h(D).

In order to achieve this bound, we need to saturate the above chain of inequalities, in particular,
choose PŜ|S so that the difference S + Ŝ is independent of Ŝ. Let S = Ŝ + Z, where Ŝ ∼ Ber(p′ ) ⊥

Z ∼ Ber(D), and p′ is such that the convolution gives exactly Ber(p), namely,

p′ ∗ D = p′ (1 − D) + (1 − p′ )D = p,

p−D
i.e., p′ = 1−2D . In other words, the backward channel PS|Ŝ is exactly BSC(D) and the resulting
PŜ|S is our choice of the forward channel PŜ|S . Then, I(S; Ŝ) = H(S) − H(S|Ŝ) = H(S) − H(Z) =
h(p) − h(D), yielding the upper bound R(D) ≤ h(p) − h(D).

Remark 26.1 Here is a more general strategy (which we will later implement in the Gaussian
case.) Denote the optimal forward channel from the achievability proof by P∗Ŝ|S and P∗S|Ŝ the asso-
ciated backward channel (which is BSC(D)). We need to show that there is no better PŜ|S with
P[S 6= Ŝ] ≤ D and a smaller mutual information. Then

I(PS , PŜ|S ) = D(PS|Ŝ kPS |PŜ )


" #
P∗S|Ŝ
= D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log
PS
≥ H(S) + EP [log D1{S 6= Ŝ} + log D̄1{S = Ŝ}]
≥ h(p) − h(D)

where the last inequality uses P[S 6= Ŝ] ≤ D ≤ 12 .

Remark 26.2 By WLLN, the distribution PnS = Ber(p)n concentrates near the Hamming sphere
of radius np as n grows large. Recall that in proving Shannon’s rate distortion theorem, the optimal
codebook are drawn independently from PnŜ = Ber(p′ )n with p′ = 1p−−2D D
. Note that p′ = 1/2 if
p = 1/2 but p′ < p if p < 1/2. In the latter case, the reconstruction points concentrate on a smaller
sphere of radius np′ and none of them are typical source realizations, as illustrated in Figure 26.1.

For a generalization of this result to m-ary uniform source, see Exercise V.6.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-509


i i

26.1 Evaluation of R(D) 509

S(0, np)

S(0, np′ )

Hamming Spheres

Figure 26.1 Source realizations (solid sphere) versus codewords (dashed sphere) in compressing Hamming
sources.

26.1.2 Gaussian Source


The following results compute the Gaussian rate-distortion function for quadratic distortion in
both the scalar and vector case. (For general covariance, see Exercise V.21.)

Theorem 26.2 Let S ∼ N (0, σ 2 ) and d(s, ŝ) = (s − ŝ)2 for s, ŝ ∈ R. Then
1 σ2
R ( D) = log+ . (26.2)
2 D
In the vector case of S ∼ N (0, σ 2 Id ) and d(s, ŝ) = ks − ŝk22 ,
d dσ 2
R ( D) = log+ . (26.3)
2 D

Proof. Since Dmax = σ 2 , in the sequel we can assume D < σ 2 for otherwise there is nothing to
show.
(Achievability) Choose S = Ŝ + Z , where Ŝ ∼ N (0, σ 2 − D) ⊥
⊥ Z ∼ N (0, D). In other words,
the backward channel PS|Ŝ is AWGN with noise power D, and the forward channel can be easily
found to be PŜ|S = N ( σ σ−2 D S, σ σ−2 D D). Then
2 2

1 σ2 1 σ2
I(S; Ŝ) =
log =⇒ R(D) ≤ log
2 D 2 D
(Converse) Formally, we can mimic the proof of Theorem 26.1 replacing Shannon entropy by
the differential entropy and applying the maximal entropy result from Theorem 2.8; the caveat is
that for Ŝ (which may be discrete) the differential entropy may not be well-defined. As such, we
follow the alternative proof given in Remark 26.1. Let PŜ|S be any conditional distribution such
that EP [(S − Ŝ)2 ] ≤ D. Denote the forward channel in the above achievability by P∗Ŝ|S . Then
" #
P∗S|Ŝ

I(PS , PŜ|S ) = D(PS|Ŝ kPS|Ŝ |PŜ ) + EP log
PS

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-510


i i

510

" #
P∗S|Ŝ
≥ EP log
PS
 (S−Ŝ)2

√ 1 e− 2D

= EP log 2πD
S2

√ 1
2π σ 2
e− 2 σ 2
" #
1 σ2 log e S2 (S − Ŝ)2
= log + EP −
2 D 2 σ2 D
1 σ2
≥ log .
2 D
Finally, for the vector case, (26.3) follows from (26.2) and the same single-letterization argu-
ment in Theorem 24.8 using the convexity of the rate-distortion function in Theorem 24.4(a).

The interpretation of the optimal reconstruction points in the Gaussian case is analogous to that
of the Hamming source previously
√ discussed in Remark 26.2: As n grows, the Gaussian random
2
vector concentrates on S(0, nσ ) (n-sphere in Euclidean p space rather than Hamming), but each
reconstruction point drawn from (P∗Ŝ )n is close to S(0, n(σ 2 − D)). So again the picture is similar
to Figure 26.1 of two nested spheres.
We can also understand geometry of errors of optimal compressors. Indeed, suppose we have
a sequence of quantizers Xn → W → X̂n with n1 log M → R(D). As we know, without loss of
generality we may assume X̂n = E[Xn |W]. Let us denote by Σ = Cov[Xn |W] be the covariance
matrix of the reconstruction errors. We know that 1n tr Σ ≤ D by the distortion constraint. Now let
us express mutual information in terms of differential entropy to obtain

log M = I(Xn ; W) = h(Xn ) − h(Xn |W) .

Applying maximum entropy principle (2.19) to the second term (and taking expectation over W
inside log det via Jensen’s and Corollary 2.9) we obtain
n 1
log M ≥ log σ 2 − log det Σ .
2 2
Let {λj , j ∈ [n]} denote the spectrum of Σ. Dividing by n and recalling that quantizer is optimal
we get

1X1
n
1 σ2 σ2
log + o( 1) ≥ log .
2 D n 2 λj
j=1

2
From strict convexity of λ 7→ 12 log σλ we conclude that empirical distributions of eigenvalues, i.e.
P
j δλj , must converge to a point, i.e. to δD . In this sense Σ ≈ D · In and the uncertainty regions
1
n
(given the message) should be approximately spherical.
Note that the exact expression in Theorem 26.2 relies on the Gaussianity assumption of the
source. How sensitive is the rate-distortion formula to this assumption? The following comparison
result is a counterpart of Theorem 20.12 for channel capacity:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-511


i i

26.1 Evaluation of R(D) 511

Theorem 26.3 Assume that ES = 0 and Var S = σ 2 . Consider the MSE distortion. Then
1 σ2 1 σ2
log+ − D(PS kN (0, σ 2 )) ≤ R(D) = inf I(S; Ŝ) ≤ log+ .
2 D PŜ|S :E(Ŝ−S)2 ≤D 2 D

Remark 26.3 A simple consequence of Theorem 26.3 is that for source distributions with a
density, the rate-distortion function grows according to 12 log D1 in the low-distortion regime as
long as D(PS kN (0, σ 2 )) is finite. In fact, the first inequality, known as the Shannon lower bound
(SLB), is asymptotically tight, in the sense that
1 σ2
R(D) = log − D(PS kN (0, σ 2 )) + o(1), D → 0 (26.4)
2 D
under appropriate conditions on PS [281, 247]. Therefore, by comparing (2.21) and (26.4), we
see that, for small distortion, uniform scalar quantization (Section 24.1) is in fact asymptotically
optimal within 12 log(2πe) ≈ 2.05 bits.
Later in Section 30.1 we will apply SLB to derive lower bounds for statistical estimation. For
this we need the following general version of SLB (see Exercise V.22 for a proof): Let k · k be an
arbitrary norm on Rd and r > 0. Let X be a d-dimensional continuous random vector with finite
differential entropy h(X). Then
   
d d d
inf I(X; X̂) ≥ h(X) + log − log Γ +1 V , (26.5)
PX̂|X :E[∥X̂−X∥r ]≤D r Dre r

where V = vol({x ∈ Rd : kxk ≤ 1}) is the volume of the unit k · k-ball.


Proof. Again, assume D < Dmax = σ 2 . Let SG ∼ N (0, σ 2 ).
“≤”: Use the same P∗Ŝ|S = N ( σ σ−2 D S, σ σ−2 D D) in the achievability proof of Gaussian rate-
2 2

distortion function:
R(D) ≤ I(PS , P∗Ŝ|S )
σ2 − D σ2 − D
= I ( S; S + W) W ∼ N ( 0, D)
σ2 σ2
σ2 − D
≤ I ( SG ; SG + W) by Gaussian saddle point (Theorem 5.11)
σ2
1 σ2
= log .
2 D
“≥”: For any PŜ|S such that E(Ŝ − S)2 ≤ D. Let P∗S|Ŝ = N (Ŝ, D) denote the AWGN channel
with noise power D. Then
I(S; Ŝ) = D(PS|Ŝ kPS |PŜ )
" #
P∗S|Ŝ
= D(PS|Ŝ kP∗S|Ŝ |PŜ ) + EP log − D(PS kPSG )
PSG
 (S−Ŝ)2

√ 1 e− 2D

≥ EP log 2πD
S2
 − D(PS kPSG )
√ 1
2π σ 2
e− 2 σ 2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-512


i i

512

1 σ2
≥ log − D(PS kPSG ).
2 D

26.2* Analog of saddle-point property in rate-distortion


In the computation of R(D) for the Hamming and Gaussian source, we guessed the correct form
of the rate-distortion function. In both of their converse arguments, we used the same trick to
establish that any other feasible PŜ|S gave a larger value for R(D). In this section, we formalize
this trick, in an analogous manner to the saddle point property of the channel capacity (recall
Theorem 5.4 and Section 5.4*). Note that typically we do not need any tricks to compute R(D),
since we can obtain a solution in a parametric form to the unconstrained convex optimization
min I(S; Ŝ) + λ E[d(S, Ŝ)].
PŜ|S

In fact we have discussed in Section 5.6 iterative algorithms (Blahut-Arimoto) that computes R(D).
However, for the peace of mind it is good to know there are some general reasons why tricks like
we used in the Hamming or Gaussian case actually are guaranteed to work.

Theorem 26.4

1 Suppose PY∗ and PX|Y∗  PX are such that E[d(X, Y∗ )] ≤ D and for any PX,Y with E[d(X, Y)] ≤
D we have
 
dPX|Y∗
E log (X|Y) ≥ I(X; Y∗ ) . (26.6)
dPX
Then R(D) = I(X; Y∗ ).
2 Suppose that I(X; Y∗ ) = R(D). Then for any regular branch of conditional probability PX|Y∗
and for any PX,Y satisfying
• E[d(X, Y)] ≤ D and
• PY  PY∗ and
• I ( X ; Y) < ∞ ,
the inequality (26.6) holds.

Some remarks on the preceding theorem are as follows:

1 The first part is a sufficient condition for optimality of a given PXY∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PX,Y satisfying those conditions is rich enough to infer from (26.6):
dPX|Y∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-513


i i

26.2* Analog of saddle-point property in rate-distortion 513

2 Note that the second part is not valid without assuming PY  PY∗ . A counterexample to this
and various other erroneous (but frequently encountered) generalizations is the following: A =
{0, 1}, PX = Ber(1/2), Â = {0, 1, 0′ , 1′ } and

d(0, 0) = d(0, 0′ ) = 1 − d(0, 1) = 1 − d(0, 1′ ) = 0 .


The R(D) = |1 − h(D)|+ , but there exist multiple non-equivalent optimal choices of PY|X , PX|Y
and PY .
Proof. The first part is just a repetition of the proofs above for the Hamming and Gaussian case,
so we focus on the second part. Suppose there exists a counterexample PX,Y achieving
 
dPX|Y∗
I1 = E log (X|Y) < I∗ = R(D) .
dPX
Notice that whenever I(X; Y) < ∞ we have
I1 = I(X; Y) − D(PX|Y kPX|Y∗ |PY ) ,
and thus
D(PX|Y kPX|Y∗ |PY ) < ∞ . (26.7)
Before going to the actual proof, we describe the principal idea. For every λ we can define a joint
distribution
PX,Yλ = λPX,Y + (1 − λ)PX,Y∗ .
Then, we can compute
   
PX|Yλ PX|Yλ PX|Y∗
I(X; Yλ ) = E log (X|Yλ ) = E log
PX PX|Y∗ PX
 
PX|Y∗ (X|Yλ )
= D(PX|Yλ kPX|Y∗ |PYλ ) + E
PX
= D(PX|Yλ kPX|Y∗ |PYλ ) + λI1 + (1 − λ)I∗ .

From here we will conclude, similar to Proposition 2.20, that the first term is o(λ) and thus for
sufficiently small λ we should have I(X; Yλ ) < R(D), contradicting optimality of coupling PX,Y∗ .
We proceed to details. For every λ ∈ [0, 1] define
dPY
ρ 1 ( y) ≜ ( y)
dPY∗
λρ1 (y)
λ(y) ≜
λρ1 (y) + λ̄
(λ)
PX|Y=y = λ(y)PX|Y=y + λ̄(y)PX|Y∗ =y
dPYλ = λdPY + λ̄dPY∗ = (λρ1 (y) + λ̄)dPY∗
D(y) = D(PX|Y=y kPX|Y∗ =y )
(λ)
Dλ (y) = D(PX|Y=y kPX|Y∗ =y ) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-514


i i

514

Notice:

On {ρ1 = 0} : λ(y) = D(y) = Dλ (y) = 0

and otherwise λ(y) > 0. By convexity of divergence

Dλ (y) ≤ λ(y)D(y)

and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (26.7) the function ρ1 (y)D(y) is non-negative and PY∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY∗ Dλ (y)ρ1 (y) = dPY∗ ρ1 (y) lim D λ ( y) = 0 (26.8)
λ→0 {ρ >0}
1
λ( y ) {ρ1 >0} λ→ 0 λ( y)
where in the last step we applied the result from Chapter 5

D(PkQ) < ∞ =⇒ D(λP + λ̄QkQ) = o(λ)

since for each y on the set {ρ1 > 0} we have λ(y) → 0 as λ → 0.


On the other hand, notice that
Z Z
1 1
dPY∗ Dλ (y)ρ1 (y)1{ρ1 (y) > 0} = dPY∗ (λρ1 (y) + λ̄)Dλ (y)
{ρ1 >0} λ(y) λ {ρ1 >0}
Z
1
= dPYλ Dλ (y)
λ {ρ1 >0}
Z
1 1 (λ)
= dPYλ Dλ (y) = D(PX|Y kPX|Y∗ |PYλ ) ,
λ Y λ
where in the penultimate step we used Dλ (y) = 0 on {ρ1 = 0}. Hence, (26.8) shows
(λ)
D(PX|Y kPX|Y∗ |PYλ ) = o(λ) , λ → 0.

Finally, since
(λ)
PX|Y ◦ PYλ = PX ,

we have
   
(λ) dPX|Y∗ dPX|Y∗ ∗
I ( X ; Yλ ) = D(PX|Y kPX|Y∗ |PYλ ) + λ E log (X|Y) + λ̄ E log (X|Y )
dPX dPX
= I∗ + λ(I1 − I∗ ) + o(λ) ,

contradicting the assumption

I ( X ; Y λ ) ≥ I ∗ = R ( D) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-515


i i

26.3 Lossy joint source-channel coding 515

26.3 Lossy joint source-channel coding


Extending the lossless joint source channel coding problem studied in Section 19.7, in this section
we study its lossy version: How to transmit a source over a noisy channel such that the receiver
can reconstruct the original source within a prescribed distortion?
The setup of the lossy joint source-channel coding problem is as follows. For each k and n, we
are given a source Sk = (S1 , . . . , Sk ) taking values on S , a distortion metric d : S k × Ŝ k → R, and
a channel PYn |Xn acting from An to B n . A lossy joint source-channel code (JSCC) consists of an
encoder f : S k → An and decoder g : B n → Ŝ k , such that the channel input is Xn = f(Sk ) and the
reconstruction Ŝk = g(Yn ) satisfies E[d(Sk , Ŝk )] ≤ D. By definition, we have the Markov chain
f PYn |Xn g
Sk −−→ Xn −−−→ Yn −−→ Ŝk

Such a pair (f, g) is called a (k, n, D)-JSCC, which transmits k symbols over n channel uses such
that the end-to-end distortion is at most D in expectation. Our goal is to optimize the encoder/de-
coder pair so as to maximize the transmission rate (number of symbols per channel use) R = nk .1
As such, we define the asymptotic fundamental limit as
1
RJSCC (D) ≜ lim inf max {k : ∃(k, n, D)-JSCC} .
n→∞ n
To simplify the exposition, we will focus on JSCC for a stationary memoryless source Sk ∼ P⊗S
k
⊗n
transmitted over a stationary memoryless channel PYn |Xn = PY|X subject to a separable distortion
Pk
function d(sk , ŝk ) = 1k i=1 d(si , ŝi ).

26.3.1 Converse
The converse for the JSCC is quite simple, based on data processing inequality and following the
weak converse of lossless JSCC using Fano’s inequality.

Theorem 26.5 (Converse)


C
RJSCC (D) ≤ ,
R ( D)

where C = supPX I(X; Y) is the capacity of the channel and R(D) = infP :E[d(S,Ŝ)]≤D I(S; Ŝ) is the
Ŝ|S
rate-distortion function of the source.

The interpretation of this result is clear: Since we need at least R(D) bits per symbol to recon-
struct the source up to a distortion D and we can transmit at most C bits per channel use, the overall
transmission rate cannot exceeds C/R(D). Note that the above theorem clearly holds for channels
with cost constraint with the corresponding capacity (Chapter 20).

1
Or equivalently, minimize the bandwidth expansion factor ρ = nk .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-516


i i

516

Proof. Consider a (k, n, D)-code which induces the Markov chain Sk → Xn → Yn → Ŝk such
Pk
that E[d(Sk , Ŝk )] = 1k i=1 E[d(Si , Ŝi )] ≤ D. Then

( a) (b) ( c)
kR(D) = inf I(Sk ; Ŝk ) ≤ I(Sk ; Ŝk ) ≤ I(Xn ; Yn ) ≤ sup I(Xn ; Yn ) = nC
PŜk |Sk :E[d(Sk ,Ŝk )]≤D P Xn

where (b) applies data processing inequality for mutual information, (a) and (c) follow from the
respective single-letterization result for lossy compression and channel coding (Theorem 24.8 and
Proposition 19.10).

Remark 26.4 Consider the case where the source is Ber(1/2) with Hamming distortion. Then
Theorem 26.5 coincides with the converse for channel coding under bit error rate Pb in (19.33):
k C
R= ≤
n 1 − h(Pb )
which was previously given in Theorem 19.21 and proved using ad hoc techniques. In the case of
channel with cost constraints, e.g., the AWGN channel with C(SNR) = 12 log(1 + SNR), we have
 
−1 C(SNR)
Pb ≥ h 1−
R
This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical
codes. (See, e.g., Fig. 2 from [359] for BIAWGN (binary-input) channel.) This is erroneous, since
the pb above refers to the bit error rate of data bits (or systematic bits), not all of the codeword bits.
The latter quantity is what typically called BER (see (19.33)) in the coding-theoretic literature.

26.3.2 Achievability via separation


The proof strategy is similar to lossless JSCC in Section 19.7 by separately constructing a
channel coding scheme and a lossy compression scheme, as opposed to jointly optimizing the
JSCC encoder/decoder pair. Specifically, first compress the data into bits then encode with
a channel code; to decode, apply the channel decoder followed by the source decompressor.
Under appropriate assumptions, this separately-designed scheme achieves the optimal rate in
Theorem 26.5.

Theorem 26.6 For any stationary memoryless source (PS , S, Ŝ, d) with rate-distortion func-
tion R(D) satisfying Assumption 26.1 (below), and for any stationary memoryless channel PY|X
with capacity C,
C
RJSCC (D) = .
R(D)

Assumption 26.1 on the source (which is rather technical and can be skipped in the first reading)
is to control the distortion incurred by the channel decoder making an error. Despite this being a
low-probability event, without any assumption on the distortion metric, we cannot say much about
its contribution to the end-to-end average distortion. (Note that this issue does not arise in lossless

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-517


i i

26.3 Lossy joint source-channel coding 517

JSCC). Assumption 26.1 is trivially satisfied by bounded distortion (e.g., Hamming), and can be
shown to hold more generally such as for Gaussian sources and MSE distortion.
Proof. In view of Theorem 26.5, we only prove achievability. We constructed a separated
compression/channel coding scheme as follows:

• Let (fs , gs ) be a (k, 2kR(D)+o(k) , D)-code for compressing Sk such that E[d(Sk , gs (fs (Sk )] ≤ D.
By Lemma 26.8 (below), we may assume that all reconstruction points are not too far from
some fixed string, namely,
d(sk0 , gs (i)) ≤ L (26.9)
for all i and some constant L, where sk0 = (s0 , . . . , s0 ) is from Assumption 26.1 below.
• Let (fc , gc ) be a (n, 2nC+o(n) , ϵn )max -code for channel PYn |Xn such that kR(D) + o(k) ≤ nC +
o(n) and the maximal probability of error ϵn → 0 as n → ∞. Such as code exists thanks to
Theorem 19.9 and Corollary 19.5.

Let the JSCC encoder and decoder be f = fc ◦ fs and g = gs ◦ gc . So the overall system is
fs fc gc gs
Sk −
→W−
→ Xn −→ Yn −
→ Ŵ −
→ Ŝk .
Note that here we need to control the maximal probability of error of the channel code since
when we concatenate these two schemes, W at the input of the channel is the output of the source
compressor, which need not be uniform.
To analyze the average distortion, we consider two cases depending on whether the channel
decoding is successful or not:
E[d(Sk , Ŝk )] = E[d(Sk , gs (W))1{W = Ŵ}] + E[d(Sk , gs (Ŵ)))1{W 6= Ŵ}].
By assumption on our lossy code, the first term is at most D. For the second term, we have P[W 6=
Ŵ] ≤ ϵn = o(1) by assumption on our channel code. Then
( a)
E[d(Sk , gs (Ŵ))1{W 6= Ŵ}] ≤ E[1{W 6= Ŵ}λ(d(Sk , ŝk0 ) + d(sk0 , gs (Ŵ)))]
(b)
≤ λ · E[1{W 6= Ŵ}d(Sk , ŝk0 )] + λL · P[W 6= Ŵ]
( c)
= o(1),
where (a) follows from the generalized triangle inequality from Assumption 26.1(a) below; (b)
follows from (26.9); in (c) we apply Lemma 25.4 that were used to show the vanishing of the
expectation in (25.15) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for all R >
C/R(D), there exists a sequence of (k, n, D + o(1))-JSCC codes.
The following assumption is needed by the previous theorem:

Assumption 26.1 Fix D. For a source (PS , S, Ŝ, d), there exists λ ≥ 0, s0 ∈ S, ŝ0 ∈ Ŝ such
that

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-518


i i

518

(a) Generalized triangle inequality: d(s, ŝ) ≤ λ(d(s, ŝ0 ) + d(s0 , â)) ∀a, â.
(b) E[d(S, ŝ0 )] < ∞ (so that Dmax < ∞ too).
(c) E[d(s0 , Ŝ)] < ∞ for any output distribution PŜ achieving the rate-distortion function R(D).
(d) d(s0 , ŝ0 ) < ∞.

The interpretation of this assumption is that the spaces S and Ŝ have “nice centers” s0 and ŝ0 ,
in the sense that the distance between any two points is upper bounded by a constant times the
distance from the centers to each point (see figure below).

b
b

s ŝ

b b

s0 ŝ0

S Ŝ

Note that Assumption 26.1 is not straightforward to verify. Next we give some more convenient
sufficient conditions. First of all, Assumption 26.1 holds automatically for bounded distortion
function. In other words, for a discrete source on a finite alphabet S , a finite reconstruction alphabet
Ŝ , and a finite distortion function d(s, ŝ) < ∞, Assumption 26.1 is fulfilled. More generally, we
have the following criterion.

Theorem 26.7 If S = Ŝ and d(s, ŝ) = ρ(s, ŝ)q for some metric ρ and q ≥ 1, and Dmax ≜
infŝ0 E[d(S, ŝ0 )] < ∞, then Assumption 26.1 holds.

Proof. Take s0 = ŝ0 that achieves a finite Dmax = E[d(S, ŝ0 )]. (In fact, any points can serve as
centers in a metric space). Applying triangle inequality and Jensen’s inequality, we have
 q  q
1 1 1 1 1
ρ(s, ŝ) ≤ ρ(s, s0 ) + ρ(s0 , ŝ) ≤ ρq (s, s0 ) + ρq (s0 , ŝ).
2 2 2 2 2
Thus d(s, ŝ) ≤ 2q−1 (d(s, s0 ) + d(s0 , ŝ)). Taking λ = 2q−1 verifies (a) and (b) in Assumption 26.1.
To verify (c), we can apply this generalized triangle inequality to get d(s0 , Ŝ) ≤ 2q−1 (d(s0 , S) +
d(S, Ŝ)). Then taking the expectation of both sides gives

E[d(s0 , Ŝ)] ≤ 2q−1 (E[d(s0 , S)] + E[d(S, Ŝ)])


≤ 2q−1 (Dmax + D) < ∞.

So we see that metrics raised to powers (e.g. squared norms) satisfy Assumption 26.1. Finally,
we give the lemma used in the proof of Theorem 26.6.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-519


i i

26.4 What is lacking in classical lossy compression? 519

Lemma 26.8 Fix a source satisfying Assumption 26.1 and an arbitrary PŜ|S . Let R > I(S; Ŝ),
L > max{E[d(s0 , Ŝ)], d(s0 , ŝ0 )} and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such that
d(sk0 , ŝk ) ≤ L for every reconstruction point ŝk , where sk0 = (s0 , . . . , s0 ).

Proof. Let X = S k , X̂ = Ŝ ⊗k and PX = PkS , PY|X = P⊗ k


Ŝ|S
. We apply the achievability bound
for excess distortion from Theorem 25.3 with γ = 2k(R+I(S;Ŝ))/2 to the following non-separable
distortion function
(
d(x, x̂) d(sk0 , x̂) ≤ L
d1 (x, x̂) =
+∞ otherwise.

For any D′ ∈ (E[d(S, Ŝ)], D), there exist M = 2kR reconstruction points (c1 , . . . , cM ) such that
 
P min d(S , cj ) > D ≤ P[d1 (Sk , Ŝk ) > D′ ] + o(1),
k ′
j∈[M]

where on the right side (Sk , Ŝk ) ∼ P⊗ k


S,Ŝ
. Note that without any change in d1 -distortion we can
remove all (if any) reconstruction points cj with d(sk0 , cj ) > L. Furthermore, from the WLLN we
have

P[d1 (S, Ŝ) > D′ ] ≤ P[d(Sk , Ŝk ) > D′ ] + P[d(sk0 , Ŝk ) > L] → 0

as k → ∞ (since E[d(S, Ŝ)] < D′ and E[d(s0 , Ŝ)] < L). Thus we have
 
P min d(Sk , cj ) > D′ → 0
j∈[M]

and d(sk0 , cj ) ≤ L. Finally, by adding another reconstruction point cM+1 = ŝk0 = (ŝ0 , . . . , ŝ0 ) we
get
h i h  i
′ ′
E min d(S , cj ) ≤ D + E d(S , ŝ0 )1 min d(S , cj ) > D
k k k k
= D′ + o(1) ,
j∈[M+1] j∈[M]

where the last estimate follows from the same argument that shows the vanishing of the expectation
in (25.15). Thus, for sufficiently large n the expected distortion is at most D, as required.

26.4 What is lacking in classical lossy compression?


Let us discuss some issues and open problems in the classical compression theory. First, for
the compression the standard results in lossless compression apply well for text files. The lossy
compression theory, however, relies on the independence assumption and on separable distortion
metrics. Because of that, while the scalar quantization theory has been widely used in practice (in
the form of analog-to-digital converters, ADCs), the vector quantizers (rate-distortion) theory so
far has not been employed. The assumptions of the rate-distortion theory can be seen to be espe-
cially problematic in the case of compressing digital images, which evidently have very strong

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-520


i i

520

spatial correlation compared to 1D signals. (For example, the first sentence and the last in Tol-
stoy’s novel are pretty uncorrelated. But the regions in the upper-left and bottom-right corners of
one image can be strongly correlated. At the same time, the uncompressed size of the novel and
the image could be easily equal.) Thus, for practicing the lossy compression of videos and images
the key problem is that of coming up with a good “whitening” bases, which is an art still being
refined.
For the joint-source-channel coding, the separation principle has definitely been a guiding light
for the entire development of digital information technology. But this now ubiquitous solution
that Shannon’s separation has professed led to a rather undesirable feature of dropped cellular
calls (as opposed to slowly degraded quality of the old analog telephones) or “snow screen” on
TV whenever the SNR falls below a certain threshold. That is, the separated systems can be very
unstable, or lacks graceful degradation. To sketch this effect consider an example of JSCC, where
the source distribution is Ber( 12 ), with rate-distortion function R(D) = 1 − h(D), and the channel
is BSCδ with capacity C(δ) = 1 − h(δ). Consider two solutions:

1 a separated scheme: targeting a certain acceptable distortion level D∗ we compress the source
at rate R(D∗ ). Then we can use a channel code of rate R(D∗ ) which would achieve vanishing
error as long as R(D∗ ) < C(δ), i.e. δ < D∗ . Overall, this scheme has a bandwidth expansion
factor ρ = nk = 1. Note that there exists channel codes (Exercises IV.8 and IV.10) that work
simultaneously for all δ < δ ∗ = D∗ .
2 a simple JSCC with ρ = 1 which transmits “uncoded” data, i.e. sets Xi = Si .

For large blocklengths, the achieved distortion are shown in Figure 26.2 as a function of δ .
We can now see why separated solution, though in some sense optimal, is not ideal. First, below

distortion

separated
1
2

uncoded

D∗ = δ ∗

δ
0 δ∗ 1
2

Figure 26.2 No graceful degradation of separately designed source channel code (black solid), as compared
with uncoded transmission (blue dashed).

δ < δ ∗ the separated solution does achieve acceptable distortion D∗ , but it does not improve if the
channel improves, i.e. the distortion stays constant at D∗ , unlike the uncoded system. Second, and
much more importantly, is a problem with δ > δ ∗ . In this regime, separated scheme undergoes a

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-521


i i

26.4 What is lacking in classical lossy compression? 521

catastrophic failure and distortion becomes 1/2 (that is, we observe pure noise, or “snow” in TV-
speak). At the same time, the distortion of the simple “uncoded” JSCC is also deteriorating but
gracefully so. Unfortunately, such graceful schemes are only known for very few cases, requiring
ρ = 1 and certain “perfect match” conditions between channel noise and source (distortion met-
ric)2 . It is a long-standing (practical and theoretical) open problem in information theory to find
schemes that exhibit non-catastrophic degradation for general source-channel pairs and general ρ.
Even purely theoretically the problem of JSCC still contains many mysteries. For example, in
Section 22.5 we described refined expansion of the channel coding rate as a function of block-
length. In particular, we have seen that convergence to channel capacity happens at the rate √1n ,
which is rather slow. At the same time, convergence to the rate-distortion function is almost at
the rate of 1n (see Exercises V.3 and V.4). Thus, it is not clear what the convergence rate of the
JSCC may be. Unfortunately, sharp results here are still at a nascent stage. In fact, even for the
most canonical setting of a binary source and BSCδ channel it was only very recently shown [248]

that the optimal rate nk converges to the ultimate limit of R(CD) at the speed of Θ(1/ n) unless
the Gastpar condition R(D) = C(δ) is met. Analyzing other source-channel pairs or any general
results of this kind is another important open problem.

2
Often informally called “Gastpar conditions” after [181].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-522


i i

27 Metric entropy

In the previous chapters of this part we discussed optimal quantization of random vectors in both
fixed and high dimensions. Complementing this average-case perspective, the topic of this chapter
is on the deterministic (worst-case) theory of quantization. The main object of interest is the metric
entropy of a set, which allows us to answer two key questions (a) covering number: the minimum
number of points to cover a set up to a given accuracy; (b) packing number: the maximal number
of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy were put forth by Kolmogorov, who, together with
his students, also determined the behavior of metric entropy in a variety of problems for both finite
and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s 13th
problem, which concerns the possibility or impossibility of representing multi-variable functions
as compositions of functions of fewer variables. It turns out that the theory of metric entropy can
provide a surprisingly simple and powerful resolution to such problems. Over the years, metric
entropy has found numerous connections to and applications in other fields such as approximation
theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In particular, metric entropy will be featured prominently in Part VI of this book, wherein we
discuss its applications to proving both lower and upper bounds for statistical estimation.
This chapter is organized as follows. Section 27.1 provides basic definitions and explains the
fundamental connections between covering and packing numbers. These foundations are laid out
by Kolmogorov and Tikhomirov in [250], which remains the definitive reference on this subject.
In Section 27.2 we study metric entropy in finite-dimensional spaces and a popular approach for
bounding the metric entropy known as the volume bound. To demonstrate the limitations of the
volume method and the associated high-dimensional phenomenon, in Section 27.3 we discuss
a few other approaches through concrete examples. Infinite-dimensional spaces are treated next
for smooth functions in Section 27.4 (wherein we also discuss the application to Hilbert’s 13th
problem) and Hilbert spaces in Section 27.3.2 (wherein we also discuss the application to empirical
processes). Section 27.5 gives an exposition of the connections between metric entropy and the
small-ball problem in probability theory. Finally, in Section 27.6 we circle back to rate-distortion
theory and discuss how it is related to metric entropy and how information-theoretic methods can
be useful for the latter.

27.1 Covering and packing


Definition 27.1 Let (V, d) be a metric space and Θ ⊂ V.

522

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-523


i i

27.1 Covering and packing 523

• We say {v1 , ..., vN } ⊂ V is an ϵ-covering (or ϵ-net) of Θ if Θ ⊂ ∪Ni=1 B(vi , ϵ), where B(v, ϵ) ≜
{u ∈ V : d(u, v) ≤ ϵ} is the (closed) ball of radius ϵ centered at v; or equivalently, ∀θ ∈ Θ,
∃i ∈ [N] such that d(θ, vi ) ≤ ϵ.
• We say {θ1 , ..., θM } ⊂ Θ is an ϵ-packing of Θ if mini̸=j kθi − θj k > ϵ;1 or equivalently, the balls
{B(θi , ϵ/2) : j ∈ [M]} are disjoint.

ϵ
≥ϵ

Θ Θ

Figure 27.1 Illustration of ϵ-covering and ϵ-packing.

Upon defining ϵ-covering and ϵ-packing, a natural question concerns the size of the optimal
covering and packing, leading to the definition of covering and packing numbers:

N(Θ, d, ϵ) ≜ min{n : ∃ ϵ-covering of Θ of size n} (27.1)


M(Θ, d, ϵ) ≜ max{m : ∃ ϵ-packing of Θ of size m} (27.2)

with min ∅ understood as ∞; we will sometimes abbreviate these as N(ϵ) and M(ϵ) for brevity.
Similar to volume and width, covering and packing numbers provide a meaningful measure for
the “massiveness” of a set. The major focus of this chapter is to understanding their behavior in
both finite and infinite-dimensional spaces as well as their statistical applications.
Some remarks are in order.

• Monotonicity: N(Θ, d, ϵ) and M(Θ, d, ϵ) are non-decreasing and right-continuous functions of


ϵ. Furthermore, both are non-decreasing in Θ with respect to set inclusion.
• Finiteness: Θ is totally bounded (e.g. compact) if N(Θ, d, ϵ) < ∞ for all ϵ > 0. For Euclidean
spaces, this is equivalent to Θ being bounded, namely, diam(Θ) < ∞ (cf. (5.4)).
• The logarithm of the covering and packing numbers are commonly referred to as metric
entropy. In particular, log M(ϵ) and log N(ϵ) are called ϵ-entropy and ϵ-capacity in [250]. Quan-
titative connections between metric entropy and other information measures are explored in
Section 27.6.
• Widely used in the literature of functional analysis [329, 285], the notion of entropy numbers
essentially refers to the inverse of the metric entropy: The kth entropy number of Θ is ek (Θ) ≜
inf{ϵ : N(Θ, d, ϵ) ≤ 2k−1 }. In particular, e1 (Θ) = rad(Θ), the radius of Θ defined in (5.3).

1
Notice we imposed strict inequality for convenience.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-524


i i

524

Remark 27.1 Unlike the packing number M(Θ, d, ϵ), the covering number N(Θ, d, ϵ) defined
in (27.1) depends implicitly on the ambient space V ⊃ Θ, since, per Definition 27.1), an ϵ-covering
is required to be a subset of V rather than Θ. Nevertheless, as the next Theorem 27.2 shows, this
dependency on V has almost no effect on the behavior of the covering number.
As an alternative to (27.1), we can define N′ (Θ, d, ϵ) as the size of the minimal ϵ-covering of Θ
that is also a subset of Θ, which is closely related to the original definition as

N(Θ, d, ϵ) ≤ N′ (Θ, d, ϵ) ≤ N(Θ, d, ϵ/2) (27.3)

Here, the left inequality is obvious. To see the right inequality,2 let {θ1 , . . . , θN } be an 2ϵ -covering
of Θ. We can project each θi to Θ by defining θi′ = argminu∈Θ d(θi , u). Then {θ1′ , . . . , θN′ } ⊂ Θ
constitutes an ϵ-covering. Indeed, for any θ ∈ Θ, we have d(θ, θi ) ≤ ϵ/2 for some θi . Then
d(θ, θi′ ) ≤ d(θ, θi ) + d(θi , θi′ ) ≤ 2d(θ, θi ) ≤ ϵ. On the other hand, the N′ covering numbers need
not be monotone with respect to set inclusion.
The relation between the covering and packing numbers is described by the following funda-
mental result.

Theorem 27.2 (Kolomogrov-Tikhomirov [250])


M(Θ, d, 2ϵ)≤N(Θ, d, ϵ)≤M(Θ, d, ϵ). (27.4)

Proof. To prove the right inequality, fix a maximal packing E = {θ1 , ..., θM }. Then ∀θ ∈ Θ\E,
∃i ∈ [M], such that d(θ, θi ) ≤ ϵ (for otherwise we can obtain a bigger packing by adding θ). Hence
E must an ϵ-covering (which is also a subset of Θ). Since N(Θ, d, ϵ) is the minimal size of all
possible coverings, we have M(Θ, d, ϵ) ≥ N(Θ, d, ϵ).
We next prove the left inequality by contradiction. Suppose there exists a 2ϵ-packing
{θ1 , ..., θM } and an ϵ-covering {x1 , ..., xN } such that M ≥ N + 1. Then by the pigeonhole prin-
ciple, there exist distinct θi and θj belonging to the same ϵ-ball B(xk , ϵ). By triangle inequality,
d(θi , θj ) ≤ 2ϵ, which is a contradiction since d(θi , θj ) > 2ϵ for a 2ϵ-packing. Hence the size of any
2ϵ-packing is at most that of any ϵ-covering, that is, M(Θ, d, 2ϵ) ≤ N(Θ, d, ϵ).

The significance of (27.4) is that it shows that the small-ϵ behavior of the covering and packing
numbers are essentially the same. In addition, the right inequality therein, namely, N(ϵ) ≤ M(ϵ),
deserves some special mention. As we will see next, it is oftentimes easier to prove negative
results (lower bound on the minimal covering or upper bound on the maximal packing) than pos-
itive results which require explicit construction. When used in conjunction with the inequality
N(ϵ) ≤ M(ϵ), these converses turn into achievability statements,3 leading to many useful bounds
on metric entropy (e.g. the volume bound in Theorem 27.3 and the Gilbert-Varshamov bound

2
Another way to see this is from Theorem 27.2: Note that the right inequality in (27.4) yields a ϵ-covering that is included
in Θ. Together with the left inequality, we get N′ (ϵ) ≤ M(ϵ) ≤ N(ϵ/2).
3
This is reminiscent of duality-based argument in optimization: To bound a minimization problem from above, instead of
constructing an explicit feasible solution, a fruitful approach is to equate it with the dual problem (maximization) and
bound this maximum from above.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-525


i i

27.2 Finite-dimensional space and volume bound 525

Theorem 27.5 in the next section). Revisiting the proof of Theorem 27.2, we see that this logic
actually corresponds to a greedy construction (greedily increase the packing until no points can
be added).

27.2 Finite-dimensional space and volume bound


A commonly used method to bound metric entropy in finite dimensions is in terms of volume
ratio. Consider the d-dimensional Euclidean space V = Rd with metric given by an arbitrary
norm d(x, y) = kx − yk. We have the following result.

Theorem 27.3 Let k · k be an arbitrary norm on Rd and B = {x ∈ Rd : kxk ≤ 1} the


corresponding unit norm ball. Then for any Θ ⊂ Rd ,
 d  d
1 vol(Θ) (a) (b) vol(Θ + ϵ B) (c) 3 vol(Θ)
≤ N(Θ, k · k, ϵ) ≤ M(Θ, k · k, ϵ) ≤ 2
≤ .
ϵ vol(B) vol( 2ϵ B) ϵ vol(B)
where (c) holds under the extra condition that Θ is convex and contains ϵB.

Proof. To prove (a), consider an ϵ-covering Θ ⊂ ∪Ni=1 B(θi , ϵ). Applying the union bound yields
  XN
vol(Θ) ≤ vol ∪Ni=1 B(θi , ϵ) ≤ vol(B(θi , ϵ)) = Nϵd vol(B),
i=1

where the last step follows from the translation-invariance and scaling property of volume.
To prove (b), consider an ϵ-packing {θ1 , . . . , θM } ⊂ Θ such that the balls B(θi , ϵ/2) are disjoint.
M(ϵ)
Since ∪i=1 B(θi , ϵ/2) ⊂ Θ + 2ϵ B, taking the volume on both sides yields
 ϵ    ϵ 
vol Θ + B ≥ vol ∪M i=1 B(θi , ϵ/2) = Mvol B .
2 2
This proves (b).
Finally, (c) follows from the following two statements: (1) if ϵB ⊂ Θ, then Θ + 2ϵ B ⊂ Θ + 21 Θ;
and (2) if Θ is convex, then Θ+ 12 Θ = 32 Θ. We only prove (2). First, ∀θ ∈ 32 Θ, we have θ = 13 θ+ 32 θ,
where 13 θ ∈ 12 Θ and 32 θ ∈ Θ. Thus 32 Θ ⊂ Θ + 12 Θ. On the other hand, for any x ∈ Θ + 12 Θ, we
have x = y + 21 z with y, z ∈ Θ. By the convexity of Θ, 23 x = 23 y + 31 z ∈ Θ. Hence x ∈ 23 Θ, implying
Θ + 21 Θ ⊂ 32 Θ.

Remark 27.2 Similar to the proof of (a) in Theorem 27.3, we can start from Θ + 2ϵ B ⊂
∪Ni=1 B(θi , 32ϵ ) to conclude that
N(Θ, k · k, ϵ)
(2/3)d ≤ ≤ 2d .
vol(Θ + 2ϵ B)/vol(ϵB)
In other words, the volume of the fattened set Θ + 2ϵ determines the metric entropy up to constants
that only depend on the dimension. We will revisit this reasoning in Section 27.5 to adapt the
volumetric estimates to infinite dimensions where this fattening step becomes necessary.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-526


i i

526

Next we discuss several applications of Theorem 27.3.

Corollary 27.4 (Metric entropy of balls and spheres) Let k · k be an arbitrary norm on
Rd . Let B ≡ B∥·∥ = {x ∈ Rd : kxk ≤ 1} and S ≡ S∥·∥ = {x ∈ Rd : kxk ≤ 1} be the corresponding
unit ball and unit sphere. Then for ϵ < 1,
 d  d
1 2
≤ N(B, k · k, ϵ) ≤ 1 + (27.5)
ϵ ϵ
 d−1  d−1
1 1
≤ N(S, k · k, ϵ) ≤ 2d 1 + (27.6)
2ϵ ϵ

where the left inequality in (27.6) holds under the extra assumption that k · k is an absolute norm
(invariant to sign changes of coordinates).

Proof. For balls, the estimate (27.5) directly follows from Theorem 27.3 since B + 2ϵ B = (1 + 2ϵ )B.
Next we consider the spheres. Applying (b) in Theorem 27.3 yields
vol(S + ϵB) vol((1 + ϵ)B) − vol((1 − ϵ)B)
N(S, k · k, ϵ) ≤ M(S, k · k, ϵ) ≤ ≤
vol(ϵB) vol(ϵB)
Z ϵ  d−1
(1 + ϵ) − (1 − ϵ)
d d
d d−1 1
= = d (1 + x) dx ≤ 2d 1 + .
ϵd ϵ −ϵ ϵ

where the third inequality applies S + ϵB ⊂ ((1 + ϵ)B)\((1 − ϵ)B) by triangle inequality.
Finally, we prove the lower bound in (27.6) for an absolute norm k · k. To this end one cannot
directly invoke the lower bound in Theorem 27.3 as the sphere has zero volume. Note that k · k′ ≜
k(·, 0)k defines a norm on Rd−1 . We claim that every ϵ-packing in k · k′ for the unit k · k′ -ball
induces an ϵ-packing in k · k for the unit k · k-sphere. Fix x ∈ Rd−1 such that k(x, 0)k ≤ 1 and
define f : R+ → R+ by f(y) = k(x, y)k. Using the fact that k · k is an absolute norm, it is easy to
verify that f is a continuous increasing function with f(0) ≤ 1 and f(∞) = ∞. By the mean value
theorem, there exists yx , such that k(x, yx )k = 1. Finally, for any ϵ-packing {x′1 , . . . , x′M } of the unit
ball B∥·∥′ with respect to k·k′ , setting x′i = (xi , yxi ) we have kx′i −x′j k ≥ k(xi −xj , 0)k = kxi −xj k′ ≥ ϵ.
This proves

M(S∥·∥ , k · k, ϵ) ≥ M(B∥·∥′ , k · k′ , ϵ).

Then the left inequality of (27.6) follows from those of (27.4) and (27.5).

Remark 27.3 Several remarks on Corollary 27.4 are in order:

(a) Using (27.5), we see that for any compact Θ with nonempty interior, we have
1
N(Θ, k · k, ϵ)  M(Θ, k · k, ϵ)  (27.7)
ϵd
for small ϵ, with proportionality constants depending on both Θ and the norm. In fact, the
sharp constant is also known to exist. It is shown in [250, Theorem IX] that there exists a

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-527


i i

27.2 Finite-dimensional space and volume bound 527

constant τ depending only on k · k and the dimension, such that


vol(Θ) 1
M(Θ, k · k, 2ϵ) = (τ + o(1))
vol(B) ϵd
holds for any Θ with positive volume. This constant τ is the maximal sphere packing density in
Rd (the proportion of the whole space covered by the balls in the packing – see [365, Chapter
1] for a formal definition); a similar result and interpretation hold for the covering number as
well. Computing or bounding the value of τ is extremely difficult and remains open except
for some special cases.4 For more on this subject see the monographs [365, 99].
(b) The result (27.6) for spheres suggests that one may expect the metric entropy for a smooth man-
ifold Θ to behave as ( 1ϵ )dim , where dim stands for the dimension of Θ as opposed to the ambient
dimension. This is indeed true in many situations, for example, in the context of matrices, for
the orthogonal group O(d), unitary group U(d), and Grassmanian manifolds [406, 407], in
which case dim corresponds to the “degrees of freedom” (for example, dim = d(d − 1)/2 for
O(d)). More generally, for an arbitrary set Θ, one may define the limit limϵ→0 log Nlog (Θ,∥·∥,ϵ)
1 as
ϵ
its dimension (known as the Minkowski dimension or box-counting dimension). For sets of a
fractal nature, this dimension can be a non-integer (e.g. log2 3 for the Cantor set).
(c) Since all norms on Euclidean space are equivalent (within multiplicative constant factors
depending on dimension), the small-ϵ behavior in (27.7) holds for any norm as long as the
dimension d is fixed. However, this result does not capture the full picture in high dimensions
when ϵ is allowed to depend on d. Understanding these high-dimensional phenomena requires
us to go beyond volume methods. See Section 27.3 for details.

Next we switch our attention to the discrete case of Hamming space. The following theorem
bounds its packing number M(Fd2 , dH , r) ≡ M(Fd2 , r), namely, the maximal number of binary code-
words of length d with a prescribed minimum distance r + 1.5 This is a central question in coding
theory, wherein the lower and upper bounds below are known as the Gilbert-Varshamov bound
and the Hamming bound, respectively.

Theorem 27.5 For any integer 1 ≤ r ≤ d − 1,


2d 2d
Pr d
 ≤ M( F d
2 , r) ≤ P ⌊r/2⌋ d
. (27.8)
i=0 i i=0 i

Proof. Both inequalities in (27.8) follow from the same argument as that in Theorem 27.3, with
Rd replaced by Fd2 and volume by the counting measure (which is translation invariant).

Of particular interest to coding theory is the asymptotic regime of d → ∞ and r = ρd for some
constant ρ ∈ (0, 1). Using the asymptotics of the binomial coefficients (cf. Proposition 1.5), the

4
For example, it is easy to show that τ = 1 for both ℓ∞ and ℓ1 balls in any dimension since cubes can be subdivided into
smaller cubes; for ℓ2 -ball in d = 2, τ = √π is the famous result of L. Fejes Tóth on the optimality of hexagonal
12
arrangement for circle packing [365].
5
Recall that the packing number in Definition 27.1 is defined with a strict inequality.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-528


i i

528

Hamming and Gilbert-Varshamov bounds translate to

2d(1−h(ρ)+o(d) ≤ M(Fd2 , ρd) ≤ 2d(1−h(ρ/2))+o(d) .

Finding the exact exponent is one of the most significant open questions in coding theory. The best
upper bound to date is due to McEliece, Rodemich, Rumsey and Welch [299] using the technique
of linear programming relaxation.
In contrast, the corresponding covering problem in Hamming space is much simpler, as we
have the following tight result

N(Fd2 , ρd) = 2dR(ρ)+o(d) , (27.9)

where R(ρ) = (1 − h(ρ))+ is the rate-distortion function of Ber( 12 ) from Theorem 26.1. Although
this does not automatically follow from the rate-distortion theory, it can be shown using similar
argument – see Exercise V.26.
Finally, we state a lower bound on the packing number of Hamming spheres, which is needed
for subsequent application in sparse estimation (Exercise VI.12) and useful as basic building blocks
for computing metric entropy in more complicated settings (Theorem 27.7).

Theorem 27.6 (Gilbert-Varshamov bound for Hamming spheres) Denote by


Sdk = {x ∈ Fd2 : wH (x) = k} (27.10)

the Hamming sphere of radius 0 ≤ k ≤ d. Then



d
M(Sdk , r) ≥ Pr k
.
d
(27.11)
i=0 i

In particular,
k d
log M(Sdk , k/2) ≥ log . (27.12)
2 2ek

Proof. Again (27.11) follows from the volume argument. To verify (27.12), note that for r ≤ d/2,
Pr 
we have i=0 di ≤ exp(dh( dr )) (see Theorem 8.2 or (15.19) with p = 1/2). Using h(x) ≤ x log xe

and dk ≥ ( dk )k , we conclude (27.12) from (27.11).

27.3 Beyond the volume bound


The volume bound in Theorem 27.3 provides a useful tool for studying metric entropy in Euclidean
spaces. As a result, as ϵ → 0, the covering number of any set with non-empty interior always grows
exponentially in d as ( 1ϵ )d – cf. (27.7). This asymptotic result, however, has its limitations and does
not apply if the dimension d is large and ϵ scales with d. In fact, one expects that there is some
critical threshold of ϵ depending on the dimension d, below which the exponential asymptotics is
tight, and above which the covering number can grow polynomially in d. This high-dimensional
phenomenon is not fully captured by the volume method.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-529


i i

27.3 Beyond the volume bound 529

As a case in point, consider the maximum number of ℓ2 -balls of radius ϵ packed into the unit
ℓ1 -ball, namely, M(B1 , k · k2 , ϵ). (Recall that Bp denotes the unit ℓp -ball in Rd with 1 ≤ p ≤ ∞.)
We have studied the metric entropy of arbitrary norm balls under the same norm in Corollary 27.4,
where the specific value of the volume was canceled from the √
volume ratio. Here, although ℓ1 and
ℓ2 norms are equivalent in the sense that kxk2 ≤ kxk1 ≤ dkxk2 , this relationship is too loose
when d is large.
Let us start by applying the volume method in Theorem 27.3:
vol(B1 ) vol(B1 + 2ϵ B2 )
≤ N(B1 , k · k2 , ϵ) ≤ M(B1 , k · k2 , ϵ) ≤ .
vol(ϵB2 ) vol( 2ϵ B2 )
Applying the formula for the volume of a unit ℓq -ball in Rd :
h  id
2Γ 1 + 1q
vol(Bq ) =   , (27.13)
Γ 1 + qd

πd
we get6 vol(B1 ) = 2d /d! and vol(B2 ) = Γ(1+d/2) , which yield, by Stirling approximation,
1 1
vol(B1 )1/d  , vol(B2 )1/d  √ . (27.14)
d d
Then for some absolute constant C,
√   d
vol(B1 + 2ϵ B2 ) vol((1 + ϵ 2 d )B1 ) 1
M(B1 , k · k2 , ϵ) ≤ ≤ ≤ C 1 + √ , (27.15)
vol( 2ϵ B2 ) vol( 2ϵ B2 ) ϵ d

where the second inequality follows from B2 ⊂ dB1 by Cauchy-Schwarz inequality. (This step
is tight in the sense that vol(B1 + 2ϵ B2 )1/d ≳ max{vol(B1 )1/d , 2ϵ vol(B2 )1/d }  max{ d1 , √ϵd }.) On
the other hand, for some absolute constant c,
 d  d
vol(B1 ) 1 vol(B1 ) c
M(B1 , k · k2 , ϵ) ≥ = = √ . (27.16)
vol(ϵB2 ) ϵ vol(B2 ) ϵ d
Overall, for ϵ ≤ √1d , we have M(B1 , k · k2 , ϵ)1/d  ϵ√1 d ; however, the lower bound trivializes and
the upper bound (which is exponential in d) is loose in the regime of ϵ  √1d , which requires
different methods than volume calculation. The following result describes the complete behavior
of this metric entropy. In view of Theorem 27.2, we will go back and forth between the covering
and packing numbers in the argument.

Theorem 27.7 For 0 < ϵ < 1 and d ∈ N,


(
d log ϵ2ed ϵ≤ √1
log M(B1 , k · k2 , ϵ)  d .
1
ϵ2
log(eϵ2 d) ϵ≥ √1
d

6
For B1 this can be proved directly by noting that B1 consists 2d disjoint “copies” of the simplex whose volume is 1/d! by
induction on d.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-530


i i

530

Proof. The case of ϵ ≤ √1d follows from earlier volume calculation (27.15)–(27.16). Next we
focus on √1d ≤ ϵ < 1.
For the upper bound, we construct an ϵ-covering in ℓ2 by quantizing each coordinate. Without
loss of generality, assume that ϵ < 1/4. Fix some δ < 1. For each θ ∈ B1 , there exists x ∈
(δ Zd ) ∩ B1 such that kx − θk∞ ≤ δ . Then kx − θk22 ≤ kx − θk1 kx − θk∞ ≤ 2δ . Furthermore, x/δ
belongs to the set
( )
X d
Z= z∈Z : d
|zi | ≤ k (27.17)
i=1

with k = b1/δc. Note that each z ∈ Z has at most k nonzeros. By enumerating the number of non-

negative solutions (stars and bars calculation) and the sign pattern, we have7 |Z| ≤ 2k∧d d−k1+k .
Finally, picking δ = ϵ2 /2, we conclude that N(B1 , k · k2 , ϵ) ≤ |Z| ≤ ( 2e(dk+k) )k as desired. (Note
that this method also recovers the volume bound for ϵ ≤ √1d , in which case k ≤ d.)

For the lower bound, note that M(B1 , k · k2 , 2) ≥ 2d by considering ±e1 , . . . , ±ed . So it
suffices to consider d ≥ 8. We construct a packing of B1 based on a packing of the Hamming
sphere. Without loss of generality, assume that ϵ > 4√1 d . Fix some 1 ≤ k ≤ d. Applying
the Gilbert-Varshamov bound in Theorem 27.6, in particular, (27.12), there exists a k/2-packing
Pd
{x1 , . . . , xM } ⊂ Sdk = {x ∈ {0, 1}d : i=1 xi = k} and log M ≥ 2k log 2ek d
. Scale the Hamming
sphere to fit the ℓ1 -ball by setting θi = xi /k. Then θi ∈ B1 and kθi − θj k2 = k2 dH (xi , xj ) ≥ 2k
2 1 1
for all
1
i 6= j. Choosing k = ϵ2 which satisfies k ≤ d/8, we conclude that {θ1 , . . . , θM } is a 2 -packing
ϵ

of B1 in k · k2 as desired.

The above elementary proof can be adapted to give the following more general result (see
Exercise V.27): Let 1 ≤ p < q ≤ ∞. For all 0 < ϵ < 1 and d ∈ N,
(
d log ϵes d ϵ ≤ d−1/s 1 1 1
log M(Bp , k · kq , ϵ) p,q 1 , ≜ − . (27.18)
−1/s
s log(eϵ d)
ϵ
s
ϵ≥d s p q

In the remainder of this section, we discuss a few generic results in connection to Theorem 27.7,
in particular, metric entropy upper bounds via the Sudakov minorization and Maurey’s empirical
method, as well as the duality of metric entropy in Euclidean spaces.

27.3.1 Sudakov’s minoration


Theorem 27.8 (Sudakov’s minoration) Define the Gaussian width of Θ ⊂ Rd as8

w(Θ) ≜ E sup hθ, Zi , Z ∼ N (0, Id ). (27.19)


θ∈Θ

7 ∑d (d)( k )
By enumerating the support and counting positive solutions, it is easy to show that |Z| = i=0 2d−i i d−i
.
8
To avoid measurability difficulty, w(Θ) should be understood as supT⊂Θ,|T|<∞ E maxθ∈T hθ, Zi.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-531


i i

27.3 Beyond the volume bound 531

For any Θ ⊂ Rd ,
p
w(Θ) ≳ sup ϵ log M(Θ, k · k2 , ϵ). (27.20)
ϵ>0

As a quick corollary, applying the volume lower bound on the packing number in Theorem 27.3
to (27.20) and optimizing over ϵ, we obtain Urysohn’s inequality:9
 1/d
√ vol(Θ) (27.14)
w(Θ) ≳ d  d · vol(Θ)1/d . (27.21)
vol(B2 )

Sudakov’s theorem relates the Gaussian width to the metric entropy, both of which are meaning-
ful measures of the massiveness of a set. The important point is that the proportionality constant
in (27.20) is independent of the dimension. It turns out that Sudakov’s lower bound is tight up to
a log(d) factor [438, Theorem 8.1.13]. The following complementary result is known as Dudley’s
chaining inequality (see Exercise V.28 for a proof.)
Z ∞p
w(Θ) ≲ log M(Θ, k · k2 , ϵ)dϵ. (27.22)
0

Understanding the maximum of Gaussian processes is a field on its own; see the monograph [411].
In this section we focus on the lower bound (27.20) in order to develop upper bound for metric
entropy using the Gaussian width.
The proof of Theorem 27.8 relies on the following Gaussian comparison lemma of Slepian
(whom we have encountered earlier in Theorem 11.13). For a self-contained proof see [89]. See
also [329, Lemma 5.7, p. 70] for a simpler proof of a weaker version E max Xi ≤ 2E max Yi , which
suffices for our purposes.

Lemma 27.9 (Slepian’s lemma) Let X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) be Gaussian


random vectors. If E(Yi − Yj )2 ≤ E(Xi − Xj )2 for all i, j, then E max Yi ≤ E max Xi .

We also need the result bounding the expectation of the maximum of n Gaussian random
variables.

Lemma 27.10 Let Z1 , . . . , Zn be distributed as N (0, 1). Then


h i p
E max Zi ≤ 2 log n. (27.23)
i∈[n]

i.i.d.
In addition, if Z1 , . . . , Zn ∼ N (0, 1), then
h i p
E max Zi = 2 log n(1 + o(1)). (27.24)
i∈[n]

9 vol(Θ)
For a sharp form, see [329, Corollary 1.4], which states that for all symmetric convex Θ, w(Θ) ≥ E[kZk2 ]( vol(B ) )1/d ;
2
in other words, balls minimize the Gaussian width among all symmetric convex bodies of the same volume.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-532


i i

532

Proof. First, let T = argmaxj Zj . Since Zj are 1-subgaussian (recall Definition 4.15), from
Exercise I.56 we have
p p p
| E[max Zi ]| = | E[ZT ]| ≤ 2I(Zn ; T) = 2H(T) ≤ 2 log n .
i

Next, assume that Zi are iid. For any t > 0,

E[max Zi ] ≥ t P[max Zi ≥ t] + E[max Zi 1 {Z1 < 0}1 {Z2 < 0} . . . 1 {Zn < 0}]
i i i

≥ t(1 − (1 − Φc (t))n ) + E[Z1 1 {Z1 < 0}1 {Z2 < 0} . . . 1 {Zn < 0}].

where Φc (t) = P[Z1 ≥ t] is the normal tail probability. The second term equals
2−(n−1) E[Z1 1 {Z1 < 0}] = o(1). For the first term, recall that Φc (t) ≥ 1+t t2 φ(t) (Exercise V.25).
p
Choosing
p t = (2 − ϵ) log n for small ϵ > 0 so that Φc (t) = ω( 1n ) and hence E[maxi Zi ] ≥
(2 − ϵ) log n(1 + o(1)). By the arbitrariness of ϵ > 0, the lower bound part of (27.24)
follows.

Proof of Theorem 27.8. Let {θ1 , . . . , θM } be an optimal ϵ-packing of Θ. Let Xi = hθi , Zi for
i.i.d.
i ∈ [M], where Z ∼ N (0, Id ). Let Yi ∼ N (0, ϵ2 /2). Then

E(Xi − Xj )2 = (θi − θj )⊤ E[ZZ⊤ ](θi − θj ) = kθi − θj k22 ≥ ϵ2 = E(Yi − Yj )2 .

Then
p
E sup hθ, Zi ≥ E max Xi ≥ E max Yi  ϵ log M
θ∈Θ 1≤i≤M 1≤i≤M

where the second and third step follows from Lemma 27.9 and Lemma 27.10 respectively.

Revisiting the packing number of the ℓ1 -ball, we apply Sudakov minorization to Θ = B1 . By


duality and applying Lemma 27.10,
p
w(B1 ) = E sup hx, Zi = EkZk∞ ≤ 2 log d.
x: ∥ x∥ 1 ≤ 1

Then Theorem 27.8 gives


log d
log M(B1 , k · k2 , ϵ) ≲ . (27.25)
ϵ2

When ϵ ≳ 1/ d, this is much tighter than the volume bound (27.15) and almost optimal (com-
2 √
pared to log(ϵd2 ϵ ) ); however, when ϵ  1/ d, (27.25) yields d log d but we know (even from
the volume bound) that the correct behavior is d. In Section 27.3.3 we discuss another general
approach that gives the optimal bound in this case.

1
27.3.2 Hilbert ball has metric entropy ϵ2
P
We consider a Hilbert ball B2 = {x ∈ R∞ : i x2i ≤ 1}. Under the usual metric ℓ2 (R∞ ) this
set is not compact and cannot have finite ϵ-nets for all ϵ. However, the metric of interest in many

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-533


i i

27.3 Beyond the volume bound 533

applications is often different. Specifically, let us fix some probability distribution P on B2 s.t.
EX∼P [kXk22 ] ≤ 1 and define
p
dP (θ, θ′ ) ≜ EX∼P [| hθ − θ′ , Xi |2 ]

for θ, θ′ ∈ B2 . The importance of this metric is that it allows to analyze complexity of a class
of linear functions θ 7→ hθ, Xi for any random variable X of unit norm and has applications in
learning theory [302, 471].

Theorem 27.11 For some universal constant c we have for all ϵ > 0
c
log N(B2 , dP , ϵ) ≤ .
ϵ2

Proof. First, we show that without loss of generality we may assume that X has all coordinates
P 2
other than the first n zero. Indeed, take n so large that E[ j>n X2j ] ≤ ϵ4 . Let us denote by θ̃ the
vector obtained from θ by zeroing out all coordinates j > n. Then from Cauchy-Schwartz we see
that dP (θ, θ̃) ≤ 2ϵ and therefore any 2ϵ -covering of B̃2 = {θ̃ : θ ∈ B2 } will be an ϵ-covering of B2 .
Hence, from now on we assume that the ball B2 is in fact finite-dimensional.
We can redefine distance dP in a more explicit way as follows

dP (θ, θ′ )2 = (θ − θ′ )⊤ Σ(θ − θ′ ) ,

where√Σ = E[XX⊤ ] is a positive-semidefinite matrix of second moments of X ∼ P. Let us set


D = Σ be the symmetric square-root of Σ. To each θ let us associate v(θ) = Dθ and let V =
D(B2 ) be the image of B2 under D. Note dP (θ, θ′ ) = kv(θ) − v(θ′ )k2 . Therefore, from Sudakov
minoration Theorem 27.8 we obtain
c
log M(V, k · k2 , ϵ) ≤ E[sup hv, Zi] Z ∼ N (0, Id ) .
ϵ2 v∈V

Since V is an ellipsoid, we can compute the supremum explicitly, indeed


q √
E[sup hv, Zi] = E[ sup hDθ, Zi] = E[kDZk2 ] ≤ E[kDZk22 ] = tr Σ ≤ 1 .
v∈ V θ∈B2

To see one simple implication of the result, recall the standard bound on empirical processes: By
endowing any collection of functions {fθ , θ ∈ Θ} with a metric dP̂n (θ, θ′ )2 = EP̂n [(fθ (X)− fθ′ (X))2 ]
we have
  " Z ∞r #
log N(Θ, dP̂n , ϵ)
E sup E[fθ (X)] − Ên [fθ (X)] ≲ E inf δ + dϵ .
θ δ>0 δ n

It can be seen that when entropy behaves as ϵ−p we get rate n− min(1/p,1/2) except for p = 2
for which the upper bound yields n− 2 log n. The significance of the previous theorem is that the
1

Hilbert ball is precisely “at the phase transition” from parametric to nonparametric rate.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-534


i i

534

As a sanity check, let us take any PX over the unit (possibly infinite dimensional) ball B with
E[X] = 0 and let Θ = B. We have
" # r
1X
n
log n
E[kX̄n k] = E sup hθ, Xi i ≲ ,
θ n i=1 n
Pn
where
p X̄n = 1n i=1 Xi is the empirical mean. In this special case it is easy to bound E[kX̄n k] ≤
E[kX̄n k2 ] ≤ √1n by an explicit calculation.

27.3.3 Maurey’s empirical method


In this section we discuss a powerful probabilistic method due to B. Maurey for constructing a
good covering. It has found applications in approximation theory and especially that for neural
nets [237, 37]. The following result gives a dimension-free bound on the cover number of convex
hulls in Hilbert spaces:
p
Theorem 27.12 Let H be an inner product space with the norm kxk ≜ hx, xi. Let T ⊂ H be
a finite set, with radius r = rad(T) = infy∈H supx∈T kx − yk (recall (5.3)). Denote the convex hull
of T by co(T). Then for any 0 < ϵ ≤ r,
 2 
|T| + d ϵr2 e − 2
N(co(T), k · k, ϵ) ≤ . (27.26)
d ϵr2 e − 1
2

Proof. Let T = {t1 , t2 , . . . , tm } and denote the Chebyshev center of T by c ∈ H, such that r =
maxi∈[m] kc − ti k. For n ∈ Z+ , let
( ! )
1 Xm Xm
Z= c+ ni ti : ni ∈ Z+ , ni = n .
n+1
i=1 i=1
Pm P
For any x = i=1 xi ti ∈ co(T) where xi ≥ 0 and xi = 1, let Z be a discrete random variable
such that Z = ti with probability xi . Then E[Z] = x. Let Z0 = c and Z1 , . . . , Zn be i.i.d. copies of
Pm
Z. Let Z̄ = n+1 1 i=0 Zi , which takes values in the set Z . Since
 
2
1 X n
1 Xn X
EkZ̄ − xk22 = E (Zi − x) =  E kZi − xk2 + EhZi − x, Zj − xi
(n + 1)2 ( n + 1) 2
i=0 i=0 i̸=j

1 X
n
1  r2
= E kZi − xk2 = kc − xk2 + nE[kZ − xk2 ] ≤ ,
(n + 1)2 ( n + 1) 2 n+1
i=0
Pm
where the last inequality follows from that kc − xk ≤ i=1 xi kc − ti k ≤ r (in other words, rad(T) =
 
rad(co(T)) and E[kZ − xk2 ] ≤ E[kZ − ck2 ] ≤ r2 . Set n = r2 /ϵ2 − 1 so that r2 /(n + 1) ≤ ϵ2 .
There exists some z ∈ N such that kz − xk ≤ ϵ. Therefore Z is an ϵ-covering of co(T). Similar to
(27.17), we have
     
n+m−1 m + r2 /ϵ2 − 2
|Z| ≤ = .
n dr2 /ϵ2 e − 1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-535


i i

27.4 Infinite-dimensional space: smooth functions 535

We now apply Theorem 27.12 to recover the result for the unit ℓ1 -ball B1 in Rd in Theorem 27.7:
Note that B1 = co(T), where T = {±e1 , . . . , ±ed , 0} satisfies rad(T) = 1. Then
 
2d + d ϵ12 e − 1
N(B1 , k · k2 , ϵ) ≤ , (27.27)
d ϵ12 e − 1

which recovers the optimal upper bound in Theorem 27.7 at both small and big scales.

27.3.4 Duality of metric entropy


First we define a more general notion of covering number. For K, T ⊂ Rd , define the covering
number of K using translates of T as

N(K, T) = min{N : ∃x1 , . . . , xN ∈ Rd such that K ⊂ ∪Ni=1 T + xi }.

Then the usual covering number in Definition 27.1 satisfies N(K, k · k, ϵ) = N(K, ϵB), where B is
the corresponding unit norm ball.
A deep result of Artstein, Milman, and Szarek [28] establishes the following duality for metric
entropy: There exist absolute constants α and β such that for any symmetric convex body K,10
1  ϵ 
log N B2 , K◦ ≤ log N(K, ϵB2 ) ≤ log N(B2 , αϵK◦ ), (27.28)
β α
where B2 is the usual unit ℓ2 -ball, and K◦ = {y : supx∈K hx, yi ≤ 1} is the polar body of K.
As an example, consider p < 2 < q and p1 + 1q = 1. By duality, B◦p = Bq . Then (27.28) shows
that N(Bp , k · k2 , ϵ) and N(B2 , k · kq , ϵ) have essentially the same behavior, as verified by (27.18).

27.4 Infinite-dimensional space: smooth functions


Unlike Euclidean spaces, in infinite-dimensional spaces, the metric entropy can grow arbitrarily
fast [250, Theorem XI]. Studying of metric entropy in functional space (for example, under shape
or smoothness constraints) is an area of interest in functional analysis (cf. [441]), and has important
applications in nonparametric statistics, empirical processes, and learning theory [139]. To gain
some insight on the fundamental distinction between finite- and infinite-dimensional spaces, let
us work out a concrete example, which will later be used in the application of density estimation
in Section 32.4. For more general and more precise results (including some cases of equality),
see [250, Sec. 4 and 7]. Consider the class F(A, L) of all L-Lipschitz probability densities on the
compact interval [0, A].

10
A convex body K is a compact convex set with non-empty interior. We say K is symmetric if K = −K.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-536


i i

536

Theorem 27.13 Assume that L, A > 0 and p ∈ [1, ∞] are constants. Then
 
1
log N(F(A, L), k · kp , ϵ) = Θ . (27.29)
ϵ

Furthermore, for the sup-norm we have the sharp asymptotics:


LA
log2 N(F(A, L), k · k∞ , ϵ) = (1 + o(1)), ϵ → 0. (27.30)
ϵ

Proof. By replacing f(x) by √1 f( √x ), we have


L L
√ 1− p
N(F(A, L), k · kp , ϵ) = N(F( LA, 1), k · kp , L 2p ϵ). (27.31)

Thus, it is sufficient to consider F(A, 1) ≜ F(A), the collection of 1-Lipschitz densities on [0, A].
Next, observe that any such density function f is bounded from above. Indeed, since f(x) ≥ (f(0) −
RA
x)+ and 0 f = 1, we conclude that f(0) ≤ max{A, A2 + A1 } ≜ m.
To show (27.29), it suffices to prove the upper bound for p = ∞ and the lower bound for p = 1.
Specifically, we aim to show, by explicit construction,
C Aϵ
N(F(A), k · k∞ , ϵ) ≤ 2 (27.32)
ϵ
c
M(F(A), k · k1 , ϵ) ≥ 2 ϵ (27.33)

which imply the desired (27.29) in view of Theorem 27.2. Here and below, c, C are constants
depending on A. We start with the easier (27.33). We construct a packing by perturbing the uniform
density. Define a function T by T(x) = x1 {x ≤ ϵ} + (2ϵ − x)1 {x ≥ ϵ} + A1 on [0, 2ϵ] and zero
 
elsewhere. Let n = 4Aϵ and a = 2nϵ. For each y ∈ {0, 1}n , define a density fy on [0, A] such that

X
n
f y ( x) = yi T(x − 2(i − 1)ϵ), x ∈ [0, a],
i=1
RA
and we linearly extend fy to [a, A] so that 0 fy = 1; see Figure 27.2. For sufficiently small ϵ, the
Ra
resulting fy is 1-Lipschitz since 0 fy = 12 + O(ϵ) so that the slope of the linear extension is O(ϵ).

1/A

x
0 ϵ 2ϵ 2nϵ A

Figure 27.2 Packing that achieves (27.33). The solid line represent one such density fy (x) with
y = (1, 0, 1, 1). The dotted line is the density of Unif(0, A).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-537


i i

27.4 Infinite-dimensional space: smooth functions 537

Thus we conclude that each fy is a valid member of F(A). Furthermore, for y, z ∈ {0, 1}n , we
have kfy −fz k1 = dH (y, z)kTk1 = ϵ2 dH (y, z). Invoking the Gilbert-Varshamov bound Theorem 27.5,
we obtain an n2 -packing Y of the Hamming space {0, 1}n with |Y| ≥ 2cn for some absolute constant
2
c. Thus {fy : y ∈ Y} constitutes an n2ϵ -packing of F(A) with respect to the L1 -norm. This is the
2
desired (27.33) since n2ϵ = Θ(ϵ).
   
To construct a covering, set J = mϵ , n = Aϵ , and xk = kϵ for k = 0, . . . , n. Let G be the
collection of all lattice paths (with grid size ϵ) of n steps starting from the coordinate (0, jϵ) for
some j ∈ {0, . . . , J}. In other words, each element g of G is a continuous piecewise linear function
on each subinterval Ik = [xk , xk+1 ) with slope being either +1 or −1. Evidently, the number of
such paths is at most (J + 1)2n = O( 1ϵ 2A/ϵ ). To show that G is an ϵ-covering, for each f ∈ F (A),
we show that there exists g ∈ G such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, A]. This can be shown
by a simple induction. Suppose that there exists g such that |f(x) − g(x)| ≤ ϵ for all x ∈ [0, xk ],
which clearly holds for the base case of k = 0. We show that g can be extended to Ik so that this
holds for k + 1. Since |f(xk ) − g(xk )| ≤ ϵ and f is 1-Lipschitz, either f(xk+1 ) ∈ [g(xk ), g(xk ) + 2ϵ]
or [g(xk ) − 2ϵ, g(xk )], in which case we extend g upward or downward, respectively. The resulting
g satisfies |f(x) − g(x)| ≤ ϵ on Ik , completing the induction.

b′ + ϵ1/3

b′

x
0 a′ A

Figure 27.3 Improved packing for (27.34). Here the solid and dashed lines are two lattice paths on a grid of
size ϵ starting from (0, b′ ) and staying in the range of [b′ , b′ + ϵ1/3 ], followed by their respective linear
extensions.

Finally, we prove the sharp bound (27.30) for p = ∞. The upper bound readily follows from
(27.32) plus the scaling relation (27.31). For the lower bound, we apply Theorem 27.2 converting
the problem to the construction of 2ϵ-packing. Following the same idea of lattice paths, next we
give an improved packing construction such that
a
M(F(A), k · k∞ , 2ϵ) ≥ Ω(ϵ3/2 2 ϵ ). (27.34)
a b
for any a < A. Choose any b such that A1 < b < A1 + (A−
2
a) ′ ′
2A . Let a = ϵ ϵ and b = ϵ ϵ . Consider
a density f on [0, A] of the following form (cf. Figure 27.3): on [0, a ], f is a lattice path from (0, b′ )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-538


i i

538

to (a′ , b′ ) that stays in the vertical range of [b′ , b′ +ϵ1/3 ]; on [a′ , A], f is a linear extension chosen so
RA
that 0 f = 1. This is possible because by the 1-Lipschitz constraint we can linearly extend f so that
RA ′ 2 ′ 2 R a′
a′
f takes any value in the interval [b′ (A−a′ )− (A−2a ) , b′ (A−a′ )+ (A−2a ) ]. Since 0 f = ab+o(1),
RA R a′
we need a′ f = 1 − 0 f = 1 − ab + o(1), which is feasible due to the choice of b. The collection
G of all such functions constitute a 2ϵ-packing in the sup norm (for two distinct paths consider the
first subinterval where they differ). Finally, we bound the cardinality of this packing by counting
the number of such paths. This can be accomplished by standard estimates on random walks (see
e.g. [164, Chap. III]). For any constant c > 0, the probability that a symmetric random walk on
Z returns to zero in n (even) steps and stays in the range of [0, n1+c ] is Θ(n−3/2 ); this implies the
desired (27.34). Finally, since a < A is arbitrary, the lower bound part of (27.30) follows in view
of Theorem 27.2.

The following result, due to Birman and Solomjak [57] (cf. [285, Sec. 15.6] for an exposition),
is an extension of Theorem 27.13 to the more general Hölder class.

Theorem 27.14 Fix positive constants A, L and d ∈ N. Let β > 0 and write β = ℓ + α,
where ℓ ∈ Z+ and α ∈ (0, 1]. Let Fβ (A, L) denote the collection of ℓ-times continuously
differentiable densities f on [0, A]d whose ℓth derivative is (L, α)-Hölder continuous, namely,
kD(ℓ) f(x) − D(ℓ) f(y)k∞ ≤ Lkx − ykα ∞ for all x, y ∈ [0, A] . Then for any 1 ≤ p ≤ ∞,
d

 d
log N(Fβ (A, L), k · kp , ϵ) = Θ ϵ− β . (27.35)

The main message of the preceding theorem is that is the entropy of the function class grows
more slowly if the dimension decreases or the smoothness increases. As such, the metric entropy
for very smooth functions can grow subpolynomially in 1ϵ . For example, Vitushkin (cf. [250,
Eq. (129)]) showed that for the class of analytic functions on the unit complex disk D having
analytic extension to a bigger disk rD for r > 1, the metric entropy (with respect to the sup-norm
on D) is Θ((log 1ϵ )2 ); see [250, Sec. 7 and 8] for more such results.
As mentioned at the beginning of this chapter, the conception and development of the subject
on metric entropy, in particular, Theorem 27.14, are motivated by and plays an important role
in the study of Hilbert’s 13th problem. In 1900, Hilbert conjectured that there exist functions of
several variables which cannot be represented as a superposition (composition) of finitely many
functions of fewer variables. This was disproved by Kolmogorov and Arnold in 1950s who showed
that every continuous function of d variables can be represented by sums and superpositions of
single-variable functions; however, their construction does not work if one requires the constituent
functions to have specific smoothness. Subsequently, Hilbert’s conjecture for smooth functions
was positively resolved by Vitushkin [439], who showed that there exist functions of d variables
in the β -Hölder class (in the sense of Theorem 27.14) that cannot be expressed as finitely many
superpositions of functions of d′ variables in the β ′ -Hölder class, provided d/β > d′ /β ′ . The
original proof of Vitushkin is highly involved. Later, Kolmogorov gave a much simplified proof
by proving and applying the k · k∞ -version of Theorem 27.14. As evident in (27.35), the index
d/β provides a complexity measure for the function class; this allows an proof of impossibility

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-539


i i

27.5 Metric entropy and small-ball probability 539

of superposition by an entropy comparison argument. For concreteness, let us prove the follow-
ing simpler version: There exists a 1-Lipschitz function f(x, y, z) of three variables on [0, 1]3 that
cannot be written as g(h1 (x, y), h2 (y, z)) where g, h1 , h2 are 1-Lipschitz functions of two variables
on [0, 1]2 . Suppose, for the sake of contradiction, that this is possible. Fixing an ϵ-covering of
cardinality exp(O( ϵ12 )) for 1-Lipschitz functions on [0, 1]2 and using it to approximate the func-
tions g, h1 , h2 , we obtain by superposition g(h1 , h2 ) an O(ϵ)-covering of cardinality exp(O( ϵ12 )) of
1-Lipschitz functions on [0, 1]3 ; however, this is a contradiction as any such covering must be of
size exp(Ω( ϵ13 )). For stronger and more general results along this line, see [250, Appendix I].

27.5 Metric entropy and small-ball probability


The small ball problem in probability theory concerns the behavior of the function
1
ϕ(ϵ) ≜ log
P [kXk ≤ ϵ]
as ϵ → 0, where X is a random variable taking values on some real separable Banach space
(V, k · k). For example, for standard normal X ∼ N (0, Id ) and the ℓ2 -ball, a simple large-deviations
calculation (Exercise III.16) shows that
1
ϕ(ϵ)  d log .
ϵ
Of more interest is the infinite-dimensional case of Gaussian processes. For example, for the
standard Brownian motion on the unit interval and the sup norm, it is elementary to show
(Exercise V.30) that
1
ϕ(ϵ)  . (27.36)
ϵ2
We refer the reader to the excellent survey [279] for this field.
There is a deep connection between the small-ball probability and metric entropy, which allows
one to translate results from one area to the other in fruitful ways. To identify this link, the start-
ing point is the volume argument in Theorem 27.3. On the one hand, it is well-known that there
exists no analog of Lebesgue measure (translation-invariant) in infinite-dimensional spaces. As
such, for functional spaces, one frequently uses a Gaussian measure. On the other hand, the “vol-
ume” argument in Theorem 27.3 and Remark 27.2 can adapted to a measure γ that need not be
translation-invariant, leading to
γ (Θ + B (0, ϵ)) γ (Θ + B (0, ϵ/2))
≤ N(Θ, k · k, ϵ) ≤ M(Θ, k · k, ϵ) ≤ , (27.37)
maxθ∈V γ (B (θ, 2ϵ)) minθ∈Θ γ (B (θ, ϵ/2))
where we recall that B(θ, ϵ) denotes the norm ball centered at θ of radius ϵ. From here we have
already seen the natural appearance of small-ball probabilities. Using properties native to the
Gaussian measure, this can be further analyzed and reduced to balls centered at zero.
To be precise, let γ be a zero-mean Gaussian measure on V such that EX∼γ [kXk2 ] < ∞. Let
H ⊂ V be the reproducing kernel Hilbert space (RKHS) generated by γ and K the unit ball in

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-540


i i

540

H. We refer the reader to, e.g., [262, Sec. 2] and [314, III.3.2], for the precise definition of this
object.11 For the purpose of this section, it is enough to consider the following examples (for more
see [262]):

• Finite dimensions. Let γ = N (0, Σ). Then

K = {Σ1/2 x : kxk2 ≤ 1} (27.38)

is a rescaled Euclidean ball, with inner product hx, yiH = x⊤ Σ−1 y.


• Brownian motion: Let γ be the law of the standard Brownian motion on the unit interval [0, 1].
Then
 Z t 
′ ′
K = f( t) = f (s)ds : kf k2 ≤ 1 (27.39)
0
R1
with inner product hf, giH = hf′ , g′ i ≡ 0
f′ (t)g′ (t)dt.

The following fundamental result due to Kuelbs and Li [263] (see also the earlier work of
Goodman [194]) describes a precise connection between the small-ball probability function ϕ(ϵ)
and the metric entropy of the unit Hilbert ball N(K, k · k, ϵ) ≡ N(ϵ).

Theorem 27.15 For all ϵ > 0,


!
ϵ
ϕ(2ϵ) − log 2 ≤ log N p ≤ 2ϕ(ϵ/2) (27.40)
2ϕ(ϵ/2)

Proof. We show that for any λ > 0,

λ2
ϕ(2ϵ) + log Φ(λ + Φ−1 (e−ϕ(ϵ) )) ≤ log N(λK, ϵ) ≤ log M(λK, ϵ) ≤ + ϕ(ϵ/2) (27.41)
2
p
To deduce (27.40), choose λ = 2ϕ(ϵ/2) and note that by scaling N(λK, ϵ) = N(K, ϵ/λ).
) = Φc (t) ≤ e−t /2 (Exercise V.25) yields Φ−1 (e−ϕ(ϵ) ) ≥
2
Applying
p the normal tail bound Φ(− t
− 2ϕ(ϵ) ≥ −λ so that Φ(Φ−1 (e−ϕ(ϵ) ) + λ) ≥ Φ(0) = 1/2.
We only give the proof in finite dimensions as the results are dimension-free and extend natu-
rally to infinite-dimensional spaces. Let Z ∼ γ = N (0, Σ) on Rd so that K = Σ1/2 B2 is given in
(27.38). Applying (27.37) to λK and noting that γ is a probability measure, we have
γ (λK + B (0, ϵ)) 1
≤ N(λK, ϵ) ≤ M(λK, ϵ) ≤ . (27.42)
maxθ∈Rd γ (B (θ, 2ϵ)) minθ∈λK γ (B (θ, ϵ/2))
Next we further bound (27.42) using properties native to the Gaussian measure.

11
In particular, if γ is the law of a Gaussian process X on C([0, 1]) with E[kXk22 ] < ∞, the kernel K(s, t) = E[X(s)X(t)]

admits the eigendecomposition K(s, t) = λk ψk (s)ψk (t) (Mercer’s theorem), where {ϕk } is an orthonormal basis for

L2 ([0, 1]) and λk > 0. Then H is the closure of the span of {ϕk } with the inner product hx, yiH = k hx, ψk ihy, ψk i/λk .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-541


i i

27.5 Metric entropy and small-ball probability 541

• For the upper bound, for any symmetric set A = −A and any θ ∈ λK, by a change of measure

γ(θ + A) = P [Z − θ ∈ A]
1 ⊤ −1
h −1 i
= e− 2 θ Σ θ E e⟨Σ θ,Z⟩ 1 {Z ∈ A}

≥ e−λ
2
/2
P [Z ∈ A] ,
h −1 i
where the last step follows from θ⊤ Σ−1 θ ≤ λ2 and by Jensen’s inequality E e⟨Σ θ,Z⟩ |Z ∈ A ≥
−1
e⟨Σ θ,E[Z|Z∈A]⟩ = 1, using crucially that E [Z|Z ∈ A] = 0 by symmetry. Applying the above to
A = B(0, ϵ/2) yields the right inequality in (27.41).
• For the lower bound, recall Anderson’s lemma (Lemma 28.10) stating that the Gaussian measure
of a ball is maximized when centered at zero, so γ(B(θ, 2ϵ)) ≤ γ(B(0, 2ϵ)) for all θ. To bound
the numerator, recall the Gaussian isoperimetric inequality (see e.g. [69, Theorem 10.15]):12

γ(A + λK) ≥ Φ(Φ−1 (γ(A)) + λ). (27.43)

Applying this with A = B(0, ϵ) proves the left inequality in (27.41) and the theorem.

The implication of Theorem 27.15 is the following. Provided that ϕ(ϵ)  ϕ(ϵ/2), then we
should expect that approximately
!
ϵ
log N p  ϕ(ϵ)
ϕ(ϵ)

With more effort this can be made precise unconditionally (see e.g. [279, Theorem 3.3], incorporat-
ing the later improvement by [278]), leading to very precise connections between metric entropy
and small-ball probability, for example: for fixed α > 0, β ∈ R,
 β   2+α

−α 1 − 2+α
2α 1
ϕ(ϵ)  ϵ log ⇐⇒ log N(ϵ)  ϵ log (27.44)
ϵ ϵ

As a concrete example, consider the unit ball (27.39) in the RKHS generated by the standard
Brownian motion, which is similar to a Sobolev ball.13 Using (27.36) and (27.44), we conclude
that log N(ϵ)  1ϵ , recovering the metric entropy of Sobolev ball determined in [420]. This result
also coincides with the metric entropy of Lipschitz ball in Theorem 27.14 which requires the
derivative to be bounded everywhere as opposed to on average in L2 . For more applications of
small-ball probability on metric entropy (and vice versa), see [263, 278].

12
The connection between (27.43) and isoperimetry is that if we interpret limλ→0 (γ(A + λK) − γ(A))/λ as the surface
measure of A, then among all sets with the same Gaussian measure, the half space has maximal surface measure.
13
The Sobolev norm is kfkW1,2 ≜ kfk2 + kf′ k2 . Nevertheless, it is simple to verify a priori that the metric entropy of
(27.39) and that of the Sobolev ball share the same behavior (see [263, p. 152]).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-542


i i

542

27.6 Metric entropy and rate-distortion theory


In this section we discuss a connection between metric entropy and rate-distortion function. Note
that the former is a non-probabilistic quantity whereas the latter is an information measure depend-
ing on the source distribution; nevertheless, if we consider the rate-distortion function induced by
the “least favorable” source distribution, it turns out to behave similarly to the metric entropy.
To make this precise, consider a metric space (X , d). For an X -valued random variable X, denote
by
ϕX (ϵ) = inf I(X; X̂) (27.45)
PX̂|X :E[d(X,X̂)]≤ϵ

its rate-distortion function (recall Section 24.3). Denote the worst-case rate-distortion function on
X by

ϕX (ϵ) = sup ϕX (ϵ). (27.46)


PX ∈P(X )

The next theorem relates ϕX to the covering and packing number of X . The lower bound simply
follows from a “Bayesian” argument, which bounds the worst case from below by the average case,
akin to the relationship between minimax and Bayes risk (see Section 28.3). The upper bound was
shown in [241] using the dual representation of rate-distortion functions; here we give a simpler
proof via Fano’s inequality.

Theorem 27.16 For any 0 < c < 1/2,


ϕX (cϵ) + log 2
ϕX (ϵ) ≤ log N(X , d, ϵ) ≤ log M(X , d, ϵ) ≤ . (27.47)
1 − 2c

Proof. Fix an ϵ-covering of X in d of size N. Let X̂ denote the closest element in the covering to
X. Then d(X, X̂) ≤ ϵ almost surely. Thus ϕX (ϵ) ≤ I(X; X̂) ≤ log N. Optimizing over PX proves the
left inequality.
For the right inequality, let X be uniformly distributed over a maximal ϵ-packing of X . For
any PX̂|X such that E[d(X, X̂)] ≤ cϵ. Let X̃ denote the closest point in the packing to X̂. Then we
have the Markov chain X → X̂ → X̃. By definition, d(X, X̃) ≤ d(X̂, X̃) + d(X̂, X) ≤ 2d(X̂, X) so
E[d(X, X̃)] ≤ 2cϵ. Since either X = X̃ or d(X, X̃) > ϵ, we have P[X 6= X̃] ≤ 2c. On the other
hand, Fano’s inequality (Corollary 3.13) yields P[X 6= X̃] ≥ 1 − I(X;log X̂)+log 2
M . In all, I(X; X̂) ≥
(1 − 2c) log M − log 2, proving the upper bound.
Remark 27.4 (a) Clearly, Theorem 27.16 can be extended to the case where the distortion
function equals a power of the metric, namely, replacing (27.45) with
ϕX,r (ϵ) ≜ inf I(X; X̂).
PX̂|X :E[d(X,X̂)r ]≤ϵr

Then (27.47) continues to hold with 1 − 2c replaced by 1 − (2c)r . This will be useful, for
example, in the forthcoming applications where second moment constraint is easier to work
with.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-543


i i

27.6 Metric entropy and rate-distortion theory 543

(b) In the earlier literature a variant of the rate-distortion function is also considered, known as
the ϵ-entropy of X, where the constraint is d(X, X̂) ≤ ϵ with probability one as opposed to
in expectation (cf. e.g. [250, Appendix II] and [349]). With this definition, it is natural to
conjecture that the maximal ϵ-entropy over all distributions on X coincides with the metric
entropy log N(X , ϵ); nevertheless, this need not be true (see [300, Remark, p. 1708] for a
counterexample).
Theorem 27.16 points out an information-theoretic route to bound the metric entropy by the
worst-case rate-distortion function (27.46).14 Solving this maximization, however, is not easy as
PX 7→ ϕX (D) is in general neither convex nor concave [6].15 Fortunately, for certain spaces, one
can show via a symmetry argument that the “uniform” distribution maximizes the rate-distortion
function at every distortion level; see Exercise V.24 for a formal statement. As a consequence, we
have:

• For Hamming space X = {0, 1}d and Hamming distortion, ϕX (D) is attained by Ber( 12 )d . (We
already knew this from Theorem 26.1 and Theorem 24.8.)
• For the unit sphere X = Sd−1 and distortion function defined by the Euclidean distance, ϕX (D)
is attained by Unif(Sd−1 ).
• For the orthogonal group X = O(d) or unitary group U(d) and distortion function defined by
the Frobenius norm, ϕX (D) is attained by the Haar measure. Similar statements also hold for
the Grassmann manifold (collection of linear subspaces).

Next we give a concrete example by computing the rate-distortion function of θ ∼ Unif(Sd−1 ):

Theorem 27.17 Let θ be uniformly distributed over the unit sphere Sd−1 . Then for all 0 <
ϵ < 1,
 
1 1
(d − 1) log − C ≤ inf I(θ; θ̂) ≤ (d − 1) log 1 + + log(2d)
ϵ Pθ̂|θ :E[∥θ̂−θ∥22 ]≤ϵ2 ϵ

for some universal constant C.

Note that the random vector θ have dependent entries so we cannot invoke the single-
d
letterization technique in Theorem 24.8. Nevertheless, we have the representation θ=Z/kZk2 for
Z ∼ N (0, Id ), which allows us to relate the rate-distortion function of θ to that of the Gaussian
found in Theorem 26.2. The resulting lower bound agree with the metric entropy for spheres in
Corollary 27.4, which scales as (d − 1) log 1ϵ . Using similar reduction arguments (see [275, The-
orem VIII.18]), one can obtain tight lower bound for the metric entropy of the orthogonal group
O(d) and the unitary group U(d), which scales as d(d2−1) log 1ϵ and d2 log 1ϵ , with pre-log factors

14
A striking parallelism between the metric entropy of Sobolev balls and the rate-distortion function of smooth Gaussian
processes has been observed by Donoho in [133]. However, we cannot apply Theorem 27.16 to formally relate one to the
other since it is unclear whether the Gaussian rate-distortion function is maximal.
15
As a counterexample, consider Theorem 26.1 for the binary source.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-544


i i

544

commensurate with their respective degrees of freedoms. As mentioned in Remark 27.3(b), these
results were obtained by Szarek in [406] using a volume argument with Haar measures; in compar-
ison, the information-theoretic approach is more elementary as we can again reduce to Gaussian
rate-distortion computation.
Proof. The upper bound follows from Theorem 27.16 and Remark 27.4(a), applying the metric
entropy bound for spheres in Corollary 27.4.
To prove the lower bound, let Z ∼ N (0, Id ). Define θ = ∥ZZ∥ and A = kZk, where k · k ≡ k · k2
henceforth. Then θ ∼ Unif(Sd−1 ) and A ∼ χd are independent. Fix Pθ̂|θ such that E[kθ̂ − θk2 ] ≤
ϵ2 . Since Var(A) ≤ 1, the Shannon lower bound (Theorem 26.3) shows that the rate-distortion
function of A is majorized by that of the standard Gaussian. So for each δ ∈ (0, 1), there exists
PÂ|A such that E[(Â − A)2 ] ≤ δ 2 , I(A, Â) ≤ log δ1 , and E[A] = E[Â]. Set Ẑ = Âθ̂. Then

I(Z; Ẑ) = I(θ, A; Ẑ) ≤ I(θ, A; θ̂, Â) = I(θ; θ̂) + I(A, Â).
Furthermore, E[Â2 ] = E[(Â − A)2 ] + E[A2 ] + 2E[(Â − A)(A − E[A])] ≤ d + δ 2 + 2δ ≤ d + 3δ .
Similarly, |E[Â(Â − A)]| ≤ 2δ and E[kZ − Ẑk2 ] ≤ dϵ2 + 7δϵ + δ . Choosing δ = ϵ, we have
E[kZ − Ẑk2 ] ≤ (d + 8)ϵ2 . Combining Theorem 24.8 with the Gaussian rate-distortion function in
Theorem 26.2, we have I(Z; Ẑ) ≥ d2 log (d+d8)ϵ2 , so applying log(1 + x) ≤ x yields
1
I(θ; θ̂) ≥ (d − 1) log − 4 log e.
ϵ2

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-545


i i

Exercises for Part V

V.1 Let S = Ŝ = {0, 1}. Consider the source X10 consisting of fair coin flips. Construct a simple
1
(suboptimal) compressor achieving average Hamming distortion 20 with 512 codewords.
V.2 Assume a separable distortion loss. Show that the minimal number of codewords M∗ (n, D)
required to represent memoryless source Xn with average distortion D (recall (24.9)) satisfies

log M∗ (n1 + n2 , D) ≤ log M∗ (n1 , D) + log M∗ (n2 , D) .

Conclude that
1 1
lim log M∗ (n, D) = inf log M∗ (n, D) . (V.1)
n→∞ n n n

That is, one can always achieve a better compression rate by using a longer blocklength. Neither
claim holds for log M∗ (n, ϵ) in channel coding as defined in (19.4). Explain why this different
behavior arises.
V.3 (Non-asymptotic rate-distortion) Our goal is to show that the convergence to R(D) happens
much faster than that to capacity in channel coding. Consider binary uniform X ∼ Ber(1/2)
with Hamming distortion.
(a) Show that there exists a lossy code Xn → W → X̂n with M codewords and

P[d(Xn , X̂n ) > D] ≤ (1 − p(nD))M ,

where
s  
X n
p(s) = 2−n .
j
j=0

(b) Show that there exists a lossy code with M codewords and

1X
n−1
M
E[d(Xn , X̂n )] ≤ (1 − p(s)) . (V.2)
n
s=0

(c) Show that there exists a lossy code with M codewords and

1 X −Mp(s)
n−1
E[d(Xn , X̂n )] ≤ e . (V.3)
n
s=0

(Note: For M ≈ 2nR , numerical evaluation of (V.2) for large n is challenging. At the same
time (V.3) is only slightly looser.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-546


i i

546 Exercises for Part V

(d) For n = 10, 50, 100 and 200 compute the upper bound on log M∗ (n, 0.11) via (V.3).
Compare with the lower bound

log M∗ (n, D) ≥ nR(D) . (V.4)

V.4 Continuing Exercise V.3 use Stirling formula and (V.3)-(V.4) to show

log M∗ (n, D) = nR(D) + O(log n) .

Note: Thus, optimal compression rate converges to its asymptotic fundamental limit R(D) at

a fast rate of O(log n/n) as opposed to O(1/ n) for channel coding (cf. Theorem 22.2). This
result holds for most memoryless sources.
i.i.d.
V.5 Let Sj ∼ PS be an iid source on a finite alphabet A and PS (a) > 0 for all a ∈ A. Suppose
the distortion metric satisfies d(x, y) = D0 =⇒ x = y. Show that R(D0 ) = log |A|, while
R(D0 +) = H(X).
V.6 Consider a memoryless source X uniform on X = X̂ = [m] with Hamming distortion: d(x, x̂) =
1{x 6= x̂}. Show that
(
log m − h(D) − D log(m − 1) D ≤ mm−1
R(D) =
0 otherwise.

(Hint: apply Fano’s inequality Theorem 3.12.)


V.7 Let X and Y be random variables taking values in {1, 2, 3} and such that P[X = Y] = 12 . If Y is
uniform what are the best upper and lower bounds on I(X; Y) you can find?
V.8 (Erasure distortion metric) Consider Bernoulli(1/2) source S ∈ {0, 1}, reproduction alphabet
 = {0, ?, 1} and the distortion metric


0, a = â,

d(a, â) = 1, â =?,


∞, a 6= â, â 6=? .

Is Dmax finite? Find rate-distortion function R(D).


V.9 Let X ∼ Unif[±1] and reconstruction X̂ ∈ R with d(x, x̂) = (x − x̂)2 . Show that rate-distortion
function is given in parametric form by

R = log 2 − h(p), D = 4p(1 − p), p ∈ [ 0, 1/ 2]

and that for any distortion level optimal vector quantizer is only taking values ±(1 − 2p) (Hint:
you may find Exercise I.64(b) useful). Compare with the case of X̂ ∈ {±1}, for which we have
shown R(D) = log 2 − h(D/4), D ∈ [0, 2].
V.10 (Product source) Consider two independent stationary memoryless sources X ∈ X and Y ∈ Y
with reproduction alphabets X̂ and Ŷ , distortion measures d1 : X × X̂ → R+ and d2 : Y ×
Ŷ → R+ , and rate-distortion functions RX and RY , respectively. Now consider the stationary
memoryless product source Z = (X, Y) with reproduction alphabet X̂ ×Ŷ and distortion measure
d(z, ẑ) = d1 (x, x̂) + d2 (y, ŷ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-547


i i

Exercises for Part V 547

(a) Show that

I(X, Y; X̂, Ŷ) ≥ I(X; X̂) + I(Y; Ŷ)

provided that X and Y are independent.


(b) Show that the rate-distortion function of Z is related to that of X and Y via the inf-convolution,
i.e.

R(D) = inf RX (D1 ) + RY (D − D1 ).


0≤D1 ≤D

(c) How do you build an optimal lossy compressor for Z using optimal lossy compressors for
X and Y?
V.11 (Compression with output constraints) Compute the rate-distortion function R(D; a, p) of a
Ber(p) source, Hamming distortion under an extra constraint that reconctruction points X̂n
should have average Hamming weight E[wH (X̂n )] ≤ an, where 0 < a, p ≤ 21 . (Hint: Show a
more general result that given two distortion metrics d1 , d2 we have R(D1 , D2 ) = min{I(S; Ŝ) :
E[di (S, Ŝ)] ≤ Di , i ∈ {1, 2}}.)
V.12 Commercial (mono) FM radio modulates a bandlimited (15kHz) audio signal into a radio-
frequency signal of bandwidth 200kHz. This system roughly achieves

SNRaudio ≈ 40 dB + SNRchannel

over the AWGN channel whenever SNRchannel ≳ 12 dB. Thus for the 12 dB channel, we get
that FM radio has distortion of 52 dB. Show that information-theoretic limit is about 160 dB.
Hint: assume that input signal is low-pass filtered to 15kHz white, zero-mean Gaussian and use
the optimal joint source channel code (JSSC) for the given bandwidth expansion ratio and fixed
SNRchannel . Also recall that the SNR of the reconstruction Ŝn expressed in dB is defined as
Pk
j=1 E[Sj ]
2
SNRaudio ≜ 10 log10 Pk .
j=1 E[(Sj − Ŝj ) ]
2

V.13 Consider a memoryless Gaussian source X ∼ N (0, 1), reconstruction alphabet  = {±1} and
quadratic distortion d(a, â) = (a − â)2 . Compute D0 , R(D0 +), Dmax . Then obtain a parametric
formula for R(D).
V.14 (Erokhin’s rate-distortion [155]) Let d(ak , bk ) = 1{ak 6= bk } be a (non-separable) distortion
metric for k-strings on a finite alphabet S = Ŝ . Prove that for any source Sk we have

φSk (ϵ) ≜ min I(Sk ; Ŝk ) ≥ H(Sk ) − ϵk log |S| − h(ϵ) , (V.5)
P[Sk ̸=Ŝk ]≤ϵ

and that the bound is tight only for Sk uniform on S k . Next, suppose that Sk is i.i.d. source. Prove
r
kV(S) − (Q−12(ϵ))2
ϕSk (ϵ) = (1 − ϵ)kH(S) − e + O(log k) ,

where V(S) is the varentropy (10.4). (Hint: Let T = P̂Sk be the empirical distribution (type) of the
realization Sk . Then I(Sk ; Ŝk ) = I(Sk , T; Ŝk ) = I(Sk ; Ŝk |T) + O(log k). Denote ϵT ≜ P[Sk 6= Ŝk |T]

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-548


i i

548 Exercises for Part V

and given ϵT optimize the first term via (V.5). Then optimize the assignment ϵT over all E[ϵT ] ≤ ϵ.
Also use E[Z1{Z > c}] = √12π e−c /2 for Z ∼ N (0, 1). See [254, Lemma 1] for full details).
2

i.i.d.
V.15 Consider a source Sn ∼ Ber( 21 ). Answer the following questions when n is large.
(a) Suppose the goal is to compress Sn into k bits so that one can reconstruct Sn with at most
one bit of error. That is, the decoded version Ŝn satisfies E[dH (Ŝn , Sn )] ≤ 1. Show that this
can be done (if possible, with an explicit algorithm) with k = n − C log n bits for some
constant C. Is it optimal?
(b) Suppose we are required to compress Sn into only 1 bit. Show that one can achieve (if

possible, with an explicit algorithm) a reconstruction error E[dH (Ŝn , Sn )] ≤ n2 − C n for
some constant C. Is it optimal?
Warning: We cannot blindly apply the asymptotic rate-distortion theory to show achievability
since here the distortion level changes with n. The converse, however, directly applies.
V.16 Consider a standard Gaussian vector Sn and quadratic distortion metric. We will discuss zero-
rate quantization.

(a) Let Smax =√max1≤i≤n Si . Show that E[(Smax − 2 ln n)2 ] → 0 when n → ∞. Show that
E[(Smax − 2 ln n)2 ] → 0 when n → ∞.
(b) Suppose you are given a budget of log2 n bits. Consider the following scheme: Let i∗ denote
the index of the largest coordinate. The compressor √
stores the index i∗ which costs log2 n
bits and the decompressor outputs Ŝ where Ŝi = 2 ln n for i = i∗ and Si = 0 otherwise.
n

Show that distortion in terms of mean-square error satisfies E[kŜn − Sn k22 ] = n − 2 ln n + o(1)
when n → ∞.
(c) Show that for any compressor (using log2 n bits) we must have E[kŜn − Sn k22 ] ≥ n − 2 ln n +
o( 1) .
V.17 (Noisy source-coding; also remote source-coding [126]) Consider the problem of compressing
i.i.d. sequence Xn under separable distortion metric d. Now, however, compressor does not have
direct access to Xn but only to its noisy version Yn obtained over a stationary memoryless channel
i.i.d.
PY|X (i.e. (Xi , Yi ) ∼ PX,Y for a fixed PX,Y and encoder is a map f : Y n → [M]). Show that the
rate-distortion function is
n o
R(D) = min I(Y; X̂) : E[d(X, X̂)] ≤ D, X → Y → X̂ ,

where minimization is over all PX̂|Y . (Hint: define d̃(y, x̂) ≜ E[d(X, x̂)|Y = y].)
i.i.d.
V.18 (Noisy/remote source coding; special case) Let Zn ∼ Ber( 12 ) and Xn = BECδ (Zn ). Compressor
is to encode Xn at rate R so that we can reconstruct Zn with bit-error rate D. Let R(D) denote
the optimal rate.
(a) Suppose that locations of erasures in Xn are provided as a side information to decompressor.
Show that R(δ/2) = δ̄2 (Hint: compressor is very simple).
(b) Surprisingly, the same rate is achievable without knowledge of erasures. Use Exercise V.17
to prove R(D) = H(δ̄/2, δ̄/2, δ) − H(1 − D − δ2 , D − δ2 , δ) for D ∈ [ δ2 , 12 ].
V.19 (Log-loss) Consider the rate-distortion problem where the reconstruction alphabet X̂n = P(X n )
is the space of all probability mass functions on X n . We define two loss functions. The first one

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-549


i i

Exercises for Part V 549

is non-separable (!)
1 1
dnonsep (xn , P̂) = log (V.6)
n P̂(xn )
and the second one is separable:
1X
n
1
dsep (xn , P̂Xn ) = log .
n P̂Xj (xj )
j=1

Let {Xi , i ≥ 1} be a process with entropy rate H.


(a) Show that any (n, M, D)-code for dsep can be converted to a (n, M, D)-code for dnonsep . Hence
Rsep (D) ≥ Rnonsep (D) .
(b) Prove that the information rate-distortion function under dnonsep is given by
( I)
Rnonsep (D) = |H − D|+ .
(c) Prove that if quantizer U = f(Xn ) is such that distribution of U is uniform then it is
(I)
necessarily optimal for non-separable log-loss, i.e. log-cardinality of U = Rnonsep (D).
(d) Assume {Xi } is stationary and ergodic. Show that rate Ri,nonsep (D) is achievable, i.e.
(I)
Rnonsep (D) = Rnonsep (D) .
(Hint: Use AEP (11.2).)
(e) Assume {Xi } is i.i.d. Show that
(I)
Rsep (D) = Rsep (D) = Rnonsep (D) = |H(X) − D|+ .
(Second equality follows from general rate-distortion theorem.)
V.20 (Information bottleneck and log-loss) Given PX,Y Information bottleneck (IB) [421] proposes
that W is an optimal approximate sufficient statistic of Y for X if it solves
FI (t) ≜ max{I(X; W) : I(Y; W) ≤ t, X → Y → W} ,
where FI -curve has been defined previously in Definition 16.5. Here we explain that IB is
equivalent to noisy source-coding under log-loss distortion.
(a) (Log-loss, information) Consider the rate-distortion problem (Exercise V.19) with X̂n =
P(X n ) and the (non-separable) loss function dnonsep in (V.6). Show that solution to the noisy
source-coding (cf. Ex. V.17) satisfies for all n:
n o
inf E[dnonsep (Xn , P̂Xn )] : I(Yn ; P̂Xn ) ≤ nR = n (H(X) − FI (R)) .

(b) (Log-loss, operational) Define operational distortion-rate function as


D(R) ≜ lim sup inf {D : ∃(n, exp(nR), nD)-code} .
n→∞

Show that
D(R) = H(X) − FI (R) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-550


i i

550 Exercises for Part V

Q
(Hint: for achievability, restrict reconstruction to P̂Xn = P̂Xi , this makes distortion
additive and then apply Ex. V.17; for converse use tensorization property of FI -curve
from Exercise III.32)
V.21 (a) Let 0 ≺ ∆  Σ be positive definite matrices. For S ∼ N (0, Σ), show that
1 det Σ
inf I(S; Ŝ) = log .
PŜ|S :E[(S−Ŝ)(S−Ŝ)⊤ ]⪯∆ 2 det ∆

(Hint: for achievability, consider S = Ŝ + Z with Ŝ ∼ N (0, Σ − ∆) ⊥ ⊥ Z ∼ N (0, ∆) and


apply Example 3.5; for converse, follow the proof of Theorem 26.2.)
(b) Prove the following extension of (26.3): Let σ12 , . . . , σd2 be the eigenvalues of Σ. Then

1 X + σi2
d
inf I(S; Ŝ) = log
PŜ|S :E[∥S−Ŝ∥22 ]≤D 2 λ
i=1
Pd
where λ > 0 is such that i=1 min{σi2 , λ} = D. This is the counterpart of the solution in
Theorem 20.14.
(Hint: First, using the orthogonal invariance of distortion metric we can assume that
Σ is diagonal. Next, apply the same single-letterization argument for (26.3) and solve
Pd σ2
minP Di =D 12 i=1 log+ Dii .)
V.22 (Shannon lower bound) Let k · k be an arbitrary norm on Rd and r > 0. Let X be a Rd -valued
random vector with a probability density function pX . Denote the rate-distortion function

ϕ X ( D) ≜ inf I(X; X̂)


PX̂|X :E[∥X̂−X∥r ]≤D

Prove the Shannon lower bound (26.5), namely


   
d d d
ϕX (D) ≥ h(X) +log − log Γ +1 V , (V.7)
r Dre r
R
where the differential entropy h(X) = Rd pX (x) log pX1(x) dx is assumed to be finite and V =
vol({x ∈ Rd : kxk ≤ 1}).
(a) Show that 0 < V < ∞.
(b) Show that for any s > 0,
Z  
d
+ 1 Vs− r .
d
Z(s) ≜ exp(−skwkr )dw = Γ
Rd r
R R R∞
(Hint: Apply Fubini’s theorem to Rd exp(−skwkr )dw = Rd ∥w∥r s exp(−sx)dxdw and use
R∞
Γ(x) = 0 tx−1 e−t dt.)
(c) Show that for any feasible PX|X̂ such that E[kX − X̂kr ] ≤ D,

I(X; X̂) ≥ h(X) − log Z(s) − sD.

(Hint: Define an auxiliary backward channel QX|X̂ (dx|x̂) = qs (x − x̂)dx, where qs (w) =
QX|X̂
1
Z(s) exp(−skwkr ). Then I(X; X̂) = EP [log PX ] + D(PX|X̂ kQX|X̂ kPX̂ ).)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-551


i i

Exercises for Part V 551

(d) Optimize over s > 0 to conclude (V.7).


(e) Verify that the lower bound of Theorem 26.3 is a special case of (V.7).
Note: Alternatively, the SLB can be written in the following form:

ϕX (D) ≥ h(X) − sup h(W)


PW :E[∥W∥r ]≤D

and this entropy maximization can be solved following the argument in Example 5.2.
V.23 (Uniform distribution minimizes convex symmetric functional.) Let G be a group acting on a
set X such that each g ∈ G sends x ∈ X to gx ∈ X . Suppose G acts transitively, i.e., for each
x, x′ ∈ X there exists g ∈ G such that gx = x′ . Let g be a random element of G with an invariant
d
distribution, namely hg=g for any h ∈ G. (Such a distribution, known as the Haar measure,
exists for compact topological groups.)
(a) Show that for any x ∈ X , gx has the same law, denoted by Unif(X ), the uniform distribution
on X .
(b) Let f : P(X ) → R be convex and G-invariant, i.e., f(PgX ) = f(PX ) for any X -valued random
variable X and any g ∈ G. Show that minPX ∈P(X ) f(PX ) = f(Unif(X )).
V.24 (Uniform distribution maximizes rate-distortion function.) Continuing the setup of Exer-
cise V.23, let d : X × X → R be a G-invariant distortion function, i.e., d(gx, gx′ ) =
d(x, x′ ) for any g ∈ G. Denote the rate-distortion function of an X -valued X by ϕX (D) =
infP :E[d(X,X̂)]≤D I(X; X̂). Suppose that ϕX (D) < ∞ for all X and all D > 0.
X̂|X

(a) Let ϕ∗X (λ) = supD {λD − ϕX (D)} denote the conjugate of ϕX . Applying Theorem 24.4 and
Fenchel-Moreau’s biconjugation theorem to conclude that ϕX (D) = supλ {λD − ϕ∗X (λ)}.
(b) Show that

ϕ∗X (λ) = sup{λE[d(X, X̂)] − I(X; X̂)}.


PX̂|X

As such, for each λ, PX 7→ ϕ∗X (λ) is convex and G-invariant. (Hint: Theorem 5.3.)
(c) Applying Exercise V.23 to conclude that ϕ∗U (λ) ≤ ϕ∗X (λ) for U ∼ Unif(X ) and that

ϕX (D) ≤ ϕU (D), ∀ D > 0.

V.25 (Normal tail bound.) Denote the standard normal density and tail probability by φ(x) =
R∞
√1 e−x /2 and Φc (t) =
2

2π t
φ(x)dx. Show that for all t > 0,
 
t φ(t) −t2 /2
φ(t) ≤ Φ (t) ≤ min
c
,e . (V.8)
1 + t2 t

(Hint: For Φc (t) ≤ e−t /2 apply the Chernoff bound (15.2); for the rest, note that by integration
2

R∞
by parts Φc (t) = φ(t t) − t φ(x2x) dx.)
V.26 (Covering radius in Hamming space) In this exercise we prove (27.9), namely, for any fixed
0 ≤ D ≤ 1, as n → ∞,

N(Fn2 , dH , Dn) = 2n(1−h(D))+ +o(n) ,

where h(·) is the binary entropy function.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-552


i i

552 Exercises for Part V

(a) Prove the lower bound by invoking the volume bound in Theorem 27.3 and the large-
deviations estimate in Example 15.1.
(b) Prove the upper bound using probabilistic construction and a similar argument to (25.8).
(c) Show that for D ≥ 12 , N(Fn2 , dH , Dn) ≤ 2 – cf. Ex. V.15a.
V.27 (Covering ℓp -ball with ℓq -balls)
(a) For 1 ≤ p < q ≤ ∞, prove the bound (27.18) on the metric entropy of the unit ℓp -ball with
respect to the ℓq -norm (Hint: for small ϵ, apply the volume calculation in (27.15)–(27.16)
and the formula in (27.13); for large ϵ, proceed as in the proof of Theorem 27.7 by applying
the quantization argument and the Gilbert-Varshamov bound of Hamming spheres.)
(b) What happens when p > q?
V.28 In this exercise we prove Dudley’s chaining inequality (27.22). In view of Theorem 27.2, it is
equivalent to show the following version with covering numbers:
Z ∞p
w(Θ) ≲ log N(ϵ)dϵ. (V.9)
0

where N(ϵ) ≡ N(Θ, k · k2 , ϵ) is the ℓ2 -covering number of Θ and w(Θ) =


EZ∼N (0,Id ) supθ∈Θ hθ, Zi is its Gaussian width.
a Show that if Θ is not totally bounded, then both sides are infinite.
b Next, assuming rad(Θ) < ∞, let Ti be the optimal ϵi -covering of Θ, where ϵi =
rad(Θ)2−i , i ≥ 0, so T0 = {θ0 }, the Chebyshev center of Θ. Show that
X
sup hZ, θi ≤ hZ, θ0 i + Mi
θ∈Θ i≥1

where Mi ≜ max{hZ, si − si−1 i : si ∈ Ti , si−1 ∈ Ti−1 }. (Hint: For any θ ∈ Θ, let θi denote its
P
nearest neighbor in Ti . Then hZ, θi = hZ, θ0 i + i≥1 hZ, θi − θi−1 i.)
p
c Show that E[Mi ] ≲ ϵi log N(ϵi ) (Hint: kθi − θi−1 k ≤ kθi − θk + kθi−1 − θk ≤ 3ϵi . Then
apply Lemma 27.10 and the bounded convergence theorem.)
d Conclude that
X p
w(Θ) ≲ ϵi log N(ϵi )
i≥0

and compare with the integral version (V.9).


V.29 (Random matrix) Let A be an m × n matrix of iid N (0, 1) entries. Denote its operator norm by
kAkop = maxv∈Sn−1 kAvk, which is also the largest singular value of A.
(a) Show that

kAkop = max hA, uv′ i . (V.10)


u∈Sm−1 ,v∈Sn−1

(b) Let U = {u1 , . . . , uM } and V = {v1 , . . . , vM } be an ϵ-net for the spheres Sm−1 and Sn−1
respectively. Show that
1
kAkop ≤ max hA, uv′ i .
(1 − ϵ)2 u∈U ,v∈V

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-553


i i

Exercises for Part V 553

(c) Apply Corollary 27.4 and Lemma 27.10 to conclude that


√ √
E[kAk] ≲ n + m (V.11)
(d) By choosing u and v in (V.10) smartly, show a matching lower bound and conclude that
√ √
E[kAk]  n + m (V.12)
(e) Use Sudakov minorization (Theorem 27.8) to prove a matching lower bound. (Hint: use
(27.6)).
V.30 (Small-ball probability II.) In this exercise we prove (27.36). Let {Wt : t ≥ 0} be a standard
Brownian motion. Show that for small ϵ,16
" #
1
ϕ(ϵ) = − log P sup |Wt | ≤ ϵ  2
t∈[0,1] ϵ
h i h i
(a) By rescaling space and time, show that P supt∈[0,1] |Wt | ≤ ϵ = P supt∈[0,T] |Wt | ≤ 1 ≜
pT , where T = 1/ϵ2 . To show pT = e−Θ(T) , there is no loss of generality to assume that T is
an integer.
(b) (Upper bound) Using the independent increment property, show that pT+1 ≤ apT , where a =
P [|Z| ≤ 1] with Z ∼ N (0, 1). (Hint: g(z) ≜ P [|Z − z| ≤ 1] for z ∈ [−1, 1] is maximized at
z = 0 and minimized at z = ±1.) h i
(c) (Lower bound) Again by scaling, it is equivalent to show P supt∈[0,T] |Wt | ≤ C ≥ C−T for
h i
some constant C. Let qT ≜ P supt∈[0,T] |Wt | ≤ 2, maxt=1,...,T |Wt | ≤ 1 . Show that qT+1 ≥
bqT , where b = P [|Z − 1| ≤ 1] P[supt∈[0,1] |Bt | ≤ 1], and Bt = Bt − tB1 is a Brownian
bridge. (Hint: {Wt : t ∈ [0, T]}, WT+1 − WT , and {WT+t − (1 − t)WT − tWT+1 : t ∈ [0, 1]}
are mutually independent, with the latter distributed as a Brownian bridge.)

16
Using the large deviations theory developed by Donsker-Varadhan, the sharp constant can be found to be
2
limϵ→0 ϵ2 ϕ(ϵ) = π8 ; see for example [279, Sec. 6.2].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-554


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-555


i i

Part VI

Statistical applications

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-556


i i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-557


i i

557

This part gives an exposition on the application of information-theoretic principles and meth-
ods in mathematical statistics; we do so by discussing a selection of topics. To start, Chapter 28
introduces the basic decision-theoretic framework of statistical estimation and the Bayes risk
and the minimax risk as the fundamental limits. Chapter 29 gives an exposition of the classi-
cal large-sample asymptotics for smooth parametric models in fixed dimensions, highlighting the
role of Fisher information introduced in Chapter 2. Notably, we discuss how to deduce classi-
cal lower bounds (Hammersley-Chapman-Robbins, Cramér-Rao, van Trees) from the variational
characterization and the data processing inequality (DPI) of χ2 -divergence in Chapter 7.
Moving into high dimensions, Chapter 30 introduces the mutual information method for sta-
tistical lower bound, based on the DPI for mutual information as well as the theory of capacity
and rate-distortion function from Parts IV and V. This principled approach includes three popular
methods for proving minimax lower bounds (Le Cam, Assouad, and Fano) as special cases, which
are discussed at length in Chapter 31 drawing results from metric entropy in Chapter 27 also.
Complementing the exposition on lower bounds in Chapters 30 and 31, in Chapter 32 we present
three upper bounds on statistical estimation based on metric entropy. These bounds appear strik-
ingly similar but follow from completely different methodologies. Application to nonparametric
density estimation is used as a primary example.
Chapter 33 introduces strong data processing inequalities (SDPI), which are quantitative
strengthening of DPIs in Part I. As applications we show how to apply SDPI to deduce lower
bounds for various estimation problems on graphs or in distributed settings.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-558


i i

28 Basics of statistical decision theory

In this chapter, we discuss the decision-theoretic framework of statistical estimation and introduce
several important examples. Section 28.1 presents the basic elements of statistical experiment and
statistical estimation. Section 28.3 introduces the Bayes risk (average-case) and the minimax risk
(worst-case) as the respective fundamental limit of statistical estimation in Bayesian and frequen-
tist setting, with the latter being our primary focus in this part. We discuss several version of the
minimax theorem (and prove a simple one) that equates the minimax risk with the worst-case
Bayes risk. Two variants are introduced next that extend a basic statistical experiment to either
large sample size or large dimension: Section 28.4 on independent observations and Section 28.5
on tensorization of experiments. Throughout this part the Gaussian location model (GLM), intro-
duced in Section 28.2, serves as a running example, with different focus at different places (such
as the role of loss functions, parameter spaces, low versus high dimensions, etc). In Section 28.6,
we discuss a key result known as the Anderson’s lemma for determining the exact minimax risk
of (unconstrained) GLM in any dimension for a broad class of loss functions, which provides a
benchmark for various more general techniques introduced in later chapters.

28.1 Basic setting


We start by presenting the basic elements of statistical decision theory. We refer to the classics
[166, 273, 404] for a systematic treatment.
A statistical experiment or statistical model refers to a collection P of probability distributions
(over a common measurable space (X , F)). Specifically, let us consider
P = {Pθ : θ ∈ Θ}, (28.1)
where each distribution is indexed by a parameter θ taking values in the parameter space Θ.
In the decision-theoretic framework, we play the following game: Nature picks some parameter
θ ∈ Θ and generates a random variable X ∼ Pθ . A statistician observes the data X and wants to
infer the parameter θ or its certain attributes. Specifically, consider some functional T : Θ → Y
and the goal is to estimate T(θ) on the basis of the observation X. Here the estimand T(θ) may be
the parameter θ itself, or some function thereof (e.g. T(θ) = 1 {θ > 0} or kθk).
An estimator (decision rule) is a function T̂ : X → Ŷ . Note that, similar to the rate-distortion
theory in Part V, the action space Ŷ need not be the same as Y (e.g. T̂ may be a confidence
interval that aims to contain the scalar T). Here T̂ can be either deterministic, i.e. T̂ = T̂(X), or
randomized, i.e., T̂ obtained by passing X through a conditional probability distribution (Markov

558

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-559


i i

28.1 Basic setting 559

transition kernel) PT̂|X , or a channel in the language of Part I. For all practical purposes, we can
write T̂ = T̂(X, U), where U denotes external randomness uniform on [0, 1] and independent of X.
To measure the quality of an estimator T̂, we introduce a loss function ℓ : Y × Ŷ → R such
that ℓ(T, T̂) is the risk of T̂ for estimating T. Since we are dealing with loss (as opposed to reward),
all the negative (converse) results are lower bounds and all the positive (achievable) results are
upper bounds. Note that X is a random variable, so are T̂ and ℓ(T, T̂). Therefore, to make sense of
“minimizing the loss”, we consider the expected risk:
Z
Rθ (T̂) = Eθ [ℓ(T, T̂)] = Pθ (dx)PT̂|X (dt̂|x)ℓ(T(θ), t̂), (28.2)

which we refer to as the risk of T̂ at θ. The subscript in Eθ indicates the distribution with respect
to which the expectation is taken. Note that the expected risk depends on the estimator as well as
the ground truth.
Remark 28.1 We note that the problem of hypothesis testing and inference can be encom-
passed as special cases of the estimation paradigm. As previously discussed in Section 16.4, there
are three formulations for testing:

• Simple vs simple hypotheses

H0 : θ = θ 0 vs H1 : θ = θ 1 , θ0 6= θ1

• Simple vs composite hypotheses

H0 : θ = θ 0 vs H1 : θ ∈ Θ 1 , θ0 ∈
/ Θ1

• Composite vs composite hypotheses

H0 : θ ∈ Θ 0 vs H1 : θ ∈ Θ 1 , Θ0 ∩ Θ1 = ∅.

For each case one can introduce the appropriate parameter space and loss function. For example,
in the last (most general) case, we may take
(
0 θ ∈ Θ0
Θ = Θ0 ∪ Θ1 , T(θ) = , T̂ ∈ {0, 1}
1 θ ∈ Θ1
n o
and use the zero-one loss ℓ(T, T̂) = 1 T 6= T̂ so that the expected risk Rθ (T̂) = Pθ {θ ∈ / ΘT̂ } is
the probability of error.
For the problem of inference, the goal is to output a confidence interval (or region) which covers
the true parameter with high
n probability.
o In this case T̂ is a subset of Θ and we may choose the
loss function ℓ(θ, T̂) = 1 θ ∈/ T̂ + λ · length(T̂) for some λ > 0, in order to balance the coverage
and the size of the confidence interval.
Remark 28.2 (Randomized versus deterministic estimators) Although most of the
estimators used in practice are deterministic, there are a number of reasons to consider randomized
estimators:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-560


i i

560

• For certain formulations, such as the minimizing worst-case risk (minimax approach), deter-
ministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the
objective is to minimize the average risk (Bayes approach), then it does not lose generality to
restrict to deterministic estimators.
• The space of randomized estimators (viewed as Markov kernels) is convex which is the convex
hull of deterministic estimators. This convexification is needed for example for the treatment
of minimax theorems.

See Section 28.3 for a detailed discussion and examples.


A well-known fact is that for convex loss function (i.e., T̂ 7→ ℓ(T, T̂) is convex), randomization
does not help. Indeed, for any randomized estimator T̂, we can derandomize it by considering its
conditional expectation E[T̂|X], which is a deterministic estimator and whose risk dominates that
of the original T̂ at every θ, namely, Rθ (T̂) = Eθ ℓ(T, T̂) ≥ Eθ ℓ(T, E[T̂|X]), by Jensen’s inequality.

28.2 Gaussian location model (GLM)


Note that, without loss of generality, all statistical models can be expressed in the parametric form
of (28.1) (since we can take θ to be the distribution itself). In the statistics literature, it is customary
to refer to a model as parametric if θ takes values in a finite-dimensional Euclidean space (so that
each distribution is specified by finitely many parameters), and nonparametric if θ takes values in
some infinite-dimensional space (e.g. density estimation or sequence model).
Perhaps the most basic parametric model is the Gaussian Location Model (GLM), also known
as the Normal Mean Model, which corresponds to our familiar Gaussian channel in Example 3.3.
This will be our running example in this part of the book. In this model, we have

P = {N (θ, σ 2 Id ) : θ ∈ Θ}

where Id is the d-dimensional identity matrix and the parameter space Θ ⊂ Rd . Equivalently, we
can express the data as a noisy observation of the unknown vector θ as:

X = θ + Z, Z ∼ N (0, σ 2 Id ).

The case of d = 1 and d > 1 refers to the univariate (scalar) and multivariate (vector) case,
respectively. (Also of interest is the case where θ is a d1 × d2 matrix, which can be vectorized into
a d = d1 d2 -dimensional vector.)
The choice of the parameter space Θ represents our prior knowledges of the unknown parameter
θ, for example,

• Θ = Rd , in which case there is no assumption on θ.


• Θ = ℓp -norm balls.
• Θ = {all k-sparse vectors} = {θ ∈ Rd : kθk0 ≤ k}, where kθk0 ≜ |{i : θi 6= 0}| denotes the
size of the support, informally referred to as the ℓ0 -“norm”.
• Θ = {θ ∈ Rd1 ×d2 : rank(θ) ≤ r}, the set of low-rank matrices.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-561


i i

28.3 Bayes risk, minimax risk, and the minimax theorem 561

By definition, more structure (smaller parameter space) always makes the estimation task easier
(smaller worst-case risk), but not necessarily so in terms of computation.
For estimating θ itself (denoising), it is customary to use a loss function defined by certain
P p 1
p for some 1 ≤ p ≤ ∞ and α > 0, where kθkp ≜ (
norms, e.g., ℓ(θ, θ̂) = kθ − θ̂kα |θi | ) p , with
p = α = 2 corresponding to the commonly used quadratic loss (squared error). Some well-known
estimators include the Maximum Likelihood Estimator (MLE)
θ̂ML = X (28.3)
and the James-Stein estimator based on shrinkage
 
(d − 2)σ 2
θ̂JS = 1 − X. (28.4)
kXk22
The choice of the estimator depends on both the objective and the parameter space. For instance,
if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero
(thresholding) in order to better denoise θ (cf. Section 30.2).
In addition to estimating the vector θ itself, it is also of interest to estimate certain functionals
T(θ) thereof, e.g., T(θ) = kθkp , max{θ1 , . . . , θd }, or eigenvalues in the matrix case. In addition,
the hypothesis testing problem in the GLM has been well-studied. For example, one can consider
detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ ϵ, or testing weak signal
H0 : kθk ≤ ϵ0 versus strong signal H1 : kθk ≥ ϵ1 , with or without further structural assumptions
on θ. We refer the reader to the monograph [225] devoted to these problems.

28.3 Bayes risk, minimax risk, and the minimax theorem


One of our main objectives in this part of the book is to understand the fundamental limit of
statistical estimation, that is, to determine the performance of the best estimator. As in (28.2), the
risk Rθ (T̂) of an estimator T̂ for T(θ) depends on the ground truth θ. To compare the risk profiles of
different estimators meaningfully requires some thought. As a toy example, Figure 28.1 depicts the
risk functions of three estimators. It is clear that θ̂1 is superior to θ̂2 in the sense that the risk of the
former is pointwise lower than that of the latter. (In statistical literature we say θ̂2 is inadmissible.)
However, the comparison of θ̂1 and θ̂3 is less clear. Although the peak risk value of θ̂3 is bigger
than that of θ̂1 , on average its risk (area under the curve) is smaller. In fact, both views are valid
and meaningful, and they correspond to the worst-case (minimax) and average-case (Bayesian)
approach, respectively. In the minimax formulation, we summarize the risk function into a scalar
quantity, namely, the worst-case risk, and seek the estimator that minimize this objective. In the
Bayesian formulation, the objective is the average risk. Below we discuss these two approaches
and their connections. For notational simplicity, we consider the task of estimating T(θ) = θ.

28.3.1 Bayes risk


The Bayesian approach is an average-case formulation in which the statistician acts as if the param-
eter θ is random with a known distribution. Concretely, let π be a probability distribution (prior)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-562


i i

562

Figure 28.1 Risk profiles of three estimators.

on Θ. Then the average risk (w.r.t. π) of an estimator θ̂ is defined as

Rπ (θ̂) = Eθ∼π [Rθ (θ̂)] = Eθ,X [ℓ(θ, θ̂)]. (28.5)

Given a prior π, its Bayes risk is the minimal average risk, namely

R∗π = inf Rπ (θ̂).


θ̂

An estimator θ̂∗ is called a Bayes estimator if it attains the Bayes risk, namely, R∗π = Eθ∼π [Rθ (θ̂∗ )].
Remark 28.3 Bayes estimator is always deterministic, a fact that holds for any loss function.
To see this, note that for any randomized estimator, say θ̂ = θ̂(X, U), where U is some external
randomness independent of X and θ, its risk is lower bounded by

Rπ (θ̂) = Eθ,X,U [ℓ(θ, θ̂(X, U))] = EU [Rπ (θ̂(·, U))] ≥ inf Rπ (θ̂(·, u)).
u

Note that for any u, θ̂(·, u) is a deterministic estimator. This shows that we can find a deterministic
estimator whose average risk is no worse than that of the randomized estimator.
An alternative explanation of this fact is the following: Note that the average risk Rπ (θ̂) defined
in (28.5) is an affine function of the randomized estimator (understood as a Markov kernel Pθ̂|X )
is affine, whose minimum is achieved at the extremal points. In this case the extremal points of
Markov kernels are simply delta measures, which corresponds to deterministic estimators.
In certain settings the Bayes estimator can be found explicitly. Consider the problem of estimat-
ing θ ∈ Rd drawn from a prior π. Under the quadratic loss ℓ(θ, θ̂) = kθ̂ − θk22 , the Bayes estimator
is the conditional mean θ̂(X) = E[θ|X] and the Bayes risk is the minimum mean-square error
(MMSE), which we previously encountered in Section 3.7* in the context of I-MMSE relationship:

R∗π = E[kθ − E[θ|X]k22 ] = E[Tr(Cov(θ|X))],

where Cov(θ|X = x) is the conditional covariance matrix of θ given X = x.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-563


i i

28.3 Bayes risk, minimax risk, and the minimax theorem 563

As a concrete example, let us consider the Gaussian Location Model in Section 28.2 with a
Gaussian prior.
Example 28.1 (Bayes risk in GLM) Consider the scalar case, where X = θ + Z and Z ∼
N (0, σ 2 ) is independent of θ. Consider a Gaussian prior θ ∼ π = N (0, s). One can verify that the
sσ 2
posterior distribution Pθ|X=x is N ( s+σ 2 x, s+σ 2 ). As such, the Bayes estimator is E[θ|X] = s+σ 2 X
s s

and the Bayes risk is


sσ 2
R∗π = . (28.6)
s + σ2
Similarly, for multivariate GLM: X = θ + Z, Z ∼ N (0, Id ), if θ ∼ π = N (0, sId ), then we have

sσ 2
R∗π = d. (28.7)
s + σ2

28.3.2 Minimax risk


A common criticism of the Bayesian approach is the arbitrariness of the selected prior. A frame-
work related to this but not discussed in this case is the empirical Bayes approach [363, 470],
where one “estimates” the prior from the data instead of choosing a prior a priori. Instead, we take
a frequentist viewpoint by considering the worst-case situation. The minimax risk is defined as

R∗ = inf sup Rθ (θ̂). (28.8)


θ̂ θ∈Θ

If there exists θ̂ s.t. supθ∈Θ Rθ (θ̂) = R∗ , then the estimator θ̂ is minimax (minimax optimal).
Finding the value of the minimax risk R∗ entails proving two things, namely,

• a minimax upper bound, by exhibiting an estimator θ̂∗ such that Rθ (θ̂∗ ) ≤ R∗ + ϵ for all θ ∈ Θ;
• a minimax lower bound, by proving that for any estimator θ̂, there exists some θ ∈ Θ, such that
Rθ (θ̂) ≥ R∗ − ϵ,

where ϵ > 0 is arbitrary. This task is frequently difficult especially in high dimensions. Instead of
the exact minimax risk, it is often useful to find a constant-factor approximation Ψ, which we call
minimax rate, such that

R∗  Ψ, (28.9)

that is, cΨ ≤ R∗ ≤ CΨ for some universal constants c, C ≥ 0. Establishing Ψ is the minimax rate
still entails proving the minimax upper and lower bounds, albeit within multiplicative constant
factors.
In practice, minimax lower bounds are rarely established according to the original definition.
The next result shows that the Bayes risk is always lower than the minimax risk. Throughout
this book, all lower bound techniques essentially boil down to evaluating the Bayes risk with a
sagaciously chosen prior.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-564


i i

564

Theorem 28.1 Let P(Θ) denote the collection of probability distributions on Θ. Then

R∗ ≥ R∗Bayes ≜ sup R∗π . (28.10)


π ∈P(Θ)

(If the supremum is attained for some prior, we say it is least favorable.)

Proof. Two (equivalent) ways to prove this fact:

1 “max ≥ mean”: For any θ̂, Rπ (θ̂) = Eθ∼π Rθ (θ̂) ≤ supθ∈Θ Rθ (θ̂). Taking the infimum over θ̂
completes the proof;
2 “min max ≥ max min”:

R∗ = inf sup Rθ (θ̂) = inf sup Rπ (θ̂) ≥ sup inf Rπ (θ̂) = sup R∗π ,
θ̂ θ∈Θ θ̂ π ∈P(Θ) π ∈P(Θ) θ̂ π

where the inequality follows from the generic fact that minx maxy f(x, y) ≥ maxy minx f(x, y).

Remark 28.4 Unlike Bayes estimators which, as shown in Remark 28.3, are always deter-
ministic, to minimize the worst-case risk it is sometimes necessary to randomize for example in
the context of hypotheses testing (Chapter 14). Specifically, consider a trivial experiment where
θ ∈ {0, 1} and nX is absent,
o so that we are forced to guess the value of θ under the zero-one
loss ℓ(θ, θ̂) = 1 θ 6= θ̂ . It is clear that in this case the minimax risk is 12 , achieved by random
guessing θ̂ ∼ Ber( 21 ) but not by any deterministic θ̂.

As an application of Theorem 28.1, let us determine the minimax risk of the Gaussian location
model under the quadratic loss function.

Example 28.2 (Minimax quadratic risk of GLM) Consider the Gaussian location model
without structural assumptions, where X ∼ N (θ, σ 2 Id ) with θ ∈ Rd . We show that

R∗ ≡ inf sup Eθ [kθ̂(X) − θk22 ] = dσ 2 . (28.11)


θ∈Rd θ∈Rd

By scaling, it suffices to consider σ = 1. For the upper bound, we consider θ̂ML = X which
achieves Rθ (θ̂ML ) = d for all θ. To get a matching minimax lower bound, we consider the prior
θ ∼ N (0, s). Using the Bayes risk previously computed in (28.6), we have R∗ ≥ R∗π = s+ sd
1.

Sending s → ∞ yields R ≥ d.

Remark 28.5 (Non-uniqueness of minimax estimators) In general, estimators that


achieve the minimax risk need not be unique. For instance, as shown in Example 28.2, the MLE
θ̂ML = X is minimax for the unconstrained GLM in any dimension. On the other hand, it is known
that whenever d ≥ 3, the risk of the James-Stein estimator (28.4) is smaller that of the MLE every-
where (see Figure 28.2) and thus is also minimax. In fact, there exist a continuum of estimators
that are minimax for (28.11) [276, Theorem 5.5].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-565


i i

28.3 Bayes risk, minimax risk, and the minimax theorem 565

3.0

2.8

2.6

2.4

2.2

2 4 6 8

Figure 28.2 Risk of the James-Stein estimator (28.4) in dimension d = 3 and σ = 1 as a function of kθk.

For most of the statistical models, Theorem 28.1 in fact holds with equality; such a result is
known as a minimax theorem. Before discussing this important topic, here is an example where
minimax risk is strictly bigger than the worst-case Bayes risk.
n o
Example 28.3 Let θ, θ̂ ∈ N ≜ {1, 2, ...} and ℓ(θ, θ̂) = 1 θ̂ < θ , i.e., the statistician loses
one dollar if the Nature’s choice exceeds the statistician’s guess and loses nothing if otherwise.
Consider the extreme case of blind guessing (i.e., no data is available, say, X = 0). Then for any
θ̂ possibly randomized, we have Rθ (θ̂) = P(θ̂ < θ). Thus R∗ ≥ limθ→∞ P(θ̂ < θ) = 1, which is
clearly achievable. On the other hand, for any prior π on N, Rπ (θ̂) = P(θ̂ < θ), which vanishes as
θ̂ → ∞. Therefore, we have R∗π = 0. Therefore in this case R∗ = 1 > R∗Bayes = 0.
As an exercise, one can show that the minimax quadratic risk of the GLM X ∼ N (θ, 1) with
parameter space θ ≥ 0 is the same as the unconstrained case. (This might be a bit surprising
because the thresholded estimator X+ = max(X, 0) achieves a better risk pointwise at every θ ≥ 0;
nevertheless, just like the James-Stein estimator (cf. Figure 28.2), in the worst case the gain is
asymptotically diminishing.)

28.3.3 Minimax and Bayes risk: a duality perspective


Recall from Theorem 28.1 the inequality

R∗ ≥ R∗Bayes .

This result can be interpreted from an optimization perspective. More precisely, R∗ is the value
of a convex optimization problem (primal) and R∗Bayes is precisely the value of its dual program.
Thus the inequality (28.10) is simply weak duality. If strong duality holds, then (28.10) is in fact
an equality, in which case the minimax theorem holds.
For simplicity, we consider the case where Θ is a finite set. Then

R∗ = min max Eθ [ℓ(θ, θ̂)]. (28.12)


Pθ̂|X θ∈Θ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-566


i i

566

This is a convex optimization problem. Indeed, Pθ̂|X 7→ Eθ [ℓ(θ, θ̂)] is affine and the pointwise
supremum of affine functions is convex. To write down its dual problem, first let us rewrite (28.12)
in an augmented form

R∗ = min t (28.13)
Pθ̂|X ,t

s.t Eθ [ℓ(θ, θ̂)] ≤ t, ∀θ ∈ Θ.

Let π θ ≥ 0 denote the Lagrange multiplier (dual variable) for each inequality constraint. The
Lagrangian of (28.13) is
!
X   X X
L(Pθ̂|X , t, π ) = t + π θ Eθ [ℓ(θ, θ̂)] − t = 1 − πθ t + π θ Eθ [ℓ(θ, θ̂)].
θ∈Θ θ∈Θ θ∈Θ
P
By definition, we have R∗ ≥ mint,Pθ̂|X L(θ̂, t, π ). Note that unless θ∈Θ π θ = 1, mint∈R L(θ̂, t, π )
is −∞. Thus π = (π θ : θ ∈ Θ) must be a probability measure and the dual problem is

max min L(Pθ̂|X , t, π ) = max min Rπ (θ̂) = max R∗π .


π Pθ̂|X ,t π ∈P(Θ) Pθ̂|X π ∈P(Θ)

Hence, R∗ ≥ R∗Bayes .
In summary, the minimax risk and the worst-case Bayes risk are related by convex duality,
where the primal variables are (randomized) estimators and the dual variables are priors. This
view can in fact be operationalized. For example, [238, 346] showed that for certain problems
dualizing Le Cam’s two-point lower bound (Theorem 31.1) leads to optimal minimax upper bound;
see Exercise VI.17.

28.3.4 Minimax theorem


Much earlier in Chapter 5 we have already seen example of the strong minimax duality. That is, we
found that capacity satisfies C = minPX maxQY D(PY|X kQY |PX ) = maxQY minPX D(PY|X kQY |PX ),
and the optimal pair (P∗X , Q∗Y ) forms a saddle point. Here we show example of a similar minimax
theorem but for the statistical risk. Namely, we want to specify conditions that ensure (28.10) holds
with equality. For simplicity, let us consider the case of estimating θ itself where the estimator θ̂
takes values in the action space Θ̂ with a loss function ℓ : Θ × Θ̂ → R. A very general result
(cf. [404, Theorem 46.6]) asserts that R∗ = R∗Bayes , provided that the following condition hold:

• The experiment is dominated, i.e., Pθ  ν holds for all θ ∈ Θ for some ν on X .


• The action space Θ̂ is a locally compact topological space with a countable base (e.g. the
Euclidean space).
• The loss function is level-compact (i.e., for each θ ∈ Θ, ℓ(θ, ·) is bounded from below and the
sublevel set {θ̂ : ℓ(θ, θ̂) ≤ a} is compact for each a).

This result shows that for virtually all problems encountered in practice, the minimax risk coin-
cides with the least favorable Bayes risk. At the heart of any minimax theorem, there is an

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-567


i i

28.4 Multiple observations and sample complexity 567

application of the separating hyperplane theorem. Below we give a proof of a special case
illustrating this type of argument.

Theorem 28.2 (Minimax theorem)


R∗ = R∗Bayes
in either of the following cases:

• Θ is a finite set and the data X takes values in a finite set X .


• Θ is a finite set and the loss function ℓ is bounded from below, i.e., infθ,θ̂ ℓ(θ, θ̂) > −∞.

Proof. The first case directly follows from the duality interpretation in Section 28.3.3 and the
fact that strong duality holds for finite-dimensional linear programming (see for example [376,
Sec. 7.4].
For the second case, we start by showing that if R∗ = ∞, then R∗Bayes = ∞. To see this, consider
the uniform prior π on Θ. Then for any estimator θ̂, there exists θ ∈ Θ such that R(θ, θ̂) = ∞.
Then Rπ (θ̂) ≥ |Θ|
1
R(θ, θ̂) = ∞.
Next we assume that R∗ < ∞. Then R∗ ∈ R since ℓ is bounded from below (say, by a) by
assumption. Given an estimator θ̂, denote its risk vector R(θ̂) = (Rθ (θ̂))θ∈Θ . Then its average risk
P
with respect to a prior π is given by the inner product hR(θ̂), π i = θ∈Θ π θ Rθ (θ̂). Define
S = {R(θ̂) ∈ RΘ : θ̂ is a randomized estimator} = set of all possible risk vectors,
T = {t ∈ RΘ : tθ < R∗ , θ ∈ Θ}.
Note that both S and T are convex (why?) subsets of Euclidean space RΘ and S∩T = ∅ by definition
of R∗ . By the separation hyperplane theorem, there exists a non-zero π ∈ RΘ and c ∈ R, such
that infs∈S hπ , si ≥ c ≥ supt∈T hπ , ti. Obviously, π must be componentwise positive, for otherwise
supt∈T hπ , ti = ∞. Therefore by normalization we may assume that π is a probability vector, i.e.,
a prior on Θ. Then R∗Bayes ≥ R∗π = infs∈S hπ , si ≥ supt∈T hπ , ti ≥ R∗ , completing the proof.

28.4 Multiple observations and sample complexity


Given a experiment {Pθ : θ ∈ Θ}, consider the experiment
Pn = {P⊗
θ : θ ∈ Θ},
n
n ≥ 1. (28.14)
We refer to this as the independent sampling model, in which we observe a sample X =
(X1 , . . . , Xn ) consisting of independent observations drawn from Pθ for some θ ∈ Θ ⊂ Rd . Given
a loss function ℓ : Rd × Rd → R+ , the minimax risk is denoted by
R∗n (Θ) = inf sup Eθ [ℓ(θ, θ̂)]. (28.15)
θ̂ θ∈Θ

Clearly, n 7→ R∗n (Θ) is non-increasing since we can always discard the extra observations.
Typically, when Θ is a fixed subset of Rd , R∗n (Θ) vanishes as n → ∞. Thus a natural question is

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-568


i i

568

at what rate R∗n converges to zero. Equivalently, one can consider the sample complexity, namely,
the minimum sample size to attain a prescribed error ϵ even in the worst case:

n∗ (ϵ) ≜ min {n ∈ N : R∗n (Θ) ≤ ϵ} . (28.16)

In the classical large-sample asymptotics (Chapter 29), the rate of convergence for the quadratic
risk is usually Θ( 1n ), which is commonly referred to as the “parametric rate“. In comparison, in this
book we focus on understanding the dependency on the dimension and other structural parameters
nonasymptotically.
As a concrete example, let us revisit the GLM in Section 28.2 with sample size n, in which case
i.i.d.
we observe X = (X1 , . . . , Xn ) ∼ N (0, σ 2 Id ), θ ∈ Rd . In this case, the minimax quadratic risk is1
dσ 2
R∗n = . (28.17)
n
To see this, note that in this case X̄ = n1 (X1 + . . . + Xn ) is a sufficient statistic (cf. Section 3.5) of X
2
for θ. Therefore the model reduces to X̄ ∼ N (θ, σn Id ) and (28.17) follows from the minimax risk
(28.11) for a single observation.
2
From (28.17), we conclude that the sample complexity is n∗ (ϵ) = d dσϵ e, which grows linearly
with the dimension d. This is the common wisdom that “sample complexity scales proportionally
to the number of parameters”, also known as “counting the degrees of freedom”. Indeed in high
dimensions we typically expect the sample complexity to grow with the ambient dimension; how-
ever, the exact dependency need not be linear as it depends on the loss function and the objective
of estimation. For example, consider the matrix case θ ∈ Rd×d with n independent observations
in Gaussian noise. Let ϵ be a small constant. Then we have
2
• For quadratic loss, namely, kθ − θ̂k2F , we have R∗n = dn and hence n∗ (ϵ) = Θ(d2 );
• If the loss function is kθ − θ̂k2op , then R∗n  dn and hence n∗ (ϵ) = Θ(d) (Example 28.4);
• As opposed to θ itself, suppose we are content with p estimating only the scalar functional θmax =

max{θ1 , . . . , θd } up to accuracy ϵ, then n (ϵ) = Θ( log d) (Exercise VI.14).

In the last two examples, the sample complexity scales sublinearly with the dimension.

28.5 Tensor product of experiments


Tensor product is a way to define a high-dimensional model from low-dimensional models. Given
statistical experiments Pi = {Pθi : θi ∈ Θi } and the corresponding loss function ℓi , for i ∈ [d],
their tensor product refers to the following statistical experiment:
( )
Yd Y
d
P = Pθ = Pθi : θ = (θ1 , . . . , θd ) ∈ Θ ≜ Θi ,
i=1 i=1

1
See Exercise VI.11 for an extension of this result to nonparametric location models.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-569


i i

28.5 Tensor product of experiments 569

X
d
ℓ(θ, θ̂) ≜ ℓi (θi , θ̂i ), ∀θ, θ̂ ∈ Θ.
i=1

In this model, the observation X = (X1 , . . . , Xd ) consists of independent (not identically dis-
ind
tributed) Xi ∼ Pθi and the loss function takes a separable form, which is reminiscent of separable
distortion function in (24.8). This should be contrasted with the multiple-observation model in
(28.14), in which n iid observations drawn from the same distribution are given.
The minimax risk of the tensorized experiment is related to the minimax risk R∗ (Pi ) and worst-
case Bayes risks R∗Bayes (Pi ) ≜ supπ i ∈P(Θi ) Rπ i (Pi ) of each individual experiment as follows:

Theorem 28.3 (Minimax risk of tensor product)


X
d X
d
R∗Bayes (Pi ) ≤ R∗ (P) ≤ R∗ (Pi ). (28.18)
i=1 i=1

Consequently, if minimax theorem holds for each experiment, i.e., R∗ (Pi ) = R∗Bayes (Pi ), then it
also holds for the product experiment and, in particular,
X
d

R (P) = R∗ (Pi ). (28.19)
i=1

Proof. The right inequality of (28.18) simply follows by separately estimating θi on the basis
of Xi , namely, θ̂ = (θ̂1 , . . . , θ̂d ), where θ̂i depends only on Xi . For the left inequality, consider
Qd
a product prior π = i=1 π i , under which θi ’s are independent and so are Xi ’s. Consider any
randomized estimator θ̂i = θ̂i (X, Ui ) of θi based on X, where Ui is some auxiliary randomness
independent of X. We can rewrite it as θ̂i = θ̂i (Xi , Ũi ), where Ũi = (X\i , Ui ) ⊥ ⊥ Xi . Thus θ̂i can
be viewed as it a randomized estimator based on Xi alone and its the average risk must satisfy
Rπ i (θ̂i ) = E[ℓ(θi , θ̂i )] ≥ R∗π i . Summing over i and taking the suprema over priors π i ’s yields the
left inequality of (28.18).
As an example, we note that the unstructured d-dimensional GLM {N (θ, σ 2 Id ) : θ ∈ Rd } with
quadratic loss is simply the d-fold tensor product of the one-dimensional GLM. Since minimax
theorem holds for the GLM (cf. Section 28.3.4), Theorem 28.3 shows the minimax risks sum up to
σ 2 d, which agrees with Example 28.2. In general, however, it is possible that the minimax risk of
the tensorized experiment is less than the sum of individual minimax risks and the right inequality
of (28.19) can be strict. This might appear surprising since Xi only carries information about θi
and it makes sense intuitively to estimate θi based solely on Xi . Nevertheless, the following is a
counterexample:
Remark 28.6 Consider X = θZ, where θ n∈ N, Zo∼ Ber( 21 ). The estimator θ̂ takes values in
N as well and the loss function is ℓ(θ, θ̂) = 1 θ̂ < θ , i.e., whoever guesses the greater number
wins. The minimax risk for this experiment is equal to P [Z = 0] = 21 . To see this, note that if
Z = 0, then all information about θ is erased. Therefore for any (randomized) estimator Pθ̂|X , the
risk is lower bounded by Rθ (θ̂) = P[θ̂ < θ] ≥ P[θ̂ < θ, Z = 0] = 21 P[θ̂ < θ|X = 0]. Therefore

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-570


i i

570

sending θ → ∞ yields supθ Rθ (θ̂) ≥ 12 . This is achievable by θ̂ = X. Clearly, this is a case where
minimax theorem does not hold, which is very similar to the previous Example 28.3.
nNext consider
o the tensoroproduct of two copies of this experiment with loss function ℓ(θ, θ̂) =
n
1 θ̂1 < θ1 + 1 θ̂2 < θ2 . We show that the minimax risk is strictly less than one. For i = 1, 2,
i.i.d.
let Xi = θi Zi , where Z1 , Z2 ∼ Ber( 21 ). Consider the following estimator
(
X1 ∨ X2 X1 > 0 or X2 > 0
θ̂1 = θ̂2 =
1 otherwise.

Then for any θ1 , θ2 ∈ N, averaging over Z1 , Z2 , we get


1 3
E[ℓ(θ, θ̂)] ≤ (1 {θ1 < θ2 } + 1 {θ2 < θ1 } + 1) ≤ .
4 4
We end this section by consider the minimax risk of GLM with non-quadratic loss. The
following result extends Example 28.2:

Theorem 28.4 Consider the Gaussian location model X1 , . . . , Xn i.i.d.


∼ N (θ, Id ). Then for 1 ≤
q < ∞,
E[kZkqq ]
inf sup Eθ [kθ − θ̂kqq ] = , Z ∼ N (0, Id ).
θ̂ θ∈Rd nq/2

Proof. Note that N (θ, Id ) is a product distribution and the loss function is separable: kθ − θ̂kqq =
Pd
i=1 |θi − θ̂i | . Thus the experiment is a d-fold tensor product of the one-dimensional version.
q

By Theorem 28.3, it suffices to consider d = 1. The upper bound is achieved by the sample mean
Pn
X = 1n i=1 Xi ∼ N (θ, 1n ), which is a sufficient statistic.
For the lower bound, following Example 28.2, consider a Gaussian prior θ ∼ π = N (0, s).
Then the posterior distribution is also Gaussian: Pθ|X = N (E[θ|X], 1+ssn ). The following lemma
shows that the Bayes estimator is simply the conditional mean:

Lemma 28.5 Let Z ∼ N (0, 1). Then miny∈R E[|y + Z|q ] = E[|Z|q ].
Thus the Bayes risk is
 s q/2
R∗π = E[|θ − E[θ|X]|q ] = E | Z| q .
1 + sn
Sending s → ∞ proves the matching lower bound.

Proof of Lemma 28.5. Write


Z ∞ Z ∞
E | y + Z| q = P [|y + Z|q > c] dc ≥ P [|Z|q > c] dc = E|Z|q ,
0 0

where the inequality follows from the simple observation that for any a > 0, P [|y + Z| ≤ a] ≤
P [|Z| ≤ a], due to the symmetry and unimodality of the normal density.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-571


i i

28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 571

28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM


As mentioned in Section 28.3.2, computing the exact minimax risk is frequently difficult especially
in high dimensions. Nevertheless, for the special case of (unconstrained) GLM, the minimax risk is
known exactly in arbitrary dimensions for a large collection of loss functions.2 We have previously
seen in Theorem 28.4 that this is possible for loss functions of the form ℓ(θ, θ̂) = kθ − θ̂kqq .
Examining the proof of this result, we note that the major limitation is that it only applies to
separable loss functions, so that tensorization allows us to reduce the problem to one dimension.
This does not apply to (and actually fails) for non-separable loss, since Theorem 28.3, if applicable,
dictates the risk to grow linearly with the dimension, which is not always the case. We next discuss
a more general result that goes beyond separable losses.

Definition 28.6 A function ρ : Rd → R+ is called bowl-shaped if its sublevel set Kc ≜ {x :


ρ(x) ≤ c} is convex and symmetric (i.e. Kc = −Kc ) for all c ∈ R.

Theorem 28.7 Consider the d-dimensional GLM where X1 , . . . , Xn ∼ N (0, Id ) are observed.
Let the loss function be ℓ(θ, θ̂) = ρ(θ − θ̂), where ρ : Rd → R+ is bowl-shaped and lower-
semicontinuous. Then the minimax risk is given by
 
Z
R∗ ≜ inf sup Eθ [ρ(θ − θ̂)] = Eρ √ , Z ∼ N (0, Id ).
θ̂ θ∈Rd n
Pn
Furthermore, the upper bound is attained by X̄ = 1n i=1 Xi .

The following corollary extends Theorem 28.4 to arbitrary norms.

Corollary 28.8 Let ρ(·) = k · kq for some q > 0, where k · k is an arbitrary norm on Rd . Then
EkZkq
R∗ = . (28.20)
nq/2

Example 28.4 Some applications of Corollary 28.8:

• For ρ = k.k22 , R∗ = 1n EkZk2 = dn , which has been shown in (28.17).


p q
• For ρ = k.k∞ , EkZk∞  log d (Lemma 27.10) and R∗  logn d .
• For a matrix θ ∈ Rd×d , let ρ(θ) = kθkop denote the operator norm (maximum singular value).
√ q
It has been shown in Exercise V.29 that E kZkop  d and so R∗  n ; for ρ(·) = k · kF ,
d

R∗  √d .
n

We can also phrase the result of Corollary 28.8 in terms of the sample complexity n∗ (ϵ) as
 
defined in (28.16). For example, for q = 2 we have n∗ (ϵ) = E[kZk2 ]/ϵ . The above examples

2
Another example is the multinomial model with squared error; cf. Exercises VI.7 and VI.9.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-572


i i

572

show that the scaling of n∗ (ϵ) with dimension depends on the loss function and the “rule of thumb”
that the sampling complexity is proportional to the number of parameters need not always hold.
Finally, for the sake of high-probability (as opposed to average) risk bound, consider ρ(θ − θ̂) =
1{kθ − θ̂k > ϵ}, which is lower semicontinuous and bowl-shaped. Then the exact expression
 √ 
R∗ = P kZk ≥ ϵ n . This result is stronger since the sample mean is optimal simultaneously for
all ϵ, so that integrating over ϵ recovers (28.20).
Proof of Theorem 28.7. We only prove the lower bound. We bound the minimax risk R∗ from
below by the Bayes risk R∗π with the prior π = N (0, sId ):
R∗ ≥ R∗π = inf Eπ [ρ(θ − θ̂)]
θ̂
 
= E inf E[ρ(θ − θ̂)|X]
θ̂
( a)
= E[E[ρ(θ − E[θ|X])|X]]
 r 
(b) s
=E ρ Z .
1 + sn
where (a) follows from the crucial Lemma 28.9 below; (b) uses the fact that θ − E[θ|X] ∼
N (0, 1+ssn Id ) under the Gaussian prior. Since ρ(·) is lower semicontinuous, sending s → ∞ and
applying Fatou’s lemma, we obtain the matching lower bound:
 r    
s Z
R∗ ≥ lim E ρ Z ≥E ρ √ .
s→∞ 1 + sn n
The following lemma establishes the conditional mean as the Bayes estimator under the
Gaussian prior for all bowl-shaped losses, extending the previous Lemma 28.5 in one dimension:

Lemma 28.9 (Anderson [21]) Let X ∼ N (0, Σ) for some Σ  0 and ρ : Rd → R+ be a


bowl-shaped loss function. Then
min E[ρ(y + X)] = E[ρ(X)].
y∈ R d

In order to prove Lemma 28.9, it suffices to consider ρ being indicator functions. This is done
in the next lemma, which we prove later.

Lemma 28.10 Let K ∈ Rd be a symmetric convex set and X ∼ N (0, Σ). Then maxy∈Rd P(X +
y ∈ K) = P(X ∈ K).

Proof of Lemma 28.9. Denote the sublevel set set Kc = {x ∈ Rd : ρ(x) ≤ c}. Since ρ is bowl-
shaped, Kc is convex and symmetric, which satisfies the conditions of Lemma 28.10. So,
Z ∞
E[ρ(y + x)] = P(ρ(y + x) > c)dc,
Z ∞
0

= (1 − P(y + x ∈ Kc ))dc,
0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-573


i i

28.6 Log-concavity, Anderson’s lemma and exact minimax risk in GLM 573

Z ∞
≥ (1 − P(x ∈ Kc ))dc,
Z 0

= P(ρ(x) ≥ c)dc,
0
= E[ρ(x)].

Hence, miny∈Rd E[ρ(y + x)] = E[ρ(x)].

Before going into the proof of Lemma 28.10, we need the following definition.

Definition 28.11 A measure μ on Rd is said to be log-concave if


μ(λA + (1 − λ)B) ≥ μ(A)λ μ(B)1−λ

for all measurable A, B ⊂ Rd and any λ ∈ [0, 1].

The following result, due to Prékopa [350], characterizes the log-concavity of measures in terms
of that of its density function; see also [361] (or [179, Theorem 4.2]) for a proof.

Theorem 28.12 Suppose that μ has a density f with respect to the Lebesgue measure on Rd .
Then μ is log-concave if and only if f is log-concave.

Example 28.5 Examples of log-concave measures:

• Lebesgue measure: Let μ = vol be the Lebesgue measure on Rd , which satisfies Theorem 28.12
(f ≡ 1). Then

vol(λA + (1 − λ)B) ≥ vol(A)λ vol(B)1−λ , (28.21)

which implies3 the Brunn-Minkowski inequality:


1 1 1
vol(A + B) d ≥ vol(A) d + vol(B) d . (28.22)

• Gaussian distribution: Let μ = N (0, Σ), with a log-concave density f since log f(x) =
− p2 log(2π ) − 12 log det(Σ) − 21 x⊤ Σ−1 x is concave in x.

Proof of Lemma 28.10. By Theorem 28.12, the distribution of X is log-concave. Then


(a)
h 1 1 i
P[X ∈ K] = P X ∈ (K + y) + (K − y)
2 2
(b) p
≥ P[X ∈ K − y]P[X ∈ K + y]
(c)
= P[X + y ∈ K],

3
Applying (28.21) to A′ = vol(A)−1/d A, B′ = vol(B)−1/d B (both of which have unit volume), and
λ = vol(A)1/d /(vol(A)1/d + vol(B)1/d ) yields (28.22).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-574


i i

574

where (a) follows from 12 (K + y) + 12 (K − y) = 12 K + 12 K = K since K is convex; (b) follows from


the definition of log-concavity in Definition 28.11 with λ = 21 , A = K − y = {x − y : x ∈ K}
and B = K + y; (c) follows from P[X ∈ K + y] = P[X ∈ −K − y] = P[X + y ∈ K] since X has a
symmetric distribution and K is symmetric (K = −K).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-575


i i

29 Classical large-sample asymptotics

In this chapter we give an overview of the classical large-sample theory in the setting of iid obser-
vations in Section 28.4 focusing again on the minimax risk (28.15). These results pertain to smooth
parametric models in fixed dimensions, with the sole asymptotics being the sample size going to
infinity. The main result is that, under suitable conditions, the minimax squared error of estimating
i.i.d.
θ based on X1 , . . . , Xn ∼ Pθ satisfies
1 + o(1)
inf sup Eθ [kθ̂ − θk22 ] = sup TrJ− 1
F (θ). (29.1)
θ̂ θ∈Θ n θ∈Θ

where JF (θ) is the Fisher information matrix introduced in (2.32) in Chapter 2. This is asymptotic
characterization of the minimax risk with sharp constant. In later chapters, we will proceed to high
dimensions where such precise results are difficult and rare.
Throughout this chapter, we focus on the quadratic risk and assume that Θ is an open set of
the Euclidean space Rd . While reading this chapter, a reader is advised to consult Exercise VI.7 to
understand the minimax risk in a simple setting of estimating parameter of a Bernoulli model.

29.1 Statistical lower bound from data processing


In this section we derive several statistical lower bounds from data processing argument. Specif-
ically, we will take a comparison-of-experiment approach by comparing the actual model with a
perturbed model. The performance of a given estimator can be then related to the f-divergence via
the data processing inequality and the variational representation (Chapter 7).
We start by discussing the Hammersley-Chapman-Robbins lower bound which implies the well-
known Cramér-Rao lower bound. Because these results are restricted to unbiased estimators, we
will also discuss their Bayesian version; in particular, the Bayesian Cramér-Rao lower bound
is responsible for proving the lower bound in (29.1). We focus on explaining how these results
can be anticipated from information-theoretic reasoning and postpone the exact statement and
assumption of the Bayesian Cramér-Rao bound to Section 29.2.

29.1.1 Hammersley-Chapman-Robbins (HCR) lower bound


The following result due to [210, 87] is a direct consequence of the variational representation of
χ2 -divergence in Section 7.13, which relates it to the mean and variance of test functions.

575

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-576


i i

576

Theorem 29.1 (HCR lower bound) The quadratic loss of any estimator θ̂ at θ ∈ Θ ⊂ Rd
satisfies
(Eθ [θ̂] − Eθ′ [θ̂])2
Rθ (θ̂) = Eθ [(θ̂ − θ)2 ] ≥ Varθ (θ̂) ≥ sup . (29.2)
θ ′ ̸=θ χ2 (Pθ′ kPθ )

Proof. Let θ̂ be a (possibly randomized) estimator based on X. Fix θ′ 6= θ ∈ Θ. Denote by P and


Q the probability distribution when the true parameter is θ or θ′ , respectively. That is, PX = Pθ
and QX = Pθ′ . Then
(Eθ [θ̂] − Eθ′ [θ̂])2
χ2 (PX kQX ) ≥ χ2 (Pθ̂ kQθ̂ ) ≥ (29.3)
Varθ (θ̂)
where the first inequality applies the data processing inequality (Theorem 7.4) and the second
inequality the variational representation (7.91) of χ2 -divergence.
Next we apply Theorem 29.1 to unbiased estimators θ̂ that satisfies Eθ [θ̂] = θ for all θ ∈ Θ.
Then
(θ − θ′ )2
Varθ (θ̂) ≥ sup .
θ ′ ̸=θ χ2 (Pθ′ kPθ )
Lower bounding the supremum by the limit of θ′ → θ and recall the asymptotic expansion of
χ2 -divergence from Theorem 7.22 in terms of the Fisher information, we get, under the regularity
conditions in Theorem 7.22, the celebrated Cramér-Rao (CR) lower bound [108, 354]:
1
Varθ (θ̂) ≥ . (29.4)
JF (θ)
A few more remarks are as follows:

• Note that the HCR lower bound Theorem 29.1 is based on the χ2 -divergence. For a version
based on Hellinger distance which also implies the CR lower bound, see Exercise VI.5.
• Both the HCR and the CR lower bounds extend to the multivariate case as follows. Let θ̂ be
an unbiased estimator of θ ∈ Θ ⊂ Rd . Assume that its covariance matrix Covθ (θ̂) = Eθ [(θ̂ −
θ)(θ̂ − θ)⊤ ] is positive definite. Fix a ∈ Rd . Applying Theorem 29.1 to ha, θ̂i, we get
h a, θ − θ ′ i 2
χ2 (Pθ kPθ′ ) ≥ .
a⊤ Covθ (θ̂)a
Optimizing over a yields1
χ2 (Pθ kPθ′ ) ≥ (θ − θ′ )⊤ Covθ (θ̂)−1 (θ − θ′ ).
Sending θ′ → θ and applying the asymptotic expansion χ2 (Pθ kPθ′ ) = (θ − θ′ )⊤ JF (θ)(θ −
θ′ )(1 + o(1)) (see Remark 7.13), we get the multivariate version of CR lower bound:
Covθ (θ̂)  J− 1
F (θ). (29.5)

1 ⟨x,y⟩2
For Σ  0, supx̸=0 x⊤ Σx
= y⊤ Σ−1 y, attained at x = Σ−1 y.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-577


i i

29.1 Statistical lower bound from data processing 577

• For a sample of n iid observations, by the additivity property (2.36), the Fisher information
matrix is equal to nJF (θ). Taking the trace on both sides, we conclude that the squared error of
any unbiased estimators satisfies
1
Tr(J−
Eθ [kθ̂ − θk22 ] ≥
1
F (θ)).
n
This is already very close to (29.1), except for the fundamental restriction of unbiased
estimators.

29.1.2 Bayesian CR and HCR


The drawback of the HCR and CR lower bounds is that they are confined to unbiased estimators.
For the minimax settings in (29.1), there is no sound reason to restrict to unbiased estimators; in
fact, it is often wise to trade bias with variance in order to achieve a smaller overall risk.
Next we discuss a lower bound, known as the Bayesian Cramér-Rao (BCR) lower bound [188]
or the van Trees inequality [433], for a Bayesian setting that applies to all estimators; to apply to
the minimax setting, in view of Theorem 28.1, one just needs to choose an appropriate prior. The
exact statement and the application to minimax risk are postponed till the next section. Here we
continue the previous line of thinking and derive it from the data processing argument.
Fix a prior π on Θ and a (possibly randomized) estimator θ̂. Then we have the Markov chain
θ → X → θ̂. Consider two joint distributions for (θ, X):

• Under Q, θ is drawn from π and X ∼ Pθ conditioned on θ;


• Under P, θ is drawn from Tδ π, where Tδ denote the pushforward of shifting by δ , i.e., Tδ π (A) =
π (A − δ), and X ∼ Pθ−δ conditioned on θ.

Similar to (29.3), applying data processing and variational representation of χ2 -divergence yields
(EP [θ − θ̂] − EQ [θ − θ̂])2
χ2 (PθX kQθX ) ≥ χ2 (Pθθ̂ kQθθ̂ ) ≥ χ2 (Pθ−θ̂ kQθ−θ̂ ) ≥ .
VarQ (θ̂ − θ)
Note that by design, PX = QX and thus EP [θ̂] = EQ [θ̂]; on the other hand, EP [θ] = EQ [θ] + δ .
Furthermore, Eπ [(θ̂ − θ)2 ] ≥ VarQ (θ̂ − θ). Since this applies to any estimators, we conclude that
the Bayes risk R∗π (and hence the minimax risk) satisfies
δ2
R∗π ≜ inf Eπ [(θ̂ − θ)2 ] ≥ sup , (29.6)
θ̂ δ̸=0 χ2 (PXθ kQXθ )
which is referred to as the Bayesian HCR lower bound in comparison with (29.2).
Similar to the deduction of CR lower bound from the HCR, we can further lower bound
this supremum by evaluating the small-δ limit. First note the following chain rule for the
χ2 -divergence:
"  2 #
dPθ
χ (PXθ kQXθ ) = χ (Pθ kQθ ) + EQ χ (PX|θ kQX|θ ) ·
2 2 2
.
dQθ

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-578


i i

578

Under suitable regularity conditions in Theorem 7.22, again applying the local expansion of χ2 -
divergence yields
R π ′2
• χ2 (Pθ kQθ ) = χ2 (Tδ π kπ ) = (J(π ) + o(1))δ 2 , where J(π ) ≜ π is the Fisher information of
the prior (see Example 2.7);
• χ2 (PX|θ kQX|θ ) = [JF (θ) + o(1)]δ 2 .

Thus from (29.6) we get


1
R∗π ≥ . (29.7)
J(π ) + Eθ∼π [JF (θ)]
We conclude this section by revisiting the Gaussian Location Model (GLM) in Example 28.1.

Example 29.1 Let Xn = (X1 , . . . , Xn )i.i.d.


∼ N (θ, 1) and consider the prior θ ∼ π = N (0, s). To
apply the Bayesian HCR bound (29.6), note that
( a)
χ2 (PθXn ||QθXn ) = χ2 (PθX̄ ||QθX̄ )
" 2 #
dPθ
= χ (Pθ ||Qθ ) + EQ
2
χ (PX̄|θ ||QX̄|θ )
2
dQθ
(b) 2 2 2
= eδ /s
− 1 + eδ /s
(enδ − 1)
2
(n+ 1s )
= eδ − 1.
Pn
where (a) follows from the sufficiency of X̄ = 1n i=1 Xi ; (b) is by Qθ = N (0, s), QX̄|θ = N (θ, n1 ),
Pθ = N (δ, s), PX̄|θ = N (θ − δ, 1n ), and the fact (7.43) for Gaussians. Therefore,

δ2 δ2 s
R∗π ≥ sup δ 2 (n+ 1s )
= lim
δ 2 (n+ 1s )
= .
δ̸=0 e −1 δ→0 e −1 sn + 1

In view of the Bayes risk found in Example 28.1, we see that in this case the Bayesian HCR and
Bayesian Cramér-Rao lower bounds are exact.

29.2 Bayesian CR lower bounds and extensions


In this section we give the rigorous statement of the Bayesian Cramér-Rao lower bound and discuss
its extensions and consequences. For the proof, we take a more direct approach as opposed to the
data-processing argument in Section 29.1 based on asymptotic expansion of the χ2 -divergence.

Theorem 29.2 (BCR lower bound) Let π be a differentiable prior density on the interval
[θ0 , θ1 ] such that π (θ0 ) = π (θ1 ) = 0 and
Z θ1
π ′ (θ)2
J( π ) ≜ dθ < ∞. (29.8)
θ0 π (θ)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-579


i i

29.2 Bayesian CR lower bounds and extensions 579

Let Pθ (dx) = pθ (x) μ(dx), where the density pθ (x) is differentiable in θ for μ-almost every x.
Assume that for π-almost every θ,
Z
μ(dx)∂θ pθ (x) = 0. (29.9)

Then the Bayes quadratic risk R∗π ≜ infθ̂ E[(θ − θ̂)2 ] satisfies
1
R∗π ≥ . (29.10)
Eθ∼π [JF (θ)] + J(π )

Proof. In view of Remark 28.3, it loses no generality to assume that the estimator θ̂ = θ̂(X) is
deterministic. For each x, integration by parts yields
Z θ1 Z θ1
dθ(θ̂(x) − θ)∂θ (pθ (x)π (θ)) = pθ (x)π (θ)dθ.
θ0 θ0

Integrating both sides over μ(dx) yields


E[(θ̂ − θ)V(θ, X)] = 1.
where V(θ, x) ≜ ∂θ (log(pθ (x)π (θ))) = ∂θ log pθ (x) + ∂θ log π (θ) and the expectation is over
the joint distribution of (θ, X). Applying Cauchy-Schwarz, we have E[(θ̂ − θ)2 ]E[V(θ, X)2 ] ≥ 1.
The proof is completed by noting that E[V(θ, X)2 ] = E[(∂θ log pθ (X))2 ] + E[(∂θ log π (θ))2 ] =
E[JF (θ)] + J(π ), thanks to the assumption (29.9).
The multivariate version of Theorem 29.2 is the following.
Qd
Theorem 29.3
Q
(Multivariate BCR) Consider a product prior density π (θ) = π i (θi ) i=1
d
over the box i=1 [θ0,i , θ1,i ],
where each π i is differentiable on [θ0,i , θ1,i ] and vanishes on the
boundary. Assume that for π-almost every θ,
Z
μ(dx)∇θ pθ (x) = 0. (29.11)

Then
R∗π ≜ inf Eπ [kθ̂ − θk22 ] ≥ Tr((Eθ∼π [JF (θ)] + J(π ))−1 ), (29.12)
θ̂

where the Fisher information matrices are given by JF (θ) = Eθ [∇θ log pθ (X)∇θ log pθ (X)⊤ ] and
J(π ) = diag(J(π 1 ), . . . , J(π d )).

Proof. Fix an estimator θ̂ = (θ̂1 , . . . , θ̂d ) and a non-zero u ∈ Rd . For each i, k = 1, . . . , d,


integration by parts yields
Z θ 1, i Z θ 1, i
(θ̂k (x) − θk )∂θi (pθ (x)π (θ))dθi = 1 {k = i} pθ (x)π (θ)dθi .
θ 0, i θ0,i
Q
Integrating both sides over j̸=i dθj and μ(dx), multiplying by ui , and summing over i, we obtain

E[(θ̂k (X) − θk )hu, ∇ log(pθ (X)π (θ))i] = hu, ek i

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-580


i i

580

where ek denotes the kth standard basis. Applying Cauchy-Schwarz and optimizing over u yield
h u , ek i 2
E[(θ̂k (X) − θk )2 ] ≥ sup = Σ− 1
kk ,
u̸=0 u⊤ Σ u

where Σ ≡ E[∇ log(pθ (X)π (θ))∇ log(pθ (X)π (θ))⊤ ] = Eθ∼π [JF (θ)] + J(π ), thanks to (29.11).
Summing over k completes the proof of (29.12).

Several remarks are in order:

• The above versions of the BCR bound assume a prior density that vanishes at the boundary.
If we choose a uniform prior, the same derivation leads to a similar lower bound known as
the Chernoff-Rubin-Stein inequality (see Ex. VI.4), which also suffices for proving the optimal
minimax lower bound in (29.1).
• For the purpose of the lower bound, it is advantageous to choose a prior density with the mini-
mum Fisher information. The optimal density with a compact support is known to be a squared
cosine density [219, 426]:

min J( g ) = π 2 ,
g on [−1,1]

attained by
πu
g(u) = cos2 . (29.13)
2
• Suppose the goal is to estimate a smooth functional T(θ) of the unknown parameter θ, where
T : Rd → Rs is differentiable with ∇T(θ) = ( ∂ T∂θi (θ)
j
) its s × d Jacobian matrix. Then under the
same condition of Theorem 29.3, we have the following Bayesian Cramér-Rao lower bound for
functional estimation:

inf Eπ [kT̂(X) − T(θ)k22 ] ≥ Tr(E[∇T(θ)](E[JF (θ)] + J(π ))−1 E[∇T(θ)]⊤ ), (29.14)


where the expectation on the right-hand side is over θ ∼ π.

As a consequence of the BCR bound, we prove the lower bound part for the asymptotic minimax
risk in (29.1).

Theorem 29.4 Assume that θ 7→ JF (θ) is continuous. Denote the minimax squared error
i.i.d.
R∗n ≜ infθ̂ supθ∈Θ Eθ [kθ̂ − θk22 ], where Eθ is taken over X1 , . . . , Xn ∼ Pθ . Then as n → ∞,
1 + o( 1)
R∗n ≥ sup TrJ− 1
F (θ). (29.15)
n θ∈Θ

Proof. Fix θ ∈ Θ. Then for all sufficiently small δ , B∞ (θ, δ) = θ + [−δ, δ]d ⊂ Θ. Let π i (θi ) =
1 θ−θi Qd
δ g( δ ), where g is the prior density in (29.13). Then the product distribution π = i=1 π i
satisfies the assumption of Theorem 29.3. By the scaling rule of Fisher information (see (2.35)),
2 2
J(π i ) = δ12 J(g) = δπ2 . Thus J(π ) = δπ2 Id .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-581


i i

29.3 Maximum likelihood estimator and asymptotic efficiency 581

It is known that (see [68, Theorem 2, Appendix V]) the continuity of θ 7→ JF (θ) implies (29.11).
So we are ready to apply the BCR bound in Theorem 29.3. Lower bounding the minimax by the
Bayes risk and also applying the additivity property (2.36) of Fisher information, we obtain
 − 1 !
∗ 1 π2
Rn ≥ · Tr Eθ∼π [JF (θ)] + 2 Id .
n nδ

Finally, choosing δ = n−1/4 and applying the continuity of JF (θ) in θ, the desired (29.15) follows.

Similarly, for estimating a smooth functional T(θ), applying (29.14) with the same argument
yields

1 + o( 1)
inf sup Eθ [kT̂ − T(θ)k22 ] ≥ sup Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ). (29.16)
T̂ θ∈Θ n θ∈Θ

29.3 Maximum likelihood estimator and asymptotic efficiency


Theorem 29.4 shows that in a small neighborhood of each parameter θ, the best estimation error
is at best 1n (TrJ− 1
F (θ) + o(1)) when the sample size n grows; this is known as the information
bound as determined by the Fisher information matrix. Estimators achieving this bound are called
asymptotic efficient. A cornerstone of the classical large-sample theory is the asymptotic efficiency
of the maximum likelihood estimator (MLE). Rigorously stating this result requires a lengthy
list of technical conditions, and an even lengthier one is needed to make the error uniform so
as to attain the minimax lower bound in Theorem 29.4. In this section we give a sketch of the
asymptotic analysis of MLE, focusing on the main ideas and how Fisher information emerges
from the likelihood optimization.
i.i.d.
Suppose we observe a sample Xn = (X1 , X2 , · · · , Xn ) ∼ Pθ0 , where θ0 stands for the true
parameter. The MLE is defined as:

θ̂MLE ∈ arg max Lθ (Xn ), (29.17)


θ∈Θ

where
X
n
Lθ (Xn ) = log pθ (Xi )
i=1

is the total log-likelihood and pθ (x) = dP dμ (x) is the density of Pθ with respect to some com-
θ

mon dominating measure μ. For discrete distribution Pθ , the MLE can also be written as the KL
projection2 of the empirical distribution P̂n to the model class: θ̂MLE ∈ arg minθ∈Θ D(P̂n kPθ ).

2
Note that this is the reverse of the information projection studied in Section 15.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-582


i i

582

The main intuition why MLE works is as follows. Assume that the model is identifiable, namely,
θ 7→ Pθ is injective. Then for any θ 6= θ0 , we have by positivity of the KL divergence (Theorem 2.3)
" n #
X pθ (Xi )
Eθ0 [Lθ − Lθ0 ] = Eθ0 log = −nD(Pθ0 ||Pθ ) < 0.
pθ0 (Xi )
i=1

In other words, Lθ − Lθ0 is an iid sum with a negative mean and thus negative with high probability
for large n. From here the consistency of MLE follows upon assuming appropriate regularity
conditions, among which is Wald’s integrability condition Eθ0 [sup∥θ−θ0 ∥≤ϵ log ppθθ (X)] < ∞ [449,
0
454].
Assuming more conditions one can obtain the asymptotic normality and efficiency of the
MLE. This follows from the local quadratic approximation of the log-likelihood function. Define
V(θ, x) ≜ ∇θ log pθ (x) (score) and H(θ, x) ≜ ∇2θ log pθ (x). By Taylor expansion,
! !
Xn
1 Xn
⊤ ⊤
Lθ =Lθ0 + (θ − θ0 ) V(θ0 , Xi ) + (θ − θ0 ) H(θ0 , Xi ) (θ − θ0 )
2
i=1 i=1

+ o(n(θ − θ0 ) ).
2
(29.18)
Recall from Section 2.6.2* that, under suitable regularity conditions, we have
Eθ0 [V(θ0 , X)] = 0, Eθ0 [V(θ0 , X)V(θ0 , X)⊤ ] = −Eθ0 [H(θ0 , X)] = JF (θ0 ).
Thus, by the Central Limit Theorem and the Weak Law of Large Numbers, we have
1 X 1X
n n
d P
√ V(θ0 , Xi )−
→N (0, JF (θ0 )), H(θ0 , Xi )−
→ − JF (θ0 ).
n n
i=1 i=1

Substituting these quantities into (29.18), we obtain the following stochastic approximation of the
log-likelihood:
p n
Lθ ≈ Lθ0 + h nJF (θ0 )Z, θ − θ0 i − (θ − θ0 )⊤ JF (θ0 )(θ − θ0 ),
2
where Z ∼ N (0, Id ). Maximizing the right-hand side yields:
1
θ̂MLE ≈ θ0 + √ JF (θ0 )−1/2 Z.
n
From this asymptotic normality, we can obtain Eθ0 [kθ̂MLE − θ0 k22 ] ≤ n1 (TrJF (θ0 )−1 + o(1)), and
for smooth functionals by Taylor expanding T at θ0 (delta method), Eθ0 [kT(θ̂MLE ) − T(θ0 )k22 ] ≤
−1 ⊤
n (Tr(∇T(θ0 )JF (θ0 ) ∇T(θ0 ) ) + o(1)), matching the information bounds (29.15) and (29.16).
1

Of course, the above heuristic derivation requires additional assumptions to justify (for example,
Cramér’s condition, cf. [168, Theorem 18] and [375, Theorem 7.63]). Even stronger assumptions
are needed to ensure the error is uniform in θ in order to achieve the minimax lower bound in
Theorem 29.4; see, e.g., Theorem 34.4 (and also Chapters 36-37) of [68] for the exact conditions
and statements. A more general and abstract theory of MLE and the attainment of information
bound were developed by Hájek and Le Cam; see [209, 273].
Despite its wide applicability and strong optimality properties, the methodology of MLE is not
without limitations. We conclude this section with some remarks along this line.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-583


i i

29.4 Application: Estimating discrete distributions and entropy 583

• MLE may not exist even for simple parametric models. For example, consider X1 , . . . , Xn
drawn iid from the location-scale mixture of two Gaussians 21 N ( μ1 , σ12 ) + 12 N ( μ2 , σ22 ), where
( μ1 , μ2 , σ1 , σ2 ) are unknown parameters. Then the likelihood can be made arbitrarily large by
setting for example μ1 = X1 and σ1 → 0.
• MLE may be inconsistent; see [375, Example 7.61] and [167] for examples, both in one-
dimensional parametric family.
• In high dimensions, it is possible that MLE fails to achieve the minimax rate (Exercise VI.15).

29.4 Application: Estimating discrete distributions and entropy


As an application in this section we consider the concrete problems of estimating a discrete dis-
tribution or its property (such as Shannon entropy) based on iid observations. Of course, the
asymptotic theory developed in this chapter applies only to the classical setting of fixed alpha-
bet and large sample size. Along the way, we will also discuss extensions to large alphabet and
what may go wrong with the classical theory.
i.i.d.
Throughout this section, let X1 , · · · , Xn ∼ P ∈ Pk , where Pk ≡ P([k]) denotes the collection
of probability distributions over [k] = {1, . . . , k}. We first consider the estimation of P under the
squared loss.

Theorem 29.5 Fox fixed k, the minimax squared error of estimating P satisfies
 
b − Pk22 ] = 1 k−1
R∗sq (k, n) ≜ inf sup E[kP + o( 1) , n → ∞. (29.19)
b
P P∈Pk n k

Proof. Let P = (P1 , . . . , Pk ) be parametrized, as in Example 2.6, by θ = (P1 , P2 , · · · , Pk−1 ) and


Pk = 1 − P1 − · · · − Pk−1 . Then P = T(θ), where T : Rk−1 → Rk is an affine functional so that
I 1
∇T(θ) = [ −k−
1⊤
], with 1 being the all-ones (column) vector.
The Fisher information matrix and its inverse have been calculated in (2.37) and (2.38): We
have J−
F (θ) = diag(θ) − θθ and
1 ⊤

 
diag(θ) − θθ⊤ −Pk θ
∇T(θ)J− 1
F (θ)∇T(θ)

=
−Pk θ⊤ Pk (1 − Pk ).
Pk Pk
So Tr(∇T(θ)J− 1 ⊤
F (θ)∇T(θ) ) = i=1 Pi (1 − Pi ) = 1 −
2
i=1 Pi , which achieves its maximum
1 − 1k at the uniform distribution. Applying the functional form of the BCR bound in (29.16), we
conclude R∗sq (k, n) ≥ 1n (1 − 1k + o(1)).
For the upper bound, consider the MLE, which in this case coincides with the empirical distri-
Pn
bution P̂ = (P̂i ) (Exercise VI.8). Note that nP̂i = j=1 1 {Xj = i} ∼ Bin(n, Pi ). Then for any P,
Pk
E[kP̂ − Pk22 ] = n1 i=1 Pi (1 − Pi ) ≤ n1 (1 − 1k ).

Some remarks on Theorem 29.5 are in order:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-584


i i

584

−1/k
• In fact, for any k, n, we have the precise result: R∗sq (k, n) = (11+√ 2 – see Ex. VI.7h. This can be
n)
shown by considering a Dirichlet prior (13.16) and applying the corresponding Bayes estimator,
which is an additively-smoothed empirical distribution (Section 13.5).
• Note that R∗sq (k, n) does not grow with the alphabet size k; this is because squared loss is
too weak for estimating probability vectors. More meaningful loss functions include the f-
divergences in Chapter 7, such as the total variation, KL divergence, χ2 -divergence. These
minimax rates are worked out in Exercise VI.8 and Exercise VI.10, for both small and large
alphabets, and they indeed depend on the alphabet size k. For example, the minimax KL risk
satisfies Θ( nk ) for k ≤ n and grows as Θ(log nk ) for k  n. This agrees with the rule of thumb
that consistent estimation requires the sample size to scale faster than the dimension.

As a final application, let us consider the classical problem of entropy estimation in information
theory and statistics [304, 128, 215], where the goal is to estimate the Shannon entropy, a non-
linear functional of P. The following result follows from the functional BCR lower bound (29.16)
and analyzing the MLE (in this case the empirical entropy) [39].

Theorem 29.6 For fixed k, the minimax quadratic risk of entropy estimation satisfies
 
b (X1 , . . . , Xn ) − H(P))2 ] = 1
R∗ent (k, n) ≜ inf sup E[(H max V(P) + o(1) , n→∞
b P∈Pk
H n P∈Pk

Pk
where H(P) = i=1 Pi log P1i = E[log P(1X) ] and V(P) = Var[log P(1X) ] are the Shannon entropy
and varentropy (cf. (10.4)) of P.

Let us analyze the result of Theorem 29.6 and see how it extends to large alphabets. It can be
2
shown that3 maxP∈Pk V(P)  log2 k, which suggests that R∗ent ≡ R∗ent (k, n) may satisfy R∗ent  logn k
even when the alphabet size k grows with n; however, this result only holds for sufficiently small
alphabet. In fact, back in Lemma 13.2 we have shown that for the empirical entropy which achieves
the bound in Theorem 29.6, its bias is on the order of nk , which is no longer negligible on large
alphabets. Using techniques of polynomial approximation [456, 233], one can reduce this bias to
n log k and further show that consistent entropy estimation is only possible if and only if n  log k
k k

[428], in which case the minimax rate satisfies


 2
k log2 k
R∗ent  +
n log k n
In summary, one needs to exercise caution extending classical large-sample results to high
dimensions, especially when bias becomes the dominating factor.

3
Indeed, maxP∈Pk V(P) ≤ log2 k for all k ≥ 3 [334, Eq. (464)]. For the lower bound, consider
P = ( 12 , 2(k−1)
1 1
, . . . 2(k−1) ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-585


i i

30 Mutual information method

In this chapter we describe a strategy for proving statistical lower bound we call the Mutual Infor-
mation Method (MIM), which entails comparing the amount of information data provides with
the minimum amount of information needed to achieve a certain estimation accuracy. Similar to
Section 29.2, the main information-theoretical ingredient is the data-processing inequality, this
time for mutual information as opposed to f-divergences.
Here is the main idea of the MIM: Fix some prior π on Θ and we aim to lower bound the Bayes
risk R∗π of estimating θ ∼ π on the basis of X with respect to some loss function ℓ. Let θ̂ be an
estimator such that E[ℓ(θ, θ̂)] ≤ D. Then we have the Markov chain θ → X → θ̂. Applying the
data processing inequality (Theorem 3.7), we have

inf I(θ; θ̂) ≤ I(θ; θ̂) ≤ I(θ; X). (30.1)


Pθ̃|θ :Eℓ(θ,θ̃)≤D

Note that

• The leftmost quantity can be interpreted as the minimum amount of information required to
achieve a given estimation accuracy. This is precisely the rate-distortion function ϕ(D) ≡ ϕθ (D)
(recall Section 24.3).
• The rightmost quantity can be interpreted as the amount of information provided by the data
about the latent parameter. Sometimes it suffices to further upper-bound it by the capacity of
the channel PX|θ by maximizing over all priors (Chapter 5):

I(θ; X) ≤ sup I(θ; X) ≜ C. (30.2)


π ∈P(Θ)

Therefore, we arrive at the following lower bound on the Bayes and hence the minimax risks

R∗π ≥ ϕ−1 (I(θ; X)) ≥ ϕ−1 (C). (30.3)

The reasoning of the mutual information method is reminiscent of the converse proof for joint-
source channel coding in Section 26.3. As such, the argument here retains the flavor of “source-
channel separation”, in that the lower bound in (30.1) depends only on the prior (source) and
the loss function, while the capacity upper bound (30.2) depends only on the statistical model
(channel).
In the next few sections, we discuss a sequence of examples to illustrate the MIM and its
execution:

585

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-586


i i

586

• Denoising a vector in Gaussian noise, where we will compute the exact minimax risk;
• Denoising a sparse vector, where we determine the sharp minimax rate;
• Community detection, where the goal is to recover a dense subgraph planted in a bigger Erdös-
Rényi graph.

In the next chapter we will discuss three popular approaches for, namely, Le Cam’s method,
Assouad’s lemma, and Fano’s method. As illustrated in Figure 30.1, all three follow from the

Mutual Information Method

Fano Assouad Le Cam

Figure 30.1 The three lower bound techniques as consequences of the Mutual Information Method.

mutual information method, corresponding to different choice of prior π for θ, namely, the uni-
form distribution over a two-point set {θ0 , θ1 }, the hypercube {0, 1}d , and a packing (recall
Section 27.1). While these methods are highly useful in determining the minimax rate for many
problems, they are often loose with constant factors compared to the MIM. In the last section
of this chapter, we discuss the problem of how and when is non-trivial estimation achievable by
applying the MIM; for this purpose, none of the three methods in the next chapter works.

30.1 GLM revisited and the Shannon lower bound


i.i.d.
Consider the d-dimensional GLM, where we observe X = (X1 , . . . , Xn ) ∼ N (θ, Id ) and θ ∈ Θ
is the parameter. Denote by R∗ (Θ) the minimax risk with respect to the quadratic loss ℓ(θ, θ̂) =
kθ̂ − θk22 .
First, let us consider the unconstrained model where Θ = Rd . Estimating using the sample
Pn
mean X̄ = n1 i=1 Xi ∼ N (θ, 1n Id ), we achieve the upper bound R∗ (Rd ) ≤ dn . This turns out to
be the exact minimax risk, as shown in Example 28.2 by computing the Bayes risk for Gaussian
priors. Next we apply the mutual information method to obtain the same matching lower bound
without evaluating the Bayes risk. Again, let us consider θ ∼ N (0, sId ) for some s > 0. We know
from the Gaussian rate-distortion function (Theorem 26.2) that
(
d
2 log sd
D D < sd
ϕ(D) = inf I(θ; θ̂) =
Pθ̂|θ :E[∥θ̂−θ∥22 ]≤D 0 otherwise.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-587


i i

30.1 GLM revisited and the Shannon lower bound 587

Using the sufficiency of X̄ and the formula of Gaussian channel capacity (cf. Theorem 5.11 or
Theorem 20.11), the mutual information between the parameter and the data can be computed as
d
I(θ; X) = I(θ; X̄) = log(1 + sn).
2
It then follows from (30.3) that R∗π ≥ 1+sdsn , which in fact matches the exact Bayes risk in (28.7).
Sending s → ∞ we recover the result in (28.17), namely
d
R∗ ( R d ) = . (30.4)
n
In the above unconstrained GLM, we are able to compute everything in closed form when
applying the mutual information method. Such exact expressions are rarely available in more
complicated models in which case various bounds on the mutual information will prove useful.
Next, let us consider the GLM with bounded means, where the parameter space Θ = B(ρ) =
{θ : kθk2 ≤ ρ} is the ℓ2 -ball of radius ρ centered at zero. In this case there is no known closed-
form formula for the minimax quadratic risk even in one dimension.1 Nevertheless, the next result
determines the sharp minimax rate, which characterizes the minimax risk up to universal constant
factors.

Theorem 30.1 (Bounded GLM)


d
R∗ (B(ρ))  ∧ ρ2 . (30.5)
n
p
p we see that if ρ ≳
Remark 30.1 Comparing (30.5) with (30.4), d/n, it is rate-optimal to
ignore the bounded-norm constraint; if ρ ≲ d/n, we can discard all observations and estimate
by zero, because data do not provide a better resolution than the prior information.

Proof. The upper bound R∗ (B(ρ)) ≤ dn ∧ ρ2 follows from considering the estimator θ̂ = X̄ and
θ̂ = 0. To prove the lower bound, we apply the mutual information method with a uniform prior
θ ∼ Unif(B(r)), where r ∈ [0, ρ] is to be optimized. The mutual information can be upper bound
using the AWGN capacity as follows:
 
1 d nr2 nr2
I(θ; X) = I(θ; X̄) ≤ sup I(θ; θ + √ Z) = log 1 + ≤ , (30.6)
Pθ :E[∥θ∥2 ]≤r n 2 d 2
2

where Z ∼ N (0, Id ). Alternatively, we can use Corollary 5.8 to bound the capacity (as information
radius) by the KL diameter, which yields the same bound within constant factors:
1
I(θ; X) ≤ sup I(θ; θ + √ Z) ≤ max D(N (θ, Id /n)kN (θ, Id /n)k) = 2nr2 . (30.7)
Pθ :∥θ∥≤r n θ,θ ′ ∈ B( r)

1
It is known that there exists some ρ0 depending on d/n such that for all ρ ≤ ρ0 , the uniform prior over the sphere of
radius ρ is exactly least favorable (see [82] for d = 1 and [48] for d > 1.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-588


i i

588

For the lower bound, due to the lack of closed-form formula for the rate-distortion function
for uniform distribution over Euclidean balls, we apply the Shannon lower bound (SLB) from
Section 26.1. Since θ has an isotropic distribution, applying Theorem 26.3 yields
d 2πed d cr2
inf I(θ; θ̂) ≥ h(θ) + log ≥ log ,
Pθ̂|θ :E∥θ−θ̂∥2 ≤D 2 D 2 D

for some universal constant c, where the last inequality is because for θ ∼ Unif(B(r)), h(θ) =
log vol(B(r)) = d log r + log vol(B(1)) and the volume of a unit Euclidean ball in d dimensions
satisfies (recall (27.14)) vol(B(1))1/d  √1d .
2 2
∗ 2 −nr /d 2
R∗ ≤ 2 , i.e., R ≥ cr e
Finally, applying (30.3) yields 12 log cr nr
. Optimizing over r and
−ax −a
using the fact that sup0<x<1 xe = ea if a ≥ 1 and e if a < 1, we have
1

d
R∗ ≥ sup cr2 e−nr /d 
2
∧ ρ2 .
r∈[0,ρ] n

As a final example, let us consider a non-quadratic loss ℓ(θ, θ̂) = kθ − θ̂kr , the rth power of an
arbitrary norm on Rd . Recall that we have determined in Corollary 28.8 the exact minimax risk
using Anderson’s lemma, namely,
inf sup Eθ [kθ̂ − θkr ] = n−r/2 E[kZkr ], Z ∼ N (0, Id ). (30.8)
θ̂ θ∈Rd

In order to apply the mutual information method, consider again a Gaussian prior θ ∼ N (0, sId ).
Suppose that E[kθ̂ − θkr ] ≤ D. By the data processing inequality,
(  d  )
d d Dre r d
log(1 + ns) ≥ I(θ; X) ≥ I(θ; θ̂) ≥ log(2πes) − log V∥·∥ Γ 1+ ,
2 2 d r
where the last inequality follows from the general SLB (26.5). Rearranging terms and sending
s → ∞ yields
 r/2   −r/d  r
d 2πe d − r/ 2 − r/ d d
inf sup Eθ [kθ̂ − θk ] ≥
r
V∥·∥ Γ 1 + n V∥·∥ ≳ √ ,
θ̂ θ∈Rd re n r nE[kZk∗ ]
(30.9)
where the middle inequality applies Stirling’s approximation Γ(x)1/x  x for x → ∞, and the
right applies Urysohn’s volume inequality (27.21), with kxk∗ = sup{hx, yi : kyk ≤ 1} denoting
the dual norm of k · k.
To evaluate the tightness of the lower bound from SLB in comparison with the exact result
P 1/q
d
(30.8), consider the example of r = 2 and the ℓq -norm kxkq = i=1 | x i | q
with 1 ≤ q ≤ ∞.
Recall the volume of a unit ℓq -ball given in (27.13). In the special case of q = 2, the (first) lower
bound in (30.9) is in fact exact and coincides with (30.4). For general q ∈ [1, ∞), (30.9) gives the
2/q
tight minimax rate d n ; however, for q = ∞, the minimax lower bound we get is 1/n, independent p
of the dimension d. In comparison, from (30.8) we get the sharp rate logn d , since EkZk∞  log d
(cf. Lemma 27.10). We will revisit this example in Section 31.4 and show how to obtain the optimal
dependency on the dimension.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-589


i i

30.2 GLM with sparse means 589

Remark 30.2 (SLB versus the volume method) Recall the connection between the rate-
distortion function and the metric entropy in Section 27.6. As we have seen in Section 27.2, a
common lower bound for metric entropy is via the volume bound. In fact, the SLB can be inter-
preted as a volume-based lower bound to the rate-distortion function. To see this, consider r = 1
and let θ be uniformly distributed over some compact set Θ, so that h(θ) = log vol(Θ) (Theo-
rem 2.7(a)). Applying Stirling’s approximation, the lower bound in (26.5) becomes log vol(vol (Θ)
B∥·∥ (cϵ))
for some constant c, which has the same form as the volume ratio in Theorem 27.3 for metric
entropy. We will see later in Section 31.4 that in statistical applications, applying SLB yields basi-
cally the same lower bound as applying Fano’s method to a packing obtained from the volume
bound, although SLB does not rely explicitly on a packing.

30.2 GLM with sparse means


In this section we consider the problem of denoising for a sparse vector. Specifically, consider
again the Gaussian location model N (θ, Id ) where the mean vector θ is known to be k-sparse,
taking values in the “ℓ0 -ball”

B0 (k) = {θ ∈ Rd : kθk0 ≤ k}, k ∈ [p],

where kθk0 = |{i ∈ [d] : θi 6= 0}| is the number of nonzero entries of θ, indicating the sparsity of
θ. Our goal is to characterize the minimax quadratic risk

R∗n (B0 (k)) = inf sup Eθ kθ̂ − θk22 .


θ̂ θ∈B0 (k)

Next we prove an optimal lower bound applying MIM. (For a different proof using Fano’s method
in Section 31.4, see Exercise VI.12.)

Theorem 30.2
k ed
R∗n (B0 (k)) ≳ log . (30.10)
n k

A few remarks are in order:


Remark 30.3 • The lower bound (30.10) turns out to be tight, achieved by the maximum
likelihood estimator

θ̂MLE = arg min kX̄ − θk2 , (30.11)


∥θ∥0 ≤k

which is equivalent to keeping the k entries from X̄ with the largest magnitude and setting the
rest to zero, or the following hard-thresholding estimator θ̂τ with an appropriately chosen τ (see
Exercise VI.13):

θ̂iτ = Xi 1 {|Xi | ≥ τ }. (30.12)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-590


i i

590

• Sharp asymptotics: For sublinear sparsity k = o(d), we have R∗n (B0 (k)) = (2 + o(1)) nk log dk
(Exercise VI.13); for linear sparsity k = (η + o(1))d with η ∈ (0, 1), R∗n (B0 (k)) = (β(η) +
o(1))d for some constant β(η). For the latter and more refined results, we refer the reader to the
monograph [236, Chapter 8].
Proof. First, note that B0 (k) is a union of linear subspace of Rd and thus homogeneous. Therefore
by scaling, we have
1 ∗ 1
R∗n (B0 (k)) =
R (B0 (k)) ≜ R∗ (k, d). (30.13)
n 1 n
Thus it suffices to consider n = 1. Denote the observation by X = θ + Z.
Next, note that the following oracle lower bound:
R∗ (k, d) ≥ k,
which is the optimal risk given the extra information of the support of θ, in view of (30.4). Thus
to show (30.10), below it suffices to consider k ≤ d/4.
We now apply the mutual information method. Recall from (27.10) that Sdk denotes the
Hamming sphere, namely,
Sdk = {b ∈ {0, 1}d : wH (b) = k},
d
where wH (b) denotes
qthe Hamming weights of b. Let b be uniformly distributed over Sk and let
θ = τ b, where τ = log dk . Given any estimator θ̂ = θ̂(X), define an estimator b̂ ∈ {0, 1}d for b
by
(
0 θ̂i ≤ τ /2
b̂i = , i ∈ [d].
1 θ̂i > τ /2

Thus the Hamming loss of b̂ can be related to the squared loss of θ̂ as


τ2
kθ − θ̂k22 ≥ dH (b, b̂). (30.14)
4
Let EdH (b, b̂) = δ k. Assume that δ ≤ 14 , for otherwise, we are done.
Note the following Markov chain b → θ → X → θ̂ → b̂ and thus, by the data processing
inequality of mutual information,
 
d kτ 2 kτ 2 k d
I(b; b̂) ≤ I(θ; X) ≤ log 1 + ≤ = log .
2 d 2 2 k
where the second inequality follows from the fact that kθk22 = kτ 2 and the Gaussian channel
capacity.
Conversely,
I(b̂; b) ≥ min I(b̂; b)
EdH (b,b̂)≤δ d

= H(b) − max H(b|b̂)


EdH (b,b̂)≤δ k

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-591


i i

30.3 Community detection 591

     
d d δk
≥ log − max H(b ⊕ b̂) = log − dh , (30.15)
k EwH (b⊕b̂)≤δ k k d
where the last step follows from Exercise I.7.

Combining the lower and upper bound on the mutual information and using dk ≥ ( dk )k , we
get dh( δdk ) ≥ 2k k log dk . Since h(p) ≤ −p log p + p for p ∈ [0, 1] and k/d ≤ 14 by assumption, we
conclude that δ ≥ ck/d for some absolute constant c, completing the proof of (30.10) in view of
(30.14).

30.3 Community detection


As another application of the mutual information method, let us consider the following statistical
problem of detecting a single hidden community in random graphs, also known as the planted
dense subgraph model. (The reader should compare this with the stochastic block model with
two communities introduced in Exercise I.49.) Let C∗ be drawn uniformly at random from all
subsets of [n] of cardinality k ≥ 2. Let G denote a random graph on the vertex set [n], such that
for each i 6= j, they are connected independently with probability p if both i and j belong to C∗ ,
and with probability q otherwise. Assuming that p > q, the set C∗ represents a densely connected
community, which forms an Erdös-Rényi graph ER(k, p) planted in the bigger ER(n, q) graph.
Upon observing G, the goal is to reconstruct C∗ as accurately as possible. In particular, given an
estimator Ĉ = Ĉ(G), we say it achieves almost exact recovery if E|C4Ĉ| = o(k). The following
result gives a necessary condition in terms of the parameters (p, q, n, k):

Theorem 30.3 Assume that k/n is bounded away from one. If almost exact recovery is possible,
then
2 + o(1) n
d(pkq) ≥ log . (30.16)
k−1 k

Remark 30.4 In addition to Theorem 30.3, another necessary condition is that


 
1
d(pkq) = ω , (30.17)
k
which can be shown by a reduction to testing the membership of two nodes given the rest. It turns
out that conditions (30.16) and (30.17) are optimal, in the sense that almost exact recovery can be
achieved (via maximum likelihood) provided that (30.17) holds and d(pkq) ≥ 2k− +ϵ n
1 log k for any
constant ϵ > 0. For details, we refer the readers to [208].

Proof. Suppose Ĉ achieves almost exact recovery of C∗ . Let ξ ∗ , ξˆ ∈ {0, 1}n denote their indicator
vectors, respectively, for example, ξi∗ = 1 {i ∈ C∗ } for each i ∈ [n]. Then E[dH (ξ, ξ)]
ˆ = ϵn k for
some ϵn → 0. Applying the mutual information method as before, we have
   
( a) n ϵn k (b) n
∗ ˆ ∗
I(G; ξ ) ≥ I(ξ; ξ ) ≥ log − nh ≥ k log (1 + o(1)),
k n k

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-592


i i

592

where (a) follows in the same manner as (30.15) did from Exercise I.7; (b) is due to the assumption
that k/n ≤ 1 − c for some constant c.
On the other hand, the mutual information between the hidden community and the graph can
be upper bounded as:
 
(b) k
∗ ( a) ⊗(n2) ( c)
I(G; ξ ) = min D(PG|ξ∗ kQ|Pξ∗ ) ≤ D(PG|ξ∗ kBer(q) | Pξ ∗ ) = d(pkq),
Q 2
where (a) is by the variational representation of mutual information in Corollary 4.2; (b) follows
from choosing Q to be the distribution of the Erdös-Rényi graph ER(n, q); (c) is by the tensoriza-
tion property of KL divergence for product distributions (Theorem 2.16(d)). Combining the last
two displays completes the proof.

30.4 Estimation better than chance


Instead of characterizing the rate of convergence of the minimax risk to zero as the amount of data
grows, suppose we are in a regime where this is impossible due to either limited sample size, poor
signal to noise ratio, or the high dimensionality; instead, we are concerned with the modest goal of
achieving an estimation error strictly better than the trivial error (without data). In the context of
clustering, this is known as weak recovery or correlated recovery, where the goal is not to achieve
a vanishing misclassification rate but one strictly better than random guessing the labels. It turns
out that MIM is particularly suited for this regime. (In fact, we will see in the next chapter that all
three popular further relaxations of MIM fall short due to the loss of constant factors.)
As an example, let us continue the setting of Theorem 30.1, where the goal is to estimate a vec-
tor in a high-dimensional unit-ball based on noisy observations. Since the radius of the parameter
space is one, the trivial squared error equals one. The following theorem shows that in high dimen-
sions, non-trivial estimation is achievable if and only if the sample size n grows proportionally
with the dimension d; otherwise, when d  n  1, the optimal estimation error is 1 − nd (1 + o(1)).

i.i.d.
Theorem 30.4 (Bounded GLM continued) Suppose X1 , . . . , Xn ∼ N (θ, Id ), where θ
belongs to B, the unit ℓ2 -ball in Rd . Then for some universal constant C0 ,
n+C0 d
e− d−1 ≤ inf sup Eθ [kθ̂ − θk2 ] ≤ .
θ̂ θ∈B d+n

Proof. Without loss of generality, assume that the observation is X = θ+ √Zn , where Z ∼ N (0, Id ).
For the upper bound, applying the shrinkage estimator2 θ̂ = 1+1d/n X yields E[kθ̂ − θk2 ] ≤ n+d d .
For the lower bound, we apply MIM as in the proof of Theorem 30.1 with the prior θ ∼
Unif(Sd−1 ). We still apply the AWGN capacity in (30.6) to get I(θ; X) ≤ n/2. (Here the

2
This corresponds to the Bayes estimator (Example 28.1) when we choose θ ∼ N (0, 1d Id ), which is approximately
concentrated on the unit sphere for large d.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-593


i i

30.4 Estimation better than chance 593

constant 1/2 is important and so the diameter-based bound (30.7) is too loose.) For the rate-
distortion function of spherical uniform distribution, applying Theorem 27.17 yields I(θ; θ̂) ≥
d−1
2 log E[∥θ̂−θ∥2 ] − C. Thus the lower bound on E[kθ̂ − θk ] follows from the data processing
1 2

inequality.
A similar phenomenon also occurs in the problem of estimating a discrete distribution P on k
elements based on n iid observations, which has been studied in Section 29.4 for small alphabet in
the large-sample asymptotics and extended in Exercise VI.7–VI.10 to large alphabets. In particular,
consider the total variation loss, which is at most one. Ex. VI.10f shows that the TV error of any
estimator is 1 − o(1) if n  k; conversely, Ex. VI.10b demonstrates an estimator P̂ such that
E[χ2 (PkP̂)] ≤ nk− 1 2
+1 . Applying the joint range (7.32) between TV and χ and Jensen’s inequality,
we have
 q
 1 k− 1 n ≥ k − 2
E[TV(P, P̂)] ≤ 2 n+1
 k− 1 n≤k−2
k+n

which is bounded away from one whenever n = Ω(k). In summary, non-trivial estimation in total
variation is possible if and only if n scales at least proportionally with k.
Finally, let us mention the problem of correlated recovery in the stochastic block model
(cf. Exercise I.49), which refers to estimating the community labels better than chance. The
optimal information-theoretic threshold of this problem can be established by bounding the
appropriate mutual information; see Section 33.9 for the Gaussian version (spiked Wigner model).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-594


i i

31 Lower bounds via reduction to hypothesis


testing

In this chapter we study three commonly used techniques for proving minimax lower bounds,
namely, Le Cam’s method, Assouad’s lemma, and Fano’s method. Compared to the results in
Chapter 29 geared towards large-sample asymptotics in smooth parametric models, the approach
here is more generic, less tied to mean-squared error, and applicable in nonasymptotic settings
such as nonparametric or high-dimensional problems.
The common rationale of all three methods is reducing statistical estimation to hypothesis test-
ing. Specifically, to lower bound the minimax risk R∗ (Θ) for the parameter space Θ, the first step
is to notice that R∗ (Θ) ≥ R∗ (Θ′ ) for any subcollection Θ′ ⊂ Θ, and Le Cam, Assouad, and Fano’s
methods amount to choosing Θ′ to be a two-point set, a hypercube, or a packing, respectively. In
particular, Le Cam’s method reduces the estimation problem to binary hypothesis testing. This
method is perhaps the easiest to evaluate; however, the disadvantage is that it is frequently loose
in estimating high-dimensional parameters. To capture the correct dependency on the dimension,
both Assouad’s and Fano’s method rely on reduction to testing multiple hypotheses.
As illustrated in Figure 30.1, all three methods in fact follow from the common principle of
the mutual information method (MIM) in Chapter 30, corresponding to different choice of priors.
The limitation of these methods, compared to the MIM, is that, due to the looseness in constant
factors, they are ineffective for certain problems such as estimation better than chance discussed
in Section 30.4.

31.1 Le Cam’s two-point method


Theorem 31.1 Suppose the loss function ℓ : Θ × Θ → R+ satisfies ℓ(θ, θ) = 0 for all θ ∈ Θ
and the following α-triangle inequality for some α > 0: For all θ0 , θ1 , θ ∈ Θ,

ℓ(θ0 , θ1 ) ≤ α(ℓ(θ0 , θ) + ℓ(θ1 , θ)). (31.1)

Then

ℓ(θ0 , θ1 )
inf sup Eθ ℓ(θ, θ̂) ≥ sup (1 − TV(Pθ0 , Pθ1 )) (31.2)
θ̂ θ∈Θ θ0 ,θ1 ∈Θ 2α

594

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-595


i i

31.1 Le Cam’s two-point method 595

Proof. Fix θ0 , θ1 ∈ Θ. Given any estimator θ̂, let us convert it into the following (randomized)
test:

θ0 with probability ℓ(θ1 ,θ̂)
,
ℓ(θ0 ,θ̂)+ℓ(θ1 ,θ̂)
θ̃ =
θ1 with probability ℓ(θ 0 , θ̂)
.
ℓ(θ ,θ̂)+ℓ(θ ,θ̂) 0 1

By the α-triangle inequality, we have


" #
ℓ(θ0 , θ̂) 1
Eθ0 [ℓ(θ̃, θ0 )] = ℓ(θ0 , θ1 )Eθ0 ≥ Eθ [ℓ(θ̂, θ0 )],
ℓ(θ0 , θ̂) + ℓ(θ1 , θ̂) α 0

and similarly for θ1 . Consider the prior π = 21 (δθ0 + δθ1 ) and let θ ∼ π. Taking expectation on
both sides yields the following lower bound on the Bayes risk:
ℓ(θ0 , θ1 )   ℓ(θ0 , θ1 )
Eπ [ℓ(θ̂, θ)] ≥ P θ̃ 6= θ ≥ (1 − TV(Pθ0 , Pθ1 ))
α 2α
where the last step follows from the minimum average probability of error in binary hypothesis
testing (Theorem 7.7).
Remark 31.1 As an example where the bound (31.2) is tight (up to constants), consider a
binary hypothesis testing problem with Θ = {θ0 , θ1 } and the Hamming loss ℓ(θ, θ̂) = 1{θ 6= θ̂},
where θ, θ̂ ∈ {θ0 , θ1 } and α = 1. Then the left side is the minimax probability of error, and the
right side is the optimal average probability of error (cf. (7.19)). These two quantities can coincide
(for example for Gaussian location model).
Another special case of interest is the quadratic loss ℓ(θ, θ̂) = kθ − θ̂k22 , where θ, θ̂ ∈ Rd , which
satisfies the α-triangle inequality with α = 2. In this case, the leading constant 41 in (31.2) makes
sense, because in the extreme case of TV = 0 where Pθ0 and Pθ1 cannot be distinguished, the best
estimate is simply θ0 +θ2 . In addition, the inequality (31.2) can be deduced based on properties of
1

f-divergences and their joint range (Chapter 7). To this end, abbreviate Pθi as Pi for i = 0, 1 and
consider the prior π = 21 (δθ0 + δθ1 ). Then the Bayes estimator (posterior mean) is θ0 dP 0 +θ1 dP1
dP0 +dP1 and
the Bayes risk is given by
Z
kθ0 − θ1 k2 dP0 dP1
R∗π =
2 dP0 + dP1
kθ0 − θ1 k2 kθ0 − θ1 k2
= (1 − LC(P0 , P1 )) ≥ (1 − TV(P0 , P1 )),
4 4
R 0 −dP1 )
2
where LC(P0 , P1 ) = (dP dP0 +dP1 is the Le Cam divergence defined in (7.7) and satisfies LC ≤ TV.

Example 31.1 As a concrete example, consider


P
the one-dimensional GLM with sample size
n
n. By considering the sufficient statistic X̄ = n1 i=1 Xi , the model is simply {N (θ, 1n ) : θ ∈ R}.
Applying Theorem 31.1 yields
     
∗ 1 1 1
R ≥ sup |θ0 − θ1 | 1 − TV N θ0 ,
2
, N θ1 ,
θ0 ,θ1 ∈R 4 n n
(a) 1 ( b) c
= sup s2 (1 − TV(N (0, 1), N (s, 1))) = (31.3)
4n s>0 n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-596


i i

596

where (a) follows from the shift and scale invariance of the total variation; in (b) c ≈ 0.083 is
some absolute constant, obtained by applying the formula TV(N (0, 1), N (s, 1)) = 2Φ( 2s ) − 1
from (7.40). On the other hand, we know from Example 28.2 that the minimax risk equals 1n , so
the two-point method is rate-optimal in this case.
In the above example, for two points separated by Θ( √1n ), the corresponding hypothesis cannot
be tested with vanishing probability of error so that the resulting estimation risk (say in squared
error) cannot be smaller than 1n . This convergence rate is commonly known as the “parametric
rate”, which we have studied in Chapter 29 for smooth parametric families focusing on the Fisher
information as the sharp constant. More generally, the 1n rate is not improvable for models with
locally quadratic behavior

H2 (Pθ0 , Pθ0 +t )  t2 , t → 0. (31.4)

(Recall that Theorem 7.23 gives a sufficient condition for this behavior.) Indeed, pick θ0 in the
interior of the parameter space and set θ1 = θ0 + √1n , so that H2 (Pθ0 , Pθ1 ) = Θ( 1n ) thanks to (31.4).
By Theorem 7.8, we have TV(P⊗ ⊗n
θ0 , Pθ1 ) ≤ 1 − c for some constant c and hence Theorem 31.1
n

yields the lower bound Ω(1/n) for the squared error. Furthermore, later we will show that the same
locally quadratic behavior in fact guarantees the achievability of the 1/n rate; see Corollary 32.12.
Example 31.2 As a different example, consider the family Unif(0, θ). Note that as opposed
to the quadratic behavior (31.4), we have

H2 (Unif(0, 1), Unif(0, 1 + t)) = 2(1 − 1/ 1 + t)  t.

Thus an application of Theorem 31.1 yields an Ω(1/n2 ) lower bound. This rate is not achieved
by the empirical mean estimator (which only achieves 1/n rate), but by the maximum likelihood
estimator θ̂ = max{X1 , . . . , Xn }. Other types of behavior in t, and hence the rates of convergence,
can occur even in compactly supported location families – see Example 7.1.
The limitation of Le Cam’s two-point method is that it does not capture the correct dependency
on the dimensionality. To see this, let us revisit Example 31.1 for d dimensions.
Example 31.3 Consider the d-dimensional GLM in Corollary 28.8. Again, it is equivalent
to consider the reduced model {N (θ, 1n ) : θ ∈ Rd }. We know from Example 28.2 (see also
Theorem 28.4) that for quadratic risk ℓ(θ, θ̂) = kθ − θ̂k22 , the exact minimax risk is R∗ = nd for any
d and n. Let us compare this with the best two-point lower bound. Applying Theorem 31.1 with
α = 2,
     
1 1 1
R∗ ≥ sup kθ0 − θ1 k22 1 − TV N θ0 , Id , N θ1 , Id
θ0 ,θ1 ∈Rd 4 n n
1
= sup kθk22 {1 − TV (N (0, Id ) , N (θ, Id ))}
θ∈Rd 4n
1
= sup s2 (1 − TV(N (0, 1), N (s, 1))),
4n s>0

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-597


i i

31.2 Assouad’s Lemma 597

where the second step applies the shift and scale invariance of the total variation; in the last step,
by rotational invariance of isotropic Gaussians, we can rotate the vector θ align with a coordinate
vector (say, e1 = (1, 0 . . . , 0)) which reduces the problem to one dimension, namely,
TV(N (0, Id ), N (θ, Id )) = TV(N (0, Id ), N (kθke1 , Id )
= TV(N (0, 1), N (kθk, 1)).

Comparing the above display with (31.3), we see that the best Le Cam two-point lower bound in
d dimensions coincide with that in one dimension.
Let us mention in passing that although Le Cam’s two-point method is typically suboptimal for
estimating a high-dimensional parameter θ, for functional estimation in high dimensions (e.g. esti-
mating a scalar functional T(θ)), Le Cam’s method is much more effective and sometimes even
optimal. The subtlety is that is that as opposed to testing a pair of simple hypotheses H0 : θ = θ0
versus H1 : θ = θ1 , we need to test H0 : T(θ) = t0 versus H1 : T(θ) = t1 , both of which are
composite hypotheses and require a sagacious choice of priors. See Exercise VI.14 for an example.

31.2 Assouad’s Lemma


From Example 31.3 we see that Le Cam’s two-point method effectively only perturbs one out
of d coordinates, leaving the remaining d − 1 coordinates unexplored; this is the source of its
suboptimality. In order to obtain a lower bound that scales with the dimension, it is necessary to
randomize all d coordinates. Our next topic Assouad’s Lemma is an extension in this direction.

Theorem 31.2 (Assouad’s lemma) Assume that the loss function ℓ satisfies the α-triangle
inequality (31.1). Suppose Θ contains a subset Θ′ = {θb : b ∈ {0, 1}d } indexed by the hypercube,
such that ℓ(θb , θb′ ) ≥ β · dH (b, b′ ) for all b, b′ and some β > 0. Then
 
βd
inf sup Eθ ℓ(θ, θ̂) ≥ 1 − max TV(Pθb , Pθb′ ) (31.5)
θ̂ θ∈Θ 4α dH (b,b′ )=1

Proof. We lower bound the Bayes risk with respect to the uniform prior over Θ′ . Given any
estimator θ̂ = θ̂(X), define b̂ ∈ argmin ℓ(θ̂, θb ). Then for any b ∈ {0, 1}d ,
β dH (b̂, b) ≤ ℓ(θb̂ , θb ) ≤ α(ℓ(θb̂ , θ̂b ) + ℓ(θ̂, θb )) ≤ 2αℓ(θ̂, θb ).

Let b ∼ Unif({0, 1}d ) and we have b → θb → X. Then


β
E[ℓ(θ̂, θb )] ≥ E[dH (b̂, b)]

β X h i
d
= P b̂i 6= bi

i=1

β X
d
≥ (1 − TV(PX|bi =0 , PX|bi =1 )),

i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-598


i i

598

where the last step is again by Theorem 7.7, just like in the proof of Theorem 31.1. Each total
variation can be upper bounded as follows:
!
( a) 1 X 1 X (b)
TV(PX|bi =0 , PX|bi =1 ) = TV d − 1
Pθb , d−1 Pθb ≤ max TV(Pθb , Pθb′ )
2 2 dH (b,b′ )=1
b:bi =1 b:bi =0

where (a) follows from the Bayes rule, and (b) follows from the convexity of total variation
(Theorem 7.5). This completes the proof.

Example 31.4 Let us continue the discussion of the d-dimensional GLM in Example 31.3.
Consider the quadratic loss first. To apply Theorem 31.2, consider the hypercube θb = ϵb, where
b ∈ {0, 1}d . Then kθb − θb′ k22 = ϵ2 dH (b, b′ ). Applying Theorem 31.2 yields
     
∗ ϵ2 d 1 ′ 1
R ≥ 1− max TV N ϵb, Id , N ϵb , Id
4 b,b′ ∈{0,1}d ,dH (b,b′ )=1 n n
2
     
ϵ d 1 1
= 1 − TV N 0, , N ϵ, ,
4 n n

where the last step applies (7.11) for f-divergence between product distributions that only differ
in one coordinate. Setting ϵ = √1n and by the scale-invariance of TV, we get the desired R∗ ≳ dn .
Next, let’s consider the loss function kθb − θb′ k∞ . In the same setup, we only kθb − θb′ k∞ ≥
′ ∗ √1 , which does not depend on d. In fact, R∗ 
d dH (b, b ). Then Assouad’s lemma yields R ≳
ϵ
q n
log d
n as shown in Corollary 28.8. In the next section, we will discuss Fano’s method which can
resolve this deficiency.

31.3 Assouad’s lemma from the Mutual Information Method


One can integrate the Assouad’s idea into the mutual information method. Consider the Bayesian
i.i.d.
setting of Theorem 31.2, where bd = (b1 , . . . , bd ) ∼ Ber( 12 ). From the rate-distortion function of
the Bernoulli source (Section 26.1.1), we know that for any b̂d and τ > 0 there is some τ ′ > 0
such that

I(bd ; X) ≤ d(1 − τ ) log 2 =⇒ E[dH (b̂d , bd )] ≥ dτ ′ . (31.6)

Here τ ′ is related to τ by τ log 2 = h(τ ′ ). Thus, using the same “hypercube embedding b → θb ”,
the bound similar to (31.5) will follow once we can bound I(bd ; X) away from d log 2.
Can we use the pairwise total variation bound in (31.5) to do that? Yes! Notice that thanks to
the independence of bi ’s we have1

I(bi ; X|bi−1 ) = I(bi ; X, bi−1 ) ≤ I(bi ; X, b\i ) = I(bi ; X|b\i ) .

1
Equivalently, this also follows from the convexity of the mutual information in the channel (cf. Theorem 5.3).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-599


i i

31.4 Fano’s method 599

Applying the chain rule leads to the upper bound


X
d X
d
I(bd ; X) = I(bi ; X|bi−1 ) ≤ I(bi ; X|b\i ) ≤ d log 2 max TV(PX|bd =b , PX|bd =b′ ) , (31.7)
dH (b,b′ )=1
i=1 i=1

where in the last step we used the fact that whenever B ∼ Ber(1/2),
I(B; X) ≤ TV(PX|B=0 , PX|B=1 ) log 2 , (31.8)
which follows from (7.39) by noting that the mutual information is expressed as the Jensen-
Shannon divergence as 2I(B; X) = JS(PX|B=0 , PX|B=1 ). Combining (31.6) and (31.7), the mutual
information method implies the following version of the Assouad’s lemma: Under the assumption
of Theorem 31.2,
   
βd −1 (1 − t) log 2
inf sup Eθ ℓ(θ, θ̂) ≥ ·f max TV(Pθ , Pθ′ ) , f(t) ≜ h (31.9)
θ̂ θ∈Θ 4α dH (θ,θ ′ )=1 2
where h−1 : [0, log 2] → [0, 1/2] is the inverse of the binary entropy function. Note that (31.9) is
slightly weaker than (31.5). Nevertheless, as seen in Example 31.4, Assouad’s lemma is typically
applied when the pairwise total variation is bounded away from one by a constant, in which case
(31.9) and (31.5) differ by only a constant factor.
In all, we may summarize Assouad’s lemma as a convenient method for bounding I(bd ; X) away
from the full entropy (d bits) on the basis of distances between PX|bd corresponding to adjacent
bd ’s.

31.4 Fano’s method


In this section we discuss another method for proving minimax lower bound by reduction to multi-
ple hypothesis testing. To this end, assume that the loss function is a metric. The idea is to consider
an ϵ-packing (Chapter 27) of the parameter space, namely, a finite collection of parameters whose
minimum separation is ϵ. Suppose we can show that given data one cannot reliably distinguish
these hypotheses. Then the best estimation error is at least proportional to ϵ. The impossibility of
testing is often shown by applying Fano’s inequality (Corollary 3.13), which bounds the probabil-
ity of error of testing in terms of the mutual information in Section 3.6. As such, we refer to this
program Fano’s method. The following is a precise statement.

Theorem 31.3 Let d be a metric on Θ. Fix an estimator θ̂. For any T ⊂ Θ and ϵ > 0,
h ϵi C(T) + log 2
P d(θ, θ̂) ≥ ≥1− , (31.10)
2 log M(T, d, ϵ)
where C(T) ≜ sup I(θ; X) is the capacity of the channel from θ to X with input space T, with the
supremum taken over all distributions (priors) on T. Consequently,
 ϵ r  C(T) + log 2

inf sup Eθ [d(θ, θ̂) ] ≥ sup
r
1− , (31.11)
θ̂ θ∈Θ T⊂Θ,ϵ>0 2 log M(T, d, ϵ)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-600


i i

600

Proof. It suffices to show (31.10). Fix T ⊂ Θ. Consider an ϵ-packing T′ = {θ1 , . . . , θM } ⊂ T such


that mini̸=j d(θi , θj ) ≥ ϵ. Let θ be uniformly distributed on this packing and X ∼ Pθ conditioned
on θ. Given any estimator θ̂, construct a test by rounding θ̂ to θ̃ = argminθ∈T′ d(θ̂, θ). By triangle
inequality, d(θ, θ̃) ≤ 2d(θ, θ̂). Thus P[θ 6= θ̃] ≤ P[d(θ, θ̃) ≥ ϵ/2]. On the other hand, applying
Fano’s inequality (Corollary 3.13) yields

I(θ; X) + log 2
P[θ 6= θ̃] ≥ 1 − .
log M

The proof of (31.10) is completed by noting that I(θ; X) ≤ C(T).

In applying Fano’s method, since it is often difficult to evaluate the capacity C(T), it is useful
to recall from Theorem 5.9 that C(T) coincides with the KL radius of the set of distributions
{Pθ : θ ∈ T}, namely, C(T) ≜ infQ supθ∈T D(Pθ kQ). As such, choosing any Q leads to an upper
bound on the capacity. As an application, we revisit the d-dimensional GLM in Corollary 28.8
under the ℓq -loss (1 ≤ q ≤ ∞), with the particular focus on the dependency on the dimension.
(For a different application in sparse setting see Exercise VI.12.)

Example 31.5 Consider GLM with sample size n, where Pθ = N (θ, Id )⊗n . Taking natural
logs here and below, we have
n
D(Pθ kPθ′ ) = kθ − θ′ k22 ;
2
in other words, KL-neighborhoods are ℓ2 -balls. As such, let us apply Theorem 31.3 to T = B2 (ρ)
2
for some ρ > 0 to be specified. Then C(T) ≤ supθ∈T D(Pθ kP0 ) = nρ2 . To bound the packing
number from below, we applying the volume bound in Theorem 27.3,
 d
ρd vol(B2 ) cq ρd1/q
M(B2 (ρ), k · kq , ϵ) ≥ d ≥ √
ϵ vol(Bq ) ϵ d

for some
p constant cq ,c where the last step follows the volume formula (27.13) for ℓq -balls. Choosing
ρ = d/n and ϵ = eq2 ρd1/q−1/2 , an application of Theorem 31.3 yields the minimax lower bound

d1/q
Rq ≡ inf sup Eθ [kθ̂ − θkq ] ≥ Cq √ (31.12)
θ̂ θ∈Rd n

for some constant Cq depending on q. This is the same lower bound as that in (30.9) obtained via
the mutual information method plus the Shannon lower bound (which is also volume-based).
For any q ≥ 1, (31.12) is rate-optimal since we can apply the MLE θ̂ = X̄. (Note that at q = ∞,
pq = ∞, (31.12)
the constant Cq is still finite since vol(B∞ ) = 2d .) However, for the special case of
does not depend on the dimension at all, as opposed to the correct dependency log d shown in
Corollary 28.8. In fact, previously in Example 31.4 the application of Assouad’s lemma yields
the same suboptimal result. So is it possible to fix this looseness with Fano’s method? It turns out
that the answer is yes and the suboptimality is due to the volume bound on the metric entropy,
which, as we have seen in Section 27.3, can be ineffective if ϵ scales with dimension. Indeed, if

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-601


i i

31.4 Fano’s method 601

q q
c log d
we apply the tight bound of M(B2 , k · k∞ , ϵ) in (27.18),2 with ϵ = and ρ = c′ logn d for
q n

some absolute constants c, c′ , we do get R∞ ≳ logn d as desired.


We end this section with some comments regarding the application of Theorem 31.3:

• It is sometimes convenient to further bound the KL radius by the KL diameter, since C(T) ≤
diamKL (T) ≜ supθ,θ′ ∈T D(Pθ′ kPθ ) (cf. Corollary 5.8). This suffices for Example 31.5.
• In Theorem 31.3 we actually lower bound the global minimax risk by that restricted on a param-
eter subspace T ⊂ Θ for the purpose of controlling the mutual information, which is often
difficult to compute. For the GLM considered in Example 31.5, the KL divergence is propor-
tional to squared ℓ2 -distance and T is naturally chosen to be a Euclidean ball. For other models
such as the covariance model (Exercise VI.16) wherein the KL divergence is more complicated,
the KL neighborhood T needs to be chosen carefully. Later in Section 32.4 we will apply the
same Fano’s method to the infinite-dimensional problem of estimating smooth density.

2
In fact, in this case we can also choose the explicit packing {ϵe1 , . . . , ϵed }.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-602


i i

32 Entropic bounds for statistical estimation

So far our discussion on information-theoretic methods have been mostly focused on statistical
lower bounds (impossibility results), with matching upper bounds obtained on a case-by-case basis.
In this chapter, we will discuss three information-theoretic upper bounds for statistical estimation.
These three results apply to different loss functions and are obtained using completely different
means. However, they take on exactly the same form involving the appropriate metric entropy of
the model.
Specifically, suppose that we observe X1 , . . . , Xn drawn independently from a distribution Pθ for
some unknown parameter θ ∈ Θ, and the goal is to produce an estimate P̂ for the true distribution
Pθ . We have the following entropic minimax upper bounds:

• KL loss (Yang-Barron [464]):


 
1
inf sup Eθ [D(Pθ kP̂)] ≲ inf ϵ + log NKL (P, ϵ) .
2
(32.1)
P̂ θ∈Θ ϵ>0 n
• Hellinger loss (Le Cam-Birgé [273, 53]):
 
1
inf sup Eθ [H (Pθ , P̂)] ≲ inf ϵ + log NH (P, ϵ) .
2 2
(32.2)
P̂ θ∈Θ ϵ>0 n
• Total variation loss (Yatracos [465]):
 
1
inf sup Eθ [TV2 (Pθ kP̂)] ≲ inf ϵ2 + log NTV (P, ϵ) . (32.3)
P̂ θ∈Θ ϵ>0 n

Here N(P, ϵ) refers to the metric entropy (cf. Chapter 27) of the model class P = {Pθ : θ ∈ Θ}
under various distances, which we will formalize along the way.
In particular, we will see that these methods achieve minimax optimal rates for the classical
problem of density estimation under smoothness constraints. To place these results in the bigger
context, we remind that we have already discussed modern methods of density estimation based
on machine learning ideas (Examples 4.2 and 7.5). However, those methods, beautiful and empir-
ically successful, are not known to achieve optimality over any reasonable classes. The metric
entropy methods as above, though, could and should be used to derive fundamental limits for the
classes which are targeted by the machine learning methods. Thus, there is a rich field of modern
applications, which this chapter will hopefully welcome the reader to explore.
We note that there are other entropic upper bound for statistical estimation, notably, MLE and
other M-estimators. This require different type of metric entropy (bracketing entropy, which is

602

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-603


i i

32.1 Yang-Barron’s construction 603

akin to metric entropy under the sup norm) and the style of analysis is more related in spirit to the
theory of empirical processes (e.g. Dudley’s entropy integral (27.22)). We refer the readers to the
monographs [332, 429, 431] for details. In this chapter we focus on more information-theoretic
style results.

32.1 Yang-Barron’s construction


Let P = {Pθ : θ ∈ Θ} be a parametric family of distributions on the space X . Given Xn =
i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, we obtain an estimate P̂ = P̂(·|Xn ), which is a distribution
depending on Xn . The loss function is the KL divergence D(Pθ kP̂).1 The average risk is thus
Z  
Eθ D(Pθ kP̂) = D Pθ kP̂(·|Xn ) P⊗n (dxn ).

If the family has a common dominating measure μ, the problem is equivalent to estimate the
density pθ = dP dμ , commonly referred to as the problem of density estimation in the statistics
θ

literature.
Our objective is to prove the upper bound (32.1) for the minimax KL risk
R∗KL (n) ≜ inf sup Eθ D(Pθ kP̂), (32.4)
P̂ θ∈Θ

where the infimum is taken over all estimators P̂ = P̂(·|Xn ) which is a distribution on X ; in
other words, we allow improper estimates in the sense that P̂ can step outside the model class P .
Indeed, the construction we will use in this section (such as predictive density estimators (Bayes)
or their mixtures) need not be a member of P . Later we will see in Sections 32.2 and 32.3 that for
total variation and Hellinger loss we can always restrict to proper estimators;2 however these loss
functions are weaker than the KL divergence.
The main result of this section is the following.

Theorem 32.1 Let Cn denotes the capacity of the channel θ 7→ Xn ∼ P⊗ n


θ , namely

Cn = sup I(θ; Xn ), (32.5)


where the supremum is over all distributions (priors) of θ taking values in Θ. Denote by

NKL (P, ϵ) ≜ min N : ∃Q1 , . . . , QN s.t. ∀θ ∈ Θ, ∃i ∈ [N], D(Pθ kQi ) ≤ ϵ2 . (32.6)
the KL covering number for the class P . Then
Cn+1
R∗KL (n) ≤ (32.7)
n+1

1
Note the asymmetry in this loss function. Alternatively the loss D(P̂kP) is typically infinite in nonparametric settings,
because it is impossible to estimate the support of the true density exactly.
2
This is in fact a generic observation: Whenever the loss function satisfies an approximate triangle inequality, any improper
estimate can be converted to a proper one by its project on the model class whose risk is inflated by no more than a
constant factor.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-604


i i

604

 
1
≤ inf ϵ2 + log NKL (P, ϵ) . (32.8)
ϵ>0 n+1
Conversely,
X
n
R∗KL (t) ≥ Cn+1 . (32.9)
t=0

Note that the capacity Cn is precisely the redundancy (13.10) which governs the minimax regret
in universal compression; the fact that it bounds the KL risk can be attributed to a generic relation
between individual and cumulative risks which we explain later in Section 32.1.4. As explained in
Chapter 13, it is in general difficult to compute the exact value of Cn even for models as simple as
Bernoulli (Pθ = Ber(θ)). This is where (32.8) comes in: one can use metric entropy and tools from
Chapter 27 to bound this capacity, leading to useful (and even optimal) risk bounds. We discuss
two types of applications of this result.

Finite-dimensional models Consider a family P = {Pθ : θ ∈ Θ} of smooth parametrized


densities, where Θ ⊂ Rd is some compact set. Suppose that the KL-divergence behaves like
squared norm, namely, D(Pθ kPθ′ )  kθ − θ′ k2 for any θ, θ′ ∈ Θ and some norm k · k on Rd .
(For example, for GLM with Pθ = N (θ, Id ), we have D(Pθ kPθ′ ) = 12 kθ − θ′ k22 .). In this case, the
KL covering numbers inherits the usual behavior of metric entropy in finite-dimensional space
(cf. Theorem 27.3 and Corollary 27.4) and we have
 d
1
NKL (P, ϵ) ≲ .
ϵ
Then (32.8) yields
 
1
Cn ≲ inf nϵ + d log
2
 d log n, (32.10)
ϵ>0 ϵ
d
which is consistent with the typical asymptotics of redundancy Cn = 2 log n + o(log n) (recall
(13.24) and (13.25)).
Applying the upper bound (32.7) or (32.8) yields
d log n
R∗KL (n) ≲ .
n
d
As compared to the usual parametric rate of n in d dimensions (e.g. GLM), this upper bound is
suboptimal only by a logarithmic factor.

Infinite-dimensional models Similar to the results in Section 27.4, for nonparametric models
NKL (ϵ) typically grows super-polynomially in 1ϵ and, in turn, the capacity Cn grows super-
logarithmically. In fact, whenever we have Cn = nα polylog(n) for some α > 0 where (log n)c0 ≤
polylog(n) ≤ (log n)c1 for some absolute c0 , c1 , Theorem 32.1 shows the minimax KL rate satisfies

R∗KL (n) = nα−1 polylog(n) (32.11)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-605


i i

32.1 Yang-Barron’s construction 605

which easily follows from combining (32.7) and (32.8) – see (32.27) for details. For concrete
examples, see Section 32.4 for the application of estimating smooth densities.
Next, we explain the intuition behind and the proof of Theorem 32.1.

32.1.1 Bayes risk as conditional mutual information and capacity bound


To gain some insight, let us start by considering the Bayesian setting with a prior π on Θ. Condi-
i.i.d.
tioned on θ ∼ π, the data Xn = (X1 , . . . , Xn ) ∼ Pθ .3 Any estimator, P̂ = P̂(·|Xn ), is a distribution
on X depending on Xn . As such, P̂ can be identified with a conditional distribution, say, QXn+1 |Xn ,
and we shall do so henceforth. For convenience, let us introduce an (unseen) observation Xn+1
that is drawn from the same Pθ and independent of Xn conditioned on θ. In this light, the role of
the estimator is to predict the distribution of the unseen Xn+1 .
The following lemma shows that the Bayes KL risk equals the conditional mutual information
and the Bayes estimator is precisely PXn+1 |Xn (with respect to the joint distribution induced by the
prior), known as the predictive density estimator in the statistics literature.

Lemma 32.2 The Bayes risk for prior π is given by


Z
R∗KL,Bayes (π ) ≜ inf π (dθ)P⊗
θ (dx )D(Pθ kP̂(·|x )) = I(θ; Xn+1 |X ),
n n n n

i.i.d.
where θ ∼ π and (X1 , . . . , Xn+1 ) ∼ Pθ conditioned on θ. The Bayes estimator achieving this infi-
mum is given by P̂Bayes (·|xn ) = PXn+1 |Xn =xn . If each Pθ has a density pθ with respect to some
common dominating measure μ, the Bayes estimator has density:
R Qn+1
π (dθ) i=1 pθ (xi )
p̂Bayes (xn+1 |x ) = R
n
Qn . (32.12)
π (dθ) i=1 pθ (xi )

Proof. The Bayes risk can be computed as follows:


 
inf Eθ,Xn D(Pθ kQXn+1 |Xn ) = inf D(PXn+1 |θ kP̂Xn+1 |Xn |Pθ,Xn )
QXn+1 |Xn QXn+1 |Xn
" #
= EXn inf D(PXn+1 |θ kP̂Xn+1 |Xn |Pθ|Xn )
QXn+1 |Xn

( a)  
= EXn D(PXn+1 |θ kPXn+1 |Xn |Pθ|Xn )
= D(PXn+1 |θ kPXn+1 |Xn |Pθ,Xn )
(b)
= I(θ; Xn+1 |Xn ).
where (a) follows from the variational representation of mutual information (Theorem 4.1 and
Corollary 4.2); (b) invokes the definition of the conditional mutual information (Section 3.4) and

3
Throughout this chapter, we continue to use the conventional notation Pθ for a parametric family of distributions and use
π to stand for the distribution of θ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-606


i i

606

the fact that Xn → θ → Xn+1 forms a Markov chain, so that PXn+1 |θ,Xn = PXn+1 |θ . In addition, the
Bayes optimal estimator is given by PXn+1 |Xn .

Note that the operational meaning of I(θ; Xn+1 |Xn ) is the information provided by one extra
observation about θ having already obtained n observations. In most situations, since Xn will have
already allowed θ to be consistently estimated as n → ∞, the additional usefulness of Xn+1 is
vanishing. This is made precisely by the following result.

Lemma 32.3 (Diminishing marginal utility in information) n 7→ I(θ; Xn+1 |Xn ) is a


decreasing sequence. Furthermore,
1
I(θ; Xn+1 |Xn ) ≤ I(θ; Xn+1 ). (32.13)
n+1

Proof. In view of the chain rule for mutual information (Theorem 3.7): I(θ; Xn+1 ) =
Pn+1 i−1
i=1 I(θ; Xi |X ), (32.13) follows from the monotonicity. To show the latter, let us consider
a “sampling channel” where the input is θ and the output is X sampled from Pθ . Let I(π )
denote the mutual information when the input distribution is π, which is a concave function in
π (Theorem 5.3). Then

I(θ; Xn+1 |Xn ) = EXn [I(Pθ|Xn )] ≤ EXn−1 [I(Pθ|Xn−1 )] = I(θ; Xn |Xn−1 )

where the inequality follows from Jensen’s inequality, since Pθ|Xn−1 is a mixture of Pθ|Xn .

Lemma 32.3 allows us to prove the converse bound (32.9): Fix any prior π. Since the minimax
risk dominates any Bayes risk (Theorem 28.1), in view of Lemma 32.2, we have
X
n X
n
R∗KL (t) ≥ I(θ; Xt+1 |Xt ) = I(θ; Xn+1 ).
t=0 t=0

Recall from (32.5) that Cn+1 = supπ ∈P(Θ) I(θ; Xn+1 ). Optimizing over the prior π yields (32.9).
Now suppose that the minimax theorem holds for (32.4), so that R∗KL = supπ ∈P(Θ) R∗KL,Bayes (π ).
Lemma 32.2 then allows us to express the minimax risk as the conditional mutual information
maximized over the prior π:

R∗KL (n) = sup I(θ; Xn+1 |Xn ).


π ∈P(Θ)

Thus Lemma 32.3 implies the desired


1
R∗KL (n) ≤ Cn+1 .
n+1
Next, we prove this directly without going through the Bayesian route or assuming the minimax
theorem. The main idea, due to Yang and Barron [464], is to consider Bayes estimators (of the
form (32.12)) but analyze it in the worst case. Fix an arbitrary joint distribution QXn+1 on X n+1 ,
Qn−1
which factorizes as QXn+1 = i=1 QXi |Xi−1 . (This joint distribution is an auxiliary object used only
for constructing an estimator.) For each i, the conditional distribution QXi |Xi−1 defines an estimator

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-607


i i

32.1 Yang-Barron’s construction 607

taking the sample Xi of size i as the input. Taking their Cesàro mean results in the following
estimator operating on the full sample Xn :

1 X
n+1
P̂(·|Xn ) ≜ QXi |Xi−1 . (32.14)
n+1
i=1

Let us bound the worst-case KL risk of this estimator. Fix θ ∈ Θ and let Xn+1 be drawn
⊗(n+1)
independently from Pθ so that PXn+1 = Pθ . Taking expectations with this law, we have
" !#
1 X
n+1
Eθ [D(Pθ kP̂(·|Xn ))] = E D Pθ QXi |Xi−1
n+1
i=1

( a) 1 X
n+1
≤ D(Pθ kQXi |Xi−1 |PXi−1 )
n+1
i=1
(b) 1 ⊗(n+1)
= D(Pθ kQXn+1 ),
n+1
where (a) and (b) follows from the convexity (Theorem 5.1) and the chain rule for KL divergence
(Theorem 2.16(c)). Taking the supremum over θ ∈ Θ bounds the worst-case risk as
1 ⊗(n+1)
R∗KL (n) ≤ sup D(Pθ kQXn+1 ).
n + 1 θ∈Θ
Optimizing over the choice of QXn+1 , we obtain
1 ⊗(n+1) Cn+1
R∗KL (n) ≤ inf sup D(Pθ kQXn+1 ) = ,
n + 1 QXn+1 θ∈Θ n+1
where the last identity applies Theorem 5.9 of Kemperman, completing the proof of (32.7).
Furthermore, Theorem 5.9 asserts that the optimal QXn+1 exists and given uniquely by the capacity-
achieving output distribution P∗Xn+1 . Thus the above minimax upper bound can be attained by
taking the Cesàro average of P∗X1 , P∗X2 |X1 , . . . , P∗Xn+1 |Xn , namely,

1 X ∗
n+1
P̂∗ (·|Xn ) = PXi |Xi−1 . (32.15)
n+1
i=1

Note that in general this is an improper estimate as it steps outside the class P .
In the special case where the capacity-achieving input distribution π ∗ exists, the capacity-
achieving output distribution can be expressed as a mixture over product distributions as P∗Xn+1 =
R ∗ ⊗(n+1)
π (dθ)Pθ . Thus the estimator P̂∗ (·|Xn ) is in fact the average of Bayes estimators (32.12)

under prior π for sample sizes ranging from 0 to n.
Finally, as will be made clear in the next section, in order to achieve the further upper bound
(32.8) in terms of the KL covering numbers, namely R∗KL (n) ≤ ϵ2 + n+1 1 log NKL (P, ϵ), it suffices to
choose the following QXn+1 as opposed to the exact capacity-achieving output distribution: Pick an
ϵ-KL cover Q1 , . . . , QN for P of size N = NKL (P, ϵ) and choose π to be the uniform distribution
PN ⊗(n+1)
and define QXn+1 = N1 j=1 Qj – this was the original construction in [464]. In this case,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-608


i i

608

applying the Bayes rule (32.12), we see that the estimator is in fact a convex combination P̂(·|Xn ) =
PN
j=1 wj Qj of the centers Q1 , . . . , QN , with data-driven weights given by
Qi−1
1 X
n+1
t=1 Qj (Xt )
wj = PN Qi−1 .
n+1 Qj ( X t )
i=1 j=1 t=1

Again, except for the extraordinary case where P is convex and the centers Qj belong to P , the
estimate P̂(·|Xn ) is improper.

32.1.2 Capacity upper bound via KL covering numbers


As explained earlier, finding the capacity Cn requires solving the difficult optimization problem in
(32.5). In this subsection we prove (32.8) which bounds this capacity by metric entropy. Concep-
tually speaking, both metric entropy and capacity measure the complexity of a model class. The
following result, which applies to a more general setting than (32.5), makes precise their relations.

Theorem 32.4 Let Q = {PB|A=a : a ∈ A} be a collection of distributions on some space B


and denote the capacity C = supPA ∈P(A) I(A; B). Then

C = inf {ϵ2 + log NKL (Q, ϵ)}, (32.16)


ϵ>0

where NKL is the KL covering number defined in (32.6).

Proof. Fix ϵ and let N = NKL (Q, ϵ). Then there exist Q1 , . . . , QN that form an ϵ-KL cover, such
that for any a ∈ A there exists i(a) ∈ [N] such that D(PB|A=a kQi(a) ) ≤ ϵ2 . Fix any PA . Then

I(A; B) = I(A, i(A); B) = I(i(A); B) + I(A; B|i(A))


≤ H(i(A)) + I(A; B|i(A)) ≤ log N + ϵ2 .

where the last inequality follows from that i(A) takes at most N values and, by applying
Theorem 4.1,

I(A; B|i(A)) ≤ D PB|A kQi(A) |Pi(A) ≤ ϵ2 .

For the lower bound, note that if C = ∞, then in view of the upper bound above, NKL (Q, ϵ) = ∞
for any ϵ and (32.16) holds with equality. If C < ∞, Theorem 5.9 shows that C is the KL radius of
Q, namely, there exists P∗B , such that C = supPA ∈P(A) D(PB|A kP∗B |PA ) = supx∈A D(PB|A kP∗B |PA ).

In other words, NKL (Q, C + δ) = 1 for any δ > 0. Sending δ → 0 proves the equality of
(32.16).

Next we specialize Theorem 32.4 to our statistical setting (32.5) where the input A is θ and the
output B is Xn ∼ Pθ . Recall that P = {Pθ : θ ∈ Θ}. Let Pn ≜ {P⊗
i.i.d.
θ : θ ∈ Θ}. By tensorization of
n
⊗n ⊗n
KL divergence (Theorem 2.16(d)), D(Pθ kPθ′ ) = nD(Pθ kPθ′ ). Thus
 
ϵ
NKL (Pn , ϵ) ≤ NKL P, √ .
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-609


i i

32.1 Yang-Barron’s construction 609

Combining this with Theorem 32.4, we obtain the following upper bound on the capacity Cn in
terms of the KL metric entropy of the (single-letter) family P :

Cn ≤ inf nϵ2 + log NKL (P, ϵ) . (32.17)
ϵ>0

This proves (32.8), completing the proof of Theorem 32.1.

32.1.3 Bounding capacity and KL covering number using Hellinger entropy


Recall that in order to deduce from (32.9) concrete bounds on the minimax KL risk, such as
(32.11), one needs to have matching upper and lower bounds on the capacity Cn . Although Theo-
rem 32.4 characterizes capacity in terms of the KL covering numbers, working with the latter is
not convenient as it is not a metric so that results developed in Chapter 27 such as Theorem 27.2
do not apply. Next, we give bounds on the KL covering number and the capacity Cn using metric
entropy with respect to the Hellinger distance, which are tight up to logarithmic factors under mild
conditions.

Theorem 32.5 Let P = {Pθ : θ ∈ Θ} and MH (ϵ) ≡ M(P, H, ϵ) the Hellinger packing number
of the set P , cf. (27.2). Then Cn defined in (32.5) satisfies
 
log e 2
Cn ≥ min nϵ , log MH (ϵ) − log 2 (32.18)
2

Proof. The idea of the proof is simple. Given a packing θ1 , . . . , θM ∈ Θ with pairwise distances
2
H2 (Qi , Qj ) ≥ ϵ2 for i 6= j, where Qi ≡ Pθi , we know that one can test Q⊗ n ⊗n
i vs Qj with error e
− nϵ2
,
nϵ 2
cf. Theorem 7.8 and Theorem 32.8. Then by the union bound, if Me− 2 < 12 , we can distinguish
these M hypotheses with error < 12 . Let θ ∼ Unif(θ1 , . . . , θM ). Then from Fano’s inequality we
get I(θ; Xn ) ≳ log M.
To get sharper constants, though, we will proceed via the inequality shown in Ex. I.58. In the
notation of that exercise we take λ = 1/2 and from Definition 7.24 we get that
1
D1/2 (Qi , Qj ) = −2 log(1 − H2 (Qi , Qj )) ≥ H2 (Qi , Qj ) log e ≥ ϵ2 log e i 6= j .
2
By the tensorization property (7.79) for Rényi divergence, D1/2 (Q⊗ n ⊗n
i , Qj ) = nD1/2 (Qi , Qj ) and
we get by Ex. I.58
 
X
M
1 X
M
1 n n o
I(θ; Xn ) ≥ − log  exp − D1/2 (Qi , Qj ) 
M M 2
i=1 j=1

X
M  
( a) 1M − 1 − nϵ22 1
≥− log e +
M M M
i=1
XM    
1 − nϵ2
2 1 − nϵ2
2 1
≥− log e + = − log e + ,
M M M
i=1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-610


i i

610

where in (a) we used the fact that pairwise distances are all ≥ nϵ2 except when i = j. Finally, since
A + B ≤ min(A,B) we conclude the result.
1 1 2

Note that, from the joint range (7.33) that D(PkQ) ≥ H2 (P, Q), a different (weaker) lower
bound on the KL risk also follows from Section 32.2.4 below.
Next we proceed to the converse of Theorem 32.5. The KL and Hellinger covering numbers
always satisfy

NKL (P, ϵ) ≥ NH (ϵ) ≡ N(P, H, ϵ). (32.19)

We next show that, assuming that the class P has a finite radius in Rényi divergence, (32.19)
and hence the capacity bound in Theorem 32.5 are tight up to logarithmic factors. Later in Sec-
tion 32.4 we will apply these results to the class of smooth densities, which has a finite χ2 -radius
(by choosing the uniform distribution as the center).

Theorem 32.6 Suppose that the family P has a finite Dλ radius for some λ > 1, i.e.
Rλ (P) ≜ inf sup Dλ (PkU) < ∞ , (32.20)
U P∈P

where Dλ is the Rényi divergence of order λ (see Definition 7.24). Then there exist ϵ0 and c
depending only on λ and Rλ , such that for all ϵ ≤ ϵ0 ,
r !
1
NKL P, cϵ log ≤ NH (ϵ) (32.21)
ϵ

and, consequently,
 
1
Cn ≤ inf 2
cnϵ log + log NH (ϵ) . (32.22)
ϵ≤ϵ0 ϵ

Proof. Let Q1 , . . . , QM be an ϵ-covering of P such that for any P ∈ P , there exists i ∈ [M] such
that H2 (P, Qi ) ≤ ϵ2 . Fix an arbitrary U and let Pi = ϵ2 U + (1 − ϵ2 )Qi . Applying Exercise I.59
yields
 
2λ 1
D(PkPi ) ≤ 24ϵ 2
log + Dλ (PkU) .
λ−1 ϵ
Optimizing over U to approach (32.20) proves (32.21). Finally, (32.22) follows from applying
(32.21) to (32.17).

32.1.4 General bounds between cumulative and individual (one-step) risks


In summary, we can see that the beauty of the Yang-Barron method lies in two ideas:

• Instead of directly studying the risk R∗KL (n), (32.7) relates it to a cumulative risk Cn
• The cumulative risk turns out to be equal to a capacity, which can be conveniently bounded in
terms of covering numbers.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-611


i i

32.1 Yang-Barron’s construction 611

In this subsection we want to point out that while the second step is very special to KL (log-loss),
the first idea is generic. Namely, we have the following relationship between individual risk (also
known as batch loss) and cumulative risk (also known as online loss), which were previously
introduced in Section 13.6 in the context of universal compression.

Proposition 32.7 (Online-to-batch conversion) Fix a loss function ℓ : P(X )×P(X ) →


R̄ and a class Π of distributions on X . Define the cumulative and one-step minimax risks as
follows:4
" n #
X
t−1
Cn = inf sup E ℓ(P, P̂t (X )) (32.23)
{P̂t (·)} P∈Π
t=1
h i
R∗n = inf sup E ℓ(P, P̂(Xn−1 )) (32.24)
P̂(·) P∈Π

where both infima are over measurable (possibly randomized) estimators P̂t : X t−1 → P(X ), and
i.i.d.
the expectations are over Xi ∼ P and the randomness of the estimators. Then we have
X
n
nR∗n ≤ Cn ≤ Cn−1 + R∗n ≤ R∗t . (32.25)
t=1
Pn−1
Thus, if the sequence {R∗n } satisfies R∗n  1n t=0 R∗t then Cn  nR∗n . Conversely, if nα− ≲ Cn ≲
nα+ for all n and some α+ ≥ α− > 0, then
α
(α− −1) α+
n − ≲ R∗n ≲ nα+ −1 . (32.26)

Remark 32.1 The meaning of the above is that R∗n ≈ 1


n Cn within either constant or
polylogarithmic factors, for most cases of interest.
Proof. To show the first inequality in (32.25), given predictors {P̂t (Xt−1 ) : t ∈ [n]} for Cn , con-
sider a randomized predictor P̂(Xn−1 ) for R∗n that equals each of the P̂t (Xt−1 ) with equal probability.
P
The second inequality follows from interchanging supP and t via:
" n # " n−1 #
X X h i
t− 1
sup E ℓ(P, P̂t (X )) ≤ sup E ℓ(P, P̂t (X )) + sup E ℓ(P, P̂n (Xn−1 )) .
t−1
P∈Π t=1 P∈Π t=1 P∈Π

(In other words, Cn is bounded by using the Cn−1 -optimal online learner for first n − 1 rounds and
the R∗n -optimal batch learner for the last round.) The third inequality in (32.25) follows from the
second by induction and C1 = R∗1 .
To derive (32.26) notice that the upper bound on R∗n follows from (32.25). For the lower bound,
notice that the sequence R∗n is non-increasing and hence we have for any n < m
X
m−1 X
n−1
Ct
Cm ≤ R∗t ≤ + (m − n)R∗n . (32.27)
t
t=0 t=0

4
Note that for KL loss, Cn and R∗n coincide with AvgReg∗n and BatchReg∗n defined in (13.34) and (13.35), respectively.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-612


i i

612

α+

Setting m = an α− with some appropriate constant a yields the lower bound.

32.2 Pairwise comparison à la Le Cam-Birgé


When we proved the lower bound in Theorem 31.3, we applied the reasoning that if an ϵ-packing
of the parameter space Θ cannot be tested, then θ ∈ Θ cannot be estimated more than precision ϵ,
thereby establishing a minimax lower bound in terms of the KL metric entropy. Conversely, we
can ask the following question:
Is it possible to construct an estimator based on tests, and produce a minimax upper bound in terms of the metric
entropy?

For Hellinger loss, the answer is yes, although the metric entropy involved is with respect to
the Hellinger distance not KL divergence. The basic construction is due to Le Cam and further
developed by Birgé. The main idea is as follows: Fix an ϵ-covering {P1 , . . . , PN } of the set of
distributions P . Given n observations drawn from P ∈ P , let us test which ball P belongs to;
this allows us to estimate P up to Hellinger loss ϵ. This can be realized by a pairwise comparison
argument of testing the (composite) hypothesis P ∈ B(Pi , ϵ) versus P ∈ B(Pj , ϵ). This program
can be further refined to involve on the local entropy of the model.

32.2.1 Composite hypothesis testing and Hellinger distance


Recall the problem of composite hypothesis testing introduced in Section 16.4. Let P and Q be two
(not necessarily convex) classes of distributions. Given iid observations X1 , . . . , Xn drawn from
some distribution P, we want to test, according some decision rule ϕ = ϕ(X1 , . . . , Xn ) ∈ {0, 1},
whether P ∈ P (indicated by ϕ = 0) or P ∈ Q (indicated by ϕ = 1). By the minimax theorem,
the optimal error is given by the total variation between the worst-case mixtures:
 
min sup P(ϕ = 1) + sup Q(ϕ = 0) = 1 − TV(co(P ⊗n ), co(Q⊗n )), (32.28)
ϕ P∈P Q∈Q

wherein the notations are explained as follows:

• P ⊗n ≜ {P⊗n : P ∈ P} consists of all n-fold products of distributions in P ;


• co(·) denotes the convex hull, that is, the set of all mixtures. For example, for a parametric
R
family, co({Pθ : θ ∈ Θ}) = {Pπ : π ∈ P(Θ)}, where Pπ = Pθ π (dθ) is the mixture under the
mixing distribution π, and P(Θ) denotes the collection of all distributions (priors) on Θ.

The optimal test that achieves (32.28) is the likelihood ratio given by the worst-case mixtures, that
is, the closest5 pair of mixture (P∗n , Q∗n ) such that TV(P∗n , Q∗n ) = TV(co(P ⊗n ), co(Q⊗n )).

5
In case the closest pair does not exist, we can replace it by an infimizing sequence.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-613


i i

32.2 Pairwise comparison à la Le Cam-Birgé 613

The exact result (32.28) is unwieldy as the RHS involves finding the least favorable priors over
the n-fold product space. However, there are several known examples where much simpler and
explicit results are available. In the case when P and Q are TV-balls around P0 and Q0 , Huber [221]
showed that the minimax optimal test has the form
( n )
X   dP0 
n ′ ′′
ϕ(x ) = 1 min c , max c , log ( Xi ) >t .
dQ0
i=1

(See also Ex. III.31.) However, there are few other examples where minimax optimal tests are
known explicitly. Fortunately, as was shown by Le Cam, there is a general “single-letter” upper
bound in terms of the Hellinger separation between P and Q. It is the consequence of the more
general tensorization property of Rényi divergence in Proposition 7.25 (of which Hellinger is a
special case).

Theorem 32.8
 
≤ e− 2 H
n 2
(co(P),co(Q))
min sup P(ϕ = 1) + sup Q(ϕ = 0) , (32.29)
ϕ P∈P Q∈Q

Remark 32.2 For the case when P and Q are Hellinger balls of radius r around P0 and
Q0 (which arises in the proof of Theorem 32.9 below), respectively, Birgé [56] constructed an
explicit test.
nPNamely, under the assumption
o H(P0 , Q0 ) > 2.01r, there q is a test of the form
n n α+βψ(Xi ) −Ω(nr2 ) dP0
ϕ(x ) = 1 i=1 log β+αψ(Xi ) > t attaining error e , where ψ(x) = dQ 0
(x) and α, β > 0
depend only on H(P0 , Q0 ).

Remark 32.3 Here is an example where Theorem 32.8 is (very) loose. Consider P =
{Ber(1/2)} and Q = {Ber(0), Ber(1)}. Then co(P) ⊂ co(Q)and so the upper bound in (32.29) is
trivial. On the other hand, the test that declares P ∈ Q if we see all 0’s or all 0’s has exponentially
small probability of error.

Proof. We start by restating the special case of Proposition 7.25:


! !! n  
1 2 O n On Y 1
1 − H co Pi , co Qi ≤ 1 − H2 (co(Pi ), co(Qi )) . (32.30)
2 2
i=1 i=1 i=1

Then from (32.28) we get


( a) 1
1 − TV(co(P ⊗n ), co(Q⊗n )) ≤ 1 − H2 (co(P ⊗n ), co(Q⊗n ))
2
 n
(b) 1 2
≤ 1 − H (co(P), co(Q)) ≤ e− 2 H (co(P),co(Q))
n 2

where (a) follows from (7.22); (b) follows from (32.30).

In the sequel we will apply Theorem 32.8 to two disjoint Hellinger balls (both are convex).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-614


i i

614

32.2.2 Hellinger guarantee on Le Cam-Birgé’s pairwise comparison estimator


The idea of constructing estimator based on pairwise tests is due to Le Cam ([273], see also [430,
Section 10]) and Birgé [53]. We are given n i.i.d. observations X1 , · · · , Xn generated from P, where
P ∈ P is the distribution to be estimated. Here let us emphasize that P need not be a convex set. Let
the loss function between the true distribution P and the estimated distribution P̂ be their squared
Hellinger distance, i.e.

ℓ(P, P̂) = H2 (P, P̂).

Then, we have the following result:

Theorem 32.9 (Le Cam-Birgé) Denote by NH (P, ϵ) the ϵ-covering number of the set P
under the Hellinger distance (cf. (27.1)). Let ϵn be such that

nϵ2n ≥ log NH (P, ϵn ) ∨ 1.

Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 1,

sup P[H(P, P̂) > 4tϵn ] ≲ e−t


2
(32.31)
P∈P

and, consequently,

sup EP [H2 (P, P̂)] ≲ ϵ2n (32.32)


P∈P

Proof of Theorem 32.9. It suffices to prove the high-probability bound (32.31). Abbreviate ϵ =
ϵn and N = NH (P, ϵn ). Let P1 , · · · , PN be a maximal ϵ-packing of P under the Hellinger distance,
which also serves as an ϵ-covering (cf. Theorem 27.2). Thus, ∀i 6= j,

H(Pi , Pj ) ≥ ϵ,

and for ∀P ∈ P , ∃i ∈ [N], s.t.

H(P, Pi ) ≤ ϵ,

Denote B(P, ϵ) = {Q : H(P, Q) ≤ ϵ} denote the ϵ-Hellinger ball centered at P. Crucially,


Hellinger ball is convex thanks to the convexity of squared Hellinger distance as an f-divergence;
cf. Theorem 7.5. (In contrast, recall from (7.6) that Hellinger distance itself is not convex.) Indeed,
for any P′ , P′′ ∈ B(P, ϵ) and α ∈ [0, 1],

H2 (ᾱP′ + αP′′ , P) ≤ ᾱH2 (P′ , P) + αH2 (P′′ , P) ≤ ϵ2 .

Next, consider the following pairwise comparison problem, where we test two Hellinger balls
(composite hypothesis) against each other:

Hi : P ∈ B(Pi , ϵ) vs Hj : P ∈ B(Pj , ϵ)

for all i 6= j, s.t. H(Pi , Pj ) ≥ δ = 4ϵ.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-615


i i

32.2 Pairwise comparison à la Le Cam-Birgé 615

Since both B(Pi , ϵ) and B(Pj , ϵ) are convex, applying Theorem 32.8 yields a test ψij =
ψij (X1 , . . . , Xn ), with ψij = 0 corresponding to declaring P ∈ B(Pi , ϵ), and ψij = 1 corresponding
to declaring P ∈ B(Pj , ϵ), such that ψij = 1 − ψji and the following large deviations bound holds:
for all i, j, s.t. H(Pi , Pj ) ≥ δ ,

P(ψij = 1) ≤ e− 8 H(Pi ,Pj ) ,


n 2
sup (32.33)
P∈B(Pi ,ϵ)

where we used the triangle inequality of Hellinger distance: for any P ∈ B(Pi , ϵ) and any Q ∈
B(Pj , ϵ),

H(P, Q) ≥ H(Pi , Pj ) − 2ϵ ≥ H(Pi , Pj )/2 ≥ 2ϵ.

For i ∈ [N], define the random variable



maxj∈[N] H2 (Pi , Pj ) s.t. ψij = 1, H(Pi , Pj ) > δ ;
Ti ≜
0, no such j exists.

Basically, Ti records the maximum distance from Pi to those Pj outside the δ -neighborhood of Pi
that is confusable with Pi given the present sample. Our density estimator is defined as

P̂ = Pi∗ , where i∗ ∈ argmin Ti . (32.34)


i∈[N]

Now for the proof of correctness, assume that P ∈ B(P1 , ϵ). The intuition is that, we should
expect, typically, that T1 = 0, and furthermore, Tj ≥ δ 2 for all j such that H(P1 , Pj ) ≥ δ . Note
that by the definition of Ti and the symmetry of the Hellinger distance, for any pair i, j such that
H(Pi , Pj ) ≥ δ , we have

max{Ti , Tj } ≥ H(Pi , Pj ).

Consequently,
n o
H(P̂, P1 )1 H(P̂, P1 ) ≥ δ = H(Pi∗ , P1 )1 {H(Pi∗ , P1 ) ≥ δ}
≤ max{Ti∗ , T1 }1 {max{Ti∗ , T1 } ≥ δ} = T1 1 {T1 ≥ δ},

where the last equality follows from the definition of i∗ as a global minimizer in (32.34). Thus, for
any t ≥ 1,

P[H(P̂, P1 ) ≥ tδ] ≤ P[T1 ≥ tδ]


≤ N(ϵ)e−2nϵ
2 2
t
(32.35)
≲ e− t ,
2
(32.36)
2
where (32.35) follows from (32.33) and (32.36) uses the assumption that nϵ2 ≥ 1 and N ≤ enϵ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-616


i i

616

32.2.3 Refinement using local entropy


Just like Theorem 32.1, while they are often tight for nonparametric problems with superlogarith-
mically metric entropy, for finite-dimensional models a direct application of Theorem 32.9 results
in a slack by a log factor. For example, for a d-dimensional parametric family, e.g., the Gaussian
location model or its finite mixtures, the metric entropy usually behaves as log NH (ϵ)  d log 1ϵ .
Thus when n ≳ d, Theorem 32.9 entails choosing ϵ2n  dn log nd , which falls short of the parametric
rate E[H2 (P̂, P)] ≲ dn which are typically achievable.
As usual, such a log factor can be removed using the local entropy argument. To this end, define
the local Hellinger entropy:
Nloc (P, ϵ) ≜ sup sup NH (B(P, η) ∩ P, η/2). (32.37)
P∈P η≥ϵ

Theorem 32.10 (Le Cam-Birgé: local entropy version) Let ϵn be such that
nϵ2n ≥ log Nloc (P, ϵn ) ∨ 1. (32.38)
Then there exists an estimator P̂ = P̂(X1 , . . . , Xn ) taking values in P such that for any t ≥ 2,
sup P[H(P, P̂) > 4tϵn ] ≤ e−t
2
(32.39)
P∈P

and hence
sup EP [H2 (P, P̂)] ≲ ϵ2n (32.40)
P∈P

Remark 32.4 (Doubling dimension) Suppose that for some d > 0, log Nloc (P, ϵ) ≤ d log 1ϵ
holds for all sufficiently large small ϵ; this is the case for finite-dimensional models where the
Hellinger distance is comparable with the vector norm by the usual volume argument (Theo-
rem 27.3). Then we say the doubling dimension (also known as the Le Cam dimensionLe Cam
dimension|see doubling dimension [430]) of P is at most d; this terminology comes from the
fact that the local entropy concerns covering Hellinger balls using balls of half the radius. Then
Theorem 32.10 shows that it is possible to achieve the “parametric rate” O( dn ). In this sense, the
doubling dimension serves as the effective dimension of the model P .

Lemma 32.11 For any P ∈ P and η ≥ ϵ and k ≥ Z+ ,


NH (B(P, 2k η) ∩ P, η/2) ≤ Nloc (P, ϵ)k (32.41)

Proof. We proceed by induction on k. The base case of k = 0 follows from the definition (32.37).
For k ≥ 1, assume that (32.41) holds for k − 1 for all P ∈ P . To prove it for k, we construct a cover
of B(P, 2k η) ∩ P as follows: first cover it with 2k−1 η -balls, then cover each ball with η/2-balls. By
the induction hypothesis, the total number of balls is at most
NH (B(P, 2k η) ∩ P, 2k−1 η) · sup NH (B(P′ , 2k−1 η) ∩ P, η/2) ≤ Nloc (ϵ) · Nloc (ϵ)k−1
P′ ∈P

completing the proof.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-617


i i

32.2 Pairwise comparison à la Le Cam-Birgé 617

We now prove Theorem 32.10:


Proof. We analyze the same estimator (32.34) following the proof of Theorem 32.9, except
that the estimate (32.35) is improved as follows: Define the Hellinger shell Ak ≜ {P : 2k δ ≤
H(P1 , P) < 2k+1 δ} and Gk ≜ {P1 , . . . , PN } ∩ Ak . Recall that δ = 4ϵ. Given t ≥ 2, let ℓ = blog2 tc
so that 2ℓ ≤ t < 2ℓ+1 . Then
X
P[T1 ≥ tδ] ≤ P[2k δ ≤ T1 < 2k+1 δ]
k≥ℓ
( a) X
|Gk |e− 8 (2 δ)
n k 2

k≥ℓ
(b) X
Nloc (ϵ)k+3 e−2nϵ 4
2 k

k≥ℓ
( c)
≲ e− 4 ≤ e− t
ℓ 2

where (a) follows from from (32.33); (c) follows from the assumption that log Nloc ≤ nϵ2 and
k ≥ ℓ ≥ log2 t ≥ 1; (b) follows from the following reasoning: since {P1 , . . . , PN } is an ϵ-packing,
we have
|Gk | ≤ M(Ak , ϵ) ≤ N(Ak , ϵ/2) ≤ N(B(P1 , 2k+1 δ) ∩ P, ϵ/2) ≤ Nloc (ϵ)k+3

where the first and the last inequalities follow from Theorem 27.2 and Lemma 32.11 respectively.

As an application of Theorem 32.10, we show that parametric rate (namely, dimension divided
by the sample size) is achievable for models with locally quadratic behavior, such as those smooth
parametric models (cf. Section 7.11 and in particular Theorem 7.23).

Corollary 32.12 Consider a parametric family P = {Pθ : θ ∈ Θ}, where Θ ⊂ Rd and P is


totally bounded in Hellinger distance. Suppose that there exists a norm k · k and constants t0 , c, C
such that for all θ0 , θ1 ∈ Θ with kθ0 − θ1 k ≤ t0 ,
ckθ0 − θ1 k ≤ H(Pθ0 , Pθ1 ) ≤ Ckθ0 − θ1 k. (32.42)
i.i.d.
Then there exists an estimator θ̂ based on X1 , . . . , Xn ∼ Pθ , such that
d
sup Eθ [H2 (Pθ , Pθ̂ )] ≲ .
θ∈Θ n

Proof. It suffices to bound the local entropy Nloc (P, ϵ) in (32.37). Fix θ0 ∈ Θ. Indeed, for any
η > t0 , we have NH (B(Pθ0 , η) ∩ P, η/2) ≤ NH (P, t0 ) ≲ 1. For ϵ ≤ η ≤ t0 ,
( a)
NH (B(Pθ0 , η) ∩ P, η/2) ≤ N∥·∥ (B∥·∥ (θ0 , η/c), η/(2C))
 d
vol(B∥·∥ (θ0 , η/c + η/(2C)))
(b) 2C
≤ = 1+
vol(B∥·∥ (θ0 , η/(2C))) c

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-618


i i

618

where (a) and (b) follow from (32.42) and Theorem 27.3 respectively. This shows that
log Nloc (P, ϵ) ≲ d, completing the proof by applying Theorem 32.10.

32.2.4 Lower bound using local Hellinger packing


It turns out that under certain regularity assumptions we can prove an almost matching lower
bound (typically within a logarithmic term) on the Hellinger risk. First we define the local packing
number as follows:
n ϵ o
Mloc (ϵ) ≡ Mloc (P, H, ϵ) = max M : ∃R, P1 , . . . , PM ∈ P : H(Pi , R) ≤ ϵ, H(Pi , Pj ) ≥ ∀i 6= j .
2
The local packing number is related to the global one M(ϵ) ≡ M(P, H, ϵ) by the following general
lemma that holds for any metric. This result shows that the local and global packing numbers
behave similarly as long as the growth is super polynomial in 1/ϵ (e.g. for those nonparametric
class considered in Section 27.4).

Lemma 32.13
M(ϵ/2)
≤ Mloc (ϵ) ≤ M(ϵ)
M(ϵ)

Proof. The upper bound is obvious. For the lower bound, Let P1 , . . . , PM be a maximal ϵ-packing
for P with M = M(ϵ). Let Q1 , . . . , QM′ be a maximal ϵ/2-packing for P with M′ = M(ϵ/2).
Partition E = {P1 , . . . , PM } into the Voronoi cells centered at each Qi , namely, Ei ≜ {Pj :
H(Pj , Qi ) = mink∈[M] H(Pk , Qi )} (with ties broken arbitrarily), so that E1 , . . . , EM′ are disjoint
and E = ∪i∈[M′ ] Ei . Thus max |Ei | ≥ M/M′ . Finally, note that each Ei ⊂ B(Qi , ϵ) because E is also
an ϵ-covering.
Note that unlike the definition of Nloc in (32.37) we are not taking the supremum over the scale
η ≥ ϵ. For this reason, we cannot generally apply Theorem 27.2 to conclude that Nloc (ϵ) ≥ Mloc (ϵ).
In all instances known to us we have log Nloc  log Mloc , in which case the following general result
provides a minimax lower bound that matches the upper bound in Theorem 32.10 up to logarithmic
factors.

Theorem 32.14 Suppose that the Dλ radius Rλ (P) of the family P is finite for some λ > 1;
cf. (32.20). There exists constants c = c(λ) and ϵ < ϵ0 (λ) such that whenever n and ϵ < ϵ0 are
such that
 
2 1
c(λ)nϵ log 2 + Rλ (P) + 2 log 2 < log Mloc (ϵ), (32.43)
ϵ

any estimator P̂ = P̂(·; Xn ) must satisfy


ϵ2
sup EP [H2 (P, P̂)] ≥ ,
P∈P 32
i.i.d.
where EP is taken with respect to Xn ∼ P.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-619


i i

32.2 Pairwise comparison à la Le Cam-Birgé 619

Proof. Let M = Mloc (P, ϵ). From the definition there exists an ϵ/2-packing P1 , . . . , PM in some
Hellinger ball B(R, ϵ).
i.i.d.
Let θ ∼ Unif([M]) and Xn ∼ Pθ conditioned on θ. Then from Fano’s inequality in the form
of Theorem 31.3 we get

 ϵ 2  I(θ; Xn ) + log 2

sup E[H (P, P̂)] ≥
2
1−
P∈P 4 log M
It remains to show that
I(θ; Xn ) + log 2 1
≤ . (32.44)
log M 2
To that end for an arbitrary distribution U define

Q = ϵ2 U + ( 1 − ϵ2 )R .

We first notice that from Ex. I.59 we have that for all i ∈ [M]
 
λ 1
D(Pi kQ) ≤ 8(H (Pi , R) + 2ϵ )
2 2
log 2 + Dλ (Pi kU)
λ−1 ϵ

provided that ϵ < 2− 2(λ−1) ≜ ϵ0 . Since H2 (Pi , R) ≤ ϵ2 , by optimizing U (as the Dλ -center of P )

we obtain
   
λ 1 c(λ) 2 1
inf max D(Pi kQ) ≤ 24ϵ 2
log 2 + Rλ ≤ ϵ log 2 + Rλ .
U i∈[M] λ−1 ϵ 2 ϵ
By Theorem 4.1 we have
 
nc(λ) 2 1
I(θ; X ) ≤ n
max D(P⊗ n ⊗n
i kQ ) ≤ ϵ log 2 + Rλ .
i∈[M] 2 ϵ
This final bound and condition (32.43) then imply (32.44) and the statement of the theorem.

Finally, we mention that for sufficiently regular models wherein the KL divergence and the
squared Hellinger distances are comparable, the upper bound in Theorem 32.10 based on local
entropy gives the exact minimax rate. Models of this type include GLM and more generally
Gaussian mixture models with bounded centers in arbitrary dimensions [232].

Corollary 32.15 Assume that


H2 (P, P′ )  D(PkP′ ), ∀P, P′ ∈ P.

Then

inf sup EP [H2 (P, P̂)]  ϵ2n


P̂ P∈P

where ϵn was defined in (32.38).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-620


i i

620

Proof. By assumption, KL neighborhoods coincide with Hellinger balls up to constant factors.


Thus the lower bound follows from apply Fano’s method in Theorem 31.3 to a Hellinger ball of
radius O(ϵn ).

32.3 Yatracos’ class and minimum distance estimator


In this section we prove (32.3), the third entropy upper bound on statistical risk. Paralleling the
result (32.1) of Yang-Barron (for KL divergence) and (32.2) of Le Cam-Birgé (for Hellinger dis-
tance), the following result bounds the minimax total variation risk using the metric entropy of
the parameter space in total variation:

Theorem 32.16 (Yatracos [465]) There exists a universal constant C such that the following
i.i.d.
holds. Let X1 , . . . , Xn ∼ P ∈ P , where P is a collection of distributions on a common measurable
space (X , E). For any ϵ > 0, there exists a proper estimator P̂ = P̂(X1 , . . . , Xn ) ∈ P , such that
 
1
sup EP [TV(P̂, P)2 ] ≤ C ϵ2 + log N(P, TV, ϵ) (32.45)
P∈P n

For loss function that is a distance, a natural idea for obtaining proper estimator is the minimum
distance estimator. In the current context, we compute the minimum-distance projection of the
empirical distribution on the model class P :6
Pmin-dist = argmin TV(P̂n , P)
P∈P

1
Pn
where P̂n = ni=1 δXi is the empirical distribution. However, since the empirical distribution is
discrete, this strategy does not make sense if elements of P have densities. The reason for this
degeneracy is because the total variation distance is too strong. The key idea is to replace TV,
which compares two distributions over all measurable sets, by a proxy, which only inspects a
“low-complexity” family of sets.
To this end, let A ⊂ E be a finite collection of measurable sets to be specified later. Define a
pseudo-distance
dist(P, Q) ≜ sup |P(A) − Q(A)|. (32.46)
A∈A

(Note that if A = E , then this is just TV.) One can verify that dist satisfies the triangle inequality.
As a result, the estimator
P̃ ≜ argmin dist(P, P̂n ), (32.47)
P∈P

as a minimizer, satisfies
dist(P̃, P) ≤ dist(P̃, P̂n ) + dist(P, P̂n ) ≤ 2dist(P, P̂n ). (32.48)

6
Here and below, if the minimizer does not exist, we can replace it by an infimizing sequence.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-621


i i

32.3 Yatracos’ class and minimum distance estimator 621

In addition, applying the binomial tail bound and the union bound, we have
C0 log |A|
E[dist(P, P̂n )2 ] ≤ . (32.49)
n
for some absolute constant C0 .
The main idea of Yatracos [465] boils down to the following choice of A: Consider an
ϵ-covering {Q1 , . . . , QN } of P in TV. Define the set
 
dQi dQj
Aij ≜ x : ( x) ≥ ( x)
d( Qi + Qj ) d(Qi + Qj )
and the collection (known as the Yatracos class)

A ≜ {Aij : i 6= j ∈ [N]}. (32.50)

Then the corresponding dist approximates the TV on P , in the sense that

dist(P, Q) ≤ TV(P, Q) ≤ dist(P, Q) + 4ϵ, ∀P, Q ∈ P. (32.51)

To see this, we only need to justify the upper bound. For any P, Q ∈ P , there exists i, j ∈ [N], such
that TV(P, Pi ) ≤ ϵ and TV(Q, Qj ) ≤ ϵ. By the key observation that dist(Qi , Qj ) = TV(Qi , Qj ), we
have

TV(P, Q) ≤ TV(P, Qi ) + TV(Qi , Qj ) + TV(Qj , Q)


≤ 2ϵ + dist(Qi , Qj )
≤ 2ϵ + dist(Qi , P) + dist(P, Q) + dist(Q, Qj )
≤ 4ϵ + dist(P, Q).

Finally, we analyze the estimator (32.47) with A given in (32.50). Applying (32.51) and (32.48)
yields

TV(P̃, P) ≤ dist(P, P̃) + 4ϵ


≤ 2dist(P, P̂n ) + 4ϵ.

Squaring both sizes, taking expectation and applying (32.49), we have


8C0 log |N|
E[TV(P̃, P)2 ] ≤ 32ϵ2 + 8E[dist(P, P̂n )2 ] ≤ 32ϵ2 + .
n
Choosing the optimal TV-covering completes the proof of (32.45).
Remark 32.5 (Robust version) Note that Yatracos’ scheme idea works even if the model
is misspecified, i.e., when the data generating distribution P is outside (but close to) P . Indeed,
denote Q∗ = argminQ∈{Qi } TV(P, Q) and notice that

dist(Q∗ , P̂n ) ≤ dist(Q∗ , P) + dist(P, P̂n ) ≤ TV(P, Q∗ ) + dist(P, P̂n ) ,

since dist(Q, Q′ ) ≤ TV(Q, Q′ ) for any pair of distributions. Then we have:

TV(P̃, P) ≤ TV(P̃, Q∗ ) + TV(Q∗ , P) = dist(P̃, Q∗ ) + TV(Q∗ , P)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-622


i i

622

≤ dist(P̃, P̂n ) + dist(P̂n , P) + dist(P, Q∗ ) + TV(Q∗ , P)


≤ dist(Q∗ , P̂n ) + dist(P̂n , P) + 2TV(P, Q∗ )
≤ 2dist(P, P̂n ) + 3TV(P, Q∗ ) .

Since 3TV(P, Q∗ ) ≤ 3ϵ + 3 minP′ ∈P TV(P, P′ ) we can see that the estimator also works for
“misspecified case”. Surprisingly, the multiplier 3 is not improvable if the estimator is required to
be proper (inside P ), cf. [70].

32.4 Density estimation over Hölder classes


In this section we will talk about the classical problem of nonparametric density estimation prob-
lem under smoothness constraint. Following Theorem 27.14, for brevity denote by F ≡ Fβ (1, 1)
the collection of β -smooth densities f on the unit cube [0, 1]d for some constant d. (In this case the
parameter is simply the density f, so we shall refrain from writing a parametrized form.) Given
i.i.d.
X1 , · · · , Xn ∼ f ∈ F , an estimator of the unknown density f is a function f̂(·) = f̂(·; X1 , . . . , Xn ).
R1
Let us first consider the conventional quadratic risk kf − f̂k22 = 0 (f(x) − f̂(x))2 . Then we will state
the results for Hellinger, KL, and total variation risks.

Theorem 32.17 Given X1 , · · · , Xn i.i.d.


∼ f ∈ F , the minimax quadratic risk over F satisfies

R∗L2 (n; F) ≜ inf sup E kf − f̂k22  n− 2β+d . (32.52)
f̂ f∈F

Capitalizing on the metric entropy of smooth densities studied in Section 27.4, we will prove
this result by applying the entropic upper bound in Theorem 32.1 and the minimax lower bound
based on Fano’s inequality in Theorem 31.3. However, Theorem 32.17 pertains to the L2 rather
than KL risk. This can be fixed by a simple reduction.

Lemma 32.18 Let F ′ denote the collection of f ∈ F which is bounded from below by 1/2.
Then

R∗L2 (n; F ′ ) ≤ R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).

Proof. The left inequality follows because F ′ ⊂ F . For the right inequality, we apply a sim-
i.i.d.
ulation argument. Fix some f ∈ F and we observe X1 , . . . , Xn ∼ f. Let us sample U1 , . . . , Un
independently and uniformly from [0, 1]d . Define
(
Ui w.p. 12 ,
Zi =
Xi w.p. 12 .
i.i.d.
Then Z1 , . . . , Zn ∼ g = 12 (1 + f) ∈ F ′ . Let ĝ be an estimator that achieves the minimax risk
R∗L2 (n; F ′ ) on F ′ . Consider the estimator f̂ = 2ĝ − 1. Then kf − f̂k22 = 4kg − ĝk22 . Taking the
supremum over f ∈ F proves R∗L2 (n; F) ≤ 4R∗L2 (n; F ′ ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-623


i i

32.4 Density estimation over Hölder classes 623

Lemma 32.18 allows us to focus on the subcollection F ′ , where each density is lower bounded
by 1/2. In addition, each β -smooth density in F is also upper bounded by an absolute constant.
Therefore, the KL divergence and squared L2 distance are in fact equivalent on F ′ , i.e.,

D(fkg)  kf − gk22 , f, g ∈ F ′ , (32.53)

as shown by the following lemma:

dQ
Lemma 32.19 Suppose both f = dP
dμ and g = dμ are upper and lower bounded by absolute
constants c and C respectively. Then
Z Z
1 1
dμ(f − g) ≤ H (P, Q) ≤ D(PkQ) ≤ χ (PkQ) ≤
2 2 2
dμ(f − g)2 .
4C c

R R
Proof. For the upper bound, applying (7.34), D(PkQ) ≤ χ2 (PkQ) = dμ (f−gg) ≤ dμ (f−gg) .
2 2
1
c
R R
For the lower bound, applying (7.33), D(PkQ) ≥ H2 (P, Q) = dμ √(f−g√) 2 ≥
2
1
4C dμ(f −
( f+ g)
g) 2 .

We now prove Theorem 32.17:

Proof. In view of Lemma 32.18, it suffices to consider R∗L2 (n; F ′ ). For the upper bound, we have

( a)
R∗L2 (n; F ′ )  R∗KL (n; F ′ )
(b)
 
1 ′
≲ inf ϵ + log NKL (F , ϵ)
2
ϵ>0 n
 
( c) 1 ′
 inf ϵ + log N(F , k · k2 , ϵ)
2
ϵ>0 n
 
(d) 1 2β
 inf ϵ2 + d/β  n− 2β+d .
ϵ>0 nϵ

where both (a) and (c) apply (32.53), so that both the risk and the metric entropy are equivalent
for KL and L2 distance; (b) follows from Theorem 32.1; (d) applies the metric entropy (under L2 )
of the Lipschitz class from Theorem 27.14 and the fact that the metric entropy of the subclass F ′
is at most that of the full class F .
For the lower bound, we apply Fano’s inequality. Applying Theorem 27.14 and the rela-
tion between covering and packing numbers in Theorem 27.2, we have log N(F, k · k2 , ϵ) 
log M(F, k · k2 , ϵ)  ϵ−d/β . Fix ϵ to be specified and let f1 , . . . , fM be an ϵ-packing in F , where
M ≥ exp(Cϵ−d/β ). Then g1 , . . . , gM is an 2ϵ -packing in F ′ , with gi = (fi + 1)/2. Applying Fano’s
inequality in Theorem 31.3, we have
 
∗ ′ C′n
RL2 (n; F ) ≳ ϵ 1 −
2
(32.54)
log M

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-624


i i

624

i.i.d.
where C′n is the capacity (or KL radius) from f ∈ F ′ to X1 , . . . , Xn ∼ f. Using (32.17) and
Lemma 32.19, we have
C′n ≤ inf (nϵ2 + log NKL (F ′ , ϵ))  inf (nϵ2 + log N(F ′ , k · k2 , ϵ)) ≲ inf (nϵ2 + ϵ−d/β )  nd/(2β+d) .
ϵ>0 ϵ>0 ϵ>0
β
− 2β+
Thus choosing ϵ = cn d for sufficiently small c ensures C′n ≤ 1
2 log M and hence R∗L2 (n; F ′ ) ≳

− 2β+
ϵ2  n d .
Remark 32.6 The above lower bound proof, based on Fano’s inequality and the intuition that
small mutual information implies large estimation error, requires us to upper bound the capacity
C′n of the subcollection F ′ . On the other hand, as hinted earlier in (32.11) (and shown next), the
C′
risk is expected to be proportional to nn , which suggests one should lower bound the capacity
using metric entropy. Indeed, this is possible: Applying Theorem 32.5,
C′n ≳ min{nϵ2 , log M(F ′ , H, ϵ)} − 2
 min{nϵ2 , log M(F ′ , k · k2 , ϵ)} − 2
 min{nϵ2 , ϵ−d/β } − 2  nd/(2β+d) ,

where we picked the same ϵ as in the previous proof. So C′n  nd/(2β+d) . Finally, applying the
online-to-batch conversion (32.26) in Proposition 32.7 (or equivalently, combining (32.7) and
C′ 2β
(32.9)) yields R∗KL (n; F ′ )  nn  n− 2β+d .
Remark 32.7 Note that the above proof of Theorem 32.17 relies on the entropic risk bound
(32.1), which, though rate-optimal, is not attained by a computationally efficient estimator. (The
same criticism also applies to (32.2) and (32.3) for Hellinger and total variation.) To remedy this,
for the squared loss, a classical idea is to apply the kernel density estimator (KDE) – cf. Section 7.9.
Pn
Specifically, one compute the convolution of the empirical distribution P̂n = 1n i=1 δXi with a
kernel function K(·) whose shape and bandwidth are chosen according to the smooth constraint.
For example, for Lipschitz densities, the optimal rate in Theorem 32.17 can be attained by a box
kernel K(·) = 2h1
1 {| · | ≤ h} with bandwidth h = n−1/3 (cf. e.g. [424, Sec. 1.2]).
The classical literature of density estimation is predominantly concerned with the L2 loss,
mainly due to the convenient quadratic nature of the loss function that allows bias-variance decom-
position and facilitates the analysis of KDE. However, L2 -distance between densities has no clear
operational meaning. Next we consider the three f-divergence losses introduced at the beginning
of this chapter. Paralleling Theorem 32.17, we have
β
R∗TV (n; F) ≜ inf sup E TV(f, f̂)  n− 2β+d (32.55)
f̂ f∈F
β
R∗H2 (n; F) ≜ inf sup E H2 (f, f̂)  n− β+d (32.56)
f̂ f∈F
β β
n− β+d ≲ R∗KL (n; F) ≜ inf sup E D(fkf̂)  n− β+d (log n) β+d
d
(32.57)
f̂ f∈F

For TV loss, the upper bound follows from the L2 -rates in Theorem 32.17 and kf − f̂k1 ≤ kf − f̂k2
by Cauchy-Schwarz; alternatively, we can also apply Yatracos’ estimator from Theorem 32.16.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-625


i i

32.4 Density estimation over Hölder classes 625

The matching lower bound can be shown using the same argument based on Fano’s method as the
metric entropy under L1 -distance behaves the same (Theorem 27.14).
Recall that for L2 /L1 the rate is derived by considering a subclass F ′ , which has the same
estimation rate, but on which Lp  H  KL, cf. Lemma 32.18. It was thus, surprising, when
Birgé [54] found the Hellinger rate on the full family F to be different.
To derive the H2 result (32.56), first note that neither upper or a lower bound follow from the
2
generic comparison inequality H2 ≤ TV ≤ H in (7.22). Instead, what works is comparing entropy
numbers via the first of these inequalities. Specifically, we have
log N(F, H, ϵ) ≤ log N(F, TV, ϵ2 /2) ≲ ϵ− β ,
2d
(32.58)
where in the last step we invoked Theorem 27.14. Combining this with Le Cam-Birgé method
(Theorem 32.9) proves the upper bound part of (32.56).7
The lower bound follows from a similar argument as in the proof of Theorem 32.17, although
the construction is more involved. Below c0 , c1 , . . . are absolute constants. Fix a small ϵ and con-
sider the subcollection F ′ = {f ∈ F : f ≥ ϵ} of densities lower bounded by ϵ. We first construct a
Hellinger packing of F ′ . Applying the same argument in Lemma 32.13 yields an L2 -packing in an
L∞ -local ball: there exist f0 , f1 , . . . , fM ∈ F , such that kfi − fj k2 ≥ c0 ϵ for all i 6= j, kfi − f0 k∞ ≤ ϵ
for all i, and M ≥ M(F, k · k2 , c0 ϵ)/M(F, k · k∞ , ϵ) ≥ exp(c1 ϵ−d/β ), the last step applying The-
orem 32.17 for sufficiently small c0 . Let hi = fi − f0 and define fi by fi (x) = ϵ + hi (2x) for
x ∈ [0, 12 ]d and extend fi smoothly elsewhere so that it is a valid density in F ′ . Then f1 , . . . , fM
√ R (f −fj )2
form a Hellinger Ω( ϵ)-packing of F ′ , since H2 (fi , fj ) ≥ [0, 1 ]d √ i √ 2
≥ c2 ϵ. (This construc-
2 ( fi + fj )
tion also shows that the metric entropy bound (32.58) is tight.) It remains to bound the capacity
C′n of F ′ as a function of n and ϵ. Note that for any f, g ∈ F ′ , we have as in Lemma 32.19
D(fkg) ≤ χ2 (fkg) ≤ kf − gk22 /ϵ. Thus NKL (F ′ , δ 2 /ϵ) ≤ N(F ′ , k · k2 , δ). Applying (32.17) and
Lemma 32.19, C′n ≲ infδ>0 (nδ 2 /ϵ + δ −d/β )  (n/ϵ)d/(2β+d) . Applying Fano’s inequality as in
(32.54) yields an Ω(ϵ) lower bound in squared Hellinger, provided log M ≥ 2C′n . This is achieved
β
by choosing ϵ = c3 n− β+d , completing the proof of (32.56).
For KL loss, the lower bound of (32.57) follows from (32.56) because D ≥ H2 . For the upper
bound, applying (32.7) in Theorem 32.1, we have R∗KL (n; F) ≤ Cn+ n+1
1 , where Cn is the capacity
i.i.d.
(32.5) of the channel between f and Xn ∼ f ∈ F . This capacity can be bounded, in turn, using
Theorem 32.6 via the Hellinger entropy. Applying (32.58) in conjunction with (32.22), we obtain
Cn ≤ infϵ (nϵ2 log 1ϵ + ϵ−2d/β )  (n log n)d/(d+β) , proving the upper bound (32.57).8 To the best
of our knowledge, resolving the logarithmic gap in (32.57) remains open.

7
Comparing (32.56) with (32.52), we see that the Hellinger rate coincides with the L2 rate upon replacing the smoothness
parameter β by β/2. Note that Hellinger distance is the L2 between root-densities. For β ≤ 1, one can indeed show that

f is β/2-Hölder continuous, which explains the result in (32.56). However, this interpretation fails for general β. For

example, Glaeser [191] constructs an infinitely differentiable f such that f has points with arbitrarily large second
derivative.
8
This capacity bound is tight up to logarithmic factors. Note that the construction in the proof of the lower bound in (32.56)
shows that log M(F , H, ϵ) ≳ ϵ−2d/β , which, via Theorem 32.5, implies that Cn ≥ nd/(d+β) .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-626


i i

33 Strong data processing inequality

In this chapter we explore statistical implications of the following effect. For any Markov chain
U→X→Y→V (33.1)
we know from the data-processing inequality (DPI, Theorem 3.7) that
I(U; Y) ≤ I(U; X), I(X; V) ≤ I(Y; V) .
However, something stronger can often be said. Namely, if the Markov chain (33.1) factor through
a known noisy channel PY|X : X → Y , then oftentimes we can prove strong data processing
inequalities (SDPI):
I(U; Y) ≤ η I(U; X), I(X; V) ≤ η (p) I(Y; V) ,
where coefficients η = η(PY|X ), η (p) = η (p) (PY|X ) < 1 only depend on the channel and not the
(generally unknown or very complex) PU,X or PY,V . The coefficients η and η (p) approach 0 for
channels that are very noisy (for example, η is always up to a constant factor equal to the Hellinger-
squared diameter of the channel).
The purpose of this chapter is twofold. First, we want to introduce general properties of the
SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility)
bounds on statistical estimation questions. The flavor of the statistical problems in this chapter
is different from the rest of the book in that here the information about unknown parameter θ
is either more “thinly spread” across a high dimensional vector of observations than in classical
X = θ + Z type of tasks (cf., spiked Wigner and tree-coloring examples), or distributed across
different terminals (as in correlation and mean estimation examples).
We point out that SDPIs are an area of current research and multiple topics are not covered by
our brief exposition here. For more, we recommend surveys [345] and [352], of which the latter
explores the functional-theoretic side of SDPIs and their close relation to logarithmic Sobolev
inequalities – a topic we entirely omit in our book.

33.1 Computing a boolean function with noisy gates


A boolean function with n inputs is defined as f : {0, 1}n → {0, 1}. Note that a boolean function
can be described as a network of primitive logic gates of the three kinds as illustrated on Figure 33.1
In 1938 Shannon has shown how any boolean function f can be represented with primitive
logic gates [377] from the top row of the Figure 33.1. In 1950s John von Neumann was laying

626

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-627


i i

33.1 Computing a boolean function with noisy gates 627

a a
OR a∨b AND a∧b a NOT a′
b b
Z Z Z
a a
OR ⊕ Y AND ⊕ Y a NOT ⊕ Y
b b

Figure 33.1 Basic building blocks of any boolean circuit. Top row shows the classical (Shannon) noiseless
gates. Bottom row shows noisy (von Neumann) gates. Here Z ∼ Ber(δ) is assumed to be independent of the
inputs.

the groundwork for the digital computers, and he was bothered by the following question. Since
real physical (and biological) networks are composed of imperfect elements, can we compute any
boolean function f if the constituent basis gates are in fact noisy? His model of the δ -noisy gate
(bottom row of Figure 33.1) is to take a primitive noiseless gate and apply a (mod 2) additive noise
to the output.
In this case, we have a network of the noisy gates, and such network necessarily has noisy (non-
deterministic) output. Therefore, when we say that a noisy gate circuit C computes f we require
the existence of some ϵ0 = ϵ0 (δ) (that cannot depend on f) such that

 1
P[C(x1 , . . . , xn ) 6= f(x1 , . . . , xn ) ≤ − ϵ0 (33.2)
2

where C(x1 , . . . , xn ) is the output of the noisy circuit with inputs x1 , . . . , xn . If we build the circuit
according to the classical (Shannon) methods, we would obviously have catastrophic error accumu-
lation so that deep circuits necessarily have ϵ0 → 0. At the same time, von Neumann was bothered
by the fact that evidently our brains operate with very noisy gates and yet are able to carry very
long computations without mistakes. His thoughts culminated in the following ground-breaking
result.

Theorem 33.1 (von Neumann [443]) There exists δ ∗ > 0 such that for all δ < δ ∗ it is
possible to compute every boolean function f via δ -noisy 3-majority gates.

von Neumann’s original estimate δ ∗ ≈ 0.087 was subsequently improved by Pippenger. The
main (still open) question of this area is to find the largest δ ∗ for which the above theorem holds.
Condition in (33.2) implies the output should be correlated with the inputs. This requires the
mutual information between the inputs (if they are random) and the output to be greater than
zero. We now give a theorem of Evans and Schulman that gives an upper bound to the mutual
information between any of the inputs and the output. We will prove the theorem in Section 33.3
as a consequence of the more general directed information percolation theory.

Theorem 33.2 ([158]) Suppose an n-input noisy boolean circuit composed of gates with at
most K inputs and with noise components having at most δ probability of error. Then, the mutual

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-628


i i

628

X1 X2 X3 X4 X5 X6 X7 X8 X9

G1 G2 G3

G4 G5

G6

Figure 33.2 An example of a 9-input Boolean Circuit

information between any input Xi and output Y is upper bounded as


di
I(Xi ; Y) ≤ K(1 − 2δ)2 log 2

where di is the minimum length between Xi and Y (i.e, the minimum number of gates required to
be passed through until reaching Y).

Theorem 33.2 implies that noisy computation is only possible for δ < 12 − 2√1 K . This is the best
known threshold. To illustrate this result consider the circuit given on Figure 33.2. That circuit has
9 inputs and composed of gates with at most 3 inputs. The 3-input gates are G4 , G5 and G6 . The
minimum distance between X3 and Y is d3 = 2, and the minimum distance between X5 and Y is
d5 = 3. If Gi ’s are δ -noisy gates, we can invoke Theorem 33.2 between any input and the output.
Now, the main conceptual implication of Theorem 33.2 is in demonstrating that some cir-
cuits are not computable with δ -noisy gates unless δ is sufficiently small. For instance, take
f(X1 , . . . , Xn ) = XOR(X1 , . . . , Xn ). Note that function f depends essentially on every input Xi , since

XOR(X1 , . . . , Xn ) = XOR XOR(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ), Xi . Thus, any circuit that ignores
any one of the inputs Xi will not be able to satisfy requirement (33.2). Since we are composing
log n
our circuit via K-input gates, this implies that there must exist at least one input Xi with di ≥ log K
(indeed, going from Y up we are to make K-ary choice at each gate and thus at height d we can at
most reach dK inputs). Now as n → ∞ we see from Theorem 33.2 that I(Xi ; Y) → 0 unless

∗ 1 1
δ ≤ δES = − √ .
2 2 K

As we argued I(Xi ; Y) → 0 is incompatible with satisfying (33.2). Hence the value of δES gives
a (at present, the tightest) upper bound for the noise limit under which reliable computation with
K-input gates is possible.
Computation with formulas Note that the graph structure given in Figure 33.2 contains some
undirected loops. A formula is a type of boolean circuits that does not contain any undirected
loops unlike the case in Figure 33.2. In other words, for a formula the underlying graph structure
forms a tree. For example, removing one of the outputs of G2 of Figure 33.2 we obtain a formula
as given on Figure 33.3.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-629


i i

33.2 Strong data processing inequality 629

X1 X2 X3 X4 X5 X6 X7 X8 X9

G1 G2 G3

G4 G5

G6

Figure 33.3 An example of a 9-input Boolean Formula

For computation with formulas much stronger results are available. For example, for any odd K,
the threshold is exactly known from [157, Theorem 1]. Specifically, it is shown there that we can
compute reliably any boolean function f that is represented with a formula compose of K-input
δ -noisy gates (with K odd) if δ < δf∗ , and no such computation is possible for δ > δf∗ , where

1 2K−1
δf∗ = − K−1
2 K K− 1
2

Since every formula is also a circuit, we of course have δf∗ < δES

, so that there is no contradiction
with Theorem 33.2. However, comparing the thresholds gives us ability to appreciate tightness of
Theorem 33.2 for general boolean circuits. Indeed, for large K we have an approximation
p
∗ 1 π /2
δf ≈ − √ , K  1 ,
2 2 K

whereas the estimate of Evans-Schulman δES ≈ 1
2 − 1

2 K
. We can thus see that it has at least the
right rate of convergence to 1/2 for large K.

33.2 Strong data processing inequality


Definition 33.3 (Contraction coefficient for PY|X ) For a fixed conditional distribution (or
kernel) PY|X , define
Df (PY kQY )
ηf (PY|X ) = sup , (33.3)
Df (PX kQX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and supremum is over all pairs (PX , QX ) satisfying
0 < Df ( P X k QX ) < ∞ .

Recall that the data-processing inequality (DPI) in Theorem 7.4 states that Df (PX kQX ) ≥
Df (PY kQY ). The concept of the Strong DPI introduced above quantifies the multiplicative decrease
between the two f-divergences.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-630


i i

630

We note that in general ηf (PY|X ) is hard to compute. However, total variation is an exception.

Theorem 33.4 ([127]) ηTV = supx̸=x′ TV(PY|X=x , PY|X=x′ ).

Proof. We consider two directions separately.

• ηTV ≥ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):


0

This case is obvious. Take PX = δx0 and QX = δx′0 .1 Then from the definition of ηTV , we
have ηTV ≥ TV(PY|X=x0 , PY|X=x′0 ) for any x0 and x′0 , x0 6= x′0 .
• ηTV ≤ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ):
0

Define η̃ ≜ supx0 ̸=x′ TV(PY|X=x0 , PY|X=x′0 ). We consider the discrete alphabet case for simplicity.
0
Fix any PX , QX and PY = PX ◦ PY|X , QY = QX ◦ PY|X . Observe that for any E ⊆ Y

PY|X=x0 (E) − PY|X=x′0 (E) ≤ η̃ 1{x0 6= x′0 }. (33.4)

Now suppose there are random variables X0 and X′0 having some marginals PX and QX respec-
tively. Consider any coupling π X0 ,X′0 with marginals PX and QX respectively. Then averaging
(33.4) and taking the supremum over E, we obtain

sup PY (E) − QY (E) ≤ η̃ π [X0 6= X′0 ]


E⊆Y

Now the left-hand side equals TV(PY , QY ) by Theorem 7.7(a). Taking the infimum over
couplings π the right-hand side evaluates to TV(PX , QX ) by Theorem 7.7(b).

Example 33.1 (ηTV of a Binary Symmetric Channel) The ηTV of the BSCδ is given by

ηTV (BSCδ ) =TV(Bern(δ ), Bern(1 − δ ))


1 
= |δ − (1 − δ)| + |1 − δ − δ| = |1 − 2δ|.
2
We sometimes want to relate ηf to the f-information (Section 7.8) instead of f-divergence. This
relation is given in the following theorem.

Theorem 33.5
If (U; Y)
ηf (PY|X ) = sup .
PUX : U→X→Y f (U; X)
I

1
δx0 is the probability distribution with P(X = x0 ) = 1

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-631


i i

33.2 Strong data processing inequality 631

Recall that for any Markov chain U → X → Y, DPI states that If (U; Y) ≤ If (U; X) and Theorem
33.5 gives the stronger bound

If (U; Y) ≤ ηf (PY|X )If (U; X). (33.5)

Proof. First, notice that for any u0 , we have Df (PY|U=u0 kPY ) ≤ ηf Df (PX|U=u0 kPX ). Averaging the
above expression over any PU , we obtain

If (U; Y) ≤ ηf If (U; X)

Second, fix P̃X , Q̃X and let U ∼ Bern(λ) for some λ ∈ [0, 1]. Define the conditional distribution
PX|U as PX|U=1 = P̃X , PX|U=0 = Q̃X . Take λ → 0, then (see [345] for technical subtleties)

If (U; X) = λDf (P̃X kQ̃X ) + o(λ)


If (U; Y) = λDf (P̃Y kQ̃Y ) + o(λ)
I (U;Y) Df (P̃Y ∥Q̃Y )
The ratio Iff(U;X) will then converge to Df (P̃X ∥Q̃X )
. Thus, optimizing over P̃X and Q̃X we can get ratio
of If ’s arbitrarily close to ηf .

We next state some of the fundamental properties of contraction coefficients.

Theorem 33.6 In the statements below ηf (and others) corresponds to ηf (PY|X ) for some fixed
PY|X from X to Y .

(a) For any f, ηf ≤ ηTV .


(b) ηKL = ηH2 = ηχ2 . More generally, for any operator-convex and twice continuously
differentiable f we have ηf = ηχ2 .
(c) ηχ2 equals the squared maximal correlation: Denote by ρ(A, B) ≜ √ Cov(A,B) the correla-
Var[A] Var[B]
tion coefficient between scalar random variables A and B. Then ηχ2 = supPX ,f,g ρ2 (f(X), g(Y)),
where the supremum is over all distributions PX on X , all functions f : X → R and g : Y → R.
(d) For binary-input channels, denote P0 = PY|X=0 and P1 = PY|X=1 . Then

ηKL = LCmax (P0 , P1 ) ≜ sup LCβ (P0 kP1 )


0<β<1

where (recall β̄ ≜ 1 − β )
( 1 − x) 2
LCβ (PkQ) = Df (PkQ), f(x) = β̄β
β̄ x + β
is the Le Cam divergence of order β (recall (7.7) for β = 1/2).
(e) Consequently,
1 2 H4 (P0 , P1 )
H (P0 , P1 ) ≤ ηKL ≤ H2 (P0 , P1 ) − . (33.6)
2 4
(f) If a binary-input channel PY|X is also input-symmetric (or BMS, see Section 19.4*), then
ηKL (PY|X ) = Iχ2 (X; Y) for X ∼ Bern(1/2).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-632


i i

632

(g) For any channel PY|X , the supremum in (33.3) can be restricted to PX , QX with a common
binary support. In other words, ηf (PY|X ) coincides with that of the least contractive binary
subchannel. Consequently, from (e) we conclude
1 diamH2
diamH2 ≤ ηKL (PY|X ) = diamLCmax ≤ diamH2 − ,
2 4
(in particular ηKL  diamH2 ), where diamH2 (PY|X ) = supx,x′ ∈X H2 (PY|X=x , PY|X=x′ ),
diamLCmax = supx,x′ LCmax (PY|X=x , PY|x=x′ ) are the squared Hellinger and Le Cam diameters
of the channel.

Proof. Most proofs in full generality can be found in [345]. For (a) one first shows that ηf ≤ ηTV
for the so-called Eγ divergences corresponding to f(x) = |x − γ|+ − |1 − γ|+ , which is not hard to
believe since Eγ is piecewise linear. Then the general result follows from the fact that any convex
function f can be approximated (as N → ∞) in the form
X
N
aj |x − cj |+ + a0 x + c0 .
j=1

For (b) see [93, Theorem 1] and [97, Proposition II.6.13 and Corollary II.6.16]. The idea of this
proof is as follows:

• ηKL ≥ ηχ2 by restricting to local perturbations. Recall that KL divergence behaves locally as
χ2 – Proposition 2.21.
R∞
• Using the identity D(PkQ) = 0 χ2 (PkQt )dt where Qt = tP1+ Q
+t , we have
Z ∞ Z ∞
D(PY kQY ) = χ (PY kQY t )dt ≤ ηχ2
2
χ2 (PX kQX t )dt = ηχ2 D(PX kQX ).
0 0

For (c), we fix QX (and thus QX,Y = QX PY|X ). If g = dQ dPX


X
then Tg(y) = dQ dPY
Y
= EQX|Y [g(X)|Y =
y] is a linear operator. ηχ2 (PY|X ) is then nothing but the maximal singular value (spectral norm
squared) of T : L2 (QX ) → L2 (QY ) when restricted to a linear subspace {h : EQX [h] = 0}. The
adjoint of T is T∗ h(x) = EPY|X [h(Y)|X = x]. The spectral norms of an operator and its adjoint
coincide and the spectral norm of T∗ is precisely the squared maximal correlation. These two
facts together yield the result. (See Theorem 33.12(c) which strengthens this result for a fixed
PX .)
2
P1 +ᾱP0 ∥β P1 +β̄ P0 )
Part (d) follows from the definition of ηχ2 = supα,β χ (α χ2 (Ber(α)∥Ber(β))
and some algebra.
Next, (e) follows from bounding (via Cauchy-Schwarz etc) LCmax in terms of H2 ; see [345,
Appendix B].
Part (f) follows from the fact that every BMS channel can be represented as X 7→ Y = (Y∆ , ∆)
where ∆ ∈ [0, 1/2] is independent of X and Yδ = BSCδ (X). In other words, every BMS channel
is a mixture of BSCs; see [360, Section 4.1]. Thus, we have for any U → X → Y = (Y∆ , ∆) and
∆⊥ ⊥ (U, X) the following chain

I(U; Y) = I(U; Y|∆) ≤ Eδ∼P∆ [(1 − 2δ)2 I(U; X|∆ = δ) = E[(1 − 2∆)2 ]I(U; X),

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-633


i i

33.3 Directed information percolation 633

where we used the fact that I(U; X|∆ = δ) = I(U; X) and Example 33.2 below.
For (g) see Ex. VI.20.

Example 33.2 (Computing ηKL (BSCδ )) Consider p the BSCδ channel. In Example 33.1
we computed ηTV . Here we have diamH2 = 2 − 4 δ(1 − δ) and thus the bound (33.6) we get
ηKL ≤ (1 − 2δ)2 . On the other hand taking U = Ber(1/2) and PX|U = Ber(α) we get
I(U; Y) log 2 − h(α + (1 − 2α)δ) 1
ηKL ≥ = → (1 − 2δ)2 α→ .
I(U; X) log 2 − h(α) 2
Thus we have shown:

ηKL (BSCδ ) = ηH2 (BSCδ ) = ηχ2 = (1 − 2δ)2 .

This example has the following consequence for the KL-divergence geometry.

Proposition 33.7 Consider any distributions P0 and P1 on X and let us consider the interval
in P(X ): Pλ = λP1 + (1 − λ)P0 for λ ∈ [0, 1]. Then divergence (with respect to the midpoint)
behaves subquadratically:

D(Pλ kP1/2 ) + D(P1−λ kP1/2 ) ≤ (1 − 2λ)2 {D(P0 kP1/2 ) + D(P1 kP1/2 )) .

The same statement holds with D replaced by χ2 (and any other Df satisfying Theorem 33.6(b)).

Proof. Let X ∼ Ber(1/2) and Y = BSCλ (X). Let U ← X → Y be defined with U ∼ P0 if X = 0


and U ∼ P1 if X = 1. Then
1 1
I f ( U; Y ) = Df (Pλ kP1/2 ) + Df (Pλ kP1/2 ) .
2 2
Thus, applying SDPI (33.5) completes the proof.
p p
Remark 33.1 Let us introduce dJS (P, Q) = JS(P, Q) and dLC = LC(P, Q) – the Jensen-
Shannon (7.8) and Le Cam (7.7) metrics. Then the proposition can be rewritten as

dJS (Pλ , P1−λ ) ≤ |1 − 2λ|dJS (P0 , P1 )


dLC (Pλ , P1−λ ) ≤ |1 − 2λ|dLC (P0 , P1 ) .

Notice that for any metric d(P, Q) on P(X ) that is induced from the norm on the vector space
M(X ) of all signed measures (such as TV), we must necessarily have d(Pλ , P1−λ ) = |1 −
2λ|d(P0 , P1 ). Thus, the ηKL (BSCλ ) = (1 − 2λ)2 which yields the inequality is rather natural.

33.3 Directed information percolation


In this section, we are concerned about the amount of information decay experienced in a directed
acyclic graph (DAG) G = (V, E). In the following context the vertex set V refers to a set of vertices
v, each associated with a random variable Xv and the edge set E refers to a set of directed edges

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-634


i i

634

whose configuration allows us to factorize the joint distribution over XV by Throughout the section,
we consider Shannon mutual information, i.e., f = x log x. Let us give a detailed example below.
Example 33.3 Suppose we have a graph G = (V, E) as follows.

B
X0 W
A

This means that we have a joint distribution factorizing as

PX0 ,A,B,W = PX0 PB|X0 PA|B,X0 PW|A,B .

Then every node has a channel from its parents to itself, for example W corresponds to a noisy
channel PW|A,B , and we can define η ≜ ηKL (PW|A,B ). Now, prepend another random variable U ∼
Bern(λ) at the beginning, the new graph G′ = (V′ , E′ ) is shown below:

B
U X0 W
A

We want to verify the relation

I(U; B, W) ≤ η̄ I(U; B) + η I(U; A, B). (33.7)

Recall that from chain rule we have I(U; B, W) = I(U; B) + I(U; W|B) ≥ I(U; B). Hence, if (33.7)
is correct, then η → 0 implies I(U; B, W) ≈ I(U; B) and symmetrically I(U; A, W) ≈ I(U; A).
Therefore for small δ , observing W, A or W, B does not give advantage over observing solely A or
B, respectively.
Observe that G′ forms a Markov chain U → X0 → (A, B) → W, which allows us to factorize
the joint distribution over E′ as

PU,X0 ,A,B,W = PU PX0 |U PA,B|X0 PW|A,B .

Now consider the joint distribution conditioned on B = b, i.e., PU,X0 ,A,W|B . We claim that the
conditional Markov chain U → X0 → A → W|B = b holds. Indeed, given B and A, X0 is
independent of W, that is PX0 |A,B PW|A,B = PX0 ,W|AB , from which follows the mentioned conditional
Markov chain. Using the conditional Markov chain, SDPI gives us for any b,

I(U; W|B = b) ≤ η I(U; A|B = b).

Averaging over b and adding I(U; B) to both sides we obtain

I(U; W, B) ≤ η I(U; A|B) + I(U; B)


= η I(U; A, B) + η̄ I(U; B).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-635


i i

33.3 Directed information percolation 635

B
R
X W
A

Figure 33.4 Illustration for Example 33.4.

From the characterization of ηf in Theorem 33.5 we conclude


ηKL (PW,B|X0 ) ≤ η · ηKL (PA,B|X0 ) + (1 − η) · ηKL (PB|X0 ) . (33.8)
Now, we provide another example which has in some sense an analogous setup to Example
33.3.
Example 33.4 (Percolation) Take the graph G = (V, E) in Example 33.3 with a small
modification. See Figure 33.4. Now, suppose X,A,B,W are some cities and the edge set E represents
the roads between these cities. Let R be a random variable denoting the state of the road connecting
to W with P(R is open) = η and P[R is closed] = η̄ . For any Y ∈ V, let the event {X → Y} indicate
that one can drive from X to Y. Then
P[X → B or W] = η P[X → A or B] + η̄ P[X → B]. (33.9)
Observe the resemblance between (33.8) and (33.9).
We will now give a theorem that relates ηKL to percolation probability on a DAG under the
following setting: Consider a DAG G = (V, E).

• All edges are open



• Every vertex is open with probability p(v) = ηKL PXv |XPa(v) where Pa(v) denotes the set of
parents of v.

Under this model, for two subsets T, S ⊂ V we define perc[T → S] = P[∃ open path T → S].
Note that PXv |XPa(v) describe the stochastic recipe for producing Xv based on its parent variables.
We assume that in addition to a DAG we also have been given all these constituent channels (or
at least bounds on their ηKL coefficients).

Theorem 33.8 ([345]) Let G = (V, E) be a DAG and let 0 be a node with in-degree equal to
zero (i.e. a source node). Note that for any 0 63 S ⊂ V we can inductively stitch together constituent
channels PXv |XPa(v) and obtain PXS |X0 . Then we have
ηKL (PXS |X0 ) ≤ perc(0 → S). (33.10)

Proof. For convenience let us denote η(T) = ηKL (PXT |X0 ) and ηv = ηKL (PXv |XPa(v) ). The proof
follows from an induction on the size of G. The statement is clear for the |V(G)| = 1 since
S = ∅ or S = {X0 }. Now suppose the statement is already shown for all graphs smaller than
G. Let v be the node with out-degree 0 in G. If v 6∈ S then we can exclude it from G and the
statement follows from induction hypothesis. Otherwise, define SA = Pa(v) \ S and SB = S \ {v},

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-636


i i

636

A = XSA , B = XSB , W = Xv . (If 0 ∈ A then we can create a fake 0′ with X0′ = X0 and retain
0′ ∈ A while moving 0 out of A. So without loss of generality, 0 6∈ A.) Prepending arbitrary U to
the graph as U → X0 , the joint DAG of random variables (X0 , A, B, W) is then given by precisely
the graph in (33.7). Thus, we obtain from (33.8) the estimate

η(S) ≤ ηv η(SA ∪ SB ) + (1 − ηv )ηKL (SB ) . (33.11)

From induction hypothesis η(SA ∪ SB ) ≤ perc(0 → SA ) and η(SB ) ≤ perc(0 → SB ) (they live
on a graph G \ {v}). Thus, from computation (33.9) we see that the right-hand side of (33.11) is
precisely perc(0 → S) and thus η(S) ≤ perc(S) as claimed.

We are now in position to complete the postponed proof.

Proof of Theorem 33.2. First observe the noisy boolean circuit is a form of DAG. Since the gates
are δ -noisy contraction coefficients of constituent channels ηv in the DAG can be bounded by
(1 − 2δ)2 . Thus, in the percolation question all vertices are open with probability (1 − 2δ)2
From SDPI, for each i, we have I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ). From Theorem 33.8, we know
ηKL (PY|Xi ) ≤ perc(Xi → Y). We now want to upper bound perc(Xi → Y). Recall that the minimum
distance between Xi and Y is di . For any path π of length ℓ(π ) from Xi to Y, therefore, the probability
that it will be open is ≤ (1 − 2δ)2ℓ(π ) . We can thus bound
X
perc(Xi → Y) ≤ (1 − 2δ)2ℓ(π ) . (33.12)
π :Xi →Y

Let us now build paths backward starting from Y, which allows us to represent paths X → Yi
as vertices of a K-ary tree with root Yi . By labeling all vertices on a K-ary tree corresponding
to paths X → Yi we observe two facts: the labeled set V is prefix-free (two labeled vertices are
never in ancestral relation) and the depth of each labeled set is at least di . It is easy to see that
P
u∈V c
depth(u)
≤ (Kc)di provided Kc ≤ 1 and attained by taking V to be set of all vertices in the
tree at depth di . We conclude that whenever K(1 − 2δ)2 ≤ 1 the right-hand side of (33.12) is
bounded by (K(1 − 2δ)2 )di , which concludes the proof by upper bounding H(Xi ) ≤ log 2 as

I(Xi ; Y) ≤ ηKL (PY|Xi )H(Xi ) ≤ Kdi (1 − 2δ)2di log 2

As another (simple) application of Theorem 33.8 we show the following.

Corollary 33.9 Consider a channel PY|X and its n-letter memoryless extension P⊗ n
Y|X . Then we
have

ηKL (P⊗
Y|X ) ≤ 1 − (1 − ηKL (PY|X )) ≤ nηKL (PY|X ) .
n n

The first inequality can be sharp for some channels. For example, it is sharp when PY|X is a
binary or q-ary erasure channel (defined below in Example 33.6). This fact is proven in [345,
Theorem 17].

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-637


i i

33.4 Input-dependent SDPI. Mixing of Markov chains 637

Proof. The graph here consists of n parallel lines Xi → Yi . Theorem 33.8 shows that ηKL (P⊗ Y|X ) ≤
n

perc({X1 , . . . , Xn } → {Y1 , . . . , Yn }). The latter simply equals 1 − (1 − η) where η = η(PY|X ) is


n

the probability of an edge being open.

We conclude the section with a more sophisticated application of Theorem 33.8, emphasizing
how it can yield stronger bounds when compared to Theorem 33.2.
Example 33.5 Suppose we have the topological restriction on the placement of gates (namely
that the inputs to each gets should be from nearest neighbors to the left), resulting in the following
circuit of 2-input δ -noisy gates.

Note that each gate may be a simple passthrough (i.e. serve as router) or a constant output. Theorem
33.2 states that if (1 − 2δ)2 < 12 , then noisy computation within arbitrary topology is not possible.
Theorem 33.8 improves this to (1 − 2δ)2 < pc , where pc is the oriented site percolation threshold
for the particular graph we have. Namely, if each vertex is open with probability p < pc then
with probability 1 the connected component emanating from any given node (and extending to
the right) is finite. For the example above the site percolation threshold is estimated as pc ≈ 0.705
(so-called Stavskaya automata).

33.4 Input-dependent SDPI. Mixing of Markov chains


Previously we have defined contraction coefficient ηf (PY|X ), as the maximum contraction of an
f-divergences over all input distributions. We now define an analogous concept for a fixed input
distribution PX .

Definition 33.10 (Input-dependent contraction coefficient) For any input distribu-


tion PX , Markov kernel PY|X and convex function f, we define
Df (QY kPY )
ηf (PX , PY|X ) ≜ sup
Df (QX kPX )
where PY = PY|X ◦ PX , QY = PY|X ◦ QX and the supremum is over QX satisfying 0 < Df (QX kPX ) <
∞.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-638


i i

638

We refer to ηf (PX , PY|X ) as the input-dependent contraction coefficient, in contrast with the
input-independent contraction coefficient ηf (PY|X ). It is obvious that

ηf (PX , PY|X ) ≤ ηf (PY|X )

but as we will see below the inequality is often strict and the difference can lead to significant
improvements in the applications (Example 33.10). In Theorem 33.6 we have seen that for most
interesting f’s we have ηf (PY|X ) = ηχ2 (PY|X ). Unfortunately, for the input-dependent version this is
not true: we only have a one-sided comparison, namely for any twice continuously differentiable
f with f′′ (1) > 0 (in particular for KL-divergence) it holds that [345, Theorem 2]

ηχ2 (PX , PY|X ) ≤ ηf (PX , PY|X ) . (33.13)

For example, for jointly Gaussian X, Y, we in fact have ηχ2 = ηKL (see Example 33.7 next);
however, in general we only have ηχ2 < ηKL (see [19] for an example). Thus, unlike the input-
independent case, here the choice of f is very important. A general rule is that ηχ2 (PX , PY|X ) is the
easiest to bound and by (33.13) it contracts the fastest. However, for various reasons other f are
more useful in applications. Consequently, theory of input-dependent contraction coefficients is
much more intricate (see [201] for many recent results and references). In this section we try to
summarize some important similarities and distinctions between the ηf (PX , PY|X ) and ηf (PY|X ).
First, just as in Theorem 33.5 we can similarly prove a mutual information characterization of
ηKL (PX , PY|X ) as follows [352, Theorem V.2]:

I(U; Y)
ηKL (PX , PY|X ) = sup .
PU|X :U→X→Y I(U; X)

In particular, we see that ηKL (PX , PY|X ) is also a slope of the FI -curve (cf. Definition 16.5):

d
ηKL (PX , PY|X ) = FI (t; PX,Y ) . (33.14)
dt t=0+

(Indeed, from Exercise III.32 we know FI (t) is concave and thus supt≥0 FI t(t) = F′I (0).)
The next property of input-dependent SDPIs emphasizes the key difference compared to its
input-independent counterpart. Recall that Corollary 33.9 (and the discussion thereafter) show
that generally ηKL (P⊗ ⊗n ⊗n
Y|X ) → 1 exponentially fast. At the same time, ηKL (PX , PY|X ) stays constant.
n

Proposition 33.11 (Tensorization of ηKL )


ηKL (P⊗ n ⊗n
X , PY|X ) = ηKL (PX , PY|X )

i.i.d.
In particular, if (Xi , Yi ) ∼ PX,Y , then ∀PU|Xn

I(U; Yn ) ≤ ηKL (PX , PY|X )I(U; Xn )

Note that not all ηf satisfy tensorization. We will show below (Theorem 33.12) that ηχ2 does
satisfy it. On the other hand, ηTV (P⊗ ⊗n
X , PY|X ) → 1 exponentially fast (which follows from (7.21)).
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-639


i i

33.4 Input-dependent SDPI. Mixing of Markov chains 639

X1 Y1

X2 Y2

Figure 33.5 Illustration for the probability space in the proof of Proposition 33.11.

Proof. This result is implied by (33.14) and Exercise III.32, but a simple direct proof is useful.
Without loss of generality (by induction) it is sufficient to prove the proposition for n = 2. It is
always useful to keep in mind the diagram in Figure 33.5 Let η = ηKL (PX , PY|X )
I(U; Y1 , Y2 ) = I(U; Y1 ) + I(U; Y2 |Y1 )
≤ η [I(U; X1 ) + I(U; X2 |Y1 )] (33.15)
= η [I(U; X1 ) + I(U; X2 |X1 ) + I(U; X1 |Y1 ) − I(U; X1 |Y1 , X2 )] (33.16)
≤ η [I(U; X1 ) + I(U; X2 |Y1 )] (33.17)
= η I(U; X1 , X2 )
where (33.15) is due to the fact that conditioned on Y1 , U → X2 → Y2 is still a Markov chain,
(33.16) is because U → X1 → Y1 is a Markov chain and (33.17) follows from the fact that
X2 → U → X1 is a Markov chain even when conditioned on Y1 .
As an example, let us analyze the erasure channel.
Example 33.6 (ηKL (PX , PY|X ) for erasure channel) We define ECτ as the following channel
(
X w.p. 1 − τ
Y=
? w.p. τ.
Consider an arbitrary U → X → Y and define an auxiliary random variable B = 1{Y =?}. We
have
I(U; Y) = I(U; Y, B) = I(U; B) +I(U; Y|B) = (1 − τ )I(U; X),
| {z }
=0, since B⊥
⊥U

where the last equality is due to the fact that I(U; Y|B = 1) = 0 and I(U; Y|B = 0) = I(U; X).
By the mutual information characterization of ηKL (PX , PY|X ), we have ηKL (PX , ECτ ) = 1 − τ .
⊗n
Note that by tensorization we also have ηKL (P⊗X , ECτ ) = 1 − τ . However, for non-product input
n

measure μ the study of ηKL ( μ, EC⊗ n


τ ) is essentially as hard as the study of mixing of Glauber
dynamics for μ – see Exercise VI.26.
Another interesting observation about the erasure channel is that even for uniform input
distribution we may have
ηKL (PY , PX|Y ) > ηKL (PX , PY|X ) = 1 − τ
(at least for τ > 0.74, see Example 33.14). Thus, ηKL is not symmetric even for such a simple
channel. This is in contrast with ηχ2 , as we show next.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-640


i i

640

Among the input-dependent ηf the most elegant is the theory of ηχ2 . The properties hold for
general PX,Y but we only state it for the finite case for simplicity.

Theorem 33.12 (Properties of ηχ2 (PX , PY|X )) Consider finite X and Y . Then, we have

(a) (Spectral characterization) Let Mx,y = √PX,Y (x,y) be an |X | × |Y| matrix. Let 1 = σ1 (M) ≥
PX (x)PY (y)
p
σ2 (M) ≥ · · · ≥ 0 be the singular values of M, i.e. σj (M) = λj (MT M). Then ηχ2 (PX , PY|X ) =
σ22 (M).
(b) (Symmetry) ηχ2 (PX , PY|X ) = ηχ2 (PY , PX|Y ).
(c) (Maximal correlation) ηχ2 (PX , PY|X ) = supg1 ,g2 ρ2 (g1 (X), g2 (Y)), where the supremum is over
all functions g1 : X → R and g2 : Y → R.
(d) (Tensorization) ηχ2 (P⊗ n ⊗n
X , PY|X ) = ηχ2 (PX , PY|X )

Proof. We focus on the spectral characterization which implies the rest. Denote by EX|Y a linear
P
operator that acts on function g as EX|Y g(y) = x PX|Y (x|y)g(x). For any QX let g(x) = QPXX((xx)) then
QY (y)
we have PY (y) = EX|Y g. Therefore, we have

VarPY [EX|Y g]
ηχ2 (PX , PY|X ) = sup
g VarPX [g]

with supremum over all g ≥ 0 and EPX [g] = 1. We claim that this supremum is also equal to

EPY [E2X|Y h]
ηχ2 (PX , PY|X ) = sup ,
h EPX [h2 ]
taken over all h with EPX h = 0. Indeed, for any such h we can take g = 1 + ϵh for some suffi-
p g ≥ 0) and conversely, for any g we can set h = g − 1. Finally, let us
ciently small ϵ (to satisfy
reparameterize ϕx ≜ PX (x)h(x) in which case we get

ϕ T MT Mϕ
ηχ2 (PX , PY|X ) = sup ,
ϕ ϕT ϕ
p
where ϕ ranges over all vectors in RX that are orthogonal to the vector ψ with ψx = PX (x).
Finally, we notice that top singular value of M corresponds to singular vector ψ and thus restricting
ϕ ⊥ ψ results in recovering the second-largest singular vector.
Symmetry follows from noticing that matrix M is replaced by MT when we interchange X and
Y. The maximal correlation characterization follows from the fact that supg2 E√[g1 (X)g2 (Y)] is attained
Var[g2 (Y)]
at g2 = EX|Y g1 . Tensorization follows from the fact that singular values of the Kronecker product
M⊗n are just products of (all possible) n-tuples of singular values of M.

Example 33.7 (SDPI constants of joint Gaussian) Let X, Y be jointly Gaussian with
correlation coefficient ρ. Then

ηχ2 (PX , PY|X ) = ηKL (PX , PY|X ) = ρ2 .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-641


i i

33.5 Application: broadcasting and coloring on trees 641

Indeed, it is well-known that the maximal correlation of X and Y is simply |ρ|. (This can be shown
by finding the eigenvalues of the (Mehler) kernel defined in Theorem 33.12(a); see e.g. [266].)
Applying Theorem 33.12(c) yields ηχ2 (PX , PY|X ) = ρ2 .
Next, in view of (33.13), it suffices to show ηKL ≤ ρ2 , which is a simple consequence of EPI.
Without loss of generality, let us consider Y = X + Z, where X ∼ PX = N (0, 1) and Z ∼ N (0, σ 2 ).
Then PY = N (0, 1 + σ 2 ) and ρ2 = 1+σ 1
2 . Let X̃ have finite second moment and finite differential
1
entropy and set Ỹ = X̃ + Z. Applying Lieb’s EPI (3.36) with U1 = X̃, U2 = Z/σ and cos2 α = 1+σ 2,

we obtain

1 σ2 1
h(Ỹ) ≥ 2
h(X̃) + 2
log(2πe) + log(1 + σ 2 )
1+σ 2( 1 + σ ) 2

which implies D(PỸ kPY ) ≤ 1+σ 2 D(PX̃ kPX )


1
as desired.

Before proceeding to statistical applications we mention a very important probabilistic appli-


cation.

Example 33.8 (Mixing of Markov chains) One area in which input-dependent contrac-
tion coefficients have found a lot of use is in estimating mixing time (time to convergence to
equilibrium) of Markov chains. Indeed, suppose K = PY|X is a kernel for a time-homogeneous
Markov chain X0 → X1 → · · · with stationary distribution π (i.e., K = PXt+1 |Xt ). Then for any
initial distribution q, SDPI gives the following bound:

Df (qPn kπ ) ≤ ηf (π , K)n Df (qkπ ) ,

showing exponential decrease of Df provided that ηf (π , K) < 1. For most interesting chains the
TV version is useless, but χ2 and KL is rather effective (the two known as the spectral gap and
modified log-Sobolev inequality methods). For example, for reversible Markov chains, we have
[124, Prop. 3]

χ2 (PXn kπ ) ≤ γ∗2n χ2 (PX0 kπ ) (33.18)

where γ∗ is the absolute spectral gap of P. See Exercise VI.19. The most efficient modern method
for bounding ηKL is known as spectral independence, see Exercise VI.26.

33.5 Application: broadcasting and coloring on trees


Consider an infinite b-ary tree G = (V, E). We assign a random variable Xv for each v ∈ V . These
random variables Xv ’s are defined on the same alphabet X . In this model, the joint distribution
is induced by the distribution on the root vertex π, i.e., Xρ ∼ π, and the edge kernel PX′ |X , i.e.
∀(p, c) ∈ E, PXc |Xp = PX′ |X .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-642


i i

642

′ |X X3 ……
PX

P X′ |X
X1
PX′ |X X5 ……

PX′ |X X6 ……
PX ′
|X X2
PX ′
|X X4 ……

To simplify our discussion, we will assume that π is a reversible measure on kernel PX′ |X , i.e.,

PX′ |X (a|b)π (b) = PX′ |X (b|a)π (a). (33.19)

By standard result on Markov chains, this also implies that π is a stationary distribution of the
reversed Markov kernel PX|X′ .
This model, known as broadcasting on trees, turns out to be rather deep. It first arose in sta-
tistical physics as a simplification of Ising model on lattices (trees are called Bethe lattices in
physics) [63]. Then, it was found to be closely related to a problem of phylogenetic reconstruc-
tion in computational biology [306] and almost simultaneously appeared in random constraint
satisfaction problems [261] and sparse-graph coding theory. Our own interest was triggered by
a discovery of a certain equivalence between reconstruction on trees and community detection in
stochastic block model [307, 119].
We make the following observations:

• We can think of this model as a broadcasting scenario, where the root broadcasts its message
Xρ to the leaves through noisy channels PX′ |X . The condition (33.19) here is only made to avoid
defining the reverse channel. In general, one only requires that π is a stationary distribution of
PX′ |X , in which case the (33.21) should be replaced with ηKL (π , PX|X′ )b < 1.
• Under the assumption (33.19), the joint distribution of this tree can also be written as a Gibbs
distribution
 
1 X X
PXall = exp  f(Xp , Xc ) + g( X v )  , (33.20)
Z
(p,c)∈E v∈V

where Z is the normalization constant, f(xp , xc ) = f(xc , xp ) is symmetric. When X = {0, 1}, this
model is known as the Ising model (on a tree). Note, however, that not every measure factorizing
as (33.20) (with symmetric f) can be written as a broadcasting process for some P and π.

The broadcasting on trees is an inference problem in which we want to reconstruct the root
variable Xρ given the observations XLd = {Xv : v ∈ Ld }, with Ld = {v : v ∈ V, depth(v) = d}.
A natural question is to upper bound the performance of any inference algorithm on this problem.
The following theorem shows that there exists a phase transition depending on the branching factor
b and the contraction coefficient of the kernel PX′ |X .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-643


i i

33.5 Application: broadcasting and coloring on trees 643

Theorem 33.13 Consider the broadcasting problem on infinite b-ary tree (b > 1), with root
distribution π and edge kernel PX′ |X . If π is a reversible measure of PX′ |X such that
ηKL (π , PX′ |X )b < 1, (33.21)
then I(Xρ ; XLd ) → 0 as d → 0.

Proof. For every v ∈ L1 , we define the set Ld,v = {u : u ∈ Ld , v ∈ ancestor(u)}. We can upper
bound the mutual information between the root vertex and leaves at depth d
X
I(Xρ ; XLd ) ≤ I(Xρ ; XLd,v ).
v∈L1

For each term in the summation, we consider the Markov chain


XLd,v → Xv → Xρ .
Due to our assumption on π and PX′ |X , we have PXρ |Xv = PX′ |X and PXv = π. By the definition of
the contraction coefficient, we have
I(XLd,v ; Xρ ) ≤ ηKL (π , PX′ |X )I(XLd,v ; Xv ).
Observe that because PXv = π and all edges have the same kernel, then I(XLd,v ; Xv ) = I(XLd−1 ; Xρ ).
This gives us the inequality
I(Xρ ; XLd ) ≤ ηKL (π , PX′ |X )bI(Xρ ; XLd−1 ),
which implies
I(Xρ ; XLd ) ≤ (ηKL (π , PX′ |X )b)d H(Xρ ).
Therefore if ηKL (π , PX′ |X )b < 1 then I(Xρ ; XLd ) → 0 exponentially fast as d → ∞.
Note that a weaker version of this theorem (non-reconstruction when ηKL (PX′ |X )b ≤ 1) is
implied by the directed information percolation theorem. The k-coloring example (see below)
demonstrates that this strengthening is essential; see [203] for details.
Example 33.9 (Broadcasting on BSC tree.) Consider a broadcasting problem on b-ary tree
with vertex alphabet X = {0, 1}, edge kernel PX′ |X = BSCδ , and π = Unif . Note that uniform
distribution is a reversible measure for BSCδ . In Example 33.2, we calculated ηKL (BSCδ ) = (1 −
2δ)2 . Therefore, using Theorem 33.13, we can deduce that if
b(1 − 2δ)2 < 1
then no inference algorithm can recover the root nodes as depth of the tree goes to infinity. This
result is originally proved in [63].
Example 33.10 (k-coloring on tree) Given a b-ary tree, we assign a k-coloring Xvall by
sampling uniformly from the ensemble of all valid k-coloring. For this model, we can define a
corresponding inference problem, namely given all the colors of the leaves at a certain depth, i.e.,
XLd , determine the color of the root node, i.e., Xρ .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-644


i i

644

This problem can be modeled as a broadcasting problem on tree where the root distribution π
is given by the uniform distribution on k colors, and the edge kernel PX′ |X is defined as
(
1
a 6= b
PX′ |X (a|b) = k−1
0, a = b.

It can be shown (see Ex. VI.24) that ηKL (Unif, PX′ |X ) = k log k(11+o(1)) for large k. By Theorem
33.13, this implies that if b < k log k(1 + o(1)) then reliable reconstruction of the root node is not
possible. This result is originally proved in [393] and [50].
The other direction b > k log k(1 + o(1)) can be shown by observing that if b > k log k(1 + o(1))
then the probability of the children of a node taking all available colors (except its own) is close to
1. Thus, an inference algorithm can always determine the color of a node by finding a color that
is not assigned to any of its children. Similarly, when b > (1 + ϵ)k log k even observing (1 − ϵ)-
fraction of the node’s children is sufficient to reconstruct its color exactly. Proceeding recursively
from bottom up, such a reconstruction algorithm will succeed with high probability. In this regime
with positive probability (over the leaf variables) the posterior distribution of the root color is a
point mass (deterministic). This effect is known as “freezing” of the root given the boundary.
We may also consider another reconstruction method which simply computes majority of the
leaves, i.e. X̂ρ = j for the color j that appears the most among the leaves. This method gives
success probability strictly above 1k if and only if d > (k − 1)2 , by a famous result of Kesten and
Stigum [244]. While the threshold is suboptimal, the method is quite robust in the sense that it
also works if we only have access to a small fraction ϵ of the leaves (and the rest are replaced by
erasures).
Let us now consider ηχ2 (Unif, PX′ |X ). The transition matrix is symmetric with eigenvalues
{1, − k−1 1 } and thus from Theorem 33.12 we have that

1 1
ηχ2 (Unif, PX′ |X ) =  ηKL (Unif, PX′ |X ) = .
( k − 1) 2 k log k(1 + o(1))
Thus if Theorem 33.13 could be shown with Iχ2 instead of IKL we would be able to show non-
reconstruction for d < (k − 1)2 , contradicting the result of the previous paragraph. What goes
wrong is that Iχ2 fails to be subadditive, cf. (7.47). However, it is locally subadditive (when e.g.
Iχ2 (X; A)  1) by [202, Lemma 26]. Thus, an argument in Theorem 33.13 can be repeated for the
case where the leaves are observed through a very noisy channel (for example, an erasure channel
leaving only ϵ-fraction of the leaves). Consequently, robust reconstruction threshold for coloring
exactly equals d = (k − 1)2 . See [228] for more on robust reconstruction thresholds.

33.6 Application: distributed correlation estimation


Tensorization property can be used for correlation estimation. Suppose Alice observes
i.i.d. i.i.d.
{Xi }i≥1 ∼ B(1/2) and Bob observes {Yi }i≥1 ∼ B(1/2) such that the (Xi , Yi ) are i.i.d. with
E[Xi Yi ] = ρ ∈ [−1, 1]. The goal is for Bob to send W to Alice with H(W) = B bits and for

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-645


i i

33.6 Application: distributed correlation estimation 645

Alice to estimate ρ̂ = ρ̂(X∞ , W) with objective

R∗ (B) = inf sup E[(ρ − ρ̂)2 ].


W,ρ̂ ρ

Notice that in this problem we are not sample-limited (each party has infinitely many observations),
but communication-limited (only B bits can be exchanged).
Here is a trivial attempt to solve it. Notice that if Bob sends W = (Y1 , . . . , YB ) then the optimal
PB
estimator is ρ̂(X∞ , W) = 1n i=1 Xi Yi which has minimax error B1 , hence R∗ (B) ≤ B1 . Surprisingly,
this can be improved.

Theorem 33.14 ([207]) The optimal rate when B → ∞ is given by


1 + o( 1) 1
R∗ (X∞ , W) = ·
2 ln 2 B

Proof. Fix PW|Y∞ , we get the following decomposition

X1 Y1
.. ..
. .
W Xi Yi
.. ..
. .

Note that once the messages W are fixed we have a parameter estimation problem {Qρ , ρ ∈
[−1, 1]} where Qρ is a distribution of (X∞ , W) when A∞ , B∞ are ρ-correlated. Since we mini-
mize mean-squared error, we know from the van Trees inequality (Theorem 29.2)2 that R∗ (B) ≥
1+o(1) 1+o(1)
minρ JF (ρ) ≥ JF (0) where JF (ρ) is the Fisher information of the family {Qρ }.
Recall, that we also know from the local approximation that
ρ2 log e
D(Qρ kQ0 ) = JF (0) + o(ρ2 )
2
Furthermore, notice that under ρ = 0 we have X∞ and W independent and thus

D(Qρ kQ0 ) = D(PρX∞ ,W kP0X∞ ,W )


= D(PρX∞ ,W kPρX∞ × PρW )
= I(W; X∞ )
≤ ρ2 I(W; Y∞ )
≤ ρ2 B log 2

hence JF (0) ≤ (2 ln 2)B + o(1) which in turns implies the theorem. For full details and the
extension to interactive communication between Alice and Bob, see [207].

2
This requires some technical justification about smoothness of the Fisher information JF (ρ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-646


i i

646

We turn to the upper bound next. First, notice that by taking blocks of m → ∞ consecutive bits
Pim−1
and setting X̃i = √1m j=(i−1)m Xj and similarly for Ỹi , Alice and Bob can replace ρ-correlated
 
i.i.d. 1 ρ
bits with ρ-correlated standard Gaussians (X̃i , Ỹi ) ∼ N (0, ). Next, fix some very large N
ρ 1
and let

W = argmax Yj .
1≤j≤N

(See Exercise V.16



for a motivation behind this idea.) From standard concentration results we know
that E[YW ] = 2 ln N(1 + o(1)) (Lemma 27.10) and Var[YW ] = O( ln1N ). Therefore, knowing W
Alice can estimate
XW
ρ̂ = .
E [ YW ]
1−ρ2 +o(1)
This is an unbiased estimator and Varρ [ρ̂] = 2 ln N . Finally, setting N = 2B completes the
argument.

33.7 Channel comparison: degradation, less noisy, more capable


It turns out that the ηKL coefficient is intimately related to the concept of less noisy partial order
on channels. We define several such partial orders together.

Definition 33.15 (Partial orders on channels) Let PY|X and PZ|X be two channels.

• We say that PY|X is a degradation of PZ|X , denoted by PY|X ≤deg PZ|X , if there exists PY|Z such
that PY|X = PY|Z ◦ PZ|X .
• We say that PZ|X is less noisy than PY|X , denoted by PY|X ≤ln PZ|X , if for every PU,X on the
following Markov chain

U X

we have I(U; Y) ≤ I(U; Z).


• We say that PZ|X is more capable than PY|X , denoted PY|X ≤mc PZ|X if for any PX we have
I(X; Y) ≤ I(X; Z).

We make some remarks on these definitions and refer to [345] for proofs:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-647


i i

33.7 Channel comparison: degradation, less noisy, more capable 647

• PY|X ≤deg PZ|X =⇒ PY|X ≤ln PZ|X =⇒ PY|X ≤mc PZ|X . Counter examples for reverse
implications can be found in [111, Problem 15.11].
• The less noisy relation can be defined equivalently in terms of the divergence, namely PY|X ≤ln
PZ|X if and only if for all PX , QX we have D(QY kPY ) ≤ D(QZ kPZ ). We refer to [290, Sections
I.B, II.A] and [345, Section 6] for alternative useful characterizations of the less-noisy order.
• For BMS channels (see Section 19.4*) it turns out that among all channels with a given
Iχ2 (X; Y) = η (with X ∼ Ber(1/2)) the BSC and BEC are the minimal and maximal elements
in the poset of ≤ln ; see Ex. VI.21 for details.

Proposition 33.16 ηKL (PY|X ) ≤ 1 − τ if and only if PY|X ≤ln ECτ , where ECτ was defined in
Example 33.6.

Proof. For ECτ we always have

I(U; Z) = (1 − τ )I(U; X).

By the mutual information characterization of ηKL we have,

I(U; Y) ≤ (1 − τ )I(U; X).

Combining these two inequalities gives us

I(U; Y) ≤ I(U; Z).

This proposition gives us an intuitive interpretation of contraction coefficient as the worst


erasure channel that still dominates the channel.

Proposition 33.17 (Tensorization


N
ofNless noisy and more capable) If for all i ∈
[n], PYi |Xi ≤ln PZi |Xi , then i∈[n] PYi |Xi ≤ln PZi |Xi .3 If for all i ∈ [n], PYi |Xi ≤mc PZi |Xi , then
N N i∈[n]
i∈[n] PYi |Xi ≤mc i∈[n] PZi |Xi .

Proof. By induction it is sufficient to consider n = 2 only. Consider the following Markov chain:

Y1

X1
Z1
U
Y2
X2

Z2

3 ∏
We remind that ⊗PYi |Xi refers to the product (memoryless) channel with xn 7→ Yn ∼ i PYi |Xi =xi .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-648


i i

648

Consider the following inequalities,

I(U; Y1 , Y2 ) = I(U; Y1 ) + I(U; Y2 |Y1 )


≤ I(U; Y1 ) + I(U; Z2 |Y1 )
= I(U; Y1 , Z2 ).

Hence I(U; Y1 , Y2 ) ≤ I(U; Y1 , Z2 ) for any PX1 ,X2 ,U . Applying the same argument we can replace
Y1 with Z1 to get I(U; Y1 , Z2 ) ≤ I(U; Z1 , Z2 ), completing the proof.
For the second claim, notice that

I(X2 ; Y2 ) = I(X2 ; Y2 ) + I(X1 ; Y2 |X2 )


( a)
= I(X2 ; Y1 ) + I(X2 ; Y2 |Y1 ) + I(X1 ; Y1 |X2 )
≤ I(X2 ; Y1 ) + I(X2 ; Z2 |Y1 ) + I(X1 ; Z1 |X2 )
= I(X2 ; Y1 , Z2 ) + I(X1 ; Z2 |X2 )
= I(X2 ; Z2 ) + I(X2 |Y1 |Z2 ) + I(X1 ; Z2 |X2 )
≤ I(X2 ; Z2 ) + I(X2 |Z1 |Z2 ) + I(X1 ; Z2 |X2 ) = I(X2 ; Z2 ) ,

where equalities are just applications of the chain rule (and in (a) and (b) we also notice that
conditioned on X2 the Y2 or Z2 are non-informative) and both inequalities are applications of
the most capable relation to the conditional distributions. For example, for every y we have
I(X2 ; Y2 |Y1 = y) ≤ I(X2 ; Z2 |Y1 = y) and hence we can average over y ∼ PY1 .

33.8 Undirected information percolation


In this section we will study the problem of inference on undirected graph. Consider an undirected
graph G = (V, E). We assign a random variable Xv on the alphabet X to each vertex v. For each
e = (u, v) ∈ E , we assign Ye sampled according to the kernel PYe |Xe with Xe = (Xu , Xv ). The
goal of this inference model is to determine the value of Xv ’s given the value of Ye ’s. As a visual
illustration we could be considering the following graph:

X2 X6
Y2

Y6
6
2

Y5
Y1

Y35
X1 X3 X5 X7
Y5
Y1

9
4

Y7
Y3

9
4

X4 X9

Example 33.11 (Community detection) In this model, we consider a complete graph


with n vertices, i.e. Kn , and the random variables Xv representing the membership of each vertex

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-649


i i

33.8 Undirected information percolation 649

to one of the m communities. We assume that Xv is sampled uniformly from [m] and independent
of the other vertices. The observation Yu,v is defined as
(
Ber(a/n) Xu = Xv
Yuv ∼
Ber(b/n) Xu 6= Xv .

Example 33.12 (Z2 synchronization) For any graph G, we sample Xv uniformly from
{−1, +1} and Ye = BSCδ (Xu Xv ).

Example 33.13 (Spiked Wigner model) We consider the inference problem of estimating
the value of vector (Xi )i∈[n] given the observation (Yij )i,j∈[n],i≤j . The Xi ’s and Yij ’s are related by
r
λ
Yij = Xi Xj + Wij ,
n
i.i.d.
where X = (X1 , . . . , Xn )⊤ is sampled uniformly from {±1}n and Wi,j = Wj,i ∼ N (0, 1), so that
W forms a Wigner matrix (symmetric Gaussian matrix). This model can also be written in matrix
form as
r
λ
Y= XX⊤ + W
n
as a rank-one perturbation of a Wigner matrix W, hence the name of the model. It is used as a
probabilistic model for principal component analysis.
This problem can also be treated as a problem of inference on undirected graph. In this case,
the underlying graph is a complete graph, and we assign Xi to the ith vertex. Under this model, the
edge observations is given by Yij = BIAWGNλ/n (Xi Xj ), cf. Example 3.4.
Although seemingly different, these problems share the following common characteristics,
namely:

Assumption 33.1 • Each Xv is uniformly distributed.


• Defining an auxiliary random variable B = 1{Xu 6= Xv } for any edges e = (u, v), the following
Markov chain holds

(Xu , Xv ) → B → Ye .

In other words, the observation on each edge only depends on whether the random variables on
its endpoints are similar.

Due to Assumption 33.1, the reconstructed Xv ’s is symmetric up to any permutation on X . In


the case of alphabet X = {−1, +1}, this implies that for any realization σ then PXall |Yall (σ|b) =
PXall |Yall (−σ|b). Consequently, our reconstruction metric also needs to accommodate this symmetry.
Pn
For X = {−1, +1}, this leads to the use of 1n | i=1 Xi X̂i | as our reconstruction metric.
Our main theorem for undirected inference problem can be seen as the analogue of the
information percolation theorem for DAG (Theorem 33.8). However, instead of controlling the

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-650


i i

650

contraction coefficient, the percolation probability is used to directly control the conditional
mutual information between any subsets of vertices in the graph.
Before stating our main theorem, we will need to define the corresponding percolation model
for inference on undirected graph. For any undirected graph G = (V, E) we define a percolation
model on this graph as follows :

• Every edge e ∈ E is open with the probability ηKL (PYe |Xe ), independent of the other edges,
• For any v ∈ V and S ⊂ V , we define the v ↔ S as the event that there exists an open path from
v to any vertex in S,
• For any S1 , S2 ⊂ V , we define the function percu (S1 , S2 ) as
X
percu (S1 , S2 ) ≜ P(v ↔ S2 ).
v∈S1

Notice that this function is different from the percolation function for information percolation
in DAG. Most importantly, this function is not equivalent to the exact percolation probability.
Instead, it is an upper bound on the percolation probability by union bounding with respect to
S1 . Hence, it is natural that this function is not symmetric, i.e. percu (S1 , S2 ) 6= percu (S2 , S1 ).

Theorem 33.18 (Undirected information percolation [347]) Consider an inference


problem on undirected graph G = (V, E). For any S1 , S2 ⊂ V , then

I(XS1 ; XS2 |Y) ≤ percu (S1 , S2 ) log |X |.

Instead of proving Theorem 33.18 in its full generality, we will prove the theorem under
Assumption 33.1. The main step of the proof utilizes the fact we can upper bound the mutual
information of any channel by its less noisy upper bound.

Theorem 33.19 Consider the problem of inference on undirected graph G = (V, E) with
X1 , ..., Xn not necessarily independent. If PYe |Xe ≤LN PZe |Xe , then for any S1 , S2 ⊂ V and E ⊂ E

I(XS1 ; YE |XS2 ) ≤ I(XS1 ; ZE |XS2 )

Proof. From our assumption and the tensorization property of less noisy ordering (Proposi-
tion 33.17), we have PYE |XS1 ,XS2 ≤ln PZE |XS1 ,XS2 . This implies that for σ as a valid realization of
XS2 we will have

I(XS1 ; YE |XS2 = σ) = I(XS1 , XS2 ; YE |XS2 = σ) ≤ I(XS1 , XS2 ; ZE |XS2 = σ) = I(XS1 ; ZE |XS2 = σ).

As this inequality holds for all realization of XS2 , then the following inequality also holds

I(XS1 ; YE |XS2 ) ≤ I(XS1 ; ZE |XS2 ).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-651


i i

33.9 Application: spiked Wigner model 651

Proof of Theorem 33.18. We only give a proof under Assumption 33.1 and only for the case
S1 = {i}. For the full proof (that proceeds by induction and does not leverage the less noisy idea),
see [347]. We have the following equalities
I(Xi ; XS2 |YE ) = I(Xi ; XS2 , YE ) = I(Xi ; YE |XS2 ) (33.22)
where the first inequality is due to the fact BE ⊥⊥ Xi under S.C, and the second inequality is due to
Xi ⊥
⊥ XS2 under Assumption 33.1.
Due to our previous result, if ηKL (PYe |Xe ) = 1 − τ then PYe |Xe ≤LN PZe |Xe where PZe |Xe = ECτ .
By tensorization property, this ordering also holds for the channel PYE |XE , thus we have
I(Xi ; YE |XS2 ) ≤ I(Xj ; ZE |XS2 ).
Let us define another auxiliary random variable D = 1{i ↔ S2 }, namely it is the indicator that
there is an open path from i to S2 . Notice that D is fully determined by ZE . By the same argument
as in (33.22), we have
I(Xi ; ZE |XS2 ) = I(Xi ; XS2 |ZE )
= I(Xi ; XS2 |ZE , D)
= (1 − P[i ↔ S2 ]) I(Xi ; XS2 |ZE , D = 0) +P[i ↔ S2 ] I(Xi ; XS2 |ZE , D = 1)
| {z } | {z }
0 ≤log |X |

≤ P[i ↔ S2 ] log |X |
= percu (i, S2 )

33.9 Application: spiked Wigner model


The following theorem shows how the undirected information percolation concept allows us to
derive a converse result for spiked Wigner model, which we described in Example 33.13. To restate
the problem, we are given an observation
r
λ
Y= XXT + W ,
n
where W is a Gaussian Wigner matrix and X = (X1 , . . . , Xn )T consists of iid uniform ±1 entries.
As in many modern problems (cf. Section 30.4) our goal here is not to recover X completely, but
only to achieve better than random guess performance (“weak recovery”). That is we will be happy
if we find an estimator with
" n #
1 X
E Xi X̂i ≥ ϵ0 > 0 (33.23)
n
i=1

as n → ∞. Now, it turns out that in problems like this there is a so-called BBP phase transition, first
discovered in [29, 326]. Specifically, the eigenvalues of √1n W are well-known to follow Wigner’s

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-652


i i

652


semicircle law supported on the interval√(−2, 2). At the same time the rank-one matrix nλ XXT has
only one non-zero eigenvalue equal to λ. It turns out that for λ < 1 the effect of this “spike” is

undetectable and the spectrum of Y/ n is unaffected. For λ > 1 it turns out that the top eigenvalue

of Y/ n moves above the edge of the semicircle law to λ + λ1 > 2. Furthermore, computing the
top eigenvector and taking the sign of its coordinates achieves a correlated recovery of the true X
in the sense of (33.23). Note, however, that inability to change the spectrum (when λ < 1) does
not imply that reconstruction of X is not possible by other means. In this section, however, we will
show that indeed for λ ≤ 1 no method can achieve (33.23). Thus, together with the mentioned
spectral algorithm for λ > 1 we may conclude that λ∗ = 1 is the critical threshold separating the
two phases of the problem.

Theorem 33.20 Consider the spiked Wigner model. If λ ≤ 1, then for any sequence of
estimators Xˆn (Y),
" #
1 X
n
E Xi X̂i →0 (33.24)
n
i=1

as n → ∞.

Proof of Theorem 33.20. By Cauchy-Schwarz, for (33.24) it suffices to show


X
E[Xi Xj X̂i X̂j ] = o(n2 ) .
i̸=j

Next, it is clear that we can simplify the task of maximizing (over X̂n ) by allowing to separately
estimate each Xi Xj by T̂i,j , i.e.
X X
max E[Xi Xj X̂i X̂j ] ≤ max E[Xi Xj T̂i,j ] .
X̂n T̂i,j
i̸=j i̸=j

The latter maximization is solved by the MAP decoder:

T̂i,j (Y) = arg max P[Xi Xj = σ|Y] .


σ∈{±1}

Since each Xi ∼ Ber(1/2) it is easy to see that

I(Xi ; Xj |Y) → 0 ⇐⇒ max E[Xi Xj T̂i,j ] → 0 .


T̂i,j

(For example, we may notice I(Xi ; Xj |Y) = I(Xi , Xj ; Y) ≥ I(Xi Xj ; Y) and apply Fano’s inequality).
Thus, from the symmetry of the problem it is sufficient to prove I(X1 ; X2 |Y) → 0 as n → ∞.
By using the undirected information percolation theorem, we have

I(X2 ; X1 |Y) ≤ percu ({1}, {2}) .

Now, for computation of perc we need to compute the probability of having an open edge, which in
our case simply equals ηKL (BIAWGNλ/n ). From Theorem 33.6 we know the latter equals Iχ2 (X; Y)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-653


i i

33.10 Strong data post-processing inequality (post-SDPI) 653

where X ∼ Ber(1/2) and Y = BIAWGNλ/n (X). A short computation shows thus

λ
ηKL (BIAWGNλ/n ) = (1 + o(1)) .
n

Suppose that λ < 1. In this case, we can overbound λ+no(1) by λn with λ′ < 1. The percolation
random graph then is equivalent to the Erdös-Rényi random graph with n vertices and λ′ /n edge
probability, i.e., ER(n, λ′ /n). Using this observation, the inequality can be rewritten as

I(X2 ; X1 |Y) ≤ P(Vertex 1 and 2 is connected in ER(n, λ′ /n)).

A classical result in random graph theory is that the largest connected component in ER(n, λ′ /n)
contains O(log n) vertices if λ′ < 1 [154]. This implies that the probability that two specific
vertices are connected is o(1), hence I(X2 ; X1 |Y) → 0 as n → ∞.
To treat the case of λ = 1 we need a slightly more refined information about ηKL (BIAWGNλ/n )
and about the behavior of the giant component of ER(n, 1+on(1) ) graph; see [347] for full details.

Remark 33.2 (Dense-sparse correspondence) The proof above changes the underlying
structure of the graph. Namely, instead of dealing with a complete graph, the information percola-
tion method replaced it with an Erdös-Rényi random graph. Moreover, if ηKL is small enough, then
the underlying percolation graph tends to be very sparse and has a locally tree-like structure. This
demonstrates a ubiquitous and actively studied effect in modern statistics: dense inference (such
as spiked Wigner model, sparse regression, sparse PCA, etc) with very weak signals (ηKL ≈ 1)
is similar to sparse inference (broadcasting on trees) with moderate signals (ηKL ∈ (ϵ, 1 − ϵ)).
The information percolation method provides a certain bridge between these two worlds, perhaps
partially explaining why the results in these two worlds often parallel one another. (E.g. results on
optimality and phase transitions for belief propagation (sparse inference) often parallel those for
approximate message passing (AMP, dense inference)). We do want to caution, however, that the
reduction given by the information percolation method is not generally tight (spiked Wigner being
a lucky exception). For example [347], for correlated recovery in the SBM √
with k communities

and edge probability a/n and b/n it yields an impossibility result ( a − b)2 ≤ 2k , weaker than
the best known upper bounds of [203].

33.10 Strong data post-processing inequality (post-SDPI)


For the applications in distributed estimation the following version of the SDPI is useful.

Definition 33.21 (Post-SDPI constant) Given a conditional measure PY|X , define the input-
dependent and input-free contraction coefficients as
 
(p) I(U; X)
ηKL (PX , PY|X ) = sup :X→Y→U
PU|Y I(U; Y)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-654


i i

654

X Y U

ε̄ 0 τ̄ 0 0
τ
?
τ
ε 1 τ̄ 1 1

Figure 33.6 Post-SDPI coefficient of BEC equals to 1.

 
(p) I(U; X)
ηKL (PY|X ) = sup :X→Y→U
PX ,PU|Y I(U; Y)

To get characterization in terms of KL-divergence we simply notice that


(p)
ηKL (PX , PY|X ) = ηKL (PY , PX|Y ) (33.25)
(p)
ηKL (PX , PY|X ) = sup ηKL (PY , PX|Y ) , (33.26)
PX

where PY = PY|X ◦ PX and PX|Y is the conditional measure corresponding to PX PY|X . From (33.25)
and Prop. 33.11 we also get tensorization property for input-dependent post-SDPI:
ηKL (P⊗ ⊗n
(p) n (p)
X , (PY|X ) ) = ηKL (PX , PY|X ). (33.27)
(p)
It is easy to see that by the data processing inequality, ηKL (PY|X ) ≤ 1. Unlike the ηKL coefficient
(p)
the ηKL can equal to 1 even for a noisy channel PY|X .
(p)
Example 33.14 (ηKL = 1 for erasure channels) Let PY|X = BECτ and X → Y → U
be defined as on Figure 33.6. Then we can compute I(Y; U) = H(U) = h(ετ̄ ) and I(X; U) =
H(U) − H(U|X) = h(ετ̄ ) − εh(τ ) hence
(p) I(X; U)
ηKL (PY|X ) ≥
I(Y; U)
ε
= 1 − h(τ )
h(ετ̄ )
This last term tends to 1 when ε tends to 0 hence
(p)
ηKL (BECτ ) = 1

even though Y is not a one to one function of X.


(p)
Note that this example also shows that even for an input-constrained version of ηKL the natural
(p)
conjecture ηKL (Unif, BMS) = ηKL (BMS) is incorrect. Indeed, by taking ε = 12 , we have that
(p)
ηKL (Unif, BECτ ) > 1 − τ for τ → 1.
Nevertheless, the post-SDPI constant is often non-trivial, most importantly for the BSC:

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-655


i i

33.10 Strong data post-processing inequality (post-SDPI) 655

Theorem 33.22
(p)
ηKL (BSCδ ) = (1 − 2δ)2 .

To prove this theorem, the following lemma is useful.

Lemma 33.23 If for any X and Y in {0, 1} we have


 1(x̸=y)
δ
p X , Y ( x, y) = f ( x) g( Y )
1−δ
for some functions f and g, then ηKL (PY|X ) ≤ (1 − 2δ)2 .

Proof. From (33.6) we know that for binary input channels


H4 (PY|X=0 , PY|X=1 )
ηKL (PY|X ) ≤ H2 (PY|X=0 , PY|X=1 ) −
4
   
If we let ϕ = gg((01)) , then we have pY|X=0 = B ϕ+λ λ
and pY|X=1 = B 1+ϕλ 1
and a simple check
shows that
H4 (PY|X=0 , PY|X=1 ) ϕ=1
max H2 (PY|X=0 , PY|X=1 ) − = (1 − 2δ)2
ϕ 4
Now observe that PX,Y in Theorem 33.22 satisfies the property of the lemma with X
(p)
and Y exchanged, hence ηKL (PY , PX|Y ) ≤ (1 − 2δ)2 which implies that ηKL (PY|X ) =
supPX ηKL (PY , PX|Y ) ≤ (1 − 2δ)2 with equality if PX is uniform.

Theorem 33.24 (Post-SDPI for BI-AWGN) Let 0 ≤ ϵ ≤ 1 and consider the channel PY|X
with X ∈ {±1} given by
Y = ϵ X + Z, Z ∼ N (0, 1) .
Then for any π ∈ (0, 1) taking PX = Ber(π ) we have for some absolute constant K the estimate
(p) ϵ2
ηKL (PX , PY|X ) ≤ K .
π (1 − π )

Proof. In this proof we assume all information measures are used to base-e. First, notice that
1
v(y) ≜ P[X = 1|Y = y] = 1−π −2yϵ
.
1+ π e
( p)
Then, the optimization defining ηKL can be written as
(p) d(EQY [v(Y)]kπ )
ηKL (PX , PY|X ) ≤ sup . (33.28)
QY D(QY kPY )
From (7.34) we have
(p) 1 (EQY [v(Y)] − π )2
ηKL (PX , PY|X ) ≤ sup . (33.29)
π (1 − π ) QY D(QY kPY )

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-656


i i

656

To proceed, we need to introduce a new concept. The T1 -transportation inequality, first intro-
duced by K. Marton, for the measure PY states the following: For every QY we have for some
c = c(PY )
p
W1 (QY , PY ) ≤ 2cD(QY kPY ) , (33.30)
where W1 (QY , PY ) is the 1-Wasserstein distance defined as
W1 (QY , PY ) = sup{EQY [f] − EPY [f] : f 1-Lipschitz} (33.31)
= inf{E[|A − B|] : A ∼ QY , B ∼ PY } .

The constant c(PY ) in (33.30) has been characterized in [64, 125] in terms of properties of PY . One
such estimate is the following:
!1/k
2 G(δ)
c(PY ) ≤ sup 2k
 ,
δ k≥ 1 k
′ 2 i.i.d. 
where G(δ) = E[eδ(Y−Y ) ] where Y, Y′ ∼ PY . Using the estimate 2k
k ≥ √ 4k
and the fact
π (k+1/2)
that 1k ln(k + 1/2) ≤ 1
2 we get a further bound

2 π e 6G(δ)
c(PY ) ≤ G(δ) ≤ .
δ 4 δ
d √
Next notice that Y − Y′ = Bϵ + 2Z where Bϵ ⊥ ⊥ Z ∼ N (0, 1) and Bϵ is symmetric and |Bϵ | ≤ 2ϵ.
Thus, we conclude that for any δ < 1/4 we have c̄ ≜ δ6 supϵ≤1 G(δ) < ∞. In the end, we have
inequality (33.30) with constant c = c̄ that holds uniformly for all 0 ≤ ϵ ≤ 1.
Now, notice that dyd
v(y) ≤ 2ϵ and therefore v is 2ϵ -Lipschitz. From (33.30) and (33.31) we
obtain then
ϵp
|EQY [v(Y)] − EPY [v(Y)]| ≤ 2c̄D(QY kPY ) .
2
Squaring this inequality and plugging back into (33.29) completes the proof.
(p)
Remark 33.3 Notice that we can also compute the exact value of ηKL (PX , PY|X ) by noticing the
following. From (33.28) it is evident that among all measures QY with a given value of EQY [v(Y)]
we are interested in the one minimizing D(QY kPY ). From Theorem 15.11 we know that such QY
is given by dQY = ebv(y)−ψV (b) dPY , where ψV (b) ≜ ln EPY [ebv(Y) ]. Thus, by defining the convex
dual ψV∗ (λ) we can get the exact value in terms of the following single-variable optimization:
(p) d(λkπ )
ηKL (PX , PY|X ) = sup ∗ .
λ∈[0,1] ψV (λ)

Numerically, for π = 1/2 it turns out that the optimal value is λ → 12 , justifying our overbounding
of d by χ2 , and surprisingly giving
(p)
ηKL (Ber(1/2), PY|X ) = 4 EPY [tanh2 (ϵY)] = ηKL (PY|X ) ,

where in the last equality we used Theorem 33.6(f)).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-657


i i

33.11 Application: distributed mean estimation 657

33.11 Application: distributed mean estimation


We want to estimate θ ∈ [−1, 1]d and we have m machines observing Yi = θ + σ Zi where Zi ∼
N (0, Id ) independently for i = 1, . . . , m. They can send a total of B bits to a remote estimator
which produces θ̂ with the goal of minimizing the worst-case risk supθ E[kθ − θ̂k2 ]. If we denote
P
by Ui ∈ Ui the messages, then we have the communication constraint i log2 |Ui | ≤ B and the
diagram is the following:

Y1 U1

θ .. ..
. . θ̂

Ym Um

Finally, denote the minimax risk of estimation by

R∗ (m, d, σ 2 , B) = inf sup E[kθ − θ̂k2 ].


U1 ,...,Um ,θ̂ θ

We begin with some simple observations:

• Without the constraint θ ∈ [−1, 1]d , we could take θ ∼ N (0, bId ) and from rate-distortion
quickly conclude that estimating θ within risk R requires communicating at least d2 log bd R bits,
which diverges as b → ∞. Thus, restricting the magnitude of θ is necessary in order for it to be
estimable with finitely many bits communicated.
• Without
h P communication
i constraint, it is easy to establish that R∗ (m, d, σ 2 , ∞) =
2 2 P
E mσ i Zi = dmσ by taking Ui = Yi and θ̂ = m1 i Ui , which matches the minimax
risk (28.17) in non-distributed setting.
• In order to achieve a risk of order md we can apply a crude quantizer as follows. Let Ui = sign(Yi )
(coordinate-wise sign). This yields B = md and it is easy to show that the achievable risk is
Pm
Oσ ( md ). Indeed, notice that by taking V = m1 i=1 Ui we see that each coordinate Vj , j ∈ [d],
estimates (within Op ( √1m )) quantities Φ(θj /σ) with Φ denoting the CDF of N (0, 1). Since Φ
has derivative bounded away from 0 on [−1/σ, 1/σ], we get that the estimate θ̂j ≜ σΦ−1 (Vj )
will have mean square error of O(1/m) (with a poor dependency on σ , though), which gives
overall error O(d/m) as claimed.
Our main result below shows that the previous simple strategy is order-optimal in terms of
communicated bits. This simplifies the proofs of [137, 73].
• We remark that these results are also implicitly contained in the long line of work in the
information theoretic literature on the so-called Gaussian CEO problem. We recommend con-
sulting [156]; in particular, Theorem 3 there implies the B ≳ dm lower bound we show below.
However, the Gaussian CEO work uses a lot more sophisticated machinery (the entropy power
inequality and related results), while our SDPI proof is more elementary.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-658


i i

658

Our goal is to show that R∗ ≲ md implies B ≳ md.


Notice, first of all, that this is completely obvious for d = 1. Indeed, if B ≤ τ m then less than τ m
machines are communicating anything at all, and hence R∗ ≥ τKm for some universal constant K
(which is not 1 because θ is restricted to [−1, 1]). However, for d  1 it is not clear whether each
machine is required to communicate Ω(d) bits. Perhaps sending  d single-bit measurements
taken in different orthogonal bases could work? Hopefully, this (incorrect) guess demonstrates
why the following result is interesting and non-trivial.

dϵ2
Theorem 33.25 There exists a constant c1 > 0 such that if R∗ (m, d, σ 2 , B) ≤ 9 then B ≥ c1 d
ϵ2
.

Proof. Let X ∼ Unif({±1}d ) and set θ = ϵX. Given an estimate θ̂ we can convert it into an
estimator of X via X̂ = sign(θ̂) (coordinatewise). Then, clearly
ϵ2 dϵ 2
E[dH (X, X̂)] ≤ E[kθ̂ − θk2 ] ≤ .
4 9
Thus, we have an estimator of X within Hamming distance 49 d. From Rate-Distortion (Theo-
rem 26.1) we conclude that I(X; X̂) ≥ cd for some constant c > 0. On the other hand, from
the standard DPI we have
X
m
cd ≤ I(X; X̂) ≤ I(X; U1 , . . . , Um ) ≤ I(X; Uj ) , (33.32)
j=1

where we also applied Theorem 6.1. Next we estimate I(X; Uj ) via I(Yj ; Uj ) by applying the Post-
SDPI. To do this we need to notice that the channel X → Yj for each j is just a memoryless
extension of the binary-input AWGN channel with SNR ϵ. Since each coordinate of X is uniform,
we can apply Theorem 33.24 (with π = 1/2) together with tensorization (33.27) to conclude that

I(X; Uj ) ≤ 4Kϵ2 I(Yj ; Uj ) ≤ 4Kϵ2 log |Uj |

Together with (33.32) we thus obtain

cd ≤ I(X; X̂) ≤ 4Kϵ2 B log 2 (33.33)

We notice that in this short section we only considered a non-interactive setting in the sense that
the message Ui is produced by machine i independently and without consulting anything except
its private measurement Yi . More generally, we could allow machines to communicate their bits
over a public broadcast channel, so that each communicated bit is seen by all other machines. We
could still restrict the total number of bits sent by all machines to be B and ask for the best possible
interactive estimation rate. While [137, 73] claim lower bounds that apply to this setting, those
bounds contain subtle errors (see [5, 4] for details). There are lower bounds applicable to non-
interactive settings but they are weaker by certain logarithmic terms. For example, [5, Theorem 4]
shows that to achieve risk ≲ dϵ2 one needs B ≳ ϵ2 logd(dm) in the limited interactive setting where
Ui may depend on Ui1−1 but there are no other interactions (i.e. the i-th machine sends its entire

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-659


i i

33.11 Application: distributed mean estimation 659

message at once instead of sending part of it and waiting for others to broadcast theirs before
completing its own transmission, as permitted by the fully interactive protocol).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-660


i i

Exercises for Part VI

i.i.d.
VI.1 Let X1 , . . . , Xn ∼ Exp(exp(θ)), where θ follows the Cauchy distribution π with parameter s,
whose pdf is given by p(θ) = 1
θ2
for θ ∈ R. Show that the Bayes risk
πs(1+ )
s2

R∗π ≜ inf Eθ∼π E(θ̂(Xn ) − θ)2


θ̂
2
satisfies R∗π ≥ 2ns2s2 +1 .
VI.2 (System identification) Let θ ∈ R be an unknown parameter of a dynamical system:
i.i.d.
Xt = θXt−1 + Zt , Zt ∼ N (0, 1), X0 = 0 .

Learning parameters of dynamical systems is known as “system identification”. Denote the law
of (X1 , . . . , Xn ) corresponding to θ by Pθ .
1. Compute D(Pθ kPθ0 ). (Hint: chain rule saves a lot of effort.)
2. Show that Fisher information
X
JF (θ) = θ2t−2 (n − t).
1≤t≤n−1

3. Argue that the hardest regime for system identification is when θ ≈ 0, and that instability
(|θ| > 1) is in fact helpful.
VI.3 (Linear regression) Consider the model

Y = Xβ + Z

where the design matrix X ∈ Rn×d is known and Z ∼ N (0, In ). Define the minimax mean-square
error of estimating the regression coefficient β ∈ Rd based on X and Y as follows:

R∗est = inf sup Ekβ̂ − βk22 . (VI.1)


β̂ β∈Rd

(a) Show that if rank(X) < d, then R∗est = ∞;


(b) Show that if rank(X) = d, then

R∗est = tr((X⊤ X)−1 )

and identify which estimator achieves the minimax risk.


(c) As opposed to the estimation error in (VI.1), consider the prediction error:

R∗pred = inf sup EkXβ̂ − Xβk22 . (VI.2)


β̂ β∈Rd

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-661


i i

Exercises for Part VI 661

Redo (a) and (b) by finding the value of R∗pred and identify the minimax estimator. Explain
intuitively why R∗pred is always finite even when d exceeds n.
i.i.d.
VI.4 (Chernoff-Rubin-Stein lower bound.) Let X1 , . . . , Xn ∼ Pθ and θ ∈ [−a, a].
(a) State the appropriate regularity conditions and prove the following minimax lower bound:
 
(1 − ϵ)2
inf sup Eθ [(θ − θ̂)2 ] ≥ min max ϵ2 a2 , ,
θ̂ θ∈[−a,a] 0<ϵ<1 nJ̄F
1
Ra
where J̄F = 2a J (θ)dθ is the average Fisher information. (Hint: Consider the uniform
−a F
prior on [−a, a] and proceed as in the proof of Theorem 29.2 by applying integration by
parts.)
(b) Simplify the above bound and show that
1
inf sup Eθ [(θ − θ̂)2 ] ≥ p . (VI.3)
θ̂ θ∈[−a,a] (a−1 + nJ̄F )2
(c) Assuming the continuity of θ 7→ JF (θ), show that the above result also leads to the optimal
local minimax lower bound in Theorem 29.4 obtained from Bayesian Cramér-Rao:
1 + o( 1)
inf sup Eθ [(θ − θ̂)2 ] ≥ .
θ̂ θ∈[θ0 ±n−1/4 ] nJF (θ0 )

Note: (VI.3) is an improvement of the inequality given in [92, Lemma 1] without proof and
credited to Rubin and Stein.
VI.5 In this exercise we give a Hellinger-based lower bound analogous to the χ2 -based HCR lower
bound in Theorem 29.1. Let θ̂ be an unbiased estimator for θ ∈ Θ ⊂ R.
(a) For any θ, θ′ ∈ Θ, show that [386]
 
1 (θ − θ′ )2 1
(Varθ (θ̂) + Varθ′ (θ̂)) ≥ −1 . (VI.4)
2 4 H2 (Pθ , Pθ′ )
R √ √ √ √
(Hint: For any c, θ − θ′ = (θ̂ − c)( pθ + pθ′ )( pθ − pθ′ ). Apply Cauchy-Schwarz
and optimize over c.)
(b) Show that
1
H2 (Pθ , Pθ′ ) ≤ (θ − θ′ )2 J̄F (VI.5)
4
R θ′
where J̄F = θ′ 1−θ θ JF (u)du is the average Fisher information.
(c) State the needed regularity conditions and deduce the Cramér-Rao lower bound from (VI.4)
and (VI.5) with θ′ → θ.
(d) Extend the previous parts to the multivariate case.
VI.6 (Bayesian distribution estimation.) Let {Pθ : θ ∈ Θ} be a family of distributions on X
with a common dominating measure μ and density pθ (x) = dP n
dμ (x). Given a sample X =
θ

i.i.d.
(X1 , . . . , Xn ) ∼ Pθ for some θ ∈ Θ, the goal is to estimate the data-generating distribution Pθ by
some estimator P̂(·) = P̂(·; Xn ) with respect to some loss function ℓ(P, P̂). Suppose we are in

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-662


i i

662 Exercises for Part VI

a Bayesian setting where θ is drawn from a prior π. Let’s find the form of the Bayes estimator
and the Bayes risk.
(a) For convenience, let Xn+1 denote a test data point (unseen) drawn from Pθ and independent
of the observed data Xn . Convince yourself that every estimator P̂ can be formally identified
as a conditional distribution QXn+1 |Xn .
(b) Consider the KL loss ℓ(P, P̂) = D(PkP̂). Using Corollary 4.2, show that the Bayes estimator
minimizing the average KL risk is the posterior (conditional mean), i.e. its μ-density is given
by
Qn+1
Eθ∼π [ i=1 pθ (xi )]
qXn+1 |Xn (xn+1 |x ) =
n
Qn . (VI.6)
Eθ∼π [ i=1 pθ (xi )]
(c) Conclude that the Bayes KL risk equals I(θ; Xn+1 |Xn ). Compare with the conclusion of
Exercise II.19 and the KL risk interpretation of batch regret in (13.35).
(d) Now, consider the χ2 loss ℓ(P, P̂) = χ2 (PkP̂). Using (I.12) in Exercise I.45 show that the
optimal risk is given by
"Z 2 #
p
inf Eθ,Xn [χ (Pθ kP̂)] = EXn
2
μ(dxn+1 ) Eθ [pθ (xn+1 ) |X ]
2 n − 1. (VI.7)

attained by
p
qXn+1 |Xn (xn+1 |xn ) ∝ Eθ [pθ (xn+1 )2 |Xn = xn ] (VI.8)

(e) Now, consider the reverse-χ2 loss ℓ(P, P̂) = χ2 (P̂kP), a weighted quadratic loss. Using
(I.13) in Exercise I.45 show that the optimal risk is attained by
− 1
qXn+1 |Xn (xn+1 |xn ) ∝ Eθ [pθ (xn+1 )−1 |Xn = xn ] (VI.9)
i.i.d.
(f) Consider the discrete alphabet [k] and Xn ∼ P, where P = (P1 , . . . , Pk ) is drawn from
the Dirichlet(α, . . . , α) prior. Applying previous results (with μ the counting measure),
show that the Bayes estimator for the KL loss and reverse-χ2 loss is given by Krichevsky-
Trofimov add-β estimator (Section 13.5)
X
n
b ( j) = nj + β ,
P nj ≜ 1 {Xi = j} , (VI.10)
n + kβ
i=1

where β = α for KL and β = α − 1 for reverse-χ2 (assuming α ≥ 1). Hint: The posterior is
(P1 , . . . , Pk )|Xn ∼ Dirichlet(n1 + α, . . . , nk + α) and Pj |Xn ∼ Beta(nj + α, n − nj + (k − 1)α).
(g) For the χ2 loss, show that the Bayes estimator is
p
b(j) = P (p nj + α)(nj + α + 1)
P k
. (VI.11)
j=1 (nj + α)(nj + α + 1)
i.i.d.
VI.7 (Coin flips) Given X1 , . . . , Xn ∼ Ber(θ) with θ ∈ Θ = [0, 1], we aim to estimate θ with respect
to the quadratic loss function ℓ(θ, θ̂) = (θ − θ̂)2 . Denote the minimax risk by R∗n .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-663


i i

Exercises for Part VI 663

(a) Use the empirical frequency θ̂emp = X̄ to estimate θ. Compute and plot the risk Rθ (θ̂) and
show that
1
R∗n ≤ .
4n
(b) Compute the Fisher information of Pθ = Ber(θ)⊗n and Qθ = Bin(n, θ). Explain why they
are equal.
(c) Invoke the Bayesian Cramér-Rao lower bound Theorem 29.2 to show that
1 + o( 1)
R∗n = .
4n
(d) Notice that the risk of θ̂emp is maximized at 1/2 (fair coin), which suggests that it might be
possible to hedge against this situation by the following randomized estimator
(
θ̂emp , with probability δ
θ̂rand = 1 (VI.12)
2 with probability 1 − δ

Find the worst-case risk of θ̂rand as a function of δ . Optimizing over δ , show the improved
upper bound:
1
R∗n ≤ .
4( n + 1)
(e) As discussed in Remark 28.3, randomized estimator can always be improved if the loss is
convex; so we should average out the randomness in (VI.12) by considering the estimator
1
θ̂∗ = E[θ̂rand |X] = X̄δ + (1 − δ). (VI.13)
2
Optimizing over δ to minimize the worst-case risk, find the resulting estimator θ̂∗ and its
risk, show that it is constant (independent of θ), and conclude
1
R∗n ≤ √ .
4(1 + n)2
(f) Next we show θ̂∗ found in part (e) is exactly minimax and hence
1
R∗n = √ .
4(1 + n)2
Consider the following prior Beta(a, b) with density:
Γ(a + b) a−1
π (θ) = θ (1 − θ)b−1 , θ ∈ [0, 1],
Γ(a)Γ(b)
R∞ √
where Γ(a) ≜ 0 xa−1 e−x dx. Show that if a = b = 2n , θ̂∗ coincides with the Bayes
estimator for this prior, which is therefore least favorable. (Hint: work with the sufficient
statistic S = X1 + . . . + Xn .)
(g) Show that the least favorable prior is not unique; in fact, there is a continuum of them. (Hint:
consider the Bayes estimator E[θ|X] and show that it only depends on the first n + 1 moments
of π.)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-664


i i

664 Exercises for Part VI

i.i.d.
(h) (k-ary alphabet) Suppose X1 , . . . , Xn ∼ P on [k]. Show that for any k, n, the minimax squared
risk of estimating P in Theorem 29.5 is exactly

b − Pk22 ] = √ 1 k−1
R∗sq (k, n) = inf sup E[kP 2
, (VI.14)
b
P P∈Pk ( n + 1 ) k

achieved by the add- kn estimator. (Hint: For the lower bound, show that the Bayes estimator
for the squared loss and the KL loss coincide, then apply (VI.10) in Exercise VI.6.)
VI.8 (Distribution estimation in TV) Continuing (VI.14), we show that the minimax rate for
estimating P with respect to the total variation loss is
r
∗ k
RTV (k, n) ≜ inf sup EP [TV(P̂, P)]  ∧ 1, ∀ k ≥ 2, n ≥ 1, (VI.15)
P̂ P∈Pk ) n
(a) Show that the MLE coincides with the empirical distribution.
(b) Show that the MLE achieves the RHS of (VI.15) within constant factors. (Hint: either apply
(7.58) plus Pinsker’s inequality, or direct use the variance of empirical frequencies.)
(c) Establish the minimax lower bound. (Hint: apply Assouad’s lemma, or Fano’s inequality
(with volume method or explicit construction of packing), or the mutual information method
directly.)
VI.9 (Distribution estimation under reverse-χ2 ) Consider estimating a discrete distribution P on [k] in
i.i.d.
reverse-χ2 divergence from Xn ∼ P, which is a weighted version of the quadratic loss in (VI.14).
We show that the minimax risk is given by
k−1
R∗revχ2 (k, n) ≜ inf sup EP [χ2 (P̂kP) = .
P̂ P∈P([k]) n
Pn
(a) Show that taking P̂(j) = n i=1 1{Xi = j} to be the empirical distribution we always have
1

E[χ2 (P̂kP)] = k−n 1 .


(b) Given P ∼ Dirichlet(α, . . . , α) show that when α = 1 the Bayes optimal estimator is
precisely the empirical distribution. (Hint: (VI.9))
(c) Conclude that uniform prior on P([k]) is least favorable and empirical distribution is exactly
minimax optimal.
VI.10 (Distribution estimation in KL and χ2 ) Consider estimating a discrete distribution P on [k] in
KL and χ2 divergence, which are examples of an unbounded loss (KL loss is also known as
cross-entropy or log-loss in machine learning). Since these divergences are not symmetric, we
define (recall reverse-χ2 has been addressed in Exercise VI.9)
R∗KL (k, n) ≜ inf sup EP [D(PkP̂)] R∗revKL (k, n) ≜ inf sup EP [D(P̂kP)]
P̂ P∈Pk P̂ P∈Pk

R∗χ2 (k, n) ≜ inf sup EP [χ (PkP̂)]


2
P̂ P∈Pk

We have (up to universal constant factors) for all k, n:


  (k
∗ ∗ k k ≤ 1.1n
RKL (k, n)  RrevKL (k, n)  log 1 +  n k (VI.16)
n log k > 1.1n
n

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-665


i i

Exercises for Part VI 665

k
R∗χ2 (k, n)  . (VI.17)
n
(a) Show that the empirical distribution, optimal for the TV loss (Exercise VI.8), implies the
claimed upper bound for the reverse KL loss (Hint: (7.58)). Show, on the other hand, that
for KL and χ2 it results in infinite loss.
(b) To show the upper bound for χ2 , consider the add-α estimator P̂ in (VI.10) with α = 1.
Show that
k−1
E[χ2 (PkP̂)] ≤ .
n+1
Using (7.34) conclude the upper bound part of (VI.16). (Hint: EN∼Bin(n,p) [ N+ 1
1] =
(n+1)p (1 − p̄
1 n+1
).
(c) Show that in the small alphabet regime of k ≲ n, all lower bounds follow from (VI.15).
(d) Next assume k ≥ 4n. Consider a Dirichlet(α, . . . , α) prior in (13.16). Applying (VI.11) and
(VI.7) for the Bayes χ2 risk and choosing α = n/k, show the lower bound R∗χ2 (k, n) ≳ nk .
(e) Consider the prior under which P is uniform over a support set S chosen uniformly at ran-
dom from all s-subsets of [k], where s < k is to be specified. Applying (VI.6), show that for
this prior the Bayes estimator for KL loss takes a natural form:
(
1
i ∈ Ŝ
P̂j = 1s −ŝ/s
k−ŝ i∈/ Ŝ

where Ŝ = {i : n√ i ≥ 1} is the support of the empirical distribution and ŝ = |Ŝ|.


p
(f) Choosing s = nk, conclude E[TV(P, P̂)] ≥ 1 − 2 nk . (Hint: Show that TV(P, P̂) ≥
(1 − ŝs )(1 − ks ) and ŝ ≤ n.)
(g) Using (7.31), show that E[D(P̂kP)], E[D(PkP̂)] ≥ Ω(log nk ), concluding the lower bound in
(VI.16). (Hint: (7.31) is convex in TV.)
Note: For k  1, [258] found that the best add-α estimator has α∗ ≈ 0.509 (unlike α = 1/2
optimal for cumulative loss in Section 13.5) and it achieves loss α∗ k−1+n o(1) . In this regime, the
optimal risk is only slightly smaller and equals R∗KL (k, n) = k−12n +o(1)
, which was shown in [72,
71] via deep results in polynomial approximation (the optimal estimator is add-c estimator but
with c varying according to the empirical count in each bin). For k  n, Paninski [323] showed
R∗KL (k, n) = log nk (1 + o(1)) by a careful analysis of the Dirichlet prior.
VI.11 (Nonparametric model) In this exercise we consider some nonparametric extensions of the
i.i.d.
Gaussian location model and the Bernoulli model. Observing X1 , . . . , Xn ∼ P for some P ∈ P ,
where P is a collection of distributions on the real line, our goal is to estimate the mean of
R
the distribution P: μ(P) ≜ xP(dx), which is a linear functional of P. Denote the minimax
quadratic risk by

R∗n = inf sup EP [(μ̂(X1 , . . . , Xn ) − μ(P))2 ].


μ̂ P∈P

(a) Let P be the class of distributions (which need not have a density) on the real line with
2
variance at most σ 2 . Show that R∗n = σn .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-666


i i

666 Exercises for Part VI

(b) Let P = P([0, 1]), the collection of all probability distributions on [0, 1]. Show that
R∗n = 4(1+1√n)2 . (Hint: For the upper bound, using the fact that, for any [0, 1]-valued ran-
dom variable Z, Var(Z) ≤ E[Z](1 − E[Z]), mimic the analysis of the estimator (VI.13) in
Ex. VI.7e.)
VI.12 Prove Theorem 30.2 using Fano’s method. (Hint: apply Theorem 31.3 with T = ϵ · Sdk , where
Sdk denotes the Hamming sphere of radius k in d dimensions. Choose ϵ appropriately and apply
the Gilbert-Varshamov bound for the packing number of Sdk in Theorem 27.6.)
VI.13 (Sharp minimax rate in sparse denoising) Continuing Theorem 30.2, in this exercise we deter-
mine the sharp minimax risk for denoising a high-dimensional sparse vector. In the notation of
(30.13), we show that, for the d-dimensional GLM model X ∼ N (θ, Id ), the following minimax
risk satisfies, as d → ∞ and k/d → 0,
d
R∗ (k, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1))k log . (VI.18)
θ̂ ∥θ∥0 ≤k k

(a) We first consider 1-sparse vectors and prove

R∗ (1, d) ≜ inf sup Eθ [kθ̂ − θk22 ] = (2 + o(1)) log d, d → ∞. (VI.19)


θ̂ ∥θ∥0 ≤1

For the lower bound, consider the prior π under which θ is uniformly p distributed over
{τ e1 , . . . , τ ed }, where ei ’s denote the standard basis. Let τ = (2 − ϵ) log d. Show that
for any ϵ > 0, the Bayes risk is given by

inf Eθ∼π [kθ̂ − θk22 ] = τ 2 (1 + o(1)), d → ∞.


θ̂

(Hint: either apply the mutual information method, or directly compute the Bayes risk by
evaluating the conditional mean and conditional variance.)
(b) Demonstrate an estimator θ̂ that achieves the RHS of (VI.19) asymptotically. (Hint: consider
the hard-thresholding estimator (30.13) or the MLE (30.11).)
(c) To prove the lower bound part of (VI.18), prove the following generic result
 
∗ ∗ d
R (k, d) ≥ kR 1,
k
and then apply (VI.19). (Hint: consider a prior of d/k blocks each of which is 1-sparse.)
(d) Similar to the 1-sparse case, demonstrate an estimator θ̂ that achieves the RHS of (VI.18)
asymptotically.
Note: For both the upper and lower bound, the normal tail bound in Exercise V.25 is helpful.
VI.14 Consider the following functional estimation problem in GLM. Observing X ∼ N (θ, Id ), we
intend to estimate the maximal coordinate of θ: T(θ) = θmax ≜ max{θ1 , . . . , θd }. Prove the
minimax rate:

inf sup Eθ (T̂ − θmax )2  log d. (VI.20)


T̂ θ∈Rd

(a) Prove the upper bound by considering T̂ = Xmax , the plug-in estimator with the MLE.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-667


i i

Exercises for Part VI 667

(b) For the lower bound, consider two hypotheses:

H0 : θ = 0, H1 : θ ∼ Unif {τ e1 , τ e2 , . . . , τ ed } .

where ei ’s are the standard bases and τ > 0. Then under H0 , X ∼ P0 = N (0, Id ); under H1 ,
Pd τ2
X ∼ P1 = 1d i=1 N (τ ei , Id ). Show that χ2 (P1 kP0 ) = e d−1 . (Hint: Exercise I.48)
(c) Applying the joint range (7.32) (or (7.38)) to bound TV(P0 , P1 ), conclude the lower bound
part of (VI.20) via Le Cam’s method (Theorem 31.1).
(d) By improving both the upper and lower bound prove the sharp version:
 
1
inf sup Eθ (T̂ − θmax ) =
2
+ o(1) log d, d → ∞. (VI.21)
T̂ θ∈Rd 2

VI.15 (Suboptimality of MLE in high dimensions [55]) Consider the d-dimensional GLM: X ∼
N (θ, Id ), where θ belongs to the parameter space
n o
Θ = θ ∈ Rd : |θ1 | ≤ d1/4 , kθ\1 k2 ≤ 2(1 − d−1/4 |θ1 |)

with θ\1 ≡ (θ2 , . . . , θd ). For the square loss, prove the following for sufficiently large d.
(a) The minimax risk is bounded:

inf sup Eθ [kθ̂ − θk22 ] ≲ 1.


θ̂ θ∈Θ

(b) The worst-case risk of maximum likelihood estimator

θMLE ≜ argmin kX − θ̃k2


θ̃∈Θ

is unbounded:

sup Eθ [kθ̂MLE − θk22 ] ≳ d.
θ∈Θ

i.i.d.
VI.16 (Covariance model) Let X1 , . . . , Xn ∼ N (0, Σ), where Σ is a d × d covariance matrix. Let us
show that the minimax quadratic risk of estimating Σ using X1 , . . . , Xn satisfies
 
d
inf sup E[kΣ̂ − Σk2F ]  ∧ 1 r2 , ∀r > 0, d, n ∈ N. (VI.22)
Σ̂ ∥Σ∥F ≤r n
P
where kΣ̂ − Σk2F = ij (Σ̂ij − Σij )2 .
(a) Show that unlike location model, without restricting to a compact parameter space for Σ,
the minimax risk in (VI.22) is infinite.
Pn
(b) Consider the sample covariance matrix Σ̂ = 1n i=1 Xi X⊤ i . Show that

1 
E[kΣ̂ − Σk2F ] = kΣk2F + Tr(Σ)2
n
and use this to deduce the minimax upper bound in (VI.22).

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-668


i i

668 Exercises for Part VI

(c) To prove the minimax lower bound, we can proceed in several steps. Show that for any
positive semidefinite (PSD) Σ0 , Σ1  0, the KL divergence satisfies
1 1/ 2 1/2
D(N (0, Id + Σ0 )kN (0, Id + Σ1 )) ≤
kΣ − Σ1 k2F , (VI.23)
2 0
where Id is the d × d identity matrix. (Hint: apply (2.8).)
(d) Let B(δ) denote the Frobenius ball of radius δ centered at the zero matrix. Let PSD = {X :
X  0} denote the collection of d × d PSD matrices. Show that
vol(B(δ) ∩ PSD)
= P [ Z  0] , (VI.24)
vol(B(δ))
where Z is a random matrix distributed according to the Gaussian Orthogonal Ensemble
i.i.d.
(GOE), that is, Z is symmetric with independent diagonals Zii ∼ N (0, 2) and off-diagonals
i.i.d.
Zij ∼ N (0, 1).
2
(e) Show that P [Z  0] ≥ cd for some absolute constant c.4
(f) Prove the following lower bound on the packing number on the set of PSD matrices:
 ′ d2 /2

M(B(δ) ∩ PSD, k · kF , ϵ) ≥ (VI.25)
ϵ
for some absolute constant c′ . (Hint: Use the volume bound and the result of Part (d) and
(e).) √
(g) Complete the proof of lower bound of (VI.22). (Hint: WLOG, we can consider r  d and
2
aim for the lower bound Ω( dn ∧ d). Take the packing from (VI.25) and shift by the identity
matrix I. Then apply Fano’s method and use (VI.23).)
VI.17 For a family of probability distributions P and a functional T : P → R define its χ2 -modulus
of continuity as

δ χ 2 ( t) = sup {T(P1 ) − T(P2 ) : χ2 (P1 kP2 ) ≤ t} .


P1 ,P2 ∈P

When the functional T is affine, and continuous, and P is compact5 it can be shown [346] that
1
δ 2 (1/n)2 ≤ inf sup E i.i.d. (T(P) − T̂n (X1 , . . . , Xn ))2 ≤ δχ2 (1/n)2 . (VI.26)
7 χ T̂n P∈P Xi ∼ P

Consider the following problem (interval censored model): A lab conducts experiments with
i.i.d.
n mice. In the i-th mouse a tumour develops at time Ai ∈ [0, 1] with Ai ∼ π where π is a pdf
on [0, 1] bounded by 12 ≤ π ≤ 2 pointwise. For each i the existence of tumour is checked at
i.i.d.
another random time Bi ∼ Unif(0, 1) with Bi ⊥ ⊥ Ai . Given observations Xi = (1{Ai ≤ Bi }, Bi )
one is trying to estimate T(π ) = π [A ≤ 1/2]. Show that

inf sup E[(T(π ) − T̂n (X1 , . . . , Xn ))2 ]  n−2/3 .


T̂n π

4
Getting the exact exponent is a difficult result (cf. [26]). Here we only need some crude estimate.
5
Both under the same, but otherwise arbitrary topology on P.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-669


i i

Exercises for Part VI 669

VI.18 (Comparison between contraction coefficients.) Prove (33.13) for ηf = ηKL .


Hint: Use local behavior of f-divergences (Proposition 2.21).
VI.19 (Spectral gap and χ2 -contraction of Markov chains.) In this exercise we prove (33.18). Let
P = (P(x, y)) denote the transition matrix of a time-reversible Markov chain with finite state
space X and stationary distribution π, so that π (x)P(x, y) = π (y)P(y, x) for all x, y ∈ X . It is
known that the k = |X | eigenvalues of P satisfy 1 = λ1 ≥ λ2 ≥ . . . ≥ λk ≥ −1. Define by
γ∗ ≜ max{λ2 , |λk |} the absolute spectral gap.
(a) Show that
χ2 (PX1 kπ ) ≤ χ2 (PX0 kπ )γ∗2 .

from which (33.18) follows.


(b) Conclude that for any initial state x,
1 − π (x) 2n
χ2 (PXn |X0 =x kπ ) ≤ γ .
π ( x) ∗
(c) Compute γ∗ for the BSCδ channel and compare with the ηχ2 contraction coefficients.
For a continuous-time version, see [124].
VI.20 (Input-independent contraction coefficient is achieved by binary inputs [321]) Let K : X → Y
be a Markov kernel with countable X . Prove that for all f-divergence, we have
Df (K ◦ PkK ◦ Q)
ηf (K) = sup .
P,Q:|supp(P)∪supp(Q)|≤2 Df (PkQ)
0<Df (P∥Q)<∞

Hint: Define function


Lλ (P, Q) = Df (K ◦ PkK ◦ Q) − λDf (PkQ)
and prove that Q̂ 7→ Lλ ( QP Q̂, Q̂) is convex on the set
 
P
Q̂ ∈ P(X ) : supp(Q̂) ⊆ supp(Q), Q̂ ∈ P(X ) .
Q

VI.21 (BMS channel comparison [371, 367]) Below X ∼ Ber(1/2) and PY|X is an input-symmetric
channel (BMS). It turns out that BSC and BEC are extremal for various partial orders. Prove
the following statements.
(a) If ITV (X; Y) = 12 (1 − 2δ), then
BSCδ ≤deg PY|X ≤deg BEC2δ .

(b) If I(X; Y) = C, then


BSCh−1 (log 2−C) ≤mc PY|X ≤mc BEC1−C/ log 2 .

(c) If Iχ2 (X; Y) = η , then


BSC1/2−√η/2 ≤ln PY|X ≤ln BEC1−η .

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-670


i i

670 Exercises for Part VI

(Hint: apply Exercise I.64.)


VI.22 (Broadcasting on Trees with BSC [204]) We have seen in Example 33.9 that broadcasting on
trees (BOT) with BSCδ has non-reconstruction when b(1 − 2δ)2 < 1. In this exercise we prove
the achievability bound (known as the Kesten-Stigum bound [244]) using channel comparison.
We work with an infinite b-ary tree with BSCδ edge channels. Let ρ be the root and Lk be the
set of nodes at distance k to ρ. Let Mk denote the channel Xρ → XLk .
In the following, assume that b(1 − 2δ)2 > 1.
1
(a) Prove that there exists τ < 2 such that
BSCτ ≤ln BSC⊗
τ ◦ M1 .
b

Hint: Use Ex. VI.21.


(b) Prove BSCτ ≤ln Mk by induction on k. Conclude that reconstruction holds.
Hint: Use tensorization of less noisy ordering.
VI.23 (Broadcasting on a 2D Grid) Consider the following broadcasting model on a 2D grid:
• Nodes are labeled with (i, j) for i, j ∈ Z;
• Xi,j = 0 when i < 0 or j < 0;
• X0,0 ∼ Ber( 12 );
i.i.d.
• Xi,j = fi,j (Xi−1,j ⊕ Zi,j,1 , Xi,j−1 ⊕ Zi,j,2 ) for i, j ≥ 0 and (i, j) 6= (0, 0), where Zi,j,k ∼ Ber(δ),
and fi,j is any function {0, 1} × {0, 1} → {0, 1}.
Let Ln = {(n−i, i) : 0 ≤ i ≤ n} be the set of nodes at level n. Let pc be directed bond percolation
threshold from (0, 0) to Ln for n → ∞. Apply Theorem 33.8 to show that for (1 − 2δ)2 < pc
we have
lim I(X0,0 ; XLn ) = 0.
n→∞

It is known that pc ≈ 0.645 (e.g. [231]).


Note: We could also use Theorem 33.8 differently. Above we replaced each directed edge by
a BSCδ . We could instead consider channels (Xi−1,j , Xi,j−1 ) 7→ Xi,j and relate its contraction
coefficient to the directed site percolation threshold p′c . This would yield a non-reconstruction
whenever
p
1 − 2δ + 4δ 3 − 2δ 4 − 2δ(1 − δ) δ(1 + δ)(1 − δ)(2 − δ) < p′c .
Since p′c ≈ 0.705 it turns out the bond percolation result is stronger.
VI.24 (Input-dependent contraction coefficient for coloring channel [203]) Fix an integer q ≥ 3 and
let X = [q]. Consider the following coloring channel K : X → X :
(
0 y = x,
K(y|x) = 1
q−1 y 6= x.

Let π be uniform distribution on X .


(a) Compute ηKL (π , K).
(b) Conclude that there exists a function f(q) = (1 − o(1))q log q as q → ∞ such that for all
d < f(q), BOT with the coloring channel on a d-ary tree has non-reconctruction.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-671


i i

Exercises for Part VI 671

Note: This bound is tight up to the first order: there exists a function g(q) = (1 + o(1))q log q
such that for all d > g(q), BOT with coloring channel on a d-ary tree has reconstruction.
VI.25 ([203]) Fix an integer q ≥ 2 and let X = [q]. Let λ ∈ [− q−1 1 , 1] be a real number. Let us define
a special kind of q-ary symmetric channel, known as the Potts channel, by taking Pλ : X → X
as
(
λ + 1−λ y = x,
P λ ( y| x) = 1−λ
q

q y 6= x.

Prove that
qλ2
ηKL (Pλ ) = .
(q − 2)λ + 2

VI.26 (Spectral Independence [20]) Say a probability distribution μ = μXn supported on [q]n is c-
pairwise independent if for every T ⊂ [n], σT ∈ [q]T , the conditional measure μ(σT ) ≜ μXTc |XT =σT
satisfies for every νXTc ,
X (σ ) c X (σ )
D(νXi ,Xj || μXi ,TXj ) ≥ (2 − ) D(νXi || μXi T ) .
n − | T | − 1
i̸=j∈Tc i∈T c

Prove that for such a measure μ we have


ηKL ( μ, EC⊗
τ )≤1−τ
n c+1
,

where ECτ is the erasure channel, cf. Example 33.6. (Hint: Define f(τ ) = D(EC⊗ ⊗n
τ ◦ν||ECτ ◦ μ)
n

′′ c ′
and prove f (τ ) ≥ τ f (τ ).)
Remark: Applying the above with τ = 1n shows that a Markov chain Gτ known as (small-block)
Glauber dynamics for μ is mixing in O(nc+1 log n) time. Indeed, Gτ consists of first applying
EC⊗ n
τ and then “imputing” erasures in the set S from the conditional distribution μXS |XSc . It is
also known that c-pairwise independence is implied (under some additional conditions on μ
and q = 2) by the uniform boundedness of the operator norms of the covariance matrices of
all μ(σT ) (see [91] for details). Thus a hard question of bounding ηKL ( μ, Gτ ) is first reduced to
ηKL ( μ, EC⊗ n
τ ) and then to the study of spectrum of covariance matrix.

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-672


i i

References

[1] E. Abbe and E. B. Adserà, “Subadditivity vol. 32, no. 4, pp. 533–542, 1986. (pp. 325
beyond trees and the Chi-squared mutual and 327)
information,” in 2019 IEEE International [9] R. Ahlswede and J. Körner, “Source cod-
Symposium on Information Theory (ISIT). ing with side information and a converse for
IEEE, 2019, pp. 697–701. (p. 135) degraded broadcast channels,” IEEE Trans-
[2] M. C. Abbott and B. B. Machta, “A scaling actions on Information Theory, vol. 21,
law from discrete to continuous solutions no. 6, pp. 629–637, 1975. (p. 227)
of channel capacity problems in the low- [10] S. M. Alamouti, “A simple transmit diver-
noise limit,” Journal of Statistical Physics, sity technique for wireless communica-
vol. 176, no. 1, pp. 214–227, 2019. (p. 248) tions,” IEEE Journal on selected areas in
[3] I. Abou-Faycal, M. Trott, and S. Shamai, communications, vol. 16, no. 8, pp. 1451–
“The capacity of discrete-time memoryless 1458, 1998. (p. 409)
rayleigh-fading channels,” IEEE Transac- [11] P. H. Algoet and T. M. Cover, “A sandwich
tion Information Theory, vol. 47, no. 4, pp. proof of the Shannon-Mcmillan-Breiman
1290 – 1301, 2001. (p. 409) theorem,” The annals of probability, pp.
[4] J. Acharya, C. L. Canonne, Y. Liu, Z. Sun, 899–909, 1988. (p. 234)
and H. Tyagi, “Interactive inference under [12] C. D. Aliprantis and K. C. Border, Infi-
information constraints,” IEEE Transac- nite Dimensional Analysis: a Hitchhiker’s
tions on Information Theory, vol. 68, no. 1, Guide, 3rd ed. Berlin: Springer-Verlag,
pp. 502–516, 2021. (p. 658) 2006. (p. 130)
[5] J. Acharya, C. L. Canonne, Z. Sun, and [13] N. Alon, “On the number of subgraphs of
H. Tyagi, “Unified lower bounds for inter- prescribed type of graphs with a given num-
active high-dimensional estimation under ber of edges,” Israel J. Math., vol. 38, no.
information constraints,” arXiv preprint 1-2, pp. 116–130, 1981. (p. 160)
arXiv:2010.06562, 2020. (p. 658) [14] N. Alon and A. Orlitsky, “A lower bound
[6] R. Ahlswede, “Extremal properties of rate on the expected length of one-to-one codes,”
distortion functions,” IEEE transactions on IEEE Transactions on Information The-
information theory, vol. 36, no. 1, pp. 166– ory, vol. 40, no. 5, pp. 1670–1672, 1994.
171, 1990. (p. 543) (p. 199)
[7] R. Ahlswede, B. Balkenhol, and L. Khacha- [15] N. Alon and J. H. Spencer, The Probabilis-
trian, “Some properties of fix free codes,” tic Method, 3rd ed. John Wiley & Sons,
in Proceedings First INTAS International 2008. (pp. 208 and 353)
Seminar on Coding Theory and Combina- [16] P. Alquier, “User-friendly introduction
torics, Thahkadzor, Armenia, 1996, pp. 20– to PAC-Bayes bounds,” arXiv preprint
33. (p. 208) arXiv:2110.11216, 2021. (p. 83)
[8] R. Ahlswede and I. Csiszár, “Hypothesis [17] S.-I. Amari and H. Nagaoka, Methods of
testing with communication constraints,” information geometry. American Math-
IEEE transactions on information theory, ematical Soc., 2007, vol. 191. (pp. 40
and 307)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-673


i i

References 673

[18] G. Aminian, Y. Bu, L. Toni, M. R. Theory and Related fields, vol. 108, no. 4,
Rodrigues, and G. Wornell, “Characteriz- pp. 517–542, 1997. (p. 668)
ing the generalization error of Gibbs algo- [27] S. Artstein, K. Ball, F. Barthe, and A. Naor,
rithm with symmetrized KL information,” “Solution of Shannon’s problem on the
arXiv preprint arXiv:2107.13656, 2021. monotonicity of entropy,” Journal of the
(p. 187) American Mathematical Society, pp. 975–
[19] V. Anantharam, A. Gohari, S. Kamath, 982, 2004. (p. 64)
and C. Nair, “On maximal correlation, [28] S. Artstein, V. Milman, and S. J. Szarek,
hypercontractivity, and the data processing “Duality of metric entropy,” Annals of math-
inequality studied by Erkip and Cover,” ematics, pp. 1313–1328, 2004. (p. 535)
arXiv preprint arXiv:1304.6133, 2013. [29] J. Baik, G. Ben Arous, and S. Péché, “Phase
(p. 638) transition of the largest eigenvalue for non-
[20] N. Anari, K. Liu, and S. O. Gharan, “Spec- null complex sample covariance matrices,”
tral independence in high-dimensional The Annals of Probability, vol. 33, no. 5, pp.
expanders and applications to the hardcore 1643–1697, 2005. (p. 651)
model,” SIAM Journal on Computing, [30] A. V. Banerjee, “A simple model of herd
no. 0, pp. FOCS20–1, 2021. (p. 671) behavior,” The Quarterly Journal of Eco-
[21] T. W. Anderson, “The integral of a symmet- nomics, vol. 107, no. 3, pp. 797–817, 1992.
ric unimodal function over a symmetric con- (pp. 135 and 181)
vex set and some probability inequalities,” [31] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and
Proceedings of the American Mathematical D. Sivakumar, “An information statistics
Society, vol. 6, no. 2, pp. 170–176, 1955. approach to data stream and communica-
(p. 572) tion complexity,” Journal of Computer and
[22] A. Antos and I. Kontoyiannis, “Conver- System Sciences, vol. 68, no. 4, pp. 702–
gence properties of functional estimates for 732, 2004. (p. 182)
discrete distributions,” Random Structures [32] B. Bárány and I. Kolossváry, “On the abso-
& Algorithms, vol. 19, no. 3-4, pp. 163–193, lute continuity of the Blackwell measure,”
2001. (p. 138) Journal of Statistical Physics, vol. 159, pp.
[23] E. Arıkan, “Channel polarization: A 158–171, 2015. (p. 111)
method for constructing capacity-achieving [33] B. Bárány, M. Pollicott, and K. Simon,
codes for symmetric binary-input memo- “Stationary measures for projective trans-
ryless channels,” IEEE Transactions on formations: the Blackwell and Furstenberg
information Theory, vol. 55, no. 7, pp. measures,” Journal of Statistical Physics,
3051–3073, 2009. (p. 346) vol. 148, pp. 393–421, 2012. (p. 111)
[24] S. Arimoto, “An algorithm for computing [34] A. Barg and G. D. Forney, “Random codes:
the capacity of arbitrary discrete memory- Minimum distances and error exponents,”
less channels,” IEEE Transactions on Infor- IEEE Transactions on Information The-
mation Theory, vol. 18, no. 1, pp. 14–20, ory, vol. 48, no. 9, pp. 2568–2573, 2002.
1972. (p. 102) (p. 433)
[25] ——, “On the converse to the coding theo- [35] A. Barg and A. McGregor, “Distance distri-
rem for discrete memoryless channels (cor- bution of binary codes and the error proba-
resp.),” IEEE Transactions on Information bility of decoding,” IEEE Transactions on
Theory, vol. 19, no. 3, pp. 357–359, 1973. Information Theory, vol. 51, no. 12, pp.
(p. 433) 4237–4246, 2005. (p. 433)
[26] G. B. Arous and A. Guionnet, “Large devi- [36] S. Barman and O. Fawzi, “Algorithmic
ations for Wigner’s law and Voiculescu’s aspects of optimal channel coding,” IEEE
non-commutative entropy,” Probability Transactions on Information Theory,

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-674


i i

674 Strong data processing inequality

vol. 64, no. 2, pp. 1038–1045, 2017. [47] C. Berrou, A. Glavieux, and P. Thiti-
(pp. 366, 367, 368, and 369) majshima, “Near Shannon limit
[37] A. R. Barron, “Universal approximation error-correcting coding and decoding:
bounds for superpositions of a sigmoidal Turbo-codes. 1,” in Proceedings of
function,” IEEE Trans. Inf. Theory, vol. 39, ICC’93-IEEE International Conference on
no. 3, pp. 930–945, 1993. (p. 534) Communications, vol. 2. IEEE, 1993, pp.
[38] P. L. Bartlett and S. Mendelson, 1064–1070. (pp. 346 and 411)
“Rademacher and Gaussian complexi- [48] J. C. Berry, “Minimax estimation of a
ties: Risk bounds and structural results,” bounded normal mean vector,” Journal of
Journal of Machine Learning Research, Multivariate Analysis, vol. 35, no. 1, pp.
vol. 3, no. Nov, pp. 463–482, 2002. (p. 86) 130–139, 1990. (p. 587)
[39] G. Basharin, “On a statistical estimate for [49] D. P. Bertsekas, A. Nedi�, and A. E.
the entropy of a sequence of independent Ozdaglar, Convex analysis and optimiza-
random variables,” Theory of Probability & tion. Belmont, MA, USA: Athena Scien-
Its Applications, vol. 4, no. 3, pp. 333–336, tific, 2003. (p. 93)
1959. (p. 584) [50] N. Bhatnagar, J. Vera, E. Vigoda, and
[40] A. Beck, First-order methods in optimiza- D. Weitz, “Reconstruction for colorings on
tion. SIAM, 2017. (p. 92) trees,” SIAM Journal on Discrete Mathe-
[41] A. Beirami and F. Fekri, “Fundamental lim- matics, vol. 25, no. 2, pp. 809–826, 2011.
its of universal lossless one-to-one com- (p. 644)
pression of parametric sources,” in Informa- [51] A. Bhatt, B. Nazer, O. Ordentlich, and
tion Theory Workshop (ITW). IEEE, 2014, Y. Polyanskiy, “Information-distilling
pp. 212–216. (p. 250) quantizers,” IEEE Transactions on
[42] C. H. Bennett, “Notes on Landauer’s princi- Information Theory, vol. 67, no. 4, pp.
ple, reversible computation, and Maxwell’s 2472–2487, 2021. (p. 190)
Demon,” Studies In History and Philoso- [52] A. Bhattacharyya, “On a measure of diver-
phy of Science Part B: Studies In History gence between two statistical populations
and Philosophy of Modern Physics, vol. 34, defined by their probability distributions,”
no. 3, pp. 501–510, 2003. (p. xix) Bull. Calcutta Math. Soc., vol. 35, pp. 99–
[43] C. H. Bennett, P. W. Shor, J. A. Smolin, and 109, 1943. (p. 117)
A. V. Thapliyal, “Entanglement-assisted [53] L. Birgé, “Approximation dans les espaces
classical capacity of noisy quantum chan- métriques et théorie de l’estimation,”
nels,” Physical Review Letters, vol. 83, Zeitschrift für Wahrscheinlichkeitstheorie
no. 15, p. 3081, 1999. (p. 498) und Verwandte Gebiete, vol. 65, no. 2, pp.
[44] W. R. Bennett, “Spectra of quantized sig- 181–237, 1983. (pp. xxii, 602, and 614)
nals,” The Bell System Technical Journal, [54] ——, “On estimating a density using
vol. 27, no. 3, pp. 446–472, 1948. (p. 483) Hellinger distance and some other strange
[45] P. Bergmans, “A simple converse for broad- facts,” Probability theory and related fields,
cast channels with additive white Gaus- vol. 71, no. 2, pp. 271–291, 1986. (p. 625)
sian noise (corresp.),” IEEE Transactions [55] L. Birgé, “Model selection via testing : an
on Information Theory, vol. 20, no. 2, pp. alternative to (penalized) maximum likeli-
279–280, 1974. (p. 65) hood estimators,” Annales de l’I.H.P. Prob-
[46] J. M. Bernardo, “Reference posterior dis- abilités et statistiques, vol. 42, no. 3, pp.
tributions for Bayesian inference,” Journal 273–325, 2006. (p. 667)
of the Royal Statistical Society: Series B [56] L. Birgé, “Robust tests for model selection,”
(Methodological), vol. 41, no. 2, pp. 113– From probability to statistics and back:
128, 1979. (p. 253) high-dimensional models and processes–A

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-675


i i

References 675

Festschrift in honor of Jon A. Wellner, IMS [67] H. F. Bohnenblust, “Convex regions and
Collections, Volume 9, pp. 47–64, 2013. projections in Minkowski spaces,” Ann.
(p. 613) Math., vol. 39, no. 2, pp. 301–308, 1938.
[57] M. Š. Birman and M. Solomjak, “Piecewise- (p. 96)
polynomial approximations of functions of [68] A. Borovkov, Mathematical Statistics.
the classes,” Mathematics of the USSR- CRC Press, 1999. (pp. xxii, 141, 581,
Sbornik, vol. 2, no. 3, p. 295, 1967. (p. 538) and 582)
[58] N. Blachman, “The convolution inequal- [69] S. Boucheron, G. Lugosi, and P. Massart,
ity for entropy powers,” IEEE Transactions Concentration Inequalities: A Nonasymp-
on Information theory, vol. 11, no. 2, pp. totic Theory of Independence. OUP
267–271, 1965. (p. 185) Oxford, 2013. (pp. 85, 151, 302, and 541)
[59] D. Blackwell, L. Breiman, and [70] O. Bousquet, D. Kane, and S. Moran,
A. Thomasian, “The capacity of a class of “The optimal approximation factor in den-
channels,” The Annals of Mathematical sity estimation,” in Conference on Learn-
Statistics, pp. 1229–1241, 1959. (p. 465) ing Theory. PMLR, 2019, pp. 318–341.
[60] D. H. Blackwell, “The entropy of functions (p. 622)
of finite-state Markov chains,” Transactions [71] D. Braess and T. Sauer, “Bernstein poly-
of the first Prague conference on infor- nomials and learning theory,” Journal of
mation theory, statistical decision func- Approximation Theory, vol. 128, no. 2, pp.
tions, random processes, pp. 13–20, 1956. 187–206, 2004. (p. 665)
(p. 111) [72] D. Braess, J. Forster, T. Sauer, and
[61] R. E. Blahut, “Hypothesis testing and infor- H. U. Simon, “How to achieve minimax
mation theory,” IEEE Trans. Inf. Theory, expected Kullback-Leibler distance from
vol. 20, no. 4, pp. 405–417, 1974. (p. 289) an unknown finite distribution,” in Algo-
[62] R. Blahut, “Computation of channel rithmic Learning Theory. Springer, 2002,
capacity and rate-distortion functions,” pp. 380–394. (p. 665)
IEEE transactions on Information Theory, [73] M. Braverman, A. Garg, T. Ma, H. L.
vol. 18, no. 4, pp. 460–473, 1972. (p. 102) Nguyen, and D. P. Woodruff, “Communica-
[63] P. M. Bleher, J. Ruiz, and V. A. Zagrebnov, tion lower bounds for statistical estimation
“On the purity of the limiting gibbs state for problems via a distributed data processing
the Ising model on the Bethe lattice,” Jour- inequality,” in Proceedings of the forty-
nal of Statistical Physics, vol. 79, no. 1, pp. eighth annual ACM symposium on Theory
473–482, Apr 1995. (pp. 642 and 643) of Computing. ACM, 2016, pp. 1011–
[64] S. G. Bobkov and F. Götze, “Exponential 1020. (pp. 657 and 658)
integrability and transportation cost related [74] L. M. Bregman, “Some properties of non-
to logarithmic Sobolev inequalities,” Jour- negative matrices and their permanents,”
nal of Functional Analysis, vol. 163, no. 1, Soviet Math. Dokl., vol. 14, no. 4, pp. 945–
pp. 1–28, 1999. (p. 656) 949, 1973. (p. 161)
[65] S. Bobkov and G. P. Chistyakov, “Entropy [75] L. Breiman, “The individual ergodic the-
power inequality for the Rényi entropy.” orem of information theory,” Ann. Math.
IEEE Transactions on Information Theory, Stat., vol. 28, no. 3, pp. 809–811, 1957.
vol. 61, no. 2, pp. 708–714, 2015. (p. 27) (p. 234)
[66] T. Bohman, “A limit theorem for the Shan- [76] L. Brillouin, Science and information the-
non capacities of odd cycles I,” Proceedings ory, 2nd Ed. Academic Press, 1962.
of the American Mathematical Society, vol. (p. xvii)
131, no. 11, pp. 3559–3569, 2003. (p. 452) [77] L. D. Brown, “Fundamentals of statisti-
cal exponential families with applications

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-676


i i

676 Strong data processing inequality

in statistical decision theory,” in Lecture assumptions,” The Annals of Mathematical


Notes-Monograph Series, S. S. Gupta, Ed. Statistics, vol. 22, no. 4, pp. 581–586, 1951.
Hayward, CA: Institute of Mathematical (p. 575)
Statistics, 1986, vol. 9. (pp. 298, 309, [88] Z. Chase and S. Lovett, “Approximate
and 310) union closed conjecture,” arXiv preprint
[78] P. Bühlmann and S. van de Geer, Statistics arXiv:2211.11689, 2022. (p. 189)
for high-dimensional data: methods, theory [89] S. Chatterjee, “An error bound in the
and applications. Springer Science & Sudakov-Fernique inequality,” arXiv
Business Media, 2011. (p. xxii) preprint arXiv:0510424, 2005. (p. 531)
[79] M. V. Burnashev, “Data transmission over [90] S. Chatterjee and P. Diaconis, “The sample
a discrete channel with feedback. ran- size required in importance sampling,” The
dom transmission time,” Prob. Peredachi Annals of Applied Probability, vol. 28, no. 2,
Inform., vol. 12, no. 4, pp. 10–30, 1976. pp. 1099–1135, 2018. (p. 337)
(p. 456) [91] Z. Chen, K. Liu, and E. Vigoda, “Optimal
[80] G. Calinescu, C. Chekuri, M. Pal, and mixing of Glauber dynamics: Entropy fac-
J. Vondrák, “Maximizing a monotone sub- torization via high-dimensional expansion,”
modular function subject to a matroid in Proceedings of the 53rd Annual ACM
constraint,” SIAM Journal on Computing, SIGACT Symposium on Theory of Comput-
vol. 40, no. 6, pp. 1740–1766, 2011. ing, 2021, pp. 1537–1550. (p. 671)
(pp. 368 and 369) [92] H. Chernoff, “Large-sample theory: Para-
[81] M. X. Cao and M. Tomamichel, “Com- metric case,” The Annals of Mathematical
ments on “Channel Coding Rate in the Statistics, vol. 27, no. 1, pp. 1–22, 1956.
Finite Blocklength Regime”: On the (p. 661)
quadratic decaying property of the informa- [93] M. Choi, M. Ruskai, and E. Seneta, “Equiv-
tion rate function,” IEEE Transactions on alence of certain entropy contraction coeffi-
Information Theory, vol. 69, no. 9, 2023. cients,” Linear algebra and its applications,
(p. 435) vol. 208, pp. 29–36, 1994. (p. 632)
[82] G. Casella and W. E. Strawderman, “Esti- [94] N. Chomsky, “Three models for the descrip-
mating a bounded normal mean,” The tion of language,” IRE Trans. Inform. Th.,
Annals of Statistics, vol. 9, no. 4, pp. 870– vol. 2, no. 3, pp. 113–124, 1956. (p. 195)
878, 1981. (p. 587) [95] B. S. Clarke and A. R. Barron,
[83] O. Catoni, “PAC-Bayesian supervised clas- “Information-theoretic asymptotics of
sification: the thermodynamics of statis- Bayes methods,” IEEE Trans. Inf. Theory,
tical learning,” Lecture Notes-Monograph vol. 36, no. 3, pp. 453–471, 1990. (p. 253)
Series. IMS, vol. 1277, 2007. (p. 83) [96] ——, “Jeffreys’ prior is asymptotically
[84] E. Çinlar, Probability and Stochastics. least favorable under entropy risk,” Jour-
New York: Springer, 2011. (pp. 20, 29, 30, nal of Statistical planning and Inference,
31, 32, 80, and 319) vol. 41, no. 1, pp. 37–60, 1994. (p. 253)
[85] N. N. Cencov, Statistical decision rules [97] J. E. Cohen, J. H. B. Kempermann, and
and optimal inference. American Math- G. Zb�ganu, Comparisons of Stochastic
ematical Soc., 2000, no. 53. (pp. 40, 307, Matrices with Applications in Information
and 309) Theory, Statistics, Economics and Popula-
[86] N. Cesa-Bianchi and G. Lugosi, Prediction, tion. Springer, 1998. (p. 632)
learning, and games. Cambridge univer- [98] A. Collins and Y. Polyanskiy, “Coherent
sity press, 2006. (p. 255) multiple-antenna block-fading channels at
[87] D. G. Chapman and H. Robbins, “Mini- finite blocklength,” IEEE Transactions on
mum variance estimation without regularity

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-677


i i

References 677

Information Theory, vol. 65, no. 1, pp. 380– [110] I. Csiszár and J. Körner, “Graph decomposi-
405, 2018. (p. 437) tion: a new key to coding theorems,” IEEE
[99] J. H. Conway and N. J. A. Sloane, Sphere Trans. Inf. Theory, vol. 27, no. 1, pp. 5–12,
packings, lattices and groups. Springer 1981. (p. 47)
Science & Business Media, 1999, vol. 290. [111] ——, Information Theory: Coding The-
(p. 527) orems for Discrete Memoryless Systems.
[100] M. Costa, “A new entropy power inequal- New York: Academic, 1981. (pp. xvii, xx,
ity,” IEEE Transactions on Information xxi, 355, 499, 500, 502, and 647)
Theory, vol. 31, no. 6, pp. 751–760, 1985. [112] I. Csiszár and P. C. Shields, “Information
(p. 64) theory and statistics: A tutorial,” Founda-
[101] D. J. Costello and G. D. Forney, “Channel tions and Trends in Communications and
coding: The road to channel capacity,” Pro- Information Theory, vol. 1, no. 4, pp. 417–
ceedings of the IEEE, vol. 95, no. 6, pp. 528, 2004. (pp. 104 and 250)
1150–1177, 2007. (p. 411) [113] I. Csiszár and G. Tusnády, “Informa-
[102] T. A. Courtade, “Monotonicity of entropy tion geometry and alternating minimiza-
and Fisher information: a quick proof via tion problems,” Statistics & Decision, Sup-
maximal correlation,” Communications in plement Issue No, vol. 1, 1984. (pp. 102
Information and Systems, vol. 16, no. 2, pp. and 103)
111–115, 2017. (p. 64) [114] I. Csiszár, “I-divergence geometry of prob-
[103] ——, “A strong entropy power inequality,” ability distributions and minimization prob-
IEEE Transactions on Information Theory, lems,” The Annals of Probability, pp. 146–
vol. 64, no. 4, pp. 2173–2192, 2017. (p. 64) 158, 1975. (pp. 303 and 312)
[104] T. M. Cover, “Universal data compression [115] I. Csiszár and J. Körner, Information The-
and portfolio selection,” in Proceedings of ory: Coding Theorems for Discrete Memo-
37th Conference on Foundations of Com- ryless Systems, 2nd ed. Cambridge Uni-
puter Science. IEEE, 1996, pp. 534–538. versity Press, 2011. (pp. xx, xxi, 13, 216,
(p. xx) and 426)
[105] T. M. Cover and B. Gopinath, Open prob- [116] P. Cuff, “Distributed channel synthesis,”
lems in communication and computation. IEEE Transactions on Information The-
Springer Science & Business Media, 2012. ory, vol. 59, no. 11, pp. 7071–7096, 2013.
(p. 413) (p. 503)
[106] T. M. Cover and J. A. Thomas, Elements [117] M. Cuturi, “Sinkhorn distances: Light-
of information theory, 2nd Ed. New speed computation of optimal transport,”
York, NY, USA: Wiley-Interscience, 2006. Advances in neural information process-
(pp. xvii, xx, xxi, 65, 210, 216, 355, 466, ing systems, vol. 26, pp. 2292–2300, 2013.
and 501) (p. 105)
[107] H. Cramér, “Über eine eigenschaft der [118] L. Davisson, R. McEliece, M. Pursley, and
normalen verteilungsfunktion,” Mathema- M. Wallace, “Efficient universal noiseless
tische Zeitschrift, vol. 41, no. 1, pp. 405– source codes,” IEEE Transactions on Infor-
414, 1936. (p. 101) mation Theory, vol. 27, no. 3, pp. 269–279,
[108] ——, Mathematical methods of statistics. 1981. (p. 270)
Princeton university press, 1946. (p. 576) [119] A. Decelle, F. Krzakala, C. Moore, and
[109] I. Csiszár, “Information-type measures of L. Zdeborová, “Asymptotic analysis of the
difference of probability distributions and stochastic block model for modular net-
indirect observation,” Studia Sci. Math. works and its algorithmic applications,”
Hungar., vol. 2, pp. 229–318, 1967. (p. 115) Physical review E, vol. 84, no. 6, p. 066106,
2011. (p. 642)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-678


i i

678 Strong data processing inequality

[120] A. Dembo and O. Zeitouni, Large devia- on Differential Equations, Two on Informa-
tions techniques and applications. New tion Theory, American Mathematical Soci-
York: Springer Verlag, 2009. (pp. 291 ety Translations: Series 2, Volume 33, 1963.
and 308) (p. 83)
[121] A. P. Dempster, N. M. Laird, and D. B. [130] ——, “Mathematical problems in the Shan-
Rubin, “Maximum likelihood from incom- non theory of optimal coding of informa-
plete data via the EM algorithm,” Journal tion,” in Proc. 4th Berkeley Symp. Mathe-
of the royal statistical society. Series B matics, Statistics, and Probability, vol. 1,
(methodological), pp. 1–38, 1977. (p. 103) Berkeley, CA, USA, 1961, pp. 211–252.
[122] P. Diaconis and L. Saloff-Coste, “Logarith- (p. 435)
mic Sobolev inequalities for finite Markov [131] ——, “Asymptotic bounds on error prob-
chains,” Ann. Probab., vol. 6, no. 3, pp. ability for transmission over DMC with
695–750, 1996. (pp. 133 and 191) symmetric transition probabilities,” Theor.
[123] P. Diaconis and D. Freedman, “Finite Probability Appl., vol. 7, pp. 283–311,
exchangeable sequences,” The Annals of 1962. (pp. 383 and 446)
Probability, vol. 8, no. 4, pp. 745–764, [132] J. Dong, A. Roth, and W. J. Su, “Gaus-
1980. (p. 187) sian differential privacy,” Journal of the
[124] P. Diaconis and D. Stroock, “Geometric Royal Statistical Society Series B: Statisti-
bounds for eigenvalues of Markov chains,” cal Methodology, vol. 84, no. 1, pp. 3–37,
The Annals of Applied Probability, vol. 1, 2022. (p. 182)
no. 1, pp. 36–61, 1991. (pp. 641 and 669) [133] D. L. Donoho, “Wald lecture I: Counting
[125] H. Djellout, A. Guillin, and L. Wu, “Trans- bits with Kolmogorov and Shannon,” Note
portation cost-information inequalities and for the Wald Lectures, IMS Annual Meeting,
applications to random dynamical systems July 1997. (p. 543)
and diffusions,” The Annals of Probabil- [134] M. D. Donsker and S. S. Varadhan,
ity, vol. 32, no. 3B, pp. 2702–2732, 2004. “Asymptotic evaluation of certain Markov
(p. 656) process expectations for large time. IV,”
[126] R. Dobrushin and B. Tsybakov, “Informa- Communications on Pure and Applied
tion transmission with additional noise,” Mathematics, vol. 36, no. 2, pp. 183–212,
IRE Transactions on Information Theory, 1983. (p. 72)
vol. 8, no. 5, pp. 293–304, 1962. (p. 548) [135] J. L. Doob, Stochastic Processes. New
[127] R. L. Dobrushin, “Central limit theorem York Wiley, 1953. (p. 233)
for nonstationary Markov chains, I,” The- [136] F. du Pin Calmon, Y. Polyanskiy, and
ory Probab. Appl., vol. 1, no. 1, pp. 65–80, Y. Wu, “Strong data processing inequalities
1956. (p. 630) for input constrained additive noise chan-
[128] ——, “A simplified method of experimen- nels,” IEEE Transactions on Information
tally evaluating the entropy of a station- Theory, vol. 64, no. 3, pp. 1879–1892, 2017.
ary sequence,” Theory of Probability & Its (p. 325)
Applications, vol. 3, no. 4, pp. 428–430, [137] J. C. Duchi, M. I. Jordan, M. J. Wainwright,
1958. (p. 584) and Y. Zhang, “Optimality guarantees for
[129] ——, “A general formulation of the funda- distributed statistical estimation,” arXiv
mental theorem of Shannon in the theory of preprint arXiv:1405.0782, 2014. (pp. 657
information,” Uspekhi Mat. Nauk, vol. 14, and 658)
no. 6, pp. 3–104, 1959, English translation [138] J. Duda, “Asymmetric numeral systems:
in Eleven Papers in Analysis: Nine Papers entropy coding combining speed of
Huffman coding with compression rate

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-679


i i

References 679

of arithmetic coding,” arXiv preprint Mathematical Statistics, vol. 43, no. 3, pp.
arXiv:1311.2540, 2013. (p. 246) 865–870, 1972. (p. 167)
[139] R. M. Dudley, Uniform central limit theo- [150] ——, “Coding for noisy channels,” IRE
rems. Cambridge university press, 1999, Convention Record, vol. 3, pp. 37–46, 1955.
no. 63. (pp. 86 and 535) (p. 365)
[140] G. Dueck, “The strong converse to the cod- [151] E. O. Elliott, “Estimates of error rates for
ing theorem for the multiple-access chan- codes on burst-noise channels,” Bell Syst.
nel,” J. Comb. Inform. Syst. Sci, vol. 6, no. 3, Tech. J., vol. 42, pp. 1977–1997, Sep. 1963.
pp. 187–196, 1981. (p. 187) (p. 111)
[141] G. Dueck and J. Körner, “Reliability [152] D. M. Endres and J. E. Schindelin, “A new
function of a discrete memoryless chan- metric for probability distributions,” IEEE
nel at rates above capacity (corresp.),” Transactions on Information theory, vol. 49,
IEEE Transactions on Information Theory, no. 7, pp. 1858–1860, 2003. (p. 117)
vol. 25, no. 1, pp. 82–85, 1979. (p. 433) [153] P. Erdös, “Some remarks on the theory of
[142] N. Dunford and J. T. Schwartz, Linear oper- graphs,” Bulletin of the American Mathe-
ators, part 1: general theory. John Wiley matical Society, vol. 53, no. 4, pp. 292–294,
& Sons, 1988, vol. 10. (p. 80) 1947. (p. 215)
[143] R. Durrett, Probability: Theory and Exam- [154] P. Erdös and A. Rényi, “On random graphs,
ples, 4th ed. Cambridge University Press, I,” Publicationes Mathematicae (Debre-
2010. (p. 125) cen), vol. 6, pp. 290–297, 1959. (p. 653)
[144] A. Dytso, S. Yagli, H. V. Poor, and S. S. [155] V. Erokhin, “ε-entropy of a discrete random
Shitz, “The capacity achieving distribu- variable,” Theory of Probability & Its Appli-
tion for the amplitude constrained additive cations, vol. 3, no. 1, pp. 97–100, 1958.
Gaussian channel: An upper bound on the (p. 547)
number of mass points,” IEEE Transactions [156] K. Eswaran and M. Gastpar, “Remote
on Information Theory, vol. 66, no. 4, pp. source coding under Gaussian noise: Duel-
2006–2022, 2019. (p. 408) ing roles of power and entropy power,”
[145] H. G. Eggleston, Convexity, ser. Tracts in IEEE Transactions on Information Theory,
Math and Math. Phys. Cambridge Univer- 2019. (p. 657)
sity Press, 1958, vol. 47. (p. 129) [157] W. Evans and N. Pippenger, “On the maxi-
[146] A. El Alaoui and A. Montanari, “An mum tolerable noise for reliable computa-
information-theoretic view of stochastic tion by formulas,” IEEE Transactions on
localization,” IEEE Transactions on Infor- Information Theory, vol. 44, no. 3, pp.
mation Theory, vol. 68, no. 11, pp. 7423– 1299–1305, May 1998. (p. 629)
7426, 2022. (p. 191) [158] W. S. Evans and L. J. Schulman, “Signal
[147] A. El Gamal and Y.-H. Kim, Network infor- propagation and noisy circuits,” IEEE
mation theory. Cambridge University Transactions on Information Theory,
Press, 2011. (pp. xxi and 501) vol. 45, no. 7, pp. 2367–2373, Nov 1999.
[148] R. Eldan, “Taming correlations through (p. 627)
entropy-efficient measure decompositions [159] M. Falahatgar, A. Orlitsky, V. Pichapati,
with applications to mean-field approxi- and A. Suresh, “Learning Markov distri-
mation,” Probability Theory and Related butions: Does estimation trump compres-
Fields, vol. 176, no. 3-4, pp. 737–755, 2020. sion?” in 2016 IEEE International Sympo-
(p. 191) sium on Information Theory (ISIT). IEEE,
[149] P. Elias, “The efficient construction of July 2016, pp. 2689–2693. (p. 258)
an unbiased random sequence,” Annals of

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-680


i i

680 Strong data processing inequality

[160] M. Feder, “Gambling using a finite state [172] G. D. Forney, “Concatenated codes,”
machine,” IEEE Transactions on Informa- MIT RLE Technical Rep., vol. 440, 1965.
tion Theory, vol. 37, no. 5, pp. 1459–1465, (p. 378)
1991. (p. 264) [173] E. Friedgut and J. Kahn, “On the number
[161] M. Feder, N. Merhav, and M. Gut- of copies of one hypergraph in another,”
man, “Universal prediction of individual Israel J. Math., vol. 105, pp. 251–256, 1998.
sequences,” IEEE Trans. Inf. Theory, (p. 160)
vol. 38, no. 4, pp. 1258–1270, 1992. [174] P. Gács and J. Körner, “Common infor-
(p. 260) mation is far less than mutual informa-
[162] M. Feder and Y. Polyanskiy, “Sequential tion,” Problems of Control and Information
prediction under log-loss and misspecifica- Theory, vol. 2, no. 2, pp. 149–162, 1973.
tion,” in Conference on Learning Theory. (p. 338)
PMLR, 2021, pp. 1937–1964. (pp. 175 [175] A. Galanis, D. Štefankovi�, and E. Vigoda,
and 261) “Inapproximability of the partition function
[163] A. A. Fedotov, P. Harremoës, and F. Top- for the antiferromagnetic Ising and hard-
søe, “Refinements of Pinsker’s inequality,” core models,” Combinatorics, Probability
Information Theory, IEEE Transactions on, and Computing, vol. 25, no. 4, pp. 500–559,
vol. 49, no. 6, pp. 1491–1498, Jun. 2003. 2016. (p. 75)
(p. 131) [176] R. G. Gallager, “A simple derivation of
[164] W. Feller, An Introduction to Probability the coding theorem and some applications,”
Theory and Its Applications, 3rd ed. New IEEE Trans. Inf. Theory, vol. 11, no. 1, pp.
York: Wiley, 1970, vol. I. (p. 538) 3–18, 1965. (p. 360)
[165] ——, An Introduction to Probability The- [177] ——, Information Theory and Reliable
ory and Its Applications, 2nd ed. New Communication. New York: Wiley, 1968.
York: Wiley, 1971, vol. II. (p. 435) (pp. xvii, xxi, 383, and 432)
[166] T. S. Ferguson, Mathematical Statistics: A [178] R. Gallager, “The random coding bound
Decision Theoretic Approach. New York, is tight for the average code (corresp.),”
NY: Academic Press, 1967. (p. 558) IEEE Transactions on Information Theory,
[167] ——, “An inconsistent maximum likeli- vol. 19, no. 2, pp. 244–246, 1973. (p. 433)
hood estimate,” Journal of the American [179] R. Gardner, “The Brunn-Minkowski
Statistical Association, vol. 77, no. 380, pp. inequality,” Bulletin of the American
831–834, 1982. (p. 583) Mathematical Society, vol. 39, no. 3, pp.
[168] ——, A course in large sample theory. 355–405, 2002. (p. 573)
CRC Press, 1996. (p. 582) [180] A. M. Garsia, Topics in almost everywhere
[169] R. A. Fisher, “The logic of inductive infer- convergence. Chicago: Markham Publish-
ence,” Journal of the royal statistical soci- ing Company, 1970. (p. 238)
ety, vol. 98, no. 1, pp. 39–82, 1935. (p. xvii) [181] M. Gastpar, B. Rimoldi, and M. Vet-
[170] B. M. Fitingof, “The compression of dis- terli, “To code, or not to code: Lossy
crete information,” Problemy Peredachi source-channel communication revisited,”
Informatsii, vol. 3, no. 3, pp. 28–36, 1967. IEEE Transactions on Information The-
(p. 246) ory, vol. 49, no. 5, pp. 1147–1158, 2003.
[171] P. Fleisher, “Sufficient conditions for (p. 521)
achieving minimum distortion in a quan- [182] I. M. Gel’fand, A. N. Kolmogorov, and
tizer,” IEEE Int. Conv. Rec., pp. 104–111, A. M. Yaglom, “On the general definition
1964. (p. 481) of the amount of information,” Dokl. Akad.
Nauk. SSSR, vol. 11, pp. 745–748, 1956.
(p. 70)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-681


i i

References 681

[183] S. I. Gelfand and M. Pinsker, “Coding for vol. 16, no. 3, pp. 1281–1290, 1988.
channels with random parameters,” Probl. (p. 540)
Contr. Inform. Theory, vol. 9, no. 1, pp. [195] V. D. Goppa, “Nonprobabilistic mutual
19–31, 1980. (p. 468) information with memory,” Probl. Contr.
[184] Y. Geng and C. Nair, “The capacity region Inf. Theory, vol. 4, pp. 97–102, 1975.
of the two-receiver Gaussian vector broad- (p. 463)
cast channel with private and common mes- [196] ——, “Codes and information,” Russian
sages,” IEEE Transactions on Information Mathematical Surveys, vol. 39, no. 1, p. 87,
Theory, vol. 60, no. 4, pp. 2087–2104, 2014. 1984. (p. 463)
(p. 109) [197] R. M. Gray and D. L. Neuhoff, “Quanti-
[185] G. L. Gilardoni, “On a Gel’fand-Yaglom- zation,” IEEE Trans. Inf. Theory, vol. 44,
Peres theorem for f-divergences,” arXiv no. 6, pp. 2325–2383, 1998. (p. 475)
preprint arXiv:0911.1934, 2009. (p. 154) [198] R. M. Gray, Entropy and Information The-
[186] ——, “On Pinsker’s and Vajda’s type ory. New York, NY: Springer-Verlag,
inequalities for Csiszár’s-divergences,” 1990. (p. xxi)
Information Theory, IEEE Transactions [199] U. Grenander and G. Szegö, Toeplitz forms
on, vol. 56, no. 11, pp. 5377–5386, 2010. and their applications, 2nd ed. New
(p. 133) York: Chelsea Publishing Company, 1984.
[187] E. N. Gilbert, “Capacity of burst-noise (p. 114)
channels,” Bell Syst. Tech. J., vol. 39, pp. [200] L. Gross, “Logarithmic sobolev inequali-
1253–1265, Sep. 1960. (p. 111) ties,” American Journal of Mathematics,
[188] R. D. Gill and B. Y. Levit, “Applications vol. 97, no. 4, pp. 1061–1083, 1975.
of the van Trees inequality: a Bayesian (pp. 107 and 191)
Cramér-Rao bound,” Bernoulli, vol. 1, no. [201] Y. Gu, “Channel comparison methods and
1–2, pp. 59–79, 1995. (p. 577) statistical problems on graphs,” Ph.D. dis-
[189] J. Gilmer, “A constant lower bound for sertation, MIT, Cambridge, MA, 02139,
the union-closed sets conjecture,” arXiv USA, 2023. (p. 638)
preprint arXiv:2211.09055, 2022. (p. 189) [202] Y. Gu and Y. Polyanskiy, “Uniqueness of
[190] C. Giraud, Introduction to High- BP fixed point for the Potts model and appli-
Dimensional Statistics. Chapman cations to community detection,” in Con-
and Hall/CRC, 2014. (p. xxii) ference on Learning Theory (COLT), 2023.
[191] G. Glaeser, “Racine carrée d’une fonc- (pp. 135 and 644)
tion différentiable,” Annales de l’institut [203] ——, “Non-linear log-Sobolev inequalities
Fourier, vol. 13, no. 2, pp. 203–210, 1963. for the Potts semigroup and appli-
(p. 625) cations to reconstruction problems,”
[192] O. Goldreich, Introduction to property test- Comm. Math. Physics, (to appear), also
ing. Cambridge University Press, 2017. arXiv:2005.05444. (pp. 643, 653, 670,
(p. 325) and 671)
[193] I. Goodfellow, J. Pouget-Abadie, M. Mirza, [204] Y. Gu, H. Roozbehani, and Y. Polyanskiy,
B. Xu, D. Warde-Farley, S. Ozair, “Broadcasting on trees near criticality,” in
A. Courville, and Y. Bengio, “Genera- 2020 IEEE International Symposium on
tive adversarial nets,” Advances in neural Information Theory (ISIT). IEEE, 2020,
information processing systems, vol. 27, pp. 1504–1509. (p. 670)
2014. (p. 150) [205] D. Guo, S. Shamai (Shitz), and S. Verdú,
[194] V. Goodman, “Characteristics of normal “Mutual information and minimum mean-
samples,” The Annals of Probability, square error in Gaussian channels,” IEEE

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-682


i i

682 Strong data processing inequality

Trans. Inf. Theory, vol. 51, no. 4, pp. 1261 P. Elias, Eds. Springer Netherlands, 1975,
– 1283, Apr. 2005. (p. 59) vol. 16, pp. 323–355. (p. 584)
[206] D. Guo, Y. Wu, S. S. Shamai, and S. Verdú, [216] D. Haussler and M. Opper, “Mutual infor-
“Estimation in Gaussian noise: Proper- mation, metric entropy and cumulative rel-
ties of the minimum mean-square error,” ative entropy risk,” The Annals of Statis-
IEEE Transactions on Information Theory, tics, vol. 25, no. 6, pp. 2451–2492, 1997.
vol. 57, no. 4, pp. 2371–2385, 2011. (p. 63) (pp. xxii and 188)
[207] U. Hadar, J. Liu, Y. Polyanskiy, and [217] M. Hayashi, “General nonasymptotic and
O. Shayevitz, “Communication complexity asymptotic formulas in channel resolv-
of estimating correlations,” in Proceedings ability and identification capacity and
of the 51st Annual ACM SIGACT Sympo- their application to the wiretap channel,”
sium on Theory of Computing. ACM, IEEE Transactions on Information The-
2019, pp. 792–803. (p. 645) ory, vol. 52, no. 4, pp. 1562–1575, 2006.
[208] B. Hajek, Y. Wu, and J. Xu, “Information (p. 505)
limits for recovering a hidden community,” [218] W. Hoeffding, “Asymptotically optimal
IEEE Trans. on Information Theory, vol. 63, tests for multinomial distributions,” The
no. 8, pp. 4729 – 4745, 2017. (p. 591) Annals of Mathematical Statistics, pp. 369–
[209] J. Hájek, “Local asymptotic minimax and 401, 1965. (p. 289)
admissibility in estimation,” in Proceedings [219] P. J. Huber, “Fisher information and spline
of the sixth Berkeley symposium on math- interpolation,” Annals of Statistics, pp.
ematical statistics and probability, vol. 1, 1029–1033, 1974. (p. 580)
1972, pp. 175–194. (p. 582) [220] ——, Robust Statistics. New York, NY:
[210] J. M. Hammersley, “On estimating Wiley-Interscience, 1981. (pp. 151 and 152)
restricted parameters,” Journal of the Royal [221] ——, “A robust version of the probabil-
Statistical Society. Series B (Methodolog- ity ratio test,” The Annals of Mathematical
ical), vol. 12, no. 2, pp. 192–240, 1950. Statistics, pp. 1753–1758, 1965. (pp. 324,
(p. 575) 338, and 613)
[211] T. S. Han, Information-spectrum methods [222] I. A. Ibragimov and R. Z. Khas’minsk�,
in information theory. Springer Science Statistical Estimation: Asymptotic Theory.
& Business Media, 2003. (pp. xix and xxi) Springer, 1981. (pp. xxii and 143)
[212] T. S. Han and S. Verdú, “Approximation [223] S. Ihara, “On the capacity of channels with
theory of output statistics,” IEEE Transac- additive non-Gaussian noise,” Information
tions on Information Theory, vol. 39, no. 3, and Control, vol. 37, no. 1, pp. 34–39, 1978.
pp. 752–772, 1993. (pp. 504 and 505) (p. 401)
[213] Y. Han, S. Jana, and Y. Wu, “Optimal pre- [224] ——, Information theory for continuous
diction of Markov chains with and without systems. World Scientific, 1993, vol. 2.
spectral gap,” IEEE Transactions on Infor- (p. 419)
mation Theory, vol. 69, no. 6, pp. 3920– [225] Y. I. Ingster and I. A. Suslina, Nonparamet-
3959, 2023. (p. 258) ric goodness-of-fit testing under Gaussian
[214] P. Harremoës and I. Vajda, “On pairs of models. New York, NY: Springer, 2003.
f-divergences and their joint range,” IEEE (pp. 134, 185, 325, and 561)
Trans. Inf. Theory, vol. 57, no. 6, pp. 3230– [226] Y. I. Ingster, “Minimax testing of nonpara-
3235, Jun. 2011. (pp. 115, 128, and 129) metric hypotheses on a distribution density
[215] B. Harris, “The statistical estimation of in the Lp metrics,” Theory of Probability &
entropy in the non-parametric case,” in Top- Its Applications, vol. 31, no. 2, pp. 333–337,
ics in Information Theory, I. Csiszár and 1987. (p. 325)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-683


i i

References 683

[227] S. Janson, “Random regular graphs: asymp- [236] I. Johnstone, Gaussian estimation:
totic distributions and contiguity,” Combi- Sequence and wavelet models, 2011, avail-
natorics, Probability and Computing, vol. 4, able at http://www-stat.stanford.edu/~imj/.
no. 4, pp. 369–405, 1995. (p. 186) (p. 590)
[228] S. Janson and E. Mossel, “Robust recon- [237] L. K. Jones, “A simple lemma on greedy
struction on trees is determined by the approximation in Hilbert space and conver-
second eigenvalue,” Ann. Probab., vol. 32, gence rates for projection pursuit regression
no. 1A, pp. 2630–2649, 2004. (p. 644) and neural network training,” The Annals of
[229] E. T. Jaynes, Probability theory: The logic Statistics, pp. 608–613, 1992. (p. 534)
of science. Cambridge university press, [238] A. B. Juditsky and A. S. Nemirovski, “Non-
2003. (p. 253) parametric estimation by convex program-
[230] T. S. Jayram, “Hellinger strikes back: ming,” The Annals of Statistics, vol. 37,
A note on the multi-party information no. 5A, pp. 2278–2300, 2009. (p. 566)
complexity of AND,” in International [239] S. M. Kakade, K. Sridharan, and A. Tewari,
Workshop on Approximation Algorithms “On the complexity of linear prediction:
for Combinatorial Optimization, 2009, pp. Risk bounds, margin bounds, and regular-
562–573. (p. 183) ization,” Advances in neural information
[231] I. Jensen and A. J. Guttmann, “Series expan- processing systems, vol. 21, 2008. (p. 87)
sions of the percolation probability for [240] S. Kamath, A. Orlitsky, D. Pichapati, and
directed square and honeycomb lattices,” A. Suresh, “On learning distributions from
Journal of Physics A: Mathematical and their samples,” in Conference on Learning
General, vol. 28, no. 17, p. 4813, 1995. Theory, June 2015, pp. 1066–1100. (p. 258)
(p. 670) [241] T. Kawabata and A. Dembo, “The rate-
[232] Z. Jia, Y. Polyanskiy, and Y. Wu, “Entropic distortion dimension of sets and measures,”
characterization of optimal rates for learn- IEEE Trans. Inf. Theory, vol. 40, no. 5, pp.
ing Gaussian mixtures,” in Conference on 1564 – 1572, Sep. 1994. (p. 542)
Learning Theory (COLT). PMLR, 2023. [242] M. Keane and G. O’Brien, “A Bernoulli
(p. 619) factory,” ACM Transactions on Modeling
[233] J. Jiao, K. Venkat, Y. Han, and T. Weiss- and Computer Simulation, vol. 4, no. 2, pp.
man, “Minimax estimation of functionals of 213–219, 1994. (p. 172)
discrete distributions,” IEEE Transactions [243] J. Kemperman, “On the Shannon capacity
on Information Theory, vol. 61, no. 5, pp. of an arbitrary channel,” in Indagationes
2835–2885, 2015. (p. 584) Mathematicae (Proceedings), vol. 77, no. 2.
[234] C. Jin, Y. Zhang, S. Balakrishnan, M. J. North-Holland, 1974, pp. 101–115. (p. 97)
Wainwright, and M. I. Jordan, “Local max- [244] H. Kesten and B. P. Stigum, “Additional
ima in the likelihood of Gaussian mixture limit theorems for indecomposable multi-
models: Structural results and algorithmic dimensional galton-watson processes,” The
consequences,” in Advances in neural infor- Annals of Mathematical Statistics, vol. 37,
mation processing systems, 2016, pp. 4116– no. 6, pp. 1463–1481, 1966. (pp. 644
4124. (p. 105) and 670)
[235] W. B. Johnson, G. Schechtman, and J. Zinn, [245] D. P. Kingma and M. Welling, “Auto-
“Best constants in moment inequalities encoding variational Bayes,” arXiv preprint
for linear combinations of independent arXiv:1312.6114, 2013. (pp. 76 and 77)
and exchangeable random variables,” The [246] D. P. Kingma, M. Welling et al., “An intro-
Annals of Probability, vol. 13, no. 1, pp. duction to variational autoencoders,” Foun-
234–253, 1985. (p. 497) dations and Trends® in Machine Learning,
vol. 12, no. 4, pp. 307–392, 2019. (p. 77)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-684


i i

684 Strong data processing inequality

[247] T. Koch, “The Shannon lower bound is [256] O. Kosut and L. Sankar, “Asymptotics
asymptotically tight,” IEEE Transactions and non-asymptotics for universal fixed-to-
on Information Theory, vol. 62, no. 11, pp. variable source coding,” IEEE Transactions
6155–6161, 2016. (p. 511) on Information Theory, vol. 63, no. 6, pp.
[248] Y. Kochman, O. Ordentlich, and Y. Polyan- 3757–3772, 2017. (p. 250)
skiy, “A lower bound on the expected [257] A. Krause and D. Golovin, “Submodular
distortion of joint source-channel coding,” function maximization,” Tractability, vol. 3,
IEEE Transactions on Information The- pp. 71–104, 2014. (p. 367)
ory, vol. 66, no. 8, pp. 4722–4741, 2020. [258] R. Krichevskiy, “Laplace’s law of succes-
(p. 521) sion and universal encoding,” IEEE Trans-
[249] A. Kolchinsky and B. D. Tracey, “Esti- actions on Information Theory, vol. 44,
mating mixture entropy with pairwise dis- no. 1, pp. 296–303, Jan. 1998. (p. 665)
tances,” Entropy, vol. 19, no. 7, p. 361, [259] R. Krichevsky, “A relation between the
2017. (p. 188) plausibility of information about a source
[250] A. N. Kolmogorov and V. M. Tikhomirov, and encoding redundancy,” Problems
“ε-entropy and ε-capacity of sets in function Inform. Transmission, vol. 4, no. 3, pp.
spaces,” Uspekhi Matematicheskikh Nauk, 48–57, 1968. (p. 247)
vol. 14, no. 2, pp. 3–86, 1959, reprinted [260] R. Krichevsky and V. Trofimov, “The per-
in Shiryayev, A. N., ed. Selected Works formance of universal encoding,” IEEE
of AN Kolmogorov: Volume III: Informa- Trans. Inf. Theory, vol. 27, no. 2, pp. 199–
tion Theory and the Theory of Algorithms, 207, 1981. (p. 254)
Vol. 27, Springer Netherlands, 1993, pp 86– [261] F. Krzakała, A. Montanari, F. Ricci-
170. (pp. 522, 523, 524, 526, 535, 538, 539, Tersenghi, G. Semerjian, and L. Zdeborová,
and 543) “Gibbs states and the set of solutions
[251] I. Kontoyiannis and S. Verdú, “Optimal of random constraint satisfaction prob-
lossless data compression: Non- lems,” Proceedings of the National
asymptotics and asymptotics,” IEEE Academy of Sciences, vol. 104, no. 25, pp.
Trans. Inf. Theory, vol. 60, no. 2, pp. 10 318–10 323, 2007. (p. 642)
777–795, 2014. (p. 198) [262] J. Kuelbs, “A strong convergence theorem
[252] J. Körner and A. Orlitsky, “Zero-error for Banach space valued random variables,”
information theory,” IEEE Transactions on The Annals of Probability, vol. 4, no. 5, pp.
Information Theory, vol. 44, no. 6, pp. 744–771, 1976. (p. 540)
2207–2229, 1998. (p. 374) [263] J. Kuelbs and W. V. Li, “Metric entropy and
[253] V. Koshelev, “Quantization with minimal the small ball problem for Gaussian mea-
entropy,” Probl. Pered. Inform, vol. 14, pp. sures,” Journal of Functional Analysis, vol.
151–156, 1963. (p. 483) 116, no. 1, pp. 133–157, 1993. (pp. 540
[254] V. Kostina, Y. Polyanskiy, and S. Verdú, and 541)
“Variable-length compression allowing [264] S. Kullback, Information theory and statis-
errors,” IEEE Transactions on Information tics. Mineola, NY: Dover publications,
Theory, vol. 61, no. 8, pp. 4316–4330, 1968. (p. xxi)
2015. (p. 548) [265] C. Külske and M. Formentin, “A symmetric
[255] V. Kostina and S. Verdú, “Fixed-length entropy bound on the non-reconstruction
lossy compression in the finite blocklength regime of Markov chains on Galton-
regime,” IEEE Transactions on Information Watson trees,” Electronic Communications
Theory, vol. 58, no. 6, pp. 3309–3338, 2012. in Probability, vol. 14, pp. 587–596, 2009.
(p. 485) (p. 135)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-685


i i

References 685

[266] H. O. Lancaster, “Some properties of [277] E. Lehmann and J. Romano, Testing Statis-
the bivariate normal distribution consid- tical Hypotheses, 3rd ed. Springer, 2005.
ered in the form of a contingency table,” (pp. 275 and 325)
Biometrika, vol. 44, no. 1/2, pp. 289–292, [278] W. V. Li and W. Linde, “Approximation,
1957. (p. 641) metric entropy and small ball estimates for
[267] R. Landauer, “Irreversibility and heat gen- Gaussian measures,” The Annals of Proba-
eration in the computing process,” IBM bility, vol. 27, no. 3, pp. 1556–1578, 1999.
journal of research and development, vol. 5, (p. 541)
no. 3, pp. 183–191, 1961. (p. xix) [279] W. V. Li and Q.-M. Shao, “Gaussian pro-
[268] A. Lapidoth, A foundation in digital com- cesses: inequalities, small ball probabilities
munication. Cambridge University Press, and applications,” Handbook of Statistics,
2017. (p. 403) vol. 19, pp. 533–597, 2001. (pp. 539, 541,
[269] A. Lapidoth and S. M. Moser, “Capac- and 553)
ity bounds via duality with applications [280] E. H. Lieb, “Proof of an entropy conjecture
to multiple-antenna systems on flat-fading of Wehrl,” Communications in Mathemati-
channels,” IEEE Transactions on Informa- cal Physics, vol. 62, no. 1, pp. 35–41, 1978.
tion Theory, vol. 49, no. 10, pp. 2426–2467, (p. 64)
2003. (p. 409) [281] T. Linder and R. Zamir, “On the asymp-
[270] B. Laurent and P. Massart, “Adaptive esti- totic tightness of the Shannon lower bound,”
mation of a quadratic functional by model IEEE Transactions on Information The-
selection,” The Annals of Statistics, vol. 28, ory, vol. 40, no. 6, pp. 2026–2031, 1994.
no. 5, pp. 1302–1338, 2000. (p. 85) (p. 511)
[271] S. L. Lauritzen, Graphical models. Claren- [282] R. S. Liptser, F. Pukel’sheim, and A. N.
don Press, 1996, vol. 17. (pp. 50 and 51) Shiryaev, “Necessary and sufficient condi-
[272] L. Le Cam, “Convergence of estimates tions for contiguity and entire asymptotic
under dimensionality restrictions,” Annals separation of probability measures,” Rus-
of Statistics, vol. 1, no. 1, pp. 38 – 53, 1973. sian Mathematical Surveys, vol. 37, no. 6,
(p. xxii) p. 107, 1982. (p. 126)
[273] ——, Asymptotic methods in statistical [283] S. Litsyn, “New upper bounds on error
decision theory. New York, NY: Springer- exponents,” IEEE Transactions on Informa-
Verlag, 1986. (pp. 117, 133, 558, 582, 602, tion Theory, vol. 45, no. 2, pp. 385–398,
and 614) 1999. (p. 433)
[274] C. C. Leang and D. H. Johnson, “On [284] S. Lloyd, “Least squares quantization in
the asymptotics of m-hypothesis Bayesian PCM,” IEEE transactions on information
detection,” IEEE Transactions on Informa- theory, vol. 28, no. 2, pp. 129–137, 1982.
tion Theory, vol. 43, no. 1, pp. 280–282, (p. 480)
1997. (p. 337) [285] G. G. Lorentz, M. v. Golitschek, and
[275] K. Lee, Y. Wu, and Y. Bresler, “Near opti- Y. Makovoz, Constructive approximation:
mal compressed sensing of sparse rank-one advanced problems. Springer, 1996, vol.
matrices via sparse power factorization,” 304. (pp. 523 and 538)
IEEE Transactions on Information Theory, [286] L. Lovász, “On the Shannon capacity of
vol. 64, no. 3, pp. 1666–1698, Mar. 2018. a graph,” IEEE Transactions on Informa-
(p. 543) tion theory, vol. 25, no. 1, pp. 1–7, 1979.
[276] E. L. Lehmann and G. Casella, Theory of (p. 452)
Point Estimation, 2nd ed. New York, NY: [287] D. J. MacKay, Information theory, infer-
Springer, 1998. (pp. xxii and 564) ence and learning algorithms. Cambridge
university press, 2003. (p. xxi)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-686


i i

686 Strong data processing inequality

[288] M. Madiman and P. Tetali, “Information annual conference on Computational learn-


inequalities for joint distributions, with ing theory, 1998, pp. 230–234. (pp. 83
interpretations and applications,” IEEE and 88)
Trans. Inf. Theory, vol. 56, no. 6, pp. 2699– [299] R. McEliece, E. Rodemich, H. Rumsey,
2713, 2010. (p. 18) and L. Welch, “New upper bounds on the
[289] M. Mahoney, “Large text compression rate of a code via the Delsarte-Macwilliams
benchmark,” http://www.mattmahoney.net/ inequalities,” IEEE transactions on Infor-
dc/text.html, Aug. 2021. (p. 245) mation Theory, vol. 23, no. 2, pp. 157–166,
[290] A. Makur and Y. Polyanskiy, “Compari- 1977. (pp. 432, 434, and 528)
son of channels: Criteria for domination by [300] R. J. McEliece and E. C. Posner, “Hide
a symmetric channel,” IEEE Transactions and seek, data storage, and entropy,” The
on Information Theory, vol. 64, no. 8, pp. Annals of Mathematical Statistics, vol. 42,
5704–5725, 2018. (p. 647) no. 5, pp. 1706–1716, 1971. (p. 543)
[291] B. Mandelbrot, “An informational theory of [301] B. McMillan, “The basic theorems of infor-
the statistical structure of language,” Com- mation theory,” Ann. Math. Stat., pp. 196–
munication theory, vol. 84, pp. 486–502, 219, 1953. (p. 234)
1953. (pp. 203, 205, and 206) [302] S. Mendelson, “Rademacher averages
[292] C. Manning and H. Schutze, Foundations and phase transitions in Glivenko-Cantelli
of statistical natural language processing. classes,” IEEE Transactions on Informa-
MIT press, 1999. (p. 351) tion Theory, vol. 48, no. 1, pp. 251–263,
[293] J. Massey, “On the fractional weight of 2002. (p. 533)
distinct binary n-tuples (corresp.),” IEEE [303] N. Merhav and M. Feder, “Universal pre-
Transactions on Information Theory, diction,” IEEE Trans. Inf. Theory, vol. 44,
vol. 20, no. 1, pp. 131–131, 1974. (p. 158) no. 6, pp. 2124–2147, 1998. (p. 260)
[294] ——, “Causality, feedback and directed [304] G. A. Miller, “Note on the bias of infor-
information,” in Proc. Int. Symp. Inf. The- mation estimates,” Information theory in
ory Applic.(ISITA-90), 1990, pp. 303–305. psychology: Problems and methods, vol. 2,
(pp. 446 and 449) pp. 95–100, 1955. (p. 584)
[295] W. Matthews, “A linear program for the [305] M. Mitzenmacher, “A brief history of gen-
finite block length converse of Polyanskiy– erative models for power law and lognormal
Poor–Verdú via nonsignaling codes,” distributions,” Internet mathematics, vol. 1,
IEEE Transactions on Information Theory, no. 2, pp. 226–251, 2004. (pp. 203 and 206)
vol. 58, no. 12, pp. 7036–7044, 2012. [306] E. Mossel, “Phase transitions in phy-
(p. 429) logeny,” Transactions of the American
[296] H. H. Mattingly, M. K. Transtrum, M. C. Mathematical Society, vol. 356, no. 6, pp.
Abbott, and B. B. Machta, “Maximizing 2379–2404, 2004. (p. 642)
the information learned from finite data [307] E. Mossel, J. Neeman, and A. Sly, “Recon-
selects a simple model,” Proceedings of the struction and estimation in the planted
National Academy of Sciences, vol. 115, partition model,” Probability Theory and
no. 8, pp. 1760–1765, 2018. (p. 248) Related Fields, vol. 162, no. 3-4, pp. 431–
[297] A. Maurer, “A note on the PAC Bayesian 461, 2015. (pp. 186 and 642)
theorem,” arXiv preprint cs/0411099, 2004. [308] E. Mossel and Y. Peres, “New coins from
(p. 83) old: computing with unknown bias,” Com-
[298] D. A. McAllester, “Some PAC-Bayesian binatorica, vol. 25, no. 6, pp. 707–724,
theorems,” in Proceedings of the eleventh 2005. (p. 172)
[309] J. Mourtada, “Exact minimax risk for linear
least squares, and the lower tail of sample

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-687


i i

References 687

covariance matrices,” The Annals of Statis- Electronics, Communications and Com-


tics, vol. 50, no. 4, pp. 2157–2178, 2022. puter Sciences, vol. 98, no. 12, pp. 2471–
(p. 86) 2475, 2015. (p. 433)
[310] X. Mu, L. Pomatto, P. Strack, and [320] OpenAI, “GPT-4 technical report,” arXiv
O. Tamuz, “From Blackwell dominance preprint arXiv:2303.08774, 2023. (pp. xix,
in large samples to rényi divergences and 110, 195, 257, and 258)
back again,” Econometrica, vol. 89, no. 1, [321] O. Ordentlich and Y. Polyanskiy, “Strong
pp. 475–506, 2021. (pp. 145 and 182) data processing constant is achieved by
[311] N. Mukhanova, “Illustrator with binary inputs,” IEEE Trans. Inf. Theory,
a focus on children’s books,” vol. 68, no. 3, pp. 1480–1481, Mar. 2022.
https://nastyamukhanova.com/, 2023. (p. 669)
(p. xvi) [322] D. Ornstein, “Bernoulli shifts with the same
[312] B. Nakibo�lu, “The sphere packing bound entropy are isomorphic,” Advances in Math-
via Augustin’s method,” IEEE Transactions ematics, vol. 4, no. 3, pp. 337–352, 1970.
on Information Theory, vol. 65, no. 2, pp. (pp. xix and 230)
816–840, 2018. (p. 454) [323] L. Paninski, “Variational minimax estima-
[313] G. L. Nemhauser, L. A. Wolsey, and M. L. tion of discrete distributions under KL loss,”
Fisher, “An analysis of approximations for Advances in Neural Information Processing
maximizing submodular set functions–I,” Systems, vol. 17, 2004. (p. 665)
Mathematical programming, vol. 14, no. 1, [324] P. Panter and W. Dite, “Quantization distor-
pp. 265–294, 1978. (pp. 367 and 368) tion in pulse-count modulation with nonuni-
[314] J. Neveu, Mathematical foundations of the form spacing of levels,” Proceedings of
calculus of probability. Holden-day, 1965. the IRE, vol. 39, no. 1, pp. 44–48, 1951.
(p. 540) (p. 481)
[315] M. E. Newman, “Power laws, Pareto dis- [325] M. Pardo and I. Vajda, “About distances
tributions and Zipf’s law,” Contemporary of discrete distributions satisfying the data
physics, vol. 46, no. 5, pp. 323–351, 2005. processing theorem of information the-
(p. 203) ory,” IEEE transactions on information the-
[316] M. Okamoto, “Some inequalities relating to ory, vol. 43, no. 4, pp. 1288–1293, 1997.
the partial sum of binomial probabilities,” (p. 121)
Annals of the institute of Statistical Math- [326] S. Péché, “The largest eigenvalue of small
ematics, vol. 10, no. 1, pp. 29–35, 1959. rank perturbations of Hermitian random
(p. 302) matrices,” Probability Theory and Related
[317] R. I. Oliveira, “The lower tail of random Fields, vol. 134, pp. 127–173, 2006.
quadratic forms with applications to ordi- (p. 651)
nary least squares,” Probability Theory and [327] Y. Peres, “Iterating von Neumann’s proce-
Related Fields, vol. 166, pp. 1175–1194, dure for extracting random bits,” Annals of
2016. (p. 86) Statistics, vol. 20, no. 1, pp. 590–597, 1992.
[318] B. Oliver, J. Pierce, and C. Shannon, “The (p. 167)
philosophy of PCM,” Proceedings of the [328] M. S. Pinsker, “Optimal filtering of square-
IRE, vol. 36, no. 11, pp. 1324–1331, 1948. integrable signals in Gaussian noise,” Prob-
(p. 477) lemy Peredachi Informatsii, vol. 16, no. 2,
[319] Y. Oohama, “On two strong converse the- pp. 52–68, 1980. (p. xxii)
orems for discrete memoryless channels,” [329] G. Pisier, The volume of convex bodies and
IEICE Transactions on Fundamentals of Banach space geometry. Cambridge Uni-
versity Press, 1999. (pp. 523 and 531)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-688


i i

688 Strong data processing inequality

[330] J. Pitman, “Probabilistic bounds on the [340] Y. Polyanskiy and S. Verdú, “Empirical dis-
coefficients of polynomials with only real tribution of good channel codes with non-
zeros,” Journal of Combinatorial Theory, vanishing error probability,” IEEE Trans.
Series A, vol. 77, no. 2, pp. 279–303, 1997. Inf. Theory, vol. 60, no. 1, pp. 5–21, Jan.
(p. 301) 2014. (p. 429)
[331] E. Plotnik, M. J. Weinberger, and J. Ziv, [341] Y. Polyanskiy, “Saddle point in the mini-
“Upper bounds on the probability of max converse for channel coding,” IEEE
sequences emitted by finite-state sources Transactions on Information Theory,
and on the redundancy of the Lempel-Ziv vol. 59, no. 5, pp. 2576–2595, 2012.
algorithm,” IEEE transactions on informa- (pp. 429 and 430)
tion theory, vol. 38, no. 1, pp. 66–72, 1992. [342] ——, “On dispersion of compound DMCs,”
(p. 264) in 2013 51st Annual Allerton Conference
[332] D. Pollard, “Empirical processes: Theory on Communication, Control, and Comput-
and applications,” NSF-CBMS Regional ing (Allerton). IEEE, 2013, pp. 26–32.
Conference Series in Probability and Statis- (pp. 437 and 465)
tics, vol. 2, pp. i–86, 1990. (p. 603) [343] Y. Polyanskiy and Y. Wu, “Peak-to-average
[333] Y. Polyanskiy, “Channel coding: non- power ratio of good codes for Gaussian
asymptotic fundamental limits,” Ph.D. channel,” IEEE Trans. Inf. Theory, vol. 60,
dissertation, Princeton Univ., Princeton, no. 12, pp. 7655–7660, Dec. 2014. (p. 408)
NJ, USA, 2010. (pp. 109, 383, 385, 429, [344] ——, “Wasserstein continuity of entropy
435, and 436) and outer bounds for interference channels,”
[334] Y. Polyanskiy, H. V. Poor, and S. Verdú, IEEE Transactions on Information Theory,
“Channel coding rate in the finite block- vol. 62, no. 7, pp. 3992–4002, 2016. (pp. 60
length regime,” IEEE Trans. Inf. Theory, and 64)
vol. 56, no. 5, pp. 2307–2359, May 2010. [345] ——, “Strong data-processing inequalities
(pp. 346, 353, 434, 435, 436, and 584) for channels and Bayesian networks,” in
[335] ——, “Dispersion of the Gilbert-Elliott Convexity and Concentration. The IMA Vol-
channel,” IEEE Trans. Inf. Theory, vol. 57, umes in Mathematics and its Applications,
no. 4, pp. 1829–1848, Apr. 2011. (p. 437) vol 161, E. Carlen, M. Madiman, and E. M.
[336] ——, “Feedback in the non-asymptotic Werner, Eds. New York, NY: Springer,
regime,” IEEE Trans. Inf. Theory, vol. 57, 2017, pp. 211–249. (pp. 325, 626, 631, 632,
no. 4, pp. 4903 – 4925, Apr. 2011. (pp. 446, 635, 636, 638, 646, and 647)
454, 455, and 456) [346] ——, “Dualizing Le Cam’s method for
[337] ——, “Minimum energy to send k bits with functional estimation, with applications to
and without feedback,” IEEE Trans. Inf. estimating the unseens,” arXiv preprint
Theory, vol. 57, no. 8, pp. 4880–4902, Aug. arXiv:1902.05616, 2019. (pp. 566 and 668)
2011. (pp. 413 and 449) [347] ——, “Application of the information-
[338] Y. Polyanskiy and S. Verdú, “Arimoto chan- percolation method to reconstruction prob-
nel coding converse and Rényi divergence,” lems on graphs,” Mathematical Statistics
in Proceedings of the Forty-eighth Annual and Learning, vol. 2, no. 1, pp. 1–24, 2020.
Allerton Conference on Communication, (pp. 650, 651, and 653)
Control, and Computing, Sep. 2010, pp. [348] ——, “Self-regularizing property of non-
1327–1333. (pp. 121, 433, and 505) parametric maximum likelihood estima-
[339] Y. Polyanskiy and S. Verdu, “Binary tor in mixture models,” arXiv preprint
hypothesis testing with feedback,” in Infor- arXiv:2008.08244, 2020. (p. 408)
mation Theory and Applications Workshop [349] E. C. Posner and E. R. Rodemich, “Epsilon
(ITA), 2011. (p. 320) entropy and data compression,” Annals of

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-689


i i

References 689

Mathematical Statistics, vol. 42, no. 6, pp. parity-check codes,” IEEE Transac-
2079–2125, Dec. 1971. (p. 543) tions on Information Theory, vol. 47, no. 2,
[350] A. Prékopa, “Logarithmic concave mea- pp. 619–637, 2001. (p. 516)
sures with application to stochastic pro- [360] T. Richardson and R. Urbanke, Modern
gramming,” Acta Scientiarum Mathemati- Coding Theory. Cambridge University
carum, vol. 32, pp. 301–316, 1971. (p. 573) Press, 2008. (pp. xxi, 63, 341, 346, 383,
[351] J. Radhakrishnan, “An entropy proof of and 632)
Bregman’s theorem,” J. Combin. Theory [361] Y. Rinott, “On convexity of measures,”
Ser. A, vol. 77, no. 1, pp. 161–164, 1997. Annals of Probability, vol. 4, no. 6, pp.
(p. 161) 1020–1026, 1976. (p. 573)
[352] M. Raginsky, “Strong data processing [362] J. J. Rissanen, “Fisher information and
inequalities and ϕ-Sobolev inequalities for stochastic complexity,” IEEE transactions
discrete channels,” IEEE Transactions on on information theory, vol. 42, no. 1, pp.
Information Theory, vol. 62, no. 6, pp. 40–47, 1996. (p. 261)
3355–3389, 2016. (pp. 626 and 638) [363] H. Robbins, “An empirical Bayes approach
[353] M. Raginsky and I. Sason, “Concentration to statistics,” in Proceedings of the Third
of measure inequalities in information the- Berkeley Symposium on Mathematical
ory, communications, and coding,” Founda- Statistics and Probability, Volume 1: Con-
tions and Trends® in Communications and tributions to the Theory of Statistics. The
Information Theory, vol. 10, no. 1-2, pp. Regents of the University of California,
1–246, 2013. (p. xxi) 1956. (p. 563)
[354] C. R. Rao, “Information and the accuracy [364] R. W. Robinson and N. C. Wormald,
attainable in the estimation of statistical “Almost all cubic graphs are Hamiltonian,”
parameters,” Bull. Calc. Math. Soc., vol. 37, Random Structures & Algorithms, vol. 3,
pp. 81–91, 1945. (p. 576) no. 2, pp. 117–125, 1992. (p. 186)
[355] A. H. Reeves, “The past present and future [365] C. Rogers, Packing and Covering, ser. Cam-
of PCM,” IEEE Spectrum, vol. 2, no. 5, pp. bridge tracts in mathematics and mathemat-
58–62, 1965. (p. 477) ical physics. Cambridge University Press,
[356] A. Rényi, “On measures of entropy and 1964. (p. 527)
information,” in Proc. 4th Berkeley Symp. [366] H. Roozbehani and Y. Polyanskiy, “Alge-
Mathematics, Statistics, and Probability, braic methods of classifying directed
vol. 1, Berkeley, CA, USA, 1961, pp. 547– graphical models,” arXiv preprint
561. (p. 13) arXiv:1401.5551, 2014. (p. 180)
[357] ——, “On the dimension and entropy of [367] ——, “Low density majority codes and
probability distributions,” Acta Mathemat- the problem of graceful degradation,” arXiv
ica Hungarica, vol. 10, no. 1 – 2, Mar. 1959. preprint arXiv:1911.12263, 2019. (pp. 191
(p. 29) and 669)
[358] Z. Reznikova and B. Ryabko, “Anal- [368] H. P. Rosenthal, “On the subspaces of
ysis of the language of ants by Lp (p > 2) spanned by sequences of inde-
information-theoretical methods,” Prob- pendent random variables,” Israel Journal
lemi Peredachi Informatsii, vol. 22, no. 3, of Mathematics, vol. 8, no. 3, pp. 273–303,
pp. 103–108, 1986, english translation: 1970. (p. 497)
http://reznikova.net/R-R-entropy-09.pdf. [369] D. Russo and J. Zou, “Controlling bias in
(p. 9) adaptive data analysis using information
[359] T. J. Richardson, M. A. Shokrollahi, and theory,” in Artificial Intelligence and Statis-
R. L. Urbanke, “Design of capacity- tics. PMLR, 2016, pp. 1232–1240. (pp. 90
approaching irregular low-density and 188)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-690


i i

690 Strong data processing inequality

[370] I. N. Sanov, “On the probability of large channels i,” Inf. Contr., vol. 10, pp. 65–103,
deviations of random magnitudes,” Matem- 1967. (pp. 432 and 433)
aticheskii Sbornik, vol. 84, no. 1, pp. 11–44, [382] J. Shawe-Taylor and R. C. Williamson, “A
1957. (p. 307) PAC analysis of a Bayesian estimator,” in
[371] E. �a�o�lu, “Polar coding theorems for dis- Proceedings of the tenth annual conference
crete systems,” EPFL, Tech. Rep., 2011. on Computational learning theory, 1997,
(p. 669) pp. 2–9. (p. 83)
[372] ——, “Polarization and polar codes,” Foun- [383] O. Shayevitz, “On Rényi measures and
dations and Trends® in Communications hypothesis testing,” in 2011 IEEE Interna-
and Information Theory, vol. 8, no. 4, pp. tional Symposium on Information Theory
259–381, 2012. (p. 341) Proceedings. IEEE, 2011, pp. 894–898.
[373] I. Sason and S. Verdú, “f-divergence (p. 182)
inequalities,” IEEE Transactions on Infor- [384] O. Shayevitz and M. Feder, “Optimal feed-
mation Theory, vol. 62, no. 11, pp. 5973– back communication via posterior match-
6006, 2016. (p. 132) ing,” IEEE Trans. Inf. Theory, vol. 57, no. 3,
[374] G. Schechtman, “Extremal configurations pp. 1186–1222, 2011. (p. 445)
for moments of sums of independent pos- [385] A. N. Shiryaev, Probability-1. Springer,
itive random variables,” in Banach Spaces 2016, vol. 95. (p. 126)
and their Applications in Analysis. De [386] G. Simons and M. Woodroofe, “The
Gruyter, 2011, pp. 183–192. (p. 505) Cramér-Rao inequality holds almost every-
[375] M. J. Schervish, Theory of statistics. where,” in Recent Advances in Statistics:
Springer-Verlag New York, 1995. (pp. 582 Papers in Honor of Herman Chernoff on his
and 583) Sixtieth Birthday. Academic, New York,
[376] A. Schrijver, Theory of linear and integer 1983, pp. 69–93. (p. 661)
programming. John Wiley & Sons, 1998. [387] Y. G. Sinai, “On the notion of entropy of a
(p. 567) dynamical system,” in Doklady of Russian
[377] C. E. Shannon, “A symbolic analysis of Academy of Sciences, vol. 124, no. 3, 1959,
relay and switching circuits,” Electrical pp. 768–771. (pp. xix and 230)
Engineering, vol. 57, no. 12, pp. 713–723, [388] R. Sinkhorn, “A relationship between arbi-
Dec 1938. (p. 626) trary positive matrices and doubly stochas-
[378] C. E. Shannon, “A mathematical theory of tic matrices,” Ann. Math. Stat., vol. 35,
communication,” Bell Syst. Tech. J., vol. 27, no. 2, pp. 876–879, 1964. (p. 105)
pp. 379–423 and 623–656, Jul./Oct. 1948. [389] M. Sion, “On general minimax theorems,”
(pp. xvii, 41, 195, 215, 234, 341, 346, 377, Pacific J. Math, vol. 8, no. 1, pp. 171–176,
and 411) 1958. (p. 93)
[379] ——, “The zero error capacity of a noisy [390] M.-K. Siu, “Which Latin squares are Cay-
channel,” IRE Transactions on Informa- ley tables?” Amer. Math. Monthly, vol. 98,
tion Theory, vol. 2, no. 3, pp. 8–19, 1956. no. 7, pp. 625–627, Aug. 1991. (p. 384)
(pp. 374, 450, and 452) [391] D. Slepian and H. O. Pollak, “Prolate
[380] ——, “Coding theorems for a discrete spheroidal wave functions, Fourier analysis
source with a fidelity criterion,” IRE Nat. and uncertainty–I,” Bell System Technical
Conv. Rec, vol. 4, no. 142-163, p. 1, 1959. Journal, vol. 40, no. 1, pp. 43–63, 1961.
(pp. 475 and 490) (p. 419)
[381] C. E. Shannon, R. G. Gallager, and E. R. [392] D. Slepian and J. Wolf, “Noiseless cod-
Berlekamp, “Lower bounds to error prob- ing of correlated information sources,”
ability for coding on discrete memoryless IEEE Transactions on information Theory,
vol. 19, no. 4, pp. 471–480, 1973. (p. 223)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-691


i i

References 691

[393] A. Sly, “Reconstruction of random colour- decision theory. Berlin, Germany: Walter
ings,” Communications in Mathematical de Gruyter, 1985. (pp. 558 and 566)
Physics, vol. 288, no. 3, pp. 943–961, Jun [405] J. Suzuki, “Some notes on universal noise-
2009. (p. 644) less coding,” IEICE transactions on fun-
[394] A. Sly and N. Sun, “Counting in two-spin damentals of electronics, communications
models on d-regular graphs,” The Annals of and computer sciences, vol. 78, no. 12, pp.
Probability, vol. 42, no. 6, pp. 2383–2416, 1840–1847, 1995. (p. 252)
2014. (p. 75) [406] S. Szarek, “Nets of Grassmann manifold
[395] B. Smith, “Instantaneous companding of and orthogonal groups,” in Proceedings of
quantized signals,” Bell System Technical Banach Space Workshop. University of
Journal, vol. 36, no. 3, pp. 653–709, 1957. Iowa Press, 1982, pp. 169–185. (pp. 527
(p. 483) and 544)
[396] J. G. Smith, “The information capacity of [407] ——, “Metric entropy of homogeneous
amplitude and variance-constrained scalar spaces,” Banach Center Publications,
Gaussian channels,” Information and Con- vol. 43, no. 1, pp. 395–410, 1998. (p. 527)
trol, vol. 18, pp. 203 – 219, 1971. (p. 408) [408] W. Szpankowski and S. Verdú, “Mini-
[397] Spectre, “SPECTRE: Short packet com- mum expected length of fixed-to-variable
munication toolbox,” https://github.com/ lossless compression without prefix con-
yp-mit/spectre, 2015, GitHub repository. straints,” IEEE Trans. Inf. Theory, vol. 57,
(pp. 418 and 441) no. 7, pp. 4017–4025, 2011. (p. 200)
[398] R. Speer, J. Chin, A. Lin, S. Jewett, and [409] I. Tal and A. Vardy, “List decoding of polar
L. Nathan, “Luminosoinsight/wordfreq: codes,” IEEE Transactions on Information
v2.2,” Oct. 2018. [Online]. Available: https: Theory, vol. 61, no. 5, pp. 2213–2226, 2015.
//doi.org/10.5281/zenodo.1443582 (p. 204) (p. 346)
[399] A. J. Stam, “Some inequalities satisfied by [410] M. Talagrand, “The Parisi formula,” Annals
the quantities of information of Fisher and of mathematics, pp. 221–263, 2006. (p. 63)
Shannon,” Information and Control, vol. 2, [411] ——, Upper and lower bounds for stochas-
no. 2, pp. 101–112, 1959. (pp. 64, 185, tic processes. Springer, 2014. (p. 531)
and 191) [412] T. Tanaka, P. M. Esfahani, and S. K. Mitter,
[400] ——, “Distance between sampling with “LQG control with minimum directed
and without replacement,” Statistica Neer- information: Semidefinite programming
landica, vol. 32, no. 2, pp. 81–91, 1978. approach,” IEEE Transactions on Auto-
(pp. 186 and 187) matic Control, vol. 63, no. 1, pp. 37–52,
[401] M. Steiner, “The strong simplex conjecture 2017. (p. 449)
is false,” IEEE Transactions on Information [413] W. Tang and F. Tang, “The Poisson bino-
Theory, vol. 40, no. 3, pp. 721–731, 1994. mial distribution – old & new,” Statistical
(p. 413) Science, vol. 38, no. 1, pp. 108–119, 2023.
[402] V. Strassen, “Asymptotische Abschätzun- (p. 301)
gen in Shannon’s Informationstheorie,” in [414] T. Tao, “Szemerédi’s regularity lemma
Trans. 3d Prague Conf. Inf. Theory, Prague, revisited,” Contributions to Discrete Math-
1962, pp. 689–723. (p. 435) ematics, vol. 1, no. 1, pp. 8–28, 2006.
[403] ——, “The existence of probability mea- (pp. 127 and 190)
sures with given marginals,” Annals of [415] G. Taricco and M. Elia, “Capacity of fading
Mathematical Statistics, vol. 36, no. 2, pp. channel with no side information,” Elec-
423–439, 1965. (p. 122) tronics Letters, vol. 33, no. 16, pp. 1368–
[404] H. Strasser, Mathematical theory of statis- 1370, 1997. (p. 409)
tics: Statistical experiments and asymptotic

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-692


i i

692 Strong data processing inequality

[416] V. Tarokh, H. Jafarkhani, and A. R. Calder- [427] I. Vajda, “Note on discrimination informa-
bank, “Space-time block codes from orthog- tion and variation (corresp.),” IEEE Trans-
onal designs,” IEEE Transactions on Infor- actions on Information Theory, vol. 16,
mation theory, vol. 45, no. 5, pp. 1456– no. 6, pp. 771–773, 1970. (p. 131)
1467, 1999. (p. 409) [428] G. Valiant and P. Valiant, “Estimating the
[417] V. Tarokh, N. Seshadri, and A. R. Calder- unseen: an n/ log(n)-sample estimator for
bank, “Space-time codes for high data rate entropy and support size, shown optimal
wireless communication: Performance cri- via new CLTs,” in Proceedings of the 43rd
terion and code construction,” IEEE trans- annual ACM symposium on Theory of com-
actions on information theory, vol. 44, puting, 2011, pp. 685–694. (p. 584)
no. 2, pp. 744–765, 1998. (p. 409) [429] S. van de Geer, Empirical Processes in M-
[418] E. Telatar, “Capacity of multi-antenna Estimation. Cambridge University Press,
Gaussian channels,” European trans. tele- 2000. (pp. 86 and 603)
com., vol. 10, no. 6, pp. 585–595, 1999. [430] A. van der Vaart, “The statistical work of
(pp. 176 and 409) Lucien Le Cam,” Annals of Statistics, pp.
[419] ——, “Wringing lemmas and multiple 631–682, 2002. (pp. 614 and 616)
descriptions,” 2016, unpublished draft. [431] A. W. van der Vaart and J. A. Well-
(p. 187) ner, Weak Convergence and Empirical Pro-
[420] V. N. Temlyakov, “On estimates of ϵ- cesses. Springer Verlag New York, Inc.,
entropy and widths of classes of functions 1996. (pp. 86 and 603)
with a bounded mixed derivative or differ- [432] T. Van Erven and P. Harremoës, “Rényi
ence,” Doklady Akademii Nauk, vol. 301, divergence and Kullback-Leibler diver-
no. 2, pp. 288–291, 1988. (p. 541) gence,” IEEE Trans. Inf. Theory, vol. 60,
[421] N. Tishby, F. C. Pereira, and W. Bialek, no. 7, pp. 3797–3820, 2014. (p. 145)
“The information bottleneck method,” [433] H. L. Van Trees, Detection, Estimation, and
arXiv preprint physics/0004057, 2000. Modulation Theory. Wiley, New York,
(p. 549) 1968. (p. 577)
[422] F. Topsøe, “Some inequalities for informa- [434] S. Verdú, “On channel capacity per unit
tion divergence and related measures of dis- cost,” IEEE Trans. Inf. Theory, vol. 36,
crimination,” IEEE Transactions on Infor- no. 5, pp. 1019–1030, Sep. 1990. (p. 414)
mation Theory, vol. 46, no. 4, pp. 1602– [435] ——, Multiuser Detection. Cambridge,
1609, 2000. (p. 133) UK: Cambridge Univ. Press, 1998. (p. 413)
[423] D. Tse and P. Viswanath, Fundamentals of [436] ——, “Information theory, part I,” draft
wireless communication. Cambridge Uni- (personal communication), 2017. (p. xv)
versity Press, 2005. (pp. xxi, 403, and 409) [437] S. Verdú and D. Guo, “A simple proof of the
[424] A. B. Tsybakov, Introduction to Nonpara- entropy-power inequality,” IEEE Transac-
metric Estimation. New York, NY: tions on Information Theory, vol. 52, no. 5,
Springer Verlag, 2009. (pp. xxi, xxii, 132, pp. 2165–2166, 2006. (p. 64)
and 624) [438] R. Vershynin, High-dimensional probabil-
[425] B. P. Tunstall, “Synthesis of noiseless com- ity: An introduction with applications in
pression codes,” Ph.D. dissertation, Geor- data science. Cambridge university press,
gia Institute of Technology, 1967. (p. 196) 2018, vol. 47. (pp. 86 and 531)
[426] E. Uhrmann-Klingen, “Minimal Fisher [439] A. G. Vitushkin, “On the 13th problem of
information distributions with compact- Hilbert,” Dokl. Akad. Nauk SSSR, vol. 95,
supports,” Sankhy�: The Indian Journal no. 4, pp. 701–704, 1954. (p. 538)
of Statistics, Series A, pp. 360–374, 1995.
(p. 580)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-693


i i

References 693

[440] ——, “On Hilbert’s thirteenth problem and [452] R. J. Williams, “Simple statistical gradient-
related questions,” Russian Mathematical following algorithms for connectionist rein-
Surveys, vol. 59, no. 1, p. 11, 2004. (p. xviii) forcement learning,” Machine learning,
[441] ——, Theory of the Transmission and Pro- vol. 8, pp. 229–256, 1992. (p. 77)
cessing of Information. Pergamon Press, [453] H. Witsenhausen and A. Wyner, “A condi-
1961. (p. 535) tional entropy bound for a pair of discrete
[442] J. von Neumann, “Various techniques used random variables,” IEEE Transactions on
in connection with random digits,” Monte Information Theory, vol. 21, no. 5, pp. 493–
Carlo Method, National Bureau of Stan- 501, 1975. (p. 325)
dards, Applied Math Series, no. 12, pp. [454] J. Wolfowitz, “On Wald’s proof of the con-
36–38, 1951. (p. 166) sistency of the maximum likelihood esti-
[443] ——, “Probabilistic logics and the synthe- mate,” The Annals of Mathematical Statis-
sis of reliable organisms from unreliable tics, vol. 20, no. 4, pp. 601–602, 1949.
components,” in Automata Studies.(AM- (p. 582)
34), Volume 34, C. E. Shannon and [455] Y. Wu and J. Xu, “Statistical problems
J. McCarthy, Eds. Princeton University with planted structures: Information-
Press, 1956, pp. 43–98. (p. 627) theoretical and computational limits,” in
[444] D. Von Rosen, “Moments for the inverted Information-Theoretic Methods in Data
Wishart distribution,” Scandinavian Jour- Science, Y. Eldar and M. Rodrigues,
nal of Statistics, pp. 97–109, 1988. (p. 272) Eds. Cambridge University Press, 2020,
[445] V. G. Vovk, “Aggregating strategies,” Proc. arXiv:1806.00118. (p. 338)
of Computational Learning Theory, 1990, [456] Y. Wu and P. Yang, “Minimax rates
1990. (pp. xx and 271) of entropy estimation on large alpha-
[446] M. J. Wainwright, High-dimensional statis- bets via best polynomial approximation,”
tics: A non-asymptotic viewpoint. Cam- IEEE Transactions on Information The-
bridge University Press, 2019, vol. 48. ory, vol. 62, no. 6, pp. 3702–3720, 2016.
(p. xxii) (p. 584)
[447] M. J. Wainwright and M. I. Jordan, “Graph- [457] A. Wyner and J. Ziv, “A theorem on
ical models, exponential families, and varia- the entropy of certain binary sequences
tional inference,” Foundations and Trends® and applications–I,” IEEE Transactions on
in Machine Learning, vol. 1, no. 1–2, pp. Information Theory, vol. 19, no. 6, pp. 769–
1–305, 2008. (pp. 74 and 75) 772, 1973. (p. 191)
[448] A. Wald, “Sequential tests of statistical [458] A. Wyner, “The common information of
hypotheses,” The Annals of Mathematical two dependent random variables,” IEEE
Statistics, vol. 16, no. 2, pp. 117–186, 1945. Transactions on Information Theory,
(p. 320) vol. 21, no. 2, pp. 163–179, 1975. (pp. 503
[449] ——, “Note on the consistency of the max- and 504)
imum likelihood estimate,” The Annals of [459] ——, “On source coding with side infor-
Mathematical Statistics, vol. 20, no. 4, pp. mation at the decoder,” IEEE Transactions
595–601, 1949. (p. 582) on Information Theory, vol. 21, no. 3, pp.
[450] A. Wald and J. Wolfowitz, “Optimum char- 294–300, 1975. (p. 227)
acter of the sequential probability ratio test,” [460] Q. Xie and A. R. Barron, “Minimax redun-
The Annals of Mathematical Statistics, pp. dancy for the class of memoryless sources,”
326–339, 1948. (p. 320) IEEE Transactions on Information Theory,
[451] M. M. Wilde, Quantum information theory. vol. 43, no. 2, pp. 646–657, 1997. (p. 252)
Cambridge University Press, 2013. (p. xxi)

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-694


i i

694 Strong data processing inequality

[461] A. Xu and M. Raginsky, “Information- Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
theoretic analysis of generalization capa- (p. 324)
bility of learning algorithms,” Advances [470] C.-H. Zhang, “Compound decision theory
in Neural Information Processing Systems, and empirical Bayes methods,” The Annals
vol. 30, 2017. (p. 90) of Statistics, vol. 31, no. 2, pp. 379–390,
[462] W. Yang, G. Durisi, T. Koch, and Y. Polyan- 2003. (p. 563)
skiy, “Quasi-static multiple-antenna fading [471] T. Zhang, “Covering number bounds of
channels at finite blocklength,” IEEE Trans- certain regularized linear function classes,”
actions on Information Theory, vol. 60, Journal of Machine Learning Research,
no. 7, pp. 4232–4265, 2014. (p. 437) vol. 2, no. Mar, pp. 527–550, 2002. (p. 533)
[463] W. Yang, G. Durisi, and Y. Polyan- [472] Z. Zhang and R. W. Yeung, “A non-
skiy, “Minimum energy to send k bits Shannon-type conditional inequality of
over multiple-antenna fading channels,” information quantities,” IEEE Trans. Inf.
IEEE Transactions on Information The- Theory, vol. 43, no. 6, pp. 1982–1986, 1997.
ory, vol. 62, no. 12, pp. 6831–6853, 2016. (p. 17)
(p. 417) [473] ——, “On characterization of entropy func-
[464] Y. Yang and A. R. Barron, “Information- tion via information inequalities,” IEEE
theoretic determination of minimax rates of Trans. Inf. Theory, vol. 44, no. 4, pp. 1440–
convergence,” Annals of Statistics, vol. 27, 1452, 1998. (p. 17)
no. 5, pp. 1564–1599, 1999. (pp. xxii, 602, [474] L. Zheng and D. N. C. Tse, “Communica-
606, and 607) tion on the Grassmann manifold: A geomet-
[465] Y. G. Yatracos, “Rates of convergence ric approach to the noncoherent multiple-
of minimum distance estimators and Kol- antenna channel,” IEEE transactions on
mogorov’s entropy,” The Annals of Statis- Information Theory, vol. 48, no. 2, pp. 359–
tics, pp. 768–774, 1985. (pp. 602, 620, 383, 2002. (p. 409)
and 621) [475] N. Zhivotovskiy, “Dimension-free bounds
[466] S. Yekhanin, “Improved upper bound for for sums of independent matrices and sim-
the redundancy of fix-free codes,” IEEE ple tensors via the variational principle,”
Trans. Inf. Theory, vol. 50, no. 11, pp. 2815– arXiv preprint arXiv:2108.08198, 2021.
2818, 2004. (p. 208) (p. 85)
[467] P. L. Zador, “Development and evaluation [476] W. Zhou, V. Veitch, M. Austern, R. P.
of procedures for quantizing multivariate Adams, and P. Orbanz, “Non-vacuous gen-
distributions,” Ph.D. dissertation, Stanford eralization bounds at the ImageNet scale:
University, Department of Statistics, 1963. a PAC-Bayesian compression approach,” in
(p. 482) International Conference on Learning Rep-
[468] ——, “Asymptotic quantization error of resentations (ICLR), 2018. (pp. 89 and 90)
continuous signals and the quantization [477] G. Zipf, Selective Studies and the Principle
dimension,” IEEE Transactions on Informa- of Relative Frequency in Language. Cam-
tion Theory, vol. 28, no. 2, pp. 139–149, bridge MA: Harvard University Press, 1932.
1982. (p. 482) (pp. 203 and 204)
[469] O. Zeitouni, J. Ziv, and N. Merhav, “When
is the generalized likelihood ratio test opti-
mal?” IEEE Transactions on Information

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-695


i i

Index

FI -curve, 325, 338, 549, 638 Alon, N., 160 BEC, 181, 372, 380, 439, 455, 460,
I-projection, see information alternating minimization algorithm, 471, 639, 654, 669
projection 102 belief propagation, 653
Log function, 25 Amari, S.-I., 307 Bell Labs, 480
ϵ-covering, 523 Anderson’s lemma, 541, 572 Berlekamp, E., 432
ϵ-net, see ϵ-covering Anderson, T. W., 572 Bernoulli factory, 172
ϵ-packing, 523 approximate message passing (AMP), Bernoulli shifts, 230
Z2 synchronization, 649 653 Bernoulli, D., 143
σ-algebra, 79 area theorem, 63 Berry-Esseen inequality, 435
denseness, 240 area under the curve (AUC), 278 Bhattacharyya distance, 315
monotone limits, 79 Arimoto, S., 102, 433 binary divergence, 1, 22, 56
f-divergence, 115, 631 arithmetic coding, 245, 268 binary entropy function, 1, 9
inf-characterization, 151 Artstein, S., 535 binary symmetric channel, see BSC
sup-characterization, 121, 147 Assouad’s lemma, 389, 597, 664 binomial tail, 159
comparison, 127, 132 via Mutual information method, 598 bipartite graph, 161
conditional, 117 asymmetric numeral system (ANS), Birgé, L., 613, 614, 625
convexity, 120 246 Birkhoff-Khintchine theorem, 234
data processing, 119 asymptotic efficiency, 581 Birman, M. Š, 538
finite partitions, 121 asymptotic equipartition property birthday paradox, 186
local behavior, 138 (AEP), 217, 234 bit error rate (BER), 389
lower semicontinuity, 148 asymptotic separatedness, 125 Blackwell measure, 111
monotonicity, 118 autocovariance function, 114 Blackwell order, 182, 329
operational meaning, 122 automatic repeat request (ARQ), 468 Blackwell, D., 111, 182
SDPI, 629 auxiliary channel, 423, 428 Blahut, R., 102
f-information, 134, 182, 630, 631 auxiliary random variable, 227 Blahut-Arimoto algorithm, 102
χ2 , 136 blocklength, 370
additivity, 135 BMS, 383, 631, 669
definition, 134 B-process, 232 mixture representation, 632
subadditivity, 135, 644 balls and urn, 186 Bollobás, B., 164
symmetric KL, 135, 187 Barg, A., 433 Boltzmann, 15
g-divergence, 121 Barron, A. R., 64, 606 Boltzmann constant, 410
k-means, 481 batch loss, 270, 611 Bonami-Beckner semigroup, 132
3GPP, 403 Bayes risk, 662 boolean function, 626
GLM, 563 bowl-shaped, 571
Bayesian Cramér-Rao, 661 box-counting dimension, see
absolute continuity, 21, 42, 43 Bayesian Cramér-Rao (BCR) lower Minkowski dimension
absolute norm, 526 bound, 577 Brégman’s theorem, 162
achievability, 201 Bayesian Cramér-Rao lower bound , Breiman, L., 234
additive set-functions, 79 663 broadcasting
ADSL, 403 Bayesian networks, 633 on a grid, 670
Ahlswede, R., 208, 227, 326 BCR lower bound, 578 on trees, 642
Ahlswede-Csisár, 326 functional estimation, 580 Brownian motion, 417
Alamouti code, 409 multivariate, 579

695

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-696


i i

696 Index

BSC, 49, 53, 111, 344, 347, 363, 372, non-existence, 98 finite blocklength, 346
379, 436, 439, 455, 469, 471, 630, capacity-achieving output distribution, fundamental limit, 373, 395
643, 655, 669, 670 94, 96, 253 Gallager’s bound, 360, 431
channel coding, 344 uniqueness, 94, 97 information density, 351
contraction coefficient, 633 capacity-cost function, 396, 466, 467 linear code, 362, 461
SDPI, 633 capacity-redundancy theorem, 248, normal approximation, 439
strong converse, 424 269, 270, 604 normalized rate, 439
Burnashev’s error-exponent, 456 Carnot’s cycle, 14 optimal decoder, 347
carrier frequency, 419 posterior matching, 444
Catoni, O., 83 power constraint, 394
capacity, 49, 91, 94, 96, 102, 178, 179,
causal conditioning, 448 probability of error, 343
256, 345, 348
causal inference, 446 randomized encoder, 460
ϵ-capacity, 373, 395
Cencov, N. N., 307 RCU bound, 359, 439
Z-channel, 380
center of gravity, 67, 184 real-world codes, 440
ACGN, 405
central limit theorem, 78, 148, 181, reliability function, 431
additive non-gaussian noise, 401
202 Schalkwijk-Kailath scheme, 459
amplitude-constrained AWGN, 407
Centroidal Voronoi Tessellation sent codeword, 350
AWGN, 399
(CVT), 481 Shannon’s random coding, 354
BEC, 380
chain rule sphere-packing bound, 427, 432,
bit error rate, 390
χ2 , 183 454, 471
BSC, 379
differential entropy, 27 straight-line bound, 432
compound DMC, 465
divergence, 32, 33, 183 strong converse, 422, 465, 469, 470
continuous-time AWGN, 418
entropy, 12, 158 submodularity, 367
erasure-error channel, 464
Hellinger, 183 threshold decoder, 353
Gaussian channel, 100, 107
mutual information, 52, 63, 187 transmission rate, 373
group channel, 379
Rényi divergence, 183 universal, 463
information capacity, 375, 395
total variation, 183 unsent codeword, 350
information stable channels, 386,
chaining, 86 variable-length, 455, 471
399
channel, 29 weak converse, 348, 397
maximal probability of error, 375
channel automorophism, 381 zero-rate, 432
memoryless channels, 377
channel capacity, see capacity channel comparison, 646, 669, 670
MIMO channel, 176
channel coding channel dispersion, 434
mixture DMC, 465
(M, ϵ)-code, 343 channel filter, 406
non-stationary AWGN, 403
κ-β bound, 435 channel state information, 408
parallel AWGN, 402
admissible constraint, 396 channel symmetry group, 381
per unit cost, 414, 470
BSC, 344 channel symmetry relations, 385
product channel, 465
capacity, 345 channel, OR-channel, 189
Shannon capacity, 373, 395
capacity per unit cost, 414 channel, Z-channel, 372
sum of DMCs, 465
capacity-cost, 396, 466 channel, q-ary erasure, 639
with feedback, 443, 471
cost function, 395 channel, q-ary symmetric, 670, 671
zero-error, 374, 464, 471
cost-constrained code, 395, 467 channel, Z-channel, 380
zero-error with feedback, 450
degrees-of-freedom, 409 channel, ACGN, 405
capacity achieving output distribution,
dispersion, see dispersion channel, additive noise, 50, 363, 464
425
DT bound, 356, 460, 461 channel, additive non-Gaussian noise,
Capacity and Hellinger entropy
DT bound, linear codes, 364 466
lower bound, 609
Elias’ scheme, 457 channel, additive-noise, 371, 467
upper bound, 610
energy-per-bit, see energy-per-bit channel, AWGN, 48, 98, 100, 372,
Capacity and KL covering numbers,
error-exponents, 430, 460, 469, 471 399, 436, 457, 460, 470
603, 608
error-exponents with feedback, 454 channel, AWGN with ISI, 406
capacity-achieving input distribution,
expurgated random coding, 469 channel, bandlimited AWGN, 419
94, 444
feedback code, 442, 471 channel, BI-AWGN, 48, 436, 655
discrete, 407

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-697


i i

Index 697

channel, binary erasure, see BEC Clausius, 14 Slepian-Wolf, 222, 267


channel, binary input, 631 codebook, 342 time sharing, 226
channel, binary memoryless, see BMS codeword, 342 universal, 245
channel, binary symmetric, see BSC coding theorem variable-length, 168, 233, 265
channel, binary-input AWGN AWGN, 399 variable-length lossless, 197
(BI-AWGN), 649 capacity per unit cost, 414 computation via formulas, 628
channel, block-fading, 409 capacity-cost, 399 computation with noisy gates, 627
channel, coloring, 643, 670 channel coding, 345, 377 concentration of measure, 151
channel, compound, 465 compression, 199 conditional expectation in Gaussian
channel, continuous-time, 417 information stable channels, 386 channel, 60
channel, cost constrained, 395 non-stationary AWGN, 403 conditional independence, 50, 180,
channel, definition, 370 parallel AWGN, 402 182
channel, discrete memoryless, see coding theory, 341 conditioning increases divergence, 33,
DMC combinatorics, 15, 174, 432, 452, 471 121
channel, Dobrushin-symmetric, 383 common information, 338 confidence interval, 559
channel, erasure, 177, 647, 671 community detection, 126, 185, 337, confusability graph, 451, 471
channel, erasure-error, 464 642, 648 constant-composition code, 426
channel, exotic, 436 single community, 591 contiguity, 125, 186
channel, fading, 408 compander, 479 continuity of information measures, 66
channel, fading (non-coherent), 416 comparison of information measures, Contraction coefficient, 629
channel, Gallager-symmetric, 383 57 contraction coefficient, 669
channel, Gaussian, 388 composite hypothesis testing χ2 , 669
channel, Gaussian broadcast, 65 Hellinger upper bound, 613 χ2 , 631, 638, 640, 644
channel, Gaussian erasure, 437 meta-converse, 429 f-divergence, 629, 637
channel, information stable, 386, 393 minimax risk, 612 f-information, 630
channel, input-symmetric, 369, 383, TV balls, 338, 613 KL, 53, 326, 636, 638, 654, 655,
631 compression 669, 671
channel, MIMO, 176, 403, 409, 437 almost lossless, 213 tensorization, 638
channel, mixture DMC, 465 arithmetic coding, 245, 268 total variation, 629
channel, non-anticipatory, 371, 449 blocklength, 201 convergence in distribution, 78, 82
channel, non-stationary AWGN, 403, Elias code, 270 converse, 201, 348
466 enumerative codes, 268 bit error rate, 390
channel, parallel AWGN, 176, 402, ergodic process, 233, 393 convex body, 535
436 error-exponents, 335 polar body, 535
channel, Poisson, 177 finite blocklength, 214 symmetric, 535
channel, polygon, 451, 471 Fitingof, 246 convex conjugate, 147, 296
channel, Potts, 671 fixed length, 213 convex duality, 73, 147, 157
channel, product channel, 465 Huffman code, 209 convex hull, 279, 309
channel, quasi-static fading, 437 iid source, 214 convex optimization, 209
channel, stationary memoryless, 371 Lempel-Ziv, 263 convex support of measure, 309
channel, sum of channels, 465 linear codes, 218 convexity, biconjugation, 147
channel, weakly input-symmetric, 383, maximin vs minimax solution, 252 convexity, strict, 92, 121, 295, 510
446 mismatched, 267 convexity, strong, 92
channel, with feedback, 455, 457 multi-terminal, 225, 266 convolution, 2, 59
channel, with memory, 464 normal approximation, 214 binary, 191
Chapman, D. G., 575 normalized maximal likelihood, 248 functions, 2
Chebyshev radius, 96 optimal, 198, 265 Gaussian, 59
Chernoff bound, 159, 292 redundancy, see redundancy probability measures, 2
Chernoff information, 123, 315, 337 run-length, 266 correlated recovery, 593
Chernoff-Rubin-Stein lower bound, side information, 221 correlation coefficient, 180, 388, 631,
661 single-shot, 198 644

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-698


i i

698 Index

correlation coefficient, maximal, 640 with feedback, 446 KL, 36, 37, 116, 131, 133, 150, 633
cost function, 395 zero-dispersion channel, 436, 437, Le Cam, 117, 133, 631
Costa, M., 64 455 local behavior, 36, 39, 137, 335
coupling, 122, 151, 178, 183 distortion metric, 484 lower semicontinuity, 78, 99
covariance matrix, 49, 59, 114, 510 separable, 485 Marton’s, 123, 151, 183
covariance matrix estimation, 85, 667 distributed estimation, 325, 644, 657 measure-theoretic properties, 80
covering lemma, 228, 327, 499 distribution estimation over an algebra of sets, 79
CR lower bound, 576, 661 χ2 risk, 664 parametric family, 38, 140
multivariate, 576 binary alphabet, 662 Rényi, see Rényi divergence, 314
Cramér’s condition, 293 KL risk, 664 real Gaussians, 23
Cramér, H., 576 quadratic risk, 583, 664 strong convexity, 92, 182
Cramér-Rao lower bound, see CR TV risk, 664 symmetric KL, 135
lower bound distribution, Bernoulli, 9, 48, 269, 333 total variation, see total variation
cryptography, 14 distribution, binomial, 174 divergence for mixtures, 36
Csisár, I., 326 distribution, Dirichlet, 250, 252 DMC, 372, 456
Cuff, P., 503 distribution, discrete, 40 Dobrushin’s contraction, 629
cumulant generating function, see log distribution, exponential, 177, 330 dominated convergence theorem, 138,
MGF distribution, Gamma, 333 141
cumulative risk, 611 distribution, Gaussian, 47, 49, 59, 64, Donsker, M. D., 71
98, 133, 330, 333, 336 Donsker-Varadhan, 71, 83, 147, 150,
distribution, geometric, 9, 175 297
data-processing inequality, see DPI
distribution, Marchenko-Pastur, 176 Doob, 31
de Bruijn’s identity, 60, 191
distribution, mixture of products, 146 doubling dimension, 616
de Finetti’s theorem, 186
distribution, mixtures, 36 DPI, 42, 154, 426
decibels (dB), 48
distribution, Poisson, 178 χ2 , 148
decoder
distribution, Poisson-Binomial, 301 f-divergence, 119
maximal mutual information
distribution, product, 55 f-information, 134
(MMI), 463
distribution, product of mixtures, 146 divergence, 34, 36, 53, 56, 57, 73,
maximum a posteriori (MAP), 347
distribution, subgaussian, see 348
maximum likelihood (ML), 347,
subgaussian Fisher information, 184
353
distribution, uniform, 11, 27, 175, 178 mutual information, 51, 53
decoding region, 342, 400
distribution, Zipf, 203 Neyman-Pearson region, 329
deconvolution filter, 407
Dite, W., 481 Rényi divergence, 433
degradation of channels, 182, 329, 646
divergence, 20 Duda, J., 246
density estimation, 244, 602, 661, 664
χ2 , 184, 185, 668, 669 Dudley’s entropy integral, 531, 552
Bayes χ2 risk, 662
χ2 , 36, 116, 122, 126, 132, 133, Dudley, R., 531
Bayes KL risk, 605, 662
136, 145, 148, 149, 631, 638, 640, Dueck, G., 187, 433
discrete case, 137
641, 644 dynamical system, 230
derivative of divergence, 36
inf-representation, 123
derivative of mutual information, 59
sup-characterization, 70, 71
diameter of a set, 96 ebno, see energy-per-bit
conditional, 42
differential entropy, 26, 47, 61, 158, ECC, 342
continuity, 78
164, 175, 191 eigenvalues, 114
continuity in σ-algebra, 80
directed acyclic graph (DAG), 50, 633 Elias ensemble, 365
convex duality, 73
directed acyclic graphs (DAGs), 179, Elias’ extractor, 168
convexity, 91
180 Elias, P., 167, 270
finite partitions, 70
directed graph, 189 Elliott, E. O., 111
geodesic, 306, 333, 335
directed information, 446 EM algorithm, 77, 104
Hellinger, see Hellinger
Dirichlet prior, 662, 665 convex, 103
distance117
disintegration of probability measure, empirical Bayes, 563
Jeffreys, 135
29 empirical distribution, 137
Jensen-Shannon, 117, 133, 149
dispersion, 379 empirical mutual information, 462

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-699


i i

Index 699

empirical process, 86 ergodic theorem, 268 finite blocklength, 214, 341, 346, 417,
empirical risk, 87 Birkhoff-Khintchine, 234 460
Empirical risk minimization (ERM), maximal, 238 finite groups, 50
87 ergodicity, 232, 393, 467 finite-state machine (FSM), 172, 264
energy-per-bit, 400, 410 error correcting code Fisher defect, 143
fading channel, 416 see ECC, 342 Fisher information, 38, 140, 252, 576,
finite blocklength, 417 error floor, 423 645, 660, 661
entropic CLT, 64 error-exponents, 123, 144, 286, 335, continuity, 141
Entropic risk bound, 602 430, 454, 456, 460, 469 data processing inequality, 184
Hellinger loss, 602, 614 estimand, 558 matrix, 38, 142, 184
Hellinger loss, parametric rate, 617 functional, 580 minimum, compactly supported,
Hellinger lower bound, 618 estimation 580
KL loss, 602, 603 entropy, 138 monotonicity, 184
local Hellinger entropy, 616 Estimation better than chance of a density, 40, 151, 191, 578
sharp rate, 619 Bounded GLM, 592 variational representation, 151
TV loss, 602, 620 distribution estimation, 593 Fisher information inequality, 185
TV loss, misspecified, 621 estimation in Gaussian noise, 58 Fisher’s factorization theorem, 54
entropy, 8 estimation, discrete parameter, 55 Fisher, R., 54, 275
ant scouts, 9 estimation, information measures, 66 Fisher-Rao metric, 40, 307
as signed measure, 46 estimation-compression inequality, Fitingof, B. M., 246
axioms, 13 see online-to-batch conversion Fitingof, B. M., 463
concavity, 92 estimator, 559 Fitingof-Goppa code, 463
conditional, 10, 46, 57 Bayes, 562 flash signaling, 417
continuity, 78, 178 deterministic, 558 Fourier spectrum, 419
differential, see differential entropy, improper, 603 Fourier transform, 114, 406
48 proper, 603 fractional covering number, 160
empirical, 138 randomized, 558 fractional packing number, 161
hidden Markov model, 111 Evans and Schulman, theorem of, 627 frequentist statistics, 54
inf-representation, 24 evidence lower bound (ELBO), 76 Friedgut, E., 160
infinite, 10 exchangeable distribution, 170, 186 Friis transmission equation, 468
Kolmogorov-Sinai, 16 exchangeable event, 177 Fubini theorem, 45
Markov chains, 110 expectation maximization, see EM functional estimation, 580, 666, 668
max entropy, 99, 175 algorithm
Rényi, 13, 57 exponential family, 310
thermodynamic, 14 natural parameter, 310
Gács-Körner information, 338
Venn diagram, 46 standard (one-parameter), 298, 306
Gallager ensemble, 365
entropy characterization, 502 exponential-weights update algorithm,
Gallager’s bound, 360
entropy estimation, 584 271
Gallager, R., 360, 431
large alphabet, 584
Galois field, 219
small alphabet, 584
Fano’s inequality, 41, 57, 179, 664 game of 20 Questions, 13
entropy method, 158
tensorization, 112 game of guessing, 55
entropy power, 64
Fano, R., 41 Gastpar conditions, 521
entropy power inequality, 41, 64
Fatou’s lemma, 38, 99, 142 Gastpar, M., 521
Costa’s, 64
Feder, M., 260 Gaussian CEO problem, 657
Lieb’s, 64
Feinstein’s lemma, 357, 397 Gaussian comparison, 531
entropy rate, 109, 181, 265
Feinstein, A., 357 Gaussian distribution, 23
relative, 288
Fekete’s lemma, 299 complex, 23
entropy vs conditional entropy, 46
Fenchel-Eggleston-Carathéodory Gaussian isoperimetric inequality, 541
entropy-power inequality, 191
theorem, 129 Gaussian location model, see GLM
Erdös, P., 215
Fenchel-Legendre conjugate, 73, 296 Gaussian mixture, 59, 76, 104, 134,
Erdös-Rényi graph, 185
filtration, 319 185, 619

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-700


i i

700 Index

Gaussian Orthogonal Ensemble Hamming sphere, 169, 170 weak converse, 282
(GOE), 668 Hamming weight, 158, 169, 170, 175
Gaussian width, 530 Han, T. S., 475
I-MMSE, 59
Gelfand, I.M., 70 Harremoës, P., 128
identity
Gelfand-Pinsker problem, 468 Haussler-Opper estimate, 188
de Bruijn’s, 191
Gelfand-Yaglom-Perez HCR lower bound
Ihara, S., 401, 419
characterization, 70, 121 Hellinger-based, 661
independence, 50, 55
generalization bounds, 87 multivariate, 576
individual (one-step) risk, 611
generalization error, 88, 187 Hellinger distance, 116, 117, 124, 131,
individual sequence, 248, 259
generalization risk, 87 133, 153, 182, 289, 302, 315, 631,
inequality
generative adversarial networks 661
Bennett’s, 301
(GANs), 149, 602 sup-characterization, 148
Bernstein’s, 301
Gibbs, 15 location family, 142
Brunn-Minkowski, 573
Gibbs distribution, 100, 242, 336 tensorization, 124
de Caen’s, 433
Gibbs sampler, 87, 187 Hellinger entropy
entropy-power, see entropy power
Gibbs variational principle, 74, 83, bounds on KL covering number,
inequality, 191
178 189, 610
Fano’s, see Fano’s inequality
Gilbert, E. N., 111, 527 covering number, 614
Han’s, 17, 28, 160
Gilbert-Elliott HMM, 111, 181 local covering number, 616
Hoeffding’s, 86, 334
Gilbert-Varshamov bound, 433, 527, local packing number, 618
Jensen’s, 11
666 packing number, 609
log-Sobolev, see log-Sobolev
Gilmer’s method, 189 Hessian, 60, 251
inequality (LSI), 191
Ginibre ensemble, 176 Hewitt-Savage 0-1 law, 177
Loomis-Whitney, 164
GLM, 140, 560, 666, 667 hidden Markov model (HMM), 110
non-Shannon, 17
golden formula, 67, 97, 466 high-dimensional probability, 83
Okamoto’s, 301
Goppa, V., 463 Hilbert’s 13th problem, 522, 538
Pinsker’s, see Pinsker’s inequality
graph coloring, 643, 670 Hoeffding’s lemma, 86, 88, 334
Shearer’s, 18
graph partitioning, 190 Huber, P. J., 151, 324, 613
Stam’s, 185
graphical model Huffman algorithm, 209
Tao’s, 58, 127, 190
directed, 447 hypothesis testing, 122
transportation, 656
graphical models accuracy, precision, recall, 277
van Trees, 645
d-connected, 51 asymptotics, 286
Young-Fenchel, 73
d-separation, 51, 445 Bayesian, 277, 314, 330
inf-convolution, 547
collider, 51 Chernoff’s regime, 314
information bottleneck, 177, 549
directed, 41, 50, 69, 179, 180, 633 communication constraints, 325
information density, 351
non-collider, 51 composite, 289, see composite
conditioning-unconditioning, 352
undirected, 74, 648 hypothesis testing
information distillation, 190
Gross, L., 65 error-exponent, 123, 144
information flow, 69, 447
Guerra interpolation, 63 error-exponents, 289, 314
information geometry, 40, 307
Gutman, M., 260 goodness-of-fit, 275, 325
information inequality, 24
independence testing, 326
information percolation
likelihood ratio test (LRT), 280, 424
Haar measure, 381, 543, 551 directed, 627, 635
null hypothesis, 275
Hamiltonian dynamical system, 231 undirected, 650
power, 277
Hammersley, J. M., 575 information projection, 91, 178
robust, 324, 338
Hammersley-Chapman-Robbins lower definition, 302
ROC curve, 276
bound, see HCR lower bound Gaussian, 336
sequential, 319
Hamming ball, 159 marginals, 331
SPRT, 320
Hamming bound, 527 Pythagorean theorem, 303
Stein’s exponent, 286
Hamming code, 362 information radius, 91, 96
strong converse, 283
Hamming distance, 112, 123, 389, 464 information stability, 386, 393, 396,
type-I, type-II error, 277
Hamming space, 219, 347 399, 464

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-701


i i

Index 701

information tails, 285, 353 Kraft inequality, 207 compact support, 142
Ingster-Suslina formula, 185 Krein-Milman theorem, 130 location parameter, 40
integer programming, 209 Krichevsky, R. E., 253 log MGF, 293
integral probability metrics, 123 Krichevsky-Trofimov algorithm, 253, properties, 293
interactive communication, 182, 658 269 log-concave distribution, 573
intersymbol interference (ISI), 406 Krichevsky-Trofimov estimator, 662 log-likelihood ratio, 280
interval censoring, 668 Kronecker Lemma, 388 log-Sobolev inequality (LSI), 65, 132,
Ising model, 74, 191, 642 191
log-Sobolev inequality, modified
Laplace method, 251
James-Stein estimator, 561 (MLSI), 641
Laplace’s law of succession, 253
Jeffreys prior, 252 Loomis, L. H., 164
Laplacian, 59
Joint entropy, 9 loss function, 559
large deviations, 35, 290, 291, 299
joint range, 115, 127 batch, 611
Gaussian, 332
χ2 vs TV, 665 cross-entropy, see log-loss
multiplicative deviation, 301
χ2 vs TV, 132 cumulative, 259
non-iid, 332
Harremoës-Vajda theorem, 128 log-loss, 24, 259, 548, 664
on the boundary, 332
Hellinger and TV, 124 quadratic, 561, 575
rate function, 296
Hellinger vs TV, 131 separable, 569
large deviations theory, 159
Jensen-Shannon vs TV, 133 test, 87
large language models, 110, 257
KL vs χ2 , 133 loss-functions
law of large numbers, 202
KL vs Hellinger, 132, 189, 302 log-loss, 331
strong, 235
KL vs TV, 131, 665 low-density parity check (LDPC), 346
Le Cam distance, 117
Le Cam and Hellinger, 133 lower semicontinuity, 148
Le Cam lemma, 146
Le Cam and Jensen-Shannon, 133 Le Cam’s method, 666
joint source-channel coding, see JSCC Le Cam’s two-point method, 594 Mandelbrot, B., 203
joint type, 462 looseness in high dimensions, 596 Markov approximation, 235
joint typicality, 228, 355, 499 Le Cam, L., 614 Markov chain, 179–181, 232, 265
JSCC, 391 Le Cam-Birgé’s estimator, 614 finite order, 235
ergodic source, 393 least favorable pair, 338 Markov chains, 110, 464
graceful degradation, 520 least favorable prior, 564 k-th order, 110
lossless, 392 non-uniqueness, 663 ergodic, 266
lossy, 515 Lempel-Ziv algorithm, 263 finite order, 247
lossy, achievability, 516 less noisy channel, 646 mixing, 641, 669, 671
lossy, converse, 515 Lieb, E., 64 Markov kernel, 29, 42
source-channel separation, 392, 516 likelihood-ratio trick, 331 composition, 30
statistical application, 585 linear code, 218, 362 Markov lemma, 228, 327, 501
coset leaders, 364 Markov types, 174
Körner, J., 227 error-exponent, 433 martingale convergence theorem, 80
Kac’s lemma, 262 generator matrix, 362 Marton’s transportation inequality,
Kahn, J., 160 geometric uniformity, 363 656
Kakutani’s dichotomy, 125 parity-check matrix, 362 Massey’s directed information, 449
Kelvin, 14 syndrome decoder, 363 see directed information, 446
kernel density estimator (KDE), 136, linear programming, 368 Massey, J., 158, 446
624 linear programming duality, 161 matrix inversion lemma, 40
Kesten-Stigum bound, 670 linear regression, 271, 660 Mauer, A., 83
KL covering numbers, 603, 608 Liouville’s theorem, 231 Maurey’s empirical method, 534
Kolmogorov identities, 51 list decoding, 346 Maurey, B., 534
Kolmogorov’s 0-1 law, 83, 125, 232 Litsyn, S., 433 maximal coding, 357, 367, 397
Kolmogorov, A. N., 239, 522, 524 Lloyd’s algorithm, 480 maximal correlation, 631, 640
Kolmogorov-Sinai entropy, 239 Lloyd, S., 480 maximal sphere packing density, 527
Koshelev, V., 483 location family, 40 circle packing, 527

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-702


i i

702 Index

cubes, 527 rate-distortion function, 542 de Finetti, 186


Maximum entropy, 100 Sobolev ball, 541 divergence, 188
Gaussian, 28 volume bound, 525, 539 Hellinger distance, 146
maximum entropy, 99, 336 volume bound, sharp constant, 527 Le Cam lemma, 146
continuous uniform, 27 metric space, 78 Rényi divergence, 146
discrete uniform, 11 Milman, V., 535 MLE, 76, 91, 143, 581, 667
Gaussian, 64 min-entropy, 14, 57 suboptimality in high dimensions,
geometric, 175 minimax estimator 667
Hamming weight constraint, 175 non-uniqueness, 564 MMSE, 58, 562
multiple constraints, 308 minimax lower bound modulus of continuity, 668
robust version, 175 Assouad’s lemma, see Assouad’s monotone convergence theorem, 36
maximum likelihood estimation, see lemma monotonicity of information, 33, 42,
MLE asymptotic, 580 118
maximum posterior (MAP), 57 better than chance, 592 more capable channel, 646
McAllester, D. A., 83 Fano’s method, 599 MRRW (JPL) bound, 432
McMillan, B., 234 Le Cam’s two-point method, see Le MRRW bound, 528
mean-field approximation, 74 Cam’s two-point method Mrs. Gerber’s lemma (MGL), 191,
measure preserving transformation, mutual information method, see 326
230 mutual information method multinomial coefficient, 16
memoryless channel, 106 minimax rate, 563 multipath diversity, 403
memoryless source, 106, 110, 200 minimax risk mutual information, 41
Mercer’s theorem, 540 binomial model, 663 inf-characterization, 68, 103
Merhav, N., 260 covariance model, 667 sup-characterization, 69, 103
meta-converse, 57, 348, 368, 428, 435 exact asymptotics, 575 concavity, 92
minimax, 429 GLM, bounded means, 587 continuity, 81
method of types, 15, 168, 174, 425, GLM, estimating the maximum, convexity, 92
462, 463 666 finite partitions, 83
metric entropy, 159, 178, 189, 523 GLM, non-quadratic loss, 570, 571 Gaussian, 47
ℓq -covering of ℓp ball, 552 GLM, quadratic loss, 564, 568, 587 lower semicontinuity, 82, 181
ℓ2 -covering of ℓ1 ball, 529 GLM, with sparse mean, 589, 666 monotone limits, 83
ℓq -covering of ℓp ball, 530 linear regression, estimation, 660 permutation invariance, 52
ϵ-capacity, 523 linear regression, prediction, 661 saddle point, 94
ϵ-entropy, 523 multinomial model, 664 single-letterization, 106
convex null, 534 nonparametric location model, 665 stochastic processes, 113
covering number, 523, 524, 535 minimax theorem, 565, 566 variational characterization, 82
covering of Hamming space, 528, counterexample, 565 Venn diagram, 46
551 duality, 565 Mutual information method
duality, 535 minimum distance, 432 via Shannon lower bound, 586
entropy numbers, 523 minimum distance estimator, 620 mutual information method, 664
finite-dimensional balls and spheres, minimum mean-square error mutual information vs entropy, 44
526 see MMSE, 58
finiteness, 523 Minkowski dimension, 527
Newtonian mechanics, 242
global to local, 618 Minkowski inequality, 65
Neyman, J., 275
Hölder class, 538 mirror descent, 92
Neyman-Pearson lemma, 275, 284,
Hilbert balls, 533 MIT, 41
424
Lipschitz class, 536 mixing distribution, 59
Neyman-Pearson region, 278, 348,
local to global, 616 mixing process, 232, 268
424
monotonicity, 523 strongly mixing, 232
noisy gates, 626
packing number, 401, 523, 524 weakly mixing, 232
non-Gaussianness, 401
packing of Hamming space, 527 mixture models
normal approximation, 214
packing of Hamming sphere, 528 χ2 , 185
channel coding, 439, 468

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-703


i i

Index 703

compression, 202 Prékopa’s theorem, 573 rate region, 225, 227


normal tail bound, 551 Prékopa, A., 573 rate-distortion function, 91, 103, 485
Nyquist sampling, 420 predictive density estimator, 605 Bernoulli, 507
prefix code, 206 discrete uniform, 546
online learning, 245 probabilistic method, see random erasure metric, 546
online-to-batch conversion, 258, 611 coding, 215, 353 Erokhin, 547
open problem, 111, 208, 248, 380, probability of bit error, 389 Gaussian, 509, 550
413, 432, 433, 452, 521 probability of error, 57 Haar measure, 544
operator-convex function, 631 probability preserving transformation, information rate-distortion function,
optimal transport, 656 230 487
oracle inequality, 260 probably approximately correct (PAC), non-Gaussian, 511
order statistic, 178 83 product source, 546
Ornstein’s distance, 106, 112 prolate spheroidal functions, 419 properties, 486
overfitting, 87 pulse-amplitude modulation, 459 single-letterization, 487
pulse-coded modulation (PCM), 477 uniform on sphere, 543
pulse-position modulation (PPM), 413 worst-case, 542, 551
PAC-Bayes, 188
rate-distortion theory
Panter, P., 481
quantization, 477 asymptotics, 491
Panter-Dite approximation, 482
entropy constrained, 483 average distortion, 485
parameter space, 558
optimal, 480 convergence rate for average
parametric family, 38, 140
optimal asymptotics, 481 distortion, 546
location family
scalar non-uniform, 479 excess distortion, 485
see location family, 40
scalar uniform, 477 excess-to-average, 488
multi-dimensional, 142
variable rate, 483 general converse, 485
regular, 140
zero-rate, 548 max distortion, 485
smooth, 252
multiple distortion metrics, 547
parametrized family, see parametric
output constraint, 547
family Rényi divergence, 126, 144, 182, 188,
random coding bound, average
Pearson, E., 275 307, 504
distortion, 493
Pearson, K., 275 convexity, 145
random coding bound, excess
percolation, 635 tensorization, 145
distortion, 496
Peres’ extractor, 169 Rényi entropy, 13, 57, 145
Rayleigh fading, 416
Peres, Y., 169 Rényi mutual information, 504
redundancy, 179, 247, 261, 269
perfect matchings, 161 Rényi, A., 13
average-case minimax, 249
permanent, 161 Rademacher complexity, 86
worst-case minimax, 249
perspective function, 92 Radhakrishnan, J., 161
Reed-Muller code, 345, 364
phase transition, 643 radius of a set, 96
Reeves, A., 477
BBP, 651 Radon-Nikodym derivative, 21, 115,
regression, 190
Pinsker’s inequality, 23, 98, 126, 131, 322, 351
regret, 247, 256, 260
187 Radon-Nikodym Theorem, 31
finite-state machines, 264
reverse, 132 random coding, 161, 215, 353, 368,
supervised learning, 270
planted dense subgraph model, 591 461, 493
regular measures, 72
plug-in estimator, 138, 666 expurgation, 432, 469
regularization term, 88
Poincaré recurrence, 231 random matrix theory, 176, 552, 651,
relative density, 21
pointwise mutual information (PMI), 668
reliable computation, 626
351 random number generator, 166
repetition code, 345
polar codes, 341, 346 random transformation, see Markov
reproducing kernel Hilbert space
Polish space, 72 kernel
(RKHS), 539
positive predictive value (PPV), 277 random walk, 181, 266, 321, 444
reverse Pinsker’s inequality, 132
power allocation, 402 randomness extractor, 166
Riemannian metric, 40
power constraint, 394 rank-frequency plot, 203
risk, 559
power spectral density, 405 Rao, C. R., 576

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-704


i i

704 Index

average, 562 simplex conjecture, 413 power spectral density, 114, 233,
Bayes, 562 Sinai’s generator theorem, 240 405
minimax, 563 Sinai, Y., 239 spectral measure, 233
Robbins, H., 575 single-letterization, 106 stationary process, 109, 230
run-length encoding, 266 singular value decomposition (SVD), statistical experiment, see statistical
176 model
singular values, 640 statistical learning theory, 83, 87
saddle point, 94, 176
Sinkhorn’s algorithm, 105, 311 statistical model, 558
Gaussian, 100, 107, 466
site percolation threshold, 637 nonparametric, 560
sample complexity, 568
SLB, see Shannon lower bound parametric, 560
sample covariance matrix, 85, 667
Slepian, D., 223, 531 Stavskaya automata, 637
sampling without replacement, 186
Slepian-Wolf theorem, 223, 225, 228 Stein’s lemma, 287, 415, 470
Sanov’s theorem, 307, 334
small subgraph conditioning, 186 Stirling’s approximation, 16, 162, 174,
Sanov, I. N., 307
small-ball probability, 539 260, 588
score function, 152
Brownian motion, 553 stochastic block model, see
parametrized family, 39
finite dimensions, 332 community detection, 642
SDPI, 53, 328, 626, 629, 669, 671
Smooth density estimation, 622 stochastic domination, 338
χ2 , 640
L2 loss, 622 stochastic localization, 58, 191
BSC, 633
Hellinger loss, 625 stopping time of a filtration, 319, 455
erasure channels, 639
KL loss, 625 strong converse, 283, 374, 422, 470
joint Gaussian, 640
TV loss, 624 failure, 427, 465
post-processing, 653
soft-covering lemma, 137, 505 strong data-processing inequality, see
tensorization, 638
Solomjak, M., 538 SDPI
self-normalizing sums, 302
source subadditivity of information, 135
separable cost-constraint, 395
Markov, 110 subgaussian, 85, 188
sequential prediction, 245
memoryless, see memoryless source subgraph counts, 160
sequential probability ratio test
mixed, 110, 265 submodular function, 16, 27, 367
(SPRT), 320
source coding Sudakov’s minoration, 530
Shannon
see compression, 197 sufficient statistic, 41, 54, 178, 180,
boolean circuits, construction of,
source-coding 282, 363
626
noisy, 548 supervised learning, 257, 270
Shannon entropy, 8
remote, 548 support, 3, 309
Shannon lower bound, 511, 586, 588
space-time coding, 409 symmetric KL-information
arbitrary norm, 511, 550
sparse estimation, 589, 666 see f-information
quadratic distortion, 511
sparse-graph codes, 341, 346 symmetric KL, 135
Shannon’s channel coding theorem,
sparsity, 666 symmetry group, 381
345, 377
spatial diversity, 403 system identification, 660
Shannon’s rate-distortion theorem,
spectral gap, 641, 669 Szarek, S. J., 527, 535
491
spectral independence, 641, 671 Szegö’s theorem, 114, 406
Shannon’s source coding theorem,
spectral measure, 233 Szemerédi regularity lemma, 127, 190
202, 214
spiked Wigner model, 649
Shannon, C. E., 1–672
squared error, 561
Shannon-McMillan-Breiman theorem, tail σ-algebra, 83, 231
Stam, A. J., 64, 186
233 Telatar, E., 187, 409
standard Borel space, 20, 29, 30, 42,
Shawe-Taylor, J., 83 temperature, 336
43, 51
Shearer’s lemma, 18, 158, 160 Tensor product of experiments, 568
stationary Gaussian processes, 114,
shift-invariant event, 231 minimax risk, 569
233
shrinkage estimator, 561 tensorization, 33, 55, 63, 106, 107,
autocovariance function, 233
Shtarkov distribution, 245, 248, 260 112, 124, 145, 636, 638, 640, 647,
B-process, 233
Shtarkov sum, 249, 260 670
ergodic, 233, 467
signal-to-noise ratio (SNR), 48 FI -curve, 338
significance testing, 275 I-projection, 331

i i

i i
i i

itbook-export CUP/HE2-design August 16, 2024 18:58 Page-705


i i

Index 705

χ2 , 145 uniquely decodable codes, 206 von Neumann, J., 166, 626
capacity, 377 unitary operator, 243 Voronoi cells, 347
capacity-cost, 397 Universal codes, 462 Vovk, V. G., 271
Hellinger, 124 universal compression, 179, 210, 270
minimax risk, 569 universal prediction, 255
test error, 87 universal probability assignment, 245, Wald, A., 319
thermodynamics, 9, 14, 410 255 Wasserstein distance, 105, 123, 151,
Thomason, A. G., 164 Urysohn’s lemma, 72 656
thresholding, 561, 589, 590 Urysohn, P. S., 72 water-filling solution, 114, 176, 341,
Tikhomirov, V. M., 522, 524 402, 406, 437, 438
tilting, 72, 297 Vajda, I., 128 waterfall plot, 423, 439
time sharing, 226 van Trees inequality, see Bayesian weak converse, 282, 348, 397
Toeplitz matrices, 114 Cramér-Rao (BCR) lower bound Whitney, H., 164
total variation, 98, 116, 122, 131, 132, van Trees, H. L., 577 Wiener process, 417
330, 629 Varadhan, S. S., 71 WiFi, 403
inf-representation, 181 varentropy, 200, 547, 584 Wigner’s semicircle law, 652
inf-representation, 122 variable-length codes, 168, 455 Williamson, R. C., 83
sup-characterization, 148 variational autoencoder (VAE), 76, Wishart matrix, 176
sup-representation, 122 602 Wolf, J., 223
training error, 87 variational inference, 74 Wozencraft ensemble, 366
training sample, 87 variational representation, 70, 71, 123, Wringing lemma, 187
147, 154 Wyner’s common information, 502
transition probability kernel, see
Markov kernel χ2 , 149 Wyner, A., 227, 502
transmit-diversity, 409 Fisher information, 151
Trofimov, V. K., 253 Hellinger distance, 148 Yaglom, A. M., 70
Tunstall code, 196 total variation, 122, 148 Yang, Y., 606
turbo codes, 346 Varshamov, R. R., 527 Yang-Barron’s estimator, 607
types, see method of types, 174 Venn diagrams, 46 Yatracos class, 621
Verdú, S., 414 Yatracos’ estimator, 620
undetectable errors, 213, 224 Verdú, S., 475 Yatracos, Y. G., 620
uniform convergence, 85, 86 Verwandlungsinhalt, 14 Young-Fenchel duality, 73
uniform integrability, 153 Vitushkin, A. G., 538
uniform quantization, 29 VLF codes, 455
uniformly integrable martingale, 80 VLFT codes, 471 Zador, P. L., 482
union-closed sets conjecture, 189 volume ratio, 525 Zipf’s law, 203

i i

i i

You might also like