100% found this document useful (1 vote)
193 views72 pages

Algorithmic Entropy in Turing Machines

This document provides an overview of algorithmic entropy and its relationship to information theory and computability theory. It discusses Alan Turing's model of computation and how it can be used to define the entropy of an object as the size of the smallest computer program needed to describe that object. Later chapters will apply this definition of entropy and discuss topics like prefix-free codes, universal machines, the halting problem, and the equivalence between algorithmic entropy and statistical interpretations of entropy.

Uploaded by

gusty76
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
193 views72 pages

Algorithmic Entropy in Turing Machines

This document provides an overview of algorithmic entropy and its relationship to information theory and computability theory. It discusses Alan Turing's model of computation and how it can be used to define the entropy of an object as the size of the smallest computer program needed to describe that object. Later chapters will apply this definition of entropy and discuss topics like prefix-free codes, universal machines, the halting problem, and the equivalence between algorithmic entropy and statistical interpretations of entropy.

Uploaded by

gusty76
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Algorithmic Entropy

Timothy Murphy
Course MA346H
2013
Alan Mathison Turing
Table of Contents
1 The Anatomy of a Turing Machine 11
1.1 Formality . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 The Turing machine as map . . . . . . . . . . . . . . . . . 14
1.3 The Church-Turing Thesis . . . . . . . . . . . . . . . . . . 15
2 Prex-free codes 21
2.1 Domain of denition . . . . . . . . . . . . . . . . . . . . . 21
2.2 Prex-free sets . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Prex-free codes . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Standard encodings . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Natural numbers . . . . . . . . . . . . . . . . . . . 24
2.4.3 Turing machines . . . . . . . . . . . . . . . . . . . 25
2.4.4 Product sets . . . . . . . . . . . . . . . . . . . . . . 26
2.4.5 A second code for N . . . . . . . . . . . . . . . . . 26
3 Universal Machines 31
3.1 Universal Turing machines . . . . . . . . . . . . . . . . . . 31
4 Algorithmic Entropy 41
4.1 The entropy of a string . . . . . . . . . . . . . . . . . . . . 41
4.2 Entropy of a number . . . . . . . . . . . . . . . . . . . . . 43
4.3 Equivalent codes . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 The binary code for numbers . . . . . . . . . . . . 45
4.4 Joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . 46
5 The Halting Problem 51
5.1 Sometimes it is possible . . . . . . . . . . . . . . . . . . . 51
5.2 The Halting Theorem . . . . . . . . . . . . . . . . . . . . . 52
02
TABLE OF CONTENTS 03
6 Recursive sets 61
6.1 Recursive sets . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.1 Recursive codes . . . . . . . . . . . . . . . . . . . . 62
6.2 Recursively enumerable sets . . . . . . . . . . . . . . . . . 62
6.3 The main theorem . . . . . . . . . . . . . . . . . . . . . . 64
7 Krafts Inequality and its Converse 71
7.1 Krafts inequality . . . . . . . . . . . . . . . . . . . . . . . 71
7.1.1 Consequences of Krafts inequality . . . . . . . . . 73
7.2 The converse of Krafts inequality . . . . . . . . . . . . . . 74
7.3 Chaitins lemma . . . . . . . . . . . . . . . . . . . . . . . . 77
8 A Statistical Interpretation of Algorithmic Entropy 81
8.1 Statistical Algorithmic Entropy . . . . . . . . . . . . . . . 81
8.2 The Turing machine as random generator . . . . . . . . . . 83
9 Equivalence of the Two Entropies 91
10 Conditional entropy re-visited 101
10.1 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . 101
10.2 The last piece of the jigsaw . . . . . . . . . . . . . . . . . 102
10.3 The easy part . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.4 The hard part . . . . . . . . . . . . . . . . . . . . . . . . . 103
A Cardinality 11
A.1 Cardinality . . . . . . . . . . . . . . . . . . . . . . . . . . 11
A.1.1 Cardinal arithmetic . . . . . . . . . . . . . . . . . . 12
A.2 The Schroder-Bernstein Theorem . . . . . . . . . . . . . . 12
A.3 Cantors Theorem . . . . . . . . . . . . . . . . . . . . . . . 14
A.4 Comparability . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.4.1 The Well Ordering Theorem . . . . . . . . . . . . . 110
Introduction
I
n Shannons theory, entropy is a property of a statistical ensemble
the entropy of a message depends not just on the message itself, but also
on its position as an event in a probability space. Kolmogorov and Chaitin
showed (independently that entropy is in fact an intrinsic property of an
object, which can be computed (in theory at least) without reference to
anything outside that object.
It is a truism of popular science that the 20th century saw 2 great scientic
revolutionsAlbert Einsteins General Theory of Relavity, and the Quantum
Theory of Paul Dirac and others.
But as we look back at the end of the century over a longer perspective,
2 more recent innovationsClaude Shannons Information Theory, and Alan
Turings Theory of Computabilitybegin to loom as large as those earlier
ones.
Each of these has in its own way changed our perception of the external
and internal world: information theory, with its image of information as a
substance owing through channels of communication; and Turing machines
as models for algorithms and perhaps for thought itself.
Algorithmic Information Theory is the child of these 2 theories. Its basic
idea is very simple. The Informational Content, or Entropy, of an object is
to be measured by the size of the smallest computer program describing that
object.
(By object we mean a nite objectperhaps the state of all the particles
in the universewhich can be represented as computer data, that is, as a
string of bits.)
Thus an object has low entropy if it can be described succinctly. This
implies that the object is highly ordered (eg a million molecules all in state
1). As in Statistical Mechanics, entropy can be regarded as a measure of
disorder, or randomness. High entropy corresponds to disorder; low entropy
to order.
This denition of entropy throws up a number of questions.
In the rst place, what do we mean by a computer program? That is
answered easily enough: we take the Turing machine as our model computer,
01
TABLE OF CONTENTS 02
with its input as program, since Turing showed that anything any other
computer can do one of his machines could do as well.
But even so, there still seems to be a problem: given any nite object,
however complicated, we could always construct a one-o Turing machine
which would output a description of the object without any input at all.
Turing comes to our rescue again. He showed that a single universal
machine can emulate every other machine. We may therefore insist that this
universal machine should be used in computing the entropy of an object.
In Chapters 13 we recall the main features of Turings theory, which we
then apply in Chapter 4 to give a precise denition of Algorithmic Entropy.
Shannon showed that a message M can be compressed
until it consists of pure information, as measured by its
entropy H(M). Algorithmic Information Theory turns
this idea on its headthe compressibility of an object is
taken as a measure of its entropy.
Chapter 1
The Anatomy of a Turing
Machine
W
e follow Chaitin in adopting a slight variation on Turings original
machine.
Turings model of an ideal computer was remarkably concise, largely be-
cause he used a single tape for input, output and scratch calculations.
However, Turings work showedparadoxically perhapsthat almost any
model with unlimited memory would serve as well as his own; at least, so
long as computability was the only question at issue. Once one steps beyond
thatwhich we shall not dointo the realms of complexity or eciency, the
equivalence of dierent models is by no means so clear.
For our purpose, Turings model is too concise, since we need to distin-
guish more clearly between input and output.
The Chaitin model that we shall follow shares the following essential
features of Turings machine:
1. The machine moves in discrete steps 0, 1, 2, . . . , which we may call
moments.
2. The working part of the machine is a doubly innite tape, divided
into squares numbered by the integers: Each square contains one of
the bits 0, 1. At any moment all but a nite number of the bits are 0.
Mathematically, the tape contents are dened by a map
t : Z 0, 1,
where
t(i) = 0 if [i[ > N
for some N.
11
12
0 1 0 1 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
Figure 1.1: Tape with numbered squares
3. At each moment the machine is in one of a nite number of states
q
0
, q
1
, . . . , q
n
.
State q
0
is both the initial and the nal, or halting, state. The compu-
tation is complete if and when the machine re-enters this state.
We denote the set of states by
Q = q
0
, q
1
, . . . , q
n
.
4. Both the action that the machine takes at each moment, and the new
state that it moves into, depend entirely on 2 factors:
(a) the state q of the machine;
(b) the bit t(0) on square 0 of the tape.
It follows that a machine T can be described by a single map
T : Q0, 1 A Q,
where A denotes the set of possible actions.
To emphasize that at any moment the machine can only take account
of the content of square 0, we imagine a scanner sitting on the tape,
through which we see this square.
Turning to the dierences between the Chaitin and Turing models:
1. Turing assumes that the input x to the machine is written directly onto
the tape; and whatever is written on the tape if and when the machine
halts is the output T(x).
In the Chaitin model, by contrast, the machine reads its input from an
input string, and writes its output to an output string.
2. Correspondingly, the possible actions dier in the 2 models. In the
Turing model there are just 4 actions:
A = noop, swap, , .
13
Each of these corresponds to a map T T from the set T of all
tapes (ie all maps Z 0, 1) to itself. Thus noop leaves the content
of square 0 unchanged (though of course the state may change at the
same moment), swap changes the value of t(0) from 0 to 1, or 1 to 0,
while and correspond to the mutually inverse shifts
( t)(i) = t(i + 1), ( t)(i) = t(i 1),
Informally, we can think of and as causing the scanner to move
to the left or right along the tape.
In the Chaitin model we have to add 2 more possible actions:
A = noop, swap, , , read, write.
The action read takes the next bit b of the input string and sets t(0) =
b, while write appends t(0) to the output string. (Note that once the
bit b has been read, it cannot be re-read; the bit is removed from the
input string. Similarly, once write has appended a bit to the output
string, this cannot be deleted or modied. We might say that the input
string is a read-only stack, while the output string is a write-only
stack.)
To summarise, while the Turing model has in a sense no input/output, the
Chaitin model has input and output ports through which it communicates
with the outside world.
Why do we need this added complication? We are interested in the Tur-
ing machine as a tool for converting an input string into an output string.
Unfortunately it is not clear how we translate the content of the tape into
a string, or vice versa. We could specify that the string begins in square 0,
but where does it end? How do we distiguish between 11 and 110?
We want to be able to measure the length of the input program or string
p, since we have dened the entropy of s to be the length of the shortest p for
which T(p) = s. While we could indeed dene the length of an input tape in
the Turing model to be (say) the distance between the rst and last 1, this
seems somewhat articial. It would also mean that the input and output
strings would have to end in 1s; outputs 111 and 1110 would presumably be
indistinguishable.
Having said all that, one must regret to some extent departing from the
standard; and it could be that the extra work involved in basing Algorithmic
Information Theory on the Turing model would be a small price to pay.
However that may be, for the future we shall use the Chaitin model
exclusively; and henceforth we shall always use the term Turing machine
to mean the Chaitin model of the Turing machine.
1.1. FORMALITY 14
1.1 Formality
Denition 1.1. We set
B = 0, 1.
Thus B is the set of possible bits.
We may sometimes identify B with the 2-element eld T
2
, but we leave
that for the future.
Denition 1.2. A Turing machine T is dened by giving a nite set Q, and
a map
T : QB A Q,
where
A = noop, swap, , , read, write.
1.2 The Turing machine as map
We denote the set of all nite strings by S.
Denition 1.3. Suppose T is a Turing machine. Then we write
T(p) = s,
where p, s S, if the machine, when presented with the input string p, reads
it in completely and halts, having written out the string s. If for a given p
there is no s such that T(p) = s then we say that T(p) is undened, and
write
T(p) = ,
Remark. Note that T must read to the end of the string p, and then halt.
Thus T(p) = (ie T(p) is undened) if the machine halts before reaching
the end of p, if it tries to read beyond the end of p, or if it never halts, eg
because it gets into a loop.
Example. As a very simple example, consider the 1-state machine T (we will
generally ignore the starting state q
0
when counting the number of states)
dened by the following rules:
(q
0
, 0) (read, q
1
)
(q
1
, 0) (write, q
0
)
(q
1
, 1) (read, q
1
)
1.3. THE CHURCH-TURING THESIS 15
This machine reads from the input string until it reads a 0, at which point
it outputs a 0 and halts. (Recall that q
0
is both the initial and the nal, or
halting, state; the machine halts if and when it re-enters state q
0
.)
Thus T(p) is dened, with value 0, if and only if p is of the form
p = 1
n
0.
Remark. By convention, if no rule is given for (q
i
, b) then we take the rule to
be (q
i
, b) (noop, 0) (so that the machine halts).
In the above example, no rule is given for (q
0
, 1). However, this case can
never arise; for the tape is blank when the machine starts, and if it re-enters
state q
0
then it halts.
It is convenient to extend the set S of strings by setting
S

= S .
By convention we set
T() =
for any Turing machine T. As one might say, garbage in, garbage out.
With this convention, every Turing machine denes a map
T : S

.
1.3 The Church-Turing Thesis
Turing asserted that any eective calculation can be implemented by a Turing
machine. (Actually, Turing spoke of the Logical Computing Machine, or
LCM, the term Turing machine not yet having been introduced.)
About the same time in 1936 Church put forward a similar thesis,
couched in the rather esoteric language of the lambda calculus. The two ideas
were rapidly shown to be equivalent, and the term Church-Turing thesis was
coined, although from our point of view it would be more accurate to call it
the Turing thesis.
The thesis is a philosophical, rather than a mathematical, proposition,
since the term eective is not perhaps cannot be dened. Turing
explained on many occasions what he meant, generally taking humans as his
model, eg A man provided with paper, pencil, and rubber, and subject to
strict discipline, is in eect a universal machine. or The idea behind digital
computers may be explained by saying that these machines are intended to
carry out any operations which could be done by a human computer.
1.3. THE CHURCH-TURING THESIS 16
The relevance of the Church-Turing thesis to us is that it gives us condence
if we believe itthat any algorithmic procedure, eg the Euclidean algorithm
for computing gcd(m, n), can be implemented by a Turing machine.
It also implies that a Turing machine cannot be improved in any simple
way, eg by using 2 tapes or even 100 tapes. Any function T : S

that
can be implemented with such a machine could equally well be implemented
by a standard 1-tape Turing machine. What we shall later call the class of
computable or Turing functions appears to have a natural boundary that is
not easily breached.
Chapter 2
Prex-free codes
2.1 Domain of denition
Denition 2.1. The domain of denition of the Turing machine T is the
set
(T) = p S : T(p) ,= .
In other words, (T) is the set of strings p for which T(p) is dened.
Recall that T(p) may be undened in three dierent ways:
Incompletion The computation may never end;
Under-run The machine may halt before reading the entire input string p;
Over-run There may be an attempt to read beyond the end of p.
We do not distinguish between these 3 modes of failure, writing
T(p) = .
in all three cases.
As we shall see, this means that T(p) can only be dened for a restricted
range of input strings p. At rst sight, this seems a serious disadvantage;
one might suspect that Chaitins modication of the Turing machine had
aected its functionality. However, as we shall see, we can avoid the problem
entirely by encoding the input; and this even turns out to have incidental
advantages, for example extending the theory to objects other than strings.
21
2.2. PREFIX-FREE SETS 22
2.2 Prex-free sets
Denition 2.2. Suppose s, s

S. We say that s is a prex of s

, and we
write s s

, if s is an initial segment of s

, ie if
s

= b
0
b
1
. . . b
n
then
s = b
0
b
1
. . . b
r
for some r n.
Evidently
s s

= [s[ [s

[ .
Denition 2.3. A subset S S is said to be prex-free if
s s

= s = s

for all s, s

S.
In other words, S is prex-free if no string in S is a prex of another
string in S.
Theorem 2.1. The domain of denition of a Turing machine T,
(T) = s S : T(s) ,= ,
is prex-free.
Proof . Suppose s

s, s

,= s. Then if T(s

) is dened, the machine must


halt after reading in s, and so it cannot read in the whole of s. Hence T(s)
is undened.
Proposition 2.1. If S S is prex-free then so is every subset T S.
Proposition 2.2. A prex-free subset S S is maximal (among prex-free
subsets of S) if and only if each t S is either a prex of some s S or else
some s S is a prex of t.
Remark. For those of a logical turn of mind, we may observe that being
prex-free is a property of nite character, that is, a set S is prex-free if
and only if that is true of every nite subset F S. It follows by Zorns
Lemma that each prex-free set S is contained in a maximal prex-free set.
However, we shall make no use of this fact.
2.3. PREFIX-FREE CODES 23
2.3 Prex-free codes
Denition 2.4. A coding of a set X is an injective map
: X S.
The coding is said to be prex-free if its image
im S
is prex-free.
By encoding X, we can in eect take the elements x X as input for
the computation T(x); and by choosing a prex-free encoding we allow the
possibility that the computation may complete for all x X.
2.4 Standard encodings
It is convenient to adopt standard prex-free encodings for some of the sets
we encounter most often, for example the set N of natural numbers, or the
set of Turing machines. In general, whenever we use the notation x without
further explanation it refers to the standard encoding for the set in question.
2.4.1 Strings
Denition 2.5. We encode the string
s = b
1
b
2
b
n
S.
as
s = 1b
1
1b
2
1 1b
n
0.
Thus a 1 in odd position signals that there is a string-bit to follow, while
a 0 in odd position signals the end of the string.
Example. If s = 01011 then
s = 10111011110.
If s = (the empty string) then
s = 0.
Denition 2.6. We denote the length of the string s S, ie the number of
bits in s, by [s[.
2.4. STANDARD ENCODINGS 24
Evidently
[s[ = n = [s[ = 2n + 1.
Proposition 2.3. The map
s s : S S
denes a maximal prex-free code for S.
Proof . A string is of the form s if and only if
1. it is of odd length,
2. the last bit is 0, and
3. this is the only 0 in an odd position.
The fact that s contains just one 0 in odd position, and that at the end,
shows that the encoding is prex-free.
To see that it is maximal, suppose x S is not of the form s for any
s S. We need only look at the odd bits of x. If there is no 0 in odd position
then appending 0 or 00 to x (according as x is of even or odd length) will
give a string of form s. If there is a 0 in odd position, consider the rst
such. If it occurs at the end of x then x is of form s, while if it does not
occur at the end of x then the prex up to this 0 is of the form s for some
s.
It follows that if x is not already of the form s then it cannot be ap-
pended to the set s : s S without destroying the prex-free property
of this set.
2.4.2 Natural numbers
Denition 2.7. Suppose n N. Then we dene n to be the string
n =
n 1s
..
1 1 0.
Example.
3 = 1110
0 = 0.
Proposition 2.4. The map
n n : N S
denes a maximal prex-free code for N.
2.4. STANDARD ENCODINGS 25
2.4.3 Turing machines
Recall that a Turing machine T is dened by a set of rules
R : (q, b) (a, q

).
We encode this rule in the string
R = qbaq

,
where the 6 actions are coded by 3 bits as follows:
noop 000
swap 001
010
011
read 100
write 101
So for example, the rule (1, 1) (, 2) is coded as
1011010110.
Denition 2.8. Suppose the Turing machine T is specied by the rules
R
1
, . . . , R
n
. Then we set
T = nR
1
R
n
.
We do not insist that all the rules are given, adopting the convention that
if no rule is given for (q, b) then the default rule
(q, b) (noop, 0)
applies.
Also, we do not specify the order of the rules; so dierent codes may
dene the same machine.
Instead of prefacing the rules with the number n of rules we could
equally well signal the end of the rules by giving a rule with non-existent
action, eg 111. Thus
001110
could serve as a stop-sign.
2.4. STANDARD ENCODINGS 26
2.4.4 Product sets
Proposition 2.5. If S and S

are both prex-free subsets of S then so is


SS

= ss

: s S, s

,
where ss

denotes the concatenation of s and s

.
Proof . If s
1
s

1
s
2
s

2
then either (a) s
1
s

1
, or (b) s

1
s
1
, or (c) s
1
= s

1
and either s
2
s

2
or s

2
s
2
.
This gives a simple way of extending prex-free codes to product-sets.
For example, the set S
2
= S S of pairs of strings can be coded by
(s
1
, s
2
) s
1
s
2
.
Or againan instance we shall apply laterthe set S N can be coded by
(s, n) sn.
2.4.5 A second code for N
Denition 2.9. Suppose n N. Then we set
[n] = B(n),
where B(n) denotes the binary code for n.
Example. Take n = 6. Then
B(n) = 110,
and so
[6] = 1111100.
Proposition 2.6. The coding [n] is a maximal prex-free code for N.
The conversion from one code for N to the other is clearly algorithmic.
So according to the Church-Turing thesis, there should exist Turing machines
S, T that will convert each code into the other:
S([n]) = n, T(n) = [n].
We construct such a machine T in Appendix ??. (We leave the construc-
tion of S to the reader . . . .) As we shall see, it may be obvious but it is not
simple!
Summary
We have adopted Chaitins model of a Turing machine. The
set (T) of input strings, or programs, for which such a ma-
chine T is dened constitutes a prex-free subset of the set S
of all strings.
Chapter 3
Universal Machines
C
ertain Turing machines have the remarkable property
that they can emulate, or imitate, all others.
3.1 Universal Turing machines
We have dened a code T for Turing machines in the last Chapter (Sub-
section 2.4.3.
Denition 3.1. We say that the Turing machine U is universal if
U (Tp) = T(p)
for every Turing machine T and every string p S.
Informally, a universal Turing machine can emulate any Turing machine.
There is another denition of a universal machine which is sometimes
seen: the machine U is said to be universal if given any Turing machine T
we can nd a string s = s(T) such that
U(sp) = T(p)
for all p S. Evidently a universal machine according to our denition is
universal in this sense; and since we are only concerned with the existence of
a universal machine it does not matter for our purposes which denition is
taken.
Theorem 3.1. There exists a universal Turing machine.
We shall outline the construction of a universal machine in Chapter ??.
The constructionwhich is rather complicatedcan be divided into 4 parts:
31
3.1. UNIVERSAL TURING MACHINES 32
1. We instroduct a variant of Turing machines, using stacks instead of a
tape.
2. We show that 2 stacks suce, with these stacks replacing the left and
right halves of the tape.
3. We show that n 2 stacks is equivalent to 2 stacks, in the sense that
given an n-stack machine S
n
we can always nd a 2-stack machine S
2
such that
S
2
(p) = S
n
(p)
for all inputs p.
4. We show that a universal machine can be implemented as a 4-stack
machine. 2 of the stacks being used to store the rules of the machine
being emulated.
But for the moment we merely point out that the Church-Turing thesis
suggests that such a machine must exist; for it is evident that given the
rules dening a machine T, and an input string p, the human computer
can determine the state of the machine and the content of the tape at each
moment t = 0, 1, 2, . . . , applying the appropriate rule to determine the state
and tape-content at the subsequent moment.
There are many universal machines. Indeed, a universal machine U can
emulate itself. So the machine V dened by
V (p) = U(Up)
is also universal.
It is an interesting question to ask if there is a best universal machine in
some sense of best. One might ask for the universal machine with the min-
imal number of states; for there are only a nite number of Turing machines
with n states, since there are only a nite number of rules
(q
i
, b) (a, q
o
)
with 0 q
i
, q
o
n. This question has some relevance for us, since we
shall dene algorithmic entropy (shortly) with respect to a given universal
machine. It will follow from the fact that two universal machines U, V can
emulate each other that the entropies with respect to U and V cannot dier
by more than a constant C = C(U, V ). But it would be nice to have a
concept of absolute algorithmic entropy.
3.1. UNIVERSAL TURING MACHINES 33
Summary
The universal machine U performs exactly like the machine
T, provided the program is prexed by the code T for the
machine:
U (Tp) = T(p).
Chapter 4
Algorithmic Entropy
W
e are now in a position to give a precise denition
of the Algorithmic Entropy (or Informational Content)
H(s) of a string s.
4.1 The entropy of a string
Denition 4.1. The algorithmic entropy H
T
(s) of a string s with respect
to the Turing machine T is dened to be the length of the shortest input p
which, when fed into the machine T, produces the output s
H
T
(s) = min
p:T(p)=s
[p[.
If there is no such p then we set H
T
(s) = .
Now let us choose, once and for all, a particular universal Turing machine
U as our model. The actual choice, as we shall see, is irrelevant up to a
constant.
Denition 4.2. The algorithmic entropy H(s) of the string s is dened to
be the entropy with respect to our chosen universal machine U
H(s) = H
U
(s).
Informational content and Kolmogorov complexity are alternative terms
for algorithmic entropy, which we shall often abbreviate to entropy.
Proposition 4.1. For each string s,
0 < H(s) < .
41
4.1. THE ENTROPY OF A STRING 42
Proof . The universal machine U cannot halt before it has read any input;
for in that case its output would always be the same, and it could not possibly
emulate every machine T. Thus H(s) > 0.
On the other hand, given a string s we can certainly construct a machine
T which will output s, say without any input. But then U outputs s on input
p = T. Hence H(s) < .
It is often convenient to choose a particular minimal input p for a given
string s with respect to a given machine T.
Denition 4.3. Suppose H
T
(s) < . Then we denote by
T
(s) that in-
put p of minimal length for which T(p) = s, choosing the rst such p in
lexicographical order in case of ambiguity; and we set
(s) =
U
(s).
We shall sometimes call
T
(s) or (s) the minimal program for s.
Proposition 4.2. For each machine T,
H(s) H
T
(s) + O(1),
ie H(s) H
T
(s) + C, where C = C(T) is independent of s.
Proof . Let =
T
(s), so that
T() = s, [[ = H
T
(s).
Then
U(T) = T() = s,
where T is the code for T. Hence
H(s) [T[ +[[
= [T[ +H(s)
= H(s) + C.
Proposition 4.3. If V is a second universal machine, then
H
V
(s) = H(s) + O(1).
Proof . By the last Proposition
H(s) H
V
(s) + O(1).
But by the same argument, taking V as our chosen universal machine,
H
V
(s) H(s) + O(1).
4.2. ENTROPY OF A NUMBER 43
4.2 Entropy of a number
The concept of algorithmic entropy extends to any object that can be encoded
as a string.
Denition 4.4. Suppose : X S is a coding for objects in the set X.
Then we dene the entropy of x X with respect to this encoding as
H

(x) = H((x)).
Where we have dened a standard encoding for a set, we naturally use
that to dene the entropy of an object in the set.
Denition 4.5. The algorithmic entropy of the natural number n N is
H(n) = H(n).
It is worth noting that the entropy of a string is unchanged (up to a
constant) by encoding.
Proposition 4.4. We have
H(s) = H(s) + O(1).
Proof . It is easy to modify U (we need only consider write statements)
to construct a machine X which outputs s when U outputs s, ie
U() = s = X() = s.
If now = (s) then
H(s) [[ +[X[
= H(s) +[X[
= H(s) + O(1).
Conversely, we can equally easily modify U to give a machine Y which
outputs s when U outputs s; and then by the same argument
H(s) H(s) + O(1).
4.3. EQUIVALENT CODES 44
4.3 Equivalent codes
As we have said, given a coding of a set X we can dene the algorithmic
entropy H(x) of the elements x X, But what if we choose a dierent
encoding?
The Church-Turing thesis (Section 1.3) allows us to hope that H(x) will
be independent of the coding chosen for X provided we stick to eective
codes.
The following denition should clarify this idea.
Denition 4.6. The two encodings
, : X S
are said to be Turing-equivalent if there exist Turing machines S, T such that
S(x)) = x, T(x)) = x
for all x X.
Proposition 4.5. If , are Turing-equivalent encodings for X then
H(x) = H(x) + O(1).
Proof . Suppose
S(x) = x, T(x) = x.
Let
= (x), = (x),
so that
U() = x, U() = x.
We can construct a machine X which starts by emulating U, except that
it saves the output s in coded form s. Then when U has completed its
computation, X emulates the machine S, taking s as input. Thus
X(s

) = S(U(s

).
In particular,
X() = x,
and so
H(x) [[ +[X[
= H(x) + O(1);
and similarly in the other direction.
4.4. JOINT ENTROPY 45
4.3.1 The binary code for numbers
As an illustration of these ideas, recall that in Section 2.4.5) we introduced
a second binary encoding
[n] = B(n)
for the natural numbers.
Proposition 4.6. The binary code [n] for the natural numbers is Turing-
equivalent to the standard code n.
Proof . We have to construct Turing machines S, T such that
S([n]) = n, T(n) = B(n).
The construction of a suitable machine T is detailed in Appendix ??. The
(simpler) construction of S is left as an exercise to the reader.
Corollary 4.1. The algorithmic entropies with respect to the two encodings
of N are the same, up to a constant:
H([n]) = H(n) + O(1).
Thus it is open to us to use the binary encoding or the standard encoding,
as we prefer, when computing H(n).
4.4 Joint entropy
We end this chapter by considering some basic properties of algorithmic
entropy.
Denition 4.7. The joint entropy H(s, t) of 2 strings s and t is dened to
be
H(s, t) = H (st) .
Note that H(s, t) measures all the information in s and t; for we can
recover s and t from st. We could not set H(s, t) = H(st), ie we cannot
simply concatenate s and t since this would lose the information of where
the split occurred.
Example. Suppose
s = 1011, t = 010.
Then H(s, t) = H(s t), where
s t = 11101111001011100000.
4.5. CONDITIONAL ENTROPY 46
Proposition 4.7. Joint entropy is independent of order:
H(t, s) = H(s, t) + O(1).
Proof . We can construct a Turing machine T which converts ts into
st for any strings s and t.
Now suppose p is a minimal program for st: U(p) = st. Let the
machine M implement T followed by U:
M = U T.
Then M(p) = ts and so
H (ts) H
M
(ts) + O(1)
[p[ +O(1)
= H (st) + O(1).
4.5 Conditional entropy
Informally, the conditional entropy H(s [ t) measures the additional infor-
mation contained in the string s if we already know the string t.
But what do we mean by knowing t?
In the context of algorithmic information theory it is natural to interpret
this to mean that we are given the minimal program
= (t) =
U
(t)
for t.
We would like to dene H(s [ t) as the length of the shortest string q
which when appended to gives us a program for s:
U(q) = s.
Unfortunately, there is a aw in this idea. If U() is dened then U(q)
cannot befor U will already have halted on reading the string .
To circumvent this obstacle, let us recall that
H(s) H
T
(s) +[T[.
Thus if we set
H

(s) = min
T
(H
T
(s) +[T[) ,
then on the one hand
H(s) H

(s)
4.5. CONDITIONAL ENTROPY 47
while on the other hand
H

(s) H
U
(s) +[U[
= H(s) +[U[.
Putting these together,
H

(s) = H(s) + O(1).


So H

(s) would serve in place of H(s) as a measure of absolute entropy.


This trick suggests the following denition of conditional entropy.
Denition 4.8. Suppose s, t S. Let = (t) be the minimal program for
t (with respect to the universal machine U):
U() = t, [[ = H(t).
Then we dene the conditional entropy of s given t to be
H(s [ t) = min
T,q:T(q)=s
([q[ +[T[) .
Proposition 4.8. Conditional entropy does not exceed absolute entropy:
H(s [ t) H(s) + O(1).
Proof . Let be the minimal program for s:
U() = s, [[ = H(s).
We can construct a machine T such that
T(pq) = U(q)
for any p (U).
The machine T starts by imitating U silently and cleanly, ie without
output, and in such a way that the internal tape is cleared after use. Then
when U would halt, T imitates U again, but this time in earnest.
In particular,
T() = U() = s,
and so
H(s [ t) [[ +[T[
= H(s) +[T[
= H(s) + O(1).
4.5. CONDITIONAL ENTROPY 48
Corollary 4.1. H(s [ t) < .
Proposition 4.9.
H(s [ s) = O(1).
Proof . In this case we can use the universal machine U to output s, taking
q = , the empty string:
U() = U() = s,
and so
H(s [ s) 0 +[U[
= O(1).
Summary
We have dened the entropy H(s) of a string s, as well as the
conditional entropy H(s [ t) and the joint entropy H(s, t) of
two strings s and t. It remains to justify this terminology, by
showing in particular that
H(s, t) = H(s) + H(t [ s) + O(1).
Chapter 5
The Halting Problem
in general it is impossible to determine for which input strings
p a Turing machine T will complete its computation and halt.
The proof of this is reminiscent of the diagonal argument
used in the proof of Cantors Theorem on cardinal numbers,
which states that the number of elements of a set X is strictly
less than the number of subsets of X.
5.1 Sometimes it is possible
It is easy to say when some Turing machines will halt. Consider for example
our adding machine
T : mn m+ n.
We know that, for any Turing machine T, the input strings p for which T(p)
is dened form a prex-free set.
But it is easy to see that the set
S = p = mn : m, n N
is not only prex-free, but is actually a maximal prex-free set, that is, if
any further string is added to S it will cease to be prex-free. It follows that
for our adding machine, T(p) is dened precisely when p S.
Moreover, it is easy to construct a Turing machine H which recognises
S, ie such that
H(p) =
_
1 if p S,
0 if p / S.
This argument applies to any Turing machine
T : n f(n),
51
5.2. THE HALTING THEOREM 52
where f : N N is a computable function.
Perhaps surprisingly, there are Turing machines to which it does not
apply.
5.2 The Halting Theorem
Proposition 5.1. There does not exist a Turing machine H such that
H(Tp) =
_
1 if T(p) is dened
0 if T(p) is undened
for every Turing machine T and every string p
Proof . Suppose such a Turing machine H exists. (We might call it a
universal halting machine.)
Let us set p = T. (In other words, we are going to feed T with its own
code T.) Then
H(TT) =
_
1 if T(T) is dened
0 if T(T) is undened.
Now let us modify H to construct a doubling machine D such that
D(T) = H(TT)
for any machine T.
Thus D reads the input, expecting the code T for a Turing machine. It
doesnt really matter what D does if the input is not of this form; we may
suppose that it either tries to read past the end of the input, or else goes
into an innite loop. But if the input is of the form T then D doubles
it, writing rst T on the tape, followed by the string code T. Then it
emulates H taking this as input.
Thus
D(T) = H(TT) =
_
1 if T(T) is dened,
0 if T(T) is undened.
Finally, we modify D (at the output stage) to construct a machine X
which outputs 0 if D outputs 0, but which goes into an innite loop if D
outputs 1. Thus
X(s) =
_
if D(s) = 1,
0 if D(s) = 0.
5.2. THE HALTING THEOREM 53
(We dont care what X does if D(s) is not 0 or 1.) Then
X(T) =
_
if T(T) ,= ,
0 if T(T) = .
This holds for all Turing machines T, In particular it holds for the machine
X itself:
X(X) =
_
if X(X) ,= ,
0 if X(X) = .
This leads to a contradiction, whether X(X) is dened or not. We conclude
that there cannot exist a halting machine H of this kind.
We improve the result above by showing that in general there does not
exist a halting machine for a single Turing machine.
Theorem 5.1. Suppose U is a universal machine. Then there does not exist
a Turing machine H such that
H(p) =
_
1 if U(p) is dened
0 if U(p) is undened.
Proof . Substituting Xp for p, we have
H(Xp) =
_
1 if X(p) is dened
0 if X(p) is undened.
Evidently we can construct a slight variant H

of H which starts by
decoding Xp, replacing it by Xp, and then acts like H, so that
H

(Xp) =
_
1 if X(p) is dened
0 if X(p) is undened.
But we saw in the Proposition above that such a machine cannot exist.
Chapter 6
Recursive sets
A set of string is recursive if it can be recognised by a com-
puter. Recursive sets oer an alternative approach to com-
putability. The concept of recursive enumerablity is more
subtle, and links up with the Halting Problem.
6.1 Recursive sets
Denition 6.1. A set of strings S S is said to be recursive if there exists
a Turing machine A such that
A(s) =
_
1 if s S
0 if s / S.
We say that A is an acceptor for S, or that A recognises S.
Note that if A is an acceptor then A(s) must be dened for all s. Since
the set s : s S is a maximal prex-free set, it follows that A(p) must
be undened for all input strings p not of the form s,
Proposition 6.1. 1. The empty set and S are recursive.
2. If R, S S are recursive then so are R S and R S.
3. If S is recursive then so is its complement

S = S S.
Proof . 1. This is trivial.
2. Suppose A, B are acceptors for S, T Then we construct a machine C
which rst emulates A, and then emulates B.
More precisely, given input s, C rst saves the input, and then emu-
lates A, taking s as input.
61
6.2. RECURSIVELY ENUMERABLE SETS 62
We know that A will end by outputting 0 or 1.
If Aends by outputting 0, then C outputs nothing, but instead emulates
B, again with input s. If A ends by outputting 1, then C outputs 1
and halts.
Evidently C accepts the union A B.
We construct a machine which accepts the intersection AB in exactly
the same way, except that now it halts if A outputs 0, and emulates B
if A outputs 1.
3. Suppose A accepts S. Let the machine C be identical to A, except
that C outputs 1 when A outputs 0, and 0 when A outputs 1. Then C
accepts the complementary set

S
6.1.1 Recursive codes
We say that a code
: X S
for a set X is recursive if the image im() S is recursive.
For example, the codes n and s that we have used for numbers and
strings are both recursive and prex-free (since we want to use them as
input to Turing machines): and the same is true of our code T for Turing
machines, and indeed all other codes we have used.
6.2 Recursively enumerable sets
Denition 6.2. The set S S is said to be recursively enumerable if there
exists a Turing machine T such that
s S s = T(p) for some p S.
We will say in such a case that the machine T outputs S.
Proposition 6.2. 1. A recursive set is recursively enumerable.
2. A set S S is recursive if and only if S and its complement S S are
both recursively enumerable.
Proof . 1. Suppose S is recursive. Let A be an acceptor for S. Then a
slight modication A

of A will output S. Thus given an input string


s, A

rst saves s and then emulates A taking p as input. If A


concludes by outputting 1 then A

outputs s; while if A concludes by


outputting 0 then A

goes into an innite loop.


6.2. RECURSIVELY ENUMERABLE SETS 63
2. If S is recursive then so is its complement

S = SS, by Proposition 6.1
So if S is recursive then both S and

S are recursively enumerable.
Conversely, suppose S and

S are recursively enumerable. Let C, D out-
put S,

S, respectively (always with coded input p). Then we construct
an acceptor A for S as follows.
Given an input string p, A starts by saving p. Then A runs through
a sequence of steps, which we will call Stage 1, Stage 2, . . . . At stage
n, A runs through all strings p of length n, carrying out n steps
in the computation of C(p) and then n steps in the computation of
D(p), saving the output string in coded form in either case. If one
or both computations end then the output is compared with s. If
C(p) = s then A outputs 1 and halts; if D(p) = s then A
outputs 0 and halts.
One or other event must happen sooner or later since C and D together
output all strings s S.
This trick is reminiscent of the proof that N N is enumerable, where
we arrange the pairs (m, n) in a 2-dimensional array, and then run down the
diagonals,
(0, 0), (0, 1), (1, 0), (0, 2), (1, 1), (2, 0), (0, 3), (1, 2), . . . .
So we will call it the diagonal trick.
It should not be confused with Cantors entirely dierent and much more
subtle diagonal method, used to show that #(X) < #(2
X
) and in the proof
of the Halting Theorem. Note that Cantors method is used to prove that
something is not possible, while the diagonal trick is a way of showing that
some procedure is possible.
Proposition 6.3. 1. , S are recursively enumerable.
2. If R, S S are recursively enumerable then so are R S and R S.
Proof . 1. This follows at once from the fact that and S are recursive.
2. Suppose C, D output R, S. In each case we use the diagonal trick; at
stage n we input p for all p of length n, and run C and D for n
steps, and determine for which p (if any) C or D halts.
For R S we simply output C(p) or D(p) in each such case.
For RS, we check to see if C(p) = D(p

) = s for any inputs p, p

,
and if there are any such we output s.
6.3. THE MAIN THEOREM 64
6.3 The main theorem
Theorem 6.1. There exists a set S S which is recursively enumerable but
not recursive.
Proof . Suppose U is a universal machine. By the Halting Theorem 5.1,
S = p : U(p) dened
is not recursive.
For a halting machine in this case is precisely an acceptor for S; and we
saw that such a machine cannot exist.
It is easy to see that S is recursively enumerable, using the diagonal trick.
At stage n we run though strings p of length n, and follow the computation
of U(< p >) for n steps, If U(< p >) completes in this time we output p.
It is clear that we will output all p S sooner or later.
Chapter 7
Krafts Inequality and its
Converse
K
rafts inequality constrains entropy to increase at a
certain rate. Its conversesometimes known as Chaitins
lemmashows that we can construct machines approaching
arbitrarily close to this constraint.
7.1 Krafts inequality
Recall that, for any Turing machine T, the set of strings
S = p S : T(p) dened
is prex-free.
Theorem 7.1. (Krafts Inequality) If S S is prex-free then

sS
2
|s|
1.
Proof . To each string s = b
1
b
2
. . . b
n
we associate the binary number
B(s) = 0 b
1
b
2
. . . b
n
[0, 1),
and the half-open interval
I(s) = [B(s), B(s) + 2
|s|
) [0, 1).
Lemma 1. The real numbers B(s), s S are dense in [0, 1).
71
7.1. KRAFTS INEQUALITY 72
Proof . If
x = 0.b
1
b
2
[0, 1)
then
B(b
1
), B(b
1
b
2
), B(b
1
b
2
b
3
) x.
Recall that we write s s

to mean that s is a prex of s

, eg
01101 0110110.
Lemma 2. For any two strings s, s

S,
1. B(s

) I(s) s s

;
2. I(s

) I(s) s s

;
3. I(s), I(s

) are disjoint unless s s

or s

s
Proof . 1. Let
s = b
1
. . . b
n
.
Suppose s s

, say
s

= b
1
. . . b
n
b
n+1
. . . b
n+r
.
Then
B(s) B(s

) = B(s) + 2
n
0.b
n+1
. . . b
n+r
< B(s) + 2
n
= B(s) + 2
|s|
.
Conversely, supppose s , s

. Then either s

s (but s

,= s); or else
s, s

dier at some point, say


s = b
1
. . . b
r1
b
r
b
r+1
. . . b
n
, s

= b
1
. . . b
r1
c
r
c
r+1
. . . c
m
,
where b
r
,= c
r
.
If s

s or b
r
= 1, c
r
= 0 then B(s

) < B(s).
If b
r
= 0, c
r
= 1 then
B(s

) 0.b
1
. . . b
r1
1 > B(s) = 0.b
1
. . . b
r1
0b
r+1
. . . b
n
/
Thus
B(s) =
a
2
n
, B(s

) =
b
2
n
,
with a < b. Hence
B(s

) B(s) +
1
2
n
.
7.1. KRAFTS INEQUALITY 73
2. Suppose s s

. Then
B(s

) I(s

) = s

= s s

= B(s

) I(s).
It follows that
I(s

) I(s).
Conversely,
I(s

) I(s) = B(s

) I(s) = s s

.
3. If I(s), I(s

) are disjoint then we can nd s

S such that
B(s

) I(s) I(s

),
so that
s s

, s

which implies that


s s

or s

s.
Conversely,
s s

= I(s

) I(s), s

s = I(s) I(s

);
and in neither case are I(s), I(s

) disjoint.
It follows from the last result that if the set of strings S S is prex-free
then the half-intervals
I(s) : s S
are disjoint; and so, since they are all contained in [0, 1),

sS
[I(s)[ =

sS
2
|s|
1.
7.1.1 Consequences of Krafts inequality
Proposition 7.1. For each Turing machine T,

sS
2
H
T
(s)
1.
7.2. THE CONVERSE OF KRAFTS INEQUALITY 74
Proof . We know that

p:T(p) is dened
2
|p|
1.
But each s for which T(s) is dened arises from a unique minimal input
p =
T
(s);
while if T(s) is not dened that
H
T
(s) = = 2
H
T
(s)
= 0.
It follows that the entropy of strings must increase suciently fast to
ensure that

sS
2
H(s)
1.
Thus there cannot be more than 2 strings of entropy 2, or more than 16
strings of entropy 4; if there is one string of entropy 2 there cannot be more
than 2 of entropy 3; and so on.
7.2 The converse of Krafts inequality
Theorem 7.2. Suppose h
i
is a set of integers such that

2
h
i
1.
Then we can nd a prex-free set p
i
S of strings such that
[p
i
[ = h
i
.
Moreover this can be achieved by the following strategy: The strings p
0
, p
1
, . . .
are chosen successively, taking p
i
to be the rst string (in lexicographical
order) of length h
i
such that the set
p
0
, p
1
, . . . , p
i

is prex-free.
Recall that the lexicographical order of S is
< 0 < 1 < 00 < 01 < 10 < 11 < 000 < ,
where denotes the empty string.
7.2. THE CONVERSE OF KRAFTS INEQUALITY 75
Proof . Suppose the strings p
0
, p
1
, . . . , p
i1
have been chosen in accordance
with the above specication. The remaining space (the gaps in [0, 1))
G = [0, 1) (I(p
0
) I(p
1
) I(p
i1
))
is expressible as a nite union of disjoint half-open intervals I(s), say
C = I(s
0
) I(s
1
) I(s
j
)
where
B(s
0
) < B(s
1
) < < B(s
j
).
(This expression is unique if we agree to amalgamate any adjoining twin
intervals of the form
B(b
1
, . . . , b
r
, 0), B(b
1
, . . . , b
r
, 1)
to form the single interval
B(b
1
, . . . , b
r
)
of twice the length.)
Lemma 3. The intervals I(s
0
), . . . , I(s
j
) are strictly increasing in length, ie
[s
0
[ > [s
1
[ > > [s
j
[;
and
h
i
[s
j
[,
so that it is possible to add another string p
i
of length h
i
.
Proof . We prove the result by induction on i. Suppose it is true for the
prex-free set p
0
, . . . , p
i1
.
Since the intervals I(s
k
) are strictly increasing in size, each I(s
k
) is at
most half as large as its successor I(s
k+1
):
[I(s
k
)[
1
2
[I(s
k+1
[.
It follows that the total space remaining is
< [I(s
j
)[
_
1 +
1
2
+
1
4
+
_
= 2[I(s
j
)[.
The next interval we are to add is to have length h
i
. By hypothesis
2
h
0
+ + 2
h
i1
+ 2
h
i
1.
7.2. THE CONVERSE OF KRAFTS INEQUALITY 76
Thus
2
h
i
1 2
h
0
2
h
i1
= [[0, 1) (I(p
0
) I(p
1
) I(p
i1
))[
= [I(s
0
) I(s
1
) I(s
j
)[
< 2[I(s
j
)[.
It follows that
2
h
i
[I(s
j
)[,
or
h
i
[s
j
[.
So we can certainly t an interval I(p) of length 2
h
i
into one of our gap
intervals I(s
k
).
By prescription, we must take the rst position available for this new
interval. Let us determine where 2
h
i
rst ts into the sequence of strictly
increasing gaps I(s
0
), I(s
1
), . . . . Suppose
[I(s
k1
)[ < 2
h
i
[I(s
k
)[.
Then I(s
k
) is the rst gap into which we can t an interval I(p) of length
2
h
i
.
If in fact
2
h
i
= [I(s
k
)[
then we set
p
i
= s
k
.
In this case, the gap is completely lled, and we continue with one fewer gap,
the remaining gaps evidently satisfying the conditions of the lemma.
If however
2
h
i
< [I(s
k
)[
then our strategy prescribes that I(p
i
) is to come at the beginning of I(s
k
),
ie
p
i
= s
k
e 0s
..
0 . . . 0,
where
e = h
i
[s
k
[.
We note that
I(s
k
) I(p
i
) = I(t
0
) I(t
1
) I(t
e1
),
7.3. CHAITINS LEMMA 77
where
t
0
= s
k
e 1 0s
..
0 . . . 0 1, t
1
= s
k
e 2 0s
..
0 . . . 0 1, . . . , t
e2
= s
k
01, t
e1
= s
k
1.
Thus after the addition of the new interval I(p
k
) the complement
[0, 1) (I(p
0
) I(p
i
)) =
I(s
0
) I(s
k1
) I(t
0
) I(t
r
) I(s
k+1
) I(s
j
)
retains the property described in the lemma. It therefore follows by induction
that this property always holds.
It follows that the strategy can be continued indenitely, creating a prex-
free set of strings with the required properties.
7.3 Chaitins lemma
We would like to construct a machine T so that specied strings s
0
, s
1
, . . .
have specied entropies h
0
, h
1
, . . . :
H
T
(s
i
) = h
i
.
By Krafts Inequality this is certainly not possible unless

i
2
h
i
1.
But suppose that is so. The converse to Krafts inequality encourages us to
believe that we should be able to construct such a machine.
But one question remains. What exactly do we mean by saying that the
entropies h
i
are specied ? How are they specied?
If the machine T is to understand the specication, it must be in machine-
readable form. In other words, we must have another machine M outputting
the numbers h
i
.
Theorem 7.3. Suppose
S S N
is a set of pairs (s, h
s
) such that
1. The integers h
s
satisfy Krafts condition:

(s,h
s
)S
2
h
s
1.
7.3. CHAITINS LEMMA 78
2. The set S is recursively enumerable.
Then there exists a Turing machine T such that
H
T
(s) h
s
for all (s, h
s
) S.
Proof . By denition, there exists a machine M which generates the set S,
say
M(n) = (s
n
, h
n
) S.
Suppose we are given an input string p. We have to determine T(p). Our
machine does this by successively building up a prex-free set
P = p
0
, p
1
, p
2
, . . . ,
where [p
n
[ = h
n
, according to the prescription above. As each p
i
is created,
it is compared with the given string p; and if p
n
= p then T outputs the
string s
n
and halts.
If p never occurs in the prex-free set P then T(p) is undened.
More fully, T functions in stages 0, 1, 2, . . . . At stage n, T emulating each
of M(0), M(1), . . . , M(n) for n steps.
If M(r) halts after m N steps, with
M(r) = s
r
h
r
.
Then T adds a further string p
i
with [p
i
[ = h
r
to the prex-free set
P = p
0
, p
1
, . . . , p
i1

which it is building up, by following the Kraft prescription.


Summary
We have constructed a machine T with specied entropies
H
T
(s
i
) for specied the string s
i
, provided these entropies
satisfy Krafts inequality, and can be recursively generated.
Chapter 8
A Statistical Interpretation of
Algorithmic Entropy
W
e can regard a Turing machine as a kind of ran-
dom generator, with a certain probability of outputting
a given string. This suggests an alternative denition of the
entropy of a string, more in line with the concepts of Shan-
nons statistical information theory. Fortunately, we are able
to establish the equivalence of this new denition with our
earlier one, at least up to an additive constant.
8.1 Statistical Algorithmic Entropy
Denition 8.1. The statistical algorithmic entropy h
T
(s) of the string s with
respect to the Turing machine T is dened by
2
h
T
(s)
=

p:T(p)=s
2
|p|
,
that is,
h
T
(s) = lg
_
_

p:T(p)=s
2
|p|
_
_
.
We set
h(s) = h
U
(s),
where U is our chosen universal machine.
81
8.1. STATISTICAL ALGORITHMIC ENTROPY 82
Recall the convention that lg x denotes log
2
x; all our logarithms will be
taken to base 2.
Proposition 8.1. For any Turing machine T, and any string s,
h
T
(s) H
T
(s);
and in particular,
h(s) H(s).
Proof . If T never outputs s then the sum is empty and so has value 0,
giving
h
T
(s) = = H
T
(s).
Otherwise, one of the programs that outputs s is the minimal program
(for T) p =
T
(s). Thus
2
h
T
(s)
=

p:T(p)=s
2
p
2

T
(s)
= 2
H
T
(s)
,
and so
h
T
(s) H
T
(s).
Proposition 8.2.
h
T
(s) 0.
Proof . Since
p : T(p) = s p : T(p) dened,
it follows that

p:T(p)=s
2
|p|

p:T(p) dened
2
|p|
1,
by Krafts Inequality 7.1.
Hence
h
T
(s) = lg

p:T(p)=s
2
|p|
0.
Proposition 8.3. For any string s S.
h(s) < .
8.2. THE TURING MACHINE AS RANDOM GENERATOR 83
Proof . We can certainly construct a machine T which outputs a given
string s without any input. Then
U(T) = s.
Hence
H(s) [T[ < ,
and a fortiori
h(s) H(s) < .
8.2 The Turing machine as random generator
Imagine the following scenario. Suppose we choose the input to T by succes-
sively tossing a coin. (We might employ a mad statistician for this purpose.)
Thus if the current rule requires that T should read in an input bit, then a
coin is tossed, and 1 or 0 is input according as the coin comes up heads or
tails. The experiment ends if and when the machine halts.
The expectation that this results in a given string s being output is
P
T
(s) =

p:T(p)=s
2
|s|
.
Recall that in Shannons Information Theory, if an event e has probability
p of occurring, then we regard the occurrence of e as conveying lg p bits of
information. (Thus occurrence of an unlikely event conveys more information
than occurrence of an event that was more or less inevitable.)
In our case, the event is the outputting of the string s. So the informa-
tion conveyed by the string is just
h
T
(s) = lg P
T
(s).
Summary
The statistical algorithmic entropy h(s) gives us an alterna-
tive measure of the informational content of a string. Fortu-
nately we shall be able to establish that it is equivalent to our
earlier measure, up to the ubiquitous constant:
h(s) = H(s) + O(1).
Chapter 9
Equivalence of the Two
Entropies
W
e show that the rival denitions of algorithmic entropy,
H(s) and h(s), are in fact equivalent.
Theorem 9.1.
h(s) = H(s) + O(1).
More precisely, there exists a contant C independent of s such that
h(s) H(s) h(s) + C
for all s S.
Proof . As we saw in Proposition 8.1,
h(s) H(s).
We must show that there exists a constant C, dependent only on our choice
of universal machine U, such that
H(s) h(s) + C
for all strings s S.
Lemma 1.

sS
2
h
T
(s)
1.
91
92
Proof . Each p for which T(p) is dened contributes to h
T
(s) for just one
s. Hence

sS
2
h
T
(s)
=

sS
_
_

p:T(p)=s
2
|s|
_
_
=

p:T(p) dened
2
|s|
1,
since the set of p for which T(p) is dened is prex-free.
Thus the numbers h(s) satisfy Krafts Inequality. However, we cannot
apply the converse as it stands since these numbers are not in general integral.
We therefore set
h
s
= [h(s)] + 1
for each string s S. (Here [x] denotes, as usual, the greatest integer x.)
Thus
h(s) < h
s
h(s) + 1.
Since

2
h
s

2
h(s)
1,
the integers h
s
, or rather the set of pairs
S = (s, h
s
) S N,
satisfy the rst criterion of Chaitins Lemma.
The Converse, if we could apply it, would allow us to construct a machine
M such that
H
M
(s) h
s
for all s with h
s
< . It would follow from this that
H(s) H
M
(s) +[M[
h
s
+O(1)
h(s) + O(1).
Unfortunately, we have no reason to suppose that the h
s
are recursively
enumerable. We cannot therefore apply the Converse directly, since we have
not shown that its second criterion is fullled.
Fortunately, a nimble side-step steers us round this obstacle.
93
Lemma 2. Suppose T is a Turing machine. Then the set
S

= (s, m) S N : h
T
(s) > 2
m

is recursively enumerable.
Proof . We construct a machine M which runs as follows.
At the nth stage, M runs through all 2
n+1
1 strings p of length n.
For each such string p, M emulates T for n steps. If T halts within these n
steps, with s = T(p), a note is made of the pair (s, [p[).
At the end of the nth stage, the accumulated total
P

T
(s) =

|p|n:T(p)=s
2
|p|
is calculated for each string s that has appeared; and for each new integer
m = m(s) for which
P

T
(s) 2
m
the pair (s, m) is output.
(Note that as more inputs are considered, P

T
(s) is increasing, tending
towards P
T
(s). Thus m is decreasing, passing through integers h
T
(s).)
Lemma 3. With the notation of the last lemma,

s,m:(s,m)S

2
m
2.
Proof . As we saw in the proof of the last Lemma, the m = m(s) that arise
for a given s are h
T
(s). Hence their sum is
< h
T
(s)
_
1 +
1
2
+
1
2
2
+
_
= 2h
T
(s).
Thus the sum for all s is
< 2

s
h
T
(s) 2,
by Lemma 1.
Now we can apply the Converse to the set
S

= (s, m+ 1) : (s, m) S

;
for we have shown in Lemma 3 that this set satises the rst criterion, while
we saw in Lemma 1 that it is recursively enumerable.
94
Thus we can construct a machine M with the property that for each
(s, m) S we can nd a program p such that
M(p) = s, [p[ h
s
+ 1.
It follows that
H
M
(s) h
s
+ 1;
and so, taking T = U,
H(s) H
M
(s) +[M[
h
s
+[M[
= h
s
+ O(1)
h(s) + O(1).
Summary
We have established that H(s) and h(s) are equivalent de-
nitions of entropy. It is thus open to us to use whichever is
more convenient for the problem in hand.
Chapter 10
Conditional entropy re-visited
R
ecall that in Chapter 4 we dened the conditional en-
tropy H(s [ t) of one string s given another string t. We
stated, but did not prove, the fundamental identity
H(s, t) = H(t) + H(s [ t).
Informally, this says that the information in t, together with
the additional information in s, is precisely the information
in both s and t. This result is the last piece of the jigsaw in
the basic theory of algorithmic entropy. The proof we give
now is somewhat convoluted, involving the equivalence of the
two entropies, as well as a further application of the converse
to Krafts Inequality.
10.1 Conditional Entropy
Recall the denition of H(s [ t): Let = (t) be the minimal input out-
putting t from our universal machine U. Then we consider the set of pairs
(T, p) consisting of a Turing Machine T and a string p such that
T(p) = s;
and we dene H(s [ t) to be the minimum of
[T[ +[p[.
(We cannot simply take the minimum of [p[ over strings p such that
p : U(p) = s,
101
10.2. THE LAST PIECE OF THE JIGSAW 102
because U is going to halt as soon as it has read in . So we make an indirect
reference to the fact that
H(s) H
T
(s) +[T[,
since U(Tp) = T(p).)
10.2 The last piece of the jigsaw
Theorem 10.1.
H(s, t) = H(t) + H(s [ t) + O(1).
Our proof is in two parts.
1. The easy part:
H(s, t) H(t) + H(s [ t) + O(1).
2. The harder part:
H(s [ t) H(s, t) H(t) + C,
where C is a constant depending only on our choice of universal machine
U.
10.3 The easy part
Proposition 10.1. There exists a Turing machine M such that
H
M
(s, t) H(t) + H(s [ t)
for all s, t S.
Proof . Let
= (t)
be the minimal program for t:
U() = t, H(t) = [[.
By the denition of conditional entropy H(s [ t), we can nd a machine T
and a string p such that
T(p) = s, H(s [ t) = [T[ +[p[.
10.4. THE HARD PART 103
It follows that
U(Tp) = T(p) = t.
Now we construct the machine M as follows. M starts by imitating U,
so that if the input string is
q = Tp
then M will compute U(q) = t. However, it does not output t but simply
records its value for later use.
But now M goes back to the beginning of the string q (which it has wisely
stored) and skips the machine code T.
Now M imitates U again, but this time with input p. Since U() = s, U
will only read in the prex before halting and outputting s. Our machine
M does not of course output s. Instead it outputs the coded version s.
Finally, it goes back to the remembered string t and outputs t before
halting.
In summary,
M(Tp) = st.
It follows that
H
M
(s, t) [T[ +[[ +[p[
= [[ + ([T[ +[p[)
= H(t) + H(s [ t).
Corollary 10.1. We have
H(s, t) H(t) + H(s [ t) + O(1).
Proof . Since
H(s, t) H
M
(s, t) +[M[,
the Proposition implies that
H(s, t) H(t) + H(s [ t) +[M[
= H(t) + H(s [ t) + O(1).
10.4 The hard part
Proposition 10.2. For each string t S,

sS
2
(h(s,t)h(t))
C,
where C is a constant depending only on our choice of universal machine U.
10.4. THE HARD PART 104
Proof . Lemma 1. Given a machine T, there exists a machine M such
that

sS
P
T
(s, t) P
M
(t)
(where P
T
(s, t) = P
T
(st)).
Proof . Let the machine M start by imitating T, except that instead of
outputting st, it skips s and then decodes t that is, as T outputs
st M outputs t, and then halts.
It follows that
T(p) = st = M(p) = t.
Hence
_
sS
P
T
(s, t) =

sS
_
_

p:T(p)=st
2
|p|
_
_

p:M(p)=t
2
|p|
= P
M
(t).
Lemma 2. With the same assumptions as the last Lemma,

sS
2
h
T
(s,t)
2
h
M
(t)
.
Proof . This follows at once from the last Lemma, since
2
h
T
(s,t)
= P
T
(s, t), 2
h
M
(t)
= P
M
(t).
Lemma 3. For any Turing machine T,
h
T
(s) 2
|T|
h(s).
Proof . Since
T(p) = s U(Tp) = s,
it follows that
2
h(s)
=

q:U(q)=s
2
|q|

p:T(p)=s
2
|Tp|
= 2
|T|

p:T(p)=s
2
|p|
= 2
|T|
h(s).
10.4. THE HARD PART 105
Lemma 4.

sS
2
h(s,t)
2
h(t)+c
,
where c is a constant depending only on our choice of universal machine U.
Proof . This follows at once on taking T = U in Lemma 2, and applying
Lemma 3 with T = M.
The result follows on taking h(t) to the other side.
Corollary 10.1.

sS
2
(H(s,t)H(t))
C

,
where C

depends only U.
Proof . This follows from the Proposition on applying the Equivalence The-
orem 9.1.
We can now formulate our strategy. Let us x the string t. Suppose
= (t)
is the minimal program for t:
U() = s, [[ = H(t).
We can re-write Corollary 10.1 as

sS
2
(H(s,t)H(t)+c)
1,
where c depends only on U. Thus the numbers
h
s
= H(s, t) H(t) + c
satisfy Krafts inequality

sS
2
h
s
1.
If these numbers were integers, and were recursively enumerable, then we
could nd a prex-free set (depending on t)
P
t
= p
st
: s S
such that
[p
st
[ H(s, t) H(t) + c.
10.4. THE HARD PART 106
Now let us prex this set with (t) = :
(t)P
t
= (t)p
st
: s S.
It is easy to see that the sets (t)P
t
for dierent t are disjoint; and their
union
P =
_
tS
(t)P
t
= s, t S : (t)p
st
.
is prex-free.
Thus we maywith luckbe able to construct a machine T such that
T((t)p
st
) = t
for each pair s, t S.
From this we would deduce that
H(s [ t) [p
st
[ +[T[
H(s, t) H(t) + c +[T[
= H(s, t) H(t) + O(1),
as required.
Now for the gory details. Our machine T starts by imitating U. Thus if
the input string is
q = p ( (U))
M begines by reading in and computing
t = U().
However, instead of outputting t, T stores and t for further use.
We are only interested in the case where is the minimal program for t:
= (t), [[ = H(t).
Of course, this is not generally true. But if it is not true then we do not
care whether T(q) is dened or not, or if it is dened what value it takes.
Therefore we assume in what follows that = (s).
By Corollary 10.1 above,

sS
2
(H(s,t)H(t)+c)
1
Thus if we set
h
s
= [H(s, t) H(t) + c + 2]
10.4. THE HARD PART 107
then

sS
2
h
s

1
2
.
As in Chapter 9, we cannot be sure that the set
S

= (s, h
s
) : s S S N
is recursively enumerable. However, we can show that a slightlybut not
too muchlarger set S is recursively enumerable. (This is the reason for
introducing the factor
1
2
aboveit allows for the dierence between S and
S

.)
Lemma 5. Let
S

= (s, h
s
) : s S, S

= (s, n) : n h
s
.
There exists a recursively generated set S S N such that
S

S S

.
Proof . We construct an auxiliary machine M which recursively generates
the set
(U) = U(p) : p S.
As each U(p) is generated, M determines if it is of the form st (where t
is the string U()). If it is then M checks if the pair (s, n), where
n = [p[ [[ +c + 1
has already been output; if not, it outputs sn.
Since
[p[ H(s, t)
by denition, while by hyothesis
[[ = H(t),
it follows that
n H(s, t) H(t) + c + 1 = h
s
,
and so (s, n) S

. Hence
S S

.
On the other hand, with the particular input p = (s, t),
[p[ = H(s, t)
and so n = h
s
. Thus (s, h
t
) S. and so
S

S.
10.4. THE HARD PART 108
But now

(t,n)S
2
n
1,
since each t contributes at most
2
h
t
+ 2
h
t
1
+ 2
h
t
2
+ = 2 2
h
t
.
We can therefore follow Krafts prescription for a prex-free set. As each
pair (s, n) is generated, T determines a string p
sn
such that
[p
sn
[ 2
n
,
in such a way that the set
P = p
sn
: (s, n) S
is prex-free.
As each string p
sn
is generated, T checks the input string q to see if
q = p
sn
.
If this is true then the computation is complete: T outputs s and halts:
T(p
tn
) = s.
One small point: in comparing p
sn
with the input string q, T might well
go past the end of q, so that T(q) is undened. However, in that case q is
certainly not of the form p
s

n
, since this would imply that
p
s

n
p
st
,
contradicting the prex-freedom of P.
To recap, we have constructed a machine T such that for each pair s, t S
we can nd a string p
st
with
T((t)p
st
) = s, [p
st
[ H(s, t) H(t) + c.
But now, from the denition of the conditional entropy H(s [ t),
H(s [ t) [p
st
[ +[T[
H(s, t) H(t) + c +[T[
= H(s, t) H(t) + O(1).
Summary
With the proof that
H(s, t) = H(t) + H(s [ t) + O(1)
the basic theory of algorithmic entropy is complete.
Appendix A
Cardinality
Cardinality that is, Cantors theory of innite cardinal
numbers does not play a direct r ole in algorithmic informa-
tion theory, or more generally in the study of computability,
since all the sets that arise there are enumerable. However,
the proof of Cantors Theorem below, using Cantors diago-
nal method, is the predecessor, or model, for our proof of the
Halting Theorem and other results in Algorithmic Informa-
tion Theory. The idea also lies behind Godels Unprovability
Theorem, that in any non-trivial axiomatic system there are
propositions that can neither be proved nor disproved.
A.1 Cardinality
Denition A.1. Two sets X and Y are said to have the same cardinality,
and we write
#(X) = #(Y ),
if there exists a bijection f : X Y .
When we use the = sign for a relation in this way we should verify that
the relation is reexive, symmetric and transitive. In this case that is trivial.
Proposition A.1. 1. #(X) = #(X);
2. #(X) = #(Y ) = #(Y ) = #(X);
3. #(X) = #(Y ) #(Y ) = #(Z) = #(X) = #(Z).
By convention, we write
#(N) =
0
.
11
A.2. THE SCHR

ODER-BERNSTEIN THEOREM 12
Denition A.2. We say that the cardinality of X is less than or equal to
the cardinality of Y , and we write
#(X) #(Y ),
if there exists a injection f : X Y .
We say that the cardinality of X is (strictly) less than the cardinality of
Y , and we write
#(X) < #(Y ),
if #(X) #(Y ) and #(X) ,= #(Y ).
Proposition A.2. 1. #(X) #(X);
2. #(X) #(Y ) #(Y ) #(Z) = #(X) #(Z).
Again, these follows at once from the denition of injectivity.
A.1.1 Cardinal arithmetic
Addition and multiplication of cardinal numbers is dened by
#(X) + #(Y ) = #(X +Y ), #(X) #(Y ) = #(X Y ),
where X+Y is the disjoint union of X and Y (ie if X and Y are not disjoint
we take copies that are).
However, these operations are not very useful; for if one (or both) of
1
,
2
are innite then

1
+
2
=
1

2
= max(
1
,
1
).
The power operation is more useful, as we shall see. Recall that 2
X
denotes the set of subsets of X. We set
2
#(X)
= #(2
X
).
A.2 The Schr oder-Bernstein Theorem
Theorem A.1.
#(X) #(Y ) #(Y ) #(X) = #(X) = #(Y ).
Proof . By denition there exist injective maps
f : X Y, g : Y X.
A.2. THE SCHR

ODER-BERNSTEIN THEOREM 13
We have to construct a bijection
h : X Y.
To simplify the discussion, we assume that X and Y are disjoint (taking
disjoint copies if necessary).
Given x
0
X, we construct the sequence
y
0
= f(x
0
) Y, x
1
= g(y
0
) X, y
1
= f(x
1
) Y, . . . .
There are two possibilities:
(i) The sequence continues indenitely, giving a singly-innite chain in X:
x
0
, y
0
, x
1
, y
1
, x
2
, . . .
(ii) There is a repetition, say
x
r
= x
s
for some r < s. Since f and g are injective, it follows that the rst
repetition must be
x
0
= x
r
,
so that we have a loop
x
0
, y
0
, x
1
, y
1
, . . . , x
r
, y
r
, x
0
.
In case (i), we may be able to extend the chain backwards, if x
0
im(g).
In that case we set
x
0
= gy
1
,
where y
1
is unique since g is injective.
Then we may be able to go further back:
y
1
= fx
1
, x
2
= gy
1
, . . . .
There are three possibilities:
(A) The process continues indenitely, giving a doubly-innite chain
. . . , x
n
, y
n
, x
n+1
, y
n+1
, . . . , x
0
, y
0
, x
1
, . . . .
(B) The process ends at an element of X, giving a singly-innite chain
x
n
, y
n
, x
n+1
, . . . .
A.3. CANTORS THEOREM 14
(C) The process ends at an element of Y , giving a singly-innite chain
y
n
, x
n+1
, y
n+1
, . . . .
It is easy to see that these chains and loops are disjoint, partitioning the
union X + Y into disjoint sets. This allows us to dene the map h on each
chain and loop separately. Thus in the case of a doubly-innite chain or a
chain starting at an element x
n
X, or a loop, we set
hx
r
= y
r
;
while in the case of a chain starting at an element y
n
Y we set
hx
r
= y
r1
.
Putting these maps together gives a bijective map
h : X Y.
A.3 Cantors Theorem
Theorem A.2. The number of elements of a set is strictly less than the
number of subsets of the set:
#(X) < #(2
X
).
Proof . We have to show that #(X) #(2
X
) but #(X) ,= #(2
X
).
There is an obvious injection X 2
X
, namely
x x.
Hence
#(X) #(2
X
).
Suppose there is a surjection
f : X 2
X
.
Let
S = x X : x / f(x).
Since f is surjective, there exists an element s X such that
S = f(s).
We ask the question: Does the element s belong to the subset S, or not?
A.4. COMPARABILITY 15
If s S, then from the denition of S,
s / f(s) = S.
On the other hand, if s / S, then again from the denition of S.
s S.
Either way, we encounter a contradiction. Hence our hypothesis is unten-
able: there is no surjection, and so no isomorphism, f : X 2
X
, ie
#(X) ,= #(2
X
).
A.4 Comparability
Our aim in this Section is to prove any 2 sets X, Y are comparable, ie
either #(X) #(Y ) or #(Y ) #(X).
To this end we introduce the notion of well-ordering.
Recall that a partial order on a set X is a relation such that
1. x x for all x;
2. x y y z = x z;
3. x y y x = x = y.
A partial order is said to be a total order if in addition, for all x, y X,
4. either x y or y x.
A total order is said to be a well-ordering if
5. every non-empty subset S X has a least element (S) S.
Examples:
1. The natural numbers N are well-ordered.
2. The integers Z are not well-ordered, since Z itself does not have a least
element.
3. The set of positive reals R
+
= x R : x 0 is not well-ordered,
since the set S = x > 0 does not have a least element in S.
A.4. COMPARABILITY 16
4. The set N N with the lexicographic ordering
(m, n) (m

, n

) if m < m

or m = m

n n

is well-ordered. To nd the least element (m, n) in a subset S NN


we rst nd the least m occuring in S; and then among the pairs
(m, n) S we nd the least n.
5. The disjoint sum N +N, with the ordering under which every element
of the rst copy of N is less than every element of the second copy of
N, is well-ordered.
It follows at once from the denition that every subset S X of a well-
ordered set is well-ordered.
A well-ordered set X has a rst (least) element
x
0
= (X).
Unless this is the only element, X has a second (next least) element
x
1
= (X x
0
).
Simitarly, unless these are the only elements, X has a third element
x
2
= (X x
0
, x
1
),
and so on. Moreover after all these elements x
0
, x
1
, x
2
, . . . (assuming they
have not exhausted X) there is a next element
x

= (X x
0
, x
1
, x
2
, . . . ).
Then comes the element
x
+1
= (X x
0
, x
1
, x
2
, . . . , x

),
and after that elements x
+2
, x
+3
, . . . .
Proposition A.3. There is at most one order-preserving isomorphism be-
tween 2 well-ordered sets.
Proof . Suppose
f, g : X Y
are 2 isomorphisms between the well-ordered sets X, Y . Let
z = (x X : fx ,= gx) .
A.4. COMPARABILITY 17
In other words, z is the rst point at which the maps f and g diverge.
We may assume without loss of generality that
fz < gz.
Since g is an isomorphism,
fz = gt
for some element t X. But now, since g is order-preserving,
gt < gz = t < z = gt = ft = fz = ft = z = t = gz = gt = fy,
contrary to hypothesis. We conclude that f = g, ie f(x) = g(x) for all
x X.
Although we shall make no use of this, we associate an ordinal number
(or just ordinal ) to each well-ordered set. By the last Proposition, two well-
ordered sets have the same ordinal number if and only if they are order-
isomorphic.
A subset I X in a partially-ordered set X is called an initial segment
if
x I y x = y I.
It is easy to see that the set
I(x) = y X : y < x
(where x < y means x y but x ,= y) is an initial segment in X for each
x X.
In a well-ordered set every initial subset I X, except X itself, is of this
form; for it is easily seen that
I = I(x),
where
x = (X I).
If an element x of a well-ordered set X is not the greatest element of X
then it has an immediate successor x

, namely
x

= (y X : y > x).
But not every element x X (apart from the minimal element x
0
) need be a
successor element. We call an element x ,= x
0
with no immediate predecessor
a limit element.
A.4. COMPARABILITY 18
Lemma 6. If x is a limit element then
I(x) =
y<x
I(y).
Proof . Certainly
J =
y<x
I(y)
is an initial segment. Suppose
J = I(z),
where z x.
If z < x then x must be the immediate successor to z. For suppose
z < t < x. Then
z I(t) J = I(z),
contrary to the denition of the initial segment I(z).
Lemma 7. Suppose X, Y are well-ordered sets. Then X is order-isomorphic
to at most one initial segment I of Y .
Proof . If X is isomorphic to two dierent initial segments I, J Y then
there are two dierent order-preserving injective maps
f, g : X Y.
Suppose these maps diverge at z X:
z = (x X : fx ,= gx).
We may assume without loss of generality that
fz < gz.
But then fz J = im(g), and so
fz = gt
for some t X. Thus
gt < gz = t < z = ft = gt = fz = t = z = fz = gt = gz,
contrary to the denition of z.
Corollary A.1. If there is such an isomorphism f : X I Y then it is
unique.
This follows at once from the Proposition above.
A.4. COMPARABILITY 19
Proposition A.4. Suppose X, Y are well-ordered sets. Then either there
exists an order-preserving injection
f : X Y,
or there exists an order-preserving injection
f : Y X.
Proof . Suppose there is an order-preserving injection
f
x
: I(x) J
onto an initial segment J of Y for every element x X. If J = Y for some
x then we are done. Otherwise
J = I(y),
where (from the last Lemma) y is completely determined by x, say
y = f(x).
Then it follows easily that
f : X Y
is an order-preserving injection.
Suppose there is no such map f
x
for some x X. Let z X be the
smallest such element. If u < v < z then the corollary above shows that f
u
is the restriction of f
v
to I(u). It follows that the maps f
u
for u < z t
together to give an order-preserving injection
f : I(z) Y.
More precisely, if x < z then (from the denition of z) there is an order-
preserving isomorphism
f
x
: I(x) I(y),
where y Y is well-dened. We set fx = y to dene an order-preserving
injection
f : I(z) Y
onto an initial segment of Y , contrary to the denition of z.
If X, Y are two well-ordered sets, we say that the ordinal of X is the
ordinal of Y if there exists an order-preserving injection f : X Y . Ordinals
are even further from our main theme than cardinals; but nevertheless, every
young mathematician should be at least vaguely familiary with them.
A.4. COMPARABILITY 110
We denote the ordinal of N with the usual ordering (which we have ob-
served is a well-ordering) by :
= 0, 1, 2, . . . .
If X, Y are well-ordered sets then so is the disjoint union X + Y , taking
the elements of X before those of Y . This allows us to add ordinals. For
example
+ 1 = 0, 1, 2, . . . , ,
where we have added another element after the natural numbers. It is easy
to see that
+ 1 ,= :
the two ordered sets are not order-isomorphic, although both are enumerable.
So dierent ordinals may correspond to the same cardinal. Of course this is
not true for nite numbers; there is a one-one correspondence between nite
cardinals and nite ordinals.
Note that addition of ordinals is not commutative, eg
1 + = ,
since adding an extra element at the beginning of N does not alter its ordi-
nality.
A.4.1 The Well Ordering Theorem
The Axiom of Choice states that for every set X we can nd a map
c : 2
X
X
such that
c(S) S
for every non-empty subset S X.
We call such a map c a choice function for X.
The Well Ordering Theorem states that every set X can be well-ordered.
Proposition A.5. The Axiom of Choice and the Well Ordering Theorem
are equivalent.
Proof . If X is a well-ordered set then there is an obvious choice function,
namely
c(S) = (S).
The converse is more dicult.
A.4. COMPARABILITY 111
Suppose c is a choice function for X. Let us say that a well-ordering of a
subset S X has property T if
(S I) = c(X I)
for every initial segment I S (except S itself).
We note that such a subset S must start with the elements x
0
, x
1
, x
2
, . . . , x

, . . .
unless S is exhausted earlier.
Lemma 8. ?? A subset S X has at most one well-ordering with property
T.
Proof . Suppose there are two such well-orderings on S. Let us denote
them by < and , respectively. If x S let us write
I
<
(x) I

(x)
to mean that not only are these initial subsets the same but they also carry
the same orderings.
If this is true of every x S then the two orderings are the same, since
u < v = u I
<
(v) = I

(v) = u v.
If however this is not true of all x, let z be the least such according to the
rst ordering:
z =
<
(x X : I < (x) , I

(x)) .
If u, v I
<
(z) then
u < v = u I
<
(v) = I

(v) = u v.
It follows that I
<
(z) is also an initial segment in the second ordering, say
I
<
(z) = I

(t).
Hence
z =
<
(X I
<
(z)) = c(X I
<
(z)) = c(X I

(z)) =
<
(X I

(t)) = t.
Thus
I
<
(z) = I

(z).
Since, as we saw, the two orderings coincide, it follows that
I
<
(z) I

(z),
contrary to hypothesis. So there is no such z, and therefore the two orderings
on S coincide.
A.4. COMPARABILITY 112
Lemma 9. Suppose the subsets S, T X both carry well-orderings with
property T. Then
either S T or T S.
Moreover, in the rst case S is an initial segment of T, and in the second
case T is an initial segment of S.
Proof . As before, we denote the well-orderings on S and T by < and ,
respectively.
Consider the elements x S such that the initial segment I = I
<
(x) in
S is also an initial segment in T. By the last Lemma, the two orderings on
I coincide.
If
I
<
(x) = T
for some x we are done: T is an initial segment of S. Otherwise
I = I
<
(x) = I

(x),
with
x = c(X I).
Suppose this is true for all x S.
If S has a largest element s then
S = I
<
(s) s = I

(s) s.
Thus S T, and either S = T,
S = I

(s

),
where s

is the successor to s in T. In either case S is an initial segment of


T.
If S does not have a largest element then
S =
xS
I
<
(x) =
xS
I

(x).
Thus S is again an initial segment of T; for
u S, v T, v u = v I

(u) = I
<
(u) = v S.
Now suppose that I
<
(x) is not an initial segment of T for some x S.
Let z be the smallest such element in S.
Since I
<
(z) is not an initial segment of T there is an element tinT such
that
t < z z t.
A.4. COMPARABILITY 113
We are now in a position to well-order X. Let us denote by o the set of
subsets S X which can be well-ordered with property T; and let
U =
SS
S.
We shall show that U is well-ordered with property T.
Firstly we dene a total ordering on U. Suppose u, v U. There exists
a set S o containing u, v; for if u S
1
, v S
2
, where S
1
, S
2
o then by
Lemma 9 either S
1
S
2
, in which case u, v S
2
or S
2
S
1
, in which case
u, v S
1
.
Also if u, v S and T then by the same Lemma the two orderings are
the same. Thus we have dened the order in U unambiguously.
To see that this is a well-ordering, suppose A is a non-empty subset of U;
and suppose a A. Then a S for some S o. Let
z = (A S).
We claim that z is the smallest element of A. For suppose t < z, t A.
Then t I(z) S, and so t A S, contradicting the minimality of z.
Finally, to see that this well-ordering of U has property T, suppose I is
an initial segment of U, I ,= U. Let z be the smallest element in U I; and
suppose z S, where S o. Then I = I(z) is an initial segment in S, with
z =
S
(S I) = c(X I),
since S has property T.
Thus
U o.
If U ,= X let
u = c(X U),
and set
V = U u.
We extend the order on U to V by making u the greatest element of
V . It is a straightforward matter to very that V is well-connected, and has
property T. It follows that
V o = V U,
which is absurd.
Hence U = X, and so X is well-connected.
Now we can prove the Comparability Theorem.
A.4. COMPARABILITY 114
Theorem A.3. Any 2 sets X, Y are comparable, ie
either #(X) #(Y ) or #(Y ) #(X).
Proof . Let us well-order X and Y . Then by Proposition A.4 either there
exists an injection
j : X Y,
or there exists an injection
j : Y X.
In other words,
#(X) #(Y ) or #(Y ) #(X).

You might also like