Introduction To Data Compression - Guy E. Blelloch PDF
Introduction To Data Compression - Guy E. Blelloch PDF
Guy E. Blelloch
Computer Science Department
Carnegie Mellon University
[email protected]
Contents
1
Introduction
Information Theory
2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 The Entropy of the English Language . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Conditional Entropy and Markov Chains . . . . . . . . . . . . . . . . . . . . . . .
5
5
6
7
Probability Coding
3.1 Prefix Codes . . . . . . . . . . . . . . . . .
3.1.1 Relationship to Entropy . . . . . .
3.2 Huffman Codes . . . . . . . . . . . . . . .
3.2.1 Combining Messages . . . . . . . .
3.2.2 Minimum Variance Huffman Codes
3.3 Arithmetic Coding . . . . . . . . . . . . .
3.3.1 Integer Implementation . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
10
11
12
14
14
15
18
.
.
.
.
.
21
24
25
25
26
28
This is an early draft of a chapter of a book Im starting to write on algorithms in the real world. There are surely
many mistakes, and please feel free to point them out. In general the Lossless compression part is more polished
than the lossy compression part. Some of the text and figures in the Lossy Compression sections are from scribe notes
taken
by Ben Liblit at UC Berkeley. Thanks for many comments from students that helped improve the presentation.
c 2000, 2001 Guy Blelloch
39
39
41
42
49
49
51
54
Introduction
Compression is used just about everywhere. All the images you get on the web are compressed,
typically in the JPEG or GIF formats, most modems use compression, HDTV will be compressed
using MPEG-2, and several file systems automatically compress files when stored, and the rest
of us do it by hand. The neat thing about compression, as with the other topics we will cover in
this course, is that the algorithms used in the real world make heavy use of a wide set of algorithmic tools, including sorting, hash tables, tries, and FFTs. Furthermore, algorithms with strong
theoretical foundations play a critical role in real-world applications.
In this chapter we will use the generic term message for the objects we want to compress,
which could be either files or messages. The task of compression consists of two components, an
encoding algorithm that takes a message and generates a compressed representation (hopefully
with fewer bits), and a decoding algorithm that reconstructs the original message or some approximation of it from the compressed representation. These two components are typically intricately
tied together since they both have to understand the shared compressed representation.
We distinguish between lossless algorithms, which can reconstruct the original message exactly
from the compressed message, and lossy algorithms, which can only reconstruct an approximation
of the original message. Lossless algorithms are typically used for text, and lossy for images and
sound where a little bit of loss in resolution is often undetectable, or at least acceptable. Lossy is
used in an abstract sense, however, and does not mean random lost pixels, but instead means loss
of a quantity such as a frequency component, or perhaps loss of noise. For example, one might
think that lossy text compression would be unacceptable because they are imagining missing or
switched characters. Consider instead a system that reworded sentences into a more standard
form, or replaced words with synonyms so that the file can be better compressed. Technically
the compression would be lossy since the text has changed, but the meaning and clarity of the
message might be fully maintained, or even improved. In fact Strunk and White might argue that
good writing is the art of lossy text compression.
Is there a lossless algorithm that can compress all messages? There has been at least one
patent application that claimed to be able to compress all files (messages)Patent 5,533,051 titled
Methods for Data Compression. The patent application claimed that if it was applied recursively,
a file could be reduced to almost nothing. With a little thought you should convince yourself that
this is not possible, at least if the source messages can contain any bit-sequence. We can see this
by a simple counting argument. Lets consider all 1000 bit messages, as an example. There are
different messages we can send, each which needs to be distinctly identified by the decoder.
It should be clear we cant represent that many different messages by sending 999 or fewer bits for
all the messages 999 bits would only allow us to send
distinct messages. The truth is that
if any one message is shortened by an algorithm, then some other message needs to be lengthened.
You can verify this in practice by running GZIP on a GIF file. It is, in fact, possible to go further
and show that for a set of input messages of fixed length, if one message is compressed, then the
average length of the compressed messages over all possible inputs is always going to be longer
than the original input messages. Consider, for example, the 8 possible 3 bit messages. If one is
compressed to two bits, it is not hard to convince yourself that two messages will have to expand
to 4 bits, giving an average of 3 1/8 bits. Unfortunately, the patent was granted.
Because one cant hope to compress everything, all compression algorithms must assume that
3
there is some bias on the input messages so that some inputs are more likely than others, i.e. that
there is some unbalanced probability distribution over the possible messages. Most compression
algorithms base this bias on the structure of the messages i.e., an assumption that repeated
characters are more likely than random characters, or that large white patches occur in typical
images. Compression is therefore all about probability.
When discussing compression algorithms it is important to make a distinction between two
components: the model and the coder. The model component somehow captures the probability
distribution of the messages by knowing or discovering something about the structure of the input.
The coder component then takes advantage of the probability biases generated in the model to
generate codes. It does this by effectively lengthening low probability messages and shortening
high-probability messages. A model, for example, might have a generic understanding of human
faces knowing that some faces are more likely than others (e.g., a teapot would not be a very
likely face). The coder would then be able to send shorter messages for objects that look like
faces. This could work well for compressing teleconference calls. The models in most current
real-world compression algorithms, however, are not so sophisticated, and use more mundane
measures such as repeated patterns in text. Although there are many different ways to design the
model component of compression algorithms and a huge range of levels of sophistication, the coder
components tend to be quite genericin current algorithms are almost exclusively based on either
Huffman or arithmetic codes. Lest we try to make to fine of a distinction here, it should be pointed
out that the line between model and coder components of algorithms is not always well defined.
It turns out that information theory is the glue that ties the model and coder components together. In particular it gives a very nice theory about how probabilities are related to information
content and code length. As we will see, this theory matches practice almost perfectly, and we can
achieve code lengths almost identical to what the theory predicts.
Another question about compression algorithms is how does one judge the quality of one versus another. In the case of lossless compression there are several criteria I can think of, the time to
compress, the time to reconstruct, the size of the compressed messages, and the generalityi.e.,
does it only work on Shakespeare or does it do Byron too. In the case of lossy compression the
judgement is further complicated since we also have to worry about how good the lossy approximation is. There are typically tradeoffs between the amount of compression, the runtime, and
the quality of the reconstruction. Depending on your application one might be more important
than another and one would want to pick your algorithm appropriately. Perhaps the best attempt
to systematically compare lossless compression algorithms is the Archive Comparison Test (ACT)
by Jeff Gilchrist. It reports times and compression ratios for 100s of compression algorithms over
many databases. It also gives a score based on a weighted average of runtime and the compression
ratio.
This chapter will be organized by first covering some basics of information theory. Section 3
then discusses the coding component of compressing algorithms and shows how coding is related
to the information theory. Section 4 discusses various models for generating the probabilities
needed by the coding component. Section 5 describes the Lempel-Ziv algorithms, and Section 6
covers other lossless algorithms (currently just Burrows-Wheeler).
Information Theory
2.1 Entropy
Shannon borrowed the definition of entropy from statistical physics to capture the notion of how
much information is contained in a and their probabilities. For a set of possible messages ,
Shannon defined entropy as,
!#" $
is the probability of message . The definition of Entropy is very similar
to that in
where
is the
statistical physicsin physics is the set of possible states a system can be in and
probability the system is in state . We might remember that the second law of thermodynamics
&% increase.
basically says that the entropy of a system and its surroundings can only
Getting back to messages, if we consider the individual messages
, Shannon defined the
notion of the self information of a message as
(
'
)*+, " $
.
This self information represents the number of bits of information contained in it and, roughly
speaking, the number of bits we should use to send that message. The equation says that messages
with higher probability will contain less information (e.g., a message saying that it will be sunny
out in LA tomorrow is less informative than one saying that it is going to snow).
The entropy is simply a weighted average of the information of each message, and therefore
the average number of bits of information in the set of messages. Larger entropies represent more
information, and perhaps counter-intuitively, the more random a set of messages (the more even
the probabilities) the more information they contain on average.
Here are some examples of entropies for different probability distributions over five messages.
!243 !253 ,253
!253
!256
/
0,1
1
1
1
1
- !2 -$
- $ !2
798:1
8:+,4"<;.=
8>1
8:+!4"?
-$
2 - 2
=@1
-BA
$ - !2
253
!253
!243
!243
!26
0,1
1
1
1
1
/
-$ -$
- $ !2 - $
2
1
8C+,4" =D;98>1
8:+!4"?
-2
-$
2
1 =
$
243
,253
!243
,2F3
26
/
0,1
1 1!E
1 1,E
1 1,E
1 1!E
-BA
;
!2
2
8:+!4" E
G1
8:+,4" I=D;98>1 1!E
$
-HA
7
J
Technically this definition is for first-order Entropy. We will get back to the general notion of Entropy.
M
bits/char
bits KL+, EN
7
entropy
4.5
Huffman Code (avg.)
4.7
Entropy (Groups of 8)
2.4
Asymptotically approaches: 1.3
Compress
3.7
Gzip
2.7
BOA
2.0
Table 1: Information Content of the English Language
1 O
7 =
$
7
$ Note that the more uneven the distribution, the lower the Entropy.
Why is the logarithm of the inverse probability the right measure for self information of a message? Although we will relate the self information and entropy to message length more formally in
R
equal probability messages, the
Section 3 lets try to get some intuition here. First, for a set of PQ
+,4"P bits are required
probability of each is P . We also know that if all are the same length, then
'
$
S
to encode each message. Well this is exactly the self information since R TU+,4" VXW U!4"P .
Another property of information we would like, is that the information given by two independent
if messages Y and Z
messages should be the sum of the information given by each. In particular
are independent, the probability of sending one after the other is Y[ Z\ and the information
contained is them is
'
'
'
Y[Z\]^
_+
Y.a=
Z\
$
$ =`+
$
Y[ Z\
Y[
Y[
The logarithm is the simplest function that has this property.
Date
May 1977
1984
1987
1987
1987
.
1988
.
Oct 1994
1995
1997
1999
bpc
3.94
3.32
3.30
3.24
3.18
2.71
2.48
2.47
2.34
2.29
1.99
1.89
scheme
LZ77
LZMW
LZH
MTF
LZB
GZIP
PPMC
SAKDC
PPM c
BW
BOA
RK
authors
Ziv, Lempel
Miller and Wegman
Brent
Moffat
Bell
.
Moffat
Williams
Cleary, Teahan, Witten
Burrows, Wheeler
Sutton
Taylor
P(b|w)
P(w|w)
Sw
Sb
P(b|b)
P(w|b)
fvn}
T
, and otherwise
Tf~n}
. In other words, knowing the context can only
reduce the entropy.
% information sources. An information
Shannon actually originally defined Entropy in terms of
3
3
3 6
sources generates an infinite sequence of messages
from a fixed message
04[
-h-set . If the probability of each message is independent of the previous messages then the system
is called an independent and identically distributed (iid) source. The entropy of such a source is
called the unconditional or first order entropy and is as defined in Section 2.1. In this chapter by
default we will use the term entropy to mean first-order entropy.
Another kind of source of messages is a Markov process, or more precisely a discrete time
Markov chain. A sequence follows an order Markov model if the probability of each message
3
3
3
3
3
f
f
--h-h-h-h-
'
3
3
6
where R is the message generated by the source. The values that can be taken on by 0
-h-h
( process
$ S $
S $
give the conditional
entropy
,M M
,M M
7,1 7
1 +,
1 =
+,
]=
7
+,
]= 7]+!
7
1
S $ - $
$
S - $
$
S $S $ -BA
$S -BA
$
S -$ A
This gives the expected number of bits of information contained in each pixel generated by the
source. Note that the first-order entropy
of the source is
7!1 7 +, 7 7,1x=
7 !
7,1
1,E
S $
$
S
$
S $
$
S
which is almost twice as large.
Shannon also defined a general notion source entropy for an arbitrary source. Let Y denote the
set of all strings of length P from an alphabet Y , then the P
order normalized entropy is defined
as
$
{+!
$
(1)
wP h#5
This is normalized since we divide it by P it represents the per-character information. The source
entropy is then defined as
/h+j
In general it is extremely hard to determine the source entropy of an arbitrary source process just
by looking at the output of the process. This is because to calculate accurate probabilities even for
a relatively simple process could require looking at extremely long sequences.
Probability Coding
As mentioned in the introduction, coding is the job of taking probabilities for messages and generating bit strings based on these probabilities. How the probabilities are generated is part of the
model component of the algorithm, which is discussed in Section 4.
In practice we typically use probabilities for parts of a larger message rather than for the complete message, e.g., each character or word in a text. To be consistent with the terminology in the
previous section, we will consider each of these components a message on its own, and we will
use the term message sequence for the larger message made up of these components. In general
each little message can be of a different type and come from its own probability distribution. For
example, when sending an image we might send a message specifying a color followed by messages specifying a frequency component of that color. Even the messages specifying the color
might come from different probability distributions since the probability of particular colors might
depend on the context.
We distinguish between algorithms that assign a unique code (bit-string) for each message, and
ones that blend the codes together from more than one message in a row. In the first class we
will consider Huffman codes, which are a type of prefix code. In the later category we consider
arithmetic codes. The arithmetic codes can achieve better compression, but can require the encoder
to delay sending messages since the messages need to be combined before they can be sent.
9
3
3
6
"
"
called a codeword, and we will denote codes using the syntax n0
.
Typically in computer science we deal with fixed-length codes, such as the ASCII code which maps
every printable character and some control characters into 7 bits. For compression, however, we
would like codewords that can vary in length based on the probability of the message. Such variable length codes have the potential problem that if we are sending one codeword after the other
to tell where
it can be hard or impossible
one
codeword finishes and the next starts. For exam3 3 3
3 3
3 3
6
ple, given the code 0
1 01 e 101
011 , the bit-sequence 1011 could either be
decoded as aba, ca, or ad. To avoid this ambiguity we could add a special stop symbol to the
end of each codeword (e.g., a 2 in a 3-valued alphabet), or send a length before each symbol.
These solutions, however, require sending extra data. A more efficient solution is to design codes
in which we can always uniquely decipher a bit sequence into its code words. We will call such
codes uniquely decodable codes.
decodable
code in which no bit-string is a prefix
A prefix code is a special kind
of
uniquely
3 3 3
3 3
3 3
6
1 01 e 000
001 . All prefix codes are uniquely
of another one, for example 0
decodable since once we get a match, there is no longer code that can also match.
Exercise 3.1.1 Come up with an example of a uniquely decodable code that is not a prefix code.
Prefix codes actually have an advantage over other uniquely decodable codes in that we can
decipher each message without having to see the start of the next message. This is important when
sending messages of different types (e.g., from different probability distributions). In fact in certain
applications one message can specify the type of the next message, so it might be necessary to fully
decode the current message before the next one can be interpreted.
A prefix code can be viewed as a binary tree as follows
Each message is a leaf in the tree
The code for each message is given by following a path from the root to the leaf, and appending a 0 each time a left branch is taken, and a 1 each time a right branch is taken.
We will call this tree a prefix-code tree. Such a tree can also be useful in decoding prefix codes. As
the bits come in, the decoder can follow a path down to the tree until it reaches a leaf, at which point
it outputs the message and returns to the root for the next bit (or possibly the root of a different tree
for a different message type).
In general prefix codes do not have to be restricted to binary alphabets. We could have a prefix
code in which the bits have 3 possible values, in which case the corresponding tree would be
ternary. In this chapter we only consider binary codes.
Given a probability distribution on a set of messages and associated variable length code n , we
define the average length of the code as
L
nwx
O
ory u zm
where
is the length of the codeword . We say that a prefix code n is an optimal prefix code
L
if nw is minimized (i.e., there is no other prefix code for the given probability distribution that
has a lower average length).
10
where it the length of the codeword . Also, for any set of lengths such that
4
3
r
$
'
3
fvTfv .
there is a prefix code n of the same size such that R x R
$ -h-hThe proof of this is left as a homework assignment. Using this we show the following
Lemma 3.1.2 For any message set with a probability distribution and associated uniquely de|
codable code n ,
)
nw
}%
|
]
a
n}/
$
+, "
+, " $
+,4" $
4
(r or u
+,F"
4oB u
+,4"
1
yz
(
(
D+!4"
ror u
R
R
R
R
R
R
R
then k
k
, where the are positive probabilities. The logarithm function is
concave. The last line uses the Kraft-McMillan inequality.
This theorem says that entropy is a lower bound on the average code length. We now also show
an upper bound based on entropy for optimal prefix codes.
Lemma 3.1.3 For any message set with a probability distribution and associated optimal prefix
|
code n ,
L
nw
I=
$ 11
}%
Proof: Take each message
y z
4oB u
KL+, VoB u N . We have
I~g B
xg B
Therefore
by the Kraft-McMillan inequality there is a prefix code n with codewords of length
. Now
L
n /
O
oB u m#
,KL+, $ N
oB u m
=D! $
$
oB u m
(
(
=
+, $
$
|
oB
u m#
=
$
By the definition of optimal prefix codes, nw
n .
Another property of optimal prefix codes is that larger probabilities can never lead to longer
codes, as shown by the following theorem. This theorem will be useful later.
3 6
If n
is an optimal prefix code for the probabilities 0 3 " 3
then RT
Theorem 3.1.1
h
h
implies that e R
e .
L
e . Now consider the code gotten by switching e R and e . If is the
Proof: Assume e R
average length of our original code, this new code will have length
(2)
=
e R ]
e I=
R e ]
e R
L
=
R e R ]
e
(3)
R e R
Given our assumptions the
that is an optimal prefix code.
class on information theory at MIT in 1950. The algorithm is now probably the most prevalently
used component of compression algorithms, used as the back end of GZIP, JPEG and many other
utilities.
The Huffman algorithm is very simple and is most easily described in terms of how it generates
the prefix-code tree.
Start with a forest of trees, one for each message. Each tree contains a single vertex with
weight R R
Repeat until only a single tree remains
Select two trees with the lowest weight roots ( and " ).
Combine them into a single tree by adding a new root with weight =C " , and making
the two trees its children. It does not matter which is the left or right child, but our
".
convention will be to put the lower weight root on the left if ._
For a code of size P this algorithm will require P steps since every complete binary tree with
$
P leaves
has P internal nodes, and each step creates one internal node. If we use a
priority queue
$
with +!P time insertions and find-mins (e.g., a heap) the algorithm will run in P+,TP time.
The key property of Huffman codes is that they generate optimal prefix codes. We show this in
the following theorem, originally given by Huffman.
Lemma 3.2.1 The Huffman algorithm generates an optimal prefix code.
Proof: The proof will be on induction of the number of messages in the code. In particular we
will show that if the Huffman code generates an optimal prefix code for all probability distributions
of P messages, then it generates an optimal prefix code for all distributions of PQ=
messages.
$
The base case is trivial since the prefix code for 1 message is unique (i.e., the null message) and
therefore optimal.
We first argue that for any set of messages there is an optimal code for which the two minimum probability messages are siblings (have the same parent in their prefix tree). By lemma 3.1.1
we know that the two minimum probabilities are on the lowest level of the tree (any complete binary tree has at least two leaves on its lowest level). Also, we can switch any leaves on the lowest
level without affecting the average length of the code since all these codes have the same length.
We therefore can just switch the two lowest probabilities so they are siblings.
Now for induction we consider a set of message probabilities of size PQ=
and the corre$
sponding tree built by the Huffman algorithm. Call the two lowest probability nodes
in the tree
and , which must be siblings in because of the design of the algorithm. Consider the tree
4
5
(5)
I=
=
To see that is optimal, note that there is an optimal tree in which and are siblings, and that
wherever we place these siblings they are going to add a constant =
to the average length of
5
4
13
any
L prefix tree on with the pair and replaced with their parent . By the induction hypothesis
L
jB is minimized, since j is of size P and built by the Huffman algorithm, and therefore O
is minimized and is optimal.
Since Huffman coding is optimal we know that for any probability distribution and associated
Huffman code n
L
n}
I=
$ 3.2.1 Combining Messages
multiplies the probabilities. In general by grouping messages the overhead of Huffman coding
L
"
ez eh]
n}
l zmx
With lower variance it can be easier to maintain a constant character transmission rate, or reduce
the size of buffers. In the above example, code 1 clearly has a much higher variance than code 2. It
14
0
0
1
1 0
0
d
1
e
h
h
R
size RB , by starting with an interval of size 1 (from 0 to 1) and narrowing the interval by a
'
factor of R on each
message . We can bound the number of bits required to uniquely identify an
interval of size , and use this to relate the length of the representation to the self information of
the messages.
In the following discussion we assume the decoder knows when a message sequence is complete either by knowing the length of the message sequence or by including a special end-of-file
message. This was also implicitly assumed when sending a sequence of messages with Huffman
(
codes since the decoder still needs to know when a message sequence is over.
3
3
6
We will denote the probability distributions of a message set as 0
, and we
$ -h-h-
15
1.0
0.7
c
0.7
0.55
b
0.2
0.0
0.3
c
0.27
b
0.3
0.2
0.27
c
c
0.255
b
0.22
0.2
b
0.23
0.22
Figure 3: An example of generating an arithmetic code assuming all messages are from the same
2
probability
distribution
,
and e 7 . The interval given by the message sequence
,2!243
e is
.
- A
define the accumulated probability for the probability distribution as
+
'
3
3
(6)
$ -h-RB
53 253 6
3 53 6
7 correspond to the accumulated probabilities 0,1
So, for example, the probabilities 0
.
- - - -B
A
Since we will often be talking about sequences of messages, each possibly from a different
proba'
3
3 R
(
R6 ,
bility distribution, we will denote the probability
distribution
of the message as 0 R
3
3 R
-h-
$ of-message
R 6 . For a particular sequence
and the accumulated probabilities as 0, R
'
+
$ -h-h
values, we denote
the index of the
message value as R . We will use the shorthand R for R R
and R for R R .
Arithmetic coding assigns an interval to a sequence of messages using the following recurrences
'
R
R
R
$'
= RF R
P
$
'
R
R
R R
(7)
$'
P
$
where
3 is the lower bound of the interval and is the size of the interval, i.e. the interval is given
by = . We assume the interval is inclusive of the lower bound, but exclusive of the upper
bound. The recurrence narrows the interval on each step to some part of the previous interval. Since
the interval starts in the range [0,1), it always stays within this range. An example of generating
an interval for a short message sequences is illustrated in Figure 3. An important property of the
intervals generated by Equation 7 is that all unique message sequences of length P will have non
overlapping intervals. Specifying an interval therefore uniquely determines the message sequence.
In fact, any number within an interval uniquely determines the message sequence. The job of
decoding is basically the same as encoding but instead of using the message value to narrow the
interval, we use the interval to select the message value, and then narrow it. We can therefore
send a message sequence by specifying a number within the corresponding interval.
The question remains of how to efficiently send a sequence of bits that represents the interval,
or a number within the interval. Real numbers betweenM 0 and 1 can be represented in binary
2
3
fractional notation as "
. For example
EQ
1,1 and 7 1 1 , where
$
S
-h-h-HA
- $!$ S{$
-$ $
- $ $
16
means that the sequence is repeated infinitely. We might therefore think that it is adequate
to represent each interval by selecting the number within the interval which has the fewest bits in
if we had the
intervals v1 3 7!7 ,
binary fractional notation, and use that as the code. For
example,
3
3
;5 ,
, and
7 ;5 . It is not hard
7!7 E , and E
we would represent
these with 1
- $ $
S
- $ $S
- $!$ S
A
- A $
to show that for an interval of size we need at most K+! " N bits to represent such a number.
The problem is that these codes are not a set of prefix codes. If you sent me 1 in the above example,
3
I would not know whether to wait for another 1 or interpret it immediately as the interval 7!7 E .
A
To avoid this problem we interpret every binary fractional codeword as an interval itself. In
particular as the interval of all possible completions. For example, the codeword 1 1 would rep3
- $
resent the interval ; 7 ? since the smallest possible completion is 1 1 1&
; and the largest
$
S
S
$
S
- $
possible completion is 1 1 7 ? . Since we now have several kinds of intervals running
- $ $ S
around, we will use the following
terms to distinguish them. We will call the current interval of the
3
R
R
R
message sequence (i.e
=
) the sequence interval, the interval corresponding to the proba'
3
bility of the message (i.e., ~ R R = R ) the message interval, and the interval of a codeword the
code interval.
An important property of code intervals is that there is a direct correspondence between whether
intervals overlap and whether they form prefix codes, as the following Lemma shows.
%
n
Lemma 3.3.1 For a code n , if no two intervals represented by its binary codewords
overlap then the code is a prefix code.
are adequate. In general for an interval of size we can always find a codeword of
- $,$ -HA $
length :KL+,4" N= , as shown by
the following
|lemma.
$
1 and =
Lemma 3.3.2 For any and an such that
, the interval represented
by
$
taking the binary fractional representation of =
and truncating it to K>+!4" N=
bits is
3
S
$
contained in the interval
= .
Proof: A binary fractional representation with digits represents an interval of size less than
since
the difference between the minimum and maximum completions are all 1s starting at the
4
=
. The
location. This has a value
interval
size of a :KL+," N=
bit representation
$
$
is therefore less than
. Since we truncate =
downwards the upper bound of the interval
S
S
represented by the bits is less than = . Truncating the representation of a number to :KL+, " NF=
the lower bound of truncating$
bits can have the effect of reducing it by at most
. Therefore
3
S
is at least . The interval is therefore contained in
=
= .
S
We will call the algorithm made up of generating an interval by Equation 7 and then using the
truncation method of Lemma 3.3.2, the RealArithCode algorithm.
3
3
the length of the
Theorem 3.3.1 For a sequence of P messages, with self informations
h
h
arithmetic code generated by RealArithCode is bounded by = k RB R , and the code will not be
a prefix of any other sequence of P messages.
17
Proof: Equation 7 will generate a sequence interval of size RB R . Now by Lemma 3.3.2
we have
we know an interval of size can be represented by =KyQ+, N bits, so
$
R N
=KyQ+, N
=KQ! "
RB
$
$
=K Q! " R N
$
RB
RN
=K
$
RB
R
=
RB
The claim that the code is not a prefix of other messages is taken directly from Lemma 3.3.1.
The decoder for RealArithCode needs to read the input bits on demand so that it can determine
when the input string is complete. In particular it loops for P iterations, where P is the number of
messages in the sequence. On each iteration it reads enough input bits to narrow the code interval to
within one of the possible message intervals, narrows the sequence interval based on that message,
and outputs that message. When complete, the decoder will have read exactly all the characters
generated by the coder. We give a more detailed description of decoding along with the integer
implementation described below.
From a practical point of view there are a few problems with the arithmetic coding algorithm
we described so far. First, the algorithm needs arbitrary precision arithmetic to manipulate and .
Manipulating these numbers can become expensive as the intervals get very small and the number
of significant bits get large. Another problem is that as described the encoder cannot output any
bits until it has coded the full message. It is actually possible to interleave the generation of
the interval with the generation of its bit representation by opportunistically outputting a 0 or 1
whenever the interval falls within the lower or upper half. This technique, however, still does not
guarantee that bits are output regularly. In particular if the interval keeps reducing in size but still
straddles .5, then the algorithm cannot output anything. In the worst case the algorithm might still
have to wait until the whole sequence is received before outputting any bits. To avoid this problem
many implementations of arithmetic coding break message sequences into fixed size blocks and
use arithmetic coding on each block separately. This approach also has the advantage that since
the group size is fixed, the encoder need not send the number of messages, except perhaps for the
last group which could be smaller than the block size.
3.3.1 Integer Implementation
It turns out that if we are willing to give up a little bit in the efficiency of the coding, we can
used fixed precision integers for arithmetic coding. This implementation does not give precise
arithmetic codes, because of roundoff errors, but if we make sure that both the coder and decoder
are always rounding in the same way the decoder will always be able to precisely interpret the
message.
18
as
e
Instead of
Using counts avoids the need for fractional or real representations of
the probabilities.
v1
( where
using intervals between 0 and 1, we will use intervals between
(i.e.,
$
-+
3
(lower) and (upper), and the corresponding interval is = . The size of the interval
integers
$
is therefore : = . The main idea of this algorithm is to always keep
the size greater than
$
; by expanding the interval whenever it gets too small. This is what the inner while loop does.
S this loop
whenever the sequence interval falls completely within the top half of the region (from
In
to ) we know that the next bit is going to be a 1 since intervals can only shrink. We can
S
therefore output a 1 and expand the top half to fill the region. Similarly if the sequence interval
falls completely within the bottom half we can output a 0 and expand the bottom half of the region
It, however,
can expand the middle region and keep track that is has expanded by incrementing a
count . Now when the algorithm does expand around the top (bottom), it outputs a 1 (0) followed
by 0s (1s). To see why this is the right thing to do, consider expanding around the middle
times and then around the top. The first expansion around the middle locates the interval between
2
; and 7 ; of the initial region,
and the second between 7 ? and ? . After expansions the
h3
S
$
S
S
S
interval is narrowed to the region
=
. Now when we expand around
4
3
$
S
$
S
$S
to $
S
the top we narrow the interval
=
. All intervals contained in this range will
$
S $
S
$
S
start with a followed by 1 .
$
Another interesting aspect of the algorithm is how it finishes. As in the case of real-number
arithmetic coding, to make it possible to decode, we want to make sure that the code (bit pattern)
for any one message sequence is not a prefix of the code for another message sequence. As before,
the way we do this is to make sure the code interval is fully contained in the sequence interval.
; to
interval F completely covers either the second quarter (from
) or the third quarter
S
S
(from
to 7
; ) since otherwise one of the expansion rules would have been applied. The
S
S
algorithm therefore simply determines which of these two regions the sequence interval covers
and outputs code bits that narrow the code interval to one of these two quartersa 1 for the
$
second quarter, since all completions of 1 are in the second quarter, and a 1 for the third quarter.
$
$
19
a3
function
IntArithCode(file, P )
1
9
$
G1
'
for
to P
$
G
= $
R R
9
= = $ S Tj $
R
= R T
S
while true
if
// interval in top half
S
WriteBit(1)
=
:
$
for
to WriteBit(0)
$
G
1
else if Q
// interval in bottom half
S
WriteBit(0)
: \=
$
for
to WriteBit(1)
$
G 1
; and >7
interval in middle half
else if (
; ) //
S
S
:
=
S
$
S
=
$
else continue // exit while loop
end while
G
end for
if (
; ) // output final bits
S
WriteBit(1)
to WriteBit(0)
for
$
WriteBit(0)
else
WriteBit(0)
for
to WriteBit(1)
$
WriteBit(1)
Figure 4: Integer Arithmetic Coding.
20
After outputting the first of these two bits the algorithm must also output bits corresponding to
previous expansions around the middle.
The reason that needs to be at least ;{ is that the sequence interval can become as small as
;=
without falling completely within any of the three halves. To be able to resolve the counts
S'
$
n , has to be at least as large as this interval.
An example: Here we consider an example of encoding a sequence of messages each from the
same probability distribution, given by the following counts.
3
3
e
e 1 e 7 1
$
$
$
The cumulative counts are
3
3
)1
7
$
$!$
!2 $
and 7 . We will chose U? , so that
E . This satisfies the requirement that ; .
3 3X53
$
7 . Figure 5 illustrates the steps taken in coding
Now consider coding the message sequence 7
$
this message sequence. The full code that is output is 01011111101 which is of length 11. The
sum of the self-information of the messages is
2
+, " 1 7 x=`+, " 1 7 =D+, "
7 x=D! " 1 7 j? ?
S $
$ S $
$S $
$ S $
.
Note that this is not within the bound given by Theorem 3.3.1. This is because we are not
generating an exact arithmetic code and we are loosing some coding efficiency.
We now consider how to decode a message sent using the integer arithmetic coding algorithm.
The codeL is given in Figure 6. The idea is to keep
separate lower and upper bounds for the code
interval ( and ) and the sequence interval ( and ). The algorithm reads one bit at a time and
reduces the code interval by half for each bit that is read (the bottom half when the bit is a 0 and the
top half when it is a 1). Whenever the code interval falls within an interval for the next message,
the message is output and the sequence interval is reduced by the message interval. This reduction
is followed by the same set of expansions around the top, bottom and middle halves as followed by
the encoder. The sequence intervals therefore follow the exact same set of lower and upper bounds
as when they were coded. This property guarantees that all rounding happens in the same way for
both the coder and decoder, and is critical for the correctness of the algorithm. It should be noted
that reduction and expansion of the code interval is always exact since these are always changed
by powers of 2.
i
start
1
2
+
3
+
+
+
+
+
+
4
+
end
R
R =
3
2
11
1
31
11
11
0
255
90 255
95 147
62 167
62 64
124 129
120 131
112 135
96 143
64 159
0
191
6
67
12 135
expand rule
256
166
53
106
3
6
12
24
48
96
192
62
124
0
0
1
1
0
1
2
4
5
6
6
0
0
output
(middle half)
(bottom half) 01
(middle half)
(middle half)
(middle half)
(middle half)
(middle half)
(bottom half) 0111111
(final out)
01
Figure 5: Example of integer arithmetic coding. The rows represent the steps of the algorithm.
Each row starting with a number represents the application of a contraction based on the next
message, and each row with a + represents the application of one of the expansion rules.
transforming the data before coding (e.g., run-length coding, move-to-front coding, and residual
coding), or directly using conditional probabilities based on a context (JBIG and PPM).
An issue to consider about a model is whether it is static or dynamic. A model can be static
over all message sequences. For example one could predetermine the frequency of characters and
text and hardcode those probabilities into the encoder and decoder. Alternatively, the model can
be static over a single message sequence. The encoder executes one pass over the sequence to
determine the probabilities, and then a second pass to use those probabilities in the code. In this
case the encoder needs to send the probabilities to the decoder. This is the approach taken by most
vector quantizers. Finally, the model can be dynamic over the message sequence. In this case the
encoder updates its probabilities as it encodes messages. To make it possible for the decoder to
determine the probability based on previous messages, it is important that for each message, the
encoder codes it using the old probability and then updates the probability based on the message.
The advantages of this approach are that the coder need not send additional probabilities, and that
it can adapt to the sequence as it changes. This approach is taken by PPM.
Figure 7 illustrates several aspects of our general framework. It shows, for example, the interaction of the model and the coder. In particular, the model generates the probabilities for each possible message, and the coder uses these probabilities along with the particular message to generate
the codeword. It is important to note that the model has to be identical on both sides. Furthermore
the model can only use previous messages to determine the probabilities. It cannot use the current
message since the decoder does not have this message and therefore could not generate the same
probability distribution. The transform has to be invertible.
22
a3
function
IntArithDecode(file, P )
// sequence interval
L 1
$
// code interval
G1
// message number
$ P do
while
G =
'
$
G1
do
// find if the code interval is within one of the message intervals
'
'
=
'
$
F
=
= j
'
$ S
$
=
'
S
not(
and
while
and
)
'
if
then
// halve the size of the code interval by reading a bit
ReadBit(file)
G
=
L
L
$
L = S
=
S
$
else
'
Output( )
// output the message in which the code interval fits
:G F
// adjust the sequence interval
=
$
while true
interval in top half
if
// sequence
S
=
: 9
L
L
$
=
$
else if Q
// sequence interval in bottom half
S
: =
L
$
L
=
$; and :^7
sequence interval in middle half
else if (
; ) //
S
S
: 9
=
L
L
S $
S
=
S
$
S
else continue // exit inner while loop
end if
end while
Figure 6: Integer Arithmetic Decoding
23
Compress
Uncompress
Model
Static
Part
Model
4IO
Dynamic
Part
codeword
Coder
Decoder
4IO
Static
Part
Dynamic
Part
Message
O
Inverse
Transform
Transform
In
Out
24
run-length
0
1
2
3
4
..
20
..
64+
128+
white codeword
00110101
000111
0111
1000
1011
black codeword
0000110111
010
11
10
011
0001000
00001101000
11011
10010
0000001111
000011001000
that each
comes from the same alphabet, and starts with a total order on the alphabet
3 3 3 3
). For each message, the first pass of the algorithm outputs the position of the
(e.g., e
--h so that the character is at
character in the current order of the alphabet, and then updates the order
3
3
3
3
. This is repeated for the full message sequence. The second pass
-h-hconverts the sequence of integers into a bit sequence using Huffman or Arithmetic coding.
The hope is that equal characters often appear close to each other in the message sequence so
that the integers will be biased to have low values. This will give a skewed probability distribution
and good compression.
values. The idea of residual coding is that the encoder tries to guess the next message value based
on the previous context and then outputs the difference between the actual and guessed value. This
is called the residual. The hope is that this residual is biased toward low values so that it can be
effectively compressed. Assuming the decoder has already decoded the previous context, it can
make the same guess as the coder and then use the residual it receives to correct the guess. By not
specifying the residual to its full accuracy, residual coding can also be used for lossy compression
Residual coding is used in JPEG lossless (JPEG LS), which is used to compress both grey
scale and color images. Here we discuss how it is used on gray scale images. Color images can
simply be compressed by compressing each of the three color planes separately. The algorithm
compresses images in raster orderthe pixels are processed starting at the top-most row of an
image from left to right and then the next row, continuing down to the bottom. When guessing a
pixel the encoder and decoder therefore have as their disposal the pixels to the left in the current
row and all the pixels above it in the previous rows. The JPEG LS algorithm just uses 4 other pixels
as a context for the guessthe pixel to the left (W), above and to the left (NW), above (N), and
above and to the right (NE). The guess works in two stages. The first stage makes the following
guess for each pixel value.
3
C3
#
!
9
3
93
"
9$
(8)
=
otherwise
This might look like a magical equation, but it is based on the idea of taking an average of nearby
pixels while taking account of
edges. The first
and second clauses capture horizontal and vertical
and
this indicates a horizontal edge and
is used as
edges. For example if
the guess. The last clause captures diagonal edges.
that
based on local
Given an initial guess a second pass adjusts
guess
gradients. It uses
3
3
9
3 %
the three gradients between the pairs of pixels
,
, and
. Based on the
value of the gradients (the difference between the two adjacent pixels) each is classified into one of
9 groups. This gives a total of 729 contexts, of which only 365 are needed because of symmetry.
Each context stores its own adjustment value which is used to adjust the guess. Each context also
stores information about the quality of previous guesses in that context. This can be used to predict
variance and can help the probability coder. Once the algorithm has the final guess for the pixel, it
determines the residual and codes it.
This algorithm is based on the LOCO-I (LOw COmplexity LOssless COmpression for Images) algorithm and the
official standard number is ISO-14495-1/ITU-T.87.
26
O O O
O O O O A
O O ?
O O O O O A
O O O O ?
(a)
(b)
Figure 8: JBIG contexts: (a) three-line template, and (b) two-line template. ? is the current pixel
and Y is the roaming pixel.
O
A O O
A O O
O O ?
O O ?
A O O
A O O
O O ?
O O ?
Figure 9: JBIG contexts for progressive transmission. The dark circles are the low resolution
pixels, the 0s are the high-resolution pixels, the A is a roaming pixel, and the ? is the pixel we
want to code/decode. The four context configurations are for the four possible configurations of
the high-resolution pixel relative to the low resolution pixel.
JBIG is similar to JPEG LS in that it uses a local context of pixels to code the current pixel.
Unlike JPEG LS, however, JBIG uses conditional probabilities directly. JBIG also allows for progressive compressionan image can be sent as a set of layers of increasing resolution. Each layer
can use the previous layer to aid compression. We first outline how the initial layer is compressed,
and then how each following layer is compressed.
The first layer is transmitted in raster order, and the compression uses a context of 10 pixels
above and to the right of the current pixel. The standard allows for two different templates for
the context as shown in Figure 8. Furthermore, the pixel marked Y is a roaming pixel and can
be chosen to be any fixed distance to the right of where it is marked in the figure. This roaming
pixel is useful for getting good compression on images with repeated vertical lines. The encoder
decides on which of the two templates to use and on where to place Y based on how well they
compress. This information is specified at the head of the compressed message sequence. Since
each pixel can only have two values, there are
possible contexts. The algorithm dynamically
generates the conditional probabilities for a black or white pixel for each of the contexts, and uses
these probabilities in a modified arithmetic coderthe coder is optimized to avoid multiplications
and divisions. The decoder can decode the pixels since it can build the probability table in the
same way as the encoder.
The higher-resolution layers are also transmitted in raster order, but now in addition to using
a context of previous pixels in the current layer, the compression algorithm can use pixels from
the previous layer. Figure 9 shows the context templates. The context consists of 6 pixels from
27
the current layer, and 4 pixels from the lower resolution layer. Furthermore 2 additional bits are
needed to specify which of the four configurations the coded pixel is in relative to the previous
layer. This gives a total of 12 bits and 4096 contexts. The algorithm generates probabilities in the
same way as for the first layer, but now with some more contexts. The JBIG standard also specifies
how to generate lower resolution layers from higher resolution layers, but this wont be discussed
here.
The approach used by JBIG is not well suited for coding grey-scale images directly since the
V
number of possible contexts go up as
, where is the number of grey-scale pixel values, and
is the number of pixels. For 8-bit grey-scale images and a context of size 10, the number of
possible contexts is , which is far too many. The algorithm can, however, be applied to greyscale images indirectly by compressing each bit-position in the grey scale separately. This still
does not work well for grey-scale levels with more than 2 or 3 bits.
have
counts
( for every
probability
of
in the context
is then n
f n
, where n
f is the number of times follows and n
is the number
S
of times appears. The probability distributions can then be used by a Huffman or Arithmetic
coder to generate a bit sequence. For example, we might have a dictionary with qu appearing 100
times and e appearing 45 times after qu. The conditional probability of the e is then .45 and the
coder should use about 1 bit to encode it. Note that the probability distribution will change from
character to character since each context has its own distribution. In terms of decoding, as long as
the context precedes the character being coded, the decoder will know the context and therefore
know which probability distribution to use. Because the probabilities tend to be high, arithmetic
codes work much better than Huffman codes for this approach.
There are two problems with the basic dictionary method described in the previous paragraph.
First, the dictionaries can become very large. There is no solution to this problem other than to
keep small, typically 3 or 4. A second problem is what happens if the count is zero. We cannot
use zero probabilities in any of the coding methods (they would imply infinitely long strings).
One way to get around this is to assume a probability of not having seen a sequence before and
evenly distribute this probability among the possible following characters that have not been seen.
Unfortunately this gives a completely even distribution, when in reality we might know that a is
more likely than b, even without knowing its context.
The PPM algorithm has a clever way to deal with the case when a context has not been seen
before, and is based on the idea of partial matching. The algorithm builds the dictionary on the
fly starting with an empty dictionary, and every time the algorithm comes across a string it has not
seen before it tries to match a string of one shorter length. This is repeated for shorter and shorter
3 3
3
lengths until a match is found. For each length 1
the algorithm keeps statistics of patterns
$ -h-h28
Order 0
Order 1
Order 2
Context Counts Context Counts Context Counts
empty
a=4
a
c=3
ac
b=1
b=2
c=2
c=5
b
a=2
ba
c=1
c
a=1
b=2
ca
a=1
c=2
cb
a=2
cc
a=1
b=1
it has seen before and counts of the following characters. In practice this can all be implemented in
a single trie. In the case of the length- 1 contexts the counts are just counts of each character seen
assuming no context.
An example table is given in Figure 10 for a string accbaccacba. Now consider following
this string with a c. Since the algorithm has the context ba followed by c in its dictionary, it can
output the c based on its probability in this context. Although we might think the probability should
be 1, since c is the only character that has ever followed ba, we need to give some probability of no
match, which we will call the escape probability. We will get back to how this probability is set
shortly. If instead of c the next character to code is an a, then the algorithm does not find a match
for a length 2 context so it looks for a match of length 1, in this case the context is the previous a.
Since a has never followed by another a, the algorithm still does not find a match, and looks for
a match with a zero length context. In this case it finds the a and uses the appropriate probability
for a (4/11). What if the algorithm needs to code a d? In this case the algorithm does not even
find the character in the zero-length context, so it assigns the character a probability assuming all
unseen characters have even likelihood.
Although it is easy for the encoder to know when to go to a shorter context, how is the decoder
supposed to know in which sized context to interpret the bits it is receiving. To make this possible,
the encoder must notify the decoder of the size of the context. The PPM algorithm does this by
assuming the context is of size and then sending an escape character whenever moving down
a size. In the example of coding an a given above, the encoder would send two escapes followed
by the a since the context was reduced from 2 to 0. The decoder then knows to use the probability
distribution for zero length contexts to decode the following bits.
The escape can just be viewed as a special character and given a probability within each context
as if it was any other kind of character. The question is how to assign this probability. Different
variants of PPM have different rules. PPMC uses the following scheme. It sets the count for
the escape character to be the number of different characters seen following the given context.
29
Order 0
Order 1
Order 2
Context Counts Context Counts Context Counts
empty
a=4
a
c=3
ac
b=1
b=2
$=1
c=2
c=5
$=2
$=3
b
a=2
$=1
ba
c=1
$=1
c
a=1
b=2
ca
a=1
c=2
$=1
$=3
cb
a=2
$=1
cc
a=1
b=1
$=2
Figure 11: An example of the PPMC table for
on the string accbaccacba. This assumes the virtual count of each escape symbol ($) is the number of different characters that have
appeared in the context.
30
Figure 11 shows an example of the
counts using this scheme. In this example, the probability of
no match for a context of ac is
=
= j ; while the probability for a b in that context is
S $
. There seems to be no theoretical justification for this choice, but empirically it works well.
There is one more trick that PPM uses. This is that when switching down a context, the algorithm can use the fact that it switched down to exclude the possibility of certain characters from
the shorter context. This effectively increases the probability of the other characters and decreases
the code length. For example, if the algorithm were to code an a, it would send two escapes, but
then could exclude the c from the counts in the zero length context. This is because there is no
way that two escapes would be followed by a c since the c would have been coded in a length 2
context. The algorithm could then give the a a probability of ; E instead of ;
(.58 bits instead
S
S$!$
of 1.46 bits!).
The Lempel-Ziv algorithms compress by building a dictionary of previously seen strings. Unlike PPM which uses the dictionary to predict the probability of each character, and codes each
character separately based on the context, the Lempel-Ziv algorithms code groups of characters of
varying lengths. The original algorithms also did not use probabilitiesstrings were either in the
dictionary or not and all strings in the dictionary were give equal probability. Some of the newer
variants, such as gzip, do take some advantage of probabilities.
At the highest level the algorithms can be described as follows. Given a position in a file,
look through the preceeding part of the file to find the longest match to the string starting at the
current position, and output some code that refers to that match. Now move the finger past the
match. The two main variants of the algorithm were described by Ziv and Lempel in two separate
papers in 1977 and 1978, and are often refered to as LZ77 and LZ78. The algorithms differ in how
far back they search and how they find matches. The LZ77 algorithm is based on the idea of a
sliding window. The algorithm only looks for matches in a window a fixed distance back from the
current position. Gzip, ZIP, and V.42bis (a standard modem protocal) are all based on LZ77. The
LZ78 algorithm is based on a more conservative approach to adding strings to the dictionary. Unix
compress, and the Gif format are both based on LZ78.
In the following discussion of the algorithms we will use the term cursor to mean the position
an algorithm is currently trying to encode from.
31
Step
1
2
3
4
5
a
a
a
a
a
a
a
a
a
a
c
c
c
c
c
a
a
a
a
a
a
a
a
a
a
c
c
c
c
c
Input String
a
b
c
a
b
c
a
b
c
a
b
c
a
b c
a
a
a
a
a
b
b
b
b
b
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
c
c
c
c
c
Output Code
(0, 0, a)
(1, 1, c)
(3, 4, b)
(3, 3, a)
(1, 2, c)
Figure 12: An example of LZ77 with a dictionary of size 6 and a lookahead buffer of size 4. The
cursor position is boxed, the dictionary is bold faced, and the lookahed buffer is underlined. The
last step does not find the longer match (10,3,1) since it is outside of the window.
3 3
2. Output a triple
P eh containing the position of the occurence in the window, the length
the next character e past the match.
P of the match and
3. Move the cursor P9=
characters forward.
$
The position can be given relative to the cursor with 0 meaning no match, 1 meaning a match
starting at the previous character, etc.. Figure 12 shows an example of the algorithm on the string
aacaacabcababac.
To decode the message we consider a single step. Inductively we assume that the decoder has
correctly constructed the string up to the current cursor, and we want to show that given the triple
3 3
P ez it can reconstruct the string up to the next cursor position. To do this the decoder can look
the string up by going back positions and taking the next P characters, and then following this
with the character e . The one tricky case is when P , as in step 3 of the example in Figure 12.
The problem is that the string to copy overlaps the lookahead
buffer, which the decoder has not
filled yet. In this case the decoder can reconstruct the message by taking characters before the
If, for example, the
cursor and repeating them enough times after the cursor to fill in P positions.
code was (2,7,d) and the two characters before the cursor were ab, the algorithm would place
abababa and then the d after the cursor.
There have been many improvements on the basic algorithm. Here we will describe several
improvements that are used by gzip.
Two formats: This improvement, often called the LZSS Variant, does not include the next character in the triple. Instead it uses two formats, either a pair with a position and length, or just a
character. An extra bit is typically used to distinguish the formats. The algorithm tries to find a
match and if it finds a match that is at least of length 3, it uses the offset, length format, otherwise
it uses the single character format. It turns out that this improvement makes a huge difference for
files that do not compress well since we no longer have to waste the position and length fields.
Huffman coding the components: Gzip uses separate huffman codes for the offset, the length
and the character. Each uses addaptive Huffman codes.
Non greedy: The LZ77 algorithm is greedy in the sense that it always tries to find a match starting at the first character in the lookahead buffer without caring how this will affect later matches.
32
For some strings it can save space to send out a single character at the current cursor position and
then match on the next position, even if there is a match at the current position. For example,
consider coding the string
d
In this case LZCC would code it as (1,3,3), (0,a), (0,b). The last two letters are coded as singletons
since the match is not at least three characters long. This same buffer could instead be coded as
(0,a), (1,6,4) if the coder was not greedy. In theory one could imagine trying to optimize coding by
trying all possible combinations of matches in the lookahead buffer, but this could be costly. As a
tradeoff that seems to work well in practice, Gzip only looks ahead 1 character, and only chooses
to code starting at the next character if the match is longer than the match at the current character.
Hash Table: To quickly access the dictionary Gzip uses a hash table with every string of length
3 used as the hash keys. These keys index into the position(s) in which they occur in the file. When
trying to find a match the algorithm goes through all of the hash entries which match on the first
three characters and looks for the longest total match. To avoid long searches when the dictionary
window has many strings with the same three characters, the algorithm only searches a bucket
to a fixed length. Within each bucket, the positions are stored in an order based on the position.
This makes it easy to select the more recent match when the two longest matches are equal length.
Using the more recent match better skews the probability distribution for the offsets and therefore
decreases the average length of the Huffman codes.
5.2 Lempel-Ziv-Welch
In this section we will describe the LZW (Lempel-Ziv-Welch) variant of LZ78 since it is the one
that is most commonly used in practice. In the following discussion we will assume the algorithm
is used to encode byte streams (i.e., each message is a byte). The algorithm maintains a dictionary
of strings (sequences of bytes). The dictionary is initialized with one entry for each of the 256
possible byte valuesthese are strings of length one. As the algorithm progresses it will add new
strings to the dictionary such that each string is only added if a prefix one byte shorter is already in
the dictionary. For example, John is only added if Joh had previously appeared in the message
sequence.
We will use the following interface to the dictionary. We assume that each entry of the dictionary is given an index, where these indices are typically given out incrementally starting at 256
(the first 256 are reserved
for the byte values).
3
existing dicCreates a new dictionary entry by extending an
n F AddDict n
n F GetIndex n
GetString n}
Flag IndexInDict? n}
n EOF do
while
ReadByte(File)
3
nF GetIndex( n
)
while n do
$
nUn
ReadByte(File)
3
)
n GetIndex( n
Output( n )
3
AddDict( n
)
C=x
dictionary for a string starting at the current positionthe inner loop finds this match. The iteration
then outputs the index for and adds the string
to the dictionary, where is the next character
after the match. The use of a dictionary is similar to LZ77 except that the dictionary is stored
explicitly rather than as indices into a window. Since the dictionary is explicit, i.e., each index
corresponds to a precise string, LZW need not specify the length.
The decoder works since it builds the dictionary in the same way as the encoder and in general
can just look up the indices it receives in its copy of the dictionary. One problem, however, is that
the dictionary
at the decoder is always one step behind the encoder. This is because
the encoder
to its dictionary at a given iteration, but the decoder will not know until the next
can add
message it receives. The only case in which this might be a problem is if the encoder sends an
'
index of an entry added to the dictionary in the previous step.
This happens
when the encoder
('
sends an index for a string
and the string is followed by
v1 , where ~1
refers to the first
character of
(i.e.,'the
input is of the form
the
v1 ). On the iteration the encoder'sends
'
index for it adds
v1 to its dictionary. On the next iteration it sends the index for
v1 . If
this happens, the decoder will receive the index for
its dictionary
v1
, which it does not have in(
yet. Since the it is able to decode the previous , however, it can easily reconstruct
~1
. This
case is handled by the else clause in LZW decode, and shown by the second example.
A problem with the algorithm is that the dictionary can get too large. There are several choices
of what to do. Here are some of them.
C
init
a
a
b
c
+
a
256
+
c
258
x
b
c
a
b
c
a
EOF
256 (a,b)
257 (b,c)
258 (c,a)
a
b
c
259 (256,c)
256
258
(a) Encoding
C
C
Init
a
a
b
b
c
c
256
256 258
Table 4: LZW Encoding and Decoding abcabca. The rows with a + for encoding are iterations
of the inner while loop.
C
init
a
a
+
a
256
+
a
+ 256
257
x
a
a
a
a
a
EOF
256 (a,a)
257 (256,a)
256
257
(a) Encoding
C
C
Init
a
a
256
256 257
Table 5: LZW Encoding and Decoding aaaaaa. This is an example in which the decoder does
not have the index in its dictionary.
35
Implementing the Dictionary: One of the biggest advantages of the LZ78 algorithms and reason
for its success is that the dictionary operations can run very quickly. Our goal is to implement the
3 dictionary operations. The basic idea is to store the dictionary as a partially filled k-ary tree such
that the root is the empty string, and any path down the tree to a node from the root specifies the
match. The path need not go to a leaf since because of the prefix property of the LZ78 dictionary, all
pointers to
paths to internal nodes must belong to strings in the dictionary. We can use the indices as
nodes of the tree (possibly indirectly through an array). To implement the GetString nw function
to the root. This requires that
we start at the node pointed to by n and follow a path from that node
3
operation we go from the
every child has a pointer to its parent. To implement the GetIndex n
node pointed to by n and search to see if there is a child byte-value and return the corresponding
3
index. For the AddDict n
operation we add a child with byte-value to the node pointed to by
n . If we assume is constant, the GetIndex and AddDict operations will take constant time since
they only require going down one level of the tree. The GetString operation requires f f time to
follow the tree up to the root, but this operation is only used by the decoder, and always outputs
after decoding it. The whole algorithm for both coding and decoding therefore require time that is
linear in the message size.
To discuss one more level of detail, lets consider how to store the pointers. The parent pointers
are trivial to keep since each node only needs a single pointer. The children pointers are a bit more
difficult to do efficiently. One choice is to store an array of length for each node. Each entry is
initialized to empty and then searches can be done with a single arrary reference, but we need
pointers per node ( is often 256 in practice) and the memory is prohibitive. Another choice is to
use a linked list (or possibly balanced tree) to store the children. This has much better space but
requires more time to find a child (although technically still constant time since is constant).
A compromise that can be made in practice is to use a linked list until the number of children in
a node rises above some threshold k and then switch to an array. This would require copying the
linked list into the array when switching.
Yet another technique is to use a hash table instead of child pointers. The string being searched
for can be hashed directly to the appropriate index.
problem with PPM is in selecting . If we set too large we will usually not find matches and
end up sending too many escape characters. On the other hand if we set it too low, we would not
be taking advantage of enough context. We could have the system automatically select based on
which does the best encoding, but this is expensive. Also within a single text there might be some
36
a
ac
acc
accb
accba
accbac
accbacc
accbacca
accbaccac
accbaccacb
a
c
c
b
a
c
c
a
c
b
a
(a)
ccbaccacba
cbaccacba
baccacba
accacba
ccacba
cacba
acba
cba
ba
a
ccbaccacba)
cbaccacbaa
baccacbaac
accacbaacc "
ccacbaaccb
cacbaaccba "
acbaaccbac
cbaaccbacc)
baaccbacca
aaccbaccac *
accbaccacb "
(b)
a
c
c"
b
a"
c
c)
a
c*
b"
a)
cbaccacbaa
ccbaccacba)
cacbaaccba "
baaccbacca
accbaccacb "
ccacbaaccb
baccacbaac
acbaaccbac
aaccbaccac *
accacbaacc "
cbaaccbacc)
c
a
c
c*
a)
a"
c"
c)
b"
b
a
(c)
Figure 14: Sorting the characters a c c " b a " c c) a c * b " a) based on context: (a) each character
in its context, (b) end context moved to front, and (c) characters sorted by their context using
reverse lexicographic ordering. We use subscripts to distinguish different occurences of the same
character.
very long contexts that could help predict, while most helpful contexts are short. Using a fixed
we would probably end up ignoring the long contexts.
Lets see if we can come up with a way to take advantage of the context that somehow automatically adapts. Ideally we would like the method also to be a bit faster. Consider taking the string we
want to compress and looking at the full context for each characteri.e., all previous characters
from the start of the string up to the character. In fact, to make the contexts the same length, which
will be convenient later, we add to the head of each context the part of the string following the
character making each context P
characters. Examples of the context for each character of
$
the string accbaccacba are given in Figure 6.1. Now lets sort these contexts based on reverse
lexical order, such that the last character of the context is the most significant (see Figure 6.1c).
Note that now characters with the similar contexts (preceeding characters) are near each other. In
fact, the longer the match (the more preceeding characters that match identically) the closer they
will be to each other. This is similar to PPM in that it prefers longer matches when grouping,
but will group things with shorter matches when the longer match does not exist. The difference
is that there is no fixed limit on the length of a matcha match of length 100 has priority over a
match of 99.
In practice the sorting based on the context is executed in blocks, rather than for the full message sequence. This is because the full message sequence and additional data structures required
for sorting it, might not fit in memory. The process of sorting the characters by their context
is often refered to as a block-sorting transform. In the dicussion below we will refer to the sequence of characters generated by a block-sorting transform as the context-sorted sequence (e.g.,
c a c c * a) a " c " c) b " b a in Figure 6.1). Given the correlation between nearyby characters in a
context-sorted sequence, we should be able to code them quite efficiently by using, for example, a
move-to-front coder (Section 4.2). For long strings with somewhat larger character sets this technique should compress the string significantly since the same character is likely to appear in similar
contexts. Experimentally, in fact, the technique compresses about as well as PPM even though it
37
has no magic number or magic way to select the escape probabilities.
The problem remains, however, of how to reconstruct the original sequence from contextsorted sequence. The way to do this is the ingenious contribution made by Burrows and Wheeler.
You might try to recreate it before reading on. The order of the most-significant characters in
the sorted contexts plays an important role in decoding. In the example of Figure 6.1, these are
a a) a " a b " b c c c * c " c) . The characters are sorted, but equal valued characters do not necessarily appear in the same order as in the input sequence. The following lemma is critical in the
algorithm for efficiently reconstruct the sequence.
Lemma 6.1.1 For the Block-Sorting transform, as long as there are at least two distinct characters
in the input, equal valued characters appear in the same order in the most-significant characters
of the sorted contexts as in the output (the context sorted sequence).
Proof: Since the contexts are sorted in reverse lexicographic order, sets of contexts whose mostsignificant character are equal will be ordered by the remaining contexti.e., the string of all
previous characters. Now consider the contexts of the context-sorted sequence. If we drop the
least-significant character of these contexts, then they are exactly the same as the remaining context
above, and therefore will be sorted into the same ordering. The only time that dropping the leastsignificant character can make a difference is when all other characters are equal. This can only
happen when all characters in the input are equal.
Based on Lemma 6.1.1, it is not hard to reconstruct the sequence from the context-sorted sequence as long as we are also given the index of the first character to output (the first character in
the original input sequence). The algorithm is given by the following code.
function BW Decode(In,FirstIndex,P )
MoveToFrontDecode(In,n)
Rank( )
FirstIndex
'
for
to P9
'$
$
,
+
% sequence , the Rank( ) function returns a sequence of integers specifying for each
For an ordered
character e
how many characters are either less than e or equal to e and appear before e in .
Another way of saying this is that it specifies the position of the character if it where sorted using
a stable sort.
To show how this algorithms works, we consider an example in which the MoveToFront decoder returns G ssnasmaisssaai, and in which FirstIndex ; (the first a). The example
is shown in Figure 15(a). We can generate the most significant characters of the contexts
simply
by sorting . The result of the sort is shown in Figure 15(b) along with the rank . Because of
Lemma 6.1.1, we know that equal valued characters will have the same order in this sorted sequence and in . This is indicated by the subscripts in the figure. Now each row of Figure 15(b)
tells us for each character what the next character is. We can therefore simply rebuild the initial sequence by starting at the first character and adding characters one by one, as done by BW Decode
and as illustrated in Figure 15(c).
38
s
s
n
a
s
m
a
i
s
s
s
a
a
i
-
(a)
Sort( )
a
s
a"
s"
a
n
a)
a
i
s
i"
m
m
a"
n
i
s
s)
s"
s*
s
s.
s)
a
s*
a)
s.
i"
Rank( )
9
10
8
1 11
7
2
5
12
13
14
3
4
6
(b)
(c)
a)
a 0/
s 0/
s) /
a /
n /
i /
s /
s. /
i" /
m 0/
a" /
s" /
s* /
Out
a s
s)
a
n
i
s
s.
i"
m
a"
s"
s*
a)
Lossy compression is compression in which some of the information from the original message
sequence is lost. This means the original sequences cannot be regenerated from the compressed
sequence. Just because information is lost doesnt mean the quality of the output is reduced. For
example, random noise has very high information content, but when present in an image or a sound
file, we would typically be perfectly happy to drop it. Also certain losses in images or sound might
be completely imperceptible to a human viewer (e.g. the loss of very high frequencies). For this
reason, lossy compression algorithms on images can often get a factor of 2 better compression
than lossless algorithms with an imperceptible loss in quality. However, when quality does start
degrading in a noticeable way, it is important to make sure it degrades in a way that is least objectionable to the viewer (e.g., dropping random pixels is probably more objectionable than dropping
some color information). For these reasons, the way most lossy compression techniques are used
are highly dependent on the media that is being compressed. Lossy compression for sound, for
example, is very different than lossy compression for images.
In this section we go over some general techniques that can be applied in various contexts, and
in the next two sections we go over more specific examples and techniques.
Out
4
Out
4
10 20 30 40
In
10 20 30 40
-2
-2
-3
-3
-4
-4
(a)
(b)
In
Figure 16: Examples of (a) uniform and (b) non-uniform scalar quantization.
8-bit integers and divide by 4 (i.e., drop the lower two bits), or take a character set in which upper
and lowercase characters are distinguished and replace all the uppercase ones with lowercase ones.
This general technique is called quantization. Since the mapping used in quantization is many-toone, it is irreversible and therefore lossy.
In the case that the set comes from a total order and the total order is broken up into regions that map onto the elements of ] , the mapping is called scalar quantization. The example
of dropping the lower two bits given in the previous paragraph is an example of scalar quantization. Applications of scalar quantization include reducing the number of color bits or gray-scale
levels in images (used to save memory on many computer monitors), and classifying the intensity
of frequency components in images or sound into groups (used in JPEG compression). In fact we
mentioned an example of quantization when talking about JPEG-LS. There quantization is used to
M
reduce the number of contexts instead of the number of message values. In particular
we catego
M
rized each of 3 gradients into one of 9 levels so that the context table needs only entries (actually
= due to symmetry).
only
$ S
The term uniform scalar quantization is typically used when the mapping is linear. Again,
the example of dividing 8-bit integers by 4 is a linear mapping. In practice it is often better to
use a nonuniform scalar quantization. For example, it turns out that the eye is more sensitive to
low values of red than to high values. Therefore we can get better quality compressed images by
making the regions in the low values smaller than the regions in the high values. Another choice
is to base the nonlinear mapping on the probability of different input values. In fact, this idea can
be formalizedfor a given error metric and a given probability distribution over the input values,
we want a mapping that will minimize the expected error. For certain error-metrics, finding this
mapping might be hard. For the root-mean-squared error metric there is an iterative algorithm
known as the Lloyd-Max algorithm that will find the optimal mapping. An interesting point is that
finding this optimal mapping will have the effect of decreasing the effectiveness of any probability
coder that is used on the output. This is because the mapping will tend to more evenly spread the
probabilities in ] .
40
200
180
160
140
Weight
120
100
80
60
40
20
1 2 3 4 5 6 7 8
Height
1f(x)
Cosine
i
0
Polynomial
Wavelet
R 9
o
"
'
u
3
P 5p8 7 "
6
^1 1
^P
43
R
o " u R 9
'
$S
3
P 5p8 7 "
6
1\
P 1
^P
3
S
The DCT is one of the most commonly used transforms in practice for image compression,
more so than the discrete Fourier transform (DFT). This is because the DFT assumes periodicity,
which is not necessarily true in images. In particular to represent a linear function over a region
requires many large amplitude high-frequency components in a DFT. This is because the periodicity assumption will view the function as a sawtooth, which is highly discontinuous at the teeth
requiring the high-frequency components. The DCT does not assume periodicity and will only require much lower amplitude high-frequency components. The DCT also does not require a phase,
which is typically represented using complex numbers in the DFT.
For the purpose of compression, the properties we would like of a transform are (1) to decorrelate the data, (2) have many of the transformed coefficients be small, and (3) have it so that from
the point of view of perception, some of the terms are more important than others.
The JPEG and the related MPEG format make good real-world examples of compression because
(a) they are used very widely in practice, and (b) they use many of the compression techniques
42
Y
(optional)
zig-zag order
DCT
for each
8x8 block
Bits
8.1 JPEG
JPEG is a lossy compression scheme for color and gray-scale images. It works on full 24-bit color,
and was designed to be used with photographic material and naturalistic artwork. It is not the ideal
format for line-drawings, textual images, or other images with large areas of solid color or a very
limited number of distinct colors. The lossless techniques, such as JBIG, work better for such
images.
JPEG is designed so that the loss factor can be tuned by the user to tradeoff image size and
image quality, and is designed so that the loss has the least effect on human perception. It however
does have some anomalies when the compression ratio gets high, such as odd effects across the
boundaries of 8x8 blocks. For high compression ratios, other techniques such as wavelet compression appear to give more satisfactory results.
An overview of the JPEG compression process is given in Figure 19. We will cover each of the
steps in this process.
The input to JPEG are three color planes of 8-bits per-pixel each representing Red, Blue and
Green (RGB). These are the colors used by hardware to generate images. The first step of JPEG
compression, which is optional, is to convert these into YIQ color planes. The YIQ color planes are
designed to better represent human perception and are what are used on analog TVs in the US (the
43
NTSC standard). The Y plane is designed to represent the brightness (luminance) of the image. It
is a weighted average of red, blue and green (0.59 Green + 0.30 Red + 0.11 Blue). The weights
are not balanced since the human eye is more responsive to green than to red, and more to red than
to blue. The I (interphase) and Q (quadrature) components represent the color hue (chrominance).
If you have an old black-and-white television, it uses only the Y signal and drops the I and Q
components, which are carried on a sub-carrier signal. The reason for converting to YIQ is that it
is more important in terms of perception to get the intensity right than the hue. Therefore JPEG
keeps all pixels for the intensity, but typically down samples the two color planes by a factor of 2
total factor of 4). This is the first lossy component of JPEG and gives a factor
in each dimension (a
!2
2
of 2 compression: =
.
7\
$
S
The next step of the JPEG algorithm is to partition each of the color planes into 8x8 blocks.
Each of these blocks is then coded separately. The first step in coding a block is to apply a cosine
transform across both dimensions. This returns an 8x8 block of 8-bit frequency terms. So far this
does not introduce any loss, or compression. The block-size is motivated by wanting it to be large
enough to capture some frequency components but not so large that it causes frequency spilling.
In particular if we cosine-transformed the whole image, a sharp boundary anywhere in a line would
cause high values across all frequency components in that line.
After the cosine transform, the next step applied to the blocks is to use uniform scalar quantization on each of the frequency terms. This quantization is controllable based on user parameters
and is the main source of information loss in JPEG compression. Since the human eye is more
perceptive to certain frequency components than to others, JPEG allows the quantization scaling
factor to be different for each frequency component. The scaling factors are specified using an
8x8 table that simply is used to element-wise divide the 8x8 table of frequency components. JPEG
defines standard quantization tables for both the Y and I-Q components. The table for Y is shown
in Table 6. In this table the largest components are in the lower-right corner. This is because these
are the highest frequency components which humans are less sensitive to than the lower-frequency
components in the upper-left corner. The selection of the particular numbers in the table seems
magic, for example the table is not even symmetric, but it is based on studies of human perception.
If desired, the coder can use a different quantization table and send the table in the head of the
message. To further compress the image, the whole resulting table can be divided by a constant,
which is a scalar quality control given to the user. The result of the quantization will often drop
most of the terms in the lower left to zero.
JPEG compression then compresses the DC component (upper-leftmost) separately from the
other components. In particular it uses a difference coding by subtracting the value given by the
DC component of the previous block from the DC component of this block. It then Huffman or
arithmetic codes this difference. The motivation for this method is that the DC component is often
similar from block-to-block so that difference coding it will give better compression.
The other components (the AC components) are now compressed. They are first converted into
a linear order by traversing the frequency table in a zig-zag order (see Figure 20). The motivation for this order is that it keeps frequencies of approximately equal length close to each other
in the linear-order. In particular most of the zeros will appear as one large contiguous block at
the end of the order. A form of run-length coding is used to compress the linear-order. It is
coded as a sequence of (skip,value) pairs, where skip is the number of zeros before a value, and
44
16
12
14
14
18
24
49
72
11
12
13
17
22
35
64
92
10
14
16
22
37
55
78
95
16
24
40 51
61
19
26
58 60
55
24
40
57 69
56
29
51
87 80
62
56
68 109 103
77
64
81 104 113
92
87 103 121 120 101
98 112 100 103
99
45
Playback order: 0 1 2 3 4 5 6
Frame type: I B B P B B P
Data stream order: 0 2 3 1 5 6 4
7 8 9
B B I
8 9 7
value is the value. The special pair (0,0) specifies the end of block. For example, the sequence
[4,3,0,0,1,0,0,0,1,0,0,0,...] is represented as [(0,4),(0,3),(2,1),(3,1),(0,0)]. This sequence is then
compressed using either arithmetic or Huffman coding. Which of the two coding schemes used is
specified on a per-image basis in the header.
8.2 MPEG
Correlation improves compression. This is a recurring theme in all of the approaches we have seen;
the more effectively a technique is able to exploit correlations in the data, the more effectively it
will be able to compress that data.
This principle is most evident in MPEG encoding. MPEG compresses video streams. In theory, a video stream is a sequence of discrete images. In practice, successive images are highly
interrelated. Barring cut shots or scene changes, any given video frame is likely to bear a close
resemblance to neighboring frames. MPEG exploits this strong correlation to achieve far better
compression rates than would be possible with isolated images.
Each frame in an MPEG image stream is encoded using one of three schemes:
I-frame , or intra-frame, are coded as isolated images.
P-frame , or predictive coded frame, are based on the previous I- or P-frame.
B-frame , or bidirectionally predictive coded frame, are based on either or both the previous and
next I- or P-frame.
Figure 21 shows an MPEG stream containing all three types of frames. I-frames and P-frames
appear in an MPEG stream in simple, chronological order. However, B-frames are moved so that
they appear after their neighboring I- and P-frames. This guarantees that each frame appears after
any frame upon which it may depend. An MPEG encoder can decode any frame by buffering the
two most recent I- or P-frames encountered in the data stream. Figure 21 shows how B-frames are
postponed in the data stream so as to simplify decoder buffering. MPEG encoders are free to mix
the frame types in any order. When the scene is relatively static, P- and B-frames could be used,
while major scene changes could be encoded using I-frames. In practice, most encoders use some
fixed pattern.
Since I-frames are independent images, they can be encoded as if they were still images. The
particular technique used by MPEG is a variant of the JPEG technique (the color transformation
and quantization steps are slightly different). I-frames are very important for use as anchor points
so that the frames in the video can be accessed randomly without requiring one to decode all
46
previous frames. To decode any frame we need only find its closest previous I-frame and go from
there. This is important for allowing reverse playback, skip-ahead, or error-recovery.
The intuition behind encoding P-frames is to find matches, i.e., groups of pixels with similar
patterns, in the previous reference frame and then coding the difference between the P-frame and
its match. To find these matches the MPEG algorithm partitions the P-frame into 16x16 blocks.
The process by which each of these blocks is encoded is illustrated in Figure 22. For each target
block in the P-frame the encoder finds a reference block in the previous P- or I-frame that most
closely matches it. The reference block need not be aligned on a 16-pixel boundary and can
potentially be anywhere in the image. In practice, however, the x-y offset is typically small. The
offset is called the motion vector. Once the match is found, the pixels of the reference block are
subtracted from the corresponding pixels in the target block. This gives a residual which ideally is
close to zero everywhere. This residual is coded using a scheme similar to JPEG encoding, but will
ideally get a much better compression ratio because of the low intensities. In addition to sending
the coded residual, the coder also needs to send the motion vector. This vector is Huffman coded.
The motivation for searching other locations in the reference image for a match is to allow for the
efficient encoding of motion. In particular if there is a moving object in the sequence of images
(e.g., a car or a ball), or if the whole video is panning, then the best match will not be in the same
location in the image. It should be noted that if no good match is found, then the block is coded as
if it were from an I-frame.
In practice, the search for good matches for each target block is the most computationally
expensive part of MPEG encoding. With current technology, real-time MPEG encoding is only
possible with the help of custom hardware. Note, however, that while the search for a match is
expensive, regenerating the image as part of the decoder is cheap since the decoder is given the
motion vector and only needs to look up the block from the previous image.
47
B-frames were not present in MPEGs predecessor, H.261. They were added in an effort to
address the following situation: portions of an intermediate P-frame may be completely absent
from all previous frames, but may be present in future frames. For example, consider a car entering
a shot from the side. Suppose an I-frame encodes the shot before the car has started to appear, and
another I-frame appears when the car is completely visible. We would like to use P-frames for
the intermediate scenes. However, since no portion of the car is visible in the first I-frame, the
P-frames will not be able to reuse that information. The fact that the car is visible in a later
I-frame does not help us, as P-frames can only look back in time, not forward.
B-frames look for reusable data in both directions. The overall technique is very similar to that
used in P-frames, but instead of just searching in the previous I- or P-frame for a match, it also
searches in the next I- or P-frame. Assuming a good match is found in each, the two reference
frames are averaged and subtracted from the target frame. If only one good match is found, then it
is used as the reference. The coder needs to send some information on which reference(s) is (are)
used, and potentially needs to send two motion vectors.
How effective is MPEG compression? We can examine typical compression ratios for each
frame type, and form an average weighted by the ratios in which the frames are typically interleaved.
2
Starting with a 7 E8 E!1 pixel, 24-bit color image, typical compression ratios for MPEG-I are:
Type
Size Ratio
I
18 Kb
7:1
P
6 Kb 20:1
B
2.5 Kb 50:1
Avg 4.8 Kb 27:1
2
If one 7 E98 E!1 frame requires 4.8 Kb, how much bandwidth does MPEG require in order to
provide a reasonable video feed at thirty frames per second?
5
4
'
? '
=
7!1!;
:
d
de ; =
? <> >
:
d ?! +
+
de
S
S
S
$ S
Thus far, we have been concentrating on the visual component of MPEG. Adding a stereo audio
stream will require roughly another 0.25 Mbits/sec, for a grand total bandwidth of 1.45 Mbits/sec.
This fits nicely within the 1.5 Mbit/sec capacity of a T1 line. In fact, this specific limit was a
design goal in the formation of MPEG. Real-life MPEG encoders track bit rate as they encode, and
will dynamically adjust compression qualities to keep the bit rate within some user-selected bound.
This bit-rate control can also be important in other contexts. For example, video on a multimedia
CD-ROM must fit within the relatively poor bandwidth of a typical CD-ROM drive.
MPEG in the Real World
MPEG has found a number of applications in the real world, including:
1. Direct Broadcast Satellite. MPEG video streams are received by a dish/decoder, which unpacks the data and synthesizes a standard NTSC television signal.
48
2. Cable Television. Trial systems are sending MPEG-II programming over cable television
lines.
3. Media Vaults. Silicon Graphics, Storage Tech, and other vendors are producing on-demand
video systems, with twenty file thousand MPEG-encoded films on a single installation.
4. Real-Time Encoding. This is still the exclusive province of professionals. Incorporating
special-purpose parallel hardware, real-time encoders can cost twenty to fifty thousand dollars.
@JILK>MON1
The family of basis functions are scaled and translated versions of this mother function. For
some scaling factor P and translation factor Q ,
@>RSTIUKVMWNX@JIUY
K[Z\QLM
A well know family of wavelets are the Haar wavelets, which are derived from the following
mother function:
`
@]ILK>MON
_a
^_
1edfKg
Z b
1b
Y[dfKg b"h
K b!h gG1
c
c
or Ki b
b
Figure 23 shows a family of seven Haar basis functions. Of the many potential wavelets,
Haar wavelets are probably the most described but the least used. Their regular form makes the
underlying mathematics simple and easy to illustrate, but tends to create bad blocking artifacts if
actually used for compression.
49
jlkmkon,jqp Dr
jqstkon,jqptu Dr
jqsmsn,jqptu Dqv,wUr
jyxmkon,jqp z Dr
jyxLsn,jqp z Dqv,wUr
jlxmxon,jqp z Dqv u r
jlxm{on,jqp z Dqv8|or
50
} }
} }~
} }
} }
} }
} }
}
} }
} }
Tmt t
} }~
}~}}}}}V~}}O}}}V~}}
}
} }~ }
}}}}}}}}~}}}}
} }
} }~
} }
} }
} }
} }
}
} }
t
" "
T
T
T
T
T
}~}}[}}}O~}}O}}}O~}}
} }
$m t
} }~
} }
} }
} }
} }
}
} }
} } }~}}}}}O~}}O}}}O~}}
TT
tT
tT
TT
TT
Many other wavelet mother functions have also been proposed. The Morret wavelet convolves
a Gaussian with a cosine, resulting in a periodic but smoothly decaying function. This function is
equivalent to a wave packet from quantum physics, and the mathematics of Morret functions have
been studied extensively. Figure 24 shows a sampling of other popular wavelets. Figure 25 shows
that the Daubechies wavelet is actually a self-similar fractal.
Wavelets in the Real World
Summus Ltd. is the premier vendor of wavelet compression technology. Summus claims to achieve
better quality than JPEG for the same compression ratios, but has been loathe to divulge details of
how their wavelet compression actually works. Summus wavelet technology has been incorporated
into such items as:
Image viewing plugins for Netscape Navigator and Microsoft Internet Explorer.
Desktop image and movie compression in Corel Draw and Corel Video.
Digital cameras under development by Fuji.
]ILK>MN
K\
Z\
b
This was a simple case. Many functions may be too complex to solve directly. Or a function
may be a black box, whose formal definition is not known. In that case, we might try an iterative
approach. Keep feeding numbers back through the function in hopes that we will converge on a
solution:
K,
guess
]ILK, v,w M
K,N
For example, suppose that we have ]ILK>M as a black box. We might guess zero as K
from there:
K
N
K w
K u
N
JIUK,MN
N
K |
N
JIUK w
M N
u
JIUK MN
K z
N
bt
KN
JIUK |
M N
z
JIUK MN
JIUKlMN
KN
JIUK,MN
JIUKlMN
and iterate
2
b
2
2
b=
b8
b=
2
2
b"8=
Y
b bq8
In this example, ]ILK>M was actually defined as u K . The exact fixed point is 2, and the iterative
b
solution was converging upon this value.
Iteration is by no means guaranteed to find a fixed point. Not all functions have a single fixed
point. Functions may have no fixed point, many fixed points, or an infinite number of fixed points.
Even if a function has a fixed point, iteration may not necessarily converge upon it.
In the above example, we were able to associate a fixed point value with a function. If we were
given only the function, we would be able to recompute the fixed point value. Put differently, if
we wish to transmit a value, we could instead transmit a function that iteratively converges on that
value.
This is the idea behind fractal compression. However, we are not interested in transmitting
simple numbers, like 2. Rather, we wish to transmit entire images. Our fixed points will be
images. Our functions, then, will be mappings from images to images.
Our encoder will operate roughly as follows:
1. Given an image, , from the set of all possible images, .
52
Figure 26: Identifying self-similarity. Range blocks appear on the left; one domain block appears
on the left. The arrow identifies one of several collage function that would be composited into a
complete image.
2. Compute a function
8G
G
8
53
54