An introduction to Information Theory
Adrish Banerjee
Department of Electrical Engineering
Indian Institute of Technology Kanpur
Kanpur, Uttar Pradesh
India
Lecture #7B: Universal Source Coding-II: Lempel-
Ziv Welch Algorithm (LZW)
In todays lecture we are going to continue our discussion on Lempel-
Ziv algorithm. We are going to talk about another variant of Lempel-Ziv
algorithm known as Lempel-Ziv Welch algorithm(LZW). The outline of the
lecture is
LZW encoding
LZW decoding
LZW encoding:
It was published by Terry Welch in 1984 as an improved implementation
of the LZ-78 algorithm published by Lempel and Ziv in 1978.
This procedure creates a dictionary that contains a list of previously en-
countered phrases along with associated codewords.
In this procedure, the input is incrementally parsed into phrases where
each new phrase is a concatenation of an old phrase that has occurred in the
past and an innovation symbol.
Same dictionary is created at the encoder as well as decoder with list of
1
phrases represented by an integer index.
Initially the dictionary is initialized to all length-one phrases (q of them
because source symbols can take q values).
Codeword is a pair representing (C(w) wk ) where C(w) is the code (index
in the dictionary) of the prefix of the new phrase and wk is the innovation
symbol.
When the dictionary is full, different methods can be used to reset parts
of the dictionary:
Reset half of the phrases in the dictionary that contain oldest unused
phrases (excluding the alphabet letters.) Use these entries for new
phrases.
Do not reset the dictionary. Continue coding without adding any new
phrases.
Another approach will be resetting entries representing shorter phrases
that are prefix of larger entries already in the dictionary.
dlog |Dn |e bits can be used to represent codewords where |Dn | is the size
of the dictionary at time n.
Alternatively, Elias coding of positive numbers can be used to represent
codewords where smaller indices can be used for recent phrases (reverse order
of dictionary).
Let us take an example and illustrate how LZW algorithm works.
Problem: Use LZW algorithm to encode a 24-bit sequence,
W124 = {001011011000011011011000}
Solution: Remember this is a universal source compression algorithm and
so does not need the prior information about the distribution of 0s and 1s.
Encoding procedure is shown in a table.
2
0 11 1 0 1 1 0 1 1 0 0 1 00 0 1 1 0 1 1
00
0 1
0 0
10
1 0
1 01
1 0
0 1 1 0 0 0
1
0 0 1
011
100
0
1
00
11
0
00
11
0
1
0 1
0
0
1
0 1
1 0
1
01 0
1
add
0 0
1 0
10
1
phrase
1
0 0
10
1
to 0
1 00
11
00
11 0
1
dictionary
0
1 0
1 1
00
0 11
00
11
00
11 0
1
00
11 0
1 0
1 0
1 1
0 1
0 0
10
1 0
1 00
11 0
1 0
1
0 00
11
00
11 00
11 0
1 0
1 0
1 1
0 0
1 0
10
1 0
1 00
11 0
1 1
0
00
11 Dictionary
0
1 00
11 0
1 0
1 0
1 0
1 0
1 0
10
1 0
1 00
11 0
1 1
0
1
00
11
00
11
000
111
0
1 00
11 1 1
0 1
0 0 1
0 1
0 0
1 0
10
1 0
1 00
11 0
1 0
1 00
11
000
11100
11 0
1 1
0 1
0 1
0 1
0 0
10
1 0
1 00
11
00 0
1 0
1 00
11
000
11100
11 1
0 0
1 0
1 0
1 1
0 0
10
1 0
1 11
00 0
1 0
1 00
11
Phrase C alt
00
11 0
1 0
1 0
1 0
1 0
1 0
10
1 0
1 11
00
11 0
1 0
1 00
11
01
1
1
00
110
1
0
0 0
1
1
0 0
1 1
0 0
10
1 0
1 00
11 0
1 0
1 00
11 0 1 1
00
110
1 0
1 1
0 1
0 0
10
1 0
1 00
11 0
1 0
1 00
11
0
1
0
1
0
1 01 1
0 1 1 0
0
1 0
1
0
1 0
1
0
1
0
1
0
1
0
1
0
1 00
11
00
11
0
1
0
1 0
1
0
1
00
11
00
11 1 2 2
000
111
000
111 0
1 0
1 0
10
1 0
1 00
11 0
1 0
1 00
11
000
111
000
111 0
1 1
0 1
0 0
10
1 0
1 00
11 0
1 0
1 00
11
01
1 1
0 1
0 0
10
1 0
1 00
11 0
1 0
1 00
11 00 3 1.0
000
111
1 01
0
00
11 01
1
00
1 0
1 00
11 0
1 0
1 00
11
000
111 00
11
0
1 0
10
1 0
1 00
11 0
1 0
1 00
11 01 1.1
0
1 11 00
1
1 1 0 00
11 0
1
01 0
1 00
11 4
000
111 0
10
1 0
1 00
11 1
0 0 00
11
0001
111 0 0
1 00
11 1
0 0
1 00
11 10
00
11 0
1 0
1 00
11 1 0
1 00
11 5 2.0
00
11 0
1
00
111 1
000
111 00 110 1
00 0 0
1 00
11
00
11
000
111 00
11 0
1 0
1 00
11 011 6 4.1
00
11
000
111 0
1 00
11 0
1 0
1 00
11
00
11 0
0 1 0
1 00
0 11
0
1
000
111 01
10
00
11 0 00
11 101 7 5.1
000
111 00
110
10
1 00
0 11
0
11 00
11
10
01
0
10 00
11
00
11
1 1 0 11 8 2.1
000
111000
111000
111
000
111000
111000
111 100 9 5.0
000
111
000
111
0 111
000
1 1 11
00
0 1
111
000
000
111
000
11100
11 000 10 3.0
000
111
000
111
000
11100
11
0110
00
11
000
111
1 1 0 11 6.0
00
11
000
111
00
11
000
111 00
11
00
11 0 1 1 0 1 12 11.1
00
11
0 0 0
000
111 00
11
00
11
000
111 00
11
00
11 110 13 8.0
1 1 2 4 5 2 5 3 6 11 8 10
1 2 3 4 5 6 7 8 9 10 11 12
The sequence to be encoded is 001011011000011011011000 using LZW.
The first step is to initialize the dictionary with length 1 Phrases which
are 0 and 1. The index of them is given by 1 and 2 respectively. Initially
dictionary contains 0 and 1. Now let us look at the sequence that is to
be encoded. The first bit in it is 0. The next bit is 0, but 00 is not in
the dictionary. So we can only encode 0 here but not 00. So 0 has been
encoded using 1 which has index number 0. At this time 00 is not in the
dictionary and so we will add 00 in the dictionary. This 00 is codeword 3 in
the dictionary. Alternative representation is also given for codewords in the
above figure. Now in this alternative representation as I said each new phrase
can be written as concatenation of old phrase and a new innovation symbol.
So this 00 is written as 1.0 i.e, codeword 3 is written as 1.0 in alternative
way, where 1 represent the index of phrase already in the dictionary and 0
is the innovation symbol. Next bit is 1 and so 01 is not in the dictionary.
Please note we are trying to find the largest phrase which is already there
3
in the dictionary and that we are encoding. Phrase 0 is codeword 1 and
so second 0 in the sequence is coded as 1. First keep 01 in the dictionary
and this is codeword 4. 01 can be written as codeword corresponding to 0
concatenating with a innovation symbol 1. So the alternative way for 01 is
1.1 and the next bit is 0 and 10 is not in the dictionary. So encode 1 as 2.
So far we have encoded 001 that is codeword 112.
Now as we can see 10 is not in the dictionary, we are going to add 10
in the dictionary and that is my codeword indexed by 5. Now we continue
our encoding as the next bit is 0. So 01 is the next one and it is there in the
dictionary but 011 is not in the dictionary. So we keep 011 in the dictionary
and thats my codeword with index 6 and note it can be written as code
phrase 01 concatenation with 1 and it will be written as 4.1 alternatively.
Here 01 is encoded as codeword 4. Now we will start with 10 which is there
in dictionary and we will look at 101 in the dictionary. But it is not there in
the dictionary and so we keep 101 in the dictionary and index by codeword 7
which can be written as 5.1 alternatively as like before. We can see that each
new phrase that we are adding in dictionary can be written as an old phrase
concatenated with the innovation symbol. Here 10 is encoded as 5 and 101
is added to dictionary. Next we will start with 1 and it is in dictionary. So
we will check 11 which is not there in the dictionary. So we are going to add
11 in the dictionary as codeword with index 8 which can be written as 2.1
alternatively and 1 is encoded as 2. Now we will see next 1 which is there
in dictionary and we will check 10 which is also in the dictionary. So we
will check 100 which is not in the dictionary and so given the codeword with
index 9 and 5.0 is the alternative representation. Here 10 is encoded as 5.
Next 000 is not in the dictionary and so we will add 000 in the dictionary as
codeword with index 10 and 00 is encoded as 3.
Next we observe 0110 is not in the dictionary which is added to dic-
tionary and given an index of 11 which 6.0 is the alternative expression as
011 is indexed with 6. Similarly 01101 is not in the dictionary and so added
to it with codeword of index 12 and 11.1 is the alternative representation.
Similarly 110 is not in the dictionary and so taken as codeword 13 with 8.0
as alternative expression.
LZW Decoding:
Now we will see the decoding part of LZW.
LZW decoding is done using alternative representation of the codewords
4
in an iterative fashion.
In alternative representation, codewords are represented as concatenation
of another codeword and an innovation symbol.
Decoding procedure includes timely generation of the dictionary from the
received codeword sequence.
At the beginning of decoding, dictionary is initialized with codewords cor-
responding to the source alphabet.
A phrase is added to the decoder dictionary after each new codeword is
received except the first codeword.
Every received codeword represents a prefix of a new entry into the dictio-
nary . If you recall when we are encoding LZW we are keeping an new string
into dictionary which is already not present which is concatenation of some
string which is already present plus an innovation symbol. The innovation
symbol for this entry is determined from the first symbol of the next decoded
codeword.
Problem: Use LZW algorithm to decode the received codeword sequence,
W112 = {1, 1, 2, 4, 5, 2, 5, 3, 6, 11, 8, 10}
Solution: Decoding procedure is shown in a table.
0
1
0
1
1
0
1
1 2 4 5 2 5 3 6 11 8 10
0
10 1
11
00
00
11 0 0 0
00
11
00
11 0
1 add phrase to dictionary Dictionary
00
11
00
11 0
1
00
1
000
111
00
11
0
1
000
111
00
11
0 1 Phrase C alt
000
111
00
11
00
11 0 1 1
00
11
000
111
00
11
1 1 0
000
111
00
11
000
111
00
110
1 1 2 2
0
1
0 11
0 00
1 0 1 1
1 00
11
00
11
0
1 00 3 1.0
0
1
0 111
1 000
1 0 1 0 1
000
111
000
111 codeword 11 is unknown to the 01 4 1.1
011
1
000
1 1
00
11
decoder at the time it is recieved.
10
00
11
00
11 5 2.0
00
11
001 0
11
Decoder uses the immediately
previous decoded phrase to build
00
11
00
110
1 011 6 4.1
0
1
0 11
1 0 0
phrase 11.
0 0 0
00
00
11
101 7 5.1
0
1
00 1 1
1 11 8
00
11
000
111
00
11
000
1110
1
01
1
2.1
0
0
1 100 9
0
1 5.0
000
111
0 1 111
000
111 000
00
11
1
000
111
00
11
0
000
111 000
111
00
11 000 10 3.0
1 1 0110 11 6.0
111
000
00
11
00
11
000
111
00
11
00
11 0 1 1 0 1 12 11.1
000
111
000
111
0 111
000
111000
0 0
000
111
000
111
000
111000 1 1 0
000
111
111 13 8.0
0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
5
The first step in decoding is initialize the dictionary. We are going to
initialize the dictionary with length 1 phrases. We are talking about binary
sequence and so length 1 sequence will be 0 and 1 which are codewords
indexed by 1 and 2 respectively. Now start with decoding 1 which is decoded
as 0. Next decode next encoded sequence is 1 which is decoded as phrase 0.
Now the first bit of this new decoded phrase is our innovation symbol. So
we are going to add 00 into the dictionary and noted it as codeword with
index 3 which can be alternatively written as 1.0 in the dictionary. Next
2 corresponds to bit 1 and so decoded as bit 1. Now we need to add 01
into our dictionary. 01 is added into dictionary as codeword 4 and 1.1 is the
alternative representation. Next is 4 and so decoded as 01 by looking into
the dictionary. So now we add 10 to dictionary which is codeword 5 and it
can be written as 2.0 alternative.
Next we decode 5 as 10. Add 011 to dictionary which is codeword 6
and 4.1 is alternative expression. Next is 101 which is not in the dictionary
and so added to it indexed by codeword which is 5.1 in alternate expression.
Next decode 5 to be 10 and then the innovation symbol is 1 and previous
phrase is 1. So add innovation symbol and add 11 to dictionary as codeword
8 which 2.1 is the alternative representation.
Next we will try to decode 3 which is 00. Here 0 is the innovation
symbol and previous phrase was 10 add this innovation symbol and add new
phrase to our dictionary. So add 100 to our dictionary which is codeword 9
and 5.0 is the alternative expression. Next 6 is decoded as 011. Innovation
symbol is first bit 0 and previous phrase is 00. By adding we have 000 as our
new phrase which is codeword 10 and 3.0 is alterative form.
Next we have 11 but 11 is not in the dictionary. The dictionary till now
is filled only upto 10. This situation happens if we are going to use a string
which has been immediately put in the dictionary. We can see in encoding
part, 0110 has been put in dictionary as codeword 11 and immediately we
used it for next phrase. The string 11 should be 0110. This entry level
codeword should be of the form 011 and innovation symbol. Innovation
symbol is first bit of the new string. So, 11 should be 0110 and kept in
dictionary and decode it as 0110. If we encounter a situation a dictionary
entry codeword is come up and not there in the dictionary, you are using
phrase which has just added to dictionary.
Next we go to 8 which is decoded as 11. The innovation symbol is 1 and
so we add this to old string which makes it as 01101 and this is added to the
dictionary which is a new entry with codeword 12 and 11.1 is the alternate
6
form. Finally 10 is decode as 000 and the innovation symbol is 0. The next
phrase is 110 which is codeword 13 and 8.0 is the alternative form.