WSU-DTC Campus
Dept. of Information Technology
Itec3121 - Multimedia Systems
Year: III, Semester: II
Chapter 7:
Lossless Compression Algorithms
Abraham Abayneh
[email protected]
Itec3121 - Multimedia Systems
1
Contents of the chapter
• Introduction
• Basics of Information Theory
• Run-Length Coding
• Variable-Length Coding (VLC)
• Dictionary Based Coding
• Huffman Coding
• Arithmetic Coding
• Lossless Image Compression
Itec3121 - Multimedia Systems
2
Introduction Lossless Compression Algorithms
• Compression is the process of eliminating redundant information
to decrease file size.
• Compression: the process of coding that will effectively reduce the
total number of bits needed to represent certain information.
• Compression converts frames and pixels to mathematical
algorithms that the computer can understand.
• Data Compression refers to a technique where a large file to
reduced to smaller sized file and can be decompressed again to the
large file.
• Decompression converts mathematical algorithms back to frames
and pixels for playback.
Itec3121 - Multimedia Systems
3
Lossless Compression
• In lossless compression (as the name suggests) data are
reconstructed after compression without errors, i.e. no
information is lost.
• Lossless compression retains all of the data of the original file as
it's converted to a smaller file size.
• Typical application domains where you do not want to loose
information is compression of text, files, fax.
In lossless compression the information is recovered without any
alteration after the decompression stage.
When a lossless file is opened, algorithms restore all compressed
information, creating a duplicate of the source file.
Itec3121 - Multimedia Systems
4
Lossless Compression….
• Lossless Compression typically is a process with three stages:
• The model: the data to be compressed is analyzed with respect
to its structure and the relative frequency of the occurring
symbols.
• The encoder: produces a compressed bit stream / file using the
information provided by the model.
• The adaptor: uses information extracted from the data (usually
during encoding) in order to adapt the model (more or less)
continuously to the data
Itec3121 - Multimedia Systems
5
What is lossless compression algorithm?
• Lossless compression is a form of data compression that reduce
file sizes without sacrificing any significant information in the
process - meaning it will not diminish the quality of your photos.
• Different Although there are various compression methods,
including Motion JPEG, only MPEG-1 and MPEG-2 are
internationally recognized standards for the compression of
moving pictures (video).
• The main advantages of compression are reductions in storage
hardware, data transmission time, and communication bandwidth.
• This can result in significant cost savings.
• Compressed files require significantly less storage capacity than
uncompressed files, meaning a significant decrease in expenses for
storage.
Itec3121 - Multimedia Systems
6
Information theory
• Information theory is defined to be the study of efficient coding and its
consequences.
• It is the field of study concerned about the storage and transmission of
data.
• It is concerned with source coding and channel coding.
– Source coding: involves compression
– Channel coding: how to transmit data, how to overcame noise, etc
• Entropy is the measure of information content in a message
• Data compression may be viewed as a branch of information theory in
which the primary objective is to minimize the amount of data to be
transmitted.
Itec3121 - Multimedia Systems
7
• Information theory examines the utilization, processing, transmission and
extraction of information.
• In the scenario of information communication over noisy channels, this
theoretical concept was formalized by Claude Shannon (in his work called A
Mathematical Theory of Communication) in 1948.
• In his work, information is considered as a group of possible messages.
• The main goal is to transfer these messages over noisy channels and to have the
receiving device redevelop the message with negligible error probability
(regardless of the channel noise).
• The main result of Shannon’s work is the noisy-channel coding theorem.
Itec3121 - Multimedia Systems
8
According to the famous scientist Claude E. Shannon, of
Bell Labs, the entropy of an information source with
alphabet is defined as
Where is the probability that symbol in S will occur.
The term indicates the amount of information (the so-
called self-information defined by Shannon ) contained in Si,
which corresponds to the number of bits needed to encode Si.
Itec3121 - Multimedia Systems
9
7.3. The algorithms used in lossless
compression are:
7.3.1. Run-Length Coding
Memoryless Source: an information source that is
independently distributed.
Namely, the value of the current symbol does not depend
on the values of the previously appeared symbols.
Instead of assuming memory less source, Run-Length
Coding (RLC) exploits memory present in the information
source.
Rationale for RLC: if the information source has the
property that symbols tend to form continuous groups,
then such symbol and the length of the group can be coded
Itec3121 - Multimedia Systems
10
7.3.2. Variable-Length Coding (VLC)
Since the entropy indicates the information content in an information
source S, it leads to a family of coding methods commonly known as
entropy coding methods.
As described earlier, variable-length coding (VLC) is one of the best-
known such methods.
Here, we will study the Shannon–Fano algorithm, Huffman coding,
and adaptive Huffman coding.
Shannon–Fano Algorithm
A Top-down approach
Sort the symbols according to the frequency count of their
occurrences.
Recursively divide the symbols into two parts, each with
approximately the same number of counts,
Itec3121 - Multimedia Systems until all parts contain only
11
Itec3121 - Multimedia Systems
12
Itec3121 - Multimedia Systems
13
Huffman Coding
A bottom-up approach
Initialization: Put all symbols on a list sorted according to
their frequency counts.
Repeat until the list has only one symbol left:
From the list pick two symbols with the lowest frequency counts.
Form a Huffman sub-tree that has these two symbols as child nodes
and create a parent node.
Assign the sum of the children’s frequency counts to the parent and
insert it into the list such that the order is maintained
Delete the children from the list.
Assign a code word for each leaf based on the path from the
root. Itec3121 - Multimedia Systems
14
Itec3121 - Multimedia Systems
15
Properties of Huffman Coding
Unique Prefix Property: No Huffman code is a prefix of
any other Huffman code.
Optimality: minimum redundancy code - proved optimal
for a given data model
Itec3121 - Multimedia Systems
16
7.3.3 Dictionary-based Coding
The Lempel-Ziv-Welch (LZW) uses fixed-length code words to
represent variable-length strings of symbols/characters that
commonly occur together,
e.g., words in English text.
The LZW encoder and decoder build up the same dictionary
dynamically while receiving the data.
LZW places longer and longer repeated entries into a dictionary,
and then emits the code for an element, rather than the string
itself, if the element has already been placed in the dictionary.
Itec3121 - Multimedia Systems
17
Itec3121 - Multimedia Systems
18
7.3.4. Arithmetic Coding
Arithmetic coding is a more modern coding method that
usually outperforms Huffman coding in practice.
Huffman coding assigns each symbol a code word which
has an integral bit length. Arithmetic coding can treat the
whole message as one unit.
A message is represented by a half-open interval [a; b]
where a and b are real numbers between 0 and 1.
Initially, the interval is [0; 1). When the message becomes
longer, the length of the interval shortens and the number of
bits needed to represent the interval increases.
Itec3121 - Multimedia Systems
19
Itec3121 - Multimedia Systems
20
Suppose the alphabet is [A, B, C, D, E, F, $],in which$ is a special symbol used to
terminate the message, and the known probability distribution is as shown in the
following figure;
a string of symbols CAEE$ is encoded as follows;
Initially, low = 0, high = 1.0, and range = 1.0.
After the first symbol C, Range_low(C) = 0.3, Range_high(C)
= 0.5;
so low = 0 + 1.0 x 0.3 = 0.3, high = 0 + 1.0 x 0.5 = 0.5.
Itec3121 - Multimedia Systems
21
7.4. Lossless Image Compression
One of the most commonly used compression
techniques in multimedia data compression is
differential coding.
Given an original image I(x; y), using a simple
difference operator we can define a difference
image d(x; y) as follows:
d(x; y) = I(x; y) − I(x − 1; y)
Due to spatial redundancy existed in normal images
I, the difference image d will have a narrower
histogram and hence a smaller entropy, as shown in
Fig. 7.9
Itec3121 - Multimedia Systems
22
Itec3121 - Multimedia Systems
23
Lossless JPEG
Lossless JPEG: A special case of the JPEG image compression.
The Predictive method
1. Forming a differential prediction: A predictor combines
the values of up to three neighboring pixels as the predicted value for the current
pixel, indicated by ‘X’ in Fig.
7.10. The predictor can use any one of the seven schemes
listed in Table 7.6.
Fig. 7.10: Neighboring Pixels for Predictors in Lossless JPEG.
Itec3121 - Multimedia Systems
24
Lossless JPEG
2. Encoding: The encoder compares the prediction with
the actual pixel value at the position ‘X’ and encodes the
difference using one of the lossless compression
techniques we have discussed, e.g., the Huffman coding
scheme
Itec3121 - Multimedia Systems
25
Thank you!!!!
Itec3121 - Multimedia Systems
26