Information Theory and Coding
(BECE313L)
Winter Semester 2024-25
Dr. Dilip Kumar Choudhary
Ph.D. (IIT-ISM, Dhanbad)
Assistant Professor, Department of Communication Engineering
School of Electronics Engineering (SENSE)
Vellore Institute of Technology, Vellore
Email Id:
[email protected] Cabin: SJT Annex 203 (J)
Website : https://sites.google.com/site/dilipiitdh
Winter Semester : 2024-25 Information Theory and Coding Slide No: 2
Module-5: Audio and Video Coding
• Audio Coding: types – Linear Predictive Coding (LPC) –
Code Excited LPC –Perceptual Coding - MPEG Audio
Coding.
• Video Coding: Motion Estimation and Compensation – Types
of Frames – Encoding and Decoding of Frames – Video
• Coding Standard: MPEG 4.
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 3
Audio Coding
Audio coding refers to the process of encoding and compressing audio
signals to reduce storage space and bandwidth requirements while
maintaining audio quality, using techniques like MP3, AAC, or FLAC.
Applications: Audio coding is used in various applications, including music and
video streaming, digital music players, and mobile devices.
Examples:
• MP3: A widely used lossy audio coding format, known for its good
compression and compatibility.
• AAC (Advanced Audio Coding): Another popular lossy format, offering better
audio quality than MP3 at similar bitrates.
• FLAC (Free Lossless Audio Codec): A lossless format that preserves the
original audio quality without any data loss.
• Opus: A versatile audio codec designed for both speech and music, offering
good quality at low bitrates.
• Vorbis: Another open-source lossy audio codec, similar to MP3 but with a focus
on quality and efficiency.
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 4
Perceptual audio coding
Perceptual audio coding is a compression technology for audio signals that is
based on imperfections of the human ear.
Perceptual encoding is a lossy compression technique.
The task of the perceptual audio coding technique is to have a decoded bit
stream that sounds exactly (or at least as close as possible) as the original audio
while keeping the compressed file as small as possible.
Perceptual audio coding is based on two related characteristics of the human ear:
The sensitivity of the human ear is not the same for all frequencies;
A loud tone or noise can make a weaker tone inaudible.
The human ear is capable of hearing sounds in the frequency range between 20
Hz - 20 kHz.
The human ear is most sensitive for frequencies between 500 Hz and 5 kHz. The
sensitivity decreases below and above this range.
This means that a tone of 100 Hz must be louder before you can hear it than a
tone of 1 kHz.
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 5
The loudness needed for a tone to become audible is called the Threshold
in Quiet. Any tone with a loudness below this threshold will not be audible
A second characteristic of the human ear is that a tone that is (just)
audible can be made inaudible by a louder tone or noise. Or to put it the
other way around, a loud tone can make other tones inaudible.
These two characteristics led to perceptual audio coding.
The input data is used to calculate the actual masking threshold for a small
period of time.
The coder then starts with an analysis of the frequencies contained in the
audio signal by either a filter bank or a time-to-frequency transformation.
The resulting frequency components are quantized whereby more bits are
allocated if the frequency component is well above the masking threshold and
none if the component is below the threshold.
The coded frequency components are packed into a bit stream.
The decoder performs the same steps in reverse. It does not have to calculate
the masking threshold.
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 6
Linear Predictive Coding (LPC)
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 7
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 8
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 9
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 10
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 11
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 12
LPC vocoders extract salient features of speech directly from the waveform, rather
than transforming the signal to the frequency domain
LPC Features:
• uses a time-varying model of vocal tract sound generated from a given excitation
• transmits only a set of parameters modeling the shape and excitation of the vocal
tract, not actual signals or differences -small bit-rate
About “Linear": The speech signal generated by the output vocal tract model is
calculated as a function of the current speech output plus a second term linear in
previous model coefficients
LPC starts by deciding whether the current segment is voiced (vocal cords resonate)
or unvoiced:
For unvoiced: a wide-band noise generator creates a signal f(n) that acts as input to
the vocal tract simulator
For voiced: a pulse train generator creates signal f(n)
Model parameters ai: calculated by using a least-squares set of equations that
minimize the difference between the actual speech and the speech generated by the
vocal tract model, excited by the noise or pulse train generators that capture speech
parameters
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 13
• If the output values generate s(n), for input values f(n), the output
depends on p previous output sample values:
G - the “gain" factor coefficients; ai - values in a linear predictor model
• LP coefficients can be calculated by solving the following minimization
problem:
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 14
Code Excited Linear Prediction (CELP)
CELP is a more complex family of coders that attempts to mitigate the lack
of quality of the simple LPC model
CELP uses a more complex description of the excitation:
• An entire set (a codebook) of excitation vectors is matched to the actual
speech, and the index of the best match is sent to the receiver
• The complexity increases the bit-rate to 4,800-9,600 bps
• The resulting speech is perceived as being more similar and continuous
• Quality achieved this way is sufficient for audio conferencing
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 15
Psychoacoustics
The range of human hearing is about 20 Hz to about 20 kHz
The frequency range of the voice is typically only from about 500 Hz to 4
kHz
The dynamic range, the ratio of the maximum sound amplitude to the quietest
sound that humans can hear, is on the order of about 120 dB
MPEG Audio Psycho-acoustic Model
MPEG audio compresses by removing acoustically irrelevant parts of audio
signals
Takes advantage of human auditory systems inability to hear quantization
noise under auditory masking
Auditory masking: occurs when ever the presence of a strong audio signal
makes a temporal or spectral neighborhood of weaker audio signals
imperceptible.
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 16
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 17
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 18
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 19
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 20
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 21
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 22
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 23
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
Winter Semester : 2024-25 Information Theory and Coding Slide No: 24
Dr. Dilip Kumar Choudhary, Asst. Prof. (Sr), SENSE,VIT
MPEG Layers 1 and 2
• Mask calculations are performed in parallel with subband filtering
25
Layer I Coding
26
27
Bit Allocation Algorithm
Aim: ensure that all of the quantization noise is below
the masking thresholds
One common scheme:
• For each subband, the psychoacoustic model calculates
the Signal- to-Mask Ratio (SMR)in dB
• Then the “Mask-to-Noise Ratio" (MNR) is defined as
the difference
MNRdB = SNRdB − SMRdB
• The lowest MNR is determined, and the number of code-
bits allocated to this subband is incremented
• Then a new estimate of the SNR is made, and the process
iterates until there are no more bits to allocate
28
Layer 1 Coding : Framing
29
Layer 1 coding : Mode
30
Layer 2 coding : Overview
31
Layer 2 Coding: Quantization
32
Layer 2 Coding : Framing
33
Layer 2 Coding : Framing and quantization
34
Layer 2 of MPEG Audio
Main difference:
• Three groups of 12 samples are encoded in each
frame and temporal masking is brought into play, as
well as frequency masking
• Bit allocation is applied to window lengths of 36
samples instead of 12
• The resolution of the quantizers is increased from
15 bits to 16
Advantage:
• a single scaling factor can be used for
all three groups
35
Layer 3 Coding:mp3
36
Layer 3 of MPEG Audio
Main difference:
• Employs a similar filter bank to that used in Layer 2,
except using a set of filters with non-equal
frequencies
• Takes into account stereo redundancy
• Uses Modified Discrete Cosine Transform (MDCT) -
addresses problems that the DCT has at boundaries of
the window used by overlapping frames by 50%:
37
MPEG Layer 3 Coding
38
MPEG Audio Header Frame Format
39
• First 12 bits – sync pattern consist of all 1,s
• Next 1 bit is version ID and Next 2 bit layer indicator
• Next 1 bit CRC protection ( It is 1 if crc prtection is
there)
• All above 16 bits used for frame synchronization
• Next 4 bits specifies the bit rate in kbits /sec ( 14
specified bit rates are available )
• Next 2 bits for indicating sampling freq ( 00 indicate
44.1 kHz for MPEG1 and 22.05 kHz for MPEG 2)
• Next bit is a single padding bit & Next 2 bits to indcate
the mode ( stero mode, joint streo, dual , mono ).
40
MP3 Compression Performance
41
MPEG-2 AAC (Advanced Audio Coding)
The standard vehicle for DVDs:
• Audio coding technology for the DVD-Audio
Recordable (DVD-AR) format, also adopted by XM
Radio
• Aimed at transparent sound reproduction for theaters
• Can deliver this at 320 kbps for five channels so that
sound can be played from 5 different directions:
• Left, Right, Center, Left-Surround, and Right-Surround
42
MPEG-2 AAC
• Also capable of delivering high-quality stereo sound
at bit-rates below 128 kbps
• Support up to 48 channels, sampling rates between
8 kHz and 96 kHz, and bit-rates up to 576 kbps per
channel
• Like MPEG-1, MPEG-2, supports three different
“profiles", but with a different purpose:
• Main profile
• Low Complexity(LC) profile
• Scalable Sampling Rate (SSR) profile
43
MPEG-4 Audio
• Integrates several different audio components into one
standard: speech compression, perceptually based coders,
text-to-speech, and MIDI
• MPEG-4 AAC (Advanced Audio Coding), is similar to the
MPEG-2 AAC standard, with some minor changes
Perceptual Coders
• Incorporate a Perceptual Noise Substitution module
• Include a Bit-Sliced Arithmetic Coding (BSAC) module
• Also include a second perceptual audio coder, a vector-
quantization method entitled TwinVQ
44
MPEG-4 Audio
Structured Coders
• Takes “Synthetic/Natural Hybrid Coding" (SNHC) in
order to have very low bit-rate delivery an option
• Objective: integrate both ”natural" multimedia
sequences, both video and audio, with those arising
synthetically – “structured" audio
• Takes a “toolbox" approach and allows specification
of many such models.
• E.g., Text-To-Speech (TTS) is an ultra-low bit-rate method,
and actually works, provided one need not care what the
speaker actually sounds like
45
Other Commercial Audio Codecs
46
MPEG-7 and MPEG-21
MPEG-7: A means of standardizing meta-data for audiovisual
multimedia sequences - meant to represent information about
multimedia information
• In terms of audio: facilitate the representation and search for sound
content. Example application supported by MPEG-7: automatic speech
recognition (ASR).
MPEG-21: Ongoing effort, aimed at driving a standardization
effort for a Multimedia Framework from a consumer's
perspective, particularly interoperability
• In terms of audio: support of this goal, using audio.
Difference from current standards:
• MPEG-4 is aimed at compression using objects.
• MPEG-7 is mainly aimed at “search": How can we find objects,
assuming that multimedia is indeed coded in terms of objects
47