Binaural Cue Coding Applied To Audio Compression With Flexible Rendering
Binaural Cue Coding Applied To Audio Compression With Flexible Rendering
This convention paper has been reproduced from the author's advance manuscript, without editing, corrections,
or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers
may be obtained by sending request and remittance to Audio Engineering Society, 60 East lend Street, New
York, New York 10165-2520, USA ; also see www . aes . org . All rights reserved. Reproduction of this paper,
or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering
Society.
ABSTRACT
In this paper we describe an efficient scheme for compression and flexible spatial rendering of audio signals .
The method is based on Binaural Cue Coding (BCC) which was recently introduced for efficient compression
of multi-channel audio signals. The encoder input consists of separate signals without directional spatial
cues, such as separate sound source signals, i.e . several monophonic signals. The signal transmitted to the
decoder consists of the mono sum-signal of all input signals plus a low bit rate (e .g . 2 kb/s) set of BCC
parameters . The mono signal can be encoded with any conventional audio or speech coder . Using the BCC
parameters and the mono signal, the BCC synthesizer can flexibly render a spatial image by determining
the perceived direction of the audio content of each of the encoder input signals. We provide the results of
an audio quality assessment using headphones, which is a more critical scenario than loudspeaker playback .
SPATIAL
RENDERING
PARAMETERS
PLAYBACK SPATIAL
RENDERINGSYSTEM -
PARAMETERS PARAMETERS
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 2
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 3
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
PARTITION
INDEX
sides is used,
b
0 for 0<i<Z
w" [z] = ~ N-Z<i<N
2
MENEEMENE where Z is the width of the zero region before and
after the non-zero part of the window. The non-zero
window span is W and the size of the transform is
3 4 k N = 2Z+W . Adjacent windows are overlapping and
TIME INDEX
are shifted by W/2 samples. The analysis window
Fig. 4: The time-frequency plane is divided into a grid was chosen such that the overlapping windows add
of partitions. Each partition is associated with the ener- up to a constant value of one. Therefore, for syn-
getically strongest source . thesis in the decoder there is no need for additional
windowing.
The underlying signal can be filtered by multiplica-
tion of its spectrum S, with the frequency response
case shown in Fig. 3. For the scheme we are present-
H, of a filter,
ing in this paper this overlap is ignored and only the
set of cues of one source in areas of the subdivided Sn = S.H. .
time-frequency plane is considered (e.g. the set of As long as the filter's impulse response satisfies
cues of the energetically dominant source). A set of
cues are the spatial cues associated with one audio h[l] = 0 for III > Z, (3)
source (one ICLD and ICTD, or a pair of HRTFs) .
no aliasing occurs due to the circular convolution
property of the DFT. The described procedure is
very similar to the overlap-add convolution method
3 SYNTHESIS OF SPATIAL CUES using the DFT [6], except that the filter does not
need to be causal . Obviously, time shifts and level
modifications can be imposed on the underlying au-
BCC generates multi-channel audio signals by syn- dio signal by filtering (2) . Time shifts and level
thesizing ICLD and ICTD between pairs of chan- modifications are applied to the audio channels to
nels . The time-frequency plane is divided into a synthesize ICTD and ICLD .
grid of non-overlapping partitions as shown in Fig . 4 .
The uniform spectral resolution of the DFT is not
At each time index k there are B partitions with
well adapted to human perception . Therefore, the
partition index b. For simplicity, the time index
uniformly spaced spectral coefficients S,, (0 <_ n <_
is mostly suppressed in the following. The specific
N/2) are grouped into B non-overlapping partitions
time-frequency resolution of the grid of partitions is
such that each partition has a bandwidth equal to
chosen according to perceptual criteria . In this sec-
the "critical bandwidth" for binaural perception .
tion we describe the time-frequency transform that
is used and the processing that is applied to the Only the first N/2 + I spectral coefficients of the
spectral coefficients for synthesizing one ICLD and spectrum are considered because the spectrum is
symmetric . We found that for our purposes the fre-
ICTD in each partition . A more detailed description
quency resolution was high enough when choosing
is given in [5] .
the critical bandwidth to be two times the Equiva-
lent Rectangular Bandwidth (ERB) [7] . The range
3 .1 The Time-Frequency Transform of consecutive DFT coefficients belonging to par-
A frame of N samples is multiplied with an analysis tition b includes the following DFT indices n E
window before an N-point DFT is applied. The fol- {Ab-1, Ab-1 + 1, . . . , Ab -1} . For the first partition
lowing analysis window with zero padding at both the lower bound is Ao = 0.
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 4
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
c-1
Fl,n = (1 + 10°L i , 6 / 1° ) - 2
For our experiments we used W = 896, Z = 64, and The complex factors Gc,, for channel c > 1 are com-
N = 1024 for a sample rate of 32 kHz . puted for each spectral coefficient within a partition
b (Ab- 1 <_ n < Ab) given T,- l,b in sampling inter-
3.2 Definition of ICLD and ICTD for Multiple vals,
Channels = 27rn(Tc-1,b - db)
In the general case of C playback channels the ICLD Gc,n exp[-j (7)
N '
and ICTD are given for each channel relative to a
reference channel. Without loss of generality, chan- where db is the delay which is introduced into refer-
nel number 1 is defined as the reference channel. ence channel 1,
Figure 5 shows how ICLD and ICTD are defined be-
( max Ti,b + min Ti,b)/2
tween the reference channel and each other channel 1<i<c 1<i<c
for the b1h partition. For example, JALi,b, Ti,b} is 27rndb
the set of the ICLD and ICTD between channel 1 Gl,n exp(-j ).
N
and channel i + 1 for the b1h partition.
The delay for the reference channel db is computed
3.3 BCC Synthesis of a Mufti-Channel Audio
in (8) such that the maximum absolute delay intro-
Signal
duced to any channel in each partition is minimal .
Given the spectral coefficients S of the mono sum
By minimizing the maximum delay, the maximum
signal, the spectral coefficients S,,,, for each channel
audio signal distortion resulting from (7) is reduced .
c are obtained by
AES 113 r1'H CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 5
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
AUDIO
SOURCES
where ST,, ,T, are the spectral coefficients of audio As opposed to synthesizing ICLD and ICTD, binau-
source m . ral signals can be generated with HRTFs as spatial
cues . In this case, the local table in the BCC synthe-
We observed that the selection of source indices ac-
sizer stores for each source m a left and right HRTF
cording to (9) sometimes results in fast variations of
frequency response, HL,,., . and HR, ., .. For obtain-
Ib in time . When two sources have nearly the same
ing the left and right binaural signals, the spectral
energy in a frequency band, then the choice between
coefficients in each partition b are multiplied with
the two corresponding source indices by (9) is quite
the coefficients corresponding to the left and right
random . The result is that the source indices tog-
HRTF associated with source Ib,
gle between two values resulting in instability of the
spatial image of the BCC synthesized signal . There- and SR,.
SL, . = HL,jb, .S. = HR, , ,, .S., (11 )
fore, we implemented a second version of the source
index selection. In that version the source index of where n E {Ab-1, Ab-1 + 1, . . . , Ab - 1} are the in-
a partition can only change if a different source is at dices of spectral coefficients within partition b. It
least a dB stronger than the second strongest source must be made sure that the impulse response of each
in that partition. HRTF satisfies the condition of (3). An example of
this process of "mixing" HRTFs is shown for the left
5 BCC SYNTHESIS channel in Fig. 8. To prevent aliasing artifacts, the
transitions between partitions need to be smoothed .
For each partition b the spatial cues are obtained This is done by using overlapping windows for the
different portions of HRTFs between the partitions .
from a local table which stores spatial cues for each
source m (1 G m G M) . These spatial cues are
selected according to the specific playback setup
6 CODING OF THE BCC PARAMETERS
(headphone/loudspeakers, number and position of
loudspeakers) . For each partition the spatial cues
are assigned by table look-up according to the source For coding applications it is desirable to compress
index Ib . Then the multi-channel output signal is the BCC parameters . A run-length coding algo-
generated by applying (4) . rithm [8] is applied to the source indices 1b (9) over
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 6
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
MICROPHONE
w
m SIGNAL
z
z SUM SIGNAL
O
F BCC PARAMETERS
F
K
Q
a
Fig. 9: Scheme for a tele-conferencing client based on
BCC: (A) BCC synthesis and (B) stereo echo canceler .
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 7
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
signals . The sources are panned to the left and right property scale
by using ICLD = ±8 dB and ICTD = ±219 ps 1 image width stereo .. .mono
(= ±7 samples) . For panning a source to the middle, 2 image stability stable . . .unstable
ICLD= 0 dB and ICTD= 0 tfs are used. 3 audio quality disregarding ITU-R 5-grade
spatial image distortions impairment
category left right middle 4 overall audio quality ITU-R 5-grade
1 speech male female none impairment
2 singing tenor soprano none
3 instruments piano organ none Table 2 : Tasks of the subjective test.
4 percussion castanets drums none
5 speech male female male Subjective Test
Image Stability
6 singing tenor soprano alto
7 instruments piano organ harpsichord Excerpt Number : PEr 1 2 3
Assessment
Table 1: List of the audio excerpts used in the subjec- Number :
tive test. The last three columns are the sources that 2 of 4
are placed to the left side, right side, and middle of the
spatial image by imposing ICLD and ICTD between left
and right.
unrabe
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 8
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
4
(7
z
03
c0
2
4 5 6 7 8 2 3 4 5 6 7 8
80
C7
z 60
a
40
20
2 3 4 5 6 7 8 2 3 4 5 6 7 8
ITEM NUMBER ITEM NUMBER
Fig. 11 : The average grading for image width and image Fig. 12 : The average grading for audio quality disre-
stability for the 8 sound signals (A solid, B dashed, C garding spatial image and overall audio quality for the 8
dotted). sound signals (A solid, B dashed, C dotted).
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 9
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING
source signals plus BCC parameters are transmit- C. Faller and F. Baumgarte, "Binaural Cue
ted. The decoder can flexibly render spatial images Coding applied to stereo and multi-channel au-
as if the source signals were transmitted separately. dio compression," in Preprint 112th Conv. Aud.
The presented scheme provides flexible rendering ca- Eng . Soc ., May 2002.
pability at bitrates lower than conventional schemes
which usually transmit each of the source signals [6] A . V. Oppenheim and R. W. Schaefer, Discrete-
separately. Time Signal Processing, Signal Processing Se-
ries . Prentice Hall, 1989 .
We described a number of application scenarios for
the presented scheme, such as tele-conferencing, vir- B. R. Glasberg and B. C. J. Moore, "Deriva-
tual reality systems, and computer gaming . Not tion of auditory filter shapes from notched-noise
only the flexible rendering capability and low bitrate data," Hear. Res., vol . 47, pp . 103-138, 1990 .
is appealing, but also backwards compatibility with
N . S. Jayant and P. Noll, Digital Coding of
existing monophonic systems (if the BCC param-
Waveforms, Prentice-Hall Signal Processing Se-
eters can be transmitted in addition to the mono
ries, 1984 .
signal).
Headphone listening test results reveal that the flex- J . Benesty, T. Gansler, D. R. Morgan, M. M .
ible rendering functionality of BCC does not imply Sondhi, and S . L. Gay, Advances in Net-
degraded audio quality with respect to BCC with work and Acoustic Echo Cancellation, Springer,
encoder-controlled spatial image (BCC type 11) for 2001.
the items tested .
[10] ITU-T, Pulse Code Modulation (PCM) of Voice
Future work will focus on further refinement of the Frequencies, 1988 (Blue Book), 1993, Recom-
algorithm for improving the audio quality of the mendation G.711 .
most critical audio material .
[11] F. Baumgarte and C. Faller, "Why Binaural
Cue Coding is better than intensity stereo,"
ACKNOWLEDGMENTS
in Preprint 112th Conv. Aud. Eng . Soc ., May
We would like to thank Peter Kroon and Martin 2002.
Vetterli for the valuable discussions and suggestions .
[12] F. Baumgarte and C. Faller, "Design and
evaluation of Binaural Cue Coding schemes,"
REFERENCES in Preprint 113th Conv. Aud. Eng . Soc ., Oct .
[1] J. Blauert, Spatial Hearing . The Psychophysics 2002.
ofHuman Sound Localization, MIT Press, 1983 .
AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 10