0% found this document useful (0 votes)
13 views10 pages

Binaural Cue Coding Applied To Audio Compression With Flexible Rendering

Uploaded by

raymond.p.plasse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Binaural Cue Coding Applied To Audio Compression With Flexible Rendering

Uploaded by

raymond.p.plasse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Audio Engineering Society

Convention Paper 5686


Presented at the 113th Convention
2002 October 5-8 Los Angeles, CA, USA

This convention paper has been reproduced from the author's advance manuscript, without editing, corrections,
or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers
may be obtained by sending request and remittance to Audio Engineering Society, 60 East lend Street, New
York, New York 10165-2520, USA ; also see www . aes . org . All rights reserved. Reproduction of this paper,
or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering
Society.

Binaural Cue Coding Applied to


Audio Compression with Flexible Rendering
Christof Fallen, Frank Baumgarte'
'Media Signal Processing Research, Agere Systems, Murray Hill, New Jersey 07974, USA
Correspondence should be addressed to Christof Faller (cfaller@agere .com)

ABSTRACT

In this paper we describe an efficient scheme for compression and flexible spatial rendering of audio signals .
The method is based on Binaural Cue Coding (BCC) which was recently introduced for efficient compression
of multi-channel audio signals. The encoder input consists of separate signals without directional spatial
cues, such as separate sound source signals, i.e . several monophonic signals. The signal transmitted to the
decoder consists of the mono sum-signal of all input signals plus a low bit rate (e .g . 2 kb/s) set of BCC
parameters . The mono signal can be encoded with any conventional audio or speech coder . Using the BCC
parameters and the mono signal, the BCC synthesizer can flexibly render a spatial image by determining
the perceived direction of the audio content of each of the encoder input signals. We provide the results of
an audio quality assessment using headphones, which is a more critical scenario than loudspeaker playback .

1 INTRODUCTION decoder. Flexible rendering means that the decoder


can determine how the compressed audio is spatial-
In this paper we present a scheme for audio com- ized . A number of sound sources, e.g . audio tracks,
pression with a flexible rendering capability at the which are encoded jointly can be synthesized indi-
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

SPATIAL
RENDERING
PARAMETERS

PLAYBACK SPATIAL
RENDERINGSYSTEM -
PARAMETERS PARAMETERS

Fig. 1: Generic scheme for audio compression with flex-


ible rendering capability at the decoder.
Fig. 2: Scheme for audio compression with flexible ren-
dering using BCC.
vidually such that they appear at desired locations
in auditory space. Furthermore, the number of play-
back channels and even the rendering method (bin- of sound sources M.
aural rendering with head-related transfer functions
In this paper we are showing how the recently in-
[1], rendering with level differences and time differ-
troduced concept of Binaural Cue Coding (BCC)
ences [1]) can be determined by the decoder. In the
[2, 3, 4] can be applied to audio compression with
following we are referring to the encoder inputs as
flexible rendering. The resulting scheme only trans-
sound sources, but in a more general sense the inputs
mits the sum signal of all sound sources and BCC
can be any kind of audio signal, i.e . any monophonic
parameters to the decoder. The sum signal can be
signal .
coded with any mono audio or speech coder. Given
A generic audio compression scheme with flexible the sum signal plus the BCC parameters, the de-
rendering capability at the decoder is shown in coder can freely render a custom spatial image of the
Fig. 1. The M input sound sources are encoded and sound sources for any specific playback setup as if
transmitted as one bitstream to the decoder. Param- the sound sources were transmitted separately . The
eters locally determined at the decoder are the play- described scheme is shown in Fig . 2. A BCC scheme
back setup (e .g . loudspeakers/headphones, number with flexible rendering capability is called BCC type
and positioning of loudspeakers) and the spatial ren- I in the BCC concept overview paper presented ear-
dering parameters (e .g . rendering method and direc- lier [3].
tion at which each track is rendered) .
The paper is organized as follows. Section 2 moti-
A straight forward way of implementing a system as vates BCC for flexible rendering . A review on how to
shown in Fig . 1 is to encode each of the M sound synthesize multi-channel audio signals given a mono
sources independently with a mono audio coder and audio signal plus spatial cues (level differences, time
combining the resulting M bitstreams to one com- differences) is given in Section 3. Sections 4 and
bined bitstream. The decoder then parses the com- 5 describe the BCC analysis and synthesis schemes
bined bitstream and decodes each of the M sound for flexible rendering, respectively. The coding of
sources . Given the decoded M sound sources the the BCC parameters for a lower bitrate is described
decoder can render these without constraints using in Section 6 . Section 7 proposes a few applications
any specific playback setup. The drawback of this for the BCC scheme with flexible rendering. The
approach is that the bitrate scales with the number subjective evaluation and results are presented in

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 2
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

Section 8 and conclusions are drawn in Section 9. FREQUENCY

2 CONCEPT AND RATIONALE FOR BCC


3
2
3
In Section 2.1 we briefly describe conventional meth-
2
ods for rendering binaural signals, stereo signals,
and multi-channel signals given a number of sound 3
sources . Based on the described methods, Section
2 .2 motivates the BCC scheme with flexible render-
ing.
Fig. 3: Three sources which occupy non-overlapping re-
2.1 Rendering a Spatial Image gions in the time-frequency plane.
Sound from a single source in freefield reaches the
two ears of a listener with an interaural level differ-
ence (ILD) and an interaural time difference (ITD) . can be generated by imposing ICLD and/or ICTD
The ILD and ITD determine the perceived azimuth between pairs of channels . For generating a multi-
of a sound source in the horizontal plane [1] . A channel signal with several phantom sources per-
more precise description of binaural directional cues ceived in different directions, separate multi-channel
are head related transfer functions (HRTFs) [1] . An signals are generated for each source as described .
HRTF is the transfer function that characterizes the Then, all resulting multi-channel signals are added
path from the sound source to the eardrum. It to form one composite multi-channel signal.
largely depends on frequency and angle of sound
2 .2 Rendering a Spatial Image with BCC
incidence. The terms inter-channel level difference
The concept of BCC is the separation of the infor-
(ICLD) and inter-channel time difference (ICTD)
mation relevant for spatial perception and the basic
are used for a level difference or time difference be-
audio content. BCC represents spatial audio as a
tween pairs of loudspeaker signals or between the
mono audio signal and BCC parameters. The mono
left and right channel of a binaural signal (head-
audio signal is just the sum signal derived from all
phones) . For headphone playback, the ICLD and
sound sources which are to be part of the spatial
ICTD between left and right directly determine the
image of the multi-channel signal generated by the
binaural directional cues (ILD and ITD) . For loud-
BCC synthesis . The BCC parameters include infor-
speaker playback, the ICLD and ICTD between a
mation which the BCC synthesis needs for generat-
pair of loudspeakers indirectly determine binaural
ing a multi-channel audio signal(i .e . where in time
directional cues . It is a common practice to gener-
and frequency spatial cues need to be inserted) .
ate stereo and multi-channel audio signals for loud-
speaker playback by simply imposing ICLD between BCC aims at generating multi-channel signals simi-
pairs of loudspeaker signals [1] . But also ICTD can lar to the composite multi-channel signal described
be used for directionally placing phantom sources in Section 2 .1 . However, only the sum signal of the
between a pair of loudspeakers [1]. Spatial cues in sources plus BCC parameters are available at the re-
this paper refers to ICLD and ICTD or possibly ceiver . One can see that this regeneration is feasible
HRTFs in the case of headphone playback . when the source signals occupy non-overlapping re-
gions in the time-frequency plane. In this case the
The conventional way for generating stereo or multi-
multi-channel signal is generated by taking the sum
channel audio signals with spatialized sources is de-
signal and by applying to each region in the time-
scribed in the following. A mono signal of a sound
frequency plane the spatial cues corresponding to
source can be processed to generate a number of au-
the source the region belongs to . Figure 3 shows an
dio channels with spatial cues . For example, a stereo
example of the time-frequency plane in which three
signal with a single phantom source, placed some-
sources occupy non-overlapping regions.
where between left and right, can be generated by
imposing an ICLD and/or ICTD between the left In general, different sources occupy overlapping
and right channel. Similarly, a multi-channel signal time-frequency regions in contrast to the simplified

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 3
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

PARTITION
INDEX
sides is used,
b
0 for 0<i<Z
w" [z] = ~ N-Z<i<N

®EEME® sine ( (2-Z)7 )


W for Z<i<Z+W,

2
MENEEMENE where Z is the width of the zero region before and
after the non-zero part of the window. The non-zero
window span is W and the size of the transform is
3 4 k N = 2Z+W . Adjacent windows are overlapping and
TIME INDEX
are shifted by W/2 samples. The analysis window
Fig. 4: The time-frequency plane is divided into a grid was chosen such that the overlapping windows add
of partitions. Each partition is associated with the ener- up to a constant value of one. Therefore, for syn-
getically strongest source . thesis in the decoder there is no need for additional
windowing.
The underlying signal can be filtered by multiplica-
tion of its spectrum S, with the frequency response
case shown in Fig. 3. For the scheme we are present-
H, of a filter,
ing in this paper this overlap is ignored and only the
set of cues of one source in areas of the subdivided Sn = S.H. .
time-frequency plane is considered (e.g. the set of As long as the filter's impulse response satisfies
cues of the energetically dominant source). A set of
cues are the spatial cues associated with one audio h[l] = 0 for III > Z, (3)
source (one ICLD and ICTD, or a pair of HRTFs) .
no aliasing occurs due to the circular convolution
property of the DFT. The described procedure is
very similar to the overlap-add convolution method
3 SYNTHESIS OF SPATIAL CUES using the DFT [6], except that the filter does not
need to be causal . Obviously, time shifts and level
modifications can be imposed on the underlying au-
BCC generates multi-channel audio signals by syn- dio signal by filtering (2) . Time shifts and level
thesizing ICLD and ICTD between pairs of chan- modifications are applied to the audio channels to
nels . The time-frequency plane is divided into a synthesize ICTD and ICLD .
grid of non-overlapping partitions as shown in Fig . 4 .
The uniform spectral resolution of the DFT is not
At each time index k there are B partitions with
well adapted to human perception . Therefore, the
partition index b. For simplicity, the time index
uniformly spaced spectral coefficients S,, (0 <_ n <_
is mostly suppressed in the following. The specific
N/2) are grouped into B non-overlapping partitions
time-frequency resolution of the grid of partitions is
such that each partition has a bandwidth equal to
chosen according to perceptual criteria . In this sec-
the "critical bandwidth" for binaural perception .
tion we describe the time-frequency transform that
is used and the processing that is applied to the Only the first N/2 + I spectral coefficients of the
spectral coefficients for synthesizing one ICLD and spectrum are considered because the spectrum is
symmetric . We found that for our purposes the fre-
ICTD in each partition . A more detailed description
quency resolution was high enough when choosing
is given in [5] .
the critical bandwidth to be two times the Equiva-
lent Rectangular Bandwidth (ERB) [7] . The range
3 .1 The Time-Frequency Transform of consecutive DFT coefficients belonging to par-
A frame of N samples is multiplied with an analysis tition b includes the following DFT indices n E
window before an N-point DFT is applied. The fol- {Ab-1, Ab-1 + 1, . . . , Ab -1} . For the first partition
lowing analysis window with zero padding at both the lower bound is Ao = 0.

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 4
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

of the power of all channels is the same as the power


of the mono signal,

c-1
Fl,n = (1 + 10°L i , 6 / 1° ) - 2

resulting in a loudness which is approximately inde-


pendent of the level differences AL i,b. This is done
to maintain the loudness and prevent that inaccu-
Fig. 5: For the general case of C playback channels, the rate ICLD result in undesired loudness fluctuations .
spatial cues are defined relative to a reference channel Before applying (4), Fc,, is smoothed at the parti-
(channel 1) for each partition b. tion boundaries by interpolating between the differ-
ent OLi,b to reduce aliasing artifacts.

For our experiments we used W = 896, Z = 64, and The complex factors Gc,, for channel c > 1 are com-
N = 1024 for a sample rate of 32 kHz . puted for each spectral coefficient within a partition
b (Ab- 1 <_ n < Ab) given T,- l,b in sampling inter-
3.2 Definition of ICLD and ICTD for Multiple vals,
Channels = 27rn(Tc-1,b - db)
In the general case of C playback channels the ICLD Gc,n exp[-j (7)
N '
and ICTD are given for each channel relative to a
reference channel. Without loss of generality, chan- where db is the delay which is introduced into refer-
nel number 1 is defined as the reference channel. ence channel 1,
Figure 5 shows how ICLD and ICTD are defined be-
( max Ti,b + min Ti,b)/2
tween the reference channel and each other channel 1<i<c 1<i<c
for the b1h partition. For example, JALi,b, Ti,b} is 27rndb
the set of the ICLD and ICTD between channel 1 Gl,n exp(-j ).
N
and channel i + 1 for the b1h partition.
The delay for the reference channel db is computed
3.3 BCC Synthesis of a Mufti-Channel Audio
in (8) such that the maximum absolute delay intro-
Signal
duced to any channel in each partition is minimal .
Given the spectral coefficients S of the mono sum
By minimizing the maximum delay, the maximum
signal, the spectral coefficients S,,,, for each channel
audio signal distortion resulting from (7) is reduced .
c are obtained by

Sc,, = Fc,,Gc,,S,, (4) 4 BCC ANALYSIS

where F,,,, is a positive real number determining a


level modification for each spectral coefficient . Gc,, The BCC analysis scheme is shown in Fig. 6. It
is a complex number of magnitude one determining a generates the BCC parameters and the sum signal
phase modification for each spectral coefficient . The is derived in parallel . As described in Section 2 .2,
following two paragraphs describe how F,,,, and G,,,, the BCC parameters assign to each area of the time-
are obtained given {OL,,b, Tc,b} . frequency plane one source index. An obvious strat-
egy for choosing this source index is to select for each
The factors Fc,,, for channel c > 1 are computed for area of the time-frequency plane the source which is
each spectral coefficient within a partition b (Ab- 1 < energetically strongest within that area . This is im-
n < Ab) given AL,-,,b, plemented by assigning at each time k to each par-
Fc n = 10(oL , 1,b)120F1 n , tition b the source index Ib (I <_ Ib <_ M) of the
energetically strongest source,
The factors F,,,, for the reference channel (c = 1)
are computed such that for each partition the sum Ib = arg max {Pn,bI ,
1<m<M

AES 113 r1'H CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 5
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

AUDIO
SOURCES

Fig. 7: BCC synthesis scheme with flexible rendering.


Fig. 6: BCC analysis scheme .
Different numbers of playback channels C can be
with partition power estimates P,,b, easily accommodated by adapting the table size . For
each source, C-1 spatial cues for channel pairs (sim-
Ay -1 ilar to Fig. 5) are stored in the table. Time adap-
Pm,b = L Sm,nl 2 , (10) tive flexible rendering is implemented by (smoothly)
n=Ab 1 modifying the spatial cues in the table in real-time .

where ST,, ,T, are the spectral coefficients of audio As opposed to synthesizing ICLD and ICTD, binau-
source m . ral signals can be generated with HRTFs as spatial
cues . In this case, the local table in the BCC synthe-
We observed that the selection of source indices ac-
sizer stores for each source m a left and right HRTF
cording to (9) sometimes results in fast variations of
frequency response, HL,,., . and HR, ., .. For obtain-
Ib in time . When two sources have nearly the same
ing the left and right binaural signals, the spectral
energy in a frequency band, then the choice between
coefficients in each partition b are multiplied with
the two corresponding source indices by (9) is quite
the coefficients corresponding to the left and right
random . The result is that the source indices tog-
HRTF associated with source Ib,
gle between two values resulting in instability of the
spatial image of the BCC synthesized signal . There- and SR,.
SL, . = HL,jb, .S. = HR, , ,, .S., (11 )
fore, we implemented a second version of the source
index selection. In that version the source index of where n E {Ab-1, Ab-1 + 1, . . . , Ab - 1} are the in-
a partition can only change if a different source is at dices of spectral coefficients within partition b. It
least a dB stronger than the second strongest source must be made sure that the impulse response of each
in that partition. HRTF satisfies the condition of (3). An example of
this process of "mixing" HRTFs is shown for the left
5 BCC SYNTHESIS channel in Fig. 8. To prevent aliasing artifacts, the
transitions between partitions need to be smoothed .
For each partition b the spatial cues are obtained This is done by using overlapping windows for the
different portions of HRTFs between the partitions .
from a local table which stores spatial cues for each
source m (1 G m G M) . These spatial cues are
selected according to the specific playback setup
6 CODING OF THE BCC PARAMETERS
(headphone/loudspeakers, number and position of
loudspeakers) . For each partition the spatial cues
are assigned by table look-up according to the source For coding applications it is desirable to compress
index Ib . Then the multi-channel output signal is the BCC parameters . A run-length coding algo-
generated by applying (4) . rithm [8] is applied to the source indices 1b (9) over

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 6
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

MICROPHONE
w
m SIGNAL
z
z SUM SIGNAL
O
F BCC PARAMETERS
F
K
Q
a
Fig. 9: Scheme for a tele-conferencing client based on
BCC: (A) BCC synthesis and (B) stereo echo canceler .

Fig. 8: The process of generating the left channel of a


" Backwards compatibility: If the BCC param-
binaural signal with HRTFs .
eters can be embedded into the transmission
channel between the server and the clients of a
frequency. For the chosen parameters (Section 3.1), mono system, this system can be upgraded for
this results in a bitrate of about 2 kb/s for the case of stereo or multi-channel tele-conferencing while
three simultaneously active talkers as source signals. maintaining backwards compatibility. For ex-
If not all talkers are active simultaneously, the run- ample, one could use LSB "bit-flipping" in
length coding is more efficient and results in lower pLaw [10] to embed the BCC bitstream.
bitrates .
7.2 Virtual Reality
7 APPLICATIONS OF BCC WITH FLEXI-
In a virtual reality system users have control over
BLE RENDERING
the visual and auditory scene which is presented to
them, e.g . they can move around within the vir-
7.1 Tele-Conferencing tual scene. It is desirable that in such a system the
Figure 9 shows a BCC-based scheme for a stereo spatial audio (stereo or multi-channel) adapts to the
tele-conferencing client consisting of (A) BCC syn- visual scene. For example, depending on the posi-
thesis and (B) a stereo echo canceler [9] . The sum tion of the person in the virtual scene, sound sources
signal and microphone signal can be coded using any are perceived in different directions . The flexible
suitable mono speech or audio coder. For generating rendering capability of BCC can be used for plac-
the BCC parameters for each client, the server uses ing the different audio sources in different directions
a BCC analysis scheme with the inputs assigned to within the virtual scene. For a flexible rendering
the other clients' incoming signals. capability without BCC, each source signal would
There are several reasons why BCC is very interest- need to be stored or transmitted separately. There-
ing for applications in tele-conferencing: fore, with BCC, such systems can be implemented
with a lower bitrate for the audio . Similarly, BCC
" Flexible rendering capability : Each client (BCC can be used for interactive computer games. Espe-
synthesis) can not only determine the spatial- cially games which are played over a network (e .g .
ization of each of the other conference partici- Internet) benefit from a low bitrate for the audio
pants but also the number of playback channels transmission .
C. With one type of bitstream, different types
of client systems are possible (e .g . mono, stereo, 8 RESULTS
or multi-channel) .
For assessing the audio quality of the presented
" Low bitrate: The bitrate is as low as in a mono
scheme we conducted a subjective test . The refer-
system with a small overhead for the BCC pa-
ence signals were 32 kHz sampled synthetic stereo
rameters .
signals with sound sources panned to the left, right,
" Conventional coders for speech and audio can and center . Table 1 shows the kind of audio signals
be used for compressing the sum signal . that were used for composing the synthetic stereo

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 7
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

signals . The sources are panned to the left and right property scale
by using ICLD = ±8 dB and ICTD = ±219 ps 1 image width stereo .. .mono
(= ±7 samples) . For panning a source to the middle, 2 image stability stable . . .unstable
ICLD= 0 dB and ICTD= 0 tfs are used. 3 audio quality disregarding ITU-R 5-grade
spatial image distortions impairment
category left right middle 4 overall audio quality ITU-R 5-grade
1 speech male female none impairment
2 singing tenor soprano none
3 instruments piano organ none Table 2 : Tasks of the subjective test.
4 percussion castanets drums none
5 speech male female male Subjective Test

Image Stability
6 singing tenor soprano alto
7 instruments piano organ harpsichord Excerpt Number : PEr 1 2 3

8 percussion castanets drums bongo 5 of 8


stable

Assessment
Table 1: List of the audio excerpts used in the subjec- Number :

tive test. The last three columns are the sources that 2 of 4
are placed to the left side, right side, and middle of the
spatial image by imposing ICLD and ICTD between left
and right.
unrabe

Play .Play ' Play 11


I...my . . .I ... .... ~L.

The synthetic stereo reference signals were com- Comments :

pared to three kinds of signals synthesized with


BCC . The three kinds of BCC synthesized signals
were generated by the BCC scheme with flexible
rendering capability with index selection according Fig . 10 : An example of the test panel shown for grading
to (9) (scheme A), the alternative index selection the property image stability.
which changes the index in time only when the
strongest source is a = 6 dB stronger than the nals was possible at any time. The listeners were
second strongest source (scheme B), and the BCC asked to pay attention to the rank order of the grad-
scheme without flexible rendering capability, BCC ings of (A), (B), and (C). For each of the 8 test items
type II, presented in [5] (scheme C) . and 4 tasks (table 1) a panel as shown in Fig. 10 was
The analysis inputs for (A) and (B) were the sepa- displayed. The test was conducted in a sound insu-
rate source files listed in the last three columns of lated room with high quality headphones .
table 1 . The rendering parameters for the BCC syn- The results for the image width and image stability
thesis (ICLD, ICTD) for (A) and (B) were set to the properties averaged over the 8 test participants are
same values as was used for generating the synthetic shown for each audio signal in Fig. 11 . The image
stereo signals. (C) estimated the ICLD [5] from the width is closer to the reference for (A) and (B) . This
synthetic stereo reference signals. The ICTD were can be explained by the fact that in (A) and (B) only
not estimated but chosen to be proportional to the the spatial cues of either source occur (e.g. for items
ICLD with a conversion factor of 219/8 [fts/dB], re-
1-4: ICLD = ±8 dB, ICTD = ±219 ps) while in (C)
sulting in the same range for the ICLD and ICTD
a continuous range of spatial cues occur which are
as in the reference signals.
bound to the range of -8. . .8 dB and -219. ..219 ps .
Table 2 lists the properties which were assessed Therefore, the average magnitude of the spatial cues
with the subjective test and the corresponding scales in (A) and (B) is larger than the average magnitude
used. For each of these properties a panel was used of the spatial cues in (C), resulting in a wider image
as shown in Fig. 10 . The listener could randomly ac- for (A) and (B) . There is no significant difference in
cess and listen to the reference signal and the coded image stability between (A), (B), and (C) . Surpris-
signals (A), (B), and (C) . Switching between the sig- ing to us was that (A) is not worse than (B) . The

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 8
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

IMAGE WIDTH AUDIO QUALITY DISREGARDING SPATIAL IMAGE


100 5

4
(7
z
03
c0
2

4 5 6 7 8 2 3 4 5 6 7 8

IMAGE STABILITY OVERALL AUDIO QUALITY


100 5

80
C7
z 60
a
40

20

2 3 4 5 6 7 8 2 3 4 5 6 7 8
ITEM NUMBER ITEM NUMBER

Fig. 11 : The average grading for image width and image Fig. 12 : The average grading for audio quality disre-
stability for the 8 sound signals (A solid, B dashed, C garding spatial image and overall audio quality for the 8
dotted). sound signals (A solid, B dashed, C dotted).

Therefore, the image stability distortions appear to


problem of toggling between sources does not seem
be correlated with other audible distortions created
to be so relevant for the range of ICLD used . How-
by the BCC processing . This interdependency was
ever, we observed in a different informal test that
not observed in previous tests [11, 12] that used
(B) indeed is better for certain audio signals with
ICLD only . Thus, both types of distortions might be
larger ICLD .
due to the different listening method (loudspeakers
While the image width degradation does not depend vs . headphones) or may be caused by the time-delay
significantly on the audio item, the image stability insertion. The time-delay insertion in (C) causes a
shows a larger degradation for item 7. We found that certain degree of colorization . We are still in the
the harpsichord supposed to appear in the middle process of investigating this effect . The overall qual-
moves temporarily to the left or right. For this item ity gradings implicate that BCC type I has an ad-
scheme (C) performs better than (A) and (B). For vantage over type II for a low number of sources .
this particular signal the cue estimation of (C) with a However, for higher numbers of sources, the esti-
"soft" output performs better than the hard decision mation of cues (C) (BCC type II) avoids imaging
used to choose the source index in (A) and (B) . The artifacts created by BCC type I (A) and (B), due
hard decision is especially problematic if the sources to the source index selection with the ambiguity of
have equal critical band energies or the number of overlapping sources .
sources is very large.
Conclusively one could say that the coders (A), (B),
Figure 12 shows the results for the audio quality and (C) show nearly the same performance. The
disregarding spatial image and overall audio quality image width is consistently slightly better for (A)
properties . There is no significant difference for au- and (B) .
dio quality disregarding spatial image between (A),
(B), and (C) . For the items with two sources (1-4), 9 CONCLUSIONS
the overall audio quality was better for (A) and (B)
than for (C) . For the items with three sources (5-8),
In this paper we presented an audio compression
(C) was slightly better than (A) and (B) .
scheme with flexible rendering capability at the de-
The ranking of items in both plots of Fig. 12 closely coder. The scheme is based on binaural cue coding
resembles the ranking for image stability in Fig . 11 . (BCC) . Only the sum signal of the encoder input

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 9
FALLER AND BAUMGARTE BCC FOR AUDIO COMPRESSION WITH FLEXIBLE RENDERING

source signals plus BCC parameters are transmit- C. Faller and F. Baumgarte, "Binaural Cue
ted. The decoder can flexibly render spatial images Coding applied to stereo and multi-channel au-
as if the source signals were transmitted separately. dio compression," in Preprint 112th Conv. Aud.
The presented scheme provides flexible rendering ca- Eng . Soc ., May 2002.
pability at bitrates lower than conventional schemes
which usually transmit each of the source signals [6] A . V. Oppenheim and R. W. Schaefer, Discrete-
separately. Time Signal Processing, Signal Processing Se-
ries . Prentice Hall, 1989 .
We described a number of application scenarios for
the presented scheme, such as tele-conferencing, vir- B. R. Glasberg and B. C. J. Moore, "Deriva-
tual reality systems, and computer gaming . Not tion of auditory filter shapes from notched-noise
only the flexible rendering capability and low bitrate data," Hear. Res., vol . 47, pp . 103-138, 1990 .
is appealing, but also backwards compatibility with
N . S. Jayant and P. Noll, Digital Coding of
existing monophonic systems (if the BCC param-
Waveforms, Prentice-Hall Signal Processing Se-
eters can be transmitted in addition to the mono
ries, 1984 .
signal).

Headphone listening test results reveal that the flex- J . Benesty, T. Gansler, D. R. Morgan, M. M .
ible rendering functionality of BCC does not imply Sondhi, and S . L. Gay, Advances in Net-
degraded audio quality with respect to BCC with work and Acoustic Echo Cancellation, Springer,
encoder-controlled spatial image (BCC type 11) for 2001.
the items tested .
[10] ITU-T, Pulse Code Modulation (PCM) of Voice
Future work will focus on further refinement of the Frequencies, 1988 (Blue Book), 1993, Recom-
algorithm for improving the audio quality of the mendation G.711 .
most critical audio material .
[11] F. Baumgarte and C. Faller, "Why Binaural
Cue Coding is better than intensity stereo,"
ACKNOWLEDGMENTS
in Preprint 112th Conv. Aud. Eng . Soc ., May
We would like to thank Peter Kroon and Martin 2002.
Vetterli for the valuable discussions and suggestions .
[12] F. Baumgarte and C. Faller, "Design and
evaluation of Binaural Cue Coding schemes,"
REFERENCES in Preprint 113th Conv. Aud. Eng . Soc ., Oct .
[1] J. Blauert, Spatial Hearing . The Psychophysics 2002.
ofHuman Sound Localization, MIT Press, 1983 .

[2] C. Faller and F. Baumgarte, "Efficient rep-


resentation of spatial audio using perceptual
parametrization," in Proc . IEEE Workshop on
Appl. of Sig. Proc. to Audio and Acoust ., Oct.
2001.

[3] C. Faller and F . Baumgarte, "Binaural Cue


Coding : A novel and efficient representation
of spatial audio," in Proc. ICASSP 2002, Or-
lando, Florida, May 2002.

[4] F. Baumgarte and C . Faller, "Estimation of


auditory spatial cues for Binaural Cue Cod-
ing (BCC)," in Proc. ICASSP 2002, Orlando,
Florida, May 2002.

AES 113TH CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5-8 10

You might also like