You are on page 1of 30

CHAPTER 1

_______________________________________________________________________
_
INTRODUCTION

Recently, the idea of Spatial Audio Coding (SAC) has emerged as a promising
new concept in perceptual coding of multi-channel audio . This approach
extends traditional techniques for coding of two or more channels in a way that
provides several significant advantages in terms of compression efficiency and user
features. Firstly, it allows the transmission of multichannel audio at bitrates, which so
far have been used for the transmission of monophonic audio. Secondly, by its underlying
structure, the multi-channel audio signal is transmitted in a backward compatible way,
i.e., the technology can be used to upgrade existing distribution infrastructures for stereo
or mono audio content (radio channels, Internet streaming, music downloads etc.)
towards the delivery of multi-channel audio while retaining full compatibility with
existing receivers.
This paper briefly sketches concept of Spatial Audio Coding , MPEG Surround
technology and performance. It describes the MPEG Surround reference model
architecture , its manifold capabilities as well as some significant extensions that have
resulted from recent development work in MPEG. The performance of the technology
is illustrated by several listening tests.

1
_______________________________________________________________________
_
CHAPTER 2
_______________________________________________________________________________________________
_

SURROUND SOUND

There are many ways to make and present a sound recording. The simplest
method, and the one used in the earliest sound movies, is called monaural or simply
mono. Mono means that all the sound is recorded onto one audio track or channel (a
single spiraled groove in a record, for example, or a single magnetic track on tape), which
is typically played on one .speaker
Two-channel recordings, in which sound is played on speakers on either side of
the listener, are often referred to as stereo. This isn't entirely accurate, as stereo (or
stereophonic) actual refers to a wider range of multi-channel recordings. Two-channel
sound is the standard format for home stereo receivers, television and FM radio
broadcasts. The simplest two-channel recordings, known as binaural recordings, are
produced with two microphones set up at a live event (a concert for example) to take the
place of a human's two ears. When you listen to these two channels on separate speakers,
it recreates the experience of being present at the event.
Surround recordings take this idea a step further, adding more audio channels so
sound comes from three or more directions. While the term "surround sound" technically
refers to specific multi-channel systems designed by Dolby Laboratories, it is more
commonly used as a generic term for theater and home theater multi-channel sound
systems.

There are special microphones that will record surround sound (by picking up
sound in three or more directions), but this is not the standard way to produce a surround
soundtrack. Almost all movie surround soundtracks are created in a mixing studio.
Sound editors and mixers take a number of different audio recordings -- dialogue

2
recorded on the movie set, sound effects recorded in a dubbing studio or created on a
computer, a musical score -- and decide which audio channel or channels to put them on.

Up to now, there have been two established methods for coding multichannel audio
content.They are Discrete surround and Matrixed surround

2.1 MATRIXED SURROUND

Dolby Surround and its enhanced Dolby Pro Logic playback decoder debuted in
tens of millions of home theaters. Dolby Surround, like earlier Quadraphonic systems
uses channel matrixing to combine four audio channels into two signals.

Sometimes known as 4-2-4 matrixing, these signals can be compatibly played


over two speakers for stereo playback, or decoded to yield multiple channels.
Specifically, basic Dolby Surround decoding yields front left and right, and one surround
channel (the center channel appears as a phantom image), as shown in Figure 2.1
Dolby Surround Pro Logic decoding yields left, center, right and a monaural
surround channel, as shown in Figure 3. Because of its compatibility with stereo playback
and coding simplicity, Dolby Surround became nearly ubiquitous in movie soundtracks,
and appeared in many television broadcasts.

Matrixing is elegant in its simplicity. The 4-2-4 encoder accepts four separate
inputs (left, right, center and surround) and creates two outputs
(left-total and right-total). The front left and right channels are
placed in the two conveying channels as a regular stereo signal.
The center channel is placed equally in the left and right channel
(similar to conventional stereo panning) with a 3dB level
reduction to maintain constant acoustic power. FIG 2.1 4-2-4 matrixing

The surround input is also divided equally between the left-total and right-total
signals but first undergoes three processing steps:

3
1. It is frequency band-limited from 100Hz to 7kHz
2. It is encoded with a modified Dolby B-type noise reduction
3. Plus and minus 90-degree phase shifts are applied. (The surround signal is split
into two identical signals; one is phase shifted by +90 degrees relative to the
fronts and the other by -90 degrees). This creates a 180-degree phase differential
between the surround signal components creating the left-total and right-total
signals.

At the output, the center channel is recovered by summing the left-total and right-
total signals, and the surround signal is recovered by taking the difference between them.
The identical center channel components in the left-total and right-total signals will
cancel each other in the surround output, and the equal but opposite surround channel
components will cancel each other in the center output. So, signals that are different on
left-total and right-total are sent to the front left and right speakers, signals that are
identical and in-phase are sent to the center channel, and signals that are identical but out
of phase are sent to the rear surround. The main left and right channels have good
separation because they are conveyed independently. The surround signal is also heard in
the front left and right speakers, but it is out of phase and the resulting diffuse sound
image is acceptable.

Dolby Surround's limitations:

1. In Dolby Digital, the channels are coded discretely, two independent surround
channels are provided, and the five main channels can all carry full-range audio.
2. The LFE channel is coded discretely (instead of being derived from the front
mains as in Dolby Surround). This allows a dedicated bass track, separate from
the bass in the main speakers. The benefits, of course, are the low-frequency
explosions, rumbles and roars that add excitement to movies. On the other hand, a
LFE is less useful in most music.
3. Dolby Digital provides a dialogue normalization level control so that dialogue
levels are uniform for different programs with widely varying overall dynamic

4
range. Listeners can select a maximum sound pressure level and the decoder will
replay all dialogue no higher than that level regardless of how it was recorded. In
addition, control data can be placed in the bitstream so that a program's recorded
dynamic range can be varied in the decoder over a +/-24dB range. So the decoder
can compress the dynamic range of a program to suit the listener's preference.
(Music compression of the soft to loud dynamic range should not be confused
with data compression, which reduces bit rate and file size). The format can also
provide bass management, for example, routing low bass only to those speakers
with woofers.
4. Known limitations in sound quality

2.2 DISCRETE SURROUND

Discrete surround describes any surround sound encoding where the additional
channels, beyond the standard left and right for stereo audio, are encoded in separate
streams rather than being matrixed from a stereo signal. The most common discrete
surround formats are the various versions of Dolby Digital and DTS.

Dolby Digital is most commonly known for delivering 5.1-channel audio, but it can be
used in other configurations ranging from one to six channels parsed as 3/2, 3/1, 3/0, 2/2,
2/1, 2/0, and 1/0, as well as an LFE channel. Most popularly, Dolby Digital is used to
provide a 5.1 multichannel surround format with left, center right, left-surround, right-
surround, and a low frequency effects channel.

These six channels can be coded at a nominal rate of 384 Kbps. However, the standard
also supports bit rates ranging from 32 kbps for a single monaural channel to 640 Kbps
for 5.1 channels. The codec is backward compatible with matrix surround sound formats,
and monaural reproduction, and all of these formats can be decoded from the bitstream.
Rather than using matrixing, Dolby Digital instead transmits a discrete multichannel
coded bit stream, with digital down-mixing in the decoder, to create the appropriate
number (mono, stereo, matrix surround or full multichannel) of playback channels.

5
Using this method, the channels are coded independently; this results in a very high
quality, yet at rather high data rates (e.g. 192-256 kbit/s with MPEG-4 AAC or 384 kbit/s
or more with Dolby Digital). Many broadcast applications require lower data rates than
that, and discrete coding only provides either surround or stereo audio in one bit stream.

2.2 Discrete Surround

If broadcasters only transmit a surround signal and do not simulcast stereo audio, the
stereo receivers have to generate some automatic stereo downmix from the surround
signal.

Limitations of Discrete surround :


1. High bitrates
2. Not backward compatible to stereo (mono)

2.3 MPEG SURROUND

MPEG Surround technology upgrades traditional ways of coding two or more


channels, providing several significant advantages, including compression
efficiency,backward compatibility, widerange scalability, and a variety of additional
tools. Transmission of multichannel audio becomes feasible at bitrates formerly used for
monophonic audio only, while its inherent structure maintains backward compatibility. A
receiver without a spatial audio decoder simply presents the transmitted mono or
stereophonic down-mix signal and discards the spatial parameter side information.

6
Depending on the operationpoint chosen on the bit rate versus perceptual quality curve, a
range of operation stretching from the extremes of very low bit rate to near transparent
quality can be covered with one codec. Another objective is to enable optimized playback
of a multichannel signal on a wide range of playback systems, such as headphone
reproduction on mobile devices that have limited processing power as well as full-size
home theater setups. The technology of MPEG Surround is based on the spatial audio
coding (SAC) principle

7
_______________________________________________________________________
CHAPTER 3
______________________________________________________________________________________
PSYCHOACOUSTIC BACKGROUND

Spatial perception of audio is mediated by a limited set of cues that are created in a
natural way due to the properties of sound propagation. For example, a sound source that
is placed toward the left side of a listener will result in different acoustical pathways
toward the left and right ears. As a result the sound arriving at the left ear will be leading
in time compared to the sound arriving at the right ear, creating an interaural time
difference (ITD). Due to the acoustic shadow effect of the head, the signal at the right ear
will also tend to be lower in intensity than at the left ear, especially at high frequencies,
creating an interaural level difference (ILD). In line with these acoustical laws, it has
been observed that ITDs and ILDs are binaural cues that influence the perceived direction
of a sound source in the horizontal plane .

A sound source that is placed in an echoic environment will create numerous reflections
that, together with the direct sound, arrive at both ears with many different time delays
and amplitudes. As a result the signals at the left and right ears will be (partially)
incoherent, that is, the maximum of the normalized interaural correlation function is
smaller than 1. This reduction in interaural correlation is perceived as a widening of the
sound source . Besides the ITDs and ILDs, additional localization cues result from the
direction-dependent acoustical filtering of the outer ear. Specifically in the perceptually
relevant region from 6 to 10 kHz, sharp peaks and valleys are found, which result from
the acoustical filtering of the head and pinna. These spectral features allow listeners to
differentiate between sounds arriving from the back and front directions, and to perceive
the elevation of a sound source.

8
When listening to a multichannel loudspeaker setup, all these spatial cues play a
role in creating the perceived spatial sound image. Under most practical circumstances
signals that are played through one loudspeaker can be localized accurately at the
position of the loudspeaker using these binaural cues. When identical signals are played
simultaneously on the left and right loudspeakers, a phantom source is created in between
the two loudspeakers, assuming that the listener is sitting at an equal distance from both
loudspeakers . The reason that a single image is perceived in the middle instead of two
separate images at the two loudspeakers is that the left and right loudspeaker sounds are
mixed at the entrance of the ear canal in a very similar way in both ears. As a result no
effective interaural time or level differences are perceived; only the pinna cues contribute
to the perceived elevation. When identical sounds are played on the left and right front
loudspeakers, with the left signal having a higher intensity, there will be differences in the
signals entering the ear canals. Both the left and the right ears will receive the signals
from the left and right loudspeakers. At low frequencies, the left loudspeaker signal will
be dominating in both ears due to its higher level (because of the absence of a head-
shadow effect) and predominantly determine the arrival time of the signal. Since the left
loudspeaker is closer to the left ear and the right loudspeaker closer to the right ear, the
composite signal at the left ear will be leading in time, whereas the right ear receives a
delayed left loudspeaker signal and therefore the composite signal will tend to be lagging
in the right ear.
As a result binaural ITD cues are present at low frequencies that will create a
localization of the sound toward the left loudspeaker while at high frequencies head
shadow effects will create ILD cues resulting from cross-channel level differences
(CLDs) since the left signal will arrive attenuated at the right ear and vice versa .Often
signals will be played over two loudspeakers that result from the same source, but will
have gone through different acoustical pathways before being recorded with two
microphones. For example, this occurs when recording a single sound source in an echoic
room with two microphones placed at different positions. When playing these
microphone signals one to one through left and right frontal loudspeakers, the mixed
signals at the ear canal will tend to have an interaural correlation that is reduced

9
significantly compared to the situation where identical signals would be played on both
loudspeakers. As discussed earlier, a reduction in interaural correlation will result in an
increase of the perceived source width. In general the interchannel level differences
(ICLDs) and interchannel time differences (ICTDs), together with the interchannel
correlation (ICC), will be Transformed into binaural ITDs, ILDs, and interaural
correlation cues at the entrance of the two ears. The exact transformation will depend on
the loudspeaker placement, the room acoustic properties, and the relevant anthropometric
aspects of the listener. Nevertheless it is clear that the across-channel differences define
the binaural cues and are therefore also defining the spatial image. In practical situations
binaural cues will not be constant across time nor frequency. The spectral resolution for
perceiving binaural cues seems to be mainly determined by the resolution imposed by the
peripheral auditory system .
A good approximation of this resolution is given by the ERB scale derived from
various monaural masking experiments . The human hearing system can track sound
source positions that change over time given certain restrictions. For example, the
perception of temporal changes in binaural cues has been shown to be rather sluggish.
For ITDs, already at a rate of fluctuation of 10 to 20 Hz, listeners cannot follow the
movement at all and hear a spatially widened image reflecting that the long-term
interaural correlation of the fluctuating stimulus is less than 1. For ILDs, the binaural
system seems to be less sluggish, although it still tends to become less sensitive to
dynamic ILDs above rates of fluctuation of 50 Hz for low frequencies. The perception of
changes in interaural correlation has also been reported to be very sluggish. If a binaural
signal has an interaural correlation that is less than 1 (or more precisely, a coherence less
than 1 if temporal alignment is taken into account), it implies that there is a difference
between the two signals. The relative intensity of the difference signal compared to the
common signal determines the reduction in interaural correlation and contributes in this
way to the perceived widening of the sound source. Although the presence of the
difference signal is highly detectable, it has been shown that listeners are not very
sensitive to the character of the difference signal .

10
The binaural ITDs, ILDs, and interaural correlation cues provide simple statistical
relations between the acoustic signals that arrive in the left and right ears, which together
form in fact the basic cues for the spatial perception of sound. Therefore it should be
possible to reinstate the original spatial illusion that is present in a two-channel recording
by imposing the proper binaural cues on a mono down mix of a two-channel recording
taking into account the spectral and temporal resolution of the binaural hearing system.
Breebaart et al. showed that this is indeed possible, maintaining a high audio quality for
stereo recordings. In their work a two-channel input signal was down-mixed to a mono
signal, and in addition the spectrotemporal patterns of binaural cues were analyzed. The
spatial parameters derived from this analysis were encoded at a very low bit rate, creating
a significant reduction in overall bit rate because only a single instead of two audio
signals needed to be encoded in the bit stream.
With this information it was possible at the decoder side to recreate a high-quality
spatial stereophonic audio signal. The current work extends this concept toward
multichannel conditions, where spatial parameters are derived from the multichannel
audio signal such that across channels differences in level and correlation are extracted
accurately, and can be imposed on a down mix at the decoder side. By creating a
multichannel up mix in this way, the multichannel reconstruction will result in binaural
cues at the two ears very similar to those that would result from the original multichannel
signal.

3.1 PERCEPTUAL CODING

Most of the time our world presents us with a multitude of sounds simultaneously.
We automatically accomplish the task of distinguishing each of the sounds and attending
to the ones of greatest importance. Unless there is something we want to hear but cannot,
we probably do not consider all the sounds we do not hear in the course of a day.

It is often difficult to hear one sound when a much louder sound is present. This
process seems intuitive, but on the psychoacoustic and cognitive levels it becomes very

11
complex. The term for this process is masking, and it is probably the most researched
phenomenon in audition (Zwislocki 1978).

Definitions of masking differ according to what field it is being related. In order


to gain a broad and thorough understanding of this phenomenon we can survey the
definition and its accompanying explanation from several views. Masking as defined by
the American Standards Association (ASA) is the amount (or the process) by which the
threshold of audibility for one sound is raised by the presence of another (masking) sound
(B.C.J. Moore 1982, p. 74). For example, a loud car stereo could mask the car's engine
noise. The term was originally borrowed from studies of vision, meaning the failure to
recognize the presence of one stimulus in the presence of another at a level normally
adequate to elicit the first perception (Schubert 1978, p. 63).

3.2 CRITICAL BANDS

To determine this threshold of audibility, an experiment must be performed. A


typical masking experiment might proceed as follows. A short, about 400 msec, pulse of a
1,000 Hz sine wave acts as the target, or the sound the listener is trying to hear. Another
sound, the masker, is a band of noise centered on the frequency of the target (the masker
could also be another pure tone). The intensity of the masker is increased until the target
cannot be heard. This point is then recorded as the masked threshold (Scharf 1975).
Another way of proceeding is to slowly widen the bandwidth of the noise without adding
energy to the original band. The increased bandwidth gradually causes more masking
until a certain point is reached, at which no more masking occurs. This bandwidth is
called the critical band (Bregman 1990). We can keep extending the masker until it is
full-bandwidth white noise and it will have no more effect than at the critical band.

As figure 1 shows, critical bands grow larger as we ascend the frequency


spectrum. Conversely, we have many more bands in the lower frequency range, because
they are smaller. It will become important later in the discussion to remember that the
size of the critical bands is not constant across the frequency range.

12
Critical bands seem to be
formed at some level by
auditory filters (Schubert
1978). These filters act
similarly to the
conventional frequency-
specific electronic
devices that parse the
audio spectrum. There is
only sparse evidence for
the process of the
auditory filter; it is not
clear whether separation
occurs in the inner ear or
at some higher level.
There is no agreement as
to the specific number of
critical bands active
simultaneously. Critical
bands and their center
frequencies are
continuous, as opposed
to having strict
boundaries at specific frequency locations. Therefore, the filters must be easily variable.
Use of the auditory filter may be the unconscious equivalent of our willfully focusing on
a specific frequency range.

13
In general, low sounds mask higher sounds, as we can see from figure 3. There is
little masking lower than the center frequency of the noise band. A general rule displayed
by this graph is that masking tends to occur between sounds that are close together in
frequency. It is also apparent that above 20 dB, for every increase in masker energy there
is a direct rise in the threshold of the target. Conveniently, different center frequencies
share the same audiogram as do pure tone maskers.

The spatial representation of frequency on the basilar membrane is perhaps the single
most important piece of physiological information about the auditory system, clarifying
many psychophysical data, including the masking data and their asymmetry (p. 130).
We often use visual analogies to aid in learning. The conventional graph showing one
tone masking another (see figure 2) may be an effective visual analogy when trying to
comprehend the masking effect. The thin line represents our hearing threshold when no
sounds are present. A 500 Hz tone at 25 dB would be within our threshold of hearing.
When a masking tone is present, a 200 Hz tone at 50 dB in this case, the threshold of
audibility is altered (represented by the thicker line on the graph) so that the 500 Hz tone
is masked.

This graph only describes a surface understanding of the cognitive processes. The
graph implies that once a masking tone is present we are biologically incapable of
receiving the target tone. In reality, we still sense, physiologically, the masked tone, but it
cannot be audibly recognized. Albert S. Bregman offers us a more neurologically-sound
analogy. He asks us to imagine hiding a red spot on a white canvas by painting the entire
canvas red. The spot is still there, but it is impossible to distinguish. He continues,

A masker is something that fills in the background in such a way that there is no
longer any spectral shape defined against the white canvas of silence. The signal is still
there, but it is camouflaged. Masking, then, is the loss of individuality of the neural
consequences of the target sound, because there is no way of segregating it from the
effects of the louder tone .

14
3.3 Central Masking and Other Effects

Another way to approach masking is to question at what level it occurs. Studies in


cognition have shown that masking can occur at or above the point where audio signals
from the two ears combine. The threshold of a signal entering monaurally can be raised
by a masker entering in the other ear monaurally. This phenomenon is referred to as
central masking because the effect occurs between the ears.

Spatial location can have a negative effect on masking. Many studies have been
performed in which unintelligible speech can be understood once the source is separated
in space from the interference (Bregman 323). The effect holds whether the sources are
actually physically separated or perceptually separated through the use of interaural time
delay.

Asynchrony of the onset of two sounds has shown to help prevent masking, as
long as the onset does not fall within the realm of non-simultaneous masking. Each 10
msec increase in the inter-onset interval was perceived as being equal to a 10 dB increase
in the target's intensity (Bregman 1990). Experiments by Rasch revealed that musicians
in an ensemble had typical deviations of onset from 30 to 50 msec, unwittingly providing
their own solution to masking effects. Incidentally, computer music sequencers would do
well to provide the feature of differing onset between tracks, ideally modeling the
deviation after human performers.

A multichannel perceptual codec (coder/decoder) has some small advantages over


monaural codecs. The increased number of channels promotes greater masking--the loud
sound in one channel may mask a softer sound in another channel, and redundancies
between channels can be eliminated. In addition, the ear's localization acuity decreases in
the rear. These and other phenomena allow greater use of composite channels, requiring
fewer bits. So, the number of bits needed to code a multichannel signal is approximately
proportional to the square root of the number of channels. A 5.1 channel coder, for

15
example, would theoretically require 2.26 times the number of bits needed to code one
channel.

The actual reduction depends on the coder itself. For example, a 5.1-channel
Dolby Digital bit stream might require 384Kbps, which is only one-quarter the bit rate of
a stereo CD. On the other hand, a 5.1-channel DTS bit stream might require 1.5 Mbps--
essentially the same bit rate as a stereo CD.

16
_______________________________________________________________________
_
CHAPTER 4
_______________________________________________________________________
_
SPATIAL AUDIO CODING

The concept of spatial audio coding as employed in the MPEG Surround standard
is outlined in Fig. 4.1. A multichannel input signal is converted to a down mix by an
MPEG Surround encoder. Typically the down mix is a mono or a stereo signal, but more
down-mix channels are also supported (for example, a 5.1 down mix from a 7.1 input
channel configuration). The perceptually relevant spatial properties of the original input
signals that are lost by the down-mix process are captured in a spatial parameter bit
stream. The down mix can subsequently be encoded with an existing compression
technology. In the last encoder step the spatial parameters are combined with the down-
mix bit stream by a multiplexer to form the output bit stream. Preferably the parameters
are stored in an ancillary data portion of the down-mix bit stream to ensure backward
compatibility.

In a first stage
the transmitted bit
stream is split into a
downmixbit stream and a
spatial parameter stream.
The downmixbit stream
is decoded using a
Fig 4.1 Principle of Spatial Audio Coding
legacy decoder. Finally
the multichannel output is constructed by an MPEG Surround decoder based on the
transmitted spatial parameters.

17
The use of an MPEG Surround encoder as a preprocessor for a conventional (legacy)
codec (and a corresponding postprocessor in the decoder) has important advantages
over existing multichannel compression methods.
• The parametric representation of spatial properties results in a significant compression
gain over conventional multichannel audio codecs, The use of a legacy codec with an
additional spatial parameter stream allows for backward compatibility with existing
compression schemes and broadcast services.
• The spatial parameterization enables novel techniques to process or modify certain
aspects of a down mix. Examples are matrixed-surround compatible down mixes,
support for so-called artistic down mixes or the generation of a three-
dimensional/binaural signal to evoke a multichannel experience over legacy headphones.
• The channel configuration at the spatial encoder can be different from that of the spatial
decoder without the need of full multichannel decoding as intermediate step.
For example, a decoder may directly render an accurate four-channel representation from
a 5.1 signal configuration without having to decode all 5.1 channels first

18
_______________________________________________________________________
_
CHAPTER 5
_______________________________________________________________________
_
MPEG SURROUND

The MPEG Surround spatial coder structure is composed of a limited set of


elementary building blocks. Each elementary building block is characterized by a set of
inputsignals, a set of output signals, and a parameter interface. A generic elementary
building block is shown in Fig. 2. An elementary building block can have up to three
input and output signals (as shown left and right, respectively), as well as an input or
output for (sets of) spatial parameters.
Different realizations of elementary
building blocks serve different purposes in
the spatial coding process. For example, a
first type of building block may decrease
the number of audio channels by means of
spatial parameterization.
Hence if such a block is applied at
Fig 5.1 Generic Building Block
the encoder side, the block will have fewer
output channels than input channels, and has a parameter output. The corresponding
block at the decoder side, however, has a parameter input and more output channels than
input channels. The encoder and decoder representations of such an encoding/ decoding
block are shown in Fig. 3(a) and (b). Two different realizations of the encoding/decoding
blocks exist.

19
The first realization is a block that describes two signals as one down-mix signal
and parameters. The corresponding encoding block is referred to as two-to-one (TTO),
whereas the decoding block is termed one-to-two (OTT). In essence, these blocks are
similar to a parametric stereo encoder/decoder . The second realization is a so-called
three-to-two (TTT) encoding block, which generates two output signals and parameters
from three input signals. The corresponding two-to-three decoding
block generates three signals from a stereo input accompanied by parameters.
A second type of
building block is
referred to as signal
converter. For
example, a stereo
input signal may be
converted into a
stereo output signal
that has different
spatial
properties, and the
processing of which is
controlled by
parameters. This is
shown in Fig. The
corresponding
decoder-side
operation inverts
the processing that is
applied at the encoder
Fig 5.2 Elementary building blocks
to retrieve the original
(unmodified) stereo input signal. Examples of signal converters are the conversion from
conventional

20
stereo to matrixed-surround compatible stereo or to three-dimensional/binaural stereo for
playback over headphones.
The third type of building block is an analysis block. This type generates
parameters from a signal stream without modifying the actual signals or signal
configuration.
This block, which can be applied at both the spatial encoder and the decoder sides, is
shown in Fig.

Fig 5.3 Multichannel Audio Encoder and Decoder

5.1 MPEG SURROUND ENCODER


5.1.1 Structure
The structure of the MPEG Surround encoder is shown in Fig. 4. A multichannel input
signal is first processed by a channel-dependent pregain. These gains enable adjustment
of the level of certain channels (for example, LFE
and surround) within the transmitted down mix. Subsequently
the input signals are decomposed into time or
frequency tiles using an analysis filter bank. A spatial

21
encoder generates a down-mix signal and (encoded) spatial parameters for each time or
frequency tile. These parameters are quantized and encoded into a parameter bit stream
by a parameter encoder Q. The down mix is converted to the time domain using a
synthesis filter bank. Finally a postgain is applied to control the overall signal
level of the down mix.
5.1.2 Pre- and Postgains
In the process of down-mixing a multichannel signal to a stereo signal, it is often
desirable to have nonequal
weights for the different
input channels. For
example, the
surround channels are
often attenuated by 3 dB
prior to the actual down-
mix process. MPEG
Surround supports
usercontrollable pregains
between 0 and −6 dB, in
steps of 1.5
dB. For the LFE, these
weights are adjustable
Fig 5.4 MPEG Encoder
between 0 and −20 dB in
steps of 5 dB.
The level of the generated down mix can also be controlled using (postencoder) gains to
prevent clipping in the digital signal domain. The down mix can be attenuated
between 0 and −12 dB in steps of 1.5 dB.
The applied pre- and postgain factors are signaled in the MPEG Surround bit stream to
enable their inverse scaling at the decoder side.
5.1.3 Analysis Filter bank

22
As outlined in Section 3, the human auditory system determines spatial properties based
on a certain time and frequency decomposition. Therefore spatial audio parameterization
cannot be employed directly on time-domain signals, but requires a filter bank to mimic
the temporal and spectral resolution of the human listener. Moreover, given the need for
time-variant processing (especially at the spatial decoder side), the filter bank used is
preferably oversampled to reduce aliasing artifacts that would otherwise
result from a critically sampled structure.
A complex-modulated quadrature mirror filter(QMF) bank is used to obtain a uniformly
distributed, oversampled frequency representation of the audio signal. The
MPEGSurround technology takes advantage of this QMF filterbank,which is used as part
of a hybrid structure to obtain an efficient non uniform frequency resolution.Furthermore,
by grouping filter-bank outputs for spatial parameter analysis and synthesis ,the
frequency resolution for spatial parameters can be varied extensively while applying a
single filter-bank configuration.Morespecifically,the number of parameters to cover the
full frequency range can be varied from only a few(forlow-bit-rateapplications) upto
28(forhigh- quality processing) to closely mimic the frequency resolution of the human
auditory system.

5.1.4 Synthesis Filter Bank


The spatial encoding process is followed by a set of hybrid QMF synthesis filter banks
(one for each output channel), also consisting of two stages . The

23
first stage comprises the summation of the subsubbands which stem from the same
subband m0, Finally, up sampling, convolution with synthesis filters [which are similar to
the QMF analysis filters as specified by Eq. (1)], and summation of the resulting subband
signals results in the final outputs xˆi[n].

5.1.5 Spatial Encoder


Tree Structures
The elementary building blocks are combined to form a spatial coding tree.
Depending on the number of (desired) input and output channels, and additional features
that are employed, different tree structures may be constructed. The most common tree
structuresfor 5.1-channel input will be outlined here. First two tree structures for a mono
down mix will be described, followed by the preferred tree structure for a stereo down
mix. The first tree structure supports a mono down mix and is outlined in. The six input
channels, left front, right front, left surround, right surround, center, and low frequency
enhancement, labeled Lf, Rf, Ls, Rs, C, and
LFE, respectively, are combined pairwise using encoding blocks (TTO type) until a mono
down mix is obtained.
Each TTO block produces a set of
parameters P. As a first step the two front
channels (Lf, Rf) are combined into a
TTO encoding blocks E3, resulting in
parameters P3. Simi-
larly, the pairs C, LFE and Ls, Rs are
combined by TTO encoding blocks E4
and E2, respectively. Subsequently the
combination of Lf, Rf on the one hand,
Fig 5.5 Stereo Downmix
and C, LFE on the other hand are

24
combined using TTO encoding block E1 to form a “front” channel F. Finally this front
channel is merged with the common surround channel in encoding
block E0 to result in a mono output S. One of the advantages of this structure is its
support for configurations with only one surround channel. In that case Ls and Rs are
identical, and hence the corresponding TTO block can be omitted (that is, the tree can be
pruned). The second tree structure for 5.1 input combined with a mono down mix is
shown in Fig. 6(b). In this configuration the Lf and Ls channels are first combined into a
common left channel (L) using a TTO encoding block E3. The same process is repeated
for the Rf and Rs channels (E4). The resulting common left and common right channels

Fig 5.6 Mono Downmix


are
then combined in E1, and finally merged (E0) with the combination of the center and
LFE channels (E2). The advantage of this scheme is that a front-only channel
configuration (that is, only comprising L, R, and C) is simply obtained by pruning the
tree. For a stereo down mix the preferred tree configuration is given in Fig. As for the
second mono-based tree, this tree also starts by the generation of common left and right
channels, and a combined center/LFE channel. These three signals are combined into a
stereo output signal SL, SR using a TTT encoding block .

25
5.2 Spatial Decoder
The spatial decoder generates multichannel output signals from the downmixed
input signal by reinstating the spatial cues captured by the spatial parameters. The spatial
synthesis of OTT decoding
blocks employs so-called
decorrelators and matrix
operations in a similar fashion as
parametric stereo decoders . In an
OTT decoding block
two output signals with the
correct spatial cues are generated
by mixing a mono input signal
with the output of a
decorrelator that is fed with that
mono input signal.
Fig 5.7 Structure of Decoder
A first attempt at building a
multichannel decoder could be to simply concatenate OTT decoding blocks according to
the tree structure at hand.

The process of transforming spatial parameterization trees from cascaded decorrelator


structures to decorrelators in parallel, extended with combined matrix multiplications,
leads to the generalized spatial decoder structure . Any encoder tree configuration can be
mapped to this generalized decoder structure. The input signals are first processed by a
preprocess matrix , which applies decorrelator input gains as outlined in TTT-type
decoding (in case of a stereo down mix), as well as any decoder-side inversion processes
that should be applied on the down mix . The outputs of the prematrix are fed to a
decorrelation stage with one or more mutually independent decorrelators. Finally a

26
postmix matrix generates the multichannel output signals. In this scheme both the
preprocess matrix as well as the postmix matrix are dependent on the transmitted
spatial parameters.
The spatial synthesis stage of the parametric multi-channel decoder consists of matrixing
and decorrelation units. Decorrelation units are required to synthesize output signals with
a variable degree of correlation (dictated by the transmitted ICC parameters). To be
more specific, each decorrelator should generate an output signal from an input signal
according to the following requirements:
1. The coherence between input and output signal should be sufficiently close to
zero. In this context, coherence is specified as the maximum of the normalized cross-
correlation function operating on band-pass signals (with bandwidths sufficiently
close to those estimated from the human hearing system). Said differently, the
coherence between input and output should be very small, even if analyzed in
narrow frequency bands.
2. Both the spectral and temporal envelope of the output signal should be close to
those of the incoming signals.
3. The outputs of multiple decorrelators should be mutually incoherent according to
the same constraints as for their input/output relation. A suitable implementation that
meets these equirements is by using lattice all-pass filters, with additional spectral
and temporal enhancement tools.

27
_______________________________________________________________________
_
CHAPTER 6
_______________________________________________________________________
_
PERFORMANCE AND QUALITY

6.1 SUBJECTIVE QUALITY


To assess the perceptual rate/distortion curve of MPEG Surround and to compare
its quality with discrete multichannel codecs as well as a popular matrixed-surround
coding scheme, the following codecs have been rated in a listening test.
A total of 13 listeners at three test labs took part in a multiple stimuli with hidden
reference and anchor (MUSHRA) [Rec. ITU-R BS.1534-1] test. This test methodology
included an additional hidden reference item, as well as a 3.5-kHz band limited anchor
signal. The subjects graded the codecs on a perceptual quality scale ranging from zero to
100 with labels from “bad” to “excellent.” Figure 2 shows the overall mean for all
subjects and items drawn as squares.
The continuous
lines extrapolate the rate/distortion
curve for a specific codec.
For the lower bit rate range,
the advantage of applying MPEG
Surround
on top of an HE-AAC core-coder is
clearly obvious. At 64 kb/s, the
average quality Fig 6.1 Quality Comparison
is already in the so-called good region

28
The quality is monotonically increasing
with the bit rate. At 96 kb/s, the mean subjective quality crosses the border of
the excellent region at 80 MUSHRA points.
For 160 kb/s, excellent quality can be achieved with HE-AAC discrete, as well as
HE-AAC + MPEG Surround. A similar performance is observed by an AAC-core coder
with MPEG Surround running at 192 kb/s total. The combination of MPEG-1 Layer 2
with MPEG Surround results in a good rating at 192 kb/s and an excellent rating at 256
kb/s.
When using Dolby Prologic II as a matrixed surround scheme, the limited
spatial reproduction ends up in only a fair score, even at 256 kb/s. Discrete 5.1
AAC running at 320 kb/s serves as a higher anchor in this test, delivering a score of
above 96 points.
6.2 SPEED/COMPLEXITY PERFORMANCE
For the combination of MPEG Surround with a stereophonic MPEG-4 HE-AAC
core, the combined computational complexity for 5.1 output corresponds to the

Fig 6.2 Performance Comparison

29
complexity of an HE-AAC decoder for six discrete channels. The low-power version
of MPEG Surround provides 5.1 output at the cost of a four-channel discrete decoder.

30

You might also like