Mmse Speech Enhancement

SPEECH ENHANCEMENT
Chunjian Li
Aalborg University, Denmark
Introduction
Applications:
- Improving quality and inteligibility (hearing
aid, cockpit comm., video conferencing ...)
- Source coding (mobile phone, video
conferencing, IP phone ...)
- Pre-processor for other speech processing
applications (speech recognition, speaker
varification ...)
Introduction
Classification 1
- Single channel
- Multi-channel
* with accoustic barrier (Adaptive Noise Cancelling)
* without accoustic barrier (Array Processing)
Classification 2
- Spectrum subtraction (Power Spectral Subtraction, Amplitude
Spectral Subtraction, Autocorrelation Subtraction, Non-causal
Infinite Wiener Filtering)
- Parametric method (Iterative Wiener Filtering)
- Adaptive noise cancelling
- Adaptive comb filtering
Single Channel Speech
Enhancement
Stochastic Model
- Noise process: broadband (white),
stationary (or short-time stationary),
uncorrelated to speech, additive.
- Speech process: short-time stationary.
- Need short-time processing
y (n; m) = s (n; m) + d (n; m)

Single Channel Speech
Enhancement
Important relation in the Power Spectrum

domain:
Γy (ω ; m) = Γs (ω ; m) + Γd (ω ; m)
This is true only when the noise is

uncorrelated with the speech signal.
* To be concise, the index ”m” is droped in the following

discussion
Power Spectral Subtraction
Γˆ s (ω ) = Γy (ω ) − Γˆ d (ω )
| Sˆs (ω ) |= N Γˆ s (ω )
jϕ y ( ω )
S s (ω ) =| S s (ω ) | e
ˆ ˆ (1)
* Power Spectral Subtraction method use the noisy phase spectrum to

synthesis the enhanced signal
Generalized Spectral
Subtraction and its variants
Generalization
Eq(1) can be written as:
[ ]e
1 (2)
jϕ y ( ω )
Sˆs (ω ) = | S y (ω ) |α − | Sˆd (ω ) |α α
When ¢=1 , eq(2) is called Amplitude Spectral Subtraction

(Boll,1979).
Variant – Correlation subtraction

rˆs (η ; m) = ry (η ; m) − rˆd (η ; m)
Comments on Spectral
Subtraction methods
Low complexity
Severe musical noise
Usually need further enhancement
- Smoothing in time and frequency; Rectification;
Amplitude Spectral Subtraction:
Power Spectral Subtraction:
Noisy speech sample:

Comments on Spectral
Subtraction methods
Oversuppressing and smoothing can

reduce residual noise but result in
distortion to the speech spectrum.
Oversuppressing ASS:
Oversuppressing PSS:
Smoothing in time:
Wiener Filtering
Non-causal infinite Wiener filter using

spectral-subtraction-prior (hereafter referred
to as Non-causal Wiener Filter) can be
recognized as a Spectral Subtraction
method.
Non-causal infinite Wiener filter using LPC
model as prior can be employed in
iteratative manner, which can be recognized
as a parametric method.
Noncausal Wiener Filter
A linear Minimum Mean Squared Error Filter:

∞
sˆ(n) = ∑ h( q ) y ( n − q ) = h( n) * y ( n)
m = −∞
Orthogonality principle:
 ∞  
E  s (n) − ∑ h(q ) y (n − q )  y (n − k )  = 0
 q = −∞  
∞
⇒ R ys (k ) = ∑ h( q ) R
q = −∞
yy (k − q) Wiener-Hopf equation
Noncausal Wiener Filter
Orthogonality principle (frequency domain):

S ys (ω ) = H (ω ) S yy (ω )
Transfer function:
S ys (ω )
H (ω ) =
S yy (ω )
MSE of the Wiener filter:
∞
E[( s (n) − sˆ(n)) 2 ] = σ s − ∑ h( q ) R
2
ys (q )
q = −∞
Comments on Noncausal WF
Requires estimate of the power

spectrum of speech and noise.
Performance depends very much on
the estimate of the speech and noise
spectrum.
WF oversuppress the speech
spectrum, results in muffling effect.
WF does not process phase spectrum.
Comments on Noncausal WF
Roughness caused by phase noise

The phase spectrum is not processed, results in
losing phase coherence in the voiced speech. The
effect is called roughness or reverberance.
Samples of muffling and roughness:
Clean samples:
Muffling:
Roughness:
Muffling & roughness:
Iterative Wiener Filtering
A parametric method using an all-pole

model
A sequential MAP estimator of both
speech waveform and LP coefficients.
[Lim, Oppenheim 1978]
All-pole modeling of speech

- Speech amplitude spectrum can be well modeled
by an all-pole transfer function (the vocal tract)
excited by white sequence or pulse train (the glottal
pulses). The coefficients of the all-pole model is
found by Linear Prediction method, thus is called
LP coef., and the excitation is called the residue.
- The LP model is of minumum phase, which is
generally not the true phase of the vocal tract.
The algorithm:
1. Estimate the LP coef. From the noisy
oberservation samples. Estimate the noise
spectrum during nonspeech activity.
2. Estimate the waveform using noncasual WF given
the current estimate of LP coef. and current
estimate of the noise spectrum.
3. Estimate the LP coef. again given the current
estimate of the waveform.
4. Keep doing the iteration until some criterion is
satisfied.
Comments:
- Convergence is not garanteed, a heuristic stop
criterion is needed
- Result in unrealisticly sharp formants and pole
jittering
- Suffer from musical noise
- Need some kind of smoothing
10 dB noisy sample:
Iterative WF:
Iterative WF with smoothing:
Further enhancement to IWF
Constrained IWF [Hansen,Clements 1987]

Apply spectral constraint inter-frame and intra-
frame using LSP transformation.
Pole-zero modeling [Flanagan 1972]
Replace WF with Kalman filtering [Gibson
1991]
Vector quantization method [Gibson 1988]
Use HMM [Ephraim 1988]
Phase issues
The majority of the noise reduction mthods
only process amplitude spectrum, while the
noisy phase spectrum is left unprocessed.
The reasons are:
- Human ears are less sensitive to phase
than to the amplitude spectrum.
- Masking of amplitude to phase (6dB/0.6rad
threshold).
For low SNR (<6dB) source, the noisy
phase causes roughness/reverberance.
MMSE approaches to speech
enhancement
Wiener filtering; MMSE amplitude spectrum

estimator; MMSE log-amplitude spectrum
estimator; Non-Gaussian prior MMSE
approaches.
Being the dominant technique because of
better performance than the Spectral
Spectrum Subtraction methods.
Need a priori info. of the speech and noise
spectrum.
MMSE amplitude spectrum
estimator (Ephraim-Malah filter)
Ephraim-Malah, 1984
The basis of the noise reduction
function of MELPe coding standard
Consists of two parts: Decision-
Directed method estimating the a priori
speech spectrum, and the MMSE
Short-Time Spectral Amplitude (STSA)
estimator
MMSE STSA estimator
Assumptions:
- Stationary additive Gaussian noise with known spectrum.
- An estimate of the speech spectrum is available.
- Spectral components (DFT coefficients) are statistically independent
and each follows Gaussian distribution (the DFT amplitude follows
Rayleigh distribution).
- The DFT phase follows uniform distribution and is independent of the
amplitude.
The signal model: y (t ) = x(t ) + d (t )
Let Yk ≡ Rk exp( jθ k ) , X k ≡ Ak exp( jα k ) , Dk denote the kth spectral

component of the noisy observation y(t), the signal x(t), and the
noise d(t).
MMSE STSA estimator
With the following PDF’s:
1  1  Ak  Ak 2 
p(Yk | Ak ,αk ) = exp− | Yk − Ake jαk |2  , p( Ak , α k ) = exp− 
πλd (k)  λd (k)  πλx (k )  λx (k ) 
and Baye’s rule, the estimator Âk can be shown to be:

Aˆ k = E[ Ak | Yk ]
π vk vk v v
= exp(− )[(1 + vk ) I 0 ( k ) + vk I1 ( k )]Rk
2 γk 2 2 2
Where I 0 (⋅) and I1 (⋅) denote the modified Bessel functions of zero
and first order, and vk is defined by:
ξk
vk = γk
1 + ξk
MMSE STSA estimator
Where ξ k and γk are defined by:
λ (k ) Rk2
ξk = x γk =
λd (k ) λd ( k )
Where λ x (k ) = E[| X k | ] and λd (k ) = E[| Dk | ]

2 2
ξ k and γ k are interpreted as the a priori and a posteriori signal-to-

noise ratio respectively.
ξ k is estimated by the Decision-Directed method.

Decision-Directed method
An estimate of the a priori SNR.
A combination of Power Spectrum
Subtraction, halfwave rectification and
inter-frame smoothing.
ˆ 2 (n − 1)
A
ξˆk (n) = α k + (1 − α ) max[γ k (n) − 1,0], 0 ≤ α < 1
λd (k , n − 1)
α is usually chosen to be 0.98 in order to
get the best smoothing performance. The
higher theα is, the less musical noise, but
more distortion to the speech.
Comments on the MMSE
STSA estimator
Comparison of the suppression gains of Wiener filter and MMSE STSA
-The instantaneous SNR can be

interpreted as the a priori SNR
estimated without smoothing.
-WF gains do not vary with the
instantaneous SNR, only vary with
the a priori SNR. Whereas the
MMSE STSA gains vary with both
instataneous SNR and a priori SNR.
-When the a priori SNR is high, the
MMSE STSA estimator has gain
curves very close to the WF. When
the a priori SNR is low, the MMSE
STSA shows higher gain which is
very much affected by the
instataneous SNR.
STSA estimator
A comparison of the suppression gains of PSS, WF and MMSE STSA estimator
Estimated A priori SNR Estimated A priori SNR
Solid line: power subtraction; dashed line: The MMSE STSA. Rpost denotes the A priori
Wiener filter. SNR estimated without smoothing (the
instantaneous SNR).
STSA estimator
The gain curve transit smoothly between the power
subtraction curve and the Wiener curve. This transit is
controled by the un-smoothed estimate of a priori SNR (Rprio).
The larger Rprio, the stronger the anttenuation.
This counter-intuitive behavior manages to flatten the spurious
spectral peaks caused by the noise at the low SNR part of the
spectrum. While WF tends to sharpen the spurious peaks at
the low SNR part of the specatrum.
The phase of the noisy speech is used as the phase of the
enhanced speech, because of the assumption of uniform
distributed phase. An independent MMSE estimate of the
phasor has nonunity modulus, thus can not be combined with
the MMSE STSA.
Suffer less musical noise than the WF.
MMSE Log-Spectral Amplitude
Estimator
A modification to the MMSE STSA based on the fact that a distortion
measure based on the mean-square error of the log-spectra is more
suitable for speech processing.
Minimize the distortion measure E[(log Ak − log Aˆ k ) 2 ]
The MMSE LSA estimator can be shown to be:
Aˆ k = exp( E[ln Ak | Yk ])
ξk 1 ∞ e −t
= exp( ∫ dt ) Rk
1 + ξk 2 v tk
ξk
where vk = γ k , ξ k and γ k are a priori SNR and a
1 + ξk
posteriori SNR as defined before.
MMSE Log-Spectral Amplitude
Estimator
Comparison of the suppression gains of MMSE STSA and MMSE LSA
- The gain curves of MMSE LSA are

always lower than that of MMSE
STSA, resulting in lower residual
noise.
- When the a priori SNR is high, the
gain curve of MMSE LSA is very flat
which is similar to Wiener filter.
When the a priori SNR is low, the
gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the
MMSE STSA does.
Decision-Directed
Wiener Filter: MMSE LSA:
Noisy sample
(0 dB):
MMSE estimator with non-
Gaussian prior
How well does Gaussian model fit the real probability distribution of DFT
coefficients?
Histogram of speech DFT amplitude. Histogram of noise (recorded from

market place) DFT amplitude.
*The histograms are taken from one hour of speech

Gaussian prior
The probability density function of the DFT
coefficients of speech can be better modeled by
Supper-Gaussian functions (e.g. Gamma or
Laplace) than the Guanssian function [Rainer
Martin 2002, 2003].
An even more exact probability density function is
the one talored to fit the shape of the histogram of
the DFT coefficients [Lotter, Vary 2003].
Using these density function in place of the
Gaussian density function (for speech or noise
processes) in the MMSE estimator can result in
better noise reduction.
Non-Gaussian prior MMSE estimator is nonlinear,
non-zero-phased.
Gaussian prior
Comparing with WF:

- Better output SNR (Gaussian/Gamma)
- Less musical noise (Laplace/Gamma)
- Less distortion to the speech
Exercises
1. The noncausal Wiener filter and the Ephraim-Malah filter are both MMSE
estimators. They have a lot in common. Please list at least 4 common points
of these two estimators.
2. So, what makes the two estimators different?
3. The residual noise is often catagorized as white noise and musical noise.
Different filter produce different residual noise. How do you prefer the two
types of residual noise? Disscuss how you make the choice in different
communication scenarios.
4. How do you think about the experiment data (the histograms) for finding the
PDF of DFT amplitude? Can you suggest any improvement to it?

Mmse Speech Enhancement

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mmse Speech Enhancement

Uploaded by

Copyright:

Available Formats

SPEECH ENHANCEMENT

y (n; m) = s (n; m) + d (n; m)

 Important relation in the Power Spectrum

This is true only when the noise is

* To be concise, the index ”m” is droped in the following

* Power Spectral Subtraction method use the noisy phase spectrum to

When ¢=1 , eq(2) is called Amplitude Spectral Subtraction

 Variant – Correlation subtraction

Amplitude Spectral Subtraction:

Power Spectral Subtraction:

Noisy speech sample:

 Oversuppressing and smoothing can

 Non-causal infinite Wiener filter using

 A linear Minimum Mean Squared Error Filter:

 Orthogonality principle (frequency domain):

 Requires estimate of the power

 Roughness caused by phase noise

 A parametric method using an all-pole

 All-pole modeling of speech

 Constrained IWF [Hansen,Clements 1987]

 Wiener filtering; MMSE amplitude spectrum

 The signal model: y (t ) = x(t ) + d (t )

Let Yk ≡ Rk exp( jθ k ) , X k ≡ Ak exp( jα k ) , Dk denote the kth spectral

and Baye’s rule, the estimator Âk can be shown to be:

Where λ x (k ) = E[| X k | ] and λd (k ) = E[| Dk | ]

ξ k and γ k are interpreted as the a priori and a posteriori signal-to-

ξ k is estimated by the Decision-Directed method.

-The instantaneous SNR can be

Estimated A priori SNR Estimated A priori SNR

- The gain curves of MMSE LSA are

Histogram of speech DFT amplitude. Histogram of noise (recorded from

*The histograms are taken from one hour of speech

 Comparing with WF:

You might also like

Important relation in the Power Spectrum

Variant – Correlation subtraction

Oversuppressing and smoothing can

Non-causal infinite Wiener filter using

A linear Minimum Mean Squared Error Filter:

Orthogonality principle (frequency domain):

Requires estimate of the power

Roughness caused by phase noise

A parametric method using an all-pole

All-pole modeling of speech

Constrained IWF [Hansen,Clements 1987]

Wiener filtering; MMSE amplitude spectrum

The signal model: y (t ) = x(t ) + d (t )

Comparing with WF: