You are on page 1of 35

SPEECH ENHANCEMENT

Chunjian Li
Aalborg University, Denmark
Introduction

„ Applications:
- Improving quality and inteligibility (hearing
aid, cockpit comm., video conferencing ...)
- Source coding (mobile phone, video
conferencing, IP phone ...)
- Pre-processor for other speech processing
applications (speech recognition, speaker
varification ...)
Introduction
„ Classification 1
- Single channel
- Multi-channel
* with accoustic barrier (Adaptive Noise Cancelling)
* without accoustic barrier (Array Processing)
„ Classification 2
- Spectrum subtraction (Power Spectral Subtraction, Amplitude
Spectral Subtraction, Autocorrelation Subtraction, Non-causal
Infinite Wiener Filtering)
- Parametric method (Iterative Wiener Filtering)
- Adaptive noise cancelling
- Adaptive comb filtering
Single Channel Speech
Enhancement

„ Stochastic Model
- Noise process: broadband (white),
stationary (or short-time stationary),
uncorrelated to speech, additive.
- Speech process: short-time stationary.
- Need short-time processing

y (n; m) = s (n; m) + d (n; m)


Single Channel Speech
Enhancement

„ Important relation in the Power Spectrum


domain:

Γy (ω ; m) = Γs (ω ; m) + Γd (ω ; m)

This is true only when the noise is


uncorrelated with the speech signal.

* To be concise, the index ”m” is droped in the following


discussion
Power Spectral Subtraction

Γˆ s (ω ) = Γy (ω ) − Γˆ d (ω )

| Sˆs (ω ) |= N Γˆ s (ω )

jϕ y ( ω )
S s (ω ) =| S s (ω ) | e
ˆ ˆ (1)

* Power Spectral Subtraction method use the noisy phase spectrum to


synthesis the enhanced signal
Generalized Spectral
Subtraction and its variants

„ Generalization
Eq(1) can be written as:

[ ]e
1 (2)
jϕ y ( ω )
Sˆs (ω ) = | S y (ω ) |α − | Sˆd (ω ) |α α

When ¢=1 , eq(2) is called Amplitude Spectral Subtraction


(Boll,1979).

„ Variant – Correlation subtraction


rˆs (η ; m) = ry (η ; m) − rˆd (η ; m)
Comments on Spectral
Subtraction methods

„ Low complexity
„ Severe musical noise
„ Usually need further enhancement
- Smoothing in time and frequency; Rectification;

Amplitude Spectral Subtraction:

Power Spectral Subtraction:

Noisy speech sample:


Comments on Spectral
Subtraction methods

„ Oversuppressing and smoothing can


reduce residual noise but result in
distortion to the speech spectrum.

Oversuppressing ASS:

Oversuppressing PSS:

Smoothing in time:
Wiener Filtering

„ Non-causal infinite Wiener filter using


spectral-subtraction-prior (hereafter referred
to as Non-causal Wiener Filter) can be
recognized as a Spectral Subtraction
method.
„ Non-causal infinite Wiener filter using LPC
model as prior can be employed in
iteratative manner, which can be recognized
as a parametric method.
Noncausal Wiener Filter

„ A linear Minimum Mean Squared Error Filter:



sˆ(n) = ∑ h( q ) y ( n − q ) = h( n) * y ( n)
m = −∞

„ Orthogonality principle:
 ∞  
E  s (n) − ∑ h(q ) y (n − q )  y (n − k )  = 0
 q = −∞  

⇒ R ys (k ) = ∑ h( q ) R
q = −∞
yy (k − q) Wiener-Hopf equation
Noncausal Wiener Filter

„ Orthogonality principle (frequency domain):


S ys (ω ) = H (ω ) S yy (ω )
„ Transfer function:
S ys (ω )
H (ω ) =
S yy (ω )
„ MSE of the Wiener filter:

E[( s (n) − sˆ(n)) 2 ] = σ s − ∑ h( q ) R
2
ys (q )
q = −∞
Comments on Noncausal WF

„ Requires estimate of the power


spectrum of speech and noise.
„ Performance depends very much on
the estimate of the speech and noise
spectrum.
„ WF oversuppress the speech
spectrum, results in muffling effect.
„ WF does not process phase spectrum.
Comments on Noncausal WF

„ Roughness caused by phase noise


The phase spectrum is not processed, results in
losing phase coherence in the voiced speech. The
effect is called roughness or reverberance.
„ Samples of muffling and roughness:
Clean samples:
Muffling:
Roughness:
Muffling & roughness:
Iterative Wiener Filtering

„ A parametric method using an all-pole


model
„ A sequential MAP estimator of both
speech waveform and LP coefficients.
„ [Lim, Oppenheim 1978]
Iterative Wiener Filtering

„ All-pole modeling of speech


- Speech amplitude spectrum can be well modeled
by an all-pole transfer function (the vocal tract)
excited by white sequence or pulse train (the glottal
pulses). The coefficients of the all-pole model is
found by Linear Prediction method, thus is called
LP coef., and the excitation is called the residue.
- The LP model is of minumum phase, which is
generally not the true phase of the vocal tract.
Iterative Wiener Filtering

„ The algorithm:
1. Estimate the LP coef. From the noisy
oberservation samples. Estimate the noise
spectrum during nonspeech activity.
2. Estimate the waveform using noncasual WF given
the current estimate of LP coef. and current
estimate of the noise spectrum.
3. Estimate the LP coef. again given the current
estimate of the waveform.
4. Keep doing the iteration until some criterion is
satisfied.
Iterative Wiener Filtering

„ Comments:
- Convergence is not garanteed, a heuristic stop
criterion is needed
- Result in unrealisticly sharp formants and pole
jittering
- Suffer from musical noise
- Need some kind of smoothing

10 dB noisy sample:
Iterative WF:
Iterative WF with smoothing:
Further enhancement to IWF

„ Constrained IWF [Hansen,Clements 1987]


Apply spectral constraint inter-frame and intra-
frame using LSP transformation.
„ Pole-zero modeling [Flanagan 1972]
„ Replace WF with Kalman filtering [Gibson
1991]
„ Vector quantization method [Gibson 1988]
„ Use HMM [Ephraim 1988]
Phase issues
„ The majority of the noise reduction mthods
only process amplitude spectrum, while the
noisy phase spectrum is left unprocessed.
„ The reasons are:
- Human ears are less sensitive to phase
than to the amplitude spectrum.
- Masking of amplitude to phase (6dB/0.6rad
threshold).
„ For low SNR (<6dB) source, the noisy
phase causes roughness/reverberance.
MMSE approaches to speech
enhancement

„ Wiener filtering; MMSE amplitude spectrum


estimator; MMSE log-amplitude spectrum
estimator; Non-Gaussian prior MMSE
approaches.
„ Being the dominant technique because of
better performance than the Spectral
Spectrum Subtraction methods.
„ Need a priori info. of the speech and noise
spectrum.
MMSE amplitude spectrum
estimator (Ephraim-Malah filter)

„ Ephraim-Malah, 1984
„ The basis of the noise reduction
function of MELPe coding standard
„ Consists of two parts: Decision-
Directed method estimating the a priori
speech spectrum, and the MMSE
Short-Time Spectral Amplitude (STSA)
estimator
MMSE STSA estimator

„ Assumptions:
- Stationary additive Gaussian noise with known spectrum.
- An estimate of the speech spectrum is available.
- Spectral components (DFT coefficients) are statistically independent
and each follows Gaussian distribution (the DFT amplitude follows
Rayleigh distribution).
- The DFT phase follows uniform distribution and is independent of the
amplitude.

„ The signal model: y (t ) = x(t ) + d (t )

Let Yk ≡ Rk exp( jθ k ) , X k ≡ Ak exp( jα k ) , Dk denote the kth spectral


component of the noisy observation y(t), the signal x(t), and the
noise d(t).
MMSE STSA estimator
With the following PDF’s:

1  1  Ak  Ak 2 
p(Yk | Ak ,αk ) = exp− | Yk − Ake jαk |2  , p( Ak , α k ) = exp− 
πλd (k)  λd (k)  πλx (k )  λx (k ) 

and Baye’s rule, the estimator Âk can be shown to be:


Aˆ k = E[ Ak | Yk ]
π vk vk v v
= exp(− )[(1 + vk ) I 0 ( k ) + vk I1 ( k )]Rk
2 γk 2 2 2

Where I 0 (⋅) and I1 (⋅) denote the modified Bessel functions of zero
and first order, and vk is defined by:
ξk
vk = γk
1 + ξk
MMSE STSA estimator
Where ξ k and γk are defined by:

λ (k ) Rk2
ξk = x γk =
λd (k ) λd ( k )

Where λ x (k ) = E[| X k | ] and λd (k ) = E[| Dk | ]


2 2

ξ k and γ k are interpreted as the a priori and a posteriori signal-to-


noise ratio respectively.

ξ k is estimated by the Decision-Directed method.


Decision-Directed method
„ An estimate of the a priori SNR.
„ A combination of Power Spectrum
Subtraction, halfwave rectification and
inter-frame smoothing.
ˆ 2 (n − 1)
A
ξˆk (n) = α k + (1 − α ) max[γ k (n) − 1,0], 0 ≤ α < 1
λd (k , n − 1)
„ α is usually chosen to be 0.98 in order to
get the best smoothing performance. The
higher theα is, the less musical noise, but
more distortion to the speech.
Comments on the MMSE
STSA estimator
„ Comparison of the suppression gains of Wiener filter and MMSE STSA

-The instantaneous SNR can be


interpreted as the a priori SNR
estimated without smoothing.
-WF gains do not vary with the
instantaneous SNR, only vary with
the a priori SNR. Whereas the
MMSE STSA gains vary with both
instataneous SNR and a priori SNR.
-When the a priori SNR is high, the
MMSE STSA estimator has gain
curves very close to the WF. When
the a priori SNR is low, the MMSE
STSA shows higher gain which is
very much affected by the
instataneous SNR.
Comments on the MMSE
STSA estimator
„ A comparison of the suppression gains of PSS, WF and MMSE STSA estimator

Estimated A priori SNR Estimated A priori SNR

Solid line: power subtraction; dashed line: The MMSE STSA. Rpost denotes the A priori
Wiener filter. SNR estimated without smoothing (the
instantaneous SNR).
Comments on the MMSE
STSA estimator
„ The gain curve transit smoothly between the power
subtraction curve and the Wiener curve. This transit is
controled by the un-smoothed estimate of a priori SNR (Rprio).
The larger Rprio, the stronger the anttenuation.
„ This counter-intuitive behavior manages to flatten the spurious
spectral peaks caused by the noise at the low SNR part of the
spectrum. While WF tends to sharpen the spurious peaks at
the low SNR part of the specatrum.
„ The phase of the noisy speech is used as the phase of the
enhanced speech, because of the assumption of uniform
distributed phase. An independent MMSE estimate of the
phasor has nonunity modulus, thus can not be combined with
the MMSE STSA.
„ Suffer less musical noise than the WF.
MMSE Log-Spectral Amplitude
Estimator
„ A modification to the MMSE STSA based on the fact that a distortion
measure based on the mean-square error of the log-spectra is more
suitable for speech processing.
„ Minimize the distortion measure E[(log Ak − log Aˆ k ) 2 ]
„ The MMSE LSA estimator can be shown to be:

Aˆ k = exp( E[ln Ak | Yk ])
ξk 1 ∞ e −t
= exp( ∫ dt ) Rk
1 + ξk 2 v tk

ξk
where vk = γ k , ξ k and γ k are a priori SNR and a
1 + ξk
posteriori SNR as defined before.
MMSE Log-Spectral Amplitude
Estimator
„ Comparison of the suppression gains of MMSE STSA and MMSE LSA

- The gain curves of MMSE LSA are


always lower than that of MMSE
STSA, resulting in lower residual
noise.
- When the a priori SNR is high, the
gain curve of MMSE LSA is very flat
which is similar to Wiener filter.
When the a priori SNR is low, the
gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the
MMSE STSA does.

Decision-Directed
Wiener Filter: MMSE LSA:

Noisy sample
(0 dB):
MMSE estimator with non-
Gaussian prior
How well does Gaussian model fit the real probability distribution of DFT
coefficients?

Histogram of speech DFT amplitude. Histogram of noise (recorded from


market place) DFT amplitude.

*The histograms are taken from one hour of speech


MMSE estimator with non-
Gaussian prior
„ The probability density function of the DFT
coefficients of speech can be better modeled by
Supper-Gaussian functions (e.g. Gamma or
Laplace) than the Guanssian function [Rainer
Martin 2002, 2003].
„ An even more exact probability density function is
the one talored to fit the shape of the histogram of
the DFT coefficients [Lotter, Vary 2003].
„ Using these density function in place of the
Gaussian density function (for speech or noise
processes) in the MMSE estimator can result in
better noise reduction.
„ Non-Gaussian prior MMSE estimator is nonlinear,
non-zero-phased.
MMSE estimator with non-
Gaussian prior

„ Comparing with WF:


- Better output SNR (Gaussian/Gamma)
- Less musical noise (Laplace/Gamma)
- Less distortion to the speech
Exercises
1. The noncausal Wiener filter and the Ephraim-Malah filter are both MMSE
estimators. They have a lot in common. Please list at least 4 common points
of these two estimators.
2. So, what makes the two estimators different?
3. The residual noise is often catagorized as white noise and musical noise.
Different filter produce different residual noise. How do you prefer the two
types of residual noise? Disscuss how you make the choice in different
communication scenarios.
4. How do you think about the experiment data (the histograms) for finding the
PDF of DFT amplitude? Can you suggest any improvement to it?

You might also like