You are on page 1of 7

Wavelet based robust sub-band features for

phoneme recognition

O. Farooq and S. Datta

Abstract: Wavelet transform has been found to be an effective tool for the time –frequency analysis
of non-stationary and quasi-stationary signals. Recent years have seen wavelet transform being
used for feature extraction in speech recognition applications. In the paper a sub-band feature
extraction technique based on an admissible wavelet transform is proposed and the features are
modified to make them robust to additive white Gaussian noise. The performance of this system is
compared with the conventional mel frequency cepstral coefficients (MFCC) under various signal
to noise ratios. The recognition performance based on the eight sub-band features is found to be
superior under the noisy conditions compared with MFCC features.

1 Introduction Thus, STFT is not the ideal solution for this type of
situation. Wavelet transform is a multi-resolution transform
Owing to recent progress in automatic speech recognition that has the capability to process the non-stationary signal as
(ASR) systems, ASR is finding new applications in the area well. Recently this has been used to extract features for the
such as recognition over the telephone network, recognition purpose of speech recognition [9 – 17]. Another advantage
over the mobile network and the Internet, etc. Thus an ASR of using the wavelet transform is its compact support, which
system has to perform recognition under all these unknown allows no spilling of the energy to the side-lobes. This helps
conditions. It has been established that the performance of in selecting a non-overlapping window.
the ASR degrades substantially in the presence of a However, there has been no study on the recognition
mismatch between the training and test environment. This performance of these features under noisy conditions.
mismatch could be caused by different noise, microphone, Acoustic phonetic features have also been studies for the
channel characteristics, etc. recognition of the stop consonants [18, 19]. In this paper we
The basic strategies currently used to have robust ASR present the results of phoneme recognition by using wavelet
systems are based on the following two approaches: based features under different levels of white Gaussian
noise. Sub-band features based on wavelet transform are
(i) robust feature extraction
calculated and the feature extraction process is modified to
(ii) compensation.
mine features that are robust to white noise.
The first approach is based on the extraction of the features
that are inherently resistant to noise. The techniques used 2 Wavelet transform
under this category are RASTA (relative spectra) processing
[1], one-sided autocorrelation LPC (linear predictive Owing to the fixed time –frequency resolution by STFT
coefficients) [2] and power difference [3] and cepstral there have been several attempts [9 –17] to use wavelet
subtraction method [4]. The second approach is based on the transform for the purpose of feature extraction. Wavelet
compensation model, which tries to recover speech from transform is a time – frequency analysis technique that
the corrupted speech in the feature parameter domain or decomposes signal over dilated and translated wavelets.
at the pattern matching stage. Methods using the second Wavelet is a function c 2 L2 ðRÞ (i.e. a finite energy
approach are cepstral normalisation [5], probabilistic function) with zero mean and is normalised ðkck ¼ 1Þ [20].
optimum filtering [6, 7] and parallel model combination [8]. A family of wavelets can be obtained by scaling c by s and
All the above studies have been carried out using the translating it by u:
STFT based features. The recognition performance of the  t  u
plosives is found to be specifically poor with STFT based cu;s ðtÞ ¼ s1=2 c ð1Þ
features. This is due to the fact that although we assume that s
the signal is stationary during the window duration, The continuous wavelet transform (CWT) of a finite energy
however, it is not perfectly true for the case of plosives. signal f(t) is given by:
Z þ1 t  u
CWTfðu; sÞ ¼ fðtÞs1=2 c dt ð2Þ
q IEE, 2004 1 s
IEE Proceedings online no. 20040324 where c ð·Þ is the complex conjugate of c ð·Þ: The above
doi: 10.1049/ip-vis:20040324 equation can be viewed as convolution of the signal with
Paper first received 28th January 2002 and in final revised form 4th dilated band-pass filters. The DWT of a signal f [n ] with
September 2003. Originally published online 26th March 2004 period N is computed as:
O. Farooq is with the Department of Electronics Engineering, AMU
Aligarh, Aligarh, 202 002, India X
N1 m  n
S. Datta is with the Department of Electronic and Electrical Engineering, DWTf ½n; a j  ¼ f ½maj=2 c ð3Þ
Loughborough University, Loughborough LE11 3TU, UK m¼0
aj

IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004 187
where m and n are integers. The value of a is equal to 2 for giving a total of eight sub-bands. Equivalent DWT
a dyadic transform. decomposition now gives the following frequency sub-
The signal representation is not complete if the wavelet bands: 0 – 62.5 Hz, 62.5 – 125 Hz, 125 – 250 Hz, 250 –
decomposition is computed up to a scale a j : The information 500 Hz, 500 –1000 Hz, 1– 2 kHz, 2 –4 kHz and 4 –8 kHz.
corresponding to the scales larger a a j is also required, The first two sub-bands carry very little speech discrimina-
which is computed by a scaling filter and is given by tory information and hence are not very useful for speech
recognition systems. Thus the features derived from these
X
N 1  m  n
frequency sub-bands obtained by the AWP will have better
SFf½n; a j  ¼ f ½maj=2 f ð4Þ
aj classification ability. Further, since the wavelet transform
m¼0
can also process non-stationary signals, it will be more
where fðnÞ is the discrete scaling filter. suitable for feature extraction as compared to Fourier
By using the DWT the problem in recognition of stop transform, and therefore a wavelet transform based feature
phonemes by STFT is overcome [11] as higher frequency extraction has been proposed.
bursts can be easily detected by going up high in frequency
and reducing the time window. The features in [11] were 3 Wavelet based feature extraction for noisy
based on the energy in the frequency sub-bands obtained speech
by the discrete wavelet decomposition. Since the DWT
performs recursive decomposition of the lower frequency A frame of 32 ms (512 samples) is formed; this is sub-
sub-band obtained by the previous decomposition, hence for divided into sub-frames of 8 ms. The AWP decomposition is
higher levels of decomposition more features are derived applied by using different ‘Daubechies’ wavelets on this
from the lower frequency sub-bands that are not very useful sub-frame duration. First of all, total energy of the wavelet
for discrimination tasks. To overcome this problem wavelet coefficients in each frequency sub-band is calculated. This is
packet decomposition based features [12 – 17], which give normalised by dividing total energy by the number of
the liberty to partition the lower frequency sub-band, or the wavelet coefficients in the corresponding sub-band.
higher frequency sub-band, were proposed. In [13, 14], Since the number of coefficients in a sub-band depends on
a best basis selection criterion is applied for the selection the bandwidth obtained after decomposition, the above
of features; however, to make the features shift invariant normalisation procedure results into a non-uniform
excessive computation is required. In [16] the use of an weighting of the energies. The decomposition by the
admissible wavelet packet (AWP), which gives an admis- AWP is performed in a manner such that higher bandwidth
sible binary tree structure, was proposed for feature is obtained at the higher frequency end while lower
extraction. The two wavelet packet orthogonal bases bandwidth at the lower frequencies, as shown in Fig. 1.
generated from a parent node are defined as: Thus the normalisation procedure results into energy
features having more emphasis on the lower frequency
X
1
and less on the higher frequency.
c2p
jþ1 ðkÞ ¼ h½ncjp ðk  2 j nÞ ð5Þ
n¼1
Figure 2 shows a clean and a noisy phoneme ð=iy=Þ with
its respective power spectral density (PSD). Since most of
the energy of the vowel =iy= is concentrated at lower
X
1
c2pþ1 g½ncjp ðk  2 j nÞ frequencies, the addition of white Gaussian noise does not
jþ1 ðkÞ ¼ ð6Þ
n¼1
change the shape of the PSD at these frequencies. However,
some differences can be noticed at higher frequencies where
where h[n ] is the low-pass (scaling) filter and g[n ] is the the PSD of the clean signal is itself low. This will cause
highpass (wavelet) filter. problems in recognition if this effect of noise is not properly
By using AWP, more sub-bands in the frequency region compensated. To offset for the effect of noise, a simple
carrying more discriminatory information (the sub-bands technique similar to the cepstral compensation is used that
that are perceptually more important) can be obtained. avoids most of the complexities. If the noise added is
This is shown in Fig. 1, illustrating an AWP tree and the perfectly white and Gaussian, it will cause the energy in
corresponding partitioning of the time – frequency plane. each sub-band to increase by a constant factor. For this
A uniform bandwidth of 500 Hz is obtained from 0 to 3 kHz, reason, if the mean is subtracted from the energy of all the
giving six sub-bands. Two additional sub-bands are obtained sub-bands, the features will be as good as those of clean
in the frequency ranges of 3– 4 kHz and 4 –8 kHz, thereby speech. However, this may result into negative energy
values if the noise is not perfectly white, and hence the
logarithmic compression cannot then be applied. To over-
come the above problem, minimum sub-band energy in each
sub-frame is calculated and 50% of this value is subtracted
from all the sub-band energy features. Another additional
feature is also calculated based on the variance of the energy
features. This helps in the recognition of phonemes, as the
variance is not altered by a constant addition, which may
occur due to the white noise. Finally, logarithmic com-
pression is applied to all the extracted features. In the feature
extraction process more features are extracted from the
lower frequency end as shown in Table 1, since the signal-
to-noise ratio (SNR) in this region is higher as compared to
the SNR at higher frequency (Fig. 2). Further details of the
steps involved for feature extraction from the noisy
phoneme are given in the Appendix.
Fig. 1 Time – frequency tiling of time – frequency plane by Two admissible tree structures were chosen, giving 6=8
admissible wavelet packet and corresponding tree structure sub-bands. The frequency bands obtained are shown in
188 IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004
Fig. 2 Clean and noisy phoneme =iy=
a Original phoneme =iy= without noise
b Power spectral density of original phoneme
c Phoneme =iy= in presence of noise ðSNR ¼ 10 dBÞ
d Power spectral density of noisy phoneme

Table 1: Frequency band distribution of wavelet and mel Table 1. Features were extracted as explained above for
filter used for feature extraction each sub-band, resulting in 7=9 features in each sub-frame,
giving a total of 28=36 features for a frame of 32 ms
Sl. no. Wavelet based filter Mel filter
duration. Figure 3 shows a block diagram of the procedure
6 band 8 band Central Bandwidth, of extracting 6 sub-band features for a sub-frame duration
filter, Hz filter, Hz frequency, Hz Hz and subsequent classification.
1 0–500 0–500 100 100 To see the effect of mother wavelet on feature extraction,
three different orders of Daubechies wavelets DB2, DB6
2 500–1000 500–1000 200 100
and DB20 (Fig. 4) can be used. By using higher order
3 1000–2000 1000–1500 300 100
wavelets, the support size increases and the function
4 2000–3000 1500–2000 400 100 becomes smoother. Another important consequence of
5 3000–4000 2000–2500 500 100 changing the order of wavelet is the change in the
6 4000–8000 2500–3000 600 100 normalisation factor. This is due to the change in the length
7 3000–4000 700 100 of filter structure, which results in different numbers of
8 4000–8000 800 100 coefficients in each sub-band. The number of coefficients
9 900 100 after a single decomposition by a mother wavelet ‘DBN’ of
10 1000 124
a signal x[n ] is given by bðn  1Þ=2c þ N; where b pc is the
largest integer x  p; and n is the length of the decomposed
11 1149 160
signal. From the above equation it is clear that the number of
12 1320 184
coefficients obtained will depend on the mother wavelet as
13 1516 211 well as the length of the input signal (or in other words the
14 1741 242 sub-frame duration). Hence, it becomes clear that using a
15 2000 278 lower order mother wavelet puts less emphasis on higher
16 2297 320 bandwidth features. Since the higher bandwidth occurs at
17 2639 367 the higher frequencies (as seen in Fig. 1), hence the features
18 3031 422 of these sub-bands are de-emphasised. If the weighting
19 3482 484
factor for the sub-bands with minimum bandwidths is scaled
to unity then the weighting factors for different mother
20 4000 556
wavelets corresponding to different bandwidths are shown
21 4595 639 in Table 2. Thus, by choosing N higher, the weighting factor
22 5278 734 of the sub-band of 4 –8 kHz will be larger, thereby giving
23 6063 843 more emphasis to this sub-band. This may result in the
24 6954 969 improved recognition of phonemes with higher frequencies
such as fricative, but may also reduce the recognition of
IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004 189
Fig. 3 Extraction of sub-band features
a Block diagram of feature extraction procedure by wavelet technique
b Processing for robust feature extraction for noisy phoneme classification

other phonemes because the higher frequencies carry are known as the MFCC and can be viewed as the short-term
speaker dependent information as well. spectral envelope of a speech signal after filtering. The
Mel filters having triangular profiles with overlapping lower order terms of the cepstral coefficients give an idea of
bands are used to extract mel-frequency cepstral coefficients smoothness of the spectrum and correspond mainly to the
(MFCC). For a frequency range of 0 –8 kHz there are 24 mel vocal tract response rather than to the fine spectral
filters with the specifications shown in Table 1. These filters structures. These fine structures produce the artefacts that
must be normalised so that they do not increase energy in reduce spectral matching. Usually the first 13 DCT
the higher frequency bands. The log of energy at the output coefficients are used for speech recognition application.
of each filter is calculated and a discrete cosine transform After the feature extraction phase, these features are
(DCT) is applied to give 24 coefficients. These coefficients given to a classifier for the task of recognition. In this
190 IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004
Fig. 4 Daubechies wavelets
a Order 2
b Order 6
c Order 20

Table 2: Number of samples and weighting factor for


different mother wavelets for a frame of 8 ms

0.5 kHz 1 kHz 4 kHz


bandwidth bandwidth bandwidth

DB2 No. of coefficients 10 18 65


Weighting factor 1 0.5556 0.1538
DB6 No. of coefficients 18 25 69
Weighting factor 1 0.7200 0.2609
DB20 No. of coefficients 44 50 83
Weighting factor 1 0.8800 0.5301

work a simple classifier based on linear discriminant


analysis (LDA) [21] using the Mahalanobis distance
measure has been implemented and is used for phoneme
classification.

4 Experimentation on noisy speech recognition

Vowels ð=aa=; =ax= and =iy=Þ; unvoiced fricatives ð=f=;


=sh= and =s=Þ and unvoiced stops ð=p=; =t= and =k=Þ from
the dialect region DR1 (New England region) and
DR2 (Northern part of USA) of the TIMIT (Texas
Instruments=Massachusetts Institute of Technology) data-
base were extracted for training and testing the classifier A Fig. 5 Recognition performance for fricatives, vowels and stops
total of 151 speakers’ data was used, out of which 114 by using 52 MFCC features and 28 AWP (DB6) based log-energy
speakers’ data were used for training and the rest for testing features
the classifier. There were 49 female speakers in all, out of a Unvoiced fricative recognition
which 37 speakers’ data were used for training the LDA b Vowel recognition
based classifier. c Unvoiced stop recognition
To have noisy speech, white Gaussian noise of different
power was generated and injected into the speech signal to
obtain different levels of signal to noise ratios. In the first The recognition performance for unvoiced fricatives,
experiment AWP was used to decompose the signal into six vowels and unvoiced stops for DB2, DB6, DB20 and MFCC
sub-bands using Daubechies 6th (DB6) order wavelets. An is shown in Fig. 6. The DB20 based features show
admissible tree structure was chosen such that the frequency considerable improvement in the recognition performance
sub-band of 0– 8 kHz was divided into six sub-bands. for the case of unvoiced fricative and stops over the lower
Features were extracted as explained earlier for each sub- order wavelet based features. In most of these cases the
band, resulting into 7 features in each sub-frame, giving a performance is found to be superior compared to MFCC
total of 28 features for 32 ms duration. Thirteen MFCC (Fig. 5); however, it is found that the same improvement is
based features were also extracted for every sub-frame not reflected for the case of vowel recognition. This is due to
duration. The comparative results obtained are shown in the fact that vowels have more energy concentrated at the
Fig. 5. It can be seen from the results that the MFCC based lower frequencies (as seen in Fig. 2). Also, better
features specially perform well for vowel and voiced recognition is achieved for all the cases by DB20 at lower
fricative recognition, while comparable performance is SNR, while it is only poor for the stop identification at
achieved for unvoiced fricative and voiced=unvoiced stop higher SNR.
recognition despite the fact that there is a reduction of In the next set of experiments the number of sub-bands
 46% in the feature dimension. was increased from 6 to 8 and DB20 was chosen for
IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004 191
Fig. 6 Recognition performance by using different order of
Daubechies wavelet Fig. 7 Comparative recognition performance of 52 MFCC and
a Unvoiced fricative recognition 32 and 36 AWP based features on DB20
b Vowel recognition
a Unvoiced fricative recognition
c Unvoiced stop recognition
b Vowel recognition
c Unvoiced stop recognition

decomposition. Two sets of feature vectors were extracted;


the first one was similar to the above one, having eight obtained, the wavelet-based features show better recog-
normalised log-energy features and one variance feature, nition performance compared with the MFCCs for the
while in the second set of feature vectors the variance was phonemes under test. The additional advantage of using the
omitted. The recognition performance achieved is shown in wavelet transform is the elimination of using the overlap-
Fig. 7. It can be seen that there is no improvement in the ping windows, which reduces the number of computations
recognition performance of the phonemes when the performed during the feature extraction phase. The
variance feature is added. The variance feature is dependent proposed features are also robust to additive white Gaussian
on the distribution of the sub-band energies: if the sub-bands noise and show significant improvement over MFCC at 5 dB
are less in number (i.e. the bandwidth is larger), then the and 0 dB SNR.
effect of speaker variation will not cause much variation in
the sub-band energy for a given phoneme. This will give the 6 References
same variance feature for a phoneme for different speakers.
1 Hermansky, H., and Morgan, N.: ‘RASTA processing of speech’, IEEE
However, if there are more sub-bands (i.e. the bandwidth is Trans. Speech Audio Process., 1994, 2, (4), pp. 578–589
small), the energies in the sub-band may be different for a 2 You, K.H., and Wang, H.C.: ‘Robust features for noisy
given phoneme for different speakers due to differences in speech recognition based on temporal trajectory filtering of short-
time autocorrelation sequences’, Speech Commun., 1999, 28, (1),
their formant frequencies. Also, the noise energy in each pp. 13 –24
sub-band will be more uniform if the sub-bands are less in 3 Xu, J., and Wei, G.: ‘Noise-robust speech recognition based on
difference of power spectrum’, Electron. Lett., 2000, 36, (14),
number (i.e. higher bandwidth). pp. 1247–1248
4 Rahim, M.G., Juang, B.H., Chou, W., and Buhrke, E.: ‘Signal
conditioning techniques for robust speech recognition’, IEEE Signal
5 Discussion Process. Lett., 1996, 3, (4), pp. 107–109
5 Acero, A., and Stern, R.M.: ‘Environmental robustness in
automatic speech recognition’. Proc. Int. Conf. on Acoustics, Speech
The performance of the wavelet based sub-band features is and Signal Processing, ICASSP’90, Albuquerque, USA, April 1990,
studied for the task of phoneme recognition. For the results pp. 849 –852

192 IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004
6 Neumeyer, L., and Weintraub, M.: ‘Probabilistic optimum filtering for 7 Appendix: Feature extraction steps from noisy
robust speech recognition’. Proc. Int. Conf. on Acoustics, Speech and
Signal Processing, ICASSP’94, Adelaide, South Australia, April 1994, phonemes
pp. I-417, I –420
7 Kim, D.Y., and Un, C.K.: ‘Probabilistic vector mapping with trajectory † Perform wavelet decomposition of the phoneme using the
information for noise-robust speech recognition’, Electron. Lett., 1996, AWP tree structure.
32, (17), pp. 1550–1551
8 Gales, M.J.F., and Young, S.J.: ‘Robust continuous speech recognition . Calculate the energy of the wavelet coefficients in each
using parallel model combination’, IEEE Trans. Speech Audio Process., sub-band.
1996, 4, (5), pp. 352–359 . If cj;k is the jth wavelet coefficient in the kth sub-band then
9 Long, C.J., and Datta, S.: ‘Wavelet based feature extraction
for phoneme recognition’. Proc. 4th Int. Conf. on Speech and the total energy ðEp Þ in the sub-band p is given by:
Language Processing, ICSLP’96, Philadelphia, USA, October 1996,
Np
pp. 264– 267 X
10 Tan, B.T., Fu, M., Spray A., and Dermody, P.: ‘The use of wavelet
transform in phoneme recognition’. Proc. 4th Int. Conf. on Speech and
Ep ¼ ðcjp Þ2 p ¼ 1; 2; . . . ; L ð7Þ
Language Processing, ICSLP’96, Philadelphia, USA, October 1996, j¼1
pp. 2431–2434
11 Farooq, O., and Datta, S.: ‘Wavelet transform for dynamic feature
extraction of phonemes’, Acoust. Lett., 1999, 23, (4), pp. 79 –82 Fp ¼ Ep =Np p ¼ 1; 2; . . . ; L ð8Þ
12 Chang, S., Kwon, Y., and Yang, S.: ‘Speech feature extracted from
adaptive wavelet for speech recognition’, Electron. Lett., 1998, 34, (23),
pp. 2211–2213 where Np is the number of wavelet coefficients in the pth
13 Long, C.J., and Datta, S.: ‘Discriminant wavelet basis construction for sub-band and L is the number of sub-bands. The calculated
speech recognition’. Proc. 5th Int. Conf. on Speech and Language
Processing, ICSLP’98, Sydney, Australia, Nov-Dec. 1998, vol. 3, energy is then divided by the number of wavelet coefficients
pp. 1047–1049 in the corresponding sub-band, thereby giving average
14 Lukasik, E.: ‘Wavelet packets based features selection for voiceless energy per wavelet coefficients per sub-band Fp (see (8)).
plosives classification’. Proc. Int. Conf. on Acoustics Speech and
Signal Processing, ICASSP 2000, Istanbul, Turkey, June 2000, vol. 2, . To evaluate noise robust features, first calculate the
pp. 689 –692 average sub-band energy ðmÞ by using (9) given below:
15 Farooq, O., and Datta, S.: ‘Dynamic feature extraction by
wavelet analysis’. Proc. 6th Int. Conf. on Speech and Language
Processing, ICSLP 2000, Beijing, China, October 2000, vo1. 4, 1X L
pp. 696–699 m¼ F p ¼ 1; 2; . . . ; L ð9Þ
16 Farooq, O., and Datta, S.: ‘Modified discrete wavelet features for L p¼1 p
phoneme recognition’. Proc. Workshop on Innovations in Speech
Processing, WISP 2001, Stratford-upon-Avon, UK, April 2001, 23, (3), . The final features ðFFp Þ are calculated by using the
pp. 93 –99
17 Farooq, O., and Datta, S.: ‘Mel filter-like admissible wavelet packet equation below:
structure for speech recognition’, IEEE Signal Process. Lett., 2001, 8,
(7), pp. 196– 198 FFp ¼ Fp  0:5 m p ¼ 1; 2; . . . ; L ð10Þ
18 Abdelatty, A.M., Spiegel, J.V., and Mueller, P.: ‘Acoustic phonetic
features for the automatic recognition of stop consonants’, J. Acoust.
Soc. Am., 1998, 103, (5), pp. 2777–2778 This gives L sub-band energy based features.
19 Abdelatty, A.M., Spiegel, J.V., and Mueller, P.: ‘Automatic . The variance based feature (VF ) is extracted as follows:
detection and classification of stop consonants using an acoustic-
phonetic feature-based system’. Proc. Int. Congress of Phonetic
Science, San Francisco, 1999, pp. 1709–1712 1X L  2
20 Mallat, S.: ‘A wavelet tour of signal processing’ (Academic Press,
VF ¼ Fp  m p ¼ 1; 2; . . . ; L ð11Þ
L p¼1
San Diego, CA, USA, 1998)
21 Fukunaga, K.: ‘Introduction to statistical pattern recognition’
(Academic Press, San Diego, CA, USA, 1990) This gives a total of L þ 1 features for each phoneme.

IEE Proc.-Vis. Image Signal Process., Vol. 151, No. 3, June 2004 193

You might also like