2009-Time-Frequency Feature Extraction From Stress and Emotio Clasification in Speech

Time-Frequency Feature Extraction from
Spectrograms and Wavelet Packets with Application

to Automatic Stress and Emotion Classification in
Speech
Ling He, Margaret Lech, Namunu C. Maddage
Authors Nicholas B. Allen
School of Electrical and Computer Engineering,

RMIT University, Melbourne, Australia
margaret.lech@rmit.edu.au
Department of Psychology,
The University of Melbourne, Melbourne, Australia
nba@unimelb.edu.au
AbstractThree new methods of feature extraction based on

time-frequency analysis of speech are presented and compared.
In the first approach, speech spectrograms were passed through a
bank of 12 log-Gabor filters and the outputs are averaged. In the
second approach, the spectrograms were sub-divided into ERB
frequency bands and the average energy for each band is
calculated. In the third approach, wavelet packet arrays were
calculated and passed through a bank of 12 log-Gabor filters and
averaged. The feature extraction methods were tested in the
process of automatic stress and emotion classification. The
feature distributions were modeled and classified using a
Gaussian mixture model. The test samples included single vowels,
words and sentences from the SUSAS data base with 3 classes of
stress, and spontaneous speech recordings with 5 emotional
classes from the ORI data base. The classification results showed
correct classification rates ranging from 64.70% to 84.85%, for
different SUSAS data sets and from 39.6% to 53.4% for the ORI
data base.
Keywords-time-frequency analysis; speech classification;
spectrogram, wavelet packets; stress and emotion recognition
I.
INTRODUCTION
Speech expresses emotion and that emotion is a vital part of

human communication. Just as effective human-to-human
communication is virtually impossible without speakers being
able to detect and understand each other's emotions, humanmachine communication suffers from significant inefficiencies
because machines cannot understand our emotions or generate
emotional responses. Words are not enough to correctly
understand the mood and intention of a speaker and thus the
introduction of human social skills to human-machine
communication is of paramount importance. This can be
achieved by researching and creating methods of speech
modeling and analysis that embrace the signal, linguistic and
emotional aspects of communication.
Prosodic features of speech produced by a speaker being
under stress or emotion vary from features produced under the
neutral condition. The most often observed changes include
978-1-4244-4657-5/09/$25.00 2009 IEEE
changes in the utterance duration, decrease or increase of pitch,

and shift of formant frequencies.
The automatic recognition and classification of speech
under stressful conditions has applications in behavioral and
mental health sciences, human to machine communication,
robotics, and medicine.
Stress and emotion classification in speech are
computational tasks consisting of two major parts: feature
extraction and feature classification. The majority of recent
studies [12] focus on the acoustic features derived from linear
models of speech production. Features that are most often used
include: pitch features (F0), spectral features (formants) and
intensity features (energy). There are also studies proposing
features such as linear predictive cepstral coefficients (LPCC)
[11] and mel frequency cepstral coefficients (MFCC) [5].
Classification methods used in stress recognition include: the
Gaussian mixture model (GMM) [17], the hidden Markov
model (HMM) [5] and various neural network systems [6].
A 2D narrowband magnitude spectrogram is a graphical
display of the squared magnitude of the time-varying spectral
characteristics of speech [1]. It is a compact and highly
efficient representation carrying information about parameters
such as energy, pitch F0, formants and timing. These
parameters are the acoustic features of speech most often used
in automatic stress and emotion recognition systems [2-4]. The
majority of these systems analyze each parameter separately
and then combine them into a set of feature vectors. Through
analyzing a spectrogram one could capture all of these
characteristics at once and preserve the important underlying
dependencies between different parameters.
The additional advantage is that by analyzing a speech
spectrogram, it is more likely to preserve and take into account
speech features that are caused by specific psychophysiological effects appearing under certain emotions and
stress. In the previous studies [8,9] Kleinschmidt et al. applied
a 2D Gabor filter bank to mel-spectrograms. The resulting
outputs of the Gabor filters were concatenated into onedimensional vectors and used as features in the speech
ICICS 2009
recognition experiments. Chih et al. [10] applied a similar

method in speech discrimination and enhancement. In recent
studies Ezzat et al. [5-7] described a 2D Gabor filter bank
decomposing localized patches of spectrograms into
components representing speech harmonicity, formants,
vertical onsets/offsets, noise and overlapping simultaneous
speakers. In recent studies of automatic stress recognition by
He et al., [12], speech spectrograms were used to derive
features extracted from the sigma-pi cells. The analysis was
performed at three alternative sets of frequency bands: critical
bands, Bark scale bands and equivalent rectangular bandwidth
(ERB) scale bands. The ERB scale provided the highest
correct classification rates ranging from 67.84% to 73.76%.
The classification results did not differ between data sets
containing specific types of vowels and data sets containing
mixtures of vowels. This indicates that the proposed method
can be applied to voiced speech in speech independent
conditions. Following this line of research we will propose
new methods of feature extraction from speech spectrograms
and time-frequency wavelet packet arrays, and apply them to
automatic recognition of emotion and stress in speech. We will
test our methods using speech recordings containing
spontaneous (not acted) emotions. A general flowchart of our
classification system is depicted in Fig. 1. Section 2 explains
details of the feature extraction and classification stages. In
Section 3 the classification results are presented, and finally
Section 4 presents the conclusions.
speech
Pre-processing
Voiced Speech
Detection
Features
Generation
Classification
Classification
result
Fig. 1. Features extraction and classification system.
II.
SPEECH DATA
A. SUSAS Database.
The Speech under Simulated and Actual Stress (SUSAS)
[13] database comprises a wide variety of acted and actual
stresses and emotions. Only speech recorded under actual stress
conditions was used in this study. The speech samples were
selected from the Actual Speech under Stress domain, which
includes speech recordings made by passengers during rides on
a roller-coaster. This domain consisted of recordings from 7
speakers (4 male and 3 female). The speakers were reading
words from the 35 word list. The amount of stress was
subjectively determined by the position of the roller-coaster
during the time when the recording was made. A total of 3179
speech recordings, including 1202 recordings representing the
high stress, 1276 recordings representing the moderate stress,
and 701 recordings representing the neutral speech, were used
in this study.
B. ORI Database.
A soundtrack of video recordings from the Oregon
Research Institute (ORI) [14] was used to select speech
samples for processing. The data included 71 parents (27
mothers and 44 fathers) video recorded while being engaged in
a family discussion with their children. During the discussion
the family was asked to discuss different problem solving tasks.
The videotapes were annotated by a trained psychologist based
on both speech and facial expressions and using the Living in
Family Environments (LIFE) coding system [15]. The Adobe

Pro software was applied to convert the video files into audio
files with a sampling frequency of 8 kHz. Each class (angry,
happy, anxious, dysphoric and neutral) was represented by 200
utterances (100 with male and 100 with female speech). The
average duration of each utterance was 3 seconds. A neutral
voice tone has an even, relaxed quality without marked stress
on individual syllables. Anger communicates displeasure,
irritation, annoyance or frustration. A subject reflects happiness
when the voice is high-pitched, or has a sing-song tone, that is
not whining. Speech is faster or louder than usual, but not
angry. An anxious state is expressed when the speaker
communicates anxiety, nervousness, fear or embarrassment. An
elevated voice volume, accompanied by rapid speech is
common. A dysphoric state is evident when the subject
communicates sadness and depression with a low voice tone
and slow pace of speech.
TABLE I.
Datasets
SUSAS:
SUSAS:
SUSAS:
mixed
vowels
SUSAS
Actual
Speech
under
Stress
ORI
database
DESCRIPTION OF DATA SETS.
Words used in datasets
Number of samples
High
Low
Neu
stress
stress
tral
133
143
59
206
220
121
east, freeze, three

break, change, eight, eighty,
gain, strafe
break, change, east, eight, fix,
871
931
523
freeze, gain, go, help, hot,
mark, navy, no, oh, on, out,
point, six, south, stand, steer,
strafe, ten, three, white, wide
break, change, degree, hot,
1202
1276
701
east, eight, eighty, enter, fifty,
fix, freeze, gain, go, hello,
help, histogram, destination,
mark, nav, no, oh, on, out,
point, six, south, stand, steer,
strafe, ten, thirty, three, white,
wide, zero
This database contains 5 different emotions (angry, anxious,
dysphoric, neutral and happy), for each emotion, there are
200 recordings.
III.
SYSTEM STRUCTURE
Two data bases of spontaneous speech recordings were

used: the SUSAS data which is widely used in emotion
recognition studies and the ORI data originally created by
psychologists for the purpose of behavioral studies. The
speech was sampled by a 16-bit A/D converter with 8 kHz
sampling rate. The feature extraction was done using voiced
speech. The data was divided into five sets described in Table
I. For each data set, 80% of all recordings were randomly
selected as the training data, and the remaining 20% were used
as the testing data. For each dataset and for each type of
feature selection method, the classification process was
performed 15 times. An average percentage of identification
accuracy was then calculated over the 15 runs.
A. Feature Extraction Methods
We have tested and compared two spectrogram-based
approaches and one wavelet packet-based approach to the
feature selection process. In the first approach (Fig. 2), the
speech spectrograms were passed through a bank of 12 log-
ICICS 2009
Gabor filters with 2 different scales and 6 different orientations.

The magnitudes of the 12 filter outputs were then averaged and
passed through an optimal feature selection algorithm based on
mutual information (MI) criteria. In the second approach (Fig.
3), the 2D spectrograms were divided into sub-bands based on
the ERB auditory scale [12]. The frequencies corresponding to
the ERB bands are listed in Table II. For each band a single
value of average energy
E i =
1
N f Nt
E i (i=1,...,N) was calculated using:

Nf
Nt
y =1
x =1
s ( x, y )
(1)
Where s(x,y) are the spectrogram values (squared magnitudes)

at the time coordinates x and frequency coordinates y, Nf is the
total number of frequency coordinates, Nt is the total number of
time coordinates, and N is the total number of frequency bands
(N=27 for ERB scale). The resulting feature values were then
concatenated into 1D vectors. In the third approach (Fig.4), the
wavelet packet arrays were calculated and passed through a
bank of 12 log-Gabor filters, averaged, and passed through the
MI feature selection.
voiced
Spectrogram
speech Calculation
12 Log-Gabor
Filters
Optimal Features
Selection
Averaging
Fig. 2. Feature extraction using averaged outputs from log-Gabor

filters.
voiced
Average Energy
Spectrogram
speech Calculation
CB, Bark
or ERB bands
........
Average Energy
Fig. 3. Feature extraction using auditory frequency band filters.

Calculation of
Wavelet Packet
speech Coefficients
voiced
12 Log-Gabor
Filters
Averaging
Optimal features
selection
Fig. 4. Feature extraction using wavelet packet coefficients and

averaged log-Gabor filter outputs.
B. Pre-processing
Both the SUSAS and ORI data sets were recorded in reallife noisy conditions. To reduce the background noise, a
wavelet-based method developed by Donoho [16] was applied.
Speech signals of length N and standard deviation were
decomposed using the wavelet transform with the mother
wavelet db2 up to the second level, and the universal threshold
= 2 log( N ) was applied to each wavelet sub-band. The
signal was then reconstructed using the inverse wavelet
transform (IWT). The voiced speech was extracted using a
rule based adaptive endpoint detection method [19].
C.
Calculation of Spectrogram Arrays

Narrow band spectrogram arrays were calculated using
short-time spectral analysis applied to 256-point frames of
voiced speech with 196-point overlap. The energy spectral
density was calculated using the FFT algorithm and only the
signals with SNR greater than or equal to 50 dB were kept in
the spectrogram. Everything with a SNR below 50 dB was
removed [8,9]. It was observed that with increasing level of
stress, the spectrograms revealed increasing formant energy in
the higher frequency bands, as well as clearly increasing pitch
for high level stress. Other acoustic information, such as the
formants, also vary under different levels of stress. These

observations indicate that the spectrograms contain important
characteristics that can be used to differentiate between
different levels of stress.
D. Calculation of Wavelet Packet Arrays
The wavelet packet time-frequency arrays of coefficients
were calculated by decomposing the speech frames up to the 4th
level of the wavelet packets tree [4]. The frequency ranges
corresponding to each of the terminal nodes are listed in Table
II. As in the case of spectrograms, an increasing level of stress
leads to an increase of formant energy within the higher
frequency bands of the wavelet packet arrays.
TABLE II.
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
ERB SCALE BANDS AND WAVELET PACKET BANDS IN HZ.
ERB Scale [Hz]

Lower
0.05
24.75
52.2
82.85
116.9
154.95
197.25
244.35
296.8
355.15
428.3
492.65
573.4
663.2
763.2
874.65
998.7
1136.9
1290.75
1462
1652.85
1865.3
2101.85
2365.4
2658.75
2985.5
3349.3
Upper
24.75
52.25
82.8
116.95
154.9
197.25
244.35
296.85
355.2
420.25
501.7
573.35
663.2
763.2
874.6
998.75
1136.9
1290.7
1462.05
1652.8
1865.35
2101.9
2365.35
2658.8
2985.45
3349.3
3754.5
Wavelet Packet
Bands [Hz]
Lower
Upper
0
250
250
500
500
750
750
1000
1000
1250
1250
1500
1500
1750
1750
2000
2000
2250
2250
2500
2500
2750
2750
3000
3000
3250
3250
3500
3500
3750
3750
4000
-
E. Application of Log-Gabor Filters

Gabor filters are commonly recognized [17] as one of the
best choices for obtaining features in image classification.
They offer an excellent simultaneous localization of spatial
and frequency information. However, the maximum
bandwidth of a Gabor filter is limited to approximately one
octave and Gabor filters are not optimal if one is seeking broad
spectral information with maximal spatial localization. As an
alternative to the Gabor filters the log-Gabor filters were
proposed by Field [18]. Log-Gabor filters can be constructed
with arbitrary bandwidth and the bandwidth can be optimized
to produce a filter with minimal spatial extent. The log-Gabor
filters have Gaussian transfer functions when viewed on the
logarithmic frequency scale, whereas the Gabor filters have
Gaussian transfer functions when viewed on the linear
frequency scale. It was therefore postulated that the log-Gabor
functions having extended tails at the high frequency ends
ICICS 2009
2
2 radial
(3)
and Gangular(r) represents the frequency response of the angular

filter component, given as:
Gangular ( r ) = exp ( 0 ) 2 2 2
(4)
In Eq. 2-4, (r,) are the polar coordinates, f0 represents the

central filter frequency, 0 is the orientation angle, r and
represent the scale bandwidth and angular bandwidth
respectively. The number of different wavelengths r (scales)
for the filter bank was set to Nr=2, and for each wavelength of
the filter the number of different orientations was set to
N=6. This produced a bank of 12 log-Gabor filters
{G1,G2,,G12} with each filter representing different scale
and orientation.
The log-Gabor feature representation |S(x,y)|m,n of a
magnitude spectrogram s(x,y) was calculated as a convolution
operation performed separately for the real and imaginary part
of the log-Gabor filters:
Re(S ( x, y ) )m , n = s ( x, y ) * Re(G (rm , n )
(5)
Im(S ( x, y) )m,n = s( x, y) * Im(G (rm , n )
(6)
Where (x,y) represent the time and frequency coordinates of a
spectrogram, and m=1,,Nr=2 and n=1,,N=6. This was
followed by the magnitude calculation for the filter bank
outputs,
S ( x, y ) m , n =
(Re(S ( x, y)) ) + (Im(S ( x, y)) )

2
m,n
m, n
(7)
In two approaches, one illustrated in Fig. 2 and one in Fig.

4, the filter bank outputs were averaged to produce a single
output array:
1
N r , N
| S ( x, y) |=
S ( x, y ) m,n
(8)
m =1
N r N n=1
The averaged arrays were then converted to 1D vectors via
row-by-row concatenation.
F. Optimal Feature Selection Using Mutual Information
Criteria
The total set of NF feature vectors was reduced to a small
sub-set of Ns<NF vectors selected using the mutual
information (MI) feature selection algorithm [18]. MI
represents a measure of information found commonly in two
random variables X and Y, and it is given as:
p( x, y )
p( x, y) log p( x) p( y)
Gradial (r ) = exp log(r f 0 )
I ( X ;Y ) =
should be able to encode natural images more efficiently by

better representing the higher frequency components. The
transfer functions of log-Gabor filters are compatible with the
human visual system, which has cell responses that are
symmetric on the log frequency scale. Furthermore, a logGabor Filter always has a zero DC component and therefore,
the filter bandwidth can be optimized to produce a filter with
minimal spatial extent.
The log-Gabor filters in the frequency domain can be
defined in polar coordinates by the transfer function G(r,)
constructed as the following product:
G (r , ) = Gradial ( r ) Gangular (r )
(2)
where Gradial(r) is the frequency response of the radial
component given as:
(9)
Where p(x) is the probability density function (pdf), defined

as p(x) = Pr{X=x}, and p(x,y) is the joint pdf defined as p(x,y)
= Pr(X=x and Y=y). Given an initial set F with NF feature
vectors and a set C of all output classes (C={1,2,3} for
SUSAS data and C={1,2,3,4,5} for ORI data), the aim was to
find an optimal subset S with NS < NF feature vectors. Starting
from the empty set, the best available feature vectors were
added one by one to the selected feature set, until the size of
the set reached the desired value of NS. The sub-set S of
feature vectors was selected through simultaneous
maximization of the mutual information between the selected
feature vectors in S and the class labels C, and minimization of
the mutual information between the selected feature vectors
within S. As a result an optimal sub-set S of mutually
independent and highly representative feature vectors was
obtained. Given the full set size of NF=513 (using SUSAS
data) we tested the classification process using optimal sub-set
sizes of Ns=10, 20, 30, 40, 50, 60 and 70. The results showed
that Ns=10 gave the best compromise between the
classification accuracy and data reduction.
G. Data Modelling and Classification
The GMM method [1] is widely used in computational
pattern recognition. Each class is represented by a Gaussian
mixture and referred to as a class model un>=0, n=1,2,3,...,
where n is a class index. The complete class model un is a
weighted sum of M component densities:
M
p (u | , , p) = pi bi ( )
(10)
i =1
where pi>=0, i=1,2,...,M are the mixture weights, and bi>=0,

i=1,2,,...,M are the Gaussian densities with a mean vector i
and a covariance matrix i . Each set of Gaussian mixture
coefficients: (b, 1 , 1 ,..., M , M ) is estimated using the
expectation maximization (EM) algorithm applied to a training
dataset. When classifying a speech utterance x test
from a test
k
dataset, the probability
p x test
k | u n P (u n )
P(x test
(11)
k | un ) =
P x test
k
for each class is calculated and the test utterance is assigned to
the class which gives the maximum probability. The
classification score for both classifiers was calculated as an
average percentage of identification accuracy (APIA), defined
as follows:
1 NC
100%
APIA=
(12)
N r NT
Where NC is the number of test inputs correctly identified, NT
is the total number of test inputs, and Nr is the number of
algorithm executions (runs).
IV.
)
( )
EXPERIMENTS AND RESULTS
The classification results are presented in Table III. It shows

that all three approaches showed very similar performance,
however, the best performing features were the wavelet packet
ICICS 2009
arrays combined with the log-Gabor filters providing correct

classification rates of 76.35%-84.85% (for the SUSAS data)
and 45.5% (for the ORI data). The second best were the ERB
spectrograms, and the lowest performance was obtained from
the averaged outputs of 12 log-Gabor filters.
These results indicate that features extracted from a subband analysis are more effective than features representing the
whole bandwidth. Due to differences in bandwidth definition
(Table II), the ERB bands are narrower at low frequencies, and
wider at the high frequencies when compared with the wavelet
packet bands, which have a constant width of 250 Hz across
the whole bandwidth (0-4 kHz). Therefore, the high
performance of the wavelet packet sub-bands for the SUSAS
data could indicate that the stress-related characteristics are
included in high frequency bands. Interestingly, the ERB
spectrograms outperformed wavelet packet in emotion
recognition while using the ORI data, which could indicate
equal importance of the low and high frequency analysis in
emotion recognition.
TABLE III.
APIA% FOR AVERAGED GABOR FILTERS, ERBSPECTROGRAM AND WAVELET PACKETS COMBINED WITH GABOR FILTERS
Dataset
SUSAS:
wolves under
actual stress (3
stress levels)
SUSAS:
wolves under
actual stress (3
stress levels)
SUSAS: Mixed
vowels under
actual stress (3
stress levels)
SUSAS: words
under actual
stress (3 stress
levels)
ORI: natural
speech
sentences (5
emotions)
rollercoaster ride when a very strong stress or emotion

expression can be expected. The ORI data on the other hand, is
a clinical data base containing spontaneously expressed
emotions during typical family based conversation when, the
emotional expressions are not expected to be as strong as in
the situations captured by the SUSAS data. In all approaches,
the highest classification accuracy was achieved while using
single vowels, which is not surprising since vowels are
distinguished by characteristic time-frequency patterns. It is
possible that the results for the ORI data could be improved if
instead of voiced speech detection, automatic detection of
particular vowels is used and the features are then extracted
from time-frequency arrays representing these vowels.
ACKNOWLEDGMENT
This research was supported by the ARC Grant LP0776235.
REFERENCES
[1]
[2]
[3]
APIA%
Averaged
12 LogGabor
Filters
ERBSpectrograms
Wavelet
Packets
and
Log-Gabor
Filters
77.58
81.82
84.85
79.03
79.09
81.27
[4]
[5]
[6]
[7]
[8]
73.76
70.69
76.72
[9]
[10]
64.70
70.63
76.35
39.60
53.40
45.5
[11]
[12]
[13]
V.
CONCLUSIONS
We have presented and tested a number of new approaches

to feature selection based on analysis of 2D time-frequency
representation of speech. The wavelet packet method
combined with the log-Gabor filters showed particularly
promising results in the process of automatic stress and
emotion classification in speech. Our results showed
significantly lower classification rates for the ORI data base
when compared with the data obtained from the SUSAS sets.
This can be attributed to the different environments in which
these two data bases were recorded. The SUSAS data base was
generated for the purpose of research on stress and emotion
detection, and contains speech recordings made during a
[14]
[15]
[16]
[17]
[18]
[19]
Quatieri T.F.,Discrete-Time Speech Signal Processing Prentice Hall .

He L., Lech M., Maddage N., and Allen N., Emotion Recognition in
Speech of Parents of Depressed Adolescents, iCBBE 2009.
He L., Lech M., Maddage N., Memon S., and Allen N., Emotion
Recognition in Spontaneous Speech within Work and Family
Environments, iCBBE 2009.
He L., Lech M., Memon S., and Allen N., 2008, Recognition of Stress
in Speech Using Wavelet Analysis and Teager Energy Operator,
Interspeech 2008.
Ezzat T., Tomaso Poggio T. Discriminative Word-Spotting Using
Ordered Spectro-Temporal Patch Features, Interspeech 2008
Bouvrie J., Ezzat T., Poggio T., Localized Spectro-Temporal Cepstral
Analysis of Speech, ICASSP 2008.
Ezzat T., Bouvrie J., Poggio T., Spectro-Temporal Analysis of Speech
Using 2-D Gabor Filters Interspeech 2007.
Kleinschmidt M. 2001. Methods for capturing spectro-temporal
modulations in automatic speech recognition, Acta Acustica 2001 (8).
Kleinschmidt M, Hohmann V. 2003. Sub-band SNR estimation using
auditory feature processing, Speech Communication, 39(1-2): 47-63.
Chih T., Ru P., and Shamma S., Multiresolution spectrotemporal
analysis of complex sounds, JASA vol. 118, pp. 887906, 2005.
Hanchuan, P., L. Fuhui, et al. (2005). "Feature selection based on mutual
information criteria of max-dependency, max-relevance, and minredundancy." IEEE PA&MI, 27(8): 1226-1238.
He L., Lech M., Maddage N., and Allen N., Stress Detection Using
Speech Spectrograms and Sigma-pi Neuron Units, ICNC09-FSKD09.
Hansen J.H.L., Sahar Bou-Ghazale. 1997. Getting Started with SUSAS:
A Speech Under Simulated and Actual Stress Database.
EUROSPEECH-97.
Davis B., Sheeber L, Hops H., and Tildesley E.., Adolescent Responses
to Depressive Parental Behaviors in Problem-Solving Interactions:
Implications for Depressive Symptoms, Journal of Abnormal Child
Psychology, vol. 28, No. 5, 2000, pp. 451-465.
Longoria N., Sheeber L., and Davis B., Living in Family Environments
(LIFE) Coding. A Reference Manual for Coders. Oregon Research
Institute, 2006.
Donoho D. L., 1995, Denoising by soft thresholding, IEEE T. on Inf.
Th., vol. 41: no3, pp. 613-627.
Lajevardi S.M., and Lech M., Facial Expression Recognition Using a
Bank of Neural Networks and Logarithmic Gabor Filters, DICTA 2008.
Kwak N., Choi C., Input Feature Selection for Classification Problems.
IEEE Trans. On Neural Networks, vol.13, no.1, 2002, pp.143-159.
Lynch, et al.., Speech/Silence segmentation for real-time coding via
rule based adaptive endpoint detection, ICASSP87
ICICS 2009

2009-Time-Frequency Feature Extraction From Stress and Emotio Clasification in Speech

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2009-Time-Frequency Feature Extraction From Stress and Emotio Clasification in Speech

Uploaded by

Copyright:

Available Formats

Time-Frequency Feature Extraction from

Spectrograms and Wavelet Packets with Application

Authors Nicholas B. Allen

School of Electrical and Computer Engineering,

AbstractThree new methods of feature extraction based on

Speech expresses emotion and that emotion is a vital part of

978-1-4244-4657-5/09/$25.00 2009 IEEE

changes in the utterance duration, decrease or increase of pitch,

recognition experiments. Chih et al. [10] applied a similar

Fig. 1. Features extraction and classification system.

Family Environments (LIFE) coding system [15]. The Adobe

DESCRIPTION OF DATA SETS.

Words used in datasets

east, freeze, three

Two data bases of spontaneous speech recordings were

Gabor filters with 2 different scales and 6 different orientations.

E i (i=1,...,N) was calculated using:

Where s(x,y) are the spectrogram values (squared magnitudes)

Fig. 2. Feature extraction using averaged outputs from log-Gabor

Fig. 3. Feature extraction using auditory frequency band filters.

Fig. 4. Feature extraction using wavelet packet coefficients and

Calculation of Spectrogram Arrays

formants, also vary under different levels of stress. These

ERB SCALE BANDS AND WAVELET PACKET BANDS IN HZ.

ERB Scale [Hz]

E. Application of Log-Gabor Filters

and Gangular(r) represents the frequency response of the angular

In Eq. 2-4, (r,) are the polar coordinates, f0 represents the

(Re(S ( x, y)) ) + (Im(S ( x, y)) )

In two approaches, one illustrated in Fig. 2 and one in Fig.

Gradial (r ) = exp log(r f 0 )

should be able to encode natural images more efficiently by

Where p(x) is the probability density function (pdf), defined

where pi>=0, i=1,2,...,M are the mixture weights, and bi>=0,

EXPERIMENTS AND RESULTS

The classification results are presented in Table III. It shows

arrays combined with the log-Gabor filters providing correct

rollercoaster ride when a very strong stress or emotion

We have presented and tested a number of new approaches

Quatieri T.F.,Discrete-Time Speech Signal Processing Prentice Hall .

You might also like