Professional Documents
Culture Documents
ABSTRACT
This paper describes a method for automatic singer identification from polyphonic musical audio signals including sounds of various instruments. Because singing voices
play an important role in musical pieces with a vocal part,
the identification of singer names is useful for music information retrieval systems. The main problem in automatically identifying singers is the negative influences caused
by accompaniment sounds. To solve this problem, we
developed two methods, accompaniment sound reduction
and reliable frame selection. The former method makes it
possible to identify the singer of a singing voice after reducing accompaniment sounds. It first extracts harmonic
components of the predominant melody from sound mixtures and then resynthesizes the melody by using a sinusoidal model driven by those components. The latter
method then judges whether each frame of the obtained
melody is reliable (i.e. little influenced by accompaniment
sound) or not by using two Gaussian mixture models for
vocal and non-vocal frames. It enables the singer identification using only reliable vocal portions of musical pieces.
Experimental results with forty popular-music songs by
ten singers showed that our method was able to reduce
the influences of accompaniment sounds and achieved an
accuracy of 95%, while the accuracy for a conventional
method was 53%.
Keywords: Singer identification, artist identification,
melody extraction, singing detection, similarity-based
MIR
INTRODUCTION
National
329
Resynthesis
Feature Extraction
threshold
Vocal
GMM
Non-Vocal
GMM
likelihood>
No
Yes
Accept
Reject
Reliable Frame
Selection
330
F0 Estimation
Accompaniment Sound
Reduction
Singer Determination
Output (Singers name)
<
(1)
(t)
p (x) =
BPF(x) p (x)
(t)
BPF(x) p (x)dx
(2)
Then, we consider each observed PDF to have been generated from a weighted-mixture model of the tone models
of all the possible F0s, which is represented as follows:
not-reliable
where is a threshold. In our experiments, we use 64mixture GMMs. It is difficult to decide a universal threshold for a variety of songs because we cannot select enough
feature vectors for classification from a song which have
few reliable frames. We therefore determine the threshold dependent on songs so that % of the whole frames
in the song are selected as reliable frames. Note that most
of the non-vocal frames are rejected in this selection step.
This means that we can avoid detecting singing sections
by using this reliable frame selection.
IMPLEMENTATION
In this section, we describe the implementation of our system. As described above, our system consists of the following four phases: accompaniment sound reduction, feature extraction, reliable frame selection and classification.
p(x| (t) ) =
Fh
w(t) (F)p(x|F)dF
(3)
(4)
Fl
(t)
where p(x|F) is the PDF of the tone model for each F0,
and Fh and Fl is defined as lower and upper limits of the
possible (allowable) F0 range, and w(t) (F) is the weight
of a tone model that satisfies
Fli
w(t) (F)dF = 1.
(5)
Fhi
Tone model represents a typical harmonic structrue and indicates where the harmonics of the F0 tend to occur. Then,
we estimate w(t) (F) using EM algorithm and regard it as
the F0s PDF. Finally, we obtain the most dominant F0s
F (t) by the following equation:
F (t) = argmax w(t) (F)
3.1
Pre-Processing
Given an audio signal, it is monauralized and downsampled to 16 kHz. Then, the spectrogram is calculated
using the short-time Fourier transform shifted by 10.0 ms
(160 points) with a 2048-point (128.0 ms) Hamming window.
3.2
Using the method described in Section 2.1, we reduce accompaniment sounds as follows:
3.2.1
3.2.2
= argmax |S(F)|
Al
l
F
r
F0 Estimation
(6)
331
linear power
0
2000
4000
6000
8000
Frequency [Hz]
linear power
2000
4000
6000
Frequency [Hz]
8000
sW (n) = i sW (n i) + g(n),
(11)
i=1
Resynthesis
s(t) =
Al cos(l t + l ),
(10)
l=1
Feature Extraction
332
where p represents the order of the predictor, i s are defined as the linear prediction coefficients (LPC), and g(n)
represents the error in the model. The LPCs are determined by minimizing the mean squared prediction error
of g(n). We use 20th-order LPC in this paper.
3.3.3
log 2
n1
+
n
1 c(k)nk
c(n) =
k=1 n
n1
c(k)nk
1
n
k=1
(n = 0)
(1 n p)
(n > p)
(12)
Gender
M
M
M
M
M
M
M
M
F
F
F
F
F
F
F
F
Piece Number
027
037
032, 078
048, 049, 051
015, 041
056, 057
047
006
013
020
026
055
060
063
070
081, 089, 091, 093, 097
3.4
Name
Gender
Kazuo Nishi
M
Hisayoshi Kazato
M
Kousuke Morimoto
M
Shinya Iguchi
M
Jeff Manning
M
Hiromi Yoshii
F
Tomomi Ogata
F
Rin
F
Makiko Hattori
F
Betty
F
D1
012
004
038
082
085
002
007
014
065
086
D2
029
011
039
084
087
017
028
021
067
092
D3
036
019
042
088
095
069
052
050
068
094
D4
043
024
044
090
098
075
080
053
077
096
Singer Determination
1
T
log p(xt |i ).
(13)
t=1
EXPERIMENTS
In this section, we describe the experiments that were conducted to evaluate our system.
4.1
To confirm the effectiveness of our methods, accompaniment sound reduction and reliable frame selection, we
conducted experiments on singer identification under the
following four conditions:
a
b
c
d
e
f
g
h
i
j
Total
(i)
baseline
1/4
3/4
2/4
4/4
1/4
1/4
0/4
4/4
4/4
1/4
53%
(ii)
reduc. only
2/4
1/4
2/4
4/4
0/4
2/4
2/4
4/4
4/4
3/4
60%
(iii)
selec. only
2/4
3/4
3/4
4/4
0/4
2/4
0/4
4/4
3/4
2/4
58%
(iv)
ours
4/4
4/4
4/4
4/4
4/4
3/4
3/4
4/4
4/4
4/4
95%
333
classification result
a
b
c
d
e
f
g
h
i
j
a b c d e
g h
a
b
c
d
e
f
g
h
i
j
(i) baseline
classification result
classification result
a b c d e
singers label
g h
singers label
singers label
g h
a b c d e
a
b
c
d
e
f
g
h
i
j
singers label
classification result
a b c d e
g h
a
b
c
d
e
f
g
h
i
j
Figure 3: Confusion matrices. Center lines in each figure are boundaries between male and female. Note that confusion
between male and female decreased by using the accompaniment sound reduction method.
100
Accuracy (%)
90
80
70
60
50
reduction only
reduction and selection
40
30
5
10
15
20
25
30
(%)
Dependency of accuracy on
We conducted experiments with setting to various values to investigate the dependency of accuracies on . Experimental results shown in Figure 4 show that classification accuracy was not affected by small changes of . It
is also noticeable that a value of that gives the highest
accuracy differed. The reason of this fact is the following: accompaniment sound reduction method reduced the
influences of accompaniment sounds and emphasized the
differences between reliable and unreliable frames. Thus,
if we raised too much, the system selected many unreliable frames and the system performance decreases. Without the reduction method, on the other hand, reliability of
each frame did not make much difference because of the
influences of accompaniment sounds. Therefore, it was
profitable to use many frames for classification by setting
comparatively higher. However, in this case, we could
not attain sufficient classification accuracy because of the
influences of accompaniment sounds.
4.3
4.3.1
334
MFCC
LPC
LPCC
LPMCC
(i)
(ii)
Using vocal-only
w/o
with
reduc.
reduc.
98%
95%
83%
88%
95%
98%
98%
98%
(iii)
(iv)
(v)
Using mixed-down data
with reduc.
w/o
corr. F0s est. F0s reduc.
78%
75%
75%
50%
58%
48%
75 %
75%
63%
88%
83%
68%
ones for vocal-only data. In addition, we compared estimating F0s of melodies and using given correct ones. In
our experiments, we virtually generated the correct F0s by
estimating ones using vocal-only tracks. Although the estimates using vocal-only tracks were not completely correct, its accuracy was sufficiently high comparing with
estimating ones using mixed-down versions. The second
purpose is to compare a variety of features. We used four
kinds of features described in Section 3.3 and compared
these results. In the experiments here, we manually cut out
60 seconds of singing regions for each song, because we
omitted reliable frame selection method in order to investigate the effectiveness of the only accompaniment sound
reduction method.
4.3.2
Table 4 shows the results of the experiments. When we focused on the differences between the cases (iv) and (v), the
accuracies for the LPC, the LPCC, and the LPMCC were
improved by the reduction method, whereas that for the
MFCC was not improved. This is because the LPC etc.
assume that given signal contains only a single speech.
For this assumption, the accuracies for the LPC etc. in the
case (v) were low because the inputs were a mixture of
singing voices and accompaniment sounds. Because the
case (iv) dealt with signals obtained by extracting only
singing voices, the accuracies were higher than the case
(v). Because the MFCC is not based on such an assump-
CONCLUSION
a
b
c
d
e
f
g
h
i
j
Total
(iii)
with reduc.,
corr. F0s
LPMCC
3/4
4/4
4/4
4/4
3/4
3/4
2/4
4/4
4/4
4/4
88%
(iv)
with reduc.,
est. F0s
LPMCC
3/4
4/4
4/4
4/4
2/4
2/4
2/4
4/4
4/4
4/4
83%
(v)
w/o
reduc.
MFCC
3/4
4/4
3/4
4/4
3/4
2/4
0/4
4/4
4/4
3/4
75%
335
ACKNOWLEDGEMENTS
This research was partially supported by the Ministry
of Education, Culture, Sports, Science and Technology (MEXT), Grant-in-Aid for Scientific Research (A),
No.15200015, and Informatics Research Center for Development of Knowledge Society Infrastructure (COE
program of MEXT, Japan). We thank everyone who has
contributed to building and distributing the RWC Music Database [15]. We also thank Kazuyoshi Yoshii and
Takuya Yoshioka (Kyoto University) for their valuable
discussions.
REFERENCES
[1] Wei-Ho Tsai, Hsin-Min Wang, and Dwight Rodgers.
Automatic singer identification of popular music
recordings via estimation and modeling of solo vocal signal. In Proceedings of European Conference on Speech Communication and Technology
(Eurospeech2003), 2003.
for detecting melody and bass lines in real-world audio signals. Speech Communication, 43(4):311329,
2004.
[9] James Anderson Moorer. Signal procesing aspects
of computer music: A survey. Proceedings of the
IEEE, 65(8):11081137, 1977.
[10] Joseph Picone. Signal modeling techniques in
speech recognition. IEEE Proceedings, 81(9):1215
1247, 1993.
[11] Steven B. Davis and Paul Mermelstein. Comparison
of parametric representation for monosyllabic word
recognition. IEEE Transactions on Acoustic, Speech
and Signal Processing, 28(4):357366, 1980.
[12] Beth Logan. Mel frequency cepstral coefficients for
music modelling. In Proceedings of the International Symposium on Music Information Retrieval
(ISMIR 2000), pages 2325, 2000.
[13] Bishnu S. Atal. Effectiveness of linear prediction
characteristics of the speech wave for automatic
speaker identification and verification. the Journal
of the Acoustical Society of America, 55(6):1304
1312, 1974.
[2] Wei-Ho Tsai and Hsin-Min Wang. Automatic detection and tracking of target singer in multi-singer
music recordings. In Proceedings of the 2004 IEEE
International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 2004), pages 221224,
2004.
[16] Tomohiro Nakatani and Hiroshi G. Okuno. Harmonic sound stream segregation using localization
and its application to speech stream segregation.
Speech Communcations, 27:209222, 1999.
336