Professional Documents
Culture Documents
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
6684
Exploring
Music Contents
7th International Symposium, CMMR 2010
Mlaga, Spain, June 21-24, 2010
Revised Papers
13
Volume Editors
Slvi Ystad
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: ystad@lma.cnrs-mrs.fr
Mitsuko Aramaki
CNRS-INCM, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: aramaki@incm.cnrs-mrs.fr
Richard Kronland-Martinet
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: kronland@lma.cnrs-mrs.fr
Kristoffer Jensen
Aalborg University Esbjerg, Niels Bohr Vej 8, 6700 Esbjerg, Denmark
E-mail: krist@create.aau.dk
ISSN 0302-9743
e-ISSN 1611-3349
e-ISBN 978-3-642-23126-1
ISBN 978-3-642-23125-4
DOI 10.1007/978-3-642-23126-1
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2011936382
CR Subject Classication (1998): J.5, H.5, C.3, H.5.5, G.3, I.5
LNCS Sublibrary: SL 3 Information Systems and Application, incl. Internet/Web
and HCI
Preface
Computer Music Modeling and Retrieval (CMMR) 2010 was the seventh event
of this international conference series that was initiated in 2003. Since its start,
the conference has been co-organized by the University of Aalborg, Esbjerg, Denmark (http://www.aaue.dk) and the Laboratoire de Mecanique et dAcoustique
in Marseille, France (http://www.lma.cnrs-mrs.fr) and has taken place in France,
Italy and Denmark. The six previous editions of CMMR oered a varied overview
of recent music information retrieval (MIR) and sound modeling activities in addition to alternative elds related to human interaction, perception and cognition.
This years CMMR took place in M
alaga, Spain, June 2124, 2010. The
conference was organized by the Application of Information and Communications Technologies Group (ATIC) of the University of Malaga (Spain), together
with LMA and INCM (CNRS, France) and AAUE (Denmark). The conference
featured three prominent keynote speakers working in the MIR area, and the
program of CMMR 2010 included in addition paper sessions, panel discussions,
posters and demos.
The proceedings of the previous CMMR conferences were published in the
Lecture Notes in Computer Science series (LNCS 2771, LNCS 3310, LNCS 3902,
LNCS 4969, LNCS 5493 and LNCS 5954), and the present edition follows the
lineage of the previous ones, including a collection of 22 papers within the topics
of CMMR. These articles were specially reviewed and corrected for this proceedings volume.
The current book is divided into ve main chapters that reect the present
challenges within the eld of computer music modeling and retrieval. The chapters span topics from music interaction, composition tools and sound source
separation to data mining and music libraries. One chapter is also dedicated
to perceptual and cognitive aspects that are currently the subject of increased
interest in the MIR community. We are condent that CMMR 2010 brought
forward the research in these important areas.
We would like to thank Isabel Barbancho and her team at the Application of
Information and Communications Technologies Group (ATIC) of the University
of M
alaga (Spain) for hosting the 7th CMMR conference and for ensuring a
successful organization of both scientic and social matters. We would also like
to thank the Program Committee members for their valuable paper reports and
thank all the participants who made CMMR 2010 a fruitful and convivial event.
Finally, we would like to thank Springer for accepting to publish the CMMR
2010 proceedings in their LNCS series.
March 2011
Slvi Ystad
Mitsuko Aramaki
Richard Kronland-Martinet
Kristoer Jensen
Organization
University of M
alaga, Spain
Symposium Co-chairs
Kristoer Jensen
Slvi Ystad
AAUE, Denmark
CNRS-LMA, France
Program Committee
Paper and Program Chairs
Mitsuko Aramaki
Richard Kronland-Martinet
CNRS-INCM, France
CNRS-LMA, France
Brian Gygi
Goredo Haus
Kristoer Jensen
Anssi Klapuri
Richard Kronland-Martinet
Marc Leman
Sylvain Marchand
Gregory Pallone
Andreas Rauber
David Sharp
Bob L. Sturm
Lorenzo J. Tard
on
Vesa Valim
aki
Slvi Ystad
Table of Contents
20
31
51
60
76
84
102
116
VIII
Table of Contents
138
163
176
188
205
219
242
259
273
303
321
Table of Contents
IX
338
356
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
361
Introduction
Two examples are the bag-of-frames approach to music similarity [5], and the periodicity pattern approach to rhythm analysis [13], which are both independent
of the order of musical notes, whereas temporal order is an essential feature of
melody, rhythm and harmonic progression. Perhaps surprisingly, much progress
has been made in music informatics in recent years1, despite the naivete of the
musical models used and the claims that some tasks have reached a glass ceiling [6].
The continuing progress can be explained in terms of a combination of factors:
the high level of redundancy in music, the simplicity of many of the tasks which
are attempted, and the limited scope of the algorithms which are developed. In
this regard we agree with [14], who review the rst 10 years of ISMIR conferences and list some challenges which the community has not fully engaged with
before. One of these challenges is to dig deeper into the music itself, which
would enable researchers to address more musically complex tasks; another is to
expand ... musical horizons, that is, broaden the scope of MIR systems.
In this paper we present two approaches to modelling musical harmony, aiming
at capturing the type of musical knowledge and reasoning a musician might use
in performing similar tasks. The rst task we address is that of chord transcription from audio recordings. We present a system which uses a high-level model
of musical context in which chord, key, metrical position, bass note, chroma
features and repetition structure are integrated in a Bayesian framework, and
generates the content of a lead-sheet containing the sequence of chord symbols, including their bass notes and metrical positions, and the key signature and
any modulations over time. This system achieves state-of-the-art performance,
being rated rst in its category in the 2009 and 2010 MIREX evaluations. The
second task to which we direct our attention is the machine learning of logical
descriptions of harmonic sequences in order to characterise particular styles or
genres. For this work we use inductive logic programming to obtain representations such as decision trees which can be used to classify unseen examples or
provide insight into the characteristics of a data corpus.
Computational models of harmony are important for many application areas
of music informatics, as well as for music psychology and musicology itself. For
example, a harmony model is a necessary component of intelligent music notation software, for determining the correct key signature and pitch spelling of
accidentals where music is obtained from digital keyboards or MIDI les. Likewise processes such as automatic transcription are benetted by tracking the
harmonic context at each point in the music [24]. It has been shown that harmonic modelling improves search and retrieval in music databases, for example
in order to nd variations of an example query [36], which is useful for musicological research. Theories of music cognition, if expressed unambiguously, can
be implemented and tested on large data corpora and compared with human
annotations, in order to verify or rene concepts in the theory.
1
The remainder of the paper is structured as follows. The next section provides
an overview of research in harmony modelling. This is followed by a section
describing our probabilistic model of chord transcription. In section 4, we present
our logic-based approach to modelling of harmony, and show how this can be
used to characterise and classify music. The nal section is a brief conclusion
and outline of future work.
Background
used in the analysis. The proposed graph search algorithm is shown to be much
more ecient than standard algorithms without diering greatly in the quality
of analyses it produces.
As an alternative to the rule-based approaches, which suer from the cumulative eects of errors, [38] proposed a probabilistic approach to functional
harmonic analysis, using a hidden Markov model. For each time unit (measure
or half-measure), their system outputs the current key and the scale degree of
the current chord. In order to make the computation tractable, a number of
simplifying assumptions were made, such as the symmetry of all musical keys.
Although this reduced the number of parameters by at least two orders of magnitude, the training algorithm was only successful on a subset of the parameters,
and the remaining parameters were set by hand.
An alternative stream of research has been concerned with multidimensional
representations of polyphonic music [10,11,42] based on the Viewpoints approach
of [12]. This representation scheme is for example able to preserve information
about voice leading which is otherwise lost by approaches that treat harmony as
a sequence of chord symbols.
Although most research has focussed on analysing musical works, some work
investigates the properties of entire corpora. [25] compared two corpora of chord
sequences, belonging to jazz standards and popular (Beatles) songs respectively,
and found key- and context-independent patterns of chords which occurred frequently in each corpus. [26] examined the statistics of the chord sequences of several thousand songs, and compared the results to those from a standard natural
language corpus in an attempt to nd lexical units in harmony that correspond
to words in language. [34,35] investigated whether stochastic language models including naive Bayes classiers and 2-, 3- and 4-grams could be used for automatic
genre classication. The models were tested on both symbolic and audio data,
where an o-the-shelf chord transcription algorithm was used to convert the audio
data to a symbolic representation. [39] analysed the Beatles corpus using probabilistic N-grams in order to show that the dependency of a chord on its context
extends beyond the immediately preceding chord (the rst-order Markov assumption). [9] studied dierences in the use of harmony across various periods of classical music history, using root progressions (i.e. the sequence of root notes of chords
in a progression) reduced to 2 categories (dominant and subdominant) to give a
representation called harmonic vectors. The use of root progressions is one of the
representations we use in our own work in section 4 [2].
All of the above systems process symbolic input, such as that found in a score,
although most of the systems do not require the level of detail provided by the
score (e.g. key signature, pitch spelling), which they are able to reconstruct from
the pitch and timing data. In recent years, the focus of research has shifted to the
analysis of audio les, starting with the work of [16], who computed a chroma
representation (salience of frequencies representing the 12 Western pitch classes,
independent of octave) which was matched to a set of chord templates using the
inner product. Alternatively, [7] modelled chords with a 12-dimensional Gaussian
distribution, where chord notes had a mean of 1, non-chord notes had a mean of 0,
and the covariance matrix had high values between pairs of chord notes. A hidden
Markov model was used to infer the most likely sequence of chords, where state
transition probabilities were initialised based on the distance between chords on
a special circle of fths which included minor chords near to their relative major
chord. Further work on audio-based harmony analysis is reviewed thoroughly in
three recent doctoral theses, to which the interested reader is referred [22,18,32].
Music theory, perceptual studies, and musicians themselves generally agree that
no musical quality can be treated individually. When a musician transcribes
the chords of a piece of music, the chord labels are not assigned solely on the
basis of local pitch content of the signal. Musical context such as the key, metrical position and even the large-scale structure of the music play an important
role in the interpretation of harmony. [17, Chapter 4] conducted a survey among
human music transcription experts, and found that they use several musical context elements to guide the transcription process: not only is a prior rough chord
detection the basis for accurate note transcription, but the chord transcription
itself depends on the tonal context and other parameters such as beats, instrumentation and structure.
The goal of our recent work on chord transcription [24,22,23] is to propose
computational models that integrate musical context into the automatic chord
estimation process. We employ a dynamic Bayesian network (DBN) to combine
models of metrical position, key, chord, bass note and beat-synchronous bass and
treble chroma into a single high-level musical context model. The most probable
sequence of metrical positions, keys, chords and bass notes is estimated via
Viterbi inference.
A DBN is a graphical model representing a succession of simple Bayesian
networks in time. These are assumed to be Markovian and time-invariant, so
the model can be expressed recursively in two time slices: the initial slice and
the recursive slice. Our DBN is shown in Figure 1. Each node in the network
represents a random variable, which might be an observed node (in our case
the bass and treble chroma) or a hidden node (the key, metrical position, chord
and bass pitch class nodes). Edges in the graph denote dependencies between
variables. In our DBN the musically interesting behaviour is modelled in the
recursive slice, which represents the progress of all variables from one beat to
the next. In the following paragraphs we explain the function of each node.
Chord. Technically, the dependencies of the random variables are described in the
conditional probability distribution of the dependent variable. Since the highest
number of dependencies join at the chord variable, it takes a central position
in the network. Its conditional probability distribution is also the most complex: it depends not only on the key and the metrical position, but also on the
chord variable in the previous slice. The chord variable has 121 dierent chord
states (see below), and its dependency on the previous chord variable enables
metric pos.
Mi1
Mi
key
Ki1
Ki
chord
Ci1
Ci
bass
Bi1
Bi
bass chroma
bs
Xi1
Xibs
treble chroma
tr
Xi1
Xitr
Fig. 1. Our network model topology, represented as a DBN with two slices and six
layers. The clear nodes represent random variables, while the observed ones are shaded
grey. The directed edges represent the dependency structure. Intra-slice dependency
edges are drawn solid, inter-slice dependency edges are dashed.
the reinforcement of smooth sequences of these states. The probability distribution of chords conditional on the previous chord strongly favours the chord that
was active in the previous slice, similar to a high self-transition probability in
a hidden Markov model. While leading to a chord transcription that is stable
over time, dependence on the previous chord alone is not sucient to model adherence to the key. Instead, it is modelled conditionally on the key variable: the
probability distribution depends on the chords t with the current key, based on
an expert function motivated by Krumhansls chord-key ratings [19, page 171].
Finally, the chord variables dependency on the metrical position node allows us
to favour chord changes at strong metrical positions to achieve a transcription
that resembles more closely that of a human transcriber.
density
0.2
0.6
0.4
note salience
0.8
Key and metrical position. The dependency structure of the key and metrical
position variables are comparatively simpler, since they depend only on the respective predecessor. The emphasis on smooth, stable key sequences is handled
in the same way as it is in chords, but the 24 states representing major and minor
keys have even higher self-transition probability, and hence they will persist for
longer stretches of time. The metrical position model represents a 44 meter and
hence has four states. The conditional probability distribution strongly favours
normal beat transitions, i.e. from one beat to the next, but it also allows for
irregular transitions in order to accommodate temporary deviations from 44 meter and occasional beat tracking errors. In Figure 2a black arrows represent a
transition probability of 1 (where = 0.05) to the following beat. Grey arrows
represent a probability of /2 to jump to dierent beats through self-transition
or omission of the expected beat.
Bass. The random variable that models the bass has 13 states, one for each of
the pitch classes, and one no bass state. It depends on both the current chord
and the previous chord. The current chord is the basis of the most probable bass
notes that can be chosen. The highest probability is assigned to the nominal
chord bass pitch class2 , lower probabilities to the remaining chord pitch classes,
and the rest of the probability mass is distributed between the remaining pitch
classes. The additional use of the dependency on the previous chord allows us
to model the behaviour of the bass note on the rst beat of the chord dierently
from its behaviour on later beats. We can thus model the tendency for the played
bass note to coincide with the nominal bass note of the chord (e.g. the note B
in the B7 chord), while there is more variation in the bass notes played during
the rest of the duration of the chord.
Chroma. The chroma nodes provide models of the bass and treble chroma audio features. Unlike the discrete nodes previously discussed, they are continuous
because the 12 elements of the chroma vector represent relative salience, which
2
The chord symbol itself always implies a bass note, but the bass line might include
other notes not specied by the chord symbol, as in the case of walking bass.
can assume any value between zero and unity. We represent both bass and treble
chroma as multidimensional Gaussian random variables. The bass chroma variable has 13 dierent Gaussians, one for every bass state, and the treble chroma
node has 121 Gaussians, one for every chord state. The means of the Gaussians
are set to reect the nature of the chords: to unity for pitch classes that are
part of the chord, and to zero for the rest. A single variate in the 12-dimensional
Gaussian treble chroma distribution models one pitch class, as illustrated in Figure 2b. Since the chroma values are normalised to the unit interval, the Gaussian
model functions similar to a regression model: for a given chord the Gaussian
density increases with increasing salience of the chord notes (solid line), and
decreases with increasing salience of non-chord notes (dashed line). For more
details see [22].
One important aspect of the model is the wide variety of chords it uses.
It models ten dierent chord types (maj, min, maj/3, maj/5, maj6, 7, maj7,
min7, dim, aug) and the no chord class N. The chord labels with slashes
denote chords whose bass note diers from the chord root, for example D/3
represents a D major chord in rst inversion (sometimes written D/F). The
recognition of these chords is a novel feature of our chord recognition algorithm.
Figure 3 shows a score rendered using exclusively the information in our model.
In the last four bars, marked with a box, the second chord is correctly annotated
as D/F. The position of the bar lines is obtained from the metrical position
variable, the key signature from the key variable, and the bass notes from the
bass variable. The chord labels are obtained from the chord variable, replicated
as notes in the treble sta for better visualisation. The crotchet rest on the rst
beat of the piece indicates that here, the Viterbi algorithm inferred that the no
chord model ts best.
Using a standard test set of 210 songs used in the MIREX chord detection
task, our basic model achieved an accuracy of 73%, with each component of the
model contributing signicantly to the result. This improves on the best result at
G
Em
F C
D/F
Em
Bm G
Fig. 3. Excerpt of automatic output of our algorithm (top) and song book version
(bottom) of the pop song Friends Will Be Friends (Deacon/Mercury). The song
book excerpt corresponds to the four bars marked with a box.
It Wont Be Long
chorus
ground truth
segmentation
part n1
automatic
segmentation
verse
chorus
bridge
part A
part B
verse
chorus
bridge
part A
part B
verse
chorus
outro
part A
part n2
chord correct
using auto seg.
chord correct 1
baseline meth.0.5
0
20
40
60
80
100
120
time/s
Fig. 4. Segmentation and its eect on chord transcription for the Beatles song It
Wont Be Long (Lennon/McCartney). The top 2 rows show the human and automatic
segmentation respectively. Although the structure is dierent, the main repetitions are
correctly identied. The bottom 2 rows show (in black) where the chord was transcribed
correctly by our algorithm using (respectively not using) the segmentation information.
MIREX 2009 for pre-trained systems. Further improvements have been made via
two extensions of this model: taking advantage of repeated structural segments
(e.g. verses or choruses), and rening the front-end audio processing.
Most musical pieces have segments which occur more than once in the piece,
and there are two reasons for wishing to identify these repetitions. First, multiple
sets of data provide us with extra information which can be shared between the
repeated segments to improve detection performance. Second, in the interest of
consistency, we can ensure that the repeated sections are labelled with the same
set of chord symbols. We developed an algorithm that automatically extracts the
repetition structure from a beat-synchronous chroma representation [27], which
ranked rst in the 2009 MIREX Structural Segmentation task.
After building a similarity matrix based on the correlation between beatsynchronous chroma vectors, the method nds sets of repetitions whose elements have the same length in beats. A repetition set composed of n elements
with length d receives a score of (n 1)d, reecting how much space a hypothetical music editor could save by typesetting a repeated segment only once. The
repetition set with the maximum score (part A in Figure 4) is added to the
nal list of structural elements, and the process is repeated on the remainder of
the song until no valid repetition sets are left.
The resulting structural segmentation is then used to merge the chroma representations of matching segments. Despite the inevitable errors propagated from
incorrect segmentation, we found a signicant performance increase (to 75% on
the MIREX score) by using the segmentation. In Figure 4 the benecial eect
of using the structural segmentation can clearly be observed: many of the white
stripes representing chord recognition errors are eliminated by the structural
segmentation method, compared to the baseline method.
10
A further improvement was achieved by modifying the front end audio processing. We found that by learning chord proles as Gaussian mixtures, the
recognition rate of some chords can be improved. However this did not result
in an overall improvement, as the performance on the most common chords decreased. Instead, an approximate pitch transcription method using non-negative
least squares was employed to reduce the eect of upper harmonics in the chroma
representations [23]. This results in both a qualitative (reduction of specic errors) and quantitative (a substantial overall increase in accuracy) improvement
in results, with a MIREX score of 79% (without using segmentation), which
again is signicantly better than the state of the art. By combining both of the
above enhancements we reach an accuracy of 81%, a statistically signicant improvement over the best result (74%) in the 2009 MIREX Chord Detection tasks
and over our own previously mentioned results.
Style Characterisation
In our rst experiments [2], we analysed two chord corpora, consisting of the
Beatles studio albums (180 songs, 14132 chords) and a set of jazz standards
from the Real Book (244 songs, 24409 chords) to nd harmonic patterns that
dierentiate the two corpora. Chord sequences were represented in terms of the
interval between successive root notes or successive bass notes (to make the
11
12
directly, but only indirectly, by using the learnt models to classify unseen examples. Thus the following harmony modelling experiments are evaluated via the
task of genre classication.
4.2
Genre Classification
13
genre(g,A,B,Key)
gap(A,C),degAndCat(5,maj,C,D,Key),degAndCat(1,min,D,E,Key) ?
Y
academic
gap(A,F),degAndCat(2,7,F,G,Key),degAndCat(5,maj,G,H,Key) ?
N
Y
...
gap(H,I),degAndCat(1,maj,I,J,Key),degAndCat(5,7,J,K,Key) ?
Y
academic
N
Y
jazz
gap(H,L),degAndCat(2,min7,L,M,Key),degAndCat(5,7,M,N,Key) ?
jjazz
N
academic
Fig. 5. Part of the decision tree for a binary classier for the classes Jazz and Academic
Table 1. Results compared with the baseline for 2-class, 3-class and 9-class classication tasks
Classication Task
Baseline Symbolic Audio
Academic Jazz
0.55
0.947 0.912
Academic Popular
0.55
0.826 0.728
Jazz Popular
0.61
0.891 0.807
Academic Popular Jazz 0.40
0.805 0.696
All 9 subgenres
0.21
0.525 0.415
scale degree, chord category, and intervals between successive root notes, and we
constrained the learning algorithm to generate rules containing subsequences of
length at least two chords. The model can be expressed as a decision tree, as
shown in gure 5, where the choice of branch taken is based on whether or not
the chord sequence matches the predicates at the current node, and the class
to which the sequence belongs is given by the leaf of the decision tree reached
by following these choices. The decision tree is equivalent to an ordered set of
rules or a Prolog program. Note that a rule at a single node of a tree cannot
necessarily be understood outside of its context in the tree. In particular, a rule
by itself cannot be used as a classier.
The results for various classication tasks are shown in Table 1. All results are
signicantly above the baseline, but performance clearly decreases for more difcult tasks. Perfect classication is not to be expected from harmony data, since
other aspects of music such as instrumentation (timbre), rhythm and melody
are also involved in dening and recognising musical styles.
Analysis of the most common rules extracted from the decision tree models
built during these experiments reveals some interesting and well-known jazz,
academic and popular music harmony patterns. For each rule shown below, the
coverage expresses the fraction of songs in each class that match the rule. For
example, while a perfect cadence is common to both academic and jazz styles,
the chord categories distinguish the styles very well, with academic music using
triads and jazz using seventh chords:
14
genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(5,maj,C,D,Key),
degreeAndCategory(1,maj,D,E,Key),
gap(E,B).
[Coverage: academic=133/235; jazz=10/338]
genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(5,7,C,D,Key),
degreeAndCategory(1,maj7,D,E,Key),
gap(E,B).
[Coverage: jazz=146/338; academic=0/235]
A good indicator of blues is the sequence: ... - I7 - IV7 - ...
genre(blues,A,B,Key) :- gap(A,C),
degreeAndCategory(1,7,C,D,Key),
degreeAndCategory(4,7,D,E,Key),
gap(E,B).
[Coverage: blues=42/84; celtic=0/99; pop=2/100]
On the other hand, jazz is characterised (but not exclusively) by the sequence:
... - ii7 - V7 - ...
genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(2,min7,C,D,Key),
degreeAndCategory(5,7,D,E,Key),
gap(E,B).
[Coverage: jazz=273/338; academic=42/235; popular=52/283]
The representation also allows for longer rules to be expressed, such as the
following rule describing a modulation to the dominant key and back again in
academic music: ... - II7 - V - ... - I - V7 - ...
genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(2,7,C,D,Key),
degreeAndCategory(5,maj,D,E,Key),
gap(E,F),
degreeAndCategory(1,maj,F,G,Key),
degreeAndCategory(5,7,G,H,Key),
gap(H,B).
[Coverage: academic=75/235; jazz=0/338; popular=1/283]
15
Although none of the rules are particularly surprising, these examples illustrate some meaningful musicological concepts that are captured by the rules. In
general, we observed that Academic music is characterised by rules establishing the tonality, e.g. via cadences, while Jazz is less about tonality, and more
about harmonic colour, e.g. the use of 7th, 6th, augmented and more complex
chords, and Popular music harmony tends to have simpler harmonic rules as
melody is predominant in this style. The system is also able to nd longer rules
that a human might not spot easily. Working from audio data, even though the
transcriptions are not fully accurate, the classication and rules still capture the
same general trends as for symbolic data.
For genre classication we are not advocating a harmony-based approach
alone. It is clear that other musical features are better predictors of genre.
Nevertheless, the positive results encouraged a further experiment in which we
integrated the current classication approach with a state-of-the-art genre classication system, to test whether the addition of a harmony feature could improve
its performance.
4.3
16
that the classication rate of the harmony-based classier alone is poor. For
both datasets the improvements over the standard classier (as shown in table
2) were found to be statistically signicant.
Conclusion
We have looked at two approaches to the modelling of harmony which aim to dig
deeper into the music. In our probabilistic approach to chord transcription, we
demonstrated the advantage of modelling musical context such as key, metrical
structure and bass line, and simultaneously estimating all of these variables
along with the chord. We also developed an audio feature using non-negative
least squares that reects the notes played better than the standard chroma
feature, and therefore reduces interference from harmonically irrelevant partials
and noise. A further improvement of the system was obtained by modelling the
global structure of the music, identifying repeated sections and averaging features
over these segments. One promising avenue of further work is the separation of
the audio (low-level) and symbolic (high-level) models which are conceptually
distinct but modelled together in current systems. A low-level model would be
concerned only with the production or analysis of audio the mapping from
notes to features; while a high-level model would be a musical model handling
the mapping from chord symbols to notes.
Using a logic-based approach, we showed that it is possible to automatically
discover patterns in chord sequences which characterise a corpus of data, and
to use such models as classiers. The advantage with a logic-based approach is
that models learnt by the system are transparent: the decision tree models can
be presented to users as sets of human readable rules. This explanatory power is
particularly relevant for applications such as music recommendation. The DCG
representation allows chord sequences of any length to coexist in the same model,
as well as context information such as key. Our experiments found that the more
musically meaningful Degree-and-Category representation gave better classication results than using root intervals. The results using transcription from audio
data were encouraging in that although some information was lost in the transcription process, the classication results remained well above the baseline, and
thus this approach is still viable when symbolic representations of the music are
not available. Finally, we showed that the combination of high-level harmony
features with low-level features can lead to genre classication accuracy improvements in a state-of-the-art system, and believe that such high-level models
provide a promising direction for genre classication research.
While these methods have advanced the state of the art in music informatics,
it is clear that in several respects they are not yet close to an expert musicians
understanding of harmony. Limiting the representation of harmony to a list of
chord symbols is inadequate for many applications. Such a representation may
be sucient as a memory aid for jazz and pop musicians, but it allows only a very
limited specication of chord voicing (via the bass note), and does not permit
analysis of polyphonic texture such as voice leading, an important concept in
many harmonic styles, unlike the recent work of [11] and [29]. Finally, we note
17
that the current work provides little insight into harmonic function, for example
the ability to distinguish harmony notes from ornamental and passing notes and
to recognise chord substitutions, both of which are essential characteristics of a
system that models a musicians understanding of harmony. We hope to address
these issues in future work.
Acknowledgements. This work was performed under the OMRAS2 project,
supported by the Engineering and Physical Sciences Research Council, grant
EP/E017614/1. We would like to thank Chris Harte, Matthew Davies and others
at C4DM who contributed to the annotation of the audio data, and the Pattern
Recognition and Articial Intelligence Group at the University of Alicante, who
provided the Band in a Box data.
References
1. Anglade, A., Benetos, E., Mauch, M., Dixon, S.: Improving music genre classication using automatically induced harmony rules. Journal of New Music Research 39(4), 349361 (2010)
2. Anglade, A., Dixon, S.: Characterisation of harmony with inductive logic programming. In: 9th International Conference on Music Information Retrieval, pp. 6368
(2008)
3. Anglade, A., Ramirez, R., Dixon, S.: First-order logic classication models of musical genres based on harmony. In: 6th Sound and Music Computing Conference,
pp. 309314 (2009)
4. Anglade, A., Ramirez, R., Dixon, S.: Genre classication using harmony rules induced from automatic chord transcriptions. In: 10th International Society for Music
Information Retrieval Conference, pp. 669674 (2009)
5. Aucouturier, J.J., Defreville, B., Pachet, F.: The bag-of-frames approach to audio
pattern recognition: A sucient model for urban soundscapes but not for polyphonic music. Journal of the Acoustical Society of America 122(2), 881891 (2007)
6. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
7. Bello, J.P., Pickens, J.: A robust mid-level representation for harmonic content in
music signals. In: 6th International Conference on Music Information Retrieval,
pp. 304311 (2005)
8. Benetos, E., Kotropoulos, C.: Non-negative tensor factorization applied to music
genre classication. IEEE Transactions on Audio, Speech, and Language Processing 18(8), 19551967 (2010)
9. Cathe, P.: Harmonic vectors and stylistic analysis: A computer-aided analysis of
the rst movement of Brahms String Quartet Op. 51-1. Journal of Mathematics
and Music 4(2), 107119 (2010)
10. Conklin, D.: Representation and discovery of vertical patterns in music. In:
Anagnostopoulou, C., Ferrand, M., Smaill, A. (eds.) ICMAI 2002. LNCS (LNAI),
vol. 2445, pp. 3242. Springer, Heidelberg (2002)
11. Conklin, D., Bergeron, M.: Discovery of contrapuntal patterns. In: 11th International Society for Music Information Retrieval Conference, pp. 201206 (2010)
12. Conklin, D., Witten, I.: Multiple viewpoint systems for music prediction. Journal
of New Music Research 24(1), 5173 (1995)
18
13. Dixon, S., Pampalk, E., Widmer, G.: Classication of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, pp.
159165 (2003)
14. Downie, J., Byrd, D., Crawford, T.: Ten years of ISMIR: Reections on challenges
and opportunities. In: 10th International Society for Music Information Retrieval
Conference, pp. 1318 (2009)
15. Ebcio
glu, K.: An expert system for harmonizing chorales in the style of J. S. Bach.
In: Balaban, M., Ebcioi
glu, K., Laske, O. (eds.) Understanding Music with AI, pp.
294333. MIT Press, Cambridge (1992)
16. Fujishima, T.: Realtime chord recognition of musical sound: A system using Common Lisp Music. In: Proceedings of the International Computer Music Conference,
pp. 464467 (1999)
17. Hainsworth, S.W.: Techniques for the Automated Analysis of Musical Audio. Ph.D.
thesis, University of Cambridge, Cambridge, UK (2003)
18. Harte, C.: Towards Automatic Extraction of Harmony Information from Music
Signals. Ph.D. thesis, Queen Mary University of London, Centre for Digital Music
(2010)
19. Krumhansl, C.L.: Cognitive Foundations of Musical Pitch. Oxford University Press,
Oxford (1990)
20. Lerdahl, F., Jackendo, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
21. Longuet-Higgins, H., Steedman, M.: On interpreting Bach. Machine Intelligence 6,
221241 (1971)
22. Mauch, M.: Automatic Chord Transcription from Audio Using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London, Centre for Digital Music (2010)
23. Mauch, M., Dixon, S.: Approximate note transcription for the improved identication of dicult chords. In: 11th International Society for Music Information
Retrieval Conference, pp. 135140 (2010)
24. Mauch, M., Dixon, S.: Simultaneous estimation of chords and musical context from
audio. IEEE Transactions on Audio, Speech and Language Processing 18(6), 1280
1289 (2010)
25. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering chord idioms
through Beatles and Real Book songs. In: 8th International Conference on Music
Information Retrieval, pp. 111114 (2007)
26. Mauch, M., M
ullensiefen, D., Dixon, S., Wiggins, G.: Can statistical language models be used for the analysis of harmonic progressions? In: International Conference
on Music Perception and Cognition (2008)
27. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: 10th International Society for Music Information Retrieval
Conference, pp. 231236 (2009)
28. Maxwell, H.: An expert system for harmonizing analysis of tonal music. In:
Balaban, M., Ebcioi
glu, K., Laske, O. (eds.) Understanding Music with AI, pp.
334353. MIT Press, Cambridge (1992)
29. Mearns, L., Tidhar, D., Dixon, S.: Characterisation of composer style using highlevel musical features. In: 3rd ACM Workshop on Machine Learning and Music
(2010)
30. Morales, E.: PAL: A pattern-based rst-order inductive system. Machine Learning 26(2-3), 227252 (1997)
31. Pachet, F.: Surprising harmonies. International Journal of Computing Anticipatory
Systems 4 (February 1999)
19
32. Papadopoulos, H.: Joint Estimation of Musical Content Information from an Audio
Signal. Ph.D. thesis, Universite Pierre et Marie Curie Paris 6 (2010)
33. Pardo, B., Birmingham, W.: Algorithms for chordal analysis. Computer Music
Journal 26(2), 2749 (2002)
34. Perez-Sancho, C., Rizo, D., I
nesta, J.M.: Genre classication using chords and
stochastic language models. Connection Science 21(2-3), 145159 (2009)
35. Perez-Sancho, C., Rizo, D., I
nesta, J.M., de Le
on, P.J.P., Kersten, S., Ramirez,
R.: Genre classication of music by tonal harmony. Intelligent Data Analysis 14,
533545 (2010)
36. Pickens, J., Bello, J., Monti, G., Sandler, M., Crawford, T., Dovey, M., Byrd, D.:
Polyphonic score retrieval using polyphonic audio queries: A harmonic modelling
approach. Journal of New Music Research 32(2), 223236 (2003)
37. Ramirez, R.: Inducing musical rules with ILP. In: Proceedings of the International
Conference on Logic Programming, pp. 502504 (2003)
38. Raphael, C., Stoddard, J.: Functional harmonic analysis using probabilistic models.
Computer Music Journal 28(3), 4552 (2004)
39. Scholz, R., Vincent, E., Bimbot, F.: Robust modeling of musical chord sequences using probabilistic N-grams. In: IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 5356 (2009)
40. Steedman, M.: A generative grammar for jazz chord sequences. Music Perception 2(1), 5277 (1984)
41. Temperley, D., Sleator, D.: Modeling meter and harmony: A preference rule approach. Computer Music Journal 23(1), 1027 (1999)
42. Whorley, R., Wiggins, G., Rhodes, C., Pearce, M.: Development of techniques for
the computational modelling of harmony. In: First International Conference on
Computational Creativity, pp. 1115 (2010)
43. Widmer, G.: Discovering simple rules in complex data: A meta-learning algorithm
and some surprising musical discoveries. Articial Intelligence 146(2), 129148
(2003)
44. Winograd, T.: Linguistics and the computer analysis of tonal harmony. Journal of
Music Theory 12(1), 249 (1968)
Introduction
The advent of the Internet and the exploding popularity of le sharing web sites
have challenged the music industrys traditional supply model that relied on the
physical distribution of music recordings such as vinyl records, cassettes, CDs,
etc [5], [3]. In this direction, new interactive music services have emerged [1],
[6], [7]. However, a standardized le format is inevitably required to provide the
interoperability between various interactive music players and interactive music
applications.
Video games and music consumption, once discrete markets, are now merging.
Games for dedicated gaming consoles such as the Microsoft XBox, Nintendo Wii
and Sony Playstation and applications for smart phones using the Apple iPhone
and Google Android platforms are incorporating music creation and manipulation
into applications which encourage users to purchase music. These games can even
be centered around specic performers such as the Beatles [11] or T-Pain [14].
Many of these games follow a format inspired by karaoke. In its simplest case,
audio processing for karaoke applications involves removing the lead vocals so
that a live singer can perform with the backing tracks. This arrangement grew in
complexity by including automatic lyric following as well. Karaoke performance
used to be relegated to a setup involving a sound system with microphone and
playback capabilities within a dedicated space such as a karaoke bar or living
room, but it has found a revitalized market with mobile devices such as smart
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 2030, 2011.
c Springer-Verlag Berlin Heidelberg 2011
21
phones. Karaoke is now no longer limited to a certain place or equipment, but can
performed with a group of friends with a gaming console in a home or performed
with a smart phone, recorded and uploaded online to share with others.
A standard format is needed to allow for the same musical content to be produced once and used with multiple applications. We will look at the current commercial applications for interactive music and discuss what requirements need to
be met. We will then look at three standards that address these requirements:
the MPEG-A Interactive Music Application Format (IM AF), IEEE 1599 and
interaction eXtensible Music Format (iXMF). We conclude by discussing what
improvements still need to be made for these standards to meet the requirements
of currently commercially-available applications.
Applications
22
Requirements
If the music industry continues to produce content for interactive music applications, a standard distribution format is needed. Content then will not need to be
individually authored for each application. At the most basic level, a standard
needs to allow:
Separate tracks or groups of tracks
Apply signal processing to those tracks or groups
Markup those tracks or stems to include time-based symbolic information
Once tracks or groups of tracks are separated from the full mix of the song,
additional processing or information can be included to enhance the interactivity
with those tracks.
3.1
Symbolic Information
The most simplistic interactive model of multiple tracks requires basic mixing
capabilities so that those tracks can be combined to create a single mix of the
song. A traditional karaoke mix could easily be created within this model by
muting the vocal track, but this model could also be extended. Including audio eects as in I Am T-Pain and Glee Karaoke allows users to add musical
content (such as their singing voice) to the mix and better emulate the original
performance.
Spatial audio signal processing is also required for more advanced applications. This could be as simple as panning a track between the left and right
channels of a stereo song, but could grow in complexity when considering applications for gaming consoles. Many games allow for surround sound playback,
usually over a 5.1 loudspeaker setup, so the optimal standard would allow for
exible loudspeaker congurations. Mobile applications could take advantage of
headphone playback and use binaural audio to create an immersive 3D space.
MPEG-A IM AF
The MPEG-A Interactive Music Application Format (IM AF) standard structures the playback of songs that have multiple, unmixed audio tracks [8], [9], [10].
23
IM AF creates a container for the tracks, the associated metadata and symbolic
data while also managing how the audio tracks are played. Creating an IM AF
le involves formatting dierent types of media data, especially multiple audio
tracks with interactivity data and storing them into an ISO-Base Media File
Format. An IM AF le is composed of:
Multiple audio tracks representing the music (e.g. instruments and/or voices).
Groups of audio tracks a hierarchical structure of audio tracks (e.g. all
guitars of a song can be gathered in the same group).
Preset data pre-dened mixing information on multiple audio tracks (e.g.
karaoke and rhythmic version).
User mixing data and interactivity rules, information related to user interaction (e.g. track/group selection, volume control).
Metadata used to describe a song, music album, artist, etc.
Additional media data that can be used to enrich the users interaction space
(e.g. timed text synchronized with audio tracks which can represent the lyrics
of a song, images related to the song, music album, artist, etc).
4.1
Mixes
The multiple audio tracks are combined to produce a mix. The mix is dened
by the playback level of tracks and may be determined by the music content
creator or by the end-user.
An interactive music player utilizing IM AF could allow users to re-mix music
tracks by enabling them to select the number of instruments to be listened to
and adjust the volume of individual tracks to their particular taste. Thus, IM
AF enables users to publish and exchange this re-mixing data, enabling other
users with IM AF players to experience their particular music taste creations.
Preset mixes of tracks could also be available. In particular IM AF supports two
possible mix modes for interaction and playback: preset-mix mode and user-mix
mode.
In the preset-mix mode, the user selects one preset among the presets stored
in IM AF, and then the audio tracks are mixed using the preset parameters
associated with the selected preset. Some preset examples are:
General preset composed of multiple audio tracks by music producer.
Karaoke preset composed of multiple audio tracks except vocal tracks.
A cappella preset composed of vocal and chorus tracks.
Figure 1 shows an MPEG-A IM AF player. In user-mix mode, the user selects/deselects the audio tracks/groups and controls the volume of each of them. Thus,
in user-mix mode, audio tracks are mixed according to the users control and
taste; however, they should comply with the interactivity rules stored in the
IM AF. User interaction should conform to certain rules dened by the music
composers with the aim to t their artistic creation. However, the rules denition is optional and up to the music composer, they are not imposed by the IM
AF format. In general there are two categories of rules in IM AF: selection and
24
Fig. 1. An interactive music application. The player on the left shows the song being
played in a preset mix mode and the player on the right shows the user mix mode.
!
!
"
mixing rules. The selection rules relate to the selection of the audio tracks and
groups at rendering time whereas the mixing rules relate to the audio mixing.
Note that the interactivity rules allow the music producer to dene the amount
of freedom available in IM AF users mixes. The interactivity rules analyser in the
player veries whether the user interaction conforms to music producers rules.
Figure 2 depicts in a block diagram the logic for both the preset-mix and the
user-mix usage modes.
IM AF supports four types of selection rules, as follows:
Min/max rule specifying both minimum and maximum number of track/
groups of the group that might be in active state.
Exclusion rule specifying that several track/groups of a song will never be in
the active state at the same time.
25
File Structure
Component Name
Specification
ISO/IEC 14496-12:2008
Audio
ISO/IEC 14496-3:2005
ISO/IEC 23003-2:2010
ISO/IEC 11172-3:1993
-
Image
JPEG
ISO/IEC 10918-1:1994
3GPP TS 26.245:2004
Text
ISO/IEC 15938-5:2003
26
Table 2. Brands supported by IM AF. For im04 and im12, simultaneously decoded
audio tracks consist of tracks related to SAOC, which are a downmix signal and SAOC
bitstream. The downmix signal should be encoded using AAC or MP3. For all brands,
the maximum channel number of each track is restricted to 2 (stereo).
Audio
Brands
im01
im02
im03
im04
im11
im12
im21
8
X
AAC/Level 2
AAC/Level 2
16
32
Mobile
SAOC Baseline/2
AAC/Level 2
Normal
AAC/Level 2
2
X
Application
48 kHz/16 bits
2
X
Profile/
Level
SAOC Baseline/3
96 kHz/24 bits
AAC/Level 5
High-end
27
+
'(%)*
'(%)*
'
'
'
!"#$%&
!"#$%&
Related Formats
While the IM AF packages together the relevant metadata and content that an
interactive music application would require, other le formats have also been
developed as a means to organize and describe synchronized streams of information for dierent applications. The two that will be briey reviewed here are
IEEE 1599 [12] and iXMF [4].
5.1
IEEE 1599
IEEE 1599 is an XML-based format for synchronizing multiple streams of symbolic and non-symbolic data validated against a document type denition (DTD).
It was proposed to IEEE Standards in 2001 and was previously referred to as
MX (Musical Application Using XML). The standard emphasizes the readability
28
#
$
%
'
&
!
"
"
"
!
"
'
"( ")*
"
of symbols by both humans and machines, hence the decision to represent all
information that is not audio or video sample data within XML.
The standard is developed primarily for applications that provide additional
information surrounding a piece of music. Example applications include being
able to easily navigate between a score, multiple recordings of performances of
that score and images of the performers in the recordings [2].
The format consists of six layers that communicate with each other, but there
can be multiple instances of the same layer type. Figure 4 illustrates how the
layers interact. The layers are referred to as:
General holds metadata relevant to entire document.
Logic logical description of score symbols.
Structural description of musical objects and their relationships.
Notational graphical representation of the score.
Performance computer-based descriptions of a musical performance.
Audio digital audio recording.
5.2
iXMF
Another le format that perform a similar task with a particular focus on video
games is iXMF (interaction eXtensible Music Format) [4]. The iXMF standard
is targeted for interactive audio within games development. XMF is a meta le
format that bundles multiple les together and iXMF uses this same meta le
format as its structure.
29
iXMF uses a structure in which a moment in time can trigger an event. The
triggered event can encompass a wide array of activities such as the playing of
an audio le or the execution of specic code. The overall structure is described
in [4] as:
The format allows for both audio and symbolic information information such
as MIDI to be included. The Scripts then allow for real-time adaptive audio
eects. iXMF has been developed to create interactive soundtracks for video
games environments, so the audio can be generated in real-time based on a users
actions and other external factors. There are a number of standard Scripts that
perform basic tasks such as starting or stopping a Cue, but this set of Scripts
can also be extended.
Discussion
30
References
1. Audizen, http://www.audizen.com (last viewed, February 2011)
2. Ludovico, L.A.: The new standard IEEE 1599, introduction and examples. J. Multimedia 4(1), 38 (2009)
3. Goel, S., Miesing, P., Chandra, U.: The Impact of Illegal Peer-to-Peer File Sharing
on the Media Industry. California Management Review 52(3) (Spring 2010)
4. IASIG Interactive XMF Workgroup: Interactive XMF specification: file format specification. Draft 0.9.1a (2008), http://www.iasig.org/pubs/ixmf_
draft-v091a.pdf
5. IFPI Digital Musi Report 2009: New Business Models for a Changing Environment.
International Federation of the Phonographic Industry (January 2009)
6. iKlax Media, http://www.iklaxmusic.com (last viewed February 2011)
7. Interactive
Music
Studio
by
MXP4,
Inc.,
http://www.mxp4.com/
interactive-music-studio (last viewed February 2011)
8. ISO/IEC 23000-12, Information technology Multimedia application format (MPEG-A) Part 12: Interactive music application format (2010),
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.
htm?csnumber=53644
9. ISO/IEC 23000-12/FDAM 1 IM AF Conformance and Reference Software, N11746,
95th MPEG Meeting, Daegu, S. Korea (2011)
10. Kudumakis, P., Jang, I., Sandler, M.: A new interactive MPEG format for the
music industry. In: 7th Int. Symposium on Computer Music Modeling and Retrieval
(CMMR 2010), M
alaga, Spain (2010)
11. Kushner, D.: The Making of the Beatles: Rock Band. IEEE Spectrum 46(9), 3035
(2009)
12. Ludovico, L.A.: IEEE 1599: a multi-layer approach to music description. J. Multimedia 4(1), 914 (2009)
13. Smule, Inc.: Glee Karaoke iPhone Application, http://glee.smule.com/ (last
viewed February 2011)
14. Smule, Inc.: I Am T-Pain iPhone Application, http://iamtpain.smule.com/ (last
viewed February 2011)
Abstract. With a standard compact disc (CD) audio player, the only
possibility for the user is to listen to the recorded track, passively: the
interaction is limited to changing the global volume or the track. Imagine
now that the listener can turn into a musician, playing with the sound
sources present in the stereo mix, changing their respective volumes and
locations in space. For example, a given instrument or voice can be either
muted, amplied, or more generally moved in the acoustic space. This
will be a kind of generalized karaoke, useful for disc jockeys and also for
music pedagogy (when practicing an instrument). Our system shows that
this dream has come true, with active CDs fully backward compatible
while enabling interactive music. The magic is that the music is in the
sound: the structure of the mix is embedded in the sound signal itself,
using audio watermarking techniques, and the embedded information is
exploited by the player to perform the separation of the sources (patent
pending) used in turn by a spatializer.
Keywords: interactive music, compact disc, audio watermarking, source
separation, sound spatialization.
Introduction
Composers of acousmatic music conduct dierent stages through the composition process, from sound recording (generally stereophonic) to diusion (multiphonic). During live interpretation, they interfere decisively on spatialization
and coloration of pre-recorded sonorities. For this purpose, the musicians generally use a(n un)mixing console. With two hands, this requires some skill and
becomes hardly tractable with many sources or speakers.
Nowadays, the public is also eager to interact with the musical sound. Indeed, more and more commercial CDs come with several versions of the same
musical piece. Some are instrumental versions (for karaoke), other are remixes.
The karaoke phenomenon gets generalized from voice to instruments, in musical
video games such as Rock Band 1 . But in this case, to get the interaction the
user has to buy the video game, which includes the multitrack recording.
Yet, the music industry is still reluctant to release the multitrack version of
musical hits. The only thing the user can get is a standard CD, thus a stereo
1
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 3150, 2011.
c Springer-Verlag Berlin Heidelberg 2011
32
mix, or its dematerialized version available for download. The CD is not dead:
imagine a CD fully backward compatible while permitting musical interaction. . .
We present the proof of concept of the active audio CD, as a player that
can read any active disc in fact any 16-bit PCM stereo sound le, decode the
musical structure present in the sound signal, and use it to perform high-quality
source separation. Then, the listener can see and manipulate the sound sources
in the acoustic space. Our system is composed of two parts.
First, a CD reader extracts the audio data of the stereo track and decodes
the musical structure embedded in the audio signal (Section 2). This additional
information consists of the combination of active sources for each time-frequency
atom. As shown in [16], this permits an informed source separation of high quality
(patent pending). In our current system, we get up to 5 individual tracks out of
the stereo mix.
Second, a sound spatializer is able to map in real time all the sound sources
to any position in the acoustic space (Section 3). Our system supports either
binaural (headphones) or multi-loudspeaker congurations. As shown in [14],
the spatialization is done in the spectral domain, is based on acoustics and
interaural cues, and the listener can control the distance and the azimuth of
each source.
Finally, the corresponding software implementation is described in Section 4.
Source Separation
Time-Frequency Decomposition
The voice / instrument source signals are non-stationary, with possibly large
temporal and spectral variability, and they generally strongly overlap in the
time domain. Decomposing the signals in the time-frequency (TF) domain leads
to a sparse representation, i.e. few TF coecients have a high energy and the
overlapping of signals is much lower in the TF domain than in the time domain
s1[n]
MDCT
s2[n]
MDCT
sI[n]
MDCT
xL[n]
Mixing
(LIS S )
xR[n]
MDCT
MDCT
S1[f,t]
Ora cle
e s tima tor
SI[f,t]
Ift
Coding
XL[f,t]
XLW[ f ,t]
XR[f,t]
IMDCT
16-bits
P CM
IMDCT
16-bits
P CM
Mixing
ma trix A
~
xLW [n]
~
xRW [n]
MDCT
MDCT
S1[ f ,t]
~
XLW[ f ,t]
~
XRW[ f , t]
33
Wa te rma rk
e xtra ction
Ift
De coding
A
22
inve rs ion
SI [ f , t]
~
xLW [n]
~
x RW [ n ]
IMDCT
s1[ n ]
IMDCT
s2 [ n]
IMDCT
sI [ n]
[29][7][15][20]. Therefore, the separation of source signals can be carried out more
eciently in the TF domain. The Modied Discrete Cosine Transform (MDCT)
[21] is used as the TF decomposition since it presents several properties very
suitable for the present problem: good energy concentration (hence emphasizing
audio signals sparsity), very good robustness to quantization (hence robustness
to quantization-based watermarking), orthogonality and perfect reconstruction.
Detailed description of the MDCT equations are not provided in the present
paper, since it can be found in many papers, e.g. [21]. The MDCT is applied on
the source signals and on the mixture signal at the input of the coder to enable
the selection of predominant sources in the TF domain. Watermarking of the
resulting side-information is applied on the MDCT coecients of the mix signal
and the time samples of the watermarked mix signal are provided by inverse
MDCT (IMDCT). At the decoder, the (PCM-quantized) mix signal is MDCTtransformed and the side-information is extracted from the resulting coecients.
Source separation is also carried out in the MDCT domain, and the resulting
separated MDCT coecients are used to reconstruct the corresponding timedomain separated source signals by IMDCT. Technically, the MDCT / IMDCT
is applied on signal time frames of W = 2048 samples (46.5ms for a sampling
frequency fs = 44.1kHz), with a 50%-overlap between consecutive frames (of
1024 frequency bins). The frame length W is chosen to follow the dynamics of
music signals while providing a frequency resolution suitable for the separation.
Appropriate windowing is applied at both analysis and synthesis to ensure the
perfect reconstruction property [21].
34
2.2
Since the MDCT is a linear transform, the LISS source separation problem
remains LISS in the transformed domain. For each frequency bin f and time bin
t, we thus have:
X(f, t) = A S(f, t)
(1)
where X(f, t) = [X1 (f, t), X2 (f, t)]T denotes the stereo mixture coecients vector and S(f, t) = [S1 (f, t), , SN (f, t)]T denotes the N -source coecients vector. Because of audio signal sparsity in the TF domain, only at most 2 sources are
assumed to be relevant, i.e. of signicant energy, at each TF bin (f, t). Therefore,
the mixture is locally given by:
X(f, t) AIf t SIf t (f, t)
(2)
where If t denotes the set of 2 relevant sources at TF bin (f, t). AIf t represents
the 2 2 mixing sub-matrix made with the Ai columns of A, i If t . If I f t
denotes the complementary set of non-active (or at least poorly active) sources
at TF bin (f, t), the source signals at bin (f, t) are estimated by [7]:
I (f, t) = A1 X(f, t)
S
ft
If t
(3)
(f, t) = 0
S
Ift
where A1
If t denotes the inverse of AIf t . Note that such a separation technique
exploits the 2-channel spatial information of the mixture signal and relaxes the
restrictive assumption of a single active source at each TF bin, as made in
[29][2][3].
The side-information that is transmitted between coder and decoder (in addition to the mix signal) mainly consists of the coecients of the mixing matrix
A and the combination of indexes If t that identies the predominant sources in
each TF bin. This contrasts with classic blind and semi-blind separation methods where those both types of information have to be estimated from the mix
signal only, generally in two steps which can both be a very challenging task and
source of signicant errors.
As for the mixing matrix, the number of coecients to be transmitted is quite
low in the present LISS conguration2 . Therefore, the transmission cost of A is
negligible compared to the transmission cost of If t , and it occupies a very small
portion of the watermarking capacity.
As for the source indexes, If t is determined at the coder for each TF bin
using the source signals, the mixture signal, and the mixture matrix A, as the
combination that provides the lower mean squared error (MSE) between the
original source signals and the estimated source signals obtained with Equation
(3) (see [17] for details). This process follows the line of oracle estimators, as introduced in [26] for the general purpose of evaluating the performances of source
2
35
Watermarking Process
36
X (t, f )
QIM
X(t, f )
(t, f )
00
01
11
10
11
01
00
10
11
01
00
10
11
01
00
10
11
01
00
10
11
01
00
10
Fig. 2. Example of QIM using a set of quantizers for C(t, f ) = 2 and the resulting
global grid. We have (t, f ) = 2C(t,f ) QIM . The binary code 01 is embedded into
the MDCT coecient X(t, f ) by quantizing it to X w (t, f ) using the quantizer indexed
by 01.
mixture signal in combination with the use of a psycho-acoustic model (PAM) for
the control of inaudibility. It has been presented in details in [19][18]. Therefore,
we just present the general lines of the watermarking process in this section, and
we refer the reader to these papers for technical details.
The embedding principle is the following. Let us denote by C(t, f ) the capacity at TF bin (t, f ), i.e. the maximum size of the binary code to be embedded
in the MDCT coecient at that TF bin (under inaudibility constraint). We will
see below how C(t, f ) is determined for each TF bin. For each TF bin (t, f ), a
set of 2C(t,f ) uniform quantizers is dened, which quantization levels are intertwined, and each quantizer represents a C(t, f )-bit binary code. Embedding a
given binary code on a given MDCT coecient is done by quantizing this coefcient with the corresponding quantizer (i.e. the quantizer indexed by the code
to transmit; see Fig. 2). At the decoder, recovering the code is done by comparing the transmitted MDCT coecient (potentially corrupted by transmission
noise) with the 2C(t,f ) quantizers, and selecting the quantizer with the quantization level closest to the transmitted MDCT coecient. Note that because
the capacity values depend on (f, t), those values must be transmitted to the
decoder to select the right set of quantizers. For this, a xed-capacity embedding
reservoir is allocated in the higher frequency region of the spectrum, and the
37
capacity values are actually dened within subbands (see [18] for details). Note
also that the complete binary message to transmit (here the set of If t codes) is
split and spread across the dierent MDCT coecients according to the local
capacity values, so that each MDCT coecient carries a small part of the complete message. Conversely, the decoded elementary messages have to be concatenated to recover the complete message. The embedding rate R is given by the
average total number of embedded bits per second of signal. It is obtained by
summing the capacity C(t, f ) over the embedded region of the TF plane and
dividing the result by the signal duration.
The performance of the embedding process is determined by two related constraints: the watermark decoding must be robust to the 16-bit PCM conversion
of the mix signal (which is the only source of noise because the perfect reconstruction property of MDCT ensures transparency of IMDCT/MDCT chained
processes), and the watermark must be inaudible. The time-domain PCM quantization leads to additive white Gaussian noise on MDCT coecients, which
induces a lower bound for QIM the minimum distance between two dierent
levels of all QIM quantizers (see Fig. 2). Given that lower bound, the inaudibility
constraint induces an upper bound on the number of quantizers, hence a corresponding upper bound on the capacity C(t, f ) [19][18]. More specically, the
constraint is that the power of the embedding error in the worst case remains
under the masking threshold M (t, f ) provided by a psychoacoustic model. The
PAM is inspired from the MPEG-AAC model [11] and adapted to the present
data hiding problem. It is shown in [18] that the optimal capacity is given by:
10
1
M
(t,
f
)
10
C (t, f ) =
log
+1
(4)
2 2
2QIM
where . denotes the floor function, and is a scaling factor (in dB) that enables
users to control the trade-o between signal degradation and embedding rate
by translating the masking threshold. Signal quality is expected to decrease as
embedding rate increases, and vice-versa. When > 0dB, the masking threshold
is raised. Larger values of the quantization error allow for larger capacities (and
thus higher embedding rate), at the price of potentially lower quality. At the
opposite, when < 0dB, the masking threshold is lowered, leading to a safety
margin for the inaudibility of the embedding process, at the price of lower
embedding rate. It can be shown that the embedding rate R corresponding to
C and the basic rate R = R0 are related by [18]:
R R +
log2 (10)
Fu
10
(5)
(Fu being the bandwidth of the embedded frequency region). This linear relation
enables to easily control the embedding rate by the setting of .
The inaudibility of the watermarking process has been assessed by subjective
and objective tests. In [19][18], Objective Dierence Grade (ODG) scores [24][12]
were calculated for a large range of embedding rates and dierent musical styles.
ODG remained very close to zero (hence imperceptibility of the watermark)
38
for rates up to about 260kbps for musical styles such as pop, rock, jazz, funk,
bossa, fusion, etc. (and only up to about 175kbps for classical music). Such
rates generally correspond to the basic level of the masking curve allowed by
the PAM (i.e. = 0dB). More comfortable rates can be set between 150
and 200kbits/s to guarantee transparent quality for the embedded signal. This
exibility is used in our informed source separation system to t the embedding
capacity with the bit-rate of the side-information, which is at the very reasonable
value of 64kbits/s/channel. Here, the watermarking is guaranteed to be highly
inaudible, since the masking curve is signicantly lowered to t the required
capacity.
Sound Spatialization
Now that we have recovered the dierent sound sources present in the original
mix, we can allow the user to manipulate them in space. We consider each punctual and omni-directional sound source in the horizontal plane, located by its (, )
coordinates, where is the distance of the source to the head center and is the
azimuth angle. Indeed, as a rst approximation in most musical situations, both
the listeners and instrumentalists are standing on the (same) ground, with no relative elevation. Moreover, we consider that the distance is large enough for the
acoustic wave to be regarded as planar when reaching the ears.
3.1
Acoustic Cues
(6)
where (f ) is the average scaling factor that best suits our model, in the leastsquare sense, for each listener of the CIPIC database (see Fig. 3). The overall
error of this model over the CIPIC database for all subjects, azimuths, and
frequencies is of 4.29dB.
39
40
30
20
10
10
12
14
16
18
20
10
12
14
16
18
20
4
3
2
1
0
Frequency [kHz]
Fig. 3. Frequency-dependent scaling factors: (top) and (bottom)
Interaural Time Dierences. Because of the head shadowing, Viste uses for
the ITDs a model based on sin() + , after Woodworth [28]. However, from the
theory of the diraction of an harmonic plane wave by a sphere (the head), the
ITDs should be proportional to sin(). Contrary to the model by Kuhn [13], our
model takes into account the inter-subject variation and the full-frequency band.
The ITD model is then expressed as:
ITD(, f ) = (f )r sin()/c
(7)
where is the average scaling factor that best suits our model, in the leastsquare sense, for each listener of the CIPIC database (see Fig. 3), r denotes the
head radius, and c is the sound celerity. The overall error of this model over the
CIPIC database is 0.052ms (thus comparable to the 0.045ms error of the model
by Viste).
Distance Cues. In ideal conditions, the intensity of a source is halved (decreases by 6dB) when the distance is doubled, according to the well-known
Inverse Square Law [5]. Applying only this frequency-independent rule to a signal has no eect on the sound timbre. But when a source moves far from the
listener, the high frequencies are more attenuated than the low frequencies. Thus
40
the sound spectrum changes with the distance. More precisely, the spectral centroid moves towards the low frequencies as the distance increases. In [4], the
authors show that the frequency-dependent attenuation due to atmospheric attenuation is roughly proportional to f 2 , similarly to the ISO 9613-1 norm [10].
Here, we manipulate the magnitude spectrum to simulate the distance between
the source and the listener. Conversely, we would measure the spectral centroid
(related to brightness) to estimate the sources distance to listener.
In a concert room, the distance is often simulated by placing the speaker near
/ away from the auditorium, which is sometimes physically restricted in small
rooms. In fact, the architecture of the room plays an important role and can
lead to severe modications in the interpretation of the piece.
Here, simulating the distance is a matter of changing the magnitude of each
short-term spectrum X. More precisely, the ISO 9613-1 norm [10] gives the
frequency-dependent attenuation factor in dB for given air temperature, humidity, and pressure conditions. At distance , the magnitudes of X(f ) should be
attenuated by D(f, ) decibels:
D(f, ) = a(f )
(8)
(10)
Binaural Spatialization
In binaural listening conditions using headphones, the sound from each earphone
speaker is heard only by one ear. Thus the encoded spatial cues are not aected
by any cross-talk signals between earphone speakers.
41
(11)
a (f )/2 j (f )/2
(12)
(because of the symmetry among the left and right ears), where a and are
given by:
a (f ) = ILD(, f )/20,
(13)
(f ) = ITD(, f ) 2f.
(14)
This is indeed a convolutive model, the convolution turning into a multiplication in the spectral domain. Moreover, the spatialization coecients are complex.
The control of both amplitude and phase should provide better audio quality [25]
than amplitude-only spatialization. Indeed, we reach a remarkable spatialization
realism through informal listening tests with AKG K240 Studio headphones.
3.3
Multi-loudspeaker Spatialization
YR
XR = HR X = CRL KL X +CRR KR X
YL
(15)
(16)
YR
the best panning coecients under CIPIC conditions for the pair of speakers to
match the binaural signals at the ears (see Equations (11) and (12)) are then
given by:
KL (t, f ) = C (CRR HL CLR HR ) ,
(17)
KR (t, f ) = C (CRL HL + CLL HR )
(18)
(19)
During diusion, the left and right signals (YL , YR ) to feed left and right
speakers are obtained by multiplying the short-term spectra X with KL and
KR , respectively:
YL (t, f ) = KL (t, f )X(t, f ) = C (CRR XL CLR XR ) ,
(20)
(21)
42
KL
KR
YL
YR
sound image
SL
SR
HR
C LL
R
L
CL
XL
CRR
HL
C
XR
Fig. 4. Stereophonic loudspeaker display: the sound source X reaches the ears L, R
through four acoustic paths (CLL , CLR , CRL , CRR )
sound image
S2
S3
S1
S4
Fig. 5. Pairwise paradigm: for a given sound source, signals are dispatched only to the
two speakers closest to it (in azimuth)
In a setup with many speakers we use the classic pair-wise paradigm [9],
consisting in choosing for a given source only the two speakers closest to it (in
player
sources
file
separator
reader
2 channels
active CD
N output
ports
43
spatializer
N sources
N input
ports
...
M output
ports
M speakers
azimuth): one at the left of the source, the other at its right (see Fig. 5). The
left and right signals computed for the source are then dispatched accordingly.
Software System
Our methods for source separation and sound spatialization have been implemented as a real-time software system, programmed in C++ language and using
Qt43 , JACK4 , and FFTW5 . These libraries were chosen to ensure portability and
performance on multiple platforms. The current implementation has been tested
on Linux and MacOS X operating systems, but should work with very minor
changes on other platforms, e.g. Windows.
Fig. 6 shows an overview of the architecture of our software system. Source
separation and sound spatialization are implemented as two dierent modules.
We rely on JACK audio ports system to route audio streams between these two
modules in real time.
This separation in two modules was mainly dictated by a dierent choice of
distribution license: the source separation of the active player should be patented
and released without sources, while the spatializer will be freely available under
the GNU General Public License.
4.1
Usage
Player. The active player is presented as a simple audio player, based on JACK.
The graphical user interface (GUI) is a very common player interface. It allows
to play or pause the reading / decoding. The player reads activated stereo les,
from an audio CD or le, and then decodes the stereo mix in order to extract
the N (mono) sources. Then these sources are transferred to N JACK output
ports, currently named QJackPlayerSeparator:outputi, with i in [1; N ].
Spatializer. The spatializer is also a real-time application, standalone and based
on JACK. It has N inputs ports that correspond to the N sources to spatialize.
These ports are to be connected, with the JACK ports connection system, to the
N outputs ports of the active player. The spatializer can be congured to work
with headphones (binaural conguration) or with M loudspeakers.
3
4
5
44
Fig. 7. From the stereo mix stored on the CD, our player is allowing the listener
(center ) to manipulate 5 sources in the acoustic space, using here an octophonic display
(top) or headphones (bottom)
Fig. 7 shows the current interface of the spatializer, which displays a birds
eye view of the audio scene. The users avatar is in the middle, represented by
a head viewed from above. He is surrounded by various sources, represented as
45
46
sources
speakers
by the fact that it has only two speakers with neither azimuth nor distance
specied. Fig. 8 shows the speaker conguration les for binaural and octophonic
(8-speaker) conguration.
4.2
Implementation
Player. The current implementation is divided into three threads. The main
thread is the Qt GUI. A second thread reads and buerizes data from the stereo
le, to be able to compensate for any physical CD reader latency. The third
thread is the JACK process function. It separates the data for the N sources and
feeds the output ports accordingly. In the current implementation, the number
of output sources is xed to N = 5.
Our source separation implementation is rather ecient as for a Modied
Discrete Cosine Transform (MDCT) of W samples, we only do a Fast Fourier
Transform (FFT) of size W/4. Indeed, a MDCT of length W is almost equivalent
to a type-IV DCT of length W/2 that can be computed with a FFT of length
W/4. Thus, as we use MDCT and IMDCT of size W = 2048, we only do FFT
and IFFT of 512 samples.
Spatializer. The spatializer is currently composed of two threads: a main
thread, the Qt GUI, and the JACK process function.
Fig. 9 shows the processing pipeline for the spatialization. For each source, xi
is rst transformed into the spectral domain with a FFT to obtain its spectrum
Xi . This spectrum is attenuated for distance i (see Equation (10)). Then, for
an azimuth i , we obtain the left (XiL ) and right (XiR ) spectra (see Equations
47
(11) and (12)). The dispatcher then chooses the pair (j, j + 1) of speakers surrounding the azimuth i , transforms the spectra XiL and XiR by the coecients
corresponding to this speaker pair (see Equations (20) and (21)), and adds the
resulting spectra Yj and Yj+1 in the spectra of these speakers. Finally, for each
speaker, its spectrum is transformed with an IFFT to obtain back in the time
domain the mono signal yj for the corresponding output.
Source spatialization is more computation-intensive than source separation,
mainly because it requires more transforms (N FFTs and M IFFTs) of larger
size W = 2048. For now, source spatialization is implemented as a serial process. However, we can see that this pipeline is highly parallel. Indeed, almost
everything operates on separate data. Only the spectra of the speakers may be
accessed concurrently, to accumulate the spectra of sources that would be spatialized to the same or neighbouring speaker pairs. These spectra should then
be protected with mutual exclusion mechanisms. A future version will take advantage of multi-core processor architectures.
4.3
Experiments
Our current prototype has been tested on an Apple MacBook Pro, with an Intel
Core 2 Duo 2.53GHz processor, connected to headphones or to a 8-speaker system, via a MOTU 828 MKII soundcard. For such a conguration, the processing
power is well contained. In order to run in real time, given a signal sampling
Fig. 10. Enhanced graphical interface with pictures of instruments for sources and
propagating sound waves represented as colored circles
48
frequency of 44.1kHz and windows of 2048 samples, the overall processing time
should be less than 23ms. With our current implementation, 5-source separation
and 8-speaker spatialization, this processing time is in fact less than 3ms on the
laptop mentioned previously. Therefore, the margin to increase the number of
sources to separate and/or the number of loudspeakers is quite confortable. To
conrm this, we exploited the split of the source separation and spatialization
modules to test the spatializer without the active player, since the latter is currently limited to 5 sources. We connected to the spatializer a multi-track player
that reads several les simultaneously and exposes these tracks as JACK output
ports. Tests showed that the spatialization can be applied to roughly 48 sources
on 8 speakers, or 40 sources on 40 speakers on this computer.
These performances allow us to have some processing power for other computations, to improve user experience for example. Fig. 10 shows an example of
an enhanced graphical interface where the sources are represented with pictures
of the instruments, and the propagation of the sound waves is represented for
each source by time-evolving colored circles. The color of each circle is computed
from the color (spectral envelope) of the spectrum of each source and updated
in real time as the sound changes.
We have presented a real-time system for musical interaction from stereo les,
fully backward-compatible with standard audio CDs. This system consists of a
source separator and a spatializer.
The source separation is based on the sparsity of the source signals in the
spectral domain and the exploitation of the stereophony. This system is characterized by a quite simple separation process and by the fact that some sideinformation is inaudibly embedded in the signal itself to guide the separation
process. Compared to (semi-)blind approaches also based on sparsity and local mixture inversion, the informed aspect of separation guarantees the optimal
combination of the sources, thus leading to a remarkable increase of quality of
the separated signals.
The sound spatialization is based on a simplied model of the head-related
transfer functions, generalized to any multi-loudspeaker conguration using a
transaural technique for the best pair of loudspeaker for each sound source.
Although this quite simple technique does not compete with the 3D accuracy of
Ambisonics or holophony (Wave Field Synthesis), it is very exible (no specic
loudspeaker conguration) and suitable for a large audience (no hot-spot eect)
with sucient sound quality.
The resulting software system is able to separate 5-source stereo mixtures
(read from audio CD or 16-bit PCM les) in real time and it enables the user to
remix the piece of music during restitution with basic functions such as volume
and spatialization control. The system has been demonstrated in several countries with excellent feedback from the users / listeners, with a clear potential in
terms of musical creativity, pedagogy, and entertainment.
49
For now, the mixing model imposed by the informed source separation is
generally over-simplistic when professional / commercial music production is
at stake. Extending the source separation technique to high-quality convolutive
mixing is part of our future research.
As shown in [14], the model we use for the spatialization is more general, and
can be used as well to localize audio sources. Thus we would like to add the
automatic detection of the speaker conguration to our system, from a pair of
microphones placed in the audience, as well as the automatic ne tuning of the
spatialization coecients to improve the 3D sound eect.
Regarding performance, lots of operations are on separated data and thus
could easily be parallelized on modern hardware architectures. Last but not least,
we are also porting the whole application to mobile touch devices, such as smart
phones and tablets. Indeed, we believe that these devices are perfect targets for
a system in between music listening and gaming, and gestural interfaces with
direct interaction to move the sources are very intuitive.
Acknowledgments
This research was partly supported by the French ANR (Agence Nationale de la
Recherche) DReaM project (ANR-09-CORD-006).
References
1. Algazi, V.R., Duda, R.O., Thompson, D.M., Avendano, C.: The CIPIC HRTF
database. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, pp. 99102
(2001)
2. Araki, S., Sawada, H., Makino, S.: K-means based underdetermined blind speech
separation. In: Makino, S., et al. (eds.) Blind Source Separation, pp. 243270.
Springer, Heidelberg (2007)
3. Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source
separation for arbitrarily arranged multiple sensors. Signal Processing 87(8), 1833
1847 (2007)
4. Bass, H., Sutherland, L., Zuckerwar, A., Blackstock, D., Hester, D.: Atmospheric
absorption of sound: Further developments. Journal of the Acoustical Society of
America 97(1), 680683 (1995)
5. Berg, R.E., Stork, D.G.: The Physics of Sound, 2nd edn. Prentice Hall, Englewood
Clis (1994)
6. Blauert, J.: Spatial Hearing. revised edn. MIT Press, Cambridge (1997); Translation by J.S. Allen
7. Boll, P., Zibulevski, M.: Underdetermined blind source separation using sparse
representations. Signal Processing 81(11), 23532362 (2001)
8. Chen, B., Wornell, G.: Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Transactions
on Information Theory 47(4), 14231443 (2001)
9. Chowning, J.M.: The simulation of moving sound sources. Journal of the Acoustical
Society of America 19(1), 26 (1971)
50
10. International Organization for Standardization, Geneva, Switzerland: ISO 96131:1993: Acoustics Attenuation of Sound During Propagation Outdoors Part 1:
Calculation of the Absorption of Sound by the Atmosphere (1993)
11. ISO/IEC JTC1/SC29/WG11 MPEG: Information technology Generic coding of
moving pictures and associated audio information Part 7: Advanced Audio Coding
(AAC) IS13818-7(E) (2004)
12. ITU-R: Method for objective measurements of perceived audio quality (PEAQ)
Recommendation BS1387-1 (2001)
13. Kuhn, G.F.: Model for the interaural time dierences in the azimuthal plane. Journal of the Acoustical Society of America 62(1), 157167 (1977)
14. Mouba, J., Marchand, S., Mansencal, B., Rivet, J.M.: RetroSpat: a perceptionbased system for semi-automatic diusion of acousmatic music. In: Proceedings of
the Sound and Music Computing (SMC) Conference, Berlin, pp. 3340 (2008)
15. OGrady, P., Pearlmutter, B.A., Rickard, S.: Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology 15(1), 1833 (2005)
16. Parvaix, M., Girin, L.: Informed source separation of underdetermined instantaneous
stereo mixtures using source index embedding. In: IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas (2010)
17. Parvaix, M., Girin, L.: Informed source separation of linear instantaneous underdetermined audio mixtures by source index embedding. IEEE Transactions on Audio, Speech, and Language Processing (accepted, pending publication, 2011)
18. Pinel, J., Girin, L., Baras, C.: A high-rate data hiding technique for uncompressed
audio signals. IEEE Transactions on Audio, Speech, and Language Processing
(submitted)
19. Pinel, J., Girin, L., Baras, C., Parvaix, M.: A high-capacity watermarking technique
for audio signals based on MDCT-domain quantization. In: International Congress
on Acoustics (ICA), Sydney, Australia (2010)
20. Plumbley, M.D., Blumensath, T., Daudet, L., Gribonval, R., Davies, M.E.: Sparse
representations in audio and music: From coding to source separation. Proceedings
of the IEEE 98(6), 9951005 (2010)
21. Princen, J., Bradley, A.: Analysis/synthesis lter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech, and Signal
Processing 64(5), 11531161 (1986)
22. Strutt (Lord Rayleigh), J.W.: Acoustical observations i. Philosophical Magazine 3,
456457 (1877)
23. Strutt (Lord Rayleigh), J.W.: On the acoustic shadow of a sphere. Philosophical
Transactions of the Royal Society of London 203A, 8797 (1904)
24. Thiede, T., Treurniet, W., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J.,
Colomes, C.: PEAQ - the ITU standard for objective measurement of perceived
audio quality. Journal of the Audio Engineering Society 48(1), 329 (2000)
25. Tournery, C., Faller, C.: Improved time delay analysis/synthesis for parametric
stereo audio coding. Journal of the Audio Engineering Society 29(5), 490498
(2006)
26. Vincent, E., Gribonval, R., Plumbley, M.D.: Oracle estimators for the benchmarking of source separation algorithms. Signal Processing 87, 19331950 (2007)
27. Viste, H.: Binaural Localization and Separation Techniques. Ph.D. thesis, Ecole
Polytechnique Federale de Lausanne, Switzerland (2004)
28. Woodworth, R.S.: Experimental Psychology. Holt, New York (1954)
29. Ylmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Transactions on Signal Processing 52(7), 18301847 (2004)
1 Introduction
Music generation has more and more uses in todays media. Be it in computer games,
interactive music performances, or in interactive films, the emotional effect of the
music is primordial in the appreciation of the media. While traditionally, the music
has been generated in pre-recorded loops that is mixed on-the-fly, or recorded in
traditional orchestras, the better understanding and models of generative music is
believed to push the interactive generative music into the multimedia. Papadopoulos
and Wiggins (1999) gave an early overview of the methods of algorithmic
composition, deploring that the music that they produce is meaningless: the
computers do not have feelings, moods or intentions. While vast progress has been
made in the decade since this statement, there is still room for improvement.
The cognitive understanding of musical time perception is the basis of the work
presented here. According to Khl (2007), this memory can be separated into three
time-scales, the short, microtemporal, related to microstructure, the mesotemporal,
related to gesture, and the macrotemporal, related to form. These time-scales are
named (Khl and Jensen 2008) subchunk, chunk and superchunk, and subchunks
extend from 30 ms to 300 ms; the conscious mesolevel of chunks from 300 ms to 3
sec; and the reflective macrolevel of superchunks from 3 sec to roughly 3040 sec.
The subchunk is related to individual notes, the chunk to meter and gesture, and the
superchunk is related to form. The superchunk was analyzed and used for in a
generative model in Khl and Jensen (2008), and the chunks were analyzed in Jensen
and Khl (2009). Further analysis of the implications of how temporal perception is
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 5159, 2011.
Springer-Verlag Berlin Heidelberg 2011
52
K. Jensen
related to durations and timing of existing music, and anatomic and perceptual finding
from the literature is given in section 2 along with an overview of the previous work
in rhythm. Section 3 presents the proposed model on the inclusion of pitch gestures in
music generation using statistical methods, and the section 4 discusses the integration
of the pitch gesture in the generative music model. Finally, section 5 offers a
conclusion.
53
(>400msec) & temps courts, and two to one ratios only found between temps longs
and courts. As for natural tempo, when subjects are asked to reproduce temporal
intervals, they tend to overestimate short intervals (making them longer) and underestimate long intervals (making them shorter). At an interval of about 500 msec to
600 msec, there is little over- or under-estimation. However, there are large
differences across individuals, the spontaneous tempo is found to be between 1.1 to 5
taps per second, with 1.7 taps per second most occurring. There are also many
spontaneous motor movements that occur at the rate of approximately 2/sec, such as
walking, sucking in the newborn, and rocking.
Friberg (1991), and Widmer (2002) give rules to how the dynamics and timing
should be changed according to the musical position of the notes. Dynamic changes
include 6db increase (doubling), and up to 100msec deviations to the duration,
depending on the musical position of the notes. With these timing changes, Snyder
(2000) indicate the categorical perception of beats, measures and patterns. The
perception of deviations of the timing is examples of within-category distinctions.
Even with large deviations from the nominal score, the notes are recognized as falling
on the beats.
As for melodic perception, Thomassen (1982) investigated the role of interval as
melodic accents. In a controlled experiment, he modeled the anticipation using an
attention span of three notes, and found that the accent perception is described fairly
well. The first of two opposite frequency changes gives the strongest accentuation.
Two changes in the same direction are equally effective. The larger of two changes
are more powerful, as are frequency rises as compared to frequency falls.
54
K. Jensen
Fig. 1. Different shapes of a chunk. Positive (a-c) or negative arches (g-i), rising (a,d,g) or
falling slopes (c,f,i).
Fig. 2. Note (top) and interval probability density function obtained from The Digital Tradition
folk database
55
pitch of music. According to Vos and Troost (1989), the smaller intervals occur more
often in descending form, while the larger ones occur mainly in ascending form.
However, since the slope and arch are modelled in this work, the pdf of the intervals
are mirrored and added around zero, and subsequently weighted and copied back to
recreate the full interval pdf. It is later made possible to create a melodic contour with
a given slope and arch characteristics, as detailed below.
In order to generative pitch contours with gestures, the model in figure 1 is used.
For the pitch contour, only the neutral gesture (e) in figure 1, the falling and rising
slope (d) and (f), and the positive and negative arches (b) and (h) are modeled here.
The gestures are obtained by weighting the positive and negative slope of the interval
probability density function with a weight,
(1)
Here, pdfi+ is the mirrored/added positive interval pdf, and w is the weight. If
w=0.5, a neutral gesture is obtained, and if w<0.5, a positive slope is obtained, and if
w>0.5, a negative slope is obtained. In order to obtained an arch, the value of the
weight is changed to w=1- w, in the middle of the gesture.
In order to obtain a musical scale, the probability density function for the intervals
(pdfi) is multiplied with a suitable pdfs for the scale, such as the one illustrated in
figure 2 (top),
(2)
As pdfs is only defined for one octave, it is circularly repeated. The interval
probabilities, pdfi, are shifted for each note n0. This is done under the hypothesis that
the intervals and scale notes are independent. So as to retain the register,
approximately, a register weight wr is further multiplied to the pdf. This weight is one
for one octave, and decreases exponentially on both sides, in order to lower the
possibility of obtaining notes far from the original register.
In order to obtain successive notes, the cumulative density function, cdf, is
calculated from eq (2), and used to model the probability that r is less than or equal to
the note intervals cdf(n0). If r is a random variable with uniform distribution in the
interval (0,1), then n0 can be found as the index of the first occurrence of cdf>r.
Examples of pitch contours obtained by setting w=0, and w=1, respectively, are
shown in figure 3. The rising and falling pitches are reset after each gesture, in order
to stay at the same register throughout the melody.
The positive and negative slopes are easily recognized when listening to the
resulting melodies, because of the abrupt pitch fall at the end of each gesture. The
arches, in comparison, are more in need of loudness and/or brightness variations in
order to make them perceptually recognized. Without this, a positive slope can be
confused for a negative arch that is shifted in time, or a positive or negative slope,
likewise shifted in time. Normally, an emphasis at the beginning of each gesture is
sufficient for the slopes, while the arches may be in need of an emphasis at the peak
of the arch as well.
56
K. Jensen
Fig. 3. Pitch contours of four melodies with positive arch, rising slope, negative arch and
falling slope
57
Fig. 4. The generative model including meter, gesture and form. Structural changes on the note
values, the intensity and the rhythm is made every 30 seconds, approximately, and gesture
changes are made on average every seven notes
The notes are created using a simple envelope model and the synthesis method
dubbed brightness creation function (bcf, Jensen 1999) that creates a sound with
exponentially decreasing amplitudes that allows the continuous control of the
brightness. The accent affects the note, so that the loudness brightness is doubled, and
the duration is increased by 25 %, with 75% of the elongation made by advancing the
start of the note, as found in Jensen (2010).
These findings are put into a generative model of tonal music. A subset of notes
(3-5) is chosen at each new form (superchunk), together with a new dynamic level. At
the chunk level, new notes are created in a metrical loop, and the gestures are added
to the pitch contour and used for additional gesture emphasis. Finally, at the
microtemporal (subchunk) level, expressive deviations are added in order to render
the loops musical. The interaction of the rigid meter with the more loose pitch gesture
renders the generated notes a more musical sense, by the incertitude and the double
stream that results. The pure rising and falling pitch gestures are still clearly
perceptible, while the arches are less present. By setting w in eq(1) to something in
between (0,1), e.g. 0.2, or 0.8, a more realistic, agreeable rising and falling gestures
are resulting. Still, the arches are more natural to the ear, while the rising and falling
demand more attention, in particular perhaps the rising gestures.
58
K. Jensen
5 Conclusion
The automatic generation of music is in need of model to render the music expressive.
This model is found using knowledge from time perception of music studies, and
further studies of the cognitive and perceptual aspects of rhythm. Indeed, the
generative model consists of three sources, corresponding to the immediate
microtemporal, the present mesotemporal and the long-term memory macroterminal.
This corresponds to the note, the gesture and the form in music. While a single stream
in each of the source may not be sufficient, so far the model incorporates the
macrotemporal superchunk, the metrical mesotemporal chunk and the microtemporal
expressive enhancements. The work presented here has introduced gestures in the
pitch contour, corresponding to the rising and falling slopes, and to the positive and
negative arches, which adds a perceptual stream to the more rigid meter stream.
The normal beat as is given by different researchers to be approximately 100 BPM,
and Fraisse (1982) furthermore shows the existence of two main note durations, one
above and one below 0.4 secs, with a ratio of two. Indications as to subjective time,
given by Zwicker and Fastl (1999) are yet to be investigated, but this may well be
creating uneven temporal intervals in conflict with the pulse.
The inclusion of the pitch gesture model certainly, in the authors opinion, renders
the music more enjoyable, but more work remains before the generative model is
ready for general-purpose uses.
References
1. Fraisse, P.: Rhythm and Tempo. In: Deutsch, D. (ed.) The Psychology of Music, 1st edn.,
pp. 149180. Academic Press, New York (1982)
2. Friberg, A.: Performance Rules for Computer-Controlled Contemporary Keyboard Music.
Computer Music Journal 15(2), 4955 (1991)
3. Gordon, J.W.: The perceptual attack time of musical tones. Journal of the Acoustical
Society of America, 88105 (1987)
4. Handel, S.: Listening. MIT Press, Cambridge (1989)
5. Huron, D.: The Melodic Arch in Western Folk songs. Computing in Musicology 10, 323
(1996)
6. Jensen, K.: Timbre Models of Musical Sounds, PhD Dissertation, DIKU Report 99/7
(1999)
7. Jensen, K.: Investigation on Meter in Generative Modeling of Music. In: Proceedings of
the CMMR, Malaga, June 21-24 (2010)
8. Jensen, K., Khl, O.: Towards a model of musical chunks. In: Ystad, S., KronlandMartinet, R., Jensen, K. (eds.) CMMR 2008. LNCS, vol. 5493, pp. 8192. Springer,
Heidelberg (2009)
9. Khl, O., Jensen, K.: Retrieving and recreating musical form. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 270282. Springer,
Heidelberg (2008)
10. Khl, O.: Musical Semantics. Peter Lang, Bern (2007)
11. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. The MIT Press,
Cambridge (1983)
59
12. Malbrn, S.: Phases in Childrens Rhythmic Development. In: Zatorre, R., Peretz, I. (eds.)
The Biological Foundations of Music. Annals of the New York Academy of Sciences
(2000)
13. Papadopoulos, G., Wiggins, G.: AI methods for algorithmic composition: a survey, a
critical view and future prospects. In: AISB Symposium on Musical Creativity, pp.
110117 (1999)
14. Patel, A., Peretz, I.: Is music autonomous from language? A neuropsychological appraisal.
In: Delige, I., Sloboda, J. (eds.) Perception and cognition of music, pp. 191215.
Psychology Press, Hove (1997)
15. Samson, S., Ehrl, N., Baulac, M.: Cerebral Substrates for Musical Temporal Processes.
In: Zatorre, R., Peretz, I. (eds.) The Biological Foundations of Music. Annals of the New
York Academy of Sciences (2000)
16. Snyder, B.: Music and Memory. An Introduction. The MIT Press, Cambridge (2000)
17. The Digital Tradition (2010), http://www.mudcat.org/AboutDigiTrad.cfm
(visited December 1, 2010)
18. Thomassen, J.M.: Melodic accent: Experiments and a tentative model. J. Acoust. Soc.
Am. 71(6), 15961605 (1982)
19. Vos, P.G., Troost, J.M.: Ascending and Descending Melodic Intervals: Statistical Findings
and Their Perceptual Relevance. Music Perception 6(4), 383396 (1089)
20. Widmer, G.: Machine discoveries: A few simple, robust local expression principles.
Journal of New Music Research 31, 3750 (2002)
21. Zwicker, E., Fastl, H.: Psychoacoustics: facts and models, 2nd edn. Springer series in
information sciences. Springer, Berlin (1999)
Universit
a di Firenze, Dip. di Matematica U. Dini
Viale Morgagni, 67/a - 50134 Florence - ITALY
2
IRCAM - CNRS STMS, Analysis/Synthesis Team 1,
Place Igor-Stravinsky - 75004 Paris - FRANCE
{marco.liuni,axel.roebel,xavier.rodet}@ircam.fr
marco.romito@math.unifi.it
http://www.ircam.fr/anasyn.html
Introduction
Far from being restricted to entertainment, sound processing techniques are required in many dierent domains: they nd applications in medical sciences,
security instruments, communications among others. The most challenging class
of signals to consider is indeed music: the completely new perspective opened
by contemporary music, assigning a fundamental role to concepts as noise and
timbre, gives musical potential to every sound.
The standard techniques of digital analysis are based on the decomposition
of the signal in a system of elementary functions, and the choice of a specic
system necessarily has an inuence on the result. Traditional methods based on
single sets of atomic functions have important limits: a Gabor frame imposes a
xed resolution over all the time-frequency plane, while a wavelet frame gives a
strictly determined variation of the resolution: moreover, the user is frequently
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 6075, 2011.
c Springer-Verlag Berlin Heidelberg 2011
61
asked to dene himself the analysis window features, which in general is not a
simple task even for experienced users. This motivates the search for adaptive
methods of sound analysis and synthesis, and for algorithms whose parameters
are designed to change according to the analyzed signal features. Our research
is focused on the development of mathematical models and tools based on the
local automatic adaptation of the system of functions used for the decomposition
of the signal: we are interested in a complete framework for analysis, spectral
transformation and re-synthesis; thus we need to dene an ecient strategy to
reconstruct the signal through the adapted decomposition, which must give a
perfect recovery of the input if no transformation is applied.
Here we propose a method for local automatic time-adaptation of the Short
Time Fourier Transform window function, through a minimization of the Renyi
entropy [22] of the spectrogram; we then dene a re-synthesis technique with
an extension of the method proposed in [11]. Our approach can be presented
schematically in three parts:
1. a model for signal analysis exploiting concepts of Harmonic Analysis, and
Frame Theory in particular: it is a generally highly redundant decomposing
system belonging to the class of multiple Gabor frames [8],[14];
2. a sparsity measure dened on time-frequency localized subsets of the analysis
coecients, in order to determine local optimal concentration;
3. a reduced representation obtained from the original analysis using the information about optimal concentration, and a synthesis method through an
expansion in the reduced system obtained.
We have realized a rst implementation of this scheme in two dierent versions:
for both of them a sparsity measure is applied on subsets of analysis coecients
covering the whole frequency dimension, thus dening a time-adapted analysis
of the signal. The main dierence between the two concerns the rst part of the
model, that is the single frames composing the multiple Gabor frame. This is a
key point as the rst and third part of the scheme are strictly linked: the frame
used for re-synthesis is a reduction of the original multi-frame, so the entire
model depends on how the analysis multi-frame is designed. The section Frame
Theory in Sound Analysis and Synthesis treats this part of our research in more
details.
The second point of the scheme is related to the measure applied on the coecients of the analysis within the multi-frame to determine local best resolutions.
We consider measures borrowed from Information Theory and Probability Theory according to the interpretation of the analysis within a frame as a probability
density [4]: our model is based on a class of entropy measures known as Renyi
entropies which extend the classical Shannon entropy. The fundamental idea
is that minimizing the complexity or information over a set of time-frequency
representations of the same signal is equivalent to maximizing the concentration and peakiness of the analysis, thus selecting the best resolution tradeo
[1]: in the section Renyi Entropy of Spectrograms we describe how a sparsity
measure can consequently be dened through an information measure. Finally,
62
M. Liuni et al.
When analyzing a signal through its decomposition, the features of the representation are inuenced by the decomposing functions; the Frame Theory (see
[3],[12] for detailed mathematical descriptions) allows a unied approach when
dealing with dierent bases and systems, studying the properties of the operators
that they identify. The concept of frame extends the one of orthonormal basis in
a Hilbert space, and it provides a theory for the discretization of time-frequency
densities and operators [8], [20], [2]. Both the STFT and the Wavelet transform
can be interpreted within this setting (see [16] for a comprehensive survey of
theory and applications).
Here we summarize the basic denitions and theorems, and outline the fundamental step consisting in the introduction of Multiple Gabor Frames, which
is comprehensively treated in [8]. The problem of standard frames is that the
decomposing atoms are dened from the same original function, thus imposing a limit on the type of information that one can deduce from the analysis
coecients; if we were able to consider frames where several families of atoms
coexist, than we would have an analysis with variable information, at the price
of a higher redundancy.
2.1
Given a Hilbert space H seen as a vector space on C, with its own scalar product,
we consider in H a set of vectors { } where the index set may be innite
and can also be a multi-index. The set { } is a frame for H if there exist
two positive non zero constants A and B, called frame bounds, such that for all
f H,
Af 2
63
(1)
For any frame {k }kZ there exist dual frames {k }kZ such that for all f
L2 (R)
f, k k =
f, k k ,
(3)
f=
kZ
kZ
(4)
(5)
the nodes are the centers of the Heisenberg boxes associated to the windows in
the frame. The lattice has to satisfy certain conditions for {gn,k } to be a frame
[7], which impose limits on the choice of the time and frequency steps: for certain
choices [6] which are often adopted in standard applications, the frame operator
takes the form of a multiplication,
|g(t un )|2 f (t) ,
Uf (t) = b1
(6)
nZ
Thus we see that the frame bounds provide also information on the redundancy
of the decomposition of the signal within the frame.
64
2.2
M. Liuni et al.
(9)
1
|gn(l) (t)|2 f (t) .
bl
(10)
n(l)
Here, if N = l b1l |gn(l) (s)|2
1 then U is invertible and the set (9) is a frame
whose dual frame is given by
gn(l),k (t) =
1
gn(l) (t)e2ibl kt .
N
(11)
65
R
enyi Entropy of Spectrograms
HR
(PS
)
=
log
dud ,
(14)
f
2
1
PSf (u , )du d
R
R
where R R2 and we omit its indication if equality holds. Given a discrete
spectrogram with time step a and frequency step b as in (13), we consider R as
a rectangle of the time-frequency plane R = [t1 , t2 ] [1 , 2 ] R2 . It identies
a sequence of points G ab where G = {(n, k) Z2 : t1 na t2 , 1 kb
2 }. As a discretization of the original continuous spectrogram, every sample in
PSf [n, k] is related to a time-frequency region of area ab; we thus obtain the
discrete Renyi entropy measure directly from (14),
1
PSf [n, k]
HG
[PS
]
=
log
+ log2 (ab) .
(15)
f
2
1
[n ,k ]G PSf [n , k ]
[n,k]G
66
M. Liuni et al.
(16)
As we are working with nite discrete densities we can also consider the case
= 0 which is simply the logarithm of the number of elements in P ; as a
consequence H0 (P ) H (P ) for every admissible order .
A third basic fact is that for every order the Renyi entropy H is maximum
when P is uniformly distributed, while it is minimum and equal to zero when P
has a single non-zero value.
All of these results give useful information on the values of dierent measures
on a single density P as in (15), while the relations between the entropies of two
dierent densities P and Q are in general hard to determine analytically; in our
problem, P and Q are two spectrograms of a signal in the same time-frequency
area, based on two window functions with dierent scaling as in (8). In some
basic cases such a relation is achievable, as shown in the following example.
3.1
(17)
The sparsity measure we are using chooses as best window the one which minimizes the entropy measure: we deduce from (17) that it is the one obtained with
the largest scaling factor available, therefore with the largest time-support. This
is coherent with our expectation as stationary signals, such as sinusoids, are best
analyzed with a high frequency resolution, because time-independency allows a
small time resolution. Moreover, this is true for any order used for the entropy
calculus. Symmetric considerations apply whenever the spectrogram of a signal
does not depend on frequency, as for impulses.
3.2
67
The Parameter
entropy
7
6
5
4
0
5
10
2
0
15
20
20
40
60
25
80
100
alpha
30
Fig. 1. Renyi entropy evaluations of the DM vectors with varying ; the distribution
becomes atter as M increases
68
3.3
M. Liuni et al.
A last remark regards the dependency of (15) on the time and frequency step a
and b used for the discretization of the spectrogram. When considering signals as
nite vectors, a signal and its Fourier Transform have the same length. Therefore
in the STFT the window length determines the number frequency points, while
the sampling rate sets frequency values: the denition of b is thus implicit in
the window choice. Actually, the FFT algorithm allows to specify a number
of frequency points larger than the signal length: further frequency values are
obtained as an interpolation between the original ones by properly adding zero
values to the signal. If the sampling rate is xed, this procedure causes a smaller
b as a consequence of a larger number of frequency points. We have numerically
veried that such a variation of b has no impact on the entropy calculus, so that
the FFT size can be set according to implementation needs.
Regarding the time step a, we are working on the analytical demonstration
of a largely veried evidence: as long as the decomposing system is a frame the
entropy measure is invariant to redundancy variation, so the choice of a can be
ruled by considerations on the invertibility of the decomposing frame without
losing coherence between the information measure of the dierent analyses. This
is a key point, as it states that the sparsity measure obtained allows a total
independence between the hop sizes of the dierent analyses: with the implementation of proper structures to handle multi-hop STFTs we have obtained a
more ecient algorithm in comparison with those imposing a xed hop size, as
[15] and the rst version of the one we have realized.
(19)
with the indicator function of the specied interval, but it is obviously possible
to generalize the results thus obtained to the entire class of compactly supported
window functions. In both the versions of our algorithm we create a multiple
Gabor frame as in (5), using as mother functions some scaled version of h,
obtained as in (8) with a nite set of positive real scaling factors L.
We consider consecutive segments of the signal, and for each segment we
calculate |L| spectrograms with the |L| scaled windows: the length of the analysis segment and the overlap between two consecutive segments are given as
parameters.
In the rst version of the algorithm the dierent frames composing the multiframe have the same time step a and frequency step b: this guarantees that for
each signal segment the dierent frames have Heisenberg boxes whose centers
lay on a same lattice on the time-frequency plane, as illustrated in gure 2. To
69
window length(smps)
4096
3043
2261
1680
1248
927
689
512
0.05
0.1
time
0.15
Fig. 2. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the rst version of our algorithm
x 10
frequency
2
1.5
1
0.5
0
x 10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1.6
1.8
frequency
2
1.5
1
0.5
0
0.2
0.4
0.6
0.8
time
1.2
1.4
guarantee that all the |L| scaled windows constitute a frame when translated
and modulated according to this global lattice, the time step a must be set
with the hop size assigned to the smallest window frame. On the other hand, as
the FFT of a discrete signal has the same number of points of the signal itself,
the frequency step b has to be the FFT size of the largest window analysis: for
the smaller ones, a zero-padding is performed.
M. Liuni et al.
70
4096
2048
1024
512
0
x 10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
frequency
2
1.5
1
0.5
time
Fig. 4. Example of an adaptive analysis performed by the rst version of our algorithm
with four Hanning windows of dierent sizes (512, 1024, 2048 and 4096 samples) on a
B4 note played by a marimba: on top, the best window chosen as a function of time; at
the bottom, the adaptive spectrogram. The entropy order is = 0.7 and each analysis
segment contains twenty-four analyses frames with a sixteen-frames overlap between
consecutive segments.
71
4096
3043
2261
1680
1248
927
689
512
0
0.05
time
0.1
0.15
Fig. 5. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the second version of our algorithm
good time resolution at the strike with that of a good frequency resolution on
the harmonic resonance. This is fully provided by the algorithm, as shown in
the adaptive spectrogram at the bottom of the gure 4. Moreover, we see that
the pre-echo of the analysis at the bottom of gure 3 is completely removed in
the adapted spectrogram.
The main dierence in the second version of our algorithm concerns the individual frames composing the multi-frame, which have the same frequency step
b but dierent time steps {al : l L}: the smallest and largest window sizes are
given as parameters together with |L|, the number of dierent windows composing the multi-frame, and the global overlap needed for the analyses. The
algorithm xes the intermediate sizes so that, for each signal segment, the different frames have the same overlap between consecutive windows, and so the
same redundancy.
This choice highly reduces the computational cost by avoiding unnecessary
small hop sizes for the larger windows, and as we have observed in the previous
section it does not aect the entropy evaluation. Such a structure generates an
irregular time disposition of the multi-frame elements in each signal segment,
as illustrated in gure 5; in this way we also avoid the problem of unshared
parts of signal between the systems, but we still have a dierent inuence of the
boundary parts depending on the analysis frame: the beginning and the end of
the signal segment have a higher energy when windowed in the smaller frames.
This is avoided with a preliminary weighting: the beginning and the end of each
signal segment are windowed respectively with the rst and second half of the
largest analysis window.
As for the rst implementation, the weighting does not concern the decomposition for re-synthesis purpose, but only the analyses used for entropy evaluations.
M. Liuni et al.
frequency
72
Fig. 6. Example of an adaptive analysis performed by the second version of our algorithm with eight Hanning windows of dierent sizes from 512 to 4096 samples, on a
B4 note played by a marimba sampled at 44.1kHz: on top, the best window chosen
as a function of time; at the bottom, the adaptive spectrogram. The entropy order is
= 0.7 and each analysis segment contains four frames of the largest window analysis
with a two-frames overlap between consecutive segments.
After the pre-weighting, the algorithm follows the same steps described above:
calculation of the |L| local spectrograms, evaluation of their entropy, selection of
the window providing minimum entropy, computation of the adapted spectrogram with the best window at each time point, thus creating an analysis with
time-varying resolution and hop size.
In gure 6 we give a rst example of an adaptive analysis performed by the
second version of our algorithm with eight Hanning windows of dierent sizes:
the sound is still the B4 note of a marimba, and we can see that the two versions
give very similar results. Thus, if the considered application does not specically
ask for a xed hop size of the overall analysis, the second version is preferable
as it highly reduces the computational cost without aecting the best window
choice.
In gure 8 we give a second example with a synthetic sound, a sinusoid with sinusoidal frequency modulation: as gure 7 shows, a small window is best adapted
where the frequency variation is fast compared to the window length; on the
other hand, the largest window is better where the signal is almost stationary.
4.1
Re-synthesis Method
x 10
73
frequency
2
1.5
1
0.5
0.5
0
x 10
1.5
2.5
2.5
frequency
2
1.5
1
0.5
0
0.5
1.5
time
(20)
74
M. Liuni et al.
Fig. 8. Example of an adaptive analysis performed by the second version of our algorithm with eight Hanning windows of dierent sizes from 512 to 4096 samples, on
a sinusoid with sinusoidal frequency modulation synthesized at 44.1 kHz: on top, the
best window chosen as a function of time; at the bottom, the adaptive spectrogram.
The entropy order is = 0.7 and each analysis segment contains four frames of the
largest window analysis with a three-frames overlap between consecutive segments.
Conclusions
We have presented an algorithm for time-adaptation of the spectrogram resolution, which can be easily integrated in existent framework for analysis, transformation and re-synthesis of an audio signal: the adaptation is locally obtained
through an entropy minimization within a nite set of resolutions, which can be
dened by the user or left as default. The user can also specify the time duration
and overlap of the analysis segments where entropy minimization is performed,
to privilege more or less discontinuous adapted analyses.
Future improvements of this method will concern the spectrogram adaptation
in both time and frequency dimensions: this will provide a decomposition of the
signal in several layers of analysis frames, thus requiring an extension of the
proposed technique for re-synthesis.
References
1. Baraniuk, R.G., Flandrin, P., Janssen, A.J.E.M., Michel, O.J.J.: Measuring TimeFrequency Information Content Using the Renyi Entropies. IEEE Trans. Info. Theory 47(4) (2001)
2. Borichev, A., Gr
ochenig, K., Lyubarskii, Y.: Frame constants of gabor frames near
the critical density. J. Math. Pures Appl. 94(2) (2010)
75
Abstract. In this paper models and algorithms are presented for transcription of pitch and timings in polyphonic music extracts. The data
are decomposed framewise into the frequency domain, where a Poisson
point process model is used to write a polyphonic pitch likelihood function. From here Bayesian priors are incorporated both over time (to link
successive frames) and also within frames (to model the number of notes
present, their pitches, the number of harmonics for each note, and inharmonicity parameters for each note). Inference in the model is carried out
via Bayesian filtering using a powerful Sequential Markov chain Monte
Carlo (MCMC) algorithm that is an MCMC extension of particle filtering. Initial results with guitar music, both laboratory test data and
commercial extracts, show promising levels of performance.
Keywords: Automated music transcription, multi-pitch estimation,
Bayesian filtering, Poisson point process, Markov chain Monte Carlo,
particle filter, spatio-temporal dynamical model.
Introduction
The audio signal generated by a musical instrument as it plays a note is complex, containing multiple frequencies, each with a time-varying amplitude and
phase. However, the human brain perceives such a signal as a single note, with
associated high-level properties such as timbre (the musical texture) and
expression (loud, soft, etc.). A musician playing a piece of music takes as input
a score, which describes the music in terms of these high-level properties, and
produces a corresponding audio signal. An accomplished musician is also able to
reverse the process, listening to a musical audio signal and transcribing a score.
A desirable goal is to automate this transcription process. Further developments
in computer understanding of audio signals of this type can be of assistance
to musicologists; they can also play an important part in source separation systems, as well as in automated mark-up systems for content-based annotation of
music databases.
Perhaps the most important property to extract in the task of musical transcription is the note or notes playing at each instant. This will be the primary
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 7683, 2011.
c Springer-Verlag Berlin Heidelberg 2011
77
2
2.1
78
Fig. 1. An example of a single note spectrum, with the associated median threshold
(using a window of 4 frequency bins) and peaks identified by the peak detection
algorithm (circles)
79
K
[yk (1 ek ) + (1 yk )ek ]
(3)
k=1
where Y = {y1 , y2 , ..., yK } are the observed peak data in the K frequency
bins, such that yk = 1 if a peak is observed in the k th bin, and yk = 0 otherwise.
It only remains to formulate the intensity function (f ), and hence
k = f kth bin (f )df . For this purpose, the Gaussian mixture model of Peeling et al.[8] is used. Note that in this formulation we can regard each harmonic
of each note to be an independent Poisson process itself, and hence by the union
property of Poisson processes, all of the individual Poisson intensities add to
give a single overall intensity , as follows:
(f ) =
N
j (f ) + c
(4)
j=1
j (f ) =
Hj
h=1
A
2
2j,h
exp
(f fj,h )2
2
2j,h
(5)
where j indicates the note number, h indicates the partial number, and N and
Hj are the numbers of notes and harmonics in each note, respectively. c is a
constant that accounts for detected clutter peaks due to noise and non-musical
2
sounds. j,h
= 2 h2 sets the variance of each Gaussian. A and are constant
parameters, chosen so as to give good performance on a set of test pieces. fj,h is
the frequency of the hth partial of the j th note, given by the inharmonic model
[4]:
fj,h = f0,j h 1 + Bj h2
(6)
f0,j is the fundamental frequency of the jth note. Bj is the inharmonicity parameter for the note (of the order 104 ).
Three parameters for each note are variable and to be determined by the
inference engine: the fundamental, the number of partials,and the inharmonicity.
Moreover, the number of notes N is also treated as unknown in the fully Bayesian
framework.
2.2
The prior P () over the unknown parameters in each time frame may be
decomposed, assuming parameters of dierent notes are independent, as:
80
P () = P (N )
N
(7)
j=1
In fact, we have here assumed all priors to be uniform over their expected ranges,
except for f0,j and N , which are stochastically linked to their values in previous
frames. To consider this linkage explicitly, we now introduce a frame number
label t and the corresponding parameters for frame t as t , with frame peak data
Y t . In order to carry out optimal sequential updating we require a transition
density p( t |t1 ), and assume that the {t } process is Markovian. Then we can
write the required sequential update as:
p( t1:t |Y1:t ) p( t1 |Y1:t1 )p( t |t1 )p(Yt | t )
(8)
81
82
Results
The methods have been evaluated on a selection of guitar music extracts, recorded
both in the laboratory and taken from commercial recordings. See Fig. 2 in which
three guitar extracts, two lab-generated (a) and (b) and one from a commercial
recording (c) are processed. Note that a few spurious note estimates arise, particularly around instants of note change, and many of these have been removed
by a post-processing stage which simply eliminates note estimates which last
for a single frame. The results are quite accurate, agreeing well with manually
obtained transcriptions.
When two notes an octave apart are played together, the upper note is not
found. See nal chord of panel (a) in Figure 2. This is attributable to the two
notes sharing many of the same partials, making discrimination dicult based
on peak frequencies alone.
In the case of strong notes, the algorithm often correctly identies up to 35
partial frequencies. In this regard, the use of inharmonicity modelling has proved
succesful: Without this feature, the estimate of the number of harmonics is often
lower, due to the inaccurate partial frequencies predicted by the linear model.
The eect of the sequential formulation is to provide a degree of smoothing
when compared to the frame-wise algorithm. Fewer single-frame spurious notes
appear, although these are not entirely removed, as shown in Figure 2. Octave
errors towards the end of notes are also reduced.
The new algorithms have shown signicant promise, especially given that the
likelihood function takes account only of peak frequencies and not amplitudes
or other information that may be useful for a transcription system. The good
performance so far obtained is a result of several novel modelling and algorithmic features, notably the formulation of a exible frame-based model that can
account robustly for inharmonicities, unknown numbers of notes and unknown
numbers of harmonics in each note. A further key feature is the ability to link
frames together via a probabilistic model; this makes the algorithm more robust
in estimation of continuous fundamental frequency tracks from the data. A nal
important component is the implementation through sequential MCMC, which
allows us to obtain reasonably accurate inferences from the models as posed.
The models may be improved in several ways, and work is underway to address
these issues. A major point is that the current Poisson model accounts only
for the frequencies of the peaks present. It is likely that performance may be
improved by including the peak amplitudes in the model. For example, this
might make it possible to distinguish more robustly when two notes an octave
apart are being played. Improvements are also envisaged in the dynamical prior
linking one frame to the next, which is currently quite crudely formulated. Thus,
further improvements will be possible if the dependency between frames is more
carefully considered, incorporating melodic and harmonic principles to generate
83
likely note and chord transitions over time. Ideally also, the algorithm should be
able to run in real time, processing a piece of music as it is played. Currently,
however, the Matlab-based processing is at many times real time and we will
study the parallel processing possibilities (as a simple starting point, the MCMC
runs can be split into several shorter parallel chains at each time frame within
a parallel architecture).
References
1. Cemgil, A., Godsill, S.J., Peeling, P., Whiteley, N.: Bayesian statistical methods for
audio and music processing. In: OHagan, A., West, M. (eds.) Handbook of Applied
Bayesian Analysis, OUP (2010)
2. Davy, M., Godsill, S., Idier, J.: Bayesian analysis of polyphonic western tonal music.
Journal of the Acoustical Society of America 119(4) (April 2006)
3. Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.): Markov Chain Monte Carlo
in Practice. Chapman and Hall, Boca Raton (1996)
4. Godsill, S.J., Davy, M.: Bayesian computational models for inharmonicity in musical
instruments. In: Proc. of IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, New Paltz, NY (October 2005)
5. Kashino, K., Nakadai, K., Kinoshita, T., Tanaka, H.: Application of the Bayesian
probability network to music scene analysis. In: Rosenthal, D.F., Okuno, H. (eds.)
Computational Audio Scene Analysis, pp. 115137. Lawrence Erlbaum Associates,
Mahwah (1998)
6. Klapuri, A., Davy, M.: Signal processing methods for music transcription. Springer,
Heidelberg (2006)
7. Pang, S.K., Godsill1, S.J., Li, J., Septier, F.: Sequential inference for dynamically
evolving groups of objects. To appear: Barber, Cemgil, Chiappa (eds.) Inference and
Learning in Dynamic Models, CUP (2009)
8. Peeling, P.H., Li, C., Godsill, S.J.: Poisson point process modeling for polyphonic music transcription. Journal of the Acoustical Society of America Express
Letters 121(4), EL168EL175 (2007)
Abstract. Separating multiple music sources from a single channel mixture is a challenging problem. We present a new approach to this problem
based on non-negative matrix factorization (NMF) and note classication, assuming that the instruments used to play the sound signals are
known a priori. The spectrogram of the mixture signal is rst decomposed
into building components (musical notes) using an NMF algorithm. The
Mel frequency cepstrum coecients (MFCCs) of both the decomposed
components and the signals in the training dataset are extracted. The
mean squared errors (MSEs) between the MFCC feature space of the
decomposed music component and those of the training signals are used
as the similarity measures for the decomposed music notes. The notes are
then labelled to the corresponding type of instruments by the K nearest
neighbors (K-NN) classication algorithm based on the MSEs. Finally,
the source signals are reconstructed from the classied notes and the
weighting matrices obtained from the NMF algorithm. Simulations are
provided to show the performance of the proposed system.
Keywords: Non-negative matrix factorization, single-channel sound
separation, Mel frequency cepstrum coecients, instrument classication, K nearest neighbors, unsupervised learning.
Introduction
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 84101, 2011.
c Springer-Verlag Berlin Heidelberg 2011
85
86
This section describes the details of the processes in our proposed sound source
separation system. First, the single-channel mixture of music sources is decomposed into basic building blocks (musical notes) by applying the NMF algorithm.
The NMF algorithm describes the mixture in the form of basis functions and
their corresponding weights (coecients) which represent the strength of each
basis function in the mixture. The next step is to extract the feature vectors
of the musical notes and then classify the notes into dierent source streams.
Finally, the source signals are reconstructed by combining the notes with the
same class labels. In this work, we assume that the instruments used to generate
the music sources are known a priori. In particular, two kinds of instruments,
i.e. piano and violin, were used in our study. The block diagram of our proposed
system is depicted in Figure 1.
2.1
In many data analysis tasks, it is a fundamental problem to nd a suitable representation of the data so that the underlying hidden structure of the data may
be revealed or displayed explicitly. NMF is a data-adaptive linear representation technique for 2-D matrices that was shown to have such potentials. Given
a non-negative data matrix X, the objective of NMF is to nd two non-negative
matrices W and H [12], such that
X = WH
(1)
87
R
wr hr
(2)
r=1
10
10
10
50
100
150
200
250
300
350
400
450
500
550
Fig. 2. The contour plot of a sound mixture (i.e. the matrix X) containing two dierent
musical notes G4 and A3
10
10
10
10
88
10
20
40
60
80
100
10
20
40
60
80
100
Fig. 3. The contour plots of the individual musical notes which were obtained by
applying an NMF algorithm to the sound mixture X. The separated notes G4 and A3
are shown in the left and right plot respectively.
Feature Extraction
Coefficient values
150
100
80
(a)
100
(b)
60
50
40
20
0
0
50
10
15
50
Coefficient values
89
20
10
15
10
15
50
(c)
40
40
30
30
20
20
10
10
10
10
10
15
(d)
Fig. 4. The 13-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) Piano..A0.wav, (b) Piano..B0.wav, (c) Violin.pizz.mf.sulG.C4B4.wav, and (d) Violin.pizz.pp.sulG.C4B4.wav. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
Coefficient values
120
100
100
80
(a)
80
60
40
40
20
20
0
20
20
10
15
20
50
Coefficient values
(b)
60
10
15
20
15
20
50
40
40
(c)
30
30
20
20
10
10
10
10
10
15
20
(d)
10
Fig. 5. The 20-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) Piano..A0.wav, (b) Piano..B0.wav, (c) Violin.pizz.mf.sulG.C4B4.wav, and (d) Violin.pizz.pp.sulG.C4B4.wav. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
90
Coefficient values
150
100
80
(a)
100
(b)
60
50
40
20
0
0
50
20
Coefficient values
50
50
40
(c)
40
(d)
30
30
20
20
10
10
10
Fig. 6. The 7-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) Piano..A0.wav, (b) Piano..B0.wav, (c) Violin.pizz.mf.sulG.C4B4.wav, and (d) Violin.pizz.pp.sulG.C4B4.wav. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
91
of piano and violin notes. The basic steps in music note classication include
preprocessing, feature extraction or selection, classier design and optimization.
The main steps used in our system are detailed in Table 1.
The main disadvantage of the classication technique based on simple majority voting is that the classes with more frequent examples tend to come up in
the K-nearest neighbors when the neighbors are computed from a large number
of training examples [5]. Therefore, the class with more frequent training examples tends to dominate the prediction of the new vector. One possible technique
to solve this problem is to weight the classication based on the distance from
the test pattern to all of its K nearest neighbors.
2.4
K-NN Classifier
This section briey describes the K-NN classier used in our algorithm. K-NN
is a simple technique for pattern classication and is particularly important for
non-parametric distributions. The K-NN classier labels an unknown pattern
x by the majority vote of its K-nearest neighbors [5], [9]. The K-NN classier
belongs to a class of techniques based on non-parametric probability density
estimation. Suppose, there is a need to estimate the density function P (x) from
a given dataset. In our case, each signal in the dataset is segmented to 999
frames, and a feature vector of 13 MFCC coecients is computed for each frame.
Therefore, the total number of examples in the training dataset is 52947. Similarly, an unknown pattern x is also a 13 dimensional MFCCs feature vector
whose label needs to be determined based on the majority vote of the nearest
neighbors. The volume V around an unknown pattern x is selected such that the
number of nearest neighbors (training examples) within V are 30. We are dealing with the two-class problem with prior probability P (i ). The measurement
distribution of the patterns in class i is denoted by P (x | i ). The measurement
of posteriori class probability P (i | x) decides the label of an unknown feature
vector of the separated note. The approximation of P (x) is given by the relation
[5], [10]
K
P (x)
(3)
NV
92
where N is the total number of examples in the dataset, V is the volume surrounding unknown pattern x and K is the number of examples within V . The
class prior probability depends on the number of examples in the dataset
P (i ) =
Ni
N
(4)
Ki
Ni V
(5)
P (x | i )P (i )
P (x)
(6)
Ki
K
(7)
i
The discriminant function gi (x) = K
K assigns the class label to an unknown
pattern x based on the majority of examples Ki of class i in volume V .
2.5
Parameter Selection
The most important parameter in the K-NN algorithm is the user-dened constant K. The best value of K depends upon the given data for classication [5].
In general, the eect of noise on classication may be reduced by selecting a
higher value of K. The problem arises when a large value of K is used for less
distinct boundaries between classes [31]. To select good value of K, many heuristic techniques such as cross-validation may be used. In the presence of noisy or
irrelevant features the performance of K-NN classier may degrade severely [5].
The selection of feature scales according to their importance is another important issue. For the improvement of classication results, a lot of eort has been
devoted to the selection or scaling of the features in a best possible way. The
optimal classication results are achieved for most datasets by selecting K = 10
or more.
2.6
Data Preparation
For the classication of separated components from mixture, the features i.e.
the MFCCs, are extracted from all the signals in the training dataset and put
the label on all feature vectors according to their classes (piano or violin). The
labels of the feature vectors of the separated components are not known which
need to be classied. Each feature vector consist of 13 MFCCs. When computing
the MFCCs, the training signals and the separated components are all divided
into frames with each having a length of 40 ms and 50 percent overlap between
93
the frames is used to avoid discontinuities between the neighboring frames. The
similarity measure of the feature vectors of the separated components to the
feature vectors obtained from the training process determines which class the
separated notes belong to. This is achieved by the K-NN classier. If majority
vote goes to the piano, then a piano label is assigned to the separated component
and vice-versa.
2.7
Evaluations
Two music sources (played by two dierent instruments, i.e. piano and violin)
with dierent number of notes overlapping each other in the time domain, were
used to generate articially an instantaneous mixture signal. The lengths of
piano and violin source signals are both 20 seconds, containing 6 and 5 notes
respectively. The K-NN classier constant K was selected as K = 30. The signalto-noise ratio (SNR), dened as follows, was used to measure the quality of both
the separated notes and the whole source signal,
2
s,t [Xm ]s,t
SN R(m, j) =
(8)
2
s,t ([Xm ]s,t [Xj ]s,t )
where s and t are the row and column indices of the matrix respectively. The
SNR was computed based on the magnitude spectrograms Xm and Xj of the
mth reference and the j th separated component to prevent the reconstruction
94
300
250
Coefficient values
200
150
100
50
50
10
12
14
Fig. 7. The collection of the audio features from a typical piano signal (i.e. Piano..A0.wav) in the training process. In total, 999 frames of features were computed.
250
Coefficient values
200
150
100
50
50
10
12
14
Fig. 8. The collection of the audio features from a typical violin signal (i.e. Violin.pizz.pp.sulG.C4B4.wav) in the training process. In total, 999 frames of features
were computed.
process from aecting the quality [22]. For the same note, j = m. In general,
higher SNR values represent better separation quality of the separated notes
and source signals, vice-versa. The training database used in the classication
process was provided by the McGill University Master Samples Collection [16],
University of Iowa website [21]. It contains 53 music signals with 29 of which
are piano signals and the rest are violin signals. All the signals were sampled
at 44100 Hz. The reference source signals were stored for the measurement of
separation quality.
For the purpose of training, the signals were rstly segmented into frames,
and then the MFCC feature vectors were computed from these frames. In total,
95
200
Coefficient values
150
100
50
50
10
12
14
Fig. 9. The collection of the audio features from a separated speech component in the
testing process. Similar to the training process, 999 frames of features were computed.
7000
6000
MSE
5000
4000
3000
2000
1000
0.5
1.5
2.5
Frame index
3.5
4.5
5
4
x 10
Fig. 10. The MSEs between the feature vector of a frame of the music component to
be classied and those from the training data. The frame indices in the horizontal axis
are ranked from the lower to the higher. The frame index 28971 is the highest frame
number of the piano signals. Therefore, on this plot, to the left of this frame are those
from piano signals, and to the right are those from the violin signals.
999 frames were computed for each signal. Figures 7 and 8 show the collection of
the features from the typical piano and violin signals (i.e. Piano..A0.wav and
Violin.pizz.pp.sulG.C4B4.wav) respectively. In both gures, it can be seen that
there exist features whose coecients are all zeros due to the silence part of the
signals. Before running the training algorithm, we performed feature selection by
removing such frames of features. In the testing stage, the MFCC feature vectors
of the individual music components that were separated by the NMF algorithm
were calculated. Figure 9 shows the feature space of 15th separated component
96
7000
6000
MSE
5000
4000
3000
2000
1000
0.5
1.5
2.5
3.5
4.5
5
4
x 10
Fig. 11. The MSE values obtained in Figure 10 were sorted from the lower to the
higher. The frame indices in the horizontal axis, associated with the MSEs, are shued
accordingly.
34
32
MSE
30
28
26
24
22
20
10
15
20
25
30
K nearest frames
Fig. 12. The MSE values of the K nearest neighbors (i.e. the frames with the K minimal
MSEs) are selected based on the K-NN clustering. In this experiment, K was set to 30.
x 10
97
4.5
4
Frame index
3.5
3
2.5
2
1.5
1
0.5
0
10
15
20
25
30
35
Fig. 13. The frame indices of the 30 nearest neighbors to the frame of the decomposed
music note obtained in Figure 12. In our experiment, the maximum frame index for
the piano signals is 28971, shown by the dashed line, while the frame indices of violin
signals are all greater than 28971. Therefore, this typical audio frame under testing
can be classied as a violin signal.
(a)
1
0
1
9
5
x 10
(b)
1
0
1
9
5
x 10
(c)
1
0
1
9
5
x 10
(d)
1
0
1
9
5
x 10
(e)
1
0
1
Time in samples
9
5
x 10
Fig. 14. A separation example of the proposed system. (a) and (b) are the piano and
violin sources respectively, (c) is the single channel mixture of these two sources, and
(d) and (e) are the separated sources respectively. The vertical axes are the amplitude
of the signals.
than 28971, which was the highest index number of the piano signals in the
training data. As a result, this component was classied as a violin signal.
Figure 14 shows a separation example of the proposed system, where (a) and
(b) are the piano and violin sources respectively, (c) is the single channel mixture
98
of these two sources, and (d) and (e) are the separated sources respectively. From
this gure, we can observe that, although most notes are correctly separated and
classied into the corresponding sources, there exist notes that were wrongly
classied. The separated notes with the highest SNR is the rst note of the
violin signal, for which the SNR equals to 9.7dB, while the highest SNR of the
note within the piano signal is 6.4dB. The average SNRs for piano and violin
are respectively 3.7 dB and 1.3 dB. According to our observation, the separation
quality of the notes varies from notes to notes. In average, the separation quality
of the piano signal is better than the violin signal.
Discussions
At the moment, for the separated components by the NMF algorithm, we calculate their MFCC features in the same way as for the signals in the training
data. As a result, the evaluation of the MSEs becomes straightforward, which
consequently facilitates the K-NN classication. It is however possible to use the
dictionary returned by the NMF algorithm (and possibly the activation coecients as well) as a set of features. In such a case, the NMF algorithm needs to
be applied to the training data in the same way as the separated components
obtained in the testing and classication process. Similar to principal component analysis (PCA) which has been widely used to generate features in many
classication system, using NMF components directly as features has a great
potential. As compared to using the MFCC features, the computational cost
associated with the NMF features could be higher due to the iterations required
for the NMF algorithms to converge. However, its applicability as a feature for
classication deserves further investigation in the future.
Another important issue in applying NMF algorithms is the selection of the
mode of the NMF model (i.e. the rank R). In our study, this determines the
number of components that will be learned from the signal. In general, for a
higher rank R, the NMF algorithm learns the components that are more likely
corresponding to individual notes. However, there is a trade-o between the decomposition rank and the computational load, as a larger R incurs a higher
computational cost. Also, it is known that NMF produces not only harmonic
dictionary components but also sometimes ad-hoc spectral shapes corresponding to drums, transients, residual noise, etc. In our recognition system, these
components were treated equally as the harmonic components. In other words,
the feature vectors of these components were calculated and evaluated in the
same way as the harmonic components. The nal decision was made from the
labelling scores and the K-NN classication results.
We note that many classication algorithms could also be applied for labelling
the separated components, such as the Gaussian Mixture Models (GMMs), which
have been used in both automatic speech/speaker recognition and music information retrieval. In this work, we choose the K-NN algorithm due its simplicity.
Moreover, the performance of the single channel source separation system developed here is largely dependent on the separated components provided by the
99
NMF algorithm. Although the music components obtained by the NMF algorithm are somehow sparse, their sparsity is not explicitly controlled. Also, we
didnt use the information from the music signals explicitly, such as the pitch information and harmonic structure. According to Li et al. [14], the information of
pitch and common amplitude modulation can be used to improve the separation
quality. Com
Conclusions
We have presented a new system for the single channel music sound separation
problem. The system essentially integrates two techniques, automatic note decomposition using NMF, and note classication based on the K-NN algorithm. A
main assumption with the proposed system is that we have the prior knowledge
about the type of instruments used for producing the music sounds. The simulation results show that the system produces a reasonable performance for this
challenging source separation problem. Future works include using more robust
classication algorithm to improve the note classication accuracy, and incorporating pitch and common amplitude modulation information into the learning
algorithm to improve the separation performance of the proposed system.
References
1. Abdallah, S.A., Plumbley, M.D.: Polyphonic Transcription by Non-Negative Sparse
Coding of Power Spectra. In: International Conference on Music Information Retrieval, Barcelona, Spain (October 2004)
2. Barry, D., Lawlor, B., Coyle, E.: Real-time Sound Source Separation: Azimuth
Discrimination and Re-synthesis, AES (2004)
3. Brown, G.J., Cooke, M.P.: Perceptual Grouping of Musical Sounds: A Computational Model. J. New Music Res. 23, 107132 (1994)
4. Casey, M.A., Westner, W.: Separation of Mixed Audio Sources by Independent
Subspace Analysis. In: Proc. Int. Comput. Music Conf. (2000)
5. Devijver, P.A., Kittler, J.: Pattern Recognition - A Statistical Approach. Prentice
Hall International, Englewood Clis (1982)
6. Every, M.R., Szymanski, J.E.: Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics. IEEE Trans. Audio Speech Lang. Process. 14, 1845
1856 (2006)
7. Fevotte, C., Bertin, N., Durrieu, J.-L.: Nonnegative Matrix Factorization With the
Itakura-Saito Divergence. With Application to Music Analysis. Neural Computation 21, 793830 (2009)
8. FitzGerald, D., Cranitch, M., Coyle, E.: Extended Nonnegative Tensor Factorisation Models for Musical Sound Source Separation, Article ID 872425, 15 pages
(2008)
9. Fukunage, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press Inc., London (1990)
100
101
29. Wang, W., Luo, Y., Chambers, J.A., Sanei, S.: Note Onset Detection via
Non-negative Factorization of Magnitude Spectrum. EURASIP Journal on
Advances in Signal Processing, Article ID 231367, 15 pages (June 2008);
doi:10.1155/2008/231367
30. Wang, W., Cichocki, A., Chambers, J.A.: A Multiplicative Algorithm for Convolutive Non-negative Matrix Factorization Based on Squared Euclidean Distance.
IEEE Transactions on Signal Processing 57, 28582864 (2009)
31. Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, New York (2005)
32. Woodru, J., Pardo, B.: Using Pitch, Amplitude Modulation and Spatial Cues for
Separation of Harmonic Instruments from Stereo Music Recordings. EURASIP J.
Adv. Signal Process. (2007)
Introduction
Nonnegative matrix factorization (NMF) is an unsupervised data decomposition technique with growing popularity in the elds of machine learning and
signal/image processing [8]. Much research about this topic has been driven by
applications in audio, where the data matrix is taken as the magnitude or power
spectrogram of a sound signal. NMF was for example applied with success to
automatic music transcription [15] and audio source separation [19,14]. The factorization amounts to decomposing the spectrogram data into a sum of rank-1
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 102115, 2011.
c Springer-Verlag Berlin Heidelberg 2011
103
spectrograms, each of which being the expression of an elementary spectral pattern amplitude-modulated in time.
However, while most music recordings are available in multichannel format
(typically, stereo), NMF in its standard setting is only suited to single-channel
data. Extensions to multichannel data have been considered, either by stacking
up the spectrograms of each channel into a single matrix [11] or by equivalently considering nonnegative tensor factorization (NTF) under a parallel factor analysis (PARAFAC) structure, where the channel spectrograms form the
slices of a 3-valence tensor [5,6]. Let Xi be the short-time Fourier transform
(STFT) of channel i, a complex-valued matrix of dimensions F N , where
i = 1, . . . , I and I is the number of channel (I = 2 in the stereo case). The
latter approaches boil down to assuming that the magnitude spectrograms |Xi |
are approximated by a linear combination of nonnegative rank-1 elementary
spectrograms |Ck | = wk hTk such that
|Xi |
K
qik |Ck |
(1)
k=1
and |Ck | is the matrix containing the modulus of the coecients of some latent
components whose precise meaning we will attempt to clarify in this paper.
Equivalently, Eq. (1) writes
|xif n |
K
qik wf k hnk
(2)
k=1
if n
with
K
def
vif n =
qik wf k hnk
(4)
k=1
and where the constraint A 0 means that the coecients of matrix A are nonnegative, and d(x|y) is a scalar cost function, taken as the generalized KullbackLeibler (KL) divergence in [5] or as the Euclidean distance in [11]. Complex k are subsequently constructed using the phase of the
valued STFT estimates C
observations (typically, ckf n is given the phase of xif n , where i = argmax{qik }i
[6]) and then inverted to produce time-domain components. The components
pertaining to same sources (e.g, instruments) can then be grouped either manually or via clustering of the estimated spatial cues {qk }k .
In this paper we build on these previous works and bring the following
contributions :
104
We recast the approach of [5] into a statistical framework, based on a generative statistical model of the multichannel observations X. In particular
we discuss NTF of the power spectrogram |X|2 with the Itakura-Saito (IS)
divergence and NTF of the magnitude spectrogram |X| with the KL divergence.
We describe a NTF with a novel structure, that allows to take care of the
clustering of the components within the decomposition, as opposed to after.
The paper is organized as follows. Section 2 describes the generative and statistical source models implied by NTF. Section 3 describes new and existing multiplicative algorithms for standard NTF and for Cluster NTF. Section 4 reports
experimental source separation results on musical data; we test in particular the
impact of the simplifying nonpoint-source assumption on underdetermined linear instantaneous mixtures of musical sources and point out the limits of the
approach for such mixtures. We conclude in Section 5. This article builds on
related publications [10,3].
2
2.1
Assume a multichannel audio recording with I channels x(t) = [x1 (t), . . . , xI (t)]T ,
also referred to as observations or data, generated as a linear mixture of
sound source signals. The term source refers to the production system, for
example a musical instrument, and the term source signal refers to the signal
produced by that source. When the intended meaning is clear from the context
we will simply refer to the source signals as the sources.
Under the linear mixing assumption, the multichannel data can be expressed
as
J
x(t) =
sj (t)
(5)
j=1
where J is the number of sources and sj (t) = [s1j (t), . . . sij (t), . . . , sIj (t)]T is the
multichannel contribution of source j to the data. Under the common assumptions of point-sources and linear instantaneous mixing, we have
sij (t) = sj (t) aij
(6)
where the coecients {aij } dene a IJ mixing matrix A, with columns denoted
[a1 , . . . , aJ ]. In the following we will show that the NTF techniques described
in this paper correspond to maximum likelihood (ML) estimation of source and
mixing parameters in a model where the point-source assumption is dropped
and replaced by
(i)
(7)
where the signals sj (t), i = 1, . . . , I are assumed to share a certain resemblance, as modelled by being two dierent realizations of the same random
105
K
(i)
mik ckf n
(10)
k=1
(i)
(i)
where xif n and ckf n are the complex-valued STFTs of xi (t) and ck (t), and
where f = 1, . . . , F is a frequency bin index and n = 1, . . . , N is a time frame
index.
2.2
if n
P(wf k hnk )
(13)
106
where P() denotes the Poisson distribution, dened in Appendix A, and the
KL divergence dKL (|) is dened as
dKL (x|y) = x log
x
+ y x.
y
(14)
(15)
qik wf k hnk
=
|xif n |
l qil wf l hnl
(16)
(i)
To remedy the drawbacks of the KL-NTF model for audio we describe a new
model based on IS-NTF of the power spectrogram, along the line of [4] and also
introduced in [10]. The model reads
(i)
xif n =
mik ckf n
(17)
k
(i)
(18)
if n
x
x
log 1.
y
y
(20)
107
Note that our notations are abusive in the sense that the mixing parameters
|mik | and the components |ckf n | appearing through their modulus in Eq. (12)
are in no way the modulus of the mixing parameters and the components appearing in Eq. (17). Similarly, the matrices W and H represent dierent types of
quantities in every case; in Eq. (13) their product is homogeneous to component
magnitudes while in Eq. (18) their product is homogeneous to variances of comKL
ponent variances. Formally we should have introduced variables |cKL
,
kf n |, W
KL
IS
IS
IS
H
to be distinguished from variables ckf n , W , H , but we have not in
order to avoid cluttering the notations. The dierence between these quantities
should be clear from the context.
Model (17)-(18) is a truly generative model in the sense that the linear mixing assumption is made on the STFT frames themselves, which is a realistic
(i)
assumption in audio. Eq. (18) denes a Gaussian variance model of ckf n ; the
zero mean assumption reects the property that the audio frames taken as the
input of the STFT can be considered centered, for typical window size of about
(i)
20 ms or more. The proper Gaussian assumption means that the phase of ckf n
is assumed to be a uniform random variable [9], i.e., the phase is taken into the
model, but in a noninformative way. This contrasts from model (12)-(13), which
simply discards the phase information.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the components are given by
def
(i)
(i)
ckf n = E{ckf n | Q, W, H, X }
(21)
qik wf k hnk
=
xif n
l qil wf l hnl
(22)
We would like to underline that the MMSE estimator of components in the STFT
domain (21) is equivalent (thanks to the linearity of the STFT and its inverse) to
Table 1. Statistical models and optimization problems underlaid to KL-NTF.mag and
IS-NTF.pow
KL-NTF.mag
Mixing model
Comp. distribution
Data
Parameters
Approximate
Optimization
IS-NTF.pow
Model
(i)
(i)
|xif n | = k |mik | |ckf n |
xif n = k mik ckf n
(i)
(i)
|ckf n | P(wf k hnk )
ckf n Nc (0|wf k hnk )
ML estimation
V = |X|
V = |X|2
W, H, Q = |M|
W, H, Q = |M|2
vif n = k qik wf k hnk
min
vif n ) min
vif n )
if n dKL (vif n |
if n dIS (vif n |
Q,W,H0
Q,W,H0
Reconstruction
(i)
|ckf n | =
q w h
ik f k nk |xif n |
l qil wf l hnl
(i)
ckf n =
q w h
ik f k nk xif n
l qil wf l hnl
108
the MMSE estimator of components in the time domain, while the the MMSE
estimator of STFT magnitudes (15) for KL-NTF is not consistent with time
domain MMSE. Equivalence of an estimator with time domain signal squared
error minimization is an attractive property, at least because it is consistent with
a popular objective source separation measure such as signal to distortion ratio
(SDR) dened in [16].
The dierences between the two models, termed KL-NTF.mag and ISNTF.pow are summarized in Table 1.
3
3.1
(23)
if n
where vif n = k qik hnk wf k , and d(x|y) is the cost function, either the KL or IS
divergence in our case. Furthermore we impose qk 1 = 1 and wk 1 = 1, so as
to remove obvious scale indeterminacies between the three loading matrices Q,
W and H. With these conventions, the columns of Q convey normalized mixing proportions (spatial cues) between the channels, the columns of W convey
normalized frequency shapes and all time-dependent amplitude information is
relegated into H.
As common practice in NMF and NTF, we employ multiplicative algorithms
These algorithms essentially consist of updating
for the minimization of D(V|V).
each scalar parameter by multiplying its value at previous iteration by the
ratio of the negative and positive parts of the derivative of the criterion w.r.t.
this parameter, namely
[ D(V|V)]
,
(24)
+
[ D(V|V)]
= [ D(V|V)]
+ [ D(V|V)]
and the summands are both
where D(V|V)
nonnegative [4]. This scheme automatically ensures the nonnegativity of the parameter updates, provided initialization with a nonnegative value. The derivative
of the criterion w.r.t scalar parameter writes
=
D(V|V)
vif n d (vif n |
vif n )
(25)
if n
(26)
fn
=
wf k D(V|V)
(27)
qik wf k d (vif n |
vif n )
(28)
in
=
hnk D(V|V)
if
109
(31)
= < G, Q W >{1,2},{1,2}
H D(V|V)
(33)
(32)
< G , W H >{2,3},{1,2}
< G+ , W H >{2,3},{1,2}
(34)
< G , Q H >{1,3},{1,2}
< G+ , Q H >{1,3},{1,2}
(35)
< G , Q W >{1,2},{1,2}
< G+ , Q W >{1,2},{1,2}
(36)
W W.
H H.
The resulting algorithm can easily be shown to nonincrease the cost function at
each iteration by generalizing existing proofs for KL-NMF [13] and for IS-NMF
[1]. In our implementation normalization of the variables is carried out at the
end of every iteration by dividing every column of Q by their 1 norm and scaling
the columns of W accordingly, then dividing the columns of W by their 1 norm
and scaling the columns of H accordingly.
3.2
Cluster NTF
(37)
110
i k Kj
(38)
ljk = 0
otherwise.
(39)
(40)
where
D = |A|
D = |A|
in KL-NTF.mag
(41)
in IS-NTF.pow
(42)
k
ljk
wf k hnk d (vif n |
vif n )
(44)
fn
i.e.,
=< G, W H >{2,3},{1,2} LT
D D(V|V)
(45)
< G , W H >{2,3},{1,2} LT
< G+ , W H >{2,3},{1,2} LT
(46)
Results
We consider source separation of simple audio mixtures taken from the Signal
Separation Evaluation Campaign (SiSEC 2008) website. More specically, we
used some development data from the underdetermined speech and music
mixtures task [18]. We considered the following datasets :
111
wdrums, a linear instantaneous stereo mixture (with positive mixing coecients) of 2 drum sources and 1 bass line,
nodrums, a linear instantaneous stereo mixture (with positive mixing coecients) of 1 rhythmic acoustic guitar, 1 electric lead guitar and 1 bass
line.
The signals are of length 10 sec and sampled at 16 kHz. We applied a STFT
with sine bell of length 64 ms (1024 samples) leading to F = 513 and N = 314.
We applied the following algorithms to the two datasets :
KL-NTF.mag with K = 9,
IS-NTF.pow with K = 9,
Fig. 1. Mixing parameters estimation and ground truth. Top : wdrums dataset. Bottom : nodrums dataset. Left : results of KL-NTF.mag and KL-cNTF.mag; ground
truth mixing vectors {|aj |}j (red), mixing vectors {dj }j estimated with KL-cNTF.mag
(blue), spatial cues {qk }k given by KL-NTF.mag (dashed, black). Right : results of ISNTF.pow and IS-cNTF.pow; ground truth mixing vectors {|aj |2 }j (red), mixing vectors
{dj }j estimated with IS-cNTF.pow (blue), spatial cues {qk }k given by IS-NTF.pow
(dashed, black).
112
113
Table 2. SDR, ISR, SIR and SAR of source estimates for the two considered datasets.
Higher values indicate better results. Values in bold font indicate the results with best
average SDR.
wdrums
s1
s2
s3
(Hi-hat) (Drums) (Bass)
KL-NTF.mag
SDR -0.2
0.4
17.9
ISR 15.5
0.7
31.5
SIR
1.4
-0.9
18.9
SAR 7.4
-3.5
25.7
KL-cNTF.mag
SDR -0.02
-14.2
1.9
ISR 15.3
2.8
2.1
SIR
1.5
-15.0
18.9
SAR 7.8
13.2
9.2
IS-NTF.pow
SDR 12.7
1.2
17.4
ISR 17.3
1.7
36.6
SIR 21.1
14.3
18.0
SAR 15.2
2.7
27.3
IS-cNTF.pow
SDR 13.1
1.8
18.0
ISR 17.0
2.5
35.4
SIR 22.0
13.7
18.7
SAR 15.9
3.4
26.5
nodrums
s1
s2
s3
(Bass) (Lead G.) (Rhythmic G.)
KL-NTF.mag
SDR 13.2
-1.8
1.0
ISR 22.7
1.0
1.2
SIR 13.9
-9.3
6.1
SAR 24. 2
7.4
2.6
KL-cNTF.mag
SDR 5.8
-9.9
3.1
ISR 8.0
0.7
6.3
SIR 13.5
-15.3
2.9
SAR 8.3
2.7
9.9
IS-NTF.pow
SDR 5.0
-10.0
-0.2
ISR 7.2
1.9
4.2
SIR 12.3
-13.5
0.3
SAR 7.2
3.3
-0.1
IS-cNTF.pow
SDR 3.9
-10.2
-1.9
ISR 6.2
3.3
4.6
SIR 10.6
-10.9
-3.7
SAR 3.7
1.0
1.5
contain much bass and lead guitar.2 Results from all four methods on this dataset
are overly all much worse than with dataset wdrums, corroborating an established idea than percussive signals are favorably modeled by NMF models [7].
Increasing the number of total components K did not seem to solve the observed
deciencies of the 4 approaches on this dataset.
Conclusions
In this paper we have attempted to clarify the statistical models latent to audio
source separation using PARAFAC-NTF of the magnitude or power spectrogram. In particular we have emphasized that the PARAFAC-NTF does not optimally exploits interchannel redundancy in the presence of point-sources. This
still may be sucient to estimate spatial cues correctly in linear instantaneous
mixtures, in particular when the NMF model suits well the sources, as seen from
2
The numerical evaluation criteria were computed using the bss eval.m function
available from SiSEC website. The function automatically pairs source estimates
with ground truth signals according to best mean SIR. This resulted here in pairing
left, middle and right blue directions with respectively left, middle and right red
directions, i.e, preserving the panning order.
114
the results on dataset wdrums but may also lead to incorrect results in other cases,
as seen from results on dataset nodrums. In contrast methods fully exploiting
interchannel dependencies, such as the EM algorithm based on model (17)-(18)
(i)
with ckf n = ckf n in [10], can successfully estimates the mixing matrix in both
datasets. The latter method is however about 10 times computationally more
demanding than IS-cNTF.pow.
In this paper we have considered a variant of PARAFAC-NTF in which the
loading matrix Q is given a structure such that Q = DL. We have assumed that
L is known labelling matrix that reects the partition K1 , . . . , KJ . An important
perspective of this work is to let the labelling matrix free and automatically
estimate it from the data, either under the constraint that every column lk of L
may contain only one nonzero entry, akin to a hard clustering, i.e., lk 0 = 1, or
more generally under the constraint that lk 0 is small, akin to soft clustering.
This should be made feasible using NTF under sparse 1 -constraints and is left
for future work.
References
1. Cao, Y., Eggermont, P.P.B., Terebey, S.: Cross Burg entropy maximization and its
application to ringing suppression in image reconstruction. IEEE Transactions on
Image Processing 8(2), 286292 (1999)
2. Cemgil, A.T.: Bayesian inference for nonnegative matrix factorisation models.
Computational Intelligence and Neuroscience (Article ID 785152), 17 pages (2009);
doi:10.1155/2009/785152
3. Fevotte, C.: Itakura-Saito nonnegative factorizations of the power spectrogram
for music signal decomposition. In: Wang, W. (ed.) Machine Audition: Principles,
Algorithms and Systems, ch. 11. IGI Global Press (August 2010), http://perso.
telecom-paristech.fr/~fevotte/Chapters/isnmf.pdf
4. Fevotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with
the Itakura-Saito divergence. With application to music analysis. Neural Computation 21(3), 793830 (2009), http://www.tsi.enst.fr/~fevotte/Journals/
neco09_is-nmf.pdf
5. FitzGerald, D., Cranitch, M., Coyle, E.: Non-negative tensor factorisation for sound
source separation. In: Proc. of the Irish Signals and Systems Conference, Dublin,
Ireland (September 2005)
6. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorisation models for musical sound source separation. Computational Intelligence and
Neuroscience (Article ID 872425), 15 pages (2008)
7. Helen, M., Virtanen, T.: Separation of drums from polyphonic music using nonnegative matrix factorization and support vector machine. In: Proc. 13th European
Signal Processing Conference (EUSIPCO 2005) (2005)
8. Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix
factorization. Nature 401, 788791 (1999)
9. Neeser, F.D., Massey, J.L.: Proper complex random processes with applications to
information theory. IEEE Transactions on Information Theory 39(4), 12931302
(1993)
115
10. Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Audio, Speech and
Language Processing 18(3), 550563 (2010), http://www.tsi.enst.fr/~fevotte/
Journals/ieee_asl_multinmf.pdf
11. Parry, R.M., Essa, I.: Estimating the spatial position of spectral components in
audio. In: Rosca, J.P., Erdogmus, D., Prncipe, J.C., Haykin, S. (eds.) ICA 2006.
LNCS, vol. 3889, pp. 666673. Springer, Heidelberg (2006)
12. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to
statistics and computer vision. In: Proc. 22nd International Conference on Machine
Learning, pp. 792799. ACM, Bonn (2005)
13. Shepp, L.A., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Medical Imaging 1(2), 113122 (1982)
14. Smaragdis, P.: Convolutive speech bases and their application to speech separation.
IEEE Transactions on Audio, Speech, and Language Processing 15(1), 112 (2007)
15. Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music
transcription. In: IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics (WASPAA 2003) (October 2003)
16. Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing 14(4), 14621469 (2006), http://www.tsi.enst.fr/~fevotte/Journals/
ieee_asl_bsseval.pdf
17. Vincent, E., Sawada, H., Bofill, P., Makino, S., Rosca, J.P.: First stereo audio source
separation evaluation campaign: Data, algorithms and results. In: Davies, M.E.,
James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666,
pp. 552559. Springer, Heidelberg (2007)
18. Vincent, E., Araki, S., Bofill, P.: Signal Separation Evaluation Campaign.
In: (SiSEC 2008) / Under-determined speech and music mixtures task results (2008), http://www.irisa.fr/metiss/SiSEC08/SiSEC_underdetermined/
dev2_eval.html
19. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on
Audio, Speech and Language Processing 15(3), 10661074 (2007)
Standard Distributions
Nc (x|, ) = | |1 exp (x )H 1 (x )
x
P(x|) = exp() x!
I1
i1 =1
...
IM
(47)
iM =1
The contracted tensor product should be thought of as a form a generalized dot product
of two tensors along common modes of same dimensions.
Introduction
Signal Processing Techniques are a powerful set of mathematical tools that allow
to obtain from a signal the required information for a certain purpose. Signal
Processing Techniques can be used for any type of signal: communication signals,
medical signals, Speech signals, multimedia signals, etc. In this contribution, we
focus on the application of signal processing techniques to music information:
audio and scores.
Signal processing techniques can be used for music database exploration. In
this eld, we present a 3D adaptive environment for music content exploration
that allows the exploration of musical contents in a novel way. The songs are
analyzed and a series of numerical descriptors are computed to characterize
their spectral content. Six main musical genres are dened as axes of a multidimensional framework, where the songs are projected. A three-dimensional subdomain is dened by choosing three of the six genres at a time and the user is
allowed to navigate in this space, browsing, exploring and analyzing the elements
of this musical universe. Also, inside this eld of music database exploration, a
novel method for music similarity evaluation is presented. The evaluation of music similarity is one of the core components of the eld of Music Information
Retrieval (MIR). In this study, rhythmic and spectral analyses are combined to
extract the tonal prole of musical compositions and evaluate music similarity.
Music signal processing can be used also for the preservation of the cultural
heritage. In this sense, we have developed a complete system with an interactive
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 116137, 2011.
c Springer-Verlag Berlin Heidelberg 2011
117
graphical user interface for Optical Music Recognition (OMR), specially adapted
for scores written in white mensural notation. Color photographies of Ancient
Scores taken at the Archivo de la Catedral de M
alaga have been used as input to
the system. A series of pre-processing steps are aimed to improve their quality
and return binary images to be processed. The music symbols are extracted and
classied, so that the system is able to transcribe the ancient music notation
into modern notation and make it sound.
Music signal processing can also be focused in developing tools for technologyenhanced learning and revolutionary learning appliances. In this sense, we present
dierent applications we have developed to help learning dierent instruments:
piano, violin and guitar. The graphical tool for piano learning we have developed,
is able to detect if a person is playing the proper piano chord. The graphical tool
shows to the user the time and frequency response of each frame of piano sound
under analysis and a piano keyboard in which the played notes are highlighted
as well as the name of the played notes. The core of the designed tool is a polyphonic transcription system able to detect the played notes, based on the use of
spectral patterns of the piano notes. The designed tool is useful both for users
with knowledge of music and users without these knowledge. The violin learning
tool is based on a transcription system able to detect the pitch and duration of
the violin notes and to identify the dierent expressiveness techniques: detache
with and without vibrato, pizzicato, tremolo, spiccato, flageolett-t
one. The interface is a pedagogical tools to aid in violin learning. For the guitar, we have
developed a system able to perform in real time string and fret estimation of
guitar notes. The system works in three modes: it is able to estimate the string
and fret of a single note played on a guitar, strummed chords from a predened
list and it is also able to make a free estimation if no information of what is
being played is given. Also, we have developed a lightweight pitch detector for
embedded systems to be used in toys. The detector is based on neural networks
in which the signal preprocessing is a frequency analysis. The selected neural
network is a perceptron-type network. For the preprocessing, the Goertzel Algorithm is the selected technique for the frequency analysis because it is a light
alternative to FFT computing and it is very well suited when only few spectral
points are enough to extract the relevant information.
Therefore, the outline of the paper is as follows. In Section 2, musical content
management related tools are presented. Section 3 is devoted to the presentation
of the tool directly related to the preservation of the cultural heritage. Section 4
will present the dierent tools developed for technology-enhanced music learning.
Finally, the conclusions are presented in Section 5.
The huge amount of digital musical content available through dierent databases
makes necessary to have intelligent music signal processing tools that help us
managing all this information.
In subsection 2.1 a novel tool for navigating through the music content is
presented. This 3D navigation environment makes easier to look for inter-related
118
I. Barbancho et al.
musical contents and it also gives the opportunity to the user to get to know
certain types of music that he would not have found with other more traditional
ways of searching musical contents.
In order to use a 3D environment as the one presented, or other types of
methods for music information retrieval, the evaluation of music similarity is one
of the core components. In subsection 2.2, the rhythmic and spectral analyses of
music contents are combined to extract the tonal prole of musical compositions
and evaluate music similarity.
2.1 3D Environment for Music Content Exploration
The interactive music exploration is an open problem [31], with increasing interest due to the growing possibilities to access large music data bases. Eorts
to automate and simplify the access to musical contents require to analyze the
songs to obtain numerical descriptors in the time or frequency domains that can
be used to measure and compare dierences and similarities among them. We
have developed an adaptive 3D environment that allows intuitive music exploration and browsing through its graphical interface. Music analysis is based on
the use of the Mel frequency cepstral coecients (MFCCs) [27], a multidimensional space is built and each song is represented as a sphere in a 3D environment
with tools to navigate, listen and query the music space.
The MFCCs are essentially based on the short term Fourier transform. The
windowed spectrum of the original signal is computed. A Mel bank of lters
is applied to obtain a logarithmic frequency representation and the resulting
spectrum is processed with a discrete cosine transform (DCT). At this end, the
Mel coecients have to be clustered in few groups, in order to achieve a compact
representation of the global spectral content of the signal. Here, the popular
k-means clustering method has been employed and the centroid of the most
populated cluster has been considered for a compact vectorial representation of
the spectral meaning of the whole piece. This approach has been applied to a
large number of samples for the six genres selected and a predominant vector
has been computed for each genre. These vectors are considered as pseudoorthonormal coordinates reference vectors for the projection of the songs. In
particular, for each song, the six coordinates have been obtained by computing
the scalar product among the predominant vectors of the song itself and the ones
of the six genres, conveniently normalized for unit norm.
The graphical user interface comprises a main window, with dierent functional panels (Figure 1). In the main panel, the representation of the songs in a
3D framework is shown: three orthogonal axes, representing the three selected
genres are centered in the coordinates range and the set of songs are represented
as blue spheres correspondingly titled. A set of other panels with dierent functions complete the window. During the exploration of the space, the user is
informed in real-time about the closest songs and can listen to them.
2.2 Evaluation of Music Similarity Based on Tonal Behavior
The evaluation of music similarity is one of the core components of the eld of
Music Information Retrieval (MIR). Similarity is often computed on the basis of
119
Fig. 1. The graphical user interface for the 3D exploration of musical audio
the extraction of low-level time and frequency descriptors [25] or on the computation of rhythmic patterns [21]. Logan and Salomon [26] use the Mel Frequency
Cepstral Coecients (MFCCs) as a main tool to compare audio tracks, based
on their spectral content. Ellis et al. [13] adopt the cross-correlation of rhythmic
patterns to identify common parts among songs.
In this study, rhythmic and spectral analyses are combined to extract the tonal
prole of musical compositions and evaluate music similarity. The processing
stage comprises two main steps: the computation of the main rhythmic meter of
the song and the estimation of the distribution of contributions of tonalities to
the overall tonal content of the composition. The calculus of the cross-correlation
of the rhythmic pattern of the envelope of the raw signal allows a quantitative
estimation of the main melodic motif of the song. Such temporal unit has to be
employed as a base for the temporal segmentation of the signal, aimed to extract
the pitch class prole of the song [14] and, consequently, the vector of tonality
contributions. Finally, this tonal behavior vector is employed as the main feature
to describe the song and it is used to evaluate similarity.
Estimation of the melodic cell. In order to characterize the main melodic
motif of the track, the songs are analyzed to estimate the tempo. More than a
real quantitative metrical calculus of the rhythmic pattern, the method aims at
delivering measures for guiding the temporal segmentation of the musical signal,
and at subsequently improving the representation of the song dynamics. This
is aimed at optimizing the step for the computation of the tonal content of the
audio signal, supplying the reference temporal frame for the audio windowing.
The aim of the tempo induction is to estimate the width of the window used for
windowing, so that the stage for the computation of the tonal content of the song
120
I. Barbancho et al.
121
Table 1. Relative and absolute dierences among the widths of the melodic window
manually evaluated by the listeners and the ones automatically computed by the proposed algorithm
Genre
Pop
Classic
Electro-Disco
Heavy Metals
Jazz
14.6
21.2
6.2
18.4
14.8
0.60
0.99
0.34
0.54
0.58
Mean
17.3
0.68
In Table 1, the dierences between the widths of the window manually measured
and automatically computed are shown.
The best results are obtained for the Disco music tracks (6.2%), where the
clear drummed bass background is well detected and the pulse coincides most of
times with the tempo. The worst results are related to the lack of a clear driving
bass in Classical music (21.2%), where the changes in time can be frequent and
a uniform tempo measure is hardly detectable.
However, the beats, or lower-level metrical features are, most of the times,
submultiples of such tempo value, which make them usable for the melodic cell
computation.
Tonal behavior. Most of the music similarity systems aim at imitating the
human perception of a song. This capacity is complex to analyze. The human
brain carries out a series of subconscious processes, as the computation of the
rhythm, the instruments richness, the musical complexity, the tonality, the mode,
the musical form or structure, the presence of modulations, etc., even without
any technical musical knowledge [29].
A novel technique for the determination of the tonal behavior of music signals
based on the extraction of the pattern of tonality contributions is presented.
The main process is based on the calculus of the contributions of each note of
the chromatic scale (Pitch Class Profile - PCP), and the computation of the
possible matching tonalities. The outcome is a vector reecting the variation of
the spectral contribution of each tonality throughout the entire piece. The song
is time windowed with no overlapping windows, whose width is determined on
the basis of the tempo induction algorithm.
The Pitch Class Prole is based on the contribution of the twelve semitone
pitch classes to the whole spectrum. Fujishima [14] employed the PCPs as main
122
I. Barbancho et al.
where Xs is the simplied spectrum, the index k covers the twelve semitone
pitches and i is used to index each octave. The subscript t stands for the temporal
frame for which the PCP is computed.
In order to estimate the predominant tonality of a track, it is important to dene
a series of PCPs for all the possible tonalities, to be compared with its own PCP.
The shape of the PCP mainly depends on the modality of the tonality (Major or
Minor). Hence, by assembling only two global proles, for major and minor modes,
and by shifting each of them twelve times according to the tonic pitch of the twelve
possible tonalities of each mode, 24 tonalities proles are obtained.
Krumhansl [24] dened the proles empirically, on the base of a series of listening sessions carried out on a group of undergraduates from University of Harvard,
who had to evaluate the correspondence among test tracks and probe tones. The
author presented two global proles, one for major and one for minor mode, representing the global contribution of each tone to all the tonalities for each mode.
More recently, Temperley [35] presented a modied less biased version of
Krumhansl proles. In this context we propose a revised version of the Krumhansls
proles with the aim of avoiding the bias of the system for a particular mode. Basically, the two mode proles are normalized to show the same sum of values and,
then, their proles are divided by their corresponding maximums.
For each windowed frame of the track, the squared Euclidean distance between the PCP of the frame and each tonality prole is computed to dene a
24-elements vector. Each element of the vector is the sum of the squared dierences between the amplitudes of the PCP and the tonality proles. The squared
distance is dened as follow:
11
j=0
Dt (k) =
11
j=0
(2)
where Dt (k) is the squared distance computed at time t of the k-th tonality, with
k {1, 2, ..., 24}, and PM /Pm are, respectively, the major and minor prole.
The predominant tonality of each frame corresponds to the minimum of
the distance vector Dt (k), where the index k, with k {1, ..., 12}, refers to
the twelve major tonalities (from C to B) and k, with k {13, ..., 24}, refers
123
Tonal behavior
Normalized Amplitude
1
0.8
0.6
0.4
0.2
0
C C# D Eb E F F# G Ab A Bb B c c# d d# e
Tonalities
f# g g# a a# b
Fig. 2. An example of the tonal behavior of the Beatles song Ill be back, where the
main tonality is E major
to the twelve minor tonalities (from c to b). Usually major and minor tonalities
are represented with capital and lower-case letter respectively.
The empirical distribution of all the predominant tonalities, estimated throughout the entire piece, is calculated in order to represent the tonality contributions
to the tonal content of the song. This is dened as the tonal behavior of the composition. In Figure 2, an example of the distribution of the tonality contributions
for the Beatles song Ill be back is shown.
Music similarity. The vectors describing the tonal behavior of the songs are
employed to measure their reciprocal degree of similarity. In fact the human
brain is able to detect the main melodic pattern, even by means of subconscious
processes and its perception of musical similarity is partially based on it [24].
The tonal similarity among the songs is computed by the Euclidean distance
of the tonal vector calculated, following the equation:
T SAB = TA TB
(3)
where T SAB stands for the coecient of tonal similarity between the songs A
and B and TA and TB are the empirical tonality distributions for song A and
B, respectively.
A robust evaluation of the performance of the proposed method for evaluation of music similarity is very hard to achieve. The judgment of the similarity
among audio les is a very subjective issue, showing the complex reality of human perception. Nevertheless, a series of tests have been performed on some
predetermined lists of songs.
Four lists of 11 songs have been submitted to a group of ten listeners. They
were instructed to sort the songs according to their perceptual similarity and tonal
similarity. For each list, a reference song was dened and the remaining 10 songs
had to be sorted with respect to their degree of similarity with the reference one.
124
I. Barbancho et al.
A series of 10-element lists were returned by the users, as well as by the automatic
method. Two kinds of experimental approaches were carried out: in the rst experiment, the users had to listen to the songs and sort them according to a global
perception of their degree of similarity. In the second framework, they were asked
to focus only on the tonal content. The latter was the hardest target to obtain,
because of the complexity of discerning the parameters to be taken into account
when listening to a song and evaluating its similarity with respect to other songs.
The degree of coherence among the list manually sorted and the ones automatically processed was obtained. A weighted matching score for each pair of
lists was computed, the reciprocal distance of the songs (in terms of the position
index in the lists) was calculated. Such distances were linearly weighted, so that
the rst songs in the lists reected more importance than the last ones. In fact, it
is easier to evaluate which is the most similar song among pieces that are similar
to the reference one, than performing the same selection among very dierent
songs. The weights aid to compensate for this bias.
Let L and L represent two dierent ordered lists of n songs, for the same
reference song. The matching score C has been computed as follows:
n
|i j| i
(4)
C=
i=1
where i and j are the indexes for lists L and L , respectively, such that j is the
index of the j-th song in list L with L (i) L (j). The absolute dierence is
linearly weighted by the weights
i normalized such to sum to one, dened by
n
the following expression: i=1 i = 1. Finally, the scores are transformed to be
represented as percentage of the maximum score attainable.
The eciency of the automatic method was evaluated by measuring its coherence with the users response. The closer the two set of values, the better the
performance of the automatic method. As expected, the evaluation of the automatic method in the rst experimental framework did not return reliable results
because of the extreme deviation of the marks, due to the scarce relevance of the
tones distribution in the subjective judgment of the song. As mentioned before,
the tonal behavior of the song is only one of the parameters taken into account
subconsciously by the human ear. Nevertheless, if the same songs were asked to
be evaluated only by their tonal content, the scores drastically decreased, revealing the extreme lack of abstraction of the human ear. In Table 2 the results for
both experimental frameworks are shown.
The dierences between the results of the two experiments are evident. Concerning the rst experiment, the mean score correspondence is 74.2%, among
the users lists and 60.1% among the users and the automatic list. That is, the
automatic method poorly reproduces the choices made by the users, taking into
account a global evaluation of music similarity. Conversely, in the second experiment, better results were obtained. The mean correspondence score for the
users lists decrease to 61.1%, approaching the value returned by the users and
automatic list together, 59.8%. The performance of the system can be considered
to be similar to the behavior of a mean human user, regarding the perception of
tonal similarities.
125
Table 2. Means and standard deviations of the correspondence scores obtained computing equation (4). The raws Auto+Users and Users refer to the correspondence
scores computed among the users lists together with the automatic list and among
only the users lists, respectively. The Experiment 1 is done listening and sorting the
songs on the base of a global perception of the track, while Experiment 2 is performed
trying to take into account only the tone distributions.
Lists
List A
List B
List C
List D
Means
Method
Experiment 1
Mean St. Dev.
Experiment 2
Mean St. Dev.
Auto+Users
67.6
7.1
66.6
8.8
Users
72.3
13.2
57.9
11.5
Auto+Users
63.6
1.9
66.3
8.8
Users
81.8
9.6
66.0
10.5
Auto+Users
61.5
4.9
55.6
10.2
Users
77.2
8.2
57.1
12.6
Auto+Users
47.8
8.6
51.0
9.3
Users
65.7
15.4
63.4
14.4
Auto+Users
60.1
5.6
59.8
9.2
Users
74.2
11.6
61.1
12.5
Fig. 3. A snapshot of some of the main windows of interface of the OMR system
Other important use of the music signal processing is the preservation and diffusion of the music heritage. In this sense, we have paid special attention to the
126
I. Barbancho et al.
OMR (Optical Music Recognition) systems are essentially based on the conversion of a digitalized music score into an electronic format. The computer must
read the document (in this case a manuscript), interprete it and transcript
its content (notes, time information, execution symbols etc.) into an electronic
format. The task can be addressed to recover important ancient documents and
to improve their availability to the music community.
In the OMR framework, the recognition of ancient handwritten scores is a
real challenge. The manuscripts are often in a very poor state of conservation,
due to their age an their preservation. The handwritten symbols are not uniform and additive symbols can be found to be manually added a posteriori by
other authors. The digital acquisition of the scores and the lighting conditions
in the exposure can cause an incoherence in the background of the image. All
these conditions make the development of an ecient OMR system a very hard
practice. Although the system workow can be generalized, the specic algorithms cannot be blindly used for dierent authors but it has to be trained for
each use.
We have developed a whole OMR system [34] for two styles of writing scores
in white mensural notation. Figure 3 shows a snapshot of its graphical user
interface. In the main window a series of tools are supplied to follow a complete
workow, based on a number of steps: the pre-processing of the image, the
partition of the score into each single sta and the processing of the staves with
the extraction, the classication and the transcription of the musical neums.
Each tool corresponds to a individual window that allows the user to interact to
complete the stage.
The preprocessing of the image, aimed at feeding the system with the cleanest black and white image of the score, is divided into the following tasks: the
clipping of region of interest of the image [9], the automatic blanking of red
frontispieces, the conversion from RGB to grayscale, the compensation of the
lighting conditions, the binarization of the image [17] and the correction of image tilt [36]. After partitioning the score into each single sta, the sta lines are
tracked and blanked and the symbols are extracted and classied. In particular,
a series of multidimensional feature vectors are computed on the geometrical
extent of the symbols and a series of corresponding classiers are employed to
relate the neums to their correspondent musical symbols. In any moment, the
interface allows the user for a careful following of each processing stage.
127
Music signal processing tools also make possible the development of new interactive methods for music learning using a computer or a toy. In this sense, we
have developed a number of specialized tools to help to learn how to play piano,
violin and guitar. These tools will be presented in sections 4.1, 4.2 and 4.3, respectively. It is worth to mention that for the development of these tools, the
very special characteristics of the instruments have been taken into account. In
fact, the people who have developed such tools are able to play these instruments. This has contributed to make these tools specially useful because in the
development, we have observed the main diculties of each of the instruments.
Finally, thinking about developing toys, or other small embedded systems with
musical intelligence, in subsection 4.4, we present a lightweight pitch detector
that has been designed to this aim.
4.1
The piano is a musical instrument that is widely used in all kinds of music and
as an aid to the composition due to its versatility and ubiquity. This instrument
is played by means of a keyboard and allows to get very rich polyphonic sound.
Piano learning involves several diculties that come from its great possibilities
of generating sound with high polyphony number. These diculties are easily observed when the musical skills are small or when trying to transcribe its sound
when piano is used in composition. Therefore, it is useful to have a system that
determines the notes that sound in a piano in each time frame and represent them
in a simple form that can be easily understood, this is the aim of the tool that will
be presented. The core of the designed tool is a polyphonic transcription system
able to detect the played notes using spectral patterns of the piano notes [6], [4].
The approach used in the proposed tool to perform the polyphonic transcription is rather dierent from the proposals that can be found in the literature [23].
In our case, the audio signal to be analyzed will be considered to have certain
similarities to the code division multiple access CDMA communications signal.
Our model will consider the spectral patterns of the dierent piano notes [4].
Therefore, in order to perform the detection of the notes that sound during each
time frame, we have considered a suitable modication of a CDMA multiuser
detection technique to cope with the polyphonic nature of the piano music and
with the dierent energy of the piano notes in the same way as an advanced
CDMA receiver [5].
A snapshot of the main windows of the interface is presented in Figure 4. The
designed graphical user interface is divided into three parts:
The management items of the tool are three main buttons: one button to
adquire the piano music to analyze, another button to start the system and
a nal button to reset the system.
The time and frequency response of each frame of piano sound under analysis
are in the middle part of the window.
128
I. Barbancho et al.
Fig. 4. A snapshot of the main windows of the interface of the tool for piano learning
A piano keyboard in which the played notes are highlighted as well as the
name of the played notes is shown at the bottom.
4.2
The violin is one of the most complex instruments often used by the children
for the rst approach to music learning. The main characteristic of the violin
is its great expressiveness due to the wide range of interpretation techniques.
The system we have developed is able to detect not only the played pitch, as
other transcription systems [11], but also the technique employed [8]. The signal
envelope and the frequency spectrum are considered in time and frequency domain, respectively. The descriptors employed for the detection system have been
computed analyzing a high amount of violin recordings, from the Musical Instrument Sound Data Base RWC-MDB-1-2001-W05 [18] and other home made
recordings. Dierent playing techniques have been performed in order to train
the system for its expressiveness capability. The graphical interface is aimed to
facilitate the violin learning for any user.
For the signal processing tool, a graphical interface has been developed. The
main window presents two options for the user, the theory section (Teora) and
the practical section (Pr
actica). In the section Teora the user is encouraged to
learn all the concepts about the violins history, the violins parts and the playing posture (left and right hand), while the section Pr
actica is mainly based on
an expressiveness transcription system [8]. Here, the user starts with the basic
study sub-section, where the main violin positions are presented, illustrating
the placement of the left hand on the ngerboard, with the aim to attain a
good intonation. Hence, the user can record the melody correspondence to the
129
selected position and ask the application to correct it, returning the errors made.
Otherwise, in the free practice sub-section, any kind of violin recording can be
analyzed for its melodic content, detecting the pitch, the duration of the notes
and the techniques employed (e.g.:detache with and without vibrato, pizzicato,
tremolo, spiccato, flageolett-t
one). The user can also visualize the envelope and
the spectrum of the each note and listen to the MIDI transcription generated.
In Figure 5, some snapshots of the interface are shown. The overall performance
attained by our system in the detection and correction of the notes and expressiveness is 95.4%.
4.3
The guitar is one of the most popular musical instruments nowadays. In contrast
to other instruments like the piano, in the guitar the same note can be played
plucking dierent strings at dierent positions. Therefore, the algorithms used
for piano transcription [10] cannot be used for guitar. In guitar transcription it
is important to estimate the string used to play a note [7].
Fig. 5. Three snapshots of the interface for violin learning are shown. Clockwise from
top left: the main window, the analysis window and a plot of the MIDI melody.
130
I. Barbancho et al.
131
The system presented in this demonstration is able to estimate the string and
the fret of a single note played with a very low error probability. In order to keep
a low error probability when a chord is strummed on a guitar, the system chooses
which chord has been most likely played from a predened list. The system works
with classical guitars as well as acoustic or electric guitars. The sound has to
be captured with a microphone connected to the computer soundcard. It is also
possible to plug a cable from an electric guitar to the sound card directly.
The graphical interface consists of a main window (Figure 6(a)) with a pop-up
menu where you can choose the type of guitar you want to use with the interface.
The main window includes a panel (Estimaci
on) with three push buttons, where
you can choose between three estimation modes:
The mode Nota u
nica (Figure 6(b)) estimates the string and fret of a single
note that is being played and includes a tuner (afinador ).
The mode Acorde predeterminado estimates strummed chords that are being
played. The system estimates the chord by choosing the most likely one from
a predened list.
The last mode, Acorde libre, makes a free estimation of what is being played.
In this mode the system does not have the information of how many notes
are being played, so this piece of information is also estimated.
Each mode includes a window that shows the microphone input, a window with
the Fourier Transform of the sound sample, a start button, a stop button and an
exit button (Salir ). At the bottom of the screen there is a panel that represents
a guitar. Each row stands for a string on the guitar and the frets are numbered
from one to twelve. The current estimation of the sound sample, either note or
chord, is shown on the panel with a red dot.
4.4
132
I. Barbancho et al.
articial intelligence techniques, such as neural networks sized up to be implemented in a small system, can provide the necessary accuracy. There are two
alternatives [3], [33] in neural networks. The rst one is unsupervised training.
This is the case of some networks which have been specially designed for pattern classication such as self-organizative maps. However, the computational
complexity of this implementation is too high for a low-cost microcontroller.
The other alternative is supervised training neural networks. This is the case
of perceptron-type networks. In these networks, the synaptic weights connecting
each neuron are modied as a new training vector is presented. Once the network
is trained, the weights can be statically stored to classify new network entries.
The training algorithm can be done on a dierent machine from the machine
where the network propagation algorithm is executed. Hence the only limitation
comes from the available memory.
In the proposed system, we focus on the design of a lightweight pitch detection
for embedded systems based on neural networks in which the signal preprocessing
is a frequency analysis. The selected neural network is a perceptron-type network.
For the preprocessing, the Goertzel algorithm [15] is the selected technique for
the frequency analysis because it is a light alternative to FFT computing if we
are only interested in some of the spectral points.
Audio in
Preamp
I2C out
8th order
elliptic filter
Pitch
Detection
A/D 10-bit
conversion
Buffering
Preprocessing
AVR ATMEGA168
Figure 7 shows the block diagram of the detection system. This gure shows
the hardware connected to the microcontrollers A/D input, which consists of a
preamplier in order to accommodate the input from the electret microphone
into the A/D input range and an anti-aliasing lter. The anti-aliasing lter provides 58dB of attenuation at cuto which is enough to ensure the anti-aliasing
function. After the lter, the internal A/D converter of the microcontroller is
used. After conversion, a buer memory is required in order to store enough samples for the preprocessing block. The output of the preprocessing block is used for
pitch detection using a neural network. Finally, an I2C (Inter-Integrated Circuit)
[32] interface is used for connecting the microcontroller with other boards.
We use the open source Arduino environment [1] with the AVR ATMEGA168
microcontroller [2] for development and testing of the pitch detection implementation. The system will be congured to detect the notes between A3 (220Hz)
133
and G#5 (830.6Hz), following the well-tempered scale, as it is the system mainly
used in Western music. This range of notes has been selected because one of the
applications of the proposed system is the detection of vocal music of children
and adolescents.
The aim of the preprocessing stage is to transform the samples of the audio
signal from the time domain to the frequency domain. The Goertzel algorithm
[15], [30] is a light alternative to FFT computing if the interest is focused only
in some of the spectrum points, as in this case. Given the frequency range of
the musical notes in which the system is going to work, along with the sampling
restriction of the selected processor, the selected sampling frequency is fs =
4KHz and the number of input samples N = 400, that obtain a precision of
10Hz, which is sucient for the pitch detection system. On the other hand, in
the preprocessing block, the number of frequencies fk , in which the Goertzel
p
algorithm is computed, is 50 and are given according to fp = 440 2 12 Hz
with p = 24, 23, ..., 0, ..., 24, 25, so that, each note in the range of interest
have, at least, one harmonic and one subharmonic to improve the detection
performance of notes with octave or perfect fth relation. Finally, the output of
the preprocessing stage is a vector that contains the squared modulus of the 50
points of interest of the Goertzel algorithm: the points of the power spectrum of
the input audio signal in the frequencies of interest.
For the algorithm implemented using xed-point arithmetic, the execution
time is less than 3ms on a 16 MIPS AVR microcontroller. The number of points
of the Goertzel algorithm are limited by the available memory. The Eq. 5 shows
the number of bytes required to implement the algorithm.
N
nbytes = 2
+ 2N + m
(5)
4
In this expression, m represents the number of desired frequency points. Thus
with m = 50 points and N = 400, the algorithm requires 1900bytes of RAM
memory for signal input/processing/output buering. Since the microcontroller
has 1024bytes of RAM memory, it is necessary to use an external high-speed SPI
RAM memory in order to have enough memory for buering audio samples.
Once the Goertzel Algorithm has been performed and the points are stored in
the RAM memory, a recognition algorithm has to be executed for pitch detection.
A useful alternative to spectral processing techniques consist of using articial
intelligence techniques. We use a statically trained neural network storing the
network weights vectors in a EEPROM memory. Thus, the network training is
performed in a computer with the same algorithm implemented and the embedded system only runs the network. Figure 8 depicts the structure of the neural
network used for pitch recognition. It is a multilayer feed-forward perceptron
with a back-propagation training algorithm.
In our approach, sigmoidal activation has been used for each neuron as well
as no neuron bias. This provides a fuzzy set of values, yj , at the output of each
neural layer. The fuzzy set is controlled by the shape factor, , of the sigmoid
function, which is set to 0.8, and it is applied to a threshold-based decision function. Hence, outputs below 0.5 does not activate output neurons while values
134
I. Barbancho et al.
1
1
1
2
50
24
Hidden
Layer
Input Layer
Output
Layer
Output (Note)
G#5
G5
F#5
F5
E5
D#5
D5
C#5
C5
B4
A#4
A4
G#4
G4
F#4
F4
E4
D#4
D4
C#4
C4
B3
A#3
A3
0
Ideal Output
Validation Test
Learning Test
A3
A#3
B3
C4
C#4
D4
D#4
E4
F4
F#4
G4
G#4
A4
A#4
B4
C5
C#5
D5
D#5
E5
F5
F#5
G5
G#5
Input (Note)
Fig. 9. Learning test, validation test and ideal output of the designed neural network
above 0.5 activate output neurons. The neural network parameters such as the
number of neurons in the hidden layer or the shape factor of the sigmoid function has been determined experimentally. The neural network has been trained
by running the BPN (Back Propagation Neural Network) on a PC. Once the
network convergence is achieved, the weight vectors are stored. Regarding the
output layer of the neural network, we use ve neurons to encode 24 dierent
outputs corresponding to each note in two octaves (A3 G#5 notation).
The training and the evaluation of the proposed system has been done using
independent note samples taken from the Musical Instrument Data Base RWC
[19]. The selected instruments have been piano and human voice. The training
of the neural network has been performed using 27 samples for each note in the
135
range of interest. Thus, we used 648 input vectors to train the network. This
way, the network convergence was achieved with an error of 0.5%.
In Figure 9, we show the learning characteristic of the network when simulating
the network with the training vectors. At the same time, we show the validation
test using 96 input vectors (4 per note) which corresponds to about 15% of new
inputs. As shown in Figure 9, the inputs are correctly classied due to the small
dierence among the outputs for the ideal, learning and validation inputs.
Conclusions
Nowadays, it is a requirement that all types of information will be widely available in digital form in digital libraries together with intelligent techniques for
the creation and management of the digital information, thus, contents will be
plentiful, open, interactive and reusable. It becomes necessary to link contents,
knowledge and learning in such a way that information will be produced, stored,
handled, transmitted and preserved to ensure long term accessibility to everyone, regardless of the special requirements of certain communities (e-inclusion).
Among the many dierent types of information, music happens to be one of the
more widely demanded due to its cultural interest, for entertainment or even
due to therapeutic reasons.
Through this paper, we have presented several application of music signal
processing techniques. It is clear that the use of such tools can be very enriching from several points of views: Music Content Management, Cultural Heritage
Preservation and Diusion, Tools for Technology-Enhanced Learning and Revolutionary Learning Appliances, etc. Now, that we have the technology at our
side in every moment: mobile phones, e-books, computers, etc., all these tools
we have developed can be easily used. There are still a lot open issues and things
that should be improved, but more and more, technology helps music.
Acknowledgments
This work has been funded by the Ministerio de Ciencia e Innovaci
on of the Spanish Government under Project No. TIN2010-21089-C03-02, by the Junta de
Andaluca under Project No. P07-TIC-02783 and by the Ministerio de Industria,
Turismo y Comercio of the Spanish Government under Project No. TSI-0205012008-117. The authors are grateful to the person in charge of the Archivo de la
Catedral de M
alaga, who allowed the utilization of the data sets used in this work.
References
1. Arduino board, http://www.arduino.cc (last viewed February 2011)
2. Atmel corporation web side, http://www.atmel.com (last viewed February 2011)
3. Aliev, R.: Soft Computing and its Applications. World Scientic Publishing Company, Singapore (2001)
136
I. Barbancho et al.
137
21. Holzapfel, A., Stylianou, Y.: Rhythmic similarity of music based on dynamic periodicity warping. In: IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2008, Las Vegas, USA, March 31- April 4, pp. 2217
2220 (2008)
Audio key nding using low-dimensional spaces. In: Proc. Music Infor22. Izmirli, O.:
mation Retrieval Conference, ISMIR 2006, Victoria, Canada, pp. 127132 (2006)
23. Klapuri, A.: Automatic music transcription as we know it today. Journal of New
Music Research 33(3), 269282 (2004)
24. Krumhansl, C.L., Kessler, E.J.: Tracing the dynamic changes in perceived tonal
organization in a spatial representation of musical keys. Psychological Review 89,
334368 (1982)
25. Lampropoulos, A.S., Sotiropoulos, D.N., Tsihrintzis, G.A.: Individualization of music similarity perception via feature subset selection. In: Proc. Int. Conference on
Systems, Man and Cybernetics, Massachusetts, USA, vol. 1, pp. 552556 (2004)
26. Logan, B., Salomon, A.: A music similarity function based on signal analysis.
In: IEEE International Conference on Multimedia and Expo., ICME 2001,Tokyo,
Japan, pp. 745748 (August 2001)
27. Logan, B.: Mel frequency cepstral coecients for music modeling. In: Proc. Music
Information Retrieval Conference(ISMIR 2000) (2000)
28. Marolt, M.: A connectionist approach to automatic transcription of polyphonic
piano music. IEEE Transactions on Multimedia 6(3), 439449 (2004)
29. Ockelford, A.: On Similarity, Derivation and the Cognition of Musical Structure.
Psychology of Music 32(1), 2374 (2004), http://pom.sagepub.com/cgi/content/
abstract/32/1/23 (last viewed February 2011)
30. Oppenheim, A., Schafer, R.: Discrete-Time Signal Processing. Prentice-Hall, Englewood Clis (1989)
31. Pampalk, E.: Islands of music - analysis, organization, and visualization of music
archives. Vienna University of Technology, Tech. rep. (2001)
32. Philips: The I2C bus specication v.2.1. (2000), http://www.nxp.com (last viewed
February 2011)
33. Prasad, B., Mahadeva, S.: Speech, Audio, Image and Biomedical Signal Processing
using Neural Networks. Springer, Heidelberg (2004)
34. Tard
on, L.J., Sammartino, S., Barbancho, I., G
omez, V., Oliver, A.J.: Optical
music recognition for scores written in white mensural notation. EURASIP Journal on Image and Video Processing 2009, Article ID 843401, 23 pages (2009),
doi:10.1155/2009/843401
35. Temperley, D.: The Cognition of Basic Musical Structures. The MIT Press, Cambridge (2004)
36. William, W.K.P.: Digital image processing, 2nd edn. John Wiley & Sons Inc., New
York (1991)
Introduction
Increasing amounts of broadcast material are being made available in the podcast format which is dened in reference [52] as a digital audio or video le
that is episodic; downloadable; programme-driven, mainly with a host and/or
theme; and convenient, usually via an automated feed with computer software
(the word podcast comes from the contraction of webcast, a digital media le
distributed over the Internet using streaming technology, and iPod, the portable
media player by Apple). New technologies have indeed emerged allowing users
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 138162, 2011.
c Springer-Verlag Berlin Heidelberg 2011
139
to access audio podcasts material either online (on radio websites such as the
one from the BBC used in this study: http://www.bbc.co.uk/podcasts), or
oine, after downloading the content on personal computers or mobile devices
using dedicated services. A drawback of the podcast format, however, is its lack of
indexes for individual songs and sections, such as speech. This makes navigation
through podcasts a dicult, manual process, and software built on top of automated podcasts segmentation methods would therefore be of considerable help
for end-users. Automatic segmentation of podcasts is a challenging task in speech
processing and music information retrieval since the nature of the content from
which they are composed is very broad. A non-exhaustive list of type of content
commonly found in podcast includes: spoken parts of various types depending
on the characteristics of the speakers (language, gender, number, etc.) and the
recording conditions (reverberation, telephonic transmission, etc.), music tracks
often belonging to disparate musical genres (classical, rock, jazz, pop, electro,
etc.) and which may include a predominant singing voice (source of confusion
since the latter intrinsically shares properties with the spoken voice), jingles and
commercials which are usually complex sound mixtures including voice, music,
and sound eects. One step of the process of automatically segmenting and annotating podcasts therefore is to segregate sections of speech from sections of
music. In this study, we propose two computational models for speech/music
discrimination based on structural segmentation and/or timbre recognition and
evaluate their performances in the classication of audio podcasts content. In
addition to their use with audio broadcast material (e.g. music shows, interviews) as assessed in this article, speech/music discrimination models may also
be of interest to enhance navigation into archival sound recordings that contain both spoken word and music (e.g. ethnomusicology interviews available on
the online sound archive from the British Library: https://sounds.bl.uk/). If
speech/music discrimination models nd a direct application in automatic audio
indexation, they may also be used as a preprocessing stage to enhance numerous
speech processing and music information retrieval tasks such as speech and music coding, automatic speaker recognition (ASR), chord recognition, or musical
instrument recognition.
The speech/music discrimination methods proposed in this study rely on
timbre models (based on various features such as the line spectral frequencies [LSF], and the mel-frequency cepstral coecients [MFCC]), and machine
learning techniques (K-means clustering and hidden Markov models [HMM]).
The rst proposed method comprises an automatic timbre recognition (ATR)
stage using the model proposed in [7] and [16] trained here with speech and
music content. The results of the timbre recognition system are then postprocessed using a median lter to minimize the undesired inter-class switches.
The second method utilizes the automatic structural segmentation (ASS) model
proposed in [35] to divide the signal into a set of segments which are homogeneous with respect to timbre before applying the timbre recognition procedure.
A database of classical music, jazz, and popular music podcasts from the BBC
was manually annotated for training and testing purposes (approximately 2,5
140
hours of speech and music). The methods were both evaluated at the semantic
level to measure the accuracy of the machine estimated classications, and at the
temporal level to measure the accuracy of the machine estimated boundaries between speech and music sections. Whilst studies on speech/music discrimination
techniques usually provide the rst type of evaluation (classication accuracy),
boundary retrieval performances are not reported to our knowledge, despite their
interest. The results of the proposed methods were also compared with those obtained with a state-of-the-arts speech/music discrimination algorithm based on
support vector machine (SVM) [44].
The remainder of the article is organized as follows. In section 2, a review
of related works on speech/music discrimination is proposed. Section 3, we give
a brief overview of timbre research in psychoacoustics, speech processing and
music information retrieval, and then describe the architecture of the proposed
timbre-based methods. Section 4 details the protocols and databases used in the
experiments, and species the measures used to evaluate the algorithms. The
results of the experiments are given and discussed in section 5. Finally, section
6 is devoted to the summary and the conclusions of this work.
Related Work
Speech/music discrimination is a special case of audio content classication reduced to two classes. Most audio content classication methods are based on the
following stages: (i) the extraction of (psycho)acoustical variables aimed at characterizing the classes to be discriminated (these variables are commonly referred
to as descriptors or features), (ii) a feature selection stage in order to further improve the performances of the classier, that can be either done a priori based on
some heuristics on the disparities between the classes to discern, or a posteriori
using an automated selection technique, and (iii) a classication system relying
either on generative methods modeling the distributions in the feature space,
or discriminative methods which determine the boundaries between classes. The
seminal works on speech/music discrimination by Saunders [46], and Scheirer
and Slaney [48], developed descriptors quantifying various acoustical specicities of speech and music which were then widely used in the studies on the same
subject. In [46], Saunders proposed ve features suitable for speech/music discrimination and whose quick computation in the time domain directly from the
waveform allowed for a real-time implementation of the algorithm; four of them
are based on the zero-crossing rate (ZCR) measure (a correlate of the spectral
centroid or center of mass of the power spectral distribution that characterize
the dominant frequency in the signal [33]), and the other was an energy contour
(or envelope) dip measure (number of energy minima below a threshold dened
relatively to the peak energy in the analyzed segment). The zero crossing rates
were computed on a short-term basis (frame-by-frame) and then integrated on a
longer-term basis with measures of the skewness of their distribution (standard
deviation of the derivative, the third central moment about the mean, number
of zero crossings exceeding a threshold, and the dierence of the zero crossing
141
samples above and below the mean). When both the ZCR and energy-based
features were used jointly with a supervised machine learning technique relying
on a multivariate-Gaussian classier, a 98% accuracy was obtained on average
(speech and music) using 2.4 s-long audio segments. The good performance of
the algorithm can be explained by the fact that the zero-crossing rate is a good
candidate to discern unvoiced speech (fricatives) with a modulated noise spectrum (relatively high ZCR) from voiced speech (vowels) with a quasi-harmonic
spectrum (relatively low ZCR): speech signals whose characteristic structure is
a succession of syllabes made of short periods of fricatives and long periods of
vowels present a marked rise in the ZCR during the periods of fricativity, which
do not appear in music signals, which are largely tonal (this however depends on
the musical genre which is considered). Secondly, the energy contour dip measure characterizes the dierences between speech (whose systematic changeovers
between voiced vowels and fricatives produce marked and frequent change in the
energy envelope), and music (which tends to have a more stable energy envelope)
well. However, the algorithm proposed by Saunders is limited in time resolution
(2.4 s). In [48], Scheirer and Slaney proposed a multifeature approach and examined various powerful classication methods. Their system relied on the 13
following features and, in some cases, their variance: 4 Hz modulation energy
(characterizing the syllabic rate in speech [30]), the percentage of low-energy
frames (more silences are present in speech than in music), the spectral rollo,
dened as the 95th percentile of the power spectral distribution (good candidate
to discriminate voiced from unvoiced sounds), the spectral centroid (often higher
for music with percussive sounds than for speech whose pitches stay in a fairly
low range), the spectral ux, which is a measure of the uctuation of the shortterm spectrum (music tends to have a higher rate of spectral ux change than
speech), the zero-crossing rate as in [46], the cepstrum resynthesis residual magnitude (the residual is lower for unvoiced speech than for voiced speech or music), and a pulse metric (indicating whether or not the signal contains a marked
beat, as is the case in some popular music). Various classication frameworks
were tested by the authors, a multidimensional Gaussian maximum a posteriori
(MAP) estimator as in [46], a Gaussian mixture model (GMM), a k-nearestneighbour estimator (k-NN), and a spatial partitioning scheme (k-d tree), and
all led to similar performances. The best average recognition accuracy using the
spatial partitioning classication was of 94.2% on a frame-by-frame basis, and of
98.6% when integrating on 2.4 s long segments of sound, the latter results being
similar to those obtained by Saunders. Some authors used extensions or correlates of the previous descriptors for the speech/music discrimination task such
as the higher order crossings (HOC) which is the zero-crossing rate of ltered
versions of the signal [37] [20] originally proposed by Kedem [33], the spectral
atness (quantifying how tonal or noisy a sound is) and the spectral spread (the
second central moment of the spectrum) dened in the MPEG-7 standard [9],
and a rhythmic pulse computed in the MPEG compressed domain [32]. Carey
et al. introduced the use of the fundamental frequency f0 (strongly correlated
to the perceptual attribute of pitch) and its derivative in order to characterize
142
some prosodic aspects of the signals (f0 changes in speech are more evenly distributed than in music where they are strongly concentrated about zero due to
steady notes, or large due to shifts between notes) [14]. The authors obtained a
recognition accuracy of 96% using the f0 -based features with a Gaussian mixture model classier. Descriptors quantifying the shape of the spectral envelope
were also widely used, such as the Mel Frequency Cepstral Coecients (MFCC)
[23] [25] [2], and the Linear Prediction Coecients (LPC) [23] [1]. El-Maleh et
al. [20] used descriptors quantifying the formant structure of the spectral envelope, the line spectral frequencies (LSF), as in this study (see section 3.1). By
coupling the LSF and HOC features with a quadratic Gaussian classier, the authors obtained a 95.9% average recognition accuracy with decisions made over 1
s long audio segments, procedure which performed slightly better than the algorithm by Scheirer and Slaney tested on the same dataset (an accuracy increase
of approximately 2%). Contrary to the studies described above that relied on
generative methods, Ramona and Richard [44] developed a discriminative classication system relying on support vector machines (SVM) and median ltering
post-processing, and compared diverse hierarchical and multi-class approaches
depending on the grouping of the learning classes (speech only, music only, speech
with musical background, and music with singing voice). The most relevant features amongst a large collection of about 600 features are selected using the
inertia ratio maximization with feature space projection (IRMFSP) technique
introduced in [42] and integrated on 1 s long segments. The method provided an
F-measure of 96.9% with a feature vector dimension of 50. Those results represent an error reduction of about 50% compared to the results gathered by the
French ESTER evaluation campaign [22]. As will be further shown in section
5, we obtained performances favorably comparable to those provided by this
algorithm. Surprisingly, all the mentioned studies evaluated the speech/music
classes recognition accuracy, but none, to our knowledge, evaluated the boundary retrieval performance commonly used to evaluate structural segmentation
algorithms [35] (see section 4.3), which we also investigate in this work.
Classification Frameworks
143
Testing audio
Structural
segmentation
S, D
Testing audio
homogeneous segments
Timbre recognition
LSF, K, L
intermediate classification
(short-term)
Post-processsing
(median filtering)
segment-level classification
Timbre recognition
LSF, K, L
intermediate classification
(short-term)
Post-processsing
(class decision)
segment-level classification
(b) Classication
based on
automatic structural segmentation and timbre recognition
(ASS/ATR)
Fig. 1. Architecture of the two proposed audio segmentation systems. The tuning parameters of the systems components are also reported: number of line spectral frequencies (LSF), number of codevectors K, latency L for the automatic timbre recognition
module, size of the sliding window W used in the median ltering (post-processing),
maximal number S of segment types, and minimal duration D of segments for the
automatic structural segmentation module.
The two proposed systems rely on the assumption that speech and music can
be discriminated based on their dierences in timbre. Exhaustive computational
models of timbre have not yet been found and the common denition used by
scholars remains vague: timbre is that attribute of auditory sensation in terms
of which a listener can judge that two sounds similarly presented and having the
same loudness and pitch are dissimilar; Timbre depends primarily upon the spectrum of the stimulus, but it also depends on the waveform, the sound pressure,
the frequency location of the spectrum, and the temporal characteristics of the
stimulus. [3]. Research in psychoacoustics [24] [10], [51], analysis/synthesis [45],
144
music perception [4] [5], speech recognition [19], and music information retrieval
[17] have however developed acoustical correlates of timbre characterizing some
of the facets of this complex and multidimensional variable.
The Two-fold Nature of Timbre: from Identity to Quality. One of the
pioneers on timbre research, French researcher and electroacoustic music composer Schaeer, put forward a relevant paradox about timbre wondering how a
musical instruments timbre could be dened considering that each of its tones
also possessed a specic timbre [47]. Cognitive categorizations theories shed
light on Schaeers paradox showing that sounds (objects, respectively) could
be categorized either in terms of the sources from which they are generated, or
simply as sounds (objects, respectively), in terms of the properties that characterize them [15]. These principles have been applied to timbre by Handel who
described timbre perception as being both guided by our ability to recognize
various physical factors that determine the acoustical signal produced by musical instruments [27] (later coined source mode of timbre perception by Hadja
et al. [26]), and by our ability to analyze the acoustic properties of sound objects perceived by the ear, traditionally modeled as a time-evolving frequency
analyser (later coined as interpretative mode of timbre perception in [26]). In
order to refer to that two-fold nature of timbre, we like to use the terms timbre
identity and timbre quality, which were proposed in reference [38]. The timbre
identity and quality facets of timbre perception have several properties: they
are not independent but intrinsically linked together (e.g. we can hear a guitar
tone and recognize the guitar, or we can hear a guitar tone and hear the sound
for itself without thinking to the instrument), they are function of the sounds to
which we listen to (in some music like the musique concr`ete, the sound sources
are deliberately hidden by the composer, hence the notion of timbre identity is
dierent, it may refer to the technique employed by the musician, e.g. a specic lter), they have variable domain of range (music lovers are often able to
recognize the performer behind the instrument extending the notion of identity
to the very start of the chain of sound production, the musician that controls
the instrument). Based on these considerations, we include the notion of sound
texture, as that produced by layers of instruments in music, into the denitions
of timbre. The notion of timbre identity in music may then be closely linked to
a specic band, a sound engineer, or the musical genre, largely related to the
instrumentation).
The Formant Theory of Timbre. Contrary to the classical theory of musical timbre advocated in the late 19th century by Helmholtz [29], timbre does
not only depend on the relative proportion between the harmonic components
of a (quasi-)harmonic sound; two straightforward experiments indeed show that
timbre is highly altered when a sound a) is reversed in time, b) is pitch-shifted
by frequency translation of the spectrum, despite the fact that in both cases the
relative energy ratios between harmonics are kept. The works from the phonetician Slawson proved that the timbre of voiced sounds was mostly characterized
by the invariance of their spectral envelope through pitch changes, and therefore
145
a mostly xed formant1 structure, i.e. zones of high spectral energy (however, in
the case of large pitch changes the formant structure needs to be slightly shifted
for the timbral identity of the sounds to remain unchanged): The popular notion that a particular timbre depends upon the presence of certain overtones (if
that notion is interpreted as the relative pitch theory of timbre) is seen [...] to
lead not to invariance but to large dierences in musical timbre with changes in
fundamental frequency. The xed pitch or formant theory of timbre is seen in
those same results to give much better predictions of the minimum dierences in
musical timbre with changes in fundamental frequency. The results [...] suggest
that the formant theory may have to be modied slightly. A precise determination of minimum dierences in musical timbre may require a small shift of the
lower resonances, or possibly the whole spectrum envelope, when the fundamental frequency changes drastically. [49]. The ndings by Slawson have causal
and cognitive explanations. Sounds produced by the voice (spoken or sung) and
most musical instruments present a formant structure closely linked to resonances generated by one or several components implicated in their production
(e.g. the vocal tract for the voice, the body for the guitar, the mouthpiece for
the trumpet). It seems therefore legitimate from the perceptual point of view to
suggest that the auditory system relies on the formant structure of the spectral
envelope to discriminate such sounds (e.g. two distinct male voices of same pitch,
loudness, and duration), as proposed by the source or identity mode of timbre
perception hypothesis mentioned earlier.
The timbre models used in this study to discriminate speech and music rely
on features modeling the spectral envelope (see the next section). In these timbre
models, the temporal dynamics of timbre are captured up to a certain extent by
performing signal analysis on successive frames where the signal is assumed to
be stationary, and by the use of hidden markov model (HMM), as described in
section 3.3. Temporal (e.g. attack time) and spectro-temporal parameters (e.g.
spectral ux) have also shown to be major correlates of timbre spaces but these
ndings were obtained in studies which did not include speech sounds but only
musical instrument tones either produced on dierent instruments (e.g. [40]),
or within the same instrument (e.g. [6]). In situations where we discriminate
timbres from various sources either implicitly (e.g. in everyday lifes situations)
or explicitly (e.g. in a controlled experiment situation), it is most probable that
the auditory system uses dierent acoustical clues depending on the typological
dierences of the considered sources. Hence, the descriptors used to account for
timbre dierences between musical instruments tones may not be adapted for
the discrimination between speech and music sounds. If subtle timbre dierences
are possible within a same instrument, large timbre dierences are expected to
occur between disparate classes, such as speech, and music, and those are liable to be captured by spectral envelope correlates. Music generally being a
mixture of musical instrument sounds playing either synchronously in a polyphonic way, or solo, may exhibit complex formant structures induced by its
1
146
3.2
147
The method is based on the timbre recognition system proposed in [7] and [16],
which we describe in the remainder of this section.
Feature Extraction. The algorithm relies on a frequency domain representation of the signal using short-term spectra (see Figure 2). The signal is rst
decomposed into overlapping frames of equal size obtained by multiplying blocks
of audio data with a Hamming window to further minimize spectral distortion.
The fast Fourier transform (FFT) is then computed on a frame-by-frame basis.
The LSF features described above are extracted using the short-term spectra.
Classifier. The classication process is based on the unsupervised K-means
clustering technique both at the training and the testing stages. The principle
of K-means clustering is to partition n-dimensional space (here the LSF feature
space) into K distinct regions (or clusters), which are characterized by their
centres (called codevectors). The collection of the K codevectors (LSF vectors)
constitutes a codebook, whose function, within this context, is to capture the
most relevant features to characterize the timbre of an audio signal segment.
Hence, to a certain extent, the K-means clustering can here be viewed both as a
classier and a technique of feature selection in time. The clustering of the feature space is performed according to the Linde-Buzo-Gray (LBG) algorithm [36].
Fig. 2. Automatic timbre recognition system based on line spectral frequencies and
K-means clustering
148
149
Fig. 3. Podcast ground truth annotations (a), classication results at 1 s intervals (b)
and post-processed results (c)
150
discrimination relies on the fact that a higher level of similarity is expected between the various spoken parts one one hand, and between the various music
parts, on the other hand.
The algorithm, implemented as a Vamp plugin [43], is based on a frequencydomain representation of the audio signal using either a constant-Q transform,
a chromagram or mel-frequency cepstral coecients (MFCC). For the reasons
mentioned earlier in section 3.1, we chose the MFCCs as underlying features in
this study. The extracted features are normalised in accordance with the MPEG7 standard (normalized audio spectrum envelope [NASE] descriptor [34]), by expressing the spectrum in the decibel scale and normalizing each spectral vector
by the root mean square (RMS) energy envelope. This stage is followed by the
extraction of 20 principal components per block of audio data using principal
component analysis. The 20 PCA components and the RMS envelope constitute a sequence of 21 dimensional feature vectors. A 40-state hidden markov
model (HMM) is then trained on the whole sequence of features (Baum-Welsh
algorithm), each state of the HMM being associated to a specic timbre quality.
After training and decoding (Viterbi algorithm) the HMM, the signal is assigned
a sequence of timbre features according to specic timbre quality distributions
for each possible structural segment. The minimal duration D of expected structural segments can be tuned. The segmentation is then computed by clustering
timbre quality histograms. A series of histograms are created using a sliding
window and are then grouped into S clusters with an adapted soft K-means
151
Experiments
Protocols
152
Level I
Level II
speech
speech m
speech f
music
classical
jazz
Fig. 5. Taxonomy used to train the automatic timbre recognition model in the
speech/music discrimination task. The rst taxonomic level is associated to a training
stage with two classes: speech and music. The second taxonomic level is associated to
a training stage with ve classes: male speech (speech m), female speech (speech f),
classical, jazz, and rock & pop music.
4.2
153
Database
The training data used in the automatic timbre recognition system consisted of
a number of audio clips extracted from a wide variety of radio podcasts from
BBC 6 Music (mostly pop) and BBC Radio 3 (mostly classical and jazz) emissions. The clips were manually auditioned and then, classied as either speech
or music when the ATR model was trained with two classes, or as male speech,
female speech, classical music, jazz music, and rock & pop music when the ATR
model was trained with ve classes. These manual classications constituted the
ground truth annotations further used in the algorithm evaluations. All speech
was english language, and the training audio clips, whose durations are shown
in Table 1, gathered approximately 30 min. of speech, and 15 min. of music.
For testing purposes, four podcasts dierent from the ones used for training
(hence containing dierent speakers and music excerpts) were manually annotated using terms from the following vocabulary: speech, multi-voice speech,
music, silence, jingle, efx (eects), tone, tones, beats. Mixtures of these terms
were also employed (e.g. speech + music, to represent speech with background
music). The music class included cases where a singing voice was predominant
(opera and choral music). More detailed descriptions of the podcast material
used for testing are given in Tables 2 and 3.
4.3
Evaluation Measures
154
Table 3. Audio testing data durations. Durations are expressed in the following format:
HH:MM:SS (hours:mn:s).
Podcast Total duration Speech duration Music duration
1
00:51:43
00:38:52
00:07:47
2
00:53:32
00:32:02
00:18:46
3
00:45:00
00:18:09
00:05:40
4
00:47:08
00:06:50
00:32:31
Total
03:17:23
01:35:53
01:04:43
a segment-level basis by considering the relative proportion of correctly identied segments. We applied the latter segment-based method by computing the
relative correct overlap (RCO) measure used to evaluate algorithms in the music information retrieval evaluation exchange (MIREX) competition [39]. The
relative correct overlap is dened as the cumulated duration of segments where
the correct class has been identied normalized by the total duration of the
annotated segments:
RCO =
(1)
where {.} denotes a set of segments, and . their duration. When comparing
the machine estimated segments with the manually annotated ones, any sections
not labelled as speech (male or female), multi-voice speech, or music (classical,
jazz, rock & pop) were disregarded due to their ambiguity (e.g. jingle). The
durations of these disregarded parts are stated in the results section.
Boundary Retrieval F-measure. In order to assess the precision with which
the algorithms are able to detect the time location of transitions from one class to
another (i.e. start/end of speech and music sections), we computed the boundary
retrieval F-measure proposed in [35] and used in MIREX to evaluate the temporal accuracy of automatic structural segmentation methods [41]. The boundary
retrieval F-measure, denoted F in the following, is dened as the harmonic mean
2P R
between the boundary retrieval precision P and recall R (F =
). The
P +R
boundary retrieval precision and recall are obtained by counting the numbers of
155
correctly detected boundaries (true positives tp), false detections (false positives
f p), and missed detections (false negatives f n) as follows:
tp
P =
(2)
tp + f p
tp
R=
(3)
tp + f n
Hence, the precision and the recall can be viewed as measures of exactness
and completeness, respectively. As in [35] and [41], the number of true positives
were determined using a tolerance window of duration T = 3 s: a retrieved
boundary is considered to be a hit (correct) if its time position l lies within
T
T
the range l
l l+
. This method to compute the F-measure is also
2
2
used in onset detector evaluation [18] (the tolerance window in the latter case
being much shorter). Before comparing the manually and the machine estimated
boundaries, a post-processing was performed on the ground-truth annotations in
order to remove the internal boundaries between two or more successive segments
whose type was discarded in the classication process (e.g. the boundary between
a jingle and a sound eect section).
In this section, we present and discuss the results obtained for the two sets
of experiments described in section 4.1. In both sets of experiments, all audio
training clips were extracted from 128 kbps, 44.1 kHz, 16 bit stereo mp3 les
(mixed down to mono) and the podcasts used in the testing stage were full
duration mp3 les of the same format.
5.1
156
Table 4. Inuence of the training class taxonomy on the performances of the automatic
timbre recognition model assessed at the semantic level with the relative correct overlap
(RCO) measure
ATR model - RCO measure (%)
1 (Rock & pop)
2 (Classical)
3 (Classical)
4 (Rock & pop)
Overall
Speech
Music
Two class Five class Two class Five class
90.5
91.9
93.7
94.5
91.8
93.0
97.8
99.4
88.3
91.0
76.1
82.7
48.7
63.6
99.8
99.9
85.2
89.2
96.6
97.8
jazz, and rock & pop). We see from that table that training the ATR model on
ve classes instead of two improved classication performances in all cases, but
most notably for the speech classications of podcast number 4 (an increase of
14.9% from 48.7% to 63.6%) and for the music classications of podcasts number
3 (up from 76.1% to 82.7%, an increase of 6.6%). In all other cases, the increase
is more modest; being between 0.1% and 2.7%. The combined results show an
increased RCO of 4% for speech and 1.2% for music when trained on ve classes
instead of two.
5.2
157
Table 5. Comparison of the relative correct overlap performances for the ATR and
ASS/ATR methods, as well as the SVM-based algorithm from [44]. For each method,
the best average result (combining speech and music) is indicated in bold
RCO (%)
Podcast Class
speech
1
music
speech
2
music
speech
3
music
speech
4
music
speech
Overall
music
Average
ATR
LSF number
8
16 24 32
94.8 94.8 94.7 94.3
94.9 92.4 90.8 92.8
94.2 95.4 92.9 92.8
98.8 98.7 98.8 98.1
96.7 96.9 93.5 92.0
96.1 79.0 76.8 77.4
55.3 51.9 56.4 58.9
99.5 99.5 99.9 99.5
90.3 90.0 89.5 89.5
98.5 96.1 96.1 96.3
94.4 93.1 92.8 92.9
ASS/ATR
SVM [44]
LSF number
n/a
8
16 24 32
n/a
96.9 95.8 96.9 96.9
97.5
84.3 82.5 82.3 86.3
94.1
96.3 96.3 96.3 96.1
97.6
97.1 94.2 96.5 96.9
99.9
96.4 95.3 93.6 93.5
97.2
92.3 85.8 77.5 83.5
96.9
61.8 48.5 60.2 65.6
88.6
99.7 100 100 100
99.5
92.8 89.4 92.0 92.8
96.8
96.2 94.4 94.3 95.8
98.8
94.5 91.9 93.2 94.3 97.3
ATR method (the computation time has not been measured in these experiments). The lower performances obtained by the three compared methods for
the speech class of the fourth podcast is to be nuanced by the very short proportion of spoken excerpts within this podcast (see Table 3), which hence does
not aect much the overall results. The good performances obtained with a low
dimensional LSF vector can be explained by the fact that the voice has a limited
number of formants that are therefore well characterized by a small number of
line spectral frequencies (LSF = 8 corresponds to the characterization of 4 formants). Improving the recognition accuracy for the speech class diminishes the
confusions made with the music class, which explains the concurrent increase of
RCO for the music class when LSF = 8. When considering the class identication accuracy, the ATR method conducted with a low number of LSF hence
appears interesting since it is not computationally expensive relatively to the
performances of modern CPUs (linear predictive lter determination, computation of 8 LSFs, K-means clustering and distance computation). For the feature
vectors of higher dimensions, the higher-order LSFs may contain information
associated with the noise in the case of the voice which would explain the drop
of overall performances obtained with LSF = 16 and LSF = 24. However the
RCOs obtained when LSF = 32 are very close to that obtained when LSF =
8. In this case, the higher number of LSF may be adapted to capture the more
complex formant structures of music.
Boundary Retrieval Performances. The boundary retrieval performance
measures (F-measure, precision P, and recall R) obtained for the ATR, ASS/ATR,
and SVM-based method from [44] are reported in Table 6.
As opposed to the relative correct overlap evaluation where the ATR and
ASS/ATR methods obtained similar performances, the ASS/ATR method clearly
158
ATR
LSF number
Podcast Measures (%) 8
16 24
P
40.0 45.7 31.0
1
R
21.3 34.0 19.1
F
27.8 39.0 23.7
P
61.5 69.0 74.1
2
R
37.5 31.3 31.3
F
46.6 43.0 44.0
P
69.2 54.5 56.7
3
R
24.3 32.4 23.0
F
36.0 40.7 32.7
P
11.7 12.3 21.7
4
R
21.9 21.9 15.6
F
15.2 15.7 18.2
P
23.3 40.6 46.8
Overall R
27.2 30.9 23.5
F
32.2 35.1 31.3
ASS/ATR
SVM [44]
LSF number
n/a
8
16 24 32
n/a
43.6 36.0 37.2 34.1
36.0
36.2 38.3 34.0 31.9
57.4
39.5 37.1 35.6 33.0
44.3
72.7 35.3 84.6 71.9
58.2
37.5 37.5 34.4 35.9
60.9
49.5 36.4 48.9 47.9
59.5
75.0 68.0 60.4 64.0
67.3
44.6 45.9 43.2 43.2
50.0
55.9 54.8 50.4 51.6
57.4
56.7 57.1 57.7 48.5
28.6
53.1 50.0 46.9 50.0
50.0
54.8 53.3 51.7 49.2
57.4
62.3 46.9 57.4 54.1 47.0
41.9 42.4 39.2 39.6 54.8
50.1 44.6 46.6 45.7 50.6
outclassed the ATR method regarding the boundary retrieval accuracy. The best
overall F-measure of the ASS/ATR method (50.1% with LSF = 8) is approximately 15% higher than the one obtained with the ATR method (35.1% for LSF
= 16). This shows the benet of using the automatic structural segmenter prior
to the timbre recognition stage to locate the transitions between the speech and
music sections. As in the previous set of experiments, the best conguration is
obtained with a small amount of LSF features (ASS/ATR method with LSF =
8) which stems from the fact the boundary positions are a consequence of the
classication decisions. For all the tested podcasts, the ASS/ATR method yields
a better precision than the SVM-based algorithm. The most notable dierence
happens for the second podcast where the precision of the ASS/ATR method
(72.7%) is approximately 14% higher than the one obtained with the SVM-based
algorithm (58.2%). The resulting increase in overall precision achieved with the
ASS/ATR method (62.3%) compared with the SVM-based method (47.0%) is
of approximately 15%. The SVM-based method however obtains a better overall
boundary recall measure (54.8%) than the ASS/ATR method (42.4%), inducing
the boundary F-measures of both methods to be very close (50.6% and 50.1%,
respectively).
159
audio podcasts. The rst method (ATR) relies on automatic timbre recognition
(LSF/K-means) and median ltering. The second method (ASS/ATR) performs
an automatic structural segmentation (MFCC, RMS / HMM, K-means) before
applying the timbre recognition system. The algorithms were tested with more
than 2,5 hours of speech and music content extracted from popular and classical
music podcasts from the BBC. Some of the music tracks contained a predominant singing voice which can be a source of confusion with the spoken voice. The
algorithms were evaluated both at the semantic level to measure the quality of
the retrieved segment-type labels (classication relative correct overlap), and at
the temporal level to measure the accuracy of the retrieved boundaries between
sections (boundary retrieval F-measure). Both methods obtained similar and relatively high segment-type labeling performances. The ASS/ATR method lead to
a RCO of 92.8% for speech, and 96.2% for music, yielding an average performance
of 94.5%. The boundary retrieval performances were higher for the ASS/ATR
method (F-measure = 50.1%) showing the benet to use a structural segmentation technique to locate transitions between dierent timbral qualities. The
results were compared against the SVM-based algorithm proposed in [44] which
provides a good benchmark of the state-of-the-arts speech/music discriminators. The performances obtained by the ASS/ATR method were approximately
3% lower than those obtained with the SVM-based method for the segment-type
labeling evaluation, but lead to better boundary retrieval precisions (approximately 15% higher).
The boundary retrieval scores were clearly lower for the three compared methods, relatively to the segment-type labeling performances which were fairly high,
up to 100% of correct identications in some cases. Future works will be dedicated to rene the accuracy of the sections boundaries either by performing a
new analysis of the feature variations locally around the retrieved boundaries,
or by including descriptors complementary to the timbre ones, by using e.g. the
rhythmic information such as tempo whose uctuations around speech/music
transitions may give complementary clues to accurately detect them. The discrimination of intricate mixtures of music, speech, and sometimes strong postproduction sound eects (e.g. the case of jingles) will also be investigated.
Acknowledgments. This work was partly funded by the Musicology for the
Masses (M4M) project (EPSRC grant EP/I001832/1, http://www.elec.qmul.
ac.uk/digitalmusic/m4m/), the Online Music Recognition and Searching 2
(OMRAS2) project (EPSRC grant EP/E017614/1, http://www.omras2.org/),
and a studentship (EPSRC grant EP/505054/1). The authors wish to thank
Matthew Davies from the Centre for Digital Music for sharing his F-measure
computation Matlab toolbox, as well as Gy
orgy Fazekas for fruitful discussions
on the structural segmenter. Many thanks to Mathieu Ramona from the Institut
de Recherche et Coordination Acoustique Musique (IRCAM) for sending us the
results obtained with his speech/music segmentation algorithm.
160
References
1. Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Segmentation. In: Proc. ICASSP 2002, vol. 1, pp. 297300 (2002)
2. Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of
Fisher Linear Discriminant Analysis to Speech Music Classication. In: Proc.
EUROCON 2005, vol. 2, pp. 16661669 (2005)
3. ANSI: USA Standard Acoustical Terminology. American National Standards Institute, New York (1960)
4. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates
of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2),
135153 (2010)
5. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis
of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Perception 28(3), 265278 (2011)
6. Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Control to Timbre Perception. Acta Acustica United with Acustica 96(4), 678689
(2010)
7. Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recognition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling
and Retrieval (CMMR 2010), Malaga, Spain, pp. 183194 (2010)
8. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A
Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and
Audio Processing (2005)
9. Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classication. Journal
of the Audio Engineering Society 52(7/8), 724739 (2004)
10. Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre
Space Dimensions: A Conrmatory Study Using Synthetic Tones. J. Acoust. Soc.
Am. 118(1), 471482 (2005)
11. Cannam, C.: Queen Mary University of London: Sonic Annotator, http://omras2.
org/SonicAnnotator
12. Cannam, C.: Queen Mary University of London: Sonic Visualiser, http://www.
sonicvisualiser.org/
13. Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin
System, http://www.vamp-plugins.org/
14. Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech,
Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, pp. 149152 (1999)
15. Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriete du Signal, de
lInstrument, ou Construction Cognitive (Timbre or Timbres? Property of the
Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on
Interdisciplinary Musicology (CIM 2005), Montreal, Quebec, Canada (2005)
16. Chetry, N., Davies, M., Sandler, M.: Musical Instrument Identication using LSF
and K-Means. In: Proc. AES 118th Convention (2005)
17. Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing.
Proc. of the IEEE 65, 14281443 (1977)
18. Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Audio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary
University of London, Centre for Digital Music (2009), http://www.eecs.qmul.
ac.uk/~matthewd/pdfs/DaviesDegaraPlumbley09-evaluation-tr.pdf
161
19. Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing ASSP-28(4), 357366 (1980)
20. El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for
Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 24452448 (2000)
21. Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help
of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention,
Vienna, Austria (2007)
22. Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.:
The ESTER Phase II Evaluation Campaign for the Rich Transcription of French
Broadcast News. In: Proc. Interspeech (2005)
23. Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for
Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187200
(2001)
24. Grey, J.M., Gordon, J.W.: Perception of Spectral Modications on Orchestral Instrument Tones. Computer Music Journal 11(1), 2431 (1978)
25. Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation
and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the
DARPA Broadcast News Transcription and Understanding Workshop, pp. 133137
(1998)
26. Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodological Issues in Timbre Research. In: Deliege, I., Sloboda, J. (eds.) Perception and
Cognition of Music, 2nd edn., pp. 253306. Psychology Press, New York (1997)
27. Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identication,
2nd edn., pp. 425461. Academic Press, San Diego (1995)
28. Harte, C.: Towards Automatic Extraction of Harmony Information From Music
Signals. Ph.D. thesis, Queen Mary University of London (2010)
29. Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the
works of 1877). English trad. with notes and appendix from E.J. Ellis
30. Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room
Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 6673 (1973)
31. Itakura, F.: Line Spectrum Representation of Linear Predictive Coecients of
Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)
32. Jarina, R., OConnor, N., Marlow, S., Murphy, N.: Rhythm Detection For SpeechMusic Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th
International Conference on Digital Signal Processing (DSP), Santorini (2002)
33. Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc.
IEEE 74, 14771493 (1986)
34. Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG7 Descriptors. In: Proc. of EUROSPEECH (2003)
35. Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained
Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318326
(2008)
36. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE
Transactions on Communications 28, 702710 (1980)
37. Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classication and Segmentation
Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203211
(2001)
38. Marozeau, J., de Cheveigne, A., McAdams, S., Winsberg, S.: The Dependency
of Timbre on Fundamental Frequency. Journal of the Acoustical Society of
America 114(5), 29462957 (2003)
162
39. Mauch, M.: Automatic Chord Transcription from Audio using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)
40. McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimpho, J.: Perceptual
Scaling of Synthesized Musical Timbres: Common Dimensions, Specicities, and
Latent Subject Classes. Psychological Research 58, 177192 (1995)
41. Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation
(2010), http://www.music-ir.org/mirex/wiki/2010:Structural_Segmentation
42. Peeters, G.: Automatic Classication of Large Musical Instrument Databases Using Hierarchical Classiers with Inertia Ratio Maximization. In: Proc. AES 115th
Convention, New York (2003)
43. Queen Mary University of London: QM Vamp Plugins, http://www.omras2.org/
SonicAnnotator
44. Ramona, M., Richard, G.: Comparison of Dierent Strategies for a SVM-Based
Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference
(EUSIPCO 2009), pp. 2024 (2009)
45. Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In:
Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)
46. Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc.
ICASSP 1996, vol. 2, pp. 993996 (1996)
47. Schaeer, P.: Traite des Objets Musicaux (Treaty of Musical Objects). Editions
du seuil (1966)
48. Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature
Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 13311334 (1997)
49. Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum Envelope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)
50. Sundberg, J.: Articulatory Interpretation of the Singing Formant. J. Acoust. Soc.
Am. 55, 838844 (1974)
51. Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception.
In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition
(SAPA 2006), pp. 1823 (2006)
52. Gil de Z
un
iga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimagining Pathways to Political Participation. Journal of Information Technology &
Politics 7(1), 3651 (2010)
164
Cloud Computing
165
EvMusic Representation
The rst step when planning a composition system should be choosing a proper
music representation. The chosen representation will set the frontiers of the systems capabilities. As a result, our CM research developed a solid and versatile
representation for music composition. EvMetamodel [3] was used to model the
music knowledge representation behind EvMusic. A previous, deep analysis of
music knowledge was carried out to assure the representation meets music composition requirements. This multilevel representation is not only compatible with
traditional notation but also capable of representing highly abstract music elements. It can also represent symbolic pitch entities [1] from both music theory
and algorithmic composition, keeping the door open to the representation of the
music elements of higher symbolic level conceived by the composers creativity.
It is based on real composition experience and was designed to support CM
Composition, including experiences in musical articial intelligence (MAI).
Current music representation is described in a platform-independent UML format [15]. Therefore, it is not conned to its original LISP system, but can be used
in any system or language: a valuable feature when approaching a cloud system.
166
Fig. 2 is an UML class diagram for the representation core of the EvMetamodel,
the base representation for time dimension. The three main classes are shown:
event, parameter and dynamic object. High level music elements are represented
as subclasses of metaevent, the interface which provides the develop functionality. The special dynamic object changes is also shown. This is a very useful
option for the graphic edition of parameters, since it represents a dynamic object
as a sequence of parameter-change events which can be easily moved in time.
Our CM system underwent several changes since the beginning, back to 1997.
Fig. 3 shows the evolutions undergone by formats, platforms and technologies
toward the current proposal. This gure follows the same horizontal scheme
shown in Figure 1. The rst column indicates the user interface for music input,
and the second shows the music computation system and its evolution over recent
years, while the music intermediate format is reported in the central column.
Post-production utilities and their nal results are shown in last two columns,
respectively.
The model in Fig. 1 clearly shows the evolution undergone by the system.
First, a process of service diversication and specialisation, mainly at the postproduction stage; second, as a trend in user input, CM is demanding graphic
environments. Finally, technologies and formats undergo multiple changes. The
reason behind most of these changes can be found in external technology advances and the need to accommodate to these new situations. At times the
167
needed tool or library was not available at that time. At others the available
tool was suitable at that moment, but oered no long-term availability. As stated
above, the recent shift of IT into cloud computing brings new opportunities for
evolution. In CM, system development can benet from computing distribution
and specialization. Splitting the system into several specialized services prevents
the limitations involved by a single programming language or platform. Therefore, individual music services can be developed and evolved independently from
the others. Each component service can be implemented in the most appropriate
platform for its particular task, regardless of the rest of services, without being
conditioned by the technologies necessary for the implementation of other services. In the previous paradigm, all services were performed by one only system,
and the selection of technologies to complete a particular task aected or even
conditioned the implementation of other tasks. This frees the system design,
thus making it more platform-independent. In addition, widely-available tools
can be used for specic tasks, thus benetting from tool development in other
areas such as database storage and web application design.
4.1
Under Cloud Computing usual notation, each of the services can be considered
as a Music-computing as a Service (MaaS) component. In a simple form, they
are servers receiving a request and performing a task. At the end of the task,
resulting objects are returned to the stream or stored in an exchange database.
The access to this database is a valuable feature for services, since it allows
the denition of complex services. They could even be converted into real MAI
agents (i.e., intelligent devices which can perceive their environment, make decisions and act inside their environment) [17]. Storing the music composition as
a virtual environment in a database allows music services interacting within the
composition, thus opening a door toward a MAI system of distributed agents.
Music Services of the Cloud are classied according their function.
168
Input. This group includes the services aimed particularly at incorporating new
music elements and translating them from other input formats.
Agents. They are those services which are capable of inspecting and modifying music composition, as well as introducing new elements. This type includes
human user interfaces, but may also include other intelligent elements taking
part in composition introducing decisions, suggestions or modications. In our
prototype, we have developed a web application acting as a user interface for
the edition of music elements. This service is described in the next section.
Storage. At this step, only music object instances and relations are stored,
but the hypothetical model also includes procedural information. Three music
storage services are implemented in the prototype. Main lib stores shared music
elements as global denitions. This content may be seen as some kind of music culture. User-related music objects are stored in the user lib. These include
music objects dened by the composer which can be reused in several parts and
complete music sections or represent the composers style. The editing storage
service is provided as temporary storage for the editing session. The piece is
progressively composed in the database. The composition environment (i.e., everything related to the piece under composition) is in the database. This is the
environment in which several composing agents can integrate and interact by
reading and writing on this database-stored score. All three storage services in
this experience, were written in Python and clouded with Google AppEngine.
Development. The services in this group perform development processes. As
explained in [8], development is the process by which higher-abstraction symbolic
elements are turned into lower-abstraction elements. High-abstraction symbols
169
What technologies are behind those successful widespread web cloud applications
such as GoogleDocs? What type of exchange format do they use? JavaScript
[13] is a dialect of the standard ECMAScript [6] supported for almost all web
browsers. It is the key tool behind these web clients. In this environment, the
code for the client is downloaded from a server and then run in the web browser
as a dynamic script keeping communication with web services. Thus, the browser
window behaves as a user-interface window for the system.
5.1
EvEditor Client
The Extjs library [7] was chosen as a development framework for client side
implementation. The web application takes the shape of a desktop with windows
archetype (i.e., a well-tested approach and an intuitive interface environment
we would like to benet from, but within a web-browser window). The main
objective of its implementation was not only producing a suitable music editor
for a music cloud example but also a whole framework for the development
of general-purpose object editors under an OOP approach. Once this aim was
reached, dierent editors could be subclassed with the customizations required
from the object type. Extjs is a powerful JavaScript library including a vast
collection of components for user interface creation. Elements are dened in a
hierarchic class system, which oers a great solution for our OOP approach.
That is, all EvEditor components are subclassed from ExtJS classes. As shown
in Fig. 5, EvEditor client is a web application consisting of three layers. The
bottom or data layer contains an editor proxy for data under current edition. It
duplicates the selected records in the remote storage service. The database in the
COMPONENT LAYER
DOM
170
remote storage service is synched with the editions an updates the editor writes
in its proxy. Several editors can share the same proxy, so all listening editors are
updated when the data are modied in the proxy. The intermediate layer is a
symbolic zone for all components. It includes both graphic interface components
such as editor windows, container views and robjects: representations for the
objects under current edition. Interface components are subclassed from ExtJS
components. Every editor is displayed in its own window on the working desktop
and optionally contains a contentView displaying its child objects as robjects.
171
Fig. 6 is a browser-window capture showing the working desktop and some editor
windows. The application menu is shown in the lower left-hand corner, including
user setting. The central area of the capture shows a diatonic sequence editor
based on our TclTk Editor [1]. A list editor and a form-based note editor are also
shown. In the third layer or DOM (Document Object Model) [19], all components
are rendered as DOM elements (i.e., HTML document elements to be visualized).
5.2
Server Side
The client script code is dynamically provided by the server side of the web
application. It is written in Python and can run as a GoogleApp for integration
into the cloud. All user-related tasks are managed by this server side. It identies
the user, and manages sessions, proles and environments.
172
6.1
Database storage allows several services sharing the same data and collaborating
in the composition process. The information stored in a database is organized in
tables of records. For storing EvMusic tree structures in a database, they must
be previously converted into records. For this purpose, the three main classes
of Ev representation are subclassed from a tree node class, as shown in Fig. 2.
Thus, every object is identied by an unique reference and a parent attribute.
This allows representing a large tree structure of nested events as a set of records
for individual retrieval or update.
6.2
Web applications usually use XML and JSON (Java Script Object Notation) for
data exchange [12]. Both formats meet the requirements. However, two reasons
supported our inclination for JSON: 1) The large tool library available for JSON
at the time of this writing, and 2) the fact that JSON is oered as the exchange
format for some of the main Internet web services such as Google or Yahoo.
The second reason has to do with its great features, such as human readability
and dynamic unclosed object support, a very valuable feature inherited from the
prototype-based nature of JavaScript.
JSON can be used to describe EvMusic objects and to communicate among
web music services. MusicJSON [2] is the name given to this use. As a simple
example, the code below shows the description for the short music fragment
shown in Fig. 7. As it can be seen, the code is self-explanatory.
{"pos":0, "objclass": "part","track":1, "events":[
{"pos":0, "objclass": "note","pitch":"d4","dur":0.5,
"art":"stacatto"},
{"pos":0.5, "objclass": "note","pitch":"d4","dur":0.5,
"art":"stacatto"},
{"pos":1, "objclass": "pitch":"g4","dur":0.75,
"dyn":"mf","legato":"start"},
{"pos":1.75, "objclass": "pitch":"f#4","dur":0.25
"legato":"end"},
{"pos":2, "objclass": "note","pitch":"g4","dur":0.5,
"art":"stacatto"}
{"pos":2.5, "objclass": "note","pitch":"a4","dur":0.5,
"art":"stacatto"}
{"pos":3, "objclass":"nchord","dur":1,"pitches":[
{"objclass": "spitch","pitch": "d4"},
{"objclass": "spitch","pitch": "b4"}]
}
]
}
173
6.3
MusicJSON File
Every EvMusic object, from single notes to complex structures, can be serialized into a MusicJSON text and subsequently transmitted through the Web. In
addition, MusicJSON can be used as an intermediate format for local storage
of compositions. The next listing code shows a draft example of the proposed
description of an EvMusic le.
{"objclass":"evmusicfile","ver":"0911","content":
{"lib":{
"instruments":"http://evmusic.fauno.org/lib/main/instruments",
"pcstypes": "http://evmusic.fauno.org/lib/main/pcstypes",
"mypcs": "http://evmusic.fauno.org/lib/jesus/pcstypes",
"mymotives": "http://evmusic.fauno.org/lib/jesus/motives"
},
"def":{
"ma": {"objclass":"motive",
"symbol":[ 0,7, 5,4,2,0 ],
"slength": "+-+- +-++ +---"},
"flamenco": {"objclass":"pcstype",
"pcs":[ 0,5,7,13 ],},
},
"orc":{
"flauta": {"objclass":"instrument",
"value":"x.lib.instruments.flute",
"role":"r1"}
"cello": {"objclass":"instrument",
"value":"x.lib.instruments.cello",
"role":"r2"}
},
"score":[
{"pos": 0,"objclass":"section",
"pars":[
"tempo":120,"dyn":"mf","meter":"4/4",
...
],
"events":[
{"pos":0, "objclass": "part","track":1,"role":"i1",
"events":[
... ]
... ]},
{"pos": 60,"objclass":"section","ref":"s2",
...
},],}}}
174
The code shows four sections in the content. Library is an array of libraries to
be loaded with object denitions. Both main and user libraries can be addressed.
The following section includes local denitions of objects. As an example, a motive and a chord type are dened. Next section establishes instrumentation assignments by means of the arrangement object role. Last section is the score
itself, where all events are placed in a tree structure using parts. Using MusicJSON as the intermediary communication format enables us to connect several
music services conforming a cloud composition system.
Conclusion
The present paper puts forward an experience of music composition under a distributed computation approach as a viable solution for Computer Music Composition in the Cloud. The system is split into several music services hosted in
common IaaS providers such as Google or Amazon. Dierent music systems can
be built by joint operation of some of these music services in the cloud.
In order to cooperate and deal with music objects, each service in the music
cloud must understand the same music knowledge. The music knowledge representation they must share must be therefore standardized. EvMusic representation is proposed for this, since it is a solid multilevel representation successfully
tested in real CM compositions in recent years.
Furthermore, MusicJSON is proposed as an exchange data format between
services. Example descriptions of music elements, as well as a le format for
local saving of a musical composition, are given. A graphic environment is also
proposed for the creation of user interfaces for object editing as a web application.
As an example, the EvEditor application is described.
This CMC approach opens multiple possibilities for derivative work. New
music creation interfaces can be developed as web applications beneting from
the upcoming web technologies such as the promising HTML5 standard [20]. The
described music in the cloud, together with EvMusic representation, provides a
ground environment for MAI research, where especialised agents can cooperate
in a music composition environment sharing the same music representation.
References
1. Alvaro, J.L. : Symbolic Pitch: Composition Experiments in Music Representation.
Research Report, http://cml.fauno.org/symbolicpitch.html (retrieved December 10, 2010) (last viewed February 2011)
2. Alvaro, J.L., Barros, B.: MusicJSON: A Representation for the Computer Music Cloud. In: Proceedings of the 7th Sound and Music Computer Conference,
Barcelona (2010)
3. Alvaro, J.L., Miranda, E.R., Barros, B.: Music knowledge analysis: Towards an
ecient representation for composition. In: Marn, R., Onainda, E., Bugarn, A.,
Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 331341. Springer,
Heidelberg (2006)
175
2
3
Abstract. Recognition of sound sources and events is an important process in sound perception and has been studied in many research domains.
Conversely sounds that cannot be recognized are not often studied except
by electroacoustic music composers. Besides, considerations on recognition of sources might help to address the problem of stimulus selection
and categorization of sounds in the context of perception research. This
paper introduces what we call abstract sounds with the existing musical
background and shows their relevance for dierent applications.
Keywords: abstract sound, stimuli selection, acousmatic.
Introduction
How do sounds convey meaning? How can acoustic characteristics that convey the relevant information in sounds be identied? These questions interest
researchers within various research elds such as cognitive neuroscience, musicology, sound synthesis, sonication, etc. Recognition of sound sources, identication, discrimination and sonication deal with the problem of linking signal
properties and perceived information. In several domains (linguistic, music analysis), this problem is known as semiotics [21]. The analysis by synthesis
approach [28] has permitted to understand some important features that characterize the sound of vibrating objects or interaction between objects. A similar
approach was also adopted in [13] where the authors use vocal imitations in
order to study human sound source identication with the assumption that vocal imitations are simplications of original sounds that still contain relevant
information.
Recently, there has been an important development in the use of sounds to
convey information to a user (of a computer, a car, etc.) within a new research
community called auditory display [19] which deals with topics related to sound
design, sonication and augmented reality. In such cases, it is important to use
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 176187, 2011.
c Springer-Verlag Berlin Heidelberg 2011
177
sounds that are meaningful independently of cultural references taking into account that sounds are presented through speakers concurrently with other audio/visual information.
Depending on the research topics, authors focused on dierent sound categories (i.e. speech, environmental sounds, music or calibrated synthesized stimuli). In [18], the author proposed a classication of everyday sounds according
to physical interactions from which the sound originates. When working within
synthesis and/or sonication domains, the aim is often to reproduce the acoustic
properties responsible for the attribution of meaning and thus, sound categories
can be considered from the point of view of semiotics i.e. focusing on information
that can be gathered in sounds.
In this way, we considered a specic category of sounds that we call abstract
sounds. This category includes any sound that cannot be associated with an
identiable source. It includes environmental sounds that cannot be easily identied by listeners or that give rise to many dierent interpretations depending on
listeners and contexts. It also includes synthesized sounds, and laboratory generated sounds if they are not associated with a clear origin. For instance, alarm or
warning sounds cannot be considered as abstract sounds. In practice, recordings
with a microphone close to the sound source and some synthesis methods like
granular synthesis are especially ecient for creating abstract sounds. Note that
in this paper, we mainly consider acoustically complex stimuli since they best
meet our needs in the dierent applications (as discussed further).
Various labels that refer to abstract sounds can be found in the literature: confused sounds [6], strange sounds [36], sounds without meaning [16]. Conversely, [34] uses the term source-bonded and the expression source bonding
for the The natural tendency to relate sounds to supposed sources and causes.
Chion introduced acousmatic sounds [9] in the context of cinema and audiovisual applications with the following denition: sound one hears without seeing
their originating cause - an invisible sound source (for more details see section 2).
The most common expression is abstract sounds [27,14,26] particularly
within the domain of auditory display, when concerning earcons [7]. Abstract used as an adjective means based on general ideas and not on any
particular real person, thing or situation and also existing in thought or as an
idea but not having a physical reality1. For sounds, we can consider another
denition used for art not representing people or things in a realistic way1 .
Abstract as a noun is a short piece of writing containing the main ideas in a
document1 and thus share the ideas of essential attributes which is suitable in
the context of semiotics. In [4], authors wrote: Edworthy and Hellier (2006)
suggested that abstract sounds can be interpreted very dierently depending on
the many possible meanings that can be linked to them, and in large depending
on the surrounding environment and the listener.
In fact, there is a general agreement for the use of the adjective abstract
applied to sounds that express both ideas of source recognition and dierent
possible interpretations.
1
178
A. Merer et al.
This paper will rst present the existing framework for the use of abstract
sounds by electroacoustic music composers and researchers. We will then discuss
some important aspects that should be considered when conducting listening
tests with a special emphasis on the specicities of abstract sounds. Finally, three
practical examples of experiments with abstract sounds in dierent research
domains will be presented.
Even if the term abstract sounds was not used in the context of electroacoustic
music, it seems that this community was one of the rst to consider the issue
related to the recognition of sound sources and to use such sounds. In 1966,
P. Schaeer, who was both a musician and a researcher, wrote the Traite des
objets musicaux [29], in which he reported more than ten years of research on
electroacoustic music. With a multidisciplinary approach, he intended to carry
out fundamental music research that included both Concr`ete2 and traditional
music. One of the rst concepts he introduced was the so called acousmatic
listening, related to the experience of listening to a sound without paying attention to the source or the event. The word acousmatic is at the origin of many
discussions, and is now mainly employed in order to describe a musical trend.
Discussions about acousmatic listening was kept alive due to a fundamental
problem in Concr`ete music. Indeed, for music composers the problem is to create
new meaning from sounds that already carry information about their origins. In
compositions where sounds are organized according to their intrinsic properties,
thanks to the acousmatic approach, information on the origins of sounds is still
present and interacts with the composers goals.
There was an important divergence of points of view between Concr`ete and
Elektronische music (see [10] for a complete review), since the Elektronische
music composers used only electronically generated sounds and thus avoided the
problem of meaning [15]. Both Concr`ete and Elektronische music have developed
a research tradition on acoustics and perception, but only Schaeer adopted a
scientic point of view. In [11], the author wrote: Schaeers decision to use
recorded sounds was based on his realization that such sounds were often rich
in harmonic and dynamic behaviors and thus had the largest potential for his
project of musical research. This work was of importance for electroacoustic
musicians, but is almost unknown by researchers in auditory perception, since
there is no published translation of his book except for concomitant works [30]
and Chions Guide des objets musicaux 3 . As reported in [12], translating Schaeers writing is extremely dicult since he used neologisms and very specic
2
179
Fig. 1. Schaeers typology. Note that some column labels are redundant since the
table must be read from center to borders. For instance, the Non existent evolution
column in the right part of the table corresponds to endless iterations whereas the
Non existent evolution column in the left part concerns sustained sounds (with no
amplitude variations).
Translation from [12]
meanings of french words. However, recently has been a growing interest in this
book and in particular in the domain of music information retrieval, for the morphological sound description [27,26,5]. Authors indicate that in the case of what
they call abstract sounds, classical approaches based on sound source recognition are not relevant and thus base their algorithms on Schaeers morphology
and typology classications.
Morphology and typology have been introduced as analysis and creation tools
for composers as an attempt to construct a music notation that includes electroacoustic music and therefore any sound. The typology classication (cf. gure 1)
is based on a characterization of spectral (mass) and dynamical (facture 4 ) proles of with respect to their complexity and consists of twenty-eight categories.
There are nine central categories of balanced sounds for which the variations
are neither too rapid and random nor too slow or nonexistent. Those nine categories included three facture proles (sustained, impulsive or iterative) and three
mass proles (tonic, complex and varying). On both sides of the balanced objects in the table, there are nineteen additional categories for which mass and
facture proles are very simple/repetitive or vary a lot.
Note that some automatic classication methods are available [26]. In [37] the
authors proposed an extension of Schaeers typology that includes graphical
notations.
Since the 1950s, electroacoustic music composers have addressed the problem
of meaning of sounds and provided an interesting tool for classication of sounds
with no a priori dierentiation on the type of sound. For sound perception
research, a classication of sounds according to these categories may be useful
4
As discussed in [12] even if facture is not a common English word, there is no better
translation from French.
180
A. Merer et al.
since they are suitable for any sound. The next section will detail the use of such
classication for the design of listening tests.
The design of listening tests is a fundamental part of sound perception studies and implies considerations of dierent aspects of perception that are closely
related to the intended measurements. For instance, it is important to design
calibrated stimuli and experimental procedures to control at best the main factors that aect the subjects evaluations. We propose to discuss such aspects in
the context of abstract sounds.
3.1
Stimuli
181
attention to certain evocations) in order to study specic aspects of the information conveyed by the sounds. In particular, we will see how the same set of
abstract sounds was used in 2 dierent studies described in sections 4.3 and 4.1.
3.2
Procedure
Type of Listening
Auditory streams have been introduced by Bregman [8], and describe our ability to
group/separate dierent elements of a sound.
182
A. Merer et al.
Abstract sounds are not often heard in our everyday life and could even be
completely novel for listeners. Therefore, they might be perceived as strange
or bizarre. As mentioned above, listeners judgements of abstract sounds are
highly subjective. In some cases, it is possible to use this subjectivity to investigate some specicities of human perception and in particular, to highlight
dierences of sound evaluations between groups of listeners. In particular, the
concept of bizarre is one important element from standard classication of
mental disorders (DSM - IV) for schizophrenia [1] pp. 275. An other frequently
reported element is the existence of auditory hallucinations7 , i.e. perception
without stimulation. From such considerations, we explored the perception of
bizarre and familiar sounds in patients with schizophrenia by using both environmental (for their familiar aspect) and abstract sounds (for their bizarre aspect). The procedure consisted in rating sounds on continuous scales according
6
7
183
to a perceptual dimension labelled by an adjective (by contrast, classical dierential semantic uses an adjective and an antonym to dene the extremes of each
scale). Sounds were evaluated on six dimensions along linear scales: familiar,
reassuring, pleasant, bizarre, frightening, invasive8. Concerning the
abstract sound corpus, we chose 20 sounds from the initial set of 200 sounds by
a pre-test on seven subjects and selected sounds that best spread in the space of
measured variables (the perceptual dimensions). This preselection was validated
by a second pre-test on fourteen subjects that produced similar repartition of
the sounds along the perceptual dimensions.
Preliminary results showed that the selected sound corpus made it possible to
highlight signicant dierences between patients with schizophrenia and control
groups. Further analysis and testing (for instance brain imaging techniques) will
be conducted in order to better understand these dierences.
4.2
These are arguable translations from French adjectives: familier, rassurant, plaisant,
bizarre, angoissant, envahissant.
184
A. Merer et al.
randomly. This step allowed us to validate the abstract sounds since no label referring to the actual source was given. Indeed when listeners are asked to explicitly
label abstract sounds, dierent labels that were more related to the sound quality
were collected. In a rst experiment a written word (prime) was visually presented
before a sound (target) and subjects had to decide whether or not the sound and
the word t together. In a second experiment, presentation order was reversed (i.e.
sound presented before word). Results showed that participants were able to evaluate the semiotic relation between the prime and the target in both sound-word and
word-sound presentations with relatively low inter-subject variability and good
consistency (see [32] for details on experimental data and related analysis). This
result indicated that abstract sounds are suitable for studying conceptual processing. Moreover, their contextualization by the presentation of a word reduced the
variability of interpretations and led to a consensus between listeners. The study
also revealed similarities in the electrophysiological patterns (Event Related Potentials) between abstract sounds and word targets, supporting the assumption
that similar processing is involved for linguistic and non-linguistic sounds.
4.3
Sound Synthesis
185
Conclusion
186
A. Merer et al.
References
1. Association, A.P.: The Diagnostic and Statistical Manual of Mental Disorders,
Fourth Edition (DSM-IV). American Psychiatric Association (1994), http://www.
psychiatryonline.com/DSMPDF/dsm-iv.pdf (last viewed February 2011)
2. Ballas, J.A.: Common factors in the identication of an assortment of brief everyday sounds. Journal of Experimental Psychology: Human Perception and Performance 19, 250267 (1993)
3. Bentin, S., McCarthy, G., Wood, C.C.: Event-related potentials, lexical decision
and semantic priming. Electroencephalogr Clin. Neurophysiol. 60, 343355 (1985)
4. Bergman, P., Skold, A., Vastfjall, D., Fransson, N.: Perceptual and emotional categorization of sound. The Journal of the Acoustical Society of America 126, 3156
3167 (2009)
5. Bloit, J., Rasamimanana, N., Bevilacqua, F.: Towards morphological sound description using segmental models. In: DAFX, Milan, Italie (2009)
6. Bonebright, T.L., Miner, N.E., Goldsmith, T.E., Caudell, T.P.: Data collection
and analysis techniques for evaluating the perceptual qualities of auditory stimuli.
ACM Trans. Appl. Percept. 2, 505516 (2005)
7. Bonebright, T.L., Nees, M.A.: Most earcons do not interfere with spoken passage
comprehension. Applied Cognitive Psychology 23, 431445 (2009)
8. Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)
9. Chion, M.: Audio-vision, Sound on Screen. Columbia University Press, New-York
(1993)
10. Cross, L.: Electronic music, 1948-1953. Perspectives of New Music (1968)
11. Dack, J.: Abstract and concrete. Journal of Electroacoustic Music 14 (2002)
12. Dack, J., North, C.: Translating pierre schaeer: Symbolism, literature and music.
In: Proceedings of EMS 2006 Conference, Beijing (2006)
13. Dessein, A., Lemaitre, G.: Free classication of vocal imitations of everyday sounds.
In: Sound And Music Computing (SMC 2009), Porto, Portugal, pp. 213218 (2009)
14. Dubois, D., Guastavino, C., Raimbault, M.: A cognitive approach to urban soundscapes: Using verbal data to access everyday life auditory categories. Acta Acustica
United with Acustica 92, 865874 (2006)
15. Eimert, H.: What is electronic music. Die Reihe 1 (1957)
16. Fastl, H.: Neutralizing the meaning of sound for sound quality evaluations. In:
Proc. Int. Congress on Acoustics ICA 2001, Rome, Italy, vol. 4, CD-ROM (2001)
17. Gaver, W.W.: How do we hear in the world? explorations of ecological acoustics.
Ecological Psychology 5, 285313 (1993)
18. Gaver, W.W.: What in the world do we hear? an ecological approach to auditory
source perception. Ecological Psychology 5, 129 (1993)
187
19. Hermann, T.: Taxonomy and denitions for sonication and auditory display.
In: Proceedings of the 14th International Conference on Auditory Display, Paris,
France (2008)
20. Homan, M., Cook, P.R.: Feature-based synthesis: Mapping acoustic and perceptual features onto synthesis parameters. In: Proceedings of the 2006 International
Computer Music Conference (ICMC), New Orleans (2006)
21. Jekosch, U.: 8. Assigning Meaning to Sounds - Semiotics in the Context of ProductSound Design. J. Blauert, 193221 (2005)
22. McKay, C., McEnnis, D., Fujinaga, I.: A large publicly accessible prototype audio
database for music research (2006)
23. Merer, A., Ystad, S., Kronland-Martinet, R., Aramaki, M.: Semiotics of sounds
evoking motions: Categorization and acoustic features. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 139158. Springer,
Heidelberg (2008)
24. Micoulaud-Franchi, J.A., Cermolacce, M., Vion-Dury, J.: Bizzare and familiar
recognition troubles of auditory perception in patient with schizophrenia (2010)
(in preparation)
25. Moore, B.C.J., Tan, C.T.: Perceived naturalness of spectrally distorted speech and
music. The Journal of the Acoustical Society of America 114, 408419 (2003)
26. Peeters, G., Deruty, E.: Automatic morphological description of sounds. In: Acoustics 2008, Paris, France (2008)
27. Ricard, J., Herrera, P.: Morphological sound description computational model and
usability evaluation. In: AES 116th Convention (2004)
28. Risset, J.C., Wessel, D.L.: Exploration of timbre by analysis and synthesis. In:
Deutsch, D. (ed.) The psychology of music. Series in Cognition and Perception,
pp. 113169. Academic Press, London (1999)
29. Schaeer, P.: Traite des objets musicaux. Editions du seuil (1966)
30. Schaeer, P., Reibel, G.: Solf`ege de lobjet sonore. INA-GRM (1967)
31. Schlauch, R.S.: 12 - Loudness. In: Ecological Psychoacoustics, pp. 318341. Elsevier, Amsterdam (2004)
32. Sch
on, D., Ystad, S., Kronland-Martinet, R., Besson, M.: The evocative power
of sounds: Conceptual priming between words and nonverbal sounds. Journal of
Cognitive Neuroscience 22, 10261035 (2010)
33. Sharo, V., Gygi, B.: How to select stimuli for environmental sound research and
where to nd them. Behavior Research Methods, Instruments, & Computers 36,
590598 (2004)
34. Smalley, D.: Dening timbre rening timbre. Contemporary Music Review 10,
3548 (1994)
35. Smalley, D.: Space-form and the acousmatic image. Org. Sound 12, 3558 (2007)
36. Tanaka, K., Matsubara, K., Sato, T.: Study of onomatopoeia expressing strange
sounds: Cases of impulse sounds and beat sounds. Transactions of the Japan Society
of Mechanical Engineers C 61, 47304735 (1995)
37. Thoresen, L., Hedman, A.: Spectromorphological analysis of sound objects: an
adaptation of pierre schaeers typomorphology. Organised Sound 12, 129141
(2007)
38. Zeitler, A., Ellermeier, W., Fastl, H.: Signicance of meaning in sound quality
evaluation. Fortschritte der Akustik, CFA/DAGA 4, 781782 (2004)
39. Zeitler, A., Hellbrueck, J., Ellermeier, W., Fastl, H., Thoma, G., Zeller, P.: Methodological approaches to investigate the eects of meaning, expectations and context
in listening experiments. In: INTER-NOISE 2006, Honolulu, Hawaii (2006)
Introduction
189
in principle be located at any temporal position and are not necessarily scaled
to the same length as the query pattern, temporal alignment of the query and
target patterns poses a signicant computational challenge in large databases.
Given that the alignment problem can be solved, another pre-requisite for meaningful pattern matching is to dene a distance measure between musical patterns
of dierent kinds. These issues will be discussed in Sec. 4.
Pattern processing in music has several interesting applications, including
music information retrieval, music classication, cover song identication, and
creation of mash-ups by blending matching excerpts from dierent music pieces.
Given a large database of music, quite detailed queries can be made, such as
searching for a piece that would work as an accompaniment for a user-created
melody.
There are various levels at which pattern induction and matching can take place
in music. At one extreme, a polyphonic music signal is considered as a coherent
whole and features describing its harmonic or timbral aspects, for example, are
calculated. In a more analytic approach, some part of the signal, such as the
melody or the drums, is extracted before the feature calculation. Both of these
approaches are valid from the perceptual viewpoint. Human listeners, especially
trained musicians, can switch between a holistic listening mode and a more
analytic one where they focus on the part played by a particular instrument or
decompose music into its constituent elements and their releationships [8,3].
Even when a music signal is treated as a coherent whole, it is necessary to
transform the acoustic waveform into a series of feature vectors x1 , x2 , . . . , xT
that characterize the desired aspect of the signal. Among the most widely used
features are Mel-frequency cepstral coecients (MFCCs) to represent the timbral
content of a signal in terms of its spectral energy distribution [73]. The local
harmonic content of a music signal, in turn, is often summarized using a 12dimensional chroma vector that represents the amount of spectral energy falling
at each of the 12 tones of an equally-tempered scale [5,50]. Rhythmic aspects are
conveniently represented by the modulation spectrum which encodes the pattern
of sub-band energy uctuations within windows of approximately one second in
length [15,34]. Besides these, there are a number of other acoustic features, see
[60] for an overview.
Focusing pattern extraction on a certain instrument or part in polyphonic music requires that the desired part be pulled apart from the rest before the feature
extraction. While this is not entirely straightforward in all cases, it enables musically more interesting pattern induction and matching, such as looking at the
melodic contour independently of the accompanying instruments. Some strategies towards decomposing a music signal into its constituent parts are discussed
in the following.
190
2.1
A. Klapuri
Musical sounds, like most natural sounds, tend to be sparse in the time-frequency
domain, meaning that the sounds can be approximated using a small number of
non-zero elements in the time-frequency domain. This facilitates sound source
separation and audio content analysis. Usually the short-time Fourier transform
(STFT) is used to represent a given signal in the time-frequency domain. A
viable alternative for STFT is the constant-Q transform (CQT), where the center
frequencies of the frequency bins are geometrically spaced [9,68]. CQT is often
ideally suited for the analysis of music signals, since the fundamental frequencies
(F0s) of the tones in Western music are geometrically spaced.
Spatial information can sometimes be used to organize time-frequency components to their respective sound sources [83]. In the case of stereophonic audio,
time-frequency components can be clustered based on the ratio of left-channel
amplitude to the right, for example. This simple principle has been demonstrated
to be quite eective for some music types, such as jazz [4], despite the fact that
overlapping partials partly undermine the idea. Duda et al. [18] used stereo
information to extract the lead vocals from complex audio for the purpose of
query-by-humming.
2.2
It is often desirable to analyze the drum track of music separately from the
harmonic part. The sinusoids+noise model is the most widely-used technique
for this purpose [71]. It produces quite robust quality for the noise residual,
although the sinusoidal (harmonic) part often suers quality degradation for
music with dense sets of sinusoids, such as orchestral music.
Ono et al. proposed a method which decomposes the power spectrogram
X(F T ) of a mixture signal into a harmonic part H and percussive part P so
that X = H + P [52]. The decomposition is done by minimizing an objective
function that measures variation over time n for the harmonic part and variation over frequency k for the percussive part. The method is straightforward to
implement and produces good results.
Non-negative matrix factorization (NMF) is a technique that decomposes the
spectrogram of a music signal into a linear sum of components that have a
xed spectrum and time-varying gains [41,76]. Helen and Virtanen used the
NMF to separate the magnitude spectrogram of a music signal into a couple of
dozen components and then used a support vector machine (SVM) to classify
each component either to pitched instruments or to drums, based on features
extracted from the spectrum and the gain function of each component [31].
2.3
Vocal melody is usually the main focus of attention for an average music listener,
especially in popular music. It tends to be the part that makes music memorable
and easily reproducible by singing or humming [69].
191
Several dierent methods have been proposed for the main melody extraction
from polyphonic music. The task was rst considered by Goto [28] and later
various methods for melody tracking have been proposed by Paiva et al. [54],
Ellis and Poliner [22], Dressler [17], and Ryyn
anen and Klapuri [65]. Typically,
the methods are based on framewise pitch estimation followed by tracking or
streaming over time. Some methods involve a timbral model [28,46,22] or a
musicological model [67]. For comparative evaluations of the dierent methods,
see [61] and [www.music-ir.org/mirex/].
Melody extraction is closely related to vocals separation: extracting the melody
faciliatates lead vocals separation, and vice versa. Several dierent approaches
have been proposed for separating the vocals signal from polyphonic music, some
based on tracking the pitch of the main melody [24,45,78], some based on timbre
models for the singing voice and for the instrumental background [53,20], and
yet others utilizing stereo information [4,18].
Bass line is another essential part in many music types and usually contains
a great deal of repetition and note patterns that are rhythmically and tonally
interesting. Indeed, high-level features extracted from the bass line and the playing style have been successfully used for music genre classication [1]. Methods
for extracting the bass line from polyphonic music have been proposed by Goto
[28], Hainsworth [30], and Ryyn
anen [67].
2.4
Pattern Induction
Pattern induction deals with the problem of detecting repeated sequential structures in music and learning the pattern underlying these repetitions. In the
following, we discuss the problem of musical pattern induction from a general
192
A. Klapuri
Pitch (MIDI)
85
80
75
70
0
4
Time (s)
Fig. 1. A piano-roll representation for an excerpt from Mozarts Turkish March. The
vertical lines indicate a possible grouping of the component notes into phrases.
perspective. We assume that a time series of feature vectors x1 , x2 , . . . , xT describing the desired characteristics of the input signal is given. The task of
pattern induction, then, is to detect repeated sequences in this data and to
learn a prototypical pattern that can be used to represent all its occurrences.
What makes this task challenging is that the data is generally multidimensional
and real-valued (as opposed to symbolic data), and furthermore, music seldom
repeats itself exactly, but variations and transformations are applied on each
occurrence of a given pattern.
3.1
193
good news here is that meter analysis is a well-understood and feasible problem
for audio signals too (see e.g. [39]). Furthermore, melodic phrase boundaries often co-incide with strong beats, although this is not always the case. For melodic
patterns, for example, this segmenting rule eectively requires two patterns to be
similarly positioned with respect to the musical measure boundaries in order for
them to be similar, which may sometimes be a too strong assumption. However,
for drum patterns this requirement is well justied.
Bertin-Mahieux et al. performed harmonic pattern induction for a large
database of music in [7]. They calculated a 12-dimensional chroma vector for
each musical beat in the target songs. The beat-synchronous chromagram data
was then segmented at barline positions and the resulting beat-chroma patches
were vector quantized to obtain a couple of hundred prototype patterns.
A third strategy is to avoid segmentation altogether by using shift-invariant
features. As an example, let us consider a sequence of one-dimensional features
x1 , x2 , . . . , xT . The sequence is rst segmented into partly-overlapping frames
that have length approximately the same as the patterns being sought. Then the
sequence within each frame is Fourier transformed and the phase information is
discarded in order to make the features shift-invariant. The resulting magnitude
spectra are then clustered to nd repeated patterns. The modulation spectrum
features (aka uctuation patterns) mentioned in the beginning of Sec. 2 are an
example of such a shift-invariant feature [15,34].
3.2
Self-distance Matrix
Pattern induction, in the sense dened in the beginning of this section, is possible
only if a pattern is repeated in a given feature sequence. The repetitions need not
be identical, but bear some similarity with each other. A self-distance matrix (aka
self-similarity matrix) oers a direct way of detecting these similarities. Given
a feature sequence x1 , x2 , . . . , xT and a distance function d that species the
distance between two feature vectors xi and xj , the self-distance matrix (SDM)
is dened as
(1)
D(i, j) = d(xi , xj )
for i, j {1, 2, . . . , T }. Frequently used distance measures include the Euclidean
distance xi xj and the cosine distance 0.5(1xi , xj /(xi xj )). Repeated
sequences appear in the SDM as o-diagonal stripes. Methods for detecting these
will be discussed below.
An obvious diculty in calculating the SDM is that when the length T of the
feature sequence is large, the number of distance computations T 2 may become
computationally prohibitive. A typical solution to overcome this is to use beatsynchronized features: a beat tracking algorithm is applied and the features xt
are then calculated within (or averaged over) each inter-beat interval. Since the
average inter-beat interval is approximately 0.5 seconds much larger than a
typical analysis frame size this greatly reduces the number of elements in the
time sequence and in the SDM. An added benet of using beat-synchronous
features is that this compensates for tempo uctuations within the piece under
analysis. As a result, repeated sequences appear in the SDM as stripes that run
194
A. Klapuri
Time (s)
30
20
10
0
0
10
20
Time (s)
30
Fig. 2. A self-distance matrix for Chopins Etude Op 25 No 9, calculated using beatsynchronous chroma features. As the o-diagonal dark stripes indicate, the note sequence between 1s and 5s starts again at 5s, and later at 28s and 32s in a varied
form.
exactly parallel to the main diagonal. Figure 2 shows an example SDM calculated
using beat-synchronous chroma features.
Self-distance matrices have been widely used for audio-based analysis of the
sectional form (structure) of music pieces [12,57]. In that domain, several different methods have been proposed for localizing the o-diagonal stripes that
indicate repeating sequences in the music [59,27,55]. Goto, for example, rst
calculates a marginal histogram which indicates the diagonal bands that contain considerable repetition, and then nds the beginning and end points of the
repeted segments at a second step [27]. Serra has proposed an interesting method
for detecting locally similar sections in two feature sequences [70].
3.3
Repeated patterns are heavily utilized in universal lossless data compression algorithms. The Lempel-Ziv-Welch (LZW) algorithm, in particular, is based on
matching and replacing repeated patterns with code values [80]. Let us denote a
sequence of discrete symbols by s1 , s2 , . . . , sT . The algorithm initializes a dictionary which contains codes for individual symbols that are possible at the input.
At the compression stage, the input symbols are gathered into a sequence until
the next character would make a sequence for which there is no code yet in the
dictionary, and a new code for that sequence is then added to the dictionary.
The usefulness of the LZW algorithm for musical pattern matching is limited
by the fact that it requires a sequence of discrete symbols as input, as opposed to
real-valued feature vectors. This means that a given feature vectore sequence has
195
to be vector-quantized before processing with the LZW. In practice, also beatsynchronous feature extraction is needed to ensure that the lengths of repeated
sequences are not aected by tempo uctuation. Vector quantization (VQ, [25])
as such is not a problem, but choosing a suitable level of granularity becomes
very dicult: if the number of symbols is too large, then two repeats of a certain
pattern are quantized dissimilarly, and if the number of symbols is too small, too
much information is lost in the quantization and spurious repeats are detected.
Another inherent limitation of the LZW family of algorithms is that they require exact repetition. This is usually not appropriate in music, where variation
is more a rule than an exception. Moreover, the beginning and end times of the
learned patterns are arbitrarily determined by the order in which the input sequence is analyzed. Improvements over the LZW family of algorithms for musical
pattern induction have been considered e.g. by Lartillot et al. [40].
3.4
Pattern induction is often used for the purpose of predicting a data sequence. Ngram models are a popular choice for predicting a sequence of discrete symbols
s1 , s2 , . . . , sT [35]. In an N-gram, the preceding N 1 symbols are used to determine the probabilities for dierent symbols to appear next, P (st |st1 , . . . , stN +1 ).
Increasing N gives more accurate predictions, but requires a very large amount of
training data to estimate the probabilities reliably. A better solution is to use a
variable-order Markov model (VMM) for which the context length N varies in
response to the available statistics in the training data [6]. This is a very desirable feature, and for note sequences, this means that both short and long note
sequences can be modeled within a single model, based on their occurrences in
the training data. Probabilistic predictions can be made even when patterns do
not repeat exactly.
Ryyn
anen and Klapuri used VMMs as a predictive model in a method that
transcribes bass lines in polyphonic music [66]. They used the VMM toolbox of
Begleiter et al. for VMM training and prediction [6].
3.5
Music often introduces a certain pattern to the listener in a simpler form before
adding further layers of instrumentation at subsequent repetitions (and variations) of the pattern. Provided that the repetitions are detected via pattern
induction, this information can be fed back in order to improve the separation
and analysis of certain instruments or parts in the mixture signal. This idea was
used by Mauch et al. who used information about music structure to improve
recognition of chords in music [47].
Pattern Matching
This section considers the problem of searching a database of music for segments
that are similar to a given pattern. The query pattern is denoted by a feature
A. Klapuri
Time (query)
196
Query
Target 1
Target 2
Time (target)
Fig. 3. A matrix of distances used by DTW to nd a time-alignment between dierent
feature sequences. The vertical axis represents the time in a query excerpt (Queens
Bohemian Rhapsody). The horizontal axis corresponds to the concatenation of features
from three dierent excerps: 1) the query itself, 2) Target 1 (Bohemian Rhapsody
performed by London Symphonium Orchestra) and Target 2 (Its a Kind of Magic by
Queen). Beginnings of the three targets are indicated below the matrix. Darker values
indicate smaller distance.
197
exibility in pattern scaling and to mitigate the eect of tempo estimation errors,
it is sometimes useful to further time-scale the beat-synchronized query pattern
by factors 12 , 1, and 2, and match each of these separately.
A remaining problem to be solved is the temporal shift: if the target database
is very large, comparing the query pattern at every possible temporal position
in the database can be infeasible. Shift-invariant features are one way of dealing
with this problem: they can be used for approximate pattern matching to prune
the target data, after which the temporal alignment is computed for the bestmatching candidates. This allows the rst stage of matching to be performed an
order of magnitude faster.
Another potential solution for the time-shift problem is to segment the target
database by meter analysis or grouping analysis, and then match the query
pattern only at temporal positions determined by estimated bar lines or group
boundaries. This approach was already discussed in Sec. 3.
Finally, ecient indexing techniques exist for dealing with extremely large
databases. In practice, these require that the time-scale problem is eliminated
(e.g. using beat-synchronous features) and the number of time-shifts is greatly
reduced (e.g. using shift-invariant features or pre-segmentation). If these conditions are satised, the locality sensitive hashing (LSH) for example, enables
sublinear search complexity for retrieving the approximate nearest neighbours
of the query pattern from a large database [14]. Ryynanen et al. used LSH for
melodic pattern matching in [64].
4.2
Melodic pattern matching is usually considered in the context of query-byhumming (QBH), where a users singing or humming is used as a query to
retrieve music with a matching melodic fragment. Typically, the users singing is
rst transcribed into a pitch trajectory or a note sequence before the matching
takes place. QBH has been studied for more than 15 years and remains an active
research topic [26,48].
Research on QBH originated in the context of the retrieval from MIDI or
score databases. Matching approaches include string matching techniques [42],
hidden Markov models [49,33], dynamic programming [32,79], and ecient recursive alignment [81]. A number of QBH systems have been evaluated in Music
Information Retrieval Evaluation eXchange (MIREX) [16].
Methods for the QBH of audio data have been proposed only quite recently
[51,72,18,29,64]. Typically, the methods extract the main melodies from the target musical audio (see Sec. 2.3) before the matching takes place. However, it
should be noted that a given query melody can in principle be matched directly against polyphonic audio data in the time-frequency or time-pitch domain. Some on-line services incorporating QBH are already available, see e.g.
[www.midomi.com], [www.musicline.de], [www.musipedia.org].
Matching two melodic patterns requires a proper denition of similarity. The
trivial assumption that two patterns are similar if they have identical pitches
is usually not appropriate. There are three main reasons that cause the query
198
A. Klapuri
pattern and the target matches to dier: 1) low quality of the sung queries (especially in the case of musically untrained users), 2) errors in extracting the main
melodies automatically from music recordings, and 3) musical variation, such as
fragmentation (elaboration) or consolidation (reduction) of a given melody [43].
One approach that works quite robustly in the presence of all these factors is
to calculate Euclidean distance between temporally aligned log-pitch trajectories. Musical key normalization can be implemented simply by normalizing the
two pitch contours to zero mean. More extensive review of research on melodic
similarity can be found in [74].
4.3
Instead of using only the main melody for music retrieval, polyphonic pitch
data can be processed directly. Multipitch estimation algorithms (see [11,38] for
review) can be used to extract multiple pitch values in successive time frames, or
alternatively, a mapping from time-frequency to a time-pitch representation can
be employed [37]. Both of these approaches yield a representation in the timepitch plane, the dierence being that multipitch estimation algorithms yield a
discrete set of pitch values, whereas mapping to a time-pitch plane yields a
more continuous representation. Matching a query pattern against a database
of music signals can be carried out by a two-dimensional correlation analysis in
the time-pitch plane.
4.4
Chord Sequences
A#m
D#m
G#m
C#m
F#m
Bm
F#
F#m
Bm
Em
Am
Dm
Gm
A#
Am
Dm
Gm
Cm
Fm
A#m
A#
D#
G#
C#
Fm
A#m
D#m
G#m
C#m
Fig. 4. Major and minor triads arranged in a two dimensional chord space. Here the
Euclidean distance between each two points can be used to approximate the distance
between chords. The dotted lines indicate the four distance parameters that dene this
particular space.
199
Measuring the distance between two chord sequences requires that the distance
between each pair of dierent chords is dened. Often this distance is approximated by arranging chords in a one- or two-dimensional space, and then using
the geometric distance between chords in this space as the distance measure [62],
see Fig. 4 for an example. In the one-dimensional case, the circle of fths is often
used.
It is often useful to compare two chord sequences in a key-invariant manner.
This can be done by expressing chords in relation to tonic (that is, using chord
degrees instead of the absolute chords), or by comparing all the 12 possible
transformations and choosing the minimum distance.
4.5
Here we discuss pattern matching in drum tracks that are presented as acoustic
signals and are possibly extracted from polyphonic music using the methods
described in Sec. 2.2. Applications of this include for example query-by-tapping
[www.music-ir.org/mirex/] and music retrieval based on drum track similarity.
Percussive music devoid of both harmony and melody can contain considerable amount of musical form and structure, encoded into the timbre, loudness,
and timing relationships between the component sounds. Timbre and loudness characteristics can be conveniently represented by MFCCs extracted in
successive time frames. Often, however, the absolute spectral shape and loudness of the components sounds is not of interest, but instead, the timbre and
loudness of sounds relative to each other denes the perceived rhythm. Paulus
and Klapuri reduced the rhythmic information into a two-dimensional signal
describing the evolution of loudness and spectral centroid over time, in order to compare rhythmic patterns performed using an arbitrary set of sounds
[56]. The features were mean- and variance-normalized to allow comparison
across dierent sound sets, and DTW was used to align the two patterns under
comparison.
Ellis and Arroyo projected drum patterns into a low-dimensional representation, where dierent rhythms could be represented as a linear sum of so-called
eigenrhythms [21]. They collected 100 drum patterns from popular music tracks
and estimated the bar line positions in these. Each pattern was normalized and
the resulting set of patterns was subjected to principal component analysis in
order to obtain a set of basis patterns (eigenrhythms) that were then combined to approximate the original data. The low-dimensional representation of
the drum patterns was used as a space for classication and for measuring similarity between rhythms.
Non-negative matrix factorization (NMF, see Sec. 2.2) is another technique for
obtaining a mid-level representation for drum patterns [58]. The resulting component gain functions can be subjected to the eigenrhythm analysis described
above, or statistical measures can be calculated to characterize the spectra and
gain functions for rhythm comparison.
200
A. Klapuri
Conclusions
This paper has discussed the induction and matching of sequential patterns
in musical audio. Such patterns are neglected by the commonly used bag-offeatures approach to music retrieval, where statistics over feature vectors are
calculated to collapse the time structure altogether. Processing sequentical structures poses computational challenges, but also enables musically interesting retrieval tasks beyond those possible with the bag-of-features approach. Some of
these applications, such as query-by-humming services, are already available for
consumers.
Acknowledgments. Thanks to Jouni Paulus for the Matlab code for computing self-distance matrices. Thanks to Christian Dittmar for the idea of using
repeated patterns to improve the accuracy of source separation and analysis.
References
1. Abesser, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classication using bass-related high-level features and playing styles. In: Intl. Society on Music
Information Retrieval Conference, Kobe, Japan (2009)
2. Badeau, R., Emiya, V., David, B.: Expectation-maximization algorithm for multipitch estimation and separation of overlapping harmonic spectra. In: Proc. IEEE
ICASSP, Taipei, Taiwan, pp. 30733076 (2009)
3. Barbour, J.: Analytic listening: A case study of radio production. In: International
Conference on Auditory Display, Sydney, Australia (July 2004)
4. Barry, D., Lawlor, B., Coyle, E.: Sound source separation: Azimuth discrimination
and resynthesis. In: 7th International Conference on Digital Audio Eects, Naples,
Italy, pp. 240244 (October 2004)
5. Bartsch, M.A., Wakeeld, G.H.: To catch a chorus: Using chroma-based representations for audio thumbnailing. In: IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, New Paltz, USA, pp. 1518 (2001)
6. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov
models. J. of Articial Intelligence Research 22, 385421 (2004)
7. Bertin-Mahieux, T., Weiss, R.J., Ellis, D.P.W.: Clustering beat-chroma patterns in
a large music database. In: Proc. of the Int. Society for Music Information Retrieval
Conference, Utrecht, Netherlands (2010)
8. Bever, T.G., Chiarello, R.J.: Cerebral dominance in musicians and nonmusicians.
The Journal of Neuropsychiatry and Clinical Neurosciences 21(1), 9497 (2009)
9. Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc.
Am. 89(1), 425434 (1991)
10. Burred, J., R
obel, A., Sikora, T.: Dynamic spectral envelope modeling for the
analysis of musical instrument sounds. IEEE Trans. Audio, Speech, and Language
Processing (2009)
11. de Cheveigne, A.: Multiple F0 estimation. In: Wang, D., Brown, G.J. (eds.) Computational Auditory Scene Analysis: Principles, Algorithms and Applications. Wiley
IEEE Press (2006)
12. Dannenberg, R.B., Goto, M.: Music structure analysis from acoustic signals. In:
Havelock, D., Kuwano, S., Vorl
ander, M. (eds.) Handbook of Signal Processing in
Acoustics, pp. 305331. Springer, Heidelberg (2009)
201
13. Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. Journal
of New Music Research 32(2), 153163 (2003)
14. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing
scheme based on p-stable distributions. In: ACM Symposium on Computational
Geometry, pp. 253262 (2004)
15. Dixon, S., Pampalk, E., Widmer, G.: Classication of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, Baltimore, MD, pp. 159165 (2003)
16. Downie, J.S.: The music information retrieval evaluation exchange (20052007): A
window into music information retrieval research. Acoustical Science and Technology 29(4), 247255 (2008)
17. Dressler, K.: An auditory streaming approach on melody extraction. In: Intl. Conf.
on Music Information Retrieval, Victoria, Canada (2006); MIREX evaluation
18. Duda, A., N
urnberger, A., Stober, S.: Towards query by humming/singing on audio
databases. In: International Conference on Music Information Retrieval, Vienna,
Austria, pp. 331334 (2007)
19. Durrieu, J.L., Ozerov, A., Fevotte, C., Richard, G., David, B.: Main instrument
separation from stereophonic audio signals using a source/lter model. In: Proc.
EUSIPCO, Glasgow, Scotland (August 2009)
20. Durrieu, J.L., Richard, G., David, B., Fevotte, C.: Source/lter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Trans. on
Audio, Speech, and Language Processing 18(3), 564575 (2010)
21. Ellis, D., Arroyo, J.: Eigenrhythms: Drum pattern basis sets for classication
and generation. In: International Conference on Music Information Retrieval,
Barcelona, Spain
22. Ellis, D.P.W., Poliner, G.: Classication-based melody transcription. Machine
Learning 65(2-3), 439456 (2006)
23. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tenson factorisation
models for musical source separation. Computational Intelligence and Neuroscience
(2008)
24. Fujihara, H., Goto, M.: A music information retrieval system based on singing voice
timbre. In: Intl. Conf. on Music Information Retrieval, Vienna, Austria (2007)
25. Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer Academic Publishers, Dordrecht (1991)
26. Ghias, A., Logan, J., Chamberlin, D.: Query by humming: Musical information
retrieval in an audio database. In: ACM Multimedia Conference 1995. Cornell
University, San Fransisco (1995)
27. Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE
International Conference on Acoustics, Speech, and Signal Processing, Hong Kong,
China, vol. 5, pp. 437440 (April 2003)
28. Goto, M.: A real-time music scene description system: Predominant-F0 estimation
for detecting melody and bass lines in real-world audio signals. Speech Communication 43(4), 311329 (2004)
29. Guo, L., He, X., Zhang, Y., Lu, Y.: Content-based retrieval of polyphonic music
objects using pitch contour. In: IEEE International Conference on Audio, Speech
and Signal Processing, Las Vegas, USA, pp. 22052208 (2008)
30. Hainsworth, S.W., Macleod, M.D.: Automatic bass line transcription from polyphonic music. In: International Computer Music Conference, Havana, Cuba, pp.
431434 (2001)
202
A. Klapuri
31. Helen, M., Virtanen, T.: Separation of drums from polyphonic music using nonnegtive matrix factorization and support vector machine. In: European Signal Processing Conference, Antalya, Turkey (2005)
32. Jang, J.S.R., Gao, M.Y.: A query-by-singing system based on dynamic programming. In: International Workshop on Intelligent Systems Resolutions (2000)
33. Jang, J.S.R., Hsu, C.L., Lee, H.R.: Continuous HMM and its enhancement for
singing/humming query retrieval. In: 6th International Conference on Music Information Retrieval, London, UK (2005)
34. Jensen, K.: Multiple scale music segmentation using rhythm, timbre, and harmony.
EURASIP Journal on Advances in Signal Processing (2007)
35. Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall, New
Jersey (2000)
36. Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G.: Instrogram: Probabilistic representation of instrument existence for polyphonic music. IPSJ Journal 48(1), 214226 (2007)
37. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Intl. Society on Music Information Retrieval Conference, Kobe, Japan (2009)
38. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer, New York (2006)
39. Klapuri, A., Eronen, A., Astola, J.: Analysis of the meter of acoustic musical signals. IEEE Trans. Speech and Audio Processing 14(1) (2006)
40. Lartillot, O., Dubnov, S., Assayag, G., Bejerano, G.: Automatic modeling of musical style. In: International Computer Music Conference (2001)
41. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788791 (1999)
42. Lemstr
om, K.: String Matching Techniques for Music Retrieval. Ph.D. thesis, University of Helsinki (2000)
43. Lerdahl, F., Jackendo, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
44. Leveau, P., Vincent, E., Richard, G., Daudet, L.: Instrument-specic harmonic
atoms for mid-level music representation. IEEE Trans. Audio, Speech, and Language Processing 16(1), 116128 (2008)
45. Li, Y., Wang, D.L.: Separation of singing voice from music accompaniment for
monaural recordings. IEEE Trans. on Audio, Speech, and Language Processing 15(4), 14751487 (2007)
46. Marolt, M.: Audio melody extraction based on timbral similarity of melodic fragments. In: EUROCON (November 2005)
47. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: Proc. 10th Intl. Society for Music Information Retrieval
Conference, Kobe, Japan (2009)
48. McNab, R., Smith, L., Witten, I., Henderson, C., Cunningham, S.: Towards the
digital music library: Tune retrieval from acoustic input. In: First ACM International Conference on Digital Libraries, pp. 1118 (1996)
49. Meek, C., Birmingham, W.: Applications of binary classication and adaptive
boosting to the query-by-humming problem. In: Intl. Conf. on Music Information
Retrieval, Paris, France (2002)
50. M
uller, M., Ewert, S., Kreuzer, S.: Making chroma features more robust to timbre
changes. In: Proceedings of IEEE International Conference on Acoustics, Speech,
and Signal Processing, Taipei, Taiwan, pp. 18691872 (April 2009)
203
51. Nishimura, T., Hashiguchi, H., Takita, J., Zhang, J.X., Goto, M., Oka, R.: Music
signal spotting retrieval by a humming query using start frame feature dependent
continuous dynamic programming. In: 2nd Annual International Symposium on
Music Information Retrieval, Bloomington, Indiana, USA, pp. 211218 (October
2001)
52. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of
a monaural audio signal into harmonic/percussive components by complementary
diucion on spectrogram. In: European Signal Processing Conference, Lausanne,
Switzerland, pp. 240244 (August 2008)
53. Ozerov, A., Philippe, P., Bimbot, F., Gribonval, R.: Adaptation of Bayesian models
for single-channel source separation and its application to voice/music separation
in popular songs. IEEE Trans. on Audio, Speech, and Language Processing 15(5),
15641578 (2007)
54. Paiva, R.P., Mendes, T., Cardoso, A.: On the detection of melody notes in polyphonic audio. In: 6th International Conference on Music Information Retrieval,
London, UK, pp. 175182
55. Paulus, J.: Signal Processing Methods for Drum Transcription and Music Structure
Analysis. Ph.D. thesis, Tampere University of Technology (2009)
56. Paulus, J., Klapuri, A.: Measuring the similarity of rhythmic patterns. In: Intl.
Conf. on Music Information Retrieval, Paris, France (2002)
57. Paulus, J., M
uller, M., Klapuri, A.: Audio-based music structure analysis. In: Proc.
of the Int. Society for Music Information Retrieval Conference, Utrecht, Netherlands (2010)
58. Paulus, J., Virtanen, T.: Drum transcription with non-negative spectrogram factorisation. In: European Signal Processing Conference, Antalya, Turkey (September 2005)
59. Peeters, G.: Sequence representations of music structure using higher-order similarity matrix and maximum-likelihood approach. In: Intl. Conf. on Music Information
Retrieval, Vienna, Austria, pp. 3540 (2007)
60. Peeters, G.: A large set of audio features for sound description (similarity and
classication) in the CUIDADO project. Tech. rep., IRCAM, Paris, France (April
2004)
61. Poliner, G., Ellis, D., Ehmann, A., G
omez, E., Streich, S., Ong, B.: Melody transcription from music audio: Approaches and evaluation. IEEE Trans. on Audio,
Speech, and Language Processing 15(4), 12471256 (2007)
62. Purwins, H.: Proles of Pitch Classes Circularity of Relative Pitch and Key:
Experiments, Models, Music Analysis, and Perspectives. Ph.D. thesis, Berlin University of Technology (2005)
63. Rowe, R.: Machine musicianship. MIT Press, Cambridge (2001)
64. Ryyn
anen, M., Klapuri, A.: Query by humming of MIDI and audio using locality
sensitive hashing. In: IEEE International Conference on Audio, Speech and Signal
Processing, Las Vegas, USA, pp. 22492252
65. Ryyn
anen, M., Klapuri, A.: Transcription of the singing melody in polyphonic
music. In: Intl. Conf. on Music Information Retrieval, Victoria, Canada, pp. 222
227 (2006)
66. Ryyn
anen, M., Klapuri, A.: Automatic bass line transcription from streaming polyphonic audio. In: IEEE International Conference on Audio, Speech and Signal
Processing, pp. 14371440 (2007)
67. Ryyn
anen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32(3), 7286 (2008)
204
A. Klapuri
68. Sch
orkhuber, C., Klapuri, A.: Constant-Q transform toolbox for music processing.
In: 7th Sound and Music Computing Conference, Barcelona, Spain (2010)
69. Selfridge-Field, E.: Conceptual and representational issues in melodic comparison.
Computing in Musicology 11, 364 (1998)
70. Serra, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local
alignment applied to cover song identication. IEEE Trans. on Audio, Speech, and
Language Processing 16, 11381152 (2007)
71. Serra, X.: Musical sound modeling with sinusoids plus noise. In: Roads, C., Pope,
S., Picialli, A., Poli, G.D. (eds.) Musical Signal Processing, Swets & Zeitlinger
(1997)
72. Song, J., Bae, S.Y., Yoon, K.: Mid-level music melody representation of polyphonic
audio for query-by-humming system. In: Intl. Conf. on Music Information Retrieval,
Paris, France, pp. 133139 (October 2002)
73. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis
a unied approach to speech spectral estimation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia (1994)
74. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Universiteit
Utrecht (2007)
75. Vincent, E., Bertin, N., Badeau, R.: Harmonic and inharmonic nonnegative matrix
factorization for polyphonic pitch transcription. In: IEEE ICASSP, Las Vegas, USA
(2008)
76. Virtanen, T.: Unsupervised learning methods for source separation in monaural
music signals. In: Klapuri, A., Davy, M. (eds.) Signal Processing Methods for Music
Transcription, pp. 267296. Springer, Heidelberg (2006)
77. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech,
and Language Processing 15(3), 10661074 (2007)
78. Virtanen, T., Mesaros, A., Ryyn
anen, M.: Combining pitch-based inference and
non-negative spectrogram factorization in separating vocals from polyphonic music.
In: ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition,
Brisbane, Australia (September 2008)
79. Wang, L., Huang, S., Hu, S., Liang, J., Xu, B.: An eective and ecient method
for query by humming system based on multi-similarity measurement fusion. In:
International Conference on Audio, Language and Image Processing, pp. 471475
(July 2008)
80. Welch, T.A.: A technique for high-performance data compression. Computer 17(6),
819 (1984)
81. Wu, X., Li, M., Yang, J., Yan, Y.: A top-down approach to melody match in pitch
countour for query by humming. In: International Conference of Chinese Spoken
Language Processing (2006)
82. Yeh, C.: Multiple fundamental frequency estimation of polyphonic recordings.
Ph.D. thesis, University of Paris VI (2008)
83. Yilmaz, O., Richard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Trans. on Signal Processing 52(7), 18301847 (2004)
Abstract. A system is presented that learns the structure of an audio recording of a rhythmical percussion fragment in an unsupervised
manner and that synthesizes musical variations from it. The procedure
consists of 1) segmentation, 2) symbolization (feature extraction, clustering, sequence structure analysis, temporal alignment), and 3) synthesis.
The symbolization step yields a sequence of event classes. Simultaneously, representations are maintained that cluster the events into few or
many classes. Based on the most regular clustering level, a tempo estimation procedure is used to preserve the metrical structure in the generated
sequence. Employing variable length Markov chains, the nal synthesis
is performed, recombining the audio material derived from the sample
itself. Representations with dierent numbers of classes are used to trade
o statistical signicance (short context sequence, low clustering renement) versus specicity (long context, high clustering renement) of the
generated sequence. For a broad variety of musical styles the musical
characteristics of the original are preserved. At the same time, considerable variability is introduced in the generated sequence.
Keywords: music analysis, music generation, unsupervised clustering,
Markov chains, machine listening.
Introduction
206
with musicians, producing jazz-style music. Another system with the same characteristics as the Continuator, called OMax, was able to learn an audio stream
employing an indexing procedure explained in [5]. Hazan et al. [8] built a system
which rst segments the musical stream and extracts timbre and onsets. An unsupervised clustering process yields a sequence of symbols that is then processed
by n-grams. The method by Marxer and Purwins [12] consists of a conceptual
clustering algorithm coupled with a hierarchical N-gram. Our method presented
in this article was rst described in detail in [11].
First, we dene the system design and the interaction of its parts. Starting
from low-level descriptors, we translate them into a fuzzy score representation,
where two sounds can either be discretized yielding the same symbol or yielding
dierent symbols according to which level of interpretation is chosen (Section 2).
Then we perform skeleton subsequence extraction and tempo detection to align
the score to a grid. At the end, we get a homogeneous sequence in time, on which
we perform the prediction. For the generation of new sequences, we reorder the
parts of the score, respecting the statistical properties of the sequence while at
the same time maintaining the metrical structure (Section 3). In Section 4, we
discuss an example.
Audio Segments
Generation of
Audio Sequences
Segmentation
Symbolization
Aligned
Multilevel
Representation
Statistic Model
Continuation
indices
2.1
Segmentation
First, the audio input signal is analyzed by an onset detector that segments
the audio le into a sequence of musical events. Each event is characterized by
its position in time (onset) and an audio segment, the audio signal starting at
the onset position and ending at the following contiguous onset. In the further
processing, these events will serve two purposes. On one side, the events are
stored as an indexed sequence of audio fragments which will be used for the resynthesis in the end. On the other side, these events will be compared with each
other to generate a reduced score-like representation of the percussion patterns
to base a tempo analysis on (cf. Fig. 1 and Sec. 2.2).
207
We used the onset detector implemented in the MIR toolbox [9] that is based
only on the energy envelope, which proves to be sucient for our purpose of
analyzing percussion sounds.
2.2
Symbolization
208
We used the single linkage algorithm to discover event clusters in this space
(cf. [6] for details). This algorithm recursively performs clustering in a bottomup manner. Points are grouped into clusters. Then clusters are merged with
additional points and clusters are merged with clusters into super clusters. The
distance between two clusters is dened as the shortest distance between two
points, each being in a dierent cluster, yielding a binary tree representation of
the point similarities (cf. Fig. 2). The leaf nodes correspond to single events. Each
node of the tree occurs at a certain height, representing the distance between
the two child nodes. Figure 2 (top) shows an example of a clustering tree of the
onset events of a sound sequence.
Cluster Distance
3.5
3
2.5
Threshold
2
1.5
1
0.5
Time (s)
The height threshold controls the (number of) clusters. Clusters are generated
with inter-cluster distances higher than the height threshold. Two thresholds
lead to the same cluster conguration if and only if their values are both within
the range delimited by the previous lower node and the next upper node in
the tree. It is therefore evident that by changing the height threshold, we can
get as many dierent cluster congurations as the number of events we have
in the sequence. Each cluster conguration leads to a dierent symbol alphabet
209
size and therefore to a dierent symbol sequence representing the original audio
le. We will refer to those sequences as representation levels or simply levels.
These levels are implicitly ordered. On the leaf level at the bottom of the tree
we nd the lowest inter-cluster distances, corresponding to a sequence with each
event being encoded by a unique symbol due to weak quantization. On the root
level on top of the tree we nd the cluster conguration with the highest intercluster distances, corresponding to a sequence with all events denoted by the
same symbol due to strong quantization. Given a particular level, we will refer
to the events denoted by the same symbol as the instances of that symbol. We do
not consider the implicit inheritance relationships between symbols of dierent
levels.
Fig. 3. A continuous audio signal (top) is discretized via clustering yielding a sequence
of symbols (bottom). The colors inside the colored triangles denote the cluster of the
event, related to the type of sound, i.e. bass drum, hi-hat, or snare.
2.3
Level Selection
210
onsets given by this subsequence. This sequence can be seen as a set of points
on a time line. We are interested to quantify the degree of temporal regularity of
those onsets. Firstly, we compute the histogram1 of the time dierences (CIOIH)
between all possible combinations of two onsets (middle Fig. 4). What we obtain
is a sort of harmonic series of peaks that are more or less prominent according
to the self-similarity of the sequence on dierent scales. Secondly, we compute
the autocorrelation ac(t) (where t is the time in seconds) of the CIOIH which, in
case of a regular sequence, has peaks at multiples of its tempo. Let tusp be the
positive time value corresponding to its upper side peak. Given the sequence of
m onsets x = (x1 , . . . , xm ) we dene the regularity of the sequence of onsets x
to be:
ac(tusp )
Regularity(x) = 1 tusp
log(m)
ac(t)dt
tusp 0
This denition was motivated by the observation that the higher this value the
more equally the onsets are spaced in time. The logarithm of the number of
onsets was multiplied by the ratio to give more importance to symbols with
more instances.
10
12
14
16
12
14
16
Number of
Interval Instances
0
0
10
Histogram
Self Correlation
1K
Upper Side
Peak
0.5
t usp
2
Time Interval (s)
Fig. 4. The procedure applied for computing the regularity value of an onset sequence
(top) is outlined. Middle: the histogram of the complete IOI between onsets. Bottom:
the autocorrelation of the histogram is shown for a subrange of IOI with relevant peaks
marked.
Then we extended, for each level, the regularity concept to an overall regularity
of the level. This simply corresponds to the mean of the regularities for all the
appropriate symbols of the level. The regularity of the level is dened to be zero
in case there is no appropriate symbol.
1
211
After the regularity value has been computed for each level, we yield the level
where the maximum regularity is reached. The resulting level will be referred so
as the regular level.
We also decided to keep the levels where we have a local maximum because
they generally refer to the levels where a partially regular interpretation of the
sequence is achieved. In the case where consecutive levels of a sequence share
the same regularity only the one is kept that is derived from a higher cluster
distance threshold. Figure 5 shows the regularity of the sequence for dierent
levels.
4
3.8
3.6
3.4
3.2
2.8
2.6
2.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Cluster Distance Threshold
1.2
1.3
1.4
Fig. 5. Sequence regularity for a range of cluster distance thresholds (x-axis). An ENST
audio excerpt was used for the analysis. The regularity reaches its global maximum
value in a central position. Towards the right, regularity increases and then remains
constant. The selected peaks are marked with red crosses implying a list of cluster
distance threshold values.
2.4
Beat Alignment
In order to predict future events without breaking the metrical structure we use
a tempo detection method and introduce a way to align onsets to a metrical
grid, accounting for the position of the sequence in the metrical context. For
our purpose of learning and regenerating the structure statistically, we do not
require a perfect beat detection. Even if we detect a beat that is twice, or half
as fast as the perceived beat, or that mistakes an on-beat for an o-beat, our
system could still tolerate this for the analysis of a music fragment, as long as
the inter beat interval and the beat phase is always misestimated in the same
way.
Our starting point is the regular level that has been found with the procedure
explained in the previous subsection. On this level we select the appropriate symbol with the highest regularity value. The subsequence that carries this symbol
212
1
1
2
2
213
Fig. 7. The event sequence derived from a segmentation by onset detection is indicated by triangles. The vertical lines show the division of the sequence into blocks of
homogeneous tempo. The red solid lines represent the beat position (as obtained by
the skeleton subsequence). The other black lines (either dashed if aligned to a detected
onset or dotted if no close onset is found) represent the subdivisions of the measure
into four blocks.
Because of the tolerance used for building such a grid it could be noticed that
sometimes the eective measure duration could be slightly longer or slightly
shorter. This implements the idea that the grid should be elastic in the sense
that, up to a certain degree, it adapts to the (expressive) timing variations of
the actual sequence.
The skeleton grid catches a part of the complete list of onsets, but we would
like to built a grid where most of the onsets are aligned. Thereafter, starting from
the skeleton grid, the intermediate point between every two subsequent beats is
found and aligned with an onset (if it exists in a tolerance region otherwise a
place-holding onset is added). The procedure is recursively repeated until at least
80% of the onsets are aligned to a grid position or the number of created onsets
exceeds the number of total onsets. In Fig. 7, an example is presented along
with the resulting grid where the skeleton grid, its aligned, and the non-aligned
subdivisions are indicated by dierent line markers.
Note that, for the sake of simplicity, our approach assumes that the metrical
structure is binary. This causes the sequence to be eventually split erroneously.
However, we will see in a ternary tempo example that this is not a limiting factor
for the generation because the statistical representation somehow compensates
for it even if less variable generations are achieved. A more general approach
could be implemented with little modications.
The nal grid is made of blocks of time of almost equal duration that can
contain none, one, or more onset events. It is important that the sequence given
to the statistical model is almost homogeneous in time so that a certain number
of blocks corresponds to a dened time duration.
We used the following rules to assign a symbol to a block (cf. Fig 7):
blocks starting on an aligned onset are denoted by the symbol of the aligned
onset,
blocks starting on a non-aligned grid position are denoted by the symbol of
the previous block.
Finally, a metrical phase value is assigned to each block describing the number of
grid positions passed after the last beat position (corresponding to the metrical
214
position of the block). For each representation level the new representation of
the sequence will be the Cartesian product of the instrument symbol and the
phase.
Generation Strategies
215
We tested the system on two audio data bases. The rst one is the ENST
database (see [7]) that provided a collection of around forty drum recording
examples. For a descriptive evaluation, we asked two professional percussionists to judge several examples of generations as if they were performances of a
student. Moreover, we asked one of them to record beat boxing excerpts trying
to push the system to the limits of complexity and to critically assess the sequences that the system had generated from these recordings. The evaluations
of the generations created from the ENST examples revealed that the style of
the original had been maintained and that the generations had a high degree of
interestingness [10].
Some examples are available on the website [1] along with graphical animations visualizing the analysis process. In each video, we see the original sound
fragment and the generation derived from it. The horizontal axis corresponds
to the time in seconds and the vertical axis to the clustering quantization resolution. Each video shows an animated graphical representation in which each
block is represented by a triangle. At each moment, the context and the currently
played block is represented by enlarged and colored triangles.
In the rst part of the video, the original sound is played and the animation
shows the extracted block representation. The currently played block is represented by an enlarged colored triangle and highlighted by a vertical dashed red
line. The other colored triangles highlight all blocks from the starting point of
the bar up to the current block. In the second part of the video, only the skeleton subsequence is played. The sequence on top is derived from applying the
largest clustering threshold (smallest number of clusters) and the one on the
bottom corresponds to the lowest clustering threshold (highest number of clusters). In the nal part of the video, the generation is shown. The colored triangles
216
represent the current block and the current context. The size of the colored
triangles decreases monotonically from the current block backwards displaying
the past time context window considered by the system. The colored triangles
appear only on the levels selected by the generation strategy.
In Figure 8, we see an example of successive states of the generation. The
levels used by the generator to compute the continuation and the context are
highlighted showing colored triangles that decrease in size from the largest, corresponding to the current block, to the smallest that is the furthest past context
block considered by the system. In Frame I, the generation starts with block
no 4, belonging to the event class indicated by light blue. In the beginning, no
previous context is considered for the generation. In Frame II, a successive block
no 11 of the green event class has been selected using all ve levels - and a
context history of length 1 just consisting of block no 4 of the light blue event
class. Note that the context given by only one light blue block matches the continuation no 11, since the previous block (no 10) is also denoted by light blue at
all the ve levels. In Frame III, the context is the bi-gram of the event classes
light blue (no 4) and green (no 11). Only level is selected since at all other
levels the bi-gram that corresponds to the colors light blue and green appears
only once. However, at level the system nds three matches (blocks no 6, 10
and 12) and randomly selects no 10. In Frame IV, the levels dier in the length
of the maximal past context. At level one but only one match (no 11) is found
for the 3-gram light blue - green - light blue, and thus this level is discarded. At
levels , and , no matches for 3-grams are found, but all these levels include
2 matches (block no 5 and 9) for the bi-gram (green - light blue). At level , no
match is found for a bi-gram either, but 3 occurrences of the light blue triangle
are found.
Discussion
Our system eectively generates sequences respecting the structure and the
tempo of the original sound fragment for medium to high complexity rhythmic
patterns.
A descriptive evaluation of a professional percussionist conrmed that the
metrical structure is correctly managed and that the statistical representation
generates musically meaningful sequences. He noticed explicitly that the drum
fills (short musical passages which help to sustain the listeners attention during
a break between the phrases) were handled adequately by the system.
The critics by the percussionist were directed to the lack of dynamics, agogics
and musically meaningful long term phrasing which we did not address in our
approach.
Part of those feature could be achieved in the future by extending the system
to the analysis of non-binary meter. To achieve musically sensible dynamics and
agogics (rallentando, accelerando, rubato. . . ) of the generated musical continuation for example by extrapolation [14] remains a challenge for future work.
Fig. 8. Nine successive frames of the generation. The red vertical dashed line marks the currently played event. In each frame, the largest
colored triangle denotes the last played event that inuences the generation of the next event. The size of the triangles decreases going
back in time. Only for the selected levels the triangles are enlarged. We can see how the length of the context as well as the number of
selected levels dynamically change during the generation. Cf. Section 4 for a detailed discussion of this gure.
218
Acknowledgments
Many thanks to Panos Papiotis for his patience during lengthy recording sessions
and for providing us with beat boxing examples, the evaluation feedback, and
inspiring comments. Thanks a lot to Ricard Marxer for his helpful support. The
rst author (MM) expresses his gratitude to Mirko Degli Esposti and Anna Rita
Addessi for their support and for motivating this work. The second author (HP)
was supported by a Juan de la Cierva scholarship of the Spanish Ministry of
Science and Innovation.
References
1. (December 2010), www.youtube.com/user/audiocontinuation
2. Buhlmann, P., Wyner, A.J.: Variable length markov chains. Annals of Statistics 27,
480513 (1999)
3. Cope, D.: Virtual Music: Computer Synthesis of Musical Style. MIT Press, Cambridge (2004)
4. Dixon, S.: Automatic extraction of tempo and beat from expressive performances.
Journal of New Music Research 30(1), 3958 (2001)
5. Dubnov, S., Assayag, G., Cont, A.: Audio oracle: A new algorithm for fast learning
of audio structures. In: Proceedings of International Computer Music Conference
(ICMC), pp. 224228 (2007)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classication. Wiley, Chichester
(2001)
7. Gillet, O., Richard, G.: Enst-drums: an extensive audio-visual database for drum
signals processing. In: ISMIR, pp. 156159 (2006)
8. Hazan, A., Marxer, R., Brossier, P., Purwins, H., Herrera, P., Serra, X.: What/when
causal expectation modelling applied to audio signals. Connection Science 21, 119
143 (2009)
9. Lartillot, O., Toiviainen, P., Eerola, T.: A matlab toolbox for music information
retrieval. In: Annual Conference of the German Classication Society (2007)
10. Marchini, M.: Unsupervised Generation of Percussion Sequences from a Sound
Example. Masters thesis (2010)
11. Marchini, M., Purwins, H.: Unsupervised generation of percussion sound sequences
from a sound example. In: Sound and Music Computing Conference (2010)
12. Marxer, R., Purwins, H.: Unsupervised incremental learning and prediction of audio signals. In: Proceedings of 20th International Symposium on Music Acoustics
(2010)
13. Pachet, F.: The continuator: Musical interaction with style. In: Proceedings of
ICMC, pp. 211218. ICMA (2002)
14. Purwins, H., Holonowicz, P., Herrera, P.: Polynomial extrapolation for prediction of
surprise based on loudness - a preliminary study. In: Sound and Music Computing
Conference, Porto (2009)
15. Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic
automata with variable memory length. Mach. Learn. 25(2-3), 117149 (1996)
Introduction
Musical expressivity can be studied by analyzing the dierences (deviations) between a musical score and its execution. These deviations are mainly motivated
by two purposes: to clarify the musical structure [26,10,23] and as a way to communicate aective content [16,19,11]. Moreover, these expressive deviations vary
depending on the musical genre, the instrument, and the performer. Specically,
each performer has his/her own unique way to add expressivity by using the
instrument.
Our research on musical expressivity aims at developing a system able to
model the use of the expressive resources of a classical guitar. In guitar playing,
both hands are used: one hand is used to press the strings in the fretboard and
the other to pluck the strings. Strings can be plucked using a single plectrum
called a atpick or by directly using the tips of the ngers. The hand that presses
the frets is mainly determining the notes while the hand that plucks the strings
is mainly determining the note onsets and timbral properties. However, left hand
is also involved in the creation of a note onset or dierent expressive articulations
like legatos, glissandos, and vibratos.
Some guitarists use the right hand to pluck the strings whereas others use the
left hand. For the sake of simplicity, in the rest of the document we consider the
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 219241, 2011.
c Springer-Verlag Berlin Heidelberg 2011
220
T.H. Ozaslan
et al.
hand that plucks the strings as the right hand and the hand that presses the
frets as the left hand.
As a rst stage of our research, we are developing a tool able to automatically
identify, from a recording, the use of guitar articulations. According to Norton
[22], guitar articulations can be divided into three main groups related to the
place of the sound where they act: attack, sustain, and release articulations.
In this research we are focusing on the identication of attack articulations
such as legatos and glissandos. Specically, we present an automatic detection
and classication system that uses as input audio recordings. We can divide
our system into two main modules (Figure 1): extraction and classication. The
extraction module determines the expressive articulation regions of a classical
guitar recording whereas the classication module analyzes these regions and
determines the king of articulation (legato or glissando).
In both, legato and glissando, left hand is involved in the creation of the note
onset. In the case of ascending legato, after plucking the string with the right
hand, one of the ngers of the l eft hand (not already used for pressing one of the
frets), presses a fret causing another note onset. Descending legato is performed
by plucking the string with a left-hand nger that was previously used to play
a note (i.e. pressing a fret).
The case of glissando is similar but this time after plucking one of the strings
with the right hand, the left hand nger that is pressing the string is slipped to
another fret also generating another note onset.
When playing legato or glissando on guitar, it is common for the performer to
play more notes within a beat than the stated timing enriching the music that
is played. A powerful legato and glissando can be dierentiated between each
other easily by ear. However, in a musical phrase context where the legato and
glissando are not isolated, it is hard to dierentiate among these two expressive
articulations.
The structure of the paper is as follows: Section 2 briey describes the current
state of the art of guitar analysis studies. Section 3 describes our methodology
for articulation determination and classication. Section 4 focuses on the experiments conducted to evaluate our approach. Last section, Section 5, summarizes
current results and presents the next research steps.
221
Related Work
Guitar is the one of the most popular instruments in western music. Thus, most
of the music genres include the guitar. Although plucked instruments and guitar
synthesis have been studied extensively (see [9,22]), the analysis of expressive
articulations from real guitar recordings has not been fully tackled. This analysis
is complex because guitar is an instrument with a rich repertoire of expressive
articulations and because, when playing guitar melodies, several strings may be
vibrating at the same time. As an additional statement, even the synthesis of a
single tone is a complex subject [9].
Expressive studies go back to the early twentieth century. In 1913, Johnstone
[15] analyzed piano performers. Johnstones analysis can be considered as one
of the rst studies focusing on musical expressivity. Advances in audio processing techniques risen the opportunity to analyze audio recordings in a ner level
(see [12] for an overview). Up to now, there are several studies focused on the
analysis of expressivity of dierent instruments. Although the instruments analyzed dier, most of them focus on analyzing monophonic or single instrument
recordings.
For instance, Mantaras et al [20] presented a survey of computer music systems based on Articial Intelligence techniques. Examples of AI-based systems
are SaxEx [1] and TempoExpress [13]. Saxex is cased-based reasoning system
that generates expressive jazz saxophone melodies from recorded examples of
human performances. More recently, TempoExpress performs tempo transformations of audio recordings taking into account the expressive characteristics of
a performance and using a CBR approach.
Regarding guitar analysis, an interesting research is from Stanford University. Traube [28], estimated the plucking point on a guitar string by using a
frequency-domain technique applied to acoustically recorded signals. The plucking point of a guitar string aects the sound envelope and inuences the timbral
characteristics of notes. For instance, plucking close to the guitar hole produces
more mellow and sustained sounds where plucking near to the bridge (end of the
guitar body) produces sharper and less sustained sounds. Traube also proposed
an original method to detect the ngering point, based on the plucking point
information.
In another interesting paper, Lee [17] proposes a new method for extraction of
the excitation point of an acoustic guitar signal. Before explaining the method,
three state of the art techniques are examined in order to compare with the new
one. The techniques analyzed are matrix pencil inverse-ltering, sinusoids plus
noise inverse-ltering, and magnitude spectrum smoothing. After describing and
comparing these three techniques, the author proposes a new method, statistical
spectral interpolation, for excitation signal extraction.
Although ngering studies are not directly related with expressivity, their results may contribute to clarify and/or constrain the use of left-hand expressive
articulations. Hank Heijink and Ruud G. J. Meulenbroek[14] performed a behavioral study about the complexity of the left hand ngering of classical guitar.
Dierent audio and camera recordings of six professional guitarists playing the
222
T.H. Ozaslan
et al.
same song were used to nd optimal places and ngerings for the notes. Several
constraints were introduced to calculate cost functions such as; minimization of
jerk, torque change, muscle-tension change, work, energy and neuromotor variance. As a result of the study, they found a signicant eect on timing.
On another interesting study, [25] investigates the the optimal ngering position for a given set of notes. Their method, path dierence learning, uses tablatures and AI techniques to obtain ngering positions and transitions. Radicioni
et al [24] also worked on nding the proper ngering position and transitions.
Specically, they calculated the weights of the nger transitions between nger
positions by using the weights of Heijing [14]. Burns and Wanderley [4] proposed
a method to visually detect and recognize ngering gestures of the left hand of
a guitarist by using aordable camera.
Unlike the general trend in the literature, Trajano [27] investigated the right
hand ngering. Although he analyzed the right-hand, his approach has similarities with left-hand studies. In his article, Trajano uses his own denitions and
cost functions to calculate the optimal selection of right hand ngers.
The rst step when analyzing guitar expressivity is to identify and characterize
the way notes are played, i.e. guitar articulations. The analysis of expressive
articulations has been previously performed with image analysis techniques. Last
but not least, one of the few studies that is focusing on guitar expressivity is
the PhD thesis of Norton [22]. In his dissertation Norton proposed the use of a
motion caption system based on PhaseSpace Inc, to analyze guitar articulations.
Methodology
Articulation refers to how the pieces of something are joined together. In music,
these pieces are the notes and the dierent ways of executing them are called
articulations. In this paper we propose a new system that is able to determine
and classify two expressive articulations from audio les. For this purpose we
have two main modules: the extraction module and the classication module (see
Figure 1). In the extraction module, we determine the sound segments where
expressive articulations are present. The purpose of this module is to classify
audio regions as expressive articulations or not. Next, the classication module,
analyzes the regions that were identied as candidates of expressive articulations
by the extraction module, and label them as legato or glissando.
3.1
Extraction
The goal of the extraction module is to nd the places where a performer played
expressive articulations. To that purpose, we analyzed a recording using several
audio analysis algorithms, and combined the information obtained from them to
take a decision.
Our approach is based on rst determining the note onsets caused when plucking the strings. Next, a more ne grained analysis is performed inside the regions
delimited by two plucking onsets to determine whether an articulation may be
present. A simple representation diagram of extraction module is shown in
Figure 2.
223
For the analysis we used Aubio [2]. Aubio is a library designed for the annotation of audio signals. Aubio library includes four main applications: abioonset,
aubionotes, aubiocut, and aubiopitch. Each application gives us the chance of
trying dierent algorithms and also tuning several other parameters. In the current prototype we are using aubioonset for our plucking detection sub-module
and aubionotes for our pitch detection sub-module.
At the end we combine the outputs from both sub-modules and decide whether
there is an expressive articulation or not. In the next two sections, the plucking
detection sub-module and the pitch detection sub-module is described. Finally,
we explain how we combine the information provided by these two sub-modules
to determine the existence of expressive articulations.
Plucking Detection. Our rst task is to determine the onsets caused by the
plucking hand. As we stated before, guitar performers can apply dierent articulations by using both of their hands. However, the kind of articulations that
we are investigating (legatos and glissandos) are performed by the left hand.
Although they can cause onsets, these onsets are not as powerful in terms of
both energy and harmonicity [28]. Therefore, we need an onset determination
algorithm suitable to this specic characterictic.
The High Frequency Content measure is a measure taken across a signal
spectrum, and can be used to characterize the amount of high-frequency content
(HFC) in the signal. The magnitudes of the spectral bins are added together, but
multiplying each magnitude by the bin position [21]. As Brossier stated, HFC
is eective with percussive onsets but less successful determining non-percussive
and legato phrases [3]. As right hand onsets are more percussive than left hand
onsets, HFC was the strongest candidate of detection algorithm for right hand
onsets. HFC is sensitive to abrupt onsets but not too much sensitive for the
changes of fundamental frequency caused by the left hand. This is the main
reason why we chose HFC to measure the changes on the harmonic content of
the signal.
Aubioonset library gave us the opportunity to tune the peak-picking threshold, which we tested with a set of hand labeled recordings, including both articulated and non-articulated notes. We used 1.7 for peak picking threshold and
224
T.H. Ozaslan
et al.
95db for silence threshold. We used this set as our ground truth and tuned our
values according to this set.
An example of the resulting onsets proposed by HFC is shown in Figure 3.
Specically, in the exemplied recording 5 plucking onsets are detected, onsets
caused by the plucking hand, which are shown with vertical lines. Between some
of two detected onsets expressive articulations are present. However, as shown
in the gure, HFC succeeds as it only determines the onsets caused by the right
hand.
Next, each portion between two plucking onsets is analyzed individually.
Specically, we are interested in determining two points: the end of the attack
and the release start. From experimental measures, attack end position is considered 10ms after the amplitude reaches its local maximum. The release start
225
position is considered as the nal point where local amplitude is equal or greater
than 3 percent of the local maximum. For example, in Figure 4, the rst portion
of the Figure 3 is zoomed. The rst and the last lines are the plucking onsets
identied by HFC algorithm. The rst dashed line is the place where attack
nishes. The second dashed line is the place where release starts.
Pitch Detection. Our second task was to analyze the sound fragment between
two onsets. Since we know the onset values of plucking hand, what we require is
another peak detection algorithm with a lower threshold in order to capture the
changes in fundamental frequency. Specically, if fundamental frequency is not
constant between two onsets, we consider that the possibility of the existence of
an expressive articulation is high.
In the pitch detection module, i.e to extract onsets and their corresponding
fundamental frequencies, we used aubionotes. In Aubio library, both onset detection and fundamental frequency estimation algorithms can be chosen from a
bunch of alternatives. For onset detection, this time we need a more sensitive
algorithm than the one we used to detect the right hand onset detection. Thus,
we used complex domain algorithm [8] to determine the peaks and Yin [6] for
the fundamental frequency estimation. Complex domain onset detection is based
on a combination of phase approach and energy based approach.
We used 2048 bins as our window size, 512 bins as our hop size, 1 as our pick
peaking threshold and 95db as our silence threshold. With these parameters
we obtained an output like the shown in Figure 5. As shown in the gure, rst
results were not as we expected. Specically, they were noisier than expected.
There were noisy parts, especially at the beginning of the notes, which generated
226
T.H. Ozaslan
et al.
false-positive peaks. For instance in Figure 5, many false positive note onsets are
detected between the interval from 0 to 0.2 seconds.
A careful analysis of the results demonstrated that the false-positive peaks
were located in the region of the notes frequency borders. Therefore, we propose
a lightweight solution for the problem: to apply a chroma ltering to the regions
that are in the borders of Complex domain peaks. As shown in Figure 6, after
applying chroma conversion, the results are drastically improved.
Next, we analyzed the fragments between two onsets based on the segments
provided by the plucking detection module. Specically, we analyzed the sound
fragment between attack ending point and release starting point (because the
noisiest part of a signal is the attack part and the release part of a signal contains
unnecessary information for pitch detection [7]). Therefore, for our analysis we
take the fragment between attack and release parts where pitch information is
relatively constant.
Figure 7 shows fundamental frequency values and right hand onsets. X-axis
represents the time domain bins and Y-axis represents the frequency. In Figure 7,
vertical lines depict the attack and release parts respectively. In the middle
there is a change in frequency, which was not determined as an onset by the
rst module. Although it seems like an error, it is a success result for our model.
Specically, in this phrase there is glissando, which is a left hand articulation, and
was not identied as an onset by plucking detection module (HFC algorithm),
but identied by the pitch detection module (Complex Domain algorithm). The
output of the pitch detection module for this recording is shown in Table 1.
Analysis and Annotation. After obtaining the results from plucking detection and pitch detection modules, the goal of the analysis and annotation module
is to determine the candidates of expressive articulations. Specically, from the
results of the pitch detection module, we analyze the dierences of fundamental
227
frequencies in the segments between attack and release parts (provided by the
plucking detection module). For instance, in Table 1 the light gray values represent the attack and release parts, which we did not take into account while
applying our decision algorithm.
The dierences of fundamental frequencies are calculated by subtracting to
each bin its preceding bin. Thus, when the fragment we are examining is a nonarticulated fragment, this operation returns 0 for all bins. On the other side,
in expressively articulated fragments some peaks will arise (see Figure 8 for an
example).
In Figure 8 there is only one peak, but in other recordings some consecutive
peaks may arise. The explanation is that the left hand also causes an onset, i.e.
it generates also a transient part. As a result of this transient, more than one
change in fundamental frequency may be present. If those changes or peaks are
close to each other we consider them as a single peak.
We dene this closeness with a pre-determined consecutiveness threshold.
Specically, if the maximum distance between these peaks is 5 bins, we
228
T.H. Ozaslan
et al.
Classification
The classication module analyzes the regions identied by the extraction module and label them as legato or glissando. A diagram of the classication module
is shown in Figure 9. In this section, rst, we describe our research to select the
appropriate descriptor to analyze the behavior of legato and glissando. Then, we
explain the new two components, Models Builder and Detection.
Selecting a Descriptor. After extracting the regions which contain candidates
of expressive articulations, the next step was to analyze them. Because dierent
expressive articulations (legato vs glissando) should present dierent characteristics in terms of changes in amplitude, aperiodicity, or pitch [22], we focused
the analysis on comparing these deviations.
Specically, we built representations of these three features (amplitude, aperiodicity, and pitch). Representations helped us to compare dierent data with
dierent length and density. As we stated above, we are mostly interested in
changes: changes in High Frequency Content, changes in fundamental frequency,
changes in amplitude, etc. Therefore, we explored the peaks in the examined
data because peaks are the points where changes occur.
As an example, Figure 10 shows, from top to bottom, amplitude evolution,
pitch evolution, and changes in aperiodicity for both legato and glissando. As
both Figures show, glissando and legato examples, the changes in pitch are similar. However, the changes in amplitude and aperiodicity present a characteristic
slope.
Thus, as a rst step we concentrated on determining which descriptor could
be used. To make this decision, we built models for both aperiodicty and
229
Fig. 10. From top to bottom, representations of amplitude, pitch and aperiodicty of
the examined regions
amplitude by using a set of training data. As a result, we obtained two models (for amplitude and aperiodicity) for both legato and glissando as is shown
in Figure 11a and Figure 11b. Analyzing the results, amplitude is not a good
candidate because the models behave similarly. In contrast, aperiodicity models
present a dierent behavior. Therefore, we selected aperiodicity as the descriptor. The details of this model construction will be explained in Building the
Models section.
Preprocessing. Before analyzing and testing our recordings, we applied two
dierent preprocessing techniques to the data in order to make them smoother
and ready for comparison: Smoothing and Envelope Approximation.
1. Smoothing. As expected, aperiodicity portion of the audio le that we
are examining includes noise. Our rst concern was to avoid this noise and
obtain a nicer representation. In order to do that rst we applied a 50 step
running median smoothing. Running median smoothing is also known as
median ltering. Median ltering is widely used in digital image processing
230
T.H. Ozaslan
et al.
(a) Aperiodicity
231
10000 to 1000) and round them. So they become 146, 146, 147, 150 and 150.
As shown, we have 2 peaks in 146 and 150. In order to x this duplicity,
we choose the ones with the highest peak value. After collecting and scaling
peak positions, the peaks are linearly connected. As shown in Figure 13,
the obtained graph is an approximation of the graph shown in Figure 12b.
Linear approximation helps the system to avoid consecutive small tips and
dips.
In our case all the recordings were performed at 60bpm and all the notes in the
recordings are 8th notes. That is, each note is half a second, and each legato or
glissando portion is 1 second. We recorded with a sampling rate of 44100, and
we did our analysis by using a hop size of 32 bins, i.e. 44100/32 = 1378 bins.
We knew that this was our highest limit. For the sake of simplicity, we scaled
our x-axis to 1000 bins.
Building the Models. After applying the preprocessing techniques, we obtained equal length aperiodicity representations of all our expressive articulation
portions. Next step was to construct models for both legato and glissando by using this data. In this section we describe how we constructed the models shown in
Figure 11a and Figure 11b. The following steps were used to construct the models: Histogram Calculation, Smoothing and Envelope approximation (explained
in Preprocessing section), and nally, SAX representation. In this section we
present the Histogram Calculation and the SAX representation techniques.
1. Histogram Calculation. Another method that we are using is histogram
envelope calculation. We use this technique to calculate the peak density
of a set of data. Specically, a set of recordings containing 36 legato and 36
glissando examples (recorded by a professional classical guitarist) was used as
training set. First, for each legato and glissando example, we determined the
peaks. Since we want to model the places where condensed peaks occur, this
232
T.H. Ozaslan
et al.
Fig. 15. Final envelope approximation of peak histograms of legato and glissando
training sets
time we used a threshold of 30 percent and collect the peaks with amplitude
values above this threshold. Notice that the threshold is dierent than the
used in envelope approximation. Then, we used histograms to compute the
density of the peak locations. Figure 14 shows the resulting histograms.
After constructing the histograms, as shown in Figure 14, we used our envelope approximation method to construct the envelopes of legato and glissando histogram models (see Figure 15).
2. SAX: Symbolic Aggregate Approximation. Although the histogram
envelope approximations of legato and glissando in Figure 15 are close to our
purposes, they still include noisy sections. Rather than these abrupt changes
(noises), we are interested in a more general representation reecting the
changes more smoothly. SAX (Symbolic Aggregate Approximation) [18], is
a symbolic representation used in time series analysis that provides a dimensionality reduction while preserving the properties of the curves. Moreover,
SAX representation makes the distance measurements easier. Then, we
applied the SAX representation to histogram envelope approximations.
233
Experiments
The goal of the experiments realized was to test the performance of our model.
Since dierent modules have been designed, and they work independently of each
other, we tested Extraction and Classication modules separately. After applying
separate studies, we combined the results to assess the overall performance of
the proposed system.
234
T.H. Ozaslan
et al.
(a) Legato
(b) Glisando
As it was explained in Section 1, legato and glissando can be played in ascending or descending intervals. Thus, we were interested in studying the results
distinguishing among these two movements. Additionally, since in a guitar there
are three nylon strings and three metallic strings, we also studied the results
taking into account these two sets of strings.
4.1
Recordings
Borrowing from Carlevaros guitar exercises [5], we recorded a collection of ascending and descending chromatic scales. Legato and glissando examples were
recorded by a professional classical guitar performer. The performer was asked
to play chromatic scales in three dierent regions of the guitar fretboard. Specifically, we recorded notes from the rst 12 frets of the fretboard where each
recording concentrated on 4 specic frets. The basic exercise from the rst fretboard region is shown in Figure 19.
235
(a) Phrase 1
(b) Phrase 2
(d) Phrase 4
(c) Phrase 3
(e) Phrase 5
Each scale contains 24 ascending and 24 descending notes. Each exercise contains 12 expressive articulations (the ones connected with an arch). Since we
repeated the exercise at three dierent positions, we obtained 36 legato and 36
glissando examples. Notice that we also performed recordings with a neutral articulation (neither legatos nor glissandos). We presented all the 72 examples to
our system.
As a preliminary test with more realistic recordings, we also recorded a small
set of 5-6 note phrases. They include dierent articulations in random places (see
Figure 20). As shown in Table 3, each phrase includes dierent combinations of
expressive articulations varying from 0-2. For instance, Phrase 3 (see Figure 20c)
does not have any expressive articulation and Phrase 4 (see Figure 20d) contains
the same notes of Phrase 3 but including two expressive articulations: rst a
legato and next an appoggiatura.
4.2
First, we analyzed the accuracy of the extraction module to identify regions with
legatos. The hypothesis was that legatos are the articulations easiest to detect
because they are composed of two long notes. Next, we analyzed the accuracy
to identify regions with glissandos. Because in this situation the rst note (the
glissando) has a short duration, it may be confused with the attack.
Scales. We rst applied our system to single expressive and non-expressive
articulations. All the recordings were hand labeled; they were also our ground
236
T.H. Ozaslan
et al.
Table 2. Performance of extraction module applied to single articulations
Recordings
Non-expressive
Ascending Legatos
Descending Legatos
Ascending Glissandos
Descending Glissandos
Nylon String
90%
80%
90%
70%
70%
Metalic String
90%
90%
70%
70%
70%
truth. We compared the output results with annotations. The output was the
number of determined expressive articulations in the sound fragment.
Analyzing the experiments (see Table 2), dierent conclusions can be extracted. First, as expected, legatos are easier to detect than glissandos. Second,
in non-steel strings the melodic direction does not cause a dierent performance.
Regarding steel strings, descending legatos are more dicult to detect than ascending legatos (90% versus 70%). This result is not surprising because the
plucking action of left hand ngers in descending legatos is slightly similar to
a right hand plucking. However, this dierence does not appear in glissandos
because the nger movement is the same.
Short melodies. We tested the performance of the extraction model to analyze
the recordings of short melodies with the same settings used with scales except
for the release threshold. Specically, since in short phrase recordings the transition parts between two notes have more noise, the average value of the amplitude
between two onsets was higher. Because of that, the release threshold in a more
realistic scenario has to be increased. Specically, after some experiments, we
xed the release threshold to 30%.
Analyzing the results, the performance of our model was similar to the previous experiments, i.e. when we analyze single articulations. However, in two
phrases where a note was played with a soft right-hand plucking, these notes
were proposed as legato candidates (Phrase 1 and Phrase 4 ).
The nal step of the extraction model is to annotate the sound fragments where
a possible attack articulation (legato or glissando) is detected. Specically, to help
the systems validation, the whole recording is presented to the user and the candidate fragments to expressive articulations are colored. As example, Figure 21
shows the annotation of Phrase 2 (see score in Figure 20b). Phrase 2 has two expressive articulations that correspond with the portions colored in black.
237
4.3
After testing the Extraction Module, we tested the same audio les (this time
using only the legato and glissando examples) to test our Classication Module. As explained in Section 3.2, we performed experiments applying dierent
step sizes for the SAX representation. Specically (see results reported in Table 4), we may observe that a step size of 5 is the most appropriate setting. This
result corroborates that a higher resolution when discretizing is not required
and demonstrates that the SAX representation provides a powerful technique to
summarize the information about changes.
The overall performance for legato identication was 83.3% and the overall
performance for glissando identication was 80.5%. Notice that identication of
ascending legato reached a 100% of accuracy whereas descending legato achieved
only a 66.6%. Regarding glissando, there was no signicant dierence between
ascending or descending accuracy (83.3% versus 77.7%). Finally, analyzing the
results when considering the string type, the results presented a similar accuracy
on both nylon and metallic strings.
4.4
After testing the main modules separately, we studied the performance of the
whole system using the same recordings. From our previous experiments, an step
size of 5 gave the best analyzes results, therefore we run these experiments with
only an step size of 5.
Since we had errors both in the extraction module and classication module, the combined results presented a lower accuracy (see results on Table 5).
238
T.H. Ozaslan
et al.
Conclusions
In this paper we presented a system that combines several state of the art analysis
algorithms to identify left hand articulations such as legatos and glissandos.
Specically, our proposal uses HFC for plucking detection and Complex Domain
and YIN algorithms for pitch detection. Then, combining the data coming from
these dierent sources, we developed a rst decision mechanism, the Extraction
module, to identify regions where attack articulations may be present. Next, the
Classication module analyzes the regions annotated by the extraction Module
and tries to determine the articulation type. Our proposal is to use aperiodicity
239
Acknowledgments
This work was partially funded by NEXT-CBR (TIN2009-13692-C03-01), IL4LTS
(CSIC-200450E557) and by the Generalitat de Catalunya under the grant 2009
SGR-1434. Tan Hakan Ozaslan
is a Phd student of the Doctoral Program in Information, Communication, and Audiovisuals Technologies of the Universitat Pompeu Fabra. We also want to thank the professional guitarist Mehmet Ali Yldrm
for his contribution with the recordings.
References
1. Arcos, J.L., L
opez de M
antaras, R., Serra, X.: Saxex: a case-based reasoning system for generating expressive musical performances. Journal of New Music Research 27(3), 194210 (1998)
2. Brossier, P.: Automatic annotation of musical audio for interactive systems. Ph.D.
thesis, Centre for Digital music, Queen Mary University of London (2006)
3. Brossier, P., Bello, J.P., Plumbley, M.D.: Real-time temporal segmentation of note
objects in music signals. In: Proceedings of the International Computer Music
Conference, ICMC 2004 (November 2004)
4. Burns, A., Wanderley, M.: Visual methods for the retrieval of guitarist ngering.
In: NIME 2006: Proceedings of the 2006 conference on New interfaces for musical
expression, Paris, pp. 196199 (June 2006)
5. Carlevaro, A.: Serie didactica para guitarra. vol. 4. Barry Editorial (1974)
6. de Cheveigne, A., Kawahara, H.: Yin, a fundamental frequency estimator for speech
and music. The Journal of the Acoustical Society of America 111(4), 19171930
(2002)
240
T.H. Ozaslan
et al.
7. Dodge, C., Jerse, T.A.: Computer Music: Synthesis, Composition, and Performance. Macmillan Library Reference (1985)
8. Duxbury, C., Bello, J., Davies, J., Sandler, M., Mark, M.: Complex domain onset detection for musical signals. In: Proceedings Digital Audio Eects Workshop
(2003)
9. Erkut, C., Valimaki, V., Karjalainen, M., Laurson, M.: Extraction of physical and
expressive parameters for model-based sound synthesis of the classical guitar. In:
108th AES Convention, pp. 1922 (February 2000)
10. Gabrielsson, A.: Once again: The theme from Mozarts piano sonata in A major
(K. 331). A comparison of ve performances. In: Gabrielsson, A. (ed.) Action and
perception in rhythm and music, pp. 81103. Royal Swedish Academy of Music,
Stockholm (1987)
11. Gabrielsson, A.: Expressive intention and performance. In: Steinberg, R. (ed.) Music and the Mind Machine, pp. 3547. Springer, Berlin (1995)
12. Gouyon, F., Herrera, P., G
omez, E., Cano, P., Bonada, J., Loscos, A., Amatriain,
X., Serra, X.: In: ontent Processing of Music Audio Signals, pp. 83160. Logos
Verlag, Berlin (2008), http://smcnetwork.org/public/S2S2BOOK1.pdf
13. Grachten, M., Arcos, J., de M
antaras, R.L.: A case based approach to expressivityaware tempo transformation. Machine Learning 65(2-3), 411437 (2006)
14. Heijink, H., Meulenbroek, R.G.J.: On the complexity of classical guitar playing:functional adaptations to task constraints. Journal of Motor Behavior 34(4),
339351 (2002)
15. Johnstone, J.A.: Phrasing in piano playing. Withmark New York (1913)
16. Juslin, P.: Communicating emotion in music performance: a review and a theoretical framework. In: Juslin, P., Sloboda, J. (eds.) Music and emotion: theory and
research, pp. 309337. Oxford University Press, New York (2001)
17. Lee, N., Zhiyao, D., Smith, J.O.: Excitation signal extraction for guitar tones. In:
International Computer Music Conference, ICMC 2007 (2007)
18. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing sax: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15(2), 107144
(2007)
19. Lindstr
om, E.: 5 x oh, my darling clementine the inuence of expressive intention
on music performance (1992) Department of Psychology, Uppsala University
20. de Mantaras, R.L., Arcos, J.L.: Ai and music from composition to expressive performance. AI Mag. 23(3), 4357 (2002)
21. Masri, P.: Computer modeling of Sound for Transformation and Synthesis of Musical Signal. Ph.D. thesis, University of Bristol (1996)
22. Norton, J.: Motion capture to build a foundation for a computer-controlled instrument by study of classical guitar performance. Ph.D. thesis, Stanford University
(September 2008)
23. Palmer, C.: Anatomy of a performance: Sources of musical expression. Music Perception 13(3), 433453 (1996)
24. Radicioni, D.P., Lombardo, V.: A constraint-based approach for annotating music
scores with gestural information. Constraints 12(4), 405428 (2007)
25. Radisavljevic, A., Driessen, P.: Path dierence learning for guitar ngering problem. In: International Computer Music Conference (ICMC 2004) (2004)
241
26. Sloboda, J.A.: The communication of musical metre in piano performance. Quarterly Journal of Experimental Psychology 35A, 377396 (1983)
27. Trajano, E., Dahia, M., Santana, H., Ramalho, G.: Automatic discovery of right
hand ngering in guitar accompaniment. In: Proceedings of the International Computer Music Conference (ICMC 2004), pp. 722725 (2004)
28. Traube, C., Depalle, P.: Extraction of the excitation point location on a string
using weighted least-square estimation of a comb lter delay. In: Procs. of the 6th
International Conference on Digital Audio Eects, DAFx 2003 (2003)
Introduction
In the last decades Music Information Retrieval (MIR) has evolved into a broad
research area that aims at making large repositories of digital music maintainable and accessible. Within MIR research two main directions can be discerned:
symbolic music retrieval and the retrieval of musical audio. The rst direction
traditionally uses score-based representations to research typical retrieval problems. One of the most important and most intensively studied of these is probably the problem of determining the similarity of a specic musical feature, e.g.
melody, rhythm, etc. The second directionmusical audio retrievalextracts features from the audio signal and uses these features for estimating whether two
pieces of music share certain musical properties. In this paper we focus on a
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 242258, 2011.
c Springer-Verlag Berlin Heidelberg 2011
243
244
the MIR community, e.g. [20,5]. Chord labeling algorithms extract symbolic
chord labels from musical audio: these labels can be matched directly using the
algorithms covered in this paper.
If you would ask a jazz musician to answer the third questionwhether sequences of chord descriptions are usefulhe will probably agree that they are,
since working with chord descriptions is everyday practice in jazz. However, we
will show in this paper that they are also useful for retrieving pieces with a similar but not identical chord sequence by performing a large experiment. In this
experiment we compare two harmonic similarity measures, the Tonal Pitch Step
Distance (TPSD) [11] and the Chord Sequence Alignment System (CSAS) [12],
and test the inuence of dierent degrees of detail in the chord description and
the knowledge of the global key of a piece on retrieval performance.
The next section gives a brief overview of the current achievements in chord
sequence similarity matching and harmonic similarity in general, Section 3 describes the data used in the experiment and Section 4 presents the results.
Contribution. This paper presents an overview of chord sequence based harmonic similarity. Two harmonic similarity approaches are compared in an experiment. For this experiment a new large corpus of 5028 chord sequences was
assembled. Six retrieval tasks are dened for this corpus, to which both algorithms are subjected. All tasks use the same dataset, but dier in the amount of
chord description detail and in the use of a priori key information. The results
show that a computational costly alignment approach signicantly outperforms
a much faster geometrical approach in most cases, that a priori key information
boosts retrieval performance, and that using a triadic chord representation yields
signicantly better results than a simpler or more complex chord representation.
245
TPS Score
10
20
30
40
50
60
70
80
90
Beat
Fig. 1. A plot demonstrating the comparison of two similar versions of All the Things
You Are using the TPSD. The total area between the two step functions, normalized
by the duration of the shortest song, represents the distance between both songs. A
minimal area is obtained by shifting one of the step functions cyclically.
dierent from the one used in [11]. The harmony grammar approach could, at
the time of writing, not compete in this experiment because in its current state
it is yet unable to parse all the songs in the used dataset.
The next section introduces the TPSD and the improvements over the implementation used for the experiment here and the implementation in [11]. Section 2.2 highlights the dierent variants of the CSAS. The main focus of this
paper is on the similarity of sequences of chord labels, but there exist other
relevant harmony based retrieval methods: some of these are briey reviewed in
Section 2.3.
2.1
The TPSD uses Lerdahls [17] Tonal Pitch Space (TPS) as its main musical
model. TPS is a model of tonality that ts musicological intuitions, correlates
well with empirical ndings from music cognition [16] and can be used to calculate a distance between two arbitrary chords. The TPS model can be seen as a
scoring mechanism that takes into account the number of steps on the circle of
fths between the roots of the chords, and the amount of overlap between the
chord structures of the two chords and their relation to the global key.
The general idea behind the TPSD is to use the TPS to compare the change
of chordal distance to the tonic over time. For every chord the TPS distance
between the chord and the key of the sequence is calculated, which results in
a step function (see Figure 1). As a consequence, information about the key
of the piece is essential. Next, the distance between two chord sequences is dened as the minimal area between the two step functions over all possible horizontal circular shifts. To prevent that longer sequences yield larger distances,
the score is normalized by dividing it by the duration of the shortest song.
246
The TPS is an elaborate model that allows to compare every arbitrary chord in
an arbitrary key to every other possible chord in any key. The TPSD does not
use the complete model and only utilizes the parts that facilitate the comparison
of two chords within the same key. In the current implementation of the TPSD
time is represented in beats, but generally any discrete representation could be
used.
The TPSD version used in this paper contains a few improvements compared
to the version used in [11]: by applying a dierent step function matching algorithm from [4], and by exploiting the fact that we use discrete time units
that enable us to sort in linear time using counting sort [6], a running time of
O(nm) is achieved where n and m are the number of chord symbols in both
songs. Furthermore, to be able to use the TPSD in situations where a priori key
information is not available, the TPSD is extended with a key nding algorithm.
Key finding. The problem of nding the global key of piece of music is called
key nding. For this study, this is done on the basis of chord information only.
The rationale behind the key nding algorithm that we present here is the following: we consider the key that minimizes the total TPS distance and best
matches the starting and ending chord, the key of the piece.
For minimizing the total TPS distance, the TPSD key nding uses TPS based
step functions as well. We assume that when a song matches a particular key, the
TPS distances between the chords and the tonic of the key are relatively small.
The general idea is to calculate 24 step functions for a single chord sequence, one
for each major and minor key. Subsequently, all these keys are ranked by sorting
them on the area between their TPS step function and the x-axis; the smaller
the total area, the better this key ts the piece, and the higher the rank. Often,
the key at the top rank is the correct key. However, among the false positives at
rank one, non-surprisingly, the IV, V and VI relative to the ground-truth key1
are found regularly. This makes sense because, when the total of TPS distances
of the chords to C is small, the distances to F, G and Am might be small as well.
Therefore, to increase performance, an additional scoring mechanism is designed
that takes into account the IV, V and VI relative to the ground-truth key. Of
all 24 keys, the candidate key that minimizes the following sum S is considered
the key of the piece.
S = r(I) + r(IV ) + r(V ) + r(V I) +
Here r(.) denotes the rank of the candidate key, a parameter determines
how important the tonic is compared to other frequently occurring scale degrees
and controls the importance of the key matching rst and last chord. The
parameters and were tuned by hand and an of 2, and a of 4 were
found to give good results. Clearly, this simple key-nding algorithm is biased
1
The roman numbers here represent the diatonic interval between the key in the
ground-truth and the predicted key.
247
towards western diatonic music, but for the corpus used in this paper it performs
quite well. The algorithm scores 88.8 percent correct on a subset of 500 songs
of the corpus used in the experiment below for which we manually checked the
correctness of the ground-truth key. The above algorithm takes O(n) time, where
n is the number of chord symbols, because the number of keys is constant.
Root interval step functions. For the tasks where only the chord root is
used we use a dierent step function representation (See Section 4). In these
tasks the interval between the chord root and the root note of the key denes
the step height and the duration of the chord again denes the step length. This
matching method is very similar to the melody matching approach by Aloupis
et al. [2]. Note that the latter was never tested in practice. The matching and
key nding methods are not dierent from the other variants of the TPSD. Note
that in all TPSD variants, chord inversions are ignored.
2.2
The CSAS algorithm is based on local alignment and computes similarity scores
between sequences of symbols representing chords or distances between chords
and key. String matching techniques can be used to quantify the dierences between two such sequences. Among several existing methods, Smith and Watermans approach [25] detects similar areas in two sequences of abitrary symbols.
This local alignment or local similarity algorithm locates and extracts a pair of
regions, one from each of the two given strings, that exhibit high similarity. A
similarity score is calculated by performing elementary operations transforming
the one string into the other. The operations used to transform the sequences are
deletion, insertion or substitution of a symbol. The total transformation from the
one string into the other can be solved with a dynamic programming in quadratic
time.
The following example illustrates local alignment by computing a distance
between the rst chords of two variants of the song All The Things You Are
considering only the root notes of the chords. The I, S and D denote Insertion,
Substitution, and Deletion of a symbol, respectively. An M represents a matching
symbol:
string 1
F B E A
D D G C C
string 2 F F B A A A D D
C C
operation I M M S M I M M D M M
score 1 +2 +2 2 +2 1 +2 +2 1 +2 +2
Algorithms based on local alignment have been successfully adapted for melodic
similarity [21,13,15] and recently it has been used to determine harmonic similarity [12] as well. Two steps are necessary to apply the alignment technique to the
comparison of chord progressions: the choice of the representation of a chord sequence, and the scores of the elementary operations between symbols. To take the
durations of the chords into account, we represent the chords at every beat. The
algorithm has therefore a complexity of O(nm), where n and m are the sizes of
248
the compared songs in beats. The score function can either be adapted to the chosen representation or can simply be binary, i.e. the score is positive (+2) if the
two chords described are identical, and negative (2) otherwise. The insertion or
deletion score is set to 1.
Absolute representation. One way of representing a chord sequence is to
simply represent the chord progression as a sequence of absolute root notes and in
that case prior knowledge of the key is not required. An absolute representation
of the chord progression of the 8 rst bars of the song All The Things You Are
is then:
F, B, E, A, D, D, G, C, C
In this case, the substitution scores may be determined by considering the dierence in semitones, the number of steps on the circle of fths between the roots, or
by the consonance of the interval between the roots, as described in [13]. For instance, the cost of substituting a C with a G (fth) is lower than the substitution
of a C with a D (Second). Taking into account the mode in the representation
can aect the score function as well: a substitution of a C for a Dm is dierent
from a substitution of a C for a D, for example. If the two modes are identical,
one may slightly increase the similarity score, and decrease it otherwise. Another
possible representation of the chord progression is a sequence of absolute pitch
sets. In that case one can use musical distances between chords, like Lerdahls
TPS model [17] or the distance introduced by Paiement et al. [22], as a score
function for substitution.
Key-relative representation. If key information is known beforehand, a chord
can be represented as a distance to this key. The distance can be expressed in
various ways: in semitones, or as the number of steps on the circle of fths
between the root of the chord and the tonic of the key of the song, or with more
complex musical models, such as TPS. If in this case the key is A and the chord
is represented by the dierence in semitones, the representation of the chord
progression of the rst eight bars of the song All The Things You Are will be:
3, 2, 5, 0, 5, 6, 1, 4, 4
If all the notes of the chords are taken into account, the TPS or Paiement
distances can be used between the chords and the triad of the key to construct
the representation. The representation is then a sequence of distances, and we use
an alignment between these distances instead of between the chords themselves.
This representation is very similar to the representation used in the TPSD. The
score functions used to compare the resulting sequences can then be binary, or
linear in similarity regarding the dierence observed in the values.
Transposition invariance. In order to be robust to key changes, two identical chord progressions transposed in dierent keys have to be considered as
similar. The usual way to deal with this issue [27] is to choose a chord representation which is transposition invariant. A rst option is to represent transitions
249
between successive chords, but this has been proven to be less accurate when
applied to alignment algorithms [13]. Another option is to consider a key relative representation, like the representation described above which is by denition
transposition invariant. However, this approach is not robust against local key
changes. With an absolute representation of chords, we use an adaptation of
the local alignment algorithm proposed in [1]. It allows to take into account an
unlimited number of local transpositions and can be applied to representations
of chord progressions to account for modulations.
According to the choice of the representation and the score function, several
variants are possible in order to build an algorithm for harmonic similarity. In
Section 4 we explain the dierent representations and scoring functions used in
the dierent tasks of the experiment and their eects on retrieval performance.
2.3
The Chord Sequence Corpus used in the experiment consists of 5,028 unique
human-generated Band-in-a-Box les that are collected from the Internet. Bandin-a-Box is a commercial software package [9] that is used to generate musical
250
Table 1. A leadsheet of the song All The Things You Are. A dot represents a beat, a
bar represents a bar line, and the chord labels are presented as written in the Bandin-a-Box le.
|Fm7 . . .
|Bbm7 . . .
|Eb7 . . .
|DbMaj7 . . . |Dm7b5 . G7b9 . |CMaj7 . . .
|Cm7 . . .
|Fm7 . . .
|Bb7 . . .
|AbMaj7 . . . |Am7b5 . D7b9 . |GMaj7 . . .
|A7 . . .
|D7 . . .
|GMaj7 . . .
|Gbm7 . . .
|B7 . . .
|EMaj7 . . .
|Fm7 . . .
|Bbm7 . . .
|Eb7 . . .
|DbMaj7 . . . |Dbm7 . Gb7 .
|Cm7 . . .
|Bbm7 . . .
|Eb7 . . .
|AbMaj7 . . .
|AbMaj7 . . .
|CMaj7 . . .
|Eb7 . . .
|GMaj7 . . .
|GMaj7 . . .
|C+ . . .
|AbMaj7 . . .
|Bdim . . .
|. . . .
|
|
|
|
|
|
|
|
|
251
Table 2. The distribution of the song class sizes in the Chord Sequence Corpus
Class Size
1
2
3
4
5
6
7
8
10
Total
Frequency
3,253
452
137
67
25
7
1
1
1
5028
Percent
82.50
11.46
3.47
1.70
.63
.18
.03
.03
.03
100
special introduction or ending. The richness of the chords descriptions may also
diverge, i.e. a C7913 may be written instead of a C7 , and common substitutions
frequently occur. Examples of the latter are relative substitution, i.e. Am instead
of C, or tritone substitution, e.g. F#7 instead of C7 . Having multiple chord
sequences describing the same song allows for setting up a cover-song nding
experiment. The the title of the song is used as ground-truth and the retrieval
challenge is to nd the other chord sequences representing the same song.
The distribution of the song class sizes is displayed in Table 2 and gives an
impression of the diculty of the retrieval task. Generally, Table 2 shows that
the song classes are relatively small and that for the majority of the queries there
is only one relevant document to be found. It furthermore shows that 82.5% of
the songs is in the corpus for distraction only. The chord sequence corpus is
available to the research community on request.
We compared the TPSD and the CSAS in six retrieval tasks. For this experiment
we used the chord sequence corpus described above, which contains sequences
that clearly describe the same song. For each of these tasks the experimental
setup was identical: all songs that have two or more similar versions were used as
a query, yielding 1775 queries. For each query a ranking was created by sorting
the other songs on their TPSD and CSAS scores and these rankings and the
runtimes of the compared algorithms were analyzed.
4.1
Tasks
The tasks, summarized in Table 3, diered in the level of chord information used
by the algorithms and in the usage of a priori global key information. In tasks 1-3
no key information was presented to the algorithms and in the remaining three
tasks we used the key information, which was manually checked for correctness,
as stored in the Band-in-a-Box les. The tasks 1-3 and 4-6 furthermore diered
in the amount of chord detail that was presented to the algorithms: in tasks 1
252
Key
Key
Key
Key
Key
Key
Key
Information
inferred
inferred
inferred
as stored in the Band-in-a-Box le
as stored in the Band-in-a-Box le
as stored in the Band-in-a-Box le
and 4 only the root note of the chord was available to the algorithms, in tasks
2 and 5 the root and the triad were available and in tasks 3 and 6 the complete
chord as stored in the Band-in-a-Box le was presented to the algorithms.
The dierent tasks required specic variants of the tested algorithms. For tasks
1-3 the TPSD used the TPS key nding algorithm as described in Section 2.1. For
the tasks 1 and 4, involving only chord roots, a simplied variant of the TPSD
was used, for the tasks 2, 3, 5 and 6 we used the regular TPSD, as described in
Section 2.1 and [11].
To measure the impact of the chord representation and substitution functions
on retrieval performance, dierent variants of the CSAS were built also. In some
cases the choices made did not yield the best possible results, but they allow the
reader to understand the eects of the parameters used on retrieval performance.
The CSAS algorithms in tasks 1-3 all used an absolute representation and the
algorithms in tasks 4-6 used a key relative representation. In tasks 4 and 5 the
chords were represented as the dierence in semitones to the root of the key of
the piece and in task 6 as the Lerdahls TPS distance between the chord and the
triad from the key (as in the TPSD). The CSAS variants in tasks 1 and 2 used
a consonance based substitution function and algorithms in tasks 4-6 a binary
substitution function was used. In tasks 2 and 5 a binary substitution function
for the mode was used as well: if the mode of the substituted chords matched,
no penalty was given, if they did not match, a penalty was given.
A last parameter that was varied was the use of local transpositions. The
CSAS variants applied in tasks 1 and 3 did not consider local transpositions, but
the CSAS algorithm used in task 2 did allow local transpositions (see Section 2.2
for details).
The TPSD was implemented in Java and the CSAS was implemented in C++,
but a small Java program was used to parallelize the matching process. All runs
were done on a Intel Xeon quad-core CPU at a frequency of 1.86 GHz. and 4 Gb
of RAM running 32 bit Linux. Both algorithms were parallelized to optimally
use the multiple cores of the CPUs.
4.2
Results
For each task and for each algorithm we analyzed the rankings of all 1775 queries
with 11-point precision recall curves and Mean Average Precision (MAP). Figure 2
displays the interpolated average precision and recall chart for the TPSD and
the CSAS for all tasks listed in Table 3. We calculated the interpolated average
Key inferred
253
Key relative
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 0.0
0.1
Recall
TPSD
CSAS
0.2
0.3
0.4
0.5
0.6
Recall
0.7
0.8
0.9
1.0
Fig. 2. The 11-point interpolated precision and recall charts for the TPSD and the
CSAS for tasks 13, on the left, and 46 on the right
The 12 dierent runs introduce 11 degrees of freedom and 1775 individual queries
were examined per run.
All statistical tests were performed in Matlab 2009a.
Key relative
0.9
0.8
341:20
170:40
85:20
42:40
0.7
21:20
0.6
10:40
0.5
5:20
0.4
2:40
0.3
1:20
0:40
0.2
0.1
0
1)
2)
3)
5)
4)
6)
5) k 6)
4)
1)
2) k 3)
sk ask
s
sk ask
sk
sk
s
sk
sk ask
sk
t( a
t( a
t(
t( a
t(
t( a
t( a
t( a
t( a
t( a
t(
t( a
s
s
s
s
ex
ad
ex
ad
ad plex
ad plex
ot
ot
ot
ot
pl
pl
tri
tri
tri
tri
ro
ro
ro
ro
m
m
m
m
+
+
+
+
D
o
o
S
D
o
o
S
s
s
s
c
c
s
c
c
S
S
ot
ot
SA oot
SA oot
TP
TP
ro
C
r
C
SD
ro PSD
AS
AS
r
P
S
S
D
S
S
T
T
C
C
S
SD
SA
SA
TP
TP
C
C
254
0:20
0:10
0:05
Fig. 3. The MAP and Runtimes of the TPSD and the CSAS. The MAP is displayed on
the left axis and the runtimes are displayed on an exponential scale on the right axis.
On the left side of the chart the key inferred tasks are displayed and the key relative
tasks are displayed on the right side.
The overall retrieval performance of all algorithms on all tasks can be considered good, but there are some large dierences between tasks and between
algorithms, both in performance and in runtime. With a MAP of .70 the overall best performing setup was the CSAS using triadic chord descriptions and
a key relative representation (task 5). The TPSD also performs best on task
5 with an MAP of .58. In tasks 2 and 4-6 the CSAS signicantly outperforms
the TPSD. On tasks 1 and 3 the TPSD outperforms the CSAS in runtime as
well as performance. For these two tasks, the results obtained by the CSAS are
signicantly lower because local transpositions are not considered. These results
show that taking into account transpositions has a high impact on the quality
of the retrieval system, but also on the runtime.
The retrieval performance of the CSAS is good, but comes at a price. On
average over six of the twelve runs, the CSAS runs need about 136 times as
much time to complete as the TPSD. The TPSD takes about 30 minutes to 1.5
hours to match all 5028 pieces, while the CSAS takes about 2 to 9 days. Due to
the fact that the CSAS run in task 2 takes 206 hours to complete, there was not
enough time to perform a run on task 1 and 3 with the CSAS variant that takes
local transpositions into account.
255
Table 4. This table shows for each pair of runs if the mean average precision, as
displayed in Figure 3 diered signicantly (+) or not ()
Key Inferred
task1 task2
TPSD TPSD
key
task2 TPSD +
inferred
task3 TPSD
+
task1 CSAS +
+
task2 CSAS +
+
task3 CSAS +
+
key
task4 TPSD +
task4 CSAS +
+
task5 CSAS +
+
task6 CSAS +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
256
On the other hand keeping all rich chord information seems to distract the
evaluated retrieval systems. Pruning the chord structure down to the triad might
be seen as a form of syntactical noise-reduction, since the chord additions, if they
do not have a voice leading function, have a rather arbitrary character and just
add some harmonic spice.
Concluding Remarks
257
References
1. Allali, J., Ferraro, P., Hanna, P., Iliopoulos, C.S.: Local transpositions in alignment
of polyphonic musical sequences. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE
2007. LNCS, vol. 4726, pp. 2638. Springer, Heidelberg (2007)
2. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nu
nez, Y., Rappaport, D., Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic
Similarity. Computer Music Journal 30(3), 6776 (2004)
3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic Local Alignment
Search Tool. Journal of Molecular Biology 215, 403410 (1990)
4. Arkin, E., Chew, L., Huttenlocher, D., Kedem, K., Mitchell, J.: An Eciently Computable Metric for Comparing Polygonal Shapes. IEEE Transactions on Pattern
Analysis and Machine Intelligence 13(3), 209216 (1991)
5. Bello, J., Pickens, J.: A Robust Mid-Level Representation for Harmonic Content
in Music Signals. In: Proceedings of the International Symposium on Music Information Retrieval, pp. 304311 (2005)
6. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2001)
7. Downie, J.S.: The Music Information Retrieval Evaluation Exchange (20052007):
A Window into Music Information Retrieval Research. Acoustical Science and
Technology 29(4), 247255 (2008)
8. Ferraro, P., Hanna, P., Imbert, L., Izard, T.: Accelerating Query-by-Humming on
GPU. In: Proceedings of the Tenth International Society for Music Information
Retrieval Conference (ISMIR), pp. 279284 (2009)
9. Gannon, P.: Band-in-a-Box. PG Music (1990), http://www.pgmusic.com/ (last
viewed February 2011)
10. de Haas, W.B., Rohrmeier, M., Veltkamp, R.C., Wiering, F.: Modeling Harmonic
Similarity Using a Generative Grammar of Tonal Harmony. In: Proceedings of the
Tenth International Society for Music Information Retrieval Conference (ISMIR),
pp. 549554 (2009)
11. de Haas, W.B., Veltkamp, R.C., Wiering, F.: Tonal Pitch Step Distance: A Similarity Measure for Chord Progressions. In: Proceedings of the Ninth International
Society for Music Information Retrieval Conference (ISMIR), pp. 5156 (2008)
12. Hanna, P., Robine, M., Rocher, T.: An Alignment Based System for Chord Sequence Retrieval. In: Proceedings of the 2009 Joint International Conference on
Digital Libraries, pp. 101104. ACM, New York (2009)
13. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for
Evaluating Similarity between Monophonic Musical Sequences. Journal of New
Music Research 36(4), 267279 (2007)
14. Harte, C., Sandler, M., Abdallah, S., G
omez, E.: Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations. In: Proceedings of the
Sixth International Society for Music Information Retrieval Conference (ISMIR),
pp. 6671 (2005)
15. van Kranenburg, P., Volk, A., Wiering, F., Veltkamp, R.C.: Musical Models for
Folk-Song Melody Alignment. In: Proceedings of the Tenth International Society
for Music Information Retrieval Conference (ISMIR), pp. 507512 (2009)
16. Krumhansl, C.: Cognitive Foundations of Musical Pitch. Oxford University Press,
USA (2001)
17. Lerdahl, F.: Tonal Pitch Space. Oxford University Press, Oxford (2001)
258
Introduction
Successful exploitation of results from basic research is the indicator for the
practical relevance of a research eld. During recent years, the scientic and
commercial interest in the comparatively young research discipline called Music
Information Retrieval (MIR) has grown considerably. Stimulated by the evergrowing availability and size of digital music catalogs and mobile media players,
MIR techniques become increasingly important to aid convenient exploration
of large music collections (e.g., through recommendation engines) and to enable
entirely new forms of music consumption (e.g., through music games). Evidently,
commercial entities like online music shops, record labels and content aggregators
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 259272, 2011.
c Springer-Verlag Berlin Heidelberg 2011
260
C. Dittmar et al.
have realized that these aspects can make them stand out among their competitors and foster customer loyalty. However, the industrys willingness to fund basic
research in MIR is comparatively low. Thus, only well described methods have
found successful application in the real world. For music recommendation and
retrieval, these are doubtlessly services based on collaborative ltering1 (CF).
For music transcription and interaction, these are successful video game titles
using monophonic pitch detection2 . The two research projects in the scope of
this paper provide the opportunity to progress in core areas of MIR, but always
with a clear focus on suitability for real-world applications.
This paper is organized as follows. Each of the two projects is described in
more detail in Sec. 2 and Sec. 3. Results from the research as well as the development perspective are reported. Finally, conclusions are given and future
directions are sketched.
Songs2See
261
Research Results
262
C. Dittmar et al.
2.2
263
Development Results
In order to have tangible results early on, the software development started in
parallel with the research tasks. Therefore, interfaces have been kept as generic
as possible in order to enable adaption to alternative or extended core algorithms
later on.
Songs2See web application. In order to achieve easy accessibility to the
exercise application, we decided to go for a web-based approach using Flex3 .
Originally, the Flash application built with this tool did not allow direct processing of the microphone input. Thus, we had to implement a streaming server
solution. We used the open source Red54 in conjunction with the transcoder
library Xuggler5 . This way, we conducted real-time pitch detection [11] in the
server application and returned the detected pitches to the web interface. Further details about the implementation are to be published in [18]. Only in their
latest release in July 2010, i.e., Adobe Flash Player 10.1, has Flash incorporated
the possibility of handling audio streams from a microphone input directly on
the client side. A screenshot of the prototype interface is shown in Fig. 1. It can
be seen that the user interface shows further assistance than just the plain score
sheet. The ngering on the respective instrument of the student is shown as an
animation. The relative tones are displayed as well as the relative position of
the pitch produced by the players instrument. This principle is well known from
music games and has been adapted here for the educational purposes. Further
helpful functions, such as transpose, tempo change and stylistic modication will
be implemented in the future.
Songs2See editor. For the creation of music exercises, we developed an application with the working title Songs2See Editor. It is a stand-alone graphical
user interface based on Qt6 that allows the average or expert user the creation
of musical exercises. The editor already allows to go through the prototypical
work-ow. During import of a song, timbre segmentation is conducted and the
beat grid and key candidates per segment are estimated. The user can choose
the segment of interest, start the automatic transcription or use the source separation functionality. For immediate visual and audible feedback of the results,
a piano-roll editor is combined with a simple synthesizer as well as sound separation controls. Thus, the user can grab any notes he suspects to be erroneously
transcribed, move or delete them. In addition the user is able to seamlessly
mix the ratio between the separated melody instrument and the background
accompaniment. In Fig. 2, the interface can be seen. We expect that the users
of the editor will creatively combine the dierent processing methods in order to
analyze and manipulate the audio tracks to their liking. In the current stage of
development, export of MIDI and MusicXML is already possible. In a later stage
support for other popular formats, such as TuxGuitar will be implemented.
3
4
5
6
See
See
See
See
http://www.adobe.com/products/flex/
http://osflash.org/red5
http://www.xuggle.com/
http://qt.nokia.com/products
264
C. Dittmar et al.
GlobalMusic2One
265
categories can, for example, be regional sub-genres which are dened through
exemplary songs or song snippets. This self-learning MIR framework will be
continuously expanded with precise content-based descriptors.
3.1
Research Results
With automatic annotation of world music content, songs often cannot be assigned to one single genre label. Instead, various rhythmic, melodic and harmonic
inuences conate into multi-layered mixtures. Common classier approaches
fail due to their immanent assumption that for all song segments, one dominant
genre exists and thus is retrievable.
Multi-domain labeling. To overcome these problems, we introduced the
multi-domain labeling approach [28] that breaks down multi-label annotations towards single-label annotations within dierent musical domains, namely
timbre, rhythm, and tonality. In addition, a separate annotation of each temporal segment of the overall song is enabled. This leads to a more meaningful and
realistic two-dimensional description of multi-layered musical content. Related
to that topic, classication of singing vs. rapping in urban music has been described in [15]. In another paper [27] we applied the recently proposed Multiple
Kernel Learning (MKL) technique that has been successfully used for real-world
applications in the elds of computational biology, image information retrieval
etc. In contrast to classic Support Vector Machines (SVM), MKL provides a
possibility of weighting over dierent kernels depending on a feature set.
Clustering with constraints. Inspired by the work in [38], we investigated
clustering with constraints with application to active exploration of music collections. Constrained clustering has been developed to improve clustering methods
through pairwise constraints. Although these constraints are received as queries
from a noiseless oracle, most of the methods involve a random procedure stage
to decide which elements are presented to the oracle. In [29] we applied spectral
clustering with constraints to a music dataset, where the queries for constraints
were selected in a deterministic way through outlier identication perspective.
We simulated the constraints through the ground-truth music genre labels. The
results showed that constrained clustering with the deterministic outlier identication method achieved reasonable and stable results through the increment
of the number of constraint queries. Although the constraints were enhancing
the similarity relations between the items, the clustering was conducted in the
static feature space. In [30] we embedded the information about the constraints
to a feature selection procedure, that adapted the feature space regarding the
constraints. We proposed two methods for the constrained feature selection:
similarity-based and constrained-based. We applied the constrained clustering
with embedded feature selection for the active exploration of music collections.
Our experiments showed that the proposed feature selection methods improved
the results of the constrained clustering.
266
C. Dittmar et al.
Rule-based classification. The second important research direction was rulebased classication with high-level features. In general, high-level features can
again be categorized according to dierent musical domains like rhythm, harmony, melody or instrumentation. In contrast to low-level and mid-level audio
features, they are designed with respect to music theory and are thus interpretable by human observers. Often, high-level features are derived from automatic music transcription or classication into semantic categories. Dierent
approaches for the extraction of rhythm-related high-level features have been reported in [21], [25] and [1]. Although automatic extraction of high-level features
is still quite error-prone, we proved in [4] that they can be used in a rule-based
classication scheme with a quality comparable to state-of-the-art pattern recognition using SVM. The concept of rule-based classication was inspected in detail
in [3] using a ne granular manual annotation of high-level features referring to
rhythm, instrumentation etc. In this paper, we tested rule-based classication
on a restricted dataset of 24 manually annotated audio tracks and achieved an
accuracy rate of over 80%.
Novel audio features. Adding the fourth domain instrumentation to the
multi-domain approach described in Sec. 3.1 required the design and implementation of novel audio features tailored towards instrument recognition in polyphonic recordings. Promising results even with instruments from non-European
cultural areas are reported in [22]. In addition, we investigated the automatic
classication of rhythmic patterns in global music styles in [42]. In this work,
special measures have been taken to make the features and distance measures
tempo independent. This is done implicitly, without the need for a preceding
beat grid extraction that is commonly recommended in the literature to derive
beat synchronous feature vectors. In conjunction with the approach to rule-based
classication described in Sec. 3.1, novel features for the classication of bassplaying styles have been published in [5] and [2]. In this paper, we compared
an approach based on high-level features and another one based on similarity
measures between bass patterns. For both approaches, we assessed two dierent
strategies: classication of patterns as a whole and classication of all measures
of a pattern with a subsequent accumulation of the classication results. Furthermore, we investigated the inuence of potential transcription errors on the
classication accuracy. Given a taxonomy consisting of 8 dierent bass playing
styles, best classication accuracy values of 60.8% were achieved for the featurebased classication and 68.5% for the pattern similarity approach.
3.2
Development Results
267
Fig. 3. Screenshot of the Annotation Tool congured to the Globalmusic2one description scheme, showing the annotation of a stylistically and structurally complex song
268
C. Dittmar et al.
269
Acknowledgments
The Thuringian Ministry of Economy, Employment and Technology supported
this research by granting funds of the European Fund for Regional Development to the project Songs2See 7 , enabling transnational cooperation between
Thuringian companies and their partners from other European regions. Additionally, this work has been partly supported by the German research project
GlobalMusic2One 8 funded by the Federal Ministry of Education and Research
(BMBF-FKZ: 01/S08039B).
References
1. Abeer, J., Dittmar, C., Gromann, H.: Automatic genre and artist classication
by analyzing improvised solo parts from musical recordings. In: Proceedings of the
Audio Mostly Conference (AMC), Pite
a, Sweden (2008)
2. Abeer, J., Br
auer, P., Lukashevich, H., Schuller, G.: Bass playing style detection
based on high-level features and pattern similarity. In: Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht,
Netherlands (2010)
3. Abeer, J., Lukashevich, H., Dittmar, C., Br
auer, P., Krause, F.: Rule-based classication of musical genres from a global cultural background. In: Proceedings
of the 7th International Symposium on Computer Music Modeling and Retrieval
(CMMR), Malaga, Spain (2010)
4. Abeer, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classication using
bass-related high-level features and playing styles. In: Proceedings of the 10th
International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
5. Abeer, J., Lukashevich, H., Schuller, G.: Feature-based extraction of plucking and
expression styles of the electric bass guitar. In: Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Dallas,
Texas, USA (2010)
6. Arndt, D., Gatzsche, G., Mehnert, M.: Symmetry model based key nding. In:
Proceedings of the 126th AES Convention, Munich, Germany (2009)
7. Barbancho, A., Barbancho, I., Tardon, L., Urdiales, C.: Automatic edition of songs
for guitar hero/frets on re. In: Proceedings of the IEEE International Conference
on Multimedia and Expo (ICME), New York, USA (2009)
8. Bertin-Mahieux, T., Eck, D., Maillet, F., Lamere, P.: Autotagger: a model for
predicting social tags from acoustic features on large music databases. Journal of
New Music Research 37(2), 115135 (2008)
7
8
See http://www.songs2see.eu
See http://www.globalmusic2one.net
270
C. Dittmar et al.
9. Cano, E., Cheng, C.: Melody line detection and source separation in classical saxophone recordings. In: Proceedings of the 12th International Conference on Digital
Audio Eects (DAFx), Como, Italy (2009)
10. Cano, E., Schuller, G., Dittmar, C.: Exploring phase information in sound source
separation applications. In: Proceedings of the 13th International Conference on
Digital Audio Eects (DAFx 2010), Graz, Austria (2010)
11. Dittmar, C., Dressler, K., Rosenbauer, K.: A toolbox for automatic transcription
of polyphonic music. In: Proceedings of the Audio Mostly Conference (AMC),
Ilmenau, Germany (2007)
12. Dittmar, C., Gromann, H., Cano, E., Grollmisch, S., Lukashevich, H., Abeer,
J.: Songs2See and GlobalMusic2One - Two ongoing projects in Music Information
Retrieval at Fraunhofer IDMT. In: Proceedings of the 7th International Symposium
on Computer Music Modeling and Retrieval (CMMR), Malaga, Spain (2010)
13. Duan, Z., Pardo, B., Zhang, C.: Multiple fundamental frequency estimation by
modeling spectral peaks and non-peak regions. EEE Transactions on Audio,
Speech, and Language Processing (99), 11 (2010)
14. Fitzgerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorization models for musical sound source separation. Computational Intelligence and
Neuroscience (2008)
15. G
artner, D.: Singing / rap classication of isolated vocal tracks. In: Proceedings
of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht, Netherlands (2010)
16. G
omez, E., Haro, M., Herrera, P.: Music and geography: Content description of
musical audio from dierent parts of the world. In: Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
17. Grollmisch, S., Dittmar, C., Cano, E.: Songs2see: Learn to play by playing. In:
Proceedings of the 41st AES International Conference on Audio in Games, London,
UK (2011)
18. Grollmisch, S., Dittmar, C., Cano, E., Dressler, K.: Server based pitch detection
for web applications. In: Proceedings of the 41st AES International Conference on
Audio in Games, London, UK (2011)
19. Grollmisch, S., Dittmar, C., Gatzsche, G.: Implementation and evaluation of an improvisation based music video game. In: Proceedings of the IEEE Consumer Electronics Societys Games Innovation Conference (IEEE GIC), London, UK (2009)
20. Gruhne, M., Schmidt, K., Dittmar, C.: Phoneme recognition on popular music.
In: 8th International Conference on Music Information Retrieval (ISMIR), Vienna,
Austria (2007)
21. Herrera, P., Sandvold, V., Gouyon, F.: Percussion-related semantic descriptors of
music audio les. In: Proceedings of the 25th International AES Conference, London, UK (2004)
22. Kahl, M., Abeer, J., Dittmar, C., Gromann, H.: Automatic recognition of tonal
instruments in polyphonic music from dierent cultural backgrounds. In: Proceedings of the 36th Jahrestagung f
ur Akustik (DAGA), Berlin, Germany (2010)
23. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Proceedings of the 10th International Society for Music Information Retrieval
Conference (ISMIR), Kobe, Japan (2009)
271
24. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer Science + Business Media LLC, New York (2006)
25. Lidy, T., Rauber, A., Pertusa, A., Iesta, J.M.: Improving genre classication by
combination of audio and symbolic descriptors using a transcription system. In:
Proceedings of the 8th International Conference on Music Information Retrieval
(ISMIR), Vienna, Austria (2007)
26. Lukashevich, H.: Towards quantitative measures of evaluating song segmentation.
In: Proceedings of the 9th International Conference on Music Information Retrieval
(ISMIR), Philadelphia, Pennsylvania, USA (2008)
27. Lukashevich, H.: Applying multiple kernel learning to automatic genre classication. In: Proceedings of the 34th Annual Conference of the German Classication
Society (GfKl), Karlsruhe, Germany (2010)
28. Lukashevich, H., Abeer, J., Dittmar, C., Gromann, H.: From multi-labeling to
multi-domain-labeling: A novel two-dimensional approach to music genre classication. In: Proceedings of the 10th International Society for Music Information
Retrieval Conference (ISMIR), Kobe, Japan (2009)
29. Mercado, P., Lukashevich, H.: Applying constrained clustering for active exploration of music collections. In: Proceedings of the 1st Workshop on Music Recommendation and Discovery (WOMRAD), Barcelona, Spain (2010)
30. Mercado, P., Lukashevich, H.: Feature selection in clustering with constraints: Application to active exploration of music collections. In: Proceedings of the 9th Int.
Conference on Machine Learning and Applications (ICMLA), Washington DC,
USA (2010)
31. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of a
monaural audio signal into harmonic/percussive components by complememntary
diusion on spectrogram. In: Proceedings of the 16th European Signal Processing
Conferenc (EUSIPCO), Lausanne, Switzerland (2008)
32. Pohle, T., Schnitzer, D., Schedl, M., Knees, P., Widmer, G.: On rhythm and general music similarity. In: Proceedings of the 10th International Society for Music
Information Retrieval Conference (ISMIR), Kobe, Japan (2009)
33. Ryyn
anen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32, 7286 (2008)
34. Sagayama, S., Takahashi, K., Kameoka, H., Nishimoto, T.: Specmurt anasylis:
A piano-roll-visualization of polyphonic music signal by deconvolution of logfrequency spectrum. In: Proceedings of the ISCA Tutorial and Research Workshop
on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea (2004)
35. Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as
nonnegative factorizations. Computational Intelligence and Neuroscience (2008)
36. Smaragdis, P., Mysore, G.J.: Separation by humming: User-guided sound extraction from monophonic mixtures. In: Proceedings of IEEE Workshop on Applications Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA
(2009)
37. Stein, M., Schubert, B.M., Gruhne, M., Gatzsche, G., Mehnert, M.: Evaluation
and comparison of audio chroma feature extraction methods. In: Proceedings of
the 126th AES Convention, Munich, Germany (2009)
38. Stober, S., N
urnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., N
urnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 5365. Springer, Heidelberg (2010)
272
C. Dittmar et al.
39. Tzanetakis, G., Kapur, A., Schloss, W.A., Wright, M.: Computational ethnomusicology. Journal of Interdisciplinary Music Studies 1(2), 124 (2007)
40. Uhle, C.: Automatisierte Extraktion rhythmischer Merkmale zur Anwendung in
Music Information Retrieval-Systemen. Ph.D. thesis, Ilmenau University, Ilmenau,
Germany (2008)
41. Vinyes, M., Bonada, J., Loscos, A.: Demixing commercial music productions via
human-assisted time-frequency masking. In: Proceedings of the 120th AES convenction, Paris, France (2006), http://www.mtg.upf.edu/files/publications/
271dd4-AES120-mvinyes-jbonada-aloscos.pdf (last viewed February 2011)
42. V
olkel, T., Abeer, J., Dittmar, C., Gromann, H.: Automatic genre classication
of latin american music using characteristic rhythmic patterns. In: Proceedings of
the Audio Mostly Conference (AMC), Pite
a, Sweden (2010)
MusicGalaxy:
A Multi-focus Zoomable Interface for
Multi-facet Exploration of Music Collections
Sebastian Stober and Andreas N
urnberger
Data & Knowledge Engineering Group
Faculty of Computer Science
Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany
{sebastian.stober,andreas.nuernberger}@ovgu.de
http://www.dke-research.de
Abstract. A common way to support exploratory music retrieval scenarios is to give an overview using a neighborhood-preserving projection
of the collection onto two dimensions. However, neighborhood cannot
always be preserved in the projection because of the inherent dimensionality reduction. Furthermore, there is usually more than one way
to look at a music collection and therefore dierent projections might
be required depending on the current task and the users interests. We
describe an adaptive zoomable interface for exploration that addresses
both problems: It makes use of a complex non-linear multi-focal zoom
lens that exploits the distorted neighborhood relations introduced by the
projection. We further introduce the concept of facet distances representing dierent aspects of music similarity. User-specic weightings of these
aspects allow an adaptation according to the users way of exploring the
collection. Following a user-centered design approach with focus on usability, a prototype system has been created by iteratively alternating
between development and evaluation phases. The results of an extensive user study including gaze analysis using an eye-tracker prove that
the proposed interface is helpful while at the same time being easy and
intuitive to use.
Keywords: exploration, interface, multi-facet, multi-focus.
Introduction
There is a lot of ongoing research in the eld of music retrieval aiming to improve
retrieval results for queries posed as text, sung, hummed or by example as well
as to automatically tag and categorize songs. All these eorts facilitate scenarios
where the user is able to somehow formulate a query either by describing the
song or by giving examples. But what if the user cannot pose a query because
the search goal is not clearly dened? E.g., he might look for background music
for a photo slide show but does not know where to start. All he knows is that he
can tell if it is the right music the moment he hears it. In such a case, exploratory
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 273302, 2011.
c Springer-Verlag Berlin Heidelberg 2011
274
S. Stober and A. N
urnberger
similar
dissimilar
Fig. 1. Possible problems caused by projecting objects represented in a highdimensional feature space (left) onto a low-dimensional space for display (right)
retrieval systems can help by providing an overview of the collection and letting
the user decide which regions to explore further.
When it comes to get an overview of a music collection, neighborhood-preserving projection techniques have become increasingly popular. Beforehand, the
objects to be projected depending on the approach, this may be artists, albums,
tracks or any combination thereof are analyzed to extract a set of descriptive
features. (Alternatively, feature information may also be annotated manually or
collected from external sources.) Based on these features, the objects can be
compared or more specically: appropriate distance- or similarity measures
can be dened. The general objective of the projection can then be paraphrased
as follows: Arrange the objects in two or three dimensions (on the display) in
such a way that neighboring objects are very similar and the similarity decreases
with increasing object distance (on the display). As the feature space of the objects to be projected usually has far more dimensions than the display space,
the projection inevitably causes some loss of information irrespective of which
dimensionality reduction techniques is applied. Consequently, this leads to a distorted display of the neighborhoods such that some objects will appear closer
than they actually are (type I error), and on the other hand some objects that
are distant in the projection may in fact be neighbors in feature space (type
II error). Such neighborhood distortions are depicted in Figure 1. These projection errors cannot be xed on a global scale without introducing new ones
elsewhere as the projection is already optimal w.r.t. some criteria (depending
on the technique used). In this sense, they should not be considered as errors
made by the projection technique but of the resulting (displayed) arrangement.
When a user explores a projected collection, type I errors increase the number
of dissimilar (i.e. irrelevant) objects displayed in a region of interest. While this
might become annoying, it is much less problematic than type II errors. They
result in similar (i.e. relevant) objects to be displayed away from the region of
interest the neighborhood they actually belong to. In the worst case they could
even be o-screen if the display is limited to the currently explored region. This
way, a user could miss objects he is actually looking for.
275
Related Work
There exists a variety of approaches that in some way give an overview of a music
collection. For the task of music discovery which is closely related to collection
exploration, a very broad survey of approaches is given in [7]. Generally, there are
several possible levels of granularity that can be supported, the most common
being: track, album, artist and genre. Though a system may cover more than
one granularity level (e.g. in [51] visualized as disc or TreeMap [41]), usually a
single one is chosen. The user-interface presented in this paper focuses on the
track level as do most of the related approaches. (However, like most of the other
techniques, it may as well be applied on other levels such as for albums or artist.
All that is required is an appropriate feature representation of the objects of
interest.) Those approaches focusing on a single level can roughly be categorized
into graph-based and similarity-based overviews.
Graphs facilitate a natural navigation along relationship-edges. They are especially well-suited for the artist level as social relations can be directly visualized
(as, e.g., in the Last.fm Artist Map 1 or the Relational Artist Map RAMA [39]).
However, building a graph requires relations between the objects either from
domain knowledge or articially introduced. E.g., there are some graphs that use
similarity-relations obtained from external sources (such as the APIs of Last.fm 2
1
2
http://sixdegrees.hu/last.fm/interactive map.html
http://www.last.fm/api
276
S. Stober and A. N
urnberger
or EchoNest 3 ) and not from an analysis of the objects themselves. Either way,
this results in a very strong dependency and may quickly become problematic
for less main-stream music where such information might not be available. This
is why a similarity-based approach is chosen here instead.
Similarity-based approaches require the objects to be represented by one or
more features. They are in general better suited for track level overviews due
the vast variety of content-based features that can be extracted from tracks.
For albums and artists, either some means for aggregating the features of the
individual tracks are needed or non-content-based features, e.g. extracted from
knowledge resources like MusicBrainz 4 and Wikipedia 5 or cultural meta-data
[54], have to be used. In most cases the overview is then generated using some
metric dened on these features which leads to proximity of similar objects in the
feature space. This neighborhood should be preserved in the collection overview
which usually has only two dimensions. Popular approaches for dimensionality
reduction are Self-Organizing Maps (SOMs) [17], Principle Component Analysis
(PCA) [14] and Multidimensional Scaling (MDS) techniques [18].
In the eld of music information retrieval, SOMs are widely used. SOM-based
systems comprise the SOM-enhanced Jukebox (SOMeJB) [37], the Islands of
Music [35,34] and nepTune [16], the MusicMiner [29], the PlaySOM - and PocketSOM -Player [30] (the latter being a special interface for mobile devices), the
BeatlesExplorer [46] (the predecessor prototype of the system presented here),
the SoniXplorer [23,24], the Globe of Music [20] and the tabletop applications
MUSICtable [44], MarGrid [12], SongExplorer [15] and [6]. SOMs are prototypebased and thus there has to be a way to initially generate random prototypes
and to modify them gradually when objects are assigned. This poses special
requirements regarding the underlying feature space and distance metric. Moreover, the result depends on the random initialization and the neural network
gradient descend algorithm may get stuck in a local minimum and thus not
produce an optimal result. Further, there are several parameters that need to
be tweaked according to the data set such as the learning rate, the termination
criterion for iteration, the initial network structure, and (if applicable) the rules
by which the structure should grow. However, there are also some advantages
of SOMs: Growing versions of SOMs can adapt incrementally to changes in the
data collection whereas other approaches may always need to generate a new
overview from scratch. Section 4.2 will address this point more specically for
the approach taken here. For the interactive task at hand, which requires a realtime response, the disadvantages of SOMs outweigh their advantages. Therefore,
the approach taken here is based on MDS.
Given a set of data points, MDS nds an embedding in the target space
that maintains their distances (or dissimilarities) as far as possible without
having to know their actual values. This way, it is also well suited to compute a
layout for spring- or force-based approaches. PCA identies the axes of highest
3
4
5
http://developer.echonest.com
http://musicbrainz.org
http://www.wikipedia.org
277
Fig. 2. In SoundBite [22], a seed song and its nearest neighbors are connected by lines
278
S. Stober and A. N
urnberger
Outline
The goal of our work is to provide a user with an interactive way of exploring
a music collection that takes into account the above described inevitable limitations of a low-dimensional projection of a collection. Further, it should be
applicable for realistic music collections containing several thousands of tracks.
The approach taken can be outlined as follows:
An overview of the collection is given, where all tracks are displayed as points
at any time. For a limited number of tracks that are chosen to be spatially
well distributed and representative, an album cover thumbnail is shown for
orientation.
The view on the collection is generated by a neighborhood-preserving projection (e.g. MDS, SOM, PCA) from some high-dimensional feature space
onto two dimensions. I.e., in general tracks that are close in feature space
will likely appear as neighbors in the projection.
6
http://www.yamaha.com/bodibeat
279
Users can adapt the projection by choosing weights for several aspects of
music (dis-)similarity. This gives them the possibility to look at a collection
from dierent perspectives. (This adaptation is purely manual, i.e. the visualization as described in this paper is only adaptable w.r.t. music similarity.
Techniques to further enable adaptive music similarity are, e.g., discussed in
[46,49].)
In order to allow immediate visual feedback in case of similarity adaptation,
the projection technique needs to guarantee near real-time performance
even for large music collections. The quality of the produced projection is
only secondary a perfect projection that correctly preserves all distances
between all tracks is extremely unlikely anyways.
The projection will inevitably contain distortions of the actual distances of
the tracks. Instead of trying to improve the quality of the projection method
and trying to x heavily distorted distances, they are exploited during interaction with the projection:
The user can zoom into a region of interest. The space for this region is
increased, thus allowing to display more details. At the same time the surrounding space is compacted but not hidden from view. This way, there remains some context for orientation. To accomplish such a behavior the zoom
is based on a non-linear distortion similar to so called sh-eye lenses.
At this point the original (type II) projection errors come into play: Instead
of putting a single lens focus on the region of interest, additional focuses are
introduced in regions that contain tracks similar to those in primary focus.
The resulting distortion brings original neighbors back closer to each other.
This gives the user another option for interactive exploration.
Figure 3 depicts the outline of the approach. The following sections cover the
underlying techniques (Section 4) and the user-interaction (Section 5) in detail.
Underlying Techniques
4.1
The prototype system described here uses collections of music tracks. As a prerequisite, it is assumed that the tracks are represented by some descriptive features that can, e.g., be extracted, manually annotated or obtained form external
sources. In the current implementation, content-based features are extracted
utilizing the capabilities of the frameworks CoMIRVA [40] and JAudio [28].
Specically, Gaussian Mixture Models of the Mel Frequency Cepstral Coecients (MFCCs) according to [2] and [26] and uctuation patterns describing
how strong and fast beats are played within specic frequency bands [35] are
computed with CoMIRVA. JAudio is used to extract a global audio descriptor
MARSYAS07 as described in [52]. Further, lyrics for all songs were obtained
through the web service of LyricWiki7 , ltered for stop words, stemmed and described by document vectors with TFxIDF term weights [38]. Additional features
7
http://lyricwiki.org
280
S. Stober and A. N
urnberger
feature extraction
high-dimensional
feature space
( j)
j
d1(i
(i,
facet
subspaces
S2
S1
d2((i
(i, j)
j
dl(i
(i,
( j)
index
offline
display
et
ac
f
m landmarks l
neighborhood facet
distance
tance aggregator
aggre
adjust weights
user
zoom
facet
distance
cuboid
Nxmxl
distances
projection
rojec
facet
distance
e aggregator
agg
online
N images
projection
lens distortion
filtering
ing
Fig. 3. Outline of the approach showing the important processing steps and data structures. Top: preprocessing. Bottom: interaction with the user with screenshots of the
graphical user interface.
281
that are currently only used for the visualization are ID3 tags (artist, album,
title, track number and year) extracted from the audio les, track play counts
obtained from a Last.fm prole, and album covers gathered through web search.
Distance Facets. Based on the features associated with the tracks, facets are
dened (on subspaces of the feature space) that refer to dierent aspects of music
(dis-)similarity. This is depicted in Figure 3 (top).
Definition 1. Given a set of features F , let S be the space determined by the
feature values for a set of tracks T . A facet f is defined by a facet distance
measure f on a subspace Sf S of the feature space, where f satisfies the
following conditions for any x, y T :
(x, y) 0 and (x, y) = 0 if and only if x = y
(x, y) = (y, x) (symmetry)
Optionally, is a distance metric if it additionally obeys the triangle inequality
for any x, y, z T :
(x, z) (x, y) + (y, z) (triangle inequality)
E.g., a facet timbre could be dened on the MFCC-based feature described in
[26] whereas a facet text could compare the combined information from the
features title and lyrics.
It is important to stress the dierence to common faceted browsing and search
approaches that rely on a faceted classication of objects to support users in
exploration by ltering available information. Here, no such ltering by value is
applied. Instead, we employ the concept of facet distances to express dierent
aspects of (dis-)similarity that can be used for ltering.
Facet Distance Normalization. In order to avoid a bias when aggregating
several facet distance measures, the values should be normalized. The following
normalization truncates very high facet distance values f (x, y) of a facet f and
results in a value range of [0, 1]
f (a, b)
f (a, b) = min 1,
(1)
+
where is the mean
=
1
|{(x, y) T 2 }|
(x,y)T 2
f (x, y)
(2)
(x,y)T 2
(f (x, y) )2
(3)
282
S. Stober and A. N
urnberger
Table 1. Facets dened for the current implementation
facet name
feature
distance metric
timbre
rhythm
dynamics
lyrics
GMM of MFCCs
uctuation patterns
MARSYAS07
TFxIDF weighted term vectors
Kullback-Leibler divergence
euclidean distance
euclidean distance
cosine distance
Projection
In the projection step shown in Figure 3 (bottom), the position of all tracks on
the display is computed according to their (aggregated) distances in the highdimensional feature space. Naturally, this projection should be neighborhoodpreserving such that tracks close to each other in feature space are also close in
the projection. We propose to use a landmark- or pivot-based Multidimensional
Scaling approach (LMDS) for the projection as described in detail in [42,43].
This is a computationally ecient approximation to classical MDS. The general
idea of this approach is as follows: A representative sample of objects called
landmarks is drawn randomly from the whole collection.8 For this landmark
sample, an embedding into low-dimensional space is computed using classical
MDS. The remaining objects can then be located within this space according to
their distances to the landmarks.
8
283
284
S. Stober and A. N
urnberger
However, it still allows objects to be added or removed from the data set to
some extend without the need to compute a new projection: If a new track
is added to the collection, an additional layer has to be appended to the
facet distance cuboid containing the facet distances of the new track with all
landmarks. The new track can then be projected according to these distances. If
a track is removed, the respective layer of the cuboid can be deleted. Neither
operation does further alter the projection.9 Adding or removing many tracks
may however alter the distribution of the data (and thus the covariances) in such
a way that the landmark sample may no longer be representative. In this case,
a new projection based on a modied landmark sample should be computed.
However, for the scope of this paper, a stable landmark set is assumed and this
point is left for further work.
4.3
Lens Distortion
Once the 2-D-positions of all tracks are computed by the projection technique,
the collection could already be displayed. However, an intermediate distortion
step is introduced as depicted in Figure 3 (bottom). It serves as the basis for the
interaction techniques described later.
Lens Modeling. The distortion technique is based on an approach originally
developed to model complex nonlinear distortions of images called SpringLens
[9]. A SpringLens consists of a mesh of mass particles and interconnecting springs
that form a rectangular grid with xed resolution. Through the springs, forces
are exerted between neighboring particles aecting their motion. By changing the
rest-length of selected springs, the mesh can be distorted as depicted in Figure 4.
(Further, Figure 3 (bottom) and Figure 7 shows larger meshes simulating lenses.)
The deformation is calculated by a simple iterative physical simulation over time
using an Euler integration [9].
In the context of this work, the SpringLens technique is applied to simulate
a complex superimposition of multiple sh-eye lenses. A moderate resolution is
9
In case a landmark track is removed from the collection, its feature representation
has to be kept to be able to compute facet distances for new tracks. However, the
corresponding layer in the cuboid can be removed as for any ordinary track.
285
chosen with a maximum of 50 cells in each dimension for the overlay mesh which
yields sucient distortion accuracy while real-time capability is maintained. The
distorted position of the projection points is obtained by barycentric coordinate
transformation with respect to the particle points of the mesh. Additionally,
z-values are derived from the rest-lengths that are used in the visualization to
decide whether an object has to be drawn below or above another one.
Nearest Neighbor Indexing. For the adaptation of the lens distortion, the
nearest neighbors of a track need to be retrieved. Here, the two major challenges
are:
1. The facet weights are not known at indexing time and thus the index can
only be built using the facet distances.
2. The choice of an appropriate indexing method for each facet depends on the
respective distance measure and the nature of the underlying features.
As the focus lies here on the visualization and not the indexing, only a very basic
approach is taken and further developments are left for future work: A limited list
of nearest neighbors in pre-computed for each track. This way, nearest neighbors
can be retrieved by simple lookup in constant time (O(1)). However, updating
the lists after a change of the facet weights is computationally expensive. While
the resulting delay of the display update is still acceptable for collections with a
few thousands tracks, it becomes infeasible for larger N .
For more ecient index structures, it may be possible to apply generic multimedia indexing techniques such as space partition trees [5] or approximate approaches based on locality sensitive hashing [13] that may even be kernelized [19]
to allow for more complex distance metrics. Another option is to generate multiple nearest neighbor indexes each for a dierent setting of the facet weights
and interpolate the retrieved result lists w.r.t. to the actual facet weights.
4.4
Visualization Metaphor
Filtering
In order to reduce the amount of information displayed at a time, an additional ltering step is introduced as depicted in Figure 3 (bottom). The user
286
S. Stober and A. N
urnberger
Fig. 5. Available lter modes: collapse all (top left), focus (top right), sparse (bottom
left), expand all (bottom right). The SpringLens mesh overlay is hidden.
can choose between dierent lters that decide whether a track is displayed collapsed or expanded i.e. as a star or album cover respectively. While album
covers help for orientation, the displayed stars give information about the data
distribution. Trivial lters are those displaying no album covers (collapseAll) or
all (expandAll). Apart from collapsing or expanding all tracks, it is possible to
expand only those tracks in magnied regions (i.e. with a z-level above a predened threshold) or to apply a sparser filter. The results of using these lter
modes are shown in Figure 5.
A sparser lter selects only a subset of the collection to be expanded that
is both, sparse (well distributed) and representative. Representative tracks are
those with a high importance (described in Section 4.4). The rst sparser version
used a Delaunay triangulation and was later substituted by a raster-based approach that produces more appealing results in terms of the spatial distribution
of displayed covers.
287
Originally, the set of expanded tracks was updated after any position changes
caused by the distortion overlay. However, this was considered irritating during
early user tests and the sparser strategy was changed to update only if the
projection or the displayed region changes.
Delaunay Sparser Filter. This sparser lter constructs a Delaunay triangulation incrementally top-down starting with the track with the highest importance
and some virtual points at the corners of the display area. Next, the size of all
resulting triangles given by the radius of their circumcircle is compared with
a predened threshold sizemin . If the size of a triangle exceeds this threshold,
the most important track within this triangle is chosen for display and added
as a point for the triangulation. This process continues recursively until no triangle that exceeds sizemin contains anymore tracks that could be added. All
tracks belonging to the triangulation are then expanded (i.e. displayed as album
thumbnail).
The Delaunay triangulation can be computed in O(n log n) and the number
of triangles is at most O(n) with n N being the number of actually displayed
album cover thumbnails. To reduce lookup time, projected points are stored
in a quadtree data structure [5] and sorted by importance within the trees
quadrants. A triangles size may change through distortion caused by the multifocal zoom. This change may trigger an expansion of the triangle or a removal
of the point that caused its creation originally. Both operations are propagated
recursively until all triangles meet the size condition again. Figure 3 (bottom)
shows a triangulation and the resulting display for a (distorted) projection of a
collection.
Raster Sparser Filter. The raster sparser lter divides the display into a grid
of quadratic cells. The size of the cells depends on the screen resolution and
the minimal display size of the album covers. Further, it maintains a list of the
tracks ranked by importance that is precomputed and only needs to be updated
when the importance values change. On an update, the sparser runs through
its ranked list. For each track it determines the respective grid cell. If the cell
and the surrounding cells are empty, the track is expanded and its cell blocked.
(Checking surrounding cells avoids image overlap. The necessary radius for the
surrounding can be derived from the cell and cover sizes.)
The computational complexity of this sparser approach is linear in the number
of objects to be considered but also depends on the radius of the surrounding
that needs to be checked. The latter can be reduced by using a data structure
for the raster that has O(1) look-up complexity but higher costs for insertions
which happen far less frequently. This approach has further the nice property
that it handles the most important objects rst and thus even if interrupted
returns a useful result.
Interaction
While the previous section covered the underlying techniques, this section describes how users can interact with the user-interface that is built on top of
288
S. Stober and A. N
urnberger
them. Figure 6 shows a screenshot of the MusicGalaxy prototype.10 It allows several ways of interacting with the visualization: Users can explore the collection
through common panning & zooming (Section 5.1). Alternatively, they can use
the adaptive multi-focus technique introduced with this prototype (Section 5.2).
Further, they can change the facet aggregation function parameters and this way
adapt the view on the collection according to their preferences (Section 5.3).
Hovering over a track displays its title and a double-click start the playback
that can be controlled by the player widget at the bottom of the interface.
Apart from this, several display parameters can be changed such as the ltering
mode (Section 4.5), the size of the displayed album covers or the visibility of the
SpringLens overlay mesh.
5.1
These are very common interaction techniques that can e.g. be found in programs
for geo-data visualization or image editing that make use of the map metaphor.
Panning shifts the displayed region whereas zooming decreases or increases it.
(This does not aect the size of the thumbnails which can be controlled separately using the PageUp and PageDn keys.) Using the keyboard, the user can pan
with the cursor keys and zoom in and out with + and respectively. Alternatively, the mouse can be used: Clicking and holding the left button while moving
the mouse pans the display. The mouse wheel controls the zoom level. If not the
whole collection can be displayed, an overview window indicating the current
section is shown in the top left corner, otherwise it is hidden. Clicking into the
overview window centers the display around the respective point. Further, the
user can drag the section indicator around which also results in panning.
5.2
Focusing
Fig. 6. Screenshot of the MusicGalaxy prototype with visible overview window (top left), player (bottom) and SpringLens mesh overlay
(blue). In this example, a strong album eect can be observed as for the track in primary focus, four tracks of the same album are nearest
neighbors in secondary focus.
290
S. Stober and A. N
urnberger
Fig. 7. SpringLens distortion with only primary focus (left) and additional secondary
focus (right)
Two facet control panels allow to adapt two facet distance aggregation functions
by choosing one of the function types listed in Section 4.1 (from a drop-down
menu) and adjusting weights for the individual facets (through sliders). The control panels are hidden in the screenshot Figure 6) but shown in Figure 3 (bottom)
that depicts the user-interaction. The rst facet distance aggregation function
291
is applied to derive the track-landmark distances from the facet distance cuboid
(Section 4.2). These distances are then used to compute the projection of the
collection. The second facet distance aggregation function is applied to identify
the nearest neighbors of a track and thus indirectly controls the secondary focus.
Changing the aggregation parameters results in a near real-time update of
the display so that the impact of the change becomes immediately visible: In
case of the parameters for the nearest neighbor search, some secondary focus
region may disappear while somewhere else a new one appears with tracks now
considered more similar. Here, the transitions are visualized smoothly due to
the underlying physical simulation of the SpringLens grid. In contrast to this, a
change of the projection similarity parameters has a more drastic impact on the
visualization possibly resulting in a complete re-arrangement of all tracks. This
is because the LMDS projection technique produces solutions that are unique
only up to translation, rotation, and reection and thus, even a small parameter
change may, e.g., ip the visualization. As this may confuse users, one direction
of future research is to investigate how the position of the landmarks can be
constrained during the projection to produce more gradual changes.
The two facet distance aggregation functions are linked by default as it is most
natural to use the same distance measure for projection and neighbor retrieval.
However, unlinking them and using e.g. orthogonal distance measures can lead
to interesting eects: For instance, one may choose to compute the collection
based solely on acoustic facets and nd nearest neighbors for the secondary
focus through lyrics similarity. Such a setting would help to uncover tracks with
a similar topic that (most likely) sound very dierent.
Evaluation
http://www.cebit.de
292
S. Stober and A. N
urnberger
eye-tracker that can capture where and how long the gaze of the participants rests
for some time (referred to as xation points). Using the adaptive SpringLens
focus, the mouse generally followed the gaze that scans the border of the focus in order to decide on the direction to explore further. This resulted in a
much smoother gaze-trajectory than the one observed during usage of panning
and zooming where the gaze frequently switched between the overview window
and the objects of interest as not to loose orientation. This indicates that the
proposed approach is less tiring for the eyes. However, the testers criticized the
controls used to change the focus especially having to hold the right mouse
button all the time. This lead to the introduction of the focus lock mode and
several minor interface improvements in the third version of the prototype [48]
that are not explicitly covered here.
The remainder of this section describes the evaluation of the third MusicGalaxy prototype in a user study [45] with the aim to proof that the userinterface indeed helps during exploration. Screencasts of 30 participants
solving an exploratory retrieval task were recorded together with eye-tracking
data (again using a Tobii T60 eye-tracker) and web cam video streams. This
data was used to identify emerging search strategies among all users and to
analyze to what extent the primary and secondary focus was used. Moreover,
rst-hand impressions of the usability of the interface were gathered by letting
the participants say aloud whatever they think, feel or remark as they go about
their task (think-aloud protocol).
In order to ease the evaluation, the study was not conducted with the original MusicGalaxy user-interface prototype but with a modied version that can
handle photo collections depicted in Figure 8. It relies on widely used MPEG-7
visual descriptors (EdgeHistogram, ScalableColor and ColorLayout ) [27,25] to
compute the visual similarity (see [49] for further details) replacing the originally used music features and respective similarity facets. Using photo collections
for evaluation instead of music has several advantages: It can be assured that
none of the participants knows any of the photos in advance what could otherwise introduce some bias. Dealing with music, this would be much harder to
realize. Furthermore, similarity and relevance of photos can be assessed in an
instant. This is much harder for music tracks and requires additional time for
listening especially if the tracks are previously unknown.
The following four questions were addressed in the user study:
1. How does the lens-based user-interface compare in terms of usability to common panning & zooming techniques that are very popular in interfaces using
a map metaphor (such as Google Maps 12 )?
2. How much do users actually use the secondary focus or would a common
sh-eye distortion (i.e. only the primary focus) be sucient?
3. What interaction patterns do emerge?
4. What can be improved to further support the user and increase user
satisfaction?
12
http://maps.google.com
293
Experimental Setup
At the beginning of the experiment, the participants were asked several questions
to gather general information about their background. Afterwards, they were
presented four image collections (described below) in xed order. On the rst
collection, a survey supervisor gave a guided introduction to the interface and the
possible user actions. Each participant could spent as much time as needed to get
used to the interface. Once, the participant was familiar with the controls, she
or he continued with the other collections for which a retrieval task (described
below) had to be solved without the help of the supervisor. At this point, the
participants were divided into two groups. The rst group used only panning
& zooming (P&Z) as described in Section 5.1 on the second collection and only
the SpringLens functionality (SL) described in Section 5.2 on the third one. The
other group started with SL and then used P&Z. The order of the datasets stayed
the same for both groups. (This way, eects caused by the order of the approaches
and slightly varying diculties among the collections are avoided.) The fourth
collection could then be explored by using both, P&Z and SL. (The functionality
294
S. Stober and A. N
urnberger
Table 2. Photo collections and topics used during the user study
Collection
for adapting the facet distance aggregation functions described in Section 5.3 was
deactivated for the whole experiment.) After the completion of the last task,
the participants were asked to assess the usability of the dierent approaches.
Furthermore, feedback was collected pointing out, e.g., missing functionality.
Test Collections. Four image collection were used during the study. They
were drawn from a personal photo collection of the authors.13 Each collection
comprises 350 images except the rst collection (used for the introduction of
the user-interface) which only contains 250 images. All images were scaled down
to t 600x600 pixels. For each of the collections 2 to 4, ve non-overlapping topics
were chosen and the images annotated accordingly. These annotation served as
ground truth and were not shown to the participants. Table 2 shows the topics
for each collection. In total, 264 of the 1050 images belong to one of the 15 topics.
Retrieval Task. For the collections 2 to 4, the participants had to nd ve (or
more) representative images for each of the topics listed in Table 2. As guidance,
handouts were prepared that showed the topics each one printed in a dierent
color , an optional brief description and two or three sample images giving an
impression what to look for. Images representing a topic had to be marked with
the topics color. This was done by double clicking on the thumbnail what opened
a oating dialog window presenting the image at big scale and allowing the
participant to classify the image to a predened topic by clicking a corresponding
button. As a result, the image was marked with the color representing the topic.
Further, the complete collection could be ltered by highlighting all thumbnails
classied to one topic. This was done by pressing the numeric key (1 to 5) for the
respective topic number. Highlighting was done by focusing a sh-eye lens on
every marked topic member and thus enlarging the corresponding thumbnails.
It was pointed out that the decision whether an image was representative for
a group was solely up to the participant and not judged otherwise. There was
no time limit for the task. However, the participants were encouraged to skip to
13
295
the next collection after approximately ve minutes as during this time already
enough information would have been collected.
Tweaking the Nearest Neighbor Index. In the original implementation, at
most ve nearest neighbors are retrieved with the additional constraint that their
distance to the query object has to be in the 1-percentile of all distances in the
collection. (This avoids returning nearest neighbors that are not really close.) 264
of the 1050 images belonging to collections 2 to 4 have a ground truth topic label.
For only 61 of these images, one or more of the ve nearest neighbors belonged to
the same topic and only in these cases, the secondary focus would have displayed
something helpful for the given retrieval task. This let us conclude that the feature
descriptors used were not sophisticated enough to capture the visual intra-topic
similarity. A lot more work would have been involved to improve the features
but this would have been beyond the scope of the study that aimed to evaluate
the user-interface and most specically the secondary focus which dierentiates
our approach from the common sh-eye techniques. In order not to have the user
evaluate the underlying feature representation and the respective similarity metric, we modied the index for the experiment: Every time, the index was queried
with an image with a ground truth annotation, the two most similar images from
the respective topics were injected into the returned list of nearest neighbors. This
ensured that the secondary would contain some relevant images.
6.2
Results
The user study was conducted with 30 participants all of them graduate or
post-graduate students. Their age was between 19 and 32 years (mean 25.5)
and 40% were female. Most of the test persons (70%) were computer science
students, with half of them having a background in computer vision or user
interface design. 43% of the participants stated that they take photos on a regular
basis and 30% use software for archiving and sorting their photo collection. The
majority (77%) declared that they are open to new user interface concepts.
Usability Comparison. Figure 9 shows the results from the questionnaire
comparing the usability and helpfulness of the SL approach with baseline P&Z.
What becomes immediately evident is that half of the participants rated the SL
interface as being signicantly more helpful that the simple P&Z interface while
being equally complicated in use. The intuitiveness of the SL was surprisingly
rated slightly better than for the P&Z interface, which is an interesting outcome
since we expected users to be more familiar with P&Z as it is more common in
todays user interfaces (e.g. Google Maps). This, however, suggests that interacting with a sh-eye lens can be regarded as intuitive for humans when interacting
with large collections. The combination of both got even better ratings but has
to be considered noncompetitive here, as it could have had an advantage by
always being the last interface used. Participants have had time for getting used
to the handling of the two complementary interfaces already. Moreover, since
the collection did not change as for P&Z and SL, the combined interface might
have had the advantage of being applied to a possibly easier collection with
296
S. Stober and A. N
urnberger
helpfulness
simplicity
intuitivity
1
P&Z
SL
both
1
P&Z
SL
both
P&Z
SL
both
primary
ext. primary
secondary
none
same topic
other topic
no focus
37.75
4.27
4.49
30.74
13.24
4.38
2.08
3.06
total
37.75
8.75
43.98
9.52
297
is supposed to bring up. It comes close to the percentage of the primary focus
that not surprisingly is the highest. Ignoring the topic, (extended) primary
and secondary almost contribute equally and only less than 10% of the marked
images were not in focus i.e. discovered only through P&Z.
Emerging Search Strategies. For this part we again analyze only interaction with the combined interface. A small group of participants excessively used
P&Z. They increased the initial thumbnail size in order to better perceive the
depicted contents and chose to display all images as thumbnails. To reduce the
overlap of thumbnails, they operated on a deeper zoom level and therefore had
to pan a lot. The gaze data shows a tendency for systematic sequential scans
which were however dicult due to the scattered and irregular arrangement
of the thumbnails. Further, some participants occasionally marked images not
in focus because of being attracted by dominant colors (e.g. for the aquarium
topic). Another typical strategy was to quickly scan through the collection by
moving the primary focus typically with small thumbnail size and at a zoom
level that showed most of the collection but the outer regions. In this case the
attention was mostly at the (extended) primary focus region with the gaze scanning in which direction to explore further and little to moderate attention at the
secondary focus. Occasionally, participants would freeze the focus or slow down
for some time to scan the whole display. In contrast to this rather continuous
change of the primary focus, there was a group of participants that browsed
the collection mostly by moving (in a single click) the primary focus to some
secondary focus region much like navigating an invisible neighborhood graph.
Here, the attention was concentrated onto the secondary focus regions.
User Feedback. Many participants had problems with an overcrowded primary
sh-eye in dense regions. This was alleviated by temporarily zooming into the
region which lets the images drift further apart. However, there are possibilities
that require less interaction such as automatically spreading the thumbnails in
focus with force-based layout techniques. Working on deeper zoom levels where
only a small part of the collection is visible, the secondary focus was considered
mostly useless as it was usually out of view. Further work could therefore investigates o-screen visualization techniques to facilitate awareness of and quick
navigation to secondary focus regions out of view and better integrate P&Z and
SL. The increasing empty space at deep zoom levels should be avoided e.g.
by automatically increasing the thumbnail size as soon as all thumbnails can be
displayed without overlap. An optional re-arrangement of the images in view into
a grid layout may ease sequential scanning as preferred by some users. Another
proposal was to visualize which regions have already been explored similar to the
(optionally time-restricted) fog of war used in strategy computer games. Some
participants would welcome advanced ltering options such as a prominent color
lter. An undo function or reverse playback of focus movement would be desirable and could easily be implemented by maintaining a list of the last images
in primary focus. Finally, some participants remarked that it would be nice to
generate the secondary focus for a set of images (belonging to the same topic).
298
S. Stober and A. N
urnberger
In fact, it is even possible to adapt the similarity metric used for the nearest
neighbor queries automatically to the task of nding more images of the same
to topic as shown in recent experiments [49]. This opens an interesting research
direction for future work.
Conclusion
A common approach for exploratory retrieval scenarios is to start with an overview from where the user can decide which regions to explore further. The focusadaptive SpringLens visualization technique described in this paper addresses
the following three major problems that arise in this context:
1. Approaches that rely on dimensionality reduction techniques to project the
collection from high-dimensional feature space onto two dimensions inevitably
face projection errors: Some tracks will appear closer than they actually are
and on the other side, some tracks that are distant in the projection may in
fact be neighbors in the original space.
2. Displaying all tracks at once becomes infeasible for large collections because
of limited display space and the risk of overwhelming the user with the
amount of information displayed.
3. There is more than one way to look at a music collection or more specically
to compare two music pieces based on their features. Each user may have a
dierent way and a retrieval system should account for this.
The rst problem is addressed by introducing a complex distortion of the visualization that adapts to the users current region of interest and temporarily
alleviates possible projection errors in the focused neighborhood. The amount
of displayed information can be adapted by the application of several sparser
lters. Concerning the third problem, the proposed user-interface allows users
to (manually) adapt the underlying similarity measure used to compute the arrangement of the tracks in the projection of the collection. To this end, weights
can be specied that control the importance of dierent facets of music similarity
and further an aggregation function can be chosen to combine the facets.
Following a user-centered design approach with focus on usability, a prototype
system has been created by iteratively alternating between development and
evaluation phases. For the nal evaluation, an extensive user study including
gaze analysis using an eye-tracker has been conducted with 30 participants. The
results prove that the proposed interface is helpful while at the same time being
easy and intuitive to use.
Acknowledgments
This work was supported in part by the German National Merit Foundation,
the German Research Foundation (DFG) under the project AUCOMA, and the
European Commission under FP7-ICT-2007-C FET-Open, contract no. BISON211898. The user study was conducted in collaboration with Christian Hentschel
299
who also took care of the image feature extraction. The authors would further like
to thank all testers and the participants of the study for their time and valuable
feedback for further development, Tobias Germer for sharing his ideas and code
of the original SpringLens approach [9], Sebastian Loose who has put a lot of
work into the development of the lter and zoom components, the developers of
CoMIRVA [40] and JAudio [28] for providing their feature extractor code and
George Tzanetakis for providing insight into his MIREX 07 submission [52].
The Landmark MDS algorithm has been partly implemented using the MDSJ
library [1].
References
1. Algorithmics Group: MDSJ: Java library for multidimensional scaling (version 0.2),
University of Konstanz (2009)
2. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
3. Baumann, S., Halloran, J.: An ecological approach to multimodal subjective music
similarity perception. In: Proc. of 1st Conf. on Interdisciplinary Musicology (CIM
2004), Graz, Austria (April 2004)
4. Cano, P., Kaltenbrunner, M., Gouyon, F., Batlle, E.: On the use of fastmap for
audio retrieval and browsing. In: Proc. of the 3rd Int. Conf. on Music Information
Retrieval (ISMIR 2002) (2002)
5. De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational geometry: algorithms and applications. Springer, New York (2008)
6. Diakopoulos, D., Vallis, O., Hochenbaum, J., Murphy, J., Kapur, A.: 21st century
electronica: Mir techniques for classication and performance. In: Proc. of the 10th
Int. Conf. on Music Information Retrieval (ISMIR 2009), pp. 465469 (2009)
7. Donaldson, J., Lamere, P.: Using visualizations for music discovery. Tutorial at the
10th Int. Conf. on Music Information Retrieval (ISMIR 2009) (October 2009)
8. Gasser, M., Flexer, A.: Fm4 soundpark: Audio-based music recommendation in
everyday use. In: Proc. of the 6th Sound and Music Computing Conference (SMC
2009), Porto, Portugal (2009)
9. Germer, T., G
otzelmann, T., Spindler, M., Strothotte, T.: Springlens: Distributed
nonlinear magnications. In: Eurographics 2006 - Short Papers, pp. 123126. Eurographics Association, Aire-la-Ville (2006)
10. Gleich, M.R.D., Zhukov, L., Lang, K.: The World of Music: SDP layout of high
dimensional data. In: Info Vis 2005 (2005)
11. van Gulik, R., Vignoli, F.: Visual playlist generation on the artist map. In: Proc.
of the 6th Int. Conf. on Music Information Retrieval (ISMIR 2005), pp. 520523
(2005)
12. Hitchner, S., Murdoch, J., Tzanetakis, G.: Music browsing using a tabletop display.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
175176 (2007)
13. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the
curse of dimensionality. In: Proc. of the 13th ACM Symposium on Theory of Computing (STOC 1998), pp. 604613. ACM, New York (1998)
14. Jollie, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)
300
S. Stober and A. N
urnberger
15. Julia, C.F., Jorda, S.: SongExplorer: a tabletop application for exploring large
collections of songs. In: Proc. of the 10th Int. Conf. on Music Information Retrieval
(ISMIR 2009), pp. 675680 (2009)
16. Knees, P., Pohle, T., Schedl, M., Widmer, G.: Exploring Music Collections in Virtual Landscapes. IEEE MultiMedia 14(3), 4654 (2007)
17. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43(1), 5969 (1982)
18. Kruskal, J., Wish, M.: Multidimensional Scaling. Sage, Thousand Oaks (1986)
19. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: Proc. 12th Int. Conf. on Computer Vision (ICCV 2009) (2009)
20. Leitich, S., Topf, M.: Globe of music - music library visualization using geosom.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
167170 (2007)
21. Lillie, A.S.: MusicBox: Navigating the space of your music. Masters thesis, MIT
(2008)
22. Lloyd, S.: Automatic Playlist Generation and Music Library Visualisation with
Timbral Similarity Measures. Masters thesis, Queen Mary University of London
(2009)
23. L
ubbers, D.: SoniXplorer: Combining visualization and auralization for contentbased exploration of music collections. In: Proc. of the 6th Int. Conf. on Music
Information Retrieval (ISMIR 2005), pp. 590593 (2005)
24. L
ubbers, D., Jarke, M.: Adaptive multimodal exploration of music collections. In:
Proc. of the 10th Int. Conf. on Music Information Retrieval (ISMIR 2009), pp.
195200 (2009)
25. Lux, M.: Caliph & emir: Mpeg-7 photo annotation and retrieval. In: Proc. of the
17th ACM Int. Conf. on Multimedia (MM 2009), pp. 925926. ACM, New York
(2009)
26. Mandel, M., Ellis, D.: Song-level features and support vector machines for music
classication. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 594599 (2005)
27. Martinez, J., Koenen, R., Pereira, F.: MPEG-7: The generic multimedia content
description standard, part 1. IEEE MultiMedia 9(2), 7887 (2002)
28. McEnnis, D., McKay, C., Fujinaga, I., Depalle, P.: jAudio: An feature extraction
library. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 600603 (2005)
29. M
orchen, F., Ultsch, A., N
ocker, M., Stamm, C.: Databionic visualization of music
collections according to perceptual distance. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 396403 (2005)
30. Neumayer, R., Dittenbach, M., Rauber, A.: PlaySOM and PocketSOMPlayer, alternative interfaces to large music collections. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 618623 (2005)
31. Nielsen, J.: Usability engineering. In: Tucker, A.B. (ed.) The Computer Science
and Engineering Handbook, pp. 14401460. CRC Press, Boca Raton (1997)
32. N
urnberger, A., Klose, A.: Improving clustering and visualization of multimedia
data using interactive user feedback. In: Proc. of the 9th Int. Conf. on Information
Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU
2002), pp. 993999 (2002)
33. Oliver, N., Kreger-Stickles, L.: PAPA: Physiology and purpose-aware automatic
playlist generation. In: Proc. of the 7th Int. Conf. on Music Information Retrieval
(ISMIR 2006) (2006)
301
34. Pampalk, E., Dixon, S., Widmer, G.: Exploring music collections by browsing different views. In: Proc. of the 4th Int. Conf. on Music Information Retrieval (ISMIR
2003), pp. 201208 (2003)
35. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visualization
of music archives. In: Proc. of the 10th ACM Int. Conf. on Multimedia (MULTIMEDIA 2002), pp. 570579. ACM Press, New York (2002)
36. Pauws, S., Eggen, B.: PATS: Realization and user evaluation of an automatic
playlist generator. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002) (2002)
37. Rauber, A., Pampalk, E., Merkl, D.: Using psycho-acoustic models and selforganizing maps to create a hierarchical structuring of music by musical styles. In:
Proc. of the 3rd Int. Conf. on Music Information Retrieval (ISMIR 2002) (2002)
38. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval.
Information Processing & Management 24(5), 513523 (1988)
39. Sarmento, L., Gouyon, F., Costa, B., Oliveira, E.: Visualizing networks of music
artists with RAMA. In: Proc. of the Int. Conf. on Web Information Systems and
Technologies, Lisbon (2009)
40. Schedl, M.: The CoMIRVA Toolkit for Visualizing Music-Related Data. Technical
report, Johannes Kepler University Linz (June 2006)
41. Shneiderman, B.: Tree visualization with tree-maps: 2-d space-lling approach.
ACM Trans. Graph 11(1), 9299 (1992)
42. de Silva, V., Tenenbaum, J.: Sparse multidimensional scaling using landmark
points. Tech. rep., Stanford University (2004)
43. de Silva, V., Tenenbaum, J.B.: Global versus local methods in nonlinear dimensionality reduction. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002), pp. 705712 (2002)
44. Stavness, I., Gluck, J., Vilhan, L., Fels, S.S.: The mUSICtable: A map-based ubiquitous system for social interaction with a digital music collection. In: Kishino,
F., Kitamura, Y., Kato, H., Nagata, N. (eds.) ICEC 2005. LNCS, vol. 3711, pp.
291302. Springer, Heidelberg (2005)
45. Stober, S., Hentschel, C., N
urnberger, A.: Evaluation of adaptive springlens - a
multi-focus interface for exploring multimedia collections. In: Proc. of the 6th
Nordic Conference on Human-Computer Interaction (NordiCHI 2010), Reykjavik,
Iceland (October 2010)
46. Stober, S., N
urnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., N
urnberger, A. (eds.) AMR 2008.
LNCS, vol. 5811, pp. 5365. Springer, Heidelberg (2010)
47. Stober, S., N
urnberger, A.: A multi-focus zoomable interface for multi-facet exploration of music collections. In: Proc. of the 7th Int. Symposium on Computer
Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 339354 (June
2010)
48. Stober, S., N
urnberger, A.: MusicGalaxy - an adaptive user-interface for exploratory music retrieval. In: Proc. of the 7th Sound and Music Computing Conference (SMC 2010), Barcelona, Spain, pp. 382389 (July 2010)
49. Stober, S., N
urnberger, A.: Similarity adaptation in an exploratory retrieval scenario. In: Detyniecki, M., Knees, P., N
urnberger, A., Schedl, M., Stober, S. (eds.)
Post- Proceedings of the 8th International Workshop on Adaptive Multimedia Retrieval (AMR 2010), Linz, Austria (2010)
302
S. Stober and A. N
urnberger
Cnam, Paris
rigaux@lamsade.dauphine.fr
Lamsade, Univ. Paris-Dauphine
zoe@armadillo.fr
Introduction
The presence of music on the web has grown exponentially over the past decade.
Music representation is multiple (audio les, midi les, printable music scores. . . )
and is easily accessible through numerous platforms. Given the availability of several compact formats, the main representation is by means of audio les, which
provide immediate access to music content and are easily spreaded, sampled and
listened to. However, extracting structured information from an audio le is a
dicult (if not impossible) task, since it is submitted to the subjectivity of interpretation. On the other hand, symbolic music representation, usually derived
from musical scores, enables exploitation scenarios dierent from what audio les
may oer. The very detailed and unambiguous description of the music content
is of high interest for communities of music professionals, such as musicologists,
music publishers, or professional musicians. New online communities of users
arise, with an interest in a more in-depth study of music than what average
music lovers may look for.
The interpretation of the music content information (structure of musical
pieces, tonality, harmonic progressions. . . ) combined with meta-data (historic
and geographic context, author and composer names. . . ) is a matter of human
expertise. Specic music analysis tools have been developped by music professionals for centuries, and should now be scaled to a larger level in order to provide
scientic and ecient analysis of large collections of scores.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 303320, 2011.
c Springer-Verlag Berlin Heidelberg 2011
304
Another fundamental need of such online communities, this time shared with
more traditional platforms, is the ability to share content and knowledge, as
well as annotate, compare and correct all this available data. This Web 2.0
space with user generated content helps improve and accelerate research, with
the added bonus to make available to a larger audience sources which would
otherwise remain condential. Questions of copyright, security and controled
contributions are part of the social networks problematic.
To summarize, a platform designed to be used mainly, but not only, by music
professionals, should oer classic features such as browsing and rendering, but
also the ability to upload new content, annotate partitions, search by content
(exact and similarity search), and several tools of music content manipulation
and analysis.
Most of the proposals devoted so far to analysis methods or similarity searches
on symbolic music focus on the accuracy and/or relevancy of the result, and
implicitly assume that these procedures apply to a small collection [2,3,13,16].
While useful, this approach gives rise to several issues when the collection consists
of thousands of scores, with heterogeneous descriptions.
A rst issue is related to software engineering and architectural concerns. A
large score digital library provides several services to many dierents users or
applications. Consistency, reliability, and security concerns call for the denition
of a single consistent data management interface for these services. In particular, one can hardly envisage the publication of ad-hoc search procedures that
merely exhibit the bunch of methods and algorithms developed for each specic retrieval task. The multiplication of these services would quickly overwhelm
external users. Worse, the combination of these functions, which is typically a
dicult matter, would be left to external applications. In complex systems, the
ability to compose uently the data manipulation operators is a key to both
expressive power and computational eciency.
Therefore, on-line communities dealing with scores and/or musical content are
often limited either by the size of their corpus or the range of possible operations,
with only one publicized strong feature. Examples of on-line communities include
Mutopia [26], MelodicMatch [27] or Musipedia [28]. Wikifonia [29] oers a wider
range of services, allowing registered users to publish and edit sheet music. One
can also cite the OMRSYS platform described in [8].
A second issue pertains to scalability. With the ongoing progress in digitization, optical recognition and user content generation, we must be ready to
face an important growth of the volume of music content that must be handled
by institutions, libraries, or publishers. Optimizing accesses to large datasets is
a delicate matter which involves many technical aspects that embrace physical
storage, indexing and algorithmic strategies. Such techniques are usually supported by a specialized data management system which releases applications
from the burden of low-level and intricate implementation concerns.
To our knowledge, no system able to handle large heterogeneous Music Digital Libraries while smoothly combining data manipulation operators exists at this
moment. The HumDrum toolkit is a widely used automated musicological
305
analysis tool [14,22], but representation remains at a low level. A HumDrum based
system will lack in exibility and will depend too much on how les are stored. This
makes dicult the development of indexing of optimization techniques. Another
possible approach would be a system based on MusicXML, an XML based le
format [12,24]. It has been suggested recently that XQuery may be used over MusicXML for music queries [11], but XQuery is a general-purpose query language
which hardly adapts to the specics of symbolic music manipulation.
Our objective in this paper is to lay the ground for a score management system with all the features of a Digital Scores Library combined with content
manipulation operators. Among other things, a crucial component of such a system is a logical data model specically designed for symbolic music management
and its associated query language. Our approach is based on the idea that the
management of structured scores corresponds, at the core level, to a limited set
of fundamental operations that can be dened and implemented once for all. We
also take into account the fact that the wide range of user needs calls for the
ability to associate these operations with user-dened functions at early steps of
the query evaluation process. Modeling the invariant operators and combining
them with user-dened operations is the main goal of our design eort. Among
numerous advantages, this allows the denition of a stable and robust query()
service which does not need ad-hoc extensions as new requirements arrive.
We do not claim (yet) that our model and its implementation will scale easily,
but a high level representation like our model is a pre-requisit in order to allow
the necessary exibility for such futur optimization.
Section 3.2 describes in further details the Neuma platform, a Digital Score
Library [30] devoted to large collections of monodic and polyphonic music from
the French Modern Era (16th 18th centuries). One of the central piece of
the architecture is the data model that we present in this paper. The language
described in section 5 oers a generic mechanism to search and transform music
notation.
The rest of this paper rst discusses related work (Section 2). Section 3
presents the motivation and the context of our work . Section 4 then exposes the
formal fundations of our model. Section 6 concludes the paper.
Related Work
The past decade has witnessed a growing interest in techniques for representing,
indexing and searching (by content) music documents. The domain is commonly
termed Music Information Retrieval (MIR) although it covers many aspects
beyond the mere process of retrieving documents. We refer the reader to [19]
for an introduction. Systems can manipulate music either as audio les or in
symbolic form. The symbolic representation oers a structured representation
which is well suited for content-based accesses, sophisticated manipulations, and
analysis [13].
An early attempt to represent scores as structured les and to develop search
and analysis functions is the HumDrum format. Both the representation and
306
the procedures are low-level (text les, Unix commands) which make them difcult to integrate in complex application. Recent works try to overcome these
limitations [22,16]. Musipedia proposes several kinds of interfaces to search the
database by content. MelodicMatch is a similar software analysing music through
pattern recognition, enabling search for musical phrases in one or more pieces.
MelodicMatch can search for melodies, rhythms and lyrics in MusicXML les.
The computation of similarity between music fragments is a central issue in
MIR systems [10]. Most proposal focus on comparisons of the melodic proles.
Because music is subject to many small variations, approximate search is of order, and the problem is actually that of nding nearest neighbors to a given
pattern. Many techniques have been experimented, that vary depending on the
melodic encoding and the similarity measure. See [9,4,1,7] for some recent proposals. The Dynamic Time Warping (DTW) distance is a well-known popular
measure in speech recognition [21,20]. It allows the non-linear mapping of one
signal to another by minimizing the distance between the two. The DTW distance is usually chosen over the less exible Euclidian distance for time series
alignment [5]. The DTW computation is rather slow but recent works show that
it can be eciently indexed [25,15].
We are not aware of any general approach to model and query music notation. A possible approach would be to use XQuery over MusicXML documents
as suggested in [11]. XQuery is a general-purpose query language, and its use for
music scores yields complicated expressions, and hardly adapts to the specics
of the objects representation (e.g., temporal sequences). We believe that a dedicated language is both more natural and more ecient. The temporal function
approach outlined here can be related to time series management [17].
3
3.1
Architecture
Approach Overview
Largescale score
management system
query results
queries
User QL
Query algebra
307
308
309
The Neuma platform is meant to interact with distant web applications with
local database that store corpus-specic informations. The purpose of Neuma
is to manage all music content informations and leave contextual data to the
client application (author, composer, date of publication, . . . ).
To send a new document, an application calls the register() service. So far
only MusicXML documents can be used to exchange musical description, but
any format could be used provided the corresponding mapping function is in
place. Since MusicXML is widely used, it is sucient for now. The mapping
function extracts a representation of the music content of the document which
complies to our data model.
To publish a score -wether its a score featured in the database or a modied
one- the render() service is called. The render() service is based on the Lilypond
package. The generator takes an instance of our model as input and converts it
into a Lilypond le. The importance of a unied data model appears clearly in
such an example : the render service is based on the model, making it rather
easy to visualize a transformed score, when it would be a lot more dicult to
do so if it was instead solely based on the document format.
A large collection of scores would be useless if there was no appropriate query()
service allowing reliable search by content. As explained before, the Neuma digital library stores all music content (originated from dierent collections, potentially with heterogenous description) in its repository and leaves descriptive
contextual data specic to collections in local databases. Regardless of their
original collection, music content complies to our data model so that it can be
queried accordingly. Several query types are oered : exact, transposed, with or
without rythm, or contour which only takes in account the shape of the input
melody. The query() service combines content search with descriptive data. A
virtual keyboard is provided to enter music content, and research elds can be
lled to adress the local databases.
The Neuma platform also provides and annotate() service. The annotations
are a great way to enrich the digital library and make sure it keeps growing and
improving. In order to use the annotate() service one rst selects part of a score
(a set of elements of the score) and enters information about this portion. There
are dierent kinds of annotations : free text (useful for performance indications),
or pre-selected terms from an ontology (for identifying music fragments). Annotations can be queried alongside other criterias previously mentioned.
4
4.1
A musical domain dommusic is a product domain combining heterogenous musical informations. For example, the domain of a simple monodic score is
dommusic = dompitch domrythm .
Any type of information that can be extracted from symbolic music (such as
measures, alterations, lyrics. . . ) can be added to the domain. Each domain contains two distinguished values: the neutral value and the null value . The
310
Note that the time domain is shared by the vocal part and the piano part.
For the same score, one could have made the dierent choice of schema where
vocal and piano are represented in the same time series of type TS([vocal
polyMusic])
The domain vocal adds the lyrics domain to the classic music domain:
domvocals = dompitch domrythm domlyrics .
The domain polymusic is the product
dompolymusic = (dompitch domrythm )N .
We will now dene two sets of operators gathered into two algebras: the
(extended) relational algebra, and the times series algebra.
4.2
311
The Alg(R) algebra consists of the usual operators selection , product , union
and dierence , along with an extended projection . We present simple
examples of those operators.
Selection, . Select all scores composed by Louis Couperin:
author= Louis
Couperin (Score)
Seigneur (Score)
Projection, . We want the vocal parts from the Score schema without the
piano part. We project the piano out:
vocals (Score)
Product, . Consider a collection of duets, split into the individual vocal parts
of male and female singers, with the following schemas
M ale_P art(Id : int, V oice : TS(vocals)),
F emale_P art(Id : int, V oice : TS(vocals)).
To get the duet scores, we cross product a female part and a male part. We get
the relation
Duet(Id : int, M ale_V : TS(vocals), F emale_V : TS(vocals)).
Note that the time domain is implicitly shared. In itself, the product doesnt
have much interest, but together with the selection operator it becomes the join
operators 1. In the previous example, we shouldnt blindly associate any male
and female vocal parts, but only the ones sharing the same Id.
M.Id=F.Id (M ale F emale) M ale 1Id=Id F emale.
Union . We want to join the two consecutive movements of a piece which
have two separate score instances.
Score = Scorept1 Scorept2 .
The time series equivalent of the null attribute is the empty time series, for
which each event is . Beyond classical relational operators, we introduce an
emptyness test ? that operates on voices and is modeled as: ? (s) = f alse if
t, s(t) = , else ? (s) = true. The emptynes test can be introduced in selection
formulas of the operator.
312
Consider once more the relation Score. We want to select all scores featuring
the word Ave in the lyrics part of V. We need a user function m : (lyrics)
(, ) such that m( Ave ) = , else . Lyrics not containing Ave are transformed into the empty time series t , t. The algebraic expression is:
? (W ) ([Id,V,W :m(lyrics(V ))] (Score)).
4.3
We now present the operators of the time series algebra Alg(T S) (, , A). Each
operator takes one or more time series as input and produces a time series.
This way, operating in closed form, operators can be composed. They allow, in
particular: alteration of the time domain in order to focus on specic instants
(external composition); apply a user function to one or more times series to form
a new one (addition operator); windowing of time series fragments for matching
purposes (aggregation).
In what follows, we take an instance s of the Score schema to run several
examples.
The external composition composes a time series s with an internal temporal
function l. Assume our score s has two movements, and we only want the second
one. Let shif tn be an element of the shift family functions parametrized by a
constant n N. For any t T , sshif tn(t) = s(t+n). In other words, sshif tn is
the time series extracted from s where the rst n events are ignored. We compose
s with the shif tL function, where L is the length of the rst movement, and the
resulting time series s shif tL is our result.
Imagine now that we want only the rst note of every measure. Assuming
we are in 4/4 and the time unit is one fourth, we compose s with warp16 . The
time series s warp16 is the time series where only one out of sixteen events are
considered, therefore the rst note of every measure.
We now give an example of the addition operator . Let dompitch be the
domain of all musical notes and domint the domain of all musical intervals. We
can dene an operation from dompitch dompitch to domint , called the harm
operator, which takes two notes as input and computes the interval between
these two notes. Given two time series representing each a vocal part, for instance
V1 =soprano, V2 =alto, we can dene the time series
V1 harm V2
of the harmonic progression (i.e., the sequence of intervals realized by the juxtaposition of the two voices).
Last, we present the aggregation mechanism A. A typical operation that we
cannot yet obtained is the windowing which, at each instant, considers a local
part of a voice s and derives a value from this restricted view. A canonical
example is pattern matching. So far we have no generic way to compare a pattern
P with all subsequences of a time series s. The intuitive way to do pattern
matching is to build all subsequences from s and compare P with each of them,
dom
dom
s( ) dom
s(t)
s 10
313
dom
aggregation step
s +10
derivation step
a. Function s(t)
10
+ 10
To end this section, we give an extended example using operators from Alg(R)
and Alg(T S) . We want to compute, given a pattern P and a score, the instant
314
t where the Dynamic Time Warping (DTW) distance between P and V is less
than 5. First, we compute the DTW distance between P and the voice V1 of a
score, at each instant, thanks to the following Alg(T S) expression:
e = AdtwP ,shift (V1 )
where dtwP is the function computing the DTW distance between a given time
series and the pattern P . Expression e denes a time series that gives, at each
instant , the DTW distance between P and the sub-series of V1 that begins at
. A selection (from Alg(T S) ) keeps the values in e below 5, all others being set
to . Let be the formula that expresses this condition. Then:
e = (e).
Finally, expression e is applied to all the scores in the Score relation with the
operator. An emptyness test can be used to eliminate those for which the DTW
is always higher that 5 (hence, e is empty).
? (e ) ([composer,V1 ,V2 ,e ] (Score)).
The language should have a precise semantic so that the intend of the query is unambiguous. It should specify what to be done, and not how to do it. The language
should understand dierent kind of expressions: queries as well as denition of
user functions. Finally, the query syntax should be easily understandable by a
human reader.
5.1
Overview
Query Structure
315
The from clause should list at least one table from the database, and the alias
is optional.
The let clause is optional. There can be as many let clause as desired. Voices
modied in a let clause are from attributes present in tables listed in the from
clause.
The construct clause should list at least one attribute, either from one the
listed tables in the from clause, or a modied attribute from a let clause.
The where clause is optional. It consists of a list of predicates, connected by
logical operators (And, Or, Not).
5.3
Query Syntax
The query has four main clauses : from, let, construct, where. In all that
follows, by attribute we mean both classical attributes or time series.
Attributes which appear in the construct, let or where clauses are called
either ColumnName if there is no ambiguity or TableName.ColumnName otherwise. If the attribute is a time series and we want to refer to a specic voice,
we project by using the symbol ->. Precisely, a voice is called ColumnName>VoiceName or TableName.ColumnName->VoiceName.
The from clause enumerate a list of tables from the database, with an optional
alias. The table names should refer to actual tables of the database. Aliases
should not be duplicated, nor should they be the name of an existing table.
The optional let clause apply a user function to an attribute. The attribute
should be from one of the tables listed in the from clause. When the attribute
is a time series, this is done by using map. When one wants to apply a binary
operator on two time series, we use map2.
The construct clause lists the names of the attributes (or modied attributes)
which should appear in the query result.
The where clause evaluates a condition on an attribute or a modied attribute
introduced by the let clause. If the condition is evaluated to true, then the line
it refers to is considered to be a part of the result query. Lack of a where clause
is evaluated as always true.
The where clause supports the usual arithmetic (+, , , /), logical (And,
Or, Not) and comparison operators (=,<>, >,<,>=, <=, contains) which are
used inside predicates.
5.4
Query Evaluation
A query goes through several steps so that the instructions it contains can be
processed. A query can be simple (retrieve a record from the database) or complex with content manipulation and transformation. The results (if any) are then
returned to the user.
Query analysis. The expression entered by the user (either a query or a user
dened function) is analyzed by the system.
The system veries the querys syntax. The query is turned into an abstract
syntax tree in which each node is an algebraic operation (from bottom to top :
316
Algebraic Equivalences
In this section, we give the syntaxic equivalent to all algebraic operators previously introduced in section 4, along with some examples.
Relational operators
Selection
algebraic notation : F (R), where F is a formula, R a relation.
syntaxic equivalent : where. . . =
Projection
algebraic notation : A (R), where A is a set of attributes, R a relation
syntaxic equivalent : construct
Example : the expression
id,voice (composer= Faure (Psalms))
is equivalent to
from
Score
construct id, voice
where
composer=Faure
Extension to time series
Projection
algebraic notation : V (S), where V is a set of voices, S a time series
syntaxic equivalent : S > (voice 1 , . . . , voice n )
317
Selection
algebraic notation : F (V ), where F is a formula, V is a set of voices of a
time series S.
syntaxic equivalent : where S->V . . . contains
Example : the expression
id,
pitch,rythm (voice)
is equivalent to
from
construct
where
and
Score
id, voice->(pitch,rythm)
voice->lyrics contains Heureux les hommes
composer = Faure
Male M, Female F
$duet := synch(M.Voice,F.Voice)
$duet
M.id=F.id
to
Psalms
$transpose := map(transpose(1), voice->pitch)
$transpose
318
the expression
trumpet(voice)
harm clarinet(voice)
(Duets)
is equivalent to
from
Duets
let
$harmonic_progression :=
map2(harm, trumpet->pitch, clarinet->pitch)
construct $harmonic_progression
Composition
algebraic notation : S where is an internal temporal function and S a
time series.
syntaxic equivalent : comp(S, )
Aggregation - derivation
algebraic notation : A, (S), where S is a time series, is a family of internal
time function and is an agregation function
syntaxic equivalent : derive(S, , ).
The family of internal time functions is a mapping from the time domain
into the set of internal time functions. Precisely, for each instant n, (n) = ,
an internal time function. The two mostly used family of time functions Shift
and Warp are provided.
Example the expression
id,Adtw(P),shift (Psalm)
is equivalent to
from
let
construct
Psalm
$dtwVal := derive(voice, Shift, dtw(P))
id, $dtwVal
319
References
1. Allan H., Mllensiefen D. and Wiggins,G.A.: Methodological Considerations in
Studies of Musical Similarity. In: Proc. Intl. Society for Music Information Retrieval
(ISMIR) (2007)
2. Anglade, A. and Dixon, S.: Characterisation of Harmony with Inductive Logic
Programming. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
3. Anglade, A. and Dixon, S.: Towards Logic-based Representations of Musical Harmony for Classication Retrieval and Knowledge Discovery. In: MML (2008)
4. Berman, T., Downie, J., Berman, B.: Beyond Error Tolerance: Finding Thematic
Similarities in Music Digital Libraries. In: Proc. European. Conf. on Digital Libraries, pp. 463466 (2006)
5. Berndt, D., Cliord, J.: Using dynamic time warping to nd patterns in time series.
In: AAAI Workshop on Knowledge Discovery in Databases, pp. 229248 (1994)
6. Brockwell, P.J., Davis, R.: Introduction to Time Series and forecasting. Springer,
Heidelberg (1996)
7. Cameron, J., Downie, J.S., Ehmann, A.F.: Human Similarity Judgments: Implications for the Design of Formal Evaluations. In: Proc. Intl. Society for Music
Information Retrieval, ISMIR (2007)
8. Capela, A., Rebelo, A., Guedes, C.: Integrated recognition system for music scores.
In: Proc. of the 2008 Internation Computer Music Conferences (2008)
9. Downie, J., Nelson, M.: Evaluation of a simple and eective music information
retrieval method. In: Proc. ACM Symp. on Information Retrieval (2000)
10. Downie, J.S.: Music Information Retrieval. Annual review of Information Science
and Technology 37, 295340 (2003)
11. Ganseman, J., Scheunders, P., Dhaes, W.: Using XQuery on MusicXML Databases
for Musicological Analysis. In: Proc. Intl. Society for Music Information Retrieval,
ISMIR (2008)
12. Good, M.: MusicXML in practice: issues in translation and analysis. In: Proc. 1st
Internationl Conference on Musical Applications Using XML, pp. 4754 (2002)
13. Haus, G., Longari, M., Pollstri, E.: A Score-Driven Approach to Music Information
Retrieval. Journal of American Society for Information Science and Technology 55,
10451052 (2004)
14. Huron, D.: Music information processing using the HumDrum toolkit: Concepts,
examples and lessons. Computer Music Journal 26, 1126 (2002)
15. Keogh, E.J., Ratanamahatana, C.A.: Exact Indexing of Dynamic Time Warping.
Knowl. Inf. Syst. 7(3), 358386 (2003)
16. Knopke, I. : The Perlhumdrum and Perllilypond Toolkits for Symbolic Music Information Retrieval. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
17. Lee, J.Y., Elmasri, R.: An EER-Based Conceptual Model and Query Language for
Time-Series Data. In: Proc. Intl.Conf. on Conceptual Modeling, pp. 2134 (1998)
18. Lerner, A., Shasha, D.: A Query : Query language for ordered data, optimization techniques and experiments. In: Proc. of the 29th VLDB Conference, Berlin,
Germany (2003)
19. Muller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg
(2004)
20. Rabiner, L., Rosenberg, A., Levinson, S.: Considerations in dynamic time warping
algorithms for discrete word recognition. IEEE Trans. Acoustics, Speech and Signal
Proc. ASSP-26, 575582 (1978)
320
21. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken
word recognition. IEEE Trans. Acoustics, Speech and Signal Proc. ASSP-26, 43
49 (1978)
22. Sapp, C.S.: Online Database of Scores in the Humdrum File Format. In: Proc. Intl.
Society for Music Information Retrieval, ISMIR (2005)
23. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey Of Music Information Retrieval
Systems. In: Proc. Intl. Society for Music Information Retrieval, ISMIR (2005)
24. Viglianti, R.: MusicXML : An XML based approach to automatic musicological
analysis. In: Conference Abstracts of the Digital Humanities (2007)
25. Zhu, Y., Shasha, D.: Warping Indexes with Envelope Transforms for Query by
Humming. In: Proc. ACM SIGMOD Symp. on the Management of Data, pp. 181
192 (2003)
26. Mutopia, http://www.mutopiaproject.org (last viewed February 2011)
27. Melodicmatch, http://www.melodicmatch.com (last viewed February 2011)
28. Musipedia, http://www.musipedia.org (last viewed February 2011)
29. Wikifonia, http://www.wikifonia.org (last viewed February 2011)
30. Neuma, http://neuma.fr (last viewed February 2011)
Abstract. In this paper, we show how to apply the framework of mathematical morphology (MM) in order to improve error-tolerance in contentbased music retrieval (CBMR) when dealing with approximate retrieval
of polyphonic, symbolically encoded music. To this end, we introduce two
algorithms based on the MM framework and carry out experiments to
compare their performance against well-known algorithms earlier developed for CBMR problems. Although, according to our experiments, the
new algorithms do not perform quite as well as the rivaling algorithms in
a typical query setting, they provide ease of adjusting the desired error
tolerance. Moreover, in certain settings the new algorithms become even
faster than their existing counterparts.
Keywords: MIR, music information retrieval, mathematical morphology, geometric music retrieval, digital image processing.
Introduction
322
M. Karvonen et al.
Fig. 1. (a) The rst two measures of Bachs Invention 1. (b) The same polyphonic
melody cast into a 2-D binary image. (c) A query pattern image with one extra note
and various time and pitch displacements. (d) The resulting image after a blur rank
order ltering operation, showing us the potential matches.
matching. Moreover, our approach provides the user with an intuitive, visual way
of dening the allowed approximations for a query in hand. In [8], Karvonen and
Lemstrom suggested the use of this framework for music retrieval purposes. We
extend and complement their ideas, introduce and implement new algorithms,
and carry out experiments to show their eciency and eectiveness.
The motivation to use symbolic methods is twofold. Firstly, there is a multitude of symbolic music databases where audio methods are naturally not of
use. In addition, the symbolic methods allow for distributed matching, i.e., occurrences of a query pattern are allowed to be distributed across the instruments
(voices) or to be hidden in some other way in the matching fragments of the polyphonic database. The corresponding symbolic and audio les may be aligned by
using mapping tools [7] in order to be able to play back the matching part in an
audio form.
1.1
In this paper we deal with symbolically encoded, polyphonic music for which we
use the pointset representation (the pitch-against-time representation of noteon information), as suggested in [15], or the extended version of the former, the
horizontal-line-segment representation [14], where note durations are also explicitly given. The latter representation is equivalent to the well-known piano-roll
representation (see e.g. Fig. 1(b)), while the former omits the duration information of the line segments and uses only the onset information of the notes (the
starting points of the horizontal line segments). As opposed to the algorithms
based on point-pattern matching where the piano-roll representation is a mere
visualization of the underlying representation, here the visualization IS the representation: the algorithms to be given operate on binary images of the onset
points or the horizontal line segments that correspond to the notes of the given
query pattern and the database.
Let us denote by P the pattern to be searched for in a database, denoted
by T . We will consider the three problems, P1-P3, specied in [14], and their
323
2
2.1
Background
Related Work
324
M. Karvonen et al.
Mathematical Morphology
325
images in a very similar way to conventional image lters. However, the focus
in MM-based methods is often in extracting attributes and geometrically meaningful data from images, as opposite to generating ltered versions of images.
In MM, sets are used to represent objects in an image. In binary images, the
sets are members of the 2-D integer space Z2 . The two fundamental morphological operations, dilation and erosion, are non-linear neighbourhood operations
on two sets. They are based on the Minkowski addition and subtraction [6]. Out
of the two sets, the typically smaller one is called the structuring element (SE).
Dilation performs a maximum on the SE, which has a growing eect on the
target set, while erosion performs a minimum on the SE and causes the target set
to shrink. Dilation can be used to ll gaps in an image, for instance, connecting
the breaks in letters in a badly scanned image of a book page. Erosion can be
used, for example, for removing salt-and-pepper type noise. One way to dene
dilation is
+ f ) A = },
A B = {f Z2 | (B
(1)
its reection (or rotation by
where A is the target image, B is the SE, and B
180 degrees). Accordingly, erosion can be written
A B = {f Z2 | (B + f ) A}.
(2)
Erosion itself can be used for pattern matching. Foreground pixels in the
resulting image mark the locations of the matches. Any shape, however, can be
found in an image lled with foreground. If the background also needs to match,
erosion has to be used separately also for the negations of the image and the
structuring element. Intersecting these two erosions leads to the desired result.
This procedure is commonly known as the hit-or-miss transform or hit-miss
transform (HMT):
HMT(A, B) = (A B) (AC B C ).
(3)
Algorithms
In [8] Karvonen and Lemstrom introduced four algorithms based on the mathematical morphology framework and gave their MATLAB implementations. Our
326
M. Karvonen et al.
closer examination revealed common principles behind the four algorithms; three
of them were virtually identical to each other.
The principles on which our two algortihms to be introduced rely are explained
by Bloomberg and Maragos [2]. Having HMT as the main means of generalizing
erosion, they present three more, which can be combined in various ways. They
also name a few of the combinations. Although we can nd some use for HMT,
its benet is not signicant in our case. But two of the other tricks proved to be
handy, and particularly their combination, which is not mentioned by Bloomberg
and Maragos.
We start with erosion as the basic pattern matching operation. The problem
with erosion is its lack of exibility: every note must match and no jittering is
tolerated. Performing the plain erosion solves problem P1. We present two ways
to gain exibility.
Allow partial matches. This is achieved by moving from P1 to P2.
Handle jittering. This is achieved by moving from P1 to AP1.
Out of the pointset problems, only AP2 now remains unconsidered. It can be
solved, however, by combining the two tricks above. We will next explain how
these improvements can be implemented. First, we concentrate on the pointset
representation, then we will deal with line segments.
3.1
For a match to be found with plain erosion, having applied a translation, the
whole foreground area of the query needs to be covered by the database foreground. In pointset representation this means that there needs to be a corresponding database note for each note in the query. To allow missing notes, a
coverage of only some specied portion of the query foreground suces. This is
achieved by replacing erosion by a more general lter. For such generalization,
Bloomberg and Maragos propose a binary rank order filter (ROF) and threshold
convolution (TC). In addition to them, one of the algorithms in [8] was based
on correlation. These three methods are connected to each other, as discussed
next.
For every possible translation f , the binary rank order lter counts the ratio
|(P + f ) T |
,
|P |
where |P | is the number of foreground pixels in the query. If the ratio is greater
than or equal to a specied threshold value, it leaves a mark in the resulting
image, representing a match. This ratio can be seen as a condence score (i.e.,
a probability) that the query foreground occurs in the database foreground at
some point. It is noteworthy that plain erosion is a special case of binary ROF,
where the threshold ratio is set to 1. By lowering the threshold we impose looser
conditions than plain erosion on detecting the query.
Correlation and convolution operate on greyscale images. Although we deal
with binary images, these operations are useful because ROF can be implemented
327
using correlation and thresholding. When using convolution as a pattern matching operation, it can be seen as a way to implement correlation: rotating the
query pattern 180 degrees and then performing convolution on real-valued data
has almost the same eect as performing correlation, the only dierence being
that the resulting marks appear in the top-left corner instead of the bottomright corner of the match region. Both can be eectively implemented using the
Fast Fourier Transform (FFT). Because of this relation between correlation and
convolution, ROF and TC are actually theoretically equivalent.
When solving P2, one may want to search for maximal partial matches (instead of threshold matches). This is straightforwardly achieved by implementing
ROF by using correlation. Although our ROF implementation is based on correlation, we will call the method ROF since it oers all the needed functionality.
3.2
Tolerating Jittering
Let us next explain the main asset of our algorithms as compared to the previous
algorithms. In order to tolerate jittering, the algorithms should be able to nd
corresponding database elements not only in the exact positions of the translated
query elements, but also in their near proximity.
Bloomberg and Vincent [3] introduced a technique for adding such toleration
into HMT. They call it blur hit-miss transform (BHMT). The trick is to dilate the database images (both the original and the complement) by a smaller,
disc-shaped structuring element before the erosions are performed. This can be
written
BHMT(A, B1 , B2 , R1 , R2 )
= [(A R1 ) B1 ] [(AC R2 ) B2 ],
(4)
where A is the database and AC its complement, B1 and B2 are the query
foreground and background and R1 and R2 are the blur SEs. The technique is
also eligible for the plain erosion. We choose this method for jitter toleration
and call it blur erosion:
A b (B, R) = (A R) B.
(5)
The shape of the preprocessive dilation SE does not have to be a disc. In our
case, where the dimensions under consideration are time and pitch, a natural
setting comprises user-specied thresholds for the dimensions. This leads us to
rectangular SEs with ecient implementations. In practice, dilation operations
are useful in the time dimension, but applying it in pitch dimension often results
in false (positive) matches. Instead, a blur of just one semitone is very useful
because the queries often contain pitch quantization errors.
3.3
By applying ROF we allow missing notes, thus being able to solve problem P2.
The jitter toleration is achieved by using blurring, thus solving AP1. In order to
328
M. Karvonen et al.
Erosion
Hit-miss transform
Rank order lter
Blur erosion
Hit-miss
rank order lter
Blur
hit-miss transform
Blur
rank order lter
Blur hit-miss
rank order lter
be able to solve AP2, we combine these two. In order to correctly solve AP2, the
dilation has to be applied to the database image. With blurred ROF a speed-up
can be obtained (with the cost of false positive matches) by dilating the query
pattern instead of the database image. If there is no need to adjust the dilation
SE, the blur can be applied to the database in a preprocessing phase. Note also
that if both the query and the database were dilated, it would grow the distance
between the query elements and the corresponding database elements which,
subsequently, would gradually decrease the overlapping area.
Figure 2 illustrates the relations between the discussed methods. Our interest
is on blur erosion and blur ROF (underlined in the Figure), because they can be
used to solve the approximate problems AP1 and AP2.
3.4
Blur ROF is applicable also for solving AP3. In this case, however, the blur is
not as essential: in a case of an approximate occurrence, even if there was some
jittering in the time dimension, a crucial portion of the line segments would
typically still overlap. Indeed, ROF without any blur solves exactly problem P3.
By using blur erosion without ROF on line segment data, we get an algorithm
that does not have an existing counterpart. Plain erosion is like P3 with the
extra requirement of full matches only, the blur then adds error toleration to the
process.
3.5
329
Thus far we have not been interested in what happens in the background of an
occurrence; we have just searched for occurrences of the query pattern that are
intermingled in the polyphonic texture of the database. If, however, no extra
notes were allowed in the time span of an occurrence, we would have to consider
the background as well. This is where we need the hit-miss transform. Naturally,
HMT is applicable also in decreasing the number of false positives in cases where
the query is assumed to be comprehensive. With HMT as the third way of
generalizing erosion, we complement the classication in Figure 2.
Combining HMT with the blur operation, we have to slightly modify the
original BHMT to meet the special requirements of the domain. In the original
form, when matching the foreground, tiny background dots are ignored because
they are considered to be noise. In our case, with notes represented as single
pixels or thin line segments, all the events would be ignored during the background matching; the background would always match. To achieve the desired
eect, instead of dilating the complemented database image, we need to erode
the complemented query image by the same SE:
BHMT*(A, B1 , B2 , R1 , R2 )
= [(A R1 ) B1 ] [AC (B2 R2 )],
(6)
where A is the database and AC its complement. B1 and B2 are the query
foreground and background, R1 and R2 are the blur SEs. If B2 is the complement
of B1 , we can write B = B1 and use the form
BHMT*(A, B, R1 , R2 )
= [(A R1 ) B1 ] [AC (B R2 )C ].
(7)
Experiments
The algorithms presented in this paper set new standards for nding approximative occurrences of a query pattern from a given database. There are no rivaling
algorithms in this sense, so we are not able to fairly compare the performance
of our algorithms to any existing algorithm. However, to give the reader a sense
of the real-life performance of these approximative algorithms, we compare the
running times of these to the existing, nonapproximative algorithms. Essentially
this means that we are comparing the performance of the algorithms being able
to solve AP1-AP3 to the ones that can solve only P1, P2 and P3 [14].
330
M. Karvonen et al.
6000
70000
Pointset
Line segment
60000
5000
50000
Time (ms)
Time (ms)
4000
3000
2000
30000
20000
1000
0
40000
10000
0
8 12 16 20 24 28 32 36
4
Time resolution (pixel columns per second)
8 12 16 20 24 28 32 36
4
Time resolution (pixel columns per second)
Fig. 3. The eect of changing the time resolution a) on blur erosion (on the left) and
b) on blur correlation (on the right)
In this paper we have sketched eight algorithms based on mathematical morphology. In our experiments we will focus on two of them: blur erosion and blur
ROF, which can be applied to solve problems AP1-AP3. The special cases of
these algorithms, where blur is not applied, are plain erosion and ROF. As our
implementation of blur ROF is based on correlation, we will call it bluf correlation from now on. As there are no competitors that solve the problems AP1-3,
we set our new algorithms against the original geometric algorithms named after
the problem specications P 1, P 2 and P 3 [14] to get an idea of their practical
performance.
For the performance of our algorithms, the implementation of dilation, erosion
and correlation are crucial. For dilation and erosion, we rely on Bloombergs
Leptonica library. Leptonica oers an optimized implementation for rectangleshaped SEs, which we can utilize on the case of dilation. On the other hand,
our erosion SEs tend to be much more complex and larger in size (we erode the
databases with the whole query patterns). For correlation we use Fast Fourier
Transform implemented in the FFTW library. This operation is quite heavy
calculation-wise compared to erosion, since it has to operate with oating point
complex numbers.
The performance of the reference algorithms, used to solve the original, nonapproximative problems, depend mostly on the number of notes in the database
and in the query. We experiment on how the algorithms scale up as a function
of the database and query pattern lengths. It is also noteworthy that the note
density can make a signicant dierence in the performance, as the time consumption of our algorithms mostly grow along with the size of the corresponding
images.
The database we used in our experiments consists of MIDI les from the Mutopia collection 1 that contains over 1.4 million notes. These les were converted
to various other formats required by the algorithms, such as binary images of
1
http://www.mutopiaproject.org/
331
1e+06
100000
10000
Time (ms)
Time (ms)
10000
1000
100
1000
100
10
10
1
1
0.1
8
16
32
64
128 256
Pattern size (notes)
512
1024
P1
P2
Blur Er.
16
32
64
128
256
Database size (thousands of notes)
512
Blur Corr.
MSM
pointset and line segment types. On the experiments of the eects of varying
pattern sizes, we selected randomly 16 pieces out of the whole database, each
containing 16,000 notes. Five distinct queries were randomly chosen, and the median of their execution times was reported. When experimenting with varying
database sizes, we chose a pattern size of 128 notes.
The size of the images is also a major concern for the performance of our
algorithms. We represent the pitch dimension as 128 pixels, since the MIDI
pitch value range consists of 128 possible values. The time dimension, however,
poses additional problems: it is not intuitively clear what would make a good
time resolution. If we use too many pixel columns per second, the performances
of our algorithms will be slowed down signicantly. On the ip side of the coin,
not using enough pixels per second would result in a loss of information as we
would not be able to distinguish separate notes in rapid passages anymore. Before
running the actual experiments, we decided to experiment on nding a suitable
time resolution eciency-wise.
4.1
We tested the eect of increasing time resolution on both blur erosion and blur
correlation, and the results can be seen in Figure 3. With blur erosion, we can
see a clear dierence between the pointset representation and the line segment
representation: in the line segment case, the running time of blur erosion seems
to grow quadratically in relation to the growing time resolution, while in the
pointset case, the growth rate seems to be clearly slower. This can be explained
with the fact that the execution time of erosion depends on the size of the query
foreground. In the pointset case, we still only mark the beginning point of the
notes, so only SEs require extra space. In the line segment case, however, the
growth is clearly linear.
332
M. Karvonen et al.
In the case of blur correlation, there seems to be nearly no eect whether the
input is in pointset or line segment form. The pointset and line segment curves
in the gure of blur correlation collide, so we depicted only one of the curves in
this case.
Looking at the results, one can note that the time usage of blur erosion begins
to grow quickly around 12 and 16 pixel columns per second. Considering that we
do not want the performance of our algorithms to suer too much, and the fact
that we are deliberately getting rid of some nuances of information by blurring,
we were encouraged to set the time resolution as low as 12 pixel columns per
second. This time resolution was used in the further experiments.
4.2
Pointset Representation
100000
100000
Time (ms)
Time (ms)
10000
10000
1000
1000
100
100
8
16
32
64
128 256
Pattern size (notes)
512
P3
Blur Er.
1024
16
32
64
128
256
Database size (thousands of notes)
Blur Corr.
512
333
performance dierence between the partial matching algorithms, blur correlation, MSM and P 2, is less radical. P 2 is clearly fastest of those with small query
sizes, but as its time consumption grows with longer queries, it becomes the
slowest with very large query sizes. Nevertheless, even with small query sizes,
we believe that the enhanced error toleration is worth the extra time it requires.
4.3
334
M. Karvonen et al.
First (approximate) match
Pattern
Fig. 7. The subject used as a search pattern and the rst approximate match
(a)
(b)
28
(c)
Fig. 8. A match found by blur erosion (a). An exact match found by both P1 and blur
erosion (b). This entry has too much variation even for blur erosion (c).
Our experiments also conrmed our claim of blur correlation handling jittering
better than P 3. We were expecting this, and Figure 6 illustrates an idiomatic
case where P 3 will not nd a match, but blur correlation will. In this case we
have a query pattern that is an excerpt of the database, with the distinction that
some of the notes have been tilted either time-wise or pitch-wise. Additionally
one note has been split into two. Blur correlation nds a perfect match in this
case, whereas P 3 cannot, unless the threshold for the total common length is
exceeded. We were expecting this kind of results, since intuitively P 3 cannot
handle this kind of jittering as well as morphological algorithms do.
4.4
13
6/11
16
335
4/11
7/11
19
11/11
Fig. 9. Some more developed imitations of the theme with their proportions of exactly
matching notes
half of the notes match exactly the original form of the theme (see Figure 9).
Nevertheless, these imitations are fairly easily recognized visually and audibly.
Our blur-erosion algorithm found them all.
Conclusions
336
M. Karvonen et al.
think that the slowdown is rather restrained compared to the added usability
of the algorithms. We expect our error-tolerant methods to give better results
in real-world applications when compared to the rivaling algorithms. As future
work, we plan on researching and setting up a relevant ground truth, as without
a ground truth, we cannot adequately measure the precision and recall on the
algorithms. Other future work could include investigating the use of greyscale
morphology for introducing more ne-grained control over approximation.
Acknowledgements
This work was partially supported by the Academy of Finland, grants #108547,
#118653, #129909 and #218156.
References
1. Barrera Hern
andez, A.: Finding an o(n2 log n) algorithm is sometimes hard. In:
Proceedings of the 8th Canadian Conference on Computational Geometry, pp.
289294. Carleton University Press, Ottawa (1996)
2. Bloomberg, D., Maragos, P.: Generalized hit-miss operators with applications to
document image analysis. In: SPIE Conference on Image Algebra and Morphological Image Processing, pp. 116128 (1990)
3. Bloomberg, D., Vincent, L.: Pattern matching using the blur hit-miss transform.
Journal of Electronic Imaging 9(2), 140150 (2000)
4. Clausen, M., Engelbrecht, R., Meyer, D., Schmitz, J.: Proms: A web-based tool for
searching in polyphonic music. In: Proceedings of the International Symposium on
Music Information Retrieval (ISMIR 2000), Plymouth, MA (October 2000)
5. Cliord, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A
fast, randomised, maximal subset matching algorithm for document-level music
retrieval. In: Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), Victoria, BC, Canada, pp. 150155 (2006)
6. Heijmans, H.: Mathematical morphology: A modern approach in image processing
based on algebra and geometry. SIAM Review 37(1), 136 (1995)
7. Hu, N., Dannenberg, R., Tzanetakis, G.: Polyphonic audio matching and alignment
for music retrieval. In: Proc. IEEE WASPAA, pp. 185188 (2003)
8. Karvonen, M., Lemstr
om, K.: Using mathematical morphology for geometric music
information retrieval. In: International Workshop on Machine Learning and Music
(MML 2008), Helsinki, Finland (2008)
9. Lemstr
om, K.: Towards more robust geometric content-based music retrieval. In:
Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), Utrecht, pp. 577582 (2010)
10. Lemstr
om, K., Mikkil
a, N., M
akinen, V.: Filtering methods for content-based
retrieval on indexed symbolic music databases. Journal of Information Retrieval 13(1), 121 (2010)
11. Lubiw, A., Tanur, L.: Pattern matching in polyphonic music as a weighted geometric translation problem. In: Proceedings of the 5th International Conference on
Music Information Retrieval (ISMIR 2004), Barcelona, pp. 289296 (2004)
337
12. Romming, C., Selfridge-Field, E.: Algorithms for polyphonic music retrieval: The
hausdor metric and geometric hashing. In: Proceedings of the 8th International
Conference on Music Information Retrieval (ISMIR 2007), Vienna, Austria (2007)
13. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Utrecht
University, Netherlands (2007)
14. Ukkonen, E., Lemstr
om, K., M
akinen, V.: Geometric algorithms for transposition
invariant content-based music retrieval. In: Proceedings of the 4th International
Conference on Music Information Retrieval (ISMIR 2003), Baltimore, MA, pp.
193199 (2003)
15. Wiggins, G.A., Lemstr
om, K., Meredith, D.: SIA(M)ESE: An algorithm for transposition invariant, polyphonic content-based music retrieval. In: Proceedings of the
3rd International Conference on Music Information Retrieval (ISMIR 2002), Paris,
France, pp. 283284 (2002)
1 Introduction
The problem of Symbolic Melodic Similarity, where musical pieces similar to a query
should be retrieved, has been approached from very different points of view [24][6].
Some techniques are based on string representations of music and editing distance
algorithms to measure the similarity between two pieces[17]. Later work has extended
this approach with other dynamic programming algorithms to compute global- or
local-alignments between the two musical pieces [19][11][12]. Other methods rely on
music representations based on n-grams [25][8][2], and other methods represent
music pieces as geometric objects, using different techniques to calculate the melodic
similarity based on the geometric similarity of the two objects. Some of these
geometric methods represent music pieces as sets of points in the pitch-time plane,
and then compute geometric similarities between these sets [26][23][7]. Others
represent music pieces as orthogonal polynomial chains crossing the set of pitch-time
points, and then measure the similarity as the minimum area between the two chains
[30][1][15].
In this paper we present a new model to compare melodic pieces. We adapted the
local alignment approach to work with n-grams instead of with single notes, and the
corresponding substitution score function between n-grams was also adapted to take
into consideration a new geometric representation of musical sequences. In this
geometric representation, we model music pieces as curves in the pitch-time plane,
and compare them in terms of their shape similarity.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 338355, 2011.
Springer-Verlag Berlin Heidelberg 2011
339
In the next section we outline several problems that a symbolic music retrieval
system should address, and then we discuss the general solutions given in the
literature to these requirements. Next, we introduce our geometric representation
model, which compares two musical pieces by their shape, and see how this model
addresses the requirements discussed. In section 5 we describe how we have
implemented our model, and in section 6 we evaluate it with the training and
evaluation test collections used in the MIREX 2005 Symbolic Melodic Similarity task
(for short, we will refer to these collections as Train05 and Eval05) [10][21][28].
Finally, we finish with conclusions and lines for further research. An appendix reports
more evaluation results at the end.
340
J. Urbano et al.
metadata or a simple traverrse through the sequence. We argue that users without such
a strong musical backgroun
nd will be interested in the recognition of a certain piitch
contour, and such cases are
a much more troublesome because some measuree of
melodic similarity has to be calculated. This is the case of query by humm
ming
applications.
2.1.2 Degree Equality
The score at the top of Fig
g. 1 shows a melody in the F major tonality, as well as the
corresponding pitch and to
onality-degree for each note. Below, Fig. 2 shows exacctly
the same melody shifted 7 semitones
s
downwards to the Bb major tonality.
The tonality-degrees useed in both cases are the same, but the resultant notes are
not. Nonetheless, one would consider the second melody a version of the first oone,
me in terms of pitch contour. Therefore, they should be
because they are the sam
considered the same one by
b a retrieval system, which should also consider possiible
modulations where the key changes somewhere throughout the song.
2.1.3 Note Equality
We could also consider thee case where exactly the same melodies, with exactly the
same notes, are written in different
d
tonalities and, therefore, each note corresponds to
a different tonality-degree in each case. Fig. 3 shows such a case, with the saame
t C major tonality.
melody as in Fig. 1, but in the
Although the degrees do
o not correspond one to each other, the actual notes do, so
both pieces should be consiidered the same one in terms of melodic similarity.
341
342
J. Urbano et al.
Indeed, if this piece werre played with a flute only one voice could be perform
med,
even if some streaming efffect were produced by changing tempo and timbre for ttwo
voices to be perceived by a listener [16]. Therefore, a query containing only one vooice
should match with this piecce in case that voice is similar enough to any of the thhree
marked in the figure.
ments
2.2 Horizontal Requirem
Horizontal requirements regard
r
the time dimension of music: time signatture
equivalence, tempo equivallence, duration equality and duration variation. A retrieeval
model that meets the secon
nd and third requirements is usually regarded as time sccale
invariant.
2.2.1 Time Signature Equ
uivalence
The top of Fig. 6 depicts a simplified version of the beginning of op. 81 no. 10 byy S.
4 time signature. If a 4/4 time signature were used, likee in
Heller, with its original 2/4
the bottom of Fig. 6, the pieece would be split into bars of duration 4 crotchets each..
343
whole score is played twicce as fast, at 224 crotchets per minute. This two channges
result in exactly the same acctual time.
On the other hand, it might also be considered a tempo of 56 crotchets per minnute
and notes with half the durration. Moreover, the tempo can change somewhere in the
middle of the melody, and therefore change the actual time of each note afterwarrds.
gths cannot be considered as the only horizontal measuure,
Therefore, actual note leng
because these three pieces would
w
sound the same to any listener.
2.2.3 Duration Equality
mpo
If the melody at the top of Fig. 6 were played slower or quicker by means of a tem
variation, but maintaining the rhythm, an example of the result would be like the
score in Fig. 8.
Even though the melodiic perception does actually change, the rhythm does nnot,
and neither does the pitch contour. Therefore, they should be considered as virtuaally
the same, maybe with somee degree of dissimilarity based on the tempo variation.
2.2.4 Duration Variation
n
As with the Pitch Variation
n problem, sometimes a melody is altered by changing oonly
the rhythm of a few notes. For
F instance, the melody in Fig. 9 maintains the same piitch
contour as in Fig. 6, but chaanges the duration of some notes.
Variations like these arre common and they should be considered as well, jjust
like the Pitch Variation pro
oblem, allowing approximate matches instead of just exxact
ones.
344
J. Urbano et al.
345
346
J. Urbano et al.
between their first derivatives (measured with the integral over the absolute value of
their difference):
diff(C, D) = |C'(t)-D'(t)|dt
(1)
The representation with orthogonal polynomial chains also led to the measurement
of dissimilarity as the area between the curves [30][1]. However, such representation
is not directly transposition invariant unless it used pitch intervals instead of absolute
pitch values, and a more complex algorithm is needed to overcome this problem[15].
As orthogonal chains are not differentiable, this would be the indirect equivalent to
calculating the first derivative as we do.
This dissimilarity measurement based on the area between curves turns out to be a
metric function, because it has the following properties:
Non-negativity, diff(C, D) 0: because the absolute value is never negative.
Identity of indiscernibles, diff(C, D) = 0 C = D: because calculating the
absolute value the only way to have no difference is with the same exact curve1.
Symmetry, diff(C, D) = diff(D, C): again, because the integral is over the
absolute value of the difference.
Triangle inequality, diff(C, E) diff(C, D) + diff(D, E):
|C'(t) - D'(t)|dt +
|C'(t) - E'(t)|dt
|C'(t) - D'(t)|dt +
|D'(t) - E'(t)|dt =
|D'(t) - E'(t)|dt
|C'(t) - E'(t)| dt
Therefore, many indexing and retrieval techniques, like vantage objects[4], could be
exploited if using this metric.
4.2 Interpolation with Splines
The next issue to address is the interpolation method to use. The standard Lagrange
interpolation method, though simple, is known to suffer the Runges Phenomenon [3].
As the number of points increases, the interpolating curve wiggles a lot, especially at
the beginning and the end of the curve. As such, one curve would be very different
from another one having just one more point at the end, the shape would be different
and so the dissimilarity metric would result in a difference when the two curves are
practically identical. Moreover, a very small difference in one of the points could
translate into an extreme variation in the overall curve, which would make virtually
impossible to handle the Pitch and Duration Variation problems properly (see top of
Fig. 11).
1
Actually, this means that the first derivatives are the same, the actual curves could still be
shifted. Nonetheless, this is the behavior we want.
347
Ci (t) =
ci,1 (t)
ci,2 (t)
ti,1 t ti,kn
ti,2 t ti,kn+1
ci,mi-kn +1 (t)
ti,mi-kn +1 t ti,mi
(2)
(3)
348
J. Urbano et al.
5 Implementation
Geometric representations of music pieces are very intuitive, but they are not
necessarily easy to implement. We could follow the approach of moving one curve
towards the other looking for the minimum area between them [1][15]. However, this
approach is very sensitive to small differences in the middle of a song, such as
repeated notes: if a single note were added or removed from a melody, it would be
impossible to fully match the original melody from that note to the end. Instead, we
follow a dynamic programming approach to find an alignment between the two
melodies [19].
Various approaches for melodic similarity have applied editing distance algorithms
upon textual representations of musical sequences that assign one character to each
interval or each n-gram [8]. This dissimilarity measure has been improved in recent
years, and sequence alignment algorithms have proved to perform better than simple
editing distance algorithms [11][12]. Next, we describe the representation and
alignment method we use.
5.1 Melody Representation
To practically apply our model, we followed a basic n-gram approach, where each ngram represents one span of the spline. The pitch of each note was represented as the
relative difference to the pitch of the first note in the n-gram, and the duration was
represented as the ratio to the duration of the whole n-gram. For example, an n-gram
of length 4 with absolute pitches 74, 81, 72, 76 and absolute durations 240, 480,
240, 720, would be modeled as 81-74, 72-74, 76-74 = 7, -2, 2 in terms of pitch
and 240, 480, 240, 7201680 = 0.1429, 0.2857, 0.1429, 0.4286 in terms of
duration. Note that the first note is omitted in the pitch representation as it is always 0.
This representation is transposition invariant because a melody shifted in the pitch
dimension maintains the same relative pitch intervals. It is also time scale invariant
because the durations are expressed as their relative duration within the span, and so
they remain the same in the face of tempo and actual or score duration changes. This
is of particular interest for query by humming applications and unquantized pieces, as
small variations in duration would have negligible effects on the ratios.
We used Uniform B-Splines as interpolation method [3]. This results in a
parametric polynomial function for each n-gram. In particular, an n-gram of length kn
results in a polynomial of degree kn-1 for the pitch dimension and a polynomial of
degree kn-1 for the time dimension. Because the actual representation uses the first
derivatives, each polynomial is actually of degree kn-2.
5.2 Melody Alignment
We used the Smith-Waterman local alignment algorithm [20], with the two sequences
of overlapping spans as input, defined as in (2). Therefore, the input symbols to the
alignment algorithm are actually the parametric pitch and time functions of a span,
349
based on the above representation of n-grams. The edit operations we define for the
Smith-Waterman algorithm are as follows:
where () returns the null n-gram of (i.e. an n-gram equal to but with all pitch
intervals set to 0), and p and t are the mean differences calculated by diffp and difft
respectively over a random sample of 100,000 pairs of n-grams sampled from the set
of incipits in the Train05 collection.
We also normalized the dissimilarity scores returned by difft. From the results in
Table 1 it can be seen that pitch dissimilarity scores are between 5 and 7 times larger
than time dissimilarity scores. Therefore, the choice of kp and kt does not intuitively
reflect the actual weight given to the pitch and time dimensions. For instance, the
selection of kt=0.25, chosen in studies like [11], would result in an actual weight
between 0.05 and 0.0357. To avoid this effect, we normalized every time dissimilarity
score multiplying it by a factor = p / t. As such, the score of the match operation
is actually defined as s(c, c) = 2p(kp+kt), and the dissimilarity function defined in (3)
is actually calculated as diff(c, d) = kp diffp(c, d) + ktdifft(c, d).
6 Experimental Results2
We evaluated the model proposed with the Train05 and Eval05 test collections used
in the MIREX 2005 Symbolic Melodic Similarity Task [21][10], measuring the mean
Average Dynamic Recall score across queries [22]. Both collections consist of about
580 incipits and 11 queries each, with their corresponding ground truths. Each ground
truth is a list of all incipits similar to each query, according to a panel of experts, and
with groups of incipits considered equally similar to the query.
However, we have recently showed that these lists have inconsistencies whereby
incipits judged as equally similar by the experts are not in the same similarity group
and vice versa [28]. All these inconsistencies result in a very permissive evaluation
where a system could return incipits not similar to the query and still be rewarded for
it. Thus, results reported with these lists are actually overestimated, by as much as
12% in the case of the MIREX 2005 evaluation. We have proposed alternatives to
arrange the similarity groups for each query, proving that the new arrangements are
significantly more consistent than the original one, leading to a more robust
evaluation. The most consistent ground truth lists were those called Any-1 [28].
Therefore, we will use these Any-1 ground truth lists from this point on to evaluate
our model, as they offer more reliable results. Nonetheless, all results are reported in
an appendix as if using the original ground truths employed in MIREX 2005, called
All-2, for the sake of comparison with previous results.
All system outputs and ground truth lists used in this paper can be downloaded from
http://julian-urbano.info/publications/
350
J. Urbano et al.
p
2.8082
2.5019
2.2901
2.1347
2.0223
p
1.6406
1.6873
1.4568
1.4278
1.3303
t
0.5653
0.494
0.4325
0.3799
0.2863
t
0.6074
0.5417
0.458
0.3897
0.2908
= p / t
4.9676
5.0646
5.2950
5.6191
7.0636
There also appears to be a negative correlation between the n-gram length and the
dissimilarity scores. This is caused by the degree of the polynomials defining the
splines: high-degree polynomials fit the points more smoothly than low-degree ones.
Polynomials of low degree tend to wiggle more, and so their derivatives are more
pronounced and lead to larger areas between curves.
6.2 Evaluation with the Train05 Test Collection, Any-1 Ground Truth Lists
The experimental design results in 55 trials for the 5 different levels of kn and the 11
different levels of kt. All these trials were performed with the Train05 test collection,
ground truths aggregated with the Any-1 function [28]. Table 2 shows the results.
In general, large n-grams tend to perform worse. This could probably be explained
by the fact that large n-grams define the splines with smoother functions, and the
differences in shape may be too small to discriminate musically perceptual
differences. However, kn=3 seems to be the exception (see Fig. 12). This is probably
caused by the extremely low degree of the derivative polynomials. N-grams of length
kn=3 result in splines defined with polynomials of degree 2, which are then
differentiated and result in polynomials of degree 1. That is, they are just straight
lines, and so a small difference in shape can turn into a relatively large dissimilarity
score when measuring the area.
Overall, kn=4 and kn=5 seem to perform the best, although kn=4 is more stable
across levels of kt. In fact, kn=4 and kt=0.6 obtain the best score, 0.7215. This result
agrees with other studies where n-grams of length 4 and 5 were also found to perform
351
Table 2. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.
kt=0
0.6961
0.7046
0.7093
0.714
0.6823
kt=0.1
0.7067
0.7126
0.7125
0.7132
0.6867
kt=0.2
0.7107
0.7153
0.7191
0.7115
0.6806
kt=0.3
0.7106
0.7147
0.72
0.7088
0.6747
kt=0.4
0.7102
0.7133
0.7173
0.7008
0.6538
kt=0.5
0.7109
0.72
0.7108
0.693
0.6544
kt=0.6
0.7148
0.7215
0.704
0.6915
0.6529
kt=0.7
0.711
0.7202
0.6978
0.6874
0.6517
kt=0.8
0.7089
0.7128
0.6963
0.682
0.6484
kt=1
0.6962
0.709
0.6866
0.6763
0.6432
0.7
0.68
0.66
0.64
kt=0.9
0.7045
0.7136
0.6973
0.6765
0.6465
kn = 3
kn = 4
kn = 5
kn = 6
kn = 7
0.72
kn
3
4
5
6
7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
kt
Fig. 12. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1
better [8]. Moreover, this combination of parameters obtains a mean ADR score of
0.8039 when evaluated with the original All-2 ground truths (see Appendix). This is
the best score ever reported for this collection.
6.3 Evaluation with the Eval05 Test Collection, Any-1 Ground Truth Lists
In a fair evaluation scenario, we would use the previous experiment to train our
system and choose the values of kn and kt that seem to perform the best (in particular,
kn=4 and kt=0.6). Then, the system would be run and evaluated with a different
collection to assess the external validity of the results and try to avoid overfitting to
the training collection. For the sake of completeness, here we show the results for all
55 combinations of the parameters with the Eval05 test collection used in MIREX
2005, again aggregated with the Any-1 function [28].Table 3 shows the results.
Unlike the previous experiment with the Train05 test collection, in this case the
variation across levels of kt is smaller (the mean standard deviation is twice as much
in Train05), indicating that the use of the time dimension does not provide better
results overall (see Fig. 13). This is probably caused by the particular queries in each
collection. Seven of the eleven queries in Train05 start with long rests, while this
352
J. Urbano et al.
Table 3. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.
kt=0
0.6522
0.653
0.6413
0.6269
0.5958
kt=0.1
0.6601
0.653
0.6367
0.6251
0.623
kt=0.2
0.6646
0.6567
0.6327
0.6225
0.6189
kt=0.3
0.6612
0.6616
0.6303
0.6168
0.6163
kt=0.4
0.664
0.6629
0.6284
0.6216
0.6162
kt=0.5
0.6539
0.6633
0.6328
0.6284
0.6192
kt=0.6
0.6566
0.6617
0.6478
0.6255
0.6215
kt=0.7
0.6576
0.6569
0.6461
0.6192
0.6174
kt=0.8
0.6591
0.65
0.6419
0.6173
0.6148
kt=0.9
0.6606
0.663
0.6414
0.6144
0.6112
kt=1
0.662
0.6531
0.6478
0.6243
0.6106
0.61
0.63
0.65
kn = 3
kn = 4
kn = 5
kn = 6
kn = 7
0.59
0.67
kn
3
4
5
6
7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
kt
Fig. 13. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1
happens for only three of the eleven queries in Eval05. In our model, rests are
ignored, and so the effect of the time dimension is larger when the very queries have
rests as their duration is added to the next note's.
Likewise, large n-grams tend to perform worse. In this case though, n-grams of
length kn=3 and kn=4 perform the best. The most effective combination is kn=3 and
kt=0.2, with a mean ADR score of 0.6646. However, kn=4 and kt=0.5 is very close,
with a mean ADR score of 0.6633. Therefore, based on the results of the previous
experiment and the results in this one, we believe that kn=4 and kt[0.5, 0.6] are the
best parameters overall.
It is also important to note that none of the 55 combinations ran result in a mean
ADR score less than 0.594, which was the highest score achieved in the actual
MIREX 2005 evaluation with the Any-1 ground truths [28]. Therefore, our systems
would have ranked first if participated.
353
References
1. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nuez, Y., Rappaport, D.,
Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic Similarity.
Computer Music Journal 30(3), 6776 (2006)
2. Bainbridge, D., Dewsnip, M., Witten, I.H.: Searching Digital Music Libraries. Information
Processing and Management 41(1), 4156 (2005)
3. de Boor, C.: A Practical guide to Splines. Springer, Heidelberg (2001)
4. Bozkaya, T., Ozsoyoglu, M.: Indexing Large Metric Spaces for Similarity Search Queries.
ACM Transactions on Database Systems 24(3), 361404 (1999)
5. Byrd, D., Crawford, T.: Problems of Music Information Retrieval in the Real World.
Information Processing and Management 38(2), 249272 (2002)
6. Casey, M.A., Veltkamp, R.C., Goto, M., Leman, M., Rhodes, C., Slaney, M.: ContentBased Music Information Retrieval: Current Directions and Future Challenges.
Proceedings of the IEEE 96(4), 668695 (2008)
354
J. Urbano et al.
7. Clifford, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A Fast,
Randomised, Maximal Subset Matching Algorithm for Document-Level Music Retrieval.
In: International Conference on Music Information Retrieval, pp. 150155 (2006)
8. Doraisamy, S., Rger, S.: Robust Polyphonic Music Retrieval with N-grams. Journal of
Intelligent Systems 21(1), 5370 (2003)
9. Downie, J.S.: The Scientific Evaluation of Music Information Retrieval Systems:
Foundations and Future. Computer Music Journal 28(2), 1223 (2004)
10. Downie, J.S., West, K., Ehmann, A.F., Vincent, E.: The 2005 Music Information Retrieval
Evaluation Exchange (MIREX 2005): Preliminary Overview. In: International Conference
on Music Information Retrieval, pp. 320323 (2005)
11. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for Evaluating
Similarity Between Monophonic Musical Sequences. Journal of New Music
Research 36(4), 267279 (2007)
12. Hanna, P., Robine, M., Ferraro, P., Allali, J.: Improvements of Alignment Algorithms for
Polyphonic Music Retrieval. In: International Symposium on Computer Music Modeling
and Retrieval, pp. 244251 (2008)
13. Isaacson, E.U.: Music IR for Music Theory. In: The MIR/MDL Evaluation Project White
paper Collection, 2nd edn., pp. 2326 (2002)
14. Kilian, J., Hoos, H.H.: Voice Separation A Local Optimisation Approach. In:
International Symposium on Music Information Retrieval, pp. 3946 (2002)
15. Lin, H.-J., Wu, H.-H.: Efficient Geometric Measure of Music Similarity. Information
Processing Letters 109(2), 116120 (2008)
16. McAdams, S., Bregman, A.S.: Hearing Musical Streams. In: Roads, C., Strawn, J. (eds.)
Foundations of Computer Music, pp. 658598. The MIT Press, Cambridge (1985)
17. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161175 (1990)
18. Selfridge-Field, E.: Conceptual and Representational Issues in Melodic Comparison.
Computing in Musicology 11, 364 (1998)
19. Smith, L.A., McNab, R.J., Witten, I.H.: Sequence-Based Melodic Comparison: A
Dynamic Programming Approach. Computing in Musicology 11, 101117 (1998)
20. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal
of Molecular Biology 147(1), 195197 (1981)
21. Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C.: A Ground Truth for
Half a Million Musical Incipits. Journal of Digital Information Management 3(1), 3439
(2005)
22. Typke, R., Veltkamp, R.C., Wiering, F.: A Measure for Evaluating Retrieval Techniques
based on Partially Ordered Ground Truth Lists. In: IEEE International Conference on
Multimedia and Expo., pp. 17931796 (2006)
23. Typke, R., Veltkamp, R.C., Wiering, F.: Searching Notated Polyphonic Music Using
Transportation Distances. In: ACM International Conference on Multimedia, pp. 128135
(2004)
24. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey of Music Information Retrieval
Systems. In: International Conference on Music Information Retrieval, pp. 153160 (2005)
25. Uitdenbogerd, A., Zobel, J.: Melodic Matching Techniques for Large Music Databases. In:
ACM International Conference on Multimedia, pp. 5766 (1999)
26. Ukkonen, E., Lemstrm, K., Mkinen, V.: Geometric Algorithms for Transposition
Invariant Content-Based Music Retrieval. In: International Conference on Music
Information Retrieval, pp. 193199 (2003)
355
27. Urbano, J., Llorns, J., Morato, J., Snchez-Cuadrado, S.: MIREX 2010 Symbolic Melodic
Similarity: Local Alignment with Geometric Representations. Music Information Retrieval
Evaluation eXchange (2010)
28. Urbano, J., Marrero, M., Martn, D., Llorns, J.: Improving the Generation of Ground
Truths based on Partially Ordered Lists. In: International Society for Music Information
Retrieval Conference, pp. 285290 (2010)
29. Urbano, J., Morato, J., Marrero, M., Martn, D.: Crowdsourcing Preference Judgments for
Evaluation of Music Similarity Tasks. In: ACM SIGIR Workshop on Crowdsourcing for
Search Evaluation, pp. 916 (2010)
30. Maidn, D.: A Geometrical Algorithm for Melodic Difference. Computing in
Musicology 11, 6572 (1998)
kt=0
0.7743
0.7836
0.7844
0.7885
0.7598
kt=0.1
0.7793
0.7899
0.7867
0.7842
0.7573
kt=0.2
0.788
0.7913
0.7937
0.7891
0.7466
kt=0.3
0.7899
0.7955
0.7951
0.7851
0.7409
kt=0.4
0.7893
0.7946
0.7944
0.7784
0.7186
kt=0.5
0.791
0.8012
0.7872
0.7682
0.7205
kt=0.6
0.7936
0.8039
0.7799
0.7658
0.7184
kt=0.7
0.7864
0.8007
0.7736
0.762
0.7168
kt=0.8
0.7824
0.791
0.7692
0.7572
0.711
kt=0.9
0.777
0.7919
0.7716
0.7439
0.7075
kt=1
0.7686
0.7841
0.7605
0.7388
0.6997
Table 5. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.
kn
3
4
5
6
7
kt=0
0.7185
0.7242
0.7114
0.708
0.6548
kt=0.1
0.714
0.7268
0.7108
0.7025
0.6832
kt=0.2
0.7147
0.7291
0.6988
0.6887
0.6818
kt=0.3
0.7116
0.7316
0.6958
0.6693
0.6735
kt=0.4
0.712
0.7279
0.6942
0.6701
0.6614
kt=0.5
0.7024
0.7282
0.6986
0.6743
0.6594
kt=0.6
0.7056
0.7263
0.7109
0.6727
0.6604
kt=0.7
0.7067
0.7215
0.7054
0.6652
0.6552
kt=0.8
0.708
0.7002
0.6959
0.6612
0.6525
kt=0.9
0.7078
0.7108
0.6886
0.6561
0.6484
kt=1
0.7048
0.7032
0.6914
0.6636
0.6499
It can also be observed that the results would again be overestimated by as much as
11% in the case of Train05 and as much as 13% in Eval05, in contrast with the
maximum 12% observed with the systems that participated in the actual MIREX 2005
evaluation.
1 Introduction
The way music is consumed today has been changed dramatically by its increasing
availability in digital form. Online music shops have replaced traditional stores and
music collections are increasingly kept on electronic storage systems and mobile
devices instead of physical media on a shelf. It has become much faster and more
comfortable to find and acquire music of a known artist. At the same time it has
become more difficult to find ones way in the enormous range of music that is
offered commercially, find music according to ones taste or even manage ones own
collection.
Young people today have a music collection with an average size of 8,159 tracks [1]
and the iTunes music store today offers more than 10 million tracks for sale. Long-tail
sales are low which is illustrated by the fact that only 1% of the catalog tracks generate
80% of sales [2][3]. A similar effect can also be seen in the usage of private music
collections. According to our own studies, only few people are actively working with
manual playlists because they consider this too time-consuming or they have simply
forgotten which music they actually possess.
1.1 Automatic Music Recommendation
This is where music recommendation technology comes in: The similarity of two
songs can be mathematically calculated based on their musical attributes. Thus, for
each song a ranked list of similar songs from the catalogue can be generated.
Editorial, user-generated or content-based data derived directly from the audio signal
can be used.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 356360, 2011.
Springer-Verlag Berlin Heidelberg 2011
357
Editorial data allows a very thorough description of the musical content but this
manual process is very expensive and time-consuming and will only ever cover a
small percentage of the available music. User-data based recommendations have
become very popular through vendors such as Amazon (People who bought this item
also bought ) or Last.FM. However, this approach suffers from a cold-start
problem and its strong focus on popular content.
Signal-based recommenders are not affected by popularity, sales rank or user
activity. They extract low-level features directly from the audio signal. This also
offers additional advantages such as being able to work without a server connection
and being able to process any music file even if it has not been published anywhere.
However, signal-based technology alone misses the socio-cultural aspect which is not
present in the audio signal and it also cannot address current trends or lyrics content.
1.2 Mufins Hybrid Approach
Mufins technology is signal-based but it combines signal features with semantic
musical attributes and metadata from other data sources thus forming a hybrid
recommendation approach. First of all, the technology analyzes the audio signal and
extracts signal features. mufin then employs state-of-the-art machine learning
technology to extract semantic musical attributes including mood descriptions such as
happy, sad, calm or aggressive but also other descriptive tags such as synthetic,
acoustic, presence of electronic beats, distorted guitars, etc.
This information can for instance be used to narrow down search results, offer
browsing capabilities or contextually describe content. By combining these attributes
with information from other sources such as editorial metadata the user can for
instance search for "aggressive rock songs from the 1970s with a track-length of more
than 8 minutes".
Lists of similar songs can then be generated by combining all available information
including signal features, musical attributes (auto-tags) and metadata from other data
sources using a music ontology. This ensures a high quality and enables the steering
of the recommendation system. The results can be used for instance for playlist
generation or as an aid to an editor who needs an alternative to a piece of music he is
not allowed to use in a certain context.
As the recommendation system features a module based on digital signalprocessing algorithms it can generate musical similarity for all songs within a music
catalogue and because it makes use of mathematical analysis of the audio signals, it is
completely deterministic and - if desired - can work independent of any "human
factor" like cultural background, listening habits, etc. Depending on the database
used, it can also work way off the mainstream and thus give the user the opportunity
to discover music he may never have found otherwise.
In contrast to other technologies such as collaborative filtering, mufin's technology
can provide recommendations for any track, even if there are no tags or social data.
Recommendations are not limited by genre boundaries, target groups or biased by
popularity. Instead, it equally covers all songs in a music catalogue and if genre
boundaries or the influence of popularity is indeed desired, this can addressed by
leveraging additional data sources.
358
D. Schnfu
Fig. 1. The mufin music recommender combines audio features inside a song model and
semantic musical attributes using a music ontology. Additionally, visualization coordinates for
the mufin vision sound galaxy are generated during the music analysis process.
Mufins complete music analysis process is fully automated. The technology has
already proven its practical application in usage scenarios with more than 9 million
tracks. Mufin's technology is available for different platforms including Linux,
MacOS X, Windows and mobile platforms. Additionally, it can also be used via web
services.
2 Mufin Vision
Common, text-based attributes such as title or artist are not suitable to keep track of a
large music collection, especially if the user is not familiar with every item in the
collection. Songs which belong together from a sound perspective may appear very
far apart when using lists sorted by metadata. Additionally, only a limited number of
songs will fit onto the screen preventing the user from actually getting an overview of
his collection.
mufin vision has been developed with the goal to offer easy access to music
collections. Even if there are thousands of songs, the user can easily find his way
around the collection since he can learn where to find music with a certain
characteristic. By looking at the concentration of dots in an area he can immediately
assess the distribution of the collection and zoom into a section to get a closer look.
The mufin vision 3D sound galaxy displays each song as a dot in a coordinate
system. X, y and z axis as well as size and color of the dots can be assigned to
different musical criteria such as tempo, mood, instrumentation or type of singing
voice; even metadata such as release date or song duration can be used. Using the axis
configuration, the user can explore his collection the way he wants and make relations
between different songs visible. As a result, it becomes much easier to find music
fitting a mood or occasion.
Mufin vision premiered in the mufin player PC application but it can also be used
on the web and even on mobile devices. The latest version of the mufin player 1.5
allows the user to control mufin vision using a multi-touch display.
359
Fig. 2. Both songs are by the same artist. However, Brothers in arms is a very calm ballad
with sparse instrumentation while Sultans of swing is a rather powerful song with a fuller
sound spectrum. The mufin vision sound galaxy reflects that difference since it works on song
level instead of an artist or genre level.
Fig. 3. The figure displays a playlist in which the entries are connected by lines. One can see
that although the songs may be similar as a whole, their musical attributes vary over the course
of the playlist.
3 Further Work
The mufin player PC application offers a database view of the users music collection
including filtering, searching and sorting mechanisms. However, instead of only using
metadata such as artist or title for sorting, the mufin player can also sort any list by
similarity to a selected seed song.
360
D. Schnfu
Additionally, the mufin player offers an online storage space for a users music
collection. This prevents the user from data loss and allows him to simultaneously
stream his music online and listen to it from anywhere in the world.
Furthermore, mufin works together with the German National Library in order to
establish a workflow for the protection of our cultural heritage. The main contribution
of mufin is the fully automatic annotation of the music content and the provision of
descriptive tags for the librarys ontology. Based on technology by mufin and its
partners, a semantic multimedia search demonstration was presented at IBC 2009 in
Amsterdam.
References
1.
2.
3.
Bahanovich, D., Collopy, D.: Music Experience and Behaviour in Young People.
University of Hertfordshire, UK (2009)
Celma, O.: Music Recommendation and Discovery in the Long Tail. PhD-Thesis,
Universitat Pompeu Fabra, Spain (2008)
Nielsen Soundscan: State of the industrie (2007), http://www.narm.com/
2008Conv/StateoftheIndustry.pdf (July 22, 2009)
Author Index
Mansencal, Boris 31
Marchand, Sylvain 31
Marchini, Marco 205
Mauch, Matthias 1
Merer, Adrien 176
Morato, Jorge 338
Mustafa, Haz 84
N
urnberger, Andreas
Ozaslan,
Tan Hakan 219
Ozerov, Alexey 102
259
51
273
Tard
on, Lorenzo J.
Urbano, Juli
an
176
338
Veltkamp, Remco C.
Vikman, Juho 321
Wang, Wenwu
Wiering, Frans
Ystad, Slvi
116
84
242
176
242