Exploring Music Contents

Lecture Notes in Computer Science
Commenced Publication in 1973

Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
6684
Slvi Ystad Mitsuko Aramaki

Richard Kronland-Martinet
Kristoffer Jensen (Eds.)
Exploring
Music Contents
7th International Symposium, CMMR 2010
Mlaga, Spain, June 21-24, 2010
Revised Papers
13
Volume Editors
Slvi Ystad
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: ystad@lma.cnrs-mrs.fr
Mitsuko Aramaki
CNRS-INCM, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: aramaki@incm.cnrs-mrs.fr
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France
E-mail: kronland@lma.cnrs-mrs.fr
Kristoffer Jensen
Aalborg University Esbjerg, Niels Bohr Vej 8, 6700 Esbjerg, Denmark
E-mail: krist@create.aau.dk
ISSN 0302-9743
e-ISSN 1611-3349
e-ISBN 978-3-642-23126-1
ISBN 978-3-642-23125-4
DOI 10.1007/978-3-642-23126-1
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2011936382
CR Subject Classication (1998): J.5, H.5, C.3, H.5.5, G.3, I.5
LNCS Sublibrary: SL 3 Information Systems and Application, incl. Internet/Web
and HCI
Springer-Verlag Berlin Heidelberg 2011

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microlms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specic statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientic Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Computer Music Modeling and Retrieval (CMMR) 2010 was the seventh event
of this international conference series that was initiated in 2003. Since its start,
the conference has been co-organized by the University of Aalborg, Esbjerg, Denmark (http://www.aaue.dk) and the Laboratoire de Mecanique et dAcoustique
in Marseille, France (http://www.lma.cnrs-mrs.fr) and has taken place in France,
Italy and Denmark. The six previous editions of CMMR oered a varied overview
of recent music information retrieval (MIR) and sound modeling activities in addition to alternative elds related to human interaction, perception and cognition.
This years CMMR took place in M
alaga, Spain, June 2124, 2010. The
conference was organized by the Application of Information and Communications Technologies Group (ATIC) of the University of Malaga (Spain), together
with LMA and INCM (CNRS, France) and AAUE (Denmark). The conference
featured three prominent keynote speakers working in the MIR area, and the
program of CMMR 2010 included in addition paper sessions, panel discussions,
posters and demos.
The proceedings of the previous CMMR conferences were published in the
Lecture Notes in Computer Science series (LNCS 2771, LNCS 3310, LNCS 3902,
LNCS 4969, LNCS 5493 and LNCS 5954), and the present edition follows the
lineage of the previous ones, including a collection of 22 papers within the topics
of CMMR. These articles were specially reviewed and corrected for this proceedings volume.
The current book is divided into ve main chapters that reect the present
challenges within the eld of computer music modeling and retrieval. The chapters span topics from music interaction, composition tools and sound source
separation to data mining and music libraries. One chapter is also dedicated
to perceptual and cognitive aspects that are currently the subject of increased
interest in the MIR community. We are condent that CMMR 2010 brought
forward the research in these important areas.
We would like to thank Isabel Barbancho and her team at the Application of
Information and Communications Technologies Group (ATIC) of the University
of M
alaga (Spain) for hosting the 7th CMMR conference and for ensuring a
successful organization of both scientic and social matters. We would also like
to thank the Program Committee members for their valuable paper reports and
thank all the participants who made CMMR 2010 a fruitful and convivial event.
Finally, we would like to thank Springer for accepting to publish the CMMR
2010 proceedings in their LNCS series.
March 2011
Slvi Ystad
Mitsuko Aramaki
Kristoer Jensen
Organization
The 7th International Symposium on Computer Music Modeling and Retrieval

(CMMR2010) was co-organized by the University of M
alaga (Spain) Aalborg
University (Esbjerg, Denmark), and LMA/INCM-CNRS (Marseille, France).
Symposium Chair
Isabel Barbancho
University of M
alaga, Spain
Symposium Co-chairs
Kristoer Jensen
Slvi Ystad
AAUE, Denmark
CNRS-LMA, France
Demonstration and Panel Chairs

Ana M. Barbancho
University of M
alaga, Spain
Lorenzo J. Tard
on
University of M
alaga, Spain
Program Committee
Paper and Program Chairs
Mitsuko Aramaki
CNRS-INCM, France
CNRS-LMA, France
CMMR 2010 Referees

Mitsuko Aramaki
Federico Avanzini
Rolf Bader
Isabel Barbancho
Ana M. Barbancho
Mathieu Barthet
Antonio Camurri
Laurent Daudet
Olivier Derrien
Simon Dixon
Barry Eaglestone
Gianpaolo Evangelista
Cedric Fevotte
Bruno Giordano
Emilia G
omez
Brian Gygi
Goredo Haus
Kristoer Jensen
Anssi Klapuri
Marc Leman
Sylvain Marchand
Gregory Pallone
Andreas Rauber
David Sharp
Bob L. Sturm
Lorenzo J. Tard
on
Vesa Valim
aki
Slvi Ystad
Table of Contents
Part I: Music Production, Interaction and

Composition Tools
Probabilistic and Logic-Based Modelling of Harmony . . . . . . . . . . . . . . . . .
Simon Dixon, Matthias Mauch, and Amelie Anglade
Interactive Music Applications and Standards . . . . . . . . . . . . . . . . . . . . . . .

Rebecca Stewart, Panos Kudumakis, and Mark Sandler
20
Interactive Music with Active Audio CDs . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sylvain Marchand, Boris Mansencal, and Laurent Girin
31
Pitch Gestures in Generative Modeling of Music . . . . . . . . . . . . . . . . . . . . .

Kristoer Jensen
51
Part II: Music Structure Analysis - Sound Source

Separation
An Entropy Based Method for Local Time-Adaptation of the
Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marco Liuni, Axel R
obel, Marco Romito, and Xavier Rodet
60
Transcription of Musical Audio Using Poisson Point Processes and

Sequential MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pete Bunch and Simon Godsill
76
Single Channel Music Sound Separation Based on Spectrogram

Decomposition and Note Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wenwu Wang and Haz Mustafa
84
Notes on Nonnegative Tensor Factorization of the Spectrogram

for Audio Source Separation: Statistical Insights and towards
Self-Clustering of the Spatial Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cedric Fevotte and Alexey Ozerov
102
Part III: Auditory Perception, Articial Intelligence

and Cognition
What Signal Processing Can Do for the Music . . . . . . . . . . . . . . . . . . . . . . .
Isabel Barbancho, Lorenzo J. Tard
on, Ana M. Barbancho,
Andres Ortiz, Simone Sammartino, and Cristina de la Bandera
116
VIII
Table of Contents
Speech/Music Discrimination in Audio Podcast Using Structural

Segmentation and Timbre Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mathieu Barthet, Steven Hargreaves, and Mark Sandler
Computer Music Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jes
us L. Alvaro and Beatriz Barros
Abstract Sounds and Their Applications in Audio and Perception
Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adrien Merer, Slvi Ystad, Richard Kronland-Martinet, and
Mitsuko Aramaki
138
163
176
Part IV: Analysis and Data Mining

Pattern Induction and Matching in Music Signals . . . . . . . . . . . . . . . . . . . .
Anssi Klapuri
Unsupervised Analysis and Generation of Audio Percussion
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marco Marchini and Hendrik Purwins
188
205
Identifying Attack Articulations in Classical Guitar . . . . . . . . . . . . . . . . . .
Tan Hakan Ozaslan,

Enric Guaus, Eric Palacios, and
Josep Lluis Arcos
219
Comparing Approaches to the Similarity of Musical Chord Sequences . . .

W. Bas de Haas, Matthias Robine, Pierre Hanna,
Remco C. Veltkamp, and Frans Wiering
242
Part V: MIR - Music Libraries

Songs2See and GlobalMusic2One: Two Applied Research Projects in
Music Information Retrieval at Fraunhofer IDMT . . . . . . . . . . . . . . . . . . . .
Christian Dittmar, Holger Gromann, Estefana Cano,
Sascha Grollmisch, Hanna Lukashevich, and Jakob Abeer
MusicGalaxy: A Multi-focus Zoomable Interface for Multi-facet
Exploration of Music Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sebastian Stober and Andreas N
urnberger
A Database Approach to Symbolic Music Content Management . . . . . . . .
Philippe Rigaux and Zoe Faget
Error-Tolerant Content-Based Music-Retrieval with Mathematical
Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mikko Karvonen, Mika Laitinen, Kjell Lemstr
om, and Juho Vikman
259
273
303
321
Table of Contents
IX
Melodic Similarity through Shape Similarity . . . . . . . . . . . . . . . . . . . . . . . . .

Juli
an Urbano, Juan Llorens, Jorge Morato, and
Sonia S
anchez-Cuadrado
338
Content-Based Music Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dirk Sch
onfu
356
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
361
Probabilistic and Logic-Based

Modelling of Harmony
Simon Dixon, Matthias Mauch, and Amelie Anglade
Centre for Digital Music,
Queen Mary University of London,
Mile End Rd, London E1 4NS, UK
simon.dixon@eecs.qmul.ac.uk
http://www.eecs.qmul.ac.uk/~simond
Abstract. Many computational models of music fail to capture essential
aspects of the high-level musical structure and context, and this limits
their usefulness, particularly for musically informed users. We describe
two recent approaches to modelling musical harmony, using a probabilistic and a logic-based framework respectively, which attempt to reduce the
gap between computational models and human understanding of music.
The rst is a chord transcription system which uses a high-level model of
musical context in which chord, key, metrical position, bass note, chroma
features and repetition structure are integrated in a Bayesian framework, achieving state-of-the-art performance. The second approach uses
inductive logic programming to learn logical descriptions of harmonic
sequences which characterise particular styles or genres. Each approach
brings us one step closer to modelling music in the way it is conceptualised by musicians.
Keywords: Chord transcription, inductive logic programming, musical
harmony.
Introduction
Music is a complex phenomenon. Although music is described as a universal

language, when viewed as a paradigm for communication it is dicult to nd
agreement on what constitutes a musical message (is it the composition or the
performance?), let alone the meaning of such a message. Human understanding of music is at best incomplete, yet there is a vast body of knowledge and
practice regarding how music is composed, performed, recorded, reproduced and
analysed in ways that are appreciated in particular cultures and settings. It is
the computational modelling of this common practice (rather than philosophical questions regarding the nature of music) which we address in this paper. In
particular, we investigate harmony, which exists alongside melody, rhythm and
timbre as one of the fundamental attributes of Western tonal music.
Our starting point in this paper is the observation that many of the computational models used in the music information retrieval and computer music
research communities fail to capture much of what is understood about music.
S. Ystad et al. (Eds.): CMMR 2010, LNCS 6684, pp. 119, 2011.
c Springer-Verlag Berlin Heidelberg 2011
S. Dixon, M. Mauch, and A. Anglade
Two examples are the bag-of-frames approach to music similarity [5], and the periodicity pattern approach to rhythm analysis [13], which are both independent
of the order of musical notes, whereas temporal order is an essential feature of
melody, rhythm and harmonic progression. Perhaps surprisingly, much progress
has been made in music informatics in recent years1, despite the naivete of the
musical models used and the claims that some tasks have reached a glass ceiling [6].
The continuing progress can be explained in terms of a combination of factors:
the high level of redundancy in music, the simplicity of many of the tasks which
are attempted, and the limited scope of the algorithms which are developed. In
this regard we agree with [14], who review the rst 10 years of ISMIR conferences and list some challenges which the community has not fully engaged with
before. One of these challenges is to dig deeper into the music itself, which
would enable researchers to address more musically complex tasks; another is to
expand ... musical horizons, that is, broaden the scope of MIR systems.
In this paper we present two approaches to modelling musical harmony, aiming
at capturing the type of musical knowledge and reasoning a musician might use
in performing similar tasks. The rst task we address is that of chord transcription from audio recordings. We present a system which uses a high-level model
of musical context in which chord, key, metrical position, bass note, chroma
features and repetition structure are integrated in a Bayesian framework, and
generates the content of a lead-sheet containing the sequence of chord symbols, including their bass notes and metrical positions, and the key signature and
any modulations over time. This system achieves state-of-the-art performance,
being rated rst in its category in the 2009 and 2010 MIREX evaluations. The
second task to which we direct our attention is the machine learning of logical
descriptions of harmonic sequences in order to characterise particular styles or
genres. For this work we use inductive logic programming to obtain representations such as decision trees which can be used to classify unseen examples or
provide insight into the characteristics of a data corpus.
Computational models of harmony are important for many application areas
of music informatics, as well as for music psychology and musicology itself. For
example, a harmony model is a necessary component of intelligent music notation software, for determining the correct key signature and pitch spelling of
accidentals where music is obtained from digital keyboards or MIDI les. Likewise processes such as automatic transcription are benetted by tracking the
harmonic context at each point in the music [24]. It has been shown that harmonic modelling improves search and retrieval in music databases, for example
in order to nd variations of an example query [36], which is useful for musicological research. Theories of music cognition, if expressed unambiguously, can
be implemented and tested on large data corpora and compared with human
annotations, in order to verify or rene concepts in the theory.
1
Progress is evident for example in the annual MIREX series of evaluations of

music information retrieval systems (http://www.music-ir.org/mirex/wiki/2010:
Main_Page)
Probabilistic and Logic-Based Modelling of Harmony
The remainder of the paper is structured as follows. The next section provides
an overview of research in harmony modelling. This is followed by a section
describing our probabilistic model of chord transcription. In section 4, we present
our logic-based approach to modelling of harmony, and show how this can be
used to characterise and classify music. The nal section is a brief conclusion
and outline of future work.
Background
Research into computational analysis of harmony has a history of over four

decades since [44] proposed a grammar-based analysis that required the user
to manually remove any non-harmonic notes (e.g. passing notes, suspensions
and ornaments) before the algorithm processed the remaining chord sequence.
A grammar-based approach was also taken by [40], who developed a set of chord
substitution rules, in the form of a context-free grammar, for generating 12-bar
Blues sequences. [31] addressed the problem of extracting patterns and substitution rules automatically from jazz standard chord sequences, and discussed how
the notions of expectation and surprise are related to the use of these patterns
and rules.
Closely related to grammar-based approaches are rule-based approaches, which
were used widely in early articial intelligence systems. [21] used an elimination
process combined with heuristic rules in order to infer the tonality given a fugue
melody from Bachs Well-Tempered Clavier. [15] presents an expert system consisting of about 350 rules for generating 4-part harmonisations of melodies in the
style of Bach Chorales. The rules cover the chord sequences, including cadences
and modulations, as well as the melodic lines of individual parts, including voice
leading. [28] developed an expert system with a complex set of rules for recognising consonances and dissonances in order to infer the chord sequence. Maxwells
approach was not able to infer harmony from a melodic sequence, as it considered
the harmony at any point in time to be dened by a subset of the simultaneously
sounding notes.
[41] addressed some of the weaknesses of earlier systems with a combined
rhythmic and harmonic analysis system based on preference rules [20]. The
system assigns a numerical score to each possible interpretation based on the
preference rules which the interpretation satises, and searches the space of all
solutions using dynamic programming restricted with a beam search. The system benets from the implementation of rules relating harmony and metre, such
as the preference rule which favours non-harmonic notes occurring on weak metrical positions. One claimed strength of the approach is the transparency of the
preference rules, but this is oset by the opacity of the system parameters such
as the numeric scores which are assigned to each rule.
[33] proposed a counting scheme for matching performed notes to chord templates for variable-length segments of music. The system is intentionally simplistic, in order that the framework might easily be extended or modied. The main
contributions of the work are the graph search algorithms, inspired by Temperleys dynamic programming approach, which determine the segmentation to be
used in the analysis. The proposed graph search algorithm is shown to be much
more ecient than standard algorithms without diering greatly in the quality
of analyses it produces.
As an alternative to the rule-based approaches, which suer from the cumulative eects of errors, [38] proposed a probabilistic approach to functional
harmonic analysis, using a hidden Markov model. For each time unit (measure
or half-measure), their system outputs the current key and the scale degree of
the current chord. In order to make the computation tractable, a number of
simplifying assumptions were made, such as the symmetry of all musical keys.
Although this reduced the number of parameters by at least two orders of magnitude, the training algorithm was only successful on a subset of the parameters,
and the remaining parameters were set by hand.
An alternative stream of research has been concerned with multidimensional
representations of polyphonic music [10,11,42] based on the Viewpoints approach
of [12]. This representation scheme is for example able to preserve information
about voice leading which is otherwise lost by approaches that treat harmony as
a sequence of chord symbols.
Although most research has focussed on analysing musical works, some work
investigates the properties of entire corpora. [25] compared two corpora of chord
sequences, belonging to jazz standards and popular (Beatles) songs respectively,
and found key- and context-independent patterns of chords which occurred frequently in each corpus. [26] examined the statistics of the chord sequences of several thousand songs, and compared the results to those from a standard natural
language corpus in an attempt to nd lexical units in harmony that correspond
to words in language. [34,35] investigated whether stochastic language models including naive Bayes classiers and 2-, 3- and 4-grams could be used for automatic
genre classication. The models were tested on both symbolic and audio data,
where an o-the-shelf chord transcription algorithm was used to convert the audio
data to a symbolic representation. [39] analysed the Beatles corpus using probabilistic N-grams in order to show that the dependency of a chord on its context
extends beyond the immediately preceding chord (the rst-order Markov assumption). [9] studied dierences in the use of harmony across various periods of classical music history, using root progressions (i.e. the sequence of root notes of chords
in a progression) reduced to 2 categories (dominant and subdominant) to give a
representation called harmonic vectors. The use of root progressions is one of the
representations we use in our own work in section 4 [2].
All of the above systems process symbolic input, such as that found in a score,
although most of the systems do not require the level of detail provided by the
score (e.g. key signature, pitch spelling), which they are able to reconstruct from
the pitch and timing data. In recent years, the focus of research has shifted to the
analysis of audio les, starting with the work of [16], who computed a chroma
representation (salience of frequencies representing the 12 Western pitch classes,
independent of octave) which was matched to a set of chord templates using the
inner product. Alternatively, [7] modelled chords with a 12-dimensional Gaussian
distribution, where chord notes had a mean of 1, non-chord notes had a mean of 0,
and the covariance matrix had high values between pairs of chord notes. A hidden
Markov model was used to infer the most likely sequence of chords, where state
transition probabilities were initialised based on the distance between chords on
a special circle of fths which included minor chords near to their relative major
chord. Further work on audio-based harmony analysis is reviewed thoroughly in
three recent doctoral theses, to which the interested reader is referred [22,18,32].
A Probabilistic Model for Chord Transcription
Music theory, perceptual studies, and musicians themselves generally agree that
no musical quality can be treated individually. When a musician transcribes
the chords of a piece of music, the chord labels are not assigned solely on the
basis of local pitch content of the signal. Musical context such as the key, metrical position and even the large-scale structure of the music play an important
role in the interpretation of harmony. [17, Chapter 4] conducted a survey among
human music transcription experts, and found that they use several musical context elements to guide the transcription process: not only is a prior rough chord
detection the basis for accurate note transcription, but the chord transcription
itself depends on the tonal context and other parameters such as beats, instrumentation and structure.
The goal of our recent work on chord transcription [24,22,23] is to propose
computational models that integrate musical context into the automatic chord
estimation process. We employ a dynamic Bayesian network (DBN) to combine
models of metrical position, key, chord, bass note and beat-synchronous bass and
treble chroma into a single high-level musical context model. The most probable
sequence of metrical positions, keys, chords and bass notes is estimated via
Viterbi inference.
A DBN is a graphical model representing a succession of simple Bayesian
networks in time. These are assumed to be Markovian and time-invariant, so
the model can be expressed recursively in two time slices: the initial slice and
the recursive slice. Our DBN is shown in Figure 1. Each node in the network
represents a random variable, which might be an observed node (in our case
the bass and treble chroma) or a hidden node (the key, metrical position, chord
and bass pitch class nodes). Edges in the graph denote dependencies between
variables. In our DBN the musically interesting behaviour is modelled in the
recursive slice, which represents the progress of all variables from one beat to
the next. In the following paragraphs we explain the function of each node.
Chord. Technically, the dependencies of the random variables are described in the
conditional probability distribution of the dependent variable. Since the highest
number of dependencies join at the chord variable, it takes a central position
in the network. Its conditional probability distribution is also the most complex: it depends not only on the key and the metrical position, but also on the
chord variable in the previous slice. The chord variable has 121 dierent chord
states (see below), and its dependency on the previous chord variable enables
metric pos.
Mi1
Mi
key
Ki1
Ki
chord
Ci1
Ci
bass
Bi1
Bi
bass chroma
bs
Xi1
Xibs
treble chroma
tr
Xi1
Xitr
Fig. 1. Our network model topology, represented as a DBN with two slices and six
layers. The clear nodes represent random variables, while the observed ones are shaded
grey. The directed edges represent the dependency structure. Intra-slice dependency
edges are drawn solid, inter-slice dependency edges are dashed.
the reinforcement of smooth sequences of these states. The probability distribution of chords conditional on the previous chord strongly favours the chord that
was active in the previous slice, similar to a high self-transition probability in
a hidden Markov model. While leading to a chord transcription that is stable
over time, dependence on the previous chord alone is not sucient to model adherence to the key. Instead, it is modelled conditionally on the key variable: the
probability distribution depends on the chords t with the current key, based on
an expert function motivated by Krumhansls chord-key ratings [19, page 171].
Finally, the chord variables dependency on the metrical position node allows us
to favour chord changes at strong metrical positions to achieve a transcription
that resembles more closely that of a human transcriber.
density
(a) metrical position model
0.2
0.6
0.4
note salience
0.8
(b) model of a single chroma pitch class

Fig. 2
Key and metrical position. The dependency structure of the key and metrical
position variables are comparatively simpler, since they depend only on the respective predecessor. The emphasis on smooth, stable key sequences is handled
in the same way as it is in chords, but the 24 states representing major and minor
keys have even higher self-transition probability, and hence they will persist for
longer stretches of time. The metrical position model represents a 44 meter and
hence has four states. The conditional probability distribution strongly favours
normal beat transitions, i.e. from one beat to the next, but it also allows for
irregular transitions in order to accommodate temporary deviations from 44 meter and occasional beat tracking errors. In Figure 2a black arrows represent a
transition probability of 1 (where = 0.05) to the following beat. Grey arrows
represent a probability of /2 to jump to dierent beats through self-transition
or omission of the expected beat.
Bass. The random variable that models the bass has 13 states, one for each of
the pitch classes, and one no bass state. It depends on both the current chord
and the previous chord. The current chord is the basis of the most probable bass
notes that can be chosen. The highest probability is assigned to the nominal
chord bass pitch class2 , lower probabilities to the remaining chord pitch classes,
and the rest of the probability mass is distributed between the remaining pitch
classes. The additional use of the dependency on the previous chord allows us
to model the behaviour of the bass note on the rst beat of the chord dierently
from its behaviour on later beats. We can thus model the tendency for the played
bass note to coincide with the nominal bass note of the chord (e.g. the note B
in the B7 chord), while there is more variation in the bass notes played during
the rest of the duration of the chord.
Chroma. The chroma nodes provide models of the bass and treble chroma audio features. Unlike the discrete nodes previously discussed, they are continuous
because the 12 elements of the chroma vector represent relative salience, which
2
The chord symbol itself always implies a bass note, but the bass line might include
other notes not specied by the chord symbol, as in the case of walking bass.
can assume any value between zero and unity. We represent both bass and treble
chroma as multidimensional Gaussian random variables. The bass chroma variable has 13 dierent Gaussians, one for every bass state, and the treble chroma
node has 121 Gaussians, one for every chord state. The means of the Gaussians
are set to reect the nature of the chords: to unity for pitch classes that are
part of the chord, and to zero for the rest. A single variate in the 12-dimensional
Gaussian treble chroma distribution models one pitch class, as illustrated in Figure 2b. Since the chroma values are normalised to the unit interval, the Gaussian
model functions similar to a regression model: for a given chord the Gaussian
density increases with increasing salience of the chord notes (solid line), and
decreases with increasing salience of non-chord notes (dashed line). For more
details see [22].
One important aspect of the model is the wide variety of chords it uses.
It models ten dierent chord types (maj, min, maj/3, maj/5, maj6, 7, maj7,
min7, dim, aug) and the no chord class N. The chord labels with slashes
denote chords whose bass note diers from the chord root, for example D/3
represents a D major chord in rst inversion (sometimes written D/F). The
recognition of these chords is a novel feature of our chord recognition algorithm.
Figure 3 shows a score rendered using exclusively the information in our model.
In the last four bars, marked with a box, the second chord is correctly annotated
as D/F. The position of the bar lines is obtained from the metrical position
variable, the key signature from the key variable, and the bass notes from the
bass variable. The chord labels are obtained from the chord variable, replicated
as notes in the treble sta for better visualisation. The crotchet rest on the rst
beat of the piece indicates that here, the Viterbi algorithm inferred that the no
chord model ts best.
Using a standard test set of 210 songs used in the MIREX chord detection
task, our basic model achieved an accuracy of 73%, with each component of the
model contributing signicantly to the result. This improves on the best result at
G

Em

F C

D/F
Em

Bm G

Fig. 3. Excerpt of automatic output of our algorithm (top) and song book version
(bottom) of the pop song Friends Will Be Friends (Deacon/Mercury). The song
book excerpt corresponds to the four bars marked with a box.
It Wont Be Long
chorus
ground truth
segmentation
part n1
automatic
segmentation
verse
chorus
bridge
part A
part B
verse
chorus
bridge
part A
part B
verse
chorus
outro
part A
part n2
chord correct
using auto seg.
chord correct 1
baseline meth.0.5
0
20
40
60
80
100
120
time/s
Fig. 4. Segmentation and its eect on chord transcription for the Beatles song It
Wont Be Long (Lennon/McCartney). The top 2 rows show the human and automatic
segmentation respectively. Although the structure is dierent, the main repetitions are
correctly identied. The bottom 2 rows show (in black) where the chord was transcribed
correctly by our algorithm using (respectively not using) the segmentation information.
MIREX 2009 for pre-trained systems. Further improvements have been made via
two extensions of this model: taking advantage of repeated structural segments
(e.g. verses or choruses), and rening the front-end audio processing.
Most musical pieces have segments which occur more than once in the piece,
and there are two reasons for wishing to identify these repetitions. First, multiple
sets of data provide us with extra information which can be shared between the
repeated segments to improve detection performance. Second, in the interest of
consistency, we can ensure that the repeated sections are labelled with the same
set of chord symbols. We developed an algorithm that automatically extracts the
repetition structure from a beat-synchronous chroma representation [27], which
ranked rst in the 2009 MIREX Structural Segmentation task.
After building a similarity matrix based on the correlation between beatsynchronous chroma vectors, the method nds sets of repetitions whose elements have the same length in beats. A repetition set composed of n elements
with length d receives a score of (n 1)d, reecting how much space a hypothetical music editor could save by typesetting a repeated segment only once. The
repetition set with the maximum score (part A in Figure 4) is added to the
nal list of structural elements, and the process is repeated on the remainder of
the song until no valid repetition sets are left.
The resulting structural segmentation is then used to merge the chroma representations of matching segments. Despite the inevitable errors propagated from
incorrect segmentation, we found a signicant performance increase (to 75% on
the MIREX score) by using the segmentation. In Figure 4 the benecial eect
of using the structural segmentation can clearly be observed: many of the white
stripes representing chord recognition errors are eliminated by the structural
segmentation method, compared to the baseline method.
10
A further improvement was achieved by modifying the front end audio processing. We found that by learning chord proles as Gaussian mixtures, the
recognition rate of some chords can be improved. However this did not result
in an overall improvement, as the performance on the most common chords decreased. Instead, an approximate pitch transcription method using non-negative
least squares was employed to reduce the eect of upper harmonics in the chroma
representations [23]. This results in both a qualitative (reduction of specic errors) and quantitative (a substantial overall increase in accuracy) improvement
in results, with a MIREX score of 79% (without using segmentation), which
again is signicantly better than the state of the art. By combining both of the
above enhancements we reach an accuracy of 81%, a statistically signicant improvement over the best result (74%) in the 2009 MIREX Chord Detection tasks
and over our own previously mentioned results.
Logic-Based Modelling of Harmony
First order logic (FOL) is a natural formalism for representing harmony, as it is

suciently general for describing combinations and sequences of notes of arbitrary complexity, and there are well-studied approaches for performing inference,
pattern matching and pattern discovery using subsets of FOL. A further advantage of logic-based representations is that a systems output can be presented in
an intuitive way to non-expert users. For example, a decision tree generated by
our learning approach provides much more intuition about what was learnt than
would a matrix of state transition probabilities. In this work we focus in particular on inductive logic programming (ILP), which is a machine learning approach
using logic programming (a subset of FOL) to uniformly represent examples,
background knowledge and hypotheses. An ILP system takes as input a set of
positive and negative examples of a concept, plus some background knowledge,
and outputs a logic program which explains the concept, in the sense that
all of the positive examples but (ideally) none of the negative examples can be
derived from the logic program and background knowledge.
ILP has been used for various musical tasks, including inference of harmony
[37] and counterpoint [30] rules from musical examples, as well as rules for expressive performance [43]. In our work, we use ILP to learn sequences of chords
that might be characteristic of a musical style [2], and test the models on classication tasks [3,4,1]. In each case we represent the harmony of a piece of music
by a list of chords, and learn models which characterise the various classes of
training data in terms of features derived from subsequences of these chord lists.
4.1
Style Characterisation
In our rst experiments [2], we analysed two chord corpora, consisting of the
Beatles studio albums (180 songs, 14132 chords) and a set of jazz standards
from the Real Book (244 songs, 24409 chords) to nd harmonic patterns that
dierentiate the two corpora. Chord sequences were represented in terms of the
interval between successive root notes or successive bass notes (to make the
11
sequences key-independent), plus the category of each chord (reduced to a triad

except in the case of the dominant seventh chord). For the Beatles data, where
the key had been annotated for each piece, we were also able to express the
chord symbols in terms of the scale degree relative to the key, rather than its
pitch class, giving a more musically satisfying representation. Chord sequences of
length 4 were used, which we had previously found [25] to be a good compromise
of sucient length to capture the context (and thus the function) of the chords,
without the sequences being overspecic, in which case few or no patterns would
be found.
Two models were built, one using the Beatles corpus as positive examples
and the other using the Real Book corpus as positive examples. The ILP system
Aleph was employed, which nds a minimal set of rules which cover (i.e. describe)
all positive examples (and a minimum number of negative examples). The models
built by Aleph consisted of 250 rules for the Beatles corpus and 596 rules for the
Real Book. Note that these rules cover every 4-chord sequence in every song,
so it is only the rules which cover many examples that are relevant in terms
of characterising the corpus. Also, once a sequence has been covered, it is not
considered again by the system, so the output is dependent on the order of
presentation of the examples.
We briey discuss some examples of rules with the highest coverage. For the
Beatles corpus, the highest coverage (35%) was the 4-chord sequence of major
triads (regardless of roots). Other highly-ranked patterns of chord categories
(5% coverage) had 3 major triads and one minor triad in the sequence. This is
not surprising, in that popular music generally has a less rich harmonic vocabulary than jazz. Patterns of root intervals were also found, including a [perfect
4th, perfect 5th, perfect 4th] pattern (4%), which could for example be interpreted as a I - IV - I - IV progression or as V - I - V - I. Since the root
interval does not encode the key, it is not possible to distinguish between these
interpretations (and it is likely that the data contains instances of both). At
2% coverage, the interval sequence [perfect 4th, major 2nd, perfect 4th] (e.g.
I - IV - V - I) is another well-known chord sequence.
No single rule covered as many Real Book sequences as the top rule for the
Beatles, but some typical jazz patterns were found, such as [perfect 4th, perfect
4th, perfect 4th] (e.g. ii - V - I - IV, coverage 8%), a cycle of descending
fths, and [major 6th, perfect 4th, perfect 4th] (e.g. I - vi - ii - V, coverage
3%), a typical turnaround pattern.
One weakness with this rst experiment, in terms of its goal as a pattern
discovery method, is that the concept to learn and the vocabulary to describe
it (dened in the background knowledge) need to be given in advance. Dierent vocabularies result in dierent concept descriptions, and a typical process
of concept characterisation is interactive, involving several renements of the
vocabulary in order to obtain an interesting theory. Thus, as we rene the vocabulary we inevitably reduce the problem to a pattern matching task rather
than pattern discovery. A second issue is that since musical styles have no formal denition, it is not possible to quantify the success of style characterisation
12
directly, but only indirectly, by using the learnt models to classify unseen examples. Thus the following harmony modelling experiments are evaluated via the
task of genre classication.
4.2
Genre Classification
For the subsequent experiments we extended the representation to allow variable

length patterns and used TILDE, a rst-order logic decision tree induction algorithm for modelling harmony [3,4]. As test data we used a collection of 856 pieces
(120510 chords) covering 3 genres, each of which was divided into a further 3
subgenres: academic music (Baroque, Classical, Romantic), popular music (Pop,
Blues, Celtic) and jazz (Pre-bop, Bop, Bossa Nova). The data is represented in
the Band in a Box format, containing a symbolic encoding of the chords, which
were extracted and encoded using a denite clause grammar (DCG) formalism.
The software Band in a Box is designed to produce an accompaniment based on
the chord symbols, using a MIDI synthesiser. In further experiments we tested
the classication method using automatic chord transcription (see section 3)
from the synthesised audio data, in order to test the robustness of the system
to errors in the chord symbols.
The DCG representation was developed for natural language processing to
express syntax or grammar rules in a format which is both human-readable
and machine-executable. Each predicate has two arguments (possibly among
other arguments), an input list and an output list, where the output list is
always a sux of the input list. The dierence between the two lists corresponds to the subsequence described by the predicate. For example, the predicate gap(In,Out) states that the input list of chords (In) commences with a
subsequence corresponding to a gap, and the remainder of the input list is
equal to the output list (Out). In our representation, a gap is an arbitrary sequence of chords, which allows the representation to skip any number of chords
at the beginning of the input list without matching them to any harmony concept. Extra arguments can encode parameters and/or context, so that the term
degreeAndCategory(Deg,Cat,In,Out,Key) states that the list In begins with
a chord of scale degree Deg and chord category Cat in the context of the key
Key. Thus the sequence:
gap(S,T),
degreeAndCategory(2,min7,T,U,gMajor),
degreeAndCategory(5,7,U,V,gMajor),
degreeAndCategory(1,maj7,V,[],gMajor)
states that the list S starts with any chord subsequence (gap), followed by a
minor 7th chord on the 2nd degree of G major (i.e. Amin7), followed by a (dominant) 7th chord on the 5th degree (D7) and ending with a major 7th chord on
the tonic (Gmaj7).
TILDE learns a classication model based on a vocabulary of predicates supplied by the user. In our case, we described the chords in terms of their root note,
13
genre(g,A,B,Key)
gap(A,C),degAndCat(5,maj,C,D,Key),degAndCat(1,min,D,E,Key) ?
Y
academic
gap(A,F),degAndCat(2,7,F,G,Key),degAndCat(5,maj,G,H,Key) ?
N
Y
...
gap(H,I),degAndCat(1,maj,I,J,Key),degAndCat(5,7,J,K,Key) ?
Y
academic
N
Y
jazz
gap(H,L),degAndCat(2,min7,L,M,Key),degAndCat(5,7,M,N,Key) ?
jjazz
N
academic
Fig. 5. Part of the decision tree for a binary classier for the classes Jazz and Academic
Table 1. Results compared with the baseline for 2-class, 3-class and 9-class classication tasks
Classication Task
Baseline Symbolic Audio
Academic Jazz
0.55
0.947 0.912
Academic Popular
0.55
0.826 0.728
Jazz Popular
0.61
0.891 0.807
Academic Popular Jazz 0.40
0.805 0.696
All 9 subgenres
0.21
0.525 0.415
scale degree, chord category, and intervals between successive root notes, and we
constrained the learning algorithm to generate rules containing subsequences of
length at least two chords. The model can be expressed as a decision tree, as
shown in gure 5, where the choice of branch taken is based on whether or not
the chord sequence matches the predicates at the current node, and the class
to which the sequence belongs is given by the leaf of the decision tree reached
by following these choices. The decision tree is equivalent to an ordered set of
rules or a Prolog program. Note that a rule at a single node of a tree cannot
necessarily be understood outside of its context in the tree. In particular, a rule
by itself cannot be used as a classier.
The results for various classication tasks are shown in Table 1. All results are
signicantly above the baseline, but performance clearly decreases for more difcult tasks. Perfect classication is not to be expected from harmony data, since
other aspects of music such as instrumentation (timbre), rhythm and melody
are also involved in dening and recognising musical styles.
Analysis of the most common rules extracted from the decision tree models
built during these experiments reveals some interesting and well-known jazz,
academic and popular music harmony patterns. For each rule shown below, the
coverage expresses the fraction of songs in each class that match the rule. For
example, while a perfect cadence is common to both academic and jazz styles,
the chord categories distinguish the styles very well, with academic music using
triads and jazz using seventh chords:
14
genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(5,maj,C,D,Key),
degreeAndCategory(1,maj,D,E,Key),
gap(E,B).
[Coverage: academic=133/235; jazz=10/338]
genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(5,7,C,D,Key),
degreeAndCategory(1,maj7,D,E,Key),
gap(E,B).
[Coverage: jazz=146/338; academic=0/235]
A good indicator of blues is the sequence: ... - I7 - IV7 - ...
genre(blues,A,B,Key) :- gap(A,C),
degreeAndCategory(4,7,D,E,Key),
gap(E,B).
[Coverage: blues=42/84; celtic=0/99; pop=2/100]
On the other hand, jazz is characterised (but not exclusively) by the sequence:
... - ii7 - V7 - ...
genre(jazz,A,B,Key) :- gap(A,C),
degreeAndCategory(2,min7,C,D,Key),
degreeAndCategory(5,7,D,E,Key),
gap(E,B).
[Coverage: jazz=273/338; academic=42/235; popular=52/283]
The representation also allows for longer rules to be expressed, such as the
following rule describing a modulation to the dominant key and back again in
academic music: ... - II7 - V - ... - I - V7 - ...
genre(academic,A,B,Key) :- gap(A,C),
degreeAndCategory(5,maj,D,E,Key),
gap(E,F),
degreeAndCategory(1,maj,F,G,Key),
degreeAndCategory(5,7,G,H,Key),
gap(H,B).
[Coverage: academic=75/235; jazz=0/338; popular=1/283]
15
Although none of the rules are particularly surprising, these examples illustrate some meaningful musicological concepts that are captured by the rules. In
general, we observed that Academic music is characterised by rules establishing the tonality, e.g. via cadences, while Jazz is less about tonality, and more
about harmonic colour, e.g. the use of 7th, 6th, augmented and more complex
chords, and Popular music harmony tends to have simpler harmonic rules as
melody is predominant in this style. The system is also able to nd longer rules
that a human might not spot easily. Working from audio data, even though the
transcriptions are not fully accurate, the classication and rules still capture the
same general trends as for symbolic data.
For genre classication we are not advocating a harmony-based approach
alone. It is clear that other musical features are better predictors of genre.
Nevertheless, the positive results encouraged a further experiment in which we
integrated the current classication approach with a state-of-the-art genre classication system, to test whether the addition of a harmony feature could improve
its performance.
4.3
Genre Classification Using Harmony and Low-Level Features
In recent work [1] we developed a genre classication framework combining both

low-level signal-based features and high-level harmony features. A state-of-theart statistical genre classier [8] using 206 features, covering spectral, temporal,
energy, and pitch characteristics of the audio signal, was extended using a random forest classier containing rules for each genre (classical, jazz and pop)
derived from chord sequences. We extended our previous work using the rstorder logic induction algorithm TILDE, to learn a random forest instead of a
single decision tree from the chord sequence corpus described in the previous
genre classication experiments. The random forest model achieved better classication rates (88% on the symbolic data and 76% on the audio data) for the
three-class classication problem (previous results 81% and 70% respectively).
Having trained the harmony classier, its output was added as an extra feature
to the low-level classier and the combined classier was tested on three-genre
subsets of two standard genre classication data sets (GTZAN and ISMIR04)
containing 300 and 448 recordings respectively. Multilayer perceptrons and support vector machines were employed to classify the test data using 55-fold
cross-validation and feature selection. Results are shown in table 2 for the support vector machine classier, which outperformed the multilayer perceptrons.
Results indicate that the combination of low-level features with the harmonybased classier produces improved genre classication results despite the fact
Table 2. Best mean classication results (and number of features used) for the two
data sets using 55-fold cross-validation and feature selection
Classier
GTZAN data set ISMIR04 data set
SVM without harmony feature 0.887 (60 features) 0.938 (70 features)
SVM with harmony feature
0.911 (50 features) 0.953 (80 features)
16
that the classication rate of the harmony-based classier alone is poor. For
both datasets the improvements over the standard classier (as shown in table
2) were found to be statistically signicant.
Conclusion
We have looked at two approaches to the modelling of harmony which aim to dig
deeper into the music. In our probabilistic approach to chord transcription, we
demonstrated the advantage of modelling musical context such as key, metrical
structure and bass line, and simultaneously estimating all of these variables
along with the chord. We also developed an audio feature using non-negative
least squares that reects the notes played better than the standard chroma
feature, and therefore reduces interference from harmonically irrelevant partials
and noise. A further improvement of the system was obtained by modelling the
global structure of the music, identifying repeated sections and averaging features
over these segments. One promising avenue of further work is the separation of
the audio (low-level) and symbolic (high-level) models which are conceptually
distinct but modelled together in current systems. A low-level model would be
concerned only with the production or analysis of audio the mapping from
notes to features; while a high-level model would be a musical model handling
the mapping from chord symbols to notes.
Using a logic-based approach, we showed that it is possible to automatically
discover patterns in chord sequences which characterise a corpus of data, and
to use such models as classiers. The advantage with a logic-based approach is
that models learnt by the system are transparent: the decision tree models can
be presented to users as sets of human readable rules. This explanatory power is
particularly relevant for applications such as music recommendation. The DCG
representation allows chord sequences of any length to coexist in the same model,
as well as context information such as key. Our experiments found that the more
musically meaningful Degree-and-Category representation gave better classication results than using root intervals. The results using transcription from audio
data were encouraging in that although some information was lost in the transcription process, the classication results remained well above the baseline, and
thus this approach is still viable when symbolic representations of the music are
not available. Finally, we showed that the combination of high-level harmony
features with low-level features can lead to genre classication accuracy improvements in a state-of-the-art system, and believe that such high-level models
provide a promising direction for genre classication research.
While these methods have advanced the state of the art in music informatics,
it is clear that in several respects they are not yet close to an expert musicians
understanding of harmony. Limiting the representation of harmony to a list of
chord symbols is inadequate for many applications. Such a representation may
be sucient as a memory aid for jazz and pop musicians, but it allows only a very
limited specication of chord voicing (via the bass note), and does not permit
analysis of polyphonic texture such as voice leading, an important concept in
many harmonic styles, unlike the recent work of [11] and [29]. Finally, we note
17
that the current work provides little insight into harmonic function, for example
the ability to distinguish harmony notes from ornamental and passing notes and
to recognise chord substitutions, both of which are essential characteristics of a
system that models a musicians understanding of harmony. We hope to address
these issues in future work.
Acknowledgements. This work was performed under the OMRAS2 project,
supported by the Engineering and Physical Sciences Research Council, grant
EP/E017614/1. We would like to thank Chris Harte, Matthew Davies and others
at C4DM who contributed to the annotation of the audio data, and the Pattern
Recognition and Articial Intelligence Group at the University of Alicante, who
provided the Band in a Box data.
References
1. Anglade, A., Benetos, E., Mauch, M., Dixon, S.: Improving music genre classication using automatically induced harmony rules. Journal of New Music Research 39(4), 349361 (2010)
2. Anglade, A., Dixon, S.: Characterisation of harmony with inductive logic programming. In: 9th International Conference on Music Information Retrieval, pp. 6368
(2008)
3. Anglade, A., Ramirez, R., Dixon, S.: First-order logic classication models of musical genres based on harmony. In: 6th Sound and Music Computing Conference,
pp. 309314 (2009)
4. Anglade, A., Ramirez, R., Dixon, S.: Genre classication using harmony rules induced from automatic chord transcriptions. In: 10th International Society for Music
Information Retrieval Conference, pp. 669674 (2009)
5. Aucouturier, J.J., Defreville, B., Pachet, F.: The bag-of-frames approach to audio
pattern recognition: A sucient model for urban soundscapes but not for polyphonic music. Journal of the Acoustical Society of America 122(2), 881891 (2007)
6. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
7. Bello, J.P., Pickens, J.: A robust mid-level representation for harmonic content in
music signals. In: 6th International Conference on Music Information Retrieval,
pp. 304311 (2005)
8. Benetos, E., Kotropoulos, C.: Non-negative tensor factorization applied to music
genre classication. IEEE Transactions on Audio, Speech, and Language Processing 18(8), 19551967 (2010)
9. Cathe, P.: Harmonic vectors and stylistic analysis: A computer-aided analysis of
the rst movement of Brahms String Quartet Op. 51-1. Journal of Mathematics
and Music 4(2), 107119 (2010)
10. Conklin, D.: Representation and discovery of vertical patterns in music. In:
Anagnostopoulou, C., Ferrand, M., Smaill, A. (eds.) ICMAI 2002. LNCS (LNAI),
vol. 2445, pp. 3242. Springer, Heidelberg (2002)
11. Conklin, D., Bergeron, M.: Discovery of contrapuntal patterns. In: 11th International Society for Music Information Retrieval Conference, pp. 201206 (2010)
12. Conklin, D., Witten, I.: Multiple viewpoint systems for music prediction. Journal
of New Music Research 24(1), 5173 (1995)
18
13. Dixon, S., Pampalk, E., Widmer, G.: Classication of dance music by periodicity
patterns. In: 4th International Conference on Music Information Retrieval, pp.
159165 (2003)
14. Downie, J., Byrd, D., Crawford, T.: Ten years of ISMIR: Reections on challenges
and opportunities. In: 10th International Society for Music Information Retrieval
Conference, pp. 1318 (2009)
15. Ebcio
glu, K.: An expert system for harmonizing chorales in the style of J. S. Bach.
In: Balaban, M., Ebcioi
glu, K., Laske, O. (eds.) Understanding Music with AI, pp.
294333. MIT Press, Cambridge (1992)
16. Fujishima, T.: Realtime chord recognition of musical sound: A system using Common Lisp Music. In: Proceedings of the International Computer Music Conference,
pp. 464467 (1999)
17. Hainsworth, S.W.: Techniques for the Automated Analysis of Musical Audio. Ph.D.
thesis, University of Cambridge, Cambridge, UK (2003)
18. Harte, C.: Towards Automatic Extraction of Harmony Information from Music
Signals. Ph.D. thesis, Queen Mary University of London, Centre for Digital Music
(2010)
19. Krumhansl, C.L.: Cognitive Foundations of Musical Pitch. Oxford University Press,
Oxford (1990)
20. Lerdahl, F., Jackendo, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
21. Longuet-Higgins, H., Steedman, M.: On interpreting Bach. Machine Intelligence 6,
221241 (1971)
22. Mauch, M.: Automatic Chord Transcription from Audio Using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London, Centre for Digital Music (2010)
23. Mauch, M., Dixon, S.: Approximate note transcription for the improved identication of dicult chords. In: 11th International Society for Music Information
Retrieval Conference, pp. 135140 (2010)
24. Mauch, M., Dixon, S.: Simultaneous estimation of chords and musical context from
audio. IEEE Transactions on Audio, Speech and Language Processing 18(6), 1280
1289 (2010)
25. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering chord idioms
through Beatles and Real Book songs. In: 8th International Conference on Music
Information Retrieval, pp. 111114 (2007)
26. Mauch, M., M
ullensiefen, D., Dixon, S., Wiggins, G.: Can statistical language models be used for the analysis of harmonic progressions? In: International Conference
on Music Perception and Cognition (2008)
27. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: 10th International Society for Music Information Retrieval
Conference, pp. 231236 (2009)
28. Maxwell, H.: An expert system for harmonizing analysis of tonal music. In:
Balaban, M., Ebcioi
glu, K., Laske, O. (eds.) Understanding Music with AI, pp.
334353. MIT Press, Cambridge (1992)
29. Mearns, L., Tidhar, D., Dixon, S.: Characterisation of composer style using highlevel musical features. In: 3rd ACM Workshop on Machine Learning and Music
(2010)
30. Morales, E.: PAL: A pattern-based rst-order inductive system. Machine Learning 26(2-3), 227252 (1997)
31. Pachet, F.: Surprising harmonies. International Journal of Computing Anticipatory
Systems 4 (February 1999)
19
32. Papadopoulos, H.: Joint Estimation of Musical Content Information from an Audio
Signal. Ph.D. thesis, Universite Pierre et Marie Curie Paris 6 (2010)
33. Pardo, B., Birmingham, W.: Algorithms for chordal analysis. Computer Music
Journal 26(2), 2749 (2002)
34. Perez-Sancho, C., Rizo, D., I
nesta, J.M.: Genre classication using chords and
stochastic language models. Connection Science 21(2-3), 145159 (2009)
35. Perez-Sancho, C., Rizo, D., I
nesta, J.M., de Le
on, P.J.P., Kersten, S., Ramirez,
R.: Genre classication of music by tonal harmony. Intelligent Data Analysis 14,
533545 (2010)
36. Pickens, J., Bello, J., Monti, G., Sandler, M., Crawford, T., Dovey, M., Byrd, D.:
Polyphonic score retrieval using polyphonic audio queries: A harmonic modelling
approach. Journal of New Music Research 32(2), 223236 (2003)
37. Ramirez, R.: Inducing musical rules with ILP. In: Proceedings of the International
Conference on Logic Programming, pp. 502504 (2003)
38. Raphael, C., Stoddard, J.: Functional harmonic analysis using probabilistic models.
Computer Music Journal 28(3), 4552 (2004)
39. Scholz, R., Vincent, E., Bimbot, F.: Robust modeling of musical chord sequences using probabilistic N-grams. In: IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 5356 (2009)
40. Steedman, M.: A generative grammar for jazz chord sequences. Music Perception 2(1), 5277 (1984)
41. Temperley, D., Sleator, D.: Modeling meter and harmony: A preference rule approach. Computer Music Journal 23(1), 1027 (1999)
42. Whorley, R., Wiggins, G., Rhodes, C., Pearce, M.: Development of techniques for
the computational modelling of harmony. In: First International Conference on
Computational Creativity, pp. 1115 (2010)
43. Widmer, G.: Discovering simple rules in complex data: A meta-learning algorithm
and some surprising musical discoveries. Articial Intelligence 146(2), 129148
(2003)
44. Winograd, T.: Linguistics and the computer analysis of tonal harmony. Journal of
Music Theory 12(1), 249 (1968)
Interactive Music Applications and Standards

Rebecca Stewart, Panos Kudumakis, and Mark Sandler
Queen Mary, University of London,
London, UK
{rebecca.stewart,panos.kudumakis,mark.sandler}@eecs.qmul.ac.uk
http://www.elec.qmul.ac.uk/digitalmusic
Abstract. Music is now consumed in interactive applications that allow

for the user to directly influence the musical performance. These applications are distributed as games for gaming consoles and applications for
mobile devices that currently use proprietary file formats, but standardization orgranizations have been working to develop an interchangeable
format. This paper surveys the applications and their requirements. It
then reviews the current standards that address these requirements focusing on the MPEG Interactive Music Application Format. The paper
closes by looking at additional standards that address similar applications and outlining the further requirements that need to be met.
Keywords: interactive music, standards.
Introduction
The advent of the Internet and the exploding popularity of le sharing web sites
have challenged the music industrys traditional supply model that relied on the
physical distribution of music recordings such as vinyl records, cassettes, CDs,
etc [5], [3]. In this direction, new interactive music services have emerged [1],
[6], [7]. However, a standardized le format is inevitably required to provide the
interoperability between various interactive music players and interactive music
applications.
Video games and music consumption, once discrete markets, are now merging.
Games for dedicated gaming consoles such as the Microsoft XBox, Nintendo Wii
and Sony Playstation and applications for smart phones using the Apple iPhone
and Google Android platforms are incorporating music creation and manipulation
into applications which encourage users to purchase music. These games can even
be centered around specic performers such as the Beatles [11] or T-Pain [14].
Many of these games follow a format inspired by karaoke. In its simplest case,
audio processing for karaoke applications involves removing the lead vocals so
that a live singer can perform with the backing tracks. This arrangement grew in
complexity by including automatic lyric following as well. Karaoke performance
used to be relegated to a setup involving a sound system with microphone and
playback capabilities within a dedicated space such as a karaoke bar or living
room, but it has found a revitalized market with mobile devices such as smart
21
phones. Karaoke is now no longer limited to a certain place or equipment, but can
performed with a group of friends with a gaming console in a home or performed
with a smart phone, recorded and uploaded online to share with others.
A standard format is needed to allow for the same musical content to be produced once and used with multiple applications. We will look at the current commercial applications for interactive music and discuss what requirements need to
be met. We will then look at three standards that address these requirements:
the MPEG-A Interactive Music Application Format (IM AF), IEEE 1599 and
interaction eXtensible Music Format (iXMF). We conclude by discussing what
improvements still need to be made for these standards to meet the requirements
of currently commercially-available applications.
Applications
Karaoke-inuenced video games have become popular as titles such as Guitar

Hero and Rock Band have brought interactive music to a broad market [11].
The video games are centered around games controllers that emulate musical
instruments such as the guitar and drum set. The players follow the music as
they would in karaoke, but instead of following lyrics and singing, they follow
colored symbols which indicate when to press the corresponding button. With
Rock Band karaoke singing is available backing tracks and lyrics are provided
so that a player can sing along. However, real-time pitch-tracking has enhanced
the gameplay as the players intonation and timing are scored.
The company Smule produces applications for the Apple iPhone, iPod Touch
and iPad. One of their most popular applications for the platform is called I Am
T-Pain [14]. The application allows users to sing into their device and automatically processes their voice with the auto-tune eects that characterize the artist
T-Pains vocals. The user can do this in a karaoke-style performance by purchasing and downloading les containing the backing music to a selection of T-Pains
released tracks. The songs lyrics then appear on the screen synchronized with
the music as for karaoke, and the users voice is automatically processed with
an auto-tune eect. The user can change the auto-tune settings to change the
key and mode or use a preset. The freestyle mode allows the user to record their
voice without music and with the auto-tuner. All of the users performances can
be recorded and uploaded online and easily shared on social networks.
Smule has built on the karaoke concept with the release of Glee Karaoke [13].
The application is branded by the US TV show Glee and features the music
performed on the show. Like the I Am T-Pain application, Glee Karaoke lets
users purchase and download music bundled with lyrics so that they can perform
the vocal portion of the song themselves. Real-time pitch correction and automatic three-part harmony generation are available to enhance the performance.
Users can also upload performances to share online, but unlike I Am T-Pain,
Glee Karaoke users can participate in a competitive game. Similar to the Guitar Hero and Rock Band games, users get points for completing songs and for
correctly singing on-pitch.
22
R. Stewart, P. Kudumakis, and M. Sandler
Requirements
If the music industry continues to produce content for interactive music applications, a standard distribution format is needed. Content then will not need to be
individually authored for each application. At the most basic level, a standard
needs to allow:
Separate tracks or groups of tracks
Apply signal processing to those tracks or groups
Markup those tracks or stems to include time-based symbolic information
Once tracks or groups of tracks are separated from the full mix of the song,
additional processing or information can be included to enhance the interactivity
with those tracks.
3.1
Symbolic Information
Karaoke-style applications involving singing require lyrical information as a bare

minimum, though it is expected that that information is time-aligned with the
audio content. As seen in Rock Band, I Am T-Pain and Glee Karaoke, additional
information regarding the correct pitch and timing is also needed.
A standard for interactive music applications also needs to accommodate multiple parallel sequences of notes. This is especially important for multiple player
games like Rock Band where each player has a dierent instrument and stream
of timings and pitches.
3.2
Audio Signal Processing
The most simplistic interactive model of multiple tracks requires basic mixing
capabilities so that those tracks can be combined to create a single mix of the
song. A traditional karaoke mix could easily be created within this model by
muting the vocal track, but this model could also be extended. Including audio eects as in I Am T-Pain and Glee Karaoke allows users to add musical
content (such as their singing voice) to the mix and better emulate the original
performance.
Spatial audio signal processing is also required for more advanced applications. This could be as simple as panning a track between the left and right
channels of a stereo song, but could grow in complexity when considering applications for gaming consoles. Many games allow for surround sound playback,
usually over a 5.1 loudspeaker setup, so the optimal standard would allow for
exible loudspeaker congurations. Mobile applications could take advantage of
headphone playback and use binaural audio to create an immersive 3D space.
MPEG-A IM AF
The MPEG-A Interactive Music Application Format (IM AF) standard structures the playback of songs that have multiple, unmixed audio tracks [8], [9], [10].
23
IM AF creates a container for the tracks, the associated metadata and symbolic
data while also managing how the audio tracks are played. Creating an IM AF
le involves formatting dierent types of media data, especially multiple audio
tracks with interactivity data and storing them into an ISO-Base Media File
Format. An IM AF le is composed of:
Multiple audio tracks representing the music (e.g. instruments and/or voices).
Groups of audio tracks a hierarchical structure of audio tracks (e.g. all
guitars of a song can be gathered in the same group).
Preset data pre-dened mixing information on multiple audio tracks (e.g.
karaoke and rhythmic version).
User mixing data and interactivity rules, information related to user interaction (e.g. track/group selection, volume control).
Metadata used to describe a song, music album, artist, etc.
Additional media data that can be used to enrich the users interaction space
(e.g. timed text synchronized with audio tracks which can represent the lyrics
of a song, images related to the song, music album, artist, etc).
4.1
Mixes
The multiple audio tracks are combined to produce a mix. The mix is dened
by the playback level of tracks and may be determined by the music content
creator or by the end-user.
An interactive music player utilizing IM AF could allow users to re-mix music
tracks by enabling them to select the number of instruments to be listened to
and adjust the volume of individual tracks to their particular taste. Thus, IM
AF enables users to publish and exchange this re-mixing data, enabling other
users with IM AF players to experience their particular music taste creations.
Preset mixes of tracks could also be available. In particular IM AF supports two
possible mix modes for interaction and playback: preset-mix mode and user-mix
mode.
In the preset-mix mode, the user selects one preset among the presets stored
in IM AF, and then the audio tracks are mixed using the preset parameters
associated with the selected preset. Some preset examples are:
General preset composed of multiple audio tracks by music producer.
Karaoke preset composed of multiple audio tracks except vocal tracks.
A cappella preset composed of vocal and chorus tracks.
Figure 1 shows an MPEG-A IM AF player. In user-mix mode, the user selects/deselects the audio tracks/groups and controls the volume of each of them. Thus,
in user-mix mode, audio tracks are mixed according to the users control and
taste; however, they should comply with the interactivity rules stored in the
IM AF. User interaction should conform to certain rules dened by the music
composers with the aim to t their artistic creation. However, the rules denition is optional and up to the music composer, they are not imposed by the IM
AF format. In general there are two categories of rules in IM AF: selection and
24
Fig. 1. An interactive music application. The player on the left shows the song being
played in a preset mix mode and the player on the right shows the user mix mode.

!

!
"

Fig. 2. Logic for interactivity rules and mixes within IM AF
mixing rules. The selection rules relate to the selection of the audio tracks and
groups at rendering time whereas the mixing rules relate to the audio mixing.
Note that the interactivity rules allow the music producer to dene the amount
of freedom available in IM AF users mixes. The interactivity rules analyser in the
player veries whether the user interaction conforms to music producers rules.
Figure 2 depicts in a block diagram the logic for both the preset-mix and the
user-mix usage modes.
IM AF supports four types of selection rules, as follows:
Min/max rule specifying both minimum and maximum number of track/
groups of the group that might be in active state.
Exclusion rule specifying that several track/groups of a song will never be in
the active state at the same time.
25
Not mute rule dening a track/group always in the active state.

Implication rule specifying that the activation of a track/group implies the
activation of another track/group.
IM AF also supports four types of mixing rules, as follows:
Limits rule specifying the minimum and maximum limits of the relative volume of each track/group.
Equivalence rule specifying an equivalence volume relationship between
tracks/groups.
Upper rule specifying a superiority volume relationship between tracks/groups.
Lower rule specifying an inferiority volume relationship between tracks/groups.
Backwards compatibility with legacy non-interactive players is also supported by
IM AF. For legacy music players or devices that are not capable of simultaneous
decoding the multiple audio tracks, a special audio track stored in IM AF le
can still be played.
4.2
File Structure
The le formats accepted within an IM AF le are described in Table 1. IM AF

holds les describing images associated with the audio such as an album cover,
timed text for lyrics, other metadata allowed in MPEG-7 and the audio content.
IM AF also supports a number of brands according to application domain. These
depend on the device processing power capabilities (e.g. mobile phone, laptop
computer and high delity devices) which consequently dene the maximum
number of audio tracks that can be decoded simultaneously in an IM AF player
running on a particular device. IM AF brands are summarized in Table 2. In all
IM AF brands, the associated data and metadata are supported.
The IM AF le format structure is derived from the ISO-Base Media File
Format standard. As such it facilitates interchange, management, editing and
presentation of dierent type media data and their associated metadata in a
exible and extensible way. The object-oriented nature of ISO-Base Media File
Table 1. The file formats accepted within an IM AF file
Type
Component Name
File Format ISO Base Media File Format (ISO-BMFF)
Specification
ISO/IEC 14496-12:2008
Audio
MPEG-4 Audio AAC Profile

MPEG-D SAOC
MPEG-1 Audio Layer III (MP3)
PCM
ISO/IEC 14496-3:2005
ISO/IEC 23003-2:2010
ISO/IEC 11172-3:1993
-
Image
JPEG
ISO/IEC 10918-1:1994
3GPP Timed Text
3GPP TS 26.245:2004
Text
Metadata MPEG-7 MDS
ISO/IEC 15938-5:2003
26
Table 2. Brands supported by IM AF. For im04 and im12, simultaneously decoded
audio tracks consist of tracks related to SAOC, which are a downmix signal and SAOC
bitstream. The downmix signal should be encoded using AAC or MP3. For all brands,
the maximum channel number of each track is restricted to 2 (stereo).
Audio
Brands
AAC MP3 SAOC PCM
Max No Max Freq.

Tracks
/bits
im01
im02
im03
im04
im11
im12
im21
8
X
AAC/Level 2
AAC/Level 2
16
32
Mobile
SAOC Baseline/2
AAC/Level 2
Normal
AAC/Level 2
2
X
Application
48 kHz/16 bits
2
X
Profile/
Level
SAOC Baseline/3
96 kHz/24 bits
AAC/Level 5
High-end
Format, inherited in IM AF, enables simplicity in the le structure in terms of

objects that have their own names, sizes and dened specications according to
their purpose.
Figure 3 illustrates the IM AF le format structure. It mainly consists of ftyp,
moov and mdat type information objects/boxes. The ftyp box contains information on le type and compatibility. The moov box describes the presentation of
the scene and usually includes more than one trak boxes. A trak box contains
the presentation description for a specic media type. A media type in each trak
box could be audio, image or text. The trak box supports time information for
synchronization with media described in other trak boxes. The mdat box contains the media data described in the trak boxes. Instead of a native system le
path, a trak box may include an URL to locate the media data. In this way the
mdat box maintains a compact representation enabling consequently ecient
exchange and sharing of IM AF les.
Furthermore, in the moov box some specic information is also included: the
group container box grco; the preset container box prco; and the rules container
box ruco for storing group, preset and rules information, respectively. The grco
box contains zero or more group boxes designated as grup describing the group
hierarchy structure of audio tracks and/or groups. The prco box contains one
or more prst boxes which describe the predened mixing information in the
absence of user interaction. The ruco box contains zero or more selection rules
boxes rusc and/or mixing rules boxes rumx describing the interactivity rules
related to selection and/or mixing of audio tracks.
27

+

'(%)*

'(%)*
'

'
'

!"#$%&

!"#$%&
Fig. 3. IM AF file format
Related Formats
While the IM AF packages together the relevant metadata and content that an
interactive music application would require, other le formats have also been
developed as a means to organize and describe synchronized streams of information for dierent applications. The two that will be briey reviewed here are
IEEE 1599 [12] and iXMF [4].
5.1
IEEE 1599
IEEE 1599 is an XML-based format for synchronizing multiple streams of symbolic and non-symbolic data validated against a document type denition (DTD).
It was proposed to IEEE Standards in 2001 and was previously referred to as
MX (Musical Application Using XML). The standard emphasizes the readability
28

*

#

$
%

'

&

!

" "
"

!
"
'

"( ")*
"
Fig. 4. The layers in IEEE 1599
of symbols by both humans and machines, hence the decision to represent all
information that is not audio or video sample data within XML.
The standard is developed primarily for applications that provide additional
information surrounding a piece of music. Example applications include being
able to easily navigate between a score, multiple recordings of performances of
that score and images of the performers in the recordings [2].
The format consists of six layers that communicate with each other, but there
can be multiple instances of the same layer type. Figure 4 illustrates how the
layers interact. The layers are referred to as:
General holds metadata relevant to entire document.
Logic logical description of score symbols.
Structural description of musical objects and their relationships.
Notational graphical representation of the score.
Performance computer-based descriptions of a musical performance.
Audio digital audio recording.
5.2
iXMF
Another le format that perform a similar task with a particular focus on video
games is iXMF (interaction eXtensible Music Format) [4]. The iXMF standard
is targeted for interactive audio within games development. XMF is a meta le
format that bundles multiple les together and iXMF uses this same meta le
format as its structure.
29
iXMF uses a structure in which a moment in time can trigger an event. The
triggered event can encompass a wide array of activities such as the playing of
an audio le or the execution of specic code. The overall structure is described
in [4] as:
An iXMF le is a collection of Cues.

A Cue is a collection of Media Chunks and Scripts.
A Media Chunk is a contiguous region in a playable media le.
A Script is rules describing how a Media Chunk is played.
The format allows for both audio and symbolic information information such
as MIDI to be included. The Scripts then allow for real-time adaptive audio
eects. iXMF has been developed to create interactive soundtracks for video
games environments, so the audio can be generated in real-time based on a users
actions and other external factors. There are a number of standard Scripts that
perform basic tasks such as starting or stopping a Cue, but this set of Scripts
can also be extended.
Discussion
Current commercial applications built around interactive music require real-time

playback and interaction with multiple audio tracks. Additionally, symbolic information, including text, is needed to accommodate the new karaoke-like games
such as Guitar Hero. The IM AF standard fulls most of the requirements, but
not all. In particular it lacks the ability to include symbolic information like MIDI
note and instrument data. IEEE 1599 and iXMF both can accommodate MIDI
data, though lack some of the advantages of IM AF such as direct integration
with a number of MPEG formats.
One of the strengths of iXMF is its Scripts which can dene time-varying
audio eects. These kind of eects are needed for applications such as I Am
T-Pain and Glee Karaoke. IM AF is beginning to consider integrating these
eects such as equalization, but greater exibility will be needed so that the
content creators can create and manipulate their own audio signal processing
algorithms. The consumer will also need to be able to manually adjust the audio
eects applied to the audio in order to build applications like the MXP4 Studio
[7] with IM AF.
As interactive music applications may be used in a variety of settings, from
dedicated gaming consoles to smart phones, any spatialization of the audio needs
to be exible and automatically adjust to the most appropriate format. This
could range from stereo speakers to surround sound systems or binaural audio
over headphones. IM AF is beginning to support SAOC (Spatial Audio Object
Coding) which addresses this very problem and dierentiates it from similar
standards.
While there are a number of standard le formats that have been developed in
parallel to address slightly diering application areas within interactive music,
IM AF is increasingly the best choice for karaoke-style games. There are still
30
underdeveloped or missing features, but by determining the best practice put

forth in similar standards, IM AF can become an interchangeable le format for
creators to distribute their music to multiple applications. The question then
remains: will the music industry embrace IM AF enabling interoperability of
interactive music services and applications for the benet of end users or will it
try to lock them down in proprietary standards for the benet of few oligopolies?
Acknowledgments. This work was supported by UK EPSRC Grants: Platform
Grant (EP/E045235/1) and Follow On Fund (EP/H008160/1).
References
1. Audizen, http://www.audizen.com (last viewed, February 2011)
2. Ludovico, L.A.: The new standard IEEE 1599, introduction and examples. J. Multimedia 4(1), 38 (2009)
3. Goel, S., Miesing, P., Chandra, U.: The Impact of Illegal Peer-to-Peer File Sharing
on the Media Industry. California Management Review 52(3) (Spring 2010)
4. IASIG Interactive XMF Workgroup: Interactive XMF specification: file format specification. Draft 0.9.1a (2008), http://www.iasig.org/pubs/ixmf_
draft-v091a.pdf
5. IFPI Digital Musi Report 2009: New Business Models for a Changing Environment.
International Federation of the Phonographic Industry (January 2009)
6. iKlax Media, http://www.iklaxmusic.com (last viewed February 2011)
7. Interactive
Music
Studio
by
MXP4,
Inc.,
http://www.mxp4.com/
interactive-music-studio (last viewed February 2011)
8. ISO/IEC 23000-12, Information technology Multimedia application format (MPEG-A) Part 12: Interactive music application format (2010),
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.
htm?csnumber=53644
9. ISO/IEC 23000-12/FDAM 1 IM AF Conformance and Reference Software, N11746,
95th MPEG Meeting, Daegu, S. Korea (2011)
10. Kudumakis, P., Jang, I., Sandler, M.: A new interactive MPEG format for the
music industry. In: 7th Int. Symposium on Computer Music Modeling and Retrieval
(CMMR 2010), M
alaga, Spain (2010)
11. Kushner, D.: The Making of the Beatles: Rock Band. IEEE Spectrum 46(9), 3035
(2009)
12. Ludovico, L.A.: IEEE 1599: a multi-layer approach to music description. J. Multimedia 4(1), 914 (2009)
13. Smule, Inc.: Glee Karaoke iPhone Application, http://glee.smule.com/ (last
viewed February 2011)
14. Smule, Inc.: I Am T-Pain iPhone Application, http://iamtpain.smule.com/ (last
Interactive Music with Active Audio CDs

Sylvain Marchand, Boris Mansencal, and Laurent Girin
LaBRI CNRS, University of Bordeaux, France
{sylvain.marchand,boris.mansencal}@labri.fr
GIPSA-lab CNRS, Grenoble Institute of Technology, France
laurent.girin@gipsa-lab.grenoble-inp.fr
Abstract. With a standard compact disc (CD) audio player, the only
possibility for the user is to listen to the recorded track, passively: the
interaction is limited to changing the global volume or the track. Imagine
now that the listener can turn into a musician, playing with the sound
sources present in the stereo mix, changing their respective volumes and
locations in space. For example, a given instrument or voice can be either
muted, amplied, or more generally moved in the acoustic space. This
will be a kind of generalized karaoke, useful for disc jockeys and also for
music pedagogy (when practicing an instrument). Our system shows that
this dream has come true, with active CDs fully backward compatible
while enabling interactive music. The magic is that the music is in the
sound: the structure of the mix is embedded in the sound signal itself,
using audio watermarking techniques, and the embedded information is
exploited by the player to perform the separation of the sources (patent
pending) used in turn by a spatializer.
Keywords: interactive music, compact disc, audio watermarking, source
separation, sound spatialization.
Introduction
Composers of acousmatic music conduct dierent stages through the composition process, from sound recording (generally stereophonic) to diusion (multiphonic). During live interpretation, they interfere decisively on spatialization
and coloration of pre-recorded sonorities. For this purpose, the musicians generally use a(n un)mixing console. With two hands, this requires some skill and
becomes hardly tractable with many sources or speakers.
Nowadays, the public is also eager to interact with the musical sound. Indeed, more and more commercial CDs come with several versions of the same
musical piece. Some are instrumental versions (for karaoke), other are remixes.
The karaoke phenomenon gets generalized from voice to instruments, in musical
video games such as Rock Band 1 . But in this case, to get the interaction the
user has to buy the video game, which includes the multitrack recording.
Yet, the music industry is still reluctant to release the multitrack version of
musical hits. The only thing the user can get is a standard CD, thus a stereo
1
See URL: http://www.rockband.com
32
S. Marchand, B. Mansencal, and L. Girin
mix, or its dematerialized version available for download. The CD is not dead:
imagine a CD fully backward compatible while permitting musical interaction. . .
We present the proof of concept of the active audio CD, as a player that
can read any active disc in fact any 16-bit PCM stereo sound le, decode the
musical structure present in the sound signal, and use it to perform high-quality
source separation. Then, the listener can see and manipulate the sound sources
in the acoustic space. Our system is composed of two parts.
First, a CD reader extracts the audio data of the stereo track and decodes
the musical structure embedded in the audio signal (Section 2). This additional
information consists of the combination of active sources for each time-frequency
atom. As shown in [16], this permits an informed source separation of high quality
(patent pending). In our current system, we get up to 5 individual tracks out of
the stereo mix.
Second, a sound spatializer is able to map in real time all the sound sources
to any position in the acoustic space (Section 3). Our system supports either
binaural (headphones) or multi-loudspeaker congurations. As shown in [14],
the spatialization is done in the spectral domain, is based on acoustics and
interaural cues, and the listener can control the distance and the azimuth of
each source.
Finally, the corresponding software implementation is described in Section 4.
Source Separation
In this section, we present a general overview of the informed source separation

technique which is at the heart of the active CD player. This technique is based
on a two-step coder-decoder conguration [16][17], as illustrated on Fig. 1. The
decoder is the active CD player, that can process separation only on mix signals
that have been generated by the coder. At the coder, the mix signal is generated
as a linear instantaneous stationary stereo (LISS) mixture, i.e. summation of
source signals with constant-gain panning coecients. Then, the system looks
for the two sources that better explain the mixture (i.e. the two source signals
that are predominant in the mix signal) at dierent time intervals and frequency
channels, and the corresponding source indexes are embedded into the mixture
signal as side-information using watermarking. The watermarked mix signal is
then quantized to 16-bits PCM. At the decoder, the only available signal is the
watermarked and quantized mix signal. The side-information is extracted from
the mix signal and used to separate the source signals by a local time / frequency
mixture inversion process.
2.1
Time-Frequency Decomposition
The voice / instrument source signals are non-stationary, with possibly large
temporal and spectral variability, and they generally strongly overlap in the
time domain. Decomposing the signals in the time-frequency (TF) domain leads
to a sparse representation, i.e. few TF coecients have a high energy and the
overlapping of signals is much lower in the TF domain than in the time domain
s1[n]
MDCT
s2[n]
MDCT
sI[n]
MDCT
xL[n]
Mixing
(LIS S )
xR[n]
MDCT
MDCT
S1[f,t]
Ora cle
e s tima tor
SI[f,t]
Ift
Coding
XL[f,t]
XLW[ f ,t]
XR[f,t]
Wa te rma rking XRW[ f ,t]
IMDCT
16-bits
P CM
IMDCT
16-bits
P CM
Mixing
ma trix A
~
xLW [n]
~
xRW [n]
MDCT
MDCT
S1[ f ,t]
~
XLW[ f ,t]
~
XRW[ f , t]
33
Wa te rma rk
e xtra ction
Ift
De coding
A
22
inve rs ion
SI [ f , t]
~
xLW [n]
~
x RW [ n ]
IMDCT
s1[ n ]
IMDCT
s2 [ n]
IMDCT
sI [ n]
Fig. 1. Informed source separation coder and decoder
[29][7][15][20]. Therefore, the separation of source signals can be carried out more
eciently in the TF domain. The Modied Discrete Cosine Transform (MDCT)
[21] is used as the TF decomposition since it presents several properties very
suitable for the present problem: good energy concentration (hence emphasizing
audio signals sparsity), very good robustness to quantization (hence robustness
to quantization-based watermarking), orthogonality and perfect reconstruction.
Detailed description of the MDCT equations are not provided in the present
paper, since it can be found in many papers, e.g. [21]. The MDCT is applied on
the source signals and on the mixture signal at the input of the coder to enable
the selection of predominant sources in the TF domain. Watermarking of the
resulting side-information is applied on the MDCT coecients of the mix signal
and the time samples of the watermarked mix signal are provided by inverse
MDCT (IMDCT). At the decoder, the (PCM-quantized) mix signal is MDCTtransformed and the side-information is extracted from the resulting coecients.
Source separation is also carried out in the MDCT domain, and the resulting
separated MDCT coecients are used to reconstruct the corresponding timedomain separated source signals by IMDCT. Technically, the MDCT / IMDCT
is applied on signal time frames of W = 2048 samples (46.5ms for a sampling
frequency fs = 44.1kHz), with a 50%-overlap between consecutive frames (of
1024 frequency bins). The frame length W is chosen to follow the dynamics of
music signals while providing a frequency resolution suitable for the separation.
Appropriate windowing is applied at both analysis and synthesis to ensure the
perfect reconstruction property [21].
34
2.2
Informed Source Separation
Since the MDCT is a linear transform, the LISS source separation problem
remains LISS in the transformed domain. For each frequency bin f and time bin
t, we thus have:
X(f, t) = A S(f, t)
(1)
where X(f, t) = [X1 (f, t), X2 (f, t)]T denotes the stereo mixture coecients vector and S(f, t) = [S1 (f, t), , SN (f, t)]T denotes the N -source coecients vector. Because of audio signal sparsity in the TF domain, only at most 2 sources are
assumed to be relevant, i.e. of signicant energy, at each TF bin (f, t). Therefore,
the mixture is locally given by:
X(f, t) AIf t SIf t (f, t)
(2)
where If t denotes the set of 2 relevant sources at TF bin (f, t). AIf t represents
the 2 2 mixing sub-matrix made with the Ai columns of A, i If t . If I f t
denotes the complementary set of non-active (or at least poorly active) sources
at TF bin (f, t), the source signals at bin (f, t) are estimated by [7]:

I (f, t) = A1 X(f, t)
S
ft
If t
(3)
(f, t) = 0
S
Ift
where A1
If t denotes the inverse of AIf t . Note that such a separation technique
exploits the 2-channel spatial information of the mixture signal and relaxes the
restrictive assumption of a single active source at each TF bin, as made in
[29][2][3].
The side-information that is transmitted between coder and decoder (in addition to the mix signal) mainly consists of the coecients of the mixing matrix
A and the combination of indexes If t that identies the predominant sources in
each TF bin. This contrasts with classic blind and semi-blind separation methods where those both types of information have to be estimated from the mix
signal only, generally in two steps which can both be a very challenging task and
source of signicant errors.
As for the mixing matrix, the number of coecients to be transmitted is quite
low in the present LISS conguration2 . Therefore, the transmission cost of A is
negligible compared to the transmission cost of If t , and it occupies a very small
portion of the watermarking capacity.
As for the source indexes, If t is determined at the coder for each TF bin
using the source signals, the mixture signal, and the mixture matrix A, as the
combination that provides the lower mean squared error (MSE) between the
original source signals and the estimated source signals obtained with Equation
(3) (see [17] for details). This process follows the line of oracle estimators, as introduced in [26] for the general purpose of evaluating the performances of source
2
For 5-source signals, if A is made of normalized column vectors depending on source

azimuths, then we have only 5 coecients.
35
separation algorithms, especially in the case of underdetermined mixtures and

sparse separation techniques. Note that because of the orthogonality / perfect
reconstruction property of the MDCT, the selection of the optimal source combination can be processed separately at each TF bin, in spite of the overlap-add
operation at source signal reconstruction by IMDCT [26]. When the number of
sources is reasonable (typically about 5 for a standard western popular music
song), If t can be found by exhaustive search, since in contrast to the decoding process, the encoding process is done oine and is therefore not subdue to
real-time constraints.
It is important to note that at the coder, the optimal combination is determined from the original (unwatermarked) mix signal. In contrast, at the
decoder, only the watermarked mix signal is available, and the source separation
is obtained by applying Equation (3) using the MDCT coecients of the water W (f, t) instead of the MDCT
marked (and 16-bit PCM quantized) mix signal X
coecients of the original mix signal X(f, t). However, it has been shown in [17]
that the inuence of the watermarking (and PCM quantization) process on separation performance is negligible. This is because the optimal combination for
each TF bin can be coded with a very limited number of bits before being embedded into the mixture signal. For example, for a 5-source mixture, the number
of combinations of two sources among ve is 10 and a 4-bit xed-size code is
appropriate (although non optimal) for encoding If t . In practice, the source separation process can be limited to the [0; 16]kHz bandwidth, because the energy of
audio signals is generally very low beyond 16kHz. Since the MDCT decomposition provides as many coecients as time samples, the side-information bit-rate
is 4Fs 16, 000/(Fs/2) = 128kbits/s (Fs = 44, 1kHz is the sampling frequency),
which can be split in two 64kbits/s streams, one for each of the stereo channels.
This is about 1/4 of the maximum capacity of the watermarking process (see
below), and for such capacity, the distortion of the MDCT coecients by the
watermarking process is suciently low to not corrupt the separation process of
Equation (3). In fact, the main source of degradation in the separation process
relies in the sparsity assumption, i.e. the fact that residual non-predominant,
but non-null, sources may interfere as noise in the local inversion process.
Separation performances are described in details in [17] for real-world 5source LISS music mixtures of dierent musical styles. Basically, source enhancement from input (mix) to output (separated) ranges from 17dB to 25dB
depending on sources and mixture, which is remarkable given the diculty of
such underdetermined mixtures. The rejection of competing sources is very ecient and the source signals are clearly isolated, as conrmed by listening tests.
Artefacts (musical noise) are present but are quite limited. The quality of the
isolated source signals makes them usable for individual manipulation by the
spatializer.
2.3
Watermarking Process
The side-information embedding process is derived from the Quantization Index

Modulation (QIM) technique of [8], applied to the MDCT coecients of the
36
X (t, f )
QIM
X(t, f )
(t, f )
00
01
11
10
11
01
00
10
11
01
00
10
11
01
00
10
11
01
00
10
11
01
00
10
Fig. 2. Example of QIM using a set of quantizers for C(t, f ) = 2 and the resulting
global grid. We have (t, f ) = 2C(t,f ) QIM . The binary code 01 is embedded into
the MDCT coecient X(t, f ) by quantizing it to X w (t, f ) using the quantizer indexed
by 01.
mixture signal in combination with the use of a psycho-acoustic model (PAM) for
the control of inaudibility. It has been presented in details in [19][18]. Therefore,
we just present the general lines of the watermarking process in this section, and
we refer the reader to these papers for technical details.
The embedding principle is the following. Let us denote by C(t, f ) the capacity at TF bin (t, f ), i.e. the maximum size of the binary code to be embedded
in the MDCT coecient at that TF bin (under inaudibility constraint). We will
see below how C(t, f ) is determined for each TF bin. For each TF bin (t, f ), a
set of 2C(t,f ) uniform quantizers is dened, which quantization levels are intertwined, and each quantizer represents a C(t, f )-bit binary code. Embedding a
given binary code on a given MDCT coecient is done by quantizing this coefcient with the corresponding quantizer (i.e. the quantizer indexed by the code
to transmit; see Fig. 2). At the decoder, recovering the code is done by comparing the transmitted MDCT coecient (potentially corrupted by transmission
noise) with the 2C(t,f ) quantizers, and selecting the quantizer with the quantization level closest to the transmitted MDCT coecient. Note that because
the capacity values depend on (f, t), those values must be transmitted to the
decoder to select the right set of quantizers. For this, a xed-capacity embedding
reservoir is allocated in the higher frequency region of the spectrum, and the
37
capacity values are actually dened within subbands (see [18] for details). Note
also that the complete binary message to transmit (here the set of If t codes) is
split and spread across the dierent MDCT coecients according to the local
capacity values, so that each MDCT coecient carries a small part of the complete message. Conversely, the decoded elementary messages have to be concatenated to recover the complete message. The embedding rate R is given by the
average total number of embedded bits per second of signal. It is obtained by
summing the capacity C(t, f ) over the embedded region of the TF plane and
dividing the result by the signal duration.
The performance of the embedding process is determined by two related constraints: the watermark decoding must be robust to the 16-bit PCM conversion
of the mix signal (which is the only source of noise because the perfect reconstruction property of MDCT ensures transparency of IMDCT/MDCT chained
processes), and the watermark must be inaudible. The time-domain PCM quantization leads to additive white Gaussian noise on MDCT coecients, which
induces a lower bound for QIM the minimum distance between two dierent
levels of all QIM quantizers (see Fig. 2). Given that lower bound, the inaudibility
constraint induces an upper bound on the number of quantizers, hence a corresponding upper bound on the capacity C(t, f ) [19][18]. More specically, the
constraint is that the power of the embedding error in the worst case remains
under the masking threshold M (t, f ) provided by a psychoacoustic model. The
PAM is inspired from the MPEG-AAC model [11] and adapted to the present
data hiding problem. It is shown in [18] that the optimal capacity is given by:

10
1
M
(t,
f
)
10
C (t, f ) =
log
+1
(4)
2 2
2QIM
where . denotes the floor function, and is a scaling factor (in dB) that enables
users to control the trade-o between signal degradation and embedding rate
by translating the masking threshold. Signal quality is expected to decrease as
embedding rate increases, and vice-versa. When > 0dB, the masking threshold
is raised. Larger values of the quantization error allow for larger capacities (and
thus higher embedding rate), at the price of potentially lower quality. At the
opposite, when < 0dB, the masking threshold is lowered, leading to a safety
margin for the inaudibility of the embedding process, at the price of lower
embedding rate. It can be shown that the embedding rate R corresponding to
C and the basic rate R = R0 are related by [18]:
R R +
log2 (10)
Fu
10
(5)
(Fu being the bandwidth of the embedded frequency region). This linear relation
enables to easily control the embedding rate by the setting of .
The inaudibility of the watermarking process has been assessed by subjective
and objective tests. In [19][18], Objective Dierence Grade (ODG) scores [24][12]
were calculated for a large range of embedding rates and dierent musical styles.
ODG remained very close to zero (hence imperceptibility of the watermark)
38
for rates up to about 260kbps for musical styles such as pop, rock, jazz, funk,
bossa, fusion, etc. (and only up to about 175kbps for classical music). Such
rates generally correspond to the basic level of the masking curve allowed by
the PAM (i.e. = 0dB). More comfortable rates can be set between 150
and 200kbits/s to guarantee transparent quality for the embedded signal. This
exibility is used in our informed source separation system to t the embedding
capacity with the bit-rate of the side-information, which is at the very reasonable
value of 64kbits/s/channel. Here, the watermarking is guaranteed to be highly
inaudible, since the masking curve is signicantly lowered to t the required
capacity.
Sound Spatialization
Now that we have recovered the dierent sound sources present in the original
mix, we can allow the user to manipulate them in space. We consider each punctual and omni-directional sound source in the horizontal plane, located by its (, )
coordinates, where is the distance of the source to the head center and is the
azimuth angle. Indeed, as a rst approximation in most musical situations, both
the listeners and instrumentalists are standing on the (same) ground, with no relative elevation. Moreover, we consider that the distance is large enough for the
acoustic wave to be regarded as planar when reaching the ears.
3.1
Acoustic Cues
In this section, we intend to perform real-time high-quality (convolutive) mixing.

The source s will reach the left (L) and right (R) ears through dierent acoustic
paths, characterizable with a pair of lters, which spectral versions are called
Head-Related Transfer Functions (HRTFs). HRTFs are frequency- and subjectdependent. The CIPIC database [1] samples dierent listeners and directions of
arrival.
A sound source positioned to the left will reach the left ear sooner than the
right one, in the same manner the right level should be lower due to wave propagation and head shadowing. Thus, the dierence in amplitude or Interaural
Level Dierence (ILD, expressed in decibels dB) [23] and dierence in arrival
time or Interaural Time Dierence (ITD, expressed in seconds) [22] are the main
spatial cues for the human auditory system [6].
Interaural Level Dierences. After Viste [27], the ILDs can be expressed as
functions of sin(), thus leading to a sinusoidal model:
ILD(, f ) = (f ) sin()
(6)
where (f ) is the average scaling factor that best suits our model, in the leastsquare sense, for each listener of the CIPIC database (see Fig. 3). The overall
error of this model over the CIPIC database for all subjects, azimuths, and
frequencies is of 4.29dB.
39
level scaling factor
40
30
20
10
time scaling factor
10
12
14
16
18
20
10
12
14
16
18
20
4
3
2
1
0
Frequency [kHz]
Fig. 3. Frequency-dependent scaling factors: (top) and (bottom)
Interaural Time Dierences. Because of the head shadowing, Viste uses for
the ITDs a model based on sin() + , after Woodworth [28]. However, from the
theory of the diraction of an harmonic plane wave by a sphere (the head), the
ITDs should be proportional to sin(). Contrary to the model by Kuhn [13], our
model takes into account the inter-subject variation and the full-frequency band.
The ITD model is then expressed as:
ITD(, f ) = (f )r sin()/c
(7)
where is the average scaling factor that best suits our model, in the leastsquare sense, for each listener of the CIPIC database (see Fig. 3), r denotes the
head radius, and c is the sound celerity. The overall error of this model over the
CIPIC database is 0.052ms (thus comparable to the 0.045ms error of the model
by Viste).
Distance Cues. In ideal conditions, the intensity of a source is halved (decreases by 6dB) when the distance is doubled, according to the well-known
Inverse Square Law [5]. Applying only this frequency-independent rule to a signal has no eect on the sound timbre. But when a source moves far from the
listener, the high frequencies are more attenuated than the low frequencies. Thus
40
the sound spectrum changes with the distance. More precisely, the spectral centroid moves towards the low frequencies as the distance increases. In [4], the
authors show that the frequency-dependent attenuation due to atmospheric attenuation is roughly proportional to f 2 , similarly to the ISO 9613-1 norm [10].
Here, we manipulate the magnitude spectrum to simulate the distance between
the source and the listener. Conversely, we would measure the spectral centroid
(related to brightness) to estimate the sources distance to listener.
In a concert room, the distance is often simulated by placing the speaker near
/ away from the auditorium, which is sometimes physically restricted in small
rooms. In fact, the architecture of the room plays an important role and can
lead to severe modications in the interpretation of the piece.
Here, simulating the distance is a matter of changing the magnitude of each
short-term spectrum X. More precisely, the ISO 9613-1 norm [10] gives the
frequency-dependent attenuation factor in dB for given air temperature, humidity, and pressure conditions. At distance , the magnitudes of X(f ) should be
attenuated by D(f, ) decibels:
D(f, ) = a(f )
(8)
where a(f ) is the frequency-dependent attenuation, which will have an impact

on the brightness of the sound (higher frequencies being more attenuated than
lower ones).
More precisely, the total absorption in decibels per meter a(f ) is given by a
rather complicated formula:

12
52
a(f )
8.68 F 2 1.84 1011 TT0
P0 + TT0
P
0.01275 e2239.1/T /[Fr,O + (F 2 /Fr,O )]

+0.1068 e3352/T /[Fr,N + (F 2 /Fr,N )]
(9)
where F = f /P , Fr,O = fr,O /P , Fr,N = fr,N /P are frequencies scaled by the
atmospheric pressure P , and P0 is the reference atmospheric pressure (1 atm), f
is the frequency in Hz, T is the atmospheric temperature in Kelvin (K), T0 is the
reference atmospheric temperature (293.15K), fr,O is the relaxation frequency
of molecular oxygen, and fr,N is the relaxation frequency of molecular nitrogen
(see [4] for details).
The spectrum thus becomes:
X(, f ) = 10(XdB (t,f )D(f,))/20
(10)
where XdB is the spectrum X in dB scale.

3.2
Binaural Spatialization
In binaural listening conditions using headphones, the sound from each earphone
speaker is heard only by one ear. Thus the encoded spatial cues are not aected
by any cross-talk signals between earphone speakers.
41
To spatialize a sound source to an expected azimuth , for each short-term

spectrum X, we compute the pair of left (XL ) and right (XR ) spectra from the
spatial cues corresponding to , using Equations (6) and (7), and:
XL (t, f ) = HL (t, f )X(t, f ) with HL (t, f ) = 10+a (f )/2 e+j (f )/2 ,
(11)
a (f )/2 j (f )/2
(12)
XR (t, f ) = HR (t, f )X(t, f ) with HR (t, f ) = 10
(because of the symmetry among the left and right ears), where a and are
given by:
a (f ) = ILD(, f )/20,
(13)
(f ) = ITD(, f ) 2f.
(14)
This is indeed a convolutive model, the convolution turning into a multiplication in the spectral domain. Moreover, the spatialization coecients are complex.
The control of both amplitude and phase should provide better audio quality [25]
than amplitude-only spatialization. Indeed, we reach a remarkable spatialization
realism through informal listening tests with AKG K240 Studio headphones.
3.3
Multi-loudspeaker Spatialization
In a stereophonic display, the sound from each loudspeaker is heard by both

ears. Thus, as in the transaural case, the stereo sound reaches the ears through
four acoustic paths, corresponding to transfer functions (Cij , i representing the
speaker and j the ear), see Fig. 4. Here, we generate these paths articially using
the binaural model (using the distance and azimuth of the source to the ears for
H, and of the speakers to the ears for C). Since we have:
XL = HL X = CLL KL X +CLR KR X

YL
YR
XR = HR X = CRL KL X +CRR KR X

YL
(15)
(16)
YR
the best panning coecients under CIPIC conditions for the pair of speakers to
match the binaural signals at the ears (see Equations (11) and (12)) are then
given by:
KL (t, f ) = C (CRR HL CLR HR ) ,
(17)
KR (t, f ) = C (CRL HL + CLL HR )
(18)
with the determinant computed as:

C = 1/ (CLL CRR CRL CLR ) .
(19)
During diusion, the left and right signals (YL , YR ) to feed left and right
speakers are obtained by multiplying the short-term spectra X with KL and
KR , respectively:
YL (t, f ) = KL (t, f )X(t, f ) = C (CRR XL CLR XR ) ,
(20)
YR (t, f ) = KR (t, f )X(t, f ) = C (CRL XL + CLL XR ) .
(21)
42

X
KL
KR
YL
YR
sound image
SL
SR
HR
C LL
R
L
CL
XL
CRR
HL
C
XR
Fig. 4. Stereophonic loudspeaker display: the sound source X reaches the ears L, R
through four acoustic paths (CLL , CLR , CRL , CRR )
sound image
S2
S3
S1
S4
Fig. 5. Pairwise paradigm: for a given sound source, signals are dispatched only to the
two speakers closest to it (in azimuth)
In a setup with many speakers we use the classic pair-wise paradigm [9],
consisting in choosing for a given source only the two speakers closest to it (in
player
sources
file
separator
reader
2 channels
active CD
N output
ports
43
spatializer
N sources
N input
ports
...
M output
ports
M speakers
Fig. 6. Overview of the software system architecture
azimuth): one at the left of the source, the other at its right (see Fig. 5). The
left and right signals computed for the source are then dispatched accordingly.
Software System
Our methods for source separation and sound spatialization have been implemented as a real-time software system, programmed in C++ language and using
Qt43 , JACK4 , and FFTW5 . These libraries were chosen to ensure portability and
performance on multiple platforms. The current implementation has been tested
on Linux and MacOS X operating systems, but should work with very minor
changes on other platforms, e.g. Windows.
Fig. 6 shows an overview of the architecture of our software system. Source
separation and sound spatialization are implemented as two dierent modules.
We rely on JACK audio ports system to route audio streams between these two
modules in real time.
This separation in two modules was mainly dictated by a dierent choice of
distribution license: the source separation of the active player should be patented
and released without sources, while the spatializer will be freely available under
the GNU General Public License.
4.1
Usage
Player. The active player is presented as a simple audio player, based on JACK.
The graphical user interface (GUI) is a very common player interface. It allows
to play or pause the reading / decoding. The player reads activated stereo les,
from an audio CD or le, and then decodes the stereo mix in order to extract
the N (mono) sources. Then these sources are transferred to N JACK output
ports, currently named QJackPlayerSeparator:outputi, with i in [1; N ].
Spatializer. The spatializer is also a real-time application, standalone and based
on JACK. It has N inputs ports that correspond to the N sources to spatialize.
These ports are to be connected, with the JACK ports connection system, to the
N outputs ports of the active player. The spatializer can be congured to work
with headphones (binaural conguration) or with M loudspeakers.
3
4
5
See URL: http://trolltech.com/products/qt

See URL: http://jackaudio.org
See URL: http://www.fftw.org
44
Fig. 7. From the stereo mix stored on the CD, our player is allowing the listener
(center ) to manipulate 5 sources in the acoustic space, using here an octophonic display
(top) or headphones (bottom)
Fig. 7 shows the current interface of the spatializer, which displays a birds
eye view of the audio scene. The users avatar is in the middle, represented by
a head viewed from above. He is surrounded by various sources, represented as
45
Fig. 8. Example of conguration les: 5-source conguration (top), binaural output

conguration (middle), and then 8-speaker conguration (bottom) les
notes in colored discs. When used in a multi-speaker conguration, speakers may

be represented in the scene. If used in a binaural conguration, the users avatar
is represented wearing headphones.
With this graphical user interface, the user can interactively move each source
individually. He picks one of the source representation and drags it around. The
corresponding audio stream is then spatialized, in real time, according to the
new source position (distance and azimuth). The user can also move his avatar
among the sources, as if the listener was moving on the stage, between the
instrumentalists. In this situation, the spatialization changes for all the sources
simultaneously, according to their new relative positions to the moving user
avatar.
Inputs and outputs are set via two conguration les (see Fig. 8). A source
conguration le denes the number of sources. For each source, this le gives
the name of the output port to which a spatializer input port will be connected,
and also its original azimuth and distance. Fig. 8 shows the source conguration
le to connect to the active player with 5 ports. A speaker conguration le
denes the number of speakers. For each speaker, this le gives the name of the
physical (soundcard) port to which a spatializer output port will be connected,
and the azimuth and distance of the speaker. The binaural case is distinguished
46
sources

speakers

Fig. 9. Processing pipeline for the spatialization of N sources on M speakers
by the fact that it has only two speakers with neither azimuth nor distance
specied. Fig. 8 shows the speaker conguration les for binaural and octophonic
(8-speaker) conguration.
4.2
Implementation
Player. The current implementation is divided into three threads. The main
thread is the Qt GUI. A second thread reads and buerizes data from the stereo
le, to be able to compensate for any physical CD reader latency. The third
thread is the JACK process function. It separates the data for the N sources and
feeds the output ports accordingly. In the current implementation, the number
of output sources is xed to N = 5.
Our source separation implementation is rather ecient as for a Modied
Discrete Cosine Transform (MDCT) of W samples, we only do a Fast Fourier
Transform (FFT) of size W/4. Indeed, a MDCT of length W is almost equivalent
to a type-IV DCT of length W/2 that can be computed with a FFT of length
W/4. Thus, as we use MDCT and IMDCT of size W = 2048, we only do FFT
and IFFT of 512 samples.
Spatializer. The spatializer is currently composed of two threads: a main
thread, the Qt GUI, and the JACK process function.
Fig. 9 shows the processing pipeline for the spatialization. For each source, xi
is rst transformed into the spectral domain with a FFT to obtain its spectrum
Xi . This spectrum is attenuated for distance i (see Equation (10)). Then, for
an azimuth i , we obtain the left (XiL ) and right (XiR ) spectra (see Equations
47
(11) and (12)). The dispatcher then chooses the pair (j, j + 1) of speakers surrounding the azimuth i , transforms the spectra XiL and XiR by the coecients
corresponding to this speaker pair (see Equations (20) and (21)), and adds the
resulting spectra Yj and Yj+1 in the spectra of these speakers. Finally, for each
speaker, its spectrum is transformed with an IFFT to obtain back in the time
domain the mono signal yj for the corresponding output.
Source spatialization is more computation-intensive than source separation,
mainly because it requires more transforms (N FFTs and M IFFTs) of larger
size W = 2048. For now, source spatialization is implemented as a serial process. However, we can see that this pipeline is highly parallel. Indeed, almost
everything operates on separate data. Only the spectra of the speakers may be
accessed concurrently, to accumulate the spectra of sources that would be spatialized to the same or neighbouring speaker pairs. These spectra should then
be protected with mutual exclusion mechanisms. A future version will take advantage of multi-core processor architectures.
4.3
Experiments
Our current prototype has been tested on an Apple MacBook Pro, with an Intel
Core 2 Duo 2.53GHz processor, connected to headphones or to a 8-speaker system, via a MOTU 828 MKII soundcard. For such a conguration, the processing
power is well contained. In order to run in real time, given a signal sampling
Fig. 10. Enhanced graphical interface with pictures of instruments for sources and
propagating sound waves represented as colored circles
48
frequency of 44.1kHz and windows of 2048 samples, the overall processing time
should be less than 23ms. With our current implementation, 5-source separation
and 8-speaker spatialization, this processing time is in fact less than 3ms on the
laptop mentioned previously. Therefore, the margin to increase the number of
sources to separate and/or the number of loudspeakers is quite confortable. To
conrm this, we exploited the split of the source separation and spatialization
modules to test the spatializer without the active player, since the latter is currently limited to 5 sources. We connected to the spatializer a multi-track player
that reads several les simultaneously and exposes these tracks as JACK output
ports. Tests showed that the spatialization can be applied to roughly 48 sources
on 8 speakers, or 40 sources on 40 speakers on this computer.
These performances allow us to have some processing power for other computations, to improve user experience for example. Fig. 10 shows an example of
an enhanced graphical interface where the sources are represented with pictures
of the instruments, and the propagation of the sound waves is represented for
each source by time-evolving colored circles. The color of each circle is computed
from the color (spectral envelope) of the spectrum of each source and updated
in real time as the sound changes.
Conclusion and Future Work
We have presented a real-time system for musical interaction from stereo les,
fully backward-compatible with standard audio CDs. This system consists of a
source separator and a spatializer.
The source separation is based on the sparsity of the source signals in the
spectral domain and the exploitation of the stereophony. This system is characterized by a quite simple separation process and by the fact that some sideinformation is inaudibly embedded in the signal itself to guide the separation
process. Compared to (semi-)blind approaches also based on sparsity and local mixture inversion, the informed aspect of separation guarantees the optimal
combination of the sources, thus leading to a remarkable increase of quality of
the separated signals.
The sound spatialization is based on a simplied model of the head-related
transfer functions, generalized to any multi-loudspeaker conguration using a
transaural technique for the best pair of loudspeaker for each sound source.
Although this quite simple technique does not compete with the 3D accuracy of
Ambisonics or holophony (Wave Field Synthesis), it is very exible (no specic
loudspeaker conguration) and suitable for a large audience (no hot-spot eect)
with sucient sound quality.
The resulting software system is able to separate 5-source stereo mixtures
(read from audio CD or 16-bit PCM les) in real time and it enables the user to
remix the piece of music during restitution with basic functions such as volume
and spatialization control. The system has been demonstrated in several countries with excellent feedback from the users / listeners, with a clear potential in
terms of musical creativity, pedagogy, and entertainment.
49
For now, the mixing model imposed by the informed source separation is
generally over-simplistic when professional / commercial music production is
at stake. Extending the source separation technique to high-quality convolutive
mixing is part of our future research.
As shown in [14], the model we use for the spatialization is more general, and
can be used as well to localize audio sources. Thus we would like to add the
automatic detection of the speaker conguration to our system, from a pair of
microphones placed in the audience, as well as the automatic ne tuning of the
spatialization coecients to improve the 3D sound eect.
Regarding performance, lots of operations are on separated data and thus
could easily be parallelized on modern hardware architectures. Last but not least,
we are also porting the whole application to mobile touch devices, such as smart
phones and tablets. Indeed, we believe that these devices are perfect targets for
a system in between music listening and gaming, and gestural interfaces with
direct interaction to move the sources are very intuitive.
Acknowledgments
This research was partly supported by the French ANR (Agence Nationale de la
Recherche) DReaM project (ANR-09-CORD-006).
References
1. Algazi, V.R., Duda, R.O., Thompson, D.M., Avendano, C.: The CIPIC HRTF
database. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, pp. 99102
(2001)
2. Araki, S., Sawada, H., Makino, S.: K-means based underdetermined blind speech
separation. In: Makino, S., et al. (eds.) Blind Source Separation, pp. 243270.
Springer, Heidelberg (2007)
3. Araki, S., Sawada, H., Mukai, R., Makino, S.: Underdetermined blind sparse source
separation for arbitrarily arranged multiple sensors. Signal Processing 87(8), 1833
1847 (2007)
4. Bass, H., Sutherland, L., Zuckerwar, A., Blackstock, D., Hester, D.: Atmospheric
absorption of sound: Further developments. Journal of the Acoustical Society of
America 97(1), 680683 (1995)
5. Berg, R.E., Stork, D.G.: The Physics of Sound, 2nd edn. Prentice Hall, Englewood
Clis (1994)
6. Blauert, J.: Spatial Hearing. revised edn. MIT Press, Cambridge (1997); Translation by J.S. Allen
7. Boll, P., Zibulevski, M.: Underdetermined blind source separation using sparse
representations. Signal Processing 81(11), 23532362 (2001)
8. Chen, B., Wornell, G.: Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Transactions
on Information Theory 47(4), 14231443 (2001)
9. Chowning, J.M.: The simulation of moving sound sources. Journal of the Acoustical
Society of America 19(1), 26 (1971)
50
10. International Organization for Standardization, Geneva, Switzerland: ISO 96131:1993: Acoustics Attenuation of Sound During Propagation Outdoors Part 1:
Calculation of the Absorption of Sound by the Atmosphere (1993)
11. ISO/IEC JTC1/SC29/WG11 MPEG: Information technology Generic coding of
moving pictures and associated audio information Part 7: Advanced Audio Coding
(AAC) IS13818-7(E) (2004)
12. ITU-R: Method for objective measurements of perceived audio quality (PEAQ)
Recommendation BS1387-1 (2001)
13. Kuhn, G.F.: Model for the interaural time dierences in the azimuthal plane. Journal of the Acoustical Society of America 62(1), 157167 (1977)
14. Mouba, J., Marchand, S., Mansencal, B., Rivet, J.M.: RetroSpat: a perceptionbased system for semi-automatic diusion of acousmatic music. In: Proceedings of
the Sound and Music Computing (SMC) Conference, Berlin, pp. 3340 (2008)
15. OGrady, P., Pearlmutter, B.A., Rickard, S.: Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology 15(1), 1833 (2005)
16. Parvaix, M., Girin, L.: Informed source separation of underdetermined instantaneous
stereo mixtures using source index embedding. In: IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas (2010)
17. Parvaix, M., Girin, L.: Informed source separation of linear instantaneous underdetermined audio mixtures by source index embedding. IEEE Transactions on Audio, Speech, and Language Processing (accepted, pending publication, 2011)
18. Pinel, J., Girin, L., Baras, C.: A high-rate data hiding technique for uncompressed
audio signals. IEEE Transactions on Audio, Speech, and Language Processing
(submitted)
19. Pinel, J., Girin, L., Baras, C., Parvaix, M.: A high-capacity watermarking technique
for audio signals based on MDCT-domain quantization. In: International Congress
on Acoustics (ICA), Sydney, Australia (2010)
20. Plumbley, M.D., Blumensath, T., Daudet, L., Gribonval, R., Davies, M.E.: Sparse
representations in audio and music: From coding to source separation. Proceedings
of the IEEE 98(6), 9951005 (2010)
21. Princen, J., Bradley, A.: Analysis/synthesis lter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech, and Signal
Processing 64(5), 11531161 (1986)
22. Strutt (Lord Rayleigh), J.W.: Acoustical observations i. Philosophical Magazine 3,
456457 (1877)
23. Strutt (Lord Rayleigh), J.W.: On the acoustic shadow of a sphere. Philosophical
Transactions of the Royal Society of London 203A, 8797 (1904)
24. Thiede, T., Treurniet, W., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J.,
Colomes, C.: PEAQ - the ITU standard for objective measurement of perceived
audio quality. Journal of the Audio Engineering Society 48(1), 329 (2000)
25. Tournery, C., Faller, C.: Improved time delay analysis/synthesis for parametric
stereo audio coding. Journal of the Audio Engineering Society 29(5), 490498
(2006)
26. Vincent, E., Gribonval, R., Plumbley, M.D.: Oracle estimators for the benchmarking of source separation algorithms. Signal Processing 87, 19331950 (2007)
27. Viste, H.: Binaural Localization and Separation Techniques. Ph.D. thesis, Ecole
Polytechnique Federale de Lausanne, Switzerland (2004)
28. Woodworth, R.S.: Experimental Psychology. Holt, New York (1954)
29. Ylmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Transactions on Signal Processing 52(7), 18301847 (2004)
Pitch Gestures in Generative Modeling of Music

Kristoffer Jensen
Aalborg University Esbjerg, Niels Bohr Vej 8,
6700 Esbjerg, Denmark
krist@create.aau.dk
Abstract. Generative models of music are in need of performance and gesture

additions, i.e. inclusions of subtle temporal and dynamic alterations, and
gestures so as to render the music musical. While much of the research
regarding music generation is based on music theory, the work presented here is
based on the temporal perception, which is divided into three parts, the
immediate (subchunk), the short-term memory (chunk), and the superchunk. By
review of the relevant temporal perception literature, the necessary performance
elements to add in the metrical generative model, related to the chunk memory,
are obtained. In particular, the pitch gestures are modeled as rising, falling, or as
arches with positive or negative peaks.
Keywords: gesture; human cognition; perception; chunking; music generation.
1 Introduction
Music generation has more and more uses in todays media. Be it in computer games,
interactive music performances, or in interactive films, the emotional effect of the
music is primordial in the appreciation of the media. While traditionally, the music
has been generated in pre-recorded loops that is mixed on-the-fly, or recorded in
traditional orchestras, the better understanding and models of generative music is
believed to push the interactive generative music into the multimedia. Papadopoulos
and Wiggins (1999) gave an early overview of the methods of algorithmic
composition, deploring that the music that they produce is meaningless: the
computers do not have feelings, moods or intentions. While vast progress has been
made in the decade since this statement, there is still room for improvement.
The cognitive understanding of musical time perception is the basis of the work
presented here. According to Khl (2007), this memory can be separated into three
time-scales, the short, microtemporal, related to microstructure, the mesotemporal,
related to gesture, and the macrotemporal, related to form. These time-scales are
named (Khl and Jensen 2008) subchunk, chunk and superchunk, and subchunks
extend from 30 ms to 300 ms; the conscious mesolevel of chunks from 300 ms to 3
sec; and the reflective macrolevel of superchunks from 3 sec to roughly 3040 sec.
The subchunk is related to individual notes, the chunk to meter and gesture, and the
superchunk is related to form. The superchunk was analyzed and used for in a
generative model in Khl and Jensen (2008), and the chunks were analyzed in Jensen
and Khl (2009). Further analysis of the implications of how temporal perception is
52
K. Jensen
related to durations and timing of existing music, and anatomic and perceptual finding
from the literature is given in section 2 along with an overview of the previous work
in rhythm. Section 3 presents the proposed model on the inclusion of pitch gestures in
music generation using statistical methods, and the section 4 discusses the integration
of the pitch gesture in the generative music model. Finally, section 5 offers a
conclusion.
2 Cognitive and Perceptual Aspects of Rhythm

According to Snyder (2000), a beat is single point in time, while the pulse is recurring
beats. Accent gives salience to beat, and meter is the organization of beats into a
cyclical structure. This may or may not be different to the rhythmic grouping, which
is generally seen as a phrase bounded by accented notes. Lerdahl and Jackendorff
(1983) gives many examples of grouping and meter, and show how this is two
independent elements; Grouping segmentation on different levels is concerned with
elements that has duration, and Meter regular alternation of strong and weak beats is
concerned with durationless elements. While grouping and meter are independent, the
percept is more stable when they are congruent.
The accentuation of some of the beats gives perceptually salience to the beat (Patel
and Peretz 1997). This accenting can be done (Handel 1989) by for instance an
intensity rise, by increasing the duration or the interval between the beats, or by
increasing the frequency difference between the notes.
Samson et al (2000) shows that the left temporal lobe processes rapid auditory
sequences, while there are also activities in front lobe. The specialized skills related to
rhythm are developed in the early years, for instance Malbrn (2000) show how 8year-old children can perform precise tapping. However, while the tapping is more
precise for high tempo, drifting is ubiquitous. Gordon (1987) has determined that the
perceptual attack time (PAT) is most often located at the point of the largest rise of
the amplitude of the sound. However, in the experiment, the subjects had problems
synchronizing many of the sounds, and Gordon concludes that the PAT is more vague
for non-percussive sounds, and spectral cues may also interfere in the determination
of the attack. Zwicker and Fastl (1999) introduced the notion of subjective duration,
and showed that the subjective duration is longer than the objective durations for
durations below 100ms. Even more subjective deviations are found, if pauses are
compared to tones or noises. Zwicker and Fastl found that long sounds (above 1
second) has the same subjective durations than pauses, while shorter pauses has
significantly longer subjective durations than sounds. Approximately 4 times longer
for 3.2kHz tone, while 200Hz tone and white noise have approximately half the
subjective duration, as compared to pauses. This is true for durations of around 100200 ms, while the difference evens out to disappear at 1sec durations. Finally Zwicker
and Fastl related the subjective duration to temporal masking, and give indications
that musicians would play tones shorter than indicated in order to fulfill the subjective
durations of the notated music. Fraisse (1982) give an overview of his important
research in rhythm perception. He states the range in which synchronization is
possible to be between 200 to 1800 msec (33-300 BPM). Fraisse furthermore has
analyzed classical music, and found two main durations that he calls temps longs
53
(>400msec) & temps courts, and two to one ratios only found between temps longs
and courts. As for natural tempo, when subjects are asked to reproduce temporal
intervals, they tend to overestimate short intervals (making them longer) and underestimate long intervals (making them shorter). At an interval of about 500 msec to
600 msec, there is little over- or under-estimation. However, there are large
differences across individuals, the spontaneous tempo is found to be between 1.1 to 5
taps per second, with 1.7 taps per second most occurring. There are also many
spontaneous motor movements that occur at the rate of approximately 2/sec, such as
walking, sucking in the newborn, and rocking.
Friberg (1991), and Widmer (2002) give rules to how the dynamics and timing
should be changed according to the musical position of the notes. Dynamic changes
include 6db increase (doubling), and up to 100msec deviations to the duration,
depending on the musical position of the notes. With these timing changes, Snyder
(2000) indicate the categorical perception of beats, measures and patterns. The
perception of deviations of the timing is examples of within-category distinctions.
Even with large deviations from the nominal score, the notes are recognized as falling
on the beats.
As for melodic perception, Thomassen (1982) investigated the role of interval as
melodic accents. In a controlled experiment, he modeled the anticipation using an
attention span of three notes, and found that the accent perception is described fairly
well. The first of two opposite frequency changes gives the strongest accentuation.
Two changes in the same direction are equally effective. The larger of two changes
are more powerful, as are frequency rises as compared to frequency falls.
3 Model of Pitch Gestures

Music is typically composed, giving intended and inherent emotions in the structural
aspects, which is then enhanced and altered by the performers, who change the
dynamics, articulations, vibrato, and timing to render the music enjoyable and
musical. In this work, the gestures in the pitch contour are investigated. Jensen and
Khl (2009) investigated the gestures of music through a simple model, with positive
or negative slope, and with positive or negative arches, as shown in figure 1. For the
songs analyzed, Jensen and Khl found more negative than positive slopes and
slightly more positive than negative arches. Huron (1996) analyzed the Essen Folk
music database, and found - by averaging all melodies - positive arches. Further
analyses were done by comparing the first and last note to the mean of the
intermediate notes, revealing more positive than negative arches (39% and 10%
respectively), and more negative than positive slopes (29% and 19% respectively).
According to Thomassen (1982) the falling slopes has less powerful accents, and they
would thus require less attention.
The generative model is made through statistical models based on data from a
musical database (The Digital Tradition 2010). From this model, note and interval
occurrences are counted. These counts are then normalized, and used as probability
density functions for note and intervals, respectively. This statistics are shown in
figure 2. As can be seen, the intervals are not symmetrical. This corroborates the
finding in Jensen and Khl (2009) that more falling than rising slopes are found in the
54
K. Jensen
Fig. 1. Different shapes of a chunk. Positive (a-c) or negative arches (g-i), rising (a,d,g) or
falling slopes (c,f,i).
Fig. 2. Note (top) and interval probability density function obtained from The Digital Tradition
folk database
55
pitch of music. According to Vos and Troost (1989), the smaller intervals occur more
often in descending form, while the larger ones occur mainly in ascending form.
However, since the slope and arch are modelled in this work, the pdf of the intervals
are mirrored and added around zero, and subsequently weighted and copied back to
recreate the full interval pdf. It is later made possible to create a melodic contour with
a given slope and arch characteristics, as detailed below.
In order to generative pitch contours with gestures, the model in figure 1 is used.
For the pitch contour, only the neutral gesture (e) in figure 1, the falling and rising
slope (d) and (f), and the positive and negative arches (b) and (h) are modeled here.
The gestures are obtained by weighting the positive and negative slope of the interval
probability density function with a weight,
pdf i = [w pdf i+ ,(1 w) pdf i+ ].
(1)
Here, pdfi+ is the mirrored/added positive interval pdf, and w is the weight. If
w=0.5, a neutral gesture is obtained, and if w<0.5, a positive slope is obtained, and if
w>0.5, a negative slope is obtained. In order to obtained an arch, the value of the
weight is changed to w=1- w, in the middle of the gesture.
In order to obtain a musical scale, the probability density function for the intervals
(pdfi) is multiplied with a suitable pdfs for the scale, such as the one illustrated in
figure 2 (top),
pdf = shift( pdf i ,n0 ) pdf s wr .
(2)
As pdfs is only defined for one octave, it is circularly repeated. The interval
probabilities, pdfi, are shifted for each note n0. This is done under the hypothesis that
the intervals and scale notes are independent. So as to retain the register,
approximately, a register weight wr is further multiplied to the pdf. This weight is one
for one octave, and decreases exponentially on both sides, in order to lower the
possibility of obtaining notes far from the original register.
In order to obtain successive notes, the cumulative density function, cdf, is
calculated from eq (2), and used to model the probability that r is less than or equal to
the note intervals cdf(n0). If r is a random variable with uniform distribution in the
interval (0,1), then n0 can be found as the index of the first occurrence of cdf>r.
Examples of pitch contours obtained by setting w=0, and w=1, respectively, are
shown in figure 3. The rising and falling pitches are reset after each gesture, in order
to stay at the same register throughout the melody.
The positive and negative slopes are easily recognized when listening to the
resulting melodies, because of the abrupt pitch fall at the end of each gesture. The
arches, in comparison, are more in need of loudness and/or brightness variations in
order to make them perceptually recognized. Without this, a positive slope can be
confused for a negative arch that is shifted in time, or a positive or negative slope,
likewise shifted in time. Normally, an emphasis at the beginning of each gesture is
sufficient for the slopes, while the arches may be in need of an emphasis at the peak
of the arch as well.
56
K. Jensen
Fig. 3. Pitch contours of four melodies with positive arch, rising slope, negative arch and
falling slope
4 Recreating Pitch Contours in Meter

In previous work (Khl and Jensen 2008), a generative model that produces tonal
music with structures changes was presented. This model, that creates note values
based on a statistical model, also introduces changes at the structural level (each 30
seconds, approximately). These changes are introduced, based on analysis of music
using the musigram visualization tools (Khl and Jensen 2008).
With respect to chroma, an observation was made that only a subset of the full scale
notes were used at each structural element. This subset was modified, by removing and
inserting new notes from the list of possible notes, at each structural boundary. The
timbre changes include varying the loudness and brightness between loud/bright and
soft/dark structural elements. The main rhythm changes were based on the
identification of short elements (10 seconds) with no discernable rhythm. A tempo drift
of up-to 10% and insertion of faster rhythmic elements (Tatum) at structural
boundaries were also identified. These structural changes were implemented in a
generative model, which flowchart can be seen in figure 4. While the structural
elements certainly were beneficial for the long-term interest of the music, the lack of
short-term changes (chunks) and a rhythm model impeded on the quality of the music.
The meter, that improves the resulting quality, is included in this generative model by
adjusting the loudness and brightness of each tone according to its accent. The pitch
contour is made through the model introduced in the previous section.
57
Fig. 4. The generative model including meter, gesture and form. Structural changes on the note
values, the intensity and the rhythm is made every 30 seconds, approximately, and gesture
changes are made on average every seven notes
The notes are created using a simple envelope model and the synthesis method
dubbed brightness creation function (bcf, Jensen 1999) that creates a sound with
exponentially decreasing amplitudes that allows the continuous control of the
brightness. The accent affects the note, so that the loudness brightness is doubled, and
the duration is increased by 25 %, with 75% of the elongation made by advancing the
start of the note, as found in Jensen (2010).
These findings are put into a generative model of tonal music. A subset of notes
(3-5) is chosen at each new form (superchunk), together with a new dynamic level. At
the chunk level, new notes are created in a metrical loop, and the gestures are added
to the pitch contour and used for additional gesture emphasis. Finally, at the
microtemporal (subchunk) level, expressive deviations are added in order to render
the loops musical. The interaction of the rigid meter with the more loose pitch gesture
renders the generated notes a more musical sense, by the incertitude and the double
stream that results. The pure rising and falling pitch gestures are still clearly
perceptible, while the arches are less present. By setting w in eq(1) to something in
between (0,1), e.g. 0.2, or 0.8, a more realistic, agreeable rising and falling gestures
are resulting. Still, the arches are more natural to the ear, while the rising and falling
demand more attention, in particular perhaps the rising gestures.
58
K. Jensen
5 Conclusion
The automatic generation of music is in need of model to render the music expressive.
This model is found using knowledge from time perception of music studies, and
further studies of the cognitive and perceptual aspects of rhythm. Indeed, the
generative model consists of three sources, corresponding to the immediate
microtemporal, the present mesotemporal and the long-term memory macroterminal.
This corresponds to the note, the gesture and the form in music. While a single stream
in each of the source may not be sufficient, so far the model incorporates the
macrotemporal superchunk, the metrical mesotemporal chunk and the microtemporal
expressive enhancements. The work presented here has introduced gestures in the
pitch contour, corresponding to the rising and falling slopes, and to the positive and
negative arches, which adds a perceptual stream to the more rigid meter stream.
The normal beat as is given by different researchers to be approximately 100 BPM,
and Fraisse (1982) furthermore shows the existence of two main note durations, one
above and one below 0.4 secs, with a ratio of two. Indications as to subjective time,
given by Zwicker and Fastl (1999) are yet to be investigated, but this may well be
creating uneven temporal intervals in conflict with the pulse.
The inclusion of the pitch gesture model certainly, in the authors opinion, renders
the music more enjoyable, but more work remains before the generative model is
ready for general-purpose uses.
References
1. Fraisse, P.: Rhythm and Tempo. In: Deutsch, D. (ed.) The Psychology of Music, 1st edn.,
pp. 149180. Academic Press, New York (1982)
2. Friberg, A.: Performance Rules for Computer-Controlled Contemporary Keyboard Music.
3. Gordon, J.W.: The perceptual attack time of musical tones. Journal of the Acoustical
Society of America, 88105 (1987)
4. Handel, S.: Listening. MIT Press, Cambridge (1989)
5. Huron, D.: The Melodic Arch in Western Folk songs. Computing in Musicology 10, 323
(1996)
6. Jensen, K.: Timbre Models of Musical Sounds, PhD Dissertation, DIKU Report 99/7
(1999)
7. Jensen, K.: Investigation on Meter in Generative Modeling of Music. In: Proceedings of
the CMMR, Malaga, June 21-24 (2010)
8. Jensen, K., Khl, O.: Towards a model of musical chunks. In: Ystad, S., KronlandMartinet, R., Jensen, K. (eds.) CMMR 2008. LNCS, vol. 5493, pp. 8192. Springer,
Heidelberg (2009)
9. Khl, O., Jensen, K.: Retrieving and recreating musical form. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 270282. Springer,
Heidelberg (2008)
10. Khl, O.: Musical Semantics. Peter Lang, Bern (2007)
11. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. The MIT Press,
Cambridge (1983)
59
12. Malbrn, S.: Phases in Childrens Rhythmic Development. In: Zatorre, R., Peretz, I. (eds.)
The Biological Foundations of Music. Annals of the New York Academy of Sciences
(2000)
13. Papadopoulos, G., Wiggins, G.: AI methods for algorithmic composition: a survey, a
critical view and future prospects. In: AISB Symposium on Musical Creativity, pp.
110117 (1999)
14. Patel, A., Peretz, I.: Is music autonomous from language? A neuropsychological appraisal.
In: Delige, I., Sloboda, J. (eds.) Perception and cognition of music, pp. 191215.
Psychology Press, Hove (1997)
15. Samson, S., Ehrl, N., Baulac, M.: Cerebral Substrates for Musical Temporal Processes.
In: Zatorre, R., Peretz, I. (eds.) The Biological Foundations of Music. Annals of the New
York Academy of Sciences (2000)
16. Snyder, B.: Music and Memory. An Introduction. The MIT Press, Cambridge (2000)
17. The Digital Tradition (2010), http://www.mudcat.org/AboutDigiTrad.cfm
(visited December 1, 2010)
18. Thomassen, J.M.: Melodic accent: Experiments and a tentative model. J. Acoust. Soc.
Am. 71(6), 15961605 (1982)
19. Vos, P.G., Troost, J.M.: Ascending and Descending Melodic Intervals: Statistical Findings
and Their Perceptual Relevance. Music Perception 6(4), 383396 (1089)
20. Widmer, G.: Machine discoveries: A few simple, robust local expression principles.
Journal of New Music Research 31, 3750 (2002)
21. Zwicker, E., Fastl, H.: Psychoacoustics: facts and models, 2nd edn. Springer series in
information sciences. Springer, Berlin (1999)
An Entropy Based Method for Local

Time-Adaptation of the Spectrogram
Marco Liuni1,2, , Axel R
obel2 , Marco Romito1 , and Xavier Rodet2
1
Universit
a di Firenze, Dip. di Matematica U. Dini
Viale Morgagni, 67/a - 50134 Florence - ITALY
2
IRCAM - CNRS STMS, Analysis/Synthesis Team 1,
Place Igor-Stravinsky - 75004 Paris - FRANCE
{marco.liuni,axel.roebel,xavier.rodet}@ircam.fr
marco.romito@math.unifi.it
http://www.ircam.fr/anasyn.html
Abstract. We propose a method for automatic local time-adaptation of

the spectrogram of audio signals: it is based on the decomposition of a
signal within a Gabor multi-frame through the STFT operator. The sparsity of the analysis in every individual frame of the multi-frame is evaluated through the Renyi entropy measures: the best local resolution is
determined minimizing the entropy values. The overall spectrogram of the
signal we obtain thus provides local optimal resolution adaptively evolving over time. We give examples of the performance of our algorithm with
an instrumental sound and a synthetic one, showing the improvement in
spectrogram displaying obtained with an automatic adaptation of the resolution. The analysis operator is invertible, thus leading to a perfect reconstruction of the original signal through the analysis coecients.
Keywords: adaptive spectrogram, sound representation, sound analysis, sound synthesis, Renyi entropy, sparsity measures, frame theory.
Introduction
Far from being restricted to entertainment, sound processing techniques are required in many dierent domains: they nd applications in medical sciences,
security instruments, communications among others. The most challenging class
of signals to consider is indeed music: the completely new perspective opened
by contemporary music, assigning a fundamental role to concepts as noise and
timbre, gives musical potential to every sound.
The standard techniques of digital analysis are based on the decomposition
of the signal in a system of elementary functions, and the choice of a specic
system necessarily has an inuence on the result. Traditional methods based on
single sets of atomic functions have important limits: a Gabor frame imposes a
xed resolution over all the time-frequency plane, while a wavelet frame gives a
strictly determined variation of the resolution: moreover, the user is frequently
This work is supported by grants from Region Ile-de-France.
A Method for Local Time-Adaptation of the Spectrogram
61
asked to dene himself the analysis window features, which in general is not a
simple task even for experienced users. This motivates the search for adaptive
methods of sound analysis and synthesis, and for algorithms whose parameters
are designed to change according to the analyzed signal features. Our research
is focused on the development of mathematical models and tools based on the
local automatic adaptation of the system of functions used for the decomposition
of the signal: we are interested in a complete framework for analysis, spectral
transformation and re-synthesis; thus we need to dene an ecient strategy to
reconstruct the signal through the adapted decomposition, which must give a
perfect recovery of the input if no transformation is applied.
Here we propose a method for local automatic time-adaptation of the Short
Time Fourier Transform window function, through a minimization of the Renyi
entropy [22] of the spectrogram; we then dene a re-synthesis technique with
an extension of the method proposed in [11]. Our approach can be presented
schematically in three parts:
1. a model for signal analysis exploiting concepts of Harmonic Analysis, and
Frame Theory in particular: it is a generally highly redundant decomposing
system belonging to the class of multiple Gabor frames [8],[14];
2. a sparsity measure dened on time-frequency localized subsets of the analysis
coecients, in order to determine local optimal concentration;
3. a reduced representation obtained from the original analysis using the information about optimal concentration, and a synthesis method through an
expansion in the reduced system obtained.
We have realized a rst implementation of this scheme in two dierent versions:
for both of them a sparsity measure is applied on subsets of analysis coecients
covering the whole frequency dimension, thus dening a time-adapted analysis
of the signal. The main dierence between the two concerns the rst part of the
model, that is the single frames composing the multiple Gabor frame. This is a
key point as the rst and third part of the scheme are strictly linked: the frame
used for re-synthesis is a reduction of the original multi-frame, so the entire
model depends on how the analysis multi-frame is designed. The section Frame
Theory in Sound Analysis and Synthesis treats this part of our research in more
details.
The second point of the scheme is related to the measure applied on the coecients of the analysis within the multi-frame to determine local best resolutions.
We consider measures borrowed from Information Theory and Probability Theory according to the interpretation of the analysis within a frame as a probability
density [4]: our model is based on a class of entropy measures known as Renyi
entropies which extend the classical Shannon entropy. The fundamental idea
is that minimizing the complexity or information over a set of time-frequency
representations of the same signal is equivalent to maximizing the concentration and peakiness of the analysis, thus selecting the best resolution tradeo
[1]: in the section Renyi Entropy of Spectrograms we describe how a sparsity
measure can consequently be dened through an information measure. Finally,
62
M. Liuni et al.
in the fourth section we provide a description of our algorithm and examples of

adapted spectrogram for dierent sounds.
Some examples of this approach can be found in the literature: the idea of
gathering a sparsity measure from Renyi entropies is detailed in [1], and in [13]
a local time-frequency adaptive framework is presented exploiting this concept,
even if no methods for perfect reconstruction are provided. In [21] sparsity is
obtained through a regression model; a recent development in this sense is contained in [14] where a class of methods for analysis adaptation are obtained separately in the time and frequency dimension together with perfect reconstruction
formulas: indeed no strategies for automatization are employed, and adaptation
has to be managed by the user. The model conceived in [18] belongs to this
same class but presents several novelties in the construction of the Gabor multiframe and in the method for automatic local time-adaptation. In [15] another
time-frequency adaptive spectrogram is dened considering a sparsity measure
called energy smearing, without taking into account the re-synthesis task. The
concept of quilted frame, recently introduced in [9], is the rst promising eort
to establish a unied mathematical model for all the various frameworks cited
above.
Frame Theory in Sound Analysis and Synthesis
When analyzing a signal through its decomposition, the features of the representation are inuenced by the decomposing functions; the Frame Theory (see
[3],[12] for detailed mathematical descriptions) allows a unied approach when
dealing with dierent bases and systems, studying the properties of the operators
that they identify. The concept of frame extends the one of orthonormal basis in
a Hilbert space, and it provides a theory for the discretization of time-frequency
densities and operators [8], [20], [2]. Both the STFT and the Wavelet transform
can be interpreted within this setting (see [16] for a comprehensive survey of
theory and applications).
Here we summarize the basic denitions and theorems, and outline the fundamental step consisting in the introduction of Multiple Gabor Frames, which
is comprehensively treated in [8]. The problem of standard frames is that the
decomposing atoms are dened from the same original function, thus imposing a limit on the type of information that one can deduce from the analysis
coecients; if we were able to consider frames where several families of atoms
coexist, than we would have an analysis with variable information, at the price
of a higher redundancy.
2.1
Basic Definitions and Results
Given a Hilbert space H seen as a vector space on C, with its own scalar product,
we consider in H a set of vectors { } where the index set may be innite
and can also be a multi-index. The set { } is a frame for H if there exist
two positive non zero constants A and B, called frame bounds, such that for all
f H,
Af 2
|f, |2 Bf 2 .
63
(1)
We are interested in the case H = L2 (R) and countable, as it represents

the standard situation where a signal f is decomposed through a countable set
of given functions {k }kZ . The frame bounds A and B are the inmum and
supremum, respectively, of the eigenvalues of the frame operator U, dened as

Uf =
f, k k .
(2)
kZ
For any frame {k }kZ there exist dual frames {k }kZ such that for all f
L2 (R)

f, k k =
f, k k ,
(3)
f=
kZ
kZ
so that given a frame it is always possible to perfectly reconstruct a signal f

using the coecients of its decomposition through the frame. The inverse of the
frame operator allows the calculation of the canonical dual frame
k = U1 k
(4)
which guarantees minimal-norm coecients in the expansion.

A Gabor frame is obtained by time-shifting and frequency-transposing a window function g according to a regular grid. They are particularly interesting in
the applications as the analysis coecients are simply given by sampling the
STFT of f with window g according to the nodes of a specied lattice. Given a
time step a and a frequency step b we write {un }nZ = an and {k }kZ = bk;
these two sequences generate the nodes of the time-frequency lattice for the
frame {gn,k }(n,k)Z2 dened as
gn,k (t) = g(t un )e2ik t ;
(5)
the nodes are the centers of the Heisenberg boxes associated to the windows in
the frame. The lattice has to satisfy certain conditions for {gn,k } to be a frame
[7], which impose limits on the choice of the time and frequency steps: for certain
choices [6] which are often adopted in standard applications, the frame operator
takes the form of a multiplication,

|g(t un )|2 f (t) ,
Uf (t) = b1
(6)
nZ
and the dual frame is easily calculated by means of a straight multiplication of

the atoms in the original frame. The relation between the steps a, b and the
frame bounds A, B in this case is clear by (6), as the frame condition implies

0 < A b1
|g(t un )|2 B < .
(7)
nZ
Thus we see that the frame bounds provide also information on the redundancy
of the decomposition of the signal within the frame.
64
2.2
M. Liuni et al.
Multiple Gabor Frames
In our adaptive framework, we look for a method to achieve an analysis with

multiple resolutions: thus we need to combine the information coming from the
decompositions of a signal in several frames of dierent window functions. Multiple Gabor frames have been introduced in [22] to provide the original Gabor
analysis with exible multi-resolution techniques: given a set of index L Z and
l
dierent frames {gn,k
}(n,k)Z2 with l L, a multiple Gabor frame is obtained
with a union of the single given frames. The dierent g l do not necessarily share
the same type or shape: in our method an original window is modied with a
nite number of scaling

1
t
g l (t) = g
;
(8)
l
l
then all the scaled versions are used to build |L| dierent frames which constitute
the initial multi-frame.
A Gabor multi-frame has in general a signicant redundancy which lowers the
readability of the analysis. A possible strategy to overcome this limit is proposed
in [14] where nonstationary Gabor frames are introduced, actually allowing the
choice of a dierent window for each time location of a global irregular lattice
, or alternatively for each frequency location. This way, the window chosen is a
function of time or frequency position in the time-frequency space, not both. In
most applications, for this kind of frame there exist fast FFT based methods for
the analysis and re-synthesis steps. Referring to the time case, with the abuse
of notation gn(l) we indicate the window g l centered at a certain time n(l) = un
which is a function of the chosen window itself. Thus, a nonstationary Gabor
frame is given by the set of atoms
{gn(l) e2ibl kt , (n(l), bl k) } ,
(9)
where bl is the frequency step associated to the window g l and k Z . If we

suppose that the windows g l have limited time support and a suciently small
frequency step bl , the frame operator U takes a similar form to the one in (6),
Uf (t) =
1
|gn(l) (t)|2 f (t) .
bl
(10)
n(l)

Here, if N = l b1l |gn(l) (s)|2
1 then U is invertible and the set (9) is a frame
whose dual frame is given by
gn(l),k (t) =
1
gn(l) (t)e2ibl kt .
N
(11)
Nonstationary Gabor frames belong to the recently introduced class of quilted

frames [9]: in this kind of decomposing systems the choice of the analysis window
depends on both the time and the frequency location, causing more diculties
65
for an analytic fast computation of a dual frame as in (11): future improvements

of our research concern the employment of such a decomposition model for automatic local adaptation of the spectrogram resolution both in the time and the
frequency dimension.
R
enyi Entropy of Spectrograms
We consider the discrete spectrogram of a signal as a sampling of the square of

its continuous version

2

2
2it

PSf (u, ) = |Sf (u, )| = f (t)g(t u)e
dt ,
(12)
where f is a signal, g is a window function and Sf (u, ) is the STFT of f through
g.
Such a sampling is obtained according to a regular lattice ab , considering a
Gabor frame (5),
PSf [n, k] = |Sf [un , k ]|2 .
(13)
With an appropriate normalization both the continuous and discrete spectrogram can be interpreted as probability densities. Thanks to this interpretation,
some techniques belonging to the domain of Probability and Information Theory can be applied to our problem: in particular, the concept of entropy can be
extended to give a sparsity measure of a time-frequency density. A promising
approach [1] takes into account Renyi entropies, a generalization of the Shannon
entropy: the application to our problem is related to the concept that minimizing the complexity or information of a set of time-frequency representations
of a same signal is equivalent to maximizing the concentration, peakiness, and
therefore the sparsity of the analysis. Thus we will consider as best analysis the
sparsest one, according to the minimal entropy evaluation.
Given a signal f and its spectrogram PSf as in (12), the Renyi entropy of
order > 0, = 1 of PSf is dened as follows

1
PSf (u, )
HR
(PS
)
=
log
dud ,
(14)
f
2
1
PSf (u , )du d
R
R
where R R2 and we omit its indication if equality holds. Given a discrete
spectrogram with time step a and frequency step b as in (13), we consider R as
a rectangle of the time-frequency plane R = [t1 , t2 ] [1 , 2 ] R2 . It identies
a sequence of points G ab where G = {(n, k) Z2 : t1 na t2 , 1 kb
2 }. As a discretization of the original continuous spectrogram, every sample in
PSf [n, k] is related to a time-frequency region of area ab; we thus obtain the
discrete Renyi entropy measure directly from (14),

1
PSf [n, k]

HG
[PS
]
=
log
+ log2 (ab) .
(15)
f
2

1
[n ,k ]G PSf [n , k ]
[n,k]G
66
M. Liuni et al.
We will focus on discretized spectrograms with a nite number of coecients,

as dealing with digital signal processing requires to work with nite sampled
signals and distributions.
Among the general properties of Renyi entropies [17], [19] and [23] we recall
in particular those directly related with our problem. It is easy to show that for
every nite discrete probability density P the entropy H (P ) tends to coincide
with the Shannon entropy of P as the order tends to one. Moreover, H (P ) is
a non increasing function of , so
1 < 2 H1 (P ) H2 (P ) .
(16)
As we are working with nite discrete densities we can also consider the case
= 0 which is simply the logarithm of the number of elements in P ; as a
consequence H0 (P ) H (P ) for every admissible order .
A third basic fact is that for every order the Renyi entropy H is maximum
when P is uniformly distributed, while it is minimum and equal to zero when P
has a single non-zero value.
All of these results give useful information on the values of dierent measures
on a single density P as in (15), while the relations between the entropies of two
dierent densities P and Q are in general hard to determine analytically; in our
problem, P and Q are two spectrograms of a signal in the same time-frequency
area, based on two window functions with dierent scaling as in (8). In some
basic cases such a relation is achievable, as shown in the following example.
3.1
Best Window for Sinusoids
When the spectrograms of a signal through dierent window functions do not

depend on time, it is easy to compare their entropies: let PSs be the sampled
spectrogram of a sinusoid s over a nite region G with a window function g of
compact support; then PSs is simply a translation in the frequency domain of
g, the Fourier transform of the window, and it is therefore time-independent.
We choose a bounded set L of admissible scaling factors, so that the discretized
support of the scaled windows g l still remains inside G for any l L. It is not
hard to prove that the entropy of a spectrogram taken with such a scaled version
of g is given by
G
HG
(PSsl ) = H (PSs ) log2 l .
(17)
The sparsity measure we are using chooses as best window the one which minimizes the entropy measure: we deduce from (17) that it is the one obtained with
the largest scaling factor available, therefore with the largest time-support. This
is coherent with our expectation as stationary signals, such as sinusoids, are best
analyzed with a high frequency resolution, because time-independency allows a
small time resolution. Moreover, this is true for any order used for the entropy
calculus. Symmetric considerations apply whenever the spectrogram of a signal
does not depend on frequency, as for impulses.
3.2
67
The Parameter
The parameter in (14) introduces a biasing on the spectral coecients; to

have a qualitative description of this biasing, we consider a collection of simple
spectrograms composed by a variable amount of large and small coecients. We
realize a vector D of length N = 100 generating numbers between 0 and 1 with
a normal random distribution; then we consider the vectors DM , 1 M N
such that

D[k] if k M
DM [k] = D[k]
(18)
20 if k > M
and then normalize to obtain a unitary sum. We then apply Renyi entropy
measures with varying between 0 and 30: as we see from gure 1, there is a
relation between M and the slope of the entropy curves for the dierent values
of . For = 0, H0 [DM ] is the logarithm of the number of non-zero coecients
and it is therefore constant; when increases, we see that densities with a small
amount of large coecients gradually decrease their entropy, faster than the
almost at vectors corresponding to larger values of M . This means that by
increasing we emphasize the dierence between the entropy values of a peaky
distribution and that of a nearly at one. The sparsity measure we consider
select as best analysis the one with minimal entropy, so reducing rises the
probability of less peaky distributions to be chosen as sparsest: in principle, this
is desirable as weaker components of the signal, such as partials, have to be
taken into account in the sparsity evaluation. But as well, this principle should
be applied with care as a small coecient in a spectrogram could be determined
by a partial as well as by a noise component; choosing an extremely small ,
the best window chosen could vary without a reliable relation with spectral
concentration depending on the noise level within the sound.
entropy
7
6
5
4
0
5
10
2
0
15
20
20
40
60
25
80
100
alpha
30
Fig. 1. Renyi entropy evaluations of the DM vectors with varying ; the distribution
becomes atter as M increases
68
3.3
M. Liuni et al.
Time and Frequency Steps
A last remark regards the dependency of (15) on the time and frequency step a
and b used for the discretization of the spectrogram. When considering signals as
nite vectors, a signal and its Fourier Transform have the same length. Therefore
in the STFT the window length determines the number frequency points, while
the sampling rate sets frequency values: the denition of b is thus implicit in
the window choice. Actually, the FFT algorithm allows to specify a number
of frequency points larger than the signal length: further frequency values are
obtained as an interpolation between the original ones by properly adding zero
values to the signal. If the sampling rate is xed, this procedure causes a smaller
b as a consequence of a larger number of frequency points. We have numerically
veried that such a variation of b has no impact on the entropy calculus, so that
the FFT size can be set according to implementation needs.
Regarding the time step a, we are working on the analytical demonstration
of a largely veried evidence: as long as the decomposing system is a frame the
entropy measure is invariant to redundancy variation, so the choice of a can be
ruled by considerations on the invertibility of the decomposing frame without
losing coherence between the information measure of the dierent analyses. This
is a key point, as it states that the sparsity measure obtained allows a total
independence between the hop sizes of the dierent analyses: with the implementation of proper structures to handle multi-hop STFTs we have obtained a
more ecient algorithm in comparison with those imposing a xed hop size, as
[15] and the rst version of the one we have realized.
Algorithm and Examples
We now summarize the main operations of the algorithm we have developed

providing examples of its application. For the calculation of the spectrograms
we use a Hanning window
h(t) = cos2 (t)[ 12 , 12 ] ,
(19)
with the indicator function of the specied interval, but it is obviously possible
to generalize the results thus obtained to the entire class of compactly supported
window functions. In both the versions of our algorithm we create a multiple
Gabor frame as in (5), using as mother functions some scaled version of h,
obtained as in (8) with a nite set of positive real scaling factors L.
We consider consecutive segments of the signal, and for each segment we
calculate |L| spectrograms with the |L| scaled windows: the length of the analysis segment and the overlap between two consecutive segments are given as
parameters.
In the rst version of the algorithm the dierent frames composing the multiframe have the same time step a and frequency step b: this guarantees that for
each signal segment the dierent frames have Heisenberg boxes whose centers
lay on a same lattice on the time-frequency plane, as illustrated in gure 2. To
69
window length(smps)
4096
3043
2261
1680
1248
927
689
512
0.05
0.1
time
0.15
Fig. 2. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the rst version of our algorithm
x 10
512-samples hann window
frequency
2
1.5
1
0.5
0
x 10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1.6
1.8
frequency
2
1.5
1
0.5
0
0.2
0.4
0.6
0.8
time
1.2
1.4
Fig. 3. Two dierent spectrograms of a B4 note played by a marimba, with Hanning

windows of sizes 512 (top) and 4096 (bottom) samples
guarantee that all the |L| scaled windows constitute a frame when translated
and modulated according to this global lattice, the time step a must be set
with the hop size assigned to the smallest window frame. On the other hand, as
the FFT of a discrete signal has the same number of points of the signal itself,
the frequency step b has to be the FFT size of the largest window analysis: for
the smaller ones, a zero-padding is performed.
M. Liuni et al.
window length (smps)
70
4096
2048
1024
512
0
x 10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
frequency
2
1.5
1
0.5
time
Fig. 4. Example of an adaptive analysis performed by the rst version of our algorithm
with four Hanning windows of dierent sizes (512, 1024, 2048 and 4096 samples) on a
B4 note played by a marimba: on top, the best window chosen as a function of time; at
the bottom, the adaptive spectrogram. The entropy order is = 0.7 and each analysis
segment contains twenty-four analyses frames with a sixteen-frames overlap between
consecutive segments.
Each signal segment identies a time-frequency rectangle G for the entropy

evaluation: the horizontal edge is the time interval of the considered segment,
while the vertical one is the whole frequency lattice. For each spectrogram, the
rectangle G denes a subset of coecients belonging to G itself. The |L| dierent
subsets do not correspond to the same part of signal, as windows have dierent
time supports. Therefore, a preliminary weighting of the signal has to be performed before the calculations of the local spectrograms: this step is necessary to
balance the inuence on the entropy calculus between coecients which regard
parts of signal shared or not shared by the dierent analysis frames.
After the pre-weighting, we calculate the entropy of every spectrogram as in
(15). Having the |L| entropy values corresponding to the dierent local spectrograms, the sparsest local analysis is dened as the one with minimum Renyi
entropy: the window associated to the sparsest local analysis is chosen as best
window at all the time points contained in G.
The global time adapted analysis of the signal is nally realized by opportunely assembling the slices of local sparsest analyses: they are obtained with
a further spectrogram calculation of the unweighted signal, employing the best
windows selected at each time point.
In gure 4 we give an example of an adaptive analysis performed by our rst
algorithm with four Hanning windows of dierent sizes on a real instrumental
sound, a B4 note played by a marimba: this sound combines the need for a
71
4096
3043
2261
1680
1248
927
689
512
0
0.05
time
0.1
0.15
Fig. 5. An analysis segment: time locations of the Heisenberg boxes associated to the
multi-frame used in the second version of our algorithm
good time resolution at the strike with that of a good frequency resolution on
the harmonic resonance. This is fully provided by the algorithm, as shown in
the adaptive spectrogram at the bottom of the gure 4. Moreover, we see that
the pre-echo of the analysis at the bottom of gure 3 is completely removed in
the adapted spectrogram.
The main dierence in the second version of our algorithm concerns the individual frames composing the multi-frame, which have the same frequency step
b but dierent time steps {al : l L}: the smallest and largest window sizes are
given as parameters together with |L|, the number of dierent windows composing the multi-frame, and the global overlap needed for the analyses. The
algorithm xes the intermediate sizes so that, for each signal segment, the different frames have the same overlap between consecutive windows, and so the
same redundancy.
This choice highly reduces the computational cost by avoiding unnecessary
small hop sizes for the larger windows, and as we have observed in the previous
section it does not aect the entropy evaluation. Such a structure generates an
irregular time disposition of the multi-frame elements in each signal segment,
as illustrated in gure 5; in this way we also avoid the problem of unshared
parts of signal between the systems, but we still have a dierent inuence of the
boundary parts depending on the analysis frame: the beginning and the end of
the signal segment have a higher energy when windowed in the smaller frames.
This is avoided with a preliminary weighting: the beginning and the end of each
signal segment are windowed respectively with the rst and second half of the
largest analysis window.
As for the rst implementation, the weighting does not concern the decomposition for re-synthesis purpose, but only the analyses used for entropy evaluations.
M. Liuni et al.
frequency
72
Fig. 6. Example of an adaptive analysis performed by the second version of our algorithm with eight Hanning windows of dierent sizes from 512 to 4096 samples, on a
B4 note played by a marimba sampled at 44.1kHz: on top, the best window chosen
as a function of time; at the bottom, the adaptive spectrogram. The entropy order is
= 0.7 and each analysis segment contains four frames of the largest window analysis
with a two-frames overlap between consecutive segments.
After the pre-weighting, the algorithm follows the same steps described above:
calculation of the |L| local spectrograms, evaluation of their entropy, selection of
the window providing minimum entropy, computation of the adapted spectrogram with the best window at each time point, thus creating an analysis with
time-varying resolution and hop size.
In gure 6 we give a rst example of an adaptive analysis performed by the
second version of our algorithm with eight Hanning windows of dierent sizes:
the sound is still the B4 note of a marimba, and we can see that the two versions
give very similar results. Thus, if the considered application does not specically
ask for a xed hop size of the overall analysis, the second version is preferable
as it highly reduces the computational cost without aecting the best window
choice.
In gure 8 we give a second example with a synthetic sound, a sinusoid with sinusoidal frequency modulation: as gure 7 shows, a small window is best adapted
where the frequency variation is fast compared to the window length; on the
other hand, the largest window is better where the signal is almost stationary.
4.1
Re-synthesis Method
The re-synthesis method introduced in [11] gives a perfect reconstruction of the

signal as a weighted expansion of the coecients of its STFT in the original
analysis frame. Let Sf [n, k] be the STFT of a signal f with window function h
and time step a; xing n, through an iFFT we have a windowed segment of f
x 10
73
frequency
2
1.5
1
0.5
0.5
0
x 10
1.5
2.5
2.5
frequency
2
1.5
1
0.5
0
0.5
1.5
time
Fig. 7. Two dierent spectrograms of a sinusoid with sinusoidal frequency modulation,

with Hanning windows of sizes 512 (top) and 4096 (bottom) samples
fh (n, l) = h(na l)f (l) ,
(20)
whose time location depends on n. An immediate perfect reconstruction of f is

given by
+
h(na l)fh (n, l)
f (l) = n=
.
(21)
+
2
n= h (na l)
In our case, after the automatic selection step we dispose of a temporal sequence
with the best windows at each time position; in the rst version we have a
xed hop for all the windows, in the second one every window has its own
time step. In both the cases we have thus reduced the initial multi-frame to
a nonstationary Gabor frame: we extend the same technique of (21) using a
variable window h and time step a according to the composition of the reduced
multi-frame, obtaining a perfect reconstruction as well. The interest of (21) is
that the given distribution does not need to be the STFT of a signal: for example,
a transformation S [n, k] of the STFT of a signal could be considered. In this
case, (21) gives the signal whose STFT has minimal least squares error with
S [n, k].
As seen by the equations (9) and (11), the theoretical existence and the mathematical denition of the canonical dual frame for a nonstationary Gabor frame
like the one we employ has been provided [14]: it is thus possible to dene the
whole analysis and re-synthesis framework within the Gabor theory. We are at
present working on the interesting analogies between the two approaches, to
establish a unied interpretation and develop further extensions.
74
M. Liuni et al.
Fig. 8. Example of an adaptive analysis performed by the second version of our algorithm with eight Hanning windows of dierent sizes from 512 to 4096 samples, on
a sinusoid with sinusoidal frequency modulation synthesized at 44.1 kHz: on top, the
best window chosen as a function of time; at the bottom, the adaptive spectrogram.
The entropy order is = 0.7 and each analysis segment contains four frames of the
largest window analysis with a three-frames overlap between consecutive segments.
Conclusions
We have presented an algorithm for time-adaptation of the spectrogram resolution, which can be easily integrated in existent framework for analysis, transformation and re-synthesis of an audio signal: the adaptation is locally obtained
through an entropy minimization within a nite set of resolutions, which can be
dened by the user or left as default. The user can also specify the time duration
and overlap of the analysis segments where entropy minimization is performed,
to privilege more or less discontinuous adapted analyses.
Future improvements of this method will concern the spectrogram adaptation
in both time and frequency dimensions: this will provide a decomposition of the
signal in several layers of analysis frames, thus requiring an extension of the
proposed technique for re-synthesis.
References
1. Baraniuk, R.G., Flandrin, P., Janssen, A.J.E.M., Michel, O.J.J.: Measuring TimeFrequency Information Content Using the Renyi Entropies. IEEE Trans. Info. Theory 47(4) (2001)
2. Borichev, A., Gr
ochenig, K., Lyubarskii, Y.: Frame constants of gabor frames near
the critical density. J. Math. Pures Appl. 94(2) (2010)
75
3. Christensen, O.: An Introduction To Frames And Riesz Bases. Birkh

auser, Boston
(2003)
4. Cohen, L.: Time-Frequency Distributions-A Review. Proceedings of the IEEE 77(7)
(1989)
5. Cohen, L.: Time-Frequency Analysis. Prentice-Hall, Upper Saddle River (1995)
6. Daubechies, I., Grossmann, A., Meyer, Y.: Painless nonorthogonal expansions. J.
Math. Phys. 27 (1986)
7. Daubechies, I.: The Wavelet Transform, Time-Frequency Localization and Signal
Analysis. IEEE Trans. Info. Theory 36(5) (1990)
8. D
orer, M.: Gabor Analysis for a Class of Signals called Music. PhD thesis,
NuHAG, University of Vienna (2002)
9. D
orer, M.: Quilted Gabor frames - a new concept for adaptive time-frequency
representation. eprint arXiv:0912.2363 (2010)
10. Flandrin, P.: Time-Frequency/ Time-Scale Analysis. Academic Press, San Diego
(1999)
11. Grin, D.W., Lim, J.S.: Signal Estimation from Modied Short-Time Fourier
Transform. IEEE Trans. Acoust. Speech Signal Process. 32(2) (1984)
12. Gr
ochenig, K.: Foundations of Time-Frequency Analysis. Birkh
auser, Boston
(2001)
13. Jaillet, F.: Representation et traitement temps-frequence des signaux audionumeriques pour des applications de design sonore. PhD thesis, Universite de
la Mediterranee - Aix-Marseille II (2005)
14. Jaillet, F., Balazs, P., D
orer, M.: Nonstationary Gabor Frames. INRIA a CCSD electronic archive server based on P.A.O.L (2009),
http://hal.inria.fr/oai/oai.php
15. Lukin, A., Todd, J.: Adaptive Time-Frequency Resolution for Analysis and Processing of Audio. Audio Engineering Society Convention Paper (2006)
16. Mallat, S.: A wavelet tour on signal processing. Academic Press, San Diego (1999)
17. Renyi, A.: On Measures of Entropy and Information. In: Proc. Fourth Berkeley
Symp. on Math. Statist. and Prob., Berkeley, California, pp. 547561 (1961)
18. Rudoy, D., Prabahan, B., Wolfe, P.: Superposition frames for adaptive timefrequency analysis and fast reconstruction. IEEE Trans. Sig. Proc. 58(5) (2010)
19. Schl
ogl, F., Beck, C. (eds.): Thermodynamics of chaotic systems. Cambridge University Press, Cambridge (1993)
20. Sun, W.: Asymptotic properties of Gabor frame operators as sampling density
tends to innity. J. Funct. Anal. 258(3) (2010)
21. Wolfe, P.J., Godsill, S.J., D
orer, M.: Multi-Gabor Dictionaries for Audio TimeFrequency Analysis. In: Proc. IEEE WASPAA (2001)
22. Zibulski, M., Zeevi, Y.Y.: Analysis of multiwindow Gabor-type schemes by frame
methods. Appl. Comput. Harmon. Anal. 4(2) (1997)
23. Zyczkowski, K.: Renyi Extrapolation of Shannon Entropy. Open Systems & Information Dynamics 10(3) (2003)
Transcription of Musical Audio Using Poisson

Point Processes and Sequential MCMC
Pete Bunch and Simon Godsill
Signal Processing and Communications Laboratory
Department of Engineering
University of Cambridge
{pb404,sjg}@eng.cam.ac.uk
http://www-sigproc.eng.cam.ac.uk/~ sjg
Abstract. In this paper models and algorithms are presented for transcription of pitch and timings in polyphonic music extracts. The data
are decomposed framewise into the frequency domain, where a Poisson
point process model is used to write a polyphonic pitch likelihood function. From here Bayesian priors are incorporated both over time (to link
successive frames) and also within frames (to model the number of notes
present, their pitches, the number of harmonics for each note, and inharmonicity parameters for each note). Inference in the model is carried out
via Bayesian filtering using a powerful Sequential Markov chain Monte
Carlo (MCMC) algorithm that is an MCMC extension of particle filtering. Initial results with guitar music, both laboratory test data and
commercial extracts, show promising levels of performance.
Keywords: Automated music transcription, multi-pitch estimation,
Bayesian filtering, Poisson point process, Markov chain Monte Carlo,
particle filter, spatio-temporal dynamical model.
Introduction
The audio signal generated by a musical instrument as it plays a note is complex, containing multiple frequencies, each with a time-varying amplitude and
phase. However, the human brain perceives such a signal as a single note, with
associated high-level properties such as timbre (the musical texture) and
expression (loud, soft, etc.). A musician playing a piece of music takes as input
a score, which describes the music in terms of these high-level properties, and
produces a corresponding audio signal. An accomplished musician is also able to
reverse the process, listening to a musical audio signal and transcribing a score.
A desirable goal is to automate this transcription process. Further developments
in computer understanding of audio signals of this type can be of assistance
to musicologists; they can also play an important part in source separation systems, as well as in automated mark-up systems for content-based annotation of
music databases.
Perhaps the most important property to extract in the task of musical transcription is the note or notes playing at each instant. This will be the primary
Sequential MCMC for Musical Transcription
77
aim of this work. As a subsidiary objective, it can be desirable to infer other

high level properties, such as timbre, expression and tempo.
Music transcription has become a large and active eld over recent years, see
e.g. [6], and recently probabilistic Bayesian approaches have attracted attention,
see e.g. [5,2,1] to list but a few of the many recent contributions to this important
area. The method considered in this paper is an enhanced form of a frequency
domain model using a Poisson point process rst developed in musical modelling
applications in [8,1]. The steps of the process are as follows. The audio signal is
rst divided into frames, and an over-sampled Fast Fourier Transform (FFT) is
performed on each frame to generate a frequency spectrum. The predominant
peaks are then extracted from the amplitude of the frequency data. A likelihood
function for the observed spectral peaks may then be formulated according to an
inhomogeneous Poisson point process model (see [8] for the static single frame
formulation), conditional on all of the unknown parameters (the number of notes
present, their pitches, the number of harmonics for each note, and inharmonicity
parameters for each note). A prior distribution for these parameters, including
their evolution over time frames, then completes a Bayesian spatio-temporal
state space model. Inference in this model is carried out using a specially modied
version of the sequential MCMC algorithm [7], in which information about the
previous frame is collapsed onto a single univariate marginal representation of
the multipitch estimation.
To summarise the new contributions of this paper, we here explicitly model
within the Poisson process framework the number of notes present, the number
of harmonics for each note and the inharmonicity parameter for each note, and
we model the temporal evolution of the notes over time frames, all within a fully
Bayesian sequential updating scheme, implemented with sequential MCMC. This
contrasts with, for example, the static single frame-based approach of our earlier
Poisson process transcription work [8].
2
2.1
Models and Algorithms

The Poisson Likelihood Model
When a note is played on a musical instrument, a vibration occurs at a unique

fundamental frequency. In addition, an array of partial frequencies is also
generated. To a rst order approximation, these occur at integer multiples of
the fundamental frequency. In fact, a degree of inharmonicity will usually occur,
especially for plucked or struck string instruments [4] (including the guitars considered as examples in this work). The inclusion of inharmonicity in the Poisson
likelihood models here adopted has not been considered before to our knowledge.
In this paper, compared with [8], we introduce an additional inharmonicity parameter for each musical pitch. This is incorporated in a similar fashion to the
inharmonicity model in [1], in which an entirely dierent time domain signal
model was adopted.
We consider rstly a single frame of data, as in [8], then extend to the sequential modelling of many frames. Examining the spectrum of a single note,
78
P. Bunch and S. Godsill
Fig. 1. An example of a single note spectrum, with the associated median threshold
(using a window of 4 frequency bins) and peaks identified by the peak detection
algorithm (circles)
such as that in Figure 1, it is evident that a substantial part of the information

about pitch is contained in the frequency and amplitude of the spectral peaks.
The amplitudes of these peaks will vary according to the volume of the note,
the timbre of the instrument, and with time (high frequency partials will decay
faster, interference will cause beating, etc.), and are thus challenging to model.
Here, then, for reasons of simplicity and robustness of the model, we consider
only the frequencies at which peaks are observed. The set of observed peaks is
constructed by locating frequencies which have an amplitude larger than those
on each side, and which also exceeds a median lter threshold. See Figure 1. for
an illustration of the procedure.
For the Poisson likelihood model, the occurrence of peaks in the frequency
domain is assumed to follow an inhomogeneous Poisson process, in which an
intensity function k gives the mean value of a Poisson distribution at the kth
frequency bin (k is the integral, over the kth frequency bin width, of an intensity
function dened in continuous frequency, (f )). The principal advantage of such
a model is that we do not have to perform data association: there is no need
to identify uniquely which spectral peak belongs to which note harmonic. A
consequence of this simplication is that each harmonic in each musical note is
deemed capable of generating more than one peak in the spectrum. Examining
the k th FFT frequency bin, with Poisson intensity k and in which Zk peaks
occur, the probability of observing n spectral peaks is given by the Poisson
distribution:
ek nk
P (Zk = n) =
(1)
n!
In fact, since it is not feasible to observe more than a single peak in each
frequency bin, we here consider only the binary detection of either zero peaks,
or one or more peaks, as in [8]:
79
Zero Peaks: P (Zk = 0) = ek
One or more peaks: P (Zk 1) = 1 ek

(2)
A single frame of data can thus be expressed as a binary vector where each
term inicates the presence or absence of a peak in the corresponding frequency
bin. As the bin observations are independent (following from the the Poisson
process assumption), the likelihood of the observed spectrum is given by:
P (Y |) =
K
[yk (1 ek ) + (1 yk )ek ]
(3)
k=1
where Y = {y1 , y2 , ..., yK } are the observed peak data in the K frequency
bins, such that yk = 1 if a peak is observed in the k th bin, and yk = 0 otherwise.
It only remains to formulate the intensity function (f ), and hence
k = f kth bin (f )df . For this purpose, the Gaussian mixture model of Peeling et al.[8] is used. Note that in this formulation we can regard each harmonic
of each note to be an independent Poisson process itself, and hence by the union
property of Poisson processes, all of the individual Poisson intensities add to
give a single overall intensity , as follows:
(f ) =
N
j (f ) + c
(4)
j=1
j (f ) =
Hj

h=1
A
2
2j,h

exp
(f fj,h )2
2
2j,h

(5)
where j indicates the note number, h indicates the partial number, and N and
Hj are the numbers of notes and harmonics in each note, respectively. c is a
constant that accounts for detected clutter peaks due to noise and non-musical
2
sounds. j,h
= 2 h2 sets the variance of each Gaussian. A and are constant
parameters, chosen so as to give good performance on a set of test pieces. fj,h is
the frequency of the hth partial of the j th note, given by the inharmonic model
[4]:

fj,h = f0,j h 1 + Bj h2
(6)
f0,j is the fundamental frequency of the jth note. Bj is the inharmonicity parameter for the note (of the order 104 ).
Three parameters for each note are variable and to be determined by the
inference engine: the fundamental, the number of partials,and the inharmonicity.
Moreover, the number of notes N is also treated as unknown in the fully Bayesian
framework.
2.2
Prior Distributions and Sequential MCMC Inference
The prior P () over the unknown parameters in each time frame may be
decomposed, assuming parameters of dierent notes are independent, as:
80
P () = P (N )
N
P (f0,j ) P (Hj |f0,j ) P (Bj |Hj , f0,j )
(7)
j=1
In fact, we have here assumed all priors to be uniform over their expected ranges,
except for f0,j and N , which are stochastically linked to their values in previous
frames. To consider this linkage explicitly, we now introduce a frame number
label t and the corresponding parameters for frame t as t , with frame peak data
Y t . In order to carry out optimal sequential updating we require a transition
density p( t |t1 ), and assume that the {t } process is Markovian. Then we can
write the required sequential update as:
p( t1:t |Y1:t ) p( t1 |Y1:t1 )p( t |t1 )p(Yt | t )
(8)
To see how this can be implemented in a sequential MCMC framework, assume

that at time t 1 the inference problem is solved and a set of M >> 1 Monte
(i)
Carlo (dependent) samples {t1 } are available from the previous times target
distribution p( t1 |Y1:t1 ). These samples are then formed into an empirical
distribution p(t1 ) which is used as an approximation to p( t1 |Y1:t1 ) in Eq.
(8). This enables the (approximated) time updated distribution p( t1:t |Y1:t ) to
be evaluated pointwise, and hence a new MCMC chain can be run with Eq. (8)
as its target distribution. The converged samples from this chain are then used
to approximate the posterior distribution at time t, and the whole procedure
repeats as time step t increases.
The implementation of the MCMC at each time step is quite complex, since
it will involve updating all elements of the parameter vector t , including the
number of notes, the fundamental frequencies, the number of harmonics in each
note and the inharmonicity parameter for each note. This is carried out via a
combination of Gibbs sampling and Metropolis-within-Gibbs sampling, using a
Reversible Jump formulation wherever the parameter dimension (i.e. the number
of notes in the frame) needs to change, see [7] for further details of how such
schemes can be implemented in tracking and nance applications and [3] for
general information about MCMC. In order to enhance the practical performance
we modied the approximating density at t 1, p( t1 ) to be a univariate
density over one single fundamental frequency, which can be thought of as the
posterior distribution of fundamental frequency at time t 1 with all the other
parameters marginalised, including the number of notes, and a univariate density
over the number of notes. This collapsing of the posterior distribution onto a
univariate marginal, although introducing an additional approximation into the
updating formula, was found to enhance the MCMC exploration at the next
time step signicantly, since it avoids combinatorial updating issues that increase
dramatically with the dimension of the full parameter vector t .
Having carried out the MCMC sampling at each time step, the fundamental
frequencies and their associated parameters (inharmonicity and number of harmonics, if required) may be estimated. This estimation is based on extracting
maxima from the collapsed univariate distribution over fundamental frequency,
as described in the previous paragraph.
(a) 2-note chords
81
(b) 3-note chords
(c) Tears in Heaven

Fig. 2. Reversible Jump MCMC Results: Dots indicate note estimates. Line below indicates estimate of the number of notes. Crosses in panels (a) and (b) indicate notes
estimated by the MCMC algorithm but removed by post-processing. A manually obtained ground-truth is shown overlayed in panel (c).
82
Results
The methods have been evaluated on a selection of guitar music extracts, recorded
both in the laboratory and taken from commercial recordings. See Fig. 2 in which
three guitar extracts, two lab-generated (a) and (b) and one from a commercial
recording (c) are processed. Note that a few spurious note estimates arise, particularly around instants of note change, and many of these have been removed
by a post-processing stage which simply eliminates note estimates which last
for a single frame. The results are quite accurate, agreeing well with manually
obtained transcriptions.
When two notes an octave apart are played together, the upper note is not
found. See nal chord of panel (a) in Figure 2. This is attributable to the two
notes sharing many of the same partials, making discrimination dicult based
on peak frequencies alone.
In the case of strong notes, the algorithm often correctly identies up to 35
partial frequencies. In this regard, the use of inharmonicity modelling has proved
succesful: Without this feature, the estimate of the number of harmonics is often
lower, due to the inaccurate partial frequencies predicted by the linear model.
The eect of the sequential formulation is to provide a degree of smoothing
when compared to the frame-wise algorithm. Fewer single-frame spurious notes
appear, although these are not entirely removed, as shown in Figure 2. Octave
errors towards the end of notes are also reduced.
Conclusions and Future Work
The new algorithms have shown signicant promise, especially given that the
likelihood function takes account only of peak frequencies and not amplitudes
or other information that may be useful for a transcription system. The good
performance so far obtained is a result of several novel modelling and algorithmic features, notably the formulation of a exible frame-based model that can
account robustly for inharmonicities, unknown numbers of notes and unknown
numbers of harmonics in each note. A further key feature is the ability to link
frames together via a probabilistic model; this makes the algorithm more robust
in estimation of continuous fundamental frequency tracks from the data. A nal
important component is the implementation through sequential MCMC, which
allows us to obtain reasonably accurate inferences from the models as posed.
The models may be improved in several ways, and work is underway to address
these issues. A major point is that the current Poisson model accounts only
for the frequencies of the peaks present. It is likely that performance may be
improved by including the peak amplitudes in the model. For example, this
might make it possible to distinguish more robustly when two notes an octave
apart are being played. Improvements are also envisaged in the dynamical prior
linking one frame to the next, which is currently quite crudely formulated. Thus,
further improvements will be possible if the dependency between frames is more
carefully considered, incorporating melodic and harmonic principles to generate
83
likely note and chord transitions over time. Ideally also, the algorithm should be
able to run in real time, processing a piece of music as it is played. Currently,
however, the Matlab-based processing is at many times real time and we will
study the parallel processing possibilities (as a simple starting point, the MCMC
runs can be split into several shorter parallel chains at each time frame within
a parallel architecture).
References
1. Cemgil, A., Godsill, S.J., Peeling, P., Whiteley, N.: Bayesian statistical methods for
audio and music processing. In: OHagan, A., West, M. (eds.) Handbook of Applied
Bayesian Analysis, OUP (2010)
2. Davy, M., Godsill, S., Idier, J.: Bayesian analysis of polyphonic western tonal music.
Journal of the Acoustical Society of America 119(4) (April 2006)
3. Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.): Markov Chain Monte Carlo
in Practice. Chapman and Hall, Boca Raton (1996)
4. Godsill, S.J., Davy, M.: Bayesian computational models for inharmonicity in musical
instruments. In: Proc. of IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, New Paltz, NY (October 2005)
5. Kashino, K., Nakadai, K., Kinoshita, T., Tanaka, H.: Application of the Bayesian
probability network to music scene analysis. In: Rosenthal, D.F., Okuno, H. (eds.)
Computational Audio Scene Analysis, pp. 115137. Lawrence Erlbaum Associates,
Mahwah (1998)
6. Klapuri, A., Davy, M.: Signal processing methods for music transcription. Springer,
Heidelberg (2006)
7. Pang, S.K., Godsill1, S.J., Li, J., Septier, F.: Sequential inference for dynamically
evolving groups of objects. To appear: Barber, Cemgil, Chiappa (eds.) Inference and
Learning in Dynamic Models, CUP (2009)
8. Peeling, P.H., Li, C., Godsill, S.J.: Poisson point process modeling for polyphonic music transcription. Journal of the Acoustical Society of America Express
Letters 121(4), EL168EL175 (2007)
Single Channel Music Sound Separation Based

on Spectrogram Decomposition and Note
Classification
Wenwu Wang and Haz Mustafa
Centre for Vision, Speech and Signal Processing (CVSSP)
University of Surrey, GU2 7XH, UK
{w.wang,hm00045}@surrey.ac.uk
http://www.surrey.ac.uk/cvssp
Abstract. Separating multiple music sources from a single channel mixture is a challenging problem. We present a new approach to this problem
based on non-negative matrix factorization (NMF) and note classication, assuming that the instruments used to play the sound signals are
known a priori. The spectrogram of the mixture signal is rst decomposed
into building components (musical notes) using an NMF algorithm. The
Mel frequency cepstrum coecients (MFCCs) of both the decomposed
components and the signals in the training dataset are extracted. The
mean squared errors (MSEs) between the MFCC feature space of the
decomposed music component and those of the training signals are used
as the similarity measures for the decomposed music notes. The notes are
then labelled to the corresponding type of instruments by the K nearest
neighbors (K-NN) classication algorithm based on the MSEs. Finally,
the source signals are reconstructed from the classied notes and the
weighting matrices obtained from the NMF algorithm. Simulations are
provided to show the performance of the proposed system.
Keywords: Non-negative matrix factorization, single-channel sound
separation, Mel frequency cepstrum coecients, instrument classication, K nearest neighbors, unsupervised learning.
Introduction
Recovering multiple unknown sources from a one-microphone signal, which is

an observed mixture of these sources, is referred to as the problem of singlechannel (or monaural) sound source separation. The single-channel problem is
an extreme case of under-determined separation problems, which are inherently
ill-posed, i.e., more unknown variables than the number of equations. To solve
the problem, additional assumptions (or constraints) about the sources or the
propagating channels are necessary. For an underdetermined system with two
The work of W. Wang was supported in part by an Academic Fellowship of the

RCUK/EPSRC (Grant number: EP/C509307/1).
Single Channel Music Sound Separation
85
microphone recordings, it is possible to separate the sources based on spatial

diversity using determined independent component analysis (ICA) algorithms
and an iterative procedure [17]. However, unlike the techniques in e.g. ADRess
[2] and DUET [18] that require at least two mixtures, the cues resulting from the
sensor diversity are not available in the single channel case, and thus separation
is dicult to achieve based on ICA algorithms.
Due to the demand from several applications such as audio coding, music information retrieval, music editing and digital library, this problem has attracted
increasing research interest in recent years [14]. A number of methods have been
proposed to tackle this problem. According to the recent review by Li et al. [14],
these methods can be approximately divided into three categories: (1) signal
modelling based on traditional signal processing techniques, such as sinusoidal
modelling of the sources, e.g. [6], [23], [24]; (2) learning techniques based on statistical tools, such as independent subspace analysis [4] and non-negative matrix
(or tensor) factorization, e.g. [19], [20], [27], [28], [25], [8], [30]; (3) psychoacoustical mechanism of human auditory perception, such as computational auditory
scene analysis (CASA), e.g. [15], [3], [26], [32], [14]. Sinusoidal modelling methods
try to decompose the signal into a combination of sinusoids, and then estimate
their parameters (frequencies, amplitudes, and phases) from the mixture. These
methods have been used particularly for harmonic sounds. The learning based
techniques do not exploit explicitly the harmonic structure of the signals, instead they use the statistical information that is estimated from the data, such
as the independence or sparsity of the separated components. The CASA based
techniques build separation systems on the basis of the perceptual theory by
exploiting the psychoacoustical cues that can be computed from the mixture,
such as common amplitude modulation.
In this paper, a new algorithm is proposed for the problem of single-channel
music source separation. The algorithm is based mainly on the combination of
note decomposition with note classication. The note decomposition is achieved
by a non-negative matrix factorization (NMF) algorithm. NMF has been previously used for music sound separation and transcription, see e.g. [11], [1], [7],
[20], [29], [30]. In this work, we rst use the NMF algorithm in [25] to decompose the spectrogram of the music mixture into building components (musical
notes). Then, Mel Frequency Cepstrum Coecients (MFCCs) feature vectors
are extracted from the segmented frames of each decomposed note. To divide
the separated notes into their corresponding instrument categories, the K nearest neighbor (NN) classier [10] is used. The K-NN classier is an algorithm
that is simple to implement and also provides good classication performance.
The source signals are reconstructed by combining the notes having same class
labels. The remainder of the paper is organized as follows. The proposed separation system is described in Section 2 in detail. Some preliminary experimental
results are shown in Section 3. Discussions about the proposed method are given
in Section 4. Finally, Section 5 summarises the paper.
86
W. Wang and H. Mustafa
The Proposed Separation System
This section describes the details of the processes in our proposed sound source
separation system. First, the single-channel mixture of music sources is decomposed into basic building blocks (musical notes) by applying the NMF algorithm.
The NMF algorithm describes the mixture in the form of basis functions and
their corresponding weights (coecients) which represent the strength of each
basis function in the mixture. The next step is to extract the feature vectors
of the musical notes and then classify the notes into dierent source streams.
Finally, the source signals are reconstructed by combining the notes with the
same class labels. In this work, we assume that the instruments used to generate
the music sources are known a priori. In particular, two kinds of instruments,
i.e. piano and violin, were used in our study. The block diagram of our proposed
system is depicted in Figure 1.
2.1
Music Decomposition by NMF
In many data analysis tasks, it is a fundamental problem to nd a suitable representation of the data so that the underlying hidden structure of the data may
be revealed or displayed explicitly. NMF is a data-adaptive linear representation technique for 2-D matrices that was shown to have such potentials. Given
a non-negative data matrix X, the objective of NMF is to nd two non-negative
matrices W and H [12], such that
X = WH
(1)
In this work, X is an S T matrix representing the mixture signal, W is the

basis matrix of dimension S R, and H is the weighting coecient matrix of
Fig. 1. Block diagram of the proposed system
87
dimension R T . The number of bases used to represent the original matrix is

described by R, i.e. the decomposition rank. Due to non-negativity constraints,
this representation is purely additive. Many algorithms can be used to nd the
suitable pair of W and H such that the error of the approximation is minimised,
see e.g. [12], [13], [7], [20] and [30]. In this work, we use the algorithm proposed in
[25] for the note decomposition. In comparison to the classical algorithm in [12],
this algorithm considers additional constraints from the structure of the signal.
Due to the non-negativity constraints, the time-domain signal (with negative
values) needs to be transformed into another domain so that only non-negative
values are present in X for an NMF algorithm to be applied. In this work, the
music sound is transformed into the frequency domain using, e.g. the short-time
Fourier transform (STFT). The matrix X is generated as the spectrogram of
the signal, and in our study, the frame size of each segment equals to 40 ms,
and 50 percent overlaps between the adjacent frames are used. An example of
matrix X generated from music signals is shown in Figure 2, where two music
sources with each having a music note repeating twice were mixed together. One
of the sources contains musical note G4, and the other is composed of note A3.
The idea of decomposing the mixture signal into individual music components is
based on the observation that a music signal may be represented by a set of basic
building blocks such as musical notes or other general harmonic structures. The
basic building blocks are also known as basis vectors and the decomposition of the
single-channel mixture into basis vectors is the rst step towards the separation
of multiple source signals from the single-channel mixture. If dierent sources
in the mixture represent dierent basis vectors, then the separation problem
can be regarded as a problem of classication of basis vectors into dierent
categories. The source signals can be obtained by combining the basis vectors in
each category.
The above mixture (or NMF) model can be equally written as
X=
R
wr hr
(2)
r=1
FFT bin index
10
10
10
50
100
150
200
250
300
350
400
450
500
550
Time slices index
Fig. 2. The contour plot of a sound mixture (i.e. the matrix X) containing two dierent
musical notes G4 and A3
FFT bin index
10
10
10
10
FFT bin index
88
10
20
40
60
80
Time slices index
100
10
20
40
60
80
100
Time slices index
Fig. 3. The contour plots of the individual musical notes which were obtained by
applying an NMF algorithm to the sound mixture X. The separated notes G4 and A3
are shown in the left and right plot respectively.
where wr is the rth column of W = [w1 , w2 , . . . , wR ] which contains the

basis vectors, and hr is the rth row of H = [h1 , h2 , . . . , hR ]T which contains
the weights or coecients of each basis function in matrix W, where the superscript T is a matrix transpose. Many algorithms including those mentioned
above can be applied to obtain such basis functions and weighting coecients.
For example, using the algorithm developed in [30], we can decompose the mixture in Figure 2, and the resulting basis vectors (i.e. the decomposed notes) are
shown in Figure 3. From this gure, it can be observed that both note G4 and
A3 are successfully separated from the mixture.
As a prior knowledge, given the mixture of musical sounds containing two
sources (e.g. piano and violin), two dierent types of basis functions are learnt
from the decomposition by the NMF algorithm. The magnitude spectrograms
of the basis components (notes) of the two dierent sources in the mixture are
obtained by multiplying the columns of the basis matrix W to the corresponding
rows of the weight matrix H. The columns of matrix W contain the information
of musical notes in the mixture and corresponding rows of matrix H describe the
strength of these notes. Some rows in H do not contain useful information and are
therefore considered as noise. The noise components are considered separately
in the classication process to improve the quality of the separated sources.
2.2
Feature Extraction
Feature extraction is a special form of dimensionality reduction by transforming

the high dimensional data into a lower dimensional feature space. It is used in
both the training and classication processes in our proposed system. The audio
features that we used in this work are the MFCCs. The MFCCs are extracted
on a frame-by-frame basis. In the training process, the MFCCs are extracted
from a training database, and the feature vectors are then formed from these
coecients. In the classication stage, the MFCCs are extracted similarly from
Coefficient values
150
100
80
(a)
100
(b)
60
50
40
20
0
0
50
10
15
50
Coefficient values
89
20
10
15
10
15
50
(c)
40
40
30
30
20
20
10
10
10
10
10
15
(d)
MFCC feature dimension
Fig. 4. The 13-dimensional MFCC feature vectors calculated from two selected frames
of the four audio signals: (a) Piano..A0.wav, (b) Piano..B0.wav, (c) Violin.pizz.mf.sulG.C4B4.wav, and (d) Violin.pizz.pp.sulG.C4B4.wav. In each of the
four plots, the solid and dashed lines represent the two frames (i.e. the 400th and
900th frame), respectively.
Coefficient values
120
100
100
80
(a)
80
60
40
40
20
20
0
20
20
10
15
20
50
Coefficient values
(b)
60
10
15
20
15
20
50
40
40
(c)
30
30
20
20
10
10
10
10
10
15
20
(d)
10
90
Coefficient values
150
100
80
(a)
100
(b)
60
50
40
20
0
0
50
20
Coefficient values
50
50
40
(c)
40
(d)
30
30
20
20
10
10
10
the decomposed notes obtained by the NMF algorithm. In our experiments,

the frame size of 40 ms is used, which equals to 1764 samples when the sampling frequency is 44100 Hz. Examples of such feature vectors are shown in
Figure 4, where the four audio les (Piano..A0.wav, Piano..B0.wav, Violin.pizz.mf.sulG.C4B4.wav, and Violin.pizz.pp.sulG.C4B4.wav) were chosen
from the The University of Iowa Musical Instrument Samples Database [21] and
the feature vectors are 13-dimensional. Dierent dimensional features have also
been examined in this work. Figure 5 and 6 show the 20-dimensional and 7dimensional MFCC feature vectors computed from the same audio frames and
from the same audio signals as those in Figure 4. In comparison to Figure 4, it
can be observed that the feature vectors in Figure 5 and 6 have similar shapes,
even though that the higher dimensional feature vectors show more details about
the signal. However, it inevitably incurs a higher computational cost if the feature dimension is increased. In our study, we choose to compute a 13 dimensional
MFCCs vector for each frame in the experiments, which oers a good trade-o
between the classication performance and the computational eciency.
2.3
Classification of Musical Notes
The main objective of classication is to maximally extract patterns on the basis

of some conditions and is to separate one class from another. The K-NN classier,
which uses a classication rule without having the knowledge of the distribution
of measurements in dierent classes, is used in this paper for the separation
91
Table 1. The musical note classication algorithm

1) Calculate the 13-D MFCCs feature vectors of all the musical examples
in the training database with class labels. This creates a feature space
for the training data.
2) Extract similarly the MFCCs feature vectors of all separated components whose class labels need to be determined.
3) Assign the labels to all the feature vectors in the separated components to the appropriate classes via the K-NN algorithm.
4) The majority vote of feature vectors determines the class label of the
separated components.
5) Optimize the classication results by dierent choices of K.
of piano and violin notes. The basic steps in music note classication include
preprocessing, feature extraction or selection, classier design and optimization.
The main steps used in our system are detailed in Table 1.
The main disadvantage of the classication technique based on simple majority voting is that the classes with more frequent examples tend to come up in
the K-nearest neighbors when the neighbors are computed from a large number
of training examples [5]. Therefore, the class with more frequent training examples tends to dominate the prediction of the new vector. One possible technique
to solve this problem is to weight the classication based on the distance from
the test pattern to all of its K nearest neighbors.
2.4
K-NN Classifier
This section briey describes the K-NN classier used in our algorithm. K-NN
is a simple technique for pattern classication and is particularly important for
non-parametric distributions. The K-NN classier labels an unknown pattern
x by the majority vote of its K-nearest neighbors [5], [9]. The K-NN classier
belongs to a class of techniques based on non-parametric probability density
estimation. Suppose, there is a need to estimate the density function P (x) from
a given dataset. In our case, each signal in the dataset is segmented to 999
frames, and a feature vector of 13 MFCC coecients is computed for each frame.
Therefore, the total number of examples in the training dataset is 52947. Similarly, an unknown pattern x is also a 13 dimensional MFCCs feature vector
whose label needs to be determined based on the majority vote of the nearest
neighbors. The volume V around an unknown pattern x is selected such that the
number of nearest neighbors (training examples) within V are 30. We are dealing with the two-class problem with prior probability P (i ). The measurement
distribution of the patterns in class i is denoted by P (x | i ). The measurement
of posteriori class probability P (i | x) decides the label of an unknown feature
vector of the separated note. The approximation of P (x) is given by the relation
[5], [10]
K
P (x)
(3)
NV
92
where N is the total number of examples in the dataset, V is the volume surrounding unknown pattern x and K is the number of examples within V . The
class prior probability depends on the number of examples in the dataset
P (i ) =
Ni
N
(4)
and the mesurement distribution of patterns in class i is dened as

P (x | i ) =
Ki
Ni V
(5)
According to the Bayes theorem, the posteriori probability becomes

P (i | x) =
P (x | i )P (i )
P (x)
(6)
Based on the above equations, we have [10]

P (i | x) =
Ki
K
(7)
i
The discriminant function gi (x) = K
K assigns the class label to an unknown
pattern x based on the majority of examples Ki of class i in volume V .
2.5
Parameter Selection
The most important parameter in the K-NN algorithm is the user-dened constant K. The best value of K depends upon the given data for classication [5].
In general, the eect of noise on classication may be reduced by selecting a
higher value of K. The problem arises when a large value of K is used for less
distinct boundaries between classes [31]. To select good value of K, many heuristic techniques such as cross-validation may be used. In the presence of noisy or
irrelevant features the performance of K-NN classier may degrade severely [5].
The selection of feature scales according to their importance is another important issue. For the improvement of classication results, a lot of eort has been
devoted to the selection or scaling of the features in a best possible way. The
optimal classication results are achieved for most datasets by selecting K = 10
or more.
2.6
Data Preparation
For the classication of separated components from mixture, the features i.e.
the MFCCs, are extracted from all the signals in the training dataset and put
the label on all feature vectors according to their classes (piano or violin). The
labels of the feature vectors of the separated components are not known which
need to be classied. Each feature vector consist of 13 MFCCs. When computing
the MFCCs, the training signals and the separated components are all divided
into frames with each having a length of 40 ms and 50 percent overlap between
93
the frames is used to avoid discontinuities between the neighboring frames. The
similarity measure of the feature vectors of the separated components to the
feature vectors obtained from the training process determines which class the
separated notes belong to. This is achieved by the K-NN classier. If majority
vote goes to the piano, then a piano label is assigned to the separated component
and vice-versa.
2.7
Phase Generation and Source Reconstruction
The factorization of magnitude spectrogram by the NMF algorithm provides

frequency-domain basis functions. Therefore, the reconstruction of source signals from the frequency-domain bases is used in this paper, where the phase
information is required. Several phase generation methods have been suggested
in the literature. When the components do not overlap each other signicantly
in time and frequency, the phases of the original mixture spectrogram produce
good synthesis quality [23]. In the mixture of piano and violin signals, signicant
overlapping occurs between musical notes in the time domain but the degree of
overlapping is relatively low in the frequency domain. Based on this observation,
the phases of the original mixture spectrogram are used to reconstruct the source
signals in this work. The reconstruction process can be summarised briey as
follows. First, the phase information is added to each classied component to
obtain its complex spectrum. Then the classied components from the above
sections are combined to the individual source streams, and nally the inverse
discrete Fourier Transform (IDFT) and the overlap-and-add technique are applied to obtain the time-domain signal. When the magnitude spectra are used
as the basis functions, the frame-wise spectra are obtained as the product of the
basis function with its gain. If the power spectra are used, a square root needs
to be taken. If the frequency resolution is non-linear, additional processing is
required for the re-synthesis using the IDFT.
Evaluations
Two music sources (played by two dierent instruments, i.e. piano and violin)
with dierent number of notes overlapping each other in the time domain, were
used to generate articially an instantaneous mixture signal. The lengths of
piano and violin source signals are both 20 seconds, containing 6 and 5 notes
respectively. The K-NN classier constant K was selected as K = 30. The signalto-noise ratio (SNR), dened as follows, was used to measure the quality of both
the separated notes and the whole source signal,

2
s,t [Xm ]s,t
SN R(m, j) =
(8)
2
s,t ([Xm ]s,t [Xj ]s,t )
where s and t are the row and column indices of the matrix respectively. The
SNR was computed based on the magnitude spectrograms Xm and Xj of the
mth reference and the j th separated component to prevent the reconstruction
94
300
250
Coefficient values
200
150
100
50
50
10
12
14
MFCC feature space
Fig. 7. The collection of the audio features from a typical piano signal (i.e. Piano..A0.wav) in the training process. In total, 999 frames of features were computed.
250
Coefficient values
200
150
100
50
50
10
12
14
MFCC feature space
Fig. 8. The collection of the audio features from a typical violin signal (i.e. Violin.pizz.pp.sulG.C4B4.wav) in the training process. In total, 999 frames of features
were computed.
process from aecting the quality [22]. For the same note, j = m. In general,
higher SNR values represent better separation quality of the separated notes
and source signals, vice-versa. The training database used in the classication
process was provided by the McGill University Master Samples Collection [16],
University of Iowa website [21]. It contains 53 music signals with 29 of which
are piano signals and the rest are violin signals. All the signals were sampled
at 44100 Hz. The reference source signals were stored for the measurement of
separation quality.
For the purpose of training, the signals were rstly segmented into frames,
and then the MFCC feature vectors were computed from these frames. In total,
95
200
Coefficient values
150
100
50
50
10
12
14
MFCC feature space
Fig. 9. The collection of the audio features from a separated speech component in the
testing process. Similar to the training process, 999 frames of features were computed.
7000
6000
MSE
5000
4000
3000
2000
1000
0.5
1.5
2.5
Frame index
3.5
4.5
5
4
x 10
Fig. 10. The MSEs between the feature vector of a frame of the music component to
be classied and those from the training data. The frame indices in the horizontal axis
are ranked from the lower to the higher. The frame index 28971 is the highest frame
number of the piano signals. Therefore, on this plot, to the left of this frame are those
from piano signals, and to the right are those from the violin signals.
999 frames were computed for each signal. Figures 7 and 8 show the collection of
the features from the typical piano and violin signals (i.e. Piano..A0.wav and
Violin.pizz.pp.sulG.C4B4.wav) respectively. In both gures, it can be seen that
there exist features whose coecients are all zeros due to the silence part of the
signals. Before running the training algorithm, we performed feature selection by
removing such frames of features. In the testing stage, the MFCC feature vectors
of the individual music components that were separated by the NMF algorithm
were calculated. Figure 9 shows the feature space of 15th separated component
96
7000
6000
MSE
5000
4000
3000
2000
1000
0.5
1.5
2.5
3.5
4.5
Sorted frame index
5
4
x 10
Fig. 11. The MSE values obtained in Figure 10 were sorted from the lower to the
higher. The frame indices in the horizontal axis, associated with the MSEs, are shued
accordingly.
34
32
MSE
30
28
26
24
22
20
10
15
20
25
30
K nearest frames
Fig. 12. The MSE values of the K nearest neighbors (i.e. the frames with the K minimal
MSEs) are selected based on the K-NN clustering. In this experiment, K was set to 30.
(the nal component in our experiment). To determine whether this component

belongs to piano or violin, we measured the mean squared error (MSE) between
the feature space of the separated component and the feature spaces obtained
from the training data. Figure 10 shows the MSEs between the feature vector of a
frame (the nal frame in this experiment) of the separated component and those
obtained in the training data. Then we sort the MSEs according their values
along all these frames. The sorted MSEs are shown in Figure 11, where the frame
indices were shued accordingly. After this, we applied the K-NN algorithm to
obtain the 30 neighbors that are nearest to the separated component. The MSEs
of these frames are shown in Figure 12. Their corresponding frame indices are
shown in Figure 13, from which we can see that all the frame indices are greater
x 10
97
4.5
4
Frame index
3.5
3
2.5
2
1.5
1
0.5
0
10
15
20
25
30
35
The K nearest frames
Fig. 13. The frame indices of the 30 nearest neighbors to the frame of the decomposed
music note obtained in Figure 12. In our experiment, the maximum frame index for
the piano signals is 28971, shown by the dashed line, while the frame indices of violin
signals are all greater than 28971. Therefore, this typical audio frame under testing
can be classied as a violin signal.
(a)
1
0
1
9
5
x 10
(b)
1
0
1
9
5
x 10
(c)
1
0
1
9
5
x 10
(d)
1
0
1
9
5
x 10
(e)
1
0
1
Time in samples
9
5
x 10
Fig. 14. A separation example of the proposed system. (a) and (b) are the piano and
violin sources respectively, (c) is the single channel mixture of these two sources, and
(d) and (e) are the separated sources respectively. The vertical axes are the amplitude
of the signals.
than 28971, which was the highest index number of the piano signals in the
training data. As a result, this component was classied as a violin signal.
Figure 14 shows a separation example of the proposed system, where (a) and
(b) are the piano and violin sources respectively, (c) is the single channel mixture
98
of these two sources, and (d) and (e) are the separated sources respectively. From
this gure, we can observe that, although most notes are correctly separated and
classied into the corresponding sources, there exist notes that were wrongly
classied. The separated notes with the highest SNR is the rst note of the
violin signal, for which the SNR equals to 9.7dB, while the highest SNR of the
note within the piano signal is 6.4dB. The average SNRs for piano and violin
are respectively 3.7 dB and 1.3 dB. According to our observation, the separation
quality of the notes varies from notes to notes. In average, the separation quality
of the piano signal is better than the violin signal.
Discussions
At the moment, for the separated components by the NMF algorithm, we calculate their MFCC features in the same way as for the signals in the training
data. As a result, the evaluation of the MSEs becomes straightforward, which
consequently facilitates the K-NN classication. It is however possible to use the
dictionary returned by the NMF algorithm (and possibly the activation coecients as well) as a set of features. In such a case, the NMF algorithm needs to
be applied to the training data in the same way as the separated components
obtained in the testing and classication process. Similar to principal component analysis (PCA) which has been widely used to generate features in many
classication system, using NMF components directly as features has a great
potential. As compared to using the MFCC features, the computational cost
associated with the NMF features could be higher due to the iterations required
for the NMF algorithms to converge. However, its applicability as a feature for
classication deserves further investigation in the future.
Another important issue in applying NMF algorithms is the selection of the
mode of the NMF model (i.e. the rank R). In our study, this determines the
number of components that will be learned from the signal. In general, for a
higher rank R, the NMF algorithm learns the components that are more likely
corresponding to individual notes. However, there is a trade-o between the decomposition rank and the computational load, as a larger R incurs a higher
computational cost. Also, it is known that NMF produces not only harmonic
dictionary components but also sometimes ad-hoc spectral shapes corresponding to drums, transients, residual noise, etc. In our recognition system, these
components were treated equally as the harmonic components. In other words,
the feature vectors of these components were calculated and evaluated in the
same way as the harmonic components. The nal decision was made from the
labelling scores and the K-NN classication results.
We note that many classication algorithms could also be applied for labelling
the separated components, such as the Gaussian Mixture Models (GMMs), which
have been used in both automatic speech/speaker recognition and music information retrieval. In this work, we choose the K-NN algorithm due its simplicity.
Moreover, the performance of the single channel source separation system developed here is largely dependent on the separated components provided by the
99
NMF algorithm. Although the music components obtained by the NMF algorithm are somehow sparse, their sparsity is not explicitly controlled. Also, we
didnt use the information from the music signals explicitly, such as the pitch information and harmonic structure. According to Li et al. [14], the information of
pitch and common amplitude modulation can be used to improve the separation
quality. Com
Conclusions
We have presented a new system for the single channel music sound separation
problem. The system essentially integrates two techniques, automatic note decomposition using NMF, and note classication based on the K-NN algorithm. A
main assumption with the proposed system is that we have the prior knowledge
about the type of instruments used for producing the music sounds. The simulation results show that the system produces a reasonable performance for this
challenging source separation problem. Future works include using more robust
classication algorithm to improve the note classication accuracy, and incorporating pitch and common amplitude modulation information into the learning
algorithm to improve the separation performance of the proposed system.
References
1. Abdallah, S.A., Plumbley, M.D.: Polyphonic Transcription by Non-Negative Sparse
Coding of Power Spectra. In: International Conference on Music Information Retrieval, Barcelona, Spain (October 2004)
2. Barry, D., Lawlor, B., Coyle, E.: Real-time Sound Source Separation: Azimuth
Discrimination and Re-synthesis, AES (2004)
3. Brown, G.J., Cooke, M.P.: Perceptual Grouping of Musical Sounds: A Computational Model. J. New Music Res. 23, 107132 (1994)
4. Casey, M.A., Westner, W.: Separation of Mixed Audio Sources by Independent
Subspace Analysis. In: Proc. Int. Comput. Music Conf. (2000)
5. Devijver, P.A., Kittler, J.: Pattern Recognition - A Statistical Approach. Prentice
Hall International, Englewood Clis (1982)
6. Every, M.R., Szymanski, J.E.: Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics. IEEE Trans. Audio Speech Lang. Process. 14, 1845
1856 (2006)
7. Fevotte, C., Bertin, N., Durrieu, J.-L.: Nonnegative Matrix Factorization With the
Itakura-Saito Divergence. With Application to Music Analysis. Neural Computation 21, 793830 (2009)
8. FitzGerald, D., Cranitch, M., Coyle, E.: Extended Nonnegative Tensor Factorisation Models for Musical Sound Source Separation, Article ID 872425, 15 pages
(2008)
9. Fukunage, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press Inc., London (1990)
100
10. Gutierrez-Osuna, R.: Lecture 12: K Nearest Neighbor Classier,

http://research.cs.tamu.edu/prism/lectures (accessed January 17, 2010)
11. Hoyer, P.: Non-Negative Sparse Coding. In: IEEE Workshop on Networks for Signal
Processing XII, Martigny, Switzerland (2002)
12. Lee, D.D., Seung, H.S.: Learning the Parts of Objects by Non-Negative Matrix
Factorization. Nature 401, 788791 (1999)
13. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Neural Information Processing Systems, Denver (2001)
14. Li, Y., Woodru, J., Wang, D.L.: Monaural Musical Sound Separation Based on
Pitch and Common Amplitude Modulation. IEEE Transactions on Audio, Speech,
and Language Processing 17, 13611371 (2009)
15. Mellinger, D.K.: Event Formation and Separation in Musical Sound. PhD dissertation, Dept. of Comput. Sci., Standford Univ., Standford, CA (1991)
16. Opolko, F., Wapnick, J.: McGill University master samples, McGill Univ., Montreal, QC, Canada, Tech. Rep. (1987)
17. Pedersen, M.S., Wang, D.L., Larsen, J., Kjems, U.: Two-Microphone Separation
of Speech Mixtures. IEEE Trans. on Neural Networks 19, 475492 (2008)
18. Rickard, S., Balan, R., Rosca, J.: Real-time Time-Frequency based Blind Source
Separation. In: 3rd International Conference on Independent Component Analysis
and Blind Source Separation, San Diego, CA (December 2001)
19. Smaragdis, P., Brown, J.C.: Non-negative Matrix Factorization for Polyphonic Music Transcription. In: Proc. IEEE Int. Workshop Application on Signal Process.
Audio Acoust., pp. 177180 (2003)
20. Smaragdis, P.: Non-negative matrix factor deconvolution; extraction of multiple
sound sources from monophonic inputs. In: Puntonet, C.G., Prieto, A.G. (eds.)
ICA 2004. LNCS, vol. 3195, pp. 494499. Springer, Heidelberg (2004)
21. The
University
of
Iowa
Musical
Instrument
Samples
Database,
http://theremin.music.uiowa.edu
22. Virtanen, T.: Sound Source Separation Using Sparse Coding with Temporal Continuity Objective. In: International Computer Music Conference, Singapore (2003)
23. Virtanen, T.: Separation of Sound Sources by Convolutive Sparse Coding. In: Proceedings of ISCA Tutorial and Research Workshop on Statistical and Perceptual
Audio Processing, Jeju, Korea (2004)
24. Virtanen, T.: Sound Source Separation in Monaural Music Signals. PhD dissertation, Tampere Univ. of Technol., Tampere, Finland (2006)
25. Virtanen, T.: Monaural Sound Source Separation by Non-Negative Matrix Factorization with Temporal Continuity and Sparseness Criteria. IEEE Transactions on
Audio, Speech, and Language Processing 15, 10661073 (2007)
26. Wang, D.L., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley/IEEE Press (2006)
27. Wang, B., Plumbley, M.D.: Investigating Single-Channel Audio Source Separation
Methods based on Non-negative Matrix Factorization. In: Nandi, Zhu (eds.) Proceedings of the ICA Research Network International Workshop, pp. 1720 (2006)
28. Wang, B., Plumbley, M.D.: Single Channel Audio Separation by Non-negative
Matrix Factorization. In: Digital Music Research Network One-day Workshop
(DMRN+1), London (2006)
101
29. Wang, W., Luo, Y., Chambers, J.A., Sanei, S.: Note Onset Detection via
Non-negative Factorization of Magnitude Spectrum. EURASIP Journal on
Advances in Signal Processing, Article ID 231367, 15 pages (June 2008);
doi:10.1155/2008/231367
30. Wang, W., Cichocki, A., Chambers, J.A.: A Multiplicative Algorithm for Convolutive Non-negative Matrix Factorization Based on Squared Euclidean Distance.
IEEE Transactions on Signal Processing 57, 28582864 (2009)
31. Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, New York (2005)
32. Woodru, J., Pardo, B.: Using Pitch, Amplitude Modulation and Spatial Cues for
Separation of Harmonic Instruments from Stereo Music Recordings. EURASIP J.
Adv. Signal Process. (2007)
Notes on Nonnegative Tensor Factorization of

the Spectrogram for Audio Source Separation:
Statistical Insights and Towards Self-Clustering
of the Spatial Cues
Cedric Fevotte1, and Alexey Ozerov2
1
CNRS LTCI, Telecom ParisTech - Paris, France

fevotte@telecom-paristech.fr
2
IRISA, INRIA - Rennes, France
ozerov@irisa.fr
Abstract. Nonnegative tensor factorization (NTF) of multichannel

spectrograms under PARAFAC structure has recently been proposed by
Fitzgerald et al as a mean of performing blind source separation (BSS)
of multichannel audio data. In this paper we investigate the statistical
source models implied by this approach. We show that it implicitly assumes a nonpoint-source model contrasting with usual BSS assumptions
and we clarify the links between the measure of fit chosen for the NTF
and the implied statistical distribution of the sources. While the original
approach of Fitzgeral et al requires a posterior clustering of the spatial
cues to group the NTF components into sources, we discuss means of
performing the clustering within the factorization. In the results section
we test the impact of the simplifying nonpoint-source assumption on
underdetermined linear instantaneous mixtures of musical sources and
discuss the limits of the approach for such mixtures.
Keywords: Nonnegative tensor factorization (NTF), audio source
separation, nonpoint-source models, multiplicative parameter updates.
Introduction
Nonnegative matrix factorization (NMF) is an unsupervised data decomposition technique with growing popularity in the elds of machine learning and
signal/image processing [8]. Much research about this topic has been driven by
applications in audio, where the data matrix is taken as the magnitude or power
spectrogram of a sound signal. NMF was for example applied with success to
automatic music transcription [15] and audio source separation [19,14]. The factorization amounts to decomposing the spectrogram data into a sum of rank-1
This work was supported in part by project ANR-09-JCJC-0073-01 TANGERINE

(Theory and applications of nonnegative matrix factorization) and by the Quaero
Programme, funded by OSEO, French State agency for innovation.
Notes on NTF for Audio Source Separation
103
spectrograms, each of which being the expression of an elementary spectral pattern amplitude-modulated in time.
However, while most music recordings are available in multichannel format
(typically, stereo), NMF in its standard setting is only suited to single-channel
data. Extensions to multichannel data have been considered, either by stacking
up the spectrograms of each channel into a single matrix [11] or by equivalently considering nonnegative tensor factorization (NTF) under a parallel factor analysis (PARAFAC) structure, where the channel spectrograms form the
slices of a 3-valence tensor [5,6]. Let Xi be the short-time Fourier transform
(STFT) of channel i, a complex-valued matrix of dimensions F N , where
i = 1, . . . , I and I is the number of channel (I = 2 in the stereo case). The
latter approaches boil down to assuming that the magnitude spectrograms |Xi |
are approximated by a linear combination of nonnegative rank-1 elementary
spectrograms |Ck | = wk hTk such that
|Xi |
K
qik |Ck |
(1)
k=1
and |Ck | is the matrix containing the modulus of the coecients of some latent
components whose precise meaning we will attempt to clarify in this paper.
Equivalently, Eq. (1) writes
|xif n |
K
qik wf k hnk
(2)
k=1
where {xif n } are the coecients of Xi . Introducing the nonnegative matrices

Q = {qik }, W = {wf k }, H = {hnk }, whose columns are respectively denoted
qk , wk and hk , the following optimization problem needs to be solved

min
d(|xif n ||
vif n ) subject to Q, W, H 0
(3)
Q,W,H
if n
with
K
def
vif n =
qik wf k hnk
(4)
k=1
and where the constraint A 0 means that the coecients of matrix A are nonnegative, and d(x|y) is a scalar cost function, taken as the generalized KullbackLeibler (KL) divergence in [5] or as the Euclidean distance in [11]. Complex k are subsequently constructed using the phase of the
valued STFT estimates C
observations (typically, ckf n is given the phase of xif n , where i = argmax{qik }i
[6]) and then inverted to produce time-domain components. The components
pertaining to same sources (e.g, instruments) can then be grouped either manually or via clustering of the estimated spatial cues {qk }k .
In this paper we build on these previous works and bring the following
contributions :
104
C. Fevotte and A. Ozerov
We recast the approach of [5] into a statistical framework, based on a generative statistical model of the multichannel observations X. In particular
we discuss NTF of the power spectrogram |X|2 with the Itakura-Saito (IS)
divergence and NTF of the magnitude spectrogram |X| with the KL divergence.
We describe a NTF with a novel structure, that allows to take care of the
clustering of the components within the decomposition, as opposed to after.
The paper is organized as follows. Section 2 describes the generative and statistical source models implied by NTF. Section 3 describes new and existing multiplicative algorithms for standard NTF and for Cluster NTF. Section 4 reports
experimental source separation results on musical data; we test in particular the
impact of the simplifying nonpoint-source assumption on underdetermined linear instantaneous mixtures of musical sources and point out the limits of the
approach for such mixtures. We conclude in Section 5. This article builds on
related publications [10,3].
2
2.1
Statistical Models to NTF

Models of Multichannel Audio
Assume a multichannel audio recording with I channels x(t) = [x1 (t), . . . , xI (t)]T ,
also referred to as observations or data, generated as a linear mixture of
sound source signals. The term source refers to the production system, for
example a musical instrument, and the term source signal refers to the signal
produced by that source. When the intended meaning is clear from the context
we will simply refer to the source signals as the sources.
Under the linear mixing assumption, the multichannel data can be expressed
as
J

x(t) =
sj (t)
(5)
j=1
where J is the number of sources and sj (t) = [s1j (t), . . . sij (t), . . . , sIj (t)]T is the
multichannel contribution of source j to the data. Under the common assumptions of point-sources and linear instantaneous mixing, we have
sij (t) = sj (t) aij
(6)
where the coecients {aij } dene a IJ mixing matrix A, with columns denoted
[a1 , . . . , aJ ]. In the following we will show that the NTF techniques described
in this paper correspond to maximum likelihood (ML) estimation of source and
mixing parameters in a model where the point-source assumption is dropped
and replaced by
(i)
sij (t) = sj (t) aij

(i)
(7)
where the signals sj (t), i = 1, . . . , I are assumed to share a certain resemblance, as modelled by being two dierent realizations of the same random
105
process, characterizing their time-frequency behavior, as opposed to be the same

realization. Dropping the point-source assumption may also be viewed as ignoring some mutual information between the channels (assumption of sources
contributing to each channel with equal statistics instead of contributing the
same signal ). Of course, when the data has been generated from point-sources,
dropping this assumption will usually lead to a suboptimal but typically faster
separation algorithm, and the results section will illustrate this point.
In this work we further model the source contributions as a sum of elementary
components themselves, so that
(i)
(i)
sj (t) =
ck (t)
(8)
kKj
where [K1 , . . . , KJ ] denotes a nontrivial partition of [1, . . . , K]. As will become

(i)
more clear in the following, the components ck (t) will be characterized by a
spectral shape wk and a vector of activation coecients hk , through a statistical
model. Finally, we obtain
K

(i)
xi (t) =
mik ck (t)
(9)
k=1
where mik is dened as mik = aij if and only if k Kj . By linearity of STFT,

model (8) writes equivalently
xif n =
K
(i)
mik ckf n
(10)
k=1
(i)
(i)
where xif n and ckf n are the complex-valued STFTs of xi (t) and ck (t), and
where f = 1, . . . , F is a frequency bin index and n = 1, . . . , N is a time frame
index.
2.2
A Statistical Interpretation of KL-NTF
Denote V the I F N tensor with coecients vif n = |xif n | and Q the I K

matrix with elements |mik |. Let us assume so far for ease of presentation that
J = K, i.e, mik = aik , so that M is a matrix with no particular structure. Then
it can be easily shown that the approach of [5], briey described in Section 1
and consisting in solving

min
dKL (vf n |
(11)
Q,W,H
if n
with vif n dened by Eq. (4), is equivalent to ML estimation of Q, W and H in

the following generative model :

(i)
|mik | |ckf n |
(12)
|xif n | =
k
(i)
|ckf n |
P(wf k hnk )
(13)
106
where P() denotes the Poisson distribution, dened in Appendix A, and the
KL divergence dKL (|) is dened as
dKL (x|y) = x log
x
+ y x.
y
(14)
The link between KL-NMF/KL-NTF and inference in composite models with

Poisson components has been established in many previous publications, see,
e.g, [2,12]. In our opinion, model (12)-(13) suers from two drawbacks. First, the
linearity of the mixing model is assumed on the magnitude of the STFT frames see Eq. (12) - instead of the frames themselves - see Eq. (10) -, which inherently
(i)
assumes that the components {ckf n }k have the same phase and that the mixing
parameters {mik }k have the same sign, or that only one component is active in
every time-frequency tile (t, f ). Second, the Poisson distribution is formally only
dened on integers, which impairs rigorous statistical interpretation of KL-NTF
on non-countable data such as audio spectra.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the component amplitudes are given by

def
(i)
(i)
|ckf n | = E{ |ckf n | | Q, W, H, |X| }
(15)
qik wf k hnk
=
|xif n |
l qil wf l hnl
(16)
(i)
Then, time-domain components ck (t) are reconstructed through inverse-STFT

(i)
(i)
of ckf n = |ckf n |arg(xif n ), where arg(x) denotes the phase of complex-valued x.
2.3
A Statistical Interpretation of IS-NTF
To remedy the drawbacks of the KL-NTF model for audio we describe a new
model based on IS-NTF of the power spectrogram, along the line of [4] and also
introduced in [10]. The model reads

(i)
xif n =
mik ckf n
(17)
k
(i)
ckf n Nc (0|wf k hnk )
(18)
where Nc (, 2 ) denotes the proper complex Gaussian distribution, dened in

Appendix A. Denoting now V = |X|2 and Q = |M|2 , it can be shown that ML
estimation of Q, W and H in model (17)-(18) amounts to solving

min
dIS (vif n |
(19)
Q,W,H
if n
where dIS (|) denotes the IS divergence dened as

dIS (x|y) =
x
x
log 1.
y
y
(20)
107
Note that our notations are abusive in the sense that the mixing parameters
|mik | and the components |ckf n | appearing through their modulus in Eq. (12)
are in no way the modulus of the mixing parameters and the components appearing in Eq. (17). Similarly, the matrices W and H represent dierent types of
quantities in every case; in Eq. (13) their product is homogeneous to component
magnitudes while in Eq. (18) their product is homogeneous to variances of comKL
ponent variances. Formally we should have introduced variables |cKL
,
kf n |, W
KL
IS
IS
IS
H
to be distinguished from variables ckf n , W , H , but we have not in
order to avoid cluttering the notations. The dierence between these quantities
should be clear from the context.
Model (17)-(18) is a truly generative model in the sense that the linear mixing assumption is made on the STFT frames themselves, which is a realistic
(i)
assumption in audio. Eq. (18) denes a Gaussian variance model of ckf n ; the
zero mean assumption reects the property that the audio frames taken as the
input of the STFT can be considered centered, for typical window size of about
(i)
20 ms or more. The proper Gaussian assumption means that the phase of ckf n
is assumed to be a uniform random variable [9], i.e., the phase is taken into the
model, but in a noninformative way. This contrasts from model (12)-(13), which
simply discards the phase information.
Given estimates Q, W and H of the loading matrices, Minimum Mean Square
Error (MMSE) estimates of the components are given by
def
(i)
(i)
ckf n = E{ckf n | Q, W, H, X }
(21)
qik wf k hnk
=
xif n
l qil wf l hnl
(22)
We would like to underline that the MMSE estimator of components in the STFT
domain (21) is equivalent (thanks to the linearity of the STFT and its inverse) to
Table 1. Statistical models and optimization problems underlaid to KL-NTF.mag and
IS-NTF.pow
KL-NTF.mag
Mixing model
Comp. distribution
Data
Parameters
Approximate
Optimization
IS-NTF.pow
Model

(i)
(i)
|xif n | = k |mik | |ckf n |
xif n = k mik ckf n
(i)
(i)
|ckf n | P(wf k hnk )
ckf n Nc (0|wf k hnk )
ML estimation
V = |X|
V = |X|2
W, H, Q = |M|
W, H, Q = |M|2

vif n = k qik wf k hnk

min
vif n ) min
vif n )
if n dKL (vif n |
if n dIS (vif n |
Q,W,H0
Q,W,H0
Reconstruction

(i)
|ckf n | =
q w h
ik f k nk |xif n |
l qil wf l hnl
(i)
ckf n =
q w h
ik f k nk xif n
l qil wf l hnl
108
the MMSE estimator of components in the time domain, while the the MMSE
estimator of STFT magnitudes (15) for KL-NTF is not consistent with time
domain MMSE. Equivalence of an estimator with time domain signal squared
error minimization is an attractive property, at least because it is consistent with
a popular objective source separation measure such as signal to distortion ratio
(SDR) dened in [16].
The dierences between the two models, termed KL-NTF.mag and ISNTF.pow are summarized in Table 1.
3
3.1
Algorithms for NTF

Standard NTF
We are now left with an optimization problem of the form

def
min D(V|V)
=
d(vif n |
Q,W,H
(23)
if n

where vif n = k qik hnk wf k , and d(x|y) is the cost function, either the KL or IS
divergence in our case. Furthermore we impose qk 1 = 1 and wk 1 = 1, so as
to remove obvious scale indeterminacies between the three loading matrices Q,
W and H. With these conventions, the columns of Q convey normalized mixing proportions (spatial cues) between the channels, the columns of W convey
normalized frequency shapes and all time-dependent amplitude information is
relegated into H.
As common practice in NMF and NTF, we employ multiplicative algorithms
These algorithms essentially consist of updating
for the minimization of D(V|V).
each scalar parameter by multiplying its value at previous iteration by the
ratio of the negative and positive parts of the derivative of the criterion w.r.t.
this parameter, namely

[ D(V|V)]
,
(24)
+
[ D(V|V)]
= [ D(V|V)]
+ [ D(V|V)]
and the summands are both
where D(V|V)
nonnegative [4]. This scheme automatically ensures the nonnegativity of the parameter updates, provided initialization with a nonnegative value. The derivative
of the criterion w.r.t scalar parameter writes

=
D(V|V)
vif n d (vif n |
vif n )
(25)
if n

where d (x|y) = y d(x|y). As such, we get

=
qik D(V|V)
wf k hnk d (vif n |
vif n )
(26)
fn
=
wf k D(V|V)
qik hnk d (vif n |

vif n )
(27)
qik wf k d (vif n |
vif n )
(28)
in
=
hnk D(V|V)

if
109
We note in the following G the I F N tensor with entries gif n = d (vif n |

vif n ).
For the KL and IS cost functions we have
x
dKL (x|y) = 1
(29)
y
1
x
dIS (x|y) = 2
(30)
y y
Let A and B be F K and N K matrices. We denote A B the F N K
tensor with elements af k bnk , i.e, each frontal slice k contains the outer product
ak bTk .1 Now we note < S, T >KS ,KT the contracted product between tensors S
and T, dened in Appendix B, where KS and KT are the sets of mode indices
over which the summation takes place. With these denitions we get
= < G, W H >{2,3},{1,2}
Q D(V|V)
= < G, Q H >{1,3},{1,2}
W D(V|V)
(31)
= < G, Q W >{1,2},{1,2}
H D(V|V)
(33)
(32)
and multiplicative updates are obtained as

Q Q.
< G , W H >{2,3},{1,2}
< G+ , W H >{2,3},{1,2}
(34)
< G , Q H >{1,3},{1,2}
< G+ , Q H >{1,3},{1,2}
(35)
< G , Q W >{1,2},{1,2}
< G+ , Q W >{1,2},{1,2}
(36)
W W.
H H.
The resulting algorithm can easily be shown to nonincrease the cost function at
each iteration by generalizing existing proofs for KL-NMF [13] and for IS-NMF
[1]. In our implementation normalization of the variables is carried out at the
end of every iteration by dividing every column of Q by their 1 norm and scaling
the columns of W accordingly, then dividing the columns of W by their 1 norm
and scaling the columns of H accordingly.
3.2
Cluster NTF
For ease of presentation of the statistical composite models inherent to NTF, we

have assumed in Section 2.2 and onwards that K = J, i.e., that one source sj (t)
is one elementary component ck (t) with its own mixing parameters {aik }i . We
now turn back to our more general model (9), where each source sj (t) is a sum
of elementary components {ck (t)}kKj sharing same mixing parameters {aik }i ,
i.e, mik = aij i k Kj . As such, we can express M as
M = AL
1
(37)
This is similar to the Khatri-Rao product of A and B, which returns a matrix of

dimensions F N K with column k equal to the Kronecker product of ak and bk .
110
where A is the I J mixing matrix and L is a J K labelling matrix with

only one nonzero value per column, i.e., such that
ljk = 1
i k Kj
(38)
ljk = 0
otherwise.
(39)
This specic structure of M transfers equivalently to Q, so that

Q = DL
(40)
where
D = |A|
D = |A|
in KL-NTF.mag
(41)
in IS-NTF.pow
(42)
The structure of Q denes a new NTF, which we refer to as Cluster NTF,

denoted cNTF. The minimization problem (23) is unchanged except for the fact
that the minimization over Q is replaced by a minimization over D. As such,
the derivatives w.r.t. wf k , hnk do not change and the derivatives over dij write

=
dij D(V|V)
(
ljk wf k hnk ) d (vif n |
vif n )
(43)
fn

k
ljk
wf k hnk d (vif n |
vif n )
(44)
fn
i.e.,
=< G, W H >{2,3},{1,2} LT
D D(V|V)
(45)
so that multiplicative updates for D can be obtained as

D D.
< G , W H >{2,3},{1,2} LT
< G+ , W H >{2,3},{1,2} LT
(46)
As before, we normalize the columns of D by their 1 norm at the end of every

iteration, and scale the columns of W accordingly.
In our Matlab implementation the resulting multiplicative algorithm for IScNTF.pow is 4 times faster than the one presented in [10] (for linear instantaneous mixtures), which was based on sequential updates of the matrices [qk ]kKj ,
[wk ]kKj , [hk ]kKj . The Matlab code of this new algorithm as well as the
other algorithms described in this paper can be found online at http://perso.
telecom-paristech.fr/~fevotte/Samples/cmmr10/.
Results
We consider source separation of simple audio mixtures taken from the Signal
Separation Evaluation Campaign (SiSEC 2008) website. More specically, we
used some development data from the underdetermined speech and music
mixtures task [18]. We considered the following datasets :
111
wdrums, a linear instantaneous stereo mixture (with positive mixing coecients) of 2 drum sources and 1 bass line,
nodrums, a linear instantaneous stereo mixture (with positive mixing coecients) of 1 rhythmic acoustic guitar, 1 electric lead guitar and 1 bass
line.
The signals are of length 10 sec and sampled at 16 kHz. We applied a STFT
with sine bell of length 64 ms (1024 samples) leading to F = 513 and N = 314.
We applied the following algorithms to the two datasets :
KL-NTF.mag with K = 9,
IS-NTF.pow with K = 9,
Fig. 1. Mixing parameters estimation and ground truth. Top : wdrums dataset. Bottom : nodrums dataset. Left : results of KL-NTF.mag and KL-cNTF.mag; ground
truth mixing vectors {|aj |}j (red), mixing vectors {dj }j estimated with KL-cNTF.mag
(blue), spatial cues {qk }k given by KL-NTF.mag (dashed, black). Right : results of ISNTF.pow and IS-cNTF.pow; ground truth mixing vectors {|aj |2 }j (red), mixing vectors
{dj }j estimated with IS-cNTF.pow (blue), spatial cues {qk }k given by IS-NTF.pow
(dashed, black).
112
KL-cNTF.mag with J = 3 and 3 components per source, leading to K = 9,

IS-cNTF.pow with J = 3 and 3 components per source, leading to K = 9.
Every four algorithm was run 10 times from 10 random initializations for 1000 iterations. For every algorithm we then selected the solutions Q, W and H yielding
smallest cost value. Time-domain components were reconstructed as discussed
in Section 2.2 for KL-NTF.mag and KL-cNTF.mag and as is in Section 2.3 for
IS-NTF.pow and IS-cNTF.pow. Given these reconstructed components, source
estimates were formed as follows :
For KL-cNTF.mag and IS-cNTF.pow, sources are immediately computed
using Eq. (8), because the partition K1 , . . . , KJ is known.
For KL-NTF.mag and IS-NTF.pow, we used the approach of [5,6] consisting
of applying the K-means algorithm to Q (with J clusters) so as to label every
component k to a source j, and each of the J sources is then reconstructed
as the sum of its assigned components.
Note that we are here not reconstructing the original single-channel sources
(1)
(I)
sj (t) but their multichannel contribution [sj (t), . . . , sj (t)] to the multichannel data (i.e, their spatial image). The quality of the source image estimates
was assessed using the standard Signal to Distortion Ratio (SDR), source Image to Spatial distortion Ratio (ISR), Source to Interference Ratio (SIR) and
Source to Artifacts Ratio (SAR) dened in [17]. The numerical results are
reported in Table 2. The source estimates may also be listened to online at
http://perso.telecom-paristech.fr/~fevotte/Samples/cmmr10/. Figure 1
displays estimated spatial cues together with ground truth mixing matrix, for
every method and dataset.
Discussion. On dataset wdrums best results are obtained with IS-cNTF.pow.
Top right plot of Figure 1 shows that the spatial cues returned by D reasonably
t the original mixing matrix |A|2 . The slightly better results of IS-cNTF.pow
compared to IS-NTF.pow illustrates the benet of performing clustering of the
spatial cues within the decomposition as opposed to after. On this dataset
KL-cNTF.mag fails to adequately estimate the mixing matrix. Top left plot of
Figure 1 shows that the spatial cues corresponding to the bass and hi-hat are
correctly captured, but it appears that two columns of D are spent on representing the same direction (bass, s3 ), suggesting that more components are
needed to represent the bass, and failing to capture the drums, which are poorly
estimated. KL-NTF.mag performs better (and as such, one spatial cue qk is correctly tted to the drums direction) but overly not as well as IS-NTF.pow and
IS-cNTF.pow.
On dataset nodrums best results are obtained with KL-NTF.mag. None of
the other methods adequately ts the ground truth spatial cues. KL-cNTF.mag
suers same problem than on dataset wdrums : two columns of D are spent
on the bass. In contrast, none of the spatial cues estimated by IS-NTF.pow
and IS-cNTF.pow accurately captures the bass direction, and s1 and s2 both
113
Table 2. SDR, ISR, SIR and SAR of source estimates for the two considered datasets.
Higher values indicate better results. Values in bold font indicate the results with best
average SDR.
wdrums
s1
s2
s3
(Hi-hat) (Drums) (Bass)
KL-NTF.mag
SDR -0.2
0.4
17.9
ISR 15.5
0.7
31.5
SIR
1.4
-0.9
18.9
SAR 7.4
-3.5
25.7
KL-cNTF.mag
SDR -0.02
-14.2
1.9
ISR 15.3
2.8
2.1
SIR
1.5
-15.0
18.9
SAR 7.8
13.2
9.2
IS-NTF.pow
SDR 12.7
1.2
17.4
ISR 17.3
1.7
36.6
SIR 21.1
14.3
18.0
SAR 15.2
2.7
27.3
IS-cNTF.pow
SDR 13.1
1.8
18.0
ISR 17.0
2.5
35.4
SIR 22.0
13.7
18.7
SAR 15.9
3.4
26.5
nodrums
s1
s2
s3
(Bass) (Lead G.) (Rhythmic G.)
KL-NTF.mag
SDR 13.2
-1.8
1.0
ISR 22.7
1.0
1.2
SIR 13.9
-9.3
6.1
SAR 24. 2
7.4
2.6
KL-cNTF.mag
SDR 5.8
-9.9
3.1
ISR 8.0
0.7
6.3
SIR 13.5
-15.3
2.9
SAR 8.3
2.7
9.9
IS-NTF.pow
SDR 5.0
-10.0
-0.2
ISR 7.2
1.9
4.2
SIR 12.3
-13.5
0.3
SAR 7.2
3.3
-0.1
IS-cNTF.pow
SDR 3.9
-10.2
-1.9
ISR 6.2
3.3
4.6
SIR 10.6
-10.9
-3.7
SAR 3.7
1.0
1.5
contain much bass and lead guitar.2 Results from all four methods on this dataset
are overly all much worse than with dataset wdrums, corroborating an established idea than percussive signals are favorably modeled by NMF models [7].
Increasing the number of total components K did not seem to solve the observed
deciencies of the 4 approaches on this dataset.
Conclusions
In this paper we have attempted to clarify the statistical models latent to audio
source separation using PARAFAC-NTF of the magnitude or power spectrogram. In particular we have emphasized that the PARAFAC-NTF does not optimally exploits interchannel redundancy in the presence of point-sources. This
still may be sucient to estimate spatial cues correctly in linear instantaneous
mixtures, in particular when the NMF model suits well the sources, as seen from
2
The numerical evaluation criteria were computed using the bss eval.m function
available from SiSEC website. The function automatically pairs source estimates
with ground truth signals according to best mean SIR. This resulted here in pairing
left, middle and right blue directions with respectively left, middle and right red
directions, i.e, preserving the panning order.
114
the results on dataset wdrums but may also lead to incorrect results in other cases,
as seen from results on dataset nodrums. In contrast methods fully exploiting
interchannel dependencies, such as the EM algorithm based on model (17)-(18)
(i)
with ckf n = ckf n in [10], can successfully estimates the mixing matrix in both
datasets. The latter method is however about 10 times computationally more
demanding than IS-cNTF.pow.
In this paper we have considered a variant of PARAFAC-NTF in which the
loading matrix Q is given a structure such that Q = DL. We have assumed that
L is known labelling matrix that reects the partition K1 , . . . , KJ . An important
perspective of this work is to let the labelling matrix free and automatically
estimate it from the data, either under the constraint that every column lk of L
may contain only one nonzero entry, akin to a hard clustering, i.e., lk 0 = 1, or
more generally under the constraint that lk 0 is small, akin to soft clustering.
This should be made feasible using NTF under sparse 1 -constraints and is left
for future work.
References
1. Cao, Y., Eggermont, P.P.B., Terebey, S.: Cross Burg entropy maximization and its
application to ringing suppression in image reconstruction. IEEE Transactions on
Image Processing 8(2), 286292 (1999)
2. Cemgil, A.T.: Bayesian inference for nonnegative matrix factorisation models.
Computational Intelligence and Neuroscience (Article ID 785152), 17 pages (2009);
doi:10.1155/2009/785152
3. Fevotte, C.: Itakura-Saito nonnegative factorizations of the power spectrogram
for music signal decomposition. In: Wang, W. (ed.) Machine Audition: Principles,
Algorithms and Systems, ch. 11. IGI Global Press (August 2010), http://perso.
telecom-paristech.fr/~fevotte/Chapters/isnmf.pdf
4. Fevotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with
the Itakura-Saito divergence. With application to music analysis. Neural Computation 21(3), 793830 (2009), http://www.tsi.enst.fr/~fevotte/Journals/
neco09_is-nmf.pdf
5. FitzGerald, D., Cranitch, M., Coyle, E.: Non-negative tensor factorisation for sound
source separation. In: Proc. of the Irish Signals and Systems Conference, Dublin,
Ireland (September 2005)
6. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorisation models for musical sound source separation. Computational Intelligence and
Neuroscience (Article ID 872425), 15 pages (2008)
7. Helen, M., Virtanen, T.: Separation of drums from polyphonic music using nonnegative matrix factorization and support vector machine. In: Proc. 13th European
Signal Processing Conference (EUSIPCO 2005) (2005)
8. Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix
factorization. Nature 401, 788791 (1999)
9. Neeser, F.D., Massey, J.L.: Proper complex random processes with applications to
information theory. IEEE Transactions on Information Theory 39(4), 12931302
(1993)
115
10. Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Audio, Speech and
Language Processing 18(3), 550563 (2010), http://www.tsi.enst.fr/~fevotte/
Journals/ieee_asl_multinmf.pdf
11. Parry, R.M., Essa, I.: Estimating the spatial position of spectral components in
audio. In: Rosca, J.P., Erdogmus, D., Prncipe, J.C., Haykin, S. (eds.) ICA 2006.
LNCS, vol. 3889, pp. 666673. Springer, Heidelberg (2006)
12. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to
statistics and computer vision. In: Proc. 22nd International Conference on Machine
Learning, pp. 792799. ACM, Bonn (2005)
13. Shepp, L.A., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Medical Imaging 1(2), 113122 (1982)
14. Smaragdis, P.: Convolutive speech bases and their application to speech separation.
IEEE Transactions on Audio, Speech, and Language Processing 15(1), 112 (2007)
15. Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music
transcription. In: IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics (WASPAA 2003) (October 2003)
16. Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing 14(4), 14621469 (2006), http://www.tsi.enst.fr/~fevotte/Journals/
ieee_asl_bsseval.pdf
17. Vincent, E., Sawada, H., Bofill, P., Makino, S., Rosca, J.P.: First stereo audio source
separation evaluation campaign: Data, algorithms and results. In: Davies, M.E.,
James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666,
pp. 552559. Springer, Heidelberg (2007)
18. Vincent, E., Araki, S., Bofill, P.: Signal Separation Evaluation Campaign.
In: (SiSEC 2008) / Under-determined speech and music mixtures task results (2008), http://www.irisa.fr/metiss/SiSEC08/SiSEC_underdetermined/
dev2_eval.html
19. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on
Audio, Speech and Language Processing 15(3), 10661074 (2007)
Standard Distributions
Proper complex Gaussian

Poisson
Nc (x|, ) = | |1 exp (x )H 1 (x )
x
P(x|) = exp() x!
Contracted Tensor Product
Let S be a tensor of size I1 . . . IM J1 . . . JN and T be a tensor of size

I1 . . . IM K1 . . . KP . Then, the contracted product < S, T >{1,...,M},{1,...,M}
is a tensor of size J1 . . . JN K1 . . . KP , given by
< S, T >{1,...,M },{1,...,M } =
I1

i1 =1
...
IM
si1 ,...,iM ,j1 ,...,jN ti1 ,...,iM ,k1 ,...,kP
(47)
iM =1
The contracted tensor product should be thought of as a form a generalized dot product
of two tensors along common modes of same dimensions.
What Signal Processing Can Do for the Music

Isabel Barbancho, Lorenzo J. Tard
on, Ana M. Barbancho, Andres Ortiz,
Simone Sammartino, and Cristina de la Bandera
Grupo de Aplicaci
on de las Tecnologas de la Informaci
on y Comunicaciones,
Departamento de Ingeniera de Comunicaciones,
E.T.S. Ingeniera de Telecomunicaci
on, Campus de Teatinos s/n,
Universidad of M
alaga, SPAIN
ibp@ic.uma.es
http://webpersonal.uma.es/~IBP/index.htm
Abstract. In this paper, several examples of what signal processing can
do in the music context will be presented. In this contribution, music content includes not only the audio les but also the scores. Using advanced
signal processing techniques, we have developed new tools that will help
us handling music information, preserve, develop and disseminate our
cultural music assets and improve our learning and education systems.
Keywords: Music Signal Processing, Music Analysis, Music Transcription, Music Information Retrieval, Optical Music Recognition, Pitch
Detection.
Introduction
Signal Processing Techniques are a powerful set of mathematical tools that allow
to obtain from a signal the required information for a certain purpose. Signal
Processing Techniques can be used for any type of signal: communication signals,
medical signals, Speech signals, multimedia signals, etc. In this contribution, we
focus on the application of signal processing techniques to music information:
audio and scores.
Signal processing techniques can be used for music database exploration. In
this eld, we present a 3D adaptive environment for music content exploration
that allows the exploration of musical contents in a novel way. The songs are
analyzed and a series of numerical descriptors are computed to characterize
their spectral content. Six main musical genres are dened as axes of a multidimensional framework, where the songs are projected. A three-dimensional subdomain is dened by choosing three of the six genres at a time and the user is
allowed to navigate in this space, browsing, exploring and analyzing the elements
of this musical universe. Also, inside this eld of music database exploration, a
novel method for music similarity evaluation is presented. The evaluation of music similarity is one of the core components of the eld of Music Information
Retrieval (MIR). In this study, rhythmic and spectral analyses are combined to
extract the tonal prole of musical compositions and evaluate music similarity.
Music signal processing can be used also for the preservation of the cultural
heritage. In this sense, we have developed a complete system with an interactive
117
graphical user interface for Optical Music Recognition (OMR), specially adapted
for scores written in white mensural notation. Color photographies of Ancient
Scores taken at the Archivo de la Catedral de M
alaga have been used as input to
the system. A series of pre-processing steps are aimed to improve their quality
and return binary images to be processed. The music symbols are extracted and
classied, so that the system is able to transcribe the ancient music notation
into modern notation and make it sound.
Music signal processing can also be focused in developing tools for technologyenhanced learning and revolutionary learning appliances. In this sense, we present
dierent applications we have developed to help learning dierent instruments:
piano, violin and guitar. The graphical tool for piano learning we have developed,
is able to detect if a person is playing the proper piano chord. The graphical tool
shows to the user the time and frequency response of each frame of piano sound
under analysis and a piano keyboard in which the played notes are highlighted
as well as the name of the played notes. The core of the designed tool is a polyphonic transcription system able to detect the played notes, based on the use of
spectral patterns of the piano notes. The designed tool is useful both for users
with knowledge of music and users without these knowledge. The violin learning
tool is based on a transcription system able to detect the pitch and duration of
the violin notes and to identify the dierent expressiveness techniques: detache
with and without vibrato, pizzicato, tremolo, spiccato, flageolett-t
one. The interface is a pedagogical tools to aid in violin learning. For the guitar, we have
developed a system able to perform in real time string and fret estimation of
guitar notes. The system works in three modes: it is able to estimate the string
and fret of a single note played on a guitar, strummed chords from a predened
list and it is also able to make a free estimation if no information of what is
being played is given. Also, we have developed a lightweight pitch detector for
embedded systems to be used in toys. The detector is based on neural networks
in which the signal preprocessing is a frequency analysis. The selected neural
network is a perceptron-type network. For the preprocessing, the Goertzel Algorithm is the selected technique for the frequency analysis because it is a light
alternative to FFT computing and it is very well suited when only few spectral
points are enough to extract the relevant information.
Therefore, the outline of the paper is as follows. In Section 2, musical content
management related tools are presented. Section 3 is devoted to the presentation
of the tool directly related to the preservation of the cultural heritage. Section 4
will present the dierent tools developed for technology-enhanced music learning.
Finally, the conclusions are presented in Section 5.
Music Content Management
The huge amount of digital musical content available through dierent databases
makes necessary to have intelligent music signal processing tools that help us
managing all this information.
In subsection 2.1 a novel tool for navigating through the music content is
presented. This 3D navigation environment makes easier to look for inter-related
118
I. Barbancho et al.
musical contents and it also gives the opportunity to the user to get to know
certain types of music that he would not have found with other more traditional
ways of searching musical contents.
In order to use a 3D environment as the one presented, or other types of
methods for music information retrieval, the evaluation of music similarity is one
of the core components. In subsection 2.2, the rhythmic and spectral analyses of
music contents are combined to extract the tonal prole of musical compositions
and evaluate music similarity.
2.1 3D Environment for Music Content Exploration
The interactive music exploration is an open problem [31], with increasing interest due to the growing possibilities to access large music data bases. Eorts
to automate and simplify the access to musical contents require to analyze the
songs to obtain numerical descriptors in the time or frequency domains that can
be used to measure and compare dierences and similarities among them. We
have developed an adaptive 3D environment that allows intuitive music exploration and browsing through its graphical interface. Music analysis is based on
the use of the Mel frequency cepstral coecients (MFCCs) [27], a multidimensional space is built and each song is represented as a sphere in a 3D environment
with tools to navigate, listen and query the music space.
The MFCCs are essentially based on the short term Fourier transform. The
windowed spectrum of the original signal is computed. A Mel bank of lters
is applied to obtain a logarithmic frequency representation and the resulting
spectrum is processed with a discrete cosine transform (DCT). At this end, the
Mel coecients have to be clustered in few groups, in order to achieve a compact
representation of the global spectral content of the signal. Here, the popular
k-means clustering method has been employed and the centroid of the most
populated cluster has been considered for a compact vectorial representation of
the spectral meaning of the whole piece. This approach has been applied to a
large number of samples for the six genres selected and a predominant vector
has been computed for each genre. These vectors are considered as pseudoorthonormal coordinates reference vectors for the projection of the songs. In
particular, for each song, the six coordinates have been obtained by computing
the scalar product among the predominant vectors of the song itself and the ones
of the six genres, conveniently normalized for unit norm.
The graphical user interface comprises a main window, with dierent functional panels (Figure 1). In the main panel, the representation of the songs in a
3D framework is shown: three orthogonal axes, representing the three selected
genres are centered in the coordinates range and the set of songs are represented
as blue spheres correspondingly titled. A set of other panels with dierent functions complete the window. During the exploration of the space, the user is
informed in real-time about the closest songs and can listen to them.
2.2 Evaluation of Music Similarity Based on Tonal Behavior
The evaluation of music similarity is one of the core components of the eld of
Music Information Retrieval (MIR). Similarity is often computed on the basis of
119
Fig. 1. The graphical user interface for the 3D exploration of musical audio
the extraction of low-level time and frequency descriptors [25] or on the computation of rhythmic patterns [21]. Logan and Salomon [26] use the Mel Frequency
Cepstral Coecients (MFCCs) as a main tool to compare audio tracks, based
on their spectral content. Ellis et al. [13] adopt the cross-correlation of rhythmic
patterns to identify common parts among songs.
In this study, rhythmic and spectral analyses are combined to extract the tonal
prole of musical compositions and evaluate music similarity. The processing
stage comprises two main steps: the computation of the main rhythmic meter of
the song and the estimation of the distribution of contributions of tonalities to
the overall tonal content of the composition. The calculus of the cross-correlation
of the rhythmic pattern of the envelope of the raw signal allows a quantitative
estimation of the main melodic motif of the song. Such temporal unit has to be
employed as a base for the temporal segmentation of the signal, aimed to extract
the pitch class prole of the song [14] and, consequently, the vector of tonality
contributions. Finally, this tonal behavior vector is employed as the main feature
to describe the song and it is used to evaluate similarity.
Estimation of the melodic cell. In order to characterize the main melodic
motif of the track, the songs are analyzed to estimate the tempo. More than a
real quantitative metrical calculus of the rhythmic pattern, the method aims at
delivering measures for guiding the temporal segmentation of the musical signal,
and at subsequently improving the representation of the song dynamics. This
is aimed at optimizing the step for the computation of the tonal content of the
audio signal, supplying the reference temporal frame for the audio windowing.
The aim of the tempo induction is to estimate the width of the window used for
windowing, so that the stage for the computation of the tonal content of the song
120
I. Barbancho et al.
attains improved performance. In particular, the window should be wide enough

to include a single melodic cell, e.g.: a single chord. Usually, the distribution of
tone contributions within a single melodic cell is uniform and coherent with the
chord content. The chord notes are played once for each melodic cell, such that,
by evaluating the tones content of each single cell, we can have a reliable idea
of the global contribution of the single tonalities for the whole track. It is clear
that there are exceptions for these assumptions, such as arpeggios, solos, etc.
Both the width and the phase (temporal location) of the window are extremely
important for achieving the best performance of the spectral analysis.
A series of frequency analysis stages are performed on the raw signal in order
to obtain the most robust estimate of the window. The signal is half-way rectied, low-pass ltered and its envelope is computed. The main window value
is assumed to be best estimated by the average temporal distance between the
points of the rst order derivative of the envelope showing the highest dierence
among crests and troughs. The steps are schematically listed below:
1. The raw signal is half-way rectied and ltered with a low-pass Butterworth
lter, with a cut-o frequency of 100 Hz [12].
2. The envelope of the ltered signal is computed, using a low-pass Butterworth
lter with a cut-o frequency of 1 Hz.
3. The rst order derivative is computed on the envelope.
4. The zero-crossing points of the derivative are found (the crests and the
troughs of the envelope).
5. The dierence between crests and troughs is computed and its empirical
cumulative distribution is evaluated.
6. Only the values exceeding the 75th percentile of their cumulative distributions are kept.
7. The temporal distances among the selected troughs (or crests) are computed
and the average value is calculated.
A further fundamental parameter is the phase of the tempo detected. It assures
the correct matching between the windowing of the signal and the extent of the
melodic cell, which helps to minimize the temporal shifting. This is aimed by
locating the position of the rst trough detected in the signal, as starting point
for the windowing stage.
The algorithm described has been employed to obtain an estimation of the
melodic cell that is used in the subsequent steps of the computation of the tonal
content. An objective evaluation of the performance of the method is hard to
achieve because of the fuzzy perception of the main motif of the song by the
human ear. Moreover, the exact usage of a strict regularity in metric is rarely
found in modern music [20] and some slight variations in the rhythm throughout
a whole composition are barely perceived by the listener. Nevertheless, a set of
30 songs have been selected from a set of 5 genres (Pop, Classic, Electro-Disco,
Heavy Metals and Jazz). The songs have been analyzed by experienced listeners
and the width of their main metric unity has been manually quantied. Then,
the results obtained by the automatic procedure described have been compared.
121
Table 1. Relative and absolute dierences among the widths of the melodic window
manually evaluated by the listeners and the ones automatically computed by the proposed algorithm
Genre
Relative Dierence Absolute Dierence

(percentage)
(second)
Pop
Classic
Electro-Disco
Heavy Metals
Jazz
14.6
21.2
6.2
18.4
14.8
0.60
0.99
0.34
0.54
0.58
Mean
17.3
0.68
In Table 1, the dierences between the widths of the window manually measured
and automatically computed are shown.
The best results are obtained for the Disco music tracks (6.2%), where the
clear drummed bass background is well detected and the pulse coincides most of
times with the tempo. The worst results are related to the lack of a clear driving
bass in Classical music (21.2%), where the changes in time can be frequent and
a uniform tempo measure is hardly detectable.
However, the beats, or lower-level metrical features are, most of the times,
submultiples of such tempo value, which make them usable for the melodic cell
computation.
Tonal behavior. Most of the music similarity systems aim at imitating the
human perception of a song. This capacity is complex to analyze. The human
brain carries out a series of subconscious processes, as the computation of the
rhythm, the instruments richness, the musical complexity, the tonality, the mode,
the musical form or structure, the presence of modulations, etc., even without
any technical musical knowledge [29].
A novel technique for the determination of the tonal behavior of music signals
based on the extraction of the pattern of tonality contributions is presented.
The main process is based on the calculus of the contributions of each note of
the chromatic scale (Pitch Class Profile - PCP), and the computation of the
possible matching tonalities. The outcome is a vector reecting the variation of
the spectral contribution of each tonality throughout the entire piece. The song
is time windowed with no overlapping windows, whose width is determined on
the basis of the tempo induction algorithm.
The Pitch Class Prole is based on the contribution of the twelve semitone
pitch classes to the whole spectrum. Fujishima [14] employed the PCPs as main
tool for chord recognition, while Izmirli

[22] dened them as Chroma Template
and used them for audio key nding. Gomez and Herrera [16] applied machine
learning methods to the Harmonic Pitch Class Prole, to estimate the tonalities
of polyphonic audio tracks.
The spectrum of the whole audio is analyzed, and the distribution of the
strengths of all the tones is evaluated. The dierent octaves are grouped to
measure the contribution of the 12 basic tones. A detailed description follows.
122
I. Barbancho et al.
The signal spectrum, computed by the discrete Fourier transform, is simplied

making use of the MIDI numbers as in [8].
The PCP is a 12-dimensions vector (from C to B) obtained by the sum of
the spectral amplitudes for each tone, spawning through the seven octaves (from
C1 to B7, or 24 to 107 in MIDI number). That is, the rst element of the PCP
vector is the sum of the strengths of the pitches from tone C1 to tone C7, the
second one from tone C# 1 to tone C# 7, and so on.
Each k-th element of the PCP vector, with k {1, 2, . . . , 12}, is computed as
follows:
7

P CPt (k) =
Xs (k + (i 1) 12)
(1)
i=1
where Xs is the simplied spectrum, the index k covers the twelve semitone
pitches and i is used to index each octave. The subscript t stands for the temporal
frame for which the PCP is computed.
In order to estimate the predominant tonality of a track, it is important to dene
a series of PCPs for all the possible tonalities, to be compared with its own PCP.
The shape of the PCP mainly depends on the modality of the tonality (Major or
Minor). Hence, by assembling only two global proles, for major and minor modes,
and by shifting each of them twelve times according to the tonic pitch of the twelve
possible tonalities of each mode, 24 tonalities proles are obtained.
Krumhansl [24] dened the proles empirically, on the base of a series of listening sessions carried out on a group of undergraduates from University of Harvard,
who had to evaluate the correspondence among test tracks and probe tones. The
author presented two global proles, one for major and one for minor mode, representing the global contribution of each tone to all the tonalities for each mode.
More recently, Temperley [35] presented a modied less biased version of
Krumhansl proles. In this context we propose a revised version of the Krumhansls
proles with the aim of avoiding the bias of the system for a particular mode. Basically, the two mode proles are normalized to show the same sum of values and,
then, their proles are divided by their corresponding maximums.
For each windowed frame of the track, the squared Euclidean distance between the PCP of the frame and each tonality prole is computed to dene a
24-elements vector. Each element of the vector is the sum of the squared dierences between the amplitudes of the PCP and the tonality proles. The squared
distance is dened as follow:
11

[(PM (j + 1)) P CPt ((j + k 1) mod 12 + 1)]2

1 k 12
j=0
Dt (k) =
11

[(Pm (j + 1)) P CPt ((j + k 1) mod 12 + 1)]2 13 k 24
j=0
(2)
where Dt (k) is the squared distance computed at time t of the k-th tonality, with
k {1, 2, ..., 24}, and PM /Pm are, respectively, the major and minor prole.
The predominant tonality of each frame corresponds to the minimum of
the distance vector Dt (k), where the index k, with k {1, ..., 12}, refers to
the twelve major tonalities (from C to B) and k, with k {13, ..., 24}, refers
123
Tonal behavior
Normalized Amplitude
1
0.8
0.6
0.4
0.2
0
C C# D Eb E F F# G Ab A Bb B c c# d d# e
Tonalities
f# g g# a a# b
Fig. 2. An example of the tonal behavior of the Beatles song Ill be back, where the
main tonality is E major
to the twelve minor tonalities (from c to b). Usually major and minor tonalities
are represented with capital and lower-case letter respectively.
The empirical distribution of all the predominant tonalities, estimated throughout the entire piece, is calculated in order to represent the tonality contributions
to the tonal content of the song. This is dened as the tonal behavior of the composition. In Figure 2, an example of the distribution of the tonality contributions
for the Beatles song Ill be back is shown.
Music similarity. The vectors describing the tonal behavior of the songs are
employed to measure their reciprocal degree of similarity. In fact the human
brain is able to detect the main melodic pattern, even by means of subconscious
processes and its perception of musical similarity is partially based on it [24].
The tonal similarity among the songs is computed by the Euclidean distance
of the tonal vector calculated, following the equation:
T SAB = TA TB
(3)
where T SAB stands for the coecient of tonal similarity between the songs A
and B and TA and TB are the empirical tonality distributions for song A and
B, respectively.
A robust evaluation of the performance of the proposed method for evaluation of music similarity is very hard to achieve. The judgment of the similarity
among audio les is a very subjective issue, showing the complex reality of human perception. Nevertheless, a series of tests have been performed on some
predetermined lists of songs.
Four lists of 11 songs have been submitted to a group of ten listeners. They
were instructed to sort the songs according to their perceptual similarity and tonal
similarity. For each list, a reference song was dened and the remaining 10 songs
had to be sorted with respect to their degree of similarity with the reference one.
124
I. Barbancho et al.
A series of 10-element lists were returned by the users, as well as by the automatic
method. Two kinds of experimental approaches were carried out: in the rst experiment, the users had to listen to the songs and sort them according to a global
perception of their degree of similarity. In the second framework, they were asked
to focus only on the tonal content. The latter was the hardest target to obtain,
because of the complexity of discerning the parameters to be taken into account
when listening to a song and evaluating its similarity with respect to other songs.
The degree of coherence among the list manually sorted and the ones automatically processed was obtained. A weighted matching score for each pair of
lists was computed, the reciprocal distance of the songs (in terms of the position
index in the lists) was calculated. Such distances were linearly weighted, so that
the rst songs in the lists reected more importance than the last ones. In fact, it
is easier to evaluate which is the most similar song among pieces that are similar
to the reference one, than performing the same selection among very dierent
songs. The weights aid to compensate for this bias.
Let L and L represent two dierent ordered lists of n songs, for the same
reference song. The matching score C has been computed as follows:
n

|i j| i
(4)
C=
i=1
where i and j are the indexes for lists L and L , respectively, such that j is the
index of the j-th song in list L with L (i) L (j). The absolute dierence is
linearly weighted by the weights
i normalized such to sum to one, dened by
n
the following expression: i=1 i = 1. Finally, the scores are transformed to be
represented as percentage of the maximum score attainable.
The eciency of the automatic method was evaluated by measuring its coherence with the users response. The closer the two set of values, the better the
performance of the automatic method. As expected, the evaluation of the automatic method in the rst experimental framework did not return reliable results
because of the extreme deviation of the marks, due to the scarce relevance of the
tones distribution in the subjective judgment of the song. As mentioned before,
the tonal behavior of the song is only one of the parameters taken into account
subconsciously by the human ear. Nevertheless, if the same songs were asked to
be evaluated only by their tonal content, the scores drastically decreased, revealing the extreme lack of abstraction of the human ear. In Table 2 the results for
both experimental frameworks are shown.
The dierences between the results of the two experiments are evident. Concerning the rst experiment, the mean score correspondence is 74.2%, among
the users lists and 60.1% among the users and the automatic list. That is, the
automatic method poorly reproduces the choices made by the users, taking into
account a global evaluation of music similarity. Conversely, in the second experiment, better results were obtained. The mean correspondence score for the
users lists decrease to 61.1%, approaching the value returned by the users and
automatic list together, 59.8%. The performance of the system can be considered
to be similar to the behavior of a mean human user, regarding the perception of
tonal similarities.
125
Table 2. Means and standard deviations of the correspondence scores obtained computing equation (4). The raws Auto+Users and Users refer to the correspondence
scores computed among the users lists together with the automatic list and among
only the users lists, respectively. The Experiment 1 is done listening and sorting the
songs on the base of a global perception of the track, while Experiment 2 is performed
trying to take into account only the tone distributions.
Lists
List A
List B
List C
List D
Means
Method
Experiment 1
Mean St. Dev.
Experiment 2
Mean St. Dev.
Auto+Users
67.6
7.1
66.6
8.8
Users
72.3
13.2
57.9
11.5
Auto+Users
63.6
1.9
66.3
8.8
Users
81.8
9.6
66.0
10.5
Auto+Users
61.5
4.9
55.6
10.2
Users
77.2
8.2
57.1
12.6
Auto+Users
47.8
8.6
51.0
9.3
Users
65.7
15.4
63.4
14.4
Auto+Users
60.1
5.6
59.8
9.2
Users
74.2
11.6
61.1
12.5
Fig. 3. A snapshot of some of the main windows of interface of the OMR system
Cultural Heritage Preservation and Diusion
Other important use of the music signal processing is the preservation and diffusion of the music heritage. In this sense, we have paid special attention to the
126
I. Barbancho et al.
musical heritage we have in the Archivo de la Catedral de M

alaga. There, handwritten musical scores of the XVII-th and the early XVIII-th centuries written
in white mensural notation are preserved. The aim of the tools we have developed is to give new life to that music, making easier to the people to get to
know the music of that time. Therefore, in this section, the OMR system we
have developed is presented.
3.1
A Prototype for an OMR System
OMR (Optical Music Recognition) systems are essentially based on the conversion of a digitalized music score into an electronic format. The computer must
read the document (in this case a manuscript), interprete it and transcript
its content (notes, time information, execution symbols etc.) into an electronic
format. The task can be addressed to recover important ancient documents and
to improve their availability to the music community.
In the OMR framework, the recognition of ancient handwritten scores is a
real challenge. The manuscripts are often in a very poor state of conservation,
due to their age an their preservation. The handwritten symbols are not uniform and additive symbols can be found to be manually added a posteriori by
other authors. The digital acquisition of the scores and the lighting conditions
in the exposure can cause an incoherence in the background of the image. All
these conditions make the development of an ecient OMR system a very hard
practice. Although the system workow can be generalized, the specic algorithms cannot be blindly used for dierent authors but it has to be trained for
each use.
We have developed a whole OMR system [34] for two styles of writing scores
in white mensural notation. Figure 3 shows a snapshot of its graphical user
interface. In the main window a series of tools are supplied to follow a complete
workow, based on a number of steps: the pre-processing of the image, the
partition of the score into each single sta and the processing of the staves with
the extraction, the classication and the transcription of the musical neums.
Each tool corresponds to a individual window that allows the user to interact to
complete the stage.
The preprocessing of the image, aimed at feeding the system with the cleanest black and white image of the score, is divided into the following tasks: the
clipping of region of interest of the image [9], the automatic blanking of red
frontispieces, the conversion from RGB to grayscale, the compensation of the
lighting conditions, the binarization of the image [17] and the correction of image tilt [36]. After partitioning the score into each single sta, the sta lines are
tracked and blanked and the symbols are extracted and classied. In particular,
a series of multidimensional feature vectors are computed on the geometrical
extent of the symbols and a series of corresponding classiers are employed to
relate the neums to their correspondent musical symbols. In any moment, the
interface allows the user for a careful following of each processing stage.
127
Tools for Technology-Enhanced Learning and

Revolutionary Learning Appliances
Music signal processing tools also make possible the development of new interactive methods for music learning using a computer or a toy. In this sense, we
have developed a number of specialized tools to help to learn how to play piano,
violin and guitar. These tools will be presented in sections 4.1, 4.2 and 4.3, respectively. It is worth to mention that for the development of these tools, the
very special characteristics of the instruments have been taken into account. In
fact, the people who have developed such tools are able to play these instruments. This has contributed to make these tools specially useful because in the
development, we have observed the main diculties of each of the instruments.
Finally, thinking about developing toys, or other small embedded systems with
musical intelligence, in subsection 4.4, we present a lightweight pitch detector
that has been designed to this aim.
4.1
Tool for Piano Learning
The piano is a musical instrument that is widely used in all kinds of music and
as an aid to the composition due to its versatility and ubiquity. This instrument
is played by means of a keyboard and allows to get very rich polyphonic sound.
Piano learning involves several diculties that come from its great possibilities
of generating sound with high polyphony number. These diculties are easily observed when the musical skills are small or when trying to transcribe its sound
when piano is used in composition. Therefore, it is useful to have a system that
determines the notes that sound in a piano in each time frame and represent them
in a simple form that can be easily understood, this is the aim of the tool that will
be presented. The core of the designed tool is a polyphonic transcription system
able to detect the played notes using spectral patterns of the piano notes [6], [4].
The approach used in the proposed tool to perform the polyphonic transcription is rather dierent from the proposals that can be found in the literature [23].
In our case, the audio signal to be analyzed will be considered to have certain
similarities to the code division multiple access CDMA communications signal.
Our model will consider the spectral patterns of the dierent piano notes [4].
Therefore, in order to perform the detection of the notes that sound during each
time frame, we have considered a suitable modication of a CDMA multiuser
detection technique to cope with the polyphonic nature of the piano music and
with the dierent energy of the piano notes in the same way as an advanced
CDMA receiver [5].
A snapshot of the main windows of the interface is presented in Figure 4. The
designed graphical user interface is divided into three parts:
The management items of the tool are three main buttons: one button to
adquire the piano music to analyze, another button to start the system and
a nal button to reset the system.
The time and frequency response of each frame of piano sound under analysis
are in the middle part of the window.
128
I. Barbancho et al.
Fig. 4. A snapshot of the main windows of the interface of the tool for piano learning
A piano keyboard in which the played notes are highlighted as well as the
name of the played notes is shown at the bottom.
4.2
Tool for Violin Learning
The violin is one of the most complex instruments often used by the children
for the rst approach to music learning. The main characteristic of the violin
is its great expressiveness due to the wide range of interpretation techniques.
The system we have developed is able to detect not only the played pitch, as
other transcription systems [11], but also the technique employed [8]. The signal
envelope and the frequency spectrum are considered in time and frequency domain, respectively. The descriptors employed for the detection system have been
computed analyzing a high amount of violin recordings, from the Musical Instrument Sound Data Base RWC-MDB-1-2001-W05 [18] and other home made
recordings. Dierent playing techniques have been performed in order to train
the system for its expressiveness capability. The graphical interface is aimed to
facilitate the violin learning for any user.
For the signal processing tool, a graphical interface has been developed. The
main window presents two options for the user, the theory section (Teora) and
the practical section (Pr
actica). In the section Teora the user is encouraged to
learn all the concepts about the violins history, the violins parts and the playing posture (left and right hand), while the section Pr
actica is mainly based on
an expressiveness transcription system [8]. Here, the user starts with the basic
study sub-section, where the main violin positions are presented, illustrating
the placement of the left hand on the ngerboard, with the aim to attain a
good intonation. Hence, the user can record the melody correspondence to the
129
selected position and ask the application to correct it, returning the errors made.
Otherwise, in the free practice sub-section, any kind of violin recording can be
analyzed for its melodic content, detecting the pitch, the duration of the notes
and the techniques employed (e.g.:detache with and without vibrato, pizzicato,
tremolo, spiccato, flageolett-t
one). The user can also visualize the envelope and
the spectrum of the each note and listen to the MIDI transcription generated.
In Figure 5, some snapshots of the interface are shown. The overall performance
attained by our system in the detection and correction of the notes and expressiveness is 95.4%.
4.3
Tool for String and Fret Estimation in Guitar
The guitar is one of the most popular musical instruments nowadays. In contrast
to other instruments like the piano, in the guitar the same note can be played
plucking dierent strings at dierent positions. Therefore, the algorithms used
for piano transcription [10] cannot be used for guitar. In guitar transcription it
is important to estimate the string used to play a note [7].
Fig. 5. Three snapshots of the interface for violin learning are shown. Clockwise from
top left: the main window, the analysis window and a plot of the MIDI melody.
130
I. Barbancho et al.
(a) Main window
(b) Single note estimation with tuner

Fig. 6. Graphical interface of the tool for guitar learning
131
The system presented in this demonstration is able to estimate the string and
the fret of a single note played with a very low error probability. In order to keep
a low error probability when a chord is strummed on a guitar, the system chooses
which chord has been most likely played from a predened list. The system works
with classical guitars as well as acoustic or electric guitars. The sound has to
be captured with a microphone connected to the computer soundcard. It is also
possible to plug a cable from an electric guitar to the sound card directly.
The graphical interface consists of a main window (Figure 6(a)) with a pop-up
menu where you can choose the type of guitar you want to use with the interface.
The main window includes a panel (Estimaci
on) with three push buttons, where
you can choose between three estimation modes:
The mode Nota u
nica (Figure 6(b)) estimates the string and fret of a single
note that is being played and includes a tuner (afinador ).
The mode Acorde predeterminado estimates strummed chords that are being
played. The system estimates the chord by choosing the most likely one from
a predened list.
The last mode, Acorde libre, makes a free estimation of what is being played.
In this mode the system does not have the information of how many notes
are being played, so this piece of information is also estimated.
Each mode includes a window that shows the microphone input, a window with
the Fourier Transform of the sound sample, a start button, a stop button and an
exit button (Salir ). At the bottom of the screen there is a panel that represents
a guitar. Each row stands for a string on the guitar and the frets are numbered
from one to twelve. The current estimation of the sound sample, either note or
chord, is shown on the panel with a red dot.
4.4
Lightweight Pitch Detector for Embedded Systems Using

Neural Networks
Pitch detection could be dened as an act of listening to a music melody and

writing down music notation of the piece, that is, to decide the notes played [28].
Basically, this is a pattern recognition problem over the time, where each pattern corresponds to features characterizing a musical note (e.g. the fundamental
frequency). Nowadays, there exists a wide range of applications for pitch detection: educational applications, music-retrieval systems, automatic music analysis
systems, music games, etc. The main problem of pitch detection systems is the
computational complexity required, especially if they are polyphonic [23]. Articial intelligence techniques often involve an ecient and lightweight alternative
for classication and recognition tasks. These techniques can be used, in some
cases, to avoid other processing algorithms, for downshifting the computational
complexity, speeding up or improving the eciency of the system [3], [33]. This
is the case of audio processing and music transcription.
When only a small amount of memory and processing power are available,
FFT-based detection techniques can be too costly to implement. In this case,
132
I. Barbancho et al.
articial intelligence techniques, such as neural networks sized up to be implemented in a small system, can provide the necessary accuracy. There are two
alternatives [3], [33] in neural networks. The rst one is unsupervised training.
This is the case of some networks which have been specially designed for pattern classication such as self-organizative maps. However, the computational
complexity of this implementation is too high for a low-cost microcontroller.
The other alternative is supervised training neural networks. This is the case
of perceptron-type networks. In these networks, the synaptic weights connecting
each neuron are modied as a new training vector is presented. Once the network
is trained, the weights can be statically stored to classify new network entries.
The training algorithm can be done on a dierent machine from the machine
where the network propagation algorithm is executed. Hence the only limitation
comes from the available memory.
In the proposed system, we focus on the design of a lightweight pitch detection
for embedded systems based on neural networks in which the signal preprocessing
is a frequency analysis. The selected neural network is a perceptron-type network.
For the preprocessing, the Goertzel algorithm [15] is the selected technique for
the frequency analysis because it is a light alternative to FFT computing if we
are only interested in some of the spectral points.
Audio in
Preamp
I2C out
8th order
elliptic filter
Pitch
Detection
A/D 10-bit
conversion
Buffering
Preprocessing
AVR ATMEGA168
Fig. 7. Block diagram of the pitch detector for an embedded system
Figure 7 shows the block diagram of the detection system. This gure shows
the hardware connected to the microcontrollers A/D input, which consists of a
preamplier in order to accommodate the input from the electret microphone
into the A/D input range and an anti-aliasing lter. The anti-aliasing lter provides 58dB of attenuation at cuto which is enough to ensure the anti-aliasing
function. After the lter, the internal A/D converter of the microcontroller is
used. After conversion, a buer memory is required in order to store enough samples for the preprocessing block. The output of the preprocessing block is used for
pitch detection using a neural network. Finally, an I2C (Inter-Integrated Circuit)
[32] interface is used for connecting the microcontroller with other boards.
We use the open source Arduino environment [1] with the AVR ATMEGA168
microcontroller [2] for development and testing of the pitch detection implementation. The system will be congured to detect the notes between A3 (220Hz)
133
and G#5 (830.6Hz), following the well-tempered scale, as it is the system mainly
used in Western music. This range of notes has been selected because one of the
applications of the proposed system is the detection of vocal music of children
and adolescents.
The aim of the preprocessing stage is to transform the samples of the audio
signal from the time domain to the frequency domain. The Goertzel algorithm
[15], [30] is a light alternative to FFT computing if the interest is focused only
in some of the spectrum points, as in this case. Given the frequency range of
the musical notes in which the system is going to work, along with the sampling
restriction of the selected processor, the selected sampling frequency is fs =
4KHz and the number of input samples N = 400, that obtain a precision of
10Hz, which is sucient for the pitch detection system. On the other hand, in
the preprocessing block, the number of frequencies fk , in which the Goertzel
p
algorithm is computed, is 50 and are given according to fp = 440 2 12 Hz
with p = 24, 23, ..., 0, ..., 24, 25, so that, each note in the range of interest
have, at least, one harmonic and one subharmonic to improve the detection
performance of notes with octave or perfect fth relation. Finally, the output of
the preprocessing stage is a vector that contains the squared modulus of the 50
points of interest of the Goertzel algorithm: the points of the power spectrum of
the input audio signal in the frequencies of interest.
For the algorithm implemented using xed-point arithmetic, the execution
time is less than 3ms on a 16 MIPS AVR microcontroller. The number of points
of the Goertzel algorithm are limited by the available memory. The Eq. 5 shows
the number of bytes required to implement the algorithm.

N
nbytes = 2
+ 2N + m
(5)
4
In this expression, m represents the number of desired frequency points. Thus
with m = 50 points and N = 400, the algorithm requires 1900bytes of RAM
memory for signal input/processing/output buering. Since the microcontroller
has 1024bytes of RAM memory, it is necessary to use an external high-speed SPI
RAM memory in order to have enough memory for buering audio samples.
Once the Goertzel Algorithm has been performed and the points are stored in
the RAM memory, a recognition algorithm has to be executed for pitch detection.
A useful alternative to spectral processing techniques consist of using articial
intelligence techniques. We use a statically trained neural network storing the
network weights vectors in a EEPROM memory. Thus, the network training is
performed in a computer with the same algorithm implemented and the embedded system only runs the network. Figure 8 depicts the structure of the neural
network used for pitch recognition. It is a multilayer feed-forward perceptron
with a back-propagation training algorithm.
In our approach, sigmoidal activation has been used for each neuron as well
as no neuron bias. This provides a fuzzy set of values, yj , at the output of each
neural layer. The fuzzy set is controlled by the shape factor, , of the sigmoid
function, which is set to 0.8, and it is applied to a threshold-based decision function. Hence, outputs below 0.5 does not activate output neurons while values
134
I. Barbancho et al.
1
1
1
2
50
24
Hidden
Layer
Input Layer
Output
Layer
Output (Note)
Fig. 8. Neural network structure for note identication in an embedded system
G#5
G5
F#5
F5
E5
D#5
D5
C#5
C5
B4
A#4
A4
G#4
G4
F#4
F4
E4
D#4
D4
C#4
C4
B3
A#3
A3
0
Ideal Output
Validation Test
Learning Test
A3
A#3
B3
C4
C#4
D4
D#4
E4
F4
F#4
G4
G#4
A4
A#4
B4
C5
C#5
D5
D#5
E5
F5
F#5
G5
G#5
Input (Note)
Fig. 9. Learning test, validation test and ideal output of the designed neural network
above 0.5 activate output neurons. The neural network parameters such as the
number of neurons in the hidden layer or the shape factor of the sigmoid function has been determined experimentally. The neural network has been trained
by running the BPN (Back Propagation Neural Network) on a PC. Once the
network convergence is achieved, the weight vectors are stored. Regarding the
output layer of the neural network, we use ve neurons to encode 24 dierent
outputs corresponding to each note in two octaves (A3 G#5 notation).
The training and the evaluation of the proposed system has been done using
independent note samples taken from the Musical Instrument Data Base RWC
[19]. The selected instruments have been piano and human voice. The training
of the neural network has been performed using 27 samples for each note in the
135
range of interest. Thus, we used 648 input vectors to train the network. This
way, the network convergence was achieved with an error of 0.5%.
In Figure 9, we show the learning characteristic of the network when simulating
the network with the training vectors. At the same time, we show the validation
test using 96 input vectors (4 per note) which corresponds to about 15% of new
inputs. As shown in Figure 9, the inputs are correctly classied due to the small
dierence among the outputs for the ideal, learning and validation inputs.
Conclusions
Nowadays, it is a requirement that all types of information will be widely available in digital form in digital libraries together with intelligent techniques for
the creation and management of the digital information, thus, contents will be
plentiful, open, interactive and reusable. It becomes necessary to link contents,
knowledge and learning in such a way that information will be produced, stored,
handled, transmitted and preserved to ensure long term accessibility to everyone, regardless of the special requirements of certain communities (e-inclusion).
Among the many dierent types of information, music happens to be one of the
more widely demanded due to its cultural interest, for entertainment or even
due to therapeutic reasons.
Through this paper, we have presented several application of music signal
processing techniques. It is clear that the use of such tools can be very enriching from several points of views: Music Content Management, Cultural Heritage
Preservation and Diusion, Tools for Technology-Enhanced Learning and Revolutionary Learning Appliances, etc. Now, that we have the technology at our
side in every moment: mobile phones, e-books, computers, etc., all these tools
we have developed can be easily used. There are still a lot open issues and things
that should be improved, but more and more, technology helps music.
Acknowledgments
This work has been funded by the Ministerio de Ciencia e Innovaci
on of the Spanish Government under Project No. TIN2010-21089-C03-02, by the Junta de
Andaluca under Project No. P07-TIC-02783 and by the Ministerio de Industria,
Turismo y Comercio of the Spanish Government under Project No. TSI-0205012008-117. The authors are grateful to the person in charge of the Archivo de la
Catedral de M
alaga, who allowed the utilization of the data sets used in this work.
References
1. Arduino board, http://www.arduino.cc (last viewed February 2011)
2. Atmel corporation web side, http://www.atmel.com (last viewed February 2011)
3. Aliev, R.: Soft Computing and its Applications. World Scientic Publishing Company, Singapore (2001)
136
I. Barbancho et al.
4. Barbancho, A.M., Barbancho, I., Fernandez, J., Tard

on, L.J.: Polyphony number
estimator for piano recordings using dierent spectral patterns. In: 128th Audio
Engineering Society Convention (AES 2010), London, UK (2010)
5. Barbancho, A.M., Tard
on, L., Barbancho, I.: CDMA systems physical function level
simulation. In: IASTED International Conference on Advances in Communication,
Rodas, Greece (2001)
6. Barbancho, A.M., Tard
on, L.J., Barbancho, I.: PIC detector for piano chords.
EURASIP Journal on Advances in Signal Processing (2010)
7. Barbancho, I., Tard
on, L.J., Barbancho, A.M., Sammartino, S.: Pitch and played
string estimation in classic and acoustic guitars. In: Proc. of the 126th Audio
Engineering Society Convention (AES 126th), Munich, Germany (May 2009)
8. Barbancho, I., Bandera, C., Barbancho, A.M., Tard
on, L.J.: Transcription and
expressiveness detection system for violin music. In: IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), Taipei, Taiwan, pp. 189192 (2009)
9. Barbancho, I., Segura, C., Tard
on, L.J., Barbancho, A.M.: Automatic selection of
the region of interest in ancient scores. In: IEEE Mediterranean Electrotechnical
Conference (MELECON 2010), Valletta, Malta (May 2010)
10. Bello, J.: Automatic piano transcription using frequency and time-domain information. IEEE Transactions on Audio, Speech and Language Processing 14(6), 2242
2251 (2006)
11. Boo, W., Wang, Y., Loscos, A.: A violin music transcriber for personalized learning.
In: IEEE Int. Conf. on Multimdia and Expo. (ICME), Toronto, Canada, pp. 2081
2084 (2006)
patterns. In: Proceedings of the International Conference on Music Information
Retrieval (ISMIR 2003), October 26-30, pp. 159165. John Hopkins University,
Baltimore, USA (2003)
13. Ellis, D.P.W., Cotton, C.V., Mandel, M.I.: Cross-correlation of beat-synchronous
representations for music similarity. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, USA, pp.
5760 (2008), http://mr-pc.org/work/icassp08.pdf (last viewed February 2011)
14. Fujishima, T.: Realtime chord recognition of musical sound: a system using common lisp music. In: Proc. International Computer Music Association, ICMC 1999,
pp. 464467 (1999), http://ci.nii.ac.jp/naid/10013545881/en/ (last viewed,
February 2011)
15. Goertzel, G.: An algorithm for the evaluation of nite trigonomentric series. The
American Mathematical Monthly 65(1), 3435 (1958)
16. G
omez, E., Herrera, P.: Estimating the tonality of polyphonic audio les: Cognitive
versus machine learning modelling strategies. In: Proc. Music Information Retrieval
Conference (ISMIR 2004), Barcelona, Spain, pp. 9295 (2004)
17. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall
Inc., Upper Saddle River (2006)
18. Goto, M.: Development of the RWC music database. In: 18th Int. Congress on
Acoustics., pp. I-553I-556 (2004)
19. Goto, M.: Development of the RWC music database. In: Proc. of the 18th International Congress on Acoustics ICA 2004, Kyoto, Japan, pp. 553556 (April 2004)
20. Gouyon, F.: A computational approach to rhythm description Audio features
for the computation of rhythm periodicity functions and their use in tempo induction and music content processing. Ph.D. thesis, Ph.D. Dissertation. UPF
(2005), http://www.mtg.upf.edu/files/publications/9d0455-PhD-Gouyon.pdf
(last viewed February 2011)
137
21. Holzapfel, A., Stylianou, Y.: Rhythmic similarity of music based on dynamic periodicity warping. In: IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2008, Las Vegas, USA, March 31- April 4, pp. 2217
2220 (2008)
Audio key nding using low-dimensional spaces. In: Proc. Music Infor22. Izmirli, O.:
mation Retrieval Conference, ISMIR 2006, Victoria, Canada, pp. 127132 (2006)
23. Klapuri, A.: Automatic music transcription as we know it today. Journal of New
Music Research 33(3), 269282 (2004)
24. Krumhansl, C.L., Kessler, E.J.: Tracing the dynamic changes in perceived tonal
organization in a spatial representation of musical keys. Psychological Review 89,
334368 (1982)
25. Lampropoulos, A.S., Sotiropoulos, D.N., Tsihrintzis, G.A.: Individualization of music similarity perception via feature subset selection. In: Proc. Int. Conference on
Systems, Man and Cybernetics, Massachusetts, USA, vol. 1, pp. 552556 (2004)
26. Logan, B., Salomon, A.: A music similarity function based on signal analysis.
In: IEEE International Conference on Multimedia and Expo., ICME 2001,Tokyo,
Japan, pp. 745748 (August 2001)
27. Logan, B.: Mel frequency cepstral coecients for music modeling. In: Proc. Music
Information Retrieval Conference(ISMIR 2000) (2000)
28. Marolt, M.: A connectionist approach to automatic transcription of polyphonic
piano music. IEEE Transactions on Multimedia 6(3), 439449 (2004)
29. Ockelford, A.: On Similarity, Derivation and the Cognition of Musical Structure.
Psychology of Music 32(1), 2374 (2004), http://pom.sagepub.com/cgi/content/
abstract/32/1/23 (last viewed February 2011)
30. Oppenheim, A., Schafer, R.: Discrete-Time Signal Processing. Prentice-Hall, Englewood Clis (1989)
31. Pampalk, E.: Islands of music - analysis, organization, and visualization of music
archives. Vienna University of Technology, Tech. rep. (2001)
32. Philips: The I2C bus specication v.2.1. (2000), http://www.nxp.com (last viewed
February 2011)
33. Prasad, B., Mahadeva, S.: Speech, Audio, Image and Biomedical Signal Processing
using Neural Networks. Springer, Heidelberg (2004)
34. Tard
on, L.J., Sammartino, S., Barbancho, I., G
omez, V., Oliver, A.J.: Optical
music recognition for scores written in white mensural notation. EURASIP Journal on Image and Video Processing 2009, Article ID 843401, 23 pages (2009),
doi:10.1155/2009/843401
35. Temperley, D.: The Cognition of Basic Musical Structures. The MIT Press, Cambridge (2004)
36. William, W.K.P.: Digital image processing, 2nd edn. John Wiley & Sons Inc., New
York (1991)
Speech/Music Discrimination in Audio Podcast

Using Structural Segmentation and Timbre
Recognition
Mathieu Barthet , Steven Hargreaves, and Mark Sandler
Centre for Digital Music, Queen Mary University of London, Mile End Road, London
E1 4NS, United Kingdom
{mathieu.barthet,steven.hargreaves,mark.sandler}@eecs.qmul.ac.uk
http://www.elec.qmul.ac.uk/digitalmusic/
Abstract. We propose two speech/music discrimination methods using

timbre models and measure their performances on a 3 hour long database
of radio podcasts from the BBC. In the rst method, the machine estimated classications obtained with an automatic timbre recognition
(ATR) model are post-processed using median ltering. The classication system (LSF/K-means) was trained using two dierent taxonomic
levels, a high-level one (speech, music), and a lower-level one (male and
female speech, classical, jazz, rock & pop). The second method combines
automatic structural segmentation and timbre recognition (ASS/ATR).
The ASS evaluates the similarity between feature distributions (MFCC,
RMS) using HMM and soft K-means algorithms. Both methods were
evaluated at a semantic (relative correct overlap RCO), and temporal
(boundary retrieval F-measure) levels. The ASS/ATR method obtained
the best results (average RCO of 94.5% and boundary F-measure of
50.1%). These performances were favourably compared with that obtained by a SVM-based technique providing a good benchmark of the
state of the art.
Keywords: Speech/Music Discrimination, Audio Podcast, Timbre
Recognition, Structural Segmentation, Line Spectral Frequencies, K-means
clustering, Mel-Frequency Cepstral Coecients, Hidden Markov Models.
Introduction
Increasing amounts of broadcast material are being made available in the podcast format which is dened in reference [52] as a digital audio or video le
that is episodic; downloadable; programme-driven, mainly with a host and/or
theme; and convenient, usually via an automated feed with computer software
(the word podcast comes from the contraction of webcast, a digital media le
distributed over the Internet using streaming technology, and iPod, the portable
media player by Apple). New technologies have indeed emerged allowing users
Correspondence should be addressed to Mathieu Barthet.
Speech/Music Discrimination Using Timbre Models
139
to access audio podcasts material either online (on radio websites such as the
one from the BBC used in this study: http://www.bbc.co.uk/podcasts), or
oine, after downloading the content on personal computers or mobile devices
using dedicated services. A drawback of the podcast format, however, is its lack of
indexes for individual songs and sections, such as speech. This makes navigation
through podcasts a dicult, manual process, and software built on top of automated podcasts segmentation methods would therefore be of considerable help
for end-users. Automatic segmentation of podcasts is a challenging task in speech
processing and music information retrieval since the nature of the content from
which they are composed is very broad. A non-exhaustive list of type of content
commonly found in podcast includes: spoken parts of various types depending
on the characteristics of the speakers (language, gender, number, etc.) and the
recording conditions (reverberation, telephonic transmission, etc.), music tracks
often belonging to disparate musical genres (classical, rock, jazz, pop, electro,
etc.) and which may include a predominant singing voice (source of confusion
since the latter intrinsically shares properties with the spoken voice), jingles and
commercials which are usually complex sound mixtures including voice, music,
and sound eects. One step of the process of automatically segmenting and annotating podcasts therefore is to segregate sections of speech from sections of
music. In this study, we propose two computational models for speech/music
discrimination based on structural segmentation and/or timbre recognition and
evaluate their performances in the classication of audio podcasts content. In
addition to their use with audio broadcast material (e.g. music shows, interviews) as assessed in this article, speech/music discrimination models may also
be of interest to enhance navigation into archival sound recordings that contain both spoken word and music (e.g. ethnomusicology interviews available on
the online sound archive from the British Library: https://sounds.bl.uk/). If
speech/music discrimination models nd a direct application in automatic audio
indexation, they may also be used as a preprocessing stage to enhance numerous
speech processing and music information retrieval tasks such as speech and music coding, automatic speaker recognition (ASR), chord recognition, or musical
instrument recognition.
The speech/music discrimination methods proposed in this study rely on
timbre models (based on various features such as the line spectral frequencies [LSF], and the mel-frequency cepstral coecients [MFCC]), and machine
learning techniques (K-means clustering and hidden Markov models [HMM]).
The rst proposed method comprises an automatic timbre recognition (ATR)
stage using the model proposed in [7] and [16] trained here with speech and
music content. The results of the timbre recognition system are then postprocessed using a median lter to minimize the undesired inter-class switches.
The second method utilizes the automatic structural segmentation (ASS) model
proposed in [35] to divide the signal into a set of segments which are homogeneous with respect to timbre before applying the timbre recognition procedure.
A database of classical music, jazz, and popular music podcasts from the BBC
was manually annotated for training and testing purposes (approximately 2,5
140
M. Barthet, S. Hargreaves, and M. Sandler
hours of speech and music). The methods were both evaluated at the semantic
level to measure the accuracy of the machine estimated classications, and at the
temporal level to measure the accuracy of the machine estimated boundaries between speech and music sections. Whilst studies on speech/music discrimination
techniques usually provide the rst type of evaluation (classication accuracy),
boundary retrieval performances are not reported to our knowledge, despite their
interest. The results of the proposed methods were also compared with those obtained with a state-of-the-arts speech/music discrimination algorithm based on
support vector machine (SVM) [44].
The remainder of the article is organized as follows. In section 2, a review
of related works on speech/music discrimination is proposed. Section 3, we give
a brief overview of timbre research in psychoacoustics, speech processing and
music information retrieval, and then describe the architecture of the proposed
timbre-based methods. Section 4 details the protocols and databases used in the
experiments, and species the measures used to evaluate the algorithms. The
results of the experiments are given and discussed in section 5. Finally, section
6 is devoted to the summary and the conclusions of this work.
Related Work
Speech/music discrimination is a special case of audio content classication reduced to two classes. Most audio content classication methods are based on the
following stages: (i) the extraction of (psycho)acoustical variables aimed at characterizing the classes to be discriminated (these variables are commonly referred
to as descriptors or features), (ii) a feature selection stage in order to further improve the performances of the classier, that can be either done a priori based on
some heuristics on the disparities between the classes to discern, or a posteriori
using an automated selection technique, and (iii) a classication system relying
either on generative methods modeling the distributions in the feature space,
or discriminative methods which determine the boundaries between classes. The
seminal works on speech/music discrimination by Saunders [46], and Scheirer
and Slaney [48], developed descriptors quantifying various acoustical specicities of speech and music which were then widely used in the studies on the same
subject. In [46], Saunders proposed ve features suitable for speech/music discrimination and whose quick computation in the time domain directly from the
waveform allowed for a real-time implementation of the algorithm; four of them
are based on the zero-crossing rate (ZCR) measure (a correlate of the spectral
centroid or center of mass of the power spectral distribution that characterize
the dominant frequency in the signal [33]), and the other was an energy contour
(or envelope) dip measure (number of energy minima below a threshold dened
relatively to the peak energy in the analyzed segment). The zero crossing rates
were computed on a short-term basis (frame-by-frame) and then integrated on a
longer-term basis with measures of the skewness of their distribution (standard
deviation of the derivative, the third central moment about the mean, number
of zero crossings exceeding a threshold, and the dierence of the zero crossing
141
samples above and below the mean). When both the ZCR and energy-based
features were used jointly with a supervised machine learning technique relying
on a multivariate-Gaussian classier, a 98% accuracy was obtained on average
(speech and music) using 2.4 s-long audio segments. The good performance of
the algorithm can be explained by the fact that the zero-crossing rate is a good
candidate to discern unvoiced speech (fricatives) with a modulated noise spectrum (relatively high ZCR) from voiced speech (vowels) with a quasi-harmonic
spectrum (relatively low ZCR): speech signals whose characteristic structure is
a succession of syllabes made of short periods of fricatives and long periods of
vowels present a marked rise in the ZCR during the periods of fricativity, which
do not appear in music signals, which are largely tonal (this however depends on
the musical genre which is considered). Secondly, the energy contour dip measure characterizes the dierences between speech (whose systematic changeovers
between voiced vowels and fricatives produce marked and frequent change in the
energy envelope), and music (which tends to have a more stable energy envelope)
well. However, the algorithm proposed by Saunders is limited in time resolution
(2.4 s). In [48], Scheirer and Slaney proposed a multifeature approach and examined various powerful classication methods. Their system relied on the 13
following features and, in some cases, their variance: 4 Hz modulation energy
(characterizing the syllabic rate in speech [30]), the percentage of low-energy
frames (more silences are present in speech than in music), the spectral rollo,
dened as the 95th percentile of the power spectral distribution (good candidate
to discriminate voiced from unvoiced sounds), the spectral centroid (often higher
for music with percussive sounds than for speech whose pitches stay in a fairly
low range), the spectral ux, which is a measure of the uctuation of the shortterm spectrum (music tends to have a higher rate of spectral ux change than
speech), the zero-crossing rate as in [46], the cepstrum resynthesis residual magnitude (the residual is lower for unvoiced speech than for voiced speech or music), and a pulse metric (indicating whether or not the signal contains a marked
beat, as is the case in some popular music). Various classication frameworks
were tested by the authors, a multidimensional Gaussian maximum a posteriori
(MAP) estimator as in [46], a Gaussian mixture model (GMM), a k-nearestneighbour estimator (k-NN), and a spatial partitioning scheme (k-d tree), and
all led to similar performances. The best average recognition accuracy using the
spatial partitioning classication was of 94.2% on a frame-by-frame basis, and of
98.6% when integrating on 2.4 s long segments of sound, the latter results being
similar to those obtained by Saunders. Some authors used extensions or correlates of the previous descriptors for the speech/music discrimination task such
as the higher order crossings (HOC) which is the zero-crossing rate of ltered
versions of the signal [37] [20] originally proposed by Kedem [33], the spectral
atness (quantifying how tonal or noisy a sound is) and the spectral spread (the
second central moment of the spectrum) dened in the MPEG-7 standard [9],
and a rhythmic pulse computed in the MPEG compressed domain [32]. Carey
et al. introduced the use of the fundamental frequency f0 (strongly correlated
to the perceptual attribute of pitch) and its derivative in order to characterize
142
some prosodic aspects of the signals (f0 changes in speech are more evenly distributed than in music where they are strongly concentrated about zero due to
steady notes, or large due to shifts between notes) [14]. The authors obtained a
recognition accuracy of 96% using the f0 -based features with a Gaussian mixture model classier. Descriptors quantifying the shape of the spectral envelope
were also widely used, such as the Mel Frequency Cepstral Coecients (MFCC)
[23] [25] [2], and the Linear Prediction Coecients (LPC) [23] [1]. El-Maleh et
al. [20] used descriptors quantifying the formant structure of the spectral envelope, the line spectral frequencies (LSF), as in this study (see section 3.1). By
coupling the LSF and HOC features with a quadratic Gaussian classier, the authors obtained a 95.9% average recognition accuracy with decisions made over 1
s long audio segments, procedure which performed slightly better than the algorithm by Scheirer and Slaney tested on the same dataset (an accuracy increase
of approximately 2%). Contrary to the studies described above that relied on
generative methods, Ramona and Richard [44] developed a discriminative classication system relying on support vector machines (SVM) and median ltering
post-processing, and compared diverse hierarchical and multi-class approaches
depending on the grouping of the learning classes (speech only, music only, speech
with musical background, and music with singing voice). The most relevant features amongst a large collection of about 600 features are selected using the
inertia ratio maximization with feature space projection (IRMFSP) technique
introduced in [42] and integrated on 1 s long segments. The method provided an
F-measure of 96.9% with a feature vector dimension of 50. Those results represent an error reduction of about 50% compared to the results gathered by the
French ESTER evaluation campaign [22]. As will be further shown in section
5, we obtained performances favorably comparable to those provided by this
algorithm. Surprisingly, all the mentioned studies evaluated the speech/music
classes recognition accuracy, but none, to our knowledge, evaluated the boundary retrieval performance commonly used to evaluate structural segmentation
algorithms [35] (see section 4.3), which we also investigate in this work.
Classification Frameworks
We propose two audio classication frameworks based on timbre models applied

in this work to the speech/music discrimination task. The architecture of both
systems are represented in Figure 1. The rst system (see Figure 1(a)) is based
on the automatic timbre recognition (ATR) algorithm described in [7], initially
developed for musical instrument recognition, and a post-processing step aiming
at reducing the undesired inter-class switches (smoothing by median ltering).
This method will be denoted ATR. The second system (see Figure 1(b)) was
designed to test whether the performances of the automatic timbre recognition
system would be improved by using a pre-processing step which divides the
signal into segments of homogenous timbre. To address this issue, the signal is
rst processed with an automatic structural segmentation (ASS) procedure [35].
Automatic timbre recognition (ATR) is then applied to the retrieved segments
143
Testing audio
Structural
segmentation
S, D
Testing audio
homogeneous segments
Timbre recognition
LSF, K, L
intermediate classification
(short-term)
Post-processsing
(median filtering)
segment-level classification
(a) Classication based on automatic timbre recognition (ATR)
Timbre recognition
LSF, K, L
intermediate classification
(short-term)
Post-processsing
(class decision)
segment-level classification
(b) Classication
based on
automatic structural segmentation and timbre recognition
(ASS/ATR)
Fig. 1. Architecture of the two proposed audio segmentation systems. The tuning parameters of the systems components are also reported: number of line spectral frequencies (LSF), number of codevectors K, latency L for the automatic timbre recognition
module, size of the sliding window W used in the median ltering (post-processing),
maximal number S of segment types, and minimal duration D of segments for the
automatic structural segmentation module.
and the segment-level classication decisions are obtained after a post-processing

step whose role is to determine the classes most frequently identied within the
segments. This method will be denoted ASS/ATR. In the remainder of this
section we will rst present the various acoustical correlates of timbre used by
the systems, and then describe both methods in more detail.
3.1
Acoustical Correlates of Timbre
The two proposed systems rely on the assumption that speech and music can
be discriminated based on their dierences in timbre. Exhaustive computational
models of timbre have not yet been found and the common denition used by
scholars remains vague: timbre is that attribute of auditory sensation in terms
of which a listener can judge that two sounds similarly presented and having the
same loudness and pitch are dissimilar; Timbre depends primarily upon the spectrum of the stimulus, but it also depends on the waveform, the sound pressure,
the frequency location of the spectrum, and the temporal characteristics of the
stimulus. [3]. Research in psychoacoustics [24] [10], [51], analysis/synthesis [45],
144
music perception [4] [5], speech recognition [19], and music information retrieval
[17] have however developed acoustical correlates of timbre characterizing some
of the facets of this complex and multidimensional variable.
The Two-fold Nature of Timbre: from Identity to Quality. One of the
pioneers on timbre research, French researcher and electroacoustic music composer Schaeer, put forward a relevant paradox about timbre wondering how a
musical instruments timbre could be dened considering that each of its tones
also possessed a specic timbre [47]. Cognitive categorizations theories shed
light on Schaeers paradox showing that sounds (objects, respectively) could
be categorized either in terms of the sources from which they are generated, or
simply as sounds (objects, respectively), in terms of the properties that characterize them [15]. These principles have been applied to timbre by Handel who
described timbre perception as being both guided by our ability to recognize
various physical factors that determine the acoustical signal produced by musical instruments [27] (later coined source mode of timbre perception by Hadja
et al. [26]), and by our ability to analyze the acoustic properties of sound objects perceived by the ear, traditionally modeled as a time-evolving frequency
analyser (later coined as interpretative mode of timbre perception in [26]). In
order to refer to that two-fold nature of timbre, we like to use the terms timbre
identity and timbre quality, which were proposed in reference [38]. The timbre
identity and quality facets of timbre perception have several properties: they
are not independent but intrinsically linked together (e.g. we can hear a guitar
tone and recognize the guitar, or we can hear a guitar tone and hear the sound
for itself without thinking to the instrument), they are function of the sounds to
which we listen to (in some music like the musique concrète, the sound sources
are deliberately hidden by the composer, hence the notion of timbre identity is
dierent, it may refer to the technique employed by the musician, e.g. a specic lter), they have variable domain of range (music lovers are often able to
recognize the performer behind the instrument extending the notion of identity
to the very start of the chain of sound production, the musician that controls
the instrument). Based on these considerations, we include the notion of sound
texture, as that produced by layers of instruments in music, into the denitions
of timbre. The notion of timbre identity in music may then be closely linked to
a specic band, a sound engineer, or the musical genre, largely related to the
instrumentation).
The Formant Theory of Timbre. Contrary to the classical theory of musical timbre advocated in the late 19th century by Helmholtz [29], timbre does
not only depend on the relative proportion between the harmonic components
of a (quasi-)harmonic sound; two straightforward experiments indeed show that
timbre is highly altered when a sound a) is reversed in time, b) is pitch-shifted
by frequency translation of the spectrum, despite the fact that in both cases the
relative energy ratios between harmonics are kept. The works from the phonetician Slawson proved that the timbre of voiced sounds was mostly characterized
by the invariance of their spectral envelope through pitch changes, and therefore
145
a mostly xed formant1 structure, i.e. zones of high spectral energy (however, in
the case of large pitch changes the formant structure needs to be slightly shifted
for the timbral identity of the sounds to remain unchanged): The popular notion that a particular timbre depends upon the presence of certain overtones (if
that notion is interpreted as the relative pitch theory of timbre) is seen [...] to
lead not to invariance but to large dierences in musical timbre with changes in
fundamental frequency. The xed pitch or formant theory of timbre is seen in
those same results to give much better predictions of the minimum dierences in
musical timbre with changes in fundamental frequency. The results [...] suggest
that the formant theory may have to be modied slightly. A precise determination of minimum dierences in musical timbre may require a small shift of the
lower resonances, or possibly the whole spectrum envelope, when the fundamental frequency changes drastically. [49]. The ndings by Slawson have causal
and cognitive explanations. Sounds produced by the voice (spoken or sung) and
most musical instruments present a formant structure closely linked to resonances generated by one or several components implicated in their production
(e.g. the vocal tract for the voice, the body for the guitar, the mouthpiece for
the trumpet). It seems therefore legitimate from the perceptual point of view to
suggest that the auditory system relies on the formant structure of the spectral
envelope to discriminate such sounds (e.g. two distinct male voices of same pitch,
loudness, and duration), as proposed by the source or identity mode of timbre
perception hypothesis mentioned earlier.
The timbre models used in this study to discriminate speech and music rely
on features modeling the spectral envelope (see the next section). In these timbre
models, the temporal dynamics of timbre are captured up to a certain extent by
performing signal analysis on successive frames where the signal is assumed to
be stationary, and by the use of hidden markov model (HMM), as described in
section 3.3. Temporal (e.g. attack time) and spectro-temporal parameters (e.g.
spectral ux) have also shown to be major correlates of timbre spaces but these
ndings were obtained in studies which did not include speech sounds but only
musical instrument tones either produced on dierent instruments (e.g. [40]),
or within the same instrument (e.g. [6]). In situations where we discriminate
timbres from various sources either implicitly (e.g. in everyday lifes situations)
or explicitly (e.g. in a controlled experiment situation), it is most probable that
the auditory system uses dierent acoustical clues depending on the typological
dierences of the considered sources. Hence, the descriptors used to account for
timbre dierences between musical instruments tones may not be adapted for
the discrimination between speech and music sounds. If subtle timbre dierences
are possible within a same instrument, large timbre dierences are expected to
occur between disparate classes, such as speech, and music, and those are liable to be captured by spectral envelope correlates. Music generally being a
mixture of musical instrument sounds playing either synchronously in a polyphonic way, or solo, may exhibit complex formant structures induced by its
1
In this article, a formant is considered as being a broad band of enhanced power

present within the spectral envelope.
146
individual components (instruments), as well as the recording conditions (e.g.

room acoustics). Some composers like Schoenberg have explicitly used very subtle instrumentation rules to produce melodies that were not shaped by changes
of pitches like in traditional Western art music but by changes of timbre (the latter were called Klangfarben melodie by Schoenberg, which literally means color
melodies). Hence, if they occur, formant structures in music are likely to be
much dierent from those produced by the vocal system. However, some intrinsic cases of confusions consist in music containing a predominant singing voice
(e.g. in opera or choral music) since singing voice shares timbral properties with
the spoken voice. The podcast database with which we tested the algorithms
included such types of mixture. Conversely, the mix of a voice with a strong
musical background (e.g. in commercials, or jingles) can also be a source of confusion in speech/music discrimination. This issue is addressed in [44], but not
directly in this study. Sundberg [50] showed the existence of a singers or singing
formant around 3 kHz when analyzing performances by classically trained male
singers, which he attributed to a clustering of the third, fourth and fth resonances of the vocal tract. This dierence between spoken and sung voices can
potentially be captured by features charactering the spectral envelope, as the
ones presented in next section.
Spectral Envelope Representations: LP, LSF, and MFCC. Spectral
envelopes can be obtained either from linear prediction (LP) [31] or from melfrequency cepstral coecients (MFCC) [17] which both oer a good representation of the spectrum while keeping a small amount of features. Linear prediction
is based on the source-lter model of sound production developed for speech
coding and synthesis. Synthesis based on linear prediction coding is performed
by processing an excitation signal (e.g. modeling the glottal excitation in the
case of voice production) with an all-pole lter (e.g. modeling the resonances
of the vocal tract in the case of voice production). The coecients of the lter
are computed on a frame-by-frame basis from the autocorrelation of the signal.
The frequency response of the LP lter hence represents the short-time spectral
envelope. Itakura derived from the coecients of the inverse LP lter a set of
features, the line spectral frequencies (LSF), suitable for ecient speech coding
[31]. The LSF have the interesting property of being correlated in a pairwise
manner to the formant frequencies: two adjacent LSF localize a zone of high
energy in the spectrum. The automatic timbre recognition model described in
section 3.2 exploits this property of the LSF. MFCCs are computed from the
logarithm of the spectrum computed on a Mel-scale (a perceptual frequency
scale emphasizing low frequencies), either by taking the inverse Fourier transform, or a discrete cosine transform. A spectral envelope can be represented by
considering the rst 20 to 30 MFCC coecients. In [51], Terasawa et al. have established that MFCC parameters are a good perceptual representation of timbre
for static sounds. The automatic structural segmentation technique, component
of the second classication method described in section 3.3, was employed using
MFCC features.
3.2
147
Classification Based on Automatic Timbre Recognition (ATR)
The method is based on the timbre recognition system proposed in [7] and [16],
which we describe in the remainder of this section.
Feature Extraction. The algorithm relies on a frequency domain representation of the signal using short-term spectra (see Figure 2). The signal is rst
decomposed into overlapping frames of equal size obtained by multiplying blocks
of audio data with a Hamming window to further minimize spectral distortion.
The fast Fourier transform (FFT) is then computed on a frame-by-frame basis.
The LSF features described above are extracted using the short-term spectra.
Classifier. The classication process is based on the unsupervised K-means
clustering technique both at the training and the testing stages. The principle
of K-means clustering is to partition n-dimensional space (here the LSF feature
space) into K distinct regions (or clusters), which are characterized by their
centres (called codevectors). The collection of the K codevectors (LSF vectors)
constitutes a codebook, whose function, within this context, is to capture the
most relevant features to characterize the timbre of an audio signal segment.
Hence, to a certain extent, the K-means clustering can here be viewed both as a
classier and a technique of feature selection in time. The clustering of the feature space is performed according to the Linde-Buzo-Gray (LBG) algorithm [36].
Fig. 2. Automatic timbre recognition system based on line spectral frequencies and
K-means clustering
148
During the training stage, each class is attributed an optimized codebook by

performing the K-means clustering on all the associated training data. During
the testing stage, the K-means clustering is applied to blocks of audio data or
decision horizons (collection of overlapping frames), the duration of which can
be varied to modify the latency L of the classication (see Figure 2). The intermediate classication decision is obtained by nding the class which minimizes a
codebook-to-codebook distortion measure based on the Euclidean distance [16].
As will be discussed in section 4.1, we tested various speech and music training
class taxonomies (e.g. separating male and female voice for the speech class) to
further enhance the performance of the recognition.
Post-processing. Given that one of our ultimate goals is to be able to accurately locate the temporal start and end positions of speech and music sections,
relatively short duration of decision horizons are required (a 1 s latency was
used in the experiments). A drawback with this method though is that even if
the LSF/K-means based algorithm achieves high levels of class recognition accuracy (for example, it might correctly classify music sections 90% of the time,
see section 5), there can be undesirable switches from one type of retrieved class
to another. This sometimes rapid variation between speech and music classications makes it dicult to accurately identify the start and end points of speech
and music sections. Choosing longer classication intervals though decreases the
resolution with which we are able to pinpoint any potential start or end time.
In an attempt to alleviate this problem, we performed some post-processing on
the initial results obtained with the LSF/K-means based algorithm. All music
sections are attributed a numerical class index of 0, and all speech sections a
class index of 1. The results are resampled at 0.1 s intervals and then processed
through a median lter. Median ltering is a nonlinear digital ltering technique
which has been widely used in digital image processing and speech/music information retrieval to remove noise, e.g. in the peak-picking stage of an onset
detector in [8], or for the same purposes as in this work, to enhance speech/music
discrimination in[32], and [44]. Median ltering has the eect of smoothing out
regions of high variation. The size W of the sliding window used in the median
ltering process was empirically tuned (see section 5). Contiguous short-term
classications of same types (speech or music) are then merged together to form
segment-level classications. Figure 3 shows a comparison between podcasts
ground truth annotations and typical results of classication before and after
post-processing.
Software Implementation. The intermediate classication decisions were obtained with the Vamp [13] musical instrument recognition plugin [7] trained for
music and speech classes. The plugin works interactively with the Sonic Visualiser host application developed to analyse and visualise music-related information from audio les [12]. The latency L of the classication (duration of the
decision horizon) can be varied between 0.5 s and 10 s. In this study, we used a
1 s long latency in order to keep a good time resolution/performance ratio [7].
The median ltering post-processing was performed in Matlab. An example of
149
Fig. 3. Podcast ground truth annotations (a), classication results at 1 s intervals (b)
and post-processed results (c)
detection of a transition from a speech to a music part within Sonic Visualiser

is shown in Figure 4.
3.3
Classification Based on Automatic Structural Segmentation and

Timbre Recognition (ASS/ATR)
Automatic Structural Segmentation. We used the structural segmentation

technique based on constrained clustering initially proposed in [35] for automatic
music structure extraction (chorus, verse, etc.). The technique is thoroughly
described in [21], study in which it is applied to studio recordings intelligent
editing.
The technique relies on the assumption that the distributions of timbre features are similar over music structural elements of same type. The high-level
song structure is hence determined upon structural/timbral similarity. In this
study we extend the application of the technique to audio broadcast content
(speech and music parts) without focusing on the structural uctuations within
the parts themselves. The legitimacy to port the technique to speech/music
150
Fig. 4. Example of detection of a transition between speech and music sections in a

podcast using the Vamp timbre recognition transform jointly with Sonic Visualiser
discrimination relies on the fact that a higher level of similarity is expected between the various spoken parts one one hand, and between the various music
parts, on the other hand.
The algorithm, implemented as a Vamp plugin [43], is based on a frequencydomain representation of the audio signal using either a constant-Q transform,
a chromagram or mel-frequency cepstral coecients (MFCC). For the reasons
mentioned earlier in section 3.1, we chose the MFCCs as underlying features in
this study. The extracted features are normalised in accordance with the MPEG7 standard (normalized audio spectrum envelope [NASE] descriptor [34]), by expressing the spectrum in the decibel scale and normalizing each spectral vector
by the root mean square (RMS) energy envelope. This stage is followed by the
extraction of 20 principal components per block of audio data using principal
component analysis. The 20 PCA components and the RMS envelope constitute a sequence of 21 dimensional feature vectors. A 40-state hidden markov
model (HMM) is then trained on the whole sequence of features (Baum-Welsh
algorithm), each state of the HMM being associated to a specic timbre quality.
After training and decoding (Viterbi algorithm) the HMM, the signal is assigned
a sequence of timbre features according to specic timbre quality distributions
for each possible structural segment. The minimal duration D of expected structural segments can be tuned. The segmentation is then computed by clustering
timbre quality histograms. A series of histograms are created using a sliding
window and are then grouped into S clusters with an adapted soft K-means
151
algorithm. Each of these clusters will correspond to a specic type of segment in

the analyzed signal. The reference histograms describing the timbre distribution
for each segment are updated during clustering in an iterative way. The nal
segmentation is obtained from the nal cluster assignments.
Automatic Timbre Recognition. Once the signal has been divided into segments assumed to be homogeneous in timbre, the latter are processed with the
automatic timbre recognition technique described in section 3.2 (see Figure 1(b)).
This yields intermediate classication decisions dened on a short-term basis
(depending on the latency L used in the ATR model).
Post-processing. Segment-level classications are then obtained by choosing
the class that appears most frequently amongst the short-term classication
decisions made within the segments.
Software Implementation. The automatic structural segmenter Vamp plugin
[43] [35] was run from the terminal using the batch tool for feature extraction
Sonic Annotator [11]. Each of the retrieved segments were then processed with
the automatic timbre recognition Vamp plugin [7] previously trained for speech
and music classes using a Python script. The segment-level classication decisions were also computed using a Python script.
Experiments
Several experiments were conducted to evaluate and compare the performances

of the speech/music discrimination ATR and ASS/ATR methods respectively
presented in sections 3.2 and 3.3. In this section, we rst describe the experimental protocols, and the training and testing databases. The evaluation measures computed to assess the class identication and boundary accuracy of the
systems are then specied.
4.1
Protocols
Influence of the Training Class Taxonomy. In a rst set of experiments,

we evaluated the precision of the ATR model according to the taxonomy used to
represent speech and music content in the training data. The classes associated
to the two taxonomic levels schematized in Figure 5 were tested to train the
ATR model. The rst level correspond to a coarse division of the audio content
into two classes: speech and music. Given that common spectral dierences may
be observed between male and female speech signals due to vocal tract morphology changes, and that musical genres are often associated with dierent sound
textures or timbres due to changes of instrumentation, we sought to establish
whether there was any benet to be gained by training the LSF/K-means algorithm on a wider, more specic set of classes. Five classes were chosen: two to
represent speech (male speech and female speech), and three to represent music
according to the genre (classical, jazz, and rock & pop). The classications obtained using the algorithm trained on the second, wider, set of classes are later
152
Level I
Level II
speech
speech m
speech f
music
classical
jazz
rock & pop
Fig. 5. Taxonomy used to train the automatic timbre recognition model in the
speech/music discrimination task. The rst taxonomic level is associated to a training
stage with two classes: speech and music. The second taxonomic level is associated to
a training stage with ve classes: male speech (speech m), female speech (speech f),
classical, jazz, and rock & pop music.
mapped back down to either speech or music in order to be able to evaluate

their correlation with ground truth data and also so that we can compare the
two methods. To make a fair comparison between the two methods, we kept
the same training excerpts in both cases and hence kept constant the content
duration for the speech and music classes.
Comparison Between ATR and ASS/ATR. A second set of experiments
was performed to compare the performances of the ATR and ASS/ATR methods. In these experiments the automatic timbre recognition model was trained
with ve classes (second taxonomic level), case which lead to the best performances (see section 5.1). The number of clusters used in the K-means classier
of the ATR method was kept constant and tuned to a value that yielded good
results in a musical instrument recognition task (K=32) [16]. In order to nd
the best conguration, the number of line spectral frequencies was varied in the
feature extraction stage (LSF={8;16;24;32}) since the number of formants in
speech and music spectra is not known a priori and is not expected to be the
same. If voice is typically associated with four or ve formants (hence 8 or 10
LSFs), this number may be higher in music due to the superpositions of various
instruments sounds. The parameter of the automatic structural segmentation
algorithm setting the minimal duration of retrieved segments was set to 2 s
since shorter events are not expected, and longer durations could decrease the
boundary retrieval accuracy. Since the content of audio podcasts can be broad
(see next section 4.2), the maximal number of segments S of the ASS was varied
between 7 and 12. Classication tests were also performed with the algorithm
proposed by Ramona and Richard in [44] which provides a good benchmark
of the state-of-the-art performance for speech/music discriminators. This algorithm which relies on a feature-based approach with a support vector machine
classier (previously described in section 2) is however computationnaly expensive since a large collection of about 600 features of various types (temporal,
spectral, cepstral, and perceptual) is computed in the training stage.
4.2
153
Database
The training data used in the automatic timbre recognition system consisted of
a number of audio clips extracted from a wide variety of radio podcasts from
BBC 6 Music (mostly pop) and BBC Radio 3 (mostly classical and jazz) emissions. The clips were manually auditioned and then, classied as either speech
or music when the ATR model was trained with two classes, or as male speech,
female speech, classical music, jazz music, and rock & pop music when the ATR
model was trained with ve classes. These manual classications constituted the
ground truth annotations further used in the algorithm evaluations. All speech
was english language, and the training audio clips, whose durations are shown
in Table 1, gathered approximately 30 min. of speech, and 15 min. of music.
For testing purposes, four podcasts dierent from the ones used for training
(hence containing dierent speakers and music excerpts) were manually annotated using terms from the following vocabulary: speech, multi-voice speech,
music, silence, jingle, efx (eects), tone, tones, beats. Mixtures of these terms
were also employed (e.g. speech + music, to represent speech with background
music). The music class included cases where a singing voice was predominant
(opera and choral music). More detailed descriptions of the podcast material
used for testing are given in Tables 2 and 3.
4.3
Evaluation Measures
We evaluated the speech/music discrimination methods with regards to two

aspects: (i) their ability to correctly identify the considered classes (semantic
level), and (ii) their ability to correctly retrieve the boundary locations between
classes (temporal level).
Relative Correct Overlap. Several evaluation measures have been proposed
to assess the performances of audio content classiers depending on the time scale
considered to perform the comparison between the machine estimated classications and the ground-truth annotations used as reference [28]. The accuracy
of the models can indeed be measured on a frame-level basis by resampling the
ground-truth annotations at the frequency used to make the estimations, or on
Table 1. Audio training data durations. Durations are expressed in the following
format: HH:MM:SS (hours:mn:s).
Training class
Total duration of audio clips (HH:MM:SS)
Speech
00:27:46
Two class training
Music
00:14:30
Male speech
00:19:48
Female speech
00:07:58
Five class training Total speech
00:27:46
Classical
00:03:50
Jazz
00:07:00
Rock & Pop
00:03:40
Total music
00:14:30
154

Table 2. Podcast content
Podcast Nature of content
1
Male speech, rock & pop songs, jingles, and small amount of
electronic music
2
Speech and classical music (orchestral and opera)
3
Speech, classical music (choral, solo piano, and solo organ), and
folk music
4
Speech, and punk, rock & pop with jingles
Table 3. Audio testing data durations. Durations are expressed in the following format:
HH:MM:SS (hours:mn:s).
Podcast Total duration Speech duration Music duration
1
00:51:43
00:38:52
00:07:47
2
00:53:32
00:32:02
00:18:46
3
00:45:00
00:18:09
00:05:40
4
00:47:08
00:06:50
00:32:31
Total
03:17:23
01:35:53
01:04:43
a segment-level basis by considering the relative proportion of correctly identied segments. We applied the latter segment-based method by computing the
relative correct overlap (RCO) measure used to evaluate algorithms in the music information retrieval evaluation exchange (MIREX) competition [39]. The
relative correct overlap is dened as the cumulated duration of segments where
the correct class has been identied normalized by the total duration of the
annotated segments:
RCO =
|{estimated segments} {annotated segments}|

|{annotated segments}|
(1)
where {.} denotes a set of segments, and . their duration. When comparing
the machine estimated segments with the manually annotated ones, any sections
not labelled as speech (male or female), multi-voice speech, or music (classical,
jazz, rock & pop) were disregarded due to their ambiguity (e.g. jingle). The
durations of these disregarded parts are stated in the results section.
Boundary Retrieval F-measure. In order to assess the precision with which
the algorithms are able to detect the time location of transitions from one class to
another (i.e. start/end of speech and music sections), we computed the boundary
retrieval F-measure proposed in [35] and used in MIREX to evaluate the temporal accuracy of automatic structural segmentation methods [41]. The boundary
retrieval F-measure, denoted F in the following, is dened as the harmonic mean
2P R
between the boundary retrieval precision P and recall R (F =
). The
P +R
boundary retrieval precision and recall are obtained by counting the numbers of
155
correctly detected boundaries (true positives tp), false detections (false positives
f p), and missed detections (false negatives f n) as follows:
tp
P =
(2)
tp + f p
tp
R=
(3)
tp + f n
Hence, the precision and the recall can be viewed as measures of exactness
and completeness, respectively. As in [35] and [41], the number of true positives
were determined using a tolerance window of duration T = 3 s: a retrieved
boundary is considered to be a hit (correct) if its time position l lies within
T
T
the range l
l l+
. This method to compute the F-measure is also
2
2
used in onset detector evaluation [18] (the tolerance window in the latter case
being much shorter). Before comparing the manually and the machine estimated
boundaries, a post-processing was performed on the ground-truth annotations in
order to remove the internal boundaries between two or more successive segments
whose type was discarded in the classication process (e.g. the boundary between
a jingle and a sound eect section).
Results and Discussion
In this section, we present and discuss the results obtained for the two sets
of experiments described in section 4.1. In both sets of experiments, all audio
training clips were extracted from 128 kbps, 44.1 kHz, 16 bit stereo mp3 les
(mixed down to mono) and the podcasts used in the testing stage were full
duration mp3 les of the same format.
5.1
Influence of the Training Class Taxonomies in the ATR Model
Analysis Parameters. The LSF/K-means algorithm in the automatic timbre

recognition model was performed with a window of length 1024 samples (approximately 20 ms), and hop length of 256 samples (approximately 5 ms). A
combination of 24 line spectral frequencies and 32 codevectors was used as in
[7]. During testing, the intermediate classications were made with a latency of
1 s. The post-processing of the machine estimated annotations was performed
by resampling the data with a sampling period of 0.1 s, and processing them
with a median lter using a 20 s long window.
Performances. Table 4 shows the relative correct overlap (RCO) performances
of the speech/music discriminator based on automatic timbre recognition for
each of the four podcasts used in the test set, as well as the overall results
(podcasts 1 to 4 combined). The sections that were neither speech nor music
and were disregarded lasted in overall 36 min. 50 s. The RCO measures are
given both when the model was trained on only two classes (music and speech),
and when it was trained on ve classes (male speech, female speech, classical,
156
Table 4. Inuence of the training class taxonomy on the performances of the automatic
timbre recognition model assessed at the semantic level with the relative correct overlap
(RCO) measure
ATR model - RCO measure (%)
1 (Rock & pop)
2 (Classical)
3 (Classical)
4 (Rock & pop)
Overall
Speech
Music
Two class Five class Two class Five class
90.5
91.9
93.7
94.5
91.8
93.0
97.8
99.4
88.3
91.0
76.1
82.7
48.7
63.6
99.8
99.9
85.2
89.2
96.6
97.8
jazz, and rock & pop). We see from that table that training the ATR model on
ve classes instead of two improved classication performances in all cases, but
most notably for the speech classications of podcast number 4 (an increase of
14.9% from 48.7% to 63.6%) and for the music classications of podcasts number
3 (up from 76.1% to 82.7%, an increase of 6.6%). In all other cases, the increase
is more modest; being between 0.1% and 2.7%. The combined results show an
increased RCO of 4% for speech and 1.2% for music when trained on ve classes
instead of two.
5.2
Comparison between ATR and ASS/ATR
Analysis Parameters. The automatic timbre model was trained with ve

classes since this conguration gave the best RCO performances. Regarding the
ATR method, the short-term analysis was performed with a window of 1024
samples, a hop size of 256 samples, and K = 32 codevectors, as in the rst set
of experiments. However, in this set of experiments the number of line spectral
frequencies LSF were varied between 8 and 32 by steps of 8, and the duration of
the median ltering windows were tuned accordingly based on experimenting.
The automatic structural segmenter Vamp plugin was used with the default
window and hop sizes (26460 samples, i.e. 0.6 s, and 8820 samples, i.e. 0.2 s,
respectively), parameters dened based on typical beat-length in music [35].
Five dierent number of segments S were tested (S = {5;7;8;10;12}). The best
relative correct overlap and boundary retrieval performances were obtained with
S = 8 and S = 7, respectively.
Relative Correct Overlap Performances. Table 5 presents a comparison of
the relative correct overlap (RCO) results obtained for the proposed speech/music
discriminators based on automatic timbre recognition ATR, and on automatic
structural segmentation and timbre recognition ASS/ATR. The performances
obtained with the algorithm from Ramona et Richard [44] are also reported.
The ATR and ASS/ATR methods obtain very similar relative correct overlaps.
For both methods, the best conguration is obtained with the lowest number
of features (LSF = 8) yielding high average RCOs, 94.4% for ATR, and 94.5%
for ASS/ATR. The algorithm from [44] obtain a slightly higher average RCO
(increase of approximately 3%) but may require more computations than the
157
Table 5. Comparison of the relative correct overlap performances for the ATR and
ASS/ATR methods, as well as the SVM-based algorithm from [44]. For each method,
the best average result (combining speech and music) is indicated in bold
RCO (%)
Podcast Class
speech
1
music
speech
2
music
speech
3
music
speech
4
music
speech
Overall
music
Average
ATR
LSF number
8
16 24 32
94.8 94.8 94.7 94.3
94.9 92.4 90.8 92.8
94.2 95.4 92.9 92.8
98.8 98.7 98.8 98.1
96.7 96.9 93.5 92.0
96.1 79.0 76.8 77.4
55.3 51.9 56.4 58.9
99.5 99.5 99.9 99.5
90.3 90.0 89.5 89.5
98.5 96.1 96.1 96.3
94.4 93.1 92.8 92.9
ASS/ATR
SVM [44]
LSF number
n/a
8
16 24 32
n/a
96.9 95.8 96.9 96.9
97.5
84.3 82.5 82.3 86.3
94.1
96.3 96.3 96.3 96.1
97.6
97.1 94.2 96.5 96.9
99.9
96.4 95.3 93.6 93.5
97.2
92.3 85.8 77.5 83.5
96.9
61.8 48.5 60.2 65.6
88.6
99.7 100 100 100
99.5
92.8 89.4 92.0 92.8
96.8
96.2 94.4 94.3 95.8
98.8
94.5 91.9 93.2 94.3 97.3
ATR method (the computation time has not been measured in these experiments). The lower performances obtained by the three compared methods for
the speech class of the fourth podcast is to be nuanced by the very short proportion of spoken excerpts within this podcast (see Table 3), which hence does
not aect much the overall results. The good performances obtained with a low
dimensional LSF vector can be explained by the fact that the voice has a limited
number of formants that are therefore well characterized by a small number of
line spectral frequencies (LSF = 8 corresponds to the characterization of 4 formants). Improving the recognition accuracy for the speech class diminishes the
confusions made with the music class, which explains the concurrent increase of
RCO for the music class when LSF = 8. When considering the class identication accuracy, the ATR method conducted with a low number of LSF hence
appears interesting since it is not computationally expensive relatively to the
performances of modern CPUs (linear predictive lter determination, computation of 8 LSFs, K-means clustering and distance computation). For the feature
vectors of higher dimensions, the higher-order LSFs may contain information
associated with the noise in the case of the voice which would explain the drop
of overall performances obtained with LSF = 16 and LSF = 24. However the
RCOs obtained when LSF = 32 are very close to that obtained when LSF =
8. In this case, the higher number of LSF may be adapted to capture the more
complex formant structures of music.
Boundary Retrieval Performances. The boundary retrieval performance
measures (F-measure, precision P, and recall R) obtained for the ATR, ASS/ATR,
and SVM-based method from [44] are reported in Table 6.
As opposed to the relative correct overlap evaluation where the ATR and
ASS/ATR methods obtained similar performances, the ASS/ATR method clearly
158
Table 6. Comparison of the boundary retrieval measures (F-measure, precision P, and

recall R) for the ATR and ASS/ATR methods, as well as the SVM-based algorithm
from [44]. For each method, the best overall result is indicated in bold.
Boundary retrieval
ATR
LSF number
Podcast Measures (%) 8
16 24
P
40.0 45.7 31.0
1
R
21.3 34.0 19.1
F
27.8 39.0 23.7
P
61.5 69.0 74.1
2
R
37.5 31.3 31.3
F
46.6 43.0 44.0
P
69.2 54.5 56.7
3
R
24.3 32.4 23.0
F
36.0 40.7 32.7
P
11.7 12.3 21.7
4
R
21.9 21.9 15.6
F
15.2 15.7 18.2
P
23.3 40.6 46.8
Overall R
27.2 30.9 23.5
F
32.2 35.1 31.3
ASS/ATR
SVM [44]
LSF number
n/a
8
16 24 32
n/a
43.6 36.0 37.2 34.1
36.0
36.2 38.3 34.0 31.9
57.4
39.5 37.1 35.6 33.0
44.3
72.7 35.3 84.6 71.9
58.2
37.5 37.5 34.4 35.9
60.9
49.5 36.4 48.9 47.9
59.5
75.0 68.0 60.4 64.0
67.3
44.6 45.9 43.2 43.2
50.0
55.9 54.8 50.4 51.6
57.4
56.7 57.1 57.7 48.5
28.6
53.1 50.0 46.9 50.0
50.0
54.8 53.3 51.7 49.2
57.4
62.3 46.9 57.4 54.1 47.0
41.9 42.4 39.2 39.6 54.8
50.1 44.6 46.6 45.7 50.6
outclassed the ATR method regarding the boundary retrieval accuracy. The best
overall F-measure of the ASS/ATR method (50.1% with LSF = 8) is approximately 15% higher than the one obtained with the ATR method (35.1% for LSF
= 16). This shows the benet of using the automatic structural segmenter prior
to the timbre recognition stage to locate the transitions between the speech and
music sections. As in the previous set of experiments, the best conguration is
obtained with a small amount of LSF features (ASS/ATR method with LSF =
8) which stems from the fact the boundary positions are a consequence of the
classication decisions. For all the tested podcasts, the ASS/ATR method yields
a better precision than the SVM-based algorithm. The most notable dierence
happens for the second podcast where the precision of the ASS/ATR method
(72.7%) is approximately 14% higher than the one obtained with the SVM-based
algorithm (58.2%). The resulting increase in overall precision achieved with the
ASS/ATR method (62.3%) compared with the SVM-based method (47.0%) is
of approximately 15%. The SVM-based method however obtains a better overall
boundary recall measure (54.8%) than the ASS/ATR method (42.4%), inducing
the boundary F-measures of both methods to be very close (50.6% and 50.1%,
respectively).
Summary and Conclusions
We proposed two methods for speech/music discrimination based on timbre

models and machine learning techniques and compared their performances with
159
audio podcasts. The rst method (ATR) relies on automatic timbre recognition
(LSF/K-means) and median ltering. The second method (ASS/ATR) performs
an automatic structural segmentation (MFCC, RMS / HMM, K-means) before
applying the timbre recognition system. The algorithms were tested with more
than 2,5 hours of speech and music content extracted from popular and classical
music podcasts from the BBC. Some of the music tracks contained a predominant singing voice which can be a source of confusion with the spoken voice. The
algorithms were evaluated both at the semantic level to measure the quality of
the retrieved segment-type labels (classication relative correct overlap), and at
the temporal level to measure the accuracy of the retrieved boundaries between
sections (boundary retrieval F-measure). Both methods obtained similar and relatively high segment-type labeling performances. The ASS/ATR method lead to
a RCO of 92.8% for speech, and 96.2% for music, yielding an average performance
of 94.5%. The boundary retrieval performances were higher for the ASS/ATR
method (F-measure = 50.1%) showing the benet to use a structural segmentation technique to locate transitions between dierent timbral qualities. The
results were compared against the SVM-based algorithm proposed in [44] which
provides a good benchmark of the state-of-the-arts speech/music discriminators. The performances obtained by the ASS/ATR method were approximately
3% lower than those obtained with the SVM-based method for the segment-type
labeling evaluation, but lead to better boundary retrieval precisions (approximately 15% higher).
The boundary retrieval scores were clearly lower for the three compared methods, relatively to the segment-type labeling performances which were fairly high,
up to 100% of correct identications in some cases. Future works will be dedicated to rene the accuracy of the sections boundaries either by performing a
new analysis of the feature variations locally around the retrieved boundaries,
or by including descriptors complementary to the timbre ones, by using e.g. the
rhythmic information such as tempo whose uctuations around speech/music
transitions may give complementary clues to accurately detect them. The discrimination of intricate mixtures of music, speech, and sometimes strong postproduction sound eects (e.g. the case of jingles) will also be investigated.
Acknowledgments. This work was partly funded by the Musicology for the
Masses (M4M) project (EPSRC grant EP/I001832/1, http://www.elec.qmul.
ac.uk/digitalmusic/m4m/), the Online Music Recognition and Searching 2
(OMRAS2) project (EPSRC grant EP/E017614/1, http://www.omras2.org/),
and a studentship (EPSRC grant EP/505054/1). The authors wish to thank
Matthew Davies from the Centre for Digital Music for sharing his F-measure
computation Matlab toolbox, as well as Gy
orgy Fazekas for fruitful discussions
on the structural segmenter. Many thanks to Mathieu Ramona from the Institut
de Recherche et Coordination Acoustique Musique (IRCAM) for sending us the
results obtained with his speech/music segmentation algorithm.
160
References
1. Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Segmentation. In: Proc. ICASSP 2002, vol. 1, pp. 297300 (2002)
2. Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of
Fisher Linear Discriminant Analysis to Speech Music Classication. In: Proc.
EUROCON 2005, vol. 2, pp. 16661669 (2005)
3. ANSI: USA Standard Acoustical Terminology. American National Standards Institute, New York (1960)
4. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates
of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2),
135153 (2010)
5. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis
of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Perception 28(3), 265278 (2011)
6. Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Control to Timbre Perception. Acta Acustica United with Acustica 96(4), 678689
(2010)
7. Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recognition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling
and Retrieval (CMMR 2010), Malaga, Spain, pp. 183194 (2010)
8. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A
Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and
Audio Processing (2005)
9. Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classication. Journal
of the Audio Engineering Society 52(7/8), 724739 (2004)
10. Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre
Space Dimensions: A Conrmatory Study Using Synthetic Tones. J. Acoust. Soc.
Am. 118(1), 471482 (2005)
11. Cannam, C.: Queen Mary University of London: Sonic Annotator, http://omras2.
org/SonicAnnotator
12. Cannam, C.: Queen Mary University of London: Sonic Visualiser, http://www.
sonicvisualiser.org/
13. Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin
System, http://www.vamp-plugins.org/
14. Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech,
Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 1, pp. 149152 (1999)
15. Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriete du Signal, de
lInstrument, ou Construction Cognitive (Timbre or Timbres? Property of the
Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on
Interdisciplinary Musicology (CIM 2005), Montreal, Quebec, Canada (2005)
16. Chetry, N., Davies, M., Sandler, M.: Musical Instrument Identication using LSF
and K-Means. In: Proc. AES 118th Convention (2005)
17. Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing.
Proc. of the IEEE 65, 14281443 (1977)
18. Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Audio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary
University of London, Centre for Digital Music (2009), http://www.eecs.qmul.
ac.uk/~matthewd/pdfs/DaviesDegaraPlumbley09-evaluation-tr.pdf
161
19. Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing ASSP-28(4), 357366 (1980)
20. El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for
Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 24452448 (2000)
21. Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help
of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention,
Vienna, Austria (2007)
22. Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.:
The ESTER Phase II Evaluation Campaign for the Rich Transcription of French
Broadcast News. In: Proc. Interspeech (2005)
23. Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for
Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187200
(2001)
24. Grey, J.M., Gordon, J.W.: Perception of Spectral Modications on Orchestral Instrument Tones. Computer Music Journal 11(1), 2431 (1978)
25. Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation
and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the
DARPA Broadcast News Transcription and Understanding Workshop, pp. 133137
(1998)
26. Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodological Issues in Timbre Research. In: Deliege, I., Sloboda, J. (eds.) Perception and
Cognition of Music, 2nd edn., pp. 253306. Psychology Press, New York (1997)
27. Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identication,
2nd edn., pp. 425461. Academic Press, San Diego (1995)
28. Harte, C.: Towards Automatic Extraction of Harmony Information From Music
Signals. Ph.D. thesis, Queen Mary University of London (2010)
29. Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the
works of 1877). English trad. with notes and appendix from E.J. Ellis
30. Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room
Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 6673 (1973)
31. Itakura, F.: Line Spectrum Representation of Linear Predictive Coecients of
Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)
32. Jarina, R., OConnor, N., Marlow, S., Murphy, N.: Rhythm Detection For SpeechMusic Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th
International Conference on Digital Signal Processing (DSP), Santorini (2002)
33. Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc.
IEEE 74, 14771493 (1986)
34. Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG7 Descriptors. In: Proc. of EUROSPEECH (2003)
35. Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained
Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318326
(2008)
36. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE
Transactions on Communications 28, 702710 (1980)
37. Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classication and Segmentation
Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203211
(2001)
38. Marozeau, J., de Cheveigne, A., McAdams, S., Winsberg, S.: The Dependency
of Timbre on Fundamental Frequency. Journal of the Acoustical Society of
America 114(5), 29462957 (2003)
162
39. Mauch, M.: Automatic Chord Transcription from Audio using Computational
Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)
40. McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimpho, J.: Perceptual
Scaling of Synthesized Musical Timbres: Common Dimensions, Specicities, and
Latent Subject Classes. Psychological Research 58, 177192 (1995)
41. Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation
(2010), http://www.music-ir.org/mirex/wiki/2010:Structural_Segmentation
42. Peeters, G.: Automatic Classication of Large Musical Instrument Databases Using Hierarchical Classiers with Inertia Ratio Maximization. In: Proc. AES 115th
Convention, New York (2003)
43. Queen Mary University of London: QM Vamp Plugins, http://www.omras2.org/
SonicAnnotator
44. Ramona, M., Richard, G.: Comparison of Dierent Strategies for a SVM-Based
Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference
(EUSIPCO 2009), pp. 2024 (2009)
45. Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In:
Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)
46. Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc.
ICASSP 1996, vol. 2, pp. 993996 (1996)
47. Schaeer, P.: Traite des Objets Musicaux (Treaty of Musical Objects). Editions
du seuil (1966)
48. Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature
Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 13311334 (1997)
49. Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum Envelope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)
50. Sundberg, J.: Articulatory Interpretation of the Singing Formant. J. Acoust. Soc.
Am. 55, 838844 (1974)
51. Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception.
In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition
(SAPA 2006), pp. 1823 (2006)
52. Gil de Z
un
iga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimagining Pathways to Political Participation. Journal of Information Technology &
Politics 7(1), 3651 (2010)
Computer Music Cloud

Jes
us L. Alvaro1 and Beatriz Barros2
1
Computer Music Lab,

Madrid, Spain
jesuslalvaro@gmail.com
http://cml.fauno.org
Departamento de Lenguajes y Ciencias de la Computaci
on
Universidad de M
alaga, Spain
bbarros@lcc.uma.es
Abstract. The present paper puts forward a proposal for computer

music (CM) composition system on the Web. Setting o from the CM
composition paradigm used so far and on the basis of the current computer technology shift into cloud computing, a new paradigm is open for
the CM composition domain. An experience of computer music cloud
(CMC) is described: the whole music system is split into several web services sharing an unique music representation. MusicJSON is proposed
as the interchangeable music data format based on the solid and exible
EvMusic representation. A web browser-based graphic environment is
developed as the user interface for the Computer Music Cloud as music
web applications.
Keywords: Music Representation, Cloud Computing, Computer Music,
Knowledge Representation, Music Composition, UML, Distributed Data,
Distributed Computing, Creativity, AI.
Computer Aided Composition
Computers oer composers multiple advantages, from score notation to sound

synthesis, algorithmic composition and music articial intelligence (MAI) experimentation. Fig. 1 shows the basic structure of a generalized CM composition
system. In this gure, music composition is intentionally divided into two functional processes: a composer-level computation and a performance-level composition [8]. Music computation systems, like algorithmic music programs, are used
to produce music materials in an intermediate format, usually a standard midi
le (SMF). These composition materials are combined and post-produced for
performance by means of a music application which nally produces a score or
sound render of the composition. In some composition systems the intermediate
format is not so evident, because the same application carries out both functions, but in terms of music representation, some symbols representing music
entities are used for computation. This internal music representation determines
the creative capabilities of the system.
164
J.L. Alvaro and B. Barros
Fig. 1. Basic Structure of a CM Composition System
There are many dierent approaches for a CM composition system, as well

as multiple languages and music representations. As shown in Fig. 3, our CM
system has substantially evolved during the last 12 years [1]. Apart from musical
and creative requirements, these changes have progressively accommodated technology changes and turned the system into a distributed computing approach.
Computer-assisted music composition and platforms have evolved for 50 years.
Mainframes were used at the beginning and personal computers (PCs) arrived
in the 1980s, bringing computation to the general public. With network development, Internet has gained more and more importance within the present-day
Information Technology (IT) and dominates the current situation, making irrelevant the geographical location of IT resources. This paper is aimed at presenting
a proposal for computer music (CM) composition on the web. Starting from the
CM paradigm used so far, and on the basis of the current computer technology
shift into cloud computing, a new paradigm is open for the CM domain. The
paper is organized as follows: the next section describes the concept of cloud
computing, thus locating the work within the eld of ITs. Then, section 3 introduces the EV representation, which is the basis of the proposed composition
paradigm, explained in section 4. Next, an example is sketched in section 5, while
section 6 presents the MusicJSON music format: the interchangeable music data
format based on the solid and exible EvMusic representation. The paper ends
with some conclusions and ideas for future research.
Cloud Computing
IT continues evolving. Cloud Computing, a new term dened in various dierent

ways [8], involves a new paradigm in which computer infrastructure and software
are provided as a service [5]. These services themselves have been referred to as
Software as a Service (SaaS ). Google Apps is a clear example of SaaS [10].
Computation infrastructure is also oered as a service (IaaS ), thus enabling
the user to run the customer software. Several providers currently oer resizable
compute capacity as a Public Cloud, such as the Amazon Elastic Compute Cloud
(EC2) [4] and the Google AppEngine [9].
This situation oers new possibilities for both software developers and users.
For instance, this paper was written and revised in GoogleDocs [11], a Google
165
web service oering word processing capabilities online. The information is no

longer stored in local hard discs but in Google servers. The only software users
need is a standard web browser. Could this computing in the cloud approach
be useful for music composition? What can it oer? What type of services does
a music composition cloud consist of? What music representation should they
share? What data exchange format should be used?
EvMusic Representation
The rst step when planning a composition system should be choosing a proper
music representation. The chosen representation will set the frontiers of the systems capabilities. As a result, our CM research developed a solid and versatile
representation for music composition. EvMetamodel [3] was used to model the
music knowledge representation behind EvMusic. A previous, deep analysis of
music knowledge was carried out to assure the representation meets music composition requirements. This multilevel representation is not only compatible with
traditional notation but also capable of representing highly abstract music elements. It can also represent symbolic pitch entities [1] from both music theory
and algorithmic composition, keeping the door open to the representation of the
music elements of higher symbolic level conceived by the composers creativity.
It is based on real composition experience and was designed to support CM
Composition, including experiences in musical articial intelligence (MAI).
Current music representation is described in a platform-independent UML format [15]. Therefore, it is not conned to its original LISP system, but can be used
in any system or language: a valuable feature when approaching a cloud system.
Fig. 2. UML class diagram of the EvMetamodel
166
Fig. 2 is an UML class diagram for the representation core of the EvMetamodel,
the base representation for time dimension. The three main classes are shown:
event, parameter and dynamic object. High level music elements are represented
as subclasses of metaevent, the interface which provides the develop functionality. The special dynamic object changes is also shown. This is a very useful
option for the graphic edition of parameters, since it represents a dynamic object
as a sequence of parameter-change events which can be easily moved in time.
Composing in the Cloud: A New Paradigm
Our CM system underwent several changes since the beginning, back to 1997.
Fig. 3 shows the evolutions undergone by formats, platforms and technologies
toward the current proposal. This gure follows the same horizontal scheme
shown in Figure 1. The rst column indicates the user interface for music input,
and the second shows the music computation system and its evolution over recent
years, while the music intermediate format is reported in the central column.
Post-production utilities and their nal results are shown in last two columns,
respectively.
The model in Fig. 1 clearly shows the evolution undergone by the system.
First, a process of service diversication and specialisation, mainly at the postproduction stage; second, as a trend in user input, CM is demanding graphic
environments. Finally, technologies and formats undergo multiple changes. The
reason behind most of these changes can be found in external technology advances and the need to accommodate to these new situations. At times the
Fig. 3. Evolution of our CM Composition System
167
needed tool or library was not available at that time. At others the available
tool was suitable at that moment, but oered no long-term availability. As stated
above, the recent shift of IT into cloud computing brings new opportunities for
evolution. In CM, system development can benet from computing distribution
and specialization. Splitting the system into several specialized services prevents
the limitations involved by a single programming language or platform. Therefore, individual music services can be developed and evolved independently from
the others. Each component service can be implemented in the most appropriate
platform for its particular task, regardless of the rest of services, without being
conditioned by the technologies necessary for the implementation of other services. In the previous paradigm, all services were performed by one only system,
and the selection of technologies to complete a particular task aected or even
conditioned the implementation of other tasks. This frees the system design,
thus making it more platform-independent. In addition, widely-available tools
can be used for specic tasks, thus benetting from tool development in other
areas such as database storage and web application design.
4.1
Computer Music Cloud (CMC)
Fig. 4 shows a Computer Music Cloud (CMC) for composition as an evolution of

the scheme in Fig. 3. The system is distributed across specialized online services.
The user interface is now a web application running in a standard browser. A
storage service is used as an edition memory. An intelligent-dedicated service is
allocated for music calculation and development. Output formats such as MIDI,
graphic score and sound le are rendered by independent services exclusively
devoted to this task. The web application includes user sessions to allow multiple
users utilizing the system. Both public and user libraries are also provided for
music objects. Intermediary music elements can be stored in the library and also
serialized into a MusicJSON format le, as described below.
An advantage involved by this approach is the availability of a single service for
dierent CMC systems. Therefore, the design of new music systems is facilitated
by the joint work of dierent services controlled by a web application. The key
factor for successful integration lies in the use of a well-dened suitable music
representation for music data exchange.
4.2
Music Web Services
Under Cloud Computing usual notation, each of the services can be considered
as a Music-computing as a Service (MaaS) component. In a simple form, they
are servers receiving a request and performing a task. At the end of the task,
resulting objects are returned to the stream or stored in an exchange database.
The access to this database is a valuable feature for services, since it allows
the denition of complex services. They could even be converted into real MAI
agents (i.e., intelligent devices which can perceive their environment, make decisions and act inside their environment) [17]. Storing the music composition as
a virtual environment in a database allows music services interacting within the
composition, thus opening a door toward a MAI system of distributed agents.
Music Services of the Cloud are classied according their function.
168
Fig. 4. Basic Structure of a CMC
Input. This group includes the services aimed particularly at incorporating new
music elements and translating them from other input formats.
Agents. They are those services which are capable of inspecting and modifying music composition, as well as introducing new elements. This type includes
human user interfaces, but may also include other intelligent elements taking
part in composition introducing decisions, suggestions or modications. In our
prototype, we have developed a web application acting as a user interface for
the edition of music elements. This service is described in the next section.
Storage. At this step, only music object instances and relations are stored,
but the hypothetical model also includes procedural information. Three music
storage services are implemented in the prototype. Main lib stores shared music
elements as global denitions. This content may be seen as some kind of music culture. User-related music objects are stored in the user lib. These include
music objects dened by the composer which can be reused in several parts and
complete music sections or represent the composers style. The editing storage
service is provided as temporary storage for the editing session. The piece is
progressively composed in the database. The composition environment (i.e., everything related to the piece under composition) is in the database. This is the
environment in which several composing agents can integrate and interact by
reading and writing on this database-stored score. All three storage services in
this experience, were written in Python and clouded with Google AppEngine.
Development. The services in this group perform development processes. As
explained in [8], development is the process by which higher-abstraction symbolic
elements are turned into lower-abstraction elements. High-abstraction symbols
169
are implemented as meta-events and represent music objects such as motives,

segments and other composing abstractions [3]. In this prototype, the entire
EvMusic LISP implementation is provided as a service. Other intelligent services
in this group such as constraint solvers, genetic algorithms and others may also
be incorporated.
Output. These services produce output formats as a response to requests from
other services. They work in a two-level scheme. At the rst level, they render formats for immediate composer feedback such as MIDI playback or .PNG
graphic notation from the element currently under edition. Composition products such as the whole score or the rendering of audio are produced at the second
level. In this experience, a MIDI service is implemented in Python language and
run in Google AppEngine. It returns a SMF for quick audible monitoring. A
LISP -written Score Service is also available. It uses FOMUS [16] and LiLypond
[14] libraries and runs in an Ubuntu Linux image[18] for the Amazon Cloud. It
produces graphic scores from the music data as .PNG and .PDF les. Included in
this output group, the MusicJSON serializer service produces a music exchange
le, as described in the next section.
User Interface: A Music Web Application
What technologies are behind those successful widespread web cloud applications
such as GoogleDocs? What type of exchange format do they use? JavaScript
[13] is a dialect of the standard ECMAScript [6] supported for almost all web
browsers. It is the key tool behind these web clients. In this environment, the
code for the client is downloaded from a server and then run in the web browser
as a dynamic script keeping communication with web services. Thus, the browser
window behaves as a user-interface window for the system.
5.1
EvEditor Client
The Extjs library [7] was chosen as a development framework for client side
implementation. The web application takes the shape of a desktop with windows
archetype (i.e., a well-tested approach and an intuitive interface environment
we would like to benet from, but within a web-browser window). The main
objective of its implementation was not only producing a suitable music editor
for a music cloud example but also a whole framework for the development
of general-purpose object editors under an OOP approach. Once this aim was
reached, dierent editors could be subclassed with the customizations required
from the object type. Extjs is a powerful JavaScript library including a vast
collection of components for user interface creation. Elements are dened in a
hierarchic class system, which oers a great solution for our OOP approach.
That is, all EvEditor components are subclassed from ExtJS classes. As shown
in Fig. 5, EvEditor client is a web application consisting of three layers. The
bottom or data layer contains an editor proxy for data under current edition. It
duplicates the selected records in the remote storage service. The database in the
PROXY & DATA
COMPONENT LAYER
DOM
170
Fig. 5. Structure of EvEditor Client
remote storage service is synched with the editions an updates the editor writes
in its proxy. Several editors can share the same proxy, so all listening editors are
updated when the data are modied in the proxy. The intermediate layer is a
symbolic zone for all components. It includes both graphic interface components
such as editor windows, container views and robjects: representations for the
objects under current edition. Interface components are subclassed from ExtJS
components. Every editor is displayed in its own window on the working desktop
and optionally contains a contentView displaying its child objects as robjects.
171
Fig. 6. Screen Capture of EvEditor
Fig. 6 is a browser-window capture showing the working desktop and some editor
windows. The application menu is shown in the lower left-hand corner, including
user setting. The central area of the capture shows a diatonic sequence editor
based on our TclTk Editor [1]. A list editor and a form-based note editor are also
shown. In the third layer or DOM (Document Object Model) [19], all components
are rendered as DOM elements (i.e., HTML document elements to be visualized).
5.2
Server Side
The client script code is dynamically provided by the server side of the web
application. It is written in Python and can run as a GoogleApp for integration
into the cloud. All user-related tasks are managed by this server side. It identies
the user, and manages sessions, proles and environments.
MusicJSON Music Format
As explained above, music services can be developed in the selected platform.

The only requirement for the service to be integrated into the music composition
cloud is that it must use the same music representation. EvMusic representation
is proposed for this purpose. This section describes how EvMusic objects are
stored in a database and transmitted over the cloud.
172
6.1
Database Object Storage
Database storage allows several services sharing the same data and collaborating
in the composition process. The information stored in a database is organized in
tables of records. For storing EvMusic tree structures in a database, they must
be previously converted into records. For this purpose, the three main classes
of Ev representation are subclassed from a tree node class, as shown in Fig. 2.
Thus, every object is identied by an unique reference and a parent attribute.
This allows representing a large tree structure of nested events as a set of records
for individual retrieval or update.
6.2
MusicJSON Object Description
Web applications usually use XML and JSON (Java Script Object Notation) for
data exchange [12]. Both formats meet the requirements. However, two reasons
supported our inclination for JSON: 1) The large tool library available for JSON
at the time of this writing, and 2) the fact that JSON is oered as the exchange
format for some of the main Internet web services such as Google or Yahoo.
The second reason has to do with its great features, such as human readability
and dynamic unclosed object support, a very valuable feature inherited from the
prototype-based nature of JavaScript.
JSON can be used to describe EvMusic objects and to communicate among
web music services. MusicJSON [2] is the name given to this use. As a simple
example, the code below shows the description for the short music fragment
shown in Fig. 7. As it can be seen, the code is self-explanatory.
{"pos":0, "objclass": "part","track":1, "events":[
{"pos":0, "objclass": "note","pitch":"d4","dur":0.5,
"art":"stacatto"},
{"pos":0.5, "objclass": "note","pitch":"d4","dur":0.5,
"art":"stacatto"},
{"pos":1, "objclass": "pitch":"g4","dur":0.75,
"dyn":"mf","legato":"start"},
{"pos":1.75, "objclass": "pitch":"f#4","dur":0.25
"legato":"end"},
{"pos":2, "objclass": "note","pitch":"g4","dur":0.5,
"art":"stacatto"}
{"pos":2.5, "objclass": "note","pitch":"a4","dur":0.5,
"art":"stacatto"}
{"pos":3, "objclass":"nchord","dur":1,"pitches":[
{"objclass": "spitch","pitch": "d4"},
{"objclass": "spitch","pitch": "b4"}]
}
]
}
173
Fig. 7. Score notation of the example code
6.3
MusicJSON File
Every EvMusic object, from single notes to complex structures, can be serialized into a MusicJSON text and subsequently transmitted through the Web. In
addition, MusicJSON can be used as an intermediate format for local storage
of compositions. The next listing code shows a draft example of the proposed
description of an EvMusic le.
{"objclass":"evmusicfile","ver":"0911","content":
{"lib":{
"instruments":"http://evmusic.fauno.org/lib/main/instruments",
"pcstypes": "http://evmusic.fauno.org/lib/main/pcstypes",
"mypcs": "http://evmusic.fauno.org/lib/jesus/pcstypes",
"mymotives": "http://evmusic.fauno.org/lib/jesus/motives"
},
"def":{
"ma": {"objclass":"motive",
"symbol":[ 0,7, 5,4,2,0 ],
"slength": "+-+- +-++ +---"},
"flamenco": {"objclass":"pcstype",
"pcs":[ 0,5,7,13 ],},
},
"orc":{
"flauta": {"objclass":"instrument",
"value":"x.lib.instruments.flute",
"role":"r1"}
"cello": {"objclass":"instrument",
"value":"x.lib.instruments.cello",
"role":"r2"}
},
"score":[
{"pos": 0,"objclass":"section",
"pars":[
"tempo":120,"dyn":"mf","meter":"4/4",
...
],
"events":[
{"pos":0, "objclass": "part","track":1,"role":"i1",
"events":[
... ]
... ]},
{"pos": 60,"objclass":"section","ref":"s2",
...
},],}}}
174
The code shows four sections in the content. Library is an array of libraries to
be loaded with object denitions. Both main and user libraries can be addressed.
The following section includes local denitions of objects. As an example, a motive and a chord type are dened. Next section establishes instrumentation assignments by means of the arrangement object role. Last section is the score
itself, where all events are placed in a tree structure using parts. Using MusicJSON as the intermediary communication format enables us to connect several
music services conforming a cloud composition system.
Conclusion
The present paper puts forward an experience of music composition under a distributed computation approach as a viable solution for Computer Music Composition in the Cloud. The system is split into several music services hosted in
common IaaS providers such as Google or Amazon. Dierent music systems can
be built by joint operation of some of these music services in the cloud.
In order to cooperate and deal with music objects, each service in the music
cloud must understand the same music knowledge. The music knowledge representation they must share must be therefore standardized. EvMusic representation is proposed for this, since it is a solid multilevel representation successfully
tested in real CM compositions in recent years.
Furthermore, MusicJSON is proposed as an exchange data format between
services. Example descriptions of music elements, as well as a le format for
local saving of a musical composition, are given. A graphic environment is also
proposed for the creation of user interfaces for object editing as a web application.
As an example, the EvEditor application is described.
This CMC approach opens multiple possibilities for derivative work. New
music creation interfaces can be developed as web applications beneting from
the upcoming web technologies such as the promising HTML5 standard [20]. The
described music in the cloud, together with EvMusic representation, provides a
ground environment for MAI research, where especialised agents can cooperate
in a music composition environment sharing the same music representation.
References
1. Alvaro, J.L. : Symbolic Pitch: Composition Experiments in Music Representation.
Research Report, http://cml.fauno.org/symbolicpitch.html (retrieved December 10, 2010) (last viewed February 2011)
2. Alvaro, J.L., Barros, B.: MusicJSON: A Representation for the Computer Music Cloud. In: Proceedings of the 7th Sound and Music Computer Conference,
Barcelona (2010)
3. Alvaro, J.L., Miranda, E.R., Barros, B.: Music knowledge analysis: Towards an
ecient representation for composition. In: Marn, R., Onainda, E., Bugarn, A.,
Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 331341. Springer,
Heidelberg (2006)
175
4. Amazon Elastic Computing, http://aws.amazon.com/ec2/ (retrieved February 1,

2010) (last viewed February 2011)
5. Armbrust, M., Fox, A., Grith, R., Joseph, A. D., Katz, R. H., Konwinski, A.,
Lee, G., Patterson, D. A., Rabkin, A., Stoica, I., Zaharia, M.: Above the Clouds:
A Berkeley View of Cloud Computing White Paper, http://www.eecs.berkeley.
edu/Pubs/TechRpts/2009/EECS-2009-28.pdf (retrieved February 1, 2010) (last
6. ECMAScript Language Specication, http://www.ecma-international.org/
publications/standards/Ecma-262.htm (retrieved February 1, 2010) (last viewed
February 2011)
7. ExtJS Library, http://www.extjs.com/ (retrieved February 1, 2010) (last viewed
February 2011)
8. Geelan, J.: Twenty Experts Dene Cloud Computing. Cloud Computing Journal (2008), http://cloudcomputing.sys-con.com/node/612375/print (retrieved
February 1, 2010) (last viewed February 2011)
9. Google AppEngine, http://code.google.com/appengine/ (retrieved February 1,
10. Google Apps, http://www.google.com/apps/ (retrieved February 1, 2010) (last
11. Google Docs, http://docs.google.com/ (retrieved February 1, 2010 (last viewed
February 2011)
12. Introducing JSON, http://www.json.org/ (retrieved February 1, 2010) (last
13. JavaScript, http://en.wikipedia.org/wiki/JavaScript (retrieved February 1,
14. Nienhuys, H.-W., Nieuwenhuizen J.: GNU Lilypond, http://www.lilypond.org
(rertrieved February 1, 2010) (last viewed February 2011)
15. OMG: Unied Modeling Language: Superstructure. Version 2.1.1(2007), http://
www.omg.org/uml (retrieved February 1, 2010) (last viewed February 2011)
16. Psenicka, D.: FOMUS, a Music Notation Package for Computer Music Composers, http://fomus.sourceforge.net/doc.html/index.html (retrieved February 1, 2010) (last viewed, February 2011)
17. Russell, S.J., Norvig, P.: Intelligent Agents. In: Articial Intelligence: A Modern
Approach, ch. 2. Prentice-Hall, Englewood Clis (2002)
18. Ubuntu Server on Amazon EC2, http://www.ubuntu.com/cloud/public (retrieved February 1, 2010) (last viewed, February 2011)
19. Wood, L.: Programming the Web: The W3C DOM Specication. IEEE Internet
Computing 3(1), 4854 (1999)
20. W3C: HTML5 A vocabulary and associated APIs for HTML and XHTML W3C
Editors Draft, http://dev.w3.org/html5/spec/ (retrieved February 1, 2010) (last
viewed, February 2011)
Abstract Sounds and Their Applications in

Audio and Perception Research
Adrien Merer1 , Slvi Ystad1 , Richard Kronland-Martinet1,
and Mitsuko Aramaki2,3
1
2
3
CNRS - Laboratoire de Mecanique et dAcoustique,

31 ch. Joseph Aiguier, Marseille, France
CNRS - Institut de Neurosciences Cognitives de la Mediterranee,
31 ch. Joseph Aiguier, Marseille, France
Universite Aix-Marseille, 38 bd. Charles Livon, Marseille, France
{merer,ystad,kronland}@lma.cnrs-mrs.fr
aramaki@incm.cnrs-mrs.fr
Abstract. Recognition of sound sources and events is an important process in sound perception and has been studied in many research domains.
Conversely sounds that cannot be recognized are not often studied except
by electroacoustic music composers. Besides, considerations on recognition of sources might help to address the problem of stimulus selection
and categorization of sounds in the context of perception research. This
paper introduces what we call abstract sounds with the existing musical
background and shows their relevance for dierent applications.
Keywords: abstract sound, stimuli selection, acousmatic.
Introduction
How do sounds convey meaning? How can acoustic characteristics that convey the relevant information in sounds be identied? These questions interest
researchers within various research elds such as cognitive neuroscience, musicology, sound synthesis, sonication, etc. Recognition of sound sources, identication, discrimination and sonication deal with the problem of linking signal
properties and perceived information. In several domains (linguistic, music analysis), this problem is known as semiotics [21]. The analysis by synthesis
approach [28] has permitted to understand some important features that characterize the sound of vibrating objects or interaction between objects. A similar
approach was also adopted in [13] where the authors use vocal imitations in
order to study human sound source identication with the assumption that vocal imitations are simplications of original sounds that still contain relevant
information.
Recently, there has been an important development in the use of sounds to
convey information to a user (of a computer, a car, etc.) within a new research
community called auditory display [19] which deals with topics related to sound
design, sonication and augmented reality. In such cases, it is important to use
Abstract Sounds and Their Applications
177
sounds that are meaningful independently of cultural references taking into account that sounds are presented through speakers concurrently with other audio/visual information.
Depending on the research topics, authors focused on dierent sound categories (i.e. speech, environmental sounds, music or calibrated synthesized stimuli). In [18], the author proposed a classication of everyday sounds according
to physical interactions from which the sound originates. When working within
synthesis and/or sonication domains, the aim is often to reproduce the acoustic
properties responsible for the attribution of meaning and thus, sound categories
can be considered from the point of view of semiotics i.e. focusing on information
that can be gathered in sounds.
In this way, we considered a specic category of sounds that we call abstract
sounds. This category includes any sound that cannot be associated with an
identiable source. It includes environmental sounds that cannot be easily identied by listeners or that give rise to many dierent interpretations depending on
listeners and contexts. It also includes synthesized sounds, and laboratory generated sounds if they are not associated with a clear origin. For instance, alarm or
warning sounds cannot be considered as abstract sounds. In practice, recordings
with a microphone close to the sound source and some synthesis methods like
granular synthesis are especially ecient for creating abstract sounds. Note that
in this paper, we mainly consider acoustically complex stimuli since they best
meet our needs in the dierent applications (as discussed further).
Various labels that refer to abstract sounds can be found in the literature: confused sounds [6], strange sounds [36], sounds without meaning [16]. Conversely, [34] uses the term source-bonded and the expression source bonding
for the The natural tendency to relate sounds to supposed sources and causes.
Chion introduced acousmatic sounds [9] in the context of cinema and audiovisual applications with the following denition: sound one hears without seeing
their originating cause - an invisible sound source (for more details see section 2).
The most common expression is abstract sounds [27,14,26] particularly
within the domain of auditory display, when concerning earcons [7]. Abstract used as an adjective means based on general ideas and not on any
particular real person, thing or situation and also existing in thought or as an
idea but not having a physical reality1. For sounds, we can consider another
denition used for art not representing people or things in a realistic way1 .
Abstract as a noun is a short piece of writing containing the main ideas in a
document1 and thus share the ideas of essential attributes which is suitable in
the context of semiotics. In [4], authors wrote: Edworthy and Hellier (2006)
suggested that abstract sounds can be interpreted very dierently depending on
the many possible meanings that can be linked to them, and in large depending
on the surrounding environment and the listener.
In fact, there is a general agreement for the use of the adjective abstract
applied to sounds that express both ideas of source recognition and dierent
possible interpretations.
1
Denitions from http://dictionary.cambridge.org/
178
A. Merer et al.
This paper will rst present the existing framework for the use of abstract
sounds by electroacoustic music composers and researchers. We will then discuss
some important aspects that should be considered when conducting listening
tests with a special emphasis on the specicities of abstract sounds. Finally, three
practical examples of experiments with abstract sounds in dierent research
domains will be presented.
The Acousmatic Approach
Even if the term abstract sounds was not used in the context of electroacoustic
music, it seems that this community was one of the rst to consider the issue
related to the recognition of sound sources and to use such sounds. In 1966,
P. Schaeer, who was both a musician and a researcher, wrote the Traite des
objets musicaux [29], in which he reported more than ten years of research on
electroacoustic music. With a multidisciplinary approach, he intended to carry
out fundamental music research that included both Concrète2 and traditional
music. One of the rst concepts he introduced was the so called acousmatic
listening, related to the experience of listening to a sound without paying attention to the source or the event. The word acousmatic is at the origin of many
discussions, and is now mainly employed in order to describe a musical trend.
Discussions about acousmatic listening was kept alive due to a fundamental
problem in Concrète music. Indeed, for music composers the problem is to create
new meaning from sounds that already carry information about their origins. In
compositions where sounds are organized according to their intrinsic properties,
thanks to the acousmatic approach, information on the origins of sounds is still
present and interacts with the composers goals.
There was an important divergence of points of view between Concrète and
Elektronische music (see [10] for a complete review), since the Elektronische
music composers used only electronically generated sounds and thus avoided the
problem of meaning [15]. Both Concrète and Elektronische music have developed
a research tradition on acoustics and perception, but only Schaeer adopted a
scientic point of view. In [11], the author wrote: Schaeers decision to use
recorded sounds was based on his realization that such sounds were often rich
in harmonic and dynamic behaviors and thus had the largest potential for his
project of musical research. This work was of importance for electroacoustic
musicians, but is almost unknown by researchers in auditory perception, since
there is no published translation of his book except for concomitant works [30]
and Chions Guide des objets musicaux 3 . As reported in [12], translating Schaeers writing is extremely dicult since he used neologisms and very specic
2
The term concrete is related to a composition method which is based on concrete

material i.e recorded or synthesized sounds, in opposition with abstract music
which is composed in an abstract manner i.e from ideas written on a score, and
become concrete afterwards.
Translation by J.Dack available at http://www.ears.dmu.ac.uk/spip.php?
page=articleEars&id_article=3597
179
Fig. 1. Schaeers typology. Note that some column labels are redundant since the
table must be read from center to borders. For instance, the Non existent evolution
column in the right part of the table corresponds to endless iterations whereas the
Non existent evolution column in the left part concerns sustained sounds (with no
amplitude variations).
Translation from [12]
meanings of french words. However, recently has been a growing interest in this
book and in particular in the domain of music information retrieval, for the morphological sound description [27,26,5]. Authors indicate that in the case of what
they call abstract sounds, classical approaches based on sound source recognition are not relevant and thus base their algorithms on Schaeers morphology
and typology classications.
Morphology and typology have been introduced as analysis and creation tools
for composers as an attempt to construct a music notation that includes electroacoustic music and therefore any sound. The typology classication (cf. gure 1)
is based on a characterization of spectral (mass) and dynamical (facture 4 ) proles of with respect to their complexity and consists of twenty-eight categories.
There are nine central categories of balanced sounds for which the variations
are neither too rapid and random nor too slow or nonexistent. Those nine categories included three facture proles (sustained, impulsive or iterative) and three
mass proles (tonic, complex and varying). On both sides of the balanced objects in the table, there are nineteen additional categories for which mass and
facture proles are very simple/repetitive or vary a lot.
Note that some automatic classication methods are available [26]. In [37] the
authors proposed an extension of Schaeers typology that includes graphical
notations.
Since the 1950s, electroacoustic music composers have addressed the problem
of meaning of sounds and provided an interesting tool for classication of sounds
with no a priori dierentiation on the type of sound. For sound perception
research, a classication of sounds according to these categories may be useful
4
As discussed in [12] even if facture is not a common English word, there is no better
translation from French.
180
A. Merer et al.
since they are suitable for any sound. The next section will detail the use of such
classication for the design of listening tests.
Design of Listening Tests Using Abstract Sounds
The design of listening tests is a fundamental part of sound perception studies and implies considerations of dierent aspects of perception that are closely
related to the intended measurements. For instance, it is important to design
calibrated stimuli and experimental procedures to control at best the main factors that aect the subjects evaluations. We propose to discuss such aspects in
the context of abstract sounds.
3.1
Stimuli
It is common to assume that perception diers as a function of sound categories

(e.g. speech, environmental sounds, music). Even more, these categories are underlying elements dening a research area. Consequently, it is dicult to determine a general property of human perception based on collected results obtained
from dierent studies. For instance, results concerning loudness conducted on elementary synthesized stimuli (sinusoids, noise, etc.) cannot be directly adapted
to complex environmental sounds as reported by [31]. Furthermore, listeners
judgements might dier for sounds belonging to a same category. For instance,
in the environmental sound category, [14] have shown specic categorization
strategies for sounds that involve human activity.
When there is no hypothesis regarding the signal properties, it is important
to gather sounds that present a large variety of acoustic characteristics as discussed in [33]. Schaeers typology oers an objective selection tool than can
help the experimenter to construct a very general sound corpus representative
of most existing sound characteristics by covering all the typology categories.
As a comparison, environmental sounds can be classied only in certain rows
of Schaeers typology categories (mainly the balanced objects). Besides, abstract sounds may constitute a good compromise in terms of acoustic properties
between elementary (sinusoids, noise, etc.) and ecological (speech, environmental
sounds and music) stimuli.
A corpus of abstract sounds can be obtained in dierent ways. Many databases
available for audiovisual applications contain such sounds (see [33]). Dierent
synthesis techniques (like granular or FM synthesis, etc.) are also ecient to
create abstract sounds. In [16] and further works [38,39], the authors presented
some techniques to transform any recognizable sound into an abstract sound,
preserving several signal characteristics. Conversely, many transformations drastically alter the original (environmental or vocal) sounds when important acoustic attributes are modied. For instance, [25] has shown that applying high and
low-pass ltering inuence the perceived naturalness of speech and music sounds.
Since abstract sounds do not convey univocal meaning, it is possible to use them
in dierent ways according to the aim of the experience. For instance, a same
sound corpus can be evaluated in dierent contexts (by drawing the listeners
181
attention to certain evocations) in order to study specic aspects of the information conveyed by the sounds. In particular, we will see how the same set of
abstract sounds was used in 2 dierent studies described in sections 4.3 and 4.1.
3.2
Procedure
To control the design of stimuli, it is important to verify in a pre-test that the

evaluated sounds are actually abstract for most listeners. In a musical context,
D. Smalley [35] has introduced the expression surrogacy level (or degree) to
quantify the ease of source recognition. This level is generally evaluated by using
identication tasks. In [6], the authors describe three methods: 1) Free identication tasks that consists of associating words or any description with sounds
[2]. 2) Context-based ratings, which are comparisons between sounds and other
stimuli. 3) Attribute rating, which is a generalization of the semantic dierential
method. The third method may be the most relevant since it provides graduated
ratings on an unlimited number of scales. In particular, we will see in section 4.3
that we evaluated the degree of recognition of abstract sounds (the sound is
easily recognizable or not) by asking listeners to use a non graduated scale from
not recognizable to easily recognizable.
Since abstract sounds are not easily associated with a source (and to the
corresponding label), they can also be attributed to several meanings that may
depend on the type of experimental procedure and task. In particular, we will
see that it is possible to take advantage of this variability of meaning to highlight
for example dierences between groups of listeners as described in section 4.1.
3.3
Type of Listening
In general, perception research distinguishes analytic and synthetic listening.

Given a listening procedure, subjects may focus on dierent aspects of sounds
since dierent concentration and attention levels are involved. From a dierent
point of view, [17] introduced the terms everyday listening (as opposed to
musical listening) and argued that even in the case of laboratory experiences,
listeners are naturally more interested in sound source properties than in intrinsic
properties and therefore use everyday listening. [29] also introduced dierent
types of listening (hearing, listening, comprehending, understanding)
and asserted that when listening to a sound we switch from one type of listening
to another. Even if dierent points of view are used to dene the dierent types
of listening, they share the notions of attentional direction and intention when
perceiving sounds. Abstract sounds might help listeners to focus on intrinsic
properties of sound and thus to adopt musical listening.
Another aspect that could inuence the type of listening and therefore introduce variability in responses is the coexistence of several streams in a sound5 .
If a sound is composed of several streams, listeners might alternatively focus on
dierent elements which cannot be accurately controlled by the experimenter.
5
Auditory streams have been introduced by Bregman [8], and describe our ability to
group/separate dierent elements of a sound.
182
A. Merer et al.
Since abstract sounds have no univocal meaning to be preserved, it is possible

to proceed to transformations that favour one stream (and alter the original
meaning). This is not the case for environmental sound recordings for instance,
since transformations can make them unrecognizable. Note that classication of
sounds with several streams according to Schaeers typology might be dicult
since they present concomitant proles associated with distinct categories.
Potentials of Abstract Sounds
As described in section 2, potentials of abstract sounds was initially revealed in

the musical context. In particular, their ability to evoke various emotions was
fully investigated by electroacoustic composers. In this section, we describe how
abstract sounds can be used in dierent contexts by presenting studies linked
to three dierent research domains, i.e. sound synthesis, cognitive neuroscience
and clinical diagnosis. Note that we only aim at giving an overview of some
experiments that use abstract sounds, in order to discuss the motivations behind
the dierent experimental approaches. Details of the material and methods can
be found in the referred articles in the following sections.
The three experiments partially shared the same stimuli. We collected abstract
sounds provided by electroacoustic composers. Composers constitute an original
resource of interesting sounds since they have thousands of specially recorded or
synthesized sounds, organized and indexed to be included in their compositions.
From these databases, we selected a set of 200 sounds6 that best spread out in
the typology table proposed by Schaeer (cf. tab 1). A subset of sounds was
nally chosen according to the needs of each study presented in the following
paragraphs.
4.1
Bizarre and Familiar Sounds
Abstract sounds are not often heard in our everyday life and could even be
completely novel for listeners. Therefore, they might be perceived as strange
or bizarre. As mentioned above, listeners judgements of abstract sounds are
highly subjective. In some cases, it is possible to use this subjectivity to investigate some specicities of human perception and in particular, to highlight
dierences of sound evaluations between groups of listeners. In particular, the
concept of bizarre is one important element from standard classication of
mental disorders (DSM - IV) for schizophrenia [1] pp. 275. An other frequently
reported element is the existence of auditory hallucinations7 , i.e. perception
without stimulation. From such considerations, we explored the perception of
bizarre and familiar sounds in patients with schizophrenia by using both environmental (for their familiar aspect) and abstract sounds (for their bizarre aspect). The procedure consisted in rating sounds on continuous scales according
6
7
Some examples from [23] are available at http://www.sensons.cnrs-mrs.fr/

CMMR07_semiotique/
[...] auditory hallucinations are by far the most common and characteristic of
Schizophrenia. [1] pp. 275
183
to a perceptual dimension labelled by an adjective (by contrast, classical dierential semantic uses an adjective and an antonym to dene the extremes of each
scale). Sounds were evaluated on six dimensions along linear scales: familiar,
reassuring, pleasant, bizarre, frightening, invasive8. Concerning the
abstract sound corpus, we chose 20 sounds from the initial set of 200 sounds by
a pre-test on seven subjects and selected sounds that best spread in the space of
measured variables (the perceptual dimensions). This preselection was validated
by a second pre-test on fourteen subjects that produced similar repartition of
the sounds along the perceptual dimensions.
Preliminary results showed that the selected sound corpus made it possible to
highlight signicant dierences between patients with schizophrenia and control
groups. Further analysis and testing (for instance brain imaging techniques) will
be conducted in order to better understand these dierences.
4.2
Reduction of Linguistic Mediation and Access to Dierent

Meanings
Within the domain of cognitive neuroscience, a major issue is to determine whether

similar neural networks are involved in the allocation of meaning for language and
other non-linguistic sounds. A well-known protocol largely used to investigate semantic processing in language, i.e. the semantic priming paradigm [3], has been
applied to other stimuli such as pictures, odors and sounds and several studies highlighted the existence of a conceptual priming in a nonlinguistic context (see [32]
for a review). One diculty that occurs when considering non-linguistic stimuli,
is the potential eect of linguistic mediation. For instance watching a picture of
a bird or listening to the song of a bird might automatically activate the verbal
label bird. In this case, the conceptual priming cannot be considered as purely
non-linguistic because of the implicit naming induced by the stimulus processing.
Abstract sounds are suitable candidates to weaken this problem, since they are not
easily associated with a recognizable source. In [32], the goals were to determine
how a sense is attributed to a sound and whether there are similarities between
brain processing of sounds and words. For that, a priming protocol was used with
word/sound pairs and the degree of congruence between the prime and the target
was manipulated. To design stimuli, seventy abstract sounds from the nine balanced (see section 2) categories of Schaeers typology table were evaluated in a
pre-test to dene the word/sound pairs. The sounds were presented successively to
listeners who were asked to write the rst words that came to their mind after listening. A large variety of words were given by listeners. One of the sounds obtained
for instance the following responses: dry, wildness, peak, winter, icy, polar, cold.
Nevertheless, for most sounds, it was possible to nd a common word that was
accepted as coherent by more than 50% of the listeners. By associating these common words with the abstract sounds, we designed forty-ve related word/sound
pairs. The non-related pairs were constructed by recombining words and sounds
8
These are arguable translations from French adjectives: familier, rassurant, plaisant,
bizarre, angoissant, envahissant.
184
A. Merer et al.
randomly. This step allowed us to validate the abstract sounds since no label referring to the actual source was given. Indeed when listeners are asked to explicitly
label abstract sounds, dierent labels that were more related to the sound quality
were collected. In a rst experiment a written word (prime) was visually presented
before a sound (target) and subjects had to decide whether or not the sound and
the word t together. In a second experiment, presentation order was reversed (i.e.
sound presented before word). Results showed that participants were able to evaluate the semiotic relation between the prime and the target in both sound-word and
word-sound presentations with relatively low inter-subject variability and good
consistency (see [32] for details on experimental data and related analysis). This
result indicated that abstract sounds are suitable for studying conceptual processing. Moreover, their contextualization by the presentation of a word reduced the
variability of interpretations and led to a consensus between listeners. The study
also revealed similarities in the electrophysiological patterns (Event Related Potentials) between abstract sounds and word targets, supporting the assumption
that similar processing is involved for linguistic and non-linguistic sounds.
4.3
Sound Synthesis
Intuitive control of synthesizers through high-level parameters is still an open

problem in virtual reality and sound design. Both in industrial and musical
contexts, the challenge consists of creating sounds from a semantic description
of their perceptual correlates. Indeed, as discussed formerly, abstract sounds can
be rich from an acoustic point of view and enable testing of dierent spectrotemporal characteristics at the same time. Thus they might be useful to identify
general signal properties characteristic of dierent sound categories. In addition,
they are particularly designed for restitution through speakers (as this is the case
for synthesizers). For this purpose, we proposed a general methodology based
on evaluation and analysis of abstract sounds aiming at identifying perceptually
relevant signal characteristics and propose an intuitive synthesis control. Given
a set of desired control parameters and a set of sounds, the proposed method
consists of asking listeners to evaluate the sounds on scales dened by the control
parameters. Sounds with same/dierent values on a scale are then analyzed in
order to identify signal correlates. Finally, using feature based synthesis [20],
signal transformations are dened to propose an intuitive control strategy.
In [23], we addressed the control of perceived movement evoked by monophonic sounds. We rst conducted a free categorization task asking subjects to
group sounds that evoke a similar movement and to label each category. The aim
of this method was to identify sound categories to further identify perceptually
relevant sound parameters specic to each category. Sixty-two abstract sounds
were considered for this purpose. Based on subjects responses, we identied
six main categories of perceived movements: rotate, fall down, approach,
pass by, go awayand go upand identied a set of sounds representative
of each category. Note that like in the previous studies, the labels given by
the subjects did not refer to the sound source but rather to an evocation.
Based on this rst study, we aimed at rening the perceptual characterization of
185
movements and identify relevant control parameters. For that, we selected 40

sounds among the initial corpus of 200 sounds. Note that in the case of
movement, we are aware that the recognition of the physical sound source can
introduce a bias in the evaluation. If the source can be easily identied, the
corresponding movement is more likely to be linked to the source: a car sound
only evokes horizontal movement and cannot fall or go up. Thus, we asked 29
listeners to evaluate the 40 sounds through a questionnaire including the two
following questions rated on a linear scale:
Is the sound source recognizable? (rated on a non graduated scale from
not recognizable to easily recognizable)
Is the sound natural? (rated from natural to synthetic)
When the sources were judged recognizable, listeners were asked to write a
few words to describe the source.
We found a correspondence between responses of the two questions: the source
is perceived natural as long as it is easily recognized (R=.89). Note that abstract
sounds were judged as synthesized sounds even if they actually were recordings
from vibrating bodies. Finally we asked listeners to characterize the movements
evoked by sounds with a drawing interface that allowed representing combination
of the elementary movements previously found (sounds can rotate and go-up at
the same time) and where drawing parameters correspond to potential control
parameters of the synthesizer. Results showed that it was possible to determine
the relevant perceptual features and to propose an intuitive control strategy for
a synthesizer dedicated to movements evoked by sounds.
Conclusion
In this paper, we presented the advantages of using abstract sounds in audio

and perception research based on a review of studies in which we exploited their
distinctive features. The richness of abstract sounds in terms of their acoustic
characteristics and potential evocations open various perspectives. Indeed, they
are generally perceived as unrecognizable, synthetic and bizarre depending on context and task and these aspects can be relevant to help listeners to
focus on the intrinsic properties of sounds, to orient the type of listening, to
evoke specic emotions or to better investigate individual dierences. Moreover,
they constitute a good compromise between elementary and ecological stimuli.
We addressed the design of the sound corpus and of specic procedures for
listening tests using abstract sounds. In auditory perception research, sound
categories based on well identied sound sources are most often considered (verbal/non verbal sounds, environmental sounds, music). The use of abstract sounds
may allow dening more general sound categories based on other criteria such as
listeners evocations or intrinsic sound properties. Based on empirical researches
from electroacoustic music trends, the sound typology proposed by P. Schaeer
should enable the denition of such new sound categories and may be relevant
for future listening tests including any sound. Otherwise, since abstract sounds
186
A. Merer et al.
convey multiple information (attribution of several meanings), the procedure is

of importance to orient type of listening towards the information that actually
is of interest for the experiment.
Beyond these considerations, the resulting reections may help us to address
more general and fundamental questions related to the determination of invariant
signal morphologies responsible for evocations and to which extent universal
sound morphologies that do not depend on context and type of listening exist.
References
1. Association, A.P.: The Diagnostic and Statistical Manual of Mental Disorders,
Fourth Edition (DSM-IV). American Psychiatric Association (1994), http://www.
psychiatryonline.com/DSMPDF/dsm-iv.pdf (last viewed February 2011)
2. Ballas, J.A.: Common factors in the identication of an assortment of brief everyday sounds. Journal of Experimental Psychology: Human Perception and Performance 19, 250267 (1993)
3. Bentin, S., McCarthy, G., Wood, C.C.: Event-related potentials, lexical decision
and semantic priming. Electroencephalogr Clin. Neurophysiol. 60, 343355 (1985)
4. Bergman, P., Skold, A., Vastfjall, D., Fransson, N.: Perceptual and emotional categorization of sound. The Journal of the Acoustical Society of America 126, 3156
3167 (2009)
5. Bloit, J., Rasamimanana, N., Bevilacqua, F.: Towards morphological sound description using segmental models. In: DAFX, Milan, Italie (2009)
6. Bonebright, T.L., Miner, N.E., Goldsmith, T.E., Caudell, T.P.: Data collection
and analysis techniques for evaluating the perceptual qualities of auditory stimuli.
ACM Trans. Appl. Percept. 2, 505516 (2005)
7. Bonebright, T.L., Nees, M.A.: Most earcons do not interfere with spoken passage
comprehension. Applied Cognitive Psychology 23, 431445 (2009)
8. Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)
9. Chion, M.: Audio-vision, Sound on Screen. Columbia University Press, New-York
(1993)
10. Cross, L.: Electronic music, 1948-1953. Perspectives of New Music (1968)
11. Dack, J.: Abstract and concrete. Journal of Electroacoustic Music 14 (2002)
12. Dack, J., North, C.: Translating pierre schaeer: Symbolism, literature and music.
In: Proceedings of EMS 2006 Conference, Beijing (2006)
13. Dessein, A., Lemaitre, G.: Free classication of vocal imitations of everyday sounds.
In: Sound And Music Computing (SMC 2009), Porto, Portugal, pp. 213218 (2009)
14. Dubois, D., Guastavino, C., Raimbault, M.: A cognitive approach to urban soundscapes: Using verbal data to access everyday life auditory categories. Acta Acustica
United with Acustica 92, 865874 (2006)
15. Eimert, H.: What is electronic music. Die Reihe 1 (1957)
16. Fastl, H.: Neutralizing the meaning of sound for sound quality evaluations. In:
Proc. Int. Congress on Acoustics ICA 2001, Rome, Italy, vol. 4, CD-ROM (2001)
17. Gaver, W.W.: How do we hear in the world? explorations of ecological acoustics.
Ecological Psychology 5, 285313 (1993)
18. Gaver, W.W.: What in the world do we hear? an ecological approach to auditory
source perception. Ecological Psychology 5, 129 (1993)
187
19. Hermann, T.: Taxonomy and denitions for sonication and auditory display.
In: Proceedings of the 14th International Conference on Auditory Display, Paris,
France (2008)
20. Homan, M., Cook, P.R.: Feature-based synthesis: Mapping acoustic and perceptual features onto synthesis parameters. In: Proceedings of the 2006 International
Computer Music Conference (ICMC), New Orleans (2006)
21. Jekosch, U.: 8. Assigning Meaning to Sounds - Semiotics in the Context of ProductSound Design. J. Blauert, 193221 (2005)
22. McKay, C., McEnnis, D., Fujinaga, I.: A large publicly accessible prototype audio
database for music research (2006)
23. Merer, A., Ystad, S., Kronland-Martinet, R., Aramaki, M.: Semiotics of sounds
evoking motions: Categorization and acoustic features. In: Kronland-Martinet, R.,
Ystad, S., Jensen, K. (eds.) CMMR 2007. LNCS, vol. 4969, pp. 139158. Springer,
Heidelberg (2008)
24. Micoulaud-Franchi, J.A., Cermolacce, M., Vion-Dury, J.: Bizzare and familiar
recognition troubles of auditory perception in patient with schizophrenia (2010)
(in preparation)
25. Moore, B.C.J., Tan, C.T.: Perceived naturalness of spectrally distorted speech and
music. The Journal of the Acoustical Society of America 114, 408419 (2003)
26. Peeters, G., Deruty, E.: Automatic morphological description of sounds. In: Acoustics 2008, Paris, France (2008)
27. Ricard, J., Herrera, P.: Morphological sound description computational model and
usability evaluation. In: AES 116th Convention (2004)
28. Risset, J.C., Wessel, D.L.: Exploration of timbre by analysis and synthesis. In:
Deutsch, D. (ed.) The psychology of music. Series in Cognition and Perception,
pp. 113169. Academic Press, London (1999)
29. Schaeer, P.: Traite des objets musicaux. Editions du seuil (1966)
30. Schaeer, P., Reibel, G.: Solfège de lobjet sonore. INA-GRM (1967)
31. Schlauch, R.S.: 12 - Loudness. In: Ecological Psychoacoustics, pp. 318341. Elsevier, Amsterdam (2004)
32. Sch
on, D., Ystad, S., Kronland-Martinet, R., Besson, M.: The evocative power
of sounds: Conceptual priming between words and nonverbal sounds. Journal of
Cognitive Neuroscience 22, 10261035 (2010)
33. Sharo, V., Gygi, B.: How to select stimuli for environmental sound research and
where to nd them. Behavior Research Methods, Instruments, & Computers 36,
590598 (2004)
34. Smalley, D.: Dening timbre rening timbre. Contemporary Music Review 10,
3548 (1994)
35. Smalley, D.: Space-form and the acousmatic image. Org. Sound 12, 3558 (2007)
36. Tanaka, K., Matsubara, K., Sato, T.: Study of onomatopoeia expressing strange
sounds: Cases of impulse sounds and beat sounds. Transactions of the Japan Society
of Mechanical Engineers C 61, 47304735 (1995)
37. Thoresen, L., Hedman, A.: Spectromorphological analysis of sound objects: an
adaptation of pierre schaeers typomorphology. Organised Sound 12, 129141
(2007)
38. Zeitler, A., Ellermeier, W., Fastl, H.: Signicance of meaning in sound quality
evaluation. Fortschritte der Akustik, CFA/DAGA 4, 781782 (2004)
39. Zeitler, A., Hellbrueck, J., Ellermeier, W., Fastl, H., Thoma, G., Zeller, P.: Methodological approaches to investigate the eects of meaning, expectations and context
in listening experiments. In: INTER-NOISE 2006, Honolulu, Hawaii (2006)
Pattern Induction and Matching in Music

Signals
Anssi Klapuri
Centre for Digital Music, Queen Mary University of London
Mile End Road, E1 4NS London, United Kingdom
anssi.klapuri@eecs.qmul.ac.uk
http://www.elec.qmul.ac.uk/people/anssik/
Abstract. This paper discusses techniques for pattern induction and

matching in musical audio. At all levels of music - harmony, melody,
rhythm, and instrumentation - the temporal sequence of events can be
subdivided into shorter patterns that are sometimes repeated and transformed. Methods are described for extracting such patterns from musical
audio signals (pattern induction) and computationally feasible methods
for retrieving similar patterns from a large database of songs (pattern
matching).
Introduction
Pattern induction and matching plays an important part in understanding the

structure of a given music piece and in detecting similarities between two dierent music pieces. The term pattern is here used to refer to sequential structures
that can be characterized by a time series of feature vectors x1 , x2 , . . . , xT . The
vectors xt may represent acoustic features calculated at regularly time intervals
or discrete symbols with varying durations. Many dierent elements of music
can be represented in this form, including melodies, drum patterns, and chord
sequences, for example.
In order to focus on the desired aspect of music, such as the drums track or
the lead vocals, it is often necessary to extract that part from a polyphonic music
signal. Section 2 of this paper will discuss methods for separating meaningful
musical objects from polyphonic recordings.
Contrary to speech, there is no global dictionary of patterns or words that
would be common to all music pieces, but in a certain sense, the dictionary of
patterns is created anew in each music piece. The term pattern induction here
refers to the process of learning to recognize sequential structures from repeated
exposure [63]. Repetition plays an important role here: rhythmic patterns are
repeated, melodic phrases recur and vary, and even entire sections, such as the
chorus in popular music, are repeated. This kind of self-reference is crucial for
imposing structure on a music piece and enables the induction of the underlying
prototypical patterns. Pattern induction will be discussed in Sec. 3.
Pattern matching, in turn, consists of searching a database of music for segments that are similar to a given query pattern. Since the target matches can
Pattern Induction and Matching in Music Signals
189
in principle be located at any temporal position and are not necessarily scaled
to the same length as the query pattern, temporal alignment of the query and
target patterns poses a signicant computational challenge in large databases.
Given that the alignment problem can be solved, another pre-requisite for meaningful pattern matching is to dene a distance measure between musical patterns
of dierent kinds. These issues will be discussed in Sec. 4.
Pattern processing in music has several interesting applications, including
music information retrieval, music classication, cover song identication, and
creation of mash-ups by blending matching excerpts from dierent music pieces.
Given a large database of music, quite detailed queries can be made, such as
searching for a piece that would work as an accompaniment for a user-created
melody.
Extracting the Object of Interest from Music
There are various levels at which pattern induction and matching can take place
in music. At one extreme, a polyphonic music signal is considered as a coherent
whole and features describing its harmonic or timbral aspects, for example, are
calculated. In a more analytic approach, some part of the signal, such as the
melody or the drums, is extracted before the feature calculation. Both of these
approaches are valid from the perceptual viewpoint. Human listeners, especially
trained musicians, can switch between a holistic listening mode and a more
analytic one where they focus on the part played by a particular instrument or
decompose music into its constituent elements and their releationships [8,3].
Even when a music signal is treated as a coherent whole, it is necessary to
transform the acoustic waveform into a series of feature vectors x1 , x2 , . . . , xT
that characterize the desired aspect of the signal. Among the most widely used
features are Mel-frequency cepstral coecients (MFCCs) to represent the timbral
content of a signal in terms of its spectral energy distribution [73]. The local
harmonic content of a music signal, in turn, is often summarized using a 12dimensional chroma vector that represents the amount of spectral energy falling
at each of the 12 tones of an equally-tempered scale [5,50]. Rhythmic aspects are
conveniently represented by the modulation spectrum which encodes the pattern
of sub-band energy uctuations within windows of approximately one second in
length [15,34]. Besides these, there are a number of other acoustic features, see
[60] for an overview.
Focusing pattern extraction on a certain instrument or part in polyphonic music requires that the desired part be pulled apart from the rest before the feature
extraction. While this is not entirely straightforward in all cases, it enables musically more interesting pattern induction and matching, such as looking at the
melodic contour independently of the accompanying instruments. Some strategies towards decomposing a music signal into its constituent parts are discussed
in the following.
190
2.1
A. Klapuri
Time-Frequency and Spatial Analysis
Musical sounds, like most natural sounds, tend to be sparse in the time-frequency
domain, meaning that the sounds can be approximated using a small number of
non-zero elements in the time-frequency domain. This facilitates sound source
separation and audio content analysis. Usually the short-time Fourier transform
(STFT) is used to represent a given signal in the time-frequency domain. A
viable alternative for STFT is the constant-Q transform (CQT), where the center
frequencies of the frequency bins are geometrically spaced [9,68]. CQT is often
ideally suited for the analysis of music signals, since the fundamental frequencies
(F0s) of the tones in Western music are geometrically spaced.
Spatial information can sometimes be used to organize time-frequency components to their respective sound sources [83]. In the case of stereophonic audio,
time-frequency components can be clustered based on the ratio of left-channel
amplitude to the right, for example. This simple principle has been demonstrated
to be quite eective for some music types, such as jazz [4], despite the fact that
overlapping partials partly undermine the idea. Duda et al. [18] used stereo
information to extract the lead vocals from complex audio for the purpose of
query-by-humming.
2.2
Separating Percussive Sounds from the Harmonic Part
It is often desirable to analyze the drum track of music separately from the
harmonic part. The sinusoids+noise model is the most widely-used technique
for this purpose [71]. It produces quite robust quality for the noise residual,
although the sinusoidal (harmonic) part often suers quality degradation for
music with dense sets of sinusoids, such as orchestral music.
Ono et al. proposed a method which decomposes the power spectrogram
X(F T ) of a mixture signal into a harmonic part H and percussive part P so
that X = H + P [52]. The decomposition is done by minimizing an objective
function that measures variation over time n for the harmonic part and variation over frequency k for the percussive part. The method is straightforward to
implement and produces good results.
Non-negative matrix factorization (NMF) is a technique that decomposes the
spectrogram of a music signal into a linear sum of components that have a
xed spectrum and time-varying gains [41,76]. Helen and Virtanen used the
NMF to separate the magnitude spectrogram of a music signal into a couple of
dozen components and then used a support vector machine (SVM) to classify
each component either to pitched instruments or to drums, based on features
extracted from the spectrum and the gain function of each component [31].
2.3
Extracting Melody and Bass Line
Vocal melody is usually the main focus of attention for an average music listener,
especially in popular music. It tends to be the part that makes music memorable
and easily reproducible by singing or humming [69].
191
Several dierent methods have been proposed for the main melody extraction
from polyphonic music. The task was rst considered by Goto [28] and later
various methods for melody tracking have been proposed by Paiva et al. [54],
Ellis and Poliner [22], Dressler [17], and Ryyn
anen and Klapuri [65]. Typically,
the methods are based on framewise pitch estimation followed by tracking or
streaming over time. Some methods involve a timbral model [28,46,22] or a
musicological model [67]. For comparative evaluations of the dierent methods,
see [61] and [www.music-ir.org/mirex/].
Melody extraction is closely related to vocals separation: extracting the melody
faciliatates lead vocals separation, and vice versa. Several dierent approaches
have been proposed for separating the vocals signal from polyphonic music, some
based on tracking the pitch of the main melody [24,45,78], some based on timbre
models for the singing voice and for the instrumental background [53,20], and
yet others utilizing stereo information [4,18].
Bass line is another essential part in many music types and usually contains
a great deal of repetition and note patterns that are rhythmically and tonally
interesting. Indeed, high-level features extracted from the bass line and the playing style have been successfully used for music genre classication [1]. Methods
for extracting the bass line from polyphonic music have been proposed by Goto
[28], Hainsworth [30], and Ryyn
anen [67].
2.4
Instrument Separation from Polyphonic Music
For human listeners, it is natural to organize simultaneously occurring sounds

to their respective sound sources. When listening to music, people are often able
to focus on a given instrument despite the fact that music intrinsically tries to
make co-occurring sounds blend as well as possible.
Separating the signals of individual instruments from a music recording has
been recently studied using various approaches. Some are based on grouping
sinusoidal components to sources (see e.g. [10]) whereas some others utilize a
structured signal model [19,2]. Some methods are based on supervised learning
of instrument-specic harmonic models [44], whereas recently several methods
have been proposed based on unsupervised methods [23,75,77]. Some methods do
not aim at separating time-domain signals, but extract the relevant information
(such as instrument identities) directly in some other domain [36].
Automatic instrument separation from a monaural or stereophonic recording
would enable pattern induction and matching for the individual instruments.
However, source separation from polyphonic music is extremely challenging and
the existing methods are generally not as reliable as those intended for melody
or bass line extraction.
Pattern Induction
Pattern induction deals with the problem of detecting repeated sequential structures in music and learning the pattern underlying these repetitions. In the
following, we discuss the problem of musical pattern induction from a general
192
A. Klapuri
Pitch (MIDI)
85
80
75
70
0
4
Time (s)
Fig. 1. A piano-roll representation for an excerpt from Mozarts Turkish March. The
vertical lines indicate a possible grouping of the component notes into phrases.
perspective. We assume that a time series of feature vectors x1 , x2 , . . . , xT describing the desired characteristics of the input signal is given. The task of
pattern induction, then, is to detect repeated sequences in this data and to
learn a prototypical pattern that can be used to represent all its occurrences.
What makes this task challenging is that the data is generally multidimensional
and real-valued (as opposed to symbolic data), and furthermore, music seldom
repeats itself exactly, but variations and transformations are applied on each
occurrence of a given pattern.
3.1
Pattern Segmentation and Clustering
The basic idea of this approach is to subdivide the feature sequence x1 , x2 , . . . , xT

into shorter segments and then cluster these segments in order to nd repeated
patterns. The clustering part requires that a distance measure between two feature segments is dened a question that will be discussed separately in Sec. 4
for dierent types of features.
For pitch sequences, such as the melody and bass lines, there are well-dened
musicological rules how individual sounds are perceptually grouped into melodic
phrases and further into larger musical entities in a hierarchical manner [43].
This process is called grouping and is based on relatively simple principles such
as preferring a phrase boundary at a point where the time or the pitch interval
between two consecutive notes is larger than in the immediate vicinity (see Fig. 1
for an example). Pattern induction, then, proceeds by choosing a certain time
scale, performing the phrase segmentation, cropping the pitch sequences according to the shortest phrase, clustering the phrases using for example k-means
clustering, and nally using the pattern nearest to each cluster centroid as the
prototype pattern for that cluster.
A diculty in implementing the phrase segmentation for audio signals is that
contrary to MIDI, note durations and rests are dicult to extract from audio.
Nevertheless, some methods produce discrete note sequences from music [67,82],
and thus enable segmenting the transcription result into phrases.
Musical meter is an alternative criterion for segmenting musical feature sequences into shorter parts for the purpose of clustering. Computational meter
analysis usually involves tracking the beat and locating bar lines in music. The
193
good news here is that meter analysis is a well-understood and feasible problem
for audio signals too (see e.g. [39]). Furthermore, melodic phrase boundaries often co-incide with strong beats, although this is not always the case. For melodic
patterns, for example, this segmenting rule eectively requires two patterns to be
similarly positioned with respect to the musical measure boundaries in order for
them to be similar, which may sometimes be a too strong assumption. However,
for drum patterns this requirement is well justied.
Bertin-Mahieux et al. performed harmonic pattern induction for a large
database of music in [7]. They calculated a 12-dimensional chroma vector for
each musical beat in the target songs. The beat-synchronous chromagram data
was then segmented at barline positions and the resulting beat-chroma patches
were vector quantized to obtain a couple of hundred prototype patterns.
A third strategy is to avoid segmentation altogether by using shift-invariant
features. As an example, let us consider a sequence of one-dimensional features
x1 , x2 , . . . , xT . The sequence is rst segmented into partly-overlapping frames
that have length approximately the same as the patterns being sought. Then the
sequence within each frame is Fourier transformed and the phase information is
discarded in order to make the features shift-invariant. The resulting magnitude
spectra are then clustered to nd repeated patterns. The modulation spectrum
features (aka uctuation patterns) mentioned in the beginning of Sec. 2 are an
example of such a shift-invariant feature [15,34].
3.2
Self-distance Matrix
Pattern induction, in the sense dened in the beginning of this section, is possible
only if a pattern is repeated in a given feature sequence. The repetitions need not
be identical, but bear some similarity with each other. A self-distance matrix (aka
self-similarity matrix) oers a direct way of detecting these similarities. Given
a feature sequence x1 , x2 , . . . , xT and a distance function d that species the
distance between two feature vectors xi and xj , the self-distance matrix (SDM)
is dened as
(1)
D(i, j) = d(xi , xj )
for i, j {1, 2, . . . , T }. Frequently used distance measures include the Euclidean
distance xi xj and the cosine distance 0.5(1xi , xj /(xi xj )). Repeated
sequences appear in the SDM as o-diagonal stripes. Methods for detecting these
will be discussed below.
An obvious diculty in calculating the SDM is that when the length T of the
feature sequence is large, the number of distance computations T 2 may become
computationally prohibitive. A typical solution to overcome this is to use beatsynchronized features: a beat tracking algorithm is applied and the features xt
are then calculated within (or averaged over) each inter-beat interval. Since the
average inter-beat interval is approximately 0.5 seconds much larger than a
typical analysis frame size this greatly reduces the number of elements in the
time sequence and in the SDM. An added benet of using beat-synchronous
features is that this compensates for tempo uctuations within the piece under
analysis. As a result, repeated sequences appear in the SDM as stripes that run
194
A. Klapuri
Time (s)
30
20
10
0
0
10
20
Time (s)
30
Fig. 2. A self-distance matrix for Chopins Etude Op 25 No 9, calculated using beatsynchronous chroma features. As the o-diagonal dark stripes indicate, the note sequence between 1s and 5s starts again at 5s, and later at 28s and 32s in a varied
form.
exactly parallel to the main diagonal. Figure 2 shows an example SDM calculated
using beat-synchronous chroma features.
Self-distance matrices have been widely used for audio-based analysis of the
sectional form (structure) of music pieces [12,57]. In that domain, several different methods have been proposed for localizing the o-diagonal stripes that
indicate repeating sequences in the music [59,27,55]. Goto, for example, rst
calculates a marginal histogram which indicates the diagonal bands that contain considerable repetition, and then nds the beginning and end points of the
repeted segments at a second step [27]. Serra has proposed an interesting method
for detecting locally similar sections in two feature sequences [70].
3.3
Lempel-Ziv-Welch Family of Algorithms
Repeated patterns are heavily utilized in universal lossless data compression algorithms. The Lempel-Ziv-Welch (LZW) algorithm, in particular, is based on
matching and replacing repeated patterns with code values [80]. Let us denote a
sequence of discrete symbols by s1 , s2 , . . . , sT . The algorithm initializes a dictionary which contains codes for individual symbols that are possible at the input.
At the compression stage, the input symbols are gathered into a sequence until
the next character would make a sequence for which there is no code yet in the
dictionary, and a new code for that sequence is then added to the dictionary.
The usefulness of the LZW algorithm for musical pattern matching is limited
by the fact that it requires a sequence of discrete symbols as input, as opposed to
real-valued feature vectors. This means that a given feature vectore sequence has
195
to be vector-quantized before processing with the LZW. In practice, also beatsynchronous feature extraction is needed to ensure that the lengths of repeated
sequences are not aected by tempo uctuation. Vector quantization (VQ, [25])
as such is not a problem, but choosing a suitable level of granularity becomes
very dicult: if the number of symbols is too large, then two repeats of a certain
pattern are quantized dissimilarly, and if the number of symbols is too small, too
much information is lost in the quantization and spurious repeats are detected.
Another inherent limitation of the LZW family of algorithms is that they require exact repetition. This is usually not appropriate in music, where variation
is more a rule than an exception. Moreover, the beginning and end times of the
learned patterns are arbitrarily determined by the order in which the input sequence is analyzed. Improvements over the LZW family of algorithms for musical
pattern induction have been considered e.g. by Lartillot et al. [40].
3.4
Markov Models for Sequence Prediction
Pattern induction is often used for the purpose of predicting a data sequence. Ngram models are a popular choice for predicting a sequence of discrete symbols
s1 , s2 , . . . , sT [35]. In an N-gram, the preceding N 1 symbols are used to determine the probabilities for dierent symbols to appear next, P (st |st1 , . . . , stN +1 ).
Increasing N gives more accurate predictions, but requires a very large amount of
training data to estimate the probabilities reliably. A better solution is to use a
variable-order Markov model (VMM) for which the context length N varies in
response to the available statistics in the training data [6]. This is a very desirable feature, and for note sequences, this means that both short and long note
sequences can be modeled within a single model, based on their occurrences in
the training data. Probabilistic predictions can be made even when patterns do
not repeat exactly.
Ryyn
anen and Klapuri used VMMs as a predictive model in a method that
transcribes bass lines in polyphonic music [66]. They used the VMM toolbox of
Begleiter et al. for VMM training and prediction [6].
3.5
Interaction between Pattern Induction and Source Separation
Music often introduces a certain pattern to the listener in a simpler form before
adding further layers of instrumentation at subsequent repetitions (and variations) of the pattern. Provided that the repetitions are detected via pattern
induction, this information can be fed back in order to improve the separation
and analysis of certain instruments or parts in the mixture signal. This idea was
used by Mauch et al. who used information about music structure to improve
recognition of chords in music [47].
Pattern Matching
This section considers the problem of searching a database of music for segments
that are similar to a given pattern. The query pattern is denoted by a feature
A. Klapuri
Time (query)
196
Query
Target 1
Target 2
Time (target)
Fig. 3. A matrix of distances used by DTW to nd a time-alignment between dierent
feature sequences. The vertical axis represents the time in a query excerpt (Queens
Bohemian Rhapsody). The horizontal axis corresponds to the concatenation of features
from three dierent excerps: 1) the query itself, 2) Target 1 (Bohemian Rhapsody
performed by London Symphonium Orchestra) and Target 2 (Its a Kind of Magic by
Queen). Beginnings of the three targets are indicated below the matrix. Darker values
indicate smaller distance.
seqence y1 , y2 , . . . , yM , and for convenience, x1 , x2 , . . . , xT is used to denote a

concatenation of the feature sequences extracted from all target music pieces.
Before discussing the similarity metrics between two music patterns, let us
consider the general computational challenges in comparing a query pattern
against a large database, an issue that is common to all types of musical patterns.
4.1
Temporal Alignment Problem in Pattern Comparison
Pattern matching in music is computationally demanding, because the query

pattern can in principle occur at any position of the target data and because
the time-scale of the query pattern may dier from the potential matches in
the target data due to tempo dierences. These two issues are here referred to
as the time-shift and time-scale problem, respectively. Brute-force matching of
the query pattern at all possible locations of the target data and using dierent
time-scaled versions of the query pattern would be computationally infeasible
for any database of signicant size.
Dynamic time warping (DTW) is a technique that aims at solving both the
time-shift and time-scale problem simultaneously. In DTW, a matrix of distances
is computed so that element (i, j) of the matrix represents the pair-wise distance
between element i of the query pattern and element j in the target data (see
Fig. 3 for an example). Dynamic programming is then applied to nd a path
of small distances from the rst to the last row of the matrix, placing suitable
constraints on the geometry of the path. DTW has been used for melodic pattern
matching by Dannenberg [13], for structure analysis by Paulus [55], and for cover
song detection by Serra [70], to mention a few examples.
Beat-synchronous feature extraction is an ecient mechanism for dealing with
the time-scale problem, as already discussed in Sec. 3. To allow some further
197
exibility in pattern scaling and to mitigate the eect of tempo estimation errors,
it is sometimes useful to further time-scale the beat-synchronized query pattern
by factors 12 , 1, and 2, and match each of these separately.
A remaining problem to be solved is the temporal shift: if the target database
is very large, comparing the query pattern at every possible temporal position
in the database can be infeasible. Shift-invariant features are one way of dealing
with this problem: they can be used for approximate pattern matching to prune
the target data, after which the temporal alignment is computed for the bestmatching candidates. This allows the rst stage of matching to be performed an
order of magnitude faster.
Another potential solution for the time-shift problem is to segment the target
database by meter analysis or grouping analysis, and then match the query
pattern only at temporal positions determined by estimated bar lines or group
boundaries. This approach was already discussed in Sec. 3.
Finally, ecient indexing techniques exist for dealing with extremely large
databases. In practice, these require that the time-scale problem is eliminated
(e.g. using beat-synchronous features) and the number of time-shifts is greatly
reduced (e.g. using shift-invariant features or pre-segmentation). If these conditions are satised, the locality sensitive hashing (LSH) for example, enables
sublinear search complexity for retrieving the approximate nearest neighbours
of the query pattern from a large database [14]. Ryynanen et al. used LSH for
melodic pattern matching in [64].
4.2
Melodic Pattern Matching
Melodic pattern matching is usually considered in the context of query-byhumming (QBH), where a users singing or humming is used as a query to
retrieve music with a matching melodic fragment. Typically, the users singing is
rst transcribed into a pitch trajectory or a note sequence before the matching
takes place. QBH has been studied for more than 15 years and remains an active
research topic [26,48].
Research on QBH originated in the context of the retrieval from MIDI or
score databases. Matching approaches include string matching techniques [42],
hidden Markov models [49,33], dynamic programming [32,79], and ecient recursive alignment [81]. A number of QBH systems have been evaluated in Music
Information Retrieval Evaluation eXchange (MIREX) [16].
Methods for the QBH of audio data have been proposed only quite recently
[51,72,18,29,64]. Typically, the methods extract the main melodies from the target musical audio (see Sec. 2.3) before the matching takes place. However, it
should be noted that a given query melody can in principle be matched directly against polyphonic audio data in the time-frequency or time-pitch domain. Some on-line services incorporating QBH are already available, see e.g.
[www.midomi.com], [www.musicline.de], [www.musipedia.org].
Matching two melodic patterns requires a proper denition of similarity. The
trivial assumption that two patterns are similar if they have identical pitches
is usually not appropriate. There are three main reasons that cause the query
198
A. Klapuri
pattern and the target matches to dier: 1) low quality of the sung queries (especially in the case of musically untrained users), 2) errors in extracting the main
melodies automatically from music recordings, and 3) musical variation, such as
fragmentation (elaboration) or consolidation (reduction) of a given melody [43].
One approach that works quite robustly in the presence of all these factors is
to calculate Euclidean distance between temporally aligned log-pitch trajectories. Musical key normalization can be implemented simply by normalizing the
two pitch contours to zero mean. More extensive review of research on melodic
similarity can be found in [74].
4.3
Patterns in Polyphonic Pitch Data
Instead of using only the main melody for music retrieval, polyphonic pitch
data can be processed directly. Multipitch estimation algorithms (see [11,38] for
review) can be used to extract multiple pitch values in successive time frames, or
alternatively, a mapping from time-frequency to a time-pitch representation can
be employed [37]. Both of these approaches yield a representation in the timepitch plane, the dierence being that multipitch estimation algorithms yield a
discrete set of pitch values, whereas mapping to a time-pitch plane yields a
more continuous representation. Matching a query pattern against a database
of music signals can be carried out by a two-dimensional correlation analysis in
the time-pitch plane.
4.4
Chord Sequences
Here we assume that chord information is represented as a discrete symbol

sequence s1 , s2 , . . . , sT , where st indicates the chord identity at time frame t.
A#m
D#m
G#m
C#m
F#m
Bm
F#
F#m
Bm
Em
Am
Dm
Gm
A#
Am
Dm
Gm
Cm
Fm
A#m
A#
D#
G#
C#
Fm
A#m
D#m
G#m
C#m
Fig. 4. Major and minor triads arranged in a two dimensional chord space. Here the
Euclidean distance between each two points can be used to approximate the distance
between chords. The dotted lines indicate the four distance parameters that dene this
particular space.
199
Measuring the distance between two chord sequences requires that the distance
between each pair of dierent chords is dened. Often this distance is approximated by arranging chords in a one- or two-dimensional space, and then using
the geometric distance between chords in this space as the distance measure [62],
see Fig. 4 for an example. In the one-dimensional case, the circle of fths is often
used.
It is often useful to compare two chord sequences in a key-invariant manner.
This can be done by expressing chords in relation to tonic (that is, using chord
degrees instead of the absolute chords), or by comparing all the 12 possible
transformations and choosing the minimum distance.
4.5
Drum Patterns and Rhythms
Here we discuss pattern matching in drum tracks that are presented as acoustic
signals and are possibly extracted from polyphonic music using the methods
described in Sec. 2.2. Applications of this include for example query-by-tapping
[www.music-ir.org/mirex/] and music retrieval based on drum track similarity.
Percussive music devoid of both harmony and melody can contain considerable amount of musical form and structure, encoded into the timbre, loudness,
and timing relationships between the component sounds. Timbre and loudness characteristics can be conveniently represented by MFCCs extracted in
successive time frames. Often, however, the absolute spectral shape and loudness of the components sounds is not of interest, but instead, the timbre and
loudness of sounds relative to each other denes the perceived rhythm. Paulus
and Klapuri reduced the rhythmic information into a two-dimensional signal
describing the evolution of loudness and spectral centroid over time, in order to compare rhythmic patterns performed using an arbitrary set of sounds
[56]. The features were mean- and variance-normalized to allow comparison
across dierent sound sets, and DTW was used to align the two patterns under
comparison.
Ellis and Arroyo projected drum patterns into a low-dimensional representation, where dierent rhythms could be represented as a linear sum of so-called
eigenrhythms [21]. They collected 100 drum patterns from popular music tracks
and estimated the bar line positions in these. Each pattern was normalized and
the resulting set of patterns was subjected to principal component analysis in
order to obtain a set of basis patterns (eigenrhythms) that were then combined to approximate the original data. The low-dimensional representation of
the drum patterns was used as a space for classication and for measuring similarity between rhythms.
Non-negative matrix factorization (NMF, see Sec. 2.2) is another technique for
obtaining a mid-level representation for drum patterns [58]. The resulting component gain functions can be subjected to the eigenrhythm analysis described
above, or statistical measures can be calculated to characterize the spectra and
gain functions for rhythm comparison.
200
A. Klapuri
Conclusions
This paper has discussed the induction and matching of sequential patterns
in musical audio. Such patterns are neglected by the commonly used bag-offeatures approach to music retrieval, where statistics over feature vectors are
calculated to collapse the time structure altogether. Processing sequentical structures poses computational challenges, but also enables musically interesting retrieval tasks beyond those possible with the bag-of-features approach. Some of
these applications, such as query-by-humming services, are already available for
consumers.
Acknowledgments. Thanks to Jouni Paulus for the Matlab code for computing self-distance matrices. Thanks to Christian Dittmar for the idea of using
repeated patterns to improve the accuracy of source separation and analysis.
References
1. Abesser, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classication using bass-related high-level features and playing styles. In: Intl. Society on Music
Information Retrieval Conference, Kobe, Japan (2009)
2. Badeau, R., Emiya, V., David, B.: Expectation-maximization algorithm for multipitch estimation and separation of overlapping harmonic spectra. In: Proc. IEEE
ICASSP, Taipei, Taiwan, pp. 30733076 (2009)
3. Barbour, J.: Analytic listening: A case study of radio production. In: International
Conference on Auditory Display, Sydney, Australia (July 2004)
4. Barry, D., Lawlor, B., Coyle, E.: Sound source separation: Azimuth discrimination
and resynthesis. In: 7th International Conference on Digital Audio Eects, Naples,
Italy, pp. 240244 (October 2004)
5. Bartsch, M.A., Wakeeld, G.H.: To catch a chorus: Using chroma-based representations for audio thumbnailing. In: IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, New Paltz, USA, pp. 1518 (2001)
6. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov
models. J. of Articial Intelligence Research 22, 385421 (2004)
7. Bertin-Mahieux, T., Weiss, R.J., Ellis, D.P.W.: Clustering beat-chroma patterns in
a large music database. In: Proc. of the Int. Society for Music Information Retrieval
Conference, Utrecht, Netherlands (2010)
8. Bever, T.G., Chiarello, R.J.: Cerebral dominance in musicians and nonmusicians.
The Journal of Neuropsychiatry and Clinical Neurosciences 21(1), 9497 (2009)
9. Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc.
Am. 89(1), 425434 (1991)
10. Burred, J., R
obel, A., Sikora, T.: Dynamic spectral envelope modeling for the
analysis of musical instrument sounds. IEEE Trans. Audio, Speech, and Language
Processing (2009)
11. de Cheveigne, A.: Multiple F0 estimation. In: Wang, D., Brown, G.J. (eds.) Computational Auditory Scene Analysis: Principles, Algorithms and Applications. Wiley
IEEE Press (2006)
12. Dannenberg, R.B., Goto, M.: Music structure analysis from acoustic signals. In:
Havelock, D., Kuwano, S., Vorl
ander, M. (eds.) Handbook of Signal Processing in
Acoustics, pp. 305331. Springer, Heidelberg (2009)
201
13. Dannenberg, R.B., Hu, N.: Pattern discovery techniques for music audio. Journal
of New Music Research 32(2), 153163 (2003)
14. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing
scheme based on p-stable distributions. In: ACM Symposium on Computational
Geometry, pp. 253262 (2004)
patterns. In: 4th International Conference on Music Information Retrieval, Baltimore, MD, pp. 159165 (2003)
16. Downie, J.S.: The music information retrieval evaluation exchange (20052007): A
window into music information retrieval research. Acoustical Science and Technology 29(4), 247255 (2008)
17. Dressler, K.: An auditory streaming approach on melody extraction. In: Intl. Conf.
on Music Information Retrieval, Victoria, Canada (2006); MIREX evaluation
18. Duda, A., N
urnberger, A., Stober, S.: Towards query by humming/singing on audio
databases. In: International Conference on Music Information Retrieval, Vienna,
Austria, pp. 331334 (2007)
19. Durrieu, J.L., Ozerov, A., Fevotte, C., Richard, G., David, B.: Main instrument
separation from stereophonic audio signals using a source/lter model. In: Proc.
EUSIPCO, Glasgow, Scotland (August 2009)
20. Durrieu, J.L., Richard, G., David, B., Fevotte, C.: Source/lter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Trans. on
Audio, Speech, and Language Processing 18(3), 564575 (2010)
21. Ellis, D., Arroyo, J.: Eigenrhythms: Drum pattern basis sets for classication
and generation. In: International Conference on Music Information Retrieval,
Barcelona, Spain
22. Ellis, D.P.W., Poliner, G.: Classication-based melody transcription. Machine
Learning 65(2-3), 439456 (2006)
23. FitzGerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tenson factorisation
models for musical source separation. Computational Intelligence and Neuroscience
(2008)
24. Fujihara, H., Goto, M.: A music information retrieval system based on singing voice
timbre. In: Intl. Conf. on Music Information Retrieval, Vienna, Austria (2007)
25. Gersho, A., Gray, R.: Vector Quantization and Signal Compression. Kluwer Academic Publishers, Dordrecht (1991)
26. Ghias, A., Logan, J., Chamberlin, D.: Query by humming: Musical information
retrieval in an audio database. In: ACM Multimedia Conference 1995. Cornell
University, San Fransisco (1995)
27. Goto, M.: A chorus-section detecting method for musical audio signals. In: IEEE
International Conference on Acoustics, Speech, and Signal Processing, Hong Kong,
China, vol. 5, pp. 437440 (April 2003)
28. Goto, M.: A real-time music scene description system: Predominant-F0 estimation
for detecting melody and bass lines in real-world audio signals. Speech Communication 43(4), 311329 (2004)
29. Guo, L., He, X., Zhang, Y., Lu, Y.: Content-based retrieval of polyphonic music
objects using pitch contour. In: IEEE International Conference on Audio, Speech
and Signal Processing, Las Vegas, USA, pp. 22052208 (2008)
30. Hainsworth, S.W., Macleod, M.D.: Automatic bass line transcription from polyphonic music. In: International Computer Music Conference, Havana, Cuba, pp.
431434 (2001)
202
A. Klapuri
31. Helen, M., Virtanen, T.: Separation of drums from polyphonic music using nonnegtive matrix factorization and support vector machine. In: European Signal Processing Conference, Antalya, Turkey (2005)
32. Jang, J.S.R., Gao, M.Y.: A query-by-singing system based on dynamic programming. In: International Workshop on Intelligent Systems Resolutions (2000)
33. Jang, J.S.R., Hsu, C.L., Lee, H.R.: Continuous HMM and its enhancement for
singing/humming query retrieval. In: 6th International Conference on Music Information Retrieval, London, UK (2005)
34. Jensen, K.: Multiple scale music segmentation using rhythm, timbre, and harmony.
EURASIP Journal on Advances in Signal Processing (2007)
35. Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall, New
Jersey (2000)
36. Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G.: Instrogram: Probabilistic representation of instrument existence for polyphonic music. IPSJ Journal 48(1), 214226 (2007)
37. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Intl. Society on Music Information Retrieval Conference, Kobe, Japan (2009)
38. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer, New York (2006)
39. Klapuri, A., Eronen, A., Astola, J.: Analysis of the meter of acoustic musical signals. IEEE Trans. Speech and Audio Processing 14(1) (2006)
40. Lartillot, O., Dubnov, S., Assayag, G., Bejerano, G.: Automatic modeling of musical style. In: International Computer Music Conference (2001)
41. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788791 (1999)
42. Lemstr
om, K.: String Matching Techniques for Music Retrieval. Ph.D. thesis, University of Helsinki (2000)
43. Lerdahl, F., Jackendo, R.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
44. Leveau, P., Vincent, E., Richard, G., Daudet, L.: Instrument-specic harmonic
atoms for mid-level music representation. IEEE Trans. Audio, Speech, and Language Processing 16(1), 116128 (2008)
45. Li, Y., Wang, D.L.: Separation of singing voice from music accompaniment for
monaural recordings. IEEE Trans. on Audio, Speech, and Language Processing 15(4), 14751487 (2007)
46. Marolt, M.: Audio melody extraction based on timbral similarity of melodic fragments. In: EUROCON (November 2005)
47. Mauch, M., Noland, K., Dixon, S.: Using musical structure to enhance automatic
chord transcription. In: Proc. 10th Intl. Society for Music Information Retrieval
Conference, Kobe, Japan (2009)
48. McNab, R., Smith, L., Witten, I., Henderson, C., Cunningham, S.: Towards the
digital music library: Tune retrieval from acoustic input. In: First ACM International Conference on Digital Libraries, pp. 1118 (1996)
49. Meek, C., Birmingham, W.: Applications of binary classication and adaptive
boosting to the query-by-humming problem. In: Intl. Conf. on Music Information
Retrieval, Paris, France (2002)
50. M
uller, M., Ewert, S., Kreuzer, S.: Making chroma features more robust to timbre
changes. In: Proceedings of IEEE International Conference on Acoustics, Speech,
and Signal Processing, Taipei, Taiwan, pp. 18691872 (April 2009)
203
51. Nishimura, T., Hashiguchi, H., Takita, J., Zhang, J.X., Goto, M., Oka, R.: Music
signal spotting retrieval by a humming query using start frame feature dependent
continuous dynamic programming. In: 2nd Annual International Symposium on
Music Information Retrieval, Bloomington, Indiana, USA, pp. 211218 (October
2001)
52. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of
a monaural audio signal into harmonic/percussive components by complementary
diucion on spectrogram. In: European Signal Processing Conference, Lausanne,
Switzerland, pp. 240244 (August 2008)
53. Ozerov, A., Philippe, P., Bimbot, F., Gribonval, R.: Adaptation of Bayesian models
for single-channel source separation and its application to voice/music separation
in popular songs. IEEE Trans. on Audio, Speech, and Language Processing 15(5),
15641578 (2007)
54. Paiva, R.P., Mendes, T., Cardoso, A.: On the detection of melody notes in polyphonic audio. In: 6th International Conference on Music Information Retrieval,
London, UK, pp. 175182
55. Paulus, J.: Signal Processing Methods for Drum Transcription and Music Structure
Analysis. Ph.D. thesis, Tampere University of Technology (2009)
56. Paulus, J., Klapuri, A.: Measuring the similarity of rhythmic patterns. In: Intl.
Conf. on Music Information Retrieval, Paris, France (2002)
57. Paulus, J., M
uller, M., Klapuri, A.: Audio-based music structure analysis. In: Proc.
of the Int. Society for Music Information Retrieval Conference, Utrecht, Netherlands (2010)
58. Paulus, J., Virtanen, T.: Drum transcription with non-negative spectrogram factorisation. In: European Signal Processing Conference, Antalya, Turkey (September 2005)
59. Peeters, G.: Sequence representations of music structure using higher-order similarity matrix and maximum-likelihood approach. In: Intl. Conf. on Music Information
Retrieval, Vienna, Austria, pp. 3540 (2007)
60. Peeters, G.: A large set of audio features for sound description (similarity and
classication) in the CUIDADO project. Tech. rep., IRCAM, Paris, France (April
2004)
61. Poliner, G., Ellis, D., Ehmann, A., G
omez, E., Streich, S., Ong, B.: Melody transcription from music audio: Approaches and evaluation. IEEE Trans. on Audio,
Speech, and Language Processing 15(4), 12471256 (2007)
62. Purwins, H.: Proles of Pitch Classes Circularity of Relative Pitch and Key:
Experiments, Models, Music Analysis, and Perspectives. Ph.D. thesis, Berlin University of Technology (2005)
63. Rowe, R.: Machine musicianship. MIT Press, Cambridge (2001)
64. Ryyn
anen, M., Klapuri, A.: Query by humming of MIDI and audio using locality
sensitive hashing. In: IEEE International Conference on Audio, Speech and Signal
Processing, Las Vegas, USA, pp. 22492252
65. Ryyn
anen, M., Klapuri, A.: Transcription of the singing melody in polyphonic
music. In: Intl. Conf. on Music Information Retrieval, Victoria, Canada, pp. 222
227 (2006)
66. Ryyn
anen, M., Klapuri, A.: Automatic bass line transcription from streaming polyphonic audio. In: IEEE International Conference on Audio, Speech and Signal
Processing, pp. 14371440 (2007)
67. Ryyn
anen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32(3), 7286 (2008)
204
A. Klapuri
68. Sch
orkhuber, C., Klapuri, A.: Constant-Q transform toolbox for music processing.
In: 7th Sound and Music Computing Conference, Barcelona, Spain (2010)
69. Selfridge-Field, E.: Conceptual and representational issues in melodic comparison.
Computing in Musicology 11, 364 (1998)
70. Serra, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local
alignment applied to cover song identication. IEEE Trans. on Audio, Speech, and
Language Processing 16, 11381152 (2007)
71. Serra, X.: Musical sound modeling with sinusoids plus noise. In: Roads, C., Pope,
S., Picialli, A., Poli, G.D. (eds.) Musical Signal Processing, Swets & Zeitlinger
(1997)
72. Song, J., Bae, S.Y., Yoon, K.: Mid-level music melody representation of polyphonic
audio for query-by-humming system. In: Intl. Conf. on Music Information Retrieval,
Paris, France, pp. 133139 (October 2002)
73. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis
a unied approach to speech spectral estimation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia (1994)
74. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Universiteit
Utrecht (2007)
75. Vincent, E., Bertin, N., Badeau, R.: Harmonic and inharmonic nonnegative matrix
factorization for polyphonic pitch transcription. In: IEEE ICASSP, Las Vegas, USA
(2008)
76. Virtanen, T.: Unsupervised learning methods for source separation in monaural
music signals. In: Klapuri, A., Davy, M. (eds.) Signal Processing Methods for Music
Transcription, pp. 267296. Springer, Heidelberg (2006)
77. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech,
and Language Processing 15(3), 10661074 (2007)
78. Virtanen, T., Mesaros, A., Ryyn
anen, M.: Combining pitch-based inference and
non-negative spectrogram factorization in separating vocals from polyphonic music.
In: ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition,
Brisbane, Australia (September 2008)
79. Wang, L., Huang, S., Hu, S., Liang, J., Xu, B.: An eective and ecient method
for query by humming system based on multi-similarity measurement fusion. In:
International Conference on Audio, Language and Image Processing, pp. 471475
(July 2008)
80. Welch, T.A.: A technique for high-performance data compression. Computer 17(6),
819 (1984)
81. Wu, X., Li, M., Yang, J., Yan, Y.: A top-down approach to melody match in pitch
countour for query by humming. In: International Conference of Chinese Spoken
Language Processing (2006)
82. Yeh, C.: Multiple fundamental frequency estimation of polyphonic recordings.
Ph.D. thesis, University of Paris VI (2008)
83. Yilmaz, O., Richard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Trans. on Signal Processing 52(7), 18301847 (2004)
Unsupervised Analysis and Generation of Audio

Percussion Sequences
Marco Marchini and Hendrik Purwins
Music Technology Group,
Department of Information and Communications Technologies,
Universitat Pompeu Fabra
Roc Boronat, 138, 08018 Barcelona, Spain
Tel.: +34-93 542-1365; Fax: +34-93 542-2455
{marco.marchini,hendrik.purwins}@upf.edu
Abstract. A system is presented that learns the structure of an audio recording of a rhythmical percussion fragment in an unsupervised
manner and that synthesizes musical variations from it. The procedure
consists of 1) segmentation, 2) symbolization (feature extraction, clustering, sequence structure analysis, temporal alignment), and 3) synthesis.
The symbolization step yields a sequence of event classes. Simultaneously, representations are maintained that cluster the events into few or
many classes. Based on the most regular clustering level, a tempo estimation procedure is used to preserve the metrical structure in the generated
sequence. Employing variable length Markov chains, the nal synthesis
is performed, recombining the audio material derived from the sample
itself. Representations with dierent numbers of classes are used to trade
o statistical signicance (short context sequence, low clustering renement) versus specicity (long context, high clustering renement) of the
generated sequence. For a broad variety of musical styles the musical
characteristics of the original are preserved. At the same time, considerable variability is introduced in the generated sequence.
Keywords: music analysis, music generation, unsupervised clustering,
Markov chains, machine listening.
Introduction
In the eighteenth century, composers such as C. P. E. Bach and W. A. Mozart

used the Musikalisches W
urfelspiel as a game to create music. They composed
several bars of music that could be randomly recombined in various ways, creating a new composition [3]. In the 1950s, Hiller and Isaacsons automatically
composed the Illiac Suite and Xenakis used Markov chains and stochastic processes in his compositions. Probably one of the most extensive work in style
imitation is the one by David Cope [3]. He let the computer compose compositions in the style of Beethoven, Prokoev, Chopin, and Rachmanino. Pachet
[13] developed the Continuator, a MIDI-based system for real-time interaction
206
M. Marchini and H. Purwins
with musicians, producing jazz-style music. Another system with the same characteristics as the Continuator, called OMax, was able to learn an audio stream
employing an indexing procedure explained in [5]. Hazan et al. [8] built a system
which rst segments the musical stream and extracts timbre and onsets. An unsupervised clustering process yields a sequence of symbols that is then processed
by n-grams. The method by Marxer and Purwins [12] consists of a conceptual
clustering algorithm coupled with a hierarchical N-gram. Our method presented
in this article was rst described in detail in [11].
First, we dene the system design and the interaction of its parts. Starting
from low-level descriptors, we translate them into a fuzzy score representation,
where two sounds can either be discretized yielding the same symbol or yielding
dierent symbols according to which level of interpretation is chosen (Section 2).
Then we perform skeleton subsequence extraction and tempo detection to align
the score to a grid. At the end, we get a homogeneous sequence in time, on which
we perform the prediction. For the generation of new sequences, we reorder the
parts of the score, respecting the statistical properties of the sequence while at
the same time maintaining the metrical structure (Section 3). In Section 4, we
discuss an example.
Unsupervised Sound Analysis
As represented in Figure 1, the system basically detects musical blocks in the

audio and re-shues them according to meter and statistical properties of the
sequence. We will now describe each step of the process in detail.
Audio Segments
Generation of
Audio Sequences
Segmentation
Symbolization
Aligned
Multilevel
Representation
Statistic Model
Continuation
indices
Fig. 1. General architecture of the system
2.1
Segmentation
First, the audio input signal is analyzed by an onset detector that segments
the audio le into a sequence of musical events. Each event is characterized by
its position in time (onset) and an audio segment, the audio signal starting at
the onset position and ending at the following contiguous onset. In the further
processing, these events will serve two purposes. On one side, the events are
stored as an indexed sequence of audio fragments which will be used for the resynthesis in the end. On the other side, these events will be compared with each
other to generate a reduced score-like representation of the percussion patterns
to base a tempo analysis on (cf. Fig. 1 and Sec. 2.2).
Analysis and Generation of Percussion Sequences
207
We used the onset detector implemented in the MIR toolbox [9] that is based
only on the energy envelope, which proves to be sucient for our purpose of
analyzing percussion sounds.
2.2
Symbolization
We will employ segmentation and clustering in order to transform the audio

signal into a discrete sequence of symbols (as shown in Fig. 3), thereby facilitating
statistical analysis. However, some considerations should be made.
As we are not restricting the problem to a monophonic percussion sequence,
non-trivial problems arise when one wants to translate a sequence of events into
a meaningful symbolic sequence. One would like to decide whether or not two
sounds have been played by the same percussion instrument (e.g. snare, bass
drum, open hi hat. . . ) and, more specically, if two segments contain the same
sound in case of polyphony. With a similarity distance we can derive a value
representing the similarity between two sounds but when two sounds are played
simultaneously a dierent sound may be created. Thus, a sequence could exist
that allows for multiple interpretations since the system is not able to determine
whether a segment contains one or more sounds played synchronously. A way
to avoid this problem directly and to still get a useful representation is to use
a fuzzy representation of the sequence. If we listen to each segment very detailedly, every segment may sound dierent. If we listen very coarsely, they may
all sound the same. Only listening with an intermediate level of renement yields
a reasonable dierentiation in which we recognize the reoccurrence of particular
percussive instruments and on which we can perceive meaningful musical structure. Therefore, we propose to maintain dierent levels of clustering renement
simultaneously and then select the level on which we encounter the most regular
non-trivial patterns. In the sequel, we will pursue an implementation of this idea
and describe the process in more detail.
Feature Extraction. We have chosen to dene the salient part of the event as
the rst 200 ms after the onset position. This duration value is a compromise
between capturing enough information about the attack for representing the
sound reliably and still avoiding irrelevant parts at the end of the segment which
may be due to pauses or interfering other instruments. In the case that the
segment is shorter than 200 ms, we use the entire segment for the extraction
of the feature vector. Across the salient part of the event we calculate the Mel
Frequency Cepstral Coecient (MFCC) vector frame-by-frame. Over all MFCCs
of the salient event part, we take the weighted mean, weighted by the RMS
energy of each frame. The frame rate is 100 frame for second, the FFT size is
512 samples and the window size 256.
Sound Clustering. At this processing stage, each event is characterized by a
13-dimensional vector (and the onset time). Events can thus be seen as points in
a 13-dimensional space in which a topology is induced by the Euclidean distance.
208
We used the single linkage algorithm to discover event clusters in this space
(cf. [6] for details). This algorithm recursively performs clustering in a bottomup manner. Points are grouped into clusters. Then clusters are merged with
additional points and clusters are merged with clusters into super clusters. The
distance between two clusters is dened as the shortest distance between two
points, each being in a dierent cluster, yielding a binary tree representation of
the point similarities (cf. Fig. 2). The leaf nodes correspond to single events. Each
node of the tree occurs at a certain height, representing the distance between
the two child nodes. Figure 2 (top) shows an example of a clustering tree of the
onset events of a sound sequence.
Cluster Distance
3.5
3
2.5
Threshold
2
1.5
1
0.5
Time (s)
Fig. 2. A tree representation of the similarity relationship between events (top) of

an audio percussion sequence (bottom). The threshold value chosen here leads to a
particular cluster conguration. Each cluster with more than one instance is indicated
by a colored subtree. The events in the audio sequence are marked in the colors of the
clusters they belong to. The height of each node is the distance (according to the single
linkage criterion) between its two child nodes. Each of the leaf nodes on the bottom of
the graph corresponds to an event.
The height threshold controls the (number of) clusters. Clusters are generated
with inter-cluster distances higher than the height threshold. Two thresholds
lead to the same cluster conguration if and only if their values are both within
the range delimited by the previous lower node and the next upper node in
the tree. It is therefore evident that by changing the height threshold, we can
get as many dierent cluster congurations as the number of events we have
in the sequence. Each cluster conguration leads to a dierent symbol alphabet
209
size and therefore to a dierent symbol sequence representing the original audio
le. We will refer to those sequences as representation levels or simply levels.
These levels are implicitly ordered. On the leaf level at the bottom of the tree
we nd the lowest inter-cluster distances, corresponding to a sequence with each
event being encoded by a unique symbol due to weak quantization. On the root
level on top of the tree we nd the cluster conguration with the highest intercluster distances, corresponding to a sequence with all events denoted by the
same symbol due to strong quantization. Given a particular level, we will refer
to the events denoted by the same symbol as the instances of that symbol. We do
not consider the implicit inheritance relationships between symbols of dierent
levels.
Fig. 3. A continuous audio signal (top) is discretized via clustering yielding a sequence
of symbols (bottom). The colors inside the colored triangles denote the cluster of the
event, related to the type of sound, i.e. bass drum, hi-hat, or snare.
2.3
Level Selection
Handling dierent representations of the same audio le in parallel enables the

system to make predictions based on ne or coarse context structure, depending
on the situation. As explained in the previous section, if the sequence contains
n events the number of total possible distinct levels is n. As the number of
events increases, it is particularly costly to use all this levels together because
the number of levels also increases linearly with the number of onsets. Moreover,
as it will be clearer later, this representation will lead to over-tted predictions
of new events.
This observation leads to the necessity to only select a few levels that can be
considered representative of the sequence in terms of structural regularity.
Given a particular level, let us consider a symbol having at least four instances but not more than 60% of the total number of events and let us call
such a symbol an appropriate symbol. The instances of dene a subsequence
of all the events that is supposedly made of more or less similar sounds according to the degree of renement of the level. Let us just consider the sequence of
210
onsets given by this subsequence. This sequence can be seen as a set of points
on a time line. We are interested to quantify the degree of temporal regularity of
those onsets. Firstly, we compute the histogram1 of the time dierences (CIOIH)
between all possible combinations of two onsets (middle Fig. 4). What we obtain
is a sort of harmonic series of peaks that are more or less prominent according
to the self-similarity of the sequence on dierent scales. Secondly, we compute
the autocorrelation ac(t) (where t is the time in seconds) of the CIOIH which, in
case of a regular sequence, has peaks at multiples of its tempo. Let tusp be the
positive time value corresponding to its upper side peak. Given the sequence of
m onsets x = (x1 , . . . , xm ) we dene the regularity of the sequence of onsets x
to be:
ac(tusp )
Regularity(x) = 1 tusp
log(m)
ac(t)dt
tusp 0
This denition was motivated by the observation that the higher this value the
more equally the onsets are spaced in time. The logarithm of the number of
onsets was multiplied by the ratio to give more importance to symbols with
more instances.
Onset Sequence of one Symbol

0
10
12
14
16
12
14
16
Number of
Interval Instances
Histogram of Complete IOI

10
0
0
10
Cross Correlation of Histogram

Energy
Histogram
Self Correlation
1K
Upper Side
Peak
0.5
t usp
2
Time Interval (s)
Fig. 4. The procedure applied for computing the regularity value of an onset sequence
(top) is outlined. Middle: the histogram of the complete IOI between onsets. Bottom:
the autocorrelation of the histogram is shown for a subrange of IOI with relevant peaks
marked.
Then we extended, for each level, the regularity concept to an overall regularity
of the level. This simply corresponds to the mean of the regularities for all the
appropriate symbols of the level. The regularity of the level is dened to be zero
in case there is no appropriate symbol.
1
We used a discretization of 100 ms for the onset bars.
211
After the regularity value has been computed for each level, we yield the level
where the maximum regularity is reached. The resulting level will be referred so
as the regular level.
We also decided to keep the levels where we have a local maximum because
they generally refer to the levels where a partially regular interpretation of the
sequence is achieved. In the case where consecutive levels of a sequence share
the same regularity only the one is kept that is derived from a higher cluster
distance threshold. Figure 5 shows the regularity of the sequence for dierent
levels.
4
3.8
Regularity of the Sequence
3.6
3.4
3.2
2.8
2.6
2.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Cluster Distance Threshold
1.2
1.3
1.4
Fig. 5. Sequence regularity for a range of cluster distance thresholds (x-axis). An ENST
audio excerpt was used for the analysis. The regularity reaches its global maximum
value in a central position. Towards the right, regularity increases and then remains
constant. The selected peaks are marked with red crosses implying a list of cluster
distance threshold values.
2.4
Beat Alignment
In order to predict future events without breaking the metrical structure we use
a tempo detection method and introduce a way to align onsets to a metrical
grid, accounting for the position of the sequence in the metrical context. For
our purpose of learning and regenerating the structure statistically, we do not
require a perfect beat detection. Even if we detect a beat that is twice, or half
as fast as the perceived beat, or that mistakes an on-beat for an o-beat, our
system could still tolerate this for the analysis of a music fragment, as long as
the inter beat interval and the beat phase is always misestimated in the same
way.
Our starting point is the regular level that has been found with the procedure
explained in the previous subsection. On this level we select the appropriate symbol with the highest regularity value. The subsequence that carries this symbol
212
will be referred to as the skeleton subsequence since it is like an anchor structure

to which we relate our metrical interpretation of the sequence.
Tempo Alignment (Inter Beat Interval and Beat Phase). Once the skeleton subsequence is found, the inter beat interval is estimated with the procedure
explained in [4]. The tempo is detected considering the intervals between all
possible onset pairs of the sequence using a score voting criterion. This method
gives higher scores to the intervals that occur more often and that are related
by integer ratios to other occurring inter onset intervals.
Then the onsets of the skeleton subsequence are parsed in order to detect
a possible alignment of the grid to the sequence. We allow a tolerance of 6%
the duration of the inter beat interval for the alignment of an onset to the
grid position. We chose the interpretation that aligns the highest number of
instances to the grid. After discarding the onsets that are not aligned we obtain
the preliminary skeleton grid. In Fig. 6 the procedure is visually explicated.
1
1
2
2
Fig. 6. Above, a skeleton sequence is represented on a timeline. Below, three possible

alignments of the sequences are given by Dixons method [4]. All these alignments are
based on the same inter beat interval but on dierent beat phases. Each alignment
captures a particular onset subset (represented by a particular graphic marker) of the
skeleton sequence and discards the remaining onsets of the skeleton sequence. The
beat phase that allows to catch the highest number of onsets (the lled red crosses)
is selected and the remaining onsets are removed, thereby yielding the preliminary
skeleton grid.
Creation of Skeleton Grid. The preliminary skeleton grid is a sequence of

onsets spaced at multiples of a constant time interval, the inter beat interval.
But, as shown in Fig. 6, it can still have some gaps (due to missing onsets). The
missing onsets are, thus, detected and, in a rst attempt, the system tries to align
the missing onsets with any onset of the entire event sequence (not just the one
symbol forming the preliminary skeleton grid). If there is any onset within a
tolerance range of 6% of the inter beat interval of the expected beat position,
the expected onset will be aligned to this onset. If no onset within this tolerance
range is encountered, the system creats a (virtual) grid bar in the expected beat
position.
At the end of this completion procedure, we obtain a quasi-periodic skeleton
grid, a sequence of beats (events) sharing the same metrical position (the same
metrical phase).
213
Fig. 7. The event sequence derived from a segmentation by onset detection is indicated by triangles. The vertical lines show the division of the sequence into blocks of
homogeneous tempo. The red solid lines represent the beat position (as obtained by
the skeleton subsequence). The other black lines (either dashed if aligned to a detected
onset or dotted if no close onset is found) represent the subdivisions of the measure
into four blocks.
Because of the tolerance used for building such a grid it could be noticed that
sometimes the eective measure duration could be slightly longer or slightly
shorter. This implements the idea that the grid should be elastic in the sense
that, up to a certain degree, it adapts to the (expressive) timing variations of
the actual sequence.
The skeleton grid catches a part of the complete list of onsets, but we would
like to built a grid where most of the onsets are aligned. Thereafter, starting from
the skeleton grid, the intermediate point between every two subsequent beats is
found and aligned with an onset (if it exists in a tolerance region otherwise a
place-holding onset is added). The procedure is recursively repeated until at least
80% of the onsets are aligned to a grid position or the number of created onsets
exceeds the number of total onsets. In Fig. 7, an example is presented along
with the resulting grid where the skeleton grid, its aligned, and the non-aligned
subdivisions are indicated by dierent line markers.
Note that, for the sake of simplicity, our approach assumes that the metrical
structure is binary. This causes the sequence to be eventually split erroneously.
However, we will see in a ternary tempo example that this is not a limiting factor
for the generation because the statistical representation somehow compensates
for it even if less variable generations are achieved. A more general approach
could be implemented with little modications.
The nal grid is made of blocks of time of almost equal duration that can
contain none, one, or more onset events. It is important that the sequence given
to the statistical model is almost homogeneous in time so that a certain number
of blocks corresponds to a dened time duration.
We used the following rules to assign a symbol to a block (cf. Fig 7):
blocks starting on an aligned onset are denoted by the symbol of the aligned
onset,
blocks starting on a non-aligned grid position are denoted by the symbol of
the previous block.
Finally, a metrical phase value is assigned to each block describing the number of
grid positions passed after the last beat position (corresponding to the metrical
214
position of the block). For each representation level the new representation of
the sequence will be the Cartesian product of the instrument symbol and the
phase.
Statistical Model Learning
Now we statistically analyze the structure of the symbol sequence obtained in

the last section. We employ variable length Markov chains (VLMC) for the statistical analysis of the sequences. In [2,15], a general method for inferencing long
sequences is described. For faster computation, we use a simplied implementation as described in [13]. We construct a sux tree for each level based on the
sequence of that level. Each node of the tree represents a specic context that
had occurred in the past. In addition, each node carries a list of continuation
indices corresponding to block indices matching the context.
For audio, a dierent approach has been applied in [5]. This method does
not require an event-wise symbolic representation as it employs the factor oracle
algorithm. VLMC has not been applied to audio before, because of the absence
of an event-wise symbolic representation we presented above.
3.1
Generation Strategies
If we x a particular level the continuation indices are drawn according to a

posterior probability distribution determined by the longest context found. But
which level should be chosen? Depending on the sequence, it could be better
to do predictions based either on a coarse or a ne level but it is not clear
which one should be preferred. First, we selected the lower level at which a
context of at least l existed (for a predetermined xed l, usually l equal 3 or
4). This works quite good for many examples. But in some cases a context of
that length does not exist and the system often reaches the higher level where
too many symbols are provided inducing too random generations. On the other
side, it occurs very often that the lower level is made of singleton clusters that
have only one instance. In this case, a long context is found in the lower level
but since a particular symbol sequence only occurs once in the whole original
segment the system replicates the audio in the same order as the original. This
behavior often leads to the exact reproduction of the original until reaching its
end and then a jump at random to another block in the original sequence.
In order to increase recombination of blocks and still provide good continuation we employ some heuristics taking into account multiple levels for the
prediction. We set p to be a recombination value between 0 and 1. We also need
to preprocess the block sequence to prevent arriving at the end of the sequence
without any musically meaningful continuation. For this purpose, before learning the sequence, we remove the last blocks until the remaining sequence ends
with a context of at least length two. We make use of the following heuristics to
generate the continuation in each step:
215
Set a maximal context length

l and compute the list of indices for each level
using the appropriate sux tree. Store the achieved length of the context
for each level.
Count the number of indices provided by each level. Select only the levels
that provide less than 75% the total number of blocks.
Among these level candidates, select only the ones that have the longest
context.
Merge all the continuation indices across the selected levels and remove the
trivial continuation (the next onset).
In case there is no level providing such a context and the current block is
not the last, use the next block as a continuation.
Otherwise, decide randomly with probability p whether to select the next
block or rather to generate the actual continuation by selecting randomly
between the merged indices.
Evaluation and Examples
We tested the system on two audio data bases. The rst one is the ENST
database (see [7]) that provided a collection of around forty drum recording
examples. For a descriptive evaluation, we asked two professional percussionists to judge several examples of generations as if they were performances of a
student. Moreover, we asked one of them to record beat boxing excerpts trying
to push the system to the limits of complexity and to critically assess the sequences that the system had generated from these recordings. The evaluations
of the generations created from the ENST examples revealed that the style of
the original had been maintained and that the generations had a high degree of
interestingness [10].
Some examples are available on the website [1] along with graphical animations visualizing the analysis process. In each video, we see the original sound
fragment and the generation derived from it. The horizontal axis corresponds
to the time in seconds and the vertical axis to the clustering quantization resolution. Each video shows an animated graphical representation in which each
block is represented by a triangle. At each moment, the context and the currently
played block is represented by enlarged and colored triangles.
In the rst part of the video, the original sound is played and the animation
shows the extracted block representation. The currently played block is represented by an enlarged colored triangle and highlighted by a vertical dashed red
line. The other colored triangles highlight all blocks from the starting point of
the bar up to the current block. In the second part of the video, only the skeleton subsequence is played. The sequence on top is derived from applying the
largest clustering threshold (smallest number of clusters) and the one on the
bottom corresponds to the lowest clustering threshold (highest number of clusters). In the nal part of the video, the generation is shown. The colored triangles
216
represent the current block and the current context. The size of the colored
triangles decreases monotonically from the current block backwards displaying
the past time context window considered by the system. The colored triangles
appear only on the levels selected by the generation strategy.
In Figure 8, we see an example of successive states of the generation. The
levels used by the generator to compute the continuation and the context are
highlighted showing colored triangles that decrease in size from the largest, corresponding to the current block, to the smallest that is the furthest past context
block considered by the system. In Frame I, the generation starts with block
no 4, belonging to the event class indicated by light blue. In the beginning, no
previous context is considered for the generation. In Frame II, a successive block
no 11 of the green event class has been selected using all ve levels - and a
context history of length 1 just consisting of block no 4 of the light blue event
class. Note that the context given by only one light blue block matches the continuation no 11, since the previous block (no 10) is also denoted by light blue at
all the ve levels. In Frame III, the context is the bi-gram of the event classes
light blue (no 4) and green (no 11). Only level is selected since at all other
levels the bi-gram that corresponds to the colors light blue and green appears
only once. However, at level the system nds three matches (blocks no 6, 10
and 12) and randomly selects no 10. In Frame IV, the levels dier in the length
of the maximal past context. At level one but only one match (no 11) is found
for the 3-gram light blue - green - light blue, and thus this level is discarded. At
levels , and , no matches for 3-grams are found, but all these levels include
2 matches (block no 5 and 9) for the bi-gram (green - light blue). At level , no
match is found for a bi-gram either, but 3 occurrences of the light blue triangle
are found.
Discussion
Our system eectively generates sequences respecting the structure and the
tempo of the original sound fragment for medium to high complexity rhythmic
patterns.
A descriptive evaluation of a professional percussionist conrmed that the
metrical structure is correctly managed and that the statistical representation
generates musically meaningful sequences. He noticed explicitly that the drum
fills (short musical passages which help to sustain the listeners attention during
a break between the phrases) were handled adequately by the system.
The critics by the percussionist were directed to the lack of dynamics, agogics
and musically meaningful long term phrasing which we did not address in our
approach.
Part of those feature could be achieved in the future by extending the system
to the analysis of non-binary meter. To achieve musically sensible dynamics and
agogics (rallentando, accelerando, rubato. . . ) of the generated musical continuation for example by extrapolation [14] remains a challenge for future work.

Fig. 8. Nine successive frames of the generation. The red vertical dashed line marks the currently played event. In each frame, the largest
colored triangle denotes the last played event that inuences the generation of the next event. The size of the triangles decreases going
back in time. Only for the selected levels the triangles are enlarged. We can see how the length of the context as well as the number of
selected levels dynamically change during the generation. Cf. Section 4 for a detailed discussion of this gure.

217
218
Acknowledgments
Many thanks to Panos Papiotis for his patience during lengthy recording sessions
and for providing us with beat boxing examples, the evaluation feedback, and
inspiring comments. Thanks a lot to Ricard Marxer for his helpful support. The
rst author (MM) expresses his gratitude to Mirko Degli Esposti and Anna Rita
Addessi for their support and for motivating this work. The second author (HP)
was supported by a Juan de la Cierva scholarship of the Spanish Ministry of
Science and Innovation.
References
1. (December 2010), www.youtube.com/user/audiocontinuation
2. Buhlmann, P., Wyner, A.J.: Variable length markov chains. Annals of Statistics 27,
480513 (1999)
3. Cope, D.: Virtual Music: Computer Synthesis of Musical Style. MIT Press, Cambridge (2004)
4. Dixon, S.: Automatic extraction of tempo and beat from expressive performances.
Journal of New Music Research 30(1), 3958 (2001)
5. Dubnov, S., Assayag, G., Cont, A.: Audio oracle: A new algorithm for fast learning
of audio structures. In: Proceedings of International Computer Music Conference
(ICMC), pp. 224228 (2007)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classication. Wiley, Chichester
(2001)
7. Gillet, O., Richard, G.: Enst-drums: an extensive audio-visual database for drum
signals processing. In: ISMIR, pp. 156159 (2006)
8. Hazan, A., Marxer, R., Brossier, P., Purwins, H., Herrera, P., Serra, X.: What/when
causal expectation modelling applied to audio signals. Connection Science 21, 119
143 (2009)
9. Lartillot, O., Toiviainen, P., Eerola, T.: A matlab toolbox for music information
retrieval. In: Annual Conference of the German Classication Society (2007)
10. Marchini, M.: Unsupervised Generation of Percussion Sequences from a Sound
Example. Masters thesis (2010)
11. Marchini, M., Purwins, H.: Unsupervised generation of percussion sound sequences
from a sound example. In: Sound and Music Computing Conference (2010)
12. Marxer, R., Purwins, H.: Unsupervised incremental learning and prediction of audio signals. In: Proceedings of 20th International Symposium on Music Acoustics
(2010)
13. Pachet, F.: The continuator: Musical interaction with style. In: Proceedings of
ICMC, pp. 211218. ICMA (2002)
14. Purwins, H., Holonowicz, P., Herrera, P.: Polynomial extrapolation for prediction of
surprise based on loudness - a preliminary study. In: Sound and Music Computing
Conference, Porto (2009)
15. Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic
automata with variable memory length. Mach. Learn. 25(2-3), 117149 (1996)
Identifying Attack Articulations

in Classical Guitar
Tan Hakan Ozaslan,

Enric Guaus, Eric Palacios, and Josep Lluis Arcos
IIIA, Articial Intelligence Research Institute
CSIC, Spanish National Research Council
Campus UAB, 08193 Bellaterra, Spain
{tan,eguaus,epalacios,arcos}@iiia.csic.es
Abstract. The study of musical expressivity is an active eld in sound

and music computing. The research interest comes from dierent motivations: to understand or model music expressivity; to identify the
expressive resources that characterize an instrument, musical genre, or
performer; or to build synthesis systems able to play expressively. Our research is focused on the study of classical guitar and deals with modeling
the use of the expressive resources in the guitar. In this paper, we present
a system that combines several state of the art analysis algorithms to
identify guitar left hand articulations such as legatos and glissandos.
After describing the components of our system, we report some experiments with recordings containing single articulations and short melodies
performed by a professional guitarist.
Introduction
Musical expressivity can be studied by analyzing the dierences (deviations) between a musical score and its execution. These deviations are mainly motivated
by two purposes: to clarify the musical structure [26,10,23] and as a way to communicate aective content [16,19,11]. Moreover, these expressive deviations vary
depending on the musical genre, the instrument, and the performer. Specically,
each performer has his/her own unique way to add expressivity by using the
instrument.
Our research on musical expressivity aims at developing a system able to
model the use of the expressive resources of a classical guitar. In guitar playing,
both hands are used: one hand is used to press the strings in the fretboard and
the other to pluck the strings. Strings can be plucked using a single plectrum
called a atpick or by directly using the tips of the ngers. The hand that presses
the frets is mainly determining the notes while the hand that plucks the strings
is mainly determining the note onsets and timbral properties. However, left hand
is also involved in the creation of a note onset or dierent expressive articulations
like legatos, glissandos, and vibratos.
Some guitarists use the right hand to pluck the strings whereas others use the
left hand. For the sake of simplicity, in the rest of the document we consider the
220
T.H. Ozaslan
et al.
Fig. 1. Main diagram model of our system
hand that plucks the strings as the right hand and the hand that presses the
frets as the left hand.
As a rst stage of our research, we are developing a tool able to automatically
identify, from a recording, the use of guitar articulations. According to Norton
[22], guitar articulations can be divided into three main groups related to the
place of the sound where they act: attack, sustain, and release articulations.
In this research we are focusing on the identication of attack articulations
such as legatos and glissandos. Specically, we present an automatic detection
and classication system that uses as input audio recordings. We can divide
our system into two main modules (Figure 1): extraction and classication. The
extraction module determines the expressive articulation regions of a classical
guitar recording whereas the classication module analyzes these regions and
determines the king of articulation (legato or glissando).
In both, legato and glissando, left hand is involved in the creation of the note
onset. In the case of ascending legato, after plucking the string with the right
hand, one of the ngers of the l eft hand (not already used for pressing one of the
frets), presses a fret causing another note onset. Descending legato is performed
by plucking the string with a left-hand nger that was previously used to play
a note (i.e. pressing a fret).
The case of glissando is similar but this time after plucking one of the strings
with the right hand, the left hand nger that is pressing the string is slipped to
another fret also generating another note onset.
When playing legato or glissando on guitar, it is common for the performer to
play more notes within a beat than the stated timing enriching the music that
is played. A powerful legato and glissando can be dierentiated between each
other easily by ear. However, in a musical phrase context where the legato and
glissando are not isolated, it is hard to dierentiate among these two expressive
articulations.
The structure of the paper is as follows: Section 2 briey describes the current
state of the art of guitar analysis studies. Section 3 describes our methodology
for articulation determination and classication. Section 4 focuses on the experiments conducted to evaluate our approach. Last section, Section 5, summarizes
current results and presents the next research steps.
Identifying Attack Articulations in Classical Guitar
221
Related Work
Guitar is the one of the most popular instruments in western music. Thus, most
of the music genres include the guitar. Although plucked instruments and guitar
synthesis have been studied extensively (see [9,22]), the analysis of expressive
articulations from real guitar recordings has not been fully tackled. This analysis
is complex because guitar is an instrument with a rich repertoire of expressive
articulations and because, when playing guitar melodies, several strings may be
vibrating at the same time. As an additional statement, even the synthesis of a
single tone is a complex subject [9].
Expressive studies go back to the early twentieth century. In 1913, Johnstone
[15] analyzed piano performers. Johnstones analysis can be considered as one
of the rst studies focusing on musical expressivity. Advances in audio processing techniques risen the opportunity to analyze audio recordings in a ner level
(see [12] for an overview). Up to now, there are several studies focused on the
analysis of expressivity of dierent instruments. Although the instruments analyzed dier, most of them focus on analyzing monophonic or single instrument
recordings.
For instance, Mantaras et al [20] presented a survey of computer music systems based on Articial Intelligence techniques. Examples of AI-based systems
are SaxEx [1] and TempoExpress [13]. Saxex is cased-based reasoning system
that generates expressive jazz saxophone melodies from recorded examples of
human performances. More recently, TempoExpress performs tempo transformations of audio recordings taking into account the expressive characteristics of
a performance and using a CBR approach.
Regarding guitar analysis, an interesting research is from Stanford University. Traube [28], estimated the plucking point on a guitar string by using a
frequency-domain technique applied to acoustically recorded signals. The plucking point of a guitar string aects the sound envelope and inuences the timbral
characteristics of notes. For instance, plucking close to the guitar hole produces
more mellow and sustained sounds where plucking near to the bridge (end of the
guitar body) produces sharper and less sustained sounds. Traube also proposed
an original method to detect the ngering point, based on the plucking point
information.
In another interesting paper, Lee [17] proposes a new method for extraction of
the excitation point of an acoustic guitar signal. Before explaining the method,
three state of the art techniques are examined in order to compare with the new
one. The techniques analyzed are matrix pencil inverse-ltering, sinusoids plus
noise inverse-ltering, and magnitude spectrum smoothing. After describing and
comparing these three techniques, the author proposes a new method, statistical
spectral interpolation, for excitation signal extraction.
Although ngering studies are not directly related with expressivity, their results may contribute to clarify and/or constrain the use of left-hand expressive
articulations. Hank Heijink and Ruud G. J. Meulenbroek[14] performed a behavioral study about the complexity of the left hand ngering of classical guitar.
Dierent audio and camera recordings of six professional guitarists playing the
222
T.H. Ozaslan
et al.
same song were used to nd optimal places and ngerings for the notes. Several
constraints were introduced to calculate cost functions such as; minimization of
jerk, torque change, muscle-tension change, work, energy and neuromotor variance. As a result of the study, they found a signicant eect on timing.
On another interesting study, [25] investigates the the optimal ngering position for a given set of notes. Their method, path dierence learning, uses tablatures and AI techniques to obtain ngering positions and transitions. Radicioni
et al [24] also worked on nding the proper ngering position and transitions.
Specically, they calculated the weights of the nger transitions between nger
positions by using the weights of Heijing [14]. Burns and Wanderley [4] proposed
a method to visually detect and recognize ngering gestures of the left hand of
a guitarist by using aordable camera.
Unlike the general trend in the literature, Trajano [27] investigated the right
hand ngering. Although he analyzed the right-hand, his approach has similarities with left-hand studies. In his article, Trajano uses his own denitions and
cost functions to calculate the optimal selection of right hand ngers.
The rst step when analyzing guitar expressivity is to identify and characterize
the way notes are played, i.e. guitar articulations. The analysis of expressive
articulations has been previously performed with image analysis techniques. Last
but not least, one of the few studies that is focusing on guitar expressivity is
the PhD thesis of Norton [22]. In his dissertation Norton proposed the use of a
motion caption system based on PhaseSpace Inc, to analyze guitar articulations.
Methodology
Articulation refers to how the pieces of something are joined together. In music,
these pieces are the notes and the dierent ways of executing them are called
articulations. In this paper we propose a new system that is able to determine
and classify two expressive articulations from audio les. For this purpose we
have two main modules: the extraction module and the classication module (see
Figure 1). In the extraction module, we determine the sound segments where
expressive articulations are present. The purpose of this module is to classify
audio regions as expressive articulations or not. Next, the classication module,
analyzes the regions that were identied as candidates of expressive articulations
by the extraction module, and label them as legato or glissando.
3.1
Extraction
The goal of the extraction module is to nd the places where a performer played
expressive articulations. To that purpose, we analyzed a recording using several
audio analysis algorithms, and combined the information obtained from them to
take a decision.
Our approach is based on rst determining the note onsets caused when plucking the strings. Next, a more ne grained analysis is performed inside the regions
delimited by two plucking onsets to determine whether an articulation may be
present. A simple representation diagram of extraction module is shown in
Figure 2.
223
Fig. 2. Extraction module diagram
For the analysis we used Aubio [2]. Aubio is a library designed for the annotation of audio signals. Aubio library includes four main applications: abioonset,
aubionotes, aubiocut, and aubiopitch. Each application gives us the chance of
trying dierent algorithms and also tuning several other parameters. In the current prototype we are using aubioonset for our plucking detection sub-module
and aubionotes for our pitch detection sub-module.
At the end we combine the outputs from both sub-modules and decide whether
there is an expressive articulation or not. In the next two sections, the plucking
detection sub-module and the pitch detection sub-module is described. Finally,
we explain how we combine the information provided by these two sub-modules
to determine the existence of expressive articulations.
Plucking Detection. Our rst task is to determine the onsets caused by the
plucking hand. As we stated before, guitar performers can apply dierent articulations by using both of their hands. However, the kind of articulations that
we are investigating (legatos and glissandos) are performed by the left hand.
Although they can cause onsets, these onsets are not as powerful in terms of
both energy and harmonicity [28]. Therefore, we need an onset determination
algorithm suitable to this specic characterictic.
The High Frequency Content measure is a measure taken across a signal
spectrum, and can be used to characterize the amount of high-frequency content
(HFC) in the signal. The magnitudes of the spectral bins are added together, but
multiplying each magnitude by the bin position [21]. As Brossier stated, HFC
is eective with percussive onsets but less successful determining non-percussive
and legato phrases [3]. As right hand onsets are more percussive than left hand
onsets, HFC was the strongest candidate of detection algorithm for right hand
onsets. HFC is sensitive to abrupt onsets but not too much sensitive for the
changes of fundamental frequency caused by the left hand. This is the main
reason why we chose HFC to measure the changes on the harmonic content of
the signal.
Aubioonset library gave us the opportunity to tune the peak-picking threshold, which we tested with a set of hand labeled recordings, including both articulated and non-articulated notes. We used 1.7 for peak picking threshold and
224
T.H. Ozaslan
et al.
Fig. 3. HFC onsets
Fig. 4. Features of the portion between two onsets
95db for silence threshold. We used this set as our ground truth and tuned our
values according to this set.
An example of the resulting onsets proposed by HFC is shown in Figure 3.
Specically, in the exemplied recording 5 plucking onsets are detected, onsets
caused by the plucking hand, which are shown with vertical lines. Between some
of two detected onsets expressive articulations are present. However, as shown
in the gure, HFC succeeds as it only determines the onsets caused by the right
hand.
Next, each portion between two plucking onsets is analyzed individually.
Specically, we are interested in determining two points: the end of the attack
and the release start. From experimental measures, attack end position is considered 10ms after the amplitude reaches its local maximum. The release start
225
Fig. 5. Note Extraction without chroma feature
position is considered as the nal point where local amplitude is equal or greater
than 3 percent of the local maximum. For example, in Figure 4, the rst portion
of the Figure 3 is zoomed. The rst and the last lines are the plucking onsets
identied by HFC algorithm. The rst dashed line is the place where attack
nishes. The second dashed line is the place where release starts.
Pitch Detection. Our second task was to analyze the sound fragment between
two onsets. Since we know the onset values of plucking hand, what we require is
another peak detection algorithm with a lower threshold in order to capture the
changes in fundamental frequency. Specically, if fundamental frequency is not
constant between two onsets, we consider that the possibility of the existence of
an expressive articulation is high.
In the pitch detection module, i.e to extract onsets and their corresponding
fundamental frequencies, we used aubionotes. In Aubio library, both onset detection and fundamental frequency estimation algorithms can be chosen from a
bunch of alternatives. For onset detection, this time we need a more sensitive
algorithm than the one we used to detect the right hand onset detection. Thus,
we used complex domain algorithm [8] to determine the peaks and Yin [6] for
the fundamental frequency estimation. Complex domain onset detection is based
on a combination of phase approach and energy based approach.
We used 2048 bins as our window size, 512 bins as our hop size, 1 as our pick
peaking threshold and 95db as our silence threshold. With these parameters
we obtained an output like the shown in Figure 5. As shown in the gure, rst
results were not as we expected. Specically, they were noisier than expected.
There were noisy parts, especially at the beginning of the notes, which generated
226
T.H. Ozaslan
et al.
Fig. 6. Note Extraction with chroma feature
false-positive peaks. For instance in Figure 5, many false positive note onsets are
detected between the interval from 0 to 0.2 seconds.
A careful analysis of the results demonstrated that the false-positive peaks
were located in the region of the notes frequency borders. Therefore, we propose
a lightweight solution for the problem: to apply a chroma ltering to the regions
that are in the borders of Complex domain peaks. As shown in Figure 6, after
applying chroma conversion, the results are drastically improved.
Next, we analyzed the fragments between two onsets based on the segments
provided by the plucking detection module. Specically, we analyzed the sound
fragment between attack ending point and release starting point (because the
noisiest part of a signal is the attack part and the release part of a signal contains
unnecessary information for pitch detection [7]). Therefore, for our analysis we
take the fragment between attack and release parts where pitch information is
relatively constant.
Figure 7 shows fundamental frequency values and right hand onsets. X-axis
represents the time domain bins and Y-axis represents the frequency. In Figure 7,
vertical lines depict the attack and release parts respectively. In the middle
there is a change in frequency, which was not determined as an onset by the
rst module. Although it seems like an error, it is a success result for our model.
Specically, in this phrase there is glissando, which is a left hand articulation, and
was not identied as an onset by plucking detection module (HFC algorithm),
but identied by the pitch detection module (Complex Domain algorithm). The
output of the pitch detection module for this recording is shown in Table 1.
Analysis and Annotation. After obtaining the results from plucking detection and pitch detection modules, the goal of the analysis and annotation module
is to determine the candidates of expressive articulations. Specically, from the
results of the pitch detection module, we analyze the dierences of fundamental
227
Fig. 7. Example of a glissando articulation

Table 1. Output of the pitch detection module
Note Start (ms.) Fundamental Frequency
0.02
130
0.19
130
0.37
130
0.46
146
0.66
146
0.76
146
099
146
1.10
146
1.41
174
1.48
116
frequencies in the segments between attack and release parts (provided by the
plucking detection module). For instance, in Table 1 the light gray values represent the attack and release parts, which we did not take into account while
applying our decision algorithm.
The dierences of fundamental frequencies are calculated by subtracting to
each bin its preceding bin. Thus, when the fragment we are examining is a nonarticulated fragment, this operation returns 0 for all bins. On the other side,
in expressively articulated fragments some peaks will arise (see Figure 8 for an
example).
In Figure 8 there is only one peak, but in other recordings some consecutive
peaks may arise. The explanation is that the left hand also causes an onset, i.e.
it generates also a transient part. As a result of this transient, more than one
change in fundamental frequency may be present. If those changes or peaks are
close to each other we consider them as a single peak.
We dene this closeness with a pre-determined consecutiveness threshold.
Specically, if the maximum distance between these peaks is 5 bins, we
228
T.H. Ozaslan
et al.
Fig. 8. Dierence vector of pitch frequency values of fundamental frequency array
consider them as an expressive articulation candidate peak. However, if the

peaks are separated each other more than the consecutiveness threshold, the
fragment is not considered an articulation candidate. Our consideration is that
it responds to a probable noisy part of the signal, a crackle in the recording, or
a digital conversion error.
3.2
Classification
The classication module analyzes the regions identied by the extraction module and label them as legato or glissando. A diagram of the classication module
is shown in Figure 9. In this section, rst, we describe our research to select the
appropriate descriptor to analyze the behavior of legato and glissando. Then, we
explain the new two components, Models Builder and Detection.
Selecting a Descriptor. After extracting the regions which contain candidates
of expressive articulations, the next step was to analyze them. Because dierent
expressive articulations (legato vs glissando) should present dierent characteristics in terms of changes in amplitude, aperiodicity, or pitch [22], we focused
the analysis on comparing these deviations.
Specically, we built representations of these three features (amplitude, aperiodicity, and pitch). Representations helped us to compare dierent data with
dierent length and density. As we stated above, we are mostly interested in
changes: changes in High Frequency Content, changes in fundamental frequency,
changes in amplitude, etc. Therefore, we explored the peaks in the examined
data because peaks are the points where changes occur.
As an example, Figure 10 shows, from top to bottom, amplitude evolution,
pitch evolution, and changes in aperiodicity for both legato and glissando. As
both Figures show, glissando and legato examples, the changes in pitch are similar. However, the changes in amplitude and aperiodicity present a characteristic
slope.
Thus, as a rst step we concentrated on determining which descriptor could
be used. To make this decision, we built models for both aperiodicty and
229
Fig. 9. Classication module diagram
(a) Features of a legato example
(b) features of a glissando example
Fig. 10. From top to bottom, representations of amplitude, pitch and aperiodicty of
the examined regions
amplitude by using a set of training data. As a result, we obtained two models (for amplitude and aperiodicity) for both legato and glissando as is shown
in Figure 11a and Figure 11b. Analyzing the results, amplitude is not a good
candidate because the models behave similarly. In contrast, aperiodicity models
present a dierent behavior. Therefore, we selected aperiodicity as the descriptor. The details of this model construction will be explained in Building the
Models section.
Preprocessing. Before analyzing and testing our recordings, we applied two
dierent preprocessing techniques to the data in order to make them smoother
and ready for comparison: Smoothing and Envelope Approximation.
1. Smoothing. As expected, aperiodicity portion of the audio le that we
are examining includes noise. Our rst concern was to avoid this noise and
obtain a nicer representation. In order to do that rst we applied a 50 step
running median smoothing. Running median smoothing is also known as
median ltering. Median ltering is widely used in digital image processing
230
T.H. Ozaslan
et al.
(a) Amplitude models
(b) Aperiodicity models
Fig. 11. Models for Legato and Glissando
(a) Aperiodicity
(b) Smoothed Aperiodicity
Fig. 12. Features of aperiodicity
because under certain conditions, it preserves edges whilst removing noise.

In our situation since we are interested in the edges and in removing noise,
this approach ts our purposes. By smoothing, the peaks locations of the
aperiodicity curves become more easy to extract. In Figure 12, comparison
of aperiodicity and smoothed aperiodicity graphs exemplify the smoothing
process and show the results we pursued.
2. Envelope Approximation. After obtaining a smoother data, an envelope
approximation algorithm was applied. The core idea of the envelope approximation is to obtain a xed length representation of the data, specially considering the peaks and also avoiding small deviations by connecting these peak
approximations linearly. The envelope approximation algorithm has three
parts: peak peaking, scaling of peak positions according to a xed length, and
linearly connecting the peaks. After the envelope approximation, all data regions we are investigating had the same length, i.e. regions were compressed
or enlarged depending on their initial size.
We collect all the peaks above a pre-determined threshold. Next, we scale
all these peak positions. For instance, imagine that our data includes 10000
bins and we want to scale this data to 1000. And lets say, our peak positions
are : 1460, 1465, 1470, 1500 and 1501. What our algorithm does is to scale
these peak locations dividing all peak locations by 10 (since we want to scale
231
Fig. 13. Envelope approximation of a legato portion
10000 to 1000) and round them. So they become 146, 146, 147, 150 and 150.
As shown, we have 2 peaks in 146 and 150. In order to x this duplicity,
we choose the ones with the highest peak value. After collecting and scaling
peak positions, the peaks are linearly connected. As shown in Figure 13,
the obtained graph is an approximation of the graph shown in Figure 12b.
Linear approximation helps the system to avoid consecutive small tips and
dips.
In our case all the recordings were performed at 60bpm and all the notes in the
recordings are 8th notes. That is, each note is half a second, and each legato or
glissando portion is 1 second. We recorded with a sampling rate of 44100, and
we did our analysis by using a hop size of 32 bins, i.e. 44100/32 = 1378 bins.
We knew that this was our highest limit. For the sake of simplicity, we scaled
our x-axis to 1000 bins.
Building the Models. After applying the preprocessing techniques, we obtained equal length aperiodicity representations of all our expressive articulation
portions. Next step was to construct models for both legato and glissando by using this data. In this section we describe how we constructed the models shown in
Figure 11a and Figure 11b. The following steps were used to construct the models: Histogram Calculation, Smoothing and Envelope approximation (explained
in Preprocessing section), and nally, SAX representation. In this section we
present the Histogram Calculation and the SAX representation techniques.
1. Histogram Calculation. Another method that we are using is histogram
envelope calculation. We use this technique to calculate the peak density
of a set of data. Specically, a set of recordings containing 36 legato and 36
glissando examples (recorded by a professional classical guitarist) was used as
training set. First, for each legato and glissando example, we determined the
peaks. Since we want to model the places where condensed peaks occur, this
232
T.H. Ozaslan
et al.
(a) Legato Histogram
(b) Glisando Histogram
Fig. 14. Peak histograms of legato and glissando training sets
(a) Legato Final Envelope
(b) Glisando Final Envelope
Fig. 15. Final envelope approximation of peak histograms of legato and glissando
training sets
time we used a threshold of 30 percent and collect the peaks with amplitude
values above this threshold. Notice that the threshold is dierent than the
used in envelope approximation. Then, we used histograms to compute the
density of the peak locations. Figure 14 shows the resulting histograms.
After constructing the histograms, as shown in Figure 14, we used our envelope approximation method to construct the envelopes of legato and glissando histogram models (see Figure 15).
2. SAX: Symbolic Aggregate Approximation. Although the histogram
envelope approximations of legato and glissando in Figure 15 are close to our
purposes, they still include noisy sections. Rather than these abrupt changes
(noises), we are interested in a more general representation reecting the
changes more smoothly. SAX (Symbolic Aggregate Approximation) [18], is
a symbolic representation used in time series analysis that provides a dimensionality reduction while preserving the properties of the curves. Moreover,
SAX representation makes the distance measurements easier. Then, we
applied the SAX representation to histogram envelope approximations.
233
(a) Legato SAX Representation (b) Glisando SAX Representation

Fig. 16. SAX representation of legato and glissando nal models
As we mentioned in Envelope Approximation, we scaled the x-axis to

1000. We made tests with step sizes of 10 and 5. As we report in the Experiments section, an step size of 5 gave better results. We also tested with
step sizes lower than 5, but the performance clearly decreased. Since we are
using an step size of 5, each step becomes 100 bins in length. After obtaining
the SAX representation of each expressive articulation, we used our distance
calculation algorithm which we are going to explain in the next section.
Detection. After obtaining the SAX representation of glissando and legato
models, we divided them into 2 regions, a rst region between bins 400 and 600,
and a second region between bins 600 and 800 (see Figure 17). For the expressive
articulation excerpt, we have the envelope approximation representation with
the same length of the SAX representation of nal models. So, we can compare
the regions. For the nal expressive articulation models (see Figure 16) we took
the value for each region and compute the deviation (slope) between these two
regions. We performed this computation for both legato and glissando models
separately.
We also computed the same deviation for each expressive articulation envelope approximation (see Figure 18). But this time, since we do not have SAX
representation, for each region we do not have single values. Therefore, for each
region we computed the local maxima and took the deviation (slope) of these
two local maxima. After obtaining this value, we may compare this deviation
vlue with the numbers that we obtained from both nal models of legato and
glissando. If the deviation value is closer to the legato model, the expressive
articulation will be labeled as a legato and vice versa.
Experiments
The goal of the experiments realized was to test the performance of our model.
Since dierent modules have been designed, and they work independently of each
other, we tested Extraction and Classication modules separately. After applying
separate studies, we combined the results to assess the overall performance of
the proposed system.
234
T.H. Ozaslan
et al.
(a) Legato
(b) Glisando
Fig. 17. Peak occurrence deviation
Fig. 18. Expressive articultion dierence
As it was explained in Section 1, legato and glissando can be played in ascending or descending intervals. Thus, we were interested in studying the results
distinguishing among these two movements. Additionally, since in a guitar there
are three nylon strings and three metallic strings, we also studied the results
taking into account these two sets of strings.
4.1
Recordings
Borrowing from Carlevaros guitar exercises [5], we recorded a collection of ascending and descending chromatic scales. Legato and glissando examples were
recorded by a professional classical guitar performer. The performer was asked
to play chromatic scales in three dierent regions of the guitar fretboard. Specifically, we recorded notes from the rst 12 frets of the fretboard where each
recording concentrated on 4 specic frets. The basic exercise from the rst fretboard region is shown in Figure 19.
235
Fig. 19. Legato Score in rst position
(a) Phrase 1
(b) Phrase 2
(d) Phrase 4
(c) Phrase 3
(e) Phrase 5
Fig. 20. Short melodies
Each scale contains 24 ascending and 24 descending notes. Each exercise contains 12 expressive articulations (the ones connected with an arch). Since we
repeated the exercise at three dierent positions, we obtained 36 legato and 36
glissando examples. Notice that we also performed recordings with a neutral articulation (neither legatos nor glissandos). We presented all the 72 examples to
our system.
As a preliminary test with more realistic recordings, we also recorded a small
set of 5-6 note phrases. They include dierent articulations in random places (see
Figure 20). As shown in Table 3, each phrase includes dierent combinations of
expressive articulations varying from 0-2. For instance, Phrase 3 (see Figure 20c)
does not have any expressive articulation and Phrase 4 (see Figure 20d) contains
the same notes of Phrase 3 but including two expressive articulations: rst a
legato and next an appoggiatura.
4.2
Experiments with Extraction Module
First, we analyzed the accuracy of the extraction module to identify regions with
legatos. The hypothesis was that legatos are the articulations easiest to detect
because they are composed of two long notes. Next, we analyzed the accuracy
to identify regions with glissandos. Because in this situation the rst note (the
glissando) has a short duration, it may be confused with the attack.
Scales. We rst applied our system to single expressive and non-expressive
articulations. All the recordings were hand labeled; they were also our ground
236
T.H. Ozaslan
et al.
Table 2. Performance of extraction module applied to single articulations
Recordings
Non-expressive
Ascending Legatos
Descending Legatos
Ascending Glissandos
Descending Glissandos
Nylon String
90%
80%
90%
70%
70%
Metalic String
90%
90%
70%
70%
70%
Table 3. Results of extraction module applied to short phrases

Excerpt Name Ground Truth Detected
Phrase 1
1
2
Phrase 2
2
2
Phrase 3
0
0
Phrase 4
2
3
Phrase 5
1
1
truth. We compared the output results with annotations. The output was the
number of determined expressive articulations in the sound fragment.
Analyzing the experiments (see Table 2), dierent conclusions can be extracted. First, as expected, legatos are easier to detect than glissandos. Second,
in non-steel strings the melodic direction does not cause a dierent performance.
Regarding steel strings, descending legatos are more dicult to detect than ascending legatos (90% versus 70%). This result is not surprising because the
plucking action of left hand ngers in descending legatos is slightly similar to
a right hand plucking. However, this dierence does not appear in glissandos
because the nger movement is the same.
Short melodies. We tested the performance of the extraction model to analyze
the recordings of short melodies with the same settings used with scales except
for the release threshold. Specically, since in short phrase recordings the transition parts between two notes have more noise, the average value of the amplitude
between two onsets was higher. Because of that, the release threshold in a more
realistic scenario has to be increased. Specically, after some experiments, we
xed the release threshold to 30%.
Analyzing the results, the performance of our model was similar to the previous experiments, i.e. when we analyze single articulations. However, in two
phrases where a note was played with a soft right-hand plucking, these notes
were proposed as legato candidates (Phrase 1 and Phrase 4 ).
The nal step of the extraction model is to annotate the sound fragments where
a possible attack articulation (legato or glissando) is detected. Specically, to help
the systems validation, the whole recording is presented to the user and the candidate fragments to expressive articulations are colored. As example, Figure 21
shows the annotation of Phrase 2 (see score in Figure 20b). Phrase 2 has two expressive articulations that correspond with the portions colored in black.
237
Fig. 21. Annotated output of Phrase 2
4.3
Experiments with Classification Module
After testing the Extraction Module, we tested the same audio les (this time
using only the legato and glissando examples) to test our Classication Module. As explained in Section 3.2, we performed experiments applying dierent
step sizes for the SAX representation. Specically (see results reported in Table 4), we may observe that a step size of 5 is the most appropriate setting. This
result corroborates that a higher resolution when discretizing is not required
and demonstrates that the SAX representation provides a powerful technique to
summarize the information about changes.
The overall performance for legato identication was 83.3% and the overall
performance for glissando identication was 80.5%. Notice that identication of
ascending legato reached a 100% of accuracy whereas descending legato achieved
only a 66.6%. Regarding glissando, there was no signicant dierence between
ascending or descending accuracy (83.3% versus 77.7%). Finally, analyzing the
results when considering the string type, the results presented a similar accuracy
on both nylon and metallic strings.
4.4
Experiments with the whole system
After testing the main modules separately, we studied the performance of the
whole system using the same recordings. From our previous experiments, an step
size of 5 gave the best analyzes results, therefore we run these experiments with
only an step size of 5.
Since we had errors both in the extraction module and classication module, the combined results presented a lower accuracy (see results on Table 5).
238
T.H. Ozaslan
et al.
Table 4. Performance of classication module applied to test set

Step Size
Recordings
5
10
Ascending Legato
100.0 % 100.0 %
Descending Legato
66.6 %
72.2 %
Ascending Glissando
83.3 %
61.1 %
Descending Glissando
77.7 %
77.7 %
Legato Nylon Strings
80.0 %
86.6 %
Legato Metallic Strings
86.6 %
85.6 %
Glissando Nylon Strings
83.3 %
61.1 %
Glissando Metallic Strings 77.7 %
77.7 %
Table 5. Performance of our whole model applied to test set

Recordings
Accuracy
Ascending Legato
85.0 %
Descending Legato
53.6 %
Ascending Glissando
58.3 %
Descending Glissando
54.4%
Legato Nylon Strings
68.0 %
Legato Metallic Strings
69.3 %
Glissando Nylon Strings
58.3 %
Glissando Metallic Strings
54.4 %
In Ascending Legatos we had a 100% of accuracy in the classication module

experiments (see Table 4), but since there was a 15% total error in detecting
ascending legato candidates from the classication module (see Table 2), the
overall accuracy results decrease to 85% (see Table 5).
Also regarding the ascending Glissandos, although we reached a high accuracy
in the classication module (83.3%), because of having 70% accuracy in the
extraction module, the overall result decreased to 58.3%. Similar conclusions
can be extracted for rest of the accuracy results.
Conclusions
In this paper we presented a system that combines several state of the art analysis
algorithms to identify left hand articulations such as legatos and glissandos.
Specically, our proposal uses HFC for plucking detection and Complex Domain
and YIN algorithms for pitch detection. Then, combining the data coming from
these dierent sources, we developed a rst decision mechanism, the Extraction
module, to identify regions where attack articulations may be present. Next, the
Classication module analyzes the regions annotated by the extraction Module
and tries to determine the articulation type. Our proposal is to use aperiodicity
239
information to identify the articulation and a SAX representation to characterize

articulation models. Finally, applying a distance measure to the trained models,
articulation candidates are classied as legato or glissando.
We reported experiments to validate our proposal by analyzing a collection
of chromatic exercises and short melodies recorded by a professional guitarist.
Although we are aware that our current system may be improved, the results
showed that it is able to identify and classify successfully these two attackbased articulations. As expected, legato are more easy to identify to glissando.
Specically, the short duration of a glissando is sometimes confused as a single
note attack.
As a next step, we plan to incorporate more analysis and decision components
into our system with the aim of covering all the main expressive articulations
used in guitar playing. We are currently working in improving the performance
of both modules and also adding additional expressive resources such as vibrato
analysis. Additionally, we are exploring the possibility of dynamically changing the parameters of the analysis algorithms like, for instance, using dierent
parameters depending on the string where notes are played.
Acknowledgments
This work was partially funded by NEXT-CBR (TIN2009-13692-C03-01), IL4LTS
(CSIC-200450E557) and by the Generalitat de Catalunya under the grant 2009
SGR-1434. Tan Hakan Ozaslan
is a Phd student of the Doctoral Program in Information, Communication, and Audiovisuals Technologies of the Universitat Pompeu Fabra. We also want to thank the professional guitarist Mehmet Ali Yldrm
for his contribution with the recordings.
References
1. Arcos, J.L., L
opez de M
antaras, R., Serra, X.: Saxex: a case-based reasoning system for generating expressive musical performances. Journal of New Music Research 27(3), 194210 (1998)
2. Brossier, P.: Automatic annotation of musical audio for interactive systems. Ph.D.
thesis, Centre for Digital music, Queen Mary University of London (2006)
3. Brossier, P., Bello, J.P., Plumbley, M.D.: Real-time temporal segmentation of note
objects in music signals. In: Proceedings of the International Computer Music
Conference, ICMC 2004 (November 2004)
4. Burns, A., Wanderley, M.: Visual methods for the retrieval of guitarist ngering.
In: NIME 2006: Proceedings of the 2006 conference on New interfaces for musical
expression, Paris, pp. 196199 (June 2006)
5. Carlevaro, A.: Serie didactica para guitarra. vol. 4. Barry Editorial (1974)
6. de Cheveigne, A., Kawahara, H.: Yin, a fundamental frequency estimator for speech
and music. The Journal of the Acoustical Society of America 111(4), 19171930
(2002)
240
T.H. Ozaslan
et al.
7. Dodge, C., Jerse, T.A.: Computer Music: Synthesis, Composition, and Performance. Macmillan Library Reference (1985)
8. Duxbury, C., Bello, J., Davies, J., Sandler, M., Mark, M.: Complex domain onset detection for musical signals. In: Proceedings Digital Audio Eects Workshop
(2003)
9. Erkut, C., Valimaki, V., Karjalainen, M., Laurson, M.: Extraction of physical and
expressive parameters for model-based sound synthesis of the classical guitar. In:
108th AES Convention, pp. 1922 (February 2000)
10. Gabrielsson, A.: Once again: The theme from Mozarts piano sonata in A major
(K. 331). A comparison of ve performances. In: Gabrielsson, A. (ed.) Action and
perception in rhythm and music, pp. 81103. Royal Swedish Academy of Music,
Stockholm (1987)
11. Gabrielsson, A.: Expressive intention and performance. In: Steinberg, R. (ed.) Music and the Mind Machine, pp. 3547. Springer, Berlin (1995)
12. Gouyon, F., Herrera, P., G
omez, E., Cano, P., Bonada, J., Loscos, A., Amatriain,
X., Serra, X.: In: ontent Processing of Music Audio Signals, pp. 83160. Logos
Verlag, Berlin (2008), http://smcnetwork.org/public/S2S2BOOK1.pdf
13. Grachten, M., Arcos, J., de M
antaras, R.L.: A case based approach to expressivityaware tempo transformation. Machine Learning 65(2-3), 411437 (2006)
14. Heijink, H., Meulenbroek, R.G.J.: On the complexity of classical guitar playing:functional adaptations to task constraints. Journal of Motor Behavior 34(4),
339351 (2002)
15. Johnstone, J.A.: Phrasing in piano playing. Withmark New York (1913)
16. Juslin, P.: Communicating emotion in music performance: a review and a theoretical framework. In: Juslin, P., Sloboda, J. (eds.) Music and emotion: theory and
research, pp. 309337. Oxford University Press, New York (2001)
17. Lee, N., Zhiyao, D., Smith, J.O.: Excitation signal extraction for guitar tones. In:
International Computer Music Conference, ICMC 2007 (2007)
18. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing sax: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15(2), 107144
(2007)
19. Lindstr
om, E.: 5 x oh, my darling clementine the inuence of expressive intention
on music performance (1992) Department of Psychology, Uppsala University
20. de Mantaras, R.L., Arcos, J.L.: Ai and music from composition to expressive performance. AI Mag. 23(3), 4357 (2002)
21. Masri, P.: Computer modeling of Sound for Transformation and Synthesis of Musical Signal. Ph.D. thesis, University of Bristol (1996)
22. Norton, J.: Motion capture to build a foundation for a computer-controlled instrument by study of classical guitar performance. Ph.D. thesis, Stanford University
(September 2008)
23. Palmer, C.: Anatomy of a performance: Sources of musical expression. Music Perception 13(3), 433453 (1996)
24. Radicioni, D.P., Lombardo, V.: A constraint-based approach for annotating music
scores with gestural information. Constraints 12(4), 405428 (2007)
25. Radisavljevic, A., Driessen, P.: Path dierence learning for guitar ngering problem. In: International Computer Music Conference (ICMC 2004) (2004)
241
26. Sloboda, J.A.: The communication of musical metre in piano performance. Quarterly Journal of Experimental Psychology 35A, 377396 (1983)
27. Trajano, E., Dahia, M., Santana, H., Ramalho, G.: Automatic discovery of right
hand ngering in guitar accompaniment. In: Proceedings of the International Computer Music Conference (ICMC 2004), pp. 722725 (2004)
28. Traube, C., Depalle, P.: Extraction of the excitation point location on a string
using weighted least-square estimation of a comb lter delay. In: Procs. of the 6th
International Conference on Digital Audio Eects, DAFx 2003 (2003)
Comparing Approaches to the Similarity of

Musical Chord Sequences
W.B. de Haas1 , Matthias Robine2 , Pierre Hanna2 ,
Remco C. Veltkamp1 , and Frans Wiering1
1
Utrecht University, Department of Information and Computing Sciences

PO Box 80.089, 3508 TB Utrecht, The Netherlands
{bas.dehaas,remco.veltkamp,frans.wiering}@cs.uu.nl
2
LaBRI - Universite de Bordeaux
F-33405 Talence cedex, France
{pierre.hanna,matthias.robine}@labri.fr
Abstract. We present a comparison between two recent approaches to

the harmonic similarity of musical chords sequences. In contrast to earlier
work that mainly focuses on the similarity of musical notation or musical
audio, in this paper we specically use on the symbolic chord description
as the primary musical representation. For an experiment, a large chord
sequence corpus was created. In this experiment we compare a geometrical and an alignment approach to harmonic similarity, and measure
the eects of chord description detail and a priori key information on
retrieval performance. The results show that an alignment approach signicantly outperforms a geometrical approach in most cases, but that the
geometrical approach is computationally more ecient than the alignment approach. Furthermore, the results demonstrate that a priori key
information boosts retrieval performance, and that using a triadic chord
representation yields signicantly better results than a simpler or more
complex chord representation.
Keywords: Music Information Retrieval, Musical Harmony, Similarity,
Chord Description, Evaluation, Ground-truth Data.
Introduction
In the last decades Music Information Retrieval (MIR) has evolved into a broad
research area that aims at making large repositories of digital music maintainable and accessible. Within MIR research two main directions can be discerned:
symbolic music retrieval and the retrieval of musical audio. The rst direction
traditionally uses score-based representations to research typical retrieval problems. One of the most important and most intensively studied of these is probably the problem of determining the similarity of a specic musical feature, e.g.
melody, rhythm, etc. The second directionmusical audio retrievalextracts features from the audio signal and uses these features for estimating whether two
pieces of music share certain musical properties. In this paper we focus on a
Comparing Approaches to the Similarity of Musical Chord Sequences
243
musical representation that is symbolic but can be automatically derived from

musical audio with reasonable eectiveness: chord descriptions.
Only recently, partly motivated by the growing interest in audio chord nding,
MIR researchers have started using chords descriptions as principal representation for modeling music similarity. Naturally, these representations are specically suitable for capturing the harmonic similarity of a musical piece. However,
determining the harmonic similarity of sequences of chords descriptions gives rise
to three questions. First, what is harmonic similarity? Second, why do we need
harmonic similarity? Last, do sequences of chord descriptions provide a valid
and useful abstraction of the musical data for determining music similarity? The
rst two questions we will address in this introduction; the third question we will
answer empirically in a retrieval experiment. In this experiment we will compare
a geometrical and an alignment based harmonic similarity measure.
The rst questionwhat is harmonic similarityis dicult to answer. We strongly
believe that if we want to model what makes two pieces of music similar, we must
not only look at the musical data, but especially at the human listener. It is important to realize that music only becomes music in the mind of the listener, and
probably not all information needed for good similarity judgment can be found in
the data alone. Human listeners, musician or non-musician, have extensive culturedependent knowledge about music that needs to be taken into account when judging music similarity.
In this light we consider the harmonic similarity of two chord sequences to
be the degree of agreement between structures of simultaneously sounding notes
(i.e. chords) and the agreement between global as well as local relations between
these structures in both sequences as perceived by the human listener. With the
agreement between structures of simultaneously sounding notes we denote the
similarity that a listener perceives when comparing two chords in isolation and
without the surrounding musical context. However, chords are rarely compared
in isolation and the relations between the global contextthe keyof a piece and
the relations to the local context play a very important role in the perception
of tonal harmony. The local relations can be considered the relations between
functions of chords within a limited time frame, for instance the preparation of
a chord with a dominant function with a sub-dominant. All these factors play a
role in the perception of tonal harmony and should be shared by two compared
pieces up to certain extent to if they are considered similar.
The second question about the usefulness of harmonic similarity is easier to
answer, since music retrieval based on harmony sequences oers various benets.
It allows for nding dierent versions of the same song even when melodies vary.
This is often the case in cover songs or live performances, especially when these
performances contain improvisations. Moreover, playing the same harmony with
dierent melodies is an essential part of musical styles like jazz and blues. Also,
variations over standard basses in baroque instrumental music can be harmonically very related.
The application of harmony matching methods is broadened further by the
extensive work on chord description extraction from musical audio data within
244
W.B. de Haas et al.
the MIR community, e.g. [20,5]. Chord labeling algorithms extract symbolic
chord labels from musical audio: these labels can be matched directly using the
algorithms covered in this paper.
If you would ask a jazz musician to answer the third questionwhether sequences of chord descriptions are usefulhe will probably agree that they are,
since working with chord descriptions is everyday practice in jazz. However, we
will show in this paper that they are also useful for retrieving pieces with a similar but not identical chord sequence by performing a large experiment. In this
experiment we compare two harmonic similarity measures, the Tonal Pitch Step
Distance (TPSD) [11] and the Chord Sequence Alignment System (CSAS) [12],
and test the inuence of dierent degrees of detail in the chord description and
the knowledge of the global key of a piece on retrieval performance.
The next section gives a brief overview of the current achievements in chord
sequence similarity matching and harmonic similarity in general, Section 3 describes the data used in the experiment and Section 4 presents the results.
Contribution. This paper presents an overview of chord sequence based harmonic similarity. Two harmonic similarity approaches are compared in an experiment. For this experiment a new large corpus of 5028 chord sequences was
assembled. Six retrieval tasks are dened for this corpus, to which both algorithms are subjected. All tasks use the same dataset, but dier in the amount of
chord description detail and in the use of a priori key information. The results
show that a computational costly alignment approach signicantly outperforms
a much faster geometrical approach in most cases, that a priori key information
boosts retrieval performance, and that using a triadic chord representation yields
signicantly better results than a simpler or more complex chord representation.
Background: Similarity Measures for Chord Sequences
The harmonic similarity of musical information has been investigated by many

authors, but the number of systems that focus solely on similarity chords sequences is much smaller. Of course it is always possible to convert notes into
chords and vice versa, but this is not a trivial task. Several algorithms can correctly segment and label approximately 80 percent of a symbolic dataset (see for
a review [26]). Within the audio domain hidden Markov Models are frequently
used for chord label assignment, e.g. [20,5]. The algorithms considered in this
paper abstract from these labeling tasks and focus on the similarity between
chord progressions only. As a consequence, we assume that we have a sequence
of symbolic chord labels describing the chord progression in a piece of music.
The systems currently known to us that are designed to match these sequences
of symbolic chord descriptions are the TPSD [11], the CSAS [12] and a harmony
grammar approach [10]. The rst two are quantitatively compared in this paper
and are introduced in the next two subsections, respectively. They have been
compared before, but all previous evaluations of TPSD and CSAS were done
with relatively small datasets (<600 songs) and the dataset used in [12] was
245
TPS Score
All The Things You Are

13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
10
20
30
40
50
60
70
80
90
100 110 120 130 140 150
Beat
Fig. 1. A plot demonstrating the comparison of two similar versions of All the Things
You Are using the TPSD. The total area between the two step functions, normalized
by the duration of the shortest song, represents the distance between both songs. A
minimal area is obtained by shifting one of the step functions cyclically.
dierent from the one used in [11]. The harmony grammar approach could, at
the time of writing, not compete in this experiment because in its current state
it is yet unable to parse all the songs in the used dataset.
The next section introduces the TPSD and the improvements over the implementation used for the experiment here and the implementation in [11]. Section 2.2 highlights the dierent variants of the CSAS. The main focus of this
paper is on the similarity of sequences of chord labels, but there exist other
relevant harmony based retrieval methods: some of these are briey reviewed in
Section 2.3.
2.1
Tonal Pitch Step Distance
The TPSD uses Lerdahls [17] Tonal Pitch Space (TPS) as its main musical
model. TPS is a model of tonality that ts musicological intuitions, correlates
well with empirical ndings from music cognition [16] and can be used to calculate a distance between two arbitrary chords. The TPS model can be seen as a
scoring mechanism that takes into account the number of steps on the circle of
fths between the roots of the chords, and the amount of overlap between the
chord structures of the two chords and their relation to the global key.
The general idea behind the TPSD is to use the TPS to compare the change
of chordal distance to the tonic over time. For every chord the TPS distance
between the chord and the key of the sequence is calculated, which results in
a step function (see Figure 1). As a consequence, information about the key
of the piece is essential. Next, the distance between two chord sequences is dened as the minimal area between the two step functions over all possible horizontal circular shifts. To prevent that longer sequences yield larger distances,
the score is normalized by dividing it by the duration of the shortest song.
246
W.B. de Haas et al.
The TPS is an elaborate model that allows to compare every arbitrary chord in
an arbitrary key to every other possible chord in any key. The TPSD does not
use the complete model and only utilizes the parts that facilitate the comparison
of two chords within the same key. In the current implementation of the TPSD
time is represented in beats, but generally any discrete representation could be
used.
The TPSD version used in this paper contains a few improvements compared
to the version used in [11]: by applying a dierent step function matching algorithm from [4], and by exploiting the fact that we use discrete time units
that enable us to sort in linear time using counting sort [6], a running time of
O(nm) is achieved where n and m are the number of chord symbols in both
songs. Furthermore, to be able to use the TPSD in situations where a priori key
information is not available, the TPSD is extended with a key nding algorithm.
Key finding. The problem of nding the global key of piece of music is called
key nding. For this study, this is done on the basis of chord information only.
The rationale behind the key nding algorithm that we present here is the following: we consider the key that minimizes the total TPS distance and best
matches the starting and ending chord, the key of the piece.
For minimizing the total TPS distance, the TPSD key nding uses TPS based
step functions as well. We assume that when a song matches a particular key, the
TPS distances between the chords and the tonic of the key are relatively small.
The general idea is to calculate 24 step functions for a single chord sequence, one
for each major and minor key. Subsequently, all these keys are ranked by sorting
them on the area between their TPS step function and the x-axis; the smaller
the total area, the better this key ts the piece, and the higher the rank. Often,
the key at the top rank is the correct key. However, among the false positives at
rank one, non-surprisingly, the IV, V and VI relative to the ground-truth key1
are found regularly. This makes sense because, when the total of TPS distances
of the chords to C is small, the distances to F, G and Am might be small as well.
Therefore, to increase performance, an additional scoring mechanism is designed
that takes into account the IV, V and VI relative to the ground-truth key. Of
all 24 keys, the candidate key that minimizes the following sum S is considered
the key of the piece.

S = r(I) + r(IV ) + r(V ) + r(V I) +
if the rst chord matches the key

(1)
if the last chord matches the key
Here r(.) denotes the rank of the candidate key, a parameter determines
how important the tonic is compared to other frequently occurring scale degrees
and controls the importance of the key matching rst and last chord. The
parameters and were tuned by hand and an of 2, and a of 4 were
found to give good results. Clearly, this simple key-nding algorithm is biased
1
The roman numbers here represent the diatonic interval between the key in the
ground-truth and the predicted key.
247
towards western diatonic music, but for the corpus used in this paper it performs
quite well. The algorithm scores 88.8 percent correct on a subset of 500 songs
of the corpus used in the experiment below for which we manually checked the
correctness of the ground-truth key. The above algorithm takes O(n) time, where
n is the number of chord symbols, because the number of keys is constant.
Root interval step functions. For the tasks where only the chord root is
used we use a dierent step function representation (See Section 4). In these
tasks the interval between the chord root and the root note of the key denes
the step height and the duration of the chord again denes the step length. This
matching method is very similar to the melody matching approach by Aloupis
et al. [2]. Note that the latter was never tested in practice. The matching and
key nding methods are not dierent from the other variants of the TPSD. Note
that in all TPSD variants, chord inversions are ignored.
2.2
Chord Sequence Alignment System
The CSAS algorithm is based on local alignment and computes similarity scores
between sequences of symbols representing chords or distances between chords
and key. String matching techniques can be used to quantify the dierences between two such sequences. Among several existing methods, Smith and Watermans approach [25] detects similar areas in two sequences of abitrary symbols.
This local alignment or local similarity algorithm locates and extracts a pair of
regions, one from each of the two given strings, that exhibit high similarity. A
similarity score is calculated by performing elementary operations transforming
the one string into the other. The operations used to transform the sequences are
deletion, insertion or substitution of a symbol. The total transformation from the
one string into the other can be solved with a dynamic programming in quadratic
time.
The following example illustrates local alignment by computing a distance
between the rst chords of two variants of the song All The Things You Are
considering only the root notes of the chords. The I, S and D denote Insertion,
Substitution, and Deletion of a symbol, respectively. An M represents a matching
symbol:
string 1
F B E A
D D G C C
string 2 F F B A A A D D
C C
operation I M M S M I M M D M M
score 1 +2 +2 2 +2 1 +2 +2 1 +2 +2
Algorithms based on local alignment have been successfully adapted for melodic
similarity [21,13,15] and recently it has been used to determine harmonic similarity [12] as well. Two steps are necessary to apply the alignment technique to the
comparison of chord progressions: the choice of the representation of a chord sequence, and the scores of the elementary operations between symbols. To take the
durations of the chords into account, we represent the chords at every beat. The
algorithm has therefore a complexity of O(nm), where n and m are the sizes of
248
W.B. de Haas et al.
the compared songs in beats. The score function can either be adapted to the chosen representation or can simply be binary, i.e. the score is positive (+2) if the
two chords described are identical, and negative (2) otherwise. The insertion or
deletion score is set to 1.
Absolute representation. One way of representing a chord sequence is to
simply represent the chord progression as a sequence of absolute root notes and in
that case prior knowledge of the key is not required. An absolute representation
of the chord progression of the 8 rst bars of the song All The Things You Are
is then:
F, B, E, A, D, D, G, C, C
In this case, the substitution scores may be determined by considering the dierence in semitones, the number of steps on the circle of fths between the roots, or
by the consonance of the interval between the roots, as described in [13]. For instance, the cost of substituting a C with a G (fth) is lower than the substitution
of a C with a D (Second). Taking into account the mode in the representation
can aect the score function as well: a substitution of a C for a Dm is dierent
from a substitution of a C for a D, for example. If the two modes are identical,
one may slightly increase the similarity score, and decrease it otherwise. Another
possible representation of the chord progression is a sequence of absolute pitch
sets. In that case one can use musical distances between chords, like Lerdahls
TPS model [17] or the distance introduced by Paiement et al. [22], as a score
function for substitution.
Key-relative representation. If key information is known beforehand, a chord
can be represented as a distance to this key. The distance can be expressed in
various ways: in semitones, or as the number of steps on the circle of fths
between the root of the chord and the tonic of the key of the song, or with more
complex musical models, such as TPS. If in this case the key is A and the chord
is represented by the dierence in semitones, the representation of the chord
progression of the rst eight bars of the song All The Things You Are will be:
3, 2, 5, 0, 5, 6, 1, 4, 4
If all the notes of the chords are taken into account, the TPS or Paiement
distances can be used between the chords and the triad of the key to construct
the representation. The representation is then a sequence of distances, and we use
an alignment between these distances instead of between the chords themselves.
This representation is very similar to the representation used in the TPSD. The
score functions used to compare the resulting sequences can then be binary, or
linear in similarity regarding the dierence observed in the values.
Transposition invariance. In order to be robust to key changes, two identical chord progressions transposed in dierent keys have to be considered as
similar. The usual way to deal with this issue [27] is to choose a chord representation which is transposition invariant. A rst option is to represent transitions
249
between successive chords, but this has been proven to be less accurate when
applied to alignment algorithms [13]. Another option is to consider a key relative representation, like the representation described above which is by denition
transposition invariant. However, this approach is not robust against local key
changes. With an absolute representation of chords, we use an adaptation of
the local alignment algorithm proposed in [1]. It allows to take into account an
unlimited number of local transpositions and can be applied to representations
of chord progressions to account for modulations.
According to the choice of the representation and the score function, several
variants are possible in order to build an algorithm for harmonic similarity. In
Section 4 we explain the dierent representations and scoring functions used in
the dierent tasks of the experiment and their eects on retrieval performance.
2.3
Other Methods for Harmonic Similarity
The third harmonic similarity measure using chord descriptions is a generative

grammar approach [10]. The authors use a generative grammar of tonal harmony to parse the chord sequences, which results in parse trees that represent
harmonic analyses of these sequences. Subsequently, a tree that contains all the
information shared by the two parse trees of two compared songs is constructed
and several properties of this tree can be analyzed yielding several similarity
measures. Currently a parser can reject a sequence of chords as being ungrammatical.
Another interesting retrieval system based on harmonic similarity is the one
developed by Pickens and Crawford [23]. Instead of describing a musical segment
with one chord, they represent a musical segment as a vector describing the t
between the segment and every major and minor triad. This system then uses
a Markov model to model the transition distributions between these vectors for
every piece. Subsequently, these Markov models are ranked using the KullbackLeibler divergence. It would be interesting to compare the performance of these
systems to the algorithms tested in here in the future.
Other interesting work has been done by Paiement et al. [22]. They dene a
similarity measure for chords rather than for chord sequences. Their similarity
measure is based on the sum of the perceived strengths of the harmonics of the
pitch classes in a chord, resulting in a vector of twelve pitch classes for each
musical segment. Paiement et al. subsequently dene the distance between two
chords as the euclidean distance between two of these vectors that correspond
to the chords. Next, they use a graphical model to model the hierarchical dependencies within a chord progression. In this model they used their chord similarity
measure for the calculation of the substitution probabilities between chords.
A Chord Sequence Corpus
The Chord Sequence Corpus used in the experiment consists of 5,028 unique
human-generated Band-in-a-Box les that are collected from the Internet. Bandin-a-Box is a commercial software package [9] that is used to generate musical
250
W.B. de Haas et al.
Table 1. A leadsheet of the song All The Things You Are. A dot represents a beat, a
bar represents a bar line, and the chord labels are presented as written in the Bandin-a-Box le.
|Fm7 . . .
|Bbm7 . . .
|Eb7 . . .
|DbMaj7 . . . |Dm7b5 . G7b9 . |CMaj7 . . .
|Cm7 . . .
|Fm7 . . .
|Bb7 . . .
|AbMaj7 . . . |Am7b5 . D7b9 . |GMaj7 . . .
|A7 . . .
|D7 . . .
|GMaj7 . . .
|Gbm7 . . .
|B7 . . .
|EMaj7 . . .
|Fm7 . . .
|Bbm7 . . .
|Eb7 . . .
|DbMaj7 . . . |Dbm7 . Gb7 .
|Cm7 . . .
|Bbm7 . . .
|Eb7 . . .
|AbMaj7 . . .
|AbMaj7 . . .
|CMaj7 . . .
|Eb7 . . .
|GMaj7 . . .
|GMaj7 . . .
|C+ . . .
|AbMaj7 . . .
|Bdim . . .
|. . . .
|
|
|
|
|
|
|
|
|
accompaniment based on a lead sheet. A Band-in-a-Box le stores a sequence

of chords and a certain style, whereupon the program synthesizes and plays a
MIDI-based accompaniment. A Band-in-a-Box le therefore contains a sequence
of chords, a melody, a style description, a key description, and some information
about the form of the piece, i.e. the number of repetitions, intro, outro etc.
For extracting the chord label information from the Band-in-a-Box les we have
extended software developed by Simon Dixon and Matthias Mauch [19].
Throughout this paper we have been referring to chord labels or chord descriptions. To rule out any possible vagueness, we adopt the following denition
of a chord: a chord always consist of a root, a chord type and an optional inversion. The root note is the fundamental note upon which the chord is built,
usually as a series of ascending thirds. The chord type (or quality) is the set of
intervals relative to the root that make up the chord and the inversion is dened
as the degree of the chord that is played as bass note. One of the most distinctive
features of the chord type is its mode, which can either be major or minor.
Although a chord label always describes these three properties, root, chord
type and inversion, musicians and researchers use dierent syntactical systems
to describe them, and also Band-in-a-Box uses its own syntax to represent the
chords. Harte et al. [14] give an in depth overview of the problems related to
representing chords and suggests a unambiguous syntax for chord labels. An
example of a chord sequence as found in a Band-in-a-Box le describing the
chord sequence of All the Things You Are is given in Table 1.
All songs of the chord sequence corpus were collected from various Internet
sources. These songs were labeled and automatically checked for having a unique
chord sequence. All chord sequences describe complete songs and songs with
fewer than 3 chords or shorter than 16 beats were removed from the corpus in
an earlier stage. The titles of the songs, which function as a ground-truth, as well
as the correctness of the key assignments, were checked and corrected manually.
The style of the songs is mainly jazz, latin and pop.
Within the collection, 1775 songs contain two or more similar versions, forming
691 classes of songs. Within a song class, songs have the same title and share
a similar melody, but may dier in a number of ways. They may, for instance,
dier in key and form, they may dier in the number of repetitions, or have a
251
Table 2. The distribution of the song class sizes in the Chord Sequence Corpus
Class Size
1
2
3
4
5
6
7
8
10
Total
Frequency
3,253
452
137
67
25
7
1
1
1
5028
Percent
82.50
11.46
3.47
1.70
.63
.18
.03
.03
.03
100
special introduction or ending. The richness of the chords descriptions may also
diverge, i.e. a C7913 may be written instead of a C7 , and common substitutions
frequently occur. Examples of the latter are relative substitution, i.e. Am instead
of C, or tritone substitution, e.g. F#7 instead of C7 . Having multiple chord
sequences describing the same song allows for setting up a cover-song nding
experiment. The the title of the song is used as ground-truth and the retrieval
challenge is to nd the other chord sequences representing the same song.
The distribution of the song class sizes is displayed in Table 2 and gives an
impression of the diculty of the retrieval task. Generally, Table 2 shows that
the song classes are relatively small and that for the majority of the queries there
is only one relevant document to be found. It furthermore shows that 82.5% of
the songs is in the corpus for distraction only. The chord sequence corpus is
available to the research community on request.
Experiment: Comparing Retrieval Performance
We compared the TPSD and the CSAS in six retrieval tasks. For this experiment
we used the chord sequence corpus described above, which contains sequences
that clearly describe the same song. For each of these tasks the experimental
setup was identical: all songs that have two or more similar versions were used as
a query, yielding 1775 queries. For each query a ranking was created by sorting
the other songs on their TPSD and CSAS scores and these rankings and the
runtimes of the compared algorithms were analyzed.
4.1
Tasks
The tasks, summarized in Table 3, diered in the level of chord information used
by the algorithms and in the usage of a priori global key information. In tasks 1-3
no key information was presented to the algorithms and in the remaining three
tasks we used the key information, which was manually checked for correctness,
as stored in the Band-in-a-Box les. The tasks 1-3 and 4-6 furthermore diered
in the amount of chord detail that was presented to the algorithms: in tasks 1
252
W.B. de Haas et al.

Table 3. The TPSD and CSAS are compared in six dierent retrieval tasks
Task nr. Chord Structure
1
Roots
2
Roots + triad
3
Complete Chord
4
Roots
5
Roots + triad
6
Complete Chord
Key
Key
Key
Key
Key
Key
Key
Information
inferred
inferred
inferred
as stored in the Band-in-a-Box le
and 4 only the root note of the chord was available to the algorithms, in tasks
2 and 5 the root and the triad were available and in tasks 3 and 6 the complete
chord as stored in the Band-in-a-Box le was presented to the algorithms.
The dierent tasks required specic variants of the tested algorithms. For tasks
1-3 the TPSD used the TPS key nding algorithm as described in Section 2.1. For
the tasks 1 and 4, involving only chord roots, a simplied variant of the TPSD
was used, for the tasks 2, 3, 5 and 6 we used the regular TPSD, as described in
Section 2.1 and [11].
To measure the impact of the chord representation and substitution functions
on retrieval performance, dierent variants of the CSAS were built also. In some
cases the choices made did not yield the best possible results, but they allow the
reader to understand the eects of the parameters used on retrieval performance.
The CSAS algorithms in tasks 1-3 all used an absolute representation and the
algorithms in tasks 4-6 used a key relative representation. In tasks 4 and 5 the
chords were represented as the dierence in semitones to the root of the key of
the piece and in task 6 as the Lerdahls TPS distance between the chord and the
triad from the key (as in the TPSD). The CSAS variants in tasks 1 and 2 used
a consonance based substitution function and algorithms in tasks 4-6 a binary
substitution function was used. In tasks 2 and 5 a binary substitution function
for the mode was used as well: if the mode of the substituted chords matched,
no penalty was given, if they did not match, a penalty was given.
A last parameter that was varied was the use of local transpositions. The
CSAS variants applied in tasks 1 and 3 did not consider local transpositions, but
the CSAS algorithm used in task 2 did allow local transpositions (see Section 2.2
for details).
The TPSD was implemented in Java and the CSAS was implemented in C++,
but a small Java program was used to parallelize the matching process. All runs
were done on a Intel Xeon quad-core CPU at a frequency of 1.86 GHz. and 4 Gb
of RAM running 32 bit Linux. Both algorithms were parallelized to optimally
use the multiple cores of the CPUs.
4.2
Results
For each task and for each algorithm we analyzed the rankings of all 1775 queries
with 11-point precision recall curves and Mean Average Precision (MAP). Figure 2
displays the interpolated average precision and recall chart for the TPSD and
the CSAS for all tasks listed in Table 3. We calculated the interpolated average
Interpolated Average Precision
Key inferred
253
Key relative
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 0.0
0.1
Recall
TPSD
CSAS
0.2
0.3
0.4
0.5
0.6
Recall
0.7
0.8
0.9
1.0
Roots only (tasks 1 and 4)

Roots + triad (tasks 2 and 5)
Complete chords (tasks 3 and 6)
Fig. 2. The 11-point interpolated precision and recall charts for the TPSD and the
CSAS for tasks 13, on the left, and 46 on the right
precision as in [18] and probed it at 11 dierent recall levels. In all evaluations

the query was excluded from the analyzed rankings. In tasks 2 and 4-6 the CSAS
outperforms the TPSD and in tasks 1 and 3 the TPSD outperforms the CSAS.
The curves all have a very similar shape, this is probably due to the specic
sizes of the song classes and the fairly limited amount of large song classes (see
Table 2).
In Figure 3 we present the MAP and the runtimes of the algorithms on two
dierent axes. The MAP is displayed on the left axis and the runtimes are shown
on right axis that has an exponential scale doubling the amount of time at every
tick. The MAP is a single-gure measure, which measures the precision at all recall levels and approximates the area under the (uninterpolated) precision recall
graph [18]. Having a single measure of retrieval quality makes it easier to evaluate the signicance of the dierences between results. We tested whether there
were signicant dierences in MAP by performing a non-parametric Friedman
test, with a signicance level of = .05. We chose the Friedman test because the
underlying distribution of the data is unknown and in contrast to an ANOVA
the Friedman does not assume a specic distribution of variance. There were signicant dierences between the 12 runs, 2 (11, N = 1775) = 2, 6182, p < .0001.
To determine which of the pairs of measurements diered signicantly we conducted a post hoc Tukey HSD test3 . Other than a regular T-test, the Tukey HSD
test can be safely used for comparing multiple means [7]. A summary of the analyzed condence intervals is given in Table 4. Signicant and non-signicant
dierences are denoted with +s and s, respectively.
2
3
The 12 dierent runs introduce 11 degrees of freedom and 1775 individual queries
were examined per run.
All statistical tests were performed in Matlab 2009a.
W.B. de Haas et al.

Key inferred
Key relative
Mean Average Precision
0.9
0.8
341:20
170:40
85:20
42:40
0.7
21:20
0.6
10:40
0.5
5:20
0.4
2:40
0.3
1:20
0:40
0.2
0.1
0
1)
2)
3)
5)
4)
6)
5) k 6)
4)
1)
2) k 3)
sk ask
s
sk ask
sk
sk
s
sk
sk ask
sk
t( a
t( a
t(
t( a
t(
t( a
t( a
t( a
t( a
t( a
t(
t( a
s
s
s
s
ex
ad
ex
ad
ad plex
ad plex
ot
ot
ot
ot
pl
pl
tri
tri
tri
tri
ro
ro
ro
ro
m
m
m
m
+
+
+
+
D
o
o
S
D
o
o
S
s
s
s
c
c
s
c
c
S
S
ot
ot
SA oot
SA oot
TP
TP
ro
C
r
C
SD
ro PSD
AS
AS
r
P
S
S
D
S
S
T
T
C
C
S
SD
SA
SA
TP
TP
C
C
Run time (hours : minutes)
254
0:20
0:10
0:05
Fig. 3. The MAP and Runtimes of the TPSD and the CSAS. The MAP is displayed on
the left axis and the runtimes are displayed on an exponential scale on the right axis.
On the left side of the chart the key inferred tasks are displayed and the key relative
tasks are displayed on the right side.
The overall retrieval performance of all algorithms on all tasks can be considered good, but there are some large dierences between tasks and between
algorithms, both in performance and in runtime. With a MAP of .70 the overall best performing setup was the CSAS using triadic chord descriptions and
a key relative representation (task 5). The TPSD also performs best on task
5 with an MAP of .58. In tasks 2 and 4-6 the CSAS signicantly outperforms
the TPSD. On tasks 1 and 3 the TPSD outperforms the CSAS in runtime as
well as performance. For these two tasks, the results obtained by the CSAS are
signicantly lower because local transpositions are not considered. These results
show that taking into account transpositions has a high impact on the quality
of the retrieval system, but also on the runtime.
The retrieval performance of the CSAS is good, but comes at a price. On
average over six of the twelve runs, the CSAS runs need about 136 times as
much time to complete as the TPSD. The TPSD takes about 30 minutes to 1.5
hours to match all 5028 pieces, while the CSAS takes about 2 to 9 days. Due to
the fact that the CSAS run in task 2 takes 206 hours to complete, there was not
enough time to perform a run on task 1 and 3 with the CSAS variant that takes
local transpositions into account.
255
Table 4. This table shows for each pair of runs if the mean average precision, as
displayed in Figure 3 diered signicantly (+) or not ()
Key Inferred
task1 task2
TPSD TPSD
key
task2 TPSD +
inferred
task3 TPSD
+
task1 CSAS +
+
task2 CSAS +
+
task3 CSAS +
+
key
task4 TPSD +
information task5 TPSD +

+
available
task6 TPSD +
task4 CSAS +
+
task5 CSAS +
+
task6 CSAS +
+
Key Information available

task3 task1 task2 task3 task4 task5 task6 task4 task5
TPSD CSAS CSAS CSAS TPSD TPSD TPSD CSAS CSAS
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
In task 6 both algorithms represent the chord sequences as TPS distances to

the triad of the key. Nevertheless, the TPSD is outperformed by the CSAS. This
dierence as well as other dierences in performance might well be explained by
the insertion and deletion operations in the CSAS algorithm: if one takes two
identical pieces an inserts one arbitrary extra chord somewhere in the middle of
the piece, an asynchrony is created between the two step functions which has a
large eect on the estimated distance, while the CSAS distance only gains one
extra deletion score.
For the CSAS algorithm we did a few additional runs that are not reported
here. These runs showed that the dierence in retrieval performance using different substitution costs (binary, consonance or semi-tones) is limited.
The runs in which a priori key information was available performed better,
regardless of the task or algorithm (compare tasks 1 and 4, 2 and 5, and 3 and
6 for both algorithms in Table 4). This was to be expected because there are
always errors in the key nding, which hampers the retrieval performance.
The amount of detail in the chord description has a signicant eect on the
retrieval performance of all algorithms. In almost all cases, using only the triadic
chord description for retrieval yields better results than using only the root or the
complex chord descriptions. Only the dierence in CSAS performance between
using complex chords or triads is not signicant in task 5 and 6. The dierences
between using only the root or using the complete chord are smaller and not
always signicant.
Thus, although colorful additions to chords may sound pleasant to the human ear, they are not always benecial for determining the similarity between
the harmonic progressions they represent. There might be a simple explanation
for these dierences in performance. Using only the root of a chord already
leads to good retrieval results, but by removing good information about the
mode one looses information that can aid in boosting the retrieval performance.
256
W.B. de Haas et al.
On the other hand keeping all rich chord information seems to distract the
evaluated retrieval systems. Pruning the chord structure down to the triad might
be seen as a form of syntactical noise-reduction, since the chord additions, if they
do not have a voice leading function, have a rather arbitrary character and just
add some harmonic spice.
Concluding Remarks
We performed a comparison of two dierent chord sequence similarity measures,

the Tonal Pitch Space Distance (TPSD) and the Chord Sequence Alignment
System (CSAS), on a large newly assembled corpus of 5028 symbolic chord sequences. The comparison consisted of six dierent tasks, in which we varied the
amount of detail in the chord description and the availability of a priori key information. The CSAS variants outperform the TPSD signicantly in most cases,
but is in all cases far more costly to use. The use of a priori key information
improves performance and using only the triad of a chord for similarity matching
gives the best results for the tested algorithms. Nevertheless, we can positively
answer the third question that we have asked ourselves in the introductiondo
chord descriptions provided a useful and valid abstractionbecause the experiment presented in the previous section clearly shows that chord descriptions can
be used for retrieving harmonically related pieces.
The retrieval performance of both algorithms is good, especially if one considers the size of the corpus and the relatively small class sizes (see Table 2),
but there is still room for improvement. Both algorithms cannot deal with large
structural changes, e.g. adding repetitions, a bridge, etc. A prior analysis of
the structure of the piece combined with partial matching could improve the
retrieval performance. Another important issue is that the compared systems
treat all chords as equally important. This is musicologically not plausible. Considering the musical function in the local as well as global structure of the chord
progression, like is done in [10] or with sequences of notes in [24], might still
improve the retrieval results.
With runtimes that are measured in days, the CSAS is a costly system. The
runtimes might be improved by using GPU programming [8], or with ltering
steps using algorithms such as BLAST [3].
The harmonic retrieval systems and experiments presented in this paper consider a specic form of symbolic music representations only. Nevertheless, the
application of the methods here presented is not limited to symbolic music and
audio applications are currently investigated. Especially the recent developments
in chord label extraction are very promising because the output of these methods could be matched directly with the systems here presented. The good performance of the proposed algorithms leads us to believe that, both in the audio
and symbolic domain, retrieval systems will benet from chord sequence based
matching in the near future.
257
References
1. Allali, J., Ferraro, P., Hanna, P., Iliopoulos, C.S.: Local transpositions in alignment
of polyphonic musical sequences. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE
2007. LNCS, vol. 4726, pp. 2638. Springer, Heidelberg (2007)
2. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nu
nez, Y., Rappaport, D., Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic
Similarity. Computer Music Journal 30(3), 6776 (2004)
3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic Local Alignment
Search Tool. Journal of Molecular Biology 215, 403410 (1990)
4. Arkin, E., Chew, L., Huttenlocher, D., Kedem, K., Mitchell, J.: An Eciently Computable Metric for Comparing Polygonal Shapes. IEEE Transactions on Pattern
Analysis and Machine Intelligence 13(3), 209216 (1991)
5. Bello, J., Pickens, J.: A Robust Mid-Level Representation for Harmonic Content
in Music Signals. In: Proceedings of the International Symposium on Music Information Retrieval, pp. 304311 (2005)
6. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2001)
7. Downie, J.S.: The Music Information Retrieval Evaluation Exchange (20052007):
A Window into Music Information Retrieval Research. Acoustical Science and
Technology 29(4), 247255 (2008)
8. Ferraro, P., Hanna, P., Imbert, L., Izard, T.: Accelerating Query-by-Humming on
GPU. In: Proceedings of the Tenth International Society for Music Information
Retrieval Conference (ISMIR), pp. 279284 (2009)
9. Gannon, P.: Band-in-a-Box. PG Music (1990), http://www.pgmusic.com/ (last
10. de Haas, W.B., Rohrmeier, M., Veltkamp, R.C., Wiering, F.: Modeling Harmonic
Similarity Using a Generative Grammar of Tonal Harmony. In: Proceedings of the
Tenth International Society for Music Information Retrieval Conference (ISMIR),
pp. 549554 (2009)
11. de Haas, W.B., Veltkamp, R.C., Wiering, F.: Tonal Pitch Step Distance: A Similarity Measure for Chord Progressions. In: Proceedings of the Ninth International
Society for Music Information Retrieval Conference (ISMIR), pp. 5156 (2008)
12. Hanna, P., Robine, M., Rocher, T.: An Alignment Based System for Chord Sequence Retrieval. In: Proceedings of the 2009 Joint International Conference on
Digital Libraries, pp. 101104. ACM, New York (2009)
13. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for
Evaluating Similarity between Monophonic Musical Sequences. Journal of New
Music Research 36(4), 267279 (2007)
14. Harte, C., Sandler, M., Abdallah, S., G
omez, E.: Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations. In: Proceedings of the
Sixth International Society for Music Information Retrieval Conference (ISMIR),
pp. 6671 (2005)
15. van Kranenburg, P., Volk, A., Wiering, F., Veltkamp, R.C.: Musical Models for
Folk-Song Melody Alignment. In: Proceedings of the Tenth International Society
for Music Information Retrieval Conference (ISMIR), pp. 507512 (2009)
16. Krumhansl, C.: Cognitive Foundations of Musical Pitch. Oxford University Press,
USA (2001)
17. Lerdahl, F.: Tonal Pitch Space. Oxford University Press, Oxford (2001)
258
W.B. de Haas et al.
18. Manning, C., Raghavan, P., Sch

utze, H.: Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
19. Mauch, M., Dixon, S., Harte, C., Casey, M., Fields, B.: Discovering Chord Idioms
through Beatles and Real Book Songs. In: Proceedings of the Eighth International
Society for Music Information Retrieval Conference (ISMIR), pp. 255258 (2007)
20. Mauch, M., Noland, K., Dixon, S.: Using Musical Structure to Enhance Automatic
Chord Transcription. In: Proceedings of the Tenth International Society for Music
Information Retrieval Conference (ISMIR), pp. 231236 (2009)
21. Mongeau, M., Sanko, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161175 (1990)
22. Paiement, J.F., Eck, D., Bengio, S.: A Probabilistic Model for Chord Progressions.
In: Proceedings of the Sixth International Conference on Music Information Retrieval (ISMIR), London, UK, pp. 312319 (2005)
23. Pickens, J., Crawford, T.: Harmonic Models for Polyphonic Music Retrieval. In:
Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 430437. ACM, New York (2002)
24. Robine, M., Hanna, P., Ferraro, P.: Music Similarity: Improvements of Edit-based
Algorithms by Considering Music Theory. In: Proceedings of the ACM SIGMM
International Workshop on Multimedia Information Retrieval (MIR), Augsburg,
Germany, pp. 135141 (2007)
25. Smith, T., Waterman, M.: Identication of Common Molecular Subsequences.
Journal of Molecular Biology 147, 195197 (1981)
26. Temperley, D.: The Cognition of Basic Musical Structures. MIT Press, Cambridge
(2001)
27. Uitdenbogerd, A.L.: Music Information Retrieval Technology. Ph.D. thesis, RMIT
University, Melbourne, Australia (July 2002)
Songs2See and GlobalMusic2One:

Two Applied Research Projects in Music
Information Retrieval at Fraunhofer IDMT
Christian Dittmar, Holger Gromann, Estefana Cano, Sascha Grollmisch,
Hanna Lukashevich, and Jakob Abeer
Fraunhofer IDMT
Ehrenbergstr. 31, 98693 Ilmenau, Germany
{dmr,grn,cano,goh,lkh,abr}@idmt.fraunhofer.de
http://www.idmt.fraunhofer.de
Abstract. At the Fraunhofer Institute for Digital Media Technology

(IDMT) in Ilmenau, Germany, two current research projects are directed
towards core problems of Music Information Retrieval. The Songs2See
project is supported by the Thuringian Ministry of Economy, Employment and Technology through granting funds of the European Fund for
Regional Development. The target outcome of this project is a web-based
application that assists music students with their instrumental exercises.
The unique advantage over existing e-learning solutions is the opportunity to create personalized exercise content using the favorite songs of
the music student. GlobalMusic2one is a research project supported by
the German Ministry of Education and Research. It is set out to develop
a new generation of hybrid music search and recommendation engines.
The target outcomes are novel adaptive methods of Music Information
Retrieval in combination with Web 2.0 technologies for better quality
in the automated recommendation and online marketing of world music
collections.
Keywords: music information retrieval, automatic music transcription,
music source separation, automatic music annotation, music similarity
search, music education games.
Introduction
Successful exploitation of results from basic research is the indicator for the
practical relevance of a research eld. During recent years, the scientic and
commercial interest in the comparatively young research discipline called Music
Information Retrieval (MIR) has grown considerably. Stimulated by the evergrowing availability and size of digital music catalogs and mobile media players,
MIR techniques become increasingly important to aid convenient exploration
of large music collections (e.g., through recommendation engines) and to enable
entirely new forms of music consumption (e.g., through music games). Evidently,
commercial entities like online music shops, record labels and content aggregators
260
C. Dittmar et al.
have realized that these aspects can make them stand out among their competitors and foster customer loyalty. However, the industrys willingness to fund basic
research in MIR is comparatively low. Thus, only well described methods have
found successful application in the real world. For music recommendation and
retrieval, these are doubtlessly services based on collaborative ltering1 (CF).
For music transcription and interaction, these are successful video game titles
using monophonic pitch detection2 . The two research projects in the scope of
this paper provide the opportunity to progress in core areas of MIR, but always
with a clear focus on suitability for real-world applications.
This paper is organized as follows. Each of the two projects is described in
more detail in Sec. 2 and Sec. 3. Results from the research as well as the development perspective are reported. Finally, conclusions are given and future
directions are sketched.
Songs2See
Musical education of children and adolescents is an important factor in their

personal self-development regardless if it is about learning a musical instrument
or in music courses at school. Children, adolescents and adults must be constantly motivated to practice and complete learning units. Traditional forms of
teaching and even current e-learning systems are often unable to provide this
motivation. On the other hand, music-based games are immensely popular [7],
[17], but they fail to develop skills which are transferable to musical instruments
[19]. Songs2See is set out to develop educational software for music learning
which provides the motivation of game playing and at the same time develops
real musical skills. Using music signal analysis as the key technology, we want
to enable students to use popular musical instruments as game controllers for
games which teach the students to play music of their own choice. This should
be possible regardless of the representation of music they possess (audio, score,
tab, chords, etc.). As a reward, the users receive immediate feedback from the
automated analysis of their rendition. The game application will provide the
students with visual and audio feedback regarding ne-grained details of their
performance with regard to timing (rhythm), intonation (pitch, vibrato), and
articulation (dynamics). Central to the analysis is automatic music transcription, i.e., the extraction of a scalable symbolic representation from real-world
music recordings using specialized computer algorithms [24], [11]. Such symbolic
representation allows to render simultaneous visual and audible playbacks for
the students, i.e., it can be translated to traditional notation, a piano-roll view
or a dynamical animation showing the ngering on the actual instrument. The
biggest additional advantage is the possibility to let the students have their favorite song transcribed into a symbolic representation by the software. Thus, the
students can play along to actual music they like, instead of specically produced
and edited learning pieces. In order to broaden the possibilities when creating
1
2
See for example http://last.fm

See for example http://www.singstargame.com/
Songs2See and GlobalMusic2One
261
exercises, state-of-the-art methods for audio source separation are exploited.

Application of source separation techniques allows to attenuate accompanying
instruments that obscure the instrument of interest or alternatively to cancel
out the original instrument in order to create a play-along backing track [9].
It should be noted that we are not striving for hi- audio quality with regards
to the separation. It is more important, that the students can use the above
described functionality to their advantage when practicing, a scenario in which
a certain amount of audible artifacts is acceptable without being disturbing for
the user.
2.1
Research Results
A glimpse of the main research directions shall be given here as an update to

the overview given in [12]. Further details about the dierent components of the
system are to be published in [17]. Past research activities allowed us to use
well-proven methods for timbre-based music segmentation [26], key detection [6]
and beat grid extraction [40] out of the box. In addition, existing methods for
automatic music transcription and audio source separation are advanced in the
project.
Automatic music transcription. We were able to integrate already available transcription algorithms that allow us to transcribe the drums, bass, main
melody and chords in real-world music segments [11], [33]. In addition, we enable a manual error correction by the user that is helpful when dealing with
dicult signals (high degree of polyphony, overlapping percussive instruments
etc.). With regard to the project goals and requirements of potential users it
became clear that is is necessary to also transcribe monotimbral, polyphonic
instruments like the piano and the guitar. Therefore, we conducted a review
of the most promising methods for multi-pitch transcription. These comprise
the iterative pitch-salience estimation in [23], the maximum likelihood approach
in [13] and a combination of the Specmurt principle [34] with shift-invariant
non-negative factorization [35]. Results show, that it is necessary to combine
any of the aforementioned multi-pitch estimators with a chromagram-based [37]
pre-processing to spare computation time. That is especially true for real-time
polyphonic pitch detection. For the monophonic real-time case, we could successfully exploit the method pointed out in [18].
Audio source separation. We focused on assessing dierent algorithms for
audio source separation that have been reported in the literature. A thorough
review, implementation and evaluation of the methods described in [14] was
conducted. Inspection of the achievable results led to the conclusion, that sound
separation based on tensor factorization is powerful but at the moment computationally too demanding to be applied in our project. Instead, we focused on
investigating the more straightforward method described in [31]. This approach
separates polyphonic music into percussive and harmonic parts via spectrogram
diusion. It has received much interest in the MIR community, presumably for
262
C. Dittmar et al.
its simplicity. An alternative approach for percussive vs. harmonic separation

has been published in [10]. In that paper, further use of phase information in
sound separation problems has been proposed. In all cases, phase information is
complementary to the use of magnitude information. Phase contours for musical
instruments exhibit similar micromodulations in frequency for certain instruments and can be an alternative of spectral instrument templates or instrument
models. For the case of overlapped harmonics, phase coupling properties can be
exploited. Although a multitude of further source separation algorithms has been
proposed in the literature, only few of them make use of user interaction. In [36],
a promising concept for using an approximate user input for extracting sound
sources is described. Using a combination of the separation methods described
in [9] and [20] in conjunction with user approved note transcriptions, we are able
to separate melodic instruments from the background accompaniment in good
quality. In addition, we exploit the principle described in [41] in order to allow
the user to focus on certain instrument tracks that are well localized within the
stereo panorama.
Fig. 1. Screenshot of Songs2See web application
2.2
263
Development Results
In order to have tangible results early on, the software development started in
parallel with the research tasks. Therefore, interfaces have been kept as generic
as possible in order to enable adaption to alternative or extended core algorithms
later on.
Songs2See web application. In order to achieve easy accessibility to the
exercise application, we decided to go for a web-based approach using Flex3 .
Originally, the Flash application built with this tool did not allow direct processing of the microphone input. Thus, we had to implement a streaming server
solution. We used the open source Red54 in conjunction with the transcoder
library Xuggler5 . This way, we conducted real-time pitch detection [11] in the
server application and returned the detected pitches to the web interface. Further details about the implementation are to be published in [18]. Only in their
latest release in July 2010, i.e., Adobe Flash Player 10.1, has Flash incorporated
the possibility of handling audio streams from a microphone input directly on
the client side. A screenshot of the prototype interface is shown in Fig. 1. It can
be seen that the user interface shows further assistance than just the plain score
sheet. The ngering on the respective instrument of the student is shown as an
animation. The relative tones are displayed as well as the relative position of
the pitch produced by the players instrument. This principle is well known from
music games and has been adapted here for the educational purposes. Further
helpful functions, such as transpose, tempo change and stylistic modication will
be implemented in the future.
Songs2See editor. For the creation of music exercises, we developed an application with the working title Songs2See Editor. It is a stand-alone graphical
user interface based on Qt6 that allows the average or expert user the creation
of musical exercises. The editor already allows to go through the prototypical
work-ow. During import of a song, timbre segmentation is conducted and the
beat grid and key candidates per segment are estimated. The user can choose
the segment of interest, start the automatic transcription or use the source separation functionality. For immediate visual and audible feedback of the results,
a piano-roll editor is combined with a simple synthesizer as well as sound separation controls. Thus, the user can grab any notes he suspects to be erroneously
transcribed, move or delete them. In addition the user is able to seamlessly
mix the ratio between the separated melody instrument and the background
accompaniment. In Fig. 2, the interface can be seen. We expect that the users
of the editor will creatively combine the dierent processing methods in order to
analyze and manipulate the audio tracks to their liking. In the current stage of
development, export of MIDI and MusicXML is already possible. In a later stage
support for other popular formats, such as TuxGuitar will be implemented.
3
4
5
6
See
See
See
See
http://www.adobe.com/products/flex/
http://osflash.org/red5
http://www.xuggle.com/
http://qt.nokia.com/products
264
C. Dittmar et al.
Fig. 2. Screenshot of Songs2See editor prototype
GlobalMusic2One
GlobalMusic2one is developing a new generation of adaptive music search engines

combining state-of-the-art methods of MIR with Web 2.0 technologies. It aims
at reaching better quality in automated music recommendation and browsing
inside global music collections. Recently, there has been a growing research interest in music outside the mainstream popular music from the so-called western
culture group [39],[16]. For well-known mainstream music, large amounts of user
generated browsing traces, reviews, play-lists and recommendations available in
dierent online communities can be analyzed through CF methods in order to
reveal similarities between artists, songs and albums. For novel or niche content
one obvious solution to derive such data is content-based similarity search. Since
the early days of MIR, the search for music items related to a specic query song
or a set of those (Query by Example) has been a consistent focus of scientic
interest. Thus, a multitude of dierent approaches with varying degree of complexity has been proposed [32]. Another challenge is the automatic annotation
(a.k.a. auto-tagging [8]) of world music content. It is obvious that the broad
term World Music is one of the most ill-dened tags when being used to lump
all exotic genres together. It lacks justication because this category comprises
such a huge variety of dierent regional styles, inuences, and a mutual mix up
thereof. On the one hand, retaining the strict classication paradigm for such a
high variety of musical styles inevitably limits the precision and expressiveness
of a classication system that shall be applied to a world-wide genre taxonomy.
With GlobalMusic2One, the user may create new categories allowing the system
to exibly adapt to new musical forms of expression and regional contexts. These
265
categories can, for example, be regional sub-genres which are dened through
exemplary songs or song snippets. This self-learning MIR framework will be
continuously expanded with precise content-based descriptors.
3.1
Research Results
With automatic annotation of world music content, songs often cannot be assigned to one single genre label. Instead, various rhythmic, melodic and harmonic
inuences conate into multi-layered mixtures. Common classier approaches
fail due to their immanent assumption that for all song segments, one dominant
genre exists and thus is retrievable.
Multi-domain labeling. To overcome these problems, we introduced the
multi-domain labeling approach [28] that breaks down multi-label annotations towards single-label annotations within dierent musical domains, namely
timbre, rhythm, and tonality. In addition, a separate annotation of each temporal segment of the overall song is enabled. This leads to a more meaningful and
realistic two-dimensional description of multi-layered musical content. Related
to that topic, classication of singing vs. rapping in urban music has been described in [15]. In another paper [27] we applied the recently proposed Multiple
Kernel Learning (MKL) technique that has been successfully used for real-world
applications in the elds of computational biology, image information retrieval
etc. In contrast to classic Support Vector Machines (SVM), MKL provides a
possibility of weighting over dierent kernels depending on a feature set.
Clustering with constraints. Inspired by the work in [38], we investigated
clustering with constraints with application to active exploration of music collections. Constrained clustering has been developed to improve clustering methods
through pairwise constraints. Although these constraints are received as queries
from a noiseless oracle, most of the methods involve a random procedure stage
to decide which elements are presented to the oracle. In [29] we applied spectral
clustering with constraints to a music dataset, where the queries for constraints
were selected in a deterministic way through outlier identication perspective.
We simulated the constraints through the ground-truth music genre labels. The
results showed that constrained clustering with the deterministic outlier identication method achieved reasonable and stable results through the increment
of the number of constraint queries. Although the constraints were enhancing
the similarity relations between the items, the clustering was conducted in the
static feature space. In [30] we embedded the information about the constraints
to a feature selection procedure, that adapted the feature space regarding the
constraints. We proposed two methods for the constrained feature selection:
similarity-based and constrained-based. We applied the constrained clustering
with embedded feature selection for the active exploration of music collections.
Our experiments showed that the proposed feature selection methods improved
the results of the constrained clustering.
266
C. Dittmar et al.
Rule-based classification. The second important research direction was rulebased classication with high-level features. In general, high-level features can
again be categorized according to dierent musical domains like rhythm, harmony, melody or instrumentation. In contrast to low-level and mid-level audio
features, they are designed with respect to music theory and are thus interpretable by human observers. Often, high-level features are derived from automatic music transcription or classication into semantic categories. Dierent
approaches for the extraction of rhythm-related high-level features have been reported in [21], [25] and [1]. Although automatic extraction of high-level features
is still quite error-prone, we proved in [4] that they can be used in a rule-based
classication scheme with a quality comparable to state-of-the-art pattern recognition using SVM. The concept of rule-based classication was inspected in detail
in [3] using a ne granular manual annotation of high-level features referring to
rhythm, instrumentation etc. In this paper, we tested rule-based classication
on a restricted dataset of 24 manually annotated audio tracks and achieved an
accuracy rate of over 80%.
Novel audio features. Adding the fourth domain instrumentation to the
multi-domain approach described in Sec. 3.1 required the design and implementation of novel audio features tailored towards instrument recognition in polyphonic recordings. Promising results even with instruments from non-European
cultural areas are reported in [22]. In addition, we investigated the automatic
classication of rhythmic patterns in global music styles in [42]. In this work,
special measures have been taken to make the features and distance measures
tempo independent. This is done implicitly, without the need for a preceding
beat grid extraction that is commonly recommended in the literature to derive
beat synchronous feature vectors. In conjunction with the approach to rule-based
classication described in Sec. 3.1, novel features for the classication of bassplaying styles have been published in [5] and [2]. In this paper, we compared
an approach based on high-level features and another one based on similarity
measures between bass patterns. For both approaches, we assessed two dierent
strategies: classication of patterns as a whole and classication of all measures
of a pattern with a subsequent accumulation of the classication results. Furthermore, we investigated the inuence of potential transcription errors on the
classication accuracy. Given a taxonomy consisting of 8 dierent bass playing
styles, best classication accuracy values of 60.8% were achieved for the featurebased classication and 68.5% for the pattern similarity approach.
3.2
Development Results
As with the Songs2See project, the development phase in GlobalMusic2One

started in parallel with the research activities and is now near its completion.
It was mandatory to have early prototypes of the required software for certain
tasks in the project.
267
Annotation Tool. We developed an Qt-based Annotation Tool that facilitates

the gathering of conceptualized annotations for any kind of audio content. The
program was designed for expert users to enable them to manually describe audio
les eciently on a very detailed level. The Annotation Tool can be congured
to dierent and extensible annotation schemes making it exible for multiple application elds. The tool supports single labeling, and multi-labeling, as well as
the new approach of multi-domain labeling. However, strong labeling is not enforced by the tool, but remains under the control of the user. As a unique feature
the audio annotation tool comes with a automated, timbre-based audio segmentation algorithm integrated that helps the user to intuitively navigate through
the audio le during the annotation process and select the right segments. Of
course, the granularity of the segmentation can be adjusted and every single
segment border can be manually corrected if necessary. There are now approx.
100 observables that can be chosen from while annotating. This list is under
steady development. In Fig. 3, a screenshot of the Annotation Tool congured
to the Globalmusic2one description scheme can be seen.
Fig. 3. Screenshot of the Annotation Tool congured to the Globalmusic2one description scheme, showing the annotation of a stylistically and structurally complex song
PAMIR framework. The Personalized Adaptive MIR framework (PAMIR)

is a server system written in Python that loosely strings together various functional blocks. These blocks are e.g., a relational database, a server for computing content-based similarities between music items and various machine learning
servers. PAMIR is also the instance that enables the adaptivity with respect to
the user preferences. Basically, it allows to conduct feature-selection and automated picking of the most suitable classication strategy for a given classication
268
C. Dittmar et al.
problem. Additionally, it enables content-based similarity search both on song

and segment level. The latter method can be used to retrieve parts of songs,
that are similar to the query, whereas the complete song may exhibit dierent
properties. Visualizations of the classication and similarity-search results are
delivered to the users via a web interface. In Fig. 4, the prototype web interface to the GlobalMusic2One portal is shown. It can be seen that we feature an
intuitive similarity map for explorative browsing inside the catalog as well as
target-oriented search masks for specic music properties. The same interface
allows the instantiation of new concepts and manual annotation of reference
songs. In the upper left corner, a segment player is shown that allows to jump
directly to dierent parts in the songs.
Fig. 4. Screenshot of the GlobalMusic2One prototype web client
Conclusions and Outlook
In this paper we presented an overview of the applied MIR projects Songs2See

and GlobalMusic2One. Both address core challenges of MIR with strong focus to
269
real-world applications. With Songs2See, development activities will be strongly

directed towards implementation of more advanced features in the editor as
well as the web application. The main eorts inside GlobalMusic2One will be
concentrated on consolidating the framework and the web client.
Acknowledgments
The Thuringian Ministry of Economy, Employment and Technology supported
this research by granting funds of the European Fund for Regional Development to the project Songs2See 7 , enabling transnational cooperation between
Thuringian companies and their partners from other European regions. Additionally, this work has been partly supported by the German research project
GlobalMusic2One 8 funded by the Federal Ministry of Education and Research
(BMBF-FKZ: 01/S08039B).
References
1. Abeer, J., Dittmar, C., Gromann, H.: Automatic genre and artist classication
by analyzing improvised solo parts from musical recordings. In: Proceedings of the
Audio Mostly Conference (AMC), Pite
a, Sweden (2008)
2. Abeer, J., Br
auer, P., Lukashevich, H., Schuller, G.: Bass playing style detection
based on high-level features and pattern similarity. In: Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht,
Netherlands (2010)
3. Abeer, J., Lukashevich, H., Dittmar, C., Br
auer, P., Krause, F.: Rule-based classication of musical genres from a global cultural background. In: Proceedings
of the 7th International Symposium on Computer Music Modeling and Retrieval
(CMMR), Malaga, Spain (2010)
4. Abeer, J., Lukashevich, H., Dittmar, C., Schuller, G.: Genre classication using
bass-related high-level features and playing styles. In: Proceedings of the 10th
International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
5. Abeer, J., Lukashevich, H., Schuller, G.: Feature-based extraction of plucking and
expression styles of the electric bass guitar. In: Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Dallas,
Texas, USA (2010)
6. Arndt, D., Gatzsche, G., Mehnert, M.: Symmetry model based key nding. In:
Proceedings of the 126th AES Convention, Munich, Germany (2009)
7. Barbancho, A., Barbancho, I., Tardon, L., Urdiales, C.: Automatic edition of songs
for guitar hero/frets on re. In: Proceedings of the IEEE International Conference
on Multimedia and Expo (ICME), New York, USA (2009)
8. Bertin-Mahieux, T., Eck, D., Maillet, F., Lamere, P.: Autotagger: a model for
predicting social tags from acoustic features on large music databases. Journal of
New Music Research 37(2), 115135 (2008)
7
8
See http://www.songs2see.eu
See http://www.globalmusic2one.net
270
C. Dittmar et al.
9. Cano, E., Cheng, C.: Melody line detection and source separation in classical saxophone recordings. In: Proceedings of the 12th International Conference on Digital
Audio Eects (DAFx), Como, Italy (2009)
10. Cano, E., Schuller, G., Dittmar, C.: Exploring phase information in sound source
separation applications. In: Proceedings of the 13th International Conference on
Digital Audio Eects (DAFx 2010), Graz, Austria (2010)
11. Dittmar, C., Dressler, K., Rosenbauer, K.: A toolbox for automatic transcription
of polyphonic music. In: Proceedings of the Audio Mostly Conference (AMC),
Ilmenau, Germany (2007)
12. Dittmar, C., Gromann, H., Cano, E., Grollmisch, S., Lukashevich, H., Abeer,
J.: Songs2See and GlobalMusic2One - Two ongoing projects in Music Information
Retrieval at Fraunhofer IDMT. In: Proceedings of the 7th International Symposium
on Computer Music Modeling and Retrieval (CMMR), Malaga, Spain (2010)
13. Duan, Z., Pardo, B., Zhang, C.: Multiple fundamental frequency estimation by
modeling spectral peaks and non-peak regions. EEE Transactions on Audio,
Speech, and Language Processing (99), 11 (2010)
14. Fitzgerald, D., Cranitch, M., Coyle, E.: Extended nonnegative tensor factorization models for musical sound source separation. Computational Intelligence and
Neuroscience (2008)
15. G
artner, D.: Singing / rap classication of isolated vocal tracks. In: Proceedings
of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht, Netherlands (2010)
16. G
omez, E., Haro, M., Herrera, P.: Music and geography: Content description of
musical audio from dierent parts of the world. In: Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan (2009)
17. Grollmisch, S., Dittmar, C., Cano, E.: Songs2see: Learn to play by playing. In:
Proceedings of the 41st AES International Conference on Audio in Games, London,
UK (2011)
18. Grollmisch, S., Dittmar, C., Cano, E., Dressler, K.: Server based pitch detection
for web applications. In: Proceedings of the 41st AES International Conference on
Audio in Games, London, UK (2011)
19. Grollmisch, S., Dittmar, C., Gatzsche, G.: Implementation and evaluation of an improvisation based music video game. In: Proceedings of the IEEE Consumer Electronics Societys Games Innovation Conference (IEEE GIC), London, UK (2009)
20. Gruhne, M., Schmidt, K., Dittmar, C.: Phoneme recognition on popular music.
In: 8th International Conference on Music Information Retrieval (ISMIR), Vienna,
Austria (2007)
21. Herrera, P., Sandvold, V., Gouyon, F.: Percussion-related semantic descriptors of
music audio les. In: Proceedings of the 25th International AES Conference, London, UK (2004)
22. Kahl, M., Abeer, J., Dittmar, C., Gromann, H.: Automatic recognition of tonal
instruments in polyphonic music from dierent cultural backgrounds. In: Proceedings of the 36th Jahrestagung f
ur Akustik (DAGA), Berlin, Germany (2010)
23. Klapuri, A.: A method for visualizing the pitch content of polyphonic music signals.
In: Proceedings of the 10th International Society for Music Information Retrieval
Conference (ISMIR), Kobe, Japan (2009)
271
24. Klapuri, A., Davy, M. (eds.): Signal Processing Methods for Music Transcription.
Springer Science + Business Media LLC, New York (2006)
25. Lidy, T., Rauber, A., Pertusa, A., Iesta, J.M.: Improving genre classication by
combination of audio and symbolic descriptors using a transcription system. In:
Proceedings of the 8th International Conference on Music Information Retrieval
(ISMIR), Vienna, Austria (2007)
26. Lukashevich, H.: Towards quantitative measures of evaluating song segmentation.
In: Proceedings of the 9th International Conference on Music Information Retrieval
(ISMIR), Philadelphia, Pennsylvania, USA (2008)
27. Lukashevich, H.: Applying multiple kernel learning to automatic genre classication. In: Proceedings of the 34th Annual Conference of the German Classication
Society (GfKl), Karlsruhe, Germany (2010)
28. Lukashevich, H., Abeer, J., Dittmar, C., Gromann, H.: From multi-labeling to
multi-domain-labeling: A novel two-dimensional approach to music genre classication. In: Proceedings of the 10th International Society for Music Information
Retrieval Conference (ISMIR), Kobe, Japan (2009)
29. Mercado, P., Lukashevich, H.: Applying constrained clustering for active exploration of music collections. In: Proceedings of the 1st Workshop on Music Recommendation and Discovery (WOMRAD), Barcelona, Spain (2010)
30. Mercado, P., Lukashevich, H.: Feature selection in clustering with constraints: Application to active exploration of music collections. In: Proceedings of the 9th Int.
Conference on Machine Learning and Applications (ICMLA), Washington DC,
USA (2010)
31. Ono, N., Miyamoto, K., Roux, J.L., Kameoka, H., Sagayama, S.: Separation of a
monaural audio signal into harmonic/percussive components by complememntary
diusion on spectrogram. In: Proceedings of the 16th European Signal Processing
Conferenc (EUSIPCO), Lausanne, Switzerland (2008)
32. Pohle, T., Schnitzer, D., Schedl, M., Knees, P., Widmer, G.: On rhythm and general music similarity. In: Proceedings of the 10th International Society for Music
Information Retrieval Conference (ISMIR), Kobe, Japan (2009)
33. Ryyn
anen, M., Klapuri, A.: Automatic transcription of melody, bass line, and
chords in polyphonic music. Computer Music Journal 32, 7286 (2008)
34. Sagayama, S., Takahashi, K., Kameoka, H., Nishimoto, T.: Specmurt anasylis:
A piano-roll-visualization of polyphonic music signal by deconvolution of logfrequency spectrum. In: Proceedings of the ISCA Tutorial and Research Workshop
on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea (2004)
35. Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as
nonnegative factorizations. Computational Intelligence and Neuroscience (2008)
36. Smaragdis, P., Mysore, G.J.: Separation by humming: User-guided sound extraction from monophonic mixtures. In: Proceedings of IEEE Workshop on Applications Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA
(2009)
37. Stein, M., Schubert, B.M., Gruhne, M., Gatzsche, G., Mehnert, M.: Evaluation
and comparison of audio chroma feature extraction methods. In: Proceedings of
the 126th AES Convention, Munich, Germany (2009)
38. Stober, S., N
urnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., N
urnberger, A. (eds.) AMR 2008.
272
C. Dittmar et al.
39. Tzanetakis, G., Kapur, A., Schloss, W.A., Wright, M.: Computational ethnomusicology. Journal of Interdisciplinary Music Studies 1(2), 124 (2007)
40. Uhle, C.: Automatisierte Extraktion rhythmischer Merkmale zur Anwendung in
Music Information Retrieval-Systemen. Ph.D. thesis, Ilmenau University, Ilmenau,
Germany (2008)
41. Vinyes, M., Bonada, J., Loscos, A.: Demixing commercial music productions via
human-assisted time-frequency masking. In: Proceedings of the 120th AES convenction, Paris, France (2006), http://www.mtg.upf.edu/files/publications/
271dd4-AES120-mvinyes-jbonada-aloscos.pdf (last viewed February 2011)
42. V
olkel, T., Abeer, J., Dittmar, C., Gromann, H.: Automatic genre classication
of latin american music using characteristic rhythmic patterns. In: Proceedings of
the Audio Mostly Conference (AMC), Pite
a, Sweden (2010)
MusicGalaxy:
A Multi-focus Zoomable Interface for
Multi-facet Exploration of Music Collections
Sebastian Stober and Andreas N
urnberger
Data & Knowledge Engineering Group
Faculty of Computer Science
Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany
{sebastian.stober,andreas.nuernberger}@ovgu.de
http://www.dke-research.de
Abstract. A common way to support exploratory music retrieval scenarios is to give an overview using a neighborhood-preserving projection
of the collection onto two dimensions. However, neighborhood cannot
always be preserved in the projection because of the inherent dimensionality reduction. Furthermore, there is usually more than one way
to look at a music collection and therefore dierent projections might
be required depending on the current task and the users interests. We
describe an adaptive zoomable interface for exploration that addresses
both problems: It makes use of a complex non-linear multi-focal zoom
lens that exploits the distorted neighborhood relations introduced by the
projection. We further introduce the concept of facet distances representing dierent aspects of music similarity. User-specic weightings of these
aspects allow an adaptation according to the users way of exploring the
collection. Following a user-centered design approach with focus on usability, a prototype system has been created by iteratively alternating
between development and evaluation phases. The results of an extensive user study including gaze analysis using an eye-tracker prove that
the proposed interface is helpful while at the same time being easy and
intuitive to use.
Keywords: exploration, interface, multi-facet, multi-focus.
Introduction
There is a lot of ongoing research in the eld of music retrieval aiming to improve
retrieval results for queries posed as text, sung, hummed or by example as well
as to automatically tag and categorize songs. All these eorts facilitate scenarios
where the user is able to somehow formulate a query either by describing the
song or by giving examples. But what if the user cannot pose a query because
the search goal is not clearly dened? E.g., he might look for background music
for a photo slide show but does not know where to start. All he knows is that he
can tell if it is the right music the moment he hears it. In such a case, exploratory
274
S. Stober and A. N
urnberger
similar
dissimilar
Fig. 1. Possible problems caused by projecting objects represented in a highdimensional feature space (left) onto a low-dimensional space for display (right)
retrieval systems can help by providing an overview of the collection and letting
the user decide which regions to explore further.
When it comes to get an overview of a music collection, neighborhood-preserving projection techniques have become increasingly popular. Beforehand, the
objects to be projected depending on the approach, this may be artists, albums,
tracks or any combination thereof are analyzed to extract a set of descriptive
features. (Alternatively, feature information may also be annotated manually or
collected from external sources.) Based on these features, the objects can be
compared or more specically: appropriate distance- or similarity measures
can be dened. The general objective of the projection can then be paraphrased
as follows: Arrange the objects in two or three dimensions (on the display) in
such a way that neighboring objects are very similar and the similarity decreases
with increasing object distance (on the display). As the feature space of the objects to be projected usually has far more dimensions than the display space,
the projection inevitably causes some loss of information irrespective of which
dimensionality reduction techniques is applied. Consequently, this leads to a distorted display of the neighborhoods such that some objects will appear closer
than they actually are (type I error), and on the other hand some objects that
are distant in the projection may in fact be neighbors in feature space (type
II error). Such neighborhood distortions are depicted in Figure 1. These projection errors cannot be xed on a global scale without introducing new ones
elsewhere as the projection is already optimal w.r.t. some criteria (depending
on the technique used). In this sense, they should not be considered as errors
made by the projection technique but of the resulting (displayed) arrangement.
When a user explores a projected collection, type I errors increase the number
of dissimilar (i.e. irrelevant) objects displayed in a region of interest. While this
might become annoying, it is much less problematic than type II errors. They
result in similar (i.e. relevant) objects to be displayed away from the region of
interest the neighborhood they actually belong to. In the worst case they could
even be o-screen if the display is limited to the currently explored region. This
way, a user could miss objects he is actually looking for.
Zoomable Interface for Multi-facet Exploration of Music Collections
275
The interactive visualization technique described in this paper exploits these

distorted neighborhood relations during user-interaction. Instead of trying to
globally repair errors in the projection, the general idea is to temporarily x
the neighborhood in focus. The approach is based on a multi-focus sh-eye
lens that allows a user to enlarge and explore a region of interest while at the
same time adaptively distorting the remaining collection to reveal distant regions
with similar tracks. It can therefor be considered as a focus-adaptive distortion
technique.
Another problem that arises when working with similarity-based neighborhoods is that music similarity is highly subjective and may depend on a persons
background. Consequently, there is more than one way to look at a music collection or more specically to compare two tracks based on their features.
The user-interface presented in this paper therefore allows the user to modify
the underlying distance measure by adapting weights for dierent aspects of
(dis-)similarity.
The remainder of this paper is structured as follows: Section 2 gives an overview
of related approaches that aim to visualize a music collection. Subsequently,
Section 3 outlines the approach developed in this work. The underlying techniques are addressed in Section 4 and Section 5 explains how a user can interact
with the proposed visualization. In order to evaluate the approach, a user study
has been conducted which is described in Section 6. Finally, Section 7 concludes
with a brief summary.
Related Work
There exists a variety of approaches that in some way give an overview of a music
collection. For the task of music discovery which is closely related to collection
exploration, a very broad survey of approaches is given in [7]. Generally, there are
several possible levels of granularity that can be supported, the most common
being: track, album, artist and genre. Though a system may cover more than
one granularity level (e.g. in [51] visualized as disc or TreeMap [41]), usually a
single one is chosen. The user-interface presented in this paper focuses on the
track level as do most of the related approaches. (However, like most of the other
techniques, it may as well be applied on other levels such as for albums or artist.
All that is required is an appropriate feature representation of the objects of
interest.) Those approaches focusing on a single level can roughly be categorized
into graph-based and similarity-based overviews.
Graphs facilitate a natural navigation along relationship-edges. They are especially well-suited for the artist level as social relations can be directly visualized
(as, e.g., in the Last.fm Artist Map 1 or the Relational Artist Map RAMA [39]).
However, building a graph requires relations between the objects either from
domain knowledge or articially introduced. E.g., there are some graphs that use
similarity-relations obtained from external sources (such as the APIs of Last.fm 2
1
2
http://sixdegrees.hu/last.fm/interactive map.html
http://www.last.fm/api
276
S. Stober and A. N
urnberger
or EchoNest 3 ) and not from an analysis of the objects themselves. Either way,
this results in a very strong dependency and may quickly become problematic
for less main-stream music where such information might not be available. This
is why a similarity-based approach is chosen here instead.
Similarity-based approaches require the objects to be represented by one or
more features. They are in general better suited for track level overviews due
the vast variety of content-based features that can be extracted from tracks.
For albums and artists, either some means for aggregating the features of the
individual tracks are needed or non-content-based features, e.g. extracted from
knowledge resources like MusicBrainz 4 and Wikipedia 5 or cultural meta-data
[54], have to be used. In most cases the overview is then generated using some
metric dened on these features which leads to proximity of similar objects in the
feature space. This neighborhood should be preserved in the collection overview
which usually has only two dimensions. Popular approaches for dimensionality
reduction are Self-Organizing Maps (SOMs) [17], Principle Component Analysis
(PCA) [14] and Multidimensional Scaling (MDS) techniques [18].
In the eld of music information retrieval, SOMs are widely used. SOM-based
systems comprise the SOM-enhanced Jukebox (SOMeJB) [37], the Islands of
Music [35,34] and nepTune [16], the MusicMiner [29], the PlaySOM - and PocketSOM -Player [30] (the latter being a special interface for mobile devices), the
BeatlesExplorer [46] (the predecessor prototype of the system presented here),
the SoniXplorer [23,24], the Globe of Music [20] and the tabletop applications
MUSICtable [44], MarGrid [12], SongExplorer [15] and [6]. SOMs are prototypebased and thus there has to be a way to initially generate random prototypes
and to modify them gradually when objects are assigned. This poses special
requirements regarding the underlying feature space and distance metric. Moreover, the result depends on the random initialization and the neural network
gradient descend algorithm may get stuck in a local minimum and thus not
produce an optimal result. Further, there are several parameters that need to
be tweaked according to the data set such as the learning rate, the termination
criterion for iteration, the initial network structure, and (if applicable) the rules
by which the structure should grow. However, there are also some advantages
of SOMs: Growing versions of SOMs can adapt incrementally to changes in the
data collection whereas other approaches may always need to generate a new
overview from scratch. Section 4.2 will address this point more specically for
the approach taken here. For the interactive task at hand, which requires a realtime response, the disadvantages of SOMs outweigh their advantages. Therefore,
the approach taken here is based on MDS.
Given a set of data points, MDS nds an embedding in the target space
that maintains their distances (or dissimilarities) as far as possible without
having to know their actual values. This way, it is also well suited to compute a
layout for spring- or force-based approaches. PCA identies the axes of highest
3
4
5
http://developer.echonest.com
http://musicbrainz.org
http://www.wikipedia.org
277
Fig. 2. In SoundBite [22], a seed song and its nearest neighbors are connected by lines
variance termed principal components for a set of data points in high-dimensional

space. To obtain a dimensionality reduction to two-dimensional space, the data
points are simply projected onto the two principal component axes with the
highest variance. PCA and MDS are closely related [55]. In contrast to SOMs,
both are non-parametric approaches that compute an optimal solution (with
respect to data variance maximization and distance preservation respectively)
in xed polynomial time. Systems that apply PCA, MDS or similar force-based
approaches comprise [4], [11], [10], the fm4 Soundpark [8], MusicBox [21], and
SoundBite [22].
All of the above approaches use some kind of projection technique to visualize
the collection but only a small number tries to additionally visualize properties
of the projection itself: The MusicMiner [29] draws mountain ranges between
songs that are displayed close to each other but dissimilar. The SoniXplorer
[23,24] uses the same geographical metaphor but in a 3D virtual environment
that the user can navigate with a game pad. The Islands of Music [35,34] and its
related approaches [16,30,8] use the third dimension the other way around: Here,
islands or mountains refer to regions of similar songs (with high density). Both
ways, local properties of the projection are visualized neighborhoods of either
dissimilar or similar songs. In contrast (and possibly as a supplementation) to
this, the technique proposed in this paper aims to visualize properties of the
projection that are not locally conned: As visualized in [32], there may be
distant regions in a projection that contain very similar objects. This is much
like a wormhole connecting both regions through the high-dimensional feature
space. To our knowledge, the only attempt so far to visualize such distortions
caused by the projection is described in [22]. The approach is to draw lines that
connect a selected seed track (highlighted with a circle) with its neighbors as
shown in Figure 2.
278
S. Stober and A. N
urnberger
Additionally, our goal is to support user-adaptation during the exploration

process by means of weighting aspects of music similarity. Of the above approaches, only the revised SoniXplorer [24], MusicBox [21] and the BeatlesExplorer [46] our original SOM-based prototype allow automatic adaptation
of the view on the collection through interaction. Apart from this, there exist
systems that also adapt a similarity measure but not to change the way the collection is presented in an overview but to directly generate playlists: MPeer [3]
allows to navigate the similarity space dened by the audio content, the lyrics
and cultural meta-data collected from the web through an intuitive joystick interface. In the E-Mu Jukebox [53], weights for ve similarity components (sound,
tempo, mood, genre and year visually represented by adapters) can be changed
by dragging them on a bulls eye. PATS (Personalized Automatic Track Selection) [36] and the system described in [56] do not require manual adjustment of
the underlying similarity measure but learn from the user as he selects songs that
in his opinion do not t to the current context-of-use. PAPA (Physiology and
Purpose-Aware Automatic Playlist Generation) [33] as well as the already commercially available BODiBEAT music player6 uses sensors that measure several
bio-signals (such as the pulse) of the user as immediate feedback for the music
currently played. In this case not even a direct interaction of the user with the
system is required to continuously adapt playlist-models for dierent purposes.
However, as we have shown in a recent survey [50], users do not particularly like
the idea of having their bio-signals logged especially if they cannot control the
impact of this information on the song recommendation process. In contrast to
these systems that purely focus on the task of playlist generation, we pursuit a
more general goal in providing an adaptive overview of the collection that can
then be used to easily generate playlists as e.g. already shown in [16] or [21].
Outline
The goal of our work is to provide a user with an interactive way of exploring
a music collection that takes into account the above described inevitable limitations of a low-dimensional projection of a collection. Further, it should be
applicable for realistic music collections containing several thousands of tracks.
The approach taken can be outlined as follows:
An overview of the collection is given, where all tracks are displayed as points
at any time. For a limited number of tracks that are chosen to be spatially
well distributed and representative, an album cover thumbnail is shown for
orientation.
The view on the collection is generated by a neighborhood-preserving projection (e.g. MDS, SOM, PCA) from some high-dimensional feature space
onto two dimensions. I.e., in general tracks that are close in feature space
will likely appear as neighbors in the projection.
6
http://www.yamaha.com/bodibeat
279
Users can adapt the projection by choosing weights for several aspects of
music (dis-)similarity. This gives them the possibility to look at a collection
from dierent perspectives. (This adaptation is purely manual, i.e. the visualization as described in this paper is only adaptable w.r.t. music similarity.
Techniques to further enable adaptive music similarity are, e.g., discussed in
[46,49].)
In order to allow immediate visual feedback in case of similarity adaptation,
the projection technique needs to guarantee near real-time performance
even for large music collections. The quality of the produced projection is
only secondary a perfect projection that correctly preserves all distances
between all tracks is extremely unlikely anyways.
The projection will inevitably contain distortions of the actual distances of
the tracks. Instead of trying to improve the quality of the projection method
and trying to x heavily distorted distances, they are exploited during interaction with the projection:
The user can zoom into a region of interest. The space for this region is
increased, thus allowing to display more details. At the same time the surrounding space is compacted but not hidden from view. This way, there remains some context for orientation. To accomplish such a behavior the zoom
is based on a non-linear distortion similar to so called sh-eye lenses.
At this point the original (type II) projection errors come into play: Instead
of putting a single lens focus on the region of interest, additional focuses are
introduced in regions that contain tracks similar to those in primary focus.
The resulting distortion brings original neighbors back closer to each other.
This gives the user another option for interactive exploration.
Figure 3 depicts the outline of the approach. The following sections cover the
underlying techniques (Section 4) and the user-interaction (Section 5) in detail.
Underlying Techniques
4.1
Features and Facets
The prototype system described here uses collections of music tracks. As a prerequisite, it is assumed that the tracks are represented by some descriptive features that can, e.g., be extracted, manually annotated or obtained form external
sources. In the current implementation, content-based features are extracted
utilizing the capabilities of the frameworks CoMIRVA [40] and JAudio [28].
Specically, Gaussian Mixture Models of the Mel Frequency Cepstral Coecients (MFCCs) according to [2] and [26] and uctuation patterns describing
how strong and fast beats are played within specic frequency bands [35] are
computed with CoMIRVA. JAudio is used to extract a global audio descriptor
MARSYAS07 as described in [52]. Further, lyrics for all songs were obtained
through the web service of LyricWiki7 , ltered for stop words, stemmed and described by document vectors with TFxIDF term weights [38]. Additional features
7
http://lyricwiki.org
280
S. Stober and A. N
urnberger
feature extraction
high-dimensional
feature space
( j)
j
d1(i
(i,
facet
subspaces
S2
S1
d2((i
(i, j)
j
dl(i
(i,
( j)
index
offline
display
et
ac
f
m landmarks l
neighborhood facet
distance
tance aggregator
aggre
adjust weights
user
zoom
facet
distance
cuboid
Nxmxl
distances
projection
rojec
facet
distance
e aggregator
agg
online
N images
nearest neighbor indexing

landmark selection
projection
lens distortion
filtering
ing
Fig. 3. Outline of the approach showing the important processing steps and data structures. Top: preprocessing. Bottom: interaction with the user with screenshots of the
graphical user interface.
281
that are currently only used for the visualization are ID3 tags (artist, album,
title, track number and year) extracted from the audio les, track play counts
obtained from a Last.fm prole, and album covers gathered through web search.
Distance Facets. Based on the features associated with the tracks, facets are
dened (on subspaces of the feature space) that refer to dierent aspects of music
(dis-)similarity. This is depicted in Figure 3 (top).
Definition 1. Given a set of features F , let S be the space determined by the
feature values for a set of tracks T . A facet f is defined by a facet distance
measure f on a subspace Sf S of the feature space, where f satisfies the
following conditions for any x, y T :
(x, y) 0 and (x, y) = 0 if and only if x = y
(x, y) = (y, x) (symmetry)
Optionally, is a distance metric if it additionally obeys the triangle inequality
for any x, y, z T :
(x, z) (x, y) + (y, z) (triangle inequality)
E.g., a facet timbre could be dened on the MFCC-based feature described in
[26] whereas a facet text could compare the combined information from the
features title and lyrics.
It is important to stress the dierence to common faceted browsing and search
approaches that rely on a faceted classication of objects to support users in
exploration by ltering available information. Here, no such ltering by value is
applied. Instead, we employ the concept of facet distances to express dierent
aspects of (dis-)similarity that can be used for ltering.
Facet Distance Normalization. In order to avoid a bias when aggregating
several facet distance measures, the values should be normalized. The following
normalization truncates very high facet distance values f (x, y) of a facet f and
results in a value range of [0, 1]

f (a, b)
f (a, b) = min 1,
(1)
+
where is the mean
=
1
|{(x, y) T 2 }|
and is the standard deviation

1

=
|{(x, y) T 2 }|

(x,y)T 2
of all distance values with respect to f .
f (x, y)
(2)
(x,y)T 2
(f (x, y) )2
(3)
282
S. Stober and A. N
urnberger
Table 1. Facets dened for the current implementation
facet name
feature
distance metric
timbre
rhythm
dynamics
lyrics
GMM of MFCCs
uctuation patterns
MARSYAS07
TFxIDF weighted term vectors
Kullback-Leibler divergence
euclidean distance
euclidean distance
cosine distance
Facet Distance Aggregation. The actual distance between tracks x, y T

w.r.t. to the facets f1 , . . . , fl can be computed by aggregating the individual
facet distances f1 (x, y), . . . , fl (x, y). For the aggregation, basically any function
could be used. Common parametrized aggregation functions are:

l
2
d=
i=1 wi fi (x, y) (weighted euclidean distance)
l
d = i=1 wi fi (x, y)2 (squared weighted eucl. distance)
l
d = i=1 wi fi (x, y) (weighted sum)
d = maxi=1..l {wi fi (x, y)} (maximum)
d = mini=1..l {wi fi (x, y)} (minimum)
These aggregation functions allow to control the importance of the facet distances d1 , . . . , dl through their associated weights w1 , . . . , wl . Default settings
for the facet weights and the aggregation function are dened by an expert (who
also dened the facets themselves) and can later be adapted by the user during interaction with the interface. Table 1 lists the facets used in the current
implementation.
4.2
Projection
In the projection step shown in Figure 3 (bottom), the position of all tracks on
the display is computed according to their (aggregated) distances in the highdimensional feature space. Naturally, this projection should be neighborhoodpreserving such that tracks close to each other in feature space are also close in
the projection. We propose to use a landmark- or pivot-based Multidimensional
Scaling approach (LMDS) for the projection as described in detail in [42,43].
This is a computationally ecient approximation to classical MDS. The general
idea of this approach is as follows: A representative sample of objects called
landmarks is drawn randomly from the whole collection.8 For this landmark
sample, an embedding into low-dimensional space is computed using classical
MDS. The remaining objects can then be located within this space according to
their distances to the landmarks.
8
Alternatively, the MaxMin heuristic (greedily seeking out extreme, well-separated

landmarks) could be used with the optional modication to replace landmarks
with a predened probability by randomly chosen objects (similar to a mutation operator in genetic programming). Neither alternative seems to produce less distorted
projections while having much higher computational complexity. However, there is
possibly some room for improvement here but this is out of the scope of this paper.
283
Complexity. Classical MDS has a computational complexity of O(N 3 ) for the

projection, where N is the number of objects in the data set. Additionally, the
N N distance matrix needed as input requires O(N 2 ) space and is computed in
O(CN 2 ), where C are the costs of computing the distance between two objects.
By limiting the number of landmark objects m N , an LMDS projection can
be computed in O(m3 + kmN ), where k is the dimension of the visualization
space, which is xed here as 2. The rst part refers to the computation of the
classical MDS for the m landmarks and the second to the projection of the remaining objects with respect to the landmarks. Further, LMDS requires only the
distances of each data point to the landmarks, i.e. only a m N distance matrix has to be computed resulting in O(mN ) space and O(CmN ) computational
complexity. This way, LMDS becomes feasible for application on large data sets
as it scales linearly with the size of the data set.
Facet Distance Caching. The computation of the distance matrix that is
required for LMDS can be very time consuming not only depending on the
size of the collection and landmark sample but also on the number of facets and
the complexity of the respective facet distance measures. Caching can reduce the
amount of information that has to be recomputed. Assuming a xed collection,
the distance matrix only needs to be recomputed if the facet weights or the
facet aggregation function change. Moreover, even a change of the aggregation
parameters has no impact on the facet distances. This allows to pre-compute
for each track the distance values to all landmarks for all facets oine and store
them in the 3-dimensional data structure depicted in Figure 3 (top) called facet
distance cuboid. It is necessary to store the facet distance values separately as
it is not clear at indexing time how these values are to be aggregated. During
interaction with the user, when near real-time response is required, only the
computational lightweight facet distance aggregation that produces the distance
matrix from the cuboid and the actual projection need to be done.
If N is the number of tracks, m the number of landmarks and l the number
of facets, the cuboid has the dimension N m l and holds as many distance
values. Note that m and l are xed small values of O(100) and O(10) respectively.
Thus, the space requirement eectively scales linearly with N and even for large
N the data structure should t into memory. To further reduce the memory
requirements of this data structure, the distance values are discretized to the
byte range ([0 . . . 255]) after normalization to [0, 1] as described in Section 4.1.
Incremental Collection Updates. Re-computation becomes also necessary
once the collection changes. In previous work [32,46] we used a Growing SelfOrganizing Map approach (GSOM) for the projection. While both approaches,
LMDS and GSOM, are neighborhood-preserving, GSOMs have the advantage of
being inherently incremental, i.e. adding or removing objects from the data set
only gradually changes the way, the data is projected. This is a nice characteristic
because too abrupt changes in the projection caused by adding or removing some
tracks might irritate the user if he has gotten used to a specic projection. On
the contrary, LMDS does not allow for incremental changes of the projection.
284
S. Stober and A. N
urnberger
Fig. 4. The SpringLens particle mesh is distorted by changing the rest-length of

selected springs
However, it still allows objects to be added or removed from the data set to
some extend without the need to compute a new projection: If a new track
is added to the collection, an additional layer has to be appended to the
facet distance cuboid containing the facet distances of the new track with all
landmarks. The new track can then be projected according to these distances. If
a track is removed, the respective layer of the cuboid can be deleted. Neither
operation does further alter the projection.9 Adding or removing many tracks
may however alter the distribution of the data (and thus the covariances) in such
a way that the landmark sample may no longer be representative. In this case,
a new projection based on a modied landmark sample should be computed.
However, for the scope of this paper, a stable landmark set is assumed and this
point is left for further work.
4.3
Lens Distortion
Once the 2-D-positions of all tracks are computed by the projection technique,
the collection could already be displayed. However, an intermediate distortion
step is introduced as depicted in Figure 3 (bottom). It serves as the basis for the
interaction techniques described later.
Lens Modeling. The distortion technique is based on an approach originally
developed to model complex nonlinear distortions of images called SpringLens
[9]. A SpringLens consists of a mesh of mass particles and interconnecting springs
that form a rectangular grid with xed resolution. Through the springs, forces
are exerted between neighboring particles aecting their motion. By changing the
rest-length of selected springs, the mesh can be distorted as depicted in Figure 4.
(Further, Figure 3 (bottom) and Figure 7 shows larger meshes simulating lenses.)
The deformation is calculated by a simple iterative physical simulation over time
using an Euler integration [9].
In the context of this work, the SpringLens technique is applied to simulate
a complex superimposition of multiple sh-eye lenses. A moderate resolution is
9
In case a landmark track is removed from the collection, its feature representation
has to be kept to be able to compute facet distances for new tracks. However, the
corresponding layer in the cuboid can be removed as for any ordinary track.
285
chosen with a maximum of 50 cells in each dimension for the overlay mesh which
yields sucient distortion accuracy while real-time capability is maintained. The
distorted position of the projection points is obtained by barycentric coordinate
transformation with respect to the particle points of the mesh. Additionally,
z-values are derived from the rest-lengths that are used in the visualization to
decide whether an object has to be drawn below or above another one.
Nearest Neighbor Indexing. For the adaptation of the lens distortion, the
nearest neighbors of a track need to be retrieved. Here, the two major challenges
are:
1. The facet weights are not known at indexing time and thus the index can
only be built using the facet distances.
2. The choice of an appropriate indexing method for each facet depends on the
respective distance measure and the nature of the underlying features.
As the focus lies here on the visualization and not the indexing, only a very basic
approach is taken and further developments are left for future work: A limited list
of nearest neighbors in pre-computed for each track. This way, nearest neighbors
can be retrieved by simple lookup in constant time (O(1)). However, updating
the lists after a change of the facet weights is computationally expensive. While
the resulting delay of the display update is still acceptable for collections with a
few thousands tracks, it becomes infeasible for larger N .
For more ecient index structures, it may be possible to apply generic multimedia indexing techniques such as space partition trees [5] or approximate approaches based on locality sensitive hashing [13] that may even be kernelized [19]
to allow for more complex distance metrics. Another option is to generate multiple nearest neighbor indexes each for a dierent setting of the facet weights
and interpolate the retrieved result lists w.r.t. to the actual facet weights.
4.4
Visualization Metaphor
The music collection is visualized as a galaxy. Each track is displayed as a star or

as its album cover. The brightness and (to some extend) the hue of stars depends
on a predened importance measure. The currently used measure of importance
is the track play count obtained from the Last.fm API and normalized to [0, 1]
(by dividing by the maximum value). However, this could also be substituted
by a more sophisticated measure, e.g. based on (user) ratings, chart positions
or general popularity. The size and the z-order (i.e. the order of objects along
the z-axis) of the objects depends on their distortion z-values. Optionally, the
SpringLens mesh overlay can be displayed. The visualization then resembles the
space-time distortions well known from gravitational and relativistic physics.
4.5
Filtering
In order to reduce the amount of information displayed at a time, an additional ltering step is introduced as depicted in Figure 3 (bottom). The user
286
S. Stober and A. N
urnberger
Fig. 5. Available lter modes: collapse all (top left), focus (top right), sparse (bottom
left), expand all (bottom right). The SpringLens mesh overlay is hidden.
can choose between dierent lters that decide whether a track is displayed collapsed or expanded i.e. as a star or album cover respectively. While album
covers help for orientation, the displayed stars give information about the data
distribution. Trivial lters are those displaying no album covers (collapseAll) or
all (expandAll). Apart from collapsing or expanding all tracks, it is possible to
expand only those tracks in magnied regions (i.e. with a z-level above a predened threshold) or to apply a sparser filter. The results of using these lter
modes are shown in Figure 5.
A sparser lter selects only a subset of the collection to be expanded that
is both, sparse (well distributed) and representative. Representative tracks are
those with a high importance (described in Section 4.4). The rst sparser version
used a Delaunay triangulation and was later substituted by a raster-based approach that produces more appealing results in terms of the spatial distribution
of displayed covers.
287
Originally, the set of expanded tracks was updated after any position changes
caused by the distortion overlay. However, this was considered irritating during
early user tests and the sparser strategy was changed to update only if the
projection or the displayed region changes.
Delaunay Sparser Filter. This sparser lter constructs a Delaunay triangulation incrementally top-down starting with the track with the highest importance
and some virtual points at the corners of the display area. Next, the size of all
resulting triangles given by the radius of their circumcircle is compared with
a predened threshold sizemin . If the size of a triangle exceeds this threshold,
the most important track within this triangle is chosen for display and added
as a point for the triangulation. This process continues recursively until no triangle that exceeds sizemin contains anymore tracks that could be added. All
tracks belonging to the triangulation are then expanded (i.e. displayed as album
thumbnail).
The Delaunay triangulation can be computed in O(n log n) and the number
of triangles is at most O(n) with n N being the number of actually displayed
album cover thumbnails. To reduce lookup time, projected points are stored
in a quadtree data structure [5] and sorted by importance within the trees
quadrants. A triangles size may change through distortion caused by the multifocal zoom. This change may trigger an expansion of the triangle or a removal
of the point that caused its creation originally. Both operations are propagated
recursively until all triangles meet the size condition again. Figure 3 (bottom)
shows a triangulation and the resulting display for a (distorted) projection of a
collection.
Raster Sparser Filter. The raster sparser lter divides the display into a grid
of quadratic cells. The size of the cells depends on the screen resolution and
the minimal display size of the album covers. Further, it maintains a list of the
tracks ranked by importance that is precomputed and only needs to be updated
when the importance values change. On an update, the sparser runs through
its ranked list. For each track it determines the respective grid cell. If the cell
and the surrounding cells are empty, the track is expanded and its cell blocked.
(Checking surrounding cells avoids image overlap. The necessary radius for the
surrounding can be derived from the cell and cover sizes.)
The computational complexity of this sparser approach is linear in the number
of objects to be considered but also depends on the radius of the surrounding
that needs to be checked. The latter can be reduced by using a data structure
for the raster that has O(1) look-up complexity but higher costs for insertions
which happen far less frequently. This approach has further the nice property
that it handles the most important objects rst and thus even if interrupted
returns a useful result.
Interaction
While the previous section covered the underlying techniques, this section describes how users can interact with the user-interface that is built on top of
288
S. Stober and A. N
urnberger
them. Figure 6 shows a screenshot of the MusicGalaxy prototype.10 It allows several ways of interacting with the visualization: Users can explore the collection
through common panning & zooming (Section 5.1). Alternatively, they can use
the adaptive multi-focus technique introduced with this prototype (Section 5.2).
Further, they can change the facet aggregation function parameters and this way
adapt the view on the collection according to their preferences (Section 5.3).
Hovering over a track displays its title and a double-click start the playback
that can be controlled by the player widget at the bottom of the interface.
Apart from this, several display parameters can be changed such as the ltering
mode (Section 4.5), the size of the displayed album covers or the visibility of the
SpringLens overlay mesh.
5.1
Panning and Zooming
These are very common interaction techniques that can e.g. be found in programs
for geo-data visualization or image editing that make use of the map metaphor.
Panning shifts the displayed region whereas zooming decreases or increases it.
(This does not aect the size of the thumbnails which can be controlled separately using the PageUp and PageDn keys.) Using the keyboard, the user can pan
with the cursor keys and zoom in and out with + and respectively. Alternatively, the mouse can be used: Clicking and holding the left button while moving
the mouse pans the display. The mouse wheel controls the zoom level. If not the
whole collection can be displayed, an overview window indicating the current
section is shown in the top left corner, otherwise it is hidden. Clicking into the
overview window centers the display around the respective point. Further, the
user can drag the section indicator around which also results in panning.
5.2
Focusing
This interaction techniques allows to visualize and to some extend alleviate

the neighborhood distortions introduced by the dimensionality reduction during
the projection. The approach is based on a multi-focus sh-eye lens that is
implemented using the SpringLens distortion technique (Section 4.3). It consists
of a user-controlled primary focus and a neighborhood-driven secondary focus.
The primary focus is a common sh-eye lens. By moving this lens around
(holding the right mouse button), the user can zoom into regions of interest. In
contrast to the basic linear zooming function described in Section Section 5.1,
this leads to a nonlinear distortion of the projection. As a result, the region of
interest is enlarged making more space to display details. At the same time,
less interesting regions are compacted. This way, the user can closely inspect
the region of interest without loosing the overview as the eld of view is not
narrowed (as opposed to the linear zoom). The magnication factor of the lens
can be changed using the mouse wheel while holding the right mouse button. The
10
A demo video is available at: http://www.dke-research.de/aucoma
Fig. 6. Screenshot of the MusicGalaxy prototype with visible overview window (top left), player (bottom) and SpringLens mesh overlay
(blue). In this example, a strong album eect can be observed as for the track in primary focus, four tracks of the same album are nearest
neighbors in secondary focus.

289
290
S. Stober and A. N
urnberger
Fig. 7. SpringLens distortion with only primary focus (left) and additional secondary
focus (right)
visual eect produced by the primary zoom resembles a 2-dimensional version

of the popular cover ow eect.
The secondary focus consist of multiple such sh-eye lenses. These lenses are
smaller and cannot be controlled by the user but are automatically adapted
depending on the primary focus. When the primary focus changes, the neighbor
index (Section 4.3) is queried with the track closest to the center of focus. If
nearest neighbors are returned that are not in the primary focus, secondary
lenses are added at the respective positions. As a result, the overall distortion
of the projection brings the distant nearest neighbors back closer to the focused
region of interest. Figure 7 shows the primary and secondary focus with visible
SpringLens mesh overlay.
As it can become tiring to hold the right mouse button while moving the
focus around, the latest prototype introduces a focus lock mode (toggled with the
return key). In this mode, the user clicks once to start a focus change and a second
time to freeze the focus. To indicate that the focus is currently being changed
(i.e. mouse movement will aect the focus), an icon showing a magnifying glass
is displayed in the lower left corner. The secondary focus is by default always
updated instantly when the primary focus changes. This behavior can be disabled
resulting only in an update of the secondary focus once the primary focus does
not change anymore.
5.3
Adapting the Aggregation Functions
Two facet control panels allow to adapt two facet distance aggregation functions
by choosing one of the function types listed in Section 4.1 (from a drop-down
menu) and adjusting weights for the individual facets (through sliders). The control panels are hidden in the screenshot Figure 6) but shown in Figure 3 (bottom)
that depicts the user-interaction. The rst facet distance aggregation function
291
is applied to derive the track-landmark distances from the facet distance cuboid
(Section 4.2). These distances are then used to compute the projection of the
collection. The second facet distance aggregation function is applied to identify
the nearest neighbors of a track and thus indirectly controls the secondary focus.
Changing the aggregation parameters results in a near real-time update of
the display so that the impact of the change becomes immediately visible: In
case of the parameters for the nearest neighbor search, some secondary focus
region may disappear while somewhere else a new one appears with tracks now
considered more similar. Here, the transitions are visualized smoothly due to
the underlying physical simulation of the SpringLens grid. In contrast to this, a
change of the projection similarity parameters has a more drastic impact on the
visualization possibly resulting in a complete re-arrangement of all tracks. This
is because the LMDS projection technique produces solutions that are unique
only up to translation, rotation, and reection and thus, even a small parameter
change may, e.g., ip the visualization. As this may confuse users, one direction
of future research is to investigate how the position of the landmarks can be
constrained during the projection to produce more gradual changes.
The two facet distance aggregation functions are linked by default as it is most
natural to use the same distance measure for projection and neighbor retrieval.
However, unlinking them and using e.g. orthogonal distance measures can lead
to interesting eects: For instance, one may choose to compute the collection
based solely on acoustic facets and nd nearest neighbors for the secondary
focus through lyrics similarity. Such a setting would help to uncover tracks with
a similar topic that (most likely) sound very dierent.
Evaluation
The development of MusicGalaxy followed a user-driven design approach [31] by

iteratively alternating between development and evaluation phases. The rst prototype [47] was presented at the CeBIT 2010 11 a German trade fair specialized
on information technology, in early March 2010. During the fair, feedback was
collected from a total of 112 visitors aged between 16 and 63 years. The general
reception was very positive. The projection-based visualization was generally
welcomed as an alternative to common list views. However, some remarked that
additional semantics of the two display axis would greatly improve orientation.
Young visitors particularly liked the interactivity of the visualization whereas
older ones tended to have problems with this. They stated that the reason lay
in the amount of information displayed which could still be overwhelming. To
address the problem, they proposed to expand only tracks in focus, increase the
size of objects in focus (compared to the others) and hide the mesh overlay as
the focus would be already visualized by the expanded and enlarged objects. All
of these proposals have been integrated into the second prototype.
The second prototype was tested thoroughly by three testers. During these
tests, the eye movements of the users were recorded with an Tobii T60
11
http://www.cebit.de
292
S. Stober and A. N
urnberger
eye-tracker that can capture where and how long the gaze of the participants rests
for some time (referred to as xation points). Using the adaptive SpringLens
focus, the mouse generally followed the gaze that scans the border of the focus in order to decide on the direction to explore further. This resulted in a
much smoother gaze-trajectory than the one observed during usage of panning
and zooming where the gaze frequently switched between the overview window
and the objects of interest as not to loose orientation. This indicates that the
proposed approach is less tiring for the eyes. However, the testers criticized the
controls used to change the focus especially having to hold the right mouse
button all the time. This lead to the introduction of the focus lock mode and
several minor interface improvements in the third version of the prototype [48]
that are not explicitly covered here.
The remainder of this section describes the evaluation of the third MusicGalaxy prototype in a user study [45] with the aim to proof that the userinterface indeed helps during exploration. Screencasts of 30 participants
solving an exploratory retrieval task were recorded together with eye-tracking
data (again using a Tobii T60 eye-tracker) and web cam video streams. This
data was used to identify emerging search strategies among all users and to
analyze to what extent the primary and secondary focus was used. Moreover,
rst-hand impressions of the usability of the interface were gathered by letting
the participants say aloud whatever they think, feel or remark as they go about
their task (think-aloud protocol).
In order to ease the evaluation, the study was not conducted with the original MusicGalaxy user-interface prototype but with a modied version that can
handle photo collections depicted in Figure 8. It relies on widely used MPEG-7
visual descriptors (EdgeHistogram, ScalableColor and ColorLayout ) [27,25] to
compute the visual similarity (see [49] for further details) replacing the originally used music features and respective similarity facets. Using photo collections
for evaluation instead of music has several advantages: It can be assured that
none of the participants knows any of the photos in advance what could otherwise introduce some bias. Dealing with music, this would be much harder to
realize. Furthermore, similarity and relevance of photos can be assessed in an
instant. This is much harder for music tracks and requires additional time for
listening especially if the tracks are previously unknown.
The following four questions were addressed in the user study:
1. How does the lens-based user-interface compare in terms of usability to common panning & zooming techniques that are very popular in interfaces using
a map metaphor (such as Google Maps 12 )?
2. How much do users actually use the secondary focus or would a common
sh-eye distortion (i.e. only the primary focus) be sucient?
3. What interaction patterns do emerge?
4. What can be improved to further support the user and increase user
satisfaction?
12
http://maps.google.com
293
Fig. 8. PhotoGalaxy a modied version of MusicGalaxy for browsing photo collections

that was used during the evaluation (color scheme inverted)
To answer the rst question, participants compared a purely SpringLens-based

user-interface with a common panning & zooming and additionally a combination of both. For questions 2 and 3, the recorded interaction of the participants
with the system was analyzed in detail. Answers to question 4 were collected
by asking the users directly for missing functionality. Section 6.1 addresses the
experimental setup in detail and Section 6.2 discusses the results.
6.1
Experimental Setup
At the beginning of the experiment, the participants were asked several questions
to gather general information about their background. Afterwards, they were
presented four image collections (described below) in xed order. On the rst
collection, a survey supervisor gave a guided introduction to the interface and the
possible user actions. Each participant could spent as much time as needed to get
used to the interface. Once, the participant was familiar with the controls, she
or he continued with the other collections for which a retrieval task (described
below) had to be solved without the help of the supervisor. At this point, the
participants were divided into two groups. The rst group used only panning
& zooming (P&Z) as described in Section 5.1 on the second collection and only
the SpringLens functionality (SL) described in Section 5.2 on the third one. The
other group started with SL and then used P&Z. The order of the datasets stayed
the same for both groups. (This way, eects caused by the order of the approaches
and slightly varying diculties among the collections are avoided.) The fourth
collection could then be explored by using both, P&Z and SL. (The functionality
294
S. Stober and A. N
urnberger
Table 2. Photo collections and topics used during the user study
Collection
Topics (number of images)
Melbourne & Victoria

Barcelona
Tibidabo (12), Sagrada Famlia (31), Stone Hallway in Park
G
uell (13), Beach & Sea (29), Casa Mil`
a (16)
Japan
Owls (10), Torii (8), Paintings (8), Osaka Aquarium (19), Traditional Clothing (35)
Western Australia
Lizards (17), Aboriginal Art (9), Plants (Macro) (17), Birds
(21), Ningaloo Reef (19)
for adapting the facet distance aggregation functions described in Section 5.3 was
deactivated for the whole experiment.) After the completion of the last task,
the participants were asked to assess the usability of the dierent approaches.
Furthermore, feedback was collected pointing out, e.g., missing functionality.
Test Collections. Four image collection were used during the study. They
were drawn from a personal photo collection of the authors.13 Each collection
comprises 350 images except the rst collection (used for the introduction of
the user-interface) which only contains 250 images. All images were scaled down
to t 600x600 pixels. For each of the collections 2 to 4, ve non-overlapping topics
were chosen and the images annotated accordingly. These annotation served as
ground truth and were not shown to the participants. Table 2 shows the topics
for each collection. In total, 264 of the 1050 images belong to one of the 15 topics.
Retrieval Task. For the collections 2 to 4, the participants had to nd ve (or
more) representative images for each of the topics listed in Table 2. As guidance,
handouts were prepared that showed the topics each one printed in a dierent
color , an optional brief description and two or three sample images giving an
impression what to look for. Images representing a topic had to be marked with
the topics color. This was done by double clicking on the thumbnail what opened
a oating dialog window presenting the image at big scale and allowing the
participant to classify the image to a predened topic by clicking a corresponding
button. As a result, the image was marked with the color representing the topic.
Further, the complete collection could be ltered by highlighting all thumbnails
classied to one topic. This was done by pressing the numeric key (1 to 5) for the
respective topic number. Highlighting was done by focusing a sh-eye lens on
every marked topic member and thus enlarging the corresponding thumbnails.
It was pointed out that the decision whether an image was representative for
a group was solely up to the participant and not judged otherwise. There was
no time limit for the task. However, the participants were encouraged to skip to
13
The collections and topic annotations are publicly available under

the Creative Commons Attribution-Noncommercial-Share Alike license,
http://creativecommons.org/licenses/by-nc-sa/3.0/ Please contact stober@ovgu.de.
295
the next collection after approximately ve minutes as during this time already
enough information would have been collected.
Tweaking the Nearest Neighbor Index. In the original implementation, at
most ve nearest neighbors are retrieved with the additional constraint that their
distance to the query object has to be in the 1-percentile of all distances in the
collection. (This avoids returning nearest neighbors that are not really close.) 264
of the 1050 images belonging to collections 2 to 4 have a ground truth topic label.
For only 61 of these images, one or more of the ve nearest neighbors belonged to
the same topic and only in these cases, the secondary focus would have displayed
something helpful for the given retrieval task. This let us conclude that the feature
descriptors used were not sophisticated enough to capture the visual intra-topic
similarity. A lot more work would have been involved to improve the features
but this would have been beyond the scope of the study that aimed to evaluate
the user-interface and most specically the secondary focus which dierentiates
our approach from the common sh-eye techniques. In order not to have the user
evaluate the underlying feature representation and the respective similarity metric, we modied the index for the experiment: Every time, the index was queried
with an image with a ground truth annotation, the two most similar images from
the respective topics were injected into the returned list of nearest neighbors. This
ensured that the secondary would contain some relevant images.
6.2
Results
The user study was conducted with 30 participants all of them graduate or
post-graduate students. Their age was between 19 and 32 years (mean 25.5)
and 40% were female. Most of the test persons (70%) were computer science
students, with half of them having a background in computer vision or user
interface design. 43% of the participants stated that they take photos on a regular
basis and 30% use software for archiving and sorting their photo collection. The
majority (77%) declared that they are open to new user interface concepts.
Usability Comparison. Figure 9 shows the results from the questionnaire
comparing the usability and helpfulness of the SL approach with baseline P&Z.
What becomes immediately evident is that half of the participants rated the SL
interface as being signicantly more helpful that the simple P&Z interface while
being equally complicated in use. The intuitiveness of the SL was surprisingly
rated slightly better than for the P&Z interface, which is an interesting outcome
since we expected users to be more familiar with P&Z as it is more common in
todays user interfaces (e.g. Google Maps). This, however, suggests that interacting with a sh-eye lens can be regarded as intuitive for humans when interacting
with large collections. The combination of both got even better ratings but has
to be considered noncompetitive here, as it could have had an advantage by
always being the last interface used. Participants have had time for getting used
to the handling of the two complementary interfaces already. Moreover, since
the collection did not change as for P&Z and SL, the combined interface might
have had the advantage of being applied to a possibly easier collection with
296
S. Stober and A. N
urnberger
helpfulness
simplicity
intuitivity
1
P&Z
SL
both
1
P&Z
SL
both
P&Z
SL
both
Fig. 9. Usability comparison of common panning & zooming (P&Z), adaptive

SpringLens (SL) and the combination of both. Ratings were on a 7-point-scale where
7 is best. The box plots show minimum, maximum, median and quartiles for N = 30.
Table 3. Percentage of marked images (N = 914) categorized by focus region and topic
of the image in primary focus at the time of marking
focus region
primary
ext. primary
secondary
none
same topic
other topic
no focus
37.75
4.27
4.49
30.74
13.24
4.38
2.08
3.06
total
37.75
8.75
43.98
9.52
topics being better distributed or a slightly better working similarity measure

so that images of the same topic are found more easily.
Usage of Secondary Focus. For this part, we restrict ourselves to the interaction with the last photo collection where both, P&Z and the lens, could be used
and the participants had had plenty of time (approximately 15 to 30 minutes
depending on the user) for practice. The question to be answered is, how much
the users actually make use of the secondary focus which always contains some
relevant images if the image in primary focus has a ground truth annotation.14
For each image marked by a participant, the location of the image at the time
of marking was determined. There are four possible regions: primary focus (only
the central image), extended primary focus (region covered by primary lens except primary focus image), secondary focus and the remaining region. Further,
there are up to three cases for each region with respect to the (user-annotated or
ground truth) topic of the image in primary focus. Table 3 shows the frequencies
of the resulting eight possible cases. (Some combinations are impossible. E.g.,
the existence of a secondary focus implies some image in primary focus.) The
most interesting number is the one referring to images in secondary focus that
belong to the same topic as the primary because this is what the secondary focus
14
Ground truth annotation were never visible to the users.
297
is supposed to bring up. It comes close to the percentage of the primary focus
that not surprisingly is the highest. Ignoring the topic, (extended) primary
and secondary almost contribute equally and only less than 10% of the marked
images were not in focus i.e. discovered only through P&Z.
Emerging Search Strategies. For this part we again analyze only interaction with the combined interface. A small group of participants excessively used
P&Z. They increased the initial thumbnail size in order to better perceive the
depicted contents and chose to display all images as thumbnails. To reduce the
overlap of thumbnails, they operated on a deeper zoom level and therefore had
to pan a lot. The gaze data shows a tendency for systematic sequential scans
which were however dicult due to the scattered and irregular arrangement
of the thumbnails. Further, some participants occasionally marked images not
in focus because of being attracted by dominant colors (e.g. for the aquarium
topic). Another typical strategy was to quickly scan through the collection by
moving the primary focus typically with small thumbnail size and at a zoom
level that showed most of the collection but the outer regions. In this case the
attention was mostly at the (extended) primary focus region with the gaze scanning in which direction to explore further and little to moderate attention at the
secondary focus. Occasionally, participants would freeze the focus or slow down
for some time to scan the whole display. In contrast to this rather continuous
change of the primary focus, there was a group of participants that browsed
the collection mostly by moving (in a single click) the primary focus to some
secondary focus region much like navigating an invisible neighborhood graph.
Here, the attention was concentrated onto the secondary focus regions.
User Feedback. Many participants had problems with an overcrowded primary
sh-eye in dense regions. This was alleviated by temporarily zooming into the
region which lets the images drift further apart. However, there are possibilities
that require less interaction such as automatically spreading the thumbnails in
focus with force-based layout techniques. Working on deeper zoom levels where
only a small part of the collection is visible, the secondary focus was considered
mostly useless as it was usually out of view. Further work could therefore investigates o-screen visualization techniques to facilitate awareness of and quick
navigation to secondary focus regions out of view and better integrate P&Z and
SL. The increasing empty space at deep zoom levels should be avoided e.g.
by automatically increasing the thumbnail size as soon as all thumbnails can be
displayed without overlap. An optional re-arrangement of the images in view into
a grid layout may ease sequential scanning as preferred by some users. Another
proposal was to visualize which regions have already been explored similar to the
(optionally time-restricted) fog of war used in strategy computer games. Some
participants would welcome advanced ltering options such as a prominent color
lter. An undo function or reverse playback of focus movement would be desirable and could easily be implemented by maintaining a list of the last images
in primary focus. Finally, some participants remarked that it would be nice to
generate the secondary focus for a set of images (belonging to the same topic).
298
S. Stober and A. N
urnberger
In fact, it is even possible to adapt the similarity metric used for the nearest
neighbor queries automatically to the task of nding more images of the same
to topic as shown in recent experiments [49]. This opens an interesting research
direction for future work.
Conclusion
A common approach for exploratory retrieval scenarios is to start with an overview from where the user can decide which regions to explore further. The focusadaptive SpringLens visualization technique described in this paper addresses
the following three major problems that arise in this context:
1. Approaches that rely on dimensionality reduction techniques to project the
collection from high-dimensional feature space onto two dimensions inevitably
face projection errors: Some tracks will appear closer than they actually are
and on the other side, some tracks that are distant in the projection may in
fact be neighbors in the original space.
2. Displaying all tracks at once becomes infeasible for large collections because
of limited display space and the risk of overwhelming the user with the
amount of information displayed.
3. There is more than one way to look at a music collection or more specically
to compare two music pieces based on their features. Each user may have a
dierent way and a retrieval system should account for this.
The rst problem is addressed by introducing a complex distortion of the visualization that adapts to the users current region of interest and temporarily
alleviates possible projection errors in the focused neighborhood. The amount
of displayed information can be adapted by the application of several sparser
lters. Concerning the third problem, the proposed user-interface allows users
to (manually) adapt the underlying similarity measure used to compute the arrangement of the tracks in the projection of the collection. To this end, weights
can be specied that control the importance of dierent facets of music similarity
and further an aggregation function can be chosen to combine the facets.
Following a user-centered design approach with focus on usability, a prototype
system has been created by iteratively alternating between development and
evaluation phases. For the nal evaluation, an extensive user study including
gaze analysis using an eye-tracker has been conducted with 30 participants. The
results prove that the proposed interface is helpful while at the same time being
easy and intuitive to use.
Acknowledgments
This work was supported in part by the German National Merit Foundation,
the German Research Foundation (DFG) under the project AUCOMA, and the
European Commission under FP7-ICT-2007-C FET-Open, contract no. BISON211898. The user study was conducted in collaboration with Christian Hentschel
299
who also took care of the image feature extraction. The authors would further like
to thank all testers and the participants of the study for their time and valuable
feedback for further development, Tobias Germer for sharing his ideas and code
of the original SpringLens approach [9], Sebastian Loose who has put a lot of
work into the development of the lter and zoom components, the developers of
CoMIRVA [40] and JAudio [28] for providing their feature extractor code and
George Tzanetakis for providing insight into his MIREX 07 submission [52].
The Landmark MDS algorithm has been partly implemented using the MDSJ
library [1].
References
1. Algorithmics Group: MDSJ: Java library for multidimensional scaling (version 0.2),
University of Konstanz (2009)
2. Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky?
Journal of Negative Results in Speech and Audio Sciences 1(1) (2004)
3. Baumann, S., Halloran, J.: An ecological approach to multimodal subjective music
similarity perception. In: Proc. of 1st Conf. on Interdisciplinary Musicology (CIM
2004), Graz, Austria (April 2004)
4. Cano, P., Kaltenbrunner, M., Gouyon, F., Batlle, E.: On the use of fastmap for
audio retrieval and browsing. In: Proc. of the 3rd Int. Conf. on Music Information
Retrieval (ISMIR 2002) (2002)
5. De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational geometry: algorithms and applications. Springer, New York (2008)
6. Diakopoulos, D., Vallis, O., Hochenbaum, J., Murphy, J., Kapur, A.: 21st century
electronica: Mir techniques for classication and performance. In: Proc. of the 10th
Int. Conf. on Music Information Retrieval (ISMIR 2009), pp. 465469 (2009)
7. Donaldson, J., Lamere, P.: Using visualizations for music discovery. Tutorial at the
10th Int. Conf. on Music Information Retrieval (ISMIR 2009) (October 2009)
8. Gasser, M., Flexer, A.: Fm4 soundpark: Audio-based music recommendation in
everyday use. In: Proc. of the 6th Sound and Music Computing Conference (SMC
2009), Porto, Portugal (2009)
9. Germer, T., G
otzelmann, T., Spindler, M., Strothotte, T.: Springlens: Distributed
nonlinear magnications. In: Eurographics 2006 - Short Papers, pp. 123126. Eurographics Association, Aire-la-Ville (2006)
10. Gleich, M.R.D., Zhukov, L., Lang, K.: The World of Music: SDP layout of high
dimensional data. In: Info Vis 2005 (2005)
11. van Gulik, R., Vignoli, F.: Visual playlist generation on the artist map. In: Proc.
of the 6th Int. Conf. on Music Information Retrieval (ISMIR 2005), pp. 520523
(2005)
12. Hitchner, S., Murdoch, J., Tzanetakis, G.: Music browsing using a tabletop display.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
175176 (2007)
13. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the
curse of dimensionality. In: Proc. of the 13th ACM Symposium on Theory of Computing (STOC 1998), pp. 604613. ACM, New York (1998)
14. Jollie, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)
300
S. Stober and A. N
urnberger
15. Julia, C.F., Jorda, S.: SongExplorer: a tabletop application for exploring large
collections of songs. In: Proc. of the 10th Int. Conf. on Music Information Retrieval
(ISMIR 2009), pp. 675680 (2009)
16. Knees, P., Pohle, T., Schedl, M., Widmer, G.: Exploring Music Collections in Virtual Landscapes. IEEE MultiMedia 14(3), 4654 (2007)
17. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43(1), 5969 (1982)
18. Kruskal, J., Wish, M.: Multidimensional Scaling. Sage, Thousand Oaks (1986)
19. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: Proc. 12th Int. Conf. on Computer Vision (ICCV 2009) (2009)
20. Leitich, S., Topf, M.: Globe of music - music library visualization using geosom.
In: Proc. of the 8th Int. Conf. on Music Information Retrieval (ISMIR 2007), pp.
167170 (2007)
21. Lillie, A.S.: MusicBox: Navigating the space of your music. Masters thesis, MIT
(2008)
22. Lloyd, S.: Automatic Playlist Generation and Music Library Visualisation with
Timbral Similarity Measures. Masters thesis, Queen Mary University of London
(2009)
23. L
ubbers, D.: SoniXplorer: Combining visualization and auralization for contentbased exploration of music collections. In: Proc. of the 6th Int. Conf. on Music
Information Retrieval (ISMIR 2005), pp. 590593 (2005)
24. L
ubbers, D., Jarke, M.: Adaptive multimodal exploration of music collections. In:
Proc. of the 10th Int. Conf. on Music Information Retrieval (ISMIR 2009), pp.
195200 (2009)
25. Lux, M.: Caliph & emir: Mpeg-7 photo annotation and retrieval. In: Proc. of the
17th ACM Int. Conf. on Multimedia (MM 2009), pp. 925926. ACM, New York
(2009)
26. Mandel, M., Ellis, D.: Song-level features and support vector machines for music
classication. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 594599 (2005)
27. Martinez, J., Koenen, R., Pereira, F.: MPEG-7: The generic multimedia content
description standard, part 1. IEEE MultiMedia 9(2), 7887 (2002)
28. McEnnis, D., McKay, C., Fujinaga, I., Depalle, P.: jAudio: An feature extraction
library. In: Proc. of the 6th Int. Conf. on Music Information Retrieval (ISMIR
2005), pp. 600603 (2005)
29. M
orchen, F., Ultsch, A., N
ocker, M., Stamm, C.: Databionic visualization of music
collections according to perceptual distance. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 396403 (2005)
30. Neumayer, R., Dittenbach, M., Rauber, A.: PlaySOM and PocketSOMPlayer, alternative interfaces to large music collections. In: Proc. of the 6th Int. Conf. on
Music Information Retrieval (ISMIR 2005), pp. 618623 (2005)
31. Nielsen, J.: Usability engineering. In: Tucker, A.B. (ed.) The Computer Science
and Engineering Handbook, pp. 14401460. CRC Press, Boca Raton (1997)
32. N
urnberger, A., Klose, A.: Improving clustering and visualization of multimedia
data using interactive user feedback. In: Proc. of the 9th Int. Conf. on Information
Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU
2002), pp. 993999 (2002)
33. Oliver, N., Kreger-Stickles, L.: PAPA: Physiology and purpose-aware automatic
playlist generation. In: Proc. of the 7th Int. Conf. on Music Information Retrieval
(ISMIR 2006) (2006)
301
34. Pampalk, E., Dixon, S., Widmer, G.: Exploring music collections by browsing different views. In: Proc. of the 4th Int. Conf. on Music Information Retrieval (ISMIR
2003), pp. 201208 (2003)
35. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visualization
of music archives. In: Proc. of the 10th ACM Int. Conf. on Multimedia (MULTIMEDIA 2002), pp. 570579. ACM Press, New York (2002)
36. Pauws, S., Eggen, B.: PATS: Realization and user evaluation of an automatic
playlist generator. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002) (2002)
37. Rauber, A., Pampalk, E., Merkl, D.: Using psycho-acoustic models and selforganizing maps to create a hierarchical structuring of music by musical styles. In:
Proc. of the 3rd Int. Conf. on Music Information Retrieval (ISMIR 2002) (2002)
38. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval.
Information Processing & Management 24(5), 513523 (1988)
39. Sarmento, L., Gouyon, F., Costa, B., Oliveira, E.: Visualizing networks of music
artists with RAMA. In: Proc. of the Int. Conf. on Web Information Systems and
Technologies, Lisbon (2009)
40. Schedl, M.: The CoMIRVA Toolkit for Visualizing Music-Related Data. Technical
report, Johannes Kepler University Linz (June 2006)
41. Shneiderman, B.: Tree visualization with tree-maps: 2-d space-lling approach.
ACM Trans. Graph 11(1), 9299 (1992)
42. de Silva, V., Tenenbaum, J.: Sparse multidimensional scaling using landmark
points. Tech. rep., Stanford University (2004)
43. de Silva, V., Tenenbaum, J.B.: Global versus local methods in nonlinear dimensionality reduction. In: Proc. of the 3rd Int. Conf. on Music Information Retrieval
(ISMIR 2002), pp. 705712 (2002)
44. Stavness, I., Gluck, J., Vilhan, L., Fels, S.S.: The mUSICtable: A map-based ubiquitous system for social interaction with a digital music collection. In: Kishino,
F., Kitamura, Y., Kato, H., Nagata, N. (eds.) ICEC 2005. LNCS, vol. 3711, pp.
291302. Springer, Heidelberg (2005)
45. Stober, S., Hentschel, C., N
urnberger, A.: Evaluation of adaptive springlens - a
multi-focus interface for exploring multimedia collections. In: Proc. of the 6th
Nordic Conference on Human-Computer Interaction (NordiCHI 2010), Reykjavik,
Iceland (October 2010)
46. Stober, S., N
urnberger, A.: Towards user-adaptive structuring and organization of
music collections. In: Detyniecki, M., Leiner, U., N
47. Stober, S., N
urnberger, A.: A multi-focus zoomable interface for multi-facet exploration of music collections. In: Proc. of the 7th Int. Symposium on Computer
Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 339354 (June
2010)
48. Stober, S., N
urnberger, A.: MusicGalaxy - an adaptive user-interface for exploratory music retrieval. In: Proc. of the 7th Sound and Music Computing Conference (SMC 2010), Barcelona, Spain, pp. 382389 (July 2010)
49. Stober, S., N
urnberger, A.: Similarity adaptation in an exploratory retrieval scenario. In: Detyniecki, M., Knees, P., N
urnberger, A., Schedl, M., Stober, S. (eds.)
Post- Proceedings of the 8th International Workshop on Adaptive Multimedia Retrieval (AMR 2010), Linz, Austria (2010)
302
S. Stober and A. N
urnberger
50. Stober, S., Steinbrecher, M., N

urnberger, A.: A survey on the acceptance of listening context logging for mir applications. In: Baumann, S., Burred, J.J., N
urnberger,
A., Stober, S. (eds.) Proc. of the 3rd Int. Workshop on Learning the Semantics of
Audio Signals (LSAS), Graz, Austria, pp. 4557 (December 2009)
51. Torrens, M., Hertzog, P., Arcos, J.L.: Visualizing and exploring personal music
libraries. In: Proc. of the 5th Int. Conf. on Music Information Retrieval (ISMIR
2004) (2004)
52. Tzanetakis, G.: Marsyas submission to MIREX 2007. In: Proc. of the 8th Int. Conf.
on Music Information Retrieval (ISMIR 2007) (2007)
53. Vignoli, F., Pauws, S.: A music retrieval system based on user driven similarity
and its evaluation. In: Proc. of the 6th Int. Conf. on Music Information Retrieval
(ISMIR 2005), pp. 272279 (2005)
54. Whitman, B., Ellis, D.: Automatic record reviews. In: Proc. of the 5th Int. Conf.
on Music Information Retrieval (ISMIR 2004) (2004)
55. Williams, C.K.I.: On a connection between kernel pca and metric multidimensional
scaling. Machine Learning 46(1-3), 1119 (2002)
56. Wolter, K., Bastuck, C., G
artner, D.: Adaptive user modeling for content-based
music retrieval. In: Detyniecki, M., Leiner, U., N
A Database Approach to Symbolic Music

Content Management
Philippe Rigaux1 and Zoe Faget2
1
Cnam, Paris
rigaux@lamsade.dauphine.fr
Lamsade, Univ. Paris-Dauphine
zoe@armadillo.fr
Abstract. The paper addresses the problem of content-based access to

large repositories of digitized music scores. We propose a data model and
query language that allow an in-depth management of musical content.
In order to cope with the exibility of music material, the language is
designed to easily incorporate user-dened functions at early steps of the
query evaluation process. We describe our architectural vision, develop a
formal description of the language, and illustrate a user-friendly syntax
with several classical examples of symbolic music information retrieval.
Keywords: Digital Libraries, Data Model, Time Series, Musicological
Information Management.
Introduction
The presence of music on the web has grown exponentially over the past decade.
Music representation is multiple (audio les, midi les, printable music scores. . . )
and is easily accessible through numerous platforms. Given the availability of several compact formats, the main representation is by means of audio les, which
provide immediate access to music content and are easily spreaded, sampled and
listened to. However, extracting structured information from an audio le is a
dicult (if not impossible) task, since it is submitted to the subjectivity of interpretation. On the other hand, symbolic music representation, usually derived
from musical scores, enables exploitation scenarios dierent from what audio les
may oer. The very detailed and unambiguous description of the music content
is of high interest for communities of music professionals, such as musicologists,
music publishers, or professional musicians. New online communities of users
arise, with an interest in a more in-depth study of music than what average
music lovers may look for.
The interpretation of the music content information (structure of musical
pieces, tonality, harmonic progressions. . . ) combined with meta-data (historic
and geographic context, author and composer names. . . ) is a matter of human
expertise. Specic music analysis tools have been developped by music professionals for centuries, and should now be scaled to a larger level in order to provide
scientic and ecient analysis of large collections of scores.
304
P. Rigaux and Z. Faget
Another fundamental need of such online communities, this time shared with
more traditional platforms, is the ability to share content and knowledge, as
well as annotate, compare and correct all this available data. This Web 2.0
space with user generated content helps improve and accelerate research, with
the added bonus to make available to a larger audience sources which would
otherwise remain condential. Questions of copyright, security and controled
contributions are part of the social networks problematic.
To summarize, a platform designed to be used mainly, but not only, by music
professionals, should oer classic features such as browsing and rendering, but
also the ability to upload new content, annotate partitions, search by content
(exact and similarity search), and several tools of music content manipulation
and analysis.
Most of the proposals devoted so far to analysis methods or similarity searches
on symbolic music focus on the accuracy and/or relevancy of the result, and
implicitly assume that these procedures apply to a small collection [2,3,13,16].
While useful, this approach gives rise to several issues when the collection consists
of thousands of scores, with heterogeneous descriptions.
A rst issue is related to software engineering and architectural concerns. A
large score digital library provides several services to many dierents users or
applications. Consistency, reliability, and security concerns call for the denition
of a single consistent data management interface for these services. In particular, one can hardly envisage the publication of ad-hoc search procedures that
merely exhibit the bunch of methods and algorithms developed for each specic retrieval task. The multiplication of these services would quickly overwhelm
external users. Worse, the combination of these functions, which is typically a
dicult matter, would be left to external applications. In complex systems, the
ability to compose uently the data manipulation operators is a key to both
expressive power and computational eciency.
Therefore, on-line communities dealing with scores and/or musical content are
often limited either by the size of their corpus or the range of possible operations,
with only one publicized strong feature. Examples of on-line communities include
Mutopia [26], MelodicMatch [27] or Musipedia [28]. Wikifonia [29] oers a wider
range of services, allowing registered users to publish and edit sheet music. One
can also cite the OMRSYS platform described in [8].
A second issue pertains to scalability. With the ongoing progress in digitization, optical recognition and user content generation, we must be ready to
face an important growth of the volume of music content that must be handled
by institutions, libraries, or publishers. Optimizing accesses to large datasets is
a delicate matter which involves many technical aspects that embrace physical
storage, indexing and algorithmic strategies. Such techniques are usually supported by a specialized data management system which releases applications
from the burden of low-level and intricate implementation concerns.
To our knowledge, no system able to handle large heterogeneous Music Digital Libraries while smoothly combining data manipulation operators exists at this
moment. The HumDrum toolkit is a widely used automated musicological
A Database Approach to Symbolic Music Content Management
305
analysis tool [14,22], but representation remains at a low level. A HumDrum based
system will lack in exibility and will depend too much on how les are stored. This
makes dicult the development of indexing of optimization techniques. Another
possible approach would be a system based on MusicXML, an XML based le
format [12,24]. It has been suggested recently that XQuery may be used over MusicXML for music queries [11], but XQuery is a general-purpose query language
which hardly adapts to the specics of symbolic music manipulation.
Our objective in this paper is to lay the ground for a score management system with all the features of a Digital Scores Library combined with content
manipulation operators. Among other things, a crucial component of such a system is a logical data model specically designed for symbolic music management
and its associated query language. Our approach is based on the idea that the
management of structured scores corresponds, at the core level, to a limited set
of fundamental operations that can be dened and implemented once for all. We
also take into account the fact that the wide range of user needs calls for the
ability to associate these operations with user-dened functions at early steps of
the query evaluation process. Modeling the invariant operators and combining
them with user-dened operations is the main goal of our design eort. Among
numerous advantages, this allows the denition of a stable and robust query()
service which does not need ad-hoc extensions as new requirements arrive.
We do not claim (yet) that our model and its implementation will scale easily,
but a high level representation like our model is a pre-requisit in order to allow
the necessary exibility for such futur optimization.
Section 3.2 describes in further details the Neuma platform, a Digital Score
Library [30] devoted to large collections of monodic and polyphonic music from
the French Modern Era (16th 18th centuries). One of the central piece of
the architecture is the data model that we present in this paper. The language
described in section 5 oers a generic mechanism to search and transform music
notation.
The rest of this paper rst discusses related work (Section 2). Section 3
presents the motivation and the context of our work . Section 4 then exposes the
formal fundations of our model. Section 6 concludes the paper.
Related Work
The past decade has witnessed a growing interest in techniques for representing,
indexing and searching (by content) music documents. The domain is commonly
termed Music Information Retrieval (MIR) although it covers many aspects
beyond the mere process of retrieving documents. We refer the reader to [19]
for an introduction. Systems can manipulate music either as audio les or in
symbolic form. The symbolic representation oers a structured representation
which is well suited for content-based accesses, sophisticated manipulations, and
analysis [13].
An early attempt to represent scores as structured les and to develop search
and analysis functions is the HumDrum format. Both the representation and
306
the procedures are low-level (text les, Unix commands) which make them difcult to integrate in complex application. Recent works try to overcome these
limitations [22,16]. Musipedia proposes several kinds of interfaces to search the
database by content. MelodicMatch is a similar software analysing music through
pattern recognition, enabling search for musical phrases in one or more pieces.
MelodicMatch can search for melodies, rhythms and lyrics in MusicXML les.
The computation of similarity between music fragments is a central issue in
MIR systems [10]. Most proposal focus on comparisons of the melodic proles.
Because music is subject to many small variations, approximate search is of order, and the problem is actually that of nding nearest neighbors to a given
pattern. Many techniques have been experimented, that vary depending on the
melodic encoding and the similarity measure. See [9,4,1,7] for some recent proposals. The Dynamic Time Warping (DTW) distance is a well-known popular
measure in speech recognition [21,20]. It allows the non-linear mapping of one
signal to another by minimizing the distance between the two. The DTW distance is usually chosen over the less exible Euclidian distance for time series
alignment [5]. The DTW computation is rather slow but recent works show that
it can be eciently indexed [25,15].
We are not aware of any general approach to model and query music notation. A possible approach would be to use XQuery over MusicXML documents
as suggested in [11]. XQuery is a general-purpose query language, and its use for
music scores yields complicated expressions, and hardly adapts to the specics
of the objects representation (e.g., temporal sequences). We believe that a dedicated language is both more natural and more ecient. The temporal function
approach outlined here can be related to time series management [17].
3
3.1
Architecture
Approach Overview
Figure 1 outlines the main components of a Score Management System built

around our data model. Basically, the architecture is that of a standard DBMS,
Symbolic Music Application
Largescale score
management system
query results
queries
Symb. music model

User
functions
User QL
Query algebra
Time series manipulation primitives

Storage and indexing
Fig. 1. Approach overview
307
and we actually position our design as an extension of the relational model,

making it possible to incorporate the new features as components of an extensible
relational system. The model consists of a logical view of score content, along
with a user query language and an algebra that operates in closed form (i.e., each
operator consumes and produces instances of the model). The algebra can be
closely associated with a library of user-dened functions which must be provided
by the application and allow to tailor the query language to a specic domain.
Functions in such a library must comply to constraints that will be developed
in the next section.
This approach brings, to the design and implementation of applications that
deal with symbolic music, the standard and well-known advantages of specialized data management systems. Let us just mention the few most important: (i)
ability to rely on a stable, well-dened and expressive data model, (ii) independence between logical modeling and physical design, saving the need to confront
programmers with intricate optimization issues at the application level and (iii)
eciency of set-based operators and indexes provided by the data system.
Regarding the design of our model, the basic idea is to extend the relational
approach with a new type of attribute : the time series type. Each such attribute
represents a (peculiar) temporal function that maps a discrete temporal space
to values in some domain.
The model supports the representation of polyphonic pieces composed of
voices, each voice being a sequence of events in some music-related domain
(notes, rests, chords) such that only one event occurs at a given instant. Adding
new domains allows to extend the concept to any sequence of symbols taken
from a nite alphabet. This covers monodies, text (where the alphabet consists
of syllables) as well as, potentially, any sequence of music-related information
(e.g., ngerings, performance indications, etc.).
We model a musical piece as Synchronized Time Series (STS). Generally
speaking, a time series is an ordered sequence of values taken at equal time
intervals. Some of the traditional domains that explicitly produce and exploit
time series are, among others, sales forecasting, stock market analysis, process
and quality control, etc. [6]. In the case of music notation, time is not to be
understood as a classic calendar (i.e., days or timestamps) as in the previously
mentioned examples but rather, at a more abstract level, as a sequence of events
where the time interval is the smallest note duration in the musical piece.
Consider now a digital library that stores music information and provides
query services on music collections. A music piece consists of one or several parts
which can be modelled as time series, each represented by a score. Fig. 2 is a
simple example of a monodic piece, whereas polyphonic music pieces (Fig. 3)
exhibit a synchronization of several parts.
The temporal domain of interest here is an abstract representation if the
temporal features of the times series. Essentially, these features are (i) the order
of the musical events (notes, rests), (ii) the relative duration of the events, and
(iii) the required synchronization of the dierent parts (here lyrics and notes).
308
Fig. 2. A monodic score
Fig. 3. A polyphonic score
Here are a few examples of queries:

1. Get the scores whose lyrics contains the word conseil (selection)
2. Get the melodic part that corresponds to the word conseil (selection and
temporal join)
3. Find a melodic pattern. (search by similarity)
When a collection of musical score is to be studied, queries regarding common
features are also of interest. For example
1. Find all musical pieces in the collection starting with the note C and getting
to G in 5 times unit.
2. Find if minor chords occur more often than major chord synchronized with
the word tot in Bachs work.
We provide an algebra that operates in closed form over collections of scores.
We show that this algebra expresses usual operations (e.g., similarity search),
achieves high expressiveness through unbounded composition of operators and
natively incorporates the ability to introduce user-dened function in query expressions, in order to meet the needs of a specic approach to symbolic music
manipulation. Finally, we introduce a user-friendly query language to express
algebraic operations. We believe that our approach denes a sound basis to
the implementation of a specialized query language devoted to score collections
management at large scale.
In the next section, we describe the Neuma plateform, a score management
system built around those concepts.
3.2
Description of the Neuma Platform
The Neuma platform is a digital library devoted to symbolic music content.

It consists of a repository dedicated to the storage of large collections of digital scores, where users/applications can upload their documents. It also proposes a family of services to interact with those scores (query, publish, annotate,
transform and analyze).
309
The Neuma platform is meant to interact with distant web applications with
local database that store corpus-specic informations. The purpose of Neuma
is to manage all music content informations and leave contextual data to the
client application (author, composer, date of publication, . . . ).
To send a new document, an application calls the register() service. So far
only MusicXML documents can be used to exchange musical description, but
any format could be used provided the corresponding mapping function is in
place. Since MusicXML is widely used, it is sucient for now. The mapping
function extracts a representation of the music content of the document which
complies to our data model.
To publish a score -wether its a score featured in the database or a modied
one- the render() service is called. The render() service is based on the Lilypond
package. The generator takes an instance of our model as input and converts it
into a Lilypond le. The importance of a unied data model appears clearly in
such an example : the render service is based on the model, making it rather
easy to visualize a transformed score, when it would be a lot more dicult to
do so if it was instead solely based on the document format.
A large collection of scores would be useless if there was no appropriate query()
service allowing reliable search by content. As explained before, the Neuma digital library stores all music content (originated from dierent collections, potentially with heterogenous description) in its repository and leaves descriptive
contextual data specic to collections in local databases. Regardless of their
original collection, music content complies to our data model so that it can be
queried accordingly. Several query types are oered : exact, transposed, with or
without rythm, or contour which only takes in account the shape of the input
melody. The query() service combines content search with descriptive data. A
virtual keyboard is provided to enter music content, and research elds can be
lled to adress the local databases.
The Neuma platform also provides and annotate() service. The annotations
are a great way to enrich the digital library and make sure it keeps growing and
improving. In order to use the annotate() service one rst selects part of a score
(a set of elements of the score) and enters information about this portion. There
are dierent kinds of annotations : free text (useful for performance indications),
or pre-selected terms from an ontology (for identifying music fragments). Annotations can be queried alongside other criterias previously mentioned.
4
4.1
The Data Model

Preliminaries
A musical domain dommusic is a product domain combining heterogenous musical informations. For example, the domain of a simple monodic score is
dommusic = dompitch domrythm .
Any type of information that can be extracted from symbolic music (such as
measures, alterations, lyrics. . . ) can be added to the domain. Each domain contains two distinguished values: the neutral value and the null value . The
310
Boolean operations (conjunction) and (disjunction) verify, for any a dom,

a = , a = a and a = a = a. In some cases, and can also
be viewed as false and true.
With a given musical domain comes a set of operations provided by the user
and related to this specic domain. When managing a collections of Choir parts,
a function such as max() which computes the highest pitch is meaningful, and
becomes irrelevant when managing of collection of Led Zeppelin tablatures. The
operators designed to single out each part of a complex music domain are presented in the extended relational algebra.
We subdivise the time domain T into a dened, regular, repeated pattern. The
subdivision of time is the smallest interval between two musical events. The time
domain T is then a countable ordered set isomorphic to N. We introduce a set of
internal time functions, designed to operate on the time domain T . We dene as
L the class of functions from T to T otherwise known as internal time functions
(ITF). Any L can be used to operate on the time domain as long as the user
nds this operation meaningful. An important sub-class of L is the set of linear
functions of the form t nt + m. We denote them temporal scaling functions
in what follows, and further distinguish the families of warping functions of the
form warpm : T T , t mt and shifting functions shif tn : T T , t t + n.
Shifting functions are used to ignore the rst n events of a time series, while
warping functions single out one out of m events.
A musical time series (or voice) is a mapping from a time domain T into
a musical domain dommusic . When dealing with a collection of scores sharing
a number of common properties, we introduce the schema of a relation. The
schema distinguishes atomic attributes names (score_id, author, year. . . ) and
times series (or voices) names. We denote by TS([dom]) the type of these attributes, where [dom] is the domain of interest. Here is, for example, the schema of a
music score
Score(Id : int, Composer : string, V oice : TS(vocal), P iano : TS(polyMusic)).
Note that the time domain is shared by the vocal part and the piano part.
For the same score, one could have made the dierent choice of schema where
vocal and piano are represented in the same time series of type TS([vocal
polyMusic])
The domain vocal adds the lyrics domain to the classic music domain:
domvocals = dompitch domrythm domlyrics .
The domain polymusic is the product
dompolymusic = (dompitch domrythm )N .
We will now dene two sets of operators gathered into two algebras: the
(extended) relational algebra, and the times series algebra.
4.2
311
The Relational Algebra Alg(R)
The Alg(R) algebra consists of the usual operators selection , product , union
and dierence , along with an extended projection . We present simple
examples of those operators.
Selection, . Select all scores composed by Louis Couperin:
author= Louis
Couperin (Score)
Select all scores containing the lyrics Heureux:

lyrics(voice) Heureux
Seigneur (Score)
Projection, . We want the vocal parts from the Score schema without the
piano part. We project the piano out:
vocals (Score)
Product, . Consider a collection of duets, split into the individual vocal parts
of male and female singers, with the following schemas
M ale_P art(Id : int, V oice : TS(vocals)),
F emale_P art(Id : int, V oice : TS(vocals)).
To get the duet scores, we cross product a female part and a male part. We get
the relation
Duet(Id : int, M ale_V : TS(vocals), F emale_V : TS(vocals)).
Note that the time domain is implicitly shared. In itself, the product doesnt
have much interest, but together with the selection operator it becomes the join
operators 1. In the previous example, we shouldnt blindly associate any male
and female vocal parts, but only the ones sharing the same Id.
M.Id=F.Id (M ale F emale) M ale 1Id=Id F emale.
Union . We want to join the two consecutive movements of a piece which
have two separate score instances.
Score = Scorept1 Scorept2 .
The time series equivalent of the null attribute is the empty time series, for
which each event is . Beyond classical relational operators, we introduce an
emptyness test ? that operates on voices and is modeled as: ? (s) = f alse if
t, s(t) = , else ? (s) = true. The emptynes test can be introduced in selection
formulas of the operator.
312
Consider once more the relation Score. We want to select all scores featuring
the word Ave in the lyrics part of V. We need a user function m : (lyrics)
(, ) such that m( Ave ) = , else . Lyrics not containing Ave are transformed into the empty time series t , t. The algebraic expression is:
? (W ) ([Id,V,W :m(lyrics(V ))] (Score)).
4.3
The Time Series Algebra Alg(T S)
We now present the operators of the time series algebra Alg(T S) (, , A). Each
operator takes one or more time series as input and produces a time series.
This way, operating in closed form, operators can be composed. They allow, in
particular: alteration of the time domain in order to focus on specic instants
(external composition); apply a user function to one or more times series to form
a new one (addition operator); windowing of time series fragments for matching
purposes (aggregation).
In what follows, we take an instance s of the Score schema to run several
examples.
The external composition composes a time series s with an internal temporal
function l. Assume our score s has two movements, and we only want the second
one. Let shif tn be an element of the shift family functions parametrized by a
constant n N. For any t T , sshif tn(t) = s(t+n). In other words, sshif tn is
the time series extracted from s where the rst n events are ignored. We compose
s with the shif tL function, where L is the length of the rst movement, and the
resulting time series s shif tL is our result.
Imagine now that we want only the rst note of every measure. Assuming
we are in 4/4 and the time unit is one fourth, we compose s with warp16 . The
time series s warp16 is the time series where only one out of sixteen events are
considered, therefore the rst note of every measure.
We now give an example of the addition operator . Let dompitch be the
domain of all musical notes and domint the domain of all musical intervals. We
can dene an operation from dompitch dompitch to domint , called the harm
operator, which takes two notes as input and computes the interval between
these two notes. Given two time series representing each a vocal part, for instance
V1 =soprano, V2 =alto, we can dene the time series
V1 harm V2
of the harmonic progression (i.e., the sequence of intervals realized by the juxtaposition of the two voices).
Last, we present the aggregation mechanism A. A typical operation that we
cannot yet obtained is the windowing which, at each instant, considers a local
part of a voice s and derives a value from this restricted view. A canonical
example is pattern matching. So far we have no generic way to compare a pattern
P with all subsequences of a time series s. The intuitive way to do pattern
matching is to build all subsequences from s and compare P with each of them,
dom
dom
s( ) dom
s(t)
s 10
313
dom
aggregation step
s +10
derivation step
a. Function s(t)
10
+ 10
b. Sequence of derived functions d (s)
Fig. 4. The derivation/aggregation mechanism
using an appropriate distance. This is what the aggregation mechanism does, in

two steps:
1. First, we take a family of internal time functions such that for each instant
, ( ) is an internal time function. At each instant , a TS s = d s( ) =
s ( ) is derived from s thanks to a derivation operator d ;
2. Then a user aggregation function is applied to the TS s , yielding an
element from dom.
Fig. 4 illustrates this two-steps process. At each instant a new function is
derived. At this point, we obtain a sequence of time series, each locally dened
with respect to an instant of the time axis. This sequence corresponds to local
views of s, possibly warped by temporal functions. Note that this intermediate
structure is not covered by our data model.
The second step applies an aggregation function . It takes one (or several,
depending on the arity of ) derived series, and produces an element from dom.
The combination of the derivation and aggregation steps results in a time series
that complies to the data model.
To illustrate the derivation step, we derive our score s with respect to the
shift function :
dshif t (s) = (s shif t0 , s shif t1 , . . . , s shif tn ).
The aggregation step takes the family of time series obtained with the derivation step and applies a user function to all of them. For our ongoing example,
this translates to applying the user function DT WP (which computes the DTW
distance between the input time series s and the pattern P ) to dshif t (s). We
denote this two steps procedure by the following expression:
ADTWP ,shift (s).
4.4
Example: Pattern Matching Using the DTW Distance
To end this section, we give an extended example using operators from Alg(R)
and Alg(T S) . We want to compute, given a pattern P and a score, the instant
314
t where the Dynamic Time Warping (DTW) distance between P and V is less
than 5. First, we compute the DTW distance between P and the voice V1 of a
score, at each instant, thanks to the following Alg(T S) expression:
e = AdtwP ,shift (V1 )
where dtwP is the function computing the DTW distance between a given time
series and the pattern P . Expression e denes a time series that gives, at each
instant , the DTW distance between P and the sub-series of V1 that begins at
. A selection (from Alg(T S) ) keeps the values in e below 5, all others being set
to . Let be the formula that expresses this condition. Then:
e = (e).
Finally, expression e is applied to all the scores in the Score relation with the
operator. An emptyness test can be used to eliminate those for which the DTW
is always higher that 5 (hence, e is empty).
? (e ) ([composer,V1 ,V2 ,e ] (Score)).
User Query Language
The language should have a precise semantic so that the intend of the query is unambiguous. It should specify what to be done, and not how to do it. The language
should understand dierent kind of expressions: queries as well as denition of
user functions. Finally, the query syntax should be easily understandable by a
human reader.
5.1
Overview
The language is implemented in Ocaml. Time series are represented by lists of

array of elements, where elements can be integers, oat, string, booleans, or any
type previously dened. This way we allow ourselves to synchronize voices of
dierent types. The only restriction is that we do not allow an element to be a
time series. We dene the type time series ts_t by a couple (string*element),
where string is the name of the voice, allowing fast access when searching for a
specic voice.
5.2
Query Structure
The general structure of a query is:

from
Table [alias]
let
NewAttribute := map (function, Attribute)
construct
Attribute | NewAttribute
where
Attribute | NewAttribute = criteria
315
The from clause should list at least one table from the database, and the alias
is optional.
The let clause is optional. There can be as many let clause as desired. Voices
modied in a let clause are from attributes present in tables listed in the from
clause.
The construct clause should list at least one attribute, either from one the
listed tables in the from clause, or a modied attribute from a let clause.
The where clause is optional. It consists of a list of predicates, connected by
logical operators (And, Or, Not).
5.3
Query Syntax
The query has four main clauses : from, let, construct, where. In all that
follows, by attribute we mean both classical attributes or time series.
Attributes which appear in the construct, let or where clauses are called
either ColumnName if there is no ambiguity or TableName.ColumnName otherwise. If the attribute is a time series and we want to refer to a specic voice,
we project by using the symbol ->. Precisely, a voice is called ColumnName>VoiceName or TableName.ColumnName->VoiceName.
The from clause enumerate a list of tables from the database, with an optional
alias. The table names should refer to actual tables of the database. Aliases
should not be duplicated, nor should they be the name of an existing table.
The optional let clause apply a user function to an attribute. The attribute
should be from one of the tables listed in the from clause. When the attribute
is a time series, this is done by using map. When one wants to apply a binary
operator on two time series, we use map2.
The construct clause lists the names of the attributes (or modied attributes)
which should appear in the query result.
The where clause evaluates a condition on an attribute or a modied attribute
introduced by the let clause. If the condition is evaluated to true, then the line
it refers to is considered to be a part of the result query. Lack of a where clause
is evaluated as always true.
The where clause supports the usual arithmetic (+, , , /), logical (And,
Or, Not) and comparison operators (=,<>, >,<,>=, <=, contains) which are
used inside predicates.
5.4
Query Evaluation
A query goes through several steps so that the instructions it contains can be
processed. A query can be simple (retrieve a record from the database) or complex with content manipulation and transformation. The results (if any) are then
returned to the user.
Query analysis. The expression entered by the user (either a query or a user
dened function) is analyzed by the system.
The system veries the querys syntax. The query is turned into an abstract
syntax tree in which each node is an algebraic operation (from bottom to top :
316
set of tables (from), selection (where), content manipulation (let), projection

(construct)).
Query evaluation. The abstract syntax tree is then transformed into an evaluation tree. This steps is a rst simple optimisation, where table names
become pointers to the actual tables and column names are turned into the
columns indice. This step also veries the validity of the query: existence of
tables in the from clause, unambiguity of the columns in the where clause,
consistency of the attributes in the construct clause, no duplicate aliases. . .
Query execution. In this last step, the nodes of the evaluation tree call the
actual algorithms, so that the program computes the result of the query.
The bottom node retrieves the rst line of the table in the from clause. The
second node access the column featured in the where clause, and evaluates
the selection condition. If the condition is true, then the next node is called,
which print the corresponding result on the output chanel, then returns to
the rst node and iterate the process on the next line. If the condition is
evaluated as false, then the rst node is called again and the process is
iterated on the next line.
The algorithm ends when there are no more lines to evaluate.
5.5
Algebraic Equivalences
In this section, we give the syntaxic equivalent to all algebraic operators previously introduced in section 4, along with some examples.
Relational operators
Selection
algebraic notation : F (R), where F is a formula, R a relation.
syntaxic equivalent : where. . . =
Projection
algebraic notation : A (R), where A is a set of attributes, R a relation
syntaxic equivalent : construct
Example : the expression
id,voice (composer= Faure (Psalms))
is equivalent to
from
Score
construct id, voice
where
composer=Faure
Extension to time series
Projection
algebraic notation : V (S), where V is a set of voices, S a time series
syntaxic equivalent : S > (voice 1 , . . . , voice n )
317
Selection
algebraic notation : F (V ), where F is a formula, V is a set of voices of a
time series S.
syntaxic equivalent : where S->V . . . contains
Example : the expression
id,
pitch,rythm (voice)
(lyrics (voice) Heureux les hommes ,composer = Faure (Psalms))
is equivalent to
from
construct
where
and
Score
id, voice->(pitch,rythm)
voice->lyrics contains Heureux les hommes
composer = Faure
Remark. if the time series consists of several synchronized voices, contains

shoud list as many conditions as voices.
Example: where voice->(lyrics,pitch) contains (Heureux les hommes, A3,
B3,B3)).
Product
algebraic notation : s t, where s and t are two time series.
syntaxic equivalent : synch
Example : the expresison
M.V oiceF.V oice (M.Id=F.Id (M ale F emale))
is equivalent to
from
let
construct
where
Male M, Female F
$duet := synch(M.Voice,F.Voice)
$duet
M.id=F.id
Time series operator

Addition
algebraic notation : op , where op is a user function.
syntaxic equivalent : map and map2
Examples :
the expression
pitch(voice)
( Psalms)
transpose(1)
is equivalent
from
let
construct
to
Psalms
$transpose := map(transpose(1), voice->pitch)
$transpose
318
the expression
trumpet(voice)
harm clarinet(voice)
(Duets)
is equivalent to
from
Duets
let
$harmonic_progression :=
map2(harm, trumpet->pitch, clarinet->pitch)
construct $harmonic_progression
Composition
algebraic notation : S where is an internal temporal function and S a
time series.
syntaxic equivalent : comp(S, )
Aggregation - derivation
algebraic notation : A, (S), where S is a time series, is a family of internal
time function and is an agregation function
syntaxic equivalent : derive(S, , ).
The family of internal time functions is a mapping from the time domain
into the set of internal time functions. Precisely, for each instant n, (n) = ,
an internal time function. The two mostly used family of time functions Shift
and Warp are provided.
Example the expression
id,Adtw(P),shift (Psalm)
is equivalent to
from
let
construct
Psalm
$dtwVal := derive(voice, Shift, dtw(P))
id, $dtwVal
Conclusion and Ongoing Work
By adopting from the beginning an algebraic approach to the management of

time series data sets, we directly enable an expressive and stable language that
avoids a case-by-case denition of a query language based on the introduction of
ad-hoc functions subject to constant evolution. We believe that this constitutes
a sound basis for the development of applications that can rely on an expressive
and ecient data management.
Current eorts are being put in language implementation in order to optimize query evaluation. Our short-term roadmap also includes an investigation
of indexing structures apt at retrieving patterns in large collections.
Acknowledgments. This work is partially supported by the French ANR
Neuma project, http://neuma.irpmf-cnrs.fr. The authors would like to thank
Virginie Thion-Goasdoue and David Gross-Amblard.
319
References
1. Allan H., Mllensiefen D. and Wiggins,G.A.: Methodological Considerations in
Studies of Musical Similarity. In: Proc. Intl. Society for Music Information Retrieval
(ISMIR) (2007)
2. Anglade, A. and Dixon, S.: Characterisation of Harmony with Inductive Logic
Programming. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
3. Anglade, A. and Dixon, S.: Towards Logic-based Representations of Musical Harmony for Classication Retrieval and Knowledge Discovery. In: MML (2008)
4. Berman, T., Downie, J., Berman, B.: Beyond Error Tolerance: Finding Thematic
Similarities in Music Digital Libraries. In: Proc. European. Conf. on Digital Libraries, pp. 463466 (2006)
5. Berndt, D., Cliord, J.: Using dynamic time warping to nd patterns in time series.
In: AAAI Workshop on Knowledge Discovery in Databases, pp. 229248 (1994)
6. Brockwell, P.J., Davis, R.: Introduction to Time Series and forecasting. Springer,
Heidelberg (1996)
7. Cameron, J., Downie, J.S., Ehmann, A.F.: Human Similarity Judgments: Implications for the Design of Formal Evaluations. In: Proc. Intl. Society for Music
Information Retrieval, ISMIR (2007)
8. Capela, A., Rebelo, A., Guedes, C.: Integrated recognition system for music scores.
In: Proc. of the 2008 Internation Computer Music Conferences (2008)
9. Downie, J., Nelson, M.: Evaluation of a simple and eective music information
retrieval method. In: Proc. ACM Symp. on Information Retrieval (2000)
10. Downie, J.S.: Music Information Retrieval. Annual review of Information Science
and Technology 37, 295340 (2003)
11. Ganseman, J., Scheunders, P., Dhaes, W.: Using XQuery on MusicXML Databases
for Musicological Analysis. In: Proc. Intl. Society for Music Information Retrieval,
ISMIR (2008)
12. Good, M.: MusicXML in practice: issues in translation and analysis. In: Proc. 1st
Internationl Conference on Musical Applications Using XML, pp. 4754 (2002)
13. Haus, G., Longari, M., Pollstri, E.: A Score-Driven Approach to Music Information
Retrieval. Journal of American Society for Information Science and Technology 55,
10451052 (2004)
14. Huron, D.: Music information processing using the HumDrum toolkit: Concepts,
examples and lessons. Computer Music Journal 26, 1126 (2002)
15. Keogh, E.J., Ratanamahatana, C.A.: Exact Indexing of Dynamic Time Warping.
Knowl. Inf. Syst. 7(3), 358386 (2003)
16. Knopke, I. : The Perlhumdrum and Perllilypond Toolkits for Symbolic Music Information Retrieval. In: Proc. Intl. Society for Music Information Retrieval (ISMIR)
(2008)
17. Lee, J.Y., Elmasri, R.: An EER-Based Conceptual Model and Query Language for
Time-Series Data. In: Proc. Intl.Conf. on Conceptual Modeling, pp. 2134 (1998)
18. Lerner, A., Shasha, D.: A Query : Query language for ordered data, optimization techniques and experiments. In: Proc. of the 29th VLDB Conference, Berlin,
Germany (2003)
19. Muller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg
(2004)
20. Rabiner, L., Rosenberg, A., Levinson, S.: Considerations in dynamic time warping
algorithms for discrete word recognition. IEEE Trans. Acoustics, Speech and Signal
Proc. ASSP-26, 575582 (1978)
320
21. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken
word recognition. IEEE Trans. Acoustics, Speech and Signal Proc. ASSP-26, 43
49 (1978)
22. Sapp, C.S.: Online Database of Scores in the Humdrum File Format. In: Proc. Intl.
Society for Music Information Retrieval, ISMIR (2005)
23. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey Of Music Information Retrieval
Systems. In: Proc. Intl. Society for Music Information Retrieval, ISMIR (2005)
24. Viglianti, R.: MusicXML : An XML based approach to automatic musicological
analysis. In: Conference Abstracts of the Digital Humanities (2007)
25. Zhu, Y., Shasha, D.: Warping Indexes with Envelope Transforms for Query by
Humming. In: Proc. ACM SIGMOD Symp. on the Management of Data, pp. 181
192 (2003)
26. Mutopia, http://www.mutopiaproject.org (last viewed February 2011)
27. Melodicmatch, http://www.melodicmatch.com (last viewed February 2011)
28. Musipedia, http://www.musipedia.org (last viewed February 2011)
29. Wikifonia, http://www.wikifonia.org (last viewed February 2011)
30. Neuma, http://neuma.fr (last viewed February 2011)
Error-Tolerant Content-Based Music-Retrieval

with Mathematical Morphology
Mikko Karvonen, Mika Laitinen, Kjell Lemstr
om, and Juho Vikman
University of Helsinki
Department of Computer Science
mikko@crankshaft.fi
{mika.laitinen,kjell.lemstrom,juho.vikman}@helsinki.fi
http://www.cs.helsinki.fi
Abstract. In this paper, we show how to apply the framework of mathematical morphology (MM) in order to improve error-tolerance in contentbased music retrieval (CBMR) when dealing with approximate retrieval
of polyphonic, symbolically encoded music. To this end, we introduce two
algorithms based on the MM framework and carry out experiments to
compare their performance against well-known algorithms earlier developed for CBMR problems. Although, according to our experiments, the
new algorithms do not perform quite as well as the rivaling algorithms in
a typical query setting, they provide ease of adjusting the desired error
tolerance. Moreover, in certain settings the new algorithms become even
faster than their existing counterparts.
Keywords: MIR, music information retrieval, mathematical morphology, geometric music retrieval, digital image processing.
Introduction
The snowballing number of multimedia data and databases publicly available

for anyone to explore and query has made the conventional text-based query
approach insucient. To eectively query these databases in the digital era,
content-based methods tailored for the specic media have to be available.
In this paper we study the applicability of a mathematical framework for
retrieving music in symbolic, polyphonic music databases in a content-based
fashion. More specically, we harness the mathematical morphology methodology for locating approximate occurrences of a given musical query pattern in
a larger music database. To this end, we represent music symbolically using
the well-known piano-roll representation (see Fig. 1(b)) and cast it into a twodimensional binary image. The representation used resembles that of a previously used technique based on point-pattern matching [14,11,12,10]; the applied
methods themselves, however, are very dierent. The advantage of using our
novel approach is that it enables more exible matching for polyphonic music,
allowing local jittering on both time and pitch values of the notes. This has been
problematic to achieve with the polyphonic methods based on the point-pattern
322
M. Karvonen et al.
Fig. 1. (a) The rst two measures of Bachs Invention 1. (b) The same polyphonic
melody cast into a 2-D binary image. (c) A query pattern image with one extra note
and various time and pitch displacements. (d) The resulting image after a blur rank
order ltering operation, showing us the potential matches.
matching. Moreover, our approach provides the user with an intuitive, visual way
of dening the allowed approximations for a query in hand. In [8], Karvonen and
Lemstrom suggested the use of this framework for music retrieval purposes. We
extend and complement their ideas, introduce and implement new algorithms,
and carry out experiments to show their eciency and eectiveness.
The motivation to use symbolic methods is twofold. Firstly, there is a multitude of symbolic music databases where audio methods are naturally not of
use. In addition, the symbolic methods allow for distributed matching, i.e., occurrences of a query pattern are allowed to be distributed across the instruments
(voices) or to be hidden in some other way in the matching fragments of the polyphonic database. The corresponding symbolic and audio les may be aligned by
using mapping tools [7] in order to be able to play back the matching part in an
audio form.
1.1
Representation and Problem Specifications
In this paper we deal with symbolically encoded, polyphonic music for which we
use the pointset representation (the pitch-against-time representation of noteon information), as suggested in [15], or the extended version of the former, the
horizontal-line-segment representation [14], where note durations are also explicitly given. The latter representation is equivalent to the well-known piano-roll
representation (see e.g. Fig. 1(b)), while the former omits the duration information of the line segments and uses only the onset information of the notes (the
starting points of the horizontal line segments). As opposed to the algorithms
based on point-pattern matching where the piano-roll representation is a mere
visualization of the underlying representation, here the visualization IS the representation: the algorithms to be given operate on binary images of the onset
points or the horizontal line segments that correspond to the notes of the given
query pattern and the database.
Let us denote by P the pattern to be searched for in a database, denoted
by T . We will consider the three problems, P1-P3, specied in [14], and their
Morphologic Music Retrieval
323
generalizations, AP1-AP3, to approximative matching where local jittering is

allowed. The problems are as follows:
(Approximative) complete subset matching: Given P and T in the
pointset representation, nd translations of P such that all its points match
(P1) / match approximatively (AP1) with some points in T .
(Approximative) maximal partial subset matching: Given P and T
in the pointset representation, nd all translations of P that give a maximal
(P2) / an approximative and maximal (AP2) partial match with points in
T.
(Approximative) longest common shared time matching: Given P
and T in the horizontal line segment representation, nd translations of P
that give the longest common (P3) / the approximative longest common
(AP3) shared time with T , i.e., the longest total length of the (approximatively) intersected line segments of T and those of translated P .
Above we have deliberately been vague in the meaning of an approximative
match: the applied techniques enable the user to steer the approximation in
the desired direction by means of shaping the structuring element used, as will
be shown later in this paper. Naturally, an algorithm capable of solving problem AP1, AP2 or AP3 would also be able to solve the related, original nonapproximative problem P1, P2 or P3, respectively, because an exact match can
be identied with zero approximation.
2
2.1
Background
Related Work
Let us denote by P + f a translation of P by vector f , i.e., vector f is added

to each m component of P separately: P + f = p1 + f, p2 + f, . . . , pm + f .
Problem AP1 can then be expressed as the search for a subset I of T such that
P + f I for some f and some similarity relation ; in the original P1 setting
the relation is to be replaced by the equality relation =. It is noteworthy that
the mathematical translation operation corresponds to two musically distinct
phenomena: a vertical move corresponds to transposition while a horizontal move
corresponds to aligning the pattern and the database time-wise.
In [15], Wiggins et al. showed how to solve P1 and P2 in O(mn log(mn)) time.
First, translations that map the maximal number of the m points of P to some
points of T (of n points) are to be collected. Then the set of such translation
vectors is to be sorted based on the lexicographic order, and nally the translation vector that is the most frequent is to be reported. If the reported vector f
appears m times, it is also an occurrence for P1. With careful implementation
of the sorting routine, the running time can be improved to O(mn log m) [14].
For P1, one can use a faster algorithm working in O(n) expected time and O(m)
space [14].
In [5], Cliord et al. showed that problem P2 is 3SUM-hard, which means that it
is unlikely that one could nd an algorithm for the problem with a subquadratic
324
M. Karvonen et al.
running time. Interestingly enough, Minkowski addition and subtraction, which

are the underlying basic operations used by our algorithms, are also known to
be 3SUM-hard [1]. Cliord et al. also gave an approximation algorithm for P2
working in time O(n log n).
In order to be able to query large music databases in real time, several indexing schemes have been suggested. Clausen et al. used an inverted le index
for a P2-related problem [4] that achieves sublinear query times in the length of
the database. In their approach, eciency is achieved at the cost of robustness:
the information extraction of their method makes the approach non-applicable
to problems P1 and P2 as exact solutions. Another very general indexing approach was recently proposed in [13]: Typke et al.s use of a metric index has
the advantage that it works under robust geometric similarity measures. However, it is dicult to adapt it to support translations and partial matching. More
recently, Lemstrom et al. [10] introduced an approach that combines indexing
and ltering achieving output sensitive running times for P1 and P2: O(sm)
and O(sm log m), respectively, where s is the number of candidates, given by a
lter, that are to be checked. Typically their algorithms perform 1-3 orders of
magnitude faster than the original algorithms by Ukkonen et al. [14].
Romming and Selfridge-Field [12] introduced an algorithm based on geometric
hashing. Their solution that combines the capability of dealing with polyphonic
music, transposition invariance and time-scale invariance, works in O(n3 ) space
and O(n2 m3 ) time, but by applying windowing on the database, the complexities
can be restated as O(w2 n) and O(wnm3 ), respectively, where w is the maximum
number of events that occur in any window. Most recently, Lemstr
om [9] generalized Ukkonen et al.s P1 and P2 algorithms [14] to be time-scale invariant.
With windowing the algorithms work in O(m log ) time and O(m) space,
where = O(wn) when searching for exact occurrences and = O(nw2 ) when
searching for partial occurrences; without windowing the respective complexities
are O(n2 log n) and O(n2 ); = O(m) for the exact case, = O(m2 ) for the
partial case.
With all the above algorithms, however, their applicability to real-world problems is reduced due to the fact that, beyond the considered invariances, matches
have to be mathematically exact, and thus, for instance, performance expression
and error is dicult to account for. We bridge this gap by introducing new algorithms based on the mathematical morphology framework where allowed error
tolerance can be elegantly embedded in a query.
2.2
Mathematical Morphology
Mathematical morphology (MM) is a theoretically well-dened framework and

the foundation of morphological image processing. Originally developed for binary images in the 1960s, it was subsequently extended to grey-scale images and
nally generalized to complete lattices. MM is used for quantitative analysis and
processing of the shape and form of spatial structures in images. It nds many applications in computer vision, template matching and pattern recognition problems. Morphological image processing is used for pre- and post-processing of
325
images in a very similar way to conventional image lters. However, the focus
in MM-based methods is often in extracting attributes and geometrically meaningful data from images, as opposite to generating ltered versions of images.
In MM, sets are used to represent objects in an image. In binary images, the
sets are members of the 2-D integer space Z2 . The two fundamental morphological operations, dilation and erosion, are non-linear neighbourhood operations
on two sets. They are based on the Minkowski addition and subtraction [6]. Out
of the two sets, the typically smaller one is called the structuring element (SE).
Dilation performs a maximum on the SE, which has a growing eect on the
target set, while erosion performs a minimum on the SE and causes the target set
to shrink. Dilation can be used to ll gaps in an image, for instance, connecting
the breaks in letters in a badly scanned image of a book page. Erosion can be
used, for example, for removing salt-and-pepper type noise. One way to dene
dilation is
+ f ) A = },
A B = {f Z2 | (B
(1)
its reection (or rotation by
where A is the target image, B is the SE, and B
180 degrees). Accordingly, erosion can be written
A B = {f Z2 | (B + f ) A}.
(2)
Erosion itself can be used for pattern matching. Foreground pixels in the
resulting image mark the locations of the matches. Any shape, however, can be
found in an image lled with foreground. If the background also needs to match,
erosion has to be used separately also for the negations of the image and the
structuring element. Intersecting these two erosions leads to the desired result.
This procedure is commonly known as the hit-or-miss transform or hit-miss
transform (HMT):
HMT(A, B) = (A B) (AC B C ).
(3)
HMT is guaranteed to give us a match only if our SE perfectly matches some

object(s) in the image. The requirement for a perfect match is that the background must also match (i.e., it cannot contain additional pixels) and that each
object has at least a one-pixel-thick background around it, separating it from
other objects (in this case B C actually becomes W B, where W is a window of
on pixels slightly larger than B). In cases where we are interested in partially
detecting patterns within a set, we can ignore the background and reduce HMT
to simple erosion. This is clearly the case when we represent polyphonic music
as binary 2-D images. We use this simplied pattern detection scheme in one of
the algorithms developed in Chapter 3.
Algorithms
In [8] Karvonen and Lemstrom introduced four algorithms based on the mathematical morphology framework and gave their MATLAB implementations. Our
326
M. Karvonen et al.
closer examination revealed common principles behind the four algorithms; three
of them were virtually identical to each other.
The principles on which our two algortihms to be introduced rely are explained
by Bloomberg and Maragos [2]. Having HMT as the main means of generalizing
erosion, they present three more, which can be combined in various ways. They
also name a few of the combinations. Although we can nd some use for HMT,
its benet is not signicant in our case. But two of the other tricks proved to be
handy, and particularly their combination, which is not mentioned by Bloomberg
and Maragos.
We start with erosion as the basic pattern matching operation. The problem
with erosion is its lack of exibility: every note must match and no jittering is
tolerated. Performing the plain erosion solves problem P1. We present two ways
to gain exibility.
Allow partial matches. This is achieved by moving from P1 to P2.
Handle jittering. This is achieved by moving from P1 to AP1.
Out of the pointset problems, only AP2 now remains unconsidered. It can be
solved, however, by combining the two tricks above. We will next explain how
these improvements can be implemented. First, we concentrate on the pointset
representation, then we will deal with line segments.
3.1
Allowing Partial Matches
For a match to be found with plain erosion, having applied a translation, the
whole foreground area of the query needs to be covered by the database foreground. In pointset representation this means that there needs to be a corresponding database note for each note in the query. To allow missing notes, a
coverage of only some specied portion of the query foreground suces. This is
achieved by replacing erosion by a more general lter. For such generalization,
Bloomberg and Maragos propose a binary rank order filter (ROF) and threshold
convolution (TC). In addition to them, one of the algorithms in [8] was based
on correlation. These three methods are connected to each other, as discussed
next.
For every possible translation f , the binary rank order lter counts the ratio
|(P + f ) T |
,
|P |
where |P | is the number of foreground pixels in the query. If the ratio is greater
than or equal to a specied threshold value, it leaves a mark in the resulting
image, representing a match. This ratio can be seen as a condence score (i.e.,
a probability) that the query foreground occurs in the database foreground at
some point. It is noteworthy that plain erosion is a special case of binary ROF,
where the threshold ratio is set to 1. By lowering the threshold we impose looser
conditions than plain erosion on detecting the query.
Correlation and convolution operate on greyscale images. Although we deal
with binary images, these operations are useful because ROF can be implemented
327
using correlation and thresholding. When using convolution as a pattern matching operation, it can be seen as a way to implement correlation: rotating the
query pattern 180 degrees and then performing convolution on real-valued data
has almost the same eect as performing correlation, the only dierence being
that the resulting marks appear in the top-left corner instead of the bottomright corner of the match region. Both can be eectively implemented using the
Fast Fourier Transform (FFT). Because of this relation between correlation and
convolution, ROF and TC are actually theoretically equivalent.
When solving P2, one may want to search for maximal partial matches (instead of threshold matches). This is straightforwardly achieved by implementing
ROF by using correlation. Although our ROF implementation is based on correlation, we will call the method ROF since it oers all the needed functionality.
3.2
Tolerating Jittering
Let us next explain the main asset of our algorithms as compared to the previous
algorithms. In order to tolerate jittering, the algorithms should be able to nd
corresponding database elements not only in the exact positions of the translated
query elements, but also in their near proximity.
Bloomberg and Vincent [3] introduced a technique for adding such toleration
into HMT. They call it blur hit-miss transform (BHMT). The trick is to dilate the database images (both the original and the complement) by a smaller,
disc-shaped structuring element before the erosions are performed. This can be
written
BHMT(A, B1 , B2 , R1 , R2 )
= [(A R1 ) B1 ] [(AC R2 ) B2 ],
(4)
where A is the database and AC its complement, B1 and B2 are the query
foreground and background and R1 and R2 are the blur SEs. The technique is
also eligible for the plain erosion. We choose this method for jitter toleration
and call it blur erosion:
A b (B, R) = (A R) B.
(5)
The shape of the preprocessive dilation SE does not have to be a disc. In our
case, where the dimensions under consideration are time and pitch, a natural
setting comprises user-specied thresholds for the dimensions. This leads us to
rectangular SEs with ecient implementations. In practice, dilation operations
are useful in the time dimension, but applying it in pitch dimension often results
in false (positive) matches. Instead, a blur of just one semitone is very useful
because the queries often contain pitch quantization errors.
3.3
Combining the Two
By applying ROF we allow missing notes, thus being able to solve problem P2.
The jitter toleration is achieved by using blurring, thus solving AP1. In order to
328
M. Karvonen et al.
Erosion
Hit-miss transform
Rank order lter
Blur erosion
Hit-miss
rank order lter
Blur
hit-miss transform
Blur
rank order lter
Blur hit-miss
rank order lter
Fig. 2. Generalizations of erosion
be able to solve AP2, we combine these two. In order to correctly solve AP2, the
dilation has to be applied to the database image. With blurred ROF a speed-up
can be obtained (with the cost of false positive matches) by dilating the query
pattern instead of the database image. If there is no need to adjust the dilation
SE, the blur can be applied to the database in a preprocessing phase. Note also
that if both the query and the database were dilated, it would grow the distance
between the query elements and the corresponding database elements which,
subsequently, would gradually decrease the overlapping area.
Figure 2 illustrates the relations between the discussed methods. Our interest
is on blur erosion and blur ROF (underlined in the Figure), because they can be
used to solve the approximate problems AP1 and AP2.
3.4
Line Segment Representation
Blur ROF is applicable also for solving AP3. In this case, however, the blur is
not as essential: in a case of an approximate occurrence, even if there was some
jittering in the time dimension, a crucial portion of the line segments would
typically still overlap. Indeed, ROF without any blur solves exactly problem P3.
By using blur erosion without ROF on line segment data, we get an algorithm
that does not have an existing counterpart. Plain erosion is like P3 with the
extra requirement of full matches only, the blur then adds error toleration to the
process.
3.5
329
Applying Hit-Miss Transform
Thus far we have not been interested in what happens in the background of an
occurrence; we have just searched for occurrences of the query pattern that are
intermingled in the polyphonic texture of the database. If, however, no extra
notes were allowed in the time span of an occurrence, we would have to consider
the background as well. This is where we need the hit-miss transform. Naturally,
HMT is applicable also in decreasing the number of false positives in cases where
the query is assumed to be comprehensive. With HMT as the third way of
generalizing erosion, we complement the classication in Figure 2.
Combining HMT with the blur operation, we have to slightly modify the
original BHMT to meet the special requirements of the domain. In the original
form, when matching the foreground, tiny background dots are ignored because
they are considered to be noise. In our case, with notes represented as single
pixels or thin line segments, all the events would be ignored during the background matching; the background would always match. To achieve the desired
eect, instead of dilating the complemented database image, we need to erode
the complemented query image by the same SE:
BHMT*(A, B1 , B2 , R1 , R2 )
= [(A R1 ) B1 ] [AC (B2 R2 )],
(6)
where A is the database and AC its complement. B1 and B2 are the query
foreground and background, R1 and R2 are the blur SEs. If B2 is the complement
of B1 , we can write B = B1 and use the form
BHMT*(A, B, R1 , R2 )
= [(A R1 ) B1 ] [AC (B R2 )C ].
(7)
Another example where background matching would be needed is with line

segment representation of long notes. In an extreme case, a tone cluster with
a long duration forms a rectangle that can be matched with anything. Even
long sounding chords can result in many false positives. This problem can be
alleviated by using HMT with a tiny local background next to the ends of the
line segments to separate the notes.
Experiments
The algorithms presented in this paper set new standards for nding approximative occurrences of a query pattern from a given database. There are no rivaling
algorithms in this sense, so we are not able to fairly compare the performance
of our algorithms to any existing algorithm. However, to give the reader a sense
of the real-life performance of these approximative algorithms, we compare the
running times of these to the existing, nonapproximative algorithms. Essentially
this means that we are comparing the performance of the algorithms being able
to solve AP1-AP3 to the ones that can solve only P1, P2 and P3 [14].
330
M. Karvonen et al.
6000
70000
Pointset
Line segment
60000
5000
50000
Time (ms)
Time (ms)
4000
3000
2000
30000
20000
1000
0
40000
10000
0
8 12 16 20 24 28 32 36
4
Time resolution (pixel columns per second)
8 12 16 20 24 28 32 36
4
Time resolution (pixel columns per second)
Fig. 3. The eect of changing the time resolution a) on blur erosion (on the left) and
b) on blur correlation (on the right)
In this paper we have sketched eight algorithms based on mathematical morphology. In our experiments we will focus on two of them: blur erosion and blur
ROF, which can be applied to solve problems AP1-AP3. The special cases of
these algorithms, where blur is not applied, are plain erosion and ROF. As our
implementation of blur ROF is based on correlation, we will call it bluf correlation from now on. As there are no competitors that solve the problems AP1-3,
we set our new algorithms against the original geometric algorithms named after
the problem specications P 1, P 2 and P 3 [14] to get an idea of their practical
performance.
For the performance of our algorithms, the implementation of dilation, erosion
and correlation are crucial. For dilation and erosion, we rely on Bloombergs
Leptonica library. Leptonica oers an optimized implementation for rectangleshaped SEs, which we can utilize on the case of dilation. On the other hand,
our erosion SEs tend to be much more complex and larger in size (we erode the
databases with the whole query patterns). For correlation we use Fast Fourier
Transform implemented in the FFTW library. This operation is quite heavy
calculation-wise compared to erosion, since it has to operate with oating point
complex numbers.
The performance of the reference algorithms, used to solve the original, nonapproximative problems, depend mostly on the number of notes in the database
and in the query. We experiment on how the algorithms scale up as a function
of the database and query pattern lengths. It is also noteworthy that the note
density can make a signicant dierence in the performance, as the time consumption of our algorithms mostly grow along with the size of the corresponding
images.
The database we used in our experiments consists of MIDI les from the Mutopia collection 1 that contains over 1.4 million notes. These les were converted
to various other formats required by the algorithms, such as binary images of
1
http://www.mutopiaproject.org/

100000
331
1e+06
100000
10000
Time (ms)
Time (ms)
10000
1000
100
1000
100
10
10
1
1
0.1
8
16
32
64
128 256
Pattern size (notes)
512
1024
P1
P2
Blur Er.
16
32
64
128
256
Database size (thousands of notes)
512
Blur Corr.
MSM
Fig. 4. Execution time on pointset data plotted on a logarithmic scale
pointset and line segment types. On the experiments of the eects of varying
pattern sizes, we selected randomly 16 pieces out of the whole database, each
containing 16,000 notes. Five distinct queries were randomly chosen, and the median of their execution times was reported. When experimenting with varying
database sizes, we chose a pattern size of 128 notes.
The size of the images is also a major concern for the performance of our
algorithms. We represent the pitch dimension as 128 pixels, since the MIDI
pitch value range consists of 128 possible values. The time dimension, however,
poses additional problems: it is not intuitively clear what would make a good
time resolution. If we use too many pixel columns per second, the performances
of our algorithms will be slowed down signicantly. On the ip side of the coin,
not using enough pixels per second would result in a loss of information as we
would not be able to distinguish separate notes in rapid passages anymore. Before
running the actual experiments, we decided to experiment on nding a suitable
time resolution eciency-wise.
4.1
Adjusting the Time Resolution
We tested the eect of increasing time resolution on both blur erosion and blur
correlation, and the results can be seen in Figure 3. With blur erosion, we can
see a clear dierence between the pointset representation and the line segment
representation: in the line segment case, the running time of blur erosion seems
to grow quadratically in relation to the growing time resolution, while in the
pointset case, the growth rate seems to be clearly slower. This can be explained
with the fact that the execution time of erosion depends on the size of the query
foreground. In the pointset case, we still only mark the beginning point of the
notes, so only SEs require extra space. In the line segment case, however, the
growth is clearly linear.
332
M. Karvonen et al.
In the case of blur correlation, there seems to be nearly no eect whether the
input is in pointset or line segment form. The pointset and line segment curves
in the gure of blur correlation collide, so we depicted only one of the curves in
this case.
Looking at the results, one can note that the time usage of blur erosion begins
to grow quickly around 12 and 16 pixel columns per second. Considering that we
do not want the performance of our algorithms to suer too much, and the fact
that we are deliberately getting rid of some nuances of information by blurring,
we were encouraged to set the time resolution as low as 12 pixel columns per
second. This time resolution was used in the further experiments.
4.2
Pointset Representation
Both P1 and P2 are problems, where we aim to nd an occurrence of a query

pattern from a database, where both the database and the query are represented
as pointsets. Our new algorithms add the support for approximation. Blur erosion solves problem AP1, nding exact approximative matches, whereas blur
correlation nds also partial matches, thus solving AP2.
We compared eciency of the non-approximative algorithms to our new, approximative algorithms with varying query and database sizes. As an additional
comparison point, we also included Cliord et al.s approximation algorithm [5],
called the maximal subset matching (MSM) algorithm, in the comparison. MSM
is based on FFT and its execution time does not depend on the query size.
Analyzing the results seen in Figure 4, we note that the exact matching algorithm P 1 is the fastest algorithm in all settings. This was to be expected due
to the linear behaviour of P 1 in the length of the database. P 1 also clearly outperforms its approximative counterpart, blur erosion. On the other hand, the
1e+06
100000
100000
Time (ms)
Time (ms)
10000
10000
1000
1000
100
100
8
16
32
64
128 256
Pattern size (notes)
512
P3
Blur Er.
1024
16
32
64
128
256
Database size (thousands of notes)
Blur Corr.
Fig. 5. Execution time on line segment data plotted on a logarithmic scale
512
333
performance dierence between the partial matching algorithms, blur correlation, MSM and P 2, is less radical. P 2 is clearly fastest of those with small query
sizes, but as its time consumption grows with longer queries, it becomes the
slowest with very large query sizes. Nevertheless, even with small query sizes,
we believe that the enhanced error toleration is worth the extra time it requires.
4.3
Line Segment Representation
When experimenting with the line segment representations, we used P 3 as a

reference algorithm for blur correlation. For blur erosion, we were not able to
nd a suitable reference algorithm. Once again, however, blur erosion gives the
reader some general sense of the eciency of the algorithms working with line
segment representation.
The time consumption behaviour of P 3, blur erosion and blur correlation are
depicted in Figure 5. The slight meandering seen in some of the curves is the
result of an uneven distribution of notes in the database. Analyzing the graphs
further, we notice that our blur correlation seems more competitive in this than
in the pointset representation case. Again, we note that the independence of the
length of the pattern makes blur correlation faster than P3 with larger pattern
sizes: blur correlation outperforms P 3 once the pattern size exceeds 256 notes.
Analyzing the results of experiments with diering database sizes with a query
pattern size of 128, the more restrictive blur erosion algorithm was fastest of the
three. However, the three algorithms time consumptions were roughly of the
same magnitude.
Fig. 6. (a) An excerpt of the database in a piano-roll representation with jittering

window around each note. (b) Query pattern. (c) The query pattern inserted into the
jittering windows of the excerpt of the database.
334
M. Karvonen et al.
First (approximate) match

Pattern

Fig. 7. The subject used as a search pattern and the rst approximate match
(a)

(b)

28

(c)
Fig. 8. A match found by blur erosion (a). An exact match found by both P1 and blur
erosion (b). This entry has too much variation even for blur erosion (c).
Our experiments also conrmed our claim of blur correlation handling jittering
better than P 3. We were expecting this, and Figure 6 illustrates an idiomatic
case where P 3 will not nd a match, but blur correlation will. In this case we
have a query pattern that is an excerpt of the database, with the distinction that
some of the notes have been tilted either time-wise or pitch-wise. Additionally
one note has been split into two. Blur correlation nds a perfect match in this
case, whereas P 3 cannot, unless the threshold for the total common length is
exceeded. We were expecting this kind of results, since intuitively P 3 cannot
handle this kind of jittering as well as morphological algorithms do.
4.4
Finding Fugue Theme Entries
To further demonstrate the assets of the blur technique, we compared P 1 and

blur erosion in the task of nding the theme entries in J. S. Bachs Fugue no. 16
in G minor, BWV861, from the Well-Tempered-Clavier, Book 1. The imitations
of a fugue theme often have slight variation. The theme in our case is shown in
Figure 7. In the following imitation, there is a little dierence. The rst interval
is a minor third instead of a minor second. This prevents P 1 nding a match
here. But with a vertical dilation of two pixels blur erosion managed to nd the
match.
Figure 8 shows three entries of the theme. The rst one has some variation
at the end and was only found by blur erosion. The second is an exact match.
Finally, the last entry could not be found by either of the algorithms, because
it has too much variation. In total, blur erosion found 16 occurrences, while P 1
found only six.
If all the entries had diered only in one or two notes, it would have been
easy to nd them using P 2. For some of the imitations, however, less than

6/11

13
6/11
16

335

4/11

7/11
19

11/11
Fig. 9. Some more developed imitations of the theme with their proportions of exactly
matching notes
half of the notes match exactly the original form of the theme (see Figure 9).
Nevertheless, these imitations are fairly easily recognized visually and audibly.
Our blur-erosion algorithm found them all.
Conclusions
In this paper, we have combined existing image processing methods based on

mathematical morphology to construct a collection of new pattern matching
algorithms for symbolic music represented as binary images. Our aim was to gain
an improved error tolerance over the existing pointset-based and line-segmentbased algorithms introduced for related problems.
Our algorithms solve three existing music retrieval problems, P1, P2 and P3.
Our basic algorithm based on erosion solves the exact matching problem P1. To
successfully solve the other two, we needed to relax the requirement of exact
matches, which we did by applying a rank order ltering technique. Using this
relaxation technique, we can solve both the partial pointset matching problem
P2, and also the line segment matching problem P3. By introducing blurring in
the form of preprocessive dilation, the error tolerance of these morphological algorithms can be improved. That way the algorithms are able to tolerate jittering
in both the time and the pitch dimension.
Comparing to the solutions of the non-approximative problems, our new
algorithms tend to be somewhat slower. However, they are still comparable
performance-wise, and actually even faster in some cases. As the most important novelty of our algorithms is the added error tolerance given by blurring, we
336
M. Karvonen et al.
think that the slowdown is rather restrained compared to the added usability
of the algorithms. We expect our error-tolerant methods to give better results
in real-world applications when compared to the rivaling algorithms. As future
work, we plan on researching and setting up a relevant ground truth, as without
a ground truth, we cannot adequately measure the precision and recall on the
algorithms. Other future work could include investigating the use of greyscale
morphology for introducing more ne-grained control over approximation.
Acknowledgements
This work was partially supported by the Academy of Finland, grants #108547,
#118653, #129909 and #218156.
References
1. Barrera Hern
andez, A.: Finding an o(n2 log n) algorithm is sometimes hard. In:
Proceedings of the 8th Canadian Conference on Computational Geometry, pp.
289294. Carleton University Press, Ottawa (1996)
2. Bloomberg, D., Maragos, P.: Generalized hit-miss operators with applications to
document image analysis. In: SPIE Conference on Image Algebra and Morphological Image Processing, pp. 116128 (1990)
3. Bloomberg, D., Vincent, L.: Pattern matching using the blur hit-miss transform.
Journal of Electronic Imaging 9(2), 140150 (2000)
4. Clausen, M., Engelbrecht, R., Meyer, D., Schmitz, J.: Proms: A web-based tool for
searching in polyphonic music. In: Proceedings of the International Symposium on
Music Information Retrieval (ISMIR 2000), Plymouth, MA (October 2000)
5. Cliord, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A
fast, randomised, maximal subset matching algorithm for document-level music
retrieval. In: Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), Victoria, BC, Canada, pp. 150155 (2006)
6. Heijmans, H.: Mathematical morphology: A modern approach in image processing
based on algebra and geometry. SIAM Review 37(1), 136 (1995)
7. Hu, N., Dannenberg, R., Tzanetakis, G.: Polyphonic audio matching and alignment
for music retrieval. In: Proc. IEEE WASPAA, pp. 185188 (2003)
8. Karvonen, M., Lemstr
om, K.: Using mathematical morphology for geometric music
information retrieval. In: International Workshop on Machine Learning and Music
(MML 2008), Helsinki, Finland (2008)
9. Lemstr
om, K.: Towards more robust geometric content-based music retrieval. In:
Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), Utrecht, pp. 577582 (2010)
10. Lemstr
om, K., Mikkil
a, N., M
akinen, V.: Filtering methods for content-based
retrieval on indexed symbolic music databases. Journal of Information Retrieval 13(1), 121 (2010)
11. Lubiw, A., Tanur, L.: Pattern matching in polyphonic music as a weighted geometric translation problem. In: Proceedings of the 5th International Conference on
Music Information Retrieval (ISMIR 2004), Barcelona, pp. 289296 (2004)
337
12. Romming, C., Selfridge-Field, E.: Algorithms for polyphonic music retrieval: The
hausdor metric and geometric hashing. In: Proceedings of the 8th International
Conference on Music Information Retrieval (ISMIR 2007), Vienna, Austria (2007)
13. Typke, R.: Music Retrieval based on Melodic Similarity. Ph.D. thesis, Utrecht
University, Netherlands (2007)
14. Ukkonen, E., Lemstr
om, K., M
akinen, V.: Geometric algorithms for transposition
invariant content-based music retrieval. In: Proceedings of the 4th International
Conference on Music Information Retrieval (ISMIR 2003), Baltimore, MA, pp.
193199 (2003)
15. Wiggins, G.A., Lemstr
om, K., Meredith, D.: SIA(M)ESE: An algorithm for transposition invariant, polyphonic content-based music retrieval. In: Proceedings of the
3rd International Conference on Music Information Retrieval (ISMIR 2002), Paris,
France, pp. 283284 (2002)
Melodic Similarity through Shape Similarity

Julin Urbano, Juan Llorns, Jorge Morato, and Sonia Snchez-Cuadrado
University Carlos III of Madrid
Department of Computer Science
Avda. Universidad, 30
28911 Legans, Madrid, Spain
{jurbano,llorens}@inf.uc3m.es,
{jorge,ssanchec}@ie.inf.uc3m.es
Abstract. We present a new geometric model to compute the melodic similarity

of symbolic musical pieces. Melodies are represented as splines in the pitchtime plane, and their similarity is computed as the similarity of their shape. The
model is very intuitive and it is transposition and time scale invariant. We have
implemented it with a local alignment algorithm over sequences of n-grams that
define spline spans. An evaluation with the MIREX 2005 collections shows that
the model performs very well, obtaining the best effectiveness scores ever
reported for these collections. Three systems based on this new model were
evaluated in MIREX 2010, and the three systems obtained the best results.
Keywords: Music information retrieval, melodic similarity, interpolation.
1 Introduction
The problem of Symbolic Melodic Similarity, where musical pieces similar to a query
should be retrieved, has been approached from very different points of view [24][6].
Some techniques are based on string representations of music and editing distance
algorithms to measure the similarity between two pieces[17]. Later work has extended
this approach with other dynamic programming algorithms to compute global- or
local-alignments between the two musical pieces [19][11][12]. Other methods rely on
music representations based on n-grams [25][8][2], and other methods represent
music pieces as geometric objects, using different techniques to calculate the melodic
similarity based on the geometric similarity of the two objects. Some of these
geometric methods represent music pieces as sets of points in the pitch-time plane,
and then compute geometric similarities between these sets [26][23][7]. Others
represent music pieces as orthogonal polynomial chains crossing the set of pitch-time
points, and then measure the similarity as the minimum area between the two chains
[30][1][15].
In this paper we present a new model to compare melodic pieces. We adapted the
local alignment approach to work with n-grams instead of with single notes, and the
corresponding substitution score function between n-grams was also adapted to take
into consideration a new geometric representation of musical sequences. In this
geometric representation, we model music pieces as curves in the pitch-time plane,
and compare them in terms of their shape similarity.
339
In the next section we outline several problems that a symbolic music retrieval
system should address, and then we discuss the general solutions given in the
literature to these requirements. Next, we introduce our geometric representation
model, which compares two musical pieces by their shape, and see how this model
addresses the requirements discussed. In section 5 we describe how we have
implemented our model, and in section 6 we evaluate it with the training and
evaluation test collections used in the MIREX 2005 Symbolic Melodic Similarity task
(for short, we will refer to these collections as Train05 and Eval05) [10][21][28].
Finally, we finish with conclusions and lines for further research. An appendix reports
more evaluation results at the end.
2 Melodic Similarity Requirements

Due to the nature of the information treated in Symbolic Melodic Similarity[18], there
are some requirements that have to be considered from the very beginning when
devising a retrieval system. Byrd and Crawford identified some requirements that
they consider every MIR system should meet, such as the need of cross-voice
matching, polyphonic queries or the clear necessity of taking into account both the
horizontal and vertical dimensions of music[5].Selfridge-Field identified three
elements that may confound both the users when they specify the query and the actual
retrieval systems at the time of computing the similarity between two music pieces:
rests, repeated notes and grace notes [18]. In terms of cross-voice and polyphonic
material, she found five types of melody considered difficult to handle: compound,
self-accompanying, submerged, roving and distributed melodies. Mongeau and
Sankoff addressed repeated notes and refer to these situations as fragmentation and
consolidation [17].
We list here some more general requirements that should be common to any
Symbolic Melodic Similarity system, as we consider them basic for the general user
needs. These requirements are divided in two categories: vertical (i.e. pitch) and
horizontal (i.e. time).
2.1 Vertical Requirements
Vertical requirements regard the pitch dimension of music: octave equivalence,
degree equality, note equality, pitch variation, harmonic similarity and voice
separation. A retrieval model that meets the first three requirements is usually
regarded as transposition invariant.
2.1.1 Octave Equivalence
When two pieces differ only in the octave they are written in, they should be
considered the same one in terms of melodic similarity. Such a case is shown in
Fig. 1, with simple versions of the main riff in Layla, by Dereck and the Dominos.
It has been pointed out that faculty or music students may want to retrieve pieces
within some certain pitch range such as C5 up to F#3, or every work above A5
[13].However, this type of information need should be easily handled with
340
J. Urbano et al.
Fig. 1. Octave equivalence
metadata or a simple traverrse through the sequence. We argue that users without such
a strong musical backgroun
nd will be interested in the recognition of a certain piitch
contour, and such cases are
a much more troublesome because some measuree of
melodic similarity has to be calculated. This is the case of query by humm
ming
applications.
2.1.2 Degree Equality
The score at the top of Fig
g. 1 shows a melody in the F major tonality, as well as the
corresponding pitch and to
onality-degree for each note. Below, Fig. 2 shows exacctly
the same melody shifted 7 semitones
s
downwards to the Bb major tonality.
Fig. 2. Degree equality
The tonality-degrees useed in both cases are the same, but the resultant notes are
not. Nonetheless, one would consider the second melody a version of the first oone,
me in terms of pitch contour. Therefore, they should be
because they are the sam
considered the same one by
b a retrieval system, which should also consider possiible
modulations where the key changes somewhere throughout the song.
2.1.3 Note Equality
We could also consider thee case where exactly the same melodies, with exactly the
same notes, are written in different
d
tonalities and, therefore, each note corresponds to
a different tonality-degree in each case. Fig. 3 shows such a case, with the saame
t C major tonality.
melody as in Fig. 1, but in the
Although the degrees do
o not correspond one to each other, the actual notes do, so
both pieces should be consiidered the same one in terms of melodic similarity.
341
Fig. 3. Note equality
2.1.4 Pitch Variation

Sometimes, a melody is alttered by changing only the pitch of a few particular nootes.
For instance, the first melo
ody in Fig. 1 might be changed by shifting the 12th nnote
from D7 to A6 (which actu
ually happens in the original song). Such a change shoould
not make a retrieval system
m disregard that result, but simply rank it lower, after the
exactly-equal ones. Thus, the retrieval process should not consider only exxact
y is part of a piece in the repository (or the other w
way
matching, where the query
around). Approximate mattching, where documents can be considered similar tto a
query to some degree, sho
ould be the way to go. This is of particular interest for
scenarios like query by hu
umming, where it is expected to have slight variationss in
pitch in the melody hummeed by the user.
2.1.5 Harmonic Similaritty
Another desired feature wo
ould be to match harmonic pieces, both with harmonic and
melodic counterparts. For instance,
i
in a triad chord (made up by the root note andd its
major third and perfect fifth
h intervals), one might recognize only two notes (typicaally
the root and the perfect fiftth). However, some other might recognize the root and the
major third, or just the roott, or even consider them as part of a 4-note chord such aas a
major seventh chord (whicch adds a major seventh interval). Fig. 4 shows the saame
piece as in the top of Fig. 1, but with some intervals added to make the song m
more
harmonic. These two piecess have basically the same pitch progression, but with soome
ornamentation, and they sho
ould be regarded as very similar by a retrieval system.
Fig. 4. Harmonic similarity
Thus, a system should

d be able to compare harmony wholly and partiaally,
considering again the Pitcch Variation problem as a basis to establish differennces
between songs.
2.1.6 Voice Separation
Fig. 5 below depicts a piano
o piece with 3 voices, which work together as a whole, but
could also be treated individ
dually.
342
J. Urbano et al.
Fig. 5. Voice separation
Indeed, if this piece werre played with a flute only one voice could be perform
med,
even if some streaming efffect were produced by changing tempo and timbre for ttwo
voices to be perceived by a listener [16]. Therefore, a query containing only one vooice
should match with this piecce in case that voice is similar enough to any of the thhree
marked in the figure.
ments
2.2 Horizontal Requirem
Horizontal requirements regard
r
the time dimension of music: time signatture
equivalence, tempo equivallence, duration equality and duration variation. A retrieeval
model that meets the secon
nd and third requirements is usually regarded as time sccale
invariant.
2.2.1 Time Signature Equ
uivalence
The top of Fig. 6 depicts a simplified version of the beginning of op. 81 no. 10 byy S.
4 time signature. If a 4/4 time signature were used, likee in
Heller, with its original 2/4
the bottom of Fig. 6, the pieece would be split into bars of duration 4 crotchets each..
Fig. 6. Time signature equivalence
The only difference betw

ween these two pieces is actually how intense some nootes
should be played. Howeverr, they are in essence the same piece, and no regular listeener
would tell the difference. Therefore, we believe the time signature should nott be
g musical performances in terms of melodic similarity.
considered when comparing
2.2.2 Tempo Equivalencee
For most people, the piecee at the top of Fig. 6, with a tempo of 112 crotchets per
minute, would sound like th
he one in Fig. 7, where notes have twice the length but the
343
Fig. 7. Tempo equivalence
whole score is played twicce as fast, at 224 crotchets per minute. This two channges
result in exactly the same acctual time.
On the other hand, it might also be considered a tempo of 56 crotchets per minnute
and notes with half the durration. Moreover, the tempo can change somewhere in the
middle of the melody, and therefore change the actual time of each note afterwarrds.
gths cannot be considered as the only horizontal measuure,
Therefore, actual note leng
because these three pieces would
w
sound the same to any listener.
2.2.3 Duration Equality
mpo
If the melody at the top of Fig. 6 were played slower or quicker by means of a tem
variation, but maintaining the rhythm, an example of the result would be like the
score in Fig. 8.
Fig. 8. Duration equality
Even though the melodiic perception does actually change, the rhythm does nnot,
and neither does the pitch contour. Therefore, they should be considered as virtuaally
the same, maybe with somee degree of dissimilarity based on the tempo variation.
2.2.4 Duration Variation
n
As with the Pitch Variation
n problem, sometimes a melody is altered by changing oonly
the rhythm of a few notes. For
F instance, the melody in Fig. 9 maintains the same piitch
contour as in Fig. 6, but chaanges the duration of some notes.
Fig. 9. Duration variation
Variations like these arre common and they should be considered as well, jjust
like the Pitch Variation pro
oblem, allowing approximate matches instead of just exxact
ones.
344
J. Urbano et al.
3 General Solutions to the Requirements

Most of these problems have been already addressed in the literature. Next, we
describe and evaluate the most used and accepted solutions.
3.1 Vertical Requirements
The immediate solution to the Octave Equivalence problem is to consider octave
numbers with their relative variation within the piece. Surely, a progression from G5
to C6 is not the same as a progression from G5 to C5. For the Degree Equality
problem it seems to be clear that tonality degrees must be used, rather than actual
pitch values, in order to compare two melodies. However, the Note Equality problem
suggests the opposite.
The accepted solution for these three vertical problems seems to be the use of
relative pitch differences as the units for the comparison, instead of the actual pitch or
degree values. Some approaches consider pitch intervals between two successive
notes [11][8][15], between each note and the tonic (assuming the key is known and
failing to meet the Note Equality problem) [17], or a mixture of both [11]. Others
compute similarities without pitch intervals, but allowing vertical translations in the
time dimension [1][19][30]. The Voice Separation problem is usually assumed to be
solved in a previous stage, as the input to these systems uses to be a single melodic
sequence. There are approximations to solve this problem [25][14].
3.2 Horizontal Requirements
Although the time signature of a performance is worth for other purposes such as
pattern search or score alignment, it seems to us that it should not be considered at all
when comparing two pieces melodically.
According to the Tempo Equivalence problem, actual time should be considered
rather than score time, since it would be probably easier for a regular user to provide
actual rhythm information. On the other hand, the Duration Equality problem requires
the score time to be used instead. Thus, it seems that both measures have to be taken
into account. The actual time is valuable for most users without a musical
background, while the score time might be more valuable for people who do have it.
However, when facing the Duration Variation problem it seems necessary to use
some sort of timeless model. The solution could be to compare both actual and score
time [11], or to use relative differences between notes, in this case with the ratio
between two notes durations [8]. Other approaches use a rhythmical framework to
represent note durations as multiples of a base score duration [2][19][23], which does
not meet the Tempo Equivalence problem and hence is not time scale invariant.
4 A Model Based on Interpolation

We developed a new geometric model that represents musical pieces with curves in
the pitch-time plane, extending the model with orthogonal polynomial chains
[30][1][15]. Notes are represented as points in the pitch-time plane, with positions
345
relative to their pitch and duration

d
differences. Then, we define the curve C(t) as the
interpolating curve passing
g through each point (see Fig. 10). Should the song hhave
multiple voices, each one would
w
be placed in a different pitch-time plane, sharing the
same time dimension, but with
w a different curve Ci(t) (where the subscript i indicaates
the voice number). Note thaat we thus assume the voices are already separated.
With this representation, the similarity between two songs could be thought off as
ween the two curves they define. Every vertical requirem
ment
the similarity in shape betw
identified in section 2.1wou
uld be met with this representation: a song with an octtave
shift would keep the same shape;
s
if the tonality changed the shape of the curve woould
not be affected either; and if the notes remained the same after a tonality change, so
would the curve do. The Pitch Variation problem can be addressed analyticaally
measuring the curvature diffference, and different voices can be compared individuaally
in the same way because theey are in different planes.
Fig. 10. Mellody represented as a curve in a pitch-time plane
Same thing happens with

h the horizontal requirements: the Tempo Equivalence and
Duration Equality problem
ms can be solved analytically, because they imply jusst a
linear transformation in thee time dimension. For example, if the melody at the topp of
Fig. 6 is defined with curve C(t) and the one in Fig. 7 is denoted with curve D(tt), it
can be easily proved that C(2t)=D(t). Moreover, the Duration Variation probllem
could be addressed analy
ytically as the Pitch Variation problem, and the Tiime
Signature Equivalence pro
oblem is not an issue because the shape of the curvee is
independent of the time sign
nature.
4.1 Measuring Dissimilarrity with the Change in Shape
Having musical pieces represented with curves, each one of them could be defiined
f
C(t)=antn+an-1tn-1++a1t+a0. The first derivativee of
with a polynomial of the form
this polynomial measures how
h
much the shape of the curve is changing at a particuular
point in time (i.e. how the song changes). To measure the change of one curve w
with
respect to another, the area between the first derivatives could be used.
h would mean just a shift in the a0 term. As it turns oout,
Note that a shift in pitch
when calculating the first derivative
d
of the curves this term is canceled, which is w
why
the vertical requirements arre met: shifts in pitch are not reflected in the shape of the
curve, so they are not reflected
r
in the first derivative either. Therefore, this
representation is transpositiion invariant.
The song is actually defiined by the first derivative of its interpolating curve, C(t).
The dissimilarity between two
t
songs, say C(t) and D(t), would be defined as the aarea
346
J. Urbano et al.
between their first derivatives (measured with the integral over the absolute value of
their difference):
diff(C, D) = |C'(t)-D'(t)|dt
(1)
The representation with orthogonal polynomial chains also led to the measurement
of dissimilarity as the area between the curves [30][1]. However, such representation
is not directly transposition invariant unless it used pitch intervals instead of absolute
pitch values, and a more complex algorithm is needed to overcome this problem[15].
As orthogonal chains are not differentiable, this would be the indirect equivalent to
calculating the first derivative as we do.
This dissimilarity measurement based on the area between curves turns out to be a
metric function, because it has the following properties:
Non-negativity, diff(C, D) 0: because the absolute value is never negative.
Identity of indiscernibles, diff(C, D) = 0 C = D: because calculating the
absolute value the only way to have no difference is with the same exact curve1.
Symmetry, diff(C, D) = diff(D, C): again, because the integral is over the
absolute value of the difference.
Triangle inequality, diff(C, E) diff(C, D) + diff(D, E):
|C'(t) - D'(t)|dt +
|C'(t) - E'(t)|dt
|C'(t) - D'(t)|dt +
|D'(t) - E'(t)|dt =
|C'(t) - D'(t)| +|D'(t) - E'(t)|dt
|C'(t) - D'(t)| +|D'(t) - E'(t)|dt
|D'(t) - E'(t)|dt
|C'(t) - E'(t)| dt
Therefore, many indexing and retrieval techniques, like vantage objects[4], could be
exploited if using this metric.
4.2 Interpolation with Splines
The next issue to address is the interpolation method to use. The standard Lagrange
interpolation method, though simple, is known to suffer the Runges Phenomenon [3].
As the number of points increases, the interpolating curve wiggles a lot, especially at
the beginning and the end of the curve. As such, one curve would be very different
from another one having just one more point at the end, the shape would be different
and so the dissimilarity metric would result in a difference when the two curves are
practically identical. Moreover, a very small difference in one of the points could
translate into an extreme variation in the overall curve, which would make virtually
impossible to handle the Pitch and Duration Variation problems properly (see top of
Fig. 11).
1
Actually, this means that the first derivatives are the same, the actual curves could still be
shifted. Nonetheless, this is the behavior we want.
347
Fig. 11. Runges Phenomenon
A way around Runge's Phenomenon

P
is the use of splines (see bottom of Fig. 111).
Besides, splines are also easy to calculate and they are defined as piece-w
wise
functions, which comes in handy when addressing the horizontal requirements. We
ntal problems could be solved, as they implied just a linnear
saw above that the horizon
transformation of the form
m D(t) D(kt) in one of the curves. However, the
calculation of the term k is anything but straightforward, and the transformattion
would apply to the whole curve,
c
complicating the measurement of differences for the
Duration Variation problem
m. The solution would be to split the curve into spans, and
define it as
Ci (t) =
ci,1 (t)
ci,2 (t)
ti,1 t ti,kn
ti,2 t ti,kn+1
ci,mi-kn +1 (t)
ti,mi-kn +1 t ti,mi
(2)
where ti,j denotes the onset time

t
of the j-th note in the i-th voice, mi is the length off the
i-th voice, and kn is the spaan length. With this representation, linear transformatiions
would be applied only to a single span without affecting the whole curve. Moreovver,
ould be normalized from 0 to 1, making it easy to calcuulate
the duration of the spans co
the term k and comply with
h the time scale invariance requirements.
Most spline interpolatio
on methods define the curve in parametric form (i.e. w
with
one function per dimension
n). In this case, it results in one function for the pitch and
one function for the time.. This means that the two musical dimensions couldd be
compared separately, giviing more weight to one or the other. Therefore, the
dissimilarity between two spans
s
c(t) and d(t) would be the sum of the pitch and tiime
dissimilarities as measured by (1):
ff(c, d) = kpdiffp(c, d) + ktdifft(c, d)
diff
(3)
where diffp and difft are fu

unctions as in (1) that consider only the pitch and tiime
dimensions, respectively, and
a kp and kt are fine tuning constants. Different woorks
suggest that pitch is much more
m
important than time for comparing melodic similarrity,
so more weight should be given
g
to kp [19][5][8][23][11].
348
J. Urbano et al.
5 Implementation
Geometric representations of music pieces are very intuitive, but they are not
necessarily easy to implement. We could follow the approach of moving one curve
towards the other looking for the minimum area between them [1][15]. However, this
approach is very sensitive to small differences in the middle of a song, such as
repeated notes: if a single note were added or removed from a melody, it would be
impossible to fully match the original melody from that note to the end. Instead, we
follow a dynamic programming approach to find an alignment between the two
melodies [19].
Various approaches for melodic similarity have applied editing distance algorithms
upon textual representations of musical sequences that assign one character to each
interval or each n-gram [8]. This dissimilarity measure has been improved in recent
years, and sequence alignment algorithms have proved to perform better than simple
editing distance algorithms [11][12]. Next, we describe the representation and
alignment method we use.
5.1 Melody Representation
To practically apply our model, we followed a basic n-gram approach, where each ngram represents one span of the spline. The pitch of each note was represented as the
relative difference to the pitch of the first note in the n-gram, and the duration was
represented as the ratio to the duration of the whole n-gram. For example, an n-gram
of length 4 with absolute pitches 74, 81, 72, 76 and absolute durations 240, 480,
240, 720, would be modeled as 81-74, 72-74, 76-74 = 7, -2, 2 in terms of pitch
and 240, 480, 240, 7201680 = 0.1429, 0.2857, 0.1429, 0.4286 in terms of
duration. Note that the first note is omitted in the pitch representation as it is always 0.
This representation is transposition invariant because a melody shifted in the pitch
dimension maintains the same relative pitch intervals. It is also time scale invariant
because the durations are expressed as their relative duration within the span, and so
they remain the same in the face of tempo and actual or score duration changes. This
is of particular interest for query by humming applications and unquantized pieces, as
small variations in duration would have negligible effects on the ratios.
We used Uniform B-Splines as interpolation method [3]. This results in a
parametric polynomial function for each n-gram. In particular, an n-gram of length kn
results in a polynomial of degree kn-1 for the pitch dimension and a polynomial of
degree kn-1 for the time dimension. Because the actual representation uses the first
derivatives, each polynomial is actually of degree kn-2.
5.2 Melody Alignment
We used the Smith-Waterman local alignment algorithm [20], with the two sequences
of overlapping spans as input, defined as in (2). Therefore, the input symbols to the
alignment algorithm are actually the parametric pitch and time functions of a span,
349
based on the above representation of n-grams. The edit operations we define for the
Smith-Waterman algorithm are as follows:
Insertion: s(-, c).Adding a span c is penalized with the scorediff(c, (c)).

Deletion: s(c, -). Deleting a span c is penalized with the scorediff(c, (c)).
Substitution: s(c, d). Substituting a span c with d is penalized with diff(c, d).
Match: s(c, c). Matching a span c is rewarded with the score 2(kpp+ktt).
where () returns the null n-gram of (i.e. an n-gram equal to but with all pitch
intervals set to 0), and p and t are the mean differences calculated by diffp and difft
respectively over a random sample of 100,000 pairs of n-grams sampled from the set
of incipits in the Train05 collection.
We also normalized the dissimilarity scores returned by difft. From the results in
Table 1 it can be seen that pitch dissimilarity scores are between 5 and 7 times larger
than time dissimilarity scores. Therefore, the choice of kp and kt does not intuitively
reflect the actual weight given to the pitch and time dimensions. For instance, the
selection of kt=0.25, chosen in studies like [11], would result in an actual weight
between 0.05 and 0.0357. To avoid this effect, we normalized every time dissimilarity
score multiplying it by a factor = p / t. As such, the score of the match operation
is actually defined as s(c, c) = 2p(kp+kt), and the dissimilarity function defined in (3)
is actually calculated as diff(c, d) = kp diffp(c, d) + ktdifft(c, d).
6 Experimental Results2
We evaluated the model proposed with the Train05 and Eval05 test collections used
in the MIREX 2005 Symbolic Melodic Similarity Task [21][10], measuring the mean
Average Dynamic Recall score across queries [22]. Both collections consist of about
580 incipits and 11 queries each, with their corresponding ground truths. Each ground
truth is a list of all incipits similar to each query, according to a panel of experts, and
with groups of incipits considered equally similar to the query.
However, we have recently showed that these lists have inconsistencies whereby
incipits judged as equally similar by the experts are not in the same similarity group
and vice versa [28]. All these inconsistencies result in a very permissive evaluation
where a system could return incipits not similar to the query and still be rewarded for
it. Thus, results reported with these lists are actually overestimated, by as much as
12% in the case of the MIREX 2005 evaluation. We have proposed alternatives to
arrange the similarity groups for each query, proving that the new arrangements are
significantly more consistent than the original one, leading to a more robust
evaluation. The most consistent ground truth lists were those called Any-1 [28].
Therefore, we will use these Any-1 ground truth lists from this point on to evaluate
our model, as they offer more reliable results. Nonetheless, all results are reported in
an appendix as if using the original ground truths employed in MIREX 2005, called
All-2, for the sake of comparison with previous results.
All system outputs and ground truth lists used in this paper can be downloaded from
http://julian-urbano.info/publications/
350
J. Urbano et al.
To determine the value of the kn and kt parameters, we used a full factorial

experimental design. We tested our model with n-gram lengths in the range kn{3, 4,
5, 6, 7}, which result in Uniform B-Spline polynomials of degrees 1 to 5. The value of
kp was kept to 1, and kt was converted to nominal with levels kt{0, 0.1, 0.2, , 1}.
6.1 Normalization Factor
First, we calculated the mean dissimilarity scores p and t for each n-gram length kn,
according to diffp and difft over a random sample of 100,000 pairs of n-grams. Table 1
lists the results. As mentioned, the pitch dissimilarity scores are between 5 and 7
times larger than the time dissimilarity scores, suggesting the use of the normalization
factor defined above.
Table 1. Mean and standard deviation of the diffp and difft functions applied upon a random
sample of 100,000 pairs of n-grams of different sizes
kn
3
4
5
6
7
p
2.8082
2.5019
2.2901
2.1347
2.0223
p
1.6406
1.6873
1.4568
1.4278
1.3303
t
0.5653
0.494
0.4325
0.3799
0.2863
t
0.6074
0.5417
0.458
0.3897
0.2908
= p / t
4.9676
5.0646
5.2950
5.6191
7.0636
There also appears to be a negative correlation between the n-gram length and the
dissimilarity scores. This is caused by the degree of the polynomials defining the
splines: high-degree polynomials fit the points more smoothly than low-degree ones.
Polynomials of low degree tend to wiggle more, and so their derivatives are more
pronounced and lead to larger areas between curves.
6.2 Evaluation with the Train05 Test Collection, Any-1 Ground Truth Lists
The experimental design results in 55 trials for the 5 different levels of kn and the 11
different levels of kt. All these trials were performed with the Train05 test collection,
ground truths aggregated with the Any-1 function [28]. Table 2 shows the results.
In general, large n-grams tend to perform worse. This could probably be explained
by the fact that large n-grams define the splines with smoother functions, and the
differences in shape may be too small to discriminate musically perceptual
differences. However, kn=3 seems to be the exception (see Fig. 12). This is probably
caused by the extremely low degree of the derivative polynomials. N-grams of length
kn=3 result in splines defined with polynomials of degree 2, which are then
differentiated and result in polynomials of degree 1. That is, they are just straight
lines, and so a small difference in shape can turn into a relatively large dissimilarity
score when measuring the area.
Overall, kn=4 and kn=5 seem to perform the best, although kn=4 is more stable
across levels of kt. In fact, kn=4 and kt=0.6 obtain the best score, 0.7215. This result
agrees with other studies where n-grams of length 4 and 5 were also found to perform
351
Table 2. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.
kt=0
0.6961
0.7046
0.7093
0.714
0.6823
kt=0.1
0.7067
0.7126
0.7125
0.7132
0.6867
kt=0.2
0.7107
0.7153
0.7191
0.7115
0.6806
kt=0.3
0.7106
0.7147
0.72
0.7088
0.6747
kt=0.4
0.7102
0.7133
0.7173
0.7008
0.6538
kt=0.5
0.7109
0.72
0.7108
0.693
0.6544
kt=0.6
0.7148
0.7215
0.704
0.6915
0.6529
kt=0.7
0.711
0.7202
0.6978
0.6874
0.6517
kt=0.8
0.7089
0.7128
0.6963
0.682
0.6484
kt=1
0.6962
0.709
0.6866
0.6763
0.6432
0.7
0.68
0.66
0.64
Mean ADR score
kt=0.9
0.7045
0.7136
0.6973
0.6765
0.6465
kn = 3
kn = 4
kn = 5
kn = 6
kn = 7
0.72
kn
3
4
5
6
7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
kt
Fig. 12. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1
better [8]. Moreover, this combination of parameters obtains a mean ADR score of
0.8039 when evaluated with the original All-2 ground truths (see Appendix). This is
the best score ever reported for this collection.
6.3 Evaluation with the Eval05 Test Collection, Any-1 Ground Truth Lists
In a fair evaluation scenario, we would use the previous experiment to train our
system and choose the values of kn and kt that seem to perform the best (in particular,
kn=4 and kt=0.6). Then, the system would be run and evaluated with a different
collection to assess the external validity of the results and try to avoid overfitting to
the training collection. For the sake of completeness, here we show the results for all
55 combinations of the parameters with the Eval05 test collection used in MIREX
2005, again aggregated with the Any-1 function [28].Table 3 shows the results.
Unlike the previous experiment with the Train05 test collection, in this case the
variation across levels of kt is smaller (the mean standard deviation is twice as much
in Train05), indicating that the use of the time dimension does not provide better
results overall (see Fig. 13). This is probably caused by the particular queries in each
collection. Seven of the eleven queries in Train05 start with long rests, while this
352
J. Urbano et al.
Table 3. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1. Bold face for largest
scores per row and italics for largest scores per column.
kt=0
0.6522
0.653
0.6413
0.6269
0.5958
kt=0.1
0.6601
0.653
0.6367
0.6251
0.623
kt=0.2
0.6646
0.6567
0.6327
0.6225
0.6189
kt=0.3
0.6612
0.6616
0.6303
0.6168
0.6163
kt=0.4
0.664
0.6629
0.6284
0.6216
0.6162
kt=0.5
0.6539
0.6633
0.6328
0.6284
0.6192
kt=0.6
0.6566
0.6617
0.6478
0.6255
0.6215
kt=0.7
0.6576
0.6569
0.6461
0.6192
0.6174
kt=0.8
0.6591
0.65
0.6419
0.6173
0.6148
kt=0.9
0.6606
0.663
0.6414
0.6144
0.6112
kt=1
0.662
0.6531
0.6478
0.6243
0.6106
0.61
0.63
0.65
kn = 3
kn = 4
kn = 5
kn = 6
kn = 7
0.59
Mean ADR score
0.67
kn
3
4
5
6
7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
kt
Fig. 13. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the Any-1 function. kp is kept to 1
happens for only three of the eleven queries in Eval05. In our model, rests are
ignored, and so the effect of the time dimension is larger when the very queries have
rests as their duration is added to the next note's.
Likewise, large n-grams tend to perform worse. In this case though, n-grams of
length kn=3 and kn=4 perform the best. The most effective combination is kn=3 and
kt=0.2, with a mean ADR score of 0.6646. However, kn=4 and kt=0.5 is very close,
with a mean ADR score of 0.6633. Therefore, based on the results of the previous
experiment and the results in this one, we believe that kn=4 and kt[0.5, 0.6] are the
best parameters overall.
It is also important to note that none of the 55 combinations ran result in a mean
ADR score less than 0.594, which was the highest score achieved in the actual
MIREX 2005 evaluation with the Any-1 ground truths [28]. Therefore, our systems
would have ranked first if participated.
7 Conclusions and Future Work

We have proposed a new transposition and time scale invariant model to represent
musical pieces and compute their melodic similarity. Songs are considered as curves
in the pitch-time plane, allowing us to compute their melodic similarity in terms of the
shape similarity of the curves they define. We have implemented it with a local
353
alignment algorithm over sequences of spline spans, each of which is represented by

one polynomial for the pitch dimension and another polynomial for the time
dimension. This parametric representation of melodies permits the application of a
weight scheme between pitch and time dissimilarities.
The MIREX 2005 test collections have been used to evaluate the model for several
span lengths and weight schemes. Overall, spans 4 notes long seem to perform the
best, with longer spans performing gradually worse. The optimal weigh scheme we
found gives about twice as much importance to the pitch dimension than to the time
dimension. However, time dissimilarities need to be normalized, as they are shown to
be about five times smaller than pitch dissimilarities.
This model obtains the best mean ADR score ever reported for the MIREX 2005
training collection, and every span length and weight scheme evaluated would have
ranked first in the actual evaluation of that edition. However, the use of the time
dimension did not improve the results significantly for the evaluation collection. On
the other hand, three systems derived from this model were submitted to the MIREX
2010 edition: PitchDeriv, ParamDeriv and Shape [27]. These systems obtained the
best results, and they ranked the top three in this edition. Again, the use of the time
dimension was not shown to improve the results.
A rough analysis of the MIREX 2005 and 2010 collections shows that the queries
used in the 2005 training collection have significantly more rests than in the
evaluation collection, and they are virtually absent in the 2010 collection. Because our
model ignores rests, simply adding their durations to the next notes duration, the use
of the time dimension is shown to improve the results only in the 2005 training
collection. This evidences the need for larger and more heterogeneous test collections
for the Symbolic Melodic Similarity task, for researchers to train and tune their
systems properly and reduce overfitting to particular collections[9][29].
The results indicate that this line of work is certainly promising. Further research
should address the interpolation method to use, different ways of splitting the curve
into spans, extend the model to consider rests and polyphonic material, and evaluate
on more heterogeneous collections.
References
1. Aloupis, G., Fevens, T., Langerman, S., Matsui, T., Mesa, A., Nuez, Y., Rappaport, D.,
Toussaint, G.: Algorithms for Computing Geometric Measures of Melodic Similarity.
2. Bainbridge, D., Dewsnip, M., Witten, I.H.: Searching Digital Music Libraries. Information
Processing and Management 41(1), 4156 (2005)
3. de Boor, C.: A Practical guide to Splines. Springer, Heidelberg (2001)
4. Bozkaya, T., Ozsoyoglu, M.: Indexing Large Metric Spaces for Similarity Search Queries.
ACM Transactions on Database Systems 24(3), 361404 (1999)
5. Byrd, D., Crawford, T.: Problems of Music Information Retrieval in the Real World.
Information Processing and Management 38(2), 249272 (2002)
6. Casey, M.A., Veltkamp, R.C., Goto, M., Leman, M., Rhodes, C., Slaney, M.: ContentBased Music Information Retrieval: Current Directions and Future Challenges.
Proceedings of the IEEE 96(4), 668695 (2008)
354
J. Urbano et al.
7. Clifford, R., Christodoulakis, M., Crawford, T., Meredith, D., Wiggins, G.: A Fast,
Randomised, Maximal Subset Matching Algorithm for Document-Level Music Retrieval.
In: International Conference on Music Information Retrieval, pp. 150155 (2006)
8. Doraisamy, S., Rger, S.: Robust Polyphonic Music Retrieval with N-grams. Journal of
Intelligent Systems 21(1), 5370 (2003)
9. Downie, J.S.: The Scientific Evaluation of Music Information Retrieval Systems:
Foundations and Future. Computer Music Journal 28(2), 1223 (2004)
10. Downie, J.S., West, K., Ehmann, A.F., Vincent, E.: The 2005 Music Information Retrieval
Evaluation Exchange (MIREX 2005): Preliminary Overview. In: International Conference
on Music Information Retrieval, pp. 320323 (2005)
11. Hanna, P., Ferraro, P., Robine, M.: On Optimizing the Editing Algorithms for Evaluating
Similarity Between Monophonic Musical Sequences. Journal of New Music
Research 36(4), 267279 (2007)
12. Hanna, P., Robine, M., Ferraro, P., Allali, J.: Improvements of Alignment Algorithms for
Polyphonic Music Retrieval. In: International Symposium on Computer Music Modeling
and Retrieval, pp. 244251 (2008)
13. Isaacson, E.U.: Music IR for Music Theory. In: The MIR/MDL Evaluation Project White
paper Collection, 2nd edn., pp. 2326 (2002)
14. Kilian, J., Hoos, H.H.: Voice Separation A Local Optimisation Approach. In:
International Symposium on Music Information Retrieval, pp. 3946 (2002)
15. Lin, H.-J., Wu, H.-H.: Efficient Geometric Measure of Music Similarity. Information
Processing Letters 109(2), 116120 (2008)
16. McAdams, S., Bregman, A.S.: Hearing Musical Streams. In: Roads, C., Strawn, J. (eds.)
Foundations of Computer Music, pp. 658598. The MIT Press, Cambridge (1985)
17. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24(3), 161175 (1990)
18. Selfridge-Field, E.: Conceptual and Representational Issues in Melodic Comparison.
Computing in Musicology 11, 364 (1998)
19. Smith, L.A., McNab, R.J., Witten, I.H.: Sequence-Based Melodic Comparison: A
Dynamic Programming Approach. Computing in Musicology 11, 101117 (1998)
20. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal
of Molecular Biology 147(1), 195197 (1981)
21. Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C.: A Ground Truth for
Half a Million Musical Incipits. Journal of Digital Information Management 3(1), 3439
(2005)
22. Typke, R., Veltkamp, R.C., Wiering, F.: A Measure for Evaluating Retrieval Techniques
based on Partially Ordered Ground Truth Lists. In: IEEE International Conference on
Multimedia and Expo., pp. 17931796 (2006)
23. Typke, R., Veltkamp, R.C., Wiering, F.: Searching Notated Polyphonic Music Using
Transportation Distances. In: ACM International Conference on Multimedia, pp. 128135
(2004)
24. Typke, R., Wiering, F., Veltkamp, R.C.: A Survey of Music Information Retrieval
Systems. In: International Conference on Music Information Retrieval, pp. 153160 (2005)
25. Uitdenbogerd, A., Zobel, J.: Melodic Matching Techniques for Large Music Databases. In:
ACM International Conference on Multimedia, pp. 5766 (1999)
26. Ukkonen, E., Lemstrm, K., Mkinen, V.: Geometric Algorithms for Transposition
Invariant Content-Based Music Retrieval. In: International Conference on Music
Information Retrieval, pp. 193199 (2003)
355
27. Urbano, J., Llorns, J., Morato, J., Snchez-Cuadrado, S.: MIREX 2010 Symbolic Melodic
Similarity: Local Alignment with Geometric Representations. Music Information Retrieval
Evaluation eXchange (2010)
28. Urbano, J., Marrero, M., Martn, D., Llorns, J.: Improving the Generation of Ground
Truths based on Partially Ordered Lists. In: International Society for Music Information
Retrieval Conference, pp. 285290 (2010)
29. Urbano, J., Morato, J., Marrero, M., Martn, D.: Crowdsourcing Preference Judgments for
Evaluation of Music Similarity Tasks. In: ACM SIGIR Workshop on Crowdsourcing for
Search Evaluation, pp. 916 (2010)
30. Maidn, D.: A Geometrical Algorithm for Melodic Difference. Computing in
Musicology 11, 6572 (1998)
Appendix: Results with the Original All-2 Ground Truth Lists

Here we list the results of all 55 combinations of kn and kt evaluated with the very
original Train05 (see Table 4) and Eval05 (see Table 5) test collections, ground truth
lists aggregated with the All-2 function [21][28]. These numbers permit a direct
comparison with previous studies that used these ground truth lists as well.
The qualitative results remain the same: kn=4 seems to perform the best, and the
effect of the time dimension is much larger in the Train05 collection. Remarkably, in
Eval05 kn=4 outperforms all other n-gram lengths for all but two levels of kt.
Table 4. Mean ADR scores for each combination of kn and kt with the Train05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.
kn
3
4
5
6
7
kt=0
0.7743
0.7836
0.7844
0.7885
0.7598
kt=0.1
0.7793
0.7899
0.7867
0.7842
0.7573
kt=0.2
0.788
0.7913
0.7937
0.7891
0.7466
kt=0.3
0.7899
0.7955
0.7951
0.7851
0.7409
kt=0.4
0.7893
0.7946
0.7944
0.7784
0.7186
kt=0.5
0.791
0.8012
0.7872
0.7682
0.7205
kt=0.6
0.7936
0.8039
0.7799
0.7658
0.7184
kt=0.7
0.7864
0.8007
0.7736
0.762
0.7168
kt=0.8
0.7824
0.791
0.7692
0.7572
0.711
kt=0.9
0.777
0.7919
0.7716
0.7439
0.7075
kt=1
0.7686
0.7841
0.7605
0.7388
0.6997
Table 5. Mean ADR scores for each combination of kn and kt with the Eval05 test collection,
ground truth lists aggregated with the original All-2 function. kp is kept to 1. Bold face for
largest scores per row and italics for largest scores per column.
kn
3
4
5
6
7
kt=0
0.7185
0.7242
0.7114
0.708
0.6548
kt=0.1
0.714
0.7268
0.7108
0.7025
0.6832
kt=0.2
0.7147
0.7291
0.6988
0.6887
0.6818
kt=0.3
0.7116
0.7316
0.6958
0.6693
0.6735
kt=0.4
0.712
0.7279
0.6942
0.6701
0.6614
kt=0.5
0.7024
0.7282
0.6986
0.6743
0.6594
kt=0.6
0.7056
0.7263
0.7109
0.6727
0.6604
kt=0.7
0.7067
0.7215
0.7054
0.6652
0.6552
kt=0.8
0.708
0.7002
0.6959
0.6612
0.6525
kt=0.9
0.7078
0.7108
0.6886
0.6561
0.6484
kt=1
0.7048
0.7032
0.6914
0.6636
0.6499
It can also be observed that the results would again be overestimated by as much as
11% in the case of Train05 and as much as 13% in Eval05, in contrast with the
maximum 12% observed with the systems that participated in the actual MIREX 2005
evaluation.
Content-Based Music Discovery

Dirk Schnfu
mufin GmbH August-Bebel-Strae 36, 01219 Dresden Germany
dschoenfuss@mufin.com
Abstract. Music recommendation systems have become a valuable aid for

managing large music collections and discovering new music. Our contentbased recommendation system employs signal-based features and semantic
music attributes generated using machine-based learning algorithms. In addition
to playlist generation and music recommendation, we are exploring new
usability concepts made possible by the analysis results. Functionality such as
the mufin vision sound universe enables the user to discover his own music
collection or even unknown catalogues in a new, more intuitive way.
Keywords: music, visualization, recommendation, cloud, clustering, semantic
attributes, auto-tagging.
1 Introduction
The way music is consumed today has been changed dramatically by its increasing
availability in digital form. Online music shops have replaced traditional stores and
music collections are increasingly kept on electronic storage systems and mobile
devices instead of physical media on a shelf. It has become much faster and more
comfortable to find and acquire music of a known artist. At the same time it has
become more difficult to find ones way in the enormous range of music that is
offered commercially, find music according to ones taste or even manage ones own
collection.
Young people today have a music collection with an average size of 8,159 tracks [1]
and the iTunes music store today offers more than 10 million tracks for sale. Long-tail
sales are low which is illustrated by the fact that only 1% of the catalog tracks generate
80% of sales [2][3]. A similar effect can also be seen in the usage of private music
collections. According to our own studies, only few people are actively working with
manual playlists because they consider this too time-consuming or they have simply
forgotten which music they actually possess.
1.1 Automatic Music Recommendation
This is where music recommendation technology comes in: The similarity of two
songs can be mathematically calculated based on their musical attributes. Thus, for
each song a ranked list of similar songs from the catalogue can be generated.
Editorial, user-generated or content-based data derived directly from the audio signal
can be used.
357
Editorial data allows a very thorough description of the musical content but this
manual process is very expensive and time-consuming and will only ever cover a
small percentage of the available music. User-data based recommendations have
become very popular through vendors such as Amazon (People who bought this item
also bought ) or Last.FM. However, this approach suffers from a cold-start
problem and its strong focus on popular content.
Signal-based recommenders are not affected by popularity, sales rank or user
activity. They extract low-level features directly from the audio signal. This also
offers additional advantages such as being able to work without a server connection
and being able to process any music file even if it has not been published anywhere.
However, signal-based technology alone misses the socio-cultural aspect which is not
present in the audio signal and it also cannot address current trends or lyrics content.
1.2 Mufins Hybrid Approach
Mufins technology is signal-based but it combines signal features with semantic
musical attributes and metadata from other data sources thus forming a hybrid
recommendation approach. First of all, the technology analyzes the audio signal and
extracts signal features. mufin then employs state-of-the-art machine learning
technology to extract semantic musical attributes including mood descriptions such as
happy, sad, calm or aggressive but also other descriptive tags such as synthetic,
acoustic, presence of electronic beats, distorted guitars, etc.
This information can for instance be used to narrow down search results, offer
browsing capabilities or contextually describe content. By combining these attributes
with information from other sources such as editorial metadata the user can for
instance search for "aggressive rock songs from the 1970s with a track-length of more
than 8 minutes".
Lists of similar songs can then be generated by combining all available information
including signal features, musical attributes (auto-tags) and metadata from other data
sources using a music ontology. This ensures a high quality and enables the steering
of the recommendation system. The results can be used for instance for playlist
generation or as an aid to an editor who needs an alternative to a piece of music he is
not allowed to use in a certain context.
As the recommendation system features a module based on digital signalprocessing algorithms it can generate musical similarity for all songs within a music
catalogue and because it makes use of mathematical analysis of the audio signals, it is
completely deterministic and - if desired - can work independent of any "human
factor" like cultural background, listening habits, etc. Depending on the database
used, it can also work way off the mainstream and thus give the user the opportunity
to discover music he may never have found otherwise.
In contrast to other technologies such as collaborative filtering, mufin's technology
can provide recommendations for any track, even if there are no tags or social data.
Recommendations are not limited by genre boundaries, target groups or biased by
popularity. Instead, it equally covers all songs in a music catalogue and if genre
boundaries or the influence of popularity is indeed desired, this can addressed by
leveraging additional data sources.
358
D. Schnfu
Fig. 1. The mufin music recommender combines audio features inside a song model and
semantic musical attributes using a music ontology. Additionally, visualization coordinates for
the mufin vision sound galaxy are generated during the music analysis process.
Mufins complete music analysis process is fully automated. The technology has
already proven its practical application in usage scenarios with more than 9 million
tracks. Mufin's technology is available for different platforms including Linux,
MacOS X, Windows and mobile platforms. Additionally, it can also be used via web
services.
2 Mufin Vision
Common, text-based attributes such as title or artist are not suitable to keep track of a
large music collection, especially if the user is not familiar with every item in the
collection. Songs which belong together from a sound perspective may appear very
far apart when using lists sorted by metadata. Additionally, only a limited number of
songs will fit onto the screen preventing the user from actually getting an overview of
his collection.
mufin vision has been developed with the goal to offer easy access to music
collections. Even if there are thousands of songs, the user can easily find his way
around the collection since he can learn where to find music with a certain
characteristic. By looking at the concentration of dots in an area he can immediately
assess the distribution of the collection and zoom into a section to get a closer look.
The mufin vision 3D sound galaxy displays each song as a dot in a coordinate
system. X, y and z axis as well as size and color of the dots can be assigned to
different musical criteria such as tempo, mood, instrumentation or type of singing
voice; even metadata such as release date or song duration can be used. Using the axis
configuration, the user can explore his collection the way he wants and make relations
between different songs visible. As a result, it becomes much easier to find music
fitting a mood or occasion.
Mufin vision premiered in the mufin player PC application but it can also be used
on the web and even on mobile devices. The latest version of the mufin player 1.5
allows the user to control mufin vision using a multi-touch display.
359
Fig. 2. Both songs are by the same artist. However, Brothers in arms is a very calm ballad
with sparse instrumentation while Sultans of swing is a rather powerful song with a fuller
sound spectrum. The mufin vision sound galaxy reflects that difference since it works on song
level instead of an artist or genre level.
Fig. 3. The figure displays a playlist in which the entries are connected by lines. One can see
that although the songs may be similar as a whole, their musical attributes vary over the course
of the playlist.
3 Further Work
The mufin player PC application offers a database view of the users music collection
including filtering, searching and sorting mechanisms. However, instead of only using
metadata such as artist or title for sorting, the mufin player can also sort any list by
similarity to a selected seed song.
360
D. Schnfu
Additionally, the mufin player offers an online storage space for a users music
collection. This prevents the user from data loss and allows him to simultaneously
stream his music online and listen to it from anywhere in the world.
Furthermore, mufin works together with the German National Library in order to
establish a workflow for the protection of our cultural heritage. The main contribution
of mufin is the fully automatic annotation of the music content and the provision of
descriptive tags for the librarys ontology. Based on technology by mufin and its
partners, a semantic multimedia search demonstration was presented at IBC 2009 in
Amsterdam.
References
1.
2.
3.
Bahanovich, D., Collopy, D.: Music Experience and Behaviour in Young People.
University of Hertfordshire, UK (2009)
Celma, O.: Music Recommendation and Discovery in the Long Tail. PhD-Thesis,
Universitat Pompeu Fabra, Spain (2008)
Nielsen Soundscan: State of the industrie (2007), http://www.narm.com/
2008Conv/StateoftheIndustry.pdf (July 22, 2009)
Author Index
Abeer, Jakob 259

Alvaro, Jes
us L. 163
Anglade, Amelie 1
Aramaki, Mitsuko 176
Arcos, Josep Lluis 219
Mansencal, Boris 31
Marchand, Sylvain 31
Marchini, Marco 205
Mauch, Matthias 1
Merer, Adrien 176
Morato, Jorge 338
Mustafa, Haz 84
Barbancho, Ana M. 116

Barbancho, Isabel 116
Barros, Beatriz 163
Barthet, Mathieu 138
Bunch, Pete 76
Cano, Estefana
N
urnberger, Andreas
Ortiz, Andres 116
Ozaslan,
Tan Hakan 219
Ozerov, Alexey 102
259
de Haas, W. Bas 242

de la Bandera, Cristina 116
Dittmar, Christian 259
Dixon, Simon 1
Palacios, Eric 219

Purwins, Hendrik 205
Rigaux, Philippe 303
R
obel, Axel 60
Robine, Matthias 242
Rodet, Xavier 60
Romito, Marco 60
Faget, Zoe 303

Fevotte, Cedric 102
Girin, Laurent 31
Godsill, Simon 76
Grollmisch, Sascha 259
Gromann, Holger 259
Guaus, Enric 219
Sammartino, Simone 116

S
anchez-Cuadrado, Sonia 338
Sandler, Mark 20, 138
Sch
onfu, Dirk 356
Stewart, Rebecca 20
Stober, Sebastian 273
Hanna, Pierre 242

Hargreaves, Steven 138
Jensen, Kristoer
51
Karvonen, Mikko 321

Klapuri, Anssi 188
Kronland-Martinet, Richard
Kudumakis, Panos 20
Laitinen, Mika 321
Lemstr
om, Kjell 321
Liuni, Marco 60
Llorens, Juan 338
Lukashevich, Hanna 259
273
Tard
on, Lorenzo J.
Urbano, Juli
an
176
338
Veltkamp, Remco C.
Vikman, Juho 321
Wang, Wenwu
Wiering, Frans
Ystad, Slvi
116
84
242
176
242

Exploring Music Contents

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploring Music Contents

Uploaded by

Copyright:

Available Formats

Lecture Notes in Computer Science

Commenced Publication in 1973

Slvi Ystad Mitsuko Aramaki

Springer-Verlag Berlin Heidelberg 2011

The 7th International Symposium on Computer Music Modeling and Retrieval

Demonstration and Panel Chairs

CMMR 2010 Referees

Part I: Music Production, Interaction and

Interactive Music Applications and Standards . . . . . . . . . . . . . . . . . . . . . . .

Interactive Music with Active Audio CDs . . . . . . . . . . . . . . . . . . . . . . . . . . .

Pitch Gestures in Generative Modeling of Music . . . . . . . . . . . . . . . . . . . . .

Part II: Music Structure Analysis - Sound Source

Transcription of Musical Audio Using Poisson Point Processes and

Single Channel Music Sound Separation Based on Spectrogram

Notes on Nonnegative Tensor Factorization of the Spectrogram

Part III: Auditory Perception, Articial Intelligence

Speech/Music Discrimination in Audio Podcast Using Structural

Part IV: Analysis and Data Mining

Identifying Attack Articulations in Classical Guitar . . . . . . . . . . . . . . . . . .

Tan Hakan Ozaslan,

Comparing Approaches to the Similarity of Musical Chord Sequences . . .

Part V: MIR - Music Libraries

Melodic Similarity through Shape Similarity . . . . . . . . . . . . . . . . . . . . . . . . .

Content-Based Music Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic and Logic-Based

Music is a complex phenomenon. Although music is described as a universal

S. Dixon, M. Mauch, and A. Anglade

Progress is evident for example in the annual MIREX series of evaluations of

Probabilistic and Logic-Based Modelling of Harmony

Research into computational analysis of harmony has a history of over four

S. Dixon, M. Mauch, and A. Anglade

Probabilistic and Logic-Based Modelling of Harmony

A Probabilistic Model for Chord Transcription

S. Dixon, M. Mauch, and A. Anglade

Probabilistic and Logic-Based Modelling of Harmony

(a) metrical position model

(b) model of a single chroma pitch class

S. Dixon, M. Mauch, and A. Anglade

Probabilistic and Logic-Based Modelling of Harmony

S. Dixon, M. Mauch, and A. Anglade

Logic-Based Modelling of Harmony

First order logic (FOL) is a natural formalism for representing harmony, as it is

Probabilistic and Logic-Based Modelling of Harmony

sequences key-independent), plus the category of each chord (reduced to a triad

S. Dixon, M. Mauch, and A. Anglade

For the subsequent experiments we extended the representation to allow variable

Probabilistic and Logic-Based Modelling of Harmony

S. Dixon, M. Mauch, and A. Anglade

Probabilistic and Logic-Based Modelling of Harmony

Genre Classification Using Harmony and Low-Level Features

In recent work [1] we developed a genre classication framework combining both

S. Dixon, M. Mauch, and A. Anglade

Probabilistic and Logic-Based Modelling of Harmony

S. Dixon, M. Mauch, and A. Anglade

Probabilistic and Logic-Based Modelling of Harmony

Interactive Music Applications and Standards

Abstract. Music is now consumed in interactive applications that allow

Interactive Music Applications and Standards

Karaoke-inuenced video games have become popular as titles such as Guitar

R. Stewart, P. Kudumakis, and M. Sandler

Karaoke-style applications involving singing require lyrical information as a bare

Audio Signal Processing

Interactive Music Applications and Standards

R. Stewart, P. Kudumakis, and M. Sandler