You are on page 1of 15

PITCH RECOGNITION

WITH WAVELETS
1.130 Wavelets and Filter Banks
May 15, 2003
Project by:
Stephen Geiger
skg@mit.edu
927939048
Abstract
This report investigates the use of wavelets for pitch recognition. A method is
developed using the Continuous Wavelet Transform at various scales to identify
individual notes. Successful results were obtained for computer generated polyphonic
piano music that included octave intervals. The current method requires the training of
the system before recognition is possible and may only work on some instruments.
However, it seems possible that the method could be extended to recognize real
polyphonic piano music.

Outline

Introduction

Problem Description

Existing Methods

Developed Method and Results

Conclusions

References

Appendix A Matlab Code

Appendix B Additional Results

Introduction
Pitch recognition, the ability to identify notes contained in an audio signal, is a
task some humans are quite proficient at. Given the sound of a dropped metal trash can
lid, (or perhaps preferably a violin) they can respond with the name of a corresponding
musical note. This ability is typically referred to in the music world as perfect pitch.
Not all humans seem to have this capability, and there has been somewhat limited success
in creating computerized systems capable of pitch recognition. Research in this area has
been approached with different motivating factors from several fields. Perhaps the most
obvious application is in automatic transcription of music [1][2][3]. There is also interest
in pitch recognition for analyzing models of musical instruments [4], speech analysis [5],
and from the perspective of perceptual computing [6]. The aim of this work was to
explore the use of wavelets [7] for computer based pitch recognition.

Problem Description
Pitch is one of the properties of sound. It is perhaps most simply described as how
high or low a sound is (not loud and soft, but high and low). Pitch also refers to what
musical note a sound can be described as. In more technical terms, pitch relates to the
fundamental frequency of a sound. Each musical note has a unique fundamental
frequency. However, a sound or note typically does not consist of one pure frequency.
This is shown in the following graph:

Frequency, Hz
Relative Frequency Content of a Computer Generated Piano Sound

The graph displays the frequencies present in a Middle C (C4) with fundamental
frequency 262 Hz. It can be seen that there is a large frequency component at the
fundamental frequency, and that there are frequency components at integer multiples of
this frequency (harmonics). The fundamental frequency is not always the largest
component as shown here:

Frequency, Hz
Relative Frequency Content of a Computer Generated Oboe Sound

In the case of the Oboe sound, the fundamental frequency is again 262 Hz, and is
present with its harmonics; but, one can notice that the most prominent frequency
component is the 4th harmonic.
What may not be obvious is that to the human ear this sound will be heard as
having the same pitch as sinusoidal wave at the fundamental frequency of 262 Hz. This is
despite the fact the strength of the fundamental frequency component in the signal is
relatively small. In fact, there can be cases where the fundamental frequency of a sound is
not even present in the signal.
It is worthwhile to note that the varying distribution of strengths of frequency
components in a note is what determines its musical property called timbre. This is the
property that makes an oboe sound like an oboe, and a piano sound like a piano, and a
trumpet sound like a trumpet, etc.
Two more relevant terms to mention are monophonic and polyphonic. A
monophonic sound is one where there is only one pitch present at any given time. Some
examples would be one person singing or a single trumpet. Polyphonic sounds are ones
that contain multiple notes simultaneously such as an orchestra, or barbershop quartet.

There are several existing methods for monophonic pitch recognition and these
have had some success. Polyphonic pitch recognition has proven significantly more
difficult. This is partially because the combined of frequency spectrum of various notes,
is more difficult to analyze, and is especially so in the case of identifying two pitches
related by an interval of one octave (for example a middle C and the next highest C
played together). This is because all of the frequency components found in the higher
note in an octave will also be present in the lower note [8].

Existing Methods
A brief overview of some of the methods that have been tried for pitch detection
is presented here. Monophonic transcription techniques include time domain techniques
based on zero crossings and auto-correlation, and frequency domain techniques based on
discrete Fourier transform and cepstrum methods, see references in [8]. The estimation of
local maxima to find the pitch period (which can be switched easily to frequency) with
the incorporation of wavelets is described in [1][9]. Another technique using wavelets to
estimate pitch period and a comparison to auto-correlation methods is presented in
[4].The use of models of human pitch perception is also described in [8], as well as the
concept of blackboard systems. This approach incorporates various sources of
knowledge and these sources could include music theory or statistical and probabilistic
knowledge [2][6]. Lastly, it is worth noting that one approach to dealing with the problem
of distinguishing octaves is to incorporate instrument models.

Developed Method and Results


Taking a different approach, the method developed in this work makes use of the
Continuous Wavelet Transform (CWT), and uses a 2nd Order Gaussian Wavelet. The
Continuous Wavelet Transform is defined as follows:
C a ,b f (t )

1
a

t b
dt
a

where :
f(t) function
(t) Mother wavelet
a scaling factor
b shift parameter

And the 2nd Order Gaussian Mother Wavelet has the following appearance:

When the scaling parameter, a, in the wavelet transform is varied it has the effect
of stretching or compressing the mother wavelet.
The implementation of the CWT found in the Matlab Wavelet Toolbox was used,
and further explanation of the CWT, and the 2nd Order Gaussian wavelet can be found in
the Wavelet Toolbox Users Guide [10].
The idea for this method is based on an observation made by Jeremy Todd [11]. In
his work he found that taking the CWT of a recording of a piano using a certain CWT
scale parameter and a 2nd Order Gaussian wavelet function the onset of a specific note (a
G4) could be easily identified. This observation is shown in the following illustration:

Original
Signal

CWT @
Specific
Scale

Furthermore, Todd observed that the same result would occur in situations with
polyphony as well. This was particularly interesting.
I started my work by running a number of continuous wavelet transforms of
varying scale on some test signals (computer generated piano sounds), and observing the
results. After looking at a number of results it was possible to identify CWT scale
coefficients to respond to all the notes in the musical scale starting at C4. (Note: in the
previous sentence the term scale is used with two different meaning; the former instance
using its wavelet definition, and the latter its musical definition). These results are shown
here:

Original
Scale: 594
530
472
446
394
722
642
606

The CWT with each one of the selected scaling factor had large values at the
occurrence of a specific note, and comparatively small value during the rest of the signal.
Next, we can observe the results of the CWTs in the presence of some polyphony:

Original
Scale: 594
530
472
446
394
722
642
606

and:

Original
Scale: 594
530
472
446
394
722
642
606

In both cases this method worked, even with the presence of polyphony.
Furthermore, in the second example we see that both the C and the G are not affected by
the presence of other octaves. (Note: the three areas of large response on the first line

[Scale =594] of the second example are correct. The second two occurrences of the C are
found in the bass clef).
One of the next steps to take was testing whether notes played with a different
instrument (i.e. having a different timbre) would work or not. This test was run using a
computer generated brass sound, and the results clearly show that it did not work.

Original

Scale: 594
530
472
446
394
722
642
606

This result was somewhat expected, and it suggests that the CWT is acting as an
instrument model of sorts. When the scale parameter of the CWT is adjusted it affects
frequency response. So at certain scale parameters it appears that this frequency response
is tailored such that it responds to one pitch more than others.
Based on the encouraging results so far, investigation was continued to see how
effective this method could be on a larger scale. At this point a training algorithm was
written to aid in the identification of appropriate scale factors corresponding to various
pitches. The computer was programmed to train itself to find applicable scale factors.
The algorithm was implemented in Matlab and works as follows:

Different sound files were created for each note in a range of desired
notes.

The CWT of each sound file was taken

The maximum results of the CWTs from each sound file were
compared.

If the maximum CWT coefficient from one file was at least twice the
value of those in all other files it was considered a result.

For all the results the following were recorded: the scale factor, pitch
of the sound file, and the factor its max value exceeded all others by.

This process was repeated over a range of CWT scale factors in hopes
of finding results for every pitch in the desired range of notes.

At the end the scale factor of the best result for each pitch was
collected.

(The code for this algorithm, as well as some of the other work for this project, is
included in Appendix A).
This algorithm was applied to several different types of training signals. It was
tried on computer generated piano sound for a range of three octaves, for a real guitar
sound (albeit electric guitar, 70s Ibanez Les Paul), for a set of pure sinusoidal waves, and
lastly for a training set of all 88 keys from the computer generated piano.
The training on the three octave range was able to find results for all pitches
except the bottom two notes. This is likely due to the fact that a limited set of CWT scales
was searched, and it is hypothesized that given a large range these values would have
been found as well. The results are shown here.

The training on the real guitar sound met with limited success. Only 5 out of 8
notes were identified in the training process (again for a somewhat limited set of scales);
however, the results were not completely successful in identifying the corresponding
notes in a test file. It wasnt a complete failure, and could merit a more thorough try, but
the guitar is expected to be a more difficult case than a basic computer generated sound,
or even a real piano.
The results for the sinusoidal wave form were found as a step to help gain a better
understanding about the relationship between scale and frequency. It can be observed that
changing the CWT scale shifts the frequency response of the transform. It also can be
observed that some interesting relationships exist between which scales yield results for
which notes as seen in the following two graphs.

SCALE

NOTE NUMBER

Successful Results from the Training Algorithm


For 8 Sinusoidal Pitches in a C Scale

SCALE

NOTE NUMBER
Successful Results from the Training Algorithm
For 3 Octaves of a Computer Generated Piano Sound

In the first graph with pure sinusoidal sounds, the scale frequency relationship
seems a little more straightforward then in the second case. There also could be some
patterns in the second graph as well, though they are less apparent.
The tests with all 88 notes were abandoned after considering the time they
required to run, and the amount of time left to complete this work. Its worth noting here
that running CWTs for a number of test files at a number scales, could take a number of
hours. This could possibly be sped up noticeably with shorter test files or a lower
sampling rate, but this was not investigated. The initial results from the training for 88
notes showed some interesting results, picking out notes 70-88. The notes appeared to
show up in more clearly defined regions than in the three scale test case. It seems
possible that a training run of the three scale test case at higher CWT scales might yield
similar results.
Lastly, a fragment of the right hand part of Chopins Prelude in C, Op. 28 No. 1
was tested, and the results were output into more music like format for comparison:

A Test Fragment by Chopin

A comparison of the musical score and the graph reveals that the method
successfully identified all the notes contained in this polyphonic fragment. This is
noteworthy, as the method was successful, even in situations with polyphony and octaves.

Conclusions
An application of the Continuous Wavelet Transform to pitch recognition was
explored, and some interesting results were found. The method demonstrated the ability
to recognize a reasonably complex polyphonic fragment and octaves, which means it
compares favorably with some of the other results in literature I came across. Two
significant drawbacks of the method are that it required training on an instrument sounds,
and it is possible that it might be effective only on some instruments.
The most obvious next step would be to try and apply the current technique to a
real piano and observe how well it worked. One issue that might need to be dealt with is
variation in the volumes of notes played, as this might interfere with the simple
maximum method used for identifying results. Possibly some type of compression or
normalization could be applied. Another issue would be the identification of the
beginning and ends of notes. If this was handled successfully the system would be well

on its way to handling basic music transcription. Maybe similar techniques to those used
in wavelet edge detection could be applied to this problem?

References
[1] Kevin Chan, Supaporn Erjongmanee, Choon Hong Tay, Real Time Automated
Transcription of Live Music into Sheet Music using Common Music Notation, 18-551
Final Project (Carnegie Mellon), May 2000.
[2] Martin, K. D. (1996). A Blackboard System for Automatic Transcription of Simple
Polyphonic Music. M.I.T. Media Lab Perceptual Computing Technical Report #385, July
1996.
[3] Michelle Kruvczuk, Ernest Pusateri, Alison Covell, Music Transcription for the Lazy
Musician, 18-551 Final Project (Carnegie Mellon), May 2000.
[4] John Fitch, Wafaa Shabana, A Wavelet-Based Pitch Detector for Musical Signals.
[5] Inge Gavat, Matei Zirra, Valentin Enescu, Pitch Detection of Speech by Dyadic
Wavelet Transform. http://www.icspat.com/papers/181mfi.pdf
[6] Martin, K. D. and Scheirer, E. D. (1997). Automatic Transcription of Simple
Polyphonic Music: Integrating Musical Knowledge. Presented at SMPC, August 1997.
[7] Robi Polikar, The Wavelet Tutorial,
http://engineering.rowan.edu/~polikar/WAVELETS/WTtutorial.html.
[8] Martin, K. D. (1996). Automatic Transcription of Simple Polyphonic Music: Robust
Front End Processing. M.I.T. Media Lab Perceptual Computing Technical Report #399,
November 1996, presented at the Third Joint Meeting of the Acoustical Societies of
America and Japan, December, 1996.
[9] Tristan Jehan, Musical Signal Parameter Estimation, Thesis CNMAT, 1997.
http://cnmat.cnmat.berkeley.edu/~tristan/Report/Report.html
[10] Wavelet Toolbox Users Guide (MATLAB), Mathworks, 1997.
[11] Jeremy Todd, A Comparison of Fourier and Wavelet Approaches to Musical
Transcription, 18.327 Final Project (MIT).

Appendix A Matlab Code

Appendix B Additional Results

You might also like