Professional Documents
Culture Documents
Abstract
arXiv:1703.06891v2 [cs.LG] 22 Mar 2017
1
B specify a metronome click track for each song. For
songs with changing tempos, the exact location of
A each change and the new tempo are annotated. This
click data could help to spur algorithmic innovation
for beat tracking and tempo detection.
Unfortunately, MIR research is stymied by the dif-
ficulty of accessing large, well-annotated datasets.
Songs are often subject to copyright issues, and thus
Figure 2: A four-beat measure of a typical chart and must be gathered by each researcher independently.
its rhythm depicted in musical notation. Red: quarter Collating audio with separately-distributed metadata
notes, Blue: eighth notes, Yellow: sixteenth notes, is nontrivial and error-prone owing to the multiple
(A): jump step, (B): freeze step available versions of many songs. Researchers must
often manually align their version of a song to the
metadata. In contrast, our dataset is publicly available,
Despite the games popularity, players have some standardized and contains meticulously-annotated la-
reasonable complaints: For one, packs are limited to bels as well as the relevant recordings.
songs with favorable licenses, meaning players may
We believe that DDR charts represent an abundant
be unable to dance to their favorite songs. Even when
and under-recognized source of annotated data for
charts are available, players may tire of repeatedly
MIR research. Stepmania Online, a popular reposi-
performing the same charts. Although players can
tory of DDR data, distributes over 350Gb of packs
produce their own charts, the process is painstaking
with annotations for more than 100k songs. In addi-
and requires significant expertise.
tion to introducing a novel task and methodology, we
In this paper, we seek to automate the process of contribute two large public datasets, which we con-
step chart generation so that players can dance to a sider to be of notably high quality and consistency. 2
wider variety of charts on any song of their choosing. Each dataset is a collection of recordings and step
We introduce the task of learning to choreograph, in charts. One contains charts by a single author and the
which we learn to generate step charts from raw audio. other by multiple authors.
Although this task has previously been approached via For both prediction stages of learning to choreo-
ad-hoc methods, we are the first to cast it as a learn- graph, we demonstrate the superior performance of
ing task in which we seek to mimic the semantics of neural networks over strong alternatives. Our best
human-generated charts. We break the problem into model for step placement jointly learns convolutional
two subtasks: First, step placement consists of iden- neural network (CNN) representations and a recurrent
tifying a set of timestamps in the song at which to neural network (RNN), which integrates information
place steps. This process can be conditioned on a user- across consecutive time slices. This method outper-
specified difficulty level. Second, step selection con- forms CNNs alone, multilayer perceptrons (MLPs),
sists of choosing which steps to place at each times- and linear models.
tamp. Running these two steps in sequence yields a Our best-performing system for step selection con-
playable step chart. This process is depicted in Figure sists of a conditional LSTM generative model. As
1. auxiliary information, the model takes beat phase, a
Progress on learning to choreograph may also lead number representing the fraction of a beat at which a
to advances in music information retrieval (MIR). Our step occurs. Additionally, the best models receive the
step placement task, for example, closely resembles time difference (measured in beats) since the last and
onset detection, a well-studied MIR problem. The until the next step. This model selects steps that are
goal of onset detection is to identify the times of more consistent with expert authors than the best n-
all musically salient events, such as melody notes or gram and fixed-window models, as measured by per-
drum strikes. While not every onset in our data cor- plexity and per-token accuracy.
responds to a DDR step, every DDR step corresponds
2
to an onset. In addition to marking steps, DDR packs All code and data shall be released upon publication.
2
1.1 Contributions character level [29, 30]. While a thorough explanation
of modern RNNs exceeds the scope of this paper, we
In short, our paper offers the following contributions:
point to two comprehensive reviews of the literature
We define learning to choreograph, a new task [31, 23]. Several papers investigate neural networks
with real-world usefulness and strong connec- for single-note melody generation [5, 15, 12, 24] and
tions to fundamental problems in MIR. polyphonic melody generation [8].
Learning to choreograph requires predicting both
We introduce two large, curated datasets for the timing and the type of events in relation to a
benchmarking DDR choreography algorithms. piece of music. In that respect, our task is similar
They represent an under-recognized source of to audio sequence transduction tasks, such as musi-
music annotations. cal transcription and speech recognition. RNNs cur-
rently yield state-of-the-art performance for musical
We introduce an effective pipeline for learning to
transcription [6, 10, 40]. RNNs are widely used for
choreograph with deep neural networks.
speech recognition [20, 21, 22, 38], and the state-of-
the-art method [1] combines convolutional and recur-
2 Related Work rent networks. While our work is methodologically
similar, it differs from the above in that we consider
Several academic papers address DDR. These include a different application and introduce a new learning
anthropological studies [27, 2] and two papers that de- task.
scribe approaches to automated choreography. The
first, called Dancing Monkeys, uses rule-based meth-
ods for both step placement and step selection [36]. 3 Data
The second employs genetic algorithms for step selec-
tion, optimizing an ad-hoc fitness function [35]. Nei- Basic statistics of our two datasets are shown in Ta-
ther establishes a reproducible evaluation methodol- ble 1. The first dataset contains 90 songs chore-
ogy or learns the semantics of steps from data. ographed by a single prolific author who works un-
Our step placement task closely resembles the clas- der the name Fraxtil. This dataset contains five charts
sic problem of musical onset detection [3, 14]. Several per song corresponding to increasing difficulty levels.
onset detection papers investigate modern deep learn- We find that while these charts overlap significantly,
ing methodology. Eyben et al. [16] employ bidirec- the lower difficulty charts are not strict subsets of the
tional LSTMs (BLSTMs) for onset detection; Marchi higher difficulty charts (Figure 3). The second dataset
et al. [33] improve upon this work, developing a is a large multi-author collection called In The Groove
rich multi-resolution feature representation; Schluter (ITG); this dataset contains 133 songs with one chart
and Bock [39] demonstrate a CNN-based approach per difficulty, except for 13 songs which lack charts
(against which we compare) that performs competi- for the highest difficulty.
tively with the prior BLSTM work. Neural networks Note that while the total number of songs is rel-
are widely used on a range of other MIR tasks, in- atively small, when considering all charts across all
cluding musical chord detection [28, 9] and boundary songs the datasets contain around 35 hours of annota-
detection [46], another transient audio phenomenon. tions and 350,000 steps. The two datasets have simi-
Our step selection problem resembles the classic lar vocabulary sizes (81 and 88 distinct step combina-
natural language processing task of statistical lan- tions, respectively). Around 84% of the steps in both
guage modeling. Classical methods, which we con- datasets consist of a single, instantaneous arrow.
sider, include n-gram distributions [11, 37]. Bengio Note that step charts contain several invariances, for
et al. [4] demonstrate an approach to language mod- example interchanging all instances of left and right
eling using neural networks with fixed-length context. results in an equally plausible sequence of steps. To
More recently, RNNs have demonstrated superior per- augment the amount of data available for training, we
formance to fixed-window approaches [34, 43, 44]. generate four instances of each chart, by mirroring
LSTMs are also capable of modeling language at the left/right, up/down (or both). Doing so considerably
3
4th 16th 12th
10 5 8th 32nd 24th
10 4
10 3
10 2
10 1 Beginner Easy Medium Hard Challenge
4
peak-picking process on this sequence of probabilities
to identify the precise timestamps at which to place
steps; and finally (4) given a sequence of timestamps, Fully connected
use a step selection algorithm to choose which steps MLP
to place at each time. Fully connected
5
layers of dimension 256 and 128. This architecture is
depicted in Figure 5. We train this model using 100
unrollings for backpropagation through time.
A charts intended difficulty influences decisions
both about how many steps to place and where to
place them. For low-difficulty charts, the average
number of steps per second is less than one. In con-
trast, the highest-difficulty charts exceed seven steps
per second. We tried training all models both with and
without conditioning on difficulty, and found this fea-
ture to be informative. We concatenate difficulty fea- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
6
tures represent the states on, off, hold, and release.
We found the bag-of-arrows to give equivalent perfor-
Next
mance to the one-hot representation while requiring Step
fewer parameters. We add an additional feature that
functions as a start token to denote the first step of a LSTM
chart. For this task, we use an LSTM with 2 layers of LSTM
(t-1)
128 cells each. RNN
Finally, we provide additional musical context to LSTM
LSTM
the step selection models by conditioning on rhyth- (t-1)
mic features (Figure 7). To inform models of the non-
uniform spacing of the step placements, we consider
Step Feats Context
the following three features: (1) -time adds two fea-
tures representing the time since the previous step and
the time until the next step; (2) -beat adds two fea- Curr
tures representing the number of beats since the pre- Step
vious and until the next step; (3) beat phase adds four
features representing which sixteenth note subdivision Figure 7: RNN model used for step selection
of the beat the current step most closely aligns to.
7
Table 2: Results for step placement experiments Table 3: Results for step selection experiments
c m
Model Dataset PPL AUC F-score F-score Model Dataset PPL Accuracy
LogReg Fraxtil 1.205 0.601 0.609 0.667 KN5 Fraxtil 3.681 0.528
MLP Fraxtil 1.097 0.659 0.665 0.726 MLP5 Fraxtil 3.744 0.543
CNN Fraxtil 1.082 0.671 0.678 0.750 MLP5 + -time Fraxtil 3.495 0.553
C-LSTM Fraxtil 1.070 0.682 0.681 0.756 MLP5 + -beat + beat phase Fraxtil 3.428 0.557
LSTM5 Fraxtil 3.583 0.558
LogReg ITG 1.123 0.599 0.634 0.652
LSTM5 + -time Fraxtil 3.188 0.584
MLP ITG 1.090 0.637 0.671 0.704
LSTM5 + -beat + beat phase Fraxtil 3.185 0.581
CNN ITG 1.083 0.677 0.689 0.719
LSTM64 Fraxtil 3.352 0.578
C-LSTM ITG 1.072 0.680 0.697 0.721
LSTM64 + -time Fraxtil 3.107 0.606
LSTM64 + -beat + beat phase Fraxtil 3.011 0.613
KN5 ITG 5.847 0.356
difficulty charts but only 0.389 for the lowest diffi-
MLP5 ITG 5.312 0.376
culty. The difficult charts contribute more to F-scorem MLP5 + -time ITG 4.792 0.402
calculations because they have more ground truth pos- MLP5 + -beat + beat phase ITG 4.786 0.401
itives. We discuss these results further in Section 7. LSTM5 ITG 5.040 0.407
LSTM5 + -time ITG 4.412 0.439
LSTM5 + -beat + beat phase ITG 4.447 0.441
6.2 Step Selection LSTM64 ITG 4.780 0.426
LSTM64 + -time ITG 4.284 0.454
Baselines For step selection, we compare the per- LSTM64 + -beat + beat phase ITG 4.342 0.444
formance of the conditional LSTM to an n-gram
model. Note that perplexity can be unbounded when
a test set token is assigned probability 0 by the gener- the previous sequence of ground-truth tokens. For a
ative model. To protect the n-gram models against given chart, the per token accuracy is averaged across
unbounded loss on previously unseen n-grams, we time steps. We produce final numbers by averaging
use modified Kneser-Ney smoothing [11], following scores across charts.
best practices in language modeling [34, 44]. Specifi- In Table 2 we present results for the step selec-
cally, we train a smoothed 5-gram model with backoff tion task. For the Fraxtil dataset, the best perform-
(KN5) as implemented in [42]. ing model was the LSTM conditioned on both -beat
Following the work of Bengio et al. [4] we also and beat phase, while for ITG it was the LSTM condi-
compare against a fixed-window 5-gram MLP which tioned on -time. While conditioning on rhythm fea-
takes 4 bag-of-arrows-encoded steps as input and pre- tures was generally beneficial, the benefits of various
dicts the next step. The MLP contains two fully- features were not strictly additive. Representing -
connected layers with 256 and 128 nodes and 50% beat and -time as real numbers outperformed buck-
dropout after each layer during training. As with the eted representations.
LSTM, we train the MLP both with and without ac- Additionally, we explored the possibility of incor-
cess to side features. In addition to the LSTM with 64 porating more comprehensive representations of the
steps of unrolling, we train an LSTM with 5 steps of audio into the step selection model. We considered
unrolling. These baselines show that the LSTM learns a variety of representations, such as conditioning on
complex, long-range dependencies. They also demon- CNN features learned from the step placement task.
strate the discriminative information conferred by the We also experimented with jointly learning a CNN au-
-time, -beat, and beat phase features. dio encoder. In all cases, these approaches led to rapid
overfitting and never approached the performance of
Metrics We report the average per-step perplexity, the conditional LSTM generative model; perhaps a
averaging scores calculated separately on each chart. much larger dataset could support these approaches.
We also report a per-token accuracy. We calculate Finally, we tried conditioning the step selection mod-
accuracy by comparing the ground-truth step to the els on both difficulty and chart author but found these
argmax over a models predictive distribution given models overfit quickly as well.
8
Human Choreography
Generated Choreography
Figure 8: Top: A real step chart from Fraxtil dataset on the song Anamanaguchi - Mess. Middle: One-step
lookahead predictions for the LSTM model, given Fraxtils choreography as input. The model predicts the next
step with high accuracy (errors in red). Bottom: Choreography generated by conditional LSTM model.
9
ings. If we trained a model to detect beat phase, we [5] J. J. Bharucha and P. M. Todd. Modeling the per-
would be able to use these features for step selection. ception of tonal structure with neural nets. Com-
puter Music Journal, 1989.
10
[17] F. Gers and J. Schmidhuber. Recurrent nets that [30] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush.
time and count. In International Joint Confer- Character-aware neural language models. In
ence on Neural Networks (IJCNN), 2000. Proceedings of the Thirtieth AAAI Conference
on Artificial Intelligence, 2016.
[18] X. Glorot and Y. Bengio. Understanding the dif-
ficulty of training deep feedforward neural net- [31] Z. C. Lipton, J. Berkowitz, and C. Elkan. A crit-
works. In AISTATS, 2010. ical review of recurrent neural networks for se-
quence learning. arXiv:1506.00019, 2015.
[19] X. Glorot, A. Bordes, and Y. Bengio. Deep
sparse rectifier neural networks. In AISTATS, [32] Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wet-
2011. zell. Learning to diagnose with LSTM recurrent
[20] A. Graves and N. Jaitly. Towards end-to-end neural networks. In ICLR, 2016.
speech recognition with recurrent neural net- [33] E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli,
works. In ICML, 2014. S. Squartini, and B. Schuller. Multi-resolution
[21] A. Graves, S. Fernandez, F. Gomez, and linear prediction based features for audio on-
J. Schmidhuber. Connectionist temporal clas- set detection with bidirectional lstm neural net-
sification: labelling unsegmented sequence data works. IEEE, 2014.
with recurrent neural networks. In ICML, 2006.
[34] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky,
[22] A. Graves, A.-r. Mohamed, and G. Hinton. and S. Khudanpur. Recurrent neural network
Speech recognition with deep recurrent neural based language model. In Interspeech, 2010.
networks. In ICASSP, 2013.
[35] A. Nogaj. A genetic algorithm for determining
[23] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Ste- optimal step patterns in dance dance revolution.
unebrink, and J. Schmidhuber. Lstm: A search Technical report, State University of New York
space odyssey. IEEE transactions on neural net- at Fredonia, 2005.
works and learning systems, 2016.
[36] K. OKeeffe. Dancing monkeys (automated
[24] G. Hadjeres and F. Pachet. Deepbach: a creation of step files for dance dance revolu-
steerable model for bach chorales generation. tion). Technical report, Imperial College Lon-
arXiv:1612.01010, 2016. don, 2003.
[25] P. Hamel, Y. Bengio, and D. Eck. Building [37] R. Rosenfeld. Two decades of statistical lan-
musically-relevant audio features through mul- guage modeling: Where do we go from here?
tiple timescale representations. In ISMIR, 2012. Proceedings of the IEEE, 2000.
[26] S. Hochreiter and J. Schmidhuber. Long short- [38] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wil-
term memory. Neural computation, 1997. son, and O. Vinyals. Learning the speech front-
[27] J. Hoysniemi. International survey on the dance end with raw waveform cldnns. In Interspeech,
dance revolution game. Computers in Entertain- 2015.
ment (CIE), 2006.
[39] J. Schluter and S. Bock. Improved musical onset
[28] E. J. Humphrey and J. P. Bello. Rethinking auto- detection with convolutional neural networks. In
matic chord recognition with convolutional neu- ICASSP, 2014.
ral networks. In ICMLA, 2012.
[40] S. Sigtia, E. Benetos, and S. Dixon. An end-to-
[29] A. Karpathy, J. Johnson, and L. Fei-Fei. Vi- end neural network for polyphonic piano music
sualizing and understanding recurrent networks. transcription. IEEE/ACM Transactions on Au-
arXiv:1506.02078, 2015. dio, Speech, and Language Processing, 2016.
11
[41] S. S. Stevens, J. Volkmann, and E. B. Newman.
A scale for the measurement of the psychologi-
cal magnitude pitch. The Journal of the Acousti-
cal Society of America, 1937.
12