You are on page 1of 20

Final Project

CS/ECE/ME 539
Professor Hu

UW-Madison
MWF 1:20p

David A. Gerasimow
The Design and Implementation of a Dynamic Data MLP to
Predict Motion Picture Revenue

Table of Contents
Introduction: Preface, Past Research, Improvements Over
Past Research
Initial Data Collection
Data Collection Improvements, Data Encoding
Pre-analysis of Data, Development of the Dynamic Data
Neural Network, Step 1 of the UpdateWizard:
Downloading, Step 2 of the UpdateWizard: Updating
Step 3 of the UpdateWizard: Creating Training and Testing
Files, Development of the MLP using Dynamic Data
Using the Dynamic Data MLP, Choice 1 of moviesbp.m,
Choice 2 of moviesbp.m
Figure 1: DataExtractor Screenshot
Figure 2: DataConcatenator Screenshot
Figure 3: DataConverter Screenshot: Films Removed
From Data File
Figure 4: DataConverter Screenshot: Films to be Updated
Figure 5: Results of preanalysis.m
Figure 6: UpdateWizard Screenshot Step 1, Figure 7:
UpdateWizard Screenshot Step 2
Figure 8: UpdateWizard Screenshot Step 3, Figure 9:
NewMovie Screenshot
Discussion of Results
Bibliography
VB Source Code

Note to Grader: This report is over twenty pages, but this is


because I was unsure if the grader has the ability to read and run
Visual Basic 6.0 source code.
Introduction

3
4
5
6
7
9
10
11
12
13
14
17
18
19
20
21

Preface
For the last century, film has been one of the American publics favorite entertainment
mediums. Large production companies often spend hundreds of millions of dollars to
create a single film. However, the amount of money spent on creating a film seems to
have little bearing on its success. The Blair Witch Project, for instance, was made for
under one million dollars, but it made over twenty-nine million dollars in its first
weekend in the box office. On the other hand, Waterworld, starring superstar Kevin
Costner, cost roughly one-hundred and seventy-five million dollars to produce, but made
back less than half of that amount in domestic box office revenue.
Predicting how much a movie will earn in opening-weekend box office revenue is a
notoriously difficult thing to do. There are many subjective aspects of a movie. In
addition, public taste changes quickly and unpredictably. Developing a mathematical
formula to predict how much a film will make will allow production companies to
maximize profit and skip film development projects that will hurt their profit margins.

Past Research
In CS/ECE/ME 539, in the fall semester of 2001, a student attempted to predict the
opening weekend box office revenue of a given film using an artificial neural network.
He claimed that an accurate prediction of how much a movie will gross in total can be
achieved by examining its opening weekend. They are proportional to each other. If a
film has a huge opening weekend, it is likely to earn a lot of money in the long run. His
logic is correct, and I will use it again in this project.
The networks inputs are the films characteristics, such as genre, rating, runtime, etc.
Despite his thorough work, there are deficiencies in his project. This project will be a
major improvement over his results. Namely, it will produce higher correct classification
results; while, at the same time, it will allow future users to easily update the data files.
The neural network will, over time, accumulate more and more training data. A major
component of this report is developing what I call a dynamic data neural network. The
training data is automatically updated weekly. Instead of the project ending with the end
of the semester, future classes will be able to easily update this projects results, and the
networks correct classification rates will improve over time.

Improvements over Past Research


As an avid film buff, I came up with the idea to research film box office revenue with
respect to neural networks independently, but I was disappointed to find out that it had
been done before. As such, I set out to improve upon previous results. By adding more
features and more feature vectors, I hoped to receive better results. Moreover, I wrote an
UpdateWizard that can automatically and repeatedly update the data files. For example,
every week, the top grossing films are listed at www.boxofficeguru.com. The
UpdateWizard automatically downloads the list and updates the data file. Finally, it will
create new training and testing data files for use with MATLAB and the dynamic data
neural network.

Initial Data Collection


Data pertaining to box office revenue is plentiful on the internet. As such, I sought out
the data with the most input features and the most reliability. After a thorough search, I
decided to use the information found at www.boxofficeguru.com. Already entered into its
.html files are all the films since 1989 that have grossed more then fifteen-million dollars
in their first weekend. Also, additional data is posted. The films opening date, number
of theatres at opening, distributor, number of days in opening weekend, and, most
importantly, the exact amount the film grossed in its opening weekend are available.
Unfortunately, the data is not in a pleasant, readable format for programming use.
Consequently, using Microsoft Visual Basic 6.0 Professional Edition, I wrote a Windows
application called DataExtractor (dataextractor.exe) to parse the information out of the
*.html files. For this portion of the data collection, which is only performed once, I
manually downloaded the data files and renamed them 35plus.htm, 25to35.htm,
20to25.htm, 17to20.htm, and 15to17.htm. After running these five files through the
DataExtractor, five readable files are created called 35plus_1.txt, 25to35_1.txt,
20to25_1.txt, 17to20_1.txt, and 15to17_1.txt.
A screenshot of the DataExtractor is found on page 10 (Fig. 1).
The source code for the DataExtractor is found on pages 21+.
After the DataExtractor has parsed the information, the five output files (35plus_1.txt,
25to35_1.txt, 20to25_1.txt, 17to20_1.txt, and 15to17_1.txt) need to be concatenated.
Again, using Visual Basic 6.0, I developed another Windows application called
DataConcatenator (dataconcatenator.exe). It takes the five aforementioned output files
as inputs and creates a single file called concatenated_data.txt.
A screenshot of the DataConcatenator is found on page 11 (Fig. 2).
The source code for the DataConcatenator is found on pages 21+.
Now that a readable, single file exists, more input features needed to be added. In order
to avoid unnecessary reentering of data, I wrote another Windows application called
DataConverter (dataconverter.exe). As its inputs, it takes two files: data.txt and
concatenated_data.txt. data.txt contains the data the student from last semester used,
while concatenated_data.txt contains the updated film information created by the
DataExtractor. The DataConverter compares these two files. If a film from data.txt did
not gross over fifteen-million dollars in its first weekend, it is removed from the data file.
Otherwise, the data is copied into mydata.txt. Films that have been released since
data.txt (which was created in 2001) are determined and are enumerated in the file
titlestoupdate.txt.
The DataConverter displays the films that did not gross over fifteen-million dollars as
well as movies that need to be updated (i.e., released since 2001).
Several screenshots of the DataConverter in action are found on pages 12-13 (fig. 3 and
fig. 4). The source code for the DataConverter can be found on pages 21+.

Once these three programs have ran (DataExtractor, DataConcatenator, DataCoverter), a


list of film titles that need to be updated is created and stored under the filename
titlestoupdate.txt. With this information, I looked up all the films in the file at
www.imdb.com. This website has more information than www.boxofficeguru.com. A
films genre, rating, runtime, color/black & white/animated, and sequel data are listed. I
looked up each movie individually and entered the data in a Microsoft Word document.

Data Collection Improvements


Here, I made some improvements over the last project done on box office revenue. First,
I eliminated his use of the IMDB user ratings as an input feature. This data is irrelevant
because people do not know if a movie will be good before its opening weekend. While
this data would be important if developing a neural network to determine total gross
revenue, it is not useful in determining opening weekend gross revenue. In addition, I
removed the input feature that determines the day of the month on which the film was
released. Because any given day of the week does not correspond to any specific day of a
month, this input feature was random, and therefore, it had little use in the MLP
development.
Several other improvements were also made. First, from general observations of the film
industry, sequels tend to do well. The audience knows what to expect. Usually, a film
studio only releases a sequel if the original did well. Whether or not a film is a sequel is
an important aspect of determining its opening weekend revenue. Next, another input
feature was added. Animated films tend to do very well. Whether or not a film is
animated has a significant impact on its opening weekend revenue. The addition of these
two input features increased correct classification rates of the multi-layer perceptron.

Data Encoding
The data contained in movies.txt is the final data file after the procedures described above
have been followed. Many features of a film are not numerical. As such, I created an
encoding scheme that allowed the non-numerical data fields to be useful to the multilayer perceptron.
Genre
Action
Comedy
Drama
Family
Horror
Mystery
Animation
Romance
Sci-Fi
Thriller
Western

Rating
20
21
22
23
24
25
26
27
28
29
30

Cell Intentionally Left Blank

Distributor

-5

PG

-4

PG-13

-3

Sony
Universal
Warner Brothers
Fox
New Line
Buena Vista
Paramount
MGM/United Artists
MGM
DreamWorks
Miramax
TriStar

1
2
3
4
5
6
7
8
9
10
11
12

-2

Cell Intentionally Left Blank

Columbia
Artisan
Polygram
USA Films
Orion

Pre-analysis of the Data


Before developing the MLP, it is helpful to thoroughly examine the data. As such, I
wrote preanalysis.m in MATLAB to assist me in this task. It produces graphs containing
how many films have certain characteristics. Also, mean values and the inputs standard
deviations are computed where applicable.
The graphs produced by preanalysis.m can be found on pages 14-16 (fig. 5).

Development of the Dynamic Data Neural Network


A major component of this project is the development of what I call a dynamic data
neural network. The training and testing data used by the MLP is constantly changing.
This is an improvement over other neural networks, including the one previously
designed to tackle the opening weekend box office revenue problem.
In Visual Basic 6.0, I developed a Windows application called the UpdateWizard
(updatewizard.exe). This program performs all the necessary steps to update the data.
Consequently, as time goes on, the MLP, that will be developed later, will change and
improve as its training data is updated.

Step 1 of the UpdateWizard: Downloading


The UpdateWizard begins by downloading the most up-to-date data files from
www.boxofficeguru.com. The program contacts the server and downloads five files:
open35+.htm, open25-35.htm, open20-25.htm, open17-20.htm, and open15-17.htm. The
files are processed and concatenated using methods similar to those found in the
DataExtractor and DataConverter. After the files have been downloaded, processed, and
linked, the updated data is compared to the current data file (movies.txt). Films that are
new since the last update are presented to the user.
A screenshot of the UpdateWizard in step 1 can be found on page 17 (fig. 6).

Step 2 of the UpdateWizard: Updating


In this step of the UpdateWizard, the user enters the information for the films that are new
since the last update. This information can be found at www.imdb.com. After all the
films have been updated, they are added to the data file movies.txt. The data is now upto-date. If no updates are available, this step is skipped, and the UpdateWizard proceeds
directly to step 3.
A screenshot of the UpdateWizard in step 2 can be found on page 17 (fig. 7).

13
14
15
16
17

Step 3 of the UpdateWizard: Creating Training and Testing Files


In the third and final step of the UpdateWizard, training and testing files are created. The
user has several options in this step. First, the user decides how many classes the data
will be partitioned into. Second, he or she decides when to begin the testing file. By
selecting a date, the training file will consist of all the films that were released prior to
that date, and the testing file will consist of all the films that were released after that date.
User Options in Step 3
Training File Options (i.e., classification scheme):
1. 2 Classes Class 1: 15m-22.5m, Class 2: 22.5m+
2. 4 Classes Class 1: 15m-18.5m, Class 2: 18.5m-23m, Class 3: 23m-32m, Class 4: 32m+
3. 5 Classes Class 1: 15m-17m, Class 2: 17m-20m, Class 3: 20m-25m, Class 4: 25m-35m, Class 5: 35m+

Testing File Options (i.e., training and testing data separation):


Begin Testing File on January 1st of 2001, 2002, or 2003.

Output File Description


training_X_YYYY.txt and testing_X_YYYY.txt where X is the classification scheme and
YYYY is the year at which the testing file begins.
A screenshot of the UpdateWizard in step 3 can be found on page 18 (fig. 8).
The source code for the UpdateWizard can be found on pages 21+.

Development of the MLP using Dynamic Data


Based on Professor Hus bp.m, bptest.m, and bpconfig.m, I developed a multi-layer
perceptron to predict a films opening weekend revenue using dynamic data. Professor
Hus MATLAB source code was modified and is contained in moviesbp.m,
moviesbptest.m, and moviesbpconfig.m. As in previous homeworks, I ran many trials to
determine the optimal configuration for the MLP. After the configuration was
determined, it was hard-coded into moviesbpconfig.m so that future users do not have to
enter the configuration each time the MLP is run.
After modifying Professor Hus multi-layer perceptron MATLAB files, I began to test the
MLP in order to determine the networks optimal configuration. To aid in this process, I
used three-way cross-validation. I wrote a MATLAB program called threeway.m to
accomplish this task. The m-file prompts the user to choose the classification scheme
that is appropriate for the training task. Then, the m-file concatenates the training and
testing files for the classification scheme. Next, it repartitions the concatenated data file
into three equally sized files. These three files are then normalized by computing the
each data columns mean and standard variance.
Tables Showing Trials for Optimal MLP Configuration
Mean and standard deviation of eight trials to determine optimal learning rate, , and
momentum constant, . Selected learning rate and momentum constant in bold.
Trial
=.

= .1

52.1739

53.8753

48.2246

50.3491

52.4582

52.3498

49.3014

52.9586

Mean
51.461

Std
1.9544

= .3
= .5
= .7
= .1
= .3
= .5
= .7
= .1
= .3
= .5
= .7
= .1
= .3
= .5
= .7

=.
3

=.
5

=.
7

54.3248
50.3244
52.3487
49.0239
50.3249
50.8708
53.9238
50.1239
49.0973
51.3094
50.3496
43.2308
44.4039
46.3897
44.1230

52.3408
54.3324
57.2349
51.0235
49.9929
51.3047
52.377
47.1344
50.8917
50.3410
51.2508
46.0943
42.054
44.8724
46.2347

50.3140
55.3398
54.1437
50.1203
51.3418
49.3140
55.0431
46.1238
51.8187
51.5209
51.2540
43.0529
46.4289
47.0251
45.3047

51.6512
52.3202
55.2304
50.7654
53.1238
49.0219
52.2908
48.0283
50.2398
51.5141
52.0295
43.2308
47.0319
46.1421
49.0329

49.0293
53.4839
52.2095
52.5478
51.2095
54.3014
52.5438
49.8721
48.2102
52.2984
50.0493
47.4230
49.1098
48.2437
50.0987

51.6612
49.9238
58.3094
49.3140
51.5637
52.0237
54.0324
45.2938
48.1234
52.2085
54.0321
44.0132
46.4877
50.1239
49.8713

50.2323
53.3204
56.4205
51.5036
50.3140
53.2878
54.1487
43.1734
50.3209
53.3407
55.0132
46.3209
45.4827
49.3409
47.0929

52.8437
54.8942
53.3657
51.1126
50.4327
53.2049
55.8714
50.1919
52.5407
50.4310
49.0829
47.4980
47.0121
48.2138
50.1319

51.550
52.992
54.908
50.676
51.038
51.666
53.779
47.493
50.155
51.621
51.633
45.108
46.001
47.544
47.736

1.6750
2.0092
2.2731
1.1589
1.0170
1.9033
1.3041
2.5537
1.6065
0.9953
2.0102
1.9267
2.0897
1.7537
2.3663

Mean and standard deviation of eight trials to determine optimal number of hidden layers,
HL. Selected number of hidden layers in bold.
Trial
HL=1
HL=2
HL=3

1
56.3094
54.1708
53.4308

2
57.9541
52.3407
52.5499

3
58.3420
49.1203
53.0023

4
57.4092
51.8274
48.2348

5
54.2231
53.3479
50.3299

6
55.3209
52.5027
51.4390

7
58.3428
53.0523
53.2581

8
57.9112
49.1042
50.2109

Mean
56.977
51.933
51.557

Std
1.536
1.8776
1.8458

Mean and standard deviation of eight trials to determine optimal number of hidden nodes
in the single hidden layer. Selected number of hidden neurons in bold.
Trial
H=2
H=4
H=6
H=8

1
48.2351
50.3460
54.5407
52.2308

2
49.4239
50.0148
53.4509
54.3498

3
47.3402
48.0129
57.2098
52.1487

4
45.4223
52.5489
55.2340
51.5023

5
47.3100
51.0253
54.2390
50.2530

6
47.1236
50.0544
53.5409
52.4879

7
45.9810
52.2587
56.4308
53.3205

8
46.3498
53.3980
55.2107
55.3991

Mean
47.148
50.957
54.982
52.712

Std
1.2763
1.7294
1.328
1.6205

The mean and standard deviation values were computed using alphamu.m, hl.m, and h.m.
These MATLAB files read data from files stats_am.txt, stats_h.txt, and stats_hl.txt.
Based on the above trials, the MLP configuration is as follows:
Learning Rate
0.1
Momentum Constant
0.7
Number of Hidden Layers
1
Number of Hidden Neurons
6
Maximum Number of Epochs
5000
Samples Per Epoch
64
Scaling of Input
[-5,5]

Neurons in the hidden layer use tanh() activation function.


Neurons in the output layer use sigmoidal() activation function.
These are the default activation functions provided in Hus bp.m.

Using the Dynamic Data MLP

After using the UpdateWizard to update the data file and create training and testing files,
the dynamic data MLP is ready to use. From the MATLAB prompt, run moviesbp.m by
entering moviesbp. Note that moviesbp.m requires many support m-files that are not
included in the *.zip file. They are, however, available for download from the
CS/ECE/ME 539 website http://www.cae.wisc.edu/~ece539/fall03/index.html in the
section entitled MATLAB Files Used in the Class.
When moviesbp.m begins, the user has two choices.

Choice 1 of moviesbp.m
Choice 1, or Predict the Revenue of a Newly Released Film allows the user to test the
dynamic data MLP on a new movie. The user must first, however, run the Windows
application called NewMovie (newmovie.exe). This program, developed using Visual
Basic 6.0, provides a graphical user interface that lets the user enter the characteristics of
a new film. The NewMovie program then creates a file called testsinglemovie.txt based
on the entered characteristics.
Once the output file is created, moviesbp.m can be run. After selecting option one, the
MLP is trained per the users instructions. Then, the MLP is tested using the film
information contained in testsinglemovie.txt. Finally, moviesbp.m classifies the movie
and predicts its revenue. Depending on the classification scheme the user chose earlier,
the film is classified. Consult classes.txt for a description of classification schemes.
A screenshot of the NewMovie in action can be found on page 18 (fig. 9).
The source code for NewMovie can be found on pages 21+.

Choice 2 of moviesbp.m
Choice 2, or Simply Train and Test the MLP, allows the user to train and test the
dynamic data MLP. It runs much like Professor Hus bp.m. Neuron weights can be
found in the variable w. It also outputs the confusion matrices and classification rates for
the training and testing datasets.

Figure 1: DataExtractor Screenshot

Figure 2: DataConcatenator Screenshot

Figure 3: DataConverter Screenshot: Films Removed From Data File

Figure 4: DataConverter Screenshot: Films to be Updated

Figure 5: Results of preanalysis.m

Figure 5 Continued: Results of preanalysis.m

Figure 5 Continued: Results of preanalysis.m

Figure 5 Continued: Results of preanalysis.m

Figure 6: UpdateWizard Screenshot Step 1

Figure 7: UpdateWizard Screenshot Step 2

Figure 8: UpdateWizard Screenshot Step 3

Figure 9: NewMovie Screenshot

Discussion of Results

The classification rates of the dynamic data multi-layer perceptron are in the range from
fifty-four to fifty-nine percent. This is roughly a four percent improvement over a similar
project performed in the fall of 2001. As discussed in the introduction, predicting the
box-office success of a film is difficult to do. As such, the MLP classifies films correctly
more than half of the time. This is a good result because it occurs when there are four
classes. If the MLP did not perform better than random classification, its classification
rates would be around twenty percent. This project is a success because I improved upon
past results.
Moreover, the most interesting aspect of the project was the UpdateWizard. The MLP
developed can easily be retrained to data that is constantly changing. The UpdateWizard
makes this entire process easy and seamless. It is my hope that over time the wizard will
accumulate more and more data which will cause correct classification rates to further
improve. The UpdateWizards functionality is better than I originally expected. It is
fairly easy to use and rarely makes mistakes. As such, this component of the project is a
success, especially because no CS/ECE/ME 539 students have attempted such an
application of neural networks in the past.

Bibliography

Film Industry:
Rand, Philip A Guide to the Film Industry London: Emerald, 2003.
Visual Basic References:
David, Harold Visual Basic 5 Secrets. Foster City, CA: IDG Books Worldwide, 1997.
Mansfield, Richard The Visual Guide to Visual Basic for Windows: The Illustrated, PlainEnglish Encyclopedia to the Windows Programming Language Version 3.0, 2nd
Edition. Chapel Hill, NC: Ventana Press, Inc., 1993.
Neural Networks:
Haykin, Simon Neural Networks: A Comprehensive Foundation, 2nd Edition. Upper
Saddle River, NJ: Prentice-Hall, Inc., 1999.
Neelakanta, Perambur S., ed. Information-Theoretic Aspects of Neural Networks. Boca
Raton, FL: CRC Press, 1999.

You might also like