Professional Documents
Culture Documents
Lucas Nunno
University of New Mexico
Computer Science Department
Albuquerque, New Mexico, United States
lnunno@cs.unm.edu
I. I NTRODUCTION
The stock market is known to be a complex adaptive
system that is difficult to predict due to the large number
of factors that determine the day to day price changes. We
do this in machine learning through regression which tries to
determine the relationship between a dependent variable and
one or more independent variables. Here, the independent
variables are the features and the dependent variable that
we would like to predict is the price. It is apparent that
the features that we are using are not truly independent,
we know that the volume and outstanding shares are not
independent as well as the closing price and the return
on investment not being independent. However, this is an
assumption that we are making to simplify the model in
order to use the chosen regression models.
This study aims to use linear and polynomial regression models to predict price changes and evaluate different
models success by withholding data during training and
evaluating the accuracy of these predictions using known
data.
This research concerns closing prices of stocks, therefore
day trading was not modeled. The model for the stock
market was only concerned with the closing price for stocks
at the end of a business day, high-frequency trading is an
area of active research, but this study preferred a simplified
model of the stock market.
IV. M ETHODS
A. Data Representation
The dataset that was used was collected from the CRSP
US Stock Database [2] as a collection of comma-separated
values where each row consisted of a stock on a specific day
along with data on the volume, shares out, closing price, and
other features for that day in time.
The Python scientific computing library numpy was used
along with the data analysis library pandas in order to
convert these CSV files into pandas DataFrames that were
indexed by date. Each specific stock is a view of the master
DataFrame that is filtered based on that stocks ticker. This
allowed efficient access to stocks of interest and convienient
access to date ranges.
These stock DataFrame views are then used as the data
to be fed into our regression black boxes.
II. M OTIVATION
Stock market price prediction is a problem that has the
potential to be worth billions of dollars and is actively
researched by the largest financial corporations in the world.
It is a significant problem because it has no clear solution,
although attempts can be made at approximation using many
1
A. Linear Regression
Linear regression was less sensitive to normalization techniques as opposed to the polynomial regression techniques.
Some plausible results were appearing early on in the study
even when a small number of features were used without
normalization, while this caused the polynomial regression
models to overflow. Linear regression also provided plausible results after normalization with no parameter tuning
required due to its simplified model, although the accuracy
was less than would be desired if relying on the results for
portfolio building.
Figure 3: Price prediction for the Apple stock 10 days in the future
using Linear Regression.
Figure 4: Price prediction for the Apple stock 45 days in the future
using Linear Regression.
Figure 8: Sample result of using the RBF kernel with the SVR.
This data was trained on the previous 48 business day closing prices
and predicted the next 45 business day closing prices.
Figure 9: Window size comparison for SVR using the RBF kernel.
Support vector regression with the rbf kernel was not very
sensitive to window size changes, which is very different
than linear and SVR with the polynomial kernel; which were
both very sensitive to window size changes.
Figure 7: Window size comparison for SVR using the polynomial
kernel.
When comparing the regression, mean absolute percentage error was used because of the high variation of stock
prices. This ensures that the results are not biased against
stocks with higher prices, since the error is calculated as the
percentage of that stock price that the prediction was off by.
ACKNOWLEDGMENT
The CRSP dataset was generously provided by the Department Chair of Finance at the University of New Mexico, Dr.
Leslie Boni.
I would also like to thank Dr. Trilce Estrada for providing
guidance on the project and helping to motivate the various
regression techniques used above.
R EFERENCES
[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:
Machine learning in Python. Journal of Machine Learning
Research, 12:28252830, 2011.
[2] Center for Research in Security Prices The University of
Chicago. Us stock databases. web, 2014.
[3] C. Tsai, and S. Wang 2009. Stock Price Forecasting by
Hybrid Machine Learning Techniques. Proceedings of the
International MultiConference of Engineers and Computer
Scientists, 20-26, 2009.