You are on page 1of 45

Fundamentals of

Stochastic Finance:
Module 1
MSc Financial Engineering
Table of Contents

1. Brief 2

2. Course context 2
2.1 Course-level learning outcomes 3

2.2 Module breakdown 3

3. Module 1: Introduction to R and Regression 4


3.1 Module-level learning outcomes 4

3.1.1 Transcript: What is R? 5

3.1.2 Notes: The Basics of R 6

3.1.3 Transcript: The Basics of Using R 8

3.1.4 Notes: Exploratory Data Analysis 14

3.1.5 Notes: Regression Analysis 16

3.1.6 Transcript: Homoscedasticity vs Heteroscedasticity 23

3.1.7 Notes: Properties of an Acceptable Empirical Model 27

3.1.8 Transcript: Interpreting and Evaluating Regression Output 37

1
1. Brief

This document contains the core content for Module 1 of Fundamentals of Stochastic Finance,
entitled Introduction to R and Regression. It consists of four video lecture scripts and four sets of
supplementary notes.

2. Course Context

Fundamentals of Stochastic Finance is the first transition course, bridging the gap between the old
and new curricula of the WorldQuant University (WQU) Master of Science in Financial Engineering
(MScFE) programs. As an introduction to the applications of statistical techniques in finance, the
course follows two main themes: the analysis of financial time series, and the fundamentals of
derivative pricing using discrete-time stochastic calculus.

2
2.1 Course-level Learning Outcomes

Upon completion of the Fundamentals of Stochastic Finance course, you will be able to:

1 Write programs and solve common statistical problems in the R language.


2 Describe and fit a time series model to data.
3 Apply Extreme Value Theory to portfolio management.
4 Understand the parameter estimation of ARCH and GARCH models.
5 Understand the language of measure-theoretic probability.
6 Describe discrete-time stochastic processes and martingales.
7 Price derivatives in discrete time.

2.2 Module Breakdown

The Fundamentals of Stochastic Finance course consists of the following one-week modules:

1 Introduction to R and Regression


2 Time Series Modeling
3 Extreme Value Theory
4 Introduction to Financial Risk Management
5 Elementary Measure-theoretic Probability
6 Discrete-time Stochastic Processes and Martingales
7 Derivative Pricing in Discrete Time

3
3. Module 1

Introduction to R and Regression

Welcome to the Introduction to R and Regression module. In this, the first week of the Fundamentals
of Stochastic Finance course, you will start with an introduction to the R statistical programming
language, before applying this knowledge to build linear regression models. After exploring the basics
of coding and statistical analysis with R, you will also learn about the characteristics of univariate
and multivariate distributions, the important exploratory data techniques relevant to the financial
engineer, and how to perform regression analysis in R.

3.1 Module-level learning outcomes

Upon completion of this module, you will be able to:

1 Write programs using the R language.


2 Use R packages to solve statistical problems.
3 Use R for visualization.
4 Formulate and fit the multiple linear regression model and its variants.

4
3.1.1 Transcript: What is R?

R is an open source project, which means that:

a) You can download the main package for free,


b) You have access to a vast array of R implementations written by the community that uses R,
c) You can write and post your own contributions to the project,

In the context of the wonderful “open source revolution”, the R project is a prime example of the
system working (almost) perfectly. You should keep in mind that this revolution is not flawless,
however; there is always a trade-off. The fact that R is open source is both its biggest strength and
weakness: anyone can use the project and anyone can contribute to it, which means quality and
usefulness is not always guaranteed. Not everyone who contributes packages to R are as meticulous
about their contributions as the new user would hope them to be. This means you should be careful
and critical of any R package you use in your professional career. In particular, be wary of packages
that have rarely been downloaded and view them with some rational conservatism. Popular and
thoroughly tested packages will be more reliable, as they are likely to already have all their kinks
sorted out.

R is so popular that, for the central parts of R that almost everyone uses and every “often-used
package”, a strong system of peer review and evaluation has organically evolved. This is why we
can be sure the packages we use in this course are of good quality.

You should also be aware of the proprietary side of the statistical analysis world. If you work for a
big player in the financial world, you will have access to an alternative group of programs that are
often used for econometrics, for example Eviews, MATLAB, Mathematica, Stata and a large number
of alternatives. The relative benefit that these corporate (for profit) programs have over the open-
source programs is that they employ professional coders that make sure that there are no bugs in
any function/ application. The downside is, of course, that these programs can be quite expensive.

You will have to work out which compromise is best for you, based on what you have
access to.

We will use R in this course as it is free, has a large network of users


and evaluators, and uses a syntax that is easy to learn.
5
3.1.2 Notes: The Basics of R

Once you have downloaded R, you can get started by experimenting with the different arithmetic
commands. As a convention in the course, R commands will look like this:

>>> This is an R command

Obtaining packages

We now go through how to install the packages we will be using in the course. First, choose a CRAN
mirror site for the download. Type the following command in R:

>>> chooseCRANmirror()

To make the download as fast as possible, choose the mirror nearest to you. We now install the
packages that we will be using throughout the course. Type and run the following commands:

>>>> install.packages(“tidyverse”)
>>>> install.packages(“tseries”)
>>>> install.packages(“forecast”)
>>>> install.packages(“fGarch”)
>>>> install.packages(“vars”)
>>>> install.packages(“evir”)
>>>> install.packages(“copula”)

If you want to use a function or method in a package, you must place it in the current global
environment so that R is able to call the functions in the package. To do this, use the command:

>>> library(tidyverse) # loads the tidyverse package

The # symbol marks the start of a comment in R, and is ignored by the compiler.

6
Using scripts
So far, we have been exercising the basics of R in the command window, but this is not a good way
of doing complicated analyses that need to be replicable. In practice, you should always do your
code-writing in a script, or a set of scripts and functions as you become more proficient with coding.
For instance, our data import step and mean calculation could be written up in a script, and all
executed at once by clicking on source, or executed line-by-line by using the key combination ctrl-
enter.

You will be writing a lot of code for different applications in this course. It is therefore best practice
to keep to a meticulous commenting regime. Every step needs an explanation of why it is done,
otherwise you will waste time later in trying to reconstruct your thoughts. There are often many
different ways of coding the same function, so keeping track of these things is important.

7
3.1.3 Transcript: The Basics of Using R

Each statistical (or coding) language has its own syntax. This is the appropriate way of calling
functions and receiving their outputs. The R language is quite simple once you get used to it. It is
therefore a critical part of your training to use R as frequently as possible so that it becomes
“second nature” to convert any statistical problem into “R code” in your mind.

All statistical analyses eventually boil down to complicated combinations of basic operations:
addition, subtraction, multiplication and division. Thus, it is unsurprising that you use it as a
calculator:

23*35
[1] 805

Here you learn the first “storage type” that R encodes: unnamed numbers. Like in standard matrix
algebra, we often refer to lists or vectors of numbers. R expects the same. The answer to our query
23*35 generated a scalar valued answer: 805. In most instances, the financial engineer will work
with many observations on several variables, the most common being vectors and matrices of
numbers (e.g. returns on a group of assets).

Before we import data, let’s explore the standard matrix algebra computations that you can do with R.

Let’s assign a random vector to the name “a”:

library(stats)
a <- rnorm(3)
a
[1] 0.1481934 -0.3550795 0.8735789

and another to the vector to the name “b”:


b <- rnorm(3).

Note the <- operator. This is R’s “preferred” symbol to represent the operation
“assign to”. It is also possible to use the =, as we will demonstrate below.
You should take note of this, as it will make code developed by others
accessible to you. 8
Next, we generate a matrix by using cbind (this binds arguments as columns, if conformable):

X = cbind(a,b)
X
a b
[1,] 0.1481934 0.7635945
[2,] -0.3550795 -0.1745421
[3,] 0.8735789 0.1133904

This now looks like a data matrix: three observations on two variables, with names a and b stored as
column headings of the variables.

We can consider element-by-element multiplication:

X*X
a b
[1,] 0.02196129 0.58307659
[2,] 0.12608144 0.03046493
[3,] 0.76314009 0.01285737

We can also consider the inner product X’X (where t() is the transpose function – its output is the
matrix transpose of its input, and %*% denotes matrix multiplication):

t(X)%*%X
a b
a 0.9111828 0.2741914
b 0.2741914 0.6263989

Or XX’:

X%*%t(X)
[,1] [,2] [,3]
[1,] 0.6050379 -0.1858998 0.2160429
[2,] -0.1858998 0.1565464 -0.3299813
[3,] 0.2160429 -0.3299813 0.7759975
9
Note how R preserves the column names and row numbers for the two operations.

All statistical packages use something like these linear algebra results, but all pre-coded and
applied to data types. A handy reference for the syntax of basic operations is available from
Quick-R (https://www.statmethods.net/advstats/matrix.html).

Next, let’s upload some data. We will be performing exploratory data analysis in this module in the
form of a tutorial on R and developing the content of exploratory data analysis and some features
of the multivariate normal and t distributions.

The tidyverse packages which form part of the R-studio package is a convenient and powerful
data management option which we will use to some extent in this course. It makes use of readxl,
which is a simple way of importing excel data into an R data structure.

In the Rstudio GUI, it is even simpler:

Under Global Environment, select “Import Dataset” and “From Excel” from the drop-down menu:

10
Browse to the location of your data and select the relevant dataset. You should see a preview of
the dataset, assuming each column contains the same “arbitrary numerical data”. In this case the
first column is the date of observation, which we want to use in our analyses and graphs.

Select the tab under the column label “Date” and select the Date data type.

Now you can click on “Import”, but before you do, note that the necessary code is provided for you.
You can copy and paste this into a script so that you can repeat your analyses without having to
manually import the dataset every time.

11
Now you should see your data open in the viewer at the top left:

In this dataset there is the daily closing prices of the following stocks: Microsoft (MSFT),
Apple (APPL), Intel (INTC) and IBM (IBM), along with their implied log returns indicated with a _lr.

This data is stored in an object with the name, tibble, which is part of the tidyverse framework.

We will not work extensively with this form, but it is worth reading up on.

For our purposes, all we need to know is how to work with objects.

An object can be thought of as a “folder” that stores various things. In our example the FinData
object contains “elements”, each being a column with a specific name, data type and values.

To access / manipulate an element of the object, we use the $ operator since this is an S3 object.
There is also a different structure, an S4 object, where we use the operator @ to access an
element – beyond this distinction they work the same for our purposes.

12
Thus, to compute the mean of the series MSFT:

mean(FinData$MSFT)

[1] 37.44039

where the argument of the function mean is the element of the object FinData with name MSFT.
An error would have occurred if we tried to reference an element not in FinData. You can thus
think of objects as filing systems just like the systems on your computer.

13
3.1.4 Notes: Exploratory Data Analysis

The financial engineer will inevitably be interested in investing in a portfolio of assets to obtain
his or her goals. You will learn a lot about the optimal construction of such a portfolio during this
degree, so to start with we will just keep to the basics.

For you to know whether or not to invest in a portfolio of two assets, you would need to understand
how they are likely to move individually, but also to understand how they are likely to co-move.
If two stocks are anything less than perfectly correlated, a way to reduce portfolio risk to lower
than either of the component assets will always exist.

The financial engineer works with uncertain variables that evolve over time. For example, the
returns on different banks may behave differently as a group during normal times to how they
would behave during a housing boom or a banking sector crisis.

The properties we will impose on this time-varying behaviour are captured statistically by the joint
cumulative probability distribution, and it is the features of this mathematical object that we will
study in this module. We will apply the joint cumulative probability distribution to some real-world
data to extract some useful information. This information will tell us which of the models
we develop in this course will be applicable to a certain set of data.

The first step in exploratory data analysis is a visualization of the variable of interest over time.
This is called a time plot. We will use the ggplot packages that form part of tidyverse/Rstudio package.

Open the R script that you will find a link to in the notes on the site.

The ggplot method works as follows.

The main call establishes the dataset involved and can recognize dated data like ours without any
further human input. This is one of the reasons why ggplot packages are great options. For every
additional element you wish to add to the graph, you add a + geom˙XXX call. In our example
we will use:

14
15
3.1.5 Notes: Regression Analysis

Assumption 1:

1
The more correct term for a function with a constant and some linear terms is an affine
function. In strict mathematical language, a linear function has no constant.
16
its

17
Assumption 2: multicollinearity

A weaker requirement on an estimator is consistency: even though an estimator may have some
bias in a small sample, it can still be considered “reliable” if it converges to the true parameter
value as the sample grows infinitely large. There are several situations in which this
occurs even without strict exogeneity.

18
Assumption 3:

19
Assumption 4:

homoscedasticity,

20
Recall that this setup is for a cross section, so homoscedasticity says that there is no correlation
in the errors across observational units (say, stock prices at a given moment). If this had been a
univariate time series (studied in Module 2), we would have considered the no serial correlation
condition. That is, each error must be uncorrelated with its own past / future. In a panel context,
we would consider both conditions.

Under homoscedasticity and the other assumptions above, the OLS is the Best Linear Unbiased
Estimator (BLUE). This means an OLS regression is the best method of extracting information
about the relationship between a dependent variable and some explanatory variables, if the true
relationship satisfies the four assumptions above. Any other estimator (e.g. Generalized Least
Squares, which we will briefly discuss below) can have no lower variance.

Important terminology: An estimator that achieves the lowest variance possible is called efficient.
This is because it uses all available information in the data.

Let us derive the variance of the OLS estimator.

21
Note that the variance of our estimator depends on an unknown parameter: the population variance
of the population errors to the data generating process. We cannot observe these errors, we can only
fit the model and study the parts that we cannot explain: the regressions residuals. Residuals and
errors are very distinct objects and you should always keep them separate in your mind.

The general result is: if and only if our assumptions about the unobserved data generating process
are correct, will the best unbiased estimator yield residuals that behave in the way we postulate the
population errors under our assumptions. If one or more of our assumptions are wrong, we cannot use
the residuals from an incorrect model for inference about the errors of the process of interest.

Thus, to produce results that we can apply statistical tests to, we need to replace with a consistent
estimator. We will use the variance of the estimated residuals for this purpose. There are many
important details that we should look into. Hendry and Nielsen (2007) give a very practical discussion
of the estimation issues, and Hayashi (2000) gives a deep treatment on the asymptotic properties of
the different choices that one could make here.

22
3.1.6 Transcript: Homoscedasticity vs Heteroscedasticity

In this video, we will be comparing homoscedastic and heteroscedastic data generating processes.
Homoscedasticity means that the variance-covariance matrix of the unexplained part of your
variable of interest, ε, is a constant diagonal matrix in the population. One of the assumptions of
the classic linear regression model is homoscedasticity, or in other words, the same dispersion of
errors. Heteroscedasticity, on the other hand, is when unequal dispersion of errors occurs.

Consider a very simple case:

Process 1 is homoscedastic:
yi = 2 + 0.6xi+εi
εi ~ N(0,σ2 )

Process 2 is heteroscedastic:
yi = 2 + 0.6xi + εi
εi ~N(0,(1+xi ) σ2 )

Note that the variance of the error increases with the size of xi.

When we compare scatter plots of data from Process 1 to Process 2, there can clearly be seen that
Process 2 as an unequal dispersion of errors.

Notice in the left-hand graph how the variability of the error around the true
expected value (grey line) is constant across X. Most of the points are close to
the line and the dispersion around the line does not obviously vary with X.

23
Notice in the right-hand graph how the variability of the error around the true expected value
(grey line) increases with X. Most of the points are still close to the line, but the dispersion around the
line obviously increases with X.

In this simulated example it is clear to see that there is enough data to accurately identify
the true relationship.

When there is a very large data set (n=1000) it is obvious that there is enough
information to accurately infer. Here you see an OLS estimate (cyan) and a GLS
estimate (magenta) of the true slope, but they are so similar you can only see the
top one. Both are consistent, so with this amount of data they will
give equally reliable inference. 24
When we have a very small data set (n=20), the estimates can be quite different (OLS estimates
are cyan, GLS estimates are magenta). However, at such a small sample, one would be unwilling
to trust either model.

You might ask: Why does heteroscedasticity matter? It is about how much information there is
in a specific observation.

25
Under Homoscedasticity:

Under Heteroscedasticity:

In conclusion, taking cognisance of the information content of an observation is important. Doing


so statistically is very difficult. GLS estimators tend to have poor small sample properties – they
only work well when the sample is very large. The standard practice is to use OLS estimators for
the slope but robust estimators for the variance-covariance matrix of the slope coefficients.

This video compared homoscedasticity to heteroscedasticity. Study the notes to learn more
about homoscedasticity.

26
3.1.7 Notes: Properties of an Acceptable Empirical Model

Our goal is to extract reliable statistical information about a data generating process from some
observed data. For this we need an acceptable empirical model. In this section we study a few of
the basic properties that an acceptable model must have:

1. It should explain an acceptable amount of the variation of the variable of interest.


2. It should contain all those explanatory variables necessary to explain the variable of interest.
3. It should not contain any unnecessary variables to increase precision of estimates.

There are a number of tests or diagnostic statistics that we use to evaluate the acceptability of
an empirical model, and to test hypotheses about variables. These are usually collected in a set of
diagnostic statistics.

Total fit

27
Inference on individual explanatory variable

Small Sample Inference:

where

28
Large Sample Inference:

29

Inference on joint restrictions on coefficients

30
In practice, this test is simple: e.g. if there are no significant explanatory variables.

).

R lab: omitted variable bias

Mathematically,

31

32
33
Generalized least squares

34

35
36
3.1.8 Transcript: Interpreting and Evaluating Regression Output

In this video, we will be interpreting and evaluating regression output for the true model

with:

You can reproduce a version of the results with the following code. Since we are using random
number generators, your results will differ slightly. This is a good time to consider the impact that
sampling variation may have, even under the best of circumstances. All variables are normally
distributed and independent. Note that we add a fourth unrelated random variable to the regression
to show an example of an insignificant estimate.

37
We will be interpreting and evaluating the different sections of the estimation output to reach a
conclusion about the model.

Consider the following ordinary least squares estimation output:

These are the estimates of the parameters. Notice that the constant and the slope parameters on
the first three variables are all close to their true values.

38
These are the standard errors of the estimated coefficients. Notice that the difference between the
estimated coefficients of the variables that belong in the model are less than two times the standard
error. This means that any hypothesis at the true values will not be rejected.

39
These are the t-statistics of the test for individual significance for each of the estimates. Each is
simply the ratio of the estimate and its standard error.

40
These are the probability values of the t-tests for individual significance for each of the estimates.
These are each the answer to the question: If the true coefficient was 0, how likely is it that we would
obtain an estimate of the given size purely from sample variation?

41
The R-squared measures how much of the variation in y is explained by the 4 explanatory variables.
The adjusted R-squared is the same but penalized for the number of explanatory variables. This is
a better model diagnostic, as the R-squared is always increasing in the number of explanatory
variables, so it may lead a researcher to include irrelevant variables.

Is 42% of the variation explained enough? It depends on the context. In the cross section, it is
reasonably acceptable. In time series it would be far too low in many applications.

42
Just like the t-tests evaluate the statistical significance of each explanatory variable individually, the
F-test evaluates the joint statistical significance of all explanatory variables (other than the con-
stant). It has a degree of freedom, namely 4,95. The 4 is the 4 zero restrictions we place on the model,
that is the slope coefficients are all zero. The 95 refers to the sample size of 100 minus the 5 param-
eters estimated in the model – 4 slope coefficients and 1 constant. The p-value is the probability of
obtaining an F-statistic as large as we did if the true F-value is zero. We clearly reject this and the
hypothesis that the slope coefficients are all zero.

In conclusion: is the model acceptable?

Almost. While most of the variables are statistically significant, x4 is not. In order to get a model
as parsimonious as possible, we should re-estimate the model without x4. If all the diagnostic
tests remain acceptable, that would be an improved model for forecasting.

This video showed you how to evaluate regression output. In the next section we
will be looking at the generalized linear model.

43
References

Fumio Hayashi. Econometrics. New Jersey, USA: Princeton University, 2000.

David F Hendry and Bent Nielsen. Econometric modeling: a likelihood approach.


Princeton University Press, 2007.

44

You might also like