You are on page 1of 42

Basic Programming in R

Kenneth R. Szulczyk

Basic Programming in R
Copyright 2015 by Kenneth R. Szulczyk
All rights reserved

Edition 1, February 2015

Table of Contents

Table of Contents ..................................................................................................................................... 3


The Basics .................................................................................................................................................. 4
Handling, manipulating, and creating data ......................................................................................... 6
Creating Variables................................................................................................................................ 6
Importing a spreadsheet saved as an CSV file................................................................................. 8
Importing an Excel Spreadsheet ...................................................................................................... 10
Directory .............................................................................................................................................. 10
Basic Statistics ......................................................................................................................................... 12
Linear Regression ................................................................................................................................... 16
Time Series Analysis .............................................................................................................................. 22
Writing Programs................................................................................................................................... 26
The Car and Sunspot Program ......................................................................................................... 26
Simple Program .................................................................................................................................. 26
Amortization Table ............................................................................................................................ 27
Matrices and Vectors ............................................................................................................................. 30
Least Squares ...................................................................................................................................... 30
Eigenvalues and eigenvectors .......................................................................................................... 33
Choleski Factorization ....................................................................................................................... 34
Appendix The Programs .................................................................................................................... 36
The Car Analysis ................................................................................................................................ 36
The Periodogram................................................................................................................................ 38
Fitting an ARIMA to the Sunspot Data........................................................................................... 39
Amortization Table ............................................................................................................................ 41

The Basics
I am assuming the readers are familiar with basic statistics and linear algebra. I do not teach
you any profound empirical techniques. Instead, I give you a comprehensive overview of
programming in R. After finishing this book, you should be able to use R and tailor it for your
own use.
R is an open source math and statistics software. Researchers can download and use the
software for free. You can download the software from:
http://cran.r-project.org/
Researchers can use R in two ways. Researchers can enter commands directly into the console
or write a program and run the program in the console. I show the console below:

I have heard someone created a Graphic User Interface for R, where the users execute
commands via pull down menus, but I have not found it yet.
Using the console, enter the command, 2 + 2. The greater than sign indicates R is waiting for a
command. Any text, commands, and equations in red indicate commands one can enter
directly into R while blue indicates the output.
2 + 2
4

R calculates the answer. R uses matrices and vectors, and [1] means the answer is a vector of
dimension 1. (Or simply a scalar). The brackets, [ ], mean an element and always indicate the
index.
[1] 4

Note: R remembers all the variables and subroutines created in the console. Once I finish a
program that seems to work, I close R and re-open it to wipe its memory. Then I check if the
program still works.

Type in the command:


license()
The output:
This software is distributed under the terms of the GNU General
Public License, either Version 2, June 1991 or Version 3, June 2007.
The terms of version 2 of the license are in a file called COPYING
which you should have received with
this software and which can be displayed by RShowDoc("COPYING").
Version 3 of the license can be displayed by RShowDoc("GPL-3").
Copies of both versions 2 and 3 of the license can be found
at http://www.R-project.org/Licenses/.
A small number of files (the API header files listed in
R_DOC_DIR/COPYRIGHTS) are distributed under the
LESSER GNU GENERAL PUBLIC LICENSE, version 2.1 or later.
This can be displayed by RShowDoc("LGPL-2.1"),
or obtained at the URI given.
Version 3 of the license can be displayed by RShowDoc("LGPL-3").
'Share and Enjoy.'

Handling, manipulating, and creating data


Creating Variables
Create a trend variable starting at 1 and ending at 100. We represent the equal sign as <- or =.
You can also store or retrieve numbers from a matrix or vector by using the index number.
trend <- 1:100
You can print it out by typing:
trend
The output:
[1]
[19]
[37]
[55]
[73]
[91]

1
19
37
55
73
91

2
20
38
56
74
92

3
21
39
57
75
93

4
22
40
58
76
94

5
23
41
59
77
95

6
24
42
60
78
96

7
25
43
61
79
97

8
26
44
62
80
98

9 10
27 28
45 46
63 64
81 82
99 100

11
29
47
65
83

12
30
48
66
84

13
31
49
67
85

14
32
50
68
86

15
33
51
69
87

16
34
52
70
88

17
35
53
71
89

18
36
54
72
90

I will create 100 observations of white noise with a mean of zero and standard deviation (sd) of
1.
noise <- rnorm(n=100, mean=0, sd=1)
I want to see the plot of the white noise. The main writes the charts title while xlab and ylab
define the labels for the axis. The code, pch=20, defines the dot on the graph. I believe it stands
for plot character.
plot(trend, noise, pch=20, main="White Noise", xlab="Trend",
ylab="Noise")
You can right click on the graph and copy it as a bitmap and paste it into a Word document.

Refer to the chart below for the main codes:

Lets say you only wanted the first 50 observations of noise. Did you notice the command? I
create a new vector called noise.2 and copy the first 50 observations from the noise vector.
noise.2 <- noise[1:50]
noise.2
[1] -1.226444898 -0.689033125

1.940228241
7

0.187140317 -0.083116309

[6]
[11]
[16]
[21]
[26]
[31]
[36]
[41]
[46]

-0.163007504
-0.417129484
-0.008955807
-1.375899489
0.133480828
-1.973639450
0.963737604
-0.374648161
1.749785524

0.466372338
0.247227624
-0.229873035
-1.709316587
-1.049068429
1.151270347
0.466905061
0.612023483
-0.650935475

0.813717075 0.077994673 0.141952803


1.019927832 0.991367337 -1.597887670
0.079801223 0.135923409 0.883680805
0.104933419 0.403238585 -0.630627316
1.539173832 -0.638730541 -1.541867808
-1.022709096 1.336546494 0.422457580
-1.503421731 1.956695046 0.172492546
-0.469123654 -0.531325273 1.402142448
-0.613352014 1.881250587 -0.290325872

Importing a spreadsheet saved as an CSV file


A user can easily import a spreadsheet into R but you must save it as a comma delimited
format or csv. The first row contains the headings while the data falls below the headings.
Headings can be upper and/or lower case. You can also use periods in the name but no other
special characters. The spreadsheet should fit the format below:

We will import the spreadsheet and name it dataset.


dataset <- read.csv("car_data_98.csv", header = TRUE)
Type in the variable name, vector, or matrix by itself to see what it looks like.
dataset
The partial output:
1
2
3
4

Observations
1
2
3
4

Model
2.5TL/3.2TL
2.5TL/3.2TL
3.5RL
2.3CL/3.0CL

Make MPG
Acura 25
Acura 24
Acura 25
Acura 28

Class
Compact
Compact
Mid-Size
Sub-Compact

Note: You must be consistent when using upper and lower case letters. For example, R views
the variable MPG, mpg, and Mpg as three different variables. If you wrote a program and
created a variable MPG, and further in the code you called the variable mpg, R will create a
new variable called mpg. You must be consistent with your names and labels.

I will create the variables I want to use. Remember, I used capital letters for miles per gallon
(MPG) in the original dataset. The dataset is an object and the $ allows a user to access specific
information from this object. In our case, the $ refers to the variable as a subset of the dataset.
mpg
eng.size
cylinders

<- dataset$MPG
<- dataset$Engine
<- dataset$Cylinders

I want to create a dummy variable for the transmission. The trans equal one if automatic and
zero if manual transmission. A single equal sign means to set a variable equal to a value while
double equal sign means a comparison.
trans <- as.numeric(dataset$Transmission == "L" )
Take a look at trans:
trans
[1]
[38]
[75]
[112]
[149]
[186]
[223]
[260]
[297]
[334]
[371]

1
0
1
0
1
1
0
0
1
0
0

1
1
1
1
0
1
1
1
0
1
1

1
0
0
1
1
0
0
1
1
0
0

1
1
1
1
0
1
1
1
0
1
1

1
0
0
1
1
0
0
1
1
0
1

0
1
0
0
1
1
1
1
0
1
1

0
0
0
1
1
0
0
1
1
0

1
1
0
1
1
1
1
1
1
1

0
0
1
1
1
0
0
0
0
1

1
1
0
0
1
1
0
1
1
0

0
0
1
1
1
1
1
0
0
1

1
1
0
0
1
1
0
1
1
1

0
0
1
1
1
1
1
1
0
0

1
1
1
0
0
1
0
0
1
1

0
1
1
0
0
1
1
1
1
0

1
1
1
1
1
1
0
0
0
0

0
1
1
0
1
1
1
0
1
0

1
1
0
1
1
1
0
0
0
1

1
1
1
0
1
1
1
0
1
0

1
1
1
1
1
1
0
0
0
1

1
1
0
1
1
1
1
0
1
0

1
1
1
1
1
1
0
1
0
1

1
1
0
0
0
1
1
0
1
0

0
1
1
1
1
1
0
1
0
1

1
1
1
0
0
1
1
0
1
0

1
1
1
0
1
1
1
1
0
1

0
1
1
1
0
0
1
0
0
0

1
1
0
0
1
1
1
1
0
1

0
0
1
1
1
1
1
0
1
1

1
1
0
0
1
0
1
1
0
0

0
1
1
1
1
1
1
0
1
1

1
1
0
0
1
0
1
1
0
0

1
1
1
1
0
1
0
0
1
1

1
1
0
0
1
0
1
1
1
1

1
0
1
1
1
1
0
0
0
0

0
1
0
1
0
0
1
1
1
1

1
0
1
0
1
1
1
0
1
1

Similarly, I create a separate dummy variable for compact cars. Note, all categories for
Compact must be spelled the same with matching upper and lower case letters.
compact <- as.numeric(dataset$Class == "Compact" )
I can easily create a variable for only engine sizes greater than 2 liters. The first command
returns a 1 if the engine size exceeds 2 and a 0 if false. Since eng.size is a vector, the large.eng
will also be a vector.
large.eng <- as.numeric(eng.size > 2 )
9

Then I multiply large.eng to get a vector of large engine sizes. The variable will only have
engine sizes greater than 2. Smaller engines are transformed into zeros. The asterisk, *, means
multiply. For vectors, R will multiply the first element of one vector to the first element of the
second vector. Then the second element, the third element, and so on.
large.eng <- large.eng * eng.size
The Output
[1]
[19]
[37]
[55]

2.5
2.8
2.8
3.8

3.2
3.7
2.8
3.1

3.5
4.2
0.0
3.8

3.0
2.8
0.0
3.8

0.0
2.8
2.5
3.8

0.0
2.8
2.5
4.6

0.0
2.8
2.8
3.0

3.0
0.0
2.8
4.6

3.2
0.0
3.2
4.6

0.0
2.8
3.2
0.0

0.0
2.8
0.0
0.0

2.8
4.4
0.0
0.0

2.8
4.4
2.8
3.1

0.0
4.4
2.8
3.8

0.0
5.4
2.4
2.4

2.8
4.4
3.1
3.1

2.8
2.5
3.8
3.8

2.8
2.5
3.8
3.8

Importing an Excel Spreadsheet


We need to install the package, RODBC that allows R to read a spreadsheet saved in the .xls
format. First, we need to install the package. You can select Packages on the menu and select
Install packages. Then search for this package.
You can also write commands in R to install a package.
install.packages("RODBC")
Just in case we do not have access to the internet, we can also install a package stored as a
computer file.
Then we must load the package, so we can use the command library below. Otherwise, R will
not recognize the commands.
library(RODBC)
Read the spreadsheet into a new dataset called new.data. Please note, I named Sheet 1 as
car_data_98. The first command opens the spreadsheet. The second command saves the
car_data_98 spreadsheet as new.data. Finally, we close the spreadsheet.
z <-odbcConnectExcel("car_data_98.xls")
new.data <-sqlFetch(z, "car_data_98")
close(z)

Directory
R sets the Document folder as the default in Windows 7. Please, copy programs and
spreadsheets to the document folder, so R can access them there.
10

If you are not sure which directory R has installed itself in, then run this program.
list.dirs <- function(path=".", pattern=NULL, all.dirs=FALSE,
full.names=FALSE, ignore.case=FALSE) {
all <- list.files(path, pattern, all.dirs,
full.names, recursive=FALSE, ignore.case)
all[file.info(all)$isdir]
}
This creates a subroutine that reads the folders in the current directory. Then run the following
to read the folders.
dir()[file.info(dir())$isdir]
If you need to change the directory, then type this command to change the working directory.
Unfortunately, if you close R, you have to enter this command again to set the directory.
setwd("C:/Users/kenneth/Documents/")

11

Basic Statistics
I can use many commands to get descriptive statistics. Remember, this dataset has many
categorical data.
summary(dataset)
The partial output:
Observations
Min.
: 1.00
1st Qu.: 94.75
Median :188.50
Mean
:188.50
3rd Qu.:282.25
Max.
:376.00

Model
Passat : 6
Accord : 5
Cavalier: 5
Mustang : 5
S70
: 5
Sunfire : 5
(Other) :345

Make
BMW
: 25
Mitsubishi: 24
Volkswagen: 22
Chevrolet : 21
Ford
: 20
Toyota
: 18
(Other)
:246

MPG
Min.
:13.00
1st Qu.:26.00
Median :29.00
Mean
:29.48
3rd Qu.:32.00
Max.
:50.00

Remember, we created new variables. If I want the summary statistics for the variables I had
created, then I use the cbind to combine the variables into a matrix. Cbind means I take the
vectors and combine them together into a matrix. The matrix x will have three columns, and C
refers to the columns. We also have the command rbind which combines rows.
x <- cbind(mpg, eng.size, cylinders)
Then I use the summary to get the descriptive statistics.
summary(x)
The output:
mpg
Min.
:13.00
1st Qu.:26.00
Median :29.00
Mean
:29.48
3rd Qu.:32.00
Max.
:50.00

eng.size
Min.
:1.000
1st Qu.:2.000
Median :2.400
Mean
:2.645
3rd Qu.:3.000
Max.
:6.000

cylinders
Min.
: 3.000
1st Qu.: 4.000
Median : 4.000
Mean
: 5.191
3rd Qu.: 6.000
Max.
:12.000

I want to calculate a box plot with the variables. R lays out the boxplot horizontally and plots
the axes. I named the title, Boxplot of the Data.
boxplot(x, horizontal=TRUE, axes=TRUE, main="Boxplot of the Data")

12

I want to calculate the correlations on my data. I redefine my x matrix to add more variables.
x <- cbind(mpg, compact, trans, eng.size, cylinders)
x refers to the matrix of observations. The cor defines the command for correlations while
use determines what I should do with missing observations. With this option, the program
will drop a pair if an observation has missing data. Finally, the method determines which
correlation to use - pearson, spearman or kendall.
corr.x <- cor(x, use="pairwise.complete.obs", method="kendall")
corr.x
The output:
mpg
compact
trans
eng.size cylinders
mpg
1.0000000 0.16313790 -0.25502462 -0.5787260 -0.6199851
compact
0.1631379 1.00000000 -0.03024894 -0.1447983 -0.1352761
trans
-0.2550246 -0.03024894 1.00000000 0.2306456 0.2411497
eng.size -0.5787260 -0.14479830 0.23064557 1.0000000 0.7877670
cylinders -0.6199851 -0.13527609 0.24114966 0.7877670 1.0000000

13

I will calculate something a little more complicated - canonical correlation. I create two
matrices x and y.
x <- cbind(mpg, compact)
y <- cbind(trans, eng.size, cylinders)
We use the command, cancor(x, y) to calculate the canonical correlation. I store the information
into an object called, cxy.
cxy <- cancor(x, y)
Print out the output
cxy
The output
$cor
[1] 0.6874188 0.1056315
$xcoef
[,1]
[,2]
mpg
-0.01040752 0.002188889
compact -0.00724475 -0.120616344
$ycoef
[,1]
[,2]
[,3]
trans
0.02686013 -0.04290648 -0.09737402
eng.size 0.02195106 0.12405721 -0.03980611
cylinders 0.01754001 -0.06947293 0.03930493
$xcenter
mpg
29.4760638

compact
0.2473404

$ycenter
trans eng.size cylinders
0.6276596 2.6452128 5.1914894
Did you notice the $ signs? That is an object in cxy. I can pull those values out. I want to use
the second correlation in a calculation: The 2 refers to the second number under cor.
correlation.2 <- cxy$cor[2]
correlation.2
[1] 0.1056315
14

Similarly, I need the third column from the $ycoef. We can index any matrix by using [row,
column]. The row is blank, so R will copy all the rows. The 3 indicates we only want the third
column.
vector <- cxy$ycoef[,3]
vector
The output
trans
eng.size
-0.09737402 -0.03980611

cylinders
0.03930493

Instead, I want the number in the second row, first column. I store the number under element.
element <- cxy$ycoef[2,1]
element
eng.size
0.02195106

15

Linear Regression
With linear regression, we estimate a dependent variable, yt, with one or more explanatory
variables. Refer to the equation below:
 =  +  , + 
,
+ +  , + 
We define the variables as:
i represents one observation. If we have time series data, then we switch i to t.
The dependent variable, yi
o We try to explain or predict yi based on the x variables
The independent variable, xj,i
o We assume these variables are fixed and constant
i represents the white noise process, assumed to be normal, mean of zero, and constant
variance.
We need to estimate the parameters
o The intercept, 0
o 1, 2, until k are the slopes.
I have data for 376 cars in 1998, or 376 observations. I believe the following relationship:
A cars petrol consumption depends on the explanatory variables.
o yt is measured in miles per gallon, or mpg.
The explanatory variables
o Compact cars should use less petrol than regular cars
 Dummy variable
 One if the car is compact, zero otherwise.
o Cars with automatic transmissions use more petrol than sticks.
 Dummy variable
o Larger engines use more petrol than smaller engines.
o An engine with more cylinders uses more petrol.
 =  +  , + 
,
+  , +  , + 
In R, we use lm as the command for linear regression. I store all the results under the object fit.
Fit is not a variable. Fit constitutes an object containing many pieces of information. In this
case, dataset is redundant. I could drop this term because I created vectors, or variables in R.
fit <- lm(mpg ~ compact + trans + eng.size + cylinders, data=dataset)
I could type in fit and R will only show the coefficients. If I want to see the statistics, then I
type the command summary:
summary(fit)
16

Call:
lm(formula = mpg ~ compact + trans + eng.size + cylinders, data =
dataset)
Residuals:
Min
1Q Median
-6.9325 -2.2384 -0.2697

3Q
Max
1.7869 17.1587

Coefficients:
Estimate Std. Error t value
(Intercept) 40.1067
0.6928 57.892
compact
0.5901
0.4347
1.357
trans
-1.7810
0.3930 -4.532
eng.size
-1.2785
0.4767 -2.682
cylinders
-1.2091
0.2932 -4.124
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
< 2e-16
0.17547
7.90e-06
0.00764
4.59e-05

***
***
**
***

* 0.05 . 0.1 1

Residual standard error: 3.581 on 371 degrees of freedom


Multiple R-squared: 0.4735,
Adjusted R-squared: 0.4678
F-statistic: 83.41 on 4 and 371 DF, p-value: < 2.2e-16
Researchers study the residuals. The residuals represent the errors, and we assume they are
normally distributed. The object, fit, contains the objects from the linear regression. I pull out
the residuals and store them in a vector called resid.
resid <- residuals(fit)
I could also use the following command:
resid <- fit$residuals
I want to see what the residuals look like, so I plot the residuals versus mgp.
plot(mpg, resid, pch=20, main="Residuals", xlab="MPG",
ylab="Residuals")
Do the residuals look random?

17

We can check whether the residuals are normally distributed. I will extract the standardized
residuals from the object, fit. Standardized means the program will subtract the average and
divide by the standard deviation. We use the command below:
resid.standard <- rstandard(fit)
The command, fit$rstandard, does not work in this case.
I create a QQ Plot by using the command, qqnorm. Then I add labels and a title to make it look
nice.
qqnorm(resid.standard, ylab="Standardized Residuals", xlab="MPG",
main="Gas Consumption")
Then I add the line.
qqline(resid.standard)
If the data is normally distributed, the points should fall on the line.

18

My data has outliers. I can estimate a Median Regression, or Quantile Regression. If you
remember your statistics, an outlier represents an extreme point or observation the exception
to the rule. When you calculate an average, outliers will cause it to deviate from the true
average. On the other hand, median is another type of average. Median is the value in the
middle, and it is not sensitive to outliers.
For quantile regression, I need to install the package, quantreg.
install.packages("quantreg")
You can also install it using the Package Menu. R should also install SparseM because it relies
on another package for its calculation.

Note: You are assuming two things when you download and use someone elses package.
1. You assume the code works correctly and calculates what it is supposed to calculate.
2. You enter the correct parameters into the function when you use it.

You must load the library before you use it.


library("quantreg")
I estimate the median regression via the command, rq. I only include the engine size because I
want to graph it.
19

fit.median <- rq(mpg ~ eng.size, data=dataset)


Lets see the results.
summary(fit.median)
Call: rq(formula = mpg ~ compact + trans + eng.size + cylinders, data
= dataset)
tau: [1] 0.5
Coefficients:
coefficients
(Intercept) 37.00000
compact
1.00000
trans
-1.00000
eng.size
0.00000
cylinders
-1.50000
Warning message:
In rq.fit.br(x, y, tau =
nonunique

lower bd
37.00000
-0.21725
-1.17795
-1.12246
-1.50000

upper bd
38.79854
1.00000
-0.30138
0.00000
-0.99398

tau, ci = TRUE, ...) : Solution may be

I must re-estimate the regular regression because I want to compare linear regression and
quantile regression.
fit <- lm(mpg ~ eng.size, data=dataset)
I want to see what our estimates look like. I plot the data.
plot(mpg ~ eng.size, data=dataset)
abline(fit, lty="dashed", col="blue")
abline(fit.median)
abline means a straight line with intercept a and slope b. The command lty specifies the line
type. They wrote dashed but its numerical code equals 2. Then we add a legend. Finally, col
refers to the color. Finally, the bty="n" removes the border from the lengend.
legend("topright", inset=0.05, bty="n",
legend = c("Least Squares Fit", "Median Fit"),
lty = c(2, 1),
col = c("blue", "black")
)
The c() means combine the elements and is not the same as cbind. The line type, lty, equals 2
for dashed and 1 for straight line. The bty command removes the box around the legend.
20

Note: The c() command lets us cheat in R. Many commands in R only allow the user to input
one argument or variable. Thus, we can use c() to combine many variables or arguments into
one element. Then we can enter the combined element as one command into an R function.

21

Time Series Analysis


I downloaded the dataset from the National Aeronautics and Space Administration (NASA) at
http://solarscience.msfc.nasa.gov/SunspotCycle.shtml. We read the data in, which I named
sundata.
sundata = read.csv("spot_num.csv", header = TRUE)
I created two vectors: sunspots and year.
sunspots <- sundata$Sunspots
year <- sundata$Year
Follow good practice and plot your data to see what the data looks like. So I created a timeseries plot.
plot(year, sunspots, xlab="Year", ylab="Sunspots", main="Activity of
the Sun")

Next, we calculate the autocorrelation and partial autocorrelation plots to guide which ARIMA
model we should estimate. Of course, I create a new plot window. Otherwise, R will copy over
my first time series plot.
win.graph(width = 6, height = 6, pointsize = 10)

22

We can use the par(mfrow) command to combine multiple plots onto the same graph. The
c(2,1) refers to two rows and one column, or c(rows, columns). Remember, the c() means
combine the elements, and it differs from the command, cbind().
par(mfrow=c(2,1))
acf(sunspots, 30, main="Sunspots")
pacf(sunspots, 30, main="Sunspots")

I fit an AutoRegressive Integrated Moving Average (ARIMA) to the data. The ARIMA is
difficult to explain, so lets assume our data does not have the Integrated part. That leaves an
ARMA, which is defined below:
 =  +

 !

!

++

" !"

+  + # ! + #
!
+ + #" !$

We define the variables as:


The time series in the current period, xt
xt-1 is the time series in the previous time period, t-1; xt-2, and so on.
o Thus, xt depends on previous values of itself
t represents the white noise process, assumed to be normal, mean of zero, and constant
variance.
t-1 is the white noise in the previous time period; t-2 and so on.
o The noise from previous periods can influence xt.
o Since we are in period t, we know what the noise is in previous periods, which is
why we can estimate the q parameters.
We need to estimate the parameters
o The intercept, c
o i are the coefficients for the autoregression and must lie between -1 and 1.
o qi are the coefficients for the moving average and must lie between -1 and 1.
23

o Otherwise, the equation becomes unstable if the magnitude of i or q1 exceeds


one.
According to the ACF plot, the plot tails off. Thus, we may have one moving average term.
The partial ACF plot also trails off. Hence, we may have one autoregressive term. We will
estimate the ARIMA:
 =  +

 !

+  + # !

We setup the estimation below. The c(1,0,1) means c(# of terms for autoregression, integrative,
# of terms for the moving average).
fit.arima <- arima(sunspots, order=c(1,0,1))
Summary does not work for ARIMA, so just enter fit.arima to get the coefficients and standard
errors.
Call:
arima(x = sunspots, order = c(1, 0, 1))
Coefficients:
ar1
ma1
0.9787 -0.4522
s.e. 0.0039
0.0191

intercept
52.0854
7.1406

sigma^2 estimated as 254:


26726.93

log likelihood = -13359.47,

aic =

If I need the residuals, I can use the command:


residuals <- residuals(fit.arima)
I believe the ARIMA has two autoregressive terms. Then I estimate the:
fit.arima <- arima(sunspots, order=c(2,0,1))
Call:
arima(x = sunspots, order = c(2, 0, 1))
Coefficients:
ar1
ar2
1.1903 -0.2037
s.e. 0.0323
0.0313

ma1
-0.6154
0.0250

sigma^2 estimated as 251:


26690.95

intercept
52.0989
7.9316

log likelihood = -13340.48,

24

aic =

Scientists at NASA claim the number of sunspots have an 11-year cycle. Unfortunately, R will
not let me estimate a seasonal ARIMA(1,0,0) with a 11-year seasonal component because that
means the data has a 132-month cycle. However, I found an annual cycle, and we can estimate
an ARIMA(1,0,1) with a seasonal ARIMA(1,0,0).
fit.arima.season <- arima(sunspots, order=c(1,0,1), seasonal =
list(order = c(1, 0, 0), period = 12))
Call:
arima(x = sunspots, order = c(1, 0, 1), seasonal = list(order = c(1,
0, 0),
period = 12))
Coefficients:
ar1
ma1
0.9777 -0.4585
s.e. 0.0040
0.0194

sar1
0.0446
0.0182

sigma^2 estimated as 253.5:


26722.92

intercept
52.0724
7.0849
log likelihood = -13356.46,

aic =

How did I know about the frequencies? I wrote a program in R that calculates the
periodogram from the residuals. A periodogram transforms a variable from the time domain
to a frequency domain. Just run the program to see the output.
source("periodogram.R")

25

Writing Programs
The Car and Sunspot Program
R uses a scripting programming language. We can write programs in text files using Notepad.
However, we save the extension as R and not txt.
We did many calculations for the car data. I organized everything into a program, which is
available in the Apendix. I added comments using the # mark and outputted my results using
the print function. Everything else is the same.
Look at the code. I wanted a blank line to separate the output, so I placed the command,
cat("\n"). Cat stands for concatenate while \n is a carriage return. I named the file, car.R, and
can run it by:
source("car.R")
Similarly, I wrote a R program to calculate the sunspot data. To run the program, type in:
source("arima.R")

Note: R scripting language allows users to write sloppy code. Imagine you return a year later
and try to figure out what you wrote. Thus, these rules come in handy.
1. Get into the habit of reviewing your code and simplifying it.
2. Use # to include comments in your code.
3. Print out your variables and calculations to verify them.

Simple Program
Programmers use a loop to repeat a process. R allows a programmer three methods to
construct a loop, but I show only one. A loop starts and ends with a curly bracket. From the
loop below, the loop starts at 1 and ends at 100.
Here is a quirk with R. I can only print one item at a time. So I create an x variable that
contains my quote and adds the index number to it by using c().
for(i in 1:100) {
x <- c("Goodbye cold, cruel world",i)
print(x)
}
26

Amortization Table
I used R to calculate an amortization table. I created a subroutine to calculate the monthly
payment. We would not do this in real practice because the program only calls the subroutine
once. We normally use subroutines to compute repeated calculations.
The subroutine comes first in your program. In the program, I utilize the subroutine to
calculate the monthly payment. My variable is payment and I named the subroutine
loan.payment. I pass the variables principal, interest, years, and number into the subroutine.

The principal is the loan amount.


Interest is the annual interest rate
The years is the number of years for the loan
Number is the number of payments per year.

payment <- loan.payment(principal, interest, years, number)


The program calls the subroutine, which I define as the first line of code. The function defines
the subroutine. Every function must start and end with curly brackets.
loan.payment <- function(principal, interest, years, number) {
I define a new variable called rate, which is the interest rate per payment. I just divided the
annual interest rate by the number of payments per year. The code is:
rate <- interest/number
I used the formula below to calculate the monthly payment for a mortgage loan.
 =

 
1 (1 + )!*+,-."/$0+-

I calculate the payment. I used plenty of parentheses to guarantee the formula calculates the
periodic payment correctly.
payment <- principal*rate / (1 - ((1+rate)^(years*number*(-1))))
Once the subroutine has finished, it returns the payment to the main program. I believe R does
not remember variables used in subroutines. Also, dont forget the closing curly bracket.
return(payment)
}

27

The main program begins executing. I define the parameters of the loan. The loan is $150,000.
The bank charges 8% annual interest rate. The borrower makes 12 payments per year and
repays the loan in 30 years. Below, I define the parameters.
principal
interest
number
years

<<<<-

150000
0.08
12
30

I need to calculate the total number of payments, which equal the number of payments per
year times the number of years.
n

<- number*years

I create a new variable called balance. For each periodic payment, the borrower pays the
interest and the remainder of the payment reduces the loan balance.
balance

<- principal

Here is where I call the subroutine to calculate the loan payment.


payment <- loan.payment(principal, interest, years, number)
I create a matrix to hold the amortization table. The first column holds the payment number;
the second column contains the accrued interest on the loan. The third has the payment, which
is the same for all periods. Then the program stores the balance in the last column.
table.amort <- matrix(0, n, 4)
colnames(table.amort) <- c("Payment","Interest","Payment","Balance")
I start a loop so I can calculate the n payments. R allows a programmer to choose from three
different methods to create a loop.
for(i in 1:n) {
I calculate the accrued interest for the current payment.
interest.payment <- balance*interest/number
I calculate the new balance after the borrower has made a payment. I take the previous
balance, subtract the payment and add the accrued interest.
balance <- balance - payment + interest.payment

28

I record this information into the matrix by indexing specific elements. The loop chooses the
row, i, while the second number determines the column. Please do not forget the closing
bracket, }, that comes at the loops end.
table.amort[i,1]
table.amort[i,2]
table.amort[i,3]
table.amort[i,4]
}

<<<<-

i
interest.payment
payment
balance

R tends to write numbers in scientific notation, so I used the command below to get rid of
scientific notation.
options(scipen=999)
If you run a program, you must use the print command to display values. Otherwise, R will
not show the values.
print(table.amort)
I show a sample of the output below:
Payment
[1,]
[2,]
[3,]

Interest Payment
Balance
1 1000.000000 1100.647 149899.35313918092288
2 999.329021 1100.647 149798.03529928970966
3 998.653569 1100.647 149696.04200713254977

29

Matrices and Vectors


Least Squares
R has several annoyances when dealing with vectors and matrices. Technically a column
vector with n rows equals a matrix of dimension n X 1. However, R treats the indexing
differently for vectors and matrices. R indexes a vector by [row] and a matrix by [row,
column].
We already created the vector, mpg. Type it in and show it. The number in the [] identifies the
row in the vector.
mpg
The output:
[1] 25 24 25 28 31 31 31 24 24 31 32 29 29 27 29 27 27 28 26 26 25 24
[26] 31 32 26 28 24 24 24 20 24 27 30 26 28 31 32 27 30 26 28 28 28
Lets say I want the value in the 26th row, then I type in:
mpg[26]
The output:
[1] 31
I want to calculate linear regression using matrices. In matrix notation, linear regression
equals:
1 = 23 + 4
The Y is a vector with dimensions n by 1. The n equals the number of observations or rows.
The X matrix has dimensions n by k, where k equals the number of beta parameters. It also
includes an intercept. The vector is a k by 1. Finally, the represents the white noise,
assumed to be normally distributed with a mean of zero and a constant variance. The has n
rows and one column.
First, we must create a column of ones for the intercept. We create a variable n for the number
of observations. The command length returns the number of rows in the vector.
n <- length(mpg)

30

I create the intercept as a vector with dimensions n X 1.


intercept <- numeric(n)
On this step, you must be careful. I fill the vector with ones starting at the first row of 1 and
ending at the last row at n. Did you notice the index [1:n]. R automatically fills the vector with
ones.
intercept[1:n] <- 1
Now, I can create my X matrix of explanatory variables and my Y vector.
x <- cbind(intercept, compact, trans, eng.size, cylinders)
y <- mpg
If we were to solve for , the solution lies below. The hat indicates we have estimated the
parameters.
5 = (6 7 6)! 6 7 8
This equation introduces two complexities. We have a transpose function indicated by T and
an inverse function denoted by -1. A transpose switches the rows and columns in a matrix.
We must calculate the solution to linear regression piece by piece. The t( ) performs the
transpose while the %*% performs matrix multiplication.
x.x <- t(x) %*% x
x.y <- t(x) %*% y
I have two methods to calculate the inverse. I could compute the inverse by using the
command below.
betas <- solve(x.x) %*% x.y
I could also use the command below. This command solves a linear system of equations, such
as V b =Z. Then the solution is b = V-1 Z. In this case, we write the command as solve (V, Z) to
solve for the b vector. For linear regression, we can write it as:
betas <- solve(x.x, x.y)
Lets look at the beta matrix:
betas
[,1]
intercept 40.1067285
31

compact
0.5901217
trans
-1.7809514
eng.size -1.2784526
cylinders -1.2090972
Did you notice? This should be a vector but R defines it as a matrix. The 1 identifies the first
column.
I want to predict the y values while using the x and estimated . The hats mean I estimated the
parameters from the data. We have seen this before, remember 1 = 23 + 4. The random noise
is missing, and I added some hats.
9
9 = 23
1
predict <- x %*% betas
I want to calculate the residuals, or the errors, so I took this equation and solved for .
9
4: = 1 23
residuals <- y - predict
I need the number of explanatory variables including the intercept.
k <- length(betas)
I calculate the variance of least squares. All I do is take each error or residual, and square it.
Then I sum over all errors.
"

1
1
( )7 ( )
;:
=
=(  )
=
<
<
?

sigma <-

(1/ (n - k)) * t(residuals)%*% residuals

sigma
[,1]
[1,] 12.823
I solve for the covariance matrix for the x terms. This matrix contains the standard errors. The
equation is below:
@  = ;:
(6 7 6)!

32

R has another quirk. The sigma is a matrix with dimensions 1 X 1. Thus, we have a single
number, or scalar. However, R does not allow sigma to be multiplied by the matrix. So we
must use the command, as.numeric, to convert the matrix into a number, or scalar.
cov <- as.numeric(sigma)*solve(x.x)
cov
intercept
compact
trans
eng.size
cylinders
intercept 0.47995567 -0.07416303 -0.042483365 0.05716483 -0.106338746
compact
-0.07416303 0.18899524 -0.001004130 0.02518794 -0.007431460
trans
-0.04248337 -0.00100413 0.154447230 -0.01112530 -0.004773151
eng.size
0.05716483 0.02518794 -0.011125304 0.22720238 -0.126632363
cylinders -0.10633875 -0.00743146 -0.004773151 -0.12663236 0.085937247

Eigenvalues and eigenvectors


Eigenvalues and eigenvectors come up often in engineering and statistics. It reflects
characteristics of a square matrix A. If I have a square matrix, A, can I multiply the matrix by a
vector, v, so it equals a scaler, ,, times the vector v. They call lambda, , the eigenvalue while
the v is called the eigenvector, v. Eigenvalues are vital because I can scale up or down a vector.
CD = D
Many times, we write it as:
CD D = E
First, lets create a square matrix, A=XTX.
A <- t(x) %*% x
Calculate the eigenvalues and eigenvectors with the command below. Since A is a 5x5 matrix,
we can have five eigenvalues and five eigenvectors. I named the object, eigen.A.
eigen.A <- eigen(A)
The output:
$values
[1] 14469.13654

89.89684

78.94843

45.24267

23.43552

$vectors
[,1]
[1,] -0.15460189
[2,] -0.03567982
[3,] -0.10275274

[,2]
[,3]
[,4]
[,5]
0.26034516 -0.1553690 0.25717147 0.9044567
0.57873778 -0.6674799 -0.43788487 -0.1628402
0.72261583 0.6754002 -0.02836281 -0.1014808
33

[4,] -0.45030242 -0.27372179 0.2311537 -0.77480083 0.2618316


[5,] -0.87263670 -0.01362799 -0.1439916 0.37549820 -0.2767435
I pull out the first eigenvalue and first eigenvector. I used as.numeric to convert the eigenvalue
into a scalar.
lambda <- as.numeric(eigen.A$values[1])
vector <- eigen.A$vectors[,1]
lambda
[1] 14469.14
vector
[1] -0.15460189 -0.03567982 -0.10275274 -0.45030242 -0.87263670
Remember the equation, lets check if it is true.
CD D = E
A%*%vector-lambda*vector

intercept
compact
trans
eng.size
cylinders

[,1]
-0.000000000003183231
-0.000000000002728484
-0.000000000003865352
-0.000000000016370905
-0.000000000030922820

Technically, this vector should equal zero. Unfortunately, we are experiencing rounding
errors.

Choleski Factorization
We can easily calculate a Choleski factorization of A. Choleski factorization is similar to a
matrix square root. In algebra,  = . For matrices, can we find a square matrix R so that:
The command is:

G7 G = H

R <- chol(A)
intercept compact
trans eng.size cylinders
intercept 19.39072 4.796109 12.1707707 51.292579 100.666714
compact
0.00000 8.366441 -0.2835543 -3.060416 -3.801916
trans
0.00000 0.000000 9.3697352 4.190299
6.695009
eng.size
0.00000 0.000000 0.0000000 17.770949 26.186285
cylinders
0.00000 0.000000 0.0000000 0.000000 12.215298
34

Verify the Choleski factorization by multiplying RTR. Remember, we must use matrix
multiplication, %*%.
t(R)%*%R
intercept compact trans eng.size cylinders
intercept
376.0
93.0 236.0
994.60
1952.0
compact
93.0
93.0
56.0
220.40
451.0
trans
236.0
56.0 236.0
664.40
1289.0
eng.size
994.6
220.4 664.4 2973.66
5668.5
cylinders
1952.0
451.0 1289.0 5668.50
11028.0
It does indeed equal the A matrix.

35

Appendix The Programs


The Car Analysis
# Import the dataset
dataset <- read.csv("car_data_98.csv", header = TRUE)
# Create vectors for the data
mpg
<- dataset$MPG
eng.size <- dataset$Engine
cylinders
<- dataset$Cylinders
# Create dummy variables for transmission and compact cars
trans <- as.numeric(dataset$Transmission == "L" )
compact <- as.numeric(dataset$Class == "Compact" )
# Combines the variables into a matrix x
x <- cbind(mpg, eng.size, cylinders)
# Print out the summary statistics and boxplot
print(summary(x))
boxplot(x, horizontal=TRUE, axes=TRUE, main="Boxplot of the
Data")
# Add all variables to the x matrix and print out the Kendall
correlation
x <- cbind(mpg, compact, trans, eng.size, cylinders)
corr.x <- cor(x, use="pairwise.complete.obs", method="kendall")
cat("\n")
print(corr.x)
# Print out the canonical correlation
x <- cbind(mpg, compact)
y <- cbind(trans, eng.size, cylinders)
cxy <- cancor(x, y)
cat("\n")
print(cxy)
# Calculate linear regression

36

fit <- lm(mpg ~ compact + trans + eng.size + cylinders,


data=dataset)
cat("\n")
print(summary(fit))
# Plot the residuals
resid <- residuals(fit)
win.graph(width = 6, height = 6, pointsize = 10)
plot(mpg, resid, pch=20, main="Residuals", xlab="MPG",
ylab="Residuals")
# QQ plot
resid.standard <- rstandard(fit)
win.graph(width = 6, height = 6, pointsize = 10)
qqnorm(resid.standard, ylab="Standardized Residuals",
xlab="MPG", main="Gas Consumption")
qqline(resid.standard)
# Quantile Regression
library("quantreg")
fit.median <- rq(mpg ~ eng.size, data=dataset)
cat("\n")
print(summary(fit.median))

# Compare quantile regression to linear regression


fit <- lm(mpg ~ eng.size, data=dataset)
win.graph(width = 6, height = 6, pointsize = 10)
plot(mpg ~ eng.size, data=dataset)
abline(fit, lty="dashed", col="blue")
abline(fit.median)
legend("topright", inset=0.05, bty="n",
legend = c("Least Squares Fit", "Median Fit"),
lty = c(2, 1),
col = c("blue", "black")
)

37

The Periodogram
############################################################
#
# This program calculates the periodogram on the residuals
# Ken Szulczyk
#
############################################################
# reads the variable residuals
n = length(residuals)
response <- matrix(0, n, 1)
frequency <- matrix(0, n, 1)

# Holds frequency response


# Holds the x-values for the graph

for(i in 1:n) {
# This loop calculates a particular frequency
# We will look at only positive frequencies
w <- i/(2*n)
real <- matrix(0, n, 1)
imag <- matrix(0, n, 1)
frequency[i,1] <- w

#
#
#
#

Frequency value
Set real vector to zero
Set imaginary vector to zero
Place frequency value in x matrix

for(t in 1:n) {
# This loop calculates the frequency response of the time series
real[t,1] <imag[t,1] <}

cos(w*t)
sin(w*t)

# Calculate the real part of the freq


# Calculate the imag part of the freq

# Calculate the Fourier transform of the residuals


response[i,1] <(2/n)*((t(real)%*%residuals)^2+(t(imag)%*%residuals)^2)
}
# **********************************************************
plot(frequency, response, pch=20, main="Periodogram of the
Residuals", xlab="Frequency", ylab="Response")

38

Fitting an ARIMA to the Sunspot Data


# Import the data into R
sundata = read.csv("spot_num.csv", header = TRUE)

# Create the vectors, or variables


sunspots <- sundata$Sunspots
year <- sundata$Year
# Plot the data
plot(year, sunspots, pch=20, xlab="Year", ylab="Sunspots",
main="Activity of the Sun")
# Open a new graphics window
win.graph(width = 6, height = 6, pointsize = 10)
# Create the ACF and PACF plots
par(mfrow=c(2,1))
acf(sunspots, 30, main="Sunspots")
pacf(sunspots, 30, main="Sunspots")
# Esimate an ARIMA with one AR and one MA term
fit.arima <- arima(sunspots, order=c(1,0,1))
# Output the results
print(fit.arima)
# Copy the residuals
residuals <- residuals(fit.arima)
# Estimate the ARIMA (1,0,1) with a seasonal component (1,0,0).
# The period is 12 months
fit.arima.season <- arima(sunspots, order=c(1,0,1), seasonal =
list(order = c(1, 0, 0), period = 12))
cat("\n")
print(fit.arima.season)

39

############################################################
#
# This program calculates the periodogram on the residuals
# Ken Szulczyk
#
############################################################
# reads the variable residuals
n = length(residuals)
response <- matrix(0, n, 1)
frequency <- matrix(0, n, 1)

# Holds frequency response


# Holds the x-values for the graph

for(i in 1:n) {
# This loop calculates a particular frequency
# We will look at only positive frequencies
w <- i/(2*n)
real <- matrix(0, n, 1)
imag <- matrix(0, n, 1)
frequency[i,1] <- w

#
#
#
#

Frequency value
Set real vector to zero
Set imaginary vector to zero
Place frequency value in x matrix

for(t in 1:n) {
# This loop calculates the frequency response of the time series
real[t,1] <imag[t,1] <}

cos(w*t)
sin(w*t)

# Calculate the real part of the freq


# Calculate the imag part of the freq

# Calculate the Fourier transform of the residuals


response[i,1] <(2/n)*((t(real)%*%residuals)^2+(t(imag)%*%residuals)^2)
}
# **********************************************************
# Open a new graphics window
win.graph(width = 6, height = 6, pointsize = 10)
plot(frequency, response, pch=20, main="Periodogram o

40

Amortization Table
####################################################
# Create a subroutine to calculate a loan payment
####################################################
loan.payment <- function(principal, interest, years, number) {
rate <- interest/number
payment <- principal*rate / (1 - ((1+rate)^(years*number*(-1))))
return(payment)
}
####################################################
# Main Program
####################################################
# Define the loan parameters. Principal is the amount of the loan.
Interest is the annual
# intersest rate. Payments are the number of payments per year, and
the years is the total
# number of years for the loan.
principal<- 150000
interest<- 0.08
number
<- 12
years
<- 30
# number of payments
n

<- number*years

balance

<- principal

payment <- loan.payment(principal, interest, years, number)


# The matrix will hold the amortization table
# The colnames allow me to name the columns of the matrix
table.amort <- matrix(0, n, 4)
colnames(table.amort) <- c("Payment","Interest","Payment","Balance")
41

for(i in 1:n) {
interest.payment <- balance*interest/number
balance <- balance - payment + interest.payment
table.amort[i,1]
table.amort[i,2]
table.amort[i,3]
table.amort[i,4]

<<<<-

i
interest.payment
payment
balance

}
# Option supresses scientific notation
options(scipen=999)
# Print the matrix
print(table.amort)

42

You might also like