Professional Documents
Culture Documents
Data Manipulation
R.M. Ripley
Department of Statistics
University of Oxford
2012/13
If your data is not like an array, e.g. the lines have differing
structures, use the function readLines().
File formats are very varied, hence there are many options for
read.table. Here are details of a few:
file name of file, usually a character string. Note for Windows
users: use "/" or "\\" but not "\" as the directory
separator.
sep field separator e.g. sep="\t" (tab), sep=","
header If your file has column headings in the first row, use
header=TRUE.
fill If empty fields at the end of lines are not present, specify
fill=TRUE
others To control the way R interprets your data, use options
such as na.strings, comment.char, quote,
as.is, ColClasses, StringsAsFactors
Variants of read.table()
Matrices
Arrays
Indexing
Suppose
x <- c(2, 4, 6, 8, 10, 12)
names(x) <- c("E1", "E2", "E3", "E4", "E5", "E6")
A vector of positive integers, in any desired order, indicating
elements to select.
e.g. x[c(1, 3, 6, 5)] will give 2, 6, 12, 10
A logical vector. This must be of the same length as the vector.
Values corresponding to the entries TRUE will be selected.
e.g. use <- c(rep(TRUE, 3), FALSE, TRUE, FALSE)
x[use] will give 2, 4, 6, 10
A vector of negative integers, indicating elements to exclude e.g.
x[-c(1:3)] will give 8, 10, 12
Suppose
x <- c(2, 4, 6, 8, 10, 12)
names(x) <- c("E1", "E2", "E3", "E4", "E5", "E6")
A vector of character strings. Only applicable if the vector has
names.
x[c("E1", "E3")] will give 2, 6
Empty. Select all. Useful if assigning to an object as all values will
be replaced but all other aspects of the object will be unchanged.
x[] <- 0
names(x) will be the same as before
Compare with x <- 0
x will be the single character 0.
If you select rows from a data frame with only one column the
result will be a vector unless you use drop=FALSE.
Often want to select the rows of a data frame which meet some
criterion.
Function Examples
Function Details
Note:
Arguments. May have default values. To call functions, either keep
the arguments in the same order
t.test.p(myx, 1)
or use names for each argument
t.test.p(mu=1, x=myx)
The two can be mixed, with ordered arguments first and named
ones at the end of the list.
To use more than one statement in the function, use braces {} to
define a block.
The object on the final line will be returned.
To return more than one item, create a list using list() or a
vector using c().
R.M. Ripley (University of Oxford) R 2012/13 16 / 35
Writing simple functions
Flow Control
Our t.test.p performed 3 commands in a sequence.
Often we need to make a decision or execute a loop.
myfn <- function(n=100)
{
tmp <- rep(NA, 3)
tmp[1] <- mean(runif(n))
tmp[2] <- mean(runif(n))
tmp[3] <- mean(runif(n))
mean(tmp[tmp > .2])
}
Flow Control : If
Example
repeat statement
The repeat statement repeats statement until flow is
transferred out using the break statement.
Example functions
Example functions
Example functions
Just in case you are confused by the notation we have introduced, let
us recap:
brace {, } Used to create blocks of statements
ifelse function
R has many functions which reduce the need for you to write loops.
This can be both easier and more efficient. One is ifelse.
Suppose x <- c(0, 1, 1, 2) and y <- c(44, 45, 56, 77).
Replace:
z <- rep(NA, 4)
for (i in 1:length(x))
{
if (x[i] > 0)
z[i] <- y[i] / x[i]
else
z[i] <- y[i] / 99
}
by
z <- ifelse(x > 0, y / x, y / 99)
apply functions
A group of functions useful for avoiding loops e.g.
lapply, sapply, apply, tapply, mapply
lapply and sapply are used to iterate along a list or a vector.
lapply(mylist, length)
will return a list with components the length of the components of
the list.
sapply(mylist, length)
will return a vector with elements the lengths of the components of
the list.
apply and tapply operate in a similar way on arrays or parts of
arrays or vectors.
mapply operates on corresponding elements of multiple lists or
vectors.
R.M. Ripley (University of Oxford) R 2012/13 27 / 35
More data manipulation
apply
Can be used on data frames, but will turn them into matrices first.
If there are factor variables, all the variables will end up as
character.
tapply
mapply, etc
There are other extensions of lapply. Worth looking for (via the
help pages) if you seem to need to write something similar.
Manipulating data
Manipulating data
For cbind, the resulting data frame may have duplicate column
names.
For rbind the column names must match, although they need
not be in the same order.
Merge
Matrix Algebra
Exercises 3
Using the data frame you created in Exercises 2, select the rows
for which the date is after 1st June 2007.