You are on page 1of 85

Data Science and Big Data Analytics

Day 3

Training Schedule: Day 3


Time

0900-1000
1000-1040
1040-1100
1100-1200
1200-1300
1300-1400
1400-1500
1500-1540
1540-1600
1600-1700

Details

Raw Data Vs Processed Data Vs Tidy Data


Download and Read Data
Morning Break
CSV, Excel, JSON Data
data.table package
Lunch
Laboratory exercise
Laboratory exercise
Afternoon Break
Laboratory exercise

Raw vs Processed Data


Data
Data are values of qualitative or quantitive variables, belonging to a
set of items
Raw Data
Data from the original source
Data that is often or difficult to be analysed
Data that needs to be processed before analysing
Usually raw data can be converted to processed data in one time
Processed Data
Data that is ready for analysis
Processing can include merging, subsetting, transforming etc
There may be standards for processing
All processing steps should be recorded

Data Processing

The raw data

A tidy data set

A code
describing
each variable
and its value in
tidy data set

A recipe that
describes the
transformation
from raw data
to tidy data set

Tidy Data
Each measured variable should be in one
column
Each different observation of that variable
should be in a different row
There should be one table for each "kind" of
variable
If you have multiple tables, they should
include a column in the table that allows them
to be linked

The code book


It is the meta data
Information that describes the variables
(including the measurement units)
Information that describes the summary
choices made
Information that describes the experimental
design used

The Instruction List


The programming script i.e. R script
The input for the script raw data
The output will be the processed tidy data
set
There are no parameters to the script

Downloading files
Always check and set your working directory
using the getwd() and setwd()commands
To check whether the Data directory has been
created or not:
if (!file.exists("data")){
dir.create("data")
}

Downloading files
download.file() to
download a file from the
Internet
Important parameters are
url, destfile, method
For example download
Apple shares price from
Google Finance
fileURL <"https://www.google.com/finance?q=
NASDAQ%3AAAPL&ei=ykGKVaPlPMSsugS66
bOIAw"
download.file(fileURL,
destfile=("./data/apple.csv"))
list.files("./data")

Exercise 1
Download the data for Jumlah penumpang
yang dikendalikan mengikut lapangan terbang
(tidak termasuk penumpang transit) di
Malaysia pada tahun 2000 2014 from this
URL http://data.gov.my/view.php?view=136
Store in passengers.xlsx file

Read local files


The main function is read.table
read.table is flexible and robust requires more
parameters()
>?read.table # to find out more about
read.table()

read.table reads all data into RAM thus not


advisable for large data set
important parameters are: file, header, sep,
row.names, nrows

Exercise 2
Read file with , as delimiter
>comma_dat<read.table("http://bit.ly/1e9Cvz
u", header=TRUE, sep=",")
Read the test_semicolon.txt with ; delimiter
file from http://bit.ly/1RGbPn9 and store in
semi_dat
Show the content of semi_dat

read.csv () and read.csv2 ()


read.csv and read.csv2 are identical to
read.table except for the defaults.
They are intended for reading comma separated
value files (.csv) or (read.csv2) the variant used
in countries that use a comma as decimal point
and a semicolon as field separator.
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)

Exercise 3
Download the Genting Malaysia Berhad share prices
(in csv) from Yahoo! Finance (URL:
http://finance.yahoo.com/q/hp?s=GMALF+Historical+P
rices)
Store in genting.csv file
Read the file and describe the data obtained:

Number of observations
Number of Variables
Names of the variables
Calculate average closing price
Calculate number of days where the closing price is more
than RM 1.35

Reading Excel Files


library (xlsx) package
read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown", ...)

rowIndex - a numeric vector indicating the rows you want


to extract. If NULL, all rows found will be extracted, unless
startRow or endRow are specified.
colIndex - a numeric vector indicating the cols you want to
extract. If NULL, all columns found will be extracted.
read.xlsx2 is much faster but can be unstable
in general it is advised to store data in either .csv or .txt
format for easier to distribute

Exercise 4
Download the Excel spreadsheet on Natural Gas
Acquisition Program here:
https://d396qusza40orc.cloudfront.net/getdata%2F
data%2FDATA.gov_NGAP.xlsx
Read rows 74 to 79 and columns 14 to 21 into R
and assign the result to a variable called: xlsdat
Calculate the sum for Supp_Vol * Supp_Org_Cost

Reading from XML

Extensible Markup Language


Frequently used to store structured data
Particularly widely used in Internet applications
Extracting XML is the basis for most web scraping
Components
markup - labels that give the text structure
Content - the actual text of the document

http://en.wikipedia.org/wiki/XML

Tags, elements and attributes


Tags correspond to general labels
Start tags <section>
End tags </section>
Empty tags <line-break />

Elements are specific examples of tags


<Greeting> Hello, World </Greeting>

Attributes are components of the label


<img src="khpoo.jpg" alt="instructor" />

Example XML file

Read the XML file into R


> library (XML)
> fileURL<"http://www.w3schools.com/xml/simplexsl.xml"
> doc<-xmlTreeParse(fileURL,
useInternal=TRUE)
> rootNode<-xmlRoot(doc)
> xmlName(rootNode)
> names(rootNode)

Directly access parts of the XML


document
> rootNode[[1]]

> rootNode[[1]][[1]]

Programatically extract parts of the file


> xmlSApply(rootNode,xmlValue)

XPath
/node Top level node
//node Node at any level
node[@attr-name] Node with an
attribute name
node[@attr-name='bob'] Node with
attribute with attr-name='bob'

Get the items on the menu and prices


> xpathApply(rootNode, "//name", xmlValue)

> xpathApply(rootNode, "//price", xmlValue)

Reading JSON

Javascript Object Notation


Lightweight data storage
Use library (jsonlite) package
Structure similar to XML but different
syntax/format
Read using the fromJSON function and to
convert data into JSON, use the toJSON
function

Example of JSON file

data.table package
data.table can be considered data.frame as a set of
columns
Every column is the same length but different type
Inherits from data.frame
All functions that work in data.frame will work with
data.table
It is designed to work much faster for subsetting,
group and updating
Goal 1: Reduce programming time
(fewer function calls less variable name repetition)

Goal 2: Reduce computation time


(fast aggregation, update by reference)

data.table general form


trades [
filledShares < orderedShares ,
sum ((orderedShares - filledShares) *
orderPrice/fx) ,
by = "date,region,algo"
]
R:
SQL:

i
WHERE
GROUP BY

j
SELECT

by

Data.table general form

DT [i , j , by]
Easy to remember, just say "Take DT, subset
rows using i, then calculate j grouped by by
Once you grok the above reading, you don't
need to memorize any other functions as all
operations follow the same intuition as base.

Exercise 5
Compare both data.frame and data.table
> DF <data.frame(x=rep(c("a","b","c"),each=3),
y=c(1,3,6), v=1:9)
> DT <data.table(x=rep(c("a","b","c"),each=3),
y=c(1,3,6), v=1:9)

Notice the output differences for both DT and DF


Check the class for both DF and DT
Subsetting rows: Select row 3 to 5 and the output for DF and DT
Show all the records where x==a
Show records for rows 1,3,7 for both DF and DT

Exercise 6 (Computing on Columns)


The subsetting function is modified for data.table
the argument passed after the comma is called an "expression
Examples
>DT[,list(mean(y), sum(v))]
>DT[,.(length(x), sum(y))] # .() is same as
list()
Find the max for column y and mean for column v
Find the square root of column y and store in column w, show all
the elements in w
Find the mean of column v and store in column w

Exercise 7 (multiple operations)


Performs any expression on j
Example:
>DT[,plot(y,v)]
>DT[, for(x in 1:10) print(x)]
Plot the histogram for y and find the log value for
elements in column v and store in a variable
called temp and in column z

Exercise 8 (Doing j by group)


Get the sum of v and group them by column x
> DT[,sum(v),by=x]
Calculate the mean for column y and group the
result by even and odd rows
> library(datasets)
Load the iris data set, convert it to data.table
called irisDT
Get the mean for Sepal.Width and group the
results by Species

Exercise 9 (Performance Analysis)


Determine the time to compute mean for
Sepal.Length group by Species
>fileURL10<"https://raw.githubusercontent.com/vincentarelbundock/R
datasets/master/csv/datasets/iris.csv"
>download.file(fileURL10, destfile="./data/iris.csv")
>irisDT <- fread("./data/iris.csv")
>system.time(irisDT[,mean(Sepal.Length),by=Species])

Find the computation time for the followings:


>
system.time(mean(irisDT$Sepal.Length,by=DT$Species))

Read MySQL
mySQL is a free and widely used open source
database software
Install MySQL from here
use library(RMySQL)

UCSC MySQL

Exercise 10 (connect and read tables)


Connect to UCSC Genome Online MySQL
databases
List all the databases
Connect to hg18 database
List all the tables in hg18 database
Show tables 1 to 5
List all the fields for HInvGeneMrna" table
Select all from HInvGeneMrna" table where
qNumInsert is between 2 and 3
Show the result

Basic RMySQL commands


ucscDb<-dbConnect(MySQL(),
user="genome", host="genomemysql.cse.ucsc.edu") #to connect to
MySQL with username and password
dbGetQuery #execute and fetch SQL queries
dbListTables #list all tables
dbListFields #list all fields for table
dbReadTable #read content from table
dbSendQuery #execute SQL queries
dbDisconnect #disconnect connection

Further Resources
RMySQL vignette https://cran.rproject.org/web/packages/RMySQL/RMySQL.
pdf
List of SQL commands
https://www.pantz.org/software/mysql/mysql
commands.html
Blog summarizing some other SQL commands
http://www.r-bloggers.com/mysql-and-r/

Read from the Web


Webscraping - programatically extracting data
from the HTML code of websites
Use readLines() or parsing with XML
Use GET from the httr package
Websites with password, you are required to
provide username and password
Use handles

Example Webpage

Read from the Web


>
>
>
>

con= url("http://pesona.mmu.edu.my/~khpoo")
htmlCode=readLines(con)
close(con)
htmlCode

Exercise 11 (read from web)


Read the HTML content from this URL:
http://pesona.mmu.edu.my/~khpoo
Using readLines function, find out how
many characters are in the 20th and 40th lines
of HTML from this page
The nchar() function in R may be helpful

Further Resources Read from the


Web
R bloggers has a number of examples of web
scraping http://www.rbloggers.com/search/web%20scraping
The httr file has useful examples
https://cran.fhcrc.org/web/packages/httr/httr.
pdf

Reading from other sources


There are many R packages out there that allow
you to interact directly with the data sources
You can read data directly from other software:

Matlab
Minitab
S
SAS
SPSS
Weka
Octave

Exercise 12: Read SPSS file


For the data files in SPSS format, it can be opened with the
function read.spss also from the foreign package.
There is a "to.data.frame" option for choosing whether
a data frame is to be returned.
By default, it returns a list of components instead.
Download the Airline Passenger SPSS data from this URL:
http://calcnet.mth.cmich.edu/org/spss/V16_materials/Data
Sets_v16/airline_passengers.sav
Read the SPSS file and assign to airlineSPSS variable
Observe the output for airlineSPSS
Convert airlineSPSS to data.frame called airlineDF
View the content of airlineDF

Downloading files
The command for downloading dengue data
from data.gov.my>
> url<"http://data.gov.my/folders/MOH/MOH_denggue_
HOTSPOT_2010_2014_v3.xlsx"
>
download.file(url,destfile="dengue.xlsx",mod
e="wb")
Make sure you
> list.files()
know where you
store your file!
Check using
getwd()

Reading xlsx file


> library(xlsx)
> data<-read.xlsx("dengue.xlsx",
sheetIndex=1,
rowIndex=2:100, header=TRUE)
You may
> dim(data)
manually
convert to
> str(data)
and load CSV
> summary(data)
file if loading
xlsx is an issue
at your PC.

Subsetting Data
> set.seed(1)
x <data.frame("var1"=sample(1:5),
"var2"=sample(6:10),"var3"=samp
le(11:15))
> x$var2[c(1,3)]=NA
> x

Subsetting Data: Hands-on


>
>
>
>

x[,1]
x[1:2, var2]
x[x$var1<=3 & x$var3 >10,]
x[x$var1>2 | x$var3 >10,]

Task: extract all rows with NA

Subsetting Data: Hands-on


> x[x$var2>1,]

Data for var1


and var3 not
presented
correctly for
row with NA

Subsetting Data: Hands-on


> x[which(x$var2>1),]

Subsetting Data: Hands-on


To retrieve row having maximum value for var1
> x[which.max(x$var1),]

> Task: look for row with minimum


value for var3

Sorting Data
To sort values for a particular column
> sort(x$var1)
to sort decreasingly
> sort(x$var1,decreasing=TRUE)
What if I have NA
> sort(x$var2,na.last=TRUE)

Sorting Data
How to sort the entire dataset based
on a particular column, say var1?

Sorting Data: Hands-on


> x[order(x$var1),]

To sort data based on multiple columns,


> x[order(x$var1, x$var3),]

Sorting Data: plyr package


> library(plyr)
> arrange(x, var1)
> arrange(x, desc(var1))

Quantile
> x<-rnorm(100, 50,20)
> hist(x)
> quantile(x)

Quantile
What is I want to see the data distribution at
30% and 80%?
> quantile(x, c(0.3,0.8))
Can you find the information for the Dengue
data?

Creating a new column


If I want to create 4 bins for
Jumlah.Kes.Terkumpul and assign them to a new
variable named group,
> group <cut(data$Jumlah.Kes.Terkumpul,
g=4)
> group
> table(group)

Splitting data
What is the command to split data according to
Negeri?
> dt.split.1<- split(data,
data$Negeri)
What is the command to split
Jumlah.Kes.Terkumpul according to Negeri?
> dt.split.2 <split(data$Jumlah.Kes.Terkumpul,
data$Negeri)

Splitting data
Apply a function across elements of the list,
> lapply(dt.split.2, sum)

Splitting data
How to get the sum of Jumlah.Kes.Terkumpul
for each Daerah.Zon.PBT?
> tapply(data$Jumlah.Kes.Terkumpul,
data$Daerah.Zon.PBT, sum)

Merging data
Lets create two data frames:
> set.seed(1)
> x<- sample(1:20,10)
> y<- sample(30:50,10)
> dt.1 <- data.frame(x,y)
>
>
>
>

set.seed(2)
x<- sample(1:20,10)
y<- sample(30:50,10)
dt.2 <- data.frame(x,y)

Merging data
To perform an inner join by column x
> merge(dt.1, dt.2, by=x")
To perform an inner join by column x with
either one matched,
> merge(dt.1, dt.2, by=x,
all=TRUE)
To perform an inner join by column x and y
> merge(dt.1, dt.2, by=c(x,y)

Merging Data: Hands-on


> df1<-data.frame(CustomerId = c(1:6),
Product = c(rep(Honda", 3),
rep(Chevrolet", 3)))
> df2<-data.frame(CustomerId = c(2, 4, 7),
State =
c("selangor","Sarawak","Kelantan"))

How does the


data frames
look like?

Merging Data: Hands-on


Task 1:
Return only the rows in which the df1 have
matching keys in df2.

Merging Data: Hands-on


Task 2:
Return all rows from the df1, and any rows with
matching keys from df2.

Merging Data: Hands-on


Task 3:
Return all rows from the df2, and any rows with
matching keys from df1.

Downloading a File
Go to http://data.gov.my/view.php?view=254
and download dengue hotspot for 2015
> url<"http://data.gov.my/folders/KKM/LokalitiHotspot201
5.xlsx"
> download.file(url,"dengue.xlsx",mode='wb')

Downloading a File
Convert the file to CSV for ease of reading
> dt <- read.csv(dengue.csv)
Check the field names of the dataset
> names(dt)

Column Name Manipulation


Remove dot(.) in all the field names
> names(dt)<-gsub(., , names(dt),
fixed=TRUE)

fixed=TRUE to
make sure gsub
treats . and a
dot, not a
function

Column Name Manipulation


To rename all the columns,
> names(dt) <c('Year','Week','State','District','Locati
on','Total','Outbreak Duration')
To rename a specific column only,
> names(dt)[2] <- Week No
Check the
changes to
the columns

Column Name Manipulation


Replace all space the field names with _.
> names(dt)<-gsub( , _,
names(dt))

Records Manipulation
Inspect the data for location, you can see
there are many wrong spellings. Say, if we
want to replace Tmn to Taman for all data
in Location
> dt$Location <-gsub(Jln, Indah,
dt$Location)
Check the
Try replacing all Kg to Kampung

data first
before
replacing
anything

String split
The command = strsplit()
> strsplit(hello Malaysia !, )

To split a particular record based on a particular


character in a dataset,
>
strsplit(as.character(dt$Location),
)[[1]]

Finding Values
Say if we want to know how many times the
word Taman has appeared in the Location
field,
> grep(Taman,dt$Location)
> grep(Taman,dt$Location, value=TRUE)
How many
of them?
What
command
to use?

Finding Values
How many with Taman and otherwise?
> cnt <table(grepl(Taman,dt$Location)
)
> barplot(cnt)
What do
you get?
Explain

Finding Values
Now, store all the data with the word Taman
in Location in a variable named dt_Taman.
> dt_Taman <- dt[grepl(Taman,
dt$Location),]
How many records in dt_Taman ?

Finding Values
Now, store all the data with the word
Seksyen or Medan in Location
> dt[grepl(Seksyen|Medan,
dt$Location),]

Finding Values
Looking for any characters.
For example, we could retrieve locations that
begins with the character T and contains
character 1 from the Location column.
> dt[grep("^T(.*)1",
dt$Location),"Location"]

Hands-on
Look for all locations with the words Jalan
and Jaya.
Retrieve all records with the word Taman in
Location in the district Petaling.
How many Kampung in Selangor and Perak.
Plot a graph to compare the two.
[estimation = 20 minutes]

Date
To retrieve the date of today,
> d1 <- Sys.Date()

%a weekday
%d day
number
%b month
%y - year

To extract the information of the date,


> format(d1, %a %b %d)
Changing characters to date
> x<- 15July2015
> z<- as.Date(x, %d%b%Y)

Date
Extracting parts of a date, try the following
> weekdays(z)
> months(z)
> julian(z)

This will tell


you how
many days
since the
origin

Date: the lubridate package


R has a specific date package named
lubridate to manipulate date easier.
> library(lubridate)
> ymd(20150715)
> mdy(07/15/2015)
> dmy(15/07/2015)
> ymd_hms(2015-07-15 18:47:00)
> wday(z)
> wday(z,label=TRUE)

You might also like