Day 3

Data Science and Big Data Analytics
Day 3
Training Schedule: Day 3

Time
0900-1000
1000-1040
1040-1100
1100-1200
1200-1300
1300-1400
1400-1500
1500-1540
1540-1600
1600-1700
Details
Raw Data Vs Processed Data Vs Tidy Data

Download and Read Data
Morning Break
CSV, Excel, JSON Data
data.table package
Lunch
Laboratory exercise
Laboratory exercise
Afternoon Break
Laboratory exercise
Raw vs Processed Data

Data
Data are values of qualitative or quantitive variables, belonging to a
set of items
Raw Data
Data from the original source
Data that is often or difficult to be analysed
Data that needs to be processed before analysing
Usually raw data can be converted to processed data in one time
Processed Data
Data that is ready for analysis
Processing can include merging, subsetting, transforming etc
There may be standards for processing
All processing steps should be recorded
Data Processing
The raw data
A tidy data set
A code
describing
each variable
and its value in
tidy data set
A recipe that
describes the
transformation
from raw data
to tidy data set
Tidy Data
Each measured variable should be in one
column
Each different observation of that variable
should be in a different row
There should be one table for each "kind" of
variable
If you have multiple tables, they should
include a column in the table that allows them
to be linked
The code book

It is the meta data
Information that describes the variables
(including the measurement units)
Information that describes the summary
choices made
Information that describes the experimental
design used
The Instruction List

The programming script i.e. R script
The input for the script raw data
The output will be the processed tidy data
set
There are no parameters to the script
Downloading files
Always check and set your working directory
using the getwd() and setwd()commands
To check whether the Data directory has been
created or not:
if (!file.exists("data")){
dir.create("data")
}
Downloading files
download.file() to
download a file from the
Internet
Important parameters are
url, destfile, method
For example download
Apple shares price from
Google Finance
fileURL <"https://www.google.com/finance?q=
NASDAQ%3AAAPL&ei=ykGKVaPlPMSsugS66
bOIAw"
download.file(fileURL,
destfile=("./data/apple.csv"))
list.files("./data")
Exercise 1
Download the data for Jumlah penumpang
yang dikendalikan mengikut lapangan terbang
(tidak termasuk penumpang transit) di
Malaysia pada tahun 2000 2014 from this
URL http://data.gov.my/view.php?view=136
Store in passengers.xlsx file
Read local files

The main function is read.table
read.table is flexible and robust requires more
parameters()
>?read.table # to find out more about
read.table()
read.table reads all data into RAM thus not

advisable for large data set
important parameters are: file, header, sep,
row.names, nrows
Exercise 2
Read file with , as delimiter
>comma_dat<read.table("http://bit.ly/1e9Cvz
u", header=TRUE, sep=",")
Read the test_semicolon.txt with ; delimiter
file from http://bit.ly/1RGbPn9 and store in
semi_dat
Show the content of semi_dat
read.csv () and read.csv2 ()

read.csv and read.csv2 are identical to
read.table except for the defaults.
They are intended for reading comma separated
value files (.csv) or (read.csv2) the variant used
in countries that use a comma as decimal point
and a semicolon as field separator.
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
Exercise 3
Download the Genting Malaysia Berhad share prices
(in csv) from Yahoo! Finance (URL:
http://finance.yahoo.com/q/hp?s=GMALF+Historical+P
rices)
Store in genting.csv file
Read the file and describe the data obtained:
Number of observations
Number of Variables
Names of the variables
Calculate average closing price
Calculate number of days where the closing price is more
than RM 1.35
Reading Excel Files

library (xlsx) package
read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown", ...)
rowIndex - a numeric vector indicating the rows you want

to extract. If NULL, all rows found will be extracted, unless
startRow or endRow are specified.
colIndex - a numeric vector indicating the cols you want to
extract. If NULL, all columns found will be extracted.
read.xlsx2 is much faster but can be unstable
in general it is advised to store data in either .csv or .txt
format for easier to distribute
Exercise 4
Download the Excel spreadsheet on Natural Gas
Acquisition Program here:
https://d396qusza40orc.cloudfront.net/getdata%2F
data%2FDATA.gov_NGAP.xlsx
Read rows 74 to 79 and columns 14 to 21 into R
and assign the result to a variable called: xlsdat
Calculate the sum for Supp_Vol * Supp_Org_Cost
Reading from XML
Extensible Markup Language

Frequently used to store structured data
Particularly widely used in Internet applications
Extracting XML is the basis for most web scraping
Components
markup - labels that give the text structure
Content - the actual text of the document
http://en.wikipedia.org/wiki/XML
Tags, elements and attributes

Tags correspond to general labels
Start tags <section>
End tags </section>
Empty tags <line-break />
Elements are specific examples of tags

<Greeting> Hello, World </Greeting>
Attributes are components of the label

<img src="khpoo.jpg" alt="instructor" />
Example XML file
Read the XML file into R

> library (XML)
> fileURL<"http://www.w3schools.com/xml/simplexsl.xml"
> doc<-xmlTreeParse(fileURL,
useInternal=TRUE)
> rootNode<-xmlRoot(doc)
> xmlName(rootNode)
> names(rootNode)
Directly access parts of the XML

document
> rootNode[[1]]
> rootNode[[1]][[1]]
Programatically extract parts of the file

> xmlSApply(rootNode,xmlValue)
XPath
/node Top level node
//node Node at any level
node[@attr-name] Node with an
attribute name
node[@attr-name='bob'] Node with
attribute with attr-name='bob'
Get the items on the menu and prices

> xpathApply(rootNode, "//name", xmlValue)
> xpathApply(rootNode, "//price", xmlValue)
Reading JSON
Javascript Object Notation

Lightweight data storage
Use library (jsonlite) package
Structure similar to XML but different
syntax/format
Read using the fromJSON function and to
convert data into JSON, use the toJSON
function
Example of JSON file
data.table package
data.table can be considered data.frame as a set of
columns
Every column is the same length but different type
Inherits from data.frame
All functions that work in data.frame will work with
data.table
It is designed to work much faster for subsetting,
group and updating
Goal 1: Reduce programming time
(fewer function calls less variable name repetition)
Goal 2: Reduce computation time

(fast aggregation, update by reference)
data.table general form

trades [
filledShares < orderedShares ,
sum ((orderedShares - filledShares) *
orderPrice/fx) ,
by = "date,region,algo"
]
R:
SQL:
i
WHERE
GROUP BY
j
SELECT
by
Data.table general form
DT [i , j , by]
Easy to remember, just say "Take DT, subset
rows using i, then calculate j grouped by by
Once you grok the above reading, you don't
need to memorize any other functions as all
operations follow the same intuition as base.
Exercise 5
Compare both data.frame and data.table
> DF <data.frame(x=rep(c("a","b","c"),each=3),
y=c(1,3,6), v=1:9)
> DT <data.table(x=rep(c("a","b","c"),each=3),
y=c(1,3,6), v=1:9)
Notice the output differences for both DT and DF

Check the class for both DF and DT
Subsetting rows: Select row 3 to 5 and the output for DF and DT
Show all the records where x==a
Show records for rows 1,3,7 for both DF and DT
Exercise 6 (Computing on Columns)

The subsetting function is modified for data.table
the argument passed after the comma is called an "expression
Examples
>DT[,list(mean(y), sum(v))]
>DT[,.(length(x), sum(y))] # .() is same as
list()
Find the max for column y and mean for column v
Find the square root of column y and store in column w, show all
the elements in w
Find the mean of column v and store in column w
Exercise 7 (multiple operations)

Performs any expression on j
Example:
>DT[,plot(y,v)]
>DT[, for(x in 1:10) print(x)]
Plot the histogram for y and find the log value for
elements in column v and store in a variable
called temp and in column z
Exercise 8 (Doing j by group)

Get the sum of v and group them by column x
> DT[,sum(v),by=x]
Calculate the mean for column y and group the
result by even and odd rows
> library(datasets)
Load the iris data set, convert it to data.table
called irisDT
Get the mean for Sepal.Width and group the
results by Species
Exercise 9 (Performance Analysis)

Determine the time to compute mean for
Sepal.Length group by Species
>fileURL10<"https://raw.githubusercontent.com/vincentarelbundock/R
datasets/master/csv/datasets/iris.csv"
>download.file(fileURL10, destfile="./data/iris.csv")
>irisDT <- fread("./data/iris.csv")
>system.time(irisDT[,mean(Sepal.Length),by=Species])
Find the computation time for the followings:

>
system.time(mean(irisDT$Sepal.Length,by=DT$Species))
Read MySQL
mySQL is a free and widely used open source
database software
Install MySQL from here
use library(RMySQL)
UCSC MySQL
Exercise 10 (connect and read tables)

Connect to UCSC Genome Online MySQL
databases
List all the databases
Connect to hg18 database
List all the tables in hg18 database
Show tables 1 to 5
List all the fields for HInvGeneMrna" table
Select all from HInvGeneMrna" table where
qNumInsert is between 2 and 3
Show the result
Basic RMySQL commands

ucscDb<-dbConnect(MySQL(),
user="genome", host="genomemysql.cse.ucsc.edu") #to connect to
MySQL with username and password
dbGetQuery #execute and fetch SQL queries
dbListTables #list all tables
dbListFields #list all fields for table
dbReadTable #read content from table
dbSendQuery #execute SQL queries
dbDisconnect #disconnect connection
Further Resources
RMySQL vignette https://cran.rproject.org/web/packages/RMySQL/RMySQL.
pdf
List of SQL commands
https://www.pantz.org/software/mysql/mysql
commands.html
Blog summarizing some other SQL commands
http://www.r-bloggers.com/mysql-and-r/
Read from the Web

Webscraping - programatically extracting data
from the HTML code of websites
Use readLines() or parsing with XML
Use GET from the httr package
Websites with password, you are required to
provide username and password
Use handles
Example Webpage
Read from the Web

>
>
>
>
con= url("http://pesona.mmu.edu.my/~khpoo")
htmlCode=readLines(con)
close(con)
htmlCode
Exercise 11 (read from web)

Read the HTML content from this URL:
http://pesona.mmu.edu.my/~khpoo
Using readLines function, find out how
many characters are in the 20th and 40th lines
of HTML from this page
The nchar() function in R may be helpful
Further Resources Read from the

Web
R bloggers has a number of examples of web
scraping http://www.rbloggers.com/search/web%20scraping
The httr file has useful examples
https://cran.fhcrc.org/web/packages/httr/httr.
pdf
Reading from other sources

There are many R packages out there that allow
you to interact directly with the data sources
You can read data directly from other software:
Matlab
Minitab
S
SAS
SPSS
Weka
Octave
Exercise 12: Read SPSS file

For the data files in SPSS format, it can be opened with the
function read.spss also from the foreign package.
There is a "to.data.frame" option for choosing whether
a data frame is to be returned.
By default, it returns a list of components instead.
Download the Airline Passenger SPSS data from this URL:
http://calcnet.mth.cmich.edu/org/spss/V16_materials/Data
Sets_v16/airline_passengers.sav
Read the SPSS file and assign to airlineSPSS variable
Observe the output for airlineSPSS
Convert airlineSPSS to data.frame called airlineDF
View the content of airlineDF
Downloading files
The command for downloading dengue data
from data.gov.my>
> url<"http://data.gov.my/folders/MOH/MOH_denggue_
HOTSPOT_2010_2014_v3.xlsx"
>
download.file(url,destfile="dengue.xlsx",mod
e="wb")
Make sure you
> list.files()
know where you
store your file!
Check using
getwd()
Reading xlsx file

> library(xlsx)
> data<-read.xlsx("dengue.xlsx",
sheetIndex=1,
rowIndex=2:100, header=TRUE)
You may
> dim(data)
manually
convert to
> str(data)
and load CSV
> summary(data)
file if loading
xlsx is an issue
at your PC.
Subsetting Data
> set.seed(1)
x <data.frame("var1"=sample(1:5),
"var2"=sample(6:10),"var3"=samp
le(11:15))
> x$var2[c(1,3)]=NA
> x
Subsetting Data: Hands-on

>
>
>
>
x[,1]
x[1:2, var2]
x[x$var1<=3 & x$var3 >10,]
x[x$var1>2 | x$var3 >10,]
Task: extract all rows with NA

> x[x$var2>1,]
Data for var1

and var3 not
presented
correctly for
row with NA

> x[which(x$var2>1),]

To retrieve row having maximum value for var1
> x[which.max(x$var1),]
> Task: look for row with minimum

value for var3
Sorting Data
To sort values for a particular column
> sort(x$var1)
to sort decreasingly
> sort(x$var1,decreasing=TRUE)
What if I have NA
> sort(x$var2,na.last=TRUE)
Sorting Data
How to sort the entire dataset based
on a particular column, say var1?
Sorting Data: Hands-on

> x[order(x$var1),]
To sort data based on multiple columns,

> x[order(x$var1, x$var3),]
Sorting Data: plyr package

> library(plyr)
> arrange(x, var1)
> arrange(x, desc(var1))
Quantile
> x<-rnorm(100, 50,20)
> hist(x)
> quantile(x)
Quantile
What is I want to see the data distribution at
30% and 80%?
> quantile(x, c(0.3,0.8))
Can you find the information for the Dengue
data?
Creating a new column

If I want to create 4 bins for
Jumlah.Kes.Terkumpul and assign them to a new
variable named group,
> group <cut(data$Jumlah.Kes.Terkumpul,
g=4)
> group
> table(group)
Splitting data
What is the command to split data according to
Negeri?
> dt.split.1<- split(data,
data$Negeri)
What is the command to split
Jumlah.Kes.Terkumpul according to Negeri?
> dt.split.2 <split(data$Jumlah.Kes.Terkumpul,
data$Negeri)
Splitting data
Apply a function across elements of the list,
> lapply(dt.split.2, sum)
Splitting data
How to get the sum of Jumlah.Kes.Terkumpul
for each Daerah.Zon.PBT?
> tapply(data$Jumlah.Kes.Terkumpul,
data$Daerah.Zon.PBT, sum)
Merging data
Lets create two data frames:
> set.seed(1)
> x<- sample(1:20,10)
> y<- sample(30:50,10)
> dt.1 <- data.frame(x,y)
>
>
>
>
set.seed(2)
x<- sample(1:20,10)
y<- sample(30:50,10)
dt.2 <- data.frame(x,y)
Merging data
To perform an inner join by column x
> merge(dt.1, dt.2, by=x")
To perform an inner join by column x with
either one matched,
> merge(dt.1, dt.2, by=x,
all=TRUE)
To perform an inner join by column x and y
> merge(dt.1, dt.2, by=c(x,y)
Merging Data: Hands-on

> df1<-data.frame(CustomerId = c(1:6),
Product = c(rep(Honda", 3),
rep(Chevrolet", 3)))
> df2<-data.frame(CustomerId = c(2, 4, 7),
State =
c("selangor","Sarawak","Kelantan"))
How does the

data frames
look like?

Task 1:
Return only the rows in which the df1 have
matching keys in df2.

Task 2:
Return all rows from the df1, and any rows with
matching keys from df2.

Task 3:
Return all rows from the df2, and any rows with
matching keys from df1.
Downloading a File
Go to http://data.gov.my/view.php?view=254
and download dengue hotspot for 2015
> url<"http://data.gov.my/folders/KKM/LokalitiHotspot201
5.xlsx"
> download.file(url,"dengue.xlsx",mode='wb')
Downloading a File
Convert the file to CSV for ease of reading
> dt <- read.csv(dengue.csv)
Check the field names of the dataset
> names(dt)
Column Name Manipulation

Remove dot(.) in all the field names
> names(dt)<-gsub(., , names(dt),
fixed=TRUE)
fixed=TRUE to
make sure gsub
treats . and a
dot, not a
function

To rename all the columns,
> names(dt) <c('Year','Week','State','District','Locati
on','Total','Outbreak Duration')
To rename a specific column only,
> names(dt)[2] <- Week No
Check the
changes to
the columns

Replace all space the field names with _.
> names(dt)<-gsub( , _,
names(dt))
Records Manipulation
Inspect the data for location, you can see
there are many wrong spellings. Say, if we
want to replace Tmn to Taman for all data
in Location
> dt$Location <-gsub(Jln, Indah,
dt$Location)
Check the
Try replacing all Kg to Kampung
data first
before
replacing
anything
String split
The command = strsplit()
> strsplit(hello Malaysia !, )
To split a particular record based on a particular

character in a dataset,
>
strsplit(as.character(dt$Location),
)[[1]]
Finding Values
Say if we want to know how many times the
word Taman has appeared in the Location
field,
> grep(Taman,dt$Location)
> grep(Taman,dt$Location, value=TRUE)
How many
of them?
What
command
to use?
Finding Values
How many with Taman and otherwise?
> cnt <table(grepl(Taman,dt$Location)
)
> barplot(cnt)
What do
you get?
Explain
Finding Values
Now, store all the data with the word Taman
in Location in a variable named dt_Taman.
> dt_Taman <- dt[grepl(Taman,
dt$Location),]
How many records in dt_Taman ?
Finding Values
Now, store all the data with the word
Seksyen or Medan in Location
> dt[grepl(Seksyen|Medan,
dt$Location),]
Finding Values
Looking for any characters.
For example, we could retrieve locations that
begins with the character T and contains
character 1 from the Location column.
> dt[grep("^T(.*)1",
dt$Location),"Location"]
Hands-on
Look for all locations with the words Jalan
and Jaya.
Retrieve all records with the word Taman in
Location in the district Petaling.
How many Kampung in Selangor and Perak.
Plot a graph to compare the two.
[estimation = 20 minutes]
Date
To retrieve the date of today,
> d1 <- Sys.Date()
%a weekday
%d day
number
%b month
%y - year
To extract the information of the date,

> format(d1, %a %b %d)
Changing characters to date
> x<- 15July2015
> z<- as.Date(x, %d%b%Y)
Date
Extracting parts of a date, try the following
> weekdays(z)
> months(z)
> julian(z)
This will tell

you how
many days
since the
origin
Date: the lubridate package

R has a specific date package named
lubridate to manipulate date easier.
> library(lubridate)
> ymd(20150715)
> mdy(07/15/2015)
> dmy(15/07/2015)
> ymd_hms(2015-07-15 18:47:00)
> wday(z)
> wday(z,label=TRUE)

Day 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day 3

Uploaded by

Copyright:

Available Formats

Data Science and Big Data Analytics

Training Schedule: Day 3

Raw Data Vs Processed Data Vs Tidy Data

Raw vs Processed Data

The raw data

A tidy data set

The code book

The Instruction List

Read local files

read.table reads all data into RAM thus not

read.csv () and read.csv2 ()

Reading Excel Files

rowIndex - a numeric vector indicating the rows you want

Reading from XML

Extensible Markup Language

Tags, elements and attributes

Elements are specific examples of tags

Attributes are components of the label

Example XML file

Read the XML file into R

Directly access parts of the XML

Programatically extract parts of the file

Get the items on the menu and prices

> xpathApply(rootNode, "//price", xmlValue)

Javascript Object Notation

Example of JSON file

Goal 2: Reduce computation time

data.table general form

Data.table general form

Notice the output differences for both DT and DF

Exercise 6 (Computing on Columns)

Exercise 7 (multiple operations)

Exercise 8 (Doing j by group)

Exercise 9 (Performance Analysis)

Find the computation time for the followings:

Exercise 10 (connect and read tables)

Basic RMySQL commands

Read from the Web

Read from the Web

Exercise 11 (read from web)

Further Resources Read from the

Reading from other sources

Exercise 12: Read SPSS file

Reading xlsx file

Subsetting Data: Hands-on

Task: extract all rows with NA

Subsetting Data: Hands-on

Data for var1

Subsetting Data: Hands-on

Subsetting Data: Hands-on

> Task: look for row with minimum

Sorting Data: Hands-on

To sort data based on multiple columns,

Sorting Data: plyr package

Creating a new column

Merging Data: Hands-on

How does the

Merging Data: Hands-on

Merging Data: Hands-on

Merging Data: Hands-on

Column Name Manipulation

Column Name Manipulation

Column Name Manipulation