You are on page 1of 39

Basic Analytic Techniques Using R

Lesson 3

Copyright 2014, Simplilearn, All rights reserved.


Copyright 2014, Simplilearn, All rights reserved.

Objective Slide
After completing
this course, you will
be able to:

Get a basic introduction to R

Understand exploration of data

Explore data using R

Visualize data using R

Understand diagnostic analytics

Implement diagnostic analytics using R

Copyright 2014, Simplilearn, All rights reserved.

Introduction to R
Programming language for graphics and statistical computations
Available freely under the GNU public license
Used in data mining and statistical analysis

Included time series analysis, linear and non linear modeling among others
Very active community and package contributions
Very little programming language knowledge necessary
Can be downloaded from http://www.r-project.org/
R Studio - optional

Copyright 2014, Simplilearn, All rights reserved.

Basic R
Packages install.packages('package_name')
library(package_name)
Loading data
data(dataset_name)
read and write functions
getwd() and setwd(dir)
read and write functions use full path name
Example : read.csv(C:/Rtutorials/Sampledata.csv).
Assignment operator : <-
Help - ?function_name
Copyright 2014, Simplilearn, All rights reserved.

Data exploration using R


Basic functions for data exploration in R
Data stored as data frames
Data frames tabular representation of data with rows and columns

Every row denotes a particular case


Sample data frame (iris data set)
Sepal length Sepal width Petal length Petal width
Species
7.9
3.8
6.4
2
I. virginica
7.7
3
6.1
2.3
I. virginica
5.6
2.5
3.9
1.1
I. versicolor
5.6
2.8
4.9
2
I. virginica
5.5
4.2
1.4
0.2
I. setosa
5.5
3.5
1.3
0.2
I. setosa
7.1
3
5.9
2.1
I. virginica
7
3.2
4.7
1.4
I. versicolor
Copyright 2014, Simplilearn, All rights reserved.

Viewing data frame


Iris dataset built in
data frame
View data set
iris
View first few rows of
the data set
head(iris, n)
View last few rows of
the data set
tail(iris, n)
Copyright 2014, Simplilearn, All rights reserved.

Dimensions of data frame


View the dimensions of the data set
dim(iris)
View the number of columns
ncol(iris)
View the number of rows
nrow(iris)

Copyright 2014, Simplilearn, All rights reserved.

Attributes of Data frame


View column names/headers
names(iris)
View all attributes
attributes(iris)

Copyright 2014, Simplilearn, All rights reserved.

View column data


iris$Petal.Length
iris* , Petal.Length+

Copyright 2014, Simplilearn, All rights reserved.

View row data


View particular rows
iris[10:15, ]
View particular rows of a single column
iris*10:15, Petal.Length+

Copyright 2014, Simplilearn, All rights reserved.

Summarizing data in R
summary(data_frame)
summary(iris)
Output : Mean,
Median, Minimum,
Maximum, 1st and 3rd
quartile
table(dataframe$columnn
ame)
Example :
table(iris$Species)

Copyright 2014, Simplilearn, All rights reserved.

Individual summaries in R

min(column_name)
max(column_name)
range(column_name)

mean(column_name)
median(column_name)
IQR(column_name)

sd(column_name)
var(column_name)

Copyright 2014, Simplilearn, All rights reserved.

Subset summary
aggregate() - function for column wise aggregation
Numerical summaries for a subset of data
aggregate(formula, data, function)

Copyright 2014, Simplilearn, All rights reserved.

Data Visualization in R
plot() - generic function for plotting in R
plot(iris)

Copyright 2014, Simplilearn, All rights reserved.

Data Visualization in R
Plot Sepal Length against Species
Attributes of plot function : main, xlab, ylab
plot(iris$Sepal.Length, iris$Species,
main = "Iris Data",

xlab = "Sepal Length,


ylab = "Species")

Copyright 2014, Simplilearn, All rights reserved.

Pie Charts
Pie charts to visualize the numerical
proportion of different classes through
sectors of the circle
pie(table(iris$Species),

main = "Iris Data by Species")


Table of data by species -

Setosa

Virginica

Versicolor

50

50

50

Copyright 2014, Simplilearn, All rights reserved.

Bar plots
USPersonalExpenditure
barplot(USPersonalExpenditure,
main = "US Personal Expenditure by Year",
xlab = "Year",
ylab = "Expenditures")
1940
Food and
Tobacco
22.2
Household
Operation
10.5
Medical and
Health
3.53
Personal
Care
1.04
Private
Education 0.341

1945

1950

1955

1960

44.5

59.6

73.2

86.8

15.5

29

36.5

46.2

5.76

9.71

14

21.1

1.98

2.45

3.4

5.4

0.974

1.8

2.6

3.64
Copyright 2014, Simplilearn, All rights reserved.

Box Plot
boxplot(Sepal.Length ~ Species,
data = iris,
main = "Iris Data Set",
xlab = "Species type",
ylab = "Sepal Length")

Copyright 2014, Simplilearn, All rights reserved.

Histogram
Histograms to depict frequency
distribution

hist(iris$Sepal.Length,

main = "Iris data",


xlab = "Sepal Length",
ylab = "Frequency")

Frequency

islands dataset

Sepal length
Copyright 2014, Simplilearn, All rights reserved.

Correlation
Class of statistical relationships between variables
Default of cor.test() Pearsons correlation
cor.test(column1, column2)

Copyright 2014, Simplilearn, All rights reserved.

Analysis of variance
aov() generic method to implement Analysis of Variance

Copyright 2014, Simplilearn, All rights reserved.

Chi-squared test
Implement and print the result of chi-squared test for goodness of fit
margin.table(HairEyeColor, 1)

chisq.test(variable) or chisq.test(variable, probabilities)

Copyright 2014, Simplilearn, All rights reserved.

T-test
Pairwise t-tests
data(anorexia, package = MASS)
attributes Treatment, pre-weight, post weight

Copyright 2014, Simplilearn, All rights reserved.

T test
Independent t-tests

Copyright 2014, Simplilearn, All rights reserved.

Summary
Here is a quick
recap of what we
have learned in this
lesson

A basic introduction to R

Data exploration using R

Data visualizations in R
Pie Charts
Bar plots
Box plots
Histogram

Diagnostic analytics using R


Chi Squared test
T tests
Analysis of Variance

Copyright 2014, Simplilearn, All rights reserved.

Quiz

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
1

Which of the following is not a functionality of R?

a.

Time series analysis

b. Linear modeling
c.

Non linear modeling

d.

Developing web applications

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
1

Which of the following is not a functionality of R?

a.

Time series analysis

b. Linear modeling
c.

Non linear modeling

d.

Developing web applications

Answer: d.
Explanation: R is a statistical analysis and data mining tool.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
2

Which of the following is the function to display the first 5 rows of data?

a.

first(data,5)

b. head(data,5)
c.

top(data,5)

d.

first(5, data)

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
2

Which of the following is the function to display the first 5 rows of data?

a.

first(data,5)

b. head(data,5)
c.

top(data,5)

d.

first(5, data)

Answer: b.
Explanation: head() function is used to display the first few rows of data

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
3

What will be the result of the following command class(iris)

a.

vector

b. integer
c.

matrix

d.

data.frame

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
3

What will be the result of the following command class(iris)

a.

vector

b. integer
c.

matrix

d.

data.frame

Answer: d.
Explanation: Iris is a sample data set that is stored as a data frame

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
4

What is the significance of the dot symbol before tilda in aggregate function?

a.

Aggregate all columns

b. Aggregate no columns
c.

Aggregate the first column

d.

Aggregate the last column

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
4

What is the significance of the dot symbol before tilda in aggregate function?

a.

Aggregate all columns

b. Aggregate no columns
c.

Aggregate the first column

d.

Aggregate the last column

Answer: a.
Explanation: The dot symbol in the formula is used to specify aggregation on all columns

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
5

Create a histogram of the islands dataset. What is the highest frequency of the
dataset?

a.

55

b. 40
c.

30

d.

45

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
5

Create a histogram of the islands dataset. What is the highest frequency of the
dataset?

a.

55

b. 40
c.

30

d.

45

Answer: b.
Explanation: Plot the histogram using hist(islands). The highest frequency from the graph is
40.
Copyright 2014, Simplilearn, All rights reserved.

The heights of a sample population is recorded before and after a height increasing
drug. Which of the following commands would be used to BEST check if there is an
effect of the drug on height of a person?

QUIZ
6
a.

aov(preheight~postheight)

b. t.test(preheight, postheight)
c.

t.test(preheight, postheight, paired = TRUE)

d.

chisq.test(preheight, postheight)

Copyright 2014, Simplilearn, All rights reserved.

The heights of a sample population is recorded before and after a height increasing
drug. Which of the following commands would be used to BEST check if there is an
effect of the drug on height of a person?

QUIZ
6
a.

aov(preheight~postheight)

b. t.test(preheight, postheight)
c.

t.test(preheight, postheight, paired = TRUE)

d.

chisq.test(preheight, postheight)

Answer: c.
Explanation: The variables are paired, and hence paired t tests need to be used to best
learn the relation.
Copyright 2014, Simplilearn, All rights reserved.

Thank You

Copyright 2014, Simplilearn, All rights reserved.


Copyright 2014, Simplilearn, All rights reserved.

You might also like