Professional Documents
Culture Documents
Lesson 3
Objective Slide
After completing
this course, you will
be able to:
Introduction to R
Programming language for graphics and statistical computations
Available freely under the GNU public license
Used in data mining and statistical analysis
Included time series analysis, linear and non linear modeling among others
Very active community and package contributions
Very little programming language knowledge necessary
Can be downloaded from http://www.r-project.org/
R Studio - optional
Basic R
Packages install.packages('package_name')
library(package_name)
Loading data
data(dataset_name)
read and write functions
getwd() and setwd(dir)
read and write functions use full path name
Example : read.csv(C:/Rtutorials/Sampledata.csv).
Assignment operator : <-
Help - ?function_name
Copyright 2014, Simplilearn, All rights reserved.
Summarizing data in R
summary(data_frame)
summary(iris)
Output : Mean,
Median, Minimum,
Maximum, 1st and 3rd
quartile
table(dataframe$columnn
ame)
Example :
table(iris$Species)
Individual summaries in R
min(column_name)
max(column_name)
range(column_name)
mean(column_name)
median(column_name)
IQR(column_name)
sd(column_name)
var(column_name)
Subset summary
aggregate() - function for column wise aggregation
Numerical summaries for a subset of data
aggregate(formula, data, function)
Data Visualization in R
plot() - generic function for plotting in R
plot(iris)
Data Visualization in R
Plot Sepal Length against Species
Attributes of plot function : main, xlab, ylab
plot(iris$Sepal.Length, iris$Species,
main = "Iris Data",
Pie Charts
Pie charts to visualize the numerical
proportion of different classes through
sectors of the circle
pie(table(iris$Species),
Setosa
Virginica
Versicolor
50
50
50
Bar plots
USPersonalExpenditure
barplot(USPersonalExpenditure,
main = "US Personal Expenditure by Year",
xlab = "Year",
ylab = "Expenditures")
1940
Food and
Tobacco
22.2
Household
Operation
10.5
Medical and
Health
3.53
Personal
Care
1.04
Private
Education 0.341
1945
1950
1955
1960
44.5
59.6
73.2
86.8
15.5
29
36.5
46.2
5.76
9.71
14
21.1
1.98
2.45
3.4
5.4
0.974
1.8
2.6
3.64
Copyright 2014, Simplilearn, All rights reserved.
Box Plot
boxplot(Sepal.Length ~ Species,
data = iris,
main = "Iris Data Set",
xlab = "Species type",
ylab = "Sepal Length")
Histogram
Histograms to depict frequency
distribution
hist(iris$Sepal.Length,
Frequency
islands dataset
Sepal length
Copyright 2014, Simplilearn, All rights reserved.
Correlation
Class of statistical relationships between variables
Default of cor.test() Pearsons correlation
cor.test(column1, column2)
Analysis of variance
aov() generic method to implement Analysis of Variance
Chi-squared test
Implement and print the result of chi-squared test for goodness of fit
margin.table(HairEyeColor, 1)
T-test
Pairwise t-tests
data(anorexia, package = MASS)
attributes Treatment, pre-weight, post weight
T test
Independent t-tests
Summary
Here is a quick
recap of what we
have learned in this
lesson
A basic introduction to R
Data visualizations in R
Pie Charts
Bar plots
Box plots
Histogram
Quiz
QUIZ
1
a.
b. Linear modeling
c.
d.
QUIZ
1
a.
b. Linear modeling
c.
d.
Answer: d.
Explanation: R is a statistical analysis and data mining tool.
QUIZ
2
Which of the following is the function to display the first 5 rows of data?
a.
first(data,5)
b. head(data,5)
c.
top(data,5)
d.
first(5, data)
QUIZ
2
Which of the following is the function to display the first 5 rows of data?
a.
first(data,5)
b. head(data,5)
c.
top(data,5)
d.
first(5, data)
Answer: b.
Explanation: head() function is used to display the first few rows of data
QUIZ
3
a.
vector
b. integer
c.
matrix
d.
data.frame
QUIZ
3
a.
vector
b. integer
c.
matrix
d.
data.frame
Answer: d.
Explanation: Iris is a sample data set that is stored as a data frame
QUIZ
4
What is the significance of the dot symbol before tilda in aggregate function?
a.
b. Aggregate no columns
c.
d.
QUIZ
4
What is the significance of the dot symbol before tilda in aggregate function?
a.
b. Aggregate no columns
c.
d.
Answer: a.
Explanation: The dot symbol in the formula is used to specify aggregation on all columns
QUIZ
5
Create a histogram of the islands dataset. What is the highest frequency of the
dataset?
a.
55
b. 40
c.
30
d.
45
QUIZ
5
Create a histogram of the islands dataset. What is the highest frequency of the
dataset?
a.
55
b. 40
c.
30
d.
45
Answer: b.
Explanation: Plot the histogram using hist(islands). The highest frequency from the graph is
40.
Copyright 2014, Simplilearn, All rights reserved.
The heights of a sample population is recorded before and after a height increasing
drug. Which of the following commands would be used to BEST check if there is an
effect of the drug on height of a person?
QUIZ
6
a.
aov(preheight~postheight)
b. t.test(preheight, postheight)
c.
d.
chisq.test(preheight, postheight)
The heights of a sample population is recorded before and after a height increasing
drug. Which of the following commands would be used to BEST check if there is an
effect of the drug on height of a person?
QUIZ
6
a.
aov(preheight~postheight)
b. t.test(preheight, postheight)
c.
d.
chisq.test(preheight, postheight)
Answer: c.
Explanation: The variables are paired, and hence paired t tests need to be used to best
learn the relation.
Copyright 2014, Simplilearn, All rights reserved.
Thank You