You are on page 1of 30

MEDITERRANEAN SCHOOL OF BUSINESS

C O UR S E : D A T A M I N I N G

PROFESSORS: Dr. Ramla Jarrar; Dr. Amor Messaoud


Lecture 2: Data exploration and reduction (part I)

10/26/2016

SOUTH MEDITERRANEAN UNIVERSITY

Chapter 2
DATA EXPLORATION AND REDUCTION

Table of contents
1. Have a look at data
2. Explore individual variables
3. Explore multiple variables
4. More explorations
5. Data reduction

Load the churn data set


# Import the churn data set
churn<-read.table("D:\\churn.txt",header=T,sep=",")

SECTION I
HAVE A LOOK AT DATA

Have a look at data


#
dim(churn)

# print the varible names


names(churn)
# structure of the R object churn (it is a data frame)
str(churn)
#
attributes(churn)

Have a look at data


# have a look at the first 10 rows of data
churn[1:10,]

# have a look at the first rows of data


head(churn)
# have a look at the first two rows of data
head(churn,2)
# have a look at the last rows of data
tail(churn)

Have a look at data


# retrieve the values of Day.Mins
churn$Day.Mins
churn[,8]
# retrieve the the first 10 values of Day.Mins
churn[1:10,8]

churn[1:10,"Day.Mins"]
churn$Day.Mins[1:10]

SECTION 2
EXPLORE INDIVIDUAL VARIABLES

Explore individual variables


The function summary() shows
The distribution of every numeric variable. It returns the minimum, maximum,
mean, median, and the first (25%) and third (75%) quartiles.
The frequency of every level for factors (or categorical variables)
# explore the individual variables using the function summary()
summary(churn)

10

Explore individual variables


Numeric variables
Descriptive measures of numeric variables can be obtained using the
functions

mean(): Compute the mean of a numeric variable


var(): Computes the variance of a numeric variable
sd(): Computes the standard deviation of a numeric variable
quantile(): Produces sample quantiles corresponding to the given
probabilities
The functions hist(), density() and boxplot() provide the histogram,
density and boxplot of numeric variables, respectively

11

Explore individual variables


Numeric variables
# descriptive measures of the variable Day.Mins
mean(churn$Day.Mins)

var(churn$Day.Mins)
sd(churn$Day.Mins)
quantile(churn$Day.Mins)
quantile(churn$Day.Mins,c(0.1,0.5,0.65))

12

Explore individual variables


Numeric variables
# Histogram of Day.Mins
hist(churn$Day.Mins)

Type ?hist
breaks?
probability?
xlab?
ylab?

13

Explore individual variables


Numeric variables
# Plot the density of Day.Mins

# Plot the density of Day.Mins

plot(density(churn$Day.Mins))

plot(density(churn$Day.Mins))

14

Explore individual variables


Categorical variables
The frequency of a categorical variable can be obtained with functions
table(). The functions pie() and barplot() are then used to plot the pie
chart and bar chart, respectively.
# frequency table of the variable state
table(churn$Int.l.Plan)
no

3010

yes

323

15

Explore individual variables


Categorical variables
pie(table(churn$Int.l.Plan))

barplot(table(churn$Int.l.Plan))

16

SECTION 3
EXPLORE MULTIPLE VARIABLES

17

Explore multiple variables


Covariance and correlation
After checking the distributions of individual variables, we then investigate the
relationships between two variables
The function cov() and cor() provide the covariance and correlation
coefficients between two numeric variables
# Covariance and correlation of evening minutes and day minutes
cov(churn$Eve.Mins,churn$Day.Mins)

[1] 19.45318
cor(churn$Eve.Mins,churn$Day.Mins)
[1] 0.007042511

18

Explore multiple variables


Covariance and correlation matrices
Explore the relationships between many numeric variables

# Covariance and correlation matrices of day minutes, evening


minutes and night minutes
cov(cbind(churn$Day.Mins,churn$Eve.Mins,churn$Night.Mins))
cor(cbind(churn$Day.Mins,churn$Eve.Mins,churn$Night.Mins))

19

Explore multiple variables


Covariance and correlation matrices
# You can also obtain the covariance and correlation matrices
cov(churn[c(8,11,14)])
cor(churn[c(8,11,14)])

20

Explore multiple variables


Scatter plot

The function plot() can be used to plot the scatter plot


of two numeric variables

# Scatter plot of day minutes and evening


minutes
plot(churn$Eve.Mins,churn$Day.Mins)
# Using jitter
plot(jitter(churn$Day.Mins),jitter(churn$Eve.
Mins))

21

Explore multiple variables


Scatter plot
with(churn,plot(churn$Day.Mins,churn$Eve.Mins,col=churn$Churn,pch=as
.numeric(churn$Churn)))

22

Explore multiple variables


Scatter plot
see the R file scatterplot_3.R
# Plot the scatter plot of Evening minutes and Day minutes ()
plot(churn$Eve.Mins,churn$Day.Mins,xlim = c(0, 400),ylim =c(0,
400),xlab = "Evening Minutes",ylab = "Day Minutes", main =
"Scatterplot of Day and Evening Minutes by Churn",col =

ifelse(churn$Churn=="True.","red","blue"))
# Adding legend
legend("topright",c("True.","False."),col =c("red","blue"),pch =
1,title = "Churn")

23

Explore multiple variables


Scatter plot matrices
A matrix of scatterplots of the variables day
minutes, evening minutes and night minutes is
produced using the function pairs()
The formula is written ~x+y+z

pairs(~churn$Day.Mins+churn$Eve.Mins
+churn$Night.Mins)

24

Explore multiple variables


Aggregate
A side-by-side box plot is a useful tool for visua

# Produce summary statistics of the variable Day minutes of the


two groups churn = True and churn = False
aggregate(churn$Day.Mins~churn$Churn,summary,data=churn)

aggregate(churn$Day.Mins~churn$Churn,mean,data=churn)

25

Explore multiple variables side


by side box plot
A side-by-side box plot is a useful tool for
visually comparing two or more data sets

numericcategorical variables (two


groups on the same variable)
# Produce the box plots of the
variable Day minutes per group Churn
boxplot(churn$Day.Mins~churn$Churn)

26

Explore multiple variables Cross


tables
The function xtabs() produces cross
tables of categorical variables
# (2 variables) Int.l.Plan Churn
xtabs(~churn$Int.l.Plan+churn$Churn)
# (3 variables) Int.l.Plan Churn
VMail.Plan
xtabs(~churn$Int.l.Plan+churn$Churn+chu
rn$VMail.Plan)

27

Explore multiple variables Cross


tables
The function crosstab() of the R
package descr produces cross
tables of categorical variables

Library(descr)
crosstab(churn$VMail.Plan,
churn$Churn,type="f",addmarg
ins=T)

28

Explore multiple variables Cross


tables
You can also use the function crosstab() created by
Dr. Paul Williamson, Dept. of Geography and Planning, School of
Environmental Sciences, University of Liverpool, UK.
It is adapted from the function ctab() in the catspec packge.
The output is best viewed using the companion function print.crosstab()
http://pcwww.liv.ac.uk/~william/R/crosstab.r (accessed on
October 20, 2016)

29

30

You might also like