You are on page 1of 82

Table

of Contents
Introduction

1.1

Why?

1.1.1

About Google Analytics

1.1.2

About R

1.1.3

Author

1.1.4

Prepare environment

1.2

Data sources

1.2.1

Creating Google Analytics account

1.2.2

Getting credentials for Google Analytics API

1.2.3

Installing Google Analytics on website

1.2.4

Installing R Studio

1.2.5

Summary

1.2.6

First steps

1.3

Introduction to R

1.3.1

Connection with Google Analytics

1.3.2

googleAnalyticsR package

1.3.3

Import and export data to CSV

1.3.4

Code repository

1.3.5

Summary

1.3.6

Exploratory data analysis


Exploratory data analysis
Data visualization

1.4
1.4.1
1.5

Data visualization in R

1.5.1

Traffic heatmap

1.5.2

Device comparsion

1.5.3

Machine Learning
Clustering (k-means)
Generating reports

1.6
1.6.1
1.7

Introduction to R Markdown

1.7.1

Create report

1.7.2

Additional analysis

1.8

Anomaly detection

1.8.1

Forecasting

1.8.2

Resources

1.9

Blogs

1.9.1

Documentation

1.9.2

Online trainings

1.9.3

Books

1.9.4

Introduction

Using Google Analytics with R


This book is practical guide about analysis data from Google Analytics in R.
In this book you will learn:
What is Google Analytics and how to collect web traffic data by this tool.
What is R and how to analyze data from Google Analytics in R Studio.
How to discover hidden knowledge into data about traffic on your website.
Feel free to share this book, read it online and offline. Thanks to Gitbook.io you can
download it if different formats - printable .pdf and formats for e-book readers like .epub
and .mobi .
This is still development version. If you want to develop this book - feel free to contact with
author via:
about.me
michalbrys.com

Why?

Why?
I've decided to write this book to show how big value is hidden in data. If you have website
you probably collecting data about web traffic. But if you use this data to make business
decisions?
Nowadays we are swimming in data lake. Only if you know how to use this data you will stay
on the surface :). First step is to regularly check standard reports in your web analytics tool
(i.e. Google Analytics).
But to stay competitive you need something more. Everybody talks about data collection.
But only a few tell you what to do with data after collect them. I try to describe this process
and give you some ideas how to deal with data from Google Analytics using R.
In this book I will share my experience on this field. I hope that it will be usefull, interesting,
sometimes funny and will save you time :)

Target audience
I wrote this book for marketers who worked with Google Analytics and know basic metrics
included in this tool and know web interface. I hope that this material will be helpful in
learning how to extend features of Google Analytics in daily work and learning how to use R.
If you are analyst who knows perfectly R I hope that you also find some inspirations in this
book. Especialyy in learning how to connect Google Analytyics as additional data source in
R and what kind of analysis you can perform on this data.

About Google Analytics

About Google Analytics


Google Analytics is free web analytics platform provided by Google.
It's also the most popular free web analytics tool in the Internet according to Builwith report
(Feb 2016). It's complete analytics platform offering solution for collect, analyze and report
data. Google Analytics offers also free APIs to export data to externals systems.

Terms of service
Common question is: if this great tool is really free? To be precise, according to Google
Analytics Terms of Service:
Service is provided without charge to You for up to 10 million Hits per month per
account.
If you exceed this quota, you should think about Google Analytics 360, former Gooogle
Analytics Premium service. This premium and paid version offers you multiple times bigger
data collection quota.

What is hit?
As you read above, your Google Analytics account has 10 000 000 hits per month limit. So
what is hit?
According to Google Analytics help:
Hit - An interaction that results in data being sent to Analytics Common hit types
include page tracking hits, event tracking hits, and ecommerce hits.
Each time the tracking code is triggered by a users behavior (for example, user loads a
page on a website or a screen in a mobile app), Analytics records that activity. Each
interaction is packaged into a hit and sent to Googles servers. Examples of hit types
include:
page tracking hits
event tracking hits
ecommerce tracking hits
social interaction hits

About Google Analytics

About R

About R
What is R?
R is a programming language and software environment for statistical computing and
graphics supported by the R Foundation for Statistical Computing. The R language is widely
used among statisticians and data miners for developing statistical software and data
analysis. Polls, surveys of data miners, and studies of scholarly literature databases show
that R's popularity has increased substantially in recent years. Wikipedia

Pros and cons


R language is now the fastest growing statistic analysis language.
The major advantages of R:
Free.
Offers a lot of libraries for different statistical computations. Actual list of packages
A lot of educational materials (tutorials, MOOCs, blogs) available free in the Internet.
Has big community support.
Ready to run in different platforms (Windows, Mac, Unix). Version for server installation
is also available.
Fast because of in-memory computations.
Disadvantages?
R is not out-of-the-box solution with GUI for all analytical problems. You need to write
a chunk of code to get the result. It sometimes can be barer for non-technical people to
start with. But I hope if you read this book is not problem for you :)
The advantage of in-memory computations is sometimes a trap. In standard
installation you can only process data set which fits to RAM memory in your machine. If
you have really big data to process - think about other solution like Hadoop
(MapReduce) or Apache Spark. If you feel comfortable with R you can run your script
on other platforms (reading from HDFS or using SparkR). It is more advanced topic for
other book ;)

Author

Author
Micha Bry
Data scientist
Micha is working in internet industry from 2009. He is expert in web analytics in e-commerce
context, especially using Google Analytics & Google Tag Manager. He loves mining big data
sets and transform information into actionable knowledge. He loves creating story from
numbers. He graduated AGH University of Science and Technology and University of
Economics in Cracow. Michal is member of Google Developers Group Cracow.
Feel free to contact author:
about.me
michalbrys.com

Prepare environment

Preparing environment
To analysis data you will need to set up:
Google Analytics account
R Studio
Credentials to connect Google Analytics API in Google Developers Console (free)
I will precisely describe this steps in this chapter.

10

Data sources

Data sources
You can find the most popular scenarios website.

I have website WITHOUT Google Analytics


tracking
Please read all of this chapter. I will lead you through Google Analytics installation process
and the you can start collecting data from your website.

I have website WITH Google Analytics tracking


You can navigate directly to Getting credentials for Google Analytics chapter.

I don't have website nor access to Google


Analytics account
If you are analyst who knows R and want to learn about analyzing data from Google
Analytics I recommend one of this options.

NGOs, University, friends, family


Contact with local NGOs who might want your help. Usually they have websites and traffic
on it. Installing Google Analytics and doing traffic analysis you can help this organizations
and do something good for your community.
You can also join to The Analytics Exchange here: www.webanalyticsdemystified.com
It's community connecting web analysts and NGOs looking for help with digital analytics.
You can also contact with your University, family and friends offering help with digital
analytics.

Google Analytics demo account


Google Analytics team prepare demo account with data from Google Merchandise Store.
You can access to this account here:

11

Data sources

support.google.com\/analytics\/answer\/6367342

About this data:


The data in the Google Analytics demo account is from the Google Merchandise Store,
a real ecommerce store. The Google Merchandise Store sells Google-branded
merchandise. The data in the account is typical of what you would see for an
ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes
data about organic traffic, paid search traffic, display traffic, etc.
Content data: information about the behavior of users on the site. This includes the
URLs of pages that visitors look at, how they interact with content, etc.
Transactional data: information about the transactions that occur on the Google
Merchandise Store website.

12

Creating Google Analytics account

Creating Google Analytics account


Get your unique tracking ID
To set up your Google Analytics account go to google.com/analytics/ and register your
account.
If you don't have any accounts connected with your Google Account you will see this screen:

Account details
To create Google Analytics account fill form with:
Account Name. (Note: One Account may have a few tracking IDs so it can be one
Account per one organization/company with many websites.)
In next steps create your unique tracking ID:
Insert Website Name, Website URL and Reporting Time Zone. (Note: Correct time
zone is critically important - your data will be divided into dates in reports using this
value).

13

Creating Google Analytics account

Data Sharing Settings


You can change data sharing settings (Note: disable this will not have impact of data
collection).

Complete registration process

14

Creating Google Analytics account

To complete registration process, click Get Tracking ID and accept Google Analytics Terms
of Service.
After this you will see instructions how to install Google Analytics Tracking Code on your
website:

Install Google Analytics Tracking Code (GATC) on your


website
To start collecting data from your website you need to insert this code on every page.
Personally I recommend to install it via Google Tag Manager.
Further details: Installing Google Analytics on your website

15

Getting credentials for Google Analytics API

Getting credentials for Google Analytics


API
Note: you can use default credentials include in googleAnalyticsR package. But this
API quota is shared for all googleAnalyticsR users. To quarantee that API quota is only
for you - please create your own credential. I've described this process below.
Navigate to Google Developers Console and create new project.
Enable Google Analytics API:

Search: Analytics

Select Enable

16

Getting credentials for Google Analytics API

Create credentials:

17

Getting credentials for Google Analytics API

18

Getting credentials for Google Analytics API

Get credentials:

19

Getting credentials for Google Analytics API

Save Client ID and Client Secret. You need this to configure library getting data from
Google Analytics to R.

20

Installing Google Analytics on website

Installing Google Analytics on your


website
Go to Google Analytics > Admin > Tracking Info > Tracking Code and get tracking Code
(GATC).
Example code:
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-11111111-1', 'auto');
ga('send', 'pageview');
</script>

Install tracking code on your website, between <head></head> tags, on every page you want
to track.
To do this you should have access to your website source code or contact with your
webmaster.
Alternatively you can install Google Analytics via Google Tag Manager. I personally
recommend that way because it will save you a lot of time in future :)

21

Installing R Studio

Installing R Studio
Go to R Studio and download R Studio Desktop - graphic interface tool for R language.

R Studio is available for multiple platforms: Windows, Mac, Linux.


You can use R Studio for free under AGPL 3. Paid version with more functionality is also
available.

Installing extra packages


One of the biggest advantage of R are thousands libraries which extend R functionality.
You can browse this packages on CRAN repository.
To inastall new package in R type:
install.packages("package name")

For example to install ggplot2 plotting library type:

22

Installing R Studio

install.packages("ggplot2")

After installing and before using package you should load it to current session:
library("ggplot2")

23

Summary

Summary
In this chapter you may learn:
How to create account, configure and install Google Analytics on your website.
How to download and set up R Studio.
How to get credentials to download data from Google Analytics into R.

24

First steps

First steps
In this chapter you will set up your environment installing R Studio, creating Google
Analytics account and make connection via API between both tools.

25

Introduction to R

Introduction to R
Try type in console (left down corner window in R Studio) some basic instructions.
Commit instructions via press [enter] button.

Arithmetic operations
> 1+1
[1] 2
> 2*4
[1] 8

Using variables
You can assign value to variable using <- (more popular) or = operator. You can find
some basic examples below.

Numeric variables

26

Introduction to R

> x <- 1+1


> x
[1] 2

Text variables
> z <- "Hello world"
> z
[1] "Hello world"

Vectors
> v <- c(1,2,3,4,5)
> v
[1] 1 2 3 4 5

Data frames
More popular than one dimensional vector is multidimensional data structure called
data.frame .

Data returned from Google Analytics API query we'll also save as data.frame

Creating data frame


Let's create simple data frame (i.e. number of sessions by city in 2016-01-01)
df <- data.frame(
date = c("20160101","20160101","20160101",
"20160101","20160101","20160101","20160101"),
city = c("London","Warsaw","Krakow",
"New York","Paris","Zurich","Sydney"),
sessions = c(101,80,70,50,30,60,20)
)

To display all data frame type data frame name: df


> df

27

Introduction to R

date city sessions


1 20160101 London 101
2 20160101 Warsaw 80
3 20160101 Krakow 70
4 20160101 New York 50
5 20160101 Paris 30
6 20160101 Zurich 60
7 20160101 Sydney 20

Basic operations on data frame


To preview data frame (by default first 6 rows, useful in bigger data sets):
> head(df)

date city sessions


1 20160101 London 101
2 20160101 Warsaw 80
3 20160101 Krakow 70
4 20160101 New York 50
5 20160101 Paris 30
6 20160101 Zurich 60

To display column names of data frame:


> colnames(df)

[1] "date" "city" "sessions"

You can refer to column by dataframe$colname operator:


> df$city

[1] London Warsaw Krakow New York Paris Zurich Sydney


Levels: Krakow London New York Paris Sydney Warsaw Zurich

And select only unique values of column (we have sessions for only one date: 2016-01-01):
> unique(df$date)

28

Introduction to R

[1] 20160101
Levels: 20160101

You can alternatively select columns and rows by number: df[rownumber,colnumber]


Select column 2:
> df[,2]

[1] London Warsaw Krakow New York Paris Zurich Sydney


Levels: Krakow London New York Paris Sydney Warsaw Zurich

Select row 1:
> df[1,]

date city sessions


1 20160101 London 101

Select only one element:


> df[1,1]

[1] 20160101
Levels: 20160101

This basic operations is enough to start your journey with R language :)

29

Connection with Google Analytics

Connection with Google Analytics


To easy download data directly from Google Analytics server to your R Studio via API
interface you have to extend R Studio using external package. This package will let you to
easy build query do Google Analytics servers, authorize connection and fetch the data to
your computer. External packages are one of the biggest advantage of R. So let's try!

Install package googleAnalyticsR


In first step install libraries in your R Studio.
install.packages("googleAuthR")
install.packages("googleAnalyticsR")

When it's done, load library into current R session:


library("googleAuthR")
library("googleAnalyticsR")

Configure connection between R and Google


Analytics API
Configure package with credentials from Google Developers Console: (How to get it? See
Getting credentials for Google Analytics API)
# optional - add your own Google Developers Console key
options(googleAuthR.client_id = "uxxxxxxx2fd4kesu6.apps.googleusercontent.com")
options(googleAuthR.client_secret = "3JhLa_GxxxxxCQYLe31c64")
options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/analytics")
# authorize connection with Google Analytics servers
ga_auth()

You will be asked about authorize R to download data from Google Analytics and your
browser will open authorization page. Click Agree:

30

Connection with Google Analytics

All done. You can now start to send queries via Google Analytics API.

First query - "Hello world"


Make first query to Google Analytics via R:
## get your accounts
account_list <- google_analytics_account_list()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000

# Get the Sessions by Date in 2016


gadata <- google_analytics(id = ga_id,
start="2016-01-01",
end="2016-06-30",
metrics = "sessions",
dimensions = "date",
max = 5000)

How to get your table.id?


For first time it may be a little tricky. The ga_id is parameter that identify your website data
(especially unique view ) on Google Analytics servers. Where to find this id?

31

Connection with Google Analytics

Tool using Google Analytics Management API


You can use my tool to get table.id .
Navigate to my tool michalbrys.github.io\/ga-tools\/ and follow instructions.

Copy from your Google Analytics web interface link


Navigate to Admin section on your Google Analytics account. Select your website, property
and view which you want to query.
You will see this screen:

Your Google Analytics table.id parameter is last number from URL.


So if your current URL is:
https://analytics.google.com/analytics/web/?
authuser=0#management/Settings/a11111111w22222222p33333333/

In query parameters in R script you need do type:


...
ga_id <- 33333333
...

Display results
After you successfully run your first query you can check results fetched from Google
Analytics. Display first 6 rows of result:
head(gadata)

32

Connection with Google Analytics

date sessions
1 20140101 39
2 20140102 46
3 20140103 47
4 20140104 53
5 20140105 49
6 20140106 15

Congrats! You've downloaded first data set from your Google Analytics account!

Source code
Complete code for this example in GitHub repository:
https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/1_hello_world.R

33

googleAnalyticsR package

googleAnalyticsR package

To pull data from Google Analytics into R we'll use package googleAnalyticsR by Mark
Edmonson. Using this add on you can use all features of Google Analytics including the
latest like Google Analytics 360 or integration with Big Query.
Short list of googleAnalyticsR features:
First Google Analytics Reporting v4 API library for R
v4 features include: dynamic calculated metrics, pivots, histograms, date
comparisons, batching.
v4 API explorer
API metadata of possible metrics and dimensions
Multi-user login in Shiny App
Integration with BigQuery Google Analytics Premium\/360 exports.
Single authentication flow can be used with other googleAuthR apps like
searchConsoleR
Automatic batching, sampling avoidance with daily walk, multi-account fetching,
multi-channel funnnel
Support for googleAuthR batch. For big data calls this could be 10x quicker than
normal GA fetching.
Meta data in attributes of returned dataframe including date ranges, totals, min and
max
You can read the docs visiting Mark's website:
http:\/\/code.markedmondson.me\/googleAnalyticsR\/
or check docs on CRAN repository: https:\/\/cran.rproject.org\/web\/packages\/googleAnalyticsR\/vignettes\/googleAnalyticsR.html

34

googleAnalyticsR package

35

Import and export data to CSV

Import and export data to CSV


Using R you can import your data from external data sources. One of the most popular
scenario is that you want to analyze data from .csv file. I will describe this use case below.

Import data from file


To import data from .csv file to R use function read.csv
Example use if you want to import file named file_to_import.csv from your working
directory and save data in data frame df :
df <- read.csv('file_to_import.csv')

Sometimes you need some extra options like header or separator.


If you don't want to import first line of your file, use header = FALSE option.
Also if you have column separator other than comma , use sep=';' option - you can
declare your separator in this place.
Example code with options:
df <- read.csv('file_to_import.csv', header = FALSE, sep=';')

Where is my working directory?


To check working directory type in R console:
getwd()

[1] "/Users/michal"

Export data
After conducted analysis you may want to save results in file to use it in other tools. To do
this you need write.csv function.

36

Import and export data to CSV

If you have data in data frame called ga.data you can use this code:
write.csv(gadata, file = "exported_data.csv")

As a result R will export data to .csv file. You can open it in every text editor or
spreadsheet (i.e. Microsoft Excel). Other use case is upload data as custom dimension or
campaign cost data to Google Analytics.

Where I can find saved file?


Your .csv file is in your working directory.

37

Code repository

Code repository
Source code in R for all examples described in this book you can find in my GitHub
repository:
github.com\/michalbrys\/R-Google-Analytics
Feel free to commit if you find some issue in code or if you want to share your examples.

38

Summary

Summary
In this chapter you can learn:
How to conduct basic arithmetic operations in R.
How to deal with basic data structures in R.
How to load extrenal packages into R.
How to connect Google Analytics and R.
Hot to import and export data from file into R.

39

Exploratory data analysis

Exploratory data analysis


Making data analysis you can use this three steps framework.
1. Load your data
i. Download from Google Analytics API
2. Know your data
i. Make some exploratory data analysis to better understand your data.
3. Do the main data analysis
i. Apply i.e. machine learning algorithms.
In this part I describe some basic exploratory data analysis operations.

40

Exploratory data analysis

Exploratory data analysis


Download your data and save it in data frame gadata
# Get the Sessions by Month in 2014
query.list <- Init(start.date = "2014-01-01",
end.date = "2014-12-31",
dimensions = "ga:date",
metrics = "ga:sessions",
table.id = "ga:00000000")

Let's do some basics operations

Min
Check what is minimum number of sessions in 2014?
min(gadata$sessions)

[1] 0

Number of days with 0 sessions recorded


It seems like error in tracking and no data for some day. When it was? Display days with 0
sessions.
subset(gadata, ga.data$sessions == 0)

date sessions
7 20140107 0
8 20140108 0
129 20140509 0
130 20140510 0
131 20140511 0
132 20140512 0
133 20140513 0
134 20140514 0
135 20140515 0

41

Exploratory data analysis

How many days with 0 sessions? Use function nrow() to count rows with this condition.
nrow(subset(gadata, ga.data$sessions == 0))

[1] 9

So it was 9 days with 0 sessions.


summary(gadata)

Max
When was the biggest traffic on your website? Use max() function.
> max(gadata$sessions)

[1] 204

So the highest traffic is 204 sessions in 1 day. When it was?


subset(gadata, gadata$sessions == 204)

date sessions
59 20140228 204

You can reach this data in one function, replacing value with max() . It is shorter but harder
to read:
subset(gadata, gadata$sessions == max(gadata$sessions))

date sessions
59 20140228 204

Mean
What is mean number of sessions per day? To calculate this, use mean() function.

42

Exploratory data analysis

mean(gadata$sessions)

[1] 27.6

So average number of sessions per day is equal 27.6.

Standard deviation
You can check diversity of number sessions per day. Use sd() function.
sd(gadata$sessions)

[1] 22.12984

So average number of sessions is equal 27.6 +\/- 22.12984. This dataset has big diversity
and in your case is better not to trust only average value.

Median
If dataset has high standard deviation its better to calculate median (the most popular value
in dataset).
median(gadata$sessions)

[1] 21

The most popular number of sessions id 21 sessions per day.

Summary
If you want, you can get all of this statistics in one function: summary .
summary(gadata)

43

Exploratory data analysis

date sessions
Length:365 Min. : 0.0
Class :character 1st Qu.: 12.0
Mode :character Median : 21.0
Mean : 27.6
3rd Qu.: 40.0
Max. :204.0

As a result you will get basic statistics for numeric variables and description for character
variables.

Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/2_eda.R

44

Data visualization

Data visualization

45

Data visualization in R

Data visualization in R
We'll make some exploratory data analysis by visualizing data from Google Analytics in R.
R has big range of visualizing packages. My favourite is ggplot2 .

Package ggplot2
According to ggplot2 project site:
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to
take the good parts of base and lattice graphics and none of the bad parts. It takes care
of many of the fiddly details that make plotting a hassle (like drawing legends) as well
as providing a powerful model of graphics that makes it easy to produce complex multilayered graphics.
Full documentation: docs.ggplot2.org
This is my favourite visualization package in R because of:
Nice charts design.
Flexibility.
Wide range charts types.
Extending plugins i.e. ggtheme .
You can also check alternatives like Plotly or R Base Graphic.
Examples in this book is made with ggplot2 .

Using ggplot2
Download data to visualize in chart
In first step install (if necessary) and load package in current session.
install.packages("ggplot2")
library("ggplot2")

Next build query do fetch data about date and number of session:

46

Data visualization in R

gadata <- google_analytics(id = ga_id,


start="2016-01-01", end="2016-06-30",
metrics = "sessions",
dimensions = "date",
max = 5000)

Display first 6 rows of result:


head(gadata)

date sessions
1 2016-01-01 199
2 2016-01-02 212
3 2016-01-03 155
4 2016-01-04 210
5 2016-01-05 192
6 2016-01-06 180

Scatter plot
Plot data in time (scatter plot)
ggplot(gadata, aes(x=date, y=sessions)) +
geom_point()

As a result you will get basic scatter plot with sessions in time:

Poin means number of sesions in particular day.

47

Data visualization in R

As you see this plot isn't very nice because of a-axis labels. You can fix this using 90-degree
pivot.
Add line:
theme(axis.text.x = element_text(angle = 90, hjust = 1))

So complete example with pivoted x-axis labels:


ggplot(gadata, aes(x=date, y=sessions)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

And the result:

You can also change point size depending on number of sessions by adding:
size = sessions

ggplot(gadata, aes(x=date, y=sessions, size = sessions)) +


geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

And the result:

48

Data visualization in R

You can also change color of points adding:


color = sessions

(the lighter color the highest number of sessions).


Complete code:
ggplot(gadata, aes(x=date, y=sessions, size = sessions, color = sessions)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

And the result:

49

Data visualization in R

This type of scatter plot is called bubble chart.

Line chart
Plot data in time (line chart) with some styles:
ggplot(gadata,aes(x=date,y=sessions,group=1)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# some styles to pivot x-axis labels

As a result you will get line chart with sessions in time:

50

Data visualization in R

Scatter plot with trend line


Sometimes you want to aggregate data and see what is trend?
gadata <- google_analytics(id = ga_id,
start="2016-01-01", end="2016-06-30",
metrics = "sessions",
dimensions = "date",
max = 5000)

And now we can plot data points with added trend line:
ggplot(data = gadata, aes(x = gadata$date,y = gadata$sessions) ) +
geom_point() +
geom_smooth() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

51

Data visualization in R

In this plot you can see, that trend is growing :)

Box plot
To make some exploratory data analysis, you can visualize your traffic in different day od
week. Is your website traffic is seasonal? When are more crowded days? Let's check
creating box plot which will illustrate distribution of number of sessions in every day of
week:
Build query to download data:
gadata <- google_analytics(id = ga_id,
start="2016-01-01", end="2016-06-30",
metrics = "sessions",
dimensions = c("dayOfWeek","date"),
max = 5000)

And vizualize it on boxplot:


ggplot(data = gadata, aes(x = dayOfWeek, y = sessions)) +
geom_boxplot()

52

Data visualization in R

In Google Analytics, number of days are named with convention:


0 - Sunday
1 - Monday
2 - Tuesday
3 - Wednesday
4 - Thursday
5 - Friday
6 - Saturday

So in this case, the highest traffic was on Thursday. Fridays are also not bad :)

Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/3_data_visualization.R

53

Traffic heatmap

Traffic heatmap
We will build some more advanced data visualization. It willl be useres engagement
heatmap. The darker color the highest user engagement (avgSessionDuration) was on this
time of day. Inspired by Todd Moy.

54

Traffic heatmap

# traffic heatmap
# based on https://github.com/toddmoy/Google-Analytics-Heatmap/blob/master/traffic_hea
tmap.R
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# install.packages("RColorBrewer")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
library("RColorBrewer")
# authorize connection with Google Analytics servers
ga_auth()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000
gadata <- google_analytics(id = ga_id,
start="2012-01-01", end="2016-06-30",
metrics = c("avgSessionDuration"),
dimensions = c("dayOfWeekName", "hour"),
max = 5000)

# order data
gadata$dayOfWeekName <- factor(gadata$dayOfWeekName, levels = c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"))
gadata[order(gadata$dayOfWeekName),]
# convert data frame to xtab
heatmap_data <- xtabs(avgSessionDuration ~ dayOfWeekName + hour, data=gadata)

When data is prepared, we'll prepare plot:

55

Traffic heatmap

# plot heatmap
heatmap(heatmap_data,
col=colorRampPalette(brewer.pal(9,"Blues"))(100),
revC=TRUE,
scale="none",
Rowv=NA, Colv=NA,
main="avgSessionDuration by Day and Hour",
xlab="Hour")

And the result is:

In this case - wednesday morning is the most engaging for users time of the day :)

Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/6_heatmap.R

56

Traffic heatmap

57

Device comparsion

Device comparsion
Let's check how engaged users are on different types of device. To do this, we'll plot 2 charts
- describing how many sessions was made from different device types and what is
avgSessionDuration (in seconds) on particular device type.
# device comparsion
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
# authorize connection with Google Analytics servers
ga_auth()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000
gadata <- google_analytics(id = ga_id,
start="2015-01-01", end="2016-06-30",
metrics = c("sessions", "avgSessionDuration"),
dimensions = c("date", "deviceCategory"),
max = 5000)

#plot sessions with deviceCategory


ggplot(gadata, aes(deviceCategory, sessions)) +
geom_bar(aes(fill = deviceCategory), stat="identity")
#plot avgSessionDuration with deviceCategory
ggplot(gadata, aes(deviceCategory, avgSessionDuration)) +
geom_bar(aes(fill = deviceCategory), stat="identity")

58

Device comparsion

In this case the longest sessions was made from mobile devices.

Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/7_device_comparsion.R

59

Machine Learning

Machine Learning

60

Clustering (k-means)

Clustering (k-means)
Power of R is wide range of packages with advanced algorithms ready-to-use. In this
example we'll use k-means for custom users segmentation.
Unsupervised learning: k-Means k-means clustering aims to partition n observations
into k clusters in which each observation belongs to the cluster with the nearest mean
(Source: Wikipedia)
Because this example needs custom instalation of Google Analytics tracking (content
grouping, fingerprint), I've prepared special dataset for thus purpose. You can find complete
code below.

61

Clustering (k-means)

# K-Means Cluster Analysis


# load data into R
# you can download data from Google Analytics API or download sample dataset
# source('ga-connection.R')
# download and preview sample dataset
download.file(url="https://raw.githubusercontent.com/michalbrys/R/master/users-segment
ation/sample-users.csv",
"sample-users.csv",
method="curl")
gadata <- read.csv(file="sample-users.csv", header=T, row.names = 1)
head(gadata)
# clustering users in 3 groups
fit <- kmeans(gadata, 3)
# get cluster means
aggregate(gadata,by=list(fit$cluster),FUN=mean)
# append and preview cluster assignment
clustered_users <- data.frame(gadata, fit$cluster)
head(clustered_users)
# visualize results in 3D chart
#install.packages("plotly")
library(plotly)
plot_ly(clustered_users,
x = clustered_users$beginner_pv,
y = clustered_users$intermediate_pv,
z = clustered_users$advanced_pv,
type = "scatter3d",
mode = "markers",
color=factor(clustered_users$fit.cluster)
)
# write results to file
write.csv(clustered_users, "clustered-users.csv", row.names=T)

Results
Result visualized in plotly package:

62

Clustering (k-means)

Results - clustered users


In addition to chart you get .csv file with userId (fingerprint) and created label (number of
segment). You can use the results uploading it to your marketing systems. Example results:
> clustered_users
Beginner Intermediate Advanced fit.cluster
266876 9 45 4 1
965265 9 51 7 1
...
981924 19 10 8 2
732529 19 16 1 2
...
377795 2 7 38 3
918083 2 8 28 3

Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/5_users_segmentation.R

63

Generating reports

Generating reports
For every analyst periodic reports can be time-consuming work. We can automate this
process in R and prepare reporting templates. After that you can run this reports changing
time range and save it do i.e. .pdf file. Sounds interesting?

64

Introduction to R Markdown

Introduction to R Markdown
You can use markdowns as follow:

R Markdown options
--title: "Monthly report"
output: pdf_document
---

Chunks of code
```{r}
# R Code
```r

If you don't want to display code in chunk in output file, use echo = FALSE option.
```{r, echo=FALSE}
# R Code
```

Basic formatting
Headers
# Header 1
## Header 2
### Header 3

will produce

Header 1
Header 2
65

Introduction to R Markdown

Header 3
Lists
* element 1
* element 2
* element 3

will produce
element 1
element 2
element 3
1. element 1
2. element 2
3. element 3

will produce
1. element 1
2. element 2
3. element 3

Formatting
*italic*
**bold**
***bold+italic**

will produce
italic bold bold+italic

More resources
Full documentation:
www.rstudio.com\/wp-content\/uploads\/2015\/03\/rmarkdown-reference.pdf
Cheat sheet:

66

Introduction to R Markdown

www.rstudio.com\/wp-content\/uploads\/2016\/03\/rmarkdown-cheatsheet-2.0.pdf

67

Create report

Create report
To generate basic report template use this code. This report will contain title, sessions in
time scatter plot from chapter 2 (Data visualization in R).

Create new RMarkdown report


In R Studio, navigate to File > New file > R Markdown .
You will see window with some basic configuration options. Change this values or you can
do this later directly in code.

You can select output of your report. Select HTML , PDF or Word .
Click OK and delete sample code.

68

Create report

Prepare custom report with Google Analytics


data
Copy this code to R Studio and click Knit HTML icon. This code will generate HTML report
with data downloaded from Google Analytics.
--title: "Google Analytics Traffic Report"
author: "Michal Brys"
output: html_document
--```{r, echo=FALSE, warning=FALSE,error=FALSE, message=FALSE }
ga_id <- 67980704
date_start <- "2016-01-01"
date_end <- "2016-06-30"
#install.packages("googleAnalyticsR")
#install.packages("ggplot2")
library("googleAnalyticsR")
library("ggplot2")
#Run once from the console, then generate knitr document
ga_auth()
```
### Sessions from `r date_start` to `r date_end`
This chart contains scatter plot of sessions number in date range.
```{r, echo=FALSE, warning=FALSE,error=FALSE, message=FALSE }
gadata <- google_analytics(id = ga_id,
start= date_start, end= date_end,
metrics = c("sessions"),
dimensions = c("date"),
max = 5000)
# scatter plot with trend line
ggplot(data = gadata, aes(x = gadata$date,y = gadata$sessions) ) +
geom_point() +
geom_smooth() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
### Users engagement by device type
This chart contains bar chart with avgSessionSuriation divided by device type.
```{r, echo=FALSE, warning=FALSE,error=FALSE, message=FALSE }
gadata2 <- google_analytics(id = ga_id,
start= date_start, end= date_end,
metrics = c("sessions", "avgSessionDuration"),

69

Create report

dimensions = c("date", "deviceCategory"),


max = 5000)

#plot sessions with deviceCategory


ggplot(gadata2, aes(deviceCategory, sessions)) +
geom_bar(aes(fill = deviceCategory), stat="identity")
#plot avgSessionDuration with deviceCategory
ggplot(gadata2, aes(deviceCategory, avgSessionDuration)) +
geom_bar(aes(fill = deviceCategory), stat="identity")
```

Result
As a result you'll get complete HTML file with report. You can also generate PDF file.
For recurring reporing you can only change dates :)

70

Create report

71

Create report

Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/8_rmarkdown_report.Rmd

72

Additional analysis

Additional analysis

73

Anomaly detection

Anomaly detection
Use: https:\/\/github.com\/twitter\/AnomalyDetection

74

Forecasting

Forecasting
Forecast of future web traffic using Holt-Winters method. Inspired by Richard Fergie.
# forecasting using Holt-Winters algorithm
# based on http://www.eanalytica.com/r-for-web-analysts/
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# install.packages("forecast")
# install.packages("reshape2")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
library("forecast")
library("reshape2")
# authorize connection with Google Analytics servers
ga_auth()
## pick a profile with data to query
#ga_id <- account_list[275,'viewId']
# or give it explicite using tool http://michalbrys.github.io/ga-tools/table-id.html i
n format 99999999
ga_id <- 00000000
gadata <- google_analytics(id = ga_id,
start="2016-05-01", end="2016-06-30",
metrics = "sessions",
dimensions = "date",
max = 5000)

timeseries <- ts(gadata$sessions, frequency=7)


components <- decompose(timeseries)
plot(components)
# note the way we add a column to a data.frame
gadata$adjusted <- gadata$sessions - components$seasonal
theme(axis.text.x = element_text(angle = 90, hjust = 1))

forecastmodel <- HoltWinters(timeseries)

75

Forecasting

plot(forecastmodel)
forecast <- forecast.HoltWinters(forecastmodel, h=26) # 26 days in future
plot(forecast, xlim=c(0,13))
forecastdf <- as.data.frame(forecast)
totalrows <- nrow(gadata) + nrow(forecastdf)
forecastdata <- data.frame(day=c(1:totalrows),
actual=c(gadata$sessions,rep(NA,nrow(forecastdf))),
forecast=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Point Forecast")
,
forecastupper=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Hi 80"),
forecastlower=c(rep(NA,nrow(gadata)-1),tail(gadata$sessions,1),forecastdf$"Lo 80")
)
ggplot(forecastdata, aes(x=day)) +
geom_line(aes(y=actual),color="black") +
geom_line(aes(y=forecast),color="blue") +
geom_ribbon(aes(ymin=forecastlower,ymax=forecastupper), alpha=0.4, fill="green") +
xlim(c(0,90)) +
xlab("Day") +
ylab("Sessions")

Result
As a result you'll get chart with predictions about your web traffic.

76

Forecasting

Source code
Complete code for this example in GitHub repository:
https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/4_forecasting.R

77

Resources

Resources

78

Blogs

Blogs
R Bloggers
Mark Edmondson
Richard Fergie
Michal Brys - blog
...

79

Documentation

Documentation
R project - official website
ggplot2 - official website
googleAnalyticsR - R package
Google Analyitcs - for developers

80

Online trainings

Online trainings
To learn more details about R I recommend to check Coursera MOOC:
R Programming by Johns Hopkins University

81

Books

Books
List of book where you can get some inspiration for further analysis, with links for free online
versions:
Cookbook for R
www.cookbook-r.com
R for Data Science
r4ds.had.co.nz
Think Stats
greenteapress.com\/thinkstats
greenteapress.com\/thinkstats2 (2nd edition)

82

You might also like