Lecture 01

COMP9033
data analytics
1/13
introduction to data
analytics
dr.donagh horgan
department of computer science
cork institute of technology
2015.09.15
Image: Andy Lamb
0.1 / overview
I
This week, we will cover:

1. Introduction to data analytics:
I
What it is, how it works.
Some examples from the real world.
2. Course outline:
I
Overview of topics.
Marking scheme.
Contact information.
Labs.
3. Introduction to statistics:
I
Populations and samples.
Distributions and histograms.
Statistical measures.
donagh.horgan@cit.ie
1.1 / what is data analytics?
Data analytics, or data analysis, is a tool for solving complex real world problems.
In recent years, it has become a hot topic, with notable uses in areas such as
Search engines, e.g. Google, Yahoo, DuckDuckGo.
Speech recognition, e.g. Siri.
Music fingerprinting, e.g. Shazam, SoundHound.
Marketing, e.g. Tesco Clubcard, Amazon.
Recommendation engines, e.g. Netflix, Spotify.
Spam detection, e.g. Gmail.
It is often conflated with concepts such as statistics, machine learning and

visualisation, but it is these things and much more!
Formally, data analytics is the process of manipulating data to discover new

information.
It is a scientific method, with defined steps and procedures:

I
Observations are made.
Hypotheses are formulated, refined, accepted and rejected.
Generally applicable conclusions are reached.
It is also an art, often relying on subjective human judgement.
Informally, we can think of it as a way to translate measurements into

understanding.
1.3 / translating measurements into understanding
system
measure
analyse
understand
Generally, we can think of data analysis as a step in a chain:

1. We have a system we want to understand.
2. We measure some data related to the system.
3. We analyse the data to gain insights.
4. We draw conclusions.
1.4 / example: weather forecasting
weather
barometer
analyse
forecast
Weather forecasting is a form of data analysis:

1. We want to make predictions about the weather.
2. We measure some data related to it, e.g. atmospheric pressure.
3. We analyse the data to extract information about the relationship.
4. We better understand how pressure affects weather, e.g. high pressure sunshine!
1.5 / example: spam detection
email
keywords
analyse
spam
Spam detection is a form of data analysis:

1. We want to detect whether incoming emails are spam or not.
2. We measure some data related to this, e.g. keywords in the email text.
4. We better understand how certain keywords are good indicators of spam.
1.6 / example: discount offers
customers
purchases
analyse
offers
Discount offers are often a result of data analysis:

1. We want customers to buy more.
2. We measure some data related to this, e.g. purchase history.
4. We better understand which items to offer discounts on to encourage purchases.
So what does data analysis actually involve?
It can take a variety of forms, including:

I
Examination of statistical measures, e.g. mean, median, standard deviation.
Visualisation of the data, e.g. histogram, scatter plot matrix.
Transformation of the data into a form that makes further analysis more convenient.
Building and testing models of the data.
Typically, multiple forms are used in combination to solve a given problem.
The key to solving problems effectively is to know which of these tools to use and
in what order.
2.1 / course outline
The aim of this course is to provide both a theoretical and a practical introduction
to data analysis techniques.
Over the coming weeks, we will cover a variety of topics:

1. Statistics.
7. Model selection and assessment.
2. Exploratory data analysis.
8. Decision tree classification.
3. Visualisation.
9. Clustering systems.
4. Data munging.
10. Linear regression.
5. Association rule mining.
11. Limits of data analysis.
6. Recommender systems.
12. Big data systems.
2.2 / marking scheme
final exam
50%
50%
project
Some important things to note:

I
The project will be set in week 4 and is due in week 13.
The lecture notes will include sample exam questions and answers.
There will be a revision lecture in week 13.
For more information, see http://courses.cit.ie/index.cfm/page/module/moduleId/8829.
2.3 / weekly schedule
Each week, there will be a

two hour lecture and a lab
assignment.
The lab work is ungraded, but

you will need to understand it
in order to do the project.
The course material is not

hard, but there is a lot to
cover you should spend at
least four hours a week
studying it outside of time
spent at lectures and labs.
lectures
2 hours
4 hours
1 hour
independent
learning
labs
For more information, see http://courses.cit.ie/index.cfm/page/module/moduleId/8829.
2.4 / contact information

I
What should you do if you have questions?

1. If you have a question about lecture notes, lab work, project work, or even
something not related specifically to course content (e.g. you want to share an
interesting link or blog post), post it to the Blackboard forum.
I
Everyone can see it.
Avoids repetition for me.
Sharing information is the fastest way for everyone to learn.
URL: http://blackboard.cit.ie.
2. If you have a problem, send me an email.
Please dont send emails about course material lets try to keep this stuff on
Blackboard, so that everyone benefits.
Email: donagh.horgan@cit.ie.
I will try to reply to Blackboard posts and emails within 48 hours, but sometimes
it may take longer than this.
2.5 / copyright
Except where noted otherwise, the material (i.e. notes and code) for this course
are licensed to you under the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International license.
This means that you are free to share course material with anyone you like, as
long as you respect the terms of the license.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.
2.6 / languages
I
The two most popular open source languages used for data analysis are R and
Python.
There are pros and cons to each:

R
Python
algorithm support
very large
large
difficulty
moderate
easy
no
yes
is useful outside domain

I
In this course, we will use Python for data analysis.
Hopefully, this will save most people time or at least lessen the burden of having
to learn a new language.
If you dont know Python or havent used it in some time, Codecademy offer a free introduction course that covers the basics (you can stop just before the bitwise
operators section): https://www.codecademy.com/en/tracks/python.
2.7 / labs
I
The lab assignments involve the completion of data analysis exercises using
IPython Notebook.
I
IPython Notebook runs in your browser.
Not an IDE, but much better than writing scripts.
Has built in support for in-line graphics, very convenient.
Your student vDesktop has been configured for IPython Notebook usage, but you
can also run the notebooks on your own machine. To do this, you will need to
install the following software:
I
Python 2.7.x
NumPy
networkx
IPython 3.x
SciPy
scikit-learn
IPython Notebook
pandas
lxml
Inkscape
matplotlib
bs4
2.8 / introduction to ipython notebook
Starting IPython Notebook is easy.

I
On Windows:
../../code/lectures/01/ipython_windows
1
2
> cd C : \ Python27 \ S c r i p t s
> ipython notebook
On Linux:
../../code/lectures/01/ipython_linux
1
2
$ s u d o p i p i n s t a l l u p g r a d e i p y t h o n [ a l l ]
$ ipython notebook
Once youve got it started, try creating a new notebook like this:

I
Next, take a quick tour of the interface like this:
3.1 / introduction to statistics

I
Statistics are numerical measures that help us to characterise a data sample or the
general population from which it came.
They are an important part of data analysis and can provide powerful insights at
an early stage.
In the following slides, we will revise some core statistical measures, which will be
useful at later points in the course.
In each case, we will compute the statistics for a hypothetical collection, or

sample, of data points, which we will denote by X .
If there are n elements in X , then we can write it as

X = {x1 , x2 , . . . , xn },
(1.1)
where xi is the i th data point.

3.2 / populations and samples

Population
Sample
Before we talk about specific statistical measures, we need to define some

terminology:
I
Population: A statistical population is a complete set, representing the entire space

of possible outcomes.
Sample: A statistical sample is a subset of a population and so represents a number

of the possible outcomes, but not the total.

Population
Sample
It is important to distinguish between populations and samples:

I
Populations describe all possible outcomes, while samples describe only a few.
By random chance, a sample may look extremely different from the population it
was drawn from (e.g. in the image above, the sample contains no triangles).
As a result, conclusions based on an analysis of a population will always generalise,

while conclusions based on a sample may not.

Population
Sample
We can think of statistical populations like the populations of countries, while

samples are random groupings of people in those countries.
I
A bad sample might the subset of people who are criminals.
Just because 100% of the sample are criminals, this does mean that 100% of the
population are.
3.5 / sample distributions

A
Numberofitems/bin
300
250
200
150
100
50
0
10
15
0
5
10
Valuesofitemsinbins
15
10
15
The most common way of visualising the distribution of the data in a sample is
the histogram:
I
Sample data are placed in fixed width bins, representing ranges of values.
The bins are then visually represented by vertical bars.
The taller the bar, the larger the number of data points in the bin.

A
Numberofitems/bin
300
250
200
150
100
50
0
10
15
0
5
10
Valuesofitemsinbins
15
10
15
Two important statistical measures of a distribution are its central tendency and
its dispersion:
I
Central tendency is a measure of the central value of the data.
Dispersion measures the spread or variability of the data.

A
Numberofitems/bin
300
250
200
150
100
50
0
10
15
0
5
10
Valuesofitemsinbins
15
10
15
Figure B above shows a sample with higher central tendency than Figure A, but a
similar level of dispersion.
Figure C shows a sample with the same central tendency as Figure B, but with a
higher level of dispersion.
3.8 / means and averages
One of the most commonly computed statistical measures of central tendency is

the arithmetic mean, or average.
The arithmetic mean of the sample X is denoted by x and is defined as

x =
n
1X
xi .
n i=1
(1.2)
Other kinds of means do exist (e.g. geometric, harmonic, weighted), so be careful

not to confuse definitions!
Generally, the term mean can be taken to refer to the arithmetic mean.
3.9 / example
Q. What is the mean of the sample A = {1, 3, 4, 5, 10}?

A. Using (1.2), we can compute the mean as
1
(1 + 3 + (4) + 5 + 10)
5
15
=
5
= 3.
a =
3.10 / weighted averages
A common variation on the mean is the weighted average.
This works by applying a weight to each element of the sample, and so can
increase the effect of some samples and decrease the effect of others.
The weighted average of X is also commonly denoted as x (be careful!) and is

defined as
Pn
wi xi
x = Pi=1
,
n
i=1 wi
(1.3)
where wi denotes the i th weight.

I
When all the weights are equal to one (i.e. wi = 1, i), the weighted average is
equivalent to the average and so (1.3) simplifies to (1.2).
3.11 / example
Q. What is the weighted mean of the sample A = {1, 3, 4, 5, 10} with the weights
W = {2, 1, 0, 1, 0.5}?
A. Using (1.3), we can compute the weighted mean as
(2 1) + (1 3) + (0 (4)) + (1 5) + (0.5 10)
2 + 1 + 0 + 1 + 0.5
2+3+0+5+5
=
4.5
15
=
4.5
3.33.
a =
3.12 / median values
The median of a sample is an alternative measure of its central tendency, which is

less sensitive to outliers than the mean.
It is defined as the middle value of the individual sample elements when arranged
in ascending order.
The median value of X is denoted by x and depends on whether the total number
of points, n, is odd or even:
I
If n is odd, then
x = x n+1 .
2
(1.4)
If n is even, then
x =
x n2 + x n2 +1
.
2
(1.5)
3.13 / example: median of an odd numbered sample
Q. What is the median of the sample A = {1, 3, 4, 5, 10}?

A. Before computing the median, we must first sort A in ascending order:
A = {4, 1, 3, 5, 10}.
As A has an odd number of elements, we can compute the median using (1.4) as
a = a 5+1
2
= a6
2
= a3
= 3.
Remember, sorting A in ascending order is a crucial step!
3.14 / example: median of an even numbered sample
Q. What is the median of the sample A = {1, 3, 4, 5, 10, 1}?

A. Before computing the median, we must first sort A in ascending order:
A = {4, 1, 1, 3, 5, 10}.
As A has an even number of elements, we can compute the median using (1.5) as
a =
a 6 + a 6 +1
2
2
a3 + a4
=
2
1+3
=
2
= 2.
3.15 / the mode
The mode of a sample is defined as its most common value.
The mode is a further alternative method for estimating central tendency and is
useful in situations where the data is not numeric.
I
For example, for the set of names {John, Isabelle, John, Mary }, the mode is John.
The mode of a sample X is typically denoted as Mox .
It is possible to have more than one mode, e.g. A = {0, 1, 1, 2, 2, 3}.
3.16 / example
Q. What is the mode of the sample A = {1, 3, 4, 5, 10, 1}?

A. The mode is defined as the most common value of the sample, and so we can just
write
MoA = 1.
3.17 / standard deviation
One of the most common measures of dispersion in a sample is the standard

deviation.
The standard deviation of the sample X is denoted by x and is defined as

v
u
u
x = t
n
1 X
(xi x )2 .
n 1 i=1
(1.6)
A related measure of dispersion is the variance of the sample, which is simply the
square of the standard deviation, i.e.
Var(X ) = x2 .
(1.7)
3.18 / example: standard deviation
Q. What is the standard deviation of the sample A = {1, 3, 4, 5, 10}?

A. To compute the standard deviation, we must first compute the mean of the data.
Earlier, we computed this as a = 3. The standard deviation can then be computed
using (1.6) as
v
u
u
A = t
s
1
((1 3)2 + (3 3)2 + ((4) 3)2 + (5 3)2 + (10 3)2 )
4
1
(4 + 0 + 49 + 4 + 49)
4
=
=
n
1 X
(ai a)2
n 1 i=1
5.14.
3.19 / example: variance
Q. What is the variance of the sample A = {1, 3, 4, 5, 10}?

A. The variance is simply the square of the standard deviation, so we can just write
Var(A) = A2
26.5.
3.20 / interquartile range
The interquartile range (IQR) of a sample is an alternative measure of its

dispersion.
It is less sensitive to outliers than standard deviation (just like the median is to
the mean).
It is defined as the difference between the upper and the lower quartiles, i.e.
IQR = Q3 Q1 ,
(1.8)
where Q3 denotes the upper quartile and Q1 denotes the lower quartile.
3.21 / interquartile range
The quartiles of an ordered set of data points are the three values that divide the
set into four groups, each containing approximately 25% of the sample data.
I
The lower quartile (Q1 ) is middle value between the minimum value of the data and
the median value of the data.
The second quartile (Q2 ) is simply the median of the data, which is the middle value
between the minimum value of the data and the maximum value of the data.
The upper quartile (Q3 ) is the middle value between the median value of the data
and the maximum value of the data.
3.22 / the standard score
The standard score, or z-score, is a measure of the number of standard deviations

away from the mean a single data point is.
I
A positive score indicates that the data point is above the mean.
A negative score indicates that the data point is below the mean.
The magnitude of the score indicates how far above or below the mean the point is.
The standard score of the data point xi is denoted by z(xi ) and defined as
z(xi ) =
xi x
.
x
(1.9)
The standard score is often used to quantify how extreme a given data point is,
and can be a useful indicator that a given data point is an outlier.
3.23 / example
Q. What is the standard score of the third data point in the sample
A = {1, 3, 4, 5, 10}?
A. Earlier, we computed the mean of A as a = 3 and the standard deviation as
A 5.14. The standard score of a3 can then be computed using (1.9) as
a3 a
A
(4) 3
5.14
1.36.
z(a3 ) =
3.24 / dependence and correlation

I
The existence of a statistical relationship between different data samples is known

as dependence.
Two samples are said to be dependent when the value of the first depends on the
value of the second or vice-versa, e.g.
The number of people wearing shorts on a given day depends on the weather.
A teams position in a league depends on the number of goals theyve scored.
A students score on a test (ideally) depends on the depth of their knowledge of the
subject.
Its possible that multiple dependencies exist: in reality, a students score on a test
depends on more than just knowledge, e.g.
I
How well they slept the night before.
How much coffee theyve drank.
The ambient temperature in the test hall.
...lots of other factors!

3.25 / dependence and correlation
Correlation is a measure of the dependence between two data samples.
If two samples are positively correlated, then their values tend to increase or
decrease together.
More of A leads to more of B: Sunshine ice cream purchases.
Less of A leads to less of B: Eating fewer calories losing weight.
If two samples are negatively correlated, then the values in one tends to
increase/decrease when the values in the other decrease/increase.
I
More of A leads to less of B: Smoking lower life expectancy.
Less of A leads to more of B: Less public transport more congestion.
If two samples are uncorrelated, then there is no dependence between them.

I
More of A leads to no change in B: Higher taxes temperature in June.
Less of A leads to no change in B: Population of France days in a year.
3.26 / the pearson correlation coefficient

I
One popular way of computing correlation is by using the Pearson correlation

coefficient.
The Pearson correlation coefficient between two data samples, X and Y , is

denoted by rxy and is defined as
n
1 X
xi x
=
n 1 i=1
x
rxy
yi y
y
n
1 X
z(xi ) z(yi ).
n 1 i=1
(1.10)
(1.11)
Other kinds of correlation coefficient do exist (e.g. Spearman, Kendall), so be

careful not to confuse definitions.
Generally, the term correlation coefficient can be taken to mean Pearson

correlation coefficient.
3.27 / the pearson correlation coefficient
The Pearson correlation coefficient has a value between -1 and +1, inclusive,
i.e. rxy [1, 1].
I
Positive correlation corresponds to a positive value of rxy , i.e. rxy > 0.
Negative correlation corresponds to a negative value of rxy , i.e. rxy < 0.
If rxy = 1, then X and Y have the strongest possible level of positive correlation.
If rxy = 1, then X and Y have the strongest possible level of negative

correlation.
If rxy = 0, then X and Y are uncorrelated.
One handy property of the coefficient is that rxy = ryx , i.e. the correlation between
X and Y is the same as the correlation between Y and X the order does not
matter.
3.28 / example
Q. What is the correlation of the samples A = {1, 2, 3} and B = {4, 5, 6}?

A. First, lets compute the means and standard deviations of the samples using (1.2)
and (1.6):
a = 2, A = 1.
b = 5, B = 1.
Next, we compute the standard scores of the data points in each sample using
(1.9):
z(a1 ) = 1, z(a2 ) = 0, z(a3 ) = 1.
z(b1 ) = 1, z(b2 ) = 0, z(b3 ) = 1.
Finally, we compute the correlation coefficient using (1.11):

rAB =
n
1 X
((1) (1)) + (0 0) + (1 1)
z(ai ) z(bi ) =
= 1.
n 1 i=1
2
3.29 / correlation is not causation
Dependence does not imply a causal relationship!

I
Buying ice creams does not make the weather sunnier.
Living a short life does not make you a smoker.
Wearing shorts does not make the weather better.
If A and B are correlated, then there are six possibilities:

1. A is caused by B.
2. B is caused by A.
3. A causes B and B causes A.
4. A and B are both caused by a hidden external factor, C .
5. A causes C which, in turn, causes B.
6. A and B are not correlated but, by random chance (e.g. poor sampling), they appear
correlated.
Image: RedAndr/Wikipedia
Image: Tyler Vigen
X.1 / summary
Lots of maths this week! Usually, there wont be so much.

I
If you have questions, post on Blackboard!
Lab:
I
Try it out, see how far you get.
If youre stuck on Python, check out the Codecademy course.
If youre stuck on statistics, check out the Khan Academy course.
If you have questions, post on Blackboard!
Next week:
I
Data mining processes.
Data formats.
Exploratory data analysis.
X.2 / optional further reading
course material
Module information
Blackboard
statistics
python
Khan Academy
Python course on Codecademy
IPython documentation
blogs
The Guardian Datablog

FlowingData
Information is Beautiful
videos
Hans Roslings TED Talk on statistics

Lecture 01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 01

Uploaded by

Copyright:

Available Formats

COMP9033

Image: Andy Lamb

This week, we will cover:

What it is, how it works.

Some examples from the real world.

Populations and samples.

Distributions and histograms.

1.1 / what is data analytics?

Search engines, e.g. Google, Yahoo, DuckDuckGo.

Speech recognition, e.g. Siri.

Music fingerprinting, e.g. Shazam, SoundHound.

Marketing, e.g. Tesco Clubcard, Amazon.

Recommendation engines, e.g. Netflix, Spotify.

Spam detection, e.g. Gmail.

It is often conflated with concepts such as statistics, machine learning and

1.2 / what is data analytics?

Formally, data analytics is the process of manipulating data to discover new

It is a scientific method, with defined steps and procedures:

Observations are made.

Hypotheses are formulated, refined, accepted and rejected.

Generally applicable conclusions are reached.

It is also an art, often relying on subjective human judgement.

Informally, we can think of it as a way to translate measurements into

1.3 / translating measurements into understanding

Generally, we can think of data analysis as a step in a chain:

1.4 / example: weather forecasting

Weather forecasting is a form of data analysis:

1.5 / example: spam detection

Spam detection is a form of data analysis:

1.6 / example: discount offers

Discount offers are often a result of data analysis:

1.7 / what is data analytics?

So what does data analysis actually involve?

It can take a variety of forms, including:

Examination of statistical measures, e.g. mean, median, standard deviation.

Visualisation of the data, e.g. histogram, scatter plot matrix.

Building and testing models of the data.

Typically, multiple forms are used in combination to solve a given problem.

2.1 / course outline

Over the coming weeks, we will cover a variety of topics:

7. Model selection and assessment.

2. Exploratory data analysis.

8. Decision tree classification.

10. Linear regression.

5. Association rule mining.

11. Limits of data analysis.

12. Big data systems.

2.2 / marking scheme

Some important things to note:

The project will be set in week 4 and is due in week 13.

There will be a revision lecture in week 13.

For more information, see http://courses.cit.ie/index.cfm/page/module/moduleId/8829.

2.3 / weekly schedule

Each week, there will be a

The lab work is ungraded, but

The course material is not

For more information, see http://courses.cit.ie/index.cfm/page/module/moduleId/8829.

2.4 / contact information

What should you do if you have questions?

Everyone can see it.

Avoids repetition for me.

Sharing information is the fastest way for everyone to learn.

2. If you have a problem, send me an email.