You are on page 1of 52

COMP9033

data analytics
1/13
introduction to data
analytics
dr.donagh horgan
department of computer science
cork institute of technology
2015.09.15

Image: Andy Lamb

0.1 / overview
I

This week, we will cover:


1. Introduction to data analytics:
I

What it is, how it works.

Some examples from the real world.

2. Course outline:
I

Overview of topics.

Marking scheme.

Contact information.

Labs.

3. Introduction to statistics:
I

Populations and samples.

Distributions and histograms.

Statistical measures.
donagh.horgan@cit.ie

1.1 / what is data analytics?

Data analytics, or data analysis, is a tool for solving complex real world problems.

In recent years, it has become a hot topic, with notable uses in areas such as

Search engines, e.g. Google, Yahoo, DuckDuckGo.

Speech recognition, e.g. Siri.

Music fingerprinting, e.g. Shazam, SoundHound.

Marketing, e.g. Tesco Clubcard, Amazon.

Recommendation engines, e.g. Netflix, Spotify.

Spam detection, e.g. Gmail.

It is often conflated with concepts such as statistics, machine learning and


visualisation, but it is these things and much more!

donagh.horgan@cit.ie

1.2 / what is data analytics?

Formally, data analytics is the process of manipulating data to discover new


information.

It is a scientific method, with defined steps and procedures:


I

Observations are made.

Hypotheses are formulated, refined, accepted and rejected.

Generally applicable conclusions are reached.

It is also an art, often relying on subjective human judgement.

Informally, we can think of it as a way to translate measurements into


understanding.

donagh.horgan@cit.ie

1.3 / translating measurements into understanding

system

measure

analyse

understand

Generally, we can think of data analysis as a step in a chain:


1. We have a system we want to understand.
2. We measure some data related to the system.
3. We analyse the data to gain insights.
4. We draw conclusions.

donagh.horgan@cit.ie

1.4 / example: weather forecasting

weather

barometer

analyse

forecast

Weather forecasting is a form of data analysis:


1. We want to make predictions about the weather.
2. We measure some data related to it, e.g. atmospheric pressure.
3. We analyse the data to extract information about the relationship.
4. We better understand how pressure affects weather, e.g. high pressure sunshine!

donagh.horgan@cit.ie

1.5 / example: spam detection

email

keywords

analyse

spam

Spam detection is a form of data analysis:


1. We want to detect whether incoming emails are spam or not.
2. We measure some data related to this, e.g. keywords in the email text.
3. We analyse the data to extract information about the relationship.
4. We better understand how certain keywords are good indicators of spam.

donagh.horgan@cit.ie

1.6 / example: discount offers

customers

purchases

analyse

offers

Discount offers are often a result of data analysis:


1. We want customers to buy more.
2. We measure some data related to this, e.g. purchase history.
3. We analyse the data to extract information about the relationship.
4. We better understand which items to offer discounts on to encourage purchases.

donagh.horgan@cit.ie

1.7 / what is data analytics?

So what does data analysis actually involve?

It can take a variety of forms, including:


I

Examination of statistical measures, e.g. mean, median, standard deviation.

Visualisation of the data, e.g. histogram, scatter plot matrix.

Transformation of the data into a form that makes further analysis more convenient.

Building and testing models of the data.

Typically, multiple forms are used in combination to solve a given problem.

The key to solving problems effectively is to know which of these tools to use and
in what order.

donagh.horgan@cit.ie

2.1 / course outline

The aim of this course is to provide both a theoretical and a practical introduction
to data analysis techniques.

Over the coming weeks, we will cover a variety of topics:


1. Statistics.

7. Model selection and assessment.

2. Exploratory data analysis.

8. Decision tree classification.

3. Visualisation.

9. Clustering systems.

4. Data munging.

10. Linear regression.

5. Association rule mining.

11. Limits of data analysis.

6. Recommender systems.

12. Big data systems.

donagh.horgan@cit.ie

2.2 / marking scheme

final exam

50%

50%

project

Some important things to note:


I

The project will be set in week 4 and is due in week 13.

The lecture notes will include sample exam questions and answers.

There will be a revision lecture in week 13.

For more information, see http://courses.cit.ie/index.cfm/page/module/moduleId/8829.

donagh.horgan@cit.ie

2.3 / weekly schedule

Each week, there will be a


two hour lecture and a lab
assignment.

The lab work is ungraded, but


you will need to understand it
in order to do the project.

The course material is not


hard, but there is a lot to
cover you should spend at
least four hours a week
studying it outside of time
spent at lectures and labs.

lectures
2 hours
4 hours
1 hour

independent
learning

labs

For more information, see http://courses.cit.ie/index.cfm/page/module/moduleId/8829.

donagh.horgan@cit.ie

2.4 / contact information


I

What should you do if you have questions?


1. If you have a question about lecture notes, lab work, project work, or even
something not related specifically to course content (e.g. you want to share an
interesting link or blog post), post it to the Blackboard forum.
I

Everyone can see it.

Avoids repetition for me.

Sharing information is the fastest way for everyone to learn.

URL: http://blackboard.cit.ie.

2. If you have a problem, send me an email.

Please dont send emails about course material lets try to keep this stuff on
Blackboard, so that everyone benefits.

Email: donagh.horgan@cit.ie.

I will try to reply to Blackboard posts and emails within 48 hours, but sometimes
it may take longer than this.
donagh.horgan@cit.ie

2.5 / copyright

Except where noted otherwise, the material (i.e. notes and code) for this course
are licensed to you under the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International license.

This means that you are free to share course material with anyone you like, as
long as you respect the terms of the license.

For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.

donagh.horgan@cit.ie

2.6 / languages
I

The two most popular open source languages used for data analysis are R and
Python.

There are pros and cons to each:


R

Python

algorithm support

very large

large

difficulty

moderate

easy

no

yes

is useful outside domain


I

In this course, we will use Python for data analysis.

Hopefully, this will save most people time or at least lessen the burden of having
to learn a new language.

If you dont know Python or havent used it in some time, Codecademy offer a free introduction course that covers the basics (you can stop just before the bitwise
operators section): https://www.codecademy.com/en/tracks/python.

donagh.horgan@cit.ie

2.7 / labs
I

The lab assignments involve the completion of data analysis exercises using
IPython Notebook.
I

IPython Notebook runs in your browser.

Not an IDE, but much better than writing scripts.

Has built in support for in-line graphics, very convenient.

Your student vDesktop has been configured for IPython Notebook usage, but you
can also run the notebooks on your own machine. To do this, you will need to
install the following software:
I

Python 2.7.x

NumPy

networkx

IPython 3.x

SciPy

scikit-learn

IPython Notebook

pandas

lxml

Inkscape

matplotlib

bs4
donagh.horgan@cit.ie

2.8 / introduction to ipython notebook

Starting IPython Notebook is easy.


I

On Windows:
../../code/lectures/01/ipython_windows

1
2

> cd C : \ Python27 \ S c r i p t s
> ipython notebook

On Linux:
../../code/lectures/01/ipython_linux

1
2

$ s u d o p i p i n s t a l l u p g r a d e i p y t h o n [ a l l ]
$ ipython notebook

donagh.horgan@cit.ie

2.9 / introduction to ipython notebook

Once youve got it started, try creating a new notebook like this:

donagh.horgan@cit.ie

2.10 / introduction to ipython notebook


I

Next, take a quick tour of the interface like this:

donagh.horgan@cit.ie

3.1 / introduction to statistics


I

Statistics are numerical measures that help us to characterise a data sample or the
general population from which it came.

They are an important part of data analysis and can provide powerful insights at
an early stage.

In the following slides, we will revise some core statistical measures, which will be
useful at later points in the course.

In each case, we will compute the statistics for a hypothetical collection, or


sample, of data points, which we will denote by X .

If there are n elements in X , then we can write it as


X = {x1 , x2 , . . . , xn },

(1.1)

where xi is the i th data point.


donagh.horgan@cit.ie

3.2 / populations and samples


Population

Sample

Before we talk about specific statistical measures, we need to define some


terminology:
I

Population: A statistical population is a complete set, representing the entire space


of possible outcomes.

Sample: A statistical sample is a subset of a population and so represents a number


of the possible outcomes, but not the total.
donagh.horgan@cit.ie

3.3 / populations and samples


Population

Sample

It is important to distinguish between populations and samples:


I

Populations describe all possible outcomes, while samples describe only a few.

By random chance, a sample may look extremely different from the population it
was drawn from (e.g. in the image above, the sample contains no triangles).

As a result, conclusions based on an analysis of a population will always generalise,


while conclusions based on a sample may not.
donagh.horgan@cit.ie

3.4 / populations and samples


Population

Sample

We can think of statistical populations like the populations of countries, while


samples are random groupings of people in those countries.
I

A bad sample might the subset of people who are criminals.

Just because 100% of the sample are criminals, this does mean that 100% of the
population are.
donagh.horgan@cit.ie

3.5 / sample distributions


A

Numberofitems/bin

300

250
200
150
100
50
0

10

15

0
5
10
Valuesofitemsinbins

15

10

15

The most common way of visualising the distribution of the data in a sample is
the histogram:
I

Sample data are placed in fixed width bins, representing ranges of values.

The bins are then visually represented by vertical bars.

The taller the bar, the larger the number of data points in the bin.
donagh.horgan@cit.ie

3.6 / sample distributions


A

Numberofitems/bin

300

250
200
150
100
50
0

10

15

0
5
10
Valuesofitemsinbins

15

10

15

Two important statistical measures of a distribution are its central tendency and
its dispersion:
I

Central tendency is a measure of the central value of the data.

Dispersion measures the spread or variability of the data.

donagh.horgan@cit.ie

3.7 / sample distributions


A

Numberofitems/bin

300

250
200
150
100
50
0

10

15

0
5
10
Valuesofitemsinbins

15

10

15

Figure B above shows a sample with higher central tendency than Figure A, but a
similar level of dispersion.

Figure C shows a sample with the same central tendency as Figure B, but with a
higher level of dispersion.

donagh.horgan@cit.ie

3.8 / means and averages

One of the most commonly computed statistical measures of central tendency is


the arithmetic mean, or average.

The arithmetic mean of the sample X is denoted by x and is defined as


x =

n
1X
xi .
n i=1

(1.2)

Other kinds of means do exist (e.g. geometric, harmonic, weighted), so be careful


not to confuse definitions!

Generally, the term mean can be taken to refer to the arithmetic mean.

donagh.horgan@cit.ie

3.9 / example

Q. What is the mean of the sample A = {1, 3, 4, 5, 10}?


A. Using (1.2), we can compute the mean as
1
(1 + 3 + (4) + 5 + 10)
5
15
=
5
= 3.

a =

donagh.horgan@cit.ie

3.10 / weighted averages

A common variation on the mean is the weighted average.

This works by applying a weight to each element of the sample, and so can
increase the effect of some samples and decrease the effect of others.

The weighted average of X is also commonly denoted as x (be careful!) and is


defined as
Pn
wi xi
x = Pi=1
,
n
i=1 wi

(1.3)

where wi denotes the i th weight.


I

When all the weights are equal to one (i.e. wi = 1, i), the weighted average is
equivalent to the average and so (1.3) simplifies to (1.2).

donagh.horgan@cit.ie

3.11 / example

Q. What is the weighted mean of the sample A = {1, 3, 4, 5, 10} with the weights
W = {2, 1, 0, 1, 0.5}?
A. Using (1.3), we can compute the weighted mean as
(2 1) + (1 3) + (0 (4)) + (1 5) + (0.5 10)
2 + 1 + 0 + 1 + 0.5
2+3+0+5+5
=
4.5
15
=
4.5
3.33.

a =

donagh.horgan@cit.ie

3.12 / median values

The median of a sample is an alternative measure of its central tendency, which is


less sensitive to outliers than the mean.

It is defined as the middle value of the individual sample elements when arranged
in ascending order.

The median value of X is denoted by x and depends on whether the total number
of points, n, is odd or even:
I

If n is odd, then
x = x n+1 .
2

(1.4)

If n is even, then
x =

x n2 + x n2 +1
.
2

(1.5)

donagh.horgan@cit.ie

3.13 / example: median of an odd numbered sample

Q. What is the median of the sample A = {1, 3, 4, 5, 10}?


A. Before computing the median, we must first sort A in ascending order:
A = {4, 1, 3, 5, 10}.
As A has an odd number of elements, we can compute the median using (1.4) as
a = a 5+1
2

= a6
2

= a3
= 3.
Remember, sorting A in ascending order is a crucial step!
donagh.horgan@cit.ie

3.14 / example: median of an even numbered sample

Q. What is the median of the sample A = {1, 3, 4, 5, 10, 1}?


A. Before computing the median, we must first sort A in ascending order:
A = {4, 1, 1, 3, 5, 10}.
As A has an even number of elements, we can compute the median using (1.5) as
a =

a 6 + a 6 +1
2

2
a3 + a4
=
2
1+3
=
2
= 2.
donagh.horgan@cit.ie

3.15 / the mode

The mode of a sample is defined as its most common value.

The mode is a further alternative method for estimating central tendency and is
useful in situations where the data is not numeric.
I

For example, for the set of names {John, Isabelle, John, Mary }, the mode is John.

The mode of a sample X is typically denoted as Mox .

It is possible to have more than one mode, e.g. A = {0, 1, 1, 2, 2, 3}.

donagh.horgan@cit.ie

3.16 / example

Q. What is the mode of the sample A = {1, 3, 4, 5, 10, 1}?


A. The mode is defined as the most common value of the sample, and so we can just
write
MoA = 1.

donagh.horgan@cit.ie

3.17 / standard deviation

One of the most common measures of dispersion in a sample is the standard


deviation.

The standard deviation of the sample X is denoted by x and is defined as


v
u
u
x = t

n
1 X
(xi x )2 .
n 1 i=1

(1.6)

A related measure of dispersion is the variance of the sample, which is simply the
square of the standard deviation, i.e.
Var(X ) = x2 .

(1.7)

donagh.horgan@cit.ie

3.18 / example: standard deviation

Q. What is the standard deviation of the sample A = {1, 3, 4, 5, 10}?


A. To compute the standard deviation, we must first compute the mean of the data.
Earlier, we computed this as a = 3. The standard deviation can then be computed
using (1.6) as
v
u
u
A = t
s

1
((1 3)2 + (3 3)2 + ((4) 3)2 + (5 3)2 + (10 3)2 )
4

1
(4 + 0 + 49 + 4 + 49)
4

=
=

n
1 X
(ai a)2
n 1 i=1

5.14.
donagh.horgan@cit.ie

3.19 / example: variance

Q. What is the variance of the sample A = {1, 3, 4, 5, 10}?


A. The variance is simply the square of the standard deviation, so we can just write
Var(A) = A2
26.5.

donagh.horgan@cit.ie

3.20 / interquartile range

The interquartile range (IQR) of a sample is an alternative measure of its


dispersion.

It is less sensitive to outliers than standard deviation (just like the median is to
the mean).

It is defined as the difference between the upper and the lower quartiles, i.e.
IQR = Q3 Q1 ,

(1.8)

where Q3 denotes the upper quartile and Q1 denotes the lower quartile.

donagh.horgan@cit.ie

3.21 / interquartile range

The quartiles of an ordered set of data points are the three values that divide the
set into four groups, each containing approximately 25% of the sample data.
I

The lower quartile (Q1 ) is middle value between the minimum value of the data and
the median value of the data.

The second quartile (Q2 ) is simply the median of the data, which is the middle value
between the minimum value of the data and the maximum value of the data.

The upper quartile (Q3 ) is the middle value between the median value of the data
and the maximum value of the data.

donagh.horgan@cit.ie

3.22 / the standard score

The standard score, or z-score, is a measure of the number of standard deviations


away from the mean a single data point is.
I

A positive score indicates that the data point is above the mean.

A negative score indicates that the data point is below the mean.

The magnitude of the score indicates how far above or below the mean the point is.

The standard score of the data point xi is denoted by z(xi ) and defined as
z(xi ) =

xi x
.
x

(1.9)

The standard score is often used to quantify how extreme a given data point is,
and can be a useful indicator that a given data point is an outlier.

donagh.horgan@cit.ie

3.23 / example

Q. What is the standard score of the third data point in the sample
A = {1, 3, 4, 5, 10}?
A. Earlier, we computed the mean of A as a = 3 and the standard deviation as
A 5.14. The standard score of a3 can then be computed using (1.9) as
a3 a
A
(4) 3

5.14
1.36.

z(a3 ) =

donagh.horgan@cit.ie

3.24 / dependence and correlation


I

The existence of a statistical relationship between different data samples is known


as dependence.

Two samples are said to be dependent when the value of the first depends on the
value of the second or vice-versa, e.g.

The number of people wearing shorts on a given day depends on the weather.

A teams position in a league depends on the number of goals theyve scored.

A students score on a test (ideally) depends on the depth of their knowledge of the
subject.

Its possible that multiple dependencies exist: in reality, a students score on a test
depends on more than just knowledge, e.g.
I

How well they slept the night before.

How much coffee theyve drank.

The ambient temperature in the test hall.

...lots of other factors!


donagh.horgan@cit.ie

3.25 / dependence and correlation

Correlation is a measure of the dependence between two data samples.

If two samples are positively correlated, then their values tend to increase or
decrease together.

More of A leads to more of B: Sunshine ice cream purchases.

Less of A leads to less of B: Eating fewer calories losing weight.

If two samples are negatively correlated, then the values in one tends to
increase/decrease when the values in the other decrease/increase.
I

More of A leads to less of B: Smoking lower life expectancy.

Less of A leads to more of B: Less public transport more congestion.

If two samples are uncorrelated, then there is no dependence between them.


I

More of A leads to no change in B: Higher taxes temperature in June.

Less of A leads to no change in B: Population of France days in a year.

donagh.horgan@cit.ie

3.26 / the pearson correlation coefficient


I

One popular way of computing correlation is by using the Pearson correlation


coefficient.

The Pearson correlation coefficient between two data samples, X and Y , is


denoted by rxy and is defined as
n
1 X
xi x
=
n 1 i=1
x

rxy

yi y
y

n
1 X
z(xi ) z(yi ).
n 1 i=1

(1.10)
(1.11)

Other kinds of correlation coefficient do exist (e.g. Spearman, Kendall), so be


careful not to confuse definitions.

Generally, the term correlation coefficient can be taken to mean Pearson


correlation coefficient.
donagh.horgan@cit.ie

3.27 / the pearson correlation coefficient

The Pearson correlation coefficient has a value between -1 and +1, inclusive,
i.e. rxy [1, 1].
I

Positive correlation corresponds to a positive value of rxy , i.e. rxy > 0.

Negative correlation corresponds to a negative value of rxy , i.e. rxy < 0.

If rxy = 1, then X and Y have the strongest possible level of positive correlation.

If rxy = 1, then X and Y have the strongest possible level of negative


correlation.

If rxy = 0, then X and Y are uncorrelated.

One handy property of the coefficient is that rxy = ryx , i.e. the correlation between
X and Y is the same as the correlation between Y and X the order does not
matter.

donagh.horgan@cit.ie

3.28 / example

Q. What is the correlation of the samples A = {1, 2, 3} and B = {4, 5, 6}?


A. First, lets compute the means and standard deviations of the samples using (1.2)
and (1.6):
a = 2, A = 1.

b = 5, B = 1.

Next, we compute the standard scores of the data points in each sample using
(1.9):
z(a1 ) = 1, z(a2 ) = 0, z(a3 ) = 1.

z(b1 ) = 1, z(b2 ) = 0, z(b3 ) = 1.

Finally, we compute the correlation coefficient using (1.11):


rAB =

n
1 X
((1) (1)) + (0 0) + (1 1)
z(ai ) z(bi ) =
= 1.
n 1 i=1
2
donagh.horgan@cit.ie

3.29 / correlation is not causation

Dependence does not imply a causal relationship!


I

Buying ice creams does not make the weather sunnier.

Living a short life does not make you a smoker.

Wearing shorts does not make the weather better.

If A and B are correlated, then there are six possibilities:


1. A is caused by B.
2. B is caused by A.
3. A causes B and B causes A.
4. A and B are both caused by a hidden external factor, C .
5. A causes C which, in turn, causes B.
6. A and B are not correlated but, by random chance (e.g. poor sampling), they appear
correlated.
donagh.horgan@cit.ie

3.30 / correlation is not causation

Image: RedAndr/Wikipedia

donagh.horgan@cit.ie

3.31 / correlation is not causation

Image: Tyler Vigen

donagh.horgan@cit.ie

X.1 / summary

Lots of maths this week! Usually, there wont be so much.


I

If you have questions, post on Blackboard!

Lab:
I

Try it out, see how far you get.

If youre stuck on Python, check out the Codecademy course.

If youre stuck on statistics, check out the Khan Academy course.

If you have questions, post on Blackboard!

Next week:
I

Data mining processes.

Data formats.

Exploratory data analysis.

donagh.horgan@cit.ie

X.2 / optional further reading

course material

Module information
Blackboard

statistics
python

Khan Academy
Python course on Codecademy
IPython documentation

blogs

The Guardian Datablog


FlowingData
Information is Beautiful

videos

Hans Roslings TED Talk on statistics

donagh.horgan@cit.ie

You might also like