Professional Documents
Culture Documents
data analytics
1/13
introduction to data
analytics
dr.donagh horgan
department of computer science
cork institute of technology
2015.09.15
0.1 / overview
I
2. Course outline:
I
Overview of topics.
Marking scheme.
Contact information.
Labs.
3. Introduction to statistics:
I
Statistical measures.
donagh.horgan@cit.ie
Data analytics, or data analysis, is a tool for solving complex real world problems.
In recent years, it has become a hot topic, with notable uses in areas such as
donagh.horgan@cit.ie
donagh.horgan@cit.ie
system
measure
analyse
understand
donagh.horgan@cit.ie
weather
barometer
analyse
forecast
donagh.horgan@cit.ie
keywords
analyse
spam
donagh.horgan@cit.ie
customers
purchases
analyse
offers
donagh.horgan@cit.ie
Transformation of the data into a form that makes further analysis more convenient.
The key to solving problems effectively is to know which of these tools to use and
in what order.
donagh.horgan@cit.ie
The aim of this course is to provide both a theoretical and a practical introduction
to data analysis techniques.
3. Visualisation.
9. Clustering systems.
4. Data munging.
6. Recommender systems.
donagh.horgan@cit.ie
final exam
50%
50%
project
The lecture notes will include sample exam questions and answers.
donagh.horgan@cit.ie
lectures
2 hours
4 hours
1 hour
independent
learning
labs
donagh.horgan@cit.ie
URL: http://blackboard.cit.ie.
Please dont send emails about course material lets try to keep this stuff on
Blackboard, so that everyone benefits.
Email: donagh.horgan@cit.ie.
I will try to reply to Blackboard posts and emails within 48 hours, but sometimes
it may take longer than this.
donagh.horgan@cit.ie
2.5 / copyright
Except where noted otherwise, the material (i.e. notes and code) for this course
are licensed to you under the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International license.
This means that you are free to share course material with anyone you like, as
long as you respect the terms of the license.
donagh.horgan@cit.ie
2.6 / languages
I
The two most popular open source languages used for data analysis are R and
Python.
Python
algorithm support
very large
large
difficulty
moderate
easy
no
yes
Hopefully, this will save most people time or at least lessen the burden of having
to learn a new language.
If you dont know Python or havent used it in some time, Codecademy offer a free introduction course that covers the basics (you can stop just before the bitwise
operators section): https://www.codecademy.com/en/tracks/python.
donagh.horgan@cit.ie
2.7 / labs
I
The lab assignments involve the completion of data analysis exercises using
IPython Notebook.
I
Your student vDesktop has been configured for IPython Notebook usage, but you
can also run the notebooks on your own machine. To do this, you will need to
install the following software:
I
Python 2.7.x
NumPy
networkx
IPython 3.x
SciPy
scikit-learn
IPython Notebook
pandas
lxml
Inkscape
matplotlib
bs4
donagh.horgan@cit.ie
On Windows:
../../code/lectures/01/ipython_windows
1
2
> cd C : \ Python27 \ S c r i p t s
> ipython notebook
On Linux:
../../code/lectures/01/ipython_linux
1
2
$ s u d o p i p i n s t a l l u p g r a d e i p y t h o n [ a l l ]
$ ipython notebook
donagh.horgan@cit.ie
Once youve got it started, try creating a new notebook like this:
donagh.horgan@cit.ie
donagh.horgan@cit.ie
Statistics are numerical measures that help us to characterise a data sample or the
general population from which it came.
They are an important part of data analysis and can provide powerful insights at
an early stage.
In the following slides, we will revise some core statistical measures, which will be
useful at later points in the course.
(1.1)
Sample
Sample
Populations describe all possible outcomes, while samples describe only a few.
By random chance, a sample may look extremely different from the population it
was drawn from (e.g. in the image above, the sample contains no triangles).
Sample
Just because 100% of the sample are criminals, this does mean that 100% of the
population are.
donagh.horgan@cit.ie
Numberofitems/bin
300
250
200
150
100
50
0
10
15
0
5
10
Valuesofitemsinbins
15
10
15
The most common way of visualising the distribution of the data in a sample is
the histogram:
I
Sample data are placed in fixed width bins, representing ranges of values.
The taller the bar, the larger the number of data points in the bin.
donagh.horgan@cit.ie
Numberofitems/bin
300
250
200
150
100
50
0
10
15
0
5
10
Valuesofitemsinbins
15
10
15
Two important statistical measures of a distribution are its central tendency and
its dispersion:
I
donagh.horgan@cit.ie
Numberofitems/bin
300
250
200
150
100
50
0
10
15
0
5
10
Valuesofitemsinbins
15
10
15
Figure B above shows a sample with higher central tendency than Figure A, but a
similar level of dispersion.
Figure C shows a sample with the same central tendency as Figure B, but with a
higher level of dispersion.
donagh.horgan@cit.ie
n
1X
xi .
n i=1
(1.2)
Generally, the term mean can be taken to refer to the arithmetic mean.
donagh.horgan@cit.ie
3.9 / example
a =
donagh.horgan@cit.ie
This works by applying a weight to each element of the sample, and so can
increase the effect of some samples and decrease the effect of others.
(1.3)
When all the weights are equal to one (i.e. wi = 1, i), the weighted average is
equivalent to the average and so (1.3) simplifies to (1.2).
donagh.horgan@cit.ie
3.11 / example
Q. What is the weighted mean of the sample A = {1, 3, 4, 5, 10} with the weights
W = {2, 1, 0, 1, 0.5}?
A. Using (1.3), we can compute the weighted mean as
(2 1) + (1 3) + (0 (4)) + (1 5) + (0.5 10)
2 + 1 + 0 + 1 + 0.5
2+3+0+5+5
=
4.5
15
=
4.5
3.33.
a =
donagh.horgan@cit.ie
It is defined as the middle value of the individual sample elements when arranged
in ascending order.
The median value of X is denoted by x and depends on whether the total number
of points, n, is odd or even:
I
If n is odd, then
x = x n+1 .
2
(1.4)
If n is even, then
x =
x n2 + x n2 +1
.
2
(1.5)
donagh.horgan@cit.ie
= a6
2
= a3
= 3.
Remember, sorting A in ascending order is a crucial step!
donagh.horgan@cit.ie
a 6 + a 6 +1
2
2
a3 + a4
=
2
1+3
=
2
= 2.
donagh.horgan@cit.ie
The mode is a further alternative method for estimating central tendency and is
useful in situations where the data is not numeric.
I
For example, for the set of names {John, Isabelle, John, Mary }, the mode is John.
donagh.horgan@cit.ie
3.16 / example
donagh.horgan@cit.ie
n
1 X
(xi x )2 .
n 1 i=1
(1.6)
A related measure of dispersion is the variance of the sample, which is simply the
square of the standard deviation, i.e.
Var(X ) = x2 .
(1.7)
donagh.horgan@cit.ie
1
((1 3)2 + (3 3)2 + ((4) 3)2 + (5 3)2 + (10 3)2 )
4
1
(4 + 0 + 49 + 4 + 49)
4
=
=
n
1 X
(ai a)2
n 1 i=1
5.14.
donagh.horgan@cit.ie
donagh.horgan@cit.ie
It is less sensitive to outliers than standard deviation (just like the median is to
the mean).
It is defined as the difference between the upper and the lower quartiles, i.e.
IQR = Q3 Q1 ,
(1.8)
where Q3 denotes the upper quartile and Q1 denotes the lower quartile.
donagh.horgan@cit.ie
The quartiles of an ordered set of data points are the three values that divide the
set into four groups, each containing approximately 25% of the sample data.
I
The lower quartile (Q1 ) is middle value between the minimum value of the data and
the median value of the data.
The second quartile (Q2 ) is simply the median of the data, which is the middle value
between the minimum value of the data and the maximum value of the data.
The upper quartile (Q3 ) is the middle value between the median value of the data
and the maximum value of the data.
donagh.horgan@cit.ie
A positive score indicates that the data point is above the mean.
A negative score indicates that the data point is below the mean.
The magnitude of the score indicates how far above or below the mean the point is.
The standard score of the data point xi is denoted by z(xi ) and defined as
z(xi ) =
xi x
.
x
(1.9)
The standard score is often used to quantify how extreme a given data point is,
and can be a useful indicator that a given data point is an outlier.
donagh.horgan@cit.ie
3.23 / example
Q. What is the standard score of the third data point in the sample
A = {1, 3, 4, 5, 10}?
A. Earlier, we computed the mean of A as a = 3 and the standard deviation as
A 5.14. The standard score of a3 can then be computed using (1.9) as
a3 a
A
(4) 3
5.14
1.36.
z(a3 ) =
donagh.horgan@cit.ie
Two samples are said to be dependent when the value of the first depends on the
value of the second or vice-versa, e.g.
The number of people wearing shorts on a given day depends on the weather.
A students score on a test (ideally) depends on the depth of their knowledge of the
subject.
Its possible that multiple dependencies exist: in reality, a students score on a test
depends on more than just knowledge, e.g.
I
If two samples are positively correlated, then their values tend to increase or
decrease together.
If two samples are negatively correlated, then the values in one tends to
increase/decrease when the values in the other decrease/increase.
I
donagh.horgan@cit.ie
rxy
yi y
y
n
1 X
z(xi ) z(yi ).
n 1 i=1
(1.10)
(1.11)
The Pearson correlation coefficient has a value between -1 and +1, inclusive,
i.e. rxy [1, 1].
I
If rxy = 1, then X and Y have the strongest possible level of positive correlation.
One handy property of the coefficient is that rxy = ryx , i.e. the correlation between
X and Y is the same as the correlation between Y and X the order does not
matter.
donagh.horgan@cit.ie
3.28 / example
b = 5, B = 1.
Next, we compute the standard scores of the data points in each sample using
(1.9):
z(a1 ) = 1, z(a2 ) = 0, z(a3 ) = 1.
n
1 X
((1) (1)) + (0 0) + (1 1)
z(ai ) z(bi ) =
= 1.
n 1 i=1
2
donagh.horgan@cit.ie
Image: RedAndr/Wikipedia
donagh.horgan@cit.ie
donagh.horgan@cit.ie
X.1 / summary
Lab:
I
Next week:
I
Data formats.
donagh.horgan@cit.ie
course material
Module information
Blackboard
statistics
python
Khan Academy
Python course on Codecademy
IPython documentation
blogs
videos
donagh.horgan@cit.ie