Professional Documents
Culture Documents
data points (
(1)
) is likewise,
):
data points (
) of another set of
(
) of a set of data
(2)
)(
)(
)(
)(
and
Therefore, we can also define a similar kind of formula involving two variables,
(
)(
)(
),
(3)
The above coefficient is called Pearsons correlation coefficient. Correlation is to test how
strongly the pairs of variables are related.
Calculation of Pearson Correlation Coefficient:
For practical calculations, we often use the following formula after multiplying by
numerator and denominator of the formula for :
( )
( )
and
to the
, where
Properties of :
no correlation
We can have an idea of the kind of correlation between two sets of data from the (
scatter plots:
Figs.
Correlation Matrix:
For the relations among more than two sets of variables, it is useful to present the correlation
coefficients between every two sets of variables in the form of a table. This is called correlation
matrix.
For example, for three sets of variables,
Y
Z
Note that in the above table, we have only three different entries. The reason is that the matrix is
symmetric as the correlation between and is same as between and and so on:
,
,
. Also, the correlation of a variable with itself is trivial; it is always the
perfect correlation (
= 1). So we have only three independent useful quantities.
Spearmans rank correlation coefficient between two sets of ranked variables is defined below.
Suppose, the original pair of data sets (
) and (
) are rank
ordered. This means that the data of the two sets are numbered as ranks 1, 2, 3, according to
the their values. Suppose we write the ranks and for the two variables X and Y.
Now we calculate the differences
) ,
Where
) and
) (
).
Now in place of and values we shall use ranks. For example, instead of -values
(
) we use the ranks,
.
[ (
* (
) ]
[ (
[
So, we rewrite
*
(
) )]
(
) +
(
)(
) (
*
*
*
(
]
)(
(1)
) by the ranks,
, we get
(2)
Also,
) (
) ]
)(
) (
) ]
(3)
Now consider,
So,
(
(
)(
)(
(4)
)(
)(
) ]
[After simplifying]
(5)
Now we combine (1), (2) and (5) to get Spearmans rank correlation formula:
(
.
.
Let us try to understand the essence of correlation from the following demonstration.
Suppose we consider the integer numbers from 1 to 10 appearing in two columns in two different
ways. We may consider them as actual values or we can consider them as ranks as we wish.
First we examine the following table for calculating the Pearson correlation coefficient:
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
4
9
16
25
36
49
64
81
100
1
4
9
16
25
36
49
64
81
100
1
4
9
16
25
36
49
64
81
100
( )
( )
Perfect Correlation! In fact that has to be as the two data sets are exactly the same. Thus
. It is a trivial case.
Now, we consider another case where the numbers in one column come exactly in the reverse
,
but
order. So, here also we will have
(
1
2
3
4
5
6
7
8
9
10
10
9
8
7
6
5
4
3
2
1
81
49
25
9
1
1
9
25
49
81
Perfect anti-correlation.
Now we take a case when two columns are not exactly the same, not also one is in exact reverse
order of the other. For a test case, we shuffle some of the entries in y-column (keeping the xcolumn intact) and examine how the correlation is affected.
Consider the following table:
1
2
3
4
5
6
7
8
9
10
1
3
2
4
6
5
7
9
8
10
1
4
9
16
25
36
49
64
81
100
1
9
4
16
36
25
49
81
64
100
1
6
6
16
25
36
49
72
72
100
( )
( )
(
1
2
3
4
5
6
7
8
9
10
1
3
2
4
6
5
7
9
8
10
0
1
1
0
1
1
0
1
1
0
Strong correleation!
Here the two methods give the same result, with the values of the correlation coefficients the
same as we have used this integer values (from 1 to 10) to demonstrate.
Applications:
and
x
y
2
5
6 8
10 12
Method 1: Pearson
We calculate the necessary quantities in the other columns to be put in the Pearson correlation
formula.
No.
Here we find,
25
10
16
81
36
11
25
121
55
10
36
100
60
12
64
144
96
Total
25
47
145
471
257
and
The above calculated value of correlation coefficient is close to 1. Thus we may say that there is
a strong correlation between the two sets of data.
Abhijit Kar Gupta, kg.abhi@gmail.com
Method 2: Spearman
The Correlation between the two sets of data can also be calculated from the Spearman rank
correlation formula.
No.
(rank of ) (rank of )
5
5
11
10
12
Total
The Spearman rank correlation,
The correlation is strong! So the conclusions by two methods are the same.
Example#2 A group of 10 runners are interested in finding out whether their running capacity is
associated with age or not. They record their age in years and also the time taken in finishing a
trial run for 100 m. The results are:
Names Akash Abeer Priya
Age
54
(years)
46
29
28
Time
(sec)
9.88
11.02 9.99
10.56
25
36
15
10.21
11.59 12.0
26
38
10.87 11.46
34
11.25
Method 1: Pearson
Names
Age ( )
Time ( )
Akash
54
10.56
2916
111.51
570.24
Abeer
46
9.88
2116
97.61
454.48
Priya
29
11.02
841
121.44
319.58
Kader
28
9.99
784
99.80
279.72
Samrat
25
10.21
625
104.24
255.25
Baser
36
11.59
1296
134.33
417.23
Daniel
15
12.0
225
144.00
180.00
Swati
26
10.87
676
118.16
282.62
Yasmin
38
11.46
1444
131.33
435.48
Elizabeth
34
11.25
1156
126.5654
382.50
TOTAL
331
108.83
12079.0
1188.99
3577.11
( )
( )
Conclusion: Weak Correlation!
Method 2: Spearman
NAMES
Akash
Abeer
Priya
Kader
Samrat
Baser
Daniel
Swati
Yasmin
Elizabeth
TOTAL
10.56
9.88
11.02
9.99
10.21
11.59
12.0
10.87
11.46
11.25
(rank of x)
10
9
5
4
2
7
1
3
8
6
(rank of y)
4
1
6
2
3
9
10
5
8
7
36
64
1
4
1
4
81
4
0
1
So the correlation is negative and very weak! One can say that there is almost no correlation among the
two sets of data.
NOTE: The numerical values of correlation coefficients by two methods (Pearson and
Spearman) can be different. There is, in fact, no reason that the two values will be the same.
But the conclusion has to be the same. In the above example#1, the strong correlation is found
by both the methods. (Two types of correlation values are incidentally close.) In example #2,
the weak correlation is found by both the methods. (Two correlation values are quite different
here.)
Sometimes one set of quantities are given as actual values while the other set are
presented as ranks. In this case, it is useful to convert the actual values into ranks so
that we get a pair of ranks and calculate Spearman rank correlation from there.
It is evident that it is easier to calculate Spearman rank correlation once we can obtain
the ranks.
Problem#1
Consider the following height-weight data: Calculate the correlation coefficient and comment.
Height 170 172 181 157 150 168 166 175 177 165 163 152 161 173 175
(cm)
Weight 65 66 69 55 51 63 61 75 72 64 61 52 60 70 72
(kg)
Problem#2
Here we present a table that represents the data for shoe sizes vs. height achieved by Olympic
participants in a high jump event. Both the columns are measured in inches.
Shoe
12.0 7.0 4.5 11.0 8.5 5.0 12.0 7.5 8.5 5.5 9.5 5.5 10.5 12.0 14.0 7.0 7.0
size
Height 72
64 62 70
69 65 72
65 65 65 68 61 69
77
73
65 67
12.0 12.0 7.0 13.0 11.0 12.0 4.5 10.5 10.0 10.0 13.0 7.5 4.5 8.5 14.0 10.0 6.5
71
73
64 71
71
72
61 71
66
67
73
69 61 70 75
72
66
Calculate the correlation coefficient. For a visual effect, you can do a scatter plot for the pairs of
data. What do you think of the relationship between them as observed from the scatter plot?
Does the idea match with the correlation coefficient?
Autocorrelation:
Autocorrelation is a correlation coefficient calculated from the same variable. Instead of
correlation between two different variables, the correlation is between two values of the same
variable at different times (or at different positions of the column of data).
) measured at different times, (
Suppose, we have one set of data (
instead we think of the positions of the values as appearing in the serial order,
) or
.
)(
(
)
)
In the numerator, we take the data at any -th position and then we think of another data point at
(
)th position. After taking the products of the mean differences, we sum over all the
positions
Autocorrelation function is used to test if there is any pattern inside a set of data, or to test the
data if that is generated from a random process.
Example: Consider the -values (1.0, -0.31, -0.74, 0.21, -0.90, 0.38, 0.63, -0.77, -0.12, 0.82).
Here,
)(
(
)
)
, and set
)(
)
So this way, we can calculate all the functions we want (for any value of
To see how those s vary, we can plot
, then we consider
)
)
)
vs. .
One can easily check that by definition, the auto correlation function lies between -1 and +1 as
before.