You are on page 1of 11

The Measure of Correlation

Let us first note that the variance (


(
Variance (

data points (
(1)
) is likewise,

):

data points (

) of another set of
(

) of a set of data

(2)

We can write the above two expressions in the following form:

)(

)(

)(

)(

and

Therefore, we can also define a similar kind of formula involving two variables,
(

)(

)(

),

(3)

which is called covariance of the two sets of data.


The linear correlation between two sets of data is defined by the following coefficient:
Corr (x,y) =

The above coefficient is called Pearsons correlation coefficient. Correlation is to test how
strongly the pairs of variables are related.
Calculation of Pearson Correlation Coefficient:

For practical calculations, we often use the following formula after multiplying by
numerator and denominator of the formula for :

( )

( )

Note: In many books, the correlation coefficient is written in the form,


,

and

Abhijit Kar Gupta, kg.abhi@gmail.com

to the

, where

Properties of :

The coefficient measures the strength of a linear relationship.


The range:
+1 perfect positive linear correlation
perfect negative linear correlation

no correlation
We can have an idea of the kind of correlation between two sets of data from the (
scatter plots:

Figs.

Correlation Matrix:

For the relations among more than two sets of variables, it is useful to present the correlation
coefficients between every two sets of variables in the form of a table. This is called correlation
matrix.
For example, for three sets of variables,

, we have the following table:


X

Y
Z

Note that in the above table, we have only three different entries. The reason is that the matrix is
symmetric as the correlation between and is same as between and and so on:
,
,
. Also, the correlation of a variable with itself is trivial; it is always the
perfect correlation (
= 1). So we have only three independent useful quantities.

Abhijit Kar Gupta, kg.abhi@gmail.com

Spearman Rank Correlation:

Spearmans rank correlation coefficient between two sets of ranked variables is defined below.
Suppose, the original pair of data sets (
) and (
) are rank
ordered. This means that the data of the two sets are numbered as ranks 1, 2, 3, according to
the their values. Suppose we write the ranks and for the two variables X and Y.
Now we calculate the differences

between the corresponding ranks of two sets.

With the above, the Spearman rank correlation coefficient:

From Pearson Correlation to Spearman Rank Correlation:

Pearson correlation formula:

) ,

Where

) and

) (

).

Now in place of and values we shall use ranks. For example, instead of -values
(
) we use the ranks,
.

[ (

* (

) ]

[ (
[

So, we rewrite

*
(

) )]
(

Abhijit Kar Gupta, kg.abhi@gmail.com

) +
(

)(

) (

*
*

*
(

]
)(

(1)

Similarly, if we replace the -values (

) by the ranks,

, we get
(2)

Also,

) (

) ]

)(

) (

) ]

(3)

Now consider,
So,

(
(

)(

)(

(4)

Putting (4) into (3),


[
(

)(

)(

) ]

[After simplifying]

(5)

Now we combine (1), (2) and (5) to get Spearmans rank correlation formula:
(

For monotonic relation or for exact correlation,

which will lead to

By design, the range of Spearmans rank correlation coefficient:


Abhijit Kar Gupta, kg.abhi@gmail.com

.
.

Let us try to understand the essence of correlation from the following demonstration.
Suppose we consider the integer numbers from 1 to 10 appearing in two columns in two different
ways. We may consider them as actual values or we can consider them as ranks as we wish.
First we examine the following table for calculating the Pearson correlation coefficient:

1
2
3
4
5
6
7
8
9
10

1
2
3
4
5
6
7
8
9
10

1
4
9
16
25
36
49
64
81
100

1
4
9
16
25
36
49
64
81
100

1
4
9
16
25
36
49
64
81
100

( )

( )

Perfect Correlation! In fact that has to be as the two data sets are exactly the same. Thus

. It is a trivial case.
Now, we consider another case where the numbers in one column come exactly in the reverse
,
but
order. So, here also we will have

Perfect anti-correlation (negative correlation)! This is also a trivial case.


Now in the above example, if we want to measure the Spearman rank correlation, we consider
the entries in the x- and y-columns as ranks (in this special case).
So for the 1st case, when the two columns are the same, we have each difference of ranks
. Thus
which will lead to the rank correlation coefficient,
. Perfect
correlation.
For the 2nd case, when the values in one column come in the exact opposite order, we have the
following table:

Abhijit Kar Gupta, kg.abhi@gmail.com

(
1
2
3
4
5
6
7
8
9
10

10
9
8
7
6
5
4
3
2
1

81
49
25
9
1
1
9
25
49
81

Pearson rank correlation coefficient:

Perfect anti-correlation.

Now we take a case when two columns are not exactly the same, not also one is in exact reverse
order of the other. For a test case, we shuffle some of the entries in y-column (keeping the xcolumn intact) and examine how the correlation is affected.
Consider the following table:

1
2
3
4
5
6
7
8
9
10

1
3
2
4
6
5
7
9
8
10

1
4
9
16
25
36
49
64
81
100

1
9
4
16
36
25
49
81
64
100

1
6
6
16
25
36
49
72
72
100

Pearson correlation coefficient,

( )

( )

Still strong correlation!


We do the same problem with Spearmans rank correlation formula. For this we consider the
following table:

Abhijit Kar Gupta, kg.abhi@gmail.com

(
1
2
3
4
5
6
7
8
9
10

1
3
2
4
6
5
7
9
8
10

0
1
1
0
1
1
0
1
1
0

The Spearman Rank Correlation Coefficient:

Strong correleation!

Here the two methods give the same result, with the values of the correlation coefficients the
same as we have used this integer values (from 1 to 10) to demonstrate.
Applications:

Example#1 In the following, 5 sets of values of


any correlation between the data set.

and
x
y

2
5

are given. We have to check if there is


4 5
9 11

6 8
10 12

Method 1: Pearson
We calculate the necessary quantities in the other columns to be put in the Pearson correlation
formula.
No.

Here we find,

25

10

16

81

36

11

25

121

55

10

36

100

60

12

64

144

96

Total

25

47

145

471

257

The Pearson correlation coefficient,

and

The above calculated value of correlation coefficient is close to 1. Thus we may say that there is
a strong correlation between the two sets of data.
Abhijit Kar Gupta, kg.abhi@gmail.com

Method 2: Spearman
The Correlation between the two sets of data can also be calculated from the Spearman rank
correlation formula.

No.
(rank of ) (rank of )
5
5

11

10

12

Total
The Spearman rank correlation,

The correlation is strong! So the conclusions by two methods are the same.

Example#2 A group of 10 runners are interested in finding out whether their running capacity is
associated with age or not. They record their age in years and also the time taken in finishing a
trial run for 100 m. The results are:
Names Akash Abeer Priya

Kader Samrat Baser Daniel Swati Yasmin Elizabeth

Age
54
(years)

46

29

28

Time
(sec)

9.88

11.02 9.99

10.56

Abhijit Kar Gupta, kg.abhi@gmail.com

25

36

15

10.21

11.59 12.0

26

38

10.87 11.46

34
11.25

Method 1: Pearson
Names

Age ( )

Time ( )

Akash

54

10.56

2916

111.51

570.24

Abeer

46

9.88

2116

97.61

454.48

Priya

29

11.02

841

121.44

319.58

Kader

28

9.99

784

99.80

279.72

Samrat

25

10.21

625

104.24

255.25

Baser

36

11.59

1296

134.33

417.23

Daniel

15

12.0

225

144.00

180.00

Swati

26

10.87

676

118.16

282.62

Yasmin

38

11.46

1444

131.33

435.48

Elizabeth

34

11.25

1156

126.5654

382.50

TOTAL

331

108.83

12079.0

1188.99

3577.11

( )

( )
Conclusion: Weak Correlation!

Method 2: Spearman
NAMES
Akash
Abeer
Priya
Kader
Samrat
Baser
Daniel
Swati
Yasmin
Elizabeth
TOTAL

Age (x) Time (y)


54
46
29
28
25
36
15
26
38
34

10.56
9.88
11.02
9.99
10.21
11.59
12.0
10.87
11.46
11.25

Abhijit Kar Gupta, kg.abhi@gmail.com

(rank of x)
10
9
5
4
2
7
1
3
8
6

(rank of y)
4
1
6
2
3
9
10
5
8
7

36
64
1
4
1
4
81
4
0
1

Spearmans Rank Correlation:

So the correlation is negative and very weak! One can say that there is almost no correlation among the
two sets of data.

NOTE: The numerical values of correlation coefficients by two methods (Pearson and
Spearman) can be different. There is, in fact, no reason that the two values will be the same.
But the conclusion has to be the same. In the above example#1, the strong correlation is found
by both the methods. (Two types of correlation values are incidentally close.) In example #2,
the weak correlation is found by both the methods. (Two correlation values are quite different
here.)

Sometimes one set of quantities are given as actual values while the other set are
presented as ranks. In this case, it is useful to convert the actual values into ranks so
that we get a pair of ranks and calculate Spearman rank correlation from there.
It is evident that it is easier to calculate Spearman rank correlation once we can obtain
the ranks.

Problem#1
Consider the following height-weight data: Calculate the correlation coefficient and comment.
Height 170 172 181 157 150 168 166 175 177 165 163 152 161 173 175
(cm)
Weight 65 66 69 55 51 63 61 75 72 64 61 52 60 70 72
(kg)
Problem#2
Here we present a table that represents the data for shoe sizes vs. height achieved by Olympic
participants in a high jump event. Both the columns are measured in inches.
Shoe
12.0 7.0 4.5 11.0 8.5 5.0 12.0 7.5 8.5 5.5 9.5 5.5 10.5 12.0 14.0 7.0 7.0
size
Height 72
64 62 70
69 65 72
65 65 65 68 61 69
77
73
65 67
12.0 12.0 7.0 13.0 11.0 12.0 4.5 10.5 10.0 10.0 13.0 7.5 4.5 8.5 14.0 10.0 6.5
71
73
64 71
71
72
61 71
66
67
73
69 61 70 75
72
66

Abhijit Kar Gupta, kg.abhi@gmail.com

Calculate the correlation coefficient. For a visual effect, you can do a scatter plot for the pairs of
data. What do you think of the relationship between them as observed from the scatter plot?
Does the idea match with the correlation coefficient?
Autocorrelation:
Autocorrelation is a correlation coefficient calculated from the same variable. Instead of
correlation between two different variables, the correlation is between two values of the same
variable at different times (or at different positions of the column of data).
) measured at different times, (
Suppose, we have one set of data (
instead we think of the positions of the values as appearing in the serial order,

) or
.

The autocorrelation function is defined as


(

)(
(

)
)

In the numerator, we take the data at any -th position and then we think of another data point at
(
)th position. After taking the products of the mean differences, we sum over all the
positions

Autocorrelation function is used to test if there is any pattern inside a set of data, or to test the
data if that is generated from a random process.
Example: Consider the -values (1.0, -0.31, -0.74, 0.21, -0.90, 0.38, 0.63, -0.77, -0.12, 0.82).
Here,

To calculate the autocorrelation function, if we start from


(

)(
(

)
)

, and set

)(
)

So this way, we can calculate all the functions we want (for any value of
To see how those s vary, we can plot

, then we consider
)
)
)

vs. .

One can easily check that by definition, the auto correlation function lies between -1 and +1 as
before.

Abhijit Kar Gupta, kg.abhi@gmail.com

You might also like