Correlation

The Measure of Correlation
Let us first note that the variance (

(
Variance (
data points (
(1)
) is likewise,
):
data points (
) of another set of
(
) of a set of data
(2)
We can write the above two expressions in the following form:
)(
)(
)(
)(
and
Therefore, we can also define a similar kind of formula involving two variables,
(
)(
)(
),
(3)
which is called covariance of the two sets of data.

The linear correlation between two sets of data is defined by the following coefficient:
Corr (x,y) =
The above coefficient is called Pearsons correlation coefficient. Correlation is to test how
strongly the pairs of variables are related.
Calculation of Pearson Correlation Coefficient:
For practical calculations, we often use the following formula after multiplying by
numerator and denominator of the formula for :
( )
( )
Note: In many books, the correlation coefficient is written in the form,

,
and
Abhijit Kar Gupta, kg.abhi@gmail.com
to the
, where
Properties of :
The coefficient measures the strength of a linear relationship.

The range:
+1 perfect positive linear correlation
perfect negative linear correlation
no correlation
We can have an idea of the kind of correlation between two sets of data from the (
scatter plots:
Figs.
Correlation Matrix:
For the relations among more than two sets of variables, it is useful to present the correlation
coefficients between every two sets of variables in the form of a table. This is called correlation
matrix.
For example, for three sets of variables,
, we have the following table:

X
Y
Z
Note that in the above table, we have only three different entries. The reason is that the matrix is
symmetric as the correlation between and is same as between and and so on:
,
,
. Also, the correlation of a variable with itself is trivial; it is always the
perfect correlation (
= 1). So we have only three independent useful quantities.
Spearman Rank Correlation:
Spearmans rank correlation coefficient between two sets of ranked variables is defined below.
Suppose, the original pair of data sets (
) and (
) are rank
ordered. This means that the data of the two sets are numbered as ranks 1, 2, 3, according to
the their values. Suppose we write the ranks and for the two variables X and Y.
Now we calculate the differences
between the corresponding ranks of two sets.
With the above, the Spearman rank correlation coefficient:
From Pearson Correlation to Spearman Rank Correlation:
Pearson correlation formula:
) ,
Where
) and
) (
).
Now in place of and values we shall use ranks. For example, instead of -values
(
) we use the ranks,
.
[ (
* (
) ]
[ (
[
So, we rewrite
*
(
) )]
(
) +
(
)(
) (
*
*
*
(
]
)(
(1)
Similarly, if we replace the -values (
) by the ranks,
, we get
(2)
Also,
) (
) ]
)(
) (
) ]
(3)
Now consider,
So,
(
(
)(
)(
(4)
Putting (4) into (3),

[
(
)(
)(
) ]
[After simplifying]
(5)
Now we combine (1), (2) and (5) to get Spearmans rank correlation formula:
(
For monotonic relation or for exact correlation,
which will lead to
By design, the range of Spearmans rank correlation coefficient:

.
.
Let us try to understand the essence of correlation from the following demonstration.
Suppose we consider the integer numbers from 1 to 10 appearing in two columns in two different
ways. We may consider them as actual values or we can consider them as ranks as we wish.
First we examine the following table for calculating the Pearson correlation coefficient:
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
4
9
16
25
36
49
64
81
100
1
4
9
16
25
36
49
64
81
100
1
4
9
16
25
36
49
64
81
100
( )
( )
Perfect Correlation! In fact that has to be as the two data sets are exactly the same. Thus
. It is a trivial case.
Now, we consider another case where the numbers in one column come exactly in the reverse
,
but
order. So, here also we will have
Perfect anti-correlation (negative correlation)! This is also a trivial case.

Now in the above example, if we want to measure the Spearman rank correlation, we consider
the entries in the x- and y-columns as ranks (in this special case).
So for the 1st case, when the two columns are the same, we have each difference of ranks
. Thus
which will lead to the rank correlation coefficient,
. Perfect
correlation.
For the 2nd case, when the values in one column come in the exact opposite order, we have the
following table:
(
1
2
3
4
5
6
7
8
9
10
10
9
8
7
6
5
4
3
2
1
81
49
25
9
1
1
9
25
49
81
Pearson rank correlation coefficient:
Perfect anti-correlation.
Now we take a case when two columns are not exactly the same, not also one is in exact reverse
order of the other. For a test case, we shuffle some of the entries in y-column (keeping the xcolumn intact) and examine how the correlation is affected.
Consider the following table:
1
2
3
4
5
6
7
8
9
10
1
3
2
4
6
5
7
9
8
10
1
4
9
16
25
36
49
64
81
100
1
9
4
16
36
25
49
81
64
100
1
6
6
16
25
36
49
72
72
100
Pearson correlation coefficient,
( )
( )
Still strong correlation!

We do the same problem with Spearmans rank correlation formula. For this we consider the
following table:
(
1
2
3
4
5
6
7
8
9
10
1
3
2
4
6
5
7
9
8
10
0
1
1
0
1
1
0
1
1
0
The Spearman Rank Correlation Coefficient:
Strong correleation!
Here the two methods give the same result, with the values of the correlation coefficients the
same as we have used this integer values (from 1 to 10) to demonstrate.
Applications:
Example#1 In the following, 5 sets of values of

any correlation between the data set.
and
x
y
2
5
are given. We have to check if there is

4 5
9 11
6 8
10 12
Method 1: Pearson
We calculate the necessary quantities in the other columns to be put in the Pearson correlation
formula.
No.
Here we find,
25
10
16
81
36
11
25
121
55
10
36
100
60
12
64
144
96
Total
25
47
145
471
257
The Pearson correlation coefficient,
and
The above calculated value of correlation coefficient is close to 1. Thus we may say that there is
a strong correlation between the two sets of data.
Method 2: Spearman
The Correlation between the two sets of data can also be calculated from the Spearman rank
correlation formula.
No.
(rank of ) (rank of )
5
5
11
10
12
Total
The Spearman rank correlation,
The correlation is strong! So the conclusions by two methods are the same.
Example#2 A group of 10 runners are interested in finding out whether their running capacity is
associated with age or not. They record their age in years and also the time taken in finishing a
trial run for 100 m. The results are:
Names Akash Abeer Priya
Kader Samrat Baser Daniel Swati Yasmin Elizabeth
Age
54
(years)
46
29
28
Time
(sec)
9.88
11.02 9.99
10.56
25
36
15
10.21
11.59 12.0
26
38
10.87 11.46
34
11.25
Method 1: Pearson
Names
Age ( )
Time ( )
Akash
54
10.56
2916
111.51
570.24
Abeer
46
9.88
2116
97.61
454.48
Priya
29
11.02
841
121.44
319.58
Kader
28
9.99
784
99.80
279.72
Samrat
25
10.21
625
104.24
255.25
Baser
36
11.59
1296
134.33
417.23
Daniel
15
12.0
225
144.00
180.00
Swati
26
10.87
676
118.16
282.62
Yasmin
38
11.46
1444
131.33
435.48
Elizabeth
34
11.25
1156
126.5654
382.50
TOTAL
331
108.83
12079.0
1188.99
3577.11
( )
( )
Conclusion: Weak Correlation!
Method 2: Spearman
NAMES
Akash
Abeer
Priya
Kader
Samrat
Baser
Daniel
Swati
Yasmin
Elizabeth
TOTAL
Age (x) Time (y)

54
46
29
28
25
36
15
26
38
34
10.56
9.88
11.02
9.99
10.21
11.59
12.0
10.87
11.46
11.25
(rank of x)
10
9
5
4
2
7
1
3
8
6
(rank of y)
4
1
6
2
3
9
10
5
8
7
36
64
1
4
1
4
81
4
0
1
Spearmans Rank Correlation:
So the correlation is negative and very weak! One can say that there is almost no correlation among the
two sets of data.
NOTE: The numerical values of correlation coefficients by two methods (Pearson and
Spearman) can be different. There is, in fact, no reason that the two values will be the same.
But the conclusion has to be the same. In the above example#1, the strong correlation is found
by both the methods. (Two types of correlation values are incidentally close.) In example #2,
the weak correlation is found by both the methods. (Two correlation values are quite different
here.)
Sometimes one set of quantities are given as actual values while the other set are
presented as ranks. In this case, it is useful to convert the actual values into ranks so
that we get a pair of ranks and calculate Spearman rank correlation from there.
It is evident that it is easier to calculate Spearman rank correlation once we can obtain
the ranks.
Problem#1
Consider the following height-weight data: Calculate the correlation coefficient and comment.
Height 170 172 181 157 150 168 166 175 177 165 163 152 161 173 175
(cm)
Weight 65 66 69 55 51 63 61 75 72 64 61 52 60 70 72
(kg)
Problem#2
Here we present a table that represents the data for shoe sizes vs. height achieved by Olympic
participants in a high jump event. Both the columns are measured in inches.
Shoe
12.0 7.0 4.5 11.0 8.5 5.0 12.0 7.5 8.5 5.5 9.5 5.5 10.5 12.0 14.0 7.0 7.0
size
Height 72
64 62 70
69 65 72
65 65 65 68 61 69
77
73
65 67
12.0 12.0 7.0 13.0 11.0 12.0 4.5 10.5 10.0 10.0 13.0 7.5 4.5 8.5 14.0 10.0 6.5
71
73
64 71
71
72
61 71
66
67
73
69 61 70 75
72
66
Calculate the correlation coefficient. For a visual effect, you can do a scatter plot for the pairs of
data. What do you think of the relationship between them as observed from the scatter plot?
Does the idea match with the correlation coefficient?
Autocorrelation:
Autocorrelation is a correlation coefficient calculated from the same variable. Instead of
correlation between two different variables, the correlation is between two values of the same
variable at different times (or at different positions of the column of data).
) measured at different times, (
Suppose, we have one set of data (
instead we think of the positions of the values as appearing in the serial order,
) or
.
The autocorrelation function is defined as

(
)(
(
)
)
In the numerator, we take the data at any -th position and then we think of another data point at
(
)th position. After taking the products of the mean differences, we sum over all the
positions
Autocorrelation function is used to test if there is any pattern inside a set of data, or to test the
data if that is generated from a random process.
Example: Consider the -values (1.0, -0.31, -0.74, 0.21, -0.90, 0.38, 0.63, -0.77, -0.12, 0.82).
Here,
To calculate the autocorrelation function, if we start from

(
)(
(
)
)
, and set
)(
)
So this way, we can calculate all the functions we want (for any value of
To see how those s vary, we can plot
, then we consider
)
)
)
vs. .
One can easily check that by definition, the auto correlation function lies between -1 and +1 as
before.

Correlation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation

Uploaded by

Copyright:

Available Formats

The Measure of Correlation

Let us first note that the variance (

We can write the above two expressions in the following form:

which is called covariance of the two sets of data.

Note: In many books, the correlation coefficient is written in the form,

Abhijit Kar Gupta, kg.abhi@gmail.com

The coefficient measures the strength of a linear relationship.

, we have the following table:

Abhijit Kar Gupta, kg.abhi@gmail.com

Spearman Rank Correlation:

between the corresponding ranks of two sets.

With the above, the Spearman rank correlation coefficient:

From Pearson Correlation to Spearman Rank Correlation:

Pearson correlation formula:

Abhijit Kar Gupta, kg.abhi@gmail.com

Similarly, if we replace the -values (

Putting (4) into (3),

For monotonic relation or for exact correlation,

which will lead to

By design, the range of Spearmans rank correlation coefficient:

Perfect anti-correlation (negative correlation)! This is also a trivial case.

Abhijit Kar Gupta, kg.abhi@gmail.com

Pearson rank correlation coefficient:

Pearson correlation coefficient,

Still strong correlation!

Abhijit Kar Gupta, kg.abhi@gmail.com

The Spearman Rank Correlation Coefficient:

Example#1 In the following, 5 sets of values of

are given. We have to check if there is

The Pearson correlation coefficient,

Kader Samrat Baser Daniel Swati Yasmin Elizabeth

Abhijit Kar Gupta, kg.abhi@gmail.com

Age (x) Time (y)

Abhijit Kar Gupta, kg.abhi@gmail.com

Spearmans Rank Correlation:

Abhijit Kar Gupta, kg.abhi@gmail.com

The autocorrelation function is defined as

To calculate the autocorrelation function, if we start from

Abhijit Kar Gupta, kg.abhi@gmail.com

You might also like