You are on page 1of 11

Correlation and Regression

Bi-variate data: Let (x1,y1),…,(xn,yn) be n


paired observations on two variables X and Y.
Such data are called bi-variate data.
Ex:
Potential sale of new product and its price.
Family expenditure on travel and its income.
Patients' weight and number of weeks he or she
has been given a diet.
Yield of a variety of wheat and rainfall.
Major objective of correlation analysis is to
measure the extent of relationship between two
variables.
Objective of regression is to establish the
relationship between the two variables, which
can be further used for predicting one variable
on the basis of given value of other variable.
Correlation Coefficient: For measuring the
extent of linear relationship between two
variables X and Y, we use the product
moment correlation coefficient defined as
1 n
∑ ( xi − X )( yi − Y )
n i =1
rxy =
1 n 2 1
n


n i =1
( xi − X ) ∑
n i =1
( yi − Y ) 2

The numerator is called the covariance between


X and Y. Thus

cov( X , Y )
rxy =
σxσ y

Alternatively rxy can be written as

∑ x y − nXY
i i
rxy = n
i =1
n

∑x ∑y
2 2
i − nX 2
i − nY 2
i =1 i =1

Result: -1≤rxy≤1.
When rxy=±1, we say that X and Y are perfectly
(positively or negatively) correlated. When
rxy=0, we say that X and Y are uncorrelated.
Notice that rxy=ryx.
Method of Least Squares and Regression
Lines
We assume that for two random variables X and
Y
Y=α+βX+e
where e is a random variable, called error term.
The above equation is called the linear
regression equation of Y on X. The rv Y is called
the dependent variable and the rv X is called the
independent variable. Here α and β are (usually
unknown) parameters and called the regression
coefficients. The coefficient α is the intercept
term and β is the slope coefficient. Since α and
β are usually unknown, we estimate these
coefficients on the basis of given set of
observations.
Let (x1,y1),…,(xn,yn) be n paired observations on
two variables X and Y. For estimating the
regression coefficients α and β, we use the
method of least squares.
Let us write
yi=α+βxi+ei , i=1,2,…,n
where ei is the error term corresponding to i-th
observation. In method of least squares we
estimate α and β by minimizing the error sum
of squares
Σei2= Σ(yi- α-βxi)2
with respect to α and β.
For minimizing error sum of squares, we
differentiate it with respect to α and β and put
it equal to zero. This yields the following set
of equations:
n n

∑ yi = nα + β ∑ xi
i =1 i =1
n n n

∑ x1 yi = α ∑ xi + β ∑ xi2
i =1 i =1 i =1
These equations are called the normal equations.
The resulting estimators obtained by solving the
above normal equations and denoted by a and
byx, and called the least squares estimators of α
and β respectively.
Result: The least squares estimators of the
regression coefficients a and b are given by
(r is same as rxy)
σy
byx = r ; a = Y − byx X
σx

The regression line of Y on X is given by


Y − Y = byx ( X − X )

Similarly the regression line of X on Y is given


by
X − X = bxy (Y −Y );
with
σx
bxy = r
σy

In general byx is not equal to bxy.


bxy and byx are of the same sign as that of r.
bxybyx=r2, i.e., r is the geometric mean of bxy and
byx.
Several other non-linear relationships between X
and Y can be transformed into linear form and
method of least squares can be used to
estimate the parameters. Some of the
examples are as follows:
(a) Y=a+b/x
(b) Y=ABX (take log of both sides)
(c) Y=AXB
(d) Y=AebX
Ex: The following data give the number of hours
which ten person study for a French test and
their scores on the test:
Hrs: 4 9 10 14 4 7 12 22 1 17
Scores: 31 58 65 73 37 44 60 91 21 84
Fit the regression of test scores on number of
hours study. Predict the score of a person who
studies for 20 hours.
(a=21.69; b=.976) ,
The plot of scores against hours indicates that a
straight line provides a reasonably good fit.

100
80
60
Series1
40
20
0
0 5 10 15 20 25
For r=1 or –1, the two lines coincide (θ=0) and
for r=0, two lines are perpendicular to each
other.
Result: The point of intersection of two lines of
regression is ( X , Y ) .
Multiple regression: Suppose we have more
than one independent variables, say, X 1,X2,
…,Xp, the multiple regression equation of Y
on X1,…,Xp is given by
Y = α+β1 X1+…+ βpXp+e
For given set of observations, we can fit the
multiple regression equation by using the
method of least squares.
Spearman’s Rank Correlation Coefficient:
Ex: The rakings given by two judges to the
works of six artists are as follows:
Judge 1: 6 4 2 5 3 1
Judge 2: 2 5 4 1 3 6
Find the correlation between the rankings of two
judges.
If we have pairs of rankings for different units,
then, to find the correlation between two
rankings, we use rank correlation coefficient.
The formula for rank correlation coefficient is
given by

n
6∑ d i2
r =1− i =1
n( n − 1)
2

where di (i=1,…,n) is the difference between the


two ranks. In case of ties between ranks, we
assign the tied observations the mean of the
ranks, which they jointly occupy. The rank
correlation coefficient lies between –1 and 1.
Ex: Calculate the rank correlation coefficient for
the above example.

You might also like