Professional Documents
Culture Documents
When comparing two population means using a t-test in chapter 10, if we conclude that
there is a difference in means, we are also concluding that there is a relationship between
Y (or X), a quantitative variable, and the populations, a qualitative variable. For
example, if we test whether the average number of hours studied differ among Business
and Engineering students, we would be also testing whether there is a relationship
between hours studied and a students major (and if we reject H0, we would be
concluding that there is such a relationship). In this example, hours studied is
quantitative and a students major is qualitative. So testing differences in means is
equivalent to testing for a relationship between a quantitative and a qualitative variable.
When testing if there is a difference in two population proportions of successes, we are
also testing whether there is a relationship between two qualitative variables. For
example, if we test whether the proportion of students taking business differs at two
different universities, we would also be testing whether there is a relationship between
two qualitative variables, business major and university, (and if we reject H0, we would
be concluding that there is such a relationship. Whats left is to look at the relationship
between two quantitative variables.
In this chapter, we are interested in studying the linear relationship between two
quantitative variables (e.g. between hours studied and marks on an exam). To study such
a relationship, we either use correlation analysis or simple linear regression.
Correlation Analysis
We use correlation analysis when we believe that there is a linear relationship between
two quantitative random variables but we do not want to assume that one of these
variables is affected or caused by the other variable. (For example, we believe that there
is a linear relationship between the number of HD TVs sold and the number of cell
phones sold but we do not want to assume that the number of HD TVs sold is affected by
the number of cell phones sold or vice versa.) And, we use correlation analysis when we
want to estimate the strength of this linear relation and possibly test whether a linear
relationship (or a positive linear relationship or a negative linear relationship) actually
exists by taking a sample from the appropriate population. For example, suppose we
gathered data on the number of each product sold at six outlets of a large electronics
chain during one weekend in September and the following data resulted:
Store
HD TVs
A
B
C
D
E
F
12
19
36
16
22
27
cell phones
27
37
65
24
39
42
Here, we will arbitrarily use the variable X to represent the number of HD TVs sold and
Y to represent the number of cell phones sold.
But, before we measure the strength of this linear relationship, we first should determine
whether a linear relationship is the appropriate relationship. We can use a scatter diagram
to see whether or not a linear relationship is a reasonable assumption to make. The
scatter diagram for this example would look like:
From this scatter diagram, we can see that a linear relationship is reasonable in this case
and, although the relationship is not a perfect linear relationship, common sense would
tell us that the linear relationship is strong. But, what do we mean by strong?
We measure strength by calculating a correlation coefficient. If we had the entire
population of data, the symbol for this coefficient is . If we only had sample data as is
the case here, the coefficient is symbolized by r where r is the best estimator of if X and
Y has a bivariate normal distribution. One way to express the formulas for these
coefficients is:
N
( xi x )( yi y )
i 1
i 1
i 1
( xi x )2 ( yi y ) 2
( xi x )( yi y )
i 1
i 1
i 1
( xi x )2 ( yi y ) 2
To simplify the formula for r and other formulas to follow, we will use the symbols SSxx,
SSyy and SSxy (where SS stands for sum of squares) where
n
SS xx ( xi x ) 2
i 1
SS yy ( yi y ) 2
SS xy ( xi x )( yi y )
i 1
i 1
SS xx SS yy
No matter whether we have the entire population data set or just a sample data set, the
following applies:
or
1 1
1 r 1
In our example, the relationship is positive (i.e., X and Y tend to move in the same
direction) and it appears to be very strong, although not perfect. So, when we calculate r,
as we will do below, there should be no surprise if rs value is fairly high.
To calculate r, we perform the following calculations by setting up the following table
(note: we first must calculate x and y before completing most of the calculations):
x
xi 132 22
n
yi 234 39
n
xi x yi y ( xi x ) ( yi y ) ( xi x )( yi y )
Store
x
y
__________________________________________________________
2
A
12 27
-10
-12
100
144
120
B
19 37
-3
-2
9
4
6
C
36 65
14
26
196
676
364
D
16 24
-6
-15
36
225
90
E
22 39
0
0
0
0
0
F
27 42
5
3
25
9
15
___________________________________________________________
totals
132 234
366
1058
595
SS xx ( xi x ) 2 366
SS yy ( yi y ) 2 1058
i 1
and, r
SS xy
SS xx SS yy
i 1
SS xy ( xi x )( yi y ) 595
i 1
595
.9562
366 1058
n2
1 r2
Another obvious question arises: How can this be a relationship between r and when
is not part of this expression?. It is part of this expression but since its value is 0, you
cant see it.
Using the same reasoning as before, we can set up the following hypothesis tests for :
Test statistic: tcalc r
If
If
If
n2
1 r2
H0 : 0
H1 : 0
R : tcalc t
H0 : 0
H1 : 0
R : tcalc t / 2 or t / 2
H0 : 0
H1 : 0
R : tcalc t
Continuing our example, at = .05, is there sufficient evidence to indicate that there is a
positive linear relationship between the number of HD TVs sold and the number of cell
phones sold?
H0 : 0
H1 : 0
.05
Decision rule
R : tcalc t
n2 4
tcalc 2.132
Test statistic
tcalc r
n2
62
.9562
6.533
2
1 r
1 .95622
Finding the best estimate of the true linear relationship must be accomplished before we
can proceed with any of the other objectives.
We can express the estimated relationship by the straight line:
y i b0 b1 xi
or, by the equation:
yi b0 b1 xi ei
where,
ei is the difference between the actual value of y and the predicted value of y or
ei yi yi
an example
Many companies try to improve their sales by sending out the same emails to individuals
email addresses, some sending out several per day, day after day after day. But, does this
strategy actually increase sales? A study was undertaken in which 8 email addresses were
randomly selected from a list of previous customers emails and each address was sent a
specific number of emails per day over a month period. The dollar sales which resulted
are recorded below:
Customer
1
2
3
4
5
6
7
8
Daily emails
$ Sales
2
4
6
8
10
12
14
16
70
30
80
20
110
100
54
120
Before we proceed to use regression analysis to estimate the true linear relationship, we
first have to determine which of the two variables is X and which is Y. We have to ask
ourselves which variable depends on the other. Common sense would tell us that sales
depends on the number of emails sent, resulting in
Y = $ Sales
Although the relationship seems rather weak, there is no reason to believe that the
relationship is not linear. Therefore, we will assume the relationship is linear and proceed
to find the best estimate of this relationship. But, before we do so, we must first make the
following assumptions:
If these assumptions are valid, the method of ordinary least squares or the method that
minimizes
n
( yi yi ) 2
or
i 1
ei2
i 1
will give us the best estimate of the true linear relationship. Using this method, we end
up with the following formulas:
b1
SS xy
SS xx
and
b0 y b1 x
SS xy ( xi x )( yi y )
i 1
SS xx ( xi x ) 2
and
i 1
To determine their values and SSyy which we will need later on, we first create the
following table (calculating x and y before being able to complete the table). Using the
formula for sample means, we easily determine that:
x 9 and y 73
xi x yi y ( xi x ) ( yi y ) ( xi x )( yi y )
Customer
x
y
_______________________________________________________________
2
1
2
70
-7
-3
49
9
21
2
4
30
-5
-43
25
1849
215
3
6
80
-3
7
9
49
-21
4
8
20
-1
-53
1
2809
53
5
10
110
1
37
1
1369
37
6
12
100
3
27
9
729
81
7
14
54
5
-19
25
361
-95
8
16
120
7
47
49
2209
329
_______________________________________________________________
totals
72
584
168
9384
620
SS xy ( xi x )( yi y ) 620
i 1
SS xx ( xi x ) 2 168
i 1
and
b1
SS xy
SS xx
620
3.6905
168
and
b0 y b1 x 73 3.6905(9) 39.7855
or,
y i 39.7855 3.6905 xi
(If we were just interested in a mathematical interpretation of this line, our interpretation
would be that the y -intercept (i.e. the value of y when x = 0) is 39.7855 and the value of
y increases by 3.6905 for each unit increase in x.)
At this point, we could draw this line on the scatter diagram. To do so manually, we need
to find two points along this line and use a ruler connecting these points. One of these
points could be the y-intercept and another point could be the point ( x , y ) because the
regression line always passes through this point. Using Excel, the scatter diagram would
now look like:
You will notice that the regression line using Excel (also indicated in the scatter diagram
if requested) does not go as far as the y axis. It starts when x = 2 and stops when x = 16,
which just happen to be the lowest and highest values of X in our sample. There is a
logical reason for this.
We drew a scatter diagram to see whether or not a linear relationship seems reasonable.
We can only make this observation over the range of X in our sample. So, in this case,
we observe that the relationship seems reasonable when X is somewhere between a value
of 2 and a value of 16. Outside this range of values, we have no information to either
support or not support this linearity assumption. Therefore, to be safe, any observations
we make or conclusions we reach should only apply over the range of our X in our
sample. So, at this point in our analysis, we can say that there appears to be a fairly weak
positive linear relationship between sales and number of emails when the number of
emails ranges between 2 and 16. But, how strong does the linear relationship appear to
be in our sample? To answer this question, we calculate the coefficient of determination,
symbolized by R2.
Coefficient of Determination
To measure the strength of the linear relationship between X and Y, we do something
similar to what we did in ANOVA. We observe that the values of Y are not all the same
and we try to explain what causes these values to vary. Here, we define two such causes Y varies because it is related to X and X varies, and, the cause which we cannot explain
(i.e. the residuals or errors). As with ANOVA, we can create the following expression:
SST ( yi y ) 2
i 1
i 1
i 1
if all values of Y were equal, SST, SSR and SSE would all equal zero
if there was no linear relationship (weakest type of linear relationship) between X and
Y in our sample (i.e. b1 = 0), SSR would equal zero, and, SST = SSE
if there was a perfect linear relationship (strongest type of linear relationship)
between X and Y in our sample, SSE would equal zero and SST = SSR
So, we want R2 to reflect some measure of strength. We do this using the relationship
R2 1
SSE SSR
SST SST
0 R2 1
R2 = 0 if there was no linear relationship in our sample
R2 = 1 if there was a perfect linear relationship in our sample
The closer R2 is to 1, the stronger the linear relationship
In addition to making these observations, we can define R2 as the proportion of the total
variation in Y in our sample which can be explained by its linear relationship to X.
At this point, we really dont want to calculate SSR and SSE if we can avoid it. Luckily
for us, R2 equals r2, or
R
2
SS xy2
SS xx SS yy
Since we have already calculated SSxy (SSxy = 620) and SSxx (SSxx = 168), the only
n
2
calculation that is missing is SSyy and SS yy ( yi y ) 9384
i 1
R2
SS xy2
SS xx SS yy
6202
.2438
(168)(9384)
Or, 24.38% of the variation in sales in our sample can be explained by its linear
relationship to the number of emails received. Since R2 is relatively small, our initial
observation that the relationship was weak is supported.
Simply because the relationship in our sample is weak, it does not rule out the possibility
that there still is a relationship between X and Y in our population. Conversely, if the
relationship in our sample was strong, it would not rule out the possibility that there was
no linear relationship between X and Y in our population. To make inferences about the
populations linear relationship we usually only make inferences concerning 1 since 1
tells us whether or not a linear relationship exists in the population.
Inferences concerning 1
Common sense should tell us that b1 is the best estimator of 1 and to use this estimator to
make these inferences we need to know its relationship with 1. If our previously stated
assumptions we made for regression analysis are valid, the relationship between b1 and 1
can be expressed as:
t
b1 1
sb
1
SSE ( yi yi ) 2
i 1
s yx
sb
1
SSE
n2
s yx
SS xx
We already have SSxx (SSxx = 168), so the only real issue is calculating SSE. Using the
formula, we would have to calculate y i for each of the observations in our sample which
could be rather tedious. So, lets see if there is a shortcut we can use.
R2 1
or,
SS xy2
SSE
SSE
1
SSE SS yy (1 R 2 ) SS yy 1
SST
SS yy
SS xx SS yy
SSE SS yy
SS xy2
SS xx
In our example,
SSE SS yy
s yx
sb
1
SS xy2
SS xx
9384
6202
7095.905
168
SSE
7095.905
34.3897
n2
6
s yx
SS xx
34.3897
2.653
168
2
Note: We could have used SSE SS yy (1 R ) to calculate SSE but there is a greater
chance of rounding errors and more chance of making a mistake (if we made a mistake
calculating R2).
Knowing the relationship between b1 and 1 and knowing how to do the preliminary
calculations, we can use the same reasoning as before and the same mathematical
manipulations to either estimate 1 or test hypotheses concerning 1.
Estimating 1
b 1
P (t / 2 t t / 2 ) 1 P t / 2 1
t / 2 1
sb
Hypothesis testing of 1
As with all other test statistics, the test statistic for 1 is simply the relationship between
0
b1 and 1 assuming Ho is true. So, our test statistic for testing 1 if H 0 : 1 1 is
tcalc
b1 10
sb
n2
And, depending on H1, the rejection regions would look like the following if:
H 0 : 1 10
H1 : 1 10
R : tcalc t
H 0 : 1 10
H1 : 1 10
R : tcalc t / 2 or t / 2
H 0 : 1 10
H1 : 1 10
R : tcalc t
0
If the value of 1 = 0, which is the most common value for testing 1, we would be
testing for:
In our example, lets calculate the 95% confidence interval for 1 and then test at = .05
if there is sufficient evidence to indicate that the more emails received, the greater the
sales.
estimating 1
In our example, b1 = 3.6905, sb 2.653 and 6 t / 2 2.447
1
We are 95% confident that sales will change between -$2.8014 and $10.1824 for each
additional email with the best estimate being a change of $3.6905 as long as the number
of emails range between 2 and 16.
Note: This wide confidence interval doesnt give the company much information about
how the number of emails affects sales.
Hypothesis testing of 1
Here we are testing for a positive linear relationship resulting in:
H 0 : 1 0
H1 : 1 0
.05
Decision rule
R : tcalc t
Test statistic
tcalc 1.943
tcalc
b1 10 3.6905 0
1.391
sb
2.653
1
0.4938
0.2438
0.1178
34.3897
8
ANOVA
Regression
Residual
Total
Intercept
no. of emails
df
1
6
7
SS
2288.0952
7095.9048
9384
Coefficient
s
39.7857
3.6905
Standard
Error
26.7962
2.6532
MS
2288.0952
1182.6508
F
1.9347
Significanc
eF
0.2136
t Stat
1.4848
1.3909
P-value
0.1881
0.2136
Lower 95%
-25.7823
-2.8017
Upper 95%
105.3537
10.1827
You can notice some of the values (give or take some rounding errors) we calculated
using the appropriate formulas:
b1 = 3.6905
b0 = 39.7857
R2 = .2438
Syx = 34.3897
SSE = 7095.9048
[-2.8017, 10.1827]
sb 2.6532
1
t = 1.3909
Notes:
The output also gives us the p-value for no. of emails (0.2136) but Excel
calculates a two-tail p-value. So, the p-value for this one-tail test is half the
quoted P-value or p-value = .1068. (It is half of the quoted value only when the
sign of b1 or t supports H1 being true, i.e. since we are testing for a positive linear
relationship, we would expect b1 or t to be positive and it is positive in the case.
If the sign is not what we would expect if H1 were true, the p-value would be
greater than .5.)
The value of the t-test and the accompanying p-value only apply when H : 1 0 .
For values other than 0, the quoted t Stat and p-value do not apply.
The output also gives us an ANOVA table, similar to the ANOVA table from the
previous chapter. As before, the table determines the total variation in Y and
partitions it into the variation in Y which can be explained (the Regression part)
and the variation which cannot be explained (the Residual part). Working through
this table, we end up with a value of F (1.9347) and a p-value or Significance F
(0.2136). You will note that Significance F = 0.2136 which is the same value as
the P-value for no. of emails. This is not a coincidence. This ANOVA table
(through the F-statistic) allows us to test whether the regression model is useful or
whether the independent variables in the model have an effect on Y. That is, it
allows us to test:
H 0 : 1 2 k 0
H 1 : not all above j's=0
(model is useless)
(model is useful)
Here, since there is only one independent variable, the model being useful would
mean that the no. of emails affects sales or H1 : 1 0 . If we tested this
hypothesis using our t-test, we would still have ended up with a test statistic of
1.3909 and the p-value of this test would be 0.2136. In fact, if we were to square
the value of t, we would end up with the value of F (give or take some rounding
error) in the ANOVA table. With only one independent variable, the ANOVA
table really doesnt give us much information. Its use will be more evident when
we have regression models with more than one independent variable.
1 ( xi x ) 2
n
SS xx
n2
1 ( xi x ) 2
n
SS xx
n2
Continuing with our example, we want to predict, with 95% confidence, sales for an
individual who receives 6 emails daily.
y i b0 b1 xi 39.7855 3.6905(6) 61.93
y i t / 2 s yx
1 ( xi x ) 2
1 (6 9) 2
1
61.93 2.447(34.3897) 1
n
SS xx
8
168
61.93 91.36
29.43, 153.29
We are 95% confident that the sales for an individual who receives 6 emails daily
will lie between -$29.43 and $153.29 with the best estimate being $61.93.
note: It is obvious that sales cannot be negative, so we can quote the lower limit for sales
to be $0.00 without be criticized for doing so.
We also want to estimate, with 95% confidence, mean sales for all individuals who
receives 6 emails daily.
y i t / 2 s yx
1 ( xi x ) 2
1 (6 9) 2
61.93 2.447(34.3897)
n
SS xx
8
168
61.93 35.56
26.37, 97.49
We are 95% confident that mean sales for all individuals who receive 6 emails daily
will lie between $26.37 and $97.49 with the best estimate being $61.93.
Observations:
Both of the above confidence intervals are quite wide making their use in any
decision making by the company rather useless.
Although both confidence intervals are wide, the confidence interval for the mean is
narrower than the confidence interval for an individual Y.
The widths of both confidence intervals could be smaller if:
- we lowered the level of confidence
- syx was smaller (or the values of Y better fit the regression line)
- took a larger sample
- xi was closer to x
- SSxx was larger (or the values of X in our sample were more spread out)
Note: As with previous analyses, we should only be predicting Y or estimating the mean
value of Y for values of X within the range of xs in our sample.
When we began to discuss regression analysis we made some assumptions about the true
linear relationship (a linear relationship is the appropriate relationship, is normally
distributed with a constant variance, the s are independent of one another). A scatter
diagram, with the regression line added to it, would give us some indication whether a
linear relationship is appropriate, whether the eis dont have a pattern to them (indicating
that the s are independent) and whether the eis are equally spread out no matter the
value of X (indicating constant variance). A histogram of the eis would give us some
indication as to whether is normally distributed or not. There are some formal tests for
some of these assumptions but they are beyond an introductory course in statistics.
Whats next?
Chapter 14 will allow us to incorporate more than one independent variable in a
regression model, allowing us to create models which are more appropriate in the real
world and allowing us to expand on the testing capabilities of regression analysis.