Professional Documents
Culture Documents
Simple Regression
An Introduction to Regression
In regression analysis we fit a predictive model to our data: we use a model to predict values of the
dependent variable (DV) from one or more independent variables (IVs).1 Simple regression seeks to predict
an outcome from a single predictor whereas multiple regression seeks to predict an outcome from several
predictors. This is an incredibly useful tool because it allows us to go a step beyond the data that we
actually possess. The model that we fit to our data is a linear one and can be imagined by trying to
summarize a data set with a straight line.
With any data set there are a number of lines that could be used to summarize the general trend and so
we need a way to decide which of many possible lines to choose. For the sake of drawing accurate
conclusions we want to fit a model that best describes the data. The simplest way to fit a line is to use
your eye to gauge a line that looks as though it summarizes the data well. However, the eyeball method
is very subjective and so offers no assurance that the model is the best one that could have been chosen.
Instead, we use a mathematical technique called the method of least squares to find the line that best
describes the data collected (See Field, 2005, Chapter 5 for a description).
Some Important Information about Straight Lines
Any straight line can be drawn if you know: (1) the slope (or gradient) of the line, and (2) the point at
which the line crosses the vertical axis of the graph (the intercept of the line). The equation of a straight
line is defined in equation (1), in which Y is the outcome variable that we want to predict and Xi is the ith
subjects score on the predictor variable. b1 is the gradient and b0 is the intercept of of the straight line
fitted to the data. There is a residual term, i, which represents the difference between the score
predicted by the line for subject i and the score that subject i actually obtained. The equation is often
conceptualized without this residual term (so, ignore it if its upsetting you); however, it is worth knowing
that this term represents the fact our model will not fit perfectly the data collected.
Yi = b0 + b1X i + i
(1)
A particular line has a specific intercept and gradient. Figure 1 shows a set of lines that have the same
intercept but different gradients, and a set of lines that have the same gradient but different intercepts.
Figure 1 also illustrates another useful point: that the gradient of the line tells us something about the
nature of the relationship being described: a line that has a gradient with a positive value describes a
positive relationship, whereas a line with a negative gradient describes a negative relationship. So, if you
look at the graph in Figure 1 in which the gradients differ but the intercepts are the same, then the
thicker line describes a positive relationship whereas the thinner line describes a negative relationship.
If it is possible to describe a line knowing only the gradient and the intercept of that line, then the model
that we fit to our data in linear regression (a straight line) can also be described mathematically by
equation (1). With regression we strive to find the line that best describes the data collected, then
estimate the gradient and intercept of that line. Having defined these values, we can insert different
values of our predictor variable into the model to estimate the value of the outcome variable.
Unfortunately, you will come across people (and SPSS for that matter) referring to regression variables as
dependent and independent variables (as in controlled experiments). However, correlational research by
its nature seldom controls the independent variables to measure the effect on a dependent variable.
Instead, variables are measured simultaneously and without strict control. It is, therefore, inaccurate to
label regression variables in this way. For this reason I label independent variables as predictors, and the
dependent variable as the outcome.
Page 1
100
100
80
80
60
60
40
40
20
20
0
0
0
10
20
30
40
10
20
30
40
Figure 1: Shows lines with the same gradients but different intercepts, and lines that share the same
intercept but have different gradients
Page 2
R2 =
SS M
SS T
(2)
Interestingly, this value is the same as if you square the Pearson correlation between variables (see Field,
2005, Chapter 4) and it is interpreted in the same way. Therefore, in simple regression we can take the
square root of this value to obtain the Pearson correlation coefficient. As such, the correlation coefficient
provides us with a good estimate of the overall fit of the regression model, and R2 provides us with a good
gauge of the substantive size of the relationship.
A second use of the sums of squares in assessing the model is through the F-test. The F-test is something
we will cover in greater detail later in the term, but briefly this test is based upon the ratio of the
improvement due to the model (SSM) and the difference between the model and the observed data (SSR).
In fact, rather than using the sums of squares themselves, we take the mean sums of squares (referred to
as the mean squares or MS). The result is the mean squares for the model (MSM) and the residual mean
squares (MSR) see Field 2005 for more detail. At this stage it isnt essential that you understand how the
mean squares are derived (it is explained in Field, 2005, Chapter 8 and in your lectures on ANOVA).
However, it is important that you understand that the F-ratio (equation (3)) is a measure of how much the
model has improved the prediction of the outcome compared to the level of inaccuracy of the model.
F=
MS M
MS R
(3)
If a model is good, then we expect the improvement in prediction due to the model to be large (so, MSM
will be large) and the difference between the model and the observed data to be small (so, MSR will be
small). In short, a good model should have a large F-ratio (greater than one at least) because the top half
of equation (3) will be bigger than the bottom. The exact magnitude of this F-ratio can be assessed using
critical values for the corresponding degrees of freedom.
Page 3
Figure
2:
Scatterplot
showing
the
relationship between record sales and the
amount spent promoting the record
Model
1
Regression
Residual
Total
433687.833
862264.167
1295952.0
Mean
Square
df
1
198
199
433687.833
4354.870
F
99.587
Sig.
a
.000
SPSS Output 2
Page 4
Model
1
(Constant)
Advertising Budget
(thousands of pounds)
134.140
7.537
9.612E-02
.010
Standardized
Coefficients
Beta
.578
Sig.
17.799
.000
9.979
.000
SPSS Output 3
The values of b represent the change in the outcome resulting from a unit change in the predictor. If the
model was useless at predicting the outcome, then if the value of the predictor changes, what might we
expect the change in the outcome to be? Well, if the model is very bad then we would expect the change
in the outcome to be zero. A regression coefficient of zero means: (a) a unit change in the predictor
variable results in no change in the predicted value of the outcome (the predicted value of the outcome
does not change at all), and (b) the gradient of the regression line is zero, meaning that the regression
line is flat. Hopefully, what should be clear at this stage is that if a variable significantly predicts an
outcome, then it should have a b value significantly different from zero. This hypothesis is tested using a
t-test (see Field, 2005, Chapter 7). The t-statistic tests the null hypothesis that the value of b is zero:
therefore, if it is significant we accept the hypothesis that the b value is significantly different from zero
and that the predictor variable contributes significantly to our ability to estimate values of the outcome.
The values of t can be compared to the values that we would expect to find by chance alone: if t is very
large then it is unlikely to have occurred by chance. SPSS provides the exact probability that the observed
value of t is a chance result, and as a general rule, if this observed significance is less than 0.05, then
social scientists agree that the result reflects a genuine effect. For these two values, the probabilities are
0.000 (zero to 3 decimal places) and so we can say that the probability of these t values occurring by
chance is less than 0.001. Therefore, they reflect genuine effects. We can, therefore, conclude that
advertising budget makes a significant contribution to predicting record sales.
You might have noticed that this value is reported by SPSS as 9.612 E02 and many students find this
notation confusing. Well, think of E02 as meaning move the decimal place 2 steps to the left, so 9.612
E02 becomes 0.09612.
Page 5
(4)
It is now possible to make a prediction about record sales, by replacing the advertising budget with a
value of interest. For example, imagine a record executive wanted to spend 100,000 on advertising a
new record. Remembering that our units are already in thousands of pounds, we can simply replace the
advertising budget with 100. He would discover that record sales should be around 144,000 for the first
week of sales.
Record Sales i = 134.14 + (0.09612 Advertising Budgeti )
= 134.14 + (0.09612 100 )
(5)
= 143.75
Task 1
Lacourse et al. (2001) conducted a study to see whether suicide risk was related to listening to heavy
metal music. They devised a scale to measure preference for bands falling into the category of heavy
metal. This scale included heavy metal bands (Black Sabbath, Iron Maiden), speed metal bands (Slayer,
Metallica), death/black metal bands (Obituary, Burzum) and gothic bands (Marilyn Manson, Sistsers of
Mercy). They then used this (and other variables) as predictors of suicide risk based on a scale measuring
suicidal ideation etc. devised by Tousignant et al., (1988).
Lacourse, E., Claes, M., & Villeneuve, M. (2001). Heavy Metal Music and Adolescent Suicidal Risk. Journal
of Youth and Adolescence, 30 (3), 321-332. [Available through the Sussex Electronic Library].
Lets imagine we replicated this study. The data file HMSuicide.sav (on the course website) contains the
data from such a replication. There are two variables representing scores on the scales described above:
hm (the extent to which the person listens to heavy metal music) and suicide (the extent to which
someone has suicidal ideation and so on). Using these data carry out a regression analysis to see whether
listening to heavy metal predicts suicide risk.
Your
Answers:
Does listening to heavy metal significantly predict suicide risk (quote relevant statistics)?
Your
Answers:
Page 6
What is the nature of the relationship between listening to heavy metal and suicide
risk? (sketch a scatterplot if it helps you to explain).
Your
Answers:
Your
Answers:
As listening to heavy metal increases by 1 unit, how much does suicide risk increase?
Your
Answers:
Page 7