You are on page 1of 322

Smart Alexs Answers

Chapter 1 Task 1 What are (broadly speaking) the five stages of the research process? Generating a research question: through an initial observation (hopefully backed up by some data). Generate a theory to explain your initial observation. Generate hypotheses: break your theory down into a set of testable predictions. Collect data to test the theory: decide on what variables you need to measure to test your predictions and how best to measure or manipulate those variables. Analyse the data: look at the data visually and by fitting a statistical model to see if it supports your predictions (and therefore your theory). At this point you should return to your theory and revise it if necessary. Task 2 What is the fundamental difference between experimental and correlational research? In a word, causality. In experimental research we manipulate a variable (predictor, independent variable) to see what effect it has on another variable (outcome, dependent variable). This manipulation, if done properly, allows us to compare situations where the causal factor is present to situations where it is absent. Therefore, if there are differences between these situations, we can

attribute cause to the variable that we manipulated. In correlational research, we measure things that naturally occur and so we cannot attribute cause but instead look at natural covariation between variables. Task 3 What is the level of measurement of the following variables? The number of downloads of different bands songs on iTunes: o This is a discrete ratio measure. It is discrete because you can download only whole songs, and it is ratio because it has a true value of 0 (no downloads at all). The names of the bands downloaded. o This is a nominal variable. Bands can be identified by their name, but the names have no meaningful order. That fact that Norwegian black metal band 1349 called themselves 1349 does not make them better than British boy-band has-beens 911; the fact that 911 were a bunch of talentless idiots does, though. The position in the iTunes download chart. o This is an ordinal variable. We know that the band at number 1 sold more than the band at number 2 or 3 (and so on) but we dont know how many more downloads they had. So, this variable tells us the order of magnitude of downloads, but doesnt tell us how many downloads there actually were. The money earned by the bands from the downloads.
2

o This variable is continuous and ratio. It is continuous because money (pounds, dollars, euros or whatever) can be broken down into very small amounts (you can earn fractions of euros even though there may not be an actual coin to represent these fractions). The weight of drugs bought by the band with their royalties. o This variable is continuous and ratio. If the drummer buys 100 g of cocaine and the singer buys 1 kg, then the singer has 10 times as much. The type of drugs bought by the band with their royalties. o This variable is categorical and nominal: the name of the drug tells us something meaningful (crack, cannabis, amphetamine, etc.) but has no meaningful order. The phone numbers that the bands obtained because of their fame. o This variable is categorical and nominal too: the phone numbers have no meaningful order; they might as well be letters. A bigger phone number did not mean that it was given by a better person. The gender of the people giving the bands their phone numbers. o This variable is categorical and binary: the people dishing out their phone numbers could fall into one of only two categories (male or female). The instruments played by the band members.

o This variable is categorical and nominal too: the instruments have no meaningful order but their names tell us something useful (guitar, bass, drums etc.). The time, they had spent learning to play their instruments. o This is a continuous and ratio variable. The amount of time could be split into infinitely small divisions (nanoseconds even) and there is a meaningful true zero (0 time spent learning your instrument means that, like 911, you cant play at all). Task 4 Say I own 857 CDs. My friend has written a computer program that uses a webcam to scan my shelves in my house where I keep my CDs and measure how many I have. His program says that I have 863 CDs. Define measurement error. What is the measurement error in my friends CD counting device? 1. Measurement error is the difference between the true value of something and the numbers used to represent that value. In this trivial example, the measurement error is 6 CDs. In this example we know the true value of what were measuring; usually we dont have this information so we have to estimate this error rather than knowing its actual value. Task 5 Sketch the shape of a normal distribution, a positively skewed distribution and a negatively skewed distribution. Normal:
4

Positive skew:

Negative skew:

Chapter 2 Task 1: Why do we use samples?

We are usually interested in populations, but because we cannot collect data from every human being (or whatever) in the population, we collect data from a small subset of the population (known as a sample) and use these data to infer things about the population as a whole. Task 2: What is the mean and how do we tell if its representative of our data? The mean is a simple statistical model of the centre of a distribution of scores. A hypothetical estimate of the typical score. We use the variance, or standard deviation, to tell us whether it is representative of our data. The standard deviation is a measure of how much error there is associated with the mean: a small standard deviation indicates that the mean is a good representation of our data. Task 3: Whats the difference between the standard deviation and the standard error? The standard deviation tells us how much observations in our sample differ from the mean value within our sample. The standard error tells us not about how the sample mean represents the sample itself, but how well the sample mean represents the population mean. The standard error is the standard deviation of the sampling distribution of a statistic. For a given statistic (e.g. the mean) it tells us how much variability there is in this statistic across samples from the same population. Large values, therefore, indicate that a statistic from a given sample may not be an accurate reflection of the population from which the sample came. Task 4: In Chapter 1 we used an example of the time taken for 21 heavy smokers to fall off of a treadmill at the fastest setting (18, 16, 18, 24, 23, 22, 22, 23, 26, 29, 32, 34, 34,

36, 36, 43, 42, 49, 46, 46, 57). Calculate the sums of squares, variance, standard deviation and standard error of these data. To calculate the sum of squares, take the mean from each value, then square this difference. Finally, add up these squared values:

So, the sum of squared errors is a massive 2685.24. The variance is the sum of squared errors divided by the degrees of freedom (N1). There were 21 scores and so the degrees of freedom were 10. The variance is, therefore, 2685.24/20 = 134.26. The standard deviation is the square root of the variance: 134.26 = 11.59. The standard error will be:

s N

11.59 21

= 2.53

The sample is small, so to calculate the confidence interval we need to find the appropriate value of t. First we need to calculate the degrees of freedom, N 1. With 21 data points, the degrees of freedom are 20. For a 95% confidence interval we can look up the value in the column labelled Two-Tailed Test, 0.05 in the table of critical values of the t-distribution (Appendix). The corresponding value is 2.09. The confidence intervals are therefore: Lower Boundary of Confidence Interval = X (2.09 SE ) = 32.19 (2.09 2.53) = 26.90 Upper Boundary of Confidence Interval = X (2.09 SE ) = 32.19 + (2.90 2.53) = 37.48 Task 5: What do the sum of squares, variance and standard deviation represent? How do they differ? All of these measures tell us something about how well the mean fits the observed sample data. Large values (relative to the scale of measurement) suggest the mean is a poor fit of the observed scores, and small values suggest a good fit. They are also, therefore, measures of dispersion with large values indicating a spread-out distribution of scores and small values showing a more tightly packed distribution. These measures all represent the same thing, but differ in how they express it. The sum of squared errors is a total and is, therefore, affected by the number of data points. The variance is the average variability but units squared. The standard deviation is the average variation but converted back to the original units of measurement. As such, the size of the standard

deviation can be compared to the mean (because they are in the same units of measurement). Task 6: What is a test statistic and what does it tell us? A test statistic is a statistic for which we know how frequently different values occur. The observed value of such a statistic is typically used to test hypotheses, or to establish whether a model is a reasonable representation of whats happening in the population. Task 7: What are Type I and Type II errors? A Type I error occurs when we believe that there is a genuine effect in our population, when in fact there isnt. A Type II error occurs when we believe that there is no effect in the population when, in reality, there is. Task 8: What is an effect size and how is it measured? An effect size is an objective and standardized measure of the magnitude of an observed effect. Measures include Cohens d, the odds ratio and Pearsons correlations coefficient, r. Task 9: What is statistical power? Power is the ability of a test to detect an effect of a particular size (a value of 0.8 is a good level to aim for). Chapter 3 Task 2 Your second task is to enter the data that I used to create Figure 3.10. These data show the score (out of 20) for 20 different students some of whom are male and
10

some female, and some of whom were taught using positive reinforcement (being nice) and others who were taught using punishment (electric shock). Just to make it hard, the data should not be entered in the same way that they are laid out below. The data can be found in the file Method of Teaching.sav and should look like this:

Or with the value labels off, like this:

11

Task 3 Research has looked at emotional reactions to infidelity and found that men get homicidal and suicidal and women feel undesirable and insecure (Shackelford, LeBlanc, and Drass, 2000). Lets imagine we did some similar research: we took some men and women and got their partners to tell them they had slept with someone else. We then took each person to two shooting galleries and each time gave them a gun and 100 bullets. In one gallery was a human-shaped target with a picture of their own face on it, and in the other was a target with their partners face on it. They were left alone with each target for 5 minutes and the number of bullets used was measured. The data are below, enter them into SPSS. (Clue:

12

They are not entered in the format in the table!) The data can be found in the file Infidelity.sav and should look like this:

Or with the value labels off, like this:

13

Chapter 4 Task 1 Using the data from Chapter 2 (which you should have saved, but if you didnt re-enter it) plot and interpret the following graphs: An error bar chart showing the mean number of friends for students and lecturers. An error bar chart showing the mean alcohol consumption for students and lecturers.

14

An error line chart showing the mean income for students and lecturers. An error line chart showing the mean neuroticism for students and lecturers. A scatterplot (with regression lines) of alcohol consumption and neuroticism grouped by lecturer/student.

A scatterplot matrix of alcohol consumption, neuroticism and number of friends.

An error bar chart showing the mean number of friends for students and lecturers. First of all access the Chart Builder and select a simple bar chart. The y-axis needs to be the dependent variable, or the thing youve measured, or more simply the thing for which you want to display the mean. In this case it would be number of friends, so select this variable from the variable list and drag it into the y-axis drop zone ( ). The

x-axis should be the variable by which we want to split the arousal data. To plot the means for the students and lecturers, select the variable Group from the variable list and drag it into the drop zone for the x-axis ( ). Then add error bars by selecting

in the Element Properties dialog box. The finished Chart Builder will look like this:

15

The error bar chart will look like this:

16

We can conclude that, on average, students had more friends than lecturers. An error bar chart showing the mean alcohol consumption for students and lecturers. Access the Chart Builder and select a simple bar chart. The y-axis needs to be the thing weve measured, which in this case is alcohol consumption, so select this variable from the variable list and drag it into the y-axis drop zone ( ). The x-axis should be

the variable by which we want to split the arousal data. To plot the means for the students and lecturers, select the variable Group from the variable list and drag it into the drop zone for the x-axis ( ). Add error bars by selecting in the Element

Properties dialog box. The finished Chart Builder will look like this:

The error bar chart will look like this:


17

We can conclude that, on average, students and lecturers drank similar amounts, but the error bars tells us that the mean is a better representation of the population for students than for lecturers (there is more variability in lecturers drinking habits compared to students). An error line chart showing the mean income for students and lecturers. Access the Chart Builder and select a simple line chart. The y-axis needs to be the thing weve measured, which in this case is income, so select this variable from the variable list and drag it into the y-axis drop zone ( ). The x-axis should again be students vs. lecturers, so

select the variable Group from the variable list and drag it into the drop zone for the x-axis ( ). Add error bars by selecting in the Element Properties

dialog box. The finished Chart Builder will look like this:

18

The error line chart will look like this:

19

We can conclude that, on average, students earn less than lecturers, but the error bars tells us that the mean is a better representation of the population for students than for lecturers (there is more variability in lecturers income compared to students). An error line chart showing the mean neuroticism for students and lecturers. Access the Chart Builder and select a simple line chart. The y-axis needs to be the thing weve measured, which in this case is neurotic, so select this variable from the variable list and drag it into the y-axis drop zone ( ). The x-axis should again be students vs.

lecturers, so select the variable Group from the variable list and drag it into the drop zone for the x-axis ( ). Add error bars by selecting in the Element

Properties dialog box. The finished Chart Builder will look like this:

The error line chart will look like this:

20

We can conclude that, on average, students are slightly less neurotic than lecturers. A scatterplot with regression lines of alcohol consumption and neuroticism grouped by lecturer/student. Access the Chart Builder and select a grouped scatterplot. It doesnt matter which way around we plot these variables, so lets select alcohol consumption from the variable list and drag it into the y-axis neurotic from the variable list and drag it into the drop zone, and then drag drop zone. We then need

to split the scatterplot by our grouping variable (lecturers or students), so select Group and drag it to the like this: drop zone. The completed Chart Builder dialog box will look

21

Click on

to produce the graph. To fit the regression lines double-click on the graph in in the Chart

the SPSS Viewer to open it in the SPSS Chart Editor. Then click on

Editor to open the properties dialog box. In this dialog box, ask for a linear model to be fitted to the data (this should be set by default). Click on to fit the lines:

22

We can conclude that for lecturers, as neuroticism increases so does alcohol consumption (a positive relationship), but for students the opposite is true, as neuroticism increases alcohol consumption decreases. (Note that SPSS has scaled this graph oddly because neither axis starts at zero; as a bit of extra practice why not edit the two axes so that they start at zero?) A scatterplot matrix with regression lines of alcohol consumption, neuroticism and number of friends. Access the Chart Builder and select a scatterplot matrix. We have to drag all three variables into drop zone. Select the first variable (Friends) by

clicking on it with the mouse. Now, hold down the Ctrl key on the keyboard and click on a second variable (Alcohol). Finally, hold down the Ctrl key and click on a third variable (Neurotic). Once the three variables are selected, click on any one of them and then drag them into . Click on to produce the graph. To fit the regression lines

double-click on the graph in the SPSS Viewer to open it in the SPSS Chart Editor. Then click on in the Chart Editor to open the properties dialog box. In this dialog box, ask to

for a linear model to be fitted to the data (this should be set by default).Click on fit the lines.

We can conclude that there is no relationship (flat line) between the number of friends and alcohol consumption; there was a negative relationship between how neurotic a person was and their number of friends (line slopes downwards); and there was a slight positive relationship between how neurotic a person was and how much alcohol they drank (line slopes upwards).

23

Task 2

24

Using the Infidelity.sav data from Chapter 3 (see Smart Alexs task) plot an clustered error bar chart of the mean number of bullets used against the self and the partner for males and females. To graph these data we need to select a clustered bar chart in the Chart Builder. We have one repeated-measures variable, which is whether the target had the persons face on it, or the face of their partner and is represented in the data file by two columns. In the Chart Builder you need to select these two variables simultaneously by clicking on one and then holding down the Ctrl key on the keyboard and clicking on the other. When they are both highlighted click on either one and drag it into . The second variable (whether

the participant was male or female) was measured using different people (obviously) and so is represented in the data file by a grouping variable (Gender). This variable can be selected in the variable list and dragged into . The two groups will now be in the

displayed as different-coloured bars. Add error bars by selecting Element Properties dialog box. The finished Chart Builder will look like this:

25

The resulting graph looks like this (the labels on both axes could benefit from some editing!):

26

The graph shows that, on average, males and females did not differ much in the number of bullets that they shot at the target when it had their partners face on it. However, men used fewer bullets than women when the target had their own face on it. Chapter 5 Task 1 Using the ChickFlick.sav data, check the assumptions of normality and homogeneity of variance for the two films (ignore gender): are the assumptions met?

The output you should get look like those reproduced below (I used the Explore function described in Chapter 5).

The skewness statistics gives rise to a z-score of 0.378/0.512 = 0.74 for Bridget Jones Diary, and 0.04/0.512 = 0.08 for memento. These show no significant skewness. For kurtosis these values are 0.254/0.992 = 0.26 for Bridget Jones Diary, and 1.024/0.992 = 1.03, so although Memento shows more positive kurtosis, neither are significant.
27

The QQ plots confirm these findings: for both films the expected quantile points are close to those that would be expected from a normal distribution (i.e. the dots fall close to the diagonal line). The KS tests show no significant deviation from normality for both films. We could report that arousal scores for Bridget Jones Diary, D(20) = 0.13, ns, and Memento, D(20) = 0.10, ns, were both not significantly different from a normal distribution. Therefore we can assume normality in the sample data. In terms of homogeneity of variance, Levenes test shows that the variances of arousal for the two films were not significantly different, F(1, 38) = 1.90.

28

29

Task 2 Remember that the numeracy scores were positively skewed in the SPSSExam.sav data (see Figure 5.5)? Transform these data using one of the transformations described in this chapter: do the data become normal? These are the original histogram and those of the transformed scores (Ive included three transformations discussed in the chapter):

30

None of these histograms appear to be normal. Below is the table of results from the KS test, all of which are significant. The only conclusion is that although the square root transformation does the best job of normalizing the data, none of these transformations actually works!

31

Chapter 6 Task 1 A student was interested in whether there was a positive relationship between the time spent doing an essay and the mark received. He got 45 of his friends and timed how ling they spent writing an essay (hours) and the percentage they got in the essay (essay). He also translated these grades into their degree classifications (grade): first, upper second, lower second and third class. Using the data in the file EssayMarks.sav find out what the relationship was between the time spent doing an essay and the eventual mark in terms of percentage and degree class (draw a scatterplot too!).

Were interested in looking at the relationship between hours spent on an essay and the grade obtained. We could simply do a scatterplot of hours spent on the essay (x-axis) and essay mark (y-axis). Ive also chosen to highlight the degree classification grades using different symbols (just place the variable grades in the style box). The resulting scatterplot should look like this:

32

Next, we should check whether the data are parametric using the Explore menu (see Chapter 3). The resulting table is as follows:

Tests of Normality Kolmogorov-Smirnov Statistic df Sig. .111 45 .200* .091 45 .200*


a

Essay Mark (%) Hours Spent on Essay

Statistic .977 .981

Shapiro-Wilk df 45 45

Sig. .493 .662

*. This is a lower bound of the true significance. a. Lilliefors Significance Correction

The KS and ShapiroWilk statistics are both non-significant (Sig > .05 in all cases) for both variables, which indicates that they are normally distributed. As such we can use Pearsons correlation coefficient. The result of which is:

33

Correlations Essay Mark (%) 1 . 45 .267* .038 45 Hours Spent on Essay .267* .038 45 1 . 45

Essay Mark (%)

Hours Spent on Essay

Pearson Correlation Sig. (1-tailed) N Pearson Correlation Sig. (1-tailed) N

*. Correlation is significant at the 0.05 level (1-tailed).

I chose a one-tailed test because a specific prediction was made: there would be a positive relationship; that is, the more time you spend on your essay, the better mark youll get. This hypothesis is supported because Pearsons r = .27 (a medium effect size), p < .05, is significant. The second part of the question asks us to do the same analysis but when the percentages are recoded into degree classifications. The degree classifications are ordinal data (not interval): they are ordered categories, so we shouldnt use Pearsons test statistic, but Spearmans and Kendalls ones instead:
Correlations Hours Spent on Essay 1.000 . 45 -.158 .089 45 1.000 . 45 -.193 .102 45 Grade -.158 .089 45 1.000 . 45 -.193 .102 45 1.000 . 45

Kendall's tau_b

Hours Spent on Essay

Grade

Spearman's rho

Hours Spent on Essay

Grade

Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N

34

In both cases the correlation is non-significant. There was no significant relationship between degree grade classification for an essay and the time spent doing it, = .19, ns, and = .16, ns. Note that the direction of the relationship has reversed. This has happened because the essay marks were recoded as 1 (first), 2 (upper second), 3 (lower second) and 4 (third), so high grades were represented by low numbers! This illustrates one of the benefits of not taking continuous data (like percentages) and transforming them into categorical data: when you do, you lose information and often statistical power! Task 2 Using the ChickFlick.sav data from Chapter 3, is there a relationship between gender and arousal? Using the same data, is there a relationship between the film watched and arousal?

Now, both gender and the film watched are categorical variables with two categories. Therefore, we need to look at this relationship using a pointbiserial correlation. The resulting tables are as follows:

Correlations Gender Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Gender 1 . 40 -.180 .266 40 Arousal -.180 .266 40 1 . 40

Arousal

35

Correlations Film Film Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 . 40 .638** .000 40 Arousal .638** .000 40 1 . 40

Arousal

**. Correlation is significant at the 0.01 level (2 il d)

In both cases I used a two-tailed test because no prediction was made. As you can see, there was no significant relationship between gender and arousal, rpb = .18, ns. However, there was a significant relationship between the film watched and arousal, rpb = .64, p < .001. Looking at how the groups were coded, you should see that Bridget Jones Diary had a code of 1, and Memento had a code of 2, therefore this result reflects the fact that as film goes up (changes from 1 to 2) arousal goes up. Put another way, as the film changes from Bridget Jones Diary to Momento, arousal increases. So, Momento gave rise to the greater arousal levels. Task 3 As a statistics lecturer I am always interested in the factors that determine whether a student will do well on a statistics course. One potentially important factor is their previous expertise with mathematics. Imagine I took 25 students and looked at their degree grades for my statistics course at the end of their first year at university. In the UK, a student can get a first-class mark (the best), an upper second, a lower second, a third, a pass or a fail (the worst). I also asked these students what grade they got in their GCSE maths exams. In the UK GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F (an A grade is

36

better than all of the lower grades). The data for this study are in the file grades.sav. Carry out the appropriate analysis to see if GCSE maths grades correlate with first-year statistics grades.

Lets look at these variables. In the UK, a student can get a first-class mark, an upper second, a lower second, a third, a pass or a fail. These grades are categories, but they have an order to them (an upper second is better than a lower second). In the UK GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F. Again, these grades are categories that have an order of importance (an A grade is better than all of the lower grades). When you have categories like these that can be ordered in a meaningful way, the data are said to be ordinal. The data are not interval, because a first-class degree encompasses a 30% range (70100%) whereas an upper second only covers a 10% range (6070%). When data have been measured at only the ordinal level they are said to be non-parametric and Pearsons correlation is not appropriate. Therefore, the Spearman correlation coefficient is used. The data are in two columns: one labelled stats and one labelled gcse. Each of the categories described above has been coded with a numeric value. In both cases, the highest grade (first class or A grade) has been coded with the value 1, with subsequent categories being labelled 2, 3 and so on. Note that for each numeric code I have provided a value label (just like we did for coding variables). The procedure for doing the Spearman correlation is the same as for Pearsons correlation except that in the bivariate correlations dialog box we need to select and deselect

the option for a Pearson correlation. At this stage, you should also specify whether you
37

require a one- or two-tailed test. For the example above, I predicted that better grades in GCSE maths would correlate with better degree grades for my statistics course. This hypothesis is directional and so a one-tailed test should be selected ( ).

The SPSS output shows the Spearman correlation on the variables stats and gcse. The output shows a matrix giving the correlation coefficient between the two variables (.455), underneath is the significance value of this coefficient (.011) and finally the sample size (25). The significance value for this correlation coefficient is less than .05; therefore, it can be concluded that there is a significant relationship between a students grade in GCSE maths and their degree grade for their statistics course. The correlation itself is positive: therefore, we can conclude that as GCSE grades improve, there is a corresponding improvement in degree grades for statistics. As such, the hypothesis was supported. Finally, it is good to check that the value of N corresponds to the number of observations that were made. If it doesnt then data may have been excluded for some reason.
Correlations Statistics Grade Spearman's rho Statistics Grade Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N 1.000 . 25 .455* .011 25 GCSE Maths Grade .455* .011 25 1.000 . 25

GCSE Maths Grade

*. Correlation is significant at the .05 level (1-tailed).

We could also look at Kendalls correlation by selecting the same as for Spearmans correlation.

. The output is much

The actual value of the correlation coefficient is less than Spearmans correlation (it has decreased from .455 to .354). Despite the difference in the correlation coefficients we can
38

still interpret this result as being a highly significant positive relationship (because the significance value of .015 is less than .05). However, Kendalls value is a more accurate gauge of what the correlation in the population would be. As with Pearsons correlation we cannot assume that the GCSE grades caused the degree students to do better in their statistics course.
Correlations Statistics Grade Kendall's tau_b Statistics Grade Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N 1.000 . 25 .354* .015 25 GCSE Maths Grade .354* .015 25 1.000 . 25

GCSE Maths Grade

*. Correlation is significant at the .05 level (1-tailed).

We could report these results as follows: 1. There was a positive relationship between a persons statistics grade and their GCSE maths grade, rs = .46, p < .05. 2. There was a positive relationship between a persons statistics grade and their GCSE maths grade, = .35, p < .05. (Note that Ive quoted Kendalls tau here.) Chapter 7 Task 1 A fashion student was interested in factors that predicted the salaries of catwalk models. She collected data from 231 models. For each model she asked them their salary per day on days when they were working (salary), their age (age), how

39

many years they had worked as a model (years), and then got a panel of experts from modelling agencies to rate the attractiveness of each model as a percentage with 100% being perfectly attractive (beauty). The data are in the file Supermodel.sav. Unfortunately, this fashion student bought some substandard statistics text and so doesnt know how to analyse her data. Can you help her out by conducting a multiple regression to see which factor predict a models salary? How valid is the regression model?

Model Summaryb Change Statistics Model 1 R .429a R Square .184 Adjusted R Square .173 Std. Error of the Estimate 14.57213 R Square Change .184 F Change 17.066 df1 3 df2 227 Sig. F Change .000 Durbin-W atson 2.057

a. Predictors: (Constant), Attractiveness (%), Number of Years as a Model, Age (Years) b. Dependent Variable: Salary per Day ()

ANOVAb Model 1 Sum of Squares 10871.964 48202.790 59074.754 df 3 227 230 Mean Square 3623.988 212.347 F 17.066 Sig. .000a

Regression Residual Total

a. Predictors: (Constant), Attractiveness (%), Number of Years as a Model, Age (Years) b. Dependent Variable: Salary per Day ()

To begin with, a sample size of 231 with three predictors seems reasonable because this would easily detect medium to large effects (see the diagram in the chapter). Overall, the model accounts for 18.4% of the variance in salaries and is a significant fit of the data (F(3, 227) = 17.07, p < .001). The adjusted R2 (.17) shows some shrinkage from

40

the unadjusted value (.184) indicating that the model may not generalize well. We can also use Steins formula:

231 1 231 2 231 + 1 adjusted R 2 = 1 (1 0.184) 231 3 1 231 3 2 231 = 1 [1.031](0.816) = 1 0.841 = 0.159 This also shows that the model may not cross-generalize well.
Coefficientsa Unstandardized Coefficients B Std. Error -60.890 16.497 6.234 1.411 -5.561 -.196 2.122 .152 Standardized Coefficients Beta .942 -.548 -.083 95% Confidence Interval for B Lower Bound Upper Bound -93.396 -28.384 3.454 9.015 -9.743 -.497 -1.380 .104 Collinearity Statistics Tolerance VIF .079 .082 .867 12.653 12.157 1.153

Model 1

(Constant) Age (Years) Number of Years as a Model Attractiveness (%)

t -3.691 4.418 -2.621 -1.289

Sig. .000 .000 .009 .199

a. Dependent Variable: Salary per Day ()

In terms of the individual predictors we could report: B SE B

Constant Age Years as a model Attractiveness

60.89 6.23 5.56 0.20

16.50 1.41 2.12 0.15 .94** .55* .08

Note: R2 = .18 (p < .001). * p < .01, ** p < .001.

41

It seems as though salaries are significantly predicted by the age of the model. This is a positive relationship (look at the sign of the beta), indicating that as age increases, salaries increase too. The number of years spent as a model also seems to significantly predict salaries, but this is a negative relationship indicating that the more years youve spent as a model, the lower your salary. This finding seems very counter-intuitive, but well come back to it later. Finally, the attractiveness of the model doesnt seem to predict salaries. If we wanted to write the regression model, we could write it as: Salary = 0 + 1Agei + 2 Experiencei + 2 Attractivenessi = 60.89 + (6.23Age i ) (5.56Experiencei )(0.02Attractivenessi ) The next part of the question asks whether this model is valid.

a Casewise Diagnostics

a Collinearity Diagnostics

Model 1

Dimension 1 2 3 4

Eigenvalue 3.925 .070 .004 .001

Condition Index 1.000 7.479 30.758 63.344

(Constant) .00 .01 .30 .69

Variance Proportions Number of Years as a Model Age (Years) .00 .00 .00 .08 .02 .01 .98 .91

Attractiveness (%) .00 .02 .94 .04

a. Dependent Variable: Salary per Day ()

Case Number 2 5 24 41 91 116 127 135 155 170 191 198

Std. Residual 2.186 4.603 2.232 2.411 2.062 3.422 2.753 4.672 3.257 2.170 3.153 3.510

Salary per Day () 53.72 95.34 48.87 51.03 56.83 64.79 61.32 89.98 74.86 54.57 50.66 71.32

Predicted Value 21.8716 28.2647 16.3444 15.8861 26.7856 14.9259 21.2059 21.8946 27.4025 22.9401 4.7164 20.1729

Residual 31.8532 67.0734 32.5232 35.1390 30.0459 49.8654 40.1129 68.0854 47.4582 31.6254 45.9394 51.1478

a. Dependent Variable: Salary per Day ()

42

Histogram Dependent Variable: Salary per Day ()


60 50 .75 40 30 20 1.00

Normal P-P Plot of Regression Standardiz Dependent Variable: Salary per Day ()

Expected Cum Prob

.50

Frequency

10 0

Std. Dev = .99 Mean = 0.00 N = 231.00

.25

0.00 0.00 .25 .50 .75 1.00

Regression Standardized Residual

Scatterplot Dependent Variable: Salary per Day ()


5 4

5 -.7 5 .2 -1 5 .7 -1

5 -.2

5 .2

5 .7

75 4. 25 4. 75 3. 25 3. 75 2. 25 2. 75 1. 25 1.

Observed Cum Prob

Partial Regression Plot Dependent Variable: Salary per Day ()


80

Regression Standardized Residual

60
3 2 1 0 -1 -2 -3 -2 -1 0 1 2 3

40

20

Salary per Day ()

-20 -40 -3 -2 -1 0 1 2

Regression Standardized Predicted Value

Age (Years)

Partial Regression Plot Dependent Variable: Salary per Day ()


80 80

Partial Regression Plot Dependent Variable: Salary per Day ()

60

60

40

40

20

20

Salary per Day ()

Salary per Day ()


-1.0 -.5 0.0 .5 1.0 1.5

-20 -40 -1.5

-20 -40 -20 -10 0 10 20 30

Number of Years as a Model

Attractiveness (%)

Residuals: There six cases that have a standardized residual greater than 3, and two of these are fairly substantial (case 5 and 135). We have 5.19% of cases with

43

standardized residuals above 2, so thats as we expect, but 3% of cases with residuals above 2.5 (wed expect only 1%), which indicates possible outliers. Normality of errors: The histogram reveals a skewed distribution indicating that the normality of errors assumption has been broken. The normal PP plot verifies this because the dashed line deviates considerably from the straight line (which indicates what youd get from normally distributed errors). Homoscedasticity and independence of errors: The scatterplot of ZPRED vs. ZRESID does not show a random pattern. There is a distinct funnelling indicating heteroscedasticity. However, the DurbinWatson statistic does fall within Fields recommended boundaries of 13, which suggests that errors are reasonably independent. Multicollinearity: For the age and experience variables in the model, VIF values are above 10 (or alternatively, tolerance values are all well below 0.2) indicating multicollinearity in the data. In fact, if you look at the correlation between these two variables it is around .9! So, these two variables are measuring very similar things. Of course, this makes perfect sense because the older a model is, the more years she wouldve spent modelling! So, it was fairly stupid to measure both of these things! This also explains the weird result that the number of years spent modelling negatively predicted salary (i.e. more experience = less salary!): in fact if you do a simple regression with experience as the only predictor of salary youll find it has the expected positive relationship. This hopefully demonstrates why multicollinearity can bias the regression model.

44

All in all, several assumptions have not been met and so this model is probably fairly unreliable.
Task 2

Using the Glastonbury data from this chapter (with the dummy coding in
GlastonburyDummy.sav), which you shouldve already analysed, comment on

whether you think the model is reliable and generalizable.

This question asks whether this model is valid.


b Model Summary

Change Statistics Model 1 R R Square .276a .076 Adjusted R Square .053 Std. Error of the Estimate .68818 R Square Change .076 F Change 3.270 df1 3 df2 119 Sig. F Change .024 DurbinWatson 1.893

a. Predictors: (Constant), No Affiliation vs. Indie Kid, No Affiliation vs. Crusty, No Affiliation vs. Metaller b. Dependent Variable: Change in Hygiene Over The Festival

Coefficientsa Unstandardized Coefficients B Std. Error -.554 .090 -.412 .167 .028 .160 -.410 .205 Standardized Coefficients Beta -.232 .017 -.185 95% Confidence Interval for B Lower Bound Upper Bound -.733 -.375 -.742 -.081 -.289 .346 -.816 -.004 Collinearity Statistics Tolerance VIF .879 .874 .909 1.138 1.144 1.100

Model 1

(Constant) No Affiliation vs. Crusty No Affiliation vs. Metaller No Affiliation vs. Indie Kid

t -6.134 -2.464 .177 -2.001

Sig. .000 .015 .860 .048

a. Dependent Variable: Change in Hygiene Over The Festival

a Casewise Diagnostics
a Collinearity Diagnostics

Model 1

Dimension 1 2 3 4

Eigenvalue 1.727 1.000 1.000 .273

Condition Index 1.000 1.314 1.314 2.515

(Constant) .14 .00 .00 .86

Variance Proportions No Affiliation No Affiliation vs. Crusty vs. Metaller .08 .08 .37 .32 .07 .08 .48 .52

No Affiliation vs. Indie Kid .05 .00 .63 .32

a. Dependent Variable: Change in Hygiene Over The Festival

Case Number 31 153 202 346 479

Std. Residual -2.302 2.317 -2.653 -2.479 2.215

Change in Hygiene Over The Festival -2.55 1.04 -2.38 -2.26 .97

Predicted Value -.9658 -.5543 -.5543 -.5543 -.5543

Residual -1.5842 1.5943 -1.8257 -1.7057 1.5243

a. Dependent Variable: Change in Hygiene Over The Festival

45

Histogram Dependent Variable: Change in Hygiene Over The


20 1.00

Normal P-P Plot of Regression Standard Dependent Variable: Change in Hygiene

.75

10

Expected Cum Prob

.50

Frequency

Std. Dev = .99 Mean = 0.00 0


-2 .7 5 -2 .2 5 -1 .7 5 -1 .2 5 - .7 5 - .2 5 .2 5 .7 5 1. 25 1. 75

.25

N = 123.00
2. 25

0.00 0.00 .25 .50 .75 1.00

Regression Standardized Residual

Observed Cum Prob

Scatterplot Dependent Variable: Change in Hygiene Over The


3 2

Partial Regression Plot Dependent Variable: Change in Hygiene Over The


Change in Hygiene Over The Festival

Regression Standardized Residual

-1

-1

-2 -3 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0

-2 -.4 -.2 0.0 .2 .4 .6 .8

Regression Standardized Predicted Value

No Affiliation vs. Crusty

Partial Regression Plot Dependent Variable: Change in Hygiene Over Th


2.0 2.0

Partial Regression Plot Dependent Variable: Change in Hygiene Over Th


Change in Hygiene Over The Festival
1.5 1.0 .5 0.0 -.5 -1.0 -1.5 -2.0 -.4 -.2 0.0 .2 .4 .6 .8 1.0

Change in Hygiene Over The Festival

1.5 1.0 .5 0.0 -.5 -1.0 -1.5 -2.0 -.4 -.2 0.0 .2 .4 .6 .8

No Affiliation vs. Metaller

No Affiliation vs. Indie Kid

46

Residuals: There are no cases that have a standardized residual greater than 3. We have 4.07% of cases with standardized residuals above 2, so thats as we expect, and .81% of cases with residuals above 2.5 (and wed expect 1%), which indicates the data are consistent with what wed expect. Normality of errors: The histogram looks reasonably normally distributed indicating that the normality of errors assumption has probably been met. The normal PP plot verifies this because the dashed line doesnt deviate much from the straight line (which indicates what youd get from normally distributed errors). Homoscedasticity and independence of errors: The scatterplot of ZPRED vs. ZRESID does look a bit odd with categorical predictors, but essentially were looking for the height of the lines to be about the same (indicating the variability at each of the three levels is the same). This is true indicating homoscedasticity. The DurbinWatson statistic also falls within Fields recommended boundaries of 13, which suggests that errors are reasonably independent. Multicollinearity: For all variables in the model, VIF values are below 10 (or alternatively, tolerance values are all well above 0.2) indicating no multicollinearity in the data. All in all, the model looks fairly reliable (but you should check for influential cases!).
Task 3
47

A study was carried out to explore the relationship between aggression and several potential predicting factors in 666 children who had an older sibling. Variables measured were Parenting_Style (high score = bad parenting practices),
Computer_Games (high score = more time spent playing computer games), Television (high score = more time spent watching television), Diet (high score =

the child has a good diet low in E-numbers), and Sibling_Aggression (high score = more aggression seen in their older sibling). Past research indicated that parenting style and sibling aggression were good predictors of the level of aggression in the younger child. All other variables were treated in an exploratory fashion. The data are in the file Child Aggression.sav. Analyse them with multiple regression.

We need to conduct this analysis hierarchically entering parenting style and sibling aggression in the first step (forced entry) and the remaining variables in a second step (stepwise):

48

49

50

51

Based on the final model (which is actually all were interested in) the following variables predict aggression:

Parenting style (b = 0.062, = 0.194, t = 4.93, p < .001) significantly predicted aggression. The beta value indicates that as parenting increases (i.e. as bad practices increase), aggression increases also.

52

Sibling aggression (b = 0.086, = 0.088, t = 2.26, p < .05) significantly predicted aggression. The beta value indicates that as sibling aggression increases (became more aggressive), aggression increases also.

Computer games (b = 0.143, = 0.037, t = 3.89, p < .001) significantly predicted aggression. The beta value indicates that as the time spent playing computer games increases, aggression increases also.

E-numbers (b = -.112, =-0.118, t = -2.95, p < .01) significantly predicted aggression. The beta value indicates that as the diet improved, aggression decreased.

The only factor not to predict aggression was: Television (b if entered = .032, t = 0.72, p > .05) did not significantly predict aggression. Based on the standardized beta values, the most substantive predictor of aggression was actually parenting style, followed by computer games, diet and then sibling aggression. R2 is the squared correlation between the observed values of aggression and the values of aggression predicted by the model. The values in this output tell us that sibling aggression and parenting style in combination explain 5.3% of the variance in aggression. When computer game use is factored in as well, 7% of variance in aggression is explained (i.e. an additional 1.7%). Finally, when diet is added to the model, 8.2% of the variance in aggression is explained (an additional 1.2%). With all four of these predictors in the model still less than of the variance in aggression can be explained.

53

The DurbinWatson statistic tests the assumption of independence of errors, which means that for any two observations (cases) in the regression, their residuals should be uncorrelated (or independent). In this output the DurbinWatson statistic falls within the recommended boundaries of 13, which suggests that errors are reasonably independent. The scatterplot helps us to assess both homoscedasticity and independence of errors. The scatterplot of ZPRED vs. ZRESID does show a random pattern and so indicates no violation of the independence of errors assumption. Also, the errors on the scatterplot do not funnel out, indicating homoscedascitity of errors, thus no violations of these assumptions.

Chapter 8 Task 1

A psychologist was interested in whether childrens understanding of display rules can be predicted from their age, and whether the child possesses a theory of mind. A display rule is a convention of displaying an appropriate emotion in a given situation. For example, if you receive a Christmas present that you dont like, the appropriate emotional display is to smile politely and say Thank you Auntie Kate, Ive always wanted a rotting cabbage. The inappropriate emotional display is to start crying and scream Why did you buy me a rotting cabbage you selfish old bag? Using appropriate display rules has been linked to having a theory of mind (the ability to understand what another person might be thinking).
54
Why did you buy me this crappy statistics Why did you buytextbook me this for Christmas crappy statistics Auntie textbook Kate? Auntie for Christmas Kate?

To test this theory, children were given a false belief task (a task used to measure whether someone has a theory of mind), a display rule task (which they could either pass or fail) and their age in months was measured. The data are in
Display.sav. Run a logistic regression to see whether possession of display rule

understanding (did the child pass the test: Yes/No?) can be predicted from possession of a theory of mind (did the child pass the false belief task: Yes/No?), age in months and their interaction. For this example, our researchers are interested in whether the understanding of emotional display rules was linked to having a theory of mind. The rationale is that it might be necessary for a child to understand how another person thinks to realize how their emotional displays will affect that person: if you cant put yourself in Auntie Kates mind, then you wont realize that she might be upset by you calling her an old bag. To test this theory, several children were given a standard false belief task (a task used to measure whether someone has a theory of mind) that they could either pass or fail and their age in months was also measured. In addition, each child was given a display rule task, which they could either pass or fail. So, the following variables were measured: 1.
Outcome (dependent variable): Possession of display rule understanding

(Did the child pass the test: Yes/No?). 2.


Predictor (independent variable): Possession of a theory of mind (Did

the child pass the false belief task: Yes/No?). 3.


Predictor (independent variable): Age in months.

The Main Analysis

55

To carry out logistic regression, the data must be entered as for normal regression: they are arranged in the data editor in three columns (one representing each variable). The data can be found in the file display.sav. Looking at the data editor you should notice that both of the categorical variables have been entered as coding variables; that is, numbers have been specified to represent categories. For ease of interpretation, the outcome variable should be coded 1 (event occurred) and 0 (event did not occur); in this case, 1 represents having display rule understanding, and 0 represents an absence of display rule understanding. For the false belief task a similar coding has been used (1 = passed the false belief task, 2 = failed the false belief task). Logistic regression is located in the regression menu accessed by selecting .

Following this menu path activates the main Logistic Regression dialog box shown below.

The main dialog box is very similar to the standard regression option box. There is a space to place a dependent variable (or outcome variable). In this example, the outcome was the display rule task, so we can simply click on display and transfer it to the

56

Dependent box by clicking on

. There is also a box for specifying the covariates (the

predictor variables). It is possible to specify both main effects and interactions in logistic regression. To specify a main effect, simply select one predictor (e.g. age) and then transfer this variable to the Covariates box by clicking on . To input an interaction,

click on more than one variable on the left-hand side of the dialog box (i.e. highlight two or more variables) and then click on to move them to the Covariates box.

For this analysis select a Forward:LR method of regression. In this example there is one categorical predictor variable. One of the great things about logistic regression is that it is quite happy to accept categorical predictors. However, it is necessary to tell SPSS which variables, if any, are categorical by clicking on the main Logistic Regression dialog box to activate this dialog box: in

The covariates are listed on the left-hand side, and there is a space on the right-hand side in which categorical covariates can be placed. Simply highlight any categorical variables you have (in this example click on fb) and transfer them to the Categorical Covariates box by clicking on . There are many ways in which you can treat categorical predictors.

Categorical predictors could be incorporated into regression by recoding them using

57

zeros and ones (known as dummy coding). Now, actually, there are different ways you can arrange this coding depending on what you want to compare, and SPSS has several standard ways built into it that you can select. By default SPSS uses indicator coding, which is the standard dummy variable coding that I explained in Chapter 7 (and you can choose to have either the first or last category as your baseline). To change to a different kind of contrast click on the down arrow in the Change Contrast box. Select Indicator coding (first). Obtaining Residuals To save residuals click on in the main Logistic Regression dialog box. SPSS

saves each of the selected variables into the data editor. The residuals dialog box gives us several options and most of these are the same as those in multiple regression. Select all of the available options, or as a bare minimum select the same options as:

Further Options There is a final dialog box that offers further options. This box is accessed by clicking on in the main Logistic Regression dialog box. For the most part, the default settings

58

in this dialog box are fine. These options are explained in the chapter and so just select the following:

Interpreting The Output

Dependent Variable Encoding Original Value No Yes Internal Value 0 1

Categorical Variables Codings Paramete False Belief understanding No Yes Frequency 29 41 (1) .000 1.000

These tables tell us the parameter codings given to the categorical predictor variable. Indicator coding was chosen with two categories, and so the coding is the same as the values in the data editor.
59

a,b Classification Table

Step 0

Observed Display Rule understanding Overall Percentage

No Yes

Predicted Display Rule understanding No Yes 0 31 0 39

Percentage Correct .0 100.0 55.7

a. Constant is included in the model. b. The cut value is .500

Variables in the Equation

Step 0

Constant

B .230

S.E. .241

Wald .910

df 1

Sig. .340

Exp(B) 1.258

Variables not in the Equation Step 0 Variables AGE FB(1) AGE by FB(1) Score 15.956 24.617 23.987 26.257 df 1 1 1 3 Sig. .000 .000 .000 .000

Overall Statistics

For this first analysis we requested a forward stepwise method and so the initial model is derived using only the constant in the regression equation. The above output tells us about the model when only the constant is included (i.e. all predictor variables are omitted). Although SPSS doesnt display this value, the log-likelihood of this baseline model is 96.124 (trust me for the time being!). This represents the fit of the model when the most basic model is fitted to the data. When including only the constant, the computer bases the model on assigning every participant to a single category of the outcome variable. In this example, SPSS can decide either to predict that every child has display rule understanding, or to predict that all children do not have display rule understanding. It could make this decision arbitrarily, but because it is crucial to try to maximize how well the model predicts the observed data SPSS will predict that every child belongs to the category in which most observed cases fell. In this example there were 39 children
60

who had display rule understanding and only 31 who did not. Therefore, if SPSS predicts that every child has display rule understanding then this prediction will be correct 39 times out of 70 (i.e. 56% approx.). However, if SPSS predicted that every child did not have display rule understanding, then this prediction would be correct only 31 times out of 70 (44% approx.). As such, of the two available options it is better to predict that all children had display rule understanding because this results in a greater number of correct predictions. The output shows a contingency table for the model in this basic state. You can see that SPSS has predicted that all children have display rule understanding, which results in 0% accuracy for the children who were observed to have no display rule understanding, and 100% accuracy for those children observed to have passed the display rule task. Overall, the model correctly classifies 55.71% of children. The next part of the output summarizes the model, and at this stage this entails quoting the value of the constant (b0), which is equal to 0.23. The final table of the output is labelled Variables not in the Equation. The bottom line of this table reports the residual chi-square statistic as 26.257 which is significant at p < .0001 (it labels this statistic Overall Statistics). This statistic tells us that the coefficients for the variables not in the model are significantly different from zeroin other words, that the addition of one or more of these variables to the model will significantly affect its predictive power. If the probability for the residual chi-square had been greater than .05 it would have meant that none of the variables excluded from the model could make a significant contribution to the predictive power of the model. As such, the analysis would have terminated at this stage.

61

The remainder of this table lists each of the predictors in turn with a value of Roas efficient score statistic for each one (column labelled Score). In large samples when the null hypothesis is true, the score statistic is identical to the Wald statistic and the likelihood ratio statistic. It is used at this stage of the analysis because it is computationally less intensive than the Wald statistic and so can still be calculated in situations when the Wald statistic would prove prohibitive. Like any test statistic Roas score statistic has a specific distribution from which statistical significance can be obtained. In this example, all excluded variables have significant score statistics at p < .001 and so all three could potentially make a contribution to the model. The stepwise calculations are relative and so the variable that will be selected for inclusion is the one with the highest value for the score statistic that is significant at a .05 level of significance. In this example, that variable will be fb because it has the highest value of the score statistic. The next part of the output deals with the model after this predictor has been added.

In the first step, false belief understanding (fb) is added to the model as a predictor. As such a child is now classified as having display rule understanding based on whether they passed or failed the false belief task.
Omnibus Tests of Model Coefficients Chi-square 26.083 26.083 26.083 df 1 1 1 Sig. .000 .000 .000

Step 1

Step Block Model

Model Summary -2 Log likelihood 70.042 Cox & Snell R Square .311 Nagelkerke R Square .417

Step 1

62

a Classification Table

Step 1

Observed Display Rule understanding Overall Percentage

No Yes

Predicted Display Rule understanding No Yes 23 8 6 33

Percentage Correct 74.2 84.6 80.0

a. The cut value is .500

The above shows summary statistics about the new model (which weve already seen contains fb). The overall fit of the new model is assessed using the log-likelihood statistic. In SPSS, rather than reporting the log-likelihood itself, the value is multiplied by 2 (and sometimes referred to as 2LL): this multiplication is done because 2LL has an approximately chi-square distribution and so makes it possible to compare values against those that we might expect to get by chance alone. Remember that large values of the log-likelihood statistic indicate poorly fitting statistical models. At this stage of the analysis the value of 2 log-likelihood should be less than the value when only the constant was included in the model (because lower values of 2LL indicate that the model is predicting the outcome variable more accurately). When only the constant was included, -2LL = 96.124, but now fb has been included this value has been reduced to 70.042. This reduction tells us that the model is better at predicting display rule understanding than it was before fb was added. The question of how much better the model predicts the outcome variable can be assessed using the model chisquare statistic, which measures the difference between the model as it currently stands and the model when only the constant was included. We can assess the significance of the change in a model by taking the log-likelihood of the new model and subtracting the loglikelihood of the baseline model from it. The value of the model chi-square statistic works on this principle and is, therefore, equal to 2LL with fb included minus the value

63

of 2LL when only the constant was in the model (96.124 70.042 = 26.083). This value has a chi-square distribution and so its statistical significance can be easily calculated. In this example, the value is significant at a .05 level and so we can say that overall the model is predicting display rule understanding significantly better than it was with only the constant included. The model chi-square is an analogue of the F-test for the linear regression sum of squares. In an ideal world we would like to see a non-significant 2LL (indicating that the amount of unexplained data is minimal) and a highly significant model chi-square statistic (indicating that the model including the predictors is significantly better than without those predictors). However, in reality it is possible for both statistics to be highly significant. There is a second statistic called the step statistic that indicates the improvement in the predictive power of the model since the last stage. At this stage there has been only one step in the analysis and so the value of the improvement statistic is the same as the model chi-square. However, in more complex models in which there are three or four stages, this statistic gives you a measure of the improvement of the predictive power of the model since the last step. Its value is equal to 2LL at the current step minus 2LL at the previous step. If the improvement statistic is significant then it indicates that the model now predicts the outcome significantly better than it did at the last step, and in a forward regression this can be taken as an indication of the contribution of a predictor to the predictive power of the model. Similarly, the block statistic provides the change in 2LL since the last block (for use in hierarchical or blockwise analyses). Finally, the classification table at the end of this section of the output indicates how well the model predicts group membership. The current model correctly classifies 23 children

64

who dont have display rule understanding but misclassifies 8 others (i.e. it correctly classifies 74.19% of cases). For children who do have display rule understanding, the model correctly classifies 33 and misclassifies 6 cases (i.e. correctly classifies 84.62% of cases). The overall accuracy of classification is, therefore, the weighted average of these two values (80%). So, when only the constant was included, the model correctly classified 56% of children, but now, with the inclusion of fb as a predictor, this has risen to 80%.
Variables in the Equation 95.0% C.I.for EXP(B) Lower Upper 4.835 51.706

Step a 1

FB(1) Constant

B 2.761 -1.344

S.E. .605 .458

Wald 20.856 8.592

df 1 1

Sig. .000 .003

Exp(B) 15.812 .261

a. Variable(s) entered on step 1: FB.

The next part of the output is crucial because it tells us the estimates for the coefficients for the predictors included in the model. This section of the output gives us the coefficients and statistics for the variables that have been included in the model at this point (namely, fb and the constant). The interpretation of this coefficient in logistic regression is that it represents the change in the logit of the outcome variable associated with a one-unit change in the predictor variable. The logit of the outcome is simply the natural logarithm of the odds of Y occurring. The crucial statistic is the Wald statistic, which has a chi-square distribution and tells us whether the b coefficient for that predictor is significantly different from zero. If the coefficient is significantly different from zero then we can assume that the predictor is making a significant contribution to the prediction of the outcome (Y). For these data it seems to indicate that false belief understanding is a significant predictor of display rule understanding (note the significance of the Wald statistic is less than .05).
65

We can calculate an analogue of R using the equation in the chapter (for these data, the Wald statistic and its df are 20.856 and 1 respectively), and the original 2LL was 96.12. Therefore, R can be calculated as: 20.856 (2 1) R= 96.124 = .4429 Hosmer and Lemeshows measure (R2L) is calculated by dividing the model chi-square by the original 2LL. In this example the model chi-square after all variables have been entered into the model is 26.083, and the original 2LL (before any variables were entered) was 96.124. So, R2L = 26.083/96.124 = .271, which is different to the value we would get by squaring the value of R given above (R2 = .44292 =0.196). SPSS reports Cox and Snells measure, which SPSS reports as .311. This is calculated as This is calculated from the equation in the book chapter. Remember that this equation uses the log-likelihood, whereas SPSS reports 2 log-likelihood. LL(New) is, therefore, 70.042/2 = 35.021, and LL(Baseline) = 96.124/2 = 48.062. The sample size, n, is 70:
2 70 ( 35.021 ( 48.062))

R = 1 e
2 CS

= 1 e 0.3726 = 1 0.6889 = 0.311


Nagelkerkes adjusted value is .417. This is calculated as:

66

2 = RN

0.311
2( 48.062) 70

1 e 0.311 = 1 e 1.3732 0.311 = 1 0.2533 = 0.416

As you can see, theres a fairly substantial difference between the two values! The final thing we need to look at is exp b (Exp(B) in the SPSS output), which was described in the book chapter. To calculate the change in odds that results from a unit change in the predictor for this example, we must first calculate the odds of a child having display rule understanding given that they don't have second-order false belief task understanding. We then calculate the odds of a child having display rule understanding given that they do have false belief understanding. Finally, we calculate the proportionate change in these two odds. To calculate the first set of odds, we need to calculate the probability of a child having display rule understanding given that they failed the false belief task. The parameter coding at the beginning of the output told us that children who failed the false belief task were coded with a 0, so we can use this value in place of X. The value of b1 has been estimated for us as 2.7607 (see Variables in the Equation), and the coefficient for the constant can be taken from the same table and is 1.3437. We can calculate the odds as:

67

P(event Y ) = =

1+ e

(b0 + b1 X1 )

1+ e = 0.2069

[1.3437 +( 2.76070 )]

P( no event Y ) = 1 P(event Y ) = 1 0.2069 = 0.7931 odds = 0.2069 0.7931 = 0.2609

Now, we calculate the same thing after the predictor variable has changed by one unit. In this case, because the predictor variable is dichotomous, we need to calculate the odds of a child passing the display rule task, given that they have passed the false belief task. So, the value of the false belief variable, X, is now 1 (rather than 0). The resulting calculations are:
P( event Y ) = 1 1 + e (b0 + b1 X 1 ) 1 = [ 1.3437 + ( 2.7607 1 )] 1+ e = 0.8049

P( no event Y ) = 1 P( event Y ) = 1 0.8049 = 0.1951 odds = 0.8049 0.1951 = 4.1256

68

We now know the odds before and after a unit change in the predictor variable. It is now a simple matter to calculate the proportionate change in odds by dividing the odds after a unit change in the predictor by the odds before that change.

odds =

odds after a unit change in the predictor original odds 4.1256 = 0.2609 = 15.8129

You should notice that the value of the proportionate change in odds is the same as the value that SPSS reports for exp b (allowing for differences in rounding). We can interpret exp b in terms of the change in odds. If the value is greater than 1 then it indicates that as the predictor increases, the odds of the outcome occurring increase. Conversely, a value less than 1 indicates that as the predictor increases, the odds of the outcome occurring decrease. In this example, we can say that the odds of a child who has false belief understanding also having display rule understanding are 15 times higher than those of a child who does not have false belief understanding. In the options (see section 0), we requested a confidence interval for exp b and it can also be found in the output. The way to interpret this confidence interval is to say that if we ran 100 experiments and calculated confidence intervals for the value of exp b, then these intervals would encompass the actual value of exp b in the population (rather than the sample) on 95 occasions. So, in this case, we can be fairly confident that the population value of exp b lies between 4.84 and 51.71. However, there is a 5% chance that a sample could give a confidence interval that misses the true value.

69

Model if Term Removed Model Log Likelihood -48.062 Change in -2 Log Likelihood 26.083 Sig. of the Change .000

Variable Step 1 FB

df 1

Variables not in the Equation Step 1 Variables Overall Statistics AGE AGE by FB(1) Score 2.313 1.261 2.521 df 1 1 2 Sig. .128 .261 .283

The test statistics for fb if it were removed from the model are reported above. The regression tests whether they then met a removal criterion. Well, the Model if Term
Removed part of the output tells us the effects of removal. The important thing to note is

the significance value of the log-likelihood ratio (log LR). The log LR for this model is highly significant (p < .0001) which tells us that removing fb from the model would have a significant effect on the predictive ability of the modelin other words, it would be a very bad idea to remove it! Finally, we are told about the variables currently not in the model. First of all, the residual chi-square (labelled Overall Statistics in the output), which is non-significant, tells us that none of the remaining variables have coefficients significantly different from zero. Furthermore, each variable is listed with its score statistic and significance value, and for both variables their coefficients are not significantly different from zero (as can be seen from the significance values of .128 for age and .261 for the interaction of age and false belief understanding). Therefore, no further variables will be added to the equation. The next part of the output displays the classification plot that we requested in the options dialog box. This plot is a histogram of the predicted probabilities of a child passing the

70

display rule task. If the model perfectly fits the data, then this histogram should show all of the cases for which the event has occurred on the right-hand side, and all the cases for which the event hasnt occurred on the left-hand side. In other words, all the children who passed the display rule task should appear on the right and all those who failed should appear on the left. In this example, the only significant predictor is dichotomous and so there are only two columns of cases on the plot. If the predictor is a continuous variable, the cases are spread out across many columns. As a rule of thumb, the more that the cases cluster at each end of the graph, the better. This statement is true because such a plot would show that when the outcome did actually occur (i.e. the child did pass the display rule task) the predicted probability of the event occurring is also high (i.e. close to 1). Likewise, at the other end of the plot it would show that when the event didnt occur (i.e. when the child failed the display rule task) the predicted probability of the event occurring is also low (i.e. close to 0). This situation represents a model that is correctly predicting the observed outcome data. If, however, there are a lot of points clustered in the centre of the plot then it shows that for many cases the model is predicting a probability of .5 that the event will occur. In other words, for these cases there is little more than a 50:50 chance that the data are correctly predictedas such the model could predict these cases just as accurately by simply tossing a coin! Also, a good model will ensure that few cases are misclassified, in this example there are two Ns on the right of the model and one Y on the left of the model. These are misclassified cases, and the fewer of these there are, the better the model. Observed Groups and Predicted Probabilities

71

Listing Predicted Probabilities SPSS saved the predicted probabilities and predicted group memberships as variables in the data editor and named them PRE_1 and PGR_1 respectively. These probabilities can be listed using the Case Summaries dialog box (see the book chapter). Below is a selection of the predicted probabilities (because the only significant predictor was a dichotomous variable, there will be only two different probability values). It is also worth listing the predictor variables as well to clarify from where the predicted probabilities come.

72

Case Summariesa Case Number 1 5 9 10 11 12 20 21 29 31 32 43 60 66 N Age in years 24.00 36.00 34.00 31.00 32.00 30.00 26.00 29.00 45.00 41.00 32.00 56.00 63.00 79.00 14 False Belief understanding No No No No No Yes No No Yes No No Yes No Yes 14 Display Rule understanding No No Yes No No Yes No No Yes Yes No Yes Yes Yes 14 Predicted probability .20690 .20690 .20690 .20690 .20690 .80488 .20690 .20690 .80488 .20690 .20690 .80488 .20690 .80488 14 Predicted group No No No No No Yes No No Yes No No Yes No Yes 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Total

a. Limited to first 100 cases.

We found from the model that the only significant predictor of display rule understanding was false belief understanding. This could have a value of either 1 (pass the false belief task) or 0 (fail the false belief task). These values tells us that when a child doesnt possess second-order false belief understanding (fb = 0, No), there is a probability of .2069 that they will pass the display rule task, approximately a 21% chance (1 out of 5 children). However, if the child does pass the false belief task (fb = 1, yes), there is a probability of .8049 that they will pass the display rule task, an 80.5% chance (4 out of 5 children). Consider that a probability of 0 indicates no chance of the child passing the display rule task, and a probability of 1 indicates that the child will definitely pass the display rule task. Therefore, the values obtained provide strong evidence for the role of false belief understanding as a prerequisite for display rule understanding. Assuming we are content that the model is accurate and that false belief understanding has some substantive significance, then we could conclude that false belief understanding is the single best predictor of display rule understanding. Furthermore, age and the interaction of age and false belief understanding do not significantly predict display rule

73

understanding. As a homework task, why not rerun this analysis using the forced entry method of analysis how do your conclusions differ? This conclusion is fine in itself, but to be sure that the model is a good one, it is important to examine the residuals. Interpreting Residuals The main purpose of examining residuals in logistic regression is to (1) isolate points for which the model fits poorly, and (2) isolate points that exert an undue influence on the model. To assess the former we examine the residuals, especially the Studentized residual, standardized residual and deviance statistics. All of these statistics have the common property that 95% of cases in an average, normally distributed sample should have values which lie within 1.96, and 99% of cases should have values that lie within 2.58. Therefore, any values outside of 3 are cause for concern and any outside of about 2.5 should be examined more closely. To assess the influence of individual cases we use influence statistics such as Cooks distance (which is interpreted in the same way as for linear regression: as a measure of the change in the regression coefficient if a case is deleted from the model). Also, the value of DFBeta, which is a standardized version of Cooks statistic, tells us something of the influence of certain cases any values greater than 1 indicate possible influential cases. Additionally, leverage statistics or hat values, which should lie between 0 (the case has no influence whatsoever) and 1 (the case exerts complete influence over the model) tell us about whether certain cases are wielding undue influence over the model. The expected value of leverage is defined as for linear regression.

74

If you request these residual statistics, SPSS saves them in as new columns in the data editor. The basic residual statistics for this example (Cooks distance, leverage, standardized residuals and DFBeta values) show little cause for concern. Note that all cases have DFBetas less than 1 and leverage statistics (LEV_1) close to the calculated expected value of 0.03. There are also no unusually high values of Cooks distance (COO_1) which, all in all, means that there are no influential cases having an effect on the model. Cooks distance is an unstandardized measure and so there is no absolute value at which you can say that a case is having an influence, Instead, you should look for values of Cooks distance which are particularly high compared to the other cases in the sample. However, Stevens (2002) suggests that a value greater than 1 is problematic. About half of the leverage values are a little high but given that the other statistics are fine, this is probably no cause for concern. The standardized residuals all have values between 2.5 and predominantly have values between 2 and so there seems to be very little here to concern us.

75

Case Summariesa Analog of Cook's influence statistics .00932 .00932 .00932 .00932 .00932 .13690 .00932 .00932 .13690 .00932 .00932 .00606 .00932 .10312 .00932 .00932 .13690 .00932 .00606 .00932 .00932 .00606 .00606 .10312 .00932 .00932 .00932 .00932 .00606 .00932 .13690 .00932 .00606 .00606 .00606 .00606 .10312 .00606 .13690 .10312 .00606 .00606 .00606 .00606 .00606 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Total

Case Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 N

Leverage value .03448 .03448 .03448 .03448 .03448 .03448 .03448 .03448 .03448 .03448 .03448 .02439 .03448 .02439 .03448 .03448 .03448 .03448 .02439 .03448 .03448 .02439 .02439 .02439 .03448 .03448 .03448 .03448 .02439 .03448 .03448 .03448 .02439 .02439 .02439 .02439 .02439 .02439 .03448 .02439 .02439 .02439 .02439 .02439 .02439 45

Normalized residual -.51075 -.51075 -.51075 -.51075 -.51075 1.95789 -.51075 -.51075 1.95789 -.51075 -.51075 .49237 -.51075 -2.03101 -.51075 -.51075 1.95789 -.51075 .49237 -.51075 -.51075 .49237 .49237 -2.03101 -.51075 -.51075 -.51075 -.51075 .49237 -.51075 1.95789 -.51075 .49237 .49237 .49237 .49237 -2.03101 .49237 1.95789 -2.03101 .49237 .49237 .49237 .49237 .49237 45

DFBETA for constant -.04503 -.04503 -.04503 -.04503 -.04503 .17262 -.04503 -.04503 .17262 -.04503 -.04503 .00000 -.04503 .00000 -.04503 -.04503 .17262 -.04503 .00000 -.04503 -.04503 .00000 .00000 .00000 -.04503 -.04503 -.04503 -.04503 .00000 -.04503 .17262 -.04503 .00000 .00000 .00000 .00000 .00000 .00000 .17262 .00000 .00000 .00000 .00000 .00000 .00000 45

DFBETA for FB(1) .04503 .04503 .04503 .04503 .04503 -.17262 .04503 .04503 -.17262 .04503 .04503 .03106 .04503 -.12812 .04503 .04503 -.17262 .04503 .03106 .04503 .04503 .03106 .03106 -.12812 .04503 .04503 .04503 .04503 .03106 .04503 -.17262 .04503 .03106 .03106 .03106 .03106 -.12812 .03106 -.17262 -.12812 .03106 .03106 .03106 .03106 .03106 45

a. Limited to first 100 cases.

76

Case Summariesa Analog of Cook's influence statistics .10312 .00606 .00932 .00932 .10312 .00606 .00606 .00606 .00606 .00606 .00606 .10312 .00606 .00606 .13690 .00606 .00606 .00606 .00606 .00606 .00606 .00606 .10312 .00606 .00606 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Total

Case Number 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 N

Leverage value .02439 .02439 .03448 .03448 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .03448 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .02439 .02439 25

Normalized residual -2.03101 .49237 -.51075 -.51075 -2.03101 .49237 .49237 .49237 .49237 .49237 .49237 -2.03101 .49237 .49237 1.95789 .49237 .49237 .49237 .49237 .49237 .49237 .49237 -2.03101 .49237 .49237 25

DFBETA for constant .00000 .00000 -.04503 -.04503 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .17262 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000 .00000 25

DFBETA for FB(1) -.12812 .03106 .04503 .04503 -.12812 .03106 .03106 .03106 .03106 .03106 .03106 -.12812 .03106 .03106 -.17262 .03106 .03106 .03106 .03106 .03106 .03106 .03106 -.12812 .03106 .03106 25

a. Limited to first 100 cases.

You should note that these residuals are slightly unusual because they are based on a single predictor that is categorical. This is why there isnt a lot of variability in the values of the residuals. Also, if substantial outliers or influential cases had been isolated, you are not justified in eliminating these cases to make the model fit better. Instead these cases should be inspected closely to try to isolate a good reason why they were unusual. It might simply be an error in inputting data, or it could be that the case was one which had a special reason for being unusual: for example, the child had found it hard to pay attention to the false belief task and you had noted this at the time of the experiment. In such a case, you may have good reason to exclude the case and duly note the reasons why.
Task 2

77

Recent research has shown that lecturers are among the most stressed workers. A researcher wanted to know exactly what it was about being a lecturer that created this stress and subsequent burnout. She took 467 lecturers and administered several questionnaires to them that measured: Burnout (burnt out or not),
Perceived Control (high score = low perceived control), Coping Style (high

score = low ability to cope with stress), Stress from Teaching (high score = teaching creates a lot of stress for the person), Stress from Research (high score = research creates a lot of stress for the person) and Stress from Providing
Pastoral Care (high score = providing pastoral care creates a lot of stress for the

person). The outcome of interest was burnout, and Coopers (1988) model of stress indicates that perceived control and coping style are important predictors of this variable. The remaining predictors were measured to see the unique contribution of different aspects of a lecturers work to their burnout. Can you help her out by conducting a logistic regression to see which factor predict burnout? The data are in Burnout.sav.

Test The analysis should be done hierarchically because Coopers model indicates that perceived control and coping style are important predictors of burnout. So, these variables should be entered in the first block. The second block should contain all other variables and because we dont know anything much about their predictive ability, we should enter them in a stepwise fashion (I chose Forward: LR). SPSS Output
78

Step 1:
Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi-square 165.928 165.928 165.928 df 2 2 2 Sig. .000 .000 .000

Model Summary Step 1 -2 Log likelihood 364.179 Cox & Snell R Square .299 Nagelkerke R Square .441

Variables in the Equation 95.0% C.I.for EXP(B) Lower Upper 1.040 1.086 1.066 1.106

Step a 1

LOC COPE Constant

B .061 .083 -4.484

S.E. .011 .009 .379

Wald 31.316 77.950 139.668

df 1 1 1

Sig. .000 .000 .000

Exp(B) 1.063 1.086 .011

a. Variable(s) entered on step 1: LOC, COPE.

The overall fit of the model is significant both at the first step, 2(2) = 165.93, p < .001. Overall, the model accounts for 29.944.1% of the variance in burnout (depending on which measure R2 you use). Step 2: The overall fit of the model is significant after both at the first new variable (teaching),

2(3) = 193.34, p < .001, and second new variable (pastoral) have been entered, 2(4) =
205.40, p < .001 Overall, the final model accounts for 35.652.4% of the variance in burnout (depending on which measure R2 you use.
Omnibus Tests of Model Coefficients Step 1 Step Block Model Step Block Model Chi-square 27.409 27.409 193.337 12.060 39.470 205.397 df 1 1 3 1 2 4 Sig. .000 .000 .000 .001 .000 .000

Step 2

79

Model Summary Step 1 2 -2 Log likelihood 336.770 324.710 Cox & Snell R Square .339 .356 Nagelkerke R Square .500 .524

Variables in the Equation 95.0% C.I.for EXP(B) Lower Upper 1.068 1.126 1.107 1.173 .890 .952 1.081 1.110 .862 1.019 1.145 1.181 .931 1.071

Step 1a

Step 2b

LOC COPE TEACHING Constant LOC COPE TEACHING PASTORAL Constant

B .092 .131 -.083 -1.707 .107 .135 -.110 .044 -3.023

S.E. .014 .015 .017 .619 .015 .016 .020 .013 .747

Wald 46.340 76.877 23.962 7.599 52.576 75.054 31.660 11.517 16.379

df 1 1 1 1 1 1 1 1 1

Sig. .000 .000 .000 .006 .000 .000 .000 .001 .000

Exp(B) 1.097 1.139 .921 .181 1.113 1.145 .896 1.045 .049

a. Variable(s) entered on step 1: TEACHING. b. Variable(s) entered on step 2: PASTORAL.

In terms of the individual predictors we could report: B (SE) Lower Step 1 Constant 4.48** (0.38) Perceived Control Coping Style 0.06** (0.01) 0.08** (0.01) Final 1.07 1.09 1.11 1.04 1.06 1.09 Exp() Upper 95% CI for Exp(B)

80

Constant

3.02** (0.75)

Perceived Control Coping Style

0.11** (0.02) 0.14** (0.02)

1.08

1.11

1.15

1.11

1.15

1.18

Teaching Stress Pastoral Stress

0.11** (0.02) 0.04* (0.01)

0.86

0.90

0.93

1.02

1.05

1.07

Note: R2 = .36 (Cox and Snell), .52 (Nagelkerke). Model 2(4) = 205.40, p < .001. * p < .01, ** p < .001. It seems as though burnout is significantly predicted by perceived control, coping style (as predicted by Cooper), stress from teaching and stress from giving pastoral care. The Exp(B) and direction of the beta values tells us that, for perceived control, coping ability and pastoral care, the relationships are positive. That is (and look back to the question to see the direction of these scales, i.e. what a high score represents), poor perceived control, poor ability to cope with stress and stress from giving pastoral care all predict burnout. However, for teaching, the relationship if the opposite way around: stress from teaching appears to be a positive thing as it predicts not becoming burnt out!
Task 3

81

A health psychologist interested in research into HIV wanted to know the factors that influenced condom use with a new partner (relationship less than 1 month old). The outcome measure was whether a condom was used (Use: condom used = 1, not used = 0). The predictor variables were mainly scales from the Condom Attitude Scale (CAS) by Sacco, Levine, Reed, and Thompson (Psychological Assessment: A Journal of Consulting and Clinical Psychology, 1991). Gender (gender of the person); safety (relationship safety, measured out of 5, indicates the degree to which the person views this relationship as safe from sexually transmitted disease); sexexp (sexual experience, measured out of 10, indicates the degree to which previous experience influences attitudes towards condom use);
previous (a measure not from the CAS, this variable measures whether or not the

couple used a condom in their previous encounter, 1 = condom used, 0 = not used, 2 = no previous encounter with this partner); selfcon (self-control, measured out of 9, indicates the degree of self-control that a subject has when it comes to condom use, i.e. do they get carried away with the heat of the moment, or do they exert control?); perceive (perceived risk, measured out of 6, indicates the degree to which the person feels at risk from unprotected sex). Previous research (Sacco, Rickman, Thompson, Levine, and Reed, in Aids Education and Prevention, 1993) has shown that gender, relationship safety and perceived risk predict condom use. Carry out an appropriate analysis to verify these previous findings, and to test whether self-control, previous usage and sexual experience can predict any of the remaining variance in condom use. (1) Interpret all important parts of the SPSS output. (2) How reliable is the final model? (3) What are the probabilities that

82

participants 12, 53 and 75 will used a condom? (4) A female, who used a condom in her previous encounter with her new partner, scores 2 on all variables except perceived risk (for which she scores 6). Use the model to estimate the probability that she will use a condom in her next encounter.

The correct analysis was to run a hierarchical logistic regression entering perceive,
safety and gender in the first block and previous, selfcon and sexexp in a second. I used

forced entry on both blocks, but you could choose to run a forward stepwise method on block 2 (either strategy is justified). For the variable previous I used an indicator contrast with No condom as the base category. Block 0: The output of the logistic regression will be arranged in terms of the blocks that were specified. In other words, SPSS will produce a regression model for the variables specified in block 1, and then produce a second model that contains the variables from both blocks 1 and 2. The results from block 1 are shown below. In this analysis we forced SPSS to enter perceive, safety and gender into the regression model first. First, the output tells us that 100 cases have been accepted, that the dependent variable has been coded 0 and 1 (because this variable was coded as 0 and 1 in the data editor, these codings correspond exactly to the data in SPSS).
Case Processing Summary Unweighted Cases Selected Cases
a

N Included in Analysis Missing Cases Total 100 0 100 0 100

Unselected Cases Total

Percent 100.0 .0 100.0 .0 100.0

a. If weight is in effect, see classification table for the total number of cases.

83

Dependent Variable Encoding Original Value Unprotected Condom Used Internal Value 0 1

Categorical Variables Codings Parameter coding (1) (2) .000 .000 1.000 .000 .000 1.000

Previous Use with Partner

No Condom Condom used First Time with partner

Frequency 50 47 3

a,b Classification Table

Predicted Condom Use Condom Unprotected Used 57 0 43 0

Step 0

Observed Condom Use Overall Percentage

Unprotected Condom Used

Percentage Correct 100.0 .0 57.0

a. Constant is included in the model. b. The cut value is .500

Block 1: The next part of the output tells us about block 1: as such it provides information about the model after the variables perceive, safety and gender have been added. The first thing to note is that 2LL has dropped to 105.77, which is a change of 30.89 (which is the value given by the model chi-square). This value tells us about the model as a whole whereas the block tells us how the model has improved since the last block. The change in the amount of information explained by the model is significant (2(3) = 30.92, p < 0.0001) and so using perceived risk, relationship safety and gender as predictors significantly improves our ability to predict condom use. Finally, the classification table shows us that 74% of cases can be correctly classified using these three predictors.

84

Omnibus Tests of Model Coefficients Chi-square 30.892 30.892 30.892 df 3 3 3 Sig. .000 .000 .000

Step 1

Step Block Model

Model Summary -2 Log likelihood 105.770 Cox & Snell R Square .266 Nagelkerke R Square .357

Step 1

a Classification Table

Predicted Condom Use Condom Used Unprotected 45 12 14 29

Step 1

Observed Condom Use Overall Percentage

Unprotected Condom Used

Percentage Correct 78.9 67.4 74.0

a. The cut value is .500

Hosmer and Lemeshows goodness-of-fit test statistic tests the hypothesis that the observed data are significantly different from the predicted values from the model. So, in effect, we want a non-significant value for this test (because this would indicate that the model does not differ significantly from the observed data). In this case (2(8) = 9.70, p = 0.287) it is non-significant, which is indicative of a model that is predicting the realworld data fairly well.
Hosmer and Lemeshow Test Step 1 Chi-square 9.700 df 8 Sig. .287

The part of the output labelled Variables in the Equation then tells us the parameters of the model for the first block. The significance values of the Wald statistics for each predictor indicate that both perceived risk (Wald = 17.76, p < 0.0001) and relationship safety (Wald = 4.54, p < 0.05) significantly predict condom use. Gender, however, does not (Wald = 0.41, p > 0.05).
85

Variables in the Equation 95.0% C.I.for EXP(B) Lower Upper 1.654 3.964 .410 .963 .519 3.631

Step a 1

PERCEIVE SAFETY GENDER Constant

B .940 -.464 .317 -2.476

S.E. .223 .218 .496 .752

Wald 17.780 4.540 .407 10.851

df 1 1 1 1

Sig. .000 .033 .523 .001

Exp(B) 2.560 .629 1.373 .084

a. Variable(s) entered on step 1: PERCEIVE, SAFETY, GENDER.

The values of exp for perceived risk (exp = 2.56, CI0.95 = 1.65, 3.96) indicate that if the value of perceived risk goes up by 1, then the odds of using a condom also increase (because exp is greater than 1). The confidence interval for this value ranges from 1.65 to 3.96 so we can be very confident that the value of exp in the population lies somewhere between these two values. Whats more, because both values are greater than 1 we can also be confident that the relationship between perceived risk and condom use found in this sample is true of the whole population. In short, as perceived risk increase by 1, people are just over twice as likely to use a condom. The values of exp for relationship safety (exp = 0.63, CI0.95 = 0.41, 0.96) indicate that if the relationship safety increases by one point, then the odds of using a condom decrease (because exp is less than 1). The confidence interval for this value ranges from 0.41 to 0.96 so we can be very confident that the value of exp in the population lies somewhere between these two values. In addition, because both values are less than 1 we can be confident that the relationship between relationship safety and condom use found in this sample would be found in 95% of samples from the same population. In short, as relationship safety increases by one unit, subjects are about 1.6 times less likely to use a condom. The values of exp for gender (exp = 1.37, CI0.95 = 0.52, 3.63) indicate that as gender changes from 0 (male) to 1 (female), then the odds of using a condom increase (because
86

exp is greater than 1). However, the confidence interval for this value crosses 1 which limits the generalizability of our findings because the value of exp in other samples (and hence the population) could indicate either a positive (exp(B) > 1) or negative (exp(B) < 1) relationship. Therefore, gender is not a reliable predictor of condom use.

A glance at the classification plot brings not such good news because a lot of cases are clustered around the middle. This indicates that the model could be performing more accurately (i.e. the classifications made by the model are not completely reliable).

Block 2: The output below shows what happens to the model when our new predictors are added (previous use, self-control and sexual experience). This part of the output describes block

87

2, which is just the model described in block 1 but with a new predictors added. So, we begin with the model that we had in block 1 and we then add previous, selfcon and
sexexp to it. The effect of adding these predictors to the model is to reduce the 2 log-

likelihood to 87.971 (a reduction of 48.69 from the original model as shown in the model chi-square and an additional reduction of 17.799 from the reduction caused by block 1 as shown by the block statistics). This additional improvement of block 2 is significant (2(4) = 17.80, p < 0.01), which tells us that including these three new predictors in the model has significantly improved our ability to predict condom use. The classification table tells us that the model is now correctly classifying 78% of cases. Remember that in block 1 there were 74% correctly classified and so an extra 4% of cases are now classified (not a great deal morein fact, examining the table shows us that only four extra cases have now been correctly classified).
Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi-square 17.799 17.799 48.692 df 4 4 7 Sig. .001 .001 .000

Model Summary -2 Log likelihood 87.971 Cox & Snell R Square .385 Nagelkerke R Square .517

Step 1

Hosmer and Lemeshow Test Step 1 Chi-square 9.186 df 8 Sig. .327

88

a Classification Table

Predicted Condom Use Condom Unprotected Used 47 10 12 31

Step 1

Observed Condom Use Overall Percentage

Unprotected Condom Used

Percentage Correct 82.5 72.1 78.0

a. The cut value is .500

The section labelled Variables in the Equation now contains all predictors. This part of the output represents the details of the final model. The significance values of the Wald statistics for each predictor indicate that both perceived risk (Wald = 16.04, p < 0.001) and relationship safety (Wald = 4.17, p < 0.05) still significantly predict condom use and, as in block 1, gender does not (Wald = 0.00, p > 0.05). We can now look at the new predictors to see which of these has some predictive power.
Variables in the Equation 95.0% C.I.for EXP(B) Lower Upper 1.623 4.109 .389 .980 .326 3.081 .962 1.490 1.005 .063 1.104 8.747 15.287 1.815

Step a 1

PERCEIVE SAFETY GENDER SEXEXP PREVIOUS PREVIOUS(1) PREVIOUS(2) SELFCON Constant

B .949 -.482 .003 .180 1.087 -.017 .348 -4.959

S.E. .237 .236 .573 .112 .552 1.400 .127 1.146

Wald 16.038 4.176 .000 2.614 4.032 3.879 .000 7.510 18.713

df 1 1 1 1 2 1 1 1 1

Sig. .000 .041 .996 .106 .133 .049 .990 .006 .000

Exp(B) 2.583 .617 1.003 1.198 2.965 .983 1.416 .007

a. Variable(s) entered on step 1: SEXEXP, PREVIOUS, SELFCON.

Previous use has been split into two components (according to whatever contrasts were specified for this variable). Looking at the very beginning of the output we are told the parameter codings for Previous(1) and previous(2). You can tell by remembering the rule from contrast coding in ANOVA which groups are being compared: that is, we compare groups with codes of 0 against those with codes of 1. From the output we can see that Previous(1) compares the condom used group against the other two, and
Previous(2) compares the base category of first time with partner against the other two
89

categories. Therefore we can tell that previous use is not a significant predictor of condom use when it is the first time with a partner compared to when it is not the first time (Wald = 0.00, p < 0.05). However, when we compare the condom used category to the other categories we find that using a condom on the previous occasion does predict use on the current occasion (Wald = 3.88, p < 0.05). Of the other new predictors we find that self-control predicts condom use (Wald = 7.51, p < 0.01) but sexual experience does not (Wald = 2.61, p > 0.05). The values of exp for perceived risk (exp = 2.58, CI0.95 = 1.62, 4.106) indicate that if the value of perceived risk goes up by 1, then the odds of using a condom also increase. Whats more, because the confidence interval doesnt cross 1 we can also be confident that the relationship between perceived risk and condom use found in this sample is true of the whole population. As perceived risk increases by 1, people are just over twice as likely to use a condom. The values of exp for relationship safety (exp = 0.62, CI0.95 = 0.39, 0.98) indicate that if the relationship safety decreases by one point, then the odds of using a condom increase. The confidence interval does not cross 1 so we can be confident that the relationship between relationship safety and condom use found in this sample would be found in 95% of samples from the same population. As relationship safety increases by one unit, subjects are about 1.6 times less likely to use a condom. The values of exp for gender (exp = 1.00, CI0.95 = 0.33, 3.08) indicate that as gender changes from 0 (male) to 1 (female), then the odds of using a condom do not change

90

(because exp is equal to 1). The confidence interval crosses 1, therefore gender is not a reliable predictor of condom use. The values of exp for previous use (1) (exp = 2.97, CI0.95 = 1.01, 8.75) indicate that if the value of previous usage goes up by 1 (i.e. changes from not having used one or being the first time to having used one), then the odds of using a condom also increase. Whats more, because the confidence interval doesnt cross 1 we can also be confident that this relationship is true in the whole population. If someone used a condom on their previous encounter with this partner (compared to if they didnt use one, or if it is their first time) then they are three times more likely to use a condom. For previous use (2) the value of exp (exp = 0.98, CI0.95 = 0.06, 15.29) indicates that if the value of previous usage goes up by 1 (i.e. changes from not having used one or having used one to being their first time with this partner), then the odds of using a condom do not change (because the value is very nearly equal to 1). Whats more, because the confidence interval crosses 1 we can tell that this is not a reliable predictor of condom use. The value of exp for self-control (exp = 1.42, CI0.95 = 1.10, 1.82) indicates that if selfcontrol increases by one point, then the odds of using a condom increase also. The confidence interval does not cross 1 so we can be confident that the relationship between relationship safety and condom use found in this sample would be found in 95% of samples from the same population. As self-control increases by one unit, subjects are about 1.4 times more likely to use a condom. The values of exp for sexual experience (exp = 1.20, CI0.95 = 0.95, 1.49) indicate that as sexual experience increases by one unit, then the odds of using a condom increase

91

slightly. However, the confidence interval crosses 1, therefore sexual experience is not a reliable predictor of condom use. A glance at the classification plot brings good news because a lot of cases that were clustered in the middle are now spread towards the edges. Therefore, overall this new model is more accurately classifying cases compared to block 1.

How reliable is the final model? Multicollinearity can affect the parameters of a regression model. Logistic regression is equally as prone to the biasing effect of collinearity and it is essential to test for collinearity following a logistic regression analysis (see the book for details of how to do this). The results of the analysis are shown below. From the first table we can see that the tolerance values for all variables are all close to 1 and are much larger than the cut-off

92

point of 0.1 below which Menard (1995) suggests indicates a serious collinearity problem. Myers (1990) also suggests that a VIF value greater than 10 is cause for concern and in these data the values are all less than this criterion. The output below also shows a table labelled Collinearity Diagnostics. In this table, we are given the eigenvalues of the scaled, uncentred cross-products matrix, the condition index and the variance proportions for each predictor. If any of the eigenvalues in this table are much larger than others then the uncentred cross-products matrix is said to be ill-conditioned, which means that the solutions of the regression parameters can be greatly affected by small changes in the predictors or outcome. In plain English, these values give us some idea as to how accurate our regression model is: if the eigenvalues are fairly similar then the derived model is likely to be unchanged by small changes in the measured variables. The condition indexes are another way of expressing these eigenvalues and represent the square root of the ratio of the largest eigenvalue to the eigenvalue of interest (so, for the dimension with the largest eigenvalue, the condition index will always be 1). For these data the condition indexes are all relatively similar showing that a problem is unlikely to exist.
Coefficients a Collinearity Statistics Tolerance VIF .849 1.178 .802 1.247 .910 1.098 .740 1.350 .796 1.256 .885 1.130 .964 1.037 .872 1.147 .929 1.076

Model 1

Perceived Risk Relationship Safety GENDER Perceived Risk Relationship Safety GENDER Previous Use with Partner Self-Control Sexual experience

a. Dependent Variable: Condom Use

93

a Collinearity Diagnostics

Model 1

Dimension 1 2 3 4 1 2 3 4 5 6 7

Eigenvalue 3.137 .593 .173 9.728E-02 5.170 .632 .460 .303 .235 .135 6.510E-02

Condition Index 1.000 2.300 4.260 5.679 1.000 2.860 3.352 4.129 4.686 6.198 8.911

(Constant) .01 .00 .01 .98 .00 .00 .00 .00 .00 .01 .98

Perceived Risk .02 .02 .55 .40 .01 .02 .03 .07 .04 .61 .23

Relationship Safety .02 .10 .76 .13 .01 .06 .10 .01 .34 .40 .08

Variance Proportions Previous Use with GENDER Partner .03 .55 .08 .35 .01 .01 .43 .10 .01 .80 .24 .00 .17 .05 .00 .00 .14 .03

Self-Control

Sexual experience

.01 .00 .00 .00 .50 .47 .03

.01 .02 .00 .60 .00 .06 .31

a. Dependent Variable: Condom Use

The final step in analysing this table is to look at the variance proportions. The variance of each regression coefficient can be broken down across the eigenvalues and the variance proportions tell us the proportion of the variance of each predictors regression coefficient that is attributed to each eigenvalue. These proportions can be converted to percentages by multiplying them by 100 (to make them more easily understood). In terms of collinearity, we are looking for predictors that have high-proportions on the same small eigenvalue, because this would indicate that the variances of their regression coefficients are dependent (see Field, 2004). Again, no variables appear to have similarly high variance proportions for the same dimensions. The result of this analysis is pretty clear cut: there is no problem of collinearity in these data. Residuals should be checked for influential cases and outliers. As a brief guide, the output lists cases with standardized residuals greater than 2. In a sample of 100, we would expect around 510% of cases to have standardized residuals with absolute values greater than this. For these data we have only four cases and only one of these has an absolute value greater than 3. Therefore, we can be fairly sure that there are no outliers.

94

Casewise Listb

Case 41 53 58 83

Selected a Status S S S S

Observed Condom Use U** U** C** C**

Predicted .891 .916 .142 .150

Predicted Group C C U U

Temporary Variable Resid ZResid -.891 -2.855 -.916 -3.294 .858 2.455 .850 2.380

a. S = Selected, U = Unselected cases, and ** = Misclassified cases. b. Cases with studentized residuals greater than 2.000 are listed.

What are the probabilities that participants 12, 53 and 75 will used a condom? The values predicted for these cases will depend on exactly how you ran the analysis (and the parameter coding used on the variable previous). Therefore, your answers might differ slightly from mine.
a m

Case Summariesa Case Number 12 53 75 a. Limited to first 100 cases. Predicted Value .49437 .88529 .37137

a m l

12 53 75
a L

Predicted Group Unprotected Condom Used Unprotected

A female, who used a condom in her previous encounter with her new partner, scores 2 on all variables except perceived risk (for which she scores 6). Use the model to estimate the probability that she will use a condom in her next encounter.
Step 1: Logistic Regression Equation:

1 1 + e z where Z = 0 + 1 X1 + 2 X 2 + K + n X n P (Y ) =

95

Step 2: Use the values of from the SPSS output (final model) and the values of X for

each variable (from the question) to construct the following table:

Variable

i
0.0027 0.4823 0.1804 1.0870 .0167 0.3476 0.9489

Xi

i Xi
0.0027 0.9646 0.3608 1.0870 0 0.6952 5.6934

Gender Safety Sexexp Previous (1) Previous (2) Selfcon Perceive

1 2 2 1 0 2 6

Step 3: Place the values of i Xi into the equation for z (remembering to include the

constant):

z = 4.6009 + 0.0027 0.9646 + 0.3608 + 1.0870 + 0 + 0.952 + 5.6934 = 2.2736

Step 4: Replace this value of z into the logistic regression equation:

96

P(Y ) =

1 1 + e z 1 = 1 + e 2.2736 1 = 1 + 0.10 = 0.9090

Therefore, there is a 91% chance that she will use a condom on her next encounter.
Chapter 9 Task 1

One of my pet hates is pop psychology books. Along with banishing Freud from all bookshops, it is my vowed ambition to rid the world of these rancid putrefaction-ridden wastes of trees. Not only do they give psychology a very bad name by stating the bloody obvious and charging people for the privilege, but they are also considerably less enjoyable to look at than the trees killed to produce them (admittedly the same could be said for the turgid tripe that I produce in the name of education but lets not go there just for now!). Anyway, as part of my plan to rid the world of popular psychology I did a little experiment. I took two groups of people who were in relationships and randomly assigned them to one of two conditions. One group read the famous popular psychology book Women are from Bras and men are from Penis, whereas another group read Marie Claire. I tested only 10 people in each of these groups, and the dependent variable was an

97

objective measure of their happiness with their relationship after reading the book. I didnt make any specific prediction about which reading material would improve relationship happiness. SPSS Output for the Independent t-test
Group Statistics Std. Error Mean 1.29957 1.48922

Relationship Happiness

Book Read Women are from Bras, Men are from Penis Marie Claire

N 10 10

Mean 20.0000 24.2000

Std. Deviation 4.10961 4.70933

Independent Samples Test Levene's Test for Equality of Variances

t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper -8.35253 -8.35800 -.04747 -.04200

F Relationship Happiness Equal variances assumed Equal variances not assumed .491

Sig. .492

t -2.125 -2.125

df 18 17.676

Sig. (2-tailed) .048 .048

Mean Difference -4.2000 -4.2000

Std. Error Difference 1.97653 1.97653

Calculating the Effect Size We know the value of t and the df from the SPSS output and so we can compute r as follows:
2.1252 2.1252 +18 4.52 22.52

r= =

= 0.45
If you think back to our benchmarks for effect sizes this represents a fairly large effect (it is just below 0.5, the threshold for a large effect). Therefore, as well as being statistically significant, this effect is large and so represents a substantive finding. Reporting the Results

98

When you report any statistical test you usually state the finding to which the test relates, and then in brackets report the test statistic (usually with its degrees of freedom), the probability value of that test statistic, and more recently the American Psychological Association is, quite rightly, requesting an estimate of the effect size. To get you into good habits early, well start thinking about effect sizes now, before you get too fixated on Fishers magic 0.05. In this example we know that the value of t was 2.12, that the degrees of freedom on which this was based were 18, and that it was significant at p = 0.048. This can all be obtained from the SPSS output. We can also see the means for each group. Based on what we learnt about reporting means, we could now write something like: On average, the reported relationship happiness after reading Marie Claire (M = 24.20, SE = 1.49), was significantly higher than after reading Women are from Bras and men are from Penis (M = 20.00, SE = 1.30) (t(18) = -2.12, p < .05, r = .45).
Task 2

Imagine Twaddle and Sons, the publishers of Women are from Bras men are from Penis, were upset about my claims that their book was about as useful as a paper umbrella. They decided to take me to task and design their own experiment in which participants read their book, and one of my books this book (Field and Hole) at different times. Relationship happiness was measured after reading each book. To maximize their chances of finding a difference they used a sample of 500 participants, but got each participant to take part in both conditions (they read

99

both books). The order in which books were read was counterbalanced and there was a delay of six months between reading the books. They predicted that reading their wonderful contribution to popular psychology would lead to greater relationship happiness than reading some dull and tedious book about experiments. The data are in Field&Hole.sav. Analyse them using the appropriate t-test.

SPSS Output
Paired Samples Statistics Std. Error Mean .44637 .40211

Pair 1

Women are from Bras, Men are from Penis Field & Hole

Mean 20.0180 18.4900

N 500 500

Std. Deviation 9.98123 8.99153

Paired Samples Correlations N Pair 1 Women are from Bras, Men are from Penis & Field & Hole 500 Correlation .117 Sig. .009

Paired Samples Test

Paired Differences 95% Confidence Interval of the Difference Lower Upper .4184 2.6376

Mean Pair 1 Women are from Bras, Men are from Penis - Field & Hole 1.5280

Std. Deviation 12.62807

Std. Error Mean .56474

t 2.706

df 499

Sig. (2-tailed) .007

Calculating the Effect Size We know the value of t and the df from the SPSS output and so we can compute r as follows:

100

r= =

2.7062 2.7062 + 499 7.32 506.32

= 0.12
If you think back to our benchmarks for effect sizes this represents a small effect (it is just above 0.1, the threshold for a small effect). Therefore, although this effect is highly statistically significant, the size of the effect is very small and so represents a trivial finding. Interpreting and Writing the Results In this example, it would be tempting for Twaddle and Sons to conclude that their book produced significantly greater relationship happiness than our book. In fact, many researchers would write conclusions like this: The results show that reading Women are from Bras, men are from Penis produces significantly greater relationship happiness than that book by smelly old Field and Hole. This result is highly significant. However, to reach such a conclusion is to confuse statistical significance with the importance of the effect. By calculating the effect size weve discovered that although the difference in happiness after reading the two books is statistically very different, the size of effect that this represents is very small indeed. So, the effect is actually not very significant in real terms. A more correct interpretation might be to say: The results show that reading Women are from Bras, men are from Penis produces significantly greater relationship happiness than that book by smelly

101

old Field and Hole. However, the effect size was small, revealing that this finding was not substantial in real terms. Of course, this latter interpretation would be unpopular with Twaddle and Sons who would like to believe that their book had a huge effect on relationship happiness.
Chapter 10 Task 1

Imagine that I was interested in how different teaching methods affected students knowledge. I noticed that some lecturers were aloof and arrogant in their teaching style and humiliated anyone who asked them a question, while others were encouraging and supportive of questions and comments. I took three statistics courses where I taught the same material. For one group of students I wandered around with a large cane and beat anyone who asked daft questions or got questions wrong (punish). In the second group I used my normal teaching style which is to encourage students to discuss things that they find difficult and to give anyone working hard a nice sweet (reward). The final group I remained indifferent to and neither punished nor rewarded their efforts (indifferent). As the dependent measure I took the students exam marks (percentage). Based on theories of operant conditioning, we expect punishment to be a very unsuccessful way of reinforcing learning, but we expect reward to be very successful. Therefore, one prediction is that reward will produce the best learning. A second hypothesis is that punishment should actually retard learning such that it is worse than an indifferent approach to learning. The data are in the file Teach.sav. Carry

102

out a one-way ANOVA and use planned comparisons to test the hypotheses that: (1) reward results in better exam results than either punishment or indifference; and (2) indifference will lead to significantly better exam results than punishment. SPSS Output
Descriptives Exam Mark 95% Confidence Interval for Mean Lower Bound Upper Bound 47.0409 52.9591 50.9192 61.0808 62.3241 68.4759 54.0483 60.2183

N Punish Indifferent Reward Total 10 10 10 30

Mean 50.0000 56.0000 65.4000 57.1333

Std. Deviation 4.13656 7.10243 4.29987 8.26181

Std. Error 1.30809 2.24598 1.35974 1.50839

Minimum 45.00 46.00 58.00 45.00

Maximum 57.00 67.00 71.00 71.00

This output shows the table of descriptive statistics from the one-way ANOVA; were told the means, standard deviations and standard errors of the means for each experimental condition. The means should correspond to those plotted in the graph. These diagnostics are important for interpretation later on. It looks as though marks are highest after reward and lowest after punishment.
Test of Homogeneity of Variances Exam Mark Levene Statistic 2.569 df1 2 df2 27 Sig. .095

The next part of the output reports a test of the assumption of homogeneity of variance (Levenes test). For these data, the assumption of homogeneity of variance has been met, because our significance is 0.095, which is bigger than the criterion of 0.05.
ANOVA Exam Mark Sum of Squares 1205.067 774.400 1979.467 df 2 27 29 Mean Square 602.533 28.681 F 21.008 Sig. .000

Between Groups Within Groups Total

103

The main ANOVA summary table shows us that because the observed significance value is less than 0.05 we can say that there was a significant effect of teaching style on exam marks. However, at this stage we still do not know exactly what the effect of the teaching style was (we dont know which groups differed).
Robust Tests of Equality of Means Exam Mark Welch Brown-Forsythe Statistic 32.235 21.008
a

df1 2 2

df2 17.336 20.959

Sig. .000 .000

a. Asymptotically F distributed.

This table shows the Welch and BrownForsythe Fs, but we can ignore these because the homogeneity of variance assumption was met.
Contrast Coefficients Type of Teaching Method Punish Indifferent Reward 1 1 -2 1 -1 0

Contrast 1 2

Because there were specific hypotheses I specified some contrasts. This table shows the codes I used. The first contrast compares reward (coded with 2) against punishment and indifference (both coded with 1). The second contrast compares punishment (coded with 1) against indifference (coded with 1). Note that the codes for each contrast sum to zero, and that in contrast 2 reward has been coded with a 0 because it is excluded from that contrast.
Contrast Tests Contrast 1 2 1 2 Value of Contrast -24.8000 -6.0000 -24.8000 -6.0000 Std. Error 4.14836 2.39506 3.76180 2.59915 t -5.978 -2.505 -6.593 -2.308 df 27 27 21.696 14.476 Sig. (2-tailed) .000 .019 .000 .036

Exam Mark

Assume equal variances Does not assume equal variances

104

This table shows the significance of the two contrasts specified above. Because homogeneity of variance was met, we can ignore the part of the table labelled Does not assume equal variances. The t-test for the first contrast tells us that reward was significantly different from punishment and indifference (its significantly different because the value in the column labelled Sig. is less than 0.05). Looking at the means, this tells us that the average mark after reward was significantly higher than the average mark for punishment and indifference combined. The second contrast (and the descriptive statistics) tells us that the marks after punishment were significantly lower than after indifference (again, its significantly different because the value in the column labelled Sig. is less than 0.05). As such we could conclude that reward produces significantly better exam grades than punishment and indifference, and that punishment produces significantly worse exam marks than indifference. So lecturers should reward their students, not punish them! Calculating the Effect Size The output provides us with three measures of variance: the between group effect (SSM), the within subject effect (MSR) and the total amount of variance in the data (SST). We can use these to calculate omega squared (2):

105

2 =

MSM MSR MSM + ( (n 1) MSR )

2 =

602.533 28.681 602.533 + ((30 1) 28.681) 573.852 = 1434.282 = .40

= .63
For the contrasts the effect sizes will be:
t2 = 2 t + df

rcontrast

rcontrast1 =

5.9782 5.9782 + 27 = 0.75

If you think back to our benchmarks for effect sizes this represents a huge effect (it is well above 0.5, the threshold for a large effect). Therefore, as well as being statistically significant, this effect is large and so represents a substantive finding. For contrast 2 we get:
2.5052 2.5052 + 27 = 0.43

rcontrast2 =

This too is a substantive finding and represents a medium to large effect size. Interpreting and Writing the Result The correct way to report the main finding would be:
106

All significant values are reported at p < .05.There was a significant effect of teaching style on exam marks, F(2, 27) = 21.01, 2 = .40. Planned contrasts revealed that reward produced significantly better exam grades than punishment and indifference, t(27) = 5.98, r = .75, and that punishment produced significantly worse exam marks than indifference, t(27) = 2.51, r = .43.
Task 2

In Chapter 15 there are some data looking at whether eating soya meals reduces your sperm count. Have a look at this section, access the data for that example, but analyse them with ANOVA. Whats the difference between what you find and what is found in section 15.5.4? Why do you think this difference has arisen?

SPSS Output

Descriptives Sperm Count (Millions) 95% Confidence Interval for Mean Lower Bound Upper Bound 2.6072 7.3663 2.4184 6.7921 2.0462 6.1740 1.1341 2.1719 2.8906 4.7869

N No Soya Meals 1 Soya Meal Per Week 4 Soyal Meals Per Week 7 Soya Meals Per Week Total 20 20 20 20 80

Mean 4.9868 4.6052 4.1101 1.6530 3.8388

Std. Deviation 5.08437 4.67263 4.40991 1.10865 4.26048

Std. Error 1.13690 1.04483 .98609 .24790 .47634

Minimum .35 .33 .40 .31 .31

Maximum 21.08 18.47 18.21 4.11 21.08

This output shows the table of descriptive statistics from the one-way ANOVA. It looks as though, as soya intake increases, sperm counts do indeed decrease.
Test of Homogeneity of Variances Sperm Count (Millions) Levene Statistic 5.117 df1 3 df2 76 Sig. .003

107

The next part of the output reports a test of the assumption of homogeneity of variance (Levenes test). For these data, the assumption of homogeneity of variance has been broken, because our significance is 0.003, which is smaller than the criterion of 0.05. In fact, these data also violate the assumption of normality (see the chapter on non parametric statistics).
ANOVA Sperm Count (Millions) Sum of Squares 135.130 1298.853 1433.983 df 3 76 79 Mean Square 45.043 17.090 F 2.636 Sig. .056

Between Groups Within Groups Total

The main ANOVA summary table shows us that because the observed significance value is greater than 0.05 we can say that there was no significant effect of soya intake on mens sperm count. This is strange because if you read the chapter on non-parametric statistics from where this example came, the KruskalWallis test produced a significant result! The reason for this difference is that the data violate the assumptions of normality and homogeneity of variance. As I mention in the chapter on non-parametric statistics, although parametric tests have more power to detect effects when their assumptions are met, when their assumptions are violated non-parametric tests have more power! This example was arranged to prove this point: because the parametric assumptions are violated, the non-parametric tests produced a significant result and the parametric test did not because, in these circumstances, the non-parametric test has the greater power!
Robust Tests of Equality of Means Sperm Count (Millions) Welch Brown-Forsythe Statistic 6.284 2.636
a

df1 3 3

df2 34.657 58.236

Sig. .002 .058

a. Asymptotically F distributed.

108

This table shows the Welch and BrownForsythe Fs; note that the Welch test agrees with the non-parametric test in that the significance of F is below the 0.05 threshold. However, the BrownForsythe F is non-significant (it is just above the threshold). This illustrates the relative superiority of the Welch procedure. However, in these circumstances, because normality and homogeneity of variance have been violated wed use a nonparametric test anyway!
Task 3

Students (and lecturers for that matter) love their mobile phones, which is rather worrying given some recent controversy about links between mobile phone use and brain tumours. The basic idea is that mobile phones emit microwaves, and so holding one next to your brain for large parts of the day is a bit like sticking your brain in a microwave oven and selecting the cook until well done button. If we wanted to test this experimentally, we could get six groups of people and strap a mobile phone on their heads (that they cant remove). Then, by remote control, we turn the phones on for a certain amount of time each day. After six months, we measure the size of any tumour (in mm3) close to the site of the phone antennae (just behind the ear). The six groups experienced 0, 1, 2, 3, 4 or 5 hours per day of phone microwaves for six months. The data are in Tumour.sav. (From Field & Hole, 2003, so there is a very detailed answer in there.)

SPSS Output

109

The error bar chart of the mobile phone data shows the mean size of brain tumour in each condition, and the funny I shapes show the confidence interval of these means. Note that in the control group (0 hours), the mean size of the tumour is virtually zero (we wouldnt actually expect them to have a tumour) and the error bar shows that there was very little variance across samples. Well see later that this is problematic for the analysis.

Descriptives Size of Tumour (MM cubed) 95% Confidence Interval for Mean Lower Bound Upper Bound .0119 .0232 .3819 .6479 1.0310 1.4917 2.6633 3.3799 4.5619 5.2137 4.3648 5.0964 2.0393 2.7720

N 0 1 2 3 4 5 Total 20 20 20 20 20 20 120

Mean .0175 .5149 1.2614 3.0216 4.8878 4.7306 2.4056

Std. Deviation .01213 .28419 .49218 .76556 .69625 .78163 2.02662

Std. Error .00271 .06355 .11005 .17118 .15569 .17478 .18500

Minimum .00 .00 .48 1.77 3.04 2.70 .00

Maximum .04 .94 2.34 4.31 6.05 6.14 6.14

This output shows the table of descriptive statistics from the one-way ANOVA; were told the means, standard deviations and standard errors of the means for each experimental condition. The means should correspond to those plotted in the graph. These diagnostics are important for interpretation later on.

110

Test of Homogeneity of Variances Size of Tumour (MM cubed) Levene Statistic 10.245 df1 5 df2 114 Sig. .000

The next part of the output reports a test of this assumption, Levenes test. For these data, the assumption of homogeneity of variance has been violated, because our significance is 0.000, which is considerably smaller than the criterion of 0.05. In these situations, we have to try to correct the problem and we can either transform the data or choose the Welch F.
ANOVA Size of Tumour (MM cubed) Sum of Squares 450.664 38.094 488.758 df 5 114 119 Mean Square 90.133 .334 F 269.733 Sig. .000

Between Groups Within Groups Total

The main ANOVA summary table shows us that because the observed significance value is less than 0.05 we can say that there was a significant effect of mobile phones on the size of tumour. However, at this stage we still do not know exactly what the effect of the phones was (we dont know which groups differed).
Robust Tests of Equality of Means Size of Tumour (MM cubed) Welch Brown-Forsythe Statistic 414.926 269.733
a

df1 5 5

df2 44.390 75.104

Sig. .000 .000

a. Asymptotically F distributed.

This table shows the Welch and BrownForsythe Fs, which are useful because homogeneity of variance was violated. Luckily our conclusions remain the same: both Fs have significance values less than 0.05.

111

Multiple Comparisons Dependent Variable: Size of Tumour (MM cubed) Games-Howell Mean Difference (I-J) -.4973* -1.2438* -3.0040* -4.8702* -4.7130* .4973* -.7465* -2.5067* -4.3729* -4.2157* 1.2438* .7465* -1.7602* -3.6264* -3.4692* 3.0040* 2.5067* 1.7602* -1.8662* -1.7090* 4.8702* 4.3729* 3.6264* 1.8662* .1572 4.7130* 4.2157* 3.4692* 1.7090* -.1572

(I) Mobile Phone Use (Hours Per Day) 0

(J) Mobile Phone Use (Hours Per Day) 1 2 3 4 5 0 2 3 4 5 0 1 3 4 5 0 1 2 4 5 0 1 2 3 5 0 1 2 3 4

Std. Error .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280 .18280

Sig. .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .984 .000 .000 .000 .000 .984

95% Confidence Interval Lower Bound Upper Bound -.6982 -.2964 -1.5916 -.8960 -3.5450 -2.4631 -5.3622 -4.3783 -5.2653 -4.1608 .2964 .6982 -1.1327 -.3603 -3.0710 -1.9424 -4.8909 -3.8549 -4.7908 -3.6406 .8960 1.5916 .3603 1.1327 -2.3762 -1.1443 -4.2017 -3.0512 -4.0949 -2.8436 2.4631 3.5450 1.9424 3.0710 1.1443 2.3762 -2.5607 -1.1717 -2.4429 -.9751 4.3783 5.3622 3.8549 4.8909 3.0512 4.2017 1.1717 2.5607 -.5455 .8599 4.1608 5.2653 3.6406 4.7908 2.8436 4.0949 .9751 2.4429 -.8599 .5455

*. The mean difference is significant at the .05 level.

Because there were no specific hypotheses I just carried out post hoc tests and stuck to my favourite GamesHowell procedure (because variances were unequal). It is clear from the table that each group of participants is compared to all of the remaining groups. First, the control group (0 hours) is compared to the 1, 2, 3, 4 and 5 hour groups and reveals a significant difference in all cases (all the values in the column labeled Sig. are less than 0.05). In the next part of the table, the 1 hour group is compared to all other groups. Again all comparisons are significant (all the values in the column labeled Sig. are less than 0.05). In fact, all of the comparisons appear to be highly significant except the comparison between the 4 and 5 hour groups, which is non-significant because the value in the column labeled Sig. is bigger than 0.05. Calculating the Effect Size

112

The output provides us with three measures of variance: the between group effect (SSM), the within subject effect (MSR) and the total amount of variance in the data (SST). We can use these to calculate omega squared (2):

2 =

MSM MSR MSM + ( (n 1) MSR )

2 =

90.133 0.334 90.133 + ((120 1) 0.334) 89.799 = 129.879 = .69

= .83
Interpreting and Writing the Result We could report the main finding as: Levenes test indicated that the assumption of homogeneity of variance had been violated (F(5, 114) = 10.25, p < .001). Transforming the data did not rectify this problem and so F-tests are reported nevertheless. The results show that using a mobile phone significantly affected the size of brain tumour found in participants (F(5, 114) = 269.73, p < .001, 2 = .69). The effect size indicated that the effect of phone use on tumour size was substantial. The next thing that needs to be reported are the post hoc comparisons. It is customary just to summarize these tests in very general terms like this: GamesHowell post hoc tests revealed significant differences between all groups (p < .001 for all tests) except between 4 and 5 hours (ns).

113

If you do want to report the results for each post hoc test individually, then at least include the 95% confidence intervals for the test as these tell us more than just the significance value. In this example, though, when there are many tests it might be as well to summarize these confidence intervals as a table (see below): 95% Confidence Interval Mobile Phone Use Sig. (Hours Per Day) 0 1 2 3 4 5 1 2 3 4 5 2 3 4 5 3 4 5 4 5 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 < .001 = .984 Bound .6982 1.5916 Bound .2964 .8960 Lower Upper

3.5450 2.4631 5.3622 4.3783 5.2653 4.1608 1.1327 .3603

3.0710 1.9424 4.8909 3.8549 4.7908 3.6406 2.3762 1.1443 4.2017 3.0512 4.0949 2.8436 2.5607 1.1717 2.4429 .5455 .9751 .8599

114

Task 4

Using the Glastonbury data (GlastonburyFestival.sav), carry out a one-way ANOVA on the data to see if the change in hygiene (change) is significant across people with different musical tastes (music). Do a simple contrast to compare each group against No Affiliation. Compare the results to those described in section 7.11.

SPSS Output:

Levenes test is non-significant, showing that variances were roughly equal, F(3, 119) = 0.87, p > .05, across crusties, metallers, indie kids and people with no affiliation.

The above is the main ANOVA table. We could say that the change in hygiene scores was significantly different across the different musical groups, F(3, 119) = 3.27, p < .05. Compare this table to the one in section 7.11, in which we analysed these data as a regression:

115

Its exactly the same! This should, I hope, re-emphasize to you that regression and ANOVA are the same analytic system!
Task 5

Labcoat Leni's Real Research 15.2 describes an experiment on quails with fetishes for terrycloth objects (really, it does). In this example, you are asked to analyse two of the variables that the researchers measured with a KruskalWallis test. However, there were two other outcome variables (time spent near the terrycloth object and copulatory efficiency). These data can be analysed with oneway ANOVA. Read Labcoat Leni's Real Research 15.2 to get the full story, then carry out two one-way ANOVAs and Bonferroni post hoc tests on the aforementioned outcome variables.

Lets begin by using the Chart Builder (

) to do some error bar charts:

116

To conduct one-way ANOVA we have to access the main dialog box by selecting . This dialog box has a space in which you can list one or more dependent variables and a second space to specify a grouping variable, or factor. For these data we need to select Duration and Efficiency from the variables list and drag them to the box labelled Dependent List (or click on ). Then select the

grouping variable Group and drag it to the box labelled Factor (or click on ).
117

You were asked to do post hoc tests so we can skip the contrast options. Click on in the main dialog box to access the post hoc tests dialog box. You were asked to do a Bonferroni post hoc test so select this, but lets also select GamesHowell in case of problems in homogeneity (which of course we would have checked before running this main analysis!). Click on to return to the main dialog box.

Select to test for homogeneity of variance and also to obtain the BrownForsythe F and Welch F. Click on the analysis. to return to the main dialog box and then click on to run

118

The output should look like this:

This tells us that the homogeneity of variance assumption is met for both outcome variables. This means that we can ignore (just as the authors did) the corrected Fs and GamesHowell post hoc tests. Instead we can look at the normal Fs and Bonferroni post hoc tests (which is what the authors of this paper reported).

119

This table tells us that the group (fetishistic, non-fetishistic or control group) had a significant effect on the time spent near the terrycloth object, and the copulatory efficiency. To find out exactly whats going on we can look at our post hoc tests:

The authors reported as follows: A one-way ANOVA indicated significant group differences, F(2, 56) = 91.38, p < .05, = 0.76. Subsequent pairwise comparisons (with the Bonferroni correction) revealed that fetishistic male quail stayed near the CS longer than both the nonfetishistic male quail (mean difference = 10.59 s; 95% CI = 4.16, 17.02; p < .05) and the control male quail (mean difference = 29.74 s; 95% CI = 24.12, 35.35; p < .05). In addition, the nonfetishistic male quail spent more time near the CS than did the control male quail (mean difference = 19.15 s; 95% CI = 13.30, 24.99; p < .05). (pp.429430) Note that the CS is the terrycloth object. Look at the graph, the ANOVA table and the post hoc tests to see from where the values that they report come. For the copulatory efficiency outcome the authors reported as follows:

120

A one-way ANOVA yielded a significant main effect of groups, F(2, 56) = 6.04, p < .05, = 0.18. Paired comparisons (with the Bonferroni correction) indicated that the nonfetishistic male quail copulated with the live female quail (US) more efficiently than both the fetishistic male quail (mean difference = 6.61; 95% CI = 1.41, 11.82; p < .05) and the control male quail (mean difference = 5.83; 95% CI = 1.11, 10.56; p < .05). The difference between the efficiency scores of the fetishistic and the control male quail was not significant (mean difference = 0.78; 95% CI = 5.33, 3.77; p > .05). (p. 430) These results show that male quails do show fetishistic behaviour (the time spent with the terrycloth) and that this affects their copulatory efficiency (they are less efficient than those that dont develop a fetish, but its worth remembering that they are no worse than quails that had no sexual conditioning the controls). If you look at Labcoat Lenis box then youll also see that this fetishistic behaviour may have evolved because the quails with fetishistic behaviour manage to fertilize a greater percentage of eggs (so their genes are passed on!).
Chapter 11 Task 1

Stalking is a very disruptive and upsetting (for the person being stalked) experience in which someone (the stalker) constantly harasses or obsesses about another person. It can take many forms, from sending intensely disturbing letters threatening to boil your cat if you dont reciprocate the stalkers undeniable love for you, to literally following you around your local area in a desperate attempt to see which CD you buy on a Saturday (as if it would be anything other than

121

Fugazi!). A psychologist, whod had enough of being stalked by people, decided to try two different therapies on different groups of stalkers (25 stalkers in each groupthis variable is called Group). The first group of stalkers he gave what he termed cruel to be kind therapy. This therapy was based on punishment for stalking behaviours; in short, every time the stalker followed him around, or sent him a letter, the psychologist attacked them with a cattle prod until they stopped their stalking behaviour. It was hoped that the stalker would learn an aversive reaction to anything resembling stalking. The second therapy was

psychodyshamic therapy, which was a recent development on Freuds psychodynamic therapy that acknowledges what a sham this kind of treatment is (so, you could say its based on Fraudian theory!). The stalkers were hypnotized and regressed into their childhood, the therapist would also discuss their penis (unless it was a woman in which case they discussed their lack of penis), the penis of their father, their dogs penis, the penis of the cat down the road, and anyone elses penis that sprang to mind. At the end of therapy, the psychologist measured the number of hours in the week that the stalker spent stalking their prey (this variable is called stalk2). Now, the therapist believed that the success of therapy might well depend on how bad the problem was to begin with, so before therapy the therapist measured the number of hours that the patient spent stalking as an indicator of how much of a stalker the person was (this variable is called stalk1). The data are in the file Stalker.sav. Analyse the effect of therapy on stalking behaviour after therapy, controlling for the amount of stalking behaviour before therapy.

122

SPSS Output
Tests of Between-Subjects Effects Dependent Variable: Time Spent Stalking After Therapy (hours per week) Source Corrected Model Intercept THERAPY Error Total Corrected Total Type III Sum of Squares 591.680a 170528.000 591.680 8526.320 179646.000 9118.000 df 1 1 1 48 50 49 Mean Square 591.680 170528.000 591.680 177.632 F 3.331 960.009 3.331 Sig. .074 .000 .074

a. R Squared = .065 (Adjusted R Squared = .045)

This output shows the ANOVA table when the covariate is not included. It is clear from the significance value that there is no difference in the hours spent stalking after therapy for the two therapy groups (p is 0.074, which is greater than 0.05). You should note that the total amount of variation to be explained (SST) was 9118, of which the experimental manipulation accounted for 591.68 units (SSM), while 8526.32 were unexplained (SSR).

70

Mean Hours Spent Stalking After Therapy

Cruel to be Kind Therapy Psychodyshamic

60

50

0 Unadjusted Adjusted

Type of Mean

This bar chart shows the mean number of hours spent stalking after therapy. The normal means are shown as well as the same means when the data are adjusted for the effect of the covariate. In this case the adjusted and unadjusted means are relatively similar.

123

Descriptive Statistics Dependent Variable: Time Spent Stalking After Therapy (hours per week) Group Cruel to be Kind Therapy Psychodyshamic Therapy Total Mean 54.9600 61.8400 58.4000 Std. Deviation 16.33116 9.41046 13.64117 N 25 25 50

This table shows the unadjusted means (i.e. the normal means if we ignore the effect of the covariate). These are the same values plotted on the left-hand side of the bar chart. These results show that the time spent stalking after therapy was less after cruel to be kind therapy. However, we know from our initial ANOVA that this difference is nonsignificant. So, what now happens when we consider the effect of the covariate (in this case the extent of the stalkers problem before therapy)?
a Levene's Test of Equality of Error Variances

Dependent Variable: Time Spent Stalking After Therapy (hours per week) F 7.189 df1 1 df2 48 Sig. .010

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+STALK1+GROUP

This table shows the results of Levenes test, which is significant because the significance value is 0.01 (less than 0.05). This finding tells us that the variances across groups are different and the assumption has been broken.
Tests of Between-Subjects Effects Dependent Variable: Time Spent Stalking After Therapy (hours per week) Source Corrected Model Intercept HOURS SPENT STALKING BEFORE THERAPY THERAPY Error Total Corrected Total a. R Squared = .549 (Adjusted R Squared = .530) Type III Sum of Squares 5006.278a 8.646E-02 4414.598 480.265 4111.722 179646.000 9118.000 df 2 1 1 1 47 50 49 Mean Square 2503.139 8.646E-02 4414.598 480.265 87.483 F 28.613 .001 50.462 5.490 Sig. .000 .975 .000 .023

This table shows the ANCOVA. Looking first at the significance values, it is clear that the covariate significantly predicts the dependent variable, so the hours spent stalking after therapy depend on the extent of the initial problem (i.e. the hours spent stalking
124

before therapy). More interesting is that when the effect of initial stalking behaviour is removed, the effect of therapy becomes significant (p has gone down from 0.074 to 0.023, which is less than 0.05).
Group Dependent Variable: Time Spent Stalking After Therapy (hours per week) Group Cruel to be Kind Therapy Psychodyshamic Therapy Mean Std. Error 55.299a 1.871 a 61.501 1.871 95% Confidence Interval Lower Bound Upper Bound 51.534 59.063 57.737 65.266

a. Evaluated at covariates appeared in the model: Time Spent Stalking Before Therapy (hours per week) = 65.2200.

To interpret the results of the main effect of therapy we need to look at adjusted means. These adjusted means are shown above. There are only two groups being compared in this example so we can conclude that the therapies had a significantly different effect on stalking behaviour; specifically, stalking behaviour was lower after the therapy involving the cattle prod compared to psychodyshamic therapy.

125

Linear Regression

Stalking After Therapy (hours per week)

80.00

60.00

40.00

20.00

50.00

60.00

70.00

80.00

90.00

Stalking Before Therapy (hours per week)

We need to interpret the covariate. The graph above shows the time spent stalking after therapy (dependent variable) and the initial level of stalking (covariate). This graph shows that there is a positive relationship between the two variables: that is, high scores on one variable correspond to high scores on the other, whereas low scores on one variable correspond to low scores on the other. Calculating the Effect Size The value of 2 can be calculated for the effect of therapy using the sum of squares for the experimental effect (480.27), the mean squares for the error term (87.48) and the total variability (the corrected total 9118):

126

2 =

SSM (dfM )MSR SST + MSR

2 =

480.265 (1)87.483 9118 + 87.483 392.782 = 9205.483 = .04 = .21

This represents a medium to large effect. Therefore, the effect of a cattle prod compared to psychodyshamic therapy is a substantive finding. For the effect of the covariate, the error mean squares is the same, but the effect is much bigger (MSM is 4414.60 rounded to 2 decimal places). If we place this value in the equation, we get the following:

2 =

SSM (dfM )MSR SST + MSR

4414.598 (1)87.483 9118 + 87.483 4327.115 = 9205.483 = .47 covariate = .69


2 = covariate

This represents a very large effect (it is well above the threshold of 0.5, and is close to 1). Therefore, the relationship between initial stalking behaviour and the stalking behaviour after therapy is very strong indeed. Interpreting and Writing the Result

127

The correct way to report the main finding would be: Levenes test was significant (F(1, 48) = 7.19, p < .05) indicating that the assumption of homogeneity of variance had been broken. The main effect of therapy was significant (F(1, 47) = 5.49, p < .05, 2 = .04) indicating that the time spent stalking was lower after using a cattle prod (M = 55.30, SE = 1.87) compared to after psychodyshamic therapy (M = 61.50, SE = 1.87). The covariate was also significant (F(1, 47) = 50.46, p < .001, 2 = .47) indicating that level of stalking before therapy had a significant effect on level of stalking after therapy (there was a positive relationship between these two variables). All significant values are reported at p < .05. There was a significant effect of teaching style on exam marks, F(2, 27) = 21.01, = .82. Planned contrasts revealed that reward produced significantly better exam grades than punishment and indifference, t(27) = 5.98, r = .75, and that punishment produced significantly worse exam marks than indifference, t(27) = 2.51, r = .43.
Task 2

A marketing manager for a certain well-known drinks manufacturer was interested in the therapeutic benefit of certain soft drinks for curing hangovers. He took 15 people out on the town one night and got them drunk. The next morning as they awoke, dehydrated and feeling as though theyd licked a camels sandy feet clean with their tongue, he gave five of them water to drink, five of them Lucozade (in case this isnt sold outside of the UK, its a very nice glucose-based drink) and the remaining five a leading brand of cola (this variable is called
drink). He then measured how well they felt (on a scale from 0 = I feel like death
128

to 10 = I feel really full of beans and healthy) two hours later (this variable is called well). He wanted to know which drink produced the greatest level of wellness. However, he realized that it was important to control for how drunk the person got the night before, and so he measured this on a scale of 0 = as sober as a nun to 10 = flapping about like a haddock out of water on the floor in a puddle of their own vomit. The data are in the file HangoverCure.sav. SPSS Output
Tests of Between-Subjects Effects Dependent Variable: How Well Does The Person Feel? Source Corrected Model Intercept DRINK Error Total Corrected Total Type III Sum of Squares 2.133a 459.267 2.133 15.600 477.000 17.733 df 2 1 2 12 15 14 Mean Square 1.067 459.267 1.067 1.300 F .821 353.282 .821 Sig. .463 .000 .463

a. R Squared = .120 (Adjusted R Squared = -.026)

This table shows the ANOVA table for these data when the covariate is not included. It is clear from the significance value that there are no differences in how well people feel when they have different drinks.
a Levene's Test of Equality of Error Variances

Dependent Variable: How Well Does The Person Feel? F .220 df1 2 df2 12 Sig. .806

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+DRUNK+DRINK

129

Tests of Between-Subjects Effects Dependent Variable: How Well Does The Person Feel? Source Corrected Model Intercept DRUNK DRINK Error Total Corrected Total Type III Sum of Squares 13.320a 14.264 11.187 3.464 4.413 477.000 17.733 df 3 1 1 2 11 15 14 Mean Square 4.440 14.264 11.187 1.732 .401 F 11.068 35.556 27.886 4.318 Sig. .001 .000 .000 .041

a. R Squared = .751 (Adjusted R Squared = .683)

These tables show the results of Levenes test and the ANOVA table when drunkenness the previous night is included in the model as a covariate. Levenes test is nonsignificant, indicating that the group variances are roughly equal (hence the assumption of homogeneity of variance has been met). It is clear that the covariate significantly predicts the dependent variable, so the drunkenness of the person influenced how well they felt the next day. Whats more interesting is that when the effect of drunkenness is removed, the effect of drink becomes significant (p is 0.041, which is less than 0.05).
Parameter Estimates Dependent Variable: How Well Does The Person Feel? Parameter Intercept DRUNK [DRINK=1.00] [DRINK=2.00] [DRINK=3.00] B Std. Error 7.116 .377 -.548 .104 -.142 .420 .987 .442 0a . t 18.861 -5.281 -.338 2.233 . Sig. .000 .000 .741 .047 . 95% Confidence Interval Lower Bound Upper Bound 6.286 7.947 -.777 -.320 -1.065 .781 .014 1.960 . .

a. This parameter is set to zero because it is redundant.

The next table shows the parameter estimates selected in the options dialog box. These estimates are calculated using a regression analysis with drink split into two dummy coding variables. SPSS codes the two dummy variables such that the last category (the category coded with the highest value in the data editor, in this case the cola group) is the reference category. This reference category (labelled dose=3 in the output) is coded with 0 for both dummy variables; dose=2, therefore, represents the difference between the

130

group coded as 2 (Lucozade) and the reference category (cola); and dose=1 represents the difference between the group coded as 1 (water) and the reference category (cola). The beta values literally represent the differences between the means of these groups and so the significances of the t-tests tell us whether the group means differ significantly. Therefore, from these estimates we could conclude that the cola and water groups have similar means whereas the cola and Lucozade groups have significantly different means.
Contrast Results (K Matrix) Dependent Variable How Well Does The Person Feel? 1.129 0 1.129 .405 .018 .237 2.021 .142 0 .142 .420 .741 -.781 1.065

Drink Simple Contrast Level 2 vs. Level 1

Contrast Estimate Hypothesized Value Difference (Estimate - Hypothesized) Std. Error Sig. 95% Confidence Interval for Difference

Lower Bound Upper Bound

Level 3 vs. Level 1

Contrast Estimate Hypothesized Value Difference (Estimate - Hypothesized) Std. Error Sig. 95% Confidence Interval for Difference

Lower Bound Upper Bound

a. Reference category = 1

The next output shows the result of a contrast analysis that compares level 2 (Lucozade) against level 1 (water) as a first comparison, and level 3 (cola) against level 1 (water) as a second comparison. These results show that the Lucozade group felt significantly better than the water group (contrast 1), but that the cola group did not differ significantly from the water group (p = 0.741). These results are consistent with the regression parameter estimates (in fact, note that contrast 2 is identical to the regression parameters for dose=1 in the previous section).

131

Drink Dependent Variable: How Well Does The Person Feel? Drink Water Lucozade Cola Mean Std. Error 5.110a .284 6.239a .295 5.252a .302 95% Confidence Interval Lower Bound Upper Bound 4.485 5.735 5.589 6.888 4.588 5.916

a. Covariates appearing in the model are evaluated at the following values: How Drunk was the Person the Night Before = 4.6000.

This table gives the adjusted values of the group means and it is these values that should be used for interpretation. The adjusted means show that the significant ANCOVA reflects a difference between the water and the Lucozade groups. The cola and water groups appear to have fairly similar adjusted means indicating that cola is no better than water at helping your hangover. These conclusions support what we know from the contrasts and regression parameters. To look at the effect of the covariate we can examine a scatterplot:

This shows that the more drunk a person was the night before, the less well they felt the next day.

Calculating the Effect Size


132

We can calculate (2) for the covariate:

2 =

SSM (dfM )MSR SST + MSR

2 =

11.187 (1).401 17.733 + .401 10.786 = 18.134 = .59 = .77

We can also do the same for the main effect of drink:

2 =

3.464 (1).401 17.733 + .401 3.063 = 18.134 = .17 = .41

Weve got t-statistics for the comparisons between the cola and water groups and the cola and Lucozade groups. These t-statistics have N2 degrees of freedom, where N is the total sample size (in this case 15). Therefore we get:

rCola vs. Water

0.338 2 = 0.338 2 + 13 = 0.09 2.233 2 = 2.233 2 + 13 = 0.53

rCola vs. Lucozade

Interpreting and Writing the Result

133

We could report the main finding as: The covariate, drunkenness, was significantly related to how ill the person felt the next day, F(1, 11) = 27.89, p < .001, 2 = .59. There was also a significant effect of the type of drink on how well the person felt after controlling for how drunk they were the night before, F(2, 11) = 4.32, p < 0.05, 2 = .17. We can also report some contrasts: Planned contrasts revealed that having Lucozade significantly improved how well you felt compared to having cola, t(13) = 2.23, p < .05, r = .53, but having cola was no better than having water, t(13) = 0.34, ns, r = .09. We can conclude that cola and water have the same effects on hangovers but that Lucozade seems significantly better at curing hangovers than cola.
Chapter 12 Task 1

Peoples musical taste tends to change as they get older (my parents, for example, after years of listening to relatively cool music when I was a kid in the 1970s, subsequently hit their mid-fourties and developed a worrying obsession with country and western musicor maybe it was the stress of having me as a teenage son!). Anyway, this worries me immensely as the future seems incredibly bleak if it is spent listening to Garth Brooks and thinking oh boy, did I underestimate Garths immense talent when I was in my 20s. So, I thought Id do some research to find out whether my fate really was sealed, or whether its possible to be old and like good music too. First, I got myself two groups of people (45 people in
134

each group): one group contained young people (which I arbitrarily decided was under 40 years of age), and the other group contained more mature individuals (above 40 years of age). This is my first independent variable, age, and it has two levels (less than or more than 40 years old). I then split each of these groups of 45 into three smaller groups of 15 and assigned them to listen to either Fugazi (who everyone knows are the coolest band on the planet), ABBA or Barf Grooks (who is a lesser known country and western musician not to be confused with anyone who has a similar name and produces music that makes you want to barf). This is my second independent variable, music, and has three levels (Fugazi, ABBA or Barf Grooks). There were different participants in all conditions, which means that of the 45 under fourties, 15 listened to Fugazi, 15 listened to ABBA and 15 listened to Barf Grooks; likewise of the 45 over fourties, 15 listened to Fugazi, 15 listened to ABBA and 15 listened to Barf Grooks. After listening to the music I got each person to rate it on a scale ranging from 100 (I hate this foul music of Satan) through 0 (I am completely indifferent) to +100 (I love this music so much Im going to explode). This variable is called liking. The data are in the file
Fugazi.sav. Conduct a two-way independent ANOVA on them.

SPSS Output The error bar chart of the music data shows the mean rating of the music played to each group. Its clear from this chart that when people listened to Fugazi the two age groups were divided: the older ages rated it very low, but the younger people rated it very highly. A reverse trend is found if you look at the ratings for Barf Grooks: the youngsters give it

135

low ratings while the wrinkly ones love it. For ABBA the groups agreed: both old and young rated them highly.

100.00 80.00 60.00

Age Group
40+ 0-40

Mean Liking Rating

40.00 20.00 0.00 -20.00 -40.00 -60.00 -80.00 -100.00 Fugazi Abba Barf Grooks

Music

The following output shows Levenes test. For these data the significance value is 0.322, which is greater than the criterion of 0.05. This means that the variances in the different experimental groups are roughly equal (i.e. not significantly different), and that the assumption has been met.
a Levene's Test of Equality of Error Variances

Dependent Variable: Liking Rating F 1.189 df1 5 df2 84 Sig. .322

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+MUSIC+AGE+MUSIC * AGE

The next output shows the main ANOVA summary table.

136

Tests of Between-Subjects Effects Dependent Variable: Liking Rating Source Corrected Model Intercept MUSIC AGE MUSIC * AGE Error Total Corrected Total Type III Sum of Squares 392654.933a 34339.600 81864.067 .711 310790.156 32553.467 459548.000 425208.400 df 5 1 2 1 2 84 90 89 Mean Square 78530.987 34339.600 40932.033 .711 155395.078 387.541 F 202.639 88.609 105.620 .002 400.977 Sig. .000 .000 .000 .966 .000

a. R Squared = .923 (Adjusted R Squared = .919)

The main effect of music is shown by the F-ratio in the row labelled MUSIC; in this case the significance is 0.000, which is lower than the usual cut-off point of 0.05. Hence, we can say that there was a significant effect of the type of music on the ratings. To understand what this actually means, we need to look at the mean ratings for each type of music when we ignore whether the person giving the rating was old or young:

Error Bars show 95.0% Cl of Mean Bars show Means


75.00

Liking Rating

50.00

25.00

0.00

-25.00

Fugazi

Abba

Barf Grooks

Music

137

What this graph shows is that the significant main effect of music is likely to reflect the fact that ABBA were rated (overall) much more positively than the other two artists. The main effect of age is shown by the F-ratio in the row labelled AGE; the probability associated with this F-ratio is 0.966, which is so close to 1 that it means that it is a virtual certainty that this F could occur by chance alone. Again, to interpret the effect we need to look at the mean ratings for the two age groups ignoring the type of music to which they listened.

50.00

Error Bars show 95.0% Cl of Mean Bars show Means

40.00

Liking Rating

30.00

20.00

10.00

0.00

-10.00

40+

0-40

Age Group

This graph shows that when you ignore the type of music that was being rated, older people, on average, gave almost identical ratings to younger people (i.e. the mean ratings in the two groups are virtually the same). The interaction effect is shown by the F-ratio in the row labeled MUSIC * AGE; the associated significance value is small (0.000) and is less than the criterion of 0.05. Therefore, we can say that there is a significant interaction between age and the type of

138

music rated. To interpret this effect we need to look at the mean ratings in all conditions and these means were originally plotted at the beginning of this output. The fact there is a significant interaction tells us that for certain types of music the different age groups gave different ratings. In this case, although they agree on ABBA, there are large disagreements in ratings of Fugazi and Barf Grooks. Given that we found a main effect of music, and of the interaction between music and age, we can look at some of the post hoc tests to establish where the difference lies. The next output shows the result of GamesHowell post hoc tests. First, ratings of Fugazi are compared to ABBA, which reveals a significant difference (the value in the column labeled Sig. is less than 0.05), and then Barf Grooks, which reveals no difference (the significance value is greater than 0.05). In the next part of the table, ratings to ABBA are compared first to Fugazi (which just repeats the finding in the previous part of the table) and then to Barf Grooks, which reveals a significant difference (the significance value is below 0.05). The final part of the table compares Barf Grooks to Fugazi and ABBA but these results repeat findings from the previous sections of the table.
Multiple Comparisons Dependent Variable: Liking Rating Mean Difference (I-J) Std. Error -66.8667* 5.08292 -6.2333 5.08292 66.8667* 5.08292 60.6333* 5.08292 6.2333 5.08292 -60.6333* 5.08292

Games-Howell

(I) Music Fugazi Abba Barf Grooks

(J) Music Abba Barf Grooks Fugazi Barf Grooks Fugazi Abba

Sig. .000 .946 .000 .001 .946 .001

95% Confidence Interval Lower Bound Upper Bound -101.1477 -32.5857 -53.3343 40.8677 32.5857 101.1477 24.9547 96.3119 -40.8677 53.3343 -96.3119 -24.9547

Based on observed means. *. The mean difference is significant at the .05 level.

Calculating Effect Sizes

139

2 =

= 900.99 15 3 2 (2 1)(0.711 387.541 ) 2 = = 4.30 15 3 2 (3 1)(2 1)(155395 .078 387.541 ) 2 = = 3444 .61 15 3 2

(3 1)(40932 .033 387.541 )

We also need to estimate the total variability and this is just the sum of these other variables plus the residual mean squares:
2 2 2 2 total = + + + MSR

= 900.99 4.30 + 3444.61 + 387.54 = 4728.84

The effect size is then simply the variance estimate for the effect in which youre interested divided by the total variance estimate:

2 effect

2 effect = 2 total

As such, for the main effect of music we get:

2 music

2 music 900.99 = 2 = = 0.19 total 4728 .84

For the main effect of age we get:


2 age 2 total

2 age

4.30 = 0.00 4728 .84

140

For the interaction of music and age we get:

2 music age =

2 music age 2 total

3444 .61 = 0.73 4728 .84

Interpreting and Writing the Result

As with the other ANOVAs weve encountered we have to report the details of the Fratio and the degrees of freedom from which it was calculated. For the various effects in these data the F-ratios will be based on different degrees of freedom: it was derived from dividing the mean squares for the effect by the mean squares for the residual. For the effects of music and the music age interaction, the model degrees of freedom were 2 (dfM = 2), but for the effect of age the degrees of freedom were only 1 (dfM = 1). For all effects, the degrees of freedom for the residuals were 84 (dfR = 84). We can, therefore, report the three effects from this analysis as follows: The results show that the main effect of the type of music listened to significantly affected the ratings of that music (F(2, 84) = 105.62, p < .001, r = .94). The GamesHowell post hoc test revealed that ABBA were rated significantly higher than both Fugazi and Barf Grooks (both ps < .01). The main effect of age on the ratings of the music was non-significant (F(1, 84) < 1, r = .00). The music age interaction was significant (F(2, 84) = 400.98, p < .001, r = .98) indicating that different types of music were rated differently by the two age groups. Specifically, Fugazi were rated more positively by the young group (M =

141

66.20, SD = 19.90) than the old (M = 75.87, SD = 14.37); ABBA were rated fairly equally in the young (M = 64.13, SD = 16.99) and old groups (M = 59.93, SD = 19.98); Barf Grooks was rated less positively by the young group (M = 71.47, SD = 23.17) compared to the old (M = 74.27, SD = 22.29). These findings indicate that there is no hope for me the minute I hit 40 I will suddenly start to love country and western music and will burn all of my Fugazi CDs (it will never happen arghhhh!!!).
Task 2

In Chapter 3 we used some data that related to men and womens arousal levels when watching either Bridget Jones Diary or Memento (ChickFlick.sav). Analyse these data to see whether men and women differ in their reactions to different types of films.

The following output shows Levenes test. For these data the significance value is 0.456, which is greater than the criterion of 0.05. This means that the variances in the different experimental groups are roughly equal (i.e. not significantly different), and that the assumption has been met.

The next output shows the main ANOVA summary table.

142

The main effect of gender is shown by the F-ratio in the row labelled gender; in this case the significance is 0.153, which is greater than the usual cut-off point of 0.05. Hence, we can say that there was not a significant effect of gender on arousal during the films. To understand what this actually means, we need to look at the mean arousal levels for men and women (when we ignore which film they watched):

143

What this graph shows is that arousal levels were quite similar for men and women in general; this is why the main effect of gender was non-significant. The main effect of film is shown by the F-ratio in the row labelled film; the probability associated with this F-ratio is 0.000, which is less than the critical value of 0.05, hence we can say that arousal levels were significantly different in the two films. Again, to interpret the effect we need to look at the mean arousal levels but this time comparing the two films (and ignoring whether the person was male or female). This graph shows that when you ignore the gender of the person, arousal levels were significantly higher for Memento than Bridget Jones Diary.

144

The interaction effect is shown by the F-ratio in the row labelled gender * film; the associated significance value is 0.366, which is greater than the criterion of 0.05. Therefore, we can say that there is not a significant interaction between gender and the type of film watched. To interpret this effect we need to look at the mean arousal in all conditions.

145

This graph shows the non-significant interaction: arousal levels are higher for Memento compared to Bridget Jones Diary in both men and women (i.e. the difference between the green and blue bars is more or less the same for men and women).

Calculating Effect Sizes

146

2 =

10 2 2 (2 1)(1092.03 40.77 ) = 1091.01 2 = 10 2 2 (2 1)(2 1)(34.23 40.77 ) = 0.16 2 = 10 2 2

(2 1)(87.03 40.77 ) = 1.16

We also need to estimate the total variability and this is just the sum of these other variables plus the residual mean squares:
2 2 2 2 total = + + + MSR

= 1.16 + 1091.01 0.16 + 40.77 = 1132.78


The effect size is then simply the variance estimate for the effect in which youre interested divided by the total variance estimate:
2 effect = 2 total

2 effect

As such, for the main effect of gender we get:


2 Gender = 2 Gender 1.16 = = 0.01 2 total 1132 .78

For the main effect of film we get:


2 Film 1091 .01 = 2 = = 0.96 total 1132 .78

For the interaction we get:

2 Film

147

2 Gender Film

2 Gender 0.16 Film = = = 0.00 2 total 1132.78

Interpreting and Writing the Result We can report the three effects from this analysis as follows: The results show that the main effect of the type of film significantly affected arousal during that film, F(1, 36) = 26.79, p < .001, 2 = .96. Arousal levels were significantly higher during Memento compared to Bridget Jones Diary. The main effect of gender on arousal levels during the films was non-significant, F(1, 84) = 2.14, 2 = .01. The gender film interaction was not significant, F(1, 36) < 1, 2 = .00. This showed that arousal levels were higher for Memento compared to Bridget Jones Diary in both men and women.

Task 3

At the start of this chapter I described a way of empirically researching whether I wrote better songs than my old band mate Malcolm, and whether this depended on the type of song (a symphony or song about flies). The outcome variable would be the number of screams elicited by audience members during the songs. These data are in the file Escape From Inside.sav. Draw an error bar graph (lines), analyse and interpret these data.

148

To do a multiple line chart for means that are independent (i.e. have come from different groups) we need to double-click on the multiple line chart icon in the Chart Builder (see the book chapter). All we need to do is to drag our variables into the appropriate drop zones. Select Screams from the variable list and drag it into
Song_Type from the variable list and drag it into Songwriter variable and drag it into

; select ; finally select the

. This will mean that lines representing

Andys and Malcolms songs will be displayed in different colours. Select error bars in the properties dialog box and click on to produce the graph. to apply them to the Chart Builder. Click on

The resulting graph looks like this:

149

The following output shows Levenes test. For these data the significance value is 0.817, which is greater than the criterion of 0.05. This means that the variances in the different experimental groups are roughly equal (i.e. not significantly different), and that the assumption has been met.

150

The next output shows the main ANOVA summary table. The main effect of the type of song is shown by the F-ratio in the row labelled Song_Type; in this case the significance is 0.000, which is smaller than the usual cut-off point of 0.05. Hence, we can say that there was a significant effect of the type of song on the number of screams elicited while it was played. The graph shows that the two symphonies elicited significantly more screams of agony than the two songs about flies.

151

The main effect of the songwriter was significant because the significance of the F-ratio for this effect is 0.002, which is less than the critical value of 0.05, hence we can say that Andy and Malcolm differed in the reactions to their songs. The graph tells us that Andys songs elicited significantly more screams of torment from the audience than Malcolms songs.

The interaction effect was significant too because the associated significance value is 0.28, which is less than the criterion of 0.05. Therefore, we can say that there is a significant interaction between the type of song and who wrote it on peoples appreciation of the song. The line graph that you drew earlier on tells us that although reactions to Malcolms and Andys were fairly similar for the Flies song, they differed quite a bit for the symphony: Andys symphony elicited more screams of torment than

152

Malcolms. We can conclude that in general Malcolm was a better songwriter than Andy, but the interaction tells us that this effect is true mainly for symphonies. Calculating Effect Sizes

17 2 2 (2 1)(35.31 3.55) = 0.47 2 = 17 2 2 (2 1)(2 1)(18.02 3.77 ) = 0.21 2 = 17 2 2


We also need to estimate the total variability and this is just the sum of these other variables plus the residual mean squares:
2 2 2 2 total = + + + MSR

2 =

(2 1)(74.13 3.55) = 1.04

= 1.04 + 0.47 + 0.21+ 3.77 = 5.49


The effect size is then simply the variance estimate for the effect in which youre interested divided by the total variance estimate:
2 effect = 2 total

2 effect

As such, for the main effect of song type we get:


2 Type 1.04 of Song = = = 0.19 2 total 5.49

2 Type of Song

For the main effect of songwriter we get:

153

2 Songwriter

2 Songwriter 0.47 = = = 0.09 2 total 5.49

For the interaction we get:


2 Type 0.21 of Song Songwriter = = = 0.04 2 total 5.49

2 Type of Song Somgwriter

Interpreting and Writing the Result We can report the three effects from this analysis as follows: The results show that the main effect of the type of song significantly affected screams elicited during that song, F(1, 64) = 20.87, p < .001, 2 = .19; the two symphonies elicited significantly more screams of agony than the two songs about flies. The main effect of the songwriter significantly affected screams elicited during that song, F(1, 64) = 9.94, p < .001, 2 = .09; Andys songs elicited significantly more screams of torment from the audience than Malcolms songs. The song type songwriter interaction was significant, F(1, 64) = 5.07, p < .05, 2 = .04. Although reactions to Malcolms and Andys were fairly similar for the Flies song, they differed quite a bit for the symphony: Andys symphony elicited more screams of torment than Malcolms.

Task 4
154

Change the syntax in GogglesSimpleEffects.sps to look at the effect of alcohol at different levels of gender.

The correct syntax to use is: MANOVA Attractiveness BY gender (0 1) alcohol(1 3) /DESIGN = alcohol WITHIN gender(1) alcohol WITHIN gender (2) /PRINT CELLINFO SIGNIF( UNIV MULT AVERF HF GG ).

The main part of the analysis is:

* * * * * * A n a l y s i s

o f

V a r i a n c e -- design

1 * * * * * *

Tests of Significance for ATTRACT using UNIQUE sums of squares Source of Variation SS DF MS F Sig of F WITHIN+RESIDUAL ALCOHOL WITHIN GENDE R(1) ALCOHOL WITHIN GENDE R(2) (Model) (Total) R-Squared = Adjusted R-Squared = 3656.25 5208.33 102.08 43 2 2 85.03 2604.17 51.04

30.63 .60

.000 .553

5310.42 8966.67 .592 .554

4 47

1327.60 190.78

15.61

.000

What this shows is a significant effect of alcohol at level 1 of gender. Because we coded gender as 0 = male, 1 = female, this means theres a significant effect of alcohol for men. Think back to the chapter and this reflects the fact that men choose very unattractive

155

dates after 4 pints. However, there is no significant effect of alcohol at level 2 of gender. This tells us that women are not affected by the beergoggles effect: the attractiveness of their dates does not chance as they drink more. Calculating the Effect Size These effects have df-2 in the model so we cant calculate an effect size (well, technically we can calculate (2) but Im not entirely sure how useful that is).
Chapter 13 Task 1

There is often concern among students as to the consistency of marking between lecturers. It is common that lecturers obtain reputations for being hard or light markers (or to use the students terminology, evil manifestations from Beelzebubs bowels and nice people) but there is often little to substantiate these reputations. A group of students investigated the consistency of marking by submitting the same essays to four different lecturers. The mark given by each lecturer was recorded for each of the eight essays. It was important that the same essays were used for all lecturers because this eliminated any individual differences in the standard of work that each lecturer marked. This design is repeated measures because every lecturer marked every essay. The independent variable was the lecturer who marked the report and the dependent variable was the percentage mark given. The data are in the file Tutor.sav. Conduct a one-way ANOVA on these data by hand.

Data for essay marks example:


156

Tutor 1 Essay (Dr Field)

Tutor 2 (Dr Smith)

Tutor 3 (Dr Scrote)

Tutor 4 Mean (Dr Death) S2

1 2 3 4 5 6 7 8
Mean

62 63 65 68 69 71 78 75
68.875

58 60 61 64 65 67 66 73
64.25

63 68 72 58 54 65 67 75
65.25

64 65 65 61 59 50 50 45
57.375

61.75 64.00 65.75 62.75 61.75 63.25 65.25 67.00

6.92 11.33 20.92 18.25 43.58 84.25 132.92 216.00

There were 8 essays, each marked by four different lecturers. Their marks are shown in the table. In addition, the mean mark given by each lecturer is shown in the table, and also the mean mark that each essay received and the variance of marks for a particular essay. Now, the total variance within essays will in part be caused by the fact that different lecturers are harder or softer markers (the manipulation), and will, in part, be caused by the fact that the essays themselves will differ in quality (individual differences). The Total Sum of Squares (SST) Remember from one-way independent ANOVA that SST is calculated using the following equation:

157

2 SS T = sgrand ( N 1)

Well, in repeated-measures designs the total sum of squares is calculated in exactly the same way. The grand variance in the equation is simply the variance of all scores when we ignore the group to which they belong. So if we treated the data as one big group it would look as follows:

62 63 65 68 69 71 78 75

58 60 61 64 65 67 66 73

63 68 72 58 54 65 67 75

64 65 65 61 59 50 50 45

Grand Mean = 63.9375

The variance of these scores is 55.028 (try this on your calculators). We used 32 scores to generate this value, and so N is 32. As such the equation becomes:

158

2 SS T = sgrand ( N 1)

= 55.028 ( 32 1) = 1705.868
The degrees of freedom for this sum of squares, as with the independent ANOVA will be N1, or 31. The Within-Participant (SSW) The crucial variation in this design is that there is a variance component called the within-participant variance (this arises because weve manipulated our independent variable within each participant). This is calculated using a sum of squares. Generally speaking, when we calculate any sum of squares we look at the squared difference between the mean and individual scores. This can be expressed in terms of the variance across a number of scores and the number of scores on which the variance is based. For example, when we calculated the residual sum of squares in independent ANOVA (SSR) we used the following equation:

SSR = ( xi x i ) SSR = s 2 (n 1)

This equation gave us the variance between individuals within a particular group, and so is an estimate of individual differences within a particular group. Therefore, to get the total value of individual differences we have to calculate the sum of squares within each group and then add them up:

2 2 2 SS R = sgroup1 (n1 1) + sgroup2 (n2 1) + sgroup3 (n3 1)


159

This is all well and good when we have different people in each group, but in repeatedmeasures designs weve subjected people to more than one experimental condition, and therefore were interested in the variation not within a group of people (as in independent ANOVA) but within an actual person. That is, how much variability is there within an individual? To find this out we actually use the same equation but we adapt it to look at people rather than groups. So, if we call this sum of squares SSW (for within-participant SS) we could write it as:
2 2 2 2 SS W = sperson1 (n1 1) + sperson2 (n2 1) + sperson3 (n3 1)K + sperson n (nn 1)

This equation simply means that were looking at the variation in an individuals scores and then adding these variances for all the people in the study. Some of you may have noticed that, in our example, were using essays rather than people, and so to be pedantic wed write this as:
2 2 2 2 SSW = sessay1 ( n1 1) + sessay2 ( n2 1) + sessay3 ( n3 1) + K + sessayn ( nn 1)

The ns simply represent the number of scores on which the variances are based (i.e. the number of experimental conditions, or in this case the number of lecturers). All of the variances we need are in the table, so we can calculate SSW as:
2 2 2 2 SSW = sessay1 (n1 1) + sessay2 (n2 1) + sessay3 (n3 1) + K + sessay n ( nn 1)

= (6.92)(4 1) + (11.33)(4 1) + (20.92)(4 1) + (18.25)(4 1)] + (43.58)(4 1) + (84.25)(4 1) + (132.92)(4 1) + (216)(4 1) = 20.76 + 34 + 62.75 + 54.75 + 130.75 + 252.75 + 398.75 + 648 = 1602.5

160

The degrees of freedom for each person are n1 (i.e. the number of conditions minus 1). To get the total degrees of freedom we add the df for all participants. So, with eight participants (essays) and four conditions (i.e. n = 4) we get 8 3 = 24 degrees of freedom. The Model Sum of Squares (SSM) So far, we know that the total amount of variation within the data is 1705.868 units. We also know that 1602.5 of those units are explained by the variance created by individuals (essays) performances under different conditions. Now some of this variation is the result of our experimental manipulation and some of this variation is simply random fluctuation. The next step is to work out how much variance is explained by our manipulation and how much is not. In independent ANOVA, we worked out how much variation could be explained by our experiment (the model SS) by looking at the means for each group and comparing these to the overall mean. So, we measured the variance resulting from the differences between group means and the overall mean. We do exactly the same thing with a repeatedmeasures design. First we calculate the mean for each level of the independent variable (in this case the mean mark given by each lecturer) and compare these values to the overall mean of all marks. So, we calculate this SS in the same way as for independent ANOVA: Calculate the difference between the mean of each group and the grand mean. Square each of these differences. Multiply each result by the number of subjects within that group (ni).
161

Add the values for each group together:

SS M =

n (x x
i i

grand

Using the means from the essay data, we can calculate SSM as follows:

SSM = 8(68.875 63.9375)2 + 8(64.25 63.9375)2 + 8(65.25 63.9375)2 + K + 8(57.375 63.9375)2 = 8(4.9375)2 + 8(0.3125)2 + 8(1.3125)2 + 8(6.5625)2 = 554.125 For SSM, the degrees of freedom (dfM) are again one less than the number of things used to calculate the sum of squares. For the model sums of squares we calculated the sum of squared errors between the four means and the grand mean. Hence, we used four things to calculate these sums of squares. So, the degrees of freedom will be 3. So, as with independent ANOVA, the model degrees of freedom is always the number of groups (k) minus 1:

dfM = k 1 = 3
The Residual Sum of Squares (SSR)

We now know that there are 1706 units of variation to be explained in our data, and that the variation across our conditions accounts for 1602 units. Of these 1602 units, our experimental manipulation can explain 554 units. The final sum of squares is the residual sum of squares (SSR), which tells us how much of the variation cannot be explained by the model. This value is the amount of variation caused by extraneous factors outside of experimental control (such as natural variation in the quality of the essays). Knowing

162

SSW and SSM already, the simplest way to calculate SSR is to subtract SSM from SSW (SSR = SSW SSM): SSR = SSW SSM = 1602.5 554.125 = 1048.375 The degrees of freedom are calculated in a similar way:

dfR = dfW dfM = 24 3 = 21


The Mean Squares

SSM tells us how much variation the model (e.g. the experimental manipulation) explains and SSR tells us how much variation is due to extraneous factors. However, because both of these values are summed values the number of scores that were summed influences them. As with independent ANOVA, we eliminate this bias by calculating the average sum of squares (known as the mean squares, MS), which is simply the sum of squares divided by the degrees of freedom:
MSM = MSR = SSM 554.125 = = 184.708 dfM 3 SSR 1048.375 = = 49.923 dfR 21

MSM represents the average amount of variation explained by the model (e.g. the systematic variation), whereas MSR is a gauge of the average amount of variation explained by extraneous variables (the unsystematic variation). The F-Ratio
163

The F-ratio is a measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors. It can be calculated by dividing the model mean squares by the residual mean squares. You should recall that this is exactly the same as for independent ANOVA:

F=

MS M MS R

So, as with the independent ANOVA, the F-ratio is still the ratio of systematic variation to unsystematic variation. As such, it is the ratio of the experimental effect to the effect on performance of unexplained factors. For the marking data, the F-ratio is:

F=

MS M 184.708 = = 3.70 MS R 49.923

This value is greater than 1, which indicates that the experimental manipulation had some effect above and beyond the effect of extraneous factors. As with independent ANOVA this value can be compared against a critical value based on its degrees of freedom (which are dfM and dfR, which are 3 and 21 in this case).
Task 2

Repeat the analysis above on SPSS and interpret the results.

Initial Output for One-Way Repeated-Measures ANOVA srepresent each level of the independent variable. This box is useful to check that the variables were entered in the correct order. The next table provides basic descriptive statistics for the four levels of the independent variable. From this table we can see that, on average, Dr Field gave the highest marks to the essays (thats because Im so nice you

164

see or it could be because Im stupid and so have low academic standards?). Dr Death, on the other hand, gave very low grades. These mean values are useful for interpreting any effects that may emerge from the main analysis.

Within-Subjects Factors Measure: MEASURE_1 TUTOR 1 2 3 4 Dependent Variable TUTOR1 TUTOR2 TUTOR3 TUTOR4

Descriptive Statistics Std. Deviation 5.6426 4.7132 6.9230 7.9091

Mean Dr. Field Dr. Smith Dr. Scrote Dr. Death 68.8750 64.2500 65.2500 57.3750

N 8 8 8 8

SPSS Output Error! No text of specified style in document..1 The next part of the output contains information about Mauchlys test. This test should be non-significant if we are to assume that the condition of sphericity has been met. The output shows Mauchlys test for the tutor data, and the important column is the one containing the significance value. The significance value (.043) is less than the critical value of .05, so we accept that the variances of the differences between levels are significantly different. In other words, the assumption of sphericity has been violated. Knowing that we have violated this assumption a pertinent question is: how should we proceed?
Mauchly's Test of Sphericitya Measure: MEASURE_1 Mauchly's W .131 Approx. Chi-Square 11.628 Epsilon Huynh-Feldt .712
b

Within Subjects Effect TUTOR

df 5

Sig. .043

Greenhouse-Geisser .558

Lower-bound .333

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. Design: Intercept Within Subjects Design: TUTOR b. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the layers (by default) of the Tests of Within Subjects Effects table.

165

SPSS produces three corrections based upon the estimates of sphericity advocated by Greenhouse and Geisser (1959) and Huynh and Feldt (1976). Both of these estimates give rise to a correction factor that is applied to the degrees of freedom used to assess the observed F-ratio. The GreenhouseGeisser correction varies between 1/k1 (where k is
is to 1.00, the more the number of repeated measures conditions) and 1. The closer that

homogeneous the variances of differences, and hence the closer the data are to being spherical. In a situation in which there are four conditions (as with our data) the lower
will be 1/(41), or 0.33 (known as the lower-bound estimate of sphericity). The limit of

in the output is 0.558. This is closer to the lower limit of 0.33 than calculated value of

it is to the upper limit of 1 and it therefore represents a substantial deviation from sphericity. We will see how these values are used in the next section. The Main ANOVA The next table in the output shows the results of the ANOVA for the within-subjects variable. This table can be read much the same as for one-way between-group ANOVA. There is a sum of squares for the repeated-measures effect of tutor, which tells us how much of the total variability is explained by the experimental effect. Note the value is 554.125, which is model sum of squares (SSM) that we calculated in the previous task. There is also an error term, which is the amount of unexplained variation across the conditions of the repeated-measures variable. This is the residual sum of squares (SSR) that was calculated in section 0 and note the value is 1048.375 (which is the same value as calculated). As I explained earlier, these sums of squares are converted into mean squares by dividing by the degrees of freedom. As we saw before, the df for the effect of
tutor are simply k1, where k is the number of levels of the independent variable. The

166

error df are (n1)(k1), where n is the number of participants (or in this case, the number of essays) and k is as before. The F-ratio is obtained by dividing the mean squares for the experimental effect (184.708) by the error mean squares (49.923). As with betweengroup ANOVA, this test statistic represents the ratio of systematic variance to unsystematic variance. The value of F (184.71/49.92 = 3.70) is then compared against a critical value for 3 and 21 degrees of freedom. SPSS displays the exact significance level for the F-ratio. The significance of F is .028, which is significant because it is less than the criterion value of .05. We can, therefore, conclude that there was a significant difference between the marks awarded by the four lecturers. However, this main test does not tell us which lecturers differed from each other in their marking.
Tests of Within-Subjects Effects Measure: MEASURE_1 Type III Sum of Squares Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound 554.125 554.125 554.125 554.125 1048.375 1048.375 1048.375 1048.375 Mean Square 184.708 331.245 259.329 554.125 49.923 89.528 70.091 149.768

Source TUTOR

df 3 1.673 2.137 1.000 21 11.710 14.957 7.000

F 3.700 3.700 3.700 3.700

Sig. .028 .063 .047 .096

Error(TUTOR)

a. Computed using alpha = .05

Although this result seems very plausible, we have learnt that the violation of the sphericity assumption makes the F-test inaccurate. We know from Mauchlys test that these data were non-spherical and so we need to make allowances for this violation. The SPSS output shows the F-ratio and associated degrees of freedom when sphericity is assumed and the significant F-statistic indicated some difference(s) between the mean marks given by the four lecturers. In versions of SPSS after version 8, this table also
167

contains several additional rows giving the corrected values of F for the three different types of adjustment (GreenhouseGeisser, HuynhFeldt and lower-bound). Notice that in all cases the F-ratios remain the same; it is the degrees of freedom that change (and hence the critical value against which the obtained F-statistic is compared). The degrees of freedom have been adjusted using the estimates of sphericity calculated by SPSS. The adjustment is made by multiplying the degrees of freedom by the estimate of sphericity.1 The new degrees of freedom are then used to ascertain the significance of F. For these data the corrections result in the observed F being non-significant when using the GreenhouseGeisser correction (because p > .05). However, it was noted earlier that this correction is quite conservative, and so can miss effects that genuinely exist. It is, therefore, useful to consult the HuynhFeldt-corrected F-statistic. Using this correction, the F-value is still significant because the probability value of .047 is just below the criterion value of .05. So, by this correction we would accept the hypothesis that the lecturers differed in their marking. However, it was also noted earlier that this correction is quite liberal and so tends to accept values as significant when, in reality, they are not significant. This leaves us with the puzzling dilemma of whether or not to accept this F-statistic as significant. I mentioned earlier that Stevens (2002) recommends taking an average of the two estimates, and certainly when the two corrections give different results (as is the case here) this is wise advice. If the two corrections give rise to the same conclusion it makes little difference which you choose to report (although if you accept the F-statistic as significant it is best to report the conservative Greenhouse
1 For example, the GreenhouseGeisser estimate of sphericity was 0.558. The original degrees of freedom for the model were 3; this value is corrected by multiplying by the estimate of sphericity (3 0.558 = 1.674). Likewise the error df were 21; this value is corrected in the same way (21 0.558 = 11.718). The F-ratio is then tested against a critical value with these new degrees of freedom (1.674, 11.718). The other corrections are applied in the same way.

168

Geisser estimate to avoid criticism!). Although it is easy to calculate the average of the two correction factors and to correct the degrees of freedom accordingly, it is not so easy to then calculate an exact probability for those degrees of freedom. Therefore, should you ever be faced with this perplexing situation (and to be honest thats fairly unlikely) I recommend taking an average of the two significance values to give you a rough idea of which correction is giving the most accurate answer. In this case, the average of the two p-values is (.063 + 0.047)/2 = .055. Therefore, we should probably go with the GreenhouseGeisser correction and conclude that the F-ratio is non-significant. These data illustrate how important it is to use a valid critical value of F: it can mean the difference between a statistically significant result and a non-significant result. More importantly, it can mean the difference between making a Type I error and not. Had we not used the corrections for sphericity we would have concluded erroneously that the markers gave significantly different marks. However, I should quantify this statement by saying that this example also highlights how arbitrary it is that we use a .05 level of significance. These two corrections produce significance values only marginally less than or more than .05, and yet they lead to completely opposite conclusions! So, we might be well advised to look at an effect size to see whether the effect is substantive regardless of its significance. We also saw earlier that a final option, when you have data that violate sphericity, is to use multivariate test statistics (MANOVA) because they do not make this assumption (see OBrien & Kaiser, 1985). The repeated-measures procedure in SPSS automatically produces multivariate test statistics. The next output shows the multivariate test statistics for this example. The column displaying the significance values clearly shows that the

169

multivariate tests are non-significant (because p is .063, which is greater than the criterion value of .05). Bearing in mind the loss of power in these tests, this result supports the decision to accept the null hypothesis and conclude that there are no significant differences between the marks given by different lecturers. The interpretation of these results should stop now because the main effect is non-significant. However, we will look at the output for contrasts to illustrate how these tests are displayed in the SPSS Viewer.
Multivariate Testsa Hypothesis df
c

Effect TUTOR Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root

Value .741 .259 2.856 2.856

F 4.760 4.760c 4.760c 4.760c

Error df 5.000 5.000 5.000 5.000

Sig. .063 .063 .063 .063

3.000 3.000 3.000 3.000

a. Design: Intercept Within Subjects Design: TUTOR b. Computed using alpha = .05 c. Exact statistic

Contrasts The transformation matrix requested in the options is shown in the next SPSS output and we have to draw on our knowledge of contrast coding to interpret this table. The first thing to remember is that a code of 0 means that the group is not included in a contrast. Therefore, contrast 1 (labelled Level 1 vs. Level 2 in the table) ignores Dr Scrote and Dr Death. The next thing to remember is that groups with a negative weight are compared to groups with a positive weight. In this case this means that the first contrast compares Dr Field against Dr Smith. Using the same logic, contrast 2 (labelled Level 3 vs. Level 3) ignores Dr Field and Dr Death and compares Dr Smith and Dr Scrote. Finally, contrast three (Level 3 vs. Level 4) compares Dr Death with Dr Scrote. This pattern of contrasts is consistent with what we expect to get from a repeated contrast (i.e. all groups except the
170

first are compared to the preceding category). The transformation matrix, which appears at the bottom of the output, is used primarily to confirm what each contrast represents.

a TUTOR

Measure: MEASURE_1 Level 1 vs. Level 2 1 -1 0 0 TUTOR Level 2 vs. Level 3 0 1 -1 0 Level 3 vs. Level 4 0 0 1 -1

Dependent Variable Dr. Field Dr. Smith Dr. Scrote Dr. Death a.

The contrasts for the within subjects factors are: TUTOR: Repeated contrast

Above the transformation matrix, we should find a summary table of the contrasts. Each contrast is listed in turn, and as with between-group contrasts, an F-test is performed that compares the two chunks of variation. So, looking at the significance values from the table, we could say that Dr Field marked significantly more highly than Dr Smith (Level 1 vs. Level 2), but that Dr Smiths marks were roughly equal to Dr Scrotes (Level 2 vs. Level 3) and Dr Scrotes marks were roughly equal to Dr Deaths (Level 3 vs. Level 4). However, the significant contrast should be ignored because of the non-significant main effect (remember that the data did not obey sphericity). The important point to note is that the sphericity in our data has led to some important issues being raised about correction factors, and about applying discretion to your data (its comforting to know that the computer does not have all of the answers, but its slightly alarming to realize that this means we have to actually know some of the answers ourselves). In this example we would have to conclude that no significant differences existed between the marks given by different lecturers. However, the ambiguity of our data might make us consider running a similar study with a greater number of essays being marked.

171

Tests of Within-Subjects Contrasts Measure: MEASURE_1 Source TUTOR TUTOR Level 1 vs. Level 2 Level 2 vs. Level 3 Level 3 vs. Level 4 Level 1 vs. Level 2 Level 2 vs. Level 3 Level 3 vs. Level 4 Type III Sum of Squares 171.125 8.000 496.125 65.875 368.000 1010.875 df 1 1 1 7 7 7 Mean Square 171.125 8.000 496.125 9.411 52.571 144.411 F 18.184 .152 3.436 Sig. .004 .708 .106

Error(TUTOR)

Post Hoc Tests If you selected post hoc tests for the repeated measures variable in the options dialog box, then the table in below will be produced in the SPSS Viewer.
Pairwise Comparisons Measure: MEASURE_1 Mean Difference (I-J) 4.625* 3.625 11.500 -4.625* -1.000 6.875 -3.625 1.000 7.875 -11.500 -6.875 -7.875 95% Confidence Interval for a Difference Lower Bound Upper Bound .682 8.568 -6.703 13.953 -5.498 28.498 -8.568 -.682 -10.320 8.320 -9.039 22.789 -13.953 6.703 -8.320 10.320 -7.572 23.322 -28.498 5.498 -22.789 9.039 -23.322 7.572

(I) TUTOR 1

(J) TUTOR 2 3 4 1 3 4 1 2 4 1 2 3

Std. Error 1.085 2.841 4.675 1.085 2.563 4.377 2.841 2.563 4.249 4.675 4.377 4.249

Sig. .022 1.000 .261 .022 1.000 .961 1.000 1.000 .637 .261 .961 .637

Based on estimated marginal means *. The mean difference is significant at the .05 level. a. Adjustment for multiple comparisons: Bonferroni.

The difference between group means is displayed, and also the standard error, the significance value and a confidence interval for the difference between means. By looking at the significance values we can see that the only difference between group means is between Dr Field and Dr Smith. Looking at the means of these groups we can see that I give significantly higher marks than Dr Smith. However, there is a rather

172

anomalous result in that there is no significant difference between the marks given by Dr Death and myself even though the mean difference between our marks is higher (11.5) than the mean difference between myself and Dr Smith (4.6). The reason for this result is the sphericity in the data. The interested reader might like to run some correlations between the four tutors grades. You will find that there is a very high positive correlation between the marks given by Dr Smith and myself (indicating a low level of variability in our data). However, there is a very low correlation between the marks given by Dr Death and myself (indicating a high level of variability between our marks). It is this large variability between Dr Death and myself that has produced the non-significant result despite the average marks being very different (this observation is also evident from the standard errors). Effect Sizes for Repeated-Measures ANOVA In repeated measures ANOVA, the equation for 2 is (hang onto your hats):

k 1 (MSM MSR ) nk 2 = MSBG MSR k 1 + MSR + (MSM MSR ) k nk SPSS doesnt give us SSW in the output, but we know that this is made up of SSM and SSR, which we are given. By substituting these terms, and rearranging the equation, we get:

SST = SSBG + SSM + SSR SSBG = SST SSM SSR

173

The next problem is that SPSS, which is clearly trying to hinder us at every step, doesnt give us SST and Im afraid (unless Ive missed something in the output) youre just going to have to calculate it by hand. From the values we calculated earlier, you should get: SSBG = 1705.868 554.125 1048.375 = 103.37 The next step is to convert this to a mean squares by dividing by the degrees of freedom, which in this case are the number of people in the experiment minus 1 (n 1): MSBG = = SSBG SSBG = dfBG n 1

103.37 8 1 = 14.77 Having done all this and probably died of boredom in the process we must now resurrect ourselves with renewed vigour for the effect size equation, which becomes:
4 1 (184.71 49.92 ) 8 4 = 14.77 49.92 4 1 (184.71 49.92 ) 49.92 + + 4 8 4 12.64 = 53.77 = 0.24
2

So, we get 2.24. If you calculate it the same way as for the independent ANOVA you should get a slightly bigger answer (.25 in fact). Ive mentioned at various other points that its actually more useful to have effect size measures for focused comparisons anyway (rather than the main ANOVA), and so, a slightly easier approach to calculating effect sizes is to calculate them for the contrasts we

174

did. For these we can use the equation that weve seen before to convert the F-values (because they all have 1 degree of freedom for the model) to r:
r= F (1, dfR ) F (1, dfR ) + dfR

For the three comparisons we did, we would get:


rField vs. Smith = rSmith vs. Scrote = rScrote vs. Death = 18.18 = 0.85 18.18 + 7 0.15 = 0.14 0.15 + 7 3.44 = 0.57 3.44 + 7

Therefore, the differences between Drs Field and Smith and Scrote and Death were both large effects, but the differences between Drs Smith and Scrote were small. Reporting One-Way repeated-Measures ANOVA We could report the main finding as: The results show that the mark of an essay was not significantly affected by the lecturer that marked it, F(1.67, 11.71) = 3.70, p > .05. If you choose to report the sphericity test as well, you should report the chi-square approximation, its degrees of freedom and the significance value. Its also nice to report the degree of sphericity by reporting the epsilon value. Well also report the effect size in this improved version: Mauchlys test indicated that the assumption of sphericity had been violated (2(5) = 11.63, p < .05), therefore degrees of freedom were corrected using Greenhouse

175

Geisser estimates of sphericity ( = .56). The results show that the mark of an essay was not significantly affected by the lecturer that marked it, F(1.67, 11.71) = 3.70, p > .05, 2 = .24. Remember that because the main ANOVA was not significant we shouldnt report any further analysis.
Task 3

Imagine I wanted to look at the effect alcohol has on the roving eye. The roving eye effect is the propensity of people in relationships to eye-up members of the opposite sex. I took 20 men and fitted them with incredibly sophisticated glasses that could track their eye movements and record both the movement and the object being observed (this is the point at which it should be apparent that Im making it up as I go along). Over four different nights I plied these poor souls with 1, 2, 3 or 4 pints of strong lager in a night-club. Each night I measured how many different women they eyed up (a women was categorized as having been eyed up if the mans eye moved from her head to toe and back up again). To validate this measure we also collected the amount of dribble on the mans chin while looking at a woman. The data are in the file RovingEye.sav. Analyse them with a one-way ANOVA.

SPSS Output

176

This error bar chart of the roving eye data shows the mean number of women that were eyed up after different doses of alcohol. Its clear from this chart that the mean number of women is pretty similar between 1 and 2 pints, and for 3 and 4 pints, but there is a jump after 2 pints.
Within-Subjects Factors Measure: MEASURE_1 ALCOHOL 1 2 3 4 Dependent Variable PINT1 PINT2 PINT3 PINT4

Descriptive Statistics Mean 11.7500 11.7000 15.2000 14.9500 Std. Deviation 4.31491 4.65776 5.80018 4.67327 N 20 20 20 20

1 Pint 2 Pints 3 Pints 4 Pints

These outputs show the initial diagnostic statistics. First, we are told the variables that represent each level of the independent variable. This box is useful to check that the variables were entered in the correct order. The next table provides basic descriptive statistics for the four levels of the independent variable. This table confirms what we saw in the graph.

177

b Mauchly's Test of Sphericity

Measure: MEASURE_1 Epsilon Within Subjects Effect ALCOHOL Mauchly's W .477 Approx. Chi-Square 13.122 df 5 Sig. .022 GreenhouseGeisser .745
a

Huynh-Feldt .849

Lower-bound .333

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the Tests of Within-Subjects Effects table. b. Design: Intercept Within Subjects Design: ALCOHOL

The next part of the output contains Mauchlys test and we hope to find that its nonsignificant if we are to assume that the condition of sphericity has been met. However, the significance value (0.022) is less than the critical value of 0.05, so we accept that the assumption of sphericity has been violated.
Tests of Within-Subjects Effects Measure: MEASURE_1 Source ALCOHOL Type III Sum of Squares 225.100 225.100 225.100 225.100 904.400 904.400 904.400 904.400 df 3 2.235 2.547 1.000 57 42.469 48.398 19.000 Mean Square 75.033 100.706 88.370 225.100 15.867 21.296 18.687 47.600 F 4.729 4.729 4.729 4.729 Sig. .005 .011 .008 .042

Error(ALCOHOL)

Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound

This output shows the main result of the ANOVA. The significance of F is 0.005, which is significant because it is less than the criterion value of 0.05. We can, therefore, conclude that alcohol had a significant effect on the average number of women that were eyed up. However, this main test does not tell us which quantities of alcohol made a difference to the number of women eyed up. This result is all very nice but as of yet we havent done anything about our violation of the sphericity assumption. This table contains several additional rows giving the corrected values of F for the three different types of adjustment (GreenhouseGeisser, HuynhFeldt and lower-bound). First we decide which correction to apply and to do this

178

we need to look at the estimates of sphericity: if the GreenhouseGeisser and Huynh Feldt estimates are less than 0.75 we should use GreenhouseGeisser, and if they are above 0.75 we use HuynhFeldt. We discovered in the book that based on these criteria we should use HuynhFeldt here. Using this corrected value we still find a significant result because the observed p (.008) is still less than the criterion of .05.
Pairwise Comparisons Measure: MEASURE_1 Mean Difference (I-J) Std. Error 5.000E-02 .742 -3.450 1.391 -3.200 1.454 -5.000E-02 .742 -3.500* 1.139 -3.250 1.420 3.450 1.391 3.500* 1.139 .250 1.269 3.200 1.454 3.250 1.420 -.250 1.269 95% Confidence Interval for a Difference Lower Bound Upper Bound -2.133 2.233 -7.544 .644 -7.480 1.080 -2.233 2.133 -6.853 -.147 -7.429 .929 -.644 7.544 .147 6.853 -3.485 3.985 -1.080 7.480 -.929 7.429 -3.985 3.485

(I) ALCOHOL 1

(J) ALCOHOL 2 3 4 1 3 4 1 2 4 1 2 3

Sig. 1.000 .136 .242 1.000 .038 .202 .136 .038 1.000 .242 .202 1.000

Based on estimated marginal means *. The mean difference is significant at the .05 level. a. Adjustment for multiple comparisons: Bonferroni.

The main effect of alcohol doesnt tell us anything about which doses of alcohol produced different results to other doses. So, we might do some post hoc tests as well. The output above shows the table from SPSS that contains these tests. We read down the column labelled Sig. and look for values less than 0.05. By looking at the significance values we can see that the only difference between condition means is between 2 and 3 pints of alcohol. Interpreting and Writing the Result We could report the main finding as: Mauchlys test indicated that the assumption of sphericity had been violated (2(5) = 13.12, p < .05), therefore degrees of freedom were corrected using HuynhFeldt

179

estimates of sphericity ( = .85). The results show that the number of women eyed up was significantly affected by the amount of alcohol drunk, F(2.55, 48.40) = 4.73, p < .05, r = .40). Bonferroni post hoc tests revealed a significant difference in the number of women eyed up only between 2 and 3 pints (CI.95 = 6.85 (lower), .15 (upper), p < .05). No other comparisons were significant (all ps > .05).
Task 4

In the previous chapter we came across the beergoggles effect: a severe perceptual distortion after imbibing vast quantities of alcohol. The specific visual distortion is that previously unattractive people suddenly become the hottest thing since Spicy Gonzalez extra hot Tabasco-marinated chillies. In short, one minute youre standing in a zoo admiring the orang utans, and the next youre wondering why someone would put Gail Porter (or whatever her surname is now) into a cage. Anyway, in that chapter, a blatantly fabricated data set demonstrated that the beergoggles effect was much stronger for men than women, and took effect only after 2 pints. Imagine we wanted to follow this finding up to look at what factors mediate the beergoggles effect. Specifically, we thought that the beer goggles effect might be made worse by the fact that it usually occurs in clubs, which have dim lighting. We took a sample of 26 men (because the effect is stronger in men) and gave them various doses of alcohol over four different weeks (0 pints, 2 pints, 4 pints and 6 pints of lager). This is our first independent variable, which well call alcohol consumption, and it has four levels. Each week (and, therefore, in each state of drunkenness) participants were asked to select a mate in a normal club (that had dim lighting) and then select a second mate in a

180

specially designed club that had bright lighting. As such, the second independent variable was whether the club had dim or bright lighting. The outcome measure was the attractiveness of each mate as assessed by a panel of independent judges. To recap, all participants took part in all levels of the alcohol consumption variable, and selected mates in both brightly and diml lit clubs. The data are in the file BeerGogglesLighting.sav. Analyse them with a two-way repeated-measures ANOVA. SPSS Output

80 Dim Lighting Bright Lighting 60

Mean Attractiveness (%)

40

20

0 0 Pints 2 Pints 4 Pints 6 Pints

Alcohol Consumption

This chart displays the mean attractiveness of the partner selected (with error bars) in dim and brightly lit clubs after the different doses of alcohol. The chart shows that in both dim and brightly lit clubs there is a tendency for men to select less attractive mates as they consume more and more alcohol.
Descriptive Statistics Mean 65.0000 65.4615 37.2308 21.3077 61.5769 60.6538 50.7692 40.7692 Std. Deviation 10.30728 8.76005 10.86391 10.67247 9.70432 10.65060 10.34334 10.77519 N 26 26 26 26 26 26 26 26

0 Pints (Dim Lighting) 2 Pints (Dim Lighting) 4 Pints (Dim Lighting) 6 Pints (Dim Lighting) 0 Pints (Bright Lighting) 2 Pints (Bright Lighting) 4 Pints (Bright Lighting) 6 Pints (Bright Lighting)

181

This shows the means for all conditions in a table. These means correspond to those plotted in the graph.
b Mauchly's Test of Sphericity

Measure: MEASURE_1 Epsilon Within Subjects Effect Mauchly's W LIGHTING 1.000 ALCOHOL .820 LIGHTING * ALCOHOL .898 Approx. Chi-Square .000 4.700 2.557 df 0 5 5 Sig. . .454 .768 Greenhouse -Geisser 1.000 .873 .936
a

Huynh-Feldt 1.000 .984 1.000

Lower-bound 1.000 .333 .333

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the Tests of Within-Subjects Effects table. b. Design: Intercept Within Subjects Design: LIGHTING+ALCOHOL+LIGHTING*ALCOHOL

The lighting variable had only two levels (dim or bright) and so the assumption of sphericity doesnt apply and SPSS doesnt produce a significance value. However, for the effects of alcohol consumption and the interaction of alcohol consumption and lighting, we do have to look at Mauchlys test. The significance values are both above 0.05 (they are 0.454 and 0.768 respectively) and so we know that the assumption of sphericity has been met for both alcohol consumption and the interaction of alcohol consumption and lighting.

182

Tests of Within-Subjects Effects Measure: MEASURE_1 Source LIGHTING Type III Sum of Squares 1993.923 1993.923 1993.923 1993.923 2128.327 2128.327 2128.327 2128.327 38591.654 38591.654 38591.654 38591.654 9242.596 9242.596 9242.596 9242.596 5765.423 5765.423 5765.423 5765.423 6487.327 6487.327 6487.327 6487.327 df 1 1.000 1.000 1.000 25 25.000 25.000 25.000 3 2.619 2.953 1.000 75 65.468 73.819 25.000 3 2.809 3.000 1.000 75 70.232 75.000 25.000 Mean Square 1993.923 1993.923 1993.923 1993.923 85.133 85.133 85.133 85.133 12863.885 14736.844 13069.660 38591.654 123.235 141.177 125.206 369.704 1921.808 2052.286 1921.808 5765.423 86.498 92.370 86.498 259.493 F 23.421 23.421 23.421 23.421 Sig. .000 .000 .000 .000

Error(LIGHTING)

ALCOHOL

Error(ALCOHOL)

LIGHTING * ALCOHOL

Error(LIGHTING*ALCOHOL)

Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound

104.385 104.385 104.385 104.385

.000 .000 .000 .000

22.218 22.218 22.218 22.218

.000 .000 .000 .000

This output shows the main ANOVA summary


60 50

Mean Attractiveness (%)

table. The main effect of lighting is shown by the Fratio in the row labelled
LIGHTING.

40

30

The significance

20

of this value is 0.000, which is well below the usual


Dim Bright

10

cut-off point of 0.05. We can conclude that average

Lighting

attractiveness ratings were significantly affected by whether mates were selected in a dim or well-lit club. We can easily interpret this result further because there were only two levels: attractiveness ratings were higher in the well-lit clubs, so we could conclude that when we ignore how much alcohol was consumed, the mates selected in well-lit clubs were significantly more attractive than those chosen in dim clubs. The main effect of alcohol consumption is shown by the F-ratio in the row labelled
ALCOHOL.
70 60

The probability associated with this F-

ratio is reported as 0.000 (i.e. p < 0.001), which is

Mean Attractiveness (%)

50

40

30

20

10

183
0 Pints 2 Pints 4 Pints 6 Pints

Alcohol Consumption

well below the critical value of 0.05. We can conclude that there was a significant main effect of the amount of alcohol consumed on the attractiveness of the mate selected. We know that generally there was an effect, but without further tests (e.g. post hoc comparisons) we cant say exactly which doses of alcohol had the most effect. Ive plotted the means for the four doses. This graph shows that when you ignore the lighting in the club, the attractiveness of mates is similar after no alcohol and 2 pints of lager but starts to rapidly decline at 4 pints and continues to decline after 6 pints.
Pairwise Comparisons Measure: MEASURE_1 Mean Difference (I-J) Std. Error .231 2.006 19.288* 2.576 32.250* 1.901 -.231 2.006 19.058* 2.075 32.019* 1.963 -19.288* 2.576 -19.058* 2.075 12.962* 2.450 -32.250* 1.901 -32.019* 1.963 -12.962* 2.450 95% Confidence Interval for a Difference Lower Bound Upper Bound -5.517 5.978 11.909 26.668 26.804 37.696 -5.978 5.517 13.112 25.003 26.395 37.644 -26.668 -11.909 -25.003 -13.112 5.942 19.981 -37.696 -26.804 -37.644 -26.395 -19.981 -5.942

(I) ALCOHOL 1

(J) ALCOHOL 2 3 4 1 3 4 1 2 4 1 2 3

Sig. 1.000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000

Based on estimated marginal means *. The mean difference is significant at the .05 level. a. Adjustment for multiple comparisons: Bonferroni.

This output shows some post hoc tests for the main effect of alcohol. In this example Ive chosen a Bonferroni correction. The main column of interest is the one labelled Sig., but the confidence intervals also tell us the likely difference between means if we were to take other samples. The mean attractiveness was significantly higher after no pints than it was after 4 pints and 6 pints (both ps are less than 0.001). We can also see that the mean attractiveness after 2 pints was significantly higher than after 4 pints and 6 pints (again, both ps are less than 0.001). Finally, the mean attractiveness after 4 pints was significantly higher than after 6 pints (p is less than 0.001). So, we can conclude that the

184

beergoggles effect doesnt kick in until after 2 pints, and that it has an ever-increasing effect (well, up to 6 pints at any rate!). The interaction effect is shown by the F-ratio in the row labelled LIGHTING*ALCOHOL. The resulting F-ratio is 22.22 (1921.81/86.50), which has an associated probability value of 0.000 (i.e. p < 0.001). As such, there is a significant interaction between the amount of alcohol consumed and the lighting in the club on the attractiveness of the mate selected.
Tests of Within-Subjects Contrasts Measure: MEASURE_1 Source LIGHTING Error(LIGHTING) ALCOHOL LIGHTING Level 1 vs. Level 2 Level 1 vs. Level 2 ALCOHOL Type III Sum of Squares 996.962 1064.163 1.385 9443.087 4368.038 2616.115 2799.663 3902.462 49.846 8751.115 912.154 8680.154 8839.885 10569.846 df 1 25 1 1 1 25 25 25 1 1 1 25 25 25 Mean Square 996.962 42.567 1.385 9443.087 4368.038 104.645 111.987 156.098 49.846 8751.115 912.154 347.206 353.595 422.794 F 23.421 .013 84.323 27.983 Sig. .000 .909 .000 .000

Error(ALCOHOL)

LIGHTING * ALCOHOL Level 1 vs. Level 2

Error(LIGHTING*ALCO Level 1 vs. Level 2 HOL)

Level 1 vs. Level 2 Level 2 vs. Level 3 Level 3 vs. Level 4 Level 1 vs. Level 2 Level 2 vs. Level 3 Level 3 vs. Level 4 Level 1 vs. Level 2 Level 2 vs. Level 3 Level 3 vs. Level 4 Level 1 vs. Level 2 Level 2 vs. Level 3 Level 3 vs. Level 4

.144 24.749 2.157

.708 .000 .154

This output shows the output from a set of contrasts that compare each level of the alcohol variable to the previous level of that variable (this is called a repeated contrast in SPSS). So, it compares no pints with 2 pints (Level 1 vs. Level 2), 2 pints with 4 pints (Level 2 vs. Level 3) and 4 pints with 6 pints (Level 3 vs. Level 4). As you can see from the output, if we just look at the main effect of group these contrasts tell us what we already know from the post hoc tests: that is, the attractiveness after no alcohol doesnt differ from the attractiveness after 2 pints, F(1, 25) < 1, the attractiveness after 4 pints does differ from that after 2 pints, F(1, 25) = 84.32, p < 0.001, and the attractiveness after 6 pints does differ from that after 4 pints, F(1, 25) = 27.98, p < 0.001. More interesting is to look at the interaction term in the table. This compares the same levels of the alcohol variable, but for each comparison it is also comparing the difference between the means

185

for the dim and brightly lit clubs. One way to think of this is
80

to look at the interaction graph and note the vertical differences between the means for dim and bright clubs at each level of alcohol. When nothing was drunk, the distance between the bright and dim means is quite small (its actually
Mean Attractiveness (%)
60 40

Dim Lighting Bright Lighting

20

0 0 Pints 2 Pints 4 Pints 6 Pints

Alcohol Consumption

3.42 units on the attractiveness scale), when 2 pints of alcohol are drunk the difference between the dim and well-lit club is still quite small (4.81 units to be precise). The first contrast is comparing the difference between dim and bright clubs when nothing was drunk with the difference between dim and bright clubs when 2 pints were drunk. So, it is asking is 3.42 significantly different from 4.81? The answer is no, because the F-ratio is non-significantin fact, its less than 1 (F(1, 25) < 1). The second contrast for the interaction is looking at the difference between dim and bright clubs when 2 pints were drunk (4.81) with the difference between dim and bright clubs when 4 pints were drunk (this difference is 13.54; note that the direction of the difference has changed as indicated by the lines crossing in the graph). This difference is significant (F(1, 25) = 24.75, p < 0.001). The final contrast for the interaction is looking at the difference between dim and bright clubs when 4 pints were drunk (13.54) with the difference between dim and bright clubs when 6 pints were drunk (this difference is 19.46). This contrast is not significant (F(1, 25) = 2.16, ns). So, we could conclude that there was a significant interaction between the amount of alcohol drunk and the lighting in the club. Specifically, the effect of alcohol after 2 pints on the attractiveness of the mate was much more pronounced when the lights were dim. Writing the Result

186

We can report the three effects from this analysis as follows: The results show that the attractiveness of the mates selected was significantly lower when the lighting in the club was dim compared to when the lighting was bright, F(1, 25) = 23.42, p < .001. The main effect of alcohol on the attractiveness of mates selected was significant, F(3, 75) = 104.39, p < .001. This indicated that when the lighting in the club was ignored, the attractiveness of the mates selected differed according to how much alcohol was drunk before the selection was made. Specifically, post hoc tests revealed that compared to a baseline of when no alcohol had been consumed, the attractiveness of selected mates was not different after 2 pints (p > .05), but was significantly lower after 4 and 6 pints (both ps < .001). The mean attractiveness after 2 pints was also significantly higher than after 4 pints and 6 pints (both ps < .001), and the mean attractiveness after 4 pints was significantly higher than after 6 pints (p < .001). To sum up, the beergoggles effect seems to take effect after 2 pints have been consumed and has an increasing impact until 6 pints are consumed. The lighting alcohol interaction was significant, F(3, 75) = 22.22, p < .001, indicating that the effect of alcohol on the attractiveness of the mates selected differed when lighting was dim compared to when it was bright. Contrasts on this interaction term revealed that when the difference in attractiveness ratings between dim and bright clubs was compared after no alcohol and after 2 pints had been drunk there was no significant difference, F(1, 25) < 1. However, when comparing the difference between dim and bright clubs when 2 pints were drunk
187

with the difference after 4 pints were drunk a significant difference emerged, F(1, 25) = 24.75, p < .001. A final contrast revealed that the difference between dim and bright clubs after 4 pints were drunk compared to after 6 pints was not significant, F(1, 25) = 2.16, ns. To sum up, there was a significant interaction between the amount of alcohol drunk and the lighting in the club: the decline in the attractiveness of the selected mate seen after 2 pints (compared to after 4) was significantly more pronounced when the lights were dim.
Task 5

Change the syntax in SimpleEffectsAttitude.sps to look at the effect of drink at different levels of imagery

The correct syntax to use is:

MANOVA beerpos beerneg beerneut winepos wineneg wineneut waterpos waterneg waterneu /WSFACTORS drink(3) imagery(3) /WSDESIGN = MWITHIN imagery(1) MWITHIN imagery(2) MWITHIN imagery(3) /PRINT SIGNIF( UNIV MULT AVERF HF GG ).

SPSS Output
The main part of the analysis is:

* * * * * * A n a l y s i s

o f

V a r i a n c e -- design

1 * * * * * *

188

Tests involving 'MWITHIN IMAGERY(1)' Within-Subject Effect. Tests of Significance for T1 using UNIQUE sums of squares Source of Variation SS DF MS F WITHIN+RESIDUAL MWITHIN IMAGERY(1) 1088.40 27136.27 19 1 57.28 27136.27

Sig of F

473.71

.000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

Tests involving 'MWITHIN IMAGERY(2)' Within-Subject Effect. Tests of Significance for T2 using UNIQUE sums of squares Source of Variation SS DF MS F WITHIN+RESIDUAL MWITHIN IMAGERY(2) 3113.92 1870.42 19 1 163.89 1870.42

Sig of F

11.41

.003

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * *

Tests involving 'MWITHIN IMAGERY(3)' Within-Subject Effect. Tests of Significance for T3 using UNIQUE sums of squares Source of Variation SS DF MS F WITHIN+RESIDUAL MWITHIN IMAGERY(3) 1070.67 3840.00 19 1 56.35 3840.00

Sig of F

68.14

.000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

What this shows is a significant effect of drink at level 1 of imagery. So, the ratings of the three drinks significantly differed when positive imagery was used. Because there are three levels of drink, though, this isnt that helpful in untangling whats going on. There is also a significant effect of drink at level 2 of imagery. So, the ratings of the three drinks significantly differed when negative imagery was used. Finally, there is also a significant effect of drink at level 3 of imagery. So, the ratings of the three drinks significantly differed when neutral imagery was used.
Chapter 14 Task 1

189

I am going to extend the example from the previous chapter (advertising and different imagery) by adding a between-group variable into the design.2 To recap, in case you havent read the previous chapter, participants viewed a total of nine mock adverts over three sessions. In these adverts there were three products (a brand of beer, Brain Death, a brand of wine, Dangleberry, and a brand of water, Puritan). These could be presented alongside positive, negative or neutral imagery. Over the three sessions and nine adverts each type of product was paired with each type of imagery (read the previous chapter if you need more detail). After each advert participants rated the drinks on a scale ranging from 100 (dislike very much) through 0 (neutral) to 100 (like very much). The design, thus far, has two independent variables: the type of drink (beer, wine or water) and the type of imagery used (positive, negative or neutral). These two variables completely cross over, producing nine experimental conditions. Now imagine that I also took note of each persons gender. Subsequent to the previous analysis it occurred to me that men and women might respond differently to the products (because, in keeping with stereotypes, men might mostly drink lager whereas women might drink wine). Therefore, I wanted to reanalyse the data taking this additional variable into account. Now, gender is a between-group variable because a participant can be only male or female: they cannot participate as a male and then change into a female and participate again! The data are the same

Previously the example contained two repeated-measures variables (drink type and imagery type), now it will include three variables (two repeated-measures and one between-group). 190

as in the previous chapter and can be found in the file MixedAttitude.sav. Run a mixed ANOVA on these data. To carry out the analysis on SPSS follow the same instructions that we did before, so first of all access the define factors dialog box by using the file path

. We are using the same repeated-measures variables as in Chapter 13 of the book, so complete this dialog box exactly as shown there, and then click on to access the main dialog box. This box should be

completed exactly as before except that we must specify gender as a between-group variable by selecting it in the variables list and clicking Between-Subjects Factors. to transfer it to the box labelled

191

Gender has only two levels (male or female) so there is no need to specify contrasts for

this variable; however, you should select simple contrasts for both drink and imagery. The addition of a between-group factor means that we can select post hoc tests for this variable by clicking on . This action brings up the post hoc test dialog box, which

can be used as previously explained. However, we need not specify any post hoc tests here because the between-group factor has only two levels. The addition of an extra variable makes it necessary to choose a different graph to the one in the previous example. Click on to access the dialog box and place drink and imagery in the

same slots as for the previous example but also place gender in the slot labelled Separate Plots. When all three variables have been specified, dont forget to click on to add

this combination to the list of plots. By asking SPSS to plot the drink imagery gender interaction, we should get the same interaction graph as before, except that a separate version of this graph will be produced for male and female subjects. As far as other options are concerned, you should select the same ones that were chosen in Chapter 13. It is worth selecting estimated marginal means for all effects (because these values will help you to understand any significant effects), but to save space I did not ask for confidence intervals for these effects because we have considered this part of the output in some detail already. When all of the appropriate options have been selected, run the analysis.

192

Main Analysis The initial output is the same as the two-way ANOVA example: there is a table listing the repeated-measures variables from the data editor and the level of each independent variable that they represent. The second table contains descriptive statistics (mean and standard deviation) for each of the nine conditions split according to whether participants were male or female. The names in this table are the names I gave the variables in the data editor (therefore, your output may differ slightly). These descriptive statistics are interesting because they show us the pattern of means across all experimental conditions (so, we use these means to produce the graphs of the three-way interaction). We can see that the variability among scores was greatest when beer was used as a product, and that when a corpse image was used the ratings given to the products were negative (as expected) for all conditions except the men in the beer condition. Likewise, ratings of products were very positive when a sexy person was used as the imagery irrespective of the gender of the participant, or the product being advertised.
193

Descriptive Statistics Std. Deviation 14.0063 11.3925 13.0080 7.8379 5.1381 17.3037 8.5434 6.7074 10.2956 7.6311 4.1150 6.7378 4.9396 4.1312 6.1815 4.9721 4.3919 6.2431 6.7864 6.3953 7.0740 6.7791 7.1368 6.8025 6.2973 3.8816 6.8386

Gender Beer + Sexy Male Female Total Male Female Total Male Female Total Male Female Total Male Female Total Male Female Total Male Female Total Male Female Total Male Female Total

Mean 24.8000 17.3000 21.0500 20.1000 -11.2000 4.4500 16.9000 3.1000 10.0000 22.3000 28.4000 25.3500 -7.8000 -16.2000 -12.0000 7.5000 15.8000 11.6500 14.5000 20.3000 17.4000 -9.8000 -8.6000 -9.2000 -2.1000 6.8000 2.3500

N 10 10 20 10 10 20 10 10 20 10 10 20 10 10 20 10 10 20 10 10 20 10 10 20 10 10 20

Beer + Corpse

Beer + Person in Armchair

Wine + Sexy

Wine + Corpse

Wine + Person in Armchair

Water + Sexy

Water + Corpse

Water + Person in Armchair

The results of Mauchlys sphericity test are different to the example in Chapter 13, because the between-group factor is now being accounted for by the test. The main effect of drink still significantly violates the sphericity assumption (W = 0.572, p < 0.01) but the main effect of imagery no longer does. Therefore, the F-value for the main effect of drink (and its interaction with the between-group variable gender) needs to be corrected for this violation.
b Mauchly's Test of Sphericity

Measure: MEASURE_1 Mauchly's W .572 .965 .609 Approx. Chi-Square 9.486 .612 8.153 Epsilon Huynh-Feldt .784 1.000 1.000
a

Within Subjects Effect DRINK IMAGERY DRINK * IMAGERY

df 2 2 9

Sig. .009 .736 .521

Greenhouse-Geisser .700 .966 .813

Lower-bound .500 .500 .250

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the layers (by default) of the Tests of Within Subjects Effects table. b. Design: Intercept+GENDER - Within Subjects Design: DRINK+IMAGERY+DRINK*IMAGERY

194

The summary table of the repeated-measures effects in the ANOVA is split into sections for each of the effects in the model and their associated error terms. The table format is the same as for the previous example, except that the interactions between gender and the repeated-measures effects are included also. We would expect to still find the effects that were previously present (in a balanced design, the inclusion of an extra variable should not affect these effects). By looking at the significance values it is clear that this prediction is true: there are still significant effects of the type of drink used, the type of imagery used, and the interaction of these two variables. In addition to the effects already described we find that gender interacts significantly with the type of drink used (so, men and women respond differently to beer, wine and water regardless of the context of the advert). There is also a significant interaction of gender and imagery (so, men and women respond differently to positive, negative and neutral imagery regardless of the drink being advertised). Finally, the three-way interaction between gender, imagery and drink is significant, indicating that the way in which imagery affects responses to different types of drinks depends on whether the subject is male or female. The effects of the repeated-measures variables have been outlined in Chapter 13 and the pattern of these responses will not have changed, so rather than repeat myself, I will concentrate on the new effects and the forgetful reader should look back at Chapter 13!

195

Tests of Within-Subjects Effects Measure: MEASURE_1 Type III Sum of Squares Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound Sphericity Assumed Greenhouse-Geisser Huynh-Feldt Lower-bound 2092.344 2092.344 2092.344 2092.344 4569.011 4569.011 4569.011 4569.011 3216.867 3216.867 3216.867 3216.867 21628.678 21628.678 21628.678 21628.678 1998.344 1998.344 1998.344 1998.344 1354.533 1354.533 1354.533 1354.533 2624.422 2624.422 2624.422 2624.422 495.689 495.689 495.689 495.689 2411.000 2411.000 2411.000 2411.000 Mean Square 1046.172 1493.568 1334.881 2092.344 2284.506 3261.475 2914.954 4569.011 89.357 127.571 114.017 178.715 10814.339 11196.937 10814.339 21628.678 999.172 1034.522 999.172 1998.344 37.626 38.957 37.626 75.252 656.106 807.186 656.106 2624.422 123.922 152.458 123.922 495.689 33.486 41.197 33.486 133.944 19.593 19.593 19.593 19.593 3.701 3.701 3.701 3.701 .000 .000 .000 .000 .009 .014 .009 .070 287.417 287.417 287.417 287.417 26.555 26.555 26.555 26.555 .000 .000 .000 .000 .000 .000 .000 .000

Source DRINK

df 2 1.401 1.567 1.000 2 1.401 1.567 1.000 36 25.216 28.214 18.000 2 1.932 2.000 1.000 2 1.932 2.000 1.000 36 34.770 36.000 18.000 4 3.251 4.000 1.000 4 3.251 4.000 1.000 72 58.524 72.000 18.000

F 11.708 11.708 11.708 11.708 25.566 25.566 25.566 25.566

Sig. .000 .001 .000 .003 .000 .000 .000 .000

DRINK * GENDER

Error(DRINK)

IMAGERY

IMAGERY * GENDER

Error(IMAGERY)

DRINK * IMAGERY

DRINK * IMAGERY * GENDER

Error(DRINK*IMAGERY)

The Effect of Gender The main effect of gender is listed separately from the repeated-measures effects in a table labelled Tests of Between-Subjects Effects. Before looking at this table it is important to check the assumption of homogeneity of variance using Levenes test. SPSS produces a table listing Levenes test for each of the repeated-measures variables in the data editor, and we need to look for any variable that has a significant value. The table showing Levenes test indicates that variances are homogeneous for all levels of the repeated-measures variables (because all significance values are greater than 0.05). If any values were significant, then this would compromise the accuracy of the F-test for

196

gender, and we would have to consider transforming all of our data to stabilize the variances between groups (one popular transformation is to take the square root of all values). Fortunately, in this example a transformation is unnecessary. The second table shows the ANOVA summary table for the main effect of gender, and this reveals a significant effect (because the significance of 0.018 is less than the standard cut-off point of 0.05).
a Levene's Test of Equality of Error Variances

F Beer + Sexy Beer + Corpse Beer + Person in Armchair Wine + Sexy Wine + Corpse Wine + Person in Armchair Water + Sexy Water + Corpse Water + Person in Armchair 1.009 1.305 1.813 2.017 1.048 .071 .317 .804 1.813

df1 1 1 1 1 1 1 1 1 1

df2 18 18 18 18 18 18 18 18 18

Sig. .328 .268 .195 .173 .320 .793 .580 .382 .195

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+GENDER - Within Subjects Design: DRINK+IMAGERY+DRINK*IMAGERY

Tests of Between-Subjects Effects Measure: MEASURE_1 Transformed Variable: Average Type III Sum of Squares 1246.445 58.178 155.167 Mean Square 1 1 18 1246.445 58.178 8.620

Source Intercept GENDER Error

df

F 144.593 6.749

Sig. .000 .018

We can report that there was a significant main effect of gender (F(1, 18) = 6.75, p < 0.05). This effect tells us that if we ignore all other variables, male subjects ratings were significantly different to females. If you requested that SPSS display means for the gender effect you should scan through your output and find the table in a section headed Estimated Marginal Means. The table of means for the main effect of gender with the associated standard errors is plotted alongside. It is clear from this graph that mens ratings were generally significantly more positive than females. Therefore, men gave
197

more positive ratings than women regardless of the drink being advertised and the type of imagery used in the advert.

Estimates Measure: MEASURE_1 95% Confidence Interval Lower Upper Bound Bound 7.649 4.238 11.551 8.140

Gender Male Female

Mean 9.600 6.189

Std. Error .928 .928

The Interaction between Gender and Drink Gender interacted in some way with the type of drink used as a stimulus. Remembering that the effect of drink violated sphericity, we must report GreenhouseGeisser-corrected values for this interaction with the between-group factor. From the summary table we should report that there was a significant interaction between the type of drink used and the gender of the subject (F(1.40, 25.22) = 25.57, p < 0.001). This effect tells us that the type of drink being advertised had a different effect on men and women. We can use the estimated marginal means to determine the nature of this interaction (or we could have asked SPSS for a plot of gender drink). The means and interaction graph show the meaning of this result. The graph shows the average male ratings of each drink ignoring the type of imagery with which it was presented (circles). The womens scores are shown as squares. The graph clearly shows that male and female ratings are very similar for wine and water, but men seem to rate beer more highly than womenregardless of the type of imagery used. We could interpret this interaction as meaning that the type of drink being advertised influenced ratings differently in men and women. Specifically, ratings were similar for wine and water but males rated beer higher than women. This interaction can be clarified using the contrasts specified before the analysis.
198

2. Gender * DRINK Measure: MEASURE_1 95% Confidence Interval Lower Upper Bound Bound 15.471 5.726 -2.103 -2.062 7.726 3.197 25.729 8.940 3.836 8.196 10.940 9.136

Gender Male

DRINK 1 2 3 1 2 3

Mean 20.600 7.333 .867 3.067 9.333 6.167

Std. Error 2.441 .765 1.414 2.441 .765 1.414

Female

The Interaction between Gender and Imagery Gender interacted in some way with the type of imagery used as a stimulus. The effect of imagery did not violate sphericity, so we can report the uncorrected F-value. From the summary table we should report that there was a significant interaction between the type of imagery used and the gender of the subject (F(2, 36) = 26.55, p < 0.001). This effect tells us that the type of imagery used in the advert had a different effect on men and women. We can use the estimated marginal means to determine the nature of this interaction. The means and interaction graph shows the meaning of this result. The graph shows the average male in each imagery condition ignoring the type of drink that was rated (circles). The womens scores are shown as squares. The graph clearly shows that male and female ratings are very similar for positive and neutral imagery, but men seem to be less affected by negative imagery than womenregardless of the drink in the advert. To interpret this finding more fully, we should consult the contrasts for this interaction.

3. Gender * IMAGERY Measure: MEASURE_1 95% Confidence Interval Lower Upper Bound Bound 17.595 -1.460 4.502 19.062 -14.293 5.635 23.471 3.127 10.365 24.938 -9.707 11.498

Gender Male

IMAGERY 1 2 3 1 2 3

Mean 20.533 .833 7.433 22.000 -12.000 8.567

Std. Error 1.399 1.092 1.395 1.399 1.092 1.395

Female

199

The Interaction between Drink and Imagery The interpretation of this interaction is the same as for the two-way ANOVA (see Chapter 13). You may remember that the interaction reflected the fact that negative imagery has a different effect to both positive and neutral imagery (because it decreased ratings rather than increasing them). The Interaction between Gender, Drink and Imagery The three-way interaction tells us whether the drink by imagery interaction is the same for men and women (i.e. whether the combined effect of the type of drink and the imagery used is the same for male subjects as for female subjects). We can conclude that there is a significant three-way drink imagery gender interaction (F(4, 72) = 3.70, p < 0.01). The nature of this interaction is shown up in the graph, which shows the imagery by drink interaction for men and women separately. The male graph shows that when positive imagery is used, men generally rated all three drinks positively (the line with circles is higher than the other lines for all drinks). This pattern is true of women also (the line representing positive imagery is above the other two lines). When neutral imagery is used, men rate beer very highly, but rate wine and water fairly neutrally. Women, on the other hand rate beer and water neutrally, but rate wine more positively (in fact, the pattern of the positive and neutral imagery lines show that women generally rate wine slightly more positively than water and beer). So, for neutral imagery men still rate beer positively, and women still rate wine positively. For the negative imagery, the men still rate beer very highly, but give low ratings to the other two types of drink. So, regardless of the type of imagery used, men rate beer very positively (if you look at the graph youll
200

note that ratings for beer are virtually identical for the three types of imagery). Women, however, rate all three drinks very negatively when negative imagery is used. The threeway interaction is, therefore, likely to reflect these sex differences in the interaction between drink and imagery. Specifically, men seem fairly immune to the effects of imagery when beer is being used as a stimulus, whereas women are not. The contrasts will show up exactly what this interaction represents.
4. Gender * DRINK * IMAGERY Measure: MEASURE_1 95% Confidence Interval Lower Upper Bound Bound 16.318 15.697 11.797 18.227 -10.825 4.383 10.119 -14.424 -5.575 8.818 -15.603 -2.003 24.327 -19.225 12.683 15.919 -13.224 3.325 33.282 24.503 22.003 26.373 -4.775 10.617 18.881 -5.176 1.375 25.782 -6.797 8.203 32.473 -13.175 18.917 24.681 -3.976 10.275

Gender Male

DRINK 1

IMAGERY 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Mean 24.800 20.100 16.900 22.300 -7.800 7.500 14.500 -9.800 -2.100 17.300 -11.200 3.100 28.400 -16.200 15.800 20.300 -8.600 6.800

Std. Error 4.037 2.096 2.429 1.939 1.440 1.483 2.085 2.201 1.654 4.037 2.096 2.429 1.939 1.440 1.483 2.085 2.201 1.654

Female

Male

Female

201

Graphs showing the drink by imagery interaction for men and women. Lines represent positive imagery (circles), negative imagery (squares) and neutral imagery (triangles)

Contrasts for Repeated-Measures Variables We requested simple contrasts for the drink variable (for which water was used as the control category) and for the imagery category (for which neutral imagery was used as the control category). The table is the same as for the previous example except that the added effects of gender and its interaction with other variables are now included. So, for the main effect of drink, the first contrast compares level 1 (beer) against the base category (in this case, the last category: water). This result is significant (F(1, 18) = 15.37, p < 0.01), and the next contrast compares level 2 (wine) with the base category (water) and confirms the significant difference found when gender was not included as a variable in the analysis (F(1, 18) = 19.92, p < 0.001). For the imagery main effect, the first contrast compares level 1 (positive) to the base category (neutral) and verifies the significant effect found by the post hoc tests (F(1, 18) = 134.87, p < 0.001). The second contrast confirms the significant difference found for the negative imagery condition compared to the neutral (F(1, 18) = 129.18, p < 0.001). No contrast was specified for gender.

202

Tests of Within-Subjects Contrasts Measure: MEASURE_1 Type III Sum of Squares 1383.339 464.006 2606.806 54.450 1619.967 419.211 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 Level 1 vs. Level 3 Level 2 vs. Level 3 3520.089 3690.139 .556 975.339 469.800 514.189 320.000 720.000 36.450 2928.200 441.800 480.200 4.050 405.000 3416.200 3416.200 1545.800 1662.800 Mean Square 1 1 1 1 18 18 1 1 1 1 18 18 1 1 1 1 1 1 1 1 18 18 18 18 1383.339 464.006 2606.806 54.450 89.998 23.290 3520.089 3690.139 .556 975.339 26.100 28.566 320.000 720.000 36.450 2928.200 441.800 480.200 4.050 405.000 189.789 189.789 85.878 92.378 1.686 8.384 .223 31.698 2.328 5.592 .025 4.384 .211 .010 .642 .000 .144 .029 .877 .051 134.869 129.179 .021 34.143 .000 .000 .886 .000

Source DRINK

DRINK Level 1 vs. Level 3 Level 2 vs. Level 3

IMAGERY

df

F 15.371 19.923 28.965 2.338

Sig. .001 .000 .000 .144

DRINK * GENDER

Level 1 vs. Level 3 Level 2 vs. Level 3

Error(DRINK)

Level 1 vs. Level 3 Level 2 vs. Level 3

IMAGERY IMAGERY * GENDER Error(IMAGERY)

DRINK * IMAGERY

DRINK * IMAGERY * GENDER

Level 1 vs. Level 3 Level 2 vs. Level 3

Error(DRINK*IMAGERY)

Level 1 vs. Level 3 Level 2 vs. Level 3

Drink Gender Interaction 1: Beer vs. Water, Male vs. Female The first interaction term looks at level 1 of drink (beer) compared to level 3 (water), comparing male and female scores. This contrast is highly significant (F(1, 18) = 28.97, p < 0.001). This result tells us that the increased ratings of beer compared to water found for men are not found for women. So, in the graph the squares representing female ratings of beer and water are roughly level; however, the circle representing male ratings of beer is much higher than the circle representing water. The positive contrast represents this difference and so we can conclude that male ratings of beer (compared to water) were significantly greater than womens ratings of beer (compared to water). Drink Gender Interaction 2: Wine vs. Water, Male vs. Female The second interaction term compares level 2 of drink (wine) to level 3 (water), contrasting male and female scores. There is no significant difference for this contrast (F(1, 18) = 2.34, p = 0.14), which tells us that the difference between ratings of wine

203

compared to water in males is roughly the same as in females. Therefore, overall, the drink gender interaction has shown up a difference between males and females in how they rate beer (regardless of the type of imagery used). Imagery Gender Interaction 1: Positive vs. Neutral, Male vs. Female The first interaction term looks at level 1 of imagery (positive) compared to level 3 (neutral), comparing male and female scores. This contrast is not significant (F < 1). This result tells us that ratings of drinks presented with positive imagery (relative to those presented with neutral imagery) were equivalent for males and females. This finding represents the fact that in the earlier graph of this interaction the squares and circles for both the positive and neutral conditions overlap (therefore male and female responses were the same). Imagery Gender Interaction 2: Negative vs. Neutral, Male vs. Female The second interaction term looks at level 2 of imagery (negative) compared to level 3 (neutral), comparing male and female scores. This contrast is highly significant (F(1, 18) = 34.13, p < 0.001). This result tells us that the difference between ratings of drinks paired with negative imagery compared to neutral was different for men and women. Looking at the earlier graph of this interaction this finding represents the fact that for men, ratings of drinks paired with negative imagery were relatively similar to ratings of drinks paired with neutral imagery (the circles have a fairly similar vertical position). However, if you look at the female ratings, then drinks were rated much less favourably when presented with negative imagery than when presented with neutral imagery (the square in the negative condition is much lower than the neutral condition). Therefore,

204

overall, the imagery gender interaction has shown up a difference between males and females in terms of their ratings to drinks presented with negative imagery compared to neutral; specifically, men seem less affected by negative imagery. Drink Imagery Gender Interaction 1: Beer vs. Water, Positive vs. Neutral Imagery, Male vs. Female The first interaction term compares level 1 of drink (beer) to level 3 (water), when positive imagery (level 1) is used compared to neutral (level 3) in males compared to females (F(1, 18) = 2.33, p = 0.144). The non-significance of this contrast tells us that the difference in ratings when positive imagery is used compared to neutral imagery is roughly equal when beer is used as a stimulus as when water is used, and these differences are equivalent in male and female subjects. In terms of the interaction graph it means that the distance between the circle and the triangle in the beer condition is the same as the distance between the circle and the triangle in the water condition and that these distances are equivalent in men and women. Drink Imagery Gender Interaction 2: Beer vs. Water, Negative vs. Neutral Imagery, Male vs. Female The second interaction term looks at level 1 of drink (beer) compared to level 3 (water), when negative imagery (level 2) is used compared to neutral (level 3). This contrast is significant (F(1, 18) = 5.59, p < 0.05). This
30 20 10 0 -10 -20 Negative Neutral Negative Neutral

Beer

Water

result tells us that the difference in ratings between beer and water when negative imagery is used (compared to neutral

205

imagery) is different between men and women. If we plot ratings of beer and water across the negative and neutral conditions, for males (circles) and females (squares) separately, we see that ratings after negative imagery are always lower than ratings for neutral imagery except for mens ratings of beer, which are actually higher after negative imagery. As such, this contrast tells us that the interaction effect reflects a difference in the way in which males rate beer compared to females when negative imagery is used compared to neutral. Males and females are similar in their pattern of ratings for water but different in the way in which they rate beer. Drink Imagery Gender Interaction 3: Wine vs. Water, Positive vs. Neutral Imagery, Male vs. Female The third interaction term looks at level 2 of drink (wine) compared to level 3 (water), when positive imagery (level 1) is used compared to neutral (level 3) in males compared to females. This contrast is non-significant (F(1, 18) < 1). This result tells us that the difference in ratings when positive imagery is used compared to neutral imagery is roughly equal when wine is used as a stimulus as when water is used, and these differences are equivalent in male and female subjects. In terms of the interaction graph it means that the distance between the circle and the triangle in the wine condition is the same as the distance between the circle and the triangle in the water condition and that these distances are equivalent in men and women. Drink Imagery Gender Interaction 4: Wine vs. Water, Negative vs. Neutral Imagery, Male vs. Female

206

The final interaction term looks at level 2 of drink (wine) compared to level 3 (water), when negative imagery (level 2) is used
30 20 10 0 -10 -20 Negative Neutral Negative Neutral

compared to neutral (level 3). This contrast


Wine

is very close to significance (F(1, 18) = 4.38, p = 0.051). This result tells us that the difference in ratings between wine and water when negative imagery is used

(compared to neutral imagery) is different between men and women (although this difference has not quite reached significance). If we plot ratings of wine and water across the negative and neutral conditions, for males (circles) and females (squares), we see that ratings after negative imagery are always lower than ratings for neutral imagery, but for women rating wine the change is much more dramatic (the line is steeper). As such, this contrast tells us that the interaction effect reflects a difference in the way in which females rate wine differently to males when neutral imagery is used compared to when negative imagery is used. Males and females are similar in their pattern of ratings for water but different in the way in which they rate wine. It is noteworthy that this contrast was not significant using the usual 0.05 level; however, it is worth remembering that this cut-off point was set in a fairly arbitrary way, and so it is worth reporting these close effects and letting your reader decide whether they are meaningful or not. There is also a growing trend towards reporting effect sizes in preference to using significance levels. Summary These contrasts again tell us nothing about the differences between the beer and wine conditions (or the positive and negative conditions) and different contrasts would have to

207

be run to find out more. However, what is clear so far is that differences exist between men and women in terms of their ratings towards beer and wine. It seems as though men are relatively unaffected by negative imagery when it comes to beer. Likewise, women seem more willing to rate wine positively when neutral imagery is used than men do. What should be clear from this is that complex ANOVA in which several independent variables are used results in complex interaction effects that require a great deal of concentration to interpret (imagine interpreting a four-way interaction!). Therefore, it is essential to take a systematic approach to interpretation and plotting graphs is a particularly useful way to proceed. It is also advisable to think carefully about the appropriate contrasts to use to answer the questions you have about your data. It is these contrasts that will help you to interpret interactions, so make sure you select sensible ones!
Task 2

Text messaging is very popular among mobile phone owners, to the point that books have been published on how to write in text speak (BTW, hope u kno wat I mean by txt spk). One concern is that children may use this form of communication so much that it will hinder their ability to learn correct written English. One concerned researcher conducted an experiment in which one group of children was encouraged to send text messages on their mobile phones over a six-month period. A second group was forbidden from sending text messages for the same period. To ensure that kids in this latter group didnt use their phones, this group was given armbands that administered painful shocks in the presence of

208

microwaves (like those emitted from phones).3 There were 50 different participants: 25 were encouraged to send text messages and 25 were forbidden. The outcome was a score on a grammatical test (as a percentage) that was measured both before and after the experiment. The first independent variable was, therefore, text message use (text messagers versus controls) and the second independent variable was the time at which grammatical ability was assessed (before or after the experiment). The data are in the file TextMessages.sav.

The line chart (with error bars) shows the grammar data. The circles show the mean grammar score before and after the experiment for the text message group and the controls. The means before and after are connected by a line for the two groups separately. Its clear from this chart that in the text message group grammar scores went down dramatically over the six month period in which they used their mobile phone. For the controls, their grammar scores also fell but much less dramatically.

80 75 70 Text Messagers Controls

Mean Grammar Score

65 60 55 50 45 40 10 0 Before After

Time

209

Line chart (with error bars showing the standard error of the mean) of the mean grammar scores before and after the experiment for text messagers and controls
Descriptive Statistics Group Text Messagers Controls Total Text Messagers Controls Total Mean 64.8400 65.6000 65.2200 52.9600 61.8400 57.4000 Std. Deviation 10.67973 10.83590 10.65467 16.33116 9.41046 13.93278 N 25 25 50 25 25 50

Grammer at Time 1

Grammar at Time 2

The output above shows the table of descriptive statistics from the two-way mixed ANOVA; the table has means at time 1 split according to whether the people were in the text messaging group or the control group, then below we have the means for the two groups at time 2. These means correspond to those plotted above.
b Mauchly's Test of Sphericity

Measure: MEASURE_1 Epsilon Within Subjects Effect TIME Mauchly's W 1.000 Approx. Chi-Square .000 df 0 Sig. . GreenhouseGeisser 1.000
a

Huynh-Feldt 1.000

Lower-bound 1.000

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the Tests of Within-Subjects Effects table. b. Design: Intercept+GROUP Within Subjects Design: TIME

a Levene's Test of Equality of Error Variances

F Grammer at Time 1 Grammar at Time 2 .089 3.458

df1 1 1

df2 48 48

Sig. .767 .069

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+GROUP Within Subjects Design: TIME

We know that when we use repeated measures we have to check the assumption of sphericity. We also know that for independent designs we need to check the homogeneity of variance assumption. If the design is a mixed design then we have both repeated and independent measures, so we have to check both assumptions. In this case, we have only

210

two levels of the repeated measure so the assumption of sphericity does not apply in this case. Levenes test produces a different test for each level of the repeated-measures variable. In mixed designs, the homogeneity assumption has to hold for every level of the repeated-measures variable. At both levels of time, Levenes test is non-significant (p = 0.77 before the experiment and p = 0.069 after the experiment). This means the assumption has not been broken at all (but it was quite close to being a problem after the experiment). _ _ The output above shows the main ANOVA summary tables. Like any two-way ANOVA, we still have three effects to find: two main effects (one for
70

each independent variable) and one interaction term. The main


Mean Grammar Score (%)
60

effect of time is significant so we can conclude that grammar scores were significantly affected by the time at which they were measured. The exact nature of this effect is easily

50

40 10 0 Before After

Group

determined because there were only two points in time (and so this main effect is comparing only two means). The graph shows that grammar scores were higher before the experiment than after. So, before the experimental manipulation scores were higher than after, meaning that the manipulation had the net effect of significantly reducing grammar scores. This main effect seems rather interesting until you consider that these means include both text messagers and controls. There are three possible reasons for the drop in grammar scores: (1) the text messagers got worse and are dragging down the mean after the experiment, (2) the controls somehow got worse, or (3) the whole group
211

just got worse and it had nothing to do with whether the children text messaged or not. Until we examine the interaction, we wont see which of these is true. The main effect of group is shown by the F-ratio in the second table above. The probability associated with this Fratio is 0.09, which is just above the critical value of 0.05. Therefore, we must conclude that there was no significant main effect on grammar scores of whether children textmessaged or not. Again, this effect seems interesting enough
Mean Grammar Score (%)
70

60

50

40 10 0 Text Messagers Controls

Group

and mobile phone companies might certainly chose to cite it as evidence that text messaging does not affect your grammar ability. However, remember that this main effect ignores the time at which grammar ability is measured. It just means that if we took the average grammar score for text messagers (thats including their score both before and after they started using their phone), and compared this to the mean of the controls (again including scores before and after) then these means would not be significantly different. The graph shows that when you ignore the time at which grammar was measured, the controls have slightly better grammar than the text messagersbut not significantly so. Main effects are not always that interesting and should certainly be viewed in the context of any interaction effects. The interaction effect in this example is shown by the F-ratio in the row labeled Time*Group, and because the probability of obtaining a value this big by chance is 0.047, which is just less than the criterion of 0.05, we can say that there is a significant interaction between the time at which grammar was measured and whether or not children were allowed to text message within that time. The mean ratings in all

212

conditions help us to interpret this effect. The significant interaction tells us that the change in grammar scores was significantly different in text messagers compared to controls. Looking at the interaction graph, we can see that although grammar scores fell in controls, the drop was much more marked in the text messagers; so, text messaging does seem to ruin your ability at grammar compared to controls4. Writing the Result We can report the three effects from this analysis as follows: The results show that the grammar ratings at the end of the experiment were significantly lower than those at the beginning of the experiment, F(1, 48) = 15.46, p < .001, r = .61. The main effect of group on the grammar scores was non-significant, F(1, 48) = 2.99, ns, r = .27. This indicated that when the time at which grammar was measured is ignored, the grammar ability in the text message group was not significantly different to the controls. The time group interaction was significant, F(1, 48) = 4.17, p < .05, r = .34, indicating that the change in grammar ability in the text message group was significantly different to the change in the control groups. These findings indicate that although there was a natural decay of grammatical ability over time (as shown by the controls) there was a much stronger effect when participants were

Its interesting that the control group means dropped too. This could be because the control group were undisciplined and still used their mobile phones, or it could just be that the education system in this country is so underfunded that there is no one to teach English anymore! 213

encouraged to use text messages. This shows that using text messages accelerates the inevitable decline in grammatical ability.
Task 3

A researcher was interested in the effects on peoples mental health of participating in Big Brother (see Chapter 1 if you dont know what Big Brother is). The researcher hypothesized that that they start off with personality disorders that are exacerbated by being forced to live with people as attention seeking as themselves. To test this hypothesis, she gave eight contestants a questionnaire measuring personality disorders before they entered the house, and again when they left the house. A second group of eight people acted as a waiting list control. These were people short listed to go into the house, but never actually made it. They too were given the questionnaire at the same points in time as the contestants. The data are in BigBrother.sav. Conduct a mixed ANOVA on the data.

Running the analysis

214

SPSS output

The line chart (with error bars) shows the grammar data. The circles show the mean grammar score before and after the experiment for the text message group and the controls. The means before and after are connected by a line for the two groups separately. Its clear from this chart that in the text message group grammar scores went down dramatically over the six month period in which they used their mobile phone. For the controls, their grammar scores also fell but much less dramatically.

215

Error bar chart of the mean personality disorder score before entering and after leaving the big brother house

216

The output above shows the table of descriptive statistics from the two-way mixed ANOVA; the table has mean borderline personality disorder (BPD) scores before entering the big brother (BB) house split according to whether the people were a contestant or not, then below we have the means for the two groups after leaving the house. These means correspond to those plotted above.

We know that when we use repeated-measures we have to check the assumption of sphericity. However, we also know that for sphericity to be an issue we need at least three conditions. We have only two conditions here so sphericity does not need to be tested (and, therefore, SPSS produces a blank in the column labeled Sig.). We also need to check the homogeneity of variance assumption. Levenes test produces a different test for each level of the repeated-measures variable. In mixed designs, the homogeneity assumption has to hold for every level of the repeated-measures variable. At both levels of time, Levenes test is non-significant (p = 0.061 before entering the BB house and p = .088 after leaving). This means the assumption has not been significantly broken (but it was quite close to being a problem).
217

The output above shows the main ANOVA summary tables. Like any two-way ANOVA, we still have three effects to find: two main effects (one for each independent variable) and one interaction term. The main effect of time is not significant so we can conclude that BPD scores were significantly affected by the time at which they were measured. The exact nature of this effect is easily determined because there were only two points in time (and so this main effect is comparing only two means). The graph shows that BPD scores were not significantly different after leaving the BB house compared to before entering it.

218

The main effect of group (bb) is shown by the F-ratio in the second table above. The probability associated with this F-ratio is .43, which is above the critical value of .05. Therefore, we must conclude that there was no significant main effect on BPD scores of whether the person was a BB contestant or not. The graph shows that when you ignore the time at which BPD was measured, the contestants and controls are not significantly different.

219

The interaction effect in this example is shown by the F-ratio in the row labelled
Time*bb, and because the probability of obtaining a value this big is .018, which is less

than the criterion of .05, we can say that there is a significant interaction between the time at which BPD was measured and whether or not the person was a contestant or not. The mean ratings in all conditions (and on the interaction graph) help us to interpret this effect. The significant interaction seems to indicate that for controls BPD scores went down (slightly) from before entering the house to after leaving it but for contestants these opposite is true: BPD scores increased over time.
Writing the results

220

We can report the three effects from this analysis as follows: The main effect of group was not significant, F(1, 14) = 0.67, p = .43, indicating that across both time points borderline personality disorder scores were similar in BB contestants and controls. The main effect of time was not significant, F(1, 14) = 0.09, p = .77, indicating that across all participants borderline personality disorder scores were similar before entering the house and after leaving it. The time group interaction was significant, F(1, 14) = 7.15, p < .05, indicating that although borderline personality disorder scores decreased for controls from before entering the house to after leaving it, scores increased for the contestants.

Chapter 15 Task 1

A psychologist was interested in the cross-species differences between men and dogs. She observed a group of dogs and a group of men in a naturalistic setting (20 of each). She classified several behaviours as being dog-like (urinating against tress and lamp posts, attempts to copulate with anything that moved, and attempts to lick their own genitals). For each man and dog she counted the number of doglike behaviours displayed in a 24 hour period. It was hypothesized that dogs would display more dog-like behaviours than men. The data are in the file
MenLikeDogs.sav. Analyse them with a MannWhitney test.

221

SPSS Output
Ranks Dog-Like Behaviour Species Dog Man Total N 20 20 40 Mean Rank 20.77 20.23 Sum of Ranks 415.50 404.50

Test Statisticsb Dog-Like Behaviour 194.500 404.500 -.150 .881 .883a

Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)]

a. Not corrected for ties. b. Grouping Variable: Species

Calculating an Effect Size The output tells us that z is .15, and we had 20 men and 20 dogs so the total number of observations was 40. The effect size is, therefore:
r= 0.15

40 = 0.02

This represents a tiny effect (it is close to zero), which tells us that there truly isnt much difference between dogs and men. Writing and Interpreting the Result We could report something like: Men (Mdn = 27) did not seem to differ from dogs (Mdn = 24) in the amount of dog-like behaviour they displayed (U = 194.5, ns).

222

Note that Ive reported the median for each condition. Of course, we really ought to include the effect size as well. We could do two things. The first is to report the z-score associated with the test statistic. This value would enable the reader to determine both the exact significance of the test, and to calculate the effect size r: Men (Mdn = 27) and dogs (Mdn = 24) did not significantly differ in the extent to which they displayed dog-like behaviours (U=194.5, ns, z = .15). The alternative is to just report the effect size (because readers can convert back to the zscore if they need to for any reason). This approach is better because the effect size will probably be most useful to the reader. Men (Mdn = 27) and dogs (Mdn = 24) did not significantly differ in the extent to which they displayed dog-like behaviours (U=194.5, ns, r =.02). Task 2 Theres been much speculation over the years about the influence of subliminal messages on records. To name a few cases, both Ozzy Osbourne and Judas Priest have been accused of putting backward masked messages on their albums that subliminally influence poor unsuspecting teenagers into doing things like blowing their heads off with shotguns. A psychologist was interested in whether backward masked messages really did have an effect. He took the master tapes of Britney Spears Baby one more time and created a second version that had the masked message deliver your soul to the dark lord repeated in the chorus. He took this version, and the original, and played one version (randomly) to a group of 32 people. He took the same group of people six months later and played them

223

whatever version they hadnt heard the time before. Thus each person heard both the original and the version with the masked message, but at different points in time. The psychologist measured the number of goats that were sacrificed in the week after listening to each version. It was hypothesized that the backward message would lead to more goats being sacrificed. The data are in the file
DarkLord.sav. Analyse them with a Wilcoxon signed-rank test.

Ranks N No Message - Message Negative Ranks Positive Ranks Ties Total 11a 17b 4c 32 Mean Rank 10.14 17.32 Sum of Ranks 111.50 294.50

a. No Message < Message b. No Message > Message c. Message = No Message

Test Statisticsb No Message - Message -2.094a .036

Z Asymp. Sig. (2-tailed)

a. Based on negative ranks. b. Wilcoxon Signed Ranks Test

Calculating an Effect Size The output tells us that z is 2.094, and we had 64 observations (although we only used 32 people and tested them twice, it is the number of observations, not the number of people, that is important here). The effect size is, therefore:
r= 2.094

64 = 0.26

224

This represents a medium effect (it is close to Cohens benchmark of 0.3), which tells us that the effect of whether or a subliminal message was present was a substantive effect. Writing and Interpreting the Result We could report something like: The number of goats sacrificed after hearing the message (Mdn = 9) was significantly less than after hearing the normal version of the song (Mdn = 11) (T = 111.50, p < .05). As with the MannWhitney test, we should report either the z-score or the effect size. The effect size is most useful: The number of goats sacrificed after hearing the message (Md = 9) was significantly less than after hearing the normal version of the song (Mdn = 11) (T = 111.50, p < .05, r = .26).
Task 3

A psychologist was interested in the effects of television programmes on domestic life. She hypothesized that through learning by watching, certain programmes might actually encourage people to behave like the characters within them. This in turn could affect the viewers own relationships (depending on whether the programme depicted harmonious or dysfunctional relationships). She took episodes of three popular TV shows and showed them to 54 couples, after which the couple were left alone in the room for an hour. The experimenter measured the number of times the couple argued. Each couple viewed all three of the TV programmes at different points in time (a week apart) and the order in which the
225

programmes were viewed was counterbalanced over couples. The TV programmes selected were Eastenders (which typically portrays the lives of extremely miserable, argumentative, London folk who like nothing more than to beat each others up, lie to each other, sleep with each others wives and generally show no evidence of any consideration to their fellow humans!), Friends (which portrays a group of unrealistically considerate and nice people who love each other oh so very muchbut for some reason I love it anyway!), and a National Geographic programme about whales (this was supposed to act as a control). The data are in the file Eastenders.sav. access them and conduct Friedmans ANOVA on the data.
Ranks Mean Rank 2.29 1.81 1.91

Eastenders Friends National Geographic

The first table shows the mean rank in each condition. These mean ranks are important later for interpreting any effects; they show that the ranks were highest after watching Eastenders.
Test Statisticsa N Chi-Square df Asymp. Sig. 54 7.586 2 .023

a. Friedman Test

The next table shows the chi-square test statistic and its associated degrees of freedom (in this case we had three groups so the degrees of freedom are 31, or 2), and the
226

significance. Therefore, we could conclude that the type of programme watched significantly affected the subsequent number of arguments (because the significance value is less than 0.05). However, this result doesnt tell us exactly where the differences lie. A nice succinct set of comparisons would be to compare each group against the control: Test 1: Eastenders compared to control Test 2: Friends compared to control This gives rise to only two tests, so rather than use 0.05 as our critical level of significance, wed use 0.05/2 = 0.025.
Ranks N National Geographic - Eastenders Negative Ranks Positive Ranks Ties Total Negative Ranks Positive Ranks Ties Total 31a 18b 5c 54 21d 24e 9f 54 Mean Rank 28.85 18.36 Sum of Ranks 894.50 330.50

National Geographic - Friends

22.00 23.88

462.00 573.00

a. National Geographic < Eastenders b. National Geographic > Eastenders c. Eastenders = National Geographic d. National Geographic < Friends e. National Geographic > Friends f. Friends = National Geographic

Test Statisticsc National National Geographic - Geographic Eastenders Friends -2.813a -.629b .005 .530

Z Asymp. Sig. (2-tailed)

a. Based on positive ranks. b. Based on negative ranks. c. Wilcoxon Signed Ranks Test

227

The next tables show the test statistics from doing Wilcoxon tests on the two comparisons that I suggested. Remember that we are now using a critical value of 0.025, so we compare the significance of both test statistics against this critical value. The test comparing Eastenders to the National Geographic programme about whales has a significance value of 0.005, which is well below our criterion of 0.025, therefore we can conclude that Eastenders led to significantly more arguments than the programme about whales. The second comparison compares the number of arguments after Friends with the number after the programme about whales. This contrast is non-significant (the significance of the test statistic is 0.530, which is bigger than our critical value of 0.025), so we can conclude that there was no difference in the number of arguments after watching Friends compared to after watching the whales. The effect we got seems to mainly reflect the fact that Eastenders makes people argue more. Calculating an Effect Size We can calculate effect sizes for the Wilcoxon tests that we used to follow up the main analysis. For the first comparison (Eastenders vs. control) z is 2.813, and because this is based on comparing two groups each containing 54 observations, we have 108 observations in total (remember that it isnt important that the observations come from the same people). The effect size is, therefore:
rEastenders Control = 2.813

108 = 0.27

228

This represents a medium effect (it is close to Cohens benchmark of 0.3), which tells us that the effect of Eastenders relative to the control was a substantive effect: Eastenders produced substantially more arguments. For the second comparison (Friends vs. control) z is 0.629, and this was again based on 108 observations. The effect size is, therefore:

rStory Control =

0.629

108 = 0.06

This represents virtually no effect (it is close to zero). Therefore, Friends had very little effect in creating arguments compared to the control. Writing and Interpreting the Result For Friedmans ANOVA we need only report the test statistic (which we saw earlier is denoted by 2), its degrees of freedom and its significance. So, we could report something like: The number of arguments that couples had was significantly affected by the programme they had just watched (2(2) = 7.59, p < .05). We need to report the follow-up tests as well (including their effect sizes): The number of arguments that couples had was significantly affected by the programme they had just watched (2(2) = 7.59, p < .05). Wilcoxon tests were used to follow up this finding. A Bonferroni correction was applied and so all effects are reported at a .025 level of significance. It appeared that watching Eastenders significantly affected the number of arguments compared to the
229

programme about whales (T = 330.50, r = .27). However, the number of arguments was not significantly different after Friends compared to after the programme about whales (T = 462, ns, r = .06). We can conclude that watching Eastenders did produce significantly more arguments compared to watching a programme about whales, and this effect was medium in size. However, Friends didnt produce any substantial reduction in the number of arguments relative to the control programme.
Task 4

A researcher was interested in trying to prevent coulrophobia (fear of clowns) in children. She decided to do an experiment in which different groups of children (15 in each) were exposed to different forms of positive information about clowns. The first group watched some adverts for McDonalds in which is mascot Ronald McDonald is seen cavorting about with children going on about how they should love their mums. A second group was told a story about a clown who helped some children when they got lost in a forest (although what on earth a clown was doing in a forest remains a mystery). A third group was entertained by a real clown, who came into the classroom and made balloon animals for the children. A final group acted as a control condition and they had nothing done to them at all. The researcher took self-report ratings of how much the children liked clowns (rather like the fear-beliefs questionnaire in Chapter 2) resulting in a score for each child that could range from 0 (not scared of clowns at all) to 5 (very scared of clowns). The data are in the file coulrophobia.sav. Access the data and conduct a Kruskal-Wallis test.

230

Ranks Format of Information Advert Story Exposure None Total N 15 15 15 15 60 Mean Rank 45.03 21.87 23.77 31.33

Fear beliefs

This table tells us the mean rank in each condition. These mean ranks are important later for interpreting any effects.
Test Statisticsa,b Fear beliefs 17.058 3 .001

Chi-Square df Asymp. Sig.

a. Kruskal Wallis Test b. Grouping Variable: Format of Information

This table shows this test statistic (SPSS labels it chi-square rather than H) and its associated degrees of freedom (in this case we had four groups so the degrees of freedom are 4-1, or 3), and the significance (which is less than the critical value of 0.05). Therefore, we could conclude that the type of information presented to the children about clowns significantly affected their fear ratings of clowns. A nice succinct set of comparisons would be to compare each group against the control: 1. Test 1: Advert compared to control 2. Test 2: Story compared to control 3. Test 3: Exposure compared to control

231

This results in three tests, so rather than use 0.05 as our critical level of significance, wed use 0.05/3 = 0.0167. The following tables show the test statistics from doing Mann Whitney tests on the three focused comparisons that I suggested: Advert vs. control:
Test Statisticsb Fear beliefs 37.500 157.500 -3.261 .001 .001a

Story vs. control:


Test Statisticsb Fear beliefs 65.000 185.000 -2.091 .037 .050a

Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)] a. Not corrected for ties.

Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)] a. Not corrected for ties.

b. Grouping Variable: Format of Information

b. Grouping Variable: Format of Information

Exposure vs. control:


Test Statisticsb Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)] a. Not corrected for ties. b. Grouping Variable: Format of Information Fear beliefs 72.500 192.500 -1.743 .081 .098a

Remember that we are now using a critical value of 0.0167, so the only comparison that is significant is when comparing the advert to the control group (because the observed significance value of 0.001 is less than 0.0167). The other two comparisons produce significance values that are greater than 0.0167 so wed have to say theyre nonsignificant. So the effect we got seems to mainly reflect the fact that McDonalds adverts significantly increased fear beliefs about clowns relative to controls (which is no surprise given what a creepy weirdo Ronald McDonald is!). Calculating an Effect Size
232

We can calculate effect sizes for the MannWhitney tests that we used to follow up the main analysis. For the first comparison (adverts vs. control) z is 3.261, and because this is based on comparing two groups each containing 15 observations, we have 30 observations in total. The effect size is, therefore:
rAdvertControl = 3.261 30 = 0.60

This represents a large effect, which tells us that the effect of adverts relative to the control was a substantive effect. For the second comparison (story vs. control) z is 2.091, and this was again based on 30 observations. The effect size is, therefore:
rStory Control = 2.091

30 = 0.38

This represents a medium to large effect. Therefore, although non-significant the effect of stories relative to the control was a substantive effect. For the final comparison (exposure vs. control) z is 1.743, and this was again based on 30 observations. The effect size is, therefore:
rExposure Control = 1.743

30 = 0.32

This represents a medium effect. Therefore, although non-significant, the effect of exposure relative to the control was a substantive effect. Writing and Interpreting the Result

233

For the KruskalWallis test, we need only report the test statistic (which we saw earlier is denoted by H), its degrees of freedom and its significance. So, we could report something like: Childrens fear beliefs about clowns was significantly affected the format of information given to them (H(3) = 17.06, p < .01). However, we need to report the follow-up tests as well (including their effect sizes): Childrens fear beliefs about clowns was significantly affected the format of information given to them (H(3) = 17.06, p < .01). MannWhitney tests were used to follow up this finding. A Bonferroni correction was applied and so all effects are reported at a .0167 level of significance. It appeared that fear beliefs were significantly higher after the adverts compared to the control (U = 37.50, r = .60). However, fear beliefs were not significantly different after the stories (U = 65.00, ns, r = .38) or exposure (U = 72.5, ns, r = .32) relative to the control. We can conclude that clown information through stories and exposure did produce medium-size effects in reducing fear beliefs about clowns, but not significantly so (future work with larger samples might be appropriate), but that Ronald McDonald was sufficient to significantly increase fear beliefs about clowns.
Chapter 16 Task 1

A clinical psychologist noticed that several of his manic psychotic patients did chicken impersonations in public. He wondered whether this behaviour could be used to diagnose this disorder and so decided to compare his patients against a
234

normal sample. He observed 10 of his patients as they went through a normal day. He also needed to observe 10 of the most normal people he could find: naturally he chose to observe lecturers at the University of Sussex. He observed all participants using two dependent variables: first, how many chicken impersonations they did in the streets of Brighton over the course of a day, and, second, how good their impersonations were (as scored out of 10 by an independent farmyard noise expert). The data are in the file chicken.sav. Use MANOVA and DFA to find out whether these variables could be used to distinguish manic psychotic patients from those without the disorder.

This output shows an initial table of descriptive statistics that is produced by clicking on the descriptive statistics option in the options dialog box. This table contains the overall and group means and standard deviations for each dependent variable in turn. It seems that manic psychotics and Sussex lecturers do pretty similar amounts of chicken impersonations (lecturers do slightly less actually, but they are of a higher quality).
Descriptive Statistics QUALITY GROUP Manic Psychosis Sussex Lecturers Total Manic Psychosis Sussex Lecturers Total Mean 6.7000 7.6000 7.1500 12.1000 10.7000 11.4000 Std. Deviation 1.05935 2.98887 2.23077 4.22821 4.37290 4.24760 N 10 10 20 10 10 20

QUANTITY

The next output shows Boxs test of the assumption of equality of covariance matrices. This statistic tests the null hypothesis that the variancecovariance matrices are the same in all three groups. Therefore, if the matrices are equal (and therefore the assumption of homogeneity is met) this statistic should be non-significant. For these data p = 0.000
235

(which is less than 0.05); hence, the covariance matrices are not equal and the assumption is broken. However, because group sizes are equal we can ignore this test because Pillais trace should be robust to this violation (fingers crossed!).
a Box's Test of Equality of Covariance Matrices

Box's M 20.926 F 6.135 df1 3 df2 58320.000 Sig. .000 Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups. a. Design: Intercept+GROUP

The next table shows the main table of results. For our purposes, the group effects are of interest because they tell us whether or not the manic psychotics and Sussex lecturers differ along the two dimensions of quality and quantity of chicken impersonations. The column of real interest is the one containing the significance values of these F-ratios. For these data, all test statistics are significant with p = 0.032 (which is less than 0.05). From this result we should probably conclude that the groups do indeed differ in terms of the quality and quantity of their chicken impersonations; however, this effect needs to be broken down to find out exactly whats going on.
b Multivariate Tests

Effect Intercept

GROUP

Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root

Value .919 .081 11.318 11.318 .333 .667 .500 .500

F Hypothesis df 96.201a 2.000 96.201a 2.000 96.201a 2.000 96.201a 2.000 4.250a 2.000 4.250a 2.000 4.250a 2.000 4.250a 2.000

Error df 17.000 17.000 17.000 17.000 17.000 17.000 17.000 17.000

Sig. .000 .000 .000 .000 .032 .032 .032 .032

a. Exact statistic b. Design: Intercept+GROUP

The next table shows a summary table of Levenes test of equality of variances for each of the dependent variables. These tests are the same as would be found if a one-way
236

ANOVA had been conducted on each dependent variable in turn. Levenes test should be non-significant for all dependent variables if the assumption of homogeneity of variance has been met. The results for these data clearly show that the assumption has been met for the quantity of chicken impersonations but has been broken for the quality of impersonations. This should dent our confidence in reliability of the univariate tests to follow.
a Levene's Test of Equality of Error Variances

QUALITY QUANTITY

F 11.135 .256

df1 1 1

df2 18 18

Sig. .004 .619

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+GROUP

The next part of the output contains the ANOVA summary table for the dependent variables. The row of interest is that labelled GROUP (youll notice that the values in this row are the same as for the row labelled Corrected Model: this is because the model fitted to the data contains only one independent variable: group). The row labelled GROUP contains an ANOVA summary table for quality and quantity of chicken impersonations respectively. The values of p indicate that there was a non-significant difference between student groups in terms of both (both ps are greater than 0.05). The multivariate test statistics led us to conclude that the student groups did differ significantly across the types of psychology yet the univariate results contradict this!

237

Tests of Between-Subjects Effects Source Corrected Model Intercept GROUP Error Total Corrected Total Dependent Variable QUALITY QUANTITY QUALITY QUANTITY QUALITY QUANTITY QUALITY QUANTITY QUALITY QUANTITY QUALITY QUANTITY Type III Sum of Squares 4.050a 9.800b 1022.450 2599.200 4.050 9.800 90.500 333.000 1117.000 2942.000 94.550 342.800 df 1 1 1 1 1 1 18 18 20 20 19 19 Mean Square 4.050 9.800 1022.450 2599.200 4.050 9.800 5.028 18.500 F .806 .530 203.360 140.497 .806 .530 Sig. .381 .476 .000 .000 .381 .476

a. R Squared = .043 (Adjusted R Squared = -.010) b. R Squared = .029 (Adjusted R Squared = -.025)

We dont need to look at contrasts because the univariate tests were non-significant (and in any case there were only two groups and so no further comparisons would be necessary), and instead, to see how the dependent variables interact, we need to carry out a discriminant function analysis DFA.
Wilks' Lambda Test of Function(s) 1 Wilks' Lambda .667 Chi-square 6.893 df 2 Sig. .032

The initial statistics from the DFA tells us that there was only one variate (because there are only two groups) and this variate is significant. Therefore, the group differences shown by the MANOVA can be explained in terms of one underlying dimension.
Standardized Canonical Discriminant Function Coefficients Function 1 1.859 -1.829

QUALITY QUANTITY

The standardized discriminant function coefficients tell us the relative contribution of each variable to the variates. Both quality and quantity of impersonations have similarsized coefficients indicating that they have equally strong influence in discriminating the

238

groups. However, they have the opposite sign, which suggests that that group differences are explained by the difference between the quality and quantity of impersonations.
Functions at Group Centroids Function 1 -.671 .671

GROUP Manic Psychosis Sussex Lecturers

Unstandardized canonical discriminant functions evaluated at group means

The variate centroids for each group confirms that variate 1 discriminates the two groups because the manic psychotics have a negative coefficient and the Sussex lecturers have a positive one. There wont be a combined-groups plot because there is only one variate. Overall we could conclude that manic psychotics are distinguished from Sussex lecturers in terms of the difference between the pattern of results for quantity of impersonations compared to quality of them. If we look at the means we can see that manic psychotics produce slightly more impersonations than Sussex lecturers (but remember from the nonsignificant univariate tests that this isnt sufficient, alone, to differentiate the groups), but the lecturers produce impersonations of a higher quality (but again remember that quality alone is not enough to differentiate the groups). Therefore, although the manic psychotics and Sussex lecturers produce similar numbers of impersonations of similar quality (see univariate tests) if we combine the quality and quantity we can differentiate the groups.
Task 2

I was interested in whether students knowledge of different aspects of psychology improved throughout their degree. I took a sample of first years, second years and third years and gave them five tests (scored out of 15) representing different aspects of psychology: Exper (experimental psychology
239

such as cognitive and neuropsychology etc.); Stats (statistics); Social (social psychology); Develop (developmental psychology); Person (personality). Your task is to: (1) carry out an appropriate general analysis to determine whether there are overall group differences along these five measures; (2) look at the scale-byscale analyses of group differences produced in the output and interpret the results accordingly; (3) select contrasts that test the hypothesis that second and third years will score higher than first years on all scales; (4) select tests that compare all groups to each otherbriefly compare these results with the contrasts; and (5) carry out a separate analysis in which you test whether a combination of the measures can successfully discriminate the groups (comment only briefly on this analysis). Include only those scales that revealed group differences for the contrasts. How do the results help you to explain the findings of your initial analysis? The data are in the file psychology.sav.

This output shows an initial table of descriptive statistics that is produced by clicking on the descriptive statistics option in the options dialog box. This table contains the overall and group means and standard deviations for each dependent variable in turn.

240

Descriptive Statistics Std. Deviation 2.1574 1.5916 2.1213 2.0062 3.5599 2.3866 3.0988 3.1211 2.7303 2.8040 1.6408 2.5236 3.3248 1.9990 2.3993 2.6745 2.6458 1.7078 3.0319 2.5908

Experimental Psychology

Statistics

Social Psychology

Personality

Developmental

Gorup 1st Year 2nd Year 3rd Year Total 1st Year 2nd Year 3rd Year Total 1st Year 2nd Year 3rd Year Total 1st Year 2nd Year 3rd Year Total 1st Year 2nd Year 3rd Year Total

Mean 5.6364 5.5000 7.0000 6.0250 7.5455 8.6875 10.4615 8.9500 10.3636 8.5625 8.7692 9.1250 10.6364 8.4375 8.3846 9.0250 11.0000 8.8750 8.7692 9.4250

N 11 16 13 40 11 16 13 40 11 16 13 40 11 16 13 40 11 16 13 40

The next output shows Boxs test of the assumption of equality of covariance matrices. This statistic tests the null hypothesis that the variancecovariance matrices are the same in all three groups. Therefore, if the matrices are equal (and therefore the assumption of homogeneity is met) this statistic should be non-significant. For these data p = 0.06 (which is greater than 0.05); hence, the covariance matrices are roughly equal and the assumption is tenable.
a Box's Test of Equality of Covariance Matrices

Box's M F df1 df2 Sig.

54.241 1.435 30 3587 .059

Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups. a. Design: Intercept+GROUP

The next table shows the main table of results. For our purposes, the group effects are of interest because they tell us whether or not the scores from different areas of psychology differ across the three years of the degree program. The column of real interest is the one

241

containing the significance values of these F-ratios. For these data, Pillais trace (p =.02), Wilkss lambda (p = 0.012), Hotellings trace (p =.007) and Roys largest root (p =.01) all reach the criterion for significance of 05. From this result we should probably conclude that the profile of knowledge across different areas of psychology does indeed change across the three years of the degree. The nature of this effect is not clear from the multivariate test statistic.
c Multivariate Tests

Effect Intercept

GROUP

Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root

Value .960 .040 24.116 24.116 .510 .522 .853 .773

F 159.166a 159.166a 159.166a 159.166a 2.330 2.534a 2.730 5.255b

Hypothesis df 5.000 5.000 5.000 5.000 10.000 10.000 10.000 5.000

Error df 33.000 33.000 33.000 33.000 68.000 66.000 64.000 34.000

Sig. .000 .000 .000 .000 .020 .012 .007 .001

a. Exact statistic b. The statistic is an upper bound on F that yields a lower bound on the significance level. c. Design: Intercept+GROUP

The next table shows a summary table of Levenes test of equality of variances for each of the dependent variables. These tests are the same as would be found if a one-way ANOVA had been conducted on each dependent variable in turn. Levenes test should be non-significant for all dependent variables if the assumption of homogeneity of variance has been met. The results for these data clearly show that the assumption has been met. This finding not only gives us confidence in the reliability of the univariate tests to follow, but also strengthens the case for assuming that the multivariate test statistics are robust.

242

a Levene's Test of Equality of Error Variances

Experimental Psychology Statistics Social Psychology Personality Developmental

F 1.311 .746 2.852 2.440 2.751

df1 2 2 2 2 2

df2 37 37 37 37 37

Sig. .282 .481 .071 .101 .077

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: Intercept+GROUP

The next part of the output contains the ANOVA summary table for the dependent variables. The row of interest is that labelled GROUP, which contains an ANOVA summary table for each of the areas of psychology. The values of p indicate that there was a non-significant difference between student groups in terms of all areas of psychology (all ps are greater than 0.05). The multivariate test statistics led us to conclude that the student groups did differ significantly across the types of psychology yet the univariate results contradict this (again ... I really should stop making up data sets that do this!).

243

Tests of Between-Subjects Effects Type III Sum of Squares 18.430a 52.504b 23.584c 39.415d 37.717e 1428.058 3093.775 3330.118 3273.395 3562.212 18.430 52.504 23.584 39.415 37.717 138.545 327.396 224.791 239.560 224.058 1609.000 3584.000 3579.000 3537.000 3815.000 156.975 379.900 248.375 278.975 261.775

Source Corrected Model

Intercept

GROUP

Error

Total

Corrected Total

Dependent Variable Experimental Psychology Statistics Social Psychology Personality Developmental Experimental Psychology Statistics Social Psychology Personality Developmental Experimental Psychology Statistics Social Psychology Personality Developmental Experimental Psychology Statistics Social Psychology Personality Developmental Experimental Psychology Statistics Social Psychology Personality Developmental Experimental Psychology Statistics Social Psychology Personality Developmental

df 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 37 37 37 37 37 40 40 40 40 40 39 39 39 39 39

Mean Square 9.215 26.252 11.792 19.708 18.859 1428.058 3093.775 3330.118 3273.395 3562.212 9.215 26.252 11.792 19.708 18.859 3.744 8.849 6.075 6.475 6.056

F 2.461 2.967 1.941 3.044 3.114 381.378 349.637 548.129 505.575 588.250 2.461 2.967 1.941 3.044 3.114

Sig. .099 .064 .158 .060 .056 .000 .000 .000 .000 .000 .099 .064 .158 .060 .056

a. R Squared = .117 (Adjusted R Squared = .070) b. R Squared = .138 (Adjusted R Squared = .092) c. R Squared = .095 (Adjusted R Squared = .046) d. R Squared = .141 (Adjusted R Squared = .095) e. R Squared = .144 (Adjusted R Squared = .098)

We dont need to look at contrasts because the univariate tests were non-significant, and instead, to see how the dependent variables interact, we need to carry out a DFA.
Wilks' Lambda Test of Function(s) 1 through 2 2 Wilks' Lambda .522 .926 Chi-square 22.748 2.710 df 10 4 Sig. .012 .608

The initial statistics from the DFA tell us that only one of the variates is significant (the second variate is non-significant, p = 0.608). Therefore, the group differences shown by the MANOVA can be explained in terms of one underlying dimension.

244

Standardized Canonical Discriminant Function Coefficients Function 1 Experimental Psychology Statistics Social Psychology Personality Developmental .367 .921 -.353 -.260 -.618 2 .789 -.081 .319 .216 .013

The standardized discriminant function coefficients tell us the relative contribution of each variable to the variate,. Looking at the first variate, its clear that statistics has the greatest contribution to the first variate. Most interesting is that on the first variate, statistics and experimental psychology have positive weights, whereas social, developmental and personality have negative weights. This suggests that the group differences are explained by the difference between experimental psychology and statistics compared to other areas of psychology.
Functions at Group Centroids Function 1 2 -1.246 .186 9.789E-02 -.333 .934 .252

Gorup 1st Year 2nd Year 3rd Year

Unstandardized canonical discriminant functions evaluated at group means

The variate centroids for each group tells us that variate 1 discriminates the first years from second and third years because the first years have a negative value whereas the second and third years have positive values on the first variate. The relationship between the variates and the groups is best illuminated using a combined-groups plot. This graph plots the variate scores for each person, grouped according to the year of their degree. In addition, the group centroids are indicated, which are the average variate scores for each group. The plot for these data confirms that variate

245

1 discriminates the first years from subsequent years (look at the horizontal distance between these centroids).

Canonical Discriminant Functions


3

1 1st Year 0 3rd Year 2nd Year

Gorup
-1 Group Centroids 3rd Year -2 2nd Year -3 -4 -3 -2 -1 0 1 2 3 1st Year

Function 2

Function 1

Overall we could conclude that different years are discriminated by different areas of psychology. In particular, it seems as though statistics and aspects of experimentation (compared to other areas of psychology) discriminate between first-year undergraduates and subsequent years. From the means, we could interpret this as first years struggling with statistics and experimental psychology (compared to other areas of psychology) but their ability improves across the three years. However, for other areas of psychology, first years are relatively good but their abilities decline over the three years. Put another way, psychology degrees improve only your knowledge of statistics and experimentation.
Chapter 17

246

Task 1

The University of Sussex is constantly seeking to employ the best people possible as lecturers (no, really, it is). Anyway, the university wanted to revise a questionnaire based on Blands theory of research methods lecturers. This theory predicts that good research methods lecturers should have four characteristics: (1) a profound love of statistics; (2) an enthusiasm for experimental design; (3) a love of teaching; and (4) a complete absence of normal interpersonal skills. These characteristics should be related (i.e. correlated). The Teaching of Statistics for Scientific Experiments (TOSSE) already existed, but the university revised this questionnaire and it became the Teaching of Statistics for Scientific Experiments Revised (TOSSER). The university gave this questionnaire to 239 research methods lecturers around the world to see if it supported Blands theory. The questionnaire is below and the data are in TOSSE-R.sav. Conduct a factor analysis (with appropriate rotation) to see the factor structure of the data

SD = Strongly Disagree, D = Disagree, N = Neither, A = Agree, SA = Strongly Agree


SD D N A SA

I once woke up in the middle of a vegetable patch hugging a 1 turnip that I'd mistakenly dug up thinking it was Roys largest root 2 If I had a big gun I'd shoot all the students I have to teach

247

3 4 5

I memorize probability values for the F-distribution I worship at the shrine of Pearson I still live with my mother and have little personal hygiene Teaching others makes me want to swallow a large bottle of

bleach because the pain of my burning oesophagus would be light relief in comparison

7 8

Helping others to understand sums of squares is a great feeling I like control conditions I calculate three ANOVAs in my head before getting out of bed

9 every morning 10 I could spend all day explaining statistics to people I like it when people tell me Ive helped them to understand 11 factor rotation 12 13 People fall asleep as soon as I open my mouth to speak Designing experiments is fun Id rather think about appropriate dependent variables than go 14 to the pub I soil my pants with excitement at the mere mention of factor 15 analysis

248

Thinking about whether to use repeated or independent 16 measures thrills me I enjoy sitting in the park contemplating whether to use 17 participant observation in my next experiment Standing in front of 300 people in no way makes me lose 18 control of my bowels 19 I like to help students Passing on knowledge is the greatest gift you can bestow an 20 individual Thinking about Bonferroni corrections gives me a tingly 21 feeling in my groin I quiver with excitement when thinking about designing my 22 next experiment I often spend my spare time talking to the pigeons ... and even 23 they die of boredom I tried to build myself a time machine so that I could go back to 24 the 1930s and follow Fisher around on my hands and knees licking the floor on which hed just trodden 25 26 I love teaching I spend lots of time helping students

249

I love teaching because students have to pretend to like me or 27 theyll get bad marks 28 My cat is my only friend

Multicollinearity: The determinant of the correlation matrix was 0.00000124, which is smaller than 0.00001 and, therefore, indicates that multicollinearity could be a problem in these data (although, strictly speaking, because were using principal component analysis we dont need to worry).

Sample size: McCallum et al. (1999) have demonstrated that when communalities after extraction are above .5, a sample size between 100 and 200 can be adequate and even when communalities are below .5 a sample size of 500 should be sufficient. We have a sample size of 239 with some communalities below .5, and so the sample size may not be adequate. However, the KMO measure of sampling adequacy is .894, which is above Kaisers (1974) recommendation of .5. This value is also meritorious (and almost marvelous) according to Hutcheson and Sofroniou (1999). As such, the evidence suggests that the sample size is adequate to yield distinct and reliable factors.

250

Bartletts test: This tests whether the correlations between questions are sufficiently large for factor analysis to be appropriate (it actually tests whether the correlation matrix is sufficiently different from an identity matrix). In this case it is significant (2(378) = 2989.77, p < .001) indicating that the correlations within the R-matrix are sufficiently different from zero to warrant factor analysis.

251

Extraction: SPSS has extracted five factors based on Kaisers criterion of retaining factors with eigenvalues greater than 1. Is this warranted? Kaisers criterion is accurate when there are less than 30 variables and the communalities after extraction are greater than .7, or when the sample size exceeds 250 and the average communality is greater than .6. For these data the sample size is 239, there are 28 variables, and the mean communality is .579, so extracting five factors is not really warranted. The scree plot shows clear inflexions at 3 and 5 factors and so using the scree plot you could justify extracting 3 or 5 factors.

252

Pattern Matrixa Component 3

Thinking about whether to use repeated or independent measures thrills me I'd rather think about appropriate dependent variables than go to the pub I quiver with excitement when thinking about designing my next experiment I enjoy sitting in the park contemplating whether to use participant observation in my next experiment Designing experiments is fun I like control conditions I could spend all day explaining statistics to people I calculate 3 ANOVAs in my head before getting out of bed every morning I like to help students Passing on knowledge is the greatest gift you can bestow an individual I love teaching I love teaching because students have to pretend to like me or they'll get bad marks Helping others to understand Sums of Squares is a great feeling I spend lots of time helping students I like it when people tell me I've helped them to understand factor rotation I often spend my spare time talking to the pigeons ... and even they die of boredom My cat is my only friend I still live with my mother and have little personal hygiene People fall asleep as soon as I open my mouth to speak I tried to build myself a time machine so that I could go back to the 1930s and follow Fisher around on my hands and knees licking the floor on which he'd just trodden I memorize probability values for the F-distribution I worship at the shrine of Pearson I soil my pants with excitement at the mere mention of Factor Analysis Thinking about Bonferroni corrections gives me a tingly feeling in my groin I once woke up in the middle of a vegetable patch hugging a turnip that I'd mistakenly dug up thinking it was Roy's largest root Teaching others makes me want to swallow a large bottle of bleach because the pain of my burning oesophagus would be light relief in comparison If I had a big gun I'd shoot all the students I have to teach Standing in front of 300 people in no way makes me lose control of my bowels Extraction Method: Principal Component Analysis. Rotation Method: Oblimin with Kaiser Normalization. a. Rotation converged in 15 iterations.

1 .829 .813 .765 .733 .575 .556 .458

.735 .651 .633 .556 .470 .461

.713 .702 .692 .493 .752 .662 .589 .481 .479 .445 .818 .784 .549

.433

Rotation: You should choose an oblique rotation because the question says that the constructs were measuring are related. Looking at the pattern matrix (and using loadings greater than .4 as recommended by Stevens) we see the following pattern: Factor 1:

253

1. Q 16. Thinking about whether to use repeated or independent measures thrills me 2. Q 14. Id rather think about appropriate dependent variables than go to the pub 3. Q 22. I quiver with excitement when thinking about designing my next experiment 4. Q 17. I enjoy sitting in the park contemplating whether to use participant observation in my next experiment 5. Q 13. Designing experiments is fun 6. Q 8. I like control conditions 7. Q 10. I could spend all day explaining statistics to people Factor 2: 8. Q 19. I like to help students 9. Q 20. Passing on knowledge is the greatest gift you can bestow an individual 10. Q 25. I love teaching 11. Q 27. I love teaching because students have to pretend to like me or theyll get bad marks 12. Q 7. Helping others to understand sums of squares is a great feeling 13. Q 26. I spend lots of time helping students Factor 3:

254

14. Q 23. I often spend my spare time talking to the pigeons ... and even they die of boredom 15. Q 28. My cat is my only friend 16. Q 5. I still live with my mother and have little personal hygiene 17. Q 12. People fall asleep as soon as I open my mouth to speak Factor 4: 18. Q 24. I tried to build myself a time machine so that I could go back to the 1930s and follow Fisher around on my hands and knees licking the floor on which hed just trodden 19. Q 3. I memorize probability values for the F-distribution 20. Q 4. I worship at the shrine of Pearson 21. Q 15. I soil my pants with excitement at the mere mention of factor analysis 22. Q 21. Thinking about Bonferroni corrections gives me a tingly feeling in my groin 23. Q 1. I once woke up in the middle of a vegetable patch hugging a turnip that Id mistakenly dug up thinking it was Roys largest root Factor 5: 24. Q 6. Teaching others makes me want to swallow a large bottle of bleach because the pain of my burning oesophagus would be light relief in comparison 25. Q 2. If I had a big gun Id shoot all the students I have to teach
255

26. Q 18. Standing in front of 300 people in no way makes me lose control of my bowels No factor: 27. Q 9. I calculate three ANOVAs in my head before getting out of bed every morning 28. Q 11. I like it when people tell me Ive helped them to understand factor rotation

Factor 1 seems to relate to research methods, factor 2 to teaching, factor 3 to general social skills, factor 4 to statistics and factor 5 to, well, err, teaching again. All in all, this isnt particularly satisfying and doesnt really support the four-factor model. We saw earlier that the extraction of five factors probably wasnt justified. In fact the scree plot seems to indicate three. Lets rerun the analysis but asking SPSS for three factors. Lets see how this changes the pattern matrix:

256

Looking at the pattern matrix (and using loadings greater than .4 as recommended by Stevens) we see the following pattern: Factor 1: 29. Q 22. I quiver with excitement when thinking about designing my next experiment 30. Q 8. I like control conditions 31. Q 17. I enjoy sitting in the park contemplating whether to use participant observation in my next experiment 32. Q 21. Thinking about Bonferroni corrections gives me a tingly feeling in my groin

257

33. Q 13. Designing experiments is fun 34. Q 9. I calculate three ANOVAs in my head before getting out of bed every morning 35. Q 3. I memorize probability values for the F-distribution 36. Q 1. I once woke up in the middle of a vegetable patch hugging a turnip that Id mistakenly dug up thinking it was Roys largest root 37. Q 24. I tried to build myself a time machine so that I could go back to the 1930s and follow Fisher around on my hands and knees licking the floor on which he'd just trodden 38. Q 4. I worship at the shrine of Pearson 39. Q 16. Thinking about whether to use repeated or independent measures thrills me 40. Q 7. Helping others to understand sums of squares is a great feeling 41. Q 15. I soil my pants with excitement at the mere mention of factor analysis 42. Q 11. I like it when people tell me Ive helped them to understand factor rotation 43. Q 10. I could spend all day explaining statistics to people 44. Q 14. Id rather think about appropriate dependent variables than go to the pub Factor 2: 45. Q 19. I like to help students 46. Q 2. If I had a big gun Id shoot all the students I have to teach (note negative weight)

258

47. Q 6. Teaching others makes me want to swallow a large bottle of bleach because the pain of my burning oesophagus would be light relief in comparison (note negative weight) 48. Q 18. Standing in front of 300 people in no way makes me lose control of my bowels (note negative weight) 49. Q 26. I spend lots of time helping students 50. Q 25. I love teaching 51. Q 20. Passing on knowledge is the greatest gift you can bestow an individual 52. Q 27. I love teaching because students have to pretend to like me or theyll get bad marks Factor 3: 53. Q 5. I still live with my mother and have little personal hygiene 54. Q 23. I often spend my spare time talking to the pigeons ... and even they die of boredom 55. Q 28. My cat is my only friend 56. Q 12. People fall asleep as soon as I open my mouth to speak 57. Q 27. I love teaching because students have to pretend to like me or theyll get bad marks No factor: This factor is a lot clearer cut: factor 1 relates to a love of methods and statistics, factor 2 to a love of teaching, and factor 3 to an absence of normal social skills. This doesnt
259

support the original four-factor model suggested because the data indicate that love of methods and statistics cant be separated (if you love one you love the other).
Task 2

Task 2: Sian Williams devised a questionnaire to measure organizational ability.

She predicted five factors to do with organisational ability: (1) preference for organization; (2) goal achievement; (3) planning approach; (4) acceptance of delays; and (5) preference for routine. These dimensions are theoretically independent. Williams questionnaire contains 28 items using a 7-point Likert scale (1 = strongly disagree, 4 = neither, 7 = strongly agree). She gave it to 239 people. Run a principal component analysis on the data in Williams.sav. _ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 I like to have a plan to work to in everyday life I feel frustrated when things dont go to plan I get most things done in a day that I want to I stick to a plan once I have made it I enjoy spontaneity and uncertainty I feel frustrated if I cant find something I need I find it difficult to follow a plan through I am an organized person I like to know what I have to do in a day Disorganized people annoy me I leave things to the last minute I have many different plans relating to the same goal I like to have my documents filed and in order I find it easy to work in a disorganized environment I make to do lists and achieve most of the things on it My workspace is messy and disorganized I like to be organized

260

18 19 20 21 22 23 24 25 26 27 28

Interruptions to my daily routine annoy me I feel that I am wasting my time I forget the plans I have made I prioritize the things I have to do I like to work in an organized environment I feel relaxed when I don't have a routine I set deadlines for myself and achieve them I change rather aimlessly from one activity to another during the day I have trouble organizing the things I have to do I put tasks off to another day I feel restricted by schedules and plans

a Correlation Matrix

Correlation i like to have a plan to work to in everyday life a. Determinant = 1.240E-06

KMO and Bartlett's Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy. Bartlett's Test of Sphericity Approx. Chi-Square df Sig. .894 2989.769 378 .000

261

Communalities i like to have a plan to work to in everyday life i feel frustrated when things don't go to plan i get most thigs done in a day that i want to i stick to a plan once i have made it i enjoy spontaneity and uncertainty i feel frustrated if i can't find something i need i find it difficult to follow a plan through i am an organised person i ike to know what i have to do in a day disorganised people annoy me i leace things to the last minute i have many different plans relating to th esame goal i like to have my documents filed and in order i find it easy to work in a disorganised environment i make 'to do' lists and acheive most of the things on it my workspace is messy and disorganised i like to be organised interruptions to my daily routine annoy me i feel that i am wasting my time i forget the plans i have made i prioritise the things i have to do i like to work in an organised environment i feel relaxed when i don't have a routine i set deadlines for myself and acheive them i change rather aimlessly from one activity to another during the day i have trouble organising the things i have to do i put tasks off to another day i feel restristed by schedules and plans Extraction Method: Principal Component Analysis. Initial 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Extraction .646 .624 .591 .589 .545 .621 .486 .683 .638 .417 .539 .297 .531 .709 .511 .681 .705 .514 .536 .477 .566 .766 .587 .649 .550 .599 .619 .538

262

Total Variance Explained Initial Eigenvalues % of Variance Cumulative % 32.373 32.373 9.954 42.328 5.944 48.272 5.409 53.681 4.215 57.896 3.539 61.435 3.304 64.739 2.924 67.663 2.832 70.495 2.657 73.152 2.518 75.670 2.336 78.005 2.224 80.229 2.051 82.281 1.945 84.225 1.841 86.067 1.740 87.806 1.621 89.427 1.511 90.938 1.363 92.301 1.218 93.519 1.193 94.712 1.102 95.814 1.046 96.860 .928 97.788 .887 98.675 .738 99.414 .586 100.000 Extraction Sums of Squared Loadings Total % of Variance Cumulative % 9.064 32.373 32.373 2.787 9.954 42.328 1.664 5.944 48.272 1.515 5.409 53.681 1.180 4.215 57.896 Rotation Sums of Squared Loadings Total % of Variance Cumulative % 4.558 16.279 16.279 3.460 12.356 28.635 3.239 11.568 40.203 2.631 9.397 49.600 2.323 8.296 57.896

Component 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Total 9.064 2.787 1.664 1.515 1.180 .991 .925 .819 .793 .744 .705 .654 .623 .574 .545 .516 .487 .454 .423 .382 .341 .334 .309 .293 .260 .248 .207 .164

Extraction Method: Principal Component Analysis.

263

a Component Matrix

1 i like to have a plan to work to in everyday life i feel frustrated when things don't go to plan i get most thigs done in a day that i want to i stick to a plan once i have made it i enjoy spontaneity and uncertainty i feel frustrated if i can't find something i need i find it difficult to follow a plan through i am an organised person i ike to know what i have to do in a day disorganised people annoy me i leace things to the last minute i have many different plans relating to th esame goal i like to have my documents filed and in order i find it easy to work in a disorganised environment i make 'to do' lists and acheive most of the things on it my workspace is messy and disorganised i like to be organised interruptions to my daily routine annoy me i feel that i am wasting my time i forget the plans i have made i prioritise the things i have to do i like to work in an organised environment i feel relaxed when i don't have a routine i set deadlines for myself and acheive them i change rather aimlessly from one activity to another during the day i have trouble organising the things i have to do i put tasks off to another day i feel restristed by schedules and plans Extraction Method: Principal Component Analysis. a. 5 components extracted. .684

2 -.543 .584 .600 .446 -.501 .528 .803 .723 .502 .675

Component 3

.452 .524 .453

.519 .673 .614 .559 .650 .768 .421 .456 .674 .791 .432 .614 .501 .533 .580 .458 -.517 -.497 -.523 .620

.518 .444 .502 .520

264

a Rotated Component Matrix

i like to have a plan to work to in everyday life i feel frustrated when things don't go to plan i get most thigs done in a day that i want to i stick to a plan once i have made it i enjoy spontaneity and uncertainty i feel frustrated if i can't find something i need i find it difficult to follow a plan through i am an organised person i ike to know what i have to do in a day disorganised people annoy me i leace things to the last minute i have many different plans relating to th esame goal i like to have my documents filed and in order i find it easy to work in a disorganised environment i make 'to do' lists and acheive most of the things on it my workspace is messy and disorganised i like to be organised interruptions to my daily routine annoy me i feel that i am wasting my time i forget the plans i have made i prioritise the things i have to do i like to work in an organised environment i feel relaxed when i don't have a routine i set deadlines for myself and acheive them i change rather aimlessly from one activity to another during the day i have trouble organising the things i have to do i put tasks off to another day i feel restristed by schedules and plans Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. a. Rotation converged in 7 iterations.

1 .409

2 .545 .666 .619

Component 3

4 .765

.666 .781 .535 .587 .432 .440 .470 .435 .506 .593 .764 .447 .775 .714 .447 .450

.509

.586 .712 .649 .505 .748 .523 .672 .744 .407 .688 .568 .613

.411 .673

Component Transformation Matrix Component 1 2 3 4 5 1 .633 -.118 -.188 -.742 .025 2 .520 .050 -.346 .503 -.595 3 .384 .738 .106 .201 .506 4 .302 -.650 -.053 .393 .574 5 .301 -.129 .911 .038 -.246

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.

Extraction: SPSS has extracted five factors based on Kaisers criterion of retaining factors with eigenvalues greater than 1. Is this warranted? Kaisers criterion is accurate when there are less than 30 variables and the communalities after extraction are greater than .7, or when the sample size exceeds 250 and the average communality is greater than .6. For these data the sample size is 239 and the mean communality is .579, so extracting

265

five factors is not really warranted. The scree plot shows clear inflexions at 3 and 5 factors and so using the scree plot you could justify extracting 3 or 5 factors. Looking at the rotated component matrix (and using loadings greater than .4 as recommended by Stevens) we see the following pattern: Factor 1: preference for organization 1. Q8: I am an organized person 2. Q13: I like to have my documents filed and in order 3. Q14: I find it easy to work in a disorganized environment 4. Q 16: My workspace is messy and disorganized 5. Q17: I like to be organized 6. Q22: I like to work in an organized environment Note: Its odd that none of these have reverse loadings. Factor 2: plan approach 7. Q1: I like to have a plan to work to in everyday life 8. Q3: I get most things done in a day that I want to 9. Q4: I stick to a plan once I have made it 10. Q9: I like to know what I have to do in a day 11. Q15: I make to do lists and achieve most of the things on it 12. Q 21: I prioritize the things I have to do 13. Q24: I set deadlines for myself and achieve them
266

Factor 3: goal achievement 14. Q7: I find it difficult to follow a plan through 15. Q11: I leave things to the last minute 16. Q19: I feel that I am wasting my time 17. Q20: I forget the plans I have made 18. Q25: I change rather aimlessly from one activity to another during the day 19. Q26: I have trouble organizing the things I have to do 20. Q27: I put tasks off to another day Factor 4: acceptance of delays 21. Q2: I feel frustrated when things dont go to plan 22. Q6: I feel frustrated if I cant find something I need 23. Q10: Disorganized people annoy me 24. Q18: Interruptions to my daily routine annoy me Factor 5: preference for routine 25. Q5: I enjoy spontaneity and uncertainty 26. Q12: I have many different plans relating to the same goal 27. Q23: I feel relaxed when I don't have a routine 28. Q28: I feel restricted by schedules and plans Therefore, it seems as though there is some factorial validity to the structure.

267

Chapter 18

Task 1 Certain editors at Sage Publications like to think there a bit of a whiz at football (soccer if you prefer). To see whether they are better than Sussex lecturers and postgraduates we invited various employees of Sage to join in our football matches (oh, sorry, I mean we invited down for important meetings about books). Every player was only allowed to play in one match. Over many matches, we counted the number of players that scored goals. The data are in the file
SageEditorsCantPlayFootball.sav. do a chi-square test to see whether more

publishers or academics scored goals. We predict that Sussex people will score more than Sage people. Lets run the analysis on the first question. First we must remember to tell SPSS which variable contains the frequencies by using the weight cases command. Select __, then in the resulting dialog box select _ and then select the variable in which the number of cases is specified (in this case Frequency) and drag it to the box labelled Frequency variable (or click on _). This process tells the computer that it should weight each category combination by the number in the column labelled Frequency.

268

To run the chi-square tests, select ___. First, select one of the variables of interest in the variable list and drag it into the box labelled Row(s) (or click on _). For this example, I selected Job to be the rows of the table. Next, select the other variable of interest (Score) and drag it to the box labelled Column(s) (or click on _). Select the same options as in the book.

269

The crosstabulation table produced by SPSS contains the number of cases that falls into each combination of categories. We can see that in total 28 people scored goals (36.47% of the total) and of these 5 were from Sage Publications (17.9% of the total that scored) and only 23 were from Sussex (82.1% of the total that scored); 49 people didnt score at all (63.6% of the total) and, of those, 19 worked for Sage (38.8% of the total that didnt score) and 30 were from Sussex (61.2% of the total that didnt score).
Job * Did they score a goal? Crosstabulation Did they score a goal? Yes No 5 19 8.7 15.3 20.8% 79.2% 17.9% 6.5% 23 19.3 43.4% 82.1% 29.9% 28 28.0 36.4% 100.0% 36.4% 38.8% 24.7% 30 33.7 56.6% 61.2% 39.0% 49 49.0 63.6% 100.0% 63.6%

Total 24 24.0 100.0% 31.2% 31.2% 53 53.0 100.0% 68.8% 68.8% 77 77.0 100.0% 100.0% 100.0%

Job

Sage Publications

University of Sussex

Total

Count Expected Count % within Job % within Did they score a goal? % of Total Count Expected Count % within Job % within Did they score a goal? % of Total Count Expected Count % within Job % within Did they score a goal? % of Total

Before moving on to look at the test statistics itself it is vital that we check that the assumption for chi-square has been met. The assumption is that in 2 2 tables (which is what we have here), all expected frequencies should be greater than 5. If you look at the expected counts in the crosstabulation table, it should be clear that the smallest expected count is 8.7 (for Sage editors who scored). This value exceeds 5 and so the assumption has been met. Pearsons chi-square test examines whether there is an association between two categorical variables (in this case the job and whether the person scored or not). As part of the crosstabs procedure SPSS produces a table that includes the chi-square statistic and
270

its significance value. The Pearson chi-square statistic tests whether the two variables are independent. If the significance value is small enough (conventionally Sig. must be less than 0.05) then we reject the hypothesis that the variables are independent and accept the hypothesis that they are in some way related. The value of the chi-square statistic is given in the table (and the degrees of freedom) as is the significance value. The value of the chi-square statistic is 3.63. This value has a two-tailed significance of 0.057, which is bigger than 0.05 (hence non-significant). However, we made a specific prediction (that Sussex people would score more than Sage people), hence we can halve this value. Therefore, the chi-square is significant (one-tailed) because p = 0.0285, which is less than 0.05. The one-tailed significance values of the other statistics are also less than 0.05 so we have consistent results.
Chi-Square Tests Value 3.634b 2.725 3.834 df 1 1 1 Asymp. Sig. (2-sided) .057 .099 .050 Exact Sig. (2-sided) Exact Sig. (1-sided)

Pearson Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases

.075 3.587 77 1 .058

.047

a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 8.73.

The highly significant result indicates that there is an association between the type of job someone does and whether they score goals. This significant finding reflects the fact that for Sussex employees there is about a 50% split of those that scored and those that didnt, but for Sage employees there is about a 2080 split with only 20% scoring and 80% not scoring. This supports our hypothesis that people from Sage, despite their delusions, are crap at football!

271

Calculating an Effect Size The odds of someone scoring given that they were employed by Sage is 5/19 = 0.26, and the odds of someone scoring given that they were employed by Sussex University is 23/30 = 0.77. Therefore, the odds ratio is 0.26/0.77 = 0.34. In other words, the odds of scoring if you work for Sage are 0.34 times higher than if you work for Sussex; a better way to express this is that if you work for Sage, the odds of scoring are 1/0.34 = 2.95 lower than if you work for Sussex! Reporting the Results of Chi-Square We could report: There was a significant association between the type of job and whether or not a person scored a goal, 2(1) = 3.63, p < .05 (one-tailed). This represents the fact that, based on the odds ratio, Sage employees were 2.95 times less likely to score than Sussex employees.
Task 2

I wrote much of this update while on sabbatical in the Netherlands (I have a real soft spot for Holland). However, living there for three months did enable me to notice certain cultural differences to England. The Dutch are famous for travelling by bike; they do it much more than the English. However, I noticed that many more Dutch people cycle while steering with only one hand. I pointed this out to one of my friends, Birgit Mayer, and she said that I was being a crazy English fool and that Dutch people did not cycle one-handed. Several weeks of my pointing at one-handed cyclists and her pointing at two-handed cyclists ensued.

272

To put it to the test I counted the number of Dutch and English cyclists who ride with one or two hands on the handlebars (Handlebars.sav). Can you work out whether Birgit or I am right? First, we must remember to tell SPSS which variable contains the frequencies by using the weight cases command. Select __, then in the resulting dialog box select _ and then select the variable in which the number of cases is specified (in this case
Frequency) and drag it to the box labelled Frequency variable (or click on _). This

process tells the computer that it should weight each category combination by the number in the column labelled Frequency.

To run the chi-square tests, select ___. First, select one of the variables of interest in the variable list and drag it into the box labelled Row(s) (or click on _). For this example, I selected Nationality to be the rows of the table. Next, select the other variable of interest (Hands) and drag it to the box labelled Column(s) (or click on _). Select the same options as in the book.

273

The crosstabulation table produced by SPSS contains the number of cases that falls into each combination of categories. We can see that in total 137 people rode their bike onehanded, of which 120 (87.6%) were Dutch and only 17 (12.4%) were English; 732 people rode their bike two-handed, of which 578 (79%) were Dutch and only 154 (21%) were English.

274

Before moving on to look at the test statistics itself it is vital that we check that the assumption for chi-square has been met. The assumption is that in 2 2 tables (which is what we have here), all expected frequencies should be greater than 5. If you look at the expected counts in the crosstabulation table, it should be clear that the smallest expected count is 27 (for English people who ride their bike one-handed). This value exceeds 5 and so the assumption has been met. The value of the chi-square statistic is 5.44. This value has a two-tailed significance of 0.020, which is smaller than 0.05 (hence significant). This suggests that the pattern of bike riding (i.e. relative numbers of one- and two-handed riders) significantly differs in English and Dutch people.

_ The significant result indicates that there is an association between whether someone is Dutch or English and whether they ride their bike one- or two-handed. Looking at the frequencies, this finding seems to show that the ratio of one- to two-handed riders differs in Dutch and English people. In Dutch people 17.2% ride their bike one-handed compared to 82.8% who ride two-handed. In England, though, only 9.9% rode their bike one-handed (almost half as many as in Holland), and 90.1% rode their bikes two-handed. If we look at the standardized residuals (in the contingency table) we can see that the only cell with a residual approaching significance (a value that lies outside of 1.96) is the cell
275

for English people riding one-handed (z = 1.9). The fact that this value is negative tells us that fewer people than expected fell into this cell. Calculating an Effect Size The odds of someone riding one-handed if they are Dutch is 120/578 = 0.21, and the odds of someone riding one-handed if they are English is 17/154 = 0.11. Therefore, the odds ratio is 0.21/0.11 = 1.9. In other words, the odds of riding one-handed if you are Dutch is 1.9 times higher than if you are English (or the odds of riding one-handed if you are English are about half that of a Dutch person). Reporting the Results of Chi-Square We could report: There was a significant association between nationality and whether the Dutch or English rode their bike one- or two-handed, 2 (1) = 5.44, p < .05. This represents the fact that, based on the odds ratio, the odds of riding a bike one-handed were 1.9 time higher for Dutch people than English people. This supports Fields argument that there are more one-handed bike riders in the Netherlands than in England and utterly refutes Mayers theory that Field is a complete arse. These data are in no way made up.
Task 3

I was interested in whether horoscopes are just a figment of peoples minds. Therefore, I got 2201 people, made a note of their star sign (this variable, obviously, has 12 categories: Capricorn, Aquarius, Pisces, Aries, Taurus, Gemini, Cancer, Leo, Virgo, Libra, Scorpio and Sagittarius) and whether they believed in
276

horoscopes (this variable has two categories: believer or unbeliever). I then sent them a horoscope in the post of what would happen over the next month: everybody, regardless of their star sign, received the same horoscope which read:August is an exciting month for you. You will make friends with a tramp in the first week of the month and cook him a cheese omelette. Curiosity is your greatest virtue, and in the second week youll discover knowledge of a subject that you previously thought was boring statistics perhaps. You might purchase a book around this time that guides you towards this knowledge. Your new wisdom leads to a change in career around the third week, when you ditch your current job and become an accountant. By the final week you find yourself free from the constraints of having friends, your boy/girlfriend has left you for a Russian ballet dancer with a glass eye, and you now spend your weekends doing loglinear analysis by hand with a pigeon called Hephzibah for company. At the end of August I interviewed all of these people and I classified the horoscope as having come true, or not, based on how closely their lives matched the fictitious horoscope. The data are in the file Horoscope.sav. Conduct a loglinear analysis to see whether there is a relationship between the persons star sign, whether they believe in horoscopes and whether the horoscope came true.

Running the Analysis Data are entered for this example as frequency values for each combination of categories so before you begin you must weight the cases by the variable frequency. If you dont do this the entire output will be wrong! Select __, then in the resulting

277

dialog box select _ and then select the variable in which the number of cases is specified (in this case Frequency) and drag it to the box labelled Frequency variable (or click on _). This process tells the computer that it should weight each category combination by the number in the column labelled Frequency.

To get a crosstabulation table, select ___. We have three variables in our crosstabulation table: whether someone believes in star signs or not (Believe), the star sign of the person (Star_Sign) and whether the horoscope came true or not (True). Select Believe and drag it into the box labelled Row(s) (or click on _). Next, select True and drag it to the box labelled Column(s) (or click on _). We have a third variable too, and we need to define this variable as a layer. Select Star_Sign and drag it to the box labelled Layer 1 of 1 (or click on _). Then click on _ and select the options required.

278

The crosstabulation table produced by SPSS contains the number of cases that falls into each combination of categories. Although this table is quite complicated you should be able to see that there are roughly the same number of believers and non-believers and similar numbers of those whose horoscopes came true or didnt. These proportions are fairly consistent also across the different star signs! Also there are no expected counts less than 5, so our assumptions are met.

279

_
280

The Loglinear Analysis Then run the main analysis, The way to run loglinear analysis that is consistent with my section on the theory of the analysis is to select ___ to access the dialog box. Select any variable that you want to include in the analysis by selecting them with the mouse (remember that you can select several at the same time by holding down the Ctrl key) and then dragging them to the box labelled Factor(s) (or click on _). When there is a variable in this box the _ button becomes active. We have to tell SPSS the codes that weve used to define our categorical variables. Select a variable in the Factor(s) box and then click on _ to activate a dialog box that allows you to specify the value of the minimum and maximum code that youve used for that variable. When youve done this click on _ to return to main dialog box.

Output from Loglinear Analysis The initial output from the loglinear analysis tells us that we have 2201 cases. SPSS then lists all of the factors in the model and the number of levels they have. To begin with,

281

SPSS fits the saturated model (all terms are in the model including the highest-order interaction, in this case the star sign believer true interaction). SPSS then gives us the observed and expected counts for each of the combinations of categories in our model. These values should be the same as the original contingency table except that each cell has 0.5 added to it. The final bit of this initial output gives us two goodness-of-fit statistics (Pearsons chi-square and the likelihood-ratio statistic, both of which we came across at the beginning of this chapter). In this context these tests are testing the hypothesis that the frequencies predicted by the model (the expected frequencies) are significantly different from the actual frequencies in our data (the observed frequencies). At this stage the model perfectly fits the data so both statistics are 0 and yield a probability value, p, of ..

282

283

The next part of the output tells us something about which components of the model can be removed. The first bit of the output is labelled K-way and higher-order effects and underneath there is a table showing likelihood-ratio and chi-square statistics when K = 1, 2 and 3 (as we go down the rows of the table). The first row (K = 1) tells us whether removing the one-way effects (i.e. the main effects of star sign, believer and true) and any higher-order effects will significantly affect the fit of the model. There are lots of higher-order effects herethere are the two-way interactions and the three-way interactionand so this is basically testing whether if we remove everything from the model there will be a significant effect on the fit of the model. This is highly significant because the probability value is 0.000, which is less than 0.05. The next row of the table (K = 2) tells us whether removing the two-way interactions (i.e. the star sign believer, star sign true and believer true interactions) and any higher-order effects will affect the model. In this case there is a higher-order effect (the three-way interaction) so this is testing whether removing the two-way interactions and the three-way interaction would affect the fit of the model. This is significant (the probability is 0.03, which is less than 0.05) indicating that if we removed the two-way interactions and the three-way interaction then this would have a significant detrimental effect on the model. The final row (K = 3) is testing whether removing the three-way effect and higher-order effects will significantly affect the fit of the model. Now of course, the three-way interaction is the highest-order effect that we have so this is simply testing whether removal of three-way interaction (i.e. the star sign believer
284

true interaction) will significantly affect the fit of the model. If you look at the two columns labelled Prob then you can see that both chi-square and likelihood ratio tests agree that removing this interaction will not significantly affect the fit of the model (because the probability value is greater than 0.05).

The next part of the table expresses the same thing but without including the higher-order effects. Its labelled K-way effects and then lists tests for when K = 1, 2 and 3. The first row (K = 1), therefore, tests whether removing the main effects (the one-way effects) has a significant detrimental effect on the model. The probability values are less than 0.05, indicating that if we removed the main effects of star sign, believer and true from our model it would significantly affect the fit of the model (in other words, one or more of these effects are significant predictors of the data). The second row (K = 2) tests whether removing the two-way interactions has a significant detrimental effect on the model. The probability values are less than 0.05, indicating that if we removed the star sign believer, star sign true and believer true interactions then this would significantly reduce how well the model fits the data. In other words, one or more of these two-way interactions is a significant predictor of the data. The final row (K = 3) tests whether removing the three-way interaction has a significant detrimental effect on the model. The
285

probability values are greater than 0.05, indicating that if we removed the star sign believer true interaction then this would not significantly reduce how well the model fits the data. In other words, this three-way interaction is not a significant predictor of the data. This row should be identical to the final row of the upper part of the table (the Kway and higher-order effects) because it is the highest-order effect and so in the

previous table there were no higher-order effects to include in the test (look at the output and youll see the results are identical). What this is actually telling us is that the three-way interaction is not significant: removing it from the model does not have a significant effect on how well the model fits the data. We also know that removing all two-way interactions does have a significant effect on the model, as does removing the main effects, but you have to remember that loglinear analysis should be done hierarchically and so these two-way interactions are more important than the main effects. The Partial Association table simply breaks down the table that weve just looked at into its component parts. So, for example, although we know from the previous output that removing all of the two-way interactions significantly affects the model, we dont know which of the two-way interactions is having the effect. This table tells us. We get a Pearson chi-square test for each of the two-way interactions and the main effects and the column labelled Sig. tells us which of these effects is significant (values less than .05 are significant). We can tell from this that the star sign believe and believe true interactions are significant but the star sign true interaction is not. Likewise, we saw in the previous output that removing the one-way effects also significantly affects the fit of the model, and these findings are confirmed here because the main effect of star sign is
286

highly significant (although this just means that we collected different amounts of data for each of the star signs!).

The final bit of output deals with the backward elimination. SPSS will begin with the highest-order effect (in this case, the star sign believe true interaction), it removes it from the model, sees what effect this has, and if it doesnt have a significant effect then it moves on to the next highest effects (in this case the two-way interactions). As weve already seen, removing the three-way interaction does not have a significant effect and this is confirmed at this stage by the table labelled Step Summary, which confirms that removing the three-way interaction has a non-significant effect on the model. At step 1, the three two-way interactions are then assessed in the bit of the table labelled Deleted
Effect. From the values of Sig. its clear that the star sign believe (p = .037) and believe

true (p = .000) interactions are significant but the star sign true interaction (p = 0. 465) is not. Therefore, at step 2 the non-significant star sign true interaction is deleted leaving the remaining two-way interactions in the model. These two interactions are then re-evaluated and both the star sign believe (p = .049) and believe true (p = .001) interactions are still significant and so are still retained. Therefore, the final model is the one that retains all main effects and these two interactions. As neither of these interactions can be removed without affecting the model, and these interactions involve
287

all three of the main effects (the variables star sign, true and believe are all involved in at least one of the remaining interactions), the main effects are not examined (because their effect is confounded with the interactions that have been retained). Finally, SPSS evaluates this final model with the likelihood ratio statistic and were looking for a nonsignificant test statistic which indicates that the expected values generated by the model are not significantly different from the observed data (put another way, the model is a good fit of the data). In this case the result is very non-significant indicating that the model is a good fit of the data.

The believe true Interaction The next step is to try to interpret these interactions. The first useful thing we can do is to collapse the data. Remember from the chapter that there are the following rules for collapsing data: (1) the highest-order interaction should be non-significant; and (2) at

288

least one of the lower-order interaction terms involving the variable to be deleted should be non-significant. We need to look at star sign believe and believe true interaction. Lets take the believe true interaction first. Ideally we want to collapse the data across the star sign variable. To do this the three-way interaction must be non-significant (it was) and at least one lower-order interaction involving star sign must be also (the star sign true interaction was). So, we can look at this interaction by doing a chi-square on believe and true, ignoring star sign. The results are below:
Did Their Horoscope Come True? * Do They Believe? Crosstabulation Do They Believe? Unbeliever Believer 582 532 542.1 571.9 26.4% 24.2% 489 598 528.9 558.1 22.2% 27.2% 1071 1130 1071.0 1130.0 48.7% 51.3%

Did Their Horoscope Come True?

Horoscope Didn't Come True Horoscope Came True

Total

Count Expected Count % of Total Count Expected Count % of Total Count Expected Count % of Total

Total 1114 1114.0 50.6% 1087 1087.0 49.4% 2201 2201.0 100.0%

Chi-Square Tests Value 11.601b 11.312 11.612 df 1 1 1 Asymp. Sig. (2-sided) .001 .001 .001 Exact Sig. (2-sided) Exact Sig. (1-sided)

Pearson Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases

.001 11.596 2201 1 .001

.000

a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 528.93.

This chi-square is highly significant. To interpret this we could consider calculating some odds ratios. First, the odds of the horoscope coming true given that the person was a believer was 598/532 = 1.12. However, the odds of the horoscope coming true given that the person was an unbeliever was 489/582 = 0.84. Therefore, the odds ratio is 1.12/0.84 = 1.33. We can interpret this by saying that the odds that a horoscope would come true

289

were 1.33 times higher in believers than non-believers. Given that the horoscopes were made-up twaddle this might be evidence that believers behave in ways to make their horoscopes come true! The star sign believe interaction Next, we can look at the star sign believe interaction. For this interaction wed like to collapse across the true variable, To do this: (1) the highest-order interaction should be non-significant (which it is); and (2) at least one of the lower-order interaction terms involving the variable to be deleted should be non-significant (the star sign true interaction was). So, we can look at this interaction by doing a chi-square on star sign and believe, ignoring true. The results are below:

290

Star Sign * Do They Believe? Crosstabulation Do They Believe? Unbeliever Believer 102 110 103.2 108.8 48.1% 51.9% 46 51 47.2 49.8 47.4% 52.6% 106 134 116.8 123.2 44.2% 55.8% 78 124 98.3 103.7 38.6% 61.4% 98 91 92.0 97.0 51.9% 48.1% 118 88 100.2 105.8 57.3% 42.7% 160 179 165.0 174.0 47.2% 52.8% 37 32 33.6 35.4 53.6% 46.4% 124 115 116.3 122.7 51.9% 48.1% 53 58 54.0 57.0 47.7% 52.3% 52 56 52.6 55.4 48.1% 51.9% 97 92 92.0 97.0 51.3% 48.7% 1071 1130 1071.0 1130.0 48.7% 51.3%

Star Sign

Capricorn

Aquarius

Pisces

Aries

Taurus

Gemini

Cancer

Leo

Virgo

Libra

Scorpio

Sagittarius

Total

Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign Count Expected Count % within Star Sign

Total 212 212.0 100.0% 97 97.0 100.0% 240 240.0 100.0% 202 202.0 100.0% 189 189.0 100.0% 206 206.0 100.0% 339 339.0 100.0% 69 69.0 100.0% 239 239.0 100.0% 111 111.0 100.0% 108 108.0 100.0% 189 189.0 100.0% 2201 2201.0 100.0%

Chi-Square Tests Value 19.634a 19.737 2.651 2201 df 11 11 1 Asymp. Sig. (2-sided) .051 .049 .103

Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 33.58.

This chi-square is borderline significant (two-tailed, but then again we had no prediction so we need to look at the two-tailed significance). It doesnt make a lot of sense to compute odds ratios because there are so many star signs (although we could use one star

291

sign as a base category and compute odds ratios for all other signs compared to this category). However, the obvious general interpretation of this effect is that the ratio of believers to unbelievers in certain star signs is different. For example, in most star signs there is a roughly 50:50 split of believers and unbelievers, but for Aries there is a 40:60 split and it is probably this difference that is most contributing to the effect. However, its important to keep this effect in perspective. It may not be that interesting that we happened to sample a different ratio of believers and unbelievers in certain star signs (unless you believe that certain star signs should have more cynical views of horoscopes than others!). We actually set out to find out something about whether the horoscopes would come true and its worth remembering that this interaction ignores the crucial variable that measured whether or not the horoscope came true! Reporting the Results For this example we could report: The three-way loglinear analysis produced a final model that retained the star sign believe and believe true interactions. The likelihood ratio of this model was

2(22) = 19.58, p = 0.61. The star sign believe interaction was significant, 2
(11) = 19.74, p < 0.05. This interaction indicates that the ratio of believers and unbelievers was different across the 12 star signs. In particular the ratio in Aries (38.6:62.4 ratio of unbelievers to believers) was quite different to the other groups, which consistently had a roughly 50:50 split. The believe true interaction was also significant, 2 (1) = 11.61, p < .001. The odds ratio indicated that the odds of the horoscope coming true were 1.33 times more likely in believers than non-believers. Given that the horoscopes were made-up twaddle
292

this might be evidence that believers behave in ways to make their horoscopes come true. Task 4 On my statistics course students have weekly SPSS classes in a computer laboratory. These classes are run by postgraduate tutors but I often pop in to help out. Ive noticed when in these sessions that many students are studying Facebook rather more than they are studying their very interesting statistics assignments that I have set them. I wanted to see the impact that this behaviour had on their exam performance. I collected data from all 260 students on my course. First I checked their Attendance and classified them as having attended either more or less than 50% of their lab classes. Next, I classified them as being either someone who looked at Facebook during their lab class, or someone who never did. Lastly, after the Research Methods in Psychology (RMiP) exam, I classified them as having either passed or failed (Exam). The data are in Facebook.sav. Do a loglinear analysis on the data to see if there is an association between studying Facebook and failing your exam. Running the Analysis Data are entered for this example as frequency values for each combination of categories so before you begin you must weight the cases by the variable frequency. If you dont do this the entire output will be wrong! Select __, then in the resulting dialog box select _ and then select the variable in which the number of cases is specified (in this case Frequency) and drag it to the box labelled Frequency variable

293

(or click on _). This process tells the computer that it should weight each category combination by the number in the column labelled Frequency.

To get a crosstabulation table, select ___. We have three variables in our crosstabulation table: whether someone looked at Facebook during their lab classes (Facebook), whether they attended more than 50% of classes (Attendance) and whether they passed or failed their RMiP exam (Exam). Select Facebook and drag it into the box labelled Row(s) (or click on _). Next, select Exam and drag it to the box labelled Column(s) (or click on _). We have a third variable too, and we need to define this variable as a layer. Select Attendance and drag it to the box labelled Layer 1 of 1 (or click on _). Then click on _ and select the options required. _ The crosstabulation table produced by SPSS contains the number of cases that falls into each combination of categories. There are no expected counts less than 5, so our assumptions are met.

294

_ The Loglinear Analysis Then run the main analysis, The way to run loglinear analysis that is consistent with my section on the theory of the analysis is to select ___ to access the dialog box. Select any variable that you want to include in the analysis by selecting them with the mouse (remember that you can select several at the same time by holding down the Ctrl key) and then dragging them to the box labelled Factor(s) (or click on _). When there is a variable in this box the _ button becomes active. We have to tell SPSS the codes that weve used to define our categorical variables. Select a variable in the Factor(s) box and then click on _ to activate a dialog box that allows you to specify

295

the value of the minimum and maximum code that youve used for that variable. When youve done this click on _ to return to main dialog box.

Output from Loglinear Analysis

The first bit of the output labelled K-way and higher-order effects shows likelihood ratio and chi-square statistics when K = 1, 2 and 3 (as we go down the rows of the table). The first row (K = 1) tells us whether removing the one-way effects (i.e. the main effects of attendance, Facebook and exam) and any higher-order effects will significantly affect the fit of the model. There are lots of higher-order effects herethere are the two way interactions and the three-way interactionand so this is basically testing whether if we

296

remove everything from the model there will be a significant effect on the fit of the model. This is highly significant because the probability value is 0.000, which is less than 0.05. The next row of the table (K = 2) tells us whether removing the two-way interactions (i.e. the Attendance Exam, Facebook Exam and Attendance Facebook) and any higher-order effects will affect the model. In this case there is a higher-order effect (the three-way interaction) so this is testing whether removing the two-way interactions and the three-way interaction would affect the fit of the model. This is significant (the probability is 0.000, which is less than 0.05) indicating that if we removed the two-way interactions and the three-way interaction then this would have a significant detrimental effect on the model. The final row (K = 3) is testing whether removing the three-way effect and higher-order effects will significantly affect the fit of the model. Now, of course, the three-way interaction is the highest-order effect that we have, so this is simply testing whether removal of the three-way interaction (i.e. the Attendance Facebook Exam interaction) will significantly affect the fit of the model. If you look at the two columns labelled Prob then you can see that both chi-square and likelihood ratio tests agree that removing this interaction will not significantly affect the fit of the model (because the probability value is greater than 0.05. The next part of the table expresses the same thing but without including the higher-order effects. Its labelled K-way effects and then lists tests for when K = 1, 2 and 3. The first row (K = 1), therefore, tests whether removing the main effects (the one-way effects) has a significant detrimental effect on the model. The probability values are less than 0.05 indicating that if we removed the main effects of star sign, believer and true from our model it would significantly affect the fit of the model (in other words, one or more of

297

these effects are significant predictors of the data). The second row (K = 2) tests whether removing the two-way interactions has a significant detrimental effect on the model. The probability values are less than 0.05, indicating that if we removed the two-way interactions then this would significantly reduce how well the model fits the data. In other words, one or more of these two-way interactions is a significant predictor of the data. The final row (K = 3) tests whether removing the three-way interaction has a significant detrimental effect on the model. The probability values are greater than 0.05 indicating that if we removed the three-way interaction then this would not significantly reduce how well the model fits the data. In other words, this three-way interaction is not a significant predictor of the data. This row should be identical to the final row of the upper part of the table (the K-way and higher-order effects) because it is the highest-order effect and so in the previous table there were no higher order effects to include in the test (look at the output and youll see the results are identical).

298

The main effect of Attendance was significant, 2(1) = 27.63, p < .001, indicating (based on the contingency table) that significantly more students attended over 50% of their classes (N = 172)5 than those that attended less than 50% (N = 88)6. The main effect of Facebook was significant, 2 (1) = 10.47, p < .01, indicating (based on the contingency table) that significantly less students looked at Facebook during their classes (N = 104)7 than those that did not look at Facebook (N = 156)8. The main effect of Exam was significant, 2 (1) = 22.54, p < .001, indicating (based on the contingency table) that significantly more students passed the RMiP exam (N = 168)9 than failed (N = 92)10. The Attendance Exam interaction was significant, 2(1) = 61.80, p < .01, indicating that whether you attended more or less than 50% of classes affected exam performance. To illustrate heres the contingency table:

39+30+98+5 = 172 5+30+26+27 = 88 7 39+30+5+30 = 104 8 98+5+26+27 = 156 9 39+98+5+26 = 168 10 30+5+30+27 = 92
5 6

299

_ This shows that those who attended more than half of their classes had a much better chance of passing their exam (nearly 80% passed) than those attending less than 50% of classes (only 35% passed). All of the standardized residuals are significant, indicating that all cells contribute to this overall association. The Facebook Exam interaction was significant, 2(1) = 49.77, p < .001, indicating that whether you looked at Facebook or not affected exam performance. To illustrate heres the contingency table:

_ This shows that those who looked at Facebook had a much lower chance of passing their exam (58% failed) than those who didnt look at Facebook during their lab classes (around 80% passed).
300

The Facebook Attendance Exam interaction was not significant, 2(1) = 1.57, p = .20. This result indicates that the effect of Facebook (described above) was the same (roughly) in those who attended more than 50% of classes and those that attended less than 50% of classes. In other words, although those attending less than 50% of classes did worse than those attending, within that group, those looking at Facebook did relatively worse than those not looking at Facebook.
Chapter 19 Task 1

Using the cosmetic surgery example, run the analysis but also including BDI, age and gender as fixed effect predictors. What differences does including these predictors make?

Select ___, and specify the contextual variable by selecting Clinic from the list of variables and dragging it to the box labelled Subjects (or click on _).

301

Click on _ to move to the main dialog box. First we must specify our outcome variable, which is quality of life (QoL) after surgery, so select Post_QoL and drag it to the space labelled Dependent variable (or click on _). Next we need to specify our predictors. Therefore, select Surgery, Base_QoL, Age, Gender and BDI (hold down Ctrl and you can select both of them simultaneously) and drag them to the space labelled Covariate(s) (or click on _)

302

We need to add the predictors as fixed effect to our model, so click on _, hold down Ctrl and select Base_QoL, Surgery, Age, Gender and BDI in the list labelled Factors and Covariates. Then make sure that _ is set to _ and click on _ to transfer these predictors to the Model. To specify the interaction term, first click on _ and change it to _. Next, select Surgery from the Factors and Covariates and then while holding down the Ctrl key select Reason. With both variables selected click on _ to transfer them to the Model as an interaction effect. Click on _ to return to the main dialog box.

We now need to ask for a random intercept, and random slopes for the effect of
Surgery. Click on _ in the main dialog box. Select Clinic and drag it to the area

labelled Combinations (or click on _). We want to specify that the intercept is random, and we do this by selecting _. Next, select Surgery from the list of Factors and covariates and add it to the model by clicking on _. The other change that we need to make is that we need to estimate the covariance between the random slope

303

and random intercept. This estimation is achieved by clicking on _ to access the drop down list, and selecting _.

Click on _ and select _. Click on _ to return to the main dialog box. In the main dialog box click on _ and request Parameter estimates and Tests for covariance parameter. Click on _ to return to the main dialog box. To run the analysis, click on _. The output is as follows:

304

In terms of the overall fit of this new model, we can use the log-likelihood statistics:

305

If we look at the critical values for the chi-square statistic in the Appendix, it is 7.81 (p < .05, df = 3); therefore, this change is significant. Including these three predictors has improved the fit of the model. Age, F(1, 150.83) = 37.32, p < .001, and BDI, F(1, 260.83) = 16.74, p < .001, significantly predicted quality of life after surgery but gender did not, F(1, 264.48) = 0.90, p = .34. The main difference that including these factors has made is that the main effect of Reason has become non-significant, and the Reason Surgery interaction has become more significant (its b has changed from 4.22, p = .013, to 5.02, p = .001). We could break down this interaction as we did in the chapter by splitting the file and running a simpler analysis (without the interaction and the main effect of Reason, but including Base_QoL, Surgery, BDI, Age and Gender). If you do these analyses you will get the parameter tables below. These tables show a similar pattern to the example in the book. It shows that for those operated on only to change their appearance surgery significantly predicted quality of life after surgery, b = 3.16, t(5.25) = 2.63, p = .04. Unlike when age, gender and BDI were not included, this effect is now significant. The negative gradient shows that in these people quality of life was lower after surgery compared to the control group. However, for those that had surgery to solve a physical problem surgery did not significantly predict quality of life, b = 0.67, t(10.59) = 0.58, p = .57. In essence the inclusion of age, gender and BDI has made very little difference in

306

this latter group. However, the slope was positive, indicating that people who had surgery scored higher on quality of life than those on the waiting list (although not significantly so!). The interaction effect, therefore, as in the chapter reflects the difference in slopes for surgery as a predictor of quality of life in those that had surgery for physical problems (slight positive slope) and those that had surgery purely for vanity (a negative slope).
Surgery to Change Appearance:

Surgery for a Physical Problem:

Task 2 Using our growth model example in this chapter, analyse the data but include
Gender as an additional covariate. Does this change your conclusions?.

First, select ___ and in the initial dialog box set up the level 2 variable. In this example, life satisfaction at multiple time points is nested within people. Therefore, the level 2 variable is the person and this variable is represented by the variable labelled Person.
307

Select this variable and drag it to the box labelled Subjects (or click on _). Click on _ to access the main dialog box.

In the main dialog box we need to set up our predictors and outcome. The outcome was life satisfaction, so select Life_Satisfaction and drag it to the box labelled Dependent variable (or click on _). Our predictor, or growth variable, is Time so select this variable and drag it to the box labelled Covariate(s), or click on _. We also want to include
Gender, so select this variable and drag it to the box labelled Covariate(s), or click on _.

308

Click on _ to bring up the fixed effects dialog box. First we need to include Gender in the model so select this variable and click on _ to add it into the model. To specify the linear polynomial, click on Time and then click on _ to add it into the model. To add the higher-order polynomials we need to select _. Select Time in the Factors and Covariates list and _ will become active; click on this button and Time will appear in the space labelled Build Term. For the quadratic or second-order polynomial we need to define
Time2 and we can specify this by clicking on _ to add a multiplication symbol to our

term, then selecting Time again and clicking on _. The Build Term bar should now read Time*Time (or, put another way, Time2). Click on _ to put it into the model. Finally, lets add the cubic trend. For the cubic or third-order polynomial we need to define Time3 (or Time*Time*Time). We build this term up in the same way as for the quadratic polynomial: select Time, click on _, click on _, select Time again, click on _, click on _ again, select Time for a third time, click on _, click on _. This should add the third-order polynomial (or Time*Time*Time) to the model. Click on _ to return to the main dialog box.

309

As in the chapter we expect the relationship between time and life satisfaction to have both a random intercept and a random slope. We need to define these parameters now by clicking on _ in the main dialog box. We specify our contextual variable by selecting
Person and dragging it to the area labelled Combinations (or click on _). To specify that

the intercept is random select _, and to specify random slopes for the effect of Time, click on this variable in the Factors and Covariates list and then click on _ to include it in the Model. Finally, we need to specify the covariance structure. As in the chapter, choose an autoregressive covariance structure, AR(1), and lets also assume that variances will be heterogeneous. Therefore, select _ from the drop-down list. Click on _ to return to the main dialog box. Click on _ and select _ and then click on _ and select Parameter estimates and Tests for covariance parameters. Click on _ to return to the main dialog box. To run the analysis, click on _.

310

The output is the same as the last output in the chapter except that it now includes the effect of Gender. To see whether Gender has improved the model we again compare the value of 2LL for this new model to the value in the previous model. We have added only one term to the model so the new degrees of freedom will have risen by 1, from 8 to 9 (again you can find the value of 8 in the row labelled Total in the column labelled Number of Parameters, in the table called Model Dimension). We can compute the change in 2LL as a result of Gender by subtracting the 2LL for this model from the -2LL for the last model in the chapter:
2 Change = 1798.86 1798.74 = 0.12

dfChange = 9 8 = 1

The critical values for the chi-square statistic for df=1 in the Appendix are 3.84 (p < .05) and 6.63 (p < .01); therefore, this change is not significant because 0.12 is less than the critical value of 3.84. The table of fixed effects and the parameter estimates tell us that the linear, F(1, 221.41) = 10.01, p < .01, and quadratic, F(1, 212.51) = 9.41, p < .01, trends both significantly described the pattern of the data over time; however, the cubic trend, F(1, 214.39) = 3.19, p > .05, does not. These results are basically the same as in the chapter. Gender itself is also not significant in this table, F(1, 113.02) = 0.11, p > .05. The final part of the output tells us about the random parameters in the model. First of all, the variance of the random intercepts was Var(u0j) = 3.89. This suggests that we were correct to assume that life satisfaction at baseline varied significantly across people. Also, the variance of the peoples slopes varied significantly Var(u1j) = 0.24. This suggests also
311

that the change in life satisfaction over time varied significantly across people too. Finally, the covariance between the slopes and intercepts (0.39) suggests that as intercepts increased, the slope decreased. These results confirm what we already know from the chapter. The trend in the data is best described by a second-order polynomial, or a quadratic trend. This reflects the initial increase in life satisfaction 6 months after finding a new partner but a subsequent reduction in life satisfaction at 12 and 18 months after the start of the relationship. The parameter estimates tell us much the same thing. As such our conclusions have been unaffected by including gender.

312

Task 3

Getting kids to exercise (Hill, Abraham, & Wright, 2007). The purpose of this

research was to examine whether providing children with a leaflet based on the theory of planned behaviour increases childrens exercise. There were four different interventions (Intervention): a control group, a leaflet, a leaflet and quiz, and a leaflet and plan; 503 children from 22 different classrooms were sampled (Classroom). It was not practical to have children in the same classrooms in different conditions, therefore the 22 classrooms were randomly assigned to the four different conditions. Children were asked On average over the last three weeks, I have exercised energetically for at least 30 minutes ______ times per week after the intervention (Exercise). Run a multilevel model analysis on these data (Hill et al. (2007).sav) to see whether the intervention affected the childrens exercise levels (the hierarchy in the data is: children within classrooms within interventions).

313

Here is a graph of the data; the big dots are means for the schools, the box plots are standard ignoring the structure

exercise + 0.5

1.0

1.5

2.0

2.5

3.0

Control

Leaftet

L+quiz

L+plan

Conditions

The data file looks like:

314

The analysis is done with the MIXED procedure by selecting ___. At the first screen you enter your level 2 variable in the subject box (Classroom). Remember: the SPSS MIXED procedure assumes that you are doing repeated-measures analysis of individuals.

315

After clicking on _ you enter the outcome variable (Exercise) and the predictor (Intervention).

You then have six buttons to enter the details of the analyses. Here we consider only _ and _. The _ screen allows you to enter the fixed part of the model. This is the condition the participant is in. Select the variable that specifies conditions (Intervention) and click on _:

316

The _ screen is where you can really take advantage of the procedure's flexibility. The model looked at here is one of the simpler multilevel models. Highlight Classroom in the Subjects box and put it into the Combinations box by clicking on _. This tells the computer that this is the cluster variable. By not entering any variables into the Model box the computer assumes that you just want a random intercept. The default choice of _ should be used for this example.

317

Now click on _ and select Tests for covariance parameters.

Click on _ on then_.

318

The first part of the output tells you details about the model that are being entered into the SPSS machinery. The Information Criteria box gives some of the popular methods for assessing the fit models. AIC and BIC are two of the most popular. The Fixed Effects box gives the information in which most of you will be most interested. It says the effect of intervention is non-significant, F(3,18.061) = 1.704, p = .202. A few words of warning: calculating a p-value requires assuming that the null hypothesis is true. In most of the statistical procedures covered in this book you would construct a probability distribution based on this null hypothesis, and often it is fairly simple, like the z- or t-distributions. For multilevel models the probability distribution of the null is often not known. Most
319

packages that estimate p-values for multilevel models estimate this probability in a complex way. This is why the denominator degrees of freedom is not a whole number. For more complex models there is concern about the accuracy of some of these approximations. Many methodologists urge caution rejecting hypotheses even when the observed p-value is less than .05.

The random effects shows how much of the variability in responses is associated with which class a person is in: .023777/(.023777 + .290766) = 7.56%. This is fairly small. A rough guide to whether this is greater than chance is obtained by dividing this value by its standard error to get the Wald z and seeing if it is greater than 1.96. It is slightly less (1.955). The significance of the Wald statistic confirms this: it just fails to reach the traditional level for statistical significance. The result from these data could be that the condition failed to affect exercise. However, there is a lot of individual variability in the amount of exercise people get. A better approach would be to take into account the amount of self-reported exercise prior to the study as a covariate.
Task 4

Repeat the above analysis but include the pre-intervention exercise scores (Pre_Exercise) as a covariate. What difference does this make to the results?

320

This can be done by repeating the procedure in Task 1 but including Pre_Exercise in the covariate box.

Then click on _, select the variables that specify conditions (Intervention) and preintervention exercise (Pre_Exercise) and then click on _:

The other options can be kept the same as in the previous task. The new estimates for the fixed effects are:
321

Now, after taking into account initial exercise, the condition is statistically significant, F(3,18.539) = 6.636, p = .003.

322

You might also like