You are on page 1of 15

Meta-Analysis Using Multilevel Models with an Application to the Study of Class Size Effects Author(s): Harvey Goldstein, Min

Yang, Rumana Omar, Rebecca Turner, Simon Thompson Reviewed work(s): Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 49, No. 3 (2000), pp. 399-412 Published by: Blackwell Publishing for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2680773 . Accessed: 13/12/2011 18:09
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics).

http://www.jstor.org

Statist. (2000) AppL.

49, Part 3, pp.399-412

Meta-analysisusing multilevel models withan application to the study of class size effects
Harvey Goldsteinand Min Yang Institute Education, of London, UK and Rumana Omar, Rebecca Turnerand Simon Thompson Imperial CollegeSchoolofMedicine, London, UK
[Received September 1998. Final revisionSeptember 1999] as is data) model in Summary. Meta-analysis formulated a special case ofa multilevel (hierarchical whichthe highest on levelis thatofthestudyand the lowestlevelthatofan observation an individual a respondent. Studies can be combinedwithin single model wherethe responses occur at different levels ofthedata hierarchy efficient estimatesare obtained.An example is givenfrom studies of and class sizes and achievementin schools, where studydata are available at the aggregate level in termsof overallmean values forclasses of different sizes, and also at the studentlevel. Keywords: Class size research; Meta-analysis;Multilevel modelling

1.

Introduction

The effects class size on achievement of have been studiedsincethe 1920squantitatively and been debated formuchlonger.There is a large numberof qualitatively, have certainly and and randomized matched controlled existing studies, including observational surveys, designs trials of the are inconclusive. Glass and Smith (RCTs). Despitethenumber studies, results often in to (1979) first applied a meta-analysis 77 studiesbased on 70 years'research morethana for dozen countries. werepositive effects class sizesofless than20, Theyconcludedthatthere to Theirquantitative based on 14 of thesestudies whichwereconsidered be 'wellcontrolled'. on method beenfollowed manymoremeta-analyses thesame topic(Carlberg has synthesis by and Kavale, 1980;Hedges and Olkin,1985;Slavin,1986,1990;McGiverin al., 1989). et was based on only a small numberof Slavin (1990) argued that Glass's positivefinding studies and the resultswere largelyaffected one extremecase (Verducci, 1969). On by that was much smallerthan that of Glass. He also reanalysisSlavin reportedan effect conductedan analysisof nine randomizedor matchedstudies.Among thesestudiessome wereused by Glass and Smithin 1979but mostof themwerenew studiesselectedaccording achievement ratio to strictinclusion criteria.The large scale Tennessee student/teacher a effect size (STAR) RCT study(Word et al., 1990) was included.Slavin suggested moderate smaller classesof 15 or of0.17 standard deviation scorecomparing (SD) unitsof achievement 16 withlargerclasses of 25-30. modelsin meta-analysis been suggested severalresearchers has * The use ofrandom-effect by and Bryk, (Hedgesand Olkin,1985;Raudenbush 1985;Hardyand Thompson,1996;Erezetal.,
of Address for correspondence: Harvey Goldstein,Departmentof MathematicalSciences,Institute Education, of University London, 20 Bedford Way, London, WC1H OAL, UK. E-mail:hgoldstn@ioe.ac.uk ? 2000 Royal StatisticalSociety 0035-9254/00/49399

400

H. Goldstein, Yang,R. Omar,R. Turner S. Thompson M. and

1996; Cleary and Casella, 1997). The presentpaper focusesmore on the methodology of meta-analyses thanon thesubstantive issue ofclass sizeperse. For a moredetaileddiscussion of the latterand a consideration the role of RCTs in such studiessee Goldsteinand of Blatchford (1998). In thispaper we tacklethe problemof how to comparedata fromdifferent studieswith varying summary measuresby usingmultilevel models (Goldstein,1995). We also develop multilevel models to combine studylevel data and individuallevel data. This providesa methodforthesituation whichsome studies statistically efficient in have individual leveldata but othershave only summarystatistics available (e.g. means and standarderrorsfrom in published papers).We first describe, Section2, thestudiesincludedand data available for addressing issue of class size effects. the Section 3 introduces multilevel a model formetaanalysis,focusing aggregatelevel data, and Section 4 describeshow the model can be on extendedto combineboth aggregate level and individualleveldata in the sarme analysis.

2. Sources of data
2.1. Criteria inclusionin the study for We restrict ourselvesto thosestudieswhichmeetthe following inclusioncriteria. (a) The studyis an RCT or has a matcheddesignwherethereis an attempt match to smallerand largerclasses initially usingschool or student levelcriteria. by (b) The study outcomes achievement are scores, standardized scoresorrating e.g. test scales. (c) The studyis longitudinal withinitialand finalachievement measuresand at least one school yearperiodforboth largerand smallerclasses. (d) The smallerclass is not less than 15 and thelargerclass is not more than40. These inclusion to criteria similar thosethatSlavin(1990) set out forhis analysisand the are rangeof class sizes matchesthatfoundin educationalsystems industrialized of countries. 2.2. Scope and strategyof literature search Several databases were searchedusing the keywordsclass size, longitudinal study,school the achievement; ERIC database from1961 to 1997,theBritish EducationIndex (1954-1996, covering journalsofeducation), CanadianEducation 300 the Index(1976-1996coverage)and the Australian EducationIndex (1978-1996 coverage).Psychological Abstracts was searched class size,classroom, and (1985-1996)usingthesubjecttitles group size,academicachievement meta-analysis. Nine studiesmetour criteria, amongwhichsevenstudieswereused by Slavin (1990). Two studiesused by Slavin could not be tracedthrough database search,or by an additional our Internet searchforthe authors'names. The data on these,as presented Slavin,are not by detailedforuse in our analysis.Two new studiesthatwere not used by Slavin sufficiently wereadded to our collection. the individual leveldata. Onlyone study, STAR study, provides In the nextsectionwe listsome basic information about the studiesselected. 2.3. Studies selected A summary the statistical of information givenin Table 1. is 2.3.1. Study1. Balow (1969), California for Study 1 was an experimental (but non-randomized) studyon readingachievement stu-

Class Size Effects


Table 1. Raw and adjusted data of each studyforreadingscores Number of Mean + SD pupils,nIjk reported,
XJk ik ?Yh.jk

401

k
1

Study Grade Class j size h 1 1 3 3 4 4 4 4 2 2 3 3 4 4 5 5 6 6 1 2 2 2 1 1 1 1 2 2 9 9 1 1 2 2 3 3 15 30 15 30 16.6 23.7 30.3 35.7 15 30 15 30 15 30 15 30 15 30 15 25 15 25 19 31 20 27 20 27 20 30 15 24 15 24 15 24

Adjusted Pooled SD, meanAC/k SDjk 50.9 48.9 248.9 248.6$ 0.00 -0.04 0.02 0.02 2.671 2.47 3.49$ 3.33 4.37$ 4.23 5.55$ 5.21 6.28T 6.10 53.4? 50.6 52.0 48.7 43.2 44.6 523.8 469.8 590.2 578.7 70.6 62.6 529.1* 516.1* 591.1* 579.2* 619.5* 619.2* 12.01t 12.01 12.37 12.37 0.275 0.275 0.275 0.275 0.620 0.620 0.588 0.588 0.661 0.661 0.665 0.665 0.700 0.700 9.01 9.01 8.379 8.379 10.46t 10.46 137.6 137.6 50.41 50.41 13.3 13.3 37.23 37.23 28.98 28.98 21.77 21.77

Standardized mean adjuisted 0.125 -0.042 0.012 -0.013 -0.012 -0,157 0.061 0.061 0.282 -0.041 0.212 -0.061 0.188 -0.024 0.395 -0.041 0.320 -0.037 0.098 -0.213 0.198 -0.205 -0.085 0.049 0.191 -0.176 0.115 -0.114 0.300 -0.301 0.070 -0.167 0.020 -0.042 0.008 -0.020

Effect size
YS.Jk- YLjk

4 5 6 7

8 9

251 744 656 602 256 368 450 555 78 542 156 555 57 441 43 413 63 374 1127 516 57 55 368 646 371 350 309 313 2819 2543 2644 1414 3112 1482 3353 1357

50.9 48.9 248.9 245.6 0.00 + 0.30 -0.04 + 0.30 0.02 + 0.27 0.02 + 0.25 2.39 + 0.809 2.52 + 0.895 3.16 + 0.954 3.42 + 1.074 4.38 + 1.181 4.23 ? 1.400 5.40 + 1.534 5.22 + 1.680 5.69 + 1.510 6.19 + 1.925 49.7 + 14.45 50.6 + 16.12 52.0 + 9.93 48.7 + 7.25 43.2 44.6 523.8 + 88.7 469.8 + 175.4 590.2 + 49.6 578.7 + 51.2 70.6+ 11.2?? 62.6 + 15.4?? 531.0 + 57.1 520.0 + 54.4 591.0 + 45.6 583.0 + 45.4 619.0 + 38.5 615.0 + 38.2

0.17 0.02 0.15 -0.07 -0.07 0.32 0.27 0.21 0.44 0.36 0.31 0.39 -0.13 0.39 0.23 0.60 0.24 0.06 0.03

tSD derivedfromthe F-testvalue reported. score based on the reported correlation coefficient tBoth the mean and the SD were adjustedfora pretreatment and betweenthepretreatment post-treatment scores. of score assuminga correlation coefficient 0.8. ?Both the mean and the SD wereadjustedfora pretreatment from schoolsavailablein thepaper. ??Boththemeanand theSD werecalculatedon thebasis ofan averagemeasure 10 *Both the mean and the SD wereadjustedfora pretreatment score usinga three-level model withcovariates.

dentsfrom as grades1-2 and thengrades3-4. Class sizesweredefined 15 forsmalland 30 for large classes. The means of the readingscore at grade 1 forthe two classes were reported for equal so thatscoresat grade2 werecompared.Means at grade4 werecomparedadjusting theintakereadingor pupils'intelligence for measuredat grade3. No standard error quotient any measurewas reported, exceptforF-testvalues in thepaper. 2.3.2. Study2. Shapsonet al. (1980), Toronto city Study2 was an RCT forfourclass size groups:16, 23, 30 and 37. The trialperiodwas from weremade to keep the same group of pupils in the same class grade 4 to grade 5. Efforts

402

H. Goldstein, Yang,R. Omar,R. Turner S. Thompson M. and

duringthe trialyear. It was reportedthat the changesin pupils in a class were limitedto within?3 by the end of the study.Measures includedtestscoresforcomposition, vocabmathematical ulary,reading, conceptsand mathematical problemsolving.Means and SDs werereported. adjustedfortheyearof the studyand teachers'experience 2.3.3. Study3: Doss and Holley (1982), Austin,Texas Study3 was a 5-year matcheddesignstudy from grade2 to grade6 forschoolachievement in reading, languageand mathematics. The class size was 15 forsmalland 30 forlargeclasses. Initialmeansand SDs of testscoresat thebeginning theyearand thoseat theend of the of yearwerereported thefive for years.Correlation coefficients between prescores the and postscoreswerealso reported gradeand class. by 2.3.4. Study4. Wilsberg Castiglione and (1968), New YorkCity A totalof 1127grade 1 students from13 schoolsand 516 grade2 students from sevenschools were wereused in study Grade 1 students 4. werein smallclasses of 15 and grade2 students in large classes of 25 and over. Both receivedthe same materials, and help fora year.The intothestudy, meansand SDs of and study reported meansand SDs ofa reading testat entry vocabularyand comprehension testsweretakenat theend of the study. 2.3.5. Study5. Wagner(1981), Toledo,Ohio Grade 2 students one schoolassignedto smallclassesof less than15 werecomparedwitha in matchedschool withlargeclasses of 25 in study5. This was publishedas a doctoralthesis. 2.3.6. Study6: Mazareas (1981), Boston A randomsampleof 1014grade 1 pupils(368 from smallclassesof less than20 and 646 from largeclasses of more than 30) wereused in study6. Outcomeswereadjustedforcovariates and F-testvalueswerereported fiveschool attainment for scoresincluding reading.This was publishedas a doctoralthesis. 2.3.7. Study7: Butlerand Handley (1989), Mississippi of 7 Study was a matched design study grade1 and grade2 students measuring reading, listening in and achievement mathematics. in classes(size 20) of grade1 Outcomesforstudents smaller in and grade2 werecomparedwiththesamegroupofstudents larger classes(size 27) followed for2 years.Students thesmallerand largerclasseswerefromthesame school. The study in out for and but matched factors suchas teachers' qualifications an entrance test, itdidnotcarry covariateadjustment. Means and SDs by subjectbyclass groupwerereported. 2.3.8. Study8: San Juan UnifedSchool District(1991), California A totalof 2819 students in from10 highschools (grade 9) originally largeclasses of 30 were assignedto reducedsize classes of 20 fora yearand comparedwiththosein largerclasses in test study8. The means of a readingcomprehension in grades9 and 10 werereported. 2.3.9. Study9. Wordet al. (1990), Tennessee The STAR project(study9) was an RCT longitudinal followedfrom studywithchildren

Class Size Effects

403

kindergarten grade 3 for4 yearswithmeasurements 1-yearintervals. to at Smallerclasses averagedabout 15 students (13-17) and larger classesabout 24 (22-25). About 4000 students were available for the analysisand the initialassignment into kindergarten classes was at random. In Table 1 theclass size,number pupils,means and SDs are takenfromthepublished of papers.The adjusted meansandpooled SDs arecomputed using by equations(1) and (2) respectively below.The standardized adjustedmeansare computed usingequation(3) below. by As we can see fromTable 1 severalproblemsarise. The teststhatwereused to measure achievement obviouslydifferent are fromstudyto study.Rescalingthe measurements a to commonscale is essential meta-analysis. for Commonpractice to standardize mean for is the each class groupwithin each study usinga pooled SD. For example, conventional by the effect terms and YL indicate meanscoreofsmaller the and larger classgroupsrespectively. our For Ys as purposeswe require, a minimum, estimates themeansand pooled SDs. Some studies of did notpresent SDs fortheir achievement measures. thiscase an F-test t-test In or valuereported by such a study had to be used to derivethepooled SD forthetwogroupsundercomparison. Differences theeffect class size between in of variouscauses. Where studies mayarisefrom common data are available, e.g. on socioeconomicbackground, can see whether we such factors explainpart of the studydifferences (Thompson,1994). In thepresent case we have theadditionalproblemthatdifferent achievement testswereused in each studyand thiswill A introduce the generally further, unknown, variation. further issueis that,apartfrom STAR leveldata wereavailable,thebetween-school within study studywherestudent variation a is not separately reported should be includedin our models. but 2.4. Adjustingforpretreatment score Our inclusioncriteria non-RCT studiesto be matchedon student class factors for or imply that for each study we can adjust for initial achievement. This is importantfor nonrandomized studiesto allow foranyassociationbetween initial achievement allocationto and as classes of different sizes. In randomized studiesit will generally increaseprecision well as for potentially helpingto correct any problemswiththerandomization procedure. and Giventhemeans and SDs forbothpretreatment post-treatment well as thewithinas and test coefficient group correlation r(pre,post) betweenthe pretreatment post-treatment mean of the small and largeclasses by equation (1) to scores,we adjust thepost-treatment score as thecovariate. analysisof covarianceto the two class groupswiththepretreatment
A4h,post

size measure (Glass and Smith, 1979; Hedges and Olkin, 1985) is (s

- L)/SDpooled,

where the

obtain estimates of the adjusted means i,cpost and

This is equivalent to applying an c,post' (1)

Xh,post

Upre

r(pre,post) (Xh,pre-

Xpre)

u mean. The symbols and x refer whereh indexestheclass size and xpre is theoverallpretest If to the pooled between-subject and treatment SD means respectively. the correlation betweenthepretreatment post-treatment and an coefficient scoresis not provided, estimate to may be available fromotherstudies(forexamplesee the footnote Table 1). 2.5. Adjusting and pooling standard deviations Giventheresidualsum of squaresof thepost-treatment scoreadjustedforthepre-treatment scoreforeach class size group separately, SSs and SSL, a pooled SD is calculatedas say

404

H. Goldstein,M. Yang, R. Omar, R. Turnerand S. Thompson


SDpooled =

Ds +D2

whereD refers the degreesof freedom to used foreach class size group. The finalsummary are the adjusted means and pooled SDs in Table 1. The statistics standardized adjustedmeans are computedby calculating, each gradein each study, for the mean over all class sizes weighted the numbersof students, by this subtracting fromeach standardized mean and dividing thepooled SD, namely by
1hjk

Zn hjkl'9hjkZ

= Yh.jk YhJk

SD ~~SDk
jk

nh.J1

(3a)

(h = 1 fora large class; h = 2 for a small class). The standardized adjustedmeans are the responses used forthe aggregate level data. On thebasis of thesetheconventional effect can be estimated in thelast columnof size as Table 1 by using
YS.Jk- YL.Jk = ( C.k J-ULJk)/SDjk.

(3b)

The homogeneity (Hedges and Olkin, 1985) forthe weighted test and bias-corrected effect size estimates theeight for withaggregate studies leveldata indicates significant heterogeneity = betweenthem (X25 255.5; p <? 0.001). Heterogeneity may have arisen in several ways including inappropriate assumptions about ways of combining effect levels sizes and omitted (betweenclasses and between schools) in the analyses. As an alternative working to withadjustedeffects, could considertreating pretest we the scoreas a covariate themultilevel in modeldescribed thenextsection. difficulty this in A with approach,however, thatthecoefficient thepretest varyfrom is for will to and we study study, shall not pursuethisfurther.

3. A multilevel meta-analysis model


In thissectionwe formulate generalclass of meta-analysis a modelsby considering simple a We two-level structure. shall assume that we have a collectionof studies,each concerned with the comparisonof several 'treatments'. These treatments may be distinct categories (represented dummyvariables) or may be effects by coefficients represented regression by or a mixture the two kinds.The basic models thatwe shall develop are 'variancecomof a ponent' models but we shall also illustrate random-coefficient model, and the variance heterogeneity can also be incorporated case (Goldstein(1995), chapter3). For the ithsubjectin theIth studywho receivedthe hthtreatment, can writea basic we model foroutcomeYhij as underlying
Yhij = (X) ij +?hthij

+ Uhj + ehi,,
u1,j
,

hi= 1,
Ng0
?2tl)

. H, j= 1,
Ng0
2 e),

. J, i = 1, n.,

(4)

eh ij

where(Xo)ij is a linearfunction covariatesforthe ith subjectin thejth study,ulj is the of randomeffect thehthtreatment thejth studyand ehi1 is therandomresidualof thehth of for for treatment subjecti in study The termthziis a dummytreatment variable(contrasted j. and a,, is the treatment againsta suitablebase category) of contrast primary If interest. the treatment variablesare replacedby a continuous variableti1thenmodel (4) becomes dummy

Class Size Effects


= YIJ (X/3)ij+ altij+ uj + eij, Uj - N(O, of2), j=1, . ,J, i = 1, . . ,,
e ij -N(O, of2).

405

for and betweenstudiesto be different each It is also possibleto allow thevarianceswithin variable,leadingto complex treatment to varywiththevalue of a continuoustreatment or covariateswhere variance structures (Goldstein(1995), chapter3). We can also introduce treatments covariates.For and and between data are available and appropriate, interactions may differ accordingto the covariatevalues. We treatment contrast example,a particular a e.g. iffitting generalized of assumption thelevel 1 residuals, may also relaxthenormality model (Goldstein,1995; Turneret al., 1999). linearmultilevel 3. 1. Aggregate level data model but we onlyhave data by Considernow the case wheremodel (4) is the underlying to treatment groupat thestudylevel.Aggregating thislevelwe writethemean responseas
th Yh = (X/3)j + olv i j

+ ulj + eh

(5)

constraints, e.g. j. wherethedot notationdenotesthemean forstudy This impliesparticular = A termin equation (5) since this may var(eh.j) var(ehij)/nlj. difficulty arise with the first impliesthatthemean of thecovariatefunction (X3)ij foreach studyis available. variableis The corresponding model forthecase of a continuoustreatment
yj = (XO3)j +catj + uj1+ ej.

case 3.2. The two-treatment Considerthespecialcase of two treatments, = 1, 2. We collapse equation(5) and, usingan h rewrite give to obvious notation, (6) a+?u?+e'j, av= av1 2This implies the constraintvar(u) = var(u1j) + var(u2j) 2 cov(ulj,u2j). We can combine responses in are intoa singlemodel forthecase wheresome aggregated equations(5) and (6) of separatetreatment and some are in termsof contrasts groups. of terms groups
yj=yi1-Y2j=

3.3. Defining originand scale Whencombining levelstudiesit is necessary ensurethattheresponse to data from aggregate two-treatment is variablescales are thesame and thatthere a commonorigin.In traditional SD is difference dividedby a suitable(pooled) within-treatment meta-analyses treatment the In the variablein each study can be as described earlier. our general model,likewise, response of scaled by dividing by an estimate thelevel 1 SD. Whereindividual it data are availablewe of data we analysisand foraggregate a may use an estimate thelevel 1 SD from preliminary if summary information, thisis available. may derivethisfromreported In situations wherethesame responsevariableis used in each study, and scalinghas been different In carriedout, we can apply equations (4) and (5) directly. manycases, however, readingtestsare used. in variablesare used. For example, class size studiesdifferent response In thiscase we would not generally treatments be to expectthe means forcorresponding as identical.One procedurefor dealing with this is to choose one treatment a reference

406

and H. Goldstein, Yang,R. Omar,R. Turner S. Thompson M.

its treatment control)and in each studyto subtract mean fromthe values of the other (or approachin two-treatment treatments to workwith and these differences. is thestandard This term withdummy by variables studies.Thus we chose one treatment described an intercept and of these dummyvariableswould for the remainder. The coefficients the intercept of case generally modelledas randomat the studylevel. In the two-treatment thisleads to be model (6). the mean of the Where we have a studywith individualdata we likewisesubtracted partof themodel,forthe reference treatment groupfrom responsevariable.In thefixed the level 1 unitswiththattreatment, intercept term(and othertreatment dummy variables) the will be 0. 3.4. Variance information We may have additionalinformation about variancesfromstudies,e.g. information from othermeta-analysis studiesabout between- within-study or variation. Suppose, forexample, 1 thatin model (4) we have an external say have estimate, rije, of o% + a2(h2 wherewe might a2 = 1/nlj.If we writean additionalcomponent the model as an extralevel2 unit to
rhue =

ulj + aehij

(7)

wherethefixed 0 imposedas above, this partis identically and we have additionalconstraints is We thatthisextralevel2 information thenincorporated theestimation. note,however, into unitis giventhe same weightas everyotherlevel 2 unitin the model,and we may wish to on obtained.Weighting is assigna different weight depending theaccuracyoftheinformation discussedin thenextsection. 3.5. Weighting units We shall consideronly weighting the level 2 units,althoughextensionsto differential of of weighting level1 unitsare possible.Suppose thatthejth level2 unitis assigneda weight wj. These weightsmay reflect information about the quality of the study or possibly nonas analysis to complement response.Such an analysismightbe undertaken a sensitivity in an unweighted is analysis. Note that sample size weighting already incorporated the via estimation equation (5). Assumingthat the weightsare uncorrelated withthe random we model (4) to includethe vectorof theinverses the square roots of the of effects, rewrite as This gives variableforthe level 2 randomeffects. weights the explanatory
Yhii =
(XA)1j + iIhthij + UiljWj-

eh ij

(8)

and we can carry thestandardestimation thismodel.This procedure carrying out for for out a weighted multilevel et analysisis discussedin Pfeffermann al. (1997) and is equivalentto their are with 'stepA only'method. Theyalso discussedthecase wheretheweights correlated the randomeffects. 3.6. Modellingclass size In our analysisclass size is treated a continuous as variablecentred a value of 15. In all the at as Table 1, onlythe averageclass sizes for'small' or 'large' classes are studies, is clear from These values are therefore values used in theanalysis. the reported. One of our aggregate levelstudies(Doss and Holley,1982) sampledseparategradeswithin schools. In principlethis provides a further level betweenthe class and the school. A

Class Size Effects

407

preliminary analysis, however, detected variation thislevelonlyforthesimplest at model,so we do notincludeitin further models, although gradelevelitself incorporated a fixed is as factor.

For the aggregate levelstudieswe can writea basic model as


Y.jk =
aOjk = Uojk iVOjk+ OtIYkCjk VOk, ejk

3.7. Aggregate levelmodelsfor class size data


+?

/AG,jk + ejk,
OZlk = OZI + Vlk, /2lk),

0aO +

Uojk +

- N(O, a2o),
Vlk

- N(O,

VOk N(O, CVO)

N(O, (T),

cov(vok, Vlk) = (vol,

(9)

The the respectively. parameter estimates mean wherejand k nowindexthegradeand study ao score fora class size of 15. The termUOjkis therandomdeparture (residual)of thejth-grade mean from kthstudyand /3thefixed the effect grade1,withtheGljk beinggradedummy for in in variables,thesebeingcovariates themodelas described equation(5). The term is the VOk The variableCjk is theclass size and theparameter estimates residualforthekthstudy. the a, The termVlkestimates additionalrandom the overallclass size effect additionalstudent. per for Further covariatescould of course departure thekthstudyof theoverallclass size effect. be added, if available. Not all the studiessampledmore than one grade level and in some each school,whereasin othersdifferent studiesseveralgradesare sampledwithin gradesare schools. In thelatter case gradedifferences confounded are withschool sampledin different differences an interpretation between-grade so of variation difficult. thisreasonwe do is For not fitgrade as a levelin the following analysis,althoughwe do studyfixedgradeeffects. Since all our data have been standardized, underlying the level 1 varianceis equal to 1. We the and we can write first ofmodel line therefore define explanatory variableZjk = 1/VfIk the model as (9) forthe aggregated
+ Yjk = aVOjk aElkCjk +
Wjk

Z f3Gl,jk+

WjkZjk,

(10)

N(O, 1).

for we In practice, classes of a givensize in a study, typically onlyhave available themean to over all classes,so, althoughthecontribution thevariancefromtheseclasses forthe kth studyis EJ nji the data thatare available provideonlythe value of (j njk)-Y'. When these -, the the can class sizes are constant, however, first expression be obtainedfrom secondwhere the numberof classes is known.

from levelstudiesonlyand followthiswithresults results theaggregate for We first present level studies. both theindividuallevel studyand the combinedindividualand aggregate Table 2 presents resultsof fitting the data studies models (9) and (10) forthe aggregate for likelihoodestimation threemodelsas shown, (numbers1-8 in Table 1), usingmaximum witha 95% confidence for based on a parametric interval the estimates together bootstrap with1000 replications (Goldstein,1995). Model A allows the class size effect vary across studies,model B allows no such to of variationand model C includesa quadraticeffect class size. As can be seen fromthelogmodel A fitsthe data substantially better thanmodel B, so thereis substantial likelihoods,

3.8. Results

408
Table 2.

H. Goldstein, Yang,R. Omar,R. Turner S. Thompson M. and


Model estimates forthe aggregated studydata using model (9)t Estimates the for followinig models: Model A Model B 0.207 (0.149, 0.261) -0.022 (-0.025, -0.019) Model C 0.224 (0.053, 0.393) -0.048 (-0.072, -0.025) 0.002 (0.001, 0.003) 0.067 (0, 0.135) -0.006 (-0.013, 0.004) 0.0006 (0, 0.0010) -54.1

Parameter

Fixed effects Intercept Class size, linear Class size, quadratic Random(between-stuidy) effects
072

0.163 (0.028, 0.308) -0.020 (-0.036, -0.004)

07vol
072

0.060 (0.0, 0.101) -0.006 (-0.010, -0.001) 0.0006 (0.0, 0.0010) -46.1

0.004 (0.0, 0.014)

-2 log-likelihood

266.3

at are 95% bootstrapintervals givenin parentheses. IThe constrained parameter class levelis omitted.

evidenceof heterogeneity theclass size effect in across studies.Model A estimates effect the on reading scoresas a decreaseof0.02 SD unitsperadditionalstudent. This is slightly greater than the 0.17 units estimated Slavin (1990) comparingclasses of 15 or 16 withlarger by classesof25-30. Model C indicates quadraticeffect class sizewhereby a of from class size of a 15 to a size of 30 thereis a continuing decreasein achievement, an increasein achievebut mentthereafter. result, This is however, influenced study2 withthelargeclasses over 30. by A testforequalityof gradeeffects not significant = 1.8) so thesehave been omitted is (X2 fromthesemodels.The likelihoodratio teststatistic thattheclass size effect suggests varies across studies.However,thereare onlyeightstudiesin the data set so inferences based on shouldbe viewedwithcaution.Also thesemodelsignorebetween-school largesampleresults variationwithin studiesand between-grade variationas pointedout above. If formodel A, we of however, allow thelevel 1 varianceto be estimated obtain an estimate 1.81 witha we likelihood ratioteststatistic, comparison for withmodelA, of 3.0 with1 degreeoffreedom so there onlyrather is weak evidence a value different for from1.0. If we do thesame formodel B the level 1 varianceestimate 16.6 and the teststatistic 285.5. The analysisutilizesall is is that the information is available forthe publishedaggregate studies.Since we are working lies withstandardized data the only flexibility in the modellingof the class size effect and the between-study variation.In comparisonwiththe inclusionof individuallevel data the of the leveldata. analysisillustrates limitations usingaggregate 4. Models for combining individual level data with study level data

leveldata the leveldata sethas covariates Although STAR individual available,theaggregate have not been adjustedforcovariatesin a consistent fashion,otherthan forclass size and initialtestscoresas discussedabove. Some studies, such as thatof Shapson et al. however, their resultsadjustedforotherfactors, and some of the studiescarriedout (1980), reported initialmatching. the following In but it needs to be analysiswe shall ignorethisvariation, bornein mindwhenthe results interpreted. are The STAR studyhas three wererecruited when levels:school,class and student. Children wheretheywererandomly theyentered kindergarten assignedto threesizes of class; a small class of 13-17,a regular class of 22-25 and a regular aide. The class of 22-25 witha teaching last two categoriesare combinedsince in the STAR studytheyshow no differences. The

Class Size Effects

409

werefollowed 4 yearsto theend of grade3, and forthepresent students for purposeswe use thereadingtestscoredata at theend of grade 1, adjustedforreadingtestscoresat theend of kindergarten, a studyextending i.e. over 1 year.The studyattempted retainthe original to class compositions, but this was not entirely possible. A discussionof the problemsof interpreting fromthisstudyis givenby Goldsteinand Blatchford data (1998). The following modelis a combined modelfortheSTAR study and thepreviously analysed of for aggregatelevel studies.We omit the effect grade since this was not significant the level analysis. aggregate
Yijkl = (OI+ ?allCi kl) + e.jkl(l
ZI)

+ (cvOijkl + ?V2X2ijkl + VlklCilkl)Zl, eijkl,

oO= oo + W01n aVoijkl= VOkl+ UOjkl+ a1ll = ?l + W1i, z1 data study, = 0 otherwise, z, = 1 if individual N(O, WO),
VOk, ' ' UOJkl w11 Vlkl
eijkl

N(O, uv1), N(O, vi),

cov(wO1, w1O)= Twol,


cov(vokl, Vlkl) = Uvol,
ejkl

(11)

N(O, vo), NA(UO,vo),

NO

e)

N(O

?njkl), e

where X2ijkl is the end of kindergarten score, with the standard assumptionthat it is of and a2 the independent therandomeffects, C is theclass size. The parameter 1 represents variancein the class size effect, and a21 the between-school variancein the between-study class size effect. This model utilizesa notationsimilarto thatused beforeand is now a four-level model i withstudents groupedwithin classesj within schoolsk within studies1.The STAR data are standardized usingthe residualvariancefroma preliminary three-level model withonly by the STAR data. For the combineddata analysesin Table 3 the random-effects parameter estimates levels1-3 are derived at from STAR data and at theclass level(2) theaggregate the levelvariance,whichis not shown,is constrained be 1. The between-study (4) interto level randomparameters estimated are fromthecompletedata set. cept and class size coefficient 4. 1. Results In Table 3 thelevel4 (between-study) variation somewhat is smaller thanthatestimated from the aggregate data studiesonly.We see thattheclass size effect the STAR data and the for combinedestimate little is different from thatin theanalysisusingonlyaggregate leveldata in is In (Table 2) and thequadraticeffect now negligible. factthelinearclass size effect the combinedmodel is less precisethan for the STAR studyalone because of the substantial The between studiesin theclass size effect. STAR data show onlya smalland heterogeneity not significant = 1.5) variationin theclass size effect betweenschools. In factGoldstein (X2 and Blatchford test (1998) show that for mathematics scores thereis a markedvariation estimated residualsat the studylevel does not betweenschools. A studyof the (shrunken) revealany outliers. 5. Discussion

We have shown how a seriesof studies,withresultsreportedat different levels of aggrewithina singlemultilevel model to provideeffect size gation,can be combinedefficiently estimates. Since the analysisis based on maximumlikelihoodestimation withinan explicit model it can be expectedto yield more efficient estimatesthan traditional approaches to

410
Table 3.

H. Goldstein, Yang,R. Omar,R. Turner S. Thompson M. and


Parameterestimates formodel (11)t Estimates the for following types data: of STAR data only Combined data (linear) Combined data (quadratic) 0.184 -0.022 (0.007) 0.907 (0.018) 0.175 -0.017 (0.011) -0.0003 (0.0006) 0.907 (0.018)

Parameter

Fixed effects cxo al (class size, linear) CX3(class size, quadratic) a2 (pretest) Randomeffects Level 4 (betweenstudy) 072 0wo v.l
072

0.078 -0.024 (0.006) 0.907 (0.018)

0.038 (0.020) -0.004 (0.002) 0.305 (0.064) 0.00014 (0.004) 0.0006 (0.0006) 0.139 (0.023) 1.000 (0.023) 11996.5

Level 3 (betweenschool)
072

(0.0002) ~~~~~~~~~~~~~0.0004 0.0004(0.0002)


0.305 (0.064) 0.00012 (0.004) 0.0006 (0.0006) 0.138 (0.023) 1.000 (0.023) 11948.3 0.305 (0.064) 0.00013 (0.004) 0.0006 (0.0006) 0.138 (0.023) 1.000 (0.023) 11948.1

0.037 (0.019) -0.004 (0.002)

Jvo 0vo1

Level 2 (betweenclass) Level 1 (betweenstudent)


0S2

072

02o

-2 log-likelihood

are IStandard errors givenin parentheses.

meta-analysis. These traditional modelsalso have been unable to combinestudieswithboth Our approachdoes notrequire balanceddata, butit individual and aggregate levelresponses. of does requirethat the reporting studiesfor inclusionin the model conforms certain to As theserequirements such thatit should minimum are requirements. we have illustrated, for be possibleto carryout a suitablestandardization means and variances,afteradjusting for relevantcovariates.One of the problemswith observationalstudies,especiallythose instiinvolvinginstitutions such as schools, is that (multilevel)modellingincorporating In tutional(and other) differences absent and this can resultin biased inferences. the is are sizable which presentcase (Table 3) the intraclassand intraschoollevel correlations from the aggregatelevel studies may overestimate implies that some of the inferences The estimates statistical significance. themselves, however,should be relatively unaffected, and thisis consistent withour analysis. A remaining in problemwhichwe have not investigated detailoccurswherestudiesadjust effects using different of explanatory sets variables.In the normaldistribution by case, if information available about the covariancematrixof all such covariatesthen for the is levelstudiescommonadjustments be carriedout as we have done in model(1). can aggregate to The modelcan be extended case readily themultivariate wheremorethanone outcome scores. is considered, in the bivariateanalysisof mathematics and readingachievement e.g. This approach can also be used wherenot all studiesmeasure all responsesso the joint estimatesthan analysingeach analysis withina single model will provide more efficient responseseparately. Since we have adopted a model-basedapproach it is possiblein principle incorporate to An further model components. important componentis the modellingof publicationbias estimates unlessthe bias is (Copas, 1999), althoughsuch modelsmay not lead to improved bias may large(Hedges and Vevea, 1996).In thepresent case we would arguethatpublication

Class Size Effects

411

selection have been quitestringent therelevant so not be a seriousissue.The criteria study for to studiesare carefully executedlong-term studieswhichare unlikely remainunpublished. studiesformodellingpurposeswe are makingan It should be noted thatin combining the same assumptionthat the responsesused in the various studiesare indeed measuring this than in, say, thing.In social scienceapplications meta-analysis is more problematic of clinicaltrialsand needs to be bornein mindwheninterpreting results. of it thatthe one Finally,althoughthe thrust thispaper is methodological, is of interest whichis verysimilarto that fromthe large RCT givesan estimate the class size effect for This pointis pursuedfurther Goldsteinand Blatchford by (1998) who observational studies. of also discusstheusefulness RCTs in thiskind of research. Acknowledgements The researchwas fundedby the Economic and Social Research Council under the proto grammefor the Analysisof Large and Complex Datasets. We are most grateful the comments. referees and theJoint Editorfortheirhelpful References
evaluationof readingachievement smallclasses. Elemen.Engl.,46, 184-187. in Balow, I. H. (1969) A longitudinal in and second gradersassociatedwith for Butler,J. M. and Handley,H. M. (1989) Differences achievement first A. Educational ResearchAssociation Conf.,LittleRock, Nov. 8th-JOth. reduction class size. 18thMid-south in a class placement exceptional for children: of Carlberg, and Kavale, K. (1980) The efficacy specialversusregular C. meta-analysis. Specl Educ., 14, 295-309. J. to in accounting for Cleary,R. and Casella, G. (1997) An applicationof Gibbs sampling estimation meta-analysis: 22, publicationbias. J. Educ. Behav. Statist., 141-154. J. models and meta-analysis. R. Statist.Soc. A, 162, 95-109. Copas, J. (1999) What works?:selectivity Projects.Austin:Office Research of Doss, D. and Holley,F. (1982) A Causefor NationalPause: TitleI Schoolwide and Evaluation. models in meta-analysis: than fixedeffects Erez, A., Bloom, M. C. and Wells,M. T. (1996) Using randomrather 46, and generalisation. PersnlPsychol., 277-306. implications situational for specificity validity Educ. Evaln Poly of on Glass, G. V. and Smith,M. L. (1979) Meta-analysis research class size and achievement. Anal., 1, 2-16. H. Statistical Models. London: Arnold. Goldstein, (1995) Multilevel with a P. Goldstein,H. and Blatchford, (1998) Class size and educationalachievement: reviewof methodology to particular reference studydesign.Br. Educ. Res. J.,24, 255-268. withrandomeffects. Statist. AMed., approachto meta-analysis Hardy,R. J.and Thompson,S. G. (1996) A likelihood 15, 619-629. Methods Meta-analysis. for Orlando: AcademicPress. Hedges, L. and Olkin,I. (1985) Statistical and bias: smallsampleproperties robustness effect underpublication size Hedges,L. and Vevea,J.(1996) Estimating of a randomeffects selection model. J. Educ. Behav. Statist., 299-332. 21, Boston of of grade pupils. Doctoral Dissertation. Mazareas, J. (1981) Effects class size on the achievement first Boston. University, of between class size and achievement. C. McGiverin, Gilman,D. and Tillitski, (1989) A meta-analysis therelation J., Elem. School J., 89, 47-56. for H. Pfeffermann, Skinner, J.,Holmes,D., Goldstein, and Rasbash, J. (1997) Weighting unequal selection D., C. in models.J. R. Statist.Soc. B, 60, 23-40. probabilities multilevel J. A. Raudenbush,S. and Bryk, S. (1985) EmpiricalBayes meta-analysis. Educ. Statist.,10, 75-98. evaluation:freshman English,spring1991. Research San Juan UnifiedSchool District(1991) Class size reduction San Juan. Report.San JuanUnifiedSchool District, of of E. J. study theeffects class size. Shapson,S. M., Wright, N., Eason, G. and Fitzgerald, (1980) An experimental Am. Educ. Res. J., 17, 141-152. reviews.Educ. Res., 15, an to and traditional Slavin, R. (1986) Best-evidence synthesis: alternative meta-analytic 5-11. is Contemp. Educ., 62, 6-12. achievement: smallerbetter? (1990) Class size and student Br. in should be investigated. Med. J., 309, Thompson,S. G. (1994) Why sourcesof heterogeneity meta-analysis 1351-1355.

412

H. Goldstein,M. Yang, R. Omar, R. Turner and S. Thompson

Turner, M., Rumana, R. Z., Yang, M., Goldstein,H. and Thompson,S. G. (1999) Multilevel R. models formeta analysisof clinicaltrialswithbinaryoutcomes.Statist.Med., to be published. F. of of Verducci, (1969) Effects class size on the learning a motorskill.Res. Q., 40, 391-395. of Wagner,E. D. (1981) The effects reducedclass size upon the acquisition readingskillsin gradetwo. Doctoral of Dissertation. University Toledo, Toledo. of M. L. Ratiosin GradesI and 2 and theProvision Wilsberg, and Castiglione, V. (1968) TheReduction Pupil-Teacher of of Additional Materials.a Programto Strengthen Early Childhood Educationin Poverty Schools,New York,NY. New York: New York CityBoard of Education. Word,E. R., Johnston, Bain, H. P., Fulton,B. D., Zaharias,J. B., Achilles, M., Lintz,M. N., Folger,J. and J., C. Breda, C. (1990) The stateof Tennessee'sstudent/teacher achievement ratio (STAR) project.Technical Report 1985-90.Nashville,TennesseeState University.

You might also like